Validation of the AI Factory with LLM Inference

As industries adopt Large Language Models (LLMs), optimization and validation of their inference performance are critical aspects of the deployment. Efficient inference is essential to guarantee that LLMs deliver high-quality results while maintaining scalability, responsiveness, and cost-effectiveness.

The LLM Inference Benchmarks focus on measuring performance metrics during the model serving process rather than the quality of the generated result. Metrics assessed by this type of benchmarks include:

  • Latency: how fast the model responds to a request.
  • Throughput: the number of requests the model can handle per unit of time.
  • Stability: model serving consistency under varying loads.

In this guide you will find the necessary steps and best practices to perform LLM inference benchmarking with OpenNebula.

The vLLM Inference Framework

The vLLM Inference Framework is a benchmark that focuses on vLLM, a production-grade, high-performance inference engine designed for large-scale LLM serving.

The main characteristics of vLLM Inference Framework are:

  • Supports single-node deployments with one or more GPUs.
  • Uses Python’s native multiprocessing for multi-GPU inference.
  • Does not require additional frameworks, such as Ray, unless deploying across multiple nodes, which is out of scope for this benchmarking task.

Benchmark Environments

To test the vLLM appliance, the benchmark uses two similar environments, but with different GPU models:

  • Benchmark environment 1: 1x NVIDIA L40S 48GB GPU cards
  • Benchmark environment 2: 1x NVIDIA H100L 94GB GPU card

Hardware Specification

Front-end Requirements

Front-end
Number of Zones1
Cloud ManagerOpenNebula 7.0
Server SpecsSupermicro Hyper A+ server, details in the table below
Operating SystemUbuntu 24.04.2 LTS
High AvailabilityNo (1 Front-end)
AuthorizationBuilt-in

Host Requirements

Virtualization Hosts
Number of Nodes1
Server SpecsSupermicro Hyper A+ server, details in the table below
Operating SystemUbuntu 24.04.2 LTS
HypervisorKVM
Special DevicesNVIDIA GPU cards, details in table below

Storage Specification

Storage
TypeLocal disk
Capacity1 Datastore

Network Requirements

Network
NetworkingBridge
Number of Networks1 networks: service

Provisioning Model

Provisioning Model
Manual on-premThe two servers have been manually provisioned and configured on-prem.

Server Specifications

The server specifications are based on a two-server setup for each environment: one server operates as the OpenNebula frontend and the other one is the cluster for VMs host with the GPU cards attached. This setup is compatible with any OpenNebula setup having a host server with AI-ready NVIDIA GPUs.

ParameterEnvironment 1Environment 2
GPU modelNVIDIA L40S 48GBNVIDIA H100L 94GB
Server modelSupermicro A+ Server AS -2025HS-TNRSupermicro A+ Server AS -2025HS-TNR
Architectureamd64 (x86_64 bits)amd64 (x86_64 bits)
CPU ModelAMD(R) EPYC 9334 Processor @2.7GHzAMD (R) EPYC 9335 Processor @3.0GHz
CPU VendorAMD(R)AMD(R)
CPU Cores128 (2 socket × 32 cores, 2 threads per core)128 (2 socket × 32 cores, 2 threads per core)
CPU Frequency2,7 GHz3,0 GHz
NUMA Nodes2 (Node0: CPUs 0-63, Node1: CPUs 64-128)2 (Node0: CPUs 0-63, Node1: CPUs 64-128)
L1d Cache2 MiB (64 instances)3 MiB (64 instances)
L1i Cache2 MiB (64 instances)2 MiB (64 instances)
L2 Cache64 MiB (64 instances)64 MiB (64 instances)
L3 Cache256 MiB (8 instances)256 MiB (8 instances)
BIOS VendorAmerican Megatrends Inc. (AMI)American Megatrends Inc. (AMI)
BIOS Release Date10/07/202403/31/2025
BIOS Version3.03.5
BIOS Firmware Rev5.275.35
ROM Size32MB32MB
Boot ModeUEFI, ACPI supportedUEFI, ACPI supported
Disks1 × NVMe (SAMSUNG MZQL215THBLA-00A07, 15TB)1 × NVMe (KIOXIA KCD6XLUL15T3, 15TB)
Partitions/boot/efi (1G), /boot (2G), LVM root (14T)/boot/efi (1G), /boot (2G), LVM root (14T)
Network1 × Intel X710 10G GbE1 × Intel X710 10G GbE
RAM1152 GB (24x48GB DDR5-4800)1536 GB (24x64GB DDR5-6400)

Benchmarks

The certification includes two LLM architectures — Qwen and Llama — each tested in two different parameter sizes.

Qwen Models:

  • Qwen/Qwen2.5-3B-Instruct
  • Qwen/Qwen2.5-14B-Instruct

Llama Models:

  • meta-llama/Llama-3.2-3B-Instruct
  • meta-llama/Llama-3.2-7B-Instruct

The benchmark process is based on GuideLLM, the native benchmarking tool provided by vLLM for optimizing and testing deployed models.

Executing the Benchmarks

Deploying the vLLM Appliance

The vLLM appliance is available through the OpenNebula Marketplace, offering a streamlined setup process suitable for both novice and experienced users.

To deploy the vLLM appliance for benchmarking, follow these steps:

  1. Download the vLLM appliance from the marketplace:

    $ onemarketapp export 'service_Vllm' vllm --datastore default
    
  2. Configure the template for GPU PCI passthrough:

    $ onetemplate update vllm
    

    In this scenario, configure the template for GPU Passthrough to the VM and the specific CPU-Pinning topology:

    CPU_MODEL=[
        MODEL="host-passthrough" ]
    OS=[
        FIRMWARE="/usr/share/OVMF/OVMF_CODE_4M.fd",
        MACHINE="pc-q35-noble" ]
    PCI=[
        CLASS="0302",
        DEVICE="26b9",
        VENDOR="10de" ]
    TOPOLOGY=[
        CORES="8",
        PIN_POLICY="THREAD",
        SOCKETS="2",
        THREADS="2" ]
    VCPU="32"
    CPU="32"
    MEMORY="32768"
    
  3. Instantiate the template. Keep the default attributes, only changing the LLM Model through the ONEAPP_VLLM_MODEL_ID input for each benchmark you do, which means that you will need to instantiate a different VM with the different models for the execution of each benchmark:

    $ onetemplate instantiate service_Vllm --name vllm
    
  4. Wait until the vLLM engine has loaded the model and the application is served. To confirm progress, access the VM via SSH and check the logs located in /var/log/one-appliance/vllm.log. You should see an output similar to this:

    [...]
    
    (APIServer pid=2480) INFO 11-26 11:00:33 [api_server.py:1971] Starting vLLM API server 0 on http://0.0.0.0:8000
    (APIServer pid=2480) INFO 11-26 11:00:33 [launcher.py:36] Available routes are:
    (APIServer pid=2480) INFO 11-26 11:00:33 [launcher.py:44] Route: /openapi.json, Methods: HEAD, GET
    (APIServer pid=2480) INFO 11-26 11:00:33 [launcher.py:44] Route: /docs, Methods: HEAD, GET
    (APIServer pid=2480) INFO 11-26 11:00:33 [launcher.py:44] Route: /docs/oauth2-redirect, Methods: HEAD, GET
    (APIServer pid=2480) INFO 11-26 11:00:33 [launcher.py:44] Route: /redoc, Methods: HEAD, GET
    (APIServer pid=2480) INFO 11-26 11:00:33 [launcher.py:44] Route: /health, Methods: GET
    (APIServer pid=2480) INFO 11-26 11:00:33 [launcher.py:44] Route: /load, Methods: GET
    (APIServer pid=2480) INFO 11-26 11:00:33 [launcher.py:44] Route: /ping, Methods: POST
    (APIServer pid=2480) INFO 11-26 11:00:33 [launcher.py:44] Route: /ping, Methods: GET
    (APIServer pid=2480) INFO 11-26 11:00:33 [launcher.py:44] Route: /tokenize, Methods: POST
    (APIServer pid=2480) INFO 11-26 11:00:33 [launcher.py:44] Route: /detokenize, Methods: POST
    (APIServer pid=2480) INFO 11-26 11:00:33 [launcher.py:44] Route: /v1/models, Methods: GET
    
    [...]
    

    At this point, the vLLM API is available. Perform a test it through curl, pointing to the port 8000 of the VM and querying the v1/models path:

    $ curl http://localhost:8000/v1/models | jq .
    

    You will retrieve a json response with the loaded models:

    {
        "object": "list",
        "data": [
            {
            "id": "Qwen/Qwen2.5-1.5B-Instruct",
            "object": "model",
            "created": 1764154926,
            "owned_by": "vllm",
            "root": "Qwen/Qwen2.5-1.5B-Instruct",
            "parent": null,
            "max_model_len": 1024,
            "permission": [
                {
                "id": "modelperm-431ea3199da545b2a5cba62dc373ab53",
                "object": "model_permission",
                "created": 1764154926,
                "allow_create_engine": false,
                "allow_sampling": true,
                "allow_logprobs": true,
                "allow_search_indices": false,
                "allow_view": true,
                "allow_fine_tuning": false,
                "organization": "*",
                "group": null,
                "is_blocking": false
                }
            ]
            }
        ]
    }
    

    Additionally, the appliance includes a webchat app for interacting with the vLLM chat API. This web application is exposed through the VM 5000 port:

    vLLM webchat

    If these verification steps are successful, the vLLM appliance is ready to run the benchmarks.

Running the Benchmark Scripts

Within the /root directory of the vLLM appliance, you will find benchmark.sh which executes GuideLLM CLI. This script automatically detects environment parameters, launches the benchmark using GuideLLM, and displays live updates of progress as well as results through the CLI. Specifically, the benchmark follows these steps:

  • To test performance and stability, the script sends hundreds of requests in parallel.
  • To run the benchmark, the script uses automatically-generated synthetic data with these values:
    • Input prompt: average 511 tokens.
    • Output prompt: average 255 tokens.
    • Total samples: 999.
  • GuideLLM identifies the throughput that the inference can handle.
  • Once the throughput is identified, 9 additional runs are performed at a fixed requests-per-second rate (below the identified throughput) to determine stability and final results.

To run the benchmark, follow this procedure:

  1. Connect to the vLLM appliance through ssh:

    $ onevm ssh vllm
    
  2. Execute the benchmark script:

    root@vllm$ ./benchmark
    

    After the benchmark is running, you will see this output: vLLM benchmark

    There are more parameters available within the benchmarking such as warmups, number of steps, and seconds per step. These parameters are fixed but can be manually adapted if needed.

  3. Once finished, the process outputs the results on the terminal and generates an HTML report with all given information in the /root/benchmark_results directory:

    vLLM benchmark results

Benchmark results

The following key performance metrics have been be tested for each model:

MetricDescriptionUnit
Request rate (throughput)Number of requests processed per second.req/s
Time to first token (TTFT)Time elapsed before the first token is generated.ms
Inter-token latency (ITL or TPOT)Average time between consecutive tokens during generation.ms
LatencyTime to process individual requests. Low latency is essential for interactive use cases.ms
ThroughputNumber of requests handled per second. High throughput indicates good scalability.req/s

Different application types have distinct performance requirements. The following GuideLLM reference SLOs provide general benchmarks for evaluating inference quality (times for 99% of requests):

Use CaseReq. Latency (ms)TTFT (ms)ITL (ms)
Chat Applications-≤ 200≤ 50
Retrieval-Augmented Generation-≤ 300≤ 100
Agentic AI≤ 5000--
Content Generation-≤ 600≤ 200
Code Generation-≤ 500≤ 150
Code Completion≤ 2000--

The following table contains the results of the benchmark for each model:

ModelsvCPUSRAMGPUThroughput (req/s)TTFT (ms)ITL (ms)TPOT (ms)p99 TTFT (ms)p99 ITL (ms)p99 TPOT (ms)
Qwen/Qwen2.5-3B-Instruct3232 GBL40s6.461.815.215.217015.415.3
Qwen/Qwen2.5-3B-Instruct3232 GBH100L34348.38.31158.28.2
Qwen/Qwen2.5-14B-Instruct3232 GBL40s-------
Qwen/Qwen2.5-14B-Instruct3232 GBH100L20700000700000
meta-llama/Llama-3.1-3B-Instruct3232 GBL40s3.359.71312.98713.113.1
meta-llama/Llama-3.1-3B-Instruct3232 GBH100L30561211.933112.412.3

OpenNebula includes the obtained results in controlled environments, with given hardware and using specific models. This information can later be used to compare future results, assess deployments, and evaluate performance against known baselines.