Inferencing with vLLM

Tue, 28 Oct 2025 00:00:00 +0000

The vLLM Inference Framework is a production-grade, high-performance inference engine designed for large-scale LLM serving.

The main characteristics of vLLM Inference Framework are:

Supports single-node deployments with one or more GPUs.
Uses Python’s native multiprocessing for multi-GPU inference.
Does not require additional frameworks, such as Ray, unless deploying across multiple nodes, which is out of scope for this benchmarking task.

In this guide you will find the necessary steps and best practices to deploy the OpenNebula vLLM appliance and perform an inference benchmarking to check its performance.

Fine-Tuning AI Models on NVIDIA Slurm

Mon, 01 Jan 0001 00:00:00 +0000

In this tutorial, we will install and configure the OpenNebula Slurm appliance and run a fine-tuning example script.

We will complete the following high-level steps:

Install the Slurm appliances (controller and workers) from the OpenNebula marketplace.
Configure the Slurm worker template with an example fine-tuning job script.
Submit a fine-tuning job from the Slurm controller with a single command.

Before Starting

Before starting this tutorial, you must complete the AI-factory deployment with either on-premises resources or cloud resources. Please complete one of the following guides relevant to your available resources:

Direct AI Execution on

Inferencing with vLLM

Fine-Tuning AI Models on NVIDIA Slurm

Before Starting