Deployment of NVIDIA KAI Scheduler
Important
To perform the deployment of NVIDIA KAI Scheduler, first you must follow the procedure outlined in Deployment of AI-Ready Kubernetes to create an AI-Ready Kubernetes ready for running GPU workloads.NVIDIA® KAI Scheduler is an open source Kubernetes-native scheduler designed to optimize GPU resource allocation for AI and machine learning workloads at scale. It is capable of managing large GPU clusters and handling high-throughput demanding workload environments. KAI Scheduler targets both interactive jobs and large-scale training or inference tasks within the same cluster, orchestrating available resources across different users and teams. It also operates alongside other schedulers installed in a cluster.
Some of the key features are:
- Share single or multiple GPU resources among multiple workloads for improving resource allocation.
- Batch scheduling of different types of workloads, including gang scheduling (i.e. all pods are scheduled together or none are scheduled until all resources are available).
- Effective workload priority with hierarchical queues.
- Resource distribution with custom quotas, limits, priorities and fairness policies.
- Elastic workloads with dynamic workload escalation.
- Compatibility with other autoscalers like Karpenter.
As the KAI Scheduler operates on top of Kubernetes, users benefit from its robust orchestration, scalability, and resource management capabilities, further enhancing deployment flexibility and reliability. Running both KAI Scheduler and Kubernetes on OpenNebula, this solution extends resource management to hybrid and multi-cloud environments, enabling dynamic and cost-effective scaling of AI workloads while maintaining control over infrastructure. This synergy leverages the strengths of all three technologies to deliver efficient, fair, and portable GPU scheduling from on-premises data centers to the cloud.
In this guide you will learn how to perform a validation using NVIDIA KAI Scheduler in an AI-Ready Kubernetes cluster. You will find details about the installation and efficiently share GPU resources among different workloads.
NVIDIA KAI Scheduler Installation
To install the NVIDIA KAI Scheduler, you need to accomplish the following prerequisites:
- An AI-Ready Kubernetes Cluster with the NVIDIA GPU Operator installed.
- Helm CLI. For additional details, check the installation instructions in the official documentation.
Create a dedicated namespace for the KAI Scheduler components in the Kubernetes cluster:
kubectl create namespace kai-schedulerInstall the KAI Scheduler through Helm. To achieve this, check the available release versions in the KAI Scheduler repository and install it via Helm using the latest version. In case you want to use GPU resource sharing feature, set the
"global.gpuSharing=true"flag:helm install kai-scheduler \ https://github.com/NVIDIA/KAI-Scheduler/releases/download/v0.9.8/kai-scheduler-v0.9.8.tgz \ -n kai-scheduler --create-namespace \ --set "global.gpuSharing=true" \ --set "global.resourceReservation.runtimeClassName=nvidia"Verify that the Helm chart is successfully installed, and all the KAI Scheduler components are running:
❯ kubectl get pods -n kai-scheduler NAME READY STATUS RESTARTS AGE admission-55f6c958b6-sx8q6 1/1 Running 0 31s binder-66757b79cf-hnckr 1/1 Running 0 33s kai-operator-86579f5b96-nc8tl 1/1 Running 0 33s pod-grouper-845d589495-qxmlk 1/1 Running 0 31s podgroup-controller-75c6986688-7cfhs 1/1 Running 0 31s queue-controller-5bf44f6c4d-gwptq 1/1 Running 0 31s scheduler-685b9d6846-69vr4 1/1 Running 0 33sA workload must belong to a queue in order to be scheduled. To manage workloads, create the corresponding queues which are the essential scheduling primitives. These queues reflect different scheduling guarantees, like resource quota and priority. You can assign the queues to different types of consumers in the cluster, such as users and groups.
Create two basic scheduling queue hierarchies:
- A
defaulttop level queue. - A
testleaf queue that you can use for your workloads.
cat <<EOF | kubectl apply -f - apiVersion: scheduling.run.ai/v2 kind: Queue metadata: name: default spec: resources: cpu: quota: -1 limit: -1 overQuotaWeight: 1 gpu: quota: -1 limit: -1 overQuotaWeight: 1 memory: quota: -1 limit: -1 overQuotaWeight: 1 --- apiVersion: scheduling.run.ai/v2 kind: Queue metadata: name: test spec: parentQueue: default resources: cpu: quota: -1 limit: -1 overQuotaWeight: 1 gpu: quota: -1 limit: -1 overQuotaWeight: 1 memory: quota: -1 limit: -1 overQuotaWeight: 1 EOF- A
For scheduling your workloads using those queues you need to:
- Specify the queue name using the
kai.scheduler/queue: testlabel. - Specify the kai workload scheduler using the
spec.schedulerName: kai-schedulerattribute.
Optionally, you can test the scheduler deploying this example from the KAI quickstart documentation.
At this point, the KAI Scheduler is installed and ready to schedule your AI workloads.
GPU Sharing with KAI Scheduler
One of the features of the KAI Scheduler is the GPU resource sharing, which allows multiple pods or workloads share the same GPU device efficiently, even if they reside in different namespaces.
Allocate a portion of the GPU by:
- requesting a specific GPU amount. Example:
3Gib. - requesting a portion of a GPU device memory. Example:
0.5.
KAI Scheduler does not enforce memory allocation limits or performs memory isolation between processes, so it’s important that the running processes allocate the GPU memory up to the requested amount. For instance, vLLM workloads by default consume the 90% of the GPU memory, so you will limit this consumption using the --gpu-memory-utilization parameter with the corresponding memory fracton such as --gpu-memory-utilization=0.5.
To test the GPU sharing feature of KAI Scheduler, follow these steps:
Create a namespace for the scheduled workloads
kubectl create ns ai-workloadsDeploy a sample workload that uses vLLM for inference serving
cat <<EOF | kubectl apply -f - apiVersion: apps/v1 kind: Deployment metadata: name: vllm-test-07 namespace: ai-workloads labels: app: vllm-test-07 spec: replicas: 1 selector: matchLabels: app: vllm-test-07 template: metadata: labels: app: vllm-test-07 kai.scheduler/queue: test annotations: gpu-fraction: "0.7" spec: schedulerName: kai-scheduler containers: - name: vllm image: vllm/vllm-openai:latest command: ["/bin/sh", "-c"] args: [ "vllm serve Qwen/Qwen2.5-1.5B-Instruct --gpu-memory-utilization=0.7" ] ports: - containerPort: 8000 resources: limits: cpu: "8" memory: 15G requests: cpu: "6" memory: 6G livenessProbe: httpGet: path: /health port: 8000 initialDelaySeconds: 60 periodSeconds: 10 readinessProbe: httpGet: path: /health port: 8000 initialDelaySeconds: 60 periodSeconds: 5 EOFVerify that the deployment is allocated with the specified resources
❯ kubectl -n ai-workloads get pods -o custom-columns="NAME:.metadata.name,STATUS:.status.phase,NODE:.spec.nodeName,GPU-FRACTION:.metadata.annotations.gpu-fraction,GPU-GROUP:.metadata.labels.runai-gpu-group" NAME STATUS NODE GPU-FRACTION GPU-GROUP vllm-test-07-5979b99584-llf45 Running k8s-gpu-md-0-wr9k6-gbtvr 0.7 407623a2-216d-4c06-b5b8-f8345bf28b5aDeploy a Kubernetes service to access the API
cat <<EOF | kubectl apply -f - apiVersion: v1 kind: Service metadata: name: vllm-test-07 namespace: ai-workloads spec: ports: - name: http port: 80 protocol: TCP targetPort: 8000 selector: app: vllm-test-07 sessionAffinity: None type: ClusterIP EOFCheck the service:
❯ kubectl -n ai-workloads get svc NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE vllm-test-07 ClusterIP 10.43.119.6 <none> 80/TCP 39sPort forward the service through
kubectlkubectl -n ai-workloads port-forward svc/vllm-test-07 9000:80 &Test the deployment by sending a request to the vLLM service API
curl http://localhost:9000/v1/completions \ -H "Content-Type: application/json" \ -d '{ "model": "Qwen/Qwen2.5-1.5B-Instruct", "prompt": "San Francisco is a", "max_tokens": 7, "temperature": 0 }' | jq .You should receive a response like this one:
{ "id": "cmpl-173532a758894deeae63d0a073c53289", "object": "text_completion", "created": 1763393534, "model": "Qwen/Qwen2.5-1.5B-Instruct", "choices": [ { "index": 0, "text": " city in the state of California,", "logprobs": null, "finish_reason": "length", "stop_reason": null, "token_ids": null, "prompt_logprobs": null, "prompt_token_ids": null } ], "service_tier": null, "system_fingerprint": null, "usage": { "prompt_tokens": 4, "total_tokens": 11, "completion_tokens": 7, "prompt_tokens_details": null }, "kv_transfer_params": null }
Optionally, deploy another workload with a small GPU fraction on that node.
Create another workload with a GPU memory fraction of
0.2cat <<EOF | kubectl apply -f - apiVersion: apps/v1 kind: Deployment metadata: name: vllm-test-02 namespace: ai-workloads labels: app: vllm-test-02 spec: replicas: 1 selector: matchLabels: app: vllm-test-02 template: metadata: labels: app: vllm-test-02 kai.scheduler/queue: test annotations: gpu-fraction: "0.2" spec: schedulerName: kai-scheduler containers: - name: vllm image: vllm/vllm-openai:latest command: ["/bin/sh", "-c"] args: [ "vllm serve Qwen/Qwen2.5-1.5B-Instruct --gpu-memory-utilization=0.2" ] ports: - containerPort: 8000 resources: limits: cpu: "8" memory: 15G requests: cpu: "6" memory: 6G livenessProbe: httpGet: path: /health port: 8000 initialDelaySeconds: 60 periodSeconds: 10 readinessProbe: httpGet: path: /health port: 8000 initialDelaySeconds: 60 periodSeconds: 5 EOFCheck that the fraction is successfully assigned and the pod is running
❯ kubectl -n ai-workloads get pods -o custom-columns="NAME:.metadata.name,STATUS:.status.phase,NODE:.spec.nodeName,GPU-FRACTION:.metadata.annotations.gpu-fraction,GPU-GROUP:.metadata.labels.runai-gpu-group" NAME STATUS NODE GPU-FRACTION GPU-GROUP vllm-test-02-79b48968bc-dtzxs Running k8s-gpu-md-0-wr9k6-gbtvr 0.2 407623a2-216d-4c06-b5b8-f8345bf28b5a vllm-test-07-5979b99584-llf45 Running k8s-gpu-md-0-wr9k6-gbtvr 0.7 407623a2-216d-4c06-b5b8-f8345bf28b5a
With this validation, you have checked how you can efficiently share fractional GPU resources between workloads in an AI-Ready Kubernetes with KAI Scheduler.
Tip
After powering your AI Factory with NVIDIA KAI Scheduler on Kubernetes, you may continue with NVIDIA Dynamo on Kubernetes as an additional validation procedure built on top of K8s.We value your feedback
Was this information helpful?
Glad to hear it
Sorry to hear that