Deployment of NVIDIA KAI Scheduler

Important

To perform the deployment of NVIDIA KAI Scheduler, first you must follow the procedure outlined in Deployment of AI-Ready Kubernetes to create an AI-Ready Kubernetes ready for running GPU workloads.

NVIDIA® KAI Scheduler is an open source Kubernetes-native scheduler designed to optimize GPU resource allocation for AI and machine learning workloads at scale. It is capable of managing large GPU clusters and handling high-throughput demanding workload environments. KAI Scheduler targets both interactive jobs and large-scale training or inference tasks within the same cluster, orchestrating available resources across different users and teams. It also operates alongside other schedulers installed in a cluster.

Some of the key features are:

Share single or multiple GPU resources among multiple workloads for improving resource allocation.
Batch scheduling of different types of workloads, including gang scheduling (i.e. all pods are scheduled together or none are scheduled until all resources are available).
Effective workload priority with hierarchical queues.
Resource distribution with custom quotas, limits, priorities and fairness policies.
Elastic workloads with dynamic workload escalation.
Compatibility with other autoscalers like Karpenter.

As the KAI Scheduler operates on top of Kubernetes, users benefit from its robust orchestration, scalability, and resource management capabilities, further enhancing deployment flexibility and reliability. Running both KAI Scheduler and Kubernetes on OpenNebula, this solution extends resource management to hybrid and multi-cloud environments, enabling dynamic and cost-effective scaling of AI workloads while maintaining control over infrastructure. This synergy leverages the strengths of all three technologies to deliver efficient, fair, and portable GPU scheduling from on-premises data centers to the cloud.

In this guide you will learn how to perform a validation using NVIDIA KAI Scheduler in an AI-Ready Kubernetes cluster. You will find details about the installation and efficiently share GPU resources among different workloads.

NVIDIA KAI Scheduler Installation

To install the NVIDIA KAI Scheduler, you need to accomplish the following prerequisites:

An AI-Ready Kubernetes Cluster with the NVIDIA GPU Operator installed.
Helm CLI. For additional details, check the installation instructions in the official documentation.

Create a dedicated namespace for the KAI Scheduler components in the Kubernetes cluster:
```
kubectl create namespace kai-scheduler
```

Install the KAI Scheduler through Helm. To achieve this, check the available release versions in the KAI Scheduler repository and install it via Helm using the latest version. In case you want to use GPU resource sharing feature, set the "global.gpuSharing=true" flag:

helm install kai-scheduler \
    https://github.com/NVIDIA/KAI-Scheduler/releases/download/v0.9.8/kai-scheduler-v0.9.8.tgz \
    -n kai-scheduler --create-namespace \
    --set "global.gpuSharing=true" \
    --set "global.resourceReservation.runtimeClassName=nvidia"

Verify that the Helm chart is successfully installed, and all the KAI Scheduler components are running:

❯ kubectl get pods -n kai-scheduler
NAME                                   READY   STATUS    RESTARTS   AGE
admission-55f6c958b6-sx8q6             1/1     Running   0          31s
binder-66757b79cf-hnckr                1/1     Running   0          33s
kai-operator-86579f5b96-nc8tl          1/1     Running   0          33s
pod-grouper-845d589495-qxmlk           1/1     Running   0          31s
podgroup-controller-75c6986688-7cfhs   1/1     Running   0          31s
queue-controller-5bf44f6c4d-gwptq      1/1     Running   0          31s
scheduler-685b9d6846-69vr4             1/1     Running   0          33s

A workload must belong to a queue in order to be scheduled. To manage workloads, create the corresponding queues which are the essential scheduling primitives. These queues reflect different scheduling guarantees, like resource quota and priority. You can assign the queues to different types of consumers in the cluster, such as users and groups.

Create two basic scheduling queue hierarchies:

A default top level queue.
A test leaf queue that you can use for your workloads.

cat <<EOF | kubectl apply -f -
apiVersion: scheduling.run.ai/v2
kind: Queue
metadata:
name: default
spec:
resources:
    cpu:
    quota: -1
    limit: -1
    overQuotaWeight: 1
    gpu:
    quota: -1
    limit: -1
    overQuotaWeight: 1
    memory:
    quota: -1
    limit: -1
    overQuotaWeight: 1
---
apiVersion: scheduling.run.ai/v2
kind: Queue
metadata:
name: test
spec:
parentQueue: default
resources:
    cpu:
    quota: -1
    limit: -1
    overQuotaWeight: 1
    gpu:
    quota: -1
    limit: -1
    overQuotaWeight: 1
    memory:
    quota: -1
    limit: -1
    overQuotaWeight: 1
EOF

For scheduling your workloads using those queues you need to:

Specify the queue name using the kai.scheduler/queue: test label.
Specify the kai workload scheduler using the spec.schedulerName: kai-scheduler attribute.

Optionally, you can test the scheduler deploying this example from the KAI quickstart documentation.

At this point, the KAI Scheduler is installed and ready to schedule your AI workloads.

One of the features of the KAI Scheduler is the GPU resource sharing, which allows multiple pods or workloads share the same GPU device efficiently, even if they reside in different namespaces.

Allocate a portion of the GPU by:

requesting a specific GPU amount. Example: 3Gib.
requesting a portion of a GPU device memory. Example: 0.5.

KAI Scheduler does not enforce memory allocation limits or performs memory isolation between processes, so it’s important that the running processes allocate the GPU memory up to the requested amount. For instance, vLLM workloads by default consume the 90% of the GPU memory, so you will limit this consumption using the --gpu-memory-utilization parameter with the corresponding memory fracton such as --gpu-memory-utilization=0.5.

To test the GPU sharing feature of KAI Scheduler, follow these steps:

Create a namespace for the scheduled workloads
```
kubectl create ns ai-workloads
```

Deploy a sample workload that uses vLLM for inference serving

cat <<EOF | kubectl apply -f -
apiVersion: apps/v1
kind: Deployment
metadata:
  name: vllm-test-07
  namespace: ai-workloads
  labels:
    app: vllm-test-07
spec:
  replicas: 1
  selector:
    matchLabels:
      app: vllm-test-07
  template:
    metadata:
      labels:
        app: vllm-test-07
        kai.scheduler/queue: test
      annotations:
        gpu-fraction: "0.7"
    spec:
      schedulerName: kai-scheduler
      containers:
      - name: vllm
        image: vllm/vllm-openai:latest
        command: ["/bin/sh", "-c"]
        args: [
	      "vllm serve Qwen/Qwen2.5-1.5B-Instruct --gpu-memory-utilization=0.7"
        ]
        ports:
        - containerPort: 8000
        resources:
          limits:
            cpu: "8"
            memory: 15G
          requests:
            cpu: "6"
            memory: 6G
        livenessProbe:
          httpGet:
            path: /health
            port: 8000
          initialDelaySeconds: 60
          periodSeconds: 10
        readinessProbe:
          httpGet:
            path: /health
            port: 8000
          initialDelaySeconds: 60
          periodSeconds: 5
EOF

Verify that the deployment is allocated with the specified resources

❯ kubectl -n ai-workloads get pods -o custom-columns="NAME:.metadata.name,STATUS:.status.phase,NODE:.spec.nodeName,GPU-FRACTION:.metadata.annotations.gpu-fraction,GPU-GROUP:.metadata.labels.runai-gpu-group"
NAME                            STATUS    NODE                       GPU-FRACTION   GPU-GROUP
vllm-test-07-5979b99584-llf45   Running   k8s-gpu-md-0-wr9k6-gbtvr   0.7            407623a2-216d-4c06-b5b8-f8345bf28b5a

Deploy a Kubernetes service to access the API

cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: Service
metadata:
  name: vllm-test-07
  namespace: ai-workloads
spec:
  ports:
  - name: http
    port: 80
    protocol: TCP
    targetPort: 8000
selector:
    app: vllm-test-07
sessionAffinity: None
type: ClusterIP
EOF

Check the service:

❯ kubectl -n ai-workloads get svc
NAME           TYPE        CLUSTER-IP    EXTERNAL-IP   PORT(S)   AGE
vllm-test-07   ClusterIP   10.43.119.6   <none>        80/TCP    39s

Port forward the service through kubectl

kubectl -n ai-workloads port-forward svc/vllm-test-07 9000:80 &

Test the deployment by sending a request to the vLLM service API

curl http://localhost:9000/v1/completions \
-H "Content-Type: application/json" \
-d '{
        "model": "Qwen/Qwen2.5-1.5B-Instruct",
        "prompt": "San Francisco is a",
        "max_tokens": 7,
        "temperature": 0
    }' | jq .

You should receive a response like this one:

{
    "id": "cmpl-173532a758894deeae63d0a073c53289",
    "object": "text_completion",
    "created": 1763393534,
    "model": "Qwen/Qwen2.5-1.5B-Instruct",
    "choices": [
        {
        "index": 0,
        "text": " city in the state of California,",
        "logprobs": null,
        "finish_reason": "length",
        "stop_reason": null,
        "token_ids": null,
        "prompt_logprobs": null,
        "prompt_token_ids": null
        }
    ],
    "service_tier": null,
    "system_fingerprint": null,
    "usage": {
        "prompt_tokens": 4,
        "total_tokens": 11,
        "completion_tokens": 7,
        "prompt_tokens_details": null
    },
    "kv_transfer_params": null
}

Optionally, deploy another workload with a small GPU fraction on that node.

Create another workload with a GPU memory fraction of 0.2

cat <<EOF | kubectl apply -f -
apiVersion: apps/v1
kind: Deployment
metadata:
name: vllm-test-02
namespace: ai-workloads
labels:
    app: vllm-test-02
spec:
replicas: 1
selector:
    matchLabels:
    app: vllm-test-02
template:
    metadata:
    labels:
        app: vllm-test-02
        kai.scheduler/queue: test
    annotations:
        gpu-fraction: "0.2"
    spec:
    schedulerName: kai-scheduler
    containers:
    - name: vllm
        image: vllm/vllm-openai:latest
        command: ["/bin/sh", "-c"]
        args: [
        "vllm serve Qwen/Qwen2.5-1.5B-Instruct --gpu-memory-utilization=0.2"
        ]
        ports:
        - containerPort: 8000
        resources:
        limits:
            cpu: "8"
            memory: 15G
        requests:
            cpu: "6"
            memory: 6G
        livenessProbe:
        httpGet:
            path: /health
            port: 8000
        initialDelaySeconds: 60
        periodSeconds: 10
        readinessProbe:
        httpGet:
            path: /health
            port: 8000
        initialDelaySeconds: 60
        periodSeconds: 5
EOF

Check that the fraction is successfully assigned and the pod is running

❯ kubectl -n ai-workloads get pods -o custom-columns="NAME:.metadata.name,STATUS:.status.phase,NODE:.spec.nodeName,GPU-FRACTION:.metadata.annotations.gpu-fraction,GPU-GROUP:.metadata.labels.runai-gpu-group"
NAME                            STATUS    NODE                       GPU-FRACTION   GPU-GROUP
vllm-test-02-79b48968bc-dtzxs   Running   k8s-gpu-md-0-wr9k6-gbtvr   0.2            407623a2-216d-4c06-b5b8-f8345bf28b5a
vllm-test-07-5979b99584-llf45   Running   k8s-gpu-md-0-wr9k6-gbtvr   0.7            407623a2-216d-4c06-b5b8-f8345bf28b5a

With this validation, you have checked how you can efficiently share fractional GPU resources between workloads in an AI-Ready Kubernetes with KAI Scheduler.

Tip

After powering your AI Factory with NVIDIA KAI Scheduler on Kubernetes, you may continue with NVIDIA Dynamo on Kubernetes as an additional validation procedure built on top of K8s.

We value your feedback

Was this information helpful?

Glad to hear it

Sorry to hear that

Deployment of NVIDIA KAI Scheduler

Important

NVIDIA KAI Scheduler Installation

GPU Sharing with KAI Scheduler

Tip

We value your feedback