Deployment of NVIDIA Dynamo
NVIDIA® Dynamo is a high-performant inference framework for serving AI models in an agnostic way, across any framework, architecture or deployment scale, as well as in multi-node distributed environments. Being an agnostic inference engine, it supports different backends such as TRT-LLM, vLLM, SGLang, etc. Dynamo also allows you to declare inference graphs which deploy different containerized components in a disaggregated way - like an API frontend, a prefill worker, a decode worker, a K/V cache, and others - and to let them interact to efficiently respond to the user queries.
Encapsulating the different inference engines, AI models and dependencies into a single container improves the workload portability and isolation. With this approach, each container is deployed consistently across different environments, including all its dependencies, avoiding conflicts and reproducibility issues.
In this guide you will learn how to combine the GPU powered Kubernetes Cluster with the NVIDIA Dynamo Cloud Platform for provisioning a secure, robust and scalable solution for our AI workloads on top of the NVIDIA Dynamo framework powered by the OpenNebula cloud platform.
Before Starting
Before starting this tutorial, you must complete the AI Factory deployment with either on-premises resources or cloud resources. Please complete one of the following guides relevant to your available resources:
You must then complete the AI-ready Kubernetes Deployment Guide. You also must undeploy any appliances, VMs or services you deployed in previous guides before continuing.
NVIDIA Dynamo Cloud Platform Installation
As a prerequisite, you need a storage provider installed to supply PersistentVolumes to the platform. For testing purposes, use the rancher local-path-provisioner that references to a local path from the pod host as storage, and creates a default storage class using it.
For the following commands to work, you must use the kubeconfig_workload.yaml Kubeconfig. Either add --kubeconfig kubeconfig_workload.yaml to the commands or export the KUBECONFIG environment variable:
export KUBECONFIG="$PWD/kubeconfig_workload.yaml"
- To install the provisioner, deploy the manifest from the GitHub repository:
kubectl apply -f https://raw.githubusercontent.com/rancher/local-path-provisioner/v0.0.32/deploy/local-path-storage.yaml
- Check that the storage provisioner is up and running:
kubectl -n local-path-storage get deploy,pods
You should see an output like this:
NAME READY UP-TO-DATE AVAILABLE AGE
deployment.apps/local-path-provisioner 1/1 1 1 7d2h
NAME READY STATUS RESTARTS AGE
pod/local-path-provisioner-7f57b55d56-7qb42 1/1 Running 0 7d2h
- Create the following storageClass and set it as default.
If you want to modify the nodePath parameter, ensure that it is available in the nodePathMap field of the provider config as indicated in the Customize the configmap guide.
cat <<EOF > storageClass.yaml
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: local-path
annotations:
storageclass.kubernetes.io/is-default-class: "true"
provisioner: rancher.io/local-path
parameters:
nodePath: /opt/local-path-provisioner
pathPattern: "{{ .PVC.Namespace }}/{{ .PVC.Name }}"
volumeBindingMode: WaitForFirstConsumer
reclaimPolicy: Delete
EOF
kubectl replace --force -f storageClass.yaml
- Make this storage class is the default:
kubectl patch storageclass local-path -p '{"metadata": {"annotations":{"storageclass.kubernetes.io/is-default-class":"true"}}}'
kubectl get storageClass
The local-path storage class should have the (default) suffix:
NAME PROVISIONER RECLAIMPOLICY VOLUMEBINDINGMODE ALLOWVOLUMEEXPANSION
local-path (default) rancher.io/local-path Delete WaitForFirstConsumer false
At this point, the Dynamo Cloud platform is ready for installation. Configure your cluster in a declarative way, by using the containers and helm charts in the NVIDIA NGC catalog, and follow these steps:
- Install the CRDs:
helm install dynamo-crds https://helm.ngc.nvidia.com/nvidia/ai-dynamo/charts/dynamo-crds-0.7.0.tgz \
--namespace dynamo-cloud --create-namespace \
--wait --atomic
- Install the operator, using the latest version available in the catalog.:
helm install dynamo-platform https://helm.ngc.nvidia.com/nvidia/ai-dynamo/charts/dynamo-platform-0.7.0.tgz \
--namespace dynamo-cloud \
--create-namespace \
--set "dynamo-operator.controllerManager.manager.image.repository=nvcr.io/nvidia/ai-dynamo/kubernetes-operator" \
--set "dynamo-operator.controllerManager.manager.image.tag=0.7.0"
- Check if the operator is up and running:
kubectl -n dynamo-cloud get deploy,pod,svc
All the pods should be in Running or Completed state:
NAME READY UP-TO-DATE AVAILABLE AGE
deployment.apps/dynamo-platform-dynamo-operator-controller-manager 1/1 1 1 27h
deployment.apps/dynamo-platform-nats-box 1/1 1 1 27h
NAME READY STATUS RESTARTS AGE
pod/dynamo-platform-dynamo-operator-controller-manager-75fd6b7cdvlt 2/2 Running 0 26h
pod/dynamo-platform-etcd-0 1/1 Running 0 27h
pod/dynamo-platform-etcd-pre-upgrade-g5cjm 0/1 Completed 0 27h
pod/dynamo-platform-nats-0 2/2 Running 0 27h
pod/dynamo-platform-nats-box-57c9cf4c7b-vbgpg 1/1 Running 0 27h
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
service/dynamo-platform-etcd ClusterIP 10.43.157.193 <none> 2379/TCP,2380/TCP 27h
service/dynamo-platform-etcd-headless ClusterIP None <none> 2379/TCP,2380/TCP 27h
service/dynamo-platform-nats ClusterIP 10.43.175.109 <none> 4222/TCP 27h
service/dynamo-platform-nats-headless ClusterIP None <none> 4222/TCP,8222/TCP 27h
dynamo-platform-dynamo-operator-controller-manager pod is stuck in the ImagePullBackOff state, see the Known Issues section for a solution.- To use some LLM models in the platform, you need a HuggingFace token for authenticating against the API. Go to the tokens page of the HuggingFace website to create a new token if you don’t already have one. Create a YAML file with your HF token (replace
<token>):
cat<<EOF > hf-secret.yaml
apiVersion: v1
kind: Secret
metadata:
name: hf-token-secret
namespace: dynamo-cloud
type: Opaque
stringData:
token: "<token>"
EOF
Then apply YAML file so that Dynamo can access the token:
kubectl apply -f hf-secret.yaml
Deployment of Dynamo Inference Graphs
NVIDIA Dynamo orchestrates the deployment of inference graphs through the Dynamo CLI or by deploying manifests following the specific Dynamo CRDs directly in the cluster, which are recognized and managed by the Dynamo Kubernetes Operator.
The instructions of this guide do not expose the Dynamo API externally. You benefit from the Dynamo Kubernetes Operator by deploying the manifests of the inference graphs directly on the cluster.
To run your workloads as Dynamo Inference Graphs, check the following requirements:
- If the HuggingFace model that you are using needs authorization, configure an updated HF token set as Kubernetes secret with the
hf-token-secretname. - Assign a GPU to the worker pod, by setting the
spec.VllmDecodeWorker.extraPodSpecfield withruntimeClassName: nvidia
Once you access the Kubernetes API, proceed to deploy the inference graphs you defined in the corresponding manifest.
The latest vllm-runtime image is located in nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.4.1, but you can build your own runtime image following the instructions in the Dynamo repository.
An example of a disaggregated deployment graph is available in the NVIDIA Dynamo GitHub Repository. For this guide, the example was adapted to work for a validated container runtime:
cat << EOF > disagg_custom.yaml
apiVersion: nvidia.com/v1alpha1
kind: DynamoGraphDeployment
metadata:
name: vllm-v1-disagg-router
spec:
services:
Frontend:
dynamoNamespace: vllm-v1-disagg-router
componentType: frontend
replicas: 1
livenessProbe:
httpGet:
path: /health
port: 8000
initialDelaySeconds: 20
periodSeconds: 5
timeoutSeconds: 5
failureThreshold: 3
readinessProbe:
exec:
command:
- /bin/sh
- -c
- 'curl -s http://localhost:8000/health | jq -e ".status == \"healthy\""'
initialDelaySeconds: 60
periodSeconds: 60
timeoutSeconds: 30
failureThreshold: 10
resources:
requests:
cpu: "1"
memory: "2Gi"
ephemeral-storage: "1Gi"
limits:
cpu: "1"
memory: "2Gi"
phemeral-storage: "2Gi"
extraPodSpec:
mainContainer:
image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.4.1
workingDir: /workspace/components/backends/vllm
command:
- /bin/sh
- -c
args:
- "python3 -m dynamo.frontend --http-port 8000 --router-mode kv"
VllmDecodeWorker:
dynamoNamespace: vllm-v1-disagg-router
envFromSecret: hf-token-secret
componentType: worker
replicas: 1
livenessProbe:
httpGet:
path: /live
port: 9090
periodSeconds: 5
timeoutSeconds: 30
failureThreshold: 1
readinessProbe:
httpGet:
path: /health
port: 9090
periodSeconds: 10
timeoutSeconds: 30
failureThreshold: 60
resources:
requests:
cpu: "10"
memory: "20Gi"
gpu: "1"
ephemeral-storage: "5Gi"
limits:
cpu: "10"
memory: "20Gi"
gpu: "1"
ephemeral-storage: "10Gi"
envs:
- name: DYN_SYSTEM_ENABLED
value: "true"
- name: DYN_SYSTEM_USE_ENDPOINT_HEALTH_STATUS
value: "[\"generate\"]"
extraPodSpec:
runtimeClassName: nvidia
mainContainer:
startupProbe:
httpGet:
path: /health
port: 9090
periodSeconds: 10
failureThreshold: 60
image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.4.1
workingDir: /workspace/components/backends/vllm
command:
- /bin/sh
- -c
args:
- python3 -m dynamo.vllm --model Qwen/Qwen3-0.6B
VllmPrefillWorker:
dynamoNamespace: vllm-v1-disagg-router
envFromSecret: hf-token-secret
componentType: worker
replicas: 1
livenessProbe:
httpGet:
path: /live
port: 9090
periodSeconds: 5
timeoutSeconds: 30
failureThreshold: 1
readinessProbe:
httpGet:
path: /health
port: 9090
periodSeconds: 10
timeoutSeconds: 30
failureThreshold: 60
resources:
requests:
cpu: "10"
memory: "20Gi"
gpu: "1"
ephemeral-storage: "5Gi"
limits:
cpu: "10"
memory: "20Gi"
gpu: "1"
ephemeral-storage: "10Gi"
envs:
- name: DYN_SYSTEM_ENABLED
value: "true"
- name: DYN_SYSTEM_USE_ENDPOINT_HEALTH_STATUS
value: "[\"generate\"]"
- name: DYN_SYSTEM_PORT
value: "9090"
extraPodSpec:
runtimeClassName: nvidia
mainContainer:
startupProbe:
httpGet:
path: /health
port: 9090
periodSeconds: 10
failureThreshold: 60
image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.4.1
workingDir: /workspace/components/backends/vllm
command:
- /bin/sh
- -c
args:
- python3 -m dynamo.vllm --model Qwen/Qwen3-0.6B --is-prefill-worker
EOF
Deploy the disaggregated deployment graph with kubectl:
kubectl -n dynamo-cloud apply -f disagg_custom.yaml
After some minutes (pulling the vllm runtime image takes its time), check that the pods are up and running:
kubectl -n dynamo-cloud get pods,svc
---
NAME READY STATUS RESTARTS AGE
pod/disagg-frontend-65646b6f7b-dwfr2 1/1 Running 0 27m
pod/disagg-prefillworker-5b784c677c-42pts 1/1 Running 0 27m
pod/disagg-vllmworker-d494976f6-78hr7 1/1 Running 0 33m
[...]
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
service/disagg-frontend ClusterIP 10.43.92.113 <none> 8000/TCP 33m
[...]
(Optional) Querying the API Locally
In case you want to query the API client locally, forward the vllm frontend service through Kubernetes with this command:
kubectl port-forward svc/<frontend_service> <local_port>:8000 &
Example:
kubectl port-forward svc/vllm-v1-disagg-router-frontend 9000:8000 &
To test the loaded models, run requests to the frontend via curl:
curl localhost:9000/v1/models | jq .
{
"object": "list",
"data": [
{
"id": "Qwen/Qwen3-0.6B",
"object": "object",
"created": 1756908946,
"owned_by": "nvidia"
}
]
}
And also submit inference requests:
curl localhost:9000/v1/completions -H "Content-Type: application/json" -d '{
"model": "Qwen/Qwen3-0.6B",
"prompt": "What is opennebula?",
"stream": false,
"max_tokens": 300
}' | jq
You will receive a response like this:
{
"id": "cmpl-2c749514-9a81-4864-a119-d195d39a235b",
"choices": [
{
"text": "<think>\nOkay, the user is asking about OpenNebula. First, I need to make sure I understand what OpenNebula is. From what I remember, OpenNebula is an open-source orchestration system used for managing virtual machines (VMs) in cloud environments. It's often used in Linux-based cloud infrastructures like AWS or Azure. \n\nI should start by defining OpenNebula. It's a tool that helps manage and orchestrate virtual machines on a cloud platform. The key features include providing a way to manage VMs, resources, and services in a centralized manner. OpenNebula is designed to be flexible and scalable, allowing for easy integration with various cloud providers.\n\nWait, are there any specific use cases or industries where OpenNebula is commonly used? I think it's often used in enterprise environments for managing VMs, especially in environments where automation and resource management are critical. It's also used in hybrid cloud setups where VMs can be managed between on-premises and cloud environments.\n\nI should mention that OpenNebula is open-source, which is important to highlight. It's developed by a community and is licensed under a specific open-source license. Maybe include something about how it simplifies VM orchestration and management.\n\nAre there any common misconceptions about OpenNebula? Perhaps users might confuse it with other cloud management tools. I should clarify that it's specifically for orchestration rather than just managing VMs.",
"index": 0,
"logprobs": null,
"finish_reason": "stop"
}
],
"created": 1755169608,
"model": "Qwen/Qwen3-0.6B",
"system_fingerprint": null,
"object": "text_completion",
"usage": {
"prompt_tokens": 15,
"completion_tokens": 299,
"total_tokens": 314,
"prompt_tokens_details": null,
"completion_tokens_details": null
}
}
If you want to test the response in stream mode, set the parameter stream: true and delete the jq tool piping to that call:
curl localhost:9000/v1/completions -H "Content-Type: application/json" -d '{
"model": "Qwen/Qwen3-0.6B",
"prompt": "What is opennebula?",
"stream": true,
"max_tokens": 300
}'
You will see this streamed output:
data: {"id":"cmpl-84041acf-79d1-4ec4-b913-c492fa4f3379","choices":[{"text":"<think>","index":0,"logprobs":null,"finish_reason":null}],"created":1756908478,"model":"Qwen/Qwen3-0.6B","system_fingerprint":null,"object":"text_completion","usage":{"prompt_tokens":15,"completion_tokens":1,"total_tokens":16,"prompt_tokens_details":null,"completion_tokens_details":null}}
data: {"id":"cmpl-84041acf-79d1-4ec4-b913-c492fa4f3379","choices":[{"text":"\n","index":0,"logprobs":null,"finish_reason":null}],"created":1756908478,"model":"Qwen/Qwen3-0.6B","system_fingerprint":null,"object":"text_completion","usage":{"prompt_tokens":15,"completion_tokens":2,"total_tokens":17,"prompt_tokens_details":null,"completion_tokens_details":null}}
data: {"id":"cmpl-84041acf-79d1-4ec4-b913-c492fa4f3379","choices":[{"text":"Okay","index":0,"logprobs":null,"finish_reason":null}],"created":1756908478,"model":"Qwen/Qwen3-0.6B","system_fingerprint":null,"object":"text_completion","usage":{"prompt_tokens":15,"completion_tokens":3,"total_tokens":18,"prompt_tokens_details":null,"completion_tokens_details":null}}
[...]
In the streamed output, you will receive multiple JSON responses with the response tokens in the text field, with some metadata included.
Undeployment
Before moving on to other AI Factory guides or deployments, you must undeploy NVIDIA Dynamo and the Disaggregated Deployment Graph.
Run the following command to undeploy the graph:
kubectl delete dynamographdeployment vllm-v1-disagg-router -n dynamo-cloud
Run the following command, until you receive the response No resources found in dynamo-cloud namespace.:
kubectl get dynamographdeployment -n dynamo-cloud
Next, to undeploy NVIDIA Dynamo, run the following command until all pods have terminated:
kubectl get all -n dynamo-cloud
Once NVIDIA Dynamo is successfully undeployed, you will receive the response: No resources found in dynamo-cloud namespace.
Finally, delete the dynamo-cloud namespace:
kubectl delete namespace dynamo-cloud
Next Steps
After powering your AI Factory with NVIDIA Dynamo on Kubernetes, you may continue with the NVIDIA KAI Scheduler as an additional validation procedure built on top of K8s.
Known Issues
Dynamo Operator Controller Manager stuck in ImagePullBackoff
If the dynamo-platform-dynamo-operator-controller-manager pod is stuck in the ImagePullBackOff state, this may be due to a missing image path:
kubectl -n dynamo-cloud get deploy,pod,svc
NAME READY UP-TO-DATE AVAILABLE AGE
deployment.apps/dynamo-platform-dynamo-operator-controller-manager 1/1 1 1 42m
NAME READY STATUS RESTARTS AGE
pod/dynamo-platform-dynamo-operator-controller-manager-75847c7qj2kx 2/2 ImagePullBackOff 0 19m
pod/dynamo-platform-etcd-0 1/1 Running 0 42m
pod/dynamo-platform-nats-0 2/2 Running 0 42m
...
Google is migrating images away from the gcr.io domain to pkg.dev. Fix the problem by updating the image path in the deployment:
Open the deployment for editing:
kubectl -n dynamo-cloud edit deployment dynamo-platform-dynamo-operator-controller-manager
Look for the line:
image: gcr.io/kubebuilder/kube-rbac-proxy:v0.15.0
Replace it with:
image: quay.io/brancz/kube-rbac-proxy:v0.15.0
Save and exit the editor. The pod should automatically restart. Run the following command again until the pod reaches the Running status:
kubectl -n dynamo-cloud get deploy,pod,svc
GIVE FEEDBACK
Was this resource helpful?
Glad to hear it
Sorry to hear that