vLLM AI
Appliance Description
vLLM is a high-performance inference engine optimized for serving transformer LLMs with low latency, high throughput, token-level streaming, and efficient GPU memory usage. This appliance packages vLLM into a configurable OpenNebula VM image which is ready to run, with the purpose of simplified deployment on your OpenNebula cloud for inference workloads.
Main Features
The vLLM appliance includes the following components for running LLMs:
- vLLM, the open source, high-perfomance engine for serving large language models with low latency and high throughput
- Hugging Face - Transformers, one of the most widely adopted frameworks for deploying large language models
- Configurable deployment options and behavior, controlled by contextualization parameters
Main References
- vLLM in the OpenNebula one-apps project
- Full documentation for the vLLM appliance
- Download the vLLM appliance from the OpenNebula Marketplace
We value your feedback
Was this information helpful?
Glad to hear it
Sorry to hear that