vLLM
High-throughput LLM inference engine with PagedAttention for efficient GPU memory usage
vLLM is an open-source inference and serving engine for Large Language Models, originally developed at UC Berkeley. It uses PagedAttention to manage GPU memory efficiently, achieving up to 24x higher throughput compared to Hugging Face Transformers. It supports most popular open-source models including Llama, Mixtral, DeepSeek, and multimodal models like LLaVA. vLLM includes both a fast inference engine and a production-ready OpenAI-compatible serving server, making it a popular choice for self-hosted LLM deployments.
Pricing: Free
vLLM Alternatives
Explore 51 products in the Inference APIs category. View all vLLM alternatives.
deepinfra
Run the top AI models using a simple API, pay per use. Low cost, scalable and production ready infrastructure.
LLMWise
Multi-LLM API orchestration platform for comparing and blending AI models
novita.ai
APIs, Serverless and GPU Instance In One AI Cloud
Nebius
Full-stack AI cloud with GPU infrastructure for training and inference
Is your product missing?