AI Infrastructure Stack

Self-Hosted AI Stack

Run everything on your own infrastructure. For teams that need full control over data, want to avoid API dependencies, or have compliance requirements that rule out third-party services.

πŸ”’ Full data control πŸ–₯️ Your infrastructure πŸ’° No per-token costs
Hand-drawn illustration of a self-hosted AI stack

Things to keep in mind

  • Self-hosting trades per-token costs for infrastructure costs and operational work. It makes sense at scale (thousands of requests per day) or when data sovereignty requires it. For small workloads, managed APIs are usually cheaper and simpler.
  • vLLM + a GPU instance is the standard starting point. GPU requirements depend on model size: a 7B model fits on a single GPU, larger models (70B+) may need multiple GPUs or quantization. Benchmark your specific model before committing to hardware.
  • Open-source models have caught up significantly. Llama, Mistral, Qwen, and DeepSeek families cover most production use cases. Check license terms, some are more permissive than others.
  • You can mix self-hosted and managed. Run your inference on your own GPUs but use Langfuse cloud for observability, or vice versa. Not everything needs to be self-hosted.

Frequently asked questions

What do I need to self-host AI inference?

An inference engine (vLLM is the standard), GPU compute (RunPod, Modal, or your own hardware), and an open-source model (Llama, Mistral, Qwen, DeepSeek). GPU requirements depend on model size: 7B models fit on one GPU, 70B+ may need multiple GPUs or quantization.

Is self-hosting AI cheaper than using APIs?

At scale (thousands of requests per day), yes. You trade per-token costs for fixed infrastructure costs. For small workloads, managed APIs are usually cheaper and simpler. The crossover point depends on your volume and model choice.

Can I self-host observability for LLMs?

Yes. Langfuse is open source (MIT) and self-hostable via Docker or Kubernetes with no feature gates. Arize Phoenix is also open source and can pipe traces into your existing Grafana or Datadog setup.

Which open-source models are best for self-hosting?

Llama, Mistral, Qwen, and DeepSeek families cover most production use cases. Check license terms as some are more permissive than others. vLLM supports most popular models out of the box.

Last updated: April 2026

Is your product missing?

Add it here →