AI Infrastructure Stack

RAG Chatbot Stack

Build a chatbot that answers questions from your own documents, knowledge base, or internal docs. One of the most common first AI projects developers take on.

💬 Document Q&A 🔍 Hybrid search 📊 Measurable quality

Hand-drawn illustration of a RAG pipeline: documents flowing through chunk and embed, to vector DB, through hybrid search to LLM, outputting a chat bubble

Hand-drawn illustration of a RAG pipeline

Inference & Embeddings

☁️

A RAG chatbot needs two types of model calls: an embedding model to turn documents and queries into vectors, and a chat model to generate answers from retrieved context. You can use different providers for each.

OpenAI Free Tier

text-embedding-3-small is the default starting point for embeddings (good cost/quality balance). GPT-4o or GPT-4.1 for generation. Most tutorials and frameworks assume OpenAI.

Cohere Free Tier

embed-v4 for multilingual embeddings (100+ languages). Includes a built-in Rerank API for improving retrieval precision. Good if your docs are in multiple languages.

DeepInfra Free Tier

Run open-source embedding models (BGE, E5) and chat models at low cost. Good for production workloads where per-token pricing matters.

Vector Database

🗄️

Stores your document embeddings and retrieves relevant chunks at query time. The biggest practical tip: use hybrid search (vector + keyword/BM25) from the start. Vector search catches semantics, keyword search catches exact terms, acronyms, and IDs. Combining them with Reciprocal Rank Fusion consistently outperforms either alone.

pgvector Open Source

If you run Postgres, add pgvector for dense search. Pair with PostgreSQL full-text search for basic hybrid retrieval. No extra service to manage. Good up to ~10M vectors.

Qdrant Open Source Free Tier

Built-in hybrid search combining dense vectors with sparse (keyword) retrieval. Fast filtered search for multi-tenant apps ("only search this user's documents").

Weaviate Open Source Free Tier

Native hybrid search with BM25 + vector in a single query. Built-in vectorization modules can embed at ingestion time. Good for teams that want fewer moving parts.

Framework

🔧

For a basic RAG pipeline, you can build it yourself with the provider SDK + a vector DB client. A framework helps when you need advanced chunking strategies, multiple data sources, or reranking. LlamaIndex has the edge here specifically because RAG is its core focus.

LlamaIndex Open Source

Purpose-built for RAG. 160+ data connectors, semantic chunking, parent-child retrieval, hybrid search orchestration. The standard choice when your core problem is search quality over documents.

Haystack Open Source

Modular pipeline architecture from deepset (Berlin). Strong document parsing and preprocessing. Used in production by Airbus and The Economist. More opinionated than LlamaIndex.

LangChain Open Source

Largest ecosystem and most tutorials. For RAG specifically, LlamaIndex and Haystack tend to be recommended over LangChain by teams focused on retrieval quality. Better suited when RAG is one part of a larger agent system.

Evaluation & Observability

🔍

RAG systems degrade silently. Documents change, embeddings go stale, retrieval quality drifts. Without evaluation, you are relying on user complaints to find problems. Start with a small set of question-answer pairs from real usage and measure from day one.

Langfuse Open Source Free Tier

Trace the full RAG pipeline: which chunks were retrieved, what prompt was sent, what the model returned. Framework-agnostic. Free tier with 50K events/month.

Braintrust Free Tier

Evaluation-focused. Build datasets, run your RAG pipeline against them, score with custom metrics. Catches quality regressions before users do.

DeepEval Open Source Free Tier

Open-source evaluation framework for LLMs. Includes RAG-specific metrics (faithfulness, answer relevancy, context precision). Integrates with pytest for CI/CD.

Things to keep in mind

Most RAG failures trace back to chunking, not the model or the vector database. Invest in smart chunking (semantic boundaries, parent-child) before tuning anything else.
Hybrid search (vector + keyword) is one of the most impactful retrieval improvements you can make. It takes a few hours to set up and the gains are usually visible immediately.
Changing your embedding model means re-indexing everything. Pick one early, benchmark on your actual documents, and commit. text-embedding-3-small is a safe starting point.
A reranker between retrieval and generation can improve answer quality. The pattern: retrieve broadly (top 20), rerank precisely (top 5), send only the best chunks to the LLM.

Frequently asked questions

What tools do I need to build a RAG chatbot?

An embedding model to vectorize documents, a vector database to store and search them, an LLM to generate answers from retrieved context, and optionally a framework like LlamaIndex to handle chunking and retrieval logic.

What is hybrid search and why does it matter for RAG?

Hybrid search combines vector similarity search with keyword search (BM25) and merges results using Reciprocal Rank Fusion. Vector search catches semantics while keyword search catches exact terms, acronyms, and IDs. Using both consistently outperforms either alone.

Which embedding model should I use for RAG?

OpenAI text-embedding-3-small is a safe starting point with good cost and quality balance. For multilingual documents, Cohere embed-v4 handles 100+ languages. Changing embedding models requires re-indexing all documents, so benchmark on your actual data before committing.

How do I evaluate RAG quality?

Build a small set of question-answer pairs from real usage and measure context precision (are the right chunks retrieved?), faithfulness (does the answer stick to the context?), and answer relevancy. Tools like DeepEval and Braintrust can automate this.

Last updated: April 2026

Compare vector databases → Browse all stacks

Localhero.ai