AI Infrastructure Stack

RAG Chatbot Stack

Build a chatbot that answers questions from your own documents, knowledge base, or internal docs. One of the most common first AI projects developers take on.

πŸ’¬ Document Q&A πŸ” Hybrid search πŸ“Š Measurable quality
Hand-drawn illustration of a RAG pipeline

Things to keep in mind

  • Most RAG failures trace back to chunking, not the model or the vector database. Invest in smart chunking (semantic boundaries, parent-child) before tuning anything else.
  • Hybrid search (vector + keyword) is one of the most impactful retrieval improvements you can make. It takes a few hours to set up and the gains are usually visible immediately.
  • Changing your embedding model means re-indexing everything. Pick one early, benchmark on your actual documents, and commit. text-embedding-3-small is a safe starting point.
  • A reranker between retrieval and generation can improve answer quality. The pattern: retrieve broadly (top 20), rerank precisely (top 5), send only the best chunks to the LLM.

Frequently asked questions

What tools do I need to build a RAG chatbot?

An embedding model to vectorize documents, a vector database to store and search them, an LLM to generate answers from retrieved context, and optionally a framework like LlamaIndex to handle chunking and retrieval logic.

What is hybrid search and why does it matter for RAG?

Hybrid search combines vector similarity search with keyword search (BM25) and merges results using Reciprocal Rank Fusion. Vector search catches semantics while keyword search catches exact terms, acronyms, and IDs. Using both consistently outperforms either alone.

Which embedding model should I use for RAG?

OpenAI text-embedding-3-small is a safe starting point with good cost and quality balance. For multilingual documents, Cohere embed-v4 handles 100+ languages. Changing embedding models requires re-indexing all documents, so benchmark on your actual data before committing.

How do I evaluate RAG quality?

Build a small set of question-answer pairs from real usage and measure context precision (are the right chunks retrieved?), faithfulness (does the answer stick to the context?), and answer relevancy. Tools like DeepEval and Braintrust can automate this.

Last updated: April 2026

Is your product missing?

Add it here →