Retrieval-Augmented Generation (RAG)
Retrieval-Augmented Generation (RAG) is a system design pattern in which a retrieval step fetches relevant documents or passages from an external store, and those documents are concatenated with the user query before being passed to a generative language model; At query time, the user's input is converted into an embedding and compared against pre-indexed document embeddings in a vector database using approximate nearest-neighbor search; RAG is the standard approach for building knowledge-base assistants, internal search tools, and document Q&A systems
Retrieval-Augmented Generation (RAG) is a system design pattern in which a retrieval step fetches relevant documents or passages from an external store, and those documents are concatenated with the user query before being passed to a generative language model. The model can then produce answers grounded in retrieved evidence rather than relying solely on its parametric training knowledge.
How it works
At query time, the user’s input is converted into an embedding and compared against pre-indexed document embeddings in a vector database using approximate nearest-neighbor search. The top-k most semantically similar chunks are retrieved, inserted into the prompt as context, and the model generates a response conditioned on that evidence. The knowledge base can be updated independently of the model.
Key facts
- Components: Chunking, embedding, vector indexing, retrieval, reranking, and generation are the core pipeline stages.
- Context window constraint: Retrieved chunks must fit within the model’s context window, requiring chunk size tuning.
- Hybrid search: Combining dense vector search with BM25 keyword search often outperforms either approach alone.
- Reranking: A cross-encoder reranker rescores retrieved candidates to improve relevance before insertion into the prompt.
For builders
RAG is the standard approach for building knowledge-base assistants, internal search tools, and document Q&A systems. It avoids the cost and complexity of fine-tuning while keeping answers current. Investing in chunking strategy, metadata filtering, and reranking yields larger quality improvements per engineering hour than upgrading the base model for most retrieval tasks.
Sources
- Lewis, P., et al. (2020). Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. arXiv:2005.11401. arxiv.org
- Gao, Y., et al. (2023). Retrieval-Augmented Generation for Large Language Models: A Survey. arXiv:2312.10997. arxiv.org
- Reimers, N., Gurevych, I. (2019). Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. arXiv:1908.10084. arxiv.org
- Pinecone. What is a vector database? pinecone.io
- Johnson, J., Douze, M., Jegou, H. (2017). Billion-scale similarity search with GPUs (FAISS). Facebook Research. github.com