Retrieval-Augmented Generation — RAG — is a technique for improving language model outputs by giving the model access to relevant external information at query time. It addresses two of the most significant limitations of vanilla LLMs: outdated knowledge and hallucination.
The problem RAG solves
When a language model is trained, its knowledge is frozen at that point in time. Ask it about something that happened after its training cutoff, and it cannot know. Ask it about something obscure that was underrepresented in its training data, and it may confabulate a plausible-sounding but incorrect answer.
RAG addresses this by adding a retrieval step before generation. Instead of asking the model to answer from its internal weights alone, the system first retrieves relevant documents and includes them in the prompt. The model then generates its response grounded in those documents.
The architecture of a RAG system
A standard RAG pipeline has three main components:
1. A document store
Your knowledge base — whether that is company documentation, a collection of research papers, a database of product information, or any other corpus — is stored in a searchable format. For semantic search (searching by meaning rather than exact keywords), documents are typically stored in a vector database as numerical embeddings.
2. A retrieval system
When a query arrives, the system retrieves the most relevant documents. This may use:
- Dense retrieval (embedding the query and finding nearest neighbors in vector space)
- Sparse retrieval (BM25 or keyword-based approaches)
- Hybrid approaches combining both
The retrieved chunks are ranked by relevance, and the top results are selected.
3. A language model
The retrieved documents are inserted into the model's prompt, along with the original query. A typical prompt template might look like:
Use the following documents to answer the question.
If the answer is not in the documents, say so.
Documents:
[retrieved chunks]
Question: [user query]
The model generates its response based on both its training and the provided context.
Chunking strategy matters
Documents need to be broken into manageable pieces (chunks) before being embedded and stored. The chunking strategy significantly affects retrieval quality:
- Fixed-size chunks: Simple but may break sentences or paragraphs at awkward points
- Semantic chunks: Split at natural boundaries (paragraphs, sections)
- Sliding window: Overlapping chunks to avoid missing context at boundaries
Typical chunk sizes range from 256 to 1024 tokens, with overlap of 10–20%.
Vector embeddings and search
The core of semantic retrieval is embedding — converting text into high-dimensional numerical vectors that encode meaning. Semantically similar text will have vectors that are close together in this space.
Both the documents and incoming queries are embedded using the same model. Retrieval becomes a nearest-neighbor search: find the document chunks whose embeddings are most similar to the query embedding. This is what vector databases like Pinecone, Weaviate, Chroma, and pgvector are designed to do efficiently at scale.
Limitations and trade-offs
RAG is powerful but not a complete solution:
Retrieval failures: If the relevant information is not retrieved — due to poor chunking, poor embeddings, or the information simply not being in the corpus — the model may still hallucinate or say it does not know.
Context window limits: You can only include so many retrieved chunks in a prompt. With very large document sets, you are always retrieving a small fraction of available information.
Faithfulness vs. creativity: RAG systems can be tuned to be more or less faithful to retrieved documents. Stricter prompting reduces hallucination but may also reduce the model's ability to synthesize across sources.
Latency: Each query requires a retrieval step, adding latency compared to pure generation.
When to use RAG
RAG is appropriate when:
- You need the model to work with private or proprietary information not in its training data
- Your information changes frequently (product catalogs, documentation, news)
- You need to reduce hallucination on domain-specific questions
- You need to cite sources in the model's output
RAG is less appropriate when:
- The task is purely generative (creative writing, brainstorming)
- The required knowledge is well-represented in the base model's training
- Latency requirements are extremely tight
Summary
RAG extends language models by adding a retrieval step that fetches relevant documents and includes them in the prompt. This reduces hallucination, enables access to up-to-date or private information, and can make model outputs more verifiable. The main components are a document store (often vector-based), a retrieval mechanism, and the language model itself. The quality of the retrieval step is as important as the quality of the model.