Retrieval-Augmented Generation Explained

Retrieval-Augmented Generation — RAG — is a technique for improving language model outputs by giving the model access to relevant external information at query time. It addresses two of the most significant limitations of vanilla LLMs: outdated knowledge and hallucination.

The problem RAG solves

When a language model is trained, its knowledge is frozen at that point in time. Ask it about something that happened after its training cutoff, and it cannot know. Ask it about something obscure that was underrepresented in its training data, and it may confabulate a plausible-sounding but incorrect answer.

RAG addresses this by adding a retrieval step before generation. Instead of asking the model to answer from its internal weights alone, the system first retrieves relevant documents and includes them in the prompt. The model then generates its response grounded in those documents.

The architecture of a RAG system

A standard RAG pipeline has three main components:

1. A document store

Your knowledge base — whether that is company documentation, a collection of research papers, a database of product information, or any other corpus — is stored in a searchable format. For semantic search (searching by meaning rather than exact keywords), documents are typically stored in a vector database as numerical embeddings.

2. A retrieval system

When a query arrives, the system retrieves the most relevant documents. This may use:

Dense retrieval (embedding the query and finding nearest neighbors in vector space)
Sparse retrieval (BM25 or keyword-based approaches)
Hybrid approaches combining both

The retrieved chunks are ranked by relevance, and the top results are selected.

3. A language model

The retrieved documents are inserted into the model's prompt, along with the original query. A typical prompt template might look like:

Use the following documents to answer the question.
If the answer is not in the documents, say so.

Documents:
[retrieved chunks]

Question: [user query]

The model generates its response based on both its training and the provided context.

Chunking strategy matters

Documents need to be broken into manageable pieces (chunks) before being embedded and stored. The chunking strategy significantly affects retrieval quality:

Fixed-size chunks: Simple but may break sentences or paragraphs at awkward points
Semantic chunks: Split at natural boundaries (paragraphs, sections)
Sliding window: Overlapping chunks to avoid missing context at boundaries

Typical chunk sizes range from 256 to 1024 tokens, with overlap of 10–20%.

Vector embeddings and search

The core of semantic retrieval is embedding — converting text into high-dimensional numerical vectors that encode meaning. Semantically similar text will have vectors that are close together in this space.

Both the documents and incoming queries are embedded using the same model. Retrieval becomes a nearest-neighbor search: find the document chunks whose embeddings are most similar to the query embedding. This is what vector databases like Pinecone, Weaviate, Chroma, and pgvector are designed to do efficiently at scale.

Limitations and trade-offs

RAG is powerful but not a complete solution:

Retrieval failures: If the relevant information is not retrieved — due to poor chunking, poor embeddings, or the information simply not being in the corpus — the model may still hallucinate or say it does not know.

Context window limits: You can only include so many retrieved chunks in a prompt. With very large document sets, you are always retrieving a small fraction of available information.

Faithfulness vs. creativity: RAG systems can be tuned to be more or less faithful to retrieved documents. Stricter prompting reduces hallucination but may also reduce the model's ability to synthesize across sources.

Latency: Each query requires a retrieval step, adding latency compared to pure generation.

When to use RAG

RAG is appropriate when:

You need the model to work with private or proprietary information not in its training data
Your information changes frequently (product catalogs, documentation, news)
You need to reduce hallucination on domain-specific questions
You need to cite sources in the model's output

RAG is less appropriate when:

The task is purely generative (creative writing, brainstorming)
The required knowledge is well-represented in the base model's training
Latency requirements are extremely tight

Summary

RAG extends language models by adding a retrieval step that fetches relevant documents and includes them in the prompt. This reduces hallucination, enables access to up-to-date or private information, and can make model outputs more verifiable. The main components are a document store (often vector-based), a retrieval mechanism, and the language model itself. The quality of the retrieval step is as important as the quality of the model.

Retrieval-Augmented Generation Explained

The problem RAG solves

The architecture of a RAG system

Chunking strategy matters

Vector embeddings and search

Limitations and trade-offs

When to use RAG

Summary

More Intelligence

The Transformer Architecture: A Technical Overview

What Is a Large Language Model?

AI Agents: What They Are and How They Work