Skip to main content
Back to IntelligenceArtificial Intelligence

Retrieval-Augmented Generation Explained

RAG combines a language model with a search system to reduce hallucinations and give AI access to up-to-date information. Here is how it works.

E
Explicor
4 min read

Retrieval-Augmented Generation — RAG — is a technique for improving language model outputs by giving the model access to relevant external information at query time. It addresses two of the most significant limitations of vanilla LLMs: outdated knowledge and hallucination.

The problem RAG solves

When a language model is trained, its knowledge is frozen at that point in time. Ask it about something that happened after its training cutoff, and it cannot know. Ask it about something obscure that was underrepresented in its training data, and it may confabulate a plausible-sounding but incorrect answer.

RAG addresses this by adding a retrieval step before generation. Instead of asking the model to answer from its internal weights alone, the system first retrieves relevant documents and includes them in the prompt. The model then generates its response grounded in those documents.

The architecture of a RAG system

A standard RAG pipeline has three main components:

1. A document store

Your knowledge base — whether that is company documentation, a collection of research papers, a database of product information, or any other corpus — is stored in a searchable format. For semantic search (searching by meaning rather than exact keywords), documents are typically stored in a vector database as numerical embeddings.

2. A retrieval system

When a query arrives, the system retrieves the most relevant documents. This may use:

  • Dense retrieval (embedding the query and finding nearest neighbors in vector space)
  • Sparse retrieval (BM25 or keyword-based approaches)
  • Hybrid approaches combining both

The retrieved chunks are ranked by relevance, and the top results are selected.

3. A language model

The retrieved documents are inserted into the model's prompt, along with the original query. A typical prompt template might look like:

Use the following documents to answer the question.
If the answer is not in the documents, say so.

Documents:
[retrieved chunks]

Question: [user query]

The model generates its response based on both its training and the provided context.

Chunking strategy matters

Documents need to be broken into manageable pieces (chunks) before being embedded and stored. The chunking strategy significantly affects retrieval quality:

  • Fixed-size chunks: Simple but may break sentences or paragraphs at awkward points
  • Semantic chunks: Split at natural boundaries (paragraphs, sections)
  • Sliding window: Overlapping chunks to avoid missing context at boundaries

Typical chunk sizes range from 256 to 1024 tokens, with overlap of 10–20%.

The core of semantic retrieval is embedding — converting text into high-dimensional numerical vectors that encode meaning. Semantically similar text will have vectors that are close together in this space.

Both the documents and incoming queries are embedded using the same model. Retrieval becomes a nearest-neighbor search: find the document chunks whose embeddings are most similar to the query embedding. This is what vector databases like Pinecone, Weaviate, Chroma, and pgvector are designed to do efficiently at scale.

Limitations and trade-offs

RAG is powerful but not a complete solution:

Retrieval failures: If the relevant information is not retrieved — due to poor chunking, poor embeddings, or the information simply not being in the corpus — the model may still hallucinate or say it does not know.

Context window limits: You can only include so many retrieved chunks in a prompt. With very large document sets, you are always retrieving a small fraction of available information.

Faithfulness vs. creativity: RAG systems can be tuned to be more or less faithful to retrieved documents. Stricter prompting reduces hallucination but may also reduce the model's ability to synthesize across sources.

Latency: Each query requires a retrieval step, adding latency compared to pure generation.

When to use RAG

RAG is appropriate when:

  • You need the model to work with private or proprietary information not in its training data
  • Your information changes frequently (product catalogs, documentation, news)
  • You need to reduce hallucination on domain-specific questions
  • You need to cite sources in the model's output

RAG is less appropriate when:

  • The task is purely generative (creative writing, brainstorming)
  • The required knowledge is well-represented in the base model's training
  • Latency requirements are extremely tight

Summary

RAG extends language models by adding a retrieval step that fetches relevant documents and includes them in the prompt. This reduces hallucination, enables access to up-to-date or private information, and can make model outputs more verifiable. The main components are a document store (often vector-based), a retrieval mechanism, and the language model itself. The quality of the retrieval step is as important as the quality of the model.

More Intelligence

Artificial Intelligence

What Is a Large Language Model?

A clear explanation of how large language models work — from tokens and transformers to training and inference — without the hype.

5 min
Artificial Intelligence

AI Agents: What They Are and How They Work

AI agents are systems that use language models to plan and execute multi-step tasks. Here is a clear explanation of their architecture and limitations.

5 min