Every interaction with a large language model happens within a context window — a finite amount of text the model can process at once. Understanding this constraint is fundamental to designing AI applications that work reliably.
What a context window is
When you interact with a language model, the model does not have any persistent memory between conversations. What it can "see" is entirely determined by what is included in the current context window.
The context window contains everything: your system prompt, the conversation history, any retrieved documents you have included, and the current message. All of this is processed together in a single forward pass through the model.
Context windows are measured in tokens — the sub-word units that models process. As a rough approximation, one token is about three-quarters of an English word. A 128,000-token context window can hold roughly 96,000 words — approximately the length of a novel.
Why context windows matter
Everything you need must fit: If relevant context exceeds the window size, some of it must be left out. This is a fundamental constraint for applications that need to reason over large bodies of text.
Cost scales with context size: Most LLM providers charge per token. A longer context means higher cost per call. For high-volume applications, this is significant.
Attention and the "lost in the middle" problem: Research has shown that language models attend unevenly to context — information at the beginning and end of the context window tends to be recalled better than information in the middle. For very long contexts, important information buried in the middle may be effectively ignored.
Latency scales with context: Larger contexts require more computation. Time-to-first-token increases with context length.
How context windows have changed
Context windows have grown dramatically over recent years:
- GPT-3 (2020): 4,096 tokens
- GPT-4 (2023): 8,192 → 128,000 tokens
- Claude 3 (2024): 200,000 tokens
- Gemini 1.5 Pro (2024): 1,000,000 tokens
This growth enables new use cases — processing entire codebases, books, or document collections in a single context. But it does not eliminate the "lost in the middle" problem or the cost implications.
Strategies for working within context limits
Chunking and retrieval: Rather than including all available information, use RAG to retrieve only the most relevant pieces. This keeps the context focused and reduces cost.
Summarization: Summarize conversation history or long documents before including them in the context. Trade completeness for conciseness.
Prioritized placement: Put the most important information at the beginning or end of the context, where models tend to attend to it more reliably.
Token budgeting: Track how many tokens different parts of your prompt use. Reserve tokens for the model's response. Use counting libraries (like tiktoken for OpenAI models) to avoid unexpected truncation.
Sliding window: For very long documents, use a sliding window — process overlapping chunks and combine results.
Truncation and its risks
When a context exceeds the window limit, one of two things happens:
- The API returns an error
- The context is silently truncated
Silent truncation is dangerous because you may not notice that critical information was dropped. Always check the length of your context before sending it.
The future
Context windows continue to grow, but so do use cases. The fundamental trade-off — more context means more cost, more latency, and potentially more attention dilution — will remain relevant even as the absolute limits increase.
Retrieval approaches that intelligently select what to include will remain valuable even with much larger windows, because they focus the model's attention on what is relevant rather than flooding the context with everything available.
Summary
The context window is the total amount of text a language model can process in a single interaction. Everything the model can "see" must fit within it. Context windows are measured in tokens, and costs, latency, and attention quality all degrade with increasing context length. Key strategies for working within context limits include RAG-based retrieval, summarization, careful placement of important information, and explicit token budget management.