Architecture7 min read30 September 2025

LLM Context Window Management: How to Work Effectively Within Token Limits

Context window limits constrain what information you can give an LLM in a single call. Managing this constraint is an engineering discipline with significant quality and cost implications.

AP

Ajay Prajapat

AI Systems Architect

Context window size — the amount of information an LLM can process in a single call — has expanded dramatically, with some models supporting millions of tokens. But larger context windows do not eliminate context management as an engineering concern. Long contexts increase cost, increase latency, and can degrade model performance as the model's attention is distributed across more content. Context management is the discipline of giving the model exactly the information it needs — not everything you have.

What Goes in the Context Window and Why It Matters

  • System prompt: instructions, persona, rules, output format specification — counted on every single API call
  • Retrieved context (RAG): document chunks retrieved to ground the response — 500-3,000 tokens typically
  • Conversation history: prior turns in a multi-turn conversation — grows unboundedly without management
  • User input: the current query or task — variable, user-controlled
  • Output: the model's response — counted toward total context on subsequent turns

Context Management Techniques for Production Systems

Conversation summarisation

For multi-turn conversations, rather than retaining the full conversation history, periodically summarise the conversation and replace the raw history with the summary. This keeps the conversation context compact while preserving semantic continuity. Trigger summarisation when the conversation history exceeds a defined token threshold.

Selective retrieval

In RAG systems, do not retrieve a fixed number of chunks — retrieve the minimum number that provides sufficient context for the query. Use relevance thresholds (only retrieve chunks above a similarity score) rather than fixed-N retrieval. For simple queries, 2-3 relevant chunks may be sufficient; for complex synthesis tasks, 8-10 may be needed.

System prompt compression

System prompts that grow organically over time accumulate redundancy. Audit system prompts periodically: remove instructions that are no longer relevant, consolidate overlapping rules, and test whether compressed prompts produce equivalent output. A 30% reduction in system prompt length with equivalent output quality is achievable in most production prompts.

Long document chunking strategies

When processing long documents that exceed context window capacity, use a map-reduce approach: process chunks independently (map), then synthesise the chunk-level outputs (reduce). Alternatively, use extraction-then-synthesis: extract the relevant sections from the document first, then provide only those sections to the synthesis model.

Context Window Size and Output Quality

Longer contexts do not always produce better outputs. Research consistently shows that LLM performance degrades on information retrieval tasks as context length increases — a phenomenon called "lost in the middle," where information at the start and end of the context is retrieved more reliably than information in the middle. For tasks requiring precise recall of specific information from long context, consider chunked processing rather than very long single-context approaches.

Give the model exactly the information it needs — not everything you have. More context is not always better context.

AI Systems Architect

Want to apply these ideas in your business?

A strategy call is where the thinking in these articles meets your specific systems, team, and goals.