Context window size — the amount of information an LLM can process in a single call — has expanded dramatically, with some models supporting millions of tokens. But larger context windows do not eliminate context management as an engineering concern. Long contexts increase cost, increase latency, and can degrade model performance as the model's attention is distributed across more content. Context management is the discipline of giving the model exactly the information it needs — not everything you have.
Context Management Techniques for Production Systems
Conversation summarisation
For multi-turn conversations, rather than retaining the full conversation history, periodically summarise the conversation and replace the raw history with the summary. This keeps the conversation context compact while preserving semantic continuity. Trigger summarisation when the conversation history exceeds a defined token threshold.
Selective retrieval
In RAG systems, do not retrieve a fixed number of chunks — retrieve the minimum number that provides sufficient context for the query. Use relevance thresholds (only retrieve chunks above a similarity score) rather than fixed-N retrieval. For simple queries, 2-3 relevant chunks may be sufficient; for complex synthesis tasks, 8-10 may be needed.
System prompt compression
System prompts that grow organically over time accumulate redundancy. Audit system prompts periodically: remove instructions that are no longer relevant, consolidate overlapping rules, and test whether compressed prompts produce equivalent output. A 30% reduction in system prompt length with equivalent output quality is achievable in most production prompts.
Long document chunking strategies
When processing long documents that exceed context window capacity, use a map-reduce approach: process chunks independently (map), then synthesise the chunk-level outputs (reduce). Alternatively, use extraction-then-synthesis: extract the relevant sections from the document first, then provide only those sections to the synthesis model.
Context Window Size and Output Quality
Longer contexts do not always produce better outputs. Research consistently shows that LLM performance degrades on information retrieval tasks as context length increases — a phenomenon called "lost in the middle," where information at the start and end of the context is retrieved more reliably than information in the middle. For tasks requiring precise recall of specific information from long context, consider chunked processing rather than very long single-context approaches.
“Give the model exactly the information it needs — not everything you have. More context is not always better context.”