Architecture7 min read30 September 2025

LLM Context Window Management: How to Work Effectively Within Token Limits

Context window limits constrain what information you can give an LLM in a single call. Managing this constraint is an engineering discipline with significant quality and cost implications.

Ajay Prajapat

AI Systems Architect

Context window size — the amount of information an LLM can process in a single call — has expanded dramatically, with some models supporting millions of tokens. But larger context windows do not eliminate context management as an engineering concern. Long contexts increase cost, increase latency, and can degrade model performance as the model's attention is distributed across more content. Context management is the discipline of giving the model exactly the information it needs — not everything you have.

What Goes in the Context Window and Why It Matters

System prompt: instructions, persona, rules, output format specification — counted on every single API call
Retrieved context (RAG): document chunks retrieved to ground the response — 500-3,000 tokens typically
Conversation history: prior turns in a multi-turn conversation — grows unboundedly without management
User input: the current query or task — variable, user-controlled
Output: the model's response — counted toward total context on subsequent turns

Context Management Techniques for Production Systems

Conversation summarisation

For multi-turn conversations, rather than retaining the full conversation history, periodically summarise the conversation and replace the raw history with the summary. This keeps the conversation context compact while preserving semantic continuity. Trigger summarisation when the conversation history exceeds a defined token threshold.

Selective retrieval

In RAG systems, do not retrieve a fixed number of chunks — retrieve the minimum number that provides sufficient context for the query. Use relevance thresholds (only retrieve chunks above a similarity score) rather than fixed-N retrieval. For simple queries, 2-3 relevant chunks may be sufficient; for complex synthesis tasks, 8-10 may be needed.

System prompt compression

System prompts that grow organically over time accumulate redundancy. Audit system prompts periodically: remove instructions that are no longer relevant, consolidate overlapping rules, and test whether compressed prompts produce equivalent output. A 30% reduction in system prompt length with equivalent output quality is achievable in most production prompts.

Long document chunking strategies

When processing long documents that exceed context window capacity, use a map-reduce approach: process chunks independently (map), then synthesise the chunk-level outputs (reduce). Alternatively, use extraction-then-synthesis: extract the relevant sections from the document first, then provide only those sections to the synthesis model.

Context Window Size and Output Quality

Longer contexts do not always produce better outputs. Research consistently shows that LLM performance degrades on information retrieval tasks as context length increases — a phenomenon called "lost in the middle," where information at the start and end of the context is retrieved more reliably than information in the middle. For tasks requiring precise recall of specific information from long context, consider chunked processing rather than very long single-context approaches.

“Give the model exactly the information it needs — not everything you have. More context is not always better context.”

Back to all articles

Key Takeaways

Context windows have been increasing but context management remains important for cost, latency, and output quality
Long contexts increase cost proportionally — every token in system prompts and RAG context is charged on every call
Summarise multi-turn conversation history rather than retaining raw history indefinitely
Use relevance thresholds for retrieval, not fixed-N — retrieve the minimum context sufficient for the query
Audit and compress system prompts periodically — 30% reduction with equivalent output is typically achievable
Very long contexts degrade information retrieval quality — chunked processing often outperforms single very long context

Apply This To Your Business

Book a strategy call to discuss how these patterns apply to your specific systems and team.

Book a Call

AI Systems Architect

Want to apply these ideas in your business?

A strategy call is where the thinking in these articles meets your specific systems, team, and goals.

Book a Strategy Call