Architecture7 min read26 November 2025

Semantic Caching for LLM APIs: How to Cut Costs Without Cutting Quality

Semantic caching returns cached LLM responses for semantically similar queries — without requiring identical inputs. At scale, it is one of the most effective cost reduction techniques available.

Ajay Prajapat

AI Systems Architect

Traditional caching returns identical responses for identical requests. Semantic caching returns appropriate responses for semantically similar requests — even when the wording is different. "What is your refund policy?" and "How do I get a refund?" are different strings but may have identical answers. For high-volume AI applications with repetitive query patterns, semantic caching is one of the most effective techniques for reducing both cost and latency.

How Semantic Caching Works

When a query arrives, the semantic cache embeds the query and searches the cache for stored responses whose query embeddings are within a defined similarity threshold. If a match is found above the threshold, the cached response is returned without a model call. If no match is found, the query is sent to the model, the response is computed and cached (query embedding + response), and the cached response is returned.

The similarity threshold is the critical parameter. Too high (requiring very close semantic match) and cache hit rates are low. Too low and the cache returns responses that do not adequately address the specific query. The right threshold is calibrated empirically against your query distribution.

Where Semantic Caching Works and Where It Does Not

Works well: customer support FAQ, product questions, documentation queries, repetitive classification tasks where the same input types recur frequently
Works moderately: general Q&A over a bounded knowledge base where query semantics cluster around common topics
Does not work: highly personalised responses (the cached response is wrong for the specific user context), time-sensitive queries (cached responses may be stale), queries where small wording differences require different answers (medical, legal)
Contraindicated: any use case where response correctness depends on the exact phrasing of the query

Implementation Considerations

Cache key: the query embedding, not the raw query string — enables semantic matching
Cache value: the response and metadata (model version, timestamp, original query)
Similarity metric: cosine similarity is standard; threshold of 0.92-0.96 works for most use cases
TTL: cache entries should expire based on how frequently the underlying content changes — product information might cache for 24h, policy documents for 7 days
Cache invalidation: when knowledge base documents are updated, invalidate related cache entries
Cache warm-up: pre-populate the cache with embeddings for your most frequent query patterns

Measuring Cache Impact

Track: cache hit rate (% of queries served from cache), cost per query with and without cache, cache quality rate (sampled quality evaluation of cached responses vs fresh model responses), and latency reduction. A well-tuned semantic cache for a customer support application typically achieves 30-60% hit rates, proportionally reducing both cost and p50 latency.

Back to all articles

Key Takeaways

Semantic caching matches semantically similar queries, not just identical strings — suitable for varied natural language inputs
Works best for FAQ, product questions, classification tasks with repetitive input patterns
Contraindicated when response correctness depends on exact phrasing or personalised context
Similarity threshold of 0.92-0.96 works for most use cases — calibrate against your query distribution
Set TTL based on how frequently the underlying content changes, not a fixed default
Measure hit rate, cost per query, and sampled quality of cached responses — optimise all three

Apply This To Your Business

Book a strategy call to discuss how these patterns apply to your specific systems and team.

Book a Call

AI Systems Architect

Want to apply these ideas in your business?

A strategy call is where the thinking in these articles meets your specific systems, team, and goals.

Book a Strategy Call