Architecture7 min read26 November 2025

Semantic Caching for LLM APIs: How to Cut Costs Without Cutting Quality

Semantic caching returns cached LLM responses for semantically similar queries — without requiring identical inputs. At scale, it is one of the most effective cost reduction techniques available.

AP

Ajay Prajapat

AI Systems Architect

Traditional caching returns identical responses for identical requests. Semantic caching returns appropriate responses for semantically similar requests — even when the wording is different. "What is your refund policy?" and "How do I get a refund?" are different strings but may have identical answers. For high-volume AI applications with repetitive query patterns, semantic caching is one of the most effective techniques for reducing both cost and latency.

How Semantic Caching Works

When a query arrives, the semantic cache embeds the query and searches the cache for stored responses whose query embeddings are within a defined similarity threshold. If a match is found above the threshold, the cached response is returned without a model call. If no match is found, the query is sent to the model, the response is computed and cached (query embedding + response), and the cached response is returned.

The similarity threshold is the critical parameter. Too high (requiring very close semantic match) and cache hit rates are low. Too low and the cache returns responses that do not adequately address the specific query. The right threshold is calibrated empirically against your query distribution.

Where Semantic Caching Works and Where It Does Not

  • Works well: customer support FAQ, product questions, documentation queries, repetitive classification tasks where the same input types recur frequently
  • Works moderately: general Q&A over a bounded knowledge base where query semantics cluster around common topics
  • Does not work: highly personalised responses (the cached response is wrong for the specific user context), time-sensitive queries (cached responses may be stale), queries where small wording differences require different answers (medical, legal)
  • Contraindicated: any use case where response correctness depends on the exact phrasing of the query

Implementation Considerations

  • Cache key: the query embedding, not the raw query string — enables semantic matching
  • Cache value: the response and metadata (model version, timestamp, original query)
  • Similarity metric: cosine similarity is standard; threshold of 0.92-0.96 works for most use cases
  • TTL: cache entries should expire based on how frequently the underlying content changes — product information might cache for 24h, policy documents for 7 days
  • Cache invalidation: when knowledge base documents are updated, invalidate related cache entries
  • Cache warm-up: pre-populate the cache with embeddings for your most frequent query patterns

Measuring Cache Impact

Track: cache hit rate (% of queries served from cache), cost per query with and without cache, cache quality rate (sampled quality evaluation of cached responses vs fresh model responses), and latency reduction. A well-tuned semantic cache for a customer support application typically achieves 30-60% hit rates, proportionally reducing both cost and p50 latency.

AI Systems Architect

Want to apply these ideas in your business?

A strategy call is where the thinking in these articles meets your specific systems, team, and goals.