How Semantic Caching Works
When a query arrives, the semantic cache embeds the query and searches the cache for stored responses whose query embeddings are within a defined similarity threshold. If a match is found above the threshold, the cached response is returned without a model call. If no match is found, the query is sent to the model, the response is computed and cached (query embedding + response), and the cached response is returned.
The similarity threshold is the critical parameter. Too high (requiring very close semantic match) and cache hit rates are low. Too low and the cache returns responses that do not adequately address the specific query. The right threshold is calibrated empirically against your query distribution.