
Our LLM API bill was increasing by 30% month-on-month. Traffic was increasing, but not that fast. When I analyzed our query logs, I found the real problem: users ask the same question in different ways.
"What is your return policy?" "How do I return something?"And "Can I get a refund?" All were hitting our LLM separately, generating almost identical responses, each incurring the full API cost.
Exact-match caching, the obvious first solution, captures only 18% of these redundant calls. Same semantic question, in different words, bypassing the cache entirely.
Therefore, I implemented semantic caching based on the meaning of the questions, not their wording. After implementing this, our cache hit rate increased to 67%, reducing LLM API costs by 73%. But getting there requires solving problems that simple implementations miss.
Why does exact-match caching fall short?
Traditional caching uses the query text as the cache key. This works when the queries are similar:
# Exact-match caching
cache_key = hash(query_text)
If cache_key in cache:
cash back[cache_key]
But users do not write queries alike. My analysis of 100,000 production queries found:
- only 18% Were exact duplicates of previous queries
-
47% Semantically, the questions were similar to the previous ones (same intent, different wording).
-
35% really new questions
That 47% represents a huge cost savings that we were missing. Each semantically-similar query triggered a full LLM call, producing almost the same response as the one we had already calculated.
Semantic Caching Architecture
Semantic caching replaces text-based keys with embedding-based similarity lookups:
Class Semantic Cache:
def __init__(self, embedding_model, similarity_threshold=0.92):
self.embedding_model = embedding_model
self.threshold = similarity_threshold
self.vector_store = vectorstore() # FAISS, PINECON, etc.
self.response_store = responseStore() # Redis, DynamoDB, etc.
def get(self, query: str) -> optional[str],
"""Return a cached response if a semantically similar query exists."""
query_embedding = self.embedding_model.encode(query)
# Find the most similar cached queries
match = self.vector_store.search(query_embedding, Top_k=1)
if matches and matches[0].similarity >= self.threshold:
cache_id = match[0].Identification
return self.response_store.get(cache_id)
return none
def set(self, query: str, response: str):
"""Cache query-response pair."""
query_embedding = self.embedding_model.encode(query)
cache_id = generated_id()
self.vector_store.add(cache_id, query_embedding)
self.response_store.set(cache_id, {
‘query’: query,
‘reaction’: reaction,
‘timestamp’: datetime.utcnow()
,
Key Insight: Instead of hashing query text, I embed queries in a vector space and find cached queries within a similarity threshold.
threshold problem
The similarity threshold is an important parameter. Set it too high, and you’ll miss legitimate cache hits. Set it too low, and you return incorrect responses.
Our initial threshold of 0.85 seemed reasonable; should be 85% similar "the same question," Correct?
Wrong. At 0.85, we got cache hits like:
- Question: "How do I cancel my subscription?"
-
Cached: "How do I cancel my order?"
-
Similarity: 0.87
These are different questions with different answers. It would be wrong to return a cached response.
I found that the optimal ranges vary by query type:
| query type |
optimal range |
plea |
|
FAQ-style questions |
0.94 |
High precision required; Wrong answers damage trust |
|
Search Products |
0.88 |
Greater tolerance for close-matches |
|
support questions |
0.92 |
Balance between coverage and accuracy |
|
transaction related questions |
0.97 |
very low tolerance for errors |
I applied query-type-specific limits:
Class Adaptive Semantic Cache:
def __init__(self):
self.limits = {
‘FAQ’: 0.94,
‘Search’: 0.88,
‘support’: 0.92,
‘Transactional’: 0.97,
‘default’: 0.92
,
self.query_classifier = QueryClassifier()
def get_threshold(self, query: str) -> float:
query_type = self.query_classifier.classify(query)
return self.thresholds.get(query_type, self.thresholds)['default'],
def get(self, query: str) -> optional[str],
threshold = self.get_threshold(query)
query_embedding = self.embedding_model.encode(query)
match = self.vector_store.search(query_embedding, Top_k=1)
if matches and matches[0].equality >= limit:
return self.response_store.get(matches[0].Identification)
return none
threshold tuning method
I couldn’t tune the threshold blindly. I needed ground truth on which to actually add questions "Which is the same."
Our Methodology:
step 1: Add sample query. I sampled 5,000 query pairs at different similarity levels (0.80-0.99).
step 2: Human labeling. Annotators have labeled each pair as "same intention" Or "Different intention." I used three annotators per pair and obtained a majority vote.
step 3: Calculate precision/recall curves. For each limit, we calculated:
- Precision: What fraction of cache hits had the same intent?
-
Recall: Of the like-intent pairs, which fraction did we cache-hit?
def compute_precision_recall(pairs, labels, range):
"""Calculate and remember the precision over a given similarity range."""
Predictions = [1 if pair.similarity >= threshold else 0 for pair in pairs]
true_positives = sum(1 for p, l in zip(predictions, labels) if p == 1 and l == 1)
false_positives = sum(1 for p, l in zip(predictions, labels) if p == 1 and l == 0)
false_negatives = sum(1 for p, l in zip(predictions, labels) if p == 0 and l == 1)
precision = true_positives / (true_positives + false_positives) if (true_positives + false_positives) > 0 otherwise 0
Recall = True_Positive / (True_Positive + False_Negative) if (True_Positive + False_Negative) > 0 otherwise 0
return precision, recall
Step 4: Select a threshold based on the cost of errors. For FAQs where wrong answers harm confidence, I optimized for precision (0.94 threshold gave 98% precision). For search queries where missing cache hits only costs money, I optimized for recall (0.88 threshold).
latency overhead
Semantic caching adds latency: You have to embed the query and search in the vector store before knowing whether to call the LLM.
Our measurements:
| Operation |
Latency(p50) |
Latency (p99) |
|
query embedding |
12ms |
28ms |
|
vector search |
8ms |
19ms |
|
total cache lookup |
20ms |
47ms |
|
llm api call |
850ms |
2400ms |
The 20ms overhead is negligible compared to the 850ms LLM call that we avoid on a cache hit. Even on p99, 47ms overhead is acceptable.
However, cache misses now take 20ms longer than before (embedding + search + LLM calls). At our 67% hit rate, the math works out optimally:
- Before: 100% queries × 850 ms = 850 ms average
-
After: (33% × 870 ms) + (67% × 20 ms) = 287 ms + 13 ms = 300 ms averaging
65% improvement in net latency along with cost reduction.
cache invalidation
Cached responses become stale. Product information changes, policies update, and yesterday’s right answer becomes today’s wrong answer.
I implemented three invalidation strategies:
-
time-based ttl
Simple termination based on content type:
TTL_BY_CONTENT_TYPE = {
‘Pricing’: TimeDelta(hours=4), # changes frequently
‘policy’: timedelta(days=7), # change rarely
‘product_info’: timedelta(days=1), # Refresh daily
‘common_faq’: timedelta(days=14), # very still
,
-
event-based invalidation
When the underlying data changes, invalidate the corresponding cache entries:
Class Cache Invalidator:
def on_content_update(self, content_id: str, content_type: str):
"""Invalid cache entries related to updated content."""
# Find cached queries that reference this content
affected_queries = self.find_queries_reference(content_id)
For query_id in affected_queries:
self.cache.invalidate(query_id)
self.log_invalidation(content_id, len(affected_queries))
-
staleness detection
For responses that may be out of date without obvious events, I implemented a periodic freshness check:
def check_freshness(self, cached_response: dict) -> bool:
"""Verify that the cached response is still valid."""
# Re-run the query against the current data
Fresh_response = self.generate_response(cached_response['query'],
# Compare the semantic similarity of responses
cached_embedding = self.embed(cached_response['response'],
Fresh_embedding = self.embed(fresh_response)
similarity = cosine_similarity(cached_embedding, fresh_embedding)
# If responses differ significantly, invalidate
If similarity <0.90: self.cache.invalidate(cached_response['id'], return false return true We perform freshness checks on a sample of cached entries daily, thereby capturing TTL and event-based invalidation constraints.
production results
After three months in production:
| metric |
First |
after |
Change |
|
cache hit rate |
18% |
67% |
+272% |
|
llm api cost |
$47K/month |
$12.7K/month |
-73% |
|
average latency |
850ms |
300 ms |
-65% |
|
false-positive rate |
N/A |
0.8% |
, |
|
Customer Complaints (Wrong Answer) |
basic |
+0.3% |
minimum increase |
The 0.8% false-positive rate (queries where we returned a cached response that was semantically incorrect) was within acceptable limits. These cases occurred primarily at the limits of our range, where similarity was just above the cutoff but the intentions were slightly different.
dangers to avoid
Do not use a single global limit. Different query types have different tolerances for errors. Tune threshold per category.
Don’t skip the embedding step on a cache hit. You may be tempted to skip the embedding overhead when returning cached responses, but you need embeddings for cached key creation. Overhead is unavoidable.
Don’t forget invalidation. Semantic caching without an invalidation strategy results in stale responses that destroy user trust. Build in invalidation from day one.
Don’t cash everything. Some queries should not be cached: personalized responses, time-sensitive information, transaction confirmations. Create exclusion rules.
def should_cache(self, query: str, response: str) -> bool:
"""Determine whether the response should be cached.""
# Do not cache personalized responses
If self.contains_personal_info(response):
return false
# Do not cache time-sensitive information
If self.is_time_sensitive(query):
return false
# Do not cache transaction confirmations
If self.is_transactional(query):
return false
return true
key takeaways
Semantic caching is a practical pattern for LLM cost control that captures redundant exact-match caching misses. The key challenges are threshold tuning (use query-type-specific thresholds based on precision/recall analysis) and cache invalidation (TTL, event-based, and consistency detection).
At a 73% cost reduction, this was our highest-ROI optimization for a production LLM system. Implementation complexity is moderate, but requires careful attention to threshold tuning to avoid quality degradation.
Srinivas Reddy Hulebidu Reddy is a leading software engineer.
<a href