Why your LLM bill is exploding — and how semantic caching can cut it by 73%

Semantic caching
Our LLM API bill was increasing by 30% month-on-month. Traffic was increasing, but not that fast. When I analyzed our query logs, I found the real problem: users ask the same question in different ways.

"What is your return policy?" "How do I return something?"And "Can I get a refund?" All were hitting our LLM separately, generating almost identical responses, each incurring the full API cost.

Exact-match caching, the obvious first solution, captures only 18% of these redundant calls. Same semantic question, in different words, bypassing the cache entirely.

Therefore, I implemented semantic caching based on the meaning of the questions, not their wording. After implementing this, our cache hit rate increased to 67%, reducing LLM API costs by 73%. But getting there requires solving problems that simple implementations miss.

Why does exact-match caching fall short?

Traditional caching uses the query text as the cache key. This works when the queries are similar:

# Exact-match caching

cache_key = hash(query_text)

If cache_key in cache:

cash back[cache_key]

But users do not write queries alike. My analysis of 100,000 production queries found:

  • only 18% Were exact duplicates of previous queries

  • 47% Semantically, the questions were similar to the previous ones (same intent, different wording).

  • 35% really new questions

That 47% represents a huge cost savings that we were missing. Each semantically-similar query triggered a full LLM call, producing almost the same response as the one we had already calculated.

Semantic Caching Architecture

Semantic caching replaces text-based keys with embedding-based similarity lookups:

Class Semantic Cache:

def __init__(self, embedding_model, similarity_threshold=0.92):

self.embedding_model = embedding_model

self.threshold = similarity_threshold

self.vector_store = vectorstore() # FAISS, PINECON, etc.

self.response_store = responseStore() # Redis, DynamoDB, etc.

def get(self, query: str) -> optional[str],

"""Return a cached response if a semantically similar query exists."""

query_embedding = self.embedding_model.encode(query)

# Find the most similar cached queries

match = self.vector_store.search(query_embedding, Top_k=1)

if matches and matches[0].similarity >= self.threshold:

cache_id = match[0].Identification

return self.response_store.get(cache_id)

return none

def set(self, query: str, response: str):

"""Cache query-response pair."""

query_embedding = self.embedding_model.encode(query)

cache_id = generated_id()

self.vector_store.add(cache_id, query_embedding)

self.response_store.set(cache_id, {

‘query’: query,

‘reaction’: reaction,

‘timestamp’: datetime.utcnow()

,

Key Insight: Instead of hashing query text, I embed queries in a vector space and find cached queries within a similarity threshold.

threshold problem

The similarity threshold is an important parameter. Set it too high, and you’ll miss legitimate cache hits. Set it too low, and you return incorrect responses.

Our initial threshold of 0.85 seemed reasonable; should be 85% similar "the same question," Correct?

Wrong. At 0.85, we got cache hits like:

  • Question: "How do I cancel my subscription?"

  • Cached: "How do I cancel my order?"

  • Similarity: 0.87

These are different questions with different answers. It would be wrong to return a cached response.

I found that the optimal ranges vary by query type:

query type

optimal range

plea

FAQ-style questions

0.94

High precision required; Wrong answers damage trust

Search Products

0.88

Greater tolerance for close-matches

support questions

0.92

Balance between coverage and accuracy

transaction related questions

0.97

very low tolerance for errors

I applied query-type-specific limits:

Class Adaptive Semantic Cache:

def __init__(self):

self.limits = {

‘FAQ’: 0.94,

‘Search’: 0.88,

‘support’: 0.92,

‘Transactional’: 0.97,

‘default’: 0.92

,

self.query_classifier = QueryClassifier()

def get_threshold(self, query: str) -> float:

query_type = self.query_classifier.classify(query)

return self.thresholds.get(query_type, self.thresholds)['default'],

def get(self, query: str) -> optional[str],

threshold = self.get_threshold(query)

query_embedding = self.embedding_model.encode(query)

match = self.vector_store.search(query_embedding, Top_k=1)

if matches and matches[0].equality >= limit:

return self.response_store.get(matches[0].Identification)

return none

threshold tuning method

I couldn’t tune the threshold blindly. I needed ground truth on which to actually add questions "Which is the same."

Our Methodology:

step 1: Add sample query. I sampled 5,000 query pairs at different similarity levels (0.80-0.99).

step 2: Human labeling. Annotators have labeled each pair as "same intention" Or "Different intention." I used three annotators per pair and obtained a majority vote.

step 3: Calculate precision/recall curves. For each limit, we calculated:

  • Precision: What fraction of cache hits had the same intent?

  • Recall: Of the like-intent pairs, which fraction did we cache-hit?

def compute_precision_recall(pairs, labels, range):

"""Calculate and remember the precision over a given similarity range."""

Predictions = [1 if pair.similarity >= threshold else 0 for pair in pairs]

true_positives = sum(1 for p, l in zip(predictions, labels) if p == 1 and l == 1)

false_positives = sum(1 for p, l in zip(predictions, labels) if p == 1 and l == 0)

false_negatives = sum(1 for p, l in zip(predictions, labels) if p == 0 and l == 1)

precision = true_positives / (true_positives + false_positives) if (true_positives + false_positives) > 0 otherwise 0

Recall = True_Positive / (True_Positive + False_Negative) if (True_Positive + False_Negative) > 0 otherwise 0

return precision, recall

Step 4: Select a threshold based on the cost of errors. For FAQs where wrong answers harm confidence, I optimized for precision (0.94 threshold gave 98% precision). For search queries where missing cache hits only costs money, I optimized for recall (0.88 threshold).

latency overhead

Semantic caching adds latency: You have to embed the query and search in the vector store before knowing whether to call the LLM.

Our measurements:

Operation

Latency(p50)

Latency (p99)

query embedding

12ms

28ms

vector search

8ms

19ms

total cache lookup

20ms

47ms

llm api call

850ms

2400ms

The 20ms overhead is negligible compared to the 850ms LLM call that we avoid on a cache hit. Even on p99, 47ms overhead is acceptable.

However, cache misses now take 20ms longer than before (embedding + search + LLM calls). At our 67% hit rate, the math works out optimally:

  • Before: 100% queries × 850 ms = 850 ms average

  • After: (33% × 870 ms) + (67% × 20 ms) = 287 ms + 13 ms = 300 ms averaging

65% improvement in net latency along with cost reduction.

cache invalidation

Cached responses become stale. Product information changes, policies update, and yesterday’s right answer becomes today’s wrong answer.

I implemented three invalidation strategies:

  1. time-based ttl

Simple termination based on content type:

TTL_BY_CONTENT_TYPE = {

‘Pricing’: TimeDelta(hours=4), # changes frequently

‘policy’: timedelta(days=7), # change rarely

‘product_info’: timedelta(days=1), # Refresh daily

‘common_faq’: timedelta(days=14), # very still

,

  1. event-based invalidation

When the underlying data changes, invalidate the corresponding cache entries:

Class Cache Invalidator:

def on_content_update(self, content_id: str, content_type: str):

"""Invalid cache entries related to updated content."""

# Find cached queries that reference this content

affected_queries = self.find_queries_reference(content_id)

For query_id in affected_queries:

self.cache.invalidate(query_id)

self.log_invalidation(content_id, len(affected_queries))

  1. staleness detection

For responses that may be out of date without obvious events, I implemented a periodic freshness check:

def check_freshness(self, cached_response: dict) -> bool:

"""Verify that the cached response is still valid."""

# Re-run the query against the current data

Fresh_response = self.generate_response(cached_response['query'],

# Compare the semantic similarity of responses

cached_embedding = self.embed(cached_response['response'],

Fresh_embedding = self.embed(fresh_response)

similarity = cosine_similarity(cached_embedding, fresh_embedding)

# If responses differ significantly, invalidate

If similarity <0.90: self.cache.invalidate(cached_response['id'], return false return true We perform freshness checks on a sample of cached entries daily, thereby capturing TTL and event-based invalidation constraints.

production results

After three months in production:

metric

First

after

Change

cache hit rate

18%

67%

+272%

llm api cost

$47K/month

$12.7K/month

-73%

average latency

850ms

300 ms

-65%

false-positive rate

N/A

0.8%

,

Customer Complaints (Wrong Answer)

basic

+0.3%

minimum increase

The 0.8% false-positive rate (queries where we returned a cached response that was semantically incorrect) was within acceptable limits. These cases occurred primarily at the limits of our range, where similarity was just above the cutoff but the intentions were slightly different.

dangers to avoid

Do not use a single global limit. Different query types have different tolerances for errors. Tune threshold per category.

Don’t skip the embedding step on a cache hit. You may be tempted to skip the embedding overhead when returning cached responses, but you need embeddings for cached key creation. Overhead is unavoidable.

Don’t forget invalidation. Semantic caching without an invalidation strategy results in stale responses that destroy user trust. Build in invalidation from day one.

Don’t cash everything. Some queries should not be cached: personalized responses, time-sensitive information, transaction confirmations. Create exclusion rules.

def should_cache(self, query: str, response: str) -> bool:

"""Determine whether the response should be cached.""

# Do not cache personalized responses

If self.contains_personal_info(response):

return false

# Do not cache time-sensitive information

If self.is_time_sensitive(query):

return false

# Do not cache transaction confirmations

If self.is_transactional(query):

return false

return true

key takeaways

Semantic caching is a practical pattern for LLM cost control that captures redundant exact-match caching misses. The key challenges are threshold tuning (use query-type-specific thresholds based on precision/recall analysis) and cache invalidation (TTL, event-based, and consistency detection).

At a 73% cost reduction, this was our highest-ROI optimization for a production LLM system. Implementation complexity is moderate, but requires careful attention to threshold tuning to avoid quality degradation.

Srinivas Reddy Hulebidu Reddy is a leading software engineer.



<a href

Leave a Comment