Semantic Caching, or: Why I Spent Three Days Making a Cache Hit Work

A semantic cache is only smart if you’re clever about what “semantic” means. I was not clever, initially.

The idea behind Sentinel’s vector cache is this: compliance verdicts are expensive to generate (Gemini API, not free) and repeatable (two transactions with the same risk profile should return the same verdict). Rather than caching by transaction ID — which would only help on exact duplicates — you cache by meaning. Two different transactions that look the same to a compliance officer should hit the same cache entry.

To do this, you need to turn a transaction into a string that captures its compliance-relevant meaning, embed that string to a 1536-dimensional vector, and store the vector alongside the verdict. On the next similar transaction, you embed its fingerprint and search for vectors within 0.95 cosine similarity. If you find one, you return the stored verdict without calling the AI.

Simple in principle. The devil, as usual, is in the fingerprint.

The First Fingerprint (Wrong)

My first fingerprint looked like this:

merchant:Starbucks | category:food_and_beverage | amount:9.50 | currency:USD | time:14:37 | type:purchase

I tested it. Cache hit rate: zero. Every transaction was unique.

The problem was time:14:37. Every transaction had a different timestamp down to the minute. Two coffees at Starbucks at 2:37pm and 2:38pm produced completely different fingerprints, which embedded to completely different vectors, which never matched each other at 0.95 similarity. I had turned a semantic cache into a very expensive exact-match cache that never matched anything.

The Fix: Buckets

The solution is obvious in retrospect. Replace exact values with buckets that map to compliance-relevant categories.

flowchart TB
  subgraph Before["Before — exact values"]
    B1[merchant: Starbucks]
    B2[amount: 9.50]
    B3[time: 14:37]
    B4[type: purchase]
    B1 --- B2 --- B3 --- B4
  end
  subgraph After["After — semantic buckets"]
    A1[merchant: Starbucks]
    A2[amount_tier: micro]
    A3[time_of_day: afternoon]
    A4[type: purchase]
    A1 --- A2 --- A3 --- A4
  end
  Before -->|every txn unique| Miss[Cache hit rate ≈ 0%]
  After -->|similar txns cluster| Hit[Real cache hits]

Time buckets: night (0–6h), morning (6–12h), afternoon (12–18h), evening (18–24h). A transaction at 2:37pm and one at 2:38pm now both produce afternoon. They’ll embed identically on this field.

Amount buckets: micro (<$10), small (<$100), medium (<$500), large (<$2000), very_large (≥$2000). The boundaries aren’t arbitrary — they map to thresholds that actually matter in compliance contexts: AML reporting thresholds, card micro-transaction patterns, high-value notification requirements.

flowchart LR
  subgraph Time["Time-of-day buckets"]
    T0[0–6h<br/>night]
    T1[6–12h<br/>morning]
    T2[12–18h<br/>afternoon]
    T3[18–24h<br/>evening]
  end
  subgraph Amount["Amount tiers"]
    A1["&lt; $10<br/>micro"]
    A2["&lt; $100<br/>small"]
    A3["&lt; $500<br/>medium"]
    A4["&lt; $2000<br/>large"]
    A5["≥ $2000<br/>very_large"]
  end

The revised fingerprint:

merchant:Starbucks | category:food_and_beverage | amount_tier:micro | currency:USD | time_of_day:afternoon | type:purchase

Now two coffees bought in the same afternoon by different people produce the same fingerprint, embed to the same vector, and hit the same cache entry. Cache hit rate went from zero to something real on the first run with a populated store.

The Upstash API Shape Gotcha

While building this, I ran into a silent failure in the Upstash Vector REST API. The response schema wraps results like this:

{ "result": [ ... ] }

Not results. Not data. result. I had typed it as results in the client code. Wrong key returns null in PHP without raising an exception. Every vector search was succeeding (HTTP 200), returning null, being treated as a cache miss, and silently falling through to the AI.

I only caught it because I added logging on the raw HTTP response during a debugging session. The fix was a one-character change. The lesson was about an hour of confusion first.

The Dimension Mismatch

There’s a separate class of mistake here: creating an Upstash Vector namespace at 768 dimensions and then configuring the embedding service to produce 1536-dimensional vectors using Gemini’s output_dimensionality: 1536 parameter.

Inserting a 1536-dim vector into a 768-dim namespace throws a 400 error. The error message is not especially illuminating. The fix is to delete the namespace and recreate it at the correct dimension. Upstash namespaces have fixed dimensions at creation time — they are not resizable.

I’ve now internalized: set the namespace dimension first, verify it matches the embedding model output, then write any upsert code.

Why 0.90?

sequenceDiagram
  participant W as Worker
  participant V as Vector cache
  participant G as Gemini Flash

  W->>W: build fingerprint (bucketed)
  W->>W: embed → 1536-dim vector
  W->>V: similarity search
  alt similarity ≥ 0.90
    V-->>W: cached verdict
    Note over W: return early — no LLM call
  else similarity < 0.90
    V-->>W: no match
    W->>G: analyze (full pipeline)
    G-->>W: verdict
    W->>V: upsert verdict
  end

The similarity threshold is 0.90 — empirically validated through testing. ADR-0015 originally proposed 0.95 as deliberately strict (a false cache hit is worse than a miss), but production data showed that for general-category transactions (coffee, groceries, gas) the fingerprints cluster tightly enough that 0.90 is safe without compromising accuracy. The bucketized fingerprints—time-of-day and amount tiers—provide enough semantic specificity that a 0.90 threshold catches real similarities while still being conservative on edge cases near reporting thresholds.

What I’d Do Differently

I’d design the fingerprint first and add embedding second. The vector similarity math is the easy part. The hard part is deciding which features of a transaction carry compliance-relevant meaning and which are noise. Get that right and the cache mostly takes care of itself.

Also: log the raw API response body on the first run. Always. You’ll thank yourself.