RAG Is Easy. Getting It to Actually Retrieve Anything Is Hard.

It worked. It just didn’t do anything.

That’s the most demoralizing kind of bug. No exception, no failed assertion, no red text in the console. The pipeline ran. Audit narratives were generated. They just didn’t reference any policy documents, because the retrieval stage was returning nothing, silently, every single time.

This is the story of how I got RAG working in Sentinel — and why “working” is doing a lot of heavy lifting in that sentence.

What the Policy RAG Does

Sentinel’s compliance analysis is backed by a policy knowledge base: chunked excerpts from AML guidelines, HIPAA requirements, GDPR obligations, indexed as vectors in Upstash. When an anomaly event comes through with a high score, the system retrieves the most relevant policy context before calling Gemini for an audit narrative.

The idea is to ground the LLM’s response in actual regulatory text rather than letting it hallucinate policy citations. Instead of “this may violate AML regulations (general)” you get “per the FINCEN threshold reporting guidance, transactions above $10,000…” — specific, traceable, auditable.

The mechanism: embed the query, search the policies namespace (separate from the transaction cache namespace, more on that shortly), retrieve the top matches above a 0.70 similarity threshold, inject the text into the prompt.

flowchart TD
  Q[Compliance-language query] --> E[Embed via Gemini embedding-001]
  E --> S[Vector search<br/>ns: policies, top 3, ≥ 0.70]
  S --> C{Results?}
  C -->|yes| Ctx[Inject policy chunks into prompt]
  C -->|no| NoCtx[No-context prompt]
  Ctx --> LLM[Gemini Flash]
  NoCtx --> LLM
  LLM --> Out["Audit narrative<br/>+ policy_refs<br/>+ confidence"]

The Two-Namespace Design

flowchart LR
  subgraph Vector["Upstash Vector"]
    NS1["ns: default<br/>transaction fingerprints<br/>threshold ≥ 0.90"]
    NS2["ns: policies<br/>AML / HIPAA / GDPR chunks<br/>threshold ≥ 0.70"]
  end
  Worker[Worker] -->|cache lookup| NS1
  Worker -->|policy RAG| NS2
  Ingest[sentinel:ingest<br/>markdown → ~500-word chunks] --> NS2

The vector store has two namespaces:

default — transaction fingerprints and cached verdicts. Similarity threshold: 0.90. Empirically validated to catch real similarities on bucketed fingerprints without false positives.
policies — regulatory document chunks. Similarity threshold: 0.70. Lenient. We’d rather retrieve an adjacent policy section than return nothing.

Keeping them separate is important. A cache search should never accidentally return policy text. A policy search should never return a transaction fingerprint. They live in different semantic spaces and serve different purposes; mixing them would mean tuning a single threshold for two contradictory requirements.

The sentinel:ingest command populates the policies namespace: it reads .md files from the policies/ directory, chunks them on paragraph boundaries at roughly 500 words, embeds each chunk with the Gemini embedding API, and upserts to the policies namespace. It’s a one-time setup step, re-runnable when policy documents change.

The Silent Failure

Here’s what was happening: the RAG retrieval was running, searching the policies namespace, finding zero results above the 0.70 threshold, returning an empty array, and the prompt builder was substituting “No specific policy context retrieved.” The AI still generated a narrative — just one with no policy grounding.

No log line for zero results. No warning. The compliance_events table was filling up with audit narratives that looked plausible but were fabricated from the model’s training data, not from my actual policy documents.

I only caught it by noticing that policy_refs in the JSON response was consistently [] across hundreds of events. That’s the tell: if the RAG is working, you should see policy IDs in the references field most of the time. An empty array on every single event means the retrieval is broken.

The Query Formulation Problem

The root cause was how I was building the retrieval query. My first version embedded raw telemetry:

status=critical, metric_value=94.0, anomaly_score=0.95, source_id=axm_00123

This is measurement language. It lives in a completely different semantic space from policy text, which is compliance language: requirements, obligations, thresholds, notifications, reporting windows.

The cosine distance between those two spaces is large. The 0.70 threshold was never met. The retrieval always came back empty.

flowchart LR
  Tel["Raw telemetry<br/>status=critical, metric_value=94.0"] -->|embed as-is| E1[Embedding ≈ measurement language]
  E1 --> S1[policies search → 0 results above 0.70]
  S1 -.->|silent fail| Empty[empty policy_refs]

  Refor["Compliance-language question<br/>What obligations apply to<br/>a critical anomaly event?"] -->|embed| E2[Embedding ≈ regulatory language]
  E2 --> S2[policies search → real matches]
  S2 --> Grounded[grounded narrative]

The fix was to reformulate the query as a natural-language compliance question:

What compliance obligations, reporting requirements, and regulatory thresholds apply 
to a critical anomaly event with an immediate escalation risk?

Embedding models are trained on Q&A text. A question phrased in compliance language lands near compliance-language answers. The same threshold (0.70) that was never reached before now returns three or four relevant policy chunks on almost every query.

I also made the query score-aware: an anomaly_score of 0.95 produces a query mentioning “immediate escalation and mandatory notification”, while 0.80 produces one mentioning “review and possible notification.” Different phrasing, different semantic neighbourhood, different policy chunks retrieved. The severity of the anomaly shapes what the compliance system looks at.

The Logging Fix

After this, I added explicit logging of the retrieval count:

Log::info('Policy RAG retrieval', [
    'query_severity' => $severity,
    'results_count' => count($results),
    'scores' => array_column($results, 'score'),
]);

flowchart TD
  Q[Query] -->|"Log: query text"| E[Embed]
  E -->|"Log: embed ok, dim"| S[Vector search]
  S -->|"Log: results_count, top score"| F{count &gt; 0?}
  F -->|yes| LLM[Gemini Flash]
  F -->|no — ★ used to be silent| LLM
  LLM -->|"Log: tokens, finish_reason"| Out[Verdict]

This makes the silent-zero case visible. A zero-result retrieval is not an error — sometimes a low-severity event genuinely has no close policy match — but it should always be observable. The key insight: in a multi-stage pipeline, log the output of every stage, not just the final result. A successful pipeline that produced nothing useful at stage two will look like a successful pipeline all the way to the end.

What Good Looks Like

When the RAG is working correctly, audit narratives look like this:

The anomaly event (score: 0.94, critical status) triggers mandatory reporting obligations under FINCEN FinCEN SAR guidelines for transactions exhibiting unusual patterns above the $5,000 threshold. Immediate escalation to compliance review is required per GDPR Article 33 for data access anomalies. Confidence: 0.87.

That’s grounded output. It’s traceable. It’s the difference between a compliance engine and an expensive autocomplete.

The implementation detail that gets you there is a single sentence in the query string. That’s both the reward and the humbling thing about building with LLMs.