Four Signals for Quality Scoring in LLM Pipelines

Published: 2026-06-02 Reading time: ~7 minutes Tags: sentinel-l7, LLM observability, RAG, compliance AI Series: Part 3 of Sentinel-L7 Systems Patterns · Prev: Post #07 — Graduated Backpressure · Next: Post #09 — Triple-Defense Idempotency

An LLM pipeline that processes every message successfully and produces no exceptions is not necessarily a pipeline that works. A compliance audit narrative can be syntactically valid, parse cleanly into a JSON schema, and still be operationally worthless — no policy citations, a vague risk level, a two-sentence narrative that says nothing specific. The pipeline completes. The dashboard shows green. Nobody knows.

This post covers two pieces of the same problem: making sure the AI gets the right inputs (domain-scoped RAG), and making sure the AI produced a useful output (quality scoring).

The Input Problem: Cross-Domain Contamination

Sentinel-L7 retrieves policy context from a vector knowledge base before asking Gemini to generate an audit narrative. The idea is that the model reasons faithfully over whatever grounding it receives — feed it the right policy chunks and you get a grounded, regulation-specific response; feed it the wrong ones and you get a confident but irrelevant narrative.

When the policy corpus is small (two documents — AML/BSA and GDPR), a global top-3 similarity search mostly works. Both documents are different enough that an AML query will score higher against AML chunks. But as the corpus grows — add HIPAA, PCI-DSS, sanctions screening, drug interaction policy — the vector space gets crowded. A high-scoring GDPR chunk can outrank a lower-scoring AML chunk for an AML event. Gemini has no way to know the wrong policy was retrieved; it reasons faithfully and produces a confident, well-structured narrative that cites the wrong regulations.

There’s no runtime signal for this failure. The query succeeds, chunks are returned, the model runs. Everything looks fine.

The fix is to answer a different question at retrieval time: not just “which chunks are most similar to this query?” but “which chunks are relevant to this domain?”

Domain Tagging at Ingest, Filtering at Retrieval

The approach is deliberately minimal: derive a domain label from the policy filename and tag every chunk with it at ingest time. aml-bsa-compliance.md becomes domain aml; gdpr-data-processing.md becomes gdpr. The rule is the first hyphen-delimited segment of the filename, lowercased. Adding a new policy domain requires no code change — drop a {domain}-*.md file in policies/ and re-run sentinel:ingest.

At retrieval time, when the Axiom payload includes a domain field, the vector search query includes a filter:

$domain = isset($data['domain']) ? (string) $data['domain'] : null;
$filter = $domain !== null ? "domain = '{$domain}'" : null;

$chunks = $this->vectorCache->searchNamespace($vector, 'policies', 0.70, 3, $filter);

Upstash Vector evaluates the filter server-side before computing similarity scores. Cross-domain chunks are never scored, never returned, never seen by the model. When domain is absent the filter is null and Upstash receives an unfiltered query — identical to the previous behaviour. This makes domain filtering strictly opt-in.

Every retrieval call logs the outcome:

Log::info('GeminiDriver: policy RAG retrieval', [
    'domain'      => $domain,
    'filter_used' => $filter !== null,
    'chunk_count' => count($chunks),
    'mean_score'  => $meanScore,
    'under_indexed' => $underIndexed,
]);

chunk_count = 0 with filter_used = true is an explicit signal that the filter matched nothing — a domain was stamped, a filter was applied, and the knowledge base returned empty. That’s a silent partial failure made detectable. A domain with fewer than two chunks (under_indexed = true) additionally emits a warning log, since a single chunk is unlikely to provide adequate grounding.

The current gap: AxiomProcessorService does not yet stamp domain on Axiom payloads. Until the Synapse-L4 emitter adds that field, all Axiom analysis falls through to the unfiltered path. The filter is ready; the data just hasn’t caught up yet.

The Output Problem: Behavioral Degradation

Domain filtering controls the inputs. Quality scoring addresses what comes out.

The failure modes that motivated this are all behavioral, not operational:

False negatives — a high-risk event receives risk_level: 'low' because the model produced a low-confidence response.
Narrative drift — responses gradually shorten and lose specificity as the policy corpus ages.
Silent citation loss — the model stops referencing specific regulations, producing generic prose that satisfies the schema but carries no compliance value.

None of these throw exceptions. A degraded response is structurally identical to a good one. The only way to detect them is to inspect the content.

The Four-Signal Rubric

Every compliance driver response is scored against four signals before being returned:

Signal	Check	What failure looks like
Policy citation	`policy_refs` is non-empty	Model answered without citing any regulation
Risk level resolved	`risk_level` ≠ `'unknown'`	Model couldn’t or wouldn’t commit to a risk assessment
Narrative substance	narrative length ≥ 150 chars	Response was too short to be meaningful
Driver confidence	`confidence` ≥ 0.6	Model expressed low confidence in its own output

Each passing signal is worth one point. quality_score is the sum — 0 to 4. The scoring runs in a private method called at the end of analyze(), between parseResponse() and return:

private function logResponseQuality(array $result, array $data): void
{
    $hasPolicyRefs   = ! empty($result['policy_refs']);
    $hasRiskLevel    = ($result['risk_level'] ?? 'unknown') !== 'unknown';
    $narrativeLength = strlen((string) ($result['narrative'] ?? ''));
    $aboveLengthMin  = $narrativeLength >= self::NARRATIVE_LENGTH_MIN;
    $confidence      = (float) ($result['confidence'] ?? 0.0);
    $aboveConfidence = $confidence >= 0.6;

    $qualityScore = (int) $hasPolicyRefs
                  + (int) $hasRiskLevel
                  + (int) $aboveLengthMin
                  + (int) $aboveConfidence;

    $context = [
        'source_id'        => $data['source_id'] ?? null,
        'domain'           => $data['domain'] ?? null,
        'has_policy_refs'  => $hasPolicyRefs,
        'has_risk_level'   => $hasRiskLevel,
        'narrative_length' => $narrativeLength,
        'above_length_min' => $aboveLengthMin,
        'confidence'       => $confidence,
        'quality_score'    => $qualityScore,
    ];

    Log::info('GeminiDriver: response quality', $context);

    if ($qualityScore <= self::QUALITY_WARNING_THRESHOLD) {
        Log::warning('GeminiDriver: low quality score', $context);
    }
}

Every call emits an info log — that’s the baseline. Scores at or below 1 additionally emit a warning log — that’s the alert hook.

The warning threshold is ≤ 1, not 0. A score of 0 is already caught by the existing 'unexpected response shape' warning that fires when parseResponse() returns the fallback shape. Alerting at 0 would be redundant. A score of 1 catches something more subtle: a structurally valid response where only one signal passes. The most common pattern is has_risk_level = true with everything else failing — the model committed to a risk level but produced no policy grounding, no substantive narrative, and expressed low confidence. That’s early-stage degradation, not parse failure, and it was previously invisible.

Logs, Not Columns

Quality scores are not persisted to Postgres alongside the compliance_events row. They live in structured logs. This was a deliberate choice: adding a quality_score column before there’s evidence the scores are actionable adds schema overhead and couples a monitoring concern to the data model prematurely. The info log on every call builds the baseline; if trend analysis over time proves valuable, promoting scores to a stored column is a straightforward migration.

QUALITY_WARNING_THRESHOLD and NARRATIVE_LENGTH_MIN are private constants on the driver class rather than entries in config/sentinel.php. Thresholds that configure deployment behaviour (rate limits, lag thresholds, timeouts) belong in config so they can be tuned without a deploy. Thresholds that define a measurement rubric should be stable across environments — changing NARRATIVE_LENGTH_MIN in production but not staging would make the scores mean different things in different places. The constants are in code because the rubric is the same everywhere.

What This Gives You

Together, domain filtering and quality scoring turn the compliance AI pipeline from a black box into an observable one.

Domain filtering makes the retrieval step explicit: you can see which domain was applied, how many chunks came back, and what their similarity scores were. A filter that returns nothing is visible. A domain that’s under-indexed is visible.

Quality scoring makes the generation step explicit: you can see whether the model cited regulations, resolved a risk level, wrote a substantive narrative, and expressed confidence. Sustained multi-signal failure signals systemic degradation. Isolated single-signal failure points to a specific, diagnosable problem.

Neither requires persisting anything new to the database or adding infrastructure. The signal is in the logs. That’s enough to start.

Q: Why not alert when chunk_count = 0 rather than waiting for the quality score to drop? A: They measure different things. chunk_count = 0 means the knowledge base couldn’t ground the analysis — it’s a retrieval failure, not necessarily an output failure (the model may still produce a usable response from its training data). A low quality score means the output failed the rubric regardless of why. Both are worth logging; both are already logged. A future alerting layer could treat them as independent alert conditions with different severities.

Q: The four signals weight policy citations equally with narrative length. Isn’t a missing citation more serious? A: Probably yes. A weighted score (e.g. policy citations worth 2 points) would be more precise. But calibrating weights correctly requires baseline data — you need to know how often each signal fails and in what combinations before you can decide which matters more. The equal-weight rubric is intentionally simple at this stage. Once the logs have accumulated enough data to reveal the actual failure distribution, adjusting the weights is a one-function change.

Q: What triggers the under_indexed warning? A: A domain filter was applied and fewer than 2 chunks were returned. One chunk is the minimum for any retrieval; fewer than two suggests the domain is either not in the knowledge base or was ingested with too little material to provide meaningful grounding. Re-running sentinel:ingest after adding policy files should clear it — but the warning will fire until the domain has adequate coverage.