Triple-Defense Idempotency in a Crash-Prone Event Stream
Triple-Defense Idempotency in a Crash-Prone Event Stream
Published: 2026-06-03 Reading time: ~7 minutes Tags: sentinel-l7, idempotency, Redis Streams, PostgreSQL, distributed systems Series: Part 4 of Sentinel-L7 Systems Patterns · Prev: Post #08 — Quality Scoring
At-least-once delivery is a guarantee that every message will be processed — eventually. It is not a guarantee that every message will be processed exactly once. Those two things are in direct tension, and handling the gap is what idempotency is for.
This post traces the specific failure modes in the Sentinel-L7 Axiom pipeline and the three layers that defend against them. Post #10 mentioned these briefly in a broader context; this post goes into the mechanics.
The Failure Scenario
A worker reads an Axiom from the synapse:axioms Redis Stream via XREADGROUP. The message enters the Pending Entry List — Redis’s record that this message has been delivered but not yet acknowledged. The worker calls Gemini, receives an audit narrative, writes a ComplianceEvent row to Postgres, then crashes before reaching XACK.
On the next loop iteration — or when XAUTOCLAIM runs on a sibling worker — the message is re-delivered. The worker has no memory of the previous attempt. It runs the full pipeline again: calls Gemini, gets a narrative, tries to write to Postgres.
Without idempotency, that second write succeeds. The compliance_events table now has two rows for the same event, each potentially with a different AI-generated narrative (Gemini doesn’t produce identical output on repeated calls). A compliance audit trail with duplicate entries is not just noisy — it’s a correctness violation.
There’s also a second, subtler failure mode: two workers racing on the same message. This can happen when a slow worker is still processing a message while XAUTOCLAIM’s idle timer expires and a sibling claims the same message. Both workers reach the write site simultaneously. Neither has committed yet, so a simple SELECT-then-INSERT check passes for both. One write succeeds; the other throws an unhandled exception — and because XACK fires after process() returns normally, the unhandled exception leaves the message in the PEL indefinitely. The worker keeps re-delivering it, keeps throwing, and the loop never clears.
Three layers close both failure modes.
Layer 1: Early Exit Before the AI Call
The cheapest defense runs at the top of process(), before any AI work starts:
if ($sourceId !== 'unknown' && ComplianceEvent::where('source_id', $sourceId)->exists()) {
Log::info('AxiomProcessorService: duplicate source_id — skipping AI call', [
'source_id' => $sourceId,
]);
return [
'source_id' => $sourceId,
'routed_to_ai' => false,
'risk_level' => 'skipped',
'narrative' => null,
// ...
];
}
This handles the common re-delivery case: crash-after-write, before ACK. The row already exists; the existence check returns true; the Gemini call is skipped entirely; the worker returns and ACKs cleanly. No wasted API quota, no duplicate row attempted.
This check was added after ADR-0021 was written. The original decision record considered it and rejected it for the common path — adding a DB round-trip to every single Axiom to guard against an infrequent scenario is a bad trade. The right framing is that it only runs on re-deliveries, which are infrequent. Once the triple-defense was in place and it was clear the check would only fire on the retry path, adding it was the right call.
Layer 2: firstOrCreate at the Write Site
Even past the early-exit check, the actual write uses firstOrCreate rather than create:
ComplianceEvent::firstOrCreate(['source_id' => $sourceId], $fields);
firstOrCreate issues a SELECT for a row matching source_id, and only inserts if nothing was found. First write wins — on re-delivery, the existing row is returned without modification. This matters because updateOrCreate was the obvious alternative, and it would have been wrong: if the first attempt succeeded and wrote a valid AI narrative, a retry that fails the AI call would silently overwrite a good audit_narrative with null. First-write-wins preserves the original result.
firstOrCreate alone is not enough. It is a SELECT followed by an INSERT — two separate operations. In the concurrent-worker race, both workers pass the SELECT before either commits. Both then attempt the INSERT. firstOrCreate doesn’t prevent the race; it just makes the application code clean for the non-concurrent case.
Layer 3: PostgreSQL Partial Unique Index
The database enforces what the application can’t do atomically:
CREATE UNIQUE INDEX compliance_events_source_id_unique
ON compliance_events (source_id)
WHERE source_id != 'unknown';
This index exists independently of the application layer. Any code path that writes a ComplianceEvent — now or in the future — is covered. When the concurrent-worker race fires and Worker B loses, the INSERT fails with a UniqueConstraintViolationException. Without a handler, that exception propagates, skips ACK, and leaves the message in the PEL for infinite retry.
The handler in persist() catches it cleanly:
} catch (\Illuminate\Database\UniqueConstraintViolationException) {
Log::info('AxiomProcessorService: duplicate source_id suppressed by DB constraint', [
'source_id' => $sourceId,
]);
}
persist() returns normally. process() returns normally. XACK fires. Both workers clear the message and the PEL empties. The only record is a structured info log.
The 'unknown' Edge Case
The partial index has a WHERE source_id != 'unknown' clause. Some Axioms arrive without a stable identifier — a malformed payload from a misconfigured upstream emitter. These get source_id = 'unknown'. Requiring uniqueness on 'unknown' would mean only one malformed event could ever be stored; all subsequent ones would silently fail the constraint. That’s worse than accepting occasional duplicates for a broken edge case.
For 'unknown' source IDs, persist() uses create() instead of firstOrCreate():
if ($sourceId !== 'unknown') {
ComplianceEvent::firstOrCreate(['source_id' => $sourceId], $fields);
} else {
ComplianceEvent::create(['source_id' => $sourceId, ...$fields]);
}
The early-exit check (Layer 1) is also gated on $sourceId !== 'unknown' for the same reason. These events can’t be deduplicated by ID. A warning log fires on every 'unknown' Axiom so a volume spike is detectable:
if ($sourceId === 'unknown') {
Log::warning('AxiomProcessorService: Axiom received without source_id', [
'anomaly_score' => $score,
]);
}
If a source emitter starts sending Axioms without identifiers at volume, the warning log makes it visible immediately rather than accumulating silent duplicates.
Why All Three Layers
Each layer defends a specific scenario:
| Layer | Defends against | Why not sufficient alone |
|---|---|---|
| Early-exit EXISTS | Crash-before-ACK re-delivery | Race window: passes for two concurrent workers before either commits |
firstOrCreate | Sequential re-delivery after the EXISTS window | Not atomic: SELECT + INSERT, race window still exists |
| Partial unique index + catch | Concurrent-worker race | Requires the other two layers to handle the common case cleanly |
Remove the early-exit and you waste Gemini quota on every re-delivery — the other layers still prevent duplicates, but at API cost. Remove firstOrCreate and a sequential re-delivery that slips past the early-exit check (narrow window, but possible) calls create() and the index fires — not wrong, but noisier. Remove the index and the concurrent race produces an unhandled exception and an infinite retry loop.
Together they’re defense-in-depth with clear separation of concerns: speed (skip AI before it starts), correctness (first write wins at the application layer), safety net (DB constraint catches what the application can’t atomically prevent).
Q: Why is the partial unique index in a raw DB::statement rather than Laravel’s schema builder?
A: Laravel’s ->unique() index modifier creates a standard unique index — it doesn’t support WHERE clauses. PostgreSQL partial indexes require raw SQL. The migration uses DB::statement() with the explicit CREATE UNIQUE INDEX ... WHERE source_id != 'unknown' syntax. The down() method mirrors it with DROP INDEX IF EXISTS compliance_events_source_id_unique. This is a PostgreSQL-specific feature; the migration would need adjustment for MySQL or SQLite.
Q: Could you use a database transaction instead of the three-layer approach?
A: A transaction wrapping the SELECT + INSERT would make firstOrCreate atomic and eliminate the concurrent-race window — SERIALIZABLE isolation or a SELECT FOR UPDATE would do it. The tradeoff is locking overhead on every write and potential deadlocks under concurrent load. The three-layer approach (optimistic application code + constraint catch) avoids locking entirely and handles the rare race with a caught exception. For an event stream where re-deliveries are infrequent by design, the optimistic approach is the right trade.
Q: What happens to the message that triggered the UniqueConstraintViolationException?
A: persist() catches the exception and returns normally. The caller (routeToAi() or recordSubThreshold()) returns its result. process() returns to WatchAxioms, which calls $stream->ack($msgId). The message is removed from the PEL and will not be re-delivered. The info log records the suppression. Worker B processed the message, found it already written by Worker A, swallowed the duplicate, and cleaned up. Both workers end up in a consistent state.
// comments via github discussions