Self-Healing Worker Pools: Embedding XAUTOCLAIM in Every Worker

Published: 2026-05-19 Reading time: ~7 minutes Tags: sentinel-l7, Redis Streams, distributed systems, fault tolerance Series: Part 1 of Sentinel-L7 Systems Patterns · Next: Post #07 — Graduated Backpressure

Sentinel-L7 runs two long-lived worker processes that consume from Redis Streams. When a worker crashes mid-message, that message stays stuck in the Pending Entry List indefinitely — unless something reclaims it. The original approach used a dedicated reclaimer daemon. This post is about why that was replaced, and what “embedded recovery” looks like in practice.

The Problem with a Dedicated Reclaimer

Redis Streams’ XREADGROUP model is built around at-least-once delivery. When a worker reads a message, it enters the PEL (Pending Entry List) — a per-consumer list of messages that have been delivered but not acknowledged. The message stays there until the worker calls XACK. If the worker crashes before ACKing, the message is stranded: delivered but unprocessed, invisible to normal XREADGROUP > reads.

The original recovery mechanism was a separate ReclaimAxioms process. It ran XCLAIM on any PEL entry idle longer than a threshold, transferring ownership to itself and reprocessing. This worked, but it had two structural problems.

Problem 1: The reclaimer is a SPOF. If the reclaimer crashes or fails to start, abandoned messages sit in the PEL forever. The system that was supposed to provide resilience had its own single point of failure. composer dev had to manage three processes — web, worker, reclaimer — and if only two of them came up, you might not notice for a while.

Problem 2: Recovery is serialised. A single reclaimer handles all abandoned messages. Under a cascade failure — say two workers die simultaneously under heavy load — the reclaimer processes the backlog one message at a time. The workers that are still running have spare capacity, but none of it goes toward recovery.

The fix removes the reclaimer entirely. Recovery is distributed across the worker pool.

The New Loop Structure

Redis 6.2 introduced XAUTOCLAIM, which combines the XPENDING + XCLAIM two-step into a single atomic command. More importantly, it lets any consumer claim idle messages from any other consumer in the same group. This makes it possible for each worker to heal the pool as a side effect of its normal read loop.

The loop now looks like this:

loop:
  1. XAUTOCLAIM synapse:axioms axiom-workers <consumer> 30000 0-0 COUNT 10
     → claim any messages idle > 30s; check delivery count; process; XACK each
  2. XREADGROUP GROUP axiom-workers <consumer> COUNT 10 BLOCK 2000 STREAMS synapse:axioms >
     → read new messages; process; XACK each

Step 1 runs on every iteration. If there are no idle messages, XAUTOCLAIM returns an empty list and costs a single round-trip (~1ms overhead). Step 2 is the normal new-message read. The net effect: every running worker is also a reclaimer. Losing one worker reduces processing capacity but does not stop recovery — the surviving workers pick up orphaned messages on their next iteration.

Here’s what this looks like in WatchAxioms:

while (true) {
    // Step 1: claim and process any orphaned messages
    foreach ($stream->autoClaim($consumer, $idleMs) as $streamMsg) {
        $msgId = $streamMsg[0];

        if ($stream->deliveryCount($msgId) >= $deliveryLimit) {
            Log::error('sentinel:watch-axioms dead-letter — delivery count exceeded', [
                'message_id'     => $msgId,
                'delivery_limit' => $deliveryLimit,
            ]);
            $stream->ack($msgId);
            continue;
        }

        $this->processMessage($streamMsg, $stream, $processor);
        $stream->ack($msgId);
    }

    // Step 2: read new messages
    foreach ($stream->readGroup($consumer)->messages as $streamMsg) {
        $this->processMessage($streamMsg, $stream, $processor);
        $stream->ack($streamMsg[0]);
    }
}

The idle threshold is 30 seconds, extracted to config (sentinel.reclaim.idle_ms). Gemini round-trips peak at ~8 seconds under load; 30 seconds gives a factor-of-3 margin before a slow-but-alive worker has its in-progress message stolen by a sibling. If processing time grows, this is the one number to raise.

Poison Message Detection

XAUTOCLAIM doesn’t expose a retry count in its response. The response shape is:

[next-cursor, [[id, [fields...]], ...], [deleted-ids]]

There’s no delivery count in that structure. To check whether a claimed message has been attempted too many times, a separate XPENDING call is needed per claimed message:

public function deliveryCount(string $messageId): int
{
    $result = LRedis::executeRaw([
        'XPENDING', self::STREAM_KEY, self::GROUP,
        'IDLE', '0', $messageId, $messageId, '1',
    ]);

    return isset($result[0][3]) ? (int) $result[0][3] : 0;
}

XPENDING with an ID range of $messageId $messageId 1 returns a single entry that includes the delivery count at index 3. This adds one extra round-trip per autoclaimed message, not per loop iteration — if XAUTOCLAIM returns nothing (the common case), deliveryCount() is never called.

When a message hits delivery_count >= 3 (configurable via sentinel.reclaim.delivery_count_limit), it is logged as a structured error and ACKed without processing:

Log::error('sentinel:watch-axioms dead-letter — delivery count exceeded', [
    'stream'         => 'synapse:axioms',
    'message_id'     => $msgId,
    'delivery_limit' => $deliveryLimit,
    'consumer'       => $consumer,
]);
$stream->ack($msgId);

The ACK is deliberate. Without it, the message would cycle through the reclaim logic indefinitely — every worker would keep claiming it, checking the count, logging an error, and re-abandoning it, burning CPU and log space forever. The ACK is the remove-from-PEL operation; the error log is the record that it happened. This is a dead-letter queue by convention rather than infrastructure.

What Was Removed

The ReclaimAxioms command and its entry in composer dev-full were deleted entirely. The process table went from:

web | queue | vite | sentinel:watch-axioms | sentinel:reclaim-axioms

to:

web | queue | vite | sentinel:watch-axioms

One fewer process to manage, monitor, and worry about failing silently.

The Subtle Constraint: `XAUTOCLAIM` Requires Redis 6.2+

XAUTOCLAIM was introduced in Redis 6.2. Upstash supports it; if you’re running a self-hosted Redis older than 6.2, you’d need to stay with XPENDING + XCLAIM explicitly. Worth checking your version before assuming this pattern is available.

The Pattern in One Sentence

Each worker runs XAUTOCLAIM before XREADGROUP on every loop iteration — recovery is a side effect of normal operation, not a separate process.

Q: What’s the PEL, and why does a message stay there on a crash? A: The Pending Entry List is Redis’s per-consumer record of messages that have been delivered but not acknowledged. XREADGROUP delivers a message and adds it to the PEL; XACK removes it. If the consumer crashes after delivery but before ACK, the message stays in the PEL and is invisible to XREADGROUP > (which only delivers new messages). It takes an explicit XCLAIM or XAUTOCLAIM to make it visible again.

Q: Why not just retry immediately — why the 30-second idle threshold? A: A slow-but-alive worker can take several seconds to process a message. If the threshold is too short, a sibling worker steals the message while the original worker is still processing it, and two workers try to write the same result. The 30-second threshold gives a realistic margin above worst-case processing time. The idempotency guards (post #06) mean a double-process doesn’t corrupt data — but it wastes an AI call, so the threshold should stay generous enough to avoid routine stealing.

Q: What happens to a dead-lettered message? A: It’s ACKed and gone from Redis. The only record is the structured Log::error entry. For Sentinel-L7 at current scale, that’s sufficient — a log query surfaces any dead-lettered messages with their IDs and timestamps. If this system ever needed retry-with-backoff for poison messages, the natural extension is to write them to a sentinel:dlq stream before ACKing, and run a separate retry worker with exponential delay.