Stage 2: Growing Pains¶

Your application is growing. Read traffic is saturating one instance, you need uptime guarantees, or you're processing background jobs across multiple workers. It's time to add replication, a job queue, and maybe Sentinel — but each addition introduces failure modes that didn't exist before.

The Anti-Over-Engineering Rule

Replication and Sentinel solve real problems: read scaling, high availability, and job distribution. But they don't solve write scaling or cross-node atomicity. If you need those, you're at Stage 3. Don't deploy Redis Cluster because you "might need it later."

Replication¶

Redis uses a leader-follower (master-replica) model. The master handles all writes; replicas receive a continuous stream of write commands and serve read traffic.

Async Replication: The Data Loss Window¶

sequenceDiagram
    participant Client
    participant Master
    participant Replica

    Client->>Master: SET user:1 "alice"
    Master->>Client: OK ✓
    Note over Client: Client thinks write is safe
    Master-->>Replica: SET user:1 "alice"
    Note over Replica: Arrives later (async)
    Note over Master,Replica: If master crashes HERE,<br/>this write is lost

The master acknowledges a write to the client before replicas confirm receipt. This gap — typically milliseconds under normal conditions — is the data loss window. Every write acknowledged but not yet replicated is at risk if the master crashes.

PSYNC and the Replication Backlog¶

When a replica briefly disconnects (network blip, restart), it doesn't need a full resync. The master maintains a replication backlog buffer — a circular buffer of recent replication stream data.

The replica sends its [replication_id, offset] pair. If the offset is still within the backlog, the master replays only the missing commands (partial resync). If the offset has been overwritten, a full resync is required — the master creates an RDB snapshot and streams the entire dataset.

Tuning repl-backlog-size

The default is 1MB. If your replicas occasionally disconnect for a few seconds and your write throughput is 10MB/s, you need at least a 30-50MB backlog to avoid expensive full resyncs. Size it based on: write_throughput * max_expected_disconnect_duration * safety_margin.

The WAIT Command: Looks Synchronous, Isn't Truly CP¶

WAIT <num_replicas> <timeout_ms> blocks the client until N replicas acknowledge the most recent write. This looks like synchronous replication, but:

It does not make Redis a CP system
Acknowledged writes can still be lost during failover if the master hasn't persisted them
It dramatically reduces but does not eliminate the probability of write loss
It adds latency to every write that uses it

Use WAIT when you need a stronger durability guarantee for specific critical writes (e.g., payment state changes), not as a blanket policy.

Dual Replication IDs (Redis 4.0+)¶

Each Redis instance maintains two replication IDs: the current one and the previous master's. After a failover, the promoted replica keeps the old master's ID as its secondary. This allows other replicas of the old master to partial-resync with the new master instead of requiring a full resync — a significant operational improvement.

The Most Dangerous Misconfiguration

Master without persistence + auto-restart: If the master has persistence disabled and auto-restarts after a crash, it comes up with an empty dataset. All replicas then sync from this empty master, destroying all data cluster-wide. This is not hypothetical — it's a documented production failure mode.

Never run a master without persistence if it can auto-restart. Either enable persistence or ensure crashed masters require manual intervention before rejoining.

Other Replication Features¶

Feature	What it does	When to use
Diskless replication	Master sends RDB directly over the socket (no disk write)	When disk I/O is the bottleneck
`min-replicas-to-write`	Master refuses writes unless N replicas are connected with lag ≤ M seconds	Safety floor for write durability
`replica-serve-stale-data`	Replicas serve old data during full resync (Redis 4.0+)	Avoiding downtime during resync

min-replicas-to-write is a safety floor, not a consistency guarantee. It prevents the worst case (writing to an isolated master with zero replicas) but doesn't prevent the data loss window inherent in async replication.

Background Jobs & Queues¶

This is where Redis stops being "just a cache" and becomes infrastructure. BullMQ is one of the most widely-used job queue libraries, and its internals are a masterclass in using Redis primitives correctly.

BullMQ's Queue Lifecycle¶

graph LR
    ADD["addJob()"] --> W["<b>wait</b><br/><i>List</i>"]
    ADD --> D["<b>delayed</b><br/><i>Sorted Set</i>"]
    ADD --> P["<b>prioritized</b><br/><i>Sorted Set</i>"]
    D -->|timestamp reached| W
    P -->|by priority score| W
    W -->|"BRPOPLPUSH<br/>(blocking pop)"| A["<b>active</b><br/><i>List</i>"]
    A -->|success| C["<b>completed</b><br/><i>Set</i>"]
    A -->|failure| F["<b>failed</b><br/><i>Set</i>"]
    A -->|"lock expired<br/>(stalled)"| W
    F -->|retry| W
    F -->|retry with delay| D

State	Redis type	Key pattern	Why this type
wait	List	`bull:{queue}:wait`	FIFO order, O(1) push/pop, blocking pop
active	List	`bull:{queue}:active`	Tracks in-progress jobs for stall detection
delayed	Sorted Set	`bull:{queue}:delayed`	Score = Unix timestamp, efficient range queries for due jobs
prioritized	Sorted Set	`bull:{queue}:prioritized`	Score = priority value
completed/failed	Set	`bull:{queue}:completed`	Membership tracking, optional retention
Job data	Hash	`bull:{queue}:<jobId>`	Partial reads/updates without full deserialization
Events	Stream	`bull:{queue}:events`	Persistent, replayable event log
Job lock	String	`bull:{queue}:<jobId>:lock`	`SET NX PX` with unique token
Job ID counter	String	`bull:{queue}:id`	`INCR` for sequential IDs

Hash Tags for Cluster Compatibility

Notice the {queue} in the key patterns — the curly braces are hash tags. Redis Cluster uses only the content within {} to compute the hash slot. This forces all keys for a given queue onto the same node, which is required for Lua scripts that touch multiple keys. The trade-off: one queue = one node. You can't shard a single queue across the cluster.

Why Lua, Not MULTI/EXEC¶

MULTI/EXEC provides atomicity — all commands in the transaction execute without interleaving. But it cannot branch on intermediate reads:

MULTI
  RPOPLPUSH wait active     -- Pop a job ID
  -- Can't check: does this job still exist?
  -- Can't conditionally: set a lock only if the job wasn't deleted
  -- Can't branch: skip the rest if the job is gone
EXEC

BullMQ needs conditional atomicity: pop a job, verify it exists, set a lock, push to active, emit an event — but only if each step's precondition holds. This is impossible with MULTI/EXEC.

Lua scripts run atomically on the Redis thread with full conditional logic:

-- Pseudocode of BullMQ's moveToActive
local jobId = redis.call('RPOPLPUSH', KEYS.wait, KEYS.active)
if jobId then
    local jobExists = redis.call('EXISTS', prefix .. jobId)
    if jobExists == 1 then
        redis.call('SET', prefix .. jobId .. ':lock', token, 'NX', 'PX', lockDuration)
        redis.call('XADD', KEYS.events, '*', 'event', 'active', 'jobId', jobId)
        return jobId
    else
        redis.call('LREM', KEYS.active, 1, jobId)  -- Clean up
        return nil
    end
end

Every meaningful BullMQ operation — addJob, moveToActive, moveToCompleted, moveToFailed, moveStalledJobsToWait — is a Lua script.

The Lua Trade-off

Lua scripts monopolize the Redis thread during execution. A long-running script (e.g., BullMQ's obliterate deleting thousands of keys) blocks all other clients. Redis enforces a default 5-second timeout (lua-time-limit), after which clients receive BUSY errors — but the script keeps running until completion or SCRIPT KILL.

Job Locking and Stalled Job Detection¶

When a worker takes a job, it acquires a lock:

SET bull:{queue}:<jobId>:lock <unique-token> NX PX 30000

NX: only set if the key doesn't exist (only one worker gets the lock)
PX 30000: auto-expires in 30 seconds
<unique-token>: a UUID identifying this specific worker

The worker renews the lock periodically (default: every 15 seconds) via a Lua script that checks the token matches before extending the TTL. This prevents a stale worker from renewing a lock that was already reassigned.

Stalled job detection: a periodic process scans the active list and checks each job's lock key. If the lock has expired (worker died or froze), the job is stalled. The stalled checker moves it back to wait for reprocessing — or to failed if maxStalledCount is exceeded.

BullMQ: At-Least-Once, Not Exactly-Once

There is a gap between a lock expiring and the stalled checker running. During this gap:

Worker A's lock expires, but it's still processing (GC pause, slow I/O)
Stalled checker moves the job back to wait
Worker B picks up the job and processes it
Worker A finishes and tries to mark the job complete — the Lua script rejects it (wrong lock token)

The job was processed twice. Worker A's side effects (API calls, database writes) already happened. This is why BullMQ provides at-least-once delivery. Your job handlers must be idempotent — processing the same job twice should produce the same result.

Move and Fetch Optimization¶

When a worker completes a job, BullMQ's moveToCompleted Lua script can atomically also fetch the next job from wait. This eliminates a network round-trip — the worker doesn't need a separate call to get the next job. Under high load, this nearly doubles throughput.

Pub/Sub vs Streams¶

Redis offers two messaging primitives, and choosing the wrong one has consequences.

Pub/Sub: Fire and Forget¶

Property	Pub/Sub
Persistence	None — messages exist only in transit
Missed messages	Permanently lost if no subscriber is listening
Replay	Impossible — no history
Delivery guarantee	At-most-once (best effort)
Backpressure	None — slow subscribers get disconnected

Pub/Sub is appropriate for ephemeral notifications where losing a message is acceptable: cache invalidation signals, real-time typing indicators, debug logging.

Streams: Persistent, Replayable Message Log¶

Property	Streams
Persistence	Yes — entries are stored on disk (via AOF/RDB)
Missed messages	Recoverable — read from any point in history
Replay	Yes — consumers track their last-read ID
Delivery guarantee	At-least-once (with consumer groups and acknowledgment)
Backpressure	Natural — consumers read at their own pace

Streams were introduced in Redis 5.0 specifically to address Pub/Sub's limitations. They provide a log-structured data store with consumer groups (similar to Kafka's model).

BullMQ: From Pub/Sub to Streams

The predecessor library (Bull) relied heavily on Pub/Sub for notifying workers of new jobs and completions. This caused problems: if a worker disconnected briefly, it missed events and fell out of sync.

BullMQ replaced this with Redis Streams for its event system. Every state transition (job added, activated, completed, failed, progress) appends an entry to bull:{queue}:events via XADD. The QueueEvents class reads events with XREAD BLOCK, and if it disconnects, it resumes from its last-read ID — no messages lost.

BullMQ still uses Pub/Sub in limited cases for lightweight signals (e.g., notifying a paused worker to resume), where loss is acceptable.

Sentinel for High Availability¶

Redis Sentinel provides automatic failover: if the master dies, Sentinel promotes a replica and reconfigures clients to point to the new master. It requires a quorum (majority) of Sentinel instances to agree before initiating a failover.

graph TD
    subgraph "Normal Operation"
        S1["Sentinel 1"] & S2["Sentinel 2"] & S3["Sentinel 3"] --> M["Master"]
        M --> R1["Replica 1"]
        M --> R2["Replica 2"]
    end

Split-Brain: The Unavoidable Window¶

graph TD
    subgraph "Partition A (majority)"
        S1["Sentinel 1"] & S2["Sentinel 2"] --> R1["Replica 1<br/><b>promoted to Master</b>"]
    end
    subgraph "Partition B (minority)"
        S3["Sentinel 3"] --> M["Old Master<br/><b>still accepting writes</b>"]
    end

    style M fill:#F44336,color:#fff
    style R1 fill:#4CAF50,color:#fff

A network partition isolates the old master and one Sentinel. The majority partition (2 Sentinels) promotes Replica 1 to master. But the old master, still accessible to some clients, continues accepting writes until it realizes it's been demoted.

When the partition heals:

The old master is demoted to replica
It syncs from the new master
All writes made to the old master during the partition are permanently lost

This Window Is Fundamental

Even with mitigations (min-replicas-to-write, deploying 3+ Sentinels across availability zones), there is always a window — up to NODE_TIMEOUT — where both the old and new master accept writes simultaneously. This is a design choice: Redis prioritizes availability over strict consistency.

If losing even a single write is unacceptable, Redis Sentinel (or Cluster) is not the right tool. You need a consensus-based system like etcd or ZooKeeper.

Failure Modes That Appear at This Scale¶

Cache Stampede (Thundering Herd)¶

A popular cache key expires. Hundreds of concurrent requests all miss the cache and hit the database simultaneously, potentially causing a cascade of failures.

graph LR
    subgraph "Normal"
        C1["Request"] --> Cache["Redis<br/>HIT ✓"]
    end
    subgraph "Stampede"
        R1["Request 1"] & R2["Request 2"] & R3["Request 3"] & R4["...Request N"] --> Miss["Redis<br/>MISS ✗"]
        Miss --> DB["Database<br/><b>overwhelmed</b>"]
    end

Solutions (choose based on your tolerance for stale data):

Strategy	How it works	Trade-off
TTL jitter	Add random variance: `TTL + random(0, 300)`	Simple, prevents synchronized expiration
Probabilistic early refresh (XFetch)	Each request has a small probability of refreshing before expiry, increasing as TTL approaches 0	Slightly stale data, but no stampede
Mutex-based rebuild	First request acquires a short-lived lock, rebuilds cache. Others wait or serve stale data.	Adds lock complexity
"Never expire" + background refresh	Cache keys never expire. A background process refreshes values periodically.	Stale data during refresh intervals

Hot Key Problem¶

One key receives a disproportionate volume of requests — a trending topic, a flash sale item, a popular user profile. Since Redis maps each key to one thread (or one node in Cluster), that key becomes a bottleneck.

Solutions:

Read replicas: distribute reads across replicas (works at this scale)
Local caching: application-level L1 cache in front of Redis
Key sharding: create N copies (hot_key:1, hot_key:2, ... hot_key:N), randomly distribute reads
Client-side caching (Redis 7.0+): server-assisted caching with invalidation messages

Big Key Problem¶

A hash with millions of fields, a list with millions of elements, or a string storing a 100MB JSON blob. Big keys cause:

Problem	Why
Blocks event loop	O(n) commands on big keys take milliseconds to seconds
Network saturation	Transferring the value saturates bandwidth
Dangerous deletion	Freeing a big key can block Redis for seconds
Replication lag	Big values take longer to propagate

UNLINK (Redis 4.0+) deletes keys asynchronously in a background thread, avoiding the blocking DEL on big keys. Always use UNLINK for keys that might be large.

When to Stay Here¶

You don't need Redis Cluster if:

[x] Read replicas handle your read throughput
[x] Sentinel gives you the HA you need
[x] Your write throughput fits on a single master
[x] Your dataset fits on a single machine
[x] BullMQ handles your job processing needs
[x] You're not doing multi-key atomic operations that need to span multiple masters

The Over-Engineering Trap

Redis Cluster introduces hash slots, MOVED/ASK redirections, cross-node coordination limits, and the inability to do multi-key operations across slots without hash tags. If you don't need write scaling or a dataset larger than one machine, Cluster adds complexity with no benefit.