Skip to content

Stage 2: Growing Pains

Your application is growing. Read traffic is saturating one instance, you need uptime guarantees, or you're processing background jobs across multiple workers. It's time to add replication, a job queue, and maybe Sentinel — but each addition introduces failure modes that didn't exist before.

The Anti-Over-Engineering Rule

Replication and Sentinel solve real problems: read scaling, high availability, and job distribution. But they don't solve write scaling or cross-node atomicity. If you need those, you're at Stage 3. Don't deploy Redis Cluster because you "might need it later."


Replication

Redis uses a leader-follower (master-replica) model. The master handles all writes; replicas receive a continuous stream of write commands and serve read traffic.

Async Replication: The Data Loss Window

sequenceDiagram
    participant Client
    participant Master
    participant Replica

    Client->>Master: SET user:1 "alice"
    Master->>Client: OK ✓
    Note over Client: Client thinks write is safe
    Master-->>Replica: SET user:1 "alice"
    Note over Replica: Arrives later (async)
    Note over Master,Replica: If master crashes HERE,<br/>this write is lost

The master acknowledges a write to the client before replicas confirm receipt. This gap — typically milliseconds under normal conditions — is the data loss window. Every write acknowledged but not yet replicated is at risk if the master crashes.

PSYNC and the Replication Backlog

When a replica briefly disconnects (network blip, restart), it doesn't need a full resync. The master maintains a replication backlog buffer — a circular buffer of recent replication stream data.

The replica sends its [replication_id, offset] pair. If the offset is still within the backlog, the master replays only the missing commands (partial resync). If the offset has been overwritten, a full resync is required — the master creates an RDB snapshot and streams the entire dataset.

Tuning repl-backlog-size

The default is 1MB. If your replicas occasionally disconnect for a few seconds and your write throughput is 10MB/s, you need at least a 30-50MB backlog to avoid expensive full resyncs. Size it based on: write_throughput * max_expected_disconnect_duration * safety_margin.

The WAIT Command: Looks Synchronous, Isn't Truly CP

WAIT <num_replicas> <timeout_ms> blocks the client until N replicas acknowledge the most recent write. This looks like synchronous replication, but:

  • It does not make Redis a CP system
  • Acknowledged writes can still be lost during failover if the master hasn't persisted them
  • It dramatically reduces but does not eliminate the probability of write loss
  • It adds latency to every write that uses it

Use WAIT when you need a stronger durability guarantee for specific critical writes (e.g., payment state changes), not as a blanket policy.

Dual Replication IDs (Redis 4.0+)

Each Redis instance maintains two replication IDs: the current one and the previous master's. After a failover, the promoted replica keeps the old master's ID as its secondary. This allows other replicas of the old master to partial-resync with the new master instead of requiring a full resync — a significant operational improvement.

The Most Dangerous Misconfiguration

Master without persistence + auto-restart: If the master has persistence disabled and auto-restarts after a crash, it comes up with an empty dataset. All replicas then sync from this empty master, destroying all data cluster-wide. This is not hypothetical — it's a documented production failure mode.

Never run a master without persistence if it can auto-restart. Either enable persistence or ensure crashed masters require manual intervention before rejoining.

Other Replication Features

Feature What it does When to use
Diskless replication Master sends RDB directly over the socket (no disk write) When disk I/O is the bottleneck
min-replicas-to-write Master refuses writes unless N replicas are connected with lag ≤ M seconds Safety floor for write durability
replica-serve-stale-data Replicas serve old data during full resync (Redis 4.0+) Avoiding downtime during resync

min-replicas-to-write is a safety floor, not a consistency guarantee. It prevents the worst case (writing to an isolated master with zero replicas) but doesn't prevent the data loss window inherent in async replication.


Background Jobs & Queues

This is where Redis stops being "just a cache" and becomes infrastructure. BullMQ is one of the most widely-used job queue libraries, and its internals are a masterclass in using Redis primitives correctly.

BullMQ's Queue Lifecycle

graph LR
    ADD["addJob()"] --> W["<b>wait</b><br/><i>List</i>"]
    ADD --> D["<b>delayed</b><br/><i>Sorted Set</i>"]
    ADD --> P["<b>prioritized</b><br/><i>Sorted Set</i>"]
    D -->|timestamp reached| W
    P -->|by priority score| W
    W -->|"BRPOPLPUSH<br/>(blocking pop)"| A["<b>active</b><br/><i>List</i>"]
    A -->|success| C["<b>completed</b><br/><i>Set</i>"]
    A -->|failure| F["<b>failed</b><br/><i>Set</i>"]
    A -->|"lock expired<br/>(stalled)"| W
    F -->|retry| W
    F -->|retry with delay| D
State Redis type Key pattern Why this type
wait List bull:{queue}:wait FIFO order, O(1) push/pop, blocking pop
active List bull:{queue}:active Tracks in-progress jobs for stall detection
delayed Sorted Set bull:{queue}:delayed Score = Unix timestamp, efficient range queries for due jobs
prioritized Sorted Set bull:{queue}:prioritized Score = priority value
completed/failed Set bull:{queue}:completed Membership tracking, optional retention
Job data Hash bull:{queue}:<jobId> Partial reads/updates without full deserialization
Events Stream bull:{queue}:events Persistent, replayable event log
Job lock String bull:{queue}:<jobId>:lock SET NX PX with unique token
Job ID counter String bull:{queue}:id INCR for sequential IDs

Hash Tags for Cluster Compatibility

Notice the {queue} in the key patterns — the curly braces are hash tags. Redis Cluster uses only the content within {} to compute the hash slot. This forces all keys for a given queue onto the same node, which is required for Lua scripts that touch multiple keys. The trade-off: one queue = one node. You can't shard a single queue across the cluster.

Why Lua, Not MULTI/EXEC

MULTI/EXEC provides atomicity — all commands in the transaction execute without interleaving. But it cannot branch on intermediate reads:

MULTI
  RPOPLPUSH wait active     -- Pop a job ID
  -- Can't check: does this job still exist?
  -- Can't conditionally: set a lock only if the job wasn't deleted
  -- Can't branch: skip the rest if the job is gone
EXEC

BullMQ needs conditional atomicity: pop a job, verify it exists, set a lock, push to active, emit an event — but only if each step's precondition holds. This is impossible with MULTI/EXEC.

Lua scripts run atomically on the Redis thread with full conditional logic:

-- Pseudocode of BullMQ's moveToActive
local jobId = redis.call('RPOPLPUSH', KEYS.wait, KEYS.active)
if jobId then
    local jobExists = redis.call('EXISTS', prefix .. jobId)
    if jobExists == 1 then
        redis.call('SET', prefix .. jobId .. ':lock', token, 'NX', 'PX', lockDuration)
        redis.call('XADD', KEYS.events, '*', 'event', 'active', 'jobId', jobId)
        return jobId
    else
        redis.call('LREM', KEYS.active, 1, jobId)  -- Clean up
        return nil
    end
end

Every meaningful BullMQ operation — addJob, moveToActive, moveToCompleted, moveToFailed, moveStalledJobsToWait — is a Lua script.

The Lua Trade-off

Lua scripts monopolize the Redis thread during execution. A long-running script (e.g., BullMQ's obliterate deleting thousands of keys) blocks all other clients. Redis enforces a default 5-second timeout (lua-time-limit), after which clients receive BUSY errors — but the script keeps running until completion or SCRIPT KILL.

Job Locking and Stalled Job Detection

When a worker takes a job, it acquires a lock:

SET bull:{queue}:<jobId>:lock <unique-token> NX PX 30000
  • NX: only set if the key doesn't exist (only one worker gets the lock)
  • PX 30000: auto-expires in 30 seconds
  • <unique-token>: a UUID identifying this specific worker

The worker renews the lock periodically (default: every 15 seconds) via a Lua script that checks the token matches before extending the TTL. This prevents a stale worker from renewing a lock that was already reassigned.

Stalled job detection: a periodic process scans the active list and checks each job's lock key. If the lock has expired (worker died or froze), the job is stalled. The stalled checker moves it back to wait for reprocessing — or to failed if maxStalledCount is exceeded.

BullMQ: At-Least-Once, Not Exactly-Once

There is a gap between a lock expiring and the stalled checker running. During this gap:

  1. Worker A's lock expires, but it's still processing (GC pause, slow I/O)
  2. Stalled checker moves the job back to wait
  3. Worker B picks up the job and processes it
  4. Worker A finishes and tries to mark the job complete — the Lua script rejects it (wrong lock token)

The job was processed twice. Worker A's side effects (API calls, database writes) already happened. This is why BullMQ provides at-least-once delivery. Your job handlers must be idempotent — processing the same job twice should produce the same result.

Move and Fetch Optimization

When a worker completes a job, BullMQ's moveToCompleted Lua script can atomically also fetch the next job from wait. This eliminates a network round-trip — the worker doesn't need a separate call to get the next job. Under high load, this nearly doubles throughput.


Pub/Sub vs Streams

Redis offers two messaging primitives, and choosing the wrong one has consequences.

Pub/Sub: Fire and Forget

Property Pub/Sub
Persistence None — messages exist only in transit
Missed messages Permanently lost if no subscriber is listening
Replay Impossible — no history
Delivery guarantee At-most-once (best effort)
Backpressure None — slow subscribers get disconnected

Pub/Sub is appropriate for ephemeral notifications where losing a message is acceptable: cache invalidation signals, real-time typing indicators, debug logging.

Streams: Persistent, Replayable Message Log

Property Streams
Persistence Yes — entries are stored on disk (via AOF/RDB)
Missed messages Recoverable — read from any point in history
Replay Yes — consumers track their last-read ID
Delivery guarantee At-least-once (with consumer groups and acknowledgment)
Backpressure Natural — consumers read at their own pace

Streams were introduced in Redis 5.0 specifically to address Pub/Sub's limitations. They provide a log-structured data store with consumer groups (similar to Kafka's model).

BullMQ: From Pub/Sub to Streams

The predecessor library (Bull) relied heavily on Pub/Sub for notifying workers of new jobs and completions. This caused problems: if a worker disconnected briefly, it missed events and fell out of sync.

BullMQ replaced this with Redis Streams for its event system. Every state transition (job added, activated, completed, failed, progress) appends an entry to bull:{queue}:events via XADD. The QueueEvents class reads events with XREAD BLOCK, and if it disconnects, it resumes from its last-read ID — no messages lost.

BullMQ still uses Pub/Sub in limited cases for lightweight signals (e.g., notifying a paused worker to resume), where loss is acceptable.


Sentinel for High Availability

Redis Sentinel provides automatic failover: if the master dies, Sentinel promotes a replica and reconfigures clients to point to the new master. It requires a quorum (majority) of Sentinel instances to agree before initiating a failover.

graph TD
    subgraph "Normal Operation"
        S1["Sentinel 1"] & S2["Sentinel 2"] & S3["Sentinel 3"] --> M["Master"]
        M --> R1["Replica 1"]
        M --> R2["Replica 2"]
    end

Split-Brain: The Unavoidable Window

graph TD
    subgraph "Partition A (majority)"
        S1["Sentinel 1"] & S2["Sentinel 2"] --> R1["Replica 1<br/><b>promoted to Master</b>"]
    end
    subgraph "Partition B (minority)"
        S3["Sentinel 3"] --> M["Old Master<br/><b>still accepting writes</b>"]
    end

    style M fill:#F44336,color:#fff
    style R1 fill:#4CAF50,color:#fff

A network partition isolates the old master and one Sentinel. The majority partition (2 Sentinels) promotes Replica 1 to master. But the old master, still accessible to some clients, continues accepting writes until it realizes it's been demoted.

When the partition heals:

  1. The old master is demoted to replica
  2. It syncs from the new master
  3. All writes made to the old master during the partition are permanently lost

This Window Is Fundamental

Even with mitigations (min-replicas-to-write, deploying 3+ Sentinels across availability zones), there is always a window — up to NODE_TIMEOUT — where both the old and new master accept writes simultaneously. This is a design choice: Redis prioritizes availability over strict consistency.

If losing even a single write is unacceptable, Redis Sentinel (or Cluster) is not the right tool. You need a consensus-based system like etcd or ZooKeeper.


Failure Modes That Appear at This Scale

Cache Stampede (Thundering Herd)

A popular cache key expires. Hundreds of concurrent requests all miss the cache and hit the database simultaneously, potentially causing a cascade of failures.

graph LR
    subgraph "Normal"
        C1["Request"] --> Cache["Redis<br/>HIT ✓"]
    end
    subgraph "Stampede"
        R1["Request 1"] & R2["Request 2"] & R3["Request 3"] & R4["...Request N"] --> Miss["Redis<br/>MISS ✗"]
        Miss --> DB["Database<br/><b>overwhelmed</b>"]
    end

Solutions (choose based on your tolerance for stale data):

Strategy How it works Trade-off
TTL jitter Add random variance: TTL + random(0, 300) Simple, prevents synchronized expiration
Probabilistic early refresh (XFetch) Each request has a small probability of refreshing before expiry, increasing as TTL approaches 0 Slightly stale data, but no stampede
Mutex-based rebuild First request acquires a short-lived lock, rebuilds cache. Others wait or serve stale data. Adds lock complexity
"Never expire" + background refresh Cache keys never expire. A background process refreshes values periodically. Stale data during refresh intervals

Hot Key Problem

One key receives a disproportionate volume of requests — a trending topic, a flash sale item, a popular user profile. Since Redis maps each key to one thread (or one node in Cluster), that key becomes a bottleneck.

Solutions:

  • Read replicas: distribute reads across replicas (works at this scale)
  • Local caching: application-level L1 cache in front of Redis
  • Key sharding: create N copies (hot_key:1, hot_key:2, ... hot_key:N), randomly distribute reads
  • Client-side caching (Redis 7.0+): server-assisted caching with invalidation messages

Big Key Problem

A hash with millions of fields, a list with millions of elements, or a string storing a 100MB JSON blob. Big keys cause:

Problem Why
Blocks event loop O(n) commands on big keys take milliseconds to seconds
Network saturation Transferring the value saturates bandwidth
Dangerous deletion Freeing a big key can block Redis for seconds
Replication lag Big values take longer to propagate

UNLINK (Redis 4.0+) deletes keys asynchronously in a background thread, avoiding the blocking DEL on big keys. Always use UNLINK for keys that might be large.


When to Stay Here

You don't need Redis Cluster if:

  • [x] Read replicas handle your read throughput
  • [x] Sentinel gives you the HA you need
  • [x] Your write throughput fits on a single master
  • [x] Your dataset fits on a single machine
  • [x] BullMQ handles your job processing needs
  • [x] You're not doing multi-key atomic operations that need to span multiple masters

The Over-Engineering Trap

Redis Cluster introduces hash slots, MOVED/ASK redirections, cross-node coordination limits, and the inability to do multi-key operations across slots without hash tags. If you don't need write scaling or a dataset larger than one machine, Cluster adds complexity with no benefit.