Stage 2: Growing Pains¶
Your application is growing. Read traffic is saturating one instance, you need uptime guarantees, or you're processing background jobs across multiple workers. It's time to add replication, a job queue, and maybe Sentinel — but each addition introduces failure modes that didn't exist before.
The Anti-Over-Engineering Rule
Replication and Sentinel solve real problems: read scaling, high availability, and job distribution. But they don't solve write scaling or cross-node atomicity. If you need those, you're at Stage 3. Don't deploy Redis Cluster because you "might need it later."
Replication¶
Redis uses a leader-follower (master-replica) model. The master handles all writes; replicas receive a continuous stream of write commands and serve read traffic.
Async Replication: The Data Loss Window¶
sequenceDiagram
participant Client
participant Master
participant Replica
Client->>Master: SET user:1 "alice"
Master->>Client: OK ✓
Note over Client: Client thinks write is safe
Master-->>Replica: SET user:1 "alice"
Note over Replica: Arrives later (async)
Note over Master,Replica: If master crashes HERE,<br/>this write is lost
The master acknowledges a write to the client before replicas confirm receipt. This gap — typically milliseconds under normal conditions — is the data loss window. Every write acknowledged but not yet replicated is at risk if the master crashes.
PSYNC and the Replication Backlog¶
When a replica briefly disconnects (network blip, restart), it doesn't need a full resync. The master maintains a replication backlog buffer — a circular buffer of recent replication stream data.
The replica sends its [replication_id, offset] pair. If the offset is still within the backlog, the master replays only the missing commands (partial resync). If the offset has been overwritten, a full resync is required — the master creates an RDB snapshot and streams the entire dataset.
Tuning repl-backlog-size
The default is 1MB. If your replicas occasionally disconnect for a few seconds and your write throughput is 10MB/s, you need at least a 30-50MB backlog to avoid expensive full resyncs. Size it based on: write_throughput * max_expected_disconnect_duration * safety_margin.
The WAIT Command: Looks Synchronous, Isn't Truly CP¶
WAIT <num_replicas> <timeout_ms> blocks the client until N replicas acknowledge the most recent write. This looks like synchronous replication, but:
- It does not make Redis a CP system
- Acknowledged writes can still be lost during failover if the master hasn't persisted them
- It dramatically reduces but does not eliminate the probability of write loss
- It adds latency to every write that uses it
Use WAIT when you need a stronger durability guarantee for specific critical writes (e.g., payment state changes), not as a blanket policy.
Dual Replication IDs (Redis 4.0+)¶
Each Redis instance maintains two replication IDs: the current one and the previous master's. After a failover, the promoted replica keeps the old master's ID as its secondary. This allows other replicas of the old master to partial-resync with the new master instead of requiring a full resync — a significant operational improvement.
The Most Dangerous Misconfiguration
Master without persistence + auto-restart: If the master has persistence disabled and auto-restarts after a crash, it comes up with an empty dataset. All replicas then sync from this empty master, destroying all data cluster-wide. This is not hypothetical — it's a documented production failure mode.
Never run a master without persistence if it can auto-restart. Either enable persistence or ensure crashed masters require manual intervention before rejoining.
Other Replication Features¶
| Feature | What it does | When to use |
|---|---|---|
| Diskless replication | Master sends RDB directly over the socket (no disk write) | When disk I/O is the bottleneck |
min-replicas-to-write |
Master refuses writes unless N replicas are connected with lag ≤ M seconds | Safety floor for write durability |
replica-serve-stale-data |
Replicas serve old data during full resync (Redis 4.0+) | Avoiding downtime during resync |
min-replicas-to-write is a safety floor, not a consistency guarantee. It prevents the worst case (writing to an isolated master with zero replicas) but doesn't prevent the data loss window inherent in async replication.
Background Jobs & Queues¶
This is where Redis stops being "just a cache" and becomes infrastructure. BullMQ is one of the most widely-used job queue libraries, and its internals are a masterclass in using Redis primitives correctly.
BullMQ's Queue Lifecycle¶
graph LR
ADD["addJob()"] --> W["<b>wait</b><br/><i>List</i>"]
ADD --> D["<b>delayed</b><br/><i>Sorted Set</i>"]
ADD --> P["<b>prioritized</b><br/><i>Sorted Set</i>"]
D -->|timestamp reached| W
P -->|by priority score| W
W -->|"BRPOPLPUSH<br/>(blocking pop)"| A["<b>active</b><br/><i>List</i>"]
A -->|success| C["<b>completed</b><br/><i>Set</i>"]
A -->|failure| F["<b>failed</b><br/><i>Set</i>"]
A -->|"lock expired<br/>(stalled)"| W
F -->|retry| W
F -->|retry with delay| D
| State | Redis type | Key pattern | Why this type |
|---|---|---|---|
| wait | List | bull:{queue}:wait |
FIFO order, O(1) push/pop, blocking pop |
| active | List | bull:{queue}:active |
Tracks in-progress jobs for stall detection |
| delayed | Sorted Set | bull:{queue}:delayed |
Score = Unix timestamp, efficient range queries for due jobs |
| prioritized | Sorted Set | bull:{queue}:prioritized |
Score = priority value |
| completed/failed | Set | bull:{queue}:completed |
Membership tracking, optional retention |
| Job data | Hash | bull:{queue}:<jobId> |
Partial reads/updates without full deserialization |
| Events | Stream | bull:{queue}:events |
Persistent, replayable event log |
| Job lock | String | bull:{queue}:<jobId>:lock |
SET NX PX with unique token |
| Job ID counter | String | bull:{queue}:id |
INCR for sequential IDs |
Hash Tags for Cluster Compatibility
Notice the {queue} in the key patterns — the curly braces are hash tags. Redis Cluster uses only the content within {} to compute the hash slot. This forces all keys for a given queue onto the same node, which is required for Lua scripts that touch multiple keys. The trade-off: one queue = one node. You can't shard a single queue across the cluster.
Why Lua, Not MULTI/EXEC¶
MULTI/EXEC provides atomicity — all commands in the transaction execute without interleaving. But it cannot branch on intermediate reads:
MULTI
RPOPLPUSH wait active -- Pop a job ID
-- Can't check: does this job still exist?
-- Can't conditionally: set a lock only if the job wasn't deleted
-- Can't branch: skip the rest if the job is gone
EXEC
BullMQ needs conditional atomicity: pop a job, verify it exists, set a lock, push to active, emit an event — but only if each step's precondition holds. This is impossible with MULTI/EXEC.
Lua scripts run atomically on the Redis thread with full conditional logic:
-- Pseudocode of BullMQ's moveToActive
local jobId = redis.call('RPOPLPUSH', KEYS.wait, KEYS.active)
if jobId then
local jobExists = redis.call('EXISTS', prefix .. jobId)
if jobExists == 1 then
redis.call('SET', prefix .. jobId .. ':lock', token, 'NX', 'PX', lockDuration)
redis.call('XADD', KEYS.events, '*', 'event', 'active', 'jobId', jobId)
return jobId
else
redis.call('LREM', KEYS.active, 1, jobId) -- Clean up
return nil
end
end
Every meaningful BullMQ operation — addJob, moveToActive, moveToCompleted, moveToFailed, moveStalledJobsToWait — is a Lua script.
The Lua Trade-off
Lua scripts monopolize the Redis thread during execution. A long-running script (e.g., BullMQ's obliterate deleting thousands of keys) blocks all other clients. Redis enforces a default 5-second timeout (lua-time-limit), after which clients receive BUSY errors — but the script keeps running until completion or SCRIPT KILL.
Job Locking and Stalled Job Detection¶
When a worker takes a job, it acquires a lock:
NX: only set if the key doesn't exist (only one worker gets the lock)PX 30000: auto-expires in 30 seconds<unique-token>: a UUID identifying this specific worker
The worker renews the lock periodically (default: every 15 seconds) via a Lua script that checks the token matches before extending the TTL. This prevents a stale worker from renewing a lock that was already reassigned.
Stalled job detection: a periodic process scans the active list and checks each job's lock key. If the lock has expired (worker died or froze), the job is stalled. The stalled checker moves it back to wait for reprocessing — or to failed if maxStalledCount is exceeded.
BullMQ: At-Least-Once, Not Exactly-Once
There is a gap between a lock expiring and the stalled checker running. During this gap:
- Worker A's lock expires, but it's still processing (GC pause, slow I/O)
- Stalled checker moves the job back to
wait - Worker B picks up the job and processes it
- Worker A finishes and tries to mark the job complete — the Lua script rejects it (wrong lock token)
The job was processed twice. Worker A's side effects (API calls, database writes) already happened. This is why BullMQ provides at-least-once delivery. Your job handlers must be idempotent — processing the same job twice should produce the same result.
Move and Fetch Optimization¶
When a worker completes a job, BullMQ's moveToCompleted Lua script can atomically also fetch the next job from wait. This eliminates a network round-trip — the worker doesn't need a separate call to get the next job. Under high load, this nearly doubles throughput.
Pub/Sub vs Streams¶
Redis offers two messaging primitives, and choosing the wrong one has consequences.
Pub/Sub: Fire and Forget¶
| Property | Pub/Sub |
|---|---|
| Persistence | None — messages exist only in transit |
| Missed messages | Permanently lost if no subscriber is listening |
| Replay | Impossible — no history |
| Delivery guarantee | At-most-once (best effort) |
| Backpressure | None — slow subscribers get disconnected |
Pub/Sub is appropriate for ephemeral notifications where losing a message is acceptable: cache invalidation signals, real-time typing indicators, debug logging.
Streams: Persistent, Replayable Message Log¶
| Property | Streams |
|---|---|
| Persistence | Yes — entries are stored on disk (via AOF/RDB) |
| Missed messages | Recoverable — read from any point in history |
| Replay | Yes — consumers track their last-read ID |
| Delivery guarantee | At-least-once (with consumer groups and acknowledgment) |
| Backpressure | Natural — consumers read at their own pace |
Streams were introduced in Redis 5.0 specifically to address Pub/Sub's limitations. They provide a log-structured data store with consumer groups (similar to Kafka's model).
BullMQ: From Pub/Sub to Streams
The predecessor library (Bull) relied heavily on Pub/Sub for notifying workers of new jobs and completions. This caused problems: if a worker disconnected briefly, it missed events and fell out of sync.
BullMQ replaced this with Redis Streams for its event system. Every state transition (job added, activated, completed, failed, progress) appends an entry to bull:{queue}:events via XADD. The QueueEvents class reads events with XREAD BLOCK, and if it disconnects, it resumes from its last-read ID — no messages lost.
BullMQ still uses Pub/Sub in limited cases for lightweight signals (e.g., notifying a paused worker to resume), where loss is acceptable.
Sentinel for High Availability¶
Redis Sentinel provides automatic failover: if the master dies, Sentinel promotes a replica and reconfigures clients to point to the new master. It requires a quorum (majority) of Sentinel instances to agree before initiating a failover.
graph TD
subgraph "Normal Operation"
S1["Sentinel 1"] & S2["Sentinel 2"] & S3["Sentinel 3"] --> M["Master"]
M --> R1["Replica 1"]
M --> R2["Replica 2"]
end
Split-Brain: The Unavoidable Window¶
graph TD
subgraph "Partition A (majority)"
S1["Sentinel 1"] & S2["Sentinel 2"] --> R1["Replica 1<br/><b>promoted to Master</b>"]
end
subgraph "Partition B (minority)"
S3["Sentinel 3"] --> M["Old Master<br/><b>still accepting writes</b>"]
end
style M fill:#F44336,color:#fff
style R1 fill:#4CAF50,color:#fff
A network partition isolates the old master and one Sentinel. The majority partition (2 Sentinels) promotes Replica 1 to master. But the old master, still accessible to some clients, continues accepting writes until it realizes it's been demoted.
When the partition heals:
- The old master is demoted to replica
- It syncs from the new master
- All writes made to the old master during the partition are permanently lost
This Window Is Fundamental
Even with mitigations (min-replicas-to-write, deploying 3+ Sentinels across availability zones), there is always a window — up to NODE_TIMEOUT — where both the old and new master accept writes simultaneously. This is a design choice: Redis prioritizes availability over strict consistency.
If losing even a single write is unacceptable, Redis Sentinel (or Cluster) is not the right tool. You need a consensus-based system like etcd or ZooKeeper.
Failure Modes That Appear at This Scale¶
Cache Stampede (Thundering Herd)¶
A popular cache key expires. Hundreds of concurrent requests all miss the cache and hit the database simultaneously, potentially causing a cascade of failures.
graph LR
subgraph "Normal"
C1["Request"] --> Cache["Redis<br/>HIT ✓"]
end
subgraph "Stampede"
R1["Request 1"] & R2["Request 2"] & R3["Request 3"] & R4["...Request N"] --> Miss["Redis<br/>MISS ✗"]
Miss --> DB["Database<br/><b>overwhelmed</b>"]
end
Solutions (choose based on your tolerance for stale data):
| Strategy | How it works | Trade-off |
|---|---|---|
| TTL jitter | Add random variance: TTL + random(0, 300) |
Simple, prevents synchronized expiration |
| Probabilistic early refresh (XFetch) | Each request has a small probability of refreshing before expiry, increasing as TTL approaches 0 | Slightly stale data, but no stampede |
| Mutex-based rebuild | First request acquires a short-lived lock, rebuilds cache. Others wait or serve stale data. | Adds lock complexity |
| "Never expire" + background refresh | Cache keys never expire. A background process refreshes values periodically. | Stale data during refresh intervals |
Hot Key Problem¶
One key receives a disproportionate volume of requests — a trending topic, a flash sale item, a popular user profile. Since Redis maps each key to one thread (or one node in Cluster), that key becomes a bottleneck.
Solutions:
- Read replicas: distribute reads across replicas (works at this scale)
- Local caching: application-level L1 cache in front of Redis
- Key sharding: create N copies (
hot_key:1,hot_key:2, ...hot_key:N), randomly distribute reads - Client-side caching (Redis 7.0+): server-assisted caching with invalidation messages
Big Key Problem¶
A hash with millions of fields, a list with millions of elements, or a string storing a 100MB JSON blob. Big keys cause:
| Problem | Why |
|---|---|
| Blocks event loop | O(n) commands on big keys take milliseconds to seconds |
| Network saturation | Transferring the value saturates bandwidth |
| Dangerous deletion | Freeing a big key can block Redis for seconds |
| Replication lag | Big values take longer to propagate |
UNLINK (Redis 4.0+) deletes keys asynchronously in a background thread, avoiding the blocking DEL on big keys. Always use UNLINK for keys that might be large.
When to Stay Here¶
You don't need Redis Cluster if:
- [x] Read replicas handle your read throughput
- [x] Sentinel gives you the HA you need
- [x] Your write throughput fits on a single master
- [x] Your dataset fits on a single machine
- [x] BullMQ handles your job processing needs
- [x] You're not doing multi-key atomic operations that need to span multiple masters
The Over-Engineering Trap
Redis Cluster introduces hash slots, MOVED/ASK redirections, cross-node coordination limits, and the inability to do multi-key operations across slots without hash tags. If you don't need write scaling or a dataset larger than one machine, Cluster adds complexity with no benefit.