Stage 3: Large Scale¶

You're here because a single master can't handle your write throughput, your dataset doesn't fit on one machine, or you need atomicity guarantees that expose fundamental distributed systems limitations. The complexity you're about to take on is justified — but only if you understand the hard truths about what Redis can and cannot guarantee at this scale.

There Is No Over-Engineering Warning Here

If you're at this stage, simplicity is no longer an option. But the opposite trap exists: assuming distributed Redis gives you guarantees it doesn't. Redis Cluster is not a consensus system. Distributed locks on Redis are not equivalent to ZooKeeper locks. Understanding the limitations is more important than understanding the features.

Redis Cluster¶

Redis Cluster distributes data across multiple master nodes, each responsible for a subset of the keyspace. It provides automatic sharding, failover, and resharding — but with constraints that shape how you design your application.

Hash Slots, Not Consistent Hashing¶

Redis Cluster uses 16,384 fixed hash slots. Every key maps to a slot via CRC16(key) mod 16384, and each slot is assigned to exactly one master node.

graph TD
    subgraph "Redis Cluster (3 Masters)"
        M1["Master A<br/>Slots 0-5460"]
        M2["Master B<br/>Slots 5461-10922"]
        M3["Master C<br/>Slots 10923-16383"]
    end

    K1["key: user:1<br/>CRC16 → slot 3421"] --> M1
    K2["key: order:99<br/>CRC16 → slot 8820"] --> M2
    K3["key: session:xyz<br/>CRC16 → slot 14102"] --> M3

Why Not Consistent Hashing?

Consistent hashing (used by Cassandra, DynamoDB) makes it harder to do resharding (moving a range of data between nodes) and impossible to do multi-key operations on keys that happen to hash to different positions on the ring. Fixed hash slots provide a clear, finite mapping that can be explicitly reassigned. This was a deliberate engineering trade-off.

Gossip Protocol¶

Cluster nodes communicate state via a gossip protocol: each node periodically pings random peers, exchanging node IDs, flags (online/PFAIL/FAIL), slot ownership bitmaps, and epoch numbers.

Parameter	Behavior
Ping frequency	Each node pings a few random nodes every second
Full coverage	Nodes without recent info are pinged every `NODE_TIMEOUT / 2`
Overhead	~330 ping/pong packets/second total in a 100-node cluster with 60s NODE_TIMEOUT

Failure Detection: Two Phases¶

Failure detection is not instant and not consensus-based. It's a "weak agreement model":

graph LR
    N["Node appears<br/>unreachable"] -->|"After NODE_TIMEOUT"| PF["<b>PFAIL</b><br/>Local suspicion<br/><i>one node's opinion</i>"]
    PF -->|"Majority of masters<br/>flag PFAIL/FAIL within<br/>NODE_TIMEOUT × 2"| F["<b>FAIL</b><br/>Confirmed failure<br/><i>cluster-wide</i>"]
    F --> FO["Failover<br/>initiated"]

PFAIL (Possible Failure): a single node's local determination that another node is unreachable for > NODE_TIMEOUT
FAIL (Confirmed Failure): requires a majority of masters to have independently flagged the node as PFAIL/FAIL within NODE_TIMEOUT * 2

The views are collected over time via gossip, not gathered simultaneously. This means failure detection can take up to 2 * NODE_TIMEOUT in the worst case.

Resharding: MOVED vs ASK¶

When slots are reassigned between nodes, clients may send commands to the wrong node:

Redirect	Meaning	Client action
MOVED	Permanent reassignment — this slot now lives on a different node	Update local slot mapping, retry
ASK	Temporary — slot is being migrated, this specific key is on the target node	Send `ASKING` + the command to the target, but don't update routing table

During migration, existing keys are served from the source node; keys that have already been migrated redirect to the destination. Keys are moved one by one via MIGRATE.

BullMQ in Cluster

BullMQ uses hash tags in key names: bull:{queueName}:wait, bull:{queueName}:active, bull:{queueName}:delayed, etc. The {queueName} hash tag forces all keys for a queue onto the same hash slot (and therefore the same node).

This is required because BullMQ's Lua scripts touch multiple keys atomically — impossible if those keys are on different nodes. But it means one queue = one node. You cannot shard a single BullMQ queue across multiple masters. To scale, you create multiple queues and distribute work across them at the application level.

The CAP Theorem and Redis¶

Redis does not cleanly fit the AP or CP categories. Antirez himself has stated that Redis Cluster "is not a design that tries to achieve AP or CP of the CAP theorem."

In Practice¶

Scenario	Behavior
Normal operation	Behaves like CP: single master per slot, strong consistency per key
Network partition (majority side)	Behaves like AP: continues serving reads and writes
Network partition (minority side)	Stops accepting writes after `NODE_TIMEOUT`
During failover	Up to `NODE_TIMEOUT` of writes on the old master are lost
Conflict resolution	None — "last failover wins," losing side's writes are silently discarded

The maximum write loss window is bounded by NODE_TIMEOUT. This is not a bug — it's the fundamental trade-off Redis makes by choosing availability and partition tolerance over strict consistency.

sequenceDiagram
    participant Client A
    participant Old Master
    participant New Master
    participant Client B

    Note over Old Master,New Master: Network partition occurs
    Client A->>Old Master: SET key "value-A"
    Old Master->>Client A: OK ✓
    Note over New Master: Promoted after NODE_TIMEOUT
    Client B->>New Master: SET key "value-B"
    New Master->>Client B: OK ✓
    Note over Old Master,New Master: Partition heals
    Note over Old Master: Demoted to replica,<br/>syncs from New Master
    Note over Old Master: "value-A" is permanently LOST

Race Conditions at Scale¶

At a single-instance scale, race conditions are impossible — the event loop serializes everything. At distributed scale, they become unavoidable, and managing them requires understanding the tools and their limits.

WATCH/MULTI: Optimistic Locking¶

WATCH marks keys to monitor. If any watched key changes between WATCH and EXEC, the entire transaction aborts and the client must retry.

WATCH inventory:item:42
val = GET inventory:item:42
if val > 0:
    MULTI
    DECR inventory:item:42
    EXEC  -- Aborts if another client modified the key
else:
    UNWATCH

The problem at scale: with many clients watching the same key (flash sale, popular item), the success rate drops dramatically. If 100 clients all WATCH the same key, 99 of them will abort on each round. This creates a livelock-like pattern — everyone retries, everyone aborts, throughput collapses.

Lua Scripting: True Atomicity (With Limits)¶

Lua scripts execute atomically on the Redis thread — no other command interleaves. Unlike WATCH/MULTI, they can read intermediate values and branch:

-- Atomic decrement-if-positive
local stock = tonumber(redis.call('GET', KEYS[1]))
if stock > 0 then
    redis.call('DECR', KEYS[1])
    return 1  -- success
end
return 0  -- out of stock

No retries, no aborts, guaranteed consistency — but only for keys on the same node.

Cross-Node Atomicity Is Impossible

In Redis Cluster, a Lua script can only access keys that hash to the same slot. If your operation needs to atomically update user:balance and order:total and they're on different nodes, no single Redis command or script can help.

Hash tags ({user:1}:balance, {user:1}:order) can force related keys onto the same node, but this creates hot spots — all data for a popular user concentrates on one node.

For true cross-node atomicity, you need application-level coordination: saga patterns, two-phase commit, or eventual consistency with compensating actions. Redis is not a distributed transaction manager.

The Distributed Lock Problem¶

Distributed locks are one of the most misunderstood topics in Redis. The Redlock debate between Antirez (Redis creator) and Martin Kleppmann (distributed systems researcher) exposed fundamental limitations that every engineer should understand.

Why You Need Distributed Locks¶

When multiple processes across multiple machines need exclusive access to a shared resource — rate limiting an external API, preventing duplicate payment processing, coordinating a one-time migration — you need a distributed lock.

A single Redis instance can provide this trivially:

SET resource:lock <unique-token> NX EX 30

NX: only set if not exists (mutual exclusion)
EX 30: auto-expire after 30 seconds (deadlock prevention)
<unique-token>: for safe release (only the holder can unlock)

This works well for a single instance. The problem arises when the Redis instance itself can fail.

Redlock: Antirez's Proposal¶

To tolerate Redis failures, Antirez proposed Redlock — acquire the lock on a majority (N/2 + 1) of N independent Redis instances:

graph LR
    C["Client"] --> R1["Redis 1 ✓"]
    C --> R2["Redis 2 ✓"]
    C --> R3["Redis 3 ✓"]
    C --> R4["Redis 4 ✗"]
    C --> R5["Redis 5 ✗"]

    style R1 fill:#4CAF50,color:#fff
    style R2 fill:#4CAF50,color:#fff
    style R3 fill:#4CAF50,color:#fff
    style R4 fill:#F44336,color:#fff
    style R5 fill:#F44336,color:#fff

Record the current timestamp
Try to acquire the lock on all N instances sequentially (short per-instance timeout)
Lock is acquired if held on majority AND total elapsed time < lock TTL
If acquisition fails, release locks on all instances

Kleppmann's Criticism: "How to Do Distributed Locking"¶

Martin Kleppmann published a detailed analysis arguing that Redlock is fundamentally unsafe. His arguments are built on three pillars:

1. No Fencing Tokens¶

Redlock produces a random token (UUID), not a monotonically increasing one. Without monotonic fencing tokens, a client whose lock has expired (due to a GC pause, network delay, or any other process suspension) can still write to the protected resource:

sequenceDiagram
    participant Client 1
    participant Redis (Lock)
    participant Storage

    Client 1->>Redis (Lock): Acquire lock (token: abc)
    Redis (Lock)->>Client 1: OK ✓
    Note over Client 1: GC pause begins...<br/>lock expires during pause
    Note over Redis (Lock): Lock expired, released

    participant Client 2
    Client 2->>Redis (Lock): Acquire lock (token: xyz)
    Redis (Lock)->>Client 2: OK ✓
    Client 2->>Storage: Write (token: xyz)

    Note over Client 1: GC pause ends
    Client 1->>Storage: Write (token: abc)
    Note over Storage: CONFLICT — both clients<br/>wrote to the resource

A proper fencing token is a monotonically increasing number. The storage system can reject any write with a token lower than the highest it has seen. Random UUIDs cannot provide this ordering guarantee.

2. Timing Assumptions That Routinely Fail¶

Redlock assumes bounded network delay, bounded process pauses, and reasonably accurate clocks. Kleppmann argues these assumptions routinely fail in production:

Assumption	Real-world failure
Bounded network delay	GitHub once experienced a 90-second packet delay
Bounded process pause	JVM garbage collection can pause for minutes
Accurate clocks	Redis historically used `gettimeofday()`, which can jump backward via NTP adjustments

3. Concrete Failure Scenario: Clock Jump¶

sequenceDiagram
    participant Client 1
    participant Node A
    participant Node B
    participant Node C
    participant Node D
    participant Node E
    participant Client 2

    Client 1->>Node A: Lock ✓
    Client 1->>Node B: Lock ✓
    Client 1->>Node C: Lock ✓
    Note over Client 1: Holds majority (A, B, C)

    Note over Node C: Clock jumps forward<br/>Lock expires early!

    Client 2->>Node C: Lock ✓
    Client 2->>Node D: Lock ✓
    Client 2->>Node E: Lock ✓
    Note over Client 2: Holds majority (C, D, E)

    Note over Client 1,Client 2: BOTH believe they hold the lock!

Node C's clock jumps forward (NTP correction, VM migration), causing its lock to expire prematurely. Client 2 acquires locks on C, D, E — a valid majority. Now two clients simultaneously believe they hold the lock.

Antirez's Rebuttal¶

Antirez responded with several counter-arguments:

On fencing tokens: if you have a system that checks and rejects stale tokens, you already have a concurrency control mechanism and probably don't need a distributed lock
On timing: Redlock only needs processes to measure relative elapsed time with ~10-20% error. It does NOT require synchronized clocks across machines
On the GC pause scenario: Redlock checks elapsed time after receiving responses. If a GC pause occurred during acquisition, the elapsed time check would detect that the lock's validity window has passed, and the client would NOT consider the lock acquired

The Actual Consensus¶

Both sides largely agree on the practical outcome:

Use case	Recommended approach
Efficiency (prevent duplicate work, but occasional double-execution is tolerable)	Single instance: `SET resource NX EX` — simple, sufficient
Correctness (mutual exclusion where violations cause data corruption)	Consensus system: ZooKeeper or etcd with fencing tokens
Redlock	"Neither fish nor fowl" — too complex for efficiency, not safe enough for correctness

Why Hardware Clocks Are the Root Issue¶

The Redlock debate ultimately reveals a deeper truth about distributed systems and time:

gettimeofday() returns wall-clock time, which can jump backward (NTP corrections) or jump forward (leap seconds, VM migration)
Monotonic clocks (clock_gettime(CLOCK_MONOTONIC)) never jump backward, but they only measure elapsed time on a single machine — they can't coordinate across machines
Distributed consensus protocols (Raft, Paxos, ZAB) maintain safety properties regardless of timing. They use logical clocks (sequence numbers, epochs) instead of physical clocks. This is why ZooKeeper locks are safe even when hardware clocks drift: the safety proof doesn't depend on any timing assumption

Redis lacks a consensus protocol. Redlock's safety depends on timing assumptions (even if relaxed), which is a weaker guarantee than consensus-based systems. This is not a fixable implementation detail — it's an architectural boundary.

System-Wide Failure Modes¶

Cache Avalanche¶

Distinct from cache stampede (which is per-key), avalanche is system-wide: mass cache expiration or a Redis failure causes a flood of requests to the backing datastore, which collapses under load.

graph TD
    subgraph "Normal"
        A1["App Servers"] --> R["Redis Cache"]
        R -->|"~5% miss rate"| DB["Database"]
    end

    subgraph "Avalanche"
        A2["App Servers"] -->|"Redis down or<br/>mass expiration"| R2["Redis ✗"]
        R2 -->|"100% miss"| DB2["Database<br/><b>CRUSHED</b>"]
        DB2 -->|"timeout/crash"| A2
    end

Solutions:

Strategy	What it does
Staggered TTLs	Prevent synchronized mass expiration
Circuit breaker	Trip when database error rate exceeds threshold, serve degraded responses
Rate limiting to DB	Cap the number of concurrent cache-miss queries to the database
Multi-layer caching	Local (in-process) cache → Redis → database
Graceful degradation	Serve stale cached data instead of failing entirely

Cascading Failures¶

Redis down → all services hit DB → DB down → everything down. This is not a Redis problem per se — it's an architecture problem that Redis's speed normally masks.

The fix is defense in depth: circuit breakers at every tier, fallback responses, health checks that distinguish "Redis is slow" from "Redis is dead," and load shedding to protect the database.

Memory Fragmentation at Scale¶

Write-heavy workloads with variable-size values cause jemalloc fragmentation to climb over time. At scale, Redis may use 2-3x more RSS than the actual dataset requires.

Monitor mem_fragmentation_ratio continuously
Active defrag (activedefrag yes) trades CPU for reduced memory waste
Plan for memory overhead in capacity planning (don't allocate 100% of available RAM to maxmemory)

BullMQ at Scale: Limitations¶

BullMQ inherits all of Redis's distributed limitations, plus some of its own:

Limitation	Root cause
No exactly-once processing	At-least-once only. Lock expiry + stalled checker gap allows double processing.
Redis failover can lose jobs	Async replication means writes not yet replicated are lost on failover.
Lua script blocking	Complex operations (`obliterate`, bulk cleanup) block the entire Redis thread. BullMQ mitigates by batching deletions.
Large payloads waste memory	Job data is stored in Redis hashes. Store references (S3 URLs, database IDs), not the data itself.
One queue = one node	Hash tags force all keys for a queue onto one node. Can't shard a single queue across the cluster.
Completed/failed job accumulation	Without cleanup (`removeOnComplete`, `removeOnFail`), completed jobs accumulate indefinitely.

For workloads requiring exactly-once semantics, cross-datacenter distribution, or extreme throughput on a single logical queue, consider purpose-built systems: Apache Kafka for event streaming, Amazon SQS for managed queues, or Temporal for workflow orchestration.

Decision Framework¶

Need	Use	Why
Cache, sessions, simple queues	Single Redis instance	100K+ ops/sec, simplest possible setup
Read scaling, high availability	Redis + Replicas + Sentinel	Read distribution, automatic failover
Write scaling, large dataset	Redis Cluster	Sharded writes across masters
Background jobs, task queues	BullMQ on Redis	Rich job lifecycle, Lua-based atomicity
Strong mutual exclusion	etcd or ZooKeeper	Consensus-based, fencing tokens, no timing assumptions
Event streaming, replay	Apache Kafka	Persistent log, exactly-once semantics, multi-consumer
Distributed transactions	Application-level sagas	Redis is not a transaction coordinator

The Final Rule

Use the simplest tool that solves your actual problem — not your hypothetical future problem. Start at Stage 1. Move to Stage 2 when you have real symptoms. Arrive here only when the complexity is justified by real constraints. The most reliable system is the one with the fewest moving parts.