The Engineering Codex/Backend Engineering for the AI Era
DAY 4
06 / 09

Distributed Systems — Consensus, Sharding & CAP

schedule12 minsignal_cellular_altAdvanced2,622 words
Once your service crosses one machine, the laws of physics start asserting themselves. Master the eight fallacies, CAP and PACELC trade-offs, the Raft consensus algorithm, consistent hashing, distributed transactions, and the patterns that real systems (Spanner, DynamoDB, etcd) use to make a network of unreliable machines look like one.

What you will learn

01The Eight Fallacies — The Mental Model You Keep Violating
02CAP & PACELC — The Trade-Offs Honest Systems Acknowledge
03Consensus — How a Cluster Agrees on Something
04Consistent Hashing — Sharding Without Re-Shuffling
05Time, Clocks, and Lamport's Insight
06Distributed Transactions — Two Painful Choices

Distributed systems are what happens when one machine isn't enough. The benefits — scale, redundancy, geographic proximity — come bundled with a cost: every assumption you held about a single program (memory is consistent, the clock is monotonic, peers reply quickly, peers exist) becomes negotiable. This chapter is about the small set of impossibility results, algorithms, and patterns that turn an unreliable network into a usable system.

🔑
Today's primitives
1) The eight fallacies — what you keep accidentally assuming. 2) CAP as a triage tool, PACELC as the honest version. 3) Raft consensus — leader election, log replication, the algorithm behind etcd, Consul, MongoDB, CockroachDB. 4) Consistent hashing for sharding without re-shuffling the world. 5) Distributed transactions — 2PC, sagas, the Spanner approach. 6) Real systems mapped onto these primitives.

The Eight Fallacies — The Mental Model You Keep Violating

Peter Deutsch and James Gosling listed these in 1994. They're still the most quoted thing in distributed systems because every fresh bug confirms one.

1. The network is reliable 2. Latency is zero 3. Bandwidth is infinite 4. The network is secure 5. Topology doesn't change 6. There is one administrator 7. Transport cost is zero 8. The network is homogeneous Every distributed-system bug is the discovery of one of these. Pin them above your desk.
The eight fallacies of distributed computing. Re-read them after every outage retro.

Concretely, what each implies:

  • Networks fail. Code that calls another service must handle timeouts, retries, partial failures.
  • Latency is non-zero. Cross-AZ ≈ 1 ms; cross-region ≈ 30 ms; cross-continent ≈ 100 ms. Chatty protocols magnify these.
  • Bandwidth is finite. Don't ship the whole table when a delta will do.
  • The network is hostile. Authenticate every hop. Mutual TLS or signed messages.
  • Topology changes. Hosts come up and down — service discovery must be dynamic.
  • Many administrators. Other teams ship breaking changes in their services. Versioning matters.
  • Transport costs. Cross-AZ data transfer is real money. Egress to internet is more.
  • Heterogeneous network. Different protocols, MTUs, providers. Things behave differently in different regions.

CAP & PACELC — The Trade-Offs Honest Systems Acknowledge

CAP (Brewer 2000): when a network partition occurs, you must choose between Consistency (every read returns the latest write) and Availability (every request gets a non-error response). You cannot have both during the partition. Partition tolerance is non-negotiable in any system that can fail.

The honest extension is PACELC (Abadi 2010): during a partition, choose between A and C; else (in the normal case), choose between L (latency) and C (consistency). Synchronous replication for strong consistency costs latency even when nothing is broken.

Partition: choose A Partition: choose C No partition: choose L No partition: choose C PA / ELDynamoDB, Cassandra, Riakalways available, eventually consistent PC / ELMongoDB (default), CouchDBCP under partition, fast in the normal case PA / ECrare combinationstrict reads, AP under partition PC / ECSpanner, FoundationDB, etcdstrict consistency always; latency tax Pick by failure mode and budget. Most apps live in the upper-right (PC/EL) world.
PACELC quadrants. Vendors usually advertise the partition behaviour; PACELC forces them to admit the latency tax.

How to triage CAP/PACELC for a feature

Three questions to ask:

  1. What's the worst that happens if a read returns stale data? If "the user sees a 5-second-old comment count" — eventual consistency is fine. If "the user is charged twice" — you need strong consistency.
  2. What's the worst if a write blocks during a partition? Most user-facing writes (a new comment, a profile edit) are tolerable; a finance ledger write usually is not.
  3. How much latency can the request budget afford? Synchronous replication across regions is 30+ ms per write — non-trivial.
⚠️
CAP isn't a religious choice
CAP is per-operation, not per-system. The same database often runs the cart in AP mode and the payments ledger in CP mode. Pick the right knob for each operation; "we are an AP shop" loses information you'd want when the next feature lands.

Consensus — How a Cluster Agrees on Something

Whenever multiple machines must agree on a value (the leader, the next entry in a log, the current configuration), you're solving consensus. Two algorithms dominate: Paxos (Lamport 1990, hard to explain) and Raft (Ongaro & Ousterhout 2014, designed to be teachable). Real systems use Raft variants overwhelmingly: etcd, Consul, CockroachDB, TiKV, MongoDB.

Leader term: 7 Follower term: 7 Follower term: 7 Follower term: 7 AppendEntries heartbeat Committed when a majority (≥ 3 of 5) acknowledge the AppendEntries.
A Raft cluster in steady state. The leader replicates each log entry; commit requires a majority.

Raft in 60 seconds

  1. Roles. Each node is a follower, candidate, or leader. There is at most one leader per term.
  2. Heartbeats. The leader sends periodic AppendEntries (often empty — heartbeats) to all followers. As long as followers hear from the leader, they stay quiet.
  3. Election. If a follower hasn't heard from a leader in some randomized timeout, it bumps its term, becomes a candidate, and asks for votes. A candidate that receives a majority wins and becomes leader for that term.
  4. Log replication. The leader appends client commands to its log and sends AppendEntries to followers. An entry is committed when a majority have stored it. Committed entries are then applied to each node's state machine.
  5. Safety. Higher terms always win. A follower won't vote for a candidate with an out-of-date log (their last entry's term/index is older). This prevents a stale candidate from being elected leader.

Why a majority (quorum)?

If 5 nodes split 3:2 by a partition, only the 3-node side can elect a leader (a majority). The 2-node side cannot. This prevents split-brain — two simultaneous leaders accepting writes that diverge. Quorum-based consensus tolerates (N-1)/2 failures: 3 nodes tolerate 1, 5 tolerate 2, 7 tolerate 3.

💡
Why Raft clusters are odd-numbered
A 4-node cluster tolerates the same number of failures as a 3-node cluster (1), but with double the synchronous-write cost. Odd cluster sizes (3, 5, 7) maximize fault tolerance per node. If you ever see a 4-node Raft cluster, someone forgot the math.

Consistent Hashing — Sharding Without Re-Shuffling

The naive way to shard data across N nodes is hash(key) mod N. This works until N changes — then almost every key moves to a different node, and your cache hit rate drops to zero overnight. Consistent hashing (Karger et al. 1997) limits the disruption: when a node is added or removed, only ~1/N of keys move.

Node A Node B Node C Node D Node E Each key (small dot) maps to the next node clockwise. Adding a node only steals from one neighbour.
Consistent hashing ring. Keys live on the ring; each node owns the arc back to the previous node.

The trick

Hash both keys and nodes onto the same circular space (e.g., 0..2³²). A key is owned by the next node clockwise. When a node is added, it takes over the keys between it and its predecessor — only those keys move.

Virtual nodes

A naive ring puts all keys for a node in one contiguous arc. If one node is hot, you can't redistribute. Virtual nodes hash each physical node to multiple positions on the ring (typically 100–500 each), spreading its keys across many arcs. Adding capacity becomes uniform, and load is naturally balanced. DynamoDB, Cassandra, ScyllaDB, and Riak all use virtual nodes.

Where it shows up

Anywhere keys must map to N changing nodes: Memcached client routing, distributed caches (Redis Cluster slots are a 16,384-cell variant), CDN PoP routing, message-broker partition assignment.

Time, Clocks, and Lamport's Insight

You cannot trust the wall clock across machines. NTP keeps clocks within ~10 ms in a data center; ~100 ms across regions; clock skew can be much worse during outages. "Happened-before" needs a different definition.

Lamport timestamps

Every event gets a counter that increments on local activity and is updated to max(local, received) + 1 on every received message. The result: if event A happened before B, then L(A) < L(B). The reverse is not true (concurrent events can have any order), but it's enough to define a global ordering that respects causality.

Vector clocks

Lamport timestamps lose information about concurrency. Vector clocks keep an integer per node; each event increments only its own component. Two events are concurrent iff neither vector dominates the other. DynamoDB and Riak use vector clocks (or version vectors) to detect conflicting writes that need resolution.

Spanner's TrueTime — the GPS approach

Google Spanner achieves external consistency by giving every node a TrueTime API that returns an interval bounding the real time, computed from GPS clocks and atomic clocks per data center. Transactions wait long enough that their commit timestamp is guaranteed to be in the past — a few ms — but the result is a globally-consistent serializable database. Most other systems can't afford the hardware and use logical clocks instead.

⚠️
"Just use timestamps" is a bug
Race conditions in distributed systems are routinely caused by code that does if (event.timestamp > cached.timestamp) update(). Without a synchronized clock guarantee, this is unsafe — the older event from a different machine may have a higher timestamp due to skew. Use logical clocks, version numbers per record, or proper happens-before tracking.

Distributed Transactions — Two Painful Choices

You sometimes need to update state across multiple services or shards atomically. The two options:

Two-Phase Commit (2PC)

A coordinator asks each participant to prepare. If all say yes, it tells them to commit; if any say no or time out, it tells them to abort. Provides true atomicity. The cost: blocking. If the coordinator dies between phases, participants are stuck holding locks until it recovers. 2PC works fine within tightly-coupled systems (database internals, distributed file systems with stable membership) and is a poor fit for loosely-coupled microservices.

Sagas (Day 3 recap)

Forget atomicity; embrace compensation. Each step is a local transaction; failures run compensating transactions in reverse. Loose coupling, no blocking, semantically weaker. The default for cross-service workflows in modern stacks.

Spanner / CockroachDB / FoundationDB

The third path: a database that does the hard work itself, exposing a single SQL endpoint that does globally-consistent transactions across shards. Spanner uses Paxos + TrueTime; CockroachDB uses Raft + hybrid logical clocks. The pitch is "distributed Postgres semantics". The cost is operational complexity and per-write latency that can't beat single-region Postgres on simple workloads.

Service Discovery and Membership

In a network where instances come and go, callers need to find healthy targets. Three patterns:

  • DNS-based — service has a hostname that resolves to current IPs. Simple. TTL caching means failover is slow.
  • Centralized registry — Consul, etcd, Kubernetes Endpoints. Each instance registers; clients query (or watch). Day 6 covers this in the K8s context.
  • Mesh-side discovery — Istio/Linkerd sidecars take care of it transparently. The application code dials a logical name; the sidecar does the rest.

Health checks — the underrated detail

A common bug: a load balancer treats a process as healthy because its TCP port is open, but the process can't reach the database. Implement two health checks: liveness (is the process alive?) — pass it lightly, restart on failure; readiness (can the process serve traffic now?) — fail it when DB connections, dependencies, or warm-up are missing. Kubernetes has separate fields for both for exactly this reason.

Replication Strategies

StyleWritesReadsBest for
Single-leaderOnly on leaderFollowers (with lag)Most apps; Postgres / MySQL default
Multi-leaderAny leader, replicateLocal leaderMulti-region active-active; conflict resolution required
Leaderless (Dynamo-style)Quorum write (W of N)Quorum read (R of N)Always-available KV; tunable consistency

For leaderless: if R + W > N, you're guaranteed at least one replica with the latest write in any read quorum — strong consistency. The classic choices are R=W=2, N=3 (good balance) or R=N, W=1 (cheap writes, expensive reads).

How Real Systems Compose These

Let's map a few systems onto the primitives above so the picture concretizes.

SystemConsensusShardingConsistency
etcd / ConsulRaft (one cluster)None — small dataStrong (CP)
Postgres + replicasSingle-leaderVertical or sharded by appStrong on primary; eventual on replicas
DynamoDBPer-partition PaxosHash-based, virtual nodesEventual default; strong opt-in
CassandraNone — leaderlessConsistent hashingTunable (R, W, N)
Spanner / CockroachDBPaxos / Raft per rangeRange-basedExternal consistency
KafkaRaft (KRaft) or ZKPer-topic partitionsPer-partition order
Redis ClusterAsync replication16,384 slotsEventual; strong with WAIT
Quick check
Your team runs a 3-node etcd cluster as the source of truth for service discovery. One AZ goes offline, taking one node with it. Reads and writes still work. The next day, a network partition cuts the remaining two nodes from each other for 30 seconds. What happens during those 30 seconds?
Show answer
Both halves lose write availability. A 3-node Raft cluster needs 2 nodes (a majority) to commit. After losing one node to the AZ, you have 2 healthy nodes — still a majority of the cluster. When the network partitions those 2, neither side has 2 reachable nodes. Neither side can elect a leader; neither side can commit writes. Reads may still succeed if the existing leader's lease hasn't expired (depending on how the client is configured). For continued availability under double-failure, run a 5-node cluster — tolerates 2 failures including network partitions of arbitrary shape. The lesson: 3 is the minimum, 5 is the production default for the registry that the rest of your system depends on.
Mnemonic — distributed systems triage
"Network fails. Clocks lie. Quorum decides. Compensate, don't commit."
  • Network fails — code defensively at every hop.
  • Clocks lie — use logical clocks for ordering.
  • Quorum decides — majorities prevent split-brain.
  • Compensate, don't commit — sagas across services beat 2PC.
Flashcard
An LLM-powered SaaS launches in three regions: us-east, eu-west, ap-south. The product team wants "users see their changes immediately, regardless of region". Walk through the three architectural options ranked by complexity.
Click to flip ↻
Answer
1) Single primary, replicas in other regions. All writes route to (say) us-east; reads can hit local replicas; lag is 30–100 ms cross-region. Read-your-writes via sticky-to-primary on writes (Day 2 AM). Cheapest, but writes from far regions are slow. 2) Per-tenant home region. Each tenant is pinned to a region; reads and writes for that tenant are local. Cross-tenant operations are rare. Used by many B2B SaaS (Notion, Slack workspaces). Delivers "immediately" within a tenant's region with simple infrastructure. 3) Globally-distributed strongly-consistent DB (Spanner, CockroachDB). True multi-region writes with strong consistency. Per-write latency includes a synchronous quorum across regions (~100 ms). Operationally heavier, more expensive, but unavoidable when truly multi-region writes must be both global and consistent. Most teams stop at option 2.
🔑
Key takeaways
1) The eight fallacies are the canonical list of assumptions you keep accidentally violating. 2) CAP/PACELC are triage tools, picked per-operation. 3) Raft is the consensus algorithm you'll meet — odd-numbered cluster, majority commit, randomized election timeouts. 4) Consistent hashing with virtual nodes shards without re-shuffling on cluster changes. 5) Logical clocks beat wall clocks for ordering. 6) Sagas over 2PC for cross-service work; consider Spanner-style if you really need distributed SQL. 7) The systems you reach for (etcd, Postgres, Kafka, Redis) compose these primitives in different combinations — knowing which composition is which makes the docs land.

Finished reading?