Distributed Systems — Consensus, Sharding & CAP
Once your service crosses one machine, the laws of physics start asserting themselves. Master the eight fallacies, CAP and PACELC trade-offs, the Raft consensus algorithm, consistent hashing, distributed transactions, and the patterns that real systems (Spanner, DynamoDB, etcd) use to make a network of unreliable machines look like one.
What you will learn
Distributed systems are what happens when one machine isn't enough. The benefits — scale, redundancy, geographic proximity — come bundled with a cost: every assumption you held about a single program (memory is consistent, the clock is monotonic, peers reply quickly, peers exist) becomes negotiable. This chapter is about the small set of impossibility results, algorithms, and patterns that turn an unreliable network into a usable system.
The Eight Fallacies — The Mental Model You Keep Violating
Peter Deutsch and James Gosling listed these in 1994. They're still the most quoted thing in distributed systems because every fresh bug confirms one.
Concretely, what each implies:
- Networks fail. Code that calls another service must handle timeouts, retries, partial failures.
- Latency is non-zero. Cross-AZ ≈ 1 ms; cross-region ≈ 30 ms; cross-continent ≈ 100 ms. Chatty protocols magnify these.
- Bandwidth is finite. Don't ship the whole table when a delta will do.
- The network is hostile. Authenticate every hop. Mutual TLS or signed messages.
- Topology changes. Hosts come up and down — service discovery must be dynamic.
- Many administrators. Other teams ship breaking changes in their services. Versioning matters.
- Transport costs. Cross-AZ data transfer is real money. Egress to internet is more.
- Heterogeneous network. Different protocols, MTUs, providers. Things behave differently in different regions.
CAP & PACELC — The Trade-Offs Honest Systems Acknowledge
CAP (Brewer 2000): when a network partition occurs, you must choose between Consistency (every read returns the latest write) and Availability (every request gets a non-error response). You cannot have both during the partition. Partition tolerance is non-negotiable in any system that can fail.
The honest extension is PACELC (Abadi 2010): during a partition, choose between A and C; else (in the normal case), choose between L (latency) and C (consistency). Synchronous replication for strong consistency costs latency even when nothing is broken.
How to triage CAP/PACELC for a feature
Three questions to ask:
- What's the worst that happens if a read returns stale data? If "the user sees a 5-second-old comment count" — eventual consistency is fine. If "the user is charged twice" — you need strong consistency.
- What's the worst if a write blocks during a partition? Most user-facing writes (a new comment, a profile edit) are tolerable; a finance ledger write usually is not.
- How much latency can the request budget afford? Synchronous replication across regions is 30+ ms per write — non-trivial.
Consensus — How a Cluster Agrees on Something
Whenever multiple machines must agree on a value (the leader, the next entry in a log, the current configuration), you're solving consensus. Two algorithms dominate: Paxos (Lamport 1990, hard to explain) and Raft (Ongaro & Ousterhout 2014, designed to be teachable). Real systems use Raft variants overwhelmingly: etcd, Consul, CockroachDB, TiKV, MongoDB.
Raft in 60 seconds
- Roles. Each node is a follower, candidate, or leader. There is at most one leader per term.
- Heartbeats. The leader sends periodic AppendEntries (often empty — heartbeats) to all followers. As long as followers hear from the leader, they stay quiet.
- Election. If a follower hasn't heard from a leader in some randomized timeout, it bumps its term, becomes a candidate, and asks for votes. A candidate that receives a majority wins and becomes leader for that term.
- Log replication. The leader appends client commands to its log and sends AppendEntries to followers. An entry is committed when a majority have stored it. Committed entries are then applied to each node's state machine.
- Safety. Higher terms always win. A follower won't vote for a candidate with an out-of-date log (their last entry's term/index is older). This prevents a stale candidate from being elected leader.
Why a majority (quorum)?
If 5 nodes split 3:2 by a partition, only the 3-node side can elect a leader (a majority). The 2-node side cannot. This prevents split-brain — two simultaneous leaders accepting writes that diverge. Quorum-based consensus tolerates (N-1)/2 failures: 3 nodes tolerate 1, 5 tolerate 2, 7 tolerate 3.
Consistent Hashing — Sharding Without Re-Shuffling
The naive way to shard data across N nodes is hash(key) mod N. This works until N changes — then almost every key moves to a different node, and your cache hit rate drops to zero overnight. Consistent hashing (Karger et al. 1997) limits the disruption: when a node is added or removed, only ~1/N of keys move.
The trick
Hash both keys and nodes onto the same circular space (e.g., 0..2³²). A key is owned by the next node clockwise. When a node is added, it takes over the keys between it and its predecessor — only those keys move.
Virtual nodes
A naive ring puts all keys for a node in one contiguous arc. If one node is hot, you can't redistribute. Virtual nodes hash each physical node to multiple positions on the ring (typically 100–500 each), spreading its keys across many arcs. Adding capacity becomes uniform, and load is naturally balanced. DynamoDB, Cassandra, ScyllaDB, and Riak all use virtual nodes.
Where it shows up
Anywhere keys must map to N changing nodes: Memcached client routing, distributed caches (Redis Cluster slots are a 16,384-cell variant), CDN PoP routing, message-broker partition assignment.
Time, Clocks, and Lamport's Insight
You cannot trust the wall clock across machines. NTP keeps clocks within ~10 ms in a data center; ~100 ms across regions; clock skew can be much worse during outages. "Happened-before" needs a different definition.
Lamport timestamps
Every event gets a counter that increments on local activity and is updated to max(local, received) + 1 on every received message. The result: if event A happened before B, then L(A) < L(B). The reverse is not true (concurrent events can have any order), but it's enough to define a global ordering that respects causality.
Vector clocks
Lamport timestamps lose information about concurrency. Vector clocks keep an integer per node; each event increments only its own component. Two events are concurrent iff neither vector dominates the other. DynamoDB and Riak use vector clocks (or version vectors) to detect conflicting writes that need resolution.
Spanner's TrueTime — the GPS approach
Google Spanner achieves external consistency by giving every node a TrueTime API that returns an interval bounding the real time, computed from GPS clocks and atomic clocks per data center. Transactions wait long enough that their commit timestamp is guaranteed to be in the past — a few ms — but the result is a globally-consistent serializable database. Most other systems can't afford the hardware and use logical clocks instead.
if (event.timestamp > cached.timestamp) update(). Without a synchronized clock guarantee, this is unsafe — the older event from a different machine may have a higher timestamp due to skew. Use logical clocks, version numbers per record, or proper happens-before tracking.Distributed Transactions — Two Painful Choices
You sometimes need to update state across multiple services or shards atomically. The two options:
Two-Phase Commit (2PC)
A coordinator asks each participant to prepare. If all say yes, it tells them to commit; if any say no or time out, it tells them to abort. Provides true atomicity. The cost: blocking. If the coordinator dies between phases, participants are stuck holding locks until it recovers. 2PC works fine within tightly-coupled systems (database internals, distributed file systems with stable membership) and is a poor fit for loosely-coupled microservices.
Sagas (Day 3 recap)
Forget atomicity; embrace compensation. Each step is a local transaction; failures run compensating transactions in reverse. Loose coupling, no blocking, semantically weaker. The default for cross-service workflows in modern stacks.
Spanner / CockroachDB / FoundationDB
The third path: a database that does the hard work itself, exposing a single SQL endpoint that does globally-consistent transactions across shards. Spanner uses Paxos + TrueTime; CockroachDB uses Raft + hybrid logical clocks. The pitch is "distributed Postgres semantics". The cost is operational complexity and per-write latency that can't beat single-region Postgres on simple workloads.
Service Discovery and Membership
In a network where instances come and go, callers need to find healthy targets. Three patterns:
- DNS-based — service has a hostname that resolves to current IPs. Simple. TTL caching means failover is slow.
- Centralized registry — Consul, etcd, Kubernetes Endpoints. Each instance registers; clients query (or watch). Day 6 covers this in the K8s context.
- Mesh-side discovery — Istio/Linkerd sidecars take care of it transparently. The application code dials a logical name; the sidecar does the rest.
Health checks — the underrated detail
A common bug: a load balancer treats a process as healthy because its TCP port is open, but the process can't reach the database. Implement two health checks: liveness (is the process alive?) — pass it lightly, restart on failure; readiness (can the process serve traffic now?) — fail it when DB connections, dependencies, or warm-up are missing. Kubernetes has separate fields for both for exactly this reason.
Replication Strategies
| Style | Writes | Reads | Best for |
|---|---|---|---|
| Single-leader | Only on leader | Followers (with lag) | Most apps; Postgres / MySQL default |
| Multi-leader | Any leader, replicate | Local leader | Multi-region active-active; conflict resolution required |
| Leaderless (Dynamo-style) | Quorum write (W of N) | Quorum read (R of N) | Always-available KV; tunable consistency |
For leaderless: if R + W > N, you're guaranteed at least one replica with the latest write in any read quorum — strong consistency. The classic choices are R=W=2, N=3 (good balance) or R=N, W=1 (cheap writes, expensive reads).
How Real Systems Compose These
Let's map a few systems onto the primitives above so the picture concretizes.
| System | Consensus | Sharding | Consistency |
|---|---|---|---|
| etcd / Consul | Raft (one cluster) | None — small data | Strong (CP) |
| Postgres + replicas | Single-leader | Vertical or sharded by app | Strong on primary; eventual on replicas |
| DynamoDB | Per-partition Paxos | Hash-based, virtual nodes | Eventual default; strong opt-in |
| Cassandra | None — leaderless | Consistent hashing | Tunable (R, W, N) |
| Spanner / CockroachDB | Paxos / Raft per range | Range-based | External consistency |
| Kafka | Raft (KRaft) or ZK | Per-topic partitions | Per-partition order |
| Redis Cluster | Async replication | 16,384 slots | Eventual; strong with WAIT |
Show answer
- Network fails — code defensively at every hop.
- Clocks lie — use logical clocks for ordering.
- Quorum decides — majorities prevent split-brain.
- Compensate, don't commit — sagas across services beat 2PC.
- Ongaro & Ousterhout — In Search of an Understandable Consensus Algorithm (Raft)raft.github.io
- The Secret Lives of Data — Raft visualizedthesecretlivesofdata.com
- DeCandia et al. — Dynamo: Amazon's Highly Available KV Storeallthingsdistributed.com
- Corbett et al. — Spanner: Google's Globally Distributed DBresearch.google
- Brewer — CAP keynotecs.berkeley.edu
- Abadi — PACELC: Consistency Trade-offs in Modern Distributed Systemscs.umd.edu
Finished reading?