Storage Systems — Relational, Document, KV, Vector
Picking a database is a decision you live with for years. Master ACID and isolation levels, the B-tree and LSM-tree machinery underneath, when relational beats document beats key-value beats vector — and how vector indexes (HNSW, IVF) join the data layer for the AI era.
What you will learn
Every other backend layer can be replaced in a quarter. The data layer cannot. The choices made on day one — schema, partitioning, isolation, replication — propagate through every feature for years. This chapter is about making those choices with eyes open: what each store guarantees, what it costs, and how to extend the picture for vector retrieval that the AI era added without warning.
ACID — The Promise You're Buying
Every relational database advertises ACID. The acronym is so worn it stops meaning anything. Pull it apart:
- Atomicity — a transaction is all-or-nothing. Either every write commits or none of them do, even if the process crashes mid-way.
- Consistency — committed transactions leave the database in a state that satisfies all defined constraints (foreign keys, NOT NULL, CHECK, triggers). This is the C that's least about the database itself — it's about your constraints holding.
- Isolation — concurrent transactions appear to run as if serially. Appear to is doing heavy lifting; this is where isolation levels live.
- Durability — once a transaction commits, it stays committed even after a crash. Backed by the write-ahead log (WAL) being fsynced before COMMIT returns.
Isolation levels — the menu nobody reads
SQL standards define four levels. Real databases implement them differently. Knowing which one your transactions actually run under is the difference between a system that quietly loses orders and one that doesn't.
Three real-world traps that come from not understanding which level you're at:
- Lost update — two transactions read the same row, both compute new values, both write. The second overwrites the first. Solved by SELECT FOR UPDATE, optimistic concurrency (version columns), or SERIALIZABLE.
- Write skew — two transactions read overlapping data and each makes a decision based on the snapshot, but together they violate an invariant the snapshots respected individually. Classic example: "at least one doctor must be on call"; two off-call requests run concurrently, each reads "three on call," each succeeds, ending with one on call. SERIALIZABLE prevents this; SNAPSHOT does not.
- Phantom reads — a SELECT inside a transaction returns different rows when re-run because another transaction inserted into the range. PG's REPEATABLE READ uses snapshot isolation and avoids this in practice.
SHOW TRANSACTION ISOLATION LEVEL; in production once a quarter. ORMs sometimes set it; some application servers raise or lower it without your knowledge. Track the answer in a Confluence page: "the system runs at READ COMMITTED with SELECT FOR UPDATE on charge-card paths." That sentence is more useful than half the database arguments people have.Under the Hood — B-trees and LSM-trees
Every database picks one of two storage engines underneath. The choice colours the latency profile, the write throughput, the disk footprint, and the way you'd diagnose it under load.
B-tree
A balanced tree of pages on disk. Reads are O(log n) and pleasingly sequential within a leaf — range scans are cheap. Writes update pages in place, which means they can be slower under heavy random-write workloads (page splits, free-space management). Postgres, MySQL InnoDB, and SQLite are all B-tree based.
LSM-tree (Log-Structured Merge)
Writes go to an in-memory memtable and a write-ahead log. When the memtable fills, it's flushed as a sorted SSTable file. Background compaction merges levels of SSTables. Reads might check the memtable + several SSTable levels (with Bloom filters to skip levels quickly). Writes are sequential and very fast; reads pay an amplification cost. Cassandra, ScyllaDB, RocksDB, BigTable, and DynamoDB-internals all use LSM.
How to choose
- Read-dominated, complex queries, transactions → B-tree (Postgres).
- Write-dominated, time-series, simple access → LSM (Cassandra, ScyllaDB).
- Mostly reads, point lookups by primary key → either, but managed offerings (DynamoDB, Spanner) are usually LSM.
- You need a SQL planner and JOINs → relational (Postgres) by default; consider Spanner / CockroachDB if you genuinely need horizontal write scale-out.
The Five Storage Shapes
Every modern backend uses two to four of these. The shape comes from the data and the queries you'll run, not from a roadmap.
Relational (Postgres, MySQL, SQLite)
Tabular, schema-defined, SQL queries with JOINs, ACID transactions. The default for any system of record. Postgres in particular has accumulated extensions that absorb most of the niches: JSONB for document-shaped data, pgvector for similarity search, pg_partman for time-partitioned tables, TimescaleDB for time-series, PostGIS for geo. Reach for Postgres first; reach for a specialized system only when you've proved Postgres won't do.
Document (MongoDB, Couchbase, Firestore)
Schema-flexible nested documents (typically JSON/BSON). Best when records have heterogeneous shapes, deep nesting that flattens awkwardly into rows, or schema evolution faster than your DBA can keep up. The cost: weaker JOIN story ("do it in the application"), and consistency models that range from eventual to strong depending on configuration. Modern Postgres JSONB closes ~80% of the gap for ~80% of use cases.
Key-value (Redis, DynamoDB, etcd)
The simplest contract: get(k), set(k, v). Built for very high throughput and predictable latency. Used in three roles:
- Cache in front of a slower store (Day 2 PM).
- System of record for narrow access patterns (DynamoDB).
- Coordination primitive (etcd, Consul) for service discovery, leader election, distributed locks.
Time-series (TimescaleDB, InfluxDB, Prometheus)
Optimized for append-mostly, time-indexed data: metrics, IoT readings, financial ticks. Compact storage via columnar layouts and time-window compression; first-class retention, downsampling, and continuous aggregates. Use these when you'd otherwise be writing millions of rows per minute and querying "average value per minute over last 7 days" — a query a B-tree painfully scans through.
Vector (pgvector, Pinecone, Qdrant, Weaviate, Milvus)
Stores high-dimensional float vectors and finds nearest neighbours by cosine similarity (or L2, dot product). The new entry to the family — required for retrieval-augmented generation, semantic search, recommendation. We unpack this in the next section because it's both novel and surprisingly tractable.
Vector Stores and Approximate Nearest Neighbour
An embedding is a list of 768–4096 floats representing a piece of text (or image, audio, code). Two embeddings whose vectors are close — by cosine similarity — represent semantically similar content. This is the core trick behind RAG, semantic search, and the LLM's "have I seen this before" cache.
Why a special index?
For a million 1024-dimensional vectors, a naive linear scan computes a million 1024-element dot products per query — slow. Approximate Nearest Neighbour (ANN) indexes trade a tiny accuracy hit (say, 95% recall) for a massive speedup. The two dominant techniques:
HNSW (Hierarchical Navigable Small World)
Builds a graph where each node connects to its nearest neighbours, with extra long-range edges in higher layers. Search starts at the top, greedily jumps toward the query, descends layers. Excellent recall (>95%), excellent latency (<10 ms for millions of vectors), but holds the graph in RAM — memory bound. Default in pgvector, Qdrant, Weaviate.
IVF (Inverted File) + Product Quantization
K-means clusters all vectors into N partitions; index stores the centroid → vector list. Query computes distances to centroids, then linear scan within the closest few. Pair with Product Quantization to compress vectors 8–16× by quantizing sub-spaces. Excellent disk efficiency, scales to billions of vectors. Used by FAISS, Milvus, large-scale Pinecone deployments.
Picking a vector store
| System | Best for | Trade-off |
|---|---|---|
| pgvector | You already run Postgres; <10M vectors | Slower than dedicated; one fewer system |
| Qdrant / Weaviate | Self-hosted, mid-scale, hybrid metadata + vector filters | Operational surface |
| Pinecone / Turbopuffer | Managed, pay-per-use, scale to billions | Vendor dependency, network round trip |
| FAISS (in-process) | Embedded research, batch jobs | No persistence layer; you wrap it |
For most teams: start with pgvector, move to a dedicated store only when you've outgrown it. The migration is usually a 1-week sprint and the operational simplification of fewer-stores-while-small is worth it.
Sharding, Replication, and the Read-Your-Writes Trap
Replication
Most production databases run a primary plus one or more replicas. Two dimensions matter:
- Synchronous vs asynchronous. Sync: primary waits for at least one replica to acknowledge before returning COMMIT — writes are slower but durable across a primary failure. Async: primary acknowledges as soon as its own WAL is fsynced — fast writes, possible data loss on failover ("window of vulnerability").
- Single-leader vs multi-leader. Single-leader is the default and far simpler; multi-leader (or leaderless) raises hard consistency questions handled on Day 4.
Read-your-writes inconsistency
If you write to the primary and then immediately read from a replica, you might see stale data because replication hasn't caught up (typically 10–500 ms lag). Three remedies:
- Read from primary for own-user-just-wrote operations. Trade some primary load for correctness. The simplest fix.
- Sticky session — same user reads from the same replica that's caught up to their last write. AWS RDS Proxy, ProxySQL.
- Wait for LSN. The application records the WAL position after a write; subsequent reads include "don't return until you've caught up to LSN X".
Sharding
When the data is too big or too hot for one machine, split it across N shards by a key (user_id, tenant_id). Two patterns:
- Range-based: A–F on shard 1, G–M on shard 2... Easy to scan a range; risk of hot spots if traffic isn't uniform.
- Hash-based: hash(key) mod N. Uniform distribution but range scans become scatter-gathers across all shards.
Real systems use consistent hashing to limit re-shuffling when shard count changes. Day 4 covers consensus and consistent hashing in detail.
Postgres Performance — The Engineer's Toolkit
Even in the polyglot era, Postgres is the database most likely to be in your call path. Three skills pay back the most:
Read EXPLAIN ANALYZE
EXPLAIN (ANALYZE, BUFFERS) SELECT … shows the query plan and what actually happened. Look for Seq Scan on big tables (missing index), Nested Loop with high inner-loop cost (consider a hash/merge join), Filter: rows removed (predicate pushed too late). The first time you fix a query from 4 s to 4 ms by adding the right index, you'll never look at a slow query the same way.
Use partial and expression indexes
Indexes don't have to cover whole columns. CREATE INDEX … ON orders(user_id) WHERE status = 'open' is one-tenth the size of a full index and faster, when 'open' is a small slice. Expression indexes (CREATE INDEX … ON users(LOWER(email))) make case-insensitive lookups indexable. Use them.
Pool connections
Postgres opens a process per connection (~10 MB each). 1000 connections = 10 GB of RAM and per-connection lock-table overhead. Use PgBouncer (or RDS Proxy) in transaction-pool mode in front of the database; let the application use thousands of "virtual" connections that map to <50 real ones. Forgetting this is the textbook cause of "the database is slow when traffic spikes" — it's not slow, it's saturated by handshakes.
Show answer
The Honest Decision Tree
- Postgres for the system of record.
- Redis for cache, locks, rate-limit counters.
- pgvector or Qdrant for embeddings.
- S3 for blobs.
- TSDB for metrics.
- Don't shard until the box really runs out.
- PostgreSQL — Transaction Isolationpostgresql.org
- PostgreSQL — Using EXPLAINpostgresql.org
- Kleppmann — DDIA, chapters 7–9amazon.com
- pgvector — Postgres vector extensiongithub.com
- Pinecone — HNSW deep divepinecone.io
- Kleppmann — Hermitage: testing isolationmartin.kleppmann.com
Finished reading?