The Engineering Codex/Backend Engineering for the AI Era
DAY 2 · AM
03 / 09

Storage Systems — Relational, Document, KV, Vector

schedule12 minsignal_cellular_altAdvanced2,671 words
Picking a database is a decision you live with for years. Master ACID and isolation levels, the B-tree and LSM-tree machinery underneath, when relational beats document beats key-value beats vector — and how vector indexes (HNSW, IVF) join the data layer for the AI era.

What you will learn

01ACID — The Promise You're Buying
02Under the Hood — B-trees and LSM-trees
03The Five Storage Shapes
04Vector Stores and Approximate Nearest Neighbour
05Sharding, Replication, and the Read-Your-Writes Trap
06Postgres Performance — The Engineer's Toolkit

Every other backend layer can be replaced in a quarter. The data layer cannot. The choices made on day one — schema, partitioning, isolation, replication — propagate through every feature for years. This chapter is about making those choices with eyes open: what each store guarantees, what it costs, and how to extend the picture for vector retrieval that the AI era added without warning.

🔑
Today's storage map
1) ACID, isolation levels, MVCC — the contract a relational DB actually offers vs the one you imagine. 2) The B-tree vs LSM-tree trade-off that explains why Postgres and Cassandra feel so different. 3) Relational, document, key-value, time-series, vector — pick by access pattern, not by hype. 4) Sharding, replication, and the read-your-writes problem. 5) Vector indexes (HNSW, IVF) — what RAG depends on under the hood.

ACID — The Promise You're Buying

Every relational database advertises ACID. The acronym is so worn it stops meaning anything. Pull it apart:

  • Atomicity — a transaction is all-or-nothing. Either every write commits or none of them do, even if the process crashes mid-way.
  • Consistency — committed transactions leave the database in a state that satisfies all defined constraints (foreign keys, NOT NULL, CHECK, triggers). This is the C that's least about the database itself — it's about your constraints holding.
  • Isolation — concurrent transactions appear to run as if serially. Appear to is doing heavy lifting; this is where isolation levels live.
  • Durability — once a transaction commits, it stays committed even after a crash. Backed by the write-ahead log (WAL) being fsynced before COMMIT returns.

Isolation levels — the menu nobody reads

SQL standards define four levels. Real databases implement them differently. Knowing which one your transactions actually run under is the difference between a system that quietly loses orders and one that doesn't.

READ UNCOMMITTED · dirty reads possible · don't use READ COMMITTED · default in Postgres · non-repeatable reads possible REPEATABLE READ · default in MySQL · phantom reads possible (in spec) SNAPSHOT (PG-equivalent) / SERIALIZABLE · highest practical level SERIALIZABLE · transactions equivalent to some serial order Each step up trades concurrency for safety. Most apps run at READ COMMITTED and never know it.
Isolation levels, weakest at top. The default in your database is almost certainly not the strongest one.

Three real-world traps that come from not understanding which level you're at:

  1. Lost update — two transactions read the same row, both compute new values, both write. The second overwrites the first. Solved by SELECT FOR UPDATE, optimistic concurrency (version columns), or SERIALIZABLE.
  2. Write skew — two transactions read overlapping data and each makes a decision based on the snapshot, but together they violate an invariant the snapshots respected individually. Classic example: "at least one doctor must be on call"; two off-call requests run concurrently, each reads "three on call," each succeeds, ending with one on call. SERIALIZABLE prevents this; SNAPSHOT does not.
  3. Phantom reads — a SELECT inside a transaction returns different rows when re-run because another transaction inserted into the range. PG's REPEATABLE READ uses snapshot isolation and avoids this in practice.
⚠️
Find your level
Run SHOW TRANSACTION ISOLATION LEVEL; in production once a quarter. ORMs sometimes set it; some application servers raise or lower it without your knowledge. Track the answer in a Confluence page: "the system runs at READ COMMITTED with SELECT FOR UPDATE on charge-card paths." That sentence is more useful than half the database arguments people have.

Under the Hood — B-trees and LSM-trees

Every database picks one of two storage engines underneath. The choice colours the latency profile, the write throughput, the disk footprint, and the way you'd diagnose it under load.

B-tree Postgres, MySQL InnoDB, SQLite root [50 90] [20 35] [60 75] 5,10 22,30 36,42 66,78 in-place updates · O(log n) reads & writes writes do random I/O · range scans cheap good for: read-heavy, transactional LSM-tree Cassandra, RocksDB, ScyllaDB, BigTable memtable (in-RAM, sorted) L0 SSTable L1 SSTable (compacted) L2 SSTable (compacted) writes append to memtable background compaction merges reads check multiple levels good for: write-heavy, time-series
B-trees update in place; LSM-trees write everything sequentially and merge later. Picking the wrong engine for your access pattern wastes a lot of disk and CPU.

B-tree

A balanced tree of pages on disk. Reads are O(log n) and pleasingly sequential within a leaf — range scans are cheap. Writes update pages in place, which means they can be slower under heavy random-write workloads (page splits, free-space management). Postgres, MySQL InnoDB, and SQLite are all B-tree based.

LSM-tree (Log-Structured Merge)

Writes go to an in-memory memtable and a write-ahead log. When the memtable fills, it's flushed as a sorted SSTable file. Background compaction merges levels of SSTables. Reads might check the memtable + several SSTable levels (with Bloom filters to skip levels quickly). Writes are sequential and very fast; reads pay an amplification cost. Cassandra, ScyllaDB, RocksDB, BigTable, and DynamoDB-internals all use LSM.

How to choose

  • Read-dominated, complex queries, transactions → B-tree (Postgres).
  • Write-dominated, time-series, simple access → LSM (Cassandra, ScyllaDB).
  • Mostly reads, point lookups by primary key → either, but managed offerings (DynamoDB, Spanner) are usually LSM.
  • You need a SQL planner and JOINs → relational (Postgres) by default; consider Spanner / CockroachDB if you genuinely need horizontal write scale-out.

The Five Storage Shapes

Every modern backend uses two to four of these. The shape comes from the data and the queries you'll run, not from a roadmap.

Relational (Postgres, MySQL, SQLite)

Tabular, schema-defined, SQL queries with JOINs, ACID transactions. The default for any system of record. Postgres in particular has accumulated extensions that absorb most of the niches: JSONB for document-shaped data, pgvector for similarity search, pg_partman for time-partitioned tables, TimescaleDB for time-series, PostGIS for geo. Reach for Postgres first; reach for a specialized system only when you've proved Postgres won't do.

Document (MongoDB, Couchbase, Firestore)

Schema-flexible nested documents (typically JSON/BSON). Best when records have heterogeneous shapes, deep nesting that flattens awkwardly into rows, or schema evolution faster than your DBA can keep up. The cost: weaker JOIN story ("do it in the application"), and consistency models that range from eventual to strong depending on configuration. Modern Postgres JSONB closes ~80% of the gap for ~80% of use cases.

Key-value (Redis, DynamoDB, etcd)

The simplest contract: get(k), set(k, v). Built for very high throughput and predictable latency. Used in three roles:

  • Cache in front of a slower store (Day 2 PM).
  • System of record for narrow access patterns (DynamoDB).
  • Coordination primitive (etcd, Consul) for service discovery, leader election, distributed locks.

Time-series (TimescaleDB, InfluxDB, Prometheus)

Optimized for append-mostly, time-indexed data: metrics, IoT readings, financial ticks. Compact storage via columnar layouts and time-window compression; first-class retention, downsampling, and continuous aggregates. Use these when you'd otherwise be writing millions of rows per minute and querying "average value per minute over last 7 days" — a query a B-tree painfully scans through.

Vector (pgvector, Pinecone, Qdrant, Weaviate, Milvus)

Stores high-dimensional float vectors and finds nearest neighbours by cosine similarity (or L2, dot product). The new entry to the family — required for retrieval-augmented generation, semantic search, recommendation. We unpack this in the next section because it's both novel and surprisingly tractable.

💡
Polyglot persistence is normal — but expensive
A mature backend often runs Postgres + Redis + a vector store + S3 + a queue's storage. Each adds operational surface (backups, upgrades, monitoring, IAM). Resist a new database for any feature that fits in Postgres with extensions; the operational tax is real and durable.

Vector Stores and Approximate Nearest Neighbour

An embedding is a list of 768–4096 floats representing a piece of text (or image, audio, code). Two embeddings whose vectors are close — by cosine similarity — represent semantically similar content. This is the core trick behind RAG, semantic search, and the LLM's "have I seen this before" cache.

Why a special index?

For a million 1024-dimensional vectors, a naive linear scan computes a million 1024-element dot products per query — slow. Approximate Nearest Neighbour (ANN) indexes trade a tiny accuracy hit (say, 95% recall) for a massive speedup. The two dominant techniques:

HNSW Hierarchical Navigable Small World L3 L2 L1 L0 descend layers, greedily pick closer neighbour fast, high recall, RAM-hungry IVF + PQ Inverted File + Product Quantization centroid 1 centroid 2 centroid 3 centroid 4 cluster vectors, search top-k centroids disk-friendly, scales to billions
HNSW navigates a small-world graph; IVF clusters and searches the closest cells. Most modern stacks default to HNSW.

HNSW (Hierarchical Navigable Small World)

Builds a graph where each node connects to its nearest neighbours, with extra long-range edges in higher layers. Search starts at the top, greedily jumps toward the query, descends layers. Excellent recall (>95%), excellent latency (<10 ms for millions of vectors), but holds the graph in RAM — memory bound. Default in pgvector, Qdrant, Weaviate.

IVF (Inverted File) + Product Quantization

K-means clusters all vectors into N partitions; index stores the centroid → vector list. Query computes distances to centroids, then linear scan within the closest few. Pair with Product Quantization to compress vectors 8–16× by quantizing sub-spaces. Excellent disk efficiency, scales to billions of vectors. Used by FAISS, Milvus, large-scale Pinecone deployments.

Picking a vector store

SystemBest forTrade-off
pgvectorYou already run Postgres; <10M vectorsSlower than dedicated; one fewer system
Qdrant / WeaviateSelf-hosted, mid-scale, hybrid metadata + vector filtersOperational surface
Pinecone / TurbopufferManaged, pay-per-use, scale to billionsVendor dependency, network round trip
FAISS (in-process)Embedded research, batch jobsNo persistence layer; you wrap it

For most teams: start with pgvector, move to a dedicated store only when you've outgrown it. The migration is usually a 1-week sprint and the operational simplification of fewer-stores-while-small is worth it.

Sharding, Replication, and the Read-Your-Writes Trap

Replication

Most production databases run a primary plus one or more replicas. Two dimensions matter:

  • Synchronous vs asynchronous. Sync: primary waits for at least one replica to acknowledge before returning COMMIT — writes are slower but durable across a primary failure. Async: primary acknowledges as soon as its own WAL is fsynced — fast writes, possible data loss on failover ("window of vulnerability").
  • Single-leader vs multi-leader. Single-leader is the default and far simpler; multi-leader (or leaderless) raises hard consistency questions handled on Day 4.

Read-your-writes inconsistency

If you write to the primary and then immediately read from a replica, you might see stale data because replication hasn't caught up (typically 10–500 ms lag). Three remedies:

  1. Read from primary for own-user-just-wrote operations. Trade some primary load for correctness. The simplest fix.
  2. Sticky session — same user reads from the same replica that's caught up to their last write. AWS RDS Proxy, ProxySQL.
  3. Wait for LSN. The application records the WAL position after a write; subsequent reads include "don't return until you've caught up to LSN X".

Sharding

When the data is too big or too hot for one machine, split it across N shards by a key (user_id, tenant_id). Two patterns:

  • Range-based: A–F on shard 1, G–M on shard 2... Easy to scan a range; risk of hot spots if traffic isn't uniform.
  • Hash-based: hash(key) mod N. Uniform distribution but range scans become scatter-gathers across all shards.

Real systems use consistent hashing to limit re-shuffling when shard count changes. Day 4 covers consensus and consistent hashing in detail.

🚨
Don't shard prematurely
Postgres on a beefy single node can comfortably handle 10TB of data and tens of thousands of TPS with proper indexing, partitioning, and connection pooling. Most companies that sharded early did so at 100GB and now have a permanent operational headache. Vertical scale + read replicas + table partitioning + careful indexes will get you further than the internet suggests. The signal you're hitting the wall: write throughput on the primary saturates the WAL bandwidth.

Postgres Performance — The Engineer's Toolkit

Even in the polyglot era, Postgres is the database most likely to be in your call path. Three skills pay back the most:

Read EXPLAIN ANALYZE

EXPLAIN (ANALYZE, BUFFERS) SELECT … shows the query plan and what actually happened. Look for Seq Scan on big tables (missing index), Nested Loop with high inner-loop cost (consider a hash/merge join), Filter: rows removed (predicate pushed too late). The first time you fix a query from 4 s to 4 ms by adding the right index, you'll never look at a slow query the same way.

Use partial and expression indexes

Indexes don't have to cover whole columns. CREATE INDEX … ON orders(user_id) WHERE status = 'open' is one-tenth the size of a full index and faster, when 'open' is a small slice. Expression indexes (CREATE INDEX … ON users(LOWER(email))) make case-insensitive lookups indexable. Use them.

Pool connections

Postgres opens a process per connection (~10 MB each). 1000 connections = 10 GB of RAM and per-connection lock-table overhead. Use PgBouncer (or RDS Proxy) in transaction-pool mode in front of the database; let the application use thousands of "virtual" connections that map to <50 real ones. Forgetting this is the textbook cause of "the database is slow when traffic spikes" — it's not slow, it's saturated by handshakes.

Quick check
An app uses Postgres for users and orders, and adds a feature that does semantic search over support tickets. The team is already comfortable with PG. They are considering pgvector vs Pinecone. Which would you reach for first, and what would change your mind?
Show answer
pgvector first. No new system to operate, transactional consistency with the metadata (ticket fields, owners, statuses) for free, single backup story, and HNSW performance is excellent for the typical <5M-vector range. Switch to Pinecone (or Qdrant) when: the corpus exceeds tens of millions of vectors, you need per-tenant isolation with very different update patterns, you want managed sharding, or you're seeing measurable latency contention with OLTP queries. The migration is straightforward — embeddings are just floats, the only real coupling is the SQL filter you push down with the vector query.

The Honest Decision Tree

Need to store data? Postgres + JSONB Specialized store Redis (cache/locks) pgvector → Qdrant Time-series DB Object store (S3) structured + JSON large blobs hot K/V embeddings metrics, events
A pragmatic decision tree. Most workloads are well served by the Postgres-shaped path on the left.
Mnemonic — storage choices
"Relational by default, specialised when proved."
  • Postgres for the system of record.
  • Redis for cache, locks, rate-limit counters.
  • pgvector or Qdrant for embeddings.
  • S3 for blobs.
  • TSDB for metrics.
  • Don't shard until the box really runs out.
Flashcard
A teammate proposes "replicas for read scaling" by routing all reads to followers and only writes to the primary. The next day, users complain their saved profile photo doesn't appear after upload. Diagnose and propose three options ranked by cost.
Click to flip ↻
Answer
Diagnosis: classic read-your-writes inconsistency. The upload's INSERT goes to the primary; the next page-load reads from a follower that hasn't replicated yet (ms to seconds). Cheap: for own-user reads in the few seconds after a write, route to the primary ("sticky to primary on write"). Medium: after writing, store the WAL LSN in the user's session; reads include "min_lsn" and the proxy waits for that LSN on the chosen replica. Expensive: use a database with strong consistency reads (Spanner, CockroachDB, Aurora consistent reads) so the application doesn't have to think about it. Most teams pick option 1 and move on.
🔑
Key takeaways
1) ACID sounds like one thing, but isolation is the part that actually decides what races can happen — find your level. 2) B-tree for read-heavy/transactional, LSM-tree for write-heavy/time-series. 3) Reach for Postgres first; specialised stores when the access pattern proves it. 4) Vector indexes (HNSW, IVF) join the data layer for the AI era — start with pgvector, graduate when you must. 5) Replication and sharding are powerful but introduce read-your-writes traps; vertical scale will take you much further than the internet suggests. 6) Connection pools, EXPLAIN, and the right indexes are the three Postgres skills with the highest ROI.

Finished reading?