The Engineering Codex/LLM Systems Engineering
DAY 6
07 / 09

Vector DB & Retrieval Pipelines

schedule4 minsignal_cellular_altIntermediate988 words
Build production RAG systems. Learn vector database internals, indexing algorithms, embedding strategies, chunking, and how to measure retrieval quality.

What you will learn

01Why Vector Databases?
02How Vector Search Works
03Vector Index Algorithms
04Popular Vector Databases
05Chunking Strategies
06Hybrid Search: BM25 + Dense

Why Vector Databases?

LLMs have knowledge cutoffs and can't access your private data. RAG (Retrieval-Augmented Generation) solves this by storing your data as vectors and retrieving relevant chunks at query time — injecting live, relevant context into the prompt.

📚
The RAG Pipeline
Indexing: Document → Chunking → Embedding model → Vectors stored in VectorDB. Querying: User query → Embed query → Nearest neighbor search in VectorDB → Retrieve top-k chunks → Inject into LLM prompt → Generate answer. The quality of every step compounds.
INDEX PATH Documentsraw text Chunk512 tokens Embeddim=1536 VectorDBHNSW index QUERY PATH User Query Embedsame model ANN Searchtop-k chunks LLM+ injected context Answer
RAG has two paths: the index path (one-time, offline) and the query path (real-time). Quality compounds — a bad embedding model or poor chunking strategy will degrade every answer.

How Vector Search Works

Each document chunk is converted to a dense vector (e.g., 1536 dimensions for OpenAI's ada-002). Similarity between two vectors is measured by cosine similarity or dot product. Finding the most similar vectors to a query is called Approximate Nearest Neighbor (ANN) search.

Cosine Similarity
sim(A, B) = (A · B) / (||A|| × ||B||) ∈ [-1, 1]

Vector Index Algorithms

AlgorithmHow It WorksSpeedMemoryAccuracy
HNSWHierarchical navigable small world graphVery FastHigh (~8× data size)Very High (recall >99%)
IVFInverted file index with Voronoi cellsFastMediumHigh
IVF-PQIVF + Product Quantization compressionFastLow (8-32× compression)Medium-High
FlatBrute-force exact searchSlow (O(N))Lowest overhead100% (exact)
ScaNNGoogle's learned quantizationVery FastLowHigh
Index Selection Guide
Start with HNSW — it's the default in most VectorDBs (Qdrant, Weaviate, Chroma) and gives excellent recall (>99%) with fast queries. Use IVF-PQ only when you need to reduce memory by 8–32× for very large datasets (>100M vectors). Flat search is only viable for <1M vectors in latency-sensitive scenarios.

Popular Vector Databases

DatabaseArchitectureScaleBest ForFree?
QdrantRust, distributedBillionsProduction, high perfOpen source
WeaviateGo, modularHundreds of millionsMulti-modal, rich filteringOpen source
PineconeManaged SaaSBillionsZero-ops, enterpriseFree tier
ChromaPython, embeddedMillionsPrototyping, local devOpen source
pgvectorPostgreSQL extensionTens of millionsAlready using PostgresOpen source
MilvusC++, cloud-nativeBillionsLarge-scale enterpriseOpen source

Chunking Strategies

How you split documents dramatically affects retrieval quality. This is often the most impactful thing to tune in a RAG pipeline.

Fixed-Size Chunking
  • Split every N tokens (e.g., 512)
  • With overlap (e.g., 50 tokens)
  • Fast, simple, predictable
  • Can split mid-sentence/concept
  • Use as baseline
Semantic Chunking
  • Split on semantic boundaries
  • Uses embeddings to detect topic shifts
  • Better coherence per chunk
  • Slower, more complex
  • Best for long documents
Python · Qdrant with LangChain
from langchain_community.vectorstores import Qdrant
from langchain_openai import OpenAIEmbeddings
from langchain.text_splitter import RecursiveCharacterTextSplitter

# 1. Chunk documents
splitter = RecursiveCharacterTextSplitter(
    chunk_size=512,
    chunk_overlap=64,
    separators=["

", "
", " ", ""]
)
chunks = splitter.split_documents(documents)

# 2. Embed and store
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
vectorstore = Qdrant.from_documents(
    chunks,
    embeddings,
    url="http://localhost:6333",
    collection_name="docs",
)

# 3. Retrieve
results = vectorstore.similarity_search(
    "What is PagedAttention?",
    k=5
)

Hybrid Search: BM25 + Dense

Pure vector search misses exact keyword matches. BM25 (sparse retrieval) handles keyword precision. Hybrid search combines both with reciprocal rank fusion (RRF) for best results.

Python · Hybrid search in Qdrant
from qdrant_client import QdrantClient
from qdrant_client.models import SparseVector, NamedSparseVector

# Query with both dense + sparse vectors
results = client.query_points(
    collection_name="docs",
    prefetch=[
        Prefetch(query=dense_vector, using="dense", limit=20),
        Prefetch(query=sparse_vector, using="sparse", limit=20),
    ],
    query=FusionQuery(fusion=Fusion.RRF),  # Reciprocal Rank Fusion
    limit=5,
)

Reranking: The Second Stage

ANN retrieval is optimized for speed — it returns the top-k vectors that are closest in embedding space, but "close in embedding space" doesn't always mean "actually relevant." A reranker (cross-encoder) runs a second, slower but more accurate pass over the retrieved candidates to reorder them before passing context to the LLM.

Bi-encoder (retriever)
  • Embeds query and docs independently
  • Fast: dot product or cosine similarity
  • Scales to millions of documents
  • Can miss subtle relevance nuances
  • Examples: text-embedding-3, BGE, E5
Cross-encoder (reranker)
  • Processes query + doc together
  • Slow: N forward passes for top-N chunks
  • Only feasible on top 20–50 candidates
  • Much higher accuracy than bi-encoder
  • Examples: Cohere Rerank, bge-reranker
🏆
Two-Stage Pipeline = Best of Both
Retrieve top-50 with fast ANN search, then rerank with a cross-encoder to get the best top-5. The expensive reranker only touches 50 candidates, not millions. This pattern consistently boosts Recall@5 by 10–20% over retrieval alone and is now standard in production RAG pipelines.

Retrieval Quality Metrics

MetricWhat It MeasuresGood Value
Recall@K% of relevant docs in top-K results> 0.85
Precision@K% of retrieved docs that are relevant> 0.70
NDCG@KNormalized Discounted Cumulative Gain — quality + ranking> 0.75
MRRMean Reciprocal Rank — how high first relevant doc ranks> 0.80
Faithfulness% of answer supported by retrieved context> 0.90
Context Relevance% of retrieved context actually used> 0.70
🔑
Key Takeaways
1. Start with Qdrant or Chroma + HNSW index — battle-tested and simple. 2. Chunking strategy often matters more than which VectorDB you choose. Experiment with chunk size and overlap. 3. Hybrid search (BM25 + dense) almost always outperforms pure dense search — enable it early. 4. Measure recall@5 before tuning anything else. If your retriever doesn't find the right chunk, no amount of LLM tuning will save the answer quality.

Finished reading?