DAY 6

07 / 09

Vector DB & Retrieval Pipelines

schedule4 minsignal_cellular_altIntermediate988 words

Build production RAG systems. Learn vector database internals, indexing algorithms, embedding strategies, chunking, and how to measure retrieval quality.

What you will learn

01Why Vector Databases?

02How Vector Search Works

03Vector Index Algorithms

04Popular Vector Databases

05Chunking Strategies

06Hybrid Search: BM25 + Dense

Why Vector Databases?

LLMs have knowledge cutoffs and can't access your private data. RAG (Retrieval-Augmented Generation) solves this by storing your data as vectors and retrieving relevant chunks at query time — injecting live, relevant context into the prompt.

📚

The RAG Pipeline

Indexing: Document → Chunking → Embedding model → Vectors stored in VectorDB. Querying: User query → Embed query → Nearest neighbor search in VectorDB → Retrieve top-k chunks → Inject into LLM prompt → Generate answer. The quality of every step compounds.

RAG has two paths: the index path (one-time, offline) and the query path (real-time). Quality compounds — a bad embedding model or poor chunking strategy will degrade every answer.

How Vector Search Works

Each document chunk is converted to a dense vector (e.g., 1536 dimensions for OpenAI's ada-002). Similarity between two vectors is measured by cosine similarity or dot product. Finding the most similar vectors to a query is called Approximate Nearest Neighbor (ANN) search.

Cosine Similarity

sim(A, B) = (A · B) / (||A|| × ||B||) ∈ [-1, 1]

Vector Index Algorithms

Algorithm	How It Works	Speed	Memory	Accuracy
HNSW	Hierarchical navigable small world graph	Very Fast	High (~8× data size)	Very High (recall >99%)
IVF	Inverted file index with Voronoi cells	Fast	Medium	High
IVF-PQ	IVF + Product Quantization compression	Fast	Low (8-32× compression)	Medium-High
Flat	Brute-force exact search	Slow (O(N))	Lowest overhead	100% (exact)
ScaNN	Google's learned quantization	Very Fast	Low	High

✅

Index Selection Guide

Start with HNSW — it's the default in most VectorDBs (Qdrant, Weaviate, Chroma) and gives excellent recall (>99%) with fast queries. Use IVF-PQ only when you need to reduce memory by 8–32× for very large datasets (>100M vectors). Flat search is only viable for <1M vectors in latency-sensitive scenarios.

Popular Vector Databases

Database	Architecture	Scale	Best For	Free?
Qdrant	Rust, distributed	Billions	Production, high perf	Open source
Weaviate	Go, modular	Hundreds of millions	Multi-modal, rich filtering	Open source
Pinecone	Managed SaaS	Billions	Zero-ops, enterprise	Free tier
Chroma	Python, embedded	Millions	Prototyping, local dev	Open source
pgvector	PostgreSQL extension	Tens of millions	Already using Postgres	Open source
Milvus	C++, cloud-native	Billions	Large-scale enterprise	Open source

Chunking Strategies

How you split documents dramatically affects retrieval quality. This is often the most impactful thing to tune in a RAG pipeline.

Fixed-Size Chunking

Split every N tokens (e.g., 512)
With overlap (e.g., 50 tokens)
Fast, simple, predictable
Can split mid-sentence/concept
Use as baseline

Semantic Chunking

Split on semantic boundaries
Uses embeddings to detect topic shifts
Better coherence per chunk
Slower, more complex
Best for long documents

Python · Qdrant with LangChain

from langchain_community.vectorstores import Qdrant
from langchain_openai import OpenAIEmbeddings
from langchain.text_splitter import RecursiveCharacterTextSplitter

# 1. Chunk documents
splitter = RecursiveCharacterTextSplitter(
    chunk_size=512,
    chunk_overlap=64,
    separators=["

", "
", " ", ""]
)
chunks = splitter.split_documents(documents)

# 2. Embed and store
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
vectorstore = Qdrant.from_documents(
    chunks,
    embeddings,
    url="http://localhost:6333",
    collection_name="docs",
)

# 3. Retrieve
results = vectorstore.similarity_search(
    "What is PagedAttention?",
    k=5
)

Hybrid Search: BM25 + Dense

Pure vector search misses exact keyword matches. BM25 (sparse retrieval) handles keyword precision. Hybrid search combines both with reciprocal rank fusion (RRF) for best results.

Python · Hybrid search in Qdrant

from qdrant_client import QdrantClient
from qdrant_client.models import SparseVector, NamedSparseVector

# Query with both dense + sparse vectors
results = client.query_points(
    collection_name="docs",
    prefetch=[
        Prefetch(query=dense_vector, using="dense", limit=20),
        Prefetch(query=sparse_vector, using="sparse", limit=20),
    ],
    query=FusionQuery(fusion=Fusion.RRF),  # Reciprocal Rank Fusion
    limit=5,
)

Reranking: The Second Stage

ANN retrieval is optimized for speed — it returns the top-k vectors that are closest in embedding space, but "close in embedding space" doesn't always mean "actually relevant." A reranker (cross-encoder) runs a second, slower but more accurate pass over the retrieved candidates to reorder them before passing context to the LLM.

Bi-encoder (retriever)

Embeds query and docs independently
Fast: dot product or cosine similarity
Scales to millions of documents
Can miss subtle relevance nuances
Examples: text-embedding-3, BGE, E5

Cross-encoder (reranker)

Processes query + doc together
Slow: N forward passes for top-N chunks
Only feasible on top 20–50 candidates
Much higher accuracy than bi-encoder
Examples: Cohere Rerank, bge-reranker

🏆

Two-Stage Pipeline = Best of Both

Retrieve top-50 with fast ANN search, then rerank with a cross-encoder to get the best top-5. The expensive reranker only touches 50 candidates, not millions. This pattern consistently boosts Recall@5 by 10–20% over retrieval alone and is now standard in production RAG pipelines.

Retrieval Quality Metrics

Metric	What It Measures	Good Value
Recall@K	% of relevant docs in top-K results	> 0.85
Precision@K	% of retrieved docs that are relevant	> 0.70
NDCG@K	Normalized Discounted Cumulative Gain — quality + ranking	> 0.75
MRR	Mean Reciprocal Rank — how high first relevant doc ranks	> 0.80
Faithfulness	% of answer supported by retrieved context	> 0.90
Context Relevance	% of retrieved context actually used	> 0.70

🔑

Key Takeaways

1. Start with Qdrant or Chroma + HNSW index — battle-tested and simple. 2. Chunking strategy often matters more than which VectorDB you choose. Experiment with chunk size and overlap. 3. Hybrid search (BM25 + dense) almost always outperforms pure dense search — enable it early. 4. Measure recall@5 before tuning anything else. If your retriever doesn't find the right chunk, no amount of LLM tuning will save the answer quality.

📚 Further reading

Finished reading?