
Vector DB & Retrieval Pipelines
Build production RAG systems. Learn vector database internals, indexing algorithms, embedding strategies, chunking, and how to measure retrieval quality.
What you will learn
Why Vector Databases?
LLMs have knowledge cutoffs and can't access your private data. RAG (Retrieval-Augmented Generation) solves this by storing your data as vectors and retrieving relevant chunks at query time — injecting live, relevant context into the prompt.
How Vector Search Works
Each document chunk is converted to a dense vector (e.g., 1536 dimensions for OpenAI's ada-002). Similarity between two vectors is measured by cosine similarity or dot product. Finding the most similar vectors to a query is called Approximate Nearest Neighbor (ANN) search.
Vector Index Algorithms
| Algorithm | How It Works | Speed | Memory | Accuracy |
|---|---|---|---|---|
| HNSW | Hierarchical navigable small world graph | Very Fast | High (~8× data size) | Very High (recall >99%) |
| IVF | Inverted file index with Voronoi cells | Fast | Medium | High |
| IVF-PQ | IVF + Product Quantization compression | Fast | Low (8-32× compression) | Medium-High |
| Flat | Brute-force exact search | Slow (O(N)) | Lowest overhead | 100% (exact) |
| ScaNN | Google's learned quantization | Very Fast | Low | High |
Popular Vector Databases
| Database | Architecture | Scale | Best For | Free? |
|---|---|---|---|---|
| Qdrant | Rust, distributed | Billions | Production, high perf | Open source |
| Weaviate | Go, modular | Hundreds of millions | Multi-modal, rich filtering | Open source |
| Pinecone | Managed SaaS | Billions | Zero-ops, enterprise | Free tier |
| Chroma | Python, embedded | Millions | Prototyping, local dev | Open source |
| pgvector | PostgreSQL extension | Tens of millions | Already using Postgres | Open source |
| Milvus | C++, cloud-native | Billions | Large-scale enterprise | Open source |
Chunking Strategies
How you split documents dramatically affects retrieval quality. This is often the most impactful thing to tune in a RAG pipeline.
- Split every N tokens (e.g., 512)
- With overlap (e.g., 50 tokens)
- Fast, simple, predictable
- Can split mid-sentence/concept
- Use as baseline
- Split on semantic boundaries
- Uses embeddings to detect topic shifts
- Better coherence per chunk
- Slower, more complex
- Best for long documents
from langchain_community.vectorstores import Qdrant from langchain_openai import OpenAIEmbeddings from langchain.text_splitter import RecursiveCharacterTextSplitter # 1. Chunk documents splitter = RecursiveCharacterTextSplitter( chunk_size=512, chunk_overlap=64, separators=[" ", " ", " ", ""] ) chunks = splitter.split_documents(documents) # 2. Embed and store embeddings = OpenAIEmbeddings(model="text-embedding-3-small") vectorstore = Qdrant.from_documents( chunks, embeddings, url="http://localhost:6333", collection_name="docs", ) # 3. Retrieve results = vectorstore.similarity_search( "What is PagedAttention?", k=5 )
Hybrid Search: BM25 + Dense
Pure vector search misses exact keyword matches. BM25 (sparse retrieval) handles keyword precision. Hybrid search combines both with reciprocal rank fusion (RRF) for best results.
from qdrant_client import QdrantClient from qdrant_client.models import SparseVector, NamedSparseVector # Query with both dense + sparse vectors results = client.query_points( collection_name="docs", prefetch=[ Prefetch(query=dense_vector, using="dense", limit=20), Prefetch(query=sparse_vector, using="sparse", limit=20), ], query=FusionQuery(fusion=Fusion.RRF), # Reciprocal Rank Fusion limit=5, )
Reranking: The Second Stage
ANN retrieval is optimized for speed — it returns the top-k vectors that are closest in embedding space, but "close in embedding space" doesn't always mean "actually relevant." A reranker (cross-encoder) runs a second, slower but more accurate pass over the retrieved candidates to reorder them before passing context to the LLM.
- Embeds query and docs independently
- Fast: dot product or cosine similarity
- Scales to millions of documents
- Can miss subtle relevance nuances
- Examples: text-embedding-3, BGE, E5
- Processes query + doc together
- Slow: N forward passes for top-N chunks
- Only feasible on top 20–50 candidates
- Much higher accuracy than bi-encoder
- Examples: Cohere Rerank, bge-reranker
Retrieval Quality Metrics
| Metric | What It Measures | Good Value |
|---|---|---|
| Recall@K | % of relevant docs in top-K results | > 0.85 |
| Precision@K | % of retrieved docs that are relevant | > 0.70 |
| NDCG@K | Normalized Discounted Cumulative Gain — quality + ranking | > 0.75 |
| MRR | Mean Reciprocal Rank — how high first relevant doc ranks | > 0.80 |
| Faithfulness | % of answer supported by retrieved context | > 0.90 |
| Context Relevance | % of retrieved context actually used | > 0.70 |
- RAG: Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks (Lewis et al.)arxiv.org
- Efficient and Robust Approximate Nearest Neighbor Search Using HNSW Graphsarxiv.org
- FAISS — Facebook AI Similarity Search Librarygithub.com
- Qdrant Vector Database — Official Documentationqdrant.tech
- RAGAS: Automated Evaluation of Retrieval Augmented Generationarxiv.org
Finished reading?