Vector Search and Reranking
Vector Search & Reranking: Complete Summary
A comprehensive guide covering ANN, HNSW, Bi-Encoder, Cross-Encoder, ColBERT, and production RAG pipelines with sticky analogies and practical examples.
1. Approximate Nearest Neighbor (ANN)
The Problem
Finding similar items in massive datasets (millions of vectors) is slow with brute force β comparing against every single item.
πΊοΈ Sticky Analogy: The City Map
You're at Times Square and need to find the 5 closest coffee shops.
- Brute force: Measure distance to every coffee shop in the city β takes hours.
- ANN approach: Only check coffee shops in nearby neighborhoods β find 5 great options in minutes.
Key insight: Use the structure of space (neighborhoods) to narrow your search dramatically.
TL;DR
- Exact NN: Check everything, guaranteed best, slow
- Approximate NN: Check strategically, probably best, fast
One-liner to Remember
"Zoom, don't scan."
3. NSW Graph Building (Proximity Graph)
How the Graph Gets Built
Three steps:
- Compute distances between all document vectors
- Add one node to the graph for each document
- Connect each node to its k nearest neighbors
What is k?
k = number of connections (edges) each node has
ποΈ Analogy: 1 million houses in a city. Each house = 1 node.
- k=5 means each house knows 5 neighbors
- k=50 means each house knows 50 neighbors
k Trade-offs
| Small k | Large k |
|---|---|
| Faster search | Slower search |
| Might miss good paths | More accurate |
| Less memory | More memory |
4. Query Entry Point & Search Algorithm
Query Entry Point
You always start from a designated entry point node at the top layer. From there, greedily hop to whichever neighbor is closest to your query.
Search Algorithm
Key insight from diagram:
"May not find closest possible vectors, algorithm doesn't pick optimal overall path, just best path in each moment."
This is why it's called Approximate Nearest Neighbor β good enough, fast enough!
The Search Process
Top layer: Entry β hop β hop β (can't improve) β drop down
Middle layer: β hop β hop β (can't improve) β drop down
Bottom layer: β hop β hop β FOUND closest neighbors!
5. Vector Search Libraries & Databases
Most Common Algorithm
HNSW β used by nearly every production vector search system
Libraries
| Library | Creator | Notes |
|---|---|---|
| FAISS | Meta | Industry standard |
| Annoy | Spotify | Tree-based, simpler |
| ScaNN | Large-scale optimized | |
| hnswlib | β | Pure HNSW, very fast |
Vector Databases (HNSW under the hood)
| Database | Notes |
|---|---|
| Pinecone | Fully managed, production-ready |
| ChromaDB | Lightweight, great for local dev |
| Weaviate | Open-source, feature-rich |
| Milvus | Open-source, highly scalable |
| Qdrant | Rust-based, fast |
| pgvector | PostgreSQL extension |
6. Bi-Encoder
How It Works
Query β [Encoder] β Vector A
Doc β [Encoder] β Vector B
β
Compare vectors (cosine similarity)
Query and document encoded separately. The encoder never sees them together.
π€ Sticky Analogy: Resume Screening
HR reads each resume independently, assigns a "fit score" to each.
- Fast β can process thousands quickly
- But never sees resume and job description together
Pros & Cons
| Pros | Cons |
|---|---|
| β‘ Very fast | Lower accuracy |
| Pre-compute doc vectors | Misses nuanced relationships |
| Search millions of docs | Information compressed |
7. Cross-Encoder
How It Works
[CLS] Query [SEP] Document [SEP] β Encoder β Relevance Score
Query and document go together through the model. Words can attend to each other!
π€ Sticky Analogy: Job Interview
Interviewer sees candidate AND job requirements at the same time.
- Much more accurate β catches nuances
- But slow β can only do one at a time
π§ͺ Alternative Analogy: Chemistry
- Bi-encoder: Describe chemicals on paper, guess if they'll react
- Cross-encoder: Put chemicals in same beaker, observe what happens
Why Cross-Encoder is Slow
| Bi-Encoder | Cross-Encoder | |
|---|---|---|
| Query time | 1 encoder call + 1M dot products | 1M encoder calls |
| Speed | ~milliseconds | ~hours |
You can't pre-compute β it needs query and doc together.
When to Use
Re-ranking only β top 50-100 docs from bi-encoder.
# Cross-Encoder with rerankers library
from rerankers import Reranker
ranker = Reranker('cross-encoder')
results = ranker.rank(
query="I love you",
docs=["I hate you", "I really like you"]
)
# Results:
# "I really like you": -2.468 (rank 1) β
# "I hate you": -4.150 (rank 2)
8. ColBERT
The Middle Ground
| Model | Document Representation |
|---|---|
| Bi-encoder | 1 vector per doc |
| ColBERT | 1 vector per token |
| Cross-encoder | No vectors (score directly) |
π³ Sticky Analogy: Recipe Search
Bi-encoder: Recipe labeled "Italian comfort food"
ColBERT: Recipe keeps ingredients visible:
- [tomato] [basil] [pasta] [garlic]
Query "tomato pasta" β direct matches on [tomato] and [pasta] tokens!
π± Alternative Analogy: Dating App
Bi-encoder: One bio sentence β "I like outdoors"
ColBERT: Individual interests listed:
- [hiking] [photography] [camping]
Search "hiking photography" β direct token matches!
9. MaxSim Score (ColBERT's Scoring Method)
How It Works
For each query token, find its maximum similarity with any doc token. Then sum.
Query: [I] [love] [you]
Doc: [I] [hate] [you]
Similarity matrix:
[I] [hate] [you]
[I] 0.9 0.1 0.2 β max = 0.9
[love] 0.1 0.7 0.1 β max = 0.7 β "love" matched "hate"!
[you] 0.2 0.1 0.9 β max = 0.9
MaxSim = 0.9 + 0.7 + 0.9 = 2.5
ColBERT's Limitation
"love" and "hate" get similar scores because they're both intense emotion words used in similar contexts.
ColBERT matches tokens. Cross-encoder understands relationships.
# ColBERT with rerankers library
from rerankers import Reranker
colbert_ranker = Reranker('colbert')
results = colbert_ranker.rank(
query="I love you",
docs=["I hate you", "I really like you"]
)
# Results:
# "I hate you": 1.753 (rank 1) β Wrong!
# "I really like you": 1.703 (rank 2)
10. Comparison: Bi-Encoder vs Cross-Encoder vs ColBERT
π½οΈ Restaurant Analogy
| Model | Analogy |
|---|---|
| Bi-Encoder | Match craving to menu descriptions |
| ColBERT | Match craving to ingredient lists |
| Cross-Encoder | Chef serves dish knowing your craving |
Full Comparison Table
| Bi-Encoder | ColBERT | Cross-Encoder | |
|---|---|---|---|
| Speed | β‘ Fastest | π Medium | π’ Slowest |
| Accuracy | Good | Better | Best |
| Pre-compute docs? | β Yes (1 vec/doc) | β Yes (1 vec/token) | β No |
| Storage | Low | High | N/A |
| Search 1M docs? | β Yes | β Yes | β No |
| Best for | Retrieval | Retrieval + precision | Re-ranking |
When to Use Each
| Scenario | Best Choice |
|---|---|
| Search millions of docs fast | Bi-Encoder |
| Short, dense text (tweets, titles) | ColBERT |
| Re-rank top 50-100 results | Cross-Encoder |
| E-commerce product search (1000s) | ColBERT |
| Production RAG pipeline | Bi-Encoder β Cross-Encoder |
11. HyDE (Hypothetical Document Embeddings)
The Problem
Questions and documents "sound" different in embedding space.
The Trick
- Ask LLM to generate a fake answer (without retrieval)
- Embed the fake answer
- Search with that embedding instead
π Sticky Analogy: House Hunting
You have a vague wish: "I want something cozy with natural light."
Normal: Tell agent your wish β they struggle
HyDE: Describe ideal house first: "A 2-bedroom cottage with south-facing windows..." β Agent finds similar actual houses
12. Production RAG Pipeline
π― Sticky Analogy: Intern + Professor
Intern (Retriever): Grabs 20 files mentioning "Project X" β some relevant, some not.
Professor (Reranker): Reads each carefully, keeps the best 5.
The Standard Two-Stage Architecture
User Query
β
βΌ
Stage 1: RETRIEVAL (Cast wide net)
βββ Vector Search (bi-encoder)
βββ Keyword Search (BM25)
β
βΌ Top 50-100 candidates
β
Stage 2: RERANKING
βββ Cross-Encoder scores each doc
β
βΌ Top 5-10 most relevant
β
Stage 3: LLM GENERATION
βββ Generate answer with context
Why Hybrid Retrieval?
- Bi-encoder: Semantic similarity ("car" β "automobile")
- BM25: Exact keyword matches ("Toyota Camry 2024")
Combined = better recall!
13. LLM-Based Scoring
Fine-tuned LLM takes query + document β outputs relevance score.
Slowest but most accurate β use when quality matters more than latency.
14. Open Source Rerankers
Popular Models
| Model | Creator | Speed | Notes |
|---|---|---|---|
| ms-marco-MiniLM | Microsoft | β‘ Fastest | 22M params |
| mxbai-rerank | Mixedbread | π Medium | Default in rerankers |
| BGE-reranker | BAAI | π Medium | Most popular |
| ColBERTv2 | β | π Medium | Token-level |
Organizations
- BAAI = Beijing Academy of Artificial Intelligence
- Mixedbread = German AI startup
- Microsoft = MS-MARCO dataset + MiniLM architecture
# Switching models in rerankers library
from rerankers import Reranker
# Microsoft's fast model
ranker = Reranker('cross-encoder', model_name='cross-encoder/ms-marco-MiniLM-L-6-v2')
# BGE model (most popular)
ranker = Reranker('cross-encoder', model_name='BAAI/bge-reranker-base')
# Mixedbread (default)
ranker = Reranker('cross-encoder', model_name='mixedbread-ai/mxbai-rerank-base-v1')
# ColBERT
ranker = Reranker('colbert', model_name='colbert-ir/colbertv2.0')
# LLM-based
ranker = Reranker('rankgpt', model_name='gpt-4o-mini')
15. Handling Long Documents
The Constraint
Most models have 512 token limit (~1-2 paragraphs). A 50-page doc = ~25,000 tokens.
The Solution: Chunking
50-page document
β
Chunking (~100 chunks)
β
Bi-encoder creates vector for each chunk
β
Query β retrieve top 50-100 chunks
β
Cross-encoder re-ranks those chunks
β
Top 10 chunks β LLM
16. Next steps:
Chunking Strategies
- Fixed Width / Recursive Character Splitting: Good defaults
- Semantic / LLM Chunking: Higher performance, more complex
- Context-Aware Chunking: Good "first improvement" to explore
Query Enhancement
- Query Parsing
- NER (Named Entity Recognition) for query parsing
- HyDE for better query-document matching
Advanced Ideas
- Use ColBERT on subject lines, Bi-encoder on body (hybrid approach)
- Rank fusion combining multiple retrieval methods
17. Real-World ColBERT Use Case
E-commerce Product Search (eBay, Amazon)
Product titles are short and every word matters:
- "iPhone 15 Pro Max 256GB Blue Unlocked"
- "Nike Air Jordan 1 Retro High OG Chicago"
Why ColBERT works:
- Titles are short β every token matters
- Exact terms matter β "256GB" must match "256GB"
- Need to re-rank 1000s of products quickly
Query: "Nike Jordan 1 Chicago size 10"
β
Bi-encoder retrieves top 1000 products
β
ColBERT re-ranks (fast, token-level)
β
Top 50 shown to user
Quick Reference
One-Liners to Remember
| Concept | One-liner |
|---|---|
| ANN | "Zoom, don't scan" |
| HNSW | "Multi-layer graph, start sparse, end dense" |
| Bi-encoder | "Encode separately, compare vectors" |
| Cross-encoder | "Encode together, understand relationship" |
| ColBERT | "Token-level matching with MaxSim" |
When to Use What
| Task | Solution |
|---|---|
| Search 1M docs | Bi-encoder + HNSW |
| Re-rank top 50-100 | Cross-encoder |
| Re-rank 500-1000 | ColBERT |
| Short dense text | ColBERT |
| Maximum quality | LLM-based reranker |
| Long documents | Chunk first! |
Speed Ranking
Bi-encoder β‘ > ColBERT π > Cross-encoder π’ > LLM-based π
Accuracy Ranking
LLM-based π― > Cross-encoder > ColBERT > Bi-encoder