Vector Search & Reranking: Complete Summary

A comprehensive guide covering ANN, HNSW, Bi-Encoder, Cross-Encoder, ColBERT, and production RAG pipelines with sticky analogies and practical examples.

1. Approximate Nearest Neighbor (ANN)

The Problem

Finding similar items in massive datasets (millions of vectors) is slow with brute force — comparing against every single item.

🗺️ Sticky Analogy: The City Map

You're at Times Square and need to find the 5 closest coffee shops.

Brute force: Measure distance to every coffee shop in the city — takes hours.

ANN approach: Only check coffee shops in nearby neighborhoods — find 5 great options in minutes.

Key insight: Use the structure of space (neighborhoods) to narrow your search dramatically.

TL;DR

Exact NN: Check everything, guaranteed best, slow
Approximate NN: Check strategically, probably best, fast

One-liner to Remember

"Zoom, don't scan."

3. NSW Graph Building (Proximity Graph)

How the Graph Gets Built

Navigable Small World - Proximity Graph

Three steps:

Compute distances between all document vectors
Add one node to the graph for each document
Connect each node to its k nearest neighbors

What is k?

k = number of connections (edges) each node has

🏘️ Analogy: 1 million houses in a city. Each house = 1 node.

k=5 means each house knows 5 neighbors

k=50 means each house knows 50 neighbors

k Trade-offs

Small k	Large k
Faster search	Slower search
Might miss good paths	More accurate
Less memory	More memory

4. Query Entry Point & Search Algorithm

Query Entry Point

You always start from a designated entry point node at the top layer. From there, greedily hop to whichever neighbor is closest to your query.

Search Algorithm

Key insight from diagram:

"May not find closest possible vectors, algorithm doesn't pick optimal overall path, just best path in each moment."

This is why it's called Approximate Nearest Neighbor — good enough, fast enough!

The Search Process

Top layer:    Entry → hop → hop → (can't improve) → drop down
Middle layer: → hop → hop → (can't improve) → drop down  
Bottom layer: → hop → hop → FOUND closest neighbors!

5. Vector Search Libraries & Databases

Most Common Algorithm

HNSW — used by nearly every production vector search system

Libraries

Library	Creator	Notes
FAISS	Meta	Industry standard
Annoy	Spotify	Tree-based, simpler
ScaNN	Google	Large-scale optimized
hnswlib	—	Pure HNSW, very fast

Vector Databases (HNSW under the hood)

Database	Notes
Pinecone	Fully managed, production-ready
ChromaDB	Lightweight, great for local dev
Weaviate	Open-source, feature-rich
Milvus	Open-source, highly scalable
Qdrant	Rust-based, fast
pgvector	PostgreSQL extension

6. Bi-Encoder

How It Works

Query  →  [Encoder]  →  Vector A
Doc    →  [Encoder]  →  Vector B
                          ↓
               Compare vectors (cosine similarity)

Query and document encoded separately. The encoder never sees them together.

🎤 Sticky Analogy: Resume Screening

HR reads each resume independently, assigns a "fit score" to each.

Fast — can process thousands quickly

But never sees resume and job description together

Pros & Cons

Pros	Cons
⚡ Very fast	Lower accuracy
Pre-compute doc vectors	Misses nuanced relationships
Search millions of docs	Information compressed

7. Cross-Encoder

How It Works

[CLS] Query [SEP] Document [SEP]  →  Encoder  →  Relevance Score

Query and document go together through the model. Words can attend to each other!

🎤 Sticky Analogy: Job Interview

Interviewer sees candidate AND job requirements at the same time.

Much more accurate — catches nuances

But slow — can only do one at a time

🧪 Alternative Analogy: Chemistry

Bi-encoder: Describe chemicals on paper, guess if they'll react

Cross-encoder: Put chemicals in same beaker, observe what happens

Why Cross-Encoder is Slow

	Bi-Encoder	Cross-Encoder
Query time	1 encoder call + 1M dot products	1M encoder calls
Speed	~milliseconds	~hours

You can't pre-compute — it needs query and doc together.

When to Use

Re-ranking only — top 50-100 docs from bi-encoder.

# Cross-Encoder with rerankers library
from rerankers import Reranker

ranker = Reranker('cross-encoder')
results = ranker.rank(
    query="I love you", 
    docs=["I hate you", "I really like you"]
)

# Results:
# "I really like you": -2.468 (rank 1) ✓
# "I hate you": -4.150 (rank 2)

8. ColBERT

The Middle Ground

Model	Document Representation
Bi-encoder	1 vector per doc
ColBERT	1 vector per token
Cross-encoder	No vectors (score directly)

🍳 Sticky Analogy: Recipe Search

Bi-encoder: Recipe labeled "Italian comfort food"

ColBERT: Recipe keeps ingredients visible:

[tomato] [basil] [pasta] [garlic]

Query "tomato pasta" → direct matches on [tomato] and [pasta] tokens!

📱 Alternative Analogy: Dating App

Bi-encoder: One bio sentence — "I like outdoors"

ColBERT: Individual interests listed:

[hiking] [photography] [camping]

Search "hiking photography" → direct token matches!

9. MaxSim Score (ColBERT's Scoring Method)

MaxSim Score Matrix

How It Works

For each query token, find its maximum similarity with any doc token. Then sum.

Query: [I] [love] [you]
Doc:   [I] [hate] [you]

Similarity matrix:
         [I]  [hate] [you]
[I]      0.9   0.1    0.2  → max = 0.9
[love]   0.1   0.7    0.1  → max = 0.7  ← "love" matched "hate"!
[you]    0.2   0.1    0.9  → max = 0.9

MaxSim = 0.9 + 0.7 + 0.9 = 2.5

ColBERT's Limitation

"love" and "hate" get similar scores because they're both intense emotion words used in similar contexts.

ColBERT matches tokens. Cross-encoder understands relationships.

# ColBERT with rerankers library
from rerankers import Reranker

colbert_ranker = Reranker('colbert')
results = colbert_ranker.rank(
    query="I love you", 
    docs=["I hate you", "I really like you"]
)

# Results:
# "I hate you": 1.753 (rank 1) ✗ Wrong!
# "I really like you": 1.703 (rank 2)

10. Comparison: Bi-Encoder vs Cross-Encoder vs ColBERT

🍽️ Restaurant Analogy

Model	Analogy
Bi-Encoder	Match craving to menu descriptions
ColBERT	Match craving to ingredient lists
Cross-Encoder	Chef serves dish knowing your craving

Full Comparison Table

	Bi-Encoder	ColBERT	Cross-Encoder
Speed	⚡ Fastest	🚗 Medium	🐢 Slowest
Accuracy	Good	Better	Best
Pre-compute docs?	✅ Yes (1 vec/doc)	✅ Yes (1 vec/token)	❌ No
Storage	Low	High	N/A
Search 1M docs?	✅ Yes	✅ Yes	❌ No
Best for	Retrieval	Retrieval + precision	Re-ranking

When to Use Each

Scenario	Best Choice
Search millions of docs fast	Bi-Encoder
Short, dense text (tweets, titles)	ColBERT
Re-rank top 50-100 results	Cross-Encoder
E-commerce product search (1000s)	ColBERT
Production RAG pipeline	Bi-Encoder → Cross-Encoder

11. HyDE (Hypothetical Document Embeddings)

The Problem

Questions and documents "sound" different in embedding space.

The Trick

Ask LLM to generate a fake answer (without retrieval)
Embed the fake answer
Search with that embedding instead

🏠 Sticky Analogy: House Hunting

You have a vague wish: "I want something cozy with natural light."

Normal: Tell agent your wish → they struggle

HyDE: Describe ideal house first: "A 2-bedroom cottage with south-facing windows..." → Agent finds similar actual houses

12. Production RAG Pipeline

🎯 Sticky Analogy: Intern + Professor

Intern (Retriever): Grabs 20 files mentioning "Project X" — some relevant, some not.

Professor (Reranker): Reads each carefully, keeps the best 5.

The Standard Two-Stage Architecture

User Query
    │
    ▼
Stage 1: RETRIEVAL (Cast wide net)
    ├── Vector Search (bi-encoder) 
    └── Keyword Search (BM25)
    │
    ▼ Top 50-100 candidates
    │
Stage 2: RERANKING
    └── Cross-Encoder scores each doc
    │
    ▼ Top 5-10 most relevant
    │
Stage 3: LLM GENERATION
    └── Generate answer with context

Why Hybrid Retrieval?

Bi-encoder: Semantic similarity ("car" ≈ "automobile")
BM25: Exact keyword matches ("Toyota Camry 2024")

Combined = better recall!

13. LLM-Based Scoring

LLM Based Scoring

Fine-tuned LLM takes query + document → outputs relevance score.

Slowest but most accurate — use when quality matters more than latency.

14. Open Source Rerankers

Popular Models

Model	Creator	Speed	Notes
ms-marco-MiniLM	Microsoft	⚡ Fastest	22M params
mxbai-rerank	Mixedbread	🚗 Medium	Default in rerankers
BGE-reranker	BAAI	🚗 Medium	Most popular
ColBERTv2	—	🚗 Medium	Token-level

Organizations

BAAI = Beijing Academy of Artificial Intelligence
Mixedbread = German AI startup
Microsoft = MS-MARCO dataset + MiniLM architecture

# Switching models in rerankers library
from rerankers import Reranker

# Microsoft's fast model
ranker = Reranker('cross-encoder', model_name='cross-encoder/ms-marco-MiniLM-L-6-v2')

# BGE model (most popular)
ranker = Reranker('cross-encoder', model_name='BAAI/bge-reranker-base')

# Mixedbread (default)
ranker = Reranker('cross-encoder', model_name='mixedbread-ai/mxbai-rerank-base-v1')

# ColBERT
ranker = Reranker('colbert', model_name='colbert-ir/colbertv2.0')

# LLM-based
ranker = Reranker('rankgpt', model_name='gpt-4o-mini')

15. Handling Long Documents

The Constraint

Most models have 512 token limit (~1-2 paragraphs). A 50-page doc = ~25,000 tokens.

The Solution: Chunking

50-page document
       ↓
   Chunking (~100 chunks)
       ↓
   Bi-encoder creates vector for each chunk
       ↓
   Query → retrieve top 50-100 chunks
       ↓
   Cross-encoder re-ranks those chunks
       ↓
   Top 10 chunks → LLM

16. Next steps:

Chunking Strategies

Fixed Width / Recursive Character Splitting: Good defaults
Semantic / LLM Chunking: Higher performance, more complex
Context-Aware Chunking: Good "first improvement" to explore

Query Enhancement

Query Parsing
NER (Named Entity Recognition) for query parsing
HyDE for better query-document matching

Advanced Ideas

Use ColBERT on subject lines, Bi-encoder on body (hybrid approach)
Rank fusion combining multiple retrieval methods

17. Real-World ColBERT Use Case

E-commerce Product Search (eBay, Amazon)

Product titles are short and every word matters:

"iPhone 15 Pro Max 256GB Blue Unlocked"
"Nike Air Jordan 1 Retro High OG Chicago"

Why ColBERT works:

Titles are short — every token matters
Exact terms matter — "256GB" must match "256GB"
Need to re-rank 1000s of products quickly

Query: "Nike Jordan 1 Chicago size 10"
    ↓
Bi-encoder retrieves top 1000 products
    ↓
ColBERT re-ranks (fast, token-level)
    ↓
Top 50 shown to user

Quick Reference

One-Liners to Remember

Concept	One-liner
ANN	"Zoom, don't scan"
HNSW	"Multi-layer graph, start sparse, end dense"
Bi-encoder	"Encode separately, compare vectors"
Cross-encoder	"Encode together, understand relationship"
ColBERT	"Token-level matching with MaxSim"

When to Use What

Task	Solution
Search 1M docs	Bi-encoder + HNSW
Re-rank top 50-100	Cross-encoder
Re-rank 500-1000	ColBERT
Short dense text	ColBERT
Maximum quality	LLM-based reranker
Long documents	Chunk first!

Speed Ranking

Bi-encoder ⚡ > ColBERT 🚗 > Cross-encoder 🐢 > LLM-based 🐌

Accuracy Ranking

LLM-based 🎯 > Cross-encoder > ColBERT > Bi-encoder

Vector Search and Reranking

Vector Search & Reranking: Complete Summary

1. Approximate Nearest Neighbor (ANN)

The Problem

🗺️ Sticky Analogy: The City Map

TL;DR

One-liner to Remember

3. NSW Graph Building (Proximity Graph)

How the Graph Gets Built

What is k?

k Trade-offs

4. Query Entry Point & Search Algorithm

Query Entry Point

Search Algorithm

The Search Process

5. Vector Search Libraries & Databases

Most Common Algorithm

Libraries

Vector Databases (HNSW under the hood)

6. Bi-Encoder

How It Works

🎤 Sticky Analogy: Resume Screening

Pros & Cons

7. Cross-Encoder

How It Works

🎤 Sticky Analogy: Job Interview

🧪 Alternative Analogy: Chemistry

Why Cross-Encoder is Slow

When to Use

8. ColBERT

The Middle Ground

🍳 Sticky Analogy: Recipe Search

📱 Alternative Analogy: Dating App

9. MaxSim Score (ColBERT's Scoring Method)

How It Works

ColBERT's Limitation

10. Comparison: Bi-Encoder vs Cross-Encoder vs ColBERT

🍽️ Restaurant Analogy

Full Comparison Table

When to Use Each

11. HyDE (Hypothetical Document Embeddings)

The Problem

The Trick

🏠 Sticky Analogy: House Hunting

12. Production RAG Pipeline

🎯 Sticky Analogy: Intern + Professor

The Standard Two-Stage Architecture

Why Hybrid Retrieval?

13. LLM-Based Scoring

14. Open Source Rerankers

Popular Models

Organizations

15. Handling Long Documents

The Constraint

The Solution: Chunking

16. Next steps:

Chunking Strategies

Query Enhancement

Advanced Ideas

17. Real-World ColBERT Use Case

E-commerce Product Search (eBay, Amazon)

Quick Reference

One-Liners to Remember

When to Use What

Speed Ranking

Accuracy Ranking