Vector Search and Reranking

2025-12-22
nlpvector-searchragembeddings

Vector Search & Reranking: Complete Summary

A comprehensive guide covering ANN, HNSW, Bi-Encoder, Cross-Encoder, ColBERT, and production RAG pipelines with sticky analogies and practical examples.


1. Approximate Nearest Neighbor (ANN)

The Problem

Finding similar items in massive datasets (millions of vectors) is slow with brute force β€” comparing against every single item.

πŸ—ΊοΈ Sticky Analogy: The City Map

You're at Times Square and need to find the 5 closest coffee shops.

  • Brute force: Measure distance to every coffee shop in the city β€” takes hours.
  • ANN approach: Only check coffee shops in nearby neighborhoods β€” find 5 great options in minutes.

Key insight: Use the structure of space (neighborhoods) to narrow your search dramatically.

TL;DR

  • Exact NN: Check everything, guaranteed best, slow
  • Approximate NN: Check strategically, probably best, fast

One-liner to Remember

"Zoom, don't scan."


3. NSW Graph Building (Proximity Graph)

How the Graph Gets Built

Navigable Small World - Proximity Graph

Three steps:

  1. Compute distances between all document vectors
  2. Add one node to the graph for each document
  3. Connect each node to its k nearest neighbors

What is k?

k = number of connections (edges) each node has

🏘️ Analogy: 1 million houses in a city. Each house = 1 node.

  • k=5 means each house knows 5 neighbors
  • k=50 means each house knows 50 neighbors

k Trade-offs

Small k Large k
Faster search Slower search
Might miss good paths More accurate
Less memory More memory

4. Query Entry Point & Search Algorithm

Query Entry Point

Query Entry Point

You always start from a designated entry point node at the top layer. From there, greedily hop to whichever neighbor is closest to your query.

Search Algorithm

Search Algorithm

Key insight from diagram:

"May not find closest possible vectors, algorithm doesn't pick optimal overall path, just best path in each moment."

This is why it's called Approximate Nearest Neighbor β€” good enough, fast enough!

The Search Process

Top layer:    Entry β†’ hop β†’ hop β†’ (can't improve) β†’ drop down
Middle layer: β†’ hop β†’ hop β†’ (can't improve) β†’ drop down  
Bottom layer: β†’ hop β†’ hop β†’ FOUND closest neighbors!

5. Vector Search Libraries & Databases

Most Common Algorithm

HNSW β€” used by nearly every production vector search system

Libraries

Library Creator Notes
FAISS Meta Industry standard
Annoy Spotify Tree-based, simpler
ScaNN Google Large-scale optimized
hnswlib β€” Pure HNSW, very fast

Vector Databases (HNSW under the hood)

Database Notes
Pinecone Fully managed, production-ready
ChromaDB Lightweight, great for local dev
Weaviate Open-source, feature-rich
Milvus Open-source, highly scalable
Qdrant Rust-based, fast
pgvector PostgreSQL extension

6. Bi-Encoder

How It Works

Query  β†’  [Encoder]  β†’  Vector A
Doc    β†’  [Encoder]  β†’  Vector B
                          ↓
               Compare vectors (cosine similarity)

Query and document encoded separately. The encoder never sees them together.

🎀 Sticky Analogy: Resume Screening

HR reads each resume independently, assigns a "fit score" to each.

  • Fast β€” can process thousands quickly
  • But never sees resume and job description together

Pros & Cons

Pros Cons
⚑ Very fast Lower accuracy
Pre-compute doc vectors Misses nuanced relationships
Search millions of docs Information compressed

7. Cross-Encoder

How It Works

[CLS] Query [SEP] Document [SEP]  β†’  Encoder  β†’  Relevance Score

Query and document go together through the model. Words can attend to each other!

🎀 Sticky Analogy: Job Interview

Interviewer sees candidate AND job requirements at the same time.

  • Much more accurate β€” catches nuances
  • But slow β€” can only do one at a time

πŸ§ͺ Alternative Analogy: Chemistry

  • Bi-encoder: Describe chemicals on paper, guess if they'll react
  • Cross-encoder: Put chemicals in same beaker, observe what happens

Why Cross-Encoder is Slow

Bi-Encoder Cross-Encoder
Query time 1 encoder call + 1M dot products 1M encoder calls
Speed ~milliseconds ~hours

You can't pre-compute β€” it needs query and doc together.

When to Use

Re-ranking only β€” top 50-100 docs from bi-encoder.

# Cross-Encoder with rerankers library
from rerankers import Reranker

ranker = Reranker('cross-encoder')
results = ranker.rank(
    query="I love you", 
    docs=["I hate you", "I really like you"]
)

# Results:
# "I really like you": -2.468 (rank 1) βœ“
# "I hate you": -4.150 (rank 2)

8. ColBERT

The Middle Ground

Model Document Representation
Bi-encoder 1 vector per doc
ColBERT 1 vector per token
Cross-encoder No vectors (score directly)

🍳 Sticky Analogy: Recipe Search

Bi-encoder: Recipe labeled "Italian comfort food"

ColBERT: Recipe keeps ingredients visible:

  • [tomato] [basil] [pasta] [garlic]

Query "tomato pasta" β†’ direct matches on [tomato] and [pasta] tokens!

πŸ“± Alternative Analogy: Dating App

Bi-encoder: One bio sentence β€” "I like outdoors"

ColBERT: Individual interests listed:

  • [hiking] [photography] [camping]

Search "hiking photography" β†’ direct token matches!


9. MaxSim Score (ColBERT's Scoring Method)

MaxSim Score Matrix

How It Works

For each query token, find its maximum similarity with any doc token. Then sum.

Query: [I] [love] [you]
Doc:   [I] [hate] [you]

Similarity matrix:
         [I]  [hate] [you]
[I]      0.9   0.1    0.2  β†’ max = 0.9
[love]   0.1   0.7    0.1  β†’ max = 0.7  ← "love" matched "hate"!
[you]    0.2   0.1    0.9  β†’ max = 0.9

MaxSim = 0.9 + 0.7 + 0.9 = 2.5

ColBERT's Limitation

"love" and "hate" get similar scores because they're both intense emotion words used in similar contexts.

ColBERT matches tokens. Cross-encoder understands relationships.

# ColBERT with rerankers library
from rerankers import Reranker

colbert_ranker = Reranker('colbert')
results = colbert_ranker.rank(
    query="I love you", 
    docs=["I hate you", "I really like you"]
)

# Results:
# "I hate you": 1.753 (rank 1) βœ— Wrong!
# "I really like you": 1.703 (rank 2)

10. Comparison: Bi-Encoder vs Cross-Encoder vs ColBERT

🍽️ Restaurant Analogy

Model Analogy
Bi-Encoder Match craving to menu descriptions
ColBERT Match craving to ingredient lists
Cross-Encoder Chef serves dish knowing your craving

Full Comparison Table

Bi-Encoder ColBERT Cross-Encoder
Speed ⚑ Fastest πŸš— Medium 🐒 Slowest
Accuracy Good Better Best
Pre-compute docs? βœ… Yes (1 vec/doc) βœ… Yes (1 vec/token) ❌ No
Storage Low High N/A
Search 1M docs? βœ… Yes βœ… Yes ❌ No
Best for Retrieval Retrieval + precision Re-ranking

When to Use Each

Scenario Best Choice
Search millions of docs fast Bi-Encoder
Short, dense text (tweets, titles) ColBERT
Re-rank top 50-100 results Cross-Encoder
E-commerce product search (1000s) ColBERT
Production RAG pipeline Bi-Encoder β†’ Cross-Encoder

11. HyDE (Hypothetical Document Embeddings)

The Problem

Questions and documents "sound" different in embedding space.

The Trick

  1. Ask LLM to generate a fake answer (without retrieval)
  2. Embed the fake answer
  3. Search with that embedding instead

🏠 Sticky Analogy: House Hunting

You have a vague wish: "I want something cozy with natural light."

Normal: Tell agent your wish β†’ they struggle

HyDE: Describe ideal house first: "A 2-bedroom cottage with south-facing windows..." β†’ Agent finds similar actual houses


12. Production RAG Pipeline

🎯 Sticky Analogy: Intern + Professor

Intern (Retriever): Grabs 20 files mentioning "Project X" β€” some relevant, some not.

Professor (Reranker): Reads each carefully, keeps the best 5.

The Standard Two-Stage Architecture

User Query
    β”‚
    β–Ό
Stage 1: RETRIEVAL (Cast wide net)
    β”œβ”€β”€ Vector Search (bi-encoder) 
    └── Keyword Search (BM25)
    β”‚
    β–Ό Top 50-100 candidates
    β”‚
Stage 2: RERANKING
    └── Cross-Encoder scores each doc
    β”‚
    β–Ό Top 5-10 most relevant
    β”‚
Stage 3: LLM GENERATION
    └── Generate answer with context

Why Hybrid Retrieval?

  • Bi-encoder: Semantic similarity ("car" β‰ˆ "automobile")
  • BM25: Exact keyword matches ("Toyota Camry 2024")

Combined = better recall!


13. LLM-Based Scoring

LLM Based Scoring

Fine-tuned LLM takes query + document β†’ outputs relevance score.

Slowest but most accurate β€” use when quality matters more than latency.


14. Open Source Rerankers

Popular Models

Model Creator Speed Notes
ms-marco-MiniLM Microsoft ⚑ Fastest 22M params
mxbai-rerank Mixedbread πŸš— Medium Default in rerankers
BGE-reranker BAAI πŸš— Medium Most popular
ColBERTv2 β€” πŸš— Medium Token-level

Organizations

  • BAAI = Beijing Academy of Artificial Intelligence
  • Mixedbread = German AI startup
  • Microsoft = MS-MARCO dataset + MiniLM architecture
# Switching models in rerankers library
from rerankers import Reranker

# Microsoft's fast model
ranker = Reranker('cross-encoder', model_name='cross-encoder/ms-marco-MiniLM-L-6-v2')

# BGE model (most popular)
ranker = Reranker('cross-encoder', model_name='BAAI/bge-reranker-base')

# Mixedbread (default)
ranker = Reranker('cross-encoder', model_name='mixedbread-ai/mxbai-rerank-base-v1')

# ColBERT
ranker = Reranker('colbert', model_name='colbert-ir/colbertv2.0')

# LLM-based
ranker = Reranker('rankgpt', model_name='gpt-4o-mini')

15. Handling Long Documents

The Constraint

Most models have 512 token limit (~1-2 paragraphs). A 50-page doc = ~25,000 tokens.

The Solution: Chunking

50-page document
       ↓
   Chunking (~100 chunks)
       ↓
   Bi-encoder creates vector for each chunk
       ↓
   Query β†’ retrieve top 50-100 chunks
       ↓
   Cross-encoder re-ranks those chunks
       ↓
   Top 10 chunks β†’ LLM

16. Next steps:

Chunking Strategies

  • Fixed Width / Recursive Character Splitting: Good defaults
  • Semantic / LLM Chunking: Higher performance, more complex
  • Context-Aware Chunking: Good "first improvement" to explore

Query Enhancement

  • Query Parsing
  • NER (Named Entity Recognition) for query parsing
  • HyDE for better query-document matching

Advanced Ideas

  • Use ColBERT on subject lines, Bi-encoder on body (hybrid approach)
  • Rank fusion combining multiple retrieval methods

17. Real-World ColBERT Use Case

E-commerce Product Search (eBay, Amazon)

Product titles are short and every word matters:

  • "iPhone 15 Pro Max 256GB Blue Unlocked"
  • "Nike Air Jordan 1 Retro High OG Chicago"

Why ColBERT works:

  1. Titles are short β€” every token matters
  2. Exact terms matter β€” "256GB" must match "256GB"
  3. Need to re-rank 1000s of products quickly
Query: "Nike Jordan 1 Chicago size 10"
    ↓
Bi-encoder retrieves top 1000 products
    ↓
ColBERT re-ranks (fast, token-level)
    ↓
Top 50 shown to user

Quick Reference

One-Liners to Remember

Concept One-liner
ANN "Zoom, don't scan"
HNSW "Multi-layer graph, start sparse, end dense"
Bi-encoder "Encode separately, compare vectors"
Cross-encoder "Encode together, understand relationship"
ColBERT "Token-level matching with MaxSim"

When to Use What

Task Solution
Search 1M docs Bi-encoder + HNSW
Re-rank top 50-100 Cross-encoder
Re-rank 500-1000 ColBERT
Short dense text ColBERT
Maximum quality LLM-based reranker
Long documents Chunk first!

Speed Ranking

Bi-encoder ⚑ > ColBERT πŸš— > Cross-encoder 🐒 > LLM-based 🐌

Accuracy Ranking

LLM-based 🎯 > Cross-encoder > ColBERT > Bi-encoder