RAG, Retrieval & Evaluation Metrics — Complete Study Notes

A comprehensive guide covering contrastive learning, embeddings, hybrid search fusion, and retrieval evaluation metrics.

How Do We Know What's Similar? (No Manual Labels Needed!)

The clever trick is self-supervision — creating similarity automatically from data structure:

Strategy	How it works
Data augmentation (SimCSE)	Same sentence + different dropout = positive pair
Nearby context (Word2Vec, BERT)	Words/sentences close in document = similar
Back-translation	Original + translated-back version = positive pair
Natural pairs	Question-answer, title-body, query-clicked result

How Off-the-Shelf Embeddings (OpenAI, etc.) Work So Well

They train on massive naturally occurring paired data:

Web pages and their titles
Reddit posts and top comments
Wikipedia sections and headings
Forum questions and accepted answers

Key Insight: Humans already created millions of implicit pairs by how they structured content online — no manual labeling needed!

2. Reciprocal Rank Fusion (RRF)

The Problem

How do you combine rankings from multiple search systems (e.g., semantic search + keyword search)?

The Formula

$$\text{RRF}(d) = \sum_{\text{each ranker}} \frac{1}{k + \text{rank}(d)}$$

Where k is a constant (usually 60).

Sticky Analogy: The Restaurant Recommendations 🍽️

Imagine asking multiple friends for restaurant recommendations. Instead of just counting votes, give more credit to items ranked higher — but being consistently "pretty good" across all friends beats being one person's favorite but ignored by others.

Numerical Example

Two search systems ranking documents (k=60):

Document	Semantic Rank	Keyword Rank	RRF Score
Doc A	1	5	1/61 + 1/65 = 0.0318
Doc B	3	1	1/63 + 1/61 = 0.0323
Doc C	2	2	1/62 + 1/62 = 0.0322

Final RRF Ranking: B → C → A

Key insight: Doc C (ranked 2nd, 2nd) beats Doc A (ranked 1st, 5th) because RRF rewards consistency.

The k Parameter: The "Patience" Knob 🎛️

k value	Behavior	Analogy
Low k (e.g., 1)	Top ranks dominate heavily	Impatient judge — only looks at gold medalists
High k (e.g., 60)	Differences compressed, rewards consistency	Patient judge — "top 10 are all pretty good"

Numerical Proof (from our exploration):

# Without k (k=0): Gap between rank 1 and rank 10
gap_no_k = (1/1) - (1/10)
print(f"Without k: gap = {gap_no_k}")

# With k=50: Gap between rank 1 and rank 10
gap_with_k = (1/51) - (1/60)
print(f"With k=50: gap = {gap_with_k:.6f}")

# Compression ratio
print(f"Gap shrinks to {gap_with_k/gap_no_k:.2%} of original!")

Without k: gap = 0.9
With k=50: gap = 0.002941
Gap shrinks to 0.33% of original!

Insight: Adding k compresses the differences between ranks — making #1 vs #10 feel almost the same, instead of 10x better.

The β (Beta) Parameter: Weighting Rankers

When you want to trust one ranker more than another:

RRF(d) = β × 1/(k + semantic_rank) + (1-β) × 1/(k + keyword_rank)

β value	Effect
β = 1.0	Pure semantic search
β = 0.5	Equal weight
β = 0.8	80% semantic, 20% keyword
β = 0.0	Pure keyword search

RRF Protects Against Bad Rankers

Panel of Judges Analogy: If 4 judges are fair and 1 is bribed to push Contestant Z, that one loud wrong voice loses to four quieter right voices. The irrelevant item gets boosted by one ranker, but can't overcome consensus from others.

Benefits of RRF:

Simple — no training needed
Robust to bad rankers
Works well for hybrid search
Just combine your lists and go!

3. Precision vs Recall Tradeoff

The Fundamental Tradeoff

Action	Precision	Recall
Retrieve fewer docs	✅ High	❌ Low (miss relevant items)
Retrieve more docs	❌ Low (more noise)	✅ High

Example

Retrieval	Precision	Recall
12 docs retrieved	66% (8/12)	80% (8/10)
15 docs retrieved	60% (9/15)	90% (9/10)

How RRF helps: Retrieve more from each ranker (boost recall), but fusion acts as a quality filter — only docs appearing in multiple rankings score high. Maintains precision even with larger retrieval sets.

4. Search Evaluation Metrics

Overview Table

Metric	What it measures	Formula	Sticky Analogy
MRR	Position of first correct answer	avg(1/rank)	Waiter bringing your order on 1st vs 3rd try
Precision@K	Quality of top K results	relevant in K / K	—
Recall@K	Coverage of relevant items	found / total relevant	—
MAP	Ranking quality of all relevant items	avg precision at each relevant doc	🛒 Shopping list efficiency

Mean Reciprocal Rank (MRR)

Formula: Average of 1/rank for the first correct answer across queries.

First relevant at rank	Score
1	1.0
2	0.5
4	0.25

Analogy: Rating restaurants by how quickly the waiter brings what you actually ordered. First try = perfect. Third try = frustrating.

Example calculation:

Query 1: first relevant at rank 2 → 1/2 = 0.5
Query 2: first relevant at rank 1 → 1/1 = 1.0
Query 3: first relevant at rank 5 → 1/5 = 0.2

MRR = (0.5 + 1.0 + 0.2) / 3 = 0.567

Mean Average Precision (MAP)

Formula: $$AP = \frac{1}{R} \sum_{k: \text{doc}_k \text{ is relevant}} \text{Precision}@k$$

Sticky Analogy: The Shopping Trip 🛒

"How little did I wander before finding what I needed?"

You're in a supermarket with a list of 3 items. Each time you find a list item, ask yourself: "How efficient have I been so far?" (items on list ÷ total grabbed). Average those efficiency checks = your MAP score.

Example:

Find list item at pick 1: 1/1 = 100%
Find list item at pick 4: 2/4 = 50%
Find list item at pick 5: 3/5 = 60%

AP = (100% + 50% + 60%) / 3 = 70%

Key insight: MAP rewards finding relevant items early — less wandering = higher score.

5. Choosing the Right Metric for Your Use Case

Decision Guide

Your Priority	Metric to Use	Example Use Case
First answer must be right	MRR	Customer support chatbot
Don't miss any relevant docs	Recall@K	Legal research
Minimize noise/irrelevant results	Precision@K	E-commerce filters
Rank good stuff early	MAP	Search results browsing

Key insight: There's no universally "best" metric — the right one depends on whether your users value speed, completeness, or cleanliness of results.

Interpreting Metric Changes

Scenario: After tuning RRF, MRR went up but Recall dropped.

Good if: Building a chatbot (first answer matters most)
Bad if: Building legal search (need ALL relevant docs)

Scenario: High Recall@20 but low MAP

Diagnosis: Finding relevant docs, but ranking them poorly (appearing late instead of early)

6. Quick Reference

RRF Cheat Sheet

RRF(d) = Σ 1/(k + rank)

k=60 (default): Balanced, rewards consistency
Low k: Trust top picks strongly
High k: Compress rank differences
β: Weight between rankers (0-1)

Metrics Cheat Sheet

Metric	One-liner
MRR	"How far do I scroll to find the first right answer?"
Precision@K	"How much of my top K is good?"
Recall@K	"How much of the good stuff did I find?"
MAP	"How little did I wander before finding what I needed?"

All Metrics Require Ground Truth

You need to know which documents are actually relevant to evaluate any of these metrics!

7. Quick Answers

Q: Explain contrastive learning in one sentence.

Train a model to pull similar items close and push different items apart in vector space, using natural pairs from data structure instead of manual labels.

Q: What is RRF and when would you use it?

A simple way to combine rankings from multiple search systems by summing 1/(k+rank). Use it for hybrid search (semantic + keyword) — no training needed, robust to bad rankers.

Q: What's the difference between Precision and Recall?

Precision = "of what I returned, how much is good?" Recall = "of all the good stuff, how much did I find?" Tradeoff: more results → higher recall, lower precision.

Q: When would you use MRR vs MAP?

MRR when only the first result matters (chatbots). MAP when you care about ranking quality across all relevant items (search browsing).

Q: How does the k parameter in RRF work?

It's a "patience" knob. Low k = trust top picks strongly. High k = compress differences, reward consistency across rankers.

Understanding Retrieval Metrics for RAG Systems