Understanding Retrieval Metrics for RAG Systems

2025-12-21
RAGMLEvaluation Metrics

RAG, Retrieval & Evaluation Metrics โ€” Complete Study Notes

A comprehensive guide covering contrastive learning, embeddings, hybrid search fusion, and retrieval evaluation metrics.

How Do We Know What's Similar? (No Manual Labels Needed!)

The clever trick is self-supervision โ€” creating similarity automatically from data structure:

Strategy How it works
Data augmentation (SimCSE) Same sentence + different dropout = positive pair
Nearby context (Word2Vec, BERT) Words/sentences close in document = similar
Back-translation Original + translated-back version = positive pair
Natural pairs Question-answer, title-body, query-clicked result

How Off-the-Shelf Embeddings (OpenAI, etc.) Work So Well

They train on massive naturally occurring paired data:

  • Web pages and their titles
  • Reddit posts and top comments
  • Wikipedia sections and headings
  • Forum questions and accepted answers

Key Insight: Humans already created millions of implicit pairs by how they structured content online โ€” no manual labeling needed!


2. Reciprocal Rank Fusion (RRF)

The Problem

How do you combine rankings from multiple search systems (e.g., semantic search + keyword search)?

The Formula

$$\text{RRF}(d) = \sum_{\text{each ranker}} \frac{1}{k + \text{rank}(d)}$$

Where k is a constant (usually 60).

Sticky Analogy: The Restaurant Recommendations ๐Ÿฝ๏ธ

Imagine asking multiple friends for restaurant recommendations. Instead of just counting votes, give more credit to items ranked higher โ€” but being consistently "pretty good" across all friends beats being one person's favorite but ignored by others.

Numerical Example

Two search systems ranking documents (k=60):

Document Semantic Rank Keyword Rank RRF Score
Doc A 1 5 1/61 + 1/65 = 0.0318
Doc B 3 1 1/63 + 1/61 = 0.0323
Doc C 2 2 1/62 + 1/62 = 0.0322

Final RRF Ranking: B โ†’ C โ†’ A

Key insight: Doc C (ranked 2nd, 2nd) beats Doc A (ranked 1st, 5th) because RRF rewards consistency.

The k Parameter: The "Patience" Knob ๐ŸŽ›๏ธ

k value Behavior Analogy
Low k (e.g., 1) Top ranks dominate heavily Impatient judge โ€” only looks at gold medalists
High k (e.g., 60) Differences compressed, rewards consistency Patient judge โ€” "top 10 are all pretty good"

Numerical Proof (from our exploration):

# Without k (k=0): Gap between rank 1 and rank 10
gap_no_k = (1/1) - (1/10)
print(f"Without k: gap = {gap_no_k}")

# With k=50: Gap between rank 1 and rank 10
gap_with_k = (1/51) - (1/60)
print(f"With k=50: gap = {gap_with_k:.6f}")

# Compression ratio
print(f"Gap shrinks to {gap_with_k/gap_no_k:.2%} of original!")
Without k: gap = 0.9
With k=50: gap = 0.002941
Gap shrinks to 0.33% of original!

Insight: Adding k compresses the differences between ranks โ€” making #1 vs #10 feel almost the same, instead of 10x better.

The ฮฒ (Beta) Parameter: Weighting Rankers

When you want to trust one ranker more than another:

RRF(d) = ฮฒ ร— 1/(k + semantic_rank) + (1-ฮฒ) ร— 1/(k + keyword_rank)
ฮฒ value Effect
ฮฒ = 1.0 Pure semantic search
ฮฒ = 0.5 Equal weight
ฮฒ = 0.8 80% semantic, 20% keyword
ฮฒ = 0.0 Pure keyword search

RRF Protects Against Bad Rankers

Panel of Judges Analogy: If 4 judges are fair and 1 is bribed to push Contestant Z, that one loud wrong voice loses to four quieter right voices. The irrelevant item gets boosted by one ranker, but can't overcome consensus from others.

Benefits of RRF:

  • Simple โ€” no training needed
  • Robust to bad rankers
  • Works well for hybrid search
  • Just combine your lists and go!

3. Precision vs Recall Tradeoff

The Fundamental Tradeoff

Action Precision Recall
Retrieve fewer docs โœ… High โŒ Low (miss relevant items)
Retrieve more docs โŒ Low (more noise) โœ… High

Example

Retrieval Precision Recall
12 docs retrieved 66% (8/12) 80% (8/10)
15 docs retrieved 60% (9/15) 90% (9/10)

How RRF helps: Retrieve more from each ranker (boost recall), but fusion acts as a quality filter โ€” only docs appearing in multiple rankings score high. Maintains precision even with larger retrieval sets.


4. Search Evaluation Metrics

Overview Table

Metric What it measures Formula Sticky Analogy
MRR Position of first correct answer avg(1/rank) Waiter bringing your order on 1st vs 3rd try
Precision@K Quality of top K results relevant in K / K โ€”
Recall@K Coverage of relevant items found / total relevant โ€”
MAP Ranking quality of all relevant items avg precision at each relevant doc ๐Ÿ›’ Shopping list efficiency

Mean Reciprocal Rank (MRR)

Formula: Average of 1/rank for the first correct answer across queries.

First relevant at rank Score
1 1.0
2 0.5
4 0.25

Analogy: Rating restaurants by how quickly the waiter brings what you actually ordered. First try = perfect. Third try = frustrating.

Example calculation:

  • Query 1: first relevant at rank 2 โ†’ 1/2 = 0.5
  • Query 2: first relevant at rank 1 โ†’ 1/1 = 1.0
  • Query 3: first relevant at rank 5 โ†’ 1/5 = 0.2

MRR = (0.5 + 1.0 + 0.2) / 3 = 0.567

Mean Average Precision (MAP)

Formula: $$AP = \frac{1}{R} \sum_{k: \text{doc}_k \text{ is relevant}} \text{Precision}@k$$

Sticky Analogy: The Shopping Trip ๐Ÿ›’

"How little did I wander before finding what I needed?"

You're in a supermarket with a list of 3 items. Each time you find a list item, ask yourself: "How efficient have I been so far?" (items on list รท total grabbed). Average those efficiency checks = your MAP score.

Example:

  • Find list item at pick 1: 1/1 = 100%
  • Find list item at pick 4: 2/4 = 50%
  • Find list item at pick 5: 3/5 = 60%

AP = (100% + 50% + 60%) / 3 = 70%

Key insight: MAP rewards finding relevant items early โ€” less wandering = higher score.


5. Choosing the Right Metric for Your Use Case

Decision Guide

Your Priority Metric to Use Example Use Case
First answer must be right MRR Customer support chatbot
Don't miss any relevant docs Recall@K Legal research
Minimize noise/irrelevant results Precision@K E-commerce filters
Rank good stuff early MAP Search results browsing

Key insight: There's no universally "best" metric โ€” the right one depends on whether your users value speed, completeness, or cleanliness of results.

Interpreting Metric Changes

Scenario: After tuning RRF, MRR went up but Recall dropped.

  • Good if: Building a chatbot (first answer matters most)
  • Bad if: Building legal search (need ALL relevant docs)

Scenario: High Recall@20 but low MAP

  • Diagnosis: Finding relevant docs, but ranking them poorly (appearing late instead of early)

6. Quick Reference

RRF Cheat Sheet

RRF(d) = ฮฃ 1/(k + rank)
  • k=60 (default): Balanced, rewards consistency
  • Low k: Trust top picks strongly
  • High k: Compress rank differences
  • ฮฒ: Weight between rankers (0-1)

Metrics Cheat Sheet

Metric One-liner
MRR "How far do I scroll to find the first right answer?"
Precision@K "How much of my top K is good?"
Recall@K "How much of the good stuff did I find?"
MAP "How little did I wander before finding what I needed?"

All Metrics Require Ground Truth

You need to know which documents are actually relevant to evaluate any of these metrics!


7. Quick Answers

Q: Explain contrastive learning in one sentence.

Train a model to pull similar items close and push different items apart in vector space, using natural pairs from data structure instead of manual labels.

Q: What is RRF and when would you use it?

A simple way to combine rankings from multiple search systems by summing 1/(k+rank). Use it for hybrid search (semantic + keyword) โ€” no training needed, robust to bad rankers.

Q: What's the difference between Precision and Recall?

Precision = "of what I returned, how much is good?" Recall = "of all the good stuff, how much did I find?" Tradeoff: more results โ†’ higher recall, lower precision.

Q: When would you use MRR vs MAP?

MRR when only the first result matters (chatbots). MAP when you care about ranking quality across all relevant items (search browsing).

Q: How does the k parameter in RRF work?

It's a "patience" knob. Low k = trust top picks strongly. High k = compress differences, reward consistency across rankers.