TIL (Understanding NDCG)
TIL NDCG (Normalized Discounted Cumulative Gain)
Why Do We Need NDCG?
The Gap in Existing Metrics
| Metric | Limitation |
|---|---|
| Precision@K | Binary relevance only — a doc is relevant or not |
| Recall@K | Binary relevance only — doesn't distinguish "amazing" from "okay" |
| MRR | Only cares about the FIRST relevant item, still binary |
The Problem: Not all relevant documents are equally relevant, and position matters more than these metrics capture.
MRR's Two Blind Spots
- Only cares about FIRST relevant item — ignores everything after
- Still binary — can't tell if first result is "amazing" vs just "okay"
Key Insight:
- MRR = "How quickly did you find something relevant?"
- NDCG = "How good is your entire ranked list, considering how relevant each item is?"
🍽️ Sticky Analogy: The Restaurant Concierge Story
Imagine you're a hungry tourist in a new city. You ask two hotel concierges for their top 5 restaurant recommendations.
Concierge A:
- Life-changing biryani place 🤩
- Decent cafe ☕
- Meh fast food 🍔
- Closed restaurant 💀
- Terrible place 🤢
Concierge B:
- Decent cafe ☕
- Okay-ish dhaba 🍛
- Life-changing biryani place 🤩
- Closed restaurant 💀
- Terrible place 🤢
You're hungry NOW. You'll probably only try the first one or two suggestions. Concierge A is clearly better — they put the amazing place first, when you're most likely to actually use it.
Step 1: CG (Cumulative Gain)
Assign Numerical Relevance Scores
| Rating | Score |
|---|---|
| 🤩 Life-changing | 3 |
| ☕ Decent | 2 |
| 🍛 Okay-ish | 1 |
| 💀🤢 Bad/Closed | 0 |
Formula
$ CG = \sum_{i=1}^{k} relevance_i $
Calculation
Concierge A: 🤩, ☕, 🍔, 💀, 🤢 → scores: 3, 2, 0, 0, 0
Concierge B: ☕, 🍛, 🤩, 💀, 🤢 → scores: 2, 1, 3, 0, 0
- $ CG_A = 3 + 2 + 0 + 0 + 0 = 5 $
- $ CG_B = 2 + 1 + 3 + 0 + 0 = 6 $
⚠️ Problem: CG says Concierge B is better! But we know A is better because they put the best item first. CG ignores position entirely.
Step 2: DCG (Discounted Cumulative Gain)
🍽️ The "Hunger Decay" Intuition
Your attention decays as you go down the list:
- Position 1: You're STARVING. Almost certainly try this one. (100% attention)
- Position 2: If #1 is closed, maybe try this. (Less attention)
- Position 3: Getting impatient... (Even less)
- Position 4: Might not even read this far. (Very little)
- Position 5: Probably already left the lobby. (Almost zero)
Formula
$ DCG = \sum_{i=1}^{k} \frac{relevance_i}{\log_2(i + 1)} $
The Discount Factor
| Position (i) | $\log_2(i + 1)$ | What it means |
|---|---|---|
| 1 | 1.0 | Full credit! |
| 2 | 1.58 | Divided by ~1.6 |
| 3 | 2.0 | Half credit |
| 4 | 2.32 | Less than half |
| 5 | 2.58 | Even smaller |
Calculation
Concierge A: (3, 2, 0, 0, 0)
$ DCG_A = \frac{3}{1.0} + \frac{2}{1.58} + \frac{0}{2.0} + \frac{0}{2.32} + \frac{0}{2.58} = 3.0 + 1.26 + 0 + 0 + 0 = \textbf{4.26} $
Concierge B: (2, 1, 3, 0, 0)
$ DCG_B = \frac{2}{1.0} + \frac{1}{1.58} + \frac{3}{2.0} + \frac{0}{2.32} + \frac{0}{2.58} = 2.0 + 0.63 + 1.5 + 0 + 0 = \textbf{4.13} $
✅ Now Concierge A wins! The discounting punished B for hiding the best restaurant at position 3.
Step 3: NDCG (Normalized DCG)
🏨 The Two Hotels Problem
Hotel Mumbai — Many amazing biryani places exist → DCG = 8.2
Hotel Shimla — Barely any biryani places exist → DCG = 2.5
You can't compare 8.2 vs 2.5 directly! Mumbai concierge had an easier task.
👨🍳 The Chef Competition Analogy
Chef A gets: truffle, wagyu beef, lobster, caviar
Chef B gets: potato, onion, garlic, salt
Scores: Chef A = 85 points, Chef B = 72 points
But what's the best possible with each set of ingredients?
- Chef A's ceiling: 100 points → 85/100 = 85%
- Chef B's ceiling: 75 points → 72/75 = 96%
Chef B is actually more skilled! They extracted almost everything possible from what they had.
Formula
$ NDCG = \frac{DCG}{IDCG} $
Where IDCG = Ideal DCG — the DCG if you ranked everything perfectly.
Calculation
Ideal ranking: Put highest scores first → 3, 2, 1, 0, 0
$ IDCG = \frac{3}{1.0} + \frac{2}{1.58} + \frac{1}{2.0} + \frac{0}{2.32} + \frac{0}{2.58} = 3.0 + 1.26 + 0.5 = \textbf{4.76} $
Final Scores:
$ NDCG_A = \frac{4.26}{4.76} = \textbf{0.895} $ (89.5% of perfect)
$ NDCG_B = \frac{4.13}{4.76} = \textbf{0.868} $ (86.8% of perfect)
What NDCG Values Mean
| Score | Interpretation |
|---|---|
| 1.0 | Perfect ranking |
| 0.9+ | Excellent |
| 0.7-0.9 | Good |
| < 0.5 | Poor |
🔑 Critical Insight: NDCG is an Evaluation Metric, NOT Runtime
⚠️ NDCG is used after the fact to measure how good your retriever is.
At runtime, when a real user query comes in, you don't know the true relevance. Your retriever just does its best.
Analogy: Think of it like a student's exam score. The exam (with answer key) is created beforehand. The student takes the exam. Then you grade it. You don't grade during the exam.
The Evaluation Pipeline
- Create test set — humans label query-document relevance (IDCG is known here!)
- Run retriever on test queries
- Compute NDCG — compare retriever's ranking vs ideal
- Iterate — tweak retriever, re-evaluate
Real-World RAG Evaluation Example
Step 1: Human-Labeled Test Set (Done Once)
Query: "What is the maternity leave policy?"
| Chunk ID | Content | Human Label |
|---|---|---|
| chunk_17 | "Maternity leave is 26 weeks paid..." | 3 (perfect) |
| chunk_42 | "Paternity leave is 2 weeks..." | 1 (related) |
| chunk_08 | "Leave requests must be submitted..." | 1 (tangential) |
| chunk_91 | "Annual leave balance can be..." | 0 (irrelevant) |
| chunk_33 | "Office timings are 9am to 6pm..." | 0 (irrelevant) |
IDCG@5 (computed from labels): 4.13
Step 2: Run Retrievers
Retriever A (BM25): chunk_17, chunk_91, chunk_42, chunk_08, chunk_33
- Relevances: 3, 0, 1, 1, 0
- $ DCG_A = 3.93 $
- $ NDCG_A = 0.952 $
Retriever B (Embeddings): chunk_42, chunk_17, chunk_08, chunk_91, chunk_33
- Relevances: 1, 3, 1, 0, 0
- $ DCG_B = 3.40 $
- $ NDCG_B = 0.823 $
🏆 Retriever A wins! Even though B uses "smarter" embeddings, it put the perfect answer at position 2 instead of 1.
NDCG@K — "I Only Care About The Top Few"
In real RAG systems, only top K chunks are fed to the LLM. Everything beyond that is never seen!
Common Values
| Metric | Use Case |
|---|---|
| NDCG@3 | Very strict, limited context |
| NDCG@5 | Common for RAG |
| NDCG@10 | Common for search engines |
Calculation (Top 3 Only)
IDCG@3 (best possible top 3: 3, 1, 1): 4.13
Retriever A: (3, 0, 1) → $ DCG@3 = 3.5 $ → $ NDCG@3 = 0.847 $
Retriever B: (1, 3, 1) → $ DCG@3 = 3.4 $ → $ NDCG@3 = 0.823 $
💡 Insight: Retriever A's score dropped from 0.952 to 0.847 because its position-2 mistake (0 relevance) hurts more when we zoom into just top 3.
Mean NDCG — Evaluating Across Many Queries
One query proves nothing. We need to test on MANY queries.
Example: 5 HR Queries
| Query | NDCG@3 (A) | NDCG@3 (B) |
|---|---|---|
| Maternity leave policy | 0.847 | 0.823 |
| Expense reports | 0.912 | 0.956 |
| Work from home policy | 0.634 | 0.891 |
| Sick days | 0.889 | 0.845 |
| Dress code | 0.723 | 0.912 |
Mean NDCG
$ \text{Mean NDCG@3 (A)} = \frac{0.847 + 0.912 + 0.634 + 0.889 + 0.723}{5} = \textbf{0.801} $
$ \text{Mean NDCG@3 (B)} = \frac{0.823 + 0.956 + 0.891 + 0.845 + 0.912}{5} = \textbf{0.885} $
🔄 Plot twist! Retriever B wins overall despite losing on some individual queries.
Debugging with Per-Query Scores
Retriever A's Q3 score (0.634) is a red flag — investigate why "work from home" queries fail!
How Relevance Scores Are Determined
⚠️ Relevance is about how well it matches YOUR specific query, not the item's overall quality.
Methods to Generate Labels
| Method | Pros | Cons |
|---|---|---|
| Human annotators | Accurate, nuanced | Expensive, slow |
| User clicks/engagement | Real behavior | Noisy, biased |
| LLM-as-judge | Scalable, cheap | May not match humans |
Quick Reference
The NDCG Family
| Metric | Formula | What it captures |
|---|---|---|
| CG | $\sum rel_i$ | Total relevance (ignores position) |
| DCG | $\sum \frac{rel_i}{\log_2(i+1)}$ | Relevance with position penalty |
| IDCG | DCG of perfect ranking | Your ceiling |
| NDCG | DCG / IDCG | Score from 0 to 1 |
| NDCG@K | Same, but top K only | Practical evaluation |
| Mean NDCG | Average across queries | System-level metric |
Comparison with Other Metrics
| Metric | Graded Relevance? | Position-Aware? | Full Ranking? |
|---|---|---|---|
| Precision@K | ❌ Binary | ❌ No | ✅ Yes |
| Recall@K | ❌ Binary | ❌ No | ✅ Yes |
| MRR | ❌ Binary | ✅ First only | ❌ No |
| NDCG | ✅ Yes | ✅ Yes | ✅ Yes |
When to Use NDCG
✅ RAG system evaluation ✅ Search engine ranking ✅ Recommendation systems ✅ When relevance is graded (not binary) ✅ When position matters
TL;DR
NDCG answers: "How good is my retrieval system at putting the most relevant items at the top of the list, measured as a percentage of the best possible ranking?"