TIL NDCG (Normalized Discounted Cumulative Gain)

Why Do We Need NDCG?

The Gap in Existing Metrics

Metric	Limitation
Precision@K	Binary relevance only — a doc is relevant or not
Recall@K	Binary relevance only — doesn't distinguish "amazing" from "okay"
MRR	Only cares about the FIRST relevant item, still binary

The Problem: Not all relevant documents are equally relevant, and position matters more than these metrics capture.

MRR's Two Blind Spots

Only cares about FIRST relevant item — ignores everything after
Still binary — can't tell if first result is "amazing" vs just "okay"

Key Insight:

MRR = "How quickly did you find something relevant?"

NDCG = "How good is your entire ranked list, considering how relevant each item is?"

🍽️ Sticky Analogy: The Restaurant Concierge Story

Imagine you're a hungry tourist in a new city. You ask two hotel concierges for their top 5 restaurant recommendations.

Concierge A:

Life-changing biryani place 🤩

Decent cafe ☕

Meh fast food 🍔

Closed restaurant 💀

Terrible place 🤢

Concierge B:

Decent cafe ☕

Okay-ish dhaba 🍛

Life-changing biryani place 🤩

Closed restaurant 💀

Terrible place 🤢

You're hungry NOW. You'll probably only try the first one or two suggestions. Concierge A is clearly better — they put the amazing place first, when you're most likely to actually use it.

Step 1: CG (Cumulative Gain)

Assign Numerical Relevance Scores

Rating	Score
🤩 Life-changing	3
☕ Decent	2
🍛 Okay-ish	1
💀🤢 Bad/Closed	0

Formula

$ CG = \sum_{i=1}^{k} relevance_i $

Calculation

Concierge A: 🤩, ☕, 🍔, 💀, 🤢 → scores: 3, 2, 0, 0, 0

Concierge B: ☕, 🍛, 🤩, 💀, 🤢 → scores: 2, 1, 3, 0, 0

$ CG_A = 3 + 2 + 0 + 0 + 0 = 5 $
$ CG_B = 2 + 1 + 3 + 0 + 0 = 6 $

⚠️ Problem: CG says Concierge B is better! But we know A is better because they put the best item first. CG ignores position entirely.

Step 2: DCG (Discounted Cumulative Gain)

🍽️ The "Hunger Decay" Intuition

Your attention decays as you go down the list:

Position 1: You're STARVING. Almost certainly try this one. (100% attention)

Position 2: If #1 is closed, maybe try this. (Less attention)

Position 3: Getting impatient... (Even less)

Position 4: Might not even read this far. (Very little)

Position 5: Probably already left the lobby. (Almost zero)

Formula

$ DCG = \sum_{i=1}^{k} \frac{relevance_i}{\log_2(i + 1)} $

The Discount Factor

Position (i)	$\log_2(i + 1)$	What it means
1	1.0	Full credit!
2	1.58	Divided by ~1.6
3	2.0	Half credit
4	2.32	Less than half
5	2.58	Even smaller

Calculation

Concierge A: (3, 2, 0, 0, 0)

$ DCG_A = \frac{3}{1.0} + \frac{2}{1.58} + \frac{0}{2.0} + \frac{0}{2.32} + \frac{0}{2.58} = 3.0 + 1.26 + 0 + 0 + 0 = \textbf{4.26} $

Concierge B: (2, 1, 3, 0, 0)

$ DCG_B = \frac{2}{1.0} + \frac{1}{1.58} + \frac{3}{2.0} + \frac{0}{2.32} + \frac{0}{2.58} = 2.0 + 0.63 + 1.5 + 0 + 0 = \textbf{4.13} $

✅ Now Concierge A wins! The discounting punished B for hiding the best restaurant at position 3.

Step 3: NDCG (Normalized DCG)

🏨 The Two Hotels Problem

Hotel Mumbai — Many amazing biryani places exist → DCG = 8.2

Hotel Shimla — Barely any biryani places exist → DCG = 2.5

You can't compare 8.2 vs 2.5 directly! Mumbai concierge had an easier task.

👨‍🍳 The Chef Competition Analogy

Chef A gets: truffle, wagyu beef, lobster, caviar

Chef B gets: potato, onion, garlic, salt

Scores: Chef A = 85 points, Chef B = 72 points

But what's the best possible with each set of ingredients?

Chef A's ceiling: 100 points → 85/100 = 85%

Chef B's ceiling: 75 points → 72/75 = 96%

Chef B is actually more skilled! They extracted almost everything possible from what they had.

Formula

$ NDCG = \frac{DCG}{IDCG} $

Where IDCG = Ideal DCG — the DCG if you ranked everything perfectly.

Calculation

Ideal ranking: Put highest scores first → 3, 2, 1, 0, 0

$ IDCG = \frac{3}{1.0} + \frac{2}{1.58} + \frac{1}{2.0} + \frac{0}{2.32} + \frac{0}{2.58} = 3.0 + 1.26 + 0.5 = \textbf{4.76} $

Final Scores:

$ NDCG_A = \frac{4.26}{4.76} = \textbf{0.895} $ (89.5% of perfect)

$ NDCG_B = \frac{4.13}{4.76} = \textbf{0.868} $ (86.8% of perfect)

What NDCG Values Mean

Score	Interpretation
1.0	Perfect ranking
0.9+	Excellent
0.7-0.9	Good
< 0.5	Poor

🔑 Critical Insight: NDCG is an Evaluation Metric, NOT Runtime

⚠️ NDCG is used after the fact to measure how good your retriever is.

At runtime, when a real user query comes in, you don't know the true relevance. Your retriever just does its best.

Analogy: Think of it like a student's exam score. The exam (with answer key) is created beforehand. The student takes the exam. Then you grade it. You don't grade during the exam.

The Evaluation Pipeline

Create test set — humans label query-document relevance (IDCG is known here!)
Run retriever on test queries
Compute NDCG — compare retriever's ranking vs ideal
Iterate — tweak retriever, re-evaluate

Real-World RAG Evaluation Example

Step 1: Human-Labeled Test Set (Done Once)

Query: "What is the maternity leave policy?"

Chunk ID	Content	Human Label
chunk_17	"Maternity leave is 26 weeks paid..."	3 (perfect)
chunk_42	"Paternity leave is 2 weeks..."	1 (related)
chunk_08	"Leave requests must be submitted..."	1 (tangential)
chunk_91	"Annual leave balance can be..."	0 (irrelevant)
chunk_33	"Office timings are 9am to 6pm..."	0 (irrelevant)

IDCG@5 (computed from labels): 4.13

Step 2: Run Retrievers

Retriever A (BM25): chunk_17, chunk_91, chunk_42, chunk_08, chunk_33

Relevances: 3, 0, 1, 1, 0
$ DCG_A = 3.93 $
$ NDCG_A = 0.952 $

Retriever B (Embeddings): chunk_42, chunk_17, chunk_08, chunk_91, chunk_33

Relevances: 1, 3, 1, 0, 0
$ DCG_B = 3.40 $
$ NDCG_B = 0.823 $

🏆 Retriever A wins! Even though B uses "smarter" embeddings, it put the perfect answer at position 2 instead of 1.

NDCG@K — "I Only Care About The Top Few"

In real RAG systems, only top K chunks are fed to the LLM. Everything beyond that is never seen!

Common Values

Metric	Use Case
NDCG@3	Very strict, limited context
NDCG@5	Common for RAG
NDCG@10	Common for search engines

Calculation (Top 3 Only)

IDCG@3 (best possible top 3: 3, 1, 1): 4.13

Retriever A: (3, 0, 1) → $ DCG@3 = 3.5 $ → $ NDCG@3 = 0.847 $

Retriever B: (1, 3, 1) → $ DCG@3 = 3.4 $ → $ NDCG@3 = 0.823 $

💡 Insight: Retriever A's score dropped from 0.952 to 0.847 because its position-2 mistake (0 relevance) hurts more when we zoom into just top 3.

Mean NDCG — Evaluating Across Many Queries

One query proves nothing. We need to test on MANY queries.

Example: 5 HR Queries

Query	NDCG@3 (A)	NDCG@3 (B)
Maternity leave policy	0.847	0.823
Expense reports	0.912	0.956
Work from home policy	0.634	0.891
Sick days	0.889	0.845
Dress code	0.723	0.912

Mean NDCG

$ \text{Mean NDCG@3 (A)} = \frac{0.847 + 0.912 + 0.634 + 0.889 + 0.723}{5} = \textbf{0.801} $

$ \text{Mean NDCG@3 (B)} = \frac{0.823 + 0.956 + 0.891 + 0.845 + 0.912}{5} = \textbf{0.885} $

🔄 Plot twist! Retriever B wins overall despite losing on some individual queries.

Debugging with Per-Query Scores

Retriever A's Q3 score (0.634) is a red flag — investigate why "work from home" queries fail!

How Relevance Scores Are Determined

⚠️ Relevance is about how well it matches YOUR specific query, not the item's overall quality.

Methods to Generate Labels

Method	Pros	Cons
Human annotators	Accurate, nuanced	Expensive, slow
User clicks/engagement	Real behavior	Noisy, biased
LLM-as-judge	Scalable, cheap	May not match humans

Quick Reference

The NDCG Family

Metric	Formula	What it captures
CG	$\sum rel_i$	Total relevance (ignores position)
DCG	$\sum \frac{rel_i}{\log_2(i+1)}$	Relevance with position penalty
IDCG	DCG of perfect ranking	Your ceiling
NDCG	DCG / IDCG	Score from 0 to 1
NDCG@K	Same, but top K only	Practical evaluation
Mean NDCG	Average across queries	System-level metric

Comparison with Other Metrics

Metric	Graded Relevance?	Position-Aware?	Full Ranking?
Precision@K	❌ Binary	❌ No	✅ Yes
Recall@K	❌ Binary	❌ No	✅ Yes
MRR	❌ Binary	✅ First only	❌ No
NDCG	✅ Yes	✅ Yes	✅ Yes

When to Use NDCG

✅ RAG system evaluation ✅ Search engine ranking ✅ Recommendation systems ✅ When relevance is graded (not binary) ✅ When position matters

TL;DR

NDCG answers: "How good is my retrieval system at putting the most relevant items at the top of the list, measured as a percentage of the best possible ranking?"

TIL (Understanding NDCG)

TIL NDCG (Normalized Discounted Cumulative Gain)

Why Do We Need NDCG?

The Gap in Existing Metrics

MRR's Two Blind Spots

🍽️ Sticky Analogy: The Restaurant Concierge Story

Step 1: CG (Cumulative Gain)

Assign Numerical Relevance Scores

Formula

Calculation

Step 2: DCG (Discounted Cumulative Gain)

🍽️ The "Hunger Decay" Intuition

Formula

The Discount Factor

Calculation

Step 3: NDCG (Normalized DCG)

🏨 The Two Hotels Problem

👨‍🍳 The Chef Competition Analogy

Formula

Calculation

What NDCG Values Mean

🔑 Critical Insight: NDCG is an Evaluation Metric, NOT Runtime

The Evaluation Pipeline

Real-World RAG Evaluation Example

Step 1: Human-Labeled Test Set (Done Once)

Step 2: Run Retrievers

NDCG@K — "I Only Care About The Top Few"

Common Values

Calculation (Top 3 Only)

Mean NDCG — Evaluating Across Many Queries

Example: 5 HR Queries

Mean NDCG

Debugging with Per-Query Scores

How Relevance Scores Are Determined

Methods to Generate Labels

Quick Reference

The NDCG Family

Comparison with Other Metrics

When to Use NDCG

TL;DR