TIL (Understanding NDCG)

2026-01-03
Information RetrievalEvaluation MetricsTIL

TIL NDCG (Normalized Discounted Cumulative Gain)


Why Do We Need NDCG?

The Gap in Existing Metrics

Metric Limitation
Precision@K Binary relevance only — a doc is relevant or not
Recall@K Binary relevance only — doesn't distinguish "amazing" from "okay"
MRR Only cares about the FIRST relevant item, still binary

The Problem: Not all relevant documents are equally relevant, and position matters more than these metrics capture.

MRR's Two Blind Spots

  1. Only cares about FIRST relevant item — ignores everything after
  2. Still binary — can't tell if first result is "amazing" vs just "okay"

Key Insight:

  • MRR = "How quickly did you find something relevant?"
  • NDCG = "How good is your entire ranked list, considering how relevant each item is?"

🍽️ Sticky Analogy: The Restaurant Concierge Story

Imagine you're a hungry tourist in a new city. You ask two hotel concierges for their top 5 restaurant recommendations.

Concierge A:

  1. Life-changing biryani place 🤩
  2. Decent cafe ☕
  3. Meh fast food 🍔
  4. Closed restaurant 💀
  5. Terrible place 🤢

Concierge B:

  1. Decent cafe ☕
  2. Okay-ish dhaba 🍛
  3. Life-changing biryani place 🤩
  4. Closed restaurant 💀
  5. Terrible place 🤢

You're hungry NOW. You'll probably only try the first one or two suggestions. Concierge A is clearly better — they put the amazing place first, when you're most likely to actually use it.


Step 1: CG (Cumulative Gain)

Assign Numerical Relevance Scores

Rating Score
🤩 Life-changing 3
☕ Decent 2
🍛 Okay-ish 1
💀🤢 Bad/Closed 0

Formula

$ CG = \sum_{i=1}^{k} relevance_i $

Calculation

Concierge A: 🤩, ☕, 🍔, 💀, 🤢 → scores: 3, 2, 0, 0, 0

Concierge B: ☕, 🍛, 🤩, 💀, 🤢 → scores: 2, 1, 3, 0, 0

  • $ CG_A = 3 + 2 + 0 + 0 + 0 = 5 $
  • $ CG_B = 2 + 1 + 3 + 0 + 0 = 6 $

⚠️ Problem: CG says Concierge B is better! But we know A is better because they put the best item first. CG ignores position entirely.


Step 2: DCG (Discounted Cumulative Gain)

🍽️ The "Hunger Decay" Intuition

Your attention decays as you go down the list:

  • Position 1: You're STARVING. Almost certainly try this one. (100% attention)
  • Position 2: If #1 is closed, maybe try this. (Less attention)
  • Position 3: Getting impatient... (Even less)
  • Position 4: Might not even read this far. (Very little)
  • Position 5: Probably already left the lobby. (Almost zero)

Formula

$ DCG = \sum_{i=1}^{k} \frac{relevance_i}{\log_2(i + 1)} $

The Discount Factor

Position (i) $\log_2(i + 1)$ What it means
1 1.0 Full credit!
2 1.58 Divided by ~1.6
3 2.0 Half credit
4 2.32 Less than half
5 2.58 Even smaller

Calculation

Concierge A: (3, 2, 0, 0, 0)

$ DCG_A = \frac{3}{1.0} + \frac{2}{1.58} + \frac{0}{2.0} + \frac{0}{2.32} + \frac{0}{2.58} = 3.0 + 1.26 + 0 + 0 + 0 = \textbf{4.26} $

Concierge B: (2, 1, 3, 0, 0)

$ DCG_B = \frac{2}{1.0} + \frac{1}{1.58} + \frac{3}{2.0} + \frac{0}{2.32} + \frac{0}{2.58} = 2.0 + 0.63 + 1.5 + 0 + 0 = \textbf{4.13} $

Now Concierge A wins! The discounting punished B for hiding the best restaurant at position 3.


Step 3: NDCG (Normalized DCG)

🏨 The Two Hotels Problem

Hotel Mumbai — Many amazing biryani places exist → DCG = 8.2

Hotel Shimla — Barely any biryani places exist → DCG = 2.5

You can't compare 8.2 vs 2.5 directly! Mumbai concierge had an easier task.

👨‍🍳 The Chef Competition Analogy

Chef A gets: truffle, wagyu beef, lobster, caviar

Chef B gets: potato, onion, garlic, salt

Scores: Chef A = 85 points, Chef B = 72 points

But what's the best possible with each set of ingredients?

  • Chef A's ceiling: 100 points → 85/100 = 85%
  • Chef B's ceiling: 75 points → 72/75 = 96%

Chef B is actually more skilled! They extracted almost everything possible from what they had.

Formula

$ NDCG = \frac{DCG}{IDCG} $

Where IDCG = Ideal DCG — the DCG if you ranked everything perfectly.

Calculation

Ideal ranking: Put highest scores first → 3, 2, 1, 0, 0

$ IDCG = \frac{3}{1.0} + \frac{2}{1.58} + \frac{1}{2.0} + \frac{0}{2.32} + \frac{0}{2.58} = 3.0 + 1.26 + 0.5 = \textbf{4.76} $

Final Scores:

$ NDCG_A = \frac{4.26}{4.76} = \textbf{0.895} $ (89.5% of perfect)

$ NDCG_B = \frac{4.13}{4.76} = \textbf{0.868} $ (86.8% of perfect)

What NDCG Values Mean

Score Interpretation
1.0 Perfect ranking
0.9+ Excellent
0.7-0.9 Good
< 0.5 Poor

🔑 Critical Insight: NDCG is an Evaluation Metric, NOT Runtime

⚠️ NDCG is used after the fact to measure how good your retriever is.

At runtime, when a real user query comes in, you don't know the true relevance. Your retriever just does its best.

Analogy: Think of it like a student's exam score. The exam (with answer key) is created beforehand. The student takes the exam. Then you grade it. You don't grade during the exam.

The Evaluation Pipeline

  1. Create test set — humans label query-document relevance (IDCG is known here!)
  2. Run retriever on test queries
  3. Compute NDCG — compare retriever's ranking vs ideal
  4. Iterate — tweak retriever, re-evaluate

Real-World RAG Evaluation Example

Step 1: Human-Labeled Test Set (Done Once)

Query: "What is the maternity leave policy?"

Chunk ID Content Human Label
chunk_17 "Maternity leave is 26 weeks paid..." 3 (perfect)
chunk_42 "Paternity leave is 2 weeks..." 1 (related)
chunk_08 "Leave requests must be submitted..." 1 (tangential)
chunk_91 "Annual leave balance can be..." 0 (irrelevant)
chunk_33 "Office timings are 9am to 6pm..." 0 (irrelevant)

IDCG@5 (computed from labels): 4.13

Step 2: Run Retrievers

Retriever A (BM25): chunk_17, chunk_91, chunk_42, chunk_08, chunk_33

  • Relevances: 3, 0, 1, 1, 0
  • $ DCG_A = 3.93 $
  • $ NDCG_A = 0.952 $

Retriever B (Embeddings): chunk_42, chunk_17, chunk_08, chunk_91, chunk_33

  • Relevances: 1, 3, 1, 0, 0
  • $ DCG_B = 3.40 $
  • $ NDCG_B = 0.823 $

🏆 Retriever A wins! Even though B uses "smarter" embeddings, it put the perfect answer at position 2 instead of 1.


NDCG@K — "I Only Care About The Top Few"

In real RAG systems, only top K chunks are fed to the LLM. Everything beyond that is never seen!

Common Values

Metric Use Case
NDCG@3 Very strict, limited context
NDCG@5 Common for RAG
NDCG@10 Common for search engines

Calculation (Top 3 Only)

IDCG@3 (best possible top 3: 3, 1, 1): 4.13

Retriever A: (3, 0, 1) → $ DCG@3 = 3.5 $ → $ NDCG@3 = 0.847 $

Retriever B: (1, 3, 1) → $ DCG@3 = 3.4 $ → $ NDCG@3 = 0.823 $

💡 Insight: Retriever A's score dropped from 0.952 to 0.847 because its position-2 mistake (0 relevance) hurts more when we zoom into just top 3.


Mean NDCG — Evaluating Across Many Queries

One query proves nothing. We need to test on MANY queries.

Example: 5 HR Queries

Query NDCG@3 (A) NDCG@3 (B)
Maternity leave policy 0.847 0.823
Expense reports 0.912 0.956
Work from home policy 0.634 0.891
Sick days 0.889 0.845
Dress code 0.723 0.912

Mean NDCG

$ \text{Mean NDCG@3 (A)} = \frac{0.847 + 0.912 + 0.634 + 0.889 + 0.723}{5} = \textbf{0.801} $

$ \text{Mean NDCG@3 (B)} = \frac{0.823 + 0.956 + 0.891 + 0.845 + 0.912}{5} = \textbf{0.885} $

🔄 Plot twist! Retriever B wins overall despite losing on some individual queries.

Debugging with Per-Query Scores

Retriever A's Q3 score (0.634) is a red flag — investigate why "work from home" queries fail!


How Relevance Scores Are Determined

⚠️ Relevance is about how well it matches YOUR specific query, not the item's overall quality.

Methods to Generate Labels

Method Pros Cons
Human annotators Accurate, nuanced Expensive, slow
User clicks/engagement Real behavior Noisy, biased
LLM-as-judge Scalable, cheap May not match humans

Quick Reference

The NDCG Family

Metric Formula What it captures
CG $\sum rel_i$ Total relevance (ignores position)
DCG $\sum \frac{rel_i}{\log_2(i+1)}$ Relevance with position penalty
IDCG DCG of perfect ranking Your ceiling
NDCG DCG / IDCG Score from 0 to 1
NDCG@K Same, but top K only Practical evaluation
Mean NDCG Average across queries System-level metric

Comparison with Other Metrics

Metric Graded Relevance? Position-Aware? Full Ranking?
Precision@K ❌ Binary ❌ No ✅ Yes
Recall@K ❌ Binary ❌ No ✅ Yes
MRR ❌ Binary ✅ First only ❌ No
NDCG ✅ Yes ✅ Yes ✅ Yes

When to Use NDCG

✅ RAG system evaluation ✅ Search engine ranking ✅ Recommendation systems ✅ When relevance is graded (not binary) ✅ When position matters

TL;DR

NDCG answers: "How good is my retrieval system at putting the most relevant items at the top of the list, measured as a percentage of the best possible ranking?"