Understanding BM25 and TF-IDF
BM25 & TF-IDF: Complete Summary
TF-IDF Fundamentals
What is TF-IDF?
TF-IDF (Term Frequency–Inverse Document Frequency) measures how important a word is to a document in a collection.
Formula: TF-IDF = TF × IDF
| Component | Definition | Intuition |
|---|---|---|
| TF (Term Frequency) | How often the word appears in this document | More occurrences → higher score |
| IDF (Inverse Document Frequency) | How rare the word is across all documents | Rarer words → higher score |
Interview-ready: "TF-IDF scores a word high when it's frequent in the current document AND rare overall."
Why Log for IDF?
Problem: Without log, rare words get extreme scores (e.g., 10,000 for a word in 1 of 10,000 docs).
Solution: Log compresses large ratios.
Sticky analogy: "Being the only person in town with a rare skill is distinctive. But 1,000 vs 1,001 people knowing it? Barely matters."
Log captures diminishing returns: the jump from 1→2 docs matters more than 1000→1001 docs.
import math
raw_idf = 10000 / 1 # 10000
log_idf = math.log(10000) # ~9.21
print(f"Raw IDF: {raw_idf}")
print(f"Log IDF: {log_idf:.2f}")
Raw IDF: 10000.0
Log IDF: 9.21
TF-IDF Worked Example
Documents:
- Doc 1: "cat cat dog"
- Doc 2: "dog bird"
Step 1: Term Frequency (TF)
| Doc | "cat" | "dog" | "bird" |
|---|---|---|---|
| Doc 1 | 2/3 ≈ 0.67 | 1/3 ≈ 0.33 | 0 |
| Doc 2 | 0 | 1/2 = 0.5 | 1/2 = 0.5 |
Step 2: Inverse Document Frequency (IDF)
Formula: IDF(t) = log(N / df_t) where N = total docs, df = docs containing term
| Term | Calculation | IDF |
|---|---|---|
| "cat" | log(2/1) | 0.69 |
| "dog" | log(2/2) | 0 (in both!) |
| "bird" | log(2/1) | 0.69 |
Step 3: TF-IDF = TF × IDF
| Doc | "cat" | "dog" | "bird" |
|---|---|---|---|
| Doc 1 | 0.67 × 0.69 = 0.46 | 0.33 × 0 = 0 | — |
| Doc 2 | — | 0.5 × 0 = 0 | 0.5 × 0.69 = 0.35 |
Key insight: "dog" gets zero because it's in every document—it's not distinctive!
BM25 Term Frequency Formula
(freq × (k1 + 1)) / (freq + k1)
Why this works:
- As freq grows, both numerator and denominator grow at nearly the same rate
- The
+ k1in denominator acts as a "brake" - Limit: As freq → ∞, formula approaches
k1 + 1
def bm25_tf(freq, k1=1.5):
"""BM25 term frequency component"""
return (freq * (k1 + 1)) / (freq + k1)
# Compare TF-IDF vs BM25 as frequency increases
print("freq TF-IDF BM25")
print("-" * 20)
for f in [1, 2, 5, 10, 20, 50]:
tfidf_score = f
bm25_score = bm25_tf(f)
print(f"{f:3d} {tfidf_score:3d} {bm25_score:.3f}")
freq TF-IDF BM25
--------------------
1 1 1.000
2 2 1.429
5 5 1.923
10 10 2.174
20 20 2.326
50 50 2.427
Saturation Visualization
import matplotlib.pyplot as plt
import numpy as np
freq = np.arange(1, 51)
bm25_scores = bm25_tf(freq, k1=1.5)
tfidf_scores = freq
plt.plot(freq, tfidf_scores, label='TF-IDF (linear)', linestyle='--')
plt.plot(freq, bm25_scores, label='BM25 (saturating)', linewidth=2)
plt.xlabel('Term Frequency')
plt.ylabel('Score')
plt.legend()
plt.grid(True)
plt.title('BM25 vs TF-IDF Saturation')
plt.show()
Sticky analogy (Saturation): "The first slice of pizza is amazing, the second is great, but by the 10th slice... you're not getting much extra satisfaction!"
BM25 Parameters: The Two Knobs
Overview Table
| Parameter | Intuitive Name | Analogy | Default |
|---|---|---|---|
| k1 | "Stop counting obsessively" | Patience — how long we keep listening to repetition | 1.5 |
| b | "Don't reward rambling" | Tax — penalty on long documents | 0.75 |
Interview-ready: "k1 controls term frequency saturation (patience with repetition), b controls document length normalization (tax on long docs)."
Parameter k1: "Stop Counting Obsessively" (Patience)
What it controls: How quickly we stop caring about repeated mentions.
| Direction | Effect |
|---|---|
| Higher k1 | More patience → saturation slower → repetition keeps mattering longer |
| Lower k1 | Less patience → saturation faster → stop rewarding spam quickly |
Analogy: "Higher patience = you listen longer before tuning out."
Parameter b: "Don't Reward Rambling" (Tax)
What it controls: How much we penalize document length.
| Direction | Effect |
|---|---|
| Higher b | More tax → penalize long docs more → short focused docs rank better |
| Lower b | Less tax → long comprehensive docs can rank better |
Analogy: "Higher tax = you pay more for being long."
Sticky phrase: "Density matters, not just count" — a short article that's 10% about cats is more focused than a long book that mentions cats the same number of times.
Visualization of Both Parameters
fig, axes = plt.subplots(1, 2, figsize=(12, 4))
freq = np.arange(1, 31)
# Plot 1: k1 effect
for k1 in [0.5, 1.5, 3.0]:
scores = (freq * (k1 + 1)) / (freq + k1)
axes[0].plot(freq, scores, label=f'k1={k1}', linewidth=2)
axes[0].set_xlabel('Times word appears')
axes[0].set_ylabel('Score')
axes[0].set_title('k1: How fast do we stop caring?')
axes[0].legend()
axes[0].grid(True)
# Plot 2: b effect (document length)
doc_lengths = np.linspace(10, 200, 50)
avg_len = 100
for b in [0, 0.5, 1.0]:
penalty = 1 - b + b * (doc_lengths / avg_len)
axes[1].plot(doc_lengths, penalty, label=f'b={b}', linewidth=2)
axes[1].axvline(x=100, color='gray', linestyle=':', label='avg length')
axes[1].set_xlabel('Document length (words)')
axes[1].set_ylabel('Penalty factor')
axes[1].set_title('b: How much do we penalize long docs?')
axes[1].legend()
axes[1].grid(True)
plt.tight_layout()
plt.show()
Tuning Guide: Quick Reference
| Problem | Which knob? | Direction | Analogy |
|---|---|---|---|
| Keyword stuffing/spam | k1 | ↓ Decrease | Less patience |
| Long docs always winning | b | ↑ Increase | More tax |
| Short docs always winning | b | ↓ Decrease | Less tax |
| Repetition rewarded too much | k1 | ↓ Decrease | Less patience |
Scenario-Based Examples
Example 1: Customer Support Tickets
Problem: Angry customers who repeat "WRONG wrong wrong!" rank too high; short professional tickets rank too low.
Solution:
- Lower k1 (less patience) → stop rewarding repetition
- Higher b (more tax) → penalize rambling complaints
Example 2: Recipe Search
Problems:
- "garlic garlic garlic" spam ranks too high
- Long recipes with backstories dominate
Solution:
- Lower k1 (less patience) → stop keyword stuffing
- Higher b (more tax) → penalize verbose recipes
Example 3: Legal Documents (Tricky!)
Problems:
- Short emails beat comprehensive 100-page legal briefs
- Some papers spam keywords 200 times
Solution:
- Lower b (less tax) → let comprehensive docs compete
- Lower k1 (less patience) → stop keyword spam
Example 4: Medical Research (Advanced!)
Problems:
- Short 1-page pamphlets beat 100-page clinical studies
- Low-quality papers spam keywords in references
- Short focused abstracts (200 words) should still rank well
Solution:
- Lower b (less tax) → helps BOTH short abstracts AND long studies against medium-length pamphlets
- Lower k1 (less patience) → stops keyword spam
Key insight: "Lowering tax doesn't just help long docs—it helps any document being unfairly compared to 'medium' length winners."
Practical Implementation
from langchain_community.retrievers import BM25Retriever
# Sample documents
docs = [
"The cat sat on the mat",
"The dog ran in the park",
"Cats and dogs are popular pets",
"The mat was red and fluffy"
]
# Create retriever with custom parameters
retriever = BM25Retriever.from_texts(docs, k=2)
# Adjust BM25 parameters
retriever.vectorizer.k1 = 1.2 # Less patience (faster saturation)
retriever.vectorizer.b = 0.8 # More tax (penalize long docs more)
# Query
results = retriever.invoke("cat on mat")
for r in results:
print(r.page_content)
The cat sat on the mat
The mat was red and fluffy
Limitations & Trade-offs
- No universal perfect values — tuning depends entirely on your corpus and use case
- Mixed corpus challenge — if you have tweets AND legal documents, one setting won't fit both
- Empirical tuning required — start with defaults, test, observe, adjust
Practitioner strategies:
- Start with defaults (k1=1.5, b=0.75)
- Tune empirically based on search result quality
- Segment corpus by document type if needed
- Use relevance feedback from users
Quick Reference Card
TF-IDF
- TF = word count in document
- IDF = log(total docs / docs with term)
- Problem: Linear TF, no length normalization
BM25 Parameters
| k1 | b | |
|---|---|---|
| Name | "Stop counting obsessively" | "Don't reward rambling" |
| Analogy | Patience | Tax |
| Default | 1.5 | 0.75 |
| ↑ Increase | More patience with repetition | More penalty for length |
| ↓ Decrease | Stop caring about repetition sooner | Less penalty for length |
Sticky Analogies
| Concept | Analogy |
|---|---|
| TF saturation | Pizza slices — 10th slice doesn't add much |
| Log in IDF | Being unique in town vs 1000 vs 1001 people |
| k1 | Patience — how long before tuning out |
| b | Tax — penalty you pay for being long |
| Document length normalization | "Don't reward rambling" |
| Term frequency saturation | "Stop counting obsessively" |