BM25 & TF-IDF: Complete Summary

TF-IDF Fundamentals

What is TF-IDF?

TF-IDF (Term Frequency–Inverse Document Frequency) measures how important a word is to a document in a collection.

Formula: TF-IDF = TF × IDF

Component	Definition	Intuition
TF (Term Frequency)	How often the word appears in this document	More occurrences → higher score
IDF (Inverse Document Frequency)	How rare the word is across all documents	Rarer words → higher score

Interview-ready: "TF-IDF scores a word high when it's frequent in the current document AND rare overall."

Why Log for IDF?

Problem: Without log, rare words get extreme scores (e.g., 10,000 for a word in 1 of 10,000 docs).

Solution: Log compresses large ratios.

Sticky analogy: "Being the only person in town with a rare skill is distinctive. But 1,000 vs 1,001 people knowing it? Barely matters."

Log captures diminishing returns: the jump from 1→2 docs matters more than 1000→1001 docs.

import math

raw_idf = 10000 / 1        # 10000
log_idf = math.log(10000)  # ~9.21

print(f"Raw IDF: {raw_idf}")
print(f"Log IDF: {log_idf:.2f}")

Raw IDF: 10000.0
Log IDF: 9.21

TF-IDF Worked Example

Documents:

Doc 1: "cat cat dog"
Doc 2: "dog bird"

Step 1: Term Frequency (TF)

Doc	"cat"	"dog"	"bird"
Doc 1	2/3 ≈ 0.67	1/3 ≈ 0.33	0
Doc 2	0	1/2 = 0.5	1/2 = 0.5

Step 2: Inverse Document Frequency (IDF)

Formula: IDF(t) = log(N / df_t) where N = total docs, df = docs containing term

Term	Calculation	IDF
"cat"	log(2/1)	0.69
"dog"	log(2/2)	0 (in both!)
"bird"	log(2/1)	0.69

Step 3: TF-IDF = TF × IDF

Doc	"cat"	"dog"	"bird"
Doc 1	0.67 × 0.69 = 0.46	0.33 × 0 = 0	—
Doc 2	—	0.5 × 0 = 0	0.5 × 0.69 = 0.35

Key insight: "dog" gets zero because it's in every document—it's not distinctive!

BM25 Term Frequency Formula

(freq × (k1 + 1)) / (freq + k1)

Why this works:

As freq grows, both numerator and denominator grow at nearly the same rate
The + k1 in denominator acts as a "brake"
Limit: As freq → ∞, formula approaches k1 + 1

def bm25_tf(freq, k1=1.5):
    """BM25 term frequency component"""
    return (freq * (k1 + 1)) / (freq + k1)

# Compare TF-IDF vs BM25 as frequency increases
print("freq  TF-IDF  BM25")
print("-" * 20)
for f in [1, 2, 5, 10, 20, 50]:
    tfidf_score = f
    bm25_score = bm25_tf(f)
    print(f"{f:3d}   {tfidf_score:3d}     {bm25_score:.3f}")

freq  TF-IDF  BM25
--------------------
  1     1     1.000
  2     2     1.429
  5     5     1.923
 10    10     2.174
 20    20     2.326
 50    50     2.427

Saturation Visualization

import matplotlib.pyplot as plt
import numpy as np

freq = np.arange(1, 51)
bm25_scores = bm25_tf(freq, k1=1.5)
tfidf_scores = freq

plt.plot(freq, tfidf_scores, label='TF-IDF (linear)', linestyle='--')
plt.plot(freq, bm25_scores, label='BM25 (saturating)', linewidth=2)
plt.xlabel('Term Frequency')
plt.ylabel('Score')
plt.legend()
plt.grid(True)
plt.title('BM25 vs TF-IDF Saturation')
plt.show()

Sticky analogy (Saturation): "The first slice of pizza is amazing, the second is great, but by the 10th slice... you're not getting much extra satisfaction!"

BM25 Parameters: The Two Knobs

Overview Table

Parameter	Intuitive Name	Analogy	Default
k1	"Stop counting obsessively"	Patience — how long we keep listening to repetition	1.5
b	"Don't reward rambling"	Tax — penalty on long documents	0.75

Interview-ready: "k1 controls term frequency saturation (patience with repetition), b controls document length normalization (tax on long docs)."

Parameter k1: "Stop Counting Obsessively" (Patience)

What it controls: How quickly we stop caring about repeated mentions.

Direction	Effect
Higher k1	More patience → saturation slower → repetition keeps mattering longer
Lower k1	Less patience → saturation faster → stop rewarding spam quickly

Analogy: "Higher patience = you listen longer before tuning out."

Parameter b: "Don't Reward Rambling" (Tax)

What it controls: How much we penalize document length.

Direction	Effect
Higher b	More tax → penalize long docs more → short focused docs rank better
Lower b	Less tax → long comprehensive docs can rank better

Analogy: "Higher tax = you pay more for being long."

Sticky phrase: "Density matters, not just count" — a short article that's 10% about cats is more focused than a long book that mentions cats the same number of times.

Visualization of Both Parameters

fig, axes = plt.subplots(1, 2, figsize=(12, 4))

freq = np.arange(1, 31)

# Plot 1: k1 effect
for k1 in [0.5, 1.5, 3.0]:
    scores = (freq * (k1 + 1)) / (freq + k1)
    axes[0].plot(freq, scores, label=f'k1={k1}', linewidth=2)

axes[0].set_xlabel('Times word appears')
axes[0].set_ylabel('Score')
axes[0].set_title('k1: How fast do we stop caring?')
axes[0].legend()
axes[0].grid(True)

# Plot 2: b effect (document length)
doc_lengths = np.linspace(10, 200, 50)
avg_len = 100

for b in [0, 0.5, 1.0]:
    penalty = 1 - b + b * (doc_lengths / avg_len)
    axes[1].plot(doc_lengths, penalty, label=f'b={b}', linewidth=2)

axes[1].axvline(x=100, color='gray', linestyle=':', label='avg length')
axes[1].set_xlabel('Document length (words)')
axes[1].set_ylabel('Penalty factor')
axes[1].set_title('b: How much do we penalize long docs?')
axes[1].legend()
axes[1].grid(True)

plt.tight_layout()
plt.show()

Tuning Guide: Quick Reference

Problem	Which knob?	Direction	Analogy
Keyword stuffing/spam	k1	↓ Decrease	Less patience
Long docs always winning	b	↑ Increase	More tax
Short docs always winning	b	↓ Decrease	Less tax
Repetition rewarded too much	k1	↓ Decrease	Less patience

Scenario-Based Examples

Example 1: Customer Support Tickets

Problem: Angry customers who repeat "WRONG wrong wrong!" rank too high; short professional tickets rank too low.

Solution:

Lower k1 (less patience) → stop rewarding repetition
Higher b (more tax) → penalize rambling complaints

Example 2: Recipe Search

Problems:

"garlic garlic garlic" spam ranks too high
Long recipes with backstories dominate

Solution:

Lower k1 (less patience) → stop keyword stuffing
Higher b (more tax) → penalize verbose recipes

Example 3: Legal Documents (Tricky!)

Problems:

Short emails beat comprehensive 100-page legal briefs
Some papers spam keywords 200 times

Solution:

Lower b (less tax) → let comprehensive docs compete
Lower k1 (less patience) → stop keyword spam

Example 4: Medical Research (Advanced!)

Problems:

Short 1-page pamphlets beat 100-page clinical studies
Low-quality papers spam keywords in references
Short focused abstracts (200 words) should still rank well

Solution:

Lower b (less tax) → helps BOTH short abstracts AND long studies against medium-length pamphlets
Lower k1 (less patience) → stops keyword spam

Key insight: "Lowering tax doesn't just help long docs—it helps any document being unfairly compared to 'medium' length winners."

Practical Implementation

from langchain_community.retrievers import BM25Retriever

# Sample documents
docs = [
    "The cat sat on the mat",
    "The dog ran in the park", 
    "Cats and dogs are popular pets",
    "The mat was red and fluffy"
]

# Create retriever with custom parameters
retriever = BM25Retriever.from_texts(docs, k=2)

# Adjust BM25 parameters
retriever.vectorizer.k1 = 1.2  # Less patience (faster saturation)
retriever.vectorizer.b = 0.8   # More tax (penalize long docs more)

# Query
results = retriever.invoke("cat on mat")
for r in results:
    print(r.page_content)

The cat sat on the mat
The mat was red and fluffy

Limitations & Trade-offs

No universal perfect values — tuning depends entirely on your corpus and use case
Mixed corpus challenge — if you have tweets AND legal documents, one setting won't fit both
Empirical tuning required — start with defaults, test, observe, adjust

Practitioner strategies:

Start with defaults (k1=1.5, b=0.75)
Tune empirically based on search result quality
Segment corpus by document type if needed
Use relevance feedback from users

Quick Reference Card

TF-IDF

TF = word count in document
IDF = log(total docs / docs with term)
Problem: Linear TF, no length normalization

BM25 Parameters

	k1	b
Name	"Stop counting obsessively"	"Don't reward rambling"
Analogy	Patience	Tax
Default	1.5	0.75
↑ Increase	More patience with repetition	More penalty for length
↓ Decrease	Stop caring about repetition sooner	Less penalty for length

Sticky Analogies

Concept	Analogy
TF saturation	Pizza slices — 10th slice doesn't add much
Log in IDF	Being unique in town vs 1000 vs 1001 people
k1	Patience — how long before tuning out
b	Tax — penalty you pay for being long
Document length normalization	"Don't reward rambling"
Term frequency saturation	"Stop counting obsessively"

Understanding BM25 and TF-IDF

BM25 & TF-IDF: Complete Summary

TF-IDF Fundamentals

What is TF-IDF?

Why Log for IDF?

TF-IDF Worked Example

BM25 Term Frequency Formula

Saturation Visualization

BM25 Parameters: The Two Knobs

Overview Table

Parameter k1: "Stop Counting Obsessively" (Patience)

Parameter b: "Don't Reward Rambling" (Tax)

Visualization of Both Parameters

Tuning Guide: Quick Reference

Scenario-Based Examples

Example 1: Customer Support Tickets

Example 2: Recipe Search

Example 3: Legal Documents (Tricky!)

Example 4: Medical Research (Advanced!)

Practical Implementation

Limitations & Trade-offs

Quick Reference Card

TF-IDF

BM25 Parameters

Sticky Analogies