📊 Classification Metrics: Complete Reference Guide

A comprehensive exploration of 16+ classification metrics with intuitive analogies, formulas, and practical code

Business Context: Spam Detection
The Problem with Accuracy
Confusion Matrix
Precision
Recall (Sensitivity)
F1 Score
F-beta Score
Threshold Optimization
Specificity & Related Rates
Top-k Accuracy
Average Precision & MAP
ROC AUC
PR AUC
Balanced Accuracy
Cohen's Kappa
Hamming Loss
Jaccard Index
Matthews Correlation Coefficient (MCC)
Quick Reference

1. Business Context: Spam Detection

The Problem Setup

We're building a spam detection model with an imbalanced dataset:

Not Spam (Class 0) ≈ 70% of data
Spam (Class 1) ≈ 30% of data

Key Insight

When data is imbalanced, accuracy alone is misleading. A model that predicts "Not Spam" for everything achieves ~70% accuracy while being completely useless!

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import (accuracy_score, precision_score, recall_score, 
                             f1_score, confusion_matrix, roc_auc_score, 
                             roc_curve, precision_recall_curve, 
                             average_precision_score, ConfusionMatrixDisplay)

3. Confusion Matrix

The Four Quadrants

	Predicted: Negative (0)	Predicted: Positive (1)
Actual: Negative (0)	True Negative (TN) ✅	False Positive (FP) ❌
Actual: Positive (1)	False Negative (FN) ❌	True Positive (TP) ✅

Definitions

True Positive (TP): $\hat{y} = 1$, $y = 1$ — Correctly identified spam
True Negative (TN): $\hat{y} = 0$, $y = 0$ — Correctly identified non-spam
False Positive (FP): $\hat{y} = 1$, $y = 0$ — Wrongly marked as spam (Type I Error)
False Negative (FN): $\hat{y} = 0$, $y = 1$ — Spam slipped through (Type II Error)

📧 Real-World Impact:

FP: Your boss's promotion email goes to spam folder! 😬

FN: A phishing attack reaches your inbox! 😱

# Confusion Matrix Code
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay

# Example predictions
y_true = [0, 0, 0, 1, 1, 1, 1, 0, 1, 0]
y_pred = [0, 0, 1, 1, 1, 0, 1, 0, 1, 0]

# Create confusion matrix
cm = confusion_matrix(y_true, y_pred)
print("Confusion Matrix:")
print(cm)

# Extract values
TN, FP, FN, TP = cm.ravel()
print(f"\nTP={TP}, TN={TN}, FP={FP}, FN={FN}")

Confusion Matrix:
[[4 1]
 [1 4]]

TP=4, TN=4, FP=1, FN=1

4. Precision

Formula

$$\text{Precision} = \frac{TP}{TP + FP}$$

The Question It Answers

"When the model says 'Spam', how often is it actually spam?"

👮 Sticky Analogy: The Overeager Security Guard

A paranoid guard stops 100 people, accusing all of being intruders:

20 were actually intruders ✅

80 had valid tickets ❌

Precision = 20/100 = 20% — Only 20% of his accusations were correct!

When to Use Precision

When False Positives are costly:

Email spam filter (don't block important emails!)
Recommendation systems (don't annoy users with bad recommendations)

# Precision Calculation
from sklearn.metrics import precision_score

# Scratch implementation
def precision_calc(y_true, y_pred):
    cm = confusion_matrix(y_true, y_pred)
    tp = cm[1, 1]
    fp = cm[0, 1]
    return tp / (tp + fp)

# Using sklearn
precision = precision_score(y_true, y_pred)
print(f"Precision: {precision:.3f}")
print(f"Interpretation: When model says 'Spam', it's correct {precision*100:.1f}% of the time")

Precision: 0.800
Interpretation: When model says 'Spam', it's correct 80.0% of the time

5. Recall (Sensitivity)

Formula

$$\text{Recall} = \frac{TP}{TP + FN}$$

The Question It Answers

"Of all the actual spam emails, how many did we catch?"

😴 Sticky Analogy: The Lazy Security Guard

A conservative guard only stops people he's absolutely sure are intruders:

50 actual intruders tried to sneak in

He only stopped 10 (all genuine intruders — great precision!)

But 40 intruders walked right past him!

Recall = 10/50 = 20% — He only caught 20% of the intruders!

When to Use Recall

When False Negatives are costly:

Cancer detection (don't miss a patient with cancer!)
Fraud detection (don't let fraud slip through!)
Airport security (don't let dangerous people board!)

# Recall Calculation
from sklearn.metrics import recall_score

# Scratch implementation
def recall_calc(y_true, y_pred):
    cm = confusion_matrix(y_true, y_pred)
    tp = cm[1, 1]
    fn = cm[1, 0]
    return tp / (tp + fn)

# Using sklearn
recall = recall_score(y_true, y_pred)
print(f"Recall: {recall:.3f}")
print(f"Interpretation: We caught {recall*100:.1f}% of all actual spam")

Recall: 0.800
Interpretation: We caught 80.0% of all actual spam

Precision vs Recall: Key Difference

Metric	Question	Looks at...
Precision	"When I say Spam, am I right?"	Predicted Spam column
Recall	"Did I catch all the Spam?"	Actually Spam row

6. F1 Score

The Problem: Precision-Recall Trade-off

You can game individual metrics:

High Recall, Low Precision: Predict everything as positive
High Precision, Low Recall: Only predict when 100% confident

Formula (Harmonic Mean)

$$F_1 = 2 \times \frac{\text{Precision} \times \text{Recall}}{\text{Precision} + \text{Recall}}$$

⚖️ Sticky Analogy: The Balanced Diet

If you eat only protein (Precision) or only carbs (Recall), you're unhealthy. F1 Score is like your nutritionist saying: "You need BOTH in balance!"

The harmonic mean punishes imbalance — having 90% precision but 10% recall gives you a low F1.

Why Harmonic Mean?

Arithmetic mean of 90% and 10% = 50% (too generous!)
Harmonic mean of 90% and 10% = 18% (exposes the weakness!)

# F1 Score Calculation
from sklearn.metrics import f1_score

f1 = f1_score(y_true, y_pred)
print(f"Precision: {precision:.3f}")
print(f"Recall: {recall:.3f}")
print(f"F1 Score: {f1:.3f}")

Precision: 0.800
Recall: 0.800
F1 Score: 0.800

7. F-beta Score

When F1 Isn't Enough

What if you care more about one type of error?

Formula

$$F_\beta = (1 + \beta^2) \times \frac{\text{Precision} \times \text{Recall}}{(\beta^2 \times \text{Precision}) + \text{Recall}}$$

Understanding Beta Values

Beta	Emphasis	Example Use Case
β = 0.5	Precision 2x more important	Spam filtering (don't block good emails)
β = 1.0	Equal weight (F1 Score)	Balanced importance
β = 2.0	Recall 2x more important	Cancer screening (don't miss any cases)

🏥 Sticky Analogy: The Cancer Screening Dial

β < 1: "I'd rather miss some cancer cases than cause panic with false alarms"

β = 1: "Both errors are equally bad"

β > 2: "I'd rather have false alarms than miss a single cancer case!"

# F-beta Score Calculation
from sklearn.metrics import fbeta_score

f05 = fbeta_score(y_true, y_pred, beta=0.5)  # Precision-focused
f1 = fbeta_score(y_true, y_pred, beta=1.0)   # Balanced
f2 = fbeta_score(y_true, y_pred, beta=2.0)   # Recall-focused

print(f"F0.5 (Precision-focused): {f05:.3f}")
print(f"F1 (Balanced): {f1:.3f}")
print(f"F2 (Recall-focused): {f2:.3f}")

F0.5 (Precision-focused): 0.800
F1 (Balanced): 0.800
F2 (Recall-focused): 0.800

8. Threshold Optimization

The Key Insight

Models output probabilities (0.7, 0.3, etc.). The default threshold is 0.5, but this isn't always optimal!

How Threshold Affects Metrics

Threshold	Effect
Higher (e.g., 0.7)	More conservative → Higher Precision, Lower Recall
Lower (e.g., 0.3)	More aggressive → Higher Recall, Lower Precision

🦁 Sticky Analogy: The Cautious Banker

"I'd rather reject 10 good customers than give a loan to 1 person who won't pay me back."

→ Lower threshold = More paranoid = Catches more defaulters = Higher recall

# Threshold Optimization Code
def find_optimal_threshold(y_true, y_scores, metric='f1'):
    """Find threshold that maximizes chosen metric"""
    thresholds = np.linspace(0.01, 0.99, 50)
    best_score, best_threshold = -1, -1
    
    for threshold in thresholds:
        y_pred = (np.array(y_scores) >= threshold).astype(int)
        
        if metric == 'f1':
            score = f1_score(y_true, y_pred)
        elif metric == 'recall':
            score = recall_score(y_true, y_pred)
        elif metric == 'precision':
            score = precision_score(y_true, y_pred)
        
        if score > best_score:
            best_score = score
            best_threshold = threshold
    
    return best_threshold, best_score

# Example
y_scores_example = [0.9, 0.2, 0.8, 0.7, 0.6, 0.4, 0.9, 0.3, 0.85, 0.1]
y_true_example = [1, 0, 1, 1, 1, 0, 1, 0, 1, 0]

threshold, score = find_optimal_threshold(y_true_example, y_scores_example, 'f1')
print(f"Optimal threshold for F1: {threshold:.2f} (F1 Score: {score:.3f})")

Optimal threshold for F1: 0.41 (F1 Score: 1.000)

9. Specificity & Related Rates

Aliases and Formulas

Metric	Formula	Also Called	The Question
True Positive Rate (TPR)	$\frac{TP}{TP+FN}$	Sensitivity, Recall	"Of actual positives, how many did we catch?"
False Positive Rate (FPR)	$\frac{FP}{FP+TN}$	1 - Specificity, Fall-out	"Of actual negatives, how many did we wrongly flag?"
Specificity	$\frac{TN}{TN+FP}$	True Negative Rate (TNR)	"Of actual negatives, how many did we correctly identify?"
Precision	$\frac{TP}{TP+FP}$	Positive Predictive Value (PPV)	"When we say 'yes', how often are we right?"

Worked Example

100 people: 10 sick, 90 healthy

Test results:

8 sick correctly identified (TP)
2 sick missed (FN)
15 healthy wrongly flagged (FP)
75 healthy correctly identified (TN)

Metric	Calculation	Result
TPR (Recall)	8/(8+2)	80%
FPR	15/(15+75)	16.7%
Specificity	75/(75+15)	83.3%
Precision	8/(8+15)	34.8%

10. Top-k Accuracy

The Concept

Top-k accuracy gives credit if the correct answer appears anywhere in the top k predictions.

🎮 Sticky Analogy: The Quiz Show with Multiple Guesses

"What breed is this dog?"

Top-1: "Labrador" ❌ (wrong)

Top-3: "Labrador, Golden Retriever, German Shepherd" ✅ (correct answer in top 3!)

Top-1 accuracy = 0%, Top-3 accuracy = 100%

When to Use

Multi-class problems with similar classes (dog breeds, product categories)
Search engines and recommendation systems
When being "close" matters

# Top-k Accuracy
from sklearn.metrics import top_k_accuracy_score

# Example: 3-class problem
y_true_multi = [0, 1, 2, 0, 1]
y_scores_multi = [
    [0.8, 0.1, 0.1],  # Class 0 predicted
    [0.4, 0.3, 0.3],  # Class 0 predicted (wrong!)
    [0.2, 0.3, 0.5],  # Class 2 predicted
    [0.6, 0.3, 0.1],  # Class 0 predicted
    [0.1, 0.8, 0.1],  # Class 1 predicted
]

top1 = top_k_accuracy_score(y_true_multi, y_scores_multi, k=1)
top2 = top_k_accuracy_score(y_true_multi, y_scores_multi, k=2)

print(f"Top-1 Accuracy: {top1:.1%}")
print(f"Top-2 Accuracy: {top2:.1%}")

Top-1 Accuracy: 80.0%
Top-2 Accuracy: 80.0%

11. Average Precision & MAP

The Problem with Precision@k

Precision@k doesn't capture ranking order!

Engine	Results	Precision@5
A	🍕🍕🍕❌❌	3/5 = 60%
B	❌❌🍕🍕🍕	3/5 = 60%

Same score, but Engine A is clearly better — good results first!

🎣 Sticky Analogy: The Fishing Competition

You're fishing for golden fish 🐠 among regular fish 🐟.

Precision@k: "After k catches, how many are golden?"

AP: "Average your precision ONLY when you catch a golden fish"

MAP: "Your average score across multiple fishing trips"

The Hand Trick 🖐️

Precision@k = Count ALL fingers up to k → "How many have rings?"
AP = Only look at fingers WITH rings → "Average the position of each ring"
MAP = Do this for BOTH hands → "Average across multiple hands"

Formula

$$AP = \frac{1}{\text{Total Relevant}} \sum_{k=1}^{n} (Precision@k \times Relevance_k)$$

# Average Precision
from sklearn.metrics import average_precision_score

y_true_ap = [1, 0, 1, 0, 1, 0, 0, 0, 0, 0]
y_scores_ap = [0.9, 0.8, 0.7, 0.6, 0.5, 0.4, 0.3, 0.2, 0.1, 0.05]

ap = average_precision_score(y_true_ap, y_scores_ap)
print(f"Average Precision: {ap:.3f}")

Average Precision: 0.756

12. ROC AUC

What is ROC?

ROC (Receiver Operating Characteristic) plots TPR vs FPR at different thresholds.

The Probabilistic Interpretation ✨

"If I randomly pick ONE positive and ONE negative, what's the probability my model ranks the positive higher?"

AUC	Meaning
1.0	Perfect — positive ALWAYS ranked higher
0.5	Random — 50/50 coin flip
0.85	Good — 85% chance positive ranked higher

🎴 Sticky Analogy: The Card Sorting Game

You have 3 red cards ♥️ (positives) and 7 blue cards 💙 (negatives).

Perfect model (AUC = 1.0):
♥️ ♥️ ♥️ 💙 💙 💙 💙 💙 💙 💙
All reds before all blues!

Random model (AUC = 0.5):
💙 ♥️ 💙 💙 ♥️ 💙 💙 ♥️ 💙 💙
Mixed randomly.

# ROC AUC Calculation and Visualization
from sklearn.metrics import roc_auc_score, roc_curve

# Hospital example
y_true_roc = [1, 0, 1, 0, 1, 0, 0, 0, 0, 0]
y_scores_roc = [0.95, 0.85, 0.80, 0.70, 0.60, 0.50, 0.40, 0.30, 0.20, 0.10]

# Calculate AUC
auc = roc_auc_score(y_true_roc, y_scores_roc)
print(f"AUC: {auc:.2f}")

# Get curve points
fpr, tpr, thresholds = roc_curve(y_true_roc, y_scores_roc)

# Plot
plt.figure(figsize=(8, 6))
plt.plot(fpr, tpr, 'b-', linewidth=2, label=f'ROC Curve (AUC = {auc:.2f})')
plt.fill_between(fpr, tpr, alpha=0.3)
plt.plot([0, 1], [0, 1], 'k--', label='Random (AUC = 0.5)')
plt.xlabel('False Positive Rate (FPR)')
plt.ylabel('True Positive Rate (TPR)')
plt.title('ROC Curve')
plt.legend(loc='lower right')
plt.grid(True, alpha=0.3)
plt.show()

AUC: 0.86

Interpreting AUC

AUC Range	Interpretation
0.5	No better than random
0.7-0.8	Acceptable discrimination
0.8-0.9	Excellent discrimination
> 0.9	Outstanding discrimination

13. PR AUC

When ROC AUC Lies

ROC AUC can be misleading with highly imbalanced data because FPR denominator (TN + FP) is dominated by the large negative class.

💎 Sticky Analogy: The Gem Inspector

Mining operation: 1000 samples, only 10 diamonds, 990 rocks.

Inspector A: Finds 8 diamonds, also flags 12 rocks as diamonds

FPR = 12/990 = 1.2% — looks amazing!

But wait:

Precision = 8/20 = 40% — exposes the problem!

"Precision only cares about YOUR picks — it doesn't care how big the haystack is."

When to Use

Metric	Best For
ROC AUC	Balanced or moderately imbalanced data
PR AUC	Highly imbalanced data (rare events)

# PR AUC Calculation
from sklearn.metrics import precision_recall_curve, auc

precision_vals, recall_vals, _ = precision_recall_curve(y_true_roc, y_scores_roc)
pr_auc = auc(recall_vals, precision_vals)

# print(f"ROC AUC: {auc:.2f}")
# print(f"PR AUC: {pr_auc:.2f}")

# Plot PR Curve
plt.figure(figsize=(8, 6))
plt.plot(recall_vals, precision_vals, 'g-', linewidth=2, label=f'PR Curve (AUC = {pr_auc:.2f})')
plt.fill_between(recall_vals, precision_vals, alpha=0.3, color='green')
plt.xlabel('Recall')
plt.ylabel('Precision')
plt.title('Precision-Recall Curve')
plt.legend(loc='lower left')
plt.grid(True, alpha=0.3)
plt.show()

14. Balanced Accuracy

The Problem

Regular accuracy is skewed by the majority class.

Formula

$$\text{Balanced Accuracy} = \frac{\text{TPR} + \text{TNR}}{2} = \frac{\text{Sensitivity} + \text{Specificity}}{2}$$

⚖️ Sticky Analogy: The Fair Judge

A judge evaluates performers from two cities:

City A: 100 performers

City B: 10 performers

Regular accuracy: City A dominates the score

Balanced accuracy: "How well did you do on City A? How well on City B? Average those!"

When to Use

Imbalanced datasets
When performance on both classes matters equally

# Balanced Accuracy
from sklearn.metrics import balanced_accuracy_score

# Imbalanced example
y_true_imb = [0]*90 + [1]*10
y_pred_imb = [0]*100  # Lazy model predicts all 0

acc = accuracy_score(y_true_imb, y_pred_imb)
bal_acc = balanced_accuracy_score(y_true_imb, y_pred_imb)

print(f"Regular Accuracy: {acc:.1%}")
print(f"Balanced Accuracy: {bal_acc:.1%}")
print("\nBalanced accuracy exposes the lazy model!")

Regular Accuracy: 90.0%
Balanced Accuracy: 50.0%

Balanced accuracy exposes the lazy model!

15. Cohen's Kappa

The Problem

Simple agreement percentage doesn't account for agreement by chance.

Two lazy labelers guessing "Cat" for everything = 100% agreement!

Formula

$$\kappa = \frac{\text{Observed Agreement} - \text{Expected by Chance}}{1 - \text{Expected by Chance}}$$

🎲 Sticky Analogy: The Honest Referee

Two doctors diagnosing 100 patients:

They agree on 75 cases → 75% agreement

But if both say "sick" 50% of the time randomly, chance agreement = 50%

Kappa = (75% - 50%) / (100% - 50%) = 0.50

"Kappa only gives points for REAL teamwork, not lucky coincidences!"

Interpretation

Kappa	Meaning
< 0	Worse than chance
0.0	Same as random guessing
0.4-0.6	Moderate agreement
0.6-0.8	Substantial agreement
0.8-1.0	Almost perfect agreement

# Cohen's Kappa
from sklearn.metrics import cohen_kappa_score

rater1 = [0, 1, 1, 0, 1, 0, 1, 1, 0, 0]
rater2 = [0, 1, 0, 0, 1, 0, 1, 1, 1, 0]

kappa = cohen_kappa_score(rater1, rater2)
print(f"Cohen's Kappa: {kappa:.3f}")

Cohen's Kappa: 0.600

16. Hamming Loss

The Multi-Label Problem

What if an example can have multiple labels at once?

Example: A movie can be Action + Comedy + Sci-Fi simultaneously!

Formula

$$\text{Hamming Loss} = \frac{\text{Incorrect Labels}}{\text{Total Labels}}$$

📚 Sticky Analogy: Document Classification

A "Quantum Mechanics" paper should be labeled: Math + Physics

Regular accuracy: One wrong label = FAIL the whole prediction

Hamming Loss: "You got 3 out of 4 labels right — that's 75%!"

"The Fair Teacher gives partial credit!"

Example

Subject	True	Predicted	Correct?
Math	✅	✅	✓
Physics	✅	❌	✗
Chemistry	❌	❌	✓
Biology	❌	✅	✗

Hamming Loss = 2/4 = 0.50 (Lower is better)

# Hamming Loss
from sklearn.metrics import hamming_loss

# Multi-label example (each row = one document, each column = one label)
y_true_ml = [[1, 1, 0, 0],   # Math, Physics
             [1, 0, 1, 0],   # Math, Chemistry
             [0, 0, 0, 1]]   # Biology

y_pred_ml = [[1, 0, 0, 1],   # Math, Biology (missed Physics, added Biology)
             [1, 0, 1, 0],   # Correct!
             [0, 0, 1, 1]]   # Chemistry, Biology (added Chemistry)

loss = hamming_loss(y_true_ml, y_pred_ml)
print(f"Hamming Loss: {loss:.3f}")
print(f"Hamming Accuracy: {1-loss:.1%}")

Hamming Loss: 0.250
Hamming Accuracy: 75.0%

17. Jaccard Index

The Problem with Hamming Loss

With many possible labels, Hamming Loss gets distracted by labels you correctly DIDN'T assign.

🛒 Sticky Analogy: E-commerce Product Tags

10,000 possible tags for a handbag:

True: {Leather, Black, Gucci}

Predicted: {Leather, Black, Prada}

Hamming Loss: "You got 9,998 tags right!" (misleading!)

Jaccard: "Of the tags that matter, you got 2/4 = 50%" (honest!)

Formula

$$\text{Jaccard} = \frac{|True \cap Predicted|}{|True \cup Predicted|} = \frac{\text{Intersection}}{\text{Union}}$$

Key Difference

Metric	Considers
Hamming Loss	ALL possible labels (even absent ones)
Jaccard	Only labels that APPEAR in true OR predicted

# Jaccard Score
from sklearn.metrics import jaccard_score

jaccard = jaccard_score(y_true_ml, y_pred_ml, average='samples')
print(f"Jaccard Score (samples): {jaccard:.3f}")

Jaccard Score (samples): 0.611

18. Matthews Correlation Coefficient (MCC)

The Problem MCC Solves

Remember all the issues?

Accuracy fails with imbalanced data
Precision ignores False Negatives
Recall ignores False Positives
F1 ignores True Negatives

MCC uses ALL FOUR values: TP, TN, FP, FN

Formula

$$MCC = \frac{TP \times TN - FP \times FN}{\sqrt{(TP+FP)(TP+FN)(TN+FP)(TN+FN)}}$$

🎯 Sticky Analogy: The Complete Judge

Imagine hiring a judge for a talent show:

Accuracy: "How many decisions were correct?" (fooled by imbalance)

Precision: "When you said YES, were you right?" (ignores missed talent)

Recall: "Did you find all the talent?" (ignores false alarms)

MCC: "Overall, how well did your YES/NO decisions correlate with actual talent?"

MCC is the complete judge — considers everything!

Interpretation

MCC	Meaning
+1	Perfect prediction
0	Random guessing
-1	Perfectly WRONG (flip predictions to be perfect!)

Factory Example

1000 phones: 980 good, 20 defective

Inspector	Strategy	Accuracy	MCC
A (Lazy)	Says "all good"	98%	0
B (Works)	Catches 18/20 defects, 30 false alarms	96.8%	0.47

MCC exposes that Inspector B is actually doing real work!

# Matthews Correlation Coefficient
from sklearn.metrics import matthews_corrcoef

# Imbalanced factory example
y_true_factory = [0]*980 + [1]*20  # 980 good, 20 defective

# Inspector A: Lazy (predicts all good)
y_pred_lazy = [0]*1000

# Inspector B: Works hard (catches 18 defects, 30 false alarms)
y_pred_works = [0]*950 + [1]*30 + [0]*2 + [1]*18  # Simplified

mcc_lazy = matthews_corrcoef(y_true_factory, y_pred_lazy)
acc_lazy = accuracy_score(y_true_factory, y_pred_lazy)

print(f"Inspector A (Lazy):")
print(f"  Accuracy: {acc_lazy:.1%}")
print(f"  MCC: {mcc_lazy:.2f}")
print(f"\nMCC = 0 exposes the lazy model as useless!")

Inspector A (Lazy):
  Accuracy: 98.0%
  MCC: 0.00

MCC = 0 exposes the lazy model as useless!

Quick Reference

📋 Metric Selection Guide

Scenario	Recommended Metric(s)
Balanced data, simple classification	Accuracy
Imbalanced data	F1, Balanced Accuracy, MCC
False Positives are costly	Precision, F0.5
False Negatives are costly	Recall, F2
Comparing models (threshold-free)	ROC AUC
Highly imbalanced rare events	PR AUC
Multi-label classification	Hamming Loss, Jaccard
Inter-annotator agreement	Cohen's Kappa
Most robust single metric	MCC

🎯 Sticky Analogies Cheat Sheet

Metric	Analogy
Precision	Overeager Security Guard — stops many but includes innocents
Recall	Lazy Security Guard — only stops who he's sure of, misses many
F1	Balanced Diet — need both protein and carbs
F-beta	Cancer Screening Dial — tune for your priority
ROC AUC	Card Sorting Game — rank reds before blues
PR AUC	Gem Inspector — honest about rare finds
AP/MAP	Fishing Competition — precision at each golden fish
Top-k	Quiz Show — multiple guesses allowed
Cohen's Kappa	Honest Referee — no credit for lucky coincidences
Hamming Loss	Fair Teacher — partial credit for labels
Jaccard	E-commerce Tags — focus on what matters
MCC	Complete Judge — considers everything

📊 Formula Quick Reference

Metric	Formula
Accuracy	$\frac{TP + TN}{TP + TN + FP + FN}$
Precision	$\frac{TP}{TP + FP}$
Recall	$\frac{TP}{TP + FN}$
F1	$2 \times \frac{Precision \times Recall}{Precision + Recall}$
Specificity	$\frac{TN}{TN + FP}$
FPR	$\frac{FP}{FP + TN}$
Balanced Acc	$\frac{TPR + TNR}{2}$

🐍 Code Import Cheat Sheet

# All sklearn imports you need
from sklearn.metrics import (
    # Basic
    accuracy_score,
    precision_score,
    recall_score,
    f1_score,
    fbeta_score,
    
    # Confusion Matrix
    confusion_matrix,
    ConfusionMatrixDisplay,
    
    # ROC & PR
    roc_auc_score,
    roc_curve,
    precision_recall_curve,
    average_precision_score,
    
    # Advanced
    balanced_accuracy_score,
    matthews_corrcoef,
    cohen_kappa_score,
    
    # Multi-label
    hamming_loss,
    jaccard_score,
    
    # Multi-class
    top_k_accuracy_score,
)

📌 Key Takeaways

Never rely on accuracy alone — especially with imbalanced data
Choose metrics based on business cost — what's worse, FP or FN?
Use threshold-free metrics (AUC) for model comparison
Use PR AUC for rare event detection
MCC is the most robust single metric for binary classification
Visualize the trade-offs — ROC and PR curves tell a story

Classification Metrics

📊 Classification Metrics: Complete Reference Guide

Table of Contents

1. Business Context: Spam Detection

The Problem Setup

Key Insight

3. Confusion Matrix

The Four Quadrants

Definitions

4. Precision

Formula

The Question It Answers

When to Use Precision

5. Recall (Sensitivity)

Formula

The Question It Answers

When to Use Recall

Precision vs Recall: Key Difference

6. F1 Score

The Problem: Precision-Recall Trade-off

Formula (Harmonic Mean)

Why Harmonic Mean?

7. F-beta Score

When F1 Isn't Enough

Formula

Understanding Beta Values

8. Threshold Optimization

The Key Insight

How Threshold Affects Metrics

9. Specificity & Related Rates

Aliases and Formulas

Worked Example

10. Top-k Accuracy

The Concept

When to Use

11. Average Precision & MAP

The Problem with Precision@k

The Hand Trick 🖐️

Formula

12. ROC AUC

What is ROC?

The Probabilistic Interpretation ✨

Interpreting AUC

13. PR AUC

When ROC AUC Lies

When to Use

14. Balanced Accuracy

The Problem

Formula

When to Use

15. Cohen's Kappa

The Problem

Formula

Interpretation

16. Hamming Loss

The Multi-Label Problem

Formula

Example

17. Jaccard Index

The Problem with Hamming Loss

Formula

Key Difference

18. Matthews Correlation Coefficient (MCC)

The Problem MCC Solves

Formula

Interpretation

Factory Example

Quick Reference

📋 Metric Selection Guide

🎯 Sticky Analogies Cheat Sheet

📊 Formula Quick Reference

🐍 Code Import Cheat Sheet

📌 Key Takeaways