Logistic Regression: Complete Study Notes

Topics Covered:

Sklearn Implementation of Logistic Regression
Accuracy Metric
Hyperparameter Tuning & Regularization
Log Odds / Logit
Impact of Outliers
Multiclass Classification

1. Sklearn Implementation of Logistic Regression

1.1 The Pipeline

Load Data → Select Features → Train/Val/Test Split → Scale Features → Fit Model → Evaluate

1.2 Feature Selection

Why select only some features?

Relevance: Not all features predict the target (e.g., phone number doesn't predict churn)
Multicollinearity: Correlated features cause instability
Overfitting: Too many features → model memorizes noise
Simplicity: Fewer features = easier interpretation

1.3 Train/Validation/Test Split

Set	Purpose	Typical Size
Training	Learn patterns	60%
Validation	Tune hyperparameters, check overfitting	20%
Test	Final evaluation (only used once!)	20%

1.4 Feature Scaling with StandardScaler

The Problem: Features on different scales (e.g., minutes: 0-350, calls: 0-9) cause unfair learning.

The Solution: Standardization transforms each feature to mean=0, std=1:

$$ z = \frac{x - \mu}{\sigma} $$

Critical Rule: Fit scaler on training data ONLY, then transform all sets.

scaler = StandardScaler()
scaler.fit(X_train)           # Learn from training ONLY

X_train = scaler.transform(X_train)
X_val = scaler.transform(X_val)     # Apply same transformation
X_test = scaler.transform(X_test)

1.5 Model Coefficients

After fitting, the model learns weights for each feature:

Feature	Coefficient	Interpretation
CustServ Calls	0.796	Strongest predictor!
Day Mins	0.684	Second strongest
Eve Mins	0.291	Moderate
Night Mins	0.136	Weak
Account Length	0.061	Very weak

Positive coefficient → Higher value = Higher probability of churn

3. Hyperparameter Tuning & Regularization

3.1 The Problem: Feature Dictatorship

Without constraints, one feature with a huge coefficient can dominate all predictions.

🎯 Analogy: Democracy for Features

Without regularization = Dictatorship — one feature takes over: "I am CustServ Calls! I alone decide who churns!"

With regularization = Democracy — every feature gets a fair say, no one dominates.

3.2 The Solution: Regularization

Add a penalty for large coefficients:

$$ \text{Total Loss} = \text{Prediction Error} + \lambda \times (\text{Size of Coefficients}) $$

3.3 The Trade-off

λ value	Effect	Risk
Too small	Dictatorship (large coefficients)	Overfitting
Too large	Over-equality (all suppressed)	Underfitting
Just right	Balanced influence	Good generalization

3.4 sklearn Notation

sklearn uses $ C = \frac{1}{\lambda} $:

High C → Low regularization
Low C → High regularization

model = LogisticRegression(C=0.001)  # C = 1/1000, high regularization

4. Log Odds / Logit: The Complete Story

4.1 Chapter 1: The Goal

Model computes: $ z = w_1 x_1 + w_2 x_2 + ... + b $

This z can be any number (-∞ to +∞), but we need probability (0 to 1).

4.2 Chapter 2: Odds

$$ \text{Odds} = \frac{P(\text{event})}{P(\text{not event})} = \frac{p}{1-p} $$

Example: 20% probability of churn

Out of 100 customers: 20 churn, 80 stay
Odds = 20:80 = 1:4
"For every 1 churner, there are 4 stayers"

4.3 Chapter 3: The Problem with Odds

Probability	Odds
1%	0.01
50%	1
99%	99

Scale is lopsided — unlikely events squished near 0, likely events explode to infinity.

4.4 Chapter 4: Log Odds to the Rescue

$$ \text{Log Odds} = \log\left(\frac{p}{1-p}\right) $$

Probability	Odds	Log Odds
1%	0.01	-4.6
10%	0.11	-2.2
50%	1	0
90%	9	+2.2
99%	99	+4.6

Symmetric around zero!

🎯 Analogy 1: The Seesaw

Log odds = 0 → Seesaw balanced (50-50)

Log odds negative → Tilted toward "No"

Log odds positive → Tilted toward "Yes"

The model computes "how tilted is the seesaw?"

🎯 Analogy 2: The Weather Forecaster

Each feature = a weather clue worth "evidence points"

z = total evidence score (positive = likely rain, negative = unlikely, zero = 50-50)

Sigmoid = translator that converts evidence score to probability %

4.5 Chapter 5: The Connection

The z that logistic regression computes IS the log odds!

$$ z = 0.684 \times \text{DayMins} + 0.796 \times \text{CustServCalls} + ... $$

4.6 Chapter 6: Back to Probability

Sigmoid converts log odds back to probability:

$$ p = \frac{1}{1 + e^{-z}} $$

4.7 The Full Pipeline

$$ \text{Features} \xrightarrow{\text{weighted sum}} z \text{ (log odds)} \xrightarrow{\text{sigmoid}} p \text{ (probability)} $$

4.8 Why Symmetry Matters

🎯 Analogy 3: The Mountain Climber

Log odds = altitude. Probability = % of climb completed.

Climber A at sea level (50%) climbs 500m → gains 12% progress

Climber B near summit (98%) climbs 500m → gains only 1% progress

Same log odds change ≠ Same probability change!

# Visualization: Odds vs Log Odds
import torch
import matplotlib.pyplot as plt

p = torch.linspace(0.01, 0.99, 100)
odds = p / (1 - p)
log_odds = torch.log(odds)

fig, axes = plt.subplots(1, 2, figsize=(12, 5))

axes[0].plot(p, odds)
axes[0].axhline(y=1, color='r', linestyle='--', label='Odds = 1 (50-50)')
axes[0].set_xlabel('Probability')
axes[0].set_ylabel('Odds')
axes[0].set_title('Probability → Odds (Lopsided!)')
axes[0].legend()
axes[0].grid(True)

axes[1].plot(p, log_odds)
axes[1].axhline(y=0, color='r', linestyle='--', label='Log Odds = 0 (50-50)')
axes[1].set_xlabel('Probability')
axes[1].set_ylabel('Log Odds')
axes[1].set_title('Probability → Log Odds (Symmetric!)')
axes[1].legend()
axes[1].grid(True)

plt.tight_layout()
plt.show()

# Demonstrating: Same log odds change ≠ Same probability change
import torch

def sigmoid(z):
    return 1 / (1 + torch.exp(-z))

log_odds_A = torch.tensor(0.0)   # seesaw balanced
log_odds_B = torch.tensor(4.0)   # seesaw tilted high
added = 0.5

print("Before adding 0.5:")
print(f"  A: log odds = {log_odds_A.item()}, probability = {sigmoid(log_odds_A).item():.0%}")
print(f"  B: log odds = {log_odds_B.item()}, probability = {sigmoid(log_odds_B).item():.0%}")

print("\nAfter adding 0.5:")
print(f"  A: log odds = {(log_odds_A + added).item()}, probability = {sigmoid(log_odds_A + added).item():.0%}")
print(f"  B: log odds = {(log_odds_B + added).item()}, probability = {sigmoid(log_odds_B + added).item():.0%}")

print("\n→ Same +0.5 added, but A gained 12%, B gained only 1%!")

Before adding 0.5:
  A: log odds = 0.0, probability = 50%
  B: log odds = 4.0, probability = 98%

After adding 0.5:
  A: log odds = 0.5, probability = 62%
  B: log odds = 4.5, probability = 99%

→ Same +0.5 added, but A gained 12%, B gained only 1%!

5. Impact of Outliers

5.1 The Two Cases

Case	Example	Loss	Impact
Outlier on correct side	Model confident & RIGHT	Small	Minimal
Outlier on wrong side	Model confident & WRONG	Huge!	Messes up training

🎯 Analogy: The Cocky Weatherman

Weatherman says: "I'm 98% sure it will rain! Cancel your picnics!"

It's bright sunshine. ☀️

The more confident you are when WRONG, the bigger the punishment!

5.2 The Math: Why Confidence Explodes Loss

For a wrong prediction where true label = 0:

$$ \text{Loss} = -\log(1 - \hat{y}) $$

As $\hat{y} \to 1$ (high confidence), $(1 - \hat{y}) \to 0$, and $-\log(\text{tiny}) \to \infty$

Confidence (ŷ)	1 - ŷ	Loss
50%	0.5	0.69
90%	0.1	2.3
98%	0.02	3.9
99%	0.01	4.6

5.3 The Chain Reaction

Confident but wrong → Big loss
Big loss → Big gradient
Big gradient → Model over-corrects
Over-correction → Ruins learning from thousands of normal cases

5.4 Practical Takeaway

Find and remove outliers on the wrong side before training!

6. Multiclass Classification

6.1 The Problem

Logistic regression is designed for yes/no questions. But what if we have 3+ classes?

🎯 Analogy: The Fruit Sorting Factory

Fruits come down a conveyor belt. You must sort into: 🍊 Orange, 🍎 Apple, 🍌 Banana

How do we use a yes/no tool for a multiple choice question?

6.2 The Solution: One-vs-Rest (OVR)

Train 3 separate binary classifiers:

Detector	Question
Detector 1	"Is it an Orange? Yes/No"
Detector 2	"Is it an Apple? Yes/No"
Detector 3	"Is it a Banana? Yes/No"

6.3 How Labels Are Modified

Same data, different labels for each detector:

Original:

Fruit	Label
1	Orange
2	Apple
3	Banana

For Orange Detector:

Fruit	Label
1	1
2	0
3	0

For Apple Detector:

Fruit	Label
1	0
2	1
3	0

6.4 At Prediction Time

All detectors vote:

Orange detector: 40%
Apple detector: 35%
Banana detector: 60%

Pick highest → Banana! 🍌

6.5 Why Probabilities Don't Sum to 100%

Each detector is a separate model. They don't know about each other!

40% + 35% + 60% = 135% — Not a bug!

6.6 sklearn Implementation

model = LogisticRegression(multi_class='ovr')  # One-vs-Rest
model.fit(X_train, y_train)

7. Quiz Questions for Self-Assessment

Quiz 1: Log Odds & Coefficients

Question: Two customers differ only in customer service calls:

Customer A: 2 calls
Customer B: 5 calls
Coefficient for CustServ Calls: 0.8

Which is true?

a) Customer B's log odds is 2.4 higher than A's ✅

b) Customer B's probability is exactly 2.4 higher ❌

c) Customer B is guaranteed to churn ❌

Explanation:

(a) ✅ Difference = 3 calls × 0.8 = 2.4 log odds
(b) ❌ Same log odds change ≠ same probability change (Mountain Climber analogy!)
(c) ❌ Higher log odds = more likely, never guaranteed. Sigmoid approaches but never reaches 1.

Quiz 2: Outliers

Question: Model is 95% confident Customer X will NOT churn. Customer X churned. Impact?

a) Very little impact ❌

b) Huge impact ✅

c) Increase regularization to fix ❌

Explanation: Cocky Weatherman! Confident but wrong → Loss = $-\log(0.05) = 3.0$ → Big gradient → Over-correction → Training messed up.

Quiz 3: Multiclass Classification

Question: OVR outputs: Orange 40%, Apple 35%, Banana 60%

a) Which fruit? Banana (highest probability)

b) Why don't they sum to 100%? Three separate models — they don't know about each other!

8. Quick Reference

Key Formulas

Concept	Formula
Odds	$\frac{p}{1-p}$
Log Odds (Logit)	$\log\left(\frac{p}{1-p}\right)$
Sigmoid	$\frac{1}{1+e^{-z}}$
Accuracy	$\frac{\text{correct}}{\text{total}}$
Regularized Loss	$\text{Error} + \lambda \times \text{Coefficients}$

Sticky Analogies Summary

Concept	Analogy
Regularization	Democracy for Features
Log Odds	Seesaw tilt
z computation	Weather Forecaster adding evidence
Same Δ log odds ≠ same Δ probability	Mountain Climber near summit
Outlier impact	Cocky Weatherman
Multiclass OVR	Fruit Sorting Factory

sklearn Cheat Sheet

# Scaling
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(X_train)
X_train = scaler.transform(X_train)

# Binary Classification
from sklearn.linear_model import LogisticRegression
model = LogisticRegression(C=0.001)  # C = 1/λ
model.fit(X_train, y_train)

# Multiclass Classification
model = LogisticRegression(multi_class='ovr')

# Inspect model
model.coef_       # Feature weights
model.intercept_  # Bias term

Logistic Regression: From Log Odds