Logistic Regression — Complete Summary

A comprehensive guide with intuitive analogies, mathematical rigor, and practical PyTorch implementation

The Problem: Binary Classification
Why Not Linear Regression?
Step Function — The Failed Attempt
Sigmoid Function — The Squisher
Geometric Interpretation
Log-Loss — Measuring Wrongness
Likelihood vs Probability
Why Not MSE?
Gradient Descent — Finding the Best Weights
Complete Implementation
Regularization
Interpreting Coefficients
Quick Reference

2. Why Not Linear Regression?

Linear regression outputs any real number from $-\infty$ to $+\infty$.

Problem: What does a prediction of -0.3 or 1.7 mean for churn probability?

Model	Output Range	Problem for Classification
Linear Regression	$(-\infty, +\infty)$	Negative probability? >100% probability?
Logistic Regression	$(0, 1)$	Always valid probability ✓

We need a "squisher" function that compresses any input into the $(0, 1)$ range.

3. Step Function — The Failed Attempt

Idea: Just threshold! $$ \text{step}(z) = \begin{cases} 1 & \text{if } z \geq 0
0 & \text{if } z < 0 \end{cases} $$

Fatal Flaw: Not differentiable → can't use gradient descent!

🏔️ The Blindfolded Hiker Analogy:

Gradient descent is like a blindfolded hiker feeling the slope to find the valley.

Step function is flat everywhere except at one point (cliff)

Hiker feels "flat" and doesn't know which way to go

Stuck!

4. Sigmoid Function — The Squisher

The Formula: $$ \sigma(z) = \frac{1}{1 + e^{-z}} $$

Properties:

Output range: $(0, 1)$ — always valid probability
$\sigma(0) = 0.5$ — uncertainty point
Smooth and differentiable everywhere ✓

Input $z$	Output $\sigma(z)$	Interpretation
-100	≈ 0	Almost certainly NO
-10	0.00005	Very likely NO
0	0.5	Coin flip — uncertain
+10	0.99995	Very likely YES
+100	≈ 1	Almost certainly YES

import torch
import matplotlib.pyplot as plt

# Visualize sigmoid
z = torch.linspace(-10, 10, 100)
sigmoid = 1 / (1 + torch.exp(-z))

plt.figure(figsize=(10, 4))
plt.plot(z, sigmoid, 'b-', linewidth=2)
plt.axhline(y=0.5, color='gray', linestyle=':', alpha=0.5)
plt.axvline(x=0, color='gray', linestyle=':', alpha=0.5)
plt.xlabel('z (input)')
plt.ylabel('σ(z) (output)')
plt.title('The Sigmoid Function — Our "Squisher"')
plt.grid(True, alpha=0.3)
plt.show()

5. Geometric Interpretation

Logistic regression finds a hyperplane (line in 2D) that separates classes.

🚪 The Bouncer at the Door Analogy:

The decision boundary is the door of a nightclub.

Far on VIP side → 100% confident they're getting in

Far on rejected side → 100% confident they're NOT getting in

Standing AT the door → 50/50, could go either way

Distance from boundary = Confidence

What z = 0 Means

        CHURN SIDE (z > 0)
             |   😟 z = +5 → ŷ = 0.99
             |
=============|============  ← THE DOOR (z = 0 → ŷ = 0.5)
             |
             |   😊 z = -5 → ŷ = 0.01
        STAY SIDE (z < 0)

$z = 0$ is the equation of the decision boundary
$\hat{y} = 0.5$ means maximum uncertainty

6. Log-Loss — Measuring Wrongness

The Cliff Edge Story

🏔️ The Cliff Edge Analogy:

Position 1 = safe ground 🏠, Position 0 = cliff edge 💀

"Log stays calm near 1, screams near 0"

Standing at 0.95 → relaxing at home, tiny penalty

Standing at 0.10 → toes hanging off the edge, HUGE penalty!

The Formulas

When actual = 1: Use $-\log(\hat{y})$

Prediction near 1 → calm (small penalty)
Prediction near 0 → screaming (huge penalty)

When actual = 0: Use $-\log(1-\hat{y})$

Prediction near 0 → calm (small penalty)
Prediction near 1 → screaming (huge penalty)

Combined Formula (The Two Light Switches)

💡 The Light Switch Analogy:

Lamp A = $-\log(\hat{y})$

Lamp B = $-\log(1-\hat{y})$

$y$ is the switch: only ONE lamp is ever on!

$$ \text{Loss} = -\left[ y \cdot \log(\hat{y}) + (1-y) \cdot \log(1-\hat{y}) \right] $$

Actual $y$	Lamp A	Lamp B	Active Loss
1	ON	OFF	$-\log(\hat{y})$
0	OFF	ON	$-\log(1-\hat{y})$

# Log Loss Implementation
def log_loss(y_actual, y_pred):
    eps = 1e-7  # avoid log(0)
    return -(y_actual * torch.log(y_pred + eps) + (1 - y_actual) * torch.log(1 - y_pred + eps))

# Test scenarios
print("When actual = 1:")
print(f"  Predict 0.99 → Loss = {log_loss(1, torch.tensor(0.99)):.4f}")
print(f"  Predict 0.10 → Loss = {log_loss(1, torch.tensor(0.10)):.4f}")

print("\nWhen actual = 0:")
print(f"  Predict 0.10 → Loss = {log_loss(0, torch.tensor(0.10)):.4f}")
print(f"  Predict 0.90 → Loss = {log_loss(0, torch.tensor(0.90)):.4f}")

When actual = 1:
  Predict 0.99 → Loss = 0.0101
  Predict 0.10 → Loss = 2.3026

When actual = 0:
  Predict 0.10 → Loss = 0.1054
  Predict 0.90 → Loss = 2.3026

7. Likelihood vs Probability

🍪 The Cookie Jar Story:

Jar A: 90 chocolate, 10 vanilla

Jar B: 10 chocolate, 90 vanilla

Probability: "Given Jar A, what's the chance of picking chocolate?" (90%)

Likelihood: Friend picks chocolate → "Which jar was it probably from?" (Jar A!)

	What's fixed?	What's unknown?	Question
Probability	The model	The data	"Given this model, what data might I see?"
Likelihood	The data	The model	"Given this data, what model fits best?"

Maximum Likelihood Estimation (MLE): Find weights that make observed data most probable.

8. Why Not MSE?

📢 The Screaming Guide Story:

You're a blind hiker in fog. A guide shouts directions.

Log Loss Guide: Far from village → SCREAMS "GO THAT WAY!!!"

MSE Guide: Far from village → speaks calmly "maybe go that way..."

Even when completely lost, the MSE guide barely raises their voice!

The Math (The Magic)

Loss	Gradient	When ŷ = 0.01 (very wrong)
MSE	$2(\hat{y} - y)$	-1.98 (calm)
Log Loss	$-\frac{1}{\hat{y}}$	-100 (screaming!)

The key: $\frac{1}{\hat{y}}$ — smaller prediction → bigger gradient → louder guide!

Convex vs Non-Convex

Log-loss + Sigmoid = Convex (one valley) ✓
MSE + Sigmoid = Non-convex (multiple valleys) ✗

With MSE, the hiker might get stuck in a crater thinking it's the valley!

9. Gradient Descent — Finding the Best Weights

🏔️ The Blind Hiker on a Mountain:

Start at random position (random weights)

Feel the slope (calculate gradient)

Step downhill (update weights)

Repeat until you reach the valley (minimum loss)

The Update Rule

$$ w_{\text{new}} = w_{\text{old}} - \text{learning_rate} \times \text{gradient} $$

Learning Rate Trade-off

Learning Rate	Behavior
Too big	Bounces around, overshoots the valley
Too small	Converges but takes forever
Just right	Smooth descent, reaches valley efficiently

10. Complete Implementation

import torch
import matplotlib.pyplot as plt

# Create fake customer data
torch.manual_seed(42)
n = 100
stayed_charge = torch.randn(n//2) * 20 + 30
stayed_calls = torch.randn(n//2) * 1 + 1
churned_charge = torch.randn(n//2) * 20 + 60
churned_calls = torch.randn(n//2) * 1 + 4

# Combine into X and y
X_stayed = torch.stack([stayed_charge, stayed_calls], dim=1)
X_churned = torch.stack([churned_charge, churned_calls], dim=1)
X = torch.cat([X_stayed, X_churned], dim=0)
y = torch.cat([torch.zeros(50), torch.ones(50)])

print(f"X shape: {X.shape}")
print(f"y shape: {y.shape}")

X shape: torch.Size([100, 2])
y shape: torch.Size([100])

# Training Loop
w = torch.randn(2, requires_grad=True)
b = torch.randn(1, requires_grad=True)
learning_rate = 0.005
eps = 1e-7

for step in range(1000):
    # 1. Forward pass
    z = X @ w + b
    y_pred = torch.sigmoid(z)
    
    # 2. Calculate loss
    loss = -torch.mean(y * torch.log(y_pred + eps) + (1 - y) * torch.log(1 - y_pred + eps))
    
    # 3. Backward pass
    loss.backward()
    
    # 4. Update weights
    with torch.no_grad():
        w.data -= learning_rate * w.grad
        b.data -= learning_rate * b.grad
        w.grad.zero_()
        b.grad.zero_()
    
    if step % 100 == 0:
        print(f"Step {step}: Loss = {loss.item():.4f}")

Step 0: Loss = 8.1505
Step 100: Loss = 8.0899
Step 200: Loss = 8.0255
Step 300: Loss = 1.0870
Step 400: Loss = 1.0464
Step 500: Loss = 1.0089
Step 600: Loss = 0.9735
Step 700: Loss = 0.9395
Step 800: Loss = 0.9069
Step 900: Loss = 0.8759

# Check accuracy
y_pred_final = torch.sigmoid(X @ w + b)
predictions = (y_pred_final > 0.5).float()
accuracy = (predictions.squeeze() == y).float().mean()
print(f"Accuracy: {accuracy.item()*100:.1f}%")
print(f"Learned weights: w = {w.data}, b = {b.data}")

Accuracy: 49.0%
Learned weights: w = tensor([-0.0717,  0.7376]), b = tensor([1.2532])

# Visualize learned decision boundary
x_range = torch.linspace(-20, 100, 200)
y_range = torch.linspace(-2, 7, 200)
xx, yy = torch.meshgrid(x_range, y_range, indexing='xy')

grid_input = torch.stack([xx.flatten(), yy.flatten()], dim=1)
with torch.no_grad():
    z_grid = grid_input @ w + b
    probs = torch.sigmoid(z_grid).reshape(200, 200)

plt.figure(figsize=(10, 7))
plt.contourf(xx, yy, probs, levels=20, cmap='RdBu_r', alpha=0.7)
plt.colorbar(label='P(Churn)')
plt.scatter(X[:50, 0], X[:50, 1], c='blue', edgecolor='white', s=60, label='Stayed')
plt.scatter(X[50:, 0], X[50:, 1], c='red', edgecolor='white', s=60, label='Churned')
plt.contour(xx, yy, probs, levels=[0.5], colors='green', linewidths=2)
plt.xlabel('Day Charge ($)')
plt.ylabel('Customer Service Calls')
plt.title('LEARNED Decision Boundary')
plt.legend()
plt.show()

11. Regularization

Prevents overfitting — when the model memorizes training data but fails on new data.

L2 Regularization (Ridge): $$ \text{Loss} = \text{Log Loss} + \lambda \sum w_i^2 $$

L1 Regularization (Lasso): $$ \text{Loss} = \text{Log Loss} + \lambda \sum |w_i| $$

Same concept as linear regression — just added to log loss instead of MSE.

12. Interpreting Coefficients

In Linear Regression: "Increase X by 1 → Y increases by w"

In Logistic Regression: Coefficients represent change in log-odds, not probability!

$$ \log\left(\frac{P(\text{churn})}{P(\text{stay})}\right) = w_1 \cdot x_1 + w_2 \cdot x_2 + b $$

Example: If $w_2 = 0.74$:

Log-odds interpretation: "Each additional service call increases log-odds of churn by 0.74"
Odds ratio: $e^{0.74} \approx 2.1$ → "Each additional service call doubles the odds of churning"

13. Quick Reference

Key Formulas

Component	Formula
Sigmoid	$\sigma(z) = \frac{1}{1 + e^{-z}}$
Linear part	$z = w^T x + b$
Log Loss	$-\left[ y \log(\hat{y}) + (1-y) \log(1-\hat{y}) \right]$
Gradient Update	$w_{new} = w_{old} - \alpha \cdot \nabla L$

Sticky Analogies Summary

Concept	Analogy
Gradient Descent	Blind hiker feeling slope to find valley
Step function problem	Hiker on flat ground, can't feel direction
Sigmoid output	Bouncer confidence — distance from door
Log loss penalty	Cliff edge — calm near 1, screams near 0
Log loss formula	Two light switches — only one ON at a time
Likelihood vs Probability	Cookie jar — know data, guess the jar
MSE vs Log Loss gradients	Calm guide vs screaming guide
Convex vs Non-convex	Single valley vs mountain with craters
Variance/Std Dev	Village well — measuring walking distance

Standard Deviation Reminder (Village Well Story)

Step	Why?
Subtract mean	Find distance from center
Square	Stop negative and positive from cancelling
Average	Get typical squared-distance (variance)
Square root	Convert back to real-world units (std dev)

"Variance is for math; standard deviation is for humans."

Training Loop Checklist

for step in range(n_steps):
    # 1. Forward pass
    z = X @ w + b
    y_pred = torch.sigmoid(z)
    
    # 2. Calculate loss
    loss = log_loss(y, y_pred)
    
    # 3. Backward pass
    loss.backward()
    
    # 4. Update weights
    with torch.no_grad():
        w.data -= lr * w.grad
        b.data -= lr * b.grad
        w.grad.zero_()
        b.grad.zero_()

Quiz Questions — Test Your Understanding

Question 1 (Easy)

Q: Why can't we use a step function for classification?

A: Can't take gradient → can't learn → can't update weights.

Question 2 (Easy-Medium)

Q: Sigmoid outputs what range? What does it represent?

A: Range (0, 1). Represents probability of belonging to class 1.

Question 3 (Medium)

Q: Why use $-\log(\hat{y})$ instead of $(1 - \hat{y})$?

A: Cliff edge story! Log explodes near 0 → penalizes confident wrong predictions severely.

Question 4 (Medium-Tricky) ⚠️

Q: What if y = 0.5 in the log loss formula?

A: Trick question! $y$ is the actual label — always 0 or 1, never 0.5.

Question 5 (Tricky)

Q: Why was w₁ (charge) ≈ 0 but w₂ (calls) = 0.74?

A: Gradient descent learned that service calls is a stronger predictor than charge from the data.

Question 6 (Tricky) ⚠️

Q: If ŷ = 0.5, what is z?

A: z = 0. The point is exactly on the decision boundary (the door). sigmoid(0) = 0.5

Understanding Logistic Regression with PyTorch