Complete Guide to Linear Regression & ML Fundamentals
Complete Guide to Linear Regression & ML Fundamentals
From First Principles to Production-Ready Models
What this guide covers:
- Building Linear Regression from scratch with PyTorch
- Gradient Descent variants (Batch, SGD, Mini-Batch)
- Polynomial Regression & Feature Engineering
- Underfitting, Overfitting & Bias-Variance Tradeoff
- The 5 Assumptions of Linear Regression
- Regularization (L1, L2, Elastic Net)
- Cross-Validation techniques
Key Skills: Forward pass, MSE loss, gradient descent, feature scaling, R² evaluation, VIF, regularization, cross-validation
Table of Contents
Part 1: Linear Regression from Scratch
- Data Setup
- Model Initialization
- Forward Pass (Predict)
- Loss Function (MSE)
- Gradient Descent
- Training Loop
- Feature Scaling
- Converting Scaled Weights
- R² Score Evaluation
- Visualization
- Comparison with sklearn
- Refactored Code with FastCore
Part 2: ML Fundamentals
- Batch Gradient Descent
- SGD & Mini-Batch Gradient Descent
- Polynomial Regression
- Multicollinearity & Polynomial Features
- Underfitting & Overfitting
- Occam's Razor
- Bias-Variance Tradeoff
Part 3: Linear Regression Assumptions
- Feature Scaling Deep Dive
- R² vs Adjusted R²
- Introduction to statsmodels
- The 5 Assumptions Overview
- Assumption 1: Linearity
- Assumption 2: No Multicollinearity (VIF)
- Assumption 3: Normal Residuals
- P-Values: Accept/Reject
- Assumption 4: Homoscedasticity
- Assumption 5: No Autocorrelation
Part 4: Regularization & Cross-Validation
- Regularization: The Core Problem
- Lambda (λ): The Control Dial
- L1 vs L2 Regularization
- Elastic Net
- Hyperparameters vs Parameters
- Cross Validation
- K-Fold Cross Validation
Quick Reference
1. Data Setup
We created synthetic house price data with a known relationship to verify our model learns correctly.
True relationship: $$ \text{price} = 100000 + 300 \cdot \text{size} - 50 \cdot \text{age} - 20 \cdot \text{distance} + \text{noise} $$
import torch
import matplotlib.pyplot as plt
torch.manual_seed(42)
torch.set_printoptions(sci_mode=False)
n = 100
size = torch.randint(800, 3000, (n,)).float()
age = torch.randint(1, 50, (n,)).float()
distance_to_city = torch.randint(1, 30, (n,)).float()
# True relationship with noise
price = 100000 + 300*size - 50*age - 20*distance_to_city + torch.randn(n)*5000
# Stack features into X matrix
X = torch.stack([size, age, distance_to_city], dim=1)
y = price
print(f"X shape: {X.shape}") # (100, 3) - 100 houses, 3 features
print(f"y shape: {y.shape}") # (100,) - 100 prices
X shape: torch.Size([100, 3])
y shape: torch.Size([100])
2. Model Initialization
Linear regression finds the best weights (W) and bias (b) such that:
$$ \hat{y} = X \cdot W + b $$
🎯 Analogy: The Real Estate Agent
You're a real estate agent with a "pricing formula." Each weight tells you how much each feature (size, age, distance) contributes to the price. The bias is your base price when all features are zero.
# Initialize weights to zeros
W = torch.zeros(3, requires_grad=True) # One weight per feature
b = torch.zeros(1, requires_grad=True) # Single bias term
print(f"W: {W}")
print(f"b: {b}")
W: tensor([0., 0., 0.], requires_grad=True)
b: tensor([0.], requires_grad=True)
Key Insight: requires_grad=True tells PyTorch to track operations for automatic differentiation.
3. Forward Pass (Predict)
For a single house: price = w₁·size + w₂·age + w₃·distance + b
For 100 houses at once: Matrix multiplication!
def predict(X, W, b):
return X @ W + b
# Test with zero weights - predictions will all be zero
y_pred = predict(X, W, b)
print(f"Predictions (first 5): {y_pred[:5]}")
Predictions (first 5): tensor([0., 0., 0., 0., 0.], grad_fn=<SliceBackward0>)
| Shape | Meaning |
|---|---|
X: (100, 3) |
100 houses, 3 features |
W: (3,) |
3 weights |
X @ W: (100,) |
100 predictions |
b: (1,) |
broadcasts to all 100 |
4. Loss Function (MSE)
How do we measure "how wrong" we are?
🎯 Analogy: Darts
Imagine playing darts blindfolded. You throw 100 darts. Your friend tells you "on average, you missed the bullseye by 15 inches." That average error is your loss. Lower = better!
Why Square the Errors?
- Makes all errors positive (no cancellation)
- Punishes big errors more than small ones
$$ \text{MSE} = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2 $$
def mse_loss(y_pred, y_true):
return ((y_true - y_pred)**2).mean()
# Calculate initial loss
loss = mse_loss(y_pred, price)
print(f"Initial loss: {loss:,.0f}") # Huge! We're predicting $0 for houses worth $300k-900k
Initial loss: 483,810,443,264
5. Gradient Descent
🎯 Analogy: Lost in Foggy Mountains
Imagine you're lost in foggy mountains wanting to reach the lowest valley. You can't see far, but you can feel the ground under your feet.
Strategy: Feel which direction slopes downward, then take a small step that way. Repeat.
- Gradient = the slope (which direction is "downhill" for the loss)
- Descent = move in that direction
The Update Rule
$$ W_{\text{new}} = W_{\text{old}} - \alpha \cdot \nabla W $$
Where $\alpha$ (alpha) is the learning rate — how big a step we take.
Two Ways to Get Gradients
| Method | How | Pros/Cons |
|---|---|---|
| Autograd | loss.backward() |
Easy, automatic |
| Manual | Implement formulas | Deep understanding |
# Using PyTorch autograd
W = torch.zeros(3, requires_grad=True)
b = torch.zeros(1, requires_grad=True)
y_pred = predict(X, W, b)
loss = mse_loss(y_pred, price)
loss.backward() # PyTorch calculates gradients automatically!
print(f"W.grad: {W.grad}")
print(f"b.grad: {b.grad}")
W.grad: tensor([-2781148416., -31536902., -20261836.])
b.grad: tensor([-1339185.3750])
Why Zero Gradients?
🎯 Analogy: The Sticky Notepad
Imagine writing directions on a notepad. Without erasing, each new direction adds to the old ones:
- Person 1: "5 north" → notepad shows 5
- Person 2: "3 north" → notepad shows 8 (5+3)
- Person 3: "2 north" → notepad shows 10 (8+2)
But you wanted just "2 north"! Zeroing = erasing the notepad before writing the new direction.
PyTorch accumulates gradients by default. For standard training, always zero gradients before backward().
6. Training Loop
Putting it all together:
# Reset
W = torch.zeros(3, requires_grad=True)
b = torch.zeros(1, requires_grad=True)
learning_rate = 1e-9 # Tiny because gradients are billions!
for i in range(1000):
# Forward pass
y_pred = predict(X, W, b)
loss = mse_loss(y_pred, price)
# Backward pass
loss.backward()
# Update weights
with torch.no_grad():
W -= learning_rate * W.grad
b -= learning_rate * b.grad
# Zero gradients for next iteration
W.grad.zero_()
b.grad.zero_()
if i % 200 == 0:
print(f"Iteration {i}: Loss = {loss:,.0f}")
Iteration 0: Loss = 483,810,443,264
Iteration 200: Loss = 20,317,921,280
Iteration 400: Loss = 1,750,683,648
Iteration 600: Loss = 1,006,857,664
Iteration 800: Loss = 977,015,040
Problem Discovered: Size weight learned okay (~347 vs true 300), but age and distance were stuck!
| Weight | Learned | True |
|---|---|---|
| size | 347 | 300 |
| age | 4.3 | -50 |
| distance | 2.8 | -20 |
7. Feature Scaling
The Problem
Features have wildly different scales:
size: 800 - 3000 (big)age: 1 - 50 (small)distance: 1 - 30 (small)
🎯 Analogy: The Shouting Problem
Imagine a classroom discussion where:
- One student speaks at 100 decibels (super loud)
- Two students speak at 10 decibels (quiet whisper)
The teacher only hears the loud student!
Feature scaling = giving everyone a microphone that adjusts their volume to the same level.
Why Big Features Dominate
🎯 Analogy: The Seesaw
- Size = heavy adult (1900 kg) on seesaw
- Age = small child (23 kg) on seesaw
Nudge the adult slightly → whole seesaw tips dramatically. Nudge the child a lot → barely moves.
Concrete Example: Same weight change of 0.001:
- Size:
0.001 × 1900 = 1.9change in prediction - Age:
0.001 × 23 = 0.023change in prediction
Size has 80x more impact from the same weight change!
The Solution: Standardization
Transform each feature to have mean=0 and std=1:
$$ X_{scaled} = \frac{X - \mu}{\sigma} $$
# Scale features
new_X = (X - X.mean(dim=0)) / X.std(dim=0)
print("Before scaling (first row):")
print(X[0]) # [2742., 34., 12.]
print("\nAfter scaling (first row):")
print(new_X[0]) # [1.34, 0.76, -0.43] - All similar range!
Before scaling (first row):
tensor([2742., 34., 12.])
After scaling (first row):
tensor([ 1.3359, 0.7573, -0.4266])
# Train with scaled features - NOW we can use a bigger learning rate!
W = torch.zeros(3, requires_grad=True)
b = torch.zeros(1, requires_grad=True)
learning_rate = 0.1 # 100 million times bigger than before!
losses = []
for i in range(1000):
y_pred = predict(new_X, W, b)
loss = mse_loss(y_pred, price)
losses.append(loss.item())
loss.backward()
with torch.no_grad():
W -= learning_rate * W.grad
b -= learning_rate * b.grad
W.grad.zero_()
b.grad.zero_()
print(f"Final loss: {losses[-1]:,.0f}") # ~21 million (vs 975 million before!)
Final loss: 21,687,478
Results Comparison
| Metric | Without Scaling | With Scaling |
|---|---|---|
| Learning rate | 1e-9 | 0.1 |
| Final loss | ~975 million | ~21 million |
| Convergence | Slow, incomplete | Fast, complete |
8. Converting Scaled Weights
Since we trained on scaled features, weights are in "scaled space." To interpret them:
$$ W_{original} = \frac{W_{scaled}}{\sigma} $$
🎯 Analogy: Unit Conversion
The model learned "$189,224 per scaled unit." But 1 scaled unit = 630 sq ft (one std).
So in real units: $189,224 per 630 sq ft = $300 per sq ft
It's like converting "miles per gallon" to "kilometers per liter"!
stds = X.std(dim=0)
original_weights = W / stds
print(f"Scaled weights: {W.data}")
print(f"Original weights: {original_weights}")
print(f"\nTrue coefficients: [300, -50, -20]")
Scaled weights: tensor([189224.5781, -488.8487, 105.2894])
Original weights: tensor([300.4156, -34.5646, 14.0370], grad_fn=<DivBackward0>)
True coefficients: [300, -50, -20]
9. R² Score Evaluation
R² answers: How much better is your model than just guessing the mean?
🎯 Analogy: The Lazy Predictor
- Strategy 1 (Lazy): Guess the average price for every house
- Strategy 2 (Your Model): Use features to make personalized predictions
R² = how much better is Strategy 2?
| R² Value | Meaning |
|---|---|
| 1.0 | Perfect! Explains all variation |
| 0.5 | 50% better than guessing mean |
| 0.0 | No better than guessing mean |
| Negative | Worse than guessing mean! 😬 |
$$ R^2 = 1 - \frac{SS_{res}}{SS_{tot}} $$
def r2_score(y_true, y_pred):
ss_res = ((y_true - y_pred)**2).sum() # Model errors
ss_tot = ((y_true - y_true.mean())**2).sum() # Baseline errors
return 1 - (ss_res / ss_tot)
y_pred = predict(new_X, W, b)
print(f"R² score: {r2_score(price, y_pred):.4f}") # ~0.9994!
R² score: 0.9994
10. Visualization
# Loss over time - the "hockey stick" curve
plt.figure(figsize=(10, 4))
plt.subplot(1, 2, 1)
plt.plot(losses)
plt.xlabel('Iteration')
plt.ylabel('Loss')
plt.title('Loss Over Time')
# Predicted vs Actual
plt.subplot(1, 2, 2)
with torch.no_grad():
y_pred = predict(new_X, W, b)
plt.scatter(price, y_pred, alpha=0.5)
plt.plot([price.min(), price.max()], [price.min(), price.max()], 'r--')
plt.xlabel('Actual Price')
plt.ylabel('Predicted Price')
plt.title('Predicted vs Actual')
plt.tight_layout()
plt.show()
11. Comparison with sklearn
from sklearn.linear_model import LinearRegression as SklearnLR
sk_model = SklearnLR()
sk_model.fit(X.numpy(), price.numpy())
print("Comparison:")
print(f"{'Metric':<20} {'Our Model':<15} {'sklearn':<15}")
print(f"{'-'*50}")
print(f"{'Size coef':<20} {(W/stds)[0].item():<15.2f} {sk_model.coef_[0]:<15.2f}")
print(f"{'Age coef':<20} {(W/stds)[1].item():<15.2f} {sk_model.coef_[1]:<15.2f}")
print(f"{'Distance coef':<20} {(W/stds)[2].item():<15.2f} {sk_model.coef_[2]:<15.2f}")
print(f"{'R² score':<20} {r2_score(price, y_pred).item():<15.4f} {sk_model.score(X.numpy(), price.numpy()):<15.4f}")
Comparison:
Metric Our Model sklearn
--------------------------------------------------
Size coef 300.42 300.42
Age coef -34.56 -34.56
Distance coef 14.04 14.04
R² score 0.9994 0.9994
Result: Identical! 🎉
12. Refactored Code with FastCore
Using @patch to build up a class incrementally — clean, modular, Pythonic.
from fastcore.utils import patch
class LR:
def __init__(self, lr=0.1):
self.lr = lr
@patch
def predict(self: LR, X):
return X @ self.W + self.b
@patch
def mse_loss(self: LR, y_pred, y_true):
return ((y_pred - y_true)**2).mean()
@patch
def r2_score(self: LR, y_true, y_pred):
ss_res = ((y_true - y_pred)**2).sum()
ss_tot = ((y_true - y_true.mean())**2).sum()
return 1 - (ss_res / ss_tot)
@patch
def fit(self: LR, X, y, iterations=1000):
n, d = X.shape
self.W = torch.zeros(d, requires_grad=True)
self.b = torch.zeros(1, requires_grad=True)
self.losses = []
for i in range(iterations):
# Forward pass
y_pred = self.predict(X)
loss = self.mse_loss(y_pred, y)
self.losses.append(loss.item())
# Backward pass
loss.backward()
# Update weights
with torch.no_grad():
self.W -= self.lr * self.W.grad
self.b -= self.lr * self.b.grad
# Zero gradients
self.W.grad.zero_()
self.b.grad.zero_()
return self
# Usage
my_model = LR(lr=0.1)
my_model.fit(new_X, price, iterations=1000)
print(f"R² score: {my_model.r2_score(price, my_model.predict(new_X)):.4f}")
R² score: 0.9994
Part 2: ML Fundamentals
13. Batch Gradient Descent
Key Concept
In Batch Gradient Descent, weights are updated using the entire dataset per iteration:
$$ w = w - \alpha \cdot \nabla L $$
Where:
- $\alpha$ = learning rate (controls step size)
- $\nabla L$ = gradient of the loss (NOT the loss itself!)
Data Scientist Vocabulary
| Term | Definition |
|---|---|
| Epoch | One complete pass through the entire training dataset |
| Gradient | Partial derivative of loss w.r.t. each weight; indicates direction & steepness |
| Learning Rate (α) | Hyperparameter controlling step size in gradient direction |
The Problem
With large datasets (e.g., 10 million samples):
- Memory bottleneck — Loading all samples into RAM at once
- Time bottleneck — Computing gradients over all samples for every single weight update
Key Term: Computational bottleneck — when a specific operation limits overall performance
14. SGD & Mini-Batch Gradient Descent
The Solution: Use Fewer Samples Per Update
Introduce batch size k:
| Method | Batch Size (k) | Description |
|---|---|---|
| Batch GD | k = n (all data) | Update once per epoch |
| SGD | k = 1 | Update after every single sample |
| Mini-Batch GD | k = 32, 64, 256... | Update after each mini-batch |
Data Scientist Vocabulary
| Term | Definition |
|---|---|
| Stochastic | "Random" — in SGD, each update uses one randomly selected sample |
| Mini-batch | A subset of training data used for one gradient update |
| Batch size | Number of samples (k) in each mini-batch |
Convergence Behavior
| Method | Gradient Accuracy | Fluctuations | Speed per Update |
|---|---|---|---|
| Batch GD (k=n) | Most accurate | Least | Slowest |
| SGD (k=1) | Noisiest | Most | Fastest |
| Mini-Batch | In between | Moderate | Fast (vectorized) |
Why Does SGD Fluctuate Most?
One sample might not represent the true pattern — it could be:
- An outlier
- Noise
- Not representative of overall trend
The gradient from one sample can point in a misleading direction → weights "zigzag" toward minimum.
15. Polynomial Regression
The Problem
Linear Regression can't fit non-linear data well.
The Solution
Add polynomial features! Transform x into [x, x², x³, ...]:
$$ \hat{y} = w_0 + w_1 x + w_2 x^2 + w_3 x^3 + ... $$
Why Is It Still "Linear" Regression?
Key Insight: "Linear" refers to linear in the weights, not linear in features!
The weights (w₀, w₁, w₂) all have power 1 — no w², no w·w₂. It's still a weighted sum.
Data Scientist Vocabulary
| Term | Definition |
|---|---|
| Linear in parameters | Model is a linear combination of weights |
| Feature engineering | Creating new features (x², x³) from existing ones |
16. Multicollinearity & Polynomial Features
Key Question
Does adding x² as a feature cause multicollinearity with x?
Answer: NO!
Multicollinearity requires a linear relationship, not just high correlation.
🔑 Sticky Analogy: The Cookie Detective
Alice & Bob always together (z = 2x + 3):
- Every time Alice is in the kitchen, Bob is too
- Can't tell who ate the cookie → Multicollinearity!
Alice walks steady, Bob accelerates (x²):
- At first they're close, but Bob zooms ahead faster and faster
- Different behaviors → No problem!
Visual Proof
import numpy as np
import matplotlib.pyplot as plt
x = np.linspace(1, 10, 50)
z = 2 * x + 3 # Linear relationship
x_squared = x ** 2 # Non-linear relationship
fig, axes = plt.subplots(1, 2, figsize=(10, 3))
# Left: Linear (causes multicollinearity)
axes[0].scatter(x, z, label='data')
axes[0].plot(x, z, color='red', label='best fit line')
axes[0].set_xlabel("x")
axes[0].set_ylabel("z = 2x + 3")
axes[0].set_title("Linear: line fits PERFECTLY")
axes[0].legend()
# Right: Polynomial (no multicollinearity)
axes[1].scatter(x, x_squared, label='data')
slope, intercept = np.polyfit(x, x_squared, 1)
axes[1].plot(x, slope * x + intercept, color='red', label='best fit line')
axes[1].set_xlabel("x")
axes[1].set_ylabel("x²")
axes[1].set_title("Polynomial: line MISSES the curve!")
axes[1].legend()
plt.tight_layout()
plt.show()
# Correlation comparison
print(f"Correlation x vs z (linear): {np.corrcoef(x, z)[0,1]:.3f}")
print(f"Correlation x vs x²: {np.corrcoef(x, x_squared)[0,1]:.3f}")
Correlation x vs z (linear): 1.000
Correlation x vs x²: 0.978
Key Takeaways
| Concept | Definition |
|---|---|
| Multicollinearity | Features have a linear relationship (one is a straight-line function of another) |
| High correlation | Features move in the same general direction |
The problem is LINEARITY, not just correlation!
x and x² are related but carry different information:
- x → linear trend
- x² → curvature
They're teammates, not duplicates!
17. Underfitting & Overfitting
🔑 Sticky Analogy: Three Students Preparing for an Exam
Student 1: The Lazy Learner (Underfitting)
- Barely studies, memorizes one rule: "answer is always B"
- Practice test: 40% | Real exam: 40%
- Under-learned — too simple to capture patterns
Student 2: The Obsessive Memorizer (Overfitting)
- Memorizes everything: exact wording, ink color, smudge on page 47
- Practice test: 100% | Real exam: 50%
- Over-learned — memorized noise, not patterns
Student 3: The Smart Learner (Just Right)
- Learns underlying concepts, ignores irrelevant details
- Practice test: 90% | Real exam: 88%
- Generalizes — works on new data
The Golden Rules
| Diagnosis | Train Error | Test Error | Problem |
|---|---|---|---|
| Underfitting | Bad | Bad | Model too simple |
| Overfitting | Great | Terrible | Model too complex |
| Good Fit | Good | Good | Model generalizes |
PyTorch Implementation: The Three Students
import torch
import matplotlib.pyplot as plt
# Set seed for reproducibility
torch.manual_seed(42)
# Generate data: a curve with noise
X = torch.linspace(0, 1, 30).reshape(-1, 1)
y = X**2 + 0.1 * torch.randn(30, 1)
# Train/test split
X_train, X_test = X[:20], X[20:]
y_train, y_test = y[:20], y[20:]
# Helper functions
def make_poly_features(X, degree):
"""Create polynomial features: [x, x², x³, ..., x^degree]"""
features = [X ** i for i in range(1, degree + 1)]
return torch.cat(features, dim=1)
def mse(y_true, y_pred):
return ((y_true - y_pred) ** 2).mean().item()
# Compare all three students
fig, axes = plt.subplots(1, 3, figsize=(14, 4))
X_line = torch.linspace(0, 1, 200).reshape(-1, 1)
results = []
for degree, title, ax in zip(
[1, 15, 2],
["Lazy Learner (Degree 1)\nUNDERFIT", "Obsessive Memorizer (Degree 15)\nOVERFIT", "Smart Learner (Degree 2)\nJUST RIGHT"],
axes
):
# Fit model
X_tr = torch.cat([torch.ones(20,1), make_poly_features(X_train, degree)], dim=1)
X_te = torch.cat([torch.ones(10,1), make_poly_features(X_test, degree)], dim=1)
weights = torch.linalg.lstsq(X_tr, y_train).solution
# Calculate errors
tr_err = mse(y_train, X_tr @ weights)
te_err = mse(y_test, X_te @ weights)
results.append((degree, tr_err, te_err))
# Plot
ax.scatter(X_train.numpy(), y_train.numpy(), label='Train', color='blue')
ax.scatter(X_test.numpy(), y_test.numpy(), label='Test', color='orange')
X_poly = make_poly_features(X_line, degree)
X_bias = torch.cat([torch.ones(200,1), X_poly], dim=1)
y_line = X_bias @ weights
ax.plot(X_line.numpy(), y_line.numpy(), color='red', linewidth=2)
ax.set_ylim(-0.5, 1.5)
ax.set_xlabel("X")
ax.set_ylabel("y")
ax.set_title(title)
ax.legend()
plt.tight_layout()
plt.show()
# Print results
print("\n📊 Results:")
print("Degree | Train MSE | Test MSE")
print("-" * 35)
for deg, tr, te in results:
print(f" {deg:2d} | {tr:.4f} | {te:.4f}")
📊 Results:
Degree | Train MSE | Test MSE
-----------------------------------
1 | 0.0204 | 0.1392
15 | 0.0071 | 271692.8125
2 | 0.0108 | 0.1423
TL;DR
- Underfitting: Didn't learn the pattern at all, just wild guessed
- Overfitting: Literally crammed it — memorized noise
- Good fit: Learning from examples and can generalize to real-world data
18. Occam's Razor
The Principle
"When you have two explanations that work equally well, choose the simpler one."
— William of Ockham (1300s)
Why Prefer Simplicity?
Reason 1: Robustness to New Data
A chef who knows "salt enhances flavor" can cook anywhere. A chef who memorized "add exactly 2.3g salt to this specific pot at 6:47pm" is useless in a new kitchen.
Reason 2: Interpretability
ŷ = 0.1 + 0.05x + 0.9x²→ You can explain this!- 50-term polynomial → Black box
Reason 3: Computational Cost
- More weights = more memory, slower predictions
Occam's Razor in One Sentence
"Don't cram when understanding is enough."
# Proof: Higher degrees don't help!
print("Degree | Train MSE | Test MSE")
print("-" * 35)
for degree in [1, 2, 3, 4, 5, 10, 15]:
X_tr = torch.cat([torch.ones(20,1), make_poly_features(X_train, degree)], dim=1)
X_te = torch.cat([torch.ones(10,1), make_poly_features(X_test, degree)], dim=1)
w = torch.linalg.lstsq(X_tr, y_train).solution
tr_err = mse(y_train, X_tr @ w)
te_err = mse(y_test, X_te @ w)
print(f" {degree:2d} | {tr_err:.4f} | {te_err:.4f}")
Degree | Train MSE | Test MSE
-----------------------------------
1 | 0.0204 | 0.1392
2 | 0.0108 | 0.1423
3 | 0.0108 | 0.2315
4 | 0.0077 | 22.1452
5 | 0.0076 | 49.2805
10 | 0.0071 | 29204.7227
15 | 0.0071 | 271692.8125
Observation: Train MSE keeps improving, but Test MSE explodes after degree 2-3. Extra complexity = memorizing noise!
19. Bias-Variance Tradeoff
🔑 Sticky Analogy: Weather Forecasters
Forecaster 1: The Lazy One (High Bias, Low Variance)
- Every day: "Temperature will be 20°C"
- Doesn't matter if summer or winter
- Consistent but systematically wrong
Forecaster 2: The Overreactor (Low Bias, High Variance)
- Looks at every tiny detail (cloud shape, bird flying by)
- Monday: "35°C!" Tuesday: "5°C!"
- Predictions jump wildly — sometimes right, sometimes way off
Forecaster 3: The Balanced One (Low Bias, Low Variance)
- Understands weather patterns (seasons, pressure systems)
- Predictions are close to reality and stable
Definitions
| Term | Meaning | Model Equivalent |
|---|---|---|
| High Bias | Systematically off target | Model too simple, can't learn patterns |
| Low Bias | Can hit the bullseye | Model has enough capacity |
| High Variance | Predictions scattered wildly | Model too sensitive to training data |
| Low Variance | Predictions stable/consistent | Model is robust |
The Tradeoff
As model complexity increases:
- Bias goes DOWN (can capture more patterns)
- Variance goes UP (becomes sensitive to noise)
Goal: Find the sweet spot where total error is minimized!
Connecting Everything
| Problem | Bias | Variance | Training Error | Test Error |
|---|---|---|---|---|
| Underfit | High | Low | Bad | Bad |
| Overfit | Low | High | Good | Bad |
| Good Fit | Low | Low | Good | Good |
Part 3: Linear Regression Assumptions
20. Feature Scaling Deep Dive
The Problem
Features on different scales cause:
- Slow gradient descent — optimizer takes zigzag path
- Uninterpretable coefficients — can't compare feature importance
Two Methods
| Method | Formula | Result |
|---|---|---|
| Standardization (Z-score) | $(x - \mu) / \sigma$ | Mean = 0, Std = 1 |
| Min-Max Normalization | $(x - x_{min}) / (x_{max} - x_{min})$ | Range [0, 1] |
When to Use Which?
🎯 Sticky Analogy: Outlier Handling
Imagine salaries: [30k, 35k, 40k, 45k, 500k]
- Min-Max: 500k becomes 1.0, everything else squished near 0
- Standardization: 500k becomes ~2.5, others spread normally around 0
| Scenario | Use |
|---|---|
| Outliers present | Standardization |
| Natural bounds, no outliers (images, ratings) | Min-Max |
| Linear regression, PCA | Standardization |
| Neural networks with sigmoid | Min-Max |
# Standardization with sklearn
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
# X_scaled = scaler.fit_transform(X_train)
21. R² vs Adjusted R²
The Problem with R²
🎯 Sticky Analogy: The Exam Cheater
- R² = Score on the test you memorized (always goes up with more features)
- Adjusted R² = Score on a NEW exam (penalizes for "memorized answers")
R² can NEVER decrease when you add features — even useless ones!
The Formula
$$ \text{Adj. } R^2 = 1 - \frac{(1 - R^2)(n - 1)}{n - p - 1} $$
Where: n = data points, p = features
Decision Table
| What you add | R² | Adjusted R² |
|---|---|---|
| Useful feature | Goes up ✅ | Goes up ✅ |
| Useless feature | Still goes up 😬 | Goes DOWN ⬇️ |
Gap Interpretation
- Big gap (R² = 0.82, Adj R² = 0.75) → Too many features, overfitting
- Small gap → Model is lean and efficient
22. Introduction to statsmodels
🎯 Sticky Analogy: Two Chefs
- sklearn = Fast-food chef: "Here's your prediction. Next!"
- statsmodels = Master chef who teaches: "Let me show you WHY each ingredient matters"
What statsmodels Gives You
| Metric | What it tells you |
|---|---|
| R² and Adj R² | Model fit quality |
| Coefficients | Feature impact |
| p-values | Feature significance |
| Confidence intervals | Coefficient certainty |
| Durbin-Watson | Autocorrelation check |
The Constant (Intercept)
statsmodels requires you to manually add the intercept:
$$ y = b_0 + b_1x_1 + b_2x_2 + ... $$
Without it, your line is forced through origin (0,0).
import statsmodels.api as sm
# Must add constant for intercept!
# X_sm = sm.add_constant(X_train)
# model = sm.OLS(y_train, X_sm).fit()
# print(model.summary())
# Get residuals directly
# residuals = model.resid
23. The 5 Assumptions Overview
🎯 Sticky Analogy: Washing Machine Conditions
Like a washing machine that needs 220V, max 7kg, water only — linear regression has conditions. Violate them, and it still runs... but gives terrible results.
| # | Assumption | Test | Memory Trick |
|---|---|---|---|
| 1 | Linearity | Visual | "Straight line fits" |
| 2 | No Multicollinearity | VIF | "Witnesses copying stories" |
| 3 | Normal Residuals | Shapiro-Wilk | "Bell curve errors" |
| 4 | Homoscedasticity | Goldfeld-Quandt | "Gold Medal Archery" |
| 5 | No Autocorrelation | Durbin-Watson | "Detective Watson" |
24. Assumption 1: Linearity
The relationship between X and y must be a straight line.
The Trick: If data is curved, transform your features!
| Original | Transformed |
|---|---|
| km_driven | km_driven, km_driven², km_driven³ |
Key Insight: "Linear" means linear in coefficients, not features.
Other transformations: log(x), √x, interaction terms (x₁ × x₂)
25. Assumption 2: No Multicollinearity (VIF)
🎯 Sticky Analogy: Crime Investigation Witnesses
- Good: 3 witnesses, each saw something different → unique information
- Bad: 3 witnesses who copied each other's story → same info repeated 3x
What is VIF?
VIF asks: "Can I predict THIS feature using all OTHER features?"
The Formula
$$ VIF = \frac{1}{1 - R^2} $$
Where R² comes from regressing one feature against all others.
How It Works Behind the Scenes
For 4 features (X1, X2, X3, X4), it fits 4 separate regressions:
- Predict X1 from X2, X3, X4 → VIF for X1
- Predict X2 from X1, X3, X4 → VIF for X2
- ... and so on
Note: The target variable (y) is NEVER involved!
VIF Interpretation
| VIF Value | Meaning |
|---|---|
| 1 | No correlation — feature is unique |
| 1-5 | Acceptable |
| 5-10 | Warning zone |
| >10 | Remove this feature! |
Why Remove One at a Time?
If A and B are correlated, removing both loses the information entirely. Remove one, and the other's VIF drops — it becomes "independent" again.
from statsmodels.stats.outliers_influence import variance_inflation_factor
import pandas as pd
# vif_data = pd.DataFrame()
# vif_data['Feature'] = df.columns
# vif_data['VIF'] = [variance_inflation_factor(df.values, i) for i in range(df.shape[1])]
# vif_data.sort_values('VIF', ascending=False)
26. Assumption 3: Normal Residuals
🎯 Sticky Analogy: The Archer
- Good archer: Arrows land around bullseye in bell-curve pattern
- Drunk archer: Arrows all over the place, no pattern
Why It Matters
When errors are normal:
- Confidence intervals are valid
- p-values are trustworthy
- Predictions are reliable
Shapiro-Wilk Test
- Statistic (0 to 1): Closer to 1 = more normal
- p > 0.05 → Errors ARE normal ✅
- p < 0.05 → Errors NOT normal ❌
from scipy import stats
# result = stats.shapiro(model.resid)
# print(f"Statistic: {result.statistic:.4f}")
# print(f"p-value: {result.pvalue:.6f}")
# if result.pvalue > 0.05:
# print("✅ Errors are normally distributed")
# else:
# print("❌ Errors are NOT normal")
27. P-Values: Accept/Reject
🎯 Sticky Analogy: Fire Alarm
- High p-value (0.65) = "Not surprising" → No alarm 🔕 → Accept null
- Low p-value (0.001) = "Extremely surprising!" → ALARM 🔔 → Reject null
Memory Trick
Small p = Reject null
Big p = Accept null
What p-value means: "Probability of seeing this data IF the null hypothesis were true"
28. Assumption 4: Homoscedasticity
Homo = same, Skedasticity = spread
Errors should have the same spread across all predictions.
🎯 Sticky Analogy: Dartboard at Different Distances
- Good (Homoscedasticity): Same accuracy at all distances
- Bad (Heteroskedasticity): Accuracy gets worse with distance
Why Heteroskedasticity Happens
Cheap cars (₹2-5 lakhs): Limited variation, easy to predict (±₹20k error)
Expensive cars (₹50+ lakhs): Huge variation from custom features, rarity, prestige (±₹5L error)
Core Insight: Cheap things have less room for variation. Expensive things have more "hidden factors."
Visual Detection
Plot Predicted vs Residuals:
- Good: Flat band around zero
- Bad: Funnel/cone shape (errors fan out)
Goldfeld-Quandt Test
🎯 Memory Trick: Gold Medal Archery Competition
Two archers compete. Judge asks: "Are both equally consistent, or is one way more scattered?"
- Gold = Groups (splits data into two)
- Quandt = Question: "Is variance equal?"
p > 0.05 → No heteroskedasticity ✅
p < 0.05 → Heteroskedasticity exists ❌
from statsmodels.stats.api import het_goldfeldquandt
# f_stat, p_value, _ = het_goldfeldquandt(y_train, X_sm)
# print(f"F-statistic: {f_stat:.4f}")
# print(f"p-value: {p_value:.6f}")
# if p_value > 0.05:
# print("✅ No heteroskedasticity")
# else:
# print("❌ Heteroskedasticity detected!")
Important Distinction: Normality vs Homoscedasticity
| Test | What it checks |
|---|---|
| Shapiro-Wilk | Do residuals follow a bell-curve shape? |
| Goldfeld-Quandt | Is the spread of residuals constant? |
🎯 Exam Scores Analogy:
- Class A: 45, 48, 50, 52, 55 (bell curve, spread = 10)
- Class B: 20, 35, 50, 65, 80 (bell curve, spread = 60)
Both normal ✅, but different spread ❌
29. Assumption 5: No Autocorrelation
🎯 Sticky Analogy: Factory Assembly Line
- Good: Each car's quality is independent
- Bad: Machine broken → Car 1 bad, Car 2 bad, Car 3 bad (consecutive errors related)
Most Relevant For
Time-series data — where order matters (stock prices, temperature, sales over time)
Durbin-Watson Test
🎯 Memory Trick: Detective Watson
"Watson watches for patterns in sequence"
Watson's favorite number is 2!
| DW Value | Interpretation |
|---|---|
| ≈ 2 | No autocorrelation ✅ |
| → 0 | Positive autocorrelation (errors stick together) |
| → 4 | Negative autocorrelation (errors alternate) |
from statsmodels.stats.stattools import durbin_watson
# dw_stat = durbin_watson(model.resid)
# print(f"Durbin-Watson: {dw_stat:.4f}")
# if 1.5 < dw_stat < 2.5:
# print("✅ No significant autocorrelation")
# else:
# print("❌ Autocorrelation may exist")
Part 4: Regularization & Cross-Validation
The Kitchen Analogy: Throughout this section, we use the analogy of a chef trying to create the perfect dish — balancing taste (fitting data) with health (preventing overfitting).
30. Regularization: The Core Problem
What Problem Does Regularization Solve?
When a model has many features (or high polynomial degree), weights can become very large to perfectly fit training data. This leads to overfitting — great training performance, terrible test performance.
The Solution
Add a penalty term to the loss function:
$$ L = MSE + \lambda \sum_{j=1}^{d} w_j^2 $$
🍳 Kitchen Analogy: MSE makes the dish tasty (fit the data). Regularization makes it healthy (prevents overfitting). We want BOTH!
Why Adding $w^2$ Keeps Weights Small
Gradient descent minimizes everything in the loss function. If $w_j^2$ is part of the loss, large weights get "punished."
Numerical Example:
| $w_3$ | MSE | $w_3^2$ | Total Loss (with reg) |
|---|---|---|---|
| 10 | 5 | 100 | 105 |
| 2 | 8 | 4 | 12 ✓ |
The model prefers $w_3 = 2$ because 12 < 105!
Key Insight
💡 Low training loss with huge weights is a trap. We sacrifice a little training accuracy to get a model that works on real-world data.
31. Lambda (λ): The Control Dial
$$ L = MSE + \lambda \sum_{j=1}^{d} w_j^2 $$
🍳 Kitchen Analogy: Lambda is your personal nutritionist who decides how strictly you follow the health rules.
| λ Value | Effect | Analogy |
|---|---|---|
| λ = 0 | Only MSE matters → overfitting | No nutritionist (all butter!) |
| λ too large | Weights crushed → underfitting | Extremely strict (bland dish) |
| λ just right | Good fit + generalization | Balanced advice |
Numerical Example:
| $w_3$ | MSE | $w_3^2$ | Loss (λ=0.1) | Loss (λ=10) |
|---|---|---|---|---|
| 10 | 5 | 100 | 15 | 1005 |
| 2 | 8 | 4 | 8.4 ✓ | 48 ✓ |
Higher λ makes the model more "afraid" of large weights.
32. L1 vs L2 Regularization
Two Types of Nutritionists
| L2 (Ridge) | L1 (Lasso) | |||
|---|---|---|---|---|
| Formula | $\sum w_j^2$ | $\sum | w_j | $ |
| Analogy | "Use less of every ingredient" | "Eliminate the junk entirely" | ||
| Effect | Shrinks all weights | Makes some weights exactly 0 | ||
| Use when | All features likely matter | Many features probably useless |
Why L1 Creates Exact Zeros
The key is in the derivatives:
- L2 derivative = $2w$ → push weakens as $w$ shrinks → never reaches zero
- L1 derivative = $\pm 1$ (constant) → steady push until exactly zero
🍳 Kitchen Analogy:
- L2: "Reduce each ingredient by 10% of current amount" → 100g → 90g → 81g → ... (never 0)
- L1: "Reduce each ingredient by 5g every day" → 100g → 95g → ... → 0g (gone!)
Mathematical Detail
The derivative of $|w|$ is:
$$ \frac{d|w|}{dw} = \begin{cases} +1 & \text{if } w > 0 \ -1 & \text{if } w < 0 \ \text{undefined} & \text{if } w = 0 \end{cases} $$
Reference: Wikipedia: Absolute Value
Gradient Descent Comparison
L2 (η=0.1, starting w=5):
| Step | w | Derivative (2w) | New w |
|---|---|---|---|
| 0 | 5.0 | 10.0 | 4.0 |
| 1 | 4.0 | 8.0 | 3.2 |
| 2 | 3.2 | 6.4 | 2.56 |
Steps get smaller as w shrinks.
L1 (η=0.1, starting w=5):
| Step | w | Derivative | New w |
|---|---|---|---|
| 0 | 5.0 | 1 | 4.9 |
| 1 | 4.9 | 1 | 4.8 |
| ... | ... | ... | ... |
| 50 | 0.0 | — | exactly 0 |
Every step is the same size (0.1), marching steadily to zero.
33. Elastic Net
Can't decide between L1 and L2? Hire both!
$$ L = MSE + \lambda \left( \alpha \sum |w_j| + (1-\alpha) \sum w_j^2 \right) $$
| Parameter | Role |
|---|---|
alpha |
Overall strictness (how much power do nutritionists have?) |
l1_ratio (α) |
Which nutritionist leads? (1=pure L1, 0=pure L2, 0.5=equal) |
Why Use Elastic Net?
🍳 Kitchen Analogy: You have three similar spices (garam masala, curry powder, turmeric mix).
- L1 alone: Randomly picks ONE, throws out others
- Elastic Net: Keeps all related spices together, eliminates truly useless ones
Best for: Correlated features where you want some elimination but keep feature groups together.
# Elastic Net Code Example
from sklearn.linear_model import ElasticNet
elastic_model = ElasticNet(
alpha=1.0, # Overall regularization strength
l1_ratio=0.5 # Mix: 1=pure L1, 0=pure L2, 0.5=equal
)
34. Hyperparameters vs Parameters
| Parameters | Hyperparameters | |
|---|---|---|
| Who chooses? | Model learns them | YOU choose them |
| When? | During training | Before training |
| Examples | Weights $w_0, w_1, ...$ | λ, polynomial degree, l1_ratio |
🍳 Kitchen Analogy:
- Parameters = Exact amounts of each ingredient (model figures out)
- Hyperparameters = Kitchen rules: "How strict is the nutritionist?" (you decide before cooking)
35. Cross Validation
The Problem
How do we find the best λ without cheating?
- Test on training data → biased (model memorized it)
- Test on test data → cheating (peeking at final exam)
The Solution: Validation Set
Split data three ways:
| Set | Purpose | Typical % |
|---|---|---|
| Training | Model learns weights | 60% |
| Validation | Tune hyperparameters | 20% |
| Test | Final evaluation (one shot!) | 20% |
🍳 Kitchen Analogy:
- Training data = Ingredients you cook with daily
- Validation data = Family who taste-tests experiments
- Test data = Competition judges (one shot!)
36. K-Fold Cross Validation
The Problem
With small datasets, splitting leaves too little validation data → unreliable hyperparameter tuning → higher test error.
The Solution: K-Fold
- Split data into K equal folds
- Rotate which fold is validation
- Train K times, average results
Example (K=5 with 100 samples):
| Round | Training | Validation |
|---|---|---|
| 1 | Folds 2-5 (80) | Fold 1 (20) |
| 2 | Folds 1,3,4,5 (80) | Fold 2 (20) |
| 3 | Folds 1,2,4,5 (80) | Fold 3 (20) |
| ... | ... | ... |
🍳 Kitchen Analogy: 10 family members take turns being taste-testers. Everyone gets to give feedback — no opinion is wasted!
Trade-off
K training runs = K× compute cost
| Dataset Size | Recommendation |
|---|---|
| Small (100s) | Use K-Fold — need reliable validation, training is fast |
| Large (millions) | Simple split is enough — plenty of data |
# K-Fold Code Example
from sklearn.model_selection import KFold
# kf = KFold(n_splits=10)
# for train_index, val_index in kf.split(X):
# X_train, X_val = X[train_index], X[val_index]
# y_train, y_val = y[train_index], y[val_index]
# model.fit(X_train, y_train)
# score = model.score(X_val, y_val)
37. Complete Quick Reference
Key Formulas
| Formula | PyTorch Code |
|---|---|
| Prediction: $\hat{y} = XW + b$ | X @ W + b |
| MSE Loss | ((y - y_pred)**2).mean() |
| R² Score | 1 - ss_res/ss_tot |
| Standardization | (X - X.mean(dim=0)) / X.std(dim=0) |
| Weight update | W -= lr * W.grad |
| Convert scaled weights | W / X.std(dim=0) |
| VIF | $\frac{1}{1 - R^2}$ |
| Adjusted R² | $1 - \frac{(1 - R^2)(n - 1)}{n - p - 1}$ |
Gradient Descent Variants
| Method | Batch Size | Pros | Cons |
|---|---|---|---|
| Batch GD | k = n | Accurate gradient | Slow, memory-heavy |
| SGD | k = 1 | Fast updates | Noisy, fluctuates |
| Mini-Batch | k = 32-256 | Fast + stable | Sweet spot |
Model Complexity
| Symptom | Diagnosis | Fix |
|---|---|---|
| Train bad, Test bad | Underfitting | Increase complexity |
| Train great, Test bad | Overfitting | Decrease complexity / Regularize |
| Train good, Test good | Good fit | Keep it! |
Regularization Cheat Sheet
| Type | Formula | When to Use | Effect | ||
|---|---|---|---|---|---|
| L2 (Ridge) | $\sum w_j^2$ | All features matter | Shrinks all weights | ||
| L1 (Lasso) | $\sum | w_j | $ | Many useless features | Eliminates features |
| Elastic Net | Both combined | Correlated features | Best of both |
Lambda Effects
| λ | Effect |
|---|---|
| Too small | Overfitting |
| Too large | Underfitting |
| Just right | Good generalization |
All Assumption Tests at a Glance
| Assumption | Test | Good Result | Memory Trick |
|---|---|---|---|
| Linearity | Visual plot | Straight pattern | - |
| No Multicollinearity | VIF | VIF < 5 | Witnesses copying |
| Normal Residuals | Shapiro-Wilk | p > 0.05 | Fire alarm |
| Homoscedasticity | Goldfeld-Quandt | p > 0.05 | Gold Medal Archery |
| No Autocorrelation | Durbin-Watson | DW ≈ 2 | Detective Watson |
P-Value Quick Rule
| p-value | Action |
|---|---|
| p > 0.05 | Accept null hypothesis ✅ |
| p < 0.05 | Reject null hypothesis ❌ |
VIF Action Guide
- Calculate VIF for all features
- Remove feature with highest VIF (if > 5)
- Recalculate VIF
- Repeat until all VIF < 5
Validation Strategy
| Data Size | Strategy |
|---|---|
| Large | Train/Val/Test split |
| Small | K-Fold CV |
Key Vocabulary
| Term | Definition |
|---|---|
| Epoch | One pass through entire dataset |
| Gradient | Derivative of loss w.r.t. weights |
| Learning Rate | Step size in gradient direction |
| Bias (error) | Systematic error from oversimplification |
| Variance (error) | Sensitivity to training data fluctuations |
| Multicollinearity | Linear relationship between features |
| Feature Engineering | Creating new features from existing ones |
| Generalization | Model works on unseen data |
| Hyperparameter | Settings YOU choose before training |
| Parameter | Weights the model learns during training |
| Homoscedasticity | Constant variance of errors |
| Autocorrelation | Errors related to each other in sequence |
Analogies Cheat Sheet
| Concept | Analogy |
|---|---|
| Gradient Descent | Lost in foggy mountains, feeling for downhill |
| Loss Function | Darts — average distance from bullseye |
| Feature Scaling | Classroom with one shouting student |
| Big features dominate | Seesaw with heavy adult vs small child |
| Zero gradients | Erasing the sticky notepad |
| Scaled weights | Unit conversion (miles/gallon → km/liter) |
| R² score | How much better than the lazy predictor? |
| Underfitting | Student who only answers "B" |
| Overfitting | Student who memorizes page smudges |
| Good fit | Student who learns concepts |
| High Bias | Forecaster who always says 20°C |
| High Variance | Forecaster who overreacts to every detail |
| Multicollinearity | Alice & Bob always together (can't tell apart) |
| No Multicollinearity | Alice walks, Bob accelerates (different behaviors) |
| R² vs Adjusted R² | Exam cheater vs new exam |
| sklearn vs statsmodels | Fast-food chef vs Master chef who teaches |
| Washing machine | Linear regression needs 5 conditions |
| VIF (Witnesses) | Crime witnesses copying each other's story |
| Normal Residuals (Archer) | Good archer → bell-curve pattern |
| Homoscedasticity | Dartboard at different distances |
| Autocorrelation | Factory assembly line with broken machine |
| Regularization | Nutritionist for your model |
| L2 (Ridge) | "Use less of every ingredient" |
| L1 (Lasso) | "Eliminate the junk entirely" |
| Lambda | How strict is the nutritionist? |
| Training data | Ingredients you cook with daily |
| Validation data | Family taste-testers |
| Test data | Competition judges (one shot!) |
| K-Fold | Everyone takes turns being taste-tester |
Key sklearn/statsmodels Classes
# sklearn
from sklearn.linear_model import LinearRegression, Ridge, Lasso, ElasticNet
from sklearn.model_selection import train_test_split, KFold
from sklearn.preprocessing import StandardScaler
# statsmodels
import statsmodels.api as sm
from statsmodels.stats.outliers_influence import variance_inflation_factor
from statsmodels.stats.api import het_goldfeldquandt
from statsmodels.stats.stattools import durbin_watson
# scipy
from scipy import stats # for shapiro test
Common Gotchas
| Problem | Solution |
|---|---|
| Loss exploding | Reduce learning rate |
| Loss barely moving | Increase learning rate |
| Some weights not learning | Feature scaling |
W.grad is None |
Use requires_grad=True |
| Weird gradient accumulation | Zero gradients each iteration |
| Can't multiply sequence by float | Non-numeric columns in data |
| High R² but poor test performance | Check assumptions, regularize |
| High VIF | Remove correlated features one at a time |
Training Loop Template
for i in range(iterations):
y_pred = predict(X, W, b) # Forward
loss = mse_loss(y_pred, y) # Loss
loss.backward() # Backward
with torch.no_grad(): # Update
W -= lr * W.grad
b -= lr * b.grad
W.grad.zero_() # Zero gradients
b.grad.zero_()
Complete Assumption Check Code Template
import torch
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import statsmodels.api as sm
from scipy import stats
from sklearn.preprocessing import StandardScaler
from statsmodels.stats.outliers_influence import variance_inflation_factor
from statsmodels.stats.api import het_goldfeldquandt
from statsmodels.stats.stattools import durbin_watson
def check_all_assumptions(X, y):
"""
Run all linear regression assumption tests
"""
# Fit model
X_sm = sm.add_constant(X)
model = sm.OLS(y, X_sm).fit()
print("="*50)
print("LINEAR REGRESSION ASSUMPTION CHECKS")
print("="*50)
# 1. VIF (Multicollinearity)
print("\n1. MULTICOLLINEARITY (VIF)")
vif_data = pd.DataFrame()
vif_data['Feature'] = X.columns if hasattr(X, 'columns') else [f'X{i}' for i in range(X.shape[1])]
vif_data['VIF'] = [variance_inflation_factor(np.array(X), i) for i in range(X.shape[1])]
print(vif_data.sort_values('VIF', ascending=False))
# 2. Shapiro-Wilk (Normality)
print("\n2. NORMALITY OF RESIDUALS (Shapiro-Wilk)")
shapiro_result = stats.shapiro(model.resid)
print(f" p-value: {shapiro_result.pvalue:.6f}")
print(f" {'✅ Normal' if shapiro_result.pvalue > 0.05 else '❌ Not Normal'}")
# 3. Goldfeld-Quandt (Homoscedasticity)
print("\n3. HOMOSCEDASTICITY (Goldfeld-Quandt)")
_, gq_pvalue, _ = het_goldfeldquandt(y, X_sm)
print(f" p-value: {gq_pvalue:.6f}")
print(f" {'✅ Homoscedastic' if gq_pvalue > 0.05 else '❌ Heteroskedastic'}")
# 4. Durbin-Watson (Autocorrelation)
print("\n4. AUTOCORRELATION (Durbin-Watson)")
dw = durbin_watson(model.resid)
print(f" DW statistic: {dw:.4f}")
print(f" {'✅ No autocorrelation' if 1.5 < dw < 2.5 else '❌ Autocorrelation present'}")
return model
The Complete Kitchen Analogy Summary
🍳 You're a chef preparing for a cooking competition.
MSE = Making the dish tasty (fitting the data)
Regularization = Making it healthy (preventing overfitting)
Lambda (λ) = Your nutritionist's strictness level
L2 Nutritionist = "Use less of every ingredient" (shrinks all)
L1 Nutritionist = "Eliminate the junk entirely" (zeros out)
Elastic Net = Hire both nutritionists
Training data = Daily cooking practice
Validation data = Family taste-testers
Test data = Competition judges (one shot!)
K-Fold = Everyone takes turns being taste-tester
Final Summary
This comprehensive guide covered:
Part 1: Linear Regression from Scratch
✅ Forward pass (prediction)
✅ Loss function (MSE)
✅ Backward pass (autograd)
✅ Weight updates (gradient descent)
✅ Feature scaling (and why it matters)
✅ Evaluation (R² score)
✅ Clean code with FastCore @patch
Part 2: ML Fundamentals
✅ Batch, SGD, Mini-Batch gradient descent
✅ Polynomial regression & feature engineering
✅ Multicollinearity concepts
✅ Underfitting vs Overfitting
✅ Occam's Razor
✅ Bias-Variance Tradeoff
Part 3: Linear Regression Assumptions
✅ Feature scaling methods (Standardization vs Min-Max)
✅ R² vs Adjusted R²
✅ statsmodels for statistical analysis
✅ All 5 assumptions with tests
✅ P-value interpretation
Part 4: Regularization & Cross-Validation
✅ L1 (Lasso), L2 (Ridge), Elastic Net
✅ Lambda tuning
✅ Hyperparameters vs Parameters
✅ Train/Val/Test splits
✅ K-Fold Cross Validation
This is the foundation of ALL deep learning! Neural networks are just more layers of the same ideas.
Key Insight: High R² Is Not Enough!
Even with R² = 0.99, always check:
- Test set performance (overfitting?)
- All 5 assumptions
- Data leakage
- Adjusted R² gap