Building Linear Regression from First Principles
Linear Regression from Scratch with PyTorch
A Deep Dive into Gradient Descent, Feature Scaling, and Model Building
What we built: A complete Linear Regression implementation using PyTorch tensors, understanding every component from first principles.
Key Skills: Forward pass, MSE loss, gradient descent, feature scaling, R² evaluation, FastCore @patch pattern
Table of Contents
- Data Setup
- Model Initialization
- Forward Pass (Predict)
- Loss Function (MSE)
- Gradient Descent
- Training Loop
- Feature Scaling
- Converting Scaled Weights
- R² Score Evaluation
- Visualization
- Comparison with sklearn
- Refactored Code with FastCore
- Real-World Application: Cars24 Data
- Quick Reference
import torch
import matplotlib.pyplot as plt
torch.manual_seed(42)
torch.set_printoptions(sci_mode=False)
n = 100
size = torch.randint(800, 3000, (n,)).float()
age = torch.randint(1, 50, (n,)).float()
distance_to_city = torch.randint(1, 30, (n,)).float()
# True relationship with noise
price = 100000 + 300*size - 50*age - 20*distance_to_city + torch.randn(n)*5000
# Stack features into X matrix
X = torch.stack([size, age, distance_to_city], dim=1)
y = price
print(f"X shape: {X.shape}") # (100, 3) - 100 houses, 3 features
print(f"y shape: {y.shape}") # (100,) - 100 prices
X shape: torch.Size([100, 3])
y shape: torch.Size([100])
2. Model Initialization
Linear regression finds the best weights (W) and bias (b) such that:
$$\hat{y} = X \cdot W + b$$
🎯 Analogy: The Real Estate Agent
You're a real estate agent with a "pricing formula." Each weight tells you how much each feature (size, age, distance) contributes to the price. The bias is your base price when all features are zero.
# Initialize weights to zeros
W = torch.zeros(3, requires_grad=True) # One weight per feature
b = torch.zeros(1, requires_grad=True) # Single bias term
print(f"W: {W}")
print(f"b: {b}")
W: tensor([0., 0., 0.], requires_grad=True)
b: tensor([0.], requires_grad=True)
Key Insight: requires_grad=True tells PyTorch to track operations for automatic differentiation.
3. Forward Pass (Predict)
For a single house: price = w₁·size + w₂·age + w₃·distance + b
For 100 houses at once: Matrix multiplication!
def predict(X, W, b):
return X @ W + b
# Test with zero weights - predictions will all be zero
y_pred = predict(X, W, b)
print(f"Predictions (first 5): {y_pred[:5]}")
Predictions (first 5): tensor([0., 0., 0., 0., 0.], grad_fn=<SliceBackward0>)
| Shape | Meaning |
|---|---|
X: (100, 3) |
100 houses, 3 features |
W: (3,) |
3 weights |
X @ W: (100,) |
100 predictions |
b: (1,) |
broadcasts to all 100 |
4. Loss Function (MSE)
How do we measure "how wrong" we are?
🎯 Analogy: Darts
Imagine playing darts blindfolded. You throw 100 darts. Your friend tells you "on average, you missed the bullseye by 15 inches." That average error is your loss. Lower = better!
Why Square the Errors?
- Makes all errors positive (no cancellation)
- Punishes big errors more than small ones
$$\text{MSE} = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2$$
def mse_loss(y_pred, y_true):
return ((y_true - y_pred)**2).mean()
# Calculate initial loss
loss = mse_loss(y_pred, price)
print(f"Initial loss: {loss:,.0f}") # Huge! We're predicting $0 for houses worth $300k-900k
Initial loss: 483,810,443,264
5. Gradient Descent
🎯 Analogy: Lost in Foggy Mountains
Imagine you're lost in foggy mountains wanting to reach the lowest valley. You can't see far, but you can feel the ground under your feet.
Strategy: Feel which direction slopes downward, then take a small step that way. Repeat.
- Gradient = the slope (which direction is "downhill" for the loss)
- Descent = move in that direction
The Update Rule
$$W_{\text{new}} = W_{\text{old}} - \alpha \cdot \nabla W$$
Where α (alpha) is the learning rate — how big a step we take.
Two Ways to Get Gradients
| Method | How | Pros/Cons |
|---|---|---|
| Autograd | loss.backward() |
Easy, automatic |
| Manual | Implement formulas | Deep understanding |
# Using PyTorch autograd
W = torch.zeros(3, requires_grad=True)
b = torch.zeros(1, requires_grad=True)
y_pred = predict(X, W, b)
loss = mse_loss(y_pred, price)
loss.backward() # PyTorch calculates gradients automatically!
print(f"W.grad: {W.grad}")
print(f"b.grad: {b.grad}")
W.grad: tensor([-2781148416., -31536902., -20261836.])
b.grad: tensor([-1339185.3750])
Why Zero Gradients?
🎯 Analogy: The Sticky Notepad
Imagine writing directions on a notepad. Without erasing, each new direction adds to the old ones:
- Person 1: "5 north" → notepad shows 5
- Person 2: "3 north" → notepad shows 8 (5+3)
- Person 3: "2 north" → notepad shows 10 (8+2)
But you wanted just "2 north"! Zeroing = erasing the notepad before writing the new direction.
PyTorch accumulates gradients by default. For standard training, always zero gradients before backward().
6. Training Loop
Putting it all together:
# Reset
W = torch.zeros(3, requires_grad=True)
b = torch.zeros(1, requires_grad=True)
learning_rate = 1e-9 # Tiny because gradients are billions!
for i in range(1000):
# Forward pass
y_pred = predict(X, W, b)
loss = mse_loss(y_pred, price)
# Backward pass
loss.backward()
# Update weights
with torch.no_grad():
W -= learning_rate * W.grad
b -= learning_rate * b.grad
# Zero gradients for next iteration
W.grad.zero_()
b.grad.zero_()
if i % 200 == 0:
print(f"Iteration {i}: Loss = {loss:,.0f}")
Iteration 0: Loss = 483,810,443,264
Iteration 200: Loss = 20,317,921,280
Iteration 400: Loss = 1,750,683,648
Iteration 600: Loss = 1,006,857,664
Iteration 800: Loss = 977,015,040
Problem Discovered: Size weight learned okay (~347 vs true 300), but age and distance were stuck!
| Weight | Learned | True |
|---|---|---|
| size | 347 | 300 |
| age | 4.3 | -50 |
| distance | 2.8 | -20 |
7. Feature Scaling
The Problem
Features have wildly different scales:
size: 800 - 3000 (big)age: 1 - 50 (small)distance: 1 - 30 (small)
🎯 Analogy: The Shouting Problem
Imagine a classroom discussion where:
- One student speaks at 100 decibels (super loud)
- Two students speak at 10 decibels (quiet whisper)
The teacher only hears the loud student!
Feature scaling = giving everyone a microphone that adjusts their volume to the same level.
Why Big Features Dominate
🎯 Analogy: The Seesaw
- Size = heavy adult (1900 kg) on seesaw
- Age = small child (23 kg) on seesaw
Nudge the adult slightly → whole seesaw tips dramatically. Nudge the child a lot → barely moves.
Concrete Example: Same weight change of 0.001:
- Size:
0.001 × 1900 = 1.9change in prediction - Age:
0.001 × 23 = 0.023change in prediction
Size has 80x more impact from the same weight change!
The Solution: Standardization
Transform each feature to have mean=0 and std=1:
$$X_{scaled} = \frac{X - \mu}{\sigma}$$
# Scale features
new_X = (X - X.mean(dim=0)) / X.std(dim=0)
print("Before scaling (first row):")
print(X[0]) # [2742., 34., 12.]
print("\nAfter scaling (first row):")
print(new_X[0]) # [1.34, 0.76, -0.43] - All similar range!
Before scaling (first row):
tensor([2742., 34., 12.])
After scaling (first row):
tensor([ 1.3359, 0.7573, -0.4266])
# Train with scaled features - NOW we can use a bigger learning rate!
W = torch.zeros(3, requires_grad=True)
b = torch.zeros(1, requires_grad=True)
learning_rate = 0.1 # 100 million times bigger than before!
losses = []
for i in range(1000):
y_pred = predict(new_X, W, b)
loss = mse_loss(y_pred, price)
losses.append(loss.item())
loss.backward()
with torch.no_grad():
W -= learning_rate * W.grad
b -= learning_rate * b.grad
W.grad.zero_()
b.grad.zero_()
print(f"Final loss: {losses[-1]:,.0f}") # ~21 million (vs 975 million before!)
Final loss: 21,687,478
Results Comparison
| Metric | Without Scaling | With Scaling |
|---|---|---|
| Learning rate | 1e-9 | 0.1 |
| Final loss | ~975 million | ~21 million |
| Convergence | Slow, incomplete | Fast, complete |
8. Converting Scaled Weights
Since we trained on scaled features, weights are in "scaled space." To interpret them:
$$W_{original} = \frac{W_{scaled}}{\sigma}$$
🎯 Analogy: Unit Conversion
The model learned "$189,224 per scaled unit." But 1 scaled unit = 630 sq ft (one std).
So in real units: $189,224 per 630 sq ft = $300 per sq ft
It's like converting "miles per gallon" to "kilometers per liter"!
stds = X.std(dim=0)
original_weights = W / stds
print(f"Scaled weights: {W.data}")
print(f"Original weights: {original_weights}")
print(f"\nTrue coefficients: [300, -50, -20]")
Scaled weights: tensor([189224.5781, -488.8487, 105.2894])
Original weights: tensor([300.4156, -34.5646, 14.0370], grad_fn=<DivBackward0>)
True coefficients: [300, -50, -20]
9. R² Score Evaluation
R² answers: How much better is your model than just guessing the mean?
🎯 Analogy: The Lazy Predictor
- Strategy 1 (Lazy): Guess the average price for every house
- Strategy 2 (Your Model): Use features to make personalized predictions
R² = how much better is Strategy 2?
| R² Value | Meaning |
|---|---|
| 1.0 | Perfect! Explains all variation |
| 0.5 | 50% better than guessing mean |
| 0.0 | No better than guessing mean |
| Negative | Worse than guessing mean! 😬 |
$$R^2 = 1 - \frac{SS_{res}}{SS_{tot}}$$
def r2_score(y_true, y_pred):
ss_res = ((y_true - y_pred)**2).sum() # Model errors
ss_tot = ((y_true - y_true.mean())**2).sum() # Baseline errors
return 1 - (ss_res / ss_tot)
y_pred = predict(new_X, W, b)
print(f"R² score: {r2_score(price, y_pred):.4f}") # ~0.9994!
R² score: 0.9994
10. Visualization
# Loss over time - the "hockey stick" curve
plt.figure(figsize=(10, 4))
plt.subplot(1, 2, 1)
plt.plot(losses)
plt.xlabel('Iteration')
plt.ylabel('Loss')
plt.title('Loss Over Time')
# Predicted vs Actual
plt.subplot(1, 2, 2)
with torch.no_grad():
y_pred = predict(new_X, W, b)
plt.scatter(price, y_pred, alpha=0.5)
plt.plot([price.min(), price.max()], [price.min(), price.max()], 'r--')
plt.xlabel('Actual Price')
plt.ylabel('Predicted Price')
plt.title('Predicted vs Actual')
plt.tight_layout()
plt.show()
11. Comparison with sklearn
from sklearn.linear_model import LinearRegression as SklearnLR
sk_model = SklearnLR()
sk_model.fit(X.numpy(), price.numpy())
print("Comparison:")
print(f"{'Metric':<20} {'Our Model':<15} {'sklearn':<15}")
print(f"{'-'*50}")
print(f"{'Size coef':<20} {(W/stds)[0].item():<15.2f} {sk_model.coef_[0]:<15.2f}")
print(f"{'Age coef':<20} {(W/stds)[1].item():<15.2f} {sk_model.coef_[1]:<15.2f}")
print(f"{'Distance coef':<20} {(W/stds)[2].item():<15.2f} {sk_model.coef_[2]:<15.2f}")
print(f"{'R² score':<20} {r2_score(price, y_pred).item():<15.4f} {sk_model.score(X.numpy(), price.numpy()):<15.4f}")
Comparison:
Metric Our Model sklearn
--------------------------------------------------
Size coef 300.42 300.42
Age coef -34.56 -34.56
Distance coef 14.04 14.04
R² score 0.9994 0.9994
Result: Identical! 🎉
12. Refactored Code with FastCore
Using @patch to build up a class incrementally — clean, modular, Pythonic.
from fastcore.utils import patch
class LR:
def __init__(self, lr=0.1):
self.lr = lr
@patch
def predict(self: LR, X):
return X @ self.W + self.b
@patch
def mse_loss(self: LR, y_pred, y_true):
return ((y_pred - y_true)**2).mean()
@patch
def r2_score(self: LR, y_true, y_pred):
ss_res = ((y_true - y_pred)**2).sum()
ss_tot = ((y_true - y_true.mean())**2).sum()
return 1 - (ss_res / ss_tot)
@patch
def fit(self: LR, X, y, iterations=1000):
n, d = X.shape
self.W = torch.zeros(d, requires_grad=True)
self.b = torch.zeros(1, requires_grad=True)
self.losses = []
for i in range(iterations):
# Forward pass
y_pred = self.predict(X)
loss = self.mse_loss(y_pred, y)
self.losses.append(loss.item())
# Backward pass
loss.backward()
# Update weights
with torch.no_grad():
self.W -= self.lr * self.W.grad
self.b -= self.lr * self.b.grad
# Zero gradients
self.W.grad.zero_()
self.b.grad.zero_()
return self
# Usage
my_model = LR(lr=0.1)
my_model.fit(new_X, price, iterations=1000)
print(f"R² score: {my_model.r2_score(price, my_model.predict(new_X)):.4f}")
R² score: 0.9994
13. Real-World Application: Cars24 Data
We applied our model to real car sales data.
import pandas as pd
# df = pd.read_csv('cars24-car-price-cleaned.csv')
# # Drop non-numeric columns
# X_cars = df.drop(['selling_price', 'make', 'model'], axis=1).values
# y_cars = df['selling_price'].values
# # Convert to tensors and scale
# X_cars_t = torch.tensor(X_cars, dtype=torch.float32)
# y_cars_t = torch.tensor(y_cars, dtype=torch.float32)
# X_cars_scaled = (X_cars_t - X_cars_t.mean(dim=0)) / X_cars_t.std(dim=0)
# # Train
# cars_model = LR(lr=0.1)
# cars_model.fit(X_cars_scaled, y_cars_t, iterations=1000)
# # Evaluate
# y_pred_cars = cars_model.predict(X_cars_scaled)
# print(f"R² score on Cars24: {cars_model.r2_score(y_cars_t, y_pred_cars):.4f}") # ~0.78
Result: R² = 0.778
Why not 99% like synthetic data?
- Real data is messy
- Many factors affect price (brand, condition, market trends...)
- Relationship might not be perfectly linear
- We dropped
makeandmodelcolumns — those matter!
For a simple linear model on real data, 0.78 is quite good!
14. Quick Reference
Key Formulas
| Formula | PyTorch Code |
|---|---|
| Prediction: $\hat{y} = XW + b$ | X @ W + b |
| MSE Loss | ((y - y_pred)**2).mean() |
| R² Score | 1 - ss_res/ss_tot |
| Standardization | (X - X.mean(dim=0)) / X.std(dim=0) |
| Weight update | W -= lr * W.grad |
| Convert scaled weights | W / X.std(dim=0) |
Sticky Analogies Summary
| Concept | Analogy |
|---|---|
| Gradient Descent | Lost in foggy mountains, feeling for downhill |
| Loss Function | Darts — average distance from bullseye |
| Feature Scaling | Classroom with one shouting student |
| Big features dominate | Seesaw with heavy adult vs small child |
| Zero gradients | Erasing the sticky notepad |
| Convergence | Reaching the valley floor |
| Scaled weights | Unit conversion (miles/gallon → km/liter) |
| R² score | How much better than the lazy predictor? |
Training Loop Template
for i in range(iterations):
y_pred = predict(X, W, b) # Forward
loss = mse_loss(y_pred, y) # Loss
loss.backward() # Backward
with torch.no_grad(): # Update
W -= lr * W.grad
b -= lr * b.grad
W.grad.zero_() # Zero gradients
b.grad.zero_()
Common Gotchas
| Problem | Solution |
|---|---|
| Loss exploding | Reduce learning rate |
| Loss barely moving | Increase learning rate |
| Some weights not learning | Feature scaling |
W.grad is None |
Use requires_grad=True |
| Weird gradient accumulation | Zero gradients each iteration |
| Can't multiply sequence by float | Non-numeric columns in data |
Summary
This guide covered building Linear Regression from scratch, understanding every component:
✅ Forward pass (prediction)
✅ Loss function (MSE)
✅ Backward pass (autograd)
✅ Weight updates (gradient descent)
✅ Feature scaling (and why it matters)
✅ Evaluation (R² score)
✅ Clean code with FastCore @patch
✅ Applied to real data
This is the foundation of ALL deep learning! Neural networks are just more layers of the same idea.