14 min

Regression & Prediction

Drawing Lines Through Data

The Simplest Useful Model

Regression is the workhorse of data science. At its core, it answers one question: given what I know, what can I predict?

Linear regression draws a straight line through data points. It's 200 years old. And it's still used more than any other model in production — because it's fast, interpretable, and surprisingly powerful.

Linear Regression: The Foundation

Given input features (square footage, bedrooms, location), predict a continuous output (house price).

The model learns weights for each feature:

price = w₁ × sqft + w₂ × bedrooms + w₃ × location_score + bias

"Learning" means finding the weights that minimize the gap between predicted and actual prices across all training examples.

Loading diagram...

Features: What You Feed the Model

Feature engineering — choosing and transforming input variables — is often more impactful than choosing the right algorithm.

Feature Type	Example	Transformation
Numeric	Square footage	Scale to 0-1 range (normalization)
Categorical	Neighborhood	One-hot encode: Downtown=1,0,0; Suburb=0,1,0
Derived	Price per sqft	Divide price by sqft (creates new signal)
Interaction	Beds × Baths	Multiply features (captures relationships)
Temporal	Month listed	Extract from date (captures seasonality)

The rule: More features isn't always better. Each feature adds complexity and noise. Start with the features that domain experts say matter, then add more only if they improve the model.

Train/Test Split: The Golden Rule

Never evaluate a model on data it trained on. This is the single most important rule in machine learning.

Split	Purpose	Typical Size
Training set	Model learns patterns	70-80%
Validation set	Tune hyperparameters	10-15%
Test set	Final evaluation (touch once!)	10-15%

Why? A model can memorize training data perfectly (100% accuracy) while being useless on new data. The test set simulates "new data" to give an honest evaluation.

Overfitting vs Underfitting

The central tension in all of machine learning:

Problem	Symptom	Cause	Fix
Overfitting	Great on training, bad on test	Too complex, too little data	Simplify model, add data, regularize
Underfitting	Bad on both training and test	Too simple for the patterns	More features, more complex model
Just right	Good on both	Right complexity for the data	Ship it

Regularization penalizes complex models. It adds a cost for large weights, forcing the model to use only the features that truly matter. Two flavors:

L1 (Lasso): Drives some weights to exactly zero — automatic feature selection

L2 (Ridge): Shrinks all weights toward zero — smoother predictions

Logistic Regression: When the Answer Is Yes/No

Despite the name, logistic regression is for classification, not regression. It predicts probabilities between 0 and 1.

Use cases: Will this customer churn? Is this transaction fraud? Will this lead convert?

The model outputs a probability: "This customer has a 73% chance of churning." You choose the threshold: above 50%? Above 80%? The threshold depends on the business cost of errors.

Evaluation Metrics

For Regression (Continuous Outputs)

Metric	What It Measures	Intuition
RMSE	Average prediction error	"On average, predictions are off by $15K"
MAE	Average absolute error	Less sensitive to outliers than RMSE
R²	Proportion of variance explained	0.85 = model explains 85% of variation

R² = 0.85 doesn't mean the model is 85% "accurate." It means 85% of the variation in the target is captured by the features. The remaining 15% is noise, luck, or missing features.

For Classification (Yes/No Outputs)

Metric	What It Measures	When to Use
Accuracy	% correct overall	Only when classes are balanced
Precision	Of predicted positives, % actually positive	When false positives are costly (spam filter)
Recall	Of actual positives, % correctly found	When false negatives are costly (cancer screening)
AUC-ROC	Ranking quality across all thresholds	General model comparison

Accuracy is misleading with imbalanced data. If 99% of transactions are legitimate, a model that always says "not fraud" gets 99% accuracy — while catching zero fraud.

When Regression Beats ML

Linear/logistic regression beats complex ML models when:

Data is limited (< 1,000 rows) — complex models overfit

Interpretability matters — "price increases $150 per sqft" is actionable

Relationships are roughly linear — no need for complexity

Speed matters — regression is 1000x faster than neural networks

Regulatory requirements — some industries require explainable models

A common mistake: jumping to random forests or neural networks when regression would work fine with proper feature engineering.

Key Takeaways

Linear regression predicts continuous values by learning feature weights

Feature engineering is often more impactful than algorithm choice

Never evaluate on training data — use train/test splits

Overfitting = memorization; underfitting = too simple. Regularization helps.

Logistic regression handles classification (yes/no) with probability outputs

Choose metrics based on business cost of errors, not just accuracy

Regression often beats complex models when data is limited or interpretability matters

This is chapter 2 of Data Science for AI.

Get the full hands-on course — free during early access. Build the complete system. Your projects become your portfolio.

View course details

Ch. 1: Data Thinking

Ch. 3: Classification & Clustering