Regression & Prediction
Drawing Lines Through Data
The Simplest Useful Model
Regression is the workhorse of data science. At its core, it answers one question: given what I know, what can I predict?
Linear regression draws a straight line through data points. It's 200 years old. And it's still used more than any other model in production — because it's fast, interpretable, and surprisingly powerful.
Linear Regression: The Foundation
Given input features (square footage, bedrooms, location), predict a continuous output (house price).
The model learns weights for each feature:
price = w₁ × sqft + w₂ × bedrooms + w₃ × location_score + bias
"Learning" means finding the weights that minimize the gap between predicted and actual prices across all training examples.
Features: What You Feed the Model
Feature engineering — choosing and transforming input variables — is often more impactful than choosing the right algorithm.
| Feature Type | Example | Transformation |
|---|---|---|
| Numeric | Square footage | Scale to 0-1 range (normalization) |
| Categorical | Neighborhood | One-hot encode: Downtown=1,0,0; Suburb=0,1,0 |
| Derived | Price per sqft | Divide price by sqft (creates new signal) |
| Interaction | Beds × Baths | Multiply features (captures relationships) |
| Temporal | Month listed | Extract from date (captures seasonality) |
The rule: More features isn't always better. Each feature adds complexity and noise. Start with the features that domain experts say matter, then add more only if they improve the model.
Train/Test Split: The Golden Rule
Never evaluate a model on data it trained on. This is the single most important rule in machine learning.
| Split | Purpose | Typical Size |
|---|---|---|
| Training set | Model learns patterns | 70-80% |
| Validation set | Tune hyperparameters | 10-15% |
| Test set | Final evaluation (touch once!) | 10-15% |
Why? A model can memorize training data perfectly (100% accuracy) while being useless on new data. The test set simulates "new data" to give an honest evaluation.
Overfitting vs Underfitting
The central tension in all of machine learning:
| Problem | Symptom | Cause | Fix |
|---|---|---|---|
| Overfitting | Great on training, bad on test | Too complex, too little data | Simplify model, add data, regularize |
| Underfitting | Bad on both training and test | Too simple for the patterns | More features, more complex model |
| Just right | Good on both | Right complexity for the data | Ship it |
Regularization penalizes complex models. It adds a cost for large weights, forcing the model to use only the features that truly matter. Two flavors:
Logistic Regression: When the Answer Is Yes/No
Despite the name, logistic regression is for classification, not regression. It predicts probabilities between 0 and 1.
Use cases: Will this customer churn? Is this transaction fraud? Will this lead convert?
The model outputs a probability: "This customer has a 73% chance of churning." You choose the threshold: above 50%? Above 80%? The threshold depends on the business cost of errors.
Evaluation Metrics
For Regression (Continuous Outputs)
| Metric | What It Measures | Intuition |
|---|---|---|
| RMSE | Average prediction error | "On average, predictions are off by $15K" |
| MAE | Average absolute error | Less sensitive to outliers than RMSE |
| R² | Proportion of variance explained | 0.85 = model explains 85% of variation |
R² = 0.85 doesn't mean the model is 85% "accurate." It means 85% of the variation in the target is captured by the features. The remaining 15% is noise, luck, or missing features.
For Classification (Yes/No Outputs)
| Metric | What It Measures | When to Use |
|---|---|---|
| Accuracy | % correct overall | Only when classes are balanced |
| Precision | Of predicted positives, % actually positive | When false positives are costly (spam filter) |
| Recall | Of actual positives, % correctly found | When false negatives are costly (cancer screening) |
| AUC-ROC | Ranking quality across all thresholds | General model comparison |
Accuracy is misleading with imbalanced data. If 99% of transactions are legitimate, a model that always says "not fraud" gets 99% accuracy — while catching zero fraud.
When Regression Beats ML
Linear/logistic regression beats complex ML models when:
A common mistake: jumping to random forests or neural networks when regression would work fine with proper feature engineering.
Key Takeaways
This is chapter 2 of Data Science for AI.
Get the full hands-on course — free during early access. Build the complete system. Your projects become your portfolio.
View course details