Back to guides
2
14 min

Regression & Prediction

Drawing Lines Through Data

The Simplest Useful Model

Regression is the workhorse of data science. At its core, it answers one question: given what I know, what can I predict?

Linear regression draws a straight line through data points. It's 200 years old. And it's still used more than any other model in production — because it's fast, interpretable, and surprisingly powerful.

Linear Regression: The Foundation

Given input features (square footage, bedrooms, location), predict a continuous output (house price).

The model learns weights for each feature:

price = w₁ × sqft + w₂ × bedrooms + w₃ × location_score + bias

"Learning" means finding the weights that minimize the gap between predicted and actual prices across all training examples.

Loading diagram...

Features: What You Feed the Model

Feature engineering — choosing and transforming input variables — is often more impactful than choosing the right algorithm.

Feature TypeExampleTransformation
NumericSquare footageScale to 0-1 range (normalization)
CategoricalNeighborhoodOne-hot encode: Downtown=1,0,0; Suburb=0,1,0
DerivedPrice per sqftDivide price by sqft (creates new signal)
InteractionBeds × BathsMultiply features (captures relationships)
TemporalMonth listedExtract from date (captures seasonality)

The rule: More features isn't always better. Each feature adds complexity and noise. Start with the features that domain experts say matter, then add more only if they improve the model.

Train/Test Split: The Golden Rule

Never evaluate a model on data it trained on. This is the single most important rule in machine learning.

SplitPurposeTypical Size
Training setModel learns patterns70-80%
Validation setTune hyperparameters10-15%
Test setFinal evaluation (touch once!)10-15%

Why? A model can memorize training data perfectly (100% accuracy) while being useless on new data. The test set simulates "new data" to give an honest evaluation.

Overfitting vs Underfitting

The central tension in all of machine learning:

ProblemSymptomCauseFix
OverfittingGreat on training, bad on testToo complex, too little dataSimplify model, add data, regularize
UnderfittingBad on both training and testToo simple for the patternsMore features, more complex model
Just rightGood on bothRight complexity for the dataShip it

Regularization penalizes complex models. It adds a cost for large weights, forcing the model to use only the features that truly matter. Two flavors:

  • L1 (Lasso): Drives some weights to exactly zero — automatic feature selection
  • L2 (Ridge): Shrinks all weights toward zero — smoother predictions
  • Logistic Regression: When the Answer Is Yes/No

    Despite the name, logistic regression is for classification, not regression. It predicts probabilities between 0 and 1.

    Use cases: Will this customer churn? Is this transaction fraud? Will this lead convert?

    The model outputs a probability: "This customer has a 73% chance of churning." You choose the threshold: above 50%? Above 80%? The threshold depends on the business cost of errors.

    Evaluation Metrics

    For Regression (Continuous Outputs)

    MetricWhat It MeasuresIntuition
    RMSEAverage prediction error"On average, predictions are off by $15K"
    MAEAverage absolute errorLess sensitive to outliers than RMSE
    Proportion of variance explained0.85 = model explains 85% of variation

    R² = 0.85 doesn't mean the model is 85% "accurate." It means 85% of the variation in the target is captured by the features. The remaining 15% is noise, luck, or missing features.

    For Classification (Yes/No Outputs)

    MetricWhat It MeasuresWhen to Use
    Accuracy% correct overallOnly when classes are balanced
    PrecisionOf predicted positives, % actually positiveWhen false positives are costly (spam filter)
    RecallOf actual positives, % correctly foundWhen false negatives are costly (cancer screening)
    AUC-ROCRanking quality across all thresholdsGeneral model comparison

    Accuracy is misleading with imbalanced data. If 99% of transactions are legitimate, a model that always says "not fraud" gets 99% accuracy — while catching zero fraud.

    When Regression Beats ML

    Linear/logistic regression beats complex ML models when:

  • Data is limited (< 1,000 rows) — complex models overfit
  • Interpretability matters — "price increases $150 per sqft" is actionable
  • Relationships are roughly linear — no need for complexity
  • Speed matters — regression is 1000x faster than neural networks
  • Regulatory requirements — some industries require explainable models
  • A common mistake: jumping to random forests or neural networks when regression would work fine with proper feature engineering.

    Key Takeaways

  • Linear regression predicts continuous values by learning feature weights
  • Feature engineering is often more impactful than algorithm choice
  • Never evaluate on training data — use train/test splits
  • Overfitting = memorization; underfitting = too simple. Regularization helps.
  • Logistic regression handles classification (yes/no) with probability outputs
  • Choose metrics based on business cost of errors, not just accuracy
  • Regression often beats complex models when data is limited or interpretability matters
  • This is chapter 2 of Data Science for AI.

    Get the full hands-on course — free during early access. Build the complete system. Your projects become your portfolio.

    View course details