Back to guides
3
14 min

Classification & Clustering

Sorting and Grouping at Scale

Two Fundamental Tasks

Classification and clustering both organize data into groups — but in fundamentally different ways:

  • Classification (supervised): You tell the model the groups. "Here are 1,000 emails labeled spam/not-spam. Learn the pattern."
  • Clustering (unsupervised): The model finds the groups. "Here are 10,000 customers. Find natural segments."
  • Loading diagram...

    Decision Trees: If-Then Logic

    A decision tree splits data using if-then rules. It's the most interpretable classification algorithm.

    How it works:

  • Find the feature and threshold that best separates classes
  • Split the data at that point
  • Repeat for each branch until leaves are pure (or you stop early)
  • Example — Loan Default Prediction:

  • If income > $75K → low risk
  • If income ≤ $75K AND debt ratio > 0.4 → high risk
  • If income ≤ $75K AND debt ratio ≤ 0.4 AND employment > 2 years → medium risk
  • Strengths: Easy to explain ("this customer was flagged because their debt ratio exceeds 40%"). Handles both numeric and categorical features. No scaling needed.

    Weaknesses: Prone to overfitting (deep trees memorize). Unstable (small data changes = different tree). Not great with continuous outputs.

    Random Forests: Wisdom of Crowds

    A random forest builds hundreds of decision trees, each trained on a random subset of data and features. The final prediction is the majority vote across all trees.

    Why it works: Individual trees overfit, but their errors are uncorrelated (because each sees different data). Averaging uncorrelated errors cancels them out.

    ParameterWhat It ControlsDefault
    n_estimatorsNumber of trees100-500
    max_depthTree depth limitNone (let them grow)
    max_featuresFeatures per splitsqrt(total features)
    min_samples_leafMin samples in leaf1-5

    Random forests are the Swiss Army knife of machine learning:

  • Work well with minimal tuning
  • Handle missing values and outliers
  • Provide feature importance rankings
  • Rarely overfit with enough trees
  • When to reach for random forests: Any classification task where you have structured data, moderate dataset size (1K-1M rows), and need good performance without heavy tuning.

    The Confusion Matrix

    The confusion matrix is the truth table for classification:

    Predicted PositivePredicted Negative
    Actually PositiveTrue Positive (TP)False Negative (FN)
    Actually NegativeFalse Positive (FP)True Negative (TN)

    From this, we derive:

    MetricFormulaBusiness Meaning
    PrecisionTP / (TP + FP)"When we say yes, how often are we right?"
    RecallTP / (TP + FN)"Of all actual yeses, how many did we catch?"
    F1 Score2 × (P × R) / (P + R)Harmonic mean — balances precision and recall

    The Precision-Recall Trade-off

    You can't maximize both. The trade-off depends on business costs:

    ScenarioOptimize ForWhy
    Spam filterPrecisionFalse positive = real email in spam (bad)
    Cancer screeningRecallFalse negative = missed cancer (very bad)
    Fraud detectionRecall (then precision)Missing fraud is worse than investigating false alarms
    Product recommendationsPrecisionBad recommendations annoy users

    Class Imbalance: The Silent Killer

    Most real-world classification problems are imbalanced: 1% fraud, 5% churn, 0.1% manufacturing defects. Standard algorithms optimize for accuracy, which means they learn to predict the majority class.

    Solutions:

    StrategyHowWhen
    Oversampling (SMOTE)Generate synthetic minority examplesSmall datasets
    UndersamplingRemove majority examplesLarge datasets
    Class weightsPenalize minority misclassification moreBuilt into most algorithms
    Threshold tuningLower the decision thresholdWhen you need higher recall
    Anomaly detectionTreat minority as anomaliesExtreme imbalance (< 0.1%)

    K-Means Clustering: Finding Natural Groups

    K-means divides data into K clusters by minimizing the distance between points and their cluster center (centroid).

    How it works:

  • Place K random centroids
  • Assign each point to its nearest centroid
  • Move each centroid to the mean of its assigned points
  • Repeat until centroids stop moving
  • The K problem: You must choose K in advance. Methods to find the right K:

  • Elbow method: Plot total distance vs K, look for the "elbow" where gains diminish
  • Silhouette score: Measures how well each point fits its cluster vs neighboring clusters
  • Domain knowledge: "We want 4 customer tiers" — sometimes business decides
  • Clustering Pitfalls

  • Scale sensitivity: Features on different scales (income in thousands, age in decades) distort distances. Always normalize before clustering.
  • Spherical assumption: K-means assumes roughly round clusters. Elongated or irregular shapes need DBSCAN.
  • Outlier sensitivity: A single outlier can pull a centroid far from the real cluster center.
  • Curse of dimensionality: Distances become meaningless in very high dimensions. Reduce dimensions first (PCA or embeddings).
  • Real Business Applications

    TaskMethodExample
    Customer churnRandom forest + class weightsPredict which customers will cancel next month
    Customer segmentationK-means on RFM featuresGroup customers by Recency, Frequency, Monetary value
    Email triageDecision treeRoute support emails to the right team
    Lead scoringLogistic regression or random forestRank sales leads by conversion probability
    Anomaly detectionIsolation forestFind unusual transactions in banking data
    Document categorizationRandom forest on TF-IDF featuresClassify support tickets by topic

    Key Takeaways

  • Classification needs labels; clustering finds groups on its own
  • Decision trees are interpretable; random forests are powerful
  • The confusion matrix reveals precision/recall trade-offs
  • Optimize for the business cost of errors, not just accuracy
  • Class imbalance is the norm — use oversampling, class weights, or threshold tuning
  • K-means requires normalized data and a good choice of K
  • Random forests are the default starting point for structured data classification
  • This is chapter 3 of Data Science for AI.

    Get the full hands-on course — free during early access. Build the complete system. Your projects become your portfolio.

    View course details