Classification & Clustering
Sorting and Grouping at Scale
Two Fundamental Tasks
Classification and clustering both organize data into groups — but in fundamentally different ways:
Decision Trees: If-Then Logic
A decision tree splits data using if-then rules. It's the most interpretable classification algorithm.
How it works:
Example — Loan Default Prediction:
Strengths: Easy to explain ("this customer was flagged because their debt ratio exceeds 40%"). Handles both numeric and categorical features. No scaling needed.
Weaknesses: Prone to overfitting (deep trees memorize). Unstable (small data changes = different tree). Not great with continuous outputs.
Random Forests: Wisdom of Crowds
A random forest builds hundreds of decision trees, each trained on a random subset of data and features. The final prediction is the majority vote across all trees.
Why it works: Individual trees overfit, but their errors are uncorrelated (because each sees different data). Averaging uncorrelated errors cancels them out.
| Parameter | What It Controls | Default |
|---|---|---|
| n_estimators | Number of trees | 100-500 |
| max_depth | Tree depth limit | None (let them grow) |
| max_features | Features per split | sqrt(total features) |
| min_samples_leaf | Min samples in leaf | 1-5 |
Random forests are the Swiss Army knife of machine learning:
When to reach for random forests: Any classification task where you have structured data, moderate dataset size (1K-1M rows), and need good performance without heavy tuning.
The Confusion Matrix
The confusion matrix is the truth table for classification:
| Predicted Positive | Predicted Negative | |
|---|---|---|
| Actually Positive | True Positive (TP) | False Negative (FN) |
| Actually Negative | False Positive (FP) | True Negative (TN) |
From this, we derive:
| Metric | Formula | Business Meaning |
|---|---|---|
| Precision | TP / (TP + FP) | "When we say yes, how often are we right?" |
| Recall | TP / (TP + FN) | "Of all actual yeses, how many did we catch?" |
| F1 Score | 2 × (P × R) / (P + R) | Harmonic mean — balances precision and recall |
The Precision-Recall Trade-off
You can't maximize both. The trade-off depends on business costs:
| Scenario | Optimize For | Why |
|---|---|---|
| Spam filter | Precision | False positive = real email in spam (bad) |
| Cancer screening | Recall | False negative = missed cancer (very bad) |
| Fraud detection | Recall (then precision) | Missing fraud is worse than investigating false alarms |
| Product recommendations | Precision | Bad recommendations annoy users |
Class Imbalance: The Silent Killer
Most real-world classification problems are imbalanced: 1% fraud, 5% churn, 0.1% manufacturing defects. Standard algorithms optimize for accuracy, which means they learn to predict the majority class.
Solutions:
| Strategy | How | When |
|---|---|---|
| Oversampling (SMOTE) | Generate synthetic minority examples | Small datasets |
| Undersampling | Remove majority examples | Large datasets |
| Class weights | Penalize minority misclassification more | Built into most algorithms |
| Threshold tuning | Lower the decision threshold | When you need higher recall |
| Anomaly detection | Treat minority as anomalies | Extreme imbalance (< 0.1%) |
K-Means Clustering: Finding Natural Groups
K-means divides data into K clusters by minimizing the distance between points and their cluster center (centroid).
How it works:
The K problem: You must choose K in advance. Methods to find the right K:
Clustering Pitfalls
Real Business Applications
| Task | Method | Example |
|---|---|---|
| Customer churn | Random forest + class weights | Predict which customers will cancel next month |
| Customer segmentation | K-means on RFM features | Group customers by Recency, Frequency, Monetary value |
| Email triage | Decision tree | Route support emails to the right team |
| Lead scoring | Logistic regression or random forest | Rank sales leads by conversion probability |
| Anomaly detection | Isolation forest | Find unusual transactions in banking data |
| Document categorization | Random forest on TF-IDF features | Classify support tickets by topic |
Key Takeaways
This is chapter 3 of Data Science for AI.
Get the full hands-on course — free during early access. Build the complete system. Your projects become your portfolio.
View course details