14 min

Classification & Clustering

Sorting and Grouping at Scale

Two Fundamental Tasks

Classification and clustering both organize data into groups — but in fundamentally different ways:

Classification (supervised): You tell the model the groups. "Here are 1,000 emails labeled spam/not-spam. Learn the pattern."

Clustering (unsupervised): The model finds the groups. "Here are 10,000 customers. Find natural segments."

Loading diagram...

Decision Trees: If-Then Logic

A decision tree splits data using if-then rules. It's the most interpretable classification algorithm.

How it works:

Find the feature and threshold that best separates classes

Split the data at that point

Repeat for each branch until leaves are pure (or you stop early)

Example — Loan Default Prediction:

If income > $75K → low risk

If income ≤ $75K AND debt ratio > 0.4 → high risk

If income ≤ $75K AND debt ratio ≤ 0.4 AND employment > 2 years → medium risk

Strengths: Easy to explain ("this customer was flagged because their debt ratio exceeds 40%"). Handles both numeric and categorical features. No scaling needed.

Weaknesses: Prone to overfitting (deep trees memorize). Unstable (small data changes = different tree). Not great with continuous outputs.

Random Forests: Wisdom of Crowds

A random forest builds hundreds of decision trees, each trained on a random subset of data and features. The final prediction is the majority vote across all trees.

Why it works: Individual trees overfit, but their errors are uncorrelated (because each sees different data). Averaging uncorrelated errors cancels them out.

Parameter	What It Controls	Default
n_estimators	Number of trees	100-500
max_depth	Tree depth limit	None (let them grow)
max_features	Features per split	sqrt(total features)
min_samples_leaf	Min samples in leaf	1-5

Random forests are the Swiss Army knife of machine learning:

Work well with minimal tuning

Handle missing values and outliers

Provide feature importance rankings

Rarely overfit with enough trees

When to reach for random forests: Any classification task where you have structured data, moderate dataset size (1K-1M rows), and need good performance without heavy tuning.

The Confusion Matrix

The confusion matrix is the truth table for classification:

	Predicted Positive	Predicted Negative
Actually Positive	True Positive (TP)	False Negative (FN)
Actually Negative	False Positive (FP)	True Negative (TN)

From this, we derive:

Metric	Formula	Business Meaning
Precision	TP / (TP + FP)	"When we say yes, how often are we right?"
Recall	TP / (TP + FN)	"Of all actual yeses, how many did we catch?"
F1 Score	2 × (P × R) / (P + R)	Harmonic mean — balances precision and recall

The Precision-Recall Trade-off

You can't maximize both. The trade-off depends on business costs:

Scenario	Optimize For	Why
Spam filter	Precision	False positive = real email in spam (bad)
Cancer screening	Recall	False negative = missed cancer (very bad)
Fraud detection	Recall (then precision)	Missing fraud is worse than investigating false alarms
Product recommendations	Precision	Bad recommendations annoy users

Class Imbalance: The Silent Killer

Most real-world classification problems are imbalanced: 1% fraud, 5% churn, 0.1% manufacturing defects. Standard algorithms optimize for accuracy, which means they learn to predict the majority class.

Solutions:

Strategy	How	When
Oversampling (SMOTE)	Generate synthetic minority examples	Small datasets
Undersampling	Remove majority examples	Large datasets
Class weights	Penalize minority misclassification more	Built into most algorithms
Threshold tuning	Lower the decision threshold	When you need higher recall
Anomaly detection	Treat minority as anomalies	Extreme imbalance (< 0.1%)

K-Means Clustering: Finding Natural Groups

K-means divides data into K clusters by minimizing the distance between points and their cluster center (centroid).

How it works:

Place K random centroids

Assign each point to its nearest centroid

Move each centroid to the mean of its assigned points

Repeat until centroids stop moving

The K problem: You must choose K in advance. Methods to find the right K:

Elbow method: Plot total distance vs K, look for the "elbow" where gains diminish

Silhouette score: Measures how well each point fits its cluster vs neighboring clusters

Domain knowledge: "We want 4 customer tiers" — sometimes business decides

Clustering Pitfalls

Scale sensitivity: Features on different scales (income in thousands, age in decades) distort distances. Always normalize before clustering.

Spherical assumption: K-means assumes roughly round clusters. Elongated or irregular shapes need DBSCAN.

Outlier sensitivity: A single outlier can pull a centroid far from the real cluster center.

Curse of dimensionality: Distances become meaningless in very high dimensions. Reduce dimensions first (PCA or embeddings).

Real Business Applications

Task	Method	Example
Customer churn	Random forest + class weights	Predict which customers will cancel next month
Customer segmentation	K-means on RFM features	Group customers by Recency, Frequency, Monetary value
Email triage	Decision tree	Route support emails to the right team
Lead scoring	Logistic regression or random forest	Rank sales leads by conversion probability
Anomaly detection	Isolation forest	Find unusual transactions in banking data
Document categorization	Random forest on TF-IDF features	Classify support tickets by topic

Key Takeaways

Classification needs labels; clustering finds groups on its own

Decision trees are interpretable; random forests are powerful

The confusion matrix reveals precision/recall trade-offs

Optimize for the business cost of errors, not just accuracy

Class imbalance is the norm — use oversampling, class weights, or threshold tuning

K-means requires normalized data and a good choice of K

Random forests are the default starting point for structured data classification

This is chapter 3 of Data Science for AI.

Get the full hands-on course — free during early access. Build the complete system. Your projects become your portfolio.

View course details

Ch. 2: Regression & Prediction

Ch. 4: Time Series & Forecasting