Data Thinking
See the World as Data
Everything Is Data
Before algorithms, before models, before AI — there's data. And the single most important skill in data science isn't coding or math. It's learning to see the world as data.
Every business process generates data. Every customer interaction, every transaction, every sensor reading, every click. The question isn't whether you have data — it's whether you're looking at it correctly.
Structured vs Unstructured
Data comes in two fundamental forms:
| Type | What It Looks Like | Examples | % of Enterprise Data |
|---|---|---|---|
| Structured | Rows and columns, fixed schema | Sales records, user profiles, sensor readings | ~20% |
| Unstructured | Free-form, no fixed schema | Emails, images, chat logs, PDFs, audio | ~80% |
The irony: 80% of enterprise data is unstructured, but most analysis tools only handle structured data. This gap is exactly what AI is closing — LLMs process unstructured text, vision models process images, and embeddings convert both into structured vectors.
Distributions: How Data Is Shaped
Every dataset has a distribution — the pattern of how values spread out. Understanding distributions is the foundation of every statistical method.
Why it matters: The shape of your data determines which methods work. Linear regression assumes roughly normal residuals. K-means clustering assumes roughly spherical clusters. Using the wrong method on the wrong distribution gives misleading results.
Summary Statistics: The First Five Numbers
Before any analysis, compute these five numbers:
| Statistic | What It Tells You | Watch Out For |
|---|---|---|
| Mean | Average value | Sensitive to outliers — one $10M salary skews the average |
| Median | Middle value | Better for skewed data (income, prices) |
| Std Dev | How spread out values are | High = noisy data, low = consistent |
| Min / Max | Range boundaries | Extreme values may be errors or outliers |
| Count | How much data you have | Too little = unreliable conclusions |
The median vs mean gap is your first diagnostic. If median income is $50K but mean is $85K, you have right-skewed data with high earners pulling the average up. Reporting the mean would be misleading.
Data Quality: Garbage In, Garbage Out
The oldest rule in data science: your model is only as good as your data. Common quality issues:
Missing Values
Outliers
Not all outliers are errors. A $500K transaction might be fraud (remove it) or a legitimate enterprise deal (keep it). Domain knowledge is the only way to decide.
Bias in Data
Exploratory Data Analysis (EDA)
EDA is the practice of looking at your data before modeling. It's the most underrated step in data science — and the one most often skipped.
The EDA checklist:
Correlation vs Causation
The most important concept in data science: correlation does not imply causation.
Ice cream sales correlate with drowning deaths. Both increase in summer. Ice cream doesn't cause drowning — temperature drives both.
In business contexts:
To establish causation, you need: randomized experiments (A/B tests), natural experiments, or careful causal inference methods. Observational data alone can only show correlation.
Key Takeaways
This is chapter 1 of Data Science for AI.
Get the full hands-on course — free during early access. Build the complete system. Your projects become your portfolio.
View course details