12 min

Data Thinking

See the World as Data

Everything Is Data

Before algorithms, before models, before AI — there's data. And the single most important skill in data science isn't coding or math. It's learning to see the world as data.

Every business process generates data. Every customer interaction, every transaction, every sensor reading, every click. The question isn't whether you have data — it's whether you're looking at it correctly.

Structured vs Unstructured

Data comes in two fundamental forms:

Type	What It Looks Like	Examples	% of Enterprise Data
Structured	Rows and columns, fixed schema	Sales records, user profiles, sensor readings	~20%
Unstructured	Free-form, no fixed schema	Emails, images, chat logs, PDFs, audio	~80%

The irony: 80% of enterprise data is unstructured, but most analysis tools only handle structured data. This gap is exactly what AI is closing — LLMs process unstructured text, vision models process images, and embeddings convert both into structured vectors.

Distributions: How Data Is Shaped

Every dataset has a distribution — the pattern of how values spread out. Understanding distributions is the foundation of every statistical method.

Loading diagram...

Why it matters: The shape of your data determines which methods work. Linear regression assumes roughly normal residuals. K-means clustering assumes roughly spherical clusters. Using the wrong method on the wrong distribution gives misleading results.

Summary Statistics: The First Five Numbers

Before any analysis, compute these five numbers:

Statistic	What It Tells You	Watch Out For
Mean	Average value	Sensitive to outliers — one $10M salary skews the average
Median	Middle value	Better for skewed data (income, prices)
Std Dev	How spread out values are	High = noisy data, low = consistent
Min / Max	Range boundaries	Extreme values may be errors or outliers
Count	How much data you have	Too little = unreliable conclusions

The median vs mean gap is your first diagnostic. If median income is $50K but mean is $85K, you have right-skewed data with high earners pulling the average up. Reporting the mean would be misleading.

Data Quality: Garbage In, Garbage Out

The oldest rule in data science: your model is only as good as your data. Common quality issues:

Missing Values

MCAR (Missing Completely At Random): Safe to drop or impute. Rare in practice.

MAR (Missing At Random): Missingness depends on observed data. E.g., income missing more often for younger people.

MNAR (Missing Not At Random): Missingness depends on the missing value itself. E.g., high-income people refuse to report income. Dangerous — no statistical fix.

Outliers

Not all outliers are errors. A $500K transaction might be fraud (remove it) or a legitimate enterprise deal (keep it). Domain knowledge is the only way to decide.

Bias in Data

Selection bias: Your data doesn't represent the population. A survey on a tech forum doesn't represent "all users."

Survivorship bias: You only see successes. Analyzing successful startups without failed ones gives a distorted picture.

Measurement bias: The way you collect data affects what you see. Self-reported data is less reliable than observed data.

Historical bias: Past data reflects past decisions. A hiring model trained on historical data will replicate past hiring biases.

Exploratory Data Analysis (EDA)

EDA is the practice of looking at your data before modeling. It's the most underrated step in data science — and the one most often skipped.

The EDA checklist:

Shape: How many rows? How many columns? What types?

Distributions: Histogram every numeric column. Are they normal? Skewed? Bimodal?

Missing values: Which columns have gaps? How much? What pattern?

Correlations: Which variables move together? Which are independent?

Outliers: Any extreme values? Are they real or errors?

Time patterns: If there's a date column, plot trends. Is there seasonality?

Correlation vs Causation

The most important concept in data science: correlation does not imply causation.

Ice cream sales correlate with drowning deaths. Both increase in summer. Ice cream doesn't cause drowning — temperature drives both.

In business contexts:

"Users who complete onboarding have 3x higher retention" — Does onboarding cause retention, or do motivated users both complete onboarding AND stay?

"Teams using our tool ship 40% faster" — Did the tool help, or did already-fast teams adopt it?

To establish causation, you need: randomized experiments (A/B tests), natural experiments, or careful causal inference methods. Observational data alone can only show correlation.

Key Takeaways

Learn to see the world as data — every process generates it

Distributions determine which methods work on your data

Summary statistics (mean, median, std dev) are your first diagnostic

Data quality issues (missing values, outliers, bias) can invalidate any analysis

EDA before modeling — always look at your data first

Correlation ≠ causation — the most important lesson in data science

This is chapter 1 of Data Science for AI.

Get the full hands-on course — free during early access. Build the complete system. Your projects become your portfolio.

View course details

Ch. 2: Regression & Prediction