Back to guides
1
12 min

Data Thinking

See the World as Data

Everything Is Data

Before algorithms, before models, before AI — there's data. And the single most important skill in data science isn't coding or math. It's learning to see the world as data.

Every business process generates data. Every customer interaction, every transaction, every sensor reading, every click. The question isn't whether you have data — it's whether you're looking at it correctly.

Structured vs Unstructured

Data comes in two fundamental forms:

TypeWhat It Looks LikeExamples% of Enterprise Data
StructuredRows and columns, fixed schemaSales records, user profiles, sensor readings~20%
UnstructuredFree-form, no fixed schemaEmails, images, chat logs, PDFs, audio~80%

The irony: 80% of enterprise data is unstructured, but most analysis tools only handle structured data. This gap is exactly what AI is closing — LLMs process unstructured text, vision models process images, and embeddings convert both into structured vectors.

Distributions: How Data Is Shaped

Every dataset has a distribution — the pattern of how values spread out. Understanding distributions is the foundation of every statistical method.

Loading diagram...

Why it matters: The shape of your data determines which methods work. Linear regression assumes roughly normal residuals. K-means clustering assumes roughly spherical clusters. Using the wrong method on the wrong distribution gives misleading results.

Summary Statistics: The First Five Numbers

Before any analysis, compute these five numbers:

StatisticWhat It Tells YouWatch Out For
MeanAverage valueSensitive to outliers — one $10M salary skews the average
MedianMiddle valueBetter for skewed data (income, prices)
Std DevHow spread out values areHigh = noisy data, low = consistent
Min / MaxRange boundariesExtreme values may be errors or outliers
CountHow much data you haveToo little = unreliable conclusions

The median vs mean gap is your first diagnostic. If median income is $50K but mean is $85K, you have right-skewed data with high earners pulling the average up. Reporting the mean would be misleading.

Data Quality: Garbage In, Garbage Out

The oldest rule in data science: your model is only as good as your data. Common quality issues:

Missing Values

  • MCAR (Missing Completely At Random): Safe to drop or impute. Rare in practice.
  • MAR (Missing At Random): Missingness depends on observed data. E.g., income missing more often for younger people.
  • MNAR (Missing Not At Random): Missingness depends on the missing value itself. E.g., high-income people refuse to report income. Dangerous — no statistical fix.
  • Outliers

    Not all outliers are errors. A $500K transaction might be fraud (remove it) or a legitimate enterprise deal (keep it). Domain knowledge is the only way to decide.

    Bias in Data

  • Selection bias: Your data doesn't represent the population. A survey on a tech forum doesn't represent "all users."
  • Survivorship bias: You only see successes. Analyzing successful startups without failed ones gives a distorted picture.
  • Measurement bias: The way you collect data affects what you see. Self-reported data is less reliable than observed data.
  • Historical bias: Past data reflects past decisions. A hiring model trained on historical data will replicate past hiring biases.
  • Exploratory Data Analysis (EDA)

    EDA is the practice of looking at your data before modeling. It's the most underrated step in data science — and the one most often skipped.

    The EDA checklist:

  • Shape: How many rows? How many columns? What types?
  • Distributions: Histogram every numeric column. Are they normal? Skewed? Bimodal?
  • Missing values: Which columns have gaps? How much? What pattern?
  • Correlations: Which variables move together? Which are independent?
  • Outliers: Any extreme values? Are they real or errors?
  • Time patterns: If there's a date column, plot trends. Is there seasonality?
  • Correlation vs Causation

    The most important concept in data science: correlation does not imply causation.

    Ice cream sales correlate with drowning deaths. Both increase in summer. Ice cream doesn't cause drowning — temperature drives both.

    In business contexts:

  • "Users who complete onboarding have 3x higher retention" — Does onboarding cause retention, or do motivated users both complete onboarding AND stay?
  • "Teams using our tool ship 40% faster" — Did the tool help, or did already-fast teams adopt it?
  • To establish causation, you need: randomized experiments (A/B tests), natural experiments, or careful causal inference methods. Observational data alone can only show correlation.

    Key Takeaways

  • Learn to see the world as data — every process generates it
  • Distributions determine which methods work on your data
  • Summary statistics (mean, median, std dev) are your first diagnostic
  • Data quality issues (missing values, outliers, bias) can invalidate any analysis
  • EDA before modeling — always look at your data first
  • Correlation ≠ causation — the most important lesson in data science
  • This is chapter 1 of Data Science for AI.

    Get the full hands-on course — free during early access. Build the complete system. Your projects become your portfolio.

    View course details