Neural Networks Demystified
From Perceptrons to Transformers
The Biological Inspiration
Neural networks are inspired by the brain — but don't take the analogy too far. A biological neuron fires when its inputs exceed a threshold. An artificial neuron does something similar: multiply inputs by weights, sum them, pass through an activation function.
The key insight: a single neuron is just linear regression with a twist. The magic happens when you stack thousands of them in layers.
The Perceptron: One Neuron
A perceptron takes inputs, multiplies each by a weight, sums them, adds a bias, and passes the result through an activation function:
output = activation(w₁x₁ + w₂x₂ + ... + wₙxₙ + bias)
Activation functions add non-linearity — without them, stacking layers would just be another linear model:
| Function | Output Range | Used Where |
|---|---|---|
| Sigmoid | 0 to 1 | Output layer (binary classification) |
| ReLU | 0 to ∞ | Hidden layers (most common, fast) |
| Tanh | -1 to 1 | Hidden layers (centered around 0) |
| Softmax | 0 to 1 (sums to 1) | Output layer (multi-class) |
ReLU (Rectified Linear Unit) is the default for hidden layers: output = max(0, x). It's simple, fast, and works well in practice.
Deep Networks: Stacking Layers
A single neuron can only learn linear boundaries. Stack them in layers and they learn hierarchical features:
| Term | Meaning |
|---|---|
| Input layer | Your raw data (pixels, features, tokens) |
| Hidden layers | Where learning happens (1 to hundreds of layers) |
| Output layer | The prediction (class probabilities, regression value) |
| Deep learning | Networks with many hidden layers (typically 3+) |
| Parameters | All weights and biases — what the network learns |
A small network (2 hidden layers, 128 neurons each) has ~20K parameters. GPT-4 has ~1.8 trillion. The difference in capability is staggering.
Backpropagation: How Networks Learn
Training a neural network:
The learning rate controls how much weights change per update:
Batch size determines how many examples the network sees before updating weights:
CNNs: Neural Networks That See
Convolutional Neural Networks are specialized for grid-like data (images, spectrograms, spatial data).
Instead of connecting every neuron to every input (which would be billions of connections for a 1080p image), CNNs use filters that slide across the image:
| Layer | What It Does | Example |
|---|---|---|
| Convolutional | Detects local patterns | Edge detection, texture recognition |
| Pooling | Reduces spatial size | 224×224 → 112×112 (keeps important features) |
| Fully connected | Combines features for classification | "This combination of features = cat" |
Key insight: CNNs learn hierarchical visual features automatically. Early layers detect edges, middle layers detect shapes (eyes, wheels), deep layers detect objects (faces, cars). This hierarchy emerges from training — it's not programmed.
Applications beyond images:
RNNs and Transformers: Neural Networks That Read
Recurrent Neural Networks (RNNs) process sequences one element at a time, maintaining a "hidden state" that carries information forward.
The problem: RNNs forget. By the time they reach the end of a long document, they've lost information from the beginning. LSTM and GRU architectures partially solve this with "memory gates."
Transformers (2017) revolutionized sequence processing with self-attention: every token can directly attend to every other token, regardless of distance.
| Architecture | How It Processes Sequences | Strength | Weakness |
|---|---|---|---|
| RNN/LSTM | One token at a time, left to right | Simple, low memory | Forgets long-range context |
| Transformer | All tokens simultaneously (attention) | Long-range context, parallelizable | Quadratic memory in sequence length |
The transformer architecture powers every modern LLM: GPT, Claude, Gemini, Llama, Mistral. It's also behind BERT (embeddings), Vision Transformers (images), and Whisper (audio).
Self-attention in one sentence: For each token, compute how much attention to pay to every other token, then create a weighted combination. "The cat sat on the [mat]" — when processing "mat," the model attends heavily to "sat on" and "cat" for context.
When Neural Networks Are Overkill
Neural networks are powerful but not always the right tool:
| Situation | Better Alternative | Why |
|---|---|---|
| < 1,000 rows | Linear/logistic regression | NNs overfit with little data |
| Tabular data | Random forest / XGBoost | NNs rarely beat tree methods on tables |
| Need interpretability | Decision tree / linear model | NNs are black boxes |
| Real-time, low latency | Simple model + caching | NNs are 100-1000x slower |
| Limited compute budget | Classical ML | GPUs are expensive |
The rule of thumb: For structured/tabular data, try XGBoost first. For images, use CNNs. For text, use transformers (or just call an LLM API). For time series, try ARIMA before LSTMs.
Key Takeaways
This is chapter 5 of Data Science for AI.
Get the full hands-on course — free during early access. Build the complete system. Your projects become your portfolio.
View course details