15 min

Neural Networks Demystified

From Perceptrons to Transformers

The Biological Inspiration

Neural networks are inspired by the brain — but don't take the analogy too far. A biological neuron fires when its inputs exceed a threshold. An artificial neuron does something similar: multiply inputs by weights, sum them, pass through an activation function.

The key insight: a single neuron is just linear regression with a twist. The magic happens when you stack thousands of them in layers.

The Perceptron: One Neuron

A perceptron takes inputs, multiplies each by a weight, sums them, adds a bias, and passes the result through an activation function:

output = activation(w₁x₁ + w₂x₂ + ... + wₙxₙ + bias)

Loading diagram...

Activation functions add non-linearity — without them, stacking layers would just be another linear model:

Function	Output Range	Used Where
Sigmoid	0 to 1	Output layer (binary classification)
ReLU	0 to ∞	Hidden layers (most common, fast)
Tanh	-1 to 1	Hidden layers (centered around 0)
Softmax	0 to 1 (sums to 1)	Output layer (multi-class)

ReLU (Rectified Linear Unit) is the default for hidden layers: output = max(0, x). It's simple, fast, and works well in practice.

Deep Networks: Stacking Layers

A single neuron can only learn linear boundaries. Stack them in layers and they learn hierarchical features:

Layer 1: Detects simple patterns (edges in images, common words in text)

Layer 2: Combines simple patterns into complex ones (shapes, phrases)

Layer 3+: Combines complex patterns into abstract concepts (faces, sentiment)

Term	Meaning
Input layer	Your raw data (pixels, features, tokens)
Hidden layers	Where learning happens (1 to hundreds of layers)
Output layer	The prediction (class probabilities, regression value)
Deep learning	Networks with many hidden layers (typically 3+)
Parameters	All weights and biases — what the network learns

A small network (2 hidden layers, 128 neurons each) has ~20K parameters. GPT-4 has ~1.8 trillion. The difference in capability is staggering.

Backpropagation: How Networks Learn

Training a neural network:

Forward pass: Input flows through the network, producing a prediction

Loss calculation: Compare prediction to actual value (how wrong were we?)

Backward pass: Calculate how each weight contributed to the error

Weight update: Adjust weights to reduce error (gradient descent)

Repeat for thousands of iterations (epochs)

The learning rate controls how much weights change per update:

Too high: Overshoots the optimal weights, training is unstable

Too low: Converges too slowly, gets stuck in local minima

Just right: Smooth convergence to good weights

Batch size determines how many examples the network sees before updating weights:

Small batch (32): Noisier gradients, better generalization, slower per epoch

Large batch (256+): Smoother gradients, faster per epoch, may generalize worse

CNNs: Neural Networks That See

Convolutional Neural Networks are specialized for grid-like data (images, spectrograms, spatial data).

Instead of connecting every neuron to every input (which would be billions of connections for a 1080p image), CNNs use filters that slide across the image:

Layer	What It Does	Example
Convolutional	Detects local patterns	Edge detection, texture recognition
Pooling	Reduces spatial size	224×224 → 112×112 (keeps important features)
Fully connected	Combines features for classification	"This combination of features = cat"

Key insight: CNNs learn hierarchical visual features automatically. Early layers detect edges, middle layers detect shapes (eyes, wheels), deep layers detect objects (faces, cars). This hierarchy emerges from training — it's not programmed.

Applications beyond images:

Audio: Spectrograms (2D) → genre classification, speech recognition

Text: Character-level CNNs for text classification

Medical: X-ray and MRI analysis

Manufacturing: Visual defect detection

RNNs and Transformers: Neural Networks That Read

Recurrent Neural Networks (RNNs) process sequences one element at a time, maintaining a "hidden state" that carries information forward.

The problem: RNNs forget. By the time they reach the end of a long document, they've lost information from the beginning. LSTM and GRU architectures partially solve this with "memory gates."

Transformers (2017) revolutionized sequence processing with self-attention: every token can directly attend to every other token, regardless of distance.

Architecture	How It Processes Sequences	Strength	Weakness
RNN/LSTM	One token at a time, left to right	Simple, low memory	Forgets long-range context
Transformer	All tokens simultaneously (attention)	Long-range context, parallelizable	Quadratic memory in sequence length

The transformer architecture powers every modern LLM: GPT, Claude, Gemini, Llama, Mistral. It's also behind BERT (embeddings), Vision Transformers (images), and Whisper (audio).

Self-attention in one sentence: For each token, compute how much attention to pay to every other token, then create a weighted combination. "The cat sat on the [mat]" — when processing "mat," the model attends heavily to "sat on" and "cat" for context.

When Neural Networks Are Overkill

Neural networks are powerful but not always the right tool:

Situation	Better Alternative	Why
< 1,000 rows	Linear/logistic regression	NNs overfit with little data
Tabular data	Random forest / XGBoost	NNs rarely beat tree methods on tables
Need interpretability	Decision tree / linear model	NNs are black boxes
Real-time, low latency	Simple model + caching	NNs are 100-1000x slower
Limited compute budget	Classical ML	GPUs are expensive

The rule of thumb: For structured/tabular data, try XGBoost first. For images, use CNNs. For text, use transformers (or just call an LLM API). For time series, try ARIMA before LSTMs.

Key Takeaways

A neuron = weighted sum + activation function. Stacking them creates depth.

Activation functions (ReLU) add non-linearity — without them, deep = shallow

Backpropagation adjusts weights by tracing errors backward through the network

CNNs learn visual hierarchies (edges → shapes → objects) for image tasks

Transformers use self-attention to process entire sequences at once — the foundation of modern AI

Neural networks are overkill for small data, tabular data, and interpretability needs

For most business problems, start with classical ML and reach for NNs only when needed

This is chapter 5 of Data Science for AI.

Get the full hands-on course — free during early access. Build the complete system. Your projects become your portfolio.

View course details

Ch. 4: Time Series & Forecasting

Ch. 6: From Statistics to Foundation Models