Back to guides
5
15 min

Neural Networks Demystified

From Perceptrons to Transformers

The Biological Inspiration

Neural networks are inspired by the brain — but don't take the analogy too far. A biological neuron fires when its inputs exceed a threshold. An artificial neuron does something similar: multiply inputs by weights, sum them, pass through an activation function.

The key insight: a single neuron is just linear regression with a twist. The magic happens when you stack thousands of them in layers.

The Perceptron: One Neuron

A perceptron takes inputs, multiplies each by a weight, sums them, adds a bias, and passes the result through an activation function:

output = activation(w₁x₁ + w₂x₂ + ... + wₙxₙ + bias)

Loading diagram...

Activation functions add non-linearity — without them, stacking layers would just be another linear model:

FunctionOutput RangeUsed Where
Sigmoid0 to 1Output layer (binary classification)
ReLU0 to ∞Hidden layers (most common, fast)
Tanh-1 to 1Hidden layers (centered around 0)
Softmax0 to 1 (sums to 1)Output layer (multi-class)

ReLU (Rectified Linear Unit) is the default for hidden layers: output = max(0, x). It's simple, fast, and works well in practice.

Deep Networks: Stacking Layers

A single neuron can only learn linear boundaries. Stack them in layers and they learn hierarchical features:

  • Layer 1: Detects simple patterns (edges in images, common words in text)
  • Layer 2: Combines simple patterns into complex ones (shapes, phrases)
  • Layer 3+: Combines complex patterns into abstract concepts (faces, sentiment)
  • TermMeaning
    Input layerYour raw data (pixels, features, tokens)
    Hidden layersWhere learning happens (1 to hundreds of layers)
    Output layerThe prediction (class probabilities, regression value)
    Deep learningNetworks with many hidden layers (typically 3+)
    ParametersAll weights and biases — what the network learns

    A small network (2 hidden layers, 128 neurons each) has ~20K parameters. GPT-4 has ~1.8 trillion. The difference in capability is staggering.

    Backpropagation: How Networks Learn

    Training a neural network:

  • Forward pass: Input flows through the network, producing a prediction
  • Loss calculation: Compare prediction to actual value (how wrong were we?)
  • Backward pass: Calculate how each weight contributed to the error
  • Weight update: Adjust weights to reduce error (gradient descent)
  • Repeat for thousands of iterations (epochs)
  • The learning rate controls how much weights change per update:

  • Too high: Overshoots the optimal weights, training is unstable
  • Too low: Converges too slowly, gets stuck in local minima
  • Just right: Smooth convergence to good weights
  • Batch size determines how many examples the network sees before updating weights:

  • Small batch (32): Noisier gradients, better generalization, slower per epoch
  • Large batch (256+): Smoother gradients, faster per epoch, may generalize worse
  • CNNs: Neural Networks That See

    Convolutional Neural Networks are specialized for grid-like data (images, spectrograms, spatial data).

    Instead of connecting every neuron to every input (which would be billions of connections for a 1080p image), CNNs use filters that slide across the image:

    LayerWhat It DoesExample
    ConvolutionalDetects local patternsEdge detection, texture recognition
    PoolingReduces spatial size224×224 → 112×112 (keeps important features)
    Fully connectedCombines features for classification"This combination of features = cat"

    Key insight: CNNs learn hierarchical visual features automatically. Early layers detect edges, middle layers detect shapes (eyes, wheels), deep layers detect objects (faces, cars). This hierarchy emerges from training — it's not programmed.

    Applications beyond images:

  • Audio: Spectrograms (2D) → genre classification, speech recognition
  • Text: Character-level CNNs for text classification
  • Medical: X-ray and MRI analysis
  • Manufacturing: Visual defect detection
  • RNNs and Transformers: Neural Networks That Read

    Recurrent Neural Networks (RNNs) process sequences one element at a time, maintaining a "hidden state" that carries information forward.

    The problem: RNNs forget. By the time they reach the end of a long document, they've lost information from the beginning. LSTM and GRU architectures partially solve this with "memory gates."

    Transformers (2017) revolutionized sequence processing with self-attention: every token can directly attend to every other token, regardless of distance.

    ArchitectureHow It Processes SequencesStrengthWeakness
    RNN/LSTMOne token at a time, left to rightSimple, low memoryForgets long-range context
    TransformerAll tokens simultaneously (attention)Long-range context, parallelizableQuadratic memory in sequence length

    The transformer architecture powers every modern LLM: GPT, Claude, Gemini, Llama, Mistral. It's also behind BERT (embeddings), Vision Transformers (images), and Whisper (audio).

    Self-attention in one sentence: For each token, compute how much attention to pay to every other token, then create a weighted combination. "The cat sat on the [mat]" — when processing "mat," the model attends heavily to "sat on" and "cat" for context.

    When Neural Networks Are Overkill

    Neural networks are powerful but not always the right tool:

    SituationBetter AlternativeWhy
    < 1,000 rowsLinear/logistic regressionNNs overfit with little data
    Tabular dataRandom forest / XGBoostNNs rarely beat tree methods on tables
    Need interpretabilityDecision tree / linear modelNNs are black boxes
    Real-time, low latencySimple model + cachingNNs are 100-1000x slower
    Limited compute budgetClassical MLGPUs are expensive

    The rule of thumb: For structured/tabular data, try XGBoost first. For images, use CNNs. For text, use transformers (or just call an LLM API). For time series, try ARIMA before LSTMs.

    Key Takeaways

  • A neuron = weighted sum + activation function. Stacking them creates depth.
  • Activation functions (ReLU) add non-linearity — without them, deep = shallow
  • Backpropagation adjusts weights by tracing errors backward through the network
  • CNNs learn visual hierarchies (edges → shapes → objects) for image tasks
  • Transformers use self-attention to process entire sequences at once — the foundation of modern AI
  • Neural networks are overkill for small data, tabular data, and interpretability needs
  • For most business problems, start with classical ML and reach for NNs only when needed
  • This is chapter 5 of Data Science for AI.

    Get the full hands-on course — free during early access. Build the complete system. Your projects become your portfolio.

    View course details