Section 3.3: Feed-Forward Neural Networks¶

We can now represent tokens as continuous vectors. The next step: build a function that transforms these vectors into predictions.

Feed-forward neural networks (also called multi-layer perceptrons or MLPs) are the simplest such functions. They're the building blocks of all modern deep learning.

This section derives neural networks from first principles and explains what makes them powerful.

The Basic Unit: A Linear Layer¶

Mathematical Definition¶

A linear layer transforms an input vector x ∈ ℝⁿ into an output vector y ∈ ℝᵐ:

\[y = Wx + b\]

Where:

W ∈ ℝᵐˣⁿ is the weight matrix
b ∈ ℝᵐ is the bias vector
x ∈ ℝⁿ is the input
y ∈ ℝᵐ is the output

What It Computes¶

Each output component yᵢ is a weighted sum of inputs plus a bias:

\[y_i = \sum_{j=1}^{n} W_{ij} x_j + b_i\]

This is a linear combination of the inputs.

Geometric Interpretation¶

A linear layer performs:

Rotation/scaling: W rotates and scales the input space
Translation: b shifts the result

It maps the input space to a new space with potentially different dimensionality.

Example¶

Input: x = [2, 3] (n = 2) Output dimension: m = 3

\[W = \begin{bmatrix} 1 & 0 \\ 0 & 1 \\ 1 & 1 \end{bmatrix}, \quad b = \begin{bmatrix} 0 \\ 1 \\ -1 \end{bmatrix}\]

\[y = \begin{bmatrix} 1·2 + 0·3 + 0 \\ 0·2 + 1·3 + 1 \\ 1·2 + 1·3 - 1 \end{bmatrix} = \begin{bmatrix} 2 \\ 4 \\ 4 \end{bmatrix}\]

The Problem: Linear Functions Are Limited¶

Composition of Linear Functions is Linear¶

If f(x) = W₁x + b₁ and g(x) = W₂x + b₂, then:

\[g(f(x)) = W_2(W_1 x + b_1) + b_2 = W_2 W_1 x + (W_2 b_1 + b_2)\]

This is still a linear function! Let W' = W₂W₁ and b' = W₂b₁ + b₂:

\[g(f(x)) = W' x + b'\]

No matter how many linear layers we stack, we get a linear function.

Why This Is a Problem¶

Linear functions can only:

Draw straight decision boundaries
Compute linear combinations

They cannot:

Compute XOR (or any non-linearly-separable function)
Model complex patterns
Learn hierarchical features

The Solution: Nonlinear Activation Functions¶

To break linearity, we apply a nonlinear function after each linear layer.

The Pattern¶

\[h = \sigma(Wx + b)\]

Where σ is a nonlinear activation function applied element-wise.

Common Activation Functions¶

ReLU (Rectified Linear Unit):

\[\text{ReLU}(x) = \max(0, x)\]

Simple and fast
Derivative: 1 if x > 0, else 0
Most popular for hidden layers

Sigmoid:

\[\sigma(x) = \frac{1}{1 + e^{-x}}\]

Outputs between 0 and 1
Historically popular, less used now due to vanishing gradients: sigmoid's derivative is at most 0.25, so when gradients pass through many sigmoid layers, they shrink exponentially (0.25 × 0.25 × ... → 0), making early layers learn very slowly

Tanh:

\[\tanh(x) = \frac{e^x - e^{-x}}{e^x + e^{-x}}\]

Outputs between -1 and 1
Zero-centered (better than sigmoid)

GELU (Gaussian Error Linear Unit):

\[\text{GELU}(x) = x \cdot \Phi(x)\]

Where Φ is the standard normal CDF (cumulative distribution function)—the probability that a standard normal random variable is less than x. Approximation:

\[\text{GELU}(x) \approx 0.5x(1 + \tanh[\sqrt{2/\pi}(x + 0.044715x^3)])\]

Used in transformers
Smooth version of ReLU

ReLU: A Closer Look¶

ReLU is piecewise linear:

For x ≤ 0: output = 0
For x > 0: output = x

This simple nonlinearity is enough to enable universal approximation.

%%{init: {'theme': 'neutral'}}%%
flowchart TB
    subgraph "Activation Function Comparison"
        direction LR

        subgraph ReLU["ReLU: max(0, x)"]
            R["Linear for x>0<br/>Zero for x≤0<br/>Fast, simple"]
        end

        subgraph Sigmoid["Sigmoid: 1/(1+e⁻ˣ)"]
            S["Output: (0, 1)<br/>S-shaped<br/>Vanishing gradients"]
        end

        subgraph Tanh["Tanh"]
            T["Output: (-1, 1)<br/>Zero-centered<br/>Better than sigmoid"]
        end

        subgraph GELU["GELU: x·Φ(x)"]
            G["Smooth ReLU<br/>Used in Transformers<br/>Modern choice"]
        end
    end

Why ReLU is preferred for hidden layers:

Activation	Gradient range	Issue	Speed
ReLU	0 or 1	Dead neurons (grad=0 forever)	Fast
Sigmoid	(0, 0.25)	Vanishing gradients	Slow (exp)
Tanh	(0, 1)	Vanishing gradients	Slow (exp)
GELU	(0, 1+)	None major	Medium

Derivative:

\[\frac{d}{dx}\text{ReLU}(x) = \begin{cases} 1 & \text{if } x > 0 \\ 0 & \text{if } x < 0 \\ \text{undefined} & \text{if } x = 0 \end{cases}\]

At x = 0, we conventionally use 0 as the derivative. (Technically, we're using a "subgradient"—a generalization of derivatives for non-smooth functions—but in practice, x = 0 happens rarely enough that this choice doesn't matter much.)

Building Deep Networks¶

Stacking Layers¶

A deep network is a composition of layers:

\[h_1 = \sigma(W_1 x + b_1)\]

\[h_2 = \sigma(W_2 h_1 + b_2)\]

\[h_3 = \sigma(W_3 h_2 + b_3)\]

\[\vdots\]

\[y = W_L h_{L-1} + b_L\]

Note: The last layer often has no activation (for regression) or a special activation (softmax for classification).

Why Depth Matters¶

Each layer can learn increasingly abstract features:

Layer 1: Low-level patterns (character combinations)
Layer 2: Mid-level patterns (syllables, common sequences)
Layer 3: High-level patterns (word-like structures)

Deep networks can express functions that would require exponentially wide shallow networks.

Width vs Depth Trade-off¶

Architecture	Advantages	Disadvantages
Wide + Shallow	Easy to train, stable	Many parameters, limited abstraction
Narrow + Deep	Parameter efficient, hierarchical	Harder to train, vanishing gradients

In practice, moderate depth (2-4 layers for character LM) works well.

The Universal Approximation Theorem¶

Statement¶

A feed-forward network with:

One hidden layer
Sufficient width (number of neurons)
Nonlinear activation

Can approximate any continuous function on a compact domain to arbitrary precision.

What This Means¶

Neural networks are universal function approximators. Given enough neurons, they can learn any reasonable input-output mapping.

What This Doesn't Mean¶

Doesn't say how many neurons needed (could be enormous)
Doesn't say the function is easy to find (optimization might fail)
Doesn't guarantee generalization (might overfit)

Intuition via ReLU¶

A ReLU network creates a piecewise linear function:

Each neuron contributes a "hinge" point
With enough hinges, any continuous curve can be approximated
More neurons = finer approximation

                      /\
                     /  \
        /\          /    \
       /  \        /      \___
    __/    \______/

Each change in slope corresponds to a neuron "turning on" or "off".

The Output Layer for Language Modeling¶

For language modeling, we need to output a probability distribution over the vocabulary.

The Softmax Function¶

Logits are the raw, unnormalized outputs of the network—real numbers that can be any value (positive, negative, or zero). The term comes from "log-odds" in statistics. We convert logits to probabilities using softmax.

Given logits z ∈ ℝ^|V|, softmax converts to probabilities:

\[\text{softmax}(z)_i = \frac{e^{z_i}}{\sum_{j=1}^{|V|} e^{z_j}}\]

Properties of Softmax¶

Valid probability distribution:

\[\sum_i \text{softmax}(z)_i = 1\]

All positive:

\[\text{softmax}(z)_i > 0 \quad \forall i\]

Monotonic: Higher logit → higher probability
Differentiable: Smooth gradients everywhere

The Complete Output Stage¶

\[\text{logits} = W_{\text{out}} h_L + b_{\text{out}}\]

\[P(\text{token} | \text{context}) = \text{softmax}(\text{logits})\]

Where:

h_L ∈ \(ℝ^h\) is the final hidden state
W_out ∈ \(ℝ^{|V|×h}\) projects to vocabulary size
logits ∈ ℝ^|V| are unnormalized scores
softmax normalizes to probabilities

Putting It Together: The Full Architecture¶

For a character-level language model:

flowchart LR
    subgraph Input["Input: Context Window"]
        C1["c₁"]
        C2["c₂"]
        C3["..."]
        C4["cₖ"]
    end

    subgraph Embed["Embedding Layer"]
        E1["e₁<br/>(d dims)"]
        E2["e₂"]
        E3["..."]
        E4["eₖ"]
    end

    subgraph Concat["Concatenation"]
        X["x<br/>(k×d dims)"]
    end

    subgraph Hidden["Hidden Layers"]
        H1["ReLU(W₁x + b₁)<br/>(h₁ dims)"]
        H2["ReLU(W₂h₁ + b₂)<br/>(h₂ dims)"]
    end

    subgraph Output["Output Layer"]
        L["Logits<br/>(|V| dims)"]
        S["Softmax"]
        P["P(next | context)"]
    end

    C1 --> E1
    C2 --> E2
    C3 --> E3
    C4 --> E4

    E1 --> X
    E2 --> X
    E3 --> X
    E4 --> X

    X --> H1 --> H2 --> L --> S --> P

Forward Pass¶

Input: context = [c_{t-k}, ..., c_{t-1}]  (k previous characters)

1. EMBED: For each position i:
   e_i = E[c_{t-k+i}]  ∈ ℝ^d

2. CONCATENATE:
   x = [e_0; e_1; ...; e_{k-1}]  ∈ ℝ^{k·d}

3. HIDDEN LAYER 1:
   h_1 = ReLU(W_1 x + b_1)  ∈ ℝ^{h_1}

4. HIDDEN LAYER 2:
   h_2 = ReLU(W_2 h_1 + b_2)  ∈ ℝ^{h_2}

5. OUTPUT:
   logits = W_out h_2 + b_out  ∈ ℝ^{|V|}

6. SOFTMAX:
   P(c_t | context) = softmax(logits)  ∈ ℝ^{|V|}

Parameter Count¶

For context length k, embedding dim d, hidden sizes h₁ and h₂, vocabulary |V|:

Component	Parameters
Embedding	\|V\| × d
Layer 1	(k·d) × h₁ + h₁
Layer 2	h₁ × h₂ + h₂
Output	h₂ × \|V\| + \|V\|
Total	\|V\|d + kdh₁ + h₁ + h₁h₂ + h₂ + h₂\|V\| + \|V\|

Example: |V| = 80, d = 32, k = 8, h₁ = h₂ = 128

Embedding: 80 × 32 = 2,560
Layer 1: 256 × 128 + 128 = 32,896
Layer 2: 128 × 128 + 128 = 16,512
Output: 128 × 80 + 80 = 10,320

Total: ~62,000 parameters

Compare to 5-gram: \(80^6\) ≈ 262 billion parameters. Neural wins by a factor of 4 million!

Implementing a Layer¶

Using our Stage 2 autograd:

class Linear:
    """A linear layer: y = Wx + b"""

    def __init__(self, in_features, out_features):
        # Xavier initialization
        scale = (2.0 / (in_features + out_features)) ** 0.5
        self.W = [[Value(random.gauss(0, scale))
                   for _ in range(in_features)]
                  for _ in range(out_features)]
        self.b = [Value(0.0) for _ in range(out_features)]

    def __call__(self, x):
        """x is a list of Values, returns list of Values."""
        out = []
        for i in range(len(self.b)):
            # Compute W[i] · x + b[i]
            activation = self.b[i]
            for j in range(len(x)):
                activation = activation + self.W[i][j] * x[j]
            out.append(activation)
        return out

    def parameters(self):
        return [w for row in self.W for w in row] + self.b

ReLU Layer¶

def relu(x):
    """Apply ReLU to a list of Values."""
    return [v.relu() for v in x]

Softmax Layer¶

def softmax(logits):
    """Convert logits to probabilities."""
    # For numerical stability, subtract max
    max_logit = max(v.data for v in logits)
    exp_logits = [(v - max_logit).exp() for v in logits]
    sum_exp = sum(exp_logits, Value(0.0))
    return [e / sum_exp for e in exp_logits]

Numerical Stability: The Log-Sum-Exp Trick¶

The Problem¶

Softmax involves exponentials. For large logits:

\[e^{100} \approx 2.7 \times 10^{43}\]

(overflow!)

\[e^{-100} \approx 3.7 \times 10^{-44}\]

(underflow!)

The Solution¶

Softmax is invariant to constant shifts:

\[\text{softmax}(z - c) = \text{softmax}(z)\]

Proof:

\[\frac{e^{z_i - c}}{\sum_j e^{z_j - c}} = \frac{e^{z_i} e^{-c}}{\sum_j e^{z_j} e^{-c}} = \frac{e^{z_i}}{\sum_j e^{z_j}}\]

So we compute:

\[\text{softmax}(z)_i = \frac{e^{z_i - \max(z)}}{\sum_j e^{z_j - \max(z)}}\]

After subtracting max, all exponents are ≤ 0, preventing overflow.

For Log Probabilities¶

We often need log softmax:

\[\log \text{softmax}(z)_i = z_i - \log\sum_j e^{z_j}\]

The log-sum-exp (LSE) function:

\[\text{LSE}(z) = \log\sum_j e^{z_j} = \max(z) + \log\sum_j e^{z_j - \max(z)}\]

This is numerically stable.

Information Flow in the Network¶

Forward Pass¶

Information flows from input to output:

Context → Embeddings → Concatenate → Hidden₁ → Hidden₂ → Logits → Softmax

Each layer transforms representations, adding capacity to model complex patterns.

Backward Pass¶

Gradients flow from output to input (via backpropagation):

Loss → ∂/∂Softmax → ∂/∂Logits → ∂/∂Hidden₂ → ∂/∂Hidden₁ → ∂/∂Embeddings

Every parameter receives a gradient signal indicating how to change to reduce loss.

The Chain Rule at Work¶

For a parameter W₁[i,j] in the first layer:

\[\frac{\partial L}{\partial W_1[i,j]} = \frac{\partial L}{\partial h_1[i]} \cdot \frac{\partial h_1[i]}{\partial W_1[i,j]}\]

The gradient "chains" through all intermediate layers—exactly what we built in Stage 2.

Summary¶

Concept	Description
Linear layer	y = Wx + b, affine transformation
Activation	Nonlinear function (ReLU, tanh, etc.)
Deep network	Composition of linear + activation
Universal approximation	Any function can be approximated
Softmax	Converts logits to probability distribution
Log-sum-exp trick	Numerically stable softmax computation

Key insight: A neural network is a composition of simple functions (linear + nonlinear). Each layer transforms representations, and the composition can approximate any function. Training adjusts the parameters so this composition predicts well.

Exercises¶

Linearity proof: Show that composing two linear functions y = Ax + a and z = By + b gives z = Cx + c where C = BA and c = Ba + b.
ReLU universality: Sketch how a ReLU network with 4 neurons could approximate f(x) = |x| on [-1, 1].
Parameter counting: For a network with input 256, hidden layers [512, 256, 128], and output 100, calculate the total number of parameters.
Softmax verification: Verify that softmax([2, 1, 0]) sums to 1 by computing it explicitly.
Stability test: Compute softmax([1000, 1001, 1002]) directly and with the max-subtraction trick. What happens in each case?

What's Next¶

We have the architecture. But how do we train it?

In Section 3.4, we'll derive the cross-entropy loss function—the objective that tells the network how wrong its predictions are and how to improve.