Section 3.3: Feed-Forward Neural Networks¶
We can now represent tokens as continuous vectors. The next step: build a function that transforms these vectors into predictions.
Feed-forward neural networks (also called multi-layer perceptrons or MLPs) are the simplest such functions. They're the building blocks of all modern deep learning.
This section derives neural networks from first principles and explains what makes them powerful.
The Basic Unit: A Linear Layer¶
Mathematical Definition¶
A linear layer transforms an input vector x ∈ ℝⁿ into an output vector y ∈ ℝᵐ:
Where:
- W ∈ ℝᵐˣⁿ is the weight matrix
- b ∈ ℝᵐ is the bias vector
- x ∈ ℝⁿ is the input
- y ∈ ℝᵐ is the output
What It Computes¶
Each output component yᵢ is a weighted sum of inputs plus a bias:
This is a linear combination of the inputs.
Geometric Interpretation¶
A linear layer performs:
- Rotation/scaling: W rotates and scales the input space
- Translation: b shifts the result
It maps the input space to a new space with potentially different dimensionality.
Example¶
Input: x = [2, 3] (n = 2) Output dimension: m = 3
The Problem: Linear Functions Are Limited¶
Composition of Linear Functions is Linear¶
If f(x) = W₁x + b₁ and g(x) = W₂x + b₂, then:
This is still a linear function! Let W' = W₂W₁ and b' = W₂b₁ + b₂:
No matter how many linear layers we stack, we get a linear function.
Why This Is a Problem¶
Linear functions can only:
- Draw straight decision boundaries
- Compute linear combinations
They cannot:
- Compute XOR (or any non-linearly-separable function)
- Model complex patterns
- Learn hierarchical features
The Solution: Nonlinear Activation Functions¶
To break linearity, we apply a nonlinear function after each linear layer.
The Pattern¶
Where σ is a nonlinear activation function applied element-wise.
Common Activation Functions¶
ReLU (Rectified Linear Unit):
- Simple and fast
- Derivative: 1 if x > 0, else 0
- Most popular for hidden layers
Sigmoid:
- Outputs between 0 and 1
- Historically popular, less used now due to vanishing gradients: sigmoid's derivative is at most 0.25, so when gradients pass through many sigmoid layers, they shrink exponentially (0.25 × 0.25 × ... → 0), making early layers learn very slowly
Tanh:
- Outputs between -1 and 1
- Zero-centered (better than sigmoid)
GELU (Gaussian Error Linear Unit):
Where Φ is the standard normal CDF (cumulative distribution function)—the probability that a standard normal random variable is less than x. Approximation:
- Used in transformers
- Smooth version of ReLU
ReLU: A Closer Look¶
ReLU is piecewise linear:
- For x ≤ 0: output = 0
- For x > 0: output = x
This simple nonlinearity is enough to enable universal approximation.
%%{init: {'theme': 'neutral'}}%%
flowchart TB
subgraph "Activation Function Comparison"
direction LR
subgraph ReLU["ReLU: max(0, x)"]
R["Linear for x>0<br/>Zero for x≤0<br/>Fast, simple"]
end
subgraph Sigmoid["Sigmoid: 1/(1+e⁻ˣ)"]
S["Output: (0, 1)<br/>S-shaped<br/>Vanishing gradients"]
end
subgraph Tanh["Tanh"]
T["Output: (-1, 1)<br/>Zero-centered<br/>Better than sigmoid"]
end
subgraph GELU["GELU: x·Φ(x)"]
G["Smooth ReLU<br/>Used in Transformers<br/>Modern choice"]
end
end
Why ReLU is preferred for hidden layers:
| Activation | Gradient range | Issue | Speed |
|---|---|---|---|
| ReLU | 0 or 1 | Dead neurons (grad=0 forever) | Fast |
| Sigmoid | (0, 0.25) | Vanishing gradients | Slow (exp) |
| Tanh | (0, 1) | Vanishing gradients | Slow (exp) |
| GELU | (0, 1+) | None major | Medium |
Derivative:
At x = 0, we conventionally use 0 as the derivative. (Technically, we're using a "subgradient"—a generalization of derivatives for non-smooth functions—but in practice, x = 0 happens rarely enough that this choice doesn't matter much.)
Building Deep Networks¶
Stacking Layers¶
A deep network is a composition of layers:
Note: The last layer often has no activation (for regression) or a special activation (softmax for classification).
Why Depth Matters¶
Each layer can learn increasingly abstract features:
- Layer 1: Low-level patterns (character combinations)
- Layer 2: Mid-level patterns (syllables, common sequences)
- Layer 3: High-level patterns (word-like structures)
Deep networks can express functions that would require exponentially wide shallow networks.
Width vs Depth Trade-off¶
| Architecture | Advantages | Disadvantages |
|---|---|---|
| Wide + Shallow | Easy to train, stable | Many parameters, limited abstraction |
| Narrow + Deep | Parameter efficient, hierarchical | Harder to train, vanishing gradients |
In practice, moderate depth (2-4 layers for character LM) works well.
The Universal Approximation Theorem¶
Statement¶
A feed-forward network with:
- One hidden layer
- Sufficient width (number of neurons)
- Nonlinear activation
Can approximate any continuous function on a compact domain to arbitrary precision.
What This Means¶
Neural networks are universal function approximators. Given enough neurons, they can learn any reasonable input-output mapping.
What This Doesn't Mean¶
- Doesn't say how many neurons needed (could be enormous)
- Doesn't say the function is easy to find (optimization might fail)
- Doesn't guarantee generalization (might overfit)
Intuition via ReLU¶
A ReLU network creates a piecewise linear function:
- Each neuron contributes a "hinge" point
- With enough hinges, any continuous curve can be approximated
- More neurons = finer approximation
Each change in slope corresponds to a neuron "turning on" or "off".
The Output Layer for Language Modeling¶
For language modeling, we need to output a probability distribution over the vocabulary.
The Softmax Function¶
Logits are the raw, unnormalized outputs of the network—real numbers that can be any value (positive, negative, or zero). The term comes from "log-odds" in statistics. We convert logits to probabilities using softmax.
Given logits z ∈ ℝ^|V|, softmax converts to probabilities:
Properties of Softmax¶
- Valid probability distribution:
- All positive:
-
Monotonic: Higher logit → higher probability
-
Differentiable: Smooth gradients everywhere
The Complete Output Stage¶
Where:
- h_L ∈ \(ℝ^h\) is the final hidden state
- W_out ∈ \(ℝ^{|V|×h}\) projects to vocabulary size
- logits ∈ ℝ^|V| are unnormalized scores
- softmax normalizes to probabilities
Putting It Together: The Full Architecture¶
For a character-level language model:
flowchart LR
subgraph Input["Input: Context Window"]
C1["c₁"]
C2["c₂"]
C3["..."]
C4["cₖ"]
end
subgraph Embed["Embedding Layer"]
E1["e₁<br/>(d dims)"]
E2["e₂"]
E3["..."]
E4["eₖ"]
end
subgraph Concat["Concatenation"]
X["x<br/>(k×d dims)"]
end
subgraph Hidden["Hidden Layers"]
H1["ReLU(W₁x + b₁)<br/>(h₁ dims)"]
H2["ReLU(W₂h₁ + b₂)<br/>(h₂ dims)"]
end
subgraph Output["Output Layer"]
L["Logits<br/>(|V| dims)"]
S["Softmax"]
P["P(next | context)"]
end
C1 --> E1
C2 --> E2
C3 --> E3
C4 --> E4
E1 --> X
E2 --> X
E3 --> X
E4 --> X
X --> H1 --> H2 --> L --> S --> P
Forward Pass¶
Input: context = [c_{t-k}, ..., c_{t-1}] (k previous characters)
1. EMBED: For each position i:
e_i = E[c_{t-k+i}] ∈ ℝ^d
2. CONCATENATE:
x = [e_0; e_1; ...; e_{k-1}] ∈ ℝ^{k·d}
3. HIDDEN LAYER 1:
h_1 = ReLU(W_1 x + b_1) ∈ ℝ^{h_1}
4. HIDDEN LAYER 2:
h_2 = ReLU(W_2 h_1 + b_2) ∈ ℝ^{h_2}
5. OUTPUT:
logits = W_out h_2 + b_out ∈ ℝ^{|V|}
6. SOFTMAX:
P(c_t | context) = softmax(logits) ∈ ℝ^{|V|}
Parameter Count¶
For context length k, embedding dim d, hidden sizes h₁ and h₂, vocabulary |V|:
| Component | Parameters |
|---|---|
| Embedding | |V| × d |
| Layer 1 | (k·d) × h₁ + h₁ |
| Layer 2 | h₁ × h₂ + h₂ |
| Output | h₂ × |V| + |V| |
| Total | |V|d + kdh₁ + h₁ + h₁h₂ + h₂ + h₂|V| + |V| |
Example: |V| = 80, d = 32, k = 8, h₁ = h₂ = 128
- Embedding: 80 × 32 = 2,560
- Layer 1: 256 × 128 + 128 = 32,896
- Layer 2: 128 × 128 + 128 = 16,512
- Output: 128 × 80 + 80 = 10,320
Total: ~62,000 parameters
Compare to 5-gram: \(80^6\) ≈ 262 billion parameters. Neural wins by a factor of 4 million!
Implementing a Layer¶
Using our Stage 2 autograd:
class Linear:
"""A linear layer: y = Wx + b"""
def __init__(self, in_features, out_features):
# Xavier initialization
scale = (2.0 / (in_features + out_features)) ** 0.5
self.W = [[Value(random.gauss(0, scale))
for _ in range(in_features)]
for _ in range(out_features)]
self.b = [Value(0.0) for _ in range(out_features)]
def __call__(self, x):
"""x is a list of Values, returns list of Values."""
out = []
for i in range(len(self.b)):
# Compute W[i] · x + b[i]
activation = self.b[i]
for j in range(len(x)):
activation = activation + self.W[i][j] * x[j]
out.append(activation)
return out
def parameters(self):
return [w for row in self.W for w in row] + self.b
ReLU Layer¶
Softmax Layer¶
def softmax(logits):
"""Convert logits to probabilities."""
# For numerical stability, subtract max
max_logit = max(v.data for v in logits)
exp_logits = [(v - max_logit).exp() for v in logits]
sum_exp = sum(exp_logits, Value(0.0))
return [e / sum_exp for e in exp_logits]
Numerical Stability: The Log-Sum-Exp Trick¶
The Problem¶
Softmax involves exponentials. For large logits:
(overflow!)
(underflow!)
The Solution¶
Softmax is invariant to constant shifts:
Proof:
So we compute:
After subtracting max, all exponents are ≤ 0, preventing overflow.
For Log Probabilities¶
We often need log softmax:
The log-sum-exp (LSE) function:
This is numerically stable.
Information Flow in the Network¶
Forward Pass¶
Information flows from input to output:
Each layer transforms representations, adding capacity to model complex patterns.
Backward Pass¶
Gradients flow from output to input (via backpropagation):
Every parameter receives a gradient signal indicating how to change to reduce loss.
The Chain Rule at Work¶
For a parameter W₁[i,j] in the first layer:
The gradient "chains" through all intermediate layers—exactly what we built in Stage 2.
Summary¶
| Concept | Description |
|---|---|
| Linear layer | y = Wx + b, affine transformation |
| Activation | Nonlinear function (ReLU, tanh, etc.) |
| Deep network | Composition of linear + activation |
| Universal approximation | Any function can be approximated |
| Softmax | Converts logits to probability distribution |
| Log-sum-exp trick | Numerically stable softmax computation |
Key insight: A neural network is a composition of simple functions (linear + nonlinear). Each layer transforms representations, and the composition can approximate any function. Training adjusts the parameters so this composition predicts well.
Exercises¶
-
Linearity proof: Show that composing two linear functions y = Ax + a and z = By + b gives z = Cx + c where C = BA and c = Ba + b.
-
ReLU universality: Sketch how a ReLU network with 4 neurons could approximate f(x) = |x| on [-1, 1].
-
Parameter counting: For a network with input 256, hidden layers [512, 256, 128], and output 100, calculate the total number of parameters.
-
Softmax verification: Verify that softmax([2, 1, 0]) sums to 1 by computing it explicitly.
-
Stability test: Compute softmax([1000, 1001, 1002]) directly and with the max-subtraction trick. What happens in each case?
What's Next¶
We have the architecture. But how do we train it?
In Section 3.4, we'll derive the cross-entropy loss function—the objective that tells the network how wrong its predictions are and how to improve.