5 Transformers as Matrix Multiplication Machines

The computational substrate of modern AI

foundations

transformers

Author

Taras Tsugrii

Published

January 5, 2025

Hands-On Notebook

Run code interactively: load GPT-2, visualize attention patterns, explore the residual stream.

What You’ll Learn

How transformers convert text to numbers (tokenization and embedding)
The attention mechanism as “soft dictionary lookup”
How MLPs process and transform information
Why understanding the architecture matters for interpretability

Prerequisites

Recommended background: Basic linear algebra (matrix multiplication, vectors). No prior deep learning knowledge required—we build from first principles.

Before You Read: Recall

From Chapter 1, recall:

Why do we want to reverse-engineer neural networks? (Scientific understanding, safety, capability improvement)
What makes neural networks “black boxes”? (We have the weights but don’t understand the algorithms)
What’s the goal of mechanistic interpretability? (Find the how behind the what)

5.1 The Machine We’re Trying to Understand

In the previous chapter, we saw a neural network discover Fourier transforms for modular arithmetic—an algorithm no one taught it. We established that mechanistic interpretability is the project of reverse engineering such discoveries: finding the how behind the what.

Now we ask: What exactly are we reverse-engineering? What operations does a transformer compute?

But before we can reverse engineer a transformer, we need to understand what it actually computes. What happens when you feed text into GPT-4? What operations transform “The capital of France is” into a prediction of “Paris”?

The answer is surprisingly simple, and surprisingly important: transformers are matrix multiplication machines.

This might sound reductive. A system that writes poetry, proves theorems, and carries on conversations—surely it’s doing something more sophisticated than multiplying matrices?

Yes and no. The sophistication lies not in the operations themselves, but in what those operations learn to represent. The transformer architecture provides a computational fabric—a substrate of linear algebra with a few nonlinearities—and training weaves complex algorithms into that fabric. Understanding the fabric is the first step to understanding what’s woven into it.

5.2 The Big Picture

Here’s the transformer forward pass in one sentence:

A transformer takes a sequence of tokens, converts them to vectors, repeatedly applies attention (information routing) and MLPs (information processing), and produces a probability distribution over next tokens.

Every step in this process is built from the same primitive: matrix-vector multiplication. Take a vector, multiply it by a matrix, get a new vector. That’s it. The magic lies in what the matrices learn to encode during training.

Let’s walk through each component.

5.3 From Text to Numbers: Token Embedding

Transformers don’t operate on text directly. They operate on numbers—specifically, vectors of floating-point numbers.

The first step is tokenization: breaking text into discrete units called tokens. These might be words, subwords, or even individual characters, depending on the tokenizer. “Understanding” might become one token; “transformers” might become [“transform”, “ers”].

Each token has an ID—an index in a vocabulary of, say, 50,000 possible tokens. But an index isn’t useful for computation. We need something richer.

Enter the embedding matrix. This is a giant lookup table: for each of the 50,000 token IDs, it stores a corresponding vector of, say, 768 dimensions. Token ID 4523 maps to a specific 768-dimensional vector. Token ID 8901 maps to a different one.

Token: "Paris"
Token ID: 4523
Embedding: [0.23, -0.87, 0.45, ..., 0.12]  (768 numbers)

The First Matrix Multiplication

Looking up a token embedding is equivalent to multiplying a one-hot vector by the embedding matrix. If we represent token ID 4523 as a vector with a 1 in position 4523 and 0s elsewhere, then multiplying by the embedding matrix extracts exactly row 4523.

After embedding, we also add positional information. The token “Paris” in position 5 gets a different representation than “Paris” in position 20. This is typically done by adding a position-specific vector to the token embedding.

The result: each token becomes a vector, and the entire input sequence becomes a matrix—one row per token position.

5.4 Attention: Soft Dictionary Lookup

Attention is the signature operation of transformers, and it’s where things get interesting.

The core insight is that attention performs soft dictionary lookup. In a regular dictionary, you have a key and you retrieve exactly one value. In attention, you have a query and you retrieve a weighted blend of all values, where the weights depend on how well each key matches your query.

Here’s how it works:

5.4.1 Queries, Keys, and Values

For each token position, we compute three vectors: - Query (Q): What am I looking for? - Key (K): What do I contain? - Value (V): What information do I provide?

These are computed by—you guessed it—multiplying the token’s embedding by learned matrices:

Q = embedding × W_Q
K = embedding × W_K
V = embedding × W_V

Three matrix multiplications, three new vectors.

5.4.2 Computing Attention Weights

Now we compare each query to all keys. How similar is position 5’s query to position 3’s key? We measure this with a dot product:

attention_score = Q_5 · K_3

A high dot product means “position 5 is looking for something that position 3 has.” We compute this for all pairs of positions, giving us a matrix of attention scores.

We then apply softmax to turn these scores into probabilities—weights that sum to 1. This is one of the few nonlinearities in the transformer. It’s crucial: without it, everything would collapse into a single linear operation.

5.4.3 Retrieving Information

Finally, we use these weights to compute a weighted average of all values:

output_5 = 0.7 × V_3 + 0.2 × V_7 + 0.1 × V_1 + ...

Position 5 “attends to” position 3 (weight 0.7), a bit to position 7, a bit to position 1, and so on. The output is a blend of information from across the sequence.

Code: Computing Attention Step-by-Step

Here’s the complete attention computation in Python. This is the core of what transformers do:

import torch
import torch.nn.functional as F

# Example: 4 tokens, each embedded as 8-dimensional vector
seq_len, d_model = 4, 8
x = torch.randn(seq_len, d_model)  # Input embeddings

# Learned weight matrices (normally trained, here random)
d_head = 4  # Dimension of Q, K, V per head
W_Q = torch.randn(d_model, d_head)
W_K = torch.randn(d_model, d_head)
W_V = torch.randn(d_model, d_head)

# Step 1: Compute queries, keys, values
Q = x @ W_Q  # Shape: (seq_len, d_head)
K = x @ W_K
V = x @ W_V

# Step 2: Compute attention scores (Q·K^T)
scores = Q @ K.T  # Shape: (seq_len, seq_len)
scores = scores / (d_head ** 0.5)  # Scale by sqrt(d_head)

# Step 3: Apply softmax to get attention weights
attn_weights = F.softmax(scores, dim=-1)  # Each row sums to 1

# Step 4: Weighted sum of values
output = attn_weights @ V  # Shape: (seq_len, d_head)

# attn_weights[i, j] = how much position i attends to position j
print(f"Attention weights:\n{attn_weights}")

Key insight: Every step is matrix multiplication except softmax, which is the essential nonlinearity. Without softmax, the whole thing would collapse to a single linear transformation.

Pause and Think

Before continuing, ask yourself: Why does attention need three separate projections (Q, K, V)? What would happen if we used the same matrix for queries and keys?

Hint: Think about what a position is “asking for” versus what it “contains.”

Why This Matters

Attention is how transformers move information between positions. A token at the end of the sequence can access information from the beginning. This is fundamentally different from older architectures that processed sequences one step at a time.

5.4.4 Multi-Head Attention

In practice, transformers use multiple attention heads in parallel. Each head has its own Q, K, V matrices, letting the model attend to different aspects simultaneously. One head might focus on syntax, another on semantics, another on nearby tokens.

The outputs of all heads are concatenated and multiplied by yet another matrix (W_O) to produce the final attention output. More matrix multiplications.

5.5 MLPs: Where Knowledge Lives

After attention, each token’s representation passes through a multi-layer perceptron (MLP), also called a feed-forward network.

The structure is simple:

hidden = activation(input × W_1 + b_1)
output = hidden × W_2 + b_2

Two matrix multiplications with a nonlinearity in between. The original transformer used ReLU; modern language models typically use GELU (Gaussian Error Linear Unit), which is smoother and empirically works better. Both serve the same purpose: introducing nonlinearity between the linear transformations.

But recent research has revealed something fascinating: MLP layers function as key-value memories.

The first matrix (W_1) learns to recognize patterns—like keys in a dictionary. The second matrix (W_2) associates those patterns with likely next tokens—like values. When the input matches a learned pattern, the corresponding value “fires,” influencing the output.

Lower layers recognize shallow patterns like n-grams and common phrases. Upper layers recognize semantic patterns—detecting that “the capital of France” and “France’s capital city” should trigger similar responses.

Two-Thirds of Parameters

MLP layers contain about two-thirds of a transformer’s parameters. They’re not just activation functions—they’re where much of the model’s knowledge is stored.

5.6 Layer Norm: Keeping Things Stable

Between attention and MLP blocks, transformers apply layer normalization. This rescales vectors to have consistent statistics (zero mean, unit variance across dimensions).

Layer norm isn’t glamorous, but it’s essential for training stability. Without it, activations would grow or shrink uncontrollably through the many layers of a deep transformer.

For interpretability, layer norm is a minor complication—another linear(ish) operation to account for. We’ll mostly set it aside, but it’s there.

5.7 The Residual Stream: A Preview

Here’s a crucial architectural detail that will become central in the next chapter: residual connections.

After each attention block and each MLP block, the output is added to the input:

x = x + attention(x)
x = x + mlp(x)

This creates what’s called the residual stream—a vector that flows through the entire network, accumulating contributions from each component.

Think of it like a shared workspace. Each attention head and each MLP reads from this workspace, performs some computation, and writes its contribution back. Components don’t talk to each other directly; they communicate through the residual stream.

This perspective—seeing the transformer as components reading from and writing to a shared communication channel—is foundational for mechanistic interpretability. We’ll develop it fully in the next chapter.

5.8 The Complete Forward Pass

Let’s put it all together. For a transformer with L layers:

1. Embed tokens → initial vectors
2. Add positional encoding
3. For each layer l = 1 to L:
   a. Apply layer norm
   b. Compute multi-head attention
   c. Add attention output to residual stream
   d. Apply layer norm
   e. Compute MLP
   f. Add MLP output to residual stream
4. Apply final layer norm
5. Multiply by "unembedding" matrix to get vocabulary logits
6. Apply softmax to get next-token probabilities

Here’s the same process as a visual diagram:

flowchart LR
    T["Tokens"] --> E["Embed + Position"]
    E --> L1["Layer 1"]
    L1 --> L2["Layer 2"]
    L2 --> Ldots["..."]
    Ldots --> LL["Layer L"]
    LL --> U["Unembed"]
    U --> S["Softmax"]
    S --> P["Predictions"]

The transformer forward pass: tokens flow through embedding, repeated attention+MLP blocks, and unembedding to produce predictions.

Each layer contains the same structure: layer norm → attention → add → layer norm → MLP → add. The residual connections (the “add” steps) accumulate information through the network.

flowchart TB
    IN["Input from previous layer"] --> LN1["Layer Norm"]
    LN1 --> ATT["Multi-Head Attention"]
    ATT --> A1["Add"]
    IN --> A1
    A1 --> LN2["Layer Norm"]
    LN2 --> MLP["MLP"]
    MLP --> A2["Add"]
    A1 --> A2
    A2 --> OUT["Output to next layer"]

Inside each transformer layer: attention and MLP both read from and write to the residual stream.

Count the matrix multiplications: - Embedding: 1 - Per layer: Q, K, V projections (3 per head × H heads), output projection (1), MLP (2) = 3H + 3 per layer - Unembedding: 1

For a model with 12 layers and 12 attention heads, that’s hundreds of matrix multiplications. For GPT-3 with 96 layers and 96 heads, it’s thousands.

And yet, it’s just matrix multiplications (plus a few nonlinearities: softmax, ReLU/GELU, layer norm).

5.9 Interactive: Explore a Real Transformer

Now that you understand the architecture conceptually, try exploring it interactively. The Transformer Explainer below lets you see a real GPT-2 model process text in your browser—watch attention patterns form, see how information flows through layers, and experiment with different inputs.

How to Use

Type any text in the input box and watch the transformer process it in real-time. Click on attention heads to see their patterns. This visualization runs GPT-2 (124M parameters) directly in your browser.

Credit

Transformer Explainer was created by Aeree Cho and colleagues at Georgia Tech. It uses ONNX Runtime to run GPT-2 entirely in your browser. Paper | GitHub

5.10 Why Linearity Matters

This brings us to the key insight for interpretability: linear operations preserve structure.

A matrix multiplication is a linear map. It can rotate, scale, and project vectors, but it does so in a geometrically predictable way. Lines map to lines. Planes map to planes. Distances might change, but relationships are preserved in a tractable way.

This is why the “features as directions” hypothesis (coming in Chapter 5) has any hope of working. If the operations were arbitrary nonlinear transformations, internal representations could be uninterpretably tangled. But because they’re mostly linear, geometric structure—directions, subspaces, projections—provides a language for understanding what’s happening inside.

The nonlinearities (softmax, ReLU) are essential for the model’s expressiveness, but they’re sparse and localized. Most of the computation is linear algebra, and linear algebra we understand.

A Performance Engineering Perspective

If you’re familiar with GPU programming, you know that matrix multiplication is the operation that GPUs are optimized for. The transformer architecture isn’t just mathematically elegant—it’s computationally efficient. The same property that makes transformers fast (linear algebra primitives) makes them interpretable (geometric structure).

5.11 Polya’s Perspective: Understanding the Substrate

In Polya’s framework, we’re still in phase one: understanding the problem. The previous chapter asked “what are we trying to reverse engineer?” This chapter answers: a machine built from matrix multiplications.

What have we learned?

The transformer is a sequence of linear operations (matrix multiplications) interspersed with nonlinearities (softmax, ReLU)
Attention routes information between positions via soft dictionary lookup
MLPs store and retrieve knowledge via pattern-value associations
The residual stream accumulates contributions from all components

What questions does this raise?

If components communicate through the residual stream, what “language” do they use? (Chapter 3)
If representations live in high-dimensional vector spaces, what structure do they have? (Chapter 4)
What are the “atoms” of meaning in these spaces? (Chapter 5)

We now know the computational substrate. Next, we’ll examine how information flows through it.

5.12 Looking Ahead

The residual stream—that shared workspace where all components read and write—is the key abstraction for mechanistic interpretability. It’s not explicitly named in the original transformer paper, but it’s implicit in the architecture, and making it explicit changes how we think about interpretation.

In the next chapter, we’ll explore the residual stream in depth. We’ll see why it’s more useful to think of a transformer as “components contributing to a shared stream” rather than “layers processing sequentially.” This shift in perspective is foundational: once you see the residual stream, you can’t unsee it.

From there, we’ll be ready to ask: what’s actually represented in this stream? What structure do the vectors have? How do we find the “features” that the model has learned?

The matrix multiplications are just the fabric. Now let’s see what’s woven into it.

5.13 Further Reading

The Illustrated Transformer — Jay Alammar: The classic visual walkthrough of transformer architecture.
A Mathematical Framework for Transformer Circuits — Anthropic: The foundational paper for mechanistic interpretability, introducing the residual stream perspective.
Attention Is All You Need — Vaswani et al., 2017: The original transformer paper.
Transformer Feed-Forward Layers Are Key-Value Memories — Geva et al., 2021: Research showing MLPs function as pattern-value memory systems.
The Transformer Attention Mechanism — Machine Learning Mastery: Clear explanation of attention mechanics.

--- title: "Transformers as Matrix Multiplication Machines" subtitle: "The computational substrate of modern AI" author: "Taras Tsugrii" date: 2025-01-05 categories: [foundations, transformers] description: "Before we can reverse engineer what a transformer learns, we need to understand what it computes. The answer is surprisingly simple: matrix multiplications, all the way down." --- ::: {.callout-note} ## Hands-On Notebook <a href="https://colab.research.google.com/github/ttsugriy/mechinterp-first-principles/blob/main/notebooks/02-transformers.ipynb" target="_blank"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a> Run code interactively: load GPT-2, visualize attention patterns, explore the residual stream. ::: ::: {.callout-tip} ## What You'll Learn - How transformers convert text to numbers (tokenization and embedding) - The attention mechanism as "soft dictionary lookup" - How MLPs process and transform information - Why understanding the architecture matters for interpretability ::: ::: {.callout-warning} ## Prerequisites **Recommended background**: Basic linear algebra (matrix multiplication, vectors). No prior deep learning knowledge required—we build from first principles. ::: ::: {.callout-note} ## Before You Read: Recall From Chapter 1, recall: - Why do we want to reverse-engineer neural networks? (Scientific understanding, safety, capability improvement) - What makes neural networks "black boxes"? (We have the weights but don't understand the algorithms) - What's the goal of mechanistic interpretability? (Find the *how* behind the *what*) ::: ## The Machine We're Trying to Understand In the previous chapter, we saw a neural network discover Fourier transforms for modular arithmetic—an algorithm no one taught it. We established that mechanistic interpretability is the project of reverse engineering such discoveries: finding the *how* behind the *what*. **Now we ask**: What exactly are we reverse-engineering? What operations does a transformer compute? But before we can reverse engineer a transformer, we need to understand what it actually computes. What happens when you feed text into GPT-4? What operations transform "The capital of France is" into a prediction of "Paris"? The answer is surprisingly simple, and surprisingly important: **transformers are matrix multiplication machines**. This might sound reductive. A system that writes poetry, proves theorems, and carries on conversations—surely it's doing something more sophisticated than multiplying matrices? Yes and no. The sophistication lies not in the operations themselves, but in what those operations *learn to represent*. The transformer architecture provides a computational fabric—a substrate of linear algebra with a few nonlinearities—and training weaves complex algorithms into that fabric. Understanding the fabric is the first step to understanding what's woven into it. ## The Big Picture Here's the transformer forward pass in one sentence: > A transformer takes a sequence of tokens, converts them to vectors, repeatedly applies attention (information routing) and MLPs (information processing), and produces a probability distribution over next tokens. Every step in this process is built from the same primitive: **matrix-vector multiplication**. Take a vector, multiply it by a matrix, get a new vector. That's it. The magic lies in what the matrices *learn to encode* during training. Let's walk through each component. ## From Text to Numbers: Token Embedding Transformers don't operate on text directly. They operate on numbers—specifically, vectors of floating-point numbers. The first step is **tokenization**: breaking text into discrete units called tokens. These might be words, subwords, or even individual characters, depending on the tokenizer. "Understanding" might become one token; "transformers" might become ["transform", "ers"]. Each token has an ID—an index in a vocabulary of, say, 50,000 possible tokens. But an index isn't useful for computation. We need something richer. Enter the **embedding matrix**. This is a giant lookup table: for each of the 50,000 token IDs, it stores a corresponding vector of, say, 768 dimensions. Token ID 4523 maps to a specific 768-dimensional vector. Token ID 8901 maps to a different one. ``` Token: "Paris" Token ID: 4523 Embedding: [0.23, -0.87, 0.45, ..., 0.12] (768 numbers) ``` ::: {.callout-note} ## The First Matrix Multiplication Looking up a token embedding is equivalent to multiplying a one-hot vector by the embedding matrix. If we represent token ID 4523 as a vector with a 1 in position 4523 and 0s elsewhere, then multiplying by the embedding matrix extracts exactly row 4523. ::: After embedding, we also add **positional information**. The token "Paris" in position 5 gets a different representation than "Paris" in position 20. This is typically done by adding a position-specific vector to the token embedding. The result: each token becomes a vector, and the entire input sequence becomes a matrix—one row per token position. ## Attention: Soft Dictionary Lookup Attention is the signature operation of transformers, and it's where things get interesting. The core insight is that attention performs **soft dictionary lookup**. In a regular dictionary, you have a key and you retrieve exactly one value. In attention, you have a query and you retrieve a *weighted blend* of all values, where the weights depend on how well each key matches your query. Here's how it works: ### Queries, Keys, and Values For each token position, we compute three vectors: - **Query (Q)**: What am I looking for? - **Key (K)**: What do I contain? - **Value (V)**: What information do I provide? These are computed by—you guessed it—multiplying the token's embedding by learned matrices: ``` Q = embedding × W_Q K = embedding × W_K V = embedding × W_V ``` Three matrix multiplications, three new vectors. ### Computing Attention Weights Now we compare each query to all keys. How similar is position 5's query to position 3's key? We measure this with a dot product: ``` attention_score = Q_5 · K_3 ``` A high dot product means "position 5 is looking for something that position 3 has." We compute this for all pairs of positions, giving us a matrix of attention scores. We then apply **softmax** to turn these scores into probabilities—weights that sum to 1. This is one of the few *nonlinearities* in the transformer. It's crucial: without it, everything would collapse into a single linear operation. ### Retrieving Information Finally, we use these weights to compute a weighted average of all values: ``` output_5 = 0.7 × V_3 + 0.2 × V_7 + 0.1 × V_1 + ... ``` Position 5 "attends to" position 3 (weight 0.7), a bit to position 7, a bit to position 1, and so on. The output is a blend of information from across the sequence. ::: {.callout-tip collapse="true"} ## Code: Computing Attention Step-by-Step Here's the complete attention computation in Python. This is the core of what transformers do: ```python import torch import torch.nn.functional as F # Example: 4 tokens, each embedded as 8-dimensional vector seq_len, d_model = 4, 8 x = torch.randn(seq_len, d_model) # Input embeddings # Learned weight matrices (normally trained, here random) d_head = 4 # Dimension of Q, K, V per head W_Q = torch.randn(d_model, d_head) W_K = torch.randn(d_model, d_head) W_V = torch.randn(d_model, d_head) # Step 1: Compute queries, keys, values Q = x @ W_Q # Shape: (seq_len, d_head) K = x @ W_K V = x @ W_V # Step 2: Compute attention scores (Q·K^T) scores = Q @ K.T # Shape: (seq_len, seq_len) scores = scores / (d_head ** 0.5) # Scale by sqrt(d_head) # Step 3: Apply softmax to get attention weights attn_weights = F.softmax(scores, dim=-1) # Each row sums to 1 # Step 4: Weighted sum of values output = attn_weights @ V # Shape: (seq_len, d_head) # attn_weights[i, j] = how much position i attends to position j print(f"Attention weights:\n{attn_weights}") ``` **Key insight**: Every step is matrix multiplication except `softmax`, which is the essential nonlinearity. Without softmax, the whole thing would collapse to a single linear transformation. ::: ::: {.callout-warning} ## Pause and Think Before continuing, ask yourself: Why does attention need three separate projections (Q, K, V)? What would happen if we used the same matrix for queries and keys? *Hint*: Think about what a position is "asking for" versus what it "contains." ::: ::: {.callout-important} ## Why This Matters Attention is how transformers move information between positions. A token at the end of the sequence can access information from the beginning. This is fundamentally different from older architectures that processed sequences one step at a time. ::: ### Multi-Head Attention In practice, transformers use **multiple attention heads** in parallel. Each head has its own Q, K, V matrices, letting the model attend to different aspects simultaneously. One head might focus on syntax, another on semantics, another on nearby tokens. The outputs of all heads are concatenated and multiplied by yet another matrix (W_O) to produce the final attention output. More matrix multiplications. ## MLPs: Where Knowledge Lives After attention, each token's representation passes through a **multi-layer perceptron (MLP)**, also called a feed-forward network. The structure is simple: ``` hidden = activation(input × W_1 + b_1) output = hidden × W_2 + b_2 ``` Two matrix multiplications with a nonlinearity in between. The original transformer used ReLU; modern language models typically use GELU (Gaussian Error Linear Unit), which is smoother and empirically works better. Both serve the same purpose: introducing nonlinearity between the linear transformations. But recent research has revealed something fascinating: **MLP layers function as key-value memories**. The first matrix (W_1) learns to recognize patterns—like keys in a dictionary. The second matrix (W_2) associates those patterns with likely next tokens—like values. When the input matches a learned pattern, the corresponding value "fires," influencing the output. Lower layers recognize shallow patterns like n-grams and common phrases. Upper layers recognize semantic patterns—detecting that "the capital of France" and "France's capital city" should trigger similar responses. ::: {.callout-tip} ## Two-Thirds of Parameters MLP layers contain about two-thirds of a transformer's parameters. They're not just activation functions—they're where much of the model's knowledge is stored. ::: ## Layer Norm: Keeping Things Stable Between attention and MLP blocks, transformers apply **layer normalization**. This rescales vectors to have consistent statistics (zero mean, unit variance across dimensions). Layer norm isn't glamorous, but it's essential for training stability. Without it, activations would grow or shrink uncontrollably through the many layers of a deep transformer. For interpretability, layer norm is a minor complication—another linear(ish) operation to account for. We'll mostly set it aside, but it's there. ## The Residual Stream: A Preview Here's a crucial architectural detail that will become central in the next chapter: **residual connections**. After each attention block and each MLP block, the output is *added* to the input: ``` x = x + attention(x) x = x + mlp(x) ``` This creates what's called the **residual stream**—a vector that flows through the entire network, accumulating contributions from each component. Think of it like a shared workspace. Each attention head and each MLP reads from this workspace, performs some computation, and writes its contribution back. Components don't talk to each other directly; they communicate through the residual stream. This perspective—seeing the transformer as components reading from and writing to a shared communication channel—is foundational for mechanistic interpretability. We'll develop it fully in the next chapter. ## The Complete Forward Pass Let's put it all together. For a transformer with L layers: ``` 1. Embed tokens → initial vectors 2. Add positional encoding 3. For each layer l = 1 to L: a. Apply layer norm b. Compute multi-head attention c. Add attention output to residual stream d. Apply layer norm e. Compute MLP f. Add MLP output to residual stream 4. Apply final layer norm 5. Multiply by "unembedding" matrix to get vocabulary logits 6. Apply softmax to get next-token probabilities ``` Here's the same process as a visual diagram: ```{mermaid} %%| fig-cap: "The transformer forward pass: tokens flow through embedding, repeated attention+MLP blocks, and unembedding to produce predictions." %%| fig-width: 8 flowchart LR T["Tokens"] --> E["Embed + Position"] E --> L1["Layer 1"] L1 --> L2["Layer 2"] L2 --> Ldots["..."] Ldots --> LL["Layer L"] LL --> U["Unembed"] U --> S["Softmax"] S --> P["Predictions"] ``` Each layer contains the same structure: layer norm → attention → add → layer norm → MLP → add. The residual connections (the "add" steps) accumulate information through the network. ```{mermaid} %%| fig-cap: "Inside each transformer layer: attention and MLP both read from and write to the residual stream." %%| fig-width: 7 flowchart TB IN["Input from previous layer"] --> LN1["Layer Norm"] LN1 --> ATT["Multi-Head Attention"] ATT --> A1["Add"] IN --> A1 A1 --> LN2["Layer Norm"] LN2 --> MLP["MLP"] MLP --> A2["Add"] A1 --> A2 A2 --> OUT["Output to next layer"] ``` Count the matrix multiplications: - Embedding: 1 - Per layer: Q, K, V projections (3 per head × H heads), output projection (1), MLP (2) = 3H + 3 per layer - Unembedding: 1 For a model with 12 layers and 12 attention heads, that's hundreds of matrix multiplications. For GPT-3 with 96 layers and 96 heads, it's thousands. And yet, it's *just* matrix multiplications (plus a few nonlinearities: softmax, ReLU/GELU, layer norm). ## Interactive: Explore a Real Transformer Now that you understand the architecture conceptually, try exploring it interactively. The Transformer Explainer below lets you see a real GPT-2 model process text in your browser—watch attention patterns form, see how information flows through layers, and experiment with different inputs. ::: {.callout-tip} ## How to Use Type any text in the input box and watch the transformer process it in real-time. Click on attention heads to see their patterns. This visualization runs GPT-2 (124M parameters) directly in your browser. ::: <div style="width: 100%; height: 800px; border: 1px solid #ddd; border-radius: 8px; overflow: hidden; margin: 2rem 0;"> <iframe src="https://poloclub.github.io/transformer-explainer/" style="width: 100%; height: 100%; border: none;" title="Transformer Explainer - Interactive GPT-2 Visualization" loading="lazy"> </iframe> </div> ::: {.callout-note} ## Credit [Transformer Explainer](https://poloclub.github.io/transformer-explainer/) was created by Aeree Cho and colleagues at Georgia Tech. It uses ONNX Runtime to run GPT-2 entirely in your browser. [Paper](https://arxiv.org/abs/2408.04619) | [GitHub](https://github.com/poloclub/transformer-explainer) ::: ## Why Linearity Matters This brings us to the key insight for interpretability: **linear operations preserve structure**. A matrix multiplication is a linear map. It can rotate, scale, and project vectors, but it does so in a geometrically predictable way. Lines map to lines. Planes map to planes. Distances might change, but relationships are preserved in a tractable way. This is why the "features as directions" hypothesis (coming in Chapter 5) has any hope of working. If the operations were arbitrary nonlinear transformations, internal representations could be uninterpretably tangled. But because they're mostly linear, geometric structure—directions, subspaces, projections—provides a language for understanding what's happening inside. The nonlinearities (softmax, ReLU) are essential for the model's expressiveness, but they're sparse and localized. Most of the computation is linear algebra, and linear algebra we understand. ::: {.callout-note} ## A Performance Engineering Perspective If you're familiar with GPU programming, you know that matrix multiplication is *the* operation that GPUs are optimized for. The transformer architecture isn't just mathematically elegant—it's computationally efficient. The same property that makes transformers fast (linear algebra primitives) makes them interpretable (geometric structure). ::: ## Polya's Perspective: Understanding the Substrate In Polya's framework, we're still in phase one: understanding the problem. The previous chapter asked "what are we trying to reverse engineer?" This chapter answers: a machine built from matrix multiplications. **What have we learned?** 1. The transformer is a sequence of linear operations (matrix multiplications) interspersed with nonlinearities (softmax, ReLU) 2. Attention routes information between positions via soft dictionary lookup 3. MLPs store and retrieve knowledge via pattern-value associations 4. The residual stream accumulates contributions from all components **What questions does this raise?** - If components communicate through the residual stream, what "language" do they use? (Chapter 3) - If representations live in high-dimensional vector spaces, what structure do they have? (Chapter 4) - What are the "atoms" of meaning in these spaces? (Chapter 5) We now know the computational substrate. Next, we'll examine how information flows through it. ## Looking Ahead The residual stream—that shared workspace where all components read and write—is the key abstraction for mechanistic interpretability. It's not explicitly named in the original transformer paper, but it's implicit in the architecture, and making it explicit changes how we think about interpretation. In the next chapter, we'll explore the residual stream in depth. We'll see why it's more useful to think of a transformer as "components contributing to a shared stream" rather than "layers processing sequentially." This shift in perspective is foundational: once you see the residual stream, you can't unsee it. From there, we'll be ready to ask: what's actually represented in this stream? What structure do the vectors have? How do we find the "features" that the model has learned? The matrix multiplications are just the fabric. Now let's see what's woven into it. --- ## Further Reading 1. **The Illustrated Transformer** — [Jay Alammar](https://jalammar.github.io/illustrated-transformer/): The classic visual walkthrough of transformer architecture. 2. **A Mathematical Framework for Transformer Circuits** — [Anthropic](https://transformer-circuits.pub/2021/framework/index.html): The foundational paper for mechanistic interpretability, introducing the residual stream perspective. 3. **Attention Is All You Need** — [Vaswani et al., 2017](https://arxiv.org/abs/1706.03762): The original transformer paper. 4. **Transformer Feed-Forward Layers Are Key-Value Memories** — [Geva et al., 2021](https://arxiv.org/abs/2012.14913): Research showing MLPs function as pattern-value memory systems. 5. **The Transformer Attention Mechanism** — [Machine Learning Mastery](https://machinelearningmastery.com/the-transformer-attention-mechanism/): Clear explanation of attention mechanics.