23  Exercises

Practice problems with solutions

Exercises organized by chapter. Each includes hints and full solutions.


23.1 Arc I: Foundations

23.1.1 Chapter 2: Transformers

Problem: Load GPT-2 Small and run the prompt “The cat sat on the”. Visualize the attention pattern for layer 0, head 0. What token does “the” (final position) attend to most?

Use model.run_with_cache() and access cache["pattern", 0][0, 0] for layer 0, head 0.

import transformer_lens as tl

model = tl.HookedTransformer.from_pretrained("gpt2-small")
tokens = model.to_tokens("The cat sat on the")
_, cache = model.run_with_cache(tokens)

# Layer 0, head 0 attention pattern
pattern = cache["pattern", 0][0, 0]  # [seq_len, seq_len]

# What does final position attend to?
final_attn = pattern[-1]
print("Attention from final 'the':")
for i, (tok, attn) in enumerate(zip(model.to_str_tokens(tokens[0]), final_attn)):
    print(f"  {tok}: {attn:.3f}")

# Typically: early heads attend to nearby tokens or BOS

Problem: For the same prompt, compute the fraction of MLP neurons that are active (> 0 after GELU) in layer 5. Is MLP activation sparse?

Access cache["post", 5] for post-GELU activations. Count how many are > 0.

mlp_post = cache["post", 5][0]  # [seq_len, d_mlp]
active = (mlp_post > 0).float().mean()
print(f"Fraction active: {active:.1%}")
# Typically 30-50% are active - moderately sparse

23.1.2 Chapter 3: Residual Stream

Problem: Run “The Eiffel Tower is in” through GPT-2 Small. At each layer, project the residual stream to vocabulary space and find the top prediction. At which layer does “Paris” first appear in top 5?

prompt = "The Eiffel Tower is in"
tokens = model.to_tokens(prompt)
_, cache = model.run_with_cache(tokens)

for layer in range(12):
    resid = cache["resid_post", layer][0, -1]
    logits = resid @ model.W_U
    top5 = logits.topk(5).indices
    top5_tokens = [model.tokenizer.decode(t) for t in top5]
    has_paris = " Paris" in top5_tokens or "Paris" in top5_tokens
    print(f"Layer {layer}: {top5_tokens} {'<-- Paris!' if has_paris else ''}")

Problem: For the same prompt, which single component (attention head or MLP) contributes most to the “Paris” logit?

paris_token = model.to_single_token(" Paris")
paris_dir = model.W_U[:, paris_token]

contributions = []
for layer in range(12):
    # MLP contribution
    mlp_out = cache["mlp_out", layer][0, -1]
    contributions.append((f"L{layer}_MLP", (mlp_out @ paris_dir).item()))

    # Each attention head's contribution (compute from z and W_O)
    z = cache["z", layer][0, -1]  # [n_heads, d_head]
    W_O = model.W_O[layer]  # [n_heads, d_head, d_model]
    for head in range(12):
        head_out = z[head] @ W_O[head]  # [d_model]
        contributions.append((f"L{layer}H{head}", (head_out @ paris_dir).item()))

top = sorted(contributions, key=lambda x: -x[1])[:5]
print("Top contributors to 'Paris':")
for name, val in top:
    print(f"  {name}: {val:+.2f}")

23.1.3 Chapter 4: Geometry

Problem: Get the unembedding vectors for: “Paris”, “London”, “Berlin”, “cat”, “dog”, “fish”. Compute pairwise cosine similarities. Do cities cluster together? Do animals?

import torch

words = [" Paris", " London", " Berlin", " cat", " dog", " fish"]
tokens = [model.to_single_token(w) for w in words]
vecs = torch.stack([model.W_U[:, t] for t in tokens])

# Pairwise cosine similarity
vecs_norm = vecs / vecs.norm(dim=1, keepdim=True)
sims = vecs_norm @ vecs_norm.T

print("Cosine similarities:")
for i, w1 in enumerate(words):
    for j, w2 in enumerate(words):
        if j > i:
            print(f"  {w1} - {w2}: {sims[i,j]:.3f}")

# Cities should have higher similarity with each other than with animals

23.2 Arc II: Core Theory

23.2.1 Chapter 5: Features

Problem: Create a “sentiment” direction by taking the difference between embeddings of positive and negative words. Test if this direction correlates with sentiment in new sentences.

positive = [" good", " great", " excellent", " wonderful"]
negative = [" bad", " terrible", " awful", " horrible"]

pos_vecs = torch.stack([model.W_E[model.to_single_token(w)] for w in positive])
neg_vecs = torch.stack([model.W_E[model.to_single_token(w)] for w in negative])

sentiment_dir = pos_vecs.mean(0) - neg_vecs.mean(0)
sentiment_dir = sentiment_dir / sentiment_dir.norm()

# Test on new sentences
test = ["This movie is fantastic", "This movie is terrible"]
for sent in test:
    tokens = model.to_tokens(sent)
    _, cache = model.run_with_cache(tokens)
    final = cache["resid_post", 11][0, -1]
    score = (final @ sentiment_dir).item()
    print(f"{sent}: {score:+.2f}")

23.2.2 Chapter 6: Superposition

Problem: The residual stream has 768 dimensions. If features were orthogonal, we could store at most 768. But models represent many more concepts. Pick 100 random unembedding vectors and compute the average absolute cosine similarity. How close to orthogonal are they?

import random

# Sample 100 random tokens
all_tokens = list(range(model.cfg.d_vocab))
sample = random.sample(all_tokens, 100)
vecs = model.W_U[:, sample].T  # [100, 768]
vecs = vecs / vecs.norm(dim=1, keepdim=True)

# Average absolute cosine similarity
sims = (vecs @ vecs.T).abs()
# Exclude diagonal
mask = ~torch.eye(100, dtype=bool)
avg_sim = sims[mask].mean()

print(f"Average |cosine similarity|: {avg_sim:.4f}")
# Expected: ~0.02-0.05 (nearly orthogonal in high dimensions)

23.3 Arc III: Techniques

23.3.1 Chapter 9: SAEs

Problem: Load an SAE for GPT-2 Small layer 8. Run “The president of the United States” and find the top 5 most active features at the final position. Look them up on Neuronpedia.

from sae_lens import SAE

sae, _, _ = SAE.from_pretrained(
    release="gpt2-small-res-jb",
    sae_id="blocks.8.hook_resid_pre"
)

prompt = "The president of the United States"
tokens = model.to_tokens(prompt)
_, cache = model.run_with_cache(tokens)

resid = cache["resid_pre", 8][0, -1]
acts = sae.encode(resid.unsqueeze(0))[0]

top5 = acts.topk(5)
print("Top 5 features:")
for idx, val in zip(top5.indices, top5.values):
    print(f"  Feature {idx.item()}: {val.item():.2f}")
    print(f"  https://neuronpedia.org/gpt2-small/{8}-res-jb/{idx.item()}")

23.3.2 Chapter 10: Attribution

Problem: For “2 + 2 =”, decompose the logit for ” 4” into contributions from each component. What fraction comes from MLPs vs attention?

prompt = "2 + 2 ="
tokens = model.to_tokens(prompt)
_, cache = model.run_with_cache(tokens)

target = model.to_single_token(" 4")
target_dir = model.W_U[:, target]

mlp_total = 0
attn_total = 0

for layer in range(12):
    mlp = cache["mlp_out", layer][0, -1]
    mlp_total += (mlp @ target_dir).item()

    attn = cache["attn_out", layer][0, -1]
    attn_total += (attn @ target_dir).item()

print(f"MLP contribution: {mlp_total:.2f}")
print(f"Attention contribution: {attn_total:.2f}")
print(f"MLP fraction: {mlp_total / (mlp_total + attn_total):.1%}")

23.3.3 Chapter 11: Patching

Problem: Patch the residual stream at layer 6 from “The Louvre is in” into “The Colosseum is in”. Does the prediction change from “Rome” to “Paris”?

clean = "The Colosseum is in"
corrupt = "The Louvre is in"

clean_tokens = model.to_tokens(clean)
corrupt_tokens = model.to_tokens(corrupt)

_, corrupt_cache = model.run_with_cache(corrupt_tokens)
corrupt_resid = corrupt_cache["resid_post", 6]

def patch_hook(act, hook):
    act[:, -1, :] = corrupt_resid[:, -1, :]
    return act

patched = model.run_with_hooks(
    clean_tokens,
    fwd_hooks=[("blocks.6.hook_resid_post", patch_hook)]
)

rome = model.to_single_token(" Rome")
paris = model.to_single_token(" Paris")

print(f"Rome logit: {patched[0, -1, rome]:.2f}")
print(f"Paris logit: {patched[0, -1, paris]:.2f}")
print(f"Prediction: {model.tokenizer.decode(patched[0, -1].argmax())}")

23.3.4 Chapter 12: Ablation

Problem: Zero-ablate each attention head individually for “The Eiffel Tower is in”. Which head, when ablated, most reduces the “Paris” logit?

prompt = "The Eiffel Tower is in"
tokens = model.to_tokens(prompt)
paris = model.to_single_token(" Paris")

baseline = model(tokens)[0, -1, paris].item()
print(f"Baseline Paris logit: {baseline:.2f}")

results = []
for layer in range(12):
    for head in range(12):
        def ablate_head(act, hook, h=head):
            act[:, :, h, :] = 0
            return act

        ablated = model.run_with_hooks(
            tokens,
            fwd_hooks=[(f"blocks.{layer}.hook_z", ablate_head)]
        )
        new_logit = ablated[0, -1, paris].item()
        results.append((f"L{layer}H{head}", baseline - new_logit))

top = sorted(results, key=lambda x: -x[1])[:5]
print("\nHeads whose ablation most reduces Paris logit:")
for name, drop in top:
    print(f"  {name}: -{drop:.2f}")

23.4 Challenge Problems

Problem: Analyze “The capital of France is” → “Paris”. Identify: 1. Which attention heads move information from “France” to the final position 2. Which MLP layers retrieve the answer 3. Verify with patching that these components are causally important

This is a multi-hour exercise. Document your methodology and findings.

Problem: Compare the circuits for “2 + 3 =” vs “7 + 8 =”. - Do they use the same components? - Is there evidence of different strategies for small vs large numbers? - What happens for “99 + 1 =”?

Problem: Find an input that maximally activates a specific SAE feature. Start with feature 1000 in the layer 8 SAE. Use gradient-based optimization or greedy token search.


23.5 Solutions Notebook

All exercises are available as a runnable notebook:

Open In Colab