Section 5.6: Positional Encoding — Injecting Position Information¶
Reading time: 20 minutes | Difficulty: ★★★★☆
Attention is permutation-equivariant: it treats positions as a set, not a sequence. This section explains why position information is essential and how we inject it into the model.
The Problem: Attention Has No Notion of Order¶
Consider self-attention on "dog bites man" vs "man bites dog":
Without positional info:
"dog bites man" → Same attention scores!
"man bites dog" → (just rows/cols permuted)
The attention mechanism computes:
- score(dog, bites) based on their embeddings
- score(man, bites) based on their embeddings
The position of dog/man doesn't affect these scores. But meaning completely changes!
Mathematical Proof¶
For any permutation π of positions:
If we permute the input, the output is permuted the same way. The relative relationships are unchanged—attention doesn't know position 1 from position 5.
Why Position Matters¶
| Linguistic Phenomenon | Position-Dependent? |
|---|---|
| Subject-verb agreement | Yes (subject comes before verb) |
| Adjective-noun order | Yes (varies by language) |
| Pronoun reference | Yes (usually refers backward) |
| Negation scope | Yes ("not" affects what follows) |
Almost everything in language depends on position!
Solution: Add Position Information to Embeddings¶
The key insight: add position-specific vectors to token embeddings before attention.
Now each position has unique information that attention can use.
Sinusoidal Positional Encoding¶
The original Transformer uses fixed sinusoidal functions:
Where:
- pos: position in sequence (0, 1, 2, ...)
- i: dimension index (0, 1, 2, ..., d/2-1)
- d: model dimension
Why Sinusoids?¶
Each dimension has a different frequency:
Dimension 0-1: High frequency (changes rapidly with position)
Dimension d-2: Low frequency (changes slowly)
Position 0: [sin(0), cos(0), sin(0), cos(0), ...]
Position 1: [sin(1/1), cos(1/1), sin(1/10000^(2/d)), cos(1/10000^(2/d)), ...]
Position 2: [sin(2/1), cos(2/1), sin(2/10000^(2/d)), cos(2/10000^(2/d)), ...]
Key Properties¶
1. Unique encoding per position: Each position gets a unique vector.
2. Bounded values: All values in [-1, 1], matching embedding scale.
3. Relative position as linear function: $\(PE_{pos+k}\)$ can be expressed as a linear function of $\(PE_{pos}\)$.
This means the model can learn to compute relative positions!
Proof sketch:
This is a linear transformation of [sin(pos), cos(pos)] with matrix:
Visualization¶
Position: 0 1 2 3 4 5 6 7 8
Dim 0: [████ ░░░░ ████ ░░░░ ████ ░░░░ ████ ░░░░ ████] High freq
Dim 2: [████████ ░░░░░░░░ ████████ ░░░░░░░░ ████████] Medium freq
Dim d-2: [██████████████████████░░░░░░░░░░░░░░░░░░░░░░] Low freq
(█ = positive, ░ = negative)
Different dimensions encode position at different scales.
Implementation¶
import numpy as np
def sinusoidal_positional_encoding(max_len, d_model):
"""
Generate sinusoidal positional encodings.
Args:
max_len: Maximum sequence length
d_model: Model dimension
Returns:
Positional encodings [max_len, d_model]
"""
PE = np.zeros((max_len, d_model))
position = np.arange(max_len)[:, np.newaxis] # [max_len, 1]
div_term = np.exp(np.arange(0, d_model, 2) * (-np.log(10000.0) / d_model))
PE[:, 0::2] = np.sin(position * div_term) # Even dimensions
PE[:, 1::2] = np.cos(position * div_term) # Odd dimensions
return PE
# Example usage
max_len = 100
d_model = 64
PE = sinusoidal_positional_encoding(max_len, d_model)
print(f"PE shape: {PE.shape}") # [100, 64]
print(f"PE[0]: {PE[0][:4]}") # [0, 1, 0, 1] (sin(0), cos(0), ...)
print(f"PE[1]: {PE[1][:4]}") # [0.84, 0.54, ...] (sin(1), cos(1), ...)
Learned Positional Embeddings¶
An alternative: learn position embeddings just like token embeddings.
class LearnedPositionalEncoding:
"""Learned positional embeddings."""
def __init__(self, max_len, d_model):
# Each position gets its own learnable vector
self.PE = np.random.randn(max_len, d_model) * 0.02
def forward(self, seq_len):
return self.PE[:seq_len]
Comparison¶
| Aspect | Sinusoidal | Learned |
|---|---|---|
| Parameters | 0 | max_len × d |
| Extrapolation | Can extend to longer sequences | Fixed to training length |
| Expressivity | Fixed patterns | Can learn any pattern |
| Modern preference | GPT-1, T5, some variants | GPT-2, BERT, most LLMs |
Modern large models mostly use learned embeddings, but with tricks for length generalization.
Relative Positional Encoding¶
Instead of encoding absolute positions, encode relative distances.
The Intuition¶
"The cat sat" should have similar relationships whether it appears at positions [0,1,2] or [100,101,102].
Relative encoding: position i attending to position j uses encoding for (i-j).
Transformer-XL Style¶
Modify attention scores to include relative position:
Where:
- \(r_{i-j}\): relative position embedding
- u, v: learnable global biases
T5 Style (Simplified)¶
Add learned bias based on relative position:
Where b is a learned bias table indexed by relative position.
class T5RelativePositionBias:
"""T5-style relative position bias."""
def __init__(self, max_distance=128, n_heads=8):
# Learn bias for each relative distance and head
self.bias_table = np.random.randn(2 * max_distance + 1, n_heads) * 0.02
self.max_distance = max_distance
def forward(self, seq_len):
"""Compute relative position bias matrix."""
# Create relative position matrix
positions = np.arange(seq_len)
relative_pos = positions[None, :] - positions[:, None] # [n, n]
# Clip to max distance
relative_pos = np.clip(relative_pos,
-self.max_distance,
self.max_distance)
# Shift to positive indices
indices = relative_pos + self.max_distance
# Look up biases
return self.bias_table[indices] # [n, n, n_heads]
Rotary Positional Embedding (RoPE)¶
The modern standard for many LLMs (LLaMA, Mistral, etc.).
Key Idea¶
Rotate query and key vectors based on position. Relative positions emerge from the dot product of rotated vectors.
The rotation difference \(R_{n-m}\) depends only on relative position!
How Rotation Works¶
For 2D vectors:
For higher dimensions, apply 2D rotations to pairs of dimensions.
Implementation¶
def rotary_embedding(x, position, base=10000):
"""
Apply rotary positional embedding.
Args:
x: Input tensor [seq_len, d]
position: Position indices [seq_len]
base: Base for frequency computation
Returns:
Rotated tensor [seq_len, d]
"""
d = x.shape[-1]
assert d % 2 == 0, "Dimension must be even"
# Compute frequencies
freqs = 1.0 / (base ** (np.arange(0, d, 2) / d)) # [d/2]
# Compute angles
angles = position[:, None] * freqs[None, :] # [seq_len, d/2]
cos_angles = np.cos(angles) # [seq_len, d/2]
sin_angles = np.sin(angles) # [seq_len, d/2]
# Split x into pairs
x1 = x[:, 0::2] # Even dimensions
x2 = x[:, 1::2] # Odd dimensions
# Apply rotation
x_rotated = np.empty_like(x)
x_rotated[:, 0::2] = x1 * cos_angles - x2 * sin_angles
x_rotated[:, 1::2] = x1 * sin_angles + x2 * cos_angles
return x_rotated
Why RoPE is Popular¶
- Relative positions naturally: No need for explicit relative position computation
- Efficient: Simple element-wise operations
- Length extrapolation: With NTK-aware scaling, can extend to longer sequences
- Linear attention compatible: Works with efficient attention variants
Connection to Modern LLMs
Most recent LLMs use RoPE or variants:
- LLaMA, LLaMA 2: Standard RoPE
- Mistral, Mixtral: RoPE with sliding window attention
- GPT-4, Claude: Details not public, but likely relative position methods
- Gemini: Uses relative position encoding
The field has converged on relative methods for their better length generalization.
ALiBi: Attention with Linear Biases¶
Another simple and effective method.
The Idea¶
Add a linear penalty to attention scores based on distance:
Where m is a head-specific slope.
Implementation¶
def alibi_attention(Q, K, V, slopes):
"""
Attention with ALiBi positional encoding.
Args:
Q, K, V: Query, Key, Value matrices
slopes: Per-head slopes for distance penalty
"""
n = Q.shape[0]
d_k = Q.shape[-1]
# Standard attention scores
scores = Q @ K.T / np.sqrt(d_k) # [n, n]
# Create distance matrix
positions = np.arange(n)
distances = positions[None, :] - positions[:, None] # [n, n]
# Apply linear bias (for causal, only negative distances matter)
bias = slopes * np.abs(distances)
scores = scores - bias
# Rest is standard attention
weights = softmax(scores, axis=-1)
return weights @ V
Advantages¶
- Zero extra parameters: Just a simple bias
- Excellent length generalization: Works on sequences 10x training length
- Efficient: Simple addition operation
Combining Positional Encodings¶
Some architectures combine methods:
class HybridPositionalEncoding:
"""Combine absolute and relative encodings."""
def __init__(self, max_len, d_model, n_heads):
# Absolute (added to embeddings)
self.absolute = sinusoidal_positional_encoding(max_len, d_model)
# Relative (added to attention scores)
self.relative = T5RelativePositionBias(max_distance=128, n_heads=n_heads)
def encode_input(self, X, positions):
"""Add absolute encoding to input."""
return X + self.absolute[positions]
def attention_bias(self, seq_len):
"""Get relative position bias for attention."""
return self.relative.forward(seq_len)
Position Encoding Summary¶
| Method | Parameters | Length Generalization | Modern Usage |
|---|---|---|---|
| Sinusoidal | 0 | Good | Limited |
| Learned | max_len × d | Poor | Common |
| Relative (T5) | distance × heads | Good | Common |
| RoPE | 0 | Good with scaling | Very common |
| ALiBi | 0 | Excellent | Common |
Exercises¶
-
Implement sinusoidal: Write the encoding and visualize as a heatmap.
-
Dot product distance: For sinusoidal PE, compute dot product between positions. What pattern emerges?
-
Extrapolation test: Train with max length 100, test at 200. Compare methods.
-
RoPE derivation: Prove that q_m' · k_n' depends only on (m-n).
-
Design your own: Create a positional encoding and analyze its properties.
Summary¶
| Concept | Definition | Purpose |
|---|---|---|
| Position encoding | Vector added/applied per position | Give attention position awareness |
| Sinusoidal | Fixed sine/cosine patterns | Unique, bounded, allows relative |
| Learned | Trainable per-position vectors | More flexible, limited length |
| Relative | Encode (i-j) not absolute i, j | Better generalization |
| RoPE | Rotate Q, K by position | Efficient relative encoding |
Key takeaway: Attention mechanisms are permutation-equivariant by design, treating input as a set rather than a sequence. Positional encodings inject position information, enabling the model to understand word order. Modern methods favor relative positions (RoPE, ALiBi) for better length generalization, while the original Transformer used fixed sinusoidal patterns.
→ Next: Section 5.7: Causal Masking