Section 7.3: Byte Pair Encoding (BPE)¶
Reading time: 15 minutes
Origins¶
Byte Pair Encoding was originally a data compression algorithm (Gage, 1994). Sennrich et al. (2016) adapted it for NLP, and it became the standard for GPT-2 and subsequent models.
The Core Idea¶
BPE iteratively merges the most frequent pair of tokens:
- Start with a vocabulary of individual characters (or bytes)
- Count all adjacent pairs in the corpus
- Merge the most frequent pair into a new token
- Repeat until reaching the desired vocabulary size
Algorithm Walkthrough¶
Let's trace BPE on a small corpus:
Corpus: "low lower lowest"
Step 0: Initialize¶
Split into characters with end-of-word marker ():
Vocabulary: ['l', 'o', 'w', '</w>', 'e', 'r', 's', 't']
Words with frequencies:
"l o w </w>" : 3 (appears in low, lower, lowest)
"l o w e r </w>" : 1
"l o w e s t </w>" : 1
Step 1: Count Pairs¶
| Pair | Frequency |
|---|---|
| (l, o) | 3 |
| (o, w) | 3 |
| (w, ) | 1 |
| (w, e) | 2 |
| (e, r) | 1 |
| (r, ) | 1 |
| (e, s) | 1 |
| (s, t) | 1 |
| (t, ) | 1 |
Step 2: Merge Best Pair¶
Most frequent: (l, o) and (o, w) are tied at 3. Choose (l, o) → create new token "lo"
Vocabulary: ['l', 'o', 'w', '</w>', 'e', 'r', 's', 't', 'lo']
Words updated:
"lo w </w>" : 3
"lo w e r </w>" : 1
"lo w e s t </w>" : 1
Step 3: Repeat¶
New pair counts:
| Pair | Frequency |
|---|---|
| (lo, w) | 3 |
| (w, ) | 1 |
| (w, e) | 2 |
| ... | ... |
Merge (lo, w) → "low"
Continue...¶
The algorithm continues, creating tokens like:
- "er" (from "lower")
- "low" (the complete word "low")
- "est" (from "lowest")
The Merge Rules¶
BPE learns a sequence of merge rules:
During encoding, these rules are applied in order.
Formal Algorithm¶
Training¶
def train_bpe(corpus: List[str], vocab_size: int) -> Tuple[Dict, List]:
"""Train BPE tokenizer.
Returns:
vocab: token → id mapping
merges: list of (token_a, token_b) merge rules
"""
# Initialize with characters
vocab = {chr(i): i for i in range(256)} # Byte vocabulary
# Split words into characters
word_freqs = count_word_frequencies(corpus)
# word_freqs: {('l','o','w'): 3, ('l','o','w','e','r'): 1, ...}
merges = []
while len(vocab) < vocab_size:
# Count all adjacent pairs
pair_freqs = count_pairs(word_freqs)
if not pair_freqs:
break
# Find most frequent pair
best_pair = max(pair_freqs, key=pair_freqs.get)
# Create merged token
merged = best_pair[0] + best_pair[1]
vocab[merged] = len(vocab)
merges.append(best_pair)
# Apply merge to all words
word_freqs = apply_merge(word_freqs, best_pair, merged)
return vocab, merges
Encoding¶
def encode_bpe(text: str, merges: List[Tuple[str, str]]) -> List[str]:
"""Encode text using learned BPE merges."""
# Start with characters
tokens = list(text)
# Apply merges in order
for pair in merges:
new_tokens = []
i = 0
while i < len(tokens):
if i < len(tokens) - 1 and (tokens[i], tokens[i+1]) == pair:
new_tokens.append(tokens[i] + tokens[i+1])
i += 2
else:
new_tokens.append(tokens[i])
i += 1
tokens = new_tokens
return tokens
Why BPE Works¶
Frequency-Based Compression¶
BPE creates tokens for common patterns:
"the" appears 1000 times → gets merged into one token
"xylophone" appears once → stays as characters
This follows Huffman coding principles: frequent things get short representations.
Morphological Discovery¶
BPE naturally discovers morphological structure:
Corpus contains: "playing", "played", "player", "plays"
BPE learns:
- "play" (common root)
- "ing" (common suffix)
- "ed" (common suffix)
- "er" (common suffix)
The algorithm doesn't know about morphology—it discovers it from frequency patterns.
Compositionality¶
Rare words are composed from frequent pieces:
The model can generalize:
- "un-" indicates negation
- "-able" indicates possibility
Byte-Level BPE (GPT-2 Style)¶
GPT-2 uses byte-level BPE:
- Start with 256 byte tokens (not Unicode characters)
- All text is UTF-8 encoded first
- Merges operate on bytes, not characters
Advantages¶
- Handles any language without special tokenization
- Works with emojis, rare scripts, binary data
- No [UNK] tokens ever needed
The GPT-2 Tweak¶
GPT-2 adds a special handling:
- Pretokenize by splitting on whitespace and punctuation
- Add space prefix to words (except first): " the" not "the"
- Run BPE within each pretokenized unit
This prevents merging across word boundaries.
Complexity Analysis¶
Training Time¶
For vocabulary size V and corpus size N:
- Each iteration scans corpus: O(N)
- Number of iterations: O(V)
- Total: O(N × V)
In practice, efficient implementations use data structures to update counts incrementally.
Encoding Time¶
For text length L and number of merges M:
- Naive: O(L × M) — apply each merge to entire text
- With hash tables: O(L × average_merge_applications)
Example: Real BPE Output¶
Training on a larger corpus, BPE might learn:
>>> tokenizer.encode("The transformer architecture is revolutionary")
['The', ' transform', 'er', ' architecture', ' is', ' revolution', 'ary']
>>> tokenizer.encode("I'm learning about tokenization!")
['I', "'m", ' learning', ' about', ' token', 'ization', '!']
Notice:
- Common words stay whole: "The", "is"
- Compound words split meaningfully: "transform" + "er"
- Morphological structure preserved: "revolution" + "ary"
Limitations of BPE¶
- Greedy algorithm: May not find globally optimal tokenization
- Training corpus dependent: Different corpora → different vocabularies
- Language-specific quirks: Works best for space-separated languages
- No probability model: Just frequency counts
Summary¶
| Aspect | Description |
|---|---|
| Core idea | Iteratively merge most frequent pairs |
| Initialization | Characters or bytes |
| Termination | Desired vocabulary size reached |
| Encoding | Apply merges in training order |
| Strength | Simple, effective, widely used |
| Used by | GPT-2, GPT-3, GPT-4, many others |
Next: We'll see how WordPiece modifies this algorithm with a likelihood-based criterion.