Section 7.3: Byte Pair Encoding (BPE)¶

Reading time: 15 minutes

Origins¶

Byte Pair Encoding was originally a data compression algorithm (Gage, 1994). Sennrich et al. (2016) adapted it for NLP, and it became the standard for GPT-2 and subsequent models.

The Core Idea¶

BPE iteratively merges the most frequent pair of tokens:

Start with a vocabulary of individual characters (or bytes)
Count all adjacent pairs in the corpus
Merge the most frequent pair into a new token
Repeat until reaching the desired vocabulary size

Algorithm Walkthrough¶

Let's trace BPE on a small corpus:

Corpus: "low lower lowest"

Step 0: Initialize¶

Split into characters with end-of-word marker ():

Vocabulary: ['l', 'o', 'w', '</w>', 'e', 'r', 's', 't']

Words with frequencies:
  "l o w </w>"       : 3  (appears in low, lower, lowest)
  "l o w e r </w>"   : 1
  "l o w e s t </w>" : 1

Step 1: Count Pairs¶

Pair	Frequency
(l, o)	3
(o, w)	3
(w, )	1
(w, e)	2
(e, r)	1
(r, )	1
(e, s)	1
(s, t)	1
(t, )	1

Step 2: Merge Best Pair¶

Most frequent: (l, o) and (o, w) are tied at 3. Choose (l, o) → create new token "lo"

Vocabulary: ['l', 'o', 'w', '</w>', 'e', 'r', 's', 't', 'lo']

Words updated:
  "lo w </w>"       : 3
  "lo w e r </w>"   : 1
  "lo w e s t </w>" : 1

Step 3: Repeat¶

New pair counts:

Pair	Frequency
(lo, w)	3
(w, )	1
(w, e)	2
...	...

Merge (lo, w) → "low"

Vocabulary: [..., 'lo', 'low']

Words:
  "low </w>"       : 3
  "low e r </w>"   : 1
  "low e s t </w>" : 1

Continue...¶

The algorithm continues, creating tokens like:

"er" (from "lower")
"low" (the complete word "low")
"est" (from "lowest")

The Merge Rules¶

BPE learns a sequence of merge rules:

1. l + o → lo
2. lo + w → low
3. e + r → er
4. low + er → lower
...

During encoding, these rules are applied in order.

Formal Algorithm¶

Training¶

def train_bpe(corpus: List[str], vocab_size: int) -> Tuple[Dict, List]:
    """Train BPE tokenizer.

    Returns:
        vocab: token → id mapping
        merges: list of (token_a, token_b) merge rules
    """
    # Initialize with characters
    vocab = {chr(i): i for i in range(256)}  # Byte vocabulary

    # Split words into characters
    word_freqs = count_word_frequencies(corpus)
    # word_freqs: {('l','o','w'): 3, ('l','o','w','e','r'): 1, ...}

    merges = []

    while len(vocab) < vocab_size:
        # Count all adjacent pairs
        pair_freqs = count_pairs(word_freqs)

        if not pair_freqs:
            break

        # Find most frequent pair
        best_pair = max(pair_freqs, key=pair_freqs.get)

        # Create merged token
        merged = best_pair[0] + best_pair[1]
        vocab[merged] = len(vocab)
        merges.append(best_pair)

        # Apply merge to all words
        word_freqs = apply_merge(word_freqs, best_pair, merged)

    return vocab, merges

Encoding¶

def encode_bpe(text: str, merges: List[Tuple[str, str]]) -> List[str]:
    """Encode text using learned BPE merges."""
    # Start with characters
    tokens = list(text)

    # Apply merges in order
    for pair in merges:
        new_tokens = []
        i = 0
        while i < len(tokens):
            if i < len(tokens) - 1 and (tokens[i], tokens[i+1]) == pair:
                new_tokens.append(tokens[i] + tokens[i+1])
                i += 2
            else:
                new_tokens.append(tokens[i])
                i += 1
        tokens = new_tokens

    return tokens

Why BPE Works¶

Frequency-Based Compression¶

BPE creates tokens for common patterns:

"the" appears 1000 times → gets merged into one token
"xylophone" appears once → stays as characters

This follows Huffman coding principles: frequent things get short representations.

Morphological Discovery¶

BPE naturally discovers morphological structure:

Corpus contains: "playing", "played", "player", "plays"

BPE learns:
- "play" (common root)
- "ing" (common suffix)
- "ed" (common suffix)
- "er" (common suffix)

The algorithm doesn't know about morphology—it discovers it from frequency patterns.

Compositionality¶

Rare words are composed from frequent pieces:

"unbelievable" → "un" + "believ" + "able"

The model can generalize:

"un-" indicates negation
"-able" indicates possibility

Byte-Level BPE (GPT-2 Style)¶

GPT-2 uses byte-level BPE:

Start with 256 byte tokens (not Unicode characters)
All text is UTF-8 encoded first
Merges operate on bytes, not characters

Advantages¶

Handles any language without special tokenization
Works with emojis, rare scripts, binary data
No [UNK] tokens ever needed

The GPT-2 Tweak¶

GPT-2 adds a special handling:

Pretokenize by splitting on whitespace and punctuation
Add space prefix to words (except first): " the" not "the"
Run BPE within each pretokenized unit

This prevents merging across word boundaries.

Complexity Analysis¶

Training Time¶

For vocabulary size V and corpus size N:

Each iteration scans corpus: O(N)
Number of iterations: O(V)
Total: O(N × V)

In practice, efficient implementations use data structures to update counts incrementally.

Encoding Time¶

For text length L and number of merges M:

Naive: O(L × M) — apply each merge to entire text
With hash tables: O(L × average_merge_applications)

Example: Real BPE Output¶

Training on a larger corpus, BPE might learn:

>>> tokenizer.encode("The transformer architecture is revolutionary")
['The', ' transform', 'er', ' architecture', ' is', ' revolution', 'ary']

>>> tokenizer.encode("I'm learning about tokenization!")
['I', "'m", ' learning', ' about', ' token', 'ization', '!']

Notice:

Common words stay whole: "The", "is"
Compound words split meaningfully: "transform" + "er"
Morphological structure preserved: "revolution" + "ary"

Limitations of BPE¶

Greedy algorithm: May not find globally optimal tokenization
Training corpus dependent: Different corpora → different vocabularies
Language-specific quirks: Works best for space-separated languages
No probability model: Just frequency counts

Summary¶

Aspect	Description
Core idea	Iteratively merge most frequent pairs
Initialization	Characters or bytes
Termination	Desired vocabulary size reached
Encoding	Apply merges in training order
Strength	Simple, effective, widely used
Used by	GPT-2, GPT-3, GPT-4, many others

Next: We'll see how WordPiece modifies this algorithm with a likelihood-based criterion.