Section 7.4: WordPiece¶
Reading time: 12 minutes
Overview¶
WordPiece is the tokenization algorithm used by BERT, DistilBERT, and related models. It's similar to BPE but uses a different criterion for selecting which pairs to merge.
BPE vs. WordPiece¶
| Aspect | BPE | WordPiece |
|---|---|---|
| Merge criterion | Most frequent pair | Pair that maximizes likelihood |
| Question asked | "What's most common?" | "What gives best compression?" |
| Optimization | Greedy frequency | Greedy likelihood |
The Likelihood Criterion¶
WordPiece asks: "Which merge would most increase the likelihood of the training data?"
Intuition¶
Consider two potential merges:
- "t" + "h" → "th" (both very common individually)
- "q" + "u" → "qu" (q rarely appears without u)
BPE might prefer (1) because "t" and "h" are both frequent.
WordPiece prefers (2) because "qu" co-occurs more than chance would predict.
The Score¶
For a pair (a, b), WordPiece computes:
This is essentially pointwise mutual information (PMI):
A high score means a and b co-occur much more often than independent chance would predict.
Example¶
Corpus frequencies:
- freq("q") = 100
- freq("u") = 5000
- freq("qu") = 95
- freq("t") = 8000
- freq("h") = 6000
- freq("th") = 4000
Scores:
WordPiece prefers "qu" despite lower absolute frequency!
The ## Prefix Convention¶
WordPiece uses a special prefix ## to mark continuation tokens:
This distinguishes:
- "play" (start of word) — ID 1234
- "##play" (continuation) — ID 5678
Why This Matters¶
Without the prefix:
Both contain "play", but they're different:
- First: standalone word
- Second: part of compound
The ## prefix keeps these separate in the vocabulary.
Algorithm¶
Training¶
def train_wordpiece(corpus: List[str], vocab_size: int) -> Dict[str, int]:
"""Train WordPiece tokenizer."""
# Initialize with special tokens
vocab = {'[PAD]': 0, '[UNK]': 1, '[CLS]': 2, '[SEP]': 3, '[MASK]': 4}
# Collect word frequencies
word_freqs = count_words(corpus)
# Initialize vocabulary with characters
for word in word_freqs:
for i, char in enumerate(word):
token = char if i == 0 else '##' + char
if token not in vocab:
vocab[token] = len(vocab)
# Convert words to token sequences
word_tokens = {word: split_to_chars(word) for word in word_freqs}
# Iteratively merge
while len(vocab) < vocab_size:
# Count pairs and compute scores
pair_scores = {}
token_freqs = count_token_frequencies(word_tokens, word_freqs)
for word, freq in word_freqs.items():
tokens = word_tokens[word]
for i in range(len(tokens) - 1):
pair = (tokens[i], tokens[i+1])
pair_freq = pair_scores.get(pair, 0) + freq
pair_scores[pair] = pair_freq
# Find best pair by WordPiece score
best_pair, best_score = None, -1
for pair, freq in pair_scores.items():
score = freq / (token_freqs[pair[0]] * token_freqs[pair[1]])
if score > best_score:
best_score = score
best_pair = pair
if best_pair is None:
break
# Create merged token
merged = merge_tokens(best_pair)
vocab[merged] = len(vocab)
# Apply merge to all words
word_tokens = apply_merge_to_all(word_tokens, best_pair, merged)
return vocab
Encoding (Greedy Longest Match)¶
WordPiece encoding uses greedy longest-match:
def encode_wordpiece(word: str, vocab: Dict[str, int]) -> List[str]:
"""Encode a single word using WordPiece."""
tokens = []
start = 0
while start < len(word):
end = len(word)
found = False
# Find longest matching token
while start < end:
substr = word[start:end]
if start > 0:
substr = '##' + substr
if substr in vocab:
tokens.append(substr)
found = True
break
end -= 1
if not found:
tokens.append('[UNK]')
start += 1
else:
start = end
return tokens
Example Encoding¶
vocab = {'un': 1, 'believ': 2, '##able': 3, 'the': 4, '##s': 5, ...}
encode("unbelievable")
# Try "unbelievable" → not in vocab
# Try "unbelievabl" → not in vocab
# ...
# Try "un" → in vocab! → ["un"]
# Start from position 2
# Try "believable" → not in vocab
# ...
# Try "believ" → in vocab! → ["un", "believ"]
# Start from position 8
# Try "##able" → in vocab! → ["un", "believ", "##able"]
Result: ["un", "believ", "##able"]
Comparison with BPE¶
Same Input, Different Tokenizations¶
For the same vocabulary size, BPE and WordPiece often produce similar (but not identical) tokenizations:
| Input | BPE | WordPiece |
|---|---|---|
| "tokenization" | ["token", "ization"] | ["token", "##ization"] |
| "transformer" | ["transform", "er"] | ["transform", "##er"] |
| "unhappy" | ["un", "happy"] | ["un", "##happy"] |
Key Differences¶
- Continuation markers: WordPiece uses
##, BPE uses space prefix - Merge selection: Frequency vs. likelihood
- Vocabulary composition: Slightly different token sets
BERT's Special Tokens¶
BERT's WordPiece vocabulary includes special tokens:
| Token | Meaning | ID |
|---|---|---|
| [PAD] | Padding for batching | 0 |
| [UNK] | Unknown token | 100 |
| [CLS] | Start of sequence (classification) | 101 |
| [SEP] | Separator between segments | 102 |
| [MASK] | Masked token for MLM training | 103 |
Example:
Advantages of WordPiece¶
- Better handling of rare combinations: The likelihood criterion prefers truly associated pairs
- Clear word boundaries: The
##prefix makes boundaries explicit - Well-suited for masked LM: Clear token identity for masking
Limitations¶
- Greedy encoding: May not find optimal segmentation
- Slower training: Computing likelihood scores is more expensive than counting
- Language-specific: Assumes whitespace word boundaries
Summary¶
| Aspect | WordPiece |
|---|---|
| Merge criterion | Maximize likelihood (PMI) |
| Continuation marker | ## prefix |
| Encoding | Greedy longest match |
| Special tokens | [CLS], [SEP], [MASK], etc. |
| Used by | BERT, DistilBERT, ELECTRA |
Next: We'll explore the Unigram approach, which takes a completely different strategy.