Section 7.1: The Tokenization Problem¶
Reading time: 10 minutes
The Challenge¶
Neural networks operate on numerical vectors, but language is composed of characters, words, and meanings. Tokenization bridges this gap by converting text into discrete units.
But how should we split text? Consider:
Possible segmentations:
| Method | Result | Token Count |
|---|---|---|
| Whitespace | ["I'm", "learning", "about", "tokenization!"] | 4 |
| Word + punctuation | ["I", "'m", "learning", "about", "tokenization", "!"] | 6 |
| Character | ["I", "'", "m", " ", "l", "e", "a", "r", "n", ...] | 33 |
| Subword | ["I", "'m", " learn", "ing", " about", " token", "ization", "!"] | 8 |
There's no obviously "correct" answer—each choice involves trade-offs.
The Vocabulary Problem¶
The simplest approach is word-level tokenization: each unique word gets a token ID.
Problem 1: Vocabulary Explosion¶
English has over 170,000 words in common use. But that's just the beginning:
- Technical terms: "transformer", "backpropagation", "eigenvalue"
- Proper nouns: company names, people, places
- Morphological variants: "run", "runs", "running", "ran"
- Compound words: "self-attention", "machine-learning"
- Neologisms: "GPT-4", "COVID-19"
A word-level vocabulary for a large corpus might need millions of entries.
Problem 2: Out-of-Vocabulary (OOV) Tokens¶
What happens when the model encounters a word it hasn't seen during training?
vocab = {"the": 0, "cat": 1, "sat": 2, ...}
text = "The xerophyte grew in the desert" # "xerophyte" not in vocab!
Options:
- [UNK] token: Replace unknown words with a special token
- Loses all information about the unknown word
-
"The [UNK] grew" tells us nothing
-
Subword fallback: Break unknown words into known pieces
- "xerophyte" → "xero" + "phyte" (if these are in vocabulary)
- Preserves morphological information
The Sequence Length Problem¶
Attention mechanisms have complexity \(O(n^2)\) where n is sequence length:
| Tokenization | Vocab Size | Sequence Length | Attention Cost |
|---|---|---|---|
| Words | ~500,000 | ~100 | \(100^2 = 10,000\) |
| Characters | ~300 | ~500 | \(500^2 = 250,000\) |
| Subwords | ~50,000 | ~150 | \(150^2 = 22,500\) |
Character-level models pay a heavy computational price for their flexibility.
The Linguistic Granularity Problem¶
What's the "right" unit of meaning?
Morphemes vs. Words¶
Consider the word "unhappiness":
- As one word: The model must learn this specific word
- As morphemes: "un" + "happy" + "ness"
- The model can generalize: "un-" means negation, "-ness" creates nouns
- Works for "unkindness", "unfairness", etc.
The Compositionality Principle¶
Subword tokenization captures compositional semantics:
This enables generalization to unseen words through known components.
Design Criteria for Tokenization¶
A good tokenization scheme should:
- Have bounded vocabulary: Thousands, not millions of tokens
- Handle any input: No OOV tokens (or minimal OOV rate)
- Produce reasonable sequence lengths: Not too long for attention
- Capture meaningful units: Subwords should correspond to linguistic units
- Be deterministic: Same input → same output
- Be efficient: Fast encoding and decoding
The Data-Driven Solution¶
Modern tokenizers are learned from data, not hand-designed:
- Start with some base units (bytes or characters)
- Learn which sequences should be merged into tokens
- The corpus determines the vocabulary
This approach:
- Adapts to the domain (scientific text, code, conversation)
- Automatically handles morphology
- Requires no linguistic expertise to create
Three Algorithms¶
The three main subword tokenization algorithms:
| Algorithm | Strategy | Used By |
|---|---|---|
| BPE | Merge most frequent pairs | GPT-2, GPT-3, GPT-4 |
| WordPiece | Merge pairs that maximize likelihood | BERT, DistilBERT |
| Unigram | Start large, prune low-impact tokens | SentencePiece, T5 |
All three produce similar results but use different optimization criteria.
Summary¶
| Concept | Key Point |
|---|---|
| Vocabulary size | Trade-off between coverage and tractability |
| Sequence length | Affects attention cost (\(O(n^2)\)) |
| OOV handling | Subword tokenization eliminates OOV |
| Compositionality | Subwords can capture morphological structure |
| Data-driven | Modern tokenizers are learned, not designed |
Next: We'll explore the design space of character vs. subword tokenization in detail.