Section 7.2: Character vs. Subword Tokenization¶
Reading time: 8 minutes
The Design Space¶
Tokenization exists on a spectrum from fine-grained to coarse-grained:
Bytes → Characters → Subwords → Words → Phrases
↑ ↑
256 tokens Millions of tokens
Long sequences Short sequences
Let's analyze each end of this spectrum.
Character-Level Tokenization¶
The simplest approach: each character is a token.
Implementation¶
class CharTokenizer:
def __init__(self):
self.vocab = {}
def train(self, texts):
chars = set()
for text in texts:
chars.update(text)
self.vocab = {c: i for i, c in enumerate(sorted(chars))}
def encode(self, text):
return [self.vocab[c] for c in text]
def decode(self, ids):
inv_vocab = {v: k for k, v in self.vocab.items()}
return ''.join(inv_vocab[i] for i in ids)
Advantages¶
- No OOV tokens: Any text can be encoded
- Tiny vocabulary: ~100-300 tokens for most languages
- Simple implementation: Just character lookups
- Handles typos and neologisms: "transformerr" still works
Disadvantages¶
- Long sequences: "transformer" = 11 tokens
- Expensive attention: \(O(n^2)\) where n is sequence length
- Must learn spelling: The model must learn that "cat" and "c-a-t" are related
- No explicit morphology: No built-in notion of prefixes, suffixes
When to Use Characters¶
- Low-resource languages: Limited training data
- Code: Where every character matters
- Noisy text: Typos, OCR errors, social media
- Small models: When vocabulary size dominates parameters
Word-Level Tokenization¶
Traditional NLP approach: split on whitespace and punctuation.
Implementation¶
class WordTokenizer:
def __init__(self, unk_token='<UNK>'):
self.vocab = {}
self.unk_token = unk_token
def train(self, texts, max_vocab=50000):
from collections import Counter
word_counts = Counter()
for text in texts:
words = text.split()
word_counts.update(words)
self.vocab = {self.unk_token: 0}
for word, _ in word_counts.most_common(max_vocab - 1):
self.vocab[word] = len(self.vocab)
def encode(self, text):
unk_id = self.vocab[self.unk_token]
return [self.vocab.get(w, unk_id) for w in text.split()]
Advantages¶
- Short sequences: One token per word
- Linguistically intuitive: Tokens are words
- Fast attention: Fewer tokens = faster
Disadvantages¶
- Huge vocabulary: English needs 100K+ words
- OOV problem: Unknown words → [UNK]
- Morphological blindness: "run", "runs", "running" are unrelated
- Language-specific: Some languages don't use spaces (Chinese, Japanese)
Subword Tokenization: The Middle Ground¶
Subword tokenization finds a balance:
- Common words get their own tokens
- Rare words are split into common subwords
Example¶
| Word | Subword Tokens | Interpretation |
|---|---|---|
| the | ["the"] | Common → single token |
| transformer | ["trans", "former"] | Split into known pieces |
| unhappiness | ["un", "happi", "ness"] | Morphemes preserved |
| GPT-4 | ["G", "PT", "-", "4"] | Unknown → character fallback |
The Key Insight¶
Zipf's law tells us that word frequencies follow a power law:
where r is the rank (most common word = rank 1).
This means:
- A few words are extremely common (the, of, and, to)
- Most words are rare
Subword tokenization exploits this:
- Give common words their own tokens
- Build rare words from common pieces
Comparing the Approaches¶
Consider the text: "The transformers are transforming NLP"
| Approach | Tokens | Count | Vocab Needed |
|---|---|---|---|
| Character | ["T","h","e"," ","t","r",...] | 38 | ~70 |
| Word | ["The","transformers","are","transforming","NLP"] | 5 | ~50,000 |
| Subword | ["The"," transform","ers"," are"," transform","ing"," NLP"] | 7 | ~10,000 |
Subword achieves a balance:
- Reasonable vocabulary size
- Moderate sequence length
- Captures that "transformers" and "transforming" share a root
The Vocabulary Size vs. Sequence Length Trade-off¶
There's an approximate invariant:
where:
- V = vocabulary size
- L = average sequence length
- C = constant (depends on text)
Doubling vocabulary roughly halves sequence length (for subwords).
Optimal Operating Point¶
Modern LLMs use vocabularies of 32K-100K tokens:
| Model | Vocabulary Size | Tokenizer |
|---|---|---|
| GPT-2 | 50,257 | BPE |
| GPT-4 | ~100,000 | BPE variant |
| BERT | 30,522 | WordPiece |
| LLaMA | 32,000 | SentencePiece |
| Claude | ~100,000 | BPE variant |
Byte-Level Tokenization¶
A modern variant: start with bytes (256 possible values) instead of characters.
Advantages¶
- Handles any encoding (UTF-8, UTF-16, binary)
- No character-level preprocessing needed
- Truly universal: works for any language
Used By¶
- GPT-2, GPT-3, GPT-4 (byte-level BPE)
Summary¶
| Approach | Vocab Size | Seq Length | OOV Handling | Best For |
|---|---|---|---|---|
| Character | ~100 | Very long | Perfect | Low-resource |
| Word | ~100K | Short | Poor | Traditional NLP |
| Subword | ~50K | Medium | Excellent | Modern LLMs |
| Byte | 256 base | Long | Perfect | Universal |
Next: We'll derive Byte Pair Encoding (BPE), the algorithm behind GPT.