Stage 7: Tokenization¶

The bridge between raw text and neural networks

Overview¶

Tokenization is the process of converting raw text into discrete units (tokens) that neural networks can process. While it may seem like a preprocessing detail, the choice of tokenization scheme profoundly affects model performance, efficiency, and capabilities.

In this stage, we'll derive and implement three major tokenization algorithms from first principles:

Byte Pair Encoding (BPE) - Used by GPT-2, GPT-3, GPT-4
WordPiece - Used by BERT
Unigram Language Model - Used by SentencePiece

Why Tokenization Matters¶

Consider the sentence: "The transformer architecture is revolutionary."

How should we split this for a neural network?

Approach	Tokens	Count
Words	["The", "transformer", "architecture", "is", "revolutionary", "."]	6
Characters	["T", "h", "e", " ", "t", "r", "a", ...]	45
Subwords	["The", " transform", "er", " architecture", " is", " revolution", "ary", "."]	8

Each approach has trade-offs:

Word-level: Small sequences, but huge vocabulary and can't handle new words
Character-level: Tiny vocabulary, but very long sequences and must learn spelling
Subword-level: Balances vocabulary size with sequence length

The Fundamental Trade-off¶

\[\text{Sequence Length} \times \text{Vocabulary Size} \approx \text{constant}\]

Larger vocabulary → shorter sequences → faster attention (O(n²))
Smaller vocabulary → longer sequences → slower but more flexible

Modern LLMs use subword tokenization with vocabularies of 32K-100K tokens.

Learning Objectives¶

By the end of this stage, you will:

Understand why subword tokenization dominates modern NLP
Derive the BPE algorithm from first principles
Understand how WordPiece differs from BPE
Implement a Unigram tokenizer
Analyze the trade-offs in vocabulary size selection

Sections¶

The Tokenization Problem - Why this is hard
Character vs. Subword - The design space
Byte Pair Encoding - The algorithm behind GPT
WordPiece - The algorithm behind BERT
Unigram Language Model - A probabilistic approach
Vocabulary Size Trade-offs - How to choose
Implementation - Building tokenizers from scratch

Prerequisites¶

Understanding of n-gram language models (Stage 1)
Basic probability (Stage 1)
Familiarity with the attention mechanism (Stage 5) to understand sequence length trade-offs

Key Insight¶

Tokenization is not just preprocessing—it defines the atomic units of meaning that your model can learn. A good tokenizer creates tokens that correspond to meaningful linguistic units while keeping the vocabulary tractable.

Code & Resources¶

Resource	Description
`code/stage-07/tokenizer.py`	BPE and tokenizer implementations
`code/stage-07/tests/`	Test suite
Exercises	Practice problems
Common Mistakes	Debugging guide