Section 3.5: Building a Character-Level Neural Language Model¶

Theory meets practice. In this section, we'll build a complete neural language model from scratch, using only the autograd system we developed in Stage 2.

By the end, you'll have a working model that learns to generate text character by character.

The Complete Architecture¶

Let's specify exactly what we're building:

Input: k previous characters [c_{t-k}, ..., c_{t-1}]
Output: P(c_t | context) for each character in vocabulary

Architecture:
  1. Embedding layer: map each character to d-dimensional vector
  2. Concatenate: combine k embeddings into one vector
  3. Hidden layer 1: linear + ReLU
  4. Hidden layer 2: linear + ReLU
  5. Output layer: linear (logits)
  6. Softmax: convert to probabilities

Prerequisites: The Value Class¶

First, let's ensure we have our Stage 2 autograd. Here's the complete Value class:

import math
import random

class Value:
    """Scalar value with automatic differentiation."""

    def __init__(self, data, _parents=(), _op=''):
        self.data = data
        self.grad = 0.0
        self._backward = lambda: None
        self._parents = set(_parents)
        self._op = _op

    def __repr__(self):
        return f"Value(data={self.data:.4f})"

    def __add__(self, other):
        other = other if isinstance(other, Value) else Value(other)
        out = Value(self.data + other.data, (self, other), '+')
        def _backward():
            self.grad += out.grad
            other.grad += out.grad
        out._backward = _backward
        return out

    def __mul__(self, other):
        other = other if isinstance(other, Value) else Value(other)
        out = Value(self.data * other.data, (self, other), '*')
        def _backward():
            self.grad += out.grad * other.data
            other.grad += out.grad * self.data
        out._backward = _backward
        return out

    def __pow__(self, n):
        out = Value(self.data ** n, (self,), f'**{n}')
        def _backward():
            self.grad += out.grad * (n * self.data ** (n - 1))
        out._backward = _backward
        return out

    def __neg__(self):
        return self * -1

    def __sub__(self, other):
        return self + (-other)

    def __truediv__(self, other):
        return self * (other ** -1)

    def __radd__(self, other):
        return self + other

    def __rmul__(self, other):
        return self * other

    def __rsub__(self, other):
        return other + (-self)

    def __rtruediv__(self, other):
        return other * (self ** -1)

    def exp(self):
        out = Value(math.exp(self.data), (self,), 'exp')
        def _backward():
            self.grad += out.grad * out.data
        out._backward = _backward
        return out

    def log(self):
        out = Value(math.log(self.data), (self,), 'log')
        def _backward():
            self.grad += out.grad / self.data
        out._backward = _backward
        return out

    def relu(self):
        out = Value(max(0, self.data), (self,), 'relu')
        def _backward():
            self.grad += out.grad * (1.0 if self.data > 0 else 0.0)
        out._backward = _backward
        return out

    def backward(self):
        topo = []
        visited = set()
        def build_topo(v):
            if v not in visited:
                visited.add(v)
                for p in v._parents:
                    build_topo(p)
                topo.append(v)
        build_topo(self)

        self.grad = 1.0
        for v in reversed(topo):
            v._backward()

Building the Model Components¶

Embedding Layer¶

class Embedding:
    """Lookup table for token embeddings."""

    def __init__(self, vocab_size, embed_dim):
        """
        vocab_size: number of unique tokens
        embed_dim: dimension of embedding vectors
        """
        # Initialize with small random values
        self.weight = [
            [Value(random.gauss(0, 0.1)) for _ in range(embed_dim)]
            for _ in range(vocab_size)
        ]
        self.vocab_size = vocab_size
        self.embed_dim = embed_dim

    def __call__(self, token_idx):
        """Return embedding for token index."""
        return self.weight[token_idx]  # List of Value objects

    def parameters(self):
        """Return all learnable parameters."""
        return [v for row in self.weight for v in row]

Linear Layer¶

class Linear:
    """Fully connected layer: y = Wx + b."""

    def __init__(self, in_features, out_features):
        """
        in_features: input dimension
        out_features: output dimension
        """
        # Xavier initialization for stable gradients
        scale = (2.0 / (in_features + out_features)) ** 0.5
        self.weight = [
            [Value(random.gauss(0, scale)) for _ in range(in_features)]
            for _ in range(out_features)
        ]
        self.bias = [Value(0.0) for _ in range(out_features)]
        self.in_features = in_features
        self.out_features = out_features

    def __call__(self, x):
        """
        x: list of Value objects (length = in_features)
        Returns: list of Value objects (length = out_features)
        """
        out = []
        for i in range(self.out_features):
            activation = self.bias[i]
            for j in range(self.in_features):
                activation = activation + self.weight[i][j] * x[j]
            out.append(activation)
        return out

    def parameters(self):
        return [v for row in self.weight for v in row] + self.bias

Activation Functions¶

def relu(x):
    """Apply ReLU to list of Values."""
    return [v.relu() for v in x]


def softmax(logits):
    """
    Convert logits to probabilities.
    Numerically stable implementation.
    """
    # Subtract max for numerical stability
    max_val = max(v.data for v in logits)
    exp_logits = [(v - max_val).exp() for v in logits]
    sum_exp = sum(exp_logits, Value(0.0))
    return [e / sum_exp for e in exp_logits]

Cross-Entropy Loss¶

def cross_entropy_loss(logits, target_idx):
    """
    Compute cross-entropy loss.

    logits: list of Value objects (unnormalized scores)
    target_idx: index of true class

    Returns: Value (scalar loss)
    """
    # Log-sum-exp trick for numerical stability
    max_logit = max(v.data for v in logits)
    shifted = [v - max_logit for v in logits]
    exp_logits = [v.exp() for v in shifted]
    sum_exp = sum(exp_logits, Value(0.0))
    log_sum_exp = sum_exp.log() + max_logit

    # Loss = -logit[target] + log_sum_exp
    loss = log_sum_exp - logits[target_idx]
    return loss

The Complete Language Model¶

Now we assemble everything:

class CharacterLM:
    """Character-level neural language model."""

    def __init__(self, vocab_size, embed_dim, hidden_dim, context_length):
        """
        vocab_size: number of unique characters
        embed_dim: dimension of character embeddings
        hidden_dim: size of hidden layers
        context_length: number of previous characters to use
        """
        self.context_length = context_length
        self.vocab_size = vocab_size

        # Embedding layer
        self.embedding = Embedding(vocab_size, embed_dim)

        # Hidden layers
        concat_dim = context_length * embed_dim
        self.layer1 = Linear(concat_dim, hidden_dim)
        self.layer2 = Linear(hidden_dim, hidden_dim)

        # Output layer
        self.output = Linear(hidden_dim, vocab_size)

    def forward(self, context):
        """
        context: list of character indices (length = context_length)
        Returns: list of Values (logits for each vocabulary item)
        """
        # 1. Embed each character
        embeddings = [self.embedding(idx) for idx in context]

        # 2. Concatenate all embeddings
        x = []
        for emb in embeddings:
            x.extend(emb)

        # 3. First hidden layer
        h1 = relu(self.layer1(x))

        # 4. Second hidden layer
        h2 = relu(self.layer2(h1))

        # 5. Output logits
        logits = self.output(h2)

        return logits

    def predict_probs(self, context):
        """Get probability distribution over next character."""
        logits = self.forward(context)
        return softmax(logits)

    def loss(self, context, target_idx):
        """Compute cross-entropy loss for one example."""
        logits = self.forward(context)
        return cross_entropy_loss(logits, target_idx)

    def parameters(self):
        """Return all learnable parameters."""
        params = []
        params.extend(self.embedding.parameters())
        params.extend(self.layer1.parameters())
        params.extend(self.layer2.parameters())
        params.extend(self.output.parameters())
        return params

Data Preparation¶

Building the Vocabulary¶

def build_vocab(text):
    """Create character-to-index mappings."""
    chars = sorted(set(text))
    char_to_idx = {c: i for i, c in enumerate(chars)}
    idx_to_char = {i: c for c, i in char_to_idx.items()}
    return char_to_idx, idx_to_char


def encode(text, char_to_idx):
    """Convert text to list of indices."""
    return [char_to_idx[c] for c in text]


def decode(indices, idx_to_char):
    """Convert indices back to text."""
    return ''.join(idx_to_char[i] for i in indices)

Creating Training Examples¶

def create_examples(encoded_text, context_length):
    """
    Create (context, target) pairs from encoded text.

    Each example is:
      - context: previous context_length characters
      - target: the next character
    """
    examples = []
    for i in range(context_length, len(encoded_text)):
        context = encoded_text[i - context_length : i]
        target = encoded_text[i]
        examples.append((context, target))
    return examples

The Training Loop¶

def train(model, examples, epochs, learning_rate, print_every=100):
    """
    Train the language model.

    model: CharacterLM instance
    examples: list of (context, target) pairs
    epochs: number of passes through the data
    learning_rate: step size for gradient descent
    """
    params = model.parameters()
    n_params = len(params)
    print(f"Training model with {n_params} parameters")

    for epoch in range(epochs):
        # Shuffle examples each epoch
        random.shuffle(examples)

        total_loss = 0.0
        for i, (context, target) in enumerate(examples):
            # Forward pass
            loss = model.loss(context, target)
            total_loss += loss.data

            # Zero gradients
            for p in params:
                p.grad = 0.0

            # Backward pass
            loss.backward()

            # Update parameters
            for p in params:
                p.data -= learning_rate * p.grad

            # Print progress
            if (i + 1) % print_every == 0:
                avg_loss = total_loss / (i + 1)
                print(f"Epoch {epoch+1}, Example {i+1}/{len(examples)}, "
                      f"Avg Loss: {avg_loss:.4f}")

        # End of epoch
        avg_loss = total_loss / len(examples)
        perplexity = math.exp(avg_loss)
        print(f"Epoch {epoch+1} complete. "
              f"Loss: {avg_loss:.4f}, Perplexity: {perplexity:.2f}")

    return model

Text Generation¶

Once trained, we can generate new text. We use temperature sampling (introduced in Section 1.6) to control the randomness of generation:

Temperature = 1.0: Sample from the model's learned distribution
Temperature < 1.0: More deterministic, favors high-probability tokens
Temperature > 1.0: More random, explores lower-probability tokens

def generate(model, idx_to_char, char_to_idx, seed_text, length, temperature=1.0):
    """
    Generate text from the model.

    seed_text: initial text to condition on
    length: number of characters to generate
    temperature: controls randomness (lower = more deterministic)
    """
    context_length = model.context_length

    # Ensure seed is long enough
    if len(seed_text) < context_length:
        seed_text = ' ' * (context_length - len(seed_text)) + seed_text

    # Encode seed
    generated = list(seed_text)
    context = [char_to_idx[c] for c in seed_text[-context_length:]]

    for _ in range(length):
        # Get logits
        logits = model.forward(context)

        # Apply temperature
        if temperature != 1.0:
            logits = [Value(v.data / temperature) for v in logits]

        # Convert to probabilities
        probs = softmax(logits)
        prob_values = [p.data for p in probs]

        # Sample from distribution
        next_idx = random.choices(range(len(prob_values)),
                                  weights=prob_values, k=1)[0]

        # Add to generated text
        next_char = idx_to_char[next_idx]
        generated.append(next_char)

        # Update context
        context = context[1:] + [next_idx]

    return ''.join(generated)

Putting It All Together¶

Here's a complete training script. The configuration values at the top are hyperparameters—settings we choose before training (as opposed to parameters like weights, which are learned during training). Common hyperparameters include learning rate, batch size, number of layers, and hidden dimensions.

def main():
    # Hyperparameters (configuration choices, not learned)
    CONTEXT_LENGTH = 8
    EMBED_DIM = 32
    HIDDEN_DIM = 128
    EPOCHS = 3
    LEARNING_RATE = 0.01

    # Sample training text
    text = """
    The quick brown fox jumps over the lazy dog.
    A journey of a thousand miles begins with a single step.
    To be or not to be, that is the question.
    All that glitters is not gold.
    The only thing we have to fear is fear itself.
    """

    # Build vocabulary
    char_to_idx, idx_to_char = build_vocab(text)
    vocab_size = len(char_to_idx)
    print(f"Vocabulary size: {vocab_size}")
    print(f"Characters: {''.join(sorted(char_to_idx.keys()))}")

    # Encode text
    encoded = encode(text, char_to_idx)
    print(f"Encoded length: {len(encoded)}")

    # Create training examples
    examples = create_examples(encoded, CONTEXT_LENGTH)
    print(f"Number of examples: {len(examples)}")

    # Create model
    model = CharacterLM(
        vocab_size=vocab_size,
        embed_dim=EMBED_DIM,
        hidden_dim=HIDDEN_DIM,
        context_length=CONTEXT_LENGTH
    )
    print(f"Model parameters: {len(model.parameters())}")

    # Train
    print("\n--- Training ---")
    model = train(model, examples, EPOCHS, LEARNING_RATE, print_every=50)

    # Generate
    print("\n--- Generation ---")
    seed = "The quick"
    generated = generate(model, idx_to_char, char_to_idx,
                        seed, length=100, temperature=0.8)
    print(f"Seed: '{seed}'")
    print(f"Generated: '{generated}'")


if __name__ == '__main__':
    main()

Expected Output¶

After a few epochs, you should see something like:

Vocabulary size: 42
Characters:  .ATTabdefghijklmnorstuvy
Encoded length: 262
Number of examples: 254
Model parameters: 19626

--- Training ---
Epoch 1, Example 50/254, Avg Loss: 3.2541
Epoch 1, Example 100/254, Avg Loss: 2.8923
...
Epoch 3 complete. Loss: 1.4521, Perplexity: 4.27

--- Generation ---
Seed: 'The quick'
Generated: 'The quick brown fox the only thing we that is the question...'

The model learns common patterns:

Word boundaries (spaces after words)
Common words ("the", "is", "that")
Phrase structures

With more data and training, quality improves significantly.

Analysis: What Did We Build?¶

Parameter Count¶

For our example (vocab=42, embed=32, hidden=128, context=8):

Component	Size	Parameters
Embedding	42 × 32	1,344
Layer 1	(8×32) × 128 + 128	32,896
Layer 2	128 × 128 + 128	16,512
Output	128 × 42 + 42	5,418
Total		56,170

Computational Cost¶

Per training example:

Forward: O(context × embed × hidden + hidden² + hidden × vocab)
Backward: Same order (automatic via autograd)
Memory: Proportional to computation (store activations)

What Makes It Work¶

Embeddings: Similar characters get similar representations
Hidden layers: Learn to combine patterns
Cross-entropy: Proper training objective
Gradient descent: Iterative improvement

Common Issues and Solutions¶

Gradient Issues¶

Problem: Loss becomes NaN

Solutions:

Reduce learning rate
Check for division by zero
Use gradient clipping (cap gradient magnitudes)

# Gradient clipping
max_norm = 1.0
for p in params:
    if abs(p.grad) > max_norm:
        p.grad = max_norm * (1 if p.grad > 0 else -1)

Poor Generation¶

Problem: Generated text is repetitive or nonsensical

Solutions:

More training data
More epochs
Adjust temperature during generation
Larger context length

Slow Training¶

Problem: Training takes too long

Solutions:

Smaller model (fewer hidden units)
Fewer training examples
Early stopping when loss plateaus

Summary¶

We built a complete neural language model:

Component	Purpose
Embedding	Discrete → continuous representation
Linear layers	Learn pattern combinations
ReLU	Add nonlinearity
Softmax	Normalize to probabilities
Cross-entropy	Measure prediction quality
Backprop	Compute all gradients
SGD	Update parameters

Key insight: With ~100 lines of autograd and ~200 lines of model code, we have a working neural language model. The same principles scale to billion-parameter models.

Exercises¶

Experiment with hyperparameters: Try different embedding dimensions, hidden sizes, and context lengths. How does each affect final perplexity?
Add a third hidden layer: Modify the model to have 3 hidden layers instead of 2. Does it help?
Different activation: Replace ReLU with tanh. Compare training dynamics.
Bigger dataset: Train on a larger text corpus (e.g., a book from Project Gutenberg). How does quality change?
Temperature exploration: Generate text at temperatures 0.5, 1.0, and 1.5. Describe the differences.

What's Next¶

Our model trains, but there's a lot we glossed over:

How to choose the learning rate?
When to stop training?
How to prevent overfitting?

In Section 3.6, we'll dive deep into training dynamics—the art and science of making neural networks learn effectively.