Skip to content

Further Reading & Resources

Curated resources for going deeper

This page collects the most valuable external resources for each topic covered in this book.


Papers

Foundational Papers

Paper Year Key Contribution Stage
Attention Is All You Need 2017 The transformer architecture 5, 6
Neural Probabilistic Language Model 2003 Word embeddings for language modeling 3
Adam: A Method for Stochastic Optimization 2014 The Adam optimizer 4
Layer Normalization 2016 LayerNorm for transformers 6
Deep Residual Learning 2015 Residual connections 6

Tokenization Papers

Paper Year Key Contribution Stage
Neural Machine Translation of Rare Words with Subword Units 2016 BPE for NLP 7
Google's Neural Machine Translation System 2016 WordPiece 7
SentencePiece 2018 Unigram tokenization 7

PEFT Papers

Paper Year Key Contribution Stage
LoRA: Low-Rank Adaptation of Large Language Models 2021 Low-rank fine-tuning 9
Parameter-Efficient Transfer Learning for NLP 2019 Adapter layers 9
Prefix-Tuning 2021 Soft prefixes 9
The Power of Scale for Parameter-Efficient Prompt Tuning 2021 Prompt tuning 9
QLoRA: Efficient Finetuning of Quantized LLMs 2023 4-bit fine-tuning 9

Alignment Papers

Paper Year Key Contribution Stage
Training Language Models to Follow Instructions with Human Feedback 2022 InstructGPT, RLHF 10
Direct Preference Optimization 2023 DPO 10
Constitutional AI 2022 Self-critique 10
Proximal Policy Optimization Algorithms 2017 PPO 10

Scaling & Architecture Papers

Paper Year Key Contribution Stage
Scaling Laws for Neural Language Models 2020 Chinchilla scaling 6
LLaMA: Open Foundation and Fine-Tuned Chat Models 2023 Modern architecture 6
GPT-2: Language Models are Unsupervised Multitask Learners 2019 GPT-2 6
RoFormer: Enhanced Transformer with Rotary Position Embedding 2021 RoPE 5

Libraries & Tools

Essential Libraries

Library Purpose Relevant Stages
NumPy Array operations (used throughout this book) All
PyTorch Production deep learning All
JAX Autodiff and accelerators 2
Hugging Face Transformers Pre-trained models 6, 9
PEFT LoRA and adapters 9
TRL RLHF and DPO 10

Tokenization Libraries

Library Purpose
tiktoken OpenAI's BPE tokenizer
SentencePiece Unigram and BPE
tokenizers Fast tokenization

Training Tools

Library Purpose
Weights & Biases Experiment tracking
TensorBoard Training visualization
DeepSpeed Distributed training
Accelerate Multi-GPU training

Books

Machine Learning Foundations

Book Author(s) Focus
Deep Learning Goodfellow, Bengio, Courville Comprehensive ML theory
Pattern Recognition and Machine Learning Bishop Probabilistic ML
The Elements of Statistical Learning Hastie, Tibshirani, Friedman Statistical methods

NLP & Language Models

Book Author(s) Focus
Speech and Language Processing Jurafsky & Martin NLP foundations
Natural Language Processing with Transformers Tunstall, von Werra, Wolf Practical transformers
Dive into Deep Learning Zhang et al. Interactive ML book

Courses

Course Institution Focus
CS231n Stanford CNNs, backprop basics
CS224n Stanford NLP with deep learning
CS324 Stanford Large language models
fast.ai fast.ai Practical deep learning

Blog Posts & Tutorials

Understanding Transformers

Understanding Training

Understanding Alignment


Codebases to Study

Educational Implementations

Repo Author What to Learn
nanoGPT Karpathy Minimal GPT training
minGPT Karpathy Simple GPT implementation
micrograd Karpathy Tiny autograd engine
llm.c Karpathy GPT in C

Production Implementations

Repo What to Learn
llama Production transformer
transformers Library architecture
vLLM Inference optimization

Datasets

Language Modeling

Dataset Size Use Case
TinyStories Small Learning, debugging
OpenWebText Medium GPT-2 reproduction
The Pile Large Serious pre-training
RedPajama Large LLaMA reproduction

Alignment

Dataset Purpose
Anthropic HH-RLHF Preference data
OpenAssistant Conversation data
Alpaca Instruction data

Communities


Staying Current

Research Feeds

Newsletters