Further Reading & Resources¶
Curated resources for going deeper
This page collects the most valuable external resources for each topic covered in this book.
Papers¶
Foundational Papers¶
| Paper | Year | Key Contribution | Stage |
|---|---|---|---|
| Attention Is All You Need | 2017 | The transformer architecture | 5, 6 |
| Neural Probabilistic Language Model | 2003 | Word embeddings for language modeling | 3 |
| Adam: A Method for Stochastic Optimization | 2014 | The Adam optimizer | 4 |
| Layer Normalization | 2016 | LayerNorm for transformers | 6 |
| Deep Residual Learning | 2015 | Residual connections | 6 |
Tokenization Papers¶
| Paper | Year | Key Contribution | Stage |
|---|---|---|---|
| Neural Machine Translation of Rare Words with Subword Units | 2016 | BPE for NLP | 7 |
| Google's Neural Machine Translation System | 2016 | WordPiece | 7 |
| SentencePiece | 2018 | Unigram tokenization | 7 |
PEFT Papers¶
| Paper | Year | Key Contribution | Stage |
|---|---|---|---|
| LoRA: Low-Rank Adaptation of Large Language Models | 2021 | Low-rank fine-tuning | 9 |
| Parameter-Efficient Transfer Learning for NLP | 2019 | Adapter layers | 9 |
| Prefix-Tuning | 2021 | Soft prefixes | 9 |
| The Power of Scale for Parameter-Efficient Prompt Tuning | 2021 | Prompt tuning | 9 |
| QLoRA: Efficient Finetuning of Quantized LLMs | 2023 | 4-bit fine-tuning | 9 |
Alignment Papers¶
| Paper | Year | Key Contribution | Stage |
|---|---|---|---|
| Training Language Models to Follow Instructions with Human Feedback | 2022 | InstructGPT, RLHF | 10 |
| Direct Preference Optimization | 2023 | DPO | 10 |
| Constitutional AI | 2022 | Self-critique | 10 |
| Proximal Policy Optimization Algorithms | 2017 | PPO | 10 |
Scaling & Architecture Papers¶
| Paper | Year | Key Contribution | Stage |
|---|---|---|---|
| Scaling Laws for Neural Language Models | 2020 | Chinchilla scaling | 6 |
| LLaMA: Open Foundation and Fine-Tuned Chat Models | 2023 | Modern architecture | 6 |
| GPT-2: Language Models are Unsupervised Multitask Learners | 2019 | GPT-2 | 6 |
| RoFormer: Enhanced Transformer with Rotary Position Embedding | 2021 | RoPE | 5 |
Libraries & Tools¶
Essential Libraries¶
| Library | Purpose | Relevant Stages |
|---|---|---|
| NumPy | Array operations (used throughout this book) | All |
| PyTorch | Production deep learning | All |
| JAX | Autodiff and accelerators | 2 |
| Hugging Face Transformers | Pre-trained models | 6, 9 |
| PEFT | LoRA and adapters | 9 |
| TRL | RLHF and DPO | 10 |
Tokenization Libraries¶
| Library | Purpose |
|---|---|
| tiktoken | OpenAI's BPE tokenizer |
| SentencePiece | Unigram and BPE |
| tokenizers | Fast tokenization |
Training Tools¶
| Library | Purpose |
|---|---|
| Weights & Biases | Experiment tracking |
| TensorBoard | Training visualization |
| DeepSpeed | Distributed training |
| Accelerate | Multi-GPU training |
Books¶
Machine Learning Foundations¶
| Book | Author(s) | Focus |
|---|---|---|
| Deep Learning | Goodfellow, Bengio, Courville | Comprehensive ML theory |
| Pattern Recognition and Machine Learning | Bishop | Probabilistic ML |
| The Elements of Statistical Learning | Hastie, Tibshirani, Friedman | Statistical methods |
NLP & Language Models¶
| Book | Author(s) | Focus |
|---|---|---|
| Speech and Language Processing | Jurafsky & Martin | NLP foundations |
| Natural Language Processing with Transformers | Tunstall, von Werra, Wolf | Practical transformers |
| Dive into Deep Learning | Zhang et al. | Interactive ML book |
Courses¶
| Course | Institution | Focus |
|---|---|---|
| CS231n | Stanford | CNNs, backprop basics |
| CS224n | Stanford | NLP with deep learning |
| CS324 | Stanford | Large language models |
| fast.ai | fast.ai | Practical deep learning |
Blog Posts & Tutorials¶
Understanding Transformers¶
- The Illustrated Transformer - Visual walkthrough
- The Annotated Transformer - Code walkthrough
- Transformer Math 101 - Memory and compute
Understanding Training¶
- A Recipe for Training Neural Networks - Karpathy's practical guide
- Why Momentum Really Works - Visual explanation
Understanding Alignment¶
- RLHF: Reinforcement Learning from Human Feedback - Hugging Face overview
- Illustrating RLHF - Visual guide
Codebases to Study¶
Educational Implementations¶
| Repo | Author | What to Learn |
|---|---|---|
| nanoGPT | Karpathy | Minimal GPT training |
| minGPT | Karpathy | Simple GPT implementation |
| micrograd | Karpathy | Tiny autograd engine |
| llm.c | Karpathy | GPT in C |
Production Implementations¶
| Repo | What to Learn |
|---|---|
| llama | Production transformer |
| transformers | Library architecture |
| vLLM | Inference optimization |
Datasets¶
Language Modeling¶
| Dataset | Size | Use Case |
|---|---|---|
| TinyStories | Small | Learning, debugging |
| OpenWebText | Medium | GPT-2 reproduction |
| The Pile | Large | Serious pre-training |
| RedPajama | Large | LLaMA reproduction |
Alignment¶
| Dataset | Purpose |
|---|---|
| Anthropic HH-RLHF | Preference data |
| OpenAssistant | Conversation data |
| Alpaca | Instruction data |
Communities¶
- Hugging Face Forums - Library questions
- r/MachineLearning - Research discussion
- r/LocalLLaMA - Running LLMs locally
- EleutherAI Discord - Open-source LLMs
Staying Current¶
Research Feeds¶
- Papers With Code - Language Models
- arXiv cs.CL - NLP papers
- arXiv cs.LG - ML papers