raiyanyahya/how-to-train-your-gpt
raiyanyahya/how-to-train-your-gptBuild a modern LLM from scratch. Every line commented. Explained like we are five.
From the README
π§ How to Train Your GPT
A guide to building a world-class language model from absolute scratch. Taught like you're five. Built like you're an engineer.
π What Is This?
This is a 12-chapter, 3,671-line interactive textbook that teaches you how to build, train and run a modern language model from absolute scratch. The same family of architecture behind ChatGPT, Claude, LLaMA and Mistral.
You won't just read about Transformers. You'll write every line yourself: tokenizer, embeddings, attention, training loop, inference engine. Every single line annotated to explain what it does and why it's there.
π€ Why This Exists
Most ML tutorials fall into one of two traps:
| β Too Shallow | β Too Academic | β
This Guide |
|---|---|---|
| model = GPT().fit(data) | 40-page papers, dense notation | 5-year-old analogies β full working code |
| You learn to call APIs | Assumes PhD in ML | Zero ML experience required |
| No understanding of internals | No worked examples | Every line annotated with WHAT & WHY |
The goal: After finishing, you won't just know that attention "works". You'll understand the variance argument behind 1/βd_k. How RoPE captures relative position through rotation. Why pre-norm beats post-norm for deep networks. And exactly where every gradient flows during backpropagation.
π₯ Who Is This For?
| π§βπ» You Are... | π You Need... | |---|---| | A Python developer curious about how ChatGPT actually works | Basic Python (functions, classes, lists). No ML experience | | A student who wants to deeply understand Transformers | Willingness to read ~3,600 lines of commented code | | An engineer evaluating LLM architectures | Understanding of tradeoffs (RoPE vs learned, RMSNorm vs LayerNorm) | | Someone who got lost at "attention" in other tutorials | Party analogy + worked numeric example with real numbers |
π§ Prerequisites: Python basics (variables, functions, classes, pip install). That's it. No calculus, no linear algebra, no PyTorch experience required. We teach those as we go.
πΊοΈ Chapters
| Chapter | What You'll Learn | |---|---| | 0: Overview | What is a GPT? The big picture | | 1: Setup | Install tools, GPU vs CPU, venv, PyTorch basics | | 2: Tokenization | BPE walkthrough: how "unbelievably" becomes tokens | | 3: Embeddings | How numbers become meaning. king β man + woman = queen | | 4: Positional Encoding | RoPE: why LLaMA rotates vectors, not adds numbers | | 5: Attention | β THE CORE. Q,K,V, scaling, causal mask, 8-step walkthrough | | 6: Transformer Block | RMSNorm, SwiGLU, residuals, pre-norm vs post-norm | | 7: Complete GPT Model | 124M parameter model, weight tying, logits explained | | **[8