raiyanyahya/how-to-train-your-gpt

460

+34/day

Jupyter Notebook

Build a modern LLM from scratch. Every line commented. Explained like we are five.

From the README

🧠 How to Train Your GPT

A guide to building a world-class language model from absolute scratch. Taught like you're five. Built like you're an engineer.

📖 What Is This?

This is a 12-chapter, 3,671-line interactive textbook that teaches you how to build, train and run a modern language model from absolute scratch. The same family of architecture behind ChatGPT, Claude, LLaMA and Mistral.

You won't just read about Transformers. You'll write every line yourself: tokenizer, embeddings, attention, training loop, inference engine. Every single line annotated to explain what it does and why it's there.

🤔 Why This Exists

Most ML tutorials fall into one of two traps:

| ❌ Too Shallow | ❌ Too Academic | ✅ This Guide | |---|---|---| | model = GPT().fit(data) | 40-page papers, dense notation | 5-year-old analogies → full working code | | You learn to call APIs | Assumes PhD in ML | Zero ML experience required | | No understanding of internals | No worked examples | Every line annotated with WHAT & WHY |

The goal: After finishing, you won't just know that attention "works". You'll understand the variance argument behind 1/√d_k. How RoPE captures relative position through rotation. Why pre-norm beats post-norm for deep networks. And exactly where every gradient flows during backpropagation.

👥 Who Is This For?

| 🧑‍💻 You Are... | 📚 You Need... | |---|---| | A Python developer curious about how ChatGPT actually works | Basic Python (functions, classes, lists). No ML experience | | A student who wants to deeply understand Transformers | Willingness to read ~3,600 lines of commented code | | An engineer evaluating LLM architectures | Understanding of tradeoffs (RoPE vs learned, RMSNorm vs LayerNorm) | | Someone who got lost at "attention" in other tutorials | Party analogy + worked numeric example with real numbers |

🔧 Prerequisites: Python basics (variables, functions, classes, pip install). That's it. No calculus, no linear algebra, no PyTorch experience required. We teach those as we go.

🗺️ Chapters

| Chapter | What You'll Learn | |---|---| | 0: Overview | What is a GPT? The big picture | | 1: Setup | Install tools, GPU vs CPU, venv, PyTorch basics | | 2: Tokenization | BPE walkthrough: how "unbelievably" becomes tokens | | 3: Embeddings | How numbers become meaning. king − man + woman = queen | | 4: Positional Encoding | RoPE: why LLaMA rotates vectors, not adds numbers | | 5: Attention | ⭐ THE CORE. Q,K,V, scaling, causal mask, 8-step walkthrough | | 6: Transformer Block | RMSNorm, SwiGLU, residuals, pre-norm vs post-norm | | 7: Complete GPT Model | 124M parameter model, weight tying, logits explained | | **[8

View on GitHub