angelos-p/llm-from-scratch
angelos-p/llm-from-scratchFrom the README
Train Your Own LLM From Scratch
A hands-on workshop where you write every piece of a GPT training pipeline yourself, understanding what each component does and why.
Andrej Karpathy's nanoGPT was my first real exposure to LLMs and transformers. Seeing how a working language model could be built in a few hundred lines of PyTorch completely changed how I thought about AI and inspired me to go deeper into the space.
This workshop is my attempt to give others that same experience. nanoGPT targets reproducing GPT-2 (124M params) and covers a lot of ground. This project strips it down to the essentials and scales it to a ~10M param model that trains on a laptop in under an hour — designed to be completed in a single workshop session.
What You'll Build
A working GPT model trained from scratch on your MacBook, capable of generating Shakespeare-like text. You'll write:
- Tokenizer — turning text into numbers the model can process
- Model architecture — the transformer: embeddings, attention, feed-forward layers
- Training loop — forward pass, loss, backprop, optimizer, learning rate scheduling
- Text generation — sampling from your trained model
Prerequisites
- Any laptop or desktop (Mac, Linux, or Windows)
- Python 3.12+
- Comfort reading Python code (you don't need ML experience)
Training uses Apple Silicon GPU (MPS), NVIDIA GPU (CUDA), or CPU automatically. Also works on Google Colab — upload the files and run with !python train.py.
Getting Started
Local (recommended)
Install uv if you don't have it:
# macOS / Linux
curl -LsSf | sh
# Windows
powershell -ExecutionPolicy ByPass -c "irm | iex"
Then set up the project:
uv sync
mkdir scratchpad && cd scratchpad
Google Colab
If you don't have a local setup, upload the repo to Colab and install dependencies:
!pip install torch numpy tqdm tiktoken
Upload data/shakespeare.txt to your Colab files, then write your code in notebook cells or upload .py files and run them with !python train.py.
Work through the docs in order. Each part walks you through writing a piece of the pipeline, explaining what each component does and why. By the end, you'll have a working model.py, train.py, and generate.py that you wrote yourself.
| Part | What You'll Write | Concepts | |------|-------------------|----------| | Part 1: Tokenization | Character-level tokenizer | Character encoding, vocabulary size, why BPE fails on small data | | Part 2: The Transformer | Full GPT model architecture | Embeddings, self-attention, layer norm, MLP blocks | | Part 3: The Training Loop | Complete training pipeline | Loss functions, AdamW, gradient clipping, LR scheduling | | Part 4: Text Generation | Inference and sampling | Temperature, top-k, autoregressive decoding | | [Part 5: Putting It All Together](docs/05-