Karpathy’s LLM Education Stack

A progression of four projects that strip away abstractions layer by layer, from pure Python to bare C/CUDA. Together they form the most complete open-source curriculum for understanding LLMs from algorithm to hardware.

The Stack (Increasing Depth)

ProjectLanguageLinesDependenciesLevelWhat You Learn
MicroGPTPure Python~200ZeroBeginnerThe complete algorithm — autograd, transformer, training, inference
nanoGPTPython + PyTorch~600PyTorchIntermediateHow to train real models (reproduces GPT-2 124M)
llm.cC / CUDA~2,000NoneAdvancedHardware-level training — what happens on the GPU
AutoresearchPython + AgentVariableGPU + LLM APIAppliedAI-driven ML research — automated experiment loops

MicroGPT: The Algorithm (200 Lines, Zero Dependencies)

A single Python file containing everything needed to build a working language model:

  • Dataset (32K names), character-level tokenizer
  • Custom autograd engine (backpropagation from scratch)
  • GPT-2-like transformer architecture
  • Adam optimizer, training loop, inference loop

Why it exists: “Strip away everything that isn’t essential. What’s left is the pure algorithm.” The culmination of Karpathy’s decade-long simplification: micrograd → makemore → nanoGPT → MicroGPT.

nanoGPT: Real Training (600 Lines)

Two files: train.py (~300 lines) and model.py (~300 lines). Can reproduce GPT-2 (124M parameters) on OpenWebText.

Successor: nanochat (late 2025) — “The best ChatGPT that $100 can buy.” nanoGPT is legacy but remains the gold standard for learning.

llm.c: The Hardware Level (2,000 Lines C/CUDA)

Trains GPT-2 in pure C/CUDA at speeds matching PyTorch (78ms/iter vs 80ms/iter on A100). No Python, no frameworks — single compiled binary.

If you want to understand what actually happens when you call loss.backward(), this is the definitive resource.

Autoresearch: AI Does the Research

The most provocative project. An AI coding agent that:

  1. Reads research directions in program.md (plain English)
  2. Modifies training code autonomously
  3. Trains for 5 minutes per experiment
  4. Keeps improvements, discards failures
  5. Runs ~12 experiments/hour, ~100 overnight

Real results (2 days, ~700 autonomous changes): Found ~20 additive improvements, dropped “Time to GPT-2” from 2.02h to 1.80h (11% gain). All discovered by AI.

Represents L4-level automation for ML research — human sets direction, AI explores the space.

Sources