On reproducing GPT-2 from scratch

There’s a particular kind of clarity that comes from rebuilding something from first principles. Not the polished clarity of reading a well-documented library, but the rougher, quieter clarity of wrestling with each moving part until it yields. GPT-2 isn’t ancient history, but it feels far enough now—long enough in the rearview mirror—that we can study it the way you study a classic text: not just to see what it says, but to understand how it was written.

I wanted to reproduce GPT-2 (small) as faithfully as possible: same architecture, same hyperparameters, same BPE tokenizer, same data pipeline. Not to beat it—absurd goal—but to inhabit it. To trace the path from random initialization to coherent text and, along the way, map the territory in my own hand.

Reproducing revealed a thousand tiny facts I had glossed over. The exact placement of layer norms. The reason weight tying isn’t just a neat trick but a stabilizing price. How a too-eager learning rate shows up first as a tone problem in generations, long before the loss curve betrays it.

block × N

Each block is simple on paper. The difficulty is not in the components, but in the choreography: the precise ordering, the dropout, the masks, the initialization scale that keeps signal alive across depth. Tiny choices compound. Get one subtly wrong and the model still trains—just into mediocrity.

import torch
from torch import nn

class CausalSelfAttention(nn.Module):
    def __init__(self, config):
        super().__init__()
        self.n_head = config.n_head
        self.head_dim = config.n_embd // config.n_head
        assert self.head_dim * self.n_head == config.n_embd
        self.qkv = nn.Linear(config.n_embd, 3 * config.n_embd, bias=False)
        self.proj = nn.Linear(config.n_embd, config.n_embd, bias=False)

It’s easy to romanticize the final generations. The more interesting story is in the middle: the long plateau where nothing seems to change, and then everything does.