PYTHONPYTORCHDDP 2026

nanochat

reproducing the chat model training pipeline, end to end.

[ shipped ]

Why build yet another training pipeline? Because understanding comes from rebuilding. I wanted to trace the full arc — from raw text to responsive chat model — without leaning on the abstractions that hide the most interesting decisions. nanochat is my attempt to recreate, in miniature, the systems that make modern LLMs work.

The goal is not state-of-the-art performance. It’s clarity, control, and the ability to reason about every moving part: data, tokenization, optimization, parallelism, and the thousand papercuts in between. When something breaks (at scale, it will), I want the stack to be mine.

I restrict the scope deliberately: decoder-only transformer, causal language modeling objective, and a chat-style instruction format. No RLHF, no retrieval, no frills. Just the spine.

tokens embed norm attn (causal) mlp block × N lm head → logits
block × N

Training is distributed with PyTorch DistributedDataParallel over NCCL. Sequence packing, gradient accumulation, and a cosine schedule keep GPUs fed and losses smooth. Every component is small enough to read in an afternoon, but together they form a system I can trust.

class CausalSelfAttention(nn.Module):
    def __init__(self, config):
        super().__init__()
        self.n_head = config.n_head
        self.head_dim = config.n_embd // config.n_head
        self.qkv = nn.Linear(config.n_embd, 3 * config.n_embd, bias=False)
        self.proj = nn.Linear(config.n_embd, config.n_embd, bias=False)