Skip to content

The Carmack Optimization

The Carmack Optimization — robotic arms defragmenting Metal GPU shaders like a Formula 1 pit stop for silicon.

Date: 2026-04-02 Status: Implemented and Verified

Training a 27-billion-parameter model on a laptop is either visionary or reckless. For a while, it was both.

The mlx-finetune pipeline was crashing mid-training with Metal GPU errors that read like suicide notes from the shader compiler. Not “out of memory” — nothing that civilized. The errors said things like “Broken Pipe” and “resource limit exceeded at 499,000 allocations,” which is Apple’s way of telling you the GPU has given up on you as a person. Inspired by John Carmack’s philosophy — “if you don’t understand the hardware, you don’t understand the problem” — we profiled everything and found four distinct failure modes. Each one was the kind of bug that only appears when you push consumer hardware past the edge of what Apple probably imagined anyone would do with it.

The Issue: Training with batch_size: 1 on unpadded datasets caused MLX to compile a new Metal shader for every unique sequence length. Every. Single. One. Three thousand training examples, three thousand unique graph shapes, three thousand shader compilations. The MTLCompilerService eventually threw a “Broken Pipe” — Apple’s polite way of saying “I give up.”

The Fix: All training examples are now padded to fixed buckets: 512, 1024, or 2048 tokens. The compiler sees 3 graph shapes instead of 3,000. The difference between “crashes after 200 steps” and “runs to completion” was literally just padding.

Phase 2: Kernel Fusion & Graph Defragmentation

Section titled “Phase 2: Kernel Fusion & Graph Defragmentation”

The Issue: Deep LoRA layers with high gradient accumulation created an explosion of tiny memory allocations, hitting Metal’s [metal::malloc] resource limit at 499,000 allocations. The GPU was spending more time managing memory than doing math — like a librarian who spends all day organizing the card catalog and never actually shelves a book.

The Fix: Once sequence bucketing locked the graph shapes (Phase 1), the allocation pattern stabilized. We pushed back to 32 layers and 2048 context without hitting the limit. The computation graph went from a Jackson Pollock painting to a circuit diagram. Same parameters, same data, same hardware — just a GPU that could finally see the forest for the allocations.

The Issue: Synthetic data generation was single-threaded. Nine agents, each needing 700+ training examples, all waiting in line behind Python’s Global Interpreter Lock. Generating the dataset took longer than training on it, which is the computational equivalent of spending more time packing for vacation than being on vacation.

The Fix: multiprocessing.Pool(). All CPU cores now participate. The GIL gets to watch from the sidelines while actual work happens. Generation time dropped from “go make coffee” to “the coffee isn’t ready yet and we’re already done.”

The Issue: Configuration changes were vibes-based. “This crashed, so reduce rank. That didn’t crash, so increase context. Repeat until the model converges or you lose patience.” This is not engineering. This is dowsing.

The Fix: scripts/profile_training.sh wraps training with Apple Instruments:

Terminal window
xcrun xctrace record --template 'Metal System Trace' -- python -m mlx_lm.lora ...

Real-time ALU occupancy, shader compilation time, unified memory bandwidth. No more guessing. The profiler doesn’t care about your intuition — it cares about what the silicon is actually doing. Carmack would approve.

A dedicated test suite (tests/test_e2e_pipeline.py) enforces all four optimizations:

TestWhat It Validates
Data GenerationParallel execution across CPU cores
Sequence BucketingPadding and dropping logic, no unbucketed sequences
Live InferenceHTTP payloads to mlx_lm.server prove the adapter streams tokens

Because the only thing worse than a crash at 3 AM is a crash at 3 AM that you thought you’d fixed. The tests don’t just verify that the pipeline works — they verify that each specific failure mode from each specific phase stays dead. Bugs that expensive don’t get to come back.