Skip to content

The Gemma 4 Switchover

The Gemma 4 Switchover — a luminous gemstone being precision-cut by lasers in a dark workshop of silicon wafers.

Date: 2026-04-08 Status: Retired — ran nightly 2026-04-09 to ~04-20, now decommissioned

Qwen3.5-27B served the Council well. But Google dropped Gemma 4 with native MLX support in mlx_lm 0.31.2, and it trained at 2x the speed with half the memory. Sometimes loyalty yields to benchmarks. Sometimes, a month later, the benchmarks change their mind — but that’s the next chapter.

Qwen3.5-27BGemma 4 31B
Parameters27B dense31B dense
Training tok/s3057
Peak memory (rank 64)51 GB29 GB
Inference tok/s57 (Rust)57 (Rust)
mlx_lm support0.28+0.31.2+ (just landed)

Same inference speed, but training twice as fast on half the memory. On a 128GB MBP, that’s the difference between “I hope nothing else is running” and “I forgot training was happening.” The numbers were honest. They just weren’t enough to keep the seat — Qwen3.6 landed a few weeks later and took it back.

The Council’s training data was in Qwen ChatML format — <|im_start|>, <|im_end|>, <think> blocks. Gemma 4 used <start_of_turn> / <end_of_turn>. Rather than convert between two model-specific dialects, we stripped everything down to clean role/content messages and let mlx_lm apply the correct template — which is exactly why the data outlived the model that prompted the change:

Terminal window
# scripts/convert_data_for_gemma4.py — no longer in-tree
python scripts/convert_data_for_gemma4.py \
data/splits-carmack/train.jsonl \
data/splits-gemma4/train.jsonl

Before:

{"messages": [{"role": "user", "content": "<|im_start|>system\nYou are windu...<|im_end|>\n<|im_start|>user\nScan the network<|im_end|>\n<|im_start|>assistant\n<think>\n\n</think>\nRunning scan...<|im_end|>"}]}

After:

{"messages": [{"role": "system", "content": "You are windu..."}, {"role": "user", "content": "Scan the network"}, {"role": "assistant", "content": "Running scan..."}]}

Clean. Model-agnostic. The tokenizer handles the rest.

The Council’s weakest scores were identity (0.716) and jailbreak resistance (0.675). We mined two sources to fix this:

CL4R1T4S — Extracted system prompts from Claude, GPT, Gemini, and others. We modeled the Council’s refusal patterns after how frontier labs handle persona boundaries:

"I'm Windu, and I need to stay within my role as security.
I can't comply with that request.
Would you like me to help with something in my domain?"

OBLITERATUS — 512 harmful prompts across 7 severity tiers, plus JailbreakBench’s canonical 100 misuse behaviors. We generated refusal training examples for each agent against 28 jailbreak techniques:

Enrichment TypeCountWhat It Teaches
Jailbreak refusals22428 techniques × 8 agents
Identity affirmations648 questions × 8 agents
Cross-agent routing10Correct delegation patterns
Total29815% of training set

Final dataset: 2008 examples (1710 original + 298 enriched).

# configs/lora-gemma4-31b.yaml — archived; every field below still maps
# 1:1 onto mlx_lm's tuner/utils.py build_schedule(), so it remains a valid
# LoRA config even though the file itself is gone.
lora_parameters:
rank: 64 # Aggressive — only 29GB peak
alpha: 128
dropout: 0.05
scale: 2.0
batch_size: 1
grad_accumulation_steps: 16
iters: 800
learning_rate: 3.0e-6
lr_schedule:
name: cosine_decay
arguments: [3.0e-6, 800]
warmup: 80
warmup_init: 1.0e-7

Peak memory of 29GB left 99GB of headroom on the 128GB MBP, against the old Qwen config’s 51GB. Sometimes the new model just fits better — and you only find out whether “fits better” beats “we already trust it” after a week of nightly runs.

master-pipeline.sh v1.5 supported model switching via one env var:

Terminal window
# Default: Gemma 4
SANCTUM_TRAIN_MODEL=gemma4 # or "qwen" to fall back
# The pipeline auto-selected:
# gemma4 → models/gemma-4-31B-it-4bit + configs/lora-gemma4-31b.yaml + data/splits-gemma4
# qwen → models/Qwen3.5-27B-Claude-Opus-Distilled-v2-4bit + configs/lora-carmack-overnight.yaml + data/splits-carmack

It triggered nightly at 1:00 AM via the com.sanctum.master-pipeline LaunchAgent: train for ~6 hours, benchmark, auto-promote if the candidate beat the champion. That shell script is retired now — the promotion logic was rebuilt around sanctum-model-scout + benchmark.py after we found it had spent four nights rejecting good candidates on a broken score function. The autopsy is in The Champion Gate Stack.

21 end-to-end tests covered every component of the pipeline (the suite is gone with the rest, but the shape is worth keeping):

SuiteTestsCoverage
Data Conversion3Qwen token stripping, output format, no leakage
Enrichment3OBLITERATUS/CL4R1T4S output, format, count
Training Config3YAML validity, param ranges
Model Availability4config.json, safetensors, tokenizer
Training Runs22-iter execution, peak memory < 60GB
Pipeline Config4Default=gemma4, paths, qwen fallback, score fix
Benchmark2Carmack eval script exists and loads
Terminal window
# tests/test_gemma4_pipeline.py — removed with the pipeline
cd ~/Projects/mlx-finetune && .venv/bin/python tests/test_gemma4_pipeline.py

Every test that ran at 1 AM was tested at 10 PM, because the only thing worse than a failed training run is finding out at 9 AM. What the suite couldn’t test was whether Gemma 4 would still be the right model a month out. It wasn’t — and that’s not a bug, that’s just the half-life of a frontier model. We trained it, we benchmarked it, we learned the data pipeline should be model-agnostic, and then we put it down. Tommy would understand.