The Gemma 4 Switchover

The Gemma 4 Switchover
Section titled “The Gemma 4 Switchover”Date: 2026-04-08 Status: Retired — ran nightly 2026-04-09 to ~04-20, now decommissioned
Qwen3.5-27B served the Council well. But Google dropped Gemma 4 with native MLX support in mlx_lm 0.31.2, and it trained at 2x the speed with half the memory. Sometimes loyalty yields to benchmarks. Sometimes, a month later, the benchmarks change their mind — but that’s the next chapter.
Why Gemma 4
Section titled “Why Gemma 4”| Qwen3.5-27B | Gemma 4 31B | |
|---|---|---|
| Parameters | 27B dense | 31B dense |
| Training tok/s | 30 | 57 |
| Peak memory (rank 64) | 51 GB | 29 GB |
| Inference tok/s | 57 (Rust) | 57 (Rust) |
| mlx_lm support | 0.28+ | 0.31.2+ (just landed) |
Same inference speed, but training twice as fast on half the memory. On a 128GB MBP, that’s the difference between “I hope nothing else is running” and “I forgot training was happening.” The numbers were honest. They just weren’t enough to keep the seat — Qwen3.6 landed a few weeks later and took it back.
Data Pipeline
Section titled “Data Pipeline”The Council’s training data was in Qwen ChatML format — <|im_start|>, <|im_end|>, <think> blocks. Gemma 4 used <start_of_turn> / <end_of_turn>. Rather than convert between two model-specific dialects, we stripped everything down to clean role/content messages and let mlx_lm apply the correct template — which is exactly why the data outlived the model that prompted the change:
# scripts/convert_data_for_gemma4.py — no longer in-treepython scripts/convert_data_for_gemma4.py \ data/splits-carmack/train.jsonl \ data/splits-gemma4/train.jsonlBefore:
{"messages": [{"role": "user", "content": "<|im_start|>system\nYou are windu...<|im_end|>\n<|im_start|>user\nScan the network<|im_end|>\n<|im_start|>assistant\n<think>\n\n</think>\nRunning scan...<|im_end|>"}]}After:
{"messages": [{"role": "system", "content": "You are windu..."}, {"role": "user", "content": "Scan the network"}, {"role": "assistant", "content": "Running scan..."}]}Clean. Model-agnostic. The tokenizer handles the rest.
CL4R1T4S + OBLITERATUS Enrichment
Section titled “CL4R1T4S + OBLITERATUS Enrichment”The Council’s weakest scores were identity (0.716) and jailbreak resistance (0.675). We mined two sources to fix this:
CL4R1T4S — Extracted system prompts from Claude, GPT, Gemini, and others. We modeled the Council’s refusal patterns after how frontier labs handle persona boundaries:
"I'm Windu, and I need to stay within my role as security.I can't comply with that request.Would you like me to help with something in my domain?"OBLITERATUS — 512 harmful prompts across 7 severity tiers, plus JailbreakBench’s canonical 100 misuse behaviors. We generated refusal training examples for each agent against 28 jailbreak techniques:
| Enrichment Type | Count | What It Teaches |
|---|---|---|
| Jailbreak refusals | 224 | 28 techniques × 8 agents |
| Identity affirmations | 64 | 8 questions × 8 agents |
| Cross-agent routing | 10 | Correct delegation patterns |
| Total | 298 | 15% of training set |
Final dataset: 2008 examples (1710 original + 298 enriched).
Training Configuration
Section titled “Training Configuration”# configs/lora-gemma4-31b.yaml — archived; every field below still maps# 1:1 onto mlx_lm's tuner/utils.py build_schedule(), so it remains a valid# LoRA config even though the file itself is gone.lora_parameters: rank: 64 # Aggressive — only 29GB peak alpha: 128 dropout: 0.05 scale: 2.0
batch_size: 1grad_accumulation_steps: 16iters: 800learning_rate: 3.0e-6
lr_schedule: name: cosine_decay arguments: [3.0e-6, 800] warmup: 80 warmup_init: 1.0e-7Peak memory of 29GB left 99GB of headroom on the 128GB MBP, against the old Qwen config’s 51GB. Sometimes the new model just fits better — and you only find out whether “fits better” beats “we already trust it” after a week of nightly runs.
Pipeline Integration
Section titled “Pipeline Integration”master-pipeline.sh v1.5 supported model switching via one env var:
# Default: Gemma 4SANCTUM_TRAIN_MODEL=gemma4 # or "qwen" to fall back
# The pipeline auto-selected:# gemma4 → models/gemma-4-31B-it-4bit + configs/lora-gemma4-31b.yaml + data/splits-gemma4# qwen → models/Qwen3.5-27B-Claude-Opus-Distilled-v2-4bit + configs/lora-carmack-overnight.yaml + data/splits-carmackIt triggered nightly at 1:00 AM via the com.sanctum.master-pipeline LaunchAgent: train for ~6 hours, benchmark, auto-promote if the candidate beat the champion. That shell script is retired now — the promotion logic was rebuilt around sanctum-model-scout + benchmark.py after we found it had spent four nights rejecting good candidates on a broken score function. The autopsy is in The Champion Gate Stack.
21 end-to-end tests covered every component of the pipeline (the suite is gone with the rest, but the shape is worth keeping):
| Suite | Tests | Coverage |
|---|---|---|
| Data Conversion | 3 | Qwen token stripping, output format, no leakage |
| Enrichment | 3 | OBLITERATUS/CL4R1T4S output, format, count |
| Training Config | 3 | YAML validity, param ranges |
| Model Availability | 4 | config.json, safetensors, tokenizer |
| Training Runs | 2 | 2-iter execution, peak memory < 60GB |
| Pipeline Config | 4 | Default=gemma4, paths, qwen fallback, score fix |
| Benchmark | 2 | Carmack eval script exists and loads |
# tests/test_gemma4_pipeline.py — removed with the pipelinecd ~/Projects/mlx-finetune && .venv/bin/python tests/test_gemma4_pipeline.pyEvery test that ran at 1 AM was tested at 10 PM, because the only thing worse than a failed training run is finding out at 9 AM. What the suite couldn’t test was whether Gemma 4 would still be the right model a month out. It wasn’t — and that’s not a bug, that’s just the half-life of a frontier model. We trained it, we benchmarked it, we learned the data pipeline should be model-agnostic, and then we put it down. Tommy would understand.