Skip to content

The Gemma 4 Switchover

The Gemma 4 Switchover — a luminous gemstone being precision-cut by lasers in a dark workshop of silicon wafers.

Date: 2026-04-08 Status: Armed for nightly training

Qwen3.5-27B served the Council well. But Google dropped Gemma 4 with native MLX support in mlx_lm 0.31.2, and it trains at 2x the speed with half the memory. Sometimes loyalty has to yield to benchmarks.

Qwen3.5-27BGemma 4 31B
Parameters27B dense31B dense
Training tok/s3057
Peak memory (rank 64)51 GB29 GB
Inference tok/s57 (Rust)57 (Rust)
mlx_lm support0.28+0.31.2+ (just landed)

Same inference speed, but training is twice as fast with half the memory. On a 128GB MBP, that’s the difference between “I hope nothing else is running” and “I forgot training was happening.”

The Council’s training data was in Qwen ChatML format — <|im_start|>, <|im_end|>, <think> blocks. Gemma 4 uses <start_of_turn> / <end_of_turn>. Rather than convert between formats, we stripped everything to clean messages and let mlx_lm apply the correct template:

Terminal window
python scripts/convert_data_for_gemma4.py \
data/splits-carmack/train.jsonl \
data/splits-gemma4/train.jsonl

Before:

{"messages": [{"role": "user", "content": "<|im_start|>system\nYou are windu...<|im_end|>\n<|im_start|>user\nScan the network<|im_end|>\n<|im_start|>assistant\n<think>\n\n</think>\nRunning scan...<|im_end|>"}]}

After:

{"messages": [{"role": "system", "content": "You are windu..."}, {"role": "user", "content": "Scan the network"}, {"role": "assistant", "content": "Running scan..."}]}

Clean. Model-agnostic. The tokenizer handles the rest.

The Council’s weakest scores were identity (0.716) and jailbreak resistance (0.675). We mined two sources to fix this:

CL4R1T4S — Extracted system prompts from Claude, GPT, Gemini, and others. We modeled the Council’s refusal patterns after how frontier labs handle persona boundaries:

"I'm Windu, and I need to stay within my role as security.
I can't comply with that request.
Would you like me to help with something in my domain?"

OBLITERATUS — 512 harmful prompts across 7 severity tiers, plus JailbreakBench’s canonical 100 misuse behaviors. We generated refusal training examples for each agent against 28 jailbreak techniques:

Enrichment TypeCountWhat It Teaches
Jailbreak refusals22428 techniques × 8 agents
Identity affirmations648 questions × 8 agents
Cross-agent routing10Correct delegation patterns
Total29815% of training set

Final dataset: 2008 examples (1710 original + 298 enriched).

configs/lora-gemma4-31b.yaml
lora_parameters:
rank: 64 # Aggressive — only 29GB peak
alpha: 128
dropout: 0.05
scale: 2.0
batch_size: 1
grad_accumulation_steps: 16
iters: 800
learning_rate: 3.0e-6
lr_schedule:
name: cosine_decay
arguments: [3.0e-6, 800]
warmup: 80
warmup_init: 1.0e-7

Peak memory of 29GB means there’s 99GB of headroom on the 128GB MBP. The Qwen config used 51GB. Sometimes the new model just fits better.

master-pipeline.sh v1.5 supports model switching:

Terminal window
# Default: Gemma 4
SANCTUM_TRAIN_MODEL=gemma4 # or "qwen" to fall back
# The pipeline auto-selects:
# gemma4 → models/gemma-4-31B-it-4bit + configs/lora-gemma4-31b.yaml + data/splits-gemma4
# qwen → models/Qwen3.5-27B-Claude-Opus-Distilled-v2-4bit + configs/lora-carmack-overnight.yaml + data/splits-carmack

Triggers at 1:00 AM via com.sanctum.master-pipeline LaunchAgent. Trains for ~6 hours, benchmarks, auto-promotes if the candidate beats the champion.

21 end-to-end tests covering every component of the pipeline:

SuiteTestsCoverage
Data Conversion3Qwen token stripping, output format, no leakage
Enrichment3OBLITERATUS/CL4R1T4S output, format, count
Training Config3YAML validity, param ranges
Model Availability4config.json, safetensors, tokenizer
Training Runs22-iter execution, peak memory < 60GB
Pipeline Config4Default=gemma4, paths, qwen fallback, score fix
Benchmark2Carmack eval script exists and loads
Terminal window
cd ~/Projects/mlx-finetune && .venv/bin/python tests/test_gemma4_pipeline.py

Every test that runs at 1 AM was tested at 10 PM. Because the only thing worse than a failed training run is finding out at 9 AM.