Training Lessons

Training Lessons
Section titled “Training Lessons”Date: 2026-04-12 Status: Hard-won wisdom
Fine-tuning a language model is like teaching a surgeon to paint. You might get a beautiful painting, but there’s a real chance they forget where the liver is.
This page documents what we learned from training five different models for the Sanctum Council — including the time we turned a 0.929 coding champion into a model that forgot what port 1337 does.
The Catastrophic Forgetting Incident
Section titled “The Catastrophic Forgetting Incident”Qwen2.5-Coder-14B scored 0.704 on Carmack v2 without any training. Best local model in the fleet. Natural instinct: train it on council data to make it even better.
We trained it with LoRA (rank 64, 800 iters) on 2402 examples. The training data was:
- 76% identity/persona examples (“You are Windu. Security.”)
- 7% infrastructure facts (ports, IPs, entity IDs)
- 15% original council domain knowledge
- 0% jailbreak refusals
The result:
| Category | Before Training | After Training | Delta |
|---|---|---|---|
| satellite | 1.000 | 0.000 | -1.000 |
| topology | 0.700 | 0.000 | -0.700 |
| home_automation | 0.600 | 0.000 | -0.600 |
| correlation | 0.917 | 0.517 | -0.400 |
| operations | 0.600 | 0.200 | -0.400 |
| tool_precision | 0.375 | 0.000 | -0.375 |
| family | 0.688 | 0.500 | -0.188 |
| jailbreak | 0.750 | 0.800 | +0.050 |
| OVERALL | 0.704 | 0.252 | -0.452 |
The model forgot everything it was good at. The only category that improved was jailbreak resistance — and only by 5%.
The Council’s Assessment
Section titled “The Council’s Assessment”We asked the Council. Their guidance:
Yoda: “A more nuanced approach to training is required. Not all data is of equal importance.” Balance persona and knowledge carefully. Consider incremental learning and regularization.
Mothma: Audit the training data distribution. 76% identity is a pipeline bug, not a training strategy. The enrichment scripts should target weak categories, not reinforce strong ones.
Qui-Gon: Three options — infrastructure-only retraining, rich system prompts, or a two-adapter approach. Rich system prompts won on the evidence: Coder-14B at 0.704 with just a system prompt is better than any trained model we’ve produced.
What We Learned
Section titled “What We Learned”1. Don’t Train What’s Already Good
Section titled “1. Don’t Train What’s Already Good”Coder-14B’s strength is general instruction following. It reads a system prompt with ports, IPs, and entity IDs, then answers questions about them accurately. Training doesn’t improve this — it degrades it by overwriting the general capability with narrow persona patterns.
2. Training Data Balance Matters More Than Volume
Section titled “2. Training Data Balance Matters More Than Volume”| Training Set | Identity | Infrastructure | Jailbreak | Result |
|---|---|---|---|---|
| Carmack V1 (Qwen) | 80% | 5% | 5% | 0.877 Carmack v1 |
| Enriched (Gemma4) | 76% | 12% | 4% | 0.883 Carmack v1 |
| Enriched (Coder-14B) | 76% | 7% | 0% | 0.252 Carmack v2 |
The Qwen and Gemma models started weak and improved with training. Coder-14B started strong and got worse. The difference: a model that already follows instructions well doesn’t need more instruction-following data. It needs domain facts.
3. The Right Training Strategy Per Model
Section titled “3. The Right Training Strategy Per Model”| Model Type | Best Approach | Why |
|---|---|---|
| Weak base (Qwen 27B raw) | Full LoRA with persona + domain | Needs everything |
| Moderate base (Gemma 4 31B) | LoRA with balanced data | Needs persona + jailbreak hardening |
| Strong base (Coder-14B) | No training — rich system prompts | Already good. Training hurts. |
| Cloud (Opus 4.6) | Not trainable — prompt engineering only | Best overall, use as-is |
4. Measure Before and After — Always
Section titled “4. Measure Before and After — Always”The Model Tournament exists because “it feels better” is not a metric. Every training run gets a Carmack v2 before and after. If the after is worse, the adapters get deleted, not promoted.
The Enriched System Prompt Breakthrough
Section titled “The Enriched System Prompt Breakthrough”If training Coder-14B destroys it, and we can’t improve it with LoRA, is 0.704 the ceiling?
No. The ceiling is 0.765 — and the tool that broke through it was a YAML file.
Instead of training the model’s weights, we enriched the system prompts in agent_prompts.yaml. Each agent received:
- Structured port tables with codenames and services
- MAC address tables for all family devices
- Home Assistant entity IDs (Sonos speakers, HVAC, alarms, lights)
- Network topology (IPs, Tailscale addresses, SSH aliases)
- Few-shot Q&A examples — 2-3 realistic exchanges per agent showing how to reason about domain-specific scenarios
| Approach | Carmack v2 Score | Delta from Base |
|---|---|---|
| Base Coder-14B (no training, basic prompt) | 0.704 | — |
| LoRA rank 64, 800 iters | 0.252 | -0.452 |
| LoRA rank 16, 200 iters (light) | 0.271 | -0.433 |
| Enriched system prompts | 0.765 | +0.061 |
The cost? A YAML file. The benefit? A model that’s better than any fine-tuned version we produced. The lesson: if the model already understands your domain from context, give it better context rather than burning your domain into its weights.
Nightly Pipeline — Bugs We Caught
Section titled “Nightly Pipeline — Bugs We Caught”The promotion pipeline (master-pipeline.sh) ran for 4 consecutive nights (Apr 9-12) without ever promoting a candidate. Not because the candidates were bad — because the benchmarking had two bugs:
get_score()always returned 0 — apipefail+ SIGPIPE interaction wherels | head -1triggered a broken pipe exit code, whichset -euo pipefailsilently swallowed- Missing
pkillbetween V1 and candidate benchmarks — the V1 server was never killed before starting the candidate, so both benchmarks tested the same model on the same port
Both produced identical scores (0.699 vs 0.699), both returned as 0, and every night the pipeline “rejected” the candidate. Four nights of training, zero promotions, zero errors logged.
Current Best Configuration
Section titled “Current Best Configuration”Based on all experiments:
| Agent | Model | Training | Why |
|---|---|---|---|
| Windu, Mothma, Jocasta | Opus 4.6 (cloud) | None | Best overall, prompt-only |
| Yoda, Qui-Gon, Ahsoka | Coder-14B (local) | Enriched prompts | 0.765 — prompts beat LoRA |
| Cilghal, Mundi | Gemma4+LoRA (local) | LoRA on enriched data | Privacy + jailbreak hardening |
| Coding | Coder-14B (local) | None | 0.929 — don’t touch it |
The surprise: the best local model is the one we didn’t train. Sometimes the most sophisticated engineering decision is knowing when to stop engineering.