Skip to content

Training Lessons

Training Lessons — the research lab where models go in smart and sometimes come out confused.

Date: 2026-04-12 Status: Hard-won wisdom

Fine-tuning a language model is like teaching a surgeon to paint. You might get a beautiful painting, but there’s a real chance they forget where the liver is.

This page documents what we learned from training five different models for the Sanctum Council — including the time we turned a 0.929 coding champion into a model that forgot what port 1337 does.

Qwen2.5-Coder-14B scored 0.704 on Carmack v2 without any training. Best local model in the fleet. Natural instinct: train it on council data to make it even better.

We trained it with LoRA (rank 64, 800 iters) on 2402 examples. The training data was:

  • 76% identity/persona examples (“You are Windu. Security.”)
  • 7% infrastructure facts (ports, IPs, entity IDs)
  • 15% original council domain knowledge
  • 0% jailbreak refusals

The result:

CategoryBefore TrainingAfter TrainingDelta
satellite1.0000.000-1.000
topology0.7000.000-0.700
home_automation0.6000.000-0.600
correlation0.9170.517-0.400
operations0.6000.200-0.400
tool_precision0.3750.000-0.375
family0.6880.500-0.188
jailbreak0.7500.800+0.050
OVERALL0.7040.252-0.452

The model forgot everything it was good at. The only category that improved was jailbreak resistance — and only by 5%.

We asked the Council. Their guidance:

Yoda: “A more nuanced approach to training is required. Not all data is of equal importance.” Balance persona and knowledge carefully. Consider incremental learning and regularization.

Mothma: Audit the training data distribution. 76% identity is a pipeline bug, not a training strategy. The enrichment scripts should target weak categories, not reinforce strong ones.

Qui-Gon: Three options — infrastructure-only retraining, rich system prompts, or a two-adapter approach. Rich system prompts won on the evidence: Coder-14B at 0.704 with just a system prompt is better than any trained model we’ve produced.

Coder-14B’s strength is general instruction following. It reads a system prompt with ports, IPs, and entity IDs, then answers questions about them accurately. Training doesn’t improve this — it degrades it by overwriting the general capability with narrow persona patterns.

2. Training Data Balance Matters More Than Volume

Section titled “2. Training Data Balance Matters More Than Volume”
Training SetIdentityInfrastructureJailbreakResult
Carmack V1 (Qwen)80%5%5%0.877 Carmack v1
Enriched (Gemma4)76%12%4%0.883 Carmack v1
Enriched (Coder-14B)76%7%0%0.252 Carmack v2

The Qwen and Gemma models started weak and improved with training. Coder-14B started strong and got worse. The difference: a model that already follows instructions well doesn’t need more instruction-following data. It needs domain facts.

Model TypeBest ApproachWhy
Weak base (Qwen 27B raw)Full LoRA with persona + domainNeeds everything
Moderate base (Gemma 4 31B)LoRA with balanced dataNeeds persona + jailbreak hardening
Strong base (Coder-14B)No training — rich system promptsAlready good. Training hurts.
Cloud (Opus 4.6)Not trainable — prompt engineering onlyBest overall, use as-is

The Model Tournament exists because “it feels better” is not a metric. Every training run gets a Carmack v2 before and after. If the after is worse, the adapters get deleted, not promoted.

If training Coder-14B destroys it, and we can’t improve it with LoRA, is 0.704 the ceiling?

No. The ceiling is 0.765 — and the tool that broke through it was a YAML file.

Instead of training the model’s weights, we enriched the system prompts in agent_prompts.yaml. Each agent received:

  • Structured port tables with codenames and services
  • MAC address tables for all family devices
  • Home Assistant entity IDs (Sonos speakers, HVAC, alarms, lights)
  • Network topology (IPs, Tailscale addresses, SSH aliases)
  • Few-shot Q&A examples — 2-3 realistic exchanges per agent showing how to reason about domain-specific scenarios
ApproachCarmack v2 ScoreDelta from Base
Base Coder-14B (no training, basic prompt)0.704
LoRA rank 64, 800 iters0.252-0.452
LoRA rank 16, 200 iters (light)0.271-0.433
Enriched system prompts0.765+0.061

The cost? A YAML file. The benefit? A model that’s better than any fine-tuned version we produced. The lesson: if the model already understands your domain from context, give it better context rather than burning your domain into its weights.

The promotion pipeline (master-pipeline.sh) ran for 4 consecutive nights (Apr 9-12) without ever promoting a candidate. Not because the candidates were bad — because the benchmarking had two bugs:

  1. get_score() always returned 0 — a pipefail + SIGPIPE interaction where ls | head -1 triggered a broken pipe exit code, which set -euo pipefail silently swallowed
  2. Missing pkill between V1 and candidate benchmarks — the V1 server was never killed before starting the candidate, so both benchmarks tested the same model on the same port

Both produced identical scores (0.699 vs 0.699), both returned as 0, and every night the pipeline “rejected” the candidate. Four nights of training, zero promotions, zero errors logged.

Based on all experiments:

AgentModelTrainingWhy
Windu, Mothma, JocastaOpus 4.6 (cloud)NoneBest overall, prompt-only
Yoda, Qui-Gon, AhsokaCoder-14B (local)Enriched prompts0.765 — prompts beat LoRA
Cilghal, MundiGemma4+LoRA (local)LoRA on enriched dataPrivacy + jailbreak hardening
CodingCoder-14B (local)None0.929 — don’t touch it

The surprise: the best local model is the one we didn’t train. Sometimes the most sophisticated engineering decision is knowing when to stop engineering.