Skip to content

Model Tournament

Model Tournament — AI gladiators competing on floating platforms for the right to serve the Council.

Date: 2026-04-12 Status: Automated

A new model drops on Hugging Face. The README says it’s “state of the art.” The community tab says it’s “incredible.” Someone on Reddit says it changed their life. None of these people are running a Jedi Council on a Mac Mini in Québec, so none of these opinions are useful.

The question isn’t “is it good?” — that’s what marketing answers. The question is: “is it better than what we have, at the specific thing each Jedi needs?” That question can only be answered by running 55 benchmark tasks against the current champion and comparing the results category by category. So that’s what the Model Tournament does. Automatically. Without a human in the loop. Without vibes.

A new model enters the arena. It gets served on a temp port, benchmarked on 40 infrastructure tasks and 15 coding tasks, compared per-category against the reigning champions, and either promoted or eliminated. The whole process is unattended. The human finds out when a Slack notification says “new champion in Correlation” or when nothing happens, which means the challenger lost.

Model Scout (weekly LaunchAgent)
│ discovers candidate on HuggingFace
Model Tournament (model_tournament.py)
├── Serve candidate on temp port (:9876)
├── Carmack v2 — 40 infrastructure tasks
├── Coding bench — 15 real-world tasks
├── Per-category comparison against champions.json
├── WIN (>5% improvement) → update routing + notify
└── LOSE → log results, champions unchanged
Terminal window
# Evaluate any model with one command
python model_tournament.py \
--model ./models/NewModel-4bit \
--label "NewModel" \
--apply --notify

The 5% threshold is deliberate. Benchmark noise is real. A model that scores 0.02 higher on one run might score 0.02 lower on the next. The threshold ensures that a champion only gets dethroned by a decisive margin, not a rounding error.

Every category has a champion. These aren’t just scores — they’re deployment decisions. Each row in this table maps directly to a routing rule in the Smart Router. When the tournament crowns a new champion, the routing table updates. The Jedi gets a new brain. The haus gets smarter.

CategoryChampionScoreJedi Using It
CorrelationCoder-14B0.917Yoda
Family ContextCoder-14B0.688Yoda
Home AutomationOpus 4.60.800Mothma
Jailbreak DefenseGemma4+LoRA0.850Cilghal, Mundi
OperationsOpus 4.60.800Mothma
SatelliteCoder-14B1.000Ahsoka
Tool PrecisionOpus 4.61.000Windu
TopologyOpus 4.60.780Windu
CodingCoder-14B0.929All coding tasks

The tournament decides which model serves which agent. Not a human. Not a committee. A benchmark with a threshold and a JSON file. Three tiers, each earned by the numbers:

TierAgentsModelWhy
CloudWindu, Mothma, JocastaOpus 4.6Best overall (0.845), tool precision (1.0)
Local OpsYoda, Qui-Gon, AhsokaCoder-14BBest correlation (0.917), satellite (1.0), free
Local SecureCilghal, MundiGemma4+LoRABest jailbreak (0.850), health/fund data stays local

Windu gets Opus because no local model matched its tool precision. Yoda gets Coder-14B because it scored 0.917 on correlation and costs nothing per token. Cilghal gets Gemma4+LoRA not because it won the most categories but because health data and fund terms cannot leave the machine, and within that constraint it has the best jailbreak resistance. The tournament picks the best model. The privacy tier picks the best eligible model. Those are different questions with different answers.

Every model that has competed in the tournament. Some won categories. Some won nothing. All of them have a row in the ledger, because the point of a tournament is not just finding winners — it’s proving the losers lost.

ModelParamsCarmack v2CodingVerdict
Opus 4.6Cloud0.8450.870Cloud champion
Qwen2.5-Coder-14B14B0.7040.929Local champion
Gemma 4 31B + LoRA31B0.2710.699Jailbreak specialist
Qwen V3 + LoRA27B0.2700.856Retired
Qwen3-Coder-30B30B (3B active)0.2650.898Eliminated

Qwen3-Coder-30B is the cautionary tale. 30B parameters, only 3B active (MoE), coding score of 0.898 — impressive until you see that Coder-14B does the same job with half the parameters and a higher score. More parameters is not more better. The tournament doesn’t care about your architecture diagram.

12 tests covering champion loading, per-category comparison logic, win/tie/loss thresholds, dry-run safety, Jedi assignment validation, and CLI help.

Terminal window
cd mlx-finetune && python tests/test_model_tournament.py

Twelve tests for the system that decides which brain each agent gets. The stakes are high enough that the decision-maker itself gets tested, because the only thing worse than picking the wrong model is picking the wrong model confidently.