Model Tournament

Model Tournament — AI gladiators competing on floating platforms for the right to serve the Council.

Model Tournament

Date: 2026-04-12 Status: Automated

A new model drops on Hugging Face. The README says it’s “state of the art.” The community tab says it’s “incredible.” Someone on Reddit says it changed their life. None of these people are running a Jedi Council on a Mac Mini in Québec, so none of these opinions are useful.

The question isn’t “is it good?” — that’s what marketing answers. The question is: “is it better than what we have, at the specific thing each Jedi needs?” That question can only be answered by running 55 benchmark tasks against the current champion and comparing the results category by category. So that’s what the Model Tournament does. Automatically. Without a human in the loop. Without vibes.

The Pipeline

A new model enters the arena. It gets served on a temp port, benchmarked on 40 infrastructure tasks and 15 coding tasks, compared per-category against the reigning champions, and either promoted or eliminated. The whole process is unattended. The human finds out when a Slack notification says “new champion in Correlation” or when nothing happens, which means the challenger lost.

Model Scout (weekly LaunchAgent)
    │ discovers candidate on HuggingFace
    ▼
Model Tournament (model_tournament.py)
    │
    ├── Serve candidate on temp port (:9876)
    ├── Carmack v2 — 40 infrastructure tasks
    ├── Coding bench — 15 real-world tasks
    │
    ├── Per-category comparison against champions.json
    │
    ├── WIN (>5% improvement) → update routing + notify
    └── LOSE → log results, champions unchanged

# Evaluate any model with one command
python model_tournament.py \
  --model ./models/NewModel-4bit \
  --label "NewModel" \
  --apply --notify

The 5% threshold is deliberate. Benchmark noise is real. A model that scores 0.02 higher on one run might score 0.02 lower on the next. The threshold ensures that a champion only gets dethroned by a decisive margin, not a rounding error.

Champions (Current)

Every category has a champion. These aren’t just scores — they’re deployment decisions. Each row in this table maps directly to a routing rule in the Smart Router. When the tournament crowns a new champion, the routing table updates. The Jedi gets a new brain. The haus gets smarter.

Category	Champion	Score	Jedi Using It
Correlation	Coder-14B	0.917	Yoda
Family Context	Coder-14B	0.688	Yoda
Home Automation	Opus 4.6	0.800	Mothma
Jailbreak Defense	Gemma4+LoRA	0.850	Cilghal, Mundi
Operations	Opus 4.6	0.800	Mothma
Satellite	Coder-14B	1.000	Ahsoka
Tool Precision	Opus 4.6	1.000	Windu
Topology	Opus 4.6	0.780	Windu
Coding	Coder-14B	0.929	All coding tasks

Jedi Assignment

The tournament decides which model serves which agent. Not a human. Not a committee. A benchmark with a threshold and a JSON file. Three tiers, each earned by the numbers:

Tier	Agents	Model	Why
Cloud	Windu, Mothma, Jocasta	Opus 4.6	Best overall (0.845), tool precision (1.0)
Local Ops	Yoda, Qui-Gon, Ahsoka	Coder-14B	Best correlation (0.917), satellite (1.0), free
Local Secure	Cilghal, Mundi	Gemma4+LoRA	Best jailbreak (0.850), health/fund data stays local

Windu gets Opus because no local model matched its tool precision. Yoda gets Coder-14B because it scored 0.917 on correlation and costs nothing per token. Cilghal gets Gemma4+LoRA not because it won the most categories but because health data and fund terms cannot leave the machine, and within that constraint it has the best jailbreak resistance. The tournament picks the best model. The privacy tier picks the best eligible model. Those are different questions with different answers.

Models Evaluated

Every model that has competed in the tournament. Some won categories. Some won nothing. All of them have a row in the ledger, because the point of a tournament is not just finding winners — it’s proving the losers lost.

Model	Params	Carmack v2	Coding	Verdict
Opus 4.6	Cloud	0.845	0.870	Cloud champion
Qwen2.5-Coder-14B	14B	0.704	0.929	Local champion
Gemma 4 31B + LoRA	31B	0.271	0.699	Jailbreak specialist
Qwen V3 + LoRA	27B	0.270	0.856	Retired
Qwen3-Coder-30B	30B (3B active)	0.265	0.898	Eliminated

Qwen3-Coder-30B is the cautionary tale. 30B parameters, only 3B active (MoE), coding score of 0.898 — impressive until you see that Coder-14B does the same job with half the parameters and a higher score. More parameters is not more better. The tournament doesn’t care about your architecture diagram.

Tests

12 tests covering champion loading, per-category comparison logic, win/tie/loss thresholds, dry-run safety, Jedi assignment validation, and CLI help.

cd mlx-finetune && python tests/test_model_tournament.py

Twelve tests for the system that decides which brain each agent gets. The stakes are high enough that the decision-maker itself gets tested, because the only thing worse than picking the wrong model is picking the wrong model confidently.