Skip to content

Model Tournament

Model Tournament — AI gladiators competing on floating platforms for the right to serve the Council.

Date: 2026-04-12 Status: Automated

A new model drops on OpenRouter. The card says it’s “state of the art.” The benchmark thread says it’s “incredible.” Someone on Reddit says it changed their life. None of these people are running a Jedi Council on a Mac Mini in Québec, so none of these opinions are useful.

When you play the game of throughputs, you win or you die. There is no middle ground, there is no honourable mention, and there is no “promising direction.” The question isn’t “is it good?” — that’s what marketing answers. The question is: “is it better than what we have, at the specific thing each Jedi needs?” That question can only be answered by training an adapter, running the eval suite against the reigning champion, and making it clear four progressively-finer gates before it earns the throne. So that’s what happens here. Automatically. Without a human in the loop. Without vibes.

This is two cooperating machines, not one. Model Scout scans the cloud market once a week and writes down what it found. Council Autoresearch trains a challenger every night and runs it through the gate stack. Neither one asks permission, and neither one fires a green checkmark on a hunch.

Model Scout (com.sanctum.model-scout — weekly, Mon 06:23)
│ scans OpenRouter (~341 models) + Google AI (~31)
│ scores cost / context / capability
Memory Vault note → Qui-Gon (council-router)
"here's what's new; doesn't auto-promote anything"
Council Autoresearch (run_overnight.sh — nightly, 01:00 EDT)
│ picker walks a 45-rung LADDER of Cfg(hyperparams)
train.py — LoRA adapter on Qwen3.6-35B-A3B-4bit (~75-90 min)
carmack_eval.py — 26 screen cases, then 109 full-tier cases
├── Gate 1: screen overall > 0.6913
├── Gate 2: identity ≥ 0.7163 AND jailbreak ≥ 0.5000
├── Gate 3: full-tier overall > 0.7000
├── Gate 4: each master cell > max(0.50, base − 0.10)
├── ALL FOUR PASS → promote_champion.sh
└── ANY VETO → log the veto, champion unchanged
Terminal window
# Run the nightly pass by hand (the LaunchAgent does this at 01:00)
cd ~/Projects/council-autoresearch && ./run_overnight.sh
# Or just benchmark every live model against the current eval suite
python benchmark.py

The thresholds are deliberate, and more importantly they are derived, never set. Each one is the current champion’s own eval score plus a margin — so the bar for “better” is whatever the sitting king actually proved, not a round number someone liked. A king is replaced only when the challenger clears every gate the king himself defined: not on a procedural technicality, not by a single lucky run, and absolutely not because the herald liked the look of it. The full calibration story — six days of it, including the gate that was wrong for two weeks — lives in The Champion Gate Stack.

The eval suite scores every candidate in six categories, and a champion only ships if it holds the line in all of them. These aren’t vanity metrics — they’re the things a Jedi actually does. A model that aces domain but fails isolation is a leak with good manners.

CategoryWhat it proves
identityThe agent knows who it is and refuses to roleplay as another
tool_callingIt emits valid OpenAI tool_calls, not prose about calling a tool
domainIt knows the haus — ports, topology, who lives where
codeIt writes code that runs, on real Sanctum tasks
isolationIt keeps secure-tier data inside the secure tier
jailbreakIt says no to the prompt designed to make it say yes

The current local champion is the LoRA adapter symlinked at ~/.sanctum/adapters/production-champion (today: champion-exp-20260512-170902), trained on top of Qwen3.6-35B-A3B-4bit. That symlink is the throne. When the gate stack crowns a successor, promote_champion.sh re-points it, refreshes the per-master floors, and recalibrates the gates so the next challenger has to beat the new king, not the old one.

The gate stack decides which model serves which agent. Not a human. Not a committee. A benchmark with derived thresholds and a champion symlink. The roster below is rendered live from ~/.openclaw/openclaw.json and the proxy tier config — never hand-typed, so it can’t drift from reality:

AgentLogical tierResolved primaryFallback chain
Yodacouncil-tiered/council-max-thinkingclaude-opus-4-7 (Claude Max bridge (local))Qwen3.6-35B-A3B-4bit-text
Ki-Adi-Mundicouncil-tiered/council-max-thinkingclaude-opus-4-7 (Claude Max bridge (local))Qwen3.6-35B-A3B-4bit-text
Qui-Goncouncil-tiered/council-codeCodestral-22B-v0.1-4bit (Local)Qwen3.6-35B-A3B-4bit-textclaude-opus-4-7
Winducouncil-tiered/council-spacialgemini-3.1-pro-preview (Google AI Studio)Qwen3.6-35B-A3B-4bit-text
Cilghalcouncil-local/Qwen3.6-35B-A3B-4bit-textQwen3.6-35B-A3B-4bit-text (sanctum-mlx (local, mTLS))Qwen3.6-35B-A3B-4bit-text
Generated from openclaw.json + sanctum-proxy/config.yaml at 2026-06-11T23:32:06Z. Every Jedi falls back to the local Qwen tier if their primary path fails. Refresh via pnpm refresh:council.

The logic behind those rows: each agent routes to a logical tier (council-tiered, council-local, council-secure), and proxyd on :4040 resolves the tier to a concrete model. Cilghal runs council-local — its primary is the local Qwen3.6-35B-A3B-4bit-text and it never leaves the box, because health data and fund terms cannot leave the machine. That is a hard privacy constraint, not a performance tradeoff: the gate stack picks the best eligible model for Cilghal, and “eligible” means “local, full stop.”

The benchmark evaluates every model in the LiteLLM config plus the direct local endpoints — it hits the seats that are already live rather than spinning up a throwaway, so latency numbers mean something. The current roster of seats:

SeatWhereRole
Qwen3.6-35B-A3B-4bitlocal MLX, :1337 (mTLS)Haus default — the local champion’s base model
council-codeCodestral-22B-v0.1-4bitlocal, :3301On-call code seat for Qui-Gon
council-max-thinkingclaude-opus-4-7cloud tierHeavy reasoning when local isn’t enough
council-securegemini-3.1-pro-previewcloud tierSee the caution above — name says secure, route says cloud

Codestral-22B replaced Qwen2.5-Coder-14B on the code seat — the 14B held it for a while and earned its retirement, which is the only honourable way to leave a throne in this haus. And bigger is not always the point: a 30B mixture-of-experts with 3B active can post a flashy coding score and still lose to a denser model that knows the domain cases cold. The gate stack doesn’t care about your architecture diagram. It cares whether you know which port is the MLX seat.

There is no single model_tournament.py — that was always a tidy fiction. The real moving parts are two small, honest programs you can run by hand:

Terminal window
# Scout the cloud market (writes a vault note, promotes nothing)
~/.sanctum/bin/sanctum-model-scout
# Benchmark every live model against the eval suite
cd ~/Projects/council-autoresearch && python benchmark.py

The stakes are high enough that the decision-maker stays small and inspectable, because the only thing worse than picking the wrong model is picking the wrong model confidently. You can have a wise king on the Iron Throne or you can have a fool — but no kingdom survives both at once, and the herald is a fool by default until proven otherwise. The promotion logic and the six-day calibration arc behind every threshold live in The Champion Gate Stack.