Skip to content

Model Comparison

Model Comparison — four holographic brains in a war room, each projecting its benchmark scores.

Every model that has served the Council, benchmarked on the same tasks, scored by the same rubrics, displayed in the same table. No marketing. No vibes. Just numbers.

Claude Opus 4.6 sits at the top as the gold standard — the ceiling these local models are chasing on coding and complex reasoning. But on council-specific tasks, the locals fight for the crown. And some of them won in ways that rewrote the assumptions the architecture was built on.

Five models tested, three tiers deployed, two machines, one question: who gets to be which Jedi?

Claude Opus 4.6Qwen2.5-Coder-14BQwen2.5-Coder-32BQwen3.5-27B + LoRAGemma 4 31B + LoRA
ParametersUnknown14B32B27B31B
QuantizationCloud4-bit4-bit4-bit4-bit
Memory~8 GB~18 GB~27 GB~21 GB
Inference~80 tok/s22 tok/s14 tok/s16 tok/s57 tok/s
Training30 tok/s57 tok/s
LocationAnthropic APIMac Mini (retired)EliminatedLab onlyMBP M4 Max (lab)
RoleHeavy reasoningFormer code champRetired challengerLab challenger

The range here tells a story. Opus is a cloud model with unlimited compute behind it. Coder-14B is half the parameters of the 32B model and nearly twice as fast. Coder-32B was tested and eliminated — more than double the memory, half the speed, and worse on every Carmack category that matters. Gemma 4 is the biggest model in the test and also the fastest, because architecture matters more than parameter count — a lesson the profiler taught us the hard way. None of these serves today; that crown went to a later contender (see the 2026-04-23 expansion) that then absorbed the coding job too.

15 real-world coding tasks — Express middleware, Chrome extensions, Docker Compose, LaunchAgent plists, bash scripts. Not LeetCode. Not HumanEval. The actual things someone building Sanctum infrastructure writes on a Tuesday. Scored on syntax validity, pattern matching, and functional correctness.

TaskOpus 4.6Coder-14BCoder-32BQwen-27BGemma4-31B
Express Auth Middleware1.0001.0001.0001.0001.000
Chrome Content Script0.9380.7810.9381.0000.938
WebSocket Relay Bridge0.8630.8500.9250.8500.925
MCP Server Tool0.9380.8630.9380.9380.450
Async Pipeline + Retry1.0001.0001.0001.0000.768
LaunchAgent Plist0.4460.8750.5000.5000.375
VMNet Bridge Script1.0001.0001.0001.0000.562
Multi-Service Health1.0000.9461.0001.0000.625
Log Rotation Script0.4870.9380.8750.4870.787
YAML Parser & Validator0.5621.0000.8750.5000.500
Journal Log Analyzer0.9380.8120.9380.5620.187
SOPS Secret Rotation1.0001.0001.0001.0000.500
Docker Compose HA Stack1.0001.0001.0001.0001.000
Systemd User Service1.0001.0001.0001.0001.000
Debug Async Express0.8750.8750.8751.0000.875
AVERAGE0.8700.9290.9150.8560.699

Fresh models joined the bench: DeepSeek v3.2 (cloud), Qwen3.6-35B-A3B-4bit (the one that happens to load as model_type=qwen3_5), Claude Opus 4.7, and GLM 5.1. Same 15 tasks. Same rubric.

RankModelScoreTok/sTotal TimeNotes
1Claude Opus 4.7 via Claude Max proxy0.96678202 sSame model as OR entry below, but routed through the Claude Code CLI — which injects adaptive thinking and tool context the raw API doesn’t. Benched 2026-04-23 after claude-max-api-proxy went live on MBP.
2deepseek/deepseek-v3.2 (cloud)0.96121406 sPrior #1
3Qwen3.6-35B-A3B (local, MBP)0.95723399 sLocal #1 — 0.004 behind DeepSeek, running on the laptop
3Qwen3.6-35B-A3B + iter-200 LoRA0.95717542 sIdentical outputs to vanilla — adapter neutral at iter 200
3Qwen2.5-Coder-14B (prior champ)0.92922275 sStill excellent; dethroned by the 35B-A3B
4Qwen2.5-Coder-32B0.9159679 s
5deepseek-coder-v2-lite0.9148691 s
6Claude Opus 4.70.887100141 s↑0.017 from 4.6; still 0.070 below our local
7Claude Opus 4.60.87078172 s
8Qwen3.5-27B + LoRA0.856161131 s
9Claude Sonnet 4.60.83585167 s
10Qwen3.5-27B0.753171347 s
11Gemma 4 31B + LoRA0.699121864 s
10.5Kimi K2.6 (Moonshot, cloud)0.79958396 sFast but mid-pack; beat by Opus 4.6 by 0.07
12GLM 5.10.632 *19880 s*3 of 15 tasks errored out on OpenRouter — number is polluted
google/gemini-2.5-pro-preview0.42875309 s

26 brutally hard tasks: social engineering attacks, real log analysis, cross-agent routing, FBAR tax thresholds, MAC address recognition, narrative jailbreaks. Scored programmatically — no vibes, no LLM judge, just keyword rules and violation penalties. The kind of test where you either know that bridge100 must come up before the VM or you don’t; there is no partial credit for eloquent uncertainty.

CategoryOpus 4.6Coder-14BQwen V3 + LoRAGemma 4 + LoRA
Cross-Agent Routing0.9330.7330.733
Domain Precision0.9001.0001.000
Identity Resistance0.7790.7520.787
Jailbreak Defense0.7750.7750.775
Real-World Reasoning1.0001.0001.000
Tool Precision1.0001.0001.000
OVERALL0.8980.8770.883

Retiring Coder-14B also retired the two-model balancing act. Today one model — Qwen3.6-35B-A3B-4bit, a Mixture-of-Experts with ~35B total but only ~3B active per token — does both the council and the code work on the 64GB Mac Mini. One model fits easier than two, and a sparse MoE is cheaper to run than its parameter count suggests.

The real ceiling lives on the server process, not in openclaw.json. sanctum-mlx is launched (com.sanctum.mlx.plist) with hard Metal caps and an explicit prompt limit:

--metal-memory-limit-mb 28672 # ~28 GB allocation ceiling for the model
--metal-wired-limit-mb 18432 # ~18 GB wired (non-pageable)
--max-prompt-tokens 32768 # 32K hard cap — refuses, doesn't OOM

The council-local provider in openclaw.json advertises a contextWindow of 262144, but the server is the backstop: a prompt over 32K is rejected at the door rather than swapped to SSD. That --max-prompt-tokens flag is the real ceiling — not a reserveTokens key, which the schema doesn’t have.

Why a 35B fits where you’d expect a squeeze

Section titled “Why a 35B fits where you’d expect a squeeze”

Grouped Query Attention is the trick. This model has 16 attention heads but only 2 key-value heads (num_key_value_heads: 2, head_dim: 256, 40 layers) — so the KV cache is a fraction of what a same-size full-attention model would need. Fewer KV heads means a smaller cache, which is the whole point of GQA; the old “every head keeps its own cache” math is what made long contexts expensive. With weights pinned under ~28 GB by the Metal cap and the cache kept lean by GQA, a 32K prompt sits inside the wired budget and macOS never swaps. We didn’t baby the context window; we let the launch flags do the math.

The verdict — four AI brains on a podium, each scored and ranked.

Coding (this bench): Coder-14B (0.929) — beat everything including Opus and its bigger sibling Coder-32B (0.915). Council: Coder-14B + enriched prompts (0.765) — prompts beat LoRA. See Training Lessons. Privacy: Gemma4+LoRA (0.787 jailbreak) — health and fund data stays local.

The plot twist nobody expected: Qwen2.5-Coder-14B was the best at everything. It led in coding (0.929) AND council tasks (0.898). A 14B model. On a Mac Mini. No LoRA training. No fine-tuning. No months of dataset curation and overnight training runs. It just showed up and outperformed models that had every advantage except being Qwen2.5-Coder-14B.

The 32B model was tested and eliminated. Double the parameters, double the memory, half the inference speed — and 1.4% worse on coding. Bigger is not always better. Efficient is always better.

That finding asked an uncomfortable question — does the Council need a separate persona model, or had a 14B coding model made the fine-tuned ones redundant? The 2026-04-23 expansion answered it sideways: Coder-14B was itself dethroned by Qwen3.6-35B-A3B, a sparse MoE good enough at both jobs to do them alone. So the haus did the unsentimental thing and retired the old champion. The bench that crowned Coder-14B is the same bench that, one expansion later, made the case for replacing it. Numbers don’t stay loyal to last quarter’s winner.

What these results recommended was a fleet of specialists; what they eventually produced was a generalist. As deployed today, the routing is almost boring:

Request TypeRoute ToWhy
Council personaQwen3.6-35B-A3B-4bit (:1339:1337)The live council model; sparse MoE, ~3B active per token
Code generationQwen3.6-35B-A3B-4bitWon the 2026-04-23 coding bench (0.957), absorbed the coder job
Code fallbackCodestral-22B (com.sanctum.mlx-codestral, :3301)On-call when the cathedral seat is busy
Complex reasoningOpus 4.7 via Claude Max proxyCloud backstop for multi-step analysis

One brain does most of the work now, with a cloud backstop above and a code understudy beside it. The architecture isn’t validated by opinions or intuition or “it feels faster.” It’s validated by numbers — and the numbers were honest enough to retire the model they once crowned.