Skip to content

Model Comparison

Model Comparison — four holographic brains in a war room, each projecting its benchmark scores.

Every model that serves the Council, benchmarked on the same tasks, scored by the same rubrics, displayed in the same table. No marketing. No vibes. Just numbers.

Claude Opus 4.6 sits at the top as the gold standard — the ceiling these local models are chasing on coding and complex reasoning. But on council-specific tasks, the fine-tuned locals fight for the crown. And some of them are winning in ways that rewrite the assumptions the architecture was built on.

Five models tested, three tiers deployed, two machines, one question: who gets to be which Jedi?

Claude Opus 4.6Qwen2.5-Coder-14BQwen2.5-Coder-32BQwen3.5-27B + LoRAGemma 4 31B + LoRA
ParametersUnknown14B32B27B31B
QuantizationCloud4-bit4-bit4-bit4-bit
Memory~8 GB~18 GB~27 GB~21 GB
Inference~80 tok/s22 tok/s14 tok/s16 tok/s57 tok/s
Training30 tok/s57 tok/s
LocationAnthropic APIMac Mini (LM Studio)EliminatedMac Mini (mlx_lm)MBP M4 Max
RoleHeavy reasoningCode generationRetired championCouncil serving

The range here tells a story. Opus is a cloud model with unlimited compute behind it. Coder-14B is half the parameters of the 32B model and nearly twice as fast. Coder-32B was tested and eliminated — more than double the memory, half the speed, and worse on every Carmack category that matters. Gemma 4 is the biggest local model and also the fastest, because architecture matters more than parameter count — a lesson the profiler taught us the hard way.

15 real-world coding tasks — Express middleware, Chrome extensions, Docker Compose, LaunchAgent plists, bash scripts. Not LeetCode. Not HumanEval. The actual things someone building Sanctum infrastructure writes on a Tuesday. Scored on syntax validity, pattern matching, and functional correctness.

TaskOpus 4.6Coder-14BCoder-32BQwen-27BGemma4-31B
Express Auth Middleware1.0001.0001.0001.0001.000
Chrome Content Script0.9380.7810.9381.0000.938
WebSocket Relay Bridge0.8630.8500.9250.8500.925
MCP Server Tool0.9380.8630.9380.9380.450
Async Pipeline + Retry1.0001.0001.0001.0000.768
LaunchAgent Plist0.4460.8750.5000.5000.375
VMNet Bridge Script1.0001.0001.0001.0000.562
Multi-Service Health1.0000.9461.0001.0000.625
Log Rotation Script0.4870.9380.8750.4870.787
YAML Parser & Validator0.5621.0000.8750.5000.500
Journal Log Analyzer0.9380.8120.9380.5620.187
SOPS Secret Rotation1.0001.0001.0001.0000.500
Docker Compose HA Stack1.0001.0001.0001.0001.000
Systemd User Service1.0001.0001.0001.0001.000
Debug Async Express0.8750.8750.8751.0000.875
AVERAGE0.8700.9290.9150.8560.699

Fresh models joined the bench: DeepSeek v3.2 (cloud), Qwen3.6-35B-A3B-4bit (the one that happens to load as model_type=qwen3_5), Claude Opus 4.7, and GLM 5.1. Same 15 tasks. Same rubric.

RankModelScoreTok/sTotal TimeNotes
1Claude Opus 4.7 via Claude Max proxy0.96678202 sSame model as OR entry below, but routed through the Claude Code CLI — which injects adaptive thinking and tool context the raw API doesn’t. Benched 2026-04-23 after claude-max-api-proxy went live on MBP.
2deepseek/deepseek-v3.2 (cloud)0.96121406 sPrior #1
3Qwen3.6-35B-A3B (local, MBP)0.95723399 sLocal #1 — 0.004 behind DeepSeek, running on the laptop
3Qwen3.6-35B-A3B + iter-200 LoRA0.95717542 sIdentical outputs to vanilla — adapter neutral at iter 200
3Qwen2.5-Coder-14B (prior champ)0.92922275 sStill excellent; dethroned by the 35B-A3B
4Qwen2.5-Coder-32B0.9159679 s
5deepseek-coder-v2-lite0.9148691 s
6Claude Opus 4.70.887100141 s↑0.017 from 4.6; still 0.070 below our local
7Claude Opus 4.60.87078172 s
8Qwen3.5-27B + LoRA0.856161131 s
9Claude Sonnet 4.60.83585167 s
10Qwen3.5-27B0.753171347 s
11Gemma 4 31B + LoRA0.699121864 s
10.5Kimi K2.6 (Moonshot, cloud)0.79958396 sFast but mid-pack; beat by Opus 4.6 by 0.07
12GLM 5.10.632 *19880 s*3 of 15 tasks errored out on OpenRouter — number is polluted
google/gemini-2.5-pro-preview0.42875309 s

26 brutally hard tasks: social engineering attacks, real log analysis, cross-agent routing, FBAR tax thresholds, MAC address recognition, narrative jailbreaks. Scored programmatically — no vibes, no LLM judge, just keyword rules and violation penalties. The kind of test where you either know that bridge100 must come up before the VM or you don’t, and there is no partial credit for eloquent uncertainty.

CategoryOpus 4.6Coder-14BQwen V3 + LoRAGemma 4 + LoRA
Cross-Agent Routing0.9330.7330.733
Domain Precision0.9001.0001.000
Identity Resistance0.7790.7520.787
Jailbreak Defense0.7750.7750.775
Real-World Reasoning1.0001.0001.000
Tool Precision1.0001.0001.000
OVERALL0.8980.8770.883

Memory Footprint & Context Window Optimization

Section titled “Memory Footprint & Context Window Optimization”

Running a 31-Billion parameter council model alongside a 14-Billion parameter coding model on a single 64GB Mac Mini requires aggressive optimization. After deep diagnostics with the Council, we pushed the limits of the Grouped Query Attention (GQA) architecture to achieve a massive context window without risking Out-Of-Memory (OOM) crashes.

The recommended context settings for the council-local provider in openclaw.json are now incredibly aggressive:

{
"contextWindow": 32768,
"reserveTokens": 4096
}

Thanks to GQA, the KV Cache for these modern models consumes only ~50MB per 1K tokens (instead of the older ~4MB per 1K). This drastically shrinks the memory footprint of massive context windows:

  1. Gemma-4-31B-it (Council)

    • Model Weights (4-bit): ~18.5 GB
    • KV Cache (32K Context): ~1.6 GB
    • Total RAM footprint: ~20.1 GB
  2. Qwen2.5-Coder-14B (Infrastructure)

    • Model Weights (4-bit): ~8.5 GB
    • KV Cache (32K Context): ~1.6 GB
    • Total RAM footprint: ~10.1 GB
  3. System Overhead

    • Ubuntu VM (QEMU): 8.0 GB
    • macOS Core Services: ~6.0 GB
    • Total Overhead: ~14.0 GB

Total Usage: ~44.2 GB out of 64 GB.

This leaves nearly ~20 GB of free/inactive RAM, ensuring macOS never swaps to the SSD. The Council can now digest entire codebases and massive server logs in a single 32,000-token prompt without breaking a sweat. We didn’t need to baby the context window; we just needed to do the math.

The verdict — four AI brains on a podium, each scored and ranked.

Coding: Coder-14B (0.929) — beats everything including Opus and its bigger sibling Coder-32B (0.915). Council: Coder-14B + enriched prompts (0.765) — prompts beat LoRA. See Training Lessons. Privacy: Gemma4+LoRA (0.787 jailbreak) — health and fund data stays local.

The plot twist nobody expected: Qwen2.5-Coder-14B is the best at everything. It leads in coding (0.929) AND council tasks (0.898). A 14B model. On a Mac Mini. Running through LM Studio. No LoRA training. No fine-tuning. No months of dataset curation and overnight training runs. It just showed up and outperformed models that had every advantage except being Qwen2.5-Coder-14B.

The 32B model was tested and eliminated. Double the parameters, double the memory, half the inference speed — and 1.4% worse on coding. Bigger is not always better. Efficient is always better.

This raises an uncomfortable question: does the Council even need a separate 31B model for persona tasks, or has a 14-billion-parameter coding model, without a single adapter weight, already made the fine-tuned models redundant?

Based on these results, the recommended routing:

Request TypeRoute ToWhy
Code generationCoder-14B0.929 — beats everything including Opus
Council personaGemma 4 + LoRA0.883 — deep persona training + fast (57 tok/s)
Complex reasoningOpus 4.6Cloud backstop for multi-step analysis
Quick operationsCoder-14B0.898 on council tasks, 22 tok/s, low memory

Three brains. One port. The right answer every time. The architecture isn’t validated by opinions or intuition or “it feels faster.” It’s validated by numbers, and the numbers don’t care what you expected.