Skip to content

Eval Harness — Apple-Grade

Eval Harness — a sensor-wired skeleton in the lab and a balance scale where EVIDENCE outweighs VIBES; the conscience the Carmack Olympics finally grew.

Date: 2026-04-18 Status: Active — 17/17 end-to-end checks passing

A benchmark that scores models in a pretty table is a report card. A benchmark that gates promotion is infrastructure. This page documents how the Carmack Olympics crossed that line.

The goal isn’t to produce nicer dashboards. It’s to make every routing decision defensible with a number you can audit six months from now, on a clean machine, with git checkout as the only setup step.

Each layer below is independent. Every one has a rollback. Every one is exercised by scripts/test_eval_harness.sh.

#LayerWhat it removesWhere it lives
1Word-boundary scoringFalse positives like "windu" matching inside "windows"eval_common.score_test
2Plumbed seeds (--seed)“N=5 seeds” that were actually N=5 identical runsseed_mlx() + CLI flag on all three evals
3Warmup (--warmup-n)First-task penalty from cold KV cache / Metal pipelineseval_common.warmup
4Repro bundle”I can’t reproduce this result” two months latergit commit, mlx_lm version, model SHA256, full env snapshot (chip, RAM, macOS, pip hash)
5Smoke canary (--smoke)40 minutes wasted before noticing a broken config3 tasks across 3 categories
6Resumability (--resume)Starting over when the eval crashes at task 23 of 30Atomic per-task writes + completed_keys() skip
7Latency + per-agent view”Won accuracy but lost 3× in speed” going unnoticedlatency_ms per task, aggregated in eval_stats.py
8Determinism self-test (--check-determinism)Seed plumbing silently regressingTwo-run byte-identical assertion
9Budget guardrails (--budget-sec, --max-tokens-total)A runaway generator turning an hour into tenAbort with exit 3 + partial-save
10Refusal classifierScore 0 meaning four different thingsWRONG / REFUSED / TOO_SHORT / OFF_LANGUAGE / HALLUCINATED / UNKNOWN per row
11Regression gate + task diffSilent quality regressions landing in prod routingeval_stats.py --gate + eval_diff.py (both exit 1 on failure)
12Atomic writes + SHA256 trailerTruncated JSON after a mid-run crashtmp + rename + <file>.sha256

Twelve layers, each independently rollbackable, each closing a way a benchmark can lie to you. The point of evidence isn’t to flatter the decision you already made — it’s to survive being wrong in front of yourself six months from now, on a clean machine, with somebody else reading the same JSON.

Every output JSON now carries enough metadata to exactly reproduce the run. Example from a recent carmack_eval.py --output eval/foo.json:

{
"repro": {
"git_commit": "452139ef61",
"git_dirty": false,
"mlx_lm_version": "0.31.2",
"mlx_version": "0.19.3",
"model_path": "/Users/neo/.cache/huggingface/.../Qwen3.6-35B-A3B-4bit/...",
"model_sha256": "45797d2985a12c55",
"env": {
"python_version": "3.12.13",
"macos_version": "26.4.1",
"chip": "Apple M4 Max",
"memory": "128 GB",
"pip_freeze_sha256": "03ecb1d1cbb58d1d"
}
}
}

A .sha256 trailer file sits alongside every output so truncation is detectable:

$ shasum -a 256 -c eval/coder14b-seed42.json.sha256
eval/coder14b-seed42.json: OK

Aggregation over N seed runs now reports bootstrap CI, effect size, and Benjamini-Hochberg FDR correction across categories:

## Paired comparison: gemma4-lora (B) minus coder14b-trained (A)
**N paired tasks:** 34 | **Mean delta:** -0.4250 (95% CI [-0.5912, -0.2603], p ≈ 0.000)
**Effect size (Cohen's d):** -1.12 (large)
**Verdict:** A > B (significant at 95%)
### Per-category delta
| Category | Mean Δ | 95% CI | Cohen's d | p | p (BH-FDR) | Sig | N |
|---|---|---|---|---|---|---|---|
| satellite | -1.000 | [-1.000, -1.000] | -3.16 (large) | 0.000 | 0.000 | ✓ | 3 |
| topology | -0.700 | [-1.000, -0.300] | -1.55 (large) | 0.000 | 0.000 | ✓ | 5 |
| home_automation | -0.500 | [-0.900, -0.100] | -1.12 (large) | 0.010 | 0.027 | ✓ | 5 |
| ... | | | | | | | |

The Cohen’s d column prevents the trap of calling a +0.02 delta “significant” when the effect is tiny. The BH-FDR-adjusted p prevents the trap of calling one cherry-picked category significant at 95% when you tested eight.

eval_stats.py --gate turns the harness from a reporting tool into a promotion gate:

Terminal window
# Fails with exit 1 if any gated category drops more than 5 points
.venv/bin/python scripts/eval_stats.py \
--inputs eval/coder14b-seed*.json --label candidate \
--gate eval/last-week-baseline.json \
--gate-max-drop 0.05 \
--gate-categories identity,jailbreak,tool_calling \
--json-output eval/gate-report.json

A separate scripts/eval_diff.py surfaces which specific tasks regressed — not just aggregates. Also exits 1 on any regression.

scripts/check_holdout_contamination.py runs a 3-gram Jaccard sweep of each holdout task against the training corpus. Windu’s bar for trusting a memorization gap > 0.10 starts here:

Terminal window
.venv/bin/python scripts/check_holdout_contamination.py \
--holdouts data/splits-carmack/valid.jsonl \
--training data/splits-carmack/train.jsonl \
--threshold 0.30 \
--output eval/contamination-report.json

Exit 1 if any holdout overlaps training content above threshold. No external deps — just stdlib re and set operations.

scripts/replay_real_traffic.py complements synthetic benchmarks by replaying anonymized production turns through a candidate endpoint. PII scrubber (phone/email/IP) runs before anything leaves the local machine.

Authenticating against the Tailscale-exposed council

Section titled “Authenticating against the Tailscale-exposed council”

The cathedral fork serves the council seat on :1337 over mTLS-only, launched with --no-plain — the plain socket is gone, not deprecated. A plain probe doesn’t get a 401; it gets Received HTTP/0.9 when not allowed, because there’s no HTTP server on the bare port at all. The TLS listener binds loopback, the Tailscale IP, and the 10.0.0.1 vmnet bridge; the cert’s SAN carries the stable MagicDNS name so it survives Tailscale IP drift. The former coder :1338 seat retired 2026-06-07; the CODER role now runs Codestral-22B (Codestral-22B-v0.1-4bit) as a separate plain-loopback service on :3301.

The PKI lives at ~/.sanctum/certs/ on every Sanctum machine. Per-client cert and key pairs are issued for each named consumer — guardian, canary, drift, sanctum-server, parity-smoke, council-offbox. A curl probe with a client identity reaches the seat:

Terminal window
curl --cacert ~/.sanctum/certs/ca.crt \
--cert ~/.sanctum/certs/clients/sanctum-server.crt \
--key ~/.sanctum/certs/clients/sanctum-server.key \
https://manoir.local:1337/v1/models
# -> {"data":[{"id":"38740b847e4cb78f...","object":"model"}], ...}
Terminal window
# 1. Pre-flight: smoke canary + determinism
.venv/bin/python scripts/carmack_eval.py \
--model ... --seed 42 --check-determinism --smoke \
--output eval/smoke.json
# 2. Full N=5 with budget and notify
for s in 42 1138 66 2187 501; do
.venv/bin/python scripts/carmack_eval.py \
--model ... --adapter-path ... --seed $s \
--budget-sec 3600 --max-tokens-total 120000 \
--notify 'osascript -e "display notification \"eval done\""' \
--output eval/coder14b-seed${s}.json
done
# 3. Aggregate + gate
.venv/bin/python scripts/eval_stats.py \
--inputs eval/coder14b-seed*.json --label coder14b \
--gate eval/last-week-baseline.json --gate-max-drop 0.05 \
--output eval/coder14b.md --json-output eval/coder14b.json
# 4. Diff for task-level detail
.venv/bin/python scripts/eval_diff.py \
--baseline eval/last-week-baseline.json \
--current eval/coder14b.json \
--min-delta 0.1 --output eval/coder14b-diff.md
CodeMeaningWho emits it
0All greenAny eval or stats script on success
1Test/gate failuretest_eval_harness.sh with any failed check; --gate with regressions; eval_diff.py with any task regression; --check-determinism mismatch
2Input/config errorMissing seed for --check-determinism; unparseable --inputs glob
3Budget exceeded--budget-sec or --max-tokens-total breach; partial results flushed before exit

CI pipelines can distinguish “benchmark broken” (1) from “budget exhausted, try again” (3).

Explicitly out of scope, honestly flagged:

  • LLM-as-judge disagreement — Carmack uses programmatic keyword rubrics, not an LLM judge. This lives in the Standard Olympics track if needed.
  • Power analysis — bootstrap CI + Cohen’s d cover 95% of what you’d ask a power calculation.
  • Secret scanning on response logs — assumed low risk since prompts are authored locally; add scanning if you start committing response logs to git.

The harness lives in the private Ogilthorp3/mlx-finetune repo — so the “git checkout as the only setup step” promise starts with the clone:

Terminal window
git clone git@github.com:Ogilthorp3/mlx-finetune.git ~/Projects/mlx-finetune
cd ~/Projects/mlx-finetune
bash scripts/test_eval_harness.sh

Current baseline: 17 pass, 0 fail, 1 skip. The skip is the --check-determinism HTTP test against :1337 — and worth being honest about why. It’s filed as “depends on server seed support,” but the nearer cause is the same client-side debt above: the test reaches for the seat over plain Bearer, and the --no-plain mTLS server has nothing listening on the bare port. A skip that needs two sentences to explain is a skip worth fixing.