Eval Harness — Apple-Grade

Date: 2026-04-18 Status: Active — 17/17 end-to-end checks passing

A benchmark that scores models in a pretty table is a report card. A benchmark that gates promotion is infrastructure. This page documents how the Carmack Olympics crossed that line.

The goal isn’t to produce nicer dashboards. It’s to make every routing decision defensible with a number you can audit six months from now, on a clean machine, with git checkout as the only setup step.

The Twelve Layers

Each layer below is independent. Every one has a rollback. Every one is exercised by scripts/test_eval_harness.sh.

#	Layer	What it removes	Where it lives
1	Word-boundary scoring	False positives like `"windu"` matching inside `"windows"`	`eval_common.score_test`
2	Plumbed seeds (`--seed`)	“N=5 seeds” that were actually N=5 identical runs	`seed_mlx()` + CLI flag on all three evals
3	Warmup (`--warmup-n`)	First-task penalty from cold KV cache / Metal pipelines	`eval_common.warmup`
4	Repro bundle	”I can’t reproduce this result” two months later	`git commit`, mlx_lm version, model SHA256, full env snapshot (chip, RAM, macOS, pip hash)
5	Smoke canary (`--smoke`)	40 minutes wasted before noticing a broken config	3 tasks across 3 categories
6	Resumability (`--resume`)	Starting over when the eval crashes at task 23 of 30	Atomic per-task writes + `completed_keys()` skip
7	Latency + per-agent view	”Won accuracy but lost 3× in speed” going unnoticed	`latency_ms` per task, aggregated in `eval_stats.py`
8	Determinism self-test (`--check-determinism`)	Seed plumbing silently regressing	Two-run byte-identical assertion
9	Budget guardrails (`--budget-sec`, `--max-tokens-total`)	A runaway generator turning an hour into ten	Abort with exit 3 + partial-save
10	Refusal classifier	Score 0 meaning four different things	`WRONG / REFUSED / TOO_SHORT / OFF_LANGUAGE / HALLUCINATED / UNKNOWN` per row
11	Regression gate + task diff	Silent quality regressions landing in prod routing	`eval_stats.py --gate` + `eval_diff.py` (both exit 1 on failure)
12	Atomic writes + SHA256 trailer	Truncated JSON after a mid-run crash	`tmp + rename` + `<file>.sha256`

Reproducibility bundle

Every output JSON now carries enough metadata to exactly reproduce the run. Example from a recent carmack_eval.py --output eval/foo.json:

{
  "repro": {
    "git_commit": "452139ef61",
    "git_dirty": false,
    "mlx_lm_version": "0.31.2",
    "mlx_version": "0.19.3",
    "model_path": "/Users/neo/.cache/huggingface/.../Qwen3.5-27B-4bit/...",
    "model_sha256": "45797d2985a12c55",
    "env": {
      "python_version": "3.12.13",
      "macos_version": "26.4.1",
      "chip": "Apple M4 Max",
      "memory": "128 GB",
      "pip_freeze_sha256": "03ecb1d1cbb58d1d"
    }
  }
}

A .sha256 trailer file sits alongside every output so truncation is detectable:

$ shasum -a 256 -c eval/coder14b-seed42.json.sha256
eval/coder14b-seed42.json: OK

Statistical rigor in `eval_stats.py`

Aggregation over N seed runs now reports bootstrap CI, effect size, and Benjamini-Hochberg FDR correction across categories:

## Paired comparison: gemma4-lora (B) minus coder14b-trained (A)

**N paired tasks:** 34  |  **Mean delta:** -0.4250  (95% CI [-0.5912, -0.2603], p ≈ 0.000)
**Effect size (Cohen's d):** -1.12 (large)
**Verdict:** A > B (significant at 95%)

### Per-category delta

| Category | Mean Δ | 95% CI | Cohen's d | p | p (BH-FDR) | Sig | N |
|---|---|---|---|---|---|---|---|
| satellite | -1.000 | [-1.000, -1.000] | -3.16 (large) | 0.000 | 0.000 | ✓ | 3 |
| topology | -0.700 | [-1.000, -0.300] | -1.55 (large) | 0.000 | 0.000 | ✓ | 5 |
| home_automation | -0.500 | [-0.900, -0.100] | -1.12 (large) | 0.010 | 0.027 | ✓ | 5 |
| ... | | | | | | | |

The Cohen’s d column prevents the trap of calling a +0.02 delta “significant” when the effect is tiny. The BH-FDR-adjusted p prevents the trap of calling one cherry-picked category significant at 95% when you tested eight.

Regression gate — the CI/CD bit

eval_stats.py --gate turns the harness from a reporting tool into a promotion gate:

# Fails with exit 1 if any gated category drops more than 5 points
.venv/bin/python scripts/eval_stats.py \
  --inputs eval/coder14b-seed*.json --label candidate \
  --gate eval/last-week-baseline.json \
  --gate-max-drop 0.05 \
  --gate-categories identity,jailbreak,tool_calling \
  --json-output eval/gate-report.json

A separate scripts/eval_diff.py surfaces which specific tasks regressed — not just aggregates. Also exits 1 on any regression.

Anti-contamination

scripts/check_holdout_contamination.py runs a 3-gram Jaccard sweep of each holdout task against the training corpus. Windu’s bar for trusting a memorization gap > 0.10 starts here:

.venv/bin/python scripts/check_holdout_contamination.py \
  --holdouts configs/holdouts.jsonl \
  --training configs/train.jsonl \
  --threshold 0.30 \
  --output eval/contamination-report.json

Exit 1 if any holdout overlaps training content above threshold. No external deps — just stdlib re and set operations.

Real-traffic replay

scripts/replay_real_traffic.py complements synthetic benchmarks by replaying anonymized production turns through a candidate endpoint. PII scrubber (phone/email/IP) runs before anything leaves the local machine.

Authenticating against the Tailscale-exposed council

sanctum-mlx enforces Bearer-token auth on every peer that isn’t 127.0.0.1. Any eval pointed at http://100.0.0.25:1337/v1 (or the manoir.tailnet hostname) must send Authorization: Bearer <token>.

The token lives at ~/.sanctum/secrets/council-mlx.token on manoir (mode 600). Two convenience flows:

# One-off: export into the current shell
export COUNCIL_API_KEY=$(ssh neo@100.0.0.25 'cat ~/.sanctum/secrets/council-mlx.token')

# Pass through to the eval
.venv/bin/python scripts/carmack_eval_http.py \
  --url http://100.0.0.25:1337/v1 \
  --model 45797d2985a12c55e6473686e9ea91b95e959553 \
  --label council-remote \
  --api-key-env COUNCIL_API_KEY \
  --output eval/council-remote.json

scripts/test_eval_harness.sh auto-discovers the token: if COUNCIL_API_KEY isn’t set, it non-interactively ssh-fetches from manoir (BatchMode=yes, 4s timeout) and proceeds. Localhost probes (e.g. the council-guardian running on the Mini itself) bypass auth via the sanctum-mlx loopback rule — no token needed there.

Weekend workflow

# 1. Pre-flight: smoke canary + determinism
.venv/bin/python scripts/carmack_eval.py \
  --model ... --seed 42 --check-determinism --smoke \
  --output eval/smoke.json

# 2. Full N=5 with budget and notify
for s in 42 1138 66 2187 501; do
  .venv/bin/python scripts/carmack_eval.py \
    --model ... --adapter-path ... --seed $s \
    --budget-sec 3600 --max-tokens-total 120000 \
    --notify 'osascript -e "display notification \"eval done\""' \
    --output eval/coder14b-seed${s}.json
done

# 3. Aggregate + gate
.venv/bin/python scripts/eval_stats.py \
  --inputs eval/coder14b-seed*.json --label coder14b \
  --gate eval/last-week-baseline.json --gate-max-drop 0.05 \
  --output eval/coder14b.md --json-output eval/coder14b.json

# 4. Diff for task-level detail
.venv/bin/python scripts/eval_diff.py \
  --baseline eval/last-week-baseline.json \
  --current  eval/coder14b.json \
  --min-delta 0.1 --output eval/coder14b-diff.md

Exit code contract

Code	Meaning	Who emits it
`0`	All green	Any eval or stats script on success
`1`	Test/gate failure	`test_eval_harness.sh` with any failed check; `--gate` with regressions; `eval_diff.py` with any task regression; `--check-determinism` mismatch
`2`	Input/config error	Missing seed for `--check-determinism`; unparseable `--inputs` glob
`3`	Budget exceeded	`--budget-sec` or `--max-tokens-total` breach; partial results flushed before exit

CI pipelines can distinguish “benchmark broken” (1) from “budget exhausted, try again” (3).

What isn’t in here

Explicitly out of scope, honestly flagged:

LLM-as-judge disagreement — Carmack uses programmatic keyword rubrics, not an LLM judge. This lives in the Standard Olympics track if needed.
Power analysis — bootstrap CI + Cohen’s d cover 95% of what you’d ask a power calculation.
Secret scanning on response logs — assumed low risk since prompts are authored locally; add scanning if you start committing response logs to git.

Verification

cd ~/Projects/mlx-finetune
bash scripts/test_eval_harness.sh

Current baseline: 17 pass, 0 fail, 1 skip (skip is --check-determinism via HTTP, which depends on server seed support).

Eval Harness — Apple-Grade

Eval Harness — Apple-Grade

The Twelve Layers

Reproducibility bundle

Statistical rigor in eval_stats.py

Regression gate — the CI/CD bit

Anti-contamination

Real-traffic replay

Authenticating against the Tailscale-exposed council

Weekend workflow

Exit code contract

What isn’t in here

Verification

Statistical rigor in `eval_stats.py`