Skip to content

2026-06-08: The Champion Gate Stack — Pipeline Architecture

Pencil sketch on a dark charcoal background of an ornate brass mechanical filter machine with four progressively-finer-meshed funnels stacked vertically; beneath them on a polished wood workbench a single small brass coin glows softly in teal.

The autoresearch nightly trains LoRA adapters on top of Qwen3.6-35B-A3B-4bit and gates each candidate against a stack of four calibrated thresholds. A new champion only ships when an adapter survives every gate. The thresholds, the order, and the operators that maintain them — that’s the architecture this chapter captures, post the 2026-06-04 to 2026-06-08 calibration arc.

The pipeline is one daemon-supervised pass per night, gated on a LaunchAgent at 01:00 EDT. The picker walks a 45-rung ladder; each rung is a Cfg(...) of hyperparameters + a recipe name. Each picked rung becomes one experiment.

Nightly @ 01:00 EDT — run_overnight.shsearch_space.pyLADDER (45 rungs)picker.py + controller.pyPhase A: UCB1 / ε-greedytrain.pyPhase 1: 300 iters, ~75-90 mincarmack_eval.pyPhase 2: 26 + 109 casesThe four-gate stack (run per candidate)Gate 1: screen overallcandidate > 0.6913(calibrated_screen_floor + margin)Gate 2: screen regressionidentity ≥ 0.7163AND jailbreak ≥ 0.5000Gate 3: full-tier overallcandidate > 0.7000(calibrated_gate)Gate 4: per-master flooreach cell > max(0.50,base_master − 0.10)screen_failedscreen_floor_vetogate_failedper_master_floor_vetopromote_champion.shPATCH → PROBE → FAILSAFE → RUNBOOK(+ auto-refresh per-master + recalibrate gates)Phase 4.7 propose.pyread-only proposal enginePhase 4.75 apply_proposals.pysentinel-gated read-writedisk-pressure-monitorLaunchAgent, <40 GB triggercalibration-retry-watcherLaunchAgent, GPU-clear pollContinuous loopsCalibration backbone (state of truth)~/.sanctum/state/production-champion.json — calibrated_gate, calibrated_screen_floor, calibrated_screen_baseline_path, per-master sidecar pointer.Every gate threshold derives from a real eval of the current champion. No magic numbers; new champions automatically rebase their gates on promote.

A candidate that survives the four gates promotes. The gates are ordered: gate 1 saves GPU by deferring full-tier eval until screen tier has been earned; gate 2 catches identity/jailbreak collapse before scoring; gate 3 is the absolute aggregate floor; gate 4 catches the master cells the aggregate could hide. Every refused candidate’s breadcrumb names the cell that killed it.

GateThreshold (today)OriginCost if it fires falsely
1 — screen overall> 0.6913calibrate_screen_floor.py against May-12 championrejects a candidate that would have promoted (high)
2 — screen regressionidentity ≥ 0.7163, jailbreak ≥ 0.5000same calibration runrefuses candidates that trade preservation for aggregate gains (low — that trade is the bug)
3 — full-tier overall> 0.7000calibrate_gate.py against May-12 championrejects a near-miss (high)
4 — per-master floorper-cell > max(0.50, base − 0.10)aggregate_per_master.py Bayesian shrinkage k=5refuses a candidate that lifts the aggregate by collapsing a cell (low — that trade is the bug)

The doctrine: gate thresholds are derived, never set. The champion’s own measured behavior on the current eval is what becomes the next candidate’s floor.

Six weak points were on the CTO briefing’s R&D queue last night. Four shipped today; two remain operator-gated. The pipeline is now load-bearing instead of advisory.

#WasNow
1propose.py output sat as a markdown artifact; rungs only landed when the operator manually edited search_space.pytools/proposals/apply_proposals.py reads the latest proposals.json, prepends NewRungLever Cfg entries to LADDER, commits + signs. Phase 4.75 in run_overnight.sh invokes it. Sentinel-gated by ~/.sanctum/autoresearch-auto-apply-proposals-active — first run is shadow / dry-run.
2per-master baseline never refreshed when champion changed — first new champion would have shipped against a stale floorpromote_champion.sh now invokes aggregate_per_master.py + queues a calibration watcher run after the four-layer Critical Service Pattern completes successfully. Failsafe path stays off the recalibration branch.
3/tmp/screen-floor-retry-watcher.sh + disk-pressure-monitor.sh were unsupervised bash loops that died on rebootProductized: com.sanctum.disk-pressure-monitor.plist (RunAtLoad + KeepAlive) and com.sanctum.calibration-retry-watcher.plist (operator-triggered). Both in ~/Library/LaunchAgents/.
4MAX_EXPERIMENTS=4 + DEADLINE_HOUR=9 capped the picker’s reach to 4 rungs per night before the deadline killed the runMAX_EXPERIMENTS=8 + DEADLINE_HOUR=14 in both run_overnight.sh defaults and the LaunchAgent plist. More slots, longer window, same per-slot 2.5h budget.

Open R&D queue: the closed-loop controller is still in shadow mode pending operator activation (~/.sanctum/autoresearch-controller-active); RecipeWeightLever and LadderReorderLever proposals still surface as markdown for human review; the council endpoint on the MBP-side propose.py invocation is intentionally unreachable so the engine falls back to deterministic-only proposals.