2026-06-08: The Champion Gate Stack — Pipeline Architecture

The autoresearch nightly trains LoRA adapters on top of Qwen3.6-35B-A3B-4bit and gates each candidate against a stack of four calibrated thresholds. A new champion only ships when an adapter survives every gate. The thresholds, the order, and the operators that maintain them — that’s the architecture this chapter captures, post the 2026-06-04 to 2026-06-08 calibration arc.
The state machine
Section titled “The state machine”The pipeline is one daemon-supervised pass per night, gated on a LaunchAgent at 01:00 EDT. The picker walks a 45-rung ladder; each rung is a Cfg(...) of hyperparameters + a recipe name. Each picked rung becomes one experiment.
The gate stack
Section titled “The gate stack”A candidate that survives the four gates promotes. The gates are ordered: gate 1 saves GPU by deferring full-tier eval until screen tier has been earned; gate 2 catches identity/jailbreak collapse before scoring; gate 3 is the absolute aggregate floor; gate 4 catches the master cells the aggregate could hide. Every refused candidate’s breadcrumb names the cell that killed it.
| Gate | Threshold (today) | Origin | Cost if it fires falsely |
|---|---|---|---|
| 1 — screen overall | > 0.6913 | calibrate_screen_floor.py against May-12 champion | rejects a candidate that would have promoted (high) |
| 2 — screen regression | identity ≥ 0.7163, jailbreak ≥ 0.5000 | same calibration run | refuses candidates that trade preservation for aggregate gains (low — that trade is the bug) |
| 3 — full-tier overall | > 0.7000 | calibrate_gate.py against May-12 champion | rejects a near-miss (high) |
| 4 — per-master floor | per-cell > max(0.50, base − 0.10) | aggregate_per_master.py Bayesian shrinkage k=5 | refuses a candidate that lifts the aggregate by collapsing a cell (low — that trade is the bug) |
The doctrine: gate thresholds are derived, never set. The champion’s own measured behavior on the current eval is what becomes the next candidate’s floor.
The 2026-06-08 hardening
Section titled “The 2026-06-08 hardening”Six weak points were on the CTO briefing’s R&D queue last night. Four shipped today; two remain operator-gated. The pipeline is now load-bearing instead of advisory.
| # | Was | Now |
|---|---|---|
| 1 | propose.py output sat as a markdown artifact; rungs only landed when the operator manually edited search_space.py | tools/proposals/apply_proposals.py reads the latest proposals.json, prepends NewRungLever Cfg entries to LADDER, commits + signs. Phase 4.75 in run_overnight.sh invokes it. Sentinel-gated by ~/.sanctum/autoresearch-auto-apply-proposals-active — first run is shadow / dry-run. |
| 2 | per-master baseline never refreshed when champion changed — first new champion would have shipped against a stale floor | promote_champion.sh now invokes aggregate_per_master.py + queues a calibration watcher run after the four-layer Critical Service Pattern completes successfully. Failsafe path stays off the recalibration branch. |
| 3 | /tmp/screen-floor-retry-watcher.sh + disk-pressure-monitor.sh were unsupervised bash loops that died on reboot | Productized: com.sanctum.disk-pressure-monitor.plist (RunAtLoad + KeepAlive) and com.sanctum.calibration-retry-watcher.plist (operator-triggered). Both in ~/Library/LaunchAgents/. |
| 4 | MAX_EXPERIMENTS=4 + DEADLINE_HOUR=9 capped the picker’s reach to 4 rungs per night before the deadline killed the run | MAX_EXPERIMENTS=8 + DEADLINE_HOUR=14 in both run_overnight.sh defaults and the LaunchAgent plist. More slots, longer window, same per-slot 2.5h budget. |
Open R&D queue: the closed-loop controller is still in shadow mode pending operator activation (~/.sanctum/autoresearch-controller-active); RecipeWeightLever and LadderReorderLever proposals still surface as markdown for human review; the council endpoint on the MBP-side propose.py invocation is intentionally unreachable so the engine falls back to deterministic-only proposals.