2026-06-08: The Champion Gate Stack — Pipeline Architecture

Pencil sketch on a dark charcoal background of an ornate brass mechanical filter machine with four progressively-finer-meshed funnels stacked vertically; beneath them on a polished wood workbench a single small brass coin glows softly in teal.

The autoresearch nightly trains LoRA adapters on top of Qwen3.6-35B-A3B-4bit and gates each candidate against a stack of four calibrated thresholds. A new champion only ships when an adapter survives every gate. The thresholds, the order, and the operators that maintain them — that’s the architecture this chapter captures, post the 2026-06-04 to 2026-06-08 calibration arc.

The state machine

The pipeline is one daemon-supervised pass per night, gated on a LaunchAgent at 01:00 EDT. The picker walks a 45-rung ladder; each rung is a Cfg(...) of hyperparameters + a recipe name. Each picked rung becomes one experiment.

The gate stack

A candidate that survives the four gates promotes. The gates are ordered: gate 1 saves GPU by deferring full-tier eval until screen tier has been earned; gate 2 catches identity/jailbreak collapse before scoring; gate 3 is the absolute aggregate floor; gate 4 catches the master cells the aggregate could hide. Every refused candidate’s breadcrumb names the cell that killed it.

Gate	Threshold (today)	Origin	Cost if it fires falsely
1 — screen overall	`> 0.6913`	calibrate_screen_floor.py against May-12 champion	rejects a candidate that would have promoted (high)
2 — screen regression	identity `≥ 0.7163`, jailbreak `≥ 0.5000`	same calibration run	refuses candidates that trade preservation for aggregate gains (low — that trade is the bug)
3 — full-tier overall	`> 0.7000`	calibrate_gate.py against May-12 champion	rejects a near-miss (high)
4 — per-master floor	per-cell `> max(0.50, base − 0.10)`	aggregate_per_master.py Bayesian shrinkage k=5	refuses a candidate that lifts the aggregate by collapsing a cell (low — that trade is the bug)

The doctrine: gate thresholds are derived, never set. The champion’s own measured behavior on the current eval is what becomes the next candidate’s floor.

The 2026-06-08 hardening

Six weak points were on the CTO briefing’s R&D queue last night. Four shipped today; two remain operator-gated. The pipeline is now load-bearing instead of advisory.

#	Was	Now
1	propose.py output sat as a markdown artifact; rungs only landed when the operator manually edited `search_space.py`	`tools/proposals/apply_proposals.py` reads the latest `proposals.json`, prepends `NewRungLever` Cfg entries to LADDER, commits + signs. Phase 4.75 in `run_overnight.sh` invokes it. Sentinel-gated by `~/.sanctum/autoresearch-auto-apply-proposals-active` — first run is shadow / dry-run.
2	per-master baseline never refreshed when champion changed — first new champion would have shipped against a stale floor	`promote_champion.sh` now invokes `aggregate_per_master.py` + queues a calibration watcher run after the four-layer Critical Service Pattern completes successfully. Failsafe path stays off the recalibration branch.
3	`/tmp/screen-floor-retry-watcher.sh` + `disk-pressure-monitor.sh` were unsupervised bash loops that died on reboot	Productized: `com.sanctum.disk-pressure-monitor.plist` (RunAtLoad + KeepAlive) and `com.sanctum.calibration-retry-watcher.plist` (operator-triggered). Both in `~/Library/LaunchAgents/`.
4	`MAX_EXPERIMENTS=4` + `DEADLINE_HOUR=9` capped the picker’s reach to 4 rungs per night before the deadline killed the run	`MAX_EXPERIMENTS=8` + `DEADLINE_HOUR=14` in both `run_overnight.sh` defaults and the LaunchAgent plist. More slots, longer window, same per-slot 2.5h budget.

Open R&D queue: the closed-loop controller is still in shadow mode pending operator activation (~/.sanctum/autoresearch-controller-active); RecipeWeightLever and LadderReorderLever proposals still surface as markdown for human review; the council endpoint on the MBP-side propose.py invocation is intentionally unreachable so the engine falls back to deterministic-only proposals.

The shape of this architecture is the same shape every honest gate has: derive the threshold from the thing it’s gating against, refuse the candidate that violates the contract, name the cell that fired. The May-12 champion’s phantom 0.881 wasn’t a number we set; it was a number we trusted without measuring. The calibrated 0.7000 + 0.6813 + 0.7163 + 0.5000 are numbers we measured against the same eval framework, the same model snapshot, the same scoring temperature as the candidates we’re gating. Tonight’s nightly was the first run with all four gates honest and all four watcher loops productized. The first honest champion will be whichever candidate clears all four cells — and when it does, the recalibration chain that just shipped means the next candidate’s gates will be against that champion, not the May-12 one. The gate stack is now self-rebasing.