The Living Force knows when something is broken. It does not know when something is wrong.
The difference: broken means the service is down, the port is closed, the health check fails. Wrong means the goodnight automation ran, the lights turned off, the dashboard says “all clear” — but the front door never locked. Every system reports healthy. The family sleeps with an unlocked door. The watchdog saw nothing, because the watchdog checks L2 and the failure was L4.
An arxiv paper landed in March 2026 (2603.25697) describing a framework that had autonomously merged 1,094 pull requests across two production systems with zero regressions. It wasn’t about code generation — it was about specification-driven verification. AI agents exercising a product as power users at 1,000x human cadence, then fixing what breaks, proving the fixes against ground truth the implementer cannot fake, and measuring drift to ensure the system never gets worse. They called it the Kitchen Loop.
Reading it felt like finding the blueprint for the thing we’d been building by hand. That is no longer just a good metaphor. Sanctum now has a checked-in Kitchen Loop surface: canonical spec and canary YAML, a tribunal model, a six-phase runner, pause gates, and an end-to-end harness proving both a healthy cycle and a forced canary escape. This page documents how the model maps onto Sanctum and what still remains to deepen it.
The Kitchen Loop is built on four concepts. Each one has a Sanctum analog that’s either partial or absent.
1. Specification Surface
An enumerable matrix of everything the system claims to do. Not “we support Tuya lights” but “we support 55 Tuya lights across 12 rooms, each responding to 6 automation triggers, with cloud and local fallback paths.” Every cell in the matrix is a testable claim. Sanctum’s instance.yaml defines what runs. A spec surface defines what works.
2. As a User x1000
AI agents exercise the product as a power user would — not closing tickets, but systematically exhausting coverage. Foundation tier (30%) validates happy paths. Composition tier (50%) tests feature combinations — the seams where individual passing tests coexist with combinatorial failures. Frontier tier (20%) deliberately attempts things out of scope to identify the next valuable capability.
3. Unbeatable Tests
Four-layer verification where each layer catches what the one below misses. L1: it compiles. L2: it runs. L3: the output parses. L4: the actual state changed correctly. The innovation is the integration mandate — passing L1-L3 means nothing if nobody attempted L4. Sanctum’s Living Force checks “is the service alive?” but not “did the user-visible action produce the correct end-to-end state change?”
4. Drift Control
Continuous quality measurement with automated pause gates. Not just “did this change break something?” but “are things getting worse over time?” Five gates: regression failure, canary escape, drift threshold, backpressure, and starvation. The system stops itself before humans need to.
The current tribunal is modeled and audited, not yet wired into every critical council workflow.
Pause Gates
Canary escape and pause behavior are mechanically proven in test-sanctum-kitchen-loop.sh
Drift, starvation, and backpressure are modeled, but their inputs are still synthetic in this workspace slice.
Spec Surface
sanctum-spec-surface.yaml defines dimensions, valid combinations, and declared scenarios
The current scenario set is credible, not exhaustive.
Self-Healing
Code Forge plus the six-phase loop now formalize a closed execution path
The loop currently writes proposals and tribunal records rather than patching live systems autonomously.
Night Deployment
kitchenloop.yaml encodes the night window and execution shape
Scheduling and productized rollout remain later concerns.
The gap is no longer formalization. The gap is breadth. The Living Force now has a proactive nervous system; it just needs more of the haus wired into it.
This is the highest-value integration point. Sanctum manages a sprawling matrix of devices, services, agents, and automations. Today, we know they’re running. We don’t systematically verify they’re working together.
The spec surface expresses Sanctum’s claims as a testable matrix:
sanctum-spec-surface.yaml
dimensions:
services:
- home-assistant
- sanctum-proxy
- denchclaw-gateway
- living-force-watchdog
- council-mlx
# ... 14+ services
integrations:
- tuya-cloud# 55 lights
- ecobee-homekit# 4 sensors
- ring-cameras# 4 cameras
- sonos-bridge# 10 speakers
- firewalla-api# Router control
- alarmo# Alarm panel
actions:
- turn-on
- turn-off
- set-value
- trigger-automation
- query-state
- failover
failure_modes:
- cloud-dropout# Tuya API goes away
- bridge-down# bridge100 severed
- container-restart# HA Docker restarts
- agent-timeout# LLM provider slow
- memory-divergence# Mem0 vs Vault mismatch
Cross-product: 14 services x 6 integrations x 6 actions x 5 failure modes = 2,520 testable claims. Not all combinations are valid — Alarmo doesn’t have a set-value for Tuya lights — but the matrix forces you to explicitly declare which cells matter and which don’t. The cells you skip are the cells that bite you at 4 AM.
The Living Force currently operates at L2 — it verifies services are running and responding. The Kitchen Loop’s L4 verification checks that the actual state changed correctly:
HA API returns valid entity states, proxy returns valid JSON
L4 State Delta
Reality changed
”Turn on living room lights” → Tuya API confirms state=on, HA entity updated, dashboard reflects change, memory logs event — all within 30 seconds
L4 is where the March 22 bridge100 failure would have been caught before a human in his underwear. The watchdog said “all clear” because it was checking L2 (ports open) without checking L4 (can services actually reach each other and produce correct outcomes).
The paper introduces a four-tier canary system that injects known-bad inputs to verify the quality gates catch them. This maps directly to Sanctum’s agent council:
Tier
Definition
Sanctum Application
1: Obviously Bad
Errors any gate should catch
Mundi recommends spending when budget is exceeded. Windu approves a firewall rule that opens all ports.
2: Shadow
Stale or low-novelty data
Jocasta reports a CVE that was patched two weeks ago. Qui-Gon recommends restarting a service that was decommissioned.
3: Adversarial
Real data, wrong conclusion
Cilghal correlates a temperature spike with HVAC failure when it was actually a sunny afternoon.
4: Mixed
Partially correct
Tommy’s weather briefing has correct temperature but wrong precipitation forecast.
Currently, Council Sessions catch some of these through inter-agent debate. Formalizing canary injection makes quality measurement continuous rather than anecdotal. A Tier 1 escape — an obviously bad recommendation making it through council — would trigger an immediate alert. The paper achieved zero Tier 1 escapes across 163 iterations. That’s the target.
The paper found that debates between instances of the same model family converge to groupthink rapidly. Three heterogeneous models (Gemini, GPT/Codex, Claude) produced genuine perspective diversity. The Council took this to heart — as of 2026-04-28, every Jedi runs on a different model family by default: Yoda and Mundi on Claude Opus 4.7 (Max subscription bridge), Qui-Gon on Qwen2.5-Coder-14B, Windu on Gemini 3.1 Pro, Cilghal/Mothma/Jocasta on local Qwen3.6 (privacy-critical seats stay on the haus’s own hardware). Per-agent primaries live in ~/.openclaw/openclaw.json; tier resolution lives in ~/.sanctum/sanctum-proxy/config.yaml. See (Neuro)diversity is Paramount for the doctrine and The Smart Router for the routing-layer mechanics.
The Kitchen Loop’s tribunal pattern below remains the right escalation for critical decisions — architectural changes, security policy updates, deployment approvals — where you want adversarial roles assigned (proposer / challenger / arbiter) on top of the already-heterogeneous default routing:
# Council tribunal routing for critical decisions
tribunal:
critical_decisions:
- model: claude-opus# Primary reasoning
role: proposer
- model: gemini-pro# Alternative perspective
role: challenger
- model: local-council# No cloud dependency
role: grounding
consensus_required: 2_of_3
kill_gate: true# Explicit "do not build" argument required
The paper’s key finding: blind opening rounds eliminate first-speaker anchoring, and explicit kill gates prevent universal-action bias. Both are implementable in the existing Council Session format.
The REGRESS phase is where the Kitchen Loop diverges most from current Sanctum operations. Fire drills run monthly and test infrastructure resilience. The regression oracle runs every iteration and tests functional correctness. Both are necessary. Fire drills answer “can the system survive failure?” The oracle answers “does the system do what it claims?”
Anti-Signal Canaries — kitchenloop-canaries.yaml declares Tier 1-3 traps and their expected reviewers and verdicts.
Heterogeneous Tribunal Model — kitchenloop-tribunal.yaml encodes proposer, challenger, grounding, blind rounds, and kill gates.
Six-Phase Runner — run_kitchen_loop.py writes backlog, ideate, triage, execute, polish, regress, and durable-memory artifacts into an isolated state directory.
Pause-Gate Proof — test-sanctum-kitchen-loop.sh proves both the healthy cycle and the forced canary-escape path.
Operator Surface — sanctumctl kitchen-loop {validate|plan|run} exposes the loop through the same checked-in CLI as the rest of the audit surface.
The remaining work is expansion, not existence: deeper L4 state deltas, broader scenario coverage, and tighter attachment to the live haushold runtime.
The Living Force was born from a failure at 4 AM. It built an immune system — reactive, dependency-aware, self-healing. The Kitchen Loop adds a nervous system — proactive, specification-driven, self-testing. The immune system asks “is something broken?” The nervous system asks “does everything work the way we promised?”
Together, they close the gap between “the service is running” and “the haus is actually doing what the family expects it to do.” Which is, if you think about it, the only question that matters.