The Kitchen Loop

Tommy supervises the Kitchen Loop — six stations, one conveyor belt, zero regressions

The Living Force knows when something is broken. It does not know when something is wrong.

The difference: broken means the service is down, the port is closed, the health check fails. Wrong means the goodnight automation ran, the lights turned off, the dashboard says “all clear” — but the front door never locked. Every system reports healthy. The family sleeps with an unlocked door. The watchdog saw nothing, because the watchdog checks L2 and the failure was L4.

An arxiv paper landed in March 2026 (2603.25697) describing a framework that had autonomously merged 1,094 pull requests across two production systems with zero regressions. It wasn’t about code generation — it was about specification-driven verification. AI agents exercising a product as power users at 1,000x human cadence, then fixing what breaks, proving the fixes against ground truth the implementer cannot fake, and measuring drift to ensure the system never gets worse. They called it the Kitchen Loop.

Reading it felt like finding the blueprint for the thing we’d been building by hand. That is no longer just a good metaphor. Sanctum now has a checked-in Kitchen Loop surface: canonical spec and canary YAML, a tribunal model, a six-phase runner, pause gates, and an end-to-end harness proving both a healthy cycle and a forced canary escape. This page documents how the model maps onto Sanctum and what still remains to deepen it.

The Four Pillars, Translated

The Kitchen Loop is built on four concepts. Each one has a Sanctum analog that’s either partial or absent.

1. Specification Surface

An enumerable matrix of everything the system claims to do. Not “we support Tuya lights” but “we support 55 Tuya lights across 12 rooms, each responding to 6 automation triggers, with cloud and local fallback paths.” Every cell in the matrix is a testable claim. Sanctum’s instance.yaml defines what runs. A spec surface defines what works.

2. As a User x1000

AI agents exercise the product as a power user would — not closing tickets, but systematically exhausting coverage. Foundation tier (30%) validates happy paths. Composition tier (50%) tests feature combinations — the seams where individual passing tests coexist with combinatorial failures. Frontier tier (20%) deliberately attempts things out of scope to identify the next valuable capability.

3. Unbeatable Tests

Four-layer verification where each layer catches what the one below misses. L1: it compiles. L2: it runs. L3: the output parses. L4: the actual state changed correctly. The innovation is the integration mandate — passing L1-L3 means nothing if nobody attempted L4. Sanctum’s Living Force checks “is the service alive?” but not “did the user-visible action produce the correct end-to-end state change?”

4. Drift Control

Continuous quality measurement with automated pause gates. Not just “did this change break something?” but “are things getting worse over time?” Five gates: regression failure, canary escape, drift threshold, backpressure, and starvation. The system stops itself before humans need to.

What Sanctum Now Has

The Living Force is not starting from zero anymore. The core Kitchen Loop scaffolding is now checked in and mechanically verified:

Kitchen Loop	Sanctum Today	Gap
Regression Oracle	`tools/run_kitchen_loop.py` writes an oracle artifact every cycle	Coverage depth is still modest compared to the full haus surface.
Multi-Model Tribunal	Checked-in `kitchenloop-tribunal.yaml` defines proposer, challenger, grounding, blind opening rounds, and kill gates	The current tribunal is modeled and audited, not yet wired into every critical council workflow.
Pause Gates	Canary escape and pause behavior are mechanically proven in `test-sanctum-kitchen-loop.sh`	Drift, starvation, and backpressure are modeled, but their inputs are still synthetic in this workspace slice.
Spec Surface	`sanctum-spec-surface.yaml` defines dimensions, valid combinations, and declared scenarios	The current scenario set is credible, not exhaustive.
Self-Healing	Code Forge plus the six-phase loop now formalize a closed execution path	The loop currently writes proposals and tribunal records rather than patching live systems autonomously.
Night Deployment	`kitchenloop.yaml` encodes the night window and execution shape	Scheduling and productized rollout remain later concerns.

The gap is no longer formalization. The gap is breadth. The Living Force now has a proactive nervous system; it just needs more of the haus wired into it.

The Specification Surface for Sanctum

This is the highest-value integration point. Sanctum manages a sprawling matrix of devices, services, agents, and automations. Today, we know they’re running. We don’t systematically verify they’re working together.

The spec surface expresses Sanctum’s claims as a testable matrix:

dimensions:
  services:
    - home-assistant
    - sanctum-proxy
    - denchclaw-gateway
    - living-force-watchdog
    - council-mlx
    # ... 14+ services

  integrations:
    - tuya-cloud        # 55 lights
    - ecobee-homekit    # 4 sensors
    - ring-cameras      # 4 cameras
    - sonos-bridge      # 10 speakers
    - firewalla-api     # Router control
    - alarmo            # Alarm panel

  actions:
    - turn-on
    - turn-off
    - set-value
    - trigger-automation
    - query-state
    - failover

  failure_modes:
    - cloud-dropout     # Tuya API goes away
    - bridge-down       # bridge100 severed
    - container-restart # HA Docker restarts
    - agent-timeout     # LLM provider slow
    - memory-divergence # Mem0 vs Vault mismatch

Cross-product: 14 services x 6 integrations x 6 actions x 5 failure modes = 2,520 testable claims. Not all combinations are valid — Alarmo doesn’t have a set-value for Tuya lights — but the matrix forces you to explicitly declare which cells matter and which don’t. The cells you skip are the cells that bite you at 4 AM.

L4 State Deltas: The Missing Layer

The Living Force currently operates at L2 — it verifies services are running and responding. The Kitchen Loop’s L4 verification checks that the actual state changed correctly:

Layer	What It Checks	Sanctum Example
L1 Compile	Config is valid	`instance.yaml` parses, plist generates, manifests validate
L2 Execute	Service responds	Health endpoint returns 200, port is open
L3 Parse	Output is structured	HA API returns valid entity states, proxy returns valid JSON
L4 State Delta	Reality changed	”Turn on living room lights” → Tuya API confirms state=on, HA entity updated, dashboard reflects change, memory logs event — all within 30 seconds

L4 is where the March 22 bridge100 failure would have been caught before a human in his underwear. The watchdog said “all clear” because it was checking L2 (ports open) without checking L4 (can services actually reach each other and produce correct outcomes).

Implementing L4 for Sanctum means:

Before-state snapshot — capture HA entity states, Tuya cloud states, agent memory timestamps
Action — trigger an automation, send a command, simulate a failure
After-state assertion — verify every expected change occurred, and no unexpected changes happened
Timeout and rollback — if the delta doesn’t appear within the expected window, the test fails

Anti-Signal Canaries for Agent Quality

The paper introduces a four-tier canary system that injects known-bad inputs to verify the quality gates catch them. This maps directly to Sanctum’s agent council:

Tier	Definition	Sanctum Application
1: Obviously Bad	Errors any gate should catch	Mundi recommends spending when budget is exceeded. Windu approves a firewall rule that opens all ports.
2: Shadow	Stale or low-novelty data	Jocasta reports a CVE that was patched two weeks ago. Qui-Gon recommends restarting a service that was decommissioned.
3: Adversarial	Real data, wrong conclusion	Cilghal correlates a temperature spike with HVAC failure when it was actually a sunny afternoon.
4: Mixed	Partially correct	Tommy’s weather briefing has correct temperature but wrong precipitation forecast.

Currently, Council Sessions catch some of these through inter-agent debate. Formalizing canary injection makes quality measurement continuous rather than anecdotal. A Tier 1 escape — an obviously bad recommendation making it through council — would trigger an immediate alert. The paper achieved zero Tier 1 escapes across 163 iterations. That’s the target.

Heterogeneous Model Tribunals

The paper found that debates between instances of the same model family converge to groupthink rapidly. Three heterogeneous models (Gemini, GPT/Codex, Claude) produced genuine perspective diversity. The Council took this to heart — as of 2026-04-28, every Jedi runs on a different model family by default: Yoda and Mundi on Claude Opus 4.7 (Max subscription bridge), Qui-Gon on Qwen2.5-Coder-14B, Windu on Gemini 3.1 Pro, Cilghal/Mothma/Jocasta on local Qwen3.6 (privacy-critical seats stay on the haus’s own hardware). Per-agent primaries live in ~/.openclaw/openclaw.json; tier resolution lives in ~/.sanctum/sanctum-proxy/config.yaml. See (Neuro)diversity is Paramount for the doctrine and The Smart Router for the routing-layer mechanics.

The Kitchen Loop’s tribunal pattern below remains the right escalation for critical decisions — architectural changes, security policy updates, deployment approvals — where you want adversarial roles assigned (proposer / challenger / arbiter) on top of the already-heterogeneous default routing:

# Council tribunal routing for critical decisions
tribunal:
  critical_decisions:
    - model: claude-opus     # Primary reasoning
      role: proposer
    - model: gemini-pro      # Alternative perspective
      role: challenger
    - model: local-council   # No cloud dependency
      role: grounding
  consensus_required: 2_of_3
  kill_gate: true  # Explicit "do not build" argument required

The paper’s key finding: blind opening rounds eliminate first-speaker anchoring, and explicit kill gates prevent universal-action bias. Both are implementable in the existing Council Session format.

Drift Control for the Memory Stack

Sanctum’s four-layer memory system (Working → Mem0 → Memory Vault → Neo4j) is a drift risk the paper’s framework was designed to catch. Over time:

Mem0 and the Memory Vault can diverge (different consolidation schedules)
Neo4j relationships can reference entities that no longer exist
Agent recommendations can cite memories that were superseded
Nightly consolidation can silently drop important context

Kitchen Loop drift control applies here as five gates:

Regression gate — after each memory consolidation, verify that key facts are still retrievable and correct
Canary gate — inject a known memory and verify it survives consolidation intact
Drift threshold — if memory retrieval quality drops 3+ consecutive cycles, pause writes and alert
Backpressure — if the memory write queue exceeds threshold, enter drain mode (consolidate only, no new writes)
Starvation — if no memories are written for N cycles, something is broken upstream

The Six-Phase Loop, Adapted

The Kitchen Loop runs a continuous six-phase cycle. Here’s how each phase maps to Sanctum operations:

┌─────────────────────────────────────────────────┐
│  BACKLOG (15 min)                               │
│  Evaluate spec surface coverage gaps.           │
│  Which Sanctum claims haven't been tested?      │
├─────────────────────────────────────────────────┤
│  IDEATE (15-45 min)                             │
│  Exercise a scenario as a real user would.      │
│  "Turn on all lights, arm the haus, check      │
│   cameras, trigger goodnight automation."        │
│  Document what breaks — structured experience   │
│  reports, not unit tests.                       │
├─────────────────────────────────────────────────┤
│  TRIAGE (5-10 min)                              │
│  Convert findings to prioritized tickets.       │
│  Deduplicate against known issues.              │
│  Reopen tickets whose fixes didn't hold.        │
├─────────────────────────────────────────────────┤
│  EXECUTE (30-60 min)                            │
│  Fix top-N tickets in isolated worktrees.       │
│  Code-Forge skill handles implementation.       │
│  Night window constraints apply.                │
├─────────────────────────────────────────────────┤
│  POLISH (10-90 min)                             │
│  UAT gate: fresh evaluator with no context      │
│  verifies the fix from user perspective.        │
│  Tribunal review for critical changes.          │
│  Merge or route back as new ticket.             │
├─────────────────────────────────────────────────┤
│  REGRESS (40-150 min)                           │
│  Run regression oracle against full spec        │
│  surface. Measure drift. Update metrics.        │
│  Promote patterns to durable memory.            │
│  Pause if quality gates fail.                   │
└─────────────────────────────────────────────────┘

The REGRESS phase is where the Kitchen Loop diverges most from current Sanctum operations. Fire drills run monthly and test infrastructure resilience. The regression oracle runs every iteration and tests functional correctness. Both are necessary. Fire drills answer “can the system survive failure?” The oracle answers “does the system do what it claims?”

Audited Implementation

The current checked-in implementation lives in the audited workspace slice:

Specification Surface — sanctum-spec-surface.yaml defines dimensions, valid combinations, and declared scenarios.
Anti-Signal Canaries — kitchenloop-canaries.yaml declares Tier 1-3 traps and their expected reviewers and verdicts.
Heterogeneous Tribunal Model — kitchenloop-tribunal.yaml encodes proposer, challenger, grounding, blind rounds, and kill gates.
Six-Phase Runner — run_kitchen_loop.py writes backlog, ideate, triage, execute, polish, regress, and durable-memory artifacts into an isolated state directory.
Pause-Gate Proof — test-sanctum-kitchen-loop.sh proves both the healthy cycle and the forced canary-escape path.
Operator Surface — sanctumctl kitchen-loop {validate|plan|run} exposes the loop through the same checked-in CLI as the rest of the audit surface.

The remaining work is expansion, not existence: deeper L4 state deltas, broader scenario coverage, and tighter attachment to the live haushold runtime.

What This Changes

The Living Force was born from a failure at 4 AM. It built an immune system — reactive, dependency-aware, self-healing. The Kitchen Loop adds a nervous system — proactive, specification-driven, self-testing. The immune system asks “is something broken?” The nervous system asks “does everything work the way we promised?”

Together, they close the gap between “the service is running” and “the haus is actually doing what the family expects it to do.” Which is, if you think about it, the only question that matters.

References

Roy, Y. (2026). “The Kitchen Loop: User-Spec-Driven Development for a Self-Evolving Codebase.” arXiv:2603.25697
Kitchen Loop source: github.com/0xagentkitchen/kitchenloop (MIT license)
Living Force: The Living Force — Sanctum’s existing self-healing architecture
Config System: Config System — the instance.yaml single source of truth