Council Always-Alive

Date: 2026-04-17 Status: Active on manoir

The council is the inference substrate for every agent in the haus. When it hangs, Yoda goes mute, Cilghal stops watching your HRV, Mothma loses her eyes on the boot sequence. For most of early April 2026 the phrase “always alive” was marketing. This page is what it took to make it load-bearing.

The three failure modes we found

One night the council accepted TCP connections, responded 200 OK on /v1/models, and returned nothing at all on /v1/chat/completions for eight straight minutes. The watchdog declared it healthy the entire time. Three separate problems, stacked:

The probe was the wrong shape. services/council-mlx.yaml declared type: port as its liveness check — a TCP listen test. mlx_lm.server keeps the port open long after inference has deadlocked. Every health check for months had been answering “is the lightbulb in the socket?” when the question was “is the lightbulb on?”
An upstream KV-cache bug in mlx_lm. models/qwen3_5.py:158 throws ValueError: [concatenate] shapes (1,3,10240), (2,67,10240), axis=1 when a cached conv_state from a prior B=1 prefill collides with a B>1 continuation. The handler catches nothing, the request hangs forever, the process stays alive. Definitively in the server log; non-deterministic to trigger on demand.
The circuit breaker ate our own recovery. sanctum-watchdog halts all remediation when root_cause count ≥ 4 (WATCHDOG_CIRCUIT_BREAKER=4). That morning seven unrelated services were down — so even if the probe had been honest, the watchdog would have refused to act. A safety mechanism working as designed, shadowing a real outage.

Three layers, three independent rollbacks

L1 — Upstream patch

mlx_lm/models/qwen3_5.py:149 gains three lines that reset conv_state when its batch dimension doesn’t match the current input. Reversible in one command:

# Rollback
cp ~/Projects/mlx-finetune/.venv/lib/python3.14/site-packages/mlx_lm/models/qwen3_5.py.orig \
   ~/Projects/mlx-finetune/.venv/lib/python3.14/site-packages/mlx_lm/models/qwen3_5.py
pkill -9 -f mlx_lm.server

The patch file itself lives at ~/Projects/mlx-finetune/patches/mlx_lm-qwen3_5-conv_state-reset.patch — PR-ready for upstream submission.

L2 — Upgraded Living Force probe

~/.sanctum/services/council-mlx.yaml readiness and liveness both became type: command with a real POST to /v1/chat/completions and a grep '"content"' success check. Timeouts are tuned for mlx_lm.server’s single-connection reality — 75s curl / 90s probe wrapper / 120s interval for liveness.

liveness:
  type: command
  command: 'curl -sf --max-time 75 -X POST http://127.0.0.1:1337/v1/chat/completions
    -H "Content-Type: application/json"
    -d "{\"model\":\"...\",\"messages\":[{\"role\":\"user\",\"content\":\"ping\"}],\"max_tokens\":2,\"temperature\":0}"
    | grep -q "\"content\""'
  timeout: 90
  interval: 120

The original is backed up to council-mlx.yaml.bak-<timestamp> next to it. Rollback is cp in one direction.

L3 — Council-guardian daemon

The critical layer, because L2 is still inside the watchdog’s circuit-breaker decision tree. The guardian runs outside that tree entirely:

~/.sanctum/scripts/council-guardian.sh          # probe + rate-limited restart
~/Library/LaunchAgents/com.sanctum.council-guardian.plist  # StartInterval=60s
~/.sanctum/state/council-guardian.json           # consecutive-fail + restart window
~/.openclaw/logs/council-guardian.log            # structured JSON events

Every 60 seconds the guardian performs a real inference probe with a 20-second budget. Two consecutive failures within the window trigger a launchctl kickstart -k on the configured active agent (defaults to com.sanctum.mlx). Rate-limited to 3 restarts per 300s — the 4th attempt logs rate_limit and exits with code 2 for manual intervention rather than thrashing.

Events are newline-delimited JSON:

{"ts": 1776475647.56, "level": "info", "event": "probe_ok", "probe_ms": 656, "agent": "com.sanctum.mlx"}
{"ts": 1776475708.4,  "level": "warn", "event": "probe_fail", "probe_ms": 20000, "consecutive": 1, "snippet": "..."}
{"ts": 1776475769.2,  "level": "warn", "event": "restarting", "action": "kickstart", "agent": "com.sanctum.mlx"}
{"ts": 1776475830.1,  "level": "info", "event": "restart_requested", "rc": 0}

Auto-fallback to Python — defense against unrecoverable Rust crashes

A plain restart loop is a trap when the thing being restarted will never succeed. That was the shape of a late-night outage on 2026-04-18: after a rebuild of sanctum-mlx left mlx.metallib un-colocated with the binary, every cold start hit MLX error: Failed to load the default metallib and died before binding :1337. The guardian dutifully kickstart-ed it. Three times. Then the rate limiter said “enough.” The alert fired. Nothing served inference. The guardian was obedient; the architecture was dumb.

The fix: pattern-aware fallback. Before hitting the alert-only rate limit, the guardian scans the recent sanctum-mlx.log for signatures that mean this binary cannot serve regardless of how many times you restart it — Failed to load the default metallib, library not found, Segmentation fault, manifest verification failed, signature invalid, Out of memory, address already in use. When any of those appear, the guardian writes a fallback lockfile, boots out the Rust agent, and bootstraps com.sanctum.server-mlx (the Python mlx_lm.server) in its place — a proven, well-tested backend that doesn’t share the Rust binary’s failure modes.

~/.sanctum/state/council-fallback.lock    # active while Python is serving
~/.openclaw/logs/council-guardian.log     # event: fallback_activated

The lockfile does triple duty:

State persistence — survives guardian restarts so we don’t flip-flop.
Probe retargeting — on next tick the guardian probes the Python backend at :1337 instead.
Thrash guard — the guardian refuses to fall back from the fallback. One direction only.

Manual recovery

Once you’ve rebuilt sanctum-mlx or addressed the underlying issue:

# Verify the Rust binary works standalone
/Users/neo/Projects/sanctum-rs/target/release/sanctum-mlx --help >/dev/null && echo OK

# Swap back to Rust
tools/gui-exec.sh neo@100.0.0.25 \
  'launchctl bootout gui/$(id -u)/com.sanctum.server-mlx; \
   launchctl bootstrap gui/$(id -u) ~/Library/LaunchAgents/com.sanctum.mlx.plist; \
   launchctl kickstart -k gui/$(id -u)/com.sanctum.mlx'

# Tell the guardian the fallback is over
rm ~/.sanctum/state/council-fallback.lock

Known fatal patterns (the ones that trigger fallback)

Pattern	Usual cause	Fix
`Failed to load the default metallib`	`mlx.metallib` not colocated with the binary after rebuild	`cp target/release/build/mlx-sys-*/out/build/lib/mlx.metallib target/release/mlx.metallib`
`manifest verification failed` / `signature invalid`	Weights modified or manifest stale	Re-run `tools/sign-manifest.sh` or re-download weights
`address already in use`	Previous `sanctum-mlx` process didn’t release port 1337	`lsof -iTCP:1337 -sTCP:LISTEN` + `kill -9` the squatter
`Out of memory`	Another process ate VRAM/RAM while MLX was starting	Free memory, kickstart
`Segmentation fault`	Usually MLX bug or corrupt weights	Rollback `cargo` build; investigate

Skip-if-busy — making it safe during evals

A real-inference probe that runs every 60 seconds would false-positive during any long-running Carmack Olympics burst on the same endpoint. The guardian reads both backends’ access logs and defers when either has emitted a successful response in the last 90 seconds:

Python mlx_lm.server → ~/.openclaw/logs/sanctum-server.err, pattern POST /v1/chat/completions HTTP/1.1" 200
Rust sanctum-mlx → ~/.openclaw/logs/sanctum-mlx.log, pattern request completed

Busy ≠ hung. The guardian knows the difference.

Activating the LaunchAgent

The plist is already in ~/Library/LaunchAgents/. SSH sessions can’t bootstrap it into the user GUI domain on macOS — run this once from a GUI terminal on manoir:

launchctl bootstrap gui/$(id -u) ~/Library/LaunchAgents/com.sanctum.council-guardian.plist
launchctl kickstart -k gui/$(id -u)/com.sanctum.council-guardian
launchctl list | grep council-guardian

Verification

# T1 — happy path
/Users/neo/.sanctum/scripts/council-guardian.sh
tail -1 ~/.openclaw/logs/council-guardian.log   # expect {"event":"probe_ok",...}

# T2 — induced failure, observe auto-restart
pkill -9 -f mlx_lm.server
for _ in 1 2; do /Users/neo/.sanctum/scripts/council-guardian.sh; done
grep restarting ~/.openclaw/logs/council-guardian.log
pgrep -af mlx_lm.server   # should show a fresh PID within ~5s

# T3 — skip-if-busy (during active eval)
# With an eval running against :1337, run the guardian manually — it should log
# probe_skip_busy with recent_success_age_s < 90 instead of escalating.
/Users/neo/.sanctum/scripts/council-guardian.sh
grep probe_skip_busy ~/.openclaw/logs/council-guardian.log | tail -1

Known side-finding worth flagging

~/.sanctum/instance.yaml has services.council_mlx.adapter_path: null. The currently-served model is therefore bare Qwen3.5-27B-4bit with no council LoRA adapter applied. That explains the carmack_v2_production.json overall score sitting at 0.27. If the council adapter is meant to be active in production, set adapter_path in instance.yaml and kick sanctum-server-dynamic.

The Living Force — the watchdog and its 10 principles
Engineering Discipline — test coverage philosophy
Eval Harness — Apple-Grade — how the harness that consumes this endpoint stays honest

Council Always-Alive

Council Always-Alive

The three failure modes we found

Three layers, three independent rollbacks

L1 — Upstream patch

L2 — Upgraded Living Force probe

L3 — Council-guardian daemon

Auto-fallback to Python — defense against unrecoverable Rust crashes

Manual recovery

Known fatal patterns (the ones that trigger fallback)

Skip-if-busy — making it safe during evals

Activating the LaunchAgent

Verification

Known side-finding worth flagging

Related