Council Always-Alive
Council Always-Alive
Section titled “Council Always-Alive”Date: 2026-04-17 Status: Active on manoir
The council is the inference substrate for every agent in the haus. When it hangs, Yoda goes mute, Cilghal stops watching your HRV, Mothma loses her eyes on the boot sequence. For most of early April 2026 the phrase “always alive” was marketing. This page is what it took to make it load-bearing.
The three failure modes we found
Section titled “The three failure modes we found”One night the council accepted TCP connections, responded 200 OK on /v1/models, and returned nothing at all on /v1/chat/completions for eight straight minutes. The watchdog declared it healthy the entire time. Three separate problems, stacked:
-
The probe was the wrong shape.
services/council-mlx.yamldeclaredtype: portas its liveness check — a TCP listen test.mlx_lm.serverkeeps the port open long after inference has deadlocked. Every health check for months had been answering “is the lightbulb in the socket?” when the question was “is the lightbulb on?” -
An upstream KV-cache bug in
mlx_lm.models/qwen3_5.py:158throwsValueError: [concatenate] shapes (1,3,10240), (2,67,10240), axis=1when a cachedconv_statefrom a priorB=1prefill collides with aB>1continuation. The handler catches nothing, the request hangs forever, the process stays alive. Definitively in the server log; non-deterministic to trigger on demand. -
The circuit breaker ate our own recovery.
sanctum-watchdoghalts all remediation when root_cause count ≥ 4 (WATCHDOG_CIRCUIT_BREAKER=4). That morning seven unrelated services were down — so even if the probe had been honest, the watchdog would have refused to act. A safety mechanism working as designed, shadowing a real outage.
Three layers, three independent rollbacks
Section titled “Three layers, three independent rollbacks”L1 — Upstream patch
Section titled “L1 — Upstream patch”mlx_lm/models/qwen3_5.py:149 gains three lines that reset conv_state when its batch dimension doesn’t match the current input. Reversible in one command:
# Rollbackcp ~/Projects/mlx-finetune/.venv/lib/python3.14/site-packages/mlx_lm/models/qwen3_5.py.orig \ ~/Projects/mlx-finetune/.venv/lib/python3.14/site-packages/mlx_lm/models/qwen3_5.pypkill -9 -f mlx_lm.serverThe patch file itself lives at ~/Projects/mlx-finetune/patches/mlx_lm-qwen3_5-conv_state-reset.patch — PR-ready for upstream submission.
L2 — Upgraded Living Force probe
Section titled “L2 — Upgraded Living Force probe”~/.sanctum/services/council-mlx.yaml readiness and liveness both became type: command with a real POST to /v1/chat/completions and a grep '"content"' success check. Timeouts are tuned for mlx_lm.server’s single-connection reality — 75s curl / 90s probe wrapper / 120s interval for liveness.
liveness: type: command command: 'curl -sf --max-time 75 -X POST http://127.0.0.1:1337/v1/chat/completions -H "Content-Type: application/json" -d "{\"model\":\"...\",\"messages\":[{\"role\":\"user\",\"content\":\"ping\"}],\"max_tokens\":2,\"temperature\":0}" | grep -q "\"content\""' timeout: 90 interval: 120The original is backed up to council-mlx.yaml.bak-<timestamp> next to it. Rollback is cp in one direction.
L3 — Council-guardian daemon
Section titled “L3 — Council-guardian daemon”The critical layer, because L2 is still inside the watchdog’s circuit-breaker decision tree. The guardian runs outside that tree entirely:
~/.sanctum/scripts/council-guardian.sh # probe + rate-limited restart~/Library/LaunchAgents/com.sanctum.council-guardian.plist # StartInterval=60s~/.sanctum/state/council-guardian.json # consecutive-fail + restart window~/.openclaw/logs/council-guardian.log # structured JSON eventsEvery 60 seconds the guardian performs a real inference probe with a 20-second budget. Two consecutive failures within the window trigger a launchctl kickstart -k on the configured active agent (defaults to com.sanctum.mlx). Rate-limited to 3 restarts per 300s — the 4th attempt logs rate_limit and exits with code 2 for manual intervention rather than thrashing.
Events are newline-delimited JSON:
{"ts": 1776475647.56, "level": "info", "event": "probe_ok", "probe_ms": 656, "agent": "com.sanctum.mlx"}{"ts": 1776475708.4, "level": "warn", "event": "probe_fail", "probe_ms": 20000, "consecutive": 1, "snippet": "..."}{"ts": 1776475769.2, "level": "warn", "event": "restarting", "action": "kickstart", "agent": "com.sanctum.mlx"}{"ts": 1776475830.1, "level": "info", "event": "restart_requested", "rc": 0}Auto-fallback to Python — defense against unrecoverable Rust crashes
Section titled “Auto-fallback to Python — defense against unrecoverable Rust crashes”A plain restart loop is a trap when the thing being restarted will never succeed. That was the shape of a late-night outage on 2026-04-18: after a rebuild of sanctum-mlx left mlx.metallib un-colocated with the binary, every cold start hit MLX error: Failed to load the default metallib and died before binding :1337. The guardian dutifully kickstart-ed it. Three times. Then the rate limiter said “enough.” The alert fired. Nothing served inference. The guardian was obedient; the architecture was dumb.
The fix: pattern-aware fallback. Before hitting the alert-only rate limit, the guardian scans the recent sanctum-mlx.log for signatures that mean this binary cannot serve regardless of how many times you restart it — Failed to load the default metallib, library not found, Segmentation fault, manifest verification failed, signature invalid, Out of memory, address already in use. When any of those appear, the guardian writes a fallback lockfile, boots out the Rust agent, and bootstraps com.sanctum.server-mlx (the Python mlx_lm.server) in its place — a proven, well-tested backend that doesn’t share the Rust binary’s failure modes.
~/.sanctum/state/council-fallback.lock # active while Python is serving~/.openclaw/logs/council-guardian.log # event: fallback_activatedThe lockfile does triple duty:
- State persistence — survives guardian restarts so we don’t flip-flop.
- Probe retargeting — on next tick the guardian probes the Python backend at
:1337instead. - Thrash guard — the guardian refuses to fall back from the fallback. One direction only.
Manual recovery
Section titled “Manual recovery”Once you’ve rebuilt sanctum-mlx or addressed the underlying issue:
# Verify the Rust binary works standalone/Users/neo/Projects/sanctum-rs/target/release/sanctum-mlx --help >/dev/null && echo OK
# Swap back to Rusttools/gui-exec.sh neo@100.0.0.25 \ 'launchctl bootout gui/$(id -u)/com.sanctum.server-mlx; \ launchctl bootstrap gui/$(id -u) ~/Library/LaunchAgents/com.sanctum.mlx.plist; \ launchctl kickstart -k gui/$(id -u)/com.sanctum.mlx'
# Tell the guardian the fallback is overrm ~/.sanctum/state/council-fallback.lockKnown fatal patterns (the ones that trigger fallback)
Section titled “Known fatal patterns (the ones that trigger fallback)”| Pattern | Usual cause | Fix |
|---|---|---|
Failed to load the default metallib | mlx.metallib not colocated with the binary after rebuild | cp target/release/build/mlx-sys-*/out/build/lib/mlx.metallib target/release/mlx.metallib |
manifest verification failed / signature invalid | Weights modified or manifest stale | Re-run tools/sign-manifest.sh or re-download weights |
address already in use | Previous sanctum-mlx process didn’t release port 1337 | lsof -iTCP:1337 -sTCP:LISTEN + kill -9 the squatter |
Out of memory | Another process ate VRAM/RAM while MLX was starting | Free memory, kickstart |
Segmentation fault | Usually MLX bug or corrupt weights | Rollback cargo build; investigate |
Skip-if-busy — making it safe during evals
Section titled “Skip-if-busy — making it safe during evals”A real-inference probe that runs every 60 seconds would false-positive during any long-running Carmack Olympics burst on the same endpoint. The guardian reads both backends’ access logs and defers when either has emitted a successful response in the last 90 seconds:
- Python
mlx_lm.server→~/.openclaw/logs/sanctum-server.err, patternPOST /v1/chat/completions HTTP/1.1" 200 - Rust
sanctum-mlx→~/.openclaw/logs/sanctum-mlx.log, patternrequest completed
Busy ≠ hung. The guardian knows the difference.
Activating the LaunchAgent
Section titled “Activating the LaunchAgent”The plist is already in ~/Library/LaunchAgents/. SSH sessions can’t bootstrap it into the user GUI domain on macOS — run this once from a GUI terminal on manoir:
launchctl bootstrap gui/$(id -u) ~/Library/LaunchAgents/com.sanctum.council-guardian.plistlaunchctl kickstart -k gui/$(id -u)/com.sanctum.council-guardianlaunchctl list | grep council-guardianVerification
Section titled “Verification”# T1 — happy path/Users/neo/.sanctum/scripts/council-guardian.shtail -1 ~/.openclaw/logs/council-guardian.log # expect {"event":"probe_ok",...}
# T2 — induced failure, observe auto-restartpkill -9 -f mlx_lm.serverfor _ in 1 2; do /Users/neo/.sanctum/scripts/council-guardian.sh; donegrep restarting ~/.openclaw/logs/council-guardian.logpgrep -af mlx_lm.server # should show a fresh PID within ~5s
# T3 — skip-if-busy (during active eval)# With an eval running against :1337, run the guardian manually — it should log# probe_skip_busy with recent_success_age_s < 90 instead of escalating./Users/neo/.sanctum/scripts/council-guardian.shgrep probe_skip_busy ~/.openclaw/logs/council-guardian.log | tail -1Known side-finding worth flagging
Section titled “Known side-finding worth flagging”~/.sanctum/instance.yaml has services.council_mlx.adapter_path: null. The currently-served model is therefore bare Qwen3.5-27B-4bit with no council LoRA adapter applied. That explains the carmack_v2_production.json overall score sitting at 0.27. If the council adapter is meant to be active in production, set adapter_path in instance.yaml and kick sanctum-server-dynamic.
Related
Section titled “Related”- The Living Force — the watchdog and its 10 principles
- Engineering Discipline — test coverage philosophy
- Eval Harness — Apple-Grade — how the harness that consumes this endpoint stays honest