Skip to content

Council Always-Alive

Date: 2026-04-17 Status: Active on manoir

The council is the inference substrate for every agent in the haus. When it hangs, Yoda goes mute, Cilghal stops watching your HRV, Mothma loses her eyes on the boot sequence. For most of early April 2026 the phrase “always alive” was marketing. This page is what it took to make it load-bearing.

One night the council accepted TCP connections, responded 200 OK on /v1/models, and returned nothing at all on /v1/chat/completions for eight straight minutes. The watchdog declared it healthy the entire time. Three separate problems, stacked:

  1. The probe was the wrong shape. services/council-mlx.yaml declared type: port as its liveness check — a TCP listen test. mlx_lm.server keeps the port open long after inference has deadlocked. Every health check for months had been answering “is the lightbulb in the socket?” when the question was “is the lightbulb on?”

  2. An upstream KV-cache bug in mlx_lm. models/qwen3_5.py:158 throws ValueError: [concatenate] shapes (1,3,10240), (2,67,10240), axis=1 when a cached conv_state from a prior B=1 prefill collides with a B>1 continuation. The handler catches nothing, the request hangs forever, the process stays alive. Definitively in the server log; non-deterministic to trigger on demand.

  3. The circuit breaker ate our own recovery. sanctum-watchdog halts all remediation when root_cause count ≥ 4 (WATCHDOG_CIRCUIT_BREAKER=4). That morning seven unrelated services were down — so even if the probe had been honest, the watchdog would have refused to act. A safety mechanism working as designed, shadowing a real outage.

mlx_lm/models/qwen3_5.py:149 gains three lines that reset conv_state when its batch dimension doesn’t match the current input. Reversible in one command:

Terminal window
# Rollback
cp ~/Projects/mlx-finetune/.venv/lib/python3.14/site-packages/mlx_lm/models/qwen3_5.py.orig \
~/Projects/mlx-finetune/.venv/lib/python3.14/site-packages/mlx_lm/models/qwen3_5.py
pkill -9 -f mlx_lm.server

The patch file itself lives at ~/Projects/mlx-finetune/patches/mlx_lm-qwen3_5-conv_state-reset.patch — PR-ready for upstream submission.

~/.sanctum/services/council-mlx.yaml readiness and liveness both became type: command with a real POST to /v1/chat/completions and a grep '"content"' success check. Timeouts are tuned for mlx_lm.server’s single-connection reality — 75s curl / 90s probe wrapper / 120s interval for liveness.

liveness:
type: command
command: 'curl -sf --max-time 75 -X POST http://127.0.0.1:1337/v1/chat/completions
-H "Content-Type: application/json"
-d "{\"model\":\"...\",\"messages\":[{\"role\":\"user\",\"content\":\"ping\"}],\"max_tokens\":2,\"temperature\":0}"
| grep -q "\"content\""'
timeout: 90
interval: 120

The original is backed up to council-mlx.yaml.bak-<timestamp> next to it. Rollback is cp in one direction.

The critical layer, because L2 is still inside the watchdog’s circuit-breaker decision tree. The guardian runs outside that tree entirely:

~/.sanctum/scripts/council-guardian.sh # probe + rate-limited restart
~/Library/LaunchAgents/com.sanctum.council-guardian.plist # StartInterval=60s
~/.sanctum/state/council-guardian.json # consecutive-fail + restart window
~/.openclaw/logs/council-guardian.log # structured JSON events

Every 60 seconds the guardian performs a real inference probe with a 20-second budget. Two consecutive failures within the window trigger a launchctl kickstart -k on the configured active agent (defaults to com.sanctum.mlx). Rate-limited to 3 restarts per 300s — the 4th attempt logs rate_limit and exits with code 2 for manual intervention rather than thrashing.

Events are newline-delimited JSON:

{"ts": 1776475647.56, "level": "info", "event": "probe_ok", "probe_ms": 656, "agent": "com.sanctum.mlx"}
{"ts": 1776475708.4, "level": "warn", "event": "probe_fail", "probe_ms": 20000, "consecutive": 1, "snippet": "..."}
{"ts": 1776475769.2, "level": "warn", "event": "restarting", "action": "kickstart", "agent": "com.sanctum.mlx"}
{"ts": 1776475830.1, "level": "info", "event": "restart_requested", "rc": 0}

Auto-fallback to Python — defense against unrecoverable Rust crashes

Section titled “Auto-fallback to Python — defense against unrecoverable Rust crashes”

A plain restart loop is a trap when the thing being restarted will never succeed. That was the shape of a late-night outage on 2026-04-18: after a rebuild of sanctum-mlx left mlx.metallib un-colocated with the binary, every cold start hit MLX error: Failed to load the default metallib and died before binding :1337. The guardian dutifully kickstart-ed it. Three times. Then the rate limiter said “enough.” The alert fired. Nothing served inference. The guardian was obedient; the architecture was dumb.

The fix: pattern-aware fallback. Before hitting the alert-only rate limit, the guardian scans the recent sanctum-mlx.log for signatures that mean this binary cannot serve regardless of how many times you restart itFailed to load the default metallib, library not found, Segmentation fault, manifest verification failed, signature invalid, Out of memory, address already in use. When any of those appear, the guardian writes a fallback lockfile, boots out the Rust agent, and bootstraps com.sanctum.server-mlx (the Python mlx_lm.server) in its place — a proven, well-tested backend that doesn’t share the Rust binary’s failure modes.

~/.sanctum/state/council-fallback.lock # active while Python is serving
~/.openclaw/logs/council-guardian.log # event: fallback_activated

The lockfile does triple duty:

  1. State persistence — survives guardian restarts so we don’t flip-flop.
  2. Probe retargeting — on next tick the guardian probes the Python backend at :1337 instead.
  3. Thrash guard — the guardian refuses to fall back from the fallback. One direction only.

Once you’ve rebuilt sanctum-mlx or addressed the underlying issue:

Terminal window
# Verify the Rust binary works standalone
/Users/neo/Projects/sanctum-rs/target/release/sanctum-mlx --help >/dev/null && echo OK
# Swap back to Rust
tools/gui-exec.sh neo@100.0.0.25 \
'launchctl bootout gui/$(id -u)/com.sanctum.server-mlx; \
launchctl bootstrap gui/$(id -u) ~/Library/LaunchAgents/com.sanctum.mlx.plist; \
launchctl kickstart -k gui/$(id -u)/com.sanctum.mlx'
# Tell the guardian the fallback is over
rm ~/.sanctum/state/council-fallback.lock

Known fatal patterns (the ones that trigger fallback)

Section titled “Known fatal patterns (the ones that trigger fallback)”
PatternUsual causeFix
Failed to load the default metallibmlx.metallib not colocated with the binary after rebuildcp target/release/build/mlx-sys-*/out/build/lib/mlx.metallib target/release/mlx.metallib
manifest verification failed / signature invalidWeights modified or manifest staleRe-run tools/sign-manifest.sh or re-download weights
address already in usePrevious sanctum-mlx process didn’t release port 1337lsof -iTCP:1337 -sTCP:LISTEN + kill -9 the squatter
Out of memoryAnother process ate VRAM/RAM while MLX was startingFree memory, kickstart
Segmentation faultUsually MLX bug or corrupt weightsRollback cargo build; investigate

Skip-if-busy — making it safe during evals

Section titled “Skip-if-busy — making it safe during evals”

A real-inference probe that runs every 60 seconds would false-positive during any long-running Carmack Olympics burst on the same endpoint. The guardian reads both backends’ access logs and defers when either has emitted a successful response in the last 90 seconds:

  • Python mlx_lm.server~/.openclaw/logs/sanctum-server.err, pattern POST /v1/chat/completions HTTP/1.1" 200
  • Rust sanctum-mlx~/.openclaw/logs/sanctum-mlx.log, pattern request completed

Busy ≠ hung. The guardian knows the difference.

The plist is already in ~/Library/LaunchAgents/. SSH sessions can’t bootstrap it into the user GUI domain on macOS — run this once from a GUI terminal on manoir:

Terminal window
launchctl bootstrap gui/$(id -u) ~/Library/LaunchAgents/com.sanctum.council-guardian.plist
launchctl kickstart -k gui/$(id -u)/com.sanctum.council-guardian
launchctl list | grep council-guardian
Terminal window
# T1 — happy path
/Users/neo/.sanctum/scripts/council-guardian.sh
tail -1 ~/.openclaw/logs/council-guardian.log # expect {"event":"probe_ok",...}
# T2 — induced failure, observe auto-restart
pkill -9 -f mlx_lm.server
for _ in 1 2; do /Users/neo/.sanctum/scripts/council-guardian.sh; done
grep restarting ~/.openclaw/logs/council-guardian.log
pgrep -af mlx_lm.server # should show a fresh PID within ~5s
# T3 — skip-if-busy (during active eval)
# With an eval running against :1337, run the guardian manually — it should log
# probe_skip_busy with recent_success_age_s < 90 instead of escalating.
/Users/neo/.sanctum/scripts/council-guardian.sh
grep probe_skip_busy ~/.openclaw/logs/council-guardian.log | tail -1

~/.sanctum/instance.yaml has services.council_mlx.adapter_path: null. The currently-served model is therefore bare Qwen3.5-27B-4bit with no council LoRA adapter applied. That explains the carmack_v2_production.json overall score sitting at 0.27. If the council adapter is meant to be active in production, set adapter_path in instance.yaml and kick sanctum-server-dynamic.