2026-04-17: The Full-Stack Health Sweep
A parallel session had fixed a narrow Claude CLI proxy routing issue and closed the ticket. The health sweep started as a quick confirmation that the rest of the stack was fine. It was not. What was supposed to be a ten-minute spot-check turned into nine components degraded or missing across three sessions and two calendar days. The Q2 catalog rename had landed three weeks earlier. It had also quietly broken things nobody had looked at since.
Apr 17 — The Sweep Itself
Section titled “Apr 17 — The Sweep Itself”Components Checked
Section titled “Components Checked”Twenty-eight components walked end to end. Twelve of them had something meaningfully wrong.
| Component | Status | Notes |
|---|---|---|
| Navigator Sidecar (:3344) | FAIL | Process not started; no monitor-status.json for any project |
| Holocron UI (:3333) | DOWN | Not running |
| Command Center (:1111) | PASS | Serving HTML |
| Health Center (:2222) | PASS process / WARN data | /health returns 502 because health-tunnel is down |
| OBLITERATUS (:7860) | DOWN | remedy_venv.sh does not exist in OBLITERATUS directory |
| sanctum-watchdog (:2187) | PASS | Reporting overall: degraded with 9 root causes |
| sanctum-proxy (:4040) | PASS | Health endpoint responds correctly |
| council-mlx (:1337) | PASS | Running |
| tommy (:3355) | PASS | Dawn + dusk briefings sent successfully |
| xtts-server (:8008) | PASS process | Running via homebrew python3.12; LaunchAgent symlink was broken |
| health-tunnel (:18095) | DOWN | SSH tunnel to VM not established |
| ha-tunnel (:18092) | DOWN | SSH tunnel to VM not established |
| graphiti-server (:31416) | DOWN | VM-hosted service, VM SSH unreachable |
| network-control (:4007) | DOWN | VM-hosted service, VM SSH unreachable |
| signal-proxy | DOWN | VM unreachable via SSH |
| anthropic-proxy | DOWN | VM unreachable via SSH |
| VM (openclaw SSH) | UNREACHABLE | ssh openclaw times out; local qemu-system-aarch64 is running |
The rest (tommy, sonos-bridge, voice-agent, lmstudio, memory-vault, home-assistant, kiwix, rewind-dashboard, health-ingester, sanctumctl.py, the sanctum-rs binary, living-force.mdx) were green.
Root Causes Found and Fixed
Section titled “Root Causes Found and Fixed”Eight independent bugs, each with its own small story. Most were downstream of a single architectural event: the Q2 catalog rename (285e817) had updated instance.yaml service keys (xtts → xtts_server, gateway → openclaw_gateway, mlx_server → council_mlx), but nothing around those keys had been re-synced since.
1. Runtime manifests stale. render_runtime_services.py had not been re-run after the Q2 rename. Three manifests were showing DIFF against their source: council-mlx.yaml, xtts-server.yaml, openclaw-gateway.yaml. Running the renderer produced 33 manifests and cleared all diffs.
2. sync_runtime_calibration.py SERVICE_MAP drift. com.sanctum.xtts-server.plist was still mapped to service key "xtts" in the SERVICE_MAP constant, but instance.yaml now used xtts_server. The enabled() check returned False, so the plist was never rendered. Changed "com.sanctum.xtts-server.plist": "xtts" to "xtts_server" in tools/sync_runtime_calibration.py. Re-running the tool created the plist and cleared the launchagent audit.
3. sanctum-xtts-server symlink broken. ~/.sanctum/bin/sanctum-xtts-server pointed to a venv that no longer existed (~/Projects/yoda-voice-agent/.xtts-venv/bin/python). The audit_runtime_launchagents.py tool flagged MISSING. The xtts server was actually running via python3.12 from the LaunchAgent’s PATH — the symlink is the entry point, not the runtime. Repointed to /opt/homebrew/bin/python3.11 (the interpreter the pin_deps transformers constraint expects).
4. Legacy living-force plist marker missing. test-sanctum-runtime-audit.sh expects com.sanctum.living-force.plist.disabled as confirmation that the legacy watchdog is retired. Neither the active plist nor the disabled marker existed. Created the empty .disabled marker.
5. mlx-finetune/configs/agents.yaml missing. sync_agent_markdown.py defaults to this path. The file didn’t exist — only the patches/ directory was in the repo. The script crashed with FileNotFoundError. Created the file with all six canonical agents (windu, quigon, cilghal, jocasta, mundi, yoda), each referencing a workspace subdirectory with workspace_optional: true so missing workspaces are skipped gracefully.
6. Test harnesses not updated after the Q2 rename. Three test files still referenced old service slugs and counts. test-sanctum-system-e2e.sh: Services: 30 → Services: 33; xtts --> voice-agent → xtts-server --> voice-agent; proxy mode/server fields retired in favor of routing/providers (the proxy health response never included mode or server — that assertion was aspirational the whole time). test-sanctum-runtime-audit.sh: SUPPLEMENTAL_COUNT:6 → 9; VOICE_AGENT_DEPS:xtts → xtts_server. test-sanctum-evolution-loop.sh: incident-learn.sh gateway → openclaw-gateway.
7. Agent capabilities stale. ~/.sanctum/config/agent-capabilities.yaml had drifted. sync_agent_capabilities.py brought it back in sync.
8. Four LaunchAgent plists stale. Running sync_runtime_calibration.py synced gateway.docker, gateway, ha-tunnel, and health-tunnel.
Still Degraded — Infrastructure, Not Code
Section titled “Still Degraded — Infrastructure, Not Code”Six components remained unhealthy at end of day, and every one of them was a tunnel or a VM reachability issue, not a code defect:
- health-center
/health→ 502 (health-tunnel down) - health export canary → 502 (same tunnel)
- VM → mac MLX bridge → SSH unreachable
- VM → mac LM Studio bridge → SSH unreachable
- Navigator sidecar → not running (no
monitor-status.jsonfiles, so it starts degraded anyway) - OBLITERATUS UI → not running (venv setup not done,
remedy_venv.shmissing)
The watchdog correctly reflected all of this with overall: degraded.
Apr 18 — Infrastructure Recovery
Section titled “Apr 18 — Infrastructure Recovery”The previous session closed nine code issues. Three infrastructure problems were left: openclaw VM SSH unreachable, navigator-sidecar not running, OBLITERATUS not running. This session was meant to finish them.
Pre-Session Watchdog State
Section titled “Pre-Session Watchdog State”overall: degraded, 22/33 healthy. Root causes listed by the watchdog: anthropic-proxy, firewalla-bridge, graphiti-server, ha-tunnel, health-center, health-tunnel, network-control, signal-proxy, triage.
The watchdog API was responding on :2187, but the last_check_at timestamp was stale (14:00 UTC). The launchd-managed watchdog kept failing to start with failed to bind port 2187: Address already in use. An orphan watchdog process (PID 1494), started by sanctum-bootstrap.sh on Apr 17, was squatting the port and serving stale check results.
What Was Wrong
Section titled “What Was Wrong”1. Stale watchdog serving cached “VM unreachable” state. PID 1494 had run its last check at 14:00 UTC yesterday, when VM SSH was unreachable. By session start today, ssh openclaw echo ok returned immediately — the SSH path had self-recovered overnight. But the watchdog had stale state, and the launchd instance couldn’t start because 1494 held the port.
Killed PID 1494. Launchd immediately started a fresh watchdog instance. After the 15-second settle delay, the new watchdog ran fresh checks. anthropic-proxy, triage, and signal-proxy (partially) all resolved from this single fix. The stale “VM unreachable” messages for anthropic-proxy and signal-proxy were phantom failures — the services were running on the VM the entire time.
Root cause of VM SSH being unreachable yesterday: not fully determined. The qemu-system-aarch64 process was running throughout. The bridge interface was up. SSH connectivity had self-recovered by session start. Likely a transient network hiccup or a brief bridge flap.
2. ha-tunnel plist stale — loaded config used 70707:127.0.0.1:70707. The running launchd ha-tunnel had a different port spec than the on-disk plist. The plist on disk said 18092:127.0.0.1:18092 (valid SSH -L format); the loaded launchd config still had the old 70707:127.0.0.1:70707 from before the last sync_runtime_calibration.py run. SSH was rejecting every connection attempt with Bad local forwarding specification '70707:127.0.0.1:70707'.
launchctl unload + launchctl load on /Users/neo/Library/LaunchAgents/com.sanctum.ha-tunnel.plist. Port 18092 opened immediately.
3. health-center (:2222) in restart loop. com.sanctum.health-center showed exit code 143 (SIGTERM) with 979 runs logged. The server was starting successfully but dying because a stale test process from a previous session (PID 92849, started by run_sanctum.sh) was holding port 2222. After the test process was killed, the launchd-managed health-center took over and the port stabilized.
4. firewalla-bridge port mismatch. The service manifest at ~/.sanctum/services/firewalla-bridge.yaml declared port: 1984 for the liveness check, but the actual firewalla-bridge.sh binds to port 18094 (hardcoded via FIREWALLA_BRIDGE_PORT="18094"). The watchdog was checking a port that was never open. Updated the YAML to use port: 18094 in provides, liveness.port, and port fields.
5. navigator-sidecar — already running. Was actually running (PID 43966) when the session started. The previous session’s “not running” finding had self-resolved overnight (launchd or a bootstrap mechanism restarted it). Confirmed via curl http://127.0.0.1:3344/status.
6. OBLITERATUS — Python 3.14 + torch startup deadlock. obliteratus ui failed with ModuleNotFoundError: no module named 'obliteratus'. Root cause: Python 3.14 silently skips .pth files located in directories whose name starts with a dot. .venv/lib/python3.14/site-packages/ had __editable__.obliteratus-0.1.2.pth and _virtualenv.pth, and Python 3.14 logged Skipping hidden .pth file for all of them. The package was installed but unreachable.
A partial fix worked interactively but not in the background: PYTHONPATH=/path/to/OBLITERATUS ./.venv/bin/obliteratus ui imports correctly, but when the same command runs as a detached background process, torch 2.11.0 stalls on loading libtorch_cpu.dylib (216 MB) at low I/O priority (SN state). Interactive: 0.7 seconds. Background: over ten minutes.
OBLITERATUS remained down at end of session. The proper fix — recreate the venv with Python 3.12 at a non-hidden path — carried over to the next session.
Post-Session State
Section titled “Post-Session State”overall: degraded, 29/33 healthy (up from 22/33 at session start). Newly green: anthropic-proxy, ha-tunnel, health-center, triage, firewalla-bridge. Four services still unhealthy, all pre-existing infrastructure gaps.
Apr 18 — Second Session, The Last Four
Section titled “Apr 18 — Second Session, The Last Four”The previous session ended at 29/33. This session targeted the remaining four: graphiti-server, health-tunnel, network-control, signal-proxy.
What Was Wrong
Section titled “What Was Wrong”1. health-tunnel port mismatch between plist and VM service. The LaunchAgent plist forwarded 18095→VM:18095, but the health-ingester service on the VM was actually bound to 10.10.10.10:10101. The running instance had been launched with a different port than the source code declared. The service YAML checked port: 18095, which was never open on the mac side.
Updated the LaunchAgent plist to forward 127.0.0.1:10101:10.10.10.10:10101. Updated the service YAML to check port: 10101. Killed the stale bootstrap-era tunnel (PID 72802) that was using the old 10101 forward, then reloaded the LaunchAgent. Port 10101 opened immediately; /health returned {"status":"ok"}.
Port 10101 is 101 doubled — binary for 5, a mathematician’s joke. Port 18095 was vestigial from an earlier health-ingester config that bound to loopback:18095. No new port assignments were made.
2. graphiti-server and network-control — missing SSH tunnel plists. Both services run inside the VM on 127.0.0.1 (VM loopback). Confirmed via lsof -i :31416 -n -P and lsof -i :4007 -n -P on the VM. No mac-side LaunchAgent forwarded these ports, so the watchdog’s port checks always found them closed.
Created two new SSH tunnel LaunchAgents and matching sanctum-*-tunnel symlinks:
~/.sanctum/bin/sanctum-graphiti-tunnel→/usr/bin/ssh~/Library/LaunchAgents/com.sanctum.graphiti-tunnel.plist— forwards127.0.0.1:31416:127.0.0.1:31416viaopenclaw.~/.sanctum/bin/sanctum-network-control-tunnel→/usr/bin/ssh~/Library/LaunchAgents/com.sanctum.network-control-tunnel.plist— forwards127.0.0.1:4007:127.0.0.1:4007viaopenclaw.
Both loaded immediately. Verified: graphiti /health returns {"status":"ok","neo4j":"connected"}; network-control /health returns {"status":"ok","dns_connected":true}. Updated both service YAMLs to reference their launchagent fields (previously null).
Port 31416 is approximately π × 10000 — nerd canon. Port 4007 is the canonical network-control port from the original service design. Neither required reassignment.
3. signal-proxy — broken grep pattern. signal-health.sh CHECK 4 (check_forceflow_port) used:
grep -E '127\.0\.0\.1:[0-9]+/api/v1/rpc' "$FORCE_FLOW_PY"But force_flow.py’s send_signal() uses http://127.0.0.1:8080/v2/send — REST format, not JSON-RPC path. The pattern never matched, configured_port was always empty, and the check always reported cannot parse signal port from force_flow.py. The watchdog read that as overall: 2 (needs_intervention) even though signal was fully healthy. Updated the pattern:
grep -E 'http://127\.0\.0\.1:[0-9]+/v[0-9]+/' "$FORCE_FLOW_PY"This correctly extracts port 8080. Since configured_port == CANONICAL_PORT (both 8080), CHECK 4 now reports healthy. Full script run: exit 0, all 6 components healthy. Watchdog picks it up as healthy on the next check cycle.
4. OBLITERATUS — Python 3.12 venv migration at a non-hidden path. The actual fix for the problem the previous session had only worked around.
python3.12 -m venv /Users/neo/Documents/Claude_Code/OBLITERATUS/venvvenv/bin/pip install -e ".[spaces]"venv/bin/python -c "import obliteratus; print('ok')" # → okvenv/bin/obliteratus ui --port 7860 --host 127.0.0.1 --no-browserTorch loaded in under 60 seconds with Python 3.12, which is within its officially supported range (3.9–3.12). Port 7860 opened; curl http://127.0.0.1:7860/ returned HTTP 200. Created OBLITERATUS/remedy_venv.sh to document the recreation procedure with the correct flags.
Why 3.12 fixes the torch stall: the 3.14 interpreter introduces new dispatch paths and uses different dynamic linker hints that interact poorly with torch’s low-level Metal and OpenMP initialization. Python 3.12 uses established import paths that the macOS page cache handles efficiently even at SN priority.
Post-Session Watchdog State
Section titled “Post-Session Watchdog State”overall: healthy, 33/33 services healthy (up from 29/33 at session start).
Newly green: graphiti-server, health-tunnel, network-control, signal-proxy.
Gotchas for Next Time
Section titled “Gotchas for Next Time”- Q2 catalog renames have long tails. After any
instance.yamlservice key rename, runrender_runtime_services.pyand re-check theSERVICE_MAPinsync_runtime_calibration.pyfor stale key names. The two files drift independently. - Symlink audit catches broken venvs. If a venv is deleted, the
.sanctum/bin/shim symlinks will break.audit_runtime_launchagents.pywill catch this — the fix is to recreate the venv or repoint the symlink to the system interpreter. - Test harness service counts are exact.
test-sanctum-system-e2e.shassertsServices: N. Anyinstance.yamladdition increments this. Update the test immediately when adding services. - Bootstrap watchdog squats launchd. On boot,
sanctum-bootstrap.shstarts a watchdog directly. The launchdcom.sanctum.watchdogplist also tries to start one. They race for port 2187. Bootstrap wins. The launchd instance logsfailed to bind port 2187every ten seconds indefinitely. If the bootstrap-started watchdog runs long enough, its check cache goes stale. Kill the bootstrap PID; launchd restarts fresh. Long-term: remove the watchdog fromsanctum-bootstrap.sh— launchd manages it now. launchctlloaded config can diverge from on-disk plist.launchctl print gui/UID/com.sanctum.ha-tunnelmay show different args than the plist file if the plist was regenerated viasync_runtime_calibration.pybut never reloaded.launchctl unload+loadis the fix. Check withlaunchctl printbefore assuming disk is what’s running.- Python 3.14 skips
.pthfiles in hidden dirs. Any editable install in.venv/(or any dot-prefixed path) breaks silently. UsePYTHONPATHexplicitly or recreate the venv at a non-hidden path (venv/). The rule isvenv/not.venv/until torch officially supports Python 3.13+. - SSH
-Lspec depends on where the service binds. VM-loopback services need127.0.0.1:PORT:127.0.0.1:PORT. Bridge-IP services needPORT:10.10.10.10:PORT. When a service changes its bind address without updating the tunnel spec, the tunnel forwards to a port that nothing listens on. Verify withlsof -i :PORT -n -Pon the VM after any bind-config change. signal-health.shgrep must trackforce_flow.py. Ifsend_signal()changes URL path (/v2/sendvs/api/v1/rpc), update CHECK 4’s grep pattern. The pattern is documented in the script header. Any change to the signal URL inforce_flow.pyrequires a parallel update here.