Skip to content

Health Monitoring & Watchdog

The watchdog is the part of Sanctum that doesn’t sleep — a long-running Rust daemon (sanctumd) that wakes every ten minutes, sweeps every service, fixes what it can, and tells you about whatever it couldn’t. It is, literally, a program that watches other programs and judges them. If it ever develops opinions about your uptime, unplug everything.

The watchdog — tireless guardian of three dozen services

  1. Check healthservice-graph.py check-all --json probes every manifest, reporting per-service status plus root causes.
  2. Evaluate — The graph separates root causes from downstream casualties, so dependants aren’t restarted blindly.
  3. Auto-remediate — With auto_fix on, each root cause goes to service-graph.py remediate <svc>, bounded by a restart budget and circuit breaker.
  4. Settle — Waits settle_delay for services to stabilize.
  5. Re-check — Runs check-all again.
  6. Notify — Alerts on anything still failed.
┌──────────────┐ ┌──────────────┐ ┌─────────────┐
│ check-all │────→│ Root causes? │─No─→│ All Clear │
└──────────────┘ └──────────────┘ └─────────────┘
│ Yes
┌──────────────┐
│ remediate │
│ <svc> │
└──────────────┘
┌──────────────┐
│ Settle Delay │
└──────────────┘
┌──────────────┐ ┌─────────────┐
│ Re-check │─OK─→│ Fixed │
└──────────────┘ └─────────────┘
│ Still failing
┌──────────────┐
│ Notify │
└──────────────┘

Most days the path is: check, all clear, sleep. On the bad days it finds a root cause, remediates, re-checks, and either nods or tells you on Signal your evening is about to get worse.

The watchdog is not a periodic LaunchAgent — it’s a continuously-running daemon that loops on its own timer. launchd’s only job is to keep it alive:

PropertyValue
Labelcom.sanctum.watchdog
Plist/Library/LaunchDaemons/com.sanctum.watchdog-rust.plist
Programsanctumd (sanctum-rs release build)
KeepAlivetrue (relaunched immediately if it ever exits)
Check intervalWATCHDOG_CHECK_INTERVAL, default 600 (seconds)
Health APIport 2187

It starts at boot and runs forever, one sweep every check_interval seconds. Being a single long-lived process is why the dedup state below lives in memory.

Nothing helps you fix a problem like your phone buzzing about it every ten minutes — so once a service fails and notifies, further failures of that same service are suppressed for the dedup_window. State is an in-memory map of service name to last-notified time, no on-disk file — so a restart wipes it, and the first failure after a relaunch always notifies.

First failure of "home-assistant" → Notify ✓
Same failure 5 min later → Suppressed (within 30 min window)
Same failure 35 min later → Notify ✓ (window expired)
Different service fails → Notify ✓ (independent key)

All settings live in instance.yaml under services.watchdog:

services:
watchdog:
enabled: true
interval: 600 # seconds between check sweeps
settle_delay: 15 # seconds to wait after a fix attempt
auto_fix: true # enable manifest-driven auto-remediation
dedup_window: 1800 # seconds (30 min) to suppress repeat alerts
notify_channels:
- macos
- signal
- dashboard
SettingTypeDefaultDescription
enabledbooleantrueEnable or disable the watchdog entirely
intervalnumber600Seconds between check sweeps
settle_delaynumber15Seconds to wait after repair before re-checking
auto_fixbooleantrueWhether to run manifest remediation on root causes
dedup_windownumber1800Seconds to suppress duplicate notifications

The watchdog delivers alerts through three channels, routed by severity in ~/.sanctum/lib/notify.sh — the louder the problem, the more channels it lights up:

SeveritymacOSDashboardSignalChannel detail
ambientLog + chitti samskara only
warnyesyesosascript Notification Center + dashboard banner
erroryesyesyesFull fan-out
criticalyesyesyesFull fan-out (memory kills land here)

The dashboard banner appends to ~/.sanctum/alerts.json, which the command center polls. Signal goes to a configured group via the apple-toolkit skill, reserved for error and critical — and nothing quite says good evening like a Signal from your haus at midnight telling you a service is down.

The repair engine is the service graph, not a separate doctor process. With auto_fix enabled, each root cause goes to service-graph.py remediate <svc>, running that service’s manifest-defined action — a heal_launchagent.sh restart, a systemd kick over SSH, a Docker bounce. Blunt, but most failures here respond to “have you tried turning it off and on again.”

Two guards keep a flapping service from turning the watchdog into a respawn machine: a per-service restart budget (max_restarts_per_hour, default 3) and a circuit breaker that halts remediation when too many fail at once.

The suite isn’t a hardcoded list — it’s whatever manifests live in ~/.sanctum/services/. Each *.yaml declares how its service is probed, and check-all walks every one (roughly 37 today). Add a manifest and the watchdog picks it up next sweep, no code change. A probe is one of a few types:

Probe typeWhat It ChecksExample
portTCP port is open (nc -z) — not an HTTP codehome-assistant :8123, sanctum-tts :8008
processA named process is alivesanctum-tts (tts_server)
commandA command exits zerotailscale (tailscale status)
interfaceA network interface holds its expected IPbridge interface

Probes return pass/fail, latency, and a diagnostic message; the graph follows requires edges so a fallen dependency takes the blame, not everything downstream. Roughly 37 services every ten minutes is about 5,300 probes a day. Your haus gets more checkups than you do.

On April 23, 2026, a runaway session pushed the 64 GB Mac Mini into OOM until WindowServer missed its watchdog checkin and the kernel panicked — twice in one evening. The service graph kept dutifully checking ports while the system asphyxiated, because knowing a service is up doesn’t help when there’s no RAM left to run it.

The memory sentinel exists because of that night. It runs before the service graph — which itself spawns Python, threads, and SSH connections, all of which cost memory. Running diagnostics on a starved system is the monitoring equivalent of operating on a burning patient.

The sentinel doesn’t reason about whole-system free memory. It reasons about Sanctum’s own services, one at a time: it enumerates every running com.sanctum.* job, reads each PID’s RSS via ps -o rss=, and compares it to a per-service limit in megabytes. Limits are tuned per service — mlx-builtin gets 20 GB because it holds a model in RAM; a tunnel gets 64 MB, anything over is a leak. Override any limit under services.<key>.memory_limit_mb. The worst grades set the overall status:

LevelTriggerAction
OKEvery service under 80% of its limitExit 0, all clear
WarningAt least one service at or above 80%Exit 0, log only
CriticalAt least one service over 100% of its limitExit 1, kill the worst offender
EmergencyThree or more services over their limitsExit 1, kill the worst offender

The “worst offender” is the service the most megabytes over its own limit. On a kill it sends SIGTERM, pauses, then SIGKILLs — and since every victim is a com.sanctum.* job, launchd restarts it clean.

The sentinel runs as Step 0.5 — the first thing each sweep does, before any service-graph work:

Watchdog check starting
→ Step 0.5: Memory sentinel pre-flight (--json)
→ If emergency/critical: --kill, notify, sleep 3s
→ If warning: log only
→ Step 1: validate manifests + service-graph check-all

On a kill it fires a critical notification and sleeps three seconds for the system to reclaim memory — so the check-all that follows sees a machine that has exhaled.

The real knobs are the per-service memory_limit_mb overrides under each service’s key. The dedicated instance.yaml block also carries some forward-looking scaffolding:

services:
memory_sentinel:
enabled: true
auto_kill: true
kill_grace_seconds: 5
thresholds:
warning_free_pct: 15
critical_free_pct: 8
emergency_free_pct: 5
safelist:
- "com.apple.Virtualization.VirtualMachine"
- "sanctum-mlx"
- "WindowServer"
- "kernel_task"

The health agent can invoke the sentinel through the service-doctor skill’s memory-check.sh, checking pressure and resetting a runaway service without SSH to the host.