Health Monitoring & Watchdog

The watchdog is the part of Sanctum that doesn’t sleep — a long-running Rust daemon (sanctumd) that wakes every ten minutes, sweeps every service, fixes what it can, and tells you about whatever it couldn’t. It is, literally, a program that watches other programs and judges them. If it ever develops opinions about your uptime, unplug everything.

The watchdog — tireless guardian of three dozen services

How It Works

Check health — service-graph.py check-all --json probes every manifest, reporting per-service status plus root causes.
Evaluate — The graph separates root causes from downstream casualties, so dependants aren’t restarted blindly.
Auto-remediate — With auto_fix on, each root cause goes to service-graph.py remediate <svc>, bounded by a restart budget and circuit breaker.
Settle — Waits settle_delay for services to stabilize.
Re-check — Runs check-all again.
Notify — Alerts on anything still failed.

┌──────────────┐     ┌──────────────┐     ┌─────────────┐
│ check-all    │────→│ Root causes? │─No─→│ All Clear   │
└──────────────┘     └──────────────┘     └─────────────┘
                           │ Yes
                           ▼
                    ┌──────────────┐
                    │ remediate    │
                    │ <svc>        │
                    └──────────────┘
                           │
                           ▼
                    ┌──────────────┐
                    │ Settle Delay │
                    └──────────────┘
                           │
                           ▼
                    ┌──────────────┐     ┌─────────────┐
                    │ Re-check     │─OK─→│ Fixed       │
                    └──────────────┘     └─────────────┘
                           │ Still failing
                           ▼
                    ┌──────────────┐
                    │ Notify       │
                    └──────────────┘

Most days the path is: check, all clear, sleep. On the bad days it finds a root cause, remediates, re-checks, and either nods or tells you on Signal your evening is about to get worse.

Schedule

The watchdog is not a periodic LaunchAgent — it’s a continuously-running daemon that loops on its own timer. launchd’s only job is to keep it alive:

Property	Value
Label	`com.sanctum.watchdog`
Plist	`/Library/LaunchDaemons/com.sanctum.watchdog-rust.plist`
Program	`sanctumd` (sanctum-rs release build)
KeepAlive	true (relaunched immediately if it ever exits)
Check interval	`WATCHDOG_CHECK_INTERVAL`, default 600 (seconds)
Health API	port 2187

It starts at boot and runs forever, one sweep every check_interval seconds. Being a single long-lived process is why the dedup state below lives in memory.

Notification Deduplication

Nothing helps you fix a problem like your phone buzzing about it every ten minutes — so once a service fails and notifies, further failures of that same service are suppressed for the dedup_window. State is an in-memory map of service name to last-notified time, no on-disk file — so a restart wipes it, and the first failure after a relaunch always notifies.

First failure of "home-assistant" → Notify ✓
Same failure 5 min later           → Suppressed (within 30 min window)
Same failure 35 min later          → Notify ✓ (window expired)
Different service fails             → Notify ✓ (independent key)

Configuration

All settings live in instance.yaml under services.watchdog:

services:
  watchdog:
    enabled: true
    interval: 600            # seconds between check sweeps
    settle_delay: 15         # seconds to wait after a fix attempt
    auto_fix: true           # enable manifest-driven auto-remediation
    dedup_window: 1800       # seconds (30 min) to suppress repeat alerts
    notify_channels:
      - macos
      - signal
      - dashboard

Setting	Type	Default	Description
`enabled`	boolean	true	Enable or disable the watchdog entirely
`interval`	number	600	Seconds between check sweeps
`settle_delay`	number	15	Seconds to wait after repair before re-checking
`auto_fix`	boolean	true	Whether to run manifest remediation on root causes
`dedup_window`	number	1800	Seconds to suppress duplicate notifications

Notification Channels

The watchdog delivers alerts through three channels, routed by severity in ~/.sanctum/lib/notify.sh — the louder the problem, the more channels it lights up:

Severity	macOS	Dashboard	Signal	Channel detail
`ambient`				Log + chitti samskara only
`warn`	yes	yes		`osascript` Notification Center + dashboard banner
`error`	yes	yes	yes	Full fan-out
`critical`	yes	yes	yes	Full fan-out (memory kills land here)

The dashboard banner appends to ~/.sanctum/alerts.json, which the command center polls. Signal goes to a configured group via the apple-toolkit skill, reserved for error and critical — and nothing quite says good evening like a Signal from your haus at midnight telling you a service is down.

Remediation

The repair engine is the service graph, not a separate doctor process. With auto_fix enabled, each root cause goes to service-graph.py remediate <svc>, running that service’s manifest-defined action — a heal_launchagent.sh restart, a systemd kick over SSH, a Docker bounce. Blunt, but most failures here respond to “have you tried turning it off and on again.”

Two guards keep a flapping service from turning the watchdog into a respawn machine: a per-service restart budget (max_restarts_per_hour, default 3) and a circuit breaker that halts remediation when too many fail at once.

Health Test Suite

The suite isn’t a hardcoded list — it’s whatever manifests live in ~/.sanctum/services/. Each *.yaml declares how its service is probed, and check-all walks every one (roughly 37 today). Add a manifest and the watchdog picks it up next sweep, no code change. A probe is one of a few types:

Probe `type`	What It Checks	Example
`port`	TCP port is open (`nc -z`) — not an HTTP code	home-assistant :8123, sanctum-tts :8008
`process`	A named process is alive	sanctum-tts (`tts_server`)
`command`	A command exits zero	tailscale (`tailscale status`)
`interface`	A network interface holds its expected IP	bridge interface

Probes return pass/fail, latency, and a diagnostic message; the graph follows requires edges so a fallen dependency takes the blame, not everything downstream. Roughly 37 services every ten minutes is about 5,300 probes a day. Your haus gets more checkups than you do.

Memory Sentinel

On April 23, 2026, a runaway session pushed the 64 GB Mac Mini into OOM until WindowServer missed its watchdog checkin and the kernel panicked — twice in one evening. The service graph kept dutifully checking ports while the system asphyxiated, because knowing a service is up doesn’t help when there’s no RAM left to run it.

The memory sentinel exists because of that night. It runs before the service graph — which itself spawns Python, threads, and SSH connections, all of which cost memory. Running diagnostics on a starved system is the monitoring equivalent of operating on a burning patient.

How It Works

The sentinel doesn’t reason about whole-system free memory. It reasons about Sanctum’s own services, one at a time: it enumerates every running com.sanctum.* job, reads each PID’s RSS via ps -o rss=, and compares it to a per-service limit in megabytes. Limits are tuned per service — mlx-builtin gets 20 GB because it holds a model in RAM; a tunnel gets 64 MB, anything over is a leak. Override any limit under services.<key>.memory_limit_mb. The worst grades set the overall status:

Level	Trigger	Action
OK	Every service under 80% of its limit	Exit 0, all clear
Warning	At least one service at or above 80%	Exit 0, log only
Critical	At least one service over 100% of its limit	Exit 1, kill the worst offender
Emergency	Three or more services over their limits	Exit 1, kill the worst offender

The “worst offender” is the service the most megabytes over its own limit. On a kill it sends SIGTERM, pauses, then SIGKILLs — and since every victim is a com.sanctum.* job, launchd restarts it clean.

Watchdog Integration

The sentinel runs as Step 0.5 — the first thing each sweep does, before any service-graph work:

Watchdog check starting
  → Step 0.5: Memory sentinel pre-flight (--json)
    → If emergency/critical: --kill, notify, sleep 3s
    → If warning: log only
  → Step 1: validate manifests + service-graph check-all

On a kill it fires a critical notification and sleeps three seconds for the system to reclaim memory — so the check-all that follows sees a machine that has exhaled.

Configuration

The real knobs are the per-service memory_limit_mb overrides under each service’s key. The dedicated instance.yaml block also carries some forward-looking scaffolding:

services:
  memory_sentinel:
    enabled: true
    auto_kill: true
    kill_grace_seconds: 5
    thresholds:
      warning_free_pct: 15
      critical_free_pct: 8
      emergency_free_pct: 5
    safelist:
      - "com.apple.Virtualization.VirtualMachine"
      - "sanctum-mlx"
      - "WindowServer"
      - "kernel_task"

Agent Integration

The health agent can invoke the sentinel through the service-doctor skill’s memory-check.sh, checking pressure and resetting a runaway service without SSH to the host.