The Living Force

The Living Force — a ghostly figure standing in a server room of bioluminescent veins, healing sick nodes.

On the night of March 22, 2026, bridge100 didn’t come up. The VM booted into a world with no bridge to anywhere. Twenty-six services tried to start anyway — each one assuming the last had done its job — and cascaded into failure like a Jenga tower at a toddler’s birthday party.

The watchdog ran. It checked ports that had changed months ago. It pinged addresses that no longer existed. It reported: all clear. Meanwhile, Neo4j had entered an unrelated crash loop — its APOC plugin helpfully rewriting its own config into garbage on every restart, then dying on the garbage it had just written. The watchdog missed that too, because the watchdog was checking localhost:4001 and Neo4j was on localhost:7474. Close enough if you’re drunk.

A human noticed two hours later. In his underwear. At 4 AM.

The cascade — one bridge failure toppling twenty-six services like dominoes

That night exposed a truth the architecture had been politely hiding: the system didn’t understand itself. It had a list of services and a blunt instrument that restarted them. It had no concept of why a service was down, what depended on it, or whether retrying would make things worse. It was a smoke detector with no batteries, hanging on the wall for decorative purposes.

What followed was not a patch. It was the infrastructure equivalent of burning your haus down and rebuilding it with actual load-bearing walls this time.

Before and After

The Living Force — from flat watchdog to 6-phase self-healing organism

🔍 Hover to zoom

The old watchdog was a security guard asleep at the desk with the monitors turned off. The Living Force is an immune system — it maps its own body, detects illness at the cellular level, quarantines what it can’t fix, and learns from every infection. It also holds committee meetings about its own improvement, which is either inspiring or dystopian depending on how you feel about AI governance.

Three Shapes Of The Organism

The Living Force council — Tommy at the governance table where incidents become doctrine instead of folklore

The first shape is governance. Incidents are not just resolved; they are turned into explicit doctrine, manifests, and escalation rules. The system has opinions now, which is how you know it has finally become difficult in a more sophisticated way.

The Living Force router — Tommy at the remediation core deciding whether a service gets restarted, repaired, or quarantined

The second shape is routing. A healthy system distinguishes between a dead dependency, a bad config, a transient crash, and a service that needs to be quarantined before it embarrasses itself again. Restarting everything blindly is not healing. It is percussion.

The Living Force evolution spiral — Tommy filing incidents into a machine that turns mistakes into architecture

The third shape is evolution. Every failure becomes part of the next design decision. Postmortems, proposal synthesis, feature adoption, and calibration all exist to ensure the same class of mistake has to work harder the second time.

The Eight Phases

Phase 1: Service Graph

Every service gets a YAML manifest declaring its ports, dependencies, health checks, and failure modes. A topological sort builds the dependency DAG. When something breaks, the system traces the graph to the root cause instead of restarting everything and hoping.

Phase 2: Immune System

A metrics collector feeds anomaly detection. Failures escalate through a remediation ladder — restart, then repair, then quarantine. Services stuck in crash loops get isolated instead of hammered with retries. The system that lies about its health is more dangerous than the system that fails.

Phase 3: Agent Autonomy

Agents gain the code-forge skill: the ability to write, test, and deploy fixes through a staging pipeline with an audit log. Deployments happen during a night window when the haushold is asleep. Yes, the robots fix things while you dream. No, this is not how Terminator starts. Probably.

Phase 4: Tech Lookout

Jocasta scans for CVEs, dependency updates, and knowledge frontier shifts on a daily cadence. New vulnerabilities get flagged before they become incidents. The system stops being surprised by the things it should have seen coming.

Phase 5: Battle Testing

Chaos-forge runs scheduled fire drills — killing services, severing bridges, corrupting configs — and measures how fast the immune system responds. Think of it as a fire drill where the AI sets the actual fire. On purpose. Monthly. You’re welcome.

Phase 6: Continuous Evolution

Every incident feeds a learning loop. Performance reviews surface degradation trends. Evolution reports propose architectural changes. The system doesn’t just heal — it holds post-mortems, writes improvement proposals, and argues with itself about priorities. It’s basically a startup with no humans and no funding rounds.

Phase 7: Genetic Health

The system expands into the biological layer, recognizing neuro-diversity (ADHD, Dyslexia, ASD) as a first-class cognitive profile. Cilghal’s genome-mcp analyzes the owner’s 23andMe data to suggest optimal working environments and cognitive scaffolding. Biology informs collaboration.

Phase 8: Centralized Calibration

All hardcoded coordinates are eliminated. The Neural Link (Port 1138) and the Sanctum Watchdog (Port 2187 — the daemon sanctumd from sanctum-rs, label com.sanctum.watchdog, which serves the Living Force API) draw their configurations dynamically from a single master holocron-config.yaml. The system reads its own DNA to align its ports and paths upon every ignition.

Phase 2 in Practice: The HA Self-Healer

The Immune System phase sounds elegant in the abstract — anomaly detection, remediation ladders, quarantine protocols. In practice, it means Qui-Gon stares at Home Assistant every thirty minutes and asks: “Are the lights still talking to us? How about the cameras? The thermostat? The thing that knows whether the doors are locked?”

The answer, with alarming regularity, is no.

Home Assistant manages 55 Tuya smart lights (cloud API), 4 Ecobee sensors (HomeKit Controller), 4 Ring cameras, 10 Sonos speakers, Alarmo (alarm panel), and 24 automations that do everything from turning on the porch light at sunset to arming the haus when everyone leaves. Each integration has its own failure mode, its own opinion about reconnection, and its own way of dying silently while the dashboard stays green.

The ha-self-healer skill at ~/Projects/openclaw-skills/ha-self-healer/ is Phase 2’s concrete implementation. It runs a five-stage pipeline:

Diagnose (ha-diagnose.sh) — queries the HA API for every integration, entity, and automation. Assigns a severity: 0 (OK), 1 (warning), 2 (attention), 3 (degraded), 4 (critical).
API Heal (ha-heal-api.sh) — reloads integrations, restarts the container, re-enables automations that tripped. The kind of fixes you’d do from the settings page, except at 3 AM without waking anyone.
UI Heal (ha-heal-ui.js) — headless Playwright. For the problems that can only be fixed by clicking through a browser — Tuya OAuth re-authentication being the prime offender.
Verify (ha-verify.sh) — runs the diagnostic again. If severity dropped, declare victory. If not, escalate.
Escalate — pings Yoda. Something is structurally wrong and an agent with higher-tier model access needs to look at it.

Two incidents illustrate the difference between “monitoring” and “understanding”:

The Ecobee Incident. Four temperature sensors went offline. The healer’s diagnosis traced the failure to a stale HomeKit Controller config entry — the kind of ghost that survives a reboot and blocks rediscovery. The API heal stage deleted the stale entry. HA auto-rediscovered the sensors within minutes. No human involved. No 4 AM underwear.

The Great Tuya Blackout. 48 lights went offline simultaneously. The healer ran its diagnosis, saw that every Tuya entity had failed at the same timestamp, and correctly identified this as a cloud connection drop — not a software failure. It did not attempt to reload the integration 48 times. It did not launch Playwright. It logged the event, set severity to 3, and waited. The cloud came back twenty minutes later. The lights came back with it. The healer verified and closed the incident.

Case Files

Four incidents that taught the system something the diagrams couldn’t. Each one is a paragraph because the lesson is the point; the forensics live in the annex.

The Metal Crash. On April 1, 2026, sanctum-server spawned mlx_lm.server, which called abort() fifteen seconds later on a Metal command-buffer error. The gateway kept polling a dead child for two minutes before returning a timeout. Readiness-checking without process-death-checking is a bouncer carding patrons at a building that already burned down. The fix: try_wait() on every poll tick, stderr captured and pattern-matched, three retries with exponential backoff, and a new E2E phase that kills the backend during startup on purpose. Principle 9 again — the tests were green because they tested a world where Metal worked. See the principles below.

The Phantom Heal Loop. On April 6, 2026, the Living Force performed 28 heals in four hours on three services that weren’t broken. model-scout, post-boot, and vm were one-shot scripts declared as persistent services; the watchdog kept “fixing” them by restarting jobs that had finished doing their job. The bug wasn’t in the watchdog. The bug was in the contract. When a manifest promises a persistent process and the service delivers a one-shot, perfect execution of the contract makes things worse — the distilled form is Principle 8 below.

The Enforcement That Said Yes. On April 18, 2026, Force Flow logged BLOCKED seven times in a row while five of seven devices kept streaming Netflix. The Firewalla bridge was returning {"success": false, "errors": [{}]} on HTTP 200, and if result: is truthy for any non-empty dict — including one that explicitly says the command did not land. The fix was a two-line guard and a new rule: every write to an external control plane is followed by a read, and the two must agree. Commands can lie as fluently as dashboards. Screen Time enforcement details →

The Desktop Sync Healers. On April 15, 2026, the Living Force extended into user-space: Apple Mail, Messages, WhatsApp, Signal Desktop, and Telegram all depend on local SQLite databases that only sync while the app is running. Watchdog now monitors them via pgrep and reopens them with open -g -a <App> when they vanish. For Mail specifically, the auto-healer parses crash logs and clears corrupted Envelope Index files before restart. The sync pipeline stopped falling behind. The user stopped noticing the app was ever closed.

The Membrane — Chitti

The Living Force has bones (services), muscles (agents), and an immune system (watchdog, sentinel, pressure-valve, canary). What it didn’t have through most of its life was the fascia — the connective tissue that lets bones and muscles coordinate with the immune system in real time, under pressure, everywhere at once. That layer shipped late April 2026 and now wraps the whole organism.

In yogic philosophy the word is chitti-shakti — pure consciousness, the ever-changing awareness-under-pressure that sits between embodiment and mind. The five koshas (annamaya, pranamaya, manomaya, vijnanamaya, anandamaya) are sheaths from dense to subtle; each modulates the next. In biotensegrity research, the equivalent is the fluid matrix of the fascial network — both the medium signals travel through and the tissue that rearranges itself in response to load. Ground fluid and fluid ground.

The Living Force runs reflexes (fast, local). Chitti is what modulates those reflexes (slow, learned, directional). One regulates. The other modulates.

sanctum-chitti is a Rust daemon on 127.0.0.1:2188. Each kosha is a real endpoint:

Annamaya — /fluid pressure block: memory available, swap, thermal probe.
Pranamaya — /presence mid-flight work signals.
Manomaya — /mood directional posture (Poised · Cautious ~ Conserving ⌣ Healing ↺ Alert !) with 30 s hysteresis.
Vijnanamaya — /samskara learned action grooves; the watchdog consults before remediation.
Anandamaya — /attention what the human is doing; can override every kosha below it.

The watchdog’s old reflexes (circuit breaker, restart budget, dedup, memory gate) are now modulated by manomaya before each pass. The old restart_cmd is one of an ordered actions list; the picker reads vijnanamaya and switches actions when the primary stops working. Wisdom flows back into reflex.

See the chitti architecture page → for the full body.

The Principles

Ten rules that emerged from the wreckage. We had twelve until someone pointed out that the Commandments and Burning Man’s principles both stop at ten, which is the kind of cosmic peer pressure you don’t argue with — so we merged the duplicates and stopped pretending two of them were separate. None of these were obvious before March 22. All of them are obvious now, which is how you know they were expensive lessons.

A system that lies to itself is more dangerous than one that fails. We refuse the comfort of false signals.

“The first principle is that you must not fool yourself — and you are the easiest person to fool.” — Richard Feynman, Cargo Cult Science (Caltech commencement, 1974)

A system that lies about its health is more dangerous than one that fails. A green dashboard during an outage isn’t reassurance — it’s gaslighting. By a computer. That you built. Congratulations.

The same rule lives at the write side. A command that returns “success” is a claim, not a fact. Every write to an external control plane (Firewalla, launchd, a router, a thermostat) must be followed by a read of the resulting state, and the two must be compared. if result: is not verification. assert new_state == intended_state is. When dashboards or commands disagree with reality, log DRIFT — loudly, with the observed value — and retry once before escalating. The April 18 Apple TV incident happened because the code trusted a bridge that literally said success: false on HTTP 200 while five devices kept streaming. Dashboards can’t lie. Commands can’t lie either.

Different minds reach different right answers. Pluralism is operational.

“Many of the truths we cling to depend greatly on our own point of view.” — Obi-Wan Kenobi, Return of the Jedi

The principle does double duty.

At the user layer — an agent that understands the owner’s cognitive profile can provide better support. High novelty-seeking markers → prioritize variety and rapid progress. Phonological processing differences → prioritize visual and structured output over dense text. The haus serves a neurodivergent operator on purpose, with intent.

At the model layer — the Council must be plural. Five seats, four model families, four sets of training biases. A council of one model wearing five robes is not a council; it’s a feedback loop with extra steps. (See (Neuro)diversity is Paramount for the routing layout — Yoda+Mundi on Opus, Qui-Gon on Coder-14B, Windu on Gemini, the rest on local Qwen3.6.)

A system designed to support a neurodivergent operator that returns one kind of answer fails at both ends.

Where to Go from Here

The Living Force is a roadmap, not a finished building. Read the stable doctrine here, then use the operations pages for the machine as it actually exists:

Operational State — The current verified shape of the workspace, runtime, and docs layers
Implementation Audit — What is implemented, where it lives, and where truth still splits
Feature Status Matrix — Which major Living Force features are implemented, partial, or still mostly doctrine
Runtime Drift Audit — The ~/.sanctum and LaunchAgent remediation work
Operational History — Dated milestones, migrations, and incident-derived lessons
Agents & Council — The full roster of specialized AI agents and their roles
Service Graph — The full service catalog and dependency model
Watchdog — Health monitoring and the remediation ladder
Skills — Agent capabilities, including code-forge and chaos-forge
Proxy Architecture — The Sanctum Proxy and model routing layers

The night of March 22 broke twenty-six services. It also broke the assumption that a system this complex could be managed by a flat loop and a restart command. What replaced it is still growing — still learning from its own failures, still arguing with itself about what to build next.

Which, if you think about it, is the most alive thing a system can do.

Field Notes

The Living Force keeps a journal. Every dated entry below is a moment the architecture learned something — a kernel panic, a failover drill, a daemon that killed the thing it was meant to protect. The principles above are the distilled lessons; these are the evenings the lessons were earned. Recent eight shown; the rest live in the archive.

2026-04-30 — The Gates Held — pristine pass on the council self-heal. Schema-mismatch caught: new-grammar manifests were silently dropped by service-graph.py’s legacy parser; rewrote four manifests in legacy shape with cmd: paths pointing at the chitti-gated heal-actions; new-grammar versions stashed for the eventual catalog-parser PR. Watchdog chitti_client::Posture got the Fever mirror it was missing (Yoda’s enum-mirror gap closed, glyph ^, breaker:1, budget:1, body in shock). Six drills run live: outline-kill (defer-during-sleep correct), jsonl-lock-flood (50→0 after sentinel→heal-action chain wired in), do-not-heal (autoimmune allowlist holds), agent-trace-collection, quarantine-recovery, quiet-defer — three real assertion-shape bugs found and fixed by actually running them. Force Flow was binding 0.0.0.0:4077; added loopback_guard middleware that 403s POST /notify from non-loopback peers (Windu’s gate). Source extracted into sanctum-runtime/force-flow/ so the gate stays durable across redeploys.
2026-04-29 — First Breath — the council living-force self-heal lands. Five plists bootstrapped in safety order; ram-sentinel fires its inaugural decision (graceful-restart-lmstudio), records success=false, declines to act on the next tick because top-RSS shifts to a tool it doesn’t have. Schema mismatch found: sanctumd’s deserializer rejects the new actions[] grammar — the new heal-actions are scripts that work by hand, not reflexes the body fires automatically yet. Outline back from five days of darkness via manual compose-up. The body is half-wired, breathing, in Healing posture.
2026-04-26 — Wisdom Informs Reflex — chitti grows two more axes (mood + samskara), then a third (attention) and a taxonomy (the five koshas). Watchdog learns to consult its grooves: when the primary action fails confidently, the picker rotates to the alternative. Live demo within minutes of deploy — compose-up failing 0/6 → switched to colima-restart-then-up. Direction loops. Heart broken open.
2026-04-24 — Five Locks on the Voice Door — sanctum-xtts renamed to sanctum-tts. Adapter dispatcher on :8007 with five layers of defense: mTLS, bearer-token ACL, reference-clip pinning, Ed25519 signing, hashed audit log. Two new CLIs (admin + verify). The voice that will answer the phone is now safe to put on a line.
2026-04-24 — Every Jedi Answers to Their Name — LM Studio coder-14b gets a memory-gated auto-load and an hourly council roll-call. The probe that surfaced the chitti gap in the same breath — a local pressure gate exposing the absence of a global one.
2026-04-21 — The mTLS Day — five probes migrated in the morning, the sanctum-server router’s code shipped late-night, the full failover matrix proven overnight on the MBP shadow.
2026-04-20 — The Pressure Valve Trilogy — a kernel panic at midnight, a Rust daemon shipped before lunch, and the same daemon killing the service it existed to protect by dinner. Five corrections and a dry-run window.
2026-04-20 — The A+ Roadmap Closes — Principle 1 becomes a 5-minute LaunchAgent, streaming-path metrics land, HA failover is exercised under real failure, and the last two ASC/sanctum-server doors close in a single session.
2026-04-19 — The Reasoner That Went Quiet — 25 hours of green probes, then the service answered a real question with nothing at all. Telemetry caught what probes couldn’t.
2026-04-18 — The Off-Catalogue Audit — five services found running that no manifest knew about, five registered, watchdog goes from 33/33 to 38/38 healthy.
2026-04-17 — The Full-Stack Health Sweep — a narrow CLI fix turns into a four-hour diagnostic across nine components. Several things that claimed to be running were not.
2026-04-15 — Living Force Manifest Deployment — seven YAML manifests, 148,181 errors categorized, one broken symlink restored. The immune system gets its memory back.
2026-04-07 — Provider Overhaul — five independent provider paths, self-healing fallbacks, Signal migrated off Docker, and the quiet death of fnm.

Earlier entries → operational-history