Skip to content

Engineering Discipline

Engineering Discipline — the security fortress where every commit is tested and every key is rotated.

Date: 2026-04-11 Status: Active

Running AI agents in your haus is either professional engineering or professional negligence. There is no middle ground. The same system that controls your kid’s screen time also holds your API keys, your health data, and the credentials to your fund’s deal flow pipeline. If that sentence doesn’t make you want to write tests, nothing will.

This page documents the standards that keep Sanctum from becoming a cautionary tale — the kind that gets forwarded around Hacker News with subject lines like “man lets AI agents manage home, discovers API key on public GitHub three weeks later.” Every number below is verified by automated tests. If it’s not tested, it doesn’t exist.

166 tests across 10 components. No untested code ships. No exceptions. No “we’ll add tests later.” Later is where bugs go to breed.

ComponentTestsWhat They Verify
Smart Router (Rust)23Pattern matching, intent classification, mock backend dispatch
LoRA in Rust27Config parsing, quantize roundtrip, 248-pair merge, weight hash
Sanctum Olympics31Config loading, task YAML validation, judge parsing, mock e2e
Gemma 4 Pipeline21Data conversion, enrichment, training config, 2-iter run, memory
Cloud Proxy16Health, models, daily caps, fallback chain, agent extraction, audit
Carmack Eval v219Task structure, scoring rules, partial scores, HTTP error handling
Key Rotation5Dry-run, verify, keychain access, mask function
Secret Rotation10Scan, keychain audit, token patterns, usage display
Docs Quality8Frontmatter, CSS hygiene, inline styles, image paths, hero uniqueness
Build Release6Prerequisites, binary, metallib, dist packaging, deploy flags
Total166
Terminal window
# Run everything
cd sanctum-rs && cargo test -- --test-threads=1 # 60 Rust tests
cd sanctum-olympics && python test_olympics.py # 31 Python tests
cd mlx-finetune && python tests/test_gemma4_pipeline.py # 21 pipeline tests
cd mlx-finetune && python tests/test_carmack_v2.py # 19 eval tests
python tests/test_cloud_proxy.py # 16 proxy tests
bash tests/test_sanctum_docs.sh # 8 docs tests
bash ~/.sanctum/tests/test_rotate_openrouter.sh # 5 rotation tests
bash ~/.sanctum/tests/test_secret_rotation.sh # 10 secret tests
bash tests/test_build_release.sh # 6 build tests

Every test suite runs independently. Every test suite can be run by a human who just woke up and doesn’t remember what day it is. This is a design requirement, not an accident.

Zero hardcoded API keys. Anywhere. Verified across 31 repos in 2 GitHub organizations. This is the line in the sand that does not move.

The API key was hardcoded in a Python file pushed to a public GitHub repo. For three weeks. Nobody noticed. The key had full API access. The bill was… educational. That was the last time a secret touched source control, and the infrastructure that prevents it from ever happening again is aggressive by design.

CheckStatus
Hardcoded keys in source files0 (all repos, both machines)
Hardcoded keys in public repos0 (Ogilthorp3 + Triptyq-Capital)
API keys in git historyNeutralized (old keys revoked)
OpenRouter key storagemacOS Keychain (both machines)
OpenRouter Management keymacOS Keychain (for programmatic rotation)
Key rotationZero-touch via rotate-openrouter.sh
Secret scanning6-phase sweep via secret-rotation.sh
macOS Keychain
├── openrouter-api-key → Used by cloud proxy + benchmarks
├── openrouter-mgmt-key → Used by rotation script (admin-only)
└── anthropic-api-key → Used by coding-llm-bench
rotate-openrouter.sh:
1. Create new key (via Management API)
2. Verify new key works (test completion)
3. Store in Keychain (both machines)
4. Delete old key (via Management API)
Zero browser. Zero human. Zero downtime.

The rotation script doesn’t ask for permission. It creates the new key, proves it works, stores it, and kills the old one. The entire cycle takes under ten seconds. If you’re still rotating API keys by logging into a dashboard and clicking buttons, you’re doing it wrong — and more importantly, you’re doing it rarely, which means your keys are old, which means the blast radius when one leaks is measured in months.

MetricBeforeAfter
LaunchAgents: disabled/backup files230
Shell scripts without set -euo pipefail130
Rust compiler warnings40
Stale adapter directories51 GB1.9 GB (2 active)
CSS !important overrides360
Inline CSS styles in MDX90
Legacy image path conventions550
instance.yaml stale services5+0 (all annotated)

Every row in that table represents a class of problem that existed, was measured, and was eliminated. The 23 disabled LaunchAgent files were ghosts from previous configurations — not hurting anything, not helping anything, just cluttering the namespace and making it harder to tell what was real. The 13 shell scripts without set -euo pipefail were time bombs waiting for a failed command to silently continue. The 51 GB of stale adapters were eleven training experiments that nobody cleaned up because nobody knew which ones were active.

The cloud proxy (sanctum-cloud-proxy.py) sits between the Smart Router and OpenRouter. It exists because giving six AI agents unlimited access to a pay-per-token cloud API is how you wake up to a surprise invoice. Five layers of protection:

LayerWhat It Does
Key securityAPI key from macOS Keychain only — never in code, env, or git
Fallback chainCloud → Coder-14B → Council MLX → error
Cost controlPer-agent daily caps ($2/day Windu, $0.50 default)
Audit trailEvery cloud call logged: agent, tokens, cost, latency
Rate limitingCircuit breaker on spending caps

The fallback chain is the interesting part. When Opus is unavailable or the daily cap is hit, the request falls to Coder-14B. When Coder-14B is down, it falls to Council MLX. When everything is down, the system returns an honest error instead of a hallucinated response. Graceful degradation, not graceful lying.

Every commit follows the same discipline:

  1. Code — write it
  2. Test — prove it works
  3. Document — explain why it exists
  4. Scan — verify no secrets leaked
  5. Push — ship it

If any step fails, the commit doesn’t ship. This is not optional. This is not aspirational. This is how the haus runs.

Because the only thing worse than an AI agent making a mistake is an AI agent making a mistake with your API keys, your health data, and your kid’s screen time schedule all in the same system — and nobody finding out until the credit card bill arrives. The discipline isn’t paranoia. It’s the minimum viable responsibility for running this kind of infrastructure in a place where people sleep.