2026-04-21: The mTLS Day
The morning after the A+ roadmap push, the obvious next move was to copy the cert-detection pattern from the two clients that already had it out to the three that didn’t. Nothing heroic. Boring is the right shape for security work — boring means the pattern is composing rather than surprising.
What Moved
Section titled “What Moved”Three probe clients migrated from bearer auth to mutual TLS using the same auto-detect-cert recipe canary and guardian established earlier in the week: parity-smoke (the nightly correctness battery that runs on the Mini), council-canary-offbox (the cross-machine correctness probe that runs on the MBP), and the cross-machine probe inside deploy-sanctum-mlx.sh verify that council-drift-offbox drives. All three now look for their client cert first and only fall back to bearer when the cert files are absent.
The pattern is boring, which is the point:
CA_CERT="/Users/neo/.sanctum/certs/ca.crt"CLIENT_CERT="/Users/neo/.sanctum/certs/clients/<probe>.crt"CLIENT_KEY="/Users/neo/.sanctum/certs/clients/<probe>.key"if [ -z "${COUNCIL_URL:-}" ] && [ -r "$CA_CERT" ] \ && [ -r "$CLIENT_CERT" ] && [ -r "$CLIENT_KEY" ]; then URL="https://127.0.0.1:1338" MTLS_ARGS=(--cacert "$CA_CERT" --cert "$CLIENT_CERT" --key "$CLIENT_KEY") TRANSPORT="mtls"else URL="${COUNCIL_URL:-http://127.0.0.1:1337}" MTLS_ARGS=() TRANSPORT="bearer"fiEach probe logs its chosen transport on every call. When Prometheus panels get built for monitor traffic, the bearer curve will decay and the mTLS curve will climb as clients flip — the migration graph is already in the data, just waiting for someone to plot it.
Five of six probe clients are on mTLS now. The holdout is sanctum-server’s HttpProxyBackend — the Rust router — which is a real refactor (reqwest client with rustls config, per-backend cert paths read from instance.yaml), not a bash one-liner. Moved to its own roadmap slot for a dedicated session. No reason to rush it.
One Small Sharp Edge Caught Along the Way
Section titled “One Small Sharp Edge Caught Along the Way”The off-box drift check — which runs deploy-sanctum-mlx.sh verify from the MBP — briefly reported /v1/models FAILED even though a manual verify from the same shell succeeded seconds later. The culprit wasn’t a config drift or a TLS issue. It was that the Rust agent had just been bootout/bootstrap’d during the HA failover test, and its 60-second model reload was still in progress when the off-box drift tick landed. A stop-the-world probe run against a service that takes a minute to start is always going to race the first tick.
No action required: the next tick was green, the alert-suppression cooldown was already in place, nothing escalated. But it’s worth logging as a known mode — cross-machine probes will see transient failures whenever the service is in its cold-start window. If the alert threshold ever gets tighter than “2 consecutive failures across 10 minutes,” this will fire. Today it didn’t.
Morning Post-Entry State
Section titled “Morning Post-Entry State”Five of six probe clients on mTLS (all four Mini-side probes plus the off-box canary). The deploy-script cross-machine probe is also on mTLS, which means council-drift-offbox inherits mTLS without its own migration — the verify path was the probe path the whole time, we just hadn’t noticed. sanctum-server on the roadmap as the last client. Bearer retirement is honest-close, gated on the router migration.
Commit 054f6cf in sanctum-rs. No service interruption.
Late Night — Shipping Code vs Flipping Config
Section titled “Late Night — Shipping Code vs Flipping Config”The evening’s exercise: close the sanctum-server mTLS gap the council asked for. Four steps planned — step 1 wrote the code (commit c95eaa5); steps 2–4 were supposed to flip configuration and soak. Code landed cleanly. Configuration did not.
What shipped
Section titled “What shipped”HttpProxyBackend learned mTLS. A new BackendTls struct on BackendDef lets any router entry in instance.yaml opt in with three paths — ca_cert_path, client_cert_path, client_key_path. When all three resolve to readable files, the backend’s reqwest::Client is built with rustls, trusting only the internal Sanctum CA, presenting the client cert on every upstream request. When any are missing, the backend builds the same plain-HTTP client it has always built. The transition is strictly additive.
Four new unit tests exercise the configuration matrix: default (no TLS), all-missing-paths (graceful fallback), bad paths (warn-log + fallback), and a live end-to-end certificate generation via the system openssl followed by a successful client construction. 22/22 tests pass.
One small reqwest landmine along the way: the use_preconfigured_tls API takes a bare rustls::ClientConfig, not an Arc<ClientConfig>. Wrapping in Arc causes a runtime downcast failure with the unhelpful message "Unknown TLS backend passed to use_preconfigured_tls". The fix is one line; the time cost was in reading reqwest’s source to confirm.
A second landmine: reqwest’s __rustls feature requires a crypto provider to be globally installed even when the backend is plain HTTP. Client::new() panics with "No provider set" if you forgot. A Once wrapper installs rustls::crypto::ring::default_provider() at every HttpProxyBackend construction, which is enough.
Why the config flip didn’t happen
Section titled “Why the config flip didn’t happen”Steps 2–4 needed a live :1338 mTLS listener on the Mini for the router’s primary URL to succeed. The com.sanctum.mlx LaunchAgent has that listener in its plist. But the actual sanctum-mlx process running at the moment is a different invocation — pointed at Qwen3.6-35B-A3B-4bit-text with --turboquant and no TLS arguments, presumably launched by a parallel session’s work. That process owns :1337 with bearer auth but has no :1338 listener at all.
Flipping council-secure.url to https://127.0.0.1:1338 against that topology would have failed every primary call, silently routing everything to the MBP shadow — the exact failure mode the council flagged as the biggest risk. Confirmed with an actual request (error sending request for url (https://127.0.0.1:1338/v1/chat/completions)), then rolled the config back to the bearer primary, kickstarted sanctum-server, and smoke-tested: "Understood" in 8 s, no failover lines in the log.
Ship the code before the config. A code deploy that doesn’t change behavior for any existing caller is boring and safe. A config flip that depends on the deployed topology is a live rollout, and live rollouts need the assumed topology to actually exist. Today’s topology diverged from yesterday’s — someone’s in-flight work is serving the council from a different binary — and that’s a legitimate state of affairs, not a failure. But mTLS rollout needs a :1338 listener to aim at, and the aim had moved.
Corollary
Section titled “Corollary”launchctl list is the ground truth, not the plist on disk. Checking ~/Library/LaunchAgents/com.sanctum.mlx.plist and seeing it contains --tls-host 127.0.0.1 --tls-port 1338 tells you what should run when the agent is loaded. It tells you nothing about whether the agent is loaded, or whether a different process has bound the same ports. Always confirm with lsof + launchctl list.
Late-Night Post-Entry State
Section titled “Late-Night Post-Entry State”Step 1 shipped, pushed to feat/proxy-hardening as commit c95eaa5. sanctum-server on the Mini restored to the bearer-primary config, production chat round-trip measured at 8 s (first-call cold cache). The mTLS config flip (steps 2–4) is a ten-minute job the next time the canonical com.sanctum.mlx LaunchAgent is the listener on :1337/:1338 — code, certs, and wrapper are all staged.
Overnight — The Router Proven on MBP Shadow
Section titled “Overnight — The Router Proven on MBP Shadow”With the Mini running someone else’s turboquant work, steps 2–4 of the sanctum-server mTLS rollout didn’t have a :1338 listener to aim at. So we moved the rollout to the MBP — which already has the shadow on :8902 (plain+bearer) and :8903 (mTLS) from the earlier P7.4 work — and exercised the full primary/fallback matrix against a disposable sanctum-server instance on :19900.
What the test exercised
Section titled “What the test exercised”A minimal router.yaml with one backend, two URLs, three cert paths:
router: backends: council-secure: url: https://127.0.0.1:8903/v1 # mTLS primary fallback_urls: - http://127.0.0.1:8902/v1 # bearer fallback (transition state) ca_cert_path: /Users/neo/.sanctum/certs/ca.crt client_cert_path: /Users/neo/.sanctum/certs/clients/sanctum-server.crt client_key_path: /Users/neo/.sanctum/certs/clients/sanctum-server.key default: council-secureThe server started cleanly, logged mTLS client configured + Registered backend ... mtls=true. A chat request through :19900 round-tripped in 379 ms to the mTLS shadow. Three warm runs averaged ~280 ms, which is the inter-process baseline for :19900 → :8903 (TLS handshake reused) → Metal.
The bug that surfaced (and got fixed)
Section titled “The bug that surfaced (and got fixed)”Pointing the primary URL at a dead :18903 to force a failover returned the zero-byte "builder error for url (http://...)" response. The mTLS client — shared across all URLs in a backend’s list — had https_only(true) set, which rejected the http:// fallback at request-build time. Not at TLS-handshake time, not at connect time, at build time, before the request ever hit the wire.
The fix is one line: drop https_only(true). reqwest is already scheme-driven — https:// URLs use the configured rustls context, http:// URLs run plain. The https_only flag layered an additional build-time reject that was useful only if the code followed HTTP redirects, which it does not. The scheme of a URL is the operator’s contract; the client should honor it.
After the fix, all three failover modes work end-to-end:
| Topology | Latency | Winning URL |
|---|---|---|
| mTLS primary healthy | 300 ms | https://127.0.0.1:8903/v1 |
| mTLS primary dead, bearer fallback | 311 ms | http://127.0.0.1:8902/v1 |
| mTLS primary dead, mTLS fallback | 300 ms | https://127.0.0.1:8903/v1 |
The gradual-migration story now actually holds: a backend can legitimately run with an mTLS primary and a bearer fallback during the transition, flip the fallback to mTLS when ready, and eventually drop the bearer endpoint entirely.
https_only(true) on an HTTP client is not a security feature, it is a URL-scheme contract. If the client is given a http:// URL it was supposed to reject, that’s an operator bug in the config, not an attacker bug on the wire. Reject it with a clear error at config-parse time if you must, but do not bury the rejection inside the request-build path — the failure mode there produces zero-byte responses and empty-looking logs, which silently breaks gradual rollouts the configuration has no way to anticipate.
Overnight Post-Entry State
Section titled “Overnight Post-Entry State”Steps 2, 3, and 4 of the sanctum-server mTLS rollout are proven against the MBP shadow. Mini deployment remains deferred — not for technical reasons (the code works end-to-end), but because the Mini’s :1337/:1338 currently belongs to a different sanctum-mlx process than the signed com.sanctum.mlx LaunchAgent. Whenever that reconciles, the instance.yaml flip is a 10-second edit and the result is already verified on this box.
Commits
Section titled “Commits”c95eaa5— step 1, mTLS-capable backend (earlier today).3b6bef2— fix, drophttps_onlyto allow mixed-transport fallback lists.- This entry in sanctum-docs.