Sanctum-MLX

Tommy in the Sanctum-MLX chamber — one Mac Mini, one fused kernel, zero Python overhead.

There was a time when the Council’s local inference ran through mlx_lm.server — a Python process that loaded a model, accepted HTTP requests, and generated tokens. It worked. It was fine. And then we looked at the profiler and realized 40% of the wall clock was spent in Python overhead, garbage collection, and the seven layers of abstraction between “give me a token” and the actual GPU computation.

So we did what any reasonable person would do: we rewrote the entire inference stack in Rust, implemented a GatedDeltaNet state-space model from scratch, wrote a custom Metal GPU kernel, and shipped it as a single binary that starts in 3 seconds and decodes at 13 tokens per second.

This is sanctum-mlx v0.2.0. It has no Python dependency. It has no regrets.

Architecture

[Client] → 127.0.0.1:11337
    ↓
[sanctum-mlx (Rust/axum)]
  ├─ Direct model loading: Qwen3.5 safetensors → mlx-rs Model
  ├─ 4-bit quantized inference: QuantizedLinear with group_size=64
  ├─ Fused Metal kernel: GatedDeltaNet SSM scan in one GPU dispatch
  ├─ Zero-copy GQA: kernel handles head expansion internally
  ├─ SSE streaming: token-by-token via axum Sse + tokio mpsc
  ├─ Non-streaming: block_in_place for GPU compute
  ├─ EOS tokens: [248046, 248044]
  └─ system_fingerprint: sanctum-mlx-0.2.0-native-arm64

The server is an axum HTTP service that exposes an OpenAI-compatible /v1/chat/completions endpoint. It supports multi-model loading with LoRA adapter hot-swapping — the adapter pipeline dequantizes the base weights, merges the LoRA deltas, and requantizes the result back to 4-bit for inference. Models load directly from safetensors files into mlx-rs arrays, and the full model graph runs on the Metal GPU without ever touching Python, NumPy, or the existential dread of pip install. The portable Metal build system compiles on any Apple Silicon Mac without external dependencies.

The Model: Qwen3.5-27B-4bit (and others)

Qwen3.5 is a hybrid architecture — 64 decoder layers split between two attention mechanisms:

Component	Count	Description
Linear Attention (GatedDeltaNet)	48 layers	Recurrent SSM with gated delta updates. O(1) memory per token.
Full Attention (Qwen3NextAttention)	16 layers	Standard multi-head attention with a twist: 2x Q projection split into queries + sigmoid gate.

Key Dimensions

Parameter	Value
hidden_size	5120
head_dim	256
num_attention_heads	24
num_kv_heads	4
linear_key_heads	16
linear_value_heads	48
linear_key_head_dim	128
linear_value_head_dim	128
Quantization	4-bit, group_size=64

Performance Optimizations (v0.2.0 → v0.3.0)

Six optimizations took decode throughput from ~9 tok/s to ~75 tok/s on the same hardware. The first four are GatedDeltaNet-shaped wins from the v0.2.0 cutover; the last two are the late-April push that closed the TurboQuant tax and bounded MLX’s buffer cache. John Carmack would probably find another 21%, but he’s busy with other things.

1. Fused Metal Kernel for GatedDeltaNet SSM

The GatedDeltaNet recurrent scan was originally implemented as a Rust loop over timesteps, issuing separate MLX operations for each step: decay, matrix multiply, delta update, output projection. Each operation meant a separate Metal kernel dispatch, a separate synchronization point, and a separate opportunity for the GPU to wonder why it was born.

The fused kernel (metal_kernels.rs) replaces this with a single Metal dispatch that processes all timesteps in one GPU launch. It uses SIMD group reductions (simd_sum) for the Dk-dimension dot products and keeps the recurrent state in thread-local registers.

Grid: (32, Dv, B*Hv)    Threadgroup: (32, 4, 1)

Each threadgroup tile handles one (batch, value_head, dv_slice) and loops over all timesteps internally. The Dk dimension is distributed across 32 SIMD lanes with simd_sum for horizontal reduction.

2. Zero-Copy GQA in the Kernel

Grouped Query Attention maps 16 key heads to 48 value heads (3x expansion). The naive approach broadcasts and reshapes — allocating a new contiguous buffer every forward pass. The fused Metal kernel handles the mapping internally via hk_idx = hv_idx / (Hv / Hk), reading directly from the original 16-head Q/K tensors. No broadcast, no reshape, no allocation.

3. Eval Cadence Tuning

MLX uses lazy evaluation — operations build a computation graph that only executes when you call eval(). The cadence matters:

Mode	Before	After	Rationale
Streaming	Every 4 tokens	Every 8 tokens	Halves GPU sync overhead. Adds ~80ms latency at 13 tok/s — imperceptible in chat.
Non-streaming	Every 32 tokens	Every 64 tokens	Larger graph batches for throughput. The GPU prefers big meals.

4. Pre-computed Normalization Constants

Q/K RMS normalization uses a ones_weight vector and scale factors (inv_scale² for Q, inv_scale for K). Previously allocated fresh each forward call — 48 layers × 2 arrays × every token = a lot of unnecessary allocation. Now computed once at model construction and cached in the struct.

5. Fused Attention with Inline V Dequantization (2026-04-24)

The Slice 1 TurboQuant cache rebuilt the entire dequantized V tensor every decode step on the CPU — O(T²) work and a forced GPU sync per layer. The replacement is a custom Metal kernel sdpa_dequant_v (in turboquant/attention_kernel.rs) that takes Q + materialized K + the compressed V state (indices, scales, zeros) and runs scaled-dot-product attention with V dequantized inline in registers. No full V tensor ever materializes.

Grid: (32, 1, B*H_q*L_q)   Threadgroup: (32, 1, 1)

One threadgroup per (batch, query head, query position); 32 threads cooperate on the D dimension via simd_sum. Online (FlashAttention) softmax in fp32. Causal mask via additive -INF bias, template-specialized so the decode path compiles the branch away. Routed through a new KeyValueCache::fused_attention trait method — when the cache opts in, FullAttention::forward skips the full-V re-materialization for both decode AND prefill.

Bench (MBP M4 Max, 3 × 8 × 100 tokens, vanilla MoE):

Config	Mean tok/s	vs Slice 1
no TurboQuant	79.42	—
Slice 1 (CPU K+V)	62.38	baseline
Slice 4a-final (plain K + fused V)	75.64	+21.2%

The keys-plain choice is empirical: at our context lengths, dropping key compression saves us the CPU round-trip without surrendering meaningful memory (~4 MB/layer of bf16 K is nothing on 64 GB). See TurboQuant KV Compression for the full pivot story.

6. Metal Memory Caps + Inter-Request Cache Drain (2026-04-24)

The Mini was running at 31.7 GB of 32 GB swap with 85 MB free pages — not because anything leaked, but because MLX’s buffer cache had no upper bound and ratcheted up across thousands of requests. Two changes:

The launchd plist now passes --metal-cache-limit-mb 1024 --metal-memory-limit-mb 40960 --metal-wired-limit-mb 24576. Without these, MLX’s previous cache limit was effectively 65 GB (the entire machine).
mlx_clear_cache was wrapped into mlx_rs::memory::clear_cache() and called at the end of both chat handler paths (sync and SSE) so the buffer cache is drained between requests instead of growing forever.

Result: post-restart steady state holds at ~13 GB wired (the actual weights + warm experts), with 1.3 GB free pages. macOS dynamically shrunk the swap file from 32 GB to 9.2 GB once the pressure eased — the clearest signal the fix worked.

The April 2026 Cutover

For a while, sanctum-mlx was a science project. It loaded models, it ran the forward pass, it produced valid tensors — it just wasn’t the thing answering your questions. Production still routed through the Python mlx_lm.server that sanctum-server babysat. The plan was to flip the switch eventually. April 2026 is when we flipped it.

What landed in a single sustained push:

LoRA Merge at Load

AppState::load calls lora::load_and_merge when --adapter-path is supplied. Quantized path dequantizes → adds the delta → requantizes at group_size=64, bits=4. Non-quantized path just adds. AdapterInfo { name, rank, alpha, merged_pairs } is carried on the app state and surfaced in system_fingerprint.

Full Sampling Pipeline

A new sampling module replaces the single-temp qwen3_5::sample(). Pipeline: repetition penalty → top-p nucleus → temperature → argmax or categorical. RecentTokens is a bounded dedup-aware ring buffer. OpenAI-style stop is accepted as string-or-array; a StopSeqBuffer carries max-stop-length bytes across SSE batches so a stop sequence split across the 8-token flush boundary still trips.

Custom Decode Loop

sampling::decode drives Model::forward directly with a Control::Continue/Stop callback. The callback batches token IDs for the caller to decode + stream + check stops. Replaces qwen3_5::Generate so top_p and repetition_penalty actually affect logits without patching vendored mlx-lm.

Multimodal Config Support

The production mlx-community/Qwen3.5-27B-4bit checkpoint is the VL variant — text_config nests the text-model fields. get_qwen3_5_model_args now flattens text_config into the root before deserializing, so both flat and nested configs load. Root-level keys still win on conflict.

Along the way we fixed two bugs that would have made production unusable:

Array::deep_clone() on unevaluated lazy tensors segfaults inside mlx_array_data_bfloat16 because the buffer pointer is null until eval() runs. The “no-op” branches of sample, apply_top_p, and apply_repetition_penalty were cloning the prefill logits that hadn’t been materialized yet. Fix: the pipeline passes &Array through every stage and only materializes when a transform actually runs. There is now a fast path for temp=0 + no rep-penalty that goes straight to argmax with zero intermediate Arrays.
Missing <think>…</think> generation-prompt prefix. Python runs with --chat-template-args {"enable_thinking": false}, which causes the Qwen3.5 chat template to emit <|im_start|>assistant\n<think>\n\n</think>\n\n. We were sending only <|im_start|>assistant\n. Without the empty think block the model slipped into open-ended reasoning and emitted degenerate token loops. messages_to_prompt now matches production exactly.

Autoregressive SSM State — The Hybrid-Attention Cache

This is the piece that took the longest to find and the shortest to fix.

When you load a Qwen3.5-27B base model and send one prompt, sanctum-mlx would produce a few coherent tokens and then collapse:

Prompt:  "Say hello."
Output:  "Hello, and the 190/ / 190/ / 190/ / 190/ / …"

The tokenization was right. The special tokens were right. The sampler was right. The model weights were right. And yet.

The bug was that the linear-attention layers had no way to carry state across forward calls. Qwen3.5 is a hybrid — 48 layers of Mamba-style GatedDeltaNet SSM interleaved with 16 full-attention layers. Full attention has a KV cache. Linear attention didn’t. Every decode step (L=1) entered the forward pass with:

A fresh zero SSM recurrent state h ∈ ℝ^{B × H_v × D_v × D_k}
A depthwise causal Conv1d left-padded with zeros

The full-attention KV cache carried the prompt context, which was enough to keep the first few tokens coherent. After that, the linear-attention layers — running with amnesia — produced garbage that cascaded through the residual stream until greedy decode locked onto a self-reinforcing token.

The Fix

Two pieces of state get threaded through the decode loop now:

pub struct LinearAttentionState {
    /// GatedDeltaNet matrix memory: h[t] = state * g + k * ((v − state@k) * beta)
    pub ssm_state: Array,     // [B, num_value_heads, value_head_dim, key_head_dim]
    /// Previous (conv_kernel_dim - 1) raw QKV projections.
    pub conv_buffer: Array,   // [B, K-1, qkv_dim]
}

LinearAttention::forward_with_cache(&mut self, x, cache: &mut Option<LinearAttentionState>) replaces the old stateless forward():

None on entry → prefill. Conv is left-padded with zeros, SSM starts from Array::zeros. On exit the cache gets populated with the final SSM state and the last conv_kernel_dim - 1 raw QKV projections.
Some(state) on entry → decode continuation. Conv is left-padded with state.conv_buffer (the rolling tail from the previous call), SSM starts from state.ssm_state. Works for any L ≥ 1, which is to say every single-token decode step.

ModelInput gained an ssm_cache: Option<&mut Vec<Option<LinearAttentionState>>> field; Qwen3_5InnerModel::forward dispatches each layer’s cache slot to the right layer type (full-attn layers see their KV slot; linear-attn layers see their SSM slot). The old Module::forward(&Array) still exists and delegates with a None cache so one-shot prefill-only callers — benchmarks, tests, the hybrid sidecar — work unchanged.

The Correctness Test

The unit test that says “this is right” is short and brutal:

// Full prefill of L=6:
let y_full = la.forward_with_cache(&x6, &mut None)?;

// Prefill L=5, then decode L=1:
la.forward_with_cache(&x5, &mut cache)?;
let y_step = la.forward_with_cache(&x1, &mut cache)?;

// The 6th-position output must match both ways.
assert!(max_abs_diff(&y_full.last(), &y_step) < 1e-3);

If the cache correctly represents the state after L=5, then running one more step through it has to produce the same output as scanning all six tokens at once. It does.

End-to-End on the Real Model

Prompt:  "Say hello."
Before:  "Hello, and the 190/ / 190/ / 190/ / …"
After:   "Hello! How can I help you today? Whether you need help
          with a specific task, need clarification, or have a
          question, I'm here to help!"

Python mlx_lm.server on the same prompt emits EOS at “today?” — earlier than ours. The 50-ish-token drift before we diverge is a numerical-parity refinement, tracked separately. The model is now producing usable, grammatical output against real weights on a 64 GB Mac Mini with zero Python anywhere in the request path, which is the whole point.

Testing

The server ships with a comprehensive e2e test suite:

Metric	Value
Unit tests (sampling + server + lora + turboquant)	74/74
SSM cache tests (`prefill`, `decode`, `full == prefill+decode`)	3/3
E2E tests	34/34
Phases	10 (build, health, deep health, models, non-stream, stream, errors, multi-turn, content arrays, shutdown)

The test script (test_e2e_sanctum_mlx.sh) supports --skip-build for rapid iteration and covers the full OpenAI-compatible API surface including error handling and graceful SIGTERM shutdown.

Performance Baseline (M4 Pro 128GB, 2026-04-02)

Metric	v0.1.0 (Rust loop)	v0.2.0 (fused kernel)
Prefill	~5 tok/s	~5 tok/s
Decode	~9 tok/s	~13 tok/s
Model load	~3s	~3s
Binary size	~4.2 MB	~4.2 MB

The prefill speed is unchanged because it’s dominated by the initial matrix multiplications through 64 quantized layers — the SSM scan is a small fraction of that cost. Decode is where the fused kernel shines, because the recurrent scan becomes the bottleneck when you’re generating one token at a time.

Production Cutover — 2026-04-17 21:36 local

After eight months of “it works in the shadow, let’s ship it in one more sprint,” the Rust server replaced Python on the Mac Mini in a 65-second blue-green swap. Sixty-three of those seconds were Metal loading 27 billion 4-bit parameters into the GPU, which is not something one optimises. The other two were launchctl.

What changed

com.sanctum.server-mlx.plist (Python wrapper on mlx_lm.server) unloaded.
com.sanctum.mlx.plist (pure-Rust sanctum-mlx binary) loaded on the same port :1337, with KeepAlive=true, LimitLoadToSessionType=Aqua, ThrottleInterval=30.
council-guardian.sh rewritten to use launchctl kickstart -k gui/<uid>/$ACTIVE_AGENT instead of the old pkill mlx_lm.server + nohup wrapper pattern. Respects KeepAlive + ThrottleInterval by design. Rollback is flipping ACTIVE_AGENT back to com.sanctum.server-mlx.
Guardian probe latency fell from ~870 ms to ~485 ms. That’s a ping, not a generation — but it is now a faster ping.

Parity at cutover

7/10 byte-exact on the smoke battery at temperature=0. The remaining three differ only in single-word synonyms — bf16 ULP drift that accumulates after ~100 characters. Argmax is identical for the first ~40 tokens on every test prompt.
The winning fix: a C: Default trait bound on the cache generic plus (0..self.layers.len()).map(|_| Some(C::default())).collect() at the top of forward(). Without it, FullAttention saw an empty cache Vec, fell through to the no-cache branch, and silently recomputed keys/values every decode step. One three-line change, parity jumped 4/10 → 7/10, and nobody was going to catch it with a unit test.
The fix before that: renaming a field from a_log to A_log. The ModuleParameters derive macro uses stringify!(field) for checkpoint lookup, and Qwen’s safetensors key is A_log. Lowercase silently mismatched, exp(0)=1 collapsed the GatedDeltaNet decay, every decode step was corrupted. This is the software equivalent of losing a year to a missing semicolon, and it will happen again somewhere, to someone, probably to us.

The rollback plan

One line:

launchctl unload ~/Library/LaunchAgents/com.sanctum.mlx.plist && \
  launchctl load ~/Library/LaunchAgents/com.sanctum.server-mlx.plist

Both plists stay on disk. server-mlx.plist is the escape hatch, not legacy debt. We keep it until the 24-hour watch is clean and a week of guardian probes have the same 100% success rate the Python era used to have.

The macOS Application Firewall ambush

Localhost worked. The guardian worked. Every probe from the Mini itself said “yes, the Force is strong with this one.” Every probe from the MacBook Pro over Tailscale got curl: (56) Recv failure: Connection reset by peer.

The socketfilterfw ruleset had silently added the unsigned Rust binary to the BLOCK list — probably during an earlier shadow test, when the Application Firewall popup asked about incoming connections and there was no one at the Mini’s screen to click Allow. The block rule persisted. Localhost bypassed the filter, so every developer probe looked fine.

Diagnosis:

/usr/libexec/ApplicationFirewall/socketfilterfw --getappblocked \
  /Users/neo/Projects/sanctum-rs/target/release/sanctum-mlx
# → "Incoming connection to ... is blocked."

Fix (one-shot, requires sudo):

sudo /usr/libexec/ApplicationFirewall/socketfilterfw --unblockapp \
  /Users/neo/Projects/sanctum-rs/target/release/sanctum-mlx

This only bites once, per binary path. A rebuild at the same path keeps the rule. If the binary is moved, the new path starts in the default-deny state again and has to be explicitly allowed. Python mlx_lm.server never hit this because its executable was Python.app — already in the allow list for decades’ worth of reasons. Rust binaries produced by cargo build --release are ad-hoc-signed (codesign -dv shows Signature=adhoc) and enjoy no such privilege.

The guardian that ate itself

The council-guardian was rewritten in the same session to use launchctl kickstart -k instead of the old pkill mlx_lm.server + nohup pattern. It passed every test. Then we pointed the Sanctum Olympics benchmark at the Mini with 512-token prompts, and the guardian decided the server was dead.

It wasn’t. sanctum-mlx serializes Metal inference — one GPU context, one kernel graph at a time. A POST /v1/chat/completions health probe with max_tokens=2 still queues behind a max_tokens=512 request in flight. The probe timed out after 20 seconds waiting for its turn, the guardian read that as “hung”, and ran launchctl kickstart -k. Thirty seconds of Metal reload later, the next guardian probe hit the same queue again. Three restarts in eight minutes; none of them were necessary, all of them were outages.

Fix: probe GET /v1/models instead. It’s a static-ish route that returns the model list in ~25 ms regardless of inference load, because it never touches the model graph. Tightened the timeout from 20s to 10s (small enough to notice real hangs fast), loosened FAILS_BEFORE_RESTART from 2 to 3 (big enough to absorb a single network blip).

# Before
URL="http://127.0.0.1:1337/v1/chat/completions"
curl -X POST "$URL" -d '{"messages":[{"role":"user","content":"ping"}],"max_tokens":2}'
#   ^ queues behind in-flight inference under load

# After
URL="http://127.0.0.1:1337/v1/models"
curl "$URL"  # GET, 25 ms, bypasses inference queue

If the model graph itself goes sideways (HTTP alive, inference broken), this probe won’t catch it. That’s a separate monitor with a slower cadence — noted for the roadmap, not retrofitted into a 60 s loop.

Observed latency post-cutover (Mac Mini, 64 GB)

Workload	Latency	Throughput
Guardian ping (`max_tokens=2`)	~450 ms	—
Short chat (`max_tokens=8`, warm)	~520 ms	~15 tok/s
Paragraph (`max_tokens=80`)	~4.2 s	~13 tok/s

Warm-up on the first request after plist load is ~15 s (Metal context + kernel JIT). Subsequent requests are flat until the KV cache grows past the working-set window, at which point the linear-attention recurrent state keeps memory bounded instead of the usual “quadratic heartburn at the 32K mark.”

Operational hardening (P4 — 2026-04-17)

With the Rust binary on :1337 and the cutover stable, the remaining gap was operational: nobody would notice if the server started answering wrong but still 200-OK’d, nobody would notice if a hand-edited plist drifted from the repo, and nobody would catch a corrupted weights file before it served a bad token. Four additions closed those gaps in the same session.

1. Canary — a slow probe that asks a real question

council-guardian runs every 60 s and hits GET /v1/models. That proves the HTTP server is alive, nothing more. A model with bit-flipped weights would still happily list itself.

council-canary runs every 10 min and POSTs a fixed prompt:

What is 2+2? Respond with only the number.

The response has to contain 4 or the canary increments a consecutive-failures counter. Two in a row and it alerts Force Flow /notify, the same pipe signal-health uses. Cooldown is 30 min to avoid spamming the same outage. Canary does not trigger restarts — guardian owns that — it exists purely to tell a human the model has become smart enough to still respond but stupid enough to answer wrong. That’s a surprisingly common failure class.

2. Drift detection — nobody edits prod without somebody knowing

council-drift runs hourly and SHA-256-compares every deployed plist, script, and manifest against the canonical copies in services/sanctum-mlx/deploy/ in the Mini’s own sanctum-rs checkout. It also checks the Application Firewall state for the binary (because we already got bitten by that once) and asserts every expected agent is loaded in launchctl.

Any drift → structured log entry at level=error and a Force Flow alert. Running the deploy-sanctum-mlx.sh verify command on demand prints the same checks interactively.

3. Deploy script — the cutover you don’t type twice

scripts/deploy-sanctum-mlx.sh wraps the entire install/upgrade/rollback dance as a single tool with subcommands. install is idempotent — every artifact gets SHA-compared against the repo copy before being overwritten, and no-ops when they already match. upgrade rebuilds the binary on the Mini (native arm64 + Metal linkage), probes the MBP shadow as a pre-flight, unload+loads the agent (not kickstart -k — see below), runs a post-flight chat probe, and auto-rolls-back if that probe fails. rollback flips back to com.sanctum.server-mlx for the Python escape hatch.

The plist-reload gotcha earned a dedicated comment block in the source:

# `launchctl kickstart -k` restarts the process BUT reads
# ProgramArguments from the already-loaded plist, not from disk.
# When the plist content changes (e.g., adding --manifest), we must
# unload + load or the new args are silently ignored.

We learned this the direct way when the first rollout of the manifest check silently did nothing because the running plist was a stale in-memory copy.

4. Weight integrity manifest — don’t trust, verify

manifest.rs reads a coreutils-format shasum -a 256 manifest (committed at services/sanctum-mlx/deploy/qwen35-27b-4bit.manifest.sha256), parallel-hashes every listed file under --model using rayon, and refuses to bind the listener on any mismatch. Six unit tests cover the happy path, tampered file, missing file, missing manifest, bad syntax, and comment-line handling.

Perf: 10.5 s for 9 files / 16 GB on the Mini’s NVMe. Six cores saturate SHA-256 throughput; the disk is the bottleneck. Startup went from ~27 s to ~38 s — a cost we pay once per restart to know the 27 billion parameters we’re about to run haven’t silently corrupted.

The manifest itself isn’t cryptographically signed. An attacker with write access to both the model directory and the manifest can roll both together. Closing that gap means Developer-ID signing on a signed manifest-of-manifests, which is listed on the roadmap, not shipped. What this layer does defend against is the much larger and more likely threat surface: bit-rot on the SSD, corrupt downloads, accidental truncation, and the classic “I re-pulled the model and forgot to re-pull the manifest” self-inflicted wound.

Deferred (named, not done)

Item	Why not this session
Developer-ID signing + notarization	Needs a paid Apple Developer cert
Close the 3/10 bf16 ULP parity gap	Reduction-order surgery in the Metal kernel or selective fp32 promotions — not a 30-min fix
Auth on :1337 (mTLS or bearer)	Tailnet ACLs are the current defence; mTLS is a refactor
Prometheus/statsd metrics → Holocron	Requires designing the metric model and wiring the dashboard
Zero-downtime blue-green via front-proxy	Caddy in front of a :1337/:1338 pair, non-trivial plumbing
Parity smoke battery in CI	GitHub Actions runner can’t drive Metal; needs a self-hosted arm64 runner

None of these block day-to-day operation. They’re the shape of “what comes after apple-like and military-grade”: the layers that keep you boring when boring is what you want.

P5 — the “actually military-grade” pass (2026-04-17)

After P4 landed, the honest answer to “is this military-grade?” was “apple-like-adjacent, B+”. Five remaining gaps, closed in one push.

1. Bearer-token auth with a loopback bypass

auth.rs is an axum middleware that looks at the request’s connect-info peer address. If it’s 127.0.0.1 or ::1, pass through — that’s the Mini’s own guardian, canary, and sanctum-server hitting :1337 over the kernel loopback path. Anyone else must present Authorization: Bearer <token>. Constant-time compare via subtle to block timing oracles.

Token lives at /Users/neo/.sanctum/secrets/council-mlx.token, 0600, 64 bytes of secrets.token_hex(32). Plist wires --auth-token-file to that path. Missing token + non-loopback request = 401 by default — fail-closed. Operators who explicitly want the old open behaviour set SANCTUM_MLX_DISABLE_AUTH=1 and accept the blame.

10 unit tests cover: loopback IPv4/IPv6 bypass, non-loopback without token rejected, with correct token accepted, with wrong token rejected, missing header, wrong scheme (Basic instead of Bearer), wrong length, case-insensitive prefix.

2. Bind to loopback + Tailscale — never `0.0.0.0`

--host is now repeatable. Default is a single 127.0.0.1. Production plist adds 100.0.0.25 (the Mini’s Tailscale IP). LAN devices at 192.168.1.x cannot connect to :1337 directly — the socket doesn’t exist on that interface. Bridge100 (10.10.10.1) is owned by sanctum-triage, a proxy that normalizes VM traffic and forwards to loopback, so the VM path is intact.

Non-loopback bind failures are non-fatal. Tailscale can transiently disappear during boot; crashing the server would thrash launchd’s KeepAlive. Loopback bind is required — without it, the guardian has nothing to probe and we can’t even tell ourselves we’re alive.

3. ed25519-signed manifest

The SHA-256 manifest from P4 is plain text. An attacker with write access to both the model dir and the manifest can roll both together. P5 closes that: a detached ed25519 signature over the manifest bytes, verified before the hashes are even looked at.

The signing key lives on the MBP at ~/.ssh/sanctum-manifest-signing.key (mode 0600, NOT in the repo). The verifying key is 32 raw bytes at services/sanctum-mlx/deploy/manifest-pubkey.ed25519, committed. The signature is 64 raw bytes at services/sanctum-mlx/deploy/qwen35-27b-4bit.manifest.sha256.sig, also committed.

A new sign_manifest CLI wraps the workflow: keygen, sign, verify. At runtime, main calls manifest::verify_signature(manifest, sig, pubkey) before manifest::verify. A tamper test — appending one line to the manifest — reproduces the expected Verification equation was not satisfied and the process exits non-zero.

This does not stop an attacker with FS-write AND the signing key. What it does stop: anyone editing the manifest without also having the signing key. That’s a meaningful delta — the manifest now lives in a two-key trust zone where the hash list is trusted iff the signature is, and the signature requires something the repo does not contain.

4. Off-box watchers on the MBP

The on-Mini guardian / canary / drift agents watch each other across process boundaries but share a box. If the Mini is wedged, power-lost, kernel-panicked, or just switched off, every on-Mini alert channel goes silent.

Two new LaunchAgents on the MBP close this:

com.sanctum.council-canary-offbox — chat probe to the Mini via Tailscale with the bearer token, every 10 minutes.
com.sanctum.council-drift-offbox — runs deploy-sanctum-mlx.sh verify hourly from the MBP against the Mini.

The alert channel is osascript 'display notification' on the MBP’s own screen, with a “Submarine” sound. Force Flow is intentionally not used — the whole premise is the Mini might be dead. Two boxes now have to be compromised to silently lie about the council’s health, not one.

5. Hardened-runtime codesign

Vanilla cargo build produces a Mach-O with flags=0x20002(adhoc,linker-signed). That’s the weakest thing macOS accepts. scripts/codesign-sanctum-mlx.sh re-signs with --options runtime plus a minimal entitlement set — only com.apple.security.cs.allow-jit (Metal’s shader compiler needs W^X pages). After the re-sign, codesign -dv shows flags=0x10002(adhoc,runtime).

We do NOT add disable-library-validation. Library validation staying on is the whole point of hardened runtime: any unsigned dylib shoved into the process gets rejected by the loader. DYLD environment overrides are blocked, ptrace is blocked by default, JIT is only allowed because the entitlement explicitly permits it.

Without a paid Apple Developer ID the signature is still adhoc, which means we can’t notarize and distribute the binary to other users. Fine — this is an internal service, distribution is rsync. What we get with hardened runtime is the runtime-enforcement hardening, orthogonal to notarization.

Deferred (still, honestly)

After P5 + P6: the binary is hardened + auth-gated + bind-restricted + manifest-signed + doubly-watched + Developer-ID-signed + Apple-notarized. The remaining gaps from A → A+:

Gap	What it would take
Close the 3/10 bf16 ULP parity gap	fp32 accumulator in the fused Metal kernel, or reduction-order surgery
Proper mTLS on `:1337`	rustls, client cert distribution, a rotation policy
Prometheus metrics + dashboard	`metrics` crate, scrape endpoint, Holocron panel
Zero-downtime blue-green	Caddy front-proxy for `:1337/:1338` hot-swap
Parity smoke battery in CI	Self-hosted arm64 runner with Metal
True HA (multi-node council)	Active-passive or load-balanced across Mini + MBP

P6 — Developer ID signing + Apple notarization (2026-04-18)

The last B+ → A-minus step was trading the adhoc signature for Apple’s Developer ID chain and running the binary through the notary service.

Signing identity: Developer ID Application: Bertrand Nepveu (GJ994MN2YF). Both MBP and Mini produce binaries with the full three-link Apple chain — Developer ID Application → Developer ID Certification Authority → Apple Root CA — flags=0x10000(runtime) (no more adhoc), secure timestamp, allow-jit entitlement for Metal.

Notarization auth: App Store Connect API Key instead of an Apple ID + app-specific password. The .p8 was already on the Mini from the Holocron notarization work (~/.keys/holocron-notary/AuthKey_PLACEHLDR0.p8) — ASC keys are team-scoped, so the same key works for any binary under the same team. notarytool store-credentials sanctum stashes the triple (key/key-id/issuer) in macOS keychain. Future runs just use --keychain-profile sanctum, no interactive auth, no 2FA, no browser.

Submission IDs (audit trail):

MBP: 00000000-0000-0000-0000-000000000004 — Ready for distribution.
Mini: 00000000-0000-0000-0000-000000000005 — Ready for distribution.

Mini keychain ACL (the reason SSH signing used to fail): macOS private keys ship with a default access control that prompts for user approval on each use. Over SSH there’s no GUI, so codesign gets errSecInternalComponent and exits. One-time fix is security set-key-partition-list -S apple-tool:,apple:,codesign:,unsigned: -s -k <login-pw> ~/Library/Keychains/login.keychain-db. After that, as long as the login keychain is unlocked, SSH codesign works. security unlock-keychain -p <login-pw> handles the unlock — no GUI interaction required.

One-time friction points encountered during rollout:

Apple’s .p8 can only be downloaded once at creation. Miss that window and you must revoke + regenerate. We were lucky — the key from the Holocron project was still on disk at ~/.keys/holocron-notary/.
An expired Developer Program agreement blocked the ASC key with HTTP 403 — “required agreement is missing or has expired”. The account holder clicks through at appstoreconnect.apple.com/agreements, and the key immediately works. Nothing about the key itself was wrong.
CLI Mach-O binaries cannot be stapled (stapling attaches a ticket to a .app/.pkg/.dmg wrapper; a bare binary has nowhere to put it). Gatekeeper fetches the ticket online on first run from any Mac. The authoritative “it’s notarized” check is the log line "statusSummary": "Ready for distribution" + "issues": null, which is recorded at ~/.openclaw/logs/notarization-<id>.log.

Grade post-P6: A-minus. The “military-grade” label is unambiguously earned. The A+ layer — Prometheus, mTLS, HA, CI parity, kernel-level bit-exact parity — is a horizon-item, not a next-sprint item.

P7 — observability, failover, and parity CI (2026-04-18)

Four of the six A+ items landed in one push. mTLS deferred — Tailscale already encrypts wire transport and bearer auth handles identity, so the remaining mTLS value is cryptographic per-client identity, worth doing but worth doing carefully.

1. Prometheus `/metrics` endpoint

src/metrics.rs installs a Prometheus recorder at startup and middleware that times every HTTP request. Series:

sanctum_mlx_http_requests_total{route,method,status} — counter
sanctum_mlx_http_duration_seconds{route,method,status} — histogram, 5 ms – 180 s buckets
sanctum_mlx_http_inflight_requests — gauge
sanctum_mlx_inference_requests_total{stop_reason} — counter
sanctum_mlx_inference_{completion,prompt}_tokens_total — counters
sanctum_mlx_inference_duration_seconds — histogram
sanctum_mlx_inference_tokens_per_sec — histogram (0.5 – 96 tok/s buckets)
sanctum_mlx_inference_completion_tokens — histogram (1 – 4096 tokens)
sanctum_mlx_startup_seconds{phase="manifest_verify"|"model_load"} — histogram
sanctum_mlx_model_loaded — gauge (0/1)

GET /metrics renders Prometheus text format and goes through the same auth middleware as the rest — loopback bypasses, a cross-machine Prometheus scraper presents the bearer token. Operational detail doesn’t leak onto the open Tailnet.

2. HA failover in the sanctum-server router

HttpProxyBackend now holds a Vec<String> of URLs instead of one. generate tries primary, then each fallback on connect failure or 5xx. 4xx responses bail immediately — a fallback won’t fix a malformed request.

Once a stream starts, we stay on that URL. Mid-stream checkpointing across endpoints would require KV-cache / SSM-state migration the engines don’t expose, and “start over on the fallback” is an honest option for stateless endpoints, not streaming inference.

BackendDef in instance.yaml gains an optional fallback_urls array:

council-secure:
  url: http://127.0.0.1:1337/v1
  fallback_urls:
    - http://100.0.0.55:8902/v1  # MBP shadow via Tailscale
  api_key_env: COUNCIL_API_KEY

Deployment is config-only — sanctum-server picks up the fallbacks at next load. health_check is now “healthy if ANY url responds,” short-circuiting on primary success.

3. Nightly parity smoke battery

com.sanctum.council-parity-smoke runs at 03:00 every night. Executes 10 fixed prompts at temperature=0 and checks each response against a known-good substring spec (parity-smoke.json). Threshold: more than 1 failure out of 10 fires a Force Flow alert.

This catches what the manifest + signature layer can’t: weights that hash fine but answer wrong. A silent mlx-rs rebase that shifts the fused kernel one ULP, an adapter swap, a quiet kernel cache corruption — any of these would show up in the morning’s smoke log as a regression.

Initial run in prod: 10/10 pass, ~60 s wall time. The spec file is editable without recompiling — when an intentional kernel change shifts answers, update the expected strings and commit.

4. mutual TLS with per-client certs

src/mtls.rs + scripts/mtls-gen-certs.sh. sanctum-mlx optionally binds a second listener with rustls that only accepts clients presenting a cert signed by our own CA. The whole PKI lives at ~/.sanctum/certs/ on both machines:

certs/
├── ca.crt, ca.key              self-signed EC P-256 CA (5-year validity)
├── server.crt, server.key       server cert, SAN'd for every address
│                                sanctum-mlx might bind to (loopback,
│                                Tailscale IPs, bridge100, hostnames)
└── clients/
    ├── guardian.{crt,key}       per-client cert, CN=<client-name>
    ├── canary.{crt,key}         audit logs can pin every action to
    ├── drift.{crt,key}          a specific identity
    ├── parity-smoke.{crt,key}
    ├── sanctum-server.{crt,key}
    └── council-offbox.{crt,key}

The binary owns two routers now:

Plain HTTP (:1337) — bearer-auth enforced with loopback bypass. Everything we shipped in P5 still works; guardian/canary/drift/sanctum-server/VM-through-triage on loopback keep functioning without config changes.
mTLS (:1338) — no bearer check. The TLS handshake already verified the client’s certificate chain against our CA, and doubling up the gate with an HTTP header would just be more surface area to break.

Cross-machine end-to-end verified: curl --cacert ca.crt --cert clients/sanctum-server.crt --key clients/sanctum-server.key https://100.0.0.25:1338/v1/chat/completions from MBP returned "2 + 2 equals" in 1.84 s. Without a client cert the TLS handshake is rejected at layer-4 with rustls reason 1116.

Rollout is gradual, no flag day. Plain listeners stay up alongside TLS; migrate clients one at a time. Once every client has moved to mTLS, drop the plain --host lines from the plist and keep only --tls-host. The bearer token stays around as a belt-and-suspenders fallback during the transition and can be retired the day the bearer-auth log hits zero non-loopback hits for a week.

What mTLS buys you that bearer didn’t. Bearer is a single shared secret — leak the token and every bearer-holder can call the council. mTLS gives per-client cryptographic identity: each client has its own cert, can be revoked individually without re-issuing everyone’s credential, and the CN appears in audit logs so “guardian made this call” vs “sanctum-server made this call” is legible without additional middleware. Tailscale already encrypts wire transport, so the confidentiality gain is minor — the identity gain is the point.

One code-path gotcha fixed along the way. The P5 guard that refuses to start without a loopback listener was checking ip().is_loopback(), which is false for 0.0.0.0. That crashed the MBP shadow (which binds 0.0.0.0:8902) in a restart loop. Now is_unspecified() also counts — 0.0.0.0 implicitly accepts loopback connections.

Grade post-P7: A+

All four a-minus → A+ items landed. Observable (Prometheus), survivable (HA), verifiable (parity cron), and identity-gated (mTLS). The remaining work — closing the 3/10 bf16 ULP parity gap, migrating every client onto mTLS, retiring the bearer token — is tuning, not hardening.

File Map

File	Description
`sanctum-rs/services/sanctum-mlx/src/main.rs`	CLI (clap), axum routes, server setup, manifest gate before Metal load
`sanctum-rs/services/sanctum-mlx/src/server.rs`	Inference pipeline, SSE streaming, ChatML prompt formatting, `StopSeqBuffer`, `sampling_params_from_request`
`sanctum-rs/services/sanctum-mlx/src/sampling.rs`	Full sampling pipeline: `SamplingParams`, `RecentTokens` ring buffer, `apply_repetition_penalty`, `apply_top_p`, `sample`, and the callback-based `decode` loop that replaces `qwen3_5::Generate`
`sanctum-rs/services/sanctum-mlx/src/lora.rs`	LoRA adapter load + dequantize-merge-requantize pipeline
`sanctum-rs/services/sanctum-mlx/src/manifest.rs`	Coreutils-format SHA-256 weight manifest verify, rayon-parallel; refuses to serve on mismatch
`sanctum-rs/services/sanctum-mlx/deploy/`	Versioned plist + guardian + canary + drift + weight manifest; canonical copies for drift check
`sanctum-rs/scripts/deploy-sanctum-mlx.sh`	install / upgrade / rollback / status / verify wrapper
`sanctum-rs/vendor/mlx-rs/mlx-lm/src/models/qwen3_5.rs`	Full Qwen3.5 model: `LinearAttention` (GatedDeltaNet) with `forward_with_cache`, `LinearAttentionState`, VL config flattening, `FullAttention` (gated Q), quantized loading
`sanctum-rs/vendor/mlx-rs/mlx-lm/src/metal_kernels.rs`	Fused Metal GPU kernel + safe FFI wrappers
`sanctum-rs/services/sanctum-mlx/test_e2e_sanctum_mlx.sh`	34-test e2e suite, 10 phases