Skip to content

The Smart Router Cathedral

Smart Router Cathedral — a holographic five-tier Jedi cathedral rising from a dark command floor, each tier a glowing ring of circuitry labeled by function, a central pillar of teal light connecting them.

The Smart Router started as a glorified match statement — model field in, backend out. It ended as a cathedral. Five tiers of defense between a client request and a model response, each tier addressing a specific failure mode the system used to quietly absorb.

This page is the canonical reference. If you want a five-minute tour instead of a first-principles read, skip to the matrix at the end.

The model field the caller sends is a nickname. The upstream needs an ID. LM Studio doesn’t accept coder; Google AI Studio doesn’t accept spatial; Anthropic doesn’t accept cloud. Tier 0 added default_model per backend in instance.yaml so every outbound request carries the canonical upstream ID:

coder:
url: http://127.0.0.1:1234/v1
default_model: qwen2.5-coder-14b-instruct
cloud:
url: http://100.0.0.55:3456/v1
default_model: claude-opus-4
spatial:
url: https://generativelanguage.googleapis.com/v1beta/openai/v1
default_model: gemini-3.1-pro-preview

Without this, every non-council backend 404’d. With it, routing is actually correct.

Four production backends wired through sanctum-server on 127.0.0.1:8900:

BackendUpstreamRoute patternsAuth
council-secureQwen3.6-35B-A3B on :1337 mTLSyoda, mothma, windu, cilghal, mundi, jocasta, quigonper-client cert
coderQwen2.5-Coder-14B via LM Studio :1234coder, code-*, *coder*, ahsokanone (loopback)
cloudClaude Opus 4.7 via claude-max-api-proxy on 100.0.0.55:3456 (Mobile primary) with 100.0.0.20:3456 (Hub secondary, since 2026-04-27)opus, claude-*, escalationMax OAuth (proxy-mediated, separate session per host)
spatialGemini 3.1 Pro via Google AI Studiogemini, windu-spatial, spatialAPI key in plist env

Each backend declares HA fallback_urls. cloud falls over from the Max proxy to OpenRouter; council-secure falls over from Mini mTLS to MBP shadow.

/metrics serves Prometheus text at :8900/metrics. Every request, every decision, every backend call is a counter or histogram:

  • sanctum_server_http_requests_total{route,method,status} — RED at the edge
  • sanctum_server_backend_requests_total{backend,outcome} — RED per backend, outcome in ok / client_error / server_error / connect_error / parse_error / breaker_open / budget_exhausted / quality_issue
  • sanctum_server_routing_decisions_total{from_model,to_backend,rule} — which Jedi went where, via which tier
  • sanctum_server_backend_duration_seconds — latency histogram for p50/p95/p99 PromQL
  • sanctum_server_backend_tokens_in_total / _out_total — what flowed
  • sanctum_server_backend_fallback_total — when HA kicked in

Structured JSON logs (SANCTUM_LOG_FORMAT=json) carry a per-request request_id via axum middleware. Inbound X-Request-ID is honoured; otherwise a UUIDv4 is minted and echoed on the response. jq '.span.request_id=="…"' pulls a whole request’s trace.

The Holocron RouterPanel scrapes /api/router/status every 10 s and renders the backend and decision tables live.

Three mechanisms that watch for known-broken states and route around them:

Circuit breaker. Per backend. Closed → Open after 5 consecutive failures → HalfOpen after 30 s → Closed on canary success or Open again on canary failure. Lock-free on the hot path (atomic u32 failure counter); mutex only on state transitions. Complementary to HA fallback: HA handles per-URL failures within a request, the breaker handles sustained per-backend outages across requests.

Breaker-aware routing. Router::select skips backends whose breaker is Open at every tier (direct, pattern, intent). Default backend is always returned as a last resort so callers get a deterministic error instead of hanging.

Token budget. Per backend, per day. daily_cap_completion_tokens in instance.yaml caps billable output; when exhausted, budget_ok() returns false and the router skips the backend exactly the way it skips Open breakers. Resets at UTC midnight; not persisted across process restarts.

Backends that answer every request correctly but in 45 s instead of 3 s are degraded, not broken. Tier 4 detects and trips on them.

Latency EMA. Exponential moving average (α = 0.2, ~5-sample half-life) of completed-request seconds per backend. AtomicU64-packed f64 bits for lock-free reads. Exposed as sanctum_server_backend_latency_ema_seconds{backend} gauge and rendered in the Router panel’s EMA column.

Slow-backend trip. When the EMA crosses a configurable slow_threshold_s for slow_trip_count consecutive successful requests (default 3), the backend’s breaker gets a synthetic failure. The standard Open → HalfOpen → Closed cycle handles the rest.

cloud:
url: http://100.0.0.55:3456/v1
slow_threshold_s: 15.0 # Opus 4.7 p95 normal ~3s; 15s = broken
slow_trip_count: 3

The final tier handles a failure mode that HTTP can’t see: the backend returned 200 OK with a response that’s garbage. Three detectors, one each for pathologies we actually saw in production:

DetectorVerdictTrips breakerExample
empty_contentBrokenyesGoogle AI Studio at low max_tokens — “done” with zero tokens emitted
pure_repetitionBrokenyesQwen3.5 conv_state bug — “190/ 190/ 190/” for 80 tokens
think_tag_leakSuspiciousnoReasoning model leaked <think> past the stream filter
truncated_tinySuspiciousnofinish_reason=length and content < 4 bytes

Every successful response accumulates its content across the stream and runs the detectors at completion. Broken verdicts call breaker.record_failure() and flip the metric outcome to quality_issue. The current caller still sees whatever was streamed; the next two after them get routed away. The second user is the guarantee; the first user is the canary.

TierWatches forSignalAction
0wrong model IDconfigpin default_model
1backend deadHA fallback_urlstry next URL
2/metrics + JSON logsobserve
3sustained backend failureerror rate / budget captrip breaker → route around
4degraded backend (slow)latency EMA > thresholdtrip breaker → route around
5garbage outputsanity verdict = Brokentrip breaker → route around

Thirteen stones across five tiers. 48 tests. The cathedral routes, observes, recovers, and reroutes on its own. The caller just sends {"model":"yoda"} and gets an answer — from wherever the answer is currently cheapest, fastest, and least likely to be nonsense.