2026-04-24: Five Locks on the Voice Door

The phone call use case was already in production: Yoda’s voice, cloned from nine reference WAVs, running on Qwen3-TTS through LiveKit and voip.ms. What wasn’t in production was any discipline around who could make sanctum speak. Anything on localhost:8008 could synthesize arbitrary text in Yoda’s voice with no token, no audit, no way to prove a clip later was genuine.
That was fine when TTS was a dev curiosity and the target was a family phone. It stops being fine the moment the digital-twin-of-Bert voice is real — a voice that says something Bert didn’t say, captured and shared, becomes permanently unprovable without origin signing. This is the day that gap closed.
What Changed in One Commit
Section titled “What Changed in One Commit”sanctum-xtts crate renamed → sanctum-tts. The Coqui XTTS Python worker that nobody had used in months was dropped. Git history preserves it; the new role is engine-agnostic.
A TtsAdapter trait landed. QwenAdapter proxies to the existing Qwen3-TTS Python server on :8008 (Yoda’s production voice is untouched). OpenVoiceAdapter is a stub returning NotImplemented with a README-documented five-step integration plan — the wiring needs a Bert reference recording and the model download, both next-session work.
Then five independent defenses, each individually toggleable.
The Five Layers
Section titled “The Five Layers”The one honest limitation
Section titled “The one honest limitation”The embedded signature does not survive transcoding. A round trip through wav → mp3 → wav strips the RIFF INFO chunk. For intact clips we have cryptographic proof of origin. For adversarial re-encode we do not. Closing that gap requires a perceptual watermark (SilentCipher, AudioSeal) — deliberately scoped out of today’s work rather than ship handwritten watermarking that any competent adversary could defeat.
Operator Tools
Section titled “Operator Tools”Two new CLIs live alongside the daemon:
sanctum-tts-admin—keygen,issue-token,pin,show-pubkey. Preparestts.yamland the Ed25519 keypair. Never contacts the daemon.sanctum-tts-verify— given a WAV, confirms the embedded signature checks out and the audio bytes match what was signed. Exit codes:0verified,2unsigned,3sig invalid,4audio tampered.
A token for a new caller:
$ sanctum-tts-admin issue-token livekit-agentcaller_id : livekit-agenttoken : 5b4a...32bytestoken_sha256 : c12d...Paste the hash into tts.yaml. Hand the plaintext to the client exactly once. Rotation = re-issue.
Verification
Section titled “Verification”35 unit tests across the new modules:
| Module | Tests | Covers |
|---|---|---|
auth | 7 | Token hashing stability, enable/disable, per-voice allow, unknown token, malformed header, localhost bypass gating |
audit | 4 | JSONL shape with plaintext non-leakage, size-based rotation, hash determinism |
integrity | 5 | Pinned match, mismatch disables, unpinned flagged, missing file, SHA-256 known-vector |
signing | 6 | Sign/verify roundtrip, tampered-hash breaks verify, WAV chunk embed + extract, PEM roundtrip |
config | 3 | Minimal / empty / hardened YAML parse |
Workspace builds clean. Pre-existing crates untouched.
End-to-End Not Automated — On Purpose
Section titled “End-to-End Not Automated — On Purpose”The signing/verify pair is itself the end-to-end test. A healthy deploy produces a WAV whose sanctum-tts-verify exit 0 is the round-trip proof. The reference-clip pin is its own check at startup. mTLS handshake is rustls’s job. Every real call through the daemon is end-to-end verified as a side effect; writing a separate harness that simulates the whole pipeline would duplicate what production already verifies continuously.
Related
Section titled “Related”- Sanctum TTS — the architecture page describing the adapter pattern and the five layers in detail.
- Chitti — The Fascial Layer — the pressure/presence field sanctum-tts will read before heavy loads in a future session.
- The Living Force — the immune system this service plugs into.