Skip to content

Sanctum TTS

A single lit door in a stone wall with five distinct latches — cert, key, pin, seal, ledger — pencil-sketched against a dark void, an amber line glowing from under the door

Yoda’s voice answers the phone. Mothma narrates the morning briefing. Someday soon Neo’s own digital twin does both, in English and in Quebec French, depending on who dialed. For any of that to be safe to put on a phone line that connects to the outside world, the TTS layer has to be two things at once: clean enough that adding a new voice is a YAML edit, and boundary-enforced enough that an adversary who reaches the port can’t impersonate someone. sanctum-tts is that layer.

Every TTS engine brings its own Python environment, model format, and cloning quirks. Qwen3-TTS lives inside mlx-audio’s venv. OpenVoice v2 clashes with it on torch/transformers pins. Coqui XTTS had its own world before we retired it. Keeping them in separate processes, each behind its own LaunchAgent, is the only sane way to add new engines without deps wars.

sanctum-tts is the Rust dispatcher on :8007. Callers address voices by name; the daemon routes to whichever adapter the voice registry declares. The dispatcher never imports Python. Adapters are thin HTTP proxies.

┌──────────────┐ ┌──────────────┐
clients ──▶│ sanctum-tts │──────▶│ qwen3-tts │
(mTLS) │ :8007 │ │ :8008 (py) │
Bearer ──▶│ dispatches │──────▶│ openvoice │
│ by voice │ │ :8009 (py) │ (planned)
│ signs WAV │ └──────────────┘
│ audits │
└──────────────┘

Every POST /speak passes through five independent defenses. Each is individually toggleable, so a laptop dev run can disable all of them while a phone-facing production deploy can demand every one.

#LayerPurposeStored in
1mTLSCert-authenticated wire; same CA as council-mlx.~/.sanctum/certs/
2Bearer-token ACLPer-voice authorization. Hashes only in config; plaintext lives with the client.tts.yaml (SHA-256 only)
3Reference-clip pinSHA-256 of every voice’s reference WAV. Mismatch at startup disables the voice.tts.yaml per voice
4Ed25519 signingEvery synthesized WAV carries a signed manifest (caller, voice, timestamp, text hash, audio hash).WAV LIST/INFO chunk + HTTP header
5Hashed audit logAppend-only JSONL; text is SHA-256’d before write so the log never becomes a surveillance vector.~/.openclaw/logs/tts-audit.jsonl

Signing — What It Proves, What It Doesn’t

Section titled “Signing — What It Proves, What It Doesn’t”

The signed manifest is canonical JSON with a stable field order:

{"v":1,"ts":"...","signer_id":"...","caller_id":"...","voice":"...","text_sha256":"...","audio_sha256":"..."}

Ed25519 over the UTF-8 bytes of that JSON. The signature + JSON both embed into the WAV as a RIFF LIST/INFO chunk (ICMT = manifest, IART = base64 signature). Clip travels, signature travels with it. Standard WAV readers ignore the chunk and play the audio; sanctum-tts-verify reads it and answers.

What this proves: any intact WAV came from a specific sanctum-tts instance (identified by signer_id = first 8 hex of SHA-256(pubkey)), with known caller/voice/text/timestamp.

What it does not prove: robustness to re-encoding. wav → mp3 → wav drops the RIFF INFO chunk; signature gone. The clip still sounds the same; we just lose the origin proof. Closing that gap needs a perceptual watermark layer (SilentCipher, AudioSeal) — deliberately scoped out of the MVP rather than shipped as handwritten theater.

Terminal window
# One-time: create the Ed25519 keypair at ~/.sanctum/keys/
sanctum-tts-admin keygen
# Mint a bearer token for a caller. Plaintext shown ONCE.
sanctum-tts-admin issue-token livekit-agent
# Compute SHA-256 for a reference clip to pin in tts.yaml
sanctum-tts-admin pin ~/.openclaw/tts-voices/yoda-do-or-do-not.wav
# Inspect the current signer
sanctum-tts-admin show-pubkey
# Verify a clip's signature + integrity
sanctum-tts-verify clip.wav
sanctum-tts-verify --json clip.wav

Exit codes of sanctum-tts-verify: 0 verified, 1 I/O, 2 unsigned, 3 signature invalid, 4 audio tampered after signing.

Add a voice with three YAML lines plus a pin:

voices:
yoda:
adapter: qwen3
reference: yoda-do-or-do-not
language: en
reference_sha256: 8f2a... # sanctum-tts-admin pin

Adding Neo as a digital twin once OpenVoice v2 lands becomes two entries — neo (English) and neo_fr (Quebec French) — pointing at the same 2-3 min reference recording.

  • sanctum-xtts renamed → sanctum-tts (Coqui worker retired).
  • TtsAdapter trait + typed TtsError.
  • Dispatcher with POST /speak, GET /voices, GET /voices/{id}, GET /health.
  • QwenAdapter proxies to the existing Qwen3-TTS server on :8008.
  • OpenVoiceAdapter stub + 5-step wiring checklist in README.
  • All five defense layers, toggle-by-toggle.
  • sanctum-tts-admin + sanctum-tts-verify CLIs.
  • 35 unit tests.
  • Record Neo’s reference.
  • Stand up com.sanctum.openvoice-tts on :8009.
  • Flip OpenVoiceAdapter from NotImplemented to real proxy.
  • A/B blind-test Neo-English through Qwen3 vs OpenVoice.
  • Neo-in-French becomes a YAML line.
  • Later: perceptual watermarking layer for re-encode robustness.