TurboQuant KV Compression

The KV cache is where long conversations go to eat memory. Every token a language model generates leaves behind a pair of tensors — the keys and values from every attention head in every layer. For a 27B model at 256-dim heads, that’s hundreds of megabytes per thousand tokens. On a Mac Mini that also runs LM Studio and a dozen other services, the KV cache is the thing that decides whether you get to have a conversation at all.
TurboQuant is how we make it smaller. The idea is not ours — it comes from Google’s ICLR 2026 paper (arXiv 2504.19874), layered on top of KVQuant and QJL. Our job is to implement it cleanly inside sanctum-mlx, wire it into Apple’s MLX runtime, and — most importantly — prove the quality doesn’t collapse when we turn the bits down.
This page is both an architecture reference and a scientific protocol for tuning it.
What ships today (Slice 4a-final)
Section titled “What ships today (Slice 4a-final)”incoming K,V tensor [1, H_kv, L, D] ↓ ┌──── keys ────┐ ↓ ▼ ▼ concat-grow plain on-device Array (bf16, bit-exact, no compression) ↓ │ ↓ ┌── values ─────┘ ↓ ▼ group-affine quantize on device → store (indices u8, scale f16, zero f16) ↓ attention: sdpa_dequant_v fused Metal kernel reads compressed V state and dequantizes it inline in registers, no full V tensor ever materializesBoth halves live in services/sanctum-mlx/src/turboquant/ and hand off to a CompressedKVCache that implements mlx_lm::cache::KeyValueCache — a drop-in for stock ConcatKeyValueCache. Routed via the new KeyValueCache::fused_attention trait method.
The paper’s recipe — historical context
Section titled “The paper’s recipe — historical context”The original Slice 1 implementation followed the TurboQuant paper literally:
keys: normalize → Hadamard rotate → Lloyd-Max quantize → QJL 1-bit sign correction → packvalues: group-affine quantize (KVQuant style) → store scale + zero per groupon every read: dequant all stored tokens → Array → attentionThis is the algorithm the protocol below validated, and the bit-budget math behind the “~3.5 bits/channel quality-neutral” claim is real. We just measured at our specific workload that the CPU round-trip cost more than the algorithm’s memory savings earned us, and pivoted accordingly. The receipts (Slice 1 PPL sweep, Slice 1.5 negative result, Slice 4a kernel correctness, Slice 4a-final pivot) are preserved in agent memory.
Why a tuning protocol
Section titled “Why a tuning protocol”The first end-to-end run on a Qwen3.5-27B-4bit model with default settings (3-bit keys + QJL, 4-bit values, group_size=128) delivered this:
ppl (fp16) = 9.3055ppl (turboquant) = 28.6451Δppl = +19.34 absolute, +207.83% relativeCatastrophic. The plumbing works end-to-end — no crashes, no SIGKILL, the bit stream roundtrips — but the output is a model that has forgotten what it was looking at. This is exactly the situation where “tune a few knobs and hope” fails. We need a protocol.
The protocol
Section titled “The protocol”The protocol has five principles, in priority order.
1. Ablate before you sweep
Section titled “1. Ablate before you sweep”Do not touch bit widths until each stage of the pipeline has been proven correct in isolation. Loss could come from the Array↔f32 bridge, from Hadamard rotation, from Lloyd-Max quantization, from QJL, or from the value group-affine path — and you can’t distinguish them by watching perplexity collapse.
The ablation sequence:
| Run | Config | What it isolates | Expected |
|---|---|---|---|
| A1 | 8-bit keys + 8-bit values, QJL off | Bridge + minimal-loss quant | ≈ fp16 |
| A4 | 8-bit keys + 8-bit values, QJL on | QJL math correctness at high precision | ≈ fp16 |
| A5 | 8-bit keys, 4-bit values, QJL off | Value stage only | measures value loss |
| A6 | 3-bit keys + QJL, 8-bit values | Key stage only | measures key loss |
If A1 fails, you have a bug in the bridge or the rotation — stop sweeping and bisect. If A5 is clean but A6 is broken, the bug is in the key pipeline (Lloyd-Max, QJL, or rotation). And so on.
2. Sweep only after ablations pass
Section titled “2. Sweep only after ablations pass”Once each stage is proven, search the Pareto frontier. Don’t grid-search 120 cells — use a staircase:
- Fix
value_bits=8, sweepkey_bits ∈ {3, 4, 5, 6, 8}→ find minimum that meets budget - Fix
key_bitsat step-1 winner, sweepvalue_bits ∈ {2, 4, 6, 8} - Fix both, check
group_size ∈ {32, 64, 128}sensitivity - At the winning
(key_bits, value_bits), compareuse_qjl: onvsoff— does QJL actually help?
About 20 runs total. Each run takes ~10 seconds on an M4 Max once the model is cached.
3. Reproducibility
Section titled “3. Reproducibility”Every run appends one line to services/sanctum-mlx/bench/turboquant_sweep.jsonl:
{ "ts": "2026-04-18T11:40:00Z", "host": "mbp", "config": {"key_bits": 4, "value_bits": 6, "group_size": 128, "use_qjl": false, "head_dim": 256}, "eval": {"text_sha": "3f4a…", "n_tokens": 54}, "ppl_fp16": 9.3055, "ppl_tq": 10.12, "abs_delta": 0.8145, "rel_delta": 0.0876, "runtime_sec": {"fp16": 4.2, "tq": 6.8}, "verdict": "warn"}The text SHA is the comparability key — different text means different rows, not different configurations. The verdict classifier is:
| Verdict | Budget |
|---|---|
pass | abs_delta ≤ 0.3 AND rel_delta ≤ 5% |
warn | abs_delta ≤ 1.0 |
fail | beyond warn |
invalid | non-finite — means a bug |
4. One knob at a time
Section titled “4. One knob at a time”Never change two variables between runs. The harness supports this through environment variables — the test binary reads SANCTUM_TQ_KEY_BITS, SANCTUM_TQ_VALUE_BITS, SANCTUM_TQ_GROUP_SIZE, SANCTUM_TQ_USE_QJL, SANCTUM_TQ_HEAD_DIM and installs them into a OnceLock before the model is loaded. No recompile between runs.
The shell driver bench/tune.sh orchestrates the staircase. Individual stages (ablation, key, value, group, qjl) can be run in isolation.
5. Early stop
Section titled “5. Early stop”Don’t over-measure. The protocol has two exit conditions:
- Confirmed bug (any ablation fails): stop, fix, restart. Sweeping over a bug produces garbage data.
- Confirmed winner (a config meets both budgets): stop. The knee is located. Tighten later if needed.
Running the sweep
Section titled “Running the sweep”cd ~/Projects/sanctum-rsSANCTUM_MLX_TEST_MODEL=/path/to/Qwen3.5-27B-4bit \SANCTUM_TQ_KEY_BITS=4 \SANCTUM_TQ_VALUE_BITS=6 \SANCTUM_TQ_USE_QJL=false \SANCTUM_TQ_JSONL=services/sanctum-mlx/bench/turboquant_sweep.jsonl \SANCTUM_TQ_SOFT=1 \ cargo test -p sanctum-mlx --test turboquant_ppl --release \ -- --ignored --nocaptureSANCTUM_TQ_SOFT=1 disables the assertion failure on regression — during sweeps, bad numbers are data, not errors. Omit it when you’re gating.
cd ~/Projects/sanctum-rsSANCTUM_MLX_TEST_MODEL=/path/to/Qwen3.5-27B-4bit \ ./services/sanctum-mlx/bench/tune.sh allRuns: 4 ablations, 5 key-bit cells, 4 value-bit cells, 3 group-size cells, 2 QJL cells = 18 runs. Writes all rows to bench/turboquant_sweep.jsonl. Takes about 3 minutes on an M4 Max once the model is cached.
For focused stages: ./tune.sh ablation or ./tune.sh key etc.
# Which configs met the pass budget?jq -c 'select(.verdict == "pass")' services/sanctum-mlx/bench/turboquant_sweep.jsonl
# Pareto front: lowest abs_delta at each bit budgetjq -c '[.config.key_bits, .config.value_bits, .abs_delta] | @tsv' \ services/sanctum-mlx/bench/turboquant_sweep.jsonl | sortFor each stage boundary, record the decision in bench/ANALYSIS.md with the run rows that justified it. Future-you will thank present-you.
Knob reference
Section titled “Knob reference”| Knob | Env | Range | Meaning |
|---|---|---|---|
key_bits | SANCTUM_TQ_KEY_BITS | 3–8 | Codebook precision per key dim. Rotation runs regardless. |
value_bits | SANCTUM_TQ_VALUE_BITS | 2–8 | Affine quant precision per value dim, per group. |
value_group_size | SANCTUM_TQ_GROUP_SIZE | 32, 64, 128 | Granularity of per-group scale/zero for values. |
use_qjl | SANCTUM_TQ_USE_QJL | true / false | Whether to add the QJL 1-bit sign correction after Lloyd-Max. Only meaningful at key_bits ≤ 4. |
head_dim | SANCTUM_TQ_HEAD_DIM | model-specific | Must match the model. Qwen3.5 = 256, Qwen2.5 = 128. |
Metal memory ceilings (orthogonal to TurboQuant, but relevant to the why):
| Flag | Env | Effect |
|---|---|---|
--metal-cache-limit-mb | SANCTUM_MLX_CACHE_LIMIT_MB | Cap MLX’s reusable buffer cache. Stops sanctum-mlx from fighting LM Studio for the Metal heap. |
--metal-memory-limit-mb | SANCTUM_MLX_MEMORY_LIMIT_MB | Hard cap total MLX device memory. |
--metal-wired-limit-mb | SANCTUM_MLX_WIRED_LIMIT_MB | Cap non-swappable memory — keep hot tensors resident without starving the rest of the system. |
What “pass” means — and doesn’t
Section titled “What “pass” means — and doesn’t”The default gate is abs_delta ≤ 0.3 AND rel_delta ≤ 5% on a short eval. That’s permissive compared to the TurboQuant paper’s reported < 0.1 Δppl, and intentionally so — our implementation omits some of the paper’s refinements (learned rotations, full TurboQuant value pipeline) in favor of shipping Slice 1.
A config that passes here is good enough to ship under --turboquant behind a flag. It is not yet good enough to be the default. Default-flip requires:
- Clean ablation pass across all stages
- Wider eval (wikitext-2 chunk, ≥1024 tokens)
- Sample-quality check: generate 200 tokens continuation from the same prefix with both caches, compare
- No regression across multiple prompt families (code, prose, reasoning)
Until then, turboquant is opt-in. Operators who want the memory savings know they’re in an experiment.
References
Section titled “References”- TurboQuant — arXiv 2504.19874 · Google · ICLR 2026
- KVQuant — arXiv 2401.18079 · per-channel K, per-token V quantization
- QJL — arXiv 2310.19536 · AAAI 2025 · 1-bit error-corrected JL projections
- PolarQuant — arXiv 2502.02617 · rotation-based precursor
- DuoAttention — arXiv 2410.10819 · retrieval vs streaming heads — next slice
- EAGLE-3 — arXiv 2503.01840 · on-device speculative decoding — later slice
First real-world run — the protocol paid for itself on day one
Section titled “First real-world run — the protocol paid for itself on day one”The first end-to-end run on Qwen3.5-27B-4bit with scaffold defaults (3-bit keys + QJL, 4-bit values) produced:
ppl (fp16) = 9.3055ppl (turboquant) = 28.6451 Δppl = +19.34 (+208%)Catastrophic. The instinct would have been to turn knobs — “try 4-bit, try 5-bit, maybe disable QJL.” The protocol said no: ablate first.
The ablation revealed that all four bit configurations gave the same +20 PPL collapse — 3-bit+QJL and 8-bit/8-bit with QJL off both +20. That’s physically impossible if the quant math were the issue. The bug had to be upstream.
We added BypassMode::{IdentityPassthrough, BridgeOnly} — short-circuits that skip the quantization pipeline. Result:
| Run | Mode | Δppl |
|---|---|---|
| A0 | IdentityPassthrough (raw concat) | 0.0000 PASS |
| A0.5 | BridgeOnly (f32 round-trip, no quant) | +20.62 FAIL |
Bisected in two runs. The f32 Array↔CPU bridge was the culprit, not quantization. Root cause: MLX Arrays from transpose_axes([0, 2, 1, 3]) are strided views, and as_slice::<T>() reads the raw buffer without respecting logical layout. Without forcing materialization (via flatten()), a logical [B, H, L, D] view was yielding pre-transpose [B, L, H, D] bytes. Heads and tokens were silently permuted before any quantization ran.
Fix: three lines in turboquant/cache.rs — flatten before as_slice.
Post-fix sweep — the actual Pareto frontier
Section titled “Post-fix sweep — the actual Pareto frontier”Full staircase on the same 27B model, 54-token eval, protocol compliant (one knob at a time, pre-fix rows archived as invalid):
Ablations — all stages clean
Section titled “Ablations — all stages clean”| Run | Config | Δppl |
|---|---|---|
| A1 | 8-bit / 8-bit, no QJL | 0.023 |
| A4 | 8-bit / 8-bit, QJL on | 0.014 |
| A5 | 8-bit keys / 4-bit values | 0.024 |
| A6 | 3-bit + QJL / 8-bit values | 0.137 |
Key-bit sweep (values = 8, no QJL)
Section titled “Key-bit sweep (values = 8, no QJL)”3-bit Δ=0.254 4-bit Δ=0.069 5-bit Δ=0.0906-bit Δ=0.031 8-bit Δ=0.023Keys quantize gracefully. All pass.
Value-bit sweep (keys = 4, no QJL)
Section titled “Value-bit sweep (keys = 4, no QJL)”2-bit Δ=1.22 ❌ breaks 4-bit Δ=0.12 ✅6-bit Δ=0.05 8-bit Δ=0.07Hard floor: 4-bit values. The value pathway (group-affine, no rotation) is less tolerant than keys.
QJL is regime-dependent
Section titled “QJL is regime-dependent”key_bits | values | QJL off | QJL on | |
|---|---|---|---|---|
| 4 | 6 | 0.050 | 0.184 | QJL hurts |
| 3 | 8 | 0.254 | 0.137 | QJL helps |
QJL costs one bit from the codebook. At 4-bit keys that trade is net-negative; at 3-bit keys the sign-correction wins. Rule: enable QJL only when key_bits ≤ 3.
Group size is quality-neutral
Section titled “Group size is quality-neutral”At (k=4, v=6) the group sizes 32, 64, 128 all land within 0.04 absolute — noise. Use 128 for minimum metadata overhead.
The two defensible winners
Section titled “The two defensible winners”| Config | Δppl | Rel | Est. Compression | Risk |
|---|---|---|---|---|
Safe: k=4, v=4, g=128, QJL=off | 0.123 | 1.32% | ~3.2× | solid pass |
Bold: k=3, v=4, g=128, QJL=on | 0.338 | 3.63% | ~4.0× | 0.04 over abs budget — noise-level margin |
On a 54-token eval we cannot responsibly discriminate between these at the 0.3-boundary. Protocol Principle 5 applies: do not flip a default without a confirmed winner. The --turboquant flag remains opt-in; wider eval (wikitext-2, ≥1024 tokens) is the gate for default change.
Slice 1b — wider eval, Bold vindicated
Section titled “Slice 1b — wider eval, Bold vindicated”Ran both candidates on a 1256-token corpus drawn from the sanctum-docs architecture pages (real technical prose the model has seen many variants of). Baseline shifted up — real prose is harder than the 54-token calibration string — but that’s a fairer measurement.
| Candidate | Δppl abs | Rel | Est. Compression |
|---|---|---|---|
Safe: k=4, v=4, g=128, QJL=off | 0.157 | 1.18% | ~3.2× |
Bold: k=3, v=4, g=128, QJL=on | 0.196 | 1.47% | ~4.0× |
Both decisively inside the 0.3 abs / 5% rel budgets. The wider sample firmed up what the short eval hinted: Bold’s borderline 0.34 tightened to 0.20, Safe’s 0.12 drifted to 0.16. Signal is stable.
Verdict — Bold is the consecrated default
Section titled “Verdict — Bold is the consecrated default”TurboQuantConfig::default() stays key_bits=3, value_bits=4, use_qjl=true — the original scaffold had the right intent. What we earned through the protocol is evidence that it was right, not just a hunch.
Safe remains the documented fallback for operators who want a tighter quality gate or are running on a model we haven’t validated. Two env vars flip to Safe:
SANCTUM_TQ_KEY_BITS=4 SANCTUM_TQ_VALUE_BITS=4 SANCTUM_TQ_USE_QJL=false \ sanctum-mlx --turboquant--turboquant stays opt-in
Section titled “--turboquant stays opt-in”Even with a validated default config, the flag stays opt-in until:
- Slice 4 (Metal-fused dequant) lands — current CPU-side path has O(T²) cost that surprises operators
- Model coverage widens — Qwen3.5-27B is one data point; Qwen3.6-35B, Qwen2.5-Coder, Gemma4-31B each deserve their own protocol run before auto-on
Evidence dictates defaults. Evidence is narrower than ambition. Keep the knob.
Slice 1c — cross-architecture de-risk (ecosystem wall + a finding)
Section titled “Slice 1c — cross-architecture de-risk (ecosystem wall + a finding)”We attempted to validate the Bold default on a non-Qwen3.5 architecture before committing to Gemma-4 Rust work. The attempt ran into an ecosystem wall and produced a real finding anyway.
The wall
Section titled “The wall”Python mlx-lm 0.29.1 (latest on Python 3.9) supports qwen2, qwen3, gemma3, llama4, and more — but not qwen3_5 (our validated model) and not gemma4 (the target for tonight’s fine-tune). Our Rust mlx-lm fork has the inverse: it supports qwen3_5 and qwen3 but not qwen2 or any gemma.
Python ∩ Rust = { }No model type loads in both stacks. Cross-validation requires either (a) porting TurboQuant to Python as a custom KVCache subclass (~2–3 hr), or (b) adding an architecture to the Rust fork (~4–6 hr Qwen2, ~8–16 hr Gemma-4). Neither is a “quick de-risk.”
The finding
Section titled “The finding”Ran Python mlx-lm’s built-in QuantizedKVCache (plain affine per-group quant — not TurboQuant’s rotation path) on Qwen2.5-Coder-7B (head_dim=128, 7:1 GQA). Same 1238-token eval.
| KV bits | PPL | Δppl abs | Verdict |
|---|---|---|---|
| fp16 baseline | 25.99 | — | — |
| 8 | 25.94 | 0.05 | PASS |
| 4 | 339,210 | 339,184 | FAIL — catastrophic |
| 3 | 712,519 | 712,493 | FAIL |
| 2 | 648,756 | 648,730 | FAIL |
A PPL of 339,210 on a ~152k-vocab model is worse than uniform random. That’s not a quality gradient — it’s a cliff. Stock 4-bit affine quant isn’t losing information, it’s actively poisoning state.
Forward plan
Section titled “Forward plan”- Slice 1d — Port TurboQuant to Python
mlx-lmas a customKVCache. Dedicated ~3-hour session. Validates across qwen2 + qwen3 + gemma3. Should precede Slice G. - Slice G — Gemma-4 Rust loader. Gated on upstream Python landing
gemma4support AND Slice 1d passing. Not before. - Slice Q2 — Qwen2 Rust loader. Unlocks Qwen2.5-Coder in sanctum-mlx production. Adjacent practice for Slice G.
Details in bench/ANALYSIS.md and the raw JSON at bench/qwen25coder_kvq_validation.json.
The knowledge we now carry forward
Section titled “The knowledge we now carry forward”Two things are load-bearing going into Slice 2+:
- Any MLX Array read via
as_sliceafter a transpose must flatten first. Documented incache.rs; future cache backends must follow suit or justify skipping. - The value pathway is the quality floor, not the key pathway. Any compression improvement plan should prioritize better value quantization (rotation? learned group scales?) over squeezing key bits further.
Full run log: bench/turboquant_sweep.jsonl. Decisions recorded in bench/ANALYSIS.md.
Cross-references
Section titled “Cross-references”- sanctum-mlx architecture — the server this lives inside
- eval harness — the broader evaluation philosophy
- engineering discipline — why we measure twice