Skip to content

TurboQuant KV Compression

Carmack would optimize the KV cache, then optimize the optimization.

The KV cache is where long conversations go to eat memory. Every token a language model generates leaves behind a pair of tensors — the keys and values from every attention head in every layer. For a 27B model at 256-dim heads, that’s hundreds of megabytes per thousand tokens. On a Mac Mini that also runs LM Studio and a dozen other services, the KV cache is the thing that decides whether you get to have a conversation at all.

TurboQuant is how we make it smaller. The idea is not ours — it comes from Google’s ICLR 2026 paper (arXiv 2504.19874), layered on top of KVQuant and QJL. Our job is to implement it cleanly inside sanctum-mlx, wire it into Apple’s MLX runtime, and — most importantly — prove the quality doesn’t collapse when we turn the bits down.

This page is both an architecture reference and a scientific protocol for tuning it.

incoming K,V tensor [1, H_kv, L, D]
↓ ┌──── keys ────┐
↓ ▼ ▼
concat-grow plain on-device Array (bf16, bit-exact, no compression)
↓ │
↓ ┌── values ─────┘
↓ ▼
group-affine quantize on device → store (indices u8, scale f16, zero f16)
attention: sdpa_dequant_v fused Metal kernel reads compressed V state
and dequantizes it inline in registers, no full V tensor
ever materializes

Both halves live in services/sanctum-mlx/src/turboquant/ and hand off to a CompressedKVCache that implements mlx_lm::cache::KeyValueCache — a drop-in for stock ConcatKeyValueCache. Routed via the new KeyValueCache::fused_attention trait method.

The paper’s recipe — historical context

Section titled “The paper’s recipe — historical context”

The original Slice 1 implementation followed the TurboQuant paper literally:

keys: normalize → Hadamard rotate → Lloyd-Max quantize → QJL 1-bit sign correction → pack
values: group-affine quantize (KVQuant style) → store scale + zero per group
on every read: dequant all stored tokens → Array → attention

This is the algorithm the protocol below validated, and the bit-budget math behind the “~3.5 bits/channel quality-neutral” claim is real. We just measured at our specific workload that the CPU round-trip cost more than the algorithm’s memory savings earned us, and pivoted accordingly. The receipts (Slice 1 PPL sweep, Slice 1.5 negative result, Slice 4a kernel correctness, Slice 4a-final pivot) are preserved in agent memory.

The first end-to-end run on a Qwen3.5-27B-4bit model with default settings (3-bit keys + QJL, 4-bit values, group_size=128) delivered this:

ppl (fp16) = 9.3055
ppl (turboquant) = 28.6451
Δppl = +19.34 absolute, +207.83% relative

Catastrophic. The plumbing works end-to-end — no crashes, no SIGKILL, the bit stream roundtrips — but the output is a model that has forgotten what it was looking at. This is exactly the situation where “tune a few knobs and hope” fails. We need a protocol.

The protocol has five principles, in priority order.

Do not touch bit widths until each stage of the pipeline has been proven correct in isolation. Loss could come from the Array↔f32 bridge, from Hadamard rotation, from Lloyd-Max quantization, from QJL, or from the value group-affine path — and you can’t distinguish them by watching perplexity collapse.

The ablation sequence:

RunConfigWhat it isolatesExpected
A18-bit keys + 8-bit values, QJL offBridge + minimal-loss quant≈ fp16
A48-bit keys + 8-bit values, QJL onQJL math correctness at high precision≈ fp16
A58-bit keys, 4-bit values, QJL offValue stage onlymeasures value loss
A63-bit keys + QJL, 8-bit valuesKey stage onlymeasures key loss

If A1 fails, you have a bug in the bridge or the rotation — stop sweeping and bisect. If A5 is clean but A6 is broken, the bug is in the key pipeline (Lloyd-Max, QJL, or rotation). And so on.

Once each stage is proven, search the Pareto frontier. Don’t grid-search 120 cells — use a staircase:

  1. Fix value_bits=8, sweep key_bits ∈ {3, 4, 5, 6, 8} → find minimum that meets budget
  2. Fix key_bits at step-1 winner, sweep value_bits ∈ {2, 4, 6, 8}
  3. Fix both, check group_size ∈ {32, 64, 128} sensitivity
  4. At the winning (key_bits, value_bits), compare use_qjl: on vs off — does QJL actually help?

About 20 runs total. Each run takes ~10 seconds on an M4 Max once the model is cached.

Every run appends one line to services/sanctum-mlx/bench/turboquant_sweep.jsonl:

{
"ts": "2026-04-18T11:40:00Z",
"host": "mbp",
"config": {"key_bits": 4, "value_bits": 6, "group_size": 128, "use_qjl": false, "head_dim": 256},
"eval": {"text_sha": "3f4a…", "n_tokens": 54},
"ppl_fp16": 9.3055,
"ppl_tq": 10.12,
"abs_delta": 0.8145,
"rel_delta": 0.0876,
"runtime_sec": {"fp16": 4.2, "tq": 6.8},
"verdict": "warn"
}

The text SHA is the comparability key — different text means different rows, not different configurations. The verdict classifier is:

VerdictBudget
passabs_delta ≤ 0.3 AND rel_delta ≤ 5%
warnabs_delta ≤ 1.0
failbeyond warn
invalidnon-finite — means a bug

Never change two variables between runs. The harness supports this through environment variables — the test binary reads SANCTUM_TQ_KEY_BITS, SANCTUM_TQ_VALUE_BITS, SANCTUM_TQ_GROUP_SIZE, SANCTUM_TQ_USE_QJL, SANCTUM_TQ_HEAD_DIM and installs them into a OnceLock before the model is loaded. No recompile between runs.

The shell driver bench/tune.sh orchestrates the staircase. Individual stages (ablation, key, value, group, qjl) can be run in isolation.

Don’t over-measure. The protocol has two exit conditions:

  • Confirmed bug (any ablation fails): stop, fix, restart. Sweeping over a bug produces garbage data.
  • Confirmed winner (a config meets both budgets): stop. The knee is located. Tighten later if needed.
Terminal window
cd ~/Projects/sanctum-rs
SANCTUM_MLX_TEST_MODEL=/path/to/Qwen3.5-27B-4bit \
SANCTUM_TQ_KEY_BITS=4 \
SANCTUM_TQ_VALUE_BITS=6 \
SANCTUM_TQ_USE_QJL=false \
SANCTUM_TQ_JSONL=services/sanctum-mlx/bench/turboquant_sweep.jsonl \
SANCTUM_TQ_SOFT=1 \
cargo test -p sanctum-mlx --test turboquant_ppl --release \
-- --ignored --nocapture

SANCTUM_TQ_SOFT=1 disables the assertion failure on regression — during sweeps, bad numbers are data, not errors. Omit it when you’re gating.

KnobEnvRangeMeaning
key_bitsSANCTUM_TQ_KEY_BITS3–8Codebook precision per key dim. Rotation runs regardless.
value_bitsSANCTUM_TQ_VALUE_BITS2–8Affine quant precision per value dim, per group.
value_group_sizeSANCTUM_TQ_GROUP_SIZE32, 64, 128Granularity of per-group scale/zero for values.
use_qjlSANCTUM_TQ_USE_QJLtrue / falseWhether to add the QJL 1-bit sign correction after Lloyd-Max. Only meaningful at key_bits ≤ 4.
head_dimSANCTUM_TQ_HEAD_DIMmodel-specificMust match the model. Qwen3.5 = 256, Qwen2.5 = 128.

Metal memory ceilings (orthogonal to TurboQuant, but relevant to the why):

FlagEnvEffect
--metal-cache-limit-mbSANCTUM_MLX_CACHE_LIMIT_MBCap MLX’s reusable buffer cache. Stops sanctum-mlx from fighting LM Studio for the Metal heap.
--metal-memory-limit-mbSANCTUM_MLX_MEMORY_LIMIT_MBHard cap total MLX device memory.
--metal-wired-limit-mbSANCTUM_MLX_WIRED_LIMIT_MBCap non-swappable memory — keep hot tensors resident without starving the rest of the system.

The default gate is abs_delta ≤ 0.3 AND rel_delta ≤ 5% on a short eval. That’s permissive compared to the TurboQuant paper’s reported < 0.1 Δppl, and intentionally so — our implementation omits some of the paper’s refinements (learned rotations, full TurboQuant value pipeline) in favor of shipping Slice 1.

A config that passes here is good enough to ship under --turboquant behind a flag. It is not yet good enough to be the default. Default-flip requires:

  • Clean ablation pass across all stages
  • Wider eval (wikitext-2 chunk, ≥1024 tokens)
  • Sample-quality check: generate 200 tokens continuation from the same prefix with both caches, compare
  • No regression across multiple prompt families (code, prose, reasoning)

Until then, turboquant is opt-in. Operators who want the memory savings know they’re in an experiment.

First real-world run — the protocol paid for itself on day one

Section titled “First real-world run — the protocol paid for itself on day one”

The first end-to-end run on Qwen3.5-27B-4bit with scaffold defaults (3-bit keys + QJL, 4-bit values) produced:

ppl (fp16) = 9.3055
ppl (turboquant) = 28.6451 Δppl = +19.34 (+208%)

Catastrophic. The instinct would have been to turn knobs — “try 4-bit, try 5-bit, maybe disable QJL.” The protocol said no: ablate first.

The ablation revealed that all four bit configurations gave the same +20 PPL collapse — 3-bit+QJL and 8-bit/8-bit with QJL off both +20. That’s physically impossible if the quant math were the issue. The bug had to be upstream.

We added BypassMode::{IdentityPassthrough, BridgeOnly} — short-circuits that skip the quantization pipeline. Result:

RunModeΔppl
A0IdentityPassthrough (raw concat)0.0000 PASS
A0.5BridgeOnly (f32 round-trip, no quant)+20.62 FAIL

Bisected in two runs. The f32 Array↔CPU bridge was the culprit, not quantization. Root cause: MLX Arrays from transpose_axes([0, 2, 1, 3]) are strided views, and as_slice::<T>() reads the raw buffer without respecting logical layout. Without forcing materialization (via flatten()), a logical [B, H, L, D] view was yielding pre-transpose [B, L, H, D] bytes. Heads and tokens were silently permuted before any quantization ran.

Fix: three lines in turboquant/cache.rsflatten before as_slice.

Post-fix sweep — the actual Pareto frontier

Section titled “Post-fix sweep — the actual Pareto frontier”

Full staircase on the same 27B model, 54-token eval, protocol compliant (one knob at a time, pre-fix rows archived as invalid):

RunConfigΔppl
A18-bit / 8-bit, no QJL0.023
A48-bit / 8-bit, QJL on0.014
A58-bit keys / 4-bit values0.024
A63-bit + QJL / 8-bit values0.137
3-bit Δ=0.254 4-bit Δ=0.069 5-bit Δ=0.090
6-bit Δ=0.031 8-bit Δ=0.023

Keys quantize gracefully. All pass.

2-bit Δ=1.22 ❌ breaks 4-bit Δ=0.12 ✅
6-bit Δ=0.05 8-bit Δ=0.07

Hard floor: 4-bit values. The value pathway (group-affine, no rotation) is less tolerant than keys.

key_bitsvaluesQJL offQJL on
460.0500.184QJL hurts
380.2540.137QJL helps

QJL costs one bit from the codebook. At 4-bit keys that trade is net-negative; at 3-bit keys the sign-correction wins. Rule: enable QJL only when key_bits ≤ 3.

At (k=4, v=6) the group sizes 32, 64, 128 all land within 0.04 absolute — noise. Use 128 for minimum metadata overhead.

ConfigΔpplRelEst. CompressionRisk
Safe: k=4, v=4, g=128, QJL=off0.1231.32%~3.2×solid pass
Bold: k=3, v=4, g=128, QJL=on0.3383.63%~4.0×0.04 over abs budget — noise-level margin

On a 54-token eval we cannot responsibly discriminate between these at the 0.3-boundary. Protocol Principle 5 applies: do not flip a default without a confirmed winner. The --turboquant flag remains opt-in; wider eval (wikitext-2, ≥1024 tokens) is the gate for default change.

Ran both candidates on a 1256-token corpus drawn from the sanctum-docs architecture pages (real technical prose the model has seen many variants of). Baseline shifted up — real prose is harder than the 54-token calibration string — but that’s a fairer measurement.

CandidateΔppl absRelEst. Compression
Safe: k=4, v=4, g=128, QJL=off0.1571.18%~3.2×
Bold: k=3, v=4, g=128, QJL=on0.1961.47%~4.0×

Both decisively inside the 0.3 abs / 5% rel budgets. The wider sample firmed up what the short eval hinted: Bold’s borderline 0.34 tightened to 0.20, Safe’s 0.12 drifted to 0.16. Signal is stable.

Verdict — Bold is the consecrated default

Section titled “Verdict — Bold is the consecrated default”

TurboQuantConfig::default() stays key_bits=3, value_bits=4, use_qjl=true — the original scaffold had the right intent. What we earned through the protocol is evidence that it was right, not just a hunch.

Safe remains the documented fallback for operators who want a tighter quality gate or are running on a model we haven’t validated. Two env vars flip to Safe:

Terminal window
SANCTUM_TQ_KEY_BITS=4 SANCTUM_TQ_VALUE_BITS=4 SANCTUM_TQ_USE_QJL=false \
sanctum-mlx --turboquant

Even with a validated default config, the flag stays opt-in until:

  • Slice 4 (Metal-fused dequant) lands — current CPU-side path has O(T²) cost that surprises operators
  • Model coverage widens — Qwen3.5-27B is one data point; Qwen3.6-35B, Qwen2.5-Coder, Gemma4-31B each deserve their own protocol run before auto-on

Evidence dictates defaults. Evidence is narrower than ambition. Keep the knob.

Slice 1c — cross-architecture de-risk (ecosystem wall + a finding)

Section titled “Slice 1c — cross-architecture de-risk (ecosystem wall + a finding)”

We attempted to validate the Bold default on a non-Qwen3.5 architecture before committing to Gemma-4 Rust work. The attempt ran into an ecosystem wall and produced a real finding anyway.

Python mlx-lm 0.29.1 (latest on Python 3.9) supports qwen2, qwen3, gemma3, llama4, and more — but not qwen3_5 (our validated model) and not gemma4 (the target for tonight’s fine-tune). Our Rust mlx-lm fork has the inverse: it supports qwen3_5 and qwen3 but not qwen2 or any gemma.

Python ∩ Rust = { }

No model type loads in both stacks. Cross-validation requires either (a) porting TurboQuant to Python as a custom KVCache subclass (~2–3 hr), or (b) adding an architecture to the Rust fork (~4–6 hr Qwen2, ~8–16 hr Gemma-4). Neither is a “quick de-risk.”

Ran Python mlx-lm’s built-in QuantizedKVCache (plain affine per-group quant — not TurboQuant’s rotation path) on Qwen2.5-Coder-7B (head_dim=128, 7:1 GQA). Same 1238-token eval.

KV bitsPPLΔppl absVerdict
fp16 baseline25.99
825.940.05PASS
4339,210339,184FAIL — catastrophic
3712,519712,493FAIL
2648,756648,730FAIL

A PPL of 339,210 on a ~152k-vocab model is worse than uniform random. That’s not a quality gradient — it’s a cliff. Stock 4-bit affine quant isn’t losing information, it’s actively poisoning state.

  • Slice 1d — Port TurboQuant to Python mlx-lm as a custom KVCache. Dedicated ~3-hour session. Validates across qwen2 + qwen3 + gemma3. Should precede Slice G.
  • Slice G — Gemma-4 Rust loader. Gated on upstream Python landing gemma4 support AND Slice 1d passing. Not before.
  • Slice Q2 — Qwen2 Rust loader. Unlocks Qwen2.5-Coder in sanctum-mlx production. Adjacent practice for Slice G.

Details in bench/ANALYSIS.md and the raw JSON at bench/qwen25coder_kvq_validation.json.

Two things are load-bearing going into Slice 2+:

  1. Any MLX Array read via as_slice after a transpose must flatten first. Documented in cache.rs; future cache backends must follow suit or justify skipping.
  2. The value pathway is the quality floor, not the key pathway. Any compression improvement plan should prioritize better value quantization (rotation? learned group scales?) over squeezing key bits further.

Full run log: bench/turboquant_sweep.jsonl. Decisions recorded in bench/ANALYSIS.md.