TurboQuant KV Compression

Carmack would optimize the KV cache, then optimize the optimization.

The KV cache is where long conversations go to eat memory. Every token a language model generates leaves behind a pair of tensors — the keys and values from every attention head in every layer. For a 27B model at 256-dim heads, that’s hundreds of megabytes per thousand tokens. On a Mac Mini that also runs LM Studio and a dozen other services, the KV cache is the thing that decides whether you get to have a conversation at all.

TurboQuant is how we make it smaller. The idea is not ours — it comes from Google’s ICLR 2026 paper (arXiv 2504.19874), layered on top of KVQuant and QJL. Our job is to implement it cleanly inside sanctum-mlx, wire it into Apple’s MLX runtime, and — most importantly — prove the quality doesn’t collapse when we turn the bits down.

This page is both an architecture reference and a scientific protocol for tuning it.

What ships today (Slice 4a-final)

incoming K,V tensor  [1, H_kv, L, D]
        ↓                             ┌──── keys ────┐
        ↓                             ▼              ▼
  concat-grow plain on-device Array (bf16, bit-exact, no compression)
        ↓                                              │
        ↓                             ┌── values ─────┘
        ↓                             ▼
  group-affine quantize on device → store (indices u8, scale f16, zero f16)
        ↓
  attention: sdpa_dequant_v fused Metal kernel reads compressed V state
             and dequantizes it inline in registers, no full V tensor
             ever materializes

Both halves live in services/sanctum-mlx/src/turboquant/ and hand off to a CompressedKVCache that implements mlx_lm::cache::KeyValueCache — a drop-in for stock ConcatKeyValueCache. Routed via the new KeyValueCache::fused_attention trait method.

The paper’s recipe — historical context

The original Slice 1 implementation followed the TurboQuant paper literally:

keys:    normalize → Hadamard rotate → Lloyd-Max quantize → QJL 1-bit sign correction → pack
values:  group-affine quantize (KVQuant style) → store scale + zero per group
on every read: dequant all stored tokens → Array → attention

This is the algorithm the protocol below validated, and the bit-budget math behind the “~3.5 bits/channel quality-neutral” claim is real. We just measured at our specific workload that the CPU round-trip cost more than the algorithm’s memory savings earned us, and pivoted accordingly. The receipts (Slice 1 PPL sweep, Slice 1.5 negative result, Slice 4a kernel correctness, Slice 4a-final pivot) are preserved in agent memory.

Why a tuning protocol

The first end-to-end run on a Qwen3.5-27B-4bit model with default settings (3-bit keys + QJL, 4-bit values, group_size=128) delivered this:

ppl (fp16)        =  9.3055
ppl (turboquant)  = 28.6451
Δppl              = +19.34 absolute, +207.83% relative

Catastrophic. The plumbing works end-to-end — no crashes, no SIGKILL, the bit stream roundtrips — but the output is a model that has forgotten what it was looking at. This is exactly the situation where “tune a few knobs and hope” fails. We need a protocol.

The protocol

The protocol has five principles, in priority order.

1. Ablate before you sweep

Do not touch bit widths until each stage of the pipeline has been proven correct in isolation. Loss could come from the Array↔f32 bridge, from Hadamard rotation, from Lloyd-Max quantization, from QJL, or from the value group-affine path — and you can’t distinguish them by watching perplexity collapse.

The ablation sequence:

Run	Config	What it isolates	Expected
A1	8-bit keys + 8-bit values, QJL off	Bridge + minimal-loss quant	≈ fp16
A4	8-bit keys + 8-bit values, QJL on	QJL math correctness at high precision	≈ fp16
A5	8-bit keys, 4-bit values, QJL off	Value stage only	measures value loss
A6	3-bit keys + QJL, 8-bit values	Key stage only	measures key loss

If A1 fails, you have a bug in the bridge or the rotation — stop sweeping and bisect. If A5 is clean but A6 is broken, the bug is in the key pipeline (Lloyd-Max, QJL, or rotation). And so on.

2. Sweep only after ablations pass

Once each stage is proven, search the Pareto frontier. Don’t grid-search 120 cells — use a staircase:

Fix value_bits=8, sweep key_bits ∈ {3, 4, 5, 6, 8} → find minimum that meets budget
Fix key_bits at step-1 winner, sweep value_bits ∈ {2, 4, 6, 8}
Fix both, check group_size ∈ {32, 64, 128} sensitivity
At the winning (key_bits, value_bits), compare use_qjl: on vs off — does QJL actually help?

About 20 runs total. Each run takes ~10 seconds on an M4 Max once the model is cached.

3. Reproducibility

Every run appends one line to services/sanctum-mlx/bench/turboquant_sweep.jsonl:

{
  "ts": "2026-04-18T11:40:00Z",
  "host": "mbp",
  "config": {"key_bits": 4, "value_bits": 6, "group_size": 128, "use_qjl": false, "head_dim": 256},
  "eval": {"text_sha": "3f4a…", "n_tokens": 54},
  "ppl_fp16": 9.3055,
  "ppl_tq": 10.12,
  "abs_delta": 0.8145,
  "rel_delta": 0.0876,
  "runtime_sec": {"fp16": 4.2, "tq": 6.8},
  "verdict": "warn"
}

The text SHA is the comparability key — different text means different rows, not different configurations. The verdict classifier is:

Verdict	Budget
`pass`	`abs_delta ≤ 0.3` AND `rel_delta ≤ 5%`
`warn`	`abs_delta ≤ 1.0`
`fail`	beyond `warn`
`invalid`	non-finite — means a bug

4. One knob at a time

Never change two variables between runs. The harness supports this through environment variables — the test binary reads SANCTUM_TQ_KEY_BITS, SANCTUM_TQ_VALUE_BITS, SANCTUM_TQ_GROUP_SIZE, SANCTUM_TQ_USE_QJL, SANCTUM_TQ_HEAD_DIM and installs them into a OnceLock before the model is loaded. No recompile between runs.

The shell driver bench/tune.sh orchestrates the staircase. Individual stages (ablation, key, value, group, qjl) can be run in isolation.

5. Early stop

Don’t over-measure. The protocol has two exit conditions:

Confirmed bug (any ablation fails): stop, fix, restart. Sweeping over a bug produces garbage data.
Confirmed winner (a config meets both budgets): stop. The knee is located. Tighten later if needed.

Running the sweep

cd ~/Projects/sanctum-rs
SANCTUM_MLX_TEST_MODEL=/path/to/Qwen3.5-27B-4bit \
SANCTUM_TQ_KEY_BITS=4 \
SANCTUM_TQ_VALUE_BITS=6 \
SANCTUM_TQ_USE_QJL=false \
SANCTUM_TQ_JSONL=services/sanctum-mlx/bench/turboquant_sweep.jsonl \
SANCTUM_TQ_SOFT=1 \
  cargo test -p sanctum-mlx --test turboquant_ppl --release \
  -- --ignored --nocapture

SANCTUM_TQ_SOFT=1 disables the assertion failure on regression — during sweeps, bad numbers are data, not errors. Omit it when you’re gating.

cd ~/Projects/sanctum-rs
SANCTUM_MLX_TEST_MODEL=/path/to/Qwen3.5-27B-4bit \
  ./services/sanctum-mlx/bench/tune.sh all

Runs: 4 ablations, 5 key-bit cells, 4 value-bit cells, 3 group-size cells, 2 QJL cells = 18 runs. Writes all rows to bench/turboquant_sweep.jsonl. Takes about 3 minutes on an M4 Max once the model is cached.

For focused stages: ./tune.sh ablation or ./tune.sh key etc.

# Which configs met the pass budget?
jq -c 'select(.verdict == "pass")' services/sanctum-mlx/bench/turboquant_sweep.jsonl

# Pareto front: lowest abs_delta at each bit budget
jq -c '[.config.key_bits, .config.value_bits, .abs_delta] | @tsv' \
  services/sanctum-mlx/bench/turboquant_sweep.jsonl | sort

For each stage boundary, record the decision in bench/ANALYSIS.md with the run rows that justified it. Future-you will thank present-you.

Knob reference

Knob	Env	Range	Meaning
`key_bits`	`SANCTUM_TQ_KEY_BITS`	3–8	Codebook precision per key dim. Rotation runs regardless.
`value_bits`	`SANCTUM_TQ_VALUE_BITS`	2–8	Affine quant precision per value dim, per group.
`value_group_size`	`SANCTUM_TQ_GROUP_SIZE`	32, 64, 128	Granularity of per-group scale/zero for values.
`use_qjl`	`SANCTUM_TQ_USE_QJL`	true / false	Whether to add the QJL 1-bit sign correction after Lloyd-Max. Only meaningful at `key_bits ≤ 4`.
`head_dim`	`SANCTUM_TQ_HEAD_DIM`	model-specific	Must match the model. Qwen3.5 = 256, Qwen2.5 = 128.

Metal memory ceilings (orthogonal to TurboQuant, but relevant to the why):

Flag	Env	Effect
`--metal-cache-limit-mb`	`SANCTUM_MLX_CACHE_LIMIT_MB`	Cap MLX’s reusable buffer cache. Stops sanctum-mlx from fighting LM Studio for the Metal heap.
`--metal-memory-limit-mb`	`SANCTUM_MLX_MEMORY_LIMIT_MB`	Hard cap total MLX device memory.
`--metal-wired-limit-mb`	`SANCTUM_MLX_WIRED_LIMIT_MB`	Cap non-swappable memory — keep hot tensors resident without starving the rest of the system.

What “pass” means — and doesn’t

The default gate is abs_delta ≤ 0.3 AND rel_delta ≤ 5% on a short eval. That’s permissive compared to the TurboQuant paper’s reported < 0.1 Δppl, and intentionally so — our implementation omits some of the paper’s refinements (learned rotations, full TurboQuant value pipeline) in favor of shipping Slice 1.

A config that passes here is good enough to ship under --turboquant behind a flag. It is not yet good enough to be the default. Default-flip requires:

Clean ablation pass across all stages
Wider eval (wikitext-2 chunk, ≥1024 tokens)
Sample-quality check: generate 200 tokens continuation from the same prefix with both caches, compare
No regression across multiple prompt families (code, prose, reasoning)

Until then, turboquant is opt-in. Operators who want the memory savings know they’re in an experiment.

References

TurboQuant — arXiv 2504.19874 · Google · ICLR 2026
KVQuant — arXiv 2401.18079 · per-channel K, per-token V quantization
QJL — arXiv 2310.19536 · AAAI 2025 · 1-bit error-corrected JL projections
PolarQuant — arXiv 2502.02617 · rotation-based precursor
DuoAttention — arXiv 2410.10819 · retrieval vs streaming heads — next slice
EAGLE-3 — arXiv 2503.01840 · on-device speculative decoding — later slice

First real-world run — the protocol paid for itself on day one

The first end-to-end run on Qwen3.5-27B-4bit with scaffold defaults (3-bit keys + QJL, 4-bit values) produced:

ppl (fp16)       = 9.3055
ppl (turboquant) = 28.6451        Δppl = +19.34 (+208%)

Catastrophic. The instinct would have been to turn knobs — “try 4-bit, try 5-bit, maybe disable QJL.” The protocol said no: ablate first.

The ablation revealed that all four bit configurations gave the same +20 PPL collapse — 3-bit+QJL and 8-bit/8-bit with QJL off both +20. That’s physically impossible if the quant math were the issue. The bug had to be upstream.

We added BypassMode::{IdentityPassthrough, BridgeOnly} — short-circuits that skip the quantization pipeline. Result:

Run	Mode	Δppl
A0	IdentityPassthrough (raw concat)	0.0000 PASS
A0.5	BridgeOnly (f32 round-trip, no quant)	+20.62 FAIL

Bisected in two runs. The f32 Array↔CPU bridge was the culprit, not quantization. Root cause: MLX Arrays from transpose_axes([0, 2, 1, 3]) are strided views, and as_slice::<T>() reads the raw buffer without respecting logical layout. Without forcing materialization (via flatten()), a logical [B, H, L, D] view was yielding pre-transpose [B, L, H, D] bytes. Heads and tokens were silently permuted before any quantization ran.

Fix: three lines in turboquant/cache.rs — flatten before as_slice.

Post-fix sweep — the actual Pareto frontier

Full staircase on the same 27B model, 54-token eval, protocol compliant (one knob at a time, pre-fix rows archived as invalid):

Ablations — all stages clean

Run	Config	Δppl
A1	8-bit / 8-bit, no QJL	0.023
A4	8-bit / 8-bit, QJL on	0.014
A5	8-bit keys / 4-bit values	0.024
A6	3-bit + QJL / 8-bit values	0.137

Key-bit sweep (values = 8, no QJL)

3-bit  Δ=0.254      4-bit  Δ=0.069      5-bit  Δ=0.090
6-bit  Δ=0.031      8-bit  Δ=0.023

Keys quantize gracefully. All pass.

Value-bit sweep (keys = 4, no QJL)

2-bit  Δ=1.22   ❌ breaks       4-bit  Δ=0.12   ✅
6-bit  Δ=0.05                   8-bit  Δ=0.07

Hard floor: 4-bit values. The value pathway (group-affine, no rotation) is less tolerant than keys.

QJL is regime-dependent

`key_bits`	`values`	QJL off	QJL on
4	6	0.050	0.184	QJL hurts
3	8	0.254	0.137	QJL helps

QJL costs one bit from the codebook. At 4-bit keys that trade is net-negative; at 3-bit keys the sign-correction wins. Rule: enable QJL only when key_bits ≤ 3.

Group size is quality-neutral

At (k=4, v=6) the group sizes 32, 64, 128 all land within 0.04 absolute — noise. Use 128 for minimum metadata overhead.

The two defensible winners

Config	Δppl	Rel	Est. Compression	Risk
Safe: `k=4, v=4, g=128, QJL=off`	0.123	1.32%	~3.2×	solid pass
Bold: `k=3, v=4, g=128, QJL=on`	0.338	3.63%	~4.0×	0.04 over abs budget — noise-level margin

On a 54-token eval we cannot responsibly discriminate between these at the 0.3-boundary. Protocol Principle 5 applies: do not flip a default without a confirmed winner. The --turboquant flag remains opt-in; wider eval (wikitext-2, ≥1024 tokens) is the gate for default change.

Slice 1b — wider eval, Bold vindicated

Ran both candidates on a 1256-token corpus drawn from the sanctum-docs architecture pages (real technical prose the model has seen many variants of). Baseline shifted up — real prose is harder than the 54-token calibration string — but that’s a fairer measurement.

Candidate	Δppl abs	Rel	Est. Compression
Safe: `k=4, v=4, g=128, QJL=off`	0.157	1.18%	~3.2×
Bold: `k=3, v=4, g=128, QJL=on`	0.196	1.47%	~4.0×

Both decisively inside the 0.3 abs / 5% rel budgets. The wider sample firmed up what the short eval hinted: Bold’s borderline 0.34 tightened to 0.20, Safe’s 0.12 drifted to 0.16. Signal is stable.

Verdict — Bold is the consecrated default

TurboQuantConfig::default() stays key_bits=3, value_bits=4, use_qjl=true — the original scaffold had the right intent. What we earned through the protocol is evidence that it was right, not just a hunch.

Safe remains the documented fallback for operators who want a tighter quality gate or are running on a model we haven’t validated. Two env vars flip to Safe:

SANCTUM_TQ_KEY_BITS=4 SANCTUM_TQ_VALUE_BITS=4 SANCTUM_TQ_USE_QJL=false \
  sanctum-mlx --turboquant

`--turboquant` stays opt-in

Even with a validated default config, the flag stays opt-in until:

Slice 4 (Metal-fused dequant) lands — current CPU-side path has O(T²) cost that surprises operators
Model coverage widens — Qwen3.5-27B is one data point; Qwen3.6-35B, Qwen2.5-Coder, Gemma4-31B each deserve their own protocol run before auto-on

Evidence dictates defaults. Evidence is narrower than ambition. Keep the knob.

Slice 1c — cross-architecture de-risk (ecosystem wall + a finding)

We attempted to validate the Bold default on a non-Qwen3.5 architecture before committing to Gemma-4 Rust work. The attempt ran into an ecosystem wall and produced a real finding anyway.

The wall

Python mlx-lm 0.29.1 (latest on Python 3.9) supports qwen2, qwen3, gemma3, llama4, and more — but not qwen3_5 (our validated model) and not gemma4 (the target for tonight’s fine-tune). Our Rust mlx-lm fork has the inverse: it supports qwen3_5 and qwen3 but not qwen2 or any gemma.

Python  ∩  Rust  = { }

No model type loads in both stacks. Cross-validation requires either (a) porting TurboQuant to Python as a custom KVCache subclass (~2–3 hr), or (b) adding an architecture to the Rust fork (~4–6 hr Qwen2, ~8–16 hr Gemma-4). Neither is a “quick de-risk.”

The finding

Ran Python mlx-lm’s built-in QuantizedKVCache (plain affine per-group quant — not TurboQuant’s rotation path) on Qwen2.5-Coder-7B (head_dim=128, 7:1 GQA). Same 1238-token eval.

KV bits	PPL	Δppl abs	Verdict
fp16 baseline	25.99	—	—
8	25.94	0.05	PASS
4	339,210	339,184	FAIL — catastrophic
3	712,519	712,493	FAIL
2	648,756	648,730	FAIL

A PPL of 339,210 on a ~152k-vocab model is worse than uniform random. That’s not a quality gradient — it’s a cliff. Stock 4-bit affine quant isn’t losing information, it’s actively poisoning state.

Forward plan

Slice 1d — Port TurboQuant to Python mlx-lm as a custom KVCache. Dedicated ~3-hour session. Validates across qwen2 + qwen3 + gemma3. Should precede Slice G.
Slice G — Gemma-4 Rust loader. Gated on upstream Python landing gemma4 support AND Slice 1d passing. Not before.
Slice Q2 — Qwen2 Rust loader. Unlocks Qwen2.5-Coder in sanctum-mlx production. Adjacent practice for Slice G.

Details in bench/ANALYSIS.md and the raw JSON at bench/qwen25coder_kvq_validation.json.

The knowledge we now carry forward

Two things are load-bearing going into Slice 2+:

Any MLX Array read via as_slice after a transpose must flatten first. Documented in cache.rs; future cache backends must follow suit or justify skipping.
The value pathway is the quality floor, not the key pathway. Any compression improvement plan should prioritize better value quantization (rotation? learned group scales?) over squeezing key bits further.

Full run log: bench/turboquant_sweep.jsonl. Decisions recorded in bench/ANALYSIS.md.

Cross-references

sanctum-mlx architecture — the server this lives inside
eval harness — the broader evaluation philosophy
engineering discipline — why we measure twice