Skip to content

TurboQuant Slice 4 — GPU Dequantization Scoping

Carmack would profile the optimization, then optimize the profiler.

Slices 1 through 1f got us from “TurboQuant doesn’t compile” to “ships 21× better quality at 4× compression as an opt-in flag.” The missing piece for --turboquant to be the default is performance. Right now our dequant lives on the CPU, and that’s fine for correctness validation but unshippable for production.

This page scopes Slice 4.

Every call to CompressedKVCache::update_and_fetch, inside the decode loop:

step t: quantize new K,V → push to Vec<CompressedHeadKV>
step t: rebuild_f32 → dequantize ALL T stored tokens,
build flat Vec<f32>, ship to Metal
step t: Array::from_slice → fp32→bf16 cast on GPU

Three problems, in priority order:

  1. O(T²) CPU work across a generation. Each step rebuilds every previously-stored token. At T=1000 that’s 500,000 per-vector dequants on the CPU.
  2. CPU↔GPU bounce every step. The dequant output is an f32 Vec in host memory, then Array::from_slice ships it back to the GPU. Full round-trip per decode step.
  3. No kernel fusion. Even with GPU dequant, the attention path is: dequant → cast → scaled-dot-product-attention. Two separate Metal dispatches. Ideal: one fused kernel that does dequant + Q·Kᵀ + softmax + ·V in one shot.

On an M4 Max at realistic context lengths (1k–4k tokens), the CPU dequant cost is the dominant factor. That’s why --turboquant ships with a bold warning about “dev-grade CPU dequant, expect throughput regression” in the server startup log.

The practical production unlock. Replace every CPU-side dequant step with MLX operations, which run as Metal kernels automatically.

Storage refactor: CompressedKVCache stores mlx_rs::Array instead of Vec<u8> / Vec<half::f16>. Quantize produces Arrays directly; concatenate across tokens with mx.concatenate; dequantize with mx.astype + mx.multiply + mx.add (all Metal-backed).

Key path (rotation + per-vector affine):

on write:
normalize_and_rotate :: Array → Array # matmul against H_d, O(d²) on GPU
find scale/zero :: Array → (Array,Array) # mx.min, mx.max
quantize :: Array → Array # (x - zero) / scale, round, clamp
on read:
dequantize + unrotate :: Array → Array # reverse of the above

No per-token CPU loops, no Vec allocation, no host round-trip.

Value path: mlx_quantize and mlx_dequantize already exist as C FFI → Metal kernels. They do exactly our per-group affine scheme. Drop-in replacement for quantize_value_group / dequantize_value_group.

Estimated effort: 200–400 LOC net (replacement of cache.rs internals + removal of CPU vec storage + plumbing of Array↔Array in the KeyValueCache::update_and_fetch contract).

Risk: Medium-low. The algebra is unchanged; we’re porting CPU float loops to MLX Array ops that do exactly the same math. Main risks are (a) Array layout quirks we already burned ourselves on once (the flatten stride bug from Slice 1), and (b) dtype conversions introducing numerical drift vs the validated CPU path. Mitigation: re-run the 3-arm A/B and bench/ANALYSIS.md sanity checks before merging.

Wins:

  • O(T) per step → O(1) amortized per step with proper concatenation.
  • Zero CPU↔GPU round-trips in the decode loop.
  • Expected throughput at 2k context: 3–6× current --turboquant path; within spitting distance of the fp16 reference.

Blocking concerns: None. MLX has all the primitives. This is a straightforward engineering port.

  1. Ship 4a first. It gives us real-world production throughput and removes the “dev-grade” warning from the server log. Measurable quickly on a wikitext decode loop.
  2. Measure before 4b or 4c. Profile at 2k and 4k contexts. If 4a is within 2× of fp16, 4b and 4c become nice-to-haves, not necessities. If 4a is still 5× slower than fp16, 4b is the right next move.
  3. 4c only if 4b hits a wall. mx.compile is Apple’s own answer to kernel fusion; building a bespoke Metal kernel to beat their compiler is a long bet with high carrying cost.

The production gate on --turboquant becoming auto-on isn’t “fastest possible” — it’s “within 2× of fp16 throughput at acceptable context lengths.” If 4a clears that bar, the flag graduates to default-on for memory-constrained contexts (shared-hardware Mac Mini) and stays opt-in for throughput-dominated contexts (M4 Max with 128 GB).

  • Quality is already paid for. Slices 1a–1f landed at Δppl ≤ 0.01 on Qwen3.5, Δppl ≤ 0.17 on Qwen2. Any perf optimization that shifts those numbers by more than measurement noise is a bug, not a win.
  • The KeyValueCache trait is a load-bearing boundary. Slice 4 changes the internals of CompressedKVCache; it must not change update_and_fetch’s signature or the KeyValueCache + Default bound that Qwen3.5 and Qwen2 model impls depend on.
  • Bypass modes stay. BypassMode::{IdentityPassthrough, BridgeOnly} saved us from a stride bug on day one and saved us from a mislabeled A/B on day two. Whatever storage refactor 4a imposes, the bypass short-circuits must keep working.
  • 4a implementation: 1 focused day. Much of the surgery is in cache.rs::update_and_fetch — rewriting the storage model from Vec-based to Array-based and teaching the two halves of the pipeline (write-side quantize, read-side dequantize) to speak in mlx_rs::Array the whole way through.
  • 4a validation: 2–3 hours. Re-run turboquant_ppl across all validated configs, diff against current results. Any Δppl change > 0.01 means a correctness bug somewhere in the port.
  • 4a profiling + doc: half a day. Before/after latency + memory numbers, update ANALYSIS.md, add a “GPU dequant” section to turboquant-kv-compression.mdx.

Total: ~2 days of focused work for the change that unlocks --turboquant default.

  • services/sanctum-mlx/src/turboquant/cache.rs — the main target
  • services/sanctum-mlx/src/turboquant/quantizer.rs — migrate quantize_key_affine / quantize_value_group to Array-based signatures
  • vendor/mlx-rs/mlx-rs/src/ops/ — may need to expose mlx_quantize / mlx_dequantize in the Rust wrapper (C FFI exists; Rust safe wrapper may not)
  • New: services/sanctum-mlx/bench/slice4_perf.jsonl — throughput measurements per config

No sanctum-rs commits needed for mlx-rs FFI changes until we verify whether mx.quantize is already wrapped. Quick check: grep -r 'pub fn quantize' vendor/mlx-rs/mlx-rs/src/ops/ — came up empty on the first look, so we’d need to add a thin safe wrapper around mlx_quantize/mlx_dequantize first.