KV Cache Compression¶

★★★★★ Advanced

Reducing KV cache memory during LLM inference to enable longer contexts and more concurrent requests on the same GPU. The key-value cache grows linearly with sequence length and batch size, often becoming the memory bottleneck before model weights do.

Key Facts¶

KV cache for a 70B model at 128K context can exceed 40GB - more than the weights themselves
3-bit KV cache quantization achieves ~5-6x memory reduction with 99.5% attention fidelity
Random rotation + scalar quantization is the simplest effective approach - implementations appeared within hours of the paper
Weight quantization and KV cache quantization are orthogonal - combine them for maximum savings
The benefit scales with context length: minimal at 4K, significant at 32K+, transformative at 128K+

Why KV Cache Is the Bottleneck¶

During autoregressive generation, the model stores key and value tensors for every past token in every layer:

KV cache size = 2 * num_layers * num_heads * head_dim * seq_len * batch_size * bytes_per_element

Example (Llama 70B, FP16, single sequence):
= 2 * 80 * 64 * 128 * 128000 * 1 * 2 bytes
= ~42 GB

Model weights (Q4_K_M quantized) take ~40GB. The KV cache alone matches or exceeds weight memory at long contexts.

TurboQuant: Rotation + Scalar Quantization¶

Core algorithm with no model retraining:

Random rotation - multiply K and V vectors by a random orthogonal matrix. This spreads outlier magnitudes evenly across dimensions, making all channels quantization-friendly
Scalar quantization - map rotated values to a small lookup table (3-4 bits per element)
Optional error correction - QJL (Quantized Johnson-Lindenstrauss) reduces approximation error in attention scores

# llama.cpp with KV cache quantization (fork)
./llama-server -m model.gguf \
  --cache-type-k turbo3 \
  --cache-type-v turbo3

# Standard llama.cpp (when merged)
./llama-server -m model.gguf \
  --cache-type-k q4_0 \
  --cache-type-v q4_0

Memory Savings by Configuration¶

Config	Weights	KV Cache	Total (128K ctx)	Quality
FP16 + FP16 cache	140 GB	42 GB	182 GB	Baseline
Q4_K_M + FP16 cache	40 GB	42 GB	82 GB	~99.9%
Q4_K_M + Q8 cache	40 GB	21 GB	61 GB	~99.8%
Q4_K_M + turbo3 cache	40 GB	7 GB	47 GB	~99.5%
Q4_K_M + turbo2 cache	40 GB	5 GB	45 GB	~98.5%

Implementation Landscape (2026)¶

Project	Status	Platform
llama.cpp fork (TheTom)	Working, 18/18 tests pass	CPU
llama.cpp mainline	Feature request, merge expected	CPU/GPU
vLLM	Feature request open	GPU (CUDA)
Standalone Triton kernels	Working	GPU (CUDA)
Apple Silicon variant	4.6x compression, q8_0 speed parity	Metal
PyTorch reference	Educational implementation	Any

When to Use¶

High value:

Long-context inference (64K+ tokens) where KV cache dominates memory
Multi-user serving where batch size multiplies cache memory
Running large models on limited VRAM (single GPU for 70B)

Low value:

Short prompts (<4K tokens) - cache is small anyway
Batch size 1 with short context - weight quantization matters more
Training - KV cache is recomputed per step, not stored

Quality Considerations¶

Base weight quantization affects how well KV cache quantization works:

Higher quality weights + aggressive cache quant = good
  Q8_0 weights + turbo3 cache → minimal degradation

Lower quality weights + aggressive cache quant = compounding errors
  Q4_K_M weights + turbo2 cache → noticeable on complex reasoning

Validation approach:

# Needle-in-haystack test across context lengths
python eval_needle.py \
  --model model.gguf \
  --cache-type turbo3 \
  --context-lengths 4096 16384 65536 131072 \
  --num-needles 100

# Compare perplexity
./llama-perplexity -m model.gguf \
  --cache-type-k turbo3 --cache-type-v turbo3 \
  -f wiki_test.txt

Combining with Other Optimizations¶

KV cache compression stacks with other inference techniques:

Technique	Target	Combines?
Weight quantization (GGUF, AWQ, GPTQ)	Model weights	Yes - orthogonal
KV cache quantization	Attention cache	This article
Flash Attention	Attention compute	Yes - reduces compute, this reduces memory
Sliding window attention	Context structure	Yes - reduces cache entries, this reduces per-entry size
Speculative decoding	Generation speed	Yes - no conflict
PagedAttention (vLLM)	Memory fragmentation	Yes - paged allocation + smaller pages

TriAttention - Pre-RoPE Importance Scoring (2026)¶

10.7x KV memory reduction with zero accuracy loss on reasoning tasks. Exploits the discovery that pre-RoPE Q/K vectors concentrate around fixed centers across heads.

Architecture¶

Trigonometric Series Scoring (S_trig) - estimates key importance via Q/K centers + positional distance
Norm-Based Scoring (S_norm) - complementary signal for low-concentration heads
Adaptive Weighting - auto-balances using Mean Resultant Length (R) metric

Performance:
  AIME25: 32.9% (vs R-KV 17.5%, SnapKV similar budget) at 10.7x compression
  MATH 500: 6.3x speedup (1405 vs 223 tokens/sec)
  RULER retrieval: 66.1 (vs SnapKV 55.6)

Deployment: vLLM plugin (auto-discovery, no code changes)
Validated on: Qwen3-8B, DeepSeek-R1-Distill-Llama-8B, GPT-OSS-20B, GLM-4.7-Flash (MLA)

Why it beats prior methods: H2O, SnapKV, R-KV use "limited observation windows" - only recent queries maintain representative orientations. TriAttention operates in pre-RoPE space where vectors are stable, providing intrinsic importance signals independent of position.

KVPress Toolkit (NVIDIA, 2026)¶

Framework for benchmarking and deploying KV compression strategies. Not a standalone method - wraps multiple approaches for comparison.

Methods: KnormPress, SnapKVPress, ObservedAttention, SinkPress, TurboQuant
Integration: HuggingFace transformers native
Use case: Benchmark different strategies on your specific model/task

CompLLM - Segment-Based Soft Compression¶

Divides context into segments, compresses each independently. Results: 2x compression yields 4x TTFT speedup, 50% KV cache reduction. Practical for long-context Q&A where full attention is unnecessary.

LoPace - Lossless Prompt Compression for Storage¶

Lossless compression for prompt caching and storage (not inference-time):

Methods: Zstandard + BPE tokenization + binary packing + hybrid
Compression: 4.89x average (range 1.22-19.09x), 72.2% space savings
Reconstruction: 100% lossless
Use case: Prompt caching, template storage, prompt libraries
NOT for: Runtime KV cache reduction

Gotchas¶

QJL correction can hurt through standard attention paths - the second-stage error correction from the paper assumes a specific attention implementation. Through standard scaled dot-product attention, QJL correction empirically worsens quality. Use the _mse variant that optimizes mean squared error directly instead
turbo3 is not equivalent to q3_0 - the rotation step before quantization is critical. Standard 3-bit quantization without rotation produces much worse results because outlier channels dominate the error. The rotation spreads information evenly across dimensions first
Benchmark at YOUR context length - a model that passes perplexity tests at 4K may fail needle-in-haystack at 128K with aggressive cache quantization. Always test at the actual context length you plan to serve
MemPalace AAAK debunked - claimed 30x lossless compression via abbreviation dialect, but token counting was wrong (len(text)//3 instead of proper tokenizer). AAAK actually increases tokens at small scales. LongMemEval regressed from 96.6% to 84.2%. The valid finding: raw verbatim storage + smart retrieval beats LLM extraction
TriAttention requires pre-RoPE access - the method operates on Q/K vectors before rotary position embedding is applied. Inference frameworks that fuse RoPE into the attention kernel may need modification