Efficient KV-Cache Compression for Long-Context and Reasoning Models
Large language models (LLMs) increasingly handle very long input contexts, and their inference relies on storing key-value (KV) caches for past tokens to avoid redundant computation. However, as context length grows, the memory footprint of full KV caches becomes a major bottleneck. In this talk, Zefan Cai (CS PhD Student, UW-Madison, advised by Prof. Junjie Hu) presents two complementary approaches to compressing the KV cache, highlighting the underlying principles, trade-offs, and practical benefits for inference efficiency.
Pyramid KV
Pyramid KV is motivated by the observation that in transformer-based LLMs, attention flows from broad scopes in lower layers to narrow, focused contexts in higher layers (“pyramidal information funneling”). By allocating more cache budget in lower layers and gradually reducing it in higher layers, Pyramid KV achieves near-full performance while retaining only ~12% of the full KV cache on long-context benchmarks.
R-KV: Redundancy-aware KV Cache Compression
Building upon Pyramid KV, R-KV targets reasoning-heavy tasks (e.g., chain-of-thought) where long outputs produce very large KV caches. R-KV identifies and prunes redundant tokens in the cache, enabling roughly a 90% memory saving and ~6.6x throughput improvement, while preserving or even slightly improving accuracy compared to the full cache.