Efficient KV-Cache Compression for Long-Context and Reasoning Models

Videos
ML+X
UW-Madison
LLM
Deep learning
NLP
GenAI
Foundation models
GPU
Presenter

Zefan Cai

Date

November 4, 2025

Large language models (LLMs) increasingly handle very long input contexts, and their inference relies on storing key-value (KV) caches for past tokens to avoid redundant computation. However, as context length grows, the memory footprint of full KV caches becomes a major bottleneck. In this talk, Zefan Cai (CS PhD Student, UW-Madison, advised by Prof. Junjie Hu) presents two complementary approaches to compressing the KV cache, highlighting the underlying principles, trade-offs, and practical benefits for inference efficiency.

Pyramid KV

Pyramid KV is motivated by the observation that in transformer-based LLMs, attention flows from broad scopes in lower layers to narrow, focused contexts in higher layers (“pyramidal information funneling”). By allocating more cache budget in lower layers and gradually reducing it in higher layers, Pyramid KV achieves near-full performance while retaining only ~12% of the full KV cache on long-context benchmarks.

R-KV: Redundancy-aware KV Cache Compression

Building upon Pyramid KV, R-KV targets reasoning-heavy tasks (e.g., chain-of-thought) where long outputs produce very large KV caches. R-KV identifies and prunes redundant tokens in the cache, enabling roughly a 90% memory saving and ~6.6x throughput improvement, while preserving or even slightly improving accuracy compared to the full cache.