No Token Left Behind: Dependable KV Cache Compression through Significance-Conscious Blended Precision Quantization
Authors: June Yong Yang, Byeongwook Kim, Jeongin Bae, Beomseok Kwon, Gunho Park, Eunho Yang, Se Jung Kwon, Dongsoo Lee
Summary: Key-Worth (KV) Caching has grow to be a vital method for accelerating the inference velocity and throughput of generative Massive Language Fashions~(LLMs). Nonetheless, the reminiscence footprint of the KV cache poses a crucial bottleneck in LLM deployment because the cache measurement grows with batch measurement and sequence size, typically surpassing even the scale of the mannequin itself. Though latest strategies have been proposed to pick and evict unimportant KV pairs from the cache to scale back reminiscence consumption, the potential ramifications of eviction on the generative course of are but to be completely examined. On this paper, we study the detrimental influence of cache eviction and observe that unexpected dangers come up as the data contained within the KV pairs is exhaustively discarded, leading to security breaches, hallucinations, and context loss. Surprisingly, we discover that preserving even a small quantity of data contained within the evicted KV pairs through diminished precision quantization considerably recovers the incurred degradation. Then again, we observe that the vital KV pairs have to be stored at a comparatively greater precision to safeguard the technology high quality. Motivated by these observations, we suggest textit{Blended-precision KV cache}~(MiKV), a dependable cache compression technique that concurrently preserves the context particulars by retaining the evicted KV pairs in low-precision and guarantee technology high quality by retaining the vital KV pairs in high-precision. Experiments on numerous benchmarks and LLM backbones present that our proposed technique gives a state-of-the-art trade-off between compression ratio and efficiency, in comparison with different baselines.