Introduction to KV Cache Compression part4(Machine Learning X LLM future tools) | by Monodeep Mukherjee | May, 2024

No Token Left Behind: Dependable KV Cache Compression through Significance-Conscious Blended Precision Quantization

Authors: June Yong Yang, Byeongwook Kim, Jeongin Bae, Beomseok Kwon, Gunho Park, Eunho Yang, Se Jung Kwon, Dongsoo Lee

Summary: Key-Worth (KV) Caching has grow to be a vital method for accelerating the inference velocity and throughput of generative Massive Language Fashions~(LLMs). Nonetheless, the reminiscence footprint of the KV cache poses a crucial bottleneck in LLM deployment because the cache measurement grows with batch measurement and sequence size, typically surpassing even the scale of the mannequin itself. Though latest strategies have been proposed to pick and evict unimportant KV pairs from the cache to scale back reminiscence consumption, the potential ramifications of eviction on the generative course of are but to be completely examined. On this paper, we study the detrimental influence of cache eviction and observe that unexpected dangers come up as the data contained within the KV pairs is exhaustively discarded, leading to security breaches, hallucinations, and context loss. Surprisingly, we discover that preserving even a small quantity of data contained within the evicted KV pairs through diminished precision quantization considerably recovers the incurred degradation. Then again, we observe that the vital KV pairs have to be stored at a comparatively greater precision to safeguard the technology high quality. Motivated by these observations, we suggest textit{Blended-precision KV cache}~(MiKV), a dependable cache compression technique that concurrently preserves the context particulars by retaining the evicted KV pairs in low-precision and guarantee technology high quality by retaining the vital KV pairs in high-precision. Experiments on numerous benchmarks and LLM backbones present that our proposed technique gives a state-of-the-art trade-off between compression ratio and efficiency, in comparison with different baselines.

Source link

Introduction to KV Cache Compression part4(Machine Learning X LLM future tools) | by Monodeep Mukherjee | May, 2024

Working with Input-Convex Neural Networks part3(Machine Learning 2024) | by Monodeep Mukherjee | Jul, 2024

Embracing the Future: The Rise of AI-Driven Development in Software Engineering The software… | by DevBlogs | Jul, 2024

Research on Metaheuristic methods part4(Machine Learning 2024) | by Monodeep Mukherjee | Jul, 2024

LogicMonitor Seeks to Disrupt AI Landscape with an $800 Million Strategic Investment at a Valuation of Approximately $2.4 Billion to Revolutionize Data Centers

Denodo Platform 9.1 Brings New Advanced AI Capabilities and Enhanced Data Lakehouse Performance

Harnessing AI in Agriculture – insideAI News

How Big Data Is Transforming Patient Care Delivery

How to Assist Human Agents & Transform Customer Experience with Conversational AI?

Our Picks

Ciencia de Datos. Algunos tipos de aprendizaje… | by Jhon Henry Rios | Jul, 2024

Breaking Digital Communication Barriers: Empowering Sign Language Users with AI Integration | by Atheeqrahman | May, 2024

Cohesity Bolsters Cyber Resilience with GenAI Capabilities

Most Popular

Revolutionizing the Way We Find Love

Will GenAI Replace Data Engineers? No – And Here’s Why.

Assortment Optimization Machine Learning | by Danishaliarshar | Mar, 2024

Introduction to KV Cache Compression part4(Machine Learning X LLM future tools) | by Monodeep Mukherjee | May, 2024

Related Posts