LLaVolta: Environment friendly Multi-modal Fashions through Stage-wise Visible Context Compression
Authors: Jieneng Chen, Luoxin Ye, Ju He, Zhao-Yang Wang, Daniel Khashabi, Alan Yuille
Summary: Whereas important developments have been made in compressed representations for textual content embeddings in giant language fashions (LLMs), the compression of visible tokens in giant multi-modal fashions (LMMs) has remained a largely ignored space. On this work, we current the examine on the evaluation of redundancy regarding visible tokens and environment friendly coaching inside these fashions. Our preliminary experiments present that eliminating as much as 70% of visible tokens on the testing stage by merely common pooling solely results in a minimal 3% discount in visible query answering accuracy on the GQA benchmark, indicating important redundancy in visible context. Addressing this, we introduce Visible Context Compressor, which reduces the variety of visible tokens throughout coaching to boost coaching effectivity with out sacrificing efficiency. To reduce data loss attributable to the compression on visible tokens whereas sustaining coaching effectivity, we develop LLaVolta as a lite coaching scheme. LLaVolta incorporates stage-wise visible context compression to progressively compress the visible tokens from closely to calmly, and at last no compression on the finish of coaching, yielding no lack of data when testing. In depth experiments exhibit that our method enhances the efficiency of MLLMs in each image-language and video-language understanding, whereas additionally considerably chopping coaching prices. Code is on the market at https://github.com/Beckschen/LLaVolta