- MM1: Strategies, Evaluation & Insights from Multimodal LLM Pre-training(arXiv)
Writer : Brandon McKinzie, Zhe Gan, Jean-Philippe Fauconnier, Sam Dodge, Bowen Zhang, Philipp Dufter, Dhruti Shah, Xianzhi Du, Futang Peng, Floris Weers, Anton Belyi, Haotian Zhang, Karanjeet Singh, Doug Kang, Hongyu Hè, Max Schwarzer, Tom Gunter, Xiang Kong, Aonan Zhang, Jianyu Wang, Chong Wang, Nan Du, Tao Lei, Sam Wiseman, Mark Lee , et al. (5 additional authors not confirmed)
Summary : On this work, we discuss organising performant Multimodal Massive Language Fashions (MLLMs). Notably, we have a look at the significance of varied building elements and knowledge selections. By cautious and full ablations of the picture encoder, the imaginative and prescient language connector, and fairly just a few pre-training knowledge selections, we acknowledged numerous vital design classes. For instance, we current that for large-scale multimodal pre-training utilizing a cautious mixture of image-caption, interleaved image-text, and text-only knowledge is essential for attaining state-of-the-art (SOTA) few-shot outcomes all by means of numerous benchmarks, as in contrast with utterly totally different revealed pre-training outcomes. Further, we present that the picture encoder along with picture alternative and the picture token rely has substantial impression, whereas the vision-language connector design is of comparatively negligible significance. By scaling up the offered recipe, we assemble MM1, a household of multimodal fashions as lots as 30B parameters, consisting of each dense fashions and mixture-of-experts (MoE) variants, which might be SOTA in pre-training metrics and purchase aggressive effectivity after supervised fine-tuning on various established multimodal benchmarks. On account of large-scale pre-training, MM1 enjoys attention-grabbing properties akin to enhanced in-context studying, and multi-image reasoning, enabling few-shot chain-of-thought prompting
2.Taming Pre-trained LLMs for Generalised Time Sequence Forecasting by way of Cross-modal Information Distillation (arXiv)
Writer : Peiyuan Liu, Hang Guo, Tao Dai, Naiqi Li, Jigang Bao, Xudong Ren, Yong Jiang, Shu-Tao Xia
Summary :