MM1技术小结（MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training)

type

status

date

slug

summary

TL, DR

建立一个有效的多模态模型需要：

精心设计不同类型数据的占比。混合图文交错数据（interleaved image-text）, 仅文本数据（text-only），image caption数据。作者文中推荐interleaved: caption: text-only = 45% : 45% : 10%。

image encoder、image resolution、image token的大小对结果非常重要。

vision language connector对performant的多模态模型不那么重要。

Motivation

作者通过大量实验分析如何建立一个有效的多模态模型。主要评估以下组建的重要性

image encoder，包括image resolution、image token length

vision language connector 的形式。

pre-training data的组成。

Method

在已有架构上，测试不同排列组合的效果

Result

用zero-shot和few-shot两个场景评估多个VQA和caption数据集：COCO Captioning, NoCaps, TextCaps , VQAv2, TextVQA, VizWiz, GQA , and OK-VQA

image encoder消融实验

作者从训练目标、分辨率两个方向评估image encoder的设计准则

从上图可以得出：

预训练的分辨率对效果的影响最大、然后是模型大小最后是合成数据（VeCap）

visual-language connector消融实验

作者实验了3种connector

average pooling。用nxn average pooling，最后在加上一个全联接。沿用：Generative multimodal models are in-context learners.

attention pooling。通过k个learnable queries来作为图片的表征（类似BLIP2中的QFormer），文中没有详细说明用的哪篇paper的方法。

convolutional mapping。main idea：将ViT提取的特征送入到CNN中，再结合adaptive pooling得到想要的token数目。Honeybee: Locality-enhanced Projector for Multimodal LLM

从上述实验结果，作者得出：

visual token数量和图片分辨率比connector的类型重要。

预训练数据消融实验

一般训练MLLM有两类典型数据：

captioning data。由图文对组成

interleaved image-text documents from web。比如同时包含图片和文本的文档。

通过消融实验可以得出

交错图文数据能有效提升few-shot和text-only任务的性能，但会一定程度上恶化zero-shot的性能。caption数据能提升zero-shot的性能。

text-only的数据能提升text-only任务的性能，并略微提升few-shot的性能。

合成数据对提升few-shot性能有一定帮助。

通过调整text-only，caption，交错图文，合成数据的占比让结果最优。作者实验采用的比率为：interleaved: caption: text-only = 45% : 45% : 10%

TextCore tasks include ARC , PIQA, LAMBADA , WinoGrande , HellaSWAG , SciQ, TriviaQA , and WebQS

Recipe

image encoder: ViT-H@378x378, pre-trained CLIP on DFN-5B

VL connector: C-Abstractor, 144 tokens.

Data: interleaved: caption: text-only = 45% : 45% : 10%

多模态预训练模型performance

SFT后performance

后缀-chat表示经过了SFT。

同样用到了以下check

为了适应多分辨率，用到了Positional embedding interpolation。

为了降低高分辨率图片self-attention的计算量，用到了sub-image decomposition。