diffusion model(十一)： InstantID技术小结

type

status

date

slug

summary

1 Motivation

目前基于diffusion model做定制生成主要有两类方法：inference before fine-tune和inference without fine-tune。inference before fine-tune方法每次有新的概念都需要训练模型，相对繁琐，但效果较好，代表工作有DreamBooth, lora, textual inversion等。inference without fine-tune的方法需要预先在大量数据上训练一个鲁棒的object embedding提取模型，推理时无需再次训练，代表工作有AnyDoor， IPAdapter等。

人脸的定制生成往往需要更加细粒度（fine-grain）的特征提取，现有的基于inference without fine-tune的方法做的都不是很好。本文提出了一种plug-and-play 定制人脸生成模型（Plugability），给定一张人脸照片，就能生成指定风格和pos的照片。InstantID不仅前期训练成本低（compatibility），还能实现inference without fine-tune （Tuning-free）和高保真图像的生成。（Superior performance）。取得了fidelity、efficiency、flexible三者很好的平衡。

ㅤ	主要原理	代表方法	优点	缺点
inference before fine-tune	需要搜集一些需要定制的概念的图片对diffusion model进行微调，使其能够学到新的概念	`DreamBooth`, `lora`, `textual inversion`	效果好	每有一个新的概念都需要重新训练。resource-intensive and time-consuming
inference without fine-tune	这类方法一般需要事先训练一个较为通用的object-level embedding提取模型。融入的方式有多种 - 取代text embedding在cross-attention融入（`anydoor`） - 和text embedding拼接融入 - DM增加一个image-cross-attn单独融入image 特征（`IPAdapter`）	`AnyDoor`，`blip-diffusioin`， `IPAdapter`	推理较为便捷	- 效果一般没有第一类方法好 - 前期的训练成本很大 - 有一些架构不是plug-and-play，导致无法适配社区开源的大量模型

2 Method

Given only one reference ID image, InstantID aims to generate customized images with various poses or styles from a single reference ID image while ensuring high fidelity.

为了实现上述目的，设计了3个模块

ID embedding: 用于提取reference image的人脸特征。

Adapted module：用于将人脸特征融入到diffusion model中

IdentityNet：用于将人脸的spatial 信息融入到diffusion model中

2.1 ID Embedding

之前类似的工作大多采用CLIP的Image encoder，或DINOv2扮演ID embedding角色。如IP-Adapter[1]、PhotoMarker[2]、FaceStudio[3]用了CLIP的image encoder。Anydoor[4]用了DINOv2。作者认为CLIP的训练用了大多数弱对齐的语料，这导致CLIP image encoder提取的特征来自广泛模糊的语义信息，对于ID级别的特征提取的粒度是不够的。因此作者此处用了人脸领域的预训练模型提取Id embedding。提取完embedding，下一个任务就是如何将其融入到Diffusion model中。

2.2 Adapted module

文生图任务中，我们将文本转为embedding融入到diffusion model的cross-attention中实现classifier-free的图片生成。对于定制生成来说，除了text embedding，我们来有一个来自reference image的image embedding。需要同时将两个embedding融入到diffusion model中（当然，如果不需要文本控制，只融入image embedding就好）。这篇paper参考了IP-Adapter的方法，分别将image embedding和text embedding融入到decoupled cross-attention中。

简单介绍一下decoupled cross-attention。

decoupled cross-attention相比文生图的cross attention多了两个训练参数，起始阶段用text分支的进行初始化。（下标代表第层cross-attention）。注意在adapted module的训练中，Unet的参数是固定的，只训练新增的。

可以通过调整image embedding的权重来决定image condition的影响程度。

2.3 IdentityNet

ID Embedding和Adapted module这两个模块的引入能让diffusion model生成特定概念的图片。IdentityNet的核心目的是给diffusion model增加spatial control的能力,来弥补损害的text edit能力。作者采用Controlnet的思路来实现IdentityNet。有所区别的是：

1）IdentityNet的输入的关键点只有5个（眼睛2个，鼻子1个，嘴巴2个）;（作者认为这样有两个好处，一是人脸区域一般比较小，只检测5个关键点比较简单，二是避免加入过强的spatial control导致损害edit能力，diversity和fidelity的权衡）

2) IdentityNet融入的是ID embedding，而原始的Controlnet融入的是text embedding。（作者的motivation是IdentityNet是为了做人脸的spatial control，融入ID embedding能让模型对人脸区域更加“敏感”）。

下图展示了不同pose的生成效果。

2.4 模型训练

基座模型	`SDXL-1.0`， `antelopev2`[5] (提取face-embedding)
数据集	LAION-Face（50M）, 10M 高质量爬取数据（用BLIP2打标签）
训练成本	48块H800，2 batch-size/card

3 Result

下图展示了InstantID和早先工作的比较。

4 Summary

总的来看，InstantID虽然并没有提出让人眼前一亮的创新，但能很好的根据人脸定制生成的特点应用合适的解决方法。并且plug-and-play的设计大大提升了instantID的可玩度，可以联合社区丰富的Lora、textual embedding等资源来扩展。最重要的是开源了耗费大量计算资源训练的模型。

Reference

[1] Ip-adapter: Text compatible image prompt adapter for text-to-image diffusion models.

[2] Photomaker: Customizing realistic human photos via stacked id embedding

[3] Facestudio: Put your face everywhere in seconds

[4] AnyDoor: Zero-shot Object-level Image Customization

[5] https://github.com/deepinsight/insightface

paper	InstanceID: Zero-shot Identity-Preserving Generation in Seconds
code	https://github.com/InstantID/InstantID
org	InstantX Team
demo	https://huggingface.co/spaces/InstantX/InstantID