InstantID: Zero-shot Identity-Preserving Generation in Seconds

On Mar 12, 2024

AI-powered image generation technology has witnessed remarkable growth in the past few years ever since large text to image diffusion models like DALL-E, GLIDE, Stable Diffusion, Imagen, and more burst into the scene. Despite the fact that image generation AI models have unique architecture and training methods, they all share a common focal point: customized and personalized image generation that aims to create images with consistent character ID, subject, and style on the basis of reference images. Owing to their remarkable generative capabilities, modern image generation AI frameworks have found applications in fields including image animation, virtual reality, E-Commerce, AI portraits, and more. However, despite their remarkable generative capabilities, these frameworks all share a common hurdle, a majority of them are unable to generate customized images while preserving the delicate identity details of human objects.

Generating customized images while preserving intricate details is of critical importance especially in human facial identity tasks that require a high standard of fidelity & detail, and nuanced semantics when compared to general object image generation tasks that concentrate primarily on coarse-grained textures and colors. Furthermore, personalized image synthesis frameworks in recent years like LoRA, DreamBooth, Textual Inversion, and more have advanced significantly. However, personalized image generative AI models are still not perfect for deployment in real-world scenarios since they have a high storage requirement, they require multiple reference images, and they often have a lengthy fine-tuning process. On the other hand, although existing ID-embedding based methods require only a single forward reference, they either lack compatibility with publicly available pre-trained models, or they require an excessive fine-tuning process across numerous parameters, or they fail to maintain high face fidelity.

To address these challenges, and further enhance image generation capabilities, in this article, we will be talking about InstantID, a diffusion model based solution for image generation. InstantID is a plug and play module that handles image generation and personalization adeptly across various styles with just a single reference image and also ensures high fidelity. The primary aim of this article is to provide our readers with a thorough understanding of the technical underpinnings and components of the InstantID framework as we will have a detailed look of the model’s architecture, training process, and application scenarios. So let’s get started.

The emergence of text to image diffusion models has contributed significantly in the advancement of image generation technology. The primary aim of these models is customized and personal generation, and creating images with consistent subject, style, and character ID using one or more reference images. The ability of these frameworks to create consistent images has created potential applications in different industries including image animation, AI portrait generation, E-Commerce, virtual and augmented reality, and much more.

However, despite their remarkable abilities, these frameworks face a fundamental challenge: they often struggle to generate customized images that preserve the intricate details of human subjects accurately. It is worth noting that generating customized images with intrinsic details is a challenging task since human facial identity requires a higher degree of fidelity and detail along with more advanced semantics when compared to general objects or styles that focus primarily on colors or coarse-grained textures. Existing text to image models depend on detailed textual descriptions, and they struggle in achieving strong semantic relevance for customized image generation. Furthermore, some large pre-trained text to image frameworks add spatial conditioning controls to enhance the controllability, facilitating fine-grained structural control using elements like body poses, depth maps, user-drawn sketches, semantic segmentation maps, and more. However, despite these additions and enhancements, these frameworks are able to achieve only partial fidelity of the generated image to the reference image.

To overcome these hurdles, the InstantID framework focuses on instant identity-preserving image synthesis, and attempts to bridge the gap between efficiency and high fidelity by introducing a simple plug and play module that allows the framework to handle image personalization using only a single facial image while maintaining high fidelity. Furthermore, to preserve the facial identity from reference image, the InstantID framework implements a novel face encoder that retains the intricate image details by adding weak spatial and strong semantic conditions that guide the image generation process by incorporating textual prompts, landmark image, and facial image.

There are three distinguishing features that separates the InstantID framework from existing text to image generation frameworks.

Compatibility and Pluggability: Instead of training on full parameters of the UNet framework, the InstantID framework focuses on training a lightweight adapter. As a result, the InstantID framework is compatible and pluggable with existing pre-trained models.

Tuning-Free: The methodology of the InstantID framework eliminates the requirement for fine-tuning since it needs only a single forward propagation for inference, making the model highly practical and economical for fine-tuning.
Superior Performance: The InstantID framework demonstrates high flexibility and fidelity since it is able to deliver state of the art performance using only a single reference image, comparable to training based methods that rely on multiple reference images.

Overall, the contributions of the InstantID framework can be categorized in the following points.

The InstantID framework is an innovative, ID-preserving adaption method for pre-trained text to image diffusion models with the aim to bridge the gap between efficiency and fidelity.
The InstantID framework is compatible and pluggable with custom fine-tuned models using the same diffusion model in its architecture allowing ID preservation in pre-trained models without any additional cost.

InstantID: Methodology and Architecture

As mentioned earlier, the InstantID framework is an efficient lightweight adapter that endows pre-trained text to image diffusion models with ID preservation capabilities effortlessly.

Talking about the architecture, the InstantID framework is built on top of the Stable Diffusion model, renowned for its ability to perform the diffusion process with high computational efficiency in a low-dimensional latent space instead of pixel space with an auto encoder. For an input image, the encoder first maps the image to a latent representation with downsampling factor and latent dimensions. Furthermore, to denoise a normally distributed noise with noisy latent, condition, and current timestep, the diffusion process adopts a denoising UNet component. The condition is an embedding of textual prompts that are generated using a pre-trained CLIP text encoder component.

Furthermore, the InstantID framework also utilizes a ControlNet component that is capable of adding spatial control to a pre-trained diffusion model as its condition, extending way beyond the traditional capabilities of textual prompts. The ControlNet component also integrates the UNet architecture from the Stable Diffusion framework using a trained replication of the UNet component. The replica of the UNet component features zero convolution layers within the middle blocks and the encoder blocks. Despite their similarities, the ControlNet component distinguishes itself from the Stable Diffusion model; they both differ in the latter residual item. The ControlNet component encodes spatial condition information like poses, depth maps, sketches and more by adding the residuals to the UNet Block, and then embeds these residuals into the original network.

The InstantID framework also draws inspiration from IP-Adapter or Image Prompt Adapter that introduces a novel approach to achieve image prompt capabilities running parallel with textual prompts without requiring to modify the original text to image models. The IP-Adapter component also employs a unique decoupled cross-attention strategy that uses additional cross-attention layers to embed the image features while leaving the other parameters unchanged.

Methodology

To give you a brief overview, the InstantID framework aims to generate customized images with different styles or poses using only a single reference ID image with high fidelity. The following figure briefly provides an overview of the InstantID framework.

As it can be observed, the InstantID framework has three essential components:

An ID embedding component that captures robust semantic information of the facial features in the image.
A lightweight adopted module with a decoupled cross-attention component to facilitate the use of an image as a visual prompt.
An IdentityNet component that encodes the detailed features from the reference image using additional spatial control.

ID Embedding

Unlike existing methods like FaceStudio, PhotoMaker, IP-Adapter and more that rely on a pre-trained CLIP image encoder to extract visual prompts, the InstantID framework focuses on enhanced fidelity and stronger semantic details in the ID preservation task. It is worth noting that the inherent limitations of the CLIP component lies primarily in its training process on weakly aligned data meaning the encoded features of the CLIP encoder primarily captures broad and ambiguous semantic information like colors, style, and composition. Although these features can act as general supplement to text embeddings, they are not suitable for precise ID preservation tasks that lay heavy emphasis on strong semantics and high fidelity. Furthermore, recent research in face representation models especially around facial recognition has demonstrated the efficiency of face representation in complex tasks including facial reconstruction and recognition. Building on the same, the InstantID framework aims to leverage a pre-trained face model to detect and extract face ID embeddings from the reference image, guiding the model for image generation.

Image Adapter

The capability of pre-trained text to image diffusion models in image prompting tasks enhances the text prompts significantly, especially for scenarios that cannot be described adequately by the text prompts. The InstantID framework adopts a strategy resembling the one used by the IP-Adapter model for image prompting, that introduces a lightweight adaptive module paired with a decoupled cross-attention component to support images as input prompts. However, contrary to the coarse-aligned CLIP embeddings, the InstantID framework diverges by employing ID embeddings as the image prompts in an attempt to achieve a semantically rich and more nuanced prompt integration.

IdentityNet

Although existing methods are capable of integrating the image prompts with text prompts, the InstantID framework argues that these methods only enhance coarse-grained features with a level of integration that is insufficient for ID-preserving image generation. Furthermore, adding the image and text tokens in cross-attention layers directly tends to weaken the control of text tokens, and an attempt to enhance the image tokens’ strength might result in impairing the abilities of text tokens on editing tasks. To counter these challenges, the InstantID framework opts for ControlNet, an alternative feature embedding method that utilizes spatial information as input for the controllable module, allowing it to maintain consistency with the UNet settings in the diffusion models.

The InstantID framework makes two changes to the traditional ControlNet architecture: for conditional inputs, the InstantID framework opts for 5 facial keypoints instead of fine-grained OpenPose facial keypoints. Second, the InstantID framework uses ID embeddings instead of text prompts as conditions for the cross-attention layers in the ControlNet architecture.

Training and Inference

During the training phase, the InstantID framework optimizes the parameters of the IdentityNet and the Image Adapter while freezing the parameters of the pre-trained diffusion model. The entire InstantID pipeline is trained on image-text pairs that feature human subjects, and employs a training objective similar to the one used in the stable diffusion framework with task specific image conditions. The highlight of the InstantID training method is the separation between the image and text cross-attention layers within the image prompt adapter, a choice allowing the InstantID framework to adjust the weights of these image conditions flexibly and independently, thus ensuring a more targeted and controlled inference and training process.

InstantID : Experiments and Results

The InstantID framework implements the Stable Diffusion and trains it on LAION-Face, a large-scale open-source dataset consisting of over 50 million image-text pairs. Additionally, the InstantID framework collects over 10 million human images with automations generated automatically by the BLIP2 model to further enhance the image generation quality. The InstantID framework focuses primarily on single-person images, and employs a pre-trained face model to detect and extract face ID embeddings from human images, and instead of training the cropped face datasets, trains the original human images. Furthermore, during training, the InstantID framework freezes the pre-trained text to image model, and only updates the parameters of IdentityNet and Image Adapter.

Image Only Generation

InstantID model uses an empty prompt to guide the image generation process using only the reference image, and the results without the prompts are demonstrated in the following image.

‘Empty Prompt’ generation as demonstrated in the above image demonstrates the ability of the InstantID framework to maintain rich semantic facial features like identity, age, and expression robustly. However, it is worth noting that using empty prompts might not be able to replicate the results on other semantics like gender accurately. Furthermore, in the above image, the columns 2 to 4 use an image and a prompt, and as it can be seen, the generated image does not demonstrate any degradation in text control capabilities, and also ensures identity consistency. Finally, the columns 5 to 9 use an image, a prompt and spatial control, demonstrating the compatibility of the model with pre-trained spatial control models allowing the InstantID model to flexibly introduce spatial controls using a pre-trained ControlNet component.

It is also worth noting that the number of reference images has a significant impact on the generated image, as demonstrated in the above image. Although InstantID framework is able to deliver good results using a single reference image, multiple reference images produce a better quality image since the InstantID framework takes the average mean of ID embeddings as image prompt. Moving along, it is essential to compare InstantID framework with previous methods that generate personalized images using a single reference image. The following figure compares the results generated by the InstantID framework and existing state of the art models for single reference customized image generation.

As it can be seen, the InstantID framework is able to preserve facial characteristics thanks to ID embedding inherently carries rich semantic information, such as identity, age, and gender. It would be safe to say that the InstantID framework outperforms existing frameworks in customized image generation since it is able to preserve human identity while maintaining control and stylistic flexibility.

Final Thoughts

In this article, we have talked about InstantID, a diffusion model based solution for image generation. InstantID is a plug and play module that handles image generation and personalization adeptly across various styles with just a single reference image and also ensures high fidelity. The InstantID framework focuses on instant identity-preserving image synthesis, and attempts to bridge the gap between efficiency and high fidelity by introducing a simple plug and play module that allows the framework to handle image personalization using only a single facial image while maintaining high fidelity.

Credit: Source link