This Artificial Intelligence (AI) Paper Proposes a Novel Method to Fuse Language Structures into Diffusion Guidance for Compositionality Text-to-Image Generation

Text-to-image generative models have received significant attention recently due to their potential to synthesize high-quality images from text descriptions. These models have many potential applications, including image synthesis, data augmentation, and improved understanding of the relationship between language and visual representation.

Several approaches to text-to-image generation include generative adversarial networks (GANs), variational autoencoders (VAEs), and normalizing flow models. These models differ in the specific techniques they use to learn the probability distribution of the data. Still, they all aim to capture the underlying structure of the data and generate new samples representative of the original dataset.

Despite their promise, text-to-image generative models face several challenges, including the need to model complex and varied distributions, training on large datasets, and balancing the trade-off between image quality and diversity. The problems, however, are not limited to the training. The main issues in image inference related to generative models are attribute leakage, interchanged attributes, and missing objects. Addressing the problems mentioned above is the key contribution of this paper.

Meet Hailo-8™: An AI Processor That Uses Computer Vision For Multi-Camera Multi-Person Re-Identification (Sponsored)
Source: https://weixi-feng.github.io/structure-diffusion-guidance/

The state-of-the-art text-to-image generative model is the latest published Stable Diffusion released by Open AI, also known for the release of the recent ChatGPT tool.

Stable Diffusion is a diffusion model, a particular generative model that has recently gained attention for its ability to synthesize high-quality images from text descriptions. It operates by “diffusing” the information from the text input through a series of intermediate steps, ultimately generating a final image that reflects the content of the text. Although the generated images are stunning and contain incredible details, the inference is error-prone. The main issues are related to the semantical information in the input text and how the text-attention mechanism impacts image generation. As shown in the picture above, Stable Diffusion frequently presents problems in the guidance process. 

The authors try to solve this issue by improving the traditional text-attention approach. Indeed, according to the authors, the reason behind the lack of semantical accuracy in Stable Diffusion is the wrong binding attribute object. For instance, feeding the model with the text prompt “red banana and yellow apple” might confuse the model, which could associate the “red” attribute to both banana and apple. The idea to solve this problem is based on the observation that attention maps provide free token-region associations in text-to-image models. By modifying the key-value pairs in cross-attention layers, we manage to map the encoding of each text span into attended regions in 2D image space.

The pipeline of the architecture is depicted in the figure below.

Source: https://weixi-feng.github.io/structure-diffusion-guidance/

Firstly the prompt is fed to the parser, whose goal is to extract a collection of concepts from the input text and place them into a hierarchical tree. Noun Phrases (NPs) are then decoded from the tree and provided to the CLIP text encoder to generate encoded text embeddings. These embeddings are then aligned with the initial prompt input to ensure no missing information. The next step is the fusion with latent feature maps to achieve classifier-free guidance. The feature maps are merged with the text embeddings into cross-attention layers, used to identify the 2D regions of the image to convey the diffusion process.

This was the summary of the text-to-image generative approach explained in the paper, novel diffusion guidance to address the consistency problems in the image generation of the known Stable Diffusion. If you are interested, you can find more information in the links below.


Check out the Paper, Project, and Code. All Credit For This Research Goes To Researchers on This Project. Also, don’t forget to join our Reddit page and discord channel, where we share the latest AI research news, cool AI projects, and more.


Daniele Lorenzi received his M.Sc. in ICT for Internet and Multimedia Engineering in 2021 from the University of Padua, Italy. He is a Ph.D. candidate at the Institute of Information Technology (ITEC) at the Alpen-Adria-Universität (AAU) Klagenfurt. He is currently working in the Christian Doppler Laboratory ATHENA and his research interests include adaptive video streaming, immersive media, machine learning, and QoS/QoE evaluation.


Credit: Source link

Comments are closed.