Text-to-image synthesis refers to the process of generating realistic images from textual prompt descriptions. This technology is a branch of generative models in the field of artificial intelligence (AI) and has been gaining increasing attention in recent years.
Text-to-image generation aims to enable neural networks to interpret and translate human language into visual representations, allowing for a wide variety of synthesis combinations. Furthermore, unless taught otherwise, the generative network outcomes several different pictures for the same textual description. This can be extremely useful to gather new ideas or portray the exact vision we have in mind but cannot find on the Internet.
This technology has potential applications in various fields, such as virtual and augmented reality, digital marketing, and entertainment.
Among the most adopted text-to-image generative networks, we find diffusion models.
Text-to-image diffusion models generate images by iteratively refining a noise distribution conditioned on textual input. They encode the given textual description into a latent vector, which affects the noise distribution, and iteratively refine the noise distribution using a diffusion process. This process results in high-resolution and diverse images that match the input text, achieved through a U-net architecture that captures and incorporates visual features of the input text.
The conditioning space in these models is referred to as the P space, defined by the language model’s token embedding space. Essentially, P represents the textual-conditioning space, where an input instance “p” belonging to P (which has passed through a text encoder) is injected into all attention layers of a U-net during synthesis.
An overview of the text-conditioning mechanism of a denoising diffusion model is presented below.
Through this process, since only one instance, “p,” is fed to the U-net architecture, the obtained disentanglement and control over the encoded text is limited.
For this reason, the authors introduce a new text-conditioning space termed P+.
This space consists of multiple textual conditions, each injected into a different layer in the U-net. This way, P+ can guarantee higher expressivity and disentanglement, providing better control of the synthesized image. As described by the authors, different layers of the U-net have varying degrees of control over the attributes of the synthesized image. In particular, the coarse layers primarily affect the structure of the image, while the fine layers predominantly influence its appearance.
Having presented the P+ space, the authors introduce a related process called Extended Textual Inversion (XTI). It refers to a revisited version of the classic Textual Inversion (TI), a process in which the model learns to represent a specific concept described in a few input images as a dedicated token. In XTI, the goal is to invert the input images into a set of token embeddings, one per layer, namely, inversion into P+.
To state clearly the difference between the two, imagine providing the picture of a “green lizard” in input to a two-layers U-net. The aim for TI is to get “green lizard” in output, while XTI requires two different instances in output, which in this case would be “green” and “lizard.”
The authors prove in their work that the expanded inversion process in P+ is not only more expressive and precise than TI but also faster.
Furthermore, increasing disentanglement on P+ enables mixing through text-to-image generation, such as object-style mixing.
One example from the mentioned work is reported below.
This was the summary of P+, a rich text-conditioning space for extended textual inversion.
Check out the Paper and Project. All Credit For This Research Goes To the Researchers on This Project. Also, don’t forget to join our 16k+ ML SubReddit, Discord Channel, and Email Newsletter, where we share the latest AI research news, cool AI projects, and more.
Daniele Lorenzi received his M.Sc. in ICT for Internet and Multimedia Engineering in 2021 from the University of Padua, Italy. He is a Ph.D. candidate at the Institute of Information Technology (ITEC) at the Alpen-Adria-Universität (AAU) Klagenfurt. He is currently working in the Christian Doppler Laboratory ATHENA and his research interests include adaptive video streaming, immersive media, machine learning, and QoS/QoE evaluation.
Credit: Source link
Comments are closed.