META AI Unveils A New Artificial Intelligence (AI) Method For Text-To-Image Generation Using Open-Vocabulary Scene Control

Source: https://omriavrahami.com/spatext/

Imagine being able to create an image by dipping your digital paintbrush (in a metaphorical sense) in “black horse” paint, drawing the horse’s particular pose, and then dropping it again in “red full moon” paint and drawing the necessary region. Last but not least, you want the whole picture to be in the vein of “The Starry Night.” Modern text-to-image models leave much to be desired in realizing this ambition.

One of the current drawbacks of SoTA models is the inability to infer spatial linkages from a single text prompt. A single prompt may represent an endless number of different pictures because of the text-to-image interface, which is very powerful. It does come with a price, though. On the one hand, it allows a novice user to explore an infinite number of concepts, but on the other, it restricts controllability. For example, suppose a user wants to generate a mental image with a specific arrangement of objects or regions in the image and their shapes. In that case, it is practically impossible to do so with text alone.

To solve this issue, Make-A-Scene suggested adding a dense segmentation map with fixed labels as an extra (optional) input to text-to-image models. Two inputs are available to the user: a text prompt outlining the scene and a detailed segmentation map with labels for each part of the image. The user can easily control the image’s layout in this way. However, it has the following shortcomings:

Providing a dense segmentation can be laborious for users and undesirable in some cases, such as when the user prefers to provide a sketch for only a few main objects they care about, letting the model infer the rest of the layout:

  1. Training the model with a fixed set of labels limits the quality for things not in that set at inference time.
  2. Providing a dense segmentation can be laborious for users and undesirable in some cases.
  3. Lack of fine-grained control over the specific characteristic.

Even if the label “dog” is included in the label set, it is unclear how to produce several instances of dogs of various breeds in a single scenario. They offer an alternative strategy to address these drawbacks: They suggest two alternatives: (1) using spatial free-form text to represent each pixel in the segmentation map rather than a fixed set of labels; and (2) using a sparse segmentation map that uses spatial freeform text to describe only the objects that the user specifies while leaving the rest of the scene unspecified.

To sum up, they provide a novel problem scenario where, given a global text prompt that defines the complete image and a spatio-textual scene that specifies the position and form of segments of interest as well as their local text descriptions, a matching image is created. These adjustments increase expressivity by giving the user more control over the areas they care about while leaving the rest to the computer to work out. To the best of their knowledge, no large-scale datasets provide free-form textual descriptions for each section of a picture, making their acquisition prohibitively expensive. As a result, they decide to pull the pertinent data from existing image-text databases.

To do this, they suggest a unique CLIP-based spatiotextual representation that allows a user to determine the position and shape of each segment as well as its description in free-form text. Using a panoptic segmentation model that has already been trained, they extract local areas during training and feed the extracted regions into a CLIP image encoder to produce their representation. The user-provided text descriptions are then taken into account at inference time, embedded using a CLIP text encoder, and translated to the CLIP picture embedding space using a prior model.

They apply their suggested representation SpaText on two cutting-edge text-to-image diffusion models, a pixel-based model (DALLE 2) and a latent-based model, to evaluate its efficacy (Stable Diffusion ). Both text-to-image models allow a single conditioning input at inference time using classifier-free guiding (text prompt). They show how classifier-free advice can be extended to any multi-conditional scenario and applied to their multi-conditional input (global text and the spatio-textual representation). They are the first to show this, as far as they know.

They also suggest a quicker variation of this extension that compromises controllability for inference speed. Finally, they provide several automated assessment measures for their issue setting and compare their approach to their benchmarks using these metrics and the FID score. Additionally, they conduct a user study to demonstrate that human evaluators also favor their method. They address a novel scenario of image generation with free-form textual scene control, which is the sum of their contributions:

(1) They extend the classifier-free guidance in diffusion models to the multi-conditional case and present an alternative accelerated inference algorithm

(2) they propose a novel spatio-textual representation that, for each segment, represents its semantic properties and structure, 

(3) they demonstrate its effectiveness on two state-of-the-art diffusion models — pixel-based and latent-based, and 

(4) they propose several automatic evaluation metrics and use them to compare against the base. They assess using a user study as well. They discover that their approach yields cutting-edge outcomes.


Check out the Paper and Project. All Credit For This Research Goes To Researchers on This Project. Also, don’t forget to join our Reddit page and discord channel, where we share the latest AI research news, cool AI projects, and more.


Aneesh Tickoo is a consulting intern at MarktechPost. He is currently pursuing his undergraduate degree in Data Science and Artificial Intelligence from the Indian Institute of Technology(IIT), Bhilai. He spends most of his time working on projects aimed at harnessing the power of machine learning. His research interest is image processing and is passionate about building solutions around it. He loves to connect with people and collaborate on interesting projects.


Credit: Source link

Comments are closed.