Researchers From Stanford Introduce Locally Conditioned Diffusion: A Method For Compositional Text-To-Image Generation Using Diffusion Models

3D scene modeling has traditionally been a time-consuming procedure reserved for people with domain expertise. Although a sizable collection of 3D materials is available in the public domain, it is uncommon to discover a 3D scene that matches the user’s requirements. Because of this, 3D designers sometimes devote hours or even days to modeling individual 3D objects and assembling them into a scene. Making 3D creation straightforward while preserving control over its components would help close the gap between experienced 3D designers and the general public (e.g., size and position of individual objects).

The accessibility of 3D scene modeling has recently improved because of working on 3D generative models. Promising results for 3D object synthesis have been obtained using 3Daware generative adversarial networks (GANs), indicating a first step towards combining created items into scenes. GANs, on the other hand, are specialized to a single item category, which restricts the variety of outcomes and makes scene-level text-to-3D conversion difficult. In contrast, text-to-3D generation utilizing diffusion models allows users to urge the creation of 3D objects from a wide range of categories.

Current research uses a single-word prompt to impose global conditioning on rendered views of a differentiable scene representation, using robust 2D image diffusion priors learned on internet-scale data. These techniques may produce excellent object-centric generations, but they need help to produce scenes with several unique features. Global conditioning further restricts controllability since user input is limited to a single text prompt, and there is no way to influence the design of the created scene. Researchers from Stanford provide a technique for compositional text-to-image production utilizing diffusion models called locally conditioned diffusion.

Their suggested technique builds cohesive 3D sets with control over the size and positioning of individual objects while using text prompts and 3D bounding boxes as input. Their approach applies conditional diffusion stages selectively to certain sections of the picture using an input segmentation mask and matching text prompts, producing outputs that follow the user-specified composition. By incorporating their technique into a text-to-3D generating pipeline based on score distillation sampling, they can also create compositional text-to-3D scenes.

🔥 Recommended Read: Leveraging TensorLeap for Effective Transfer Learning: Overcoming Domain Gaps

They specifically provide the following contributions: 

• They present locally conditioned diffusion, a technique that gives 2D diffusion models more compositional flexibility. 

• They propose important camera pose sampling methodologies, crucial for a compositional 3D generation.

• They introduce a method for compositional 3D synthesis by adding locally conditioned diffusion to a score distillation sampling-based 3D generating pipeline.


Check out the Paper and Project. All Credit For This Research Goes To the Researchers on This Project. Also, don’t forget to join our 16k+ ML SubRedditDiscord Channel, and Email Newsletter, where we share the latest AI research news, cool AI projects, and more.


Aneesh Tickoo is a consulting intern at MarktechPost. He is currently pursuing his undergraduate degree in Data Science and Artificial Intelligence from the Indian Institute of Technology(IIT), Bhilai. He spends most of his time working on projects aimed at harnessing the power of machine learning. His research interest is image processing and is passionate about building solutions around it. He loves to connect with people and collaborate on interesting projects.


Credit: Source link

Comments are closed.

  • Slot777
  • Link Gacor
  • Link Gacor
  • Bonus Slot
  • Link gacor
  • link gacor