Meet pix2pix-zero: A Diffusion-Based Image-to-Image Translation Method that Allows Users to Specify the Edit Direction on-the-fly (e.g., Cat → Dog)
Over the past few years, many advancements have been made in the field of Artificial intelligence, and one such development is text-to-image generation models. The recently developed model created by OpenAI called DALLE 2 creates images from textual descriptions or prompts. Presently, there are a number of text-to-image models that not only generate a fresh image from a textual explanation but also edit a current image. These models synthesize some miscellaneous images of high quality. Producing an image from a textual prompt is usually easier than editing an existing image, as a lot of fine detailing needs to be sustained while editing. The editing process is difficult because maintaining an image’s original and important details requires a lot of effort.
A team from Carnegie Mellon University and Adobe Research have introduced a zero-shot image-to-image translation method called pix2pix-zero. This diffusion-based approach allows editing images without the need to enter any prompt or text as input. It maintains the fine details of the original image, which are significant and need to be preserved even after editing. Using the text to image models like DALLE 2 has two main constraints. One is that it is difficult for the user to come up with an exactly accurate prompt that articulately describes the target image with all the minute details. The second limitation comes with the model, where it makes unnecessary changes in unwanted spots of the image and alters the input by itself. The new approach, pix2pix-zero, does not require manual prompting and lets users specify the edit direction on the fly, like a cat to dog or man to woman.
This method directly makes use of the pre-trained Stable Diffusion model, which is a latent text-to-image diffusion model. It lets users edit real and synthetic images and maintains the image structure of the input. This makes this approach free from training and any manual entering of the prompt. The researchers behind the approach have used cross-attention guidance to impose coherence in the cross-attention maps. Cross-attention guidance is an attention mechanism that blends two, unlike embedding sequences with the same dimension in a transformer model. Pix2pix-zero refines the quality of the entered image as well as the inference speed. The techniques that do so are –
- Autocorrelation regularization – This technique confirms that the noise in the image is close to Gaussian during inversion.
- Conditional GAN distillation – This technique lets the user edit images interactively and with a real-time inference.
Pix2pix-zero first reconstructs the input image using only the input text without the edit direction. It produces two groups of sentences with both the original word (for example – cat) and the edited word (for example – dog). Followed by this, the CLIP embedding direction is calculated between the two groups. The time taken by this step is mere 5 seconds and can be pre-computed as well.
Consequently, this new image-to-image translation is a great development as it preserves the quality of the image without additional training or prompting. It can be a remarkable breakthrough, just like DALLE 2.
Check out the Paper, Project, and Github. All Credit For This Research Goes To the Researchers on This Project. Also, don’t forget to join our 14k+ ML SubReddit, Discord Channel, and Email Newsletter, where we share the latest AI research news, cool AI projects, and more.
Tanya Malhotra is a final year undergrad from the University of Petroleum & Energy Studies, Dehradun, pursuing BTech in Computer Science Engineering with a specialization in Artificial Intelligence and Machine Learning.
She is a Data Science enthusiast with good analytical and critical thinking, along with an ardent interest in acquiring new skills, leading groups, and managing work in an organized manner.
Credit: Source link
Comments are closed.