In the real world, information is generally presented in various formats. Images, for example, are frequently coupled with tags and text explanations; the text may use images to better effectively explain the article’s main topic. Different statistical features characterize different modalities. Images, for example, are typically represented as pixel intensities or feature extractor outputs, but texts are typically represented as discrete word count vectors. It’s crucial to determine the link between different modalities because different information resources have different statistical features.
Multimodal learning has grown as a promising method for representing the combination of several modalities’ representations. Text-to-image synthesis and image-text contrastive learning are two examples of multimodal learning lately gaining popularity.
A new Google Brain Research introduces Imagen, a text-to-image diffusion model that combines the power of transformer language models (LMs) with high-fidelity diffusion models. This model delivers an unprecedented level of photorealism and language understanding in text-to-image synthesis. The major result of Imagen is that text embeddings from huge LMs, pretrained on text-only corpora, are astonishingly effective for text-to-image synthesis, in contrast to previous work that employs just image-text data for model training.
A frozen T5-XXL encoder maps input text into a sequence of embeddings, followed by a 6464 image diffusion model, and two super-resolution diffusion models generate 256256 and 10241024 images. All diffusion models use classifier-free guidance and are conditioned on the text embedding sequence.
Imagen employs novel sampling approaches that enable the use of large guide weights without sacrificing sample quality. The resulting images have better image-text alignment than previously feasible. Imagen produces remarkably strong outcomes despite being theoretically simple and easy to train.
According to their findings on COCO, Imagen outperforms other approaches with a zero-shot FID-30K of 7.27, greatly exceeding previous work like GLIDE and concurrent work like DALL-E 2. Their paper mentions that their zero-shot FID score outperforms state-of-the-art COCO-trained models, such as Make-A-Scene. Furthermore, human raters report that Imagen-generated samples are comparable to the reference images on COCO captions in image-text alignment.
The team also introduced DrawBench, a new structured set of test questions for evaluating text-to-image conversions. With text prompts designed to investigate distinct semantic features of models, DrawBench allows for deeper insights through a multi-dimensional evaluation of text-to-image models.
Imagen surpasses other current approaches by a large margin in DrawBench, according to comprehensive human evaluation. Their work also shows the clear advantages of using big pre-trained language models as a text encoder for Imagen over multimodal embeddings like CLIP.
According to researchers, there has been relatively little work on social bias evaluation methods for text-to-image models, even though there has been significant work auditing picture-to-text and image labeling models for forms of social bias.
They believe this holds great importance for future research and therefore plan to look into benchmark evaluations for social and cultural bias. This would include, for example, seeing if the normalized pointwise mutual information metric can be used to measure biases in picture generation models. They also state a pressing need to create a conceptual vocabulary around the possible hazards of text-to-image models. They believe that this will help develop evaluation criteria and the responsible release of models.
This Article Is Based On The Research Paper 'Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding'. All Credit For This Research Goes To The Researchers of This Project. Check out the paper and project. Please Don't Forget To Join Our ML Subreddit
Credit: Source link
Comments are closed.