Meta AI Introduces CM3leon: The Multimodal Game-Changer Delivering State-of-the-Art Text-to-Image Generation with Unmatched Compute Efficiency
Natural language processing and systems that produce visuals based on text input have recently sparked a renewed interest in generative AI models. A recent Meta study unveils CM3leon (pronounced “chameleon”), a single foundation model that can generate text and images.
With a large-scale retrieval-augmented pre-training stage and a second multitask supervised fine-tuning (SFT) stage, CM3leon is the first multimodal model developed using a recipe modified from text-only language models.
The CM3Leon architecture is similar to popular text-based models, employing a decoder-only transformer. What makes CM3Leon stand out is that it can take in and produce both text and visuals. Despite being trained with five times less computation than earlier transformer-based approaches, CM3leon provides state-of-the-art performance for text-to-image generation.
CM3leon has the flexibility and power of autoregressive models and the efficiency and economy of training and inference. Because it can generate text and image sequences based on any given text and image sequence, the CM3 model fits the criteria for a causal masked mixed-modal model. This considerably improves upon earlier models that could only perform one of these tasks.
The researchers show that applying large-scale multitask instruction tweaking to CM3leon for both picture and text generation; it can dramatically enhance performance on tasks including image caption generation, visual question answering, text-based editing, and conditional image generation. The team has added an independently trained super-resolution stage to create higher-resolution images from the original model outputs.
According to the findings, CM3Leon outperforms Google’s Parti text-to-image model. It sets a new state of the art with an FID (Fréchet Inception Distance) score of 4.88 on the most popular picture creation benchmark (zero-shot MS-COCO). This success demonstrates the power of retrieval enhancement and the importance of scaling techniques in determining autoregressive models’ output. CM3leon excels in vision-language tasks, such as long-form captioning and visual question answering. CM3Leon’s zero-shot performance is competitive with larger models trained on larger datasets despite having only been trained on a dataset consisting of three billion text tokens.
CM3leon’s impressive performance across a wide range of tasks gives the team hope that they can eventually generate and comprehend images with greater accuracy.
Check out the Paper and Meta Article. Don’t forget to join our 26k+ ML SubReddit, Discord Channel, and Email Newsletter, where we share the latest AI research news, cool AI projects, and more. If you have any questions regarding the above article or if we missed anything, feel free to email us at Asif@marktechpost.com
🚀 Check Out 100’s AI Tools in AI Tools Club
Dhanshree Shenwai is a Computer Science Engineer and has a good experience in FinTech companies covering Financial, Cards & Payments and Banking domain with keen interest in applications of AI. She is enthusiastic about exploring new technologies and advancements in today’s evolving world making everyone’s life easy.
Credit: Source link
Comments are closed.