Meet JourneyDB: A Large Scale Dataset with 4 Million Diverse and High-Quality Generated Images Curated for Multimodal Visual Understanding

With the advancement of Large Language Models like ChatGPT and DALL-E and the rise in popularity of generative Artificial Intelligence, generating content like a human is no more a dream. Everything is now feasible, including question answering, code completion, and generation of content from textual descriptions, as well as the creation of images from both text and images. Recently, AI has been on par with human ingenuity. The well-known chatbot developed by OpenAI, called ChatGPT, is based on GPT 3.5’s transformer architecture and is being used by almost everyone. The latest version of GPT, i.e., GPT 4, is multimodal in nature, unlike the previous version, GPT 3.5, which only lets ChatGPT take textual inputs. 

The quality of generative content has significantly increased as a result of the development of diffusion models. Because of these developments, Artificial Intelligence Generative Content (AIGC) platforms, like DALLE, Stability AI, Runway, and Midjourney, have become increasingly popular as these systems let users create high-quality images based on text prompts provided in natural language. Despite advances in multimodal understanding, vision-language models still have difficulty understanding generated visuals. In comparison to real data, synthetic images display a larger degree of content and style variability, making it far more challenging for models to understand them properly.

To address these issues, a team of researchers has introduced JourneyDB, a large-scale dataset specifically curated for multimodal visual understanding of generative images. JourneyDB has 4 million unique, high-quality generated photos that have been created using different text prompts. This dataset focuses on both content and style interpretation and seeks to offer a complete resource for training and assessing models’ abilities to comprehend generated images.

[Sponsored] 🔥 Build your personal brand with Taplio  🚀 The 1st all-in-one AI-powered tool to grow on LinkedIn. Create better LinkedIn content 10x faster, schedule, analyze your stats & engage. Try it for free!

The four tasks included in the suggested benchmark are as follows. 

  1. Prompt inversion – Prompt inversion has been used to find the text prompts that the user used to generate an image. This tests the model’s comprehension of the generated images’ content and style.
  1. Style retrieval – The team has focused on style retrieval so that the model identifies and retrieves similar generative images based on their stylistic attributes. This assesses the model’s proficiency in discerning stylistic nuances within generative images.
  1. Image captioning – In image captioning, the model is tasked with generating descriptive captions that accurately represent the content of the generative image, which thus evaluates the model’s capability to comprehend and express the visual elements of the generated content effectively in natural language.
  1. Visual Question Answering – Through Visual Question Answering (VQA), the model provides accurate answers to questions related to the generative image. The model is able to comprehend the visual and style content and provide relevant responses based on the given questions.

The team gathered 4,692,751 image-text prompt pairs and divided them into three sets: a training set, a validation set, and a test set. For evaluation, the team conducted extensive experiments using the benchmark dataset. The results showed that current state-of-the-art multimodal models don’t perform as well as they do on real datasets,  but a few adjustments on the proposed dataset greatly improved their performance.


Check out the Paper, Code, and Project. Don’t forget to join our 25k+ ML SubRedditDiscord Channel, and Email Newsletter, where we share the latest AI research news, cool AI projects, and more. If you have any questions regarding the above article or if we missed anything, feel free to email us at Asif@marktechpost.com

🚀 Check Out 100’s AI Tools in AI Tools Club


Tanya Malhotra is a final year undergrad from the University of Petroleum & Energy Studies, Dehradun, pursuing BTech in Computer Science Engineering with a specialization in Artificial Intelligence and Machine Learning.
She is a Data Science enthusiast with good analytical and critical thinking, along with an ardent interest in acquiring new skills, leading groups, and managing work in an organized manner.


🔥 StoryBird.ai just dropped some amazing features. Generate an illustrated story from a prompt. Check it out here. (Sponsored)

Credit: Source link

Comments are closed.