Latest Artificial Intelligence (AI) Research From Czech Republic Proposes ‘GLAMI-1M,’ A Multilingual Image-Text Classification Dataset And Benchmark
Public datasets are one of machine learning research’s most important building blocks. Because of these datasets, anyone can train and evaluate their models on personal devices or cloud services. These public benchmarks allow testing and evaluating different methods because they have pre-defined training and test data splits.
Image classification is one of the most well-known problems in computer vision. However, image classification models were already pretty good. When an ALIGN model predecessor was trained on a proprietary WebImageText for classification, it achieved state-of-the-art performance on the Fashion-Gen dataset. These observations reveal that image classification can be further improved using image-text models.
However, public large-scale image-text classification datasets have limited size and language diversity (see Table 1). So in this paper, the authors introduced GLAMI-1M. A public multilingual image-text classification benchmark of fashion products. Let’s briefly describe the dataset; the dataset contains 1.1M images of fashion products and their descriptions in one of the 13 languages. The descriptions of products are taken from e-commerce websites. The images are categorized into 191 classes (see Figure 2) with high-quality labels. Complete test set and 75% of the 1M training set images are human-labeled.
As the data is collected from an e-commerce website, it poses various challenges, like dealing with imbalanced long-tailed class distributions, noisy labels, multimodal inputs, multilingual texts, and many more.
There are some fashion-gen datasets (see Tables 2 and 3), but only one bilingual image-text dataset, Fashion-MMT. However, it is ten times smaller in size than GLAMI-1M.
Now coming to the question, How is the data collected and cleaned?
The fashion items that are present in the dataset are selected from the GLAMI catalog in two phases:
- Items with high-quality human annotations are sampled based on the annotation source. 100k randomly selected samples are used to create a test set.
- Items are sampled from a less reliable heuristic labeling system to get a training set of 1M items.
In addition, there is no overlap between training and test set images and texts, as checked via MD5 hashes and cosine similarity.
Table 4 gives some more information about the dataset.
The researchers also produced a baseline for Multimodal classification and Text-conditional image generation on GLAMI-1M.
Let’s talk about classification first-
In multimodal classification, the inputs come from different modalities, here; textual (title + description), visual (image), and categorical (label-source). For the baseline, they have used EmbraceNet because it can take encoded inputs from any modality and combine them to form a single modality.
Now, talking about Text-conditional Image Generation,
They trained a small version of the Imagen-like model on some subset of the dataset.
Results from both the baseline can be seen in Table 6 and Figure 5,6,7.
In conclusion, GLAMI-1M is the largest publicly available multilingual image-text classification dataset. It has the potential to help accelerate research in text-conditional image generation, image-text classification, and multilingual machine translation. Moreover, it can also be helpful in the detailed listing of fashion products on e-commerce websites.
Check out the Paper and Github link. All Credit For This Research Goes To Researchers on This Project. Also, don’t forget to join our Reddit page and discord channel, where we share the latest AI research news, cool AI projects, and more.
Vineet Kumar is a consulting intern at MarktechPost. He is currently pursuing his BS from the Indian Institute of Technology(IIT), Kanpur. He is a Machine Learning enthusiast. He is passionate about research and the latest advancements in Deep Learning, Computer Vision, and related fields.
Credit: Source link
Comments are closed.