This AI Paper Presents X-Decoder: A Single Model Trained to Support a Wide Range of Vision and Vision-Language Tasks
A long-standing issue in the field of vision is interpreting visual information at various degrees of detail. The tasks range from pixel-level grouping tasks (e.g., image instance/semantic/panoptic segmentation) to region-level localization tasks (e.g., object detection and phrase grounding). They also include image-level tasks (e.g., image classification, image-text retrieval, image captioning, and visual question answering). Only recently, most of these jobs were handled individually using specialized model designs, making it impossible to take advantage of the synergy between tasks at various granularities. A rising interest in developing general-purpose models that can learn from and be applied to a variety of vision and vision-language activities through multi-task learning, sequential decoding, or unified learning method is being seen in the light of transformers’ adaptability.
Although these studies have demonstrated hopeful cross-task generalization skills, most concentrate on integrating image-level and region-level activities, leaving crucial pixel-level comprehension to be addressed. The authors try to include segmentation in decoding a coordinate sequence or a color map. However, this results in subpar performance and only partial support for open-world generalization. One of the most significant but difficult issues in that deciphering pictures down to the pixel level is debatable: It is difficult to learn from data at two levels of granularity while also gaining mutual advantages because (1) pixel-level annotations are expensive and unquestionably much more scarce than other types of annotations; (2) grouping every pixel and recognizing them in an open-vocabulary manner is less studied; and (3) most importantly learning from data at two very different granularities while also gaining mutual advantages is not simple.
Recent initiatives have tried to close this gap in various ways. The architecture Mask2Former addresses all three types of segmentation tasks in a closed set. Several studies investigate the transfer or distillation of rich semantic knowledge from image-level vision-language foundation models like CLIP and ALIGN to specialty models to allow open vocabulary recognition. However, these first investigations need to demonstrate generalization to tasks with various granularities; instead, they focus on particular segmentation tasks of relevance.
The decoder design is the main unique element. They combine all tasks, such as vision-language problems and pixel-level picture segmentation, into a general decoding technique. According to the design of Mask2Former, X-Decoder is specifically developed on top of a vision backbone and a transformer encoder for extracting multi-scale picture characteristics. To decode segmentation masks for universal segmentation, similar to Mask2Former, it first accepts two sets of queries as input: I generic non-semantic questions and (ii) newly introduced textual queries, which aim to make the decoder language-aware for a variety of language-related vision tasks. Second, it anticipates two different output types—pixel-level masks and token-level semantics—and their various pairings may effectively handle all relevant jobs.
Third, they utilize a single text encoder to encode the whole corpus of text used in all tasks, including questions in VQA, ideas in segmentation, phrases in referencing segmentation, tokens in picture captioning, etc. Consequently, their X-Decoder can respect the varied character of various tasks while automatically facilitating the synergy between activities and advocating the learning of a shared visual-semantic space. Their generalized decoder architecture offers an end-to-end pretraining strategy to learn from all supervision granularities. They combine three different forms of data: image-text pairings, referencing segmentation, and panoptic segmentation.
X-Decoder immediately groups and offers a few relevant segmentation possibilities, in contrast to other research that employs pseudo-labeling algorithms to extract fine-grained supervision from image-text pairings, making it possible to quickly map the areas to the contents provided in the captions. Their X-Decoder is pre-trained using a small amount of segmentation data and millions of image-text pairings. It can do various jobs with zero shots and a wide vocabulary. By sharing pixel-level decoding with generic segmentation and semantic queries with the latter, the referencing segmentation task connects generic segmentation and picture captioning—strong zero-shot transferability to various segmentation and VL problems and task-specific transferability.
Their approach specifically establishes new state-of-the-art on ten configurations of seven datasets. It may be used directly for all three types of segmentation tasks in various applications. The flexibility provided by their model architecture allows it to enable certain innovative task compositions and effective fine-tuning, as they detect several fascinating aspects in their model. Their model also consistently outperforms earlier efforts when applied to certain charges. Code implementation is available on GitHub. Also, a Hugging Face demo is there for people to try it out themselves.
Check out the Paper, Project, and Github. All Credit For This Research Goes To Researchers on This Project. Also, don’t forget to join our Reddit page and discord channel, where we share the latest AI research news, cool AI projects, and more.
Aneesh Tickoo is a consulting intern at MarktechPost. He is currently pursuing his undergraduate degree in Data Science and Artificial Intelligence from the Indian Institute of Technology(IIT), Bhilai. He spends most of his time working on projects aimed at harnessing the power of machine learning. His research interest is image processing and is passionate about building solutions around it. He loves to connect with people and collaborate on interesting projects.
Credit: Source link
Comments are closed.