AI Researchers From PRIOR@AI2 Release ‘GRIT’: A General Robust Image Task Benchmark For Evaluating Computer Vision Model’s Performace
Most computer vision (CV) models are trained and assessed on a small number of concepts and with a strong assumption that the images and annotations in the training and test sets are distributed similarly. Therefore despite significant advances in CV, vision systems’ flexibility and generality still fall short of humans’ abilities to learn from various sources and generalize to new data sources and tasks.
The lack of a consistent technique and benchmark for measuring performance under distribution shifts is one of the hurdles to developing more flexible and general computer vision systems.
A new study by [email protected] Institute for AI and the University of Illinois team introduces the General Robust Image Task Benchmark (GRIT), a unified model to evaluate the CV model’s overall competence in terms of performance and robustness across a variety of image prediction tasks, concepts, and data sources. Object classification, object localization, referencing expressions, visual question answering, semantic segmentation, human keypoint estimation, and normal surface estimate is among the seven GRIT tasks. These tasks test a variety of visual skills, including the capacity to make predictions for ideas learned from other data sources or tasks and robustness to image disturbances and calibration measures.
Most existing benchmarks only evaluate under i.i.d. situations. In contrast, the model is tested using data sources and ideas that are not allowed to be seen during training in the GRIT Restricted track. To complete tasks involving novel concepts, the model must be able to transfer skills between concepts.
GRIT assesses image stability in the front of image disturbances. Performance on each task is assessed on a collection of examples with and without 20 distinct types of picture distortions of variable strengths, such as JPEG compression, motion blur, Gaussian noise, and universal adversarial perturbation.
GRIT encourages the construction of large-scale foundation models for vision while allowing for a fair comparison of models with minimal compute resources. GRIT’s two tracks, Restricted and Unrestricted, help to achieve this. The Unrestricted track facilitates the creation of big models with little limits on the training data that can be used. Given a rich but limited set of training data sources, the Restricted track allows researchers to focus on the skill-concept transfer and efficient learning. The Restricted track levels the playing field for researchers in terms of computational resource requirements and access to the same data sources by limiting the training data to publically available sources.
The following design principles guide all design decisions in the development of GRIT:
- Clear tasks: The researchers choose vision and vision-language tasks with a clear task definition and unambiguous ground truth. They don’t include captioning, for example, because there are various ways to caption an image. Further, in cases that include VQA tasks with several possible responses, they follow the VQA standards’ evaluation technique of having various answer possibilities and answer text normalization to reduce ambiguity as much as feasible.
- Generality and Robustness Tests: Each task includes assessment samples from various data sources, as well as concepts not found in the task’s training data and image perturbations. This metric assesses the capacity to transmit information between different data sources and concepts and visual distortion resistance.
- Concept Diversity and Balance: Task samples are chosen to represent a diverse variety of equally distributed concepts. The team further grouped objects (noun concepts) into 24 concept categories (e.g., animals, food, tools).
- Per-Sample Evaluation: To summarise the performance, all metrics are calculated at the sample level to average them over different subsets of data.
- Knowledge Assessment and Calibration: For each prediction, models must forecast a confidence score, which is used to analyze the model’s knowledge, degree of disinformation, and belief calibration.
- Using Existing Datasets: To ensure that annotations and tasks are validated, the team can source tasks from existing, well-established datasets whenever possible and choose annotations from the selected sources that are hidden or perhaps unused.
- Level Playing Field: A restricted track with a fixed set of publically available training data sources is offered. It provides a fair comparison of submissions and allows researchers with limited data and computing resources to participate and contribute unique, robust, and efficient learning methods.
- Promote Unified Models: All contributions must provide the total number of parameters used in the models. It also serves as a straightforward, albeit imprecise, an indicator of computing and sample efficiency. While participants can use completely distinct models for each task, they should choose models that share parameters across tasks and have fewer parameters.
The researchers believe that combining these advancements into more robust general-purpose systems that do not require architecture changes can withstand the distribution shift issues that plague vision and vision-language models in the open world.
This Article is written as a summay by Marktechpost Staff based on the Research Paper 'Meta-Learning Sparse Compression Networks'. All Credit For This Research Goes To The Researchers of This Project. Check out the paper, github, project and reference article. Please Don't Forget To Join Our ML Subreddit
Credit: Source link
Comments are closed.