Self-supervised learning is a form of unsupervised learning in which the supervised learning task is constructed from raw, unlabeled data. Supervised learning is effective but usually requires a large amount of labeled data. Getting high-quality labeled data is time-consuming and resource-intensive, especially for sophisticated tasks like object detection and instance segmentation, where more in-depth annotations are sought.
Self-supervised learning aims to first learn usable representations of the data from an unlabeled pool of data by self-supervision and then to refine these representations with few labels for the supervised downstream tasks such as image classification, semantic segmentation, etc.
Self-supervised learning is at the heart of many recent advances in artificial intelligence. However, existing algorithms focus on a particular modality (such as images or text) and a high computer resource requirement. Humans, on the other hand, appear to learn significantly more efficiently than existing AI and to learn from diverse types of information consistently rather than requiring distinct learning systems for text, speech, and other modalities.
Thus, it is not obvious if the same learning mechanisms apply to all sensory modalities. For this reason, recent efforts have standardized model topologies and training goals that apply across modalities. For some modalities, models with hundreds of billions of parameters are trained, which typically pushes the bounds of what is computationally practical.
A year ago, Meta AI unveiled data2vec, the first high-performance self-supervised system to learn in the same way for three separate modalities: speech, vision, and text. Using Data2vec, it became simpler to adapt text understanding research advancements to an image segmentation or speech translation problem.
As part of their most recent work, they introduced data2vec 2.0, a new method that significantly improves upon the already impressive performance of its predecessor. It’s 16 times faster than the current leading self-supervised method in computer vision and is just as accurate.
Data2vec 2.0, like its predecessor, predicts data representations in contexts, such as the layers of a neural network rather than the pixels of an image, the words of a text passage, or the sounds of speech. These “target representations” are context-aware and consider the whole training case. According to the researchers, data2vec 2.0 can learn more quickly than competing algorithms because of the contextualized targets that they use.
The team made numerous enhancements to the original data2vec algorithm that greatly increased its effectiveness:
- The target representations developed for a training example were applied to the masked versions. Each masked version is fed into the training model, which is predicted to yield an identical contextualized target representation. The time and energy spent on computing representations for targets can be spread out in this way.
- Wasting computational resources was avoided by running the student encoder network for the blanked-out portions of the training samples, just as masked autoencoders.
- A multilayer convolutional network is used in place of a Transformer network in the improved decoder model.
The team conducted experiments on popular benchmarks for computer vision, speech, and text to compare the efficiency of data2vec 2.0 to its predecessor techniques.
They evaluated data2vec 2.0 on the industry-standard ImageNet-1K image classification benchmark to see how well it handles representing pictures for computer vision applications. Data2vec 2.0 is 16 times faster than masked autoencoders (MAE) while maintaining the same accuracy. With more time invested, the algorithm can outperform MAE in terms of accuracy while still being faster.
They also ran it through its paces on the LibriSpeech speech recognition benchmark. The findings show that data2vec 2.0 is 11 times faster than wav2vec 2.0, with results that were on par in terms of accuracy. Data2vec 2.0 is also tested on the widely used General Language Understanding Evaluation (GLUE) benchmark for NLP. The results show that it is just as accurate as RoBERTa, a reimplementation of BERT, but requires just half as much training time.
The team has open-sourced their code and pretrained models. They hope their work will help the research community envision a future when machines can fully comprehend massive amounts of complicated data, like a movie’s plot.
Check out the Paper and Github. All Credit For This Research Goes To Researchers on This Project. Also, don’t forget to join our Reddit page and discord channel, where we share the latest AI research news, cool AI projects, and more.
Tanushree Shenwai is a consulting intern at MarktechPost. She is currently pursuing her B.Tech from the Indian Institute of Technology(IIT), Bhubaneswar. She is a Data Science enthusiast and has a keen interest in the scope of application of artificial intelligence in various fields. She is passionate about exploring the new advancements in technologies and their real-life application.
Credit: Source link
Comments are closed.