In the past decade, Artificial Intelligence (AI) and Machine Learning (ML) have seen tremendous progress. Today, they are more accurate, efficient, and capable than they have ever been. Modern AI and ML models can seamlessly and accurately recognize objects in images or video files. Additionally, they can generate text and speech that parallels human intelligence.
AI & ML models of today are heavily reliant on training on labeled dataset that teach them how to interpret a block of text, identify objects in an image or video frame, and several other tasks.
Despite their capabilities, AI & ML models are not perfect, and scientists are working towards building models that are capable of learning from the information they are given, and not necessarily relying on labeled or annotated data. This approach is known as self-supervised learning, and it’s one of the most efficient methods to build ML and AI models that have the “common sense” or background knowledge to solve problems that are beyond the capabilities of AI models today.
Self-supervised learning has already shown its results in Natural Language Processing as it has allowed developers to train large models that can work with an enormous amount of data, and has led to several breakthroughs in fields of natural language inference, machine translation, and question answering.
The SEER model by Facebook AI aims at maximizing the capabilities of self-supervised learning in the field of computer vision. SEER or SElf SupERvised is a self-supervised computer vision learning model that has over a billion parameters, and it’s capable of finding patterns or learning even from a random group of images found on the internet without proper annotations or labels.
The Need for Self-Supervised Learning in Computer Vision
Data annotation or data labeling is a pre-processing stage in the development of machine learning & artificial intelligence models. Data annotation process identifies raw data like images or video frames, and then adds labels on the data to specify the context of the data for the model. These labels allow the model to make accurate predictions on the data.
One of the greatest hurdles & challenges developers face when working on computer vision models is finding high-quality annotated data. Computer Vision models today rely on these labeled or annotated dataset to learn the patterns that allows them to recognize objects in the image.
Data annotation, and its use in the computer vision model pose the following challenges:
Managing Consistent Dataset Quality
Probably the greatest hurdle in front of developers is to gain access to high quality dataset consistently because high quality dataset with proper labels & clear images result in better learning & accurate models. However, accessing high quality dataset consistently has its own challenges.
Workforce Management
Data labeling often comes with workforce management issues mainly because a large number of workers are required to process & label large amounts of unstructured & unlabeled data while ensuring quality. So it’s essential for the developers to strike a balance between quality & quantity when it comes to data labeling.
Financial Restraints
Probably the biggest hurdle is the financial restraints that accompany the data labeling process, and most of the time, the data labeling cost is a significant percent of the overall project cost.
As you can see, data annotation is a major hurdle in developing advanced computer vision models especially when it comes to developing complex models that deal with a large amount of training data. It’s the reason why the computer vision industry needs self-supervised learning to develop complex & advanced computer vision models that are capable of tackling tasks that are beyond the scope of current models.
With that being said, there are already plenty of self-supervised learning models that have been performing well in a controlled environment, and primarily on the ImageNet dataset. Although these models might be doing a good job, they do not satisfy the primary condition of self-supervised learning in computer vision: to learn from any unbounded dataset or random image, and not just from a well-defined dataset. When implemented ideally, self-supervised learning can help in developing more accurate, and more capable computer vision models that are cost effective & viable as well.
SEER or SElf-supERvised Model: An Introduction
Recent trends in the AI & ML industry have indicated that model pre-training approaches like semi-supervised, weakly-supervised, and self-supervised learning can significantly improve the performance for most deep learning models for downstream tasks.
There are two key factors that have massively contributed towards the boost in performance of these deep learning models.
Pre-Training on Massive Datasets
Pre-training on massive datasets generally results in better accuracy & performance because it exposes the model to a wide variety of data. Large dataset allows the models to understand the patterns in the data better, and ultimately it results in the model performing better in real-life scenarios.
Some of the best performing models like the GPT-3 model & Wav2vec 2.0 model are trained on massive datasets. The GPT-3 language model uses a pre-training dataset with over 300 billion words whereas the Wav2vec 2.0 model for speech recognition uses a dataset with over 53 thousand hours of audio data.
Models with Massive Capacity
Models with higher numbers of parameters often yield accurate results because a greater number of parameters allows the model to focus only on objects in the data that are necessary instead of focusing on the interference or noise in the data.
Developers in the past have made attempts to train self-supervised learning models on non-labeled or uncurated data but with smaller datasets that contained only a few million images. But can self-supervised learning models yield in high accuracy when they are trained on a large amount of unlabeled, and uncurated data? It’s precisely the question that the SEER model aims to answer.
The SEER model is a deep learning framework that aims to register images available on the internet independent of curated or labeled data sets. The SEER framework allows developers to train large & complex ML models on random data with no supervision, i.e the model analyzes the data & learns the patterns or information on its own without any added manual input.
The ultimate goal of the SEER model is to help in developing strategies for the pre-training process that use uncurated data to deliver top-notch state of the art performance in transfer learning. Furthermore, the SEER model also aims at creating systems that can continuously learn from a never ending stream of data in a self-supervised manner.
The SEER framework trains high-capacity models on billions of random & unconstrained images extracted from the internet. The models trained on these images do not rely on the image meta data or annotations to train the model, or filter the data. In recent times, self-supervised learning has shown high potential as training models on uncurated data have yielded better results when compared to supervised pretrained models for downstream tasks.
SEER Framework and RegNet : What’s the Connection?
To analyze the SEER model, it focuses on the RegNet architecture with over 700 million parameters that align with SEER’s goal of self-supervised learning on uncurated data for two primary reasons:
- They offer a perfect balance between performance & efficiency.
- They are highly flexible, and can be used to scale for a number of parameters.
SEER Framework: Prior Work from Different Areas
The SEER framework aims at exploring the limits of training large model architectures in uncurated or unlabeled datasets using self-supervised learning, and the model seeks inspiration from prior work in the field.
Unsupervised Pre-Training of Visual Features
Self-supervised learning has been implemented in computer vision for sometime now with methods using autoencoders, instance-level discrimination, or clustering. In recent times, methods using contrastive learning have indicated that pre-training models using unsupervised learning for downstream tasks can perform better than a supervised learning approach.
The major takeaway from unsupervised learning of visual features is that as long as you are training on filtered data, supervised labels are not required. The SEER model aims to explore whether the model can learn accurate representations when large model architectures are trained on a large amount of uncurated, unlabeled, and random images.
Learning Visual Features at Scale
Prior models have benefited from pre-training the models on large labeled datasets with weak supervised learning, supervised learning, and semi supervised learning on millions of filtered images. Furthermore, model analysis has also indicated that pre-training the model on billions of images often yields better accuracy when compared to training the model from scratch.
Furthermore, training the model on a large scale usually relies on data filtering steps to make the images resonate with the target concepts. These filtering steps either make use of predictions from a pre-trained classifier, or they use hashtags that are often sysnets of the ImageNet classes. The SEER model works differently as it aims at learning features in any random image, and hence the training data for the SEER model is not curated to match a predefined set of features or concepts.
Scaling Architectures for Image Recognition
Models usually benefit from training large architectures on better quality resulting visual features. It’s essential to train large architectures when pretraining on a large dataset is important because a model with limited capacity will often underfit. It has even more importance when pre-training is done along with contrastive learning because in such cases, the model has to learn how to discriminate between dataset instances so that it can learn better visual representations.
However, for image recognition, the scaling architecture involves a lot more than just changing the depth & width of the model, and to build a scale efficient model with higher capacity, a lot of literature needs to be dedicated. The SEER model shows the benefits of using the RegNets family of models for deploying self-supervised learning at large scale.
SEER: Methods and Components Uses
The SEER framework uses a variety of methods and components to pretrain the model to learn visual representations. Some of the main methods and components used by the SEER framework are: RegNet, and SwAV. Let’s discuss the methods and components used in the SEER framework briefly.
Self-Supervised Pre Training with SwAV
The SEER framework is pre-trained with SwAV, an online self-supervised learning approach. SwAV is an online clustering method that is used to train convnets framework without annotations. The SwAV framework works by training an embedding that produces cluster assignments consistently between different views of the same image. The system then learns semantic representations by mining clusters that are invariant to data augmentations.
In practice, the SwAV framework compares the features of the different views of an image by making use of their independent cluster assignments. If these assignments capture the same or resembling features, it is possible to predict the assignment of one image by using the feature of another view.
The SEER model considers a set of K clusters, and each of these clusters is associated with a learnable d-dimensional vector vk. For a batch of B images, each image i is transformed into two different views: xi1 , and xi2. The views are then featurized with the help of a convnet, and it results in two sets of features: (f11, …, fB2), and (f12, … , fB2). Each feature set is then assigned independently to cluster prototypes with the help of an Optimal Transport solver.
The Optimal Transport solver ensures that the features are split evenly across the clusters, and it helps in avoiding trivial solutions where all the representations are mapped to a single prototype. The resulting assignment is then swapped between two sets: the cluster assignment yi1 of the view xi1 needs to be predicted using the feature representation fi2 of the view xi2, and vice-versa.
The prototype weights, and convnet are then trained to minimize the loss for all examples. The cluster prediction loss l is essentially the cross entropy between a softmax of the dot product of f, and cluster assignment.
RegNetY: Scale Efficient Model Family
Scaling model capacity, and data require architectures that are efficient not only in terms of memory, but also in terms of the runtime & the RegNets framework is a family of models designed specifically for this purpose.
The RegNet family of architecture is defined by a design space of convnets with 4 stages where each stage contains a series of identical blocks while ensuring the structure of their block remains fixed, mainly the residual bottleneck block.
The SEER framework focuses on the RegNetY architecture and adds a Squeeze-and-Excitation to the standard RegNets architecture in an attempt to improve their performance. Furthermore, the RegNetY model has 5 parameters that help in the search of good instances with a fixed number of FLOPs that consume reasonable resources. The SEER model aims at improving its results by implementing the RegNetY architecture directly on its self-supervised pre-training task.
The RegNetY 256GF Architecture: The SEER model focuses mainly on the RegNetY 256GF architecture in the RegNetY family, and its parameters use the scaling rule of the RegNets architecture. The parameters are described as follows.
The RegNetY 256GF architecture has 4 stages with stage widths(528, 1056, 2904, 7392), and stage depths(2,7,17,1) that add to over 696 million parameters. When training on the 512 V100 32GB NVIDIA GPUs, each iteration takes about 6125ms for a batch size of 8,704 images. Training the model on a dataset with over a billion images, with a batch size of 8,704 images on over 512 GPUs requires 114,890 iterations, and the training lasts for about 8 days.
Optimization and Training at Scale
The SEER model proposes several adjustments to train self-supervised methods to apply and adapt these methods to a large scale. These methods are:
- Learning Rate schedule.
- Reducing memory consumption per GPU.
- Optimizing Training speed.
- Pre Training data on a large scale.
Let’s discuss them briefly.
Learning Rate Schedule
The SEER model explores the possibility of using two learning rate schedules: the cosine wave learning rate schedule, and the fixed learning rate schedule.
The cosine wave learning schedule is used for comparing different models fairly as it adapts to the number of updates. However, the cosine wave learning rate schedule does not adapt to a large-scale training primarily because it weighs the images differently on the basis of when they are seen while training, and it also uses complete updates for scheduling.
The fixed learning rate scheduling keeps the learning rate fixed until the loss is non-decreasing, and then the learning rate is divided by 2. Analysis shows that the fixed learning rate scheduling works better as it has room for making the training more flexible. However, because the model only trains on 1 billion images, it uses the cosine wave learning rate for training its biggest model, the RegNet 256GF.
Reducing Memory Consumption per GPU
The model also aims at reducing the amount of GPU needed during the training period by making use of mixed precision, and grading checkpointing. The model makes use of NVIDIA Apex Library’s O1 Optimization level to perform operations like convolutions, and GEMMs in 16-bits floating point precision. The model also uses PyTorch’s gradient checkpointing implementation that trades computers for memory.
Furthermore, the model also discards any intermediate activations made during the forward pass, and during the backward pass, it recomputes these activations.
Optimizing Training Speed
Using mixed precision for optimizing memory usage has additional benefits as accelerators take advantage of the reduced size of FP16 by increasing throughput when compared to the FP32. It helps in speeding up the training period by improving the memory-bandwidth bottleneck.
The SEER model also synchronizes the BatchNorm layer across GPUs to create process groups instead of using global sync which usually takes more time. Finally, the data loader used in the SEER model pre-fetches more training batches that leads to a higher amount of data being throughput when compared to PyTorch’s data loader.
Large Scale Pre Training Data
The SEER model uses over a billion images during pre training, and it considers a data loader that samples random images directly from the internet, and Instagram. Because the SEER model trains these images in the wild and online, it does not apply any pre-processing on these images nor curates them using processes like de-duplication or hashtag filtering.
It’s worth noting that the dataset is not static, and the images in the dataset are refreshed every three months. However, refreshing the dataset does not affect the model’s performance.
SEER Model Implementation
The SEER model pretrains a RegNetY 256GF with SwAV using six crops per image, with each image having a resolution of 2×224 + 4×96. During the pre training phase, the model uses a 3-layer MLP or Multi-Layer Perceptron with projection heads of dimensions 10444×8192, 8192×8192, and 8192×256.
Instead of using BatchNorm layers in the head, the SEER model uses 16 thousand prototypes with the temperature t set to 0.1. The Sinkhorn regularization parameter is set to 0.05, and it performs 10 iterations of the algorithm. The model further synchronizes the BatchNorm stats across the GPU, and creates numerous process groups with suze 64 for synchronization.
Furthermore, the model uses a LARS or Layer-wise Adaptive Rate Scaling optimizer, a weight decay of 10-5, activation checkpoints, and O1 mixed-precision optimization. The model is then trained with stochastic gradient descent using a batch size with 8192 random images distributed over 512 NVIDIA GPUs resulting in 16 images per GPU.
The learning rate is ramped up linearly from 0.15 to 9.6 for the first 8 thousand training updates. After the warmup, the model follows a cosine learning rate schedule that decays to a final value of 0.0096. Overall, the SEER model trains over a billion images over 122 thousand iterations.
SEER Framework: Results
The quality of features generated by the self-supervised pre training approach is studied & analyzed on a variety of benchmarks and downstream tasks. The model also considers a low-shot setting that grants limited access to the images & its labels for downstream tasks.
FineTuning Large Pre Trained Models
It measures the quality of models pretrained on random data by transferring them to the ImageNet benchmark for object classification. The results on fine tuning large pretrained models are determined on the following parameters.
Experimental Settings
The model pretrains 6 RegNet architecture with different capacities namely RegNetY- {8,16,32,64,128,256}GF, on over 1 billion random and public Instagram images with SwAV. The models are then fine tuned for the purpose of image classification on ImageNet that uses over 1.28 million standard training images with proper labels, and has a standard validation set with over 50 thousand images for evaluation.
The model then applies the same data augmentation techniques as in SwAV, and finetunes for 35 epochs with SGD optimizer or Stochastic Gradient Descent with a batch size of 256, and a learning rate of 0.0125 that is reduced by a factor of 10 after 30 epochs, momentum of 0.9, and weight decay of 10-4. The model reports top-1 accuracy on the validation dataset using the center corp of 224×224.
Comparing with other Self Supervised Pre Training Approaches
In the following table, the largest pretrained model in RegNetY-256GF is compared with existing pre-trained models that use the self supervised learning approach.
As you can see, the SEER model returns a top-1 accuracy of 84.2% on ImageNet, and surprises SimCLRv2, the best existing pretrained model by 1%.
Furthermore, the following figure compares the SEER framework with models of different capacities. As you can see, regardless of the model capacity, combining the RegNet framework with SwAV yields accurate results during pre training.
The SEER model is pretrained on uncurated and random images, and they have the RegNet architecture with the SwAV self-supervised learning method. The SEER model is compared against SimCLRv2 and the ViT models with different network architectures. Finally, the model is finetuned on the ImageNet dataset, and the top-1 accuracy is reported.
Impact of the Model Capacity
Model capacity has a significant impact on the model performance of pretraining, and the below figure compares it with the impact when training from scratch.
It can be clearly seen that the top-1 accuracy score of pretrained models is higher than models that are trained from scratch, and the difference keeps getting bigger as the number of parameters increases. It is also evident that although model capacity benefits both the pretrained and trained from scratch models, the impact is greater on pretrained models when dealing with a large amount of parameters.
A possible reason why training a model from scratch could overfit when training on the ImageNet dataset is because of the small dataset size.
Low-Shot Learning
Low-shot learning refers to evaluating the performance of the SEER model in a low-shot setting i.e using only a fraction of the total data when performing downstream tasks.
Experimental Settings
The SEER framework uses two datasets for low-shot learning namely Places205 and ImageNet. Furthermore, the model assumes to have a limited access to the dataset during transfer learning both in terms of images, and their labels. This limited access setting is different from the default settings used for self-supervised learning where the model has access to the entire dataset, and only the access to the image labels is limited.
-
Results on Place205 Dataset
The below figure shows the impact of pretraining the model on different portions of the Place205 dataset.
The approach used is compared to pre-training the model on the ImageNet dataset under supervision with the same RegNetY-128 GF architecture. The results from the comparison are surprising as it can be observed that there is a stable gain of about 2.5% in top-1 accuracy regardless of the portion of training data available for fine tuning on the Places205 dataset.
The difference observed between supervised and self-supervised pre-training processes can be explained given the difference in the nature of the training data as features learned by the model from random images in the wild may be more suited to classify the scene. Furthermore, a non-uniform distribution of underlying concept might prove to be an advantage for pretraining on an unbalanced dataset like Places205.
Results on ImageNet
The above table compares the approach of the SEER model with self-supervised pre-training approaches, and semi-supervised approaches on low-shot learning. It’s worth noting that all these methods use all the 1.2 million images in the ImageNet dataset for pre-training, and they only restrict accessing the labels. On the other hand, the approach used in the SEER model allows it to see only 1 to 10% of the images in the dataset.
As the networks have seen more images from the same distribution during pre-training, it benefits these approaches immensely. But what’s impressive is that even though the SEER model only sees 1 to 10% of the ImageNet dataset, it is still able to achieve a top-1 accuracy score of about 80%, that falls just short of the accuracy score of the approaches discussed in the table above.
Impact of the Model Capacity
The figure below discusses the impact of model capacity on low-shot learning: at 1%, 10%, and 100% of the ImageNet dataset.
It can be observed that increasing the model capacity can improve the accuracy score of the model as it decreases the access to both the images and labels in the dataset.
Transfer to Other Benchmarks
To evaluate the SEER model further, and analyze its performance, the pretrained features are transferred to other downstream tasks.
Linear Evaluation of Image Classification
The above table compares the features from SEER’s pre-trained RegNetY-256GF, and RegNetY128-GF pretrained on the ImageNet dataset with the same architecture with and without supervision. To analyze the quality of the features, the model freezes the weights, and uses a linear classifier on top of the features using the training set for the downstream tasks. The following benchmarks are considered for the process: Open-Images(OpIm), iNaturalist(iNat), Places205(Places), and Pascal VOC(VOC).
Detection and Segmentation
The figure given below compares the pre-trained features on detection, and segmentation, and evaluates them.
The SEER framework trains a Mask-RCNN model on the COCO benchmark with pre-trained RegNetY-64GF and RegNetY-128GF as the building blocks. For both architecture as well as downstream tasks, SEER’s self-supervised pre-training approach outperforms supervised training by 1.5 to 2 AP points.
Comparison with Weakly Supervised Pre-Training
Most of the images available on the internet usually have a meta description or an alt text, or descriptions, or geolocations that can provide leverage during pre-training. Prior work has indicated that predicting a curated or labeled set of hashtags can improve the quality of predicting the resulting visual features. However, this approach needs to filter images, and it works best only when a textual metadata is present.
The figure below compares the pre-training of a ResNetXt101-32dx8d architecture trained on random images with the same architecture being trained on labeled images with hashtags and metadata, and reports the top-1 accuracy for both.
It can be seen that although the SEER framework does not use metadata during pre-training, its accuracy is comparable to the models that use metadata for pre-training.
Ablation Studies
Ablation study is performed to analyze the impact of a particular component on the overall performance of the model. An ablation study is done by removing the component from the model altogether, and understand how the model performs. It gives developers a brief overview of the impact of that particular component on the model’s performance.
Impact of the Model Architecture
The model architecture has a significant impact on the performance of model especially when the model is scaled, or the specifications of the pre-training data are modified.
The following figure discusses the impact of how changing the architecture affects the quality of the pre-trained features with evaluating the ImageNet dataset linearly. The pre-trained features can be probed directly in this case because the evaluation does not favor the model that return high accuracy when trained from scratch on the ImageNet dataset.
It can be observed that for the ResNeXts and the ResNet architecture, the features obtained from the penultimate layer work better with the current settings. On the other hand, the RegNet architecture outperforms the other architectures .
Overall, it can be concluded that increasing the model capacity has a positive impact on the quality of features, and there is a logarithmic gain in the model performance.
Scaling the Pre-Training Data
There are two primary reasons why training a model on a larger dataset can improve the overall quality of the visual feature the model learns: more unique images, and more parameters. Let’s have a brief look at how these reasons affect the model performance.
Increasing the Number of Unique Images
The above figure compares two different architectures, the RegNet8, and the RegNet16 that have the same number of parameters, but they are trained on different number of unique images. The SEER framework trains the models for updates corresponding to 1 epoch for a billion images, or 32 epochs for 32 unique images, and with a single-half wave cosine learning rate.
It can be observed that for a model to perform well, the number of unique images fed to the model should ideally be higher. In this case, the model performs well when it’s fed unique images greater than the images present in the ImageNet dataset.
More Parameters
The figure below indicates a model’s performance as it is trained over a billion images using the RegNet-128GF architecture. It can be observed that the the performance of the model increases steadily when the number of parameters are increased.
Self-Supervised Computer Vision in Real World
Until now, we have discussed how self-supervised learning and the SEER model for computer vision works in theory. Now, let us have a look at how self-supervised computer vision works in real world scenarios, and why SEER is the future of self-supervised computer vision.
The SEER model rivals the work done in the Natural Language Processing industry where high-end state of the art models make use of trillions of datasets and parameters coupled with trillions of words of text during pre-training the model. Performance on downstream tasks generally increase with an increase in the number of input data for training the model, and the same is true for computer vision tasks as well.
But using self-supervision learning techniques for Natural Language Processing is different from using self-supervised learning for computer vision. It’s because when dealing with texts, the semantic concepts are usually broken down into discrete words, but when dealing with images, the model has to decide which pixel belongs to which concept.
Additionally, different images have different views, and even though multiple images might have the same object, the concept might vary significantly. For example, consider a dataset with images of a cat. Although the primary object, the cat is common across all the images, the concept might vary significantly as the cat might be standing still in an image, while it might be playing with a ball in the next one, and so on and so forth. Because the images often have varying concept, it’s essential for the model to have a look at a significant amount of images to grasp the differences around the same concept.
Scaling a model successfully so that it works efficiently with high-dimensional and complex image data needs two components:
- A convolutional neural network or CNN that’s large enough to capture & learn the visual concepts from a very large image dataset.
- An algorithm that can learn the patterns from a large amount of images without any labels, annotations, or metadata.
The SEER model aims to apply the above components to the field of computer vision. The SEER model aims to exploit the advancements made by SwAV, a self-supervised learning framework that uses online clustering to group or pair images with parallel visual concepts, and leverage these similarities to identify patterns better.
With the SwAV architecture, the SEER model is able to make the use of self-supervised learning in computer vision much more effective, and reduce the training time by up to 6 times.
Furthermore, training models at a large scale, in this scale, over 1 billion images requires a model architecture that is efficient not only in terms or runtime & memory, but also on accuracy. This is where the RegNet models come into play as these RegNets model are ConvNets models that can scale trillions of parameters, and can be optimized as per the needs to comply with memory limitations, and runtime regulations.
Conclusion : A Self-Supervised Future
Self-supervised learning has been a major talking point in the AI and ML industry for a while now because it allows AI models to learn information directly from a large amount of data that’s available randomly on the internet instead of relying on carefully curated, and labeled dataset that have the sole purpose of training AI models.
Self-supervised learning is a vital concept for the future of AI and ML because it has the potential to allow developers to create AI models that adapt well to real world scenarios, and has multiple use cases rather than having a specific purpose, and SEER is a milestone in the implementation of self-supervised learning in the computer vision industry.
The SEER model takes the first step in the transformation of the computer vision industry, and reducing our dependence on labeled dataset. The SEER model aims at eliminating the need for annotating the dataset that will allow developers to work with a diverse, and large amounts of data. The implementation of SEER is especially helpful for developers working on models that deal with areas that have limited images or metadata like the medical industry.
Furthermore, eliminating human annotations will allow developers to develop & deploy the model quicker, that will further allow them to respond to rapidly evolving situations faster & with more accuracy.
Credit: Source link
Comments are closed.