Microsoft AI Introduce DeBERTa-V3: A Novel Pre-Training Paradigm for Language Models Based on the Combination of DeBERTa and ELECTRA

Natural Language Processing (NLP) and Natural Language Understanding (NLU) have been two of the primary running goals in the field of Artificial Intelligence. With the introduction of Large Language Models (LLMs), there has been a lot of progress and advancements in these domains. These pre-trained neural language models belong to the family of generative AI and are establishing new benchmarks like language comprehension, generating textual data, and answering questions by imitating humans.

The famous BERT (Bidirectional Encoder Representations from Transformers) model, which is able to present state-of-the-art results in a wide range of NLP tasks, was improvised by a new model architecture the previous year. This model, called DeBERTa (Decoding-enhanced BERT with disentangled attention), released by Microsoft Research, improvised on the BERT and RoBERTa models using two novel techniques. The first is the disentangled attention mechanism in which each word is characterized using two separate vectors: one that encodes its content and another that encodes its position. This allows the model to capture better the relationships between words and their positions in a sentence. The second technique is an improved mask decoder which replaces the output SoftMax layer to predict the masked tokens for model pre-training.

Now comes an even improved version of the DeBERTa model called DeBERTaV3. This open-source version improves the original DeBERTa model with a better and more sample-efficient pre-training task. DeBERTaV3, compared to the earlier versions, has new features that make it better at understanding language and keeping track of the order of words in a sentence. It uses a method called “self-attention” to view all the words in a sentence and find each word’s context based on the words around it.

DeBERTaV3 improves the original model by trying two ways. First, by replacing mask language modeling (MLM) with replaced token detection (RTD), which helps the program learn better. Second, creating a new method of sharing information in the program that makes it work better. Researchers found that sharing information in the old way actually made the program work worse because different parts of the program were trying to learn different things. The technique called vanilla embedding sharing used in another language model called ELECTRA reduced the efficiency and performance of the model. That made the researchers develop a new way of sharing information that made the program work better. This new method, called gradient-disentangled embedding sharing, improves both the efficiency and quality of the pre-trained model.

🔥 Recommended Read: Leveraging TensorLeap for Effective Transfer Learning: Overcoming Domain Gaps

The researchers have trained three versions of DeBERTaV3 models and tested them on different NLU tasks. These models outperformed previous ones on various benchmarks. DeBERTaV3[large] had a higher score on the GLUE benchmark by 1.37%, DeBERTaV3[base] performed better on MNLI-matched and SQuAD v2.0 by 1.8% and 2.2%, respectively, and DeBERTaV3[small] outperformed on the MNLI-matched and SQuAD v2.0 by more than 1.2% in accuracy and 1.3% in F1, respectively.

DeBERTaV3 is definitely a significant advancement in the field of NLP with a wide range of use cases. It is also capable of processing up to 4,096 tokens in a single pass. This count is exponentially higher than models like BERT and GPT-3. This makes DeBERTaV3 useful for lengthy documents requiring large volumes of text to be processed or analyzed. Consequently, all the comparisons show that DeBERTaV3 models are efficient and have set a strong foundation for future research in language understanding.


Check out the Paper and Github. All Credit For This Research Goes To the Researchers on This Project. Also, don’t forget to join our 16k+ ML SubRedditDiscord Channel, and Email Newsletter, where we share the latest AI research news, cool AI projects, and more.


Tanya Malhotra is a final year undergrad from the University of Petroleum & Energy Studies, Dehradun, pursuing BTech in Computer Science Engineering with a specialization in Artificial Intelligence and Machine Learning.
She is a Data Science enthusiast with good analytical and critical thinking, along with an ardent interest in acquiring new skills, leading groups, and managing work in an organized manner.


Credit: Source link

Comments are closed.