The Three Key Changes Driving the Success of Pre-trained Foundation Models and Large Language Models LLMs

Large Language Models (LLMs) have received a lot of appreciation worldwide and have gained immense popularity in the field of Natural Language Processing. This has allowed us to describe intelligent systems with a better and more articulate understanding of language than ever before. There has been a significantly increasing performance by LLMs like GPT-3, T5, PaLM, etc. These models are here to stay as they do everything from imitating humans by learning to read to generating text and summarizing long paragraphs. According to some in-depth studies, an LLM performs well if its size is massive. By training these models on huge chunks of data, these models can understand the syntax, semantics, and pragmatics of human language. 

The popular Large Language Model ChatGPT, developed by OpenAI, has grown so much because of advanced techniques like Reinforcement Learning with Human Feedback (RLHF). With RLHF, machine learning algorithms combine and use human input to improve the model’s performance. It fine-tunes the pretrained LLMs for tasks like developing a chatbot, virtual assistants, etc. In recent years, the pre-trained foundation models upon which LLMs like ChatGPT are based have also significantly improved. This has mainly been due to three changes. 

  • The scaling of the model has been proven useful in improving its performance. Taking the example of the Pathways Language Model (PaLM), the model has greatly impacted its performance by scaling on the few-shot learning. Few-shot learning decreases the number of task-specific training examples required to adjust the model to a specific application. By scaling and training a 540 billion parameter on 6144 TPU v4 chips using Pathways, PaLM displayed repeated benefits of scaling. It outperformed various traditional models and showed a lot of progress. Scaling of both depth and width has thus been a great factor for better performance of the foundation models.
  • Another change has been the process of increasing the number of tokens at the time of pre-training. Models like Chinchilla have demonstrated that large language models perform more optimally by increasing the pre-training data. Chinchilla, a compute optimal model, was trained on 70B parameters and four times more data than the Gopher model with the same computing budget, and Chinchilla uniformly outperformed Gopher. It even worked better than LLMs like GPT-3, Jurassic-1, and Megatron-Turing NLG. It clearly depicted that for each compute-optimal training, the number of tokens should be accordingly scaled, i.e., twice the model size, twice should be the number of training tokens. 
  • The third change is the usage of clean and diverse pre-training data. This has been shown by the performance of Galactica, the large language model that stores, blends, and reasons scientific knowledge. Trained on text from several scientific papers, Galactica outperformed models like GPT-3, Chinchilla, etc. Another Large Language model, BioMedLM, a domain-specific LLM for Biomedical text, showed massive performance improvement when trained on domain-specific data. It clearly depicted that pre-training on domain-specific data beats it on the general purpose data.

The success of LLMs is undoubtedly due to a mixture of factors, including the use of RLHF and developments in pre-trained foundation models. The three changes have greatly affected the performance of LLMs. Also, GLaM (Generalist Language Model) has shown massive improvement in its performance by using a sparsely activated mixture-of-experts architecture to scale the model’s capacity with less training cost. Consequently, these changes have opened the way for even more advanced language models that will continue to make our lives easy.  

🚨 Read Our Latest AI Newsletter🚨


All Credit For This Research Goes To the Researchers on These Projects. Special credit to the tweet from Cameron. Also, don’t forget to join our 14k+ ML SubRedditDiscord Channel, and Email Newsletter, where we share the latest AI research news, cool AI projects, and more.


Some References and Resources:

  • MT-NLG: http://arxiv.org/abs/2201.11990
  • Chinchilla: http://arxiv.org/abs/2203.15556
  • PaLM: http://arxiv.org/abs/2204.02311
  • GLaM: http://arxiv.org/abs/2112.06905
  • BioMedLM: http://bit.ly/3KuE7GY
  • Galactica: http://arxiv.org/abs/2211.09085


Tanya Malhotra is a final year undergrad from the University of Petroleum & Energy Studies, Dehradun, pursuing BTech in Computer Science Engineering with a specialization in Artificial Intelligence and Machine Learning.
She is a Data Science enthusiast with good analytical and critical thinking, along with an ardent interest in acquiring new skills, leading groups, and managing work in an organized manner.



Credit: Source link

Comments are closed.