Do Large Language Models Really Need All Those Layers? This AI Research Unmasks Model Efficiency: The Quest for Essential Components in Large Language Models
The advent of large language models (LLMs) has sparked significant interest among the public, particularly with the emergence of ChatGPT. These models, which are trained on extensive amounts of data, can learn in context, even with minimal examples. This year, a paper presented at the Association for Computational Linguistics (ACL) meeting delves into the importance of model scale for in-context learning and examines the interpretability of LLM architectures.
The study focuses on the OPT-66B model, a 66-billion-parameter LLM developed by Meta as an open replica of GPT-3. By analyzing OPT-66B, the researchers sought to determine whether all components of LLMs are essential for in-context learning, aiming to provide insights into potential areas for improved training.
LLMs are built using the Transformer architecture, which relies on an attention mechanism. This mechanism enables the model to predict which prior tokens in a sequence it should focus on when generating the current token. These LLMs utilize multi-head attention, employing multiple attention mechanisms in parallel. OPT-66B consists of 64 layers, each containing 72 attention heads. The output of the multi-head attention then passes through a separate feed-forward network (FFN) at each layer.
To investigate the OPT-66B model, the researchers employed two methods. Firstly, they assigned scores to each attention head and FFN to determine their importance for a given task. Using these scores, they pruned the model, discarding certain components. Surprisingly, they found that a significant portion of the model could be removed without affecting performance. This suggested that OPT-66B, and potentially other prominent LLMs, were undertrained.
The researchers discovered that important attention heads predominantly resided in the intermediate layers of the model, while important FFNs were primarily located in the later layers. Strikingly, even after removing up to 70% (around 15.7 billion parameters) of the attention heads, the ability to perform zero- or few-shot in-context learning on 14 different natural language processing (NLP) datasets/tasks remained largely unaffected. Moreover, they identified a common subset of attention heads responsible for in-context learning across tasks and shots, indicating task-agnostic functionality. Furthermore, they observed that approximately 20% of the FFNs (around 8.5 billion parameters) could be removed with minimal impact on zero- or few-shot in-context learning.
For their second analytic technique, the researchers evaluated the capacity of all attention heads in OPT-66B to perform task-agnostic primitive operations associated with in-context learning. These operations included prefix matching and copying, which involve searching for a prior occurrence of the current token and copying the succeeding token. They found that a small set of attention heads exhibited nontrivial scores for both primitives. Interestingly, these heads also overlapped with the attention heads identified as important for specific tasks, suggesting their involvement in more sophisticated in-context learning behaviors, such as latent concept matching.
The study concluded that only a core group of attention heads and FFNs appeared crucial for in-context learning, implying that OPT-66B, and potentially other leading LLMs, were undertrained. This observation aligns with recent research questioning the effectiveness of fixed amounts of pre-training data when scaling up models. The findings suggest that both the models and the amount of pretraining data must be scaled in tandem to achieve optimal performance. Future investigations could explore how newer LLM variants, including those tailored to follow instructions, fare in similar analyses.
Check out the Paper and Blog. Don’t forget to join our 26k+ ML SubReddit, Discord Channel, and Email Newsletter, where we share the latest AI research news, cool AI projects, and more. If you have any questions regarding the above article or if we missed anything, feel free to email us at Asif@marktechpost.com
🚀 Check Out 800+ AI Tools in AI Tools Club
Niharika is a Technical consulting intern at Marktechpost. She is a third year undergraduate, currently pursuing her B.Tech from Indian Institute of Technology(IIT), Kharagpur. She is a highly enthusiastic individual with a keen interest in Machine learning, Data science and AI and an avid reader of the latest developments in these fields.
Credit: Source link
Comments are closed.