This AI Paper Unveils the Key to Extending Language Models to 128K Contexts with Continual Pretraining

Large language models can accomplish tasks that surpass current paradigms, such as reading code at the repository level, modeling long-history dialogs, and powering autonomous agents with language models with a context window of 128K tokens. The recent Needle-in-a-Haystack test is a popular way to see if models can use long context length. In this test, the model is asked to accurately repeat the information in a given sentence, with the sentence being placed in an arbitrary location within a 128 K-long document. 

A recent study by researchers at the University of Edinburgh, MIT-IBM Watson AI Lab, University of Washington, MIT, University of Melbourne, Ohio State University, and UIUC examines the data engineering techniques for increasing the context durations of language models. They kept pretraining it on suitable data combinations to ensure the language model passed the Needle-in-a-Haystack test at 128K length. Continual pretraining with complete attention on significantly longer context lengths (we train on 64K-80K context lengths) may appear prohibitively costly at first glance, given that most extant models are trained on less than 4K context lengths and that attention has quadratic complexity. 

The team’s foundational models are LLaMA-2 7B and 13B. While they did tweak RoPE’s foundation, they didn’t alter the model’s architecture in any major way.

Most of their attention goes to the data recipe or the ingredients needed to properly train a model to succeed in the Needle-in-a-Haystack test with a 128K context length. The researchers postulate that, even for models pretrained on much shorter 4K contexts, the capacity to use the information at arbitrary positions within extended context length is (largely) already learned during pretraining. Contrary to this hypothesis, current research uses continuous pretraining on massive datasets (400B tokens) to provide long-context modeling capabilities; this approach can be just as expensive as starting from the beginning with pretraining.

In this study, the team demonstrates that a 7B model can be “unlocked” to perform accurate retrieval over significantly longer context durations compared to original pretraining by continuously pretraining on a small set of long-context data, in this example, 1-5B tokens. In addition, they prove that previous studies neglected the need to upsampling lengthy sequences while keeping the domain mixture of the pretraining corpora, even though it is critical for context scaling. Upsampling domains with long sequences in the data mixture is important to represent long-range dependencies, as demonstrated by LongChat 32K and YaRN Mistral 128K, according to most previous publications. This is because domains like books supply the necessary long-sequence data. But as suggested in their paper, its obvious answer isn’t the best since it leads to confusion and degradation in other areas. So, for the most consistent performance improvement, it’s best to use a data mixture that maintains the same domain mixing ratio as the pretraining mixture and then upsamples long sequences within each domain. 

Compared to robust baselines such as YaRN-Mistral 128K and LongLoRA 100K, the findings demonstrate that this is the fundamental cause of our solution’s enhanced long-context task performance while preserving short-context performance.

On the retrieval challenge, the team believes their approach bridges the gap to frontier models like GPT-4 128K and lays the groundwork for future research on fine-tuning long-context instructions.  


Check out the Paper and Github. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and Google News. Join our 38k+ ML SubReddit, 41k+ Facebook Community, Discord Channel, and LinkedIn Group.

If you like our work, you will love our newsletter..

Don’t Forget to join our Telegram Channel

You may also like our FREE AI Courses….


Dhanshree Shenwai is a Computer Science Engineer and has a good experience in FinTech companies covering Financial, Cards & Payments and Banking domain with keen interest in applications of AI. She is enthusiastic about exploring new technologies and advancements in today’s evolving world making everyone’s life easy.


🚀 LLMWare Launches SLIMs: Small Specialized Function-Calling Models for Multi-Step Automation [Check out all the models]


Credit: Source link

Comments are closed.