A New Study from the University of Wisconsin Investigates How Small Transformers Trained from Random Initialization can Efficiently Learn Arithmetic Operations Using the Next Token Prediction Objective
For various downstream tasks, including language and code translation, compositional thinking, and fundamental arithmetic operations, large language models like GPT-3/4, PaLM, and LaMDA have shown general-purpose features, sometimes emergent skills. Perhaps surprisingly, the training objective of the model, which is often an auto-regressive loss based on the prediction of the next token, does not directly encode these objectives. These skills have been explored in depth in earlier studies, along with how they change as training compute scale, data type, and model size. However, given the complexity of the data and the range of jobs evaluated, it is still difficult to separate the elements. They went out to identify the major contributions that hasten the appearance of these talents because they were curious about the factors that prompt these abilities in next-token forecasters.
These factors include the format and size of the data, the size of the model, the existence of pretraining, and the prompting style. Their work is undertaken in a controlled environment to enable a more thorough analysis of these parameters. They concentrate on teaching maths to tiny transformer models, including NanoGPT and GPT-2, when trained from random init. They employ the common autoregressive next-token prediction loss, scaling from a model with 10.6 million parameters to one with 124 million. Researchers from UW Madison aim to comprehend how these models may effectively learn fundamental mathematical operations like addition, subtraction, multiplication, square root, and sine, giving us a deeper perspective on how emergent talents are elicited. They have outlined their conclusions below.
- Sample size and data format are both important.
First, they note that teaching a model addition using typical addition samples, such as “A3A2A1 + B3B1B1 = C3C2C1,” is not ideal since it forces the model to assess the most significant digit C3 of the result first, which depends on all the digits of the two summands collectively. They allow the model to learn a simpler function by training it on samples with reversed outcomes, such as “A3A2A1 + B3B1B1 = C1C2C3,” which greatly increases sample complexity. Further enhancing learning is a balanced sample of many “variations” of addition, dependent on the digits and carry involved. They see abrupt phase shifts from 0% to 100% accuracy as a function of training data amount, even in this straightforward scenario. Unexpectedly, they note that completing a low-rank matrix is similar to learning an addition map on n digits from random samples. They can provide a logical justification for such phase changes thanks to this link.
- Data on the flow of cognition throughout training.
Based on these findings, they investigate the possible advantages of chain-of-thought data during training. This format enables the model to learn the distinct elements of difficult tasks since it comprises step-by-step operations and interim outputs. This structure is directly taken from relevant literature, such as. According to CoT finetuning literature, they discovered that CoTtype training data considerably enhanced learning in terms of sample complexity and accuracy; however, their findings remain true even in the absence of language pretraining. They hypothesize that this is because the model can learn a higher-dimensional but simpler function map by breaking down the necessary compositional function to be realized into individual components. They give samples of each of the four data formatting techniques they looked into in their research in Figure 1.
- Training with text and math mixes.
As LLMs are trained on enormous volumes of data downloaded from the internet, where it is hard to segregate various forms of data cleanly, they also examine how text and numeric data interact during training. They track how the proportion of text to arithmetic input affects the model’s perplexity and accuracy. They discover that knowing the previously covered arithmetic operations can enhance each task’s performance individually and that switching from zero-shot to one-shot prompting significantly increases accuracy. However, accuracy is less appreciable when more examples are provided.the importance of model size and pretraining.
- Role of pre-training and model scale.
Additionally, they look into the function of pretraining by finetuning models like GPT-2 and GPT-3 and find that while zero-shot performance on arithmetic operations is subpar, the prior “skills” developed during pretraining enable acceptable performance on some fundamental arithmetic tasks, even with a limited number of finetuning samples. However, when the model is pretrained on standard-formatted operations, finetuning non-standard formatting—such as reverse formatting—can interfere with model performance and reduce accuracy. Finally, they research how scale affects arithmetic performance and discover that while scale does help in learning arithmetic operations, it is not a prerequisite.
- Length and compositional generalization.
One may wonder if their trained models have a solid understanding of mathematics. Their research offers a complex response. They find it challenging to generalize length beyond training digit lengths. For instance, a model finds it difficult to adjust for and correctly calculate this missing digit length if it is trained on all n-digit lengths but excludes a particular length. As a result, the models perform well within the training digit length range but much worse outside of it. This shows that the models learn arithmetic more as a mapping function confined to taught digit lengths rather than as a flexible procedure. This goes beyond rote memorizing but falls short of a thorough “understanding” of mathematics.
- Novelty versus previous efforts.
They do not claim that their method is original regarding the type of training data they utilize, but rather that it strongly draws on earlier work that employs instructive data to improve model performance. The major emphasis on randomly initialized models and in-depth ablation investigations on various sampling/data formats and model scale settings to separate the variables that lead to the rapid formation of arithmetic capabilities distinguish their work from other research in the field. Furthermore, some of the occurrences they detect have a few straightforward but potentially enlightening theoretical explanations in their study.
Check out the Paper and Github link. Don’t forget to join our 26k+ ML SubReddit, Discord Channel, and Email Newsletter, where we share the latest AI research news, cool AI projects, and more. If you have any questions regarding the above article or if we missed anything, feel free to email us at Asif@marktechpost.com
🚀 Check Out 100’s AI Tools in AI Tools Club
Aneesh Tickoo is a consulting intern at MarktechPost. He is currently pursuing his undergraduate degree in Data Science and Artificial Intelligence from the Indian Institute of Technology(IIT), Bhilai. He spends most of his time working on projects aimed at harnessing the power of machine learning. His research interest is image processing and is passionate about building solutions around it. He loves to connect with people and collaborate on interesting projects.
Credit: Source link
Comments are closed.