Large Language Models (LLMs) like GPT-3 and ChatGPT have revolutionized AI by offering Natural Language Understanding and content generation capabilities. But their development comes at a hefty price limiting accessibility and further research. Researchers estimate that training GPT-3 cost OpenAI around $5 million. Nevertheless, Microsoft recognized the potential and invested $1 billion in 2019 and $10 billion in 2023 in OpenAI’s GPT-3 and ChatGPT venture.
LLMs are machine learning models trained on extensive textual data for NLP applications. They are based on transformer architecture and utilize attention mechanisms for NLP tasks like question-answering, machine translation, sentiment analysis, etc.
The question arises: can the efficiency of these large models be increased while simultaneously reducing computational cost and training time?
Several approaches, like Progressive Neural Networks, Network Morphism, intra-layer model parallelism, knowledge inheritance, etc., have been developed to reduce the computational cost of training neural networks. The novel LiGO (Linear Growth Operator) approach we will discuss is setting a new benchmark. It halves the computational cost of training LLMs.
Before discussing this technique, examining the factors contributing to the high price of making LLMs is essential.
Cost of Building Large Language Models
Three major expenses for developing LLMs are as follows:
1. Computational Resources
Building LLMs require massive computational resources to train on large datasets. They must process billions of parameters and learn complex patterns from massive textual data.
Investment in specialized hardware such as Graphics Processing Units (GPUs) and Tensor Processing Units (TPUs) is required for building and training LLMs to achieve state-of-the-art performance.
For instance, GPT-3 was trained on a supercomputer with 10000 enterprise-grade GPUs (H100 and A100) and 285,000 CPU cores.
2. Energy Consumption
The intensive computational resources required for building LLMs result in significant energy consumption. For instance, training 175 billion parameters GPT-3 took 14.8 days using 10,000 V100 GPUs, equivalent to 3.55 million GPU hours. Such a high level of energy consumption has significant environmental effects as well.
3. Data Storage & Management
LLMs are trained on large datasets. For instance, GPT-3 was trained on a vast corpus of textual data, including Common Crawl, WebText2, Books1, Books2, and Wikipedia, among other sources. Significant infrastructure investment is required to collect, curate and store these datasets.
Also, cloud storage is required for data storage, and human expertise for data preprocessing and version control. Moreover, ensuring that your data strategy complies with regulations like GDPR also adds to the cost.
LiGO Technique: Reduce the Cost of Building Large Language Models to Half
LiGO (Linear Growth Operator) is a novel technique developed by researchers at MIT to reduce the computational cost of training LLMs by 50%. The method involves initializing the weights of larger models from those of smaller pre-trained models, enabling efficient scaling of neural networks.
Yoon Kim, the senior author of the paper, says:
“It’s been estimated that training models at the scale of what ChatGPT is hypothesized to run on could take millions of dollars just for a single training run. Can we improve the efficiency of these training methods, so we can still get good models in less time and for less money? We propose to do this by leveraging smaller language models that have previously been trained.”
This method maintains the performance benefits of larger models with reduced computational cost and training time compared to training a large model from scratch. LiGO utilizes a data-driven linear growth operator that combines depth and width operators for optimum performance.
The paper utilized various datasets to conduct text-based experiments, including the English Wikipedia corpus for training BERT and RoBERTa models and the C4 dataset for training GPT2.
The LiGO technique experimentation included growing BERT-Small to BERT-Base, BERT-Base to BERT-Large, RoBERTaSmall to RoBERTa-Base, GPT2-Base to GPT2-Medium, and CaiT-XS to CaiT-S.
The researchers compared their approach with several other baselines, including training from scratch, progressive training, bert2BERT, and KI.
LiGO technique offered 44.7% savings in FLOPs (floating-point operations per second) and 40.7% savings in wall time compared to training BERT-Base from scratch by reusing the BERT-Small model. LiGO growth operator outperforms StackBERT, MSLT, bert2BERT, and KI in efficient training.
Benefits of Using a Training Optimization Technique Like LiGO
LiGO is an efficient neural network training method that has various benefits listed as follows:
1. Faster Training
As stated earlier, faster training is the main advantage of the LiGO technique. It trains LLMs in half the time, increasing productivity and reducing costs.
2. Resource Efficient
LiGO is resource-efficient since it minimizes wall time and FLOPs, leading to a more cost-effective and eco-friendly approach to training large transformer models.
3. Generalization
The LiGO technique has improved the performance of both language and vision transformers suggesting that it is a generalizable technique that can be applied to various tasks.
Building commercial AI products is just one facet of the overall expenses associated with AI systems. Another significant component of costs comes from daily operations. For instance, it costs OpenAI about $700,000 every day to answer queries using ChatGPT. Researchers are expected to continue exploring approaches that make LLMs cost-effective during training and more accessible on runtime.
For more AI-related content, visit unite.ai.
Credit: Source link
Comments are closed.