Google Introduces TPU v4: A Machine-Learning Super-Computer With Hardware Support For Embeddings That May Be Optically Reconfigurable
Machine learning (ML) models are still developing in challenging ways, both in terms of size and technique. Large language models (LLMs) serve as instances of the former, whereas Deep Learning Recommender Models (DLRMs) and the massive computations of Transformers and BERT serve as examples of the latter. Our ML supercomputer has expanded from 256 TPU v2 nodes to 4096 TPU v4 nodes because to the enormous magnitude of recent LLMs . Reaching such a size results in reliability issues, which are further exacerbated by the fact that deep neural network (DNN) training is carried out in an HPC-style, checkpoint/restore, everything-must-work manner. That is very different from the software dependability characteristic of distributed mainline systems like Google.
Researchers from Google outlined three key TPU v4 enhancements that address these problems:
1. To overcome the challenges of scalability and reliability, they introduced optical circuit switches (OCSes) with optical data lines, enabling a 4K-node supercomputer to accept 1K CPU hosts that are down 0.1%–1.0% of the time through reconfiguration.
2. They describe the SparseCore or SC hardware support for embeddings in DLRMs, a feature of TPUs from TPU version 2.
3. By combining the above two skills, embeddings increase the requirements for supercomputer-scale connectivity by introducing all-to-all communication patterns. All-to-all patterns put a load on the bisection bandwidth in contrast to all-reduce, which is utilized in backpropagation and translates well to 2D and 3D tori. OCS allows for versatile topology construction, including improved bisection.
LLMs are now a hot issue in the ML community. OCSes in TPU v4 were initially driven by size and reliability, but their topological flexibility and deployment benefits ended up greatly reducing LLM training time. Although the principles of earlier TPUs for training and for inference have already been covered in previous publications, this study concentrates on the three unique aspects of TPU v4 that have not previously been covered.
The following is the paper’s main contributions:
- It discusses and assesses the first production deployment of OCSes in a supercomputer and the first to provide topology change for performance improvement.
- It discusses and assesses the first embedding accelerator assistance in a for-profit ML system.
- It details the quick evolution of production model types since 2016 for the rapidly evolving ML sector.
- It demonstrates how Google co-optimizes DNN models, OCS topology, and the SparseCore using machine learning.
Check out the Paper. All Credit For This Research Goes To the Researchers on This Project. Also, don’t forget to join our 18k+ ML SubReddit, Discord Channel, and Email Newsletter, where we share the latest AI research news, cool AI projects, and more.
Aneesh Tickoo is a consulting intern at MarktechPost. He is currently pursuing his undergraduate degree in Data Science and Artificial Intelligence from the Indian Institute of Technology(IIT), Bhilai. He spends most of his time working on projects aimed at harnessing the power of machine learning. His research interest is image processing and is passionate about building solutions around it. He loves to connect with people and collaborate on interesting projects.
Credit: Source link
Comments are closed.