PyTorch introduces ‘nvFuser’: a Deep Learning Compiler for NVIDIA GPUs that automatically just-in-time compiles fast and flexible kernels to reliably accelerate users’ networks

On Aug 30, 2022

The PyTorch team recently released a Deep Learning Compiler for NVIDIA GPUs called nvFuser. This compiler automatically creates quick, adaptable kernels, speeding up user networks. Creating quick bespoke “fusion” kernels at runtime also significantly accelerates deep learning networks running on Volta and later CUDA accelerators. The new and updated compiler, nvFuser, supports a variety of network architectures as well as applications with dynamic inputs of different shapes and strides and has been specially created to address the particular needs of the PyTorch community. In order to optimize and accelerate PyTorch operations, nvFuser uses graphical representations. Users’ PyTorch operations are not directly accessible as a complete program that a system like nvFuser can optimize because PyTorch uses an eager execution approach. As a result, there is a need for intermediary systems that can translate user programs into a format that nvFuser can optimize. These more advanced methods send the captured operations to nvFuser, which can subsequently tailor the user’s script execution for NVIDIA GPUs.

Three systems are used to record, translate and send user programs to nvFuser for improvement. To convert into its representation of what the user is doing, the TorchScript jit.script directly parses portions of an annotated python script. Following its brand of auto differentiation on the graph, this system then sends portions of the ensuing forward and backward graphs to nvFuser for optimization. The FuncTorch system does not directly examine the user’s Python script. Instead, it incorporates a system that records PyTorch operations as they happen. FuncTorch copies PyTorch’s autograd to produce backward graphs; it does not perform its auto differentiation. Another program acquisition tool developed on top of FuncTorch is TorchDynamo. To choose which sections to trace with FuncTorch, it parses the Python bytecode generated by the user script. TorchDynamo’s advantage is its ability to add decorators to user programs, isolating what should be transmitted to FuncTorch and facilitating FuncTorch’s ability to trace even the most intricate Python scripts.

For the tasks it allows, nvFuser can provide highly customized and optimized GPU functions, making it possible to power new PyTorch systems like TorchDynamo and FuncTorch. These programs deliver parsed user programs to nvFuser automatically. Then, among its functionalities is the analysis of GPU-based processes and the design of parallelization plans for those activities. Additionally, it is in charge of applying those techniques to the generated GPU code, compiling the functions that are produced, and running them on CUDA kernels. It is important to note that not all PyTorch operations are supported by nvFuser, and there is still room for enhancement in several of its functions. The number of DL performance-critical procedures the compiler now supports is expected to increase in future PyTorch editions.

nvFuser was evaluated on a range of models from the HuggingFace Transformers and PyTorch Image Models (TIMM) libraries to learn about its training speed gains. It was discovered that when used with another significant optimization, the compiler can significantly speed up the training of HuggingFace Transformers. Performance gains on a selection of well-liked HuggingFace Transformer networks range from 1.12x to 1.50x. When paired with the torch.amp module, nvFuser can dramatically shorten the training time of TIMM networks by up to 1.44x compared to eager PyTorch and up to over 1.3x compared to eager PyTorch.

The performance of nvFuser is inconsistent in all situations because it has not yet been adjusted for inference workloads. However, the PyTorch team urges people to use their most recent compiler because many models still gain a lot from using nvFuser during inference. The team has also produced a tutorial showing how to use nvFuser to speed up a standard transformer block and how to design quick and unique operations using nvFuser. The development team has also said that they are working continuously to address nvFuser’s present flaws so they can be addressed in later releases.

Reference: https://pytorch.org/blog/introducing-nvfuser-a-deep-learning-compiler-for-pytorch/

Github: https://github.com/pytorch/tutorials/blob/master/intermediate_source/nvfuser_intro_tutorial.py

Tutorial: https://pytorch.org/tutorials/intermediate/nvfuser_intro_tutorial.html

Please Don't Forget To Join Our ML Subreddit

Khushboo Gupta is a consulting intern at MarktechPost. She is currently pursuing her B.Tech from the Indian Institute of Technology(IIT), Goa. She is passionate about the fields of Machine Learning, Natural Language Processing and Web Development. She enjoys learning more about the technical field by participating in several challenges.

Credit: Source link