Meet BiLLM: A Novel Post-Training Binary Quantization Method Specifically Tailored for Compressing Pre-Trained LLMs
Pretrained large language models (LLMs) boast remarkable language processing abilities but require substantial computational resources. Binarization, which reduces model weights to a single bit, offers a solution by drastically reducing computation and memory demands. However, existing quantization techniques must help maintain LLM performance at such low bit widths. This challenges achieving efficient deployment of LLMs while maintaining effectiveness in various language processing tasks.
Recent works have highlighted the exceptional performance of LLMs like OPT and LLaMA across various benchmarks, but their deployment on memory-constrained devices remains challenging. Model quantization, particularly Post-Training Quantization (PTQ), effectively compresses LLMs, saving GPU memory consumption. While PTQ methods have succeeded in 8-bit and 4-bit quantization, the expanding size of LLMs necessitates more aggressive approaches like neural network binarization. However, existing PTQ methods face performance collapse under ultra-low bit quantization.
Researchers from the University of Hong Kong, Beihang University, and ETH Zurich introduced BiLLM, a groundbreaking 1-bit post-training quantization scheme designed for pre-trained LLMs. BiLLM utilizes weight distribution analysis to identify salient weights and employs a binary residual approximation strategy to minimize compression loss. It also introduces an optimal splitting search for accurate binarization of non-salient weights with a bell-shaped distribution.
BiLLM introduces a novel 1-bit post-training quantization method for LLMs, leveraging weight sensitivity analysis via the Hessian matrix. It employs a structured selection of salient weights and optimal splitting for non-salient weights, minimizing quantization error. BiLLM implements binary residual approximation for salient weights and bell-shaped distribution splitting for non-salient ones, achieving high-accuracy inference with ultra-low bit widths and efficient deployment on GPUs.
BiLLM, implemented on PyTorch and Huggingface libraries, presents a groundbreaking 1-bit PTQ framework for LLMs. It surpasses existing methods like GPTQ and PB-LLM, achieving superior perplexity results across various model sizes and datasets, including WikiText2, PTB, and C4. BiLLM‘s structured salient binarization and optimal splitting of non-salient weights significantly enhance binary performance, demonstrating its universal applicability and robustness in diverse LLM settings.
In conclusion, Researchers from the University of Hong Kong, Beihang University, and ETH Zurich introduced BiLLM, a novel post-training binary quantization method for compressing pre-trained LLMs. By leveraging binary residual approximation for salient weights and optimal segmentation for non-salient ones, BiLLM achieves ultra-low bit quantization without significant loss of precision. It sets a new frontier in LLMs’ bit-width quantization, enabling deployment in edge scenarios and resource-constrained devices while maintaining performance guarantees.
Check out the Paper and Github. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and Google News. Join our 37k+ ML SubReddit, 41k+ Facebook Community, Discord Channel, and LinkedIn Group.
If you like our work, you will love our newsletter..
Don’t Forget to join our Telegram Channel
Asjad is an intern consultant at Marktechpost. He is persuing B.Tech in mechanical engineering at the Indian Institute of Technology, Kharagpur. Asjad is a Machine learning and deep learning enthusiast who is always researching the applications of machine learning in healthcare.
Credit: Source link
Comments are closed.