Meet BLOOMChat: An Open-Source 176-Billion-Parameter Multilingual Chat Large Language Model (LLM) Built on Top of the BLOOM Model

On May 23, 2023

With some great advancements being made in the field of Artificial Intelligence, natural language systems are rapidly progressing. Large Language Models (LLMs) are getting significantly better and more popular with each upgrade and innovation. A new feature or modification is being added nearly daily, enabling LLMs to serve in different applications in almost every domain. LLMs are everywhere, from Machine translation and text summarization to sentiment analysis and question answering.

The open-source community has made some remarkable progress in developing chat-based LLMs, but mostly in the English language. A little less focus has been put on developing kind of similar multilingual chat capability in an LLM. To address that, SambaNova, a software company that focuses on generative AI solutions, has introduced an open-source, multilingual chat LLM called BLOOMChat. Developed in collaboration with Together, which is an open, scalable, and decentralized cloud for Artificial Intelligence, BLOOMChat is a 176-billion-parameter multilingual chat LLM built on top of the BLOOM model.

The BLOOM model has the ability to generate text in 46 natural languages and 13 programming languages. For languages such as Spanish, French, and Arabic, BLOOM represents the first language model ever created with over 100 billion parameters. BLOOM was developed by the BigScience organization, which is an international collaboration of over 1000 researchers. By fine-tuning BLOOM on open conversation and alignment datasets from projects like OpenChatKit, Dolly 2.0, and OASST1, the core capabilities of BLOOM were extended into the chat domain.

🚀 JOIN the fastest ML Subreddit Community

For the development of the multilingual chat LLM, BLOOMChat, SambaNova, and Together have used the SambaNova DataScale systems that utilize SambaNova’s unique Reconfigurable Dataflow Architecture for the training process. Synthetic conversation data and human-written samples have been combined to create BLOOMChat. A big synthetic dataset called OpenChatKit has served as the basis for chat functionality, and higher-quality human-generated datasets like Dolly 2.0 and OASST1 have been used to enhance performance significantly. The code and scripts used for instruction-tuning on the OpenChatKit and Dolly-v2 datasets have been made available on SambaNova’s GitHub.

In human evaluations conducted across six languages, BLOOMChat responses were preferred over GPT-4 responses 45.25% of the time. Compared to four other open-source chat-aligned models in the same six languages, BLOOMChat’s responses ranked as the best 65.92% of the time. This accomplishment successfully closes the open-source market’s multilingual chat capability gap. In the WMT translation test, BLOOMChat performed better than additional BLOOM model iterations as well as popular open-source conversation models.

BLOOMChat, like other chat LLMs, has limitations. It may produce factually incorrect or irrelevant information or may switch languages by mistake. It can even repeat phrases, have limited coding or math capabilities, and sometimes generate toxic content. Further research is working towards addressing these challenges and ensuring better usage.

In conclusion, BLOOMChat builds upon the extensive work of the open-source community and is a great addition to the list of some highly useful and multilingual LLMs. By releasing it under an open-source license, SambaNova and Together aims to expand access to advanced multilingual chat capabilities and encourage further innovation in the AI research community.

Check out the Project and Reference Article. Don’t forget to join our 21k+ ML SubReddit, Discord Channel, and Email Newsletter, where we share the latest AI research news, cool AI projects, and more. If you have any questions regarding the above article or if we missed anything, feel free to email us at Asif@marktechpost.com

🚀 Check Out 100’s AI Tools in AI Tools Club

Tanya Malhotra is a final year undergrad from the University of Petroleum & Energy Studies, Dehradun, pursuing BTech in Computer Science Engineering with a specialization in Artificial Intelligence and Machine Learning.
She is a Data Science enthusiast with good analytical and critical thinking, along with an ardent interest in acquiring new skills, leading groups, and managing work in an organized manner.

➡️ Meet Bright Data: The World’s #1 Web Data Platform

Credit: Source link