Alibaba Researchers Introduce Qwen-Audio Series: A Set of Large-Scale Audio-Language Models with Universal Audio Understanding Abilities

Researchers from Alibaba Group introduced Qwen-Audio, which addresses the challenge of limited pre-trained audio models for diverse tasks. A hierarchical tag-based multi-task framework is designed to avoid interference issues from co-training. Qwen-Audio achieves impressive performance across benchmark tasks without task-specific fine-tuning. Qwen-Audio-Chat, built upon Qwen-Audio, supports multi-turn dialogues and diverse audio-central scenarios, demonstrating its universal audio understanding abilities.

Qwen-Audio overcomes the limitations of previous audio-language models by handling diverse audio types and tasks. Unlike prior works on speech alone, Qwen-Audio incorporates human speech, natural sounds, music, and songs, allowing co-training on datasets with varying granularities. The model excels in speech perception and recognition tasks without task-specific modifications. Qwen-Audio-Chat extends these capabilities to align with human intent, supporting multilingual, multi-turn dialogues from audio and text inputs, showcasing robust and comprehensive audio understanding.

LLMs excel in general artificial intelligence but lack audio comprehension. Qwen-Audio addresses this by scaling pre-training to cover 30 tasks and diverse audio types. A multi-task framework mitigates interference, enabling knowledge sharing. Qwen-Audio performs impressively across benchmarks without task-specific fine-tuning. Qwen-Audio-Chat, an extension, supports multi-turn dialogues and diverse audio-centric scenarios, showcasing comprehensive audio interaction capabilities in LLMs.

Qwen-Audio and Qwen-Audio-Chat are models for universal audio understanding and flexible human interaction. Qwen-Audio adopts a multi-task pre-training approach, optimizing the audio encoder while freezing language model weights. In contrast, Qwen-Audio-Chat employs supervised fine-tuning, optimizing the language model while fixing audio encoder weights. The training process includes multi-task pre-training and supervised fine-tuning. Qwen-Audio-Chat enables versatile human interaction, supporting multilingual, multi-turn dialogues from audio and text inputs, showcasing its adaptability and comprehensive audio understanding.

Qwen-Audio demonstrates remarkable performance across diverse benchmark tasks, surpassing counterparts without task-specific fine-tuning. It consistently outperforms baselines by a substantial margin on jobs like AAC, SWRT ASC, SER, AQA, VSC, and MNA. The model establishes state-of-the-art results on CochlScene, ClothoAQA, and VocalSound, showcasing robust audio understanding capabilities. Qwen-Audio’s superior performance across various analyses highlights its effectiveness and competence in achieving state-of-the-art results in challenging audio tasks.

The Qwen-Audio series introduces large-scale audio-language models with universal understanding across diverse audio types and tasks. Developed through a multi-task training framework, these models facilitate knowledge sharing and overcome interference from varying textual labels in different datasets. Achieving impressive performance across benchmarks without task-specific fine-tuning, Qwen-Audio surpasses prior works. Qwen-Audio-Chat extends these capabilities, enabling multi-turn dialogues and supporting diverse audio scenarios, showcasing robust alignment with human intent and facilitating multilingual interactions.

Qwen-Audio’s future exploration includes expanding capabilities for different audio types, languages, and specific tasks. Refining the multi-task framework or exploring alternative knowledge-sharing approaches could address interference issues in co-training. Investigating task-specific fine-tuning can enhance performance. Continuous updates based on new benchmarks, datasets, and user feedback aim to improve universal audio understanding. Qwen-Audio-Chat is refined to align with human intent, support multilingual interactions, and enable dynamic multi-turn dialogues.


Check out the Paper and Github. All credit for this research goes to the researchers of this project. Also, don’t forget to join our 33k+ ML SubReddit, 41k+ Facebook Community, Discord Channel, and Email Newsletter, where we share the latest AI research news, cool AI projects, and more.

If you like our work, you will love our newsletter..


Sana Hassan, a consulting intern at Marktechpost and dual-degree student at IIT Madras, is passionate about applying technology and AI to address real-world challenges. With a keen interest in solving practical problems, he brings a fresh perspective to the intersection of AI and real-life solutions.


↗ Step by Step Tutorial on ‘How to Build LLM Apps that can See Hear Speak’

Credit: Source link

Comments are closed.