Meet Occiglot: A Large-Scale Research Collective for Open-Source Development of Large Language Models by and for Europe

On Mar 8, 2024

A team of researchers in Europe has introduced OcciGlot to address the need for dedicated language modeling solutions. The model aims to maintain Europe’s academic and economic competitiveness, AI sovereignty, and digital language equality. The model focuses on incorporating European values like linguistic diversity and cultural richness, which is lacking in current large language models introduced by big tech companies and deep tech startups, which focus on creating an understanding of the English language.

Currently, the field of language modeling is dominated by a few major players, leaving European languages and cultural diversity underrepresented. In response, Occiglot introduces Model Release v0.1, a set of intermediary 7B model checkpoints focused on the five largest European languages: English, German, French, Spanish, and Italian. This release is a result of bi-lingual continual pre-training and instruction tuning for each language, as well as the development of a multilingual model covering all five languages. The models are available under an open-source license on Hugging Face, aiming to democratize access to language models.

Occiglot leverages a novel approach that involves continual pre-training and instruction tuning of transformer-based language models for each target language, starting from an existing pre-trained model for English. The models are then fine-tuned and optimized for each specific language, with a focus on linguistic diversity and cultural nuances. This iterative process ensures the development of high-quality language models tailored to the European context. The collective also emphasizes collaboration within the community to gather large-scale training data, curate instruction-tuning datasets, and evaluate model performance accurately.

The performance of Occiglot’s language models is evaluated based on their ability to support diverse linguistic tasks and applications across different European languages. The release of intermediary model checkpoints marks a significant step towards achieving the long-term goal of creating a cohesive language modeling approach covering all official languages within the European Union and beyond. Furthermore, the commitment of hessian.AI to provide computing resources supports the initiative’s scalability and sustainability.

In conclusion, Occiglot’s initiative addresses the pressing need for accessible and culturally sensitive language models in Europe. By releasing open-source LLM checkpoints and fostering collaboration within the research community, they are opening the way for advancements in language technology that align with European values of linguistic diversity and cultural richness.

Today, we are announcing Occiglot!

A large-scale collaborative research collective focusing on open-source European LLMs.

We invite anybody working on multilingual datasets, benchmarks, or models to get in touch/join our discord. https://t.co/OcT7DNM4Ky

— OcciGlot (@occiglot) March 7, 2024

Pragati Jhunjhunwala is a consulting intern at MarktechPost. She is currently pursuing her B.Tech from the Indian Institute of Technology(IIT), Kharagpur. She is a tech enthusiast and has a keen interest in the scope of software and data science applications. She is always reading about the developments in different field of AI and ML.

🚀 [FREE AI WEBINAR] ‘Building with Google’s New Open Gemma Models’ (March 11, 2024) [Promoted]

Credit: Source link