MOSEL: Advancing Speech Data Collection for All European Languages

The development of AI language models has largely been dominated by English, leaving many European languages underrepresented. This has created a significant imbalance in how AI technologies understand and respond to different languages and cultures. MOSEL aims to change this narrative by creating a comprehensive, open-source collection of speech data for the 24 official languages of the European Union. By providing diverse language data, MOSEL seeks to ensure that AI models are more inclusive and representative of Europe’s rich linguistic landscape.

Contents

Overview of MOSEL Bridging the Data Gap for Underrepresented Languages The Role of Open Access in Driving AI Innovation Future Directions and the Road Ahead

Language diversity is crucial for ensuring inclusivity in AI development. Over-relying on English-centric models can result in technologies that are less effective or even inaccessible for speakers of other languages. Multilingual datasets help create AI systems that serve everyone, regardless of the language they speak. Embracing language diversity enhances technology accessibility and ensures fair representation of different cultures and communities. By promoting linguistic inclusivity, AI can truly reflect the diverse needs and voices of its users.

Overview of MOSEL

MOSEL, or Massive Open-source Speech data for European Languages, is a groundbreaking project that aims to build an extensive, open-source collection of speech data covering all 24 official languages of the European Union. Developed by an international team of researchers, MOSEL integrates data from 18 different projects, such as CommonVoice, LibriSpeech, and VoxPopuli. This collection includes both transcribed speech recordings and unlabeled audio data, offering a significant resource for advancing multilingual AI development.

One of the key contributions of MOSEL is the inclusion of both transcribed and unlabeled data. The transcribed data provides a reliable foundation for training AI models, while the unlabeled audio data can be used for further research and experimentation, especially for resource-poor languages. The combination of these datasets creates a unique opportunity to develop language models that are more inclusive and capable of understanding the diverse linguistic landscape of Europe.

Bridging the Data Gap for Underrepresented Languages

The distribution of speech data across European languages is highly uneven, with English dominating the majority of available datasets. This imbalance presents significant challenges for developing AI models that can understand and accurately respond to less-represented languages. Many of the official EU languages, such as Maltese or Irish, have very limited data, which hinders the ability of AI technologies to effectively serve these linguistic communities.

MOSEL aims to bridge this data gap by leveraging OpenAI’s Whisper model to automatically transcribe 441,000 hours of previously unlabeled audio data. This approach has significantly expanded the availability of training material, particularly for languages that lacked extensive manually transcribed data. Although automatic transcription is not perfect, it provides a valuable starting point for further development, allowing more inclusive language models to be built.

However, the challenges are particularly evident for certain languages. For instance, the Whisper model struggled with Maltese, achieving a word error rate of over 80 percent. Such high error rates highlight the need for additional work, including improving transcription models and collecting more high-quality, manually transcribed data. The MOSEL team is committed to continuing these efforts, ensuring that even resource-poor languages can benefit from advancements in AI technology.

The Role of Open Access in Driving AI Innovation

MOSEL’s open-source availability is a key factor in driving innovation in European AI research. By making the speech data freely accessible, MOSEL empowers researchers and developers to work with extensive, high-quality datasets that were previously unavailable or limited. This accessibility encourages collaboration and experimentation, fostering a community-driven approach to advancing AI technologies for all European languages.

Researchers and developers can leverage MOSEL’s data to train, test, and refine AI language models, especially for languages that have been underrepresented in the AI landscape. The open nature of this data also allows smaller organizations and academic institutions to participate in cutting-edge AI research, breaking down barriers that often favor large tech companies with exclusive resources.

Future Directions and the Road Ahead

Looking ahead, the MOSEL team plans to continue expanding the dataset, particularly for underrepresented languages. By collecting more data and improving the accuracy of automated transcriptions, MOSEL aims to create a more balanced and inclusive resource for AI development. These efforts are crucial for ensuring that all European languages, regardless of the number of speakers, have a place in the evolving AI landscape.

The success of MOSEL could also inspire similar initiatives globally, promoting linguistic diversity in AI beyond Europe. By setting a precedent for open access and collaborative development, MOSEL paves the way for future projects that prioritize inclusivity and representation in AI, ultimately contributing to a more equitable technological future.