MLCommons Collaborates with Hugging Face to Launch Unprecedented Million-Hour Speech Dataset for AI Research
In a significant move to advance AI research, MLCommons, a nonprofit group focused on AI safety, has partnered with AI development platform Hugging Face to release one of the largest public domain collections of voice recordings.
The new dataset, named Unsupervised People’s Speech, includes over one million hours of audio in at least 89 languages. The Unsupervised People’s Speech dataset enables MLCommons to directly advance speech technology innovation by making vast amounts of data available to researchers and developers.
“Supporting broader natural language processing research for languages other than English helps bring communication technologies to more people globally,” the company said in a blog post. “We anticipate several avenues for the research community to continue to build and develop, especially in the areas of improving low-resource language speech models, enhanced speech recognition across different accents and dialects, and novel applications in speech synthesis.”
While the release of Unsupervised People’s Speech is a commendable initiative aimed at advancing AI research, it comes with potential risks for those utilizing the dataset. The primary concern with this dataset is its potential for data bias. This dataset contains speech recordings taken from archiving.org, a nonprofit organization that maintains the well-known Wayback Machine online archiving service.
According to the project’s readme manual, Archive.org contributions primarily originate from English-speaking communities in the United States, hence Unsupervised People’s Speech contains many American-accented English recordings.
The dataset features a significant linguistic bias, which affects the performance of AI systems during the training phase, notably speech recognition and voice synthesis models. When systems lack proper filtering procedures, they struggle to process the speech of non-native English speakers and have difficulty synthesizing voice outputs in languages other than English.
AI researchers may gain access to speech recordings of individuals without their consent using the dataset for both AI research and commercial uses. The MLCommons team confirms that all gathered recordings are in the public domain or under Creative Commons licenses, yet inadvertent oversight may still occur.
An analysis by MIT has revealed that numerous publicly available AI training datasets lack proper licensing information and contain errors. MLCommons has stated its commitment to maintaining and improving the quality of Unsupervised People’s Speech. However, given the potential issues outlined, developers are urged to approach the dataset with caution and thorough evaluation to ensure its suitability for their AI projects.
Sharing clear, practical insights on tech, lifestyle, and business. Always curious and eager to connect with readers.