Any high-quality speech-to-text engines require thousands of hours of voice data to train them, but publicly available voice data is very limited and the cost of commercial datasets is exorbitant. This prompted the question, how might we collect large quantities of voice data for Open Source machine learning?
Source: Branson, M. (2018). We’re intentionally designing open experiences, here’s why.
One of the big problems with the development of AI is that few organisations have the large, inclusive, diverse datasets that are necessary to reduce the inherent bias in algorithmic training. Mozilla’s Common Voice project is an attempt to create a large, multilanguage dataset of human voices with which to train natural language AI.
This is why we built Common Voice. To tell the story of voice data and how it relates to the need for diversity and inclusivity in speech technology. To better enable this storytelling, we created a robot that users on our website would “teach” to understand human speech by speaking to it through reading sentences.
I think that voice and audio is probably going to be the next compter-user interface so this is an important project to support if we want to make sure that Google, Facebook, Baidu and Tencent don’t have a monopoly on natural language processing. I see this project existing on the same continuum as OpenAI, which aims to ensure that “…AGI’s benefits are as widely and evenly distributed as possible.” Whatever you think about the possibility of AGI arriving anytime soon, I think it’s a good thing that people are working to ensure that the benefits of AI aren’t mediated by a few gatekeepers whose primary function is to increase shareholder value.
Most of the data used by large companies isn’t available to the majority of people. We think that stifles innovation. So we’ve launched Common Voice, a project to help make voice recognition open and accessible to everyone. Now you can donate your voice to help us build an open-source voice database that anyone can use to make innovative apps for devices and the web. Read a sentence to help machines learn how real people speak. Check the work of other contributors to improve the quality. It’s that simple!
The datasets are openly licensed and available for anyone to download and use, alongside other open language datasets that Mozilla links to on the page. This is an important project that everyone should consider contributing to. The interface is intuitive and makes it very easy to either submit your own voice or to validate the recordings that other people have made. Why not give it a go?