The researchers started with 140,000 hours of YouTube videos of people talking in diverse situations. Then, they designed a program that created clips a few seconds long with the mouth movement for each phoneme, or word sound, annotated. The program filtered out non-English speech, nonspeaking faces, low-quality video, and video that wasn’t shot straight ahead. Then, they cropped the videos around the mouth. That yielded nearly 4000 hours of footage, including more than 127,000 English words.
After training, the researchers tested their system on 37 minutes of video it had not seen before. The AI misidentified only 41% of the words… That might not sound like a lot, but the best previous computer method, which focuses on individual letters rather than phonemes, had a word error rate of 77%. In the same study, professional lip readers erred at a rate of 93% (though in real life they have context and body language to go on, which helps).
There’s not much else to say here, other than to highlight one of the potential applications in healthcare. For example, patients who are hard of hearing could have a universal translator with them at all times. In a country like South Africa where we have a Constitution that mandates the provision of healthcare in a language of the patient’s choosing, but where we have 12 official languages and a huge shortage of translators, you can see how this might be useful.