Google has developed a new text-to-speech (TTS) system called Tacotron 2 that the company claims is not only better than past systems of its type but is capable of generating speech that is almost indistinguishable from its human counterpart.
Like a lot of the company’s recent breakthroughs, Tacotron 2’s human-like speech capabilities were developed with the help of neural networks.
In contrast to past TTS systems, which the company says used complex linguistic and acoustic markers to help machines generate human speech from text, Google allowed Tacotron 2 to develop its own methodology. The company’s researchers say they trained Tacotron 2 using only speech examples and their corresponding text transcripts.
Essentially, what makes Tacotron 2 stand out is that whereas past systems depended on tools developed by human researchers to help them translate text into speech, Google allowed its system to develop its own rules and methodologies.
According to Quartz‘s Dave Gershgorn, in practice the system works by first translating text into a spectrogram — that is, it creates a visual representation of the sounds the words make. The system then reads the spectrogram to generate the correct audio elements.
Google has uploaded audio examples to showcase Tacotron 2’s capabilities. In short, what’s on display is impressive.
The system is able to parse homonyms, as well as change its inflection to suite a sentence’s punctuation. It can even make sense of spelling mistakes.
The company even provides comparative examples to showcase just how good the system is at mimicking human speech. If you listen carefully, you can spot the difference between the human voice and the computer-generated voices by listening to the cadence of their speech. There’s a machine-like precision to Tacotron 2’s speech that’s missing from its human counterpart — but it’s an impressive mimic nonetheless.