Google has pioneered a brand new text-to-speech system that it calls Tacotron 2 and it works with stunning accuracy, delivering voice narrations that are indistinguishable from the voice of a real human. This is not an exaggeration: Tacotron 2 is the second generation of the technology and it consists of two deep neural networks, one that converts the text into a special spectogram (like the one you see in the picture above), and the second one, the WaveNet, that reads this chart and interprets it into a real voice.
The system is currently only trained to work in English with the one female voice that you can hear below. It can not only read, but it will also be able to tell nuance, and if a certain word is highlighted in all caps, it will add an accent to that word. It is also able to deal with a small amount of typing errors.
Here are a few examples, showing the Tacotron 2 in action:
“That girl did a video about Star Wars lipstick.”