Home PC News Google says its Parallel Tacotron model generates synthetic voices 13 times faster...

Google says its Parallel Tacotron model generates synthetic voices 13 times faster than its predecessor

In December 2016, Google released Tacotron 2, a machine learning text-to-speech (TTS) system that generates natural-sounding speech from raw transcripts. It’s used in user-facing services like Google Assistant to create voices that sound humanlike, but it’s relatively compute-intensive. In a new paper, researchers at the search giant claim to have addressed this limitation with what they call Parallel Tacotron, a model that’s highly parallelized during training and inference to enable efficient voice generation on less-powerful hardware.

Text-to-speech synthesis is what’s known as a one-to-many mapping problem. Given any snippet of text, multiple voices with different prosodies (intonation, tone, stress, and rhythm) could be generated. Even sophisticated models like Tacotron 2 are prone to errors like babble, cut-off speech, and repeating or skipping words as a result. One way to address this is to augment models by incorporating representations that capture latent speech factors. These representations can be extracted by an encoder that takes ground-truth spectrograms (a visual representation of speech frequencies over time) as its input; this is the approach Parallel Tacotron takes.

In experiments, to train Parallel Tacotron, the researchers say they used a dataset containing 405 hours of speech including 347,872 utterances from 45 speakers in 3 English accents (32 U.S. English speakers, 8 British English, and 5 Australian English speakers). Training took a day using Google Cloud TPUs, application-specific integrated circuits developed specifically to accelerate AI.

The researchers had human reviewers look at 1,000 sentences in order to evaluate Parallel Tacotron’s performance, which were synthesized using 10 U.S. English speakers (5 male and 5 female) in a round-robin style (100 sentences per speaker). While there’s room for improvement, the results suggest that Parallel Tacotron “did well” compared with human speech. Moreover, Parallel Tacotron was about 13 times faster than Tacotron 2.

“A number of models have been proposed to synthesize various aspects of speech (e.g., speaking styles) in a natural sounding way,” the researchers wrote. “Parallel Tacotron matched the baseline Tacotron 2 in naturalness and offered significantly faster inference than Tacotron 2.”

The release of Parallel Tacotron, which is available on GitHub, comes after Microsoft and Facebook detailed speedy text-to-speech techniques of their own. Microsoft’s FastSpeech features a unique architecture that not only improves performance in a number of areas but eliminates errors like word skipping and affords fine-grained adjustment of speed and word break. As for Facebook’s system, it leverages a language model for curation to create voices 160 times faster compared with a baseline.


How startups are scaling communication:

The pandemic is making startups take a close look at ramping up their communication solutions. Learn how


Most Popular

How Mark Kelly used conversational AI to help win a Senate seat

Conversational artificial intelligence has rapidly smartened and scaled since chatbots first entered mainstream social media in 2016. The first few iterations of chatbots on...

Slack could quickly become Salesforce’s golden goose

Last week, news broke that Salesforce was thought to be in advanced talks to acquire Slack. This inevitably fuelled much excitement and debate, not...

Zebra’s enterprise AR glasses add XMReality Remote Guidance software

Augmented reality headsets are becoming important tools for enterprises, enabling frontline workers to instantly access reference data as they’re in the field. Today, industrial...

What’s New in DirectX 12? Understanding DirectML, DirectX Raytracing and DirectStorage

DirectX has been with us for 25 years, providing developers with the tools to make incredible games. The latest version, DX12 was released in...

Recent Comments