In a paper initially printed final October and accepted to the International Conference on Learning Representations (ICLR) 2020, researchers affiliated with Google and the University College London suggest an AI mannequin that allows management of speech traits like pitch, emotion, and talking price with as little as 30 minutes of knowledge.

The work has apparent industrial implications. Brand voices resembling Progressive’s Flo (performed by comic Stephanie Courtney) are sometimes pulled in for pick-ups — classes to handle errors, modifications, or additions in voiceover scripts — lengthy after a recording finishes. AI-assisted voice correction may remove the necessity for these, saving money and time on the a part of the actors’ employers.

A previous study investigated the usage of so-called type tokens (which represented totally different classes of emotion) to regulate speech have an effect on. The methodology achieved good outcomes with solely 5% of labeled knowledge, but it surely couldn’t deal with speech samples with various prosody (i.e., intonation, tone, stress, and rhythm) and stuck emotion. The work from Google and the University of College London addresses this limitation.

The researchers skilled the system for 300,000 steps throughout 32 of Google’s custom-designed tensor processing items (TPUs), a scale of compute exceeding that utilized in earlier work. They report that utilizing 30 minutes of labeled knowledge allowed for a “significant degree” of management over speech price, valence, and arousal, and that have an effect on accuracy didn’t degrade noticeably with no less than 10% of labeled knowledge. The researchers mentioned that simply three minutes of knowledge allowed for management of speech price and extrapolation outdoors knowledge seen throughout coaching — a outcome the researchers declare beat out state-of-the-art baselines.

VB Transform 2020 Online – Live July 15-17, 2020: Join main AI executives at VentureBeat’s AI occasion of the 12 months. Register today and save 30% off digital entry passes.

The researchers’ system faucets a skilled generative mannequin that may synthesize acoustic options from textual content. Similar to Google’s Tacotron 2, a text-to-speech (TTS) system that generates natural-sounding speech from uncooked transcripts, the brand new system can produce visible representations of frequencies referred to as spectrograms by coaching a second mannequin resembling DeepThoughts’s WaveNet to behave as a vocoder, a voice codec that analyzes and synthesizes voice knowledge. (This system makes use of WaveRNN.)

An annotated knowledge set comprising 72,405 roughly 5-second recordings from 40 English audio system, amounting to 45 hours of audio, was used to coach the system. The audio system, all of whom had been skilled voice actors, had been prompted to learn textual content snippets with various ranges of valence (feelings like unhappiness or happiness) and arousal (pleasure or power). From these classes, the researchers obtained six doable affective states, which they modeled and use as labels together with labels for talking price (right here outlined because the variety of syllables per second in every utterance).

Here’s one of many voices the system modified (which sounds not in contrast to the default Google Assistant voice, curiously) to have excessive arousal and an “angry” valence:

And right here’s that very same voice with excessive arousal and a “happy” valence:

And low arousal and unhappy valence:


The research’s coauthors acknowledge that the work would possibly increase moral considerations as a result of it may very well be misused for misinformation or to commit fraud. Indeed, deepfakes — media that takes an individual in an current picture, audio recording, or video and replaces them with another person’s likeness utilizing AI — are multiplying rapidly, and have already been used to defraud a significant power producer. In tandem with instruments like Resemble, Baidu’s Deep Voice, and Lyrebird, which want solely seconds to minutes of audio samples to clone somebody’s voice, it’s not tough to think about how this new system would possibly add gasoline to the hearth.

But the coauthors additionally assert that on this case, for the reason that focus of this work is on improved prosody with potential advantages to human-computer interfaces, the advantages possible outweigh the dangers. “We … urge the research community to take seriously the potential for misuse both of this work and broader advances in TTS,” they wrote.