Facebook researchers have developed what they declare is the most important computerized speech recognition (ASR) mannequin of its type — a mannequin that realized to grasp phrases in 51 languages after coaching on over 16,000 hours of voice recordings. In a paper revealed on the preprint server Arxiv.org, the coauthors say the system, which comprises round a billion parameters, improves speech recognition efficiency as much as 28.8% on one benchmark in contrast with baselines.
Designing a single mannequin to acknowledge speech in a number of languages is fascinating for a number of causes. It simplifies the backend manufacturing pipeline, for one factor, and research have proven coaching multilingual fashions on related languages can lower general phrase error charge (WER).
Facebook’s mannequin — a so-called joint sequence-to-sequence (Seq2Seq) mannequin — was skilled whereas sharing the parameters from an encoder, decoder, and token set throughout all languages. The encoder maps enter audio sequences to intermediate representations whereas the decoder maps the representations to output textual content, and the token set simplifies the method of working with many languages by sampling sentences at totally different frequencies.
The researchers divided the 51 languages into distinct teams with a distinct decoder for every, after which they chose 10,000 “subword” models because the token set for every particular person language group. Next, they manually mixed a number of the smaller language teams collectively till they ended up with six in whole, which prevented the group sizes from turning into overly skewed by the variety of languages they contained.
The coauthors created a coaching information set from anonymized movies publicly shared by Facebook, which they divided into three classes: high-resource languages consisting of over 600 hours of coaching information (e.g., English, Hindi, French), mid-resource languages with 300 to 500 hours of information (Bengali, Japanese, Russian), and low-resource languages with 100 to 150 hours of information (Norwegian, Swahili, Lithuanian). After transcribing the movies based on sure pointers, they tuned the mannequin’s hyperparameters, or the parameters whose values are used to manage the training course of.
The researchers report that throughout a number of experiments, the best-performing model of their mannequin improved WER by 9.1% on common for high-resource languages, by 12.44% for mid-resource languages, and by 28.76% for low-resource languages. It additionally carried out effectively on low-resource languages it hadn’t seen earlier than, together with Traditional Chinese, Persian, and Telugu.
“To the best of our knowledge, this work is the first one to study multilingual systems at a massive scale,” the Facebook researchers wrote. “We demonstrated that it is possible to train a massive single ASR architecture for 51 various languages, which we found in practice considerably less time-consuming to tune than 51 different monolingual baselines.”
The unveiling of the brand new mannequin comes after Facebook detailed wav2vec 2.0, an improved framework for self-supervised speech recognition. In a paper, researchers claimed wav2vec 2.Zero outperformed the most effective semi-supervised strategies whereas being conceptually less complicated, reaching state-of-the-art outcomes utilizing simply 10 minutes of labeled information and pretraining on 53,000 hours of unlabeled information.