Home PC News Researchers’ data set pinpoints challenges adapting speech recognition models to new hardware

Researchers’ data set pinpoints challenges adapting speech recognition models to new hardware

A new study from researchers affiliated with the University College London, Nokia Bell Labs Cambridge, and the University of Oxford reveals how variations in microphone top quality can affect speech recognition accuracy. The coauthors use a custom-made information set referred to as Libri-Adapt that includes 7,200 hours of English speech to verify how correctly Mozilla’s DeepSpeech model handles distinctive environments and microphones. The findings suggest a noticeable degradation in accuracy occurs all through certain “domain shifts,” with phrase error cost rising to as extreme as 28% after switching microphones.

Automatic speech recognition fashions ought to perform correctly all through {hardware} to be reliable. Customers anticipate the fashions powering Alexa to work equally on completely totally different good audio system, good reveals, and good devices, for instance. But some fashions fail to attain this absolute best on account of they’re not persistently educated with corpora containing speech recorded on microphones of varied top quality and in novel settings.

Libri-Adapt is designed to present these flaws with speech recorded using the microphones in six completely totally different merchandise: A PlayStation Eye digital digicam, a generic USB mic, a Google Nexus 6 smartphone, the Shure MV5, a Raspberry Pi accent referred to as ReSpeaker, and the Matrix Voice developer gear. The corpus has speech information in three English accents (U.S. English, British English, and Indian English), culled from 251 U.S. audio system and synthetic voices generated by Google Cloud Platform’s text-to-speech API. Beyond this, Libra-Adapt incorporates wind, rain, and laughter background noises as added confounders.

Libra-Adapt word error rate

Above: Word error cost of a fine-tuned DeepSpeech model educated and examined on quite a few microphone pairs for U.S. English speech. The columns correspond to the teaching microphone space and rows correspond to the verify microphone space.

During experiments, the researchers in distinction the speech recognition effectivity of a pretrained DeepSpeech model (mannequin 0.5.0) all through the aforementioned six devices. They found that when information from the equivalent microphone was used for teaching and testing the model, DeepSpeech unsurprisingly achieved the smallest error cost (e.g., 11.39% throughout the case of PlayStation Eye). But the inverse was moreover true: When there was a mismatch between the teaching and testing models, the phrase error cost jumped significantly (e.g., 24.18% when a model educated on PlayStation Eye-recorded speech was examined on Matrix Voice speech).

The researchers say Libra-Adapt, which is accessible in open provide, could be utilized to create conditions that verify the generalizability of speech recognition algorithms. As an occasion, they examined a DeepSpeech model educated on U.S.-accented speech collected by a ReSpeaker microphone in direction of Indian-accented speech with rain background noise recorded by a PlayStation Eye. The outcomes current the model suffered an error cost uptick of virtually 29.8%, pointing to poor robustness on the model’s half.

Although the coauthors declare to have manually verified an entire lot of Libra-Adapt’s recordings, they warning that some is probably going to be incomplete or noisy. They plan to develop unsupervised space adaptation algorithms in future work to type out space shifts throughout the information set.

Most Popular

Recent Comments