Imagine that a tourist calls an emergency service speaking a foreign language. How to find a person that speaks the right language? Or you have tons of multilingual television broadcasts in need of automatic translation or subtitling. Most current automatic speech recognition (ASR) and other language technology tools assume that the source language is known and does not change abruptly. Furthermore, the current off-the-shelf automatic language identification systems are text-based, which means that the speech should first be transformed into some kind of a script. But this can not be done by ASR without knowing the language first. This is the problem called spoken language identification (SLI).
In MeMAD the spoken language identification turned out to be an important problem to solve: Large quantities of audiovisual data are frequently processed by language dependent tools. Even in the national broadcasts many languages are spoken. Often there is no metadata or any other way to segment languages in all the data except automatic spoken language identification.
Humans are capable of recognizing familiar languages with high accuracy. Still, the task of distinguishing between several unfamiliar languages is surprisingly challenging, even for the human listener. Brain studies have shown that when attempting to recognize an unfamiliar language, people pay attention to low-level acoustical cues such as different sound patterns or intonations. On the other hand, for familiar languages, they recognize languages from semantics such as grammar. In other words, if you understand what is being said, then the language must be obvious.
Three approaches to solve the problem
For machines, the three main approaches for SLI are phonotactic, acoustic-phonetic and language embedding. The phonotactic models use first a phoneme recognizer to transform the audio into a phoneme sequence and then a language classifier to detect unique patterns and structures in the phoneme sequences to identify the language. While phonotactic models can be accurate at recognizing languages, they are severely limited by the performance of the phoneme recognizer in the variability of target languages and audio quality. Acoustic-phonetic SLI models, on the other hand, separate languages directly in the acoustic domain, detecting the characteristic acoustic structure of each language. They only require audio files labeled by the language for training. The idea of language embeddings is to use the training data to learn a low dimensional vector space where the acoustic samples can be mapped to vectors and the distance of two vectors imply similarity of the languages. Much of the language embedding work originates from speaker identification, where the two best known embedding methods are the so-called i-vectors and x-vectors.
Master’s thesis provides preliminary tools
In his Master’s thesis, Matias Lindgren from Aalto University studied different deep learning based approaches for detecting natural languages from speech data. He analysed and compared several state-of-the-art SLI models on various speech data. The main results in the thesis are the following:
- It is hard to both learn an unsupervised acoustic unit representation that could replace a supervised phoneme representation and still provide good language discriminability. All models that utilized the acoustic unit representations learned by an autoencoder, produced worse results, in average, than those based on multilingual phoneme recognition.
- None of the evaluated state-of-the-art SLI models can outperform all other models in all of the available SLI test setups. However, the models based on convolutional neural networks (CNN) consistently provided better performance compared to models which did not utilize CNNs.
- His work resulted in a free open source SLI software library called “lidbox”
- A large new SLI data set called “YTN-Aalto2019” was created. It contains almost 1200 hours in 6 languages, and itwas collected from YouTube news channels, and labeled weakly with the assumption that each news channel contains only speech in a single language. The metadata and collection scripts are freely available, but access to the videos depends on their producers.
- A real-time SLI demo was developed based on the x-vector architecture and trained on the YTN-Aalto2019 dataset. It is made in such a way that it can be executed in a web browser taking audio input in real-time from the user’s microphone.
See yourself how it works from this example video of using the real-time spoken language identification demo. As in the MeMAD’s real-time audio tagging demo, more work and training data would be required to make the segments more coherent for practical use. This demo does not try to model the language transitions. Nor does it separate speech and noise regions to produce a smooth output. Nevertheless, it is a very good start on an important topic to improve accessibility and discoverability of audiovisual content.