Master Thesis

Connectionist Temporal Classification for Robust Speech Recognition Applications

Advisors: Prof. Hervé Bourlard & Dr. Wilhelm Hagg
Work completed at Sony Deutschland GmbH, Stuttgart, Germany
Completed in August 2016

Abstract

Automatic Speech Recognition (ASR) is the task of converting a speech signal into the sequence of words that it incorporates, by the help of an algorithm that is implemented on computers or computerized devices. Today, ASR systems find applications in a wide range of domains, including automatic call processing in telephone networks, automatic query-based information systems, human-to-machine interaction, and many others. One of the most popular approaches followed in ASR systems is to model the temporal variability of speech signals using hiddenMarkovModel (HMM) states. In the recent years, significant progress has been made into the acoustic modelling of these states based on the training of context-dependent deep neural networks (DNN’s) with observed feature vectors extracted from the signals. The training procedure of a traditional DNN-based acoustic model requires first to obtain the state alignments of the input feature frames with the HMMphoneme states, because traditional objective functions require a network output target value for every input feature vector. Connectionist Temporal Classification (CTC) overcomes this requirement. CTC is a cost function that is well-suited for sequence labelling tasks, where large sequences of input data (e.g. input feature vectors of a speech signal) are transcribed with smaller sequences of discrete labels (e.g. phoneme labels spoken in the speech stream). On the other hand, another major factor that degrades performance of current ASR systems is background noise and reverberation in the recorded speech.

In this work, we show the suitability of CTC training for robust speech recognition applications. We conduct experiments on public speech corpora, invoking different noise levels on the speech data, while comparing with the traditional hybrid DNN-HMM framework. Our results show that CTC is well-suited for large amounts of training data. As less amounts of information is given to the CTC-trained network compared to framewise training (i.e. the alignment of input features with target labels is not given), a larger training set is required for the network to learn this alignment. Once available, CTC-trained networks outperform the hybrid approach, especially for noisy data, where the confusion between input-output correlations is higher than the case of clean data. Moreover, due to its single-state modeling of target labels, CTC allows to achieve a faster decoding, compared to the context-dependent triphone-state hybrid approach. Consequently, the decoding graphs are smaller in size, which saves storage space and allows to keep a larger number of hypotheses in beam-search decoding.

Download Links

[Thesis]
[Defense Slides]