Improved training of end-to-end attention models for speech recognition

Sequence-to-sequence attention-based models on subword units allow simple open-vocabulary end-to-end speech recognition. In this work, we show that such models can achieve competitive results on the Switchboard 300h and LibriSpeech 1000h tasks. In particular, we report the state-of-the-art word error rates (WER) of 3.54% on the dev-clean and 3.82% on the test-clean evaluation subsets of LibriSpeech. We introduce a new pretraining scheme by starting with a high time reduction factor and lowering it during training, which is crucial both for convergence and final performance. In some experiments, we also use an auxiliary CTC loss function to help the convergence. In addition, we train long short-term memory (LSTM) language models on subword units. By shallow fusion, we report up to 27% relative improvements in WER over the attention baseline without a language model.


Introduction
Conventional speech recognition systems [1] with neural network (NN) based acoustic models using the hybrid hidden Markov models (HMM) / NN approach [2,3] usually operate on the phone level, given a phonetic pronunciation lexicon (from phones to words). They require a pretraining scheme with HMM and Gaussian mixture models (GMM) as emission probabilities to bootstrap good alignments of the HMM states. Context-independent phones are used initially because contextdependent phones need a good clustering, which is usually created on good existing alignments (via a Classification And Regression Tree (CART) clustering [4]). This boot-strapping process is iterated a few times. Then a hybrid HMM / NN is trained with frame-wise cross entropy. Recognition with such a model requires a sophisticated beam search decoder. Handling out-ofvocabulary words is also not straightforward and increases the complexity. There was certain work to remove the GMM dependency in the pretraining [5], or to be able to train without an existing alignment [6][7][8], or to avoid the lexicon [9], which simplifies the pretraining procedure but still is not end-to-end.
An end-to-end model in speech recognition generally denotes a simple single model which can be trained from scratch, and usually directly operates on words, sub-words or characters/graphemes. This removes the need for a pronunciation lexicon and the whole explicit modeling of phones, and it greatly simplifies the decoding.
The encoder-decoder framework with attention has become the standard approach for machine translation [20][21][22] and many other domains such as images [23]. Recent investigations have shown promising results by applying the same approach for speech recognition [24][25][26][27][28]. In this work, we also investigate techniques to improve recurrent encoder-attentiondecoder based systems for speech recognition. We use long short-term memory (LSTM) neural networks [29] for the encoder and the decoder. Our model is similar to the architecture used in machine translation [30], except of encoder time reduction. This generality of the model and the simplicity is its strength. Although a valid argument against this model for speech recognition is that it is in fact too powerful because it does not require monotonicity in its implicit alignments. There are attempts to restrict the attention to become monotonic in various ways [31][32][33][34][35][36][37][38]. In this work, our models are without these modifications and extensions.
Recently, alternative models for end-to-end modeling were also suggested, such as inverted HMMs [39], the recurrent transducer [40][41][42], or the recurrent neural aligner [43]. In many ways, these can all be interpreted in the same encoderdecoder-attention framework, but these approaches often use some variant of hard latent monotonic attention instead of soft attention.
Our models operate on subword units which are created via byte-pair encoding (BPE) [44]. We introduce a pretraining scheme applied on the encoder, which grows the encoder in layer depth, as well as decreases the initial high encoder time reduction factor. To the best of our knowledge, we are the first to apply pretraining for encoder-attention-decoder models. We use RETURNN [30,45] based on TensorFlow [46] for its computation. We have implemented our own flexible and efficient beam search decoder and efficient LSTM kernels in native CUDA. In addition, we train subword-level LSTM language models [47], which we integrate in the beam search by shallow fusion [48]. The source code is fully open 1 , as well as all the setups of the experiments in this paper 2 . We report competitive results on the 300h-Switchboard and LibriSpeech [49]. In particular on Librispeech, our system achieves WERs of 3.54% on the dev-clean and 3.82% on the test-clean evaluation subsets, which are the best results obtained on this task to the best of our knowledge.

Pretraining
Compared to machine translation, the input sequences are much longer in speech recognition, relatively to the output sequence (e.g. with BPE 10K subword units, and audio feature frames every 10ms, more than 30 times longer on Switchboard on average). However, as the original input is continuous, some sort of downscaling in the time dimension works, such as concatenation in the feature dimension of consecutive time-frames [7,24,42,50]. We use max-pooling in the time-dimension which is simpler. The time reduction can be done directly on the features or alternatively at multiple steps inside the encoder, e.g. after every encoder layer [24]. This is also what we do. This allows the encoder to better compress any necessary information.
We observed that a high time reduction factor makes the training much simpler. In fact, without careful tuning, usually the model will not converge without a high time reduction factor (16 or 32), as it was also observed in the literature [24]. However, we also observed that a low time reduction factor (e.g. 8) can perform better after all, when pretrained with a high time reduction factor. Also, it has been shown that deep LSTM models can benefit from layer-wise pretraining, by starting with 1 or 2 layers and adding more and more layers [1]. We apply the same pretraining.
To improve the convergence further, we disable label smoothing during pretraining and only enable it after pretraining. Also, we disable dropout during the first few pretraining epochs in the encoder.

Model
We use a deep bidirectional LSTM encoder network, and LSTM decoder network. After every layer in the encoder, we optionally do max-pooling in the time dimension to reduce the encoder length. I.e. for the input sequence x T 1 , we end up with the encoder state , where T ′ = red · T for the time reduction factor red, and #enc is the number of encoder layers, with #enc ≥ 2. We use the MLP attention [20,21,31,32,51]. Our model closely follows the machine translation model presented by Bahar et al. [51] and Bahdanau et al. [20] and we use a variant of attention weight / fertility feedback [52], which is inverse in our case, to use a multiplication instead of a division, for better numerical stability. More specifically, the attention energies ei,t ∈ R for encoder time-step t and decoder step i are defined as where v is a trainable vector, W a trainable matrix, si the current decoder state, ht the encoder state, and βi,t is the attention weight feedback, defined as where v β is a trainable vector. Then the attention weights are defined as αi = softmaxt(ei) and the attention context vector is given as The decoder state is recurrent function implemented as and the final prediction probability for the output symbol yi is given as ). In our case we use MLP readout = linear • maxout • linear.

Sub-word units
Characters/graphemes are probably the most generic and simple output units for generating texts but it has been shown that sub-word units can perform better [26] and they can be just as generic since the characters can be included in the set of subword units. Using words as output units is also possible but it does not allow to recognize out-of-vocabulary words and it requires a large softmax output and thus is computational expensive. An inhomogeneous length distribution as well as an imbalance in the label occurence can also make training harder.
In all the experiments, we use byte-pair encoding (BPE) [44] to create subword units, which are the output targets of the decoder. The beam search decoding will go over these BPE units, and then select the best hypothesis. Therefore, our system is open-vocabulary. At the end of decoding, the BPE units are merged into words in order to obtain the best hypothesis on word level. In addition, we add the special tokens from the transcriptions which denote noise, vocalized-noise and laughter in our BPE vocabulary set. Our recognizer can also potentially recognize these special events.

Language model combination
We also improve the recognition accuracy of our recognizer using external language models. We train LSTM language models [47] on the same BPE vocabulary set as the end-to-end model, using RETURNN with TensorFlow. For Switchboard, the training set of 27M words concatenating Switchboard and Fisher parts of transcriptions was used. For LibriSpeech, we use the 800M-word dataset officially available 3 for training language models. It can be noted that in the case of Switchboard, there is some overlap between the training data for language models and the transcription used to train the end-to-end model: 3M out of 27M words are used to train the end-to-end system. While for the LispriSpeech, 800M-word data is fully external to the end-to-end models. Our experiments show that this difference in amount of external data directly affects the performance improvements by the use of external language model. For both tasks, we use a LSTM LM with one input projection layer size of 512 dimension and two LSTM layers with 2048 nodes. We apply dropout at the input of all hidden layers with the rate of 0.2. The standard stochastic gradient descent with global gradient clipping is used for optimization to train all LSTM LMs.
We integrate the external language model in the beam search by shallow fusion [48]. The weight for the language model has been optimized by grid search on the development set WER. We found 0.23 and 0.36 to be optimal respectively for Switchboard and LibriSpeech (the weight on the attention model is 1).
For LibriSpeech, we also train Kneser-Ney smoothed ngram count based language models [53] on the same BPE vocabulary set using SRILM toolkit [54]. The comparison of perplexities can be found in Table 1. We also report WERs using the 4-gram count model by shallow fusion with a weight of 0.01, for comparison to the performance of LSTM LM.

Experiments
All attention models and neural network language models were trained and decoded with RETURNN. For both Switchboard and LibriSpeech, we first used the BPE vocabulary of 10K subword units to tune the hyperparameters of the model, then trained the models with 1K and 5K BPE units. We found 1K and 10K to be optimal for Switchboard and LibriSpeech respectively. We use label smoothing [55], dropout [56], Adam [57], learning rate warmup [26], and automatic learning rate scheduling according to a cross-validation set ("Newbob") [1].

Pretraining
In all cases we use layer-wise pretraining for the encoder, where we start with two encoder layers and a single max-pool in between with factor 32. Then we add a LSTM layer and a maxpool in between, and we reduce the first max-pool to factor 16 and the new one with factor 2 such that we always keep the same total encoder time reduction factor of 32. Only when we end up at 6 layers, we remove some of the max-pooling ops to get a final total time reduction factor of e.g. 8. Directly starting with a time reduction factor of 8 with and with 2 layers did not work for us. Also directly starting with 6 layers and time reduction factor of 32 did not work for us. Similar experiments for translation converged also without pretraining, however with much worse performance compared when layer-wise pretraining was used [30]. With more careful tuning or more training data, it might have worked without pretraining as it is seen in the literature, however, that is not necessary with pretraining. We were interested in the optimal final total time reduction factor, after the pretraining with time reduction factor 32. We tried factor 8, 16 and 32, and ended up with 20.4, 21.0 and 21.9 WER% respectively, on the full Hub5'00 set (Switchboard + Callhome). Thus we continue to use a final reduction factor of 8 in all further experiments. Note that a lower factor requires more memory and more computation for the global attention and was not feasible with our hardware and computational resources.

Switchboard 300h
Switchboard consists of about 300 hours of training data. There is also the additional Fisher training dataset, so combined it makes the total of about 2000h. In this work, we only use the 300h-Switchboard training data. We use 40-dimensional Gammatone features [58], and the feature extraction was done with RASR [59]. Results are shown in Table 2. We observe that our attention model performs better on the easier Switchboard subset of the dev set Hub5'00, where it is the best end-to-end model we know. On the harder Callhome part, it also performs well compared to other end-to-end models but the relative difference is not as high.

LibriSpeech 1000h
LibriSpeech training dataset consist of about 1000 hours of read audio books. The dev and test sets were split into simple ("clean") and harder ("other") subsets [49]. We do 40-dim. MFCC feature extraction on-the-fly in RETURNN, based on librosa [62]. We use CTC as an additional loss function applied on top of the decoder to help the convergence, although this is not used in decoding [63]. We initially trained only using the train-clean set and restricting it to sequences not longer than 75 characters in the orthography. Results are shown in Table 3. Our end-to-end system achieves competitive performance even without using language models. We observed that the shallow fusion with LSTM LM brings from 17% to 27% relative improvements in terms of WER on different subsets. This improvement is much larger than in the case of Switchboard. The amount of data is most likely the reason for this observation. For Librispeech, the external data of 800M words is used to train the language models, which is 80 times larger than the 10M words corresponding to the transcription of 1000 hours of audio. In addition, this 10M transcription is not part of the language model training data. In case of Switchboard, the LM is trained only on about 27M words, including 3M of transcription used to train the end-to-end system. Text data for conversational speech is not as readily available as for read speech. The WER of 3.54% on the dev-clean and 3.82% on the test-clean subsets are the best performance on this task to the best of our knowledge for systems trained only using LibriSpeech data.

Beam search prune error analysis
Beam search is an approximation for the decision rule The approximation is the pruning we apply due to the beam size. Beam search decoding for hybrid models is very sophisticated and uses a dynamic beam size based on the partial hypothesis scores which can become very large (on the order of thousands) [66]. The beam search for attention models works directly on the labels, i.e. on the BPE units in our case, and usually a static fixed very low beam size (e.g. 10) is used. It has been shown that increasing the beam size much more does not help in increasing the overal performance. This indicates that we do not have a search problem but we wanted to analyze this in more detail. Specifically, we are interested in how much errors we are making due to the pruning for our attention models, and we can count that by calculating the search score of the real target sequence, and compare it to the search score of the decoded sequence. If the decoded sequence has a higher score than the real target sequence, we have not made a search error but it is a model error. We count the number of sequences where the decoded sequence has a lower score than the real target sequence. We report our results in Table 4. We observe that for our standard beam size 12, the number of search errors are well below 1%, and also the WER will not noticeably improve with a larger beam size. Note that we only analyzed the search errors regarding reaching the real target sequence. We did not count search errors regarding reaching any sequence with lower WER. However, our results still suggest that we do not seem to have a search problem but a model problem.

Conclusions
We presentented an encoder-decoder-attention model for speech recognition operating on BPE subword units. We introduced a new method for pretraining the encoder, which was crucial for both convergence and the performance in terms of WER. We further improved our recognition accuracy by a joint beam search with a LSTM LM trained on the same subword vocabulary. We carried out experiments on two standard datasets. On the 300h-Switchboard, we achieved competitve results compared to the previously reported end-to-end models, while the WERs are still higher than the conventional hybrid systems. On the 1000h-LibriSpeech task, we obtained competitive results across different evaluation subsets. To the best of our knowl-edge, the WERs of 3.54% on the dev-clean and 3.82% on the test-clean subsets are the best results reported on this task, when only the official LibriSpeech training data is used.

Acknowledgements
This work has received funding from the European Research Council (ERC) under the European Union's Horizon 2020 research and innovation programme (grant agreement No 694537, project "SEQCLAS"). The work reflects only the authors' views and the ERC Executive Agency is not responsible for any use that may be made of the information it contains. The GPU cluster used for the experiments was partially funded by Deutsche Forschungsgemeinschaft (DFG) Grant INST 222/1168-1.