New Course: Introduction to Deep Learning and Neural Networks

Learn more

Speech Recognition: a review of the different deep learning approaches

Ilias Papastratison2021-07-14·22 mins

Humans communicate preferably through speech using the same language. Speech recognition can be defined as the ability to understand the spoken words of the person speaking.

Automatic speech recognition (ASR) refers to the task of recognizing human speech and translating it into text. This research field has gained a lot of focus over the last decades. It is an important research area for human-to-machine communication. Early methods focused on manual feature extraction and conventional techniques such as Gaussian Mixture Models (GMM), the Dynamic Time Warping (DTW) algorithm and Hidden Markov Models (HMM).

More recently, neural networks such as recurrent neural networks (RNNs), convolutional neural networks (CNNs) and in the last years Transformers, have been applied on ASR and have achieved great performance.

How to formulate Automatic Speech Recognition (ASR)?

The overall flow of ASR can be represented as shown below:

ASR Overview of an ASR system

The main goal of an ASR system is to transform an audio input signal x=(x1,x2,xT)\mathbf{x} = (x_1, x_2, \dots x_T) with a specific length TT into a sequence of words or characters (i.e., labels) y=(y1,y2,,yN\mathbf{y} = ( y_1, y_2, \dots, y_N), ynVy_{n}\in \mathbf{V}, where V\mathbf{V} is the vocabulary. The labels might be character-level labels (i.e., letters) or word-level labels (i.e., words).

The most probable output sequence is given by:

y^=arg maxyV p(yx)\mathbf{\hat{y}} = \argmax_{\mathbf{y} \in \mathbf{V}}~ p(\mathbf{y}|\mathbf{x})

A typical ASR system has the following processing steps:

  1. Pre-processing

  2. Feature extraction

  3. Classification

  4. Language modeling.

The pre-processing step aims to improve the audio signal by reducing the signal-to-noise ratio, reducing the noise, and filtering the signal.

In general, the features that are used for ASR, are extracted with a specific number of values or coefficients, which are generated by applying various methods on the input. This step must be robust, concerning various quality factors, such as noise or the echo effect.

The majority of the ASR methods adopt the following feature extraction techniques:

The classification model aims to find the spoken text which is contained on the input signal. It takes the extracted features from the pre-processing step and generates the output text.

The language model (LM) is an important module as it captures the grammatical rules or the semantic information of a language. Language models are important in order to recognize the output token from the classification model as well as to make corrections on the output text.

Datasets for ASR

Various databases with text from audiobooks, conversations, and talks have been recorded.

  1. The CallHome English, Spanish and German databases ( Post et al.1) contain conversational data with a high number of words, which are not in the vocabulary. They are challenging databases with foreign words and telephone channel distortion. The English CallHome database has 120 spontaneous English telephone conversations between native English people. The training set has 80 conversations of about 15 hours of speech, while the test and development sets contain 20 conversations, where each set has 1.8 hours of audio files.

Moreover, the CallHome Spanish consists of 120 telephone conversations respectively between native speakers. The training part has 16 hours of speech and its test set has 20 conversations with 2 hours of speech. Finally, the CallHome German consists of 100 telephone conversations between native German speakers with 15 hours of speech in the training set and 3.7 hours of speech in the test set.

  1. TIMIT2 is a large dataset with broadband recordings from American English, where each speaker reads 10 grammatically rich sentences. TIMIT contains audio signals, which have been time-aligned, corrected and can be used for character or word recognition. The audio files are encoded in 16 bits. The training set contains a large number of audios from 462 speakers in total, while the validation set has audios from 50 speakers and the test set audios from 24 speakers.

Feature extraction for ASR

Mel-frequency Cepstral coefficients is the most common method for extracting speech features. The human ear is a nonlinear system concerning how it perceives the audio signal. In order to cope with the change in frequency, the Mel-scale was developed to make a linear model of the human auditory system. Only frequencies in the range of [0,1] kHz can be transformed to the Mel-scale, while the remaining frequencies are considered to be logarithmic. The mel-scale frequency is computed as:

fmel=1000log(2)[1+fHz1000]f_{mel} = \frac{1000}{\log(2)} [1+ \frac{f_{Hz}}{1000}]

where fHzf_{Hz} is the frequency of the original signal.

The MFCC feature extraction technique basically includes the following steps:

  • Window the signal

  • Apply Discrete Fourier Transform

  • Logarithm of the magnitude

  • Convert to a Mel scale

  • Apply inverse discrete cosine transform (DCT)

Deep Neural Networks for ASR

In the deep learning era, neural networks have shown significant improvement in the speech recognition task. Various methods have been applied such as convolutional neural networks (CNNs), recurrent neural networks (RNNs), while recently Transformer networks have achieved great performance.

Recurrent Neural Networks

RNNs perform computations on the time sequence since their current hidden state

is dependent on all the previous hidden states. More specifically, they are designed to model time-series signals as well as capture long-term and short-term dependencies between different time-steps of the input.

Concerning speech recognition applications, the input signal x=(x1,x2,xT)\mathbf{x} = (x_1, x_2, \dots x_T) is passed through the RNN to compute the hidden sequences h=(h1,h2,hN)\mathbf{h} = (h_1, h_2, \dots h_N) and the output sequences y=(y1,y2,yN)\mathbf{y} = (y_1, y_2, \dots y_N), respectively. One major drawback of the simple form of RNNs is that it generates the next output based only on the previous context.

bi_rnn Bidirectional RNN

RNNs compute the sequence of hidden vectors h\mathbf{h} as:

ht=H (Wxhxt+Whhht1+bh)yt=Whyht+by,\begin{aligned} h_t = H~(W_{xh}x_t + W_{hh}h_{t-1} + b_{h})\\ y_t = W_{hy}h_t + b_{y}, \end{aligned}

where W\mathbf{W} are the weights, b\mathbf{b} are the bias vectors and HH is the nonlinear function.

RNNs limitations and solutions

However, in speech recognition, usually the information of the future context is equally significant as the past context (Graves et al.3). That’s why instead of using a unidirectional RNN, bidirectional RNNs (BiRNNs) are commonly selected in order to address this shortcoming. BiRNNs process the input vectors in both directions i.e., forward and backward, and keep the hidden state vectors for each direction as shown in the above figure.

Neural networks, both feed-forward and recurrent, can be only used for frame-wise classification of the input audio.

This problem can be addressed using:

  • Hidden Markov Models (HMMs) to get the alignment between the input audio and its transcribed output.

  • Connectionist Temporal Classification (CTC) loss, which is the most common technique.

CTC is an objective function that computes the alignment between the input speech signal and the output sequence of the words. CTC uses a blank label that represents the silence time-step i.e., the person doesn't speak, or represents the transition between words or phonemes. Given the input x\mathbf{x} and the output probability sequence of words or characters y\mathbf{y}, the probability of an alignment path α\boldsymbol{\alpha} is calculated as:

P(αx)=Πt=1TP(αtx)P(\boldsymbol{\alpha}|\mathbf{x}) = \Pi_{t=1}^{T}P(\alpha_t|\mathbf{x})

where αt\alpha_t is a single alignment at time-step tt.

For a given transcription sequence, there are several possible alignments since labels can be separated from blanks in different ways. For example the alignments (a,,b,c,,)(a,-,b,c,-,-) and (,(-,-, a,a,-, b,c)b,c), (- is the blank symbol) both correspond to the character sequence (a,b,c)(a,b,c).

Finally, the total probability of all paths is calculated as:

P(yx)=P(αx) P(\mathbf{y}|\mathbf{x}) = \sum P(\boldsymbol{\alpha}|\mathbf{x})

CTC aims to maximize the total probability of the correct alignments in order to get the correct output word sequence. One main benefit of CTC is that it doesn't require prior segmentation or alignment of the data. DNNs can be used directly to model the features and achieve great performance in speech recognition tasks.


The decoding process is used to generate predictions from a trained model using CTC. There are several decoding algorithms. The most common step is the best-path decoding algorithm, where the max probabilities are used in every time-step. Since the model assumes that the latent symbols are independent given the network outputs in the frame-wise case, the output with the highest probability is obtained at each time-step as:

y^=arg maxP(yx)\mathbf{\hat{y}} = \argmax P(\mathbf{y}|\mathbf{x})

Beam search has also been adopted for CTC decoding. The most likely translation is searched using left-to-right time-steps and a small number BB of partial hypotheses is maintained. Each hypothesis is actually a prefix of the output sequence, while at each time-step it is extended in the beam with every possible word in the vocabulary.


In other works (e.g Rao et al.4), an architecture commonly known as RNN-Transducer, has also been employed for ASR. This method combines an RNN with CTC and a separate RNN that predicts the next output given the previous one. It determines a separate probability distribution P(ykt,u)P(y_k|t,u) for every timestep tt of the input and time-step uu of the output for the kk-th element of the output y\mathbf{y}.

An encoder network converts the acoustic feature xtx_t at time-step tt to a representation het=fenc(xt)he_t = f_{enc}(x_t). Furthermore, a prediction network takes the previous label yu1y_{u-1} and generates a new representation hpt=fp(yu1)hp_t = f_{p}(y_{u-1}). The joint network is a fully-connected layer that combines the two representations and generates the posterior probability P(yt,u)=fjoint(het:hpt)P(\mathbf{y}|t,u) = f_{joint}(he_t : hp_t). In this way the RNN-Transducer can generate the next symbols or words by using information both from the encoder and the

prediction network based on if the predicted label is a blank or a non-blank label. The inference procedure stops when a blank label is emitted at the last time-step.

RNN_T RNN Transducer overview

Graves et al.3 tested regular RNNs with CTC and RNN-Transducers in TIMIT2 database using different numbers of layers and hidden states.

The feature extraction is performed with a Fourier transform filter-bank method of 40 coefficients that are distributed on a logarithmic mel-scale concatenated with the first and second temporal derivatives.

In the table below, it is shown that RNN-T with 3 layers of 250 hidden states each has the best performance of 17.7%17.7\% phoneme error rate (PER), while simple RNN-CTC models perform worse with PER >18.4%> 18.4\%.

timi_rnn RNN performance on TIMIT3

End-to-end ASR with RNN-Transducer (RNN-T)

Rao et al.5 proposed an encoder-decoder RNN. The proposed method adopts an encoder network consisting of several blocks of LSTM layers, which are pre-trained with CTC using phonemes, graphemes, and words as output. In addition, 1D-CNN reduces the length TT of the time sequence by a factor of 3 using specific kernel strides and sizes.

The decoder network is an RNN-T model trained along with an LSTM language model that also predicts words. The target of the network is the next label in the sequence and is used in the cross-entropy loss to optimize the network. Concerning feature extraction, 80-dimensional mel-scale features are computed every 10 msec and stacked every 30 msec to a single 240-dimensional acoustic feature vector.

enc_dec_rnn RNN-T method5

The method is trained on a set of 22 million hand-transcribed audio recordings extracted

from Google US English voice traffic, which corresponds to 18,000 hours of training data. These include voice-search as well as voice-dictation utterances. The language model was pretrained on text sentences obtained from the dataset. The method was tested with different configurations. It achieves 5.2%5.2\% WER on this large dataset when the encoder contains 12 layers of 700 hidden units and the decoder 2 layers of 1000 hidden units each.

enc_dec_rnnt_results Results from the RNN-T method5

Streaming end-to-end speech recognition for mobile devices

RNN-Transducers have also been adopted for real-time speech recognition (He et al.6). In this work, the model consists of 8 layers of uni-directional LSTM cells, while a time-reduction layer is used in the encoder to speed up training and inference. Memory caching techniques are also used to avoid redundant computation for identical prediction histories. This saves about 5060%50–60\% of the prediction network computations. In addition, different threads are used for the encoder and the prediction network to enable pipe-lining and save a significant amount of time.

The encoder inference procedure is split over two threads corresponding to the components before and after the time-reduction layer, which balances the computation between the

two encoder components and the prediction network, and has a speedup of 28%28\% compared against single-threaded execution. Furthermore, parameters are quantized from 32-bit floating-point precision into 8-bit to reduce memory consumption, both on disk and at run-time, and to optimize the model’s execution in real-time.

The algorithm was trained on a dataset that consists of 35 million English utterances with a size of 27,500 hours. The training utterances are hand-transcribed and are obtained from Google’s voice search and dictation traffic and it was created by artificially corrupting clean utterances using a room simulator. The reported results are evaluated on 14800 voice search (VS) samples extracted from Google traffic assistant, as well as 15700 voice command samples, denoted as the IME test set. The feature extraction step creates 80-dimensional mel-scale features computed every 25msec. The results are reported in inference speed divided by audio duration (RT90) and WER. The RNN-T model with symmetric quantization achieves WERs of 7.3%7.3\% on the voice search set and 4.2%4.2\% on the IME set.

streaming_results Quantization results6

Fast and Accurate Recurrent Neural Network Acoustic Models for ASR

Sak et al.7 adopt long-short memory (LSTM) networks for large vocabulary speech recognition. Their method extracts high-dimensional features using mel-filter banks using a sliding window technique. In addition, they incorporate context-dependent states and further improve the performance of the model. The method is evaluated on hand-transcribed audio recordings from real Google voice search traffic. The training set has 3 million utterances with an average duration of 4 seconds. The results are shown in the tables below:

cd_results Context dependent and independent results7

cd_results2 Results with different vocabulary size7

Attention-based models

Other works have adopted the attention encoder-decoder structure of the RNN that directly computes the conditional probability of the output sequence given the input sequence without assuming a fixed alignment. The encoder-decoder method uses an attention mechanism, which does not require pre-segment alignment of data. An attention-based model uses a single decoder to produce a distribution over the labels conditioned on the full sequence of previous predictions and the input audio. With attention, it can implicitly learn the soft alignment between input and output sequences, which solves a big problem for speech recognition.

The model can still have a good effect on long input sequences, so it is also possible for such models to handle speech input of various lengths. More specifically, the model computes the output probability density P(yx)P(\mathbf{y}|\mathbf{x}), where the lengths of the input and output are different. The encoder maps the input to the context vector ci\mathbf{c}_i for each output yiy_i. The decoder computes:

P(yx)=Πi=1IP(yiy1,,yi1ci)P(\mathbf{y}|\mathbf{x}) = \Pi_{i=1}^{I} P(y_i | y_1, \dots, y_{i-1}|\mathbf{c}_i)

conditioned on the II previous outputs and the context cic_i.

The posterior probability of symbol yiy_i is calculated as:

P(yiy1,,yi1ci)=g(yi1,si,ci)si=f(yi1,si1,ci),\begin{aligned} P(y_i | y_1, \dots, y_{i-1}|c_i) = g(y_{i-1}, s_i, c_i)\\ s_i = f(y_{i-1}, s_{i-1}, c_i), \end{aligned}

where sis_i is the output of the recurrent layer ff and gg is the softmax function.

The context is obtained from the weighted average of the hidden states of all time-steps as:

ci=t=1Tai,thtai,t=exp(et)t=1Texp(et),\begin{aligned} c_i = \sum_{t=1}^{T} a_{i,t} h_t \\ a_{i,t} = \frac{\mathrm{exp}(e_t)}{\sum_{t=1}^{T}\mathrm{exp}(e_t)}, \end{aligned}

where ai,t[0,1]a_{i,t} \in [0,1], t=1Tai,t=1\sum_{t=1}^{T} a_{i,t} = 1.

The attention mechanism selects the temporal locations over the input sequence that should be used to update the hidden state of the RNN and to predict the next output value. It asserts the attention weights ai,ta_{i,t} to compute the relevance scores between the input and the output.

Attention-based recurrent sequence generator

Chorowski et al.8, adopts an attention-based recurrent sequence generator (ARSG) that generates the output word sequence from speech features h=(h1,h2,hT)\mathbf{h} = (h_1, h_2, h_T) that can be modelled by any type of encoder. ARSG generates the output yiy_i by focusing on the relevant features:

ai=attend(si1,ai1,h)gi=j=1Lai,jhjyi=generate(si1,gi),\begin{aligned} a_i = \mathrm{attend}(s_{i-1}, a_{i-1}, \mathbf{h})\\ g_i = \sum_{j=1}^{L} a_{i,j}h_j\\ y_i = \mathrm{generate}(s_{i-1}, g_i), \end{aligned}

where sis_i is the i-th state of the RNN, aia_i are the attention weights.

A new state is generated as:

si=recurrency(si1,gi,yi)s_i = recurrency(s_{i-1}, g_i, y_i)

In more detail, the scoring mechanism works as:

ei,j=score (si1,hi)ai,j=exp (ei,jj=1Lexp (ei,je_{i,j} = score~( s_{i-1}, h_i)\\ a_{i,j} = \frac{exp~(e_{i,j}}{\sum_{j=1}^{L}exp~(e_{i,j}}

ARSG is evaluated on the TIMIT dataset and achieves WERs of 15.8%15.8\% and 17.6%17.6\% on validation and test sets.

Listen-Attend-Spell (LAS)

In Chan et al9 and Chiu et.al10 the Listen-Attend-Spell (LAS) method was developed. The encoder (i.e., Listen) takes the input audio x\mathbf{x} and generates the representation h\mathbf{h}. More specifically, it uses a bidirectional Long Short Term Memory (BLSTM) module with a pyramid structure, where in each layer the time resolution is reduced. The output at the ii-th time step, from the jj-th layer is computed as:

h=listen (x)hij=BSLTM (hi1j,hij1)\begin{aligned} \mathbf{h} = listen~(\mathbf{x})\\ h_i^j = BSLTM~(h_{i-1}^{j}, h_i^{j-1}) \end{aligned}

The decoder (i.e., Attend-Spell) is an attention-based module that attends the representation h\mathbf{h} and produces the output probability P(yx)P(\mathbf{y}|\mathbf{x}). In more detail, an attention-based LSTM transducer produces the next character based on the previous outputs as:

ci=AttentionContext (si,h)si=LSTM (si1,yi1,ci1)P(yix)=FC (si,ci),\begin{aligned} c_i = AttentionContext~(s_i, \mathbf{h})\\ s_i = LSTM~(s_{i-1}, y_{i-1}, c_{i-1})\\ P(y_i|\mathbf{x}) = FC~(s_i, c_i), \end{aligned}

where sis_i, cic_i are the decoder state and the context vector, respectively.

LAS was evaluated on 3 million Google voice search utterances with 2000 hours of speech, where 10 hours of utterances were randomly selected for validation. Data augmentation was also performed on the training dataset using a room simulator noise as well as by adding other types of noise and reverberations. It was able to achieve great recognition rates with WERs of 10.3%10.3\% and 12,0%12,0\% on clean and noisy environments, respectively.

LAS Overview of LAS method

End-to-end Speech Recognition with Word-based RNN Language Models and Attention

Hori et al.11, adopt a joint decoder using CTC, attention decoder, and an RNN language model. A CNN encoder network takes the input audio x\mathbf{x} and outputs the hidden sequence h\mathbf{h} that is shared between the decoder modules. The decoder network iteratively predicts the 0 label sequence c\mathbf{c} based on the hidden sequence. The joint decoder utilizes both CTC, attention and the language model to enforce better alignments between the input and the output and find a better output sequence. The network is trained to maximize the following joint function:

L=λ log pCTC(cx)+(1λ) log pAtt(cx) L = \lambda~\mathrm{log} ~p_{CTC} (\mathbf{c}| \mathbf{x}) + (1-\lambda)~ \mathrm{log}~ p_{Att} (\mathbf{c}| \mathbf{x})

During inference, to find the most probable word sequence c^\hat{\mathbf{c}} , the decoder finds the most probable words as:

c^=argmax [ pCTC(cx)+(1λ) log pAtt(cx)+γ logpLM(c)]\hat{\mathbf{c}} = \mathrm{argmax}~[~p_{CTC} (\mathbf{c}| \mathbf{x}) + (1-\lambda)~ \mathrm{log}~ p_{Att} (\mathbf{c}| \mathbf{x}) + \gamma ~ \mathrm{log} p_{LM}(\mathbf{c})]

where the language model probability is also used. hori_2018 Joint decoder

The method is evaluated on Wall Street Journal (WSJ) and LibriSpeech datasets.

WSJ12 is a well-known English clean speech database including approximately 80 hours.

LibriSpeech is a large data set of reading speech from audiobooks and contains 1000 hours of audio and transcriptions13. The experimental results of the proposed method on WSJ and Librispeech are shown in the following table, respectively.

hori2018_results Evaluation on the LibriSpeech dataset

hori2018_WSJ Evaluation on the WSJ dataset

Convolutional Models

Convolutional neural networks were initially implemented for computer vision (CV) tasks. In recent years, CNNs have also been widely applied in the field of natural language processing (NLP), due to their good generation, and discrimination capability.

A very typical CNN architecture is formed of several convolutional and pooling layers with fully connected layers for classification. A convolutional layer is composed by kernels that are convolved with the input. A convolutional kernel divides the input signal into smaller

parts namely the receptive field of the kernel. Furthermore, the convolution operation is performed by multiplying the kernel with the corresponding parts of the input that are into the receptive field. Convolutional methods can be grouped into 1-dimensional and 2-dimensional networks, respectively.

2D-CNNs construct 2D feature maps from the acoustic signal. Similar to images, they organize acoustic features i.e., MFCC features, in a 2-dimensional feature map, where one axis represents the frequency domain and the other represents the time domain. In contrast, 1D-CNNs accept acoustic features directly as input.

In 1D-CNN for speech recognition, every input feature map X=(X1,,XI)X=(X_1,\dots, X_I) is connected to many feature maps O=(O1,,OJ)O = (O_1, \dots, O_J). The convolution operation can be written as:

Oj=σ(i=1IXiwi,j), j[1,J]O_j = \sigma(\sum_{i=1}^{I}X_i w_{i,j} ), ~ j\in[1,J]

where w\mathbf{w} is the local weight.

  • In 1D-CNNs: w\mathbf{w}, O\mathbf{O} are vectors

  • In 2D-CNNs they are matrices.

Abdel et al.14 were the first that applied CNNs to speech recognition. Their method adopts two types of convolutional layers. The first one adopts full weight sharing (FWS), where weights are shared across. This technique is common in CNNs for image recognition since the same characteristics may appear at any location in an image. However, in speech recognition, the signal varies across different frequencies and has distinct feature patterns in different filters. To tackle this, limited weight sharing (LWS) is used, where only the convolution filters that are attached to the same pooling filters share the same weights.

cnn_2d_asr Illustration of 2D-CNN feature map for speech recognition14

The speech input was analyzed with a 25-ms Hamming window with a fixed 10-ms frame rate. More specifically, feature vectors are generated by Fourier-transform-based filter-bank analysis, which includes40 log energy coefficients distributed on a mel scale, along with

their first and second temporal derivatives. All speech data were normalized so that each vector dimension has a zero mean and unit variance.

The building block of their CNN architecture has convolutions and pooling layers. The input features are organized as several feature maps. The size (resolution) of feature maps gets smaller at upper layers as more convolution and pooling operations are applied as shown in the figure below. Usually, one or more fully connected hidden layers are added

on top of the final CNN layer to combine the features across all frequency bands before feeding to the output layer. They made a comprehensive study with different CNN configurations and achieved great results on TIMIT, which are shown in the below table. Their best model adopts only LWS layers and achieves a WER of 20.23%20.23\% .

cnn_1d_arch_2014 Illustration of CNN method14

cnn_timit Results of CNN method14

Residual CNN

Wang et al.15 adopted residual 2D-CNN (RCNN) with CTC loss for speech recognition. The residual block uses direct connections between the previous and the next layer as follows:

xi+1=f (xi,W)+xi \mathbf{x}_{i+1} = f~(\mathbf{x}_{i}, \mathbf{W}) + \mathbf{x}_{i}

where ff is a nonlinear function. This helps the network to converge faster without the use of extra parameters. The proposed architecture is depicted in the figure below. The Residual CNN-CTC method adopts 4 groups of residual blocks with small 3×33 \times 3 filters. Each Residual group has NN number of convolutional blocks with 2 layers. Each residual group also has different strides to reduce the computational cost and model temporal dependencies with different contexts. Batch normalization and ReLU activation are applied on each layer.

res_cnn_ctc Illustration of residual CNN architecture15

The RCNN is evaluated on WSJ with the standard configuration (si284 set

for training, eval92 set for validation, and dev93 set for testing). Furthermore, it is evaluated on the Tencent Chat data set that contains about 1400 hours of speech data for training and an independent 2000 sentences for test. The experimental results demonstrate the effectiveness of residual convolutional neural networks. RCNN can achieve WERs of 4.29%/7.65%4.29\%/7.65\% on validation and test sets of WSJ and 13.33%13.33\% on the Tencent Chat dataset.


Li et al.17 implemented a residual 1D-CNN with dense and residual blocks as shown below. The network extracts mel-filter-bank features and uses residual blocks that contain batch normalization and dropout layers for faster convergence and better generalization. The input is constructed from mel-filter-bank features obtained using 20 msec windows with a 10 msec overlapping. The network has been tested with different types of normalization and activation functions, while each block is optimized to fit on a single GPU kernel for faster inference. Jasper is evaluated on LibriSpeech with different settings of configuration. The best model has 10 blocks of 4 layers and BatchNorm + ReLU and achieves validation WERs of 6.15%6.15\% and 17.38%17.38\% on clean and noisy sets, respectively.

jasper Illustration of Jasper17

Fully Convolutional Network

Zeghidour et al.16 implement a fully convolutional network (FCN) with 3 main modules. The convolutional front-end is a CNN with low pass filters, convolutional filters similar to filter-banks, and algorithmic function to extract features. The second module is a convolutional acoustic model with several convolutional layers, GELU activation function, dropout, and weight regularization and predicts the letters from the input. Finally, there is a convolutional language model with 14 convolutional residual blocks and bottleneck layers.

This module is used to evaluate the candidate transcriptions of the acoustic model using a beam search decoder. FCN is evaluated on WSJ and LibriSpeech datasets. Their best configuration adopts a trainable convolutional front-end with 80 filters and a convolutional Language model. FCN achieves WERs of 6.8%6.8\% on the validation set and 3.5%3.5\% on the test set of WSJ, while on LibriSpeech it achieves validations WERs of 3.08%/9.94%3.08\%/9.94\% on clean and noisy sets and testing WERs of 3.26%/10.47%3.26\%/10.47\% on clean and noisy sets, respectively.

fcn Illustration of fully convolutional architecture16

Time-Depth Separable Convolutions (TDS)

Differently from other works, Hannum et al.18 use time-separable convolutional networks with limited number of parameters and because time-separable CNNs generalize better and are more efficient. The encoder uses 2D depth-wise convolutions along with layer normalization. The encoder outputs two vectors, the keys k=k1,k2,kT\mathbf{k} = k_1, k_2,\dots k_T and the values v=v1,v2,vT\mathbf{v} = v_1, v_2,\dots v_T from the input sequence x\mathbf{x} as:

[k, v]=TSN (x) [\mathbf{k},~ \mathbf{v}] = TSN~(\mathbf{x})

As for the decoder, a simple RNN is used and outputs the next token yuy_u as:

Qu=RNN (yu1,Qu1)Su=attend(Qu,k,v)yu=softmax([Su,Qu]),\begin{aligned} \mathbf{Q}_u = RNN~(y_{u-1}, \mathbf{Q}_{u-1})\\ \mathbf{S}_u = attend (\mathbf{Q}_u, \mathbf{k}, \mathbf{v})\\ y_u = softmax([\mathbf{S}_u , \mathbf{Q}_u]), \end{aligned}

where Su\mathbf{S}_u is a summary vector and Qu\mathbf{Q}_u is the query vector.

TDS is evaluated on LibriSpeech with different receptive fields and kernel sizes in order to find the best setting for the time-separable convolutional layers. The best option is 11 time-separable blocks, which achieve WERs of 5.04%5.04\% and 14.46%14.46\% on dev clean and other sets, respectively.

tsn 2D depth-wise convolutional ASR method18


ContextNet19 is a fully convolutional network that feeds global context information into the layers with squeeze-and-excitation modules. The CNN has KK layers and generates the features as:

h=CK (CK1 (( C1(x)))),\mathbf{h} = C_K~(C_{K-1}~(\dots (~C_1(\mathbf{x})))),

where CC is a convolutional block followed by batch normalization and activation functions. Furthermore, the squeeze-and-excitation block generates a global channel-wise weight θ\theta with a global average pooling layer, which is multiplied by the input x\mathbf{x} as:

xˉ=1Tt=0Txtθ=fc (xˉ)SE(x)=θx\begin{aligned} \mathbf{\bar{x}} = \frac{1}{T}\sum_{t=0}^{T} x_t \\ \theta = \mathrm{fc}~(\mathbf{\bar{x}})\\ SE(\mathbf{x}) = \theta * \mathbf{x} \end{aligned}

ContextNet is validated on LibriSpeech with 3 different configurations of ContextNet, with or without a language model. The 3 configurations are ContextNet(Small), ContextNet(Medium), and ContextNet(Large), which contain different numbers of layers and filters.

contextnet_results Results on LibriSpeech with 3 different configurations of ContextNet, with or without language model


Recently, with the introduction of Transformer networks20, machine translation and speech recognition have seen significant improvements. Transformer models that are designed for speech recognition are usually based on the encoder-decoder architecture similar to seq2seq models. In more detail, they are based on the self-attention mechanism instead of recurrence that is adopted by RNNs. The self-attention can attend to different positions of a sequence and extract meaningful representations. The self-attention mechanism takes three inputs, queries, values, and keys.

Let us denote the queries as QRtq×dq\mathbf{Q}\in\mathrm{R^{t_q\times d_q}}, the values VRtv×dv\mathbf{V}\in\mathrm{R^{t_v\times d_v}} and the keys KRtk×dk\mathbf{K}\in\mathrm{R^{t_k\times d_k}}, where tt_{*} are the corresponding dimensions. The outputs of self-attention is calculated as:

Attention(Q,K,V)=softmax(QKTdk)V,\mathrm{Attention}(\mathbf{Q},\mathbf{K},\mathbf{V}) = \mathrm{softmax}(\frac{\mathbf{Q}\mathbf{K}^T}{\sqrt{d_k}})\mathbf{V},

where 1dk\frac{1}{\sqrt{d_k}} is a scaling factor. However, Transformer adopts the Multi-head attention, which calculates the self-attention hh times, one for each head ii. In this way, each attention module focuses on different parts and learns different representations. Moreover, the multi-head attention is computed as:

MHA(Q,K,V)=concat(h1,h2,hh)W0hi=Attention(QWiQ,KWiK,VWiV),\begin{aligned} \mathrm{ MHA}(\mathbf{Q},\mathbf{K},\mathbf{V}) = \mathrm{concat}(h_1, h_2, \dots h_h)\mathbf{W}^0\\ h_i = \mathrm{Attention}(\mathbf{Q}\mathbf{W}_i^Q,\mathbf{K}\mathbf{W}_i^K,\mathbf{V}\mathbf{W}_i^V), \end{aligned}

where WiQRdmodel×dq\mathbf{W}_i^Q\in \mathrm{R}^{d_{model}\times d_q}, WiKRdmodel×dk\mathbf{W}_i^K\in\mathrm{R}^{d_{model}\times d_k}, WiVRdmodel×dv\mathbf{W}_i^V\in\mathrm{R}^{d_{model}\times d_v}, WORhdv×dmodel\mathbf{W}^O\in \mathrm{R}^{hd_v\times d_{model}} and dmodeld_{model} the dimensionality of the Transformer. Finally, a feed-forward network is used that contains two fully connected networks and ReLU activation functions as:

FFN(x)=ReLU(xW1+b1)W2+b2,\mathrm{FFN(\mathbf{x})} = \mathrm{ReLU}(\mathbf{x}\mathbf{W}_1+\mathbf{b}_1)\mathbf{W}_2+\mathbf{b}_2,

where W1Rdmodel×dff,W2Rdff×dmodel\mathbf{W}_1\in \mathrm{R}^{d_{model}\times d_{ff}}, \mathbf{W}_2\in \mathrm{R}^{ d_{ff}\times d_{model}} are the weights and b1Rdff,b2Rdmodel\mathbf{b}_1\in \mathrm{R}^{d_{ff}}, \mathbf{b}_2\in \mathrm{R}^{d_{model}} are the biases. In general, to enable the Transformer to attend relative positions, we adopt a positional encoding which is added to the input. The most common technique is the sinusoidal encoding, described by:

PE(j,i)={sin(j/100002i/dmodel)0i<dmodel/2cos(j/100002i/dmodel)dmodel/2i<dmodel\mathbf{PE}(j,i) = \left\{ \begin{array}{ll} { sin(j/10000^{2i/d_{model}}) } {0 \leq i <d_{model}/2} \\ {cos(j/10000^{2i/d_{model}}) } {d_{model}/2 \geq i < d_{model}} \end{array} \right.

where j,ij,i represents the position in the sequence and the ii-th dimension, respectively. Finally, normalization layers and residual connections are used to speed up training.


The Speech-Transformer21 transforms the speech feature sequence to the corresponding character sequence. The feature sequence which is longer than the output character sequence is constructed from 2-dimensional spectrograms with time and frequency dimensions. More specifically, CNNs are used to exploit the structure locality of spectrograms and mitigate the length mismatch by striding along time.

speech_transformer Illustration of the Speech Transformer21

att_transformer Illustration of 2D attention21

In the Speech Transformer, 2D attention is used in order to attend at both the frequency and the time dimensions. The queries, keys, and values are extracted from convolutional neural networks and fed to the two self-attention modules. The Speech Transformer is evaluated on WSJ datasets and achieves competitive recognition results with a WER of 10.9%10.9\%, while it needs about 80%80\% less training time than conventional RNNs or CNNs.

Transformers with convolutional context

Mohamed et al.22 adopt an encoder-decoder model formed by CNNs and a Transformer to learn local relationships and context of the speech signal. For the encoder, 2D convolutional modules with layer normalization and ReLU activation are used. In addition, each 2D convolutional module is formed by KK convolutional layers with max-pooling. For the decoder, 1D convolutions are performed over embeddings of the past predicted words.


Similar to RNN-Transducer, a Transformer-Transducer23 model has also been developed for speech recognition. Compared to RNN-T, this model joint network combines the output of the audio encoder AE\mathrm{AE} at time-step tit_i and the previously predicted label sequence z0i1=(z1,...,zi1)\mathbf{z}_0^{i-1} = (z_1,..., z_{i-1}), which is produced from a feedforward network and a softmax layer, denoted as LE\mathrm{LE}.

The joint representation is produced as:

J=fc(AE(x))(ti)+fc(LE(z0i1)),J = \mathrm{fc}(\mathrm{AE}(x))(t_i) +\mathrm{fc}(\mathrm{LE}(\mathbf{z}_0^{i-1})),

where fc\mathbf{fc} is a fully connected layer.

Then, the distribution of the alignment at time-step tit_i is computed as:

P(zix,ti,z0i1)=softmax(fc(J))P(z_i|\mathbf{x}, t_i, \mathbf{z}_0^{i-1}) = \mathrm{softmax}(\mathrm{fc}(J))


The Conformer24 is a variant of the original Transformer that combines CNNs and transformers in order to model both local and global speech dependencies by using a more efficient architecture and fewer parameters. The module of the Conformer contains two feedforward layers (FFN), one convolutional layer (CNN), and a multi-head attention module (MHA). The output of the Conformer is computed as:

x1=x+FFN(x)x2=x1+MHA(x1)x3=x2+CNN(x2)y=LN(x3+FFN(x3))\begin{aligned} \mathbf{x}_1 = \mathbf{x} + \mathrm{FFN}(\mathbf{x})\\ \mathbf{x}_2 = \mathbf{x}_1 + \mathrm{MHA}(\mathbf{x}_1)\\ \mathbf{x}_3 = \mathbf{x}_2 + \mathrm{CNN}(\mathbf{x}_2)\\ \mathbf{y} = \mathrm{LN}( \mathbf{x}_3 + \mathrm{FFN}(\mathbf{x}_3)) \end{aligned}

Here, the convolutional module adopts efficient pointwise and depthwise convolutions along with layer normalization.

conformer Overview of the Conformer method24

CTC and language models have also been used with Transformer networks25.

Semantic mask for transformer-based ASR

Wang et al.26 utilized a semantic mask of the input speech according to corresponding output tokens in order to generate the next word based on the previous context. A VGG-like convolution layer is used in order to generate short-term dependent features from the input spectrogram, which are then modeled by a Transformer. On the decoder network, the position encoding is replaced by a 1D convolutional layer to extract local features.

Weak-attention suppression or transformer-based ASR

Shi et al.27 propose a weak attention module to suppress non-informative parts of the speech signal such as during silence. The weak attention module sets the attention probabilities smaller than a threshold to zero and normalizes the rest attention probabilities.

The threshold is determined based on the following:

θi=miγδiθi=1Lγj=1L(ai,j1L)2L1\begin{aligned} \theta_i = m_i -\gamma \delta_i\\ \theta_i = \frac{1}{L}-\gamma \sqrt{\frac{\sum_{j=1}^{L}{(a_{i,j}-\frac{1}{L})}^2}{L-1}} \end{aligned}

Then, softmax is applied again on the new attention probabilities to generate the new attention matrix.

vgg_transformer Overview of the Semantic Masked Transformer method26


It is evident that deep architectures have already had a significant impact on automatic speech recognition. Convolutional neural networks, recurrent neural networks, and transformers have all been utilized with great success. Today’s SOTA models are all based on some combination of the aforementioned techniques. You can find some benchmarks on the popular datasets on paperswithcode.

If you find this article useful, you might also be interested in a previous one where we review the best speech synthesis methods. And as always, feel free to share it with your friends.

Cite as

title = "Speech Recognition: a review of the different deep learning approaches",
author = "Papastratis, Ilias",
journal = "",
year = "2021",
howpublished = {},


  1. Matt Post, Gaurav Kumar, Adam Lopez, Damianos Karakos, Chris Callison-Burch, and Sanjeev Khudanpur, “Improved speech-to-text translation with the Fisher and Callhome Spanish–English speech translation corpus,” inProceedings of the International Workshop on Spoken Language Translation(IWSLT), Heidelberg, Germany, December 2013.
  2. John S Garofolo, “Timit acoustic phonetic continuous speech corpus,” Linguistic Data Consortium, 1993, 1993.
  3. Alex Graves, Abdel-rahman Mohamed, and Geoffrey Hinton, “Speech recognition with deep recurrent neural networks,” in2013 IEEE international conference on acoustics, speech and signal processing. Ieee, 2013, pp. 6645–6649.
  4. Kanishka Rao, Ha ̧sim Sak, and Rohit Prabhavalkar, “Exploring architectures, data, and units for streaming end-to-end speech recognition with rnn-transducer,” in2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), 2017, pp. 193–199.
  5. Kanishka Rao, Ha ̧sim Sak, and Rohit Prabhavalkar, “Exploring architectures, data and units for streaming end-to-end speech recognition with rnn-transducer,” in2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU). IEEE, 2017, pp. 193–199.
  6. Yanzhang He, Tara N Sainath, Rohit Prabhavalkar, Ian McGraw, Raziel Alvarez, Ding Zhao, David Rybach, Anjuli Kannan, Yonghui Wu, RuomingPang, et al., “Streaming end-to-end speech recognition for mobile devices,”inICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2019, pp. 6381–6385.31
  7. Ha ̧sim Sak, Andrew Senior, Kanishka Rao, and Françoise Beaufays, “Fast and accurate recurrent neural network acoustic models for speech recognition,”arXiv preprint arXiv:1507.06947, 2015.
  8. Jan Chorowski, Dzmitry Bahdanau, Dmitriy Serdyuk, Kyunghyun Cho, and Yoshua Bengio, “Attention-based models for speech recognition,”arXivpreprint arXiv:1506.07503, 2015.
  9. William Chan, Navdeep Jaitly, Quoc Le, and Oriol Vinyals, “Listen, attend and spell: A neural network for large vocabulary conversational speech recognition,” in2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2016, pp. 4960–4964.
  10. Chung-Cheng Chiu, Tara N Sainath, Yonghui Wu, Rohit Prabhavalkar, Patrick Nguyen, Zhifeng Chen, Anjuli Kannan, Ron J Weiss, Kanishka Rao, Ekaterina Gonina, et al., “State-of-the-art speech recognition with sequence-to-sequence models,” in2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2018, pp. 4774–4778.
  11. Takaaki Hori, Jaejin Cho, and Shinji Watanabe, “End-to-end speech recognition with word-based rnn language models,” in2018 IEEE Spoken LanguageTechnology Workshop (SLT), 2018, pp. 389–396.
  12. Douglas B Paul and Janet Baker, “The design for the wall street journal-based csr corpus,” in Speech and Natural Language: Proceedings of a Workshop Held at Harriman, New York, February 23-26, 1992, 1992.
  13. Vassil Panayotov, Guoguo Chen, Daniel Povey, and Sanjeev Khudanpur, “Librispeech: an asr corpus based on public domain audio books,” in2015IEEE international conference on acoustics, speech and signal processing(ICASSP). IEEE, 2015, pp. 5206–5210.
  14. Ossama Abdel-Hamid, Abdel-rahman Mohamed, Hui Jiang, Li Deng, Gerald Penn, and Dong Yu, “Convolutional neural networks for speech recognition,” IEEE/ACM Transactions on audio, speech, and language processing, vol. 22, no. 10, pp. 1533–1545, 2014.
  15. Yisen Wang, Xuejiao Deng, Songbai Pu, and Zhiheng Huang, “Residual convolutional ctc networks for automatic speech recognition,”arXiv preprintarXiv:1702.07793, 2017.
  16. Jason Li, Vitaly Lavrukhin, Boris Ginsburg, Ryan Leary, Oleksii Kuchaiev, Jonathan M Cohen, Huyen Nguyen, and Ravi Teja Gadde, “Jasper: An end-to-end convolutional neural acoustic model,”arXiv preprintarXiv:1904.03288, 2019.
  17. Neil Zeghidour, Qiantong Xu, Vitaliy Liptchinsky, Nicolas Usunier, GabrielSynnaeve, and Ronan Collobert, “Fully convolutional speech recognition,”arXiv e-prints, pp. arXiv–1812, 2018.
  18. Awni Hannun, Ann Lee, Qiantong Xu, and Ronan Collobert, “Sequence-to-sequence speech recognition with time-depth separable convolutions,”arXivpreprint arXiv:1904.02619, 2019.
  19. Wei Han, Zhengdong Zhang, Yu Zhang, Jiahui Yu, Chung-Cheng Chiu, James Qin, Anmol Gulati, Ruoming Pang, and Yonghui Wu, “Contextnet: Improving convolutional neural networks for automatic speech recognition with global context,”arXiv preprint arXiv:2005.03191, 2020.
  20. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones,Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin, “Attention is all you need,”arXiv preprint arXiv:1706.03762, 2017.
  21. Linhao Dong, Shuang Xu, and Bo Xu, “Speech-transformer: a no-recurrence sequence-to-sequence model for speech recognition,” in2018 IEEE Inter-national Conference on Acoustics, Speech and Signal Processing (ICASSP).IEEE, 2018, pp. 5884–5888.
  22. Abdelrahman Mohamed, Dmytro Okhonko, and Luke Zettlemoyer, “Transformers with convolutional context for asr,”arXiv preprintarXiv:1904.11660, 2019.
  23. Qian Zhang, Han Lu, Hasim Sak, Anshuman Tripathi, Erik McDermott, Stephen Koo, and Shankar Kumar, “Transformer transducer: A streamable speech recognition model with transformer encoders and rnn-t loss,”inICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2020, pp. 7829–7833.
  24. Anmol Gulati, James Qin, Chung-Cheng Chiu, Niki Parmar, Yu Zhang, Ji-ahui Yu, Wei Han, Shibo Wang, Zhengdong Zhang, Yonghui Wu, et al., “Conformer: Convolution-augmented transformer for speech recognition,”arXiv preprint arXiv:2005.08100, 2020.
  25. Shigeki Karita, Nelson Enrique Yalta Soplin, Shinji Watanabe, Marc Del-croix, Atsunori Ogawa, and Tomohiro Nakatani, “Improving transformer-based end-to-end speech recognition with connectionist temporal classification and language model integration,”Proc. Interspeech 2019, pp. 1408–1412, 2019.
  26. Chengyi Wang, Yu Wu, Yujiao Du, Jinyu Li, Shujie Liu, Liang Lu, Shuo Ren, Guoli Ye, Sheng Zhao, and Ming Zhou, “Semantic mask for transformer-based end-to-end speech recognition,”arXiv preprintarXiv:1912.03010, 2019.
  27. Yangyang Shi, Yongqiang Wang, Chunyang Wu, Christian Fuegen, FrankZhang, Duc Le, Ching-Feng Yeh, and Michael L Seltzer, “Weak-attention suppression for transformer-based speech recognition,”arXiv preprintarXiv:2005.09137, 2020

Introduction to Deep Learning & Neural Networks

For a more comprehensive understanding of the fundamental archutectures of Deep Learning, check out our interactive course.

You will learn the basics behind CNNs, LSTMs, Autoencoders, GANs, Transformers and Graph Neural Networks using Pytorch in a 100% text-based way.

Learn more

* Disclosure: Please note that some of the links above might be affiliate links, and at no additional cost to you, we will earn a commission if you decide to make a purchase after clicking through.