Conformer, Gulati, Anmol, et al. "Conformer: Convolution-augmented Transformer for Speech Recognition." arXiv preprint arXiv:2005.08100 (2020). review by June-Woo Kim
2. Abstract
• This paper achieved the best of both worlds by studying how to combine convolution neural networks and
transformers to model both local and global dependencies of an audio sequence in a parameter-efficient way.
• This paper proposed the convolution-augmented transformer for speech recognition, named Conformer, and it
significantly outperforms the previous Transformer and CNN based models achieving state-of-the-art accuracies.
• On the LibriSpeech benchmark, they achieved % score on test/testother set.
• 2.1%/4.3% of Character Error Rate (CER) without using a language model.
• 1.9%/3.9% with an external language model.
3. Background
• End-to-end automatic speech recognition (ASR) systems based on neural networks have seen large improvements
in recent years.
• RNN have been the de-facto choice for ASR as they can model the temporal dependencies in the audio sequences
effectively.
• Recently, Transformer [1] architecture based on self-attention has enjoyed widespread adoption for modeling sequences
due to its ability to capture long distance interactions and the high training efficiency.
• Alternatively, CNN have also been successful for ASR, which capture local context progressively via a local receptive field
layer by layer.
[1] Vaswani, Ashish, et al. "Attention is all you need." Advances in neural information processing systems. 2017.
4. Background
• However, models with self-attention or CNN each has its limitations.
• While Transformers are good at modeling long-range global context, they are less capable to extract fine-grained local
feature patterns.
• CNNs, on the other hand, exploit local information and are used as the de-facto computational block in vision.
• One limitation of using local connectivity is that we need many more layers or parameters to capture global information.
• To combat this issue, contemporary work ContextNet [2] adopts the squeeze-and-excitation module [3] in each
residual block to capture longer context.
• Squeeze-and-excitation module’s goal is to improve the representational power of a network by explicitly modelling the
interdependencies between the channels of its convolutional features.
[2] Poudel, Rudra PK, et al. "Contextnet: Exploring context and detail for semantic segmentation in real-time." arXiv preprint arXiv:1805.04554 (2018).
[3] Hu, Jie, Li Shen, and Gang Sun. "Squeeze-and-excitation networks." Proceedings of the IEEE conference on computer vision and pattern recognition. 2018.
5. Background: SENet(Squeeze and excitation networks)
• The goal of SENet is to improve the representational power of a network by explicitly modelling the
interdependencies between the channels of its convolutional features.
• Contribution of this paper
• It can be attached directly to any network such as VGG, GoogLeNet, ResNet, etc.
• The improvement in model performance is significant compared to the increase in parameters.
• This has the advantage that model complexity and computational burden are not increasing significantly
6. Background: SENet(Squeeze and excitation networks)
• Squeeze: Global information embedding
• This paper used GAP (Global Average Pooling), one of the most common methodologies for extracting the important
information.
• This paper described that using GAP can compress global spatial information into a channel descriptor.
• Excitation: Adaptive recalibration
• Calculate the channel-wise dependencies.
• In this paper, this is simply calculated by adjusting the Fully connected layer and the nonlinear function.
• Finally, 𝑠 and each of the 𝐶 feature maps before GAP in the existing network are multiplied.
𝑠 = 𝐹𝑒𝑥 𝑧, 𝑊 = 𝜎 𝑊2 𝛿 𝑊1 𝑧
Where 𝛿 =ReLU, 𝜎=Sigmoid,
𝑊1, 𝑊2 =FC layer,
8. Background
• However, it is still limited in capturing dynamic global context as it only applies a global averaging over the
entire sequence.
• Recent works have shown that combining convolution and self-attention improves over using them individually.
• Papers like [4], [5] have augmented self-attention with relative position based information that maintains equivariance.
[4] Yang, Baosong, et al. "Convolutional self-attention networks." arXiv preprint arXiv:1904.03107 (2019).
[5] Yu, Adams Wei, et al. "QANet: Combining Local Convolution with Global Self-Attention for Reading Comprehension." International Conference on Learning Representations. 2018.
9. Method
• This paper showed how to organically combine convolutions with self-attention in ASR models.
• The distinctive feature of our model is the use of Conformer blocks in the place of Transformer blocks.
• A conformer block is composed of four modules stacked together, i.e, a feed-forward module, a self-attention module, a convolution module, and a second feed-
forward module in the end.
10. Method: Multi-Headed Self-Attention Module
• This paper employed multi-headed self-attention (MHSA) while integrating an important technique from
Transformer-XL, the relative sinusoidal positional encoding scheme.
• The relative positional encoding allows the self-attention module to generalize better on different input length and the
resulting encoder is more robust to the variance of the utterance length.
• This paper used pre-norm residual units [6, 7] with dropout which helps training and regularizing deeper models.
[6] Wang, Qiang, et al. "Learning Deep Transformer Models for Machine Translation." Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 2019.
[7] Nguyen, Toan Q., and Julian Salazar. "Transformers without tears: Improving the normalization of self-attention." arXiv preprint arXiv:1910.05895 (2019).
11. Method: Convolution module
• The convolution module starts with a gating mechanism [8]—a pointwise convolution and a gated linear unit
(GLU).
• This is followed by a single 1-D depthwise convolution layer.
• Batchnorm is deployed just after the convolution to aid training deep models.
• [8] show this mechanism to be useful for language modeling as it allows the model to select which words or
features are relevant for predicting the next word.
[8] Dauphin, Yann N., et al. "Language modeling with gated convolutional networks." International conference on machine learning. 2017.
12. Method: Feed Forward Module
• The Transformer architecture deploys a feed forward module after the MHSA layer and is composed of two linear
transformations and a nonlinear activation in between.
• A residual connection is added over the feed-forward layers, followed by layer normalization.
13. Method: Feed Forward Module
• However, this paper followed pre-norm residual units and apply layer normalization within the residual unit and
on the input before the first linear layer.
• This paper also applied Swish activation [9] and dropout, which helps regularizing the network.
[9] Ramachandran, Prajit, Barret Zoph, and Quoc V. Le. "Searching for activation functions." arXiv preprint arXiv:1710.05941 (2017).
This paper proposed to leverage automatic search
techniques to discover new activation functions.
𝑓 𝑥 = 𝑥 ∙ 𝜎(𝛽𝑥)
Where 𝜎 𝑧 = 1 + exp −𝑧 −1
and 𝛽 is a constant or
trainable parameter
15. Experiments
• Dataset: Librispeech
• 970 hours of labeled speech and an additional 800M word token text-only corpus for building language model.
• Preprocessing
• 80-channel filterbanks features computed from a 25ms window with a stride of 10ms
• SpecAugment also used with mask parameter (𝐹 = 27).
• Ten time masks with maximum time-mask ratio (𝑝𝑠=0.05), where the maximum-size of the time mask is set to 𝑝𝑠 times the
length of the utterance.
16. Hyper-parameters for Conformer
• Decoder: a single-LSTM-layer.
• Applied dropout in each residual unit of the conformer
• 𝑃𝑑𝑟𝑜𝑝 = 0.1
• 𝑙2 regularization with 1𝑒 − 6 weight is also added to all the trainable weights.
• Adam optimizer with 𝛽1 = 0.9, 𝛽2 = 0.98 and 𝜖 = 10−9
.
• Transformer learning rate schedule with 10k warm-up steps.
• LM
• 3-layer LSTM with width 4096 trained on the LibrisSpeech language model corpus with the LibriSpeech960h transcripts
added, tokenized with the 1k- WPM built from LibriSpeech 960h.
• LM has word-level perplexity 63.9 on the dev-set transcripts.
18. Ablation Studies
• Conformer Block vs. Transformer Block
• Among all differences, convolution sub-block is the most important feature, while having a Macaron-style FFN pair is also
more effective than a single FFN of the same number of parameters.
• They also found that using swish activations led to faster convergence in the Conformer models.
19. Ablation Studies
• Combinations of Convolution and Transformer Modules
• First, performance significantly dropped because replacing the depthwise convolution in the convolution module with a
lightweight convolution in devother set.
• They found that to split the input into parallel branches of multi-headed self attention module and a convolution module
with their output concatenated worsens the performance when compared to their proposed architecture.
20. Ablation Studies
• Macaron FFN
• Table 5 shows the impact of changing the Conformer block to use a single FFN or full-step residuals.
21. Ablation Studies
• # of Attention Heads
• They performed experiments to study the effect of varying the number of attention heads from 4 to 32 in our large model,
using the same number of heads in all layers.
• They found that increasing attention heads up to 16 improves the accuracy, especially over the devother datasets.
22. Ablation Studies
• Convolution Kernel Sizes
• They sweep the kernel size in {3, 7, 17, 32, 65} of the large model, using the same kernel size for all layers.
• They found that the performance improves with larger kernel sizes till kernel sizes 17 and 32.
• However, worsens in the case of kernel size 65.
Worse
23. Contribution
• In this work, authors introduced Conformer, an architecture that integrates components from CNNs and
Transformers for end-to-end speech recognition.
• They studied the importance of each component, and demonstrated that the inclusion of convolution modules is
critical to the performance of the Conformer model.
• The model exhibits better accuracy with fewer parameters than previous work on the LibriSpeech dataset, and
achieves a new state-of-the-art performance at 1.9%/3.9% for test/testother.
24. Reference
• Vaswani, Ashish, et al. "Attention is all you need." Advances in neural information processing systems. 2017.
• Poudel, Rudra PK, et al. "Contextnet: Exploring context and detail for semantic segmentation in real-time." arXiv
preprint arXiv:1805.04554 (2018).
• Hu, Jie, Li Shen, and Gang Sun. "Squeeze-and-excitation networks." Proceedings of the IEEE conference on computer
vision and pattern recognition. 2018.
• Yang, Baosong, et al. "Convolutional self-attention networks." arXiv preprint arXiv:1904.03107 (2019).
• Yu, Adams Wei, et al. "QANet: Combining Local Convolution with Global Self-Attention for Reading Comprehension."
International Conference on Learning Representations. 2018.
• Wang, Qiang, et al. "Learning Deep Transformer Models for Machine Translation." Proceedings of the 57th Annual
Meeting of the Association for Computational Linguistics. 2019.
• Nguyen, Toan Q., and Julian Salazar. "Transformers without tears: Improving the normalization of self-attention." arXiv
preprint arXiv:1910.05895 (2019).
• Dauphin, Yann N., et al. "Language modeling with gated convolutional networks." International conference on machine
learning. 2017.
• Ramachandran, Prajit, Barret Zoph, and Quoc V. Le. "Searching for activation functions." arXiv preprint
arXiv:1710.05941 (2017).