9. Jan 2021•0 gefällt mir•202 views

Melden

Ingenieurwesen

Conformer, Gulati, Anmol, et al. "Conformer: Convolution-augmented Transformer for Speech Recognition." arXiv preprint arXiv:2005.08100 (2020). review by June-Woo Kim

June-Woo KimFolgen

AVX-512（フォーマット）詳解MITSUNARI Shigeo

[DL輪読会]Stereo Magnification: Learning view synthesis using multiplane images, +αDeep Learning JP

【DL輪読会】GradMax: Growing Neural Networks using Gradient InformationDeep Learning JP

PostgreSQL 14 モニタリング新機能紹介（PostgreSQL カンファレンス #24、2021/06/08）NTT DATA Technology & Innovation

.NET 7期待の新機能TomomitsuKusaba

キャッシュコヒーレントに囚われない並列カウンタ達Kumazaki Hiroki

- 1. Conformer: Convolution-augmented Transformer for Speech Recognition Presented by: June-Woo Kim Kyungpook National University 6, Nov. 2020. Gulati, Anmol, et al
- 2. Abstract • This paper achieved the best of both worlds by studying how to combine convolution neural networks and transformers to model both local and global dependencies of an audio sequence in a parameter-efficient way. • This paper proposed the convolution-augmented transformer for speech recognition, named Conformer, and it significantly outperforms the previous Transformer and CNN based models achieving state-of-the-art accuracies. • On the LibriSpeech benchmark, they achieved % score on test/testother set. • 2.1%/4.3% of Character Error Rate (CER) without using a language model. • 1.9%/3.9% with an external language model.
- 3. Background • End-to-end automatic speech recognition (ASR) systems based on neural networks have seen large improvements in recent years. • RNN have been the de-facto choice for ASR as they can model the temporal dependencies in the audio sequences effectively. • Recently, Transformer [1] architecture based on self-attention has enjoyed widespread adoption for modeling sequences due to its ability to capture long distance interactions and the high training efficiency. • Alternatively, CNN have also been successful for ASR, which capture local context progressively via a local receptive field layer by layer. [1] Vaswani, Ashish, et al. "Attention is all you need." Advances in neural information processing systems. 2017.
- 4. Background • However, models with self-attention or CNN each has its limitations. • While Transformers are good at modeling long-range global context, they are less capable to extract fine-grained local feature patterns. • CNNs, on the other hand, exploit local information and are used as the de-facto computational block in vision. • One limitation of using local connectivity is that we need many more layers or parameters to capture global information. • To combat this issue, contemporary work ContextNet [2] adopts the squeeze-and-excitation module [3] in each residual block to capture longer context. • Squeeze-and-excitation module’s goal is to improve the representational power of a network by explicitly modelling the interdependencies between the channels of its convolutional features. [2] Poudel, Rudra PK, et al. "Contextnet: Exploring context and detail for semantic segmentation in real-time." arXiv preprint arXiv:1805.04554 (2018). [3] Hu, Jie, Li Shen, and Gang Sun. "Squeeze-and-excitation networks." Proceedings of the IEEE conference on computer vision and pattern recognition. 2018.
- 5. Background: SENet(Squeeze and excitation networks) • The goal of SENet is to improve the representational power of a network by explicitly modelling the interdependencies between the channels of its convolutional features. • Contribution of this paper • It can be attached directly to any network such as VGG, GoogLeNet, ResNet, etc. • The improvement in model performance is significant compared to the increase in parameters. • This has the advantage that model complexity and computational burden are not increasing significantly
- 6. Background: SENet(Squeeze and excitation networks) • Squeeze: Global information embedding • This paper used GAP (Global Average Pooling), one of the most common methodologies for extracting the important information. • This paper described that using GAP can compress global spatial information into a channel descriptor. • Excitation: Adaptive recalibration • Calculate the channel-wise dependencies. • In this paper, this is simply calculated by adjusting the Fully connected layer and the nonlinear function. • Finally, 𝑠 and each of the 𝐶 feature maps before GAP in the existing network are multiplied. 𝑠 = 𝐹𝑒𝑥 𝑧, 𝑊 = 𝜎 𝑊2 𝛿 𝑊1 𝑧 Where 𝛿 =ReLU, 𝜎=Sigmoid, 𝑊1, 𝑊2 =FC layer,
- 7. Background: SENet(Squeeze and excitation networks) • Exemplars: SE-Inception and SE-ResNet
- 8. Background • However, it is still limited in capturing dynamic global context as it only applies a global averaging over the entire sequence. • Recent works have shown that combining convolution and self-attention improves over using them individually. • Papers like [4], [5] have augmented self-attention with relative position based information that maintains equivariance. [4] Yang, Baosong, et al. "Convolutional self-attention networks." arXiv preprint arXiv:1904.03107 (2019). [5] Yu, Adams Wei, et al. "QANet: Combining Local Convolution with Global Self-Attention for Reading Comprehension." International Conference on Learning Representations. 2018.
- 9. Method • This paper showed how to organically combine convolutions with self-attention in ASR models. • The distinctive feature of our model is the use of Conformer blocks in the place of Transformer blocks. • A conformer block is composed of four modules stacked together, i.e, a feed-forward module, a self-attention module, a convolution module, and a second feed- forward module in the end.
- 10. Method: Multi-Headed Self-Attention Module • This paper employed multi-headed self-attention (MHSA) while integrating an important technique from Transformer-XL, the relative sinusoidal positional encoding scheme. • The relative positional encoding allows the self-attention module to generalize better on different input length and the resulting encoder is more robust to the variance of the utterance length. • This paper used pre-norm residual units [6, 7] with dropout which helps training and regularizing deeper models. [6] Wang, Qiang, et al. "Learning Deep Transformer Models for Machine Translation." Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 2019. [7] Nguyen, Toan Q., and Julian Salazar. "Transformers without tears: Improving the normalization of self-attention." arXiv preprint arXiv:1910.05895 (2019).
- 11. Method: Convolution module • The convolution module starts with a gating mechanism [8]—a pointwise convolution and a gated linear unit (GLU). • This is followed by a single 1-D depthwise convolution layer. • Batchnorm is deployed just after the convolution to aid training deep models. • [8] show this mechanism to be useful for language modeling as it allows the model to select which words or features are relevant for predicting the next word. [8] Dauphin, Yann N., et al. "Language modeling with gated convolutional networks." International conference on machine learning. 2017.
- 12. Method: Feed Forward Module • The Transformer architecture deploys a feed forward module after the MHSA layer and is composed of two linear transformations and a nonlinear activation in between. • A residual connection is added over the feed-forward layers, followed by layer normalization.
- 13. Method: Feed Forward Module • However, this paper followed pre-norm residual units and apply layer normalization within the residual unit and on the input before the first linear layer. • This paper also applied Swish activation [9] and dropout, which helps regularizing the network. [9] Ramachandran, Prajit, Barret Zoph, and Quoc V. Le. "Searching for activation functions." arXiv preprint arXiv:1710.05941 (2017). This paper proposed to leverage automatic search techniques to discover new activation functions. 𝑓 𝑥 = 𝑥 ∙ 𝜎(𝛽𝑥) Where 𝜎 𝑧 = 1 + exp −𝑧 −1 and 𝛽 is a constant or trainable parameter
- 14. Method: conformer Block • Their proposed Conformer block contains two Feed Forward modules sandwiching the Multi-Headed Self-Attention module and the Convolution module. 𝑥𝑖 = 𝑥𝑖 + 1 2 𝐹𝐹𝑁(𝑥𝑖) 𝑥𝑖 ′ = 𝑥𝑖 + 𝑀𝐻𝑆𝐴( 𝑥𝑖) 𝑥𝑖 ′′ = 𝑥𝑖 ′ + 𝐶𝑜𝑛𝑣(𝑥𝑖 ′ ) 𝑦𝑖 = 𝐿𝑎𝑦𝑒𝑟𝑛𝑜𝑟𝑚(𝑥𝑖 ′′ + 1 2 𝐹𝐹𝑁 𝑥𝑖 ′′ )
- 15. Experiments • Dataset: Librispeech • 970 hours of labeled speech and an additional 800M word token text-only corpus for building language model. • Preprocessing • 80-channel filterbanks features computed from a 25ms window with a stride of 10ms • SpecAugment also used with mask parameter (𝐹 = 27). • Ten time masks with maximum time-mask ratio (𝑝𝑠=0.05), where the maximum-size of the time mask is set to 𝑝𝑠 times the length of the utterance.
- 16. Hyper-parameters for Conformer • Decoder: a single-LSTM-layer. • Applied dropout in each residual unit of the conformer • 𝑃𝑑𝑟𝑜𝑝 = 0.1 • 𝑙2 regularization with 1𝑒 − 6 weight is also added to all the trainable weights. • Adam optimizer with 𝛽1 = 0.9, 𝛽2 = 0.98 and 𝜖 = 10−9 . • Transformer learning rate schedule with 10k warm-up steps. • LM • 3-layer LSTM with width 4096 trained on the LibrisSpeech language model corpus with the LibriSpeech960h transcripts added, tokenized with the 1k- WPM built from LibriSpeech 960h. • LM has word-level perplexity 63.9 on the dev-set transcripts.
- 17. Results
- 18. Ablation Studies • Conformer Block vs. Transformer Block • Among all differences, convolution sub-block is the most important feature, while having a Macaron-style FFN pair is also more effective than a single FFN of the same number of parameters. • They also found that using swish activations led to faster convergence in the Conformer models.
- 19. Ablation Studies • Combinations of Convolution and Transformer Modules • First, performance significantly dropped because replacing the depthwise convolution in the convolution module with a lightweight convolution in devother set. • They found that to split the input into parallel branches of multi-headed self attention module and a convolution module with their output concatenated worsens the performance when compared to their proposed architecture.
- 20. Ablation Studies • Macaron FFN • Table 5 shows the impact of changing the Conformer block to use a single FFN or full-step residuals.
- 21. Ablation Studies • # of Attention Heads • They performed experiments to study the effect of varying the number of attention heads from 4 to 32 in our large model, using the same number of heads in all layers. • They found that increasing attention heads up to 16 improves the accuracy, especially over the devother datasets.
- 22. Ablation Studies • Convolution Kernel Sizes • They sweep the kernel size in {3, 7, 17, 32, 65} of the large model, using the same kernel size for all layers. • They found that the performance improves with larger kernel sizes till kernel sizes 17 and 32. • However, worsens in the case of kernel size 65. Worse
- 23. Contribution • In this work, authors introduced Conformer, an architecture that integrates components from CNNs and Transformers for end-to-end speech recognition. • They studied the importance of each component, and demonstrated that the inclusion of convolution modules is critical to the performance of the Conformer model. • The model exhibits better accuracy with fewer parameters than previous work on the LibriSpeech dataset, and achieves a new state-of-the-art performance at 1.9%/3.9% for test/testother.
- 24. Reference • Vaswani, Ashish, et al. "Attention is all you need." Advances in neural information processing systems. 2017. • Poudel, Rudra PK, et al. "Contextnet: Exploring context and detail for semantic segmentation in real-time." arXiv preprint arXiv:1805.04554 (2018). • Hu, Jie, Li Shen, and Gang Sun. "Squeeze-and-excitation networks." Proceedings of the IEEE conference on computer vision and pattern recognition. 2018. • Yang, Baosong, et al. "Convolutional self-attention networks." arXiv preprint arXiv:1904.03107 (2019). • Yu, Adams Wei, et al. "QANet: Combining Local Convolution with Global Self-Attention for Reading Comprehension." International Conference on Learning Representations. 2018. • Wang, Qiang, et al. "Learning Deep Transformer Models for Machine Translation." Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 2019. • Nguyen, Toan Q., and Julian Salazar. "Transformers without tears: Improving the normalization of self-attention." arXiv preprint arXiv:1910.05895 (2019). • Dauphin, Yann N., et al. "Language modeling with gated convolutional networks." International conference on machine learning. 2017. • Ramachandran, Prajit, Barret Zoph, and Quoc V. Le. "Searching for activation functions." arXiv preprint arXiv:1710.05941 (2017).
- 25. Thank you!

- Beta = 1 Sigmoid-weighted Linear Unit 강화학습에 제안됨 Beta = 0 스케일링 된 선형함수 𝑓 𝑥 = 𝑥 2 Beta = 무한 ReLu