Presenter: Miroslav Skácel
BUT Zero-Cost Speech Recognition 2016 System Description In Working Notes Proceedings of the MediaEval 2016 Workshop, Hilversum, Netherlands, October 20-21, CEUR-WS.org (2016) by Miroslav Skácel, Martin Karafiát, Lucas Ondel, Albert Uchytil, Igor Szöke
Paper: http://ceur-ws.org/Vol-1739/MediaEval_2016_paper_48.pdf
Video: https://youtu.be/0pNiLLVTa28
Abstract: This paper describes our work on developing speech recognizers for Vietnamese. It focuses on procedures to prepare provided data precisely. We aim on analysis of the textual transcriptions in particular. Methods to filter out defective data to improve performance of final system are proposed and described in detail. We also propose cleaning of other textual data used for language modeling. Several architectures are investigated to reach both sub-tasks goals. The achieved results are discussed.
Biopesticide (2).pptx .This slides helps to know the different types of biop...
MediaEval 2016 - BUT Zero-Cost Speech Recognition
1. fit
speech fit
BUT Zero-Cost 2016 Speech Recognition
Miroslav Skácel, Martin Karafiát, Lucas Ondel, Albert Uchytil, Igor Szöke
BUT Speech@FIT
Brno University of Technology
Czech Republic
MediaEval 2016 Workshop, October 20-21, 2016, Hilversum, Netherlands
2. Overview
Our ideas for this task were:
• to use previous knowledge of low-resource languages
• to use existing systems from Babel1
• to adapt that on Vietnamese
• to follow zero-cost requirements
We ended up with:
• lots of data processing and cleaning
• 2x LVCSR sub-task systems
• 1x subword sub-task systems
• suggestions for further improvements
1https://www.iarpa.gov/index.php/research-programs/babel
1
3. Data preparation
• audio downsampled from 16kHz to 8kHz
• got Vietnamese alphabet from wordlist
• cleaned other characters (punctuation marks, brackets, etc.)
• numerals expanded to textual form
2
4. Data segmentation
• audio longer than 1 minute caused troubles
• alignment from first training stages used
• data split when:
• silence longer than 0.5s occurs
• segment would be longer than 15s
3
5. Language models
3 LMs were created:
1) cleaned transcriptions from train set
2) cleaned subtitles
3) cleaned text from websites
• 3 or more words
• we combined them together
4
6. LVCSR systems - GMM/DNN2
2Martin Karafiát et al. BUT Neural Network Features for Spontaneous Vietnamese in
BABEL. In Proceedings of ICASSP 2014, pages 5659–5663. IEEE Signal Processing Society,
2014.
5
7. LVCSR systems - BLSTM3
• 16kHz audio, original transcriptions
• Kaldi/TNet toolkits
3Martin Karafiát et al. Multilingual BLSTM and Speaker-Specific Vector Adaptation in
2016 BUT Babel System. Accepted at SLT 2016, 2016.
6
8. Subword system
• Acoustic Unit Discovery4
(AUD) model
• unlabeled data - no transcription needed
• like phone-loop model, fully Bayesian, Dirichlet distribution
• no fixed number of components like in HMM
• can learn complexity of model
4Lucas Ondel et al. Variational Inference for Acoustic Unit Discovery. In Procedia
Computer Science, volume 2016, pages 80–86. Elsevier Science, 2016.
7
10. Further work
Ideas for future experiments:
• find better way to clean original transcriptions
• methods to detect inappropriate audio/transcriptions
• adding noise to audio for more robust system
• audio reverberation using RIR
9