Diese Präsentation wurde erfolgreich gemeldet.
Die SlideShare-Präsentation wird heruntergeladen. ×

MediaEval 2016 - BUT Zero-Cost Speech Recognition

Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Wird geladen in …3
×

Hier ansehen

1 von 11 Anzeige

MediaEval 2016 - BUT Zero-Cost Speech Recognition

Herunterladen, um offline zu lesen

Presenter: Miroslav Skácel

BUT Zero-Cost Speech Recognition 2016 System Description In Working Notes Proceedings of the MediaEval 2016 Workshop, Hilversum, Netherlands, October 20-21, CEUR-WS.org (2016) by Miroslav Skácel, Martin Karafiát, Lucas Ondel, Albert Uchytil, Igor Szöke

Paper: http://ceur-ws.org/Vol-1739/MediaEval_2016_paper_48.pdf

Video: https://youtu.be/0pNiLLVTa28

Abstract: This paper describes our work on developing speech recognizers for Vietnamese. It focuses on procedures to prepare provided data precisely. We aim on analysis of the textual transcriptions in particular. Methods to filter out defective data to improve performance of final system are proposed and described in detail. We also propose cleaning of other textual data used for language modeling. Several architectures are investigated to reach both sub-tasks goals. The achieved results are discussed.

Presenter: Miroslav Skácel

BUT Zero-Cost Speech Recognition 2016 System Description In Working Notes Proceedings of the MediaEval 2016 Workshop, Hilversum, Netherlands, October 20-21, CEUR-WS.org (2016) by Miroslav Skácel, Martin Karafiát, Lucas Ondel, Albert Uchytil, Igor Szöke

Paper: http://ceur-ws.org/Vol-1739/MediaEval_2016_paper_48.pdf

Video: https://youtu.be/0pNiLLVTa28

Abstract: This paper describes our work on developing speech recognizers for Vietnamese. It focuses on procedures to prepare provided data precisely. We aim on analysis of the textual transcriptions in particular. Methods to filter out defective data to improve performance of final system are proposed and described in detail. We also propose cleaning of other textual data used for language modeling. Several architectures are investigated to reach both sub-tasks goals. The achieved results are discussed.

Anzeige
Anzeige

Weitere Verwandte Inhalte

Andere mochten auch (16)

Ähnlich wie MediaEval 2016 - BUT Zero-Cost Speech Recognition (20)

Anzeige

Weitere von multimediaeval (20)

Aktuellste (20)

Anzeige

MediaEval 2016 - BUT Zero-Cost Speech Recognition

  1. 1. fit speech fit BUT Zero-Cost 2016 Speech Recognition Miroslav Skácel, Martin Karafiát, Lucas Ondel, Albert Uchytil, Igor Szöke BUT Speech@FIT Brno University of Technology Czech Republic MediaEval 2016 Workshop, October 20-21, 2016, Hilversum, Netherlands
  2. 2. Overview Our ideas for this task were: • to use previous knowledge of low-resource languages • to use existing systems from Babel1 • to adapt that on Vietnamese • to follow zero-cost requirements We ended up with: • lots of data processing and cleaning • 2x LVCSR sub-task systems • 1x subword sub-task systems • suggestions for further improvements 1https://www.iarpa.gov/index.php/research-programs/babel 1
  3. 3. Data preparation • audio downsampled from 16kHz to 8kHz • got Vietnamese alphabet from wordlist • cleaned other characters (punctuation marks, brackets, etc.) • numerals expanded to textual form 2
  4. 4. Data segmentation • audio longer than 1 minute caused troubles • alignment from first training stages used • data split when: • silence longer than 0.5s occurs • segment would be longer than 15s 3
  5. 5. Language models 3 LMs were created: 1) cleaned transcriptions from train set 2) cleaned subtitles 3) cleaned text from websites • 3 or more words • we combined them together 4
  6. 6. LVCSR systems - GMM/DNN2 2Martin Karafiát et al. BUT Neural Network Features for Spontaneous Vietnamese in BABEL. In Proceedings of ICASSP 2014, pages 5659–5663. IEEE Signal Processing Society, 2014. 5
  7. 7. LVCSR systems - BLSTM3 • 16kHz audio, original transcriptions • Kaldi/TNet toolkits 3Martin Karafiát et al. Multilingual BLSTM and Speaker-Specific Vector Adaptation in 2016 BUT Babel System. Accepted at SLT 2016, 2016. 6
  8. 8. Subword system • Acoustic Unit Discovery4 (AUD) model • unlabeled data - no transcription needed • like phone-loop model, fully Bayesian, Dirichlet distribution • no fixed number of components like in HMM • can learn complexity of model 4Lucas Ondel et al. Variational Inference for Acoustic Unit Discovery. In Procedia Computer Science, volume 2016, pages 80–86. Elsevier Science, 2016. 7
  9. 9. Results System Devel [WER] Test [WER] all (ELSA / Forvo / RhinoSpike) all (ELSA / Forvo / RhinoSpike / YouTube) Kaldi Baseline 60.4 (45.3 / 99.8 / 73.5) 75.5 (46.7 / 98.1 / 76.2 / 97.1) P-BUT - Babel Kaldi BLSTM 16kHz 17.9 (6.4 / 58.1 / 15.8) 48.0 (4.9 / 55.7 / 35.4 / 87.2) L-BUT - Babel Kaldi BLSTM 16kHz - LM tune 17.6 (6.2 / 56.4 / 16.9) 46.3 (4.6 / 52.6 / 32.2 / 84.7) L-BUT - Babel GMM/DNN 8kHz 36.1 (29.7 / 68.5 / 23.4) 55.7 (28.0 / 59.3 / 44.9 / 81.4) System Devel [NMI] Test [NMI] all (ELSA / Forvo / RhinoSpike) all (ELSA / Forvo / RhinoSpike / YouTube) Kaldi Baseline 5.48 (7.88 / 10.44 / 13.8) 4.29 (7.49 / 11.25 / 15.99 / 6.35) P-BUT AUD phone-loop 5.08 (6.45 / 8.76 / 14.19) 4.56 (5.52 / 9.59 / 18.49 / 7.59) • BLSTM overperformed GMM/DNN system • exception is unseen data (YouTube test set) 8
  10. 10. Further work Ideas for future experiments: • find better way to clean original transcriptions • methods to detect inappropriate audio/transcriptions • adding noise to audio for more robust system • audio reverberation using RIR 9
  11. 11. Thanks for your attention! 9

×