MediaEval 2016 - ININ Submission to Zero Cost ASR Task

ININ submission to MediaEval Zero Cost ASR task
Tejas Godambe, Naresh Kumar, Pavan Kumar,
Veera Raghavendra and Aravind Ganpathiraju
October 21, 2016
MediaEval Workshop, October 20-21, Hilversum, Netherlands
1 / 9

Introduction to Zero Cost ASR task
Motivation: To bridge the gap between “top speech labs and
companies” which can aﬀord buying and collecting data for
development and research, and “other small players”.
Task: To build the best possible ASR in Vietnamese language
using limited public-domain data comprising diverse acoustic
conditions, and having imperfect transcripts.
More details of the task are in [Szoke and Anguera, 2016]
2 / 9

Data description
Oﬃcial data from organizers
ELSA: Proprietary recordings of sentences read from a book
of Vietnamese quotes.
Forvo.com: Collection of short recordings downloaded from
forvo.com.
Rhinospike.com: Collection of both short and long
recordings downloaded from rhinospike.com.
“Surprise“ test data: Download of 35 Youtube videos
(broadcast news, presentations, talks, etc.).
Data from participants
Not used for training.
3 / 9

System Description
Kaldi toolkit [Povey et al., 2011] was used for system building.
Steps followed for building the ﬁnal system:
1 Audio pre-processing: Long silences in training data were
truncated to 0.3 second.
2 Audio augmentation: Data was augmented with 0.9x and
1.1x speed perturbed versions of itself [Ko et al., 2015].
3 Use of pitch information: Pitch information was extracted
along with conventional MFCCs. [Ghahremani et al., 2014].
4 Estimation of robust parameters with less data: SGMM
acoustic model was used [Povey et al., 2010].
5 Use of more history: 5 gram language model (LM) was used.
6 Use of test data for training: Test data was decoded and
approximate transcripts were added to training data.
7 Hypothesis re-ranking with a diﬀerent LM: Lattices were
generated and rescored using RNN LM [Mikolov et al., 2011].
8 Final decoding.
4 / 9

Results on dev-local data
Row Experiments WER (%) WERR (%)
1 Training the triphone model 37.0
2 Truncating silence in training data 27.4 37.0-27.4=9.6
3 Truncating silence in test data 50.3 27.4-50.3=-22.9
4 Using SGMM model 18.1 27.4-18.1=9.3
5 Using DNN model 23.5 18.1-23.5=-5.4
6 Using position independent phones 19.1 18.1-19.1=-1.0
7 Unsupervised adaptation 16.1 18.1-16.1=2.0
8 Audio augmentation-1 17.0 18.1-17.0=1.1
9 Audio augmentation-2 17.3 18.1-17.3=0.8
10 Using pitch information 16.9 18.1-16.9=1.2
11 Using 5 gram LM 16.1 18.1-16.1=2.0
12 Using 7 gram LM 16.6 18.1-6.6=1.5
13 Combined system 13.8
14 Rescoring lattices using RNN LM 13.5 13.8-13.5=0.3
15 ROVER [Fiscus, 1997] 13.5 13.5-13.5=0.0
5 / 9

Final results and discussion
Dev-local Dev Test
Our system did decent on data from ELSA and
rhinospike.com, but relatively poor on data from forvo.com
and Youtube.This warrants further investigation.
Immediate and complementary exploration areas include
ways to artiﬁcially increase size of data to train better ANNs,
exploring training of robust ANN models with less data.
6 / 9

References I
Fiscus, J. G. (1997).
A post-processing system to yield reduced word error rates: Recognizer output
voting error reduction (rover).
In Automatic Speech Recognition and Understanding, 1997. Proceedings., 1997
IEEE Workshop on, pages 347–354. IEEE.
Ghahremani, P., BabaAli, B., Povey, D., Riedhammer, K., Trmal, J., and
Khudanpur, S. (2014).
A pitch extraction algorithm tuned for automatic speech recognition.
In 2014 IEEE International Conference on Acoustics, Speech and Signal
Processing (ICASSP), pages 2494–2498. IEEE.
Ko, T., Peddinti, V., Povey, D., and Khudanpur, S. (2015).
Audio augmentation for speech recognition.
In Proceedings of INTERSPEECH.
Mikolov, T., Kombrink, S., Deoras, A., Burget, L., and Cernocky, J. (2011).
Rnnlm-recurrent neural network language modeling toolkit.
In Proc. of the 2011 ASRU Workshop, pages 196–201.
7 / 9

References II
Povey, D., Burget, L., Agarwal, M., Akyazi, P., Feng, K., Ghoshal, A., Goel,
N. K., Karaﬁát, M., Rastrow, A., Rose, R. C., et al. (2010).
Subspace gaussian mixture models for speech recognition.
In 2010 IEEE International Conference on Acoustics, Speech and Signal
Processing, pages 4330–4333. IEEE.
Povey, D., Ghoshal, A., Boulianne, G., Burget, L., Glembek, O., Goel, N.,
Hannemann, M., Motlíček, P., Qian, Y., Schwarz, P., et al. (2011).
The Kaldi speech recognition toolkit.
Szoke, I. and Anguera, X. (2016).
Zero cost speech recognition task at mediaeval 2016.
In Proc. of the 2016 MediaEval Workshop.
8 / 9

MediaEval 2016 - ININ Submission to Zero Cost ASR Task

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Andere mochten auch

Andere mochten auch (15)

Ähnlich wie MediaEval 2016 - ININ Submission to Zero Cost ASR Task

Ähnlich wie MediaEval 2016 - ININ Submission to Zero Cost ASR Task (20)

Mehr von multimediaeval

Mehr von multimediaeval (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

MediaEval 2016 - ININ Submission to Zero Cost ASR Task