CUHK System for the Spoken Web Search task at Mediaeval 2012
1. Overview System Description System performance Conclusion Acknowledgement
The CUHK Systems for Spoken Web Search task at
MediaEval 2012
Haipeng Wang and Tan Lee
Department of Electronic Engineering
The Chinese University of Hong Kong
September 30, 2012
2. Overview System Description System performance Conclusion Acknowledgement
Outline
1 Overview
2 System Description
PTDTW framework
Tokenizers
DTW detection
Pseudo-relevance Feedback and Score Normalization
3 System configuration and performance
4 Conclusion
5 Acknowledgement
3. Overview System Description System performance Conclusion Acknowledgement
Overview
2012 Spoken Web Search task [Metze et al., 2012]
QbyE STD: Audio search using audio queries.
Multilingual: Four South African languages.
Low-resource: Less than 4-hour DEV audio data in total.
Extreme case: One example for each query term.
Overview of our systems
Aiming at language-independent QbyE STD system.
Multiple resources:
1) the DEV audio data; 2) rich-resource languages.
Combine different resources: PTDTW framework.
Pseudo-relevance feedback (PRF).
Score normalization.
4. Overview System Description System performance Conclusion Acknowledgement
Posteriorgram-based template matching
Training
Resources
Query Query
Example Posteriorgrams
Detection
Tokenizer
Score
Test Test
Utterance Posteriorgrams
DETECT by DTW
Figure: Posteriorgram-based template matching[Hazen et al., 2009]
Training resources: audio data with or without transcriptions.
Tokenizer: if trained without transcriptions, unsupervised;
otherwise, supervised.
Posteriorgrams: more robust than spectral features.
How to effectively combine different resources?
5. Overview System Description System performance Conclusion Acknowledgement
PTDTW framework
Query
Posteriorgrams 1 DTW
Tokenizer 1 distance
Test Matrix D1
Posteriorgrams 1
Query Query
Example Posteriorgrams 2 DTW
Tokenizer 2 distance DTW Raw
Test Matrix D2
Posteriorgrams 2 Distance Detection
Matrix D Score
Test
Utterance Query
Posteriorgrams N DTW DETECT by DTW
Tokenizer N distance
Test Matrix DN
Posteriorgrams N
Figure: PTDTW Framework
Parallel tokenizers followed by DTW detection (PTDTW).
Modified from the posteriorgram-based template matching
approach.
Key idea: Combining DTW distance matrices.
6. Overview System Description System performance Conclusion Acknowledgement
Unsupervised tokenizers
MFCC-GMM tokenizer [Zhang and Glass, 2009]
Unsupervised training from the DEV data without transcription.
1024 Gaussian components.
39-dim MFCC + MVN + VTLN
MFCC-ASM tokenizer [Lee et al., 1988, Wang et al., 2012]
Acoustic segment model, also named as self-organized unit
(SOU) [Siu et al., 2010].
Unsupervised training from the DEV data without transcription.
256 ASM units. Each unit has 3 state, with 16 gaussian
components for each state.
39-dim MFCC + MVN + VTLN
7. Overview System Description System performance Conclusion Acknowledgement
Phoneme recognizers
Czech, Hungarian, Russian phoneme recognizers
developed by BUT [Schwarz, 2009].
trained from SpeechDat-E corpora.
Mandarin phoneme recognizer
179 tonal phonemes.
About 15-hour training data from CallHome corpus and
CallFriend corpus.
English phoneme recognizer
40 phonemes.
About 15-hour training data from Fisher corpus and Swichboard
Cellular corpus.
8. Overview System Description System performance Conclusion Acknowledgement
Phoneme recognizers
Input Phoneme Taking PCA Gaussian
GMM
Data Recognizers Logarithm Transform Posteriorgrams
Figure: Tandem Structure
256 Gaussian components trained on the DEV data.
Using tandem structure, we have 5 tokenizers:
CZ-GMM, HU-GMM, RU-GMM, MA-GMM and EN-GMM.
9. Overview System Description System performance Conclusion Acknowledgement
DTW detection
DTW detection is performed with a sliding window.
Find the path minimizing the normalized distance:
K
ˆ 1 d(i(k), j(k))wk
d= min
K,i(k),j(k) Z(w)
where d(i(k), j(k)) is set to the inner-product distance, wk = 1,
and Z(w) = K.
Additional constraint: |i(k) − j(k)| ≤ R.
Due to the large variation of the query length, R is not set to a
fixed number, but in proportional to the query length I:
1
R = α × I. (α = 3 in our systems).
10. Overview System Description System performance Conclusion Acknowledgement
Pseudo-relevance Feedback and Score Normalization
Pseudo-revelance Feedback for each query:
1) The top H hits from all the test utterances were selected as the
relevance examples. Selection criterion included: a) H ≤ 3; b)
raw detection score should be larger than a pre-set threshold.
ˆ ˆ
2) The relevance examples were used to score the top H (H = 2
for this task) hits from each test utterance.
3) The scores obtained by the relevance examples were linearly
fused with the scores of the original query examples.
Score normalization for each query:
ˆq,t = (sq,t − µq )/δq
s
sq,t is the score of the qth query on the tth hit region.
2
µq and δq are the mean and variance of the scores for the qth
query estimated from the development data.
11. Overview System Description System performance Conclusion Acknowledgement
System Configuration and Performance
Table: System Configurations and ATWV performances.
System No. 1 2 3 4 5
√ √ √ √
MFCC-GMM
√ √ √ √
MFCC-ASM
√ √ √
PHNREC-GMM1
√ √
PRF
√ √ √ √ √
Score Normalization
devQ - devC 0.68 0.63 0.73 0.78 0.74
devQ - evlC 0.60 0.55 0.70 0.75 0.70
evlQ - devC 0.68 0.65 0.73 0.77 0.75
evlQ - evlC 0.64 0.59 0.72 0.74 0.74
System 1 and 2 belong to the require run condition.
System 3, 4 and 5 belong to the general run condition.
The best performance (system 4) is achieved when all the tokenizers, PRF and
Score normalization are used.
1
PHNREC-GMM denotes the combination of the five used tandem tokenizers: CZ-GMM,
HU-GMM, RU-GMM, MA-GMM, and EN-GMM.
12. Overview System Description System performance Conclusion Acknowledgement
System Configuration and Performance
Table: System Configurations and ATWV performances.
System No. 1 2 3 4 5
√ √ √ √
MFCC-GMM
√ √ √ √
MFCC-ASM
√ √ √
PHNREC-GMM
√ √
PRF
√ √ √ √ √
Score Normalization
devQ - devC 0.68 0.63 0.73 0.78 0.74
devQ - evlC 0.60 0.55 0.70 0.75 0.70
evlQ - devC 0.68 0.65 0.73 0.77 0.75
evlQ - evlC 0.64 0.59 0.72 0.74 0.74
Supervised tokenizers perform better than the unsupervised tokenizers.
Training resources for unsupervised tokenizers are limited in this task, but not
limited for supervised tokenizers.
The PTDTW framework provides a flexible way to combine all these resources.
13. Overview System Description System performance Conclusion Acknowledgement
System Configuration and Performance
Table: System Configurations and ATWV performances.
System No. 1 2 3 4 5
√ √ √ √
MFCC-GMM
√ √ √ √
MFCC-ASM
√ √ √
PHNREC-GMM
√ √
PRF
√ √ √ √ √
Score Normalization
devQ - devC 0.68 0.63 0.73 0.78 0.74
devQ - evlC 0.60 0.55 0.70 0.75 0.70
evlQ - devC 0.68 0.65 0.73 0.77 0.75
evlQ - evlC 0.64 0.59 0.72 0.74 0.74
Combination of supervised tokenizers and unsupervised tokenizers leads to
consistent improvement.
Pseudo-relevance Feedback provides consistent improvement.
14. Overview System Description System performance Conclusion Acknowledgement
Conclusion
A PTDTW framework was proposed for the query-by-example
STD task in this evaluation.
Supervised tokenizers performed better than unsupervised
tokenizers for this task. The combination of supervised and
unsupervised tokenizers provided consistent gain.
Pseudo-relevance feedback and score normalization were used.
15. Overview System Description System performance Conclusion Acknowledgement
Acknowledgement
Thank Cheung-Chi Leung from IIR for helpful discussions.
Thank the organizers for organizing this evaluation.
Thank BUT for sharing the phoneme recognizers and scripts.
This research is partially supported by the General Research
Funds (Ref: 414010 and 413811) from the Hong Kong Research
Grants Council.
16. Overview System Description System performance Conclusion Acknowledgement
Thank you!
17. Overview System Description System performance Conclusion Acknowledgement
Reference
Hazen, T., Shen, W., and White, C. (2009).
Query-by-example spoken term detection using phonetic posteriorgram templates.
In ASRU.
Lee, C., Soong, F., and Juang, B. (1988).
A segment model based approach to speech recognition.
In ICASSP.
Metze, F., Barnard, E., Davel, M., van Heerden, C., Anguera, X., Gravier, G., and Rajput, N. (2012).
The spoken web search task.
In MediaEval 2012 Workshop.
Schwarz, P. (2009).
Phoneme recognition based on long temporal context, PhD thesis.
Siu, M., Gish, H., Chan, A., and Belfield, W. (2010).
Improved topic classification and keyword discovery using an hmm-based speech recognizer trained without
supervision.
In INTERSPEECH.
Wang, H., C.Leung, LEE, T., Li, H., and Ma, B. (2012).
An acoustic segment modeling approach to query-by-example spoken term detection.
In ICASSP.
Zhang, Y. and Glass, J. (2009).
Unsupervised spoken keyword spotting via segmental dtw on gaussian posteriorgrams.
In ASRU.