This talk describes the work developed by the Yandex Speech Group in the last two years. Beginning from scratch, large amounts of voice recordings were collected from the field of application, and the most popular open source speech projects were studied to get a thorough understanding of the problem and to gather ideas to build our own technology. This talk will present key experiments and their results, as well as our latest achievements in automatic speech recognition in Russian.
Currently, the Yandex Speech Group provides three different services in Russian: maps, navigation, and general search, with a performance that is comparable to competitor products.
2. 2
Automatic speech
recognition for mobile
applications in Yandex
Automatic speech
recognition for mobile
applications in Yandex
Fran CampilloFran Campillo
7. 7
Road mapRoad map
●
Sep-2011: study of open source tools and data
collection.
– HTK, Sphinx, Rasr, Kaldi,...
– Service provided by 3rd
party.
●
Jan-2012: development of in-house technology.
●
Jan-2013: launching of own services.
●
Sep-2011: study of open source tools and data
collection.
– HTK, Sphinx, Rasr, Kaldi,...
– Service provided by 3rd
party.
●
Jan-2012: development of in-house technology.
●
Jan-2013: launching of own services.
9. 9
ASR: complexityASR: complexity
Style Planned Spontaneous
Audio quality CD Telephone
Vocabulary size Hundreds Hundreds of thousands
Number of speakers One Many
Recognition rate WorseWorseBetterBetter
Complexity BiggerBiggerSmallerSmaller
11. 11
Word pronunciations: dictionaryWord pronunciations: dictionary
а a
аб a tc p
абад a dc b a tc t
абаза a dc b a z a
абакан a dc b a tc k ax n
абакана a dc b a tc k a n a
абакане a dc b a tc k a nj e
абаканская a dc b a tc k a n s tc k ax j a
абаканский a dc b a tc k a n s tc kj I j
абакумова a dc b a tc k u m ax v a
абанский a dc b a n s tc kj I j
абганеровская a dc b dc g ax nj I r ax f s tc k ax j a
абдулино a dc b dc dK& u lj i n a
абельмановская a dc bj e lj m ax n ax f s tc k ax j a
абзаково a dc b z a tc k o v a
абзелиловский a dc b zj i lj i l ax f s tc kj I j
13. 13
ASR: the problemASR: the problem
●
We have a sequence of observations:
– O = {o1
, o2
, …, oT
}
– oi
is a feature vector representing a speech frame.
● Goal: finding the likeliest sequence of words wi
for O:
argmax
i
P(wi/O)argmax
i
P(wi/O)
14. 14
ASR: the problem (II)ASR: the problem (II)
● We cannot compute directly P(wi
/O).
●
Bayes: P(wi/O)=
P(O/wi)P(wi)
P(O)
argmax
i
P(wi/O)=argmax
i
{P(O/wi)P(wi)}
Acoustic model Language model
15. 15
Language modelLanguage model
●
Probability of sequences of words:
– “We will rock you” => P1
.
– “Will will rock you” => P2
.
●
Trained on large corpora.
●
The closer to the application domain, the
better.
16. 16
Acoustic model: Hidden Markov ModelsAcoustic model: Hidden Markov Models
●
HMM of first order: sequence of states that depend only on the state
before, and are associated to events we can observe
●
Typical layout for ASR:
Q1
Q2
Q3
a11
a12
a22
a23
a33
b1
(o) b2
(o) b3
(o)
● aij
: transition probabilities.
● bj
(o): probability of observation o in state j.
17. 17
Acoustic model: HMM and speechAcoustic model: HMM and speech
●
Each state models a part of the phoneme:
– 1st
: beginning of the phoneme.
– 2nd
: stationary part.
– 3rd
: end of the phoneme.
● aij
: duration of each part.
● bj
(o): probability of producing a vector of features o in
state j.
18. 18
Modeling probability of observationModeling probability of observation
●
Gaussian mixtures:
– cjm
= weight of mth
Gaussian of state j.
– μjm
=> average (vector) of mth
Gaussian of state j.
– ∑jm
=> covariance matrix of mth
Gaussian of state j.
●
Neural networks.
bj(x)=∑m c jm N (x ,μ jm,Σjm)
20. 20
Block diagram for trainingBlock diagram for training
Initialization
Baum-Welch
HMM Parameters update
Convergence
Prototype HMM
No
Trained models
Yes
Initial μ0m,
∑j0m,
com
for the GMMs
Alignments of the
training sentences
(observations to states)
New estimations
for μjm
, ∑jm
, cjm
Training sentences
21. 21
DecodingDecoding
●
Lexicon: words that can be recognized.
●
Decoder: dynamic programming, with the constraints imposed
by the lexicon, the acoustic models, and the language model.
Parametrize
Lexicon
Acoustic
models
Language
model
DecoderSpeech signal Words
22. 22
Our decoderOur decoder
●
Based on Weighted Finite State transducers.
●
The lexicon, the language model, and the
acoustic model are composed into a single
structure.
–Same information, but more efficient.
Lexicon
Acoustic
models
Language
model
HCLG
23. 23
Composition of WFST: exampleComposition of WFST: example
Lexicon
Language
model
0 1
B:Bob
2
ah:
3
b:
4
l: likes
5
ay: k:
6
s:
25. 25
Data collectionData collection
●
Speech samples taken from the field.
●
Manual transcriptions:
– Speaker features: gender, native,...
– Anomalies in the pronunciation.
– Noises in the recording.
37. 37
●
Results relative to our WER in each experiment (in red, experiments in which our system is
outperformed):
Results: relative word error rateResults: relative word error rate
Maps Navigation General search
Yandex-GMM 1 1 1
3rd Party 44.6% 31.8% 37.3%
Competitor 1.9% -9.7% -23.4%
General search
Yandex-DNN 1
Competitor 6.6%