"Automatic speech recognition for mobile applications in Yandex" — Fran Campillo, Яндекс

2
Automatic speech
recognition for mobile
applications in Yandex
Automatic speech
recognition for mobile
applications in Yandex
Fran CampilloFran Campillo

3
OutlineOutline
●
Motivation.
●
Road map.
●
Automatic speech recognition.
●
Data collection.
●
Experiments.
●
Results.
●
Motivation.
●
Road map.
●
Automatic speech recognition.
●
Data collection.
●
Experiments.
●
Results.

7
Road mapRoad map
●
Sep-2011: study of open source tools and data
collection.
– HTK, Sphinx, Rasr, Kaldi,...
– Service provided by 3rd
party.
●
Jan-2012: development of in-house technology.
●
Jan-2013: launching of own services.
●
Sep-2011: study of open source tools and data
collection.
– HTK, Sphinx, Rasr, Kaldi,...
– Service provided by 3rd
party.
●
Jan-2012: development of in-house technology.
●
Jan-2013: launching of own services.

8
Automatic speech
recognition
Automatic speech
recognition

9
ASR: complexityASR: complexity
Style Planned Spontaneous
Audio quality CD Telephone
Vocabulary size Hundreds Hundreds of thousands
Number of speakers One Many
Recognition rate WorseWorseBetterBetter
Complexity BiggerBiggerSmallerSmaller

10
Word pronunciationsWord pronunciations
●
ASR: sounds => words.
●
How is a word pronounced?
– Line => /'laɪn/.
– Linear => /'lɪnɪɘʳ/
●
Need a mapping from writing to
phonemes: G2P.

11
Word pronunciations: dictionaryWord pronunciations: dictionary
а a
аб a tc p
абад a dc b a tc t
абаза a dc b a z a
абакан a dc b a tc k ax n
абакана a dc b a tc k a n a
абакане a dc b a tc k a nj e
абаканская a dc b a tc k a n s tc k ax j a
абаканский a dc b a tc k a n s tc kj I j
абакумова a dc b a tc k u m ax v a
абанский a dc b a n s tc kj I j
абганеровская a dc b dc g ax nj I r ax f s tc k ax j a
абдулино a dc b dc dK& u lj i n a
абельмановская a dc bj e lj m ax n ax f s tc k ax j a
абзаково a dc b z a tc k o v a
абзелиловский a dc b zj i lj i l ax f s tc kj I j

12
Speech parametrizationSpeech parametrization
Phone /a/ Phone /i/

13
ASR: the problemASR: the problem
●
We have a sequence of observations:
– O = {o1
, o2
, …, oT
}
– oi
is a feature vector representing a speech frame.
● Goal: finding the likeliest sequence of words wi
for O:
argmax
i
P(wi/O)argmax
i
P(wi/O)

14
ASR: the problem (II)ASR: the problem (II)
● We cannot compute directly P(wi
/O).
●
Bayes: P(wi/O)=
P(O/wi)P(wi)
P(O)
argmax
i
P(wi/O)=argmax
i
{P(O/wi)P(wi)}
Acoustic model Language model

15
Language modelLanguage model
●
Probability of sequences of words:
– “We will rock you” => P1
.
– “Will will rock you” => P2
.
●
Trained on large corpora.
●
The closer to the application domain, the
better.

16
Acoustic model: Hidden Markov ModelsAcoustic model: Hidden Markov Models
●
HMM of first order: sequence of states that depend only on the state
before, and are associated to events we can observe
●
Typical layout for ASR:
Q1
Q2
Q3
a11
a12
a22
a23
a33
b1
(o) b2
(o) b3
(o)
● aij
: transition probabilities.
● bj
(o): probability of observation o in state j.

17
Acoustic model: HMM and speechAcoustic model: HMM and speech
●
Each state models a part of the phoneme:
– 1st
: beginning of the phoneme.
– 2nd
: stationary part.
– 3rd
: end of the phoneme.
● aij
: duration of each part.
● bj
(o): probability of producing a vector of features o in
state j.

18
Modeling probability of observationModeling probability of observation
●
Gaussian mixtures:
– cjm
= weight of mth
Gaussian of state j.
– μjm
=> average (vector) of mth
– ∑jm
=> covariance matrix of mth
●
Neural networks.
bj(x)=∑m c jm N (x ,μ jm,Σjm)

19
Waveform, phonemes, frames, and statesWaveform, phonemes, frames, and states
/o//o/
to1
o2
o3
o4
o5
o6
o7
o8
o9
o10
/o//o/
Q1
Q2
Q3
Q1
=> o1
, o2
Q2
=> o3
, o4
, o5
, o6
, o7
Q3
= > o8
, o9
, o10
μ3m,
∑3m,
c3m
μ2m,
∑2m,
c2m
μ1m,
∑1m,
c1m

20
Block diagram for trainingBlock diagram for training
Initialization
Baum-Welch
HMM Parameters update
Convergence
Prototype HMM
No
Trained models
Yes
Initial μ0m,
∑j0m,
com
for the GMMs
Alignments of the
training sentences
(observations to states)
New estimations
for μjm
, ∑jm
, cjm
Training sentences

21
DecodingDecoding
●
Lexicon: words that can be recognized.
●
Decoder: dynamic programming, with the constraints imposed
by the lexicon, the acoustic models, and the language model.
Parametrize
Lexicon
Acoustic
models
Language
model
DecoderSpeech signal Words

22
Our decoderOur decoder
●
Based on Weighted Finite State transducers.
●
The lexicon, the language model, and the
acoustic model are composed into a single
structure.
–Same information, but more efficient.
Lexicon
Acoustic
models
Language
model
HCLG

23
Composition of WFST: exampleComposition of WFST: example
Lexicon
Language
model
0 1
B:Bob
2
ah:
3
b:
4
l: likes
5
ay: k:
6
s:

24
Data collectionData collection

25
Data collectionData collection
●
Speech samples taken from the field.
●
Manual transcriptions:
– Speaker features: gender, native,...
– Anomalies in the pronunciation.
– Noises in the recording.

26
Manual transcriptionsManual transcriptions
●
600k recordings.
●
Uncompressed format: 8KHz and 16KHz.
●
286020 different speakers.
Percentage (%)
Native 87.7
Male 83.3
Female 8.5
Child 8.2

27
Manual transcriptionsManual transcriptions
●
Percentage of records without anomalies: 7.4%
Anomalies Percentage (%)
side_speech 14.4
speech-in-noise 71.5
Indistinguishable 3.7
mouth_noise 3.6
breath_noise 6.3
Irregular pronunciations 5.3
Hesitations 0.5
Fragments 5.5
Transient noise 14.0
Foreign words 0.1

28
Manual transcriptions: examplesManual transcriptions: examples
●
марциальные воды male, native
●
*трёx#пруд#ньій* male, native, speech-in-noise
●
[side_speech] чкалова male, native, speech-in-noise,
bad-audio тр

31
Grapheme-2-phonemeGrapheme-2-phoneme
●
Sequitur:
– Based on joined sequence models.
– Accuracy => 2.09% phoneme error rate.
●
Phonetisaurus:
– WFST.
– Accuracy => 1.04% phoneme error rate.
●
Special treatment for Latin words:
– G2P trained on transliterated version of Russian pronunciation (for
example: whatsapp => уотсап).

33
Experiments: acoustic model vs. language modelExperiments: acoustic model vs. language model

34
Experiments: number of GaussiansExperiments: number of Gaussians

36
Users: NavigatorUsers: Navigator

37
●
Results relative to our WER in each experiment (in red, experiments in which our system is
outperformed):
Results: relative word error rateResults: relative word error rate
Maps Navigation General search
Yandex-GMM 1 1 1
3rd Party 44.6% 31.8% 37.3%
Competitor 1.9% -9.7% -23.4%
General search
Yandex-DNN 1
Competitor 6.6%

38
Thanks for your
attention!
Thanks for your
attention!

39
Fran CampilloFran Campillo
Senior Software EngineerSenior Software Engineer
Yandex Speech GroupYandex Speech Group
francampillo@yandex-team.rufrancampillo@yandex-team.ru
PhDPhD

"Automatic speech recognition for mobile applications in Yandex" — Fran Campillo, Яндекс

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to "Automatic speech recognition for mobile applications in Yandex" — Fran Campillo, Яндекс

Similar to "Automatic speech recognition for mobile applications in Yandex" — Fran Campillo, Яндекс (20)

More from Yandex

More from Yandex (20)

Recently uploaded

Recently uploaded (20)

"Automatic speech recognition for mobile applications in Yandex" — Fran Campillo, Яндекс