5. Existing solutions
• Server-based
– Telephony, IVR
– Dictation (Heath care industry)
– Audio indexing
Either offline or with important delays
6. Existing solutions
• Desktop-based
– Real-time dictation
– Language learning
Requires a good setup, powerful computer,
quiet environment
Very good accuracy, no training required
7. Existing solutions
• Embedded applications
– Simple voice commands
(‘Call-mum’ type command)
– Disconnected word recognition
Small vocabulary and lack
of naturalness restricts the
range of applications
11. Technical challenge
Acoustic
feature
vectors Recognizer
Transcription
Senome
calculation
Viterbi decoding ‘Hello world’
Statistical
Acoustic Phoneme Word
Language
Models Lexicon Lexicon
Model
12. Technical challenge
Acoustic
feature
vectors Recognizer
Transcription
Senome
calculation
Viterbi decoding ‘Hello world’
Statistical
Acoustic Phoneme Word
Language
Models Lexicon Lexicon
Model
13. Technical challenge
Acoustic Models
Acoustic
feature • 4000 acoustic models
vectors Recognizer
• Sub-acoustic unit Transcription
Multi-dim.
• Functions that score 10 ms of speech
Gaussian mixt. Viterbi decoding ‘Hello world’
calculation mean and variance 40-long
• Sets of
vectors of Gaussian mixtures (16)
Statistical
Acoustic Phoneme Word
Language
Models Lexicon Lexicon
Model
14. Technical challenge
Acoustic
feature
vectors Recognizer
Transcription
Senome
calculation
Viterbi decoding ‘Hello world’
Statistical
Acoustic Phoneme Word
Language
Models Lexicon Lexicon
Model
15. Technical challenge
Acoustic
feature
vectors Recognizer
Transcription
Senome
calculation
Viterbi decoding ‘Hello world’
Statistical
Acoustic Phoneme Word
Language
Models Lexicon Lexicon
Model
16. Technical challenge
Phoneme
Acoustic
feature • 50 in English
vectors Recognizer
• Differentiable sounds Transcription
Multi-dim.
• Represent a sequence of senomes: HMM
Gaussian mixt.
(Hidden Markov Model) Viterbi decoding ‘Hello world’
calculation
‘ah’: ah1 ah2 ah3
Statistical
Word
Senome Phoneme Language
Lexicon
‘l’:
Lexicon l1 l2 l3
Lexicon Model
17. Technical challenge
Triphone
Acoustic
feature • 2500 in English
vectors Recognizer
• Differentiable sounds in their context Transcription
Multi-dim.
continuous speech
Gaussian mixt. Viterbi decoding ‘Hello world’
calculation
‘hh-ah+l’: ah1 ah2 ah3
Statistical
Senome Phoneme Word
‘ah-l+ow’: l1 l2 l3 Language
Lexicon Lexicon Lexicon
Model
18. Technical challenge
Acoustic
feature
vectors Recognizer
Transcription
Senome
calculation
Viterbi decoding ‘Hello world’
Statistical
Acoustic Triphone Word
Language
Models Lexicon Lexicon
Model
19. Technical challenge
Acoustic
feature
vectors Recognizer
Transcription
Senome
calculation
Viterbi decoding ‘Hello world’
Statistical
Acoustic Triphone Word
Language
Models Lexicon Lexicon
Model
20. Technical challenge
Word
Acoustic
feature • Large vocabulary: 64000
vectors Recognizer
• Represent a sequence of phonemes/triphones Transcription
Multi-dim.
Gaussian mixt. Viterbi decoding ‘Hello world’
calculation
‘hello’: hh ah l ow
Statistical
Senome Phoneme Word
‘world’: Language
Lexicon w Lexiconl
er d
Lexicon
Model
21. Technical challenge
Acoustic
feature
vectors Recognizer
Transcription
Senome
calculation
Viterbi decoding ‘Hello world’
Statistical
Acoustic Triphone Word
Language
Models Lexicon Lexicon
Model
22. Technical challenge
Acoustic
feature
vectors Recognizer
Transcription
Senome
calculation
Viterbi decoding ‘Hello world’
Statistical
Acoustic Triphone Word
Language
Models Lexicon Lexicon
Model
23. Technical challenge
Statistical language model
Acoustic
feature • Bi-gram / Tri-gram
vectors Recognizer
• Give the probability of sequence of 2/3 words Transcription
Multi-dim.
• 64000 words leads to roughly 10 million states /
50 million mixt.
Gaussian
arcs Viterbi decoding ‘Hello world’
calculation
0.3 mum
hello
0.2 Statistical
Senome Phoneme dad
Word
Language
Lexicon Lexicon
0.05 Lexicon
Model
world
24. Technical challenge
Acoustic
feature
vectors Recognizer
Transcription
Senome
calculation
Viterbi decoding ‘Hello world’
Statistical
Acoustic Triphone Word
Language
Models Lexicon Lexicon
Model
25. Technical challenge
Acoustic
feature
vectors Recognizer
Transcription
Senome
calculation
Viterbi decoding ‘Hello world’
Statistical
Acoustic Triphone Word
Language
Models Lexicon Lexicon
Model
~ 25 million states / 250 million arcs
26. Technical challenge
Acoustic
feature
vectors Recognizer
Transcription
Senome
calculation
Viterbi decoding ‘Hello world’
Statistical
Acoustic Triphone Word
Language
Models Lexicon Lexicon
Model
~ 25 million states / 250 million arcs
27. Technical challenge
Viterbi decoding
Acoustic • Token passing algorithm
feature • 5000/10000 tokens to propagate every 10 ms
vectors Recognizer
Transcription
• Select the most promising tokens and output
Multi-dim.
associated sequence of:
senomes mixt.
Gaussian triphones Viterbi decoding
words sentence ‘Hello world’
calculation
v1
Statistical
Senome Triphone
l1 l2 l3 Word
ow1 ow2 ow3
Language
Lexicon Lexicon Lexicon
Model
s1 s2 s3 ey1 d1 d3
~ ey2 million statesd2 250 million arcs
25 v3 / v2
ey3
28. Technical challenge
Acoustic
feature
vectors Recognizer
Transcription
Senome
calculation
Viterbi decoding ‘Hello world’
Statistical
Acoustic Triphone Word
Language
Models Lexicon Lexicon
Model
~ 25 million states / 250 million arcs
29. Challenges in embedded systems
• Low computational resources
• Power consumption constraints
• Noisy environment, poor audio quality
For a truly embedded speech recognition
engine that works, we must move away from
the pure software approach:
• Make the best of all hardware acceleration available
• Dedicated chip (accelerator) to unload CPU and
relax memory constraints
30. Why do we want speech
recognition on embedded
devices anyway?
31. Applications on mobiles
• Complement touch screen interface with
speech interface
• Speech enable existing mobile applications
– Browse complex menus
– Easily find items in large libraries,
local or online (contacts, music…)
– Browse Web and search maps
– Games
– Compose text-messages,
emails…
32. Applications on mobiles
• Speech enable mobile applications
Rubicon, quot;The Apple iPhone: Successes and Challenges for the Mobile Industryquot;, 31 March 2008
33. Applications on mobiles
• Key to safety when driving
– Text-messaging
– Satellite-Navigation function
• Voice Memo
– Shopping list
– Activity scheduler
• Market of Speech technology in embedded
devices
– $125 million in 2006
– $500 million in 2010
Opus Research report, March 2007
34. Other markets
• Developing countries
– Access to information technology for illiterate people
• Administrative tasks
• Education
• Social integration
• Health-care at home
(self-manage diseases)
– Exploding market
• Chronic diseases
• Elderly people (Baby Boomers reach retirement age)
• Market for home health care products is evaluated at $4.3 billion today
– Place for Speech recognition
• Inexperience of patients with electronic interfaces
• Poor physical condition (e.g. low vision)
• Illiteracy Medical device today, March 2009
36. Okay, I can’t wait!
Is there anything I can use now?
37. Upcoming solutions
• Voicemail accessible via text-message,
email or dedicated application
– Server-based
– Require agreement and implementation by the
carriers
38. Upcoming solutions
• Nuance Voice Control 2
– Online search
– Text-messaging
• Embedded software for
simple voice command
• Server-based engine for large
vocabulary speech recognition
• Speech Recognition API
on Android 1.5
40. Conclusion
• A truly embedded speech recognition system
– A range of exciting applications
• Real-time dictation with no perceived delay
• Natural language interface (ASR + TTS)
• Applications independent of the carrier
– But… not available yet!
• New speech recognition API are arriving soon
– Rely on network/server availability
– Can still lead to innovative applications
41. Conclusion
• Key to succeed
– Robustness, accuracy
– Fast to load and execute
– Well designed interface
• Speech cannot be used on its own
• Should be cleverly combined with other interfaces
– Graphical
– Touch
– …
– Don’t put customers off by clumsy speech recognition
widgets, again!