Dev Days, Speech Recognition, LM Aubert

Speech Recognition on
embedded devices

Louis-Marie Aubert
ECIT – Queen’s University Belfast

DevDays – Belfast – April 24, 2009

What should we expect from
speech recognition?

Speech Recognition success?
• Natural continuous speech
• Real-time
• Large vocabulary (up to 100,000 words)
• No training (speaker independent)
• Adaptive to speaker accent
• Robust against
– Background noise
– Audio frontend imperfections
• N-best hypotheses with confidence value

What are the solutions on the
market?

Existing solutions
• Server-based

– Telephony, IVR

– Dictation (Heath care industry)

– Audio indexing

Either offline or with important delays

Existing solutions
• Desktop-based

– Real-time dictation

– Language learning

Requires a good setup, powerful computer,
quiet environment
Very good accuracy, no training required

Existing solutions
• Embedded applications

– Simple voice commands
(‘Call-mum’ type command)

– Disconnected word recognition

Small vocabulary and lack
of naturalness restricts the
range of applications

Technical challenge

Speech waveform
Transcription

Speech
‘Hello world’
Recognizer

Technical challenge

Speech waveform Acoustic feature vectors

Spectral
Analyser ~40 coeff.

10 ms

Technical challenge

Acoustic
feature
vectors Recognizer
Transcription
Senome
calculation
Viterbi decoding ‘Hello world’

Statistical
Acoustic Phoneme Word
Language
Models Lexicon Lexicon
Model

Technical challenge

Acoustic Models
Acoustic
feature • 4000 acoustic models
vectors Recognizer
• Sub-acoustic unit Transcription
Multi-dim.
• Functions that score 10 ms of speech
Gaussian mixt. Viterbi decoding ‘Hello world’
calculation mean and variance 40-long
• Sets of
vectors of Gaussian mixtures (16)

Statistical
Acoustic Phoneme Word
Language
Model

Technical challenge

Phoneme

Acoustic
feature • 50 in English
vectors Recognizer
• Differentiable sounds Transcription
Multi-dim.
• Represent a sequence of senomes: HMM
Gaussian mixt.
(Hidden Markov Model) Viterbi decoding ‘Hello world’
calculation

‘ah’: ah1 ah2 ah3
Statistical
Word
Senome Phoneme Language
Lexicon
‘l’:
Lexicon l1 l2 l3
Lexicon Model

Technical challenge

Triphone

Acoustic
feature • 2500 in English
vectors Recognizer
• Differentiable sounds in their context Transcription
Multi-dim.
continuous speech
calculation
‘hh-ah+l’: ah1 ah2 ah3

Statistical
Senome Phoneme Word
‘ah-l+ow’: l1 l2 l3 Language
Lexicon Lexicon Lexicon
Model

Technical challenge

Acoustic
feature
vectors Recognizer
Transcription
Senome
calculation

Statistical
Acoustic Triphone Word
Language
Model

Technical challenge

Word

Acoustic
feature • Large vocabulary: 64000
vectors Recognizer
• Represent a sequence of phonemes/triphones Transcription
Multi-dim.
calculation
‘hello’: hh ah l ow

Statistical
Senome Phoneme Word
‘world’: Language
Lexicon w Lexiconl
er d
Lexicon
Model

Technical challenge

Statistical language model

Acoustic
feature • Bi-gram / Tri-gram
vectors Recognizer
• Give the probability of sequence of 2/3 words Transcription
Multi-dim.
• 64000 words leads to roughly 10 million states /
50 million mixt.
Gaussian
arcs Viterbi decoding ‘Hello world’
calculation

0.3 mum
hello
0.2 Statistical
Senome Phoneme dad
Word
Language
Lexicon Lexicon
0.05 Lexicon
Model
world

Technical challenge

Acoustic
feature
vectors Recognizer
Transcription
Senome
calculation

Statistical
Acoustic Triphone Word
Language
Model

~ 25 million states / 250 million arcs

Technical challenge

Viterbi decoding

Acoustic • Token passing algorithm
feature • 5000/10000 tokens to propagate every 10 ms
vectors Recognizer
Transcription
• Select the most promising tokens and output
Multi-dim.
associated sequence of:
senomes mixt.
Gaussian triphones Viterbi decoding
words sentence ‘Hello world’
calculation

v1

Statistical
Senome Triphone
l1 l2 l3 Word
ow1 ow2 ow3
Language
Lexicon Lexicon Lexicon
Model
s1 s2 s3 ey1 d1 d3

~ ey2 million statesd2 250 million arcs
25 v3 / v2
ey3

Challenges in embedded systems
• Low computational resources
• Power consumption constraints
• Noisy environment, poor audio quality

For a truly embedded speech recognition
engine that works, we must move away from
the pure software approach:
• Make the best of all hardware acceleration available
• Dedicated chip (accelerator) to unload CPU and
relax memory constraints

Why do we want speech
recognition on embedded
devices anyway?

Applications on mobiles
• Complement touch screen interface with
speech interface
• Speech enable existing mobile applications
– Browse complex menus
– Easily find items in large libraries,
local or online (contacts, music…)
– Browse Web and search maps
– Games
– Compose text-messages,
emails…

• Speech enable mobile applications

Rubicon, quot;The Apple iPhone: Successes and Challenges for the Mobile Industryquot;, 31 March 2008

• Key to safety when driving
– Text-messaging
– Satellite-Navigation function

• Voice Memo
– Shopping list
– Activity scheduler

• Market of Speech technology in embedded
devices
– $125 million in 2006
– $500 million in 2010
Opus Research report, March 2007

Other markets
• Developing countries
– Access to information technology for illiterate people
• Administrative tasks
• Education
• Social integration

• Health-care at home
(self-manage diseases)
– Exploding market
• Chronic diseases
• Elderly people (Baby Boomers reach retirement age)
• Market for home health care products is evaluated at $4.3 billion today
– Place for Speech recognition
• Inexperience of patients with electronic interfaces
• Poor physical condition (e.g. low vision)
• Illiteracy Medical device today, March 2009

Other applications
• Speech translation
– IraqCom

Okay, I can’t wait!
Is there anything I can use now?

Upcoming solutions
• Voicemail accessible via text-message,
email or dedicated application

– Server-based
– Require agreement and implementation by the
carriers

Upcoming solutions
• Nuance Voice Control 2
– Online search
– Text-messaging

• Embedded software for
simple voice command
• Server-based engine for large
vocabulary speech recognition

• Speech Recognition API
on Android 1.5

Conclusion
• A truly embedded speech recognition system
– A range of exciting applications
• Real-time dictation with no perceived delay
• Natural language interface (ASR + TTS)
• Applications independent of the carrier
– But… not available yet!

• New speech recognition API are arriving soon
– Rely on network/server availability
– Can still lead to innovative applications

Conclusion
• Key to succeed
– Robustness, accuracy
– Fast to load and execute
– Well designed interface
• Speech cannot be used on its own
• Should be cleverly combined with other interfaces
– Graphical
– Touch
– …

– Don’t put customers off by clumsy speech recognition
widgets, again!

Dev Days, Speech Recognition, LM Aubert

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Andere mochten auch

Andere mochten auch (16)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Dev Days, Speech Recognition, LM Aubert