This document summarizes a presentation on practical applications of speech technology. It discusses speech recognition, text-to-speech, biometrics, and data analytics. For speech recognition, call centers have excellent standardized systems while dictation and personalized answers are more expensive. Text-to-speech requires understanding language and translating terms. Speaker identification and characterization are practical using biometrics but verification is still rare. Data analytics through mining is useful but not real-time. The document also lists sponsors of the conference.
9. Lessons
Exercise 1 Exercise 2
Everyone has the
same & simple
answers
Call centers;
6
10. Lessons
Exercise 1 Exercise 2
Everyone has the
same & simple
answers
Call centers;
standard device
commands
Speaker
6
11. Lessons
Exercise 1 Exercise 2
Everyone has the
same & simple
answers
Call centers;
standard device
commands
Speaker
Speaker
Independent
6
12. Lessons
Exercise 1 Exercise 2
Everyone has the
Highly Personal
same & simple
Answers
answers
Call centers;
standard device
commands
Speaker
Speaker
Independent
6
13. Lessons
Exercise 1 Exercise 2
Everyone has the
Highly Personal
same & simple
Answers
answers
Call centers;
Dictation; voice
standard device
search
commands
Speaker Speaker
Speaker
Independent
6
14. Lessons
Exercise 1 Exercise 2
Everyone has the
Highly Personal
same & simple
Answers
answers
Call centers;
Dictation; voice
standard device
search
commands
Speaker Speaker
Speaker
Dependent
Independent
or
6
21. Engine
Speech Recognition (ASR)
s
Summary:
You can do almost anything — but
the more you do, the more you
pay.
13
22. Telephony ASR is excellent:
Inexpensiv “What city?”—
“Amsterdam”
“What is wrong with your
phone?” — “I dropped it
Very
on the floor, and the
expensive
screen is cracked, and
now I can’t see anything.”
14
23. Cautions
No such thing as “speech to text”
Speaker dependent comes closest
Voicemail to text: human assisted
Some telephone ASR is also human
assisted
15
24. Speaker Dependant
Desktop computers can do excellent
transcription, need corrections
Hand-held devices have more
memory & power → better ASR
16
25. Engine
Text-to-speech (TTS)
s
Summary:
Available in many languages,
reasonable quality, sometimes
difficult to understand.
17
28. TTS requires language understanding
and specific jargon translation:
“Mr.” → “Mister”
18
29. TTS requires language understanding
and specific jargon translation:
“Mr.” → “Mister”
“bbl” →“Be Back Later
18
30. TTS requires language understanding
and specific jargon translation:
“Mr.” → “Mister”
“bbl” →“Be Back Later
“287 m” →“about 300 meters”
18
31. TTS requires language understanding
and specific jargon translation:
“Mr.” → “Mister”
“bbl” →“Be Back Later
“287 m” →“about 300 meters”
Custom voices available
18
32. Biometrics (Speaker
Engine Identification, Speaker
s Verification, Speaker
Characterization)
Summary:
Speaker verification practical but
still rare; speaker identification &
characterization practical and
secret
19
33. Speaker Verification (is that really
you?)
Available, practical
Rare in the US, more prevalent in
Australia, Israel, and Canada
Roadblocks: valid fear; fear of
biometrics; love of fingerprints;
only part of complete solution
20
Other topics: APIs, IDE, Grammar building tools, VUI tools
1. Ask the person next to you a question as if you were an airline reservations system. Find out what city he wants to fly to.
2. Ask the person next to you for a twitter updates of the conference.
1. Ask the person next to you a question as if you were an airline reservations system. Find out what city he wants to fly to.
2. Ask the person next to you for a twitter updates of the conference.
Google, for example, does Voice mail transcriptions - poorly.
Google, for example, does Voice mail transcriptions - poorly.
Google, for example, does Voice mail transcriptions - poorly.
Google, for example, does Voice mail transcriptions - poorly.
Google, for example, does Voice mail transcriptions - poorly.
Google, for example, does Voice mail transcriptions - poorly.
Practical deployment configurations
The telco server is also hosted. The voice of the user (the “utterance”) must have a good, clean path to the recognition system.
Known text: address book, firmware
Complex: dictation, add-on
Not practical in the network: who is using the phone?
We have reviewed the hardware and the types of recognition. I will now review some more specific details about recognition.
Not magic. You still have to manage the data; enroll users; deal with users who are locked out; etc.