Video and slides synchronized, mp3 and slide download available at URL http://bit.ly/1nyoC1W.
Halle Winkler overviews the state of speech technology, examining the opportunities in usability and new forms of usage that become available with speech interfaces in mobile apps. Filmed at qconlondon.com.
Halle Winkler is the software developer and UX designer behind Berlin's Politepix, which produces OpenEars, the most popular offline speech recognition framework for iOS. She has worked in tech startups in the Silicons Alley, Valley and Allee since the days of Web .9 and has been preoccupied with human factors since day one.
Time Series Foundation Models - current state and future directions
An Unseen Interface
1.
2. Watch the video with slide
synchronization on InfoQ.com!
http://www.infoq.com/presentations
/speech-ui-mobile
InfoQ.com: News & Community Site
• 750,000 unique visitors/month
• Published in 4 languages (English, Chinese, Japanese and Brazilian
Portuguese)
• Post content from our QCon conferences
• News 15-20 / week
• Articles 3-4 / week
• Presentations (videos) 12-15 / week
• Interviews 2-3 / week
• Books 1 / month
3. Presented at QCon London
www.qconlondon.com
Purpose of QCon
- to empower software development by facilitating the spread of
knowledge and innovation
Strategy
- practitioner-driven conference designed for YOU: influencers of
change and innovation in your teams
- speakers and topics driving the evolution and innovation
- connecting and catalyzing the influencers and innovators
Highlights
- attended by more than 12,000 delegates since 2007
- held in 9 cities worldwide
4. An Unseen Interface
Creating Speech-driven UI For Your App That Makes Users Happy
by Halle Winkler, @politepix http://www.politepix.com
:D
6. A speech-driven UI uses either
speech recognition as an input
method, speech synthesis as
an information source for the
user, or both together.
...but it can also be multi-modal.
7. How does speech
recognition work?
The elements of speech recognition are:
1. An acoustic model
2. A lexicon
3. A language model (probability) or grammar (ruleset for states)
4. A decoder
8. What kind of apps benefit
from speech UIs?
Large Vocabulary Tasks: server, built-in vocabulary
(UITextView, Android.speech, Nuance, AT&T, iSpeech)
Tasks in which free-form dictation is useful
Tasks which relate specifically to language
Command and control tasks: offline, you generally define
vocabulary (OpenEars or other CMU Sphinx or Julius
implementations, some Android.speech devices and OSes)
Interfaces where the user is looking somewhere else
Interfaces where speech provides a new input or output
Interfaces that are more fun with speech
Interfaces where it’s easier to speak than type
Interfaces where it’s easier to listen than read
Interfaces where a heavy obstacle is removed
9. Why offline?
The interface is always available to your user
Speed is as fast or faster as a network API – and it's quantifiable!
Interface design and implementation is simpler and more
predictable without an asynchronous network dependency
The user is not giving away any of their data
10. How is a speech UI
different from a
visual UI?
What are the dimensions on which a
visual UI is rendered?
What are the dimensions on which a speech UI is rendered?
A speech UI is rendered on the dimension of time.
People value their time exquisitely.
11. Do people understand each other
perfectly all the time?
Why not?
Accents
Lack of shared vocabulary/Dialect
Noise
Distractions
Interruptions
Hearing difficulties
Distance
Language errors
!
Human speech interactions have frequent comprehension faults
Emotional intelligence makes us incredibly fault-tolerant
12. Automated speech
recognition is subject to all
the same issues as human
speech recognition, but
without the emotional
intelligence
13. We have to stack
the deck in our
(users’) favor.
14. Short is good.
Don't bite off more than you can chew – small (read "fast") steps
forward means small (read "fast") steps backwards
!
Use keyword detection to launch events
!
Switch between small vocabularies that each relate to one domain
This results in accuracy, speed, and a large vocabulary!
15. Short is bad.
Phonemes are the smallest unit of speech
Words with few of them have a lot of rhymes
Contextless rhyming is our enemy
Medium-sized, crunchy granola words are our friends
16. My app, my rules
Some apps need to recognize words
or phrases in ways that can be
expressed by rules.
Or be flexible
Some apps need to do probability-based detection
There are probability-based language models for
expressing this such as ARPA models
17. Out of vocabulary
Your app also has to behave well when
people aren't speaking to it!
18. Mic distance and
vocabulary
The more distance, the less vocabulary
20. Case study 1: Recipe App
A natural implementation of offline speech recognition
21. What are our interface
considerations?
• What are we buying with our time? Hands-free operation, moving locus
• Hands-free doesn't mean eyes-free! We can provide visual info
• Operational distance is pretty far
• Instead of NLP, offline grammar
• Secret weapon: we know all the words in a recipe in advance
• Fault tolerance: one level of complexity, don't confirm; return!
• Challenges: noise, moving locus, reflection, competing speech
!
22. Case study 2: Marco Polo
A dialog management tag game: one user checks in
a single location and the other user receives
volume-based speech feedback about their
proximity to the target when they say “Marco”
23. UX Considerations
• What are we buying with our time: play!
• For a single word, language model is fast and sufficient
• Acoustic environment and OOV semi-important
• This is a single-mode interface – an actual dialog
manager
• Extra development time should be put into increasing
voice dynamic range
24. Case study 3:
TalkCheater
An app to whisper sweet presentation notes in your ear
25. UX Considerations
• What are we buying? Eye contact, moving locus,
enhanced human capabilities
• Is this a speech recognition app?
• Does this have a visual or a touch interface?
• The body is the interface
• Fault tolerance, always important but most
important in a high-value scenario
• Volume
• Speaking speed of synthesized speech
26. Talk to me @politepix
and the OpenEars
forums. I will tell you
all the things.
27.
28. Watch the video with slide synchronization on
InfoQ.com!
http://www.infoq.com/presentations/speech-ui-
mobile