Most AI assistants on mobile phones uses a conversational user interface (CUI) that mimics a chat app and translates user requests to API calls to backend services. I will present Conversational GUI (CGUI) which provides a thin layer of conversational interaction on top of existing GUI of mobile apps, by translating user requests into sequences of GUI actions such as clicks and swipes that user would have to perform by themselves. CUI avoids rebuilding existing user experiences in a chat window. More importantly, it makes it possible for end users, instead of software engineers, to create new skills by providing pairs of natural language expressions and a demonstration of the GUI actions.
2. A Tale of Two Uber Rides
uber ride to
crowne
plaza sfo
3. Naturali
A Beijing-based startup company
Upgrade apps with a speech interface
Naturali Sesami
✦ Translate speech inputs to action sequences
in apps and execute them on users’ behalf.
✦ Chinese version launched on LeTV phones
as a system app on April 12, 2017
✦ Available as a third party app all Android
phones since Aug. 2017
4. Advantages of Speech
Speed
✦ voice input is three times as fast as typing
Hand-free:
✦ send messages, play music, order food
✦ turn on hotspot: 5 clicks
Mind-free:
✦ where is my luggage?
8. Chat + API: the down sides
Chat assistants displace apps, but
Chat is not the best mode of
interaction for everything.
editing
browsing
viewing
None the less, there are plenty of
needs for voice interaction.
who has
access to
this?
10. Chat + API: the down sides
Re-invention of user
experience inside the
chat window:
✦ usually not as good as
specialized apps,
✦ requires a great deal of
repeated development
effort
11. Chat + API: the down sides
Re-invention of user
experience inside the
chat window:
✦ usually not as good as
specialized apps,
✦ requires a great deal of
development effort
12. Chat + API: the down sides
Economic interests of the assistant and the backend
services may not be aligned.
13. Naturali Sesami
A thin, transparent translation
layer over apps.
✦ voice ➜ front end UI actions
Seamless integration of speech
and graphics
✦ Existing GUI interactions are still
available
✦ Making voice interaction available
on any app page
14. Use Yelp to find greek food near Santa Clara Convention
Center
15. Voice to Actions in Three Steps
Speech Recognition: sound → text
✦ data
Semantic Interpretation: text → intent
✦ knowledge
Plan Generation: intent → actions
✦ grounding
17. Naturali Speech
End-to-end DNN: CNN+LSTM+Attention+CTC
✦ built from scratch with TensorFlow
✦ trained with thousands of hours of transcribed speech
Personalized and contextualized language model:
✦ contact names
✦ app specific vocabulary
18. Semantic Interpretation: text → intent
An intent identifies a task and the necessary
information (parameters) for the task
Example:
✦ task: FlightSearch
✦ parameters: (to, from, date, airline, class)
19. Entities and Types
Persons: singers/directors/contacts
Locations: cities/POIs/addresses
Apps and Games
Media: songs/shows/movies/books
Time and Date
Food
Sports teams
……
20. Recognizing Thousands of Types
It is not an option to use manually labeled training
examples.
An alternative is to use naturally annotated data:
✦ Hearst patterns: NPtype such as NPinst
✦ Other examples: navigate to NPloc
21. Multi-round Conversation
Complex intents may not be articulated in one shot
✦ FlightSearch(to, from, date, airline, class)
A multi-round conversation incrementally collects
information from user and guides the user in the
process.
23. Composite Intents
Messenger chat with Alex and say let’s meet on saturday
✦ OpenMessenger
✦ ChatWithPerson
✦ SendMessage
get a uber black ride to SFO
✦ UberRide
✦ SetDest
✦ SelectUberBlack
25. Plan Generation: intent → actions
Grounding: establishes the connection between in the
inside (the assistant) and the outside (apps and devices).
Example:
✦ intent:
{“task”: “FlightStatus”, “number”:”UA888”, “date”:”2017-11-04”}
✦ action:
select * from flight_db where “airline”=“United Airlines”, flight_num = “888”
and year=2017 and month=11 and day=4
30. Crowd Sourced Skills
Skills are immediately usable by the creator.
✦ The user may share the skills with others, e.g., tech support
for parents
Vetted skills can be made available to the public
31. Summary
Voice interaction is inevitable
Naturali Sesami translates user requests into sequences
of actions in APPs.
Sesami grows by crowd sourcing skills.
Join US!
✦ jobs@naturali.ai