5. Why We Need?
Daily Life Usage
⌠Weather
⌠Schedule
⌠Transportation
⌠Restaurant Seeking
5
6. Why We Need?
â Get things done
⢠E.g. set up alarm/reminder, take note
â Easy access to structured data, services and apps
⢠E.g. find docs/photos/restaurants
â Assist your daily schedule and routine
⢠E.g. commute alerts to/from work
â Be more productive in managing your work and personal life
6
7. Why Natural Language?
⢠Global Digital Statistics (2015 January)
Global Population
7.21B
Active Internet Users
3.01B
Active Social
Media Accounts
2.08B
Active Unique
Mobile Users
3.65B
The more natural and convenient input of devices evolves towards speech.
7
8. Intelligent Assistant Architecture
Reactive
Assistance
ASR, LU, Dialog, LG, TTS
Proactive
Assistance
Inferences, User
Modeling, Suggestions
Data
Back-end Data
Bases, Services and
Client Signals
Device/Service End-points
(Phone, PC, Xbox, Web Browser, Messaging Apps)
User Experience
ârestaurant suggestionsââcall taxiâ
8
9. ⢠Spoken dialogue systems are intelligent agents that are able to help
users finish tasks more efficiently via spoken interactions.
⢠Spoken dialogue systems are being incorporated into various devices
(smart-phones, smart TVs, in-car navigating system, etc).
Spoken Dialogue System (SDS)
JARVIS â Iron Manâs Personal Assistant Baymax â Personal Healthcare Companion
9
Good dialogue systems assist users to access information conveniently
and finish tasks efficiently.
10. APP ď BOT
10
Seamless and automatic information transferring across domains
ď reduce duplicate information and interaction
⢠A bot is responsible for a âsingleâ domain, similar to an app
ćéŁč¨
ĺ°ĺ
LINE
Goal: Schedule a lunch with Vivian
KKBOX
12. System Framework
12
Speech
Recognition
Language Understanding (LU)
⢠Domain Identification
⢠User Intent Detection
⢠Slot Filling
Dialogue Management (DM)
⢠Dialogue State Tracking
⢠System Action/Policy
Decision
Output
Generation
Hypothesis
are there any action movies to
see this weekend
Semantic Frame
request_movie
genre=action, date=this weekend
System Action/Policy
request_location
Text response
Where are you located?
Screen Display
location?
Text Input
Are there any action movies to see this weekend?
Speech Signal current bottleneck
ď error propagation
13. Interaction Example
User
Intelligent
Agent Q: How does a dialogue system process this request?
Good Taiwanese eating places include Din Tai
Fung, Boiling Point, etc. What do you want to
choose? I can help you go there.
find a good eating place for taiwanese food
13
14. System Framework
14
Speech
Recognition
Language Understanding (LU)
⢠Domain Identification
⢠User Intent Detection
⢠Slot Filling
Dialogue Management (DM)
⢠Dialogue State Tracking
⢠System Action/Policy
Decision
Output
Generation
Hypothesis
are there any action movies to
see this weekend
Semantic Frame
request_movie
genre=action, date=this weekend
System Action/Policy
request_location
Text response
Where are you located?
Screen Display
location?
Text Input
Are there any action movies to see this weekend?
Speech Signal
15. 1. Domain Identification
Requires Predefined Domain Ontology
find a good eating place for taiwanese food
User
Organized Domain Knowledge (Database)Intelligent
Agent
15
Restaurant DB Taxi DB Movie DB
Classification!
16. 2. Intent Detection
Requires Predefined Schema
find a good eating place for taiwanese food
User
Intelligent
Agent
16
Restaurant DB
FIND_RESTAURANT
FIND_PRICE
FIND_TYPE
:
Classification!
17. 3. Slot Filling
Requires Predefined Schema
find a good eating place for taiwanese food
User
Intelligent
Agent
17
Restaurant DB
Restaurant Rating Type
Rest 1 good Taiwanese
Rest 2 bad Thai
: : :
FIND_RESTAURANT
rating=âgoodâ
type=âtaiwaneseâ
SELECT restaurant {
rest.rating=âgoodâ
rest.type=âtaiwaneseâ
}Semantic Frame Sequence Labeling
O O B-rating O O O B-type O
18. System Framework
18
Speech
Recognition
Language Understanding (LU)
⢠Domain Identification
⢠User Intent Detection
⢠Slot Filling
Dialogue Management (DM)
⢠Dialogue State Tracking
⢠System Action/Policy
Decision
Output
Generation
Hypothesis
are there any action movies to
see this weekend
Semantic Frame
request_movie
genre=action, date=this weekend
System Action/Policy
request_location
Text response
Where are you located?
Screen Display
location?
Text Input
Are there any action movies to see this weekend?
Speech Signal
19. State Tracking
Requires Hand-Crafted States
User
Intelligent
Agent
find a good eating place for taiwanese food
19
location rating type
loc, rating rating, type loc, type
all
i want it near to my office
NULL
20. State Tracking
Requires Hand-Crafted States
User
Intelligent
Agent
find a good eating place for taiwanese food
20
location rating type
loc, rating rating, type loc, type
all
i want it near to my office
NULL
21. State Tracking
Handling Errors and Confidence
User
Intelligent
Agent
find a good eating place for taixxxx food
21
FIND_RESTAURANT
rating=âgoodâ
type=âtaiwaneseâ
FIND_RESTAURANT
rating=âgoodâ
type=âthaiâ
FIND_RESTAURANT
rating=âgoodâ
location rating type
loc, rating rating, type loc, type
all
NULL
?
?
22. Policy for Agent Action
⢠Inform
â âThe nearest one is at Taipei 101â
⢠Request
â âWhere is your home?â
⢠Confirm
â âDid you want Taiwanese food?â
⢠Database Search
⢠Task Completion / Information Display
â ticket booked, weather information
22
Din Tai Fung
:
:
23. System Framework
23
Semantic Frame
request_movie
genre=action, date=this weekend
Speech
Recognition
Language Understanding (LU)
⢠Domain Identification
⢠User Intent Detection
⢠Slot Filling
Dialogue Management (DM)
⢠Dialogue State Tracking
⢠System Action/Policy
Decision
Hypothesis
are there any action movies to
see this weekend
Text Input
Are there any action movies to see this weekend?
Speech Signal
Output
Generation
System Action/Policy
request_location
Text response
Where are you located?
Screen Display
location?
24. Output / NL Generation
⢠Inform
â âThe nearest one is at Taipei 101â v.s.
⢠Request
â âWhere is your home?â v.s.
⢠Confirm
â âDid you want Taiwanese food?â
24
26. Challenge
⢠Predefined semantic schema
Chen et al., âMatrix Factorization with Knowledge Graph Propagation for Unsupervised Spoken Language Understanding,â in ACL-IJCNLP, 2015.
⢠Data without annotations
Chen et al., âZero-Shot Learning of Intent Embeddings for Expansion by Convolutional Deep Structured Semantic Models,â in ICASSP, 2016.
⢠Semantic concept interpretation
Chen et al., âDeriving Local Relational Surface Forms from Dependency-Based Entity Embeddings for Unsupervised Spoken Language Understanding,â in SLT, 2014.
⢠Predefined dialogue states
Chen, et al., âEnd-to-End Memory Networks with Knowledge Carryover for Multi-Turn Spoken Language Understanding,â in Interspeech, 2016.
⢠Error propagation
Hakkani-Tur et al., âMulti-Domain Joint Semantic Frame Parsing using Bi-directional RNN-LSTM,â in Interspeech, 2016.
⢠Cross-domain intention/bot hierarchy
Sun et al., âAn Intelligent Assistant for High-Level Task Understanding,â in IUI, 2016.
Sun et al., âAppDialogue: Multi-App Dialogues for Intelligent Assistants,â in LREC, 2016.
Chen et al., âLeveraging Behavioral Patterns of Mobile Applications for Personalized Spoken Language Understanding,â in ICMI, 2016.
⢠Cross-domain information transferring
Kim et al., âNew Transfer Learning Techniques For Disparate Label Sets,â in ACL-IJCNLP, 2015.
FIND_RESTAURANT
rating=âgoodâ rating=5? 4?
HotelRest Flight
Travel
Trip
Planning
26
30. A Layer of Neurons
⢠Handwriting digit classification MN
RRf ďŽ:
A layer of neurons can handle multiple possible output,
and the result depends on the max one
âŚ
1x
2x
Nx
ďŤ
1
ďŤ 1y
ďŤ
âŚ
âŚ
â1â or not
â2â or not
â3â or not
2y
3y
10 neurons/10 classes
Which
one is
max?
31. Deep Neural Network (DNN)
⢠Fully connected feedforward network
1x
2x
âŚâŚ
Layer 1
âŚâŚ
1y
2y
âŚâŚ
Layer 2
âŚâŚ
Layer L
âŚâŚ
âŚâŚ
âŚâŚ
Input Output
MyNx
vector
x
vector
y
Deep NN: multiple hidden layers
MN
RRf ďŽ:
33. RNN for SLU
⢠Joint Multi-Domain Intent Prediction and Slot Filling
â Information can mutually enhanced
33
semantic frame sequence
ht-1 ht+1ht
W W W W
taiwanese
B-type
U
food
U
please
U
V
O
V
O
V
hT+1
EOS
U
FIND_REST
V
Slot Tagging Intent
Prediction
Hakkani-Tur, et al., âMulti-Domain Joint Semantic Frame Parsing using Bi-directional RNN-LSTM,â in Interspeech, 2016.
34. 34
just sent email to bob about fishing this weekend
O O O O
B-contact_name
O
B-subject I-subject I-subject
U
S
I send_email
D communication
ď send_email(contact_name=âbobâ, subject=âfishing this weekendâ)
are we going to fish this weekend
U1
S2
ď send_email(message=âare we going to fish this weekendâ)
send email to bob
U2
ď send_email(contact_name=âbobâ)
B-message
I-message
I-message I-message I-message
I-message I-message
B-contact_nameS1
Domain Identification ď Intent Prediction ď Slot Filling
Contextual SLU (Chen et al., 2016)
35. 35
u
Knowledge Attention Distributionpi
mi
Memory Representation
Weighted
Sum
h
â Wkg
o
Knowledge Encoding
Representation
history utterances {xi}
current utterance
c
Inner
Product
Sentence
Encoder
RNNin
x1 x2 xiâŚ
Contextual
Sentence Encoder
x1 x2 xiâŚ
RNNmem
slot tagging sequence y
ht-1 ht
V V
W W W
wt-1 wt
yt-1 yt
U U
RNN
Tagger
M M
Chen, et al., âEnd-to-End Memory Networks with Knowledge Carryover for Multi-Turn Spoken Language Understanding,â in Interspeech, 2016.
1. Sentence Encoding 2. Knowledge Attention 3. Knowledge Encoding
Contextual SLU (Chen et al., 2016)
Idea: additionally incorporating contextual knowledge during slot tagging
ď track dialogue states in a latent way
36. E2E Supervised Dialogue System
36Wen, et al., âA Network-based End-to-End Trainable Task-Oriented Dialogue System,â arXiv.:1604.04562v2.
0 0 0 ⌠0 1
Database Operator
Copy
field
âŚ
Database
Sevendays
CurryPrince
Nirala
RoyalStandard
LittleSeuol
DB pointer
Can I have korean
Korean
0.7
British
0.2
French
0.1
âŚ
Belief Tracker
Intent Network
Can I have <v.food>
Generation Network
<v.name> serves great <v.food> .
Policy Network
zt
pt
xt
MySQL query:
âSelect * where
food=Koreanâ
qt
37. InfoBot: E2E Dialogue System with
Supervised & Reinforcement Learning
37Dhingra, et al., âEnd-to-End Reinforcement Learning of Dialogue Agents for Information Access,â arXiv.:1609.00777v2.
Movie=?; Actor=Bill Murray; Release Year=1993
Find me the Bill Murrayâs movie.
I think it came out in 1993.
When was it released?
Groundhog Day is a Bill Murray
movie which came out in 1993.
KB-InfoBot
User
(Groundhog Day, actor, Bill Murray)
(Groundhog Day, release year, 1993)
(Australia, actor, Nicole Kidman)
(Mad Max: Fury Road, release year, 2015)
Knowledge Base (head, relation, tail)
Idea: differentiable database for propagating the gradients
38. TC-Bot: E2End Task-Completion Bot
38
wi
B-
type
wi+
1
wi+2
O O
EOS
<intent>
wi
B-
type
wi+
1
wi+2
O O
EOS
<intent>
Semantic Frame
request_movie
genre=action,
date=this weekend
System Action /
Policy
request_location
User Dialogue Action
Inform(location=San Francisco)
Time t-1
wi
<slot>
wi+
1
wi+2
O O
EOS
<intent>
Language Understanding (LU)
Time t-2
Time t
Dialogue
Management
(DM)
w0 w1 w2
Natural Language Generation (NLG)
EOS
User
Goal
User Agenda Modeling
User Simulator
End-to-End Neural Dialogue System
Text Input
Are there any action movies
to see this weekend?
Idea: supervised learning for each component and reinforcement
learning for end-to-end training the neural dialogue system
Li, et al., âEnd-to-End Task-Completion Neural Dialogue Systems,â arXiv.:1703.01008.
39. RL TC-Bot
User goal: Two tickets for âthe witchâ tomorrow 9:30 PM at regal meridian 16, Seattle.
39
RULE BASED SYSTEM
User: Can I get 2 tickets for the witch?
Agent: Which theater do you want?
User: Regal meridian 16
Agent: What time would you like to see it?
User: 9:30 pm
Agent: What date would you like to see it?
User: Tomorrow!
Agent: How many people are you?
User: Just two adults.
Agent: What is your location?
User: Seattle!
Agent: What kind of video format do you like?
User: Oh, any kind is okay.
Agent: Do you have any price constraint?
User: No, I donât care.
Agent: Great - I was able to purchase 2 tickets for
you to see the witch tomorrow at regal meridian
16 theater in Seattle at 9:30 pm.
User: Thank you.
Agent: Thank you!
REINFORCEMENT LEARNING SYSTEM
User: Can I get 2 tickets for the witch?
Agent: What date would you like to see it?
User: Tomorrow
Agent: Which theater do you prefer?
User: Regal meridian 16
Agent: What is your location?
User: Seattle
Agent: What time would you like to see it?
User: 9:30 pm.
Agent: Great - I was able to purchase 2 tickets
for you to see the witch tomorrow at regal
meridian 16 theater in Seattle at 9:30 pm.
User: Thanks.
Agent: Thanks!
The system can learn how to efficiently interact with
users for task completion
41. Conclusion
⢠The conversational systems can manage information access
via spoken interactions
⢠A domain is usually constrained by the backend service
â Semantic schema should be predefined
â Cross-domain knowledge and intention is difficult to handled
⢠NN-Based Dialogue System
â Pipeline outputs are represented as vectors ď distributional
⢠Semantic frames as vectors to encode confidence
⢠Implicitly represent dialogue states in hidden vectors
â The execution is constrained by backend services ď symbolic
41
42. Q & A
T H A N K S F O R YO U R AT T E N T I O N !
42