Towards End-to-End Reinforcement Learning of Dialogue Agents for Information Access
1. Towards End-to-End Reinforcement Learning
of Dialogue Agents for Information Access
Bhuwan Dhingra Carnegie Mellon University
Lihong Li Microsoft Research
Xiujun Li Microsoft Research
Jianfeng Gao Microsoft Research
Yun-Nung (Vivian) Chen National Taiwan University
Faisal Ahmed Microsoft Research
Li Deng Citadel
2. KB-InfoBot: An interactive search engine
• Setting
– User is looking for a piece of information from one or more tables/KBs
– System must iteratively ask for user constraints (“slots”) to retrieve the answer
• Interactive search is more natural
– Users are used to issuing queries of length less than 5 words (Spink et al, 2001)
– Users may not know the structure of the database being queried
Movie=? Actor=Bill Murray; Release Year=1993
Find me the Bill Murray’s movie.
I think it came out in 1993.
When was it released?
Groundhog Day is a Bill Murray
movie which came out in 1993.
KB-InfoBotUser
Entity-Centric Knowledge Base
Movie Actor
Release
Year
Groundhog Day Bill Murray 1993
Australia Nicole Kidman X
Mad Max: Fury Road X 2015
3. Goal-Oriented Dialogue System (Young et al., 2013)
Natural
Language
Understanding
(NLU)
State Tracker/
Belief Tracker
Dialogue Policy
Natural
Language
Generator
(NLG)
Database /
KB
User
Agent
User
Utterance
Acts/
Entities
Dialogue
State
System
Response
Query
Results
Query Example:
SELECT Movie
WHERE
Actor==Bill Murray AND
Genre==ComedyDialogue
Act
4. KB-InfoBot
• A simple rule-based approach:
– Use heuristics to maintain belief state over slots
– Ask for slot with maximum uncertainty, until some
“inform” criterion is met
Has no notion of what the user is likely to be looking for or likely to know
Symbolic queries lose notion of uncertainty in upstream modules
Cannot improve online with user feedback
5. KB-InfoBot
• Supervised / Reinforcement Learning-based
approach
– Use neural networks to model LU, Belief Tracker and
Policy
Learn user behaviors (e.g. slots likely to be known)
Symbolic queries lose notion of uncertainty in upstream modules
End-to-end and online learning possible, but cannot backprop gradients
through symbolic query
6. Network-Based Dialogue System (Wen et al., 2017)
Database /
KB
User
Agent
User
Utterance
Acts/
Entities
Dialogue
Act
System
Response
Query
Results
Query Example:
SELECT Movie
WHERE
Actor==Bill Murray AND
Genre==Comedy
Dialogue
StateLoss / Reward
Backprop
Not Differentiable!
Supervised Learning /
Reinforcement LearningTruly “End-to-end” learning not possible
7. Piecewise Training (Wen et al., 2017)
Database /
KB
User
Agent
User
Utterance
Acts/
Entities
Dialogue
Act
System
Response
Query
Results
Dialogue
StateLoss / Reward
Backprop
Supervised Learning / Reinforcement Learning
Labeled
Data
LossBackprop
Supervised Learning- Labeling expensive
- Cannot learn online
8. • Replace symbolic query with an attention distribution
– Compose slot-wise belief states into one posterior
distribution over entire database
– The KB structure is encoded in the computation of
attention
Uncertainty over database entries propagated to policy network (rule-based + RL)
Differentiable operations allow backpropagation of gradients (RL)
Computationally expensive for large databases
Our Approach: Soft-KB Lookup via
Attention
9. Our Approach: Soft-KB Lookup via
Attention
Database /
KB
Agent
Soft
Attention
Supervised Learning /
Reinforcement Learning
User
User
Utterance
Acts/
Entities
Dialogue
Act
System
Response
Dialogu
e State
Full Distribution
over DB
Backprop
Backprop
Uncertainty propagated forward
Gradients propagated backward
11. State Tracker
For each slot j:
1. A multinomial over slot values –
2. A binomial probability of whether user knows
the value of the slot -
x1 x2
0.3 0.7
Slot Values
Probabilities
0.8
12. KB Posterior
Entity Slot1 Slot2
A x1 y1
B x2 ?
C ? y2
Assumption: Slot values are independently distributed
15. KB-Posterior
• Distribution over all entities in the database
• Posterior reflects uncertainty in LU + State Tracking
• All operations are differentiable
– Gradients can pass through during backward pass
16. Evaluation – Three Questions
Does Soft-KB lookup lead to better dialog policies?
Does Reinforcement Learning improve over Rule-based approach?
Does End-to-end learning lead to higher rewards?
17. KB-InfoBot Versions
Belief Trackers:
A. Hand-Crafted (Bayesian updates)
B. Neural (GRU)
Policy Network:
C. Hand-Crafted (Entropy Minimization)
D. Neural (GRU)
KB-lookup:
1. No KB lookup (Policy unaware of KB)
2. Hard-KB lookup (SQL type lookup)
3. Soft-KB lookup (KB Posterior)
Rule-Based Agents: A + C + (1, 2, 3)
RL-Based Agents: A + D + (1, 2, 3)
E2E Agent: B + D + (3)
18. Training
• All agents trained using against a publicly available user simulator (Li et al, 2017)*
• Optimize future discounted rewards:
• RL agent:
• E2E agent:
• Credit assignment:
– E2E agent always fails with random initialization
– Imitation learning at beginning to mimic rule-based policy
* https://github.com/MiuLab/TC-Bot
Policy
KB Posterior Policy
19. Simulation Results
• Evaluated on Movie-Centric KBs – small, medium, large, X-large
• Metrics:
– # of Dialogue Turns (T)
– Success Rate (correct movie returned) (S)
– Average Reward (R)
• All agents tuned to maximize average reward
Soft-KB > Hard-KB > No-KB
RL > Rule-based
E2E performs best
20. Human Evaluation
• Setting
– Typed interactions
– Given 1) a goal entity 2) subset of slot values
– multiple values per slot noise modeling
– Users are free to frame their inputs
Soft-KB lookup > Hard-KB lookup (Success Rate)
RL agent > Rule-based agent (#Turns)
However, full E2E agent performed worse than RL-
Soft and Rule-Soft agents
21. Discussion
• Soft-KB lookup
– Better dialogue policies
• E2E agent
– Strong performance in simulations
– Does not transfer to real interactions
– Overfits to the limited natural language from the simulator
• Future research: personalized dialogue assistants?
– Deploy using RL-Soft agent
– Collect interactions to train E2E agent
– Gradually switch to the E2E agent
There has been interest in semantic parsing of complicated queries using neural models (Neural GenQA), but evidence suggests an interactive setting may be more appropriate.
What is a goal-oriented dialog system?
Description of each module:
- NLU – extract entities and intents
- Tracker – maintain distribution over user goals and information