Regret-Based Reward Elicitation for MDPs

Regret-Based Reward Elicitation for
Markov Decision Processes
Kevin Regan University of Toronto
Craig Boutilier

Introduction 2

Motivation

Introduction 3

Motivation

Markov Decision Processes have proven to be an extremely useful
model for decision making in stochastic environments

• Model requires dynamics and rewards

Introduction 4

Motivation



Specifying dynamics a priori can be difﬁcult

• We can learn a model of the world in either an ofﬂine or online
(reinforcement learning) setting

Introduction 5

Motivation



In some simple cases reward can be thought of as being directly
“observed”

• For instance: the reward in a robot navigation problem
corresponding to the distance travelled

Introduction 6

Motivation

Except in some simple cases, the speciﬁcation of reward
functions for MDPs is problematic

• Rewards can vary user-to-user

• Preferences about which states/actions are “good” and “bad”
need to be translated into precise numerical reward

• Time consuming to specify reward for all states/actions

Example domain: assistive technology

Introduction 7

Motivation

However,

• Near-optimal policies can be found without a fully speciﬁed
reward function

• We can bound the performance of a policy using regret

Outline 1. Decision Theory
2. Preference Elicitation
3. MDPs
4. Current Work

Decision Theory 9

Utility

Given A decision maker (DM)
A set of possible outcomes Θ
A set of lotteries L of the form:
l ≡ 〈 p1 , x1 , p2 , x2 ,…, pn , xn 〉 where xi ∈Θ, ∑p i =1
i

l ≡ 〈x1 , p, x2 〉 = 〈 p, x1 ,(1 − p), x2 〉
Compound lotteries
l 1 = 〈0.75, x, 0.25, 〈0.6, y, 0.4, z〉〉

l2 y
l1= x l2 = z

Decision Theory 9

Utility

Given A decision maker (DM)
A set of possible outcomes Θ
A set of lotteries L of the form:
l ≡ 〈 p1 , x1 , p2 , x2 ,…, pn , xn 〉 where xi ∈Θ, ∑p i =1
i

l ≡ 〈x1 , p, x2 〉 = 〈 p, x1 ,(1 − p), x2 〉
Compound lotteries
l 1 = 〈0.75, x, 0.25, 〈0.6, y, 0.4, z〉〉 = 〈0.75, x, 0.15, y, 0.1, z〉

y
l2 y z
l1= x l2 = z l1= x

Decision Theory 10

Utility

Axioms Completeness
Transitivity
Independence
Continuity

Decision Theory 11

Utility

Axioms Completeness For x, y ∈Θ
Transitivity It is the case that either:
Independence x is weakly preferred to y : x ± y
Continuity y is weakly preferred to x : x y
One is indifferent : x ~ y

Decision Theory 12

Utility

Axioms Completeness For any x, y, z ∈Θ
Transitivity If x ± y and y ± z
Independence Then x ± z
Continuity

Decision Theory 13

Utility

Axioms Completeness For every l 1 , l 2 , l 3 ∈L and p ∈(0,1)
Transitivity If l 1 f l 2
Independence Then 〈l 1 , p, l 3 〉 f 〈l 2 , p, l 3 〉
Continuity

Decision Theory 14

Utility

Axioms Completeness For every l 1 , l 2 , l 3 ∈L
Transitivity If l 1 f l 2 f l 3
Independence Then for some p ∈(0,1) :
Continuity l 2 ~ 〈l 1 , p, l 3 〉

Decision Theory 15

Utility

Axioms Completeness There exists a utility function u : Θ → °
Transitivity Such that:
Independence u(x) ≥ u(y) ⇔ x ± y
Continuity n
u(l ) = 〈 p1 , x1 ,…, pn , xn 〉 = ∑ pi u(xi )
i

The utility of a lottery is the
expected utility of its outcomes

Preference Elicitation 17

Queries

Ranking
Please order this subset of outcomes
Standard Gamble
〈x1 , x2 ,…, xm 〉
Bound

u(x1 ) ≥ u(x2 ) ≥ u(x3 ) ≥ L ≥ u(xm )


Queries

Ranking
Please choose a p for which you
Standard Gamble
are indifferent between y and the
Bound
lottery 〈x ï , p, x ⊥ 〉
ï ⊥
y ~ 〈x , p, x 〉

u(y) = p


Queries

Ranking
Please choose a p for which y is at
Standard Gamble
least as good as the lottery 〈x ï ,b, x ⊥ 〉
Bound
y ± 〈x ï ,b, x ⊥ 〉

u(y) ≥ b


Preference Elicitation

Rather than fully specifying a utility function, we
1. Make decision w.r.t. an imprecisely specified utility function
2. Perform elicitation until we are satisfied with the decision

Prob

Make Decision Satisfied? YES Done
Util

NO

User Select Query


Robust Decision Criteria

Maximax Given a set of feasible utility functions U
Maximin
arg max max u(x)
Minimax Regret x∈Θ u∈U

u2 u3 Max
u1 Regret

x1 8 2 1

x2 7 7 1

x3 2 2 2



Maximin
arg max max u(x)

u2 u3 Max
u1 Regret

x1 8 2 1

x2 7 7 1

x3 2 2 2



Maximin
arg max min u(x)

u2 u3 Max
u1 Regret

x1 8 2 1

x2 7 7 1

x3 2 2 2



Maximin
arg min max max u(x ') − u(x)
Minimax Regret x∈Θ x '∈Θ u∈U

u2 u3 Max
u1 Regret

x1 8 2 1

x2 7 7 1

x3 2 2 2



Maximin

u2 u3 Max
u1 Regret

x1 8 2 1

x2 7 7 1

x3 2 2 2



Maximin

u2 u3 Max
u1 Regret

x1 8 2 1 5

x2 7 7 1

x3 2 2 2



Maximin

u2 u3 Max
u1 Regret

x1 8 2 1 5

x2 7 7 1 1

x3 2 2 2



Maximin

u2 u3 Max
u1 Regret

x1 8 2 1 5

x2 7 7 1 1

x3 2 2 2 6


Bayesian Decision Criteria

Expected Utility Assuming we have a prior φ over
Value At Risk potential utility functions




φ
arg max E [u(x)]
u
x∈Θ




φ
( )
arg max max Pr Eu [u(x)] ≥ δ ≥ η
x∈Θ δ

90%
η = 90%

10%

δ

Markov Decision Processes 37

Markov Decision Process (MDP)

S - Set of States at at+1

A - Set of Actions st st+1 st+2 …
Pr(s ' | a, s) - Transitions
rt rt+1 rt+2

α - Starting State Distribution

γ - Discount Factor
WORLD

r(s) - Reward [or r(s, a) ]
States Actions
AGENT


Markov Decision Process (MDP)

S - Set of States at at+1

A - Set of Actions st st+1 st+2 …
Known

Pr(s ' | a, s) - Transitions
rt rt+1 rt+2

α - Starting State Distribution

γ - Discount Factor
WORLD

? r(s) - Reward [or r(s, a) ]
States Actions
AGENT


MDP - Policies

Policy A stationary policy π maps each state to an action
For inﬁnite horizon MDPs, every
policy is a stationary policy

Policy Given a policy π , the value of a state is
Value
π  ∞ t 
V (s0 ) = E  ∑ γ r π , s0 
 t=0 


MDP - Computing Value Function

The value of a policy can be found by successive approximation

V0π (s) = r(s, aπ )
V1π (s) = r(s, aπ ) +γ ∑ Pr( s′ | s, aπ )V0π (s ')
s'

M M M
Vkπ (s) = r(s, aπ ) +γ ∑ Pr( s′ | s, aπ )V π (s ')
k−1
s'

There will exist a ﬁxed point

π π
V (s) = r(s, aπ ) +γ ∑ Pr(s ' | s, aπ )V ( s′ )
s'


MDP - Optimal Value Functions

Optimal We wish to ﬁnd the optimal policy
π*
Policy
* π′
π : V ≥V ∀π '

π* π*
Bellman V (s) = max r(s, aπ * ) +γ ∑ Pr( s′ | s, aπ * )V (s ')
a
s'
Equation


Value Iteration Algorithm

Yields an Ú− optimal policy
1. initialize V0 , set n = 0, choose Ú> 0
2. For each s :
Vn+1 (s) = max r(s, a) +γ ∑ Pr( s′ | s, a)Vn (s ')
a
s'

(1 − γ )
3. If Vn+1 − Vn > Ú :
2γ
increment n and return to step 2

We can recover the policy by ﬁnding the best one step action
π (s) = arg max r(s, a) +γ ∑ Pr( s′ | s, a)V (s ')
a s'


Linear Programming Formulation

minimize
V
∑ α (s)V (s)
s

subject to V (s) ≥ r(s, a) + γ ∑ Pr(s ' | s, a)V (s ') ∀a, s
s'


MDP - Occupancy Frequencies

f (s, a) An occupancy frequency f (s, a) expresses the
total discounted probability of being in state s
and taking action a

Valid ∑ f (s , a) = ∑ ∑ Pr(s
0 0 | s, a) f (s, a) − α (s0 ) ∀s0
a s a
f (s, a)


LP - Occupancy Frequency

min.
V
∑ α (s)V (s)
s

subj: V (s) ≥ r(s, a) + γ ∑ Pr(s ' | s, a)V (s ') ∀a, s
s'

max.
f
∑ ∑ f (s, a)r(s, a)
s a

subj: ∑ f (s , a) − γ ∑ ∑ Pr(s
0 0 | s, a) f (s, a) = α (s0 ) ∀s0
a s a


LP - Occupancy Frequency

∑ ∑ f (s, a)r(s, a) = ∑ α (s)V (s)
s a s

min.
V
∑ α (s)V (s)
s

subj: V (s) ≥ r(s, a) + γ ∑ Pr(s ' | s, a)V (s ') ∀a, s
s'

max.
f
∑ ∑ f (s, a)r(s, a)
s a

subj: ∑ f (s , a) − γ ∑ ∑ Pr(s
0 0 | s, a) f (s, a) = α (s0 ) ∀s0
a s a


MDP Summary Slide

Policies Over the past couple of decades, there has
Dynamics been lot of work done on scaling MDPs
Rewards
Factored Models
Decomposition
Linear Approximation


MDP Summary Slide

Policies To use these algorithms we need a model of
Dynamics the dynamics (transition function). There are
techniques for:
Rewards

Deriving models of
dynamics from data.

Finding policies that are robust
to inaccurate transition models


MDP Summary Slide

Policies There has been comparatively little work on
Dynamics specifying rewards
Rewards
Finding policies that are robust
to imprecise reward models

Eliciting reward information
from users

3. MDPs
4. Current Work A. Imprecise Reward Speciﬁcation
B. Computing Robust Policies
C. Preference Elicitation
D. Evaluation
E. Future Work

Text 50

Current Work

MDP

Compute
Satisﬁed? YES Done
Robust Policy
R

NO

User Select Query

Model : MDP 51

MDP - Reward Uncertainty

We quantify the strict uncertainty over reward
with a set of feasible reward functions R

We specify R using a set of linear
inequalities forming a polytope

Where do these inequalities come from?

Bound queries: Is r(s,a) > b?
Policy comparisons: Is fπ ·r > fπ′ ·r ?

3. MDPs

D. Evaluation
E. Future Work

Computation 53

Minimax Regret

Original min max max g ·r − f ·r
Formulation f∈F g∈F r∈R

Benders’ minimize δ
f∈F , δ
Decomposition
subject to : δ ≥ g ·r − f ·r ∀ g ∈F r ∈R

Computation 54

Minimax Regret

Original min max max g ·r − f ·r
Formulation f∈F g∈F r∈R

Benders’ minimize δ
f∈F , δ
Decomposition
subject to : δ ≥ g ·r − f ·r ∀ g ∈V ( F ) r ∈V ( R )

Maximums will exist at the vertices of F and R
Rather than enumerating an exponential number of vertices we use
constraint generation

Computation 55

Minimax Regret - Constraint Generation

1. We limit adversary
• Player minimizes regret w.r.t. a small set of adversary
responses

2. We untie adversary’s hands
• Adversary ﬁnds maximum regret w.r.t. player’s policy
• Add adversary’s choice of r and g to set of adversary
responses

Done when: untying adversary’s hands yields no improvement
• ie. regret of player minimizing = regret of adversary maximizing

Computation 56

Constraint Generation - Player

1. Limit adversary

minimize δ
f∈F , δ

subject to : δ ≥ g ·r − f ·r ∀ 〈 g, r 〉 ∈GEN

Computation 57

Constraint Generation - Adversary

2. Untie adversary’s hands: Given player policy f

max max g ·r − f ·r
g∈F r∈R

This formulation is a
non-convex linear program

We reformulate as a mixed
integer linear program

(indeed, it is the maximally violated such constraint). So it
Computation 58

is added to Gen and the process repeats.
Constraint Generation ,-R) is realized by the following MIP,
Computation of MR(f Adversary
using value and Q-functions:1

2. maximize α · V − r · f (9)
Q,V,I,r

subject to: Qa = ra + γPa V ∀a∈A
V ≥ Qa ∀a∈A (10)
V ≤ (1 − Ia )Ma + Qa ∀a∈A (11)
Cr ≤ d
X
Ia = 1 (12)
a

Ia (s) ∈ {0, 1} ∀a, s (13)
⊥
Ma = M − Ma

Only tractablerepresents the adversary’s policy, with Ia (s) de-
Here I for small Markov Decision Problems
noting the probability of action a being taken at state s

)
! " # $ % & ' ( )* )) Computation 59
+,-./012314565/7

Figure 2: Scaling of constraint generation with number of states.
Approximating Minimax Regret
9.:54;<.0=//1/0>7/7470?5@09.A/.40<670*+,-./01203454.6
)78)

We )7(# the Max Regret MIP formulation
relax
9.:54;<.0=//1/

)7()

The
)7)# value of the resulting policy is no longer exact, however,
resulting reward still feasible. We ﬁnd optimal policy w.r.t. to
)7))
resulting reward #
! " $ % & ' ()
*+,-./01203454.6

9.:54;<.0=//1/0>7/7470?;B;,5@09.A/.40<670*+,-./01203454.6
)78)
9.:54;<.0=//1/

)7(#

)7()

)7)#

)7))
! " # $ % & ' ()
*+,-./01203454.6

Figure 3: Relative approximation error of linear relaxation

Computation 60

Scaling (Log Scale)
89-/1<7=1+,-./012314565/7
)*****
>?6@51A9B9-6?1C/D0/5
EFF02?9-65/1A9B9-6?1C/D0/5

)****

)***
89-/1:-7;

)**

)*

)
! " # $ % & ' ( )* ))
+,-./012314565/7

Figure 2: Scaling of constraint generation with number of states.

3. MDPs

D. Evaluation
E. Future Work

Reward Elicitation 62

Reward Elicitation

MDP

Compute
Robust Policy
R

NO

User Select Query


Bound Queries

Query Is r(s,a) > b?

where b is a point between the
upper and lower bounds of r(s,a)

Gap Δ(s, a) = max r '(s, a) − min r(s, a)
r' r

At each step of elicitation we need
to select the s, a parameters
and b using the gap:


Selecting Bound Queries

Halve the Largest Gap (HLG) Current Solution (CS)

Select the s,a with the Use the current solution g(s,a)
largest gap Δ(s,a) [or f(s,a)] of the minimax
regret calculation to weight
Set b to the midpoint of the each gap Δ(s,a)
interval for r(s,a)
Select the s,a with the largest
weighted gap g(s,a)Δ(s, a)

Set b to the midpoint of the
interval for r(s,a)

3. MDPs

D. Evaluation
E. Future Work

Evaluation 66

Experimental Setup

Randomly generated MDPs

Semi-sparse random transition function,
discount factor of 0.95

Random true reward drawn from ﬁxed interval,
upper and lower bounds on reward drawn
randomly

All results are averaged over 20 runs

10 states 5 actions

Evaluation 67

Elicitation Effectiveness

We examine the combination of each criteria for robust policies with
each of the elicitation strategies

Minimax Regret Halve the Largest Gap

ƒ
(MMR) (HLG)

Maximin Regret Current Solution
(MR) (CS)

Evaluation 68

Max Regret - Random MDP
/01+2(3)(4+567+,'-.()+89+&'():(6
%"
0.35 #$
0.12
/01:-:;+<+=>?
/:;:-01+<+=>?
0.30
%! /01:-:;+<+@A #!
0.10
/:;:-01+<+@A
$"
0.25
1
0.08
/01+2(3)(4

2)'(+3(4)(5
Max Regret

True Regret
$!
0.20
0
0.06
#"
0.15

0.04
/
#!
0.10

56+>+?@A
34+>+?@A $
0.02
"
0.05
56+>+BC
34+>+BC
! !
"! %!! !1 "! #!! #"! $!! $"! %!! !
&'()*+,'-.()

Evaluation 69

True Regret (Loss) - Random MDP
2)'(+3(4)(5+678+,'-.()+9:+&'();(7
#$
0.12
-:;+<+=>? <=>;-;?+@+ABC
01+<+=>? <;?;-=>+@+ABC
-:;+<+@A #!
0.10 <=>;-;?+@+DE
01+<+@A <;?;-=>+@+DE

1
0.08
2)'(+3(4)(5
True Regret

0
0.06

0.04
/

$
0.02

!
"! %!! 1
! "! #!! #"! $!! $"! %!!
&'()*+,'-.()

Evaluation 70

Queries per Reward Point - Random MDP
<45;1=/7,0>0?+./4.509./0/.67/80914:;
$&!
700

$!!
600
Most of reward
500
#&! space unexplored
*+,-./0120/.67/80914:;5

#!!
400

"&!
300

We repeatedly query a small
"!!
200 set of “high impact” reward points

&!
100

!
! " # $ % & ' ( )
*+,-./01203+./4.5

Evaluation 71

Autonomic Computing

Setup
Host 1
Demand Resource
2 Hosts
Total
3 Demand levels
Resource
3 Units of Resource

M Model
Host k
Demand Resource 90 States
10 Actions

Evaluation 72

Max Regret - Autonomic Computing
Queries vs. Max Regret
0.7 0.12
Maximin
Minimax Regret
0.6
0.10

0.5
0.08

True Regret
Max Regret

0.4
0.06
0.3

0.04
0.2

0.02
0.1

egret
0.0 0.00
1000 1
0 200 400 600 800 1000 0
Queries

Evaluation 73

True Regret (Loss) - Autonomic Computing
Queries vs. True Regret
0.12
Maximin
egret Minimax Regret
0.10

0.08
True Regret

0.06

0.04

0.02

0.00
1000 0
1 200 400 600 800 1000
Queries

3. MDPs
D. Evaluation

E. Future Work

Introduction 75

Overview

MDP

Compute
Robust Policy
R

NO

User Select Query

Introduction 75

Contributions

Compute 1. A technique for ﬁnding robust policies using
Robust Policy minimax regret

NO

2. A simple elicitation procedure that quickly leads to
Select Query
near-optimal/optimal policies

Conclusion 76

Future Work

Bottleneck: Adversary’s max regret computation
Scaling Idea: The set Γ of adversary policies g that will ever
be a regret maximizing response can be small
Factored
MDPs
Approaches that uses Γ to
Richer
efﬁciently compute max regret
Queries
We have An algorithm to ﬁnd Γ

A theorem that shows the
algorithm runs in time polynomial
in the number of policies found

Conclusion 77

Future Work

Scaling Working with Factored MDPs will

Factored Model problems in a more natural way
MDPs

Richer Allow us to use lower the dimensionality of
Queries the reward functions

Leverage existing techniques for scaling
MDPs that take advantage of factored

Conclusion 78

Future Work

In state s which action would you like to take?
Scaling

Factored
MDPs
In state s do you prefer action a1 to a2 ?
Richer
Queries

Do you prefer sequence
s1 , a1 , s2 , a2 ,…sk to
′ ′ ′ ′ ′
s , a , s , a ,…s ?
1 1 2 2 k

Conclusion 79

Future Work

Do you prefer tradeoff
Scaling f (s2 , a3 ) = f1 amount of time doing (s2 , a3 ) and
f (s1 , a4 ) = f2 amount of time doing (s1 , a4 )
Factored or
MDPs
f ′ (s2 , a3 ) = f 1 amount of time doing (s2 , a3 ) and
′

Richer f ′ (s1 , a4 ) = f ′2 amount of time doing (s1 , a4 ) ?
Queries
f1 s Cab Available
f1 s No Street Car
f2 f2 a Take Cab
a Waiting

s No Street Car s Cab Available
f2 f1 f1 f2
a Waiting a Take Cab

Regret-Based Reward Elicitation for
Markov Decision Processes
Kevin M Regan University of Toronto
Craig Boutilier

f g r
ax min r·f (7) subject to: γE f + α = 0 Appendix 82
F r∈R

γE g + α = 0
Full Formulation
on the adversary. If MR(f , R) = MMR (R) then the con- to com
uncertainty in any MDP pa- Cr ≤ at
straint for g, r is satisfied d the current solution, and in- mine
k has focused on uncertainty deed all unexpressed constraints must be satisfied as well. have t
This is equivalent to a minimization:
the of eliciting information The process then terminates with minimax optimal solu- freque
rewards is left unaddressed. tion minimize δ MR(f , R) > MMR (R), implying that
f . Otherwise, (8) exact
f ,δ
the constraint for g, r is violated in the current relaxation
Master
uted for uncertain transition
(indeed, it is the r · g − r · f violated suchF, r ∈ R So it
subject to: maximally ≤ δ ∀ g ∈ constraint).
We ha
an alt
riterion by decomposing the
is added to Gen and the+ α = 0 repeats.
γE f process sarial
nd using dynamic program-
ization to find the worst case Computation of MR(f , R) is realized by the following MIP, (for a
]. McMahan, Gordon, and This corresponds Q-functions:1 dual LP formulation of
using value and to the standard for m
rogramming approach to ef- an MDP with the addition of adversarial policy constraints. imatio
maximize α · V − r · f (9)
n value of an MDP (we em- The infinite number of constraints can be reduced: first we
Q,V,I,r tice):
need only retain as potentially active those ∀ a ∈ A
subject to: Qa = ra + γPa V constraints for the in
ch to ours below). Delage
vertices of polytope R; Qa for any r ∈ R, weaonly require
V ≥ and ∀ ∈A (10) tors.
oblem of uncertainty over re-
V ≤ (1 − a )M + Qa ∀a∈A ∗
the constraint correspondingIto itsaoptimal policy gr . How-
(11) does n
functions) in the presence of policy
rcentile criterion, which can ever, vertex enumeration is not feasible; so we apply Ben-
Cr ≤ d
Subproblem
than maximin. They also ders’ decomposition [2]a to iteratively generate constraints.
X
I =1 (12)
constr
remai
ng rewards using sampling to At each iteration, two optimizations are solved. The master
a
value
e of information of noisy in- Ia (s) ∈ {0, of ∀a, s
problem solves a relaxation 1} program (8) using only a (13)
that is
ard space. The percentile ap- small subset of the constraints,M⊥
Ma = M − corresponding to a subset
a this s
n nor does it offer a bound on Gen of all g, r pairs; we call these generated constraints. lution
es ([20]) also adopt maximin Initially, this set is arbitrary (e.g., empty). with Ia (s) de-
Here I represents the adversary’s policy, Intuitively, in lem s
noting the probability of action a being taken at state s

Evaluation 83

Maximin Value - Random MDP
2345-56+738'(+9:;+,'-.()+<=+&'()5(:
#!!
1.00 %"
0.35

0.30
%!
0.95
1"

$"
0.25
0.90
1!
2345-56+738'(
Maximin Value

/01+2(3)(4
Max Regret
$!
0.20
0.85
0"
#"
0.15

0!
0.80
#!
0.10

2345-56+>+?@A
0.75
/" 2565-34+>+?@A "
0.05
2345-56+>+BC
2565-34+>+BC
/!
0.70 1 !
! "! #!! #"! $!! $"! %!! !
&'()*+,'-.()

Computation 84

Regret Gap vs Time
3.4567/859/:1;/+,-./!/3.<=
"&#

"$#

"##

*#
3.4567/859
Regret Gap

(#

&#

$#

#

!$#
!"### # "### $### %### &### '### (### )### *###
+,-./0-12

Regret-Based Reward Elicitation for MDPs

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Andere mochten auch

Andere mochten auch (10)

Ähnlich wie Regret-Based Reward Elicitation for MDPs

Ähnlich wie Regret-Based Reward Elicitation for MDPs (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Regret-Based Reward Elicitation for MDPs

Hinweis der Redaktion