3. Introduction 3
Motivation
Markov Decision Processes have proven to be an extremely useful
model for decision making in stochastic environments
• Model requires dynamics and rewards
4. Introduction 4
Motivation
Markov Decision Processes have proven to be an extremely useful
model for decision making in stochastic environments
• Model requires dynamics and rewards
Specifying dynamics a priori can be difficult
• We can learn a model of the world in either an offline or online
(reinforcement learning) setting
5. Introduction 5
Motivation
Markov Decision Processes have proven to be an extremely useful
model for decision making in stochastic environments
• Model requires dynamics and rewards
In some simple cases reward can be thought of as being directly
“observed”
• For instance: the reward in a robot navigation problem
corresponding to the distance travelled
6. Introduction 6
Motivation
Except in some simple cases, the specification of reward
functions for MDPs is problematic
• Rewards can vary user-to-user
• Preferences about which states/actions are “good” and “bad”
need to be translated into precise numerical reward
• Time consuming to specify reward for all states/actions
Example domain: assistive technology
7. Introduction 7
Motivation
However,
• Near-optimal policies can be found without a fully specified
reward function
• We can bound the performance of a policy using regret
8. Outline 1. Decision Theory
2. Preference Elicitation
3. MDPs
4. Current Work
9. Decision Theory 9
Utility
Given A decision maker (DM)
A set of possible outcomes Θ
A set of lotteries L of the form:
l ≡ 〈 p1 , x1 , p2 , x2 ,…, pn , xn 〉 where xi ∈Θ, ∑p i =1
i
l ≡ 〈x1 , p, x2 〉 = 〈 p, x1 ,(1 − p), x2 〉
Compound lotteries
l 1 = 〈0.75, x, 0.25, 〈0.6, y, 0.4, z〉〉
l2 y
l1= x l2 = z
10. Decision Theory 9
Utility
Given A decision maker (DM)
A set of possible outcomes Θ
A set of lotteries L of the form:
l ≡ 〈 p1 , x1 , p2 , x2 ,…, pn , xn 〉 where xi ∈Θ, ∑p i =1
i
l ≡ 〈x1 , p, x2 〉 = 〈 p, x1 ,(1 − p), x2 〉
Compound lotteries
l 1 = 〈0.75, x, 0.25, 〈0.6, y, 0.4, z〉〉 = 〈0.75, x, 0.15, y, 0.1, z〉
y
l2 y z
l1= x l2 = z l1= x
11. Decision Theory 10
Utility
Axioms Completeness
Transitivity
Independence
Continuity
12. Decision Theory 11
Utility
Axioms Completeness For x, y ∈Θ
Transitivity It is the case that either:
Independence x is weakly preferred to y : x ± y
Continuity y is weakly preferred to x : x y
One is indifferent : x ~ y
13. Decision Theory 12
Utility
Axioms Completeness For any x, y, z ∈Θ
Transitivity If x ± y and y ± z
Independence Then x ± z
Continuity
14. Decision Theory 13
Utility
Axioms Completeness For every l 1 , l 2 , l 3 ∈L and p ∈(0,1)
Transitivity If l 1 f l 2
Independence Then 〈l 1 , p, l 3 〉 f 〈l 2 , p, l 3 〉
Continuity
15. Decision Theory 14
Utility
Axioms Completeness For every l 1 , l 2 , l 3 ∈L
Transitivity If l 1 f l 2 f l 3
Independence Then for some p ∈(0,1) :
Continuity l 2 ~ 〈l 1 , p, l 3 〉
16. Decision Theory 15
Utility
Axioms Completeness There exists a utility function u : Θ → °
Transitivity Such that:
Independence u(x) ≥ u(y) ⇔ x ± y
Continuity n
u(l ) = 〈 p1 , x1 ,…, pn , xn 〉 = ∑ pi u(xi )
i
The utility of a lottery is the
expected utility of its outcomes
17. Outline 1. Decision Theory
2. Preference Elicitation
3. MDPs
4. Current Work
18. Preference Elicitation 17
Queries
Ranking
Please order this subset of outcomes
Standard Gamble
〈x1 , x2 ,…, xm 〉
Bound
u(x1 ) ≥ u(x2 ) ≥ u(x3 ) ≥ L ≥ u(xm )
19. Preference Elicitation 18
Queries
Ranking
Please choose a p for which you
Standard Gamble
are indifferent between y and the
Bound
lottery 〈x ï , p, x ⊥ 〉
ï ⊥
y ~ 〈x , p, x 〉
u(y) = p
20. Preference Elicitation 19
Queries
Ranking
Please choose a p for which y is at
Standard Gamble
least as good as the lottery 〈x ï ,b, x ⊥ 〉
Bound
y ± 〈x ï ,b, x ⊥ 〉
u(y) ≥ b
21. Preference Elicitation 20
Preference Elicitation
Rather than fully specifying a utility function, we
1. Make decision w.r.t. an imprecisely specified utility function
2. Perform elicitation until we are satisfied with the decision
Prob
Make Decision Satisfied? YES Done
Util
NO
User Select Query
22. Preference Elicitation 21
Robust Decision Criteria
Maximax Given a set of feasible utility functions U
Maximin
arg max max u(x)
Minimax Regret x∈Θ u∈U
u2 u3 Max
u1 Regret
x1 8 2 1
x2 7 7 1
x3 2 2 2
23. Preference Elicitation 22
Robust Decision Criteria
Maximax Given a set of feasible utility functions U
Maximin
arg max max u(x)
Minimax Regret x∈Θ u∈U
u2 u3 Max
u1 Regret
x1 8 2 1
x2 7 7 1
x3 2 2 2
24. Preference Elicitation 23
Robust Decision Criteria
Maximax Given a set of feasible utility functions U
Maximin
arg max min u(x)
Minimax Regret x∈Θ u∈U
u2 u3 Max
u1 Regret
x1 8 2 1
x2 7 7 1
x3 2 2 2
25. Preference Elicitation 24
Robust Decision Criteria
Maximax Given a set of feasible utility functions U
Maximin
arg max min u(x)
Minimax Regret x∈Θ u∈U
u2 u3 Max
u1 Regret
x1 8 2 1
x2 7 7 1
x3 2 2 2
26. Preference Elicitation 25
Robust Decision Criteria
Maximax Given a set of feasible utility functions U
Maximin
arg min max max u(x ') − u(x)
Minimax Regret x∈Θ x '∈Θ u∈U
u2 u3 Max
u1 Regret
x1 8 2 1
x2 7 7 1
x3 2 2 2
27. Preference Elicitation 26
Robust Decision Criteria
Maximax Given a set of feasible utility functions U
Maximin
arg min max max u(x ') − u(x)
Minimax Regret x∈Θ x '∈Θ u∈U
u2 u3 Max
u1 Regret
x1 8 2 1
x2 7 7 1
x3 2 2 2
28. Preference Elicitation 27
Robust Decision Criteria
Maximax Given a set of feasible utility functions U
Maximin
arg min max max u(x ') − u(x)
Minimax Regret x∈Θ x '∈Θ u∈U
u2 u3 Max
u1 Regret
x1 8 2 1 5
x2 7 7 1
x3 2 2 2
29. Preference Elicitation 28
Robust Decision Criteria
Maximax Given a set of feasible utility functions U
Maximin
arg min max max u(x ') − u(x)
Minimax Regret x∈Θ x '∈Θ u∈U
u2 u3 Max
u1 Regret
x1 8 2 1 5
x2 7 7 1
x3 2 2 2
30. Preference Elicitation 29
Robust Decision Criteria
Maximax Given a set of feasible utility functions U
Maximin
arg min max max u(x ') − u(x)
Minimax Regret x∈Θ x '∈Θ u∈U
u2 u3 Max
u1 Regret
x1 8 2 1 5
x2 7 7 1 1
x3 2 2 2
31. Preference Elicitation 30
Robust Decision Criteria
Maximax Given a set of feasible utility functions U
Maximin
arg min max max u(x ') − u(x)
Minimax Regret x∈Θ x '∈Θ u∈U
u2 u3 Max
u1 Regret
x1 8 2 1 5
x2 7 7 1 1
x3 2 2 2
32. Preference Elicitation 31
Robust Decision Criteria
Maximax Given a set of feasible utility functions U
Maximin
arg min max max u(x ') − u(x)
Minimax Regret x∈Θ x '∈Θ u∈U
u2 u3 Max
u1 Regret
x1 8 2 1 5
x2 7 7 1 1
x3 2 2 2 6
33. Preference Elicitation 32
Robust Decision Criteria
Maximax Given a set of feasible utility functions U
Maximin
arg min max max u(x ') − u(x)
Minimax Regret x∈Θ x '∈Θ u∈U
u2 u3 Max
u1 Regret
x1 8 2 1 5
x2 7 7 1 1
x3 2 2 2 6
34. Preference Elicitation 33
Bayesian Decision Criteria
Expected Utility Assuming we have a prior φ over
Value At Risk potential utility functions
35. Preference Elicitation 34
Bayesian Decision Criteria
Expected Utility Assuming we have a prior φ over
Value At Risk potential utility functions
φ
arg max E [u(x)]
u
x∈Θ
36. Preference Elicitation 35
Bayesian Decision Criteria
Expected Utility Assuming we have a prior φ over
Value At Risk potential utility functions
φ
( )
arg max max Pr Eu [u(x)] ≥ δ ≥ η
x∈Θ δ
90%
η = 90%
10%
δ
37. Outline 1. Decision Theory
2. Preference Elicitation
3. MDPs
4. Current Work
38. Markov Decision Processes 37
Markov Decision Process (MDP)
S - Set of States at at+1
A - Set of Actions st st+1 st+2 …
Pr(s ' | a, s) - Transitions
rt rt+1 rt+2
α - Starting State Distribution
γ - Discount Factor
WORLD
r(s) - Reward [or r(s, a) ]
States Actions
AGENT
39. Markov Decision Processes 37
Markov Decision Process (MDP)
S - Set of States at at+1
A - Set of Actions st st+1 st+2 …
Known
Pr(s ' | a, s) - Transitions
rt rt+1 rt+2
α - Starting State Distribution
γ - Discount Factor
WORLD
? r(s) - Reward [or r(s, a) ]
States Actions
AGENT
40. Markov Decision Processes 38
MDP - Policies
Policy A stationary policy π maps each state to an action
For infinite horizon MDPs, every
policy is a stationary policy
Policy Given a policy π , the value of a state is
Value
π ∞ t
V (s0 ) = E ∑ γ r π , s0
t=0
41. Markov Decision Processes 39
MDP - Computing Value Function
The value of a policy can be found by successive approximation
V0π (s) = r(s, aπ )
V1π (s) = r(s, aπ ) +γ ∑ Pr( s′ | s, aπ )V0π (s ')
s'
M M M
Vkπ (s) = r(s, aπ ) +γ ∑ Pr( s′ | s, aπ )V π (s ')
k−1
s'
There will exist a fixed point
π π
V (s) = r(s, aπ ) +γ ∑ Pr(s ' | s, aπ )V ( s′ )
s'
42. Markov Decision Processes 40
MDP - Optimal Value Functions
Optimal We wish to find the optimal policy
π*
Policy
* π′
π : V ≥V ∀π '
π* π*
Bellman V (s) = max r(s, aπ * ) +γ ∑ Pr( s′ | s, aπ * )V (s ')
a
s'
Equation
43. Markov Decision Processes 41
Value Iteration Algorithm
Yields an Ú− optimal policy
1. initialize V0 , set n = 0, choose Ú> 0
2. For each s :
Vn+1 (s) = max r(s, a) +γ ∑ Pr( s′ | s, a)Vn (s ')
a
s'
(1 − γ )
3. If Vn+1 − Vn > Ú :
2γ
increment n and return to step 2
We can recover the policy by finding the best one step action
π (s) = arg max r(s, a) +γ ∑ Pr( s′ | s, a)V (s ')
a s'
44. Markov Decision Processes 42
Linear Programming Formulation
minimize
V
∑ α (s)V (s)
s
subject to V (s) ≥ r(s, a) + γ ∑ Pr(s ' | s, a)V (s ') ∀a, s
s'
45. Markov Decision Processes 43
MDP - Occupancy Frequencies
f (s, a) An occupancy frequency f (s, a) expresses the
total discounted probability of being in state s
and taking action a
Valid ∑ f (s , a) = ∑ ∑ Pr(s
0 0 | s, a) f (s, a) − α (s0 ) ∀s0
a s a
f (s, a)
46. Markov Decision Processes 44
LP - Occupancy Frequency
min.
V
∑ α (s)V (s)
s
subj: V (s) ≥ r(s, a) + γ ∑ Pr(s ' | s, a)V (s ') ∀a, s
s'
max.
f
∑ ∑ f (s, a)r(s, a)
s a
subj: ∑ f (s , a) − γ ∑ ∑ Pr(s
0 0 | s, a) f (s, a) = α (s0 ) ∀s0
a s a
47. Markov Decision Processes 44
LP - Occupancy Frequency
∑ ∑ f (s, a)r(s, a) = ∑ α (s)V (s)
s a s
min.
V
∑ α (s)V (s)
s
subj: V (s) ≥ r(s, a) + γ ∑ Pr(s ' | s, a)V (s ') ∀a, s
s'
max.
f
∑ ∑ f (s, a)r(s, a)
s a
subj: ∑ f (s , a) − γ ∑ ∑ Pr(s
0 0 | s, a) f (s, a) = α (s0 ) ∀s0
a s a
48. Markov Decision Processes 45
MDP Summary Slide
Policies Over the past couple of decades, there has
Dynamics been lot of work done on scaling MDPs
Rewards
Factored Models
Decomposition
Linear Approximation
49. Markov Decision Processes 46
MDP Summary Slide
Policies To use these algorithms we need a model of
Dynamics the dynamics (transition function). There are
techniques for:
Rewards
Deriving models of
dynamics from data.
Finding policies that are robust
to inaccurate transition models
50. Markov Decision Processes 47
MDP Summary Slide
Policies There has been comparatively little work on
Dynamics specifying rewards
Rewards
Finding policies that are robust
to imprecise reward models
Eliciting reward information
from users
51. Outline 1. Decision Theory
2. Preference Elicitation
3. MDPs
4. Current Work
52. Outline 1. Decision Theory
2. Preference Elicitation
3. MDPs
4. Current Work A. Imprecise Reward Specification
B. Computing Robust Policies
C. Preference Elicitation
D. Evaluation
E. Future Work
53. Text 50
Current Work
MDP
Compute
Satisfied? YES Done
Robust Policy
R
NO
User Select Query
54. Model : MDP 51
MDP - Reward Uncertainty
We quantify the strict uncertainty over reward
with a set of feasible reward functions R
We specify R using a set of linear
inequalities forming a polytope
Where do these inequalities come from?
Bound queries: Is r(s,a) > b?
Policy comparisons: Is fπ ·r > fπ′ ·r ?
55. Outline 1. Decision Theory
2. Preference Elicitation
3. MDPs
4. Current Work A. Imprecise Reward Specification
B. Computing Robust Policies
C. Preference Elicitation
D. Evaluation
E. Future Work
56. Computation 53
Minimax Regret
Original min max max g ·r − f ·r
Formulation f∈F g∈F r∈R
Benders’ minimize δ
f∈F , δ
Decomposition
subject to : δ ≥ g ·r − f ·r ∀ g ∈F r ∈R
57. Computation 54
Minimax Regret
Original min max max g ·r − f ·r
Formulation f∈F g∈F r∈R
Benders’ minimize δ
f∈F , δ
Decomposition
subject to : δ ≥ g ·r − f ·r ∀ g ∈V ( F ) r ∈V ( R )
Maximums will exist at the vertices of F and R
Rather than enumerating an exponential number of vertices we use
constraint generation
58. Computation 55
Minimax Regret - Constraint Generation
1. We limit adversary
• Player minimizes regret w.r.t. a small set of adversary
responses
2. We untie adversary’s hands
• Adversary finds maximum regret w.r.t. player’s policy
• Add adversary’s choice of r and g to set of adversary
responses
Done when: untying adversary’s hands yields no improvement
• ie. regret of player minimizing = regret of adversary maximizing
59. Computation 56
Constraint Generation - Player
1. Limit adversary
minimize δ
f∈F , δ
subject to : δ ≥ g ·r − f ·r ∀ 〈 g, r 〉 ∈GEN
60. Computation 57
Constraint Generation - Adversary
2. Untie adversary’s hands: Given player policy f
max max g ·r − f ·r
g∈F r∈R
This formulation is a
non-convex linear program
We reformulate as a mixed
integer linear program
61. (indeed, it is the maximally violated such constraint). So it
Computation 58
is added to Gen and the process repeats.
Constraint Generation ,-R) is realized by the following MIP,
Computation of MR(f Adversary
using value and Q-functions:1
2. maximize α · V − r · f (9)
Q,V,I,r
subject to: Qa = ra + γPa V ∀a∈A
V ≥ Qa ∀a∈A (10)
V ≤ (1 − Ia )Ma + Qa ∀a∈A (11)
Cr ≤ d
X
Ia = 1 (12)
a
Ia (s) ∈ {0, 1} ∀a, s (13)
⊥
Ma = M − Ma
Only tractablerepresents the adversary’s policy, with Ia (s) de-
Here I for small Markov Decision Problems
noting the probability of action a being taken at state s
62. )
! " # $ % & ' ( )* )) Computation 59
+,-./012314565/7
Figure 2: Scaling of constraint generation with number of states.
Approximating Minimax Regret
9.:54;<.0=//1/0>7/7470?5@09.A/.40<670*+,-./01203454.6
)78)
We )7(# the Max Regret MIP formulation
relax
9.:54;<.0=//1/
)7()
The
)7)# value of the resulting policy is no longer exact, however,
resulting reward still feasible. We find optimal policy w.r.t. to
)7))
resulting reward #
! " $ % & ' ()
*+,-./01203454.6
9.:54;<.0=//1/0>7/7470?;B;,5@09.A/.40<670*+,-./01203454.6
)78)
9.:54;<.0=//1/
)7(#
)7()
)7)#
)7))
! " # $ % & ' ()
*+,-./01203454.6
Figure 3: Relative approximation error of linear relaxation
63. Computation 60
Scaling (Log Scale)
89-/1<7=1+,-./012314565/7
)*****
>?6@51A9B9-6?1C/D0/5
EFF02?9-65/1A9B9-6?1C/D0/5
)****
)***
89-/1:-7;
)**
)*
)
! " # $ % & ' ( )* ))
+,-./012314565/7
Figure 2: Scaling of constraint generation with number of states.
64. Outline 1. Decision Theory
2. Preference Elicitation
3. MDPs
4. Current Work A. Imprecise Reward Specification
B. Computing Robust Policies
C. Preference Elicitation
D. Evaluation
E. Future Work
65. Reward Elicitation 62
Reward Elicitation
MDP
Compute
Satisfied? YES Done
Robust Policy
R
NO
User Select Query
66. Reward Elicitation 63
Bound Queries
Query Is r(s,a) > b?
where b is a point between the
upper and lower bounds of r(s,a)
Gap Δ(s, a) = max r '(s, a) − min r(s, a)
r' r
At each step of elicitation we need
to select the s, a parameters
and b using the gap:
67. Reward Elicitation 64
Selecting Bound Queries
Halve the Largest Gap (HLG) Current Solution (CS)
Select the s,a with the Use the current solution g(s,a)
largest gap Δ(s,a) [or f(s,a)] of the minimax
regret calculation to weight
Set b to the midpoint of the each gap Δ(s,a)
interval for r(s,a)
Select the s,a with the largest
weighted gap g(s,a)Δ(s, a)
Set b to the midpoint of the
interval for r(s,a)
68. Outline 1. Decision Theory
2. Preference Elicitation
3. MDPs
4. Current Work A. Imprecise Reward Specification
B. Computing Robust Policies
C. Preference Elicitation
D. Evaluation
E. Future Work
69. Evaluation 66
Experimental Setup
Randomly generated MDPs
Semi-sparse random transition function,
discount factor of 0.95
Random true reward drawn from fixed interval,
upper and lower bounds on reward drawn
randomly
All results are averaged over 20 runs
10 states 5 actions
70. Evaluation 67
Elicitation Effectiveness
We examine the combination of each criteria for robust policies with
each of the elicitation strategies
Minimax Regret Halve the Largest Gap
ƒ
(MMR) (HLG)
Maximin Regret Current Solution
(MR) (CS)
73. Evaluation 70
Queries per Reward Point - Random MDP
<45;1=/7,0>0?+./4.509./0/.67/80914:;
$&!
700
$!!
600
Most of reward
500
#&! space unexplored
*+,-./0120/.67/80914:;5
#!!
400
"&!
300
We repeatedly query a small
"!!
200 set of “high impact” reward points
&!
100
!
! " # $ % & ' ( )
*+,-./01203+./4.5
74. Evaluation 71
Autonomic Computing
Setup
Host 1
Demand Resource
2 Hosts
Total
3 Demand levels
Resource
3 Units of Resource
M Model
Host k
Demand Resource 90 States
10 Actions
77. Outline 1. Decision Theory
2. Preference Elicitation
3. MDPs
4. Current Work A. Imprecise Reward Specification
B. Computing Robust Policies
C. Preference Elicitation
D. Evaluation
E. Future Work
78. Introduction 75
Overview
MDP
Compute
Satisfied? YES Done
Robust Policy
R
NO
User Select Query
79. Introduction 75
Contributions
Compute 1. A technique for finding robust policies using
Satisfied? YES Done
Robust Policy minimax regret
NO
2. A simple elicitation procedure that quickly leads to
Select Query
near-optimal/optimal policies
80. Conclusion 76
Future Work
Bottleneck: Adversary’s max regret computation
Scaling Idea: The set Γ of adversary policies g that will ever
be a regret maximizing response can be small
Factored
MDPs
Approaches that uses Γ to
Richer
efficiently compute max regret
Queries
We have An algorithm to find Γ
A theorem that shows the
algorithm runs in time polynomial
in the number of policies found
81. Conclusion 77
Future Work
Scaling Working with Factored MDPs will
Factored Model problems in a more natural way
MDPs
Richer Allow us to use lower the dimensionality of
Queries the reward functions
Leverage existing techniques for scaling
MDPs that take advantage of factored
82. Conclusion 78
Future Work
In state s which action would you like to take?
Scaling
Factored
MDPs
In state s do you prefer action a1 to a2 ?
Richer
Queries
Do you prefer sequence
s1 , a1 , s2 , a2 ,…sk to
′ ′ ′ ′ ′
s , a , s , a ,…s ?
1 1 2 2 k
83. Conclusion 79
Future Work
Do you prefer tradeoff
Scaling f (s2 , a3 ) = f1 amount of time doing (s2 , a3 ) and
f (s1 , a4 ) = f2 amount of time doing (s1 , a4 )
Factored or
MDPs
f ′ (s2 , a3 ) = f 1 amount of time doing (s2 , a3 ) and
′
Richer f ′ (s1 , a4 ) = f ′2 amount of time doing (s1 , a4 ) ?
Queries
f1 s Cab Available
f1 s No Street Car
f2 f2 a Take Cab
a Waiting
s No Street Car s Cab Available
f2 f1 f1 f2
a Waiting a Take Cab
87. f g r
ax min r·f (7) subject to: γE f + α = 0 Appendix 82
F r∈R
γE g + α = 0
Full Formulation
on the adversary. If MR(f , R) = MMR (R) then the con- to com
uncertainty in any MDP pa- Cr ≤ at
straint for g, r is satisfied d the current solution, and in- mine
k has focused on uncertainty deed all unexpressed constraints must be satisfied as well. have t
This is equivalent to a minimization:
the of eliciting information The process then terminates with minimax optimal solu- freque
rewards is left unaddressed. tion minimize δ MR(f , R) > MMR (R), implying that
f . Otherwise, (8) exact
f ,δ
the constraint for g, r is violated in the current relaxation
Master
uted for uncertain transition
(indeed, it is the r · g − r · f violated suchF, r ∈ R So it
subject to: maximally ≤ δ ∀ g ∈ constraint).
We ha
an alt
riterion by decomposing the
is added to Gen and the+ α = 0 repeats.
γE f process sarial
nd using dynamic program-
ization to find the worst case Computation of MR(f , R) is realized by the following MIP, (for a
]. McMahan, Gordon, and This corresponds Q-functions:1 dual LP formulation of
using value and to the standard for m
rogramming approach to ef- an MDP with the addition of adversarial policy constraints. imatio
maximize α · V − r · f (9)
n value of an MDP (we em- The infinite number of constraints can be reduced: first we
Q,V,I,r tice):
need only retain as potentially active those ∀ a ∈ A
subject to: Qa = ra + γPa V constraints for the in
ch to ours below). Delage
vertices of polytope R; Qa for any r ∈ R, weaonly require
V ≥ and ∀ ∈A (10) tors.
oblem of uncertainty over re-
V ≤ (1 − a )M + Qa ∀a∈A ∗
the constraint correspondingIto itsaoptimal policy gr . How-
(11) does n
functions) in the presence of policy
rcentile criterion, which can ever, vertex enumeration is not feasible; so we apply Ben-
Cr ≤ d
Subproblem
than maximin. They also ders’ decomposition [2]a to iteratively generate constraints.
X
I =1 (12)
constr
remai
ng rewards using sampling to At each iteration, two optimizations are solved. The master
a
value
e of information of noisy in- Ia (s) ∈ {0, of ∀a, s
problem solves a relaxation 1} program (8) using only a (13)
that is
ard space. The percentile ap- small subset of the constraints,M⊥
Ma = M − corresponding to a subset
a this s
n nor does it offer a bound on Gen of all g, r pairs; we call these generated constraints. lution
es ([20]) also adopt maximin Initially, this set is arbitrary (e.g., empty). with Ia (s) de-
Here I represents the adversary’s policy, Intuitively, in lem s
noting the probability of action a being taken at state s
89. Computation 84
Regret Gap vs Time
3.4567/859/:1;/+,-./!/3.<=
"&#
"$#
"##
*#
3.4567/859
Regret Gap
(#
&#
$#
#
!$#
!"### # "### $### %### &### '### (### )### *###
+,-./0-12
Hinweis der Redaktion
Markov decision processes are an extremely useful model for decision making in stochastic environments. To use it we need to know dynamics and the rewards involved. A lot of work has been done on learning dynamics both in an offline and online setting.
Markov decision processes are an extremely useful model for decision making in stochastic environments. To use it we need to know dynamics and the rewards involved. A lot of work has been done on learning dynamics both in an offline and online setting.
Markov decision processes are an extremely useful model for decision making in stochastic environments. To use it we need to know dynamics and the rewards involved. A lot of work has been done on learning dynamics both in an offline and online setting. Rewards are often assumed to be directly observable parts of the world. My perspective is that &#x201C;rewards are in people&#x2019;s heads&#x201D;: In some cases there is a simple mapping between what you want (in your head) and the world (ie. finding the shortest path),
Markov decision processes are an extremely useful model for decision making in stochastic environments. To use it we need to know dynamics and the rewards involved. A lot of work has been done on learning dynamics both in an offline and online setting. In some simple cases, rewards can be thought of as being directly observable: --> For instance the distance travelled in a robot navigation problem where we are trying to get a robot from point A to point B. ---> When I am in my car trying to get from point A to point B I want the path with the fewest stoplights, someone else may want the path with the nicest scenery, while someone else may sacrifice some stoplights for some scenery. Reward is a surrogate for subjective preferences .... flip slide (sometimes its easy but)
The dynamics in combination simple bounds on reward function lead to areas of reward space not having a high impact on the value of a policy
Maximin is common but we use regret
Maximin is common but we use regret
Maximin is common but we use regret
Maximin is common but we use regret
Maximin is common but we use regret
Maximin is common but we use regret
Maximin is common but we use regret
Maximin is common but we use regret
Maximin is common but we use regret
Maximin is common but we use regret
Maximin is common but we use regret
Maximin is common but we use regret
Maximin is common but we use regret
Maximin is common but we use regret
Maximin is common but we use regret
Maximin is common but we use regret
Maximin is common but we use regret
Maximin is common but we use regret
Maximin is common but we use regret
Maximin is common but we use regret
Maximin is common but we use regret
Maximin is common but we use regret
Maximin is common but we use regret
So for f2, no matter what the instantiation of reward, by changing policy the player only stood to gain by one. This is an intuitive measure.
Maximin is common but we use regret
Maximin is common but we use regret
Maximin is common but we use regret
Robust MDP literature often assumes transitions are known
transitions are learnable - do not change from user to user
Robust MDP literature often assumes transitions are known
transitions are learnable - do not change from user to user
Robust MDP literature often assumes transitions are known
transitions are learnable - do not change from user to user
Convergence properties
explain how constraints create max
Question: Is it worth spending more time to familiarize the audience with &#x201C;occupancy frequencies&#x201D;?
I will vocally mention other representations
In the voice over I will explain how each reformulation proceeds from the previous expression
In the voice over I will explain how each reformulation proceeds from the previous expression
In the voice over I will explain how each reformulation proceeds from the previous expression
In the voice over I will explain how each reformulation proceeds from the previous expression
In the voice over I will explain how each reformulation proceeds from the previous expression
I would also like to give a clear intuition as to why this is inherently hard.
On average less than 10% error
will also note that on a 90 state MDP with 16 action the relaxation is computing minimax regret in less than 3 seconds.
Here I will review the preference elicitation process
Note that it is useful in non-sequential
Now I have left out the autonomic computing results, due to lack of time. If there is a little time, after giving the results for the random MDPs I can state that we have similar results for a large MDP instance.
20 runs --> 20 MDPs with 10 states and 5 actions
We have a paritially specified reward function we compute a robust policy that minimizes maximum regret --> We then elicit information about the reward function which leads to a better policy. --> We continue this process until we have an optimal policy or we regret guarantee is small enough
We have a paritially specified reward function we compute a robust policy that minimizes maximum regret --> We then elicit information about the reward function which leads to a better policy. --> We continue this process until we have an optimal policy or we regret guarantee is small enough
We have a paritially specified reward function we compute a robust policy that minimizes maximum regret --> We then elicit information about the reward function which leads to a better policy. --> We continue this process until we have an optimal policy or we regret guarantee is small enough
We have a paritially specified reward function we compute a robust policy that minimizes maximum regret --> We then elicit information about the reward function which leads to a better policy. --> We continue this process until we have an optimal policy or we regret guarantee is small enough
We have a paritially specified reward function we compute a robust policy that minimizes maximum regret --> We then elicit information about the reward function which leads to a better policy. --> We continue this process until we have an optimal policy or we regret guarantee is small enough
Notes test
I will give the context in the voice over. The main idea on this slide is that: in practice constraint generation quickly converges. To segue to the next slide I will recall that we still need to solve a MIP with |S||A| variables and constraints, thus we developed an approximation.