SlideShare ist ein Scribd-Unternehmen logo
1 von 89
Regret-Based Reward Elicitation for
Markov Decision Processes
Kevin Regan                           University of Toronto
Craig Boutilier
Introduction   2




Motivation
Introduction   3




Motivation

 Markov Decision Processes have proven to be an extremely useful
 model for decision making in stochastic environments

    •   Model requires dynamics and rewards
Introduction   4




Motivation

 Markov Decision Processes have proven to be an extremely useful
 model for decision making in stochastic environments

    •   Model requires dynamics and rewards

 Specifying dynamics a priori can be difficult

    •   We can learn a model of the world in either an offline or online
        (reinforcement learning) setting
Introduction   5




Motivation

 Markov Decision Processes have proven to be an extremely useful
 model for decision making in stochastic environments

    •   Model requires dynamics and rewards

 In some simple cases reward can be thought of as being directly
 “observed”

    •   For instance: the reward in a robot navigation problem
        corresponding to the distance travelled
Introduction   6




Motivation

 Except in some simple cases, the specification of reward
 functions for MDPs is problematic

    •   Rewards can vary user-to-user

    •   Preferences about which states/actions are “good” and “bad”
        need to be translated into precise numerical reward

    •   Time consuming to specify reward for all states/actions

 Example domain: assistive technology
Introduction   7




Motivation

 However,

    •   Near-optimal policies can be found without a fully specified
        reward function

    •   We can bound the performance of a policy using regret
Outline   1. Decision Theory
          2. Preference Elicitation
          3. MDPs
          4. Current Work
Decision Theory   9




Utility	

 Given     A decision maker (DM)
           A set of possible outcomes Θ
           A set of lotteries L of the form:
               l ≡ 〈 p1 , x1 , p2 , x2 ,…, pn , xn 〉 where xi ∈Θ,   ∑p  i   =1
                                                                    i

                l ≡ 〈x1 , p, x2 〉 = 〈 p, x1 ,(1 − p), x2 〉
           Compound lotteries
                l 1 = 〈0.75, x, 0.25, 〈0.6, y, 0.4, z〉〉


                          l2                    y
           l1=        x              l2 =             z
Decision Theory   9




Utility	

 Given     A decision maker (DM)
           A set of possible outcomes Θ
           A set of lotteries L of the form:
               l ≡ 〈 p1 , x1 , p2 , x2 ,…, pn , xn 〉 where xi ∈Θ,   ∑p       i   =1
                                                                     i

                l ≡ 〈x1 , p, x2 〉 = 〈 p, x1 ,(1 − p), x2 〉
           Compound lotteries
                l 1 = 〈0.75, x, 0.25, 〈0.6, y, 0.4, z〉〉 = 〈0.75, x, 0.15, y, 0.1, z〉

                                                                             y
                          l2                    y                                z
           l1=        x              l2 =             z      l1=         x
Decision Theory   10




Utility	

 Axioms    Completeness
           Transitivity
           Independence
           Continuity
Decision Theory   11




Utility	

 Axioms    Completeness   For x, y ∈Θ
           Transitivity   It is the case that either:
           Independence       x is weakly preferred to y : x ± y
           Continuity       y is weakly preferred to x : x         y
                            One is indifferent : x ~ y
Decision Theory   12




Utility	

 Axioms    Completeness   For any x, y, z ∈Θ
           Transitivity   If x ± y and y ± z
           Independence   Then x ± z
           Continuity
Decision Theory   13




Utility	

 Axioms    Completeness   For every l 1 , l 2 , l 3 ∈L and p ∈(0,1)
           Transitivity   If l 1 f l 2
           Independence   Then 〈l 1 , p, l 3 〉 f 〈l 2 , p, l 3 〉
           Continuity
Decision Theory   14




Utility	

 Axioms    Completeness   For every l 1 , l 2 , l 3 ∈L
           Transitivity   If l 1 f l 2 f l 3
           Independence   Then for some p ∈(0,1) :
           Continuity       l 2 ~ 〈l 1 , p, l 3 〉
Decision Theory   15




Utility	

 Axioms    Completeness   There exists a utility function u : Θ → °
           Transitivity   Such that:
           Independence     u(x) ≥ u(y) ⇔ x ± y
           Continuity                                        n
                            u(l ) = 〈 p1 , x1 ,…, pn , xn 〉 = ∑ pi u(xi )
                                                             i



                          The utility of a lottery is the
                          expected utility of its outcomes
Outline   1. Decision Theory
          2. Preference Elicitation
          3. MDPs
          4. Current Work
Preference Elicitation   17




Queries

 Ranking
                   Please order this subset of outcomes
 Standard Gamble
                        〈x1 , x2 ,…, xm 〉
 Bound

                       u(x1 ) ≥ u(x2 ) ≥ u(x3 ) ≥ L ≥ u(xm )
Preference Elicitation   18




Queries

 Ranking
                   Please choose a p for which you
 Standard Gamble
                   are indifferent between y and the
 Bound
                   lottery 〈x ï , p, x ⊥ 〉
                                 ï       ⊥
                         y ~ 〈x , p, x 〉

                         u(y) = p
Preference Elicitation   19




Queries

 Ranking
                   Please choose a p for which y is at
 Standard Gamble
                   least as good as the lottery 〈x ï ,b, x ⊥ 〉
 Bound
                        y ± 〈x ï ,b, x ⊥ 〉


                        u(y) ≥ b
Preference Elicitation   20




Preference Elicitation

 Rather than fully specifying a utility function, we
 1. Make decision w.r.t. an imprecisely specified utility function
 2. Perform elicitation until we are satisfied with the decision


        Prob


                          Make Decision      Satisfied?   YES        Done
           Util


                                                NO




        User              Select Query
Preference Elicitation   21




Robust Decision Criteria	

 Maximax          Given a set of feasible utility functions U
 Maximin
                              arg max max u(x)
 Minimax Regret                 x∈Θ    u∈U




                                             u2    u3         Max
                                u1                           Regret


                         x1     8            2     1

                        x2      7            7     1

                        x3      2            2     2
Preference Elicitation   22




Robust Decision Criteria	

 Maximax          Given a set of feasible utility functions U
 Maximin
                              arg max max u(x)
 Minimax Regret                 x∈Θ    u∈U




                                             u2    u3         Max
                                u1                           Regret


                         x1     8            2     1

                        x2      7            7     1

                        x3      2            2     2
Preference Elicitation   23




Robust Decision Criteria	

 Maximax          Given a set of feasible utility functions U
 Maximin
                              arg max min u(x)
 Minimax Regret                 x∈Θ    u∈U




                                             u2    u3         Max
                                u1                           Regret


                         x1     8            2     1

                        x2      7            7     1

                        x3      2            2     2
Preference Elicitation   24




Robust Decision Criteria	

 Maximax          Given a set of feasible utility functions U
 Maximin
                              arg max min u(x)
 Minimax Regret                 x∈Θ    u∈U




                                             u2    u3         Max
                                u1                           Regret


                         x1     8            2     1

                        x2      7            7     1

                        x3      2            2     2
Preference Elicitation   25




Robust Decision Criteria	

 Maximax          Given a set of feasible utility functions U
 Maximin
                       arg min max max u(x ') − u(x)
 Minimax Regret           x∈Θ    x '∈Θ   u∈U




                                               u2   u3         Max
                                 u1                           Regret


                         x1      8             2    1

                         x2      7             7    1

                         x3      2             2    2
Preference Elicitation   26




Robust Decision Criteria	

 Maximax          Given a set of feasible utility functions U
 Maximin
                       arg min max max u(x ') − u(x)
 Minimax Regret           x∈Θ    x '∈Θ   u∈U




                                               u2   u3         Max
                                 u1                           Regret


                         x1      8             2    1

                         x2      7             7    1

                         x3      2             2    2
Preference Elicitation   27




Robust Decision Criteria	

 Maximax          Given a set of feasible utility functions U
 Maximin
                       arg min max max u(x ') − u(x)
 Minimax Regret           x∈Θ    x '∈Θ   u∈U




                                               u2   u3         Max
                                 u1                           Regret


                         x1      8             2    1            5

                         x2      7             7    1

                         x3      2             2    2
Preference Elicitation   28




Robust Decision Criteria	

 Maximax          Given a set of feasible utility functions U
 Maximin
                       arg min max max u(x ') − u(x)
 Minimax Regret           x∈Θ    x '∈Θ   u∈U




                                               u2   u3         Max
                                 u1                           Regret


                         x1      8             2    1            5

                         x2      7             7    1

                         x3      2             2    2
Preference Elicitation   29




Robust Decision Criteria	

 Maximax          Given a set of feasible utility functions U
 Maximin
                       arg min max max u(x ') − u(x)
 Minimax Regret           x∈Θ    x '∈Θ   u∈U




                                               u2   u3         Max
                                 u1                           Regret


                         x1      8             2    1            5

                         x2      7             7    1            1

                         x3      2             2    2
Preference Elicitation   30




Robust Decision Criteria	

 Maximax          Given a set of feasible utility functions U
 Maximin
                       arg min max max u(x ') − u(x)
 Minimax Regret           x∈Θ    x '∈Θ   u∈U




                                               u2   u3         Max
                                 u1                           Regret


                         x1      8             2    1            5

                         x2      7             7    1            1

                         x3      2             2    2
Preference Elicitation   31




Robust Decision Criteria	

 Maximax          Given a set of feasible utility functions U
 Maximin
                       arg min max max u(x ') − u(x)
 Minimax Regret           x∈Θ    x '∈Θ   u∈U




                                               u2   u3         Max
                                 u1                           Regret


                         x1      8             2    1            5

                         x2      7             7    1            1

                         x3      2             2    2            6
Preference Elicitation   32




Robust Decision Criteria	

 Maximax          Given a set of feasible utility functions U
 Maximin
                       arg min max max u(x ') − u(x)
 Minimax Regret           x∈Θ    x '∈Θ   u∈U




                                               u2   u3         Max
                                 u1                           Regret


                         x1      8             2    1            5

                         x2      7             7    1            1

                         x3      2             2    2            6
Preference Elicitation   33




Bayesian Decision Criteria	

 Expected Utility   Assuming we have a prior φ over
 Value At Risk      potential utility functions
Preference Elicitation   34




Bayesian Decision Criteria	

 Expected Utility   Assuming we have a prior φ over
 Value At Risk      potential utility functions

                                  φ
                         arg max E [u(x)]
                                  u
                           x∈Θ
Preference Elicitation   35




Bayesian Decision Criteria	

 Expected Utility      Assuming we have a prior φ over
 Value At Risk         potential utility functions

                                           φ
                                            (         )
                           arg max max Pr Eu [u(x)] ≥ δ ≥ η
                                  x∈Θ   δ




                                            90%
                    η = 90%


                         10%

                              δ
Outline   1. Decision Theory
          2. Preference Elicitation
          3. MDPs
          4. Current Work
Markov Decision Processes    37




Markov Decision Process (MDP)

S - Set of States                      at              at+1

A - Set of Actions                st           st+1                st+2        …
Pr(s ' | a, s) - Transitions
                                       rt              rt+1                    rt+2

α - Starting State Distribution

γ - Discount Factor
                                                      WORLD


r(s) - Reward [or r(s, a) ]
                                            States                  Actions
                                                      AGENT
Markov Decision Processes    37




Markov Decision Process (MDP)

        S - Set of States                   at              at+1

        A - Set of Actions             st           st+1                st+2        …
Known




        Pr(s ' | a, s) - Transitions
                                            rt              rt+1                    rt+2

        α - Starting State Distribution

        γ - Discount Factor
                                                           WORLD


 ?      r(s) - Reward [or r(s, a) ]
                                                 States                  Actions
                                                           AGENT
Markov Decision Processes   38




MDP - Policies

 Policy   A stationary policy π maps each state to an action
                 For infinite horizon MDPs, every
                 policy is a stationary policy



 Policy   Given a policy π , the value of a state is
 Value
                  π           ∞ t            
                 V (s0 ) = E  ∑ γ r   π , s0 
                              t=0            
Markov Decision Processes   39




MDP - Computing Value Function

The value of a policy can be found by successive approximation


             V0π (s) = r(s, aπ )
             V1π (s) = r(s, aπ ) +γ ∑ Pr( s′ | s, aπ )V0π (s ')
                                      s'

                M         M                            M
             Vkπ (s) = r(s, aπ ) +γ ∑ Pr( s′ | s, aπ )V π (s ')
                                                        k−1
                                      s'


 There will exist a fixed point

               π                                           π
             V (s) = r(s, aπ ) +γ ∑ Pr(s ' | s, aπ )V ( s′ )
                                      s'
Markov Decision Processes   40




MDP - Optimal Value Functions


 Optimal     We wish to find the optimal policy
                     π*
 Policy
                *          π′
              π : V ≥V        ∀π '




             π*                                              π*
 Bellman    V (s) = max r(s, aπ * ) +γ ∑ Pr( s′ | s, aπ * )V (s ')
                         a
                                           s'
 Equation
Markov Decision Processes   41




Value Iteration Algorithm

             Yields an Ú− optimal policy
             1. initialize V0 , set n = 0, choose Ú> 0
             2. For each s :
                        Vn+1 (s) = max r(s, a) +γ ∑ Pr( s′ | s, a)Vn (s ')
                                        a
                                                         s'

                                   (1 − γ )
             3. If    Vn+1 − Vn > Ú         :
                                     2γ
                       increment n and return to step 2


 We can recover the policy by finding the best one step action
    π (s) = arg max r(s, a) +γ ∑ Pr( s′ | s, a)V (s ')
                 a                 s'
Markov Decision Processes   42




Linear Programming Formulation




         minimize
             V
                      ∑ α (s)V (s)
                       s

         subject to V (s) ≥ r(s, a) + γ ∑ Pr(s ' | s, a)V (s ') ∀a, s
                                         s'
Markov Decision Processes   43




MDP - Occupancy Frequencies


  f (s, a)   An occupancy frequency f (s, a) expresses the
             total discounted probability of being in state s
             and taking action a




  Valid      ∑ f (s , a) = ∑ ∑ Pr(s
                   0                  0   | s, a) f (s, a) − α (s0 ) ∀s0
              a           s   a
  f (s, a)
Markov Decision Processes   44




LP - Occupancy Frequency




     min.
      V
             ∑ α (s)V (s)
              s

     subj: V (s) ≥ r(s, a) + γ ∑ Pr(s ' | s, a)V (s ') ∀a, s
                                  s'




     max.
       f
             ∑ ∑ f (s, a)r(s, a)
                  s   a

     subj:   ∑ f (s , a) − γ ∑ ∑ Pr(s
                          0             0   | s, a) f (s, a) = α (s0 ) ∀s0
              a               s    a
Markov Decision Processes   44




LP - Occupancy Frequency

                                  ∑ ∑ f (s, a)r(s, a)             =    ∑ α (s)V (s)
                                       s   a                             s




     min.
      V
             ∑ α (s)V (s)
              s

     subj: V (s) ≥ r(s, a) + γ ∑ Pr(s ' | s, a)V (s ') ∀a, s
                                  s'




     max.
       f
             ∑ ∑ f (s, a)r(s, a)
                  s   a

     subj:   ∑ f (s , a) − γ ∑ ∑ Pr(s
                          0                    0   | s, a) f (s, a) = α (s0 ) ∀s0
              a               s    a
Markov Decision Processes   45




MDP Summary Slide

  Policies   Over the past couple of decades, there has
  Dynamics   been lot of work done on scaling MDPs
  Rewards
                             Factored Models
                             Decomposition
                             Linear Approximation
Markov Decision Processes   46




MDP Summary Slide

  Policies   To use these algorithms we need a model of
  Dynamics   the dynamics (transition function). There are
             techniques for:
  Rewards

                              Deriving models of
                              dynamics from data.


                              Finding policies that are robust
                              to inaccurate transition models
Markov Decision Processes   47




MDP Summary Slide

  Policies   There has been comparatively little work on
  Dynamics   specifying rewards
  Rewards
                             Finding policies that are robust
                             to imprecise reward models


                             Eliciting reward information
                             from users
Outline   1. Decision Theory
          2. Preference Elicitation
          3. MDPs
          4. Current Work
Outline   1. Decision Theory
          2. Preference Elicitation
          3. MDPs
          4. Current Work      A. Imprecise Reward Specification
                               B.   Computing Robust Policies
                               C. Preference Elicitation
                               D. Evaluation
                               E.   Future Work
Text   50




Current Work


 MDP

                 Compute
                               Satisfied?   YES   Done
               Robust Policy
   R


                                  NO




 User          Select Query
Model : MDP   51




MDP - Reward Uncertainty

We quantify the strict uncertainty over reward
with a set of feasible reward functions R


We specify R using a set of linear
inequalities forming a polytope



Where do these inequalities come from?


        Bound queries: Is r(s,a) > b?
        Policy comparisons: Is fπ ·r > fπ′ ·r ?
Outline   1. Decision Theory
          2. Preference Elicitation
          3. MDPs
          4. Current Work      A.   Imprecise Reward Specification

                               B. Computing Robust Policies
                               C. Preference Elicitation
                               D. Evaluation
                               E.   Future Work
Computation   53




Minimax Regret


Original        min max max g ·r − f ·r
Formulation     f∈F    g∈F   r∈R




Benders’        minimize           δ
                  f∈F , δ
Decomposition
                subject to : δ ≥ g ·r − f ·r ∀ g ∈F r ∈R
Computation   54




Minimax Regret


Original         min max max g ·r − f ·r
Formulation       f∈F    g∈F   r∈R




Benders’         minimize            δ
                    f∈F , δ
Decomposition
                 subject to : δ ≥ g ·r − f ·r ∀ g ∈V ( F   ) r ∈V ( R )

 Maximums will exist at the vertices of F and R
 Rather than enumerating an exponential number of vertices we use
 constraint generation
Computation   55




Minimax Regret - Constraint Generation

  1.        We limit adversary
               •   Player minimizes regret w.r.t. a small set of adversary
                   responses

  2.        We untie adversary’s hands
               •   Adversary finds maximum regret w.r.t. player’s policy
               •   Add adversary’s choice of r and g to set of adversary
                   responses

  Done when: untying adversary’s hands yields no improvement
       •   ie. regret of player minimizing = regret of adversary maximizing
Computation   56




Constraint Generation - Player


  1.     Limit adversary

         minimize      δ
            f∈F , δ

         subject to : δ ≥ g ·r − f ·r ∀ 〈 g, r 〉 ∈GEN
Computation   57




Constraint Generation - Adversary


  2.    Untie adversary’s hands: Given player policy f


         max max g ·r − f ·r
         g∈F   r∈R



                            This formulation is a
                            non-convex linear program


                            We reformulate as a mixed
                            integer linear program
(indeed, it is the maximally violated such constraint). So it
                                                         Computation    58

    is added to Gen and the process repeats.
Constraint Generation ,-R) is realized by the following MIP,
    Computation of MR(f Adversary
    using value and Q-functions:1

   2.       maximize α · V − r · f                                     (9)
             Q,V,I,r

            subject to: Qa = ra + γPa V           ∀a∈A
                        V ≥ Qa                    ∀a∈A            (10)
                        V ≤ (1 − Ia )Ma + Qa      ∀a∈A            (11)
                        Cr ≤ d
                        X
                           Ia = 1                                 (12)
                         a

                       Ia (s) ∈ {0, 1}               ∀a, s        (13)
                                      ⊥
                       Ma = M −      Ma

  Only tractablerepresents the adversary’s policy, with Ia (s) de-
       Here I for small Markov Decision Problems
        noting the probability of action a being taken at state s
)
                              !         "       #       $       %       &       '       (       )*        )) Computation   59
                                                             +,-./012314565/7

Figure 2: Scaling of constraint generation with number of states.
   Approximating Minimax Regret
                                      9.:54;<.0=//1/0>7/7470?5@09.A/.40<670*+,-./01203454.6
                      )78)

    We )7(# the Max Regret MIP formulation
        relax
     9.:54;<.0=//1/




                      )7()

                         The
                      )7)#   value of the resulting policy is no longer exact, however,
                         resulting reward still feasible. We find optimal policy w.r.t. to
                      )7))
                         resulting reward #
                                !     "              $     %      &     '     ()
                                                            *+,-./01203454.6


                                  9.:54;<.0=//1/0>7/7470?;B;,5@09.A/.40<670*+,-./01203454.6
                      )78)
     9.:54;<.0=//1/




                      )7(#

                      )7()

                      )7)#

                      )7))
                                  !         "       #          $        %           &       '        ()
                                                            *+,-./01203454.6

  Figure 3: Relative approximation error of linear relaxation
Computation   60




Scaling (Log Scale)
                                        89-/1<7=1+,-./012314565/7
                  )*****
                                >?6@51A9B9-6?1C/D0/5
                                EFF02?9-65/1A9B9-6?1C/D0/5

                   )****




                    )***
      89-/1:-7;




                     )**




                      )*




                       )
                        !   "      #      $      %       &       '   (   )*     ))
                                              +,-./012314565/7

Figure 2: Scaling of constraint generation with number of states.
Outline   1. Decision Theory
          2. Preference Elicitation
          3. MDPs
          4. Current Work      A.   Imprecise Reward Specification
                               B.   Computing Robust Policies

                               C. Preference Elicitation
                               D. Evaluation
                               E.   Future Work
Reward Elicitation   62




Reward Elicitation


 MDP

                  Compute
                                Satisfied?   YES              Done
                Robust Policy
    R


                                   NO




  User          Select Query
Reward Elicitation   63




Bound Queries

 Query    Is r(s,a) > b?

                where b is a point between the
                upper and lower bounds of r(s,a)



 Gap     Δ(s, a) = max r '(s, a) − min r(s, a)
                     r'             r


                At each step of elicitation we need
                to select the s, a parameters
                and b using the gap:
Reward Elicitation   64




Selecting Bound Queries

Halve the Largest Gap (HLG)     Current Solution (CS)


 Select the s,a with the         Use the current solution g(s,a)
 largest gap Δ(s,a)              [or f(s,a)] of the minimax
                                 regret calculation to weight
 Set b to the midpoint of the    each gap Δ(s,a)
 interval for r(s,a)
                                 Select the s,a with the largest
                                 weighted gap g(s,a)Δ(s, a)

                                 Set b to the midpoint of the
                                 interval for r(s,a)
Outline   1. Decision Theory
          2. Preference Elicitation
          3. MDPs
          4. Current Work      A.   Imprecise Reward Specification
                               B.   Computing Robust Policies
                               C. Preference Elicitation

                               D. Evaluation
                               E.   Future Work
Evaluation   66




Experimental Setup

Randomly generated MDPs


             Semi-sparse random transition function,
             discount factor of 0.95

             Random true reward drawn from fixed interval,
             upper and lower bounds on reward drawn
             randomly

             All results are averaged over 20 runs

             10 states 5 actions
Evaluation   67




Elicitation Effectiveness

We examine the combination of each criteria for robust policies with
each of the elicitation strategies



        Minimax Regret                 Halve the Largest Gap

                              ƒ
            (MMR)                              (HLG)

        Maximin Regret                    Current Solution
            (MR)                               (CS)
Evaluation         68




     Max Regret - Random MDP
                                         /01+2(3)(4+567+,'-.()+89+&'():(6
                          %"
                        0.35                                                                           #$
                                                                                                     0.12
                                                                        /01:-:;+<+=>?
                                                                        /:;:-01+<+=>?
                        0.30
                          %!                                            /01:-:;+<+@A                   #!
                                                                                                     0.10
                                                                        /:;:-01+<+@A
                          $"
                        0.25
                                                                                                        1
                                                                                                     0.08
                 /01+2(3)(4




                                                                                             2)'(+3(4)(5
                 Max Regret




                                                                                             True Regret
                          $!
                        0.20
                                                                                                        0
                                                                                                     0.06
                          #"
                        0.15

                                                                                                     0.04
                                                                                                        /
                          #!
                        0.10

56+>+?@A
34+>+?@A                                                                                                $
                                                                                                     0.02
                           "
                        0.05
56+>+BC
34+>+BC
                              !                                                                            !
"!         %!!                 !1   "!       #!!        #"!       $!!       $"!         %!!                 !
                                                   &'()*+,'-.()
Evaluation   69




      True Regret (Loss) - Random MDP
                                      2)'(+3(4)(5+678+,'-.()+9:+&'();(7
                        #$
                      0.12
-:;+<+=>?                                                            <=>;-;?+@+ABC
 01+<+=>?                                                            <;?;-=>+@+ABC
-:;+<+@A                #!
                      0.10                                           <=>;-;?+@+DE
 01+<+@A                                                             <;?;-=>+@+DE


                         1
                      0.08
              2)'(+3(4)(5
              True Regret




                         0
                      0.06



                      0.04
                         /



                         $
                      0.02



                            !
"!          %!!              1
                             !   "!       #!!        #"!       $!!        $"!        %!!
                                                &'()*+,'-.()
Evaluation   70




Queries per Reward Point - Random MDP
                                           <45;1=/7,0>0?+./4.509./0/.67/80914:;
                             $&!
                             700


                             $!!
                             600
                                               Most of reward
                             500
                             #&!               space unexplored
   *+,-./0120/.67/80914:;5




                             #!!
                             400


                             "&!
                             300

                                                      We repeatedly query a small
                             "!!
                             200                   set of “high impact” reward points

                              &!
                             100


                               !
                                   !   "       #      $     %       &     '       (     )
                                                     *+,-./01203+./4.5
Evaluation   71




Autonomic Computing

                                       Setup
  Host 1
    Demand       Resource
                                               2 Hosts
                              Total
                                               3 Demand levels
                            Resource
                                               3 Units of Resource

             M                         Model
  Host k
    Demand       Resource                      90 States
                                               10 Actions
Evaluation            72




        Max Regret - Autonomic Computing
                                         Queries vs. Max Regret
                            0.7                                                                           0.12
                                                                  Maximin
                                                                  Minimax Regret
                            0.6
                                                                                                          0.10


                            0.5
                                                                                                          0.08




                                                                                            True Regret
               Max Regret




                            0.4
                                                                                                          0.06
                            0.3

                                                                                                          0.04
                            0.2


                                                                                                          0.02
                            0.1

egret
                            0.0                                                                           0.00
        1000                   1
                               0   200      400             600      800           1000                          0
                                                  Queries
Evaluation   73




    True Regret (Loss) - Autonomic Computing
                                              Queries vs. True Regret
                             0.12
                                                                        Maximin
egret                                                                   Minimax Regret
                             0.10



                             0.08
               True Regret




                             0.06



                             0.04



                             0.02



                             0.00
        1000                        0
                                    1   200      400             600       800           1000
                                                       Queries
Outline   1. Decision Theory
          2. Preference Elicitation
          3. MDPs
          4. Current Work      A.   Imprecise Reward Specification
                               B.   Computing Robust Policies
                               C. Preference Elicitation
                               D. Evaluation

                               E.   Future Work
Introduction   75




Overview


 MDP

             Compute
                           Satisfied?   YES        Done
           Robust Policy
    R


                              NO




  User     Select Query
Introduction   75




Contributions



   Compute       1. A technique for finding robust policies using
                       Satisfied?    YES     Done
 Robust Policy      minimax regret


                          NO



                 2. A simple elicitation procedure that quickly leads to
 Select Query
                    near-optimal/optimal policies
Conclusion   76




Future Work

            Bottleneck: Adversary’s max regret computation
 Scaling    Idea: The set Γ of adversary policies g that will ever
                  be a regret maximizing response can be small
 Factored
 MDPs
                                Approaches that uses Γ to
 Richer
                                efficiently compute max regret
 Queries
               We have          An algorithm to find Γ

                                A theorem that shows the
                                algorithm runs in time polynomial
                                in the number of policies found
Conclusion   77




Future Work


 Scaling    Working with Factored MDPs will

 Factored          Model problems in a more natural way
 MDPs

 Richer            Allow us to use lower the dimensionality of
 Queries           the reward functions


                   Leverage existing techniques for scaling
                   MDPs that take advantage of factored
Conclusion   78




Future Work


              In state s which action would you like to take?
 Scaling

 Factored
 MDPs
              In state s do you prefer action a1 to a2 ?
 Richer
 Queries

              Do you prefer sequence
                s1 , a1 , s2 , a2 ,…sk to
                 ′   ′   ′   ′   ′
                s , a , s , a ,…s ?
                 1   1   2   2   k
Conclusion   79




Future Work

              Do you prefer tradeoff
 Scaling        f (s2 , a3 ) = f1 amount of time doing (s2 , a3 ) and
                f (s1 , a4 ) = f2 amount of time doing (s1 , a4 )
 Factored                     or
 MDPs
                    f ′ (s2 , a3 ) = f 1 amount of time doing (s2 , a3 ) and
                                       ′


 Richer             f ′ (s1 , a4 ) = f ′2 amount of time doing (s1 , a4 ) ?
 Queries
                         f1                                          s Cab Available
                                   f1   s No Street Car
                    f2                                     f2        a Take Cab
                                        a Waiting



                                        s No Street Car              s Cab Available
               f2        f1        f1                           f2
                                        a Waiting                    a Take Cab
Thank you.
Regret-Based Reward Elicitation for
Markov Decision Processes
Kevin M Regan                         University of Toronto
Craig Boutilier
f      g    r
ax min   r·f              (7)                 subject to: γE f + α = 0                    Appendix   82
F r∈R

                                                             γE g + α = 0
      Full Formulation
                                    on the adversary. If MR(f , R) = MMR (R) then the con-            to com
uncertainty in any MDP pa-                                   Cr ≤ at
                                    straint for g, r is satisfied d the current solution, and in-      mine
k has focused on uncertainty        deed all unexpressed constraints must be satisfied as well.        have t
                                  This is equivalent to a minimization:
   the of eliciting information     The process then terminates with minimax optimal solu-            freque
 rewards is left unaddressed.       tion minimize δ MR(f , R) > MMR (R), implying that
                                          f . Otherwise,                                      (8)     exact
                                              f ,δ
                                    the constraint for g, r is violated in the current relaxation
              Master
uted for uncertain transition
                                    (indeed, it is the r · g − r · f violated suchF, r ∈ R So it
                                         subject to: maximally ≤ δ ∀ g ∈ constraint).
                                                                                                      We ha
                                                                                                      an alt
 riterion by decomposing the
                                    is added to Gen and the+ α = 0 repeats.
                                                       γE f process                                   sarial
nd using dynamic program-
  ization to find the worst case     Computation of MR(f , R) is realized by the following MIP,        (for a
 ]. McMahan, Gordon, and          This corresponds Q-functions:1 dual LP formulation of
                                    using value and to the standard                                   for m
rogramming approach to ef-        an MDP with the addition of adversarial policy constraints.         imatio
                                          maximize α · V − r · f                            (9)
n value of an MDP (we em-         The infinite number of constraints can be reduced: first we
                                            Q,V,I,r                                                   tice):
                                  need only retain as potentially active those ∀ a ∈ A
                                          subject to: Qa = ra + γPa V           constraints for       the in
 ch to ours below). Delage
                                  vertices of polytope R; Qa for any r ∈ R, weaonly require
                                                      V ≥ and                   ∀ ∈A       (10)       tors.
oblem of uncertainty over re-
                                                      V ≤ (1 − a )M + Qa        ∀a∈A  ∗
                                  the constraint correspondingIto itsaoptimal policy gr . How-
                                                                                           (11)       does n
 functions) in the presence of                                                                        policy
rcentile criterion, which can     ever, vertex enumeration is not feasible; so we apply Ben-
                                                      Cr ≤ d
              Subproblem
    than maximin. They also       ders’ decomposition [2]a to iteratively generate constraints.
                                                      X
                                                         I =1                              (12)
                                                                                                      constr
                                                                                                      remai
ng rewards using sampling to      At each iteration, two optimizations are solved. The master
                                                         a
                                                                                                      value
 e of information of noisy in-                         Ia (s) ∈ {0, of              ∀a, s
                                  problem solves a relaxation 1} program (8) using only a    (13)
                                                                                                      that is
ard space. The percentile ap-     small subset of the constraints,M⊥
                                                       Ma = M − corresponding to a subset
                                                                       a                              this s
 n nor does it offer a bound on   Gen of all g, r pairs; we call these generated constraints.         lution
es ([20]) also adopt maximin      Initially, this set is arbitrary (e.g., empty). with Ia (s) de-
                                       Here I represents the adversary’s policy, Intuitively, in      lem s
                                     noting the probability of action a being taken at state s
Evaluation         83




Maximin Value - Random MDP
                             2345-56+738'(+9:;+,'-.()+<=+&'()5(:
               #!!
               1.00                                                                         %"
                                                                                          0.35


                                                                                          0.30
                                                                                            %!
               0.95
                 1"


                                                                                            $"
                                                                                          0.25
               0.90
                 1!
   2345-56+738'(
   Maximin Value




                                                                                   /01+2(3)(4
                                                                                   Max Regret
                                                                                            $!
                                                                                          0.20
               0.85
                 0"
                                                                                            #"
                                                                                          0.15

                 0!
               0.80
                                                                                            #!
                                                                                          0.10

                                                             2345-56+>+?@A
               0.75
                 /"                                          2565-34+>+?@A                   "
                                                                                          0.05
                                                             2345-56+>+BC
                                                             2565-34+>+BC
                 /!
               0.70 1                                                                           !
                   !    "!        #!!        #"!       $!!      $"!          %!!                 !
                                        &'()*+,'-.()
Computation   84




Regret Gap vs Time
                                       3.4567/859/:1;/+,-./!/3.<=
                   "&#


                   "$#


                   "##


                    *#
     3.4567/859
      Regret Gap




                    (#


                    &#


                    $#


                     #


                   !$#
                    !"###   #   "###   $###   %###   &###   '###    (###   )###   *###
                                               +,-./0-12

Weitere ähnliche Inhalte

Was ist angesagt?

Ssp notes
Ssp notesSsp notes
Ssp notesbalu902
 
Lesson 21: Curve Sketching (slides)
Lesson 21: Curve Sketching (slides)Lesson 21: Curve Sketching (slides)
Lesson 21: Curve Sketching (slides)Matthew Leingang
 
Chapter 5 heterogeneous
Chapter 5 heterogeneousChapter 5 heterogeneous
Chapter 5 heterogeneousNBER
 
Introducing Copula to Risk Management Presentation
Introducing Copula to Risk Management PresentationIntroducing Copula to Risk Management Presentation
Introducing Copula to Risk Management PresentationEva Li
 
2D1431 Machine Learning
2D1431 Machine Learning2D1431 Machine Learning
2D1431 Machine Learningbutest
 
Initial-Population Bias in the Univariate Estimation of Distribution Algorithm
Initial-Population Bias in the Univariate Estimation of Distribution AlgorithmInitial-Population Bias in the Univariate Estimation of Distribution Algorithm
Initial-Population Bias in the Univariate Estimation of Distribution AlgorithmMartin Pelikan
 
Speeding up the Gillespie algorithm
Speeding up the Gillespie algorithmSpeeding up the Gillespie algorithm
Speeding up the Gillespie algorithmColin Gillespie
 
Feature Extraction for Universal Hypothesis Testing via Rank-Constrained Opti...
Feature Extraction for Universal Hypothesis Testing via Rank-Constrained Opti...Feature Extraction for Universal Hypothesis Testing via Rank-Constrained Opti...
Feature Extraction for Universal Hypothesis Testing via Rank-Constrained Opti...dayuhuang
 
The tau-leap method for simulating stochastic kinetic models
The tau-leap method for simulating stochastic kinetic modelsThe tau-leap method for simulating stochastic kinetic models
The tau-leap method for simulating stochastic kinetic modelsColin Gillespie
 
Pr 2-bayesian decision
Pr 2-bayesian decisionPr 2-bayesian decision
Pr 2-bayesian decisionshivamsoni123
 
Lesson 27: Integration by Substitution, part II (Section 10 version)
Lesson 27: Integration by Substitution, part II (Section 10 version)Lesson 27: Integration by Substitution, part II (Section 10 version)
Lesson 27: Integration by Substitution, part II (Section 10 version)Matthew Leingang
 
Shortfall Aversion
Shortfall AversionShortfall Aversion
Shortfall Aversionguasoni
 

Was ist angesagt? (20)

1 - Linear Regression
1 - Linear Regression1 - Linear Regression
1 - Linear Regression
 
Slides euria-1
Slides euria-1Slides euria-1
Slides euria-1
 
Ssp notes
Ssp notesSsp notes
Ssp notes
 
Slides irisa
Slides irisaSlides irisa
Slides irisa
 
Berlin
BerlinBerlin
Berlin
 
Lesson 21: Curve Sketching (slides)
Lesson 21: Curve Sketching (slides)Lesson 21: Curve Sketching (slides)
Lesson 21: Curve Sketching (slides)
 
Chapter 5 heterogeneous
Chapter 5 heterogeneousChapter 5 heterogeneous
Chapter 5 heterogeneous
 
Introducing Copula to Risk Management Presentation
Introducing Copula to Risk Management PresentationIntroducing Copula to Risk Management Presentation
Introducing Copula to Risk Management Presentation
 
Cross-Validation
Cross-ValidationCross-Validation
Cross-Validation
 
2D1431 Machine Learning
2D1431 Machine Learning2D1431 Machine Learning
2D1431 Machine Learning
 
Initial-Population Bias in the Univariate Estimation of Distribution Algorithm
Initial-Population Bias in the Univariate Estimation of Distribution AlgorithmInitial-Population Bias in the Univariate Estimation of Distribution Algorithm
Initial-Population Bias in the Univariate Estimation of Distribution Algorithm
 
Slides compiegne
Slides compiegneSlides compiegne
Slides compiegne
 
Speeding up the Gillespie algorithm
Speeding up the Gillespie algorithmSpeeding up the Gillespie algorithm
Speeding up the Gillespie algorithm
 
Slides lyon-anr
Slides lyon-anrSlides lyon-anr
Slides lyon-anr
 
Slides edf-1
Slides edf-1Slides edf-1
Slides edf-1
 
Feature Extraction for Universal Hypothesis Testing via Rank-Constrained Opti...
Feature Extraction for Universal Hypothesis Testing via Rank-Constrained Opti...Feature Extraction for Universal Hypothesis Testing via Rank-Constrained Opti...
Feature Extraction for Universal Hypothesis Testing via Rank-Constrained Opti...
 
The tau-leap method for simulating stochastic kinetic models
The tau-leap method for simulating stochastic kinetic modelsThe tau-leap method for simulating stochastic kinetic models
The tau-leap method for simulating stochastic kinetic models
 
Pr 2-bayesian decision
Pr 2-bayesian decisionPr 2-bayesian decision
Pr 2-bayesian decision
 
Lesson 27: Integration by Substitution, part II (Section 10 version)
Lesson 27: Integration by Substitution, part II (Section 10 version)Lesson 27: Integration by Substitution, part II (Section 10 version)
Lesson 27: Integration by Substitution, part II (Section 10 version)
 
Shortfall Aversion
Shortfall AversionShortfall Aversion
Shortfall Aversion
 

Andere mochten auch

Hierarchical Pomdp Planning And Execution
Hierarchical Pomdp Planning And ExecutionHierarchical Pomdp Planning And Execution
Hierarchical Pomdp Planning And Executionahmad bassiouny
 
Applications of Markov Decision Processes (MDPs) in the Internet of Things (I...
Applications of Markov Decision Processes (MDPs) in the Internet of Things (I...Applications of Markov Decision Processes (MDPs) in the Internet of Things (I...
Applications of Markov Decision Processes (MDPs) in the Internet of Things (I...mabualsh
 
7. Linear Regression
7. Linear Regression7. Linear Regression
7. Linear RegressionJungkyu Lee
 
4. Gaussian Model
4. Gaussian Model4. Gaussian Model
4. Gaussian ModelJungkyu Lee
 
Breadth-First Search, Depth-First Search and Backtracking Depth-First Search ...
Breadth-First Search, Depth-First Search and Backtracking Depth-First Search ...Breadth-First Search, Depth-First Search and Backtracking Depth-First Search ...
Breadth-First Search, Depth-First Search and Backtracking Depth-First Search ...Fernando Rodrigues Junior
 
Reinforcement Learning : A Beginners Tutorial
Reinforcement Learning : A Beginners TutorialReinforcement Learning : A Beginners Tutorial
Reinforcement Learning : A Beginners TutorialOmar Enayet
 
Reinforcement learning
Reinforcement learning Reinforcement learning
Reinforcement learning Chandra Meena
 
Depth first search and breadth first searching
Depth first search and breadth first searchingDepth first search and breadth first searching
Depth first search and breadth first searchingKawsar Hamid Sumon
 

Andere mochten auch (10)

Hierarchical Pomdp Planning And Execution
Hierarchical Pomdp Planning And ExecutionHierarchical Pomdp Planning And Execution
Hierarchical Pomdp Planning And Execution
 
Laser 2-change
Laser 2-changeLaser 2-change
Laser 2-change
 
Applications of Markov Decision Processes (MDPs) in the Internet of Things (I...
Applications of Markov Decision Processes (MDPs) in the Internet of Things (I...Applications of Markov Decision Processes (MDPs) in the Internet of Things (I...
Applications of Markov Decision Processes (MDPs) in the Internet of Things (I...
 
7. Linear Regression
7. Linear Regression7. Linear Regression
7. Linear Regression
 
4. Gaussian Model
4. Gaussian Model4. Gaussian Model
4. Gaussian Model
 
Breadth first search
Breadth first searchBreadth first search
Breadth first search
 
Breadth-First Search, Depth-First Search and Backtracking Depth-First Search ...
Breadth-First Search, Depth-First Search and Backtracking Depth-First Search ...Breadth-First Search, Depth-First Search and Backtracking Depth-First Search ...
Breadth-First Search, Depth-First Search and Backtracking Depth-First Search ...
 
Reinforcement Learning : A Beginners Tutorial
Reinforcement Learning : A Beginners TutorialReinforcement Learning : A Beginners Tutorial
Reinforcement Learning : A Beginners Tutorial
 
Reinforcement learning
Reinforcement learning Reinforcement learning
Reinforcement learning
 
Depth first search and breadth first searching
Depth first search and breadth first searchingDepth first search and breadth first searching
Depth first search and breadth first searching
 

Ähnlich wie Regret-Based Reward Elicitation for MDPs

CMA-ES with local meta-models
CMA-ES with local meta-modelsCMA-ES with local meta-models
CMA-ES with local meta-modelszyedb
 
Bachelor_Defense
Bachelor_DefenseBachelor_Defense
Bachelor_DefenseTeja Turk
 
Advanced Microeconomics - Lecture Slides
Advanced Microeconomics - Lecture SlidesAdvanced Microeconomics - Lecture Slides
Advanced Microeconomics - Lecture SlidesYosuke YASUDA
 
Maximum likelihood estimation of regularisation parameters in inverse problem...
Maximum likelihood estimation of regularisation parameters in inverse problem...Maximum likelihood estimation of regularisation parameters in inverse problem...
Maximum likelihood estimation of regularisation parameters in inverse problem...Valentin De Bortoli
 
Kernels and Support Vector Machines
Kernels and Support Vector  MachinesKernels and Support Vector  Machines
Kernels and Support Vector MachinesEdgar Marca
 
Optimization Techniques.pdf
Optimization Techniques.pdfOptimization Techniques.pdf
Optimization Techniques.pdfanandsimple
 
Deep Learning for Cyber Security
Deep Learning for Cyber SecurityDeep Learning for Cyber Security
Deep Learning for Cyber SecurityAltoros
 
4 greedy methodnew
4 greedy methodnew4 greedy methodnew
4 greedy methodnewabhinav108
 
9 Multi criteria Operation Decision Making - Nov 16 2020. pptx (ver2).pptx
9 Multi criteria Operation Decision Making - Nov 16 2020. pptx (ver2).pptx9 Multi criteria Operation Decision Making - Nov 16 2020. pptx (ver2).pptx
9 Multi criteria Operation Decision Making - Nov 16 2020. pptx (ver2).pptxdnbtraniemyu
 
Introduction to Big Data Science
Introduction to Big Data ScienceIntroduction to Big Data Science
Introduction to Big Data ScienceAlbert Bifet
 
Statement of stochastic programming problems
Statement of stochastic programming problemsStatement of stochastic programming problems
Statement of stochastic programming problemsSSA KPI
 
Understanding variable importances in forests of randomized trees
Understanding variable importances in forests of randomized treesUnderstanding variable importances in forests of randomized trees
Understanding variable importances in forests of randomized treesGilles Louppe
 
Regression_1.pdf
Regression_1.pdfRegression_1.pdf
Regression_1.pdfAmir Saleh
 
lecture on support vector machine in the field of ML
lecture on support vector machine in the field of MLlecture on support vector machine in the field of ML
lecture on support vector machine in the field of MLAkashVerma639361
 
Cuckoo Search Algorithm: An Introduction
Cuckoo Search Algorithm: An IntroductionCuckoo Search Algorithm: An Introduction
Cuckoo Search Algorithm: An IntroductionXin-She Yang
 
Interval Type-2 fuzzy decision making
Interval Type-2 fuzzy decision makingInterval Type-2 fuzzy decision making
Interval Type-2 fuzzy decision makingBob John
 
02.03 Artificial Intelligence: Search by Optimization
02.03 Artificial Intelligence: Search by Optimization02.03 Artificial Intelligence: Search by Optimization
02.03 Artificial Intelligence: Search by OptimizationAndres Mendez-Vazquez
 
Artificial bee colony (abc)
Artificial bee colony (abc)Artificial bee colony (abc)
Artificial bee colony (abc)quadmemo
 
1004_theorem_proving_2018.pptx on the to
1004_theorem_proving_2018.pptx on the to1004_theorem_proving_2018.pptx on the to
1004_theorem_proving_2018.pptx on the tofariyaPatel
 

Ähnlich wie Regret-Based Reward Elicitation for MDPs (20)

CMA-ES with local meta-models
CMA-ES with local meta-modelsCMA-ES with local meta-models
CMA-ES with local meta-models
 
Bachelor_Defense
Bachelor_DefenseBachelor_Defense
Bachelor_Defense
 
Advanced Microeconomics - Lecture Slides
Advanced Microeconomics - Lecture SlidesAdvanced Microeconomics - Lecture Slides
Advanced Microeconomics - Lecture Slides
 
Maximum likelihood estimation of regularisation parameters in inverse problem...
Maximum likelihood estimation of regularisation parameters in inverse problem...Maximum likelihood estimation of regularisation parameters in inverse problem...
Maximum likelihood estimation of regularisation parameters in inverse problem...
 
ML unit-1.pptx
ML unit-1.pptxML unit-1.pptx
ML unit-1.pptx
 
Kernels and Support Vector Machines
Kernels and Support Vector  MachinesKernels and Support Vector  Machines
Kernels and Support Vector Machines
 
Optimization Techniques.pdf
Optimization Techniques.pdfOptimization Techniques.pdf
Optimization Techniques.pdf
 
Deep Learning for Cyber Security
Deep Learning for Cyber SecurityDeep Learning for Cyber Security
Deep Learning for Cyber Security
 
4 greedy methodnew
4 greedy methodnew4 greedy methodnew
4 greedy methodnew
 
9 Multi criteria Operation Decision Making - Nov 16 2020. pptx (ver2).pptx
9 Multi criteria Operation Decision Making - Nov 16 2020. pptx (ver2).pptx9 Multi criteria Operation Decision Making - Nov 16 2020. pptx (ver2).pptx
9 Multi criteria Operation Decision Making - Nov 16 2020. pptx (ver2).pptx
 
Introduction to Big Data Science
Introduction to Big Data ScienceIntroduction to Big Data Science
Introduction to Big Data Science
 
Statement of stochastic programming problems
Statement of stochastic programming problemsStatement of stochastic programming problems
Statement of stochastic programming problems
 
Understanding variable importances in forests of randomized trees
Understanding variable importances in forests of randomized treesUnderstanding variable importances in forests of randomized trees
Understanding variable importances in forests of randomized trees
 
Regression_1.pdf
Regression_1.pdfRegression_1.pdf
Regression_1.pdf
 
lecture on support vector machine in the field of ML
lecture on support vector machine in the field of MLlecture on support vector machine in the field of ML
lecture on support vector machine in the field of ML
 
Cuckoo Search Algorithm: An Introduction
Cuckoo Search Algorithm: An IntroductionCuckoo Search Algorithm: An Introduction
Cuckoo Search Algorithm: An Introduction
 
Interval Type-2 fuzzy decision making
Interval Type-2 fuzzy decision makingInterval Type-2 fuzzy decision making
Interval Type-2 fuzzy decision making
 
02.03 Artificial Intelligence: Search by Optimization
02.03 Artificial Intelligence: Search by Optimization02.03 Artificial Intelligence: Search by Optimization
02.03 Artificial Intelligence: Search by Optimization
 
Artificial bee colony (abc)
Artificial bee colony (abc)Artificial bee colony (abc)
Artificial bee colony (abc)
 
1004_theorem_proving_2018.pptx on the to
1004_theorem_proving_2018.pptx on the to1004_theorem_proving_2018.pptx on the to
1004_theorem_proving_2018.pptx on the to
 

Kürzlich hochgeladen

WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...gurkirankumar98700
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 

Kürzlich hochgeladen (20)

WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 

Regret-Based Reward Elicitation for MDPs

  • 1. Regret-Based Reward Elicitation for Markov Decision Processes Kevin Regan University of Toronto Craig Boutilier
  • 2. Introduction 2 Motivation
  • 3. Introduction 3 Motivation Markov Decision Processes have proven to be an extremely useful model for decision making in stochastic environments • Model requires dynamics and rewards
  • 4. Introduction 4 Motivation Markov Decision Processes have proven to be an extremely useful model for decision making in stochastic environments • Model requires dynamics and rewards Specifying dynamics a priori can be difficult • We can learn a model of the world in either an offline or online (reinforcement learning) setting
  • 5. Introduction 5 Motivation Markov Decision Processes have proven to be an extremely useful model for decision making in stochastic environments • Model requires dynamics and rewards In some simple cases reward can be thought of as being directly “observed” • For instance: the reward in a robot navigation problem corresponding to the distance travelled
  • 6. Introduction 6 Motivation Except in some simple cases, the specification of reward functions for MDPs is problematic • Rewards can vary user-to-user • Preferences about which states/actions are “good” and “bad” need to be translated into precise numerical reward • Time consuming to specify reward for all states/actions Example domain: assistive technology
  • 7. Introduction 7 Motivation However, • Near-optimal policies can be found without a fully specified reward function • We can bound the performance of a policy using regret
  • 8. Outline 1. Decision Theory 2. Preference Elicitation 3. MDPs 4. Current Work
  • 9. Decision Theory 9 Utility Given A decision maker (DM) A set of possible outcomes Θ A set of lotteries L of the form: l ≡ 〈 p1 , x1 , p2 , x2 ,…, pn , xn 〉 where xi ∈Θ, ∑p i =1 i l ≡ 〈x1 , p, x2 〉 = 〈 p, x1 ,(1 − p), x2 〉 Compound lotteries l 1 = 〈0.75, x, 0.25, 〈0.6, y, 0.4, z〉〉 l2 y l1= x l2 = z
  • 10. Decision Theory 9 Utility Given A decision maker (DM) A set of possible outcomes Θ A set of lotteries L of the form: l ≡ 〈 p1 , x1 , p2 , x2 ,…, pn , xn 〉 where xi ∈Θ, ∑p i =1 i l ≡ 〈x1 , p, x2 〉 = 〈 p, x1 ,(1 − p), x2 〉 Compound lotteries l 1 = 〈0.75, x, 0.25, 〈0.6, y, 0.4, z〉〉 = 〈0.75, x, 0.15, y, 0.1, z〉 y l2 y z l1= x l2 = z l1= x
  • 11. Decision Theory 10 Utility Axioms Completeness Transitivity Independence Continuity
  • 12. Decision Theory 11 Utility Axioms Completeness For x, y ∈Θ Transitivity It is the case that either: Independence x is weakly preferred to y : x ± y Continuity y is weakly preferred to x : x y One is indifferent : x ~ y
  • 13. Decision Theory 12 Utility Axioms Completeness For any x, y, z ∈Θ Transitivity If x ± y and y ± z Independence Then x ± z Continuity
  • 14. Decision Theory 13 Utility Axioms Completeness For every l 1 , l 2 , l 3 ∈L and p ∈(0,1) Transitivity If l 1 f l 2 Independence Then 〈l 1 , p, l 3 〉 f 〈l 2 , p, l 3 〉 Continuity
  • 15. Decision Theory 14 Utility Axioms Completeness For every l 1 , l 2 , l 3 ∈L Transitivity If l 1 f l 2 f l 3 Independence Then for some p ∈(0,1) : Continuity l 2 ~ 〈l 1 , p, l 3 〉
  • 16. Decision Theory 15 Utility Axioms Completeness There exists a utility function u : Θ → ° Transitivity Such that: Independence u(x) ≥ u(y) ⇔ x ± y Continuity n u(l ) = 〈 p1 , x1 ,…, pn , xn 〉 = ∑ pi u(xi ) i The utility of a lottery is the expected utility of its outcomes
  • 17. Outline 1. Decision Theory 2. Preference Elicitation 3. MDPs 4. Current Work
  • 18. Preference Elicitation 17 Queries Ranking Please order this subset of outcomes Standard Gamble 〈x1 , x2 ,…, xm 〉 Bound u(x1 ) ≥ u(x2 ) ≥ u(x3 ) ≥ L ≥ u(xm )
  • 19. Preference Elicitation 18 Queries Ranking Please choose a p for which you Standard Gamble are indifferent between y and the Bound lottery 〈x ï , p, x ⊥ 〉 ï ⊥ y ~ 〈x , p, x 〉 u(y) = p
  • 20. Preference Elicitation 19 Queries Ranking Please choose a p for which y is at Standard Gamble least as good as the lottery 〈x ï ,b, x ⊥ 〉 Bound y ± 〈x ï ,b, x ⊥ 〉 u(y) ≥ b
  • 21. Preference Elicitation 20 Preference Elicitation Rather than fully specifying a utility function, we 1. Make decision w.r.t. an imprecisely specified utility function 2. Perform elicitation until we are satisfied with the decision Prob Make Decision Satisfied? YES Done Util NO User Select Query
  • 22. Preference Elicitation 21 Robust Decision Criteria Maximax Given a set of feasible utility functions U Maximin arg max max u(x) Minimax Regret x∈Θ u∈U u2 u3 Max u1 Regret x1 8 2 1 x2 7 7 1 x3 2 2 2
  • 23. Preference Elicitation 22 Robust Decision Criteria Maximax Given a set of feasible utility functions U Maximin arg max max u(x) Minimax Regret x∈Θ u∈U u2 u3 Max u1 Regret x1 8 2 1 x2 7 7 1 x3 2 2 2
  • 24. Preference Elicitation 23 Robust Decision Criteria Maximax Given a set of feasible utility functions U Maximin arg max min u(x) Minimax Regret x∈Θ u∈U u2 u3 Max u1 Regret x1 8 2 1 x2 7 7 1 x3 2 2 2
  • 25. Preference Elicitation 24 Robust Decision Criteria Maximax Given a set of feasible utility functions U Maximin arg max min u(x) Minimax Regret x∈Θ u∈U u2 u3 Max u1 Regret x1 8 2 1 x2 7 7 1 x3 2 2 2
  • 26. Preference Elicitation 25 Robust Decision Criteria Maximax Given a set of feasible utility functions U Maximin arg min max max u(x ') − u(x) Minimax Regret x∈Θ x '∈Θ u∈U u2 u3 Max u1 Regret x1 8 2 1 x2 7 7 1 x3 2 2 2
  • 27. Preference Elicitation 26 Robust Decision Criteria Maximax Given a set of feasible utility functions U Maximin arg min max max u(x ') − u(x) Minimax Regret x∈Θ x '∈Θ u∈U u2 u3 Max u1 Regret x1 8 2 1 x2 7 7 1 x3 2 2 2
  • 28. Preference Elicitation 27 Robust Decision Criteria Maximax Given a set of feasible utility functions U Maximin arg min max max u(x ') − u(x) Minimax Regret x∈Θ x '∈Θ u∈U u2 u3 Max u1 Regret x1 8 2 1 5 x2 7 7 1 x3 2 2 2
  • 29. Preference Elicitation 28 Robust Decision Criteria Maximax Given a set of feasible utility functions U Maximin arg min max max u(x ') − u(x) Minimax Regret x∈Θ x '∈Θ u∈U u2 u3 Max u1 Regret x1 8 2 1 5 x2 7 7 1 x3 2 2 2
  • 30. Preference Elicitation 29 Robust Decision Criteria Maximax Given a set of feasible utility functions U Maximin arg min max max u(x ') − u(x) Minimax Regret x∈Θ x '∈Θ u∈U u2 u3 Max u1 Regret x1 8 2 1 5 x2 7 7 1 1 x3 2 2 2
  • 31. Preference Elicitation 30 Robust Decision Criteria Maximax Given a set of feasible utility functions U Maximin arg min max max u(x ') − u(x) Minimax Regret x∈Θ x '∈Θ u∈U u2 u3 Max u1 Regret x1 8 2 1 5 x2 7 7 1 1 x3 2 2 2
  • 32. Preference Elicitation 31 Robust Decision Criteria Maximax Given a set of feasible utility functions U Maximin arg min max max u(x ') − u(x) Minimax Regret x∈Θ x '∈Θ u∈U u2 u3 Max u1 Regret x1 8 2 1 5 x2 7 7 1 1 x3 2 2 2 6
  • 33. Preference Elicitation 32 Robust Decision Criteria Maximax Given a set of feasible utility functions U Maximin arg min max max u(x ') − u(x) Minimax Regret x∈Θ x '∈Θ u∈U u2 u3 Max u1 Regret x1 8 2 1 5 x2 7 7 1 1 x3 2 2 2 6
  • 34. Preference Elicitation 33 Bayesian Decision Criteria Expected Utility Assuming we have a prior φ over Value At Risk potential utility functions
  • 35. Preference Elicitation 34 Bayesian Decision Criteria Expected Utility Assuming we have a prior φ over Value At Risk potential utility functions φ arg max E [u(x)] u x∈Θ
  • 36. Preference Elicitation 35 Bayesian Decision Criteria Expected Utility Assuming we have a prior φ over Value At Risk potential utility functions φ ( ) arg max max Pr Eu [u(x)] ≥ δ ≥ η x∈Θ δ 90% η = 90% 10% δ
  • 37. Outline 1. Decision Theory 2. Preference Elicitation 3. MDPs 4. Current Work
  • 38. Markov Decision Processes 37 Markov Decision Process (MDP) S - Set of States at at+1 A - Set of Actions st st+1 st+2 … Pr(s ' | a, s) - Transitions rt rt+1 rt+2 α - Starting State Distribution γ - Discount Factor WORLD r(s) - Reward [or r(s, a) ] States Actions AGENT
  • 39. Markov Decision Processes 37 Markov Decision Process (MDP) S - Set of States at at+1 A - Set of Actions st st+1 st+2 … Known Pr(s ' | a, s) - Transitions rt rt+1 rt+2 α - Starting State Distribution γ - Discount Factor WORLD ? r(s) - Reward [or r(s, a) ] States Actions AGENT
  • 40. Markov Decision Processes 38 MDP - Policies Policy A stationary policy π maps each state to an action For infinite horizon MDPs, every policy is a stationary policy Policy Given a policy π , the value of a state is Value π  ∞ t  V (s0 ) = E  ∑ γ r π , s0   t=0 
  • 41. Markov Decision Processes 39 MDP - Computing Value Function The value of a policy can be found by successive approximation V0π (s) = r(s, aπ ) V1π (s) = r(s, aπ ) +γ ∑ Pr( s′ | s, aπ )V0π (s ') s' M M M Vkπ (s) = r(s, aπ ) +γ ∑ Pr( s′ | s, aπ )V π (s ') k−1 s' There will exist a fixed point π π V (s) = r(s, aπ ) +γ ∑ Pr(s ' | s, aπ )V ( s′ ) s'
  • 42. Markov Decision Processes 40 MDP - Optimal Value Functions Optimal We wish to find the optimal policy π* Policy * π′ π : V ≥V ∀π ' π* π* Bellman V (s) = max r(s, aπ * ) +γ ∑ Pr( s′ | s, aπ * )V (s ') a s' Equation
  • 43. Markov Decision Processes 41 Value Iteration Algorithm Yields an Ú− optimal policy 1. initialize V0 , set n = 0, choose Ú> 0 2. For each s : Vn+1 (s) = max r(s, a) +γ ∑ Pr( s′ | s, a)Vn (s ') a s' (1 − γ ) 3. If Vn+1 − Vn > Ú : 2γ increment n and return to step 2 We can recover the policy by finding the best one step action π (s) = arg max r(s, a) +γ ∑ Pr( s′ | s, a)V (s ') a s'
  • 44. Markov Decision Processes 42 Linear Programming Formulation minimize V ∑ α (s)V (s) s subject to V (s) ≥ r(s, a) + γ ∑ Pr(s ' | s, a)V (s ') ∀a, s s'
  • 45. Markov Decision Processes 43 MDP - Occupancy Frequencies f (s, a) An occupancy frequency f (s, a) expresses the total discounted probability of being in state s and taking action a Valid ∑ f (s , a) = ∑ ∑ Pr(s 0 0 | s, a) f (s, a) − α (s0 ) ∀s0 a s a f (s, a)
  • 46. Markov Decision Processes 44 LP - Occupancy Frequency min. V ∑ α (s)V (s) s subj: V (s) ≥ r(s, a) + γ ∑ Pr(s ' | s, a)V (s ') ∀a, s s' max. f ∑ ∑ f (s, a)r(s, a) s a subj: ∑ f (s , a) − γ ∑ ∑ Pr(s 0 0 | s, a) f (s, a) = α (s0 ) ∀s0 a s a
  • 47. Markov Decision Processes 44 LP - Occupancy Frequency ∑ ∑ f (s, a)r(s, a) = ∑ α (s)V (s) s a s min. V ∑ α (s)V (s) s subj: V (s) ≥ r(s, a) + γ ∑ Pr(s ' | s, a)V (s ') ∀a, s s' max. f ∑ ∑ f (s, a)r(s, a) s a subj: ∑ f (s , a) − γ ∑ ∑ Pr(s 0 0 | s, a) f (s, a) = α (s0 ) ∀s0 a s a
  • 48. Markov Decision Processes 45 MDP Summary Slide Policies Over the past couple of decades, there has Dynamics been lot of work done on scaling MDPs Rewards Factored Models Decomposition Linear Approximation
  • 49. Markov Decision Processes 46 MDP Summary Slide Policies To use these algorithms we need a model of Dynamics the dynamics (transition function). There are techniques for: Rewards Deriving models of dynamics from data. Finding policies that are robust to inaccurate transition models
  • 50. Markov Decision Processes 47 MDP Summary Slide Policies There has been comparatively little work on Dynamics specifying rewards Rewards Finding policies that are robust to imprecise reward models Eliciting reward information from users
  • 51. Outline 1. Decision Theory 2. Preference Elicitation 3. MDPs 4. Current Work
  • 52. Outline 1. Decision Theory 2. Preference Elicitation 3. MDPs 4. Current Work A. Imprecise Reward Specification B. Computing Robust Policies C. Preference Elicitation D. Evaluation E. Future Work
  • 53. Text 50 Current Work MDP Compute Satisfied? YES Done Robust Policy R NO User Select Query
  • 54. Model : MDP 51 MDP - Reward Uncertainty We quantify the strict uncertainty over reward with a set of feasible reward functions R We specify R using a set of linear inequalities forming a polytope Where do these inequalities come from? Bound queries: Is r(s,a) > b? Policy comparisons: Is fπ ·r > fπ′ ·r ?
  • 55. Outline 1. Decision Theory 2. Preference Elicitation 3. MDPs 4. Current Work A. Imprecise Reward Specification B. Computing Robust Policies C. Preference Elicitation D. Evaluation E. Future Work
  • 56. Computation 53 Minimax Regret Original min max max g ·r − f ·r Formulation f∈F g∈F r∈R Benders’ minimize δ f∈F , δ Decomposition subject to : δ ≥ g ·r − f ·r ∀ g ∈F r ∈R
  • 57. Computation 54 Minimax Regret Original min max max g ·r − f ·r Formulation f∈F g∈F r∈R Benders’ minimize δ f∈F , δ Decomposition subject to : δ ≥ g ·r − f ·r ∀ g ∈V ( F ) r ∈V ( R ) Maximums will exist at the vertices of F and R Rather than enumerating an exponential number of vertices we use constraint generation
  • 58. Computation 55 Minimax Regret - Constraint Generation 1. We limit adversary • Player minimizes regret w.r.t. a small set of adversary responses 2. We untie adversary’s hands • Adversary finds maximum regret w.r.t. player’s policy • Add adversary’s choice of r and g to set of adversary responses Done when: untying adversary’s hands yields no improvement • ie. regret of player minimizing = regret of adversary maximizing
  • 59. Computation 56 Constraint Generation - Player 1. Limit adversary minimize δ f∈F , δ subject to : δ ≥ g ·r − f ·r ∀ 〈 g, r 〉 ∈GEN
  • 60. Computation 57 Constraint Generation - Adversary 2. Untie adversary’s hands: Given player policy f max max g ·r − f ·r g∈F r∈R This formulation is a non-convex linear program We reformulate as a mixed integer linear program
  • 61. (indeed, it is the maximally violated such constraint). So it Computation 58 is added to Gen and the process repeats. Constraint Generation ,-R) is realized by the following MIP, Computation of MR(f Adversary using value and Q-functions:1 2. maximize α · V − r · f (9) Q,V,I,r subject to: Qa = ra + γPa V ∀a∈A V ≥ Qa ∀a∈A (10) V ≤ (1 − Ia )Ma + Qa ∀a∈A (11) Cr ≤ d X Ia = 1 (12) a Ia (s) ∈ {0, 1} ∀a, s (13) ⊥ Ma = M − Ma Only tractablerepresents the adversary’s policy, with Ia (s) de- Here I for small Markov Decision Problems noting the probability of action a being taken at state s
  • 62. ) ! " # $ % & ' ( )* )) Computation 59 +,-./012314565/7 Figure 2: Scaling of constraint generation with number of states. Approximating Minimax Regret 9.:54;<.0=//1/0>7/7470?5@09.A/.40<670*+,-./01203454.6 )78) We )7(# the Max Regret MIP formulation relax 9.:54;<.0=//1/ )7() The )7)# value of the resulting policy is no longer exact, however, resulting reward still feasible. We find optimal policy w.r.t. to )7)) resulting reward # ! " $ % & ' () *+,-./01203454.6 9.:54;<.0=//1/0>7/7470?;B;,5@09.A/.40<670*+,-./01203454.6 )78) 9.:54;<.0=//1/ )7(# )7() )7)# )7)) ! " # $ % & ' () *+,-./01203454.6 Figure 3: Relative approximation error of linear relaxation
  • 63. Computation 60 Scaling (Log Scale) 89-/1<7=1+,-./012314565/7 )***** >?6@51A9B9-6?1C/D0/5 EFF02?9-65/1A9B9-6?1C/D0/5 )**** )*** 89-/1:-7; )** )* ) ! " # $ % & ' ( )* )) +,-./012314565/7 Figure 2: Scaling of constraint generation with number of states.
  • 64. Outline 1. Decision Theory 2. Preference Elicitation 3. MDPs 4. Current Work A. Imprecise Reward Specification B. Computing Robust Policies C. Preference Elicitation D. Evaluation E. Future Work
  • 65. Reward Elicitation 62 Reward Elicitation MDP Compute Satisfied? YES Done Robust Policy R NO User Select Query
  • 66. Reward Elicitation 63 Bound Queries Query Is r(s,a) > b? where b is a point between the upper and lower bounds of r(s,a) Gap Δ(s, a) = max r '(s, a) − min r(s, a) r' r At each step of elicitation we need to select the s, a parameters and b using the gap:
  • 67. Reward Elicitation 64 Selecting Bound Queries Halve the Largest Gap (HLG) Current Solution (CS) Select the s,a with the Use the current solution g(s,a) largest gap Δ(s,a) [or f(s,a)] of the minimax regret calculation to weight Set b to the midpoint of the each gap Δ(s,a) interval for r(s,a) Select the s,a with the largest weighted gap g(s,a)Δ(s, a) Set b to the midpoint of the interval for r(s,a)
  • 68. Outline 1. Decision Theory 2. Preference Elicitation 3. MDPs 4. Current Work A. Imprecise Reward Specification B. Computing Robust Policies C. Preference Elicitation D. Evaluation E. Future Work
  • 69. Evaluation 66 Experimental Setup Randomly generated MDPs Semi-sparse random transition function, discount factor of 0.95 Random true reward drawn from fixed interval, upper and lower bounds on reward drawn randomly All results are averaged over 20 runs 10 states 5 actions
  • 70. Evaluation 67 Elicitation Effectiveness We examine the combination of each criteria for robust policies with each of the elicitation strategies Minimax Regret Halve the Largest Gap ƒ (MMR) (HLG) Maximin Regret Current Solution (MR) (CS)
  • 71. Evaluation 68 Max Regret - Random MDP /01+2(3)(4+567+,'-.()+89+&'():(6 %" 0.35 #$ 0.12 /01:-:;+<+=>? /:;:-01+<+=>? 0.30 %! /01:-:;+<+@A #! 0.10 /:;:-01+<+@A $" 0.25 1 0.08 /01+2(3)(4 2)'(+3(4)(5 Max Regret True Regret $! 0.20 0 0.06 #" 0.15 0.04 / #! 0.10 56+>+?@A 34+>+?@A $ 0.02 " 0.05 56+>+BC 34+>+BC ! ! "! %!! !1 "! #!! #"! $!! $"! %!! ! &'()*+,'-.()
  • 72. Evaluation 69 True Regret (Loss) - Random MDP 2)'(+3(4)(5+678+,'-.()+9:+&'();(7 #$ 0.12 -:;+<+=>? <=>;-;?+@+ABC 01+<+=>? <;?;-=>+@+ABC -:;+<+@A #! 0.10 <=>;-;?+@+DE 01+<+@A <;?;-=>+@+DE 1 0.08 2)'(+3(4)(5 True Regret 0 0.06 0.04 / $ 0.02 ! "! %!! 1 ! "! #!! #"! $!! $"! %!! &'()*+,'-.()
  • 73. Evaluation 70 Queries per Reward Point - Random MDP <45;1=/7,0>0?+./4.509./0/.67/80914:; $&! 700 $!! 600 Most of reward 500 #&! space unexplored *+,-./0120/.67/80914:;5 #!! 400 "&! 300 We repeatedly query a small "!! 200 set of “high impact” reward points &! 100 ! ! " # $ % & ' ( ) *+,-./01203+./4.5
  • 74. Evaluation 71 Autonomic Computing Setup Host 1 Demand Resource 2 Hosts Total 3 Demand levels Resource 3 Units of Resource M Model Host k Demand Resource 90 States 10 Actions
  • 75. Evaluation 72 Max Regret - Autonomic Computing Queries vs. Max Regret 0.7 0.12 Maximin Minimax Regret 0.6 0.10 0.5 0.08 True Regret Max Regret 0.4 0.06 0.3 0.04 0.2 0.02 0.1 egret 0.0 0.00 1000 1 0 200 400 600 800 1000 0 Queries
  • 76. Evaluation 73 True Regret (Loss) - Autonomic Computing Queries vs. True Regret 0.12 Maximin egret Minimax Regret 0.10 0.08 True Regret 0.06 0.04 0.02 0.00 1000 0 1 200 400 600 800 1000 Queries
  • 77. Outline 1. Decision Theory 2. Preference Elicitation 3. MDPs 4. Current Work A. Imprecise Reward Specification B. Computing Robust Policies C. Preference Elicitation D. Evaluation E. Future Work
  • 78. Introduction 75 Overview MDP Compute Satisfied? YES Done Robust Policy R NO User Select Query
  • 79. Introduction 75 Contributions Compute 1. A technique for finding robust policies using Satisfied? YES Done Robust Policy minimax regret NO 2. A simple elicitation procedure that quickly leads to Select Query near-optimal/optimal policies
  • 80. Conclusion 76 Future Work Bottleneck: Adversary’s max regret computation Scaling Idea: The set Γ of adversary policies g that will ever be a regret maximizing response can be small Factored MDPs Approaches that uses Γ to Richer efficiently compute max regret Queries We have An algorithm to find Γ A theorem that shows the algorithm runs in time polynomial in the number of policies found
  • 81. Conclusion 77 Future Work Scaling Working with Factored MDPs will Factored Model problems in a more natural way MDPs Richer Allow us to use lower the dimensionality of Queries the reward functions Leverage existing techniques for scaling MDPs that take advantage of factored
  • 82. Conclusion 78 Future Work In state s which action would you like to take? Scaling Factored MDPs In state s do you prefer action a1 to a2 ? Richer Queries Do you prefer sequence s1 , a1 , s2 , a2 ,…sk to ′ ′ ′ ′ ′ s , a , s , a ,…s ? 1 1 2 2 k
  • 83. Conclusion 79 Future Work Do you prefer tradeoff Scaling f (s2 , a3 ) = f1 amount of time doing (s2 , a3 ) and f (s1 , a4 ) = f2 amount of time doing (s1 , a4 ) Factored or MDPs f ′ (s2 , a3 ) = f 1 amount of time doing (s2 , a3 ) and ′ Richer f ′ (s1 , a4 ) = f ′2 amount of time doing (s1 , a4 ) ? Queries f1 s Cab Available f1 s No Street Car f2 f2 a Take Cab a Waiting s No Street Car s Cab Available f2 f1 f1 f2 a Waiting a Take Cab
  • 85.
  • 86. Regret-Based Reward Elicitation for Markov Decision Processes Kevin M Regan University of Toronto Craig Boutilier
  • 87. f g r ax min r·f (7) subject to: γE f + α = 0 Appendix 82 F r∈R γE g + α = 0 Full Formulation on the adversary. If MR(f , R) = MMR (R) then the con- to com uncertainty in any MDP pa- Cr ≤ at straint for g, r is satisfied d the current solution, and in- mine k has focused on uncertainty deed all unexpressed constraints must be satisfied as well. have t This is equivalent to a minimization: the of eliciting information The process then terminates with minimax optimal solu- freque rewards is left unaddressed. tion minimize δ MR(f , R) > MMR (R), implying that f . Otherwise, (8) exact f ,δ the constraint for g, r is violated in the current relaxation Master uted for uncertain transition (indeed, it is the r · g − r · f violated suchF, r ∈ R So it subject to: maximally ≤ δ ∀ g ∈ constraint). We ha an alt riterion by decomposing the is added to Gen and the+ α = 0 repeats. γE f process sarial nd using dynamic program- ization to find the worst case Computation of MR(f , R) is realized by the following MIP, (for a ]. McMahan, Gordon, and This corresponds Q-functions:1 dual LP formulation of using value and to the standard for m rogramming approach to ef- an MDP with the addition of adversarial policy constraints. imatio maximize α · V − r · f (9) n value of an MDP (we em- The infinite number of constraints can be reduced: first we Q,V,I,r tice): need only retain as potentially active those ∀ a ∈ A subject to: Qa = ra + γPa V constraints for the in ch to ours below). Delage vertices of polytope R; Qa for any r ∈ R, weaonly require V ≥ and ∀ ∈A (10) tors. oblem of uncertainty over re- V ≤ (1 − a )M + Qa ∀a∈A ∗ the constraint correspondingIto itsaoptimal policy gr . How- (11) does n functions) in the presence of policy rcentile criterion, which can ever, vertex enumeration is not feasible; so we apply Ben- Cr ≤ d Subproblem than maximin. They also ders’ decomposition [2]a to iteratively generate constraints. X I =1 (12) constr remai ng rewards using sampling to At each iteration, two optimizations are solved. The master a value e of information of noisy in- Ia (s) ∈ {0, of ∀a, s problem solves a relaxation 1} program (8) using only a (13) that is ard space. The percentile ap- small subset of the constraints,M⊥ Ma = M − corresponding to a subset a this s n nor does it offer a bound on Gen of all g, r pairs; we call these generated constraints. lution es ([20]) also adopt maximin Initially, this set is arbitrary (e.g., empty). with Ia (s) de- Here I represents the adversary’s policy, Intuitively, in lem s noting the probability of action a being taken at state s
  • 88. Evaluation 83 Maximin Value - Random MDP 2345-56+738'(+9:;+,'-.()+<=+&'()5(: #!! 1.00 %" 0.35 0.30 %! 0.95 1" $" 0.25 0.90 1! 2345-56+738'( Maximin Value /01+2(3)(4 Max Regret $! 0.20 0.85 0" #" 0.15 0! 0.80 #! 0.10 2345-56+>+?@A 0.75 /" 2565-34+>+?@A " 0.05 2345-56+>+BC 2565-34+>+BC /! 0.70 1 ! ! "! #!! #"! $!! $"! %!! ! &'()*+,'-.()
  • 89. Computation 84 Regret Gap vs Time 3.4567/859/:1;/+,-./!/3.<= "&# "$# "## *# 3.4567/859 Regret Gap (# &# $# # !$# !"### # "### $### %### &### '### (### )### *### +,-./0-12

Hinweis der Redaktion

  1. Markov decision processes are an extremely useful model for decision making in stochastic environments. To use it we need to know dynamics and the rewards involved. A lot of work has been done on learning dynamics both in an offline and online setting.
  2. Markov decision processes are an extremely useful model for decision making in stochastic environments. To use it we need to know dynamics and the rewards involved. A lot of work has been done on learning dynamics both in an offline and online setting.
  3. Markov decision processes are an extremely useful model for decision making in stochastic environments. To use it we need to know dynamics and the rewards involved. A lot of work has been done on learning dynamics both in an offline and online setting. Rewards are often assumed to be directly observable parts of the world. My perspective is that &amp;#x201C;rewards are in people&amp;#x2019;s heads&amp;#x201D;: In some cases there is a simple mapping between what you want (in your head) and the world (ie. finding the shortest path),
  4. Markov decision processes are an extremely useful model for decision making in stochastic environments. To use it we need to know dynamics and the rewards involved. A lot of work has been done on learning dynamics both in an offline and online setting. In some simple cases, rewards can be thought of as being directly observable: --&gt; For instance the distance travelled in a robot navigation problem where we are trying to get a robot from point A to point B. ---&gt; When I am in my car trying to get from point A to point B I want the path with the fewest stoplights, someone else may want the path with the nicest scenery, while someone else may sacrifice some stoplights for some scenery. Reward is a surrogate for subjective preferences .... flip slide (sometimes its easy but)
  5. The dynamics in combination simple bounds on reward function lead to areas of reward space not having a high impact on the value of a policy
  6. Maximin is common but we use regret
  7. Maximin is common but we use regret
  8. Maximin is common but we use regret
  9. Maximin is common but we use regret
  10. Maximin is common but we use regret
  11. Maximin is common but we use regret
  12. Maximin is common but we use regret
  13. Maximin is common but we use regret
  14. Maximin is common but we use regret
  15. Maximin is common but we use regret
  16. Maximin is common but we use regret
  17. Maximin is common but we use regret
  18. Maximin is common but we use regret
  19. Maximin is common but we use regret
  20. Maximin is common but we use regret
  21. Maximin is common but we use regret
  22. Maximin is common but we use regret
  23. Maximin is common but we use regret
  24. Maximin is common but we use regret
  25. Maximin is common but we use regret
  26. Maximin is common but we use regret
  27. Maximin is common but we use regret
  28. Maximin is common but we use regret
  29. So for f2, no matter what the instantiation of reward, by changing policy the player only stood to gain by one. This is an intuitive measure.
  30. Maximin is common but we use regret
  31. Maximin is common but we use regret
  32. Maximin is common but we use regret
  33. Robust MDP literature often assumes transitions are known transitions are learnable - do not change from user to user
  34. Robust MDP literature often assumes transitions are known transitions are learnable - do not change from user to user
  35. Robust MDP literature often assumes transitions are known transitions are learnable - do not change from user to user
  36. Convergence properties
  37. explain how constraints create max
  38. Question: Is it worth spending more time to familiarize the audience with &amp;#x201C;occupancy frequencies&amp;#x201D;?
  39. I will vocally mention other representations
  40. In the voice over I will explain how each reformulation proceeds from the previous expression
  41. In the voice over I will explain how each reformulation proceeds from the previous expression
  42. In the voice over I will explain how each reformulation proceeds from the previous expression
  43. In the voice over I will explain how each reformulation proceeds from the previous expression
  44. In the voice over I will explain how each reformulation proceeds from the previous expression
  45. I would also like to give a clear intuition as to why this is inherently hard.
  46. On average less than 10% error
  47. will also note that on a 90 state MDP with 16 action the relaxation is computing minimax regret in less than 3 seconds.
  48. Here I will review the preference elicitation process
  49. Note that it is useful in non-sequential
  50. Now I have left out the autonomic computing results, due to lack of time. If there is a little time, after giving the results for the random MDPs I can state that we have similar results for a large MDP instance.
  51. 20 runs --&gt; 20 MDPs with 10 states and 5 actions
  52. We have a paritially specified reward function we compute a robust policy that minimizes maximum regret --&gt; We then elicit information about the reward function which leads to a better policy. --&gt; We continue this process until we have an optimal policy or we regret guarantee is small enough
  53. We have a paritially specified reward function we compute a robust policy that minimizes maximum regret --&gt; We then elicit information about the reward function which leads to a better policy. --&gt; We continue this process until we have an optimal policy or we regret guarantee is small enough
  54. We have a paritially specified reward function we compute a robust policy that minimizes maximum regret --&gt; We then elicit information about the reward function which leads to a better policy. --&gt; We continue this process until we have an optimal policy or we regret guarantee is small enough
  55. We have a paritially specified reward function we compute a robust policy that minimizes maximum regret --&gt; We then elicit information about the reward function which leads to a better policy. --&gt; We continue this process until we have an optimal policy or we regret guarantee is small enough
  56. We have a paritially specified reward function we compute a robust policy that minimizes maximum regret --&gt; We then elicit information about the reward function which leads to a better policy. --&gt; We continue this process until we have an optimal policy or we regret guarantee is small enough
  57. Notes test
  58. I will give the context in the voice over. The main idea on this slide is that: in practice constraint generation quickly converges. To segue to the next slide I will recall that we still need to solve a MIP with |S||A| variables and constraints, thus we developed an approximation.