SlideShare ist ein Scribd-Unternehmen logo
1 von 63
Downloaden Sie, um offline zu lesen
Generalized
Reinforcement Learning
      Framework
       Barnett P. Chiu
         3.22.2013
Overview
•  Standard Formulation of Reinforcement Learning
•  Challenges in standard RL framework due to its
   representation
•  Generalized/Alternative action formulation
  •  Action as an Operator
  •  Parametric-Action Model
•  Reinforcement Field
   –  Using kernel as a similarity measure over “decision
      contexts” (i.e. generalized state-action pair)
   –  Value predictions using functions (vectors) from RKHS (a vector
      space)
   –  Representing policy using kernelized samples
Reinforcement Learning: Examples
•  A learning paradigm that formalizes
   sequential decision process under uncertainty
    –  Navigating in an unknown environment
    –  Playing and winning a game (e.g. Backgammon)
    –  Retrieving information over the web (finding the
       right info on the right websites)
    –  Assigning user tasks to a set of computational
       resources
•  Reference:
   –  Reinforcement Learning: A Survey by Leslie P. Kaelbling, Michael L.
     Littman, Andrew W. Moore
   –  Autonomous Helicopter: Andrew Ng
Reinforcement Learning:
                  Optimization Objective

•  Optimizing performance through trial and error.
   –  The agent interacts with the environment, perform the right actions
      such that they induce a state trajectory towards maximizing rewards.
   –  Task dependent; can have multiple subgoals/subtasks
   –  Learning from incomplete background knowledge
•  Ex1: Navigating in an unknown environment
   –  Objective: shortest path + avoiding obstacles +
                 minimize fuel consumption + …
•  Ex2: Assigning user tasks to a set of servers with
   unknown resource capacity
   –  Objective: minimize turnaround time,
                 maximize success rate, load balancing, …
Reinforcement Learning:
Markov Decision Process (typical but not always)
Reinforcement Learning:
Markov Decision Process
                          Potential Function:
                           Q: S × A è utility
Challenges in Standard RL Formulations (1)
•  Challenges from large state and action space
    –  the complexity of RL methods depends largely on the
       dimensionality of the state space representation

Later …
•  Solution: Generalized action representation
    –  Explicitly express errors/variations of actions

                       ( x s , x a ) ∈ S ⊗ A+
                        x = (x s , x a )
                        k (x, xʹ′)            //Compare two decision contexts

   –  It becomes possible to express correlation from within a state-
      action combination to reveal their interdependency.
   –  The value function no longer needs to express a concrete mapping
      from state-action pairs to their values.
   –  Enables simultaneous control over multiple parameters that
      collectively describe behavioral details as actions are performed.
Challenges in Standard RL Formulations (2)

•  Challenges from Environmental Shifts
   –  Unexpected behaviors inherent in actions (same action but
      different outcomes over time)
   –  E.g. Recall the rover navigation example earlier
      •  Navigational policy learned under different surface conditions


•  Challenges from Irreducible and Varying Action Sets
   –  Large decision points
   –  Feasible actions do not stay the same
   –  E.g. Assigning tasks to time-varying computational resources
           in a dynamic virtual cluster (DVC)
      •  Compute resources are dynamically acquired with limited
         walltimes
Reinforcement Learning Algorithms …
•  In most complex domains, T, R need to be estimated à RL
   framework
   –  Temporal Difference (TD) Learning                st                St+1
       •  Q-learning, SARSA, TD(λ)                            MDP
                                                                          rt
   –  Example of TD learning: SARSA
                                                                    at

                                                       st


   –  SARSA Update Rule:

          Q(s, a) ← Q(s, a) + α [r + γ Q(sʹ′, aʹ′) − Q(s, a)]
   –  Q-learning Update Rule:

          Q( s, a) ← Q(s, a) + α ⎡r + γ max Q(sʹ′, aʹ′) − Q(s, a) ⎤
                                 ⎣       aʹ′                      ⎦
   –  Function approximation, SMC, Policy Gradient, etc.



“But the problems are …”
Alternative View of Actions
•  Actions are not just decision choices
   –  Variational procedure (e.g. principle of least {action, time})
   –  Errors
       •  Poor calibrations of actuators
       •  Extra bumpy surfaces
   –  Real-world domains may involve actions with high complexity
       •  Robotic control (e.g. a simultaneous control over a set of joint
          parameters)

“What is an action really?”
 Actions induce a shift in the state configuration
   –  A continuous process
   –  A process involving errors
   –  State and action has hidden correlations
       •  Current knowledge base (state)
       •  New info retrieved from external world (action)
   –  Similarity between decisions: (x1,a1) vs (x2,a2)
Action as an Operator
•  Action Operator (aop)
  –  Acts on current state and produces its successor
     state.
     •  aop takes on an input state … (1)
     •  aop resolves stochastic effects in the action … (2)
          –  Recall: action is now parameterized by constrained
                      random variables (e.g. d_r, d_theta)
     •  Given a fixed operator by (1), (2), now aop maps the input state
        to the output state (i.e. the successor state)
     •  The current state vector + action vector à augemented state
     •  E.g.
           "    Δx            %
           $ 1+          0    '"    % "         %   Δx = Δr cos(Δθ )
           $     x            '$ x ' = $ x + Δx '
           $              Δy '# y & # y + Δy &      Δy = Δr sin(Δθ )
               0       1+
           $
           #               y '&
              (      ) (              )
            ⇒ x s ,x a = (x, y),(Δx,Δy)
Value Prediction: Part I
“How to connect the notion of action
 operator with value predictions?”
Parametric Actions (1)
      A           11 12 1
          10                  2

          9                       3
              8               4                          3
                  7       5                       2
                      6
                                                                         2        B
                                      4       2                  4   3       Δθ
                              Δr
                                          2




•  Actions as a random process xa = (x1,x 2 ) = (Δr,Δθ )
•  Example:                       Y (x i ) = P((x i + w i ) ∈ Γ),
                                  (x a )i = x i |Y (x i ) ≥ η.

• These 12 parametric actions all take on parameter
  bounded within pie-shaped scopes
Parametric Actions (2)
                          11 12 1
                   10                 2

                  9                       3
                      8               4                       3
                          7       5                       2
                              6
                                                      2                    2
                                              4                   4   3        Δθ
                                      Δr
                                                  2




•  Actions as a random process                                            x a = (x1,x 2 ) = (Δr,Δθ )
•  Augmented state space
     (      ) (
  ⇒ x s ,x a = (x, y),(Δx,Δy)                         )
     ⎡ Δx        ⎤                                                      action as an operator …
     ⎢1+     0 ⎥
          x          ⎡ x ⎤ ⎡ x + Δx ⎤
     ⎢           ⎥ ⎢ y ⎥ = ⎢ y + Δy ⎥
               Δy ⎣ ⎦ ⎣
     ⎢ 0   1+    ⎥                     ⎦
     ⎢
     ⎣         y ⎥
                  ⎦

•  Learn a potential function Q (x s ,x a ) = Q ( x, y,Δx,Δy )
Using GPR: A Thought Process (1)
•  Need to gauge the similarity/correlation between any two decisions
   à kernel functions k (x, xʹ′)
•  Need to estimate the (potential) value of an arbitrary combination of
   state and actions without needing to explore all the state space à
   the value predictor is a “function” of kernels
•  The class of functions that exhibit the above properties à
   functions drawn from Reproduced Kernel Hilbert Space (RKHS)
                             n                            n
                Q + (⋅) =   ∑αi k(xi ,⋅)   Q + (x * ) =   ∑α k(x ,x )
                                                                i   i   *
                            i=1                           i=1


•  GP regression (GPR) method induces such functions [GPML, Rasmussen]
    –  Representer Theorem [B. Schölkopf et al. 2000]

            (                                     )
       cost (x1, y1, f (x1 )),...,(xm , ym , f (xm )) + regularizer f       ( )
Using GPR: A Thought Process (2)
•  With GPR, we work with samples of experience
                      Ω : {(x1 , q1 ),...,(xm , qm )}
•  The Approach
   –  Find the “best” function(s) that explain the data (decisions made so
      far)
       •  Model Selection
   –  Keep only the most important samples from which to derive the
      value functions
       •  Sample Selection
•  It is relatively easy to implement the notion of model selection
   with GPR
   –  Tune hyperparameters of the kernel (as a covariance function of
      GP) such that the marginal likelihood (of the data) is maximized
   –  Periodic model selection to cope with environmental shifts
•  Sample selection
   –  Two-tier learning architecture
       •  Using a baseline RL learning algorithm for obtaining utility estimates
       •  Using GPR to achieve generalization
*GPR I: Gaussian Process

•  GP: a probability distribution over a set of random
       variables (or function values), any finite set of
       which has a (joint) Gaussian distribution

     y | x, M i  N f ,σ noise I ⇒ f | M i  GP m(x),k(x, x!)
                    (                              )                                                  (                  )
   –  Given a prior assumption over functions (See previous page)
   –  Make observations (e.g. policy learning) and gather evidences
   –  Posterior p.d. over functions that eliminate those not consistent
      with the evidence      Prior and Posterior

                                               2                                                  2

                                               1                                                  1




                                                                                  output, f(x)
                               output, f(x)




                                               0                                                  0

                                              −1                                                 −1

                                              −2                                                 −2

                                              −5              0             5                    −5          0       5
                                                           input, x                                       input, x



                            Predictive distribution:
                                                                                2
                                              p(y⇤ |x⇤ x y) ⇠ N k(x⇤ x)> [K +   noise I] y
                                                                                        -1
*GPR II: State Correlation Hypothesis
•  Kernel: covariance function (PD Kernel)
    –  Correlation hypothesis for states
   –  Prior                         1                  2
                 k(x, x!) = θ 0exp(− x Τ D −1x") + θ1σ noise
                                    2
                                           2
                                    1 d xi           2
                         = θ 0exp(− ∑ ) + θ1σ noise
                                    2 i=2 θi
   –  Observe samples

                Ω : {(x1 , q1 ),...,(xm , qm )}
   –  Compute GP posterior (over latent functions)

      Q + (x i ) | X ,q, xi ~ GP(m post (x i ) = k(x i ,x) Τ K ( X , X )−1 q,
                                  cov post (x i ,x) = k(x i ,x i ) − k(x i ,x) Τ K ( X , X )−1 q)

   –  Predicted distribution à averaging over all possible posterior
      weights with respect to Gaussian likelihood function
*GPR III: Value Prediction
•  Predictive Distribution with a test point
                                                          n
                 +                Τ           −1
          q* = Q (x* ) = k(x* ) K ( X , X ) q =       ∑α k(x ,x )
                                                                i   i   *
                                                          i=1
                                                     −1
         cov(q* ) =   k(x* ,x* ) − k T
                                     *   (K +σ )
                                               2
                                               n
                                                 I        k*
   –  Prediction of a new test is achieved by comparing
      with all the samples retained in the memory
   –  Predictive value largely depends on correlated
      samples
   –  The (sample) correlation hypothesis (i.e. kernel)
      applies to in all state space
   –  Reproducing property from RKHS [B. Schölkopf and A.
     Smola,2002]
          (x∗ , Q+ (x∗ )) : k (⋅, x∗ ) → Q+ (⋅), k (⋅, x∗ ) = Q+ (x∗ )
*GPR IV: Model Selection
•  Maximizing Marginal Likelihood (ARD)
                       1 T −1  1           n
      log p(q | X ) = − q K q − log | K | − log 2π
                       2       2           2
  –  A trade off between data fit and model complexity
  –  Optimization: take partial derivative wrt each hyperparameter
     to get
       ∂                     1 T −1 ∂K −1    1       −1 ∂K
          log p (u | X , θ) = u K       K u − tr ( K        )
      ∂θi                    2      ∂θi      2          ∂θi
     •  conjugate gradient optimization
     •  We get the optimal hyperparameters that best explain the
        data
     •  Resulting model follows Occam’s Razor principle
     •  Computing K-1 is expensive à Reinforcement Sampling
Value Predictions (Part II)
•  Now what remains to solve are:
  –  How to obtain the training signals {qi}?
     •  Use baseline RL agent to estimate utility based on MDP
        with concrete action choices
     •  Use GPR to generalize the utility estimate with
        parameterized action representation
  –  How to train the desired value function using only
     essential samples?
     •  memory constraint!
     •  Sample replacement using reinforcement signals
        –  Old samples referencing old utility estimates
           can be replaced by new samples with new estimates
        - Experience Association
i, j 1
                                                    m
                    (      )  )    Q   ( ) å
             QQ (s s), xaQ =are + x =s , containing a sequence of predictive values with respect
                +
            corresponding targetsxQ denotedx1 ,..., ixn1α1 k ( xxx i )
                   (x                       q
                                                      s     a      ,a
                                                     = , x i ,..., m                   (6.2)
roduct definition in (5.59), one can thus evaluate the inner product between Q
                                                                                        2
                                                                                            3. From Baseline RL layer to GPR
ent traverses the state X. With the assumption ofaasequence ofktraining samples will be
          to the input in space with m steps,       noisy kernel ( x i , x j ) ij as the covariance
 pattern k ( , x ) in terms of        GPR                                                     (next)
         function, the predictive distribution),...,( x ,s targets isa thus, given by q X f, X ~ N q, [ X ] ,
                    (s ) Q x ( xQq1 over newn ,m 1 ,..., xm Q where | and (6.2)
d retained inQ memory:             :     1,
                                                  s           a
                                                x1 ,..., x q x)    X                                Q,
                                                                            2. Use the generalization capacity
                                                       m
               k (× xs , xa )) =m (× )
                  ,(               k ,x
                 Q, k ( , x )            k ( x, x i ) Q( x )                      (5.63)
                                                and their corresponding observedfrom GP to predict values
            where                      i
denote the set of augmented states +                                            fitness
                                  i 1        m () k ( ,a )
agent traverses the state space with Q steps,× sequence of training samples will be
                                                     ×, x
                                                               1
lpful for the moment to consider these training samplesqin
                                     q K X , X K X, X
                                                                                as a functionalExpand
                                                                                             -  datathe action into its
                                                                                                   (5.51)

     st
and retained a(xa)                     Ω :({(,x1 ),...,( x m , qm )m , qX 1)} ,
                                        : x1 q1, q1 ),...,(x
              in memory:
                                  X     K X ,X  *   *       *
                                                                         m Q
                                                                         where X and Q,
nced by the experience particles distributed K X , X K state space. The mechanism
                                                 over the X , X K X , X             *
                                                                                        parametric representation
                                                                                             (5.52)

   denote the set of that theto Section 6.5. Each corresponding observed fitness
                     deferred
                                                                                      -  Kernel assumption accounts for
, their values are note augmented statesarrive attheir experience particle the weight-space
                                               and (5.51) and (5.52) is similar to effectively
            Here we             derivation to
                                                                                        random effect in the action
rol policy that generalizes into the neighboring (augmented)for morespace through one
                                                                      state details. With only
helpful formodel; however, to consider these training samples in
             the moment the interested readers may refer to [14, 107]
                                      …... rt                            as a functional data
             st     at          r0 111                       ut                  1b. Estimate utility for each
 the kernel to be(5.51) is reducedshortly. The similarity of any two particles (or
         test point, discussed to
                                                                           “regular”
renced by the experience particles distributed over the state space. The mechanism                                       state-action pair
wo referenced augmented states) takes into account both the state vector x s and
                                                                  1a. Baseline
ng their values are deferred to Section 6.5. Each experience particle effectively                                    RL: e.g. SARSA
x a . This formulation is made possible by allowing the action to take on continuous
ontrol policy that generalizes into the neighboring (augmented) state space- MDP   based RL treats actions
                                                 108
                                                                            through
g with the use of a kernel function as a correlation hypothesis associating as decision choices
                                                                             one
            st                                         St+1
of the kernel to be discussed shortly. The similarity of any two particles (or
ate to another through their inner products in the kernel-induced feature -  Estimate utility without looking
                         MDP                                               space
                                                        rt
,above constructs defined, the nexttakes into account both the the reinforcement action parameters
                                                                             at
  two referenced augmented states) step toward establishing state vector x s and

esent the fitness function made possible by integrates with parametric actions and,
 r x . This formulation is in a manner that allowing the action to take on continuous
    a
                                                                                                                     11 12 1
me, serves as a “critic” for the policy at
                                         embedded in experience particles. In this                              10                2
ong with the use of a kernel function as a correlation hypothesis associating one
represent the fitness value function through a progressively-updated Gaussian
          s                                                                                                 9                         3
               t
 state to another through their inner products in the kernel-induced feature space
                                                                                                                8                 4
                                                                                                                     7        5
he above constructs defined, the next step toward establishing the reinforcement                                          6
i, j 1
                                               m
         Q (( ) a
             s           x   Q x = å i n , α k ( x ax
         Q+(xs , xQ))= Q+( )x1 ,..., x=1x1 i,...,x,m i )
                                    s        s   a
                                                                             (6.2)
uct definition in (5.59), one can thus evaluate the inner product between Q
                                                                                   3. From Baseline RL layer to GPR
 traverses the state space with m steps, a sequence of training samples will be
 ttern k ( , x ) in terms of          GPR                                             3a. Take a sample of the random
 retained in Q (s ) Q : ( x1Q 1 ),...,(,..., , qmx a ,..., xm Q, where X and Q,
                   memory:           x       ,q     x1 x m xn , )1 X a
                                                       s      s                            action vector
                                                                                              (6.2)
                 k (× xs , xa )) = k , x
                    ,(              m (× )
                   Q, k ( , x )           i k ( x, x i )  Q( x )
ote the set of augmented states and their corresponding observed fitness
                                                                                      3b. Form the augmented state
                                                                                            (5.63)
                                                 m () k ( ,a sequence of training samples willa) à (x ,x )
                                                   +
ent traverses the state space with Q steps,× )
                                   i 1                 ×, x
                                                                                           (s, be       s a
ul for the moment to consider these training samples in
                                                                                      3c. Propagate the utility signal
                                                                            as a functional data
      st
 d retained a(xa) in memory: Ω :({1, q1, q1 ),...,(m )m , qX )} , where X and Q,
                                         : x(x 1 ),...,( x m , q x     m Q
 d by the experience particles distributed over the state space. The mechanism             from the baseline learner and
denote the are deferred to Section 6.5. Each experience particle effectively it as the training signal
eir values set of augmented states and their corresponding observed fitness                use
 policy that generalizes into the neighboring (augmented) state space through
                                                                                      3d. Insert new (functional) data
 lpful for the moment to consider these training samples in
                                          …... rt                              as a functional data
           st       at              r0 111                         ut                       into current working memory
e kernel to be discussed shortly. The similarity of any two particles (or
nced by the experience particles distributed over the state space. The mechanism
referenced augmented states) takes into account both the state vector x s and
 their values are deferred to Section 6.5. Each experience particleUse GPR
                                                             4. effectively          to predict new test points
                                                                        4a. Kernelized the new test point
 This formulation is made possible by allowing the action to take on continuous
 rol policy that generalizes into the neighboring (augmented) state space through
with the use of a kernel function as a correlation hypothesis associating (s, a) à (xs,xa) à k(., x)
                                                                             one
          st
  the kernel to be discussed shortly. The similarity of any two particles (or inner product of k(., x)
                                                                        4b. Take
                                                       St+1
                          MDP
  to another through their inner products in the kernel-induced feature space
                                                        rt                     and Q+ to obtain the
                                          into account both the state vector x s and
wo referenced augmented states) takes toward establishing the reinforcement
 ove constructs defined, the next step
                                                                               utility estimate (fitness value)
x a .the fitness function in a manner that integrates the action to take on continuous
nt This formulation is made possible by allowing with parametric actions and,
                                                                                                 11 12 1
 serves as a “critic” for the policy at
                                     embedded in experience particles. In this
g with the use of a kernel function as a correlation hypothesis associating one          10                  2
 resent the fitness value function through a progressively-updated Gaussian
         st                                                                              9                       3
ate to another through their inner products in the kernel-induced feature space
                                                                                             8               4
 above constructs defined, the next step toward establishing the reinforcement                   7       5
                                                                                                     6
Policy Estimation Using Q+
•  Policy Evaluation using Q+(xs,xa) ~ GP
   –  Q+ can be estimated through GPR
   –  Define policy through Q+ : e.g. softmax (Gibbs
      distr.)
                                 exp !Q + s,a (i) / τ $
                                      #      (       )    &
                                      "                   %
             (
            π s,a   (i)
                          )   =
                                ∑ j exp !Q + (s,a ( j) ) / τ $
                                        "                    %

                                 exp !Q + (x s ,x (i) ) / τ $
                                      #      (     a     )    &
                                      "                       %
                              =
                                ∑ j exp !Q + (x s ,x (a j) ) / τ $
                                        #
                                        "        (           )   &
                                                                 %

•  π [Q+ ] is an increasing functional over Q+
Particle Reinforcement (1)

•  Recall from ARD, we want to minimize
   the dimension of K?
  –  Preserving only “essential info”
  –  Samples that lead to an increase in TD à positively-
     polarized particles
  –  Samples that lead to a decrease in TD à negatively-
     polarized particles
  –  Positive particles lead to a policy that is aligned with
     the global objective ________
  –  Negative particles serve as counterexamples for the
     agent to avoid repeating the same mistakes.
  –  Example next
Particle Reinforcement (2)

                                         •  Maintain a set of state partitions

                                         •  Keep track of both positive particles
                                           and negative particles

                                         •  positive particles refer to the desired
                                           control policy while negative particles
        +                                  point out what to avoid
-
                                         •  “Interpolate” control policy. Recall:

        +
            -
                +                                      (
                                           π [Q + ] = π s,a (i)   )
    -
                                                      exp !Q + (x s ,x (i) ) / τ $
                                                           #          ( a     )    &
                                                           "                       %
                                                   =
                                                     ∑ j exp !Q + (x s ,x (a j) ) / τ $
                                                             #
                                                             "            (       )   &
                                                                                      %
                    “Problem: How to replace older samples?”
Experience Association (1)
•  Basis learning principle:
    –  “If a decision led to a desired result in the past,
       then a similar decision should be replicated to cope
       with similar situations in the future”
      •  Again, use the kernel as a similarity measure


•  Agent: “Is my current situation similar to a particular
            experience in the past?”
•  Agent: “I see there are two highly related instances
           memory where I did action #2 , which led to a
                     +
                    sh
           pretty decent result. OK, I’ll try that again
          (or something similar).”
Experience Association (2):
               Policy Generalization

                        C                            (x s ,x a ) ∈ S ⊗ A+
        5                       6


                11 12 1
                                                     (x s ,x a ) ≈ (x s +x s ,x a +x a )
        10                      2
                                                     k(x, x")
        9                           3
            8                   4
2               7
                    6
                            5                10

A                                             B
    2                                   10



                                              Similarity measure comes in handy
                                               - relate similar samples of experience
                                               - sample update
Reinforcement Field

•  A reinforcement field is a vector field in Hilbert space established
   by one or more kernels through their linear combination as a
   representation for the fitness function, where each of the kernel
   centers around a particular augmented state vector
                                      n
                          Q + (⋅) =   ∑α k(⋅,x )
                                            i   *
                                      i=1
Reinforcement Field: Example
                    •  Objective:
                       Travel to the destination
                       while circumventing obstacles

                                    11 12 1
                            10                  2

                            9                       3
                                8               4
                                    7       5
                                        6


                    •  State space is partitioned
                       into a set of local regions

                    •  Gray areas are filled with
                       obstacles
                    •  Strong penalty is imposed
                       when the agent runs into
                       obstacle-filled areas
Reinforcement Field: Using Different Kernels


                                                                                1                  2
                                                             k(x, x!) = θ 0exp(− x Τ D −1x!) + θ1σ noise
                                                                                2




                                                     2
k(x, x!) = θ 0 k s (x s , x!s )ka (x a , x!a ) + θ1σ noise
Reinforcement Field Example
                      •  Objective:
                        Travel to the destination
                        while circumventing obstacles

                                     11 12 1
                              10                 2

                             9                       3
                                 8               4
                                     7       5
                                         6


                      •  State space is partitioned
                         into a set of local regions
                      •  Gray areas are filled with
                         obstacles
                      •  Strong penalty is imposed
                         when the agent runs into
S                        obstacle-filled areas
Reinforcement Field Example
Action Operator: Step-by-Step (1)
•  At a given state s ∈ S, the agent
   chooses a (parametric) action
   according to the current policy π [Q+ ]
•  The action operator resolves the
   random effect in action parameters
   through a sampling process such that              11 12 1
   the (stochastic) action is reduced to     10                  2
   a fixed action vector x a .
                                             9                       3
•  The action vector resolved from
                                                 8               4
   above is subsequently paired with the             7       5
   current state vector x s to form an                   6

   augmented state x = (xs , xa ).
Action Operator: Step-by-Step (2)
•  The new augmented state x is
   kernelized in terms of k(⋅,x) such
   that any state x implicitly maps
   to a function that expects
   another state x! as an argument;
    k( x!,x) evaluates to a high value
                                                               11 12 1
   provided that x and x! are                           10                 2
   strongly correlated.
•  The value prediction for the new                    9                       3

   augmented state is given by                             8               4
                                                               7       5
   reproducing property (in RKHS):                                 6


   Q + ,k(⋅,x) = ∑m α i k(⋅,x i ),k(⋅,x) = ∑m α i k(x,x i ) = Q + (x).
                  i=1                       i=1
Next …
•  The entire reinforcement field is generated by
   a set of training samples that are aligned with
   the global objective – maximizing payoff

              Ω : {(x1 , q1 ),...,(xm , qm )}
“Can we learn decision concept out of these training
 samples by treating them as a structured functional
 data?”
A Few Observations
•  Consider Ω : {(x1 , q1 ),...,(xm , qm )}

•  Properties of GPR
    –  Correlated inputs have correlated signals
    –  Can we assemble correlated samples together to form
       clusters of “similar decisions?”
    –  Functional clustering
        •  Cluster criteria
            –  Similarity takes into account both input patterns and their
               corresponding signals
Example: Task Assignment Domain (1)

•  Given a large set of computational resources and
   a continuous stream of user tasks, find the
   optimal task assignment policy
  –  Simplified control policy with actions: dispatch job or
     not dispatch job (à actions as decision choices)
     •  Not practical
     •  Users’ concern: under what conditions can we optimize a
        given performance metric (e.g. minimized turn around
        time)
  –  Characterize each candidate servers in terms of their
     resource capacity (e.g. CPU percentage time, disk
     space, memory, bandwidth, owner-imposed usage
     criteria, etc)
     •  Actions: dispatching the current task to machine X
     •  Problems?
Example: Task Assignment Domain (2)

•  A general resource sharing environment could
   have a large amount of distributed resources
   (e.g. Grid network, Volunteer Computing, Cloud
   Computing, etc).
  –  1000 machines à 1000 decision points per state.
•  Treating user tasks and machines collectively
   as multi-agent system?
  –  Combinatorial state/action space
  –  Large amount of agents
Task Assignment, Match Making:
            Other Similar Examples

•  Recommender Systems
  -  Selecting {movies, music, …} catering to the
     interests/needs of the (potential)
     customers: matching users with their
     favorite items (content-based)
  –  Online advertising campaign trying to match
     products and services with the right target
     audience
•  NLP Apps: Question and Answering Systems
  –  Matching questions with relevant documents
Generalizing RL with Functional Clustering
•  Abstract Action Representation
  –  (Functional) pattern discovery from within the
     generalized state-action pairs (as localized policy)
  –  Functional clustering by relating correlated inputs w.r.t.
     their correlated functional responses
     •  Inputs: generalized state-action pairs
     •  Functional responses: utitlies
  –  Covariance Matrix from GPR à Fully-connected
     similarity graph à Graph Laplacian
  –  Spectral Clustering à a set of abstraction over
     (functionally) similar localized policies
  –  Control policy over these abstractions used as control
     actions
     •  Reduce decision points per state
     •  Reveal interesting correlations between state features
        and action parameters. E.g. match-making criterion
Policy Generalization

                        C
                                                  ( x s , x a ) ∈ S ⊗ A+
        5                       6

                                                  (x s ,x a ) ≈ (x s +x s ,x a +x a )
                11 12 1
        10                      2                 k(x, x")
        9                           3
            8                   4
2               7
                    6
                            5                10

A                                             B
    2                                   10          “Policy as a Functional over Q+”

                                                    “à Experience Association”
Experience Association (1)
•  Basis learning principle:
    –  “If a decision led to a desired result in the past, then a
       similar decision can be re-applied to cope with similar
       situations in the future”
        •  Again, use the kernel as a similarity measure

•  Agent: “Is my current situation similar to a particular scenario
           in the past?”
•  Agent: “I see there are two instances of similar memory where
          I did action #2, and that action led to a decent result,
          ok, I’ll try that again (or something similar)”
          “And if not, let me avoid repeating the mistake again”
    –  Hypothetical State +
                         sh
        •  Definition
        •  Agent: “I’ll looking to a past experience, and I replicate the action of
           that experience and apply to mine current situation, is the result gong to
           be similar?”
*Experience Association (2)
•  Step 1: form a hypothetical state (If I were to be …)
   –  I am at a state sʹ′ = x s ʹ′
   –  I pick a (relevant) particle from a set (later used in an
      abstraction)
            ω (i ) ∈ Ω(i ) ← A(i )       s+ : ( xs , xa )
   –  Replicate the action and apply it to mine own state
              +
             sh = ( xʹ′s , xa )
•  Step 2: Compare (… is the result going to be similar?)
   –  Compare the result using the kernel (that’s my state
       correlation hypothesis)

             k (s + , s + ) = k xʹ′, x
                  h               (      ) = k (( xsʹ′ , xa ) , ( xs , xa ))
   –  If k ( s + , s + ) ≥ τ , then this sample is correlated to mine
               h
      state; and the state of the target sample is in context;
      otherwise, it’s out of context
*Experience Association (3)

•  Normalize the kernel such that the kernel
   assumes a probability semantics
  –  This is a generalization over the concept of probability
     amplitude in QM.
             k(x, x!)
                             = k !(x, x!) = Φ(x),Φ( x!) = Φ Φ!
         k(x,x) k( x!, x!)

•  Nadaraya Waston’s Model (Section 5.7)
Concept-Driven Learning Architecture (CDLA)

•  The agent derives control policy only on the level
   of abstract actions
   –  Further reduce decision points per state
   –  Find interesting patterns across state features and
      action parameters. E.g. match-making criterion
•  We have the necessary representation to form
   (functional) clusters
   –  Kernel as a similarity measure over augmented states
   –  Covariance matrix K from GPR
•  Each abstract action is represented through a
   set of experience particles
CDLA Conceptual Hierarchy

                  A set of unstructured
                  experience particles




               Graph representation from
                  K where Kij = k(xi,xj)




                Partitioned graph with two
                      abstract actions
*Spectral Clustering: Big Picture

•  Construct similarity graph (ßGPR)
•  Graph Laplacian (GL)
•  Graph cut as objective function (e.g. normalized
   cut)
•  Optimize the graph cut criterion
   –  Minimizing Normalized Cut à partitions as tight as
      possible
      ß maximal in-cluster links (weights) and
         mimimal between-cluster links
   –  NP-Hard à spectral relaxation
•  Use eigenvectors of GL as a continuous version
   cluster indicator vectors
•  Evaluate final clusters using a selected instance-
   based clustering algorithm (Kmeans++)
*Spectral Clustering: Definitions
Pairwise affinity            Degree




                                              N
        wnm = k(x n ,x m )            Dn = ∑m=1 wnm



Volume of set                Cut between 2 sets




        Vol(C) =    ∑ Dn              Cut(C1 ,C2 ) =   ∑ ∑         wnm
                    n∈C                                n∈C1 m∈C2
*Spectral Clustering: Graph Cut

•  Graph Cut
  –  Naïve Cut
                     1 k
         (1)    (k )
                     2 i=1
                              (
    Cut(A ,..., A ) = ∑W A(i) ,V  A(i)      )
  –  (K-way) Ratio Cut

               RatioCut(A(1) ,..., A(k ) ) =
                                                  1
                                                    ∑
                                                       k     (
                                                           W A(i) ,V  A(i)   )
                                                  2 i=1           A(i)
  –  (K-way) Normalized Cut                                                       NP-hard!

                        (1)      1(k )
               NCut(A ,..., A ) = ∑
                                         k         (
                                                 W A(i) ,V  A(i)    )
                                 2 i=1                      ( )
                                                       vol A(i)
*Spectral Clustering: Approximation
•  Approximation
   –  Given a set of data to cluster
   –  Form affinity matrix W
   –  Find leading k eigenvectors

        Lv = λ v
   –  Cluster data set in the
      eigenspace
   –  Projecting back to the
      original data
•  Major differences in
   algorithms:
        L = f (W )
*Random Walk Graph Laplacian (1)

•  Definition: Lrw = D −1 ( D − K ) = I − D −1K ,
•  First order Markov transition Matrix
   –  Each entry: probability of transitioning from node n to to
      node m in a single step
                            m                         m
               di = ∑           K
                            j =1 ij
                                            =∑        j =1
                                                           k (xi , x j )
                       Kij                  k ( xi , x j )
               Pij =       m
                                        =
                       ∑          Kij            di
                           j =1


               Lrw = I − D −1K = I − P
*Random Walk Graph Laplacian (2)

•  K-way normalized cut: find a partitioning s.t. the
   probability of transitioning across clusters is
   minimized
                     k                           k ⎛   ∑ n,m∈C wnm ⎞
       NCut (C ) = ∑ (1 − P (Ci → Ci | Ci )) = ∑ ⎜1 −        i
                                                               N
                                                                      ⎟
                                               i =1 ⎜            w ⎟
                                                    ⎝ ∑ n∈C ∑ m=1 nm ⎠
                   i =1
                                                          i


    –  When used with Random-Walk GL, this corresponds to
       minimizing the probability of the state-transitions
       between clusters
•  In CDLA
    –  Each abstract action corresponds to a coherent decision
       concept
    –  Why? By taking any actions in any of the states associated
       with the same concept, this is a minimal chance to
       transitioning to the states associated with other concepts
Functional Clustering using GPSC

•  GPR + SC à SGP Clustering
  –  Kernel as correlation hypothesis
  –  Same hypothesis used as similarity measure for SC
  –  Correlated inputs share approximately identical
     outputs:
  –  Similar augmented states ~ close fitness values or
     utilities
•  Warning: The reverse is NOT true and this is why
   we need multiple concepts that may share similar
   output signals
  –  E.g. match-making job and machine requirements
CDLA: Context Matching
•  Each abstract action implicitly defines an action-
   selection strategy
   –  In context: at a given state, find the most correlated
      state pair with its action, followed by applying that
      action
   –  Out of context: random action selection
      •  Applicable in 1) infant agent 2) empty cluster 3)
         referenced particles don’t match (by experience
         association)
      •  Caveat:
                    (
                  Q s, A   ) ≠ Q ( s, a ) !
                          (i )   +    (i )


      •  The utility (fitness value) for random action selection
         does not correspond to the true estimate of resolving
      •  Need other way to adjust the utility estimate for its
         fitness value
Empirical Study: Task Assignment Domain
Goal: Find for each incoming user task the best candidate
      server(s) that are mutually agreeable in terms of
      matching criteria.

 Parameter Spec.                         Values
  Task feature set     Task type, size, expected runtime
      (state)
 Server feature set    Service type, percentage CPU time,
     (action)          memory, disk space, CPU speed, job
                       slots
      Kernel k         SE kernel + noise (see (5.3))
  Num. of Abstract     10 (assumed to be known)
     Actions
Model update cycle     10 state transitions towards 100
        T
Empirical Study: Learned Concepts

      Task   Size    Expected     Service    %CPU     Fitness Value
      type            runtime      type       time
1    1       1.1    0.93         1          9.784     120.41
2    2       2.5    1.98         2          10.235    128.13
3    3       3.2    2.92         3          15.29     135.23
4    1       1.0    1.02         2          20.36     -50.05
5    2       2.0    2.09         3          0.58      -47.28


    •  Illustration of 5 different learned decision concepts.
    •  The top 3 rows in blue indicate success matches while
       the bottom 2 rows in yellow indicate failed matches.
Empirical Study: Comparison
                      1500

                      1000

                       500
 Reward Per Episode




                         0

                       -500
                                                            G-SARSA
                      -1000                                 Condor
                                                            Random
                      -1500

                      -2000
                              0   20   40              60   80        100
                                            Episodes


       •  Performance comparison among
          (1) Stripped-down Condor match-making (black)
          (2) G-SARSA (blue)
          (3) Random (red)
Sample Promotion (1)



             +
          k(sh , s) ≥ τ




Two possibilities:
  (1) the target is indeed correlated to a given an AA.

   (2) the target is NOT correlated to ANY Abstract Action
       à trial and error + sample promotion
Sample Promotion (2)
•  Recall: Each abstract actions implicitly define an action
         selection strategy
•  Key: functional data set must be as accurate as possible in
       terms of their predictive strength
•  Match new experience particle against the memory, find the
   most relevant piece and use its value estimation.
•  Out-of-context case leads to randomized action selection.
•  How does the agent still manage to gain experience in this case?
•  Randomized action selection + Sample promotion
    –  Case 1: random action does get positive result
       •  Match the result per abstract actions by sampling using
          experience association operation
                          (    )
                             s, a ( i ) → ( x s , x a )
       •  If indeed correlated to some experience, then update the
          fitness value; otherwise à case 2
   –  Case 2: Discard the sample because there is no points of reference,
               sample not useful
GRL Schematic




Figure 8.1 CDLA schematic. Periodic iterations between value function approximation
and concept-driven cluster formation constitutes the heartbeat of CDLA. The conceptual
Demo of 4 Decision Concepts
                                                    “Trapped in high hills L”
      A(1) ~ A(4)
                                                At(1)
               “Fallen into deep water L”
                                                  Avoid
                At(3)

                    Avoid



                                                                                      At(4) −1
                                                                                         −k
 “Discovery of plant life J”
                                                                              At(4)
                                                                                 −k
  At(2)
                Collect                                            At(4) +1
                                                                      −k
                                                                                                                   t − k −1

                                                                                                             t−k

                                                                                          t − k +1
                                        At(4)
                                                          Clean
                                                                                 Concept Polarity
                                                                                 +/J A(2) , A(4)
                                                                                      {                  }
                                                                    t            -/L {A    (1)    (3)
                                                                                                  ,A     }
                                       “Trash removed and organized J”
Future Work
More on …
•    Using spectral clustering to cluster functional data
•    Experience association
•    Context matching
•    Evolving the samples
     –  sample promotion
     –  Probabilistic model for experience associations
•  Evolving the clusters
     –  Need to adjust value estimate for abstract actions as new
        samples join in
     –  Morphing clusters

Weitere ähnliche Inhalte

Was ist angesagt?

Classification and Regression
Classification and RegressionClassification and Regression
Classification and RegressionMegha Sharma
 
Introduction to Machine Learning with SciKit-Learn
Introduction to Machine Learning with SciKit-LearnIntroduction to Machine Learning with SciKit-Learn
Introduction to Machine Learning with SciKit-LearnBenjamin Bengfort
 
Reinforcement learning
Reinforcement  learningReinforcement  learning
Reinforcement learningSKS
 
Types of Machine Learning
Types of Machine LearningTypes of Machine Learning
Types of Machine LearningSamra Shahzadi
 
Image feature extraction
Image feature extractionImage feature extraction
Image feature extractionRushin Shah
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement LearningSalem-Kabbani
 
I. AO* SEARCH ALGORITHM
I. AO* SEARCH ALGORITHMI. AO* SEARCH ALGORITHM
I. AO* SEARCH ALGORITHMvikas dhakane
 
Autoencoders
AutoencodersAutoencoders
AutoencodersCloudxLab
 
Knowledge representation in AI
Knowledge representation in AIKnowledge representation in AI
Knowledge representation in AIVishal Singh
 
Production System in AI
Production System in AIProduction System in AI
Production System in AIBharat Bhushan
 
An introduction to Machine Learning
An introduction to Machine LearningAn introduction to Machine Learning
An introduction to Machine Learningbutest
 
backpropagation in neural networks
backpropagation in neural networksbackpropagation in neural networks
backpropagation in neural networksAkash Goel
 
Deep Reinforcement Learning
Deep Reinforcement LearningDeep Reinforcement Learning
Deep Reinforcement LearningUsman Qayyum
 
Machine Learning
Machine LearningMachine Learning
Machine LearningRahul Kumar
 
Reinforcement learning
Reinforcement learning Reinforcement learning
Reinforcement learning Chandra Meena
 
Recurrent Neural Networks. Part 1: Theory
Recurrent Neural Networks. Part 1: TheoryRecurrent Neural Networks. Part 1: Theory
Recurrent Neural Networks. Part 1: TheoryAndrii Gakhov
 

Was ist angesagt? (20)

Planning
PlanningPlanning
Planning
 
Classification and Regression
Classification and RegressionClassification and Regression
Classification and Regression
 
Introduction to Machine Learning with SciKit-Learn
Introduction to Machine Learning with SciKit-LearnIntroduction to Machine Learning with SciKit-Learn
Introduction to Machine Learning with SciKit-Learn
 
Reinforcement learning
Reinforcement  learningReinforcement  learning
Reinforcement learning
 
Types of Machine Learning
Types of Machine LearningTypes of Machine Learning
Types of Machine Learning
 
Image feature extraction
Image feature extractionImage feature extraction
Image feature extraction
 
Deep Reinforcement Learning
Deep Reinforcement LearningDeep Reinforcement Learning
Deep Reinforcement Learning
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement Learning
 
I. AO* SEARCH ALGORITHM
I. AO* SEARCH ALGORITHMI. AO* SEARCH ALGORITHM
I. AO* SEARCH ALGORITHM
 
Autoencoders
AutoencodersAutoencoders
Autoencoders
 
Knowledge representation in AI
Knowledge representation in AIKnowledge representation in AI
Knowledge representation in AI
 
Informed search
Informed searchInformed search
Informed search
 
Production System in AI
Production System in AIProduction System in AI
Production System in AI
 
An introduction to Machine Learning
An introduction to Machine LearningAn introduction to Machine Learning
An introduction to Machine Learning
 
backpropagation in neural networks
backpropagation in neural networksbackpropagation in neural networks
backpropagation in neural networks
 
Deep Reinforcement Learning
Deep Reinforcement LearningDeep Reinforcement Learning
Deep Reinforcement Learning
 
Machine Learning
Machine LearningMachine Learning
Machine Learning
 
Reinforcement learning
Reinforcement learning Reinforcement learning
Reinforcement learning
 
Recurrent Neural Networks. Part 1: Theory
Recurrent Neural Networks. Part 1: TheoryRecurrent Neural Networks. Part 1: Theory
Recurrent Neural Networks. Part 1: Theory
 
strong slot and filler
strong slot and fillerstrong slot and filler
strong slot and filler
 

Ähnlich wie Generalized Reinforcement Learning

Demystifying deep reinforement learning
Demystifying deep reinforement learningDemystifying deep reinforement learning
Demystifying deep reinforement learning재연 윤
 
Modeling the Dynamics of SGD by Stochastic Differential Equation
Modeling the Dynamics of SGD by Stochastic Differential EquationModeling the Dynamics of SGD by Stochastic Differential Equation
Modeling the Dynamics of SGD by Stochastic Differential EquationMark Chang
 
Bill howe 7_machinelearning_2
Bill howe 7_machinelearning_2Bill howe 7_machinelearning_2
Bill howe 7_machinelearning_2Mahammad Valiyev
 
Condition Monitoring Of Unsteadily Operating Equipment
Condition Monitoring Of Unsteadily Operating EquipmentCondition Monitoring Of Unsteadily Operating Equipment
Condition Monitoring Of Unsteadily Operating EquipmentJordan McBain
 
Dual SVM Problem.pdf
Dual SVM Problem.pdfDual SVM Problem.pdf
Dual SVM Problem.pdfssuser8547f2
 
Ann chapter-3-single layerperceptron20021031
Ann chapter-3-single layerperceptron20021031Ann chapter-3-single layerperceptron20021031
Ann chapter-3-single layerperceptron20021031frdos
 
block-mdp-masters-defense.pdf
block-mdp-masters-defense.pdfblock-mdp-masters-defense.pdf
block-mdp-masters-defense.pdfJunghyun Lee
 
Paper Study: Melding the data decision pipeline
Paper Study: Melding the data decision pipelinePaper Study: Melding the data decision pipeline
Paper Study: Melding the data decision pipelineChenYiHuang5
 
ppt - Deep Learning From Scratch.pdf
ppt - Deep Learning From Scratch.pdfppt - Deep Learning From Scratch.pdf
ppt - Deep Learning From Scratch.pdfsurefooted
 
Spatial Analysis with R - the Good, the Bad, and the Pretty
Spatial Analysis with R - the Good, the Bad, and the PrettySpatial Analysis with R - the Good, the Bad, and the Pretty
Spatial Analysis with R - the Good, the Bad, and the PrettyNoam Ross
 
Introduction to search and optimisation for the design theorist
Introduction to search and optimisation for the design theoristIntroduction to search and optimisation for the design theorist
Introduction to search and optimisation for the design theoristAkin Osman Kazakci
 
Gradient Boosted Regression Trees in Scikit Learn by Gilles Louppe & Peter Pr...
Gradient Boosted Regression Trees in Scikit Learn by Gilles Louppe & Peter Pr...Gradient Boosted Regression Trees in Scikit Learn by Gilles Louppe & Peter Pr...
Gradient Boosted Regression Trees in Scikit Learn by Gilles Louppe & Peter Pr...PyData
 
Reinforcement Learning and Artificial Neural Nets
Reinforcement Learning and Artificial Neural NetsReinforcement Learning and Artificial Neural Nets
Reinforcement Learning and Artificial Neural NetsPierre de Lacaze
 
267 handout 2_partial_derivatives_v2.60
267 handout 2_partial_derivatives_v2.60267 handout 2_partial_derivatives_v2.60
267 handout 2_partial_derivatives_v2.60Ali Adeel
 
Supervised Prediction of Graph Summaries
Supervised Prediction of Graph SummariesSupervised Prediction of Graph Summaries
Supervised Prediction of Graph SummariesDaniil Mirylenka
 
variBAD, A Very Good Method for Bayes-Adaptive Deep RL via Meta-Learning.pdf
variBAD, A Very Good Method for Bayes-Adaptive Deep RL via Meta-Learning.pdfvariBAD, A Very Good Method for Bayes-Adaptive Deep RL via Meta-Learning.pdf
variBAD, A Very Good Method for Bayes-Adaptive Deep RL via Meta-Learning.pdftaeseon ryu
 
Gradient Boosted Regression Trees in scikit-learn
Gradient Boosted Regression Trees in scikit-learnGradient Boosted Regression Trees in scikit-learn
Gradient Boosted Regression Trees in scikit-learnDataRobot
 
Rethinking of Generalization
Rethinking of GeneralizationRethinking of Generalization
Rethinking of GeneralizationHikaru Ibayashi
 

Ähnlich wie Generalized Reinforcement Learning (20)

Demystifying deep reinforement learning
Demystifying deep reinforement learningDemystifying deep reinforement learning
Demystifying deep reinforement learning
 
Modeling the Dynamics of SGD by Stochastic Differential Equation
Modeling the Dynamics of SGD by Stochastic Differential EquationModeling the Dynamics of SGD by Stochastic Differential Equation
Modeling the Dynamics of SGD by Stochastic Differential Equation
 
Bill howe 7_machinelearning_2
Bill howe 7_machinelearning_2Bill howe 7_machinelearning_2
Bill howe 7_machinelearning_2
 
Condition Monitoring Of Unsteadily Operating Equipment
Condition Monitoring Of Unsteadily Operating EquipmentCondition Monitoring Of Unsteadily Operating Equipment
Condition Monitoring Of Unsteadily Operating Equipment
 
Dual SVM Problem.pdf
Dual SVM Problem.pdfDual SVM Problem.pdf
Dual SVM Problem.pdf
 
Ann chapter-3-single layerperceptron20021031
Ann chapter-3-single layerperceptron20021031Ann chapter-3-single layerperceptron20021031
Ann chapter-3-single layerperceptron20021031
 
CLIM Fall 2017 Course: Statistics for Climate Research, Spatial Data: Models ...
CLIM Fall 2017 Course: Statistics for Climate Research, Spatial Data: Models ...CLIM Fall 2017 Course: Statistics for Climate Research, Spatial Data: Models ...
CLIM Fall 2017 Course: Statistics for Climate Research, Spatial Data: Models ...
 
block-mdp-masters-defense.pdf
block-mdp-masters-defense.pdfblock-mdp-masters-defense.pdf
block-mdp-masters-defense.pdf
 
Paper Study: Melding the data decision pipeline
Paper Study: Melding the data decision pipelinePaper Study: Melding the data decision pipeline
Paper Study: Melding the data decision pipeline
 
ppt - Deep Learning From Scratch.pdf
ppt - Deep Learning From Scratch.pdfppt - Deep Learning From Scratch.pdf
ppt - Deep Learning From Scratch.pdf
 
Spatial Analysis with R - the Good, the Bad, and the Pretty
Spatial Analysis with R - the Good, the Bad, and the PrettySpatial Analysis with R - the Good, the Bad, and the Pretty
Spatial Analysis with R - the Good, the Bad, and the Pretty
 
Introduction to search and optimisation for the design theorist
Introduction to search and optimisation for the design theoristIntroduction to search and optimisation for the design theorist
Introduction to search and optimisation for the design theorist
 
Gradient Boosted Regression Trees in Scikit Learn by Gilles Louppe & Peter Pr...
Gradient Boosted Regression Trees in Scikit Learn by Gilles Louppe & Peter Pr...Gradient Boosted Regression Trees in Scikit Learn by Gilles Louppe & Peter Pr...
Gradient Boosted Regression Trees in Scikit Learn by Gilles Louppe & Peter Pr...
 
Reinforcement Learning and Artificial Neural Nets
Reinforcement Learning and Artificial Neural NetsReinforcement Learning and Artificial Neural Nets
Reinforcement Learning and Artificial Neural Nets
 
267 handout 2_partial_derivatives_v2.60
267 handout 2_partial_derivatives_v2.60267 handout 2_partial_derivatives_v2.60
267 handout 2_partial_derivatives_v2.60
 
Supervised Prediction of Graph Summaries
Supervised Prediction of Graph SummariesSupervised Prediction of Graph Summaries
Supervised Prediction of Graph Summaries
 
variBAD, A Very Good Method for Bayes-Adaptive Deep RL via Meta-Learning.pdf
variBAD, A Very Good Method for Bayes-Adaptive Deep RL via Meta-Learning.pdfvariBAD, A Very Good Method for Bayes-Adaptive Deep RL via Meta-Learning.pdf
variBAD, A Very Good Method for Bayes-Adaptive Deep RL via Meta-Learning.pdf
 
Modeling full scale-data(2)
Modeling full scale-data(2)Modeling full scale-data(2)
Modeling full scale-data(2)
 
Gradient Boosted Regression Trees in scikit-learn
Gradient Boosted Regression Trees in scikit-learnGradient Boosted Regression Trees in scikit-learn
Gradient Boosted Regression Trees in scikit-learn
 
Rethinking of Generalization
Rethinking of GeneralizationRethinking of Generalization
Rethinking of Generalization
 

Kürzlich hochgeladen

Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?Igalia
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024The Digital Insurer
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024The Digital Insurer
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodJuan lago vázquez
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MIND CTI
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyKhushali Kathiriya
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingEdi Saputra
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 

Kürzlich hochgeladen (20)

Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 

Generalized Reinforcement Learning

  • 1. Generalized Reinforcement Learning Framework Barnett P. Chiu 3.22.2013
  • 2. Overview •  Standard Formulation of Reinforcement Learning •  Challenges in standard RL framework due to its representation •  Generalized/Alternative action formulation •  Action as an Operator •  Parametric-Action Model •  Reinforcement Field –  Using kernel as a similarity measure over “decision contexts” (i.e. generalized state-action pair) –  Value predictions using functions (vectors) from RKHS (a vector space) –  Representing policy using kernelized samples
  • 3. Reinforcement Learning: Examples •  A learning paradigm that formalizes sequential decision process under uncertainty –  Navigating in an unknown environment –  Playing and winning a game (e.g. Backgammon) –  Retrieving information over the web (finding the right info on the right websites) –  Assigning user tasks to a set of computational resources •  Reference: –  Reinforcement Learning: A Survey by Leslie P. Kaelbling, Michael L. Littman, Andrew W. Moore –  Autonomous Helicopter: Andrew Ng
  • 4. Reinforcement Learning: Optimization Objective •  Optimizing performance through trial and error. –  The agent interacts with the environment, perform the right actions such that they induce a state trajectory towards maximizing rewards. –  Task dependent; can have multiple subgoals/subtasks –  Learning from incomplete background knowledge •  Ex1: Navigating in an unknown environment –  Objective: shortest path + avoiding obstacles + minimize fuel consumption + … •  Ex2: Assigning user tasks to a set of servers with unknown resource capacity –  Objective: minimize turnaround time, maximize success rate, load balancing, …
  • 5. Reinforcement Learning: Markov Decision Process (typical but not always)
  • 6. Reinforcement Learning: Markov Decision Process Potential Function: Q: S × A è utility
  • 7. Challenges in Standard RL Formulations (1) •  Challenges from large state and action space –  the complexity of RL methods depends largely on the dimensionality of the state space representation Later … •  Solution: Generalized action representation –  Explicitly express errors/variations of actions ( x s , x a ) ∈ S ⊗ A+ x = (x s , x a ) k (x, xʹ′) //Compare two decision contexts –  It becomes possible to express correlation from within a state- action combination to reveal their interdependency. –  The value function no longer needs to express a concrete mapping from state-action pairs to their values. –  Enables simultaneous control over multiple parameters that collectively describe behavioral details as actions are performed.
  • 8. Challenges in Standard RL Formulations (2) •  Challenges from Environmental Shifts –  Unexpected behaviors inherent in actions (same action but different outcomes over time) –  E.g. Recall the rover navigation example earlier •  Navigational policy learned under different surface conditions •  Challenges from Irreducible and Varying Action Sets –  Large decision points –  Feasible actions do not stay the same –  E.g. Assigning tasks to time-varying computational resources in a dynamic virtual cluster (DVC) •  Compute resources are dynamically acquired with limited walltimes
  • 9. Reinforcement Learning Algorithms … •  In most complex domains, T, R need to be estimated à RL framework –  Temporal Difference (TD) Learning st St+1 •  Q-learning, SARSA, TD(λ) MDP rt –  Example of TD learning: SARSA at st –  SARSA Update Rule: Q(s, a) ← Q(s, a) + α [r + γ Q(sʹ′, aʹ′) − Q(s, a)] –  Q-learning Update Rule: Q( s, a) ← Q(s, a) + α ⎡r + γ max Q(sʹ′, aʹ′) − Q(s, a) ⎤ ⎣ aʹ′ ⎦ –  Function approximation, SMC, Policy Gradient, etc. “But the problems are …”
  • 10. Alternative View of Actions •  Actions are not just decision choices –  Variational procedure (e.g. principle of least {action, time}) –  Errors •  Poor calibrations of actuators •  Extra bumpy surfaces –  Real-world domains may involve actions with high complexity •  Robotic control (e.g. a simultaneous control over a set of joint parameters) “What is an action really?” Actions induce a shift in the state configuration –  A continuous process –  A process involving errors –  State and action has hidden correlations •  Current knowledge base (state) •  New info retrieved from external world (action) –  Similarity between decisions: (x1,a1) vs (x2,a2)
  • 11. Action as an Operator •  Action Operator (aop) –  Acts on current state and produces its successor state. •  aop takes on an input state … (1) •  aop resolves stochastic effects in the action … (2) –  Recall: action is now parameterized by constrained random variables (e.g. d_r, d_theta) •  Given a fixed operator by (1), (2), now aop maps the input state to the output state (i.e. the successor state) •  The current state vector + action vector à augemented state •  E.g. " Δx % $ 1+ 0 '" % " % Δx = Δr cos(Δθ ) $ x '$ x ' = $ x + Δx ' $ Δy '# y & # y + Δy & Δy = Δr sin(Δθ ) 0 1+ $ # y '& ( ) ( ) ⇒ x s ,x a = (x, y),(Δx,Δy)
  • 12. Value Prediction: Part I “How to connect the notion of action operator with value predictions?”
  • 13. Parametric Actions (1) A 11 12 1 10 2 9 3 8 4 3 7 5 2 6 2 B 4 2 4 3 Δθ Δr 2 •  Actions as a random process xa = (x1,x 2 ) = (Δr,Δθ ) •  Example: Y (x i ) = P((x i + w i ) ∈ Γ), (x a )i = x i |Y (x i ) ≥ η. • These 12 parametric actions all take on parameter bounded within pie-shaped scopes
  • 14. Parametric Actions (2) 11 12 1 10 2 9 3 8 4 3 7 5 2 6 2 2 4 4 3 Δθ Δr 2 •  Actions as a random process x a = (x1,x 2 ) = (Δr,Δθ ) •  Augmented state space ( ) ( ⇒ x s ,x a = (x, y),(Δx,Δy) ) ⎡ Δx ⎤ action as an operator … ⎢1+ 0 ⎥ x ⎡ x ⎤ ⎡ x + Δx ⎤ ⎢ ⎥ ⎢ y ⎥ = ⎢ y + Δy ⎥ Δy ⎣ ⎦ ⎣ ⎢ 0 1+ ⎥ ⎦ ⎢ ⎣ y ⎥ ⎦ •  Learn a potential function Q (x s ,x a ) = Q ( x, y,Δx,Δy )
  • 15. Using GPR: A Thought Process (1) •  Need to gauge the similarity/correlation between any two decisions à kernel functions k (x, xʹ′) •  Need to estimate the (potential) value of an arbitrary combination of state and actions without needing to explore all the state space à the value predictor is a “function” of kernels •  The class of functions that exhibit the above properties à functions drawn from Reproduced Kernel Hilbert Space (RKHS) n n Q + (⋅) = ∑αi k(xi ,⋅) Q + (x * ) = ∑α k(x ,x ) i i * i=1 i=1 •  GP regression (GPR) method induces such functions [GPML, Rasmussen] –  Representer Theorem [B. Schölkopf et al. 2000] ( ) cost (x1, y1, f (x1 )),...,(xm , ym , f (xm )) + regularizer f ( )
  • 16. Using GPR: A Thought Process (2) •  With GPR, we work with samples of experience Ω : {(x1 , q1 ),...,(xm , qm )} •  The Approach –  Find the “best” function(s) that explain the data (decisions made so far) •  Model Selection –  Keep only the most important samples from which to derive the value functions •  Sample Selection •  It is relatively easy to implement the notion of model selection with GPR –  Tune hyperparameters of the kernel (as a covariance function of GP) such that the marginal likelihood (of the data) is maximized –  Periodic model selection to cope with environmental shifts •  Sample selection –  Two-tier learning architecture •  Using a baseline RL learning algorithm for obtaining utility estimates •  Using GPR to achieve generalization
  • 17. *GPR I: Gaussian Process •  GP: a probability distribution over a set of random variables (or function values), any finite set of which has a (joint) Gaussian distribution y | x, M i  N f ,σ noise I ⇒ f | M i  GP m(x),k(x, x!) ( ) ( ) –  Given a prior assumption over functions (See previous page) –  Make observations (e.g. policy learning) and gather evidences –  Posterior p.d. over functions that eliminate those not consistent with the evidence Prior and Posterior 2 2 1 1 output, f(x) output, f(x) 0 0 −1 −1 −2 −2 −5 0 5 −5 0 5 input, x input, x Predictive distribution: 2 p(y⇤ |x⇤ x y) ⇠ N k(x⇤ x)> [K + noise I] y -1
  • 18. *GPR II: State Correlation Hypothesis •  Kernel: covariance function (PD Kernel) –  Correlation hypothesis for states –  Prior 1 2 k(x, x!) = θ 0exp(− x Τ D −1x") + θ1σ noise 2 2 1 d xi 2 = θ 0exp(− ∑ ) + θ1σ noise 2 i=2 θi –  Observe samples Ω : {(x1 , q1 ),...,(xm , qm )} –  Compute GP posterior (over latent functions) Q + (x i ) | X ,q, xi ~ GP(m post (x i ) = k(x i ,x) Τ K ( X , X )−1 q, cov post (x i ,x) = k(x i ,x i ) − k(x i ,x) Τ K ( X , X )−1 q) –  Predicted distribution à averaging over all possible posterior weights with respect to Gaussian likelihood function
  • 19. *GPR III: Value Prediction •  Predictive Distribution with a test point n + Τ −1 q* = Q (x* ) = k(x* ) K ( X , X ) q = ∑α k(x ,x ) i i * i=1 −1 cov(q* ) = k(x* ,x* ) − k T * (K +σ ) 2 n I k* –  Prediction of a new test is achieved by comparing with all the samples retained in the memory –  Predictive value largely depends on correlated samples –  The (sample) correlation hypothesis (i.e. kernel) applies to in all state space –  Reproducing property from RKHS [B. Schölkopf and A. Smola,2002] (x∗ , Q+ (x∗ )) : k (⋅, x∗ ) → Q+ (⋅), k (⋅, x∗ ) = Q+ (x∗ )
  • 20. *GPR IV: Model Selection •  Maximizing Marginal Likelihood (ARD) 1 T −1 1 n log p(q | X ) = − q K q − log | K | − log 2π 2 2 2 –  A trade off between data fit and model complexity –  Optimization: take partial derivative wrt each hyperparameter to get ∂ 1 T −1 ∂K −1 1 −1 ∂K log p (u | X , θ) = u K K u − tr ( K ) ∂θi 2 ∂θi 2 ∂θi •  conjugate gradient optimization •  We get the optimal hyperparameters that best explain the data •  Resulting model follows Occam’s Razor principle •  Computing K-1 is expensive à Reinforcement Sampling
  • 21. Value Predictions (Part II) •  Now what remains to solve are: –  How to obtain the training signals {qi}? •  Use baseline RL agent to estimate utility based on MDP with concrete action choices •  Use GPR to generalize the utility estimate with parameterized action representation –  How to train the desired value function using only essential samples? •  memory constraint! •  Sample replacement using reinforcement signals –  Old samples referencing old utility estimates can be replaced by new samples with new estimates - Experience Association
  • 22. i, j 1 m ( ) ) Q ( ) å QQ (s s), xaQ =are + x =s , containing a sequence of predictive values with respect + corresponding targetsxQ denotedx1 ,..., ixn1α1 k ( xxx i ) (x q s a ,a = , x i ,..., m (6.2) roduct definition in (5.59), one can thus evaluate the inner product between Q 2 3. From Baseline RL layer to GPR ent traverses the state X. With the assumption ofaasequence ofktraining samples will be to the input in space with m steps, noisy kernel ( x i , x j ) ij as the covariance pattern k ( , x ) in terms of GPR (next) function, the predictive distribution),...,( x ,s targets isa thus, given by q X f, X ~ N q, [ X ] , (s ) Q x ( xQq1 over newn ,m 1 ,..., xm Q where | and (6.2) d retained inQ memory: : 1, s a x1 ,..., x q x) X Q, 2. Use the generalization capacity m k (× xs , xa )) =m (× ) ,( k ,x Q, k ( , x ) k ( x, x i ) Q( x ) (5.63) and their corresponding observedfrom GP to predict values where i denote the set of augmented states + fitness i 1 m () k ( ,a ) agent traverses the state space with Q steps,× sequence of training samples will be ×, x 1 lpful for the moment to consider these training samplesqin q K X , X K X, X as a functionalExpand -  datathe action into its (5.51) st and retained a(xa) Ω :({(,x1 ),...,( x m , qm )m , qX 1)} , : x1 q1, q1 ),...,(x in memory: X K X ,X * * * m Q where X and Q, nced by the experience particles distributed K X , X K state space. The mechanism over the X , X K X , X * parametric representation (5.52) denote the set of that theto Section 6.5. Each corresponding observed fitness deferred -  Kernel assumption accounts for , their values are note augmented statesarrive attheir experience particle the weight-space and (5.51) and (5.52) is similar to effectively Here we derivation to random effect in the action rol policy that generalizes into the neighboring (augmented)for morespace through one state details. With only helpful formodel; however, to consider these training samples in the moment the interested readers may refer to [14, 107] …... rt as a functional data st at r0 111 ut 1b. Estimate utility for each the kernel to be(5.51) is reducedshortly. The similarity of any two particles (or test point, discussed to “regular” renced by the experience particles distributed over the state space. The mechanism state-action pair wo referenced augmented states) takes into account both the state vector x s and 1a. Baseline ng their values are deferred to Section 6.5. Each experience particle effectively RL: e.g. SARSA x a . This formulation is made possible by allowing the action to take on continuous ontrol policy that generalizes into the neighboring (augmented) state space- MDP based RL treats actions 108 through g with the use of a kernel function as a correlation hypothesis associating as decision choices one st St+1 of the kernel to be discussed shortly. The similarity of any two particles (or ate to another through their inner products in the kernel-induced feature -  Estimate utility without looking MDP space rt ,above constructs defined, the nexttakes into account both the the reinforcement action parameters at two referenced augmented states) step toward establishing state vector x s and esent the fitness function made possible by integrates with parametric actions and, r x . This formulation is in a manner that allowing the action to take on continuous a 11 12 1 me, serves as a “critic” for the policy at embedded in experience particles. In this 10 2 ong with the use of a kernel function as a correlation hypothesis associating one represent the fitness value function through a progressively-updated Gaussian s 9 3 t state to another through their inner products in the kernel-induced feature space 8 4 7 5 he above constructs defined, the next step toward establishing the reinforcement 6
  • 23. i, j 1 m Q (( ) a s x Q x = å i n , α k ( x ax Q+(xs , xQ))= Q+( )x1 ,..., x=1x1 i,...,x,m i ) s s a (6.2) uct definition in (5.59), one can thus evaluate the inner product between Q 3. From Baseline RL layer to GPR traverses the state space with m steps, a sequence of training samples will be ttern k ( , x ) in terms of GPR 3a. Take a sample of the random retained in Q (s ) Q : ( x1Q 1 ),...,(,..., , qmx a ,..., xm Q, where X and Q, memory: x ,q x1 x m xn , )1 X a s s action vector (6.2) k (× xs , xa )) = k , x ,( m (× ) Q, k ( , x ) i k ( x, x i ) Q( x ) ote the set of augmented states and their corresponding observed fitness 3b. Form the augmented state (5.63) m () k ( ,a sequence of training samples willa) à (x ,x ) + ent traverses the state space with Q steps,× ) i 1 ×, x (s, be s a ul for the moment to consider these training samples in 3c. Propagate the utility signal as a functional data st d retained a(xa) in memory: Ω :({1, q1, q1 ),...,(m )m , qX )} , where X and Q, : x(x 1 ),...,( x m , q x m Q d by the experience particles distributed over the state space. The mechanism from the baseline learner and denote the are deferred to Section 6.5. Each experience particle effectively it as the training signal eir values set of augmented states and their corresponding observed fitness use policy that generalizes into the neighboring (augmented) state space through 3d. Insert new (functional) data lpful for the moment to consider these training samples in …... rt as a functional data st at r0 111 ut into current working memory e kernel to be discussed shortly. The similarity of any two particles (or nced by the experience particles distributed over the state space. The mechanism referenced augmented states) takes into account both the state vector x s and their values are deferred to Section 6.5. Each experience particleUse GPR 4. effectively to predict new test points 4a. Kernelized the new test point This formulation is made possible by allowing the action to take on continuous rol policy that generalizes into the neighboring (augmented) state space through with the use of a kernel function as a correlation hypothesis associating (s, a) à (xs,xa) à k(., x) one st the kernel to be discussed shortly. The similarity of any two particles (or inner product of k(., x) 4b. Take St+1 MDP to another through their inner products in the kernel-induced feature space rt and Q+ to obtain the into account both the state vector x s and wo referenced augmented states) takes toward establishing the reinforcement ove constructs defined, the next step utility estimate (fitness value) x a .the fitness function in a manner that integrates the action to take on continuous nt This formulation is made possible by allowing with parametric actions and, 11 12 1 serves as a “critic” for the policy at embedded in experience particles. In this g with the use of a kernel function as a correlation hypothesis associating one 10 2 resent the fitness value function through a progressively-updated Gaussian st 9 3 ate to another through their inner products in the kernel-induced feature space 8 4 above constructs defined, the next step toward establishing the reinforcement 7 5 6
  • 24. Policy Estimation Using Q+ •  Policy Evaluation using Q+(xs,xa) ~ GP –  Q+ can be estimated through GPR –  Define policy through Q+ : e.g. softmax (Gibbs distr.) exp !Q + s,a (i) / τ $ # ( ) & " % ( π s,a (i) ) = ∑ j exp !Q + (s,a ( j) ) / τ $ " % exp !Q + (x s ,x (i) ) / τ $ # ( a ) & " % = ∑ j exp !Q + (x s ,x (a j) ) / τ $ # " ( ) & % •  π [Q+ ] is an increasing functional over Q+
  • 25. Particle Reinforcement (1) •  Recall from ARD, we want to minimize the dimension of K? –  Preserving only “essential info” –  Samples that lead to an increase in TD à positively- polarized particles –  Samples that lead to a decrease in TD à negatively- polarized particles –  Positive particles lead to a policy that is aligned with the global objective ________ –  Negative particles serve as counterexamples for the agent to avoid repeating the same mistakes. –  Example next
  • 26. Particle Reinforcement (2) •  Maintain a set of state partitions •  Keep track of both positive particles and negative particles •  positive particles refer to the desired control policy while negative particles + point out what to avoid - •  “Interpolate” control policy. Recall: + - + ( π [Q + ] = π s,a (i) ) - exp !Q + (x s ,x (i) ) / τ $ # ( a ) & " % = ∑ j exp !Q + (x s ,x (a j) ) / τ $ # " ( ) & % “Problem: How to replace older samples?”
  • 27. Experience Association (1) •  Basis learning principle: –  “If a decision led to a desired result in the past, then a similar decision should be replicated to cope with similar situations in the future” •  Again, use the kernel as a similarity measure •  Agent: “Is my current situation similar to a particular experience in the past?” •  Agent: “I see there are two highly related instances memory where I did action #2 , which led to a + sh pretty decent result. OK, I’ll try that again (or something similar).”
  • 28. Experience Association (2): Policy Generalization C (x s ,x a ) ∈ S ⊗ A+ 5 6 11 12 1 (x s ,x a ) ≈ (x s +x s ,x a +x a ) 10 2 k(x, x") 9 3 8 4 2 7 6 5 10 A B 2 10 Similarity measure comes in handy - relate similar samples of experience - sample update
  • 29. Reinforcement Field •  A reinforcement field is a vector field in Hilbert space established by one or more kernels through their linear combination as a representation for the fitness function, where each of the kernel centers around a particular augmented state vector n Q + (⋅) = ∑α k(⋅,x ) i * i=1
  • 30. Reinforcement Field: Example •  Objective: Travel to the destination while circumventing obstacles 11 12 1 10 2 9 3 8 4 7 5 6 •  State space is partitioned into a set of local regions •  Gray areas are filled with obstacles •  Strong penalty is imposed when the agent runs into obstacle-filled areas
  • 31. Reinforcement Field: Using Different Kernels 1 2 k(x, x!) = θ 0exp(− x Τ D −1x!) + θ1σ noise 2 2 k(x, x!) = θ 0 k s (x s , x!s )ka (x a , x!a ) + θ1σ noise
  • 32. Reinforcement Field Example •  Objective: Travel to the destination while circumventing obstacles 11 12 1 10 2 9 3 8 4 7 5 6 •  State space is partitioned into a set of local regions •  Gray areas are filled with obstacles •  Strong penalty is imposed when the agent runs into S obstacle-filled areas
  • 34. Action Operator: Step-by-Step (1) •  At a given state s ∈ S, the agent chooses a (parametric) action according to the current policy π [Q+ ] •  The action operator resolves the random effect in action parameters through a sampling process such that 11 12 1 the (stochastic) action is reduced to 10 2 a fixed action vector x a . 9 3 •  The action vector resolved from 8 4 above is subsequently paired with the 7 5 current state vector x s to form an 6 augmented state x = (xs , xa ).
  • 35. Action Operator: Step-by-Step (2) •  The new augmented state x is kernelized in terms of k(⋅,x) such that any state x implicitly maps to a function that expects another state x! as an argument; k( x!,x) evaluates to a high value 11 12 1 provided that x and x! are 10 2 strongly correlated. •  The value prediction for the new 9 3 augmented state is given by 8 4 7 5 reproducing property (in RKHS): 6 Q + ,k(⋅,x) = ∑m α i k(⋅,x i ),k(⋅,x) = ∑m α i k(x,x i ) = Q + (x). i=1 i=1
  • 36. Next … •  The entire reinforcement field is generated by a set of training samples that are aligned with the global objective – maximizing payoff Ω : {(x1 , q1 ),...,(xm , qm )} “Can we learn decision concept out of these training samples by treating them as a structured functional data?”
  • 37. A Few Observations •  Consider Ω : {(x1 , q1 ),...,(xm , qm )} •  Properties of GPR –  Correlated inputs have correlated signals –  Can we assemble correlated samples together to form clusters of “similar decisions?” –  Functional clustering •  Cluster criteria –  Similarity takes into account both input patterns and their corresponding signals
  • 38. Example: Task Assignment Domain (1) •  Given a large set of computational resources and a continuous stream of user tasks, find the optimal task assignment policy –  Simplified control policy with actions: dispatch job or not dispatch job (à actions as decision choices) •  Not practical •  Users’ concern: under what conditions can we optimize a given performance metric (e.g. minimized turn around time) –  Characterize each candidate servers in terms of their resource capacity (e.g. CPU percentage time, disk space, memory, bandwidth, owner-imposed usage criteria, etc) •  Actions: dispatching the current task to machine X •  Problems?
  • 39. Example: Task Assignment Domain (2) •  A general resource sharing environment could have a large amount of distributed resources (e.g. Grid network, Volunteer Computing, Cloud Computing, etc). –  1000 machines à 1000 decision points per state. •  Treating user tasks and machines collectively as multi-agent system? –  Combinatorial state/action space –  Large amount of agents
  • 40. Task Assignment, Match Making: Other Similar Examples •  Recommender Systems -  Selecting {movies, music, …} catering to the interests/needs of the (potential) customers: matching users with their favorite items (content-based) –  Online advertising campaign trying to match products and services with the right target audience •  NLP Apps: Question and Answering Systems –  Matching questions with relevant documents
  • 41. Generalizing RL with Functional Clustering •  Abstract Action Representation –  (Functional) pattern discovery from within the generalized state-action pairs (as localized policy) –  Functional clustering by relating correlated inputs w.r.t. their correlated functional responses •  Inputs: generalized state-action pairs •  Functional responses: utitlies –  Covariance Matrix from GPR à Fully-connected similarity graph à Graph Laplacian –  Spectral Clustering à a set of abstraction over (functionally) similar localized policies –  Control policy over these abstractions used as control actions •  Reduce decision points per state •  Reveal interesting correlations between state features and action parameters. E.g. match-making criterion
  • 42. Policy Generalization C ( x s , x a ) ∈ S ⊗ A+ 5 6 (x s ,x a ) ≈ (x s +x s ,x a +x a ) 11 12 1 10 2 k(x, x") 9 3 8 4 2 7 6 5 10 A B 2 10 “Policy as a Functional over Q+” “à Experience Association”
  • 43. Experience Association (1) •  Basis learning principle: –  “If a decision led to a desired result in the past, then a similar decision can be re-applied to cope with similar situations in the future” •  Again, use the kernel as a similarity measure •  Agent: “Is my current situation similar to a particular scenario in the past?” •  Agent: “I see there are two instances of similar memory where I did action #2, and that action led to a decent result, ok, I’ll try that again (or something similar)” “And if not, let me avoid repeating the mistake again” –  Hypothetical State + sh •  Definition •  Agent: “I’ll looking to a past experience, and I replicate the action of that experience and apply to mine current situation, is the result gong to be similar?”
  • 44. *Experience Association (2) •  Step 1: form a hypothetical state (If I were to be …) –  I am at a state sʹ′ = x s ʹ′ –  I pick a (relevant) particle from a set (later used in an abstraction) ω (i ) ∈ Ω(i ) ← A(i ) s+ : ( xs , xa ) –  Replicate the action and apply it to mine own state + sh = ( xʹ′s , xa ) •  Step 2: Compare (… is the result going to be similar?) –  Compare the result using the kernel (that’s my state correlation hypothesis) k (s + , s + ) = k xʹ′, x h ( ) = k (( xsʹ′ , xa ) , ( xs , xa )) –  If k ( s + , s + ) ≥ τ , then this sample is correlated to mine h state; and the state of the target sample is in context; otherwise, it’s out of context
  • 45. *Experience Association (3) •  Normalize the kernel such that the kernel assumes a probability semantics –  This is a generalization over the concept of probability amplitude in QM. k(x, x!) = k !(x, x!) = Φ(x),Φ( x!) = Φ Φ! k(x,x) k( x!, x!) •  Nadaraya Waston’s Model (Section 5.7)
  • 46. Concept-Driven Learning Architecture (CDLA) •  The agent derives control policy only on the level of abstract actions –  Further reduce decision points per state –  Find interesting patterns across state features and action parameters. E.g. match-making criterion •  We have the necessary representation to form (functional) clusters –  Kernel as a similarity measure over augmented states –  Covariance matrix K from GPR •  Each abstract action is represented through a set of experience particles
  • 47. CDLA Conceptual Hierarchy A set of unstructured experience particles Graph representation from K where Kij = k(xi,xj) Partitioned graph with two abstract actions
  • 48. *Spectral Clustering: Big Picture •  Construct similarity graph (ßGPR) •  Graph Laplacian (GL) •  Graph cut as objective function (e.g. normalized cut) •  Optimize the graph cut criterion –  Minimizing Normalized Cut à partitions as tight as possible ß maximal in-cluster links (weights) and mimimal between-cluster links –  NP-Hard à spectral relaxation •  Use eigenvectors of GL as a continuous version cluster indicator vectors •  Evaluate final clusters using a selected instance- based clustering algorithm (Kmeans++)
  • 49. *Spectral Clustering: Definitions Pairwise affinity Degree N wnm = k(x n ,x m ) Dn = ∑m=1 wnm Volume of set Cut between 2 sets Vol(C) = ∑ Dn Cut(C1 ,C2 ) = ∑ ∑ wnm n∈C n∈C1 m∈C2
  • 50. *Spectral Clustering: Graph Cut •  Graph Cut –  Naïve Cut 1 k (1) (k ) 2 i=1 ( Cut(A ,..., A ) = ∑W A(i) ,V A(i) ) –  (K-way) Ratio Cut RatioCut(A(1) ,..., A(k ) ) = 1 ∑ k ( W A(i) ,V A(i) ) 2 i=1 A(i) –  (K-way) Normalized Cut NP-hard! (1) 1(k ) NCut(A ,..., A ) = ∑ k ( W A(i) ,V A(i) ) 2 i=1 ( ) vol A(i)
  • 51. *Spectral Clustering: Approximation •  Approximation –  Given a set of data to cluster –  Form affinity matrix W –  Find leading k eigenvectors Lv = λ v –  Cluster data set in the eigenspace –  Projecting back to the original data •  Major differences in algorithms: L = f (W )
  • 52. *Random Walk Graph Laplacian (1) •  Definition: Lrw = D −1 ( D − K ) = I − D −1K , •  First order Markov transition Matrix –  Each entry: probability of transitioning from node n to to node m in a single step m m di = ∑ K j =1 ij =∑ j =1 k (xi , x j ) Kij k ( xi , x j ) Pij = m = ∑ Kij di j =1 Lrw = I − D −1K = I − P
  • 53. *Random Walk Graph Laplacian (2) •  K-way normalized cut: find a partitioning s.t. the probability of transitioning across clusters is minimized k k ⎛ ∑ n,m∈C wnm ⎞ NCut (C ) = ∑ (1 − P (Ci → Ci | Ci )) = ∑ ⎜1 − i N ⎟ i =1 ⎜ w ⎟ ⎝ ∑ n∈C ∑ m=1 nm ⎠ i =1 i –  When used with Random-Walk GL, this corresponds to minimizing the probability of the state-transitions between clusters •  In CDLA –  Each abstract action corresponds to a coherent decision concept –  Why? By taking any actions in any of the states associated with the same concept, this is a minimal chance to transitioning to the states associated with other concepts
  • 54. Functional Clustering using GPSC •  GPR + SC à SGP Clustering –  Kernel as correlation hypothesis –  Same hypothesis used as similarity measure for SC –  Correlated inputs share approximately identical outputs: –  Similar augmented states ~ close fitness values or utilities •  Warning: The reverse is NOT true and this is why we need multiple concepts that may share similar output signals –  E.g. match-making job and machine requirements
  • 55. CDLA: Context Matching •  Each abstract action implicitly defines an action- selection strategy –  In context: at a given state, find the most correlated state pair with its action, followed by applying that action –  Out of context: random action selection •  Applicable in 1) infant agent 2) empty cluster 3) referenced particles don’t match (by experience association) •  Caveat: ( Q s, A ) ≠ Q ( s, a ) ! (i ) + (i ) •  The utility (fitness value) for random action selection does not correspond to the true estimate of resolving •  Need other way to adjust the utility estimate for its fitness value
  • 56. Empirical Study: Task Assignment Domain Goal: Find for each incoming user task the best candidate server(s) that are mutually agreeable in terms of matching criteria. Parameter Spec. Values Task feature set Task type, size, expected runtime (state) Server feature set Service type, percentage CPU time, (action) memory, disk space, CPU speed, job slots Kernel k SE kernel + noise (see (5.3)) Num. of Abstract 10 (assumed to be known) Actions Model update cycle 10 state transitions towards 100 T
  • 57. Empirical Study: Learned Concepts Task Size Expected Service %CPU Fitness Value type runtime type time 1 1 1.1 0.93 1 9.784 120.41 2 2 2.5 1.98 2 10.235 128.13 3 3 3.2 2.92 3 15.29 135.23 4 1 1.0 1.02 2 20.36 -50.05 5 2 2.0 2.09 3 0.58 -47.28 •  Illustration of 5 different learned decision concepts. •  The top 3 rows in blue indicate success matches while the bottom 2 rows in yellow indicate failed matches.
  • 58. Empirical Study: Comparison 1500 1000 500 Reward Per Episode 0 -500 G-SARSA -1000 Condor Random -1500 -2000 0 20 40 60 80 100 Episodes •  Performance comparison among (1) Stripped-down Condor match-making (black) (2) G-SARSA (blue) (3) Random (red)
  • 59. Sample Promotion (1) + k(sh , s) ≥ τ Two possibilities: (1) the target is indeed correlated to a given an AA. (2) the target is NOT correlated to ANY Abstract Action à trial and error + sample promotion
  • 60. Sample Promotion (2) •  Recall: Each abstract actions implicitly define an action selection strategy •  Key: functional data set must be as accurate as possible in terms of their predictive strength •  Match new experience particle against the memory, find the most relevant piece and use its value estimation. •  Out-of-context case leads to randomized action selection. •  How does the agent still manage to gain experience in this case? •  Randomized action selection + Sample promotion –  Case 1: random action does get positive result •  Match the result per abstract actions by sampling using experience association operation ( ) s, a ( i ) → ( x s , x a ) •  If indeed correlated to some experience, then update the fitness value; otherwise à case 2 –  Case 2: Discard the sample because there is no points of reference, sample not useful
  • 61. GRL Schematic Figure 8.1 CDLA schematic. Periodic iterations between value function approximation and concept-driven cluster formation constitutes the heartbeat of CDLA. The conceptual
  • 62. Demo of 4 Decision Concepts “Trapped in high hills L” A(1) ~ A(4) At(1) “Fallen into deep water L” Avoid At(3) Avoid At(4) −1 −k “Discovery of plant life J” At(4) −k At(2) Collect At(4) +1 −k t − k −1 t−k t − k +1 At(4) Clean Concept Polarity +/J A(2) , A(4) { } t -/L {A (1) (3) ,A } “Trash removed and organized J”
  • 63. Future Work More on … •  Using spectral clustering to cluster functional data •  Experience association •  Context matching •  Evolving the samples –  sample promotion –  Probabilistic model for experience associations •  Evolving the clusters –  Need to adjust value estimate for abstract actions as new samples join in –  Morphing clusters