Generalized Reinforcement Learning

Generalized
Reinforcement Learning
Framework
Barnett P. Chiu
3.22.2013

Overview
•  Standard Formulation of Reinforcement Learning
•  Challenges in standard RL framework due to its
representation
•  Generalized/Alternative action formulation
•  Action as an Operator
•  Parametric-Action Model
•  Reinforcement Field
–  Using kernel as a similarity measure over “decision
contexts” (i.e. generalized state-action pair)
–  Value predictions using functions (vectors) from RKHS (a vector
space)
–  Representing policy using kernelized samples

Reinforcement Learning: Examples
•  A learning paradigm that formalizes
sequential decision process under uncertainty
–  Navigating in an unknown environment
–  Playing and winning a game (e.g. Backgammon)
–  Retrieving information over the web (finding the
right info on the right websites)
–  Assigning user tasks to a set of computational
resources
•  Reference:
–  Reinforcement Learning: A Survey by Leslie P. Kaelbling, Michael L.
Littman, Andrew W. Moore
–  Autonomous Helicopter: Andrew Ng

Reinforcement Learning:
Optimization Objective

•  Optimizing performance through trial and error.
–  The agent interacts with the environment, perform the right actions
such that they induce a state trajectory towards maximizing rewards.
–  Task dependent; can have multiple subgoals/subtasks
–  Learning from incomplete background knowledge
•  Ex1: Navigating in an unknown environment
–  Objective: shortest path + avoiding obstacles +
minimize fuel consumption + …
•  Ex2: Assigning user tasks to a set of servers with
unknown resource capacity
–  Objective: minimize turnaround time,
maximize success rate, load balancing, …

Markov Decision Process (typical but not always)

Markov Decision Process
Potential Function:
Q: S × A è utility

Challenges in Standard RL Formulations (1)
•  Challenges from large state and action space
–  the complexity of RL methods depends largely on the
dimensionality of the state space representation

Later …
•  Solution: Generalized action representation
–  Explicitly express errors/variations of actions

( x s , x a ) ∈ S ⊗ A+
x = (x s , x a )
k (x, xʹ′) //Compare two decision contexts

–  It becomes possible to express correlation from within a state-
action combination to reveal their interdependency.
–  The value function no longer needs to express a concrete mapping
from state-action pairs to their values.
–  Enables simultaneous control over multiple parameters that
collectively describe behavioral details as actions are performed.

Challenges in Standard RL Formulations (2)

•  Challenges from Environmental Shifts
–  Unexpected behaviors inherent in actions (same action but
different outcomes over time)
–  E.g. Recall the rover navigation example earlier
•  Navigational policy learned under different surface conditions

•  Challenges from Irreducible and Varying Action Sets
–  Large decision points
–  Feasible actions do not stay the same
–  E.g. Assigning tasks to time-varying computational resources
in a dynamic virtual cluster (DVC)
•  Compute resources are dynamically acquired with limited
walltimes

Reinforcement Learning Algorithms …
•  In most complex domains, T, R need to be estimated à RL
framework
–  Temporal Difference (TD) Learning st St+1
•  Q-learning, SARSA, TD(λ) MDP
rt
–  Example of TD learning: SARSA
at

st

–  SARSA Update Rule:

Q(s, a) ← Q(s, a) + α [r + γ Q(sʹ′, aʹ′) − Q(s, a)]
–  Q-learning Update Rule:

Q( s, a) ← Q(s, a) + α ⎡r + γ max Q(sʹ′, aʹ′) − Q(s, a) ⎤
⎣ aʹ′ ⎦
–  Function approximation, SMC, Policy Gradient, etc.

“But the problems are …”

Alternative View of Actions
•  Actions are not just decision choices
–  Variational procedure (e.g. principle of least {action, time})
–  Errors
•  Poor calibrations of actuators
•  Extra bumpy surfaces
–  Real-world domains may involve actions with high complexity
•  Robotic control (e.g. a simultaneous control over a set of joint
parameters)

“What is an action really?”
Actions induce a shift in the state configuration
–  A continuous process
–  A process involving errors
–  State and action has hidden correlations
•  Current knowledge base (state)
•  New info retrieved from external world (action)
–  Similarity between decisions: (x1,a1) vs (x2,a2)

Action as an Operator
•  Action Operator (aop)
–  Acts on current state and produces its successor
state.
•  aop takes on an input state … (1)
•  aop resolves stochastic effects in the action … (2)
–  Recall: action is now parameterized by constrained
random variables (e.g. d_r, d_theta)
•  Given a fixed operator by (1), (2), now aop maps the input state
to the output state (i.e. the successor state)
•  The current state vector + action vector à augemented state
•  E.g.
" Δx %
$ 1+ 0 '" % " % Δx = Δr cos(Δθ )
$ x '$ x ' = $ x + Δx '
$ Δy '# y & # y + Δy & Δy = Δr sin(Δθ )
0 1+
$
# y '&
( ) ( )
⇒ x s ,x a = (x, y),(Δx,Δy)

Value Prediction: Part I
“How to connect the notion of action
operator with value predictions?”

Parametric Actions (1)
A 11 12 1
10 2

9 3
8 4 3
7 5 2
6
2 B
4 2 4 3 Δθ
Δr
2

•  Actions as a random process xa = (x1,x 2 ) = (Δr,Δθ )
•  Example: Y (x i ) = P((x i + w i ) ∈ Γ),
(x a )i = x i |Y (x i ) ≥ η.

• These 12 parametric actions all take on parameter
bounded within pie-shaped scopes

Parametric Actions (2)
11 12 1
10 2

9 3
8 4 3
7 5 2
6
2 2
4 4 3 Δθ
Δr
2

•  Actions as a random process x a = (x1,x 2 ) = (Δr,Δθ )
•  Augmented state space
( ) (
⇒ x s ,x a = (x, y),(Δx,Δy) )
⎡ Δx ⎤ action as an operator …
⎢1+ 0 ⎥
x ⎡ x ⎤ ⎡ x + Δx ⎤
⎢ ⎥ ⎢ y ⎥ = ⎢ y + Δy ⎥
Δy ⎣ ⎦ ⎣
⎢ 0 1+ ⎥ ⎦
⎢
⎣ y ⎥
⎦

•  Learn a potential function Q (x s ,x a ) = Q ( x, y,Δx,Δy )

Using GPR: A Thought Process (1)
•  Need to gauge the similarity/correlation between any two decisions
à kernel functions k (x, xʹ′)
•  Need to estimate the (potential) value of an arbitrary combination of
state and actions without needing to explore all the state space à
the value predictor is a “function” of kernels
•  The class of functions that exhibit the above properties à
functions drawn from Reproduced Kernel Hilbert Space (RKHS)
n n
Q + (⋅) = ∑αi k(xi ,⋅) Q + (x * ) = ∑α k(x ,x )
i i *
i=1 i=1

•  GP regression (GPR) method induces such functions [GPML, Rasmussen]
–  Representer Theorem [B. Schölkopf et al. 2000]

( )
cost (x1, y1, f (x1 )),...,(xm , ym , f (xm )) + regularizer f ( )

Using GPR: A Thought Process (2)
•  With GPR, we work with samples of experience
Ω : {(x1 , q1 ),...,(xm , qm )}
•  The Approach
–  Find the “best” function(s) that explain the data (decisions made so
far)
•  Model Selection
–  Keep only the most important samples from which to derive the
value functions
•  Sample Selection
•  It is relatively easy to implement the notion of model selection
with GPR
–  Tune hyperparameters of the kernel (as a covariance function of
GP) such that the marginal likelihood (of the data) is maximized
–  Periodic model selection to cope with environmental shifts
•  Sample selection
–  Two-tier learning architecture
•  Using a baseline RL learning algorithm for obtaining utility estimates
•  Using GPR to achieve generalization

*GPR I: Gaussian Process

•  GP: a probability distribution over a set of random
variables (or function values), any finite set of
which has a (joint) Gaussian distribution

y | x, M i  N f ,σ noise I ⇒ f | M i  GP m(x),k(x, x!)
( ) ( )
–  Given a prior assumption over functions (See previous page)
–  Make observations (e.g. policy learning) and gather evidences
–  Posterior p.d. over functions that eliminate those not consistent
with the evidence Prior and Posterior

2 2

1 1

output, f(x)
output, f(x)

0 0

−1 −1

−2 −2

−5 0 5 −5 0 5
input, x input, x

Predictive distribution:
2
p(y⇤ |x⇤ x y) ⇠ N k(x⇤ x)> [K + noise I] y
-1

*GPR II: State Correlation Hypothesis
•  Kernel: covariance function (PD Kernel)
–  Correlation hypothesis for states
–  Prior 1 2
k(x, x!) = θ 0exp(− x Τ D −1x") + θ1σ noise
2
2
1 d xi 2
= θ 0exp(− ∑ ) + θ1σ noise
2 i=2 θi
–  Observe samples

Ω : {(x1 , q1 ),...,(xm , qm )}
–  Compute GP posterior (over latent functions)

Q + (x i ) | X ,q, xi ~ GP(m post (x i ) = k(x i ,x) Τ K ( X , X )−1 q,
cov post (x i ,x) = k(x i ,x i ) − k(x i ,x) Τ K ( X , X )−1 q)

–  Predicted distribution à averaging over all possible posterior
weights with respect to Gaussian likelihood function

*GPR III: Value Prediction
•  Predictive Distribution with a test point
n
+ Τ −1
q* = Q (x* ) = k(x* ) K ( X , X ) q = ∑α k(x ,x )
i i *
i=1
−1
cov(q* ) = k(x* ,x* ) − k T
* (K +σ )
2
n
I k*
–  Prediction of a new test is achieved by comparing
with all the samples retained in the memory
–  Predictive value largely depends on correlated
samples
–  The (sample) correlation hypothesis (i.e. kernel)
applies to in all state space
–  Reproducing property from RKHS [B. Schölkopf and A.
Smola,2002]
(x∗ , Q+ (x∗ )) : k (⋅, x∗ ) → Q+ (⋅), k (⋅, x∗ ) = Q+ (x∗ )

*GPR IV: Model Selection
•  Maximizing Marginal Likelihood (ARD)
1 T −1 1 n
log p(q | X ) = − q K q − log | K | − log 2π
2 2 2
–  A trade off between data fit and model complexity
–  Optimization: take partial derivative wrt each hyperparameter
to get
∂ 1 T −1 ∂K −1 1 −1 ∂K
log p (u | X , θ) = u K K u − tr ( K )
∂θi 2 ∂θi 2 ∂θi
•  conjugate gradient optimization
•  We get the optimal hyperparameters that best explain the
data
•  Resulting model follows Occam’s Razor principle
•  Computing K-1 is expensive à Reinforcement Sampling

Value Predictions (Part II)
•  Now what remains to solve are:
–  How to obtain the training signals {qi}?
•  Use baseline RL agent to estimate utility based on MDP
with concrete action choices
•  Use GPR to generalize the utility estimate with
parameterized action representation
–  How to train the desired value function using only
essential samples?
•  memory constraint!
•  Sample replacement using reinforcement signals
–  Old samples referencing old utility estimates
can be replaced by new samples with new estimates
- Experience Association

i, j 1
m
( ) ) Q ( ) å
QQ (s s), xaQ =are + x =s , containing a sequence of predictive values with respect
+
corresponding targetsxQ denotedx1 ,..., ixn1α1 k ( xxx i )
(x q
s a ,a
= , x i ,..., m (6.2)
roduct definition in (5.59), one can thus evaluate the inner product between Q
2
3. From Baseline RL layer to GPR
ent traverses the state X. With the assumption ofaasequence ofktraining samples will be
to the input in space with m steps, noisy kernel ( x i , x j ) ij as the covariance
pattern k ( , x ) in terms of GPR (next)
function, the predictive distribution),...,( x ,s targets isa thus, given by q X f, X ~ N q, [ X ] ,
(s ) Q x ( xQq1 over newn ,m 1 ,..., xm Q where | and (6.2)
d retained inQ memory: : 1,
s a
x1 ,..., x q x) X Q,
2. Use the generalization capacity
m
k (× xs , xa )) =m (× )
,( k ,x
Q, k ( , x ) k ( x, x i ) Q( x ) (5.63)
and their corresponding observedfrom GP to predict values
where i
denote the set of augmented states + fitness
i 1 m () k ( ,a )
agent traverses the state space with Q steps,× sequence of training samples will be
×, x
1
lpful for the moment to consider these training samplesqin
q K X , X K X, X
as a functionalExpand
-  datathe action into its
(5.51)

st
and retained a(xa) Ω :({(,x1 ),...,( x m , qm )m , qX 1)} ,
: x1 q1, q1 ),...,(x
in memory:
X K X ,X * * *
m Q
where X and Q,
nced by the experience particles distributed K X , X K state space. The mechanism
over the X , X K X , X *
parametric representation
(5.52)

denote the set of that theto Section 6.5. Each corresponding observed fitness
deferred
-  Kernel assumption accounts for
, their values are note augmented statesarrive attheir experience particle the weight-space
and (5.51) and (5.52) is similar to effectively
Here we derivation to
random effect in the action
rol policy that generalizes into the neighboring (augmented)for morespace through one
state details. With only
helpful formodel; however, to consider these training samples in
the moment the interested readers may refer to [14, 107]
…... rt as a functional data
st at r0 111 ut 1b. Estimate utility for each
the kernel to be(5.51) is reducedshortly. The similarity of any two particles (or
test point, discussed to
“regular”
renced by the experience particles distributed over the state space. The mechanism state-action pair
wo referenced augmented states) takes into account both the state vector x s and
1a. Baseline
ng their values are deferred to Section 6.5. Each experience particle effectively RL: e.g. SARSA
x a . This formulation is made possible by allowing the action to take on continuous
ontrol policy that generalizes into the neighboring (augmented) state space- MDP based RL treats actions
108
through
g with the use of a kernel function as a correlation hypothesis associating as decision choices
one
st St+1
of the kernel to be discussed shortly. The similarity of any two particles (or
ate to another through their inner products in the kernel-induced feature -  Estimate utility without looking
MDP space
rt
,above constructs defined, the nexttakes into account both the the reinforcement action parameters
at
two referenced augmented states) step toward establishing state vector x s and

esent the fitness function made possible by integrates with parametric actions and,
r x . This formulation is in a manner that allowing the action to take on continuous
a
11 12 1
me, serves as a “critic” for the policy at
embedded in experience particles. In this 10 2
ong with the use of a kernel function as a correlation hypothesis associating one
represent the fitness value function through a progressively-updated Gaussian
s 9 3
t
state to another through their inner products in the kernel-induced feature space
8 4
7 5
he above constructs defined, the next step toward establishing the reinforcement 6

i, j 1
m
Q (( ) a
s x Q x = å i n , α k ( x ax
Q+(xs , xQ))= Q+( )x1 ,..., x=1x1 i,...,x,m i )
s s a
(6.2)
uct definition in (5.59), one can thus evaluate the inner product between Q
3. From Baseline RL layer to GPR
traverses the state space with m steps, a sequence of training samples will be
ttern k ( , x ) in terms of GPR 3a. Take a sample of the random
retained in Q (s ) Q : ( x1Q 1 ),...,(,..., , qmx a ,..., xm Q, where X and Q,
memory: x ,q x1 x m xn , )1 X a
s s action vector
(6.2)
k (× xs , xa )) = k , x
,( m (× )
Q, k ( , x ) i k ( x, x i ) Q( x )
ote the set of augmented states and their corresponding observed fitness
3b. Form the augmented state
(5.63)
m () k ( ,a sequence of training samples willa) à (x ,x )
+
ent traverses the state space with Q steps,× )
i 1 ×, x
(s, be s a
ul for the moment to consider these training samples in
3c. Propagate the utility signal
as a functional data
st
d retained a(xa) in memory: Ω :({1, q1, q1 ),...,(m )m , qX )} , where X and Q,
: x(x 1 ),...,( x m , q x m Q
d by the experience particles distributed over the state space. The mechanism from the baseline learner and
denote the are deferred to Section 6.5. Each experience particle effectively it as the training signal
eir values set of augmented states and their corresponding observed fitness use
policy that generalizes into the neighboring (augmented) state space through
3d. Insert new (functional) data
lpful for the moment to consider these training samples in
…... rt as a functional data
st at r0 111 ut into current working memory
e kernel to be discussed shortly. The similarity of any two particles (or
nced by the experience particles distributed over the state space. The mechanism
referenced augmented states) takes into account both the state vector x s and
their values are deferred to Section 6.5. Each experience particleUse GPR
4. effectively to predict new test points
4a. Kernelized the new test point
This formulation is made possible by allowing the action to take on continuous
rol policy that generalizes into the neighboring (augmented) state space through
with the use of a kernel function as a correlation hypothesis associating (s, a) à (xs,xa) à k(., x)
one
st
the kernel to be discussed shortly. The similarity of any two particles (or inner product of k(., x)
4b. Take
St+1
MDP
to another through their inner products in the kernel-induced feature space
rt and Q+ to obtain the
into account both the state vector x s and
wo referenced augmented states) takes toward establishing the reinforcement
ove constructs defined, the next step
utility estimate (fitness value)
x a .the fitness function in a manner that integrates the action to take on continuous
nt This formulation is made possible by allowing with parametric actions and,
11 12 1
serves as a “critic” for the policy at
embedded in experience particles. In this
g with the use of a kernel function as a correlation hypothesis associating one 10 2
resent the fitness value function through a progressively-updated Gaussian
st 9 3
ate to another through their inner products in the kernel-induced feature space
8 4
above constructs defined, the next step toward establishing the reinforcement 7 5
6

Policy Estimation Using Q+
•  Policy Evaluation using Q+(xs,xa) ~ GP
–  Q+ can be estimated through GPR
–  Define policy through Q+ : e.g. softmax (Gibbs
distr.)
exp !Q + s,a (i) / τ $
# ( ) &
" %
(
π s,a (i)
) =
∑ j exp !Q + (s,a ( j) ) / τ $
" %

exp !Q + (x s ,x (i) ) / τ $
# ( a ) &
" %
=
∑ j exp !Q + (x s ,x (a j) ) / τ $
#
" ( ) &
%

•  π [Q+ ] is an increasing functional over Q+

Particle Reinforcement (1)

•  Recall from ARD, we want to minimize
the dimension of K?
–  Preserving only “essential info”
–  Samples that lead to an increase in TD à positively-
polarized particles
–  Samples that lead to a decrease in TD à negatively-
polarized particles
–  Positive particles lead to a policy that is aligned with
the global objective ________
–  Negative particles serve as counterexamples for the
agent to avoid repeating the same mistakes.
–  Example next

Particle Reinforcement (2)

•  Maintain a set of state partitions

•  Keep track of both positive particles
and negative particles

•  positive particles refer to the desired
control policy while negative particles
+ point out what to avoid
-
•  “Interpolate” control policy. Recall:

+
-
+ (
π [Q + ] = π s,a (i) )
-
exp !Q + (x s ,x (i) ) / τ $
# ( a ) &
" %
=
∑ j exp !Q + (x s ,x (a j) ) / τ $
#
" ( ) &
%
“Problem: How to replace older samples?”

Experience Association (1)
•  Basis learning principle:
–  “If a decision led to a desired result in the past,
then a similar decision should be replicated to cope
with similar situations in the future”
•  Again, use the kernel as a similarity measure

•  Agent: “Is my current situation similar to a particular
experience in the past?”
•  Agent: “I see there are two highly related instances
memory where I did action #2 , which led to a
+
sh
pretty decent result. OK, I’ll try that again
(or something similar).”

Experience Association (2):
Policy Generalization

C (x s ,x a ) ∈ S ⊗ A+
5 6

11 12 1
(x s ,x a ) ≈ (x s +x s ,x a +x a )
10 2
k(x, x")
9 3
8 4
2 7
6
5 10

A B
2 10

Similarity measure comes in handy
- relate similar samples of experience
- sample update

Reinforcement Field

•  A reinforcement field is a vector field in Hilbert space established
by one or more kernels through their linear combination as a
representation for the fitness function, where each of the kernel
centers around a particular augmented state vector
n
Q + (⋅) = ∑α k(⋅,x )
i *
i=1

Reinforcement Field: Example
•  Objective:
Travel to the destination
while circumventing obstacles

11 12 1
10 2

9 3
8 4
7 5
6

•  State space is partitioned
into a set of local regions

•  Gray areas are filled with
obstacles
•  Strong penalty is imposed
when the agent runs into
obstacle-filled areas

Reinforcement Field: Using Different Kernels

1 2
k(x, x!) = θ 0exp(− x Τ D −1x!) + θ1σ noise
2

2
k(x, x!) = θ 0 k s (x s , x!s )ka (x a , x!a ) + θ1σ noise

Reinforcement Field Example
•  Objective:
Travel to the destination
while circumventing obstacles

11 12 1
10 2

9 3
8 4
7 5
6

•  State space is partitioned
into a set of local regions
•  Gray areas are filled with
obstacles
•  Strong penalty is imposed
when the agent runs into
S obstacle-filled areas

Action Operator: Step-by-Step (1)
•  At a given state s ∈ S, the agent
chooses a (parametric) action
according to the current policy π [Q+ ]
•  The action operator resolves the
random effect in action parameters
through a sampling process such that 11 12 1
the (stochastic) action is reduced to 10 2
a fixed action vector x a .
9 3
•  The action vector resolved from
8 4
above is subsequently paired with the 7 5
current state vector x s to form an 6

augmented state x = (xs , xa ).

Action Operator: Step-by-Step (2)
•  The new augmented state x is
kernelized in terms of k(⋅,x) such
that any state x implicitly maps
to a function that expects
another state x! as an argument;
k( x!,x) evaluates to a high value
11 12 1
provided that x and x! are 10 2
strongly correlated.
•  The value prediction for the new 9 3

augmented state is given by 8 4
7 5
reproducing property (in RKHS): 6

Q + ,k(⋅,x) = ∑m α i k(⋅,x i ),k(⋅,x) = ∑m α i k(x,x i ) = Q + (x).
i=1 i=1

Next …
•  The entire reinforcement field is generated by
a set of training samples that are aligned with
the global objective – maximizing payoff

Ω : {(x1 , q1 ),...,(xm , qm )}
“Can we learn decision concept out of these training
samples by treating them as a structured functional
data?”

A Few Observations
•  Consider Ω : {(x1 , q1 ),...,(xm , qm )}

•  Properties of GPR
–  Correlated inputs have correlated signals
–  Can we assemble correlated samples together to form
clusters of “similar decisions?”
–  Functional clustering
•  Cluster criteria
–  Similarity takes into account both input patterns and their
corresponding signals

Example: Task Assignment Domain (1)

•  Given a large set of computational resources and
a continuous stream of user tasks, find the
optimal task assignment policy
–  Simplified control policy with actions: dispatch job or
not dispatch job (à actions as decision choices)
•  Not practical
•  Users’ concern: under what conditions can we optimize a
given performance metric (e.g. minimized turn around
time)
–  Characterize each candidate servers in terms of their
resource capacity (e.g. CPU percentage time, disk
space, memory, bandwidth, owner-imposed usage
criteria, etc)
•  Actions: dispatching the current task to machine X
•  Problems?

Example: Task Assignment Domain (2)

•  A general resource sharing environment could
have a large amount of distributed resources
(e.g. Grid network, Volunteer Computing, Cloud
Computing, etc).
–  1000 machines à 1000 decision points per state.
•  Treating user tasks and machines collectively
as multi-agent system?
–  Combinatorial state/action space
–  Large amount of agents

Task Assignment, Match Making:
Other Similar Examples

•  Recommender Systems
-  Selecting {movies, music, …} catering to the
interests/needs of the (potential)
customers: matching users with their
favorite items (content-based)
–  Online advertising campaign trying to match
products and services with the right target
audience
•  NLP Apps: Question and Answering Systems
–  Matching questions with relevant documents

Generalizing RL with Functional Clustering
•  Abstract Action Representation
–  (Functional) pattern discovery from within the
generalized state-action pairs (as localized policy)
–  Functional clustering by relating correlated inputs w.r.t.
their correlated functional responses
•  Inputs: generalized state-action pairs
•  Functional responses: utitlies
–  Covariance Matrix from GPR à Fully-connected
similarity graph à Graph Laplacian
–  Spectral Clustering à a set of abstraction over
(functionally) similar localized policies
–  Control policy over these abstractions used as control
actions
•  Reduce decision points per state
•  Reveal interesting correlations between state features
and action parameters. E.g. match-making criterion

Policy Generalization

C
( x s , x a ) ∈ S ⊗ A+
5 6

(x s ,x a ) ≈ (x s +x s ,x a +x a )
11 12 1
10 2 k(x, x")
9 3
8 4
2 7
6
5 10

A B
2 10 “Policy as a Functional over Q+”

“à Experience Association”

Experience Association (1)
•  Basis learning principle:
–  “If a decision led to a desired result in the past, then a
similar decision can be re-applied to cope with similar
situations in the future”
•  Again, use the kernel as a similarity measure

•  Agent: “Is my current situation similar to a particular scenario
in the past?”
•  Agent: “I see there are two instances of similar memory where
I did action #2, and that action led to a decent result,
ok, I’ll try that again (or something similar)”
“And if not, let me avoid repeating the mistake again”
–  Hypothetical State +
sh
•  Definition
•  Agent: “I’ll looking to a past experience, and I replicate the action of
that experience and apply to mine current situation, is the result gong to
be similar?”

*Experience Association (2)
•  Step 1: form a hypothetical state (If I were to be …)
–  I am at a state sʹ′ = x s ʹ′
–  I pick a (relevant) particle from a set (later used in an
abstraction)
ω (i ) ∈ Ω(i ) ← A(i ) s+ : ( xs , xa )
–  Replicate the action and apply it to mine own state
+
sh = ( xʹ′s , xa )
•  Step 2: Compare (… is the result going to be similar?)
–  Compare the result using the kernel (that’s my state
correlation hypothesis)

k (s + , s + ) = k xʹ′, x
h ( ) = k (( xsʹ′ , xa ) , ( xs , xa ))
–  If k ( s + , s + ) ≥ τ , then this sample is correlated to mine
h
state; and the state of the target sample is in context;
otherwise, it’s out of context

*Experience Association (3)

•  Normalize the kernel such that the kernel
assumes a probability semantics
–  This is a generalization over the concept of probability
amplitude in QM.
k(x, x!)
= k !(x, x!) = Φ(x),Φ( x!) = Φ Φ!
k(x,x) k( x!, x!)

•  Nadaraya Waston’s Model (Section 5.7)

Concept-Driven Learning Architecture (CDLA)

•  The agent derives control policy only on the level
of abstract actions
–  Further reduce decision points per state
–  Find interesting patterns across state features and
action parameters. E.g. match-making criterion
•  We have the necessary representation to form
(functional) clusters
–  Kernel as a similarity measure over augmented states
–  Covariance matrix K from GPR
•  Each abstract action is represented through a
set of experience particles

CDLA Conceptual Hierarchy

A set of unstructured
experience particles

Graph representation from
K where Kij = k(xi,xj)

Partitioned graph with two
abstract actions

*Spectral Clustering: Big Picture

•  Construct similarity graph (ßGPR)
•  Graph Laplacian (GL)
•  Graph cut as objective function (e.g. normalized
cut)
•  Optimize the graph cut criterion
–  Minimizing Normalized Cut à partitions as tight as
possible
ß maximal in-cluster links (weights) and
mimimal between-cluster links
–  NP-Hard à spectral relaxation
•  Use eigenvectors of GL as a continuous version
cluster indicator vectors
•  Evaluate final clusters using a selected instance-
based clustering algorithm (Kmeans++)

*Spectral Clustering: Definitions
Pairwise affinity Degree

N
wnm = k(x n ,x m ) Dn = ∑m=1 wnm

Volume of set Cut between 2 sets

Vol(C) = ∑ Dn Cut(C1 ,C2 ) = ∑ ∑ wnm
n∈C n∈C1 m∈C2

*Spectral Clustering: Graph Cut

•  Graph Cut
–  Naïve Cut
1 k
(1) (k )
2 i=1
(
Cut(A ,..., A ) = ∑W A(i) ,V A(i) )
–  (K-way) Ratio Cut

RatioCut(A(1) ,..., A(k ) ) =
1
∑
k (
W A(i) ,V A(i) )
2 i=1 A(i)
–  (K-way) Normalized Cut NP-hard!

(1) 1(k )
NCut(A ,..., A ) = ∑
k (
W A(i) ,V A(i) )
2 i=1 ( )
vol A(i)

*Spectral Clustering: Approximation
•  Approximation
–  Given a set of data to cluster
–  Form affinity matrix W
–  Find leading k eigenvectors

Lv = λ v
–  Cluster data set in the
eigenspace
–  Projecting back to the
original data
•  Major differences in
algorithms:
L = f (W )

*Random Walk Graph Laplacian (1)

•  Definition: Lrw = D −1 ( D − K ) = I − D −1K ,
•  First order Markov transition Matrix
–  Each entry: probability of transitioning from node n to to
node m in a single step
m m
di = ∑ K
j =1 ij
=∑ j =1
k (xi , x j )
Kij k ( xi , x j )
Pij = m
=
∑ Kij di
j =1

Lrw = I − D −1K = I − P

*Random Walk Graph Laplacian (2)

•  K-way normalized cut: find a partitioning s.t. the
probability of transitioning across clusters is
minimized
k k ⎛ ∑ n,m∈C wnm ⎞
NCut (C ) = ∑ (1 − P (Ci → Ci | Ci )) = ∑ ⎜1 − i
N
⎟
i =1 ⎜ w ⎟
⎝ ∑ n∈C ∑ m=1 nm ⎠
i =1
i

–  When used with Random-Walk GL, this corresponds to
minimizing the probability of the state-transitions
between clusters
•  In CDLA
–  Each abstract action corresponds to a coherent decision
concept
–  Why? By taking any actions in any of the states associated
with the same concept, this is a minimal chance to
transitioning to the states associated with other concepts

Functional Clustering using GPSC

•  GPR + SC à SGP Clustering
–  Kernel as correlation hypothesis
–  Same hypothesis used as similarity measure for SC
–  Correlated inputs share approximately identical
outputs:
–  Similar augmented states ~ close fitness values or
utilities
•  Warning: The reverse is NOT true and this is why
we need multiple concepts that may share similar
output signals
–  E.g. match-making job and machine requirements

CDLA: Context Matching
•  Each abstract action implicitly defines an action-
selection strategy
–  In context: at a given state, find the most correlated
state pair with its action, followed by applying that
action
–  Out of context: random action selection
•  Applicable in 1) infant agent 2) empty cluster 3)
referenced particles don’t match (by experience
association)
•  Caveat:
(
Q s, A ) ≠ Q ( s, a ) !
(i ) + (i )

•  The utility (fitness value) for random action selection
does not correspond to the true estimate of resolving
•  Need other way to adjust the utility estimate for its
fitness value

Empirical Study: Task Assignment Domain
Goal: Find for each incoming user task the best candidate
server(s) that are mutually agreeable in terms of
matching criteria.

Parameter Spec. Values
Task feature set Task type, size, expected runtime
(state)
Server feature set Service type, percentage CPU time,
(action) memory, disk space, CPU speed, job
slots
Kernel k SE kernel + noise (see (5.3))
Num. of Abstract 10 (assumed to be known)
Actions
Model update cycle 10 state transitions towards 100
T

Empirical Study: Learned Concepts

Task Size Expected Service %CPU Fitness Value
type runtime type time
1 1 1.1 0.93 1 9.784 120.41
2 2 2.5 1.98 2 10.235 128.13
3 3 3.2 2.92 3 15.29 135.23
4 1 1.0 1.02 2 20.36 -50.05
5 2 2.0 2.09 3 0.58 -47.28

•  Illustration of 5 different learned decision concepts.
•  The top 3 rows in blue indicate success matches while
the bottom 2 rows in yellow indicate failed matches.

Empirical Study: Comparison
1500

1000

500
Reward Per Episode

0

-500
G-SARSA
-1000 Condor
Random
-1500

-2000
0 20 40 60 80 100
Episodes

•  Performance comparison among
(1) Stripped-down Condor match-making (black)
(2) G-SARSA (blue)
(3) Random (red)

Sample Promotion (1)

+
k(sh , s) ≥ τ

Two possibilities:
(1) the target is indeed correlated to a given an AA.

(2) the target is NOT correlated to ANY Abstract Action
à trial and error + sample promotion

Sample Promotion (2)
•  Recall: Each abstract actions implicitly define an action
selection strategy
•  Key: functional data set must be as accurate as possible in
terms of their predictive strength
•  Match new experience particle against the memory, find the
most relevant piece and use its value estimation.
•  Out-of-context case leads to randomized action selection.
•  How does the agent still manage to gain experience in this case?
•  Randomized action selection + Sample promotion
–  Case 1: random action does get positive result
•  Match the result per abstract actions by sampling using
experience association operation
( )
s, a ( i ) → ( x s , x a )
•  If indeed correlated to some experience, then update the
fitness value; otherwise à case 2
–  Case 2: Discard the sample because there is no points of reference,
sample not useful

GRL Schematic

Figure 8.1 CDLA schematic. Periodic iterations between value function approximation
and concept-driven cluster formation constitutes the heartbeat of CDLA. The conceptual

Demo of 4 Decision Concepts
“Trapped in high hills L”
A(1) ~ A(4)
At(1)
“Fallen into deep water L”
Avoid
At(3)

Avoid

At(4) −1
−k
“Discovery of plant life J”
At(4)
−k
At(2)
Collect At(4) +1
−k
t − k −1

t−k

t − k +1
At(4)
Clean
Concept Polarity
+/J A(2) , A(4)
{ }
t -/L {A (1) (3)
,A }
“Trash removed and organized J”

Future Work
More on …
•  Using spectral clustering to cluster functional data
•  Experience association
•  Context matching
•  Evolving the samples
–  sample promotion
–  Probabilistic model for experience associations
•  Evolving the clusters
–  Need to adjust value estimate for abstract actions as new
samples join in
–  Morphing clusters

Generalized Reinforcement Learning

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie Generalized Reinforcement Learning

Ähnlich wie Generalized Reinforcement Learning (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Generalized Reinforcement Learning