The policy gradient theorem is from "Reinforcement Learning : An Introduction". DPG and DDPG is from the original paper.
original link https://docs.google.com/presentation/d/1I3QqfY6h2Pb0a-KEIbKy6v5NuZtnTMLN16Fl-IuNtUo/edit?usp=sharing
Reinforcement Learning Classification
● Value-Based
○ Learned Value Function
○ Implicit Policy
(usually Ɛ-greedy)
● Policy-Based
○ No Value Function
○ Explicit Policy
Parameterization
● Mixed(Actor-Critic)
○ Learned Value Function
○ Policy Parameterization
3
Policy Approximation (Discrete Actions)
● Ensure exploration we generally require that the policy never becomes
deterministic
● The most common parameterization for discrete action spaces - Softmax
in action preferences
○ discrete action space can not too large
● Action preferences can be
parameterization arbitrarily(linear, ANN...)
5
Advantage of Policy Approximation
1. Can approach to a deterministic policy (Ɛ-
greedy always has Ɛ probability of selecting
a random action),ex: Temperature parameter
(T -> 0) of soft-max
○ In practice, it is difficult to choose reduction
schedule or initial value of T
2. Enables the selection of actions with
arbitrary probabilities
○ Bluffing in poker, Action-Value methods have no
natural way
6
https://en.wikipedia.org/wiki/Softmax_function (Temperature parameters)
3. May be simpler function to approximate depending on the complexity
of
policies and action-value functions
4. A good way of injecting prior knowledge about the desired form of the
policy into the reinforcement learning system (often the most important
reason)
7
Short Corridor With Switched Actions
● All the states appear
identical under the function
approximation
● A method can do
significantly better if it can
learn a specific probability
with which to select right
● The best probability is
about 0.59
8
The Policy Gradient Theorem (Episodic)
https://papers.nips.cc/paper/1713-policy-gradient-methods-for-reinforcement-
learning-with-function-approximation.pdf
NIPS 2000 Policy Gradient Methods for Reinforcement Learning with Function
Approximation (Richard S. Sutton)
9
The Policy Gradient Theorem
● Stronger convergence of guarantees are available for policy-gradient
method than for action-value methods
○ Ɛ-greedy selection may change dramatically for an arbitrary small
action value change that results in having the maximal value
● There are two cases define the different performance messures
○ Episodic Case - performance measure as the value of the start state
of the episode
○ Continuing Case - no end even start state (Refer to Chap10.3)
10
The Policy Gradient Theorem (Episodic) - On Policy Distribution
fraction of time spent in s that is
usually under on-policy training
(on-policy distribution, the same
as p.43)
15
better be writed in
The Policy Gradient Theorem (Episodic) - On Policy Distribution
Number of time steps spent, on average, in
state s in a single episoid
h(s) denotes the probability that
an episode begins in states in a
single episode
16
The Policy Gradient Theorem (Episodic) - Concept
17
Ratio of s that
appear in the
state-action tree
Gathering gradients over all action
spaces of every state
The Policy Gradient Theorem (Episodic):
Sum Over States Weighted by How Offen the States Occur Under The
Policy
● Policy gradient for episodic case
● The distribution is the on-policy distribution under
● The constant of proportionality is the average length of an episode and
can be absorbed to step size
● Performance’s gradient ascent does not involve the derivative of the state
distribution
18
REINFORCE Meaning
● The update increases the
parameter vector in this
direction proportional
to the return
● inversely proportional to the
action probability (make sense
because otherwise actions that
are selected frequently are at an
advantage)
Action is a summation. If using
samping by action probability, we
have to average gradient by sampling
number
21
REINFORCE on the short-corridor gridworld
short-corridor gridworld
● With a good step
size, the total
reward per episode
approaches the
optimal value of the
start state
23
REINFORCE Defect & Solution
● Slow Converage
● High Variance From Reward
● Hard To Choose Learning Rate
24
REINFORCE with Baseline (episodic)
● Expected value of the update
unchanged(unbiased), but it
can have a large effect on its
variance
● Baseline can be any function,
evan a random variable
● For MDPs, the baseline should
vary with state, one natural
choice is state value function
○ some states all actions
have high values => a high
baseline
○ in others, => a low baseline
Treat State-Value Function as
a Independent Value-function
Approximation!
25
REINFORCE with Baseline (episodic)
can be learned by any methods
of previous chapters independently.
We use the same Monte-Carlo
here.(Section 9.3 Gradient Monte-Carlo)
26
Short-Corridor GridWorld
● Learn much
faster
● Policy
parameter is
much less clear
to set
● State-Value
function
paramenter
(Section 9.6)
27
Defects
● Learn Slowly (product estimates of high variance)
● Incovenient to implement online or continuing problems
28
One-Step Actor-Critic Method
● Add One-step
bootstrapping to make
it online
● But TD Method always
introduces bias
● The TD(0) with only
one random step has
lower variance than
Monte-Carlo and
accelerate learning 30
Actor-Critic
● Actor - Policy
Function
● Critic- State-Value
Function
● Critic Assign Credit
to Cricitize Actor’s
Selection
31
https://cs.wmich.edu/~trenary/files/cs5300/RLBook/node66.html
Actor-Critic with Eligiblity Traces (episodic)
● Weight Vector
is a long-term
memory
● Eligibility trace
is a short-term
memory,
keeping track
of which
components of
the weight
vector have
contributed to
recent state
33
The Policy Gradient Theorem (Continuing) - Performance Measure with Ergodicity
● “Ergodicity Assumption”
○ Any early decision by
the agent can have
only a temporary
effect
○ State Expectation in
the long run depends
on policy and MDP
transition
probabilities
○ Steady state
distribution is
assumed to exist and
to be independent of S0
guarantee
limit exist
Average Rate of Reward per Time Step
38
( is a fixed parameter for any . We will
treat it later as a linear function independent of
s in the theorem)
V(s)
The Policy Gradient Theorem (Continuing) - Performance Measure Definition
“Every Step’s Average
Reward Is The Same”39
The Policy Gradient Theorem (Continuing) - Steady State Distribution
Steady State Distribution Under
40
Replace Discount with Average Reward for Continuing Problem(Session 10.3, 10.4)
● Continuing problem with discounted setting is useful in tabular case, but
questionable for function approximation case
● In Continuing problem, performance measure with discounted setting is
proportional to the average reward setting (They has almost the same
effect )(session 10.4)
● Discounted setting is problematic with function approximation
○ with function approximation we have lost the policy improvement
theorem (session 4.3) important in Policy Iteration Method
41
Proof The Policy Gradient Theorem (Continuing) 1/2
Gradient Definition
Parameterization of policy
by replacing discount with
average reward setting 42
Proof The Policy Gradient Theorem (Continuing) 2/2
● Introduce
steady state
distribution
and its
property
steady state distribution property
43
By Definistion, it’s independent of s
Trick
Actor-Critic with Eligibility Traces (continuing)
● Replace Discount
with average
reward
● Traing with
Semi-Gradient
TD(0)
Independent Semi-Gradient TD(0)
=1
46
Policy Parameterization for Continuous
Actions
● Can deal with large or infinite
continue actions spaces
● Normal distribution of the actions are
through the state’s parameterization
Feature vectors constructed by Polynomial, Fourier... (Session 9.5) 47
Make it Positive
Chapter 19 Summary
● Policy gradient is superior to Ɛ-greedy and action-value method in
○ Can learn specific probabilities for taking the actions
○ Can approach deterministic policies asymptotically
○ Can naturally handle continuous action spaces
● Policy gradient theorem gives an exact formula for how performance is a
affected by the policy parameter that does not involve derivatives of the state
distribution.
● REINFORCE method
○ Add State-Value as Baseline -> reduce variance without introducing bias
● Actor-Critic method
○ Add state-value function for bootstrapping ->introduce bias but reduce
variance and accelerate learning
○ Critic assign credit to cricitize Actor’s selection
48
Comparison with Stochastic Policy Gradient
Advantage
● No action space sampling, more efficient (usually 10x faster)
● Can deal with large action space more efficiently
Weekness
● Less Exploration
50
Deterministic Policy Gradient Theorem - Performance Measure
● Deterministic Policy
Performance Messure
51
● Policy Gradient (Continuing)
Performance Messure
(Paper Not Distinguishing from
Episodic to Continuing Case)
Similar to
V(s)
V(s)
Deterministic Policy Gradient Theorem
Policy Gradient Theorem
Transition Probability
is parameterized by
Policy Gradient Theorem
53
Reward is
parameterized
by
Deterministic Policy Gradient Theorem vs Policy Gradient Theorem (episodic)
58
Both samping from
steady distribution,
but PG has to sum
over all acton spaces
Samping Space Samping Space
On-Policy Deterministic Actor-Critic Problems
59
● Behaving according to a deterministic policy will not ensure adequate
exploration and may lead to suboptimal solutions
● It may be useful for environments in which there is sufficient noise in the
environment to ensure adequate exploration, even with a deterministic
behaviour policy
● On-policy is not practical; may be useful for environments in which
there is sufficient noise in the environment to ensure adequate
exploration
Sarsa Update
Off-Policy Deterministic Actor-Critic (OPDAC)
● Original Deterministic target policy
µθ(s)
● Trajectories generated by an
arbitrary stochastic behaviour policy
β(s,a)
● Value-action function off-policy
update - Q learning
60
Off Policy Actor-Critic (using Importance
Sampling in both Actor and Critic)
https://arxiv.org/pdf/1205.4839.pdf
Off Policy Deterministic Actor-Critic
DAC removes the integral
over actions, so we can avoid
importance sampling in the
actor
Experiments Designs
63
1. Continus Bandit, with fixed width Gaussian
behaviro policy
2. Mountain Car, with fixed width
Gaussian behavior policy
3. Octopus Arm with 6 segments
a. Sigmoidal multi-layer perceptron (8 hidden units and sigmoidal output units) to represent
the policy (s)
b. A(s) function approximator (session 4.3)
c. V(s) multi-layer perceptron (40 hidden units and linear output units).
Experiment Results
64
In practice, the DAC significantly outperformed its stochastic counterpart by several
orders of magnitude in a bandit with 50 continuous action dimensions, and solved a
challenging reinforcement learning problem with 20 continuous action dimensions
and 50 state dimensions.
Deep Deterministic Policy
Gradient (DDPG)
https://arxiv.org/pdf/1509.02971.pdf
ICLR 2016 Continuous Control With Deep Reinforcement Learning (DeepMind)
65
Q-Learning Limitation
66
http://doremi2016.logdown.com/posts/2017/01/25/convolutional-neural-networks-cnn my cnn architecture
http://www.davidqiu.com:8888/research/nature14236.pdf Human-level control through deep reinforcement learning
Tabular Q-learning Limitations
● Very limited states/actions
● Can’t generalize to unobserved states
Q-learning with function approximation(neural net) can solve limits
above but still unstable or diverge
● The correlations present in the sequence of observations
● Small updates to Q function may significantly change the
policy(policy may oscillate)
● Scale of rewards vary greatly from game to game
○ lead to largely unstable gradient caculation
Deep Q-Learning
67http://www.davidqiu.com:8888/research/nature14236.pdf Human-level control through deep reinforcement learning
1. Experimence Replay
○ Break samples’
correlations
○ Off-policy learn for
all past policies
2. Independent Target Q-
network and update
weight from Q-network
every C steps
○ Avoid oscillations
○ Break correlations
with Q-network
3. Clip rewards to limit the
scale of TD error
○ Robust Gradinet
behavior policy Ɛ-greedy
experience replay buffer
Freeze and update Target Q network
train
minibach
size samples
DQN Flow (cont.)
69
1. Each time step, using Ɛ-greedy from Q-Network to creating samples and
assign to the experience buffer
2. Each Time Step, Experience Buffer randomly assign mini batch samples to
all networks(Q Network, Target Network Q’)
3. Calculate Q Network’s TD error. Update Q Network and target network
Q’(every C steps)
DQN Disadvantage
● Many tasks of interest, most notably physical control tasks, have
continuous (real valued) and high dimensional action spaces
● With high-dimensional observation spaces, it can only handle discrete and
low-dimensional action spaces (requires an iterative optimization process
at every step to find the argmax)
● Simple Approach for DQN to deal with continus domain is simply
discretizing,but many limitation:the number of actions increases
exponentially with the number of degrees of freedom,ex:a 7 degree of
freedom system (as in the human arm) with the coarsest discretization a ∈
{−k, 0, k} for each joint. 3^7 = 2187 action dimensionality
70
https://arxiv.org/pdf/1509.02971.pdf CONTINUOUS CONTROL WITH DEEP REINFORCEMENT LEARNING
DDPG Contributions (DQN+DPG)
71
● Can learn policies “end-to-end”:directly from raw pixel inputs (DQN)
● Can learn policies from high-dimentional continus action space (DPG)
● From above, we can learn policies in large state and action space online
72
DDPG Algo
● Experience
Replay
● Independent
Target
networks
● Batch
Normalization
of Minibatch
● Temporal
Correlated
Exploration
temporal correlated random policy
experience replay buffer
mini batch
Train Actor
mini batch
Train Critic
weighted blending between Q and Target Q’ network
weighted blending between Actor μ and Target Actor μ’network
DDPG Flow (cont.)
74
1. Each time step, using temporally correlated policy to create a sample and
assign it to experience replay buffer
2. Each time step, experience buffer assign mini batch samples to all
networks(Actor μ, Actor Target μ’, Q Network, Q' Target Network)
3. Calculate Q Network’s TD error. Update Q Network and Q' target network
Calculate Actor’s gradient. Update μ and target μ’
DDPG Challenges and Solutions
75
● Replay Buffer is used to break up sequeitial samples (like DQN)
● Target Networks is used for stable learning, but use “soft” update
○
○ Target networks slowly change, but greatly improve the stability of
learning
● Using Batch Normalization to normalize each dimension across the
minibatch samples (in low dimensional feature space, observation may
have different physical values, like position and velocity)
● Use Ornstein-Uhlenbeck process to generate temporally correlated
exploration efficiency with inertia
Learn a parameterized policy that can select actions without consulting a value function
value function may still be used to learn the policy parameter, but is not required for action selection
不失一般性,S0在每個episodic是固定的(non-random)
J(theta)在episodic和continuing不一樣
In problems with significant function approximation, the best approximate policy may be stochastic
In the latter case, a policy-based method will typically learn faster and yield a superior asymptotic policy(Kothiyal 2016)
policy change parameters will affect pi , reward, p. pi and reward is easy to calculate, but p is belong to environment(unknown)
policy gradient theorem will try to not involve the derivative of state distribution(p)
值函數近似(value-function approximation)的原理都是由Mean Squared Value Error中來的(session 9.2)
expectations are conditioned on the initial state, S0,侷限在以S0為起點所遍歷過(Ergodicity)的所有狀態下的期望值,沒有辦法經歷到的(也許是其他S為起點才能經歷到的狀態),不在M(s)範圍內
J(theta)=r(pi)的定義,是用來假設有個定值,而且會獨立於s(推導會用到),在theorem推導時,不會用這個definition,只會把它當一個線性變數,
N is nose - used an Ornstein-Uhlenbeck process (Uhlenbeck & Ornstein, 1930) to generate temporally correlated exploration for exploration efficiency in physical control problems with inertia