This document provides a summary of sampling-based approximations for reinforcement learning. It discusses using samples to approximate value iteration, policy iteration, and Q-learning when the state-action space is too large to store a table of values. Key points covered include using Q-learning with function approximation instead of a table, using features to generalize Q-values across states, and examples of feature representations like those used for the Tetris domain. Convergence properties of approximate Q-learning are also discussed.
2. n Optimal Control
=
given an MDP (S, A, P, R, γ, H)
find the optimal policy π*
Quick One-Slide Recap
n Exact Methods:
n Value Iteration
n Policy Iteration
Limitations:
• Update equations require access to dynamics
model
• Iteration over / Storage for all states and actions:
requires small, discrete state-action space
-> sampling-based approximations
-> Q/V function fitting
4. Recap Q-Values
Q*(s, a) = expected utility starting in s, taking action a, and (thereafter)
acting optimally
Bellman Equation:
Q-Value Iteration:
5. n Q-value iteration:
n Rewrite as expectation:
n (Tabular) Q-Learning: replace expectation by samples
n For an state-action pair (s,a), receive:
n Consider your old estimate:
n Consider your new sample estimate:
n Incorporate the new estimate into a running average:
(Tabular) Q-Learning
Qk+1 Es0⇠P (s0|s,a)
h
R(s, a, s0
) + max
a0
Qk(s0
, a0
)
i
s0
⇠ P(s0
|s, a)
Qk(s, a)
Qk+1(s, a) (1 ↵)Qk(s, a) + ↵ [target(s0
)]
7. n Choose random actions?
n Choose action that maximizes (i.e. greedily)?
n ɛ-Greedy: choose random action with prob. ɛ, otherwise choose
action greedily
How to sample actions?
Qk(s, a)
9. n Technical requirements.
n All states and actions are visited infinitely often
n Basically, in the limit, it doesn’t matter how you select actions (!)
n Learning rate schedule such that for all state and action
pairs (s,a):
Q-Learning Properties
For details, see Tommi Jaakkola, Michael I. Jordan, and Satinder P. Singh. On the convergence of stochastic iterative
dynamic programming algorithms. Neural Computation, 6(6), November 1994.
1X
t=0
↵t(s, a) = 1
1X
t=0
↵2
t (s, a) < 1
16. n Q Value Iteration à (Tabular) Q-learning
n Value Iteration?
n Policy Iteration
n Policy Evaluation à (Tabular) TD-learning
n Policy Improvement (for now)
Sampling-Based Approximation
17. n Optimal Control
=
given an MDP (S, A, P, R, γ, H)
find the optimal policy π*
Quick One-Slide Recap
n Exact Methods:
n Value Iteration
n Policy Iteration
Limitations:
• Update equations require access to dynamics
model
• Iteration over / Storage for all states and actions:
requires small, discrete state-action space
-> sampling-based approximations
-> Q/V function fitting
22. Connection to Tabular Q-Learning
n Suppose
n Plug into update:
n Compare with Tabular Q-Learning update:
✓ 2 R|S|⇥|A|
, Q✓(s, a) ⌘ ✓sa
r✓sa
1
2
(Q✓(s, a) target(s0
))2
= r✓sa
1
2
(✓sa target(s0
))2
= ✓sa target(s0
)
Qk+1(s, a) (1 ↵)Qk(s, a) + ↵ [target(s0
)]
✓sa ✓sa ↵(✓sa target(s0
))
= (1 ↵)✓sa + ↵[target(s0
)]
23. n state: naïve board configuration + shape of the falling piece ~1060 states!
n action: rotation and translation applied to the falling piece
n 22 features aka basis functions
n Ten basis functions, 0, . . . , 9, mapping the state to the height h[k] of each column.
n Nine basis functions, 10, . . . , 18, each mapping the state to the absolute difference
between heights of successive columns: |h[k+1] − h[k]|, k = 1, . . . , 9.
n One basis function, 19, that maps state to the maximum column height: maxk h[k]
n One basis function, 20, that maps state to the number of ‘holes’ in the board.
n One basis function, 21, that is equal to 1 in every state.
[Bertsekas & Ioffe, 1996 (TD); Bertsekas & Tsitsiklis 1996 (TD); Kakade 2002 (policy gradient); Farias & Van Roy, 2006 (approximate LP)]
ˆV (s) =
21X
i=0
i⇥i(s) = >
⇥(s)
i
Engineered Approximation Example: Tetris
29. n Definition. An operator G is a non-expansion with respect to a norm || . || if
n Fact. If the operator F is a γ-contraction with respect to a norm || . || and the
operator G is a non-expansion with respect to the same norm, then the
sequential application of the operators G and F is a γ-contraction, i.e.,
n Corollary. If the supervised learning step is a non-expansion, then iteration in
value iteration with function approximation is a γ-contraction, and in this case
we have a convergence guarantee.
Composing Operators**