Alborz

Incremental
Least-Squares
Temporal Diﬀerence
Learning
Alborz Geramifard
September, 2007
alborz@cs.ualberta.ca

Acknowledgement

Richard Sutton

Michael Bowling

Martin Zinkevich

Agencia

Generous

Wise

Kind

Forgetful

Agencia Agriculture

Politics

Research

...

Envirocia

The Vazir of Agencia

The Ambassador of Envirocia


“Your king is generous, wise,
and kind, but he is forgetful.
Sometimes he makes decisions
which are not wise in light of
our earlier conversations.”


Having consultants for various
subjects might be the answer ...

“Your king is generous, wise,
and kind, but he is forgetful.
Sometimes he makes decisions
which are not wise in light of
our earlier conversations.”

Vazir

Ambassador
“I should admit that recently the king
makes decent judgments, but as the
number of our meetings increases, it
takes a while for me to hear back from
the majesty ...”

Reinforcement Learning Framework
Observation
Agent
Action
Reward

Environment

Observation
Agent
Action
Reward

Environment

Observation
Agent
Action
Reward

Environment

Goal

Observation
Agent
Action
Mo Reward

re Co
mp Environment
lica
ted
S tate
Spa
Goal
ce

Mo
re Co
mplic
a ted
Tas
ks

Data Eﬃciency

Speed
Summary

Summary
Data Eﬃciency

Part I: King

Speed

Summary
Part II: King + Consultants
Data Eﬃciency

Part I: King

Speed

Summary
Part II: King + Consultants
Data Eﬃciency

?

Part I: King

Speed

Summary
LSTD
Data Eﬃciency

TD
Speed

Summary
LSTD
Data Eﬃciency
iLSTD

TD
Speed

Contributions
iLSTD: A new policy evaluation
algorithm

Extension with eligibility traces

Running time analysis

Dimension selection methods

Proof of convergence

Empirical results

Outline
Motivation

Introduction

The New Approach

Dimension Selection

Conclusion

Markov Decision Process
a a
S, A, Pss , Rss ,γ

a a
S, A, Pss , Rss ,γ
10%,-300 100%,+120
100%,+60
Working
Working
30%,-200

Studying 70%,-200
Ph.D.
B.S. M.S.
80%,-50 Studying

10%,-50 Working

100%,+85

a a
S, A, Pss , Rss ,γ
10%,-300 100%,+120
100%,+60
Working
Working
30%,-200

Studying 70%,-200
Ph.D.
B.S. M.S.
80%,-50 Studying

10%,-50 Working

100%,+85

B.S., Working, +60, B.S. Studying, -50, M.S. , ...

Policy Evaluation

Policy Improvement

Notation

rt+1
V (s)
π
Scalar Regular

φ(s) µ(θ)
Vector Bold Lower Case

˜
At A
Matrix Bold Upper Case

Policy Evaluation

∞
V (s) = E rt s0 = s, π
π t−1
γ
t=1

Linear
Function Approximation
n
V (s) = θ · φ(s) = θi φi (s)
i=1

Sparsity of features
k features are active at any given
Sparsity: Only
moment.

k n

Sparsity: Only
moment.

k n
Acrobot [Sutton 96]: 48 << 18,648

Sparsity: Only
moment.

k n
Acrobot [Sutton 96]: 48 << 18,648

Card game [Bowling et al. 02]: 3 << 10^6

Sparsity: Only
moment.

k n
Acrobot [Sutton 96]: 48 << 18,648

Card game [Bowling et al. 02]: 3 << 10^6

Keep away soccer [Stone et al. 05]: 416 << 10^4

Temporal Diﬀerence Learning
TD(0)

TD(0)
(π) a,r
st+1
st

[Sutton 88]

TD(0)
(π) a,r
st+1
st
Tabular Representation

δt = rt + γV (st+1 ) − V (st ).
Vt+1 (st ) = Vt (st ) + αt δt

[Sutton 88]

TD(0)
(π) a,r
st+1
st
Tabular Representation

δt = rt + γV (st+1 ) − V (st ).
Vt+1 (st ) = Vt (st ) + αt δt
Linear Function Approximation
θ t+1 = θ t + αt φ(st )δt (V )
[Sutton 88]

TD(0) Properties
Computational complexity

O(k) per time step
Data ineﬃcient

Only last transition

TD(0) Properties
Constant
O(k) per time step
Data ineﬃcient

Only last transition

Least-Squares TD (LSTD)
Sum of TD updates

[Bradtke, Barto 96]
Sum of TD updates

[Bradtke, Barto 96]
Sum of TD updates
t
µt (θ) = φi δi (Vθ )
i=1

[Bradtke, Barto 96]
Sum of TD updates
t
i=1
t t
= φi (φi − γφi+1 ) θ
T
φi ri+1 −
i=1 i=1

At
bt

[Bradtke, Barto 96]
Sum of TD updates
t
i=1
t t
T
φi ri+1 −
i=1 i=1

At
bt

= bt − At θ

[Bradtke, Barto 96]
Sum of TD updates
t
i=1
t t
T
φi ri+1 −
i=1 i=1

At
bt

= bt − At θ
µt (θ) = 0 −1
θ=A b

LSTD Properties

O(n ) per time step
2

Data eﬃcient

Look through all data

LSTD Properties
Quadratic

O(n ) per time step
2

Data eﬃcient

Look through all data

The New Approach

A and b matrices change on each iteration.

The New Approach

t t
µt (θ) = φi (φi − γφi+1 ) θ
T
φi ri+1 −
i=1 i=1

At
bt

The New Approach

t t
µt (θ) = φi (φi − γφi+1 ) θ
T
φi ri+1 −
i=1 i=1

At
bt
bt = bt−1 + rt φt
∆bt
At = At−1 + φt (φt − γφt+1 ) T

∆At

Incremental LSTD
µt (θ) = bt − At θ

Incremental LSTD

[Geramifard, Bowling, Sutton 06]

Incremental LSTD
Fixed θ

Fixed A and b


Incremental LSTD
Fixed θ

µt (θ) = µt−1 (θ) + ∆bt − (∆At )θ.

Fixed A and b


Incremental LSTD
Fixed θ

µt (θ) = µt−1 (θ) + ∆bt − (∆At )θ.

Fixed A and b

µt (θ t+1 ) = µt (θ t ) − At (∆θ t ).

iLSTD
µt (θ t+1 ) = µt (θ t ) − At (∆θ t ).

iLSTD
µt (θ t+1 ) = µt (θ t ) − At (∆θ t ).
How to change θ ?

iLSTD
µt (θ t+1 ) = µt (θ t ) − At (∆θ t ).
How to change θ ?

Descent in the direction of µt (θ) ?

iLSTD
µt (θ t+1 ) = µt (θ t ) − At (∆θ t ).
How to change θ ?


n

iLSTD
µt (θ t+1 ) = µt (θ t ) − At (∆θ t ).
How to change θ ?

O(n )
2

n

iLSTD
µt (θ t+1 ) = µt (θ t ) − At (∆θ t ).
How to change θ ?

O(n )
2

chosen.

iLSTD Algorithm
Algorithm 5: iLSTD Co
0 s ← s0 , A ← 0, µ ← 0, t ← 0
1 Initialize θ arbitrarily
2 repeat
3 Take action according to π and observe r, s
t←t+1
4
∆b ← φ(s)r
5
∆A ← φ(s)(φ(s) − γφ(s ))T
6
A ← A + ∆A
7
µ ← µ + ∆b − (∆A)θ
8
for i from 1 to m do
9
10 j ← choose an index of µ using a dimension selection mechanism
θj ← θj + αµj
11
µ ← µ − αµj Aej
12
13 end for
14 s ← s
15 end repeat

iLSTD

Per-time-step computational complexity

O(mn + k ) 2

More data eﬃcient than TD

iLSTD


O(mn + k ) 2

Number of iterations per time step

iLSTD


O(mn + k ) 2

Number of features

iLSTD


O(mn + k ) 2

Maximum number of active features
Number of features

iLSTD


O(mn + k ) 2
Linear
Maximum number of active features
Number of features

iLSTD
Theorem : iLSTD converges with
probability one to the same solution as TD,
under the usual step-size conditions, for any
dimension selection method such that all
dimensions for which µt is non-zero are
selected in the limit an inﬁnite number of
times.

Settings
Averaged over 30 runs
Same random seed for all methods
Sparse matrix representation
iLSTD
Non-zero random selection
One descent per iteration

Boyan Chain
-3 -3 -3 -3
-3

N 5 -3 4 -3 3 -3 2 -2 10 0
-3
-3
1 0 0 0 0 0 0
. . . . . . .
. . . . . . .
. . . . . . .
0 1 0.75 0.5 0.25 0 0
0 0 0.25 0.5 0.75 1 0

[Boyan 99]

Boyan Chain
-3 -3 -3 -3
-3

N 5 -3 4 -3 3 -3 2 -2 10 0
-3
-3
1 0 0 0 0 0 0
. . . . . . .
. . . . . . .
. . . . . . .
0 1 0.75 0.5 0.25 0 0
0 0 0.25 0.5 0.75 1 0

n = 4 (Small)
n = 25 (Medium)
n = 100 (Large)
[Boyan 99]

Boyan Chain
-3 -3 -3 -3
-3

N 5 -3 4 -3 3 -3 2 -2 10 0
-3
-3
1 0 0 0 0 0 0
. . . . . . .
. . . . . . .
. . . . . . .
0 1 0.75 0.5 0.25 0 0
0 0 0.25 0.5 0.75 1 0

n = 4 (Small) k=2
n = 25 (Medium)
n = 100 (Large)
[Boyan 99]

Small Boyan Chain
2
0
10
10
RMS of V(s) over all states

1
10

iLSTD
TD
−1 0
10 10

TD
LSTD
iLSTD −1
LSTD
10

−2 −2
10 10
0 200 400 600 800 1000 0 0.5 1 1.5 2 2.5 3
Episodes Time(s)

Medium Boyan Chain
2 3
10 10

2
10
1
10

1
10
iLSTD
0
iLSTD TD
10
TD
0
10

LSTD
LSTD
−1 −1
10 10
0 200 400 600 800 1000 0 5 10 15 20
Episodes Time(s)

Large Boyan Chain
3 3
10 10

2

2
10 10

TD
TD
1 1
10 10

iLSTD
LSTD
0 0
10 10

LSTD iLSTD
−1 −1
10 10
0 200 400 600 800 1000 0 20 40 60 80 100
Episodes Time(s)

Mountain Car
Goal

Mountain Car Tile coding
n = 10,000
Position = -1 (Easy)
k = 10
Position = -.5 (Hard)

[For details see RL-Library]

Easy Mountain Car
7
6
10
10

TD
6
10
iLSTD
4
10

TD

Loss
Loss

5
10

TD LSTD
5
LSTD 10
iLSTD
Loss Function

iLSTD
3
10

4
4
10
10
LSTD 0 50 100 150 200 250 300 350
0 200 400 600 800 1000
Time (s)
Episode

∗
Loss = ||b − A θ||2 ∗
2
10
0 200 400 600 800 1000
Episodes

Hard Mountain Car
8
10
7
10

TD
7
10

Loss
Loss

iLSTD
6
10
LSTD
TD
6
LSTD 10

iLSTD
5 5
10 10
0 200 400 600 800 1000 0 200 400 600 800 1000 1200
Episode Time (s)

Running Time Constant
Linear
Exponential

155 154
TD
124
iLSTD
93
Time (ms)

LSTD
62

31 34

0 9
7 5
5
2
0
0
0
0 0
0 0 0
m

d

sy

d
all

ar

ar
iu

Ea
Sm

H

H
ed
M

Boyan Chain Mountain Car

Dimension Selection

Random ✓

Dimension Selection

Random ✓

Greedy

Dimension Selection

Random ✓

Greedy

ε-Greedy

Dimension Selection

Random ✓

Greedy

ε-Greedy

Boltzmann

Greedy Dimension Selection
Pick the one with highest value of

|µt (i)|

Greedy Dimension Selection
Pick the one with highest value of

|µt (i)|
Not proven to converge.

ε-Greedy Dimension
Selection
ε : Non-Zero Random

(1-ε) : Greedy

ε-Greedy Dimension
Selection
ε : Non-Zero Random

(1-ε) : Greedy

Convergence proof applies.

Boltzmann Component
Selection
Boltzmann Distribution +
Non-Zero Random

Boltzmann Component
Selection
Non-Zero Random

ψ×m

Boltzmann Component
Selection
Non-Zero Random
Boltzmann Distribution

ψ×m

Boltzmann Component
Selection
Non-Zero Random
Boltzmann Distribution

ψ×m

Convergence proof applies.

Empirical Results
ε-Greedy: ε =.1

Boltzmann: ψ = 10^-9, τ = 1

Empirical Results
8
10
ε-Greedy: ε =.1
Random
Greedy
!−Greedy
6
10
Boltzmann: ψ = 10^-9, τ = 1
Boltzmann
Error Measure

4
10

2
10

0
10

−2
10
Small Medium Large Easy Hard
Boyan Chain Mountain Car

Running Time
Random Greedy Boltzmann
!-greedy
11.0
Clocktime/step (ms)

8.8

6.6

4.4

2.2

0
l

e

sy

d
um
al

ar
rg

Ea
Sm

i

La

H
ed
M

Boyan chain Mountain car

Conclusion
LSTD
Data Eﬃciency

TD
Speed

Conclusion
LSTD
Data Eﬃciency
iLSTD

TD
Speed

Conclusion
No learning rate!
LSTD
Data Eﬃciency
iLSTD

TD
Speed

Questions ...
What if someone uses batch-LSTD?

Questions ...

Why iLSTD takes simple descent?

Questions ...

Why iLSTD takes simple descent?

Hmm ... What about control?

Alborz

Recommended

Recommended

More Related Content

More from knowdiff

More from knowdiff (17)

Recently uploaded

Recently uploaded (20)

Alborz