Knowledge Diffusion Network
Visiting Lecturer Program (114)
Speaker: Alborz Geramifard
Ph.D. Candidate
Department of Computing Science, Edmonton, University of Alberta, Canada
Title: Incremental Least- Squares Temporal Difference Learning
Time: Tuesday, Sep 11, 2007, 12:00-1:00 pm
Location: Department of Computer Engineering, Sharif University of Technology, Tehran
11. The Vazir of Agencia
The Ambassador of Envirocia
12. The Vazir of Agencia
The Ambassador of Envirocia
“Your king is generous, wise,
and kind, but he is forgetful.
Sometimes he makes decisions
which are not wise in light of
our earlier conversations.”
13. The Vazir of Agencia
Having consultants for various
subjects might be the answer ...
The Ambassador of Envirocia
“Your king is generous, wise,
and kind, but he is forgetful.
Sometimes he makes decisions
which are not wise in light of
our earlier conversations.”
19. Vazir
Ambassador
“I should admit that recently the king
makes decent judgments, but as the
number of our meetings increases, it
takes a while for me to hear back from
the majesty ...”
35. Contributions
iLSTD: A new policy evaluation
algorithm
Extension with eligibility traces
Running time analysis
Dimension selection methods
Proof of convergence
Empirical results
36. Contributions
iLSTD: A new policy evaluation
algorithm
Extension with eligibility traces
Running time analysis
Dimension selection methods
Proof of convergence
Empirical results
37. Contributions
iLSTD: A new policy evaluation
algorithm
Extension with eligibility traces
Running time analysis
Dimension selection methods
Proof of convergence
Empirical results
42. Markov Decision Process
a a
S, A, Pss , Rss ,γ
10%,-300 100%,+120
100%,+60
Working
Working
30%,-200
Studying 70%,-200
Ph.D.
B.S. M.S.
80%,-50 Studying
10%,-50 Working
100%,+85
43. Markov Decision Process
a a
S, A, Pss , Rss ,γ
10%,-300 100%,+120
100%,+60
Working
Working
30%,-200
Studying 70%,-200
Ph.D.
B.S. M.S.
80%,-50 Studying
10%,-50 Working
100%,+85
B.S., Working, +60, B.S. Studying, -50, M.S. , ...
51. Sparsity of features
k features are active at any given
Sparsity: Only
moment.
k n
52. Sparsity of features
k features are active at any given
Sparsity: Only
moment.
k n
Acrobot [Sutton 96]: 48 << 18,648
53. Sparsity of features
k features are active at any given
Sparsity: Only
moment.
k n
Acrobot [Sutton 96]: 48 << 18,648
Card game [Bowling et al. 02]: 3 << 10^6
54. Sparsity of features
k features are active at any given
Sparsity: Only
moment.
k n
Acrobot [Sutton 96]: 48 << 18,648
Card game [Bowling et al. 02]: 3 << 10^6
Keep away soccer [Stone et al. 05]: 416 << 10^4
73. The New Approach
A and b matrices change on each iteration.
t t
µt (θ) = φi (φi − γφi+1 ) θ
T
φi ri+1 −
i=1 i=1
At
bt
74. The New Approach
A and b matrices change on each iteration.
t t
µt (θ) = φi (φi − γφi+1 ) θ
T
φi ri+1 −
i=1 i=1
At
bt
bt = bt−1 + rt φt
∆bt
At = At−1 + φt (φt − γφt+1 ) T
∆At
81. iLSTD
µt (θ t+1 ) = µt (θ t ) − At (∆θ t ).
How to change θ ?
82. iLSTD
µt (θ t+1 ) = µt (θ t ) − At (∆θ t ).
How to change θ ?
Descent in the direction of µt (θ) ?
83. iLSTD
µt (θ t+1 ) = µt (θ t ) − At (∆θ t ).
How to change θ ?
Descent in the direction of µt (θ) ?
84. iLSTD
µt (θ t+1 ) = µt (θ t ) − At (∆θ t ).
How to change θ ?
Descent in the direction of µt (θ) ?
85. iLSTD
µt (θ t+1 ) = µt (θ t ) − At (∆θ t ).
How to change θ ?
Descent in the direction of µt (θ) ?
n
86. iLSTD
µt (θ t+1 ) = µt (θ t ) − At (∆θ t ).
How to change θ ?
O(n )
2
Descent in the direction of µt (θ) ?
n
87. iLSTD
µt (θ t+1 ) = µt (θ t ) − At (∆θ t ).
How to change θ ?
O(n )
2
Descent in the direction of µt (θ) ?
88. iLSTD
µt (θ t+1 ) = µt (θ t ) − At (∆θ t ).
How to change θ ?
O(n )
2
Descent in the direction of µt (θ) ?
89. iLSTD
µt (θ t+1 ) = µt (θ t ) − At (∆θ t ).
How to change θ ?
O(n )
2
Descent in the direction of µt (θ) ?
90. iLSTD
µt (θ t+1 ) = µt (θ t ) − At (∆θ t ).
How to change θ ?
O(n )
2
Descent in the direction of µt (θ) ?
91. chosen.
iLSTD Algorithm
Algorithm 5: iLSTD Co
0 s ← s0 , A ← 0, µ ← 0, t ← 0
1 Initialize θ arbitrarily
2 repeat
3 Take action according to π and observe r, s
t←t+1
4
∆b ← φ(s)r
5
∆A ← φ(s)(φ(s) − γφ(s ))T
6
A ← A + ∆A
7
µ ← µ + ∆b − (∆A)θ
8
for i from 1 to m do
9
10 j ← choose an index of µ using a dimension selection mechanism
θj ← θj + αµj
11
µ ← µ − αµj Aej
12
13 end for
14 s ← s
15 end repeat
92. chosen.
iLSTD Algorithm
Algorithm 5: iLSTD Co
0 s ← s0 , A ← 0, µ ← 0, t ← 0
1 Initialize θ arbitrarily
2 repeat
3 Take action according to π and observe r, s
t←t+1
4
∆b ← φ(s)r
5
∆A ← φ(s)(φ(s) − γφ(s ))T
6
A ← A + ∆A
7
µ ← µ + ∆b − (∆A)θ
8
for i from 1 to m do
9
10 j ← choose an index of µ using a dimension selection mechanism
θj ← θj + αµj
11
µ ← µ − αµj Aej
12
13 end for
14 s ← s
15 end repeat
93. chosen.
iLSTD Algorithm
Algorithm 5: iLSTD Co
0 s ← s0 , A ← 0, µ ← 0, t ← 0
1 Initialize θ arbitrarily
2 repeat
3 Take action according to π and observe r, s
t←t+1
4
∆b ← φ(s)r
5
∆A ← φ(s)(φ(s) − γφ(s ))T
6
A ← A + ∆A
7
µ ← µ + ∆b − (∆A)θ
8
for i from 1 to m do
9
10 j ← choose an index of µ using a dimension selection mechanism
θj ← θj + αµj
11
µ ← µ − αµj Aej
12
13 end for
14 s ← s
15 end repeat
96. iLSTD
Per-time-step computational complexity
O(mn + k ) 2
Number of features
Number of iterations per time step
More data efficient than TD
97. iLSTD
Per-time-step computational complexity
O(mn + k ) 2
Maximum number of active features
Number of features
Number of iterations per time step
More data efficient than TD
98. iLSTD
Per-time-step computational complexity
O(mn + k ) 2
Linear
Maximum number of active features
Number of features
Number of iterations per time step
More data efficient than TD
99. iLSTD
Theorem : iLSTD converges with
probability one to the same solution as TD,
under the usual step-size conditions, for any
dimension selection method such that all
dimensions for which µt is non-zero are
selected in the limit an infinite number of
times.
102. Settings
Averaged over 30 runs
Same random seed for all methods
Sparse matrix representation
iLSTD
Non-zero random selection
One descent per iteration
115. Mountain Car
Goal
Mountain Car Tile coding
n = 10,000
Position = -1 (Easy)
k = 10
Position = -.5 (Hard)
[For details see RL-Library]
116. Easy Mountain Car
7
6
10
10
TD
6
10
iLSTD
4
10
TD
Loss
Loss
5
10
TD LSTD
5
LSTD 10
iLSTD
Loss Function
iLSTD
3
10
4
4
10
10
LSTD 0 50 100 150 200 250 300 350
0 200 400 600 800 1000
Time (s)
Episode
∗
Loss = ||b − A θ||2 ∗
2
10
0 200 400 600 800 1000
Episodes
117. Hard Mountain Car
8
10
7
10
TD
7
10
Loss
Loss
iLSTD
6
10
LSTD
TD
6
LSTD 10
iLSTD
5 5
10 10
0 200 400 600 800 1000 0 200 400 600 800 1000 1200
Episode Time (s)
118. Hard Mountain Car
8
10
7
10
TD
7
10
Loss
Loss
iLSTD
6
10
LSTD
TD
6
LSTD 10
iLSTD
5 5
10 10
0 200 400 600 800 1000 0 200 400 600 800 1000 1200
Episode Time (s)
119. Running Time Constant
Linear
Exponential
155 154
TD
124
iLSTD
93
Time (ms)
LSTD
62
31 34
0 9
7 5
5
2
0
0
0
0 0
0 0 0
m
d
sy
d
all
ar
ar
iu
Ea
Sm
H
H
ed
M
Boyan Chain Mountain Car
120. Running Time Constant
Linear
Exponential
155 154
TD
124
iLSTD
93
Time (ms)
LSTD
62
31 34
0 9
7 5
5
2
0
0
0
0 0
0 0 0
m
d
sy
d
all
ar
ar
iu
Ea
Sm
H
H
ed
M
Boyan Chain Mountain Car
137. Empirical Results
8
10
ε-Greedy: ε =.1
Random
Greedy
!−Greedy
6
10
Boltzmann: ψ = 10^-9, τ = 1
Boltzmann
Error Measure
4
10
2
10
0
10
−2
10
Small Medium Large Easy Hard
Boyan Chain Mountain Car
138. Empirical Results
8
10
ε-Greedy: ε =.1
Random
Greedy
!−Greedy
6
10
Boltzmann: ψ = 10^-9, τ = 1
Boltzmann
Error Measure
4
10
2
10
0
10
−2
10
Small Medium Large Easy Hard
Boyan Chain Mountain Car
139. Empirical Results
8
10
ε-Greedy: ε =.1
Random
Greedy
!−Greedy
6
10
Boltzmann: ψ = 10^-9, τ = 1
Boltzmann
Error Measure
4
10
2
10
0
10
−2
10
Small Medium Large Easy Hard
Boyan Chain Mountain Car
140. Running Time
Random Greedy Boltzmann
!-greedy
11.0
Clocktime/step (ms)
8.8
6.6
4.4
2.2
0
l
e
sy
d
um
al
ar
rg
Ea
Sm
i
La
H
ed
M
Boyan chain Mountain car