It described about MDP, Monte-Carlo, Time-Difference, sarsa, and q-learning method, and used for Reinforcement Learning study group's lecture, where is belonged to Korea Artificial Intelligence Laboratory.
6. Monte-Carlo
• Learn from the episodes
• Action selected by policy-> end of episode-> calculate value function
based on the episode
• If you iterate an episode several time, calculate value function by
averaging each rewards took from the state 𝑠𝑡.
9. 𝑉𝑡 = 𝑉𝑡−1 + 𝑎 𝐺𝑡 − 𝑉𝑡−1
Reward predicted by
previous value function
Actual reward
Previous
value function
Value function
𝑉𝜋 𝑠𝑡 ← 𝑉𝜋(𝑠𝑡) + 𝑎 𝐺𝑡 − 𝑉(𝑠𝑡)
This is how we update value function
Using Monte-Carlo method
10. Limitations of Monte-Carlo
• Episode must end before we update the value function.
• It is difficult to train if the environment is endless or if the time to finish is too
long.
11. Time difference
• Let's change the Monte-Carlo method to a real time processing
method.
𝐺𝑡 = 𝑅𝑡+1 + 𝛾𝐺𝑡+1
𝑅𝑡+1 + 𝛾𝑉 (𝑆𝑡+1)
𝑉𝑡 = 𝑉𝑡 + 𝑎 𝑅𝑡+1 + 𝛾𝑉 (𝑆𝑡+1) − 𝑉𝑡
𝑉𝜋 𝑠𝑡 ← 𝑉𝜋(𝑠𝑡) + 𝑎 𝐺𝑡 − 𝑉(𝑠𝑡)
12. Time difference
• advantage
• You can update the value function in real time.
• disadvantage
• The reference 𝐺𝑡 is not true, it is derived
from the expected value : (we call this phenomenon, bootstrap)
13. Sarsa
• Algorithm to find optimal q value through TD method
𝑄 𝑆, 𝐴 ← 𝑄 𝑆, 𝐴 + 𝛼(𝑅 + 𝛾𝑄 𝑆′, 𝐴′ − 𝑄 𝑆, 𝐴 )
State, Action, Reward, next State, next Action