Reinforcement Learning basics part1

•Download as PPTX, PDF•

0 likes•213 views

It described about MDP, Monte-Carlo, Time-Difference, sarsa, and q-learning method, and used for Reinforcement Learning study group's lecture, where is belonged to Korea Artificial Intelligence Laboratory.

Software

MDP, MC, TD
sarsa, q-learning
Uijin Jung

𝑉𝜋 𝑠 = 𝐸 𝑟 𝑡+1
+ 𝛾 ∙ 𝑉𝜋 𝑆𝑡+1 𝑆𝑡 = 𝑠]
Bellman equation
𝑉𝜋 𝑠 : value function
𝑟 𝑡+1 ∶ reward
𝛾 : discount factor

S
A1 A2
Vπ(s) ↤ s
𝑞 𝜋(𝑠, 𝑎) ↤ 𝑎
Vπ s =
𝑎∈𝐴
𝜋 𝑎 𝑠 𝑞 𝜋(𝑠, 𝑎)
Vπ(s′) ↤ s’
𝑞 𝜋 𝑠, 𝑎 = 𝑅 𝑠
𝑎
+ 𝛾
𝑠′∈𝑆
Ρ𝑠𝑠′
𝑎
𝑉𝜋(𝑠′
)
𝑆1
′
𝑆2
′
𝑉𝜋 𝑠 = 𝐸 𝑟 𝑡+1 + 𝛾 ∙ 𝑉𝜋 𝑆𝑡+1 𝑆𝑡 = 𝑠]

Vπ s =
𝑎∈𝐴
𝜋 𝑎 𝑠 𝑞 𝜋(𝑠, 𝑎)
𝑞 𝜋 𝑠, 𝑎 = 𝑅 𝑠
𝑎
+ 𝛾
𝑠′∈𝑆
Ρ𝑠𝑠′
𝑎
𝑉𝜋(𝑠′
)
Vπ s =
𝑎∈𝐴
𝜋 𝑎 𝑠 (𝑅 𝑠
𝑎
+ 𝛾
𝑠′∈𝑆
Ρ𝑠𝑠′
𝑎
𝑉𝜋(𝑠′
))

Limitation of Bellman equation
• We must know MDP model perfectly.

Monte-Carlo
• Learn from the episodes
• Action selected by policy-> end of episode-> calculate value function
based on the episode
• If you iterate an episode several time, calculate value function by
averaging each rewards took from the state 𝑠𝑡.

=
1
𝑡
𝐺𝑡 +
𝑗=1
𝑡−1
𝐺𝑗
𝑗=1
𝑡−1
𝐺𝑗 = (𝑡 − 1)𝑉𝑡−1
=
1
𝑡
𝐺𝑡 + (𝑡 − 1)𝑉𝑡−1
𝑉𝑡 =
1
𝑡
𝑗=1
𝑡
𝐺𝑗 𝑉𝑡−1 =
1
𝑡 − 1
𝑗=1
𝑡−1
𝐺𝑗
(𝑡 − 1)𝑉𝑡−1=
𝑗=1
𝑡−1
𝐺𝑗

𝑉𝑡 =
1
𝑡
𝐺𝑡 + (𝑡 − 1)𝑉𝑡−1
=
1
𝑡
𝐺𝑡 + 𝑉𝑡−1 −
1
𝑡
∙ 𝑉𝑡−1
= 𝑉𝑡−1 +
1
𝑡
𝐺𝑡 − 𝑉𝑡−1
𝑉𝑡−1 +
1
𝑡
𝐺𝑡 −
1
𝑡
∙ 𝑉𝑡−1
= 𝑉𝑡−1 + 𝑎 𝐺𝑡 − 𝑉𝑡−1

𝑉𝑡 = 𝑉𝑡−1 + 𝑎 𝐺𝑡 − 𝑉𝑡−1
Reward predicted by
previous value function
Actual reward
Previous
value function
Value function
𝑉𝜋 𝑠𝑡 ← 𝑉𝜋(𝑠𝑡) + 𝑎 𝐺𝑡 − 𝑉(𝑠𝑡)
This is how we update value function
Using Monte-Carlo method

Limitations of Monte-Carlo
• Episode must end before we update the value function.
• It is difficult to train if the environment is endless or if the time to finish is too
long.

Time difference
• Let's change the Monte-Carlo method to a real time processing
method.
𝐺𝑡 = 𝑅𝑡+1 + 𝛾𝐺𝑡+1
𝑅𝑡+1 + 𝛾𝑉 (𝑆𝑡+1)
𝑉𝑡 = 𝑉𝑡 + 𝑎 𝑅𝑡+1 + 𝛾𝑉 (𝑆𝑡+1) − 𝑉𝑡
𝑉𝜋 𝑠𝑡 ← 𝑉𝜋(𝑠𝑡) + 𝑎 𝐺𝑡 − 𝑉(𝑠𝑡)

Time difference
• advantage
• You can update the value function in real time.
• disadvantage
• The reference 𝐺𝑡 is not true, it is derived
from the expected value : (we call this phenomenon, bootstrap)

Sarsa
• Algorithm to find optimal q value through TD method
𝑄 𝑆, 𝐴 ← 𝑄 𝑆, 𝐴 + 𝛼(𝑅 + 𝛾𝑄 𝑆′, 𝐴′ − 𝑄 𝑆, 𝐴 )
State, Action, Reward, next State, next Action

Sarsa pseudo code (https://stackoverflow.com/questions/32846262/q-learning-vs-sarsa-with-greedy-select)

Me
exploration
predict
Sarsa
-1 +1
• On-Policy
• There is a possibility
of being biased.

Q-learaning
• Algorithm to find optimal q value through TD method using off-policy.

Sarsa Q-learning
𝑄 𝑠, 𝑎 ← 𝑄 𝑠, 𝑎 + 𝛼(𝑅 + 𝛾𝑄 𝑠′, 𝑎′
− 𝑄 𝑠, 𝑎 ) 𝑄 𝑠, 𝑎 ← 𝑄(𝑠, 𝑎) + 𝛼(𝑅 + 𝛾 ∙ 𝑚𝑎𝑥 𝑎′ 𝑄 𝑠′
, 𝑎′
− 𝑄 𝑠, 𝑎 )

Q-learning pseudo code (https://stackoverflow.com/questions/32846262/q-learning-vs-sarsa-with-greedy-select)

Me
exploration
predict
Q-learning
-1 +1
v

Similar to Reinforcement Learning basics part1

2Multi_armed_bandits.pptxZhiwuGuo1

Solving Poisson Equation using Conjugate Gradient Methodand its implementationJongsu "Liam" Kim

Stochastic optimal control & rlChoiJinwon3

Reinfrocement LearningNatan Katz

04 Multi-layer Feedforward NetworksTamer Ahmed Farrag, PhD

Support vector machinesJinho Lee

variBAD, A Very Good Method for Bayes-Adaptive Deep RL via Meta-Learning.pdftaeseon ryu

Vehicle Routing Problem using PSO (Particle Swarm Optimization)Niharika Varshney

Lecture 5 backpropagationParveenMalik18

Intro to Quant Trading Strategies (Lecture 6 of 10)Adrian Aley

Lec05.pptxHassanAhmad442087

Reinforcement Learning and Artificial Neural NetsPierre de Lacaze

variBAD, A Very Good Method for Bayes-Adaptive Deep RL via Meta-Learning.pptxSeungeon Baek

Deep RL.pdfMohammadHosseinModir

Learning group em - 20171025 - copyShuai Zhang

Machine learning introduction lecture notesUmeshJagga1

Digital control systems (dcs) lecture 18-19-20Ali Rind

MachineLearning_QLearningCircuitSean Williams

Arjrandomjjejejj3ejjeejjdjddjjdjdjdjdjdjdjdjdjd12345arjitcs

Hidden Markov ModelNghia Bui Van

Similar to Reinforcement Learning basics part1 (20)

2Multi_armed_bandits.pptx

Solving Poisson Equation using Conjugate Gradient Methodand its implementation

Stochastic optimal control & rl

Reinfrocement Learning

04 Multi-layer Feedforward Networks

Support vector machines

variBAD, A Very Good Method for Bayes-Adaptive Deep RL via Meta-Learning.pdf

Vehicle Routing Problem using PSO (Particle Swarm Optimization)

Lecture 5 backpropagation

Intro to Quant Trading Strategies (Lecture 6 of 10)

Lec05.pptx

Reinforcement Learning and Artificial Neural Nets

variBAD, A Very Good Method for Bayes-Adaptive Deep RL via Meta-Learning.pptx

Deep RL.pdf

Learning group em - 20171025 - copy

Machine learning introduction lecture notes

Digital control systems (dcs) lecture 18-19-20

MachineLearning_QLearningCircuit

Arjrandomjjejejj3ejjeejjdjddjjdjdjdjdjdjdjdjdjd

Hidden Markov Model

Recently uploaded

%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...masabamasaba

%in Bahrain+277-882-255-28 abortion pills for sale in Bahrainmasabamasaba

%in kempton park+277-882-255-28 abortion pills for sale in kempton park masabamasaba

%in Hazyview+277-882-255-28 abortion pills for sale in Hazyviewmasabamasaba

What Goes Wrong with Language Definitions and How to Improve the SituationJuha-Pekka Tolvanen

%in Soweto+277-882-255-28 abortion pills for sale in sowetomasabamasaba

%in Benoni+277-882-255-28 abortion pills for sale in Benonimasabamasaba

%in tembisa+277-882-255-28 abortion pills for sale in tembisamasabamasaba

AI & Machine Learning Presentation TemplatePresentation.STUDIO

WSO2CON 2024 - Freedom First—Unleashing Developer Potential with Open SourceWSO2

MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...Jittipong Loespradit

Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...SelfMade bd

+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...Health

WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...WSO2

%in tembisa+277-882-255-28 abortion pills for sale in tembisamasabamasaba

WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital TransformationWSO2

tonesoftglanshi9

WSO2CON 2024 - API Management Usage at La Poste and Its Impact on Business an...WSO2

WSO2CON 2024 - How to Run a Security ProgramWSO2

Direct Style Effect Systems -The Print[A] Example- A Comprehension AidPhilip Schwarz

Recently uploaded (20)

%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...

%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain

%in kempton park+277-882-255-28 abortion pills for sale in kempton park

%in Hazyview+277-882-255-28 abortion pills for sale in Hazyview

What Goes Wrong with Language Definitions and How to Improve the Situation

%in Soweto+277-882-255-28 abortion pills for sale in soweto

%in Benoni+277-882-255-28 abortion pills for sale in Benoni

%in tembisa+277-882-255-28 abortion pills for sale in tembisa

AI & Machine Learning Presentation Template

WSO2CON 2024 - Freedom First—Unleashing Developer Potential with Open Source

MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...

Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...

+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...

WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...

%in tembisa+277-882-255-28 abortion pills for sale in tembisa

WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation

tonesoftg

WSO2CON 2024 - API Management Usage at La Poste and Its Impact on Business an...

WSO2CON 2024 - How to Run a Security Program

Direct Style Effect Systems -The Print[A] Example- A Comprehension Aid

Reinforcement Learning basics part1

1. MDP, MC, TD sarsa, q-learning Uijin Jung

2. 𝑉𝜋 𝑠 = 𝐸 𝑟 𝑡+1 + 𝛾 ∙ 𝑉𝜋 𝑆𝑡+1 𝑆𝑡 = 𝑠] Bellman equation 𝑉𝜋 𝑠 : value function 𝑟 𝑡+1 ∶ reward 𝛾 : discount factor

3. S A1 A2 Vπ(s) ↤ s 𝑞 𝜋(𝑠, 𝑎) ↤ 𝑎 Vπ s = 𝑎∈𝐴 𝜋 𝑎 𝑠 𝑞 𝜋(𝑠, 𝑎) Vπ(s′) ↤ s’ 𝑞 𝜋 𝑠, 𝑎 = 𝑅 𝑠 𝑎 + 𝛾 𝑠′∈𝑆 Ρ𝑠𝑠′ 𝑎 𝑉𝜋(𝑠′ ) 𝑆1 ′ 𝑆2 ′ 𝑉𝜋 𝑠 = 𝐸 𝑟 𝑡+1 + 𝛾 ∙ 𝑉𝜋 𝑆𝑡+1 𝑆𝑡 = 𝑠]

4. Vπ s = 𝑎∈𝐴 𝜋 𝑎 𝑠 𝑞 𝜋(𝑠, 𝑎) 𝑞 𝜋 𝑠, 𝑎 = 𝑅 𝑠 𝑎 + 𝛾 𝑠′∈𝑆 Ρ𝑠𝑠′ 𝑎 𝑉𝜋(𝑠′ ) Vπ s = 𝑎∈𝐴 𝜋 𝑎 𝑠 (𝑅 𝑠 𝑎 + 𝛾 𝑠′∈𝑆 Ρ𝑠𝑠′ 𝑎 𝑉𝜋(𝑠′ ))

5. Limitation of Bellman equation • We must know MDP model perfectly.

6. Monte-Carlo • Learn from the episodes • Action selected by policy-> end of episode-> calculate value function based on the episode • If you iterate an episode several time, calculate value function by averaging each rewards took from the state 𝑠𝑡.

7. = 1 𝑡 𝐺𝑡 + 𝑗=1 𝑡−1 𝐺𝑗 𝑗=1 𝑡−1 𝐺𝑗 = (𝑡 − 1)𝑉𝑡−1 = 1 𝑡 𝐺𝑡 + (𝑡 − 1)𝑉𝑡−1 𝑉𝑡 = 1 𝑡 𝑗=1 𝑡 𝐺𝑗 𝑉𝑡−1 = 1 𝑡 − 1 𝑗=1 𝑡−1 𝐺𝑗 (𝑡 − 1)𝑉𝑡−1= 𝑗=1 𝑡−1 𝐺𝑗

8. 𝑉𝑡 = 1 𝑡 𝐺𝑡 + (𝑡 − 1)𝑉𝑡−1 = 1 𝑡 𝐺𝑡 + 𝑉𝑡−1 − 1 𝑡 ∙ 𝑉𝑡−1 = 𝑉𝑡−1 + 1 𝑡 𝐺𝑡 − 𝑉𝑡−1 𝑉𝑡−1 + 1 𝑡 𝐺𝑡 − 1 𝑡 ∙ 𝑉𝑡−1 = 𝑉𝑡−1 + 𝑎 𝐺𝑡 − 𝑉𝑡−1

9. 𝑉𝑡 = 𝑉𝑡−1 + 𝑎 𝐺𝑡 − 𝑉𝑡−1 Reward predicted by previous value function Actual reward Previous value function Value function 𝑉𝜋 𝑠𝑡 ← 𝑉𝜋(𝑠𝑡) + 𝑎 𝐺𝑡 − 𝑉(𝑠𝑡) This is how we update value function Using Monte-Carlo method

10. Limitations of Monte-Carlo • Episode must end before we update the value function. • It is difficult to train if the environment is endless or if the time to finish is too long.

11. Time difference • Let's change the Monte-Carlo method to a real time processing method. 𝐺𝑡 = 𝑅𝑡+1 + 𝛾𝐺𝑡+1 𝑅𝑡+1 + 𝛾𝑉 (𝑆𝑡+1) 𝑉𝑡 = 𝑉𝑡 + 𝑎 𝑅𝑡+1 + 𝛾𝑉 (𝑆𝑡+1) − 𝑉𝑡 𝑉𝜋 𝑠𝑡 ← 𝑉𝜋(𝑠𝑡) + 𝑎 𝐺𝑡 − 𝑉(𝑠𝑡)

12. Time difference • advantage • You can update the value function in real time. • disadvantage • The reference 𝐺𝑡 is not true, it is derived from the expected value : (we call this phenomenon, bootstrap)

13. Sarsa • Algorithm to find optimal q value through TD method 𝑄 𝑆, 𝐴 ← 𝑄 𝑆, 𝐴 + 𝛼(𝑅 + 𝛾𝑄 𝑆′, 𝐴′ − 𝑄 𝑆, 𝐴 ) State, Action, Reward, next State, next Action

14. Sarsa pseudo code (https://stackoverflow.com/questions/32846262/q-learning-vs-sarsa-with-greedy-select)

15. Me exploration predict Sarsa -1 +1 • On-Policy • There is a possibility of being biased.

16. Q-learaning • Algorithm to find optimal q value through TD method using off-policy.

17. Sarsa Q-learning 𝑄 𝑠, 𝑎 ← 𝑄 𝑠, 𝑎 + 𝛼(𝑅 + 𝛾𝑄 𝑠′, 𝑎′ − 𝑄 𝑠, 𝑎 ) 𝑄 𝑠, 𝑎 ← 𝑄(𝑠, 𝑎) + 𝛼(𝑅 + 𝛾 ∙ 𝑚𝑎𝑥 𝑎′ 𝑄 𝑠′ , 𝑎′ − 𝑄 𝑠, 𝑎 )

18. Q-learning pseudo code (https://stackoverflow.com/questions/32846262/q-learning-vs-sarsa-with-greedy-select)

19. Me exploration predict Q-learning -1 +1 v

20. The end

Reinforcement Learning basics part1

Recommended

Recommended

More Related Content

Similar to Reinforcement Learning basics part1

Similar to Reinforcement Learning basics part1 (20)

Recently uploaded

Recently uploaded (20)

Reinforcement Learning basics part1