SlideShare a Scribd company logo
1 of 20
MDP, MC, TD
sarsa, q-learning
Uijin Jung
𝑉𝜋 𝑠 = 𝐸 𝑟 𝑡+1
+ 𝛾 ∙ 𝑉𝜋 𝑆𝑡+1 𝑆𝑡 = 𝑠]
Bellman equation
𝑉𝜋 𝑠 : value function
𝑟 𝑡+1 ∶ reward
𝛾 : discount factor
S
A1 A2
Vπ(s) ↤ s
𝑞 𝜋(𝑠, 𝑎) ↤ 𝑎
Vπ s =
𝑎∈𝐴
𝜋 𝑎 𝑠 𝑞 𝜋(𝑠, 𝑎)
Vπ(s′) ↤ s’
𝑞 𝜋 𝑠, 𝑎 = 𝑅 𝑠
𝑎
+ 𝛾
𝑠′∈𝑆
Ρ𝑠𝑠′
𝑎
𝑉𝜋(𝑠′
)
𝑆1
′
𝑆2
′
𝑉𝜋 𝑠 = 𝐸 𝑟 𝑡+1 + 𝛾 ∙ 𝑉𝜋 𝑆𝑡+1 𝑆𝑡 = 𝑠]
Vπ s =
𝑎∈𝐴
𝜋 𝑎 𝑠 𝑞 𝜋(𝑠, 𝑎)
𝑞 𝜋 𝑠, 𝑎 = 𝑅 𝑠
𝑎
+ 𝛾
𝑠′∈𝑆
Ρ𝑠𝑠′
𝑎
𝑉𝜋(𝑠′
)
Vπ s =
𝑎∈𝐴
𝜋 𝑎 𝑠 (𝑅 𝑠
𝑎
+ 𝛾
𝑠′∈𝑆
Ρ𝑠𝑠′
𝑎
𝑉𝜋(𝑠′
))
Limitation of Bellman equation
• We must know MDP model perfectly.
Monte-Carlo
• Learn from the episodes
• Action selected by policy-> end of episode-> calculate value function
based on the episode
• If you iterate an episode several time, calculate value function by
averaging each rewards took from the state 𝑠𝑡.
=
1
𝑡
𝐺𝑡 +
𝑗=1
𝑡−1
𝐺𝑗
𝑗=1
𝑡−1
𝐺𝑗 = (𝑡 − 1)𝑉𝑡−1
=
1
𝑡
𝐺𝑡 + (𝑡 − 1)𝑉𝑡−1
𝑉𝑡 =
1
𝑡
𝑗=1
𝑡
𝐺𝑗 𝑉𝑡−1 =
1
𝑡 − 1
𝑗=1
𝑡−1
𝐺𝑗
(𝑡 − 1)𝑉𝑡−1=
𝑗=1
𝑡−1
𝐺𝑗
𝑉𝑡 =
1
𝑡
𝐺𝑡 + (𝑡 − 1)𝑉𝑡−1
=
1
𝑡
𝐺𝑡 + 𝑉𝑡−1 −
1
𝑡
∙ 𝑉𝑡−1
= 𝑉𝑡−1 +
1
𝑡
𝐺𝑡 − 𝑉𝑡−1
𝑉𝑡−1 +
1
𝑡
𝐺𝑡 −
1
𝑡
∙ 𝑉𝑡−1
= 𝑉𝑡−1 + 𝑎 𝐺𝑡 − 𝑉𝑡−1
𝑉𝑡 = 𝑉𝑡−1 + 𝑎 𝐺𝑡 − 𝑉𝑡−1
Reward predicted by
previous value function
Actual reward
Previous
value function
Value function
𝑉𝜋 𝑠𝑡 ← 𝑉𝜋(𝑠𝑡) + 𝑎 𝐺𝑡 − 𝑉(𝑠𝑡)
This is how we update value function
Using Monte-Carlo method
Limitations of Monte-Carlo
• Episode must end before we update the value function.
• It is difficult to train if the environment is endless or if the time to finish is too
long.
Time difference
• Let's change the Monte-Carlo method to a real time processing
method.
𝐺𝑡 = 𝑅𝑡+1 + 𝛾𝐺𝑡+1
𝑅𝑡+1 + 𝛾𝑉 (𝑆𝑡+1)
𝑉𝑡 = 𝑉𝑡 + 𝑎 𝑅𝑡+1 + 𝛾𝑉 (𝑆𝑡+1) − 𝑉𝑡
𝑉𝜋 𝑠𝑡 ← 𝑉𝜋(𝑠𝑡) + 𝑎 𝐺𝑡 − 𝑉(𝑠𝑡)
Time difference
• advantage
• You can update the value function in real time.
• disadvantage
• The reference 𝐺𝑡 is not true, it is derived
from the expected value : (we call this phenomenon, bootstrap)
Sarsa
• Algorithm to find optimal q value through TD method
𝑄 𝑆, 𝐴 ← 𝑄 𝑆, 𝐴 + 𝛼(𝑅 + 𝛾𝑄 𝑆′, 𝐴′ − 𝑄 𝑆, 𝐴 )
State, Action, Reward, next State, next Action
Sarsa pseudo code (https://stackoverflow.com/questions/32846262/q-learning-vs-sarsa-with-greedy-select)
Me
exploration
predict
Sarsa
-1 +1
• On-Policy
• There is a possibility
of being biased.
Q-learaning
• Algorithm to find optimal q value through TD method using off-policy.
Sarsa Q-learning
𝑄 𝑠, 𝑎 ← 𝑄 𝑠, 𝑎 + 𝛼(𝑅 + 𝛾𝑄 𝑠′, 𝑎′
− 𝑄 𝑠, 𝑎 ) 𝑄 𝑠, 𝑎 ← 𝑄(𝑠, 𝑎) + 𝛼(𝑅 + 𝛾 ∙ 𝑚𝑎𝑥 𝑎′ 𝑄 𝑠′
, 𝑎′
− 𝑄 𝑠, 𝑎 )
Q-learning pseudo code (https://stackoverflow.com/questions/32846262/q-learning-vs-sarsa-with-greedy-select)
Me
exploration
predict
Q-learning
-1 +1
v
The end

More Related Content

Similar to Reinforcement Learning basics part1

2Multi_armed_bandits.pptx
2Multi_armed_bandits.pptx2Multi_armed_bandits.pptx
2Multi_armed_bandits.pptxZhiwuGuo1
 
Solving Poisson Equation using Conjugate Gradient Method and its implementation
Solving Poisson Equation using Conjugate Gradient Methodand its implementationSolving Poisson Equation using Conjugate Gradient Methodand its implementation
Solving Poisson Equation using Conjugate Gradient Method and its implementationJongsu "Liam" Kim
 
Stochastic optimal control & rl
Stochastic optimal control & rlStochastic optimal control & rl
Stochastic optimal control & rlChoiJinwon3
 
Reinfrocement Learning
Reinfrocement LearningReinfrocement Learning
Reinfrocement LearningNatan Katz
 
Support vector machines
Support vector machinesSupport vector machines
Support vector machinesJinho Lee
 
variBAD, A Very Good Method for Bayes-Adaptive Deep RL via Meta-Learning.pdf
variBAD, A Very Good Method for Bayes-Adaptive Deep RL via Meta-Learning.pdfvariBAD, A Very Good Method for Bayes-Adaptive Deep RL via Meta-Learning.pdf
variBAD, A Very Good Method for Bayes-Adaptive Deep RL via Meta-Learning.pdftaeseon ryu
 
Vehicle Routing Problem using PSO (Particle Swarm Optimization)
Vehicle Routing Problem using PSO (Particle Swarm Optimization)Vehicle Routing Problem using PSO (Particle Swarm Optimization)
Vehicle Routing Problem using PSO (Particle Swarm Optimization)Niharika Varshney
 
Lecture 5 backpropagation
Lecture 5 backpropagationLecture 5 backpropagation
Lecture 5 backpropagationParveenMalik18
 
Intro to Quant Trading Strategies (Lecture 6 of 10)
Intro to Quant Trading Strategies (Lecture 6 of 10)Intro to Quant Trading Strategies (Lecture 6 of 10)
Intro to Quant Trading Strategies (Lecture 6 of 10)Adrian Aley
 
Reinforcement Learning and Artificial Neural Nets
Reinforcement Learning and Artificial Neural NetsReinforcement Learning and Artificial Neural Nets
Reinforcement Learning and Artificial Neural NetsPierre de Lacaze
 
variBAD, A Very Good Method for Bayes-Adaptive Deep RL via Meta-Learning.pptx
variBAD, A Very Good Method for Bayes-Adaptive Deep RL via Meta-Learning.pptxvariBAD, A Very Good Method for Bayes-Adaptive Deep RL via Meta-Learning.pptx
variBAD, A Very Good Method for Bayes-Adaptive Deep RL via Meta-Learning.pptxSeungeon Baek
 
Learning group em - 20171025 - copy
Learning group   em - 20171025 - copyLearning group   em - 20171025 - copy
Learning group em - 20171025 - copyShuai Zhang
 
Machine learning introduction lecture notes
Machine learning introduction lecture notesMachine learning introduction lecture notes
Machine learning introduction lecture notesUmeshJagga1
 
Digital control systems (dcs) lecture 18-19-20
Digital control systems (dcs) lecture 18-19-20Digital control systems (dcs) lecture 18-19-20
Digital control systems (dcs) lecture 18-19-20Ali Rind
 
MachineLearning_QLearningCircuit
MachineLearning_QLearningCircuitMachineLearning_QLearningCircuit
MachineLearning_QLearningCircuitSean Williams
 
Arjrandomjjejejj3ejjeejjdjddjjdjdjdjdjdjdjdjdjd
Arjrandomjjejejj3ejjeejjdjddjjdjdjdjdjdjdjdjdjdArjrandomjjejejj3ejjeejjdjddjjdjdjdjdjdjdjdjdjd
Arjrandomjjejejj3ejjeejjdjddjjdjdjdjdjdjdjdjdjd12345arjitcs
 

Similar to Reinforcement Learning basics part1 (20)

2Multi_armed_bandits.pptx
2Multi_armed_bandits.pptx2Multi_armed_bandits.pptx
2Multi_armed_bandits.pptx
 
Solving Poisson Equation using Conjugate Gradient Method and its implementation
Solving Poisson Equation using Conjugate Gradient Methodand its implementationSolving Poisson Equation using Conjugate Gradient Methodand its implementation
Solving Poisson Equation using Conjugate Gradient Method and its implementation
 
Stochastic optimal control & rl
Stochastic optimal control & rlStochastic optimal control & rl
Stochastic optimal control & rl
 
Reinfrocement Learning
Reinfrocement LearningReinfrocement Learning
Reinfrocement Learning
 
04 Multi-layer Feedforward Networks
04 Multi-layer Feedforward Networks04 Multi-layer Feedforward Networks
04 Multi-layer Feedforward Networks
 
Support vector machines
Support vector machinesSupport vector machines
Support vector machines
 
variBAD, A Very Good Method for Bayes-Adaptive Deep RL via Meta-Learning.pdf
variBAD, A Very Good Method for Bayes-Adaptive Deep RL via Meta-Learning.pdfvariBAD, A Very Good Method for Bayes-Adaptive Deep RL via Meta-Learning.pdf
variBAD, A Very Good Method for Bayes-Adaptive Deep RL via Meta-Learning.pdf
 
Vehicle Routing Problem using PSO (Particle Swarm Optimization)
Vehicle Routing Problem using PSO (Particle Swarm Optimization)Vehicle Routing Problem using PSO (Particle Swarm Optimization)
Vehicle Routing Problem using PSO (Particle Swarm Optimization)
 
Lecture 5 backpropagation
Lecture 5 backpropagationLecture 5 backpropagation
Lecture 5 backpropagation
 
Intro to Quant Trading Strategies (Lecture 6 of 10)
Intro to Quant Trading Strategies (Lecture 6 of 10)Intro to Quant Trading Strategies (Lecture 6 of 10)
Intro to Quant Trading Strategies (Lecture 6 of 10)
 
Lec05.pptx
Lec05.pptxLec05.pptx
Lec05.pptx
 
Reinforcement Learning and Artificial Neural Nets
Reinforcement Learning and Artificial Neural NetsReinforcement Learning and Artificial Neural Nets
Reinforcement Learning and Artificial Neural Nets
 
variBAD, A Very Good Method for Bayes-Adaptive Deep RL via Meta-Learning.pptx
variBAD, A Very Good Method for Bayes-Adaptive Deep RL via Meta-Learning.pptxvariBAD, A Very Good Method for Bayes-Adaptive Deep RL via Meta-Learning.pptx
variBAD, A Very Good Method for Bayes-Adaptive Deep RL via Meta-Learning.pptx
 
Deep RL.pdf
Deep RL.pdfDeep RL.pdf
Deep RL.pdf
 
Learning group em - 20171025 - copy
Learning group   em - 20171025 - copyLearning group   em - 20171025 - copy
Learning group em - 20171025 - copy
 
Machine learning introduction lecture notes
Machine learning introduction lecture notesMachine learning introduction lecture notes
Machine learning introduction lecture notes
 
Digital control systems (dcs) lecture 18-19-20
Digital control systems (dcs) lecture 18-19-20Digital control systems (dcs) lecture 18-19-20
Digital control systems (dcs) lecture 18-19-20
 
MachineLearning_QLearningCircuit
MachineLearning_QLearningCircuitMachineLearning_QLearningCircuit
MachineLearning_QLearningCircuit
 
Arjrandomjjejejj3ejjeejjdjddjjdjdjdjdjdjdjdjdjd
Arjrandomjjejejj3ejjeejjdjddjjdjdjdjdjdjdjdjdjdArjrandomjjejejj3ejjeejjdjddjjdjdjdjdjdjdjdjdjd
Arjrandomjjejejj3ejjeejjdjddjjdjdjdjdjdjdjdjdjd
 
Hidden Markov Model
Hidden Markov ModelHidden Markov Model
Hidden Markov Model
 

Recently uploaded

%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...
%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...
%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...masabamasaba
 
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrainmasabamasaba
 
%in kempton park+277-882-255-28 abortion pills for sale in kempton park
%in kempton park+277-882-255-28 abortion pills for sale in kempton park %in kempton park+277-882-255-28 abortion pills for sale in kempton park
%in kempton park+277-882-255-28 abortion pills for sale in kempton park masabamasaba
 
%in Hazyview+277-882-255-28 abortion pills for sale in Hazyview
%in Hazyview+277-882-255-28 abortion pills for sale in Hazyview%in Hazyview+277-882-255-28 abortion pills for sale in Hazyview
%in Hazyview+277-882-255-28 abortion pills for sale in Hazyviewmasabamasaba
 
What Goes Wrong with Language Definitions and How to Improve the Situation
What Goes Wrong with Language Definitions and How to Improve the SituationWhat Goes Wrong with Language Definitions and How to Improve the Situation
What Goes Wrong with Language Definitions and How to Improve the SituationJuha-Pekka Tolvanen
 
%in Soweto+277-882-255-28 abortion pills for sale in soweto
%in Soweto+277-882-255-28 abortion pills for sale in soweto%in Soweto+277-882-255-28 abortion pills for sale in soweto
%in Soweto+277-882-255-28 abortion pills for sale in sowetomasabamasaba
 
%in Benoni+277-882-255-28 abortion pills for sale in Benoni
%in Benoni+277-882-255-28 abortion pills for sale in Benoni%in Benoni+277-882-255-28 abortion pills for sale in Benoni
%in Benoni+277-882-255-28 abortion pills for sale in Benonimasabamasaba
 
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisamasabamasaba
 
AI & Machine Learning Presentation Template
AI & Machine Learning Presentation TemplateAI & Machine Learning Presentation Template
AI & Machine Learning Presentation TemplatePresentation.STUDIO
 
WSO2CON 2024 - Freedom First—Unleashing Developer Potential with Open Source
WSO2CON 2024 - Freedom First—Unleashing Developer Potential with Open SourceWSO2CON 2024 - Freedom First—Unleashing Developer Potential with Open Source
WSO2CON 2024 - Freedom First—Unleashing Developer Potential with Open SourceWSO2
 
MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...
MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...
MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...Jittipong Loespradit
 
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...SelfMade bd
 
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...Health
 
WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...
WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...
WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...WSO2
 
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisamasabamasaba
 
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital TransformationWSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital TransformationWSO2
 
tonesoftg
tonesoftgtonesoftg
tonesoftglanshi9
 
WSO2CON 2024 - API Management Usage at La Poste and Its Impact on Business an...
WSO2CON 2024 - API Management Usage at La Poste and Its Impact on Business an...WSO2CON 2024 - API Management Usage at La Poste and Its Impact on Business an...
WSO2CON 2024 - API Management Usage at La Poste and Its Impact on Business an...WSO2
 
WSO2CON 2024 - How to Run a Security Program
WSO2CON 2024 - How to Run a Security ProgramWSO2CON 2024 - How to Run a Security Program
WSO2CON 2024 - How to Run a Security ProgramWSO2
 
Direct Style Effect Systems - The Print[A] Example - A Comprehension Aid
Direct Style Effect Systems -The Print[A] Example- A Comprehension AidDirect Style Effect Systems -The Print[A] Example- A Comprehension Aid
Direct Style Effect Systems - The Print[A] Example - A Comprehension AidPhilip Schwarz
 

Recently uploaded (20)

%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...
%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...
%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...
 
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
 
%in kempton park+277-882-255-28 abortion pills for sale in kempton park
%in kempton park+277-882-255-28 abortion pills for sale in kempton park %in kempton park+277-882-255-28 abortion pills for sale in kempton park
%in kempton park+277-882-255-28 abortion pills for sale in kempton park
 
%in Hazyview+277-882-255-28 abortion pills for sale in Hazyview
%in Hazyview+277-882-255-28 abortion pills for sale in Hazyview%in Hazyview+277-882-255-28 abortion pills for sale in Hazyview
%in Hazyview+277-882-255-28 abortion pills for sale in Hazyview
 
What Goes Wrong with Language Definitions and How to Improve the Situation
What Goes Wrong with Language Definitions and How to Improve the SituationWhat Goes Wrong with Language Definitions and How to Improve the Situation
What Goes Wrong with Language Definitions and How to Improve the Situation
 
%in Soweto+277-882-255-28 abortion pills for sale in soweto
%in Soweto+277-882-255-28 abortion pills for sale in soweto%in Soweto+277-882-255-28 abortion pills for sale in soweto
%in Soweto+277-882-255-28 abortion pills for sale in soweto
 
%in Benoni+277-882-255-28 abortion pills for sale in Benoni
%in Benoni+277-882-255-28 abortion pills for sale in Benoni%in Benoni+277-882-255-28 abortion pills for sale in Benoni
%in Benoni+277-882-255-28 abortion pills for sale in Benoni
 
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
 
AI & Machine Learning Presentation Template
AI & Machine Learning Presentation TemplateAI & Machine Learning Presentation Template
AI & Machine Learning Presentation Template
 
WSO2CON 2024 - Freedom First—Unleashing Developer Potential with Open Source
WSO2CON 2024 - Freedom First—Unleashing Developer Potential with Open SourceWSO2CON 2024 - Freedom First—Unleashing Developer Potential with Open Source
WSO2CON 2024 - Freedom First—Unleashing Developer Potential with Open Source
 
MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...
MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...
MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...
 
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
 
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
 
WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...
WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...
WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...
 
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
 
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital TransformationWSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
 
tonesoftg
tonesoftgtonesoftg
tonesoftg
 
WSO2CON 2024 - API Management Usage at La Poste and Its Impact on Business an...
WSO2CON 2024 - API Management Usage at La Poste and Its Impact on Business an...WSO2CON 2024 - API Management Usage at La Poste and Its Impact on Business an...
WSO2CON 2024 - API Management Usage at La Poste and Its Impact on Business an...
 
WSO2CON 2024 - How to Run a Security Program
WSO2CON 2024 - How to Run a Security ProgramWSO2CON 2024 - How to Run a Security Program
WSO2CON 2024 - How to Run a Security Program
 
Direct Style Effect Systems - The Print[A] Example - A Comprehension Aid
Direct Style Effect Systems -The Print[A] Example- A Comprehension AidDirect Style Effect Systems -The Print[A] Example- A Comprehension Aid
Direct Style Effect Systems - The Print[A] Example - A Comprehension Aid
 

Reinforcement Learning basics part1

  • 1. MDP, MC, TD sarsa, q-learning Uijin Jung
  • 2. 𝑉𝜋 𝑠 = 𝐸 𝑟 𝑡+1 + 𝛾 ∙ 𝑉𝜋 𝑆𝑡+1 𝑆𝑡 = 𝑠] Bellman equation 𝑉𝜋 𝑠 : value function 𝑟 𝑡+1 ∶ reward 𝛾 : discount factor
  • 3. S A1 A2 Vπ(s) ↤ s 𝑞 𝜋(𝑠, 𝑎) ↤ 𝑎 Vπ s = 𝑎∈𝐴 𝜋 𝑎 𝑠 𝑞 𝜋(𝑠, 𝑎) Vπ(s′) ↤ s’ 𝑞 𝜋 𝑠, 𝑎 = 𝑅 𝑠 𝑎 + 𝛾 𝑠′∈𝑆 Ρ𝑠𝑠′ 𝑎 𝑉𝜋(𝑠′ ) 𝑆1 ′ 𝑆2 ′ 𝑉𝜋 𝑠 = 𝐸 𝑟 𝑡+1 + 𝛾 ∙ 𝑉𝜋 𝑆𝑡+1 𝑆𝑡 = 𝑠]
  • 4. Vπ s = 𝑎∈𝐴 𝜋 𝑎 𝑠 𝑞 𝜋(𝑠, 𝑎) 𝑞 𝜋 𝑠, 𝑎 = 𝑅 𝑠 𝑎 + 𝛾 𝑠′∈𝑆 Ρ𝑠𝑠′ 𝑎 𝑉𝜋(𝑠′ ) Vπ s = 𝑎∈𝐴 𝜋 𝑎 𝑠 (𝑅 𝑠 𝑎 + 𝛾 𝑠′∈𝑆 Ρ𝑠𝑠′ 𝑎 𝑉𝜋(𝑠′ ))
  • 5. Limitation of Bellman equation • We must know MDP model perfectly.
  • 6. Monte-Carlo • Learn from the episodes • Action selected by policy-> end of episode-> calculate value function based on the episode • If you iterate an episode several time, calculate value function by averaging each rewards took from the state 𝑠𝑡.
  • 7. = 1 𝑡 𝐺𝑡 + 𝑗=1 𝑡−1 𝐺𝑗 𝑗=1 𝑡−1 𝐺𝑗 = (𝑡 − 1)𝑉𝑡−1 = 1 𝑡 𝐺𝑡 + (𝑡 − 1)𝑉𝑡−1 𝑉𝑡 = 1 𝑡 𝑗=1 𝑡 𝐺𝑗 𝑉𝑡−1 = 1 𝑡 − 1 𝑗=1 𝑡−1 𝐺𝑗 (𝑡 − 1)𝑉𝑡−1= 𝑗=1 𝑡−1 𝐺𝑗
  • 8. 𝑉𝑡 = 1 𝑡 𝐺𝑡 + (𝑡 − 1)𝑉𝑡−1 = 1 𝑡 𝐺𝑡 + 𝑉𝑡−1 − 1 𝑡 ∙ 𝑉𝑡−1 = 𝑉𝑡−1 + 1 𝑡 𝐺𝑡 − 𝑉𝑡−1 𝑉𝑡−1 + 1 𝑡 𝐺𝑡 − 1 𝑡 ∙ 𝑉𝑡−1 = 𝑉𝑡−1 + 𝑎 𝐺𝑡 − 𝑉𝑡−1
  • 9. 𝑉𝑡 = 𝑉𝑡−1 + 𝑎 𝐺𝑡 − 𝑉𝑡−1 Reward predicted by previous value function Actual reward Previous value function Value function 𝑉𝜋 𝑠𝑡 ← 𝑉𝜋(𝑠𝑡) + 𝑎 𝐺𝑡 − 𝑉(𝑠𝑡) This is how we update value function Using Monte-Carlo method
  • 10. Limitations of Monte-Carlo • Episode must end before we update the value function. • It is difficult to train if the environment is endless or if the time to finish is too long.
  • 11. Time difference • Let's change the Monte-Carlo method to a real time processing method. 𝐺𝑡 = 𝑅𝑡+1 + 𝛾𝐺𝑡+1 𝑅𝑡+1 + 𝛾𝑉 (𝑆𝑡+1) 𝑉𝑡 = 𝑉𝑡 + 𝑎 𝑅𝑡+1 + 𝛾𝑉 (𝑆𝑡+1) − 𝑉𝑡 𝑉𝜋 𝑠𝑡 ← 𝑉𝜋(𝑠𝑡) + 𝑎 𝐺𝑡 − 𝑉(𝑠𝑡)
  • 12. Time difference • advantage • You can update the value function in real time. • disadvantage • The reference 𝐺𝑡 is not true, it is derived from the expected value : (we call this phenomenon, bootstrap)
  • 13. Sarsa • Algorithm to find optimal q value through TD method 𝑄 𝑆, 𝐴 ← 𝑄 𝑆, 𝐴 + 𝛼(𝑅 + 𝛾𝑄 𝑆′, 𝐴′ − 𝑄 𝑆, 𝐴 ) State, Action, Reward, next State, next Action
  • 14. Sarsa pseudo code (https://stackoverflow.com/questions/32846262/q-learning-vs-sarsa-with-greedy-select)
  • 15. Me exploration predict Sarsa -1 +1 • On-Policy • There is a possibility of being biased.
  • 16. Q-learaning • Algorithm to find optimal q value through TD method using off-policy.
  • 17. Sarsa Q-learning 𝑄 𝑠, 𝑎 ← 𝑄 𝑠, 𝑎 + 𝛼(𝑅 + 𝛾𝑄 𝑠′, 𝑎′ − 𝑄 𝑠, 𝑎 ) 𝑄 𝑠, 𝑎 ← 𝑄(𝑠, 𝑎) + 𝛼(𝑅 + 𝛾 ∙ 𝑚𝑎𝑥 𝑎′ 𝑄 𝑠′ , 𝑎′ − 𝑄 𝑠, 𝑎 )
  • 18. Q-learning pseudo code (https://stackoverflow.com/questions/32846262/q-learning-vs-sarsa-with-greedy-select)