SlideShare ist ein Scribd-Unternehmen logo
1 von 19
Reinforcement Learning
from Explore to Exploit
Task: Learn how to behave successfully to achieve a
goal while interacting with an external environment
Learn via experiences!
Examples
• Game playing: player knows whether it win or lose, but
not know how to move at each step
• Control: a traffic system can measure the delay of cars,
but not know how to decrease it.
Reinforcement Learning
2
RL is learning from interaction
3
Agent acts on its environment, it receives some evaluation of
its action (reinforcement)
The goal of the agent is to learn a policy that maximize its total
(future) reward
St →At →Rt →St+1 →At+1 →Rt+1 →St+2…
At each State S, choose the a action a which
maximizes the function Q(S, a)
Q is the estimated utility function – it tells us how
good an action is given a certain state
Q-Learning Basics
4
所有決策都依據Q-table (best policy),但Q table 要從何得來?
5
If this number > epsilon, then we will do “exploitation” (this means we
use what we already know to select the best action at each step).
Bellman equation (Q-table Update Rule)
6
s0 s2
s1
s3
a
c
b
a
c
d
a
c
f
a
b
d
Q(S0,b) is max
Get maximum Q value for this next
state based on all possible actions.
0 <= 𝛾 <1Q(S,a)=R(S,a)+ 𝛾 max(Q(S’,a’))
a’immediate reward future reward
This is a recursive definition
Discount rate
Q(S,a)=R(S,a)+ 𝛾 max(Q(S’,a’))
𝛾=0.9
Example
7
reward=1
?
8
Initially we explore the environment and update the Q-Table.
When the Q-Table is ready, the agent will start to exploit the
environment and start taking better actions.
This Q-table becomes a reference table for our agent to
select the best action
1. Set current state = initial state.
2. From current state, find the action with the highest Q value.
3. Set current state = next state.
4. Repeat Steps 2 and 3 until current state = goal state.
Algorithm to utilize the Q matrix:
9
根據Q-table 走,
有可能有其他更
好的線但你不知
道, 也就是你可
能未必找到整體
的最佳路線!
10
Bellman equation 是看到當下狀態, 因此未必能能找到最
佳解, 故引入MDP方法以增加隨機性
只看Q-table 的值 (greedy method), 即永遠只做當下最好
的選擇,但未必最好。 故引入某種程度的隨機選擇,則有
機會去嘗試一條沒有走過的路,也許可以找到最佳的路
MDP (Markov Decision Process)
11
Q-Learning algorithm
12
Bellman equation
13
Note: 更新的是落差的部份..
NewQ(s,a)= (1-𝛼) Q(s,a) + 𝛼 {R (s,a)+ 𝛾 max(Q (s’, a’) }
0 <= 𝛾 < 1 Discount rate
0 <= 𝛼 <= 1 Learning rate
Example : Q-Learning By Hand
14http://mnemstudio.org/path-finding-q-learning-tutorial.htm
The outside of the building can be thought of as one big
room (5). Notice that doors 1 and 4 lead into the building
from room 5 (outside).
15
The -1's in the table represent null
values (i.e.; where there isn't a
link between nodes). For example,
State 0 cannot go to State 1.
Taking Action: Exploit or Explore
16
Again, Initial state = 3, randomly select action 1
Update Q table
Q(3,1) =
R(3, 1) + 0.8 * Max(Q(1, 3), Q(1, 5))
= 0 + 0.8 * Max(0, 100) = 80
Initial random state = 1
Update Q table
Q(1, 5) =
R(1, 5) + 0.8 * Max(Q(5, 1), Q(5, 4), Q(5, 5))
= 100 + 0.8 * 0 = 100
1 2
Create a q-table0
17
If our agent learns more through further
episodes, it will finally reach convergence
values in matrix Q like:
This matrix Q, can then be
normalized (i.e.; converted to
percentage) by dividing all non-zero
entries by the highest number (500 in
this case):
實作
18
19
X
y
North
South
EastWest
1
4
2
3

Weitere ähnliche Inhalte

Ähnlich wie Reinforcement Learning

Deep Reinforcement Learning
Deep Reinforcement LearningDeep Reinforcement Learning
Deep Reinforcement LearningUsman Qayyum
 
Reinforcement learning, Q-Learning
Reinforcement learning, Q-LearningReinforcement learning, Q-Learning
Reinforcement learning, Q-LearningKuppusamy P
 
Reinforcement learning
Reinforcement learningReinforcement learning
Reinforcement learningElias Hasnat
 
24.09.2021 Reinforcement Learning Algorithms.pptx
24.09.2021 Reinforcement Learning Algorithms.pptx24.09.2021 Reinforcement Learning Algorithms.pptx
24.09.2021 Reinforcement Learning Algorithms.pptxManiMaran230751
 
Reinforcement learning
Reinforcement learningReinforcement learning
Reinforcement learningDing Li
 
Lecture notes
Lecture notesLecture notes
Lecture notesbutest
 
Intro to Reinforcement Learning
Intro to Reinforcement LearningIntro to Reinforcement Learning
Intro to Reinforcement LearningUtkarsh Garg
 
Demystifying deep reinforement learning
Demystifying deep reinforement learningDemystifying deep reinforement learning
Demystifying deep reinforement learning재연 윤
 
Q-Learning Algorithm: A Concise Introduction [Shakeeb A.]
Q-Learning Algorithm: A Concise Introduction [Shakeeb A.]Q-Learning Algorithm: A Concise Introduction [Shakeeb A.]
Q-Learning Algorithm: A Concise Introduction [Shakeeb A.]Shakeeb Ahmad Mohammad Mukhtar
 
Problem solving agents
Problem solving agentsProblem solving agents
Problem solving agentsMegha Sharma
 
Deep reinforcement learning from scratch
Deep reinforcement learning from scratchDeep reinforcement learning from scratch
Deep reinforcement learning from scratchJie-Han Chen
 
REINFORCEMENT LEARNING
REINFORCEMENT LEARNINGREINFORCEMENT LEARNING
REINFORCEMENT LEARNINGpradiprahul
 

Ähnlich wie Reinforcement Learning (20)

Deep Reinforcement Learning
Deep Reinforcement LearningDeep Reinforcement Learning
Deep Reinforcement Learning
 
Reinforcement learning, Q-Learning
Reinforcement learning, Q-LearningReinforcement learning, Q-Learning
Reinforcement learning, Q-Learning
 
Reinforcement learning
Reinforcement learningReinforcement learning
Reinforcement learning
 
24.09.2021 Reinforcement Learning Algorithms.pptx
24.09.2021 Reinforcement Learning Algorithms.pptx24.09.2021 Reinforcement Learning Algorithms.pptx
24.09.2021 Reinforcement Learning Algorithms.pptx
 
Introduction to Deep Reinforcement Learning
Introduction to Deep Reinforcement LearningIntroduction to Deep Reinforcement Learning
Introduction to Deep Reinforcement Learning
 
Reinforcement learning
Reinforcement learningReinforcement learning
Reinforcement learning
 
Lecture notes
Lecture notesLecture notes
Lecture notes
 
Intro to Reinforcement Learning
Intro to Reinforcement LearningIntro to Reinforcement Learning
Intro to Reinforcement Learning
 
Demystifying deep reinforement learning
Demystifying deep reinforement learningDemystifying deep reinforement learning
Demystifying deep reinforement learning
 
Finalver
FinalverFinalver
Finalver
 
Hill climbing algorithm
Hill climbing algorithmHill climbing algorithm
Hill climbing algorithm
 
Deep Reinforcement Learning
Deep Reinforcement LearningDeep Reinforcement Learning
Deep Reinforcement Learning
 
Q-Learning Algorithm: A Concise Introduction [Shakeeb A.]
Q-Learning Algorithm: A Concise Introduction [Shakeeb A.]Q-Learning Algorithm: A Concise Introduction [Shakeeb A.]
Q-Learning Algorithm: A Concise Introduction [Shakeeb A.]
 
Making Complex Decisions(Artificial Intelligence)
Making Complex Decisions(Artificial Intelligence)Making Complex Decisions(Artificial Intelligence)
Making Complex Decisions(Artificial Intelligence)
 
Problem solving agents
Problem solving agentsProblem solving agents
Problem solving agents
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement Learning
 
Deep reinforcement learning from scratch
Deep reinforcement learning from scratchDeep reinforcement learning from scratch
Deep reinforcement learning from scratch
 
CS799_FinalReport
CS799_FinalReportCS799_FinalReport
CS799_FinalReport
 
Reinforcement Learning - DQN
Reinforcement Learning - DQNReinforcement Learning - DQN
Reinforcement Learning - DQN
 
REINFORCEMENT LEARNING
REINFORCEMENT LEARNINGREINFORCEMENT LEARNING
REINFORCEMENT LEARNING
 

Mehr von 艾鍗科技

TinyML - 4 speech recognition
TinyML - 4 speech recognition TinyML - 4 speech recognition
TinyML - 4 speech recognition 艾鍗科技
 
Appendix 1 Goolge colab
Appendix 1 Goolge colabAppendix 1 Goolge colab
Appendix 1 Goolge colab艾鍗科技
 
Project-IOT於餐館系統的應用
Project-IOT於餐館系統的應用Project-IOT於餐館系統的應用
Project-IOT於餐館系統的應用艾鍗科技
 
02 IoT implementation
02 IoT implementation02 IoT implementation
02 IoT implementation艾鍗科技
 
Tiny ML for spark Fun Edge
Tiny ML for spark Fun EdgeTiny ML for spark Fun Edge
Tiny ML for spark Fun Edge艾鍗科技
 
2. 機器學習簡介
2. 機器學習簡介2. 機器學習簡介
2. 機器學習簡介艾鍗科技
 
5.MLP(Multi-Layer Perceptron)
5.MLP(Multi-Layer Perceptron) 5.MLP(Multi-Layer Perceptron)
5.MLP(Multi-Layer Perceptron) 艾鍗科技
 
心率血氧檢測與運動促進
心率血氧檢測與運動促進心率血氧檢測與運動促進
心率血氧檢測與運動促進艾鍗科技
 
利用音樂&情境燈幫助放鬆
利用音樂&情境燈幫助放鬆利用音樂&情境燈幫助放鬆
利用音樂&情境燈幫助放鬆艾鍗科技
 
IoT感測器驅動程式 在樹莓派上實作
IoT感測器驅動程式在樹莓派上實作IoT感測器驅動程式在樹莓派上實作
IoT感測器驅動程式 在樹莓派上實作艾鍗科技
 
無線聲控遙控車
無線聲控遙控車無線聲控遙控車
無線聲控遙控車艾鍗科技
 
最佳光源的研究和實作
最佳光源的研究和實作最佳光源的研究和實作
最佳光源的研究和實作 艾鍗科技
 
無線監控網路攝影機與控制自走車
無線監控網路攝影機與控制自走車無線監控網路攝影機與控制自走車
無線監控網路攝影機與控制自走車 艾鍗科技
 
人臉辨識考勤系統
人臉辨識考勤系統人臉辨識考勤系統
人臉辨識考勤系統艾鍗科技
 
智慧家庭Smart Home
智慧家庭Smart Home智慧家庭Smart Home
智慧家庭Smart Home艾鍗科技
 

Mehr von 艾鍗科技 (20)

TinyML - 4 speech recognition
TinyML - 4 speech recognition TinyML - 4 speech recognition
TinyML - 4 speech recognition
 
Appendix 1 Goolge colab
Appendix 1 Goolge colabAppendix 1 Goolge colab
Appendix 1 Goolge colab
 
Project-IOT於餐館系統的應用
Project-IOT於餐館系統的應用Project-IOT於餐館系統的應用
Project-IOT於餐館系統的應用
 
02 IoT implementation
02 IoT implementation02 IoT implementation
02 IoT implementation
 
Tiny ML for spark Fun Edge
Tiny ML for spark Fun EdgeTiny ML for spark Fun Edge
Tiny ML for spark Fun Edge
 
Openvino ncs2
Openvino ncs2Openvino ncs2
Openvino ncs2
 
Step motor
Step motorStep motor
Step motor
 
2. 機器學習簡介
2. 機器學習簡介2. 機器學習簡介
2. 機器學習簡介
 
5.MLP(Multi-Layer Perceptron)
5.MLP(Multi-Layer Perceptron) 5.MLP(Multi-Layer Perceptron)
5.MLP(Multi-Layer Perceptron)
 
3. data features
3. data features3. data features
3. data features
 
心率血氧檢測與運動促進
心率血氧檢測與運動促進心率血氧檢測與運動促進
心率血氧檢測與運動促進
 
利用音樂&情境燈幫助放鬆
利用音樂&情境燈幫助放鬆利用音樂&情境燈幫助放鬆
利用音樂&情境燈幫助放鬆
 
IoT感測器驅動程式 在樹莓派上實作
IoT感測器驅動程式在樹莓派上實作IoT感測器驅動程式在樹莓派上實作
IoT感測器驅動程式 在樹莓派上實作
 
無線聲控遙控車
無線聲控遙控車無線聲控遙控車
無線聲控遙控車
 
最佳光源的研究和實作
最佳光源的研究和實作最佳光源的研究和實作
最佳光源的研究和實作
 
無線監控網路攝影機與控制自走車
無線監控網路攝影機與控制自走車無線監控網路攝影機與控制自走車
無線監控網路攝影機與控制自走車
 
Linux Device Tree
Linux Device TreeLinux Device Tree
Linux Device Tree
 
人臉辨識考勤系統
人臉辨識考勤系統人臉辨識考勤系統
人臉辨識考勤系統
 
智慧家庭Smart Home
智慧家庭Smart Home智慧家庭Smart Home
智慧家庭Smart Home
 
智能健身
智能健身智能健身
智能健身
 

Kürzlich hochgeladen

Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businesspanagenda
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Jeffrey Haguewood
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelMcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelDeepika Singh
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...apidays
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodJuan lago vázquez
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusZilliz
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Platformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityPlatformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityWSO2
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Victor Rentea
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingEdi Saputra
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsNanddeep Nachan
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyKhushali Kathiriya
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Angeliki Cooney
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 

Kürzlich hochgeladen (20)

Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelMcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with Milvus
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Platformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityPlatformless Horizons for Digital Adaptability
Platformless Horizons for Digital Adaptability
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 

Reinforcement Learning

  • 2. Task: Learn how to behave successfully to achieve a goal while interacting with an external environment Learn via experiences! Examples • Game playing: player knows whether it win or lose, but not know how to move at each step • Control: a traffic system can measure the delay of cars, but not know how to decrease it. Reinforcement Learning 2
  • 3. RL is learning from interaction 3 Agent acts on its environment, it receives some evaluation of its action (reinforcement) The goal of the agent is to learn a policy that maximize its total (future) reward St →At →Rt →St+1 →At+1 →Rt+1 →St+2…
  • 4. At each State S, choose the a action a which maximizes the function Q(S, a) Q is the estimated utility function – it tells us how good an action is given a certain state Q-Learning Basics 4 所有決策都依據Q-table (best policy),但Q table 要從何得來?
  • 5. 5 If this number > epsilon, then we will do “exploitation” (this means we use what we already know to select the best action at each step).
  • 6. Bellman equation (Q-table Update Rule) 6 s0 s2 s1 s3 a c b a c d a c f a b d Q(S0,b) is max Get maximum Q value for this next state based on all possible actions. 0 <= 𝛾 <1Q(S,a)=R(S,a)+ 𝛾 max(Q(S’,a’)) a’immediate reward future reward This is a recursive definition Discount rate
  • 8. 8 Initially we explore the environment and update the Q-Table. When the Q-Table is ready, the agent will start to exploit the environment and start taking better actions.
  • 9. This Q-table becomes a reference table for our agent to select the best action 1. Set current state = initial state. 2. From current state, find the action with the highest Q value. 3. Set current state = next state. 4. Repeat Steps 2 and 3 until current state = goal state. Algorithm to utilize the Q matrix: 9
  • 11. Bellman equation 是看到當下狀態, 因此未必能能找到最 佳解, 故引入MDP方法以增加隨機性 只看Q-table 的值 (greedy method), 即永遠只做當下最好 的選擇,但未必最好。 故引入某種程度的隨機選擇,則有 機會去嘗試一條沒有走過的路,也許可以找到最佳的路 MDP (Markov Decision Process) 11
  • 13. Bellman equation 13 Note: 更新的是落差的部份.. NewQ(s,a)= (1-𝛼) Q(s,a) + 𝛼 {R (s,a)+ 𝛾 max(Q (s’, a’) } 0 <= 𝛾 < 1 Discount rate 0 <= 𝛼 <= 1 Learning rate
  • 14. Example : Q-Learning By Hand 14http://mnemstudio.org/path-finding-q-learning-tutorial.htm The outside of the building can be thought of as one big room (5). Notice that doors 1 and 4 lead into the building from room 5 (outside).
  • 15. 15 The -1's in the table represent null values (i.e.; where there isn't a link between nodes). For example, State 0 cannot go to State 1.
  • 16. Taking Action: Exploit or Explore 16 Again, Initial state = 3, randomly select action 1 Update Q table Q(3,1) = R(3, 1) + 0.8 * Max(Q(1, 3), Q(1, 5)) = 0 + 0.8 * Max(0, 100) = 80 Initial random state = 1 Update Q table Q(1, 5) = R(1, 5) + 0.8 * Max(Q(5, 1), Q(5, 4), Q(5, 5)) = 100 + 0.8 * 0 = 100 1 2 Create a q-table0
  • 17. 17 If our agent learns more through further episodes, it will finally reach convergence values in matrix Q like: This matrix Q, can then be normalized (i.e.; converted to percentage) by dividing all non-zero entries by the highest number (500 in this case):