SlideShare ist ein Scribd-Unternehmen logo
1 von 17
Downloaden Sie, um offline zu lesen
Challenges
In
Reinforcement Learning
Elias Hasnat
Definition of Reinforcement Learning
We can explain reinforcement learning as a combination of
supervised and unsupervised learning.
1. In supervised learning one has a target label for each
training example and
2. In unsupervised learning one has no labels at all,
3. In reinforcement learning one has sparse and time-
delayed labels (the rewards).
Based only on those rewards the agent has to learn to behave in the environment.
1. Credit assignment problem and
2. Exploration-Exploitation Dilemma
Steps in consideration in RL problem solving
Credit Assignment Step is used for defining the strategy,
Exploration-Exploitation is used for improving the reward
outcomes.
Credit Assignment
Mathematical relation formalization
● An agent, situated in an environment.
● The environment is in a certain state.
● The agent can perform certain actions in the environment.
● These actions sometimes result in a reward.
● Actions transform the environment and lead to a new state, where the
agent can perform another action, and so on.
● The rules for how you choose those actions are called policy.
● The environment in general is stochastic, which means the next state may
be somewhat random.
● The set of states and actions, together with rules for transitioning from one
state to another, make up a Markov decision process.
● One episode of this process forms a finite sequence of states, actions and
rewards.
Reward Strategy
To perform well in the long-term, we need to take into account not only the
immediate rewards, but also the future rewards we are going to get.
total reward function
total reward function from time point t
But as the environment is stochastic we never get same reward in the next run,
that is why we be considering
discounted future reward
γ is the discount factor between 0 and 1, as the seps increased the γ value goes
near to 0.
Discounted future reward calculation
Discounted future reward at time step t can be expressed in terms of the same
thing at time step t+1:
If we set the discount factor γ=0, then our strategy will be short-sighted and we
rely only on the immediate rewards.
If we want to balance between immediate and future rewards, we should set
discount factor to something like γ=0.9.
If our environment is deterministic and the same actions always result in same
rewards, then we can set discount factor γ=1.
A good strategy for an agent would be to always choose an action that maximizes the (discounted) future reward.
The Q-function is for improved quality in action
The Q-Learning function Q(s, a) could be represented for
the maximum discounted future reward when we perform action a in state s, and
continue optimally from that point on:
Suppose you are in state and thinking whether you should take action a or b. You
want to select the action that results in the highest score at the end of game.
π represents the policy,
Now Let’s focus on just one transition <s, a, r, s’>. we can express the Q-value of
state s and action a in terms of the Q-value of the next state s’.
This is called the Bellman Equation.
Q-Learning Formulation
α in the algorithm is a learning rate that controls how much of the difference
between previous Q-value and newly proposed Q-value is taken into account.
In particular, when α=1, then two Q[s,a] cancel and the update is exactly the
same as the Bellman equation.
Deep Q Network
Neural networks are exceptionally good at coming up
with good features for highly structured data.
We can easily represent our Q-function with a
neural network,
Use-case 1:
that takes the state and action as input and
outputs the corresponding Q-value.
Use-case 2:
Input the state as input and
output the Q-value for each possible action.
* The benefit of the later use-case over the former one is that if we want to perform a Q-value update or pick the action with the highest Q-value, we only have to
do one forward pass through the network and have all Q-values for all actions available immediately.
Rewrite the Quality function using Q Network
Given a transition < s, a, r, s’ >, the Q-table update rule in the previous algorithm
must be replaced with the following:
1. Do a feedforward pass for the current state s to get predicted Q-values for all
actions.
2. Do a feedforward pass for the next state s’ and calculate maximum overall
network outputs max a’ Q(s’, a’).
3. Set Q-value target for action to r + γmax a’ Q(s’, a’) (use the max calculated
in step 2). For all other actions, set the Q-value target to the same as
originally returned from step 1, making the error 0 for those outputs.
4. Update the weights using backpropagation.
Experience Replay (Memory)
Estimate the future reward in each state using Q-learning and approximate the Q-
function using neural network.
Challenge:
But the approximation of Q-values using non-linear functions is very time costly.
Solution: Memorizing the experience.
During gameplay all the experiences < s, a, r, s’ > are stored in a replay memory.
When training the network, random mini batches from the replay memory are used
instead of the most recent transition. This breaks the similarity of subsequent
training samples, which otherwise might drive the network into a local minimum.
Also experience replay makes the training task more similar to usual supervised
learning, which simplifies debugging and testing the algorithm.
Exploration and Exploitation
Firstly observe, that when a Q-table or Q-network is initialized randomly, then its
predictions are initially random as well. If we pick an action with the highest Q-value,
the action will be random and the agent performs crude “exploration”. As a Q-
function converges, it returns more consistent Q-values and the amount of
exploration decreases. So one could say, that Q-learning incorporates the
exploration as part of the algorithm. But this exploration is “greedy”, it settles with
the first effective strategy it finds.
A simple and effective fix for the above problem is ε-greedy exploration – with
probability ε choose a random action, otherwise go with the “greedy” action with the
highest Q-value. It actually decreases ε over time from 1 to 0.1 – in the beginning
the system makes completely random moves to explore the state space maximally,
and then it settles down to a fixed exploration rate.
Deep Q-Learning
Thank You

Weitere ähnliche Inhalte

Was ist angesagt?

Reinforcement Learning (DLAI D7L2 2017 UPC Deep Learning for Artificial Intel...
Reinforcement Learning (DLAI D7L2 2017 UPC Deep Learning for Artificial Intel...Reinforcement Learning (DLAI D7L2 2017 UPC Deep Learning for Artificial Intel...
Reinforcement Learning (DLAI D7L2 2017 UPC Deep Learning for Artificial Intel...Universitat Politècnica de Catalunya
 
1118_Seminar_Continuous_Deep Q-Learning with Model based acceleration
1118_Seminar_Continuous_Deep Q-Learning with Model based acceleration1118_Seminar_Continuous_Deep Q-Learning with Model based acceleration
1118_Seminar_Continuous_Deep Q-Learning with Model based accelerationHye-min Ahn
 
Temporal difference learning
Temporal difference learningTemporal difference learning
Temporal difference learningJie-Han Chen
 
An introduction to reinforcement learning (rl)
An introduction to reinforcement learning (rl)An introduction to reinforcement learning (rl)
An introduction to reinforcement learning (rl)pauldix
 
Deep Reinforcement Learning: MDP & DQN - Xavier Giro-i-Nieto - UPC Barcelona ...
Deep Reinforcement Learning: MDP & DQN - Xavier Giro-i-Nieto - UPC Barcelona ...Deep Reinforcement Learning: MDP & DQN - Xavier Giro-i-Nieto - UPC Barcelona ...
Deep Reinforcement Learning: MDP & DQN - Xavier Giro-i-Nieto - UPC Barcelona ...Universitat Politècnica de Catalunya
 
An introduction to deep reinforcement learning
An introduction to deep reinforcement learningAn introduction to deep reinforcement learning
An introduction to deep reinforcement learningBig Data Colombia
 
Episodic Policy Gradient Training
Episodic Policy Gradient TrainingEpisodic Policy Gradient Training
Episodic Policy Gradient TrainingHung Le
 
Reinforcement learning
Reinforcement learningReinforcement learning
Reinforcement learningDongHyun Kwak
 
Trust Region Policy Optimization, Schulman et al, 2015
Trust Region Policy Optimization, Schulman et al, 2015Trust Region Policy Optimization, Schulman et al, 2015
Trust Region Policy Optimization, Schulman et al, 2015Chris Ohk
 
0415_seminar_DeepDPG
0415_seminar_DeepDPG0415_seminar_DeepDPG
0415_seminar_DeepDPGHye-min Ahn
 
Continuous Control with Deep Reinforcement Learning, lillicrap et al, 2015
Continuous Control with Deep Reinforcement Learning, lillicrap et al, 2015Continuous Control with Deep Reinforcement Learning, lillicrap et al, 2015
Continuous Control with Deep Reinforcement Learning, lillicrap et al, 2015Chris Ohk
 
Adversarially Guided Actor-Critic, Y. Flet-Berliac et al, 2021
Adversarially Guided Actor-Critic, Y. Flet-Berliac et al, 2021Adversarially Guided Actor-Critic, Y. Flet-Berliac et al, 2021
Adversarially Guided Actor-Critic, Y. Flet-Berliac et al, 2021Chris Ohk
 
Deep Multi-agent Reinforcement Learning
Deep Multi-agent Reinforcement LearningDeep Multi-agent Reinforcement Learning
Deep Multi-agent Reinforcement Learningdeawoo Kim
 
Aaa ped-24- Reinforcement Learning
Aaa ped-24- Reinforcement LearningAaa ped-24- Reinforcement Learning
Aaa ped-24- Reinforcement LearningAminaRepo
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement LearningDongHyun Kwak
 
Maximum Entropy Reinforcement Learning (Stochastic Control)
Maximum Entropy Reinforcement Learning (Stochastic Control)Maximum Entropy Reinforcement Learning (Stochastic Control)
Maximum Entropy Reinforcement Learning (Stochastic Control)Dongmin Lee
 
The effect of episodic retrieval on inhibition in task switching
The effect of episodic retrieval on inhibition in task switchingThe effect of episodic retrieval on inhibition in task switching
The effect of episodic retrieval on inhibition in task switchingJimGrange
 

Was ist angesagt? (20)

Reinforcement Learning (DLAI D7L2 2017 UPC Deep Learning for Artificial Intel...
Reinforcement Learning (DLAI D7L2 2017 UPC Deep Learning for Artificial Intel...Reinforcement Learning (DLAI D7L2 2017 UPC Deep Learning for Artificial Intel...
Reinforcement Learning (DLAI D7L2 2017 UPC Deep Learning for Artificial Intel...
 
1118_Seminar_Continuous_Deep Q-Learning with Model based acceleration
1118_Seminar_Continuous_Deep Q-Learning with Model based acceleration1118_Seminar_Continuous_Deep Q-Learning with Model based acceleration
1118_Seminar_Continuous_Deep Q-Learning with Model based acceleration
 
Temporal difference learning
Temporal difference learningTemporal difference learning
Temporal difference learning
 
An introduction to reinforcement learning (rl)
An introduction to reinforcement learning (rl)An introduction to reinforcement learning (rl)
An introduction to reinforcement learning (rl)
 
Deep Reinforcement Learning: MDP & DQN - Xavier Giro-i-Nieto - UPC Barcelona ...
Deep Reinforcement Learning: MDP & DQN - Xavier Giro-i-Nieto - UPC Barcelona ...Deep Reinforcement Learning: MDP & DQN - Xavier Giro-i-Nieto - UPC Barcelona ...
Deep Reinforcement Learning: MDP & DQN - Xavier Giro-i-Nieto - UPC Barcelona ...
 
An introduction to deep reinforcement learning
An introduction to deep reinforcement learningAn introduction to deep reinforcement learning
An introduction to deep reinforcement learning
 
Episodic Policy Gradient Training
Episodic Policy Gradient TrainingEpisodic Policy Gradient Training
Episodic Policy Gradient Training
 
Reinforcement learning
Reinforcement learningReinforcement learning
Reinforcement learning
 
Trust Region Policy Optimization, Schulman et al, 2015
Trust Region Policy Optimization, Schulman et al, 2015Trust Region Policy Optimization, Schulman et al, 2015
Trust Region Policy Optimization, Schulman et al, 2015
 
0415_seminar_DeepDPG
0415_seminar_DeepDPG0415_seminar_DeepDPG
0415_seminar_DeepDPG
 
Continuous Control with Deep Reinforcement Learning, lillicrap et al, 2015
Continuous Control with Deep Reinforcement Learning, lillicrap et al, 2015Continuous Control with Deep Reinforcement Learning, lillicrap et al, 2015
Continuous Control with Deep Reinforcement Learning, lillicrap et al, 2015
 
Adversarially Guided Actor-Critic, Y. Flet-Berliac et al, 2021
Adversarially Guided Actor-Critic, Y. Flet-Berliac et al, 2021Adversarially Guided Actor-Critic, Y. Flet-Berliac et al, 2021
Adversarially Guided Actor-Critic, Y. Flet-Berliac et al, 2021
 
Deep Multi-agent Reinforcement Learning
Deep Multi-agent Reinforcement LearningDeep Multi-agent Reinforcement Learning
Deep Multi-agent Reinforcement Learning
 
Hierarchical Object Detection with Deep Reinforcement Learning
Hierarchical Object Detection with Deep Reinforcement LearningHierarchical Object Detection with Deep Reinforcement Learning
Hierarchical Object Detection with Deep Reinforcement Learning
 
Aaa ped-24- Reinforcement Learning
Aaa ped-24- Reinforcement LearningAaa ped-24- Reinforcement Learning
Aaa ped-24- Reinforcement Learning
 
Generalized Reinforcement Learning
Generalized Reinforcement LearningGeneralized Reinforcement Learning
Generalized Reinforcement Learning
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement Learning
 
Maximum Entropy Reinforcement Learning (Stochastic Control)
Maximum Entropy Reinforcement Learning (Stochastic Control)Maximum Entropy Reinforcement Learning (Stochastic Control)
Maximum Entropy Reinforcement Learning (Stochastic Control)
 
The effect of episodic retrieval on inhibition in task switching
The effect of episodic retrieval on inhibition in task switchingThe effect of episodic retrieval on inhibition in task switching
The effect of episodic retrieval on inhibition in task switching
 
Active Object Localization with Deep Reinforcement Learning
Active Object Localization with Deep Reinforcement LearningActive Object Localization with Deep Reinforcement Learning
Active Object Localization with Deep Reinforcement Learning
 

Ähnlich wie Reinforcement learning

An efficient use of temporal difference technique in Computer Game Learning
An efficient use of temporal difference technique in Computer Game LearningAn efficient use of temporal difference technique in Computer Game Learning
An efficient use of temporal difference technique in Computer Game LearningPrabhu Kumar
 
Head First Reinforcement Learning
Head First Reinforcement LearningHead First Reinforcement Learning
Head First Reinforcement Learningazzeddine chenine
 
24.09.2021 Reinforcement Learning Algorithms.pptx
24.09.2021 Reinforcement Learning Algorithms.pptx24.09.2021 Reinforcement Learning Algorithms.pptx
24.09.2021 Reinforcement Learning Algorithms.pptxManiMaran230751
 
Reinforcement learning
Reinforcement learning Reinforcement learning
Reinforcement learning Chandra Meena
 
How to formulate reinforcement learning in illustrative ways
How to formulate reinforcement learning in illustrative waysHow to formulate reinforcement learning in illustrative ways
How to formulate reinforcement learning in illustrative waysYasutoTamura1
 
Reinforcement learning
Reinforcement learningReinforcement learning
Reinforcement learningDing Li
 
Lecture notes
Lecture notesLecture notes
Lecture notesbutest
 
lecture_21.pptx - PowerPoint Presentation
lecture_21.pptx - PowerPoint Presentationlecture_21.pptx - PowerPoint Presentation
lecture_21.pptx - PowerPoint Presentationbutest
 
Adaptive High-Level Strategy Learning in StarCraft
Adaptive High-Level Strategy Learning in StarCraftAdaptive High-Level Strategy Learning in StarCraft
Adaptive High-Level Strategy Learning in StarCraftJiéverson Maissiat
 
Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017
Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017
Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017MLconf
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement LearningSVijaylakshmi
 
Reinforcement learning Research experiments OpenAI
Reinforcement learning Research experiments OpenAIReinforcement learning Research experiments OpenAI
Reinforcement learning Research experiments OpenAIRaouf KESKES
 
Demystifying deep reinforement learning
Demystifying deep reinforement learningDemystifying deep reinforement learning
Demystifying deep reinforement learning재연 윤
 
Survey of Modern Reinforcement Learning
Survey of Modern Reinforcement Learning Survey of Modern Reinforcement Learning
Survey of Modern Reinforcement Learning Julia Maddalena
 
Reinforcement learning, Q-Learning
Reinforcement learning, Q-LearningReinforcement learning, Q-Learning
Reinforcement learning, Q-LearningKuppusamy P
 
Reinfrocement Learning
Reinfrocement LearningReinfrocement Learning
Reinfrocement LearningNatan Katz
 

Ähnlich wie Reinforcement learning (20)

CS799_FinalReport
CS799_FinalReportCS799_FinalReport
CS799_FinalReport
 
An efficient use of temporal difference technique in Computer Game Learning
An efficient use of temporal difference technique in Computer Game LearningAn efficient use of temporal difference technique in Computer Game Learning
An efficient use of temporal difference technique in Computer Game Learning
 
Head First Reinforcement Learning
Head First Reinforcement LearningHead First Reinforcement Learning
Head First Reinforcement Learning
 
24.09.2021 Reinforcement Learning Algorithms.pptx
24.09.2021 Reinforcement Learning Algorithms.pptx24.09.2021 Reinforcement Learning Algorithms.pptx
24.09.2021 Reinforcement Learning Algorithms.pptx
 
Reinforcement learning
Reinforcement learning Reinforcement learning
Reinforcement learning
 
How to formulate reinforcement learning in illustrative ways
How to formulate reinforcement learning in illustrative waysHow to formulate reinforcement learning in illustrative ways
How to formulate reinforcement learning in illustrative ways
 
Reinforcement learning
Reinforcement learningReinforcement learning
Reinforcement learning
 
Lecture notes
Lecture notesLecture notes
Lecture notes
 
Introduction to Deep Reinforcement Learning
Introduction to Deep Reinforcement LearningIntroduction to Deep Reinforcement Learning
Introduction to Deep Reinforcement Learning
 
RL.ppt
RL.pptRL.ppt
RL.ppt
 
lecture_21.pptx - PowerPoint Presentation
lecture_21.pptx - PowerPoint Presentationlecture_21.pptx - PowerPoint Presentation
lecture_21.pptx - PowerPoint Presentation
 
Adaptive High-Level Strategy Learning in StarCraft
Adaptive High-Level Strategy Learning in StarCraftAdaptive High-Level Strategy Learning in StarCraft
Adaptive High-Level Strategy Learning in StarCraft
 
Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017
Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017
Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement Learning
 
Reinforcement learning Research experiments OpenAI
Reinforcement learning Research experiments OpenAIReinforcement learning Research experiments OpenAI
Reinforcement learning Research experiments OpenAI
 
Demystifying deep reinforement learning
Demystifying deep reinforement learningDemystifying deep reinforement learning
Demystifying deep reinforement learning
 
Survey of Modern Reinforcement Learning
Survey of Modern Reinforcement Learning Survey of Modern Reinforcement Learning
Survey of Modern Reinforcement Learning
 
the bellman equation
 the bellman equation the bellman equation
the bellman equation
 
Reinforcement learning, Q-Learning
Reinforcement learning, Q-LearningReinforcement learning, Q-Learning
Reinforcement learning, Q-Learning
 
Reinfrocement Learning
Reinfrocement LearningReinfrocement Learning
Reinfrocement Learning
 

Mehr von Elias Hasnat

FacialRecognition-May-8-2020.pdf
FacialRecognition-May-8-2020.pdfFacialRecognition-May-8-2020.pdf
FacialRecognition-May-8-2020.pdfElias Hasnat
 
Smart City IoT Solution Improved
Smart City IoT Solution ImprovedSmart City IoT Solution Improved
Smart City IoT Solution ImprovedElias Hasnat
 
Connected vehicle mobility as a service (maas)
Connected vehicle mobility as a service (maas)Connected vehicle mobility as a service (maas)
Connected vehicle mobility as a service (maas)Elias Hasnat
 
Lorawan for agriculture, haccp hazard analysis and critical control point
Lorawan for agriculture, haccp hazard analysis and critical control pointLorawan for agriculture, haccp hazard analysis and critical control point
Lorawan for agriculture, haccp hazard analysis and critical control pointElias Hasnat
 
IoT Security with Azure
IoT Security with AzureIoT Security with Azure
IoT Security with AzureElias Hasnat
 
産業向け AWS IoT ソリューション
産業向け AWS IoT ソリューション産業向け AWS IoT ソリューション
産業向け AWS IoT ソリューションElias Hasnat
 
AIIoT組み込みシステム向けIEEE1888通信スタック
AIIoT組み込みシステム向けIEEE1888通信スタックAIIoT組み込みシステム向けIEEE1888通信スタック
AIIoT組み込みシステム向けIEEE1888通信スタックElias Hasnat
 
IoT security reference architecture
IoT security  reference architectureIoT security  reference architecture
IoT security reference architectureElias Hasnat
 
Intelligent video stream detection platform
Intelligent video stream detection platformIntelligent video stream detection platform
Intelligent video stream detection platformElias Hasnat
 
Machine Learning Algorithms
Machine Learning AlgorithmsMachine Learning Algorithms
Machine Learning AlgorithmsElias Hasnat
 
China Mobile Market
China Mobile MarketChina Mobile Market
China Mobile MarketElias Hasnat
 

Mehr von Elias Hasnat (20)

BLE.pdf
BLE.pdfBLE.pdf
BLE.pdf
 
FacialRecognition-May-8-2020.pdf
FacialRecognition-May-8-2020.pdfFacialRecognition-May-8-2020.pdf
FacialRecognition-May-8-2020.pdf
 
Smart City IoT Solution Improved
Smart City IoT Solution ImprovedSmart City IoT Solution Improved
Smart City IoT Solution Improved
 
Connected vehicle mobility as a service (maas)
Connected vehicle mobility as a service (maas)Connected vehicle mobility as a service (maas)
Connected vehicle mobility as a service (maas)
 
Lorawan for agriculture, haccp hazard analysis and critical control point
Lorawan for agriculture, haccp hazard analysis and critical control pointLorawan for agriculture, haccp hazard analysis and critical control point
Lorawan for agriculture, haccp hazard analysis and critical control point
 
IoT Security with Azure
IoT Security with AzureIoT Security with Azure
IoT Security with Azure
 
産業向け AWS IoT ソリューション
産業向け AWS IoT ソリューション産業向け AWS IoT ソリューション
産業向け AWS IoT ソリューション
 
Soap vs REST-API
Soap vs REST-APISoap vs REST-API
Soap vs REST-API
 
AIIoT組み込みシステム向けIEEE1888通信スタック
AIIoT組み込みシステム向けIEEE1888通信スタックAIIoT組み込みシステム向けIEEE1888通信スタック
AIIoT組み込みシステム向けIEEE1888通信スタック
 
IoT security reference architecture
IoT security  reference architectureIoT security  reference architecture
IoT security reference architecture
 
Intelligent video stream detection platform
Intelligent video stream detection platformIntelligent video stream detection platform
Intelligent video stream detection platform
 
Machine Learning Algorithms
Machine Learning AlgorithmsMachine Learning Algorithms
Machine Learning Algorithms
 
REST API
REST APIREST API
REST API
 
Mqtt
MqttMqtt
Mqtt
 
Java8 features
Java8 featuresJava8 features
Java8 features
 
Dalvikよりart
DalvikよりartDalvikよりart
Dalvikよりart
 
K means
K meansK means
K means
 
Unity sdk-plugin
Unity sdk-pluginUnity sdk-plugin
Unity sdk-plugin
 
Cocos2dx
Cocos2dxCocos2dx
Cocos2dx
 
China Mobile Market
China Mobile MarketChina Mobile Market
China Mobile Market
 

Kürzlich hochgeladen

Crushers to screens in aggregate production
Crushers to screens in aggregate productionCrushers to screens in aggregate production
Crushers to screens in aggregate productionChinnuNinan
 
Indian Dairy Industry Present Status and.ppt
Indian Dairy Industry Present Status and.pptIndian Dairy Industry Present Status and.ppt
Indian Dairy Industry Present Status and.pptMadan Karki
 
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort service
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort serviceGurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort service
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort servicejennyeacort
 
Sachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
Sachpazis Costas: Geotechnical Engineering: A student's Perspective IntroductionSachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
Sachpazis Costas: Geotechnical Engineering: A student's Perspective IntroductionDr.Costas Sachpazis
 
Crystal Structure analysis and detailed information pptx
Crystal Structure analysis and detailed information pptxCrystal Structure analysis and detailed information pptx
Crystal Structure analysis and detailed information pptxachiever3003
 
US Department of Education FAFSA Week of Action
US Department of Education FAFSA Week of ActionUS Department of Education FAFSA Week of Action
US Department of Education FAFSA Week of ActionMebane Rash
 
System Simulation and Modelling with types and Event Scheduling
System Simulation and Modelling with types and Event SchedulingSystem Simulation and Modelling with types and Event Scheduling
System Simulation and Modelling with types and Event SchedulingBootNeck1
 
Comparative study of High-rise Building Using ETABS,SAP200 and SAFE., SAFE an...
Comparative study of High-rise Building Using ETABS,SAP200 and SAFE., SAFE an...Comparative study of High-rise Building Using ETABS,SAP200 and SAFE., SAFE an...
Comparative study of High-rise Building Using ETABS,SAP200 and SAFE., SAFE an...Erbil Polytechnic University
 
BSNL Internship Training presentation.pptx
BSNL Internship Training presentation.pptxBSNL Internship Training presentation.pptx
BSNL Internship Training presentation.pptxNiranjanYadav41
 
Autonomous emergency braking system (aeb) ppt.ppt
Autonomous emergency braking system (aeb) ppt.pptAutonomous emergency braking system (aeb) ppt.ppt
Autonomous emergency braking system (aeb) ppt.pptbibisarnayak0
 
Class 1 | NFPA 72 | Overview Fire Alarm System
Class 1 | NFPA 72 | Overview Fire Alarm SystemClass 1 | NFPA 72 | Overview Fire Alarm System
Class 1 | NFPA 72 | Overview Fire Alarm Systemirfanmechengr
 
Earthing details of Electrical Substation
Earthing details of Electrical SubstationEarthing details of Electrical Substation
Earthing details of Electrical Substationstephanwindworld
 
welding defects observed during the welding
welding defects observed during the weldingwelding defects observed during the welding
welding defects observed during the weldingMuhammadUzairLiaqat
 
Ch10-Global Supply Chain - Cadena de Suministro.pdf
Ch10-Global Supply Chain - Cadena de Suministro.pdfCh10-Global Supply Chain - Cadena de Suministro.pdf
Ch10-Global Supply Chain - Cadena de Suministro.pdfChristianCDAM
 
Virtual memory management in Operating System
Virtual memory management in Operating SystemVirtual memory management in Operating System
Virtual memory management in Operating SystemRashmi Bhat
 
IVE Industry Focused Event - Defence Sector 2024
IVE Industry Focused Event - Defence Sector 2024IVE Industry Focused Event - Defence Sector 2024
IVE Industry Focused Event - Defence Sector 2024Mark Billinghurst
 
UNIT III ANALOG ELECTRONICS (BASIC ELECTRONICS)
UNIT III ANALOG ELECTRONICS (BASIC ELECTRONICS)UNIT III ANALOG ELECTRONICS (BASIC ELECTRONICS)
UNIT III ANALOG ELECTRONICS (BASIC ELECTRONICS)Dr SOUNDIRARAJ N
 
Research Methodology for Engineering pdf
Research Methodology for Engineering pdfResearch Methodology for Engineering pdf
Research Methodology for Engineering pdfCaalaaAbdulkerim
 

Kürzlich hochgeladen (20)

Crushers to screens in aggregate production
Crushers to screens in aggregate productionCrushers to screens in aggregate production
Crushers to screens in aggregate production
 
Indian Dairy Industry Present Status and.ppt
Indian Dairy Industry Present Status and.pptIndian Dairy Industry Present Status and.ppt
Indian Dairy Industry Present Status and.ppt
 
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort service
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort serviceGurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort service
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort service
 
Sachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
Sachpazis Costas: Geotechnical Engineering: A student's Perspective IntroductionSachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
Sachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
 
Design and analysis of solar grass cutter.pdf
Design and analysis of solar grass cutter.pdfDesign and analysis of solar grass cutter.pdf
Design and analysis of solar grass cutter.pdf
 
Crystal Structure analysis and detailed information pptx
Crystal Structure analysis and detailed information pptxCrystal Structure analysis and detailed information pptx
Crystal Structure analysis and detailed information pptx
 
US Department of Education FAFSA Week of Action
US Department of Education FAFSA Week of ActionUS Department of Education FAFSA Week of Action
US Department of Education FAFSA Week of Action
 
System Simulation and Modelling with types and Event Scheduling
System Simulation and Modelling with types and Event SchedulingSystem Simulation and Modelling with types and Event Scheduling
System Simulation and Modelling with types and Event Scheduling
 
Comparative study of High-rise Building Using ETABS,SAP200 and SAFE., SAFE an...
Comparative study of High-rise Building Using ETABS,SAP200 and SAFE., SAFE an...Comparative study of High-rise Building Using ETABS,SAP200 and SAFE., SAFE an...
Comparative study of High-rise Building Using ETABS,SAP200 and SAFE., SAFE an...
 
BSNL Internship Training presentation.pptx
BSNL Internship Training presentation.pptxBSNL Internship Training presentation.pptx
BSNL Internship Training presentation.pptx
 
Autonomous emergency braking system (aeb) ppt.ppt
Autonomous emergency braking system (aeb) ppt.pptAutonomous emergency braking system (aeb) ppt.ppt
Autonomous emergency braking system (aeb) ppt.ppt
 
Class 1 | NFPA 72 | Overview Fire Alarm System
Class 1 | NFPA 72 | Overview Fire Alarm SystemClass 1 | NFPA 72 | Overview Fire Alarm System
Class 1 | NFPA 72 | Overview Fire Alarm System
 
Earthing details of Electrical Substation
Earthing details of Electrical SubstationEarthing details of Electrical Substation
Earthing details of Electrical Substation
 
welding defects observed during the welding
welding defects observed during the weldingwelding defects observed during the welding
welding defects observed during the welding
 
POWER SYSTEMS-1 Complete notes examples
POWER SYSTEMS-1 Complete notes  examplesPOWER SYSTEMS-1 Complete notes  examples
POWER SYSTEMS-1 Complete notes examples
 
Ch10-Global Supply Chain - Cadena de Suministro.pdf
Ch10-Global Supply Chain - Cadena de Suministro.pdfCh10-Global Supply Chain - Cadena de Suministro.pdf
Ch10-Global Supply Chain - Cadena de Suministro.pdf
 
Virtual memory management in Operating System
Virtual memory management in Operating SystemVirtual memory management in Operating System
Virtual memory management in Operating System
 
IVE Industry Focused Event - Defence Sector 2024
IVE Industry Focused Event - Defence Sector 2024IVE Industry Focused Event - Defence Sector 2024
IVE Industry Focused Event - Defence Sector 2024
 
UNIT III ANALOG ELECTRONICS (BASIC ELECTRONICS)
UNIT III ANALOG ELECTRONICS (BASIC ELECTRONICS)UNIT III ANALOG ELECTRONICS (BASIC ELECTRONICS)
UNIT III ANALOG ELECTRONICS (BASIC ELECTRONICS)
 
Research Methodology for Engineering pdf
Research Methodology for Engineering pdfResearch Methodology for Engineering pdf
Research Methodology for Engineering pdf
 

Reinforcement learning

  • 2.
  • 3. Definition of Reinforcement Learning We can explain reinforcement learning as a combination of supervised and unsupervised learning. 1. In supervised learning one has a target label for each training example and 2. In unsupervised learning one has no labels at all, 3. In reinforcement learning one has sparse and time- delayed labels (the rewards). Based only on those rewards the agent has to learn to behave in the environment.
  • 4. 1. Credit assignment problem and 2. Exploration-Exploitation Dilemma Steps in consideration in RL problem solving Credit Assignment Step is used for defining the strategy, Exploration-Exploitation is used for improving the reward outcomes.
  • 6.
  • 7. Mathematical relation formalization ● An agent, situated in an environment. ● The environment is in a certain state. ● The agent can perform certain actions in the environment. ● These actions sometimes result in a reward. ● Actions transform the environment and lead to a new state, where the agent can perform another action, and so on. ● The rules for how you choose those actions are called policy. ● The environment in general is stochastic, which means the next state may be somewhat random. ● The set of states and actions, together with rules for transitioning from one state to another, make up a Markov decision process. ● One episode of this process forms a finite sequence of states, actions and rewards.
  • 8. Reward Strategy To perform well in the long-term, we need to take into account not only the immediate rewards, but also the future rewards we are going to get. total reward function total reward function from time point t But as the environment is stochastic we never get same reward in the next run, that is why we be considering discounted future reward γ is the discount factor between 0 and 1, as the seps increased the γ value goes near to 0.
  • 9. Discounted future reward calculation Discounted future reward at time step t can be expressed in terms of the same thing at time step t+1: If we set the discount factor γ=0, then our strategy will be short-sighted and we rely only on the immediate rewards. If we want to balance between immediate and future rewards, we should set discount factor to something like γ=0.9. If our environment is deterministic and the same actions always result in same rewards, then we can set discount factor γ=1. A good strategy for an agent would be to always choose an action that maximizes the (discounted) future reward.
  • 10. The Q-function is for improved quality in action The Q-Learning function Q(s, a) could be represented for the maximum discounted future reward when we perform action a in state s, and continue optimally from that point on: Suppose you are in state and thinking whether you should take action a or b. You want to select the action that results in the highest score at the end of game. π represents the policy, Now Let’s focus on just one transition <s, a, r, s’>. we can express the Q-value of state s and action a in terms of the Q-value of the next state s’. This is called the Bellman Equation.
  • 11. Q-Learning Formulation α in the algorithm is a learning rate that controls how much of the difference between previous Q-value and newly proposed Q-value is taken into account. In particular, when α=1, then two Q[s,a] cancel and the update is exactly the same as the Bellman equation.
  • 12. Deep Q Network Neural networks are exceptionally good at coming up with good features for highly structured data. We can easily represent our Q-function with a neural network, Use-case 1: that takes the state and action as input and outputs the corresponding Q-value. Use-case 2: Input the state as input and output the Q-value for each possible action. * The benefit of the later use-case over the former one is that if we want to perform a Q-value update or pick the action with the highest Q-value, we only have to do one forward pass through the network and have all Q-values for all actions available immediately.
  • 13. Rewrite the Quality function using Q Network Given a transition < s, a, r, s’ >, the Q-table update rule in the previous algorithm must be replaced with the following: 1. Do a feedforward pass for the current state s to get predicted Q-values for all actions. 2. Do a feedforward pass for the next state s’ and calculate maximum overall network outputs max a’ Q(s’, a’). 3. Set Q-value target for action to r + γmax a’ Q(s’, a’) (use the max calculated in step 2). For all other actions, set the Q-value target to the same as originally returned from step 1, making the error 0 for those outputs. 4. Update the weights using backpropagation.
  • 14. Experience Replay (Memory) Estimate the future reward in each state using Q-learning and approximate the Q- function using neural network. Challenge: But the approximation of Q-values using non-linear functions is very time costly. Solution: Memorizing the experience. During gameplay all the experiences < s, a, r, s’ > are stored in a replay memory. When training the network, random mini batches from the replay memory are used instead of the most recent transition. This breaks the similarity of subsequent training samples, which otherwise might drive the network into a local minimum. Also experience replay makes the training task more similar to usual supervised learning, which simplifies debugging and testing the algorithm.
  • 15. Exploration and Exploitation Firstly observe, that when a Q-table or Q-network is initialized randomly, then its predictions are initially random as well. If we pick an action with the highest Q-value, the action will be random and the agent performs crude “exploration”. As a Q- function converges, it returns more consistent Q-values and the amount of exploration decreases. So one could say, that Q-learning incorporates the exploration as part of the algorithm. But this exploration is “greedy”, it settles with the first effective strategy it finds. A simple and effective fix for the above problem is ε-greedy exploration – with probability ε choose a random action, otherwise go with the “greedy” action with the highest Q-value. It actually decreases ε over time from 1 to 0.1 – in the beginning the system makes completely random moves to explore the state space maximally, and then it settles down to a fixed exploration rate.