SlideShare ist ein Scribd-Unternehmen logo
1 von 31
Introduction to Deep Reinforcement Learning
Khaled Saleh
PhD Researcher at IISRI/ Deakin University
Australia
Khaled Saleh
Agenda
• Motivation
• What is Reinforcement Learning (RL) ?
• Characteristics of RL
• Formulation of the RL Problem
• Different Components of RL
• Taxonomy of Algorithms for Solving RL
• Q-Learning
• Deep Q Network (DQN)
• Policy Gradient Methods
• Inverse RL
• Deep RL/IRL Potential Applications
2
Motivation
3
Video credit: Ng et al. NIPS 2007 Video credit: Google DeepMind 2015
What is Reinforcement Learning (RL) ?
4Image credit: Sutton and Barto (1998)
Characteristics of RL
5
• In comparison to other machine learning paradigms, the
following are what make the RL different:
• No supervision needed, only a reward signal
• Feedback is delayed, not instantaneous
• Sequential decision Making
Formulation of RL
6
• Most common method to formulate RL problem is through
Markov Decision Process (MDP)
• One episode of this process forms a finite sequence of states,
actions and rewards:
• 𝑠0, 𝑎0, 𝑟1, 𝑠1, 𝑎1, 𝑟2, 𝑠2, … , 𝑠 𝑛−1, 𝑎 𝑛−1, 𝑟𝑛, 𝑠 𝑛
Image credit: WikipediaImage credit: Sutton and Barto (1998)
Formulation of RL
7
• A good policy, need to take into account not only the
immediate rewards, but also the future rewards we are going
to get.
• Thus, the ultimate goal of RL agent is to select actions to
maximize a total future reward.
• Given one run of Markov decision process, we can easily
calculate the total reward for one episode from time
step t onward as follows:
• 𝑅𝑡 = 𝑟𝑡 + 𝑟𝑡+1 + 𝑟𝑡+2 + ⋯ + 𝑟𝑛
• Due to the inherit uncertainty in the environment, we usually
use the discounted future reward instead:
• 𝑅𝑡 = 𝑟𝑡 + 𝛾𝑟𝑡+1 + 𝛾2
𝑟𝑡+2 + ⋯ + 𝛾 𝑛−𝑡
𝑟𝑛 = 𝑟𝑡 + 𝛾𝑅𝑡+1
Components of RL
8
• An RL agent may include one or more of these components:
• Policy: agent’s behavior function 𝑎 = π(𝑠)
• Value function: a prediction of future reward - how good
is each state and/or action
• Model: agent’s representation of the environment, given
state 𝑠 and action 𝑎, the model gives us both the reward
of this state and action as well as the probability of the
next state 𝑠′
Components of RL: Policy
9Example adapted from: http://www0.cs.ucl.ac.uk/staff/D.Silver/web/Teaching.html
• Given the following maze example:
Policy would be
Components of RL: Value Function
10
• Used to evaluate the goodness/badness of states
• And therefore to select between actions:
𝑄 𝜋(𝑠, 𝑎) = 𝑚𝑎𝑥 𝜋 𝑅𝑡+1
Taxonomy of Algorithms for Solving RL
11
• Model Free
• Policy or/and Value Function
• Model Based
• Model + Policy or/and Value Function
• Approximated Learned Model + Policy or/and Value
Function
Q-Learning
12
• Q-learning is a model free paradigm to learn the value
function of the RL problem.
• In Q-learning, we define a function 𝑄(𝑠, 𝑎) representing the
discounted future reward when we perform action a in state s,
and continue optimally from that point on.
• 𝑄 𝑠𝑡, 𝑎 𝑡 = 𝑚𝑎𝑥 𝜋 𝑅𝑡+1
• Once we have the Q-function, the question of which policy to
choose at a given state 𝑠, can be broke down into :
• 𝜋 𝑠 = 𝑎𝑟𝑔𝑚𝑎𝑥 𝑎 𝑄(𝑠, 𝑎)
Q-Learning (2)
13
• To obtain Q-function, we will focus on just one transition
<𝑠, 𝑎, r, 𝑠′>.
• Recall,
𝑅𝑡 = 𝑟𝑡 + 𝛾𝑟𝑡+1 + 𝛾2 𝑟𝑡+2 + ⋯ + 𝛾 𝑛−𝑡 𝑟𝑛 = 𝑟𝑡 + 𝛾𝑅𝑡+1
• Similarly, we can just represent Q-value of state 𝑠 and
action 𝑎 in terms of Q-value of next state 𝑠′
𝑄 𝑠, 𝑎 = 𝑟 + 𝛾𝑚𝑎𝑥 𝑎′ 𝑄(𝑠′, 𝑎′)
Bellman Equation
Q-Learning (3)
14Algorithm adapted from : http://artint.info/html/ArtInt_265.html
• We can then iteratively approximate the Q-function using the
Bellman equation, as follows:
Learning rate
Deep Q-Networks
15
• Q-function could be represented with neural network, that
takes the state and action as input and outputs the
corresponding Q-value
• Alternatively, we could take only game screens as input and
output the Q-value for each possible action.
DQN: Atari
16Image credit: Mnih et al. Nature 2015
DQN: Training
17
• Given a transition <𝑠, 𝑎, r, 𝑠′>, and loss function
𝐿 =
1
2
[𝑟 + 𝛾𝑚𝑎𝑥 𝑎′ 𝑄 𝑠′, 𝑎′ − Q s, a ]2:
1. Do a feedforward pass for the current state 𝑠 to get
predicted Q-values for all actions.
2. Do a feedforward pass for the next state 𝑠′ and calculate
maximum over all network outputs 𝑚𝑎𝑥 𝑎′ 𝑄 𝑠′, 𝑎′
3. Set Q-value target for
action 𝑎 to 𝑟 + 𝛾𝑚𝑎𝑥 𝑎′ 𝑄 𝑠′, 𝑎′ (use the max calculated
in step 2). For all other actions, set the Q-value target to
the same as originally returned from step 1, making the
error 0 for those outputs
4. Update the weights using backpropagation.
target prediction
DQN: Experience Replay
18
• One of the engineering tricks that made the training of DQN
much more stable
• During gameplay all the experiences <𝑠, 𝑎, r, 𝑠′
> are stored in
a replay memory
• When training the network, random samples from the replay
memory are used instead of the most recent transition
1. This breaks the similarity of subsequent training
samples, which otherwise might drive the network into a
local minimum
2. It made the training task more similar to usual
supervised learning, which simplifies debugging and
testing the algorithm.
DQN: ε-greedy exploration
19
• When Q-network is initialized randomly, then its predictions
are initially random as well
• If we pick an action with the highest Q-value, the action will
be random and the agent performs crude “exploration”.
• As a Q-function converges, it returns more consistent Q-
values and the amount of exploration decreases
• Another engineering trick is ε-greedy exploration – with
probability ε choose a random action, otherwise go with the
“greedy” action with the highest Q-value.
DQN: Algorithm
20Algorithm adapted from : http://artint.info/html/ArtInt_265.html
Experience Replay
ε-greedy exploration
Policy Gradient Methods
21
• Another commonly paradigm to solve the RL problem is by
learning the policy directly.
• Learning the policy directly, can be much more efficient in
case of continuous action spaces (human locomotion,..etc.)
• One of the key methods in this paradigm, is policy gradient
methods (Gradient descent, Conjugate gradient, Quasi-
newton).
• The formulation as follow, let 𝐽 𝜃 be any policy objective
function
• Policy gradient methods search for a local maximum in 𝐽 𝜃
by ascending the gradient of the policy, w.r.t. parameters 𝜃
Δ𝜃 = α𝛻𝜃 𝐽 𝜃
Policy gradient
Policy Gradient Methods
22
Heess, Nicolas, et al. "Emergence of locomotion behaviours in rich environments." arXiv preprint
arXiv:1707.02286 (2017).
Inverse RL
Adapted from CS 294: Deep Reinforcement Learning, UC Berkeley, Fall 2017
Inverse RL
• Since in most of the real-world applications, the notion of
reward is not quite obvious or really hard to specify.
• In IRL problem, we try to learn the reward (and the transition
model as well) from expert or human demonstrations.
Inverse RL: Autonomous Driving
Image credit: Wulfmeier et al. IROS 2016
Reward Features
Inverse RL: Intent Prediction
26Image credit: KITTI Dataset
Pedestrian
Deep RL/IRL Potential Applications
• Autonomous Navigation
• Semantic Segmentation
• Recommendation Systems
• Chatbots
• Inventory Management
• Power Systems
• Financial investment decisions*
• Medical Sector (Dynamic treatment regime)
* http://pit.ai/
Further Educational Resources
• Reinforcement Learning: An Introduction (Sutton and Barto’s
Book, 2nd Edition)
• David Silver's Reinforcement Learning Course (UCL, 2015)
• CS 294: Deep Reinforcement Learning, Fall 2017
• Deep RL Bootcamp, Summer 2017
DeepMind AlphaGo
29Image and Video credit: Google Brain & DeepMind
References
1. Sutton, Richard S., and Andrew G. Barto. Reinforcement learning: An introduction. Vol. 1. No. 1. Cambridge: MIT press, 1998.
2. Mnih, Volodymyr, et al. "Human-level control through deep reinforcement learning." Nature518.7540 (2015): 529-533.
3. Abbeel, Pieter, and Andrew Y. Ng. "Apprenticeship learning via inverse reinforcement learning." Proceedings of the twenty-first
international conference on Machine learning. ACM, 2004.
4. Cassandra, Anthony Rocco. "Exact and approximate algorithms for partially observable Markov decision processes." (1998).
5. Heess, Nicolas, et al. "Emergence of Locomotion Behaviours in Rich Environments." arXiv preprint arXiv:1707.02286 (2017).
6. Heess, Nicolas, et al. "Learning and Transfer of Modulated Locomotor Controllers." arXiv preprint arXiv:1610.05182 (2016).
7. Chelsea Finn, Xin Yu Tan, Yan Duan, Trevor Darrell, Sergey Levine, and Pieter Abbeel. Deep Spatial Autoencoders for Visuomotor
Learning. In ICRA, 2016.
8. Jakob N Foerster, Yannis M Assael, Nando de Freitas, and Shimon Whiteson. Learning to Communicate to Solve Riddles with Deep
Distributed Recurrent QNetworks. arXiv:1602.02672, 2016.
9. Sham M Kakade. A Natural Policy Gradient. In NIPS, 2002
10. Nate Kohl and Peter Stone. Policy Gradient Reinforcement Learning for Fast Quadrupedal Locomotion. In ICRA, volume 3, 2004
11. Sascha Lange, Martin Riedmiller, and Arne Voigtlander. Autonomous Reinforcement Learning on Raw Visual Input Data in a Real World
Application. In IJCNN, 2012.
12. Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deep Learning. Nature, 521 (7553):436–444, 2015.
13. Sergey Levine, Chelsea Finn, Trevor Darrell, and Pieter Abbeel. End-to-end Training of Deep Visuomotor Policies. JMLR, 17(39):1–40,
2016
14. Xiujun Li, Lihong Li, Jianfeng Gao, Xiaodong He, Jianshu Chen, Li Deng, and Ji He. Recurrent Reinforcement Learning: A Hybrid
Approach. arXiv:1509.03044,
15. Wulfmeier, Markus, Dominic Zeng Wang, and Ingmar Posner. "Watch this: Scalable cost-function learning for path planning in urban
environments." Intelligent Robots and Systems (IROS), 2016 IEEE/RSJ International Conference on. IEEE, 2016.
30
Thank You!

Weitere ähnliche Inhalte

Was ist angesagt?

Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement LearningSalem-Kabbani
 
Deep Reinforcement Learning and Its Applications
Deep Reinforcement Learning and Its ApplicationsDeep Reinforcement Learning and Its Applications
Deep Reinforcement Learning and Its ApplicationsBill Liu
 
Deep Reinforcement Learning: Q-Learning
Deep Reinforcement Learning: Q-LearningDeep Reinforcement Learning: Q-Learning
Deep Reinforcement Learning: Q-LearningKai-Wen Zhao
 
A brief overview of Reinforcement Learning applied to games
A brief overview of Reinforcement Learning applied to gamesA brief overview of Reinforcement Learning applied to games
A brief overview of Reinforcement Learning applied to gamesThomas da Silva Paula
 
Multi-armed Bandits
Multi-armed BanditsMulti-armed Bandits
Multi-armed BanditsDongmin Lee
 
Reinforcement Learning : A Beginners Tutorial
Reinforcement Learning : A Beginners TutorialReinforcement Learning : A Beginners Tutorial
Reinforcement Learning : A Beginners TutorialOmar Enayet
 
Deep reinforcement learning from scratch
Deep reinforcement learning from scratchDeep reinforcement learning from scratch
Deep reinforcement learning from scratchJie-Han Chen
 
An introduction to reinforcement learning (rl)
An introduction to reinforcement learning (rl)An introduction to reinforcement learning (rl)
An introduction to reinforcement learning (rl)pauldix
 
MIT 6.S091: Introduction to Deep Reinforcement Learning (Deep RL) by Lex Fridman
MIT 6.S091: Introduction to Deep Reinforcement Learning (Deep RL) by Lex FridmanMIT 6.S091: Introduction to Deep Reinforcement Learning (Deep RL) by Lex Fridman
MIT 6.S091: Introduction to Deep Reinforcement Learning (Deep RL) by Lex FridmanPeerasak C.
 
Introduction of Deep Reinforcement Learning
Introduction of Deep Reinforcement LearningIntroduction of Deep Reinforcement Learning
Introduction of Deep Reinforcement LearningNAVER Engineering
 
An introduction to reinforcement learning
An introduction to  reinforcement learningAn introduction to  reinforcement learning
An introduction to reinforcement learningJie-Han Chen
 
DQN (Deep Q-Network)
DQN (Deep Q-Network)DQN (Deep Q-Network)
DQN (Deep Q-Network)Dong Guo
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement LearningCloudxLab
 
[1312.5602] Playing Atari with Deep Reinforcement Learning
[1312.5602] Playing Atari with Deep Reinforcement Learning[1312.5602] Playing Atari with Deep Reinforcement Learning
[1312.5602] Playing Atari with Deep Reinforcement LearningSeung Jae Lee
 
Actor critic algorithm
Actor critic algorithmActor critic algorithm
Actor critic algorithmJie-Han Chen
 
Deep Reinforcement Learning: MDP & DQN - Xavier Giro-i-Nieto - UPC Barcelona ...
Deep Reinforcement Learning: MDP & DQN - Xavier Giro-i-Nieto - UPC Barcelona ...Deep Reinforcement Learning: MDP & DQN - Xavier Giro-i-Nieto - UPC Barcelona ...
Deep Reinforcement Learning: MDP & DQN - Xavier Giro-i-Nieto - UPC Barcelona ...Universitat Politècnica de Catalunya
 
Lecture 9 Markov decision process
Lecture 9 Markov decision processLecture 9 Markov decision process
Lecture 9 Markov decision processVARUN KUMAR
 
Temporal difference learning
Temporal difference learningTemporal difference learning
Temporal difference learningJie-Han Chen
 
Reinforcement Learning for Self Driving Cars
Reinforcement Learning for Self Driving CarsReinforcement Learning for Self Driving Cars
Reinforcement Learning for Self Driving CarsSneha Ravikumar
 
Markov decision process
Markov decision processMarkov decision process
Markov decision processHamed Abdi
 

Was ist angesagt? (20)

Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement Learning
 
Deep Reinforcement Learning and Its Applications
Deep Reinforcement Learning and Its ApplicationsDeep Reinforcement Learning and Its Applications
Deep Reinforcement Learning and Its Applications
 
Deep Reinforcement Learning: Q-Learning
Deep Reinforcement Learning: Q-LearningDeep Reinforcement Learning: Q-Learning
Deep Reinforcement Learning: Q-Learning
 
A brief overview of Reinforcement Learning applied to games
A brief overview of Reinforcement Learning applied to gamesA brief overview of Reinforcement Learning applied to games
A brief overview of Reinforcement Learning applied to games
 
Multi-armed Bandits
Multi-armed BanditsMulti-armed Bandits
Multi-armed Bandits
 
Reinforcement Learning : A Beginners Tutorial
Reinforcement Learning : A Beginners TutorialReinforcement Learning : A Beginners Tutorial
Reinforcement Learning : A Beginners Tutorial
 
Deep reinforcement learning from scratch
Deep reinforcement learning from scratchDeep reinforcement learning from scratch
Deep reinforcement learning from scratch
 
An introduction to reinforcement learning (rl)
An introduction to reinforcement learning (rl)An introduction to reinforcement learning (rl)
An introduction to reinforcement learning (rl)
 
MIT 6.S091: Introduction to Deep Reinforcement Learning (Deep RL) by Lex Fridman
MIT 6.S091: Introduction to Deep Reinforcement Learning (Deep RL) by Lex FridmanMIT 6.S091: Introduction to Deep Reinforcement Learning (Deep RL) by Lex Fridman
MIT 6.S091: Introduction to Deep Reinforcement Learning (Deep RL) by Lex Fridman
 
Introduction of Deep Reinforcement Learning
Introduction of Deep Reinforcement LearningIntroduction of Deep Reinforcement Learning
Introduction of Deep Reinforcement Learning
 
An introduction to reinforcement learning
An introduction to  reinforcement learningAn introduction to  reinforcement learning
An introduction to reinforcement learning
 
DQN (Deep Q-Network)
DQN (Deep Q-Network)DQN (Deep Q-Network)
DQN (Deep Q-Network)
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement Learning
 
[1312.5602] Playing Atari with Deep Reinforcement Learning
[1312.5602] Playing Atari with Deep Reinforcement Learning[1312.5602] Playing Atari with Deep Reinforcement Learning
[1312.5602] Playing Atari with Deep Reinforcement Learning
 
Actor critic algorithm
Actor critic algorithmActor critic algorithm
Actor critic algorithm
 
Deep Reinforcement Learning: MDP & DQN - Xavier Giro-i-Nieto - UPC Barcelona ...
Deep Reinforcement Learning: MDP & DQN - Xavier Giro-i-Nieto - UPC Barcelona ...Deep Reinforcement Learning: MDP & DQN - Xavier Giro-i-Nieto - UPC Barcelona ...
Deep Reinforcement Learning: MDP & DQN - Xavier Giro-i-Nieto - UPC Barcelona ...
 
Lecture 9 Markov decision process
Lecture 9 Markov decision processLecture 9 Markov decision process
Lecture 9 Markov decision process
 
Temporal difference learning
Temporal difference learningTemporal difference learning
Temporal difference learning
 
Reinforcement Learning for Self Driving Cars
Reinforcement Learning for Self Driving CarsReinforcement Learning for Self Driving Cars
Reinforcement Learning for Self Driving Cars
 
Markov decision process
Markov decision processMarkov decision process
Markov decision process
 

Ähnlich wie Intro to Deep Reinforcement Learning

Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017
Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017
Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017MLconf
 
Review :: Demystifying deep reinforcement learning (written by Tambet Matiisen)
Review :: Demystifying deep reinforcement learning (written by Tambet Matiisen)Review :: Demystifying deep reinforcement learning (written by Tambet Matiisen)
Review :: Demystifying deep reinforcement learning (written by Tambet Matiisen)Hogeon Seo
 
1118_Seminar_Continuous_Deep Q-Learning with Model based acceleration
1118_Seminar_Continuous_Deep Q-Learning with Model based acceleration1118_Seminar_Continuous_Deep Q-Learning with Model based acceleration
1118_Seminar_Continuous_Deep Q-Learning with Model based accelerationHye-min Ahn
 
Demystifying deep reinforement learning
Demystifying deep reinforement learningDemystifying deep reinforement learning
Demystifying deep reinforement learning재연 윤
 
Reinforcement Learning Guide For Beginners
Reinforcement Learning Guide For BeginnersReinforcement Learning Guide For Beginners
Reinforcement Learning Guide For Beginnersgokulprasath06
 
How to formulate reinforcement learning in illustrative ways
How to formulate reinforcement learning in illustrative waysHow to formulate reinforcement learning in illustrative ways
How to formulate reinforcement learning in illustrative waysYasutoTamura1
 
24.09.2021 Reinforcement Learning Algorithms.pptx
24.09.2021 Reinforcement Learning Algorithms.pptx24.09.2021 Reinforcement Learning Algorithms.pptx
24.09.2021 Reinforcement Learning Algorithms.pptxManiMaran230751
 
Deep Learning in Robotics: Robot gains Social Intelligence through Multimodal...
Deep Learning in Robotics: Robot gains Social Intelligence through Multimodal...Deep Learning in Robotics: Robot gains Social Intelligence through Multimodal...
Deep Learning in Robotics: Robot gains Social Intelligence through Multimodal...gabrielesisinna
 
Continuous Control with Deep Reinforcement Learning, lillicrap et al, 2015
Continuous Control with Deep Reinforcement Learning, lillicrap et al, 2015Continuous Control with Deep Reinforcement Learning, lillicrap et al, 2015
Continuous Control with Deep Reinforcement Learning, lillicrap et al, 2015Chris Ohk
 
0415_seminar_DeepDPG
0415_seminar_DeepDPG0415_seminar_DeepDPG
0415_seminar_DeepDPGHye-min Ahn
 
TensorFlow London 11: Pierre Harvey Richemond 'Trends and Developments in Rei...
TensorFlow London 11: Pierre Harvey Richemond 'Trends and Developments in Rei...TensorFlow London 11: Pierre Harvey Richemond 'Trends and Developments in Rei...
TensorFlow London 11: Pierre Harvey Richemond 'Trends and Developments in Rei...Seldon
 
Reinforcement Learning and Artificial Neural Nets
Reinforcement Learning and Artificial Neural NetsReinforcement Learning and Artificial Neural Nets
Reinforcement Learning and Artificial Neural NetsPierre de Lacaze
 
Literature Review - Presentation on Relevant work for RL4AD capstone
Literature Review - Presentation on Relevant work for RL4AD capstoneLiterature Review - Presentation on Relevant work for RL4AD capstone
Literature Review - Presentation on Relevant work for RL4AD capstoneMayank Gupta
 
Continuous control with deep reinforcement learning (DDPG)
Continuous control with deep reinforcement learning (DDPG)Continuous control with deep reinforcement learning (DDPG)
Continuous control with deep reinforcement learning (DDPG)Taehoon Kim
 
An Introduction to Reinforcement Learning - The Doors to AGI
An Introduction to Reinforcement Learning - The Doors to AGIAn Introduction to Reinforcement Learning - The Doors to AGI
An Introduction to Reinforcement Learning - The Doors to AGIAnirban Santara
 
Exploration Strategies in Reinforcement Learning
Exploration Strategies in Reinforcement LearningExploration Strategies in Reinforcement Learning
Exploration Strategies in Reinforcement LearningDongmin Lee
 
Foundations: Artificial Neural Networks
Foundations: Artificial Neural NetworksFoundations: Artificial Neural Networks
Foundations: Artificial Neural Networksananth
 

Ähnlich wie Intro to Deep Reinforcement Learning (20)

Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017
Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017
Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017
 
Review :: Demystifying deep reinforcement learning (written by Tambet Matiisen)
Review :: Demystifying deep reinforcement learning (written by Tambet Matiisen)Review :: Demystifying deep reinforcement learning (written by Tambet Matiisen)
Review :: Demystifying deep reinforcement learning (written by Tambet Matiisen)
 
1118_Seminar_Continuous_Deep Q-Learning with Model based acceleration
1118_Seminar_Continuous_Deep Q-Learning with Model based acceleration1118_Seminar_Continuous_Deep Q-Learning with Model based acceleration
1118_Seminar_Continuous_Deep Q-Learning with Model based acceleration
 
Demystifying deep reinforement learning
Demystifying deep reinforement learningDemystifying deep reinforement learning
Demystifying deep reinforement learning
 
Reinforcement Learning Guide For Beginners
Reinforcement Learning Guide For BeginnersReinforcement Learning Guide For Beginners
Reinforcement Learning Guide For Beginners
 
Introduction to Deep Reinforcement Learning
Introduction to Deep Reinforcement LearningIntroduction to Deep Reinforcement Learning
Introduction to Deep Reinforcement Learning
 
How to formulate reinforcement learning in illustrative ways
How to formulate reinforcement learning in illustrative waysHow to formulate reinforcement learning in illustrative ways
How to formulate reinforcement learning in illustrative ways
 
24.09.2021 Reinforcement Learning Algorithms.pptx
24.09.2021 Reinforcement Learning Algorithms.pptx24.09.2021 Reinforcement Learning Algorithms.pptx
24.09.2021 Reinforcement Learning Algorithms.pptx
 
Deep Learning in Robotics: Robot gains Social Intelligence through Multimodal...
Deep Learning in Robotics: Robot gains Social Intelligence through Multimodal...Deep Learning in Robotics: Robot gains Social Intelligence through Multimodal...
Deep Learning in Robotics: Robot gains Social Intelligence through Multimodal...
 
Continuous Control with Deep Reinforcement Learning, lillicrap et al, 2015
Continuous Control with Deep Reinforcement Learning, lillicrap et al, 2015Continuous Control with Deep Reinforcement Learning, lillicrap et al, 2015
Continuous Control with Deep Reinforcement Learning, lillicrap et al, 2015
 
GDRR Opening Workshop - Deep Reinforcement Learning for Asset Based Modeling ...
GDRR Opening Workshop - Deep Reinforcement Learning for Asset Based Modeling ...GDRR Opening Workshop - Deep Reinforcement Learning for Asset Based Modeling ...
GDRR Opening Workshop - Deep Reinforcement Learning for Asset Based Modeling ...
 
0415_seminar_DeepDPG
0415_seminar_DeepDPG0415_seminar_DeepDPG
0415_seminar_DeepDPG
 
TensorFlow London 11: Pierre Harvey Richemond 'Trends and Developments in Rei...
TensorFlow London 11: Pierre Harvey Richemond 'Trends and Developments in Rei...TensorFlow London 11: Pierre Harvey Richemond 'Trends and Developments in Rei...
TensorFlow London 11: Pierre Harvey Richemond 'Trends and Developments in Rei...
 
Reinforcement Learning and Artificial Neural Nets
Reinforcement Learning and Artificial Neural NetsReinforcement Learning and Artificial Neural Nets
Reinforcement Learning and Artificial Neural Nets
 
Literature Review - Presentation on Relevant work for RL4AD capstone
Literature Review - Presentation on Relevant work for RL4AD capstoneLiterature Review - Presentation on Relevant work for RL4AD capstone
Literature Review - Presentation on Relevant work for RL4AD capstone
 
Continuous control with deep reinforcement learning (DDPG)
Continuous control with deep reinforcement learning (DDPG)Continuous control with deep reinforcement learning (DDPG)
Continuous control with deep reinforcement learning (DDPG)
 
Playing Atari with Deep Reinforcement Learning
Playing Atari with Deep Reinforcement LearningPlaying Atari with Deep Reinforcement Learning
Playing Atari with Deep Reinforcement Learning
 
An Introduction to Reinforcement Learning - The Doors to AGI
An Introduction to Reinforcement Learning - The Doors to AGIAn Introduction to Reinforcement Learning - The Doors to AGI
An Introduction to Reinforcement Learning - The Doors to AGI
 
Exploration Strategies in Reinforcement Learning
Exploration Strategies in Reinforcement LearningExploration Strategies in Reinforcement Learning
Exploration Strategies in Reinforcement Learning
 
Foundations: Artificial Neural Networks
Foundations: Artificial Neural NetworksFoundations: Artificial Neural Networks
Foundations: Artificial Neural Networks
 

Kürzlich hochgeladen

Biopesticide (2).pptx .This slides helps to know the different types of biop...
Biopesticide (2).pptx  .This slides helps to know the different types of biop...Biopesticide (2).pptx  .This slides helps to know the different types of biop...
Biopesticide (2).pptx .This slides helps to know the different types of biop...RohitNehra6
 
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...Sérgio Sacani
 
Recombination DNA Technology (Nucleic Acid Hybridization )
Recombination DNA Technology (Nucleic Acid Hybridization )Recombination DNA Technology (Nucleic Acid Hybridization )
Recombination DNA Technology (Nucleic Acid Hybridization )aarthirajkumar25
 
Physiochemical properties of nanomaterials and its nanotoxicity.pptx
Physiochemical properties of nanomaterials and its nanotoxicity.pptxPhysiochemical properties of nanomaterials and its nanotoxicity.pptx
Physiochemical properties of nanomaterials and its nanotoxicity.pptxAArockiyaNisha
 
Chromatin Structure | EUCHROMATIN | HETEROCHROMATIN
Chromatin Structure | EUCHROMATIN | HETEROCHROMATINChromatin Structure | EUCHROMATIN | HETEROCHROMATIN
Chromatin Structure | EUCHROMATIN | HETEROCHROMATINsankalpkumarsahoo174
 
Chemistry 4th semester series (krishna).pdf
Chemistry 4th semester series (krishna).pdfChemistry 4th semester series (krishna).pdf
Chemistry 4th semester series (krishna).pdfSumit Kumar yadav
 
Formation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disksFormation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disksSérgio Sacani
 
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral Analysis
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral AnalysisRaman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral Analysis
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral AnalysisDiwakar Mishra
 
Pulmonary drug delivery system M.pharm -2nd sem P'ceutics
Pulmonary drug delivery system M.pharm -2nd sem P'ceuticsPulmonary drug delivery system M.pharm -2nd sem P'ceutics
Pulmonary drug delivery system M.pharm -2nd sem P'ceuticssakshisoni2385
 
Natural Polymer Based Nanomaterials
Natural Polymer Based NanomaterialsNatural Polymer Based Nanomaterials
Natural Polymer Based NanomaterialsAArockiyaNisha
 
CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service 🪡
CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service  🪡CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service  🪡
CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service 🪡anilsa9823
 
Botany 4th semester series (krishna).pdf
Botany 4th semester series (krishna).pdfBotany 4th semester series (krishna).pdf
Botany 4th semester series (krishna).pdfSumit Kumar yadav
 
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.Nitya salvi
 
DIFFERENCE IN BACK CROSS AND TEST CROSS
DIFFERENCE IN  BACK CROSS AND TEST CROSSDIFFERENCE IN  BACK CROSS AND TEST CROSS
DIFFERENCE IN BACK CROSS AND TEST CROSSLeenakshiTyagi
 
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...Sérgio Sacani
 
Spermiogenesis or Spermateleosis or metamorphosis of spermatid
Spermiogenesis or Spermateleosis or metamorphosis of spermatidSpermiogenesis or Spermateleosis or metamorphosis of spermatid
Spermiogenesis or Spermateleosis or metamorphosis of spermatidSarthak Sekhar Mondal
 
Animal Communication- Auditory and Visual.pptx
Animal Communication- Auditory and Visual.pptxAnimal Communication- Auditory and Visual.pptx
Animal Communication- Auditory and Visual.pptxUmerFayaz5
 
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdfPests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdfPirithiRaju
 
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 60009654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000Sapana Sha
 
Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)PraveenaKalaiselvan1
 

Kürzlich hochgeladen (20)

Biopesticide (2).pptx .This slides helps to know the different types of biop...
Biopesticide (2).pptx  .This slides helps to know the different types of biop...Biopesticide (2).pptx  .This slides helps to know the different types of biop...
Biopesticide (2).pptx .This slides helps to know the different types of biop...
 
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
 
Recombination DNA Technology (Nucleic Acid Hybridization )
Recombination DNA Technology (Nucleic Acid Hybridization )Recombination DNA Technology (Nucleic Acid Hybridization )
Recombination DNA Technology (Nucleic Acid Hybridization )
 
Physiochemical properties of nanomaterials and its nanotoxicity.pptx
Physiochemical properties of nanomaterials and its nanotoxicity.pptxPhysiochemical properties of nanomaterials and its nanotoxicity.pptx
Physiochemical properties of nanomaterials and its nanotoxicity.pptx
 
Chromatin Structure | EUCHROMATIN | HETEROCHROMATIN
Chromatin Structure | EUCHROMATIN | HETEROCHROMATINChromatin Structure | EUCHROMATIN | HETEROCHROMATIN
Chromatin Structure | EUCHROMATIN | HETEROCHROMATIN
 
Chemistry 4th semester series (krishna).pdf
Chemistry 4th semester series (krishna).pdfChemistry 4th semester series (krishna).pdf
Chemistry 4th semester series (krishna).pdf
 
Formation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disksFormation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disks
 
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral Analysis
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral AnalysisRaman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral Analysis
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral Analysis
 
Pulmonary drug delivery system M.pharm -2nd sem P'ceutics
Pulmonary drug delivery system M.pharm -2nd sem P'ceuticsPulmonary drug delivery system M.pharm -2nd sem P'ceutics
Pulmonary drug delivery system M.pharm -2nd sem P'ceutics
 
Natural Polymer Based Nanomaterials
Natural Polymer Based NanomaterialsNatural Polymer Based Nanomaterials
Natural Polymer Based Nanomaterials
 
CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service 🪡
CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service  🪡CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service  🪡
CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service 🪡
 
Botany 4th semester series (krishna).pdf
Botany 4th semester series (krishna).pdfBotany 4th semester series (krishna).pdf
Botany 4th semester series (krishna).pdf
 
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
 
DIFFERENCE IN BACK CROSS AND TEST CROSS
DIFFERENCE IN  BACK CROSS AND TEST CROSSDIFFERENCE IN  BACK CROSS AND TEST CROSS
DIFFERENCE IN BACK CROSS AND TEST CROSS
 
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
 
Spermiogenesis or Spermateleosis or metamorphosis of spermatid
Spermiogenesis or Spermateleosis or metamorphosis of spermatidSpermiogenesis or Spermateleosis or metamorphosis of spermatid
Spermiogenesis or Spermateleosis or metamorphosis of spermatid
 
Animal Communication- Auditory and Visual.pptx
Animal Communication- Auditory and Visual.pptxAnimal Communication- Auditory and Visual.pptx
Animal Communication- Auditory and Visual.pptx
 
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdfPests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
 
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 60009654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
 
Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)
 

Intro to Deep Reinforcement Learning

  • 1. Introduction to Deep Reinforcement Learning Khaled Saleh PhD Researcher at IISRI/ Deakin University Australia Khaled Saleh
  • 2. Agenda • Motivation • What is Reinforcement Learning (RL) ? • Characteristics of RL • Formulation of the RL Problem • Different Components of RL • Taxonomy of Algorithms for Solving RL • Q-Learning • Deep Q Network (DQN) • Policy Gradient Methods • Inverse RL • Deep RL/IRL Potential Applications 2
  • 3. Motivation 3 Video credit: Ng et al. NIPS 2007 Video credit: Google DeepMind 2015
  • 4. What is Reinforcement Learning (RL) ? 4Image credit: Sutton and Barto (1998)
  • 5. Characteristics of RL 5 • In comparison to other machine learning paradigms, the following are what make the RL different: • No supervision needed, only a reward signal • Feedback is delayed, not instantaneous • Sequential decision Making
  • 6. Formulation of RL 6 • Most common method to formulate RL problem is through Markov Decision Process (MDP) • One episode of this process forms a finite sequence of states, actions and rewards: • 𝑠0, 𝑎0, 𝑟1, 𝑠1, 𝑎1, 𝑟2, 𝑠2, … , 𝑠 𝑛−1, 𝑎 𝑛−1, 𝑟𝑛, 𝑠 𝑛 Image credit: WikipediaImage credit: Sutton and Barto (1998)
  • 7. Formulation of RL 7 • A good policy, need to take into account not only the immediate rewards, but also the future rewards we are going to get. • Thus, the ultimate goal of RL agent is to select actions to maximize a total future reward. • Given one run of Markov decision process, we can easily calculate the total reward for one episode from time step t onward as follows: • 𝑅𝑡 = 𝑟𝑡 + 𝑟𝑡+1 + 𝑟𝑡+2 + ⋯ + 𝑟𝑛 • Due to the inherit uncertainty in the environment, we usually use the discounted future reward instead: • 𝑅𝑡 = 𝑟𝑡 + 𝛾𝑟𝑡+1 + 𝛾2 𝑟𝑡+2 + ⋯ + 𝛾 𝑛−𝑡 𝑟𝑛 = 𝑟𝑡 + 𝛾𝑅𝑡+1
  • 8. Components of RL 8 • An RL agent may include one or more of these components: • Policy: agent’s behavior function 𝑎 = π(𝑠) • Value function: a prediction of future reward - how good is each state and/or action • Model: agent’s representation of the environment, given state 𝑠 and action 𝑎, the model gives us both the reward of this state and action as well as the probability of the next state 𝑠′
  • 9. Components of RL: Policy 9Example adapted from: http://www0.cs.ucl.ac.uk/staff/D.Silver/web/Teaching.html • Given the following maze example: Policy would be
  • 10. Components of RL: Value Function 10 • Used to evaluate the goodness/badness of states • And therefore to select between actions: 𝑄 𝜋(𝑠, 𝑎) = 𝑚𝑎𝑥 𝜋 𝑅𝑡+1
  • 11. Taxonomy of Algorithms for Solving RL 11 • Model Free • Policy or/and Value Function • Model Based • Model + Policy or/and Value Function • Approximated Learned Model + Policy or/and Value Function
  • 12. Q-Learning 12 • Q-learning is a model free paradigm to learn the value function of the RL problem. • In Q-learning, we define a function 𝑄(𝑠, 𝑎) representing the discounted future reward when we perform action a in state s, and continue optimally from that point on. • 𝑄 𝑠𝑡, 𝑎 𝑡 = 𝑚𝑎𝑥 𝜋 𝑅𝑡+1 • Once we have the Q-function, the question of which policy to choose at a given state 𝑠, can be broke down into : • 𝜋 𝑠 = 𝑎𝑟𝑔𝑚𝑎𝑥 𝑎 𝑄(𝑠, 𝑎)
  • 13. Q-Learning (2) 13 • To obtain Q-function, we will focus on just one transition <𝑠, 𝑎, r, 𝑠′>. • Recall, 𝑅𝑡 = 𝑟𝑡 + 𝛾𝑟𝑡+1 + 𝛾2 𝑟𝑡+2 + ⋯ + 𝛾 𝑛−𝑡 𝑟𝑛 = 𝑟𝑡 + 𝛾𝑅𝑡+1 • Similarly, we can just represent Q-value of state 𝑠 and action 𝑎 in terms of Q-value of next state 𝑠′ 𝑄 𝑠, 𝑎 = 𝑟 + 𝛾𝑚𝑎𝑥 𝑎′ 𝑄(𝑠′, 𝑎′) Bellman Equation
  • 14. Q-Learning (3) 14Algorithm adapted from : http://artint.info/html/ArtInt_265.html • We can then iteratively approximate the Q-function using the Bellman equation, as follows: Learning rate
  • 15. Deep Q-Networks 15 • Q-function could be represented with neural network, that takes the state and action as input and outputs the corresponding Q-value • Alternatively, we could take only game screens as input and output the Q-value for each possible action.
  • 16. DQN: Atari 16Image credit: Mnih et al. Nature 2015
  • 17. DQN: Training 17 • Given a transition <𝑠, 𝑎, r, 𝑠′>, and loss function 𝐿 = 1 2 [𝑟 + 𝛾𝑚𝑎𝑥 𝑎′ 𝑄 𝑠′, 𝑎′ − Q s, a ]2: 1. Do a feedforward pass for the current state 𝑠 to get predicted Q-values for all actions. 2. Do a feedforward pass for the next state 𝑠′ and calculate maximum over all network outputs 𝑚𝑎𝑥 𝑎′ 𝑄 𝑠′, 𝑎′ 3. Set Q-value target for action 𝑎 to 𝑟 + 𝛾𝑚𝑎𝑥 𝑎′ 𝑄 𝑠′, 𝑎′ (use the max calculated in step 2). For all other actions, set the Q-value target to the same as originally returned from step 1, making the error 0 for those outputs 4. Update the weights using backpropagation. target prediction
  • 18. DQN: Experience Replay 18 • One of the engineering tricks that made the training of DQN much more stable • During gameplay all the experiences <𝑠, 𝑎, r, 𝑠′ > are stored in a replay memory • When training the network, random samples from the replay memory are used instead of the most recent transition 1. This breaks the similarity of subsequent training samples, which otherwise might drive the network into a local minimum 2. It made the training task more similar to usual supervised learning, which simplifies debugging and testing the algorithm.
  • 19. DQN: ε-greedy exploration 19 • When Q-network is initialized randomly, then its predictions are initially random as well • If we pick an action with the highest Q-value, the action will be random and the agent performs crude “exploration”. • As a Q-function converges, it returns more consistent Q- values and the amount of exploration decreases • Another engineering trick is ε-greedy exploration – with probability ε choose a random action, otherwise go with the “greedy” action with the highest Q-value.
  • 20. DQN: Algorithm 20Algorithm adapted from : http://artint.info/html/ArtInt_265.html Experience Replay ε-greedy exploration
  • 21. Policy Gradient Methods 21 • Another commonly paradigm to solve the RL problem is by learning the policy directly. • Learning the policy directly, can be much more efficient in case of continuous action spaces (human locomotion,..etc.) • One of the key methods in this paradigm, is policy gradient methods (Gradient descent, Conjugate gradient, Quasi- newton). • The formulation as follow, let 𝐽 𝜃 be any policy objective function • Policy gradient methods search for a local maximum in 𝐽 𝜃 by ascending the gradient of the policy, w.r.t. parameters 𝜃 Δ𝜃 = α𝛻𝜃 𝐽 𝜃 Policy gradient
  • 22. Policy Gradient Methods 22 Heess, Nicolas, et al. "Emergence of locomotion behaviours in rich environments." arXiv preprint arXiv:1707.02286 (2017).
  • 23. Inverse RL Adapted from CS 294: Deep Reinforcement Learning, UC Berkeley, Fall 2017
  • 24. Inverse RL • Since in most of the real-world applications, the notion of reward is not quite obvious or really hard to specify. • In IRL problem, we try to learn the reward (and the transition model as well) from expert or human demonstrations.
  • 25. Inverse RL: Autonomous Driving Image credit: Wulfmeier et al. IROS 2016 Reward Features
  • 26. Inverse RL: Intent Prediction 26Image credit: KITTI Dataset Pedestrian
  • 27. Deep RL/IRL Potential Applications • Autonomous Navigation • Semantic Segmentation • Recommendation Systems • Chatbots • Inventory Management • Power Systems • Financial investment decisions* • Medical Sector (Dynamic treatment regime) * http://pit.ai/
  • 28. Further Educational Resources • Reinforcement Learning: An Introduction (Sutton and Barto’s Book, 2nd Edition) • David Silver's Reinforcement Learning Course (UCL, 2015) • CS 294: Deep Reinforcement Learning, Fall 2017 • Deep RL Bootcamp, Summer 2017
  • 29. DeepMind AlphaGo 29Image and Video credit: Google Brain & DeepMind
  • 30. References 1. Sutton, Richard S., and Andrew G. Barto. Reinforcement learning: An introduction. Vol. 1. No. 1. Cambridge: MIT press, 1998. 2. Mnih, Volodymyr, et al. "Human-level control through deep reinforcement learning." Nature518.7540 (2015): 529-533. 3. Abbeel, Pieter, and Andrew Y. Ng. "Apprenticeship learning via inverse reinforcement learning." Proceedings of the twenty-first international conference on Machine learning. ACM, 2004. 4. Cassandra, Anthony Rocco. "Exact and approximate algorithms for partially observable Markov decision processes." (1998). 5. Heess, Nicolas, et al. "Emergence of Locomotion Behaviours in Rich Environments." arXiv preprint arXiv:1707.02286 (2017). 6. Heess, Nicolas, et al. "Learning and Transfer of Modulated Locomotor Controllers." arXiv preprint arXiv:1610.05182 (2016). 7. Chelsea Finn, Xin Yu Tan, Yan Duan, Trevor Darrell, Sergey Levine, and Pieter Abbeel. Deep Spatial Autoencoders for Visuomotor Learning. In ICRA, 2016. 8. Jakob N Foerster, Yannis M Assael, Nando de Freitas, and Shimon Whiteson. Learning to Communicate to Solve Riddles with Deep Distributed Recurrent QNetworks. arXiv:1602.02672, 2016. 9. Sham M Kakade. A Natural Policy Gradient. In NIPS, 2002 10. Nate Kohl and Peter Stone. Policy Gradient Reinforcement Learning for Fast Quadrupedal Locomotion. In ICRA, volume 3, 2004 11. Sascha Lange, Martin Riedmiller, and Arne Voigtlander. Autonomous Reinforcement Learning on Raw Visual Input Data in a Real World Application. In IJCNN, 2012. 12. Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deep Learning. Nature, 521 (7553):436–444, 2015. 13. Sergey Levine, Chelsea Finn, Trevor Darrell, and Pieter Abbeel. End-to-end Training of Deep Visuomotor Policies. JMLR, 17(39):1–40, 2016 14. Xiujun Li, Lihong Li, Jianfeng Gao, Xiaodong He, Jianshu Chen, Li Deng, and Ji He. Recurrent Reinforcement Learning: A Hybrid Approach. arXiv:1509.03044, 15. Wulfmeier, Markus, Dominic Zeng Wang, and Ingmar Posner. "Watch this: Scalable cost-function learning for path planning in urban environments." Intelligent Robots and Systems (IROS), 2016 IEEE/RSJ International Conference on. IEEE, 2016. 30

Hinweis der Redaktion

  1. In Reinforcement learning, we have an agent that interact with the environment whereas, at each time step, it gets an observation from the environment about his/her state s_t, it executes an action a_t , and receives a reward r_t from the environment. From the agent perspective: it only input an action, and get as input from env (observation s_t, and reward r_t) From the environment perspective: it output both observations about agent state, and reward r_t Reward is a scalar feedback signal, indicates how well agent is doing at each time step The job of the agent is to maximize a cumulative reward
  2. Sequential decision Making -> Agent’s actions affect the subsequent data it receives, that’s why the time really matters And this is distinction between it and supervised, where you only have an independent predictions for each input sample.
  3. The set of states and actions, together with rules for transitioning from one state to another and for getting rewards, make up a Markov decision process. The episode ends with terminal state sn (e.g. “game over” screen). The rules for how you choose those actions are called policy. A Markov decision process relies on the Markov assumption, that the probability of the next state si+1 depends only on current state si and performed action ai, but not on preceding states or actions.
  4. But because our environment is stochastic, we can never be sure, if we will get the same rewards the next time we perform the same actions. The more into the future we go, the more it may diverge. For that reason it is common to use discounted future reward Here γ is the discount factor between 0 and 1 – the more into the future the reward is, the less we take it into consideration. It is easy to see, that discounted future reward at time step t can be expressed in terms of the same thing at time step t+1: If we set the discount factor γ=0, then our strategy will be short-sighted and we rely only on the immediate rewards. If we want to balance between immediate and future rewards, we should set discount factor to something like γ=0.9. If our environment is deterministic and the same actions always result in same rewards, then we can set discount factor γ=1
  5. P predicts the next state
  6. Rewards: -1 per time-step -> motivate it to finish ASAP Actions: N, E, S, W States: Agent’s location Arrows represent policy π(s) for each state s
  7. Numbers represent value vπ(s) of each state s
  8. The main distinction in Model free, you learn on the job by trial and error, however in model based you learn about it offline or from demonstrations Policy based have better convergence, effective in high dimension or continuous actions spaces
  9. The way to think about Q(s,a) is that it is “the best possible score at the end of game after performing action a in state s”. It is called Q-function, because it represents the “quality” of certain action in given state. Once you have the magical Q-function, the answer becomes really simple – pick the action with the highest Q-value!
  10. This may sound quite a puzzling definition. How can we estimate the score at the end of game, if we know just current state and action, and not the actions and rewards coming after that? We really can’t. But as a theoretical construct we can assume existence of such a function. Let’s focus on just one transition <s,a,r,s′>. Just like with discounted future rewards in previous section we can express Q-value of state s and action a in terms of Q-value of next state s′. If you think about it, it is quite logical – maximum future reward for this state and action is the immediate reward plus maximum future reward for the next state.
  11. In the simplest case the Q-function is implemented as a table, with states as rows and actions as columns. α in the algorithm is a learning rate that controls how much of the difference between previous Q-value and newly proposed Q-value is taken into account. In particular, when α=1, then two Q[s,a]-s cancel and the update is exactly the same as Bellman equation. maxa’ Q[s',a'] that we use to update Q[s,a] is only an estimation and in early stages of learning it may be completely wrong. However the estimations get more and more accurate with every iteration, that if we perform this update enough times, then the Q-function will converge and represent the true Q-value. The state of the environment in the Breakout game can be defined by the location of the paddle, location and direction of the ball and the existence of each individual brick. This intuitive representation is however game specific. Could we come up with something more universal, that would be suitable for all the games? Obvious choice is screen pixels. they implicitly contain all of the relevant information about the game situation, except for the speed and direction of the ball. Two consecutive screens would have these covered as well.
  12. In case of the break out –Atari game in the first videos, to construct the Q(s,a) table from raw pixels as state space (84*84*4) this mean a possible of million of game states, which corresponds , nillions of rows in our (s,a) table This is the point, where deep learning steps in. Neural networks are exceptionally good in coming up with good features for highly structured data We could represent our Q-function with a neural network, that takes the state (four game screens) and action as input and outputs the corresponding Q-value This approach has the advantage, that if we want to perform a Q-value update or pick the action with highest Q-value, we only have to do one forward pass through the network and have all Q-values for all actions immediately available.
  13. This is a classical convolutional neural network with three convolutional layers, followed by two fully connected layers. People familiar with object recognition networks may notice that there are no pooling layers. But if you really think about that, then pooling layers buy you a translation invariance – the network becomes insensitive to the location of an object in the image. That makes perfectly sense for a classification task like ImageNet, but for games the location of the ball is crucial in determining the potential reward and we wouldn’t want to discard this information!
  14. In case of the break out –Atari game in the first videos, to construct the Q(s,a) table from raw pixels as state space (84*84*4) this mean a possible of million of game states, which corresponds , nillions of rows in our (s,a) table This is the point, where deep learning steps in. Neural networks are exceptionally good in coming up with good features for highly structured data We could represent our Q-function with a neural network, that takes the state (four game screens) and action as input and outputs the corresponding Q-value This approach has the advantage, that if we want to perform a Q-value update or pick the action with highest Q-value, we only have to do one forward pass through the network and have all Q-values for all actions immediately available.
  15. So we could say, that Q-learning incorporates the exploration as part of the algorithm. But this exploration is “greedy”, it settles with the first effective strategy it finds. In their system DeepMind actually decreases ε over time from 1 to 0.1 – in the beginning the system makes completely random moves to explore the state space maximally, and then it settles down to a fixed exploration rate.
  16. In the simplest case the Q-function is implemented as a table, with states as rows and actions as columns. α in the algorithm is a learning rate that controls how much of the difference between previous Q-value and newly proposed Q-value is taken into account. In particular, when α=1, then two Q[s,a]-s cancel and the update is exactly the same as Bellman equation. maxa’ Q[s',a'] that we use to update Q[s,a] is only an estimation and in early stages of learning it may be completely wrong. However the estimations get more and more accurate with every iteration, that if we perform this update enough times, then the Q-function will converge and represent the true Q-value. The state of the environment in the Breakout game can be defined by the location of the paddle, location and direction of the ball and the existence of each individual brick. This intuitive representation is however game specific. Could we come up with something more universal, that would be suitable for all the games? Obvious choice is screen pixels. they implicitly contain all of the relevant information about the game situation, except for the speed and direction of the ball. Two consecutive screens would have these covered as well.
  17. * A 15-month old infant can interpret the intentions of other human demonstrator, even if it was the first time to see it actaualy
  18. Reinforcement Learning is used to develop distributed control structure for a set of distributed generation sources. The exchange of information between these sources is governed by a communication graph topology Reinforcement learning algorithms can be built to reduce transit time for stocking as well as retrieving products in the warehouse for optimizing space utilization and warehouse operations. Pit.ai is at the forefront leveraging reinforcement learning for evaluating trading strategies A dynamic treatment regime (DTR) is a subject of medical research setting rules for finding effective treatments for patients. Diseases like cancer demand treatments for a long period where drugs and treatment levels are administered over a long period. Reinforcement learning addresses this DTR problem where RI algorithms help in processing clinical data to come up with a treatment strategy, using various clinical indicators collected from patients as inputs.