This document provides an introduction to deep reinforcement learning. It begins with an overview of reinforcement learning and its key characteristics such as using reward signals rather than supervision and sequential decision making. The document then covers the formulation of reinforcement learning problems using Markov decision processes and the typical components of an RL agent including policies, value functions, and models. It discusses popular RL algorithms like Q-learning, deep Q-networks, and policy gradient methods. The document concludes by outlining some potential applications of deep reinforcement learning and recommending further educational resources.
Recombinant DNA technology (Immunological screening)
Intro to Deep Reinforcement Learning
1. Introduction to Deep Reinforcement Learning
Khaled Saleh
PhD Researcher at IISRI/ Deakin University
Australia
Khaled Saleh
2. Agenda
• Motivation
• What is Reinforcement Learning (RL) ?
• Characteristics of RL
• Formulation of the RL Problem
• Different Components of RL
• Taxonomy of Algorithms for Solving RL
• Q-Learning
• Deep Q Network (DQN)
• Policy Gradient Methods
• Inverse RL
• Deep RL/IRL Potential Applications
2
5. Characteristics of RL
5
• In comparison to other machine learning paradigms, the
following are what make the RL different:
• No supervision needed, only a reward signal
• Feedback is delayed, not instantaneous
• Sequential decision Making
6. Formulation of RL
6
• Most common method to formulate RL problem is through
Markov Decision Process (MDP)
• One episode of this process forms a finite sequence of states,
actions and rewards:
• 𝑠0, 𝑎0, 𝑟1, 𝑠1, 𝑎1, 𝑟2, 𝑠2, … , 𝑠 𝑛−1, 𝑎 𝑛−1, 𝑟𝑛, 𝑠 𝑛
Image credit: WikipediaImage credit: Sutton and Barto (1998)
7. Formulation of RL
7
• A good policy, need to take into account not only the
immediate rewards, but also the future rewards we are going
to get.
• Thus, the ultimate goal of RL agent is to select actions to
maximize a total future reward.
• Given one run of Markov decision process, we can easily
calculate the total reward for one episode from time
step t onward as follows:
• 𝑅𝑡 = 𝑟𝑡 + 𝑟𝑡+1 + 𝑟𝑡+2 + ⋯ + 𝑟𝑛
• Due to the inherit uncertainty in the environment, we usually
use the discounted future reward instead:
• 𝑅𝑡 = 𝑟𝑡 + 𝛾𝑟𝑡+1 + 𝛾2
𝑟𝑡+2 + ⋯ + 𝛾 𝑛−𝑡
𝑟𝑛 = 𝑟𝑡 + 𝛾𝑅𝑡+1
8. Components of RL
8
• An RL agent may include one or more of these components:
• Policy: agent’s behavior function 𝑎 = π(𝑠)
• Value function: a prediction of future reward - how good
is each state and/or action
• Model: agent’s representation of the environment, given
state 𝑠 and action 𝑎, the model gives us both the reward
of this state and action as well as the probability of the
next state 𝑠′
9. Components of RL: Policy
9Example adapted from: http://www0.cs.ucl.ac.uk/staff/D.Silver/web/Teaching.html
• Given the following maze example:
Policy would be
10. Components of RL: Value Function
10
• Used to evaluate the goodness/badness of states
• And therefore to select between actions:
𝑄 𝜋(𝑠, 𝑎) = 𝑚𝑎𝑥 𝜋 𝑅𝑡+1
11. Taxonomy of Algorithms for Solving RL
11
• Model Free
• Policy or/and Value Function
• Model Based
• Model + Policy or/and Value Function
• Approximated Learned Model + Policy or/and Value
Function
12. Q-Learning
12
• Q-learning is a model free paradigm to learn the value
function of the RL problem.
• In Q-learning, we define a function 𝑄(𝑠, 𝑎) representing the
discounted future reward when we perform action a in state s,
and continue optimally from that point on.
• 𝑄 𝑠𝑡, 𝑎 𝑡 = 𝑚𝑎𝑥 𝜋 𝑅𝑡+1
• Once we have the Q-function, the question of which policy to
choose at a given state 𝑠, can be broke down into :
• 𝜋 𝑠 = 𝑎𝑟𝑔𝑚𝑎𝑥 𝑎 𝑄(𝑠, 𝑎)
13. Q-Learning (2)
13
• To obtain Q-function, we will focus on just one transition
<𝑠, 𝑎, r, 𝑠′>.
• Recall,
𝑅𝑡 = 𝑟𝑡 + 𝛾𝑟𝑡+1 + 𝛾2 𝑟𝑡+2 + ⋯ + 𝛾 𝑛−𝑡 𝑟𝑛 = 𝑟𝑡 + 𝛾𝑅𝑡+1
• Similarly, we can just represent Q-value of state 𝑠 and
action 𝑎 in terms of Q-value of next state 𝑠′
𝑄 𝑠, 𝑎 = 𝑟 + 𝛾𝑚𝑎𝑥 𝑎′ 𝑄(𝑠′, 𝑎′)
Bellman Equation
14. Q-Learning (3)
14Algorithm adapted from : http://artint.info/html/ArtInt_265.html
• We can then iteratively approximate the Q-function using the
Bellman equation, as follows:
Learning rate
15. Deep Q-Networks
15
• Q-function could be represented with neural network, that
takes the state and action as input and outputs the
corresponding Q-value
• Alternatively, we could take only game screens as input and
output the Q-value for each possible action.
17. DQN: Training
17
• Given a transition <𝑠, 𝑎, r, 𝑠′>, and loss function
𝐿 =
1
2
[𝑟 + 𝛾𝑚𝑎𝑥 𝑎′ 𝑄 𝑠′, 𝑎′ − Q s, a ]2:
1. Do a feedforward pass for the current state 𝑠 to get
predicted Q-values for all actions.
2. Do a feedforward pass for the next state 𝑠′ and calculate
maximum over all network outputs 𝑚𝑎𝑥 𝑎′ 𝑄 𝑠′, 𝑎′
3. Set Q-value target for
action 𝑎 to 𝑟 + 𝛾𝑚𝑎𝑥 𝑎′ 𝑄 𝑠′, 𝑎′ (use the max calculated
in step 2). For all other actions, set the Q-value target to
the same as originally returned from step 1, making the
error 0 for those outputs
4. Update the weights using backpropagation.
target prediction
18. DQN: Experience Replay
18
• One of the engineering tricks that made the training of DQN
much more stable
• During gameplay all the experiences <𝑠, 𝑎, r, 𝑠′
> are stored in
a replay memory
• When training the network, random samples from the replay
memory are used instead of the most recent transition
1. This breaks the similarity of subsequent training
samples, which otherwise might drive the network into a
local minimum
2. It made the training task more similar to usual
supervised learning, which simplifies debugging and
testing the algorithm.
19. DQN: ε-greedy exploration
19
• When Q-network is initialized randomly, then its predictions
are initially random as well
• If we pick an action with the highest Q-value, the action will
be random and the agent performs crude “exploration”.
• As a Q-function converges, it returns more consistent Q-
values and the amount of exploration decreases
• Another engineering trick is ε-greedy exploration – with
probability ε choose a random action, otherwise go with the
“greedy” action with the highest Q-value.
21. Policy Gradient Methods
21
• Another commonly paradigm to solve the RL problem is by
learning the policy directly.
• Learning the policy directly, can be much more efficient in
case of continuous action spaces (human locomotion,..etc.)
• One of the key methods in this paradigm, is policy gradient
methods (Gradient descent, Conjugate gradient, Quasi-
newton).
• The formulation as follow, let 𝐽 𝜃 be any policy objective
function
• Policy gradient methods search for a local maximum in 𝐽 𝜃
by ascending the gradient of the policy, w.r.t. parameters 𝜃
Δ𝜃 = α𝛻𝜃 𝐽 𝜃
Policy gradient
22. Policy Gradient Methods
22
Heess, Nicolas, et al. "Emergence of locomotion behaviours in rich environments." arXiv preprint
arXiv:1707.02286 (2017).
24. Inverse RL
• Since in most of the real-world applications, the notion of
reward is not quite obvious or really hard to specify.
• In IRL problem, we try to learn the reward (and the transition
model as well) from expert or human demonstrations.
30. References
1. Sutton, Richard S., and Andrew G. Barto. Reinforcement learning: An introduction. Vol. 1. No. 1. Cambridge: MIT press, 1998.
2. Mnih, Volodymyr, et al. "Human-level control through deep reinforcement learning." Nature518.7540 (2015): 529-533.
3. Abbeel, Pieter, and Andrew Y. Ng. "Apprenticeship learning via inverse reinforcement learning." Proceedings of the twenty-first
international conference on Machine learning. ACM, 2004.
4. Cassandra, Anthony Rocco. "Exact and approximate algorithms for partially observable Markov decision processes." (1998).
5. Heess, Nicolas, et al. "Emergence of Locomotion Behaviours in Rich Environments." arXiv preprint arXiv:1707.02286 (2017).
6. Heess, Nicolas, et al. "Learning and Transfer of Modulated Locomotor Controllers." arXiv preprint arXiv:1610.05182 (2016).
7. Chelsea Finn, Xin Yu Tan, Yan Duan, Trevor Darrell, Sergey Levine, and Pieter Abbeel. Deep Spatial Autoencoders for Visuomotor
Learning. In ICRA, 2016.
8. Jakob N Foerster, Yannis M Assael, Nando de Freitas, and Shimon Whiteson. Learning to Communicate to Solve Riddles with Deep
Distributed Recurrent QNetworks. arXiv:1602.02672, 2016.
9. Sham M Kakade. A Natural Policy Gradient. In NIPS, 2002
10. Nate Kohl and Peter Stone. Policy Gradient Reinforcement Learning for Fast Quadrupedal Locomotion. In ICRA, volume 3, 2004
11. Sascha Lange, Martin Riedmiller, and Arne Voigtlander. Autonomous Reinforcement Learning on Raw Visual Input Data in a Real World
Application. In IJCNN, 2012.
12. Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deep Learning. Nature, 521 (7553):436–444, 2015.
13. Sergey Levine, Chelsea Finn, Trevor Darrell, and Pieter Abbeel. End-to-end Training of Deep Visuomotor Policies. JMLR, 17(39):1–40,
2016
14. Xiujun Li, Lihong Li, Jianfeng Gao, Xiaodong He, Jianshu Chen, Li Deng, and Ji He. Recurrent Reinforcement Learning: A Hybrid
Approach. arXiv:1509.03044,
15. Wulfmeier, Markus, Dominic Zeng Wang, and Ingmar Posner. "Watch this: Scalable cost-function learning for path planning in urban
environments." Intelligent Robots and Systems (IROS), 2016 IEEE/RSJ International Conference on. IEEE, 2016.
30
In Reinforcement learning, we have an agent that interact with the environment whereas, at each time step, it gets an observation from the environment about his/her state s_t, it executes an action a_t , and receives a reward r_t from the environment.
From the agent perspective: it only input an action, and get as input from env (observation s_t, and reward r_t)
From the environment perspective: it output both observations about agent state, and reward r_t
Reward is a scalar feedback signal, indicates how well agent is doing at each time step
The job of the agent is to maximize a cumulative reward
Sequential decision Making -> Agent’s actions affect the subsequent data it receives, that’s why the time really matters
And this is distinction between it and supervised, where you only have an independent predictions for each input sample.
The set of states and actions, together with rules for transitioning from one state to another and for getting rewards, make up a Markov decision process.
The episode ends with terminal state sn (e.g. “game over” screen).
The rules for how you choose those actions are called policy.
A Markov decision process relies on the Markov assumption, that the probability of the next state si+1 depends only on current state si and performed action ai, but not on preceding states or actions.
But because our environment is stochastic, we can never be sure, if we will get the same rewards the next time we perform the same actions. The more into the future we go, the more it may diverge. For that reason it is common to use discounted future reward
Here γ is the discount factor between 0 and 1 – the more into the future the reward is, the less we take it into consideration. It is easy to see, that discounted future reward at time step t can be expressed in terms of the same thing at time step t+1:
If we set the discount factor γ=0, then our strategy will be short-sighted and we rely only on the immediate rewards. If we want to balance between immediate and future rewards, we should set discount factor to something like γ=0.9. If our environment is deterministic and the same actions always result in same rewards, then we can set discount factor γ=1
P predicts the next state
Rewards: -1 per time-step -> motivate it to finish ASAP
Actions: N, E, S, W
States: Agent’s location
Arrows represent policy π(s) for each state s
Numbers represent value vπ(s) of each state s
The main distinction in Model free, you learn on the job by trial and error, however in model based you learn about it offline or from demonstrations
Policy based have better convergence, effective in high dimension or continuous actions spaces
The way to think about Q(s,a) is that it is “the best possible score at the end of game after performing action a in state s”. It is called Q-function, because it represents the “quality” of certain action in given state.
Once you have the magical Q-function, the answer becomes really simple – pick the action with the highest Q-value!
This may sound quite a puzzling definition. How can we estimate the score at the end of game, if we know just current state and action, and not the actions and rewards coming after that? We really can’t. But as a theoretical construct we can assume existence of such a function.
Let’s focus on just one transition <s,a,r,s′>. Just like with discounted future rewards in previous section we can express Q-value of state s and action a in terms of Q-value of next state s′.
If you think about it, it is quite logical – maximum future reward for this state and action is the immediate reward plus maximum future reward for the next state.
In the simplest case the Q-function is implemented as a table, with states as rows and actions as columns.
α in the algorithm is a learning rate that controls how much of the difference between previous Q-value and newly proposed Q-value is taken into account. In particular, when α=1, then two Q[s,a]-s cancel and the update is exactly the same as Bellman equation.
maxa’ Q[s',a'] that we use to update Q[s,a] is only an estimation and in early stages of learning it may be completely wrong. However the estimations get more and more accurate with every iteration, that if we perform this update enough times, then the Q-function will converge and represent the true Q-value.
The state of the environment in the Breakout game can be defined by the location of the paddle, location and direction of the ball and the existence of each individual brick. This intuitive representation is however game specific. Could we come up with something more universal, that would be suitable for all the games? Obvious choice is screen pixels. they implicitly contain all of the relevant information about the game situation, except for the speed and direction of the ball. Two consecutive screens would have these covered as well.
In case of the break out –Atari game in the first videos, to construct the Q(s,a) table from raw pixels as state space (84*84*4) this mean a possible of million of game states, which corresponds , nillions of rows in our (s,a) table
This is the point, where deep learning steps in. Neural networks are exceptionally good in coming up with good features for highly structured data
We could represent our Q-function with a neural network, that takes the state (four game screens) and action as input and outputs the corresponding Q-value
This approach has the advantage, that if we want to perform a Q-value update or pick the action with highest Q-value, we only have to do one forward pass through the network and have all Q-values for all actions immediately available.
This is a classical convolutional neural network with three convolutional layers, followed by two fully connected layers. People familiar with object recognition networks may notice that there are no pooling layers.
But if you really think about that, then pooling layers buy you a translation invariance – the network becomes insensitive to the location of an object in the image. That makes perfectly sense for a classification task like ImageNet, but for games the location of the ball is crucial in determining the potential reward and we wouldn’t want to discard this information!
In case of the break out –Atari game in the first videos, to construct the Q(s,a) table from raw pixels as state space (84*84*4) this mean a possible of million of game states, which corresponds , nillions of rows in our (s,a) table
This is the point, where deep learning steps in. Neural networks are exceptionally good in coming up with good features for highly structured data
We could represent our Q-function with a neural network, that takes the state (four game screens) and action as input and outputs the corresponding Q-value
This approach has the advantage, that if we want to perform a Q-value update or pick the action with highest Q-value, we only have to do one forward pass through the network and have all Q-values for all actions immediately available.
So we could say, that Q-learning incorporates the exploration as part of the algorithm. But this exploration is “greedy”, it settles with the first effective strategy it finds.
In their system DeepMind actually decreases ε over time from 1 to 0.1 – in the beginning the system makes completely random moves to explore the state space maximally, and then it settles down to a fixed exploration rate.
In the simplest case the Q-function is implemented as a table, with states as rows and actions as columns.
α in the algorithm is a learning rate that controls how much of the difference between previous Q-value and newly proposed Q-value is taken into account. In particular, when α=1, then two Q[s,a]-s cancel and the update is exactly the same as Bellman equation.
maxa’ Q[s',a'] that we use to update Q[s,a] is only an estimation and in early stages of learning it may be completely wrong. However the estimations get more and more accurate with every iteration, that if we perform this update enough times, then the Q-function will converge and represent the true Q-value.
The state of the environment in the Breakout game can be defined by the location of the paddle, location and direction of the ball and the existence of each individual brick. This intuitive representation is however game specific. Could we come up with something more universal, that would be suitable for all the games? Obvious choice is screen pixels. they implicitly contain all of the relevant information about the game situation, except for the speed and direction of the ball. Two consecutive screens would have these covered as well.
* A 15-month old infant can interpret the intentions of other human demonstrator, even if it was the first time to see it actaualy
Reinforcement Learning is used to develop distributed control structure for a set of distributed generation sources. The exchange of information between these sources is governed by a communication graph topology
Reinforcement learning algorithms can be built to reduce transit time for stocking as well as retrieving products in the warehouse for optimizing space utilization and warehouse operations.
Pit.ai is at the forefront leveraging reinforcement learning for evaluating trading strategies
A dynamic treatment regime (DTR) is a subject of medical research setting rules for finding effective treatments for patients. Diseases like cancer demand treatments for a long period where drugs and treatment levels are administered over a long period. Reinforcement learning addresses this DTR problem where RI algorithms help in processing clinical data to come up with a treatment strategy, using various clinical indicators collected from patients as inputs.