Diese Präsentation wurde erfolgreich gemeldet.
Die SlideShare-Präsentation wird heruntergeladen. ×

Intro to Deep Reinforcement Learning

Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Wird geladen in …3
×

Hier ansehen

1 von 31 Anzeige

Weitere Verwandte Inhalte

Diashows für Sie (20)

Ähnlich wie Intro to Deep Reinforcement Learning (20)

Anzeige

Aktuellste (20)

Intro to Deep Reinforcement Learning

  1. 1. Introduction to Deep Reinforcement Learning Khaled Saleh PhD Researcher at IISRI/ Deakin University Australia Khaled Saleh
  2. 2. Agenda • Motivation • What is Reinforcement Learning (RL) ? • Characteristics of RL • Formulation of the RL Problem • Different Components of RL • Taxonomy of Algorithms for Solving RL • Q-Learning • Deep Q Network (DQN) • Policy Gradient Methods • Inverse RL • Deep RL/IRL Potential Applications 2
  3. 3. Motivation 3 Video credit: Ng et al. NIPS 2007 Video credit: Google DeepMind 2015
  4. 4. What is Reinforcement Learning (RL) ? 4Image credit: Sutton and Barto (1998)
  5. 5. Characteristics of RL 5 • In comparison to other machine learning paradigms, the following are what make the RL different: • No supervision needed, only a reward signal • Feedback is delayed, not instantaneous • Sequential decision Making
  6. 6. Formulation of RL 6 • Most common method to formulate RL problem is through Markov Decision Process (MDP) • One episode of this process forms a finite sequence of states, actions and rewards: • 𝑠0, 𝑎0, 𝑟1, 𝑠1, 𝑎1, 𝑟2, 𝑠2, … , 𝑠 𝑛−1, 𝑎 𝑛−1, 𝑟𝑛, 𝑠 𝑛 Image credit: WikipediaImage credit: Sutton and Barto (1998)
  7. 7. Formulation of RL 7 • A good policy, need to take into account not only the immediate rewards, but also the future rewards we are going to get. • Thus, the ultimate goal of RL agent is to select actions to maximize a total future reward. • Given one run of Markov decision process, we can easily calculate the total reward for one episode from time step t onward as follows: • 𝑅𝑡 = 𝑟𝑡 + 𝑟𝑡+1 + 𝑟𝑡+2 + ⋯ + 𝑟𝑛 • Due to the inherit uncertainty in the environment, we usually use the discounted future reward instead: • 𝑅𝑡 = 𝑟𝑡 + 𝛾𝑟𝑡+1 + 𝛾2 𝑟𝑡+2 + ⋯ + 𝛾 𝑛−𝑡 𝑟𝑛 = 𝑟𝑡 + 𝛾𝑅𝑡+1
  8. 8. Components of RL 8 • An RL agent may include one or more of these components: • Policy: agent’s behavior function 𝑎 = π(𝑠) • Value function: a prediction of future reward - how good is each state and/or action • Model: agent’s representation of the environment, given state 𝑠 and action 𝑎, the model gives us both the reward of this state and action as well as the probability of the next state 𝑠′
  9. 9. Components of RL: Policy 9Example adapted from: http://www0.cs.ucl.ac.uk/staff/D.Silver/web/Teaching.html • Given the following maze example: Policy would be
  10. 10. Components of RL: Value Function 10 • Used to evaluate the goodness/badness of states • And therefore to select between actions: 𝑄 𝜋(𝑠, 𝑎) = 𝑚𝑎𝑥 𝜋 𝑅𝑡+1
  11. 11. Taxonomy of Algorithms for Solving RL 11 • Model Free • Policy or/and Value Function • Model Based • Model + Policy or/and Value Function • Approximated Learned Model + Policy or/and Value Function
  12. 12. Q-Learning 12 • Q-learning is a model free paradigm to learn the value function of the RL problem. • In Q-learning, we define a function 𝑄(𝑠, 𝑎) representing the discounted future reward when we perform action a in state s, and continue optimally from that point on. • 𝑄 𝑠𝑡, 𝑎 𝑡 = 𝑚𝑎𝑥 𝜋 𝑅𝑡+1 • Once we have the Q-function, the question of which policy to choose at a given state 𝑠, can be broke down into : • 𝜋 𝑠 = 𝑎𝑟𝑔𝑚𝑎𝑥 𝑎 𝑄(𝑠, 𝑎)
  13. 13. Q-Learning (2) 13 • To obtain Q-function, we will focus on just one transition <𝑠, 𝑎, r, 𝑠′>. • Recall, 𝑅𝑡 = 𝑟𝑡 + 𝛾𝑟𝑡+1 + 𝛾2 𝑟𝑡+2 + ⋯ + 𝛾 𝑛−𝑡 𝑟𝑛 = 𝑟𝑡 + 𝛾𝑅𝑡+1 • Similarly, we can just represent Q-value of state 𝑠 and action 𝑎 in terms of Q-value of next state 𝑠′ 𝑄 𝑠, 𝑎 = 𝑟 + 𝛾𝑚𝑎𝑥 𝑎′ 𝑄(𝑠′, 𝑎′) Bellman Equation
  14. 14. Q-Learning (3) 14Algorithm adapted from : http://artint.info/html/ArtInt_265.html • We can then iteratively approximate the Q-function using the Bellman equation, as follows: Learning rate
  15. 15. Deep Q-Networks 15 • Q-function could be represented with neural network, that takes the state and action as input and outputs the corresponding Q-value • Alternatively, we could take only game screens as input and output the Q-value for each possible action.
  16. 16. DQN: Atari 16Image credit: Mnih et al. Nature 2015
  17. 17. DQN: Training 17 • Given a transition <𝑠, 𝑎, r, 𝑠′>, and loss function 𝐿 = 1 2 [𝑟 + 𝛾𝑚𝑎𝑥 𝑎′ 𝑄 𝑠′, 𝑎′ − Q s, a ]2: 1. Do a feedforward pass for the current state 𝑠 to get predicted Q-values for all actions. 2. Do a feedforward pass for the next state 𝑠′ and calculate maximum over all network outputs 𝑚𝑎𝑥 𝑎′ 𝑄 𝑠′, 𝑎′ 3. Set Q-value target for action 𝑎 to 𝑟 + 𝛾𝑚𝑎𝑥 𝑎′ 𝑄 𝑠′, 𝑎′ (use the max calculated in step 2). For all other actions, set the Q-value target to the same as originally returned from step 1, making the error 0 for those outputs 4. Update the weights using backpropagation. target prediction
  18. 18. DQN: Experience Replay 18 • One of the engineering tricks that made the training of DQN much more stable • During gameplay all the experiences <𝑠, 𝑎, r, 𝑠′ > are stored in a replay memory • When training the network, random samples from the replay memory are used instead of the most recent transition 1. This breaks the similarity of subsequent training samples, which otherwise might drive the network into a local minimum 2. It made the training task more similar to usual supervised learning, which simplifies debugging and testing the algorithm.
  19. 19. DQN: ε-greedy exploration 19 • When Q-network is initialized randomly, then its predictions are initially random as well • If we pick an action with the highest Q-value, the action will be random and the agent performs crude “exploration”. • As a Q-function converges, it returns more consistent Q- values and the amount of exploration decreases • Another engineering trick is ε-greedy exploration – with probability ε choose a random action, otherwise go with the “greedy” action with the highest Q-value.
  20. 20. DQN: Algorithm 20Algorithm adapted from : http://artint.info/html/ArtInt_265.html Experience Replay ε-greedy exploration
  21. 21. Policy Gradient Methods 21 • Another commonly paradigm to solve the RL problem is by learning the policy directly. • Learning the policy directly, can be much more efficient in case of continuous action spaces (human locomotion,..etc.) • One of the key methods in this paradigm, is policy gradient methods (Gradient descent, Conjugate gradient, Quasi- newton). • The formulation as follow, let 𝐽 𝜃 be any policy objective function • Policy gradient methods search for a local maximum in 𝐽 𝜃 by ascending the gradient of the policy, w.r.t. parameters 𝜃 Δ𝜃 = α𝛻𝜃 𝐽 𝜃 Policy gradient
  22. 22. Policy Gradient Methods 22 Heess, Nicolas, et al. "Emergence of locomotion behaviours in rich environments." arXiv preprint arXiv:1707.02286 (2017).
  23. 23. Inverse RL Adapted from CS 294: Deep Reinforcement Learning, UC Berkeley, Fall 2017
  24. 24. Inverse RL • Since in most of the real-world applications, the notion of reward is not quite obvious or really hard to specify. • In IRL problem, we try to learn the reward (and the transition model as well) from expert or human demonstrations.
  25. 25. Inverse RL: Autonomous Driving Image credit: Wulfmeier et al. IROS 2016 Reward Features
  26. 26. Inverse RL: Intent Prediction 26Image credit: KITTI Dataset Pedestrian
  27. 27. Deep RL/IRL Potential Applications • Autonomous Navigation • Semantic Segmentation • Recommendation Systems • Chatbots • Inventory Management • Power Systems • Financial investment decisions* • Medical Sector (Dynamic treatment regime) * http://pit.ai/
  28. 28. Further Educational Resources • Reinforcement Learning: An Introduction (Sutton and Barto’s Book, 2nd Edition) • David Silver's Reinforcement Learning Course (UCL, 2015) • CS 294: Deep Reinforcement Learning, Fall 2017 • Deep RL Bootcamp, Summer 2017
  29. 29. DeepMind AlphaGo 29Image and Video credit: Google Brain & DeepMind
  30. 30. References 1. Sutton, Richard S., and Andrew G. Barto. Reinforcement learning: An introduction. Vol. 1. No. 1. Cambridge: MIT press, 1998. 2. Mnih, Volodymyr, et al. "Human-level control through deep reinforcement learning." Nature518.7540 (2015): 529-533. 3. Abbeel, Pieter, and Andrew Y. Ng. "Apprenticeship learning via inverse reinforcement learning." Proceedings of the twenty-first international conference on Machine learning. ACM, 2004. 4. Cassandra, Anthony Rocco. "Exact and approximate algorithms for partially observable Markov decision processes." (1998). 5. Heess, Nicolas, et al. "Emergence of Locomotion Behaviours in Rich Environments." arXiv preprint arXiv:1707.02286 (2017). 6. Heess, Nicolas, et al. "Learning and Transfer of Modulated Locomotor Controllers." arXiv preprint arXiv:1610.05182 (2016). 7. Chelsea Finn, Xin Yu Tan, Yan Duan, Trevor Darrell, Sergey Levine, and Pieter Abbeel. Deep Spatial Autoencoders for Visuomotor Learning. In ICRA, 2016. 8. Jakob N Foerster, Yannis M Assael, Nando de Freitas, and Shimon Whiteson. Learning to Communicate to Solve Riddles with Deep Distributed Recurrent QNetworks. arXiv:1602.02672, 2016. 9. Sham M Kakade. A Natural Policy Gradient. In NIPS, 2002 10. Nate Kohl and Peter Stone. Policy Gradient Reinforcement Learning for Fast Quadrupedal Locomotion. In ICRA, volume 3, 2004 11. Sascha Lange, Martin Riedmiller, and Arne Voigtlander. Autonomous Reinforcement Learning on Raw Visual Input Data in a Real World Application. In IJCNN, 2012. 12. Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deep Learning. Nature, 521 (7553):436–444, 2015. 13. Sergey Levine, Chelsea Finn, Trevor Darrell, and Pieter Abbeel. End-to-end Training of Deep Visuomotor Policies. JMLR, 17(39):1–40, 2016 14. Xiujun Li, Lihong Li, Jianfeng Gao, Xiaodong He, Jianshu Chen, Li Deng, and Ji He. Recurrent Reinforcement Learning: A Hybrid Approach. arXiv:1509.03044, 15. Wulfmeier, Markus, Dominic Zeng Wang, and Ingmar Posner. "Watch this: Scalable cost-function learning for path planning in urban environments." Intelligent Robots and Systems (IROS), 2016 IEEE/RSJ International Conference on. IEEE, 2016. 30
  31. 31. Thank You!

Hinweis der Redaktion

  • In Reinforcement learning, we have an agent that interact with the environment whereas, at each time step, it gets an observation from the environment about his/her state s_t, it executes an action a_t , and receives a reward r_t from the environment.

    From the agent perspective: it only input an action, and get as input from env (observation s_t, and reward r_t)
    From the environment perspective: it output both observations about agent state, and reward r_t

    Reward is a scalar feedback signal, indicates how well agent is doing at each time step
    The job of the agent is to maximize a cumulative reward
  • Sequential decision Making -> Agent’s actions affect the subsequent data it receives, that’s why the time really matters
    And this is distinction between it and supervised, where you only have an independent predictions for each input sample.
  • The set of states and actions, together with rules for transitioning from one state to another and for getting rewards, make up a Markov decision process.
    The episode ends with terminal state sn (e.g. “game over” screen).
    The rules for how you choose those actions are called policy.
    A Markov decision process relies on the Markov assumption, that the probability of the next state si+1 depends only on current state si and performed action ai, but not on preceding states or actions.

  • But because our environment is stochastic, we can never be sure, if we will get the same rewards the next time we perform the same actions. The more into the future we go, the more it may diverge. For that reason it is common to use discounted future reward

    Here γ is the discount factor between 0 and 1 – the more into the future the reward is, the less we take it into consideration. It is easy to see, that discounted future reward at time step t can be expressed in terms of the same thing at time step t+1:

    If we set the discount factor γ=0, then our strategy will be short-sighted and we rely only on the immediate rewards. If we want to balance between immediate and future rewards, we should set discount factor to something like γ=0.9. If our environment is deterministic and the same actions always result in same rewards, then we can set discount factor γ=1

  • P predicts the next state
  • Rewards: -1 per time-step -> motivate it to finish ASAP
    Actions: N, E, S, W
    States: Agent’s location



    Arrows represent policy π(s) for each state s

  • Numbers represent value vπ(s) of each state s
  • The main distinction in Model free, you learn on the job by trial and error, however in model based you learn about it offline or from demonstrations
    Policy based have better convergence, effective in high dimension or continuous actions spaces



  • The way to think about Q(s,a) is that it is “the best possible score at the end of game after performing action a in state s”. It is called Q-function, because it represents the “quality” of certain action in given state.

    Once you have the magical Q-function, the answer becomes really simple – pick the action with the highest Q-value!

  • This may sound quite a puzzling definition. How can we estimate the score at the end of game, if we know just current state and action, and not the actions and rewards coming after that? We really can’t. But as a theoretical construct we can assume existence of such a function.


    Let’s focus on just one transition <s,a,r,s′>. Just like with discounted future rewards in previous section we can express Q-value of state s and action a in terms of Q-value of next state s′.

    If you think about it, it is quite logical – maximum future reward for this state and action is the immediate reward plus maximum future reward for the next state.
  • In the simplest case the Q-function is implemented as a table, with states as rows and actions as columns.

    α in the algorithm is a learning rate that controls how much of the difference between previous Q-value and newly proposed Q-value is taken into account. In particular, when α=1, then two Q[s,a]-s cancel and the update is exactly the same as Bellman equation.

    maxa’ Q[s',a'] that we use to update Q[s,a] is only an estimation and in early stages of learning it may be completely wrong. However the estimations get more and more accurate with every iteration, that if we perform this update enough times, then the Q-function will converge and represent the true Q-value.

    The state of the environment in the Breakout game can be defined by the location of the paddle, location and direction of the ball and the existence of each individual brick. This intuitive representation is however game specific. Could we come up with something more universal, that would be suitable for all the games? Obvious choice is screen pixels. they implicitly contain all of the relevant information about the game situation, except for the speed and direction of the ball. Two consecutive screens would have these covered as well.

  • In case of the break out –Atari game in the first videos, to construct the Q(s,a) table from raw pixels as state space (84*84*4) this mean a possible of million of game states, which corresponds , nillions of rows in our (s,a) table

    This is the point, where deep learning steps in. Neural networks are exceptionally good in coming up with good features for highly structured data

    We could represent our Q-function with a neural network, that takes the state (four game screens) and action as input and outputs the corresponding Q-value

    This approach has the advantage, that if we want to perform a Q-value update or pick the action with highest Q-value, we only have to do one forward pass through the network and have all Q-values for all actions immediately available.
  • This is a classical convolutional neural network with three convolutional layers, followed by two fully connected layers. People familiar with object recognition networks may notice that there are no pooling layers.

    But if you really think about that, then pooling layers buy you a translation invariance – the network becomes insensitive to the location of an object in the image. That makes perfectly sense for a classification task like ImageNet, but for games the location of the ball is crucial in determining the potential reward and we wouldn’t want to discard this information!
  • In case of the break out –Atari game in the first videos, to construct the Q(s,a) table from raw pixels as state space (84*84*4) this mean a possible of million of game states, which corresponds , nillions of rows in our (s,a) table

    This is the point, where deep learning steps in. Neural networks are exceptionally good in coming up with good features for highly structured data

    We could represent our Q-function with a neural network, that takes the state (four game screens) and action as input and outputs the corresponding Q-value

    This approach has the advantage, that if we want to perform a Q-value update or pick the action with highest Q-value, we only have to do one forward pass through the network and have all Q-values for all actions immediately available.
  • So we could say, that Q-learning incorporates the exploration as part of the algorithm. But this exploration is “greedy”, it settles with the first effective strategy it finds.


    In their system DeepMind actually decreases ε over time from 1 to 0.1 – in the beginning the system makes completely random moves to explore the state space maximally, and then it settles down to a fixed exploration rate.
  • In the simplest case the Q-function is implemented as a table, with states as rows and actions as columns.

    α in the algorithm is a learning rate that controls how much of the difference between previous Q-value and newly proposed Q-value is taken into account. In particular, when α=1, then two Q[s,a]-s cancel and the update is exactly the same as Bellman equation.

    maxa’ Q[s',a'] that we use to update Q[s,a] is only an estimation and in early stages of learning it may be completely wrong. However the estimations get more and more accurate with every iteration, that if we perform this update enough times, then the Q-function will converge and represent the true Q-value.

    The state of the environment in the Breakout game can be defined by the location of the paddle, location and direction of the ball and the existence of each individual brick. This intuitive representation is however game specific. Could we come up with something more universal, that would be suitable for all the games? Obvious choice is screen pixels. they implicitly contain all of the relevant information about the game situation, except for the speed and direction of the ball. Two consecutive screens would have these covered as well.

  • * A 15-month old infant can interpret the intentions of other human demonstrator, even if it was the first time to see it actaualy
  • Reinforcement Learning is used to develop distributed control structure for a set of distributed generation sources. The exchange of information between these sources is governed by a communication graph topology

    Reinforcement learning algorithms can be built to reduce transit time for stocking as well as retrieving products in the warehouse for optimizing space utilization and warehouse operations.

    Pit.ai is at the forefront leveraging reinforcement learning for evaluating trading strategies

    A dynamic treatment regime (DTR) is a subject of medical research setting rules for finding effective treatments for patients. Diseases like cancer demand treatments for a long period where drugs and treatment levels are administered over a long period. Reinforcement learning addresses this DTR problem where RI algorithms help in processing clinical data to come up with a treatment strategy, using various clinical indicators collected from patients as inputs.

×