Diese Präsentation wurde erfolgreich gemeldet.
Die SlideShare-Präsentation wird heruntergeladen. ×

Intro to Reinforcement Learning

Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Wird geladen in …3
×

Hier ansehen

1 von 29 Anzeige
Anzeige

Weitere Verwandte Inhalte

Ähnlich wie Intro to Reinforcement Learning (20)

Aktuellste (20)

Anzeige

Intro to Reinforcement Learning

  1. 1. Introduction to Reinforcement Learning - Utkarsh Garg
  2. 2. How do we learn to do Stuff? • When any living organism, gets exposed to a specific stimulus (or a situation), there is an effect of strengthening the future behaviour of that organism when it has been exposed to a specific stimulus prompting it to execute the learned behaviour. • The organism’s behaviour is controlled by detectable changes in the environment, which is something external that influences an activity. For example, our bodies can detect touch, sound, vision, etc. • The organism’s brain uses reinforcement or punishment to modify the likelihood of a behaviour. As well, it involves voluntary behaviour that can be described with the following example on animal behaviour: • dog can be trained to jump higher when rewarded by dog treats, meaning its behaviour was reinforced by treats to perform specific actions
  3. 3. With the advancements in Robotics Arm Manipulation, Google Deep Mind beating a professional Alpha Go Player, and recently the OpenAI team beating a professional DOTA player, the field of reinforcement learning has really exploded in recent years Before we understand how these systems were able to accomplish something like above, lets first learn about the building blocks of Reinforcement learning. Let’s learn to crawl before we run!
  4. 4.  A maze-like problem  The agent lives in a grid  Walls block the agent’s path  Noisy movement: actions do not always go as planned  80% of the time, the action North takes the agent North (if there is no wall there)  10% of the time, North takes the agent West; 10% East  If there is a wall in the direction the agent would have been taken, the agent stays put  The agent receives rewards each time step  Small “living” reward each step (can be negative)  Big rewards come at the end (good or bad)  Goal: maximize sum of rewards Grid World
  5. 5. Deterministic Grid World Stochastic Grid World
  6. 6. • Need to reach from point A to B • Each segment shows time in hrs. A to C takes 4 mins • The shortest path in this problem is ACDEGHB • This is a deterministic problem • Let’s say we introduce some traffic with some probabilities in each path • There is 25% chance it will take 10 mins and 75% chance it will take 3 mins to reach point C from point A. Similar some probabilities for other segments • Now, if we run the simulation multiple times, the shortest time path would be different for each iteration due to randomness in traffic introduced in the system. This is called a Stochastic process • Finding the shortest time route is not straight forward anymore. In real world we may not know these probabilities as well. Our goal is now to find the most probable shortest path. Another Example
  7. 7. Reinforcement Learning • Reinforcement learning (RL) is an area of machine learning concerned with how software agents ought to take actions in an environment so as to maximize some notion of cumulative reward.
  8. 8. A simple example of the above system:  Imagine a baby is given a TV remote control at your home (environment)  The baby (agent) will first observe the TV and its state (if its on/of, what channel etc.)  Then the curious baby will take certain actions like hitting the remote control (action) and observe how would the TV response (next state)  As a non-responding TV is dull, the baby dislike it (receiving a negative reward) and will take less actions that will lead to such a result (updating the policy) and vice versa.  The baby will repeat the process until he/she finds a policy (what to do under different circumstances) that he/she is happy with (maximizing the total (discounted) rewards).
  9. 9. BREAKOUT
  10. 10. Reward and Policy • The reward structure of our system depends on how and what we want our system to learn R(s) = -2.0R(s) = -0.4 R(s) = -0.03R(s) = -0.01
  11. 11. • We not only want the system to greedily get whatever the highest reward it is getting right now but we also want it to consider the future reward. Why? It leads to better strategies!
  12. 12. • Therefore, we want to: • Maximize the sum of rewards • Prefer rewards now more than later since we deal with a stochastic process and we never know if the action we take leads to the target state with the reward
  13. 13. Calculating Rewards In the picture on the left, • the two paths are policies • Each circle is a state and each diamond a reward • The agent needs to decide the optimal path (or policy) so that it maximizes its total reward • If it was a deterministic process, both paths would lead to equal sum of rewards • But since we are dealing with a Stochastic process, we cannot wait for the 4th circle as the policy may not take us to max reward One way to model this is to exponentially decay future rewards: 𝛾(gamma) is the decaying factor. Therefore, the reward equation becomes: Total discounted reward = r_1 + 𝛾 r_2 + 𝛾² r_3 + 𝛾³ r_4 + 𝛾⁴ r_5+ … The above equation gives us a quantitative basis to say that the agent would prefer path 1 as the value of Total discounted award is more than the second case.
  14. 14. Done with basics. Let’s go Deeper
  15. 15. Q - Learning What is Q? • Q-value: Q(s,a) is the value of total discounted rewards, when the agent takes an action a and then follows the most optimal path (that is why we have max over all actions in below equation). • And Q*(s,a) is this value for the best action at state s. By having this value for all combinations of states and actions, Q table Reward Value 1 Step -0.04 Power +0.5 Mines -10 End +1 or -1 𝛾 = 0.9
  16. 16. Learned Q Values
  17. 17. Exploration Vs Exploitation • There is an important concept of the exploration and exploitation trade off in reinforcement learning. • Exploration is all about finding more information about an environment, whereas exploitation is exploiting already known information to maximize the rewards. • Real Life Example: Say you go to the same restaurant (which you like) every day. You are basically exploiting. But on the other hand, if you search for new restaurant every time before going to any one of them, then it’s exploration. Exploration is very important for the search of future rewards which might be higher than the near rewards i.e. you may find a new restaurant even better than when you were exploiting.
  18. 18. Generalization across States • Basic Q-Learning keeps a table of all q-values • In realistic situations, we cannot possibly learn about every single state! • Too many states to visit them all in training • Too many states to hold the q-tables in memory • Instead, we want to generalize: • Learn about some small number of training states from experience • Generalize that experience to new, similar situations • This is a fundamental idea in machine learning, and we’ll see it over and over again
  19. 19. State space • Discretized vertical distance from lower pipe • Discretized horizontal distance from next pair of pipes • Life: Dead or Living Actions • Click • Do nothing Rewards • +1 if Flappy Bird still alive • -1000 if Flappy Bird is dead • 6-7 hours of Q-learning Generalization Example 1
  20. 20. Let’s say we discover through experience that this state is bad: In naïve q-learning, we know nothing about this state: Or even this one! Generalization Example 2
  21. 21. • Solution: describe a state using a vector of features (properties) • Features are functions from states to real numbers (often 0/1) that capture important properties of the state • Example features: • Distance to closest ghost • Distance to closest dot • Number of ghosts • 1 / (dist to dot)2 • Is Pacman in a tunnel? (0/1) • …… etc. • Is it the exact state on this slide? • Can also describe a q-state (s, a) with features (e.g. action moves closer to food) • Now instead of a Q table, we have these features using which we can train any supervised learning algo to learn the Q values and hence the right actions Feature Based Representation
  22. 22. Generalization Example 3 (play video) 4 Actions available: • The avg angle of the blades • Difference in angle between front and back • Difference in angle between left and right • Angle for the tail rotor Task: Learn to hover States: • Data from various sensors Note! The most efficient policy it found was to fly inverted!
  23. 23. Going even Deeper…
  24. 24. Deep Q Networks (DQN)
  25. 25. Alpha Go • In 2016, initial version of alphago lee beat 17 times world champion lee sedol. • Just a year later, alphago zero beat unlike its predecessor was trained without any data from real human games • It learned only by playing against itself. The 2016 version was defeated 100-0 by alphago zero. • Go has shown us that AI has started to move beyond what humans can tell it to do. • This was shown when the alphago made the move37. For humans or the world champion, it was a seemingly bad move, but it turn out to be a game changing move which led to alphago’s victory Arch Link : https://applied- data.science/static/main/res/alpha_go_zero_cheat_sheet.png
  26. 26. Alpha Go Training Graph
  27. 27. Self Driving Cars Supervised learning based self driving car (with simulator) https://www.youtube.com/watch?v=EaY5QiZwSP4&t=1111s The reinforcement learning way to do this! https://wayve.ai/blog/learning-to-drive-in-a-day-with- reinforcement-learning
  28. 28. Landing Spacex Rockets https://www.youtube.com/watch?v=4_igzo4qNmQ
  29. 29. Thank You

×