Diese Präsentation wurde erfolgreich gemeldet.
Die SlideShare-Präsentation wird heruntergeladen. ×

Reinfrocement Learning

Wird geladen in …3

Hier ansehen

1 von 58 Anzeige

Weitere Verwandte Inhalte

Diashows für Sie (20)

Ähnlich wie Reinfrocement Learning (20)


Reinfrocement Learning

  1. 1. Reinforcement Learning
  2. 2. Learning Through Interaction Sutton: “When an infant plays, waves its arms, or looks about, it has no explicit teacher, but it does have a direct sensorimotor connection to its environment. Exercising this connection produces a wealth of information about cause and effect, about the consequences of actions, and about what to do in order to achieve goals” • Reinforcement learning is a computational approach for this type of learning. It adopts AI perspective to model learning through interaction.
  3. 3. • As a single (agent) approaches the system, it takes an action. Upon this action he gets a reward and jumps to the next state. Online learning becomes plausible 3 Reinforcement Learning
  4. 4. Reinforcement Objective • Learning the relation between the current situation (state) and the action to be taken in order to optimize a “payment” Predicting the expected future reward given the current state (s) : 1. Which actions should we take in order to maximize our gain 2. Which actions should we take in order to maximize the click rate • The action that is taken influences on the next step “closed loop” • The learner has to discover which action to take (in ML terminology we can write the feature vector as some features are function of others)
  5. 5. RL- Elements • State (s) - The place where the user agent is right now. Examples: 1. A position on a chess board 2. A potential customer in a sales web • Action (a)- An action that a user can take while he is in a state. Examples: 1. Knight pawn captures bishop 2. The user buys a ticket • Reward (r) - The reward that is obtained due to the action Examples: 1. A better worse position 2. More money or more clicks
  6. 6. Basic Elements (Cont) • Policy (π)- The “strategy” in which the agent decides which action to take. Abstractly speaking the policy is simply a probability function that is defined for each state • Episode – A sequence of states and their actions • 𝑉π (𝑠) - The value function of a state 𝑠 when using policy π. Mostly it is the expected reward (e.g. in a chess the expected final outcome of the game if we follow a given strategy) • V(s) - Similar to 𝑉π (𝑠) without a fixed policy (The expected reward over all possible trajectories starting from 𝑠 ) • Q(s,a) - The analog for V(s) : the planar value function for state s and action a
  7. 7. 7 Examples • Tic Tac Toe
  8. 8. GridWorld (0,-1,10,5)
  9. 9. • We wish to find the best slot machine (best = max reward). Strategy Play ! .. and find the machine with the biggest reward (on average) • At the beginning we pick each slot randomly • After several steps we gain some knowledge How do we choose which machine to play? 1. Should we always use the best machine ? 2. Should we pick it randomly? 3. Any other mechanism? 9 Slot Machines n-armed bandit
  10. 10. • The common trade-off 1. Play always with best machine -Exploitation We may miss better machines due to statistical “noise” 2. Choose machine randomly - Exploration We don’t take the optimal machine, Epsilon Greedy We exploit in probability (1- ε) and explore with probability ε Typically ε=0.1 10 Exploration ,Exploitation & Epsilon Greedy
  11. 11. • Some problems (like n-bandit) are -Next Best Action. 1. A single given state 2. A set of options that are associated with this state 3. A reward for each action • Sometimes we wish to learn journeys Examples: 1. Teach a robot to go from point A to point B 2. Find the fastest way to drive home 11 Episodes
  12. 12. • Episode 1. A “time series” of states {S1, S2, S3.. SK} 2. For all state Si There are set of options {O1, O2,..Oki } 3. Learning formula (the “gradient”) depends not only on the immediate rewards but on the next state as well 12 Episode (Cont.)
  13. 13. • The observed sequence: st ,at , Rt+1, st+1 ,at+1 , Rt+2 ,………….., sT ,aT , RT+1 , s-state , a-action, r-reward • We optimize our goal function (commonly maximizing the average): Gt = Rt+1 +γRt+2 +γ2 Rt+3 + …… + γ𝑙Rt+l+1 0< γ ≤ 1 –aging factor Classical Example The Pole Balancing Find the exact force to implement in order to keep the pole up The reward is 1 for every time step that The pole didn’t fall Reinforcement Learning – Foundation
  14. 14. Markov Property Pr{ St+1 = s’, Rt+1 = r | S0, A0, R1, . . . , St-1, At-1, Rt , St , At }= Pr{ St+1 = s’, Rt+1 = r | St , At } i.e. : The current state captures the entire history • Markov processes are fully determined by the transition matrix P Markov Process (or Markov Chain) A tuple <S,P> where S - set of states (mostly finite), P a state transition probability matrix. Namely: Pss’= P [St+1 = s’ | St = s] Markov Decision Process -MDP
  15. 15. A Markov Reward Process -MRP (Markov Chain with Values) A tuple < S,P, R, γ> S ,P as in Markov process, R a reward function Rs = E [Rt+1 | St = s] γ is a discount factor, γ ∈ [0, 1] (as in Gt ) State Value Function for MRP: v(s) = E [Gt | St = s] MDP-cont
  16. 16. Bellman Eq. • v(s) = E [Gt | St = s] = E [Rt+1 + γRt+2 +γ2 Rt+3 +... | St = s]= E [Rt+1 + γ (Rt+2 + γRt+3+ ...) | St = s] = E [R t+1 + γG t+1 | S t = s ] We get a recursion rule: v(s) = E[Rt+1 + γ v(s t+1) | St = s] Similalry we can define on a value on state-action space: Q(s,a)= E [Gt | St = s, At =a] MDP - MRP with a finite set of actions A MDP-cont
  17. 17. • Recall - policy π is the strategy – it maps between states and actions. π(a|s) = P [At = a | St = s] We assume that for each time t ,and state S π( | St) is fixed (π is stationary ) Clearly for a MDP, a given policy π modifies the MDP: R -> Rπ P->Pπ We modify V & Q Vπ(s) = Eπ [G t | S t = s] Qπ(s,a) = Eπ [G t | S t = s, At =a] Policy
  18. 18. • For V (or Q) the optimal value function v* ,for each state s : v*(s) = max π vπ(s) π -policy Solving MDP ≡ Finding the optimal value function!!!! Optimal Policy π ≥ π’ if vπ(s) ≥ v π’(s) ∀s Theorem For every MDP there exists optimal policy Optimal Value Function
  19. 19. • If we know 𝑞∗ (s,a) we can find the optimal policy: Optimal Value (Cont)
  20. 20. • Dynamic programming • Monte Carlo • TD methods Objectives Prediction - Find the optimal function Control – Find the optimal policy Solution Methods
  21. 21. • A class of algorithms used in many applications such as graph theory (shortest path) and bio informatics. It has two essential properties: 1. Can be decomposed to sub solutions 2. Solutions can be cashed and reused RL-MDP satisfies these both • We assume a full knowledge of MDP !!! Prediction Input: MDP and policy Output: Optimal Value function vπ Control Input: MDP Output: Optimal Value function v* Optimal policy π * Dynamic Programming
  22. 22. • Assume policy π and MDP we wish to find the optima V π(s) V π(s) = Eπ [Rt+1 + γvπ(St+1) | St =s] • Since policy and MDP are known it is a linear eq. in vi but…. Extremely tedious !!!! Let’s do something iterative (Sutton &Barto) Prediction – Optimal Value Function
  23. 23. • Following the previous algorithm one can use an algorithm (often a greedy algorithm) to improve the policy which will lead to an optimal function Policy Improvement (policy iteration)
  24. 24. • Policy iteration requires policy updating which can be heavy. • We can study 𝑉∗ and obtain the policy through • The idea is that • Hence we can find 𝑉∗ iteratively (and derive the optimal policy) Value Iteration
  25. 25. • The formula supports online update • Bootstrapping • Mostly we don’t have MDP DP -Remarks
  26. 26. • A model free (we don’t need MDP) 1. It learns from generating episodes. 2. It must complete an episode for having the required average. 3. It is unbiased • For a policy π S0,A0,R1….. St ~ π We use empirical mean return rather expected return. V(St) =V(St) + 1 𝑁(𝑡) [ Gt –V(St ) ] N(t) – Amount of visits at time t For non-stationary cases we update differently: V(St) =V(St) +α [ Gt –V(St ) ] In MC one must terminate the episode to get the value (we calculate the mean explicitly ) Hence in grid problems it may work bad Monte Carlo Methods
  27. 27. • Learn the optimal policy (using Q function): Monte Carlo Control
  28. 28. Temporal Difference –TD • Motivation –Combining DP & MC As MC -Learning from experience , no explicit MDP As DP- Bootstrapping, no need to complete the episodes Prediction Recall that for MC we have Where Gt is known only at the end of the episode.
  29. 29. TD –Methods (Cont.) • TD method needs to wait until the next step (TD(0)) We can see that it leads to different targets: MC- Gt TD - Rt+1 + γ V(S t+1) • Hence it is a Bootstrapping method The estimaion of V given a policy is straightforwad since the policy chooses S t+1.
  30. 30. Driving Home Problem
  31. 31. TD Vs. MC -Summary MC • High variance unbiased • Good convergence • Easy to understand • Low sensitivity for i.c TD • Efficiency • Convergence to V π • More sensitive to i.c.
  32. 32. SARSA • On Policy method for Qlearning (update after every step): The next step is using SARSA to develop also a control algorithm, we learn on policy the Q function and update the policy toward greedyness
  33. 33. On Policy Control Algorithm
  34. 34. Example Windy Grid-World
  35. 35. Qlearning –Off Policy • Rather learning from an action that has been offered we simply take the best action for the state The control algorithm is straightforward
  36. 36. Value Function Approx. • Sometimes we have a large scale RL 1. TD backgammon (Appendix) 2. GO – (Deep Mind) 3. Helicopter (continuous) • Our objectives are still :control & predictions but we have huge amount of states. • The tabular solutions that we presented are not scalable. • Value Function approx. will allow us to use models!!!
  37. 37. Value Function (Cont) • Consider a large (continuous ) MDP Vπ (s)= 𝑉′ π (s,w) Qπ (s,a) =𝑄′ π (s,a,w) w –set of function parameters • We can train them by both TD & MC . • We can expand values to unseen states
  38. 38. Type of Approximations 1. Linear Combinations 2. Neural networks (lead to DQN) 3. Wavelet solutions
  39. 39. Function Approximation on the technics • Define features vectors (X(S)) for the state S. e.g. Distance from target Trend in stock Chess board configuration • Training methods for W • SGD Linear Function get the form: 𝑉′ π =<X(S),W>
  40. 40. RL -Based problems • No supervisor, only rewards solutions become:
  41. 41. Deep -RL Why using Deep RL? • It allows us to find an optimal model (value/policy) • It allows us to optimize a model • Commonly we will use SGD Examples • Automatic cars • Atari • Deep Mind • TD- Gammon
  42. 42. Q – network • We follow the value function approx. approach Q(s,a,w)≈𝑄∗(s,a)
  43. 43. Q-Learning • We will simply follow TD target function with supervised manners: Target r+ γmax 𝑎′ 𝑄(𝑠′, 𝑎′, 𝑤) Loss -MSE (r+ γ max 𝑎′ 𝑄(𝑠′, 𝑎′, 𝑤) −Q(s,a,w) )2 • We solve it with SGD
  44. 44. Q Network –Stability Issues Divergences • Correlation between successive samples ,non-iid • Policy is not necessarily stationary (influences on Q value) • Scale of rewards and Q value is unknown
  45. 45. Deep –Q network Experience Replay Replay the data from the past with the current W It allows to remove correlation in data: • Pick at upon a greedy algorithm • Store in memory the tuple(st, at, rt+1, st+1 ) - Replay • Now calculate the MSE
  46. 46. Experience Repaly
  47. 47. DQN (Cont) Fixed Target Q- Network In order to handle oscillations We calculate targets with respect to old parameters 𝑤− r+ γ max 𝑎′ 𝑄(𝑠′, 𝑎′, 𝑤− ) The loss becomes (r+ γ max 𝑎′ 𝑄(𝑠′ , 𝑎′ , 𝑤− ) −Q(s,a,w) )2 𝑤− <- w
  48. 48. DQN –Summary Many further methods: • RewardValue • Double DQN • Parallel Updates Requires another lecture
  49. 49. Gradient Policy • We have discussed: 1. Function approximations 2. Algorithms in which policy is learned through the value functions We can parametrize policy using parameters θ : πθ (s, a) =P[a| s, θ] Remark: we focus on model free!!
  50. 50. Policy Based Good & Bad Good Better in High dimensions Convergence faster Bad Less efficient for high variance Local minima Example: Rock-Paper-Scissors
  51. 51. How to optimize a policy? • We assume it is differentiable and calculate the log-likelihood • We assume further Gibbs distribution i.e. policy exponent in value function πθ (s, a) α 𝑒−θΦ(𝑠,𝑎) Deriving by θ implies: We can also use Gaussian policy
  52. 52. Optimize policy (Cont.) Actor-Critic Critic – Update the action-state function by w Actor –Update the policy θ upon the critic suggestion
  53. 53. • Rather Learning value functions we learn probabilities. Let At the action that is taken at time t Pr(At =a) = πt (a) = 𝑒Ht (a) 𝑏=1 𝑘 𝑒Ht (b) H – Numerical Preference We assume Gibbs Boltzmann Distribution R¯t - The average until time t Rt - The reward at time t Ht+1(At) = Ht(At) + α (Rt − R¯t )(1 − πt(At) ) Ht+1(a) = Ht(a) − α (Rt − R¯t ) πt(a) ∀a ≠ At Gradient Bandit algorithm
  54. 54. Further Reading • Sutton & Barto http://ufal.mff.cuni.cz/~straka/courses/npfl114/2016/sutton- bookdraft2016sep.pdf Pole balancing - https://www.youtube.com/watch?v=Lt-KLtkDlh8 • DeepMind papers • David Silver –Youtube and ucl.ac.uk • TD-Backgammon
  55. 55. Thank you