Diese Präsentation wurde erfolgreich gemeldet.
Wir verwenden Ihre LinkedIn Profilangaben und Informationen zu Ihren Aktivitäten, um Anzeigen zu personalisieren und Ihnen relevantere Inhalte anzuzeigen. Sie können Ihre Anzeigeneinstellungen jederzeit ändern.

Multi-armed Bandits

1.323 Aufrufe

Veröffentlicht am

Hello~! :)

While studying the Sutton-Barto book, the traditional textbook for Reinforcement Learning, I created PPT about the Multi-armed Bandits, a Chapter 2.

If there are any mistakes, I would appreciate your feedback immediately.

Thank you.

Veröffentlicht in: Wissenschaft
  • Sex in your area is here: www.bit.ly/sexinarea
       Antworten 
    Sind Sie sicher, dass Sie …  Ja  Nein
    Ihre Nachricht erscheint hier
  • Dating for everyone is here: www.bit.ly/2AJerkH
       Antworten 
    Sind Sie sicher, dass Sie …  Ja  Nein
    Ihre Nachricht erscheint hier
  • Sex in your area for one night is there tinyurl.com/hotsexinarea Copy and paste link in your browser to visit a site)
       Antworten 
    Sind Sie sicher, dass Sie …  Ja  Nein
    Ihre Nachricht erscheint hier
  • Girls for sex are waiting for you https://bit.ly/2TQ8UAY
       Antworten 
    Sind Sie sicher, dass Sie …  Ja  Nein
    Ihre Nachricht erscheint hier
  • Meetings for sex in your area are there: https://bit.ly/2TQ8UAY
       Antworten 
    Sind Sie sicher, dass Sie …  Ja  Nein
    Ihre Nachricht erscheint hier

Multi-armed Bandits

  1. 1. Multi-armed Bandits Dongmin Lee RLI Study
  2. 2. Reference - Reinforcement Learning : An Introduction, Richard S. Sutton & Andrew G. Barto (Link) - 멀티 암드 밴딧(Multi-Armed Bandits), 송호연 (https://brunch.co.kr/@chris-song/62)
  3. 3. Index 1. Why do you need to know Multi-armed Bandits? 2. A k-armed Bandit Problem 3. Simple-average Action-value Methods 4. A simple Bandit Algorithm 5. Gradient Bandit Algorithm 6. Summary
  4. 4. 1. Why do you need to know Multi-armed Bandits(MAB)?
  5. 5. 1. Why do you need to know MAB? - Reinforcement Learning(RL) uses training information that ‘Evaluate’(‘Instruct’ X) the actions - Evaluative feedback indicates ‘How good the action taken was’ - Because of simplicity, ‘Nonassociative’ one situation - Most prior work involving evaluative feedback - ‘Nonassociative’, ‘Evaluative feedback’ -> MAB - In order to Introduce basic learning methods in later chapters
  6. 6. 1. Why do you need to know MAB? In my opinion, - I think we can’t seem to know RL without knowing MAB. - MAB deal with ‘Exploitation & Exploration’ of the core ideas in RL. - In the full reinforcement learning problem, MAB is always used. - In every profession, MAB is very useful.
  7. 7. 2. A k-armed Bandit Problem
  8. 8. 2. A k-armed Bandit Problem Do you know what MAB is?
  9. 9. Do you know what MAB is? Source : Multi-Armed Bandit Image source : https://brunch.co.kr/@chris-song/62 - Slot Machine -> Bandit - Slot Machine’s lever -> Armed - N slot Machine  Multi-armed Bandits
  10. 10. Do you know what MAB is? Source : Multi-Armed Bandit Image source : https://brunch.co.kr/@chris-song/62 Among the various slot machines, which slot machine should I put my money on and lower the lever?
  11. 11. Do you know what MAB is? Source : Multi-Armed Bandit Image source : https://brunch.co.kr/@chris-song/62 How can you make the best return on your investment?
  12. 12. Do you know what MAB is? Source : Multi-Armed Bandit Image source : https://brunch.co.kr/@chris-song/62 MAB is a algorithm created to optimize investment in slot machines
  13. 13. 2. A k-armed Bandit Problem A K-armed Bandit Problem
  14. 14. A K-armed Bandit Problem Source : Reinforcement Learning : An Introduction, Richard S. Sutton & Andrew G. Barto Image source : https://drive.google.com/file/d/1xeUDVGWGUUv1-ccUMAZHJLej2C7aAFWY/view t – Discrete time step or play number 𝐴 𝑡 - Action at time t 𝑅𝑡 - Reward at time t 𝑞∗ 𝑎 – True value (expected reward) of action a
  15. 15. A K-armed Bandit Problem Source : Reinforcement Learning : An Introduction, Richard S. Sutton & Andrew G. Barto Image source : https://drive.google.com/file/d/1xeUDVGWGUUv1-ccUMAZHJLej2C7aAFWY/view In our k-armed bandit problem, each of the k actions has an expected or mean reward given that that action is selected; let us call this the value of that action.
  16. 16. 3. Simple-average Action-value Methods
  17. 17. 3. Simple-average Action-value Methods Simple-average Method
  18. 18. Simple-average Method Source : Reinforcement Learning : An Introduction, Richard S. Sutton & Andrew G. Barto Image source : https://drive.google.com/file/d/1xeUDVGWGUUv1-ccUMAZHJLej2C7aAFWY/view 𝑄𝑡(𝑎) converges to 𝑞∗(𝑎)
  19. 19. 3. Simple-average Action-value Methods Action-value Methods
  20. 20. Action-value Methods Action-value Methods - Greedy Action Selection Method - 𝜀-greedy Action Selection Method - Upper-Confidence-Bound(UCB) Action Selection Method
  21. 21. Action-value Methods Action-value Methods - Greedy Action Selection Method - 𝜀-greedy Action Selection Method - Upper-Confidence-Bound(UCB) Action Selection Method
  22. 22. Greedy Action Selection Method 𝑎𝑟𝑔𝑚𝑎𝑥 𝑎 𝑓(𝑎) - a value of 𝑎 at which 𝑓(𝑎) takes its maximal value Source : Reinforcement Learning : An Introduction, Richard S. Sutton & Andrew G. Barto Image source : https://drive.google.com/file/d/1xeUDVGWGUUv1-ccUMAZHJLej2C7aAFWY/view
  23. 23. Greedy Action Selection Method Greedy action selection always exploits current knowledge to maximize immediate reward Source : Reinforcement Learning : An Introduction, Richard S. Sutton & Andrew G. Barto Image source : https://drive.google.com/file/d/1xeUDVGWGUUv1-ccUMAZHJLej2C7aAFWY/view
  24. 24. Greedy Action Selection Method Greedy Action Selection Method’s disadvantage Is it a good idea to select greedy action, exploit that action selection and maximize the current immediate reward?
  25. 25. Action-value Methods Action-value Methods - Greedy Action Selection Method - 𝜀-greedy Action Selection Method - Upper-Confidence-Bound(UCB) Action Selection Method
  26. 26. 𝜀-greedy Action Selection Method Exploitation is the right thing to do to maximize the expected reward on the one step, but Exploration may produce the greater total reward in the long run.
  27. 27. 𝜀-greedy Action Selection Method Source : Reinforcement Learning : An Introduction, Richard S. Sutton & Andrew G. Barto Image source : https://drive.google.com/file/d/1xeUDVGWGUUv1-ccUMAZHJLej2C7aAFWY/view 𝜀 – probability of taking a random action in an 𝜀-greedy policy
  28. 28. 𝜀-greedy Action Selection Method Exploitation Exploration Source : Reinforcement Learning : An Introduction, Richard S. Sutton & Andrew G. Barto Image source : https://drive.google.com/file/d/1xeUDVGWGUUv1-ccUMAZHJLej2C7aAFWY/view
  29. 29. Action-value Methods Action-value Methods - Greedy Action Selection Method - 𝜀-greedy Action Selection Method - Upper-Confidence-Bound(UCB) Action Selection Method
  30. 30. Upper-Confidence-Bound(UCB) Action Selection Method ln 𝑡 - natural logarithm of 𝑡 𝑁𝑡(𝑎) – the number of times that action 𝑎 has been selected prior to time 𝑡 𝑐 – the number 𝑐 > 0 controls the degree of exploration Source : Reinforcement Learning : An Introduction, Richard S. Sutton & Andrew G. Barto Image source : https://drive.google.com/file/d/1xeUDVGWGUUv1-ccUMAZHJLej2C7aAFWY/view
  31. 31. Upper-Confidence-Bound(UCB) Action Selection Method Source : Reinforcement Learning : An Introduction, Richard S. Sutton & Andrew G. Barto Image source : https://drive.google.com/file/d/1xeUDVGWGUUv1-ccUMAZHJLej2C7aAFWY/view The idea of this UCB action selection is that The square-root term is a measure of the uncertainty(or potential) in the estimate of 𝑎’s value The probability that the slot machine may be the optimal slot machine
  32. 32. Upper-Confidence-Bound(UCB) Action Selection Method UCB Action Selection Method’s disadvantage UCB is more difficult than 𝜀-greedy to extend beyond bandits to the more general reinforcement learning settings One difficulty is in dealing with nonstationary problems Another difficulty is dealing with large state spaces Source : Reinforcement Learning : An Introduction, Richard S. Sutton & Andrew G. Barto Image source : https://drive.google.com/file/d/1xeUDVGWGUUv1-ccUMAZHJLej2C7aAFWY/view
  33. 33. 4. A simple Bandit Algorithm
  34. 34. 4. A simple Bandit Algorithm - Incremental Implementation - Tracking a Nonstationary Problem
  35. 35. 4. A simple Bandit Algorithm - Incremental Implementation - Tracking a Nonstationary Problem
  36. 36. Incremental Implementation 𝑄 𝑛 denote the estimate of 𝑅 𝑛−1‘s action value after 𝑅 𝑛−1 has been selected n − 1 times Source : Reinforcement Learning : An Introduction, Richard S. Sutton & Andrew G. Barto Image source : https://drive.google.com/file/d/1xeUDVGWGUUv1-ccUMAZHJLej2C7aAFWY/view
  37. 37. Incremental Implementation Source : Reinforcement Learning : An Introduction, Richard S. Sutton & Andrew G. Barto Image source : https://drive.google.com/file/d/1xeUDVGWGUUv1-ccUMAZHJLej2C7aAFWY/view
  38. 38. Incremental Implementation Source : Reinforcement Learning : An Introduction, Richard S. Sutton & Andrew G. Barto Image source : https://drive.google.com/file/d/1xeUDVGWGUUv1-ccUMAZHJLej2C7aAFWY/view 𝑄 𝑛+1 = 𝑄 𝑛 + 1 𝑛 𝑅 𝑛 − 𝑄 𝑛 holds even for 𝑛 = 1, obtaining 𝑄2 = 𝑅1 for arbitrary 𝑄1
  39. 39. Incremental Implementation Source : Reinforcement Learning : An Introduction, Richard S. Sutton & Andrew G. Barto Image source : https://drive.google.com/file/d/1xeUDVGWGUUv1-ccUMAZHJLej2C7aAFWY/view
  40. 40. Incremental Implementation Source : Reinforcement Learning : An Introduction, Richard S. Sutton & Andrew G. Barto Image source : https://drive.google.com/file/d/1xeUDVGWGUUv1-ccUMAZHJLej2C7aAFWY/view
  41. 41. Incremental Implementation Source : Reinforcement Learning : An Introduction, Richard S. Sutton & Andrew G. Barto Image source : https://drive.google.com/file/d/1xeUDVGWGUUv1-ccUMAZHJLej2C7aAFWY/view Unstable(↔Constant) Available on stationary problem
  42. 42. Incremental Implementation Source : Reinforcement Learning : An Introduction, Richard S. Sutton & Andrew G. Barto Image source : https://drive.google.com/file/d/1xeUDVGWGUUv1-ccUMAZHJLej2C7aAFWY/view
  43. 43. Incremental Implementation Source : Reinforcement Learning : An Introduction, Richard S. Sutton & Andrew G. Barto Image source : https://drive.google.com/file/d/1xeUDVGWGUUv1-ccUMAZHJLej2C7aAFWY/view The expression 𝑇𝑎𝑟𝑔𝑒𝑡 − 𝑂𝑙𝑑𝐸𝑠𝑡𝑖𝑚𝑎𝑡𝑒 is an error in the estimate. The target is presumed to indicate a desirable direction in which to move, though it may be noisy.
  44. 44. 4. A simple Bandit Algorithm - Incremental Implementation - Tracking a Nonstationary Problem
  45. 45. Tracking a Nonstationary Problem Source : Reinforcement Learning : An Introduction, Richard S. Sutton & Andrew G. Barto Image source : https://drive.google.com/file/d/1xeUDVGWGUUv1-ccUMAZHJLej2C7aAFWY/view
  46. 46. Tracking a Nonstationary Problem Source : Reinforcement Learning : An Introduction, Richard S. Sutton & Andrew G. Barto Image source : https://drive.google.com/file/d/1xeUDVGWGUUv1-ccUMAZHJLej2C7aAFWY/view Why do you think it should be changed from 1 𝑛 to 𝛼?
  47. 47. Tracking a Nonstationary Problem Why do you think it should be changed from 1 𝑛 to 𝛼? We often encounter RL problems that are effectively nonstationary. In such cases it makes sense to give more weight to recent rewards than to long-past rewards. One of the most popular ways of doing this is to use a constant step-size parameter. The step-size parameter 𝛼 ∈ (0,1] is constant.
  48. 48. Tracking a Nonstationary Problem Source : Reinforcement Learning : An Introduction, Richard S. Sutton & Andrew G. Barto Image source : https://drive.google.com/file/d/1xeUDVGWGUUv1-ccUMAZHJLej2C7aAFWY/view
  49. 49. Tracking a Nonstationary Problem ? = 𝑄 𝑛 + 𝛼𝑅 𝑛 − 𝛼𝑄 𝑛 = 𝛼𝑅 𝑛 + (1 − 𝛼)𝑄 𝑛 𝑄 𝑛 = 𝛼𝑅 𝑛−1 + 1 − 𝛼 𝑄 𝑛−1 ? ∴ 𝑄 𝑛 = 𝛼𝑅 𝑛−1 + (1 − 𝛼)𝑄 𝑛−1 Source : Reinforcement Learning : An Introduction, Richard S. Sutton & Andrew G. Barto Image source : https://drive.google.com/file/d/1xeUDVGWGUUv1-ccUMAZHJLej2C7aAFWY/view
  50. 50. Tracking a Nonstationary Problem ? Source : Reinforcement Learning : An Introduction, Richard S. Sutton & Andrew G. Barto Image source : https://drive.google.com/file/d/1xeUDVGWGUUv1-ccUMAZHJLej2C7aAFWY/view
  51. 51. Tracking a Nonstationary Problem ∴ 𝑄 𝑛−1 = 𝛼 𝑅 𝑛−2 + 1 − 𝛼 𝑅 𝑛−3 + ⋯ + 1 − 𝛼 𝑛−3 𝑅1 + 1 − 𝛼 𝑛−2 𝑄1 1 − 𝛼 2 𝑄 𝑛−1 = 1 − 𝛼 2 𝛼𝑅 𝑛−2 + 1 − 𝛼 3 𝛼𝑅 𝑛−3 + ⋯ + 1 − 𝛼 𝑛−1 𝛼𝑅1 + 1 − 𝛼 𝑛 𝑄1 ? = 1 − 𝛼 2{𝛼𝑅 𝑛−2 + 1 − 𝛼 𝛼𝑅 𝑛−3 + ⋯ + 1 − 𝛼 𝑛−3 𝛼𝑅1 + 1 − 𝛼 𝑛−2 𝑄1}
  52. 52. Tracking a Nonstationary Problem Sequences of step-size parameters often converge very slowly or need considerable tuning in order to obtain a satisfactory convergence rate. Thus, step-size parameters should be tuned effectively. Source : Reinforcement Learning : An Introduction, Richard S. Sutton & Andrew G. Barto Image source : https://drive.google.com/file/d/1xeUDVGWGUUv1-ccUMAZHJLej2C7aAFWY/view
  53. 53. 5. Gradient Bandit Algorithm
  54. 54. 5. Gradient Bandit Algorithm In addition to a simple bandit algorithm, there is another way to use the gradient method as a bandit algorithm
  55. 55. 5. Gradient Bandit Algorithm We consider learning a numerical 𝑝𝑟𝑒𝑓𝑒𝑟𝑒𝑛𝑐𝑒 for each action 𝑎, which we denote 𝐻𝑡(𝑎). The larger the preference, the more often that action is taken, but the preference has no interpretation in terms of reward. In other wards, just because the preference(𝐻𝑡(𝑎)) is large, the reward is not unconditionally large. However, if the reward is large, It can affect the preference(𝐻𝑡(𝑎))
  56. 56. 5. Gradient Bandit Algorithm The action probabilities are determined according to a 𝑠𝑜𝑓𝑡 − 𝑚𝑎𝑥 𝑑𝑖𝑠𝑡𝑟𝑖𝑏𝑢𝑡𝑖𝑜𝑛 (i.e., Gibbs or Boltzmann distribution) Source : Reinforcement Learning : An Introduction, Richard S. Sutton & Andrew G. Barto Image source : https://drive.google.com/file/d/1xeUDVGWGUUv1-ccUMAZHJLej2C7aAFWY/view
  57. 57. 5. Gradient Bandit Algorithm 𝜋 𝑡(𝑎) – Probability of selecting action 𝑎 at time 𝑡 Initially all action preferences are the same so that all actions have an equal probability of being selected. Source : Reinforcement Learning : An Introduction, Richard S. Sutton & Andrew G. Barto Image source : https://drive.google.com/file/d/1xeUDVGWGUUv1-ccUMAZHJLej2C7aAFWY/view
  58. 58. 5. Gradient Bandit Algorithm Source : Reinforcement Learning : An Introduction, Richard S. Sutton & Andrew G. Barto Image source : https://drive.google.com/file/d/1xeUDVGWGUUv1-ccUMAZHJLej2C7aAFWY/view There is a natural learning algorithm for this setting based on the idea of stochastic gradient ascent. On each step, after selecting action 𝐴 𝑡 and receiving the reward 𝑅𝑡, the action preferences are updated. Selected action 𝐴 𝑡 Non-selected actions
  59. 59. 5. Gradient Bandit Algorithm Source : Reinforcement Learning : An Introduction, Richard S. Sutton & Andrew G. Barto Image source : https://drive.google.com/file/d/1xeUDVGWGUUv1-ccUMAZHJLej2C7aAFWY/view 1) What does 𝑅𝑡 mean? 2) What does (𝑅𝑡 − 𝑅𝑡)(1 − 𝜋 𝑡 𝐴 𝑡 ) mean?
  60. 60. 5. Gradient Bandit Algorithm Source : Reinforcement Learning : An Introduction, Richard S. Sutton & Andrew G. Barto Image source : https://drive.google.com/file/d/1xeUDVGWGUUv1-ccUMAZHJLej2C7aAFWY/view 1) What does 𝑅𝑡 mean? 2) What does (𝑅𝑡 − 𝑅𝑡)(1 − 𝜋 𝑡 𝐴 𝑡 ) mean? ?
  61. 61. What does 𝑅𝑡 mean? Source : Reinforcement Learning : An Introduction, Richard S. Sutton & Andrew G. Barto Image source : https://drive.google.com/file/d/1xeUDVGWGUUv1-ccUMAZHJLej2C7aAFWY/view 𝑅𝑡 ∈ R is the average of all the rewards. The 𝑅𝑡 term serves as a baseline. If the reward is higher than the baseline, then the probability of taking 𝐴 𝑡 in the future is increased, and if the reward is below baseline, then probability is decreased. The non-selected actions move in the opposite direction.
  62. 62. 5. Gradient Bandit Algorithm Source : Reinforcement Learning : An Introduction, Richard S. Sutton & Andrew G. Barto Image source : https://drive.google.com/file/d/1xeUDVGWGUUv1-ccUMAZHJLej2C7aAFWY/view 1) What does 𝑅𝑡 mean? 2) What does (𝑅𝑡 − 𝑅𝑡)(1 − 𝜋 𝑡 𝐴 𝑡 ) mean? ?
  63. 63. What does (𝑅𝑡 − 𝑅𝑡)(1 − 𝜋 𝑡 𝐴 𝑡 ) mean? Source : Reinforcement Learning : An Introduction, Richard S. Sutton & Andrew G. Barto Image source : https://drive.google.com/file/d/1xeUDVGWGUUv1-ccUMAZHJLej2C7aAFWY/view - Stochastic approximation to gradient ascent in Bandit Gradient Algorithm - Expected reward - Expected reward by Low of total expectationE[𝑅𝑡] = E[ E 𝑅𝑡 𝐴 𝑡 ]
  64. 64. What does (𝑅𝑡 − 𝑅𝑡)(1 − 𝜋 𝑡 𝐴 𝑡 ) mean? Source : Reinforcement Learning : An Introduction, Richard S. Sutton & Andrew G. Barto Image source : https://drive.google.com/file/d/1xeUDVGWGUUv1-ccUMAZHJLej2C7aAFWY/view
  65. 65. What does (𝑅𝑡 − 𝑅𝑡)(1 − 𝜋 𝑡 𝐴 𝑡 ) mean? Source : Reinforcement Learning : An Introduction, Richard S. Sutton & Andrew G. Barto Image source : https://drive.google.com/file/d/1xeUDVGWGUUv1-ccUMAZHJLej2C7aAFWY/view ?
  66. 66. What does (𝑅𝑡 − 𝑅𝑡)(1 − 𝜋 𝑡 𝐴 𝑡 ) mean? Source : Reinforcement Learning : An Introduction, Richard S. Sutton & Andrew G. Barto Image source : https://drive.google.com/file/d/1xeUDVGWGUUv1-ccUMAZHJLej2C7aAFWY/view ? The gradient sums to zero over all the actions, σ 𝑥 𝜕𝜋 𝑡(𝑥) 𝜕𝐻𝑡(𝑎) = 0 – as 𝐻𝑡(𝑎) is changed, some actions’ probabilities go up and some go down, but the sum of the changes must be zero because the sum of the probabilities is always one.
  67. 67. What does (𝑅𝑡 − 𝑅𝑡)(1 − 𝜋 𝑡 𝐴 𝑡 ) mean? Source : Reinforcement Learning : An Introduction, Richard S. Sutton & Andrew G. Barto Image source : https://drive.google.com/file/d/1xeUDVGWGUUv1-ccUMAZHJLej2C7aAFWY/view
  68. 68. What does (𝑅𝑡 − 𝑅𝑡)(1 − 𝜋 𝑡 𝐴 𝑡 ) mean? Source : Reinforcement Learning : An Introduction, Richard S. Sutton & Andrew G. Barto Image source : https://drive.google.com/file/d/1xeUDVGWGUUv1-ccUMAZHJLej2C7aAFWY/view ? E[𝑅𝑡] = E[ E 𝑅𝑡 𝐴 𝑡 ]
  69. 69. What does (𝑅𝑡 − 𝑅𝑡)(1 − 𝜋 𝑡 𝐴 𝑡 ) mean? Source : Reinforcement Learning : An Introduction, Richard S. Sutton & Andrew G. Barto Image source : https://drive.google.com/file/d/1xeUDVGWGUUv1-ccUMAZHJLej2C7aAFWY/view Please refer page 40 in link of reference slide
  70. 70. What does (𝑅𝑡 − 𝑅𝑡)(1 − 𝜋 𝑡 𝐴 𝑡 ) mean? Source : Reinforcement Learning : An Introduction, Richard S. Sutton & Andrew G. Barto Image source : https://drive.google.com/file/d/1xeUDVGWGUUv1-ccUMAZHJLej2C7aAFWY/view ∴ We can substitute a sample of the expectation above for the performance gradient in 𝐻𝑡+1 𝑎 ≅ 𝐻𝑡 𝑎 + 𝛼 𝜕E[𝑅 𝑡] 𝜕𝐻𝑡(𝑎)
  71. 71. 6. Summary
  72. 72. 6. Summary In this chapter, ‘Exploitation & Exploration‘ is the core idea.
  73. 73. 6. Summary Action-value Methods - Greedy Action Selection Method - 𝜀-greedy Action Selection Method - Upper-Confidence-Bound(UCB) Action Selection Method
  74. 74. 6. Summary A simple bandit algorithm : Incremental Implementation Source : Reinforcement Learning : An Introduction, Richard S. Sutton & Andrew G. Barto Image source : https://drive.google.com/file/d/1xeUDVGWGUUv1-ccUMAZHJLej2C7aAFWY/view
  75. 75. 6. Summary A simple bandit algorithm : Incremental Implementation Source : Reinforcement Learning : An Introduction, Richard S. Sutton & Andrew G. Barto Image source : https://drive.google.com/file/d/1xeUDVGWGUUv1-ccUMAZHJLej2C7aAFWY/view
  76. 76. 6. Summary A simple bandit algorithm : Tracking a Nonstationary Problem Source : Reinforcement Learning : An Introduction, Richard S. Sutton & Andrew G. Barto Image source : https://drive.google.com/file/d/1xeUDVGWGUUv1-ccUMAZHJLej2C7aAFWY/view
  77. 77. 6. Summary A simple bandit algorithm : Tracking a Nonstationary Problem Source : Reinforcement Learning : An Introduction, Richard S. Sutton & Andrew G. Barto Image source : https://drive.google.com/file/d/1xeUDVGWGUUv1-ccUMAZHJLej2C7aAFWY/view
  78. 78. 6. Summary Source : Reinforcement Learning : An Introduction, Richard S. Sutton & Andrew G. Barto Image source : https://drive.google.com/file/d/1xeUDVGWGUUv1-ccUMAZHJLej2C7aAFWY/view Gradient Bandit Algorithm
  79. 79. 6. Summary Source : Reinforcement Learning : An Introduction, Richard S. Sutton & Andrew G. Barto Image source : https://drive.google.com/file/d/1xeUDVGWGUUv1-ccUMAZHJLej2C7aAFWY/view A parameter study of the various bandit algorithms
  80. 80. Reinforcement Learning is LOVE♥
  81. 81. Thank you

×