Diese Präsentation wurde erfolgreich gemeldet.
Wir verwenden Ihre LinkedIn Profilangaben und Informationen zu Ihren Aktivitäten, um Anzeigen zu personalisieren und Ihnen relevantere Inhalte anzuzeigen. Sie können Ihre Anzeigeneinstellungen jederzeit ändern.

強化学習の分散アーキテクチャ変遷

9.850 Aufrufe

Veröffentlicht am

強化学習アーキテクチャ勉強会#14の発表内容.
強化学習の分散アーキテクチャの歴史、Gorila, A3C, GA3C, A2C, Ape-X, IMPALAなどをまとめた.
https://rlarch.connpass.com/event/81669/

Veröffentlicht in: Technologie
  • Sex in your area is here: ♥♥♥ http://bit.ly/2u6xbL5 ♥♥♥
       Antworten 
    Sind Sie sicher, dass Sie …  Ja  Nein
    Ihre Nachricht erscheint hier
  • Dating direct: ❤❤❤ http://bit.ly/2u6xbL5 ❤❤❤
       Antworten 
    Sind Sie sicher, dass Sie …  Ja  Nein
    Ihre Nachricht erscheint hier
  • te aconsejo una pagina que se llama languagetool, aver si mejoras
       Antworten 
    Sind Sie sicher, dass Sie …  Ja  Nein
    Ihre Nachricht erscheint hier
  • aprende a escrivir
       Antworten 
    Sind Sie sicher, dass Sie …  Ja  Nein
    Ihre Nachricht erscheint hier
  • no se entiende nada chiquilla
       Antworten 
    Sind Sie sicher, dass Sie …  Ja  Nein
    Ihre Nachricht erscheint hier

強化学習の分散アーキテクチャ変遷

  1. 1. 強化学習の分散アーキテクチャ変遷 #14 Twitter: @eratostennis
  2. 2. • Include • • Gorila, A3C, GA3C, batched A2C, Ape-X, IMPALA • • Accelerated Methods for Deep Reinforcement Learning • Exclude • • • Trace [8] • Importance Sampling, Tree-backup, Retrace • UNREAL [9] • •
  3. 3. DistBelief[10] A3C GA3C batchedA2C IMPALA Gorila Ape-X AcceleratedMethods 2012 2015 2016 2017 2018
  4. 4. • Gorila • A3C • GA3C • batched A2C • Ape-X • IMPALA • Accelerated Methods for Deep Reinforcement Learning
  5. 5. Gorila [1]
  6. 6. DQN ( )
  7. 7. Gorila
  8. 8. Results DQN (6 19 ) DQN 12~14 (6 ) DQN 41/49
  9. 9. A3C [2]
  10. 10. A3C [3] GA3C : • • on-policy • • 1 CPU (GPU )
  11. 11. A3C • DNN • CNN2 +FC +Softmax ×2(policy, value) • • • •
  12. 12. Experiments LSTM . • • Multi-Step A3C Pong 4
  13. 13. GA3C [3]
  14. 14. GA3C • A3C CPU/GPU • DQN, Double-DQN, Dueling Double DQN GPU • SoTa GPU A3C • AlphaGO • 50GPU • Gorilla DQN 31 100 • GPU • Queue • • TensorFlow • SoTa
  15. 15. A3C • 16 , 16CPU • 4 (Atari ,Brockman) • • • • GPU • A3C Replay Memory
  16. 16. Hybrid CPU/GPU A3C (GA3C)
  17. 17. Performance metrics and trade-offs • • CPU-GPU • GPU • GPU • GA3C • Predictor <=> GPU • Trainer <=> GPU • <=>
  18. 18. • Training Per Second (TPS) • • Predictions Per Second (PPS) • • A3C 5 … • PPS TPS × 5
  19. 19. Dynamic adjustment of trade- offs • TPS • , DNN , • Atari BOXING Atari PONG
  20. 20. Policy lag in GA3C • A3C GA3C • • k
  21. 21. Maximizing training speed GPU : A3C
  22. 22. GPU utilization and DNN size Large DNN TPS 7% … GPU ( 12%UP) : A3C
  23. 23. Effect of TPS on learning speed
  24. 24. Compares scores A3C 4 ( ) GA3C 1
  25. 25. Training curves GA3C LR
  26. 26. Policy lag, learning stability and convergence speed • • TPS • • TPS, , 1~40 …
  27. 27. GA3C • CPU/GPU • • GA3C A3C • ( )
  28. 28. batched A2C [4]
  29. 29. batched A2C • Gorila • DQN • Actor Learner • A3C • • GA3C • GPU • • • • Gorilla Actor, A3C GA3C , • • Gorila A3C 1 • GA3C ( , ) • Atari SoTa
  30. 30. Parallel Framework for Deep Reinforcement Learning A3C on-line experience memory
  31. 31. Algorithm
  32. 32. Experiments • Atari 12 • Python TensorFlow • • 4 (Intel i7-4790K) • Nvidia 980 Ti GPU • • Gorila, A3C, GA3C • Arch(nips): Conv×2 + FC×2 • Arch(nature): Conv×3 + FC×2 (nips )
  33. 33. Results
  34. 34. The number of actors Learning rate Actor ( ) 0.0007 . [13] . .
  35. 35. Time usage in the game of Pong nips (ne=32) nature 22% , CPU 41% : GPU
  36. 36. batched A2C • GPU • , • •
  37. 37. Ape-X [5]
  38. 38. Ape-X • • • Actor • • Experience Replay Memory • Prioritized Experience Replay
  39. 39. Distributed Prioritized Experience Replay
  40. 40. Distributed Prioritized Experience Replay • Gorila • Replay Memory • Actor • Prioritized DQN … • 1 • Actor • Ape-X Actor
  41. 41. Experiments (Ape-X DQN) • Atari 57 • 12.5K FPS =360Actor×139FPS/4Repeat • Actor 100 • 19 /sec, 16, 512 • png • Actor 400 Learner • Actor ε ( ) ε-greedy • Experience Replay 20 • 100 FIFO • Priority Exponent = 0.6, Importance Sampling Exponent = 0.4
  42. 42. Scaling the number of actors (Ape-X DQN) Actor
  43. 43. Varying the capacity of the replay
  44. 44. Conclusion • Prioritized replay • , SoTa • • •
  45. 45. IMPALA [6]
  46. 46. IMPALA • • • A2C GA3C • IMPALA • V-trace • • DMLab-30 • Atari57 •
  47. 47. IMPALA Actor n … 1. Learner 2. , , LSTM Learner Learner Actor Learner π Actor μ ( ) V- trace GPU Point: (Actor) (Learner) A3C . ( )
  48. 48. Efficiency Optimisations • GPU CPU • A3C IMPALA • GA3C, A2C, Ape-X • TensorFlow • • XLA [11] • cuDNN [12]
  49. 49. V-trace • Actor Learner • Actor-Learner • V-trace Actor-Critic Notation (MDP) : : : (Policy μ): μ π
  50. 50. V-trace target • • Temporal Difference • Importance Sampling • (π=μ) => n-step
  51. 51. Importance Sampling • TD • • μ π • 0 behavior policy • target policy
  52. 52. Retrace • Retrace[8] ”trace cutting” •      は時刻tのTDが前回の時刻sの価値関数の更新にど れだけ影響を与えるか測定する • πとμが⼤きく異なれば(オフポリシー)、よりバリアンス(学 習の変動)が⼤きくなる •  はバリアンスの削減係数 • このテクニックを⽤いても収束する値に変化はない
  53. 53. Experiments • • • • DeepMind Lab 30 , Atari 57 •
  54. 54. Computational Performance • A3C, Batched A2C (Shallow Model)
  55. 55. Single-Task Training • 5 DeepMind Lab • planning task • maze navigation task × 2 • laser tag task • simple fruit collection task •
  56. 56. Convergence and Stability • 2/5 IMPALA ( ) • V-trace • • IMPALA A3C ( )
  57. 57. V-trace Analysis • No-correction: No off-policy • ε-correction • logπが⼩さくなりすぎないように 勾配計算時に微⼩εを加算 • 1-step importance sampling • 各ステップで重みをかける • V-traceのtraceない版 • V-trace V-trace Replay Replay
  58. 58. Multi-Task (Atari) • Atari 57 • IMPALA A3C shallow 1 IMPALA shallow, deep A3C shallow
  59. 59. IMPALA • IMPALA • • オフポリシーアルゴリズムV-trace • 他のオフポリシーActor-Critic⼿法と⽐較して安定 • 実験 • DMLab-30とAtari57でのマルチタスク学習 • A3C⽐較して優れたパフォマーンス
  60. 60. Accelerated Methods for Deep Learning [7]
  61. 61. Accelerated Methods • • CPU GPU • • • • • Atari • : NVIDIA DGX-1 • 40 CPU cores, 8P 100GPU
  62. 62. Related Work • Gorila • sub-linear • Ape-X • prioritized replay • CPU GPU • A3C • • GA3C • CPU A3C GPU • • IMPALA • GPU • V-trace • PAAC (batched A2C) • batched A2C
  63. 63. Parallel, Accelerated RL Framework • • … • CPU • Deep Neural Network
  64. 64. Synchronized Sampling • CPU , • • CPU • CPU • サンプリングと推論が交互だが、2グループに分けて処理を進める⼯夫もできる.
  65. 65. Synchronous Multi-GPU Optimization • GPU • • • GPU Reduce ( ) • • NVIDIA Collective Communication Library GPU
  66. 66. Asynchronous Multi-GPU Optimization • GPU Sampler Learner • 1. GPU 2. GPU 3. • CPU • 2,3 4. GPU
  67. 67. Experiments • Atari2600 • • • Q •
  68. 68. Atari Frame Processing • Minh et al., 2015(Human Level…) • • 2 • 2 104×80 • • Q 3 (DQN-Net) • 2 (A3C-Net)
  69. 69. Sampling • BREAKOUTゲームでの利⽤効率 1 . . 80%
  70. 70. A2C • • A2C 0.0007×Num_Actors シミュレータ数を16~512(バッチサイズを80~2560)に 増やしていくと、徐々にサンプル効率が悪くなっている
  71. 71. PPO • PPO (8 ×256=2048) • 環境1つあたりのバッチサイズを256から4に減らしていった 並列性を⾼くするとよくなったゲームもあれば悪くなったものもあった
  72. 72. Q-Value Learning with Large Training Batches • DQN • 32 2048 . 512 . • 32 • • (2.5, 7.5, 15*10^4) • Categorical DQN • DQN 2048 •
  73. 73. Update Rule • Adam RMSProp • Categorical DQN, Rainbow RMSProp Adam • Learner • RMSProp Adam
  74. 74. Learning Speed • A2C,A3C,PPO,APPO • (10 ) • PPO Pong 4 • 256 A2C 25,000 /sec (=90million/hour)
  75. 75. 50 million Policy Gradient Learning 8つのGPUを使って6倍⾼速化 Q-Learning 50millionステップ達成時間の計測 DQNでは1GPU, 5CPUで8時間程度 Categorical-DQNはバッチサイズが⼤き いため、GPU利⽤の効果が⼤きく出た
  76. 76. Accelerated Methods • RL • Q • … • Atari • … • Atari •
  77. 77. • Gorila • DQN • A3C • (Actor-Critic) • GA3C • ( ) GPU • batched A2C • • Ape-X • Prioritized Replay • IMPALA • Importance Sampling(Retrace) • Accelerated Methods for Deep Reinforcement Learning •
  78. 78. • Pong 4 (A3C 4 ) • • • • • GPU • • , • • / Retrace [8] • (UNREAL) [9]
  79. 79. 1. Nair, Arun, et al. “Massively parallel methods for deep reinforcement learning”. arXiv preprint arXiv: 1507.04296, 2015. 2. Mnih, Volodymyr, et al. "Asynchronous methods for deep reinforcement learning". Proceedings of the 33nd International Conference on Machine Learning, ICML 2016, New York City, NY, USA, June 19-24, 2016, 2016. 3. Babaeizadeh, Mohammad, et al. "GA3C: GPU-based A3C for deep reinforcement learning". NIPS Workshop, 2016. 4. Clemente, Alfredo V., et al. "Efficient parallel methods for deep reinforcement learning". CoRR, abs/ 1705.04862, 2017. 5. Horgan, D., et al. “Distributed Prioritized Experience Replay”. ArXiv e-prints, March 2018. 6. Espeholt, Lasse, et al. "IMPALA: Scalable Distributed Deep-RL with Importance Weighted Actor-Learner Architectures”. arXiv preprint arXiv: 1802.01561, 2018 7. Stooke, Adam, Abbeel, Pieter. “Accelerated Methods for Deep Learning”. arXiv preprint arXiv: 1803.02811, 2018
  80. 80. 8. Munos, Re ́mi, et al. “Safe and efficient off-policy reinforcement learning”. In Advances in Neural Information Processing Systems, pp. 1046–1054, 2016. 9. Jaderberg, Max, et al. “Reinforcement learning with unsupervised auxiliary tasks”. International Conference on Learning Representations, 2017. 10.Dean, Jeffrey, et al. “Large scale distributed deep networks”. In Advances in Neural Information Processing Systems 25, pp. 1223–1231, 2012. 11.TensorFlow w/XLA: TensorFlow, Compiled! • https://autodiff-workshop.github.io/slides/JeffDean.pdf 12.Chetlur, Sharan, et al. cudnn: Efficient primitives for deep learning. CoRR, abs/1410.0759, 2014. 13.Goyal, Priya, et al. Accurate, large minibatch sgd: Training imagenet in 1 hour. arXiv preprint arXiv:1706.02677, 2017. 14.Intuitive RL: Intro to Advantage-Actor-Critic (A2C) • https://hackernoon.com/intuitive-rl-intro-to-advantage-actor-critic-a2c-4ff545978752

×