[DL輪読会]Deep Reinforcement Learning that Matters

•

8 likes•3,870 views

Deep Learning JP

2017/12/8 Deep Learning JP: http://deeplearning.jp/seminar-2/

Technology

1
DEEP LEARNING JP
[DL Papers]
http://deeplearning.jp/
Deep Reinforcement Learning that Matters
Reiji Hatsugai

11
!"~$(&|(")
("*+~,(-.|(", !")
0"*+ = 0((", !", ("*+)

12
!"~$(&|(")
("*+~,(-.|(", !")
0"*+ = 0((", !", ("*+)
$
π∗
= argmax
π
Eπ [ γ τ
rτ ]
τ =0
∞
∑

13
TRPO
DQN DDQN
A3C
UNREAL PCL
ACER
PPO
Q-Prop
IPG
ACKTR
DDPG
D4PG
SAC
Soft Q

14
TRPO
DQN DDQN
A3C
UNREAL PCL
ACER
PPO
Q-Prop
IPG
ACKTR
DDPG
D4PG
SAC
Soft Q
『『深深層層』』強強化化学学習習ににななっっててかからら
たたくくささんんのの手手法法がが開開発発さされれたた

Deep Reinforcement Learning that Matters
• ICML2017 reproducibility work shop Reproducibility of
Benchmarked Deep Reinforcement Learning Tasks for Continuous Control
• AAAI2018 accepted
•
–
–
•
•
16

Deep Reinforcement Learning that Matters
•
– ACKTR (Wu et al. 2017)
– PPO (Schulman et al. 2017)
– DDPG (Lillicrap et al. 2015)
– TRPO (Schulman et al. 2015)
• ACKTR, PPO
• DDPG, TRPO baseline
•
17

Deep Reinforcement Learning that Matters
• Network Architecture
• Reward Scale
• Random Seeds and Trials
• Environments
• Codebases
• Reporting Evaluation Metrics
18

Network Architecture
•
– (64, 64) (rllab)
– (100, 50, 25) (Q-Prop)
– (400, 300) (DDPG)
•
• Activation Function
21

Network Architecture
• PPO
• Tanh
• PPO
• “This also suggests a possible need for hyper parameter agnostic algorithms”
•
24

Reward Scale
• Q DQN cliping
• 0.
= 20
• σ=0.1
•
LeCun et al .2012; Glorot and Bengio 2010; Vincent, de Brebisson, and Bouthillier 2015
•
25

Reward Scale
• Reward Scale
•
• Reward Scale
• Layer norm
• Learning values across many orders of magnitude (Hado van Hasselt et al. 2016)
– adaptive
• HumanoidStandup-v1 100
– Reward Scale
27

Random Seeds and Trials
• 10 seed
• 10 5 5
•
29

Random Seeds and Trials
• 2
–
–
•
seed
• power analysis
•
33

Environment
• Hopper, HalfCheetah, Swimmer, Walker2D
•
34

HalfCheetah
• HalfCheetah DDPG
• Hopper DDPG
• Reproducibility of Benchmarked Deep
Reinforcement Learning Tasks for Continuous Control
• DDPG Q
• HalfCheetah DDPG DDPG base
HalfCheetah unfair
37

Swimmer
• TRPO
• policy local optimal
•
•
39

Code base
• TRPO DDPG rllab, baseline
•
40

Code base
•
• dramatic impacts on performance
•
42

Reporting Evaluation Metrics
•
•
•
–
–
–
43

Deep Reinforcement Learning that Matters
•
•
–
–
–
–
•
– hyperparameters agnostic algorithm
• “There is often no clear winner among all benchmark environments.”
44

• HalfCheetah Hopper DDPG
stable, unstable
• task difficulty algorithm
• Simple Nearest Neighbor Policy Method for Continuous Control Tasks
– Nearest Neighbor Policy
– task difficulty task
– NN task
45

• NN-1, NN-2
•
• NN-1
1.
2. action
• NN-2
1.
2. action 1step 1
• Sparse reward
46

Simple Nearest Neighbor
• Sparse Mountain Car
• HalfCheetah
• HalfCheetah
• task difficulty
• ICLR3,4,4
• NNPolicy
48

•
HalfCheetah
•
–
– sensor
• 3 MLP
• Towards Generalization and Simplicity in Continuous Control
– Policy parameterize RBF
– Natural Gradient
– Neural Net humanoid
– mujoco Todorov Natural Gradient Kakade 49

Towards Generalization and Simplicity in Continuous Control
50

•
• sensor DeepLearning
•
•
– sparse reward
–
• IL, IRL??
–
normalize
51

What's hot

[DL輪読会]深層強化学習はなぜ難しいのか？Why Deep RL fails? A brief survey of recent works.Deep Learning JP

【DL輪読会】DayDreamer: World Models for Physical Robot LearningDeep Learning JP

DQNからRainbowまで〜深層強化学習の最新動向〜Jun Okumura

強化学習における好奇心Shota Imai

[DL輪読会]`強化学習のための状態表現学習－より良い「世界モデル」の獲得に向けて－Deep Learning JP

[DL輪読会]逆強化学習とGANsDeep Learning JP

強化学習その3nishio

【DL輪読会】Transformers are Sample Efficient World ModelsDeep Learning JP

【DL輪読会】Scaling Laws for Neural Language ModelsDeep Learning JP

機械学習モデルの判断根拠の説明Satoshi Hara

ゼロから始める深層強化学習（NLP2018講演資料）/ Introduction of Deep Reinforcement LearningPreferred Networks

【DL輪読会】マルチエージェント強化学習における近年の協調的方策学習アルゴリズムの発展Deep Learning JP

[DL輪読会]Dream to Control: Learning Behaviors by Latent ImaginationDeep Learning JP

MASTERING ATARI WITH DISCRETE WORLD MODELS (DreamerV2)harmonylab

強化学習の分散アーキテクチャ変遷Eiji Sekiya

A3C解説harmonylab

強化学習と逆強化学習を組み合わせた模倣学習Eiji Uchibe

[DL輪読会]Inverse Constrained Reinforcement LearningDeep Learning JP

強化学習の基礎的な考え方と問題の分類佑甲野

[DL輪読会]近年のオフライン強化学習のまとめ —Offline Reinforcement Learning: Tutorial, Review, an...Deep Learning JP

What's hot (20)

[DL輪読会]深層強化学習はなぜ難しいのか？Why Deep RL fails? A brief survey of recent works.

【DL輪読会】DayDreamer: World Models for Physical Robot Learning

DQNからRainbowまで〜深層強化学習の最新動向〜

強化学習における好奇心

[DL輪読会]`強化学習のための状態表現学習－より良い「世界モデル」の獲得に向けて－

[DL輪読会]逆強化学習とGANs

強化学習その3

【DL輪読会】Transformers are Sample Efficient World Models

【DL輪読会】Scaling Laws for Neural Language Models

機械学習モデルの判断根拠の説明

ゼロから始める深層強化学習（NLP2018講演資料）/ Introduction of Deep Reinforcement Learning

【DL輪読会】マルチエージェント強化学習における近年の協調的方策学習アルゴリズムの発展

[DL輪読会]Dream to Control: Learning Behaviors by Latent Imagination

MASTERING ATARI WITH DISCRETE WORLD MODELS (DreamerV2)

強化学習の分散アーキテクチャ変遷

A3C解説

強化学習と逆強化学習を組み合わせた模倣学習

[DL輪読会]Inverse Constrained Reinforcement Learning

強化学習の基礎的な考え方と問題の分類

[DL輪読会]近年のオフライン強化学習のまとめ —Offline Reinforcement Learning: Tutorial, Review, an...

Similar to [DL輪読会]Deep Reinforcement Learning that Matters

pycon2018 "RL Adventure : DQN 부터 Rainbow DQN까지"YeChan(Paul) Kim

Hadoop londonYahoo Developer Network

India software developers conference 2013 BangaloreSatnam Singh

Demystifying deep reinforement learning재연 윤

Deep Convolutional GANs - meaning of latent spaceHansol Kang

A Workshop on RAjay Ohri

Developing in R - the contextual Multi-Armed Bandit editionRobin van Emden

Imitation Learning for Autonomous Driving in TORCSPreferred Networks

Valerii Vasylkov Erlang. measurements and benefits.Аліна Шепшелей

SE2016 Exotic Valerii Vasylkov "Erlang. Measurements and benefits"Inhacking

Face recognition v1San Kim

Getting started with Spark & Cassandra by Jon Haddad of DatastaxData Con LA

Cassandra drivers and librariesDuyhai Doan

Ensuring High Availability for Real-time Analytics featuring Boxed Ice / Serv...MongoDB

R for hadoopersGwen (Chen) Shapira

Training in Analytics, R and Social Media AnalyticsAjay Ohri

IIBMP2019 講演資料「オープンソースで始める深層学習」Preferred Networks

Building Deep Reinforcement Learning Applications on Apache Spark with Analyt...Databricks

MySQL Performance Monitoringspil-engineering

Python于Web 2.0网站的应用 - QCon Beijing 2010Qiangning Hong

Similar to [DL輪読会]Deep Reinforcement Learning that Matters (20)

pycon2018 "RL Adventure : DQN 부터 Rainbow DQN까지"

Hadoop london

India software developers conference 2013 Bangalore

Demystifying deep reinforement learning

Deep Convolutional GANs - meaning of latent space

A Workshop on R

Developing in R - the contextual Multi-Armed Bandit edition

Imitation Learning for Autonomous Driving in TORCS

Valerii Vasylkov Erlang. measurements and benefits.

SE2016 Exotic Valerii Vasylkov "Erlang. Measurements and benefits"

Face recognition v1

Getting started with Spark & Cassandra by Jon Haddad of Datastax

Cassandra drivers and libraries

Ensuring High Availability for Real-time Analytics featuring Boxed Ice / Serv...

R for hadoopers

Training in Analytics, R and Social Media Analytics

IIBMP2019 講演資料「オープンソースで始める深層学習」

Building Deep Reinforcement Learning Applications on Apache Spark with Analyt...

MySQL Performance Monitoring

Python于Web 2.0网站的应用 - QCon Beijing 2010

Recently uploaded

Automating Google Workspace (GWS) & more with Apps Scriptwesley chun

Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingEdi Saputra

Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung

Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer

Artificial Intelligence: Facts and MythsJoaquim Jorge

HTML Injection Attacks: Impact and Mitigation StrategiesBoston Institute of Analytics

A Year of the Servo Reboot: Where Are We Now?Igalia

GenAI Risks & Security Meetup 01052024.pdflior mazor

Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodJuan lago vázquez

Apidays New York 2024 - The value of a flexible API Management solution for O...apidays

Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal

🐬 The future of MySQL is Postgres 🐘RTylerCroy

Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services

Partners Life - Insurer Innovation Award 2024The Digital Insurer

2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong

Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun

Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j

Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobeapidays

Artificial Intelligence Chap.5 : UncertaintyKhushali Kathiriya

TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc

Recently uploaded (20)

Automating Google Workspace (GWS) & more with Apps Script

Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving

Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...

Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024

Artificial Intelligence: Facts and Myths

HTML Injection Attacks: Impact and Mitigation Strategies

A Year of the Servo Reboot: Where Are We Now?

GenAI Risks & Security Meetup 01052024.pdf

Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood

Apidays New York 2024 - The value of a flexible API Management solution for O...

Understanding Discord NSFW Servers A Guide for Responsible Users.pdf

🐬 The future of MySQL is Postgres 🐘

Strategies for Landing an Oracle DBA Job as a Fresher

Partners Life - Insurer Innovation Award 2024

2024: Domino Containers - The Next Step. News from the Domino Container commu...

Data Cloud, More than a CDP by Matt Robison

Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...

Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe

Artificial Intelligence Chap.5 : Uncertainty

TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery

[DL輪読会]Deep Reinforcement Learning that Matters

1. 1 DEEP LEARNING JP [DL Papers] http://deeplearning.jp/ Deep Reinforcement Learning that Matters Reiji Hatsugai

2. • – – • • difficulty • • 2

3. 3

4. 4

5. : HalfCheetah 5

6. : Hopper 6

7. 7

8. 8

9. 9

10. 10

11. 11 !"~$(&|(") ("*+~,(-.|(", !") 0"*+ = 0((", !", ("*+)

12. 12 !"~$(&|(") ("*+~,(-.|(", !") 0"*+ = 0((", !", ("*+) $ π∗ = argmax π Eπ [ γ τ rτ ] τ =0 ∞ ∑

13. 13 TRPO DQN DDQN A3C UNREAL PCL ACER PPO Q-Prop IPG ACKTR DDPG D4PG SAC Soft Q

14. 14 TRPO DQN DDQN A3C UNREAL PCL ACER PPO Q-Prop IPG ACKTR DDPG D4PG SAC Soft Q 『『深深層層』』強強化化学学習習ににななっっててかかららたたくくささんんのの手手法法がが開開発発さされれたた

15. • 1. 2. 3. 4. 15

16. Deep Reinforcement Learning that Matters • ICML2017 reproducibility work shop Reproducibility of Benchmarked Deep Reinforcement Learning Tasks for Continuous Control • AAAI2018 accepted • – – • • 16

17. Deep Reinforcement Learning that Matters • – ACKTR (Wu et al. 2017) – PPO (Schulman et al. 2017) – DDPG (Lillicrap et al. 2015) – TRPO (Schulman et al. 2015) • ACKTR, PPO • DDPG, TRPO baseline • 17

18. Deep Reinforcement Learning that Matters • Network Architecture • Reward Scale • Random Seeds and Trials • Environments • Codebases • Reporting Evaluation Metrics 18

19. Deep Reinforcement Learning that Matters • Network Architecture • Reward Scale • Random Seeds and Trials • Environments • Codebases • Reporting Evaluation Metrics 19 外因的なもの

20. Deep Reinforcement Learning that Matters • Network Architecture • Reward Scale • Random Seeds and Trials • Environments • Codebases • Reporting Evaluation Metrics 20 内因的なもの

21. Network Architecture • – (64, 64) (rllab) – (100, 50, 25) (Q-Prop) – (400, 300) (DDPG) • • Activation Function 21

22. Policy Architecture 22

23. Activation Function 23

24. Network Architecture • PPO • Tanh • PPO • “This also suggests a possible need for hyper parameter agnostic algorithms” • 24

25. Reward Scale • Q DQN cliping • 0. = 20 • σ=0.1 • LeCun et al .2012; Glorot and Bengio 2010; Vincent, de Brebisson, and Bouthillier 2015 • 25

26. Reward Scale 26

27. Reward Scale • Reward Scale • • Reward Scale • Layer norm • Learning values across many orders of magnitude (Hado van Hasselt et al. 2016) – adaptive • HumanoidStandup-v1 100 – Reward Scale 27

28. Deep Reinforcement Learning that Matters • Network Architecture • Reward Scale • Random Seeds and Trials • Environments • Codebases • Reporting Evaluation Metrics 28 内因的なもの

29. Random Seeds and Trials • 10 seed • 10 5 5 • 29

30. Random Seeds and Trials 30

31. Random Seeds and Trials 31

32. Random Seeds and Trials 32 <0.05

33. Random Seeds and Trials • 2 – – • seed • power analysis • 33

34. Environment • Hopper, HalfCheetah, Swimmer, Walker2D • 34

35. HalfCheetah 35

36. Hopper 36

37. HalfCheetah • HalfCheetah DDPG • Hopper DDPG • Reproducibility of Benchmarked Deep Reinforcement Learning Tasks for Continuous Control • DDPG Q • HalfCheetah DDPG DDPG base HalfCheetah unfair 37

38. Swimmer 38

39. Swimmer • TRPO • policy local optimal • • 39

40. Code base • TRPO DDPG rllab, baseline • 40

41. Code base 41

42. Code base • • dramatic impacts on performance • 42

43. Reporting Evaluation Metrics • • • – – – 43

44. Deep Reinforcement Learning that Matters • • – – – – • – hyperparameters agnostic algorithm • “There is often no clear winner among all benchmark environments.” 44

45. • HalfCheetah Hopper DDPG stable, unstable • task difficulty algorithm • Simple Nearest Neighbor Policy Method for Continuous Control Tasks – Nearest Neighbor Policy – task difficulty task – NN task 45

46. • NN-1, NN-2 • • NN-1 1. 2. action • NN-2 1. 2. action 1step 1 • Sparse reward 46

47. NN 47

48. Simple Nearest Neighbor • Sparse Mountain Car • HalfCheetah • HalfCheetah • task difficulty • ICLR3,4,4 • NNPolicy 48

49. • HalfCheetah • – – sensor • 3 MLP • Towards Generalization and Simplicity in Continuous Control – Policy parameterize RBF – Natural Gradient – Neural Net humanoid – mujoco Todorov Natural Gradient Kakade 49

50. Towards Generalization and Simplicity in Continuous Control 50

51. • • sensor DeepLearning • • – sparse reward – • IL, IRL?? – normalize 51

[DL輪読会]Deep Reinforcement Learning that Matters

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to [DL輪読会]Deep Reinforcement Learning that Matters

Similar to [DL輪読会]Deep Reinforcement Learning that Matters (20)

More from Deep Learning JP

More from Deep Learning JP (20)

Recently uploaded

Recently uploaded (20)

[DL輪読会]Deep Reinforcement Learning that Matters