SlideShare ist ein Scribd-Unternehmen logo
1 von 30
Discovering Reinforcement Learning
Algorithms
Oh et al. in <NeurIPS 2020>
발표자 : 윤지상
Graduate School of Information. Yonsei Univ.
Machine Learning & Computational Finance Lab.
1. Introduction
2. LPG
3. Details in LPG Architecture
4. Experiments
INDEX
1 Introduction
1. Introduction
RL에서 Learning to learning 이라고 알려져 있는 meta-learning은 특정
value function을 알고 있다면 policy update rule을 스스로 학습하고
unseen task에 적용이 가능하다는 연구들이 나오고 있다.
Scratch부터 RL의 학습을 최적화하는 방향으로
스스로 찾을 수는 없을까?
1. Introduction
This study contributes:
1. Agent의 policy와 semantic prediction vector를 학습하는 방법을 모델이
직접 찾을 수 있고 좋은 성능을 가질 수 있는 feasibility를 보여주었다.
2. Semantic prediction vector에 어떠한 가정도 넣지 않아 사용자의 설정을
더 최소화하고 meta-learning에 가까운 모델이 되었다.
3. 간단한 task들을 통해 만들어진 RL 학습 알고리즘이 복잡한 task에도
유의미한 성능을 보여주었다.
2 LPG
2. LPG
Learned Policy Gradient (LPG)
1. 몇 번의 행동 후 특정 상황에서의 점프
타이밍을 배운다.
2. 여러 번의 의사결정으로 몬스터 속도,
지름길 등 게임 전략들을 배운다.
3. 게임이 끝나고 점수를 더 높이기 위해서
어떻게 하면 게임 전략을 더 많이 배울 수
있을지 고민한다.
4. 다른 게임에서 게임 전략을 더 많이 터득
할 수 있는 노하우를 적용한다.
2. LPG
Learned Policy Gradient (LPG)
2. LPG
Learned Policy Gradient (LPG)
LPG parameterized by 𝜂 (Backward LSTM)
agent parameterized by θ
There are TWO learnable model
최종 목적 : optimized 𝜼 찾기
2. LPG
Learned Policy Gradient (LPG)
① agent가 𝜃의 parameter를 이용해 2개의 값 출력
1. 문제에 대한 action을 뽑을 분포 policy 𝜋𝜃
2. 문제의 action을 선택할 정보를 추정한 prediction 𝑦𝜃
게임 중 행동에 대한
선택과 기준
2. LPG
Learned Policy Gradient (LPG)
② agent가 𝑇 time-step 만큼 action을 취해 trajectory를 형성하고 LPG에서 나온
agent의 학습을 도와줄 정답 target 𝜋, 𝑦에 가깝게 agent의 𝜃 update
많은 행동으로 여
러 전략 터득
2. LPG
Learned Policy Gradient (LPG)
③ 여러 상황 environment 들에 대해 각각의 agent들이 𝑇 time-step마다 학습되고
모든 environment가 끝나면 total reward가 최대가 되도록 LPG의 𝜂 update
점수를 더 높일 게임
노하우 학습
3 Details in LPG
Architecture
3. Details in LPG Architecture
1) LPG Architecture
2) Agent Update (𝜃)
3) LPG Update (𝜂)
4) Balancing Agent Hyperparameters for Stabilisation (𝛼)
𝑝 ℰ 는 environment ℰ의 분포
𝑝 𝜃0 는 agent parameter 𝜃의 초기값 분포
𝐺는 lifetime 전체의 reward 합
𝜂∗ = 𝑎𝑟𝑔max
𝜂
𝔼ℰ~𝑝 ℰ 𝔼𝜃0~𝑝 𝜃0
[𝐺]
Objective :
1) LPG Architecture
Backward LSTM인 LPG에는,
input으로 𝑥𝑡 = [𝑟𝑡, 𝑑𝑡, 𝜋𝜃 𝑎𝑡 𝑠𝑡 , 𝜑(𝑦𝜃 𝑠𝑡 ), 𝜑(𝑦𝜃 𝑠𝑡+1 )]
output으로 𝜋 ∈ ℝ, 𝑦 ∈ 0,1 𝑚
- 𝑟𝑡 : reward
- 𝑑𝑡 : episode 종료 여부 (binary value)
- 𝜋𝜃 𝑎𝑡 𝑠𝑡 : policy from agent
- 𝑦𝜃 ∈ 0,1 𝑚
: 𝑚-dimensional categorical prediction vector (𝑚=30 사용)
- 𝜑 : shared neural network (dim 16 → dim 1)
3. Details in LPG Architecture
1) LPG Architecture
Backward LSTM인 LPG에는,
input으로 𝑥𝑡 = [𝑟𝑡, 𝑑𝑡, 𝜋𝜃 𝑎𝑡 𝑠𝑡 , 𝑦𝜃 𝑠𝑡 , 𝑦𝜃 𝑠𝑡+1 ]
output으로 𝜋 ∈ ℝ, 𝑦 ∈ 0,1 𝑚
- 𝑟𝑡 : reward
- 𝑑𝑡 : episode 종료 여부 (binary value)
- 𝜋𝜃 𝑎𝑡 𝑠𝑡 : policy from agent
- 𝑦𝜃 ∈ 0,1 𝑚
: 𝑚-dimensional categorical prediction vector (𝑚=30 사용)
- 𝜑 : shared neural network (dim 16 → dim 1)
LPG에 input으로 action이 아니라 state
에서 action이 나올 확률을 넣기 때문에
다양한 environment에 적용 가능
3. Details in LPG Architecture
2) Agent Update (𝜽)
𝜋으로 agent가 𝜋을 취하도록 directly 𝜃 update.
𝑦으로 value function처럼 state를 semantic하게 표현하도록 indirectly 𝜃 update
𝑇 time-step만큼 trajectory 형성 후 𝜃 update (𝑇=20 사용)
∆𝜃 ∝ 𝔼𝜋𝜃
[∇𝜃𝑙𝑜𝑔𝜋𝜃 𝑎 𝑠 𝜋 − 𝛼𝑦∇𝜃𝐷𝐾𝐿(𝑦𝜃(𝑠) 𝑦)]
categorical cross
entropy
KL-divergence
3. Details in LPG Architecture
3) LPG Update (𝜼)
𝜃0 → 𝜃𝑁까지 학습이 진행되고 ∆𝜂를 계산해야 하지만 memory 문제 때문에 𝜃𝐾(𝐾 < 𝑁)
만큼 agent 학습 후 ∆𝜂 계산 (𝐾 = 5 사용)
(e.g., 𝑇=20, 𝐾=5 일 때, 20 time-step 마다 𝜃𝑛 → 𝜃𝑛+1 update,
20x5 (=100) time-step이 지나 𝜃𝑛+5까지 update 되면 𝐺를 계산하고 ∆𝜂 계산,
Environment lifetime이 끝날 때까지 반복)
∆𝜂 ∝ 𝔼ℰ𝔼𝜃[∇𝜂𝑙𝑜𝑔𝜋𝜃𝑁
𝑎 𝑠 𝐺]
𝜂∗
= 𝑎𝑟𝑔max
𝜂
𝔼ℰ~𝑝 ℰ 𝔼𝜃0~𝑝 𝜃0
[𝐺]
objective
gradient
3. Details in LPG Architecture
3) LPG Update (𝜼)
∆𝜂 ∝ 𝔼ℰ𝔼𝜃[∇𝜂𝑙𝑜𝑔𝜋𝜃𝑁
𝑎 𝑠 𝐺]
𝜂∗
= 𝑎𝑟𝑔max
𝜂
𝔼ℰ~𝑝 ℰ 𝔼𝜃0~𝑝 𝜃0
[𝐺]
objective
gradient
∆𝜂 ∝ 𝔼ℰ𝔼𝜃[∇𝜂𝑙𝑜𝑔𝜋𝜃𝑁
𝑎 𝑠 𝐺 + 𝛽0∇𝜂ℋ 𝜋𝜃𝑁
+ 𝛽1∇𝜂ℋ 𝑦𝜃𝑁
− 𝛽2∇𝜂 𝜋 2
2
− 𝛽3∇𝜂 𝑦 2
2
]
3. Details in LPG Architecture
안정적 학습을 위해
regularized term 추가
4) Balancing Agent Hyperparameters for Stabilisation (𝜶)
한번에 다양한 environment를 학습하게 되는데 모두 동일한 parameter(e.g., learning
rate)를 적용하면 학습이 unstable하기 때문에 동적으로 parameter 설정
𝜂∗
= 𝑎𝑟𝑔max
𝜂
𝔼ℰ~𝑝 ℰ max
𝛼
𝔼𝜃0~𝑝 Θ [𝐺]
3. Details in LPG Architecture
𝛼~𝑝(𝛼|ℰ)
주어진 ℰ environment마다 G를 높이는 파라미터가 뽑힐 확률을
높인다. (𝛼=learning rate, KL-Divergence weight 사용)
3. Details in LPG Architecture
Ablation Study Result
3. Details in LPG Architecture
Lifetimes = N timesteps
Lifetimes = N timesteps
Lifetimes = N timesteps
Interacting environment
Environment Agent
940 64
𝑥1 → ⋯ → 𝑥20 → ⋯ → 𝑥100 → ⋯ → 𝑥𝑁
UPDATE
agent parameter 𝜃
COMPUTE & SAVE
LPG parameter 𝜂
SAMPLE ℰ~𝑝 ℰ , 𝜃~𝑝 𝜃 , 𝛼~𝑝(𝛼|ℰ)
UPDATE
𝑝(𝛼|ℰ)
𝑝(𝛼|ℰ)
UPDATE
LPG parameter 𝜂 using averaged 𝜂
4Experiments
4. Experiments
4. Experiments
Setting
- Baseline
1. A2C
2. LPG-V (only learns 𝜋 given 𝑦 (value function of TD(𝜆))
- Training Environments
1. Tabular grid worlds
2. Random grid worlds
3. Delayed chain MDP
4. Experiments
Specialising in Training Environments
4. Experiments
What does the prediction (y) look like?
4. Experiments
Does the prediction (y) capture true values and beyond?
Does the prediction(y) converge?
4. Experiments
Ablation Study
4. Experiments
Generalising from Toy Environments to Atari Games
Selected results

Weitere ähnliche Inhalte

Was ist angesagt?

RCIM 2008 - Modello Scheduling
RCIM 2008 - Modello SchedulingRCIM 2008 - Modello Scheduling
RCIM 2008 - Modello Scheduling
Marco Santambrogio
 
Parallel programming
Parallel programmingParallel programming
Parallel programming
Anshul Sharma
 
Writing distributed N-body code using distributed FFT - 1
Writing distributed N-body code using distributed FFT - 1Writing distributed N-body code using distributed FFT - 1
Writing distributed N-body code using distributed FFT - 1
kr0y
 

Was ist angesagt? (20)

Daa unit 2
Daa unit 2Daa unit 2
Daa unit 2
 
Bubble sort
Bubble sortBubble sort
Bubble sort
 
RCIM 2008 - Modello Scheduling
RCIM 2008 - Modello SchedulingRCIM 2008 - Modello Scheduling
RCIM 2008 - Modello Scheduling
 
Discrete sequential prediction of continuous actions for deep RL
Discrete sequential prediction of continuous actions for deep RLDiscrete sequential prediction of continuous actions for deep RL
Discrete sequential prediction of continuous actions for deep RL
 
Job shop scheduling problem using genetic algorithm
Job shop scheduling problem using genetic algorithmJob shop scheduling problem using genetic algorithm
Job shop scheduling problem using genetic algorithm
 
Parallel programming
Parallel programmingParallel programming
Parallel programming
 
Algorithm analysis
Algorithm analysisAlgorithm analysis
Algorithm analysis
 
Chap 8. Optimization for training deep models
Chap 8. Optimization for training deep modelsChap 8. Optimization for training deep models
Chap 8. Optimization for training deep models
 
TensorFlow and Deep Learning Tips and Tricks
TensorFlow and Deep Learning Tips and TricksTensorFlow and Deep Learning Tips and Tricks
TensorFlow and Deep Learning Tips and Tricks
 
論文紹介 Combining Model-Based and Model-Free Updates for Trajectory-Centric Rein...
論文紹介 Combining Model-Based and Model-Free Updates for Trajectory-Centric Rein...論文紹介 Combining Model-Based and Model-Free Updates for Trajectory-Centric Rein...
論文紹介 Combining Model-Based and Model-Free Updates for Trajectory-Centric Rein...
 
Distributed Deep Q-Learning
Distributed Deep Q-LearningDistributed Deep Q-Learning
Distributed Deep Q-Learning
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement Learning
 
1.[1 5]implementation of pre compensation fuzzy for a cascade pid controller ...
1.[1 5]implementation of pre compensation fuzzy for a cascade pid controller ...1.[1 5]implementation of pre compensation fuzzy for a cascade pid controller ...
1.[1 5]implementation of pre compensation fuzzy for a cascade pid controller ...
 
Evolving Reinforcement Learning Algorithms, JD. Co-Reyes et al, 2021
Evolving Reinforcement Learning Algorithms, JD. Co-Reyes et al, 2021Evolving Reinforcement Learning Algorithms, JD. Co-Reyes et al, 2021
Evolving Reinforcement Learning Algorithms, JD. Co-Reyes et al, 2021
 
Data structure introduction
Data structure introductionData structure introduction
Data structure introduction
 
Optimization for Deep Learning
Optimization for Deep LearningOptimization for Deep Learning
Optimization for Deep Learning
 
2020 12-2-detr
2020 12-2-detr2020 12-2-detr
2020 12-2-detr
 
Writing distributed N-body code using distributed FFT - 1
Writing distributed N-body code using distributed FFT - 1Writing distributed N-body code using distributed FFT - 1
Writing distributed N-body code using distributed FFT - 1
 
Data structure and algorithm notes
Data structure and algorithm notesData structure and algorithm notes
Data structure and algorithm notes
 
Deep Learning in Finance
Deep Learning in FinanceDeep Learning in Finance
Deep Learning in Finance
 

Ähnlich wie PPT - Discovering Reinforcement Learning Algorithms

2-Algorithms and Complexit data structurey.pdf
2-Algorithms and Complexit data structurey.pdf2-Algorithms and Complexit data structurey.pdf
2-Algorithms and Complexit data structurey.pdf
ishan743441
 

Ähnlich wie PPT - Discovering Reinforcement Learning Algorithms (20)

DDPG algortihm for angry birds
DDPG algortihm for angry birdsDDPG algortihm for angry birds
DDPG algortihm for angry birds
 
PPT - Adaptive Quantitative Trading : An Imitative Deep Reinforcement Learnin...
PPT - Adaptive Quantitative Trading : An Imitative Deep Reinforcement Learnin...PPT - Adaptive Quantitative Trading : An Imitative Deep Reinforcement Learnin...
PPT - Adaptive Quantitative Trading : An Imitative Deep Reinforcement Learnin...
 
Reinforcement learning
Reinforcement learningReinforcement learning
Reinforcement learning
 
2-Algorithms and Complexit data structurey.pdf
2-Algorithms and Complexit data structurey.pdf2-Algorithms and Complexit data structurey.pdf
2-Algorithms and Complexit data structurey.pdf
 
Playing Atari with Deep Reinforcement Learning
Playing Atari with Deep Reinforcement LearningPlaying Atari with Deep Reinforcement Learning
Playing Atari with Deep Reinforcement Learning
 
variBAD, A Very Good Method for Bayes-Adaptive Deep RL via Meta-Learning.pptx
variBAD, A Very Good Method for Bayes-Adaptive Deep RL via Meta-Learning.pptxvariBAD, A Very Good Method for Bayes-Adaptive Deep RL via Meta-Learning.pptx
variBAD, A Very Good Method for Bayes-Adaptive Deep RL via Meta-Learning.pptx
 
Understanding GBM and XGBoost in Scikit-Learn
Understanding GBM and XGBoost in Scikit-LearnUnderstanding GBM and XGBoost in Scikit-Learn
Understanding GBM and XGBoost in Scikit-Learn
 
GDRR Opening Workshop - Deep Reinforcement Learning for Asset Based Modeling ...
GDRR Opening Workshop - Deep Reinforcement Learning for Asset Based Modeling ...GDRR Opening Workshop - Deep Reinforcement Learning for Asset Based Modeling ...
GDRR Opening Workshop - Deep Reinforcement Learning for Asset Based Modeling ...
 
Imitation Learning for Autonomous Driving in TORCS
Imitation Learning for Autonomous Driving in TORCSImitation Learning for Autonomous Driving in TORCS
Imitation Learning for Autonomous Driving in TORCS
 
Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...
Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...
Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...
 
Learning To Run
Learning To RunLearning To Run
Learning To Run
 
Dexterous In-hand Manipulation by OpenAI
Dexterous In-hand Manipulation by OpenAIDexterous In-hand Manipulation by OpenAI
Dexterous In-hand Manipulation by OpenAI
 
Paper review: Learned Optimizers that Scale and Generalize.
Paper review: Learned Optimizers that Scale and Generalize.Paper review: Learned Optimizers that Scale and Generalize.
Paper review: Learned Optimizers that Scale and Generalize.
 
Introduction of Deep Reinforcement Learning
Introduction of Deep Reinforcement LearningIntroduction of Deep Reinforcement Learning
Introduction of Deep Reinforcement Learning
 
Aaa ped-24- Reinforcement Learning
Aaa ped-24- Reinforcement LearningAaa ped-24- Reinforcement Learning
Aaa ped-24- Reinforcement Learning
 
An efficient use of temporal difference technique in Computer Game Learning
An efficient use of temporal difference technique in Computer Game LearningAn efficient use of temporal difference technique in Computer Game Learning
An efficient use of temporal difference technique in Computer Game Learning
 
Comparative Analysis of Tuning Hyperparameters in Policy-Based DRL Algorithm ...
Comparative Analysis of Tuning Hyperparameters in Policy-Based DRL Algorithm ...Comparative Analysis of Tuning Hyperparameters in Policy-Based DRL Algorithm ...
Comparative Analysis of Tuning Hyperparameters in Policy-Based DRL Algorithm ...
 
0415_seminar_DeepDPG
0415_seminar_DeepDPG0415_seminar_DeepDPG
0415_seminar_DeepDPG
 
Deep Reinforcement Learning
Deep Reinforcement LearningDeep Reinforcement Learning
Deep Reinforcement Learning
 
Design of predictive controller for smooth set point tracking for fast dynami...
Design of predictive controller for smooth set point tracking for fast dynami...Design of predictive controller for smooth set point tracking for fast dynami...
Design of predictive controller for smooth set point tracking for fast dynami...
 

Kürzlich hochgeladen

Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts ServiceCall Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service BangaloreCall Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
amitlee9823
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
MarinCaroMartnezBerg
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
AroojKhan71
 
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
amitlee9823
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Riyadh +966572737505 get cytotec
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
ZurliaSoop
 
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
amitlee9823
 
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
amitlee9823
 

Kürzlich hochgeladen (20)

ELKO dropshipping via API with DroFx.pptx
ELKO dropshipping via API with DroFx.pptxELKO dropshipping via API with DroFx.pptx
ELKO dropshipping via API with DroFx.pptx
 
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts ServiceCall Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
 
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service BangaloreCall Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
 
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFx
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFx
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysis
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
 
Capstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramCapstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics Program
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptx
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
 
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptx
 
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
 
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interaction
 

PPT - Discovering Reinforcement Learning Algorithms

  • 1. Discovering Reinforcement Learning Algorithms Oh et al. in <NeurIPS 2020> 발표자 : 윤지상 Graduate School of Information. Yonsei Univ. Machine Learning & Computational Finance Lab.
  • 2. 1. Introduction 2. LPG 3. Details in LPG Architecture 4. Experiments INDEX
  • 4. 1. Introduction RL에서 Learning to learning 이라고 알려져 있는 meta-learning은 특정 value function을 알고 있다면 policy update rule을 스스로 학습하고 unseen task에 적용이 가능하다는 연구들이 나오고 있다. Scratch부터 RL의 학습을 최적화하는 방향으로 스스로 찾을 수는 없을까?
  • 5. 1. Introduction This study contributes: 1. Agent의 policy와 semantic prediction vector를 학습하는 방법을 모델이 직접 찾을 수 있고 좋은 성능을 가질 수 있는 feasibility를 보여주었다. 2. Semantic prediction vector에 어떠한 가정도 넣지 않아 사용자의 설정을 더 최소화하고 meta-learning에 가까운 모델이 되었다. 3. 간단한 task들을 통해 만들어진 RL 학습 알고리즘이 복잡한 task에도 유의미한 성능을 보여주었다.
  • 7. 2. LPG Learned Policy Gradient (LPG) 1. 몇 번의 행동 후 특정 상황에서의 점프 타이밍을 배운다. 2. 여러 번의 의사결정으로 몬스터 속도, 지름길 등 게임 전략들을 배운다. 3. 게임이 끝나고 점수를 더 높이기 위해서 어떻게 하면 게임 전략을 더 많이 배울 수 있을지 고민한다. 4. 다른 게임에서 게임 전략을 더 많이 터득 할 수 있는 노하우를 적용한다.
  • 8. 2. LPG Learned Policy Gradient (LPG)
  • 9. 2. LPG Learned Policy Gradient (LPG) LPG parameterized by 𝜂 (Backward LSTM) agent parameterized by θ There are TWO learnable model 최종 목적 : optimized 𝜼 찾기
  • 10. 2. LPG Learned Policy Gradient (LPG) ① agent가 𝜃의 parameter를 이용해 2개의 값 출력 1. 문제에 대한 action을 뽑을 분포 policy 𝜋𝜃 2. 문제의 action을 선택할 정보를 추정한 prediction 𝑦𝜃 게임 중 행동에 대한 선택과 기준
  • 11. 2. LPG Learned Policy Gradient (LPG) ② agent가 𝑇 time-step 만큼 action을 취해 trajectory를 형성하고 LPG에서 나온 agent의 학습을 도와줄 정답 target 𝜋, 𝑦에 가깝게 agent의 𝜃 update 많은 행동으로 여 러 전략 터득
  • 12. 2. LPG Learned Policy Gradient (LPG) ③ 여러 상황 environment 들에 대해 각각의 agent들이 𝑇 time-step마다 학습되고 모든 environment가 끝나면 total reward가 최대가 되도록 LPG의 𝜂 update 점수를 더 높일 게임 노하우 학습
  • 13. 3 Details in LPG Architecture
  • 14. 3. Details in LPG Architecture 1) LPG Architecture 2) Agent Update (𝜃) 3) LPG Update (𝜂) 4) Balancing Agent Hyperparameters for Stabilisation (𝛼) 𝑝 ℰ 는 environment ℰ의 분포 𝑝 𝜃0 는 agent parameter 𝜃의 초기값 분포 𝐺는 lifetime 전체의 reward 합 𝜂∗ = 𝑎𝑟𝑔max 𝜂 𝔼ℰ~𝑝 ℰ 𝔼𝜃0~𝑝 𝜃0 [𝐺] Objective :
  • 15. 1) LPG Architecture Backward LSTM인 LPG에는, input으로 𝑥𝑡 = [𝑟𝑡, 𝑑𝑡, 𝜋𝜃 𝑎𝑡 𝑠𝑡 , 𝜑(𝑦𝜃 𝑠𝑡 ), 𝜑(𝑦𝜃 𝑠𝑡+1 )] output으로 𝜋 ∈ ℝ, 𝑦 ∈ 0,1 𝑚 - 𝑟𝑡 : reward - 𝑑𝑡 : episode 종료 여부 (binary value) - 𝜋𝜃 𝑎𝑡 𝑠𝑡 : policy from agent - 𝑦𝜃 ∈ 0,1 𝑚 : 𝑚-dimensional categorical prediction vector (𝑚=30 사용) - 𝜑 : shared neural network (dim 16 → dim 1) 3. Details in LPG Architecture
  • 16. 1) LPG Architecture Backward LSTM인 LPG에는, input으로 𝑥𝑡 = [𝑟𝑡, 𝑑𝑡, 𝜋𝜃 𝑎𝑡 𝑠𝑡 , 𝑦𝜃 𝑠𝑡 , 𝑦𝜃 𝑠𝑡+1 ] output으로 𝜋 ∈ ℝ, 𝑦 ∈ 0,1 𝑚 - 𝑟𝑡 : reward - 𝑑𝑡 : episode 종료 여부 (binary value) - 𝜋𝜃 𝑎𝑡 𝑠𝑡 : policy from agent - 𝑦𝜃 ∈ 0,1 𝑚 : 𝑚-dimensional categorical prediction vector (𝑚=30 사용) - 𝜑 : shared neural network (dim 16 → dim 1) LPG에 input으로 action이 아니라 state 에서 action이 나올 확률을 넣기 때문에 다양한 environment에 적용 가능 3. Details in LPG Architecture
  • 17. 2) Agent Update (𝜽) 𝜋으로 agent가 𝜋을 취하도록 directly 𝜃 update. 𝑦으로 value function처럼 state를 semantic하게 표현하도록 indirectly 𝜃 update 𝑇 time-step만큼 trajectory 형성 후 𝜃 update (𝑇=20 사용) ∆𝜃 ∝ 𝔼𝜋𝜃 [∇𝜃𝑙𝑜𝑔𝜋𝜃 𝑎 𝑠 𝜋 − 𝛼𝑦∇𝜃𝐷𝐾𝐿(𝑦𝜃(𝑠) 𝑦)] categorical cross entropy KL-divergence 3. Details in LPG Architecture
  • 18. 3) LPG Update (𝜼) 𝜃0 → 𝜃𝑁까지 학습이 진행되고 ∆𝜂를 계산해야 하지만 memory 문제 때문에 𝜃𝐾(𝐾 < 𝑁) 만큼 agent 학습 후 ∆𝜂 계산 (𝐾 = 5 사용) (e.g., 𝑇=20, 𝐾=5 일 때, 20 time-step 마다 𝜃𝑛 → 𝜃𝑛+1 update, 20x5 (=100) time-step이 지나 𝜃𝑛+5까지 update 되면 𝐺를 계산하고 ∆𝜂 계산, Environment lifetime이 끝날 때까지 반복) ∆𝜂 ∝ 𝔼ℰ𝔼𝜃[∇𝜂𝑙𝑜𝑔𝜋𝜃𝑁 𝑎 𝑠 𝐺] 𝜂∗ = 𝑎𝑟𝑔max 𝜂 𝔼ℰ~𝑝 ℰ 𝔼𝜃0~𝑝 𝜃0 [𝐺] objective gradient 3. Details in LPG Architecture
  • 19. 3) LPG Update (𝜼) ∆𝜂 ∝ 𝔼ℰ𝔼𝜃[∇𝜂𝑙𝑜𝑔𝜋𝜃𝑁 𝑎 𝑠 𝐺] 𝜂∗ = 𝑎𝑟𝑔max 𝜂 𝔼ℰ~𝑝 ℰ 𝔼𝜃0~𝑝 𝜃0 [𝐺] objective gradient ∆𝜂 ∝ 𝔼ℰ𝔼𝜃[∇𝜂𝑙𝑜𝑔𝜋𝜃𝑁 𝑎 𝑠 𝐺 + 𝛽0∇𝜂ℋ 𝜋𝜃𝑁 + 𝛽1∇𝜂ℋ 𝑦𝜃𝑁 − 𝛽2∇𝜂 𝜋 2 2 − 𝛽3∇𝜂 𝑦 2 2 ] 3. Details in LPG Architecture 안정적 학습을 위해 regularized term 추가
  • 20. 4) Balancing Agent Hyperparameters for Stabilisation (𝜶) 한번에 다양한 environment를 학습하게 되는데 모두 동일한 parameter(e.g., learning rate)를 적용하면 학습이 unstable하기 때문에 동적으로 parameter 설정 𝜂∗ = 𝑎𝑟𝑔max 𝜂 𝔼ℰ~𝑝 ℰ max 𝛼 𝔼𝜃0~𝑝 Θ [𝐺] 3. Details in LPG Architecture 𝛼~𝑝(𝛼|ℰ) 주어진 ℰ environment마다 G를 높이는 파라미터가 뽑힐 확률을 높인다. (𝛼=learning rate, KL-Divergence weight 사용)
  • 21. 3. Details in LPG Architecture Ablation Study Result
  • 22. 3. Details in LPG Architecture Lifetimes = N timesteps Lifetimes = N timesteps Lifetimes = N timesteps Interacting environment Environment Agent 940 64 𝑥1 → ⋯ → 𝑥20 → ⋯ → 𝑥100 → ⋯ → 𝑥𝑁 UPDATE agent parameter 𝜃 COMPUTE & SAVE LPG parameter 𝜂 SAMPLE ℰ~𝑝 ℰ , 𝜃~𝑝 𝜃 , 𝛼~𝑝(𝛼|ℰ) UPDATE 𝑝(𝛼|ℰ) 𝑝(𝛼|ℰ) UPDATE LPG parameter 𝜂 using averaged 𝜂
  • 25. 4. Experiments Setting - Baseline 1. A2C 2. LPG-V (only learns 𝜋 given 𝑦 (value function of TD(𝜆)) - Training Environments 1. Tabular grid worlds 2. Random grid worlds 3. Delayed chain MDP
  • 26. 4. Experiments Specialising in Training Environments
  • 27. 4. Experiments What does the prediction (y) look like?
  • 28. 4. Experiments Does the prediction (y) capture true values and beyond? Does the prediction(y) converge?
  • 30. 4. Experiments Generalising from Toy Environments to Atari Games Selected results