SlideShare ist ein Scribd-Unternehmen logo
1 von 31
Multi-Agent Actor-Critic for Mixed
Cooperative-Competitive Environments
Ryan Lowe & Yi Wu (OpenAI) in <NIPS-2017>
발표자 : 윤지상
Graduate School of Information. Yonsei Univ.
Machine Learning & Computational Finance Lab.
1. Introduction
2. Background
3. MADDPG
a. Multi-Agent Actor Critic
b. Inferring Policies of Other Agents
c. Agents with Policy Ensembles
4. Experiments
INDEX
1 Introduction
1. Introduction
https://sites.google.com/site/multiagentac/
1. Introduction
To generalize Reinforcement Learning problem,
Multi-agent Reinforcement Learning(MARL) is important and should be
developed more.
MARL is used widely in Social science, finance, signal/communication network,
virtual physical system etc.
But many problems of MARL make structuring MARL model difficult
 Non-stationary distribution
 Inefficiency of communication
 Suboptimal decision problem from partial observation
 Scalability issue from joint action space growing exponentially
Multi-agent Reinforcement Learning
2 Background
2. Background
Partially Observable Markov Games
< 𝑵, 𝑺, 𝑨, 𝑻, 𝑹, 𝜸, 𝑶, 𝒁 >
1. N is the number of agents
2. S is states set
3. A is action set
4. T is state transition function
5. R is reward function
6. 𝛾 is discount factor
7. 𝑂 is observation set
8. 𝑍 is observation transition function
2. Background
Partially Observable Markov Games
𝑜𝑏𝑠𝑒𝑟𝑣𝑎𝑡𝑖𝑜𝑛1
𝑎𝑐𝑡𝑖𝑜𝑛1
𝑎𝑔𝑒𝑛𝑡2
𝑎𝑔𝑒𝑛𝑡3
state
𝑎𝑔𝑒𝑛𝑡1
𝑟𝑒𝑤𝑎𝑟𝑑1
2. Background
Target Q function
ℒ(𝜃𝑖) = 𝔼s,𝑎,𝑟,s′[ 𝑄∗
𝑠, 𝑎|𝜃 − 𝑦 2
] 𝑦 = 𝑟 + 𝛾𝑄∗
(𝑠′
, 𝑎′
|𝜃′)
where 𝑄 is a target Q function
After some time-steps, 𝜃′
← 𝜏𝜃 + (1 − 𝜏)𝜃′ (Target 𝑄 updating)
… …
𝑟𝑛
𝛾𝑄𝑛−1 + 𝑟𝑛−1
𝑄𝑛−1
𝑄𝑖
𝛾𝑄𝑖 + 𝑟𝑖
𝛾𝑄2 + 𝑟2
𝛾𝑄1 + 𝑟1
𝑄2
𝑄0 𝑄1
𝑟𝑛
𝑟𝑖+1 + 𝜆𝑟𝑖+2 + ⋯
+ 𝜆𝑛−𝑖−1
𝑟𝑛
𝑟1 + 𝜆𝑟2 + ⋯ + 𝜆𝑖−1
𝑟𝑖
+ 𝜆𝑛−1
𝑟𝑛
exploration
update
Training Q function with
supervised learning !
To know which action is the optimal,
find accurate action-value Q function and select action to maximum Q function
2. Background
Deterministic Policy Gradient
𝐽 𝜃 = 𝑠 𝑝𝜇
𝑠 𝑄𝜇
(𝑠, 𝑎)|𝑎=𝜇𝜃(𝑠)
∇𝜃𝐽 𝜃 = 𝑠 𝑝𝜇(𝑠) ∇𝜃𝑄𝜇(𝑠, 𝑎)|𝑎=𝜇𝜃(𝑠)
∇𝜃𝐽 𝜃 = 𝑠 𝑝𝜇
𝑠 ∇𝑎𝑄𝜇
𝑠, 𝑎 ∇𝜃𝜇𝜃(𝑎|𝑠)|𝑎=𝜇𝜃 𝑠
∇𝜃𝐽 𝜃 = 𝔼𝑠~𝒟 ∇𝑎𝑄𝜇
𝑠, 𝑎 ∇𝜃𝜇𝜃(𝑎|𝑠)|𝑎=𝜇𝜃 𝑠
where
• 𝑝𝜇 𝑠 = state distribution
• 𝜇𝜃 𝑎 𝑠 = probability that action 𝑎 can be taken
• 𝜇𝜃 𝑠 = optimal action based on policy
Chain-rule
Score function 𝐽 𝜃 = 𝑠 𝑝𝜋 𝑠 𝑄𝜇(𝑠, 𝑎)|𝑎=𝜇𝜃(𝑠) should be maximized to find optimal policy 𝜇𝜃
action value for all states
2. Background
Actor Critic
Actor : policy 𝜇 that outputs action
Critic : action-value 𝑄 function that find which state-action pair is the best
ℒ(𝜃𝑖) = 𝔼s,𝑎,𝑟,s′[ 𝑄𝜇
𝑠, 𝑎|𝜃 − 𝑦 2
]
actor critic
action
𝑄𝜇
𝑠, 𝑎|𝜃
∇𝜃𝐽 𝜃 = 𝔼𝑠~𝒟 ∇𝑎𝑄𝜇
𝑠, 𝑎 ∇𝜃𝜇𝜃(𝑎|𝑠)|𝑎=𝜇𝜃 𝑠
state
3 MADDPG
3.MADDPG
Contribution
a. Multi-Agent decentralized actor/centralized critic changes
non-stationary problem to stationary problem
b. Approximation of others’ policy to use centralized critic is
introduced when we don’t know other agents’ policy
c. Policy Ensembles is used to compete adversaries
𝑃(𝑠′
|𝑠, 𝑎𝑖, 𝜋1, … , 𝜋𝑖, … , 𝜋𝑁) ≠ 𝑃(𝑠′
|𝑠, 𝑎𝑖, 𝜋′1, … , 𝜋′𝑖, … , 𝜋′𝑁)
𝑃 𝑠′
𝑠, 𝑎1, … , 𝑎𝑁, 𝜋1, … , 𝜋𝑁 = 𝑃(𝑠′
|𝑠, 𝑎1, … , 𝑎𝑁, 𝜋′1, … , 𝜋′𝑁)
3.MADDPG
Overview
3.MADDPG
Overview(a. Multi-Agent Actor Critic)
In this case, each agents can use only their partial observations to make actions.
→ policy of agents can not converge to optimality (non-stationary)
critic
So, this work used critic that takes every observations and actions of others as
input to guide agent
agent
(actor)
3.MADDPG
Overview(a. Multi-Agent Actor Critic)
In this case, each agents can use only their partial observations to make actions.
→ policy of agents can not converge to optimality (non-stationary)
critic
So, this work used critic that takes every observations and actions of others as
input to guide agent
agent
(actor)
3.MADDPG
Overview(b. Inferring Policies of Other Agents)
Agents can exploit critic (= 𝑄(𝑜, 𝑎1, 𝑎2, … , 𝑎𝑁))
only when we know policy of all agents
3.MADDPG
Overview(b. Inferring Policies of Other Agents)
When we don’t know policy of all agents,
predicted actions 𝑎𝑛(=𝜇𝑛) is taken to 𝑄 (i.e., 𝑄(𝑜, 𝑎1, 𝜇2, … , 𝜇𝑁) for agent 1)
𝜇2 𝜇𝑁
…
𝜇1 Predicted 𝑎𝑛
3.MADDPG
Overview(c. Agents with Policy Ensembles)
At competitive environment, policy of each agents is ensemble of 𝒌 policies
to prevent overfitting to policy of competitors.
(𝜇1
(1)
, 𝜇1
(2)
, … , 𝜇1
𝑘
) … (𝜇𝑁
(1)
, 𝜇𝑁
(2)
, … , 𝜇𝑁
𝑘
)
3.a Multi-Agent Actor Critic
Decentralized Actor, Centralized Critic
 Observations of 𝑁 agents : (𝑜1, 𝑜2, … , 𝑜𝑁)
 Deterministic policies 𝜇(actor) parameterized 𝜃 : (𝜇1, 𝜇2, … , 𝜇𝑁)
 Centralized action-value function 𝑄(critic) : 𝑄𝑖
𝜇
x, 𝑎1, 𝑎2, … , 𝑎𝑁 ,
where x = 𝑜1, 𝑜2, … , 𝑜𝑁, 𝜀
 Experience replay buffer 𝒟 contains x, x′
, 𝑎1, … , 𝑎𝑁, 𝑟1, … , 𝑟𝑁
 x′
= x at next time step / 𝑎′
𝑖 = action of 𝑖 agent at next time step
1. Gradient of 𝜇𝑖 can be written as:
∇𝜃𝑖
𝐽 𝜇𝑖 = 𝔼x,𝑎~𝒟[∇𝜃𝑖
𝜇𝑖 𝑎𝑖 𝑜𝑖 ∇𝑎𝑖
𝑄𝑖
𝜇
x, 𝑎1, 𝑎2, … , 𝑎𝑁 |𝑎𝑖=𝜇𝑖 𝑜𝑖
]
2. The centralized action-value function 𝑄𝑖
𝜇
is updated as:
ℒ(𝜃𝑖) = 𝔼x,𝑎,𝑟,x′[ 𝑄𝑖
𝜇
x, 𝑎1, 𝑎2, … , 𝑎𝑁 − 𝑦
2
]
𝑦 = 𝑟𝑖 + 𝛾𝑄𝑖
𝜇′
(x′, 𝑎′
1, 𝑎′
2, … , 𝑎′
𝑁) |𝑎′𝑗=𝜇′𝑗 𝑜𝑖
where 𝜇′
= {𝜇𝜃1
′ , … , 𝜇𝜃𝑁
′ } is the set of target policies with delayed parameters 𝜃𝑖
′
3.b Inferring Policies of Other Agents
Policy Approximation
If each agents don’t know other agents’ policies, 𝑄𝑖
𝜇
x, 𝑎1, 𝑎2, … , 𝑎𝑁 cannot be calculated.
So, 𝝁𝒊
𝒋
𝒂𝒋 𝒐𝒋 is introduced to approximate 𝒂𝒋
ℒ 𝜙𝑖
𝑗
= −𝔼𝑜𝑗,𝑎𝑗
[𝑙𝑜𝑔𝜇𝑖
𝑗
𝑎𝑗 𝑜𝑗 + 𝜆𝐻 𝜇𝑖
𝑗
]
where 𝐻 is the entropy of the policy distribution which makes model explore more
ℒ(𝜃𝑖) = 𝔼x,𝑎,𝑟,x′[ 𝑄𝑖
𝜇
x, 𝑎1, 𝑎2, … , 𝑎𝑁 − 𝑦
2
]
𝑦 = 𝑟𝑖 + 𝛾𝑄𝑖
𝜇′
(x′, 𝜇𝑖
′1
𝑜1 , 𝜇𝑖
′2
𝑜𝑖 , … , 𝜇𝑖
′𝑁
𝑜𝑁 )
where 𝜇𝑖
′𝑗
denotes the target network for the approximate policy 𝜇𝑖
𝑗
Approximate policy 𝜇𝑖
𝑗
parametrized by 𝜙 is learned by minimizing equation below:
If 𝜇𝑖
𝑗
𝑎𝑗 𝑜𝑗 = 1, 𝑙𝑜𝑔𝜇𝑖
𝑗
𝑎𝑗 𝑜𝑗 =0
then, critic is updated with equation below:
3.c Agents with Policy Ensembles
Policy Ensembles to prevent overfitting to competitors
In competitive settings, agents can derive a strong policy by overfitting to the behavior of
their competitors.
Such policy may fail when the competitors alter strategies.
In this work, training collection of 𝐾 different sub-policies is introduced.
Maximizing the ensemble objective:
∇𝜃𝑖
(𝑘)𝐽𝑒 𝜇𝑖 =
1
𝐾
𝔼x,𝑎~𝒟𝑖
(𝑘)[∇𝜃𝑖
𝑘 𝜇𝑖
𝑘
𝑎𝑖 𝑜𝑖 ∇𝑎𝑖
𝑄𝑖
𝜇𝑖
x, 𝑎1, 𝑎2, … , 𝑎𝑁 |𝑎𝑖=𝜇𝑖
(𝑘)
𝑜𝑖
]
Model maintains Replay buffer 𝒟𝑖
(𝑘)
for each sub-policy 𝜇𝑖
𝑘
of agent 𝑖
Sub-policy 𝑘 : 𝜇𝑖
(𝑘)
and 𝜇𝑖
(𝑘)
∈ 𝜇𝑖
𝐽𝑒 𝜇𝑖 = 𝔼𝑘~unif 1,𝐾 ,𝑠~𝑝𝜇,𝑎~𝜇𝑖
𝑘[𝑅𝑖 𝑠, 𝑎 ]
4Experiments
4. Experiments
Problems
I. Comparison to Decentralized Reinforcement Learning Methods
II. Effect of Learning Polices of Other Agents
III.Effect of Training with Policy Ensembles
4. Experiments
Problems
Cooperative Communication
Cooperative Navigation Keep-away
agent 1
agent 1
adversary 1
?
Covert communication
agent 1
agent 2
adversary
“apple”
“A98F1C4”
?
?
Predator-prey Physical deception
4. Experiments
I. Comparison to Decentralized Reinforcement Learning
Methods
<Agent reward on cooperative
communication after 25000 episodes>
<Policy learning success rate on cooperative
communication after 25000 episodes>
4. Experiments
I. Comparison to Decentralized Reinforcement Learning
Methods
agent 1
agent 2
adversary
“apple”
“A98F1C4” ?
?
4. Experiments
I. Comparison to Decentralized Reinforcement Learning
Methods
4. Experiments
II. Effect of Learning Polices of Other Agents
• 𝜆 = 0.001
ℒ 𝜙𝑖
𝑗
= −𝔼𝑜𝑗,𝑎𝑗
[𝑙𝑜𝑔𝜇𝑖
𝑗
𝑎𝑗 𝑜𝑗 + 𝜆𝐻 𝜇𝑖
𝑗
]
Approximated policy is quite different
But same result !
approximate
4. Experiments
III. Effect of Training with Policy Ensembles
agent 1
agent 1
adversary 1
?
𝐾 = 3 𝐾 = 3 𝐾 = 2
4. Experiments
Results

Weitere ähnliche Inhalte

Was ist angesagt?

Continuous control with deep reinforcement learning (DDPG)
Continuous control with deep reinforcement learning (DDPG)Continuous control with deep reinforcement learning (DDPG)
Continuous control with deep reinforcement learning (DDPG)Taehoon Kim
 
Intro to Reinforcement learning - part III
Intro to Reinforcement learning - part IIIIntro to Reinforcement learning - part III
Intro to Reinforcement learning - part IIIMikko Mäkipää
 
Maximum Entropy Reinforcement Learning (Stochastic Control)
Maximum Entropy Reinforcement Learning (Stochastic Control)Maximum Entropy Reinforcement Learning (Stochastic Control)
Maximum Entropy Reinforcement Learning (Stochastic Control)Dongmin Lee
 
Reinforcement Learning : A Beginners Tutorial
Reinforcement Learning : A Beginners TutorialReinforcement Learning : A Beginners Tutorial
Reinforcement Learning : A Beginners TutorialOmar Enayet
 
Introduction of Deep Reinforcement Learning
Introduction of Deep Reinforcement LearningIntroduction of Deep Reinforcement Learning
Introduction of Deep Reinforcement LearningNAVER Engineering
 
Overview on Optimization algorithms in Deep Learning
Overview on Optimization algorithms in Deep LearningOverview on Optimization algorithms in Deep Learning
Overview on Optimization algorithms in Deep LearningKhang Pham
 
Lecture 9 Markov decision process
Lecture 9 Markov decision processLecture 9 Markov decision process
Lecture 9 Markov decision processVARUN KUMAR
 
Deep Reinforcement Learning: Q-Learning
Deep Reinforcement Learning: Q-LearningDeep Reinforcement Learning: Q-Learning
Deep Reinforcement Learning: Q-LearningKai-Wen Zhao
 
An introduction to reinforcement learning
An introduction to reinforcement learningAn introduction to reinforcement learning
An introduction to reinforcement learningSubrat Panda, PhD
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement LearningSalem-Kabbani
 
Deep deterministic policy gradient
Deep deterministic policy gradientDeep deterministic policy gradient
Deep deterministic policy gradientSlobodan Blazeski
 
An introduction to reinforcement learning
An introduction to  reinforcement learningAn introduction to  reinforcement learning
An introduction to reinforcement learningJie-Han Chen
 
Dimensionality Reduction | Machine Learning | CloudxLab
Dimensionality Reduction | Machine Learning | CloudxLabDimensionality Reduction | Machine Learning | CloudxLab
Dimensionality Reduction | Machine Learning | CloudxLabCloudxLab
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement LearningDongHyun Kwak
 
Reinforcement Learning 2. Multi-armed Bandits
Reinforcement Learning 2. Multi-armed BanditsReinforcement Learning 2. Multi-armed Bandits
Reinforcement Learning 2. Multi-armed BanditsSeung Jae Lee
 

Was ist angesagt? (20)

Continuous control with deep reinforcement learning (DDPG)
Continuous control with deep reinforcement learning (DDPG)Continuous control with deep reinforcement learning (DDPG)
Continuous control with deep reinforcement learning (DDPG)
 
Reinforcement learning
Reinforcement learningReinforcement learning
Reinforcement learning
 
Deep Q-Learning
Deep Q-LearningDeep Q-Learning
Deep Q-Learning
 
Intro to Reinforcement learning - part III
Intro to Reinforcement learning - part IIIIntro to Reinforcement learning - part III
Intro to Reinforcement learning - part III
 
Maximum Entropy Reinforcement Learning (Stochastic Control)
Maximum Entropy Reinforcement Learning (Stochastic Control)Maximum Entropy Reinforcement Learning (Stochastic Control)
Maximum Entropy Reinforcement Learning (Stochastic Control)
 
Reinforcement Learning : A Beginners Tutorial
Reinforcement Learning : A Beginners TutorialReinforcement Learning : A Beginners Tutorial
Reinforcement Learning : A Beginners Tutorial
 
Introduction of Deep Reinforcement Learning
Introduction of Deep Reinforcement LearningIntroduction of Deep Reinforcement Learning
Introduction of Deep Reinforcement Learning
 
Overview on Optimization algorithms in Deep Learning
Overview on Optimization algorithms in Deep LearningOverview on Optimization algorithms in Deep Learning
Overview on Optimization algorithms in Deep Learning
 
Lecture 9 Markov decision process
Lecture 9 Markov decision processLecture 9 Markov decision process
Lecture 9 Markov decision process
 
Deep Reinforcement Learning: Q-Learning
Deep Reinforcement Learning: Q-LearningDeep Reinforcement Learning: Q-Learning
Deep Reinforcement Learning: Q-Learning
 
An introduction to reinforcement learning
An introduction to reinforcement learningAn introduction to reinforcement learning
An introduction to reinforcement learning
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement Learning
 
Lec3 dqn
Lec3 dqnLec3 dqn
Lec3 dqn
 
Deep deterministic policy gradient
Deep deterministic policy gradientDeep deterministic policy gradient
Deep deterministic policy gradient
 
An introduction to reinforcement learning
An introduction to  reinforcement learningAn introduction to  reinforcement learning
An introduction to reinforcement learning
 
Dimensionality Reduction | Machine Learning | CloudxLab
Dimensionality Reduction | Machine Learning | CloudxLabDimensionality Reduction | Machine Learning | CloudxLab
Dimensionality Reduction | Machine Learning | CloudxLab
 
Genetic algorithm
Genetic algorithmGenetic algorithm
Genetic algorithm
 
PSO.ppt
PSO.pptPSO.ppt
PSO.ppt
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement Learning
 
Reinforcement Learning 2. Multi-armed Bandits
Reinforcement Learning 2. Multi-armed BanditsReinforcement Learning 2. Multi-armed Bandits
Reinforcement Learning 2. Multi-armed Bandits
 

Ähnlich wie Multi PPT - Agent Actor-Critic for Mixed Cooperative-Competitive Environments

Unbiased Learning from Biased User Feedback (AIS304) - AWS re:Invent 2018
Unbiased Learning from Biased User Feedback (AIS304) - AWS re:Invent 2018Unbiased Learning from Biased User Feedback (AIS304) - AWS re:Invent 2018
Unbiased Learning from Biased User Feedback (AIS304) - AWS re:Invent 2018Amazon Web Services
 
Deep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAI
Deep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAIDeep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAI
Deep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAIJack Clark
 
Unsteady MHD Flow Past A Semi-Infinite Vertical Plate With Heat Source/ Sink:...
Unsteady MHD Flow Past A Semi-Infinite Vertical Plate With Heat Source/ Sink:...Unsteady MHD Flow Past A Semi-Infinite Vertical Plate With Heat Source/ Sink:...
Unsteady MHD Flow Past A Semi-Infinite Vertical Plate With Heat Source/ Sink:...IJERA Editor
 
Review: [KDD'21]Model-Agnostic Counterfactual Reasoning for Eliminating Popul...
Review: [KDD'21]Model-Agnostic Counterfactual Reasoning for Eliminating Popul...Review: [KDD'21]Model-Agnostic Counterfactual Reasoning for Eliminating Popul...
Review: [KDD'21]Model-Agnostic Counterfactual Reasoning for Eliminating Popul...CS Kwak
 
DDPG algortihm for angry birds
DDPG algortihm for angry birdsDDPG algortihm for angry birds
DDPG algortihm for angry birdsWangyu Han
 
Internship project report,Predictive Modelling
Internship project report,Predictive ModellingInternship project report,Predictive Modelling
Internship project report,Predictive ModellingAmit Kumar
 
Supervised Learning.pdf
Supervised Learning.pdfSupervised Learning.pdf
Supervised Learning.pdfgadissaassefa
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement LearningSVijaylakshmi
 
Network Based Intrusion Detection System using Filter Based Feature Selection...
Network Based Intrusion Detection System using Filter Based Feature Selection...Network Based Intrusion Detection System using Filter Based Feature Selection...
Network Based Intrusion Detection System using Filter Based Feature Selection...IRJET Journal
 
PPT - Adaptive Quantitative Trading : An Imitative Deep Reinforcement Learnin...
PPT - Adaptive Quantitative Trading : An Imitative Deep Reinforcement Learnin...PPT - Adaptive Quantitative Trading : An Imitative Deep Reinforcement Learnin...
PPT - Adaptive Quantitative Trading : An Imitative Deep Reinforcement Learnin...Jisang Yoon
 
効率的反実仮想学習
効率的反実仮想学習効率的反実仮想学習
効率的反実仮想学習Masa Kato
 
Sentiment analysis of Twitter Data
Sentiment analysis of Twitter DataSentiment analysis of Twitter Data
Sentiment analysis of Twitter DataNurendra Choudhary
 
Probability Collectives
Probability CollectivesProbability Collectives
Probability Collectiveskulk0003
 
A Mathematical Programming Approach for Selection of Variables in Cluster Ana...
A Mathematical Programming Approach for Selection of Variables in Cluster Ana...A Mathematical Programming Approach for Selection of Variables in Cluster Ana...
A Mathematical Programming Approach for Selection of Variables in Cluster Ana...IJRES Journal
 
An Algorithm of Policy Gradient Reinforcement Learning with a Fuzzy Controlle...
An Algorithm of Policy Gradient Reinforcement Learning with a Fuzzy Controlle...An Algorithm of Policy Gradient Reinforcement Learning with a Fuzzy Controlle...
An Algorithm of Policy Gradient Reinforcement Learning with a Fuzzy Controlle...Waqas Tariq
 

Ähnlich wie Multi PPT - Agent Actor-Critic for Mixed Cooperative-Competitive Environments (20)

ddpg seminar
ddpg seminarddpg seminar
ddpg seminar
 
Unbiased Learning from Biased User Feedback (AIS304) - AWS re:Invent 2018
Unbiased Learning from Biased User Feedback (AIS304) - AWS re:Invent 2018Unbiased Learning from Biased User Feedback (AIS304) - AWS re:Invent 2018
Unbiased Learning from Biased User Feedback (AIS304) - AWS re:Invent 2018
 
Deep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAI
Deep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAIDeep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAI
Deep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAI
 
Unsteady MHD Flow Past A Semi-Infinite Vertical Plate With Heat Source/ Sink:...
Unsteady MHD Flow Past A Semi-Infinite Vertical Plate With Heat Source/ Sink:...Unsteady MHD Flow Past A Semi-Infinite Vertical Plate With Heat Source/ Sink:...
Unsteady MHD Flow Past A Semi-Infinite Vertical Plate With Heat Source/ Sink:...
 
Review: [KDD'21]Model-Agnostic Counterfactual Reasoning for Eliminating Popul...
Review: [KDD'21]Model-Agnostic Counterfactual Reasoning for Eliminating Popul...Review: [KDD'21]Model-Agnostic Counterfactual Reasoning for Eliminating Popul...
Review: [KDD'21]Model-Agnostic Counterfactual Reasoning for Eliminating Popul...
 
DDPG algortihm for angry birds
DDPG algortihm for angry birdsDDPG algortihm for angry birds
DDPG algortihm for angry birds
 
Internship project report,Predictive Modelling
Internship project report,Predictive ModellingInternship project report,Predictive Modelling
Internship project report,Predictive Modelling
 
Supervised Learning.pdf
Supervised Learning.pdfSupervised Learning.pdf
Supervised Learning.pdf
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement Learning
 
Network Based Intrusion Detection System using Filter Based Feature Selection...
Network Based Intrusion Detection System using Filter Based Feature Selection...Network Based Intrusion Detection System using Filter Based Feature Selection...
Network Based Intrusion Detection System using Filter Based Feature Selection...
 
PPT - Adaptive Quantitative Trading : An Imitative Deep Reinforcement Learnin...
PPT - Adaptive Quantitative Trading : An Imitative Deep Reinforcement Learnin...PPT - Adaptive Quantitative Trading : An Imitative Deep Reinforcement Learnin...
PPT - Adaptive Quantitative Trading : An Imitative Deep Reinforcement Learnin...
 
効率的反実仮想学習
効率的反実仮想学習効率的反実仮想学習
効率的反実仮想学習
 
Best Student Selection Using Extended Promethee II Method
Best Student Selection Using Extended Promethee II MethodBest Student Selection Using Extended Promethee II Method
Best Student Selection Using Extended Promethee II Method
 
Sentiment analysis of Twitter Data
Sentiment analysis of Twitter DataSentiment analysis of Twitter Data
Sentiment analysis of Twitter Data
 
Probability Collectives
Probability CollectivesProbability Collectives
Probability Collectives
 
[IJCT-V3I2P31] Authors: Amarbir Singh
[IJCT-V3I2P31] Authors: Amarbir Singh[IJCT-V3I2P31] Authors: Amarbir Singh
[IJCT-V3I2P31] Authors: Amarbir Singh
 
A Mathematical Programming Approach for Selection of Variables in Cluster Ana...
A Mathematical Programming Approach for Selection of Variables in Cluster Ana...A Mathematical Programming Approach for Selection of Variables in Cluster Ana...
A Mathematical Programming Approach for Selection of Variables in Cluster Ana...
 
An Algorithm of Policy Gradient Reinforcement Learning with a Fuzzy Controlle...
An Algorithm of Policy Gradient Reinforcement Learning with a Fuzzy Controlle...An Algorithm of Policy Gradient Reinforcement Learning with a Fuzzy Controlle...
An Algorithm of Policy Gradient Reinforcement Learning with a Fuzzy Controlle...
 
2021 06-02-tabnet
2021 06-02-tabnet2021 06-02-tabnet
2021 06-02-tabnet
 
Deepa seminar
Deepa seminarDeepa seminar
Deepa seminar
 

Kürzlich hochgeladen

Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxolyaivanovalion
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightDelhi Call girls
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxolyaivanovalion
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...amitlee9823
 
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...amitlee9823
 
Edukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxEdukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxolyaivanovalion
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxolyaivanovalion
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFxolyaivanovalion
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxolyaivanovalion
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionfulawalesam
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysismanisha194592
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...amitlee9823
 
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...amitlee9823
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...amitlee9823
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxolyaivanovalion
 
ELKO dropshipping via API with DroFx.pptx
ELKO dropshipping via API with DroFx.pptxELKO dropshipping via API with DroFx.pptx
ELKO dropshipping via API with DroFx.pptxolyaivanovalion
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz1
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAroojKhan71
 

Kürzlich hochgeladen (20)

Predicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science ProjectPredicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science Project
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptx
 
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptx
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
 
Edukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxEdukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFx
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptx
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFx
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptx
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interaction
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysis
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
 
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptx
 
ELKO dropshipping via API with DroFx.pptx
ELKO dropshipping via API with DroFx.pptxELKO dropshipping via API with DroFx.pptx
ELKO dropshipping via API with DroFx.pptx
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signals
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
 

Multi PPT - Agent Actor-Critic for Mixed Cooperative-Competitive Environments

  • 1. Multi-Agent Actor-Critic for Mixed Cooperative-Competitive Environments Ryan Lowe & Yi Wu (OpenAI) in <NIPS-2017> 발표자 : 윤지상 Graduate School of Information. Yonsei Univ. Machine Learning & Computational Finance Lab.
  • 2. 1. Introduction 2. Background 3. MADDPG a. Multi-Agent Actor Critic b. Inferring Policies of Other Agents c. Agents with Policy Ensembles 4. Experiments INDEX
  • 5. 1. Introduction To generalize Reinforcement Learning problem, Multi-agent Reinforcement Learning(MARL) is important and should be developed more. MARL is used widely in Social science, finance, signal/communication network, virtual physical system etc. But many problems of MARL make structuring MARL model difficult  Non-stationary distribution  Inefficiency of communication  Suboptimal decision problem from partial observation  Scalability issue from joint action space growing exponentially Multi-agent Reinforcement Learning
  • 7. 2. Background Partially Observable Markov Games < 𝑵, 𝑺, 𝑨, 𝑻, 𝑹, 𝜸, 𝑶, 𝒁 > 1. N is the number of agents 2. S is states set 3. A is action set 4. T is state transition function 5. R is reward function 6. 𝛾 is discount factor 7. 𝑂 is observation set 8. 𝑍 is observation transition function
  • 8. 2. Background Partially Observable Markov Games 𝑜𝑏𝑠𝑒𝑟𝑣𝑎𝑡𝑖𝑜𝑛1 𝑎𝑐𝑡𝑖𝑜𝑛1 𝑎𝑔𝑒𝑛𝑡2 𝑎𝑔𝑒𝑛𝑡3 state 𝑎𝑔𝑒𝑛𝑡1 𝑟𝑒𝑤𝑎𝑟𝑑1
  • 9. 2. Background Target Q function ℒ(𝜃𝑖) = 𝔼s,𝑎,𝑟,s′[ 𝑄∗ 𝑠, 𝑎|𝜃 − 𝑦 2 ] 𝑦 = 𝑟 + 𝛾𝑄∗ (𝑠′ , 𝑎′ |𝜃′) where 𝑄 is a target Q function After some time-steps, 𝜃′ ← 𝜏𝜃 + (1 − 𝜏)𝜃′ (Target 𝑄 updating) … … 𝑟𝑛 𝛾𝑄𝑛−1 + 𝑟𝑛−1 𝑄𝑛−1 𝑄𝑖 𝛾𝑄𝑖 + 𝑟𝑖 𝛾𝑄2 + 𝑟2 𝛾𝑄1 + 𝑟1 𝑄2 𝑄0 𝑄1 𝑟𝑛 𝑟𝑖+1 + 𝜆𝑟𝑖+2 + ⋯ + 𝜆𝑛−𝑖−1 𝑟𝑛 𝑟1 + 𝜆𝑟2 + ⋯ + 𝜆𝑖−1 𝑟𝑖 + 𝜆𝑛−1 𝑟𝑛 exploration update Training Q function with supervised learning ! To know which action is the optimal, find accurate action-value Q function and select action to maximum Q function
  • 10. 2. Background Deterministic Policy Gradient 𝐽 𝜃 = 𝑠 𝑝𝜇 𝑠 𝑄𝜇 (𝑠, 𝑎)|𝑎=𝜇𝜃(𝑠) ∇𝜃𝐽 𝜃 = 𝑠 𝑝𝜇(𝑠) ∇𝜃𝑄𝜇(𝑠, 𝑎)|𝑎=𝜇𝜃(𝑠) ∇𝜃𝐽 𝜃 = 𝑠 𝑝𝜇 𝑠 ∇𝑎𝑄𝜇 𝑠, 𝑎 ∇𝜃𝜇𝜃(𝑎|𝑠)|𝑎=𝜇𝜃 𝑠 ∇𝜃𝐽 𝜃 = 𝔼𝑠~𝒟 ∇𝑎𝑄𝜇 𝑠, 𝑎 ∇𝜃𝜇𝜃(𝑎|𝑠)|𝑎=𝜇𝜃 𝑠 where • 𝑝𝜇 𝑠 = state distribution • 𝜇𝜃 𝑎 𝑠 = probability that action 𝑎 can be taken • 𝜇𝜃 𝑠 = optimal action based on policy Chain-rule Score function 𝐽 𝜃 = 𝑠 𝑝𝜋 𝑠 𝑄𝜇(𝑠, 𝑎)|𝑎=𝜇𝜃(𝑠) should be maximized to find optimal policy 𝜇𝜃 action value for all states
  • 11. 2. Background Actor Critic Actor : policy 𝜇 that outputs action Critic : action-value 𝑄 function that find which state-action pair is the best ℒ(𝜃𝑖) = 𝔼s,𝑎,𝑟,s′[ 𝑄𝜇 𝑠, 𝑎|𝜃 − 𝑦 2 ] actor critic action 𝑄𝜇 𝑠, 𝑎|𝜃 ∇𝜃𝐽 𝜃 = 𝔼𝑠~𝒟 ∇𝑎𝑄𝜇 𝑠, 𝑎 ∇𝜃𝜇𝜃(𝑎|𝑠)|𝑎=𝜇𝜃 𝑠 state
  • 13. 3.MADDPG Contribution a. Multi-Agent decentralized actor/centralized critic changes non-stationary problem to stationary problem b. Approximation of others’ policy to use centralized critic is introduced when we don’t know other agents’ policy c. Policy Ensembles is used to compete adversaries 𝑃(𝑠′ |𝑠, 𝑎𝑖, 𝜋1, … , 𝜋𝑖, … , 𝜋𝑁) ≠ 𝑃(𝑠′ |𝑠, 𝑎𝑖, 𝜋′1, … , 𝜋′𝑖, … , 𝜋′𝑁) 𝑃 𝑠′ 𝑠, 𝑎1, … , 𝑎𝑁, 𝜋1, … , 𝜋𝑁 = 𝑃(𝑠′ |𝑠, 𝑎1, … , 𝑎𝑁, 𝜋′1, … , 𝜋′𝑁)
  • 15. 3.MADDPG Overview(a. Multi-Agent Actor Critic) In this case, each agents can use only their partial observations to make actions. → policy of agents can not converge to optimality (non-stationary) critic So, this work used critic that takes every observations and actions of others as input to guide agent agent (actor)
  • 16. 3.MADDPG Overview(a. Multi-Agent Actor Critic) In this case, each agents can use only their partial observations to make actions. → policy of agents can not converge to optimality (non-stationary) critic So, this work used critic that takes every observations and actions of others as input to guide agent agent (actor)
  • 17. 3.MADDPG Overview(b. Inferring Policies of Other Agents) Agents can exploit critic (= 𝑄(𝑜, 𝑎1, 𝑎2, … , 𝑎𝑁)) only when we know policy of all agents
  • 18. 3.MADDPG Overview(b. Inferring Policies of Other Agents) When we don’t know policy of all agents, predicted actions 𝑎𝑛(=𝜇𝑛) is taken to 𝑄 (i.e., 𝑄(𝑜, 𝑎1, 𝜇2, … , 𝜇𝑁) for agent 1) 𝜇2 𝜇𝑁 … 𝜇1 Predicted 𝑎𝑛
  • 19. 3.MADDPG Overview(c. Agents with Policy Ensembles) At competitive environment, policy of each agents is ensemble of 𝒌 policies to prevent overfitting to policy of competitors. (𝜇1 (1) , 𝜇1 (2) , … , 𝜇1 𝑘 ) … (𝜇𝑁 (1) , 𝜇𝑁 (2) , … , 𝜇𝑁 𝑘 )
  • 20. 3.a Multi-Agent Actor Critic Decentralized Actor, Centralized Critic  Observations of 𝑁 agents : (𝑜1, 𝑜2, … , 𝑜𝑁)  Deterministic policies 𝜇(actor) parameterized 𝜃 : (𝜇1, 𝜇2, … , 𝜇𝑁)  Centralized action-value function 𝑄(critic) : 𝑄𝑖 𝜇 x, 𝑎1, 𝑎2, … , 𝑎𝑁 , where x = 𝑜1, 𝑜2, … , 𝑜𝑁, 𝜀  Experience replay buffer 𝒟 contains x, x′ , 𝑎1, … , 𝑎𝑁, 𝑟1, … , 𝑟𝑁  x′ = x at next time step / 𝑎′ 𝑖 = action of 𝑖 agent at next time step 1. Gradient of 𝜇𝑖 can be written as: ∇𝜃𝑖 𝐽 𝜇𝑖 = 𝔼x,𝑎~𝒟[∇𝜃𝑖 𝜇𝑖 𝑎𝑖 𝑜𝑖 ∇𝑎𝑖 𝑄𝑖 𝜇 x, 𝑎1, 𝑎2, … , 𝑎𝑁 |𝑎𝑖=𝜇𝑖 𝑜𝑖 ] 2. The centralized action-value function 𝑄𝑖 𝜇 is updated as: ℒ(𝜃𝑖) = 𝔼x,𝑎,𝑟,x′[ 𝑄𝑖 𝜇 x, 𝑎1, 𝑎2, … , 𝑎𝑁 − 𝑦 2 ] 𝑦 = 𝑟𝑖 + 𝛾𝑄𝑖 𝜇′ (x′, 𝑎′ 1, 𝑎′ 2, … , 𝑎′ 𝑁) |𝑎′𝑗=𝜇′𝑗 𝑜𝑖 where 𝜇′ = {𝜇𝜃1 ′ , … , 𝜇𝜃𝑁 ′ } is the set of target policies with delayed parameters 𝜃𝑖 ′
  • 21. 3.b Inferring Policies of Other Agents Policy Approximation If each agents don’t know other agents’ policies, 𝑄𝑖 𝜇 x, 𝑎1, 𝑎2, … , 𝑎𝑁 cannot be calculated. So, 𝝁𝒊 𝒋 𝒂𝒋 𝒐𝒋 is introduced to approximate 𝒂𝒋 ℒ 𝜙𝑖 𝑗 = −𝔼𝑜𝑗,𝑎𝑗 [𝑙𝑜𝑔𝜇𝑖 𝑗 𝑎𝑗 𝑜𝑗 + 𝜆𝐻 𝜇𝑖 𝑗 ] where 𝐻 is the entropy of the policy distribution which makes model explore more ℒ(𝜃𝑖) = 𝔼x,𝑎,𝑟,x′[ 𝑄𝑖 𝜇 x, 𝑎1, 𝑎2, … , 𝑎𝑁 − 𝑦 2 ] 𝑦 = 𝑟𝑖 + 𝛾𝑄𝑖 𝜇′ (x′, 𝜇𝑖 ′1 𝑜1 , 𝜇𝑖 ′2 𝑜𝑖 , … , 𝜇𝑖 ′𝑁 𝑜𝑁 ) where 𝜇𝑖 ′𝑗 denotes the target network for the approximate policy 𝜇𝑖 𝑗 Approximate policy 𝜇𝑖 𝑗 parametrized by 𝜙 is learned by minimizing equation below: If 𝜇𝑖 𝑗 𝑎𝑗 𝑜𝑗 = 1, 𝑙𝑜𝑔𝜇𝑖 𝑗 𝑎𝑗 𝑜𝑗 =0 then, critic is updated with equation below:
  • 22. 3.c Agents with Policy Ensembles Policy Ensembles to prevent overfitting to competitors In competitive settings, agents can derive a strong policy by overfitting to the behavior of their competitors. Such policy may fail when the competitors alter strategies. In this work, training collection of 𝐾 different sub-policies is introduced. Maximizing the ensemble objective: ∇𝜃𝑖 (𝑘)𝐽𝑒 𝜇𝑖 = 1 𝐾 𝔼x,𝑎~𝒟𝑖 (𝑘)[∇𝜃𝑖 𝑘 𝜇𝑖 𝑘 𝑎𝑖 𝑜𝑖 ∇𝑎𝑖 𝑄𝑖 𝜇𝑖 x, 𝑎1, 𝑎2, … , 𝑎𝑁 |𝑎𝑖=𝜇𝑖 (𝑘) 𝑜𝑖 ] Model maintains Replay buffer 𝒟𝑖 (𝑘) for each sub-policy 𝜇𝑖 𝑘 of agent 𝑖 Sub-policy 𝑘 : 𝜇𝑖 (𝑘) and 𝜇𝑖 (𝑘) ∈ 𝜇𝑖 𝐽𝑒 𝜇𝑖 = 𝔼𝑘~unif 1,𝐾 ,𝑠~𝑝𝜇,𝑎~𝜇𝑖 𝑘[𝑅𝑖 𝑠, 𝑎 ]
  • 24. 4. Experiments Problems I. Comparison to Decentralized Reinforcement Learning Methods II. Effect of Learning Polices of Other Agents III.Effect of Training with Policy Ensembles
  • 25. 4. Experiments Problems Cooperative Communication Cooperative Navigation Keep-away agent 1 agent 1 adversary 1 ? Covert communication agent 1 agent 2 adversary “apple” “A98F1C4” ? ? Predator-prey Physical deception
  • 26. 4. Experiments I. Comparison to Decentralized Reinforcement Learning Methods <Agent reward on cooperative communication after 25000 episodes> <Policy learning success rate on cooperative communication after 25000 episodes>
  • 27. 4. Experiments I. Comparison to Decentralized Reinforcement Learning Methods agent 1 agent 2 adversary “apple” “A98F1C4” ? ?
  • 28. 4. Experiments I. Comparison to Decentralized Reinforcement Learning Methods
  • 29. 4. Experiments II. Effect of Learning Polices of Other Agents • 𝜆 = 0.001 ℒ 𝜙𝑖 𝑗 = −𝔼𝑜𝑗,𝑎𝑗 [𝑙𝑜𝑔𝜇𝑖 𝑗 𝑎𝑗 𝑜𝑗 + 𝜆𝐻 𝜇𝑖 𝑗 ] Approximated policy is quite different But same result ! approximate
  • 30. 4. Experiments III. Effect of Training with Policy Ensembles agent 1 agent 1 adversary 1 ? 𝐾 = 3 𝐾 = 3 𝐾 = 2