Multi PPT - Agent Actor-Critic for Mixed Cooperative-Competitive Environments

Multi-Agent Actor-Critic for Mixed
Cooperative-Competitive Environments
Ryan Lowe & Yi Wu (OpenAI) in <NIPS-2017>
발표자 : 윤지상
Graduate School of Information. Yonsei Univ.
Machine Learning & Computational Finance Lab.

1. Introduction
2. Background
3. MADDPG
a. Multi-Agent Actor Critic
b. Inferring Policies of Other Agents
c. Agents with Policy Ensembles
4. Experiments
INDEX

1. Introduction
https://sites.google.com/site/multiagentac/

1. Introduction
To generalize Reinforcement Learning problem,
Multi-agent Reinforcement Learning(MARL) is important and should be
developed more.
MARL is used widely in Social science, finance, signal/communication network,
virtual physical system etc.
But many problems of MARL make structuring MARL model difficult
 Non-stationary distribution
 Inefficiency of communication
 Suboptimal decision problem from partial observation
 Scalability issue from joint action space growing exponentially
Multi-agent Reinforcement Learning

2. Background
Partially Observable Markov Games
< 𝑵, 𝑺, 𝑨, 𝑻, 𝑹, 𝜸, 𝑶, 𝒁 >
1. N is the number of agents
2. S is states set
3. A is action set
4. T is state transition function
5. R is reward function
6. 𝛾 is discount factor
7. 𝑂 is observation set
8. 𝑍 is observation transition function

2. Background
Partially Observable Markov Games
𝑜𝑏𝑠𝑒𝑟𝑣𝑎𝑡𝑖𝑜𝑛1
𝑎𝑐𝑡𝑖𝑜𝑛1
𝑎𝑔𝑒𝑛𝑡2
state
𝑟𝑒𝑤𝑎𝑟𝑑1

2. Background
Target Q function
ℒ(𝜃𝑖) = 𝔼s,𝑎,𝑟,s′[ 𝑄∗
𝑠, 𝑎|𝜃 − 𝑦 2
] 𝑦 = 𝑟 + 𝛾𝑄∗
(𝑠′
, 𝑎′
|𝜃′)
where 𝑄 is a target Q function
After some time-steps, 𝜃′
← 𝜏𝜃 + (1 − 𝜏)𝜃′ (Target 𝑄 updating)
… …
𝑟𝑛
𝛾𝑄𝑛−1 + 𝑟𝑛−1
𝑄𝑛−1
𝑄𝑖
𝛾𝑄𝑖 + 𝑟𝑖
𝛾𝑄2 + 𝑟2
𝛾𝑄1 + 𝑟1
𝑄2
𝑄0 𝑄1
𝑟𝑛
𝑟𝑖+1 + 𝜆𝑟𝑖+2 + ⋯
+ 𝜆𝑛−𝑖−1
𝑟𝑛
𝑟1 + 𝜆𝑟2 + ⋯ + 𝜆𝑖−1
𝑟𝑖
+ 𝜆𝑛−1
𝑟𝑛
exploration
update
Training Q function with
supervised learning !
To know which action is the optimal,
find accurate action-value Q function and select action to maximum Q function

2. Background
Deterministic Policy Gradient
𝐽 𝜃 = 𝑠 𝑝𝜇
𝑠 𝑄𝜇
(𝑠, 𝑎)|𝑎=𝜇𝜃(𝑠)
∇𝜃𝐽 𝜃 = 𝑠 𝑝𝜇(𝑠) ∇𝜃𝑄𝜇(𝑠, 𝑎)|𝑎=𝜇𝜃(𝑠)
∇𝜃𝐽 𝜃 = 𝑠 𝑝𝜇
𝑠 ∇𝑎𝑄𝜇
𝑠, 𝑎 ∇𝜃𝜇𝜃(𝑎|𝑠)|𝑎=𝜇𝜃 𝑠
∇𝜃𝐽 𝜃 = 𝔼𝑠~𝒟 ∇𝑎𝑄𝜇
where
• 𝑝𝜇 𝑠 = state distribution
• 𝜇𝜃 𝑎 𝑠 = probability that action 𝑎 can be taken
• 𝜇𝜃 𝑠 = optimal action based on policy
Chain-rule
Score function 𝐽 𝜃 = 𝑠 𝑝𝜋 𝑠 𝑄𝜇(𝑠, 𝑎)|𝑎=𝜇𝜃(𝑠) should be maximized to find optimal policy 𝜇𝜃
action value for all states

2. Background
Actor Critic
Actor : policy 𝜇 that outputs action
Critic : action-value 𝑄 function that find which state-action pair is the best
ℒ(𝜃𝑖) = 𝔼s,𝑎,𝑟,s′[ 𝑄𝜇
𝑠, 𝑎|𝜃 − 𝑦 2
]
actor critic
action
𝑄𝜇
𝑠, 𝑎|𝜃
∇𝜃𝐽 𝜃 = 𝔼𝑠~𝒟 ∇𝑎𝑄𝜇
state

3.MADDPG
Contribution
a. Multi-Agent decentralized actor/centralized critic changes
non-stationary problem to stationary problem
b. Approximation of others’ policy to use centralized critic is
introduced when we don’t know other agents’ policy
c. Policy Ensembles is used to compete adversaries
𝑃(𝑠′
|𝑠, 𝑎𝑖, 𝜋1, … , 𝜋𝑖, … , 𝜋𝑁) ≠ 𝑃(𝑠′
|𝑠, 𝑎𝑖, 𝜋′1, … , 𝜋′𝑖, … , 𝜋′𝑁)
𝑃 𝑠′
𝑠, 𝑎1, … , 𝑎𝑁, 𝜋1, … , 𝜋𝑁 = 𝑃(𝑠′
|𝑠, 𝑎1, … , 𝑎𝑁, 𝜋′1, … , 𝜋′𝑁)

3.MADDPG
Overview(a. Multi-Agent Actor Critic)
In this case, each agents can use only their partial observations to make actions.
→ policy of agents can not converge to optimality (non-stationary)
critic
So, this work used critic that takes every observations and actions of others as
input to guide agent
agent
(actor)

3.MADDPG
Overview(b. Inferring Policies of Other Agents)
Agents can exploit critic (= 𝑄(𝑜, 𝑎1, 𝑎2, … , 𝑎𝑁))
only when we know policy of all agents

3.MADDPG
Overview(b. Inferring Policies of Other Agents)
When we don’t know policy of all agents,
predicted actions 𝑎𝑛(=𝜇𝑛) is taken to 𝑄 (i.e., 𝑄(𝑜, 𝑎1, 𝜇2, … , 𝜇𝑁) for agent 1)
𝜇2 𝜇𝑁
…
𝜇1 Predicted 𝑎𝑛

3.MADDPG
Overview(c. Agents with Policy Ensembles)
At competitive environment, policy of each agents is ensemble of 𝒌 policies
to prevent overfitting to policy of competitors.
(𝜇1
(1)
, 𝜇1
(2)
, … , 𝜇1
𝑘
) … (𝜇𝑁
(1)
, 𝜇𝑁
(2)
, … , 𝜇𝑁
𝑘
)

3.a Multi-Agent Actor Critic
Decentralized Actor, Centralized Critic
 Observations of 𝑁 agents : (𝑜1, 𝑜2, … , 𝑜𝑁)
 Deterministic policies 𝜇(actor) parameterized 𝜃 : (𝜇1, 𝜇2, … , 𝜇𝑁)
 Centralized action-value function 𝑄(critic) : 𝑄𝑖
𝜇
x, 𝑎1, 𝑎2, … , 𝑎𝑁 ,
where x = 𝑜1, 𝑜2, … , 𝑜𝑁, 𝜀
 Experience replay buffer 𝒟 contains x, x′
, 𝑎1, … , 𝑎𝑁, 𝑟1, … , 𝑟𝑁
 x′
= x at next time step / 𝑎′
𝑖 = action of 𝑖 agent at next time step
1. Gradient of 𝜇𝑖 can be written as:
∇𝜃𝑖
𝐽 𝜇𝑖 = 𝔼x,𝑎~𝒟[∇𝜃𝑖
𝜇𝑖 𝑎𝑖 𝑜𝑖 ∇𝑎𝑖
𝑄𝑖
𝜇
x, 𝑎1, 𝑎2, … , 𝑎𝑁 |𝑎𝑖=𝜇𝑖 𝑜𝑖
]
2. The centralized action-value function 𝑄𝑖
𝜇
is updated as:
ℒ(𝜃𝑖) = 𝔼x,𝑎,𝑟,x′[ 𝑄𝑖
𝜇
x, 𝑎1, 𝑎2, … , 𝑎𝑁 − 𝑦
2
]
𝑦 = 𝑟𝑖 + 𝛾𝑄𝑖
𝜇′
(x′, 𝑎′
1, 𝑎′
2, … , 𝑎′
𝑁) |𝑎′𝑗=𝜇′𝑗 𝑜𝑖
where 𝜇′
= {𝜇𝜃1
′ , … , 𝜇𝜃𝑁
′ } is the set of target policies with delayed parameters 𝜃𝑖
′

3.b Inferring Policies of Other Agents
Policy Approximation
If each agents don’t know other agents’ policies, 𝑄𝑖
𝜇
x, 𝑎1, 𝑎2, … , 𝑎𝑁 cannot be calculated.
So, 𝝁𝒊
𝒋
𝒂𝒋 𝒐𝒋 is introduced to approximate 𝒂𝒋
ℒ 𝜙𝑖
𝑗
= −𝔼𝑜𝑗,𝑎𝑗
[𝑙𝑜𝑔𝜇𝑖
𝑗
𝑎𝑗 𝑜𝑗 + 𝜆𝐻 𝜇𝑖
𝑗
]
where 𝐻 is the entropy of the policy distribution which makes model explore more
ℒ(𝜃𝑖) = 𝔼x,𝑎,𝑟,x′[ 𝑄𝑖
𝜇
x, 𝑎1, 𝑎2, … , 𝑎𝑁 − 𝑦
2
]
𝑦 = 𝑟𝑖 + 𝛾𝑄𝑖
𝜇′
(x′, 𝜇𝑖
′1
𝑜1 , 𝜇𝑖
′2
𝑜𝑖 , … , 𝜇𝑖
′𝑁
𝑜𝑁 )
where 𝜇𝑖
′𝑗
denotes the target network for the approximate policy 𝜇𝑖
𝑗
Approximate policy 𝜇𝑖
𝑗
parametrized by 𝜙 is learned by minimizing equation below:
If 𝜇𝑖
𝑗
𝑎𝑗 𝑜𝑗 = 1, 𝑙𝑜𝑔𝜇𝑖
𝑗
𝑎𝑗 𝑜𝑗 =0
then, critic is updated with equation below:

3.c Agents with Policy Ensembles
Policy Ensembles to prevent overfitting to competitors
In competitive settings, agents can derive a strong policy by overfitting to the behavior of
their competitors.
Such policy may fail when the competitors alter strategies.
In this work, training collection of 𝐾 different sub-policies is introduced.
Maximizing the ensemble objective:
∇𝜃𝑖
(𝑘)𝐽𝑒 𝜇𝑖 =
1
𝐾
𝔼x,𝑎~𝒟𝑖
(𝑘)[∇𝜃𝑖
𝑘 𝜇𝑖
𝑘
𝑎𝑖 𝑜𝑖 ∇𝑎𝑖
𝑄𝑖
𝜇𝑖
x, 𝑎1, 𝑎2, … , 𝑎𝑁 |𝑎𝑖=𝜇𝑖
(𝑘)
𝑜𝑖
]
Model maintains Replay buffer 𝒟𝑖
(𝑘)
for each sub-policy 𝜇𝑖
𝑘
of agent 𝑖
Sub-policy 𝑘 : 𝜇𝑖
(𝑘)
and 𝜇𝑖
(𝑘)
∈ 𝜇𝑖
𝐽𝑒 𝜇𝑖 = 𝔼𝑘~unif 1,𝐾 ,𝑠~𝑝𝜇,𝑎~𝜇𝑖
𝑘[𝑅𝑖 𝑠, 𝑎 ]

4. Experiments
Problems
I. Comparison to Decentralized Reinforcement Learning Methods
II. Effect of Learning Polices of Other Agents
III.Effect of Training with Policy Ensembles

4. Experiments
Problems
Cooperative Communication
Cooperative Navigation Keep-away
agent 1
agent 1
adversary 1
?
Covert communication
agent 1
agent 2
adversary
“apple”
“A98F1C4”
?
?
Predator-prey Physical deception

4. Experiments
I. Comparison to Decentralized Reinforcement Learning
Methods
<Agent reward on cooperative
communication after 25000 episodes>
<Policy learning success rate on cooperative
communication after 25000 episodes>

4. Experiments
Methods
agent 1
agent 2
adversary
“apple”
“A98F1C4” ?
?

4. Experiments
Methods

4. Experiments
II. Effect of Learning Polices of Other Agents
• 𝜆 = 0.001
ℒ 𝜙𝑖
𝑗
= −𝔼𝑜𝑗,𝑎𝑗
[𝑙𝑜𝑔𝜇𝑖
𝑗
𝑎𝑗 𝑜𝑗 + 𝜆𝐻 𝜇𝑖
𝑗
]
Approximated policy is quite different
But same result !
approximate

4. Experiments
III. Effect of Training with Policy Ensembles
agent 1
agent 1
adversary 1
?
𝐾 = 3 𝐾 = 3 𝐾 = 2

Multi PPT - Agent Actor-Critic for Mixed Cooperative-Competitive Environments

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie Multi PPT - Agent Actor-Critic for Mixed Cooperative-Competitive Environments

Ähnlich wie Multi PPT - Agent Actor-Critic for Mixed Cooperative-Competitive Environments (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Multi PPT - Agent Actor-Critic for Mixed Cooperative-Competitive Environments