Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Multi PPT - Agent Actor-Critic for Mixed Cooperative-Competitive Environments
1. Multi-Agent Actor-Critic for Mixed
Cooperative-Competitive Environments
Ryan Lowe & Yi Wu (OpenAI) in <NIPS-2017>
발표자 : 윤지상
Graduate School of Information. Yonsei Univ.
Machine Learning & Computational Finance Lab.
2. 1. Introduction
2. Background
3. MADDPG
a. Multi-Agent Actor Critic
b. Inferring Policies of Other Agents
c. Agents with Policy Ensembles
4. Experiments
INDEX
5. 1. Introduction
To generalize Reinforcement Learning problem,
Multi-agent Reinforcement Learning(MARL) is important and should be
developed more.
MARL is used widely in Social science, finance, signal/communication network,
virtual physical system etc.
But many problems of MARL make structuring MARL model difficult
Non-stationary distribution
Inefficiency of communication
Suboptimal decision problem from partial observation
Scalability issue from joint action space growing exponentially
Multi-agent Reinforcement Learning
7. 2. Background
Partially Observable Markov Games
< 𝑵, 𝑺, 𝑨, 𝑻, 𝑹, 𝜸, 𝑶, 𝒁 >
1. N is the number of agents
2. S is states set
3. A is action set
4. T is state transition function
5. R is reward function
6. 𝛾 is discount factor
7. 𝑂 is observation set
8. 𝑍 is observation transition function
9. 2. Background
Target Q function
ℒ(𝜃𝑖) = 𝔼s,𝑎,𝑟,s′[ 𝑄∗
𝑠, 𝑎|𝜃 − 𝑦 2
] 𝑦 = 𝑟 + 𝛾𝑄∗
(𝑠′
, 𝑎′
|𝜃′)
where 𝑄 is a target Q function
After some time-steps, 𝜃′
← 𝜏𝜃 + (1 − 𝜏)𝜃′ (Target 𝑄 updating)
… …
𝑟𝑛
𝛾𝑄𝑛−1 + 𝑟𝑛−1
𝑄𝑛−1
𝑄𝑖
𝛾𝑄𝑖 + 𝑟𝑖
𝛾𝑄2 + 𝑟2
𝛾𝑄1 + 𝑟1
𝑄2
𝑄0 𝑄1
𝑟𝑛
𝑟𝑖+1 + 𝜆𝑟𝑖+2 + ⋯
+ 𝜆𝑛−𝑖−1
𝑟𝑛
𝑟1 + 𝜆𝑟2 + ⋯ + 𝜆𝑖−1
𝑟𝑖
+ 𝜆𝑛−1
𝑟𝑛
exploration
update
Training Q function with
supervised learning !
To know which action is the optimal,
find accurate action-value Q function and select action to maximum Q function
10. 2. Background
Deterministic Policy Gradient
𝐽 𝜃 = 𝑠 𝑝𝜇
𝑠 𝑄𝜇
(𝑠, 𝑎)|𝑎=𝜇𝜃(𝑠)
∇𝜃𝐽 𝜃 = 𝑠 𝑝𝜇(𝑠) ∇𝜃𝑄𝜇(𝑠, 𝑎)|𝑎=𝜇𝜃(𝑠)
∇𝜃𝐽 𝜃 = 𝑠 𝑝𝜇
𝑠 ∇𝑎𝑄𝜇
𝑠, 𝑎 ∇𝜃𝜇𝜃(𝑎|𝑠)|𝑎=𝜇𝜃 𝑠
∇𝜃𝐽 𝜃 = 𝔼𝑠~𝒟 ∇𝑎𝑄𝜇
𝑠, 𝑎 ∇𝜃𝜇𝜃(𝑎|𝑠)|𝑎=𝜇𝜃 𝑠
where
• 𝑝𝜇 𝑠 = state distribution
• 𝜇𝜃 𝑎 𝑠 = probability that action 𝑎 can be taken
• 𝜇𝜃 𝑠 = optimal action based on policy
Chain-rule
Score function 𝐽 𝜃 = 𝑠 𝑝𝜋 𝑠 𝑄𝜇(𝑠, 𝑎)|𝑎=𝜇𝜃(𝑠) should be maximized to find optimal policy 𝜇𝜃
action value for all states
11. 2. Background
Actor Critic
Actor : policy 𝜇 that outputs action
Critic : action-value 𝑄 function that find which state-action pair is the best
ℒ(𝜃𝑖) = 𝔼s,𝑎,𝑟,s′[ 𝑄𝜇
𝑠, 𝑎|𝜃 − 𝑦 2
]
actor critic
action
𝑄𝜇
𝑠, 𝑎|𝜃
∇𝜃𝐽 𝜃 = 𝔼𝑠~𝒟 ∇𝑎𝑄𝜇
𝑠, 𝑎 ∇𝜃𝜇𝜃(𝑎|𝑠)|𝑎=𝜇𝜃 𝑠
state
15. 3.MADDPG
Overview(a. Multi-Agent Actor Critic)
In this case, each agents can use only their partial observations to make actions.
→ policy of agents can not converge to optimality (non-stationary)
critic
So, this work used critic that takes every observations and actions of others as
input to guide agent
agent
(actor)
16. 3.MADDPG
Overview(a. Multi-Agent Actor Critic)
In this case, each agents can use only their partial observations to make actions.
→ policy of agents can not converge to optimality (non-stationary)
critic
So, this work used critic that takes every observations and actions of others as
input to guide agent
agent
(actor)
18. 3.MADDPG
Overview(b. Inferring Policies of Other Agents)
When we don’t know policy of all agents,
predicted actions 𝑎𝑛(=𝜇𝑛) is taken to 𝑄 (i.e., 𝑄(𝑜, 𝑎1, 𝜇2, … , 𝜇𝑁) for agent 1)
𝜇2 𝜇𝑁
…
𝜇1 Predicted 𝑎𝑛
19. 3.MADDPG
Overview(c. Agents with Policy Ensembles)
At competitive environment, policy of each agents is ensemble of 𝒌 policies
to prevent overfitting to policy of competitors.
(𝜇1
(1)
, 𝜇1
(2)
, … , 𝜇1
𝑘
) … (𝜇𝑁
(1)
, 𝜇𝑁
(2)
, … , 𝜇𝑁
𝑘
)
20. 3.a Multi-Agent Actor Critic
Decentralized Actor, Centralized Critic
Observations of 𝑁 agents : (𝑜1, 𝑜2, … , 𝑜𝑁)
Deterministic policies 𝜇(actor) parameterized 𝜃 : (𝜇1, 𝜇2, … , 𝜇𝑁)
Centralized action-value function 𝑄(critic) : 𝑄𝑖
𝜇
x, 𝑎1, 𝑎2, … , 𝑎𝑁 ,
where x = 𝑜1, 𝑜2, … , 𝑜𝑁, 𝜀
Experience replay buffer 𝒟 contains x, x′
, 𝑎1, … , 𝑎𝑁, 𝑟1, … , 𝑟𝑁
x′
= x at next time step / 𝑎′
𝑖 = action of 𝑖 agent at next time step
1. Gradient of 𝜇𝑖 can be written as:
∇𝜃𝑖
𝐽 𝜇𝑖 = 𝔼x,𝑎~𝒟[∇𝜃𝑖
𝜇𝑖 𝑎𝑖 𝑜𝑖 ∇𝑎𝑖
𝑄𝑖
𝜇
x, 𝑎1, 𝑎2, … , 𝑎𝑁 |𝑎𝑖=𝜇𝑖 𝑜𝑖
]
2. The centralized action-value function 𝑄𝑖
𝜇
is updated as:
ℒ(𝜃𝑖) = 𝔼x,𝑎,𝑟,x′[ 𝑄𝑖
𝜇
x, 𝑎1, 𝑎2, … , 𝑎𝑁 − 𝑦
2
]
𝑦 = 𝑟𝑖 + 𝛾𝑄𝑖
𝜇′
(x′, 𝑎′
1, 𝑎′
2, … , 𝑎′
𝑁) |𝑎′𝑗=𝜇′𝑗 𝑜𝑖
where 𝜇′
= {𝜇𝜃1
′ , … , 𝜇𝜃𝑁
′ } is the set of target policies with delayed parameters 𝜃𝑖
′
21. 3.b Inferring Policies of Other Agents
Policy Approximation
If each agents don’t know other agents’ policies, 𝑄𝑖
𝜇
x, 𝑎1, 𝑎2, … , 𝑎𝑁 cannot be calculated.
So, 𝝁𝒊
𝒋
𝒂𝒋 𝒐𝒋 is introduced to approximate 𝒂𝒋
ℒ 𝜙𝑖
𝑗
= −𝔼𝑜𝑗,𝑎𝑗
[𝑙𝑜𝑔𝜇𝑖
𝑗
𝑎𝑗 𝑜𝑗 + 𝜆𝐻 𝜇𝑖
𝑗
]
where 𝐻 is the entropy of the policy distribution which makes model explore more
ℒ(𝜃𝑖) = 𝔼x,𝑎,𝑟,x′[ 𝑄𝑖
𝜇
x, 𝑎1, 𝑎2, … , 𝑎𝑁 − 𝑦
2
]
𝑦 = 𝑟𝑖 + 𝛾𝑄𝑖
𝜇′
(x′, 𝜇𝑖
′1
𝑜1 , 𝜇𝑖
′2
𝑜𝑖 , … , 𝜇𝑖
′𝑁
𝑜𝑁 )
where 𝜇𝑖
′𝑗
denotes the target network for the approximate policy 𝜇𝑖
𝑗
Approximate policy 𝜇𝑖
𝑗
parametrized by 𝜙 is learned by minimizing equation below:
If 𝜇𝑖
𝑗
𝑎𝑗 𝑜𝑗 = 1, 𝑙𝑜𝑔𝜇𝑖
𝑗
𝑎𝑗 𝑜𝑗 =0
then, critic is updated with equation below:
22. 3.c Agents with Policy Ensembles
Policy Ensembles to prevent overfitting to competitors
In competitive settings, agents can derive a strong policy by overfitting to the behavior of
their competitors.
Such policy may fail when the competitors alter strategies.
In this work, training collection of 𝐾 different sub-policies is introduced.
Maximizing the ensemble objective:
∇𝜃𝑖
(𝑘)𝐽𝑒 𝜇𝑖 =
1
𝐾
𝔼x,𝑎~𝒟𝑖
(𝑘)[∇𝜃𝑖
𝑘 𝜇𝑖
𝑘
𝑎𝑖 𝑜𝑖 ∇𝑎𝑖
𝑄𝑖
𝜇𝑖
x, 𝑎1, 𝑎2, … , 𝑎𝑁 |𝑎𝑖=𝜇𝑖
(𝑘)
𝑜𝑖
]
Model maintains Replay buffer 𝒟𝑖
(𝑘)
for each sub-policy 𝜇𝑖
𝑘
of agent 𝑖
Sub-policy 𝑘 : 𝜇𝑖
(𝑘)
and 𝜇𝑖
(𝑘)
∈ 𝜇𝑖
𝐽𝑒 𝜇𝑖 = 𝔼𝑘~unif 1,𝐾 ,𝑠~𝑝𝜇,𝑎~𝜇𝑖
𝑘[𝑅𝑖 𝑠, 𝑎 ]
24. 4. Experiments
Problems
I. Comparison to Decentralized Reinforcement Learning Methods
II. Effect of Learning Polices of Other Agents
III.Effect of Training with Policy Ensembles
26. 4. Experiments
I. Comparison to Decentralized Reinforcement Learning
Methods
<Agent reward on cooperative
communication after 25000 episodes>
<Policy learning success rate on cooperative
communication after 25000 episodes>
27. 4. Experiments
I. Comparison to Decentralized Reinforcement Learning
Methods
agent 1
agent 2
adversary
“apple”
“A98F1C4” ?
?
29. 4. Experiments
II. Effect of Learning Polices of Other Agents
• 𝜆 = 0.001
ℒ 𝜙𝑖
𝑗
= −𝔼𝑜𝑗,𝑎𝑗
[𝑙𝑜𝑔𝜇𝑖
𝑗
𝑎𝑗 𝑜𝑗 + 𝜆𝐻 𝜇𝑖
𝑗
]
Approximated policy is quite different
But same result !
approximate
30. 4. Experiments
III. Effect of Training with Policy Ensembles
agent 1
agent 1
adversary 1
?
𝐾 = 3 𝐾 = 3 𝐾 = 2