Multi agent reinforcement learning for sequential social dilemmas

Multi-agent RL
in
Sequential Social Dilemmas
Paper Review

MARL in SSD
• Multi Agent Reinforcement Learning
• Sequential Social Dilemmas
=> Understanding Agent Cooperation
=> In sequential situation ( mixed incentive sturcutre of matrix game social dilemma )
learn policies.

Sequential situation
Fruit Gathering
Wolfpack Hunting

Social Dilemma
• A social dilemma is a situation in which an
individual profits from selfishness unless everyone
chooses the selfish alternative, in which case the
whole group loses => Represent with Matrix game

Matrix Game – prisoner’s dilemma
Nash Equilibrium
This is Best Choice..
in global perspective
Betrayal Cooperate Matrix Game Social Dilemma
== MGSD
Rational agent
choice this
( Think reward is - )

MGSD ignores…
1. In real world’s social dilemmas are temporally extended
2. Cooperation and defection are labels that apply to polices implementing
strategic decision
3. Cooperativeness may be a graded quantity
4. Decision to cooperate or defect occur only quasi-simultaneously since some
information about what player 2 is starting to do can inform player 1’s decision
and vice versa
5. Decision must be made despite only having partial information about the
state of the world and the activities of the other players

Sequential Social Dilemma
SSD
= Markov Games +
Matrix Game Social
Dilemma

SSD – Markov Games
two-player partially observable Markov game : M => O : S x {1,2}
# O = { o_i | s, o_i }
Transition Function T : S x A_1 x A_2 -> delta(S) ( discrete probability distributions )
Reward Function r_i : S x A1 x A2
Policy π : O_i -> delta(A_i)
== Find MGSD with Reinforcement Learning
Value-state function

SSD – Definition of SSD
Sequential Social Dilemma
Empirical payoff matrix
Markov game에서 observation이 변함에 따라 policy가 변화

Learning Algorithm
== Deep Multiagent Reinforcement Learning
Use Deep Q-Network
Uniform Dist.

Simulation Method
Game : 2D grid-world
Observation : 3( RGB )
x 15(forehead) x 10(side)
Action :
8 ( arrow keys + rotate left + rotate right
+ use beam + stand )
Episode : 1000 step
NN : two Hidden layer – 32 unit
+ relu activation 8 output
Policy : e-greedy ( decrease e 1.0 to 0.1 )

Result – Gathering
Reward가 없지만… laser로 other agent를 잠깐 없앰
먹을게 (초록) 많으면 공존하면서 reward를 얻고,
적으면 서로 공격하기 시작함

Touch Green : reward +1 ( green removed temporally )
Beam to other player : (tagging)
hit twice, remove opponent from game N_tagged frames
Apple respawns after N_apple frames
=>
Defecting Policy == aggressive ( use beam )
Coopertive Policy == not seek to tag the other player
https://www.youtube.com/watch?v=F97lqqpcqsM

*After training for 4- million steps for each option
Conflict cost
Abundance
Highly Agressive
Low Agressive

RL to SSD
1. Train Policies at Different Game
2. Extract trained Policies from 1.
3. Calculate MGSD
4. Repeat 2-3 Until Converge

Gathering : DRL to SSD
Prisoner Dilemma
or
Non-SSD : ( NE is Global Optimal )

Wolfpack
함께 잡으면 더 높은 Reward

Wolfpack
r_team : reward when touch prey same
time
radius : capture radius ( collision size )
== difficulty of capture

Material Link
• https://arxiv.org/pdf/1702.03037.pdf
• https://deepmind.com/blog/understanding-agent-
cooperation/

Multi agent reinforcement learning for sequential social dilemmas

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie Multi agent reinforcement learning for sequential social dilemmas

Ähnlich wie Multi agent reinforcement learning for sequential social dilemmas (9)

Mehr von Dong Heon Cho

Mehr von Dong Heon Cho (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Multi agent reinforcement learning for sequential social dilemmas