SlideShare ist ein Scribd-Unternehmen logo
1 von 23
Downloaden Sie, um offline zu lesen
QMIX: Monotonic Value Function Factorisation for
Deep Multi-Agent Reinforcement Learning
2021.4.30 정민재
Multi agent system setting (MAS)
Jan’t Hoen, Pieter, et al. "An overview of cooperative and competitive multiagent learning." International Workshop on Learning and
Adaption in Multi-Agent Systems. Springer, Berlin, Heidelberg, 2005.
• Cooperative
- The agents pursue a common goal
• Competitive
- Non-aligned goals
- Individual agents seek only to maximize their own gains
MAS setting
● Challenge
○ Joint action space grow exponentially by the number of agents
■ EX) N Agent with 4 discrete action: 4𝑁
○ Agent’s partial observability, communication constraint
■ Cannot access to full state
NEED
Decentralized policy
Up
Down
Right
Left
(Up, Down)
(Up, Left)
(Up, Up)
(Up, Right)
(Down, Down)
(Down, Up)
⋮
4 16
!
!
?
Centralized Training Decentralized Execution(CTDE)
● We can use global state or extra state information and remove
communication constraint in simulation and laboratorial environment
Centralized Training Decentralized execution
CTDE Approach
Independent Q learning
(IQL) [Tan 1993]
Counterfactual multi-agent policy gradient
(COMA) [Foerster 2018]
Value Decomposition network
(VDN) [Sunehg 2017]
policy 1
𝑸𝒕𝒐𝒕
policy 2 policy 3
How to learn action-value function 𝑸𝒕𝒐𝒕 and extract decentralized policy?
𝑸𝒕𝒐𝒕
Qa1
Qa2
Qa3
+
+
policy 3
𝑸𝒂𝟑
Greedy
policy 2
𝑸𝒂𝟐
Greedy
policy 1
𝑸𝒂𝟏
Greedy
Learn independent individual action-value function
policy 1
policy 2
policy 3
+ Simplest Option
+ Learn decentralized policy trivially
- Cannot handle non-stationary case
Learn centralized, but factored action-value function Learn centralized full state action-value function
+ Lean 𝑸𝒕𝒐𝒕 directly with actor-critic framework
- On policy: sample inefficient
- Less scalability
+ Lean 𝑸𝒕𝒐𝒕
+ Easy to extract decentralized policy
- Limited representation capacity
- Do not use additional global state information
● Learn 𝑸𝒕𝒐𝒕 -
> figure out effectiveness of agent’s actions
● Extract decentralized policy <- joint-action space growth problem, local observability, communication constraint
QMIX !
Other agents also learning and
change the strategies
-> no guarantee convergence
Background
DEC-POMDP (decentralized partially observable Markov decision process)
𝑠 ∈ 𝑆 : state
𝑢 ∈ 𝑈 : joint action
𝑃(𝑠′
|𝑠, 𝑢): transition function
𝑟 𝑠, 𝑢 : reward function
𝑛 : agent number
𝑎 ∈ 𝐴 : agent
𝑧 ∈ 𝑍 : observation
𝑂(𝑠, 𝑎) : observation function
𝛾 : discount rate
𝜏𝑎
∈ 𝑇 : action-observation history
Background: Value decomposition
VDN full factorization
Utility function, not value function
Does not estimate an expected return
Guestrin, Carlos, Daphne Koller, and Ronald Parr. "Multiagent Planning with Factored MDPs." NIPS. Vol. 1. 2001.
Sunehag, Peter, et al. "Value-decomposition networks for cooperative multi-agent learning." arXiv preprint arXiv:1706.05296 (2017).
Factored joint value function [Guestrin 2001]
Subset of the agents
Factored value function reduce the parameters that have to be learned
https://youtu.be/W_9kcQmaWjo
Improve scalability
QMIX: Key Idea
Key Idea: full factorization of VDN is not necessary
• Consistency holds if a global argmax performed on 𝑄𝑡𝑜𝑡 yields the same result as
a set of individual argmax operations performed on each 𝑄𝑎
Assumption: the environment is not adversarial
VDN’s representation also satisfy this
QMIX’s representation can be generalized to the larger family of monotonic function
How to ensure this?
QMIX: Monotonicity constraint
Rashid, Tabish, et al. "Monotonic value function factorisation for deep multi-agent reinforcement learning." Journal of Machine Learning Research 21.178
(2020): 1-51.
argmax joint-action of 𝑸𝒕𝒐𝒕is the set of individual argmax action of 𝑸𝒂
If
Then
QMIX: Architecture
● QMIX represents 𝑄𝑡𝑜𝑡 using an architecture consisting of agent network, mixing network,
hypernetworks
QMIX: agent network
• DRQN (Hausknecht 2015)
- Deal with partial observability
• 𝒖𝒕−𝟏
𝒂
input
- Stochastic policies during training
• Agent ID(optional)
- Heterogeneous policies
Hausknecht, Matthew, and Peter Stone. "Deep recurrent q-learning for partially observable mdps." arXiv preprint arXiv:1507.06527 (2015).
QMIX: Mixing Network
• Hypernetwork
- Change state 𝑠𝑡 into weights of the mixing network
- Allow 𝑄𝑡𝑜𝑡 depend on the extra information
• Absolute operation
- Ensure the monotonicity constraint
• Why not pass 𝒔𝒕 directly into mixing network?
• Elu activation
- Prevent overly constraining 𝑠𝑡 through the
monotonic function, reduce representation capacity
- Negative input is likely to remain negative <- zeroed
by mixing network If use ReLU
QMIX: algorithm
Initialization
Rollout episode
Episode sampling
Update 𝑄𝑡𝑜𝑡
Update target
Rashid, Tabish, et al. "Monotonic value function factorisation for deep multi-agent reinforcement learning." Journal of Machine Learning Research 21.178 (2020):
1-51.
Representational complexity
Learned QMIX 𝑄𝑡𝑜𝑡
Learned VDN 𝑄𝑡𝑜𝑡
• Agent’s best action at the same time step in multi agent setting will not factorize perfectly with QMIX
• Monotonicity constraint prevents QMIX from representing non monotonic function
https://youtu.be/W_9kcQmaWjo
Representational complexity
https://youtu.be/W_9kcQmaWjo
Shimon Whiteson
• Even VDN cannot represent the example of the
middle exactly, it could approximate it with a value
function from the left game
• Should we care about these games that are in the
middle?
Rashid, Tabish, et al. "Monotonic value function factorisation for deep multi-agent reinforcement learning." Journal of Machine Learning Research 21.178
(2020): 1-51.
• It matters because of Bootstrapping
Representational complexity
• QMIX still learns the correct maximum over the Q-values
Payoff matrix Learned QMIX 𝑄𝑡𝑜𝑡
• The less bootstrapping error results better action selection in earlier state
Let’s see with
two-step games
• Two-Step Game
VDN
QMIX
Greedy policy
Representational complexity
The ability to express the complex situation
(A, ∙) -> (A,B) or (B,B)
(B, ∙) -> (B,B)
7 7
8
QMIX’s higher
represent capacity
Better
strategy than VDN
Experiment: SC2
• Starcraft2 have a rich set of complex micro-actions that allow the learning of complex
interactions between collaborating agents
• SC2LE environment mitigates many of the practical difficulties In using game as RL platform
allies enemy
https://youtu.be/HIqS-r4ZRGg
Experiment: SC2
• Observation (Sight range)
- distance
- relative x, y
- unit_type
• Action
- move[direction]
- attack[enemy_id] (Shooting range)
- stop
- noop
• Reward
- joint reward: total damage (each time step)
- bonus1 : 10 (killing each opponent)
- bonus2 : 100 (killing all opponent)
enemy
• Global state (hidden from agents)
(distance from center, health, shield, cooldown, last action)
Experiment: SC2 - main results
• IQL: Highly unstable <- non-stationary of the environment
• VDN: Better than IQL in every experiment setup, learn focusing fire
• QMIX: Superior at heterogeneous agent setting
Heterogeneous
: initial hump
Learning the simple strategy
Homogeneous
Experiment : SC2 - ablation result
• QMIX-NS: Without hypernetworks -> significance of extra state information
• QMIX-Lin: Removing hidden layer -> necessity of non-linear mixing
• VDN-S: Adding a state-dependent term to the sum of the 𝑄𝑎 -> significance of utilizing the state 𝒔𝒕
Heterogeneous
Nonlinear factorization is not always required
Homogeneous
Conclusion - QMIX
• Centralized training – Decentralized execution
Training Execution
• Allow rich joint-action value function 𝑄𝑡𝑜𝑡 with ensuring monotonicity constraint
• Performance of decentralized unit micromanagement task in SC2 environment outperforms VDN
More information
https://youtu.be/W_9kcQmaWjo
Rashid, Tabish, et al. "Monotonic value function factorisation for deep multi-agent
reinforcement learning." Journal of Machine Learning Research 21.178 (2020): 1-51.

Weitere ähnliche Inhalte

Was ist angesagt?

Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement LearningDongHyun Kwak
 
XGBoost: the algorithm that wins every competition
XGBoost: the algorithm that wins every competitionXGBoost: the algorithm that wins every competition
XGBoost: the algorithm that wins every competitionJaroslaw Szymczak
 
Actor critic algorithm
Actor critic algorithmActor critic algorithm
Actor critic algorithmJie-Han Chen
 
(研究会輪読) Weight Uncertainty in Neural Networks
(研究会輪読) Weight Uncertainty in Neural Networks(研究会輪読) Weight Uncertainty in Neural Networks
(研究会輪読) Weight Uncertainty in Neural NetworksMasahiro Suzuki
 
Deep sarsa, Deep Q-learning, DQN
Deep sarsa, Deep Q-learning, DQNDeep sarsa, Deep Q-learning, DQN
Deep sarsa, Deep Q-learning, DQNEuijin Jeong
 
An introduction to deep reinforcement learning
An introduction to deep reinforcement learningAn introduction to deep reinforcement learning
An introduction to deep reinforcement learningBig Data Colombia
 
Continuous control with deep reinforcement learning (DDPG)
Continuous control with deep reinforcement learning (DDPG)Continuous control with deep reinforcement learning (DDPG)
Continuous control with deep reinforcement learning (DDPG)Taehoon Kim
 
Deep deterministic policy gradient
Deep deterministic policy gradientDeep deterministic policy gradient
Deep deterministic policy gradientSlobodan Blazeski
 
Gradient descent optimizer
Gradient descent optimizerGradient descent optimizer
Gradient descent optimizerHojin Yang
 
Optimization as a model for few shot learning
Optimization as a model for few shot learningOptimization as a model for few shot learning
Optimization as a model for few shot learningKaty Lee
 
Hyperparameter Optimization for Machine Learning
Hyperparameter Optimization for Machine LearningHyperparameter Optimization for Machine Learning
Hyperparameter Optimization for Machine LearningFrancesco Casalegno
 
강화학습 해부학 교실: Rainbow 이론부터 구현까지 (2nd dlcat in Daejeon)
강화학습 해부학 교실: Rainbow 이론부터 구현까지 (2nd dlcat in Daejeon)강화학습 해부학 교실: Rainbow 이론부터 구현까지 (2nd dlcat in Daejeon)
강화학습 해부학 교실: Rainbow 이론부터 구현까지 (2nd dlcat in Daejeon)Kyunghwan Kim
 
Policy Gradient Theorem
Policy Gradient TheoremPolicy Gradient Theorem
Policy Gradient TheoremAshwin Rao
 
Introduction to XGBoost
Introduction to XGBoostIntroduction to XGBoost
Introduction to XGBoostJoonyoung Yi
 
분산 강화학습 논문(DeepMind IMPALA) 구현
분산 강화학습 논문(DeepMind IMPALA) 구현분산 강화학습 논문(DeepMind IMPALA) 구현
분산 강화학습 논문(DeepMind IMPALA) 구현정주 김
 
Introduction to SAC(Soft Actor-Critic)
Introduction to SAC(Soft Actor-Critic)Introduction to SAC(Soft Actor-Critic)
Introduction to SAC(Soft Actor-Critic)Suhyun Cho
 
Reinforcement Learning with Deep Energy-Based Policies
Reinforcement Learning with Deep Energy-Based PoliciesReinforcement Learning with Deep Energy-Based Policies
Reinforcement Learning with Deep Energy-Based PoliciesSangwoo Mo
 
pycon2018 "RL Adventure : DQN 부터 Rainbow DQN까지"
pycon2018 "RL Adventure : DQN 부터 Rainbow DQN까지"pycon2018 "RL Adventure : DQN 부터 Rainbow DQN까지"
pycon2018 "RL Adventure : DQN 부터 Rainbow DQN까지"YeChan(Paul) Kim
 

Was ist angesagt? (20)

Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement Learning
 
XGBoost: the algorithm that wins every competition
XGBoost: the algorithm that wins every competitionXGBoost: the algorithm that wins every competition
XGBoost: the algorithm that wins every competition
 
Actor critic algorithm
Actor critic algorithmActor critic algorithm
Actor critic algorithm
 
Xgboost
XgboostXgboost
Xgboost
 
(研究会輪読) Weight Uncertainty in Neural Networks
(研究会輪読) Weight Uncertainty in Neural Networks(研究会輪読) Weight Uncertainty in Neural Networks
(研究会輪読) Weight Uncertainty in Neural Networks
 
Deep sarsa, Deep Q-learning, DQN
Deep sarsa, Deep Q-learning, DQNDeep sarsa, Deep Q-learning, DQN
Deep sarsa, Deep Q-learning, DQN
 
An introduction to deep reinforcement learning
An introduction to deep reinforcement learningAn introduction to deep reinforcement learning
An introduction to deep reinforcement learning
 
Continuous control with deep reinforcement learning (DDPG)
Continuous control with deep reinforcement learning (DDPG)Continuous control with deep reinforcement learning (DDPG)
Continuous control with deep reinforcement learning (DDPG)
 
Deep deterministic policy gradient
Deep deterministic policy gradientDeep deterministic policy gradient
Deep deterministic policy gradient
 
Gradient descent optimizer
Gradient descent optimizerGradient descent optimizer
Gradient descent optimizer
 
Optimization as a model for few shot learning
Optimization as a model for few shot learningOptimization as a model for few shot learning
Optimization as a model for few shot learning
 
Hyperparameter Optimization for Machine Learning
Hyperparameter Optimization for Machine LearningHyperparameter Optimization for Machine Learning
Hyperparameter Optimization for Machine Learning
 
강화학습 해부학 교실: Rainbow 이론부터 구현까지 (2nd dlcat in Daejeon)
강화학습 해부학 교실: Rainbow 이론부터 구현까지 (2nd dlcat in Daejeon)강화학습 해부학 교실: Rainbow 이론부터 구현까지 (2nd dlcat in Daejeon)
강화학습 해부학 교실: Rainbow 이론부터 구현까지 (2nd dlcat in Daejeon)
 
Policy Gradient Theorem
Policy Gradient TheoremPolicy Gradient Theorem
Policy Gradient Theorem
 
Introduction to XGBoost
Introduction to XGBoostIntroduction to XGBoost
Introduction to XGBoost
 
분산 강화학습 논문(DeepMind IMPALA) 구현
분산 강화학습 논문(DeepMind IMPALA) 구현분산 강화학습 논문(DeepMind IMPALA) 구현
분산 강화학습 논문(DeepMind IMPALA) 구현
 
Introduction to SAC(Soft Actor-Critic)
Introduction to SAC(Soft Actor-Critic)Introduction to SAC(Soft Actor-Critic)
Introduction to SAC(Soft Actor-Critic)
 
Reinforcement Learning with Deep Energy-Based Policies
Reinforcement Learning with Deep Energy-Based PoliciesReinforcement Learning with Deep Energy-Based Policies
Reinforcement Learning with Deep Energy-Based Policies
 
PRML chapter7
PRML chapter7PRML chapter7
PRML chapter7
 
pycon2018 "RL Adventure : DQN 부터 Rainbow DQN까지"
pycon2018 "RL Adventure : DQN 부터 Rainbow DQN까지"pycon2018 "RL Adventure : DQN 부터 Rainbow DQN까지"
pycon2018 "RL Adventure : DQN 부터 Rainbow DQN까지"
 

Ähnlich wie QMIX: monotonic value function factorization paper review

Intro to Deep Reinforcement Learning
Intro to Deep Reinforcement LearningIntro to Deep Reinforcement Learning
Intro to Deep Reinforcement LearningKhaled Saleh
 
Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017
Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017
Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017MLconf
 
Deep Learning in Robotics: Robot gains Social Intelligence through Multimodal...
Deep Learning in Robotics: Robot gains Social Intelligence through Multimodal...Deep Learning in Robotics: Robot gains Social Intelligence through Multimodal...
Deep Learning in Robotics: Robot gains Social Intelligence through Multimodal...gabrielesisinna
 
Combinatorial optimization and deep reinforcement learning
Combinatorial optimization and deep reinforcement learningCombinatorial optimization and deep reinforcement learning
Combinatorial optimization and deep reinforcement learning민재 정
 
Cvpr 2018 papers review (efficient computing)
Cvpr 2018 papers review (efficient computing)Cvpr 2018 papers review (efficient computing)
Cvpr 2018 papers review (efficient computing)DonghyunKang12
 
Hanjun Dai, PhD Student, School of Computational Science and Engineering, Geo...
Hanjun Dai, PhD Student, School of Computational Science and Engineering, Geo...Hanjun Dai, PhD Student, School of Computational Science and Engineering, Geo...
Hanjun Dai, PhD Student, School of Computational Science and Engineering, Geo...MLconf
 
deep reinforcement learning with double q learning
deep reinforcement learning with double q learningdeep reinforcement learning with double q learning
deep reinforcement learning with double q learningSeungHyeok Baek
 
Reinforcement learning
Reinforcement learningReinforcement learning
Reinforcement learningDongHyun Kwak
 
Thesis Presentation on Energy Efficiency Improvement in Data Centers
Thesis Presentation on Energy Efficiency Improvement in Data CentersThesis Presentation on Energy Efficiency Improvement in Data Centers
Thesis Presentation on Energy Efficiency Improvement in Data CentersMonica Vitali
 
Deep Learning - 인공지능 기계학습의 새로운 트랜드 :김인중
Deep Learning - 인공지능 기계학습의 새로운 트랜드 :김인중Deep Learning - 인공지능 기계학습의 새로운 트랜드 :김인중
Deep Learning - 인공지능 기계학습의 새로운 트랜드 :김인중datasciencekorea
 
Reinforcement Learning and Artificial Neural Nets
Reinforcement Learning and Artificial Neural NetsReinforcement Learning and Artificial Neural Nets
Reinforcement Learning and Artificial Neural NetsPierre de Lacaze
 
PR-187 : MorphNet: Fast & Simple Resource-Constrained Structure Learning of D...
PR-187 : MorphNet: Fast & Simple Resource-Constrained Structure Learning of D...PR-187 : MorphNet: Fast & Simple Resource-Constrained Structure Learning of D...
PR-187 : MorphNet: Fast & Simple Resource-Constrained Structure Learning of D...Sunghoon Joo
 
Computer Vision for Beginners
Computer Vision for BeginnersComputer Vision for Beginners
Computer Vision for BeginnersSanghamitra Deb
 
Deep Q-learning from Demonstrations DQfD
Deep Q-learning from Demonstrations DQfDDeep Q-learning from Demonstrations DQfD
Deep Q-learning from Demonstrations DQfDAmmar Rashed
 
State Representation Learning for control: an overview
State Representation Learning for control: an overviewState Representation Learning for control: an overview
State Representation Learning for control: an overviewNatalia Díaz Rodríguez
 
WWW 2021report public
WWW 2021report publicWWW 2021report public
WWW 2021report publicTakuma Oda
 
Towards Reinforcement Learning-based Aggregate Computing
Towards Reinforcement Learning-based Aggregate ComputingTowards Reinforcement Learning-based Aggregate Computing
Towards Reinforcement Learning-based Aggregate ComputingGianluca Aguzzi
 

Ähnlich wie QMIX: monotonic value function factorization paper review (20)

Intro to Deep Reinforcement Learning
Intro to Deep Reinforcement LearningIntro to Deep Reinforcement Learning
Intro to Deep Reinforcement Learning
 
Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017
Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017
Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017
 
Deep Learning in Robotics: Robot gains Social Intelligence through Multimodal...
Deep Learning in Robotics: Robot gains Social Intelligence through Multimodal...Deep Learning in Robotics: Robot gains Social Intelligence through Multimodal...
Deep Learning in Robotics: Robot gains Social Intelligence through Multimodal...
 
Combinatorial optimization and deep reinforcement learning
Combinatorial optimization and deep reinforcement learningCombinatorial optimization and deep reinforcement learning
Combinatorial optimization and deep reinforcement learning
 
Cvpr 2018 papers review (efficient computing)
Cvpr 2018 papers review (efficient computing)Cvpr 2018 papers review (efficient computing)
Cvpr 2018 papers review (efficient computing)
 
Hanjun Dai, PhD Student, School of Computational Science and Engineering, Geo...
Hanjun Dai, PhD Student, School of Computational Science and Engineering, Geo...Hanjun Dai, PhD Student, School of Computational Science and Engineering, Geo...
Hanjun Dai, PhD Student, School of Computational Science and Engineering, Geo...
 
deep reinforcement learning with double q learning
deep reinforcement learning with double q learningdeep reinforcement learning with double q learning
deep reinforcement learning with double q learning
 
GDRR Opening Workshop - Deep Reinforcement Learning for Asset Based Modeling ...
GDRR Opening Workshop - Deep Reinforcement Learning for Asset Based Modeling ...GDRR Opening Workshop - Deep Reinforcement Learning for Asset Based Modeling ...
GDRR Opening Workshop - Deep Reinforcement Learning for Asset Based Modeling ...
 
Reinforcement learning
Reinforcement learningReinforcement learning
Reinforcement learning
 
Thesis Presentation on Energy Efficiency Improvement in Data Centers
Thesis Presentation on Energy Efficiency Improvement in Data CentersThesis Presentation on Energy Efficiency Improvement in Data Centers
Thesis Presentation on Energy Efficiency Improvement in Data Centers
 
AI and Deep Learning
AI and Deep Learning AI and Deep Learning
AI and Deep Learning
 
Deep Learning - 인공지능 기계학습의 새로운 트랜드 :김인중
Deep Learning - 인공지능 기계학습의 새로운 트랜드 :김인중Deep Learning - 인공지능 기계학습의 새로운 트랜드 :김인중
Deep Learning - 인공지능 기계학습의 새로운 트랜드 :김인중
 
Reinforcement Learning and Artificial Neural Nets
Reinforcement Learning and Artificial Neural NetsReinforcement Learning and Artificial Neural Nets
Reinforcement Learning and Artificial Neural Nets
 
PR-187 : MorphNet: Fast & Simple Resource-Constrained Structure Learning of D...
PR-187 : MorphNet: Fast & Simple Resource-Constrained Structure Learning of D...PR-187 : MorphNet: Fast & Simple Resource-Constrained Structure Learning of D...
PR-187 : MorphNet: Fast & Simple Resource-Constrained Structure Learning of D...
 
Computer Vision for Beginners
Computer Vision for BeginnersComputer Vision for Beginners
Computer Vision for Beginners
 
Deep Q-learning from Demonstrations DQfD
Deep Q-learning from Demonstrations DQfDDeep Q-learning from Demonstrations DQfD
Deep Q-learning from Demonstrations DQfD
 
State Representation Learning for control: an overview
State Representation Learning for control: an overviewState Representation Learning for control: an overview
State Representation Learning for control: an overview
 
WWW 2021report public
WWW 2021report publicWWW 2021report public
WWW 2021report public
 
Optimization Using Evolutionary Computing Techniques
Optimization Using Evolutionary Computing Techniques Optimization Using Evolutionary Computing Techniques
Optimization Using Evolutionary Computing Techniques
 
Towards Reinforcement Learning-based Aggregate Computing
Towards Reinforcement Learning-based Aggregate ComputingTowards Reinforcement Learning-based Aggregate Computing
Towards Reinforcement Learning-based Aggregate Computing
 

Kürzlich hochgeladen

KubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghlyKubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghlysanyuktamishra911
 
Intro To Electric Vehicles PDF Notes.pdf
Intro To Electric Vehicles PDF Notes.pdfIntro To Electric Vehicles PDF Notes.pdf
Intro To Electric Vehicles PDF Notes.pdfrs7054576148
 
Unleashing the Power of the SORA AI lastest leap
Unleashing the Power of the SORA AI lastest leapUnleashing the Power of the SORA AI lastest leap
Unleashing the Power of the SORA AI lastest leapRishantSharmaFr
 
Call Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance BookingCall Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance Bookingroncy bisnoi
 
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756dollysharma2066
 
University management System project report..pdf
University management System project report..pdfUniversity management System project report..pdf
University management System project report..pdfKamal Acharya
 
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete RecordCCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete RecordAsst.prof M.Gokilavani
 
Top Rated Call Girls In chittoor 📱 {7001035870} VIP Escorts chittoor
Top Rated Call Girls In chittoor 📱 {7001035870} VIP Escorts chittoorTop Rated Call Girls In chittoor 📱 {7001035870} VIP Escorts chittoor
Top Rated Call Girls In chittoor 📱 {7001035870} VIP Escorts chittoordharasingh5698
 
notes on Evolution Of Analytic Scalability.ppt
notes on Evolution Of Analytic Scalability.pptnotes on Evolution Of Analytic Scalability.ppt
notes on Evolution Of Analytic Scalability.pptMsecMca
 
AKTU Computer Networks notes --- Unit 3.pdf
AKTU Computer Networks notes ---  Unit 3.pdfAKTU Computer Networks notes ---  Unit 3.pdf
AKTU Computer Networks notes --- Unit 3.pdfankushspencer015
 
Call Girls Wakad Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Wakad Call Me 7737669865 Budget Friendly No Advance BookingCall Girls Wakad Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Wakad Call Me 7737669865 Budget Friendly No Advance Bookingroncy bisnoi
 
Work-Permit-Receiver-in-Saudi-Aramco.pptx
Work-Permit-Receiver-in-Saudi-Aramco.pptxWork-Permit-Receiver-in-Saudi-Aramco.pptx
Work-Permit-Receiver-in-Saudi-Aramco.pptxJuliansyahHarahap1
 
UNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its PerformanceUNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its Performancesivaprakash250
 

Kürzlich hochgeladen (20)

KubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghlyKubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghly
 
Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort ServiceCall Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
 
Intro To Electric Vehicles PDF Notes.pdf
Intro To Electric Vehicles PDF Notes.pdfIntro To Electric Vehicles PDF Notes.pdf
Intro To Electric Vehicles PDF Notes.pdf
 
Water Industry Process Automation & Control Monthly - April 2024
Water Industry Process Automation & Control Monthly - April 2024Water Industry Process Automation & Control Monthly - April 2024
Water Industry Process Automation & Control Monthly - April 2024
 
Unleashing the Power of the SORA AI lastest leap
Unleashing the Power of the SORA AI lastest leapUnleashing the Power of the SORA AI lastest leap
Unleashing the Power of the SORA AI lastest leap
 
(INDIRA) Call Girl Bhosari Call Now 8617697112 Bhosari Escorts 24x7
(INDIRA) Call Girl Bhosari Call Now 8617697112 Bhosari Escorts 24x7(INDIRA) Call Girl Bhosari Call Now 8617697112 Bhosari Escorts 24x7
(INDIRA) Call Girl Bhosari Call Now 8617697112 Bhosari Escorts 24x7
 
Call Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance BookingCall Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance Booking
 
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
 
Call Girls in Netaji Nagar, Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Netaji Nagar, Delhi 💯 Call Us 🔝9953056974 🔝 Escort ServiceCall Girls in Netaji Nagar, Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Netaji Nagar, Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
 
University management System project report..pdf
University management System project report..pdfUniversity management System project report..pdf
University management System project report..pdf
 
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete RecordCCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
 
Top Rated Call Girls In chittoor 📱 {7001035870} VIP Escorts chittoor
Top Rated Call Girls In chittoor 📱 {7001035870} VIP Escorts chittoorTop Rated Call Girls In chittoor 📱 {7001035870} VIP Escorts chittoor
Top Rated Call Girls In chittoor 📱 {7001035870} VIP Escorts chittoor
 
notes on Evolution Of Analytic Scalability.ppt
notes on Evolution Of Analytic Scalability.pptnotes on Evolution Of Analytic Scalability.ppt
notes on Evolution Of Analytic Scalability.ppt
 
(INDIRA) Call Girl Meerut Call Now 8617697112 Meerut Escorts 24x7
(INDIRA) Call Girl Meerut Call Now 8617697112 Meerut Escorts 24x7(INDIRA) Call Girl Meerut Call Now 8617697112 Meerut Escorts 24x7
(INDIRA) Call Girl Meerut Call Now 8617697112 Meerut Escorts 24x7
 
AKTU Computer Networks notes --- Unit 3.pdf
AKTU Computer Networks notes ---  Unit 3.pdfAKTU Computer Networks notes ---  Unit 3.pdf
AKTU Computer Networks notes --- Unit 3.pdf
 
Call Girls Wakad Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Wakad Call Me 7737669865 Budget Friendly No Advance BookingCall Girls Wakad Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Wakad Call Me 7737669865 Budget Friendly No Advance Booking
 
NFPA 5000 2024 standard .
NFPA 5000 2024 standard                                  .NFPA 5000 2024 standard                                  .
NFPA 5000 2024 standard .
 
Work-Permit-Receiver-in-Saudi-Aramco.pptx
Work-Permit-Receiver-in-Saudi-Aramco.pptxWork-Permit-Receiver-in-Saudi-Aramco.pptx
Work-Permit-Receiver-in-Saudi-Aramco.pptx
 
UNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its PerformanceUNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its Performance
 
FEA Based Level 3 Assessment of Deformed Tanks with Fluid Induced Loads
FEA Based Level 3 Assessment of Deformed Tanks with Fluid Induced LoadsFEA Based Level 3 Assessment of Deformed Tanks with Fluid Induced Loads
FEA Based Level 3 Assessment of Deformed Tanks with Fluid Induced Loads
 

QMIX: monotonic value function factorization paper review

  • 1. QMIX: Monotonic Value Function Factorisation for Deep Multi-Agent Reinforcement Learning 2021.4.30 정민재
  • 2. Multi agent system setting (MAS) Jan’t Hoen, Pieter, et al. "An overview of cooperative and competitive multiagent learning." International Workshop on Learning and Adaption in Multi-Agent Systems. Springer, Berlin, Heidelberg, 2005. • Cooperative - The agents pursue a common goal • Competitive - Non-aligned goals - Individual agents seek only to maximize their own gains
  • 3. MAS setting ● Challenge ○ Joint action space grow exponentially by the number of agents ■ EX) N Agent with 4 discrete action: 4𝑁 ○ Agent’s partial observability, communication constraint ■ Cannot access to full state NEED Decentralized policy Up Down Right Left (Up, Down) (Up, Left) (Up, Up) (Up, Right) (Down, Down) (Down, Up) ⋮ 4 16 ! ! ?
  • 4. Centralized Training Decentralized Execution(CTDE) ● We can use global state or extra state information and remove communication constraint in simulation and laboratorial environment Centralized Training Decentralized execution
  • 5. CTDE Approach Independent Q learning (IQL) [Tan 1993] Counterfactual multi-agent policy gradient (COMA) [Foerster 2018] Value Decomposition network (VDN) [Sunehg 2017] policy 1 𝑸𝒕𝒐𝒕 policy 2 policy 3 How to learn action-value function 𝑸𝒕𝒐𝒕 and extract decentralized policy? 𝑸𝒕𝒐𝒕 Qa1 Qa2 Qa3 + + policy 3 𝑸𝒂𝟑 Greedy policy 2 𝑸𝒂𝟐 Greedy policy 1 𝑸𝒂𝟏 Greedy Learn independent individual action-value function policy 1 policy 2 policy 3 + Simplest Option + Learn decentralized policy trivially - Cannot handle non-stationary case Learn centralized, but factored action-value function Learn centralized full state action-value function + Lean 𝑸𝒕𝒐𝒕 directly with actor-critic framework - On policy: sample inefficient - Less scalability + Lean 𝑸𝒕𝒐𝒕 + Easy to extract decentralized policy - Limited representation capacity - Do not use additional global state information ● Learn 𝑸𝒕𝒐𝒕 - > figure out effectiveness of agent’s actions ● Extract decentralized policy <- joint-action space growth problem, local observability, communication constraint QMIX ! Other agents also learning and change the strategies -> no guarantee convergence
  • 6. Background DEC-POMDP (decentralized partially observable Markov decision process) 𝑠 ∈ 𝑆 : state 𝑢 ∈ 𝑈 : joint action 𝑃(𝑠′ |𝑠, 𝑢): transition function 𝑟 𝑠, 𝑢 : reward function 𝑛 : agent number 𝑎 ∈ 𝐴 : agent 𝑧 ∈ 𝑍 : observation 𝑂(𝑠, 𝑎) : observation function 𝛾 : discount rate 𝜏𝑎 ∈ 𝑇 : action-observation history
  • 7. Background: Value decomposition VDN full factorization Utility function, not value function Does not estimate an expected return Guestrin, Carlos, Daphne Koller, and Ronald Parr. "Multiagent Planning with Factored MDPs." NIPS. Vol. 1. 2001. Sunehag, Peter, et al. "Value-decomposition networks for cooperative multi-agent learning." arXiv preprint arXiv:1706.05296 (2017). Factored joint value function [Guestrin 2001] Subset of the agents Factored value function reduce the parameters that have to be learned https://youtu.be/W_9kcQmaWjo Improve scalability
  • 8. QMIX: Key Idea Key Idea: full factorization of VDN is not necessary • Consistency holds if a global argmax performed on 𝑄𝑡𝑜𝑡 yields the same result as a set of individual argmax operations performed on each 𝑄𝑎 Assumption: the environment is not adversarial VDN’s representation also satisfy this QMIX’s representation can be generalized to the larger family of monotonic function How to ensure this?
  • 9. QMIX: Monotonicity constraint Rashid, Tabish, et al. "Monotonic value function factorisation for deep multi-agent reinforcement learning." Journal of Machine Learning Research 21.178 (2020): 1-51. argmax joint-action of 𝑸𝒕𝒐𝒕is the set of individual argmax action of 𝑸𝒂 If Then
  • 10. QMIX: Architecture ● QMIX represents 𝑄𝑡𝑜𝑡 using an architecture consisting of agent network, mixing network, hypernetworks
  • 11. QMIX: agent network • DRQN (Hausknecht 2015) - Deal with partial observability • 𝒖𝒕−𝟏 𝒂 input - Stochastic policies during training • Agent ID(optional) - Heterogeneous policies Hausknecht, Matthew, and Peter Stone. "Deep recurrent q-learning for partially observable mdps." arXiv preprint arXiv:1507.06527 (2015).
  • 12. QMIX: Mixing Network • Hypernetwork - Change state 𝑠𝑡 into weights of the mixing network - Allow 𝑄𝑡𝑜𝑡 depend on the extra information • Absolute operation - Ensure the monotonicity constraint • Why not pass 𝒔𝒕 directly into mixing network? • Elu activation - Prevent overly constraining 𝑠𝑡 through the monotonic function, reduce representation capacity - Negative input is likely to remain negative <- zeroed by mixing network If use ReLU
  • 13. QMIX: algorithm Initialization Rollout episode Episode sampling Update 𝑄𝑡𝑜𝑡 Update target Rashid, Tabish, et al. "Monotonic value function factorisation for deep multi-agent reinforcement learning." Journal of Machine Learning Research 21.178 (2020): 1-51.
  • 14. Representational complexity Learned QMIX 𝑄𝑡𝑜𝑡 Learned VDN 𝑄𝑡𝑜𝑡 • Agent’s best action at the same time step in multi agent setting will not factorize perfectly with QMIX • Monotonicity constraint prevents QMIX from representing non monotonic function https://youtu.be/W_9kcQmaWjo
  • 15. Representational complexity https://youtu.be/W_9kcQmaWjo Shimon Whiteson • Even VDN cannot represent the example of the middle exactly, it could approximate it with a value function from the left game • Should we care about these games that are in the middle?
  • 16. Rashid, Tabish, et al. "Monotonic value function factorisation for deep multi-agent reinforcement learning." Journal of Machine Learning Research 21.178 (2020): 1-51. • It matters because of Bootstrapping Representational complexity • QMIX still learns the correct maximum over the Q-values Payoff matrix Learned QMIX 𝑄𝑡𝑜𝑡 • The less bootstrapping error results better action selection in earlier state Let’s see with two-step games
  • 17. • Two-Step Game VDN QMIX Greedy policy Representational complexity The ability to express the complex situation (A, ∙) -> (A,B) or (B,B) (B, ∙) -> (B,B) 7 7 8 QMIX’s higher represent capacity Better strategy than VDN
  • 18. Experiment: SC2 • Starcraft2 have a rich set of complex micro-actions that allow the learning of complex interactions between collaborating agents • SC2LE environment mitigates many of the practical difficulties In using game as RL platform allies enemy https://youtu.be/HIqS-r4ZRGg
  • 19. Experiment: SC2 • Observation (Sight range) - distance - relative x, y - unit_type • Action - move[direction] - attack[enemy_id] (Shooting range) - stop - noop • Reward - joint reward: total damage (each time step) - bonus1 : 10 (killing each opponent) - bonus2 : 100 (killing all opponent) enemy • Global state (hidden from agents) (distance from center, health, shield, cooldown, last action)
  • 20. Experiment: SC2 - main results • IQL: Highly unstable <- non-stationary of the environment • VDN: Better than IQL in every experiment setup, learn focusing fire • QMIX: Superior at heterogeneous agent setting Heterogeneous : initial hump Learning the simple strategy Homogeneous
  • 21. Experiment : SC2 - ablation result • QMIX-NS: Without hypernetworks -> significance of extra state information • QMIX-Lin: Removing hidden layer -> necessity of non-linear mixing • VDN-S: Adding a state-dependent term to the sum of the 𝑄𝑎 -> significance of utilizing the state 𝒔𝒕 Heterogeneous Nonlinear factorization is not always required Homogeneous
  • 22. Conclusion - QMIX • Centralized training – Decentralized execution Training Execution • Allow rich joint-action value function 𝑄𝑡𝑜𝑡 with ensuring monotonicity constraint • Performance of decentralized unit micromanagement task in SC2 environment outperforms VDN
  • 23. More information https://youtu.be/W_9kcQmaWjo Rashid, Tabish, et al. "Monotonic value function factorisation for deep multi-agent reinforcement learning." Journal of Machine Learning Research 21.178 (2020): 1-51.