SlideShare a Scribd company logo
1 of 27
Download to read offline
Introduction to
AlphaGo Zero
Theoretical Foundations and Implementation Details
Chia-Ching Lin
National Taiwan University
2018/05/14
Outline
• Introduction
• Network in AlphaGo Zero
• Self-Play in AlphaGo Zero
• Experiment Results
• Why Does It Work?
• Conclusion
Outline
• Introduction
• Network in AlphaGo Zero
• Self-Play in AlphaGo Zero
• Experiment Results
• Why Does It Work?
• Conclusion
AlphaGo Evolution
• Starting tabula rasa, AlphaGo Zero [2] achieved superhuman
performance, winning 100–0 against the previously published,
champion-defeating AlphaGo [1]
• Summary of different versions of AlphaGo from Wikipedia:
[1]
[2]
[3]
[1] D. Silver and A. Huang et al., "Mastering the game of Go with deep neural networks and tree search," Nature, January 2016
[2] D. Silver, J. Schrittwieser, and K. Simonyanet et al., "Mastering the game of go without human knowledge," Nature, October 2017
[3] D. Silver, T. Hubert, and J. Schrittwieser et al., "Mastering Chess and Shogi by Self-Play with a General Reinforcement Learning Algorithm",
arXiv:1712.01815
Versions Hardware (inference time) Elo rating Matches
AlphaGo Fan 176 GPUs, distributed 3,144 5:0 against Fan Hui
AlphaGo Lee 48 TPUs, distributed 3,739 4:1 against Lee Sedol
AlphaGo Master 4 TPUs v2, single machine 4,858
60:0 against professional players;
Future of Go Summit
AlphaGo Zero 4 TPUs v2, single machine 5,185
100:0 against AlphaGo Lee
89:11 against AlphaGo Master
AlphaZero 4 TPUs v2, single machine N/A 60:40 against AlphaGo Zero
Introduction
Two Main Components
• Two main components (work in parallel)
1. A policy and value network 𝑓𝜃 (parameter: 𝜃) takes as an input the raw board
representation 𝑠 of the position and its history, and outputs 𝒑 and 𝑣
• 𝒑: probability of the next move (362-dim vector, including pass) from 𝑠
• 𝑣: expected outcome of the current player from 𝑠 (−1 (loss) ~ +1 (win))
2. In each position 𝑠, a Monte Carlo Tree Search (MCTS) is executed, guided by 𝑓𝜃, to
output 𝝅 = 𝛼 𝜃 𝑠 (also a 362-dim probability vector) of playing each move from 𝑠
Guide simulation
Provide training data
𝑠
𝝅
𝒑 𝑣
𝑓𝜃
𝑠
MCTSPolicy and value
network 𝛼 𝜃
Introduction
Outline
• Introduction
• Network in AlphaGo Zero
• Self-Play in AlphaGo Zero
• Experiment Results
• Why Does It Work?
• Conclusion
MCTS MCTS MCTS
How to Train the Network?
• Self-play
• Network training
From 𝑠1 to 𝑠 𝑇: A self-play
game with a final reward
𝑟𝑇 = −1 (loss) or +1 (win)
All 𝑠𝑡, 𝝅 𝑡, 𝑧𝑡 will be
stored, where 𝑧𝑡 = ±𝑟𝑇
with sign determined by
the current player at step 𝑡
Random initialized weights 𝜃0
𝜃𝑖 are trained using data sampled
uniformly among all 𝑠, 𝝅, 𝑧 ’s of
the last 500,000 games of self-
play, by minimizing loss:
𝑙 = 𝑧 − 𝑣 2
− 𝝅 𝑇
log 𝒑 + 𝑐 𝜃 2
At iteration 𝑖 ≥ 1, 25,000 games
of self-play are generated based
on 𝑓𝜃∗
( current best player 𝛼 𝜃∗
)
In each position 𝑠𝑡, an
MCTS is executed to obtain
𝝅 𝑡, and an action 𝑎 𝑡 is
sampled accordingly
𝑠5 𝑠3 𝑠17
𝝅5 𝝅3 𝝅17
𝒑 𝒑 𝒑𝑣 𝑣 𝑣
𝝅1 𝝅2 𝝅3
𝑠1 𝑠2 𝑠3
𝑎1~𝝅1 𝑎2~𝝅2 𝑎 𝑡~𝝅 𝑡
𝑧5 𝑧3 𝑧17
𝑧𝑡 = ±𝑟𝑇
𝑟𝑇
𝑠 𝑇
Network in AlphaGo Zero
Network Training Details
• Some details about network training
• Each neural network 𝑓𝜃 𝑖
is optimized on the Google Cloud using TensorFlow, with 64
GPU workers (batch-size 32 per worker) and 19 CPU parameter servers
• Total batch-size is 2,048, sampled uniformly at random from all positions of the most
recent 500,000 games of self-play
• Produces a new checkpoint every 1,000 training steps
• Each checkpoint 𝑓𝜃 𝑖
is further evaluated against the current best 𝑓𝜃∗
by 400 games
using MCTS to decide actions
• If the new player 𝛼 𝜃 𝑖
(guided by 𝑓𝜃 𝑖
) wins by a margin of > 55% (to avoid
selecting on noise alone) then it becomes the best player 𝛼 𝜃∗
(guided by 𝑓𝜃∗
),
and is subsequently used for self-play generation, and also becomes the
baseline for subsequent comparisons
Network in AlphaGo Zero
Network Architecture
• Some details about network training (cont’d)
• The input to the neural network is a 19 × 19 × 17 image stack comprising 17 binary
feature planes 𝑠𝑡 = 𝑋𝑡, 𝑌𝑡, 𝑋𝑡−1, 𝑌𝑡−1, … , 𝑋𝑡−7, 𝑌𝑡−7, 𝐶
• 𝑋’s: indicating the presence of the current player’s stones (including histories)
• 𝑌’s: indicating the presence of the opponent’s stones (including histories)
• 𝐶: representing the color to play
• Network architecture
• One convolutional block (256 filters of size 3 × 3, BN, RELU) followed by either
19 or 39 residual blocks (256 filters of size 3 × 3, BN, RELU, 256 filters of size 3 ×
3, BN, skip connection, RELU)
• Two separate “heads” for computing the policy and value
• Policy head: convolutional block (2 filters of size 1 × 1, BN, RELU) + fully
connected linear layer that outputs a vector of size 362
• Value head: convolutional block (1 filter of size 1 × 1, BN, RELU) + fully
connected linear layer (size 256, RELU) + fully connected linear layer (size
1, tanh) to output a scalar in the range [−1, 1]
Network in AlphaGo Zero
Outline
• Introduction
• Network in AlphaGo Zero
• Self-Play in AlphaGo Zero
• Experiment Results
• Why Does It Work?
• Conclusion
• Game tree: directed graph whose nodes are positions (states) in a game and
whose edges are actions (moves)
• Start from a position, we can search the tree to find the best next action
• E.g., minimax algorithm for tic-tac-toe:
• There are approximately 𝑏 𝑑
possible sequences of actions, where 𝑏 is the game’s
breadth (number of legal moves per position) and 𝑑 is its depth (game length)
• 𝑏 ≈ 35, 𝑑 ≈ 80 in Chess and 𝑏 ≈ 250, 𝑑 ≈ 150 in Go  exhaustive search is
infeasible!
Game Tree
’s turn
’s turn
’s turn
choose max
(best)
0
0
0 0 1
1
1
1
1
0
0
1
1
1
1
backup min
backup max
win: 1
draw: 0
loss: -1
Self-Play in AlphaGo Zero
Monte Carlo Game Tree
• Instead of considering all 𝑏 𝑑
sequences of actions, MCTS applies Monte Carlo
methods that rely on repeated random sampling to obtain numerical results
• At each position 𝑠, MCTS runs many simulations to find the (approximately) best
action
• In each simulation, 2 general principles are applied to reduce the search space
1. Action sampling: reduce 𝑏 by sampling high-probability actions from a policy
𝑝 𝑎|𝑠 that is a probability distribution over possible actions 𝑎 in position 𝑠
2. Position evaluation: reduce 𝑑 by truncating the search tree at state 𝑠 and
replacing the subtree below 𝑠 by an approximate value that predicts the
outcome from state 𝑠
• E.g., within one simulation in tic-tac-toe:
2. Evaluate the value of the child position by
taking random actions until a win, loss, or draw1. Sample one action
child
position Update 𝑁 (# visit) and 𝑊 (# win)
(more details next page)
win
𝑁++
𝑊++
Self-Play in AlphaGo Zero
General Ideas of MCTS
• MCTS iteratively builds partial search tree using 4 main steps per simulation
• Select: traverse the partial search tree until leaf by sampling
• Expand: sample one more action to add a new child
• Evaluate: predict the outcome by a value function or rollout (playout)
• No rollout in AlphaGo Zero (instead, it uses the value function directly)
• Backup: use the Evaluated result to update statistics (# win / # visit) for all nodes in the path so
that better nodes are more likely to be sampled in future simulations
• Eventually, the most explored move (i.e., the move with the max # visit) is taken
Default Policy
(random or some rollout policy
used in old versions of AlphaGo)
Select Expand Evaluate Backup
Tree Policy
Policies: some prior distributions used to guide simulation
Self-Play in AlphaGo Zero
MCTS in AlphaGo Zero
• Each node 𝑠 in the search tree has edges 𝑠, 𝑎 for all legal actions 𝑎 ∈ 𝒜 𝑠
• Statistics are stored for each edge (not for each node): visit count 𝑁 𝑠, 𝑎 , total action
value 𝑊 𝑠, 𝑎 , mean action value 𝑄 𝑠, 𝑎 , and prior probability 𝑃 𝑠, 𝑎
• Initially, 𝑁 𝑠, 𝑎 = 𝑊 𝑠, 𝑎 = 𝑄 𝑠, 𝑎 = 0, and 𝑃 𝑠, 𝑎 = 𝑝 𝑎 (from the network 𝑓𝜃)
b: Whenever choosing an action that leads to a new leaf,
it is added and evaluated (to decide its 𝑣 and 𝑃’s for all
its legal actions) by the policy and value network 𝑓𝜃
a: Choose the action with the max 𝑄 + 𝑈, where 𝑈 ∝ 𝑃/ 1 + 𝑁 (initially prefers actions with high
prior probabilities and low visit counts, but asympotically prefers actions with high action values)
c: Update edge
statistics
• 𝑁 = 𝑁 + 1
• 𝑊 = 𝑊 + 𝑣
• 𝑄 = 𝑊/𝑁
Eventually, good actions will have
larger 𝑁 than bad actions
Self-Play in AlphaGo Zero
Self-Play via MCTS
• At the end of the search (1,600 simulations), MCTS selects an action 𝑎 to play in the root
position 𝑠root, with probability 𝜋 𝑎 ∝ 𝑁 𝑠root, 𝑎 1/𝜏
, where 𝜏 controls the level of
exploration
• The larger 𝜏, the less differences between actions with different 𝑁  exploration
• 𝜏 → 0: choose the best move according to MCTS  exploitation
• The child node corresponding to the played action becomes the new root node, and
another round of MCTS starts over from it with all edge statistics (𝑁, 𝑊, 𝑄, and 𝑃) of the
subtree below this child being retained
• MCTS can thus be viewed as a self-play algorithm that, given neural network parameters 𝜃
and a root position 𝑠root, computes a vector of search probabilities recommending moves
to play, 𝝅 = 𝛼 𝜃 𝑠root
𝑠root
𝝅
d: In each round of MCTS,
the 1,600 simulations
only decide one move in
the self-play game!
(discard)
all statistics
retained
𝑎~𝝅
Search tree is reused
in the next MCTS
Self-Play in AlphaGo Zero
Self-Play Details
• Some details about self-play
• In each iteration, the best current player 𝛼 𝜃∗
plays 25,000 games of self-play, using
1,600 simulations (0.4s) of MCTS to select each move
• At step a (Select)
• 𝑈 𝑠, 𝑎 = 𝑐puct 𝑃 𝑠, 𝑎 𝑏 𝑁 𝑠,𝑏
1+𝑁 𝑠,𝑎
, where 𝑐puct is a constant used to control exploration
• Additional exploration is achieved by adding Dirichlet noise to the prior probabilities in the
root node: 𝑃 𝑠root, 𝑎 = 1 − 𝜀 𝑝 𝑎 + 𝜀𝜂 𝑎, where 𝜼~Dir 0.03 and 𝜀 = 0.25
• At step b (Expand and evaluate), a leaf node 𝑠leaf will be randomly reflected or
rotated before evaluated by the current network, 𝑑𝑖 𝒑 , 𝑣 = 𝑓𝜃 𝑑𝑖 𝑠leaf , where
𝑑𝑖 is a dihedral reflection or rotation selected uniformly at random from 𝑖 ∈ 1. . 8
• At step d (Play), 𝜏 = 1 for the first 30 moves of each game, and 𝜏 → 0 for the
remainder of the game
• To save computation, AlphaGo Zero resigns from a self-play game if its root value and
best child value are lower than a threshold value 𝑣resign
• 𝑣resign is selected automatically to keep the fraction of false positives (games that could
have been won if AlphaGo Zero had not resigned) below 5% (measured by disabling
resignation in 10% of self-play games)
Self-Play in AlphaGo Zero
Outline
• Introduction
• Network in AlphaGo Zero
• Self-Play in AlphaGo Zero
• Experiment Results
• Why Does It Work?
• Conclusion
Empirical Analysis (1/2)
• AlphaGo Zero (19 residual blocks) outperformed AlphaGo Lee after 36 h (see a)
• Supervised learning (from human data using the same architecture) was better
at predicting human professional moves (see b), but the self-trained player still
performed much better overall, defeating the human-trained player within the
first 24 h (see a)
• This suggests that AlphaGo Zero may be learning a strategy that is qualitatively
different to human play
Experiment Results
Empirical Analysis (2/2)
• Comparison of network architectures for policy network and value network
• dual-res (AlphaGo Zero): single network with residual blocks
• sep-res: 2 separate network with residual blocks
• dual-conv: single CNN
• sep-conv (AlphaGo): 2 separate CNN
~600 Elo
~600 Elo
Experiment Results
Final Performance
• AlphaGo Zero with deeper network (19  39 residual blocks) and longer
training time (3  40 days)
• Raw network: directly selects the move 𝑎 with maximum probability 𝑝 𝑎 output by
the network, without using MCTS (i.e., 𝜋 𝑎) to sample next move
Experiment Results
AlphaGo Zero vs. AlphaGo
• Main modifications comparing to old versions of AlphaGo
• Self-play reinforcement learning without any human data
• Simpler board representations using only the black and white stones
• Single neural network, rather than separate policy and value networks
• Also, the residual blocks matter (mentioned in [2])
• Simpler tree search without rollouts
• AlphaGo Zero and AlphaZero compensate for the lower number of
evaluations by using its deep neural network to focus much more
selectively on the most promising variations – arguably a more “human-
like” approach to search
Experiment Results
Outline
• Introduction
• Network in AlphaGo Zero
• Self-Play in AlphaGo Zero
• Experiment Results
• Why Does It Work?
• Conclusion
Thinking Fast and Slow
• As mentioned in [4], human reasoning consists of two different kinds of thinking
• System 1 is a fast, unconscious and automatic mode of thought, also known as
intuition or heuristic process
• System 2, an evolutionarily recent process unique to humans, is a slow, conscious,
explicit and rule-based mode of reasoning
Why Does It Work?
[4] T. Anthony et al., "Thinking Fast and Slow with Deep Learning and Tree Search," NIPS, Dec. 2017
Imitation learning and
Expert Improvement
• According to [4], the learning loop can be viewed as an extension of Imitation
Learning, in which an Apprentice policy (system 1; neural network) is trained to
imitate the behavior of an Expert policy (system 2; tree search)
• Between each iteration, an Expert Improvement step is performed by bootstrapping
the (fast) Apprentice policy to increase the performance of the (slow) Expert
• This allows us to exploit the fast convergence properties of Imitation Learning, even
in contexts where no strong player was originally known
𝑠
𝝅
𝒑 𝑣
𝑓𝜃
𝑠 Expert
(System 2)
(Exploration)
Apprentice
(System 1)
(Generalization)
𝛼 𝜃
Imitation Learning
AlphaGo / AlphaGo Zero / AlphaZero =
Imitation Learning + Expert Improvement
Expert Improvement
Why Does It Work?
[4] T. Anthony et al., "Thinking Fast and Slow with Deep Learning and Tree Search," NIPS, Dec. 2017
Outline
• Introduction
• Network in AlphaGo Zero
• Self-Play in AlphaGo Zero
• Experiment Results
• Why Does It Work?
• Conclusion
Conclusion
• Starting tabula rasa, AlphaGo Zero was able to rediscover much of the
Go knowledge, as well as novel strategies that provide new insights into
the oldest of games
• Motivated by the dual process theory of human thought (System 1 and
System 2), a new Reinforcement Learning algorithm (Imitation Learning
+ Expert Improvement) is introduced and results in state-of-the-art
performance for challenging problems
Reference
[1] D. Silver and A. Huang et al., "Mastering the game of Go with deep neural networks and tree
search," Nature, Jan. 2016
[2] D. Silver, J. Schrittwieser, and K. Simonyanet et al., "Mastering the game of go without human
knowledge," Nature, Oct. 2017
[3] D. Silver, T. Hubert, and J. Schrittwieser et al., "Mastering Chess and Shogi by Self-Play with a
General Reinforcement Learning Algorithm", arXiv:1712.01815
[4] T. Anthony et al., "Thinking Fast and Slow with Deep Learning and Tree Search," NIPS, Dec. 2017
[5] Deepmind website: https://deepmind.com/research/alphago/
[6] Wikipedia (AlphaGo): https://en.wikipedia.org/wiki/AlphaGo
[7] Wikipedia (MCTS): https://en.wikipedia.org/wiki/Monte_Carlo_tree_search
[8] Wikipedia (game tree): https://en.wikipedia.org/wiki/Game_tree
[9] Slides from Caltech CS 159: Advanced Topics in Machine Learning (Spring 2016), "Monte Carlo Tree
Search and AlphaGo": http://www.yisongyue.com/courses/cs159/lectures/MCTS.pdf
[10] Blog post by Tim Wheeler, "AlphaGo Zero - How and Why it Works":
http://tim.hibal.org/blog/blog/
(received: 2017/4/7; accepted: 2017/9/13)
(paper submission deadline: 2017/5/19)

More Related Content

What's hot

AlphaGo Zero: Mastering the Game of Go Without Human Knowledge
AlphaGo Zero: Mastering the Game of Go Without Human KnowledgeAlphaGo Zero: Mastering the Game of Go Without Human Knowledge
AlphaGo Zero: Mastering the Game of Go Without Human KnowledgeJoonhyung Lee
 
알파고 (바둑 인공지능)의 작동 원리
알파고 (바둑 인공지능)의 작동 원리알파고 (바둑 인공지능)의 작동 원리
알파고 (바둑 인공지능)의 작동 원리Shane (Seungwhan) Moon
 
Deep Learning for Recommender Systems
Deep Learning for Recommender SystemsDeep Learning for Recommender Systems
Deep Learning for Recommender SystemsYves Raimond
 
How DeepMind Mastered The Game Of Go
How DeepMind Mastered The Game Of GoHow DeepMind Mastered The Game Of Go
How DeepMind Mastered The Game Of GoTim Riser
 
Sequential Decision Making in Recommendations
Sequential Decision Making in RecommendationsSequential Decision Making in Recommendations
Sequential Decision Making in RecommendationsJaya Kawale
 
Learning a Personalized Homepage
Learning a Personalized HomepageLearning a Personalized Homepage
Learning a Personalized HomepageJustin Basilico
 
알파고 해부하기 2부
알파고 해부하기 2부알파고 해부하기 2부
알파고 해부하기 2부Donghun Lee
 
Recap: Designing a more Efficient Estimator for Off-policy Evaluation in Band...
Recap: Designing a more Efficient Estimator for Off-policy Evaluation in Band...Recap: Designing a more Efficient Estimator for Off-policy Evaluation in Band...
Recap: Designing a more Efficient Estimator for Off-policy Evaluation in Band...Justin Basilico
 
Deep Learning for Recommender Systems
Deep Learning for Recommender SystemsDeep Learning for Recommender Systems
Deep Learning for Recommender SystemsJustin Basilico
 
Beginners Guide to Non-Negative Matrix Factorization
Beginners Guide to Non-Negative Matrix FactorizationBeginners Guide to Non-Negative Matrix Factorization
Beginners Guide to Non-Negative Matrix FactorizationBenjamin Bengfort
 
A brief overview of Reinforcement Learning applied to games
A brief overview of Reinforcement Learning applied to gamesA brief overview of Reinforcement Learning applied to games
A brief overview of Reinforcement Learning applied to gamesThomas da Silva Paula
 
Rishabh Mehrotra - Recommendations in a Marketplace: Personalizing Explainabl...
Rishabh Mehrotra - Recommendations in a Marketplace: Personalizing Explainabl...Rishabh Mehrotra - Recommendations in a Marketplace: Personalizing Explainabl...
Rishabh Mehrotra - Recommendations in a Marketplace: Personalizing Explainabl...MLconf
 
Stock market trend prediction using k nearest neighbor(knn) algorithm
Stock market trend prediction using k nearest neighbor(knn) algorithmStock market trend prediction using k nearest neighbor(knn) algorithm
Stock market trend prediction using k nearest neighbor(knn) algorithmVenkat Projects
 
Introduction to Generative Adversarial Networks (GANs)
Introduction to Generative Adversarial Networks (GANs)Introduction to Generative Adversarial Networks (GANs)
Introduction to Generative Adversarial Networks (GANs)Appsilon Data Science
 
Deep Reinforcement Learning
Deep Reinforcement LearningDeep Reinforcement Learning
Deep Reinforcement LearningUsman Qayyum
 
Recent Trends in Personalization at Netflix
Recent Trends in Personalization at NetflixRecent Trends in Personalization at Netflix
Recent Trends in Personalization at NetflixJustin Basilico
 
Reinforcement Learning In AI Powerpoint Presentation Slide Templates Complete...
Reinforcement Learning In AI Powerpoint Presentation Slide Templates Complete...Reinforcement Learning In AI Powerpoint Presentation Slide Templates Complete...
Reinforcement Learning In AI Powerpoint Presentation Slide Templates Complete...SlideTeam
 
GAN - Theory and Applications
GAN - Theory and ApplicationsGAN - Theory and Applications
GAN - Theory and ApplicationsEmanuele Ghelfi
 
AlphaGo 알고리즘 요약
AlphaGo 알고리즘 요약AlphaGo 알고리즘 요약
AlphaGo 알고리즘 요약Jooyoul Lee
 

What's hot (20)

AlphaGo Zero: Mastering the Game of Go Without Human Knowledge
AlphaGo Zero: Mastering the Game of Go Without Human KnowledgeAlphaGo Zero: Mastering the Game of Go Without Human Knowledge
AlphaGo Zero: Mastering the Game of Go Without Human Knowledge
 
알파고 (바둑 인공지능)의 작동 원리
알파고 (바둑 인공지능)의 작동 원리알파고 (바둑 인공지능)의 작동 원리
알파고 (바둑 인공지능)의 작동 원리
 
Understanding AlphaGo
Understanding AlphaGoUnderstanding AlphaGo
Understanding AlphaGo
 
Deep Learning for Recommender Systems
Deep Learning for Recommender SystemsDeep Learning for Recommender Systems
Deep Learning for Recommender Systems
 
How DeepMind Mastered The Game Of Go
How DeepMind Mastered The Game Of GoHow DeepMind Mastered The Game Of Go
How DeepMind Mastered The Game Of Go
 
Sequential Decision Making in Recommendations
Sequential Decision Making in RecommendationsSequential Decision Making in Recommendations
Sequential Decision Making in Recommendations
 
Learning a Personalized Homepage
Learning a Personalized HomepageLearning a Personalized Homepage
Learning a Personalized Homepage
 
알파고 해부하기 2부
알파고 해부하기 2부알파고 해부하기 2부
알파고 해부하기 2부
 
Recap: Designing a more Efficient Estimator for Off-policy Evaluation in Band...
Recap: Designing a more Efficient Estimator for Off-policy Evaluation in Band...Recap: Designing a more Efficient Estimator for Off-policy Evaluation in Band...
Recap: Designing a more Efficient Estimator for Off-policy Evaluation in Band...
 
Deep Learning for Recommender Systems
Deep Learning for Recommender SystemsDeep Learning for Recommender Systems
Deep Learning for Recommender Systems
 
Beginners Guide to Non-Negative Matrix Factorization
Beginners Guide to Non-Negative Matrix FactorizationBeginners Guide to Non-Negative Matrix Factorization
Beginners Guide to Non-Negative Matrix Factorization
 
A brief overview of Reinforcement Learning applied to games
A brief overview of Reinforcement Learning applied to gamesA brief overview of Reinforcement Learning applied to games
A brief overview of Reinforcement Learning applied to games
 
Rishabh Mehrotra - Recommendations in a Marketplace: Personalizing Explainabl...
Rishabh Mehrotra - Recommendations in a Marketplace: Personalizing Explainabl...Rishabh Mehrotra - Recommendations in a Marketplace: Personalizing Explainabl...
Rishabh Mehrotra - Recommendations in a Marketplace: Personalizing Explainabl...
 
Stock market trend prediction using k nearest neighbor(knn) algorithm
Stock market trend prediction using k nearest neighbor(knn) algorithmStock market trend prediction using k nearest neighbor(knn) algorithm
Stock market trend prediction using k nearest neighbor(knn) algorithm
 
Introduction to Generative Adversarial Networks (GANs)
Introduction to Generative Adversarial Networks (GANs)Introduction to Generative Adversarial Networks (GANs)
Introduction to Generative Adversarial Networks (GANs)
 
Deep Reinforcement Learning
Deep Reinforcement LearningDeep Reinforcement Learning
Deep Reinforcement Learning
 
Recent Trends in Personalization at Netflix
Recent Trends in Personalization at NetflixRecent Trends in Personalization at Netflix
Recent Trends in Personalization at Netflix
 
Reinforcement Learning In AI Powerpoint Presentation Slide Templates Complete...
Reinforcement Learning In AI Powerpoint Presentation Slide Templates Complete...Reinforcement Learning In AI Powerpoint Presentation Slide Templates Complete...
Reinforcement Learning In AI Powerpoint Presentation Slide Templates Complete...
 
GAN - Theory and Applications
GAN - Theory and ApplicationsGAN - Theory and Applications
GAN - Theory and Applications
 
AlphaGo 알고리즘 요약
AlphaGo 알고리즘 요약AlphaGo 알고리즘 요약
AlphaGo 알고리즘 요약
 

Similar to Introduction to AlphaGo Zero Theoretical Foundations and Implementation Details

A Presentation on the Paper: Mastering the game of Go with deep neural networ...
A Presentation on the Paper: Mastering the game of Go with deep neural networ...A Presentation on the Paper: Mastering the game of Go with deep neural networ...
A Presentation on the Paper: Mastering the game of Go with deep neural networ...AdityaSuryavamshi
 
Foundations: Artificial Neural Networks
Foundations: Artificial Neural NetworksFoundations: Artificial Neural Networks
Foundations: Artificial Neural Networksananth
 
IaGo: an Othello AI inspired by AlphaGo
IaGo: an Othello AI inspired by AlphaGoIaGo: an Othello AI inspired by AlphaGo
IaGo: an Othello AI inspired by AlphaGoShion Honda
 
J-Fall 2017 - AI Self-learning Game Playing
J-Fall 2017 - AI Self-learning Game PlayingJ-Fall 2017 - AI Self-learning Game Playing
J-Fall 2017 - AI Self-learning Game PlayingRichard Abbuhl
 
AlphaZero and beyond: Polygames
AlphaZero and beyond: PolygamesAlphaZero and beyond: Polygames
AlphaZero and beyond: PolygamesOlivier Teytaud
 
AlphaGo: An AI Go player based on deep neural networks and monte carlo tree s...
AlphaGo: An AI Go player based on deep neural networks and monte carlo tree s...AlphaGo: An AI Go player based on deep neural networks and monte carlo tree s...
AlphaGo: An AI Go player based on deep neural networks and monte carlo tree s...Michael Jongho Moon
 
From Alpha Go to Alpha Zero - Vaas Madrid 2018
From Alpha Go to Alpha Zero -  Vaas Madrid 2018From Alpha Go to Alpha Zero -  Vaas Madrid 2018
From Alpha Go to Alpha Zero - Vaas Madrid 2018Juantomás García Molina
 
Artificial neural networks introduction
Artificial neural networks introductionArtificial neural networks introduction
Artificial neural networks introductionSungminYou
 
21CSC206T_UNIT3.pptx.pdf ARITIFICIAL INTELLIGENCE
21CSC206T_UNIT3.pptx.pdf ARITIFICIAL INTELLIGENCE21CSC206T_UNIT3.pptx.pdf ARITIFICIAL INTELLIGENCE
21CSC206T_UNIT3.pptx.pdf ARITIFICIAL INTELLIGENCEudayvanand
 
Alpha go 16110226_김영우
Alpha go 16110226_김영우Alpha go 16110226_김영우
Alpha go 16110226_김영우영우 김
 
An analysis of minimax search and endgame databases in evolving awale game pl...
An analysis of minimax search and endgame databases in evolving awale game pl...An analysis of minimax search and endgame databases in evolving awale game pl...
An analysis of minimax search and endgame databases in evolving awale game pl...csandit
 
AN ANALYSIS OF MINIMAX SEARCH AND ENDGAME DATABASES IN EVOLVING AWALE GAME PL...
AN ANALYSIS OF MINIMAX SEARCH AND ENDGAME DATABASES IN EVOLVING AWALE GAME PL...AN ANALYSIS OF MINIMAX SEARCH AND ENDGAME DATABASES IN EVOLVING AWALE GAME PL...
AN ANALYSIS OF MINIMAX SEARCH AND ENDGAME DATABASES IN EVOLVING AWALE GAME PL...cscpconf
 
Dueling network architectures for deep reinforcement learning
Dueling network architectures for deep reinforcement learningDueling network architectures for deep reinforcement learning
Dueling network architectures for deep reinforcement learningTaehoon Kim
 
AI based Tic Tac Toe game using Minimax Algorithm
AI based Tic Tac Toe game using Minimax AlgorithmAI based Tic Tac Toe game using Minimax Algorithm
AI based Tic Tac Toe game using Minimax AlgorithmKiran Shahi
 
Demystifying deep reinforement learning
Demystifying deep reinforement learningDemystifying deep reinforement learning
Demystifying deep reinforement learning재연 윤
 
Android application - Tic Tac Toe
Android application - Tic Tac ToeAndroid application - Tic Tac Toe
Android application - Tic Tac ToeSarthak Srivastava
 

Similar to Introduction to AlphaGo Zero Theoretical Foundations and Implementation Details (20)

A Presentation on the Paper: Mastering the game of Go with deep neural networ...
A Presentation on the Paper: Mastering the game of Go with deep neural networ...A Presentation on the Paper: Mastering the game of Go with deep neural networ...
A Presentation on the Paper: Mastering the game of Go with deep neural networ...
 
Foundations: Artificial Neural Networks
Foundations: Artificial Neural NetworksFoundations: Artificial Neural Networks
Foundations: Artificial Neural Networks
 
IaGo: an Othello AI inspired by AlphaGo
IaGo: an Othello AI inspired by AlphaGoIaGo: an Othello AI inspired by AlphaGo
IaGo: an Othello AI inspired by AlphaGo
 
J-Fall 2017 - AI Self-learning Game Playing
J-Fall 2017 - AI Self-learning Game PlayingJ-Fall 2017 - AI Self-learning Game Playing
J-Fall 2017 - AI Self-learning Game Playing
 
Games.4
Games.4Games.4
Games.4
 
AlphaZero and beyond: Polygames
AlphaZero and beyond: PolygamesAlphaZero and beyond: Polygames
AlphaZero and beyond: Polygames
 
Games
GamesGames
Games
 
Two player games
Two player gamesTwo player games
Two player games
 
AlphaGo: An AI Go player based on deep neural networks and monte carlo tree s...
AlphaGo: An AI Go player based on deep neural networks and monte carlo tree s...AlphaGo: An AI Go player based on deep neural networks and monte carlo tree s...
AlphaGo: An AI Go player based on deep neural networks and monte carlo tree s...
 
From Alpha Go to Alpha Zero - Vaas Madrid 2018
From Alpha Go to Alpha Zero -  Vaas Madrid 2018From Alpha Go to Alpha Zero -  Vaas Madrid 2018
From Alpha Go to Alpha Zero - Vaas Madrid 2018
 
Artificial neural networks introduction
Artificial neural networks introductionArtificial neural networks introduction
Artificial neural networks introduction
 
21CSC206T_UNIT3.pptx.pdf ARITIFICIAL INTELLIGENCE
21CSC206T_UNIT3.pptx.pdf ARITIFICIAL INTELLIGENCE21CSC206T_UNIT3.pptx.pdf ARITIFICIAL INTELLIGENCE
21CSC206T_UNIT3.pptx.pdf ARITIFICIAL INTELLIGENCE
 
Alpha go 16110226_김영우
Alpha go 16110226_김영우Alpha go 16110226_김영우
Alpha go 16110226_김영우
 
Adversarial search
Adversarial searchAdversarial search
Adversarial search
 
An analysis of minimax search and endgame databases in evolving awale game pl...
An analysis of minimax search and endgame databases in evolving awale game pl...An analysis of minimax search and endgame databases in evolving awale game pl...
An analysis of minimax search and endgame databases in evolving awale game pl...
 
AN ANALYSIS OF MINIMAX SEARCH AND ENDGAME DATABASES IN EVOLVING AWALE GAME PL...
AN ANALYSIS OF MINIMAX SEARCH AND ENDGAME DATABASES IN EVOLVING AWALE GAME PL...AN ANALYSIS OF MINIMAX SEARCH AND ENDGAME DATABASES IN EVOLVING AWALE GAME PL...
AN ANALYSIS OF MINIMAX SEARCH AND ENDGAME DATABASES IN EVOLVING AWALE GAME PL...
 
Dueling network architectures for deep reinforcement learning
Dueling network architectures for deep reinforcement learningDueling network architectures for deep reinforcement learning
Dueling network architectures for deep reinforcement learning
 
AI based Tic Tac Toe game using Minimax Algorithm
AI based Tic Tac Toe game using Minimax AlgorithmAI based Tic Tac Toe game using Minimax Algorithm
AI based Tic Tac Toe game using Minimax Algorithm
 
Demystifying deep reinforement learning
Demystifying deep reinforement learningDemystifying deep reinforement learning
Demystifying deep reinforement learning
 
Android application - Tic Tac Toe
Android application - Tic Tac ToeAndroid application - Tic Tac Toe
Android application - Tic Tac Toe
 

Recently uploaded

Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur EscortsCall Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur High Profile
 
Russian Call Girls in Nagpur Grishma Call 7001035870 Meet With Nagpur Escorts
Russian Call Girls in Nagpur Grishma Call 7001035870 Meet With Nagpur EscortsRussian Call Girls in Nagpur Grishma Call 7001035870 Meet With Nagpur Escorts
Russian Call Girls in Nagpur Grishma Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur High Profile
 
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLSMANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLSSIVASHANKAR N
 
Java Programming :Event Handling(Types of Events)
Java Programming :Event Handling(Types of Events)Java Programming :Event Handling(Types of Events)
Java Programming :Event Handling(Types of Events)simmis5
 
Call Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service Nashik
Call Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service NashikCall Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service Nashik
Call Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service NashikCall Girls in Nagpur High Profile
 
UNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its PerformanceUNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its Performancesivaprakash250
 
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...roncy bisnoi
 
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 BookingVIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Bookingdharasingh5698
 
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...ranjana rawat
 
Introduction to IEEE STANDARDS and its different types.pptx
Introduction to IEEE STANDARDS and its different types.pptxIntroduction to IEEE STANDARDS and its different types.pptx
Introduction to IEEE STANDARDS and its different types.pptxupamatechverse
 
BSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptx
BSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptxBSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptx
BSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptxfenichawla
 
Extrusion Processes and Their Limitations
Extrusion Processes and Their LimitationsExtrusion Processes and Their Limitations
Extrusion Processes and Their Limitations120cr0395
 
result management system report for college project
result management system report for college projectresult management system report for college project
result management system report for college projectTonystark477637
 
Porous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writingPorous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writingrakeshbaidya232001
 
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...ranjana rawat
 
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...ranjana rawat
 
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...Christo Ananth
 
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINEMANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINESIVASHANKAR N
 
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...Call Girls in Nagpur High Profile
 
Glass Ceramics: Processing and Properties
Glass Ceramics: Processing and PropertiesGlass Ceramics: Processing and Properties
Glass Ceramics: Processing and PropertiesPrabhanshu Chaturvedi
 

Recently uploaded (20)

Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur EscortsCall Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
 
Russian Call Girls in Nagpur Grishma Call 7001035870 Meet With Nagpur Escorts
Russian Call Girls in Nagpur Grishma Call 7001035870 Meet With Nagpur EscortsRussian Call Girls in Nagpur Grishma Call 7001035870 Meet With Nagpur Escorts
Russian Call Girls in Nagpur Grishma Call 7001035870 Meet With Nagpur Escorts
 
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLSMANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
 
Java Programming :Event Handling(Types of Events)
Java Programming :Event Handling(Types of Events)Java Programming :Event Handling(Types of Events)
Java Programming :Event Handling(Types of Events)
 
Call Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service Nashik
Call Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service NashikCall Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service Nashik
Call Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service Nashik
 
UNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its PerformanceUNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its Performance
 
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
 
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 BookingVIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
 
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
 
Introduction to IEEE STANDARDS and its different types.pptx
Introduction to IEEE STANDARDS and its different types.pptxIntroduction to IEEE STANDARDS and its different types.pptx
Introduction to IEEE STANDARDS and its different types.pptx
 
BSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptx
BSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptxBSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptx
BSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptx
 
Extrusion Processes and Their Limitations
Extrusion Processes and Their LimitationsExtrusion Processes and Their Limitations
Extrusion Processes and Their Limitations
 
result management system report for college project
result management system report for college projectresult management system report for college project
result management system report for college project
 
Porous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writingPorous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writing
 
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
 
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...
 
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
 
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINEMANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
 
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...
 
Glass Ceramics: Processing and Properties
Glass Ceramics: Processing and PropertiesGlass Ceramics: Processing and Properties
Glass Ceramics: Processing and Properties
 

Introduction to AlphaGo Zero Theoretical Foundations and Implementation Details

  • 1. Introduction to AlphaGo Zero Theoretical Foundations and Implementation Details Chia-Ching Lin National Taiwan University 2018/05/14
  • 2. Outline • Introduction • Network in AlphaGo Zero • Self-Play in AlphaGo Zero • Experiment Results • Why Does It Work? • Conclusion
  • 3. Outline • Introduction • Network in AlphaGo Zero • Self-Play in AlphaGo Zero • Experiment Results • Why Does It Work? • Conclusion
  • 4. AlphaGo Evolution • Starting tabula rasa, AlphaGo Zero [2] achieved superhuman performance, winning 100–0 against the previously published, champion-defeating AlphaGo [1] • Summary of different versions of AlphaGo from Wikipedia: [1] [2] [3] [1] D. Silver and A. Huang et al., "Mastering the game of Go with deep neural networks and tree search," Nature, January 2016 [2] D. Silver, J. Schrittwieser, and K. Simonyanet et al., "Mastering the game of go without human knowledge," Nature, October 2017 [3] D. Silver, T. Hubert, and J. Schrittwieser et al., "Mastering Chess and Shogi by Self-Play with a General Reinforcement Learning Algorithm", arXiv:1712.01815 Versions Hardware (inference time) Elo rating Matches AlphaGo Fan 176 GPUs, distributed 3,144 5:0 against Fan Hui AlphaGo Lee 48 TPUs, distributed 3,739 4:1 against Lee Sedol AlphaGo Master 4 TPUs v2, single machine 4,858 60:0 against professional players; Future of Go Summit AlphaGo Zero 4 TPUs v2, single machine 5,185 100:0 against AlphaGo Lee 89:11 against AlphaGo Master AlphaZero 4 TPUs v2, single machine N/A 60:40 against AlphaGo Zero Introduction
  • 5. Two Main Components • Two main components (work in parallel) 1. A policy and value network 𝑓𝜃 (parameter: 𝜃) takes as an input the raw board representation 𝑠 of the position and its history, and outputs 𝒑 and 𝑣 • 𝒑: probability of the next move (362-dim vector, including pass) from 𝑠 • 𝑣: expected outcome of the current player from 𝑠 (−1 (loss) ~ +1 (win)) 2. In each position 𝑠, a Monte Carlo Tree Search (MCTS) is executed, guided by 𝑓𝜃, to output 𝝅 = 𝛼 𝜃 𝑠 (also a 362-dim probability vector) of playing each move from 𝑠 Guide simulation Provide training data 𝑠 𝝅 𝒑 𝑣 𝑓𝜃 𝑠 MCTSPolicy and value network 𝛼 𝜃 Introduction
  • 6. Outline • Introduction • Network in AlphaGo Zero • Self-Play in AlphaGo Zero • Experiment Results • Why Does It Work? • Conclusion
  • 7. MCTS MCTS MCTS How to Train the Network? • Self-play • Network training From 𝑠1 to 𝑠 𝑇: A self-play game with a final reward 𝑟𝑇 = −1 (loss) or +1 (win) All 𝑠𝑡, 𝝅 𝑡, 𝑧𝑡 will be stored, where 𝑧𝑡 = ±𝑟𝑇 with sign determined by the current player at step 𝑡 Random initialized weights 𝜃0 𝜃𝑖 are trained using data sampled uniformly among all 𝑠, 𝝅, 𝑧 ’s of the last 500,000 games of self- play, by minimizing loss: 𝑙 = 𝑧 − 𝑣 2 − 𝝅 𝑇 log 𝒑 + 𝑐 𝜃 2 At iteration 𝑖 ≥ 1, 25,000 games of self-play are generated based on 𝑓𝜃∗ ( current best player 𝛼 𝜃∗ ) In each position 𝑠𝑡, an MCTS is executed to obtain 𝝅 𝑡, and an action 𝑎 𝑡 is sampled accordingly 𝑠5 𝑠3 𝑠17 𝝅5 𝝅3 𝝅17 𝒑 𝒑 𝒑𝑣 𝑣 𝑣 𝝅1 𝝅2 𝝅3 𝑠1 𝑠2 𝑠3 𝑎1~𝝅1 𝑎2~𝝅2 𝑎 𝑡~𝝅 𝑡 𝑧5 𝑧3 𝑧17 𝑧𝑡 = ±𝑟𝑇 𝑟𝑇 𝑠 𝑇 Network in AlphaGo Zero
  • 8. Network Training Details • Some details about network training • Each neural network 𝑓𝜃 𝑖 is optimized on the Google Cloud using TensorFlow, with 64 GPU workers (batch-size 32 per worker) and 19 CPU parameter servers • Total batch-size is 2,048, sampled uniformly at random from all positions of the most recent 500,000 games of self-play • Produces a new checkpoint every 1,000 training steps • Each checkpoint 𝑓𝜃 𝑖 is further evaluated against the current best 𝑓𝜃∗ by 400 games using MCTS to decide actions • If the new player 𝛼 𝜃 𝑖 (guided by 𝑓𝜃 𝑖 ) wins by a margin of > 55% (to avoid selecting on noise alone) then it becomes the best player 𝛼 𝜃∗ (guided by 𝑓𝜃∗ ), and is subsequently used for self-play generation, and also becomes the baseline for subsequent comparisons Network in AlphaGo Zero
  • 9. Network Architecture • Some details about network training (cont’d) • The input to the neural network is a 19 × 19 × 17 image stack comprising 17 binary feature planes 𝑠𝑡 = 𝑋𝑡, 𝑌𝑡, 𝑋𝑡−1, 𝑌𝑡−1, … , 𝑋𝑡−7, 𝑌𝑡−7, 𝐶 • 𝑋’s: indicating the presence of the current player’s stones (including histories) • 𝑌’s: indicating the presence of the opponent’s stones (including histories) • 𝐶: representing the color to play • Network architecture • One convolutional block (256 filters of size 3 × 3, BN, RELU) followed by either 19 or 39 residual blocks (256 filters of size 3 × 3, BN, RELU, 256 filters of size 3 × 3, BN, skip connection, RELU) • Two separate “heads” for computing the policy and value • Policy head: convolutional block (2 filters of size 1 × 1, BN, RELU) + fully connected linear layer that outputs a vector of size 362 • Value head: convolutional block (1 filter of size 1 × 1, BN, RELU) + fully connected linear layer (size 256, RELU) + fully connected linear layer (size 1, tanh) to output a scalar in the range [−1, 1] Network in AlphaGo Zero
  • 10. Outline • Introduction • Network in AlphaGo Zero • Self-Play in AlphaGo Zero • Experiment Results • Why Does It Work? • Conclusion
  • 11. • Game tree: directed graph whose nodes are positions (states) in a game and whose edges are actions (moves) • Start from a position, we can search the tree to find the best next action • E.g., minimax algorithm for tic-tac-toe: • There are approximately 𝑏 𝑑 possible sequences of actions, where 𝑏 is the game’s breadth (number of legal moves per position) and 𝑑 is its depth (game length) • 𝑏 ≈ 35, 𝑑 ≈ 80 in Chess and 𝑏 ≈ 250, 𝑑 ≈ 150 in Go  exhaustive search is infeasible! Game Tree ’s turn ’s turn ’s turn choose max (best) 0 0 0 0 1 1 1 1 1 0 0 1 1 1 1 backup min backup max win: 1 draw: 0 loss: -1 Self-Play in AlphaGo Zero
  • 12. Monte Carlo Game Tree • Instead of considering all 𝑏 𝑑 sequences of actions, MCTS applies Monte Carlo methods that rely on repeated random sampling to obtain numerical results • At each position 𝑠, MCTS runs many simulations to find the (approximately) best action • In each simulation, 2 general principles are applied to reduce the search space 1. Action sampling: reduce 𝑏 by sampling high-probability actions from a policy 𝑝 𝑎|𝑠 that is a probability distribution over possible actions 𝑎 in position 𝑠 2. Position evaluation: reduce 𝑑 by truncating the search tree at state 𝑠 and replacing the subtree below 𝑠 by an approximate value that predicts the outcome from state 𝑠 • E.g., within one simulation in tic-tac-toe: 2. Evaluate the value of the child position by taking random actions until a win, loss, or draw1. Sample one action child position Update 𝑁 (# visit) and 𝑊 (# win) (more details next page) win 𝑁++ 𝑊++ Self-Play in AlphaGo Zero
  • 13. General Ideas of MCTS • MCTS iteratively builds partial search tree using 4 main steps per simulation • Select: traverse the partial search tree until leaf by sampling • Expand: sample one more action to add a new child • Evaluate: predict the outcome by a value function or rollout (playout) • No rollout in AlphaGo Zero (instead, it uses the value function directly) • Backup: use the Evaluated result to update statistics (# win / # visit) for all nodes in the path so that better nodes are more likely to be sampled in future simulations • Eventually, the most explored move (i.e., the move with the max # visit) is taken Default Policy (random or some rollout policy used in old versions of AlphaGo) Select Expand Evaluate Backup Tree Policy Policies: some prior distributions used to guide simulation Self-Play in AlphaGo Zero
  • 14. MCTS in AlphaGo Zero • Each node 𝑠 in the search tree has edges 𝑠, 𝑎 for all legal actions 𝑎 ∈ 𝒜 𝑠 • Statistics are stored for each edge (not for each node): visit count 𝑁 𝑠, 𝑎 , total action value 𝑊 𝑠, 𝑎 , mean action value 𝑄 𝑠, 𝑎 , and prior probability 𝑃 𝑠, 𝑎 • Initially, 𝑁 𝑠, 𝑎 = 𝑊 𝑠, 𝑎 = 𝑄 𝑠, 𝑎 = 0, and 𝑃 𝑠, 𝑎 = 𝑝 𝑎 (from the network 𝑓𝜃) b: Whenever choosing an action that leads to a new leaf, it is added and evaluated (to decide its 𝑣 and 𝑃’s for all its legal actions) by the policy and value network 𝑓𝜃 a: Choose the action with the max 𝑄 + 𝑈, where 𝑈 ∝ 𝑃/ 1 + 𝑁 (initially prefers actions with high prior probabilities and low visit counts, but asympotically prefers actions with high action values) c: Update edge statistics • 𝑁 = 𝑁 + 1 • 𝑊 = 𝑊 + 𝑣 • 𝑄 = 𝑊/𝑁 Eventually, good actions will have larger 𝑁 than bad actions Self-Play in AlphaGo Zero
  • 15. Self-Play via MCTS • At the end of the search (1,600 simulations), MCTS selects an action 𝑎 to play in the root position 𝑠root, with probability 𝜋 𝑎 ∝ 𝑁 𝑠root, 𝑎 1/𝜏 , where 𝜏 controls the level of exploration • The larger 𝜏, the less differences between actions with different 𝑁  exploration • 𝜏 → 0: choose the best move according to MCTS  exploitation • The child node corresponding to the played action becomes the new root node, and another round of MCTS starts over from it with all edge statistics (𝑁, 𝑊, 𝑄, and 𝑃) of the subtree below this child being retained • MCTS can thus be viewed as a self-play algorithm that, given neural network parameters 𝜃 and a root position 𝑠root, computes a vector of search probabilities recommending moves to play, 𝝅 = 𝛼 𝜃 𝑠root 𝑠root 𝝅 d: In each round of MCTS, the 1,600 simulations only decide one move in the self-play game! (discard) all statistics retained 𝑎~𝝅 Search tree is reused in the next MCTS Self-Play in AlphaGo Zero
  • 16. Self-Play Details • Some details about self-play • In each iteration, the best current player 𝛼 𝜃∗ plays 25,000 games of self-play, using 1,600 simulations (0.4s) of MCTS to select each move • At step a (Select) • 𝑈 𝑠, 𝑎 = 𝑐puct 𝑃 𝑠, 𝑎 𝑏 𝑁 𝑠,𝑏 1+𝑁 𝑠,𝑎 , where 𝑐puct is a constant used to control exploration • Additional exploration is achieved by adding Dirichlet noise to the prior probabilities in the root node: 𝑃 𝑠root, 𝑎 = 1 − 𝜀 𝑝 𝑎 + 𝜀𝜂 𝑎, where 𝜼~Dir 0.03 and 𝜀 = 0.25 • At step b (Expand and evaluate), a leaf node 𝑠leaf will be randomly reflected or rotated before evaluated by the current network, 𝑑𝑖 𝒑 , 𝑣 = 𝑓𝜃 𝑑𝑖 𝑠leaf , where 𝑑𝑖 is a dihedral reflection or rotation selected uniformly at random from 𝑖 ∈ 1. . 8 • At step d (Play), 𝜏 = 1 for the first 30 moves of each game, and 𝜏 → 0 for the remainder of the game • To save computation, AlphaGo Zero resigns from a self-play game if its root value and best child value are lower than a threshold value 𝑣resign • 𝑣resign is selected automatically to keep the fraction of false positives (games that could have been won if AlphaGo Zero had not resigned) below 5% (measured by disabling resignation in 10% of self-play games) Self-Play in AlphaGo Zero
  • 17. Outline • Introduction • Network in AlphaGo Zero • Self-Play in AlphaGo Zero • Experiment Results • Why Does It Work? • Conclusion
  • 18. Empirical Analysis (1/2) • AlphaGo Zero (19 residual blocks) outperformed AlphaGo Lee after 36 h (see a) • Supervised learning (from human data using the same architecture) was better at predicting human professional moves (see b), but the self-trained player still performed much better overall, defeating the human-trained player within the first 24 h (see a) • This suggests that AlphaGo Zero may be learning a strategy that is qualitatively different to human play Experiment Results
  • 19. Empirical Analysis (2/2) • Comparison of network architectures for policy network and value network • dual-res (AlphaGo Zero): single network with residual blocks • sep-res: 2 separate network with residual blocks • dual-conv: single CNN • sep-conv (AlphaGo): 2 separate CNN ~600 Elo ~600 Elo Experiment Results
  • 20. Final Performance • AlphaGo Zero with deeper network (19  39 residual blocks) and longer training time (3  40 days) • Raw network: directly selects the move 𝑎 with maximum probability 𝑝 𝑎 output by the network, without using MCTS (i.e., 𝜋 𝑎) to sample next move Experiment Results
  • 21. AlphaGo Zero vs. AlphaGo • Main modifications comparing to old versions of AlphaGo • Self-play reinforcement learning without any human data • Simpler board representations using only the black and white stones • Single neural network, rather than separate policy and value networks • Also, the residual blocks matter (mentioned in [2]) • Simpler tree search without rollouts • AlphaGo Zero and AlphaZero compensate for the lower number of evaluations by using its deep neural network to focus much more selectively on the most promising variations – arguably a more “human- like” approach to search Experiment Results
  • 22. Outline • Introduction • Network in AlphaGo Zero • Self-Play in AlphaGo Zero • Experiment Results • Why Does It Work? • Conclusion
  • 23. Thinking Fast and Slow • As mentioned in [4], human reasoning consists of two different kinds of thinking • System 1 is a fast, unconscious and automatic mode of thought, also known as intuition or heuristic process • System 2, an evolutionarily recent process unique to humans, is a slow, conscious, explicit and rule-based mode of reasoning Why Does It Work? [4] T. Anthony et al., "Thinking Fast and Slow with Deep Learning and Tree Search," NIPS, Dec. 2017
  • 24. Imitation learning and Expert Improvement • According to [4], the learning loop can be viewed as an extension of Imitation Learning, in which an Apprentice policy (system 1; neural network) is trained to imitate the behavior of an Expert policy (system 2; tree search) • Between each iteration, an Expert Improvement step is performed by bootstrapping the (fast) Apprentice policy to increase the performance of the (slow) Expert • This allows us to exploit the fast convergence properties of Imitation Learning, even in contexts where no strong player was originally known 𝑠 𝝅 𝒑 𝑣 𝑓𝜃 𝑠 Expert (System 2) (Exploration) Apprentice (System 1) (Generalization) 𝛼 𝜃 Imitation Learning AlphaGo / AlphaGo Zero / AlphaZero = Imitation Learning + Expert Improvement Expert Improvement Why Does It Work? [4] T. Anthony et al., "Thinking Fast and Slow with Deep Learning and Tree Search," NIPS, Dec. 2017
  • 25. Outline • Introduction • Network in AlphaGo Zero • Self-Play in AlphaGo Zero • Experiment Results • Why Does It Work? • Conclusion
  • 26. Conclusion • Starting tabula rasa, AlphaGo Zero was able to rediscover much of the Go knowledge, as well as novel strategies that provide new insights into the oldest of games • Motivated by the dual process theory of human thought (System 1 and System 2), a new Reinforcement Learning algorithm (Imitation Learning + Expert Improvement) is introduced and results in state-of-the-art performance for challenging problems
  • 27. Reference [1] D. Silver and A. Huang et al., "Mastering the game of Go with deep neural networks and tree search," Nature, Jan. 2016 [2] D. Silver, J. Schrittwieser, and K. Simonyanet et al., "Mastering the game of go without human knowledge," Nature, Oct. 2017 [3] D. Silver, T. Hubert, and J. Schrittwieser et al., "Mastering Chess and Shogi by Self-Play with a General Reinforcement Learning Algorithm", arXiv:1712.01815 [4] T. Anthony et al., "Thinking Fast and Slow with Deep Learning and Tree Search," NIPS, Dec. 2017 [5] Deepmind website: https://deepmind.com/research/alphago/ [6] Wikipedia (AlphaGo): https://en.wikipedia.org/wiki/AlphaGo [7] Wikipedia (MCTS): https://en.wikipedia.org/wiki/Monte_Carlo_tree_search [8] Wikipedia (game tree): https://en.wikipedia.org/wiki/Game_tree [9] Slides from Caltech CS 159: Advanced Topics in Machine Learning (Spring 2016), "Monte Carlo Tree Search and AlphaGo": http://www.yisongyue.com/courses/cs159/lectures/MCTS.pdf [10] Blog post by Tim Wheeler, "AlphaGo Zero - How and Why it Works": http://tim.hibal.org/blog/blog/ (received: 2017/4/7; accepted: 2017/9/13) (paper submission deadline: 2017/5/19)