The document provides an introduction and overview of AlphaGo Zero, including:
- AlphaGo Zero achieved superhuman performance at Go without human data by using self-play reinforcement learning.
- It uses a policy network and Monte Carlo tree search to select moves. The network is trained through self-play games using its own policy and value outputs as training labels.
- Experiments showed AlphaGo Zero outperformed previous AlphaGo versions and human-trained networks, and continued improving with deeper networks and more self-play training.
2. Outline
• Introduction
• Network in AlphaGo Zero
• Self-Play in AlphaGo Zero
• Experiment Results
• Why Does It Work?
• Conclusion
3. Outline
• Introduction
• Network in AlphaGo Zero
• Self-Play in AlphaGo Zero
• Experiment Results
• Why Does It Work?
• Conclusion
4. AlphaGo Evolution
• Starting tabula rasa, AlphaGo Zero [2] achieved superhuman
performance, winning 100–0 against the previously published,
champion-defeating AlphaGo [1]
• Summary of different versions of AlphaGo from Wikipedia:
[1]
[2]
[3]
[1] D. Silver and A. Huang et al., "Mastering the game of Go with deep neural networks and tree search," Nature, January 2016
[2] D. Silver, J. Schrittwieser, and K. Simonyanet et al., "Mastering the game of go without human knowledge," Nature, October 2017
[3] D. Silver, T. Hubert, and J. Schrittwieser et al., "Mastering Chess and Shogi by Self-Play with a General Reinforcement Learning Algorithm",
arXiv:1712.01815
Versions Hardware (inference time) Elo rating Matches
AlphaGo Fan 176 GPUs, distributed 3,144 5:0 against Fan Hui
AlphaGo Lee 48 TPUs, distributed 3,739 4:1 against Lee Sedol
AlphaGo Master 4 TPUs v2, single machine 4,858
60:0 against professional players;
Future of Go Summit
AlphaGo Zero 4 TPUs v2, single machine 5,185
100:0 against AlphaGo Lee
89:11 against AlphaGo Master
AlphaZero 4 TPUs v2, single machine N/A 60:40 against AlphaGo Zero
Introduction
5. Two Main Components
• Two main components (work in parallel)
1. A policy and value network 𝑓𝜃 (parameter: 𝜃) takes as an input the raw board
representation 𝑠 of the position and its history, and outputs 𝒑 and 𝑣
• 𝒑: probability of the next move (362-dim vector, including pass) from 𝑠
• 𝑣: expected outcome of the current player from 𝑠 (−1 (loss) ~ +1 (win))
2. In each position 𝑠, a Monte Carlo Tree Search (MCTS) is executed, guided by 𝑓𝜃, to
output 𝝅 = 𝛼 𝜃 𝑠 (also a 362-dim probability vector) of playing each move from 𝑠
Guide simulation
Provide training data
𝑠
𝝅
𝒑 𝑣
𝑓𝜃
𝑠
MCTSPolicy and value
network 𝛼 𝜃
Introduction
6. Outline
• Introduction
• Network in AlphaGo Zero
• Self-Play in AlphaGo Zero
• Experiment Results
• Why Does It Work?
• Conclusion
7. MCTS MCTS MCTS
How to Train the Network?
• Self-play
• Network training
From 𝑠1 to 𝑠 𝑇: A self-play
game with a final reward
𝑟𝑇 = −1 (loss) or +1 (win)
All 𝑠𝑡, 𝝅 𝑡, 𝑧𝑡 will be
stored, where 𝑧𝑡 = ±𝑟𝑇
with sign determined by
the current player at step 𝑡
Random initialized weights 𝜃0
𝜃𝑖 are trained using data sampled
uniformly among all 𝑠, 𝝅, 𝑧 ’s of
the last 500,000 games of self-
play, by minimizing loss:
𝑙 = 𝑧 − 𝑣 2
− 𝝅 𝑇
log 𝒑 + 𝑐 𝜃 2
At iteration 𝑖 ≥ 1, 25,000 games
of self-play are generated based
on 𝑓𝜃∗
( current best player 𝛼 𝜃∗
)
In each position 𝑠𝑡, an
MCTS is executed to obtain
𝝅 𝑡, and an action 𝑎 𝑡 is
sampled accordingly
𝑠5 𝑠3 𝑠17
𝝅5 𝝅3 𝝅17
𝒑 𝒑 𝒑𝑣 𝑣 𝑣
𝝅1 𝝅2 𝝅3
𝑠1 𝑠2 𝑠3
𝑎1~𝝅1 𝑎2~𝝅2 𝑎 𝑡~𝝅 𝑡
𝑧5 𝑧3 𝑧17
𝑧𝑡 = ±𝑟𝑇
𝑟𝑇
𝑠 𝑇
Network in AlphaGo Zero
8. Network Training Details
• Some details about network training
• Each neural network 𝑓𝜃 𝑖
is optimized on the Google Cloud using TensorFlow, with 64
GPU workers (batch-size 32 per worker) and 19 CPU parameter servers
• Total batch-size is 2,048, sampled uniformly at random from all positions of the most
recent 500,000 games of self-play
• Produces a new checkpoint every 1,000 training steps
• Each checkpoint 𝑓𝜃 𝑖
is further evaluated against the current best 𝑓𝜃∗
by 400 games
using MCTS to decide actions
• If the new player 𝛼 𝜃 𝑖
(guided by 𝑓𝜃 𝑖
) wins by a margin of > 55% (to avoid
selecting on noise alone) then it becomes the best player 𝛼 𝜃∗
(guided by 𝑓𝜃∗
),
and is subsequently used for self-play generation, and also becomes the
baseline for subsequent comparisons
Network in AlphaGo Zero
9. Network Architecture
• Some details about network training (cont’d)
• The input to the neural network is a 19 × 19 × 17 image stack comprising 17 binary
feature planes 𝑠𝑡 = 𝑋𝑡, 𝑌𝑡, 𝑋𝑡−1, 𝑌𝑡−1, … , 𝑋𝑡−7, 𝑌𝑡−7, 𝐶
• 𝑋’s: indicating the presence of the current player’s stones (including histories)
• 𝑌’s: indicating the presence of the opponent’s stones (including histories)
• 𝐶: representing the color to play
• Network architecture
• One convolutional block (256 filters of size 3 × 3, BN, RELU) followed by either
19 or 39 residual blocks (256 filters of size 3 × 3, BN, RELU, 256 filters of size 3 ×
3, BN, skip connection, RELU)
• Two separate “heads” for computing the policy and value
• Policy head: convolutional block (2 filters of size 1 × 1, BN, RELU) + fully
connected linear layer that outputs a vector of size 362
• Value head: convolutional block (1 filter of size 1 × 1, BN, RELU) + fully
connected linear layer (size 256, RELU) + fully connected linear layer (size
1, tanh) to output a scalar in the range [−1, 1]
Network in AlphaGo Zero
10. Outline
• Introduction
• Network in AlphaGo Zero
• Self-Play in AlphaGo Zero
• Experiment Results
• Why Does It Work?
• Conclusion
11. • Game tree: directed graph whose nodes are positions (states) in a game and
whose edges are actions (moves)
• Start from a position, we can search the tree to find the best next action
• E.g., minimax algorithm for tic-tac-toe:
• There are approximately 𝑏 𝑑
possible sequences of actions, where 𝑏 is the game’s
breadth (number of legal moves per position) and 𝑑 is its depth (game length)
• 𝑏 ≈ 35, 𝑑 ≈ 80 in Chess and 𝑏 ≈ 250, 𝑑 ≈ 150 in Go exhaustive search is
infeasible!
Game Tree
’s turn
’s turn
’s turn
choose max
(best)
0
0
0 0 1
1
1
1
1
0
0
1
1
1
1
backup min
backup max
win: 1
draw: 0
loss: -1
Self-Play in AlphaGo Zero
12. Monte Carlo Game Tree
• Instead of considering all 𝑏 𝑑
sequences of actions, MCTS applies Monte Carlo
methods that rely on repeated random sampling to obtain numerical results
• At each position 𝑠, MCTS runs many simulations to find the (approximately) best
action
• In each simulation, 2 general principles are applied to reduce the search space
1. Action sampling: reduce 𝑏 by sampling high-probability actions from a policy
𝑝 𝑎|𝑠 that is a probability distribution over possible actions 𝑎 in position 𝑠
2. Position evaluation: reduce 𝑑 by truncating the search tree at state 𝑠 and
replacing the subtree below 𝑠 by an approximate value that predicts the
outcome from state 𝑠
• E.g., within one simulation in tic-tac-toe:
2. Evaluate the value of the child position by
taking random actions until a win, loss, or draw1. Sample one action
child
position Update 𝑁 (# visit) and 𝑊 (# win)
(more details next page)
win
𝑁++
𝑊++
Self-Play in AlphaGo Zero
13. General Ideas of MCTS
• MCTS iteratively builds partial search tree using 4 main steps per simulation
• Select: traverse the partial search tree until leaf by sampling
• Expand: sample one more action to add a new child
• Evaluate: predict the outcome by a value function or rollout (playout)
• No rollout in AlphaGo Zero (instead, it uses the value function directly)
• Backup: use the Evaluated result to update statistics (# win / # visit) for all nodes in the path so
that better nodes are more likely to be sampled in future simulations
• Eventually, the most explored move (i.e., the move with the max # visit) is taken
Default Policy
(random or some rollout policy
used in old versions of AlphaGo)
Select Expand Evaluate Backup
Tree Policy
Policies: some prior distributions used to guide simulation
Self-Play in AlphaGo Zero
14. MCTS in AlphaGo Zero
• Each node 𝑠 in the search tree has edges 𝑠, 𝑎 for all legal actions 𝑎 ∈ 𝒜 𝑠
• Statistics are stored for each edge (not for each node): visit count 𝑁 𝑠, 𝑎 , total action
value 𝑊 𝑠, 𝑎 , mean action value 𝑄 𝑠, 𝑎 , and prior probability 𝑃 𝑠, 𝑎
• Initially, 𝑁 𝑠, 𝑎 = 𝑊 𝑠, 𝑎 = 𝑄 𝑠, 𝑎 = 0, and 𝑃 𝑠, 𝑎 = 𝑝 𝑎 (from the network 𝑓𝜃)
b: Whenever choosing an action that leads to a new leaf,
it is added and evaluated (to decide its 𝑣 and 𝑃’s for all
its legal actions) by the policy and value network 𝑓𝜃
a: Choose the action with the max 𝑄 + 𝑈, where 𝑈 ∝ 𝑃/ 1 + 𝑁 (initially prefers actions with high
prior probabilities and low visit counts, but asympotically prefers actions with high action values)
c: Update edge
statistics
• 𝑁 = 𝑁 + 1
• 𝑊 = 𝑊 + 𝑣
• 𝑄 = 𝑊/𝑁
Eventually, good actions will have
larger 𝑁 than bad actions
Self-Play in AlphaGo Zero
15. Self-Play via MCTS
• At the end of the search (1,600 simulations), MCTS selects an action 𝑎 to play in the root
position 𝑠root, with probability 𝜋 𝑎 ∝ 𝑁 𝑠root, 𝑎 1/𝜏
, where 𝜏 controls the level of
exploration
• The larger 𝜏, the less differences between actions with different 𝑁 exploration
• 𝜏 → 0: choose the best move according to MCTS exploitation
• The child node corresponding to the played action becomes the new root node, and
another round of MCTS starts over from it with all edge statistics (𝑁, 𝑊, 𝑄, and 𝑃) of the
subtree below this child being retained
• MCTS can thus be viewed as a self-play algorithm that, given neural network parameters 𝜃
and a root position 𝑠root, computes a vector of search probabilities recommending moves
to play, 𝝅 = 𝛼 𝜃 𝑠root
𝑠root
𝝅
d: In each round of MCTS,
the 1,600 simulations
only decide one move in
the self-play game!
(discard)
all statistics
retained
𝑎~𝝅
Search tree is reused
in the next MCTS
Self-Play in AlphaGo Zero
16. Self-Play Details
• Some details about self-play
• In each iteration, the best current player 𝛼 𝜃∗
plays 25,000 games of self-play, using
1,600 simulations (0.4s) of MCTS to select each move
• At step a (Select)
• 𝑈 𝑠, 𝑎 = 𝑐puct 𝑃 𝑠, 𝑎 𝑏 𝑁 𝑠,𝑏
1+𝑁 𝑠,𝑎
, where 𝑐puct is a constant used to control exploration
• Additional exploration is achieved by adding Dirichlet noise to the prior probabilities in the
root node: 𝑃 𝑠root, 𝑎 = 1 − 𝜀 𝑝 𝑎 + 𝜀𝜂 𝑎, where 𝜼~Dir 0.03 and 𝜀 = 0.25
• At step b (Expand and evaluate), a leaf node 𝑠leaf will be randomly reflected or
rotated before evaluated by the current network, 𝑑𝑖 𝒑 , 𝑣 = 𝑓𝜃 𝑑𝑖 𝑠leaf , where
𝑑𝑖 is a dihedral reflection or rotation selected uniformly at random from 𝑖 ∈ 1. . 8
• At step d (Play), 𝜏 = 1 for the first 30 moves of each game, and 𝜏 → 0 for the
remainder of the game
• To save computation, AlphaGo Zero resigns from a self-play game if its root value and
best child value are lower than a threshold value 𝑣resign
• 𝑣resign is selected automatically to keep the fraction of false positives (games that could
have been won if AlphaGo Zero had not resigned) below 5% (measured by disabling
resignation in 10% of self-play games)
Self-Play in AlphaGo Zero
17. Outline
• Introduction
• Network in AlphaGo Zero
• Self-Play in AlphaGo Zero
• Experiment Results
• Why Does It Work?
• Conclusion
18. Empirical Analysis (1/2)
• AlphaGo Zero (19 residual blocks) outperformed AlphaGo Lee after 36 h (see a)
• Supervised learning (from human data using the same architecture) was better
at predicting human professional moves (see b), but the self-trained player still
performed much better overall, defeating the human-trained player within the
first 24 h (see a)
• This suggests that AlphaGo Zero may be learning a strategy that is qualitatively
different to human play
Experiment Results
19. Empirical Analysis (2/2)
• Comparison of network architectures for policy network and value network
• dual-res (AlphaGo Zero): single network with residual blocks
• sep-res: 2 separate network with residual blocks
• dual-conv: single CNN
• sep-conv (AlphaGo): 2 separate CNN
~600 Elo
~600 Elo
Experiment Results
20. Final Performance
• AlphaGo Zero with deeper network (19 39 residual blocks) and longer
training time (3 40 days)
• Raw network: directly selects the move 𝑎 with maximum probability 𝑝 𝑎 output by
the network, without using MCTS (i.e., 𝜋 𝑎) to sample next move
Experiment Results
21. AlphaGo Zero vs. AlphaGo
• Main modifications comparing to old versions of AlphaGo
• Self-play reinforcement learning without any human data
• Simpler board representations using only the black and white stones
• Single neural network, rather than separate policy and value networks
• Also, the residual blocks matter (mentioned in [2])
• Simpler tree search without rollouts
• AlphaGo Zero and AlphaZero compensate for the lower number of
evaluations by using its deep neural network to focus much more
selectively on the most promising variations – arguably a more “human-
like” approach to search
Experiment Results
22. Outline
• Introduction
• Network in AlphaGo Zero
• Self-Play in AlphaGo Zero
• Experiment Results
• Why Does It Work?
• Conclusion
23. Thinking Fast and Slow
• As mentioned in [4], human reasoning consists of two different kinds of thinking
• System 1 is a fast, unconscious and automatic mode of thought, also known as
intuition or heuristic process
• System 2, an evolutionarily recent process unique to humans, is a slow, conscious,
explicit and rule-based mode of reasoning
Why Does It Work?
[4] T. Anthony et al., "Thinking Fast and Slow with Deep Learning and Tree Search," NIPS, Dec. 2017
24. Imitation learning and
Expert Improvement
• According to [4], the learning loop can be viewed as an extension of Imitation
Learning, in which an Apprentice policy (system 1; neural network) is trained to
imitate the behavior of an Expert policy (system 2; tree search)
• Between each iteration, an Expert Improvement step is performed by bootstrapping
the (fast) Apprentice policy to increase the performance of the (slow) Expert
• This allows us to exploit the fast convergence properties of Imitation Learning, even
in contexts where no strong player was originally known
𝑠
𝝅
𝒑 𝑣
𝑓𝜃
𝑠 Expert
(System 2)
(Exploration)
Apprentice
(System 1)
(Generalization)
𝛼 𝜃
Imitation Learning
AlphaGo / AlphaGo Zero / AlphaZero =
Imitation Learning + Expert Improvement
Expert Improvement
Why Does It Work?
[4] T. Anthony et al., "Thinking Fast and Slow with Deep Learning and Tree Search," NIPS, Dec. 2017
25. Outline
• Introduction
• Network in AlphaGo Zero
• Self-Play in AlphaGo Zero
• Experiment Results
• Why Does It Work?
• Conclusion
26. Conclusion
• Starting tabula rasa, AlphaGo Zero was able to rediscover much of the
Go knowledge, as well as novel strategies that provide new insights into
the oldest of games
• Motivated by the dual process theory of human thought (System 1 and
System 2), a new Reinforcement Learning algorithm (Imitation Learning
+ Expert Improvement) is introduced and results in state-of-the-art
performance for challenging problems
27. Reference
[1] D. Silver and A. Huang et al., "Mastering the game of Go with deep neural networks and tree
search," Nature, Jan. 2016
[2] D. Silver, J. Schrittwieser, and K. Simonyanet et al., "Mastering the game of go without human
knowledge," Nature, Oct. 2017
[3] D. Silver, T. Hubert, and J. Schrittwieser et al., "Mastering Chess and Shogi by Self-Play with a
General Reinforcement Learning Algorithm", arXiv:1712.01815
[4] T. Anthony et al., "Thinking Fast and Slow with Deep Learning and Tree Search," NIPS, Dec. 2017
[5] Deepmind website: https://deepmind.com/research/alphago/
[6] Wikipedia (AlphaGo): https://en.wikipedia.org/wiki/AlphaGo
[7] Wikipedia (MCTS): https://en.wikipedia.org/wiki/Monte_Carlo_tree_search
[8] Wikipedia (game tree): https://en.wikipedia.org/wiki/Game_tree
[9] Slides from Caltech CS 159: Advanced Topics in Machine Learning (Spring 2016), "Monte Carlo Tree
Search and AlphaGo": http://www.yisongyue.com/courses/cs159/lectures/MCTS.pdf
[10] Blog post by Tim Wheeler, "AlphaGo Zero - How and Why it Works":
http://tim.hibal.org/blog/blog/
(received: 2017/4/7; accepted: 2017/9/13)
(paper submission deadline: 2017/5/19)