AlphaGo: An AI Go player based on deep neural networks and monte carlo tree search
1. AlphaGo: An AI Go Player based on
Deep Neural Networks and Monte Carlo Tree Search
Michael J. Moon
M.Sc. Candidates in Biostatistics
Dalla Lana School of Public Health
University of Toronto
April 7, 2016
3. Introduction | Background
AlphaGo | M.Moon 3
The Game of Go
> Played on a square grid called a board, usually 19 x 19
> Stones – black and white – are placed alternatively
> Points awarded for surrounding empty space
1. 1 Googol = 1.0 × 10100
Example of a Go Board
Shades represent territories
4. Introduction | Background
AlphaGo | M.Moon 4
The Game of Go
> Played on a square grid called a board, usually 19 x 19
> Stones – black and white – are placed alternatively
> Points awarded for surrounding empty space
Complexity
> Possible number of sequences ≈ 250150
> Googol1 times more complex than chess
> Viewed as an unsolved “grand challenge” for AI
1. 1 Googol = 1.0 × 10100
“pinnacle of perfect information games”
Demis Hassabis, Co-founder of DeepMind
Example of a Go Board
Shades represent territories
5. Introduction | Background
AlphaGo | M.Moon 5
1. Image source: https://deepmind.com/alpha-go.html;
2. Source: http://www.straitstimes.com/asia/east-asia/googles-alphago-gets-divine-go-ranking
> Google DeepMind’s AI Go Player
6. Introduction | Background
AlphaGo | M.Moon 6
1. Image source: https://deepmind.com/alpha-go.html;
2. Source: http://www.straitstimes.com/asia/east-asia/googles-alphago-gets-divine-go-ranking
5-0 against Fan Hui
> Victory against 3-times European champion
> First program to win against a professional player in an even game
7. Introduction | Background
AlphaGo | M.Moon 7
1. Image source: https://deepmind.com/alpha-go.html;
2. Source: http://www.straitstimes.com/asia/east-asia/googles-alphago-gets-divine-go-ranking
5-0 against Fan Hui
> Victory against 3-times European champion
> First program to win against a professional player in an even game
4-1 against Sedol Lee
> Victory against world’s top player over the past decade
> Awarded the highest Go ranking after the match2
8. Introduction | Overview of the Design
AlphaGo | M.Moon
30M Human Moves
SL Policy Network
Rollout Policy
RL Policy Network
RL Value Network
8
9. Introduction | Overview of the Design
AlphaGo | M.Moon
30M Human Moves
SL Policy Network
Rollout Policy
RL Policy Network
RL Value Network
9
Monte Carlo
Tree Search
Move Selection
10. Introduction | Overview of the Design
AlphaGo | M.Moon
30M Human Moves
SL Policy Network
Rollout Policy
RL Policy Network
RL Value Network
10
Monte Carlo
Tree Search
Asynchronous Multi-threaded Search
> 40 Search Threads
> 48 CPUs
> 8 GPUs
Distributed Version1
> 40 Search Threads
> 1,202 CPUs
> 176 GPUs
1. Used against Fan Hui; 1,920 CPUs and 280 GPUs against Lee
http://www.economist.com/news/science-and-technology/21694540-win-or-lose-best-five-battle-contest-another-milestone
Move Selection
14. Methodologies | Deep Neural Network
AlphaGo | M.Moon 14
Deep Learning Architecture
> Multilayer (5~20) stack of simple
modules subject to learning
Backpropagation Training
> Trained by simple stochastic
gradient descent to minimize error
> Rectified linear unit (ReLU) learns
faster than other non-linearities
𝑓 𝑥 = max 0, 𝑥
𝑤𝑖𝑗 𝑤𝑗𝑘
𝑤 𝑘𝑙
𝑦𝑗 = 𝑓 𝑧𝑗
𝑦 𝑘 = 𝑓 𝑧 𝑘
𝑦𝑙 = 𝑓 𝑧𝑙
𝐻𝑖𝑑𝑑𝑒𝑛 𝑢𝑛𝑖𝑡𝑠 𝐻2
𝐻𝑖𝑑𝑑𝑒𝑛 𝑢𝑛𝑖𝑡𝑠 𝐻1
𝑂𝑢𝑡𝑝𝑢𝑡 𝑢𝑛𝑖𝑡𝑠
𝑦𝑙
𝑧𝑗 =
𝑖∈𝐼𝑛
𝑤𝑖𝑗 𝑥𝑖
𝑧 𝑘 =
𝑗∈𝐻1
𝑤𝑗𝑘 𝑦𝑗
𝑧𝑙 =
𝑘∈𝐻2
𝑤 𝑘𝑙 𝑦 𝑘
i
j
k
l
𝐼𝑛𝑝𝑢𝑡 𝑢𝑛𝑖𝑡𝑠
15. Methodologies | Deep Convolutional Neural Network
AlphaGo | M.Moon 15
Input
Arrays such as signals,
images and videos
Local Connections
Arrays such as signals,
images and videos
Shared Weights
Each filter with common weights
and bias to create a feature
𝑾 𝟏
𝑾 𝟏
𝑾 𝟏
Pooling
Coarse-graining the position of
each feature, typically by taking
max from neighbouring features
Non-linearity
Local weighted sums to
a non-linearity such as ReLU
Size and Stride
Filter size 3 with stride 2
Deep Architecture
Uses stacks of many layers
Properties of natural signals
16. Methodologies | Deep Convolutional Neural Network
AlphaGo | M.Moon 16
Architecture
> Highly correlated local
groups
> Local statistics invariant to
location
Properties
> Compositional hierarchy
> Invariant to small shifts and
distortions due to pooling
> Weights trained through
backpropagation
𝑾 𝟏
𝑾 𝟏
𝑾 𝟏
17. Methodologies | Monte Carlo Tree Search
AlphaGo | M.Moon 17
Tree Policy
𝑔 𝑝 𝑎 𝑠 , 𝑁 𝑎 ,
𝑠 ∈ 𝑇𝑟𝑒𝑒 𝑁𝑜𝑑𝑒𝑠
Default Policy
𝑝 𝑎 𝑠 , 𝑠 ∉ 𝑇𝑟𝑒𝑒 𝑁𝑜𝑑𝑒𝑠
𝑠 ∈ 𝑆, 𝑛𝑜𝑑𝑒𝑠
𝑎 ∈ 𝐴, 𝑒𝑑𝑔𝑒𝑠
𝑟 𝑠 , 𝑟𝑒𝑤𝑎𝑟𝑑
𝑁 𝑎 , 𝑣𝑖𝑠𝑖𝑡 𝑐𝑜𝑢𝑛𝑡
Overview
Find optimal decisions by:
> Take random samples in the decision space
> Build a search tree according to the result
18. Methodologies | Monte Carlo Tree Search
AlphaGo | M.Moon 18
Selection
Traverse to the most
urgent expandable node
Tree Policy
𝑔 𝑝 𝑎 𝑠 , 𝑁 𝑎 ,
𝑠 ∈ 𝑇𝑟𝑒𝑒 𝑁𝑜𝑑𝑒𝑠
Tries to balance
exploration and exploitation
Default Policy
𝑝 𝑎 𝑠 , 𝑠 ∉ 𝑇𝑟𝑒𝑒 𝑁𝑜𝑑𝑒𝑠
𝑠 ∈ 𝑆, 𝑛𝑜𝑑𝑒𝑠
𝑎 ∈ 𝐴, 𝑒𝑑𝑔𝑒𝑠
𝑟 𝑠 , 𝑟𝑒𝑤𝑎𝑟𝑑
𝑁 𝑎 , 𝑣𝑖𝑠𝑖𝑡 𝑐𝑜𝑢𝑛𝑡
19. Methodologies | Monte Carlo Tree Search
AlphaGo | M.Moon 19
Selection
Traverse to the most
urgent expandable node
Expansion
Add a child node from the
selected node
Tree Policy
𝑔 𝑝 𝑎 𝑠 , 𝑁 𝑎 ,
𝑠 ∈ 𝑇𝑟𝑒𝑒 𝑁𝑜𝑑𝑒𝑠
Tries to balance
exploration and exploitation
Default Policy
𝑝 𝑎 𝑠 , 𝑠 ∉ 𝑇𝑟𝑒𝑒 𝑁𝑜𝑑𝑒𝑠
𝑠 ∈ 𝑆, 𝑛𝑜𝑑𝑒𝑠
𝑎 ∈ 𝐴, 𝑒𝑑𝑔𝑒𝑠
𝑟 𝑠 , 𝑟𝑒𝑤𝑎𝑟𝑑
𝑁 𝑎 , 𝑣𝑖𝑠𝑖𝑡 𝑐𝑜𝑢𝑛𝑡
20. Methodologies | Monte Carlo Tree Search
AlphaGo | M.Moon 20
𝑟 𝑠′
Selection
Traverse to the most
urgent expandable node
Expansion
Add a child node from the
selected node
Simulation
Simulate from the newly
added node to an outcome
Tree Policy
𝑔 𝑝 𝑎 𝑠 , 𝑁 𝑎 ,
𝑠 ∈ 𝑇𝑟𝑒𝑒 𝑁𝑜𝑑𝑒𝑠
Default Policy
𝑝 𝑎 𝑠 , 𝑠 ∉ 𝑇𝑟𝑒𝑒 𝑁𝑜𝑑𝑒𝑠
𝑠 ∈ 𝑆, 𝑛𝑜𝑑𝑒𝑠
𝑎 ∈ 𝐴, 𝑒𝑑𝑔𝑒𝑠
𝑟 𝑠 , 𝑟𝑒𝑤𝑎𝑟𝑑
𝑁 𝑎 , 𝑣𝑖𝑠𝑖𝑡 𝑐𝑜𝑢𝑛𝑡
21. Methodologies | Monte Carlo Tree Search
AlphaGo | M.Moon 21
𝑟 𝑠′
Selection
Traverse to the most
urgent expandable node
Expansion
Add a child node from the
selected node
Simulation
Simulate from the newly
added node to an outcome
Backpropagation
Backup simulation result
through selected nodes
Tree Policy
𝑔 𝑝 𝑎 𝑠 , 𝑁 𝑎 ,
𝑠 ∈ 𝑇𝑟𝑒𝑒 𝑁𝑜𝑑𝑒𝑠
Default Policy
𝑝 𝑎 𝑠 , 𝑠 ∉ 𝑇𝑟𝑒𝑒 𝑁𝑜𝑑𝑒𝑠
𝑠 ∈ 𝑆, 𝑛𝑜𝑑𝑒𝑠
𝑎 ∈ 𝐴, 𝑒𝑑𝑔𝑒𝑠
𝑟 𝑠 , 𝑟𝑒𝑤𝑎𝑟𝑑
𝑁 𝑎 , 𝑣𝑖𝑠𝑖𝑡 𝑐𝑜𝑢𝑛𝑡
22. Methodologies | Monte Carlo Tree Search
AlphaGo | M.Moon 22
𝑟 𝑠′
Tree Policy
𝑔 𝑝 𝑎 𝑠 , 𝑁 𝑎 ,
𝑠 ∈ 𝑇𝑟𝑒𝑒 𝑁𝑜𝑑𝑒𝑠
Default Policy
𝑝 𝑎 𝑠 , 𝑠 ∉ 𝑇𝑟𝑒𝑒 𝑁𝑜𝑑𝑒𝑠
Strengths
> Anytime algorithm – gives a valid solution at
any time of interruption
> Values of intermediate states are not
evaluated – domain knowledge not required
𝑠 ∈ 𝑆, 𝑛𝑜𝑑𝑒𝑠
𝑎 ∈ 𝐴, 𝑒𝑑𝑔𝑒𝑠
𝑟 𝑠 , 𝑟𝑒𝑤𝑎𝑟𝑑
𝑁 𝑎 , 𝑣𝑖𝑠𝑖𝑡 𝑐𝑜𝑢𝑛𝑡
23. Design | Problem Setting
AlphaGo | M.Moon 23
Unique Optimal Value Function
𝑣∗
𝑠 =
𝑧 𝑇 𝑖𝑓 𝑠 = 𝑠 𝑇
max
𝑎
−𝑣∗ 𝑓 𝑥, 𝑎 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒
> 𝑠 ∈ 𝑆 State of the game
> 𝑎 ∈ 𝐴(𝑠) Legal actions at 𝑠
> 𝑓(𝑎, 𝑠) Deterministic state transitions
> 𝑟 𝑖 𝑠 Reward for player 𝑖 at 𝑠, 𝑖 ∈ 1,2
𝑍𝑒𝑟𝑜 − 𝑠𝑢𝑚 𝑔𝑎𝑚𝑒: 𝑟 𝑠 = 𝑟1
𝑠 = −𝑟2
𝑠
𝑟 𝑠 = 0 𝑖𝑓 𝑠 ≠ 𝑠 𝑇
> 𝑧𝑡 = ±𝑟 𝑠 𝑇 Terminal reward at 𝑠 𝑇
zt ∈ −1,1
Value Function
> 𝑣 𝑝
(𝑠) 𝐸 𝑧𝑡 𝑠𝑡 = 𝑠, 𝑎 𝑡,…,𝑇~𝑝
Policy
> 𝑝 𝑎 𝑠 Probability distribution
over legal actions
24. Design | Rollout Policy
AlphaGo | M.Moon 24
> A fast, linear softmax policy for simulation
> Pattern-based feature inputs
> Trained using 8 million positions
> Less domain knowledge implemented compared to
existing MTSC Go programs
> 24.2% prediction accuracy
> Similar 𝑝𝜏 𝑎 𝑠 for tree expansion
𝑝 𝜋 𝑎 𝑠
𝑀𝑎𝑥𝑚𝑖𝑧𝑒 ∆𝜋 ∝
𝜕𝑙𝑜𝑔𝑝 𝜋 𝑎 𝑠
𝜕𝜋
25. Design | Neural Network Architectures
AlphaGo | M.Moon 25
1
1
0
1
1
0
1
0
0
0
1
0
1
0
Input
19 x 19 intersections
x 48 feature plane
x48 +1Input Feature Space
> Stone Colour
> Ones & Zeros
> Turns Since
> Liberties
> Capture Size
> Self-atari Size
> Liberties after Move
> Ladder Capture
> Ladder Escape
> Sensibleness
with respect to current player
Extra Feature for Value Network
> Player Colour 0
19
x
19
1
1
0
1
0
1
0
0 1
0
29. Design | Neural Network Architectures
AlphaGo | M.Moon 29
0
0
0
0
0
0
0
0
0
0
0
0
1
1
0
1
1
0
0
0
1
0
0
0
0
0
0
0
1
0
0
0
0
00
0
0
0
Policy
1-Stride Convolution
1 kernel of size 1 x 1 with different
bias for each intersection
Softmax Function
Outputs 𝑝 𝑎 𝑠 for each of
19 x 19 intersections
Value 1-Stride Convolution
1 kernel of size 1 x 1
Tanh Function
Fully-connected layer
Outputs a single 𝑣 𝜃 𝑠 ∈ −1,1
256 Rectifiers
Fully-connected layer
19
x
19
x11
Convolution Layers
0
0
0
0
0
0
0
0
30. Design | Supervised Learning Policy Network
AlphaGo | M.Moon 30
> Trained using mini-batches of 16 randomly selected
from 28.4 million positions
> Trained on 50 GPUs over 3 weeks
> Tested with 1 million positions
> 57.0% prediction accuracy
𝑝 𝜎 𝑎 𝑠
𝑀𝑎𝑥𝑚𝑖𝑧𝑒 ∆𝜎 ∝
𝜕𝑙𝑜𝑔𝑝 𝜎 𝑎 𝑠
𝜕𝜎
31. Design | Reinforcement Learning Policy Network
AlphaGo | M.Moon 31
> Trained using self-play between the current network and a randomly
selected previous iteration of 𝑝 𝜎 𝑎 𝑠
> Trained over 10,000 million mini-batches of 128 games
> Evaluated through game play 𝑎~𝑝 𝜌 ∙ 𝑠 without search
> 80% against 𝑝 𝜎 𝑎 𝑠
> 85% against strongest open-source Go program
𝑝 𝜌 𝑎 𝑠
𝑀𝑎𝑥𝑚𝑖𝑧𝑒 ∆𝜌 ∝
𝜕𝑙𝑜𝑔𝑝 𝜌 𝑎 𝑡 𝑠𝑡
𝜕𝜌
𝑧𝑡
32. Design | Value Network
AlphaGo | M.Moon 32
> Trained using 30 million distinct positions from a separate game
generated by a random mix of 𝑝 𝜎 𝑎 𝑠 and 𝑝 𝜌 𝑎 𝑠 to prevent overfitting
> Consistently more accurate than 𝑝 𝜋 𝑎 𝑠
> Approaches Monte Carlo rollouts using 𝑝 𝜌 𝑎 𝑠 with less computation
𝑣 𝜃(𝑠)
𝑀𝑖𝑛𝑖𝑚𝑖𝑧𝑒 ∆𝜃 ∝
𝜕𝑣 𝜃 𝑠
𝜕𝜃
𝑧 − 𝑣 𝜃 𝑠
𝒗 𝜽 𝒔 ≈ 𝒗 𝒑 𝝆 𝒔 ≈ 𝒗∗ 𝒔
33. Design | Search Algorithm
AlphaGo | M.Moon 33
*Image captured from Silver D. et al. (2016)
Edge (𝑠,𝑎) Data
𝑄 𝑠, 𝑎 = 𝑎𝑐𝑡𝑖𝑜𝑛 𝑣𝑎𝑙𝑢𝑒
𝑁 𝑠, 𝑎 = 𝑣𝑖𝑠𝑖𝑡 𝑐𝑜𝑢𝑛𝑡
𝑃 𝑠, 𝑎 = 𝑝𝑟𝑖𝑜𝑟 𝑝𝑟𝑜𝑏𝑎𝑏𝑖𝑙𝑖𝑡𝑦
39. Discussion | Performance
AlphaGo | M.Moon 39
Against AI Players
> Played against strongest commercial and
open-source Go programs based on MCTS
> Single machine AlphaGo won 494 out of
495 in even games
> Distributed version of AlphaGo won 77%
against the single machine version and
100% against others
40. Discussion | Performance
AlphaGo | M.Moon 40
Against Fan Hui
> Won 5-0 in formal games with 1 hour of
main time + three 30s byoyomi1’s
> Won 3-2 in informal games with three
30s byoyomi1’s
1. Time slots to be consumed after exhausting main time; reset to full period if not exceeded in a single turn;
*Image captured from Silver D. et al. (2016)
41. Discussion | Performance
AlphaGo | M.Moon 41
Against Sedol Lee
> Won 4-1 in formal games with 2 hours of main
time + three 60s byoyomi’s
> Game 4 – the only loss – being analyzed
> MCTS may have overlooked Lee’s game
changing move – which was the only move that
could save the game at the state
Game 4
Sedol Lee (White), AlphaGo (Black)
Sedol Lee wins by resignation
*Image captured from https://gogameguru.com/lee-sedol-defeats-alphago-masterful-comeback-game-4/
42. Discussion | Future Work
AlphaGo | M.Moon 42
Next Potential Matches
> Imperfect information games
(e.g., Poker, StarCraft)
> AlphaGo based on pure learning
> Testbed for future algorithmic researches
Areas Applications
> Gaming
> Healthcare
> Smartphone Assistant
Healthcare Applications
> Medical diagnosis of images
> Longitudinal tracking of vital signs to help
people have healthier lifestyles
43. Discussion | Future Work
AlphaGo | M.Moon 43
Next Potential Matches
> Imperfect information games
(e.g., Poker, StarCraft)
> AlphaGo based on pure learning
> Testbed for future algorithmic researches
“it’d be cool if one day an AI was involved in finding a new particle”
Demis Hassabis, Co-founder of DeepMind
Areas Applications
> Gaming
> Healthcare
> Smartphone Assistant
Healthcare Applications
> Medical diagnosis of images
> Longitudinal tracking of vital signs to help
people have healthier lifestyles
44. References
AlphaGo | M.Moon 44
Browne, C. B., Powley, E., Whitehouse, D., Lucas, S. M., Cowling, P. I., Rohlfshagen, P., . . . Colton, S. (2012). A Survey of Monte
Carlo Tree Search Methods. IEEE Trans. Comput. Intell. AI Games IEEE Transactions on Computational Intelligence and AI in
Games, 4(1), 1-43.
Byford, S. (2016, March 10). DeepMind founder Demis Hassabis on how AI will shape the future. The Verge. Retrieved April 02,
2016, from http://www.theverge.com/2016/3/10/11192774/demis-hassabis-interview-alphago-google-deepmind-ai
Google Inc. (2016). AlphaGo | Google DeepMind. Retrieved April 02, 2016, from https://deepmind.com/alpha-go.html
Lecun, Y., Bengio, Y., & Hinton, G. (2015). Deep learning. Nature, 521(7553), 436-444.
Ormerod, D. (2016, March 13). Lee Sedol defeats AlphaGo in masterful comeback - Game 4. Retrieved April 06, 2016, from
https://gogameguru.com/lee-sedol-defeats-alphago-masterful-comeback-game-4/
Silver, D., Huang, A., Maddison, C. J., Guez, A., Sifre, L., Driessche, G. V., . . . Hassabis, D. (2016). Mastering the game of Go with
deep neural networks and tree search. Nature, 529(7587), 484-489.
Hinweis der Redaktion
Pachi runs 100,000 simulations per move
AlphaGo seems to be able to manage risk more precisely than humans can, and is completely happy to accept losses as long as its probability of winning remains favorable.