AlphaGo: An AI Go player based on deep neural networks and monte carlo tree search

AlphaGo: An AI Go Player based on
Deep Neural Networks and Monte Carlo Tree Search
Michael J. Moon
M.Sc. Candidates in Biostatistics
Dalla Lana School of Public Health
University of Toronto
April 7, 2016

Agenda
Introduction
Methodologies
Design
Discussion
References
AlphaGo | M.Moon 2

Introduction | Background
AlphaGo | M.Moon 3
The Game of Go
> Played on a square grid called a board, usually 19 x 19
> Stones – black and white – are placed alternatively
> Points awarded for surrounding empty space
1. 1 Googol = 1.0 × 10100
Example of a Go Board
Shades represent territories

AlphaGo | M.Moon 4
The Game of Go
> Played on a square grid called a board, usually 19 x 19
> Stones – black and white – are placed alternatively
> Points awarded for surrounding empty space
Complexity
> Possible number of sequences ≈ 250150
> Googol1 times more complex than chess
> Viewed as an unsolved “grand challenge” for AI
1. 1 Googol = 1.0 × 10100
“pinnacle of perfect information games”
Demis Hassabis, Co-founder of DeepMind
Example of a Go Board
Shades represent territories

AlphaGo | M.Moon 5
1. Image source: https://deepmind.com/alpha-go.html;
2. Source: http://www.straitstimes.com/asia/east-asia/googles-alphago-gets-divine-go-ranking
> Google DeepMind’s AI Go Player

AlphaGo | M.Moon 6
5-0 against Fan Hui
> Victory against 3-times European champion
> First program to win against a professional player in an even game

AlphaGo | M.Moon 7
5-0 against Fan Hui
> Victory against 3-times European champion
> First program to win against a professional player in an even game
4-1 against Sedol Lee
> Victory against world’s top player over the past decade
> Awarded the highest Go ranking after the match2

Introduction | Overview of the Design
AlphaGo | M.Moon
30M Human Moves
SL Policy Network
Rollout Policy
RL Policy Network
RL Value Network
8

AlphaGo | M.Moon
30M Human Moves
SL Policy Network
Rollout Policy
RL Policy Network
RL Value Network
9
Monte Carlo
Tree Search
Move Selection

AlphaGo | M.Moon
30M Human Moves
SL Policy Network
Rollout Policy
RL Policy Network
RL Value Network
10
Monte Carlo
Tree Search
Asynchronous Multi-threaded Search
> 40 Search Threads
> 48 CPUs
> 8 GPUs
Distributed Version1
> 40 Search Threads
> 1,202 CPUs
> 176 GPUs
1. Used against Fan Hui; 1,920 CPUs and 280 GPUs against Lee
http://www.economist.com/news/science-and-technology/21694540-win-or-lose-best-five-battle-contest-another-milestone
Move Selection

Methodologies | Deep Neural Network
AlphaGo | M.Moon 11
Deep Learning Architecture
> Multilayer (5~20) stack of simple
modules subject to learning
𝑤𝑖𝑗 𝑤𝑗𝑘
𝑤 𝑘𝑙
𝑦𝑗 = 𝑓 𝑧𝑗
𝑦 𝑘 = 𝑓 𝑧 𝑘
𝑦𝑙 = 𝑓 𝑧𝑙
𝐻𝑖𝑑𝑑𝑒𝑛 𝑢𝑛𝑖𝑡𝑠 𝐻2
𝐼𝑛𝑝𝑢𝑡 𝑢𝑛𝑖𝑡𝑠
𝑂𝑢𝑡𝑝𝑢𝑡 𝑢𝑛𝑖𝑡𝑠
𝑦𝑙
𝑧𝑗 =
𝑖∈𝐼𝑛
𝑤𝑖𝑗 𝑥𝑖
𝑧 𝑘 =
𝑗∈𝐻1
𝑤𝑗𝑘 𝑦𝑗
𝑧𝑙 =
𝑘∈𝐻2
𝑤 𝑘𝑙 𝑦 𝑘
i
j
k
l

AlphaGo | M.Moon 12
Backpropagation Training
> Trained by simple stochastic
gradient descent to minimize error
𝑤 𝑘𝑙
𝑦𝑙
𝑧𝑗 =
𝑖∈𝐼𝑛
𝑧 𝑘 =
𝑗∈𝐻1
𝑧𝑙 =
𝑘∈𝐻2
i
j
k
l

AlphaGo | M.Moon 13
𝜕𝐸
𝜕𝑦𝑙
𝐸 = 𝐿𝑜𝑠𝑠 𝐹𝑢𝑛𝑐𝑡𝑖𝑜𝑛
𝑒. 𝑔. , 𝐸 =
1
2
𝑦𝑙 − 𝑡𝑙
2
𝜕𝐸
𝜕𝑦𝑙
= 𝑦𝑙 − 𝑡𝑙
𝜕𝐸
𝜕𝑧𝑗
=
𝜕𝐸
𝜕𝑗
𝜕𝑦𝑗
𝜕𝑧𝑗
𝜕𝐸
𝜕𝑧 𝑘
=
𝜕𝐸
𝜕𝑦 𝑘
𝜕𝑦 𝑘
𝜕𝑧 𝑘
𝜕𝐸
𝜕𝑧𝑙
=
𝜕𝐸
𝜕𝑦𝑙
𝜕𝑦𝑙
𝜕𝑧𝑙
𝜕𝐸
𝜕𝑦𝑗
=
𝑘∈𝐻2
𝑤𝑗𝑘
𝜕𝐸
𝜕𝑧 𝑘
𝜕𝐸
𝜕𝑦 𝑘
=
𝑙∈𝑂𝑢𝑡
𝑤 𝑘𝑙
𝜕𝐸
𝜕𝑧𝑙
𝑤 𝑘𝑙
i
j
k
l
𝑤𝑖𝑗
′
= 𝑤𝑖𝑗 − 𝜂
𝜕𝐸
𝜕𝑧𝑗
𝜂 = 𝑠𝑡𝑒𝑝 𝑠𝑖𝑧𝑒
Application of the chain rule for derivatives to obtain gradient descents

AlphaGo | M.Moon 14
> Rectified linear unit (ReLU) learns
faster than other non-linearities
𝑓 𝑥 = max 0, 𝑥
𝑤 𝑘𝑙
𝑦𝑙
𝑧𝑗 =
𝑖∈𝐼𝑛
𝑧 𝑘 =
𝑗∈𝐻1
𝑧𝑙 =
𝑘∈𝐻2
i
j
k
l

Methodologies | Deep Convolutional Neural Network
AlphaGo | M.Moon 15
Input
Arrays such as signals,
images and videos
Local Connections
Arrays such as signals,
images and videos
Shared Weights
Each filter with common weights
and bias to create a feature
𝑾 𝟏
𝑾 𝟏
𝑾 𝟏
Pooling
Coarse-graining the position of
each feature, typically by taking
max from neighbouring features
Non-linearity
Local weighted sums to
a non-linearity such as ReLU
Size and Stride
Filter size 3 with stride 2
Deep Architecture
Uses stacks of many layers
Properties of natural signals

Methodologies | Deep Convolutional Neural Network
AlphaGo | M.Moon 16
Architecture
> Highly correlated local
groups
> Local statistics invariant to
location
Properties
> Compositional hierarchy
> Invariant to small shifts and
distortions due to pooling
> Weights trained through
backpropagation
𝑾 𝟏
𝑾 𝟏
𝑾 𝟏

Methodologies | Monte Carlo Tree Search
AlphaGo | M.Moon 17
Tree Policy
𝑔 𝑝 𝑎 𝑠 , 𝑁 𝑎 ,
𝑠 ∈ 𝑇𝑟𝑒𝑒 𝑁𝑜𝑑𝑒𝑠
Default Policy
𝑝 𝑎 𝑠 , 𝑠 ∉ 𝑇𝑟𝑒𝑒 𝑁𝑜𝑑𝑒𝑠
𝑠 ∈ 𝑆, 𝑛𝑜𝑑𝑒𝑠
𝑎 ∈ 𝐴, 𝑒𝑑𝑔𝑒𝑠
𝑟 𝑠 , 𝑟𝑒𝑤𝑎𝑟𝑑
𝑁 𝑎 , 𝑣𝑖𝑠𝑖𝑡 𝑐𝑜𝑢𝑛𝑡
Overview
Find optimal decisions by:
> Take random samples in the decision space
> Build a search tree according to the result

AlphaGo | M.Moon 18
Selection
Traverse to the most
urgent expandable node
Tree Policy
Tries to balance
exploration and exploitation
Default Policy

AlphaGo | M.Moon 19
Selection
Expansion
Add a child node from the
selected node
Tree Policy
Tries to balance
exploration and exploitation
Default Policy

AlphaGo | M.Moon 20
𝑟 𝑠′
Selection
Expansion
selected node
Simulation
Simulate from the newly
added node to an outcome
Tree Policy
Default Policy

AlphaGo | M.Moon 21
𝑟 𝑠′
Selection
Expansion
selected node
Simulation
Simulate from the newly
added node to an outcome
Backpropagation
Backup simulation result
through selected nodes
Tree Policy
Default Policy

AlphaGo | M.Moon 22
𝑟 𝑠′
Tree Policy
Default Policy
Strengths
> Anytime algorithm – gives a valid solution at
any time of interruption
> Values of intermediate states are not
evaluated – domain knowledge not required

Design | Problem Setting
AlphaGo | M.Moon 23
Unique Optimal Value Function
𝑣∗
𝑠 =
𝑧 𝑇 𝑖𝑓 𝑠 = 𝑠 𝑇
max
𝑎
−𝑣∗ 𝑓 𝑥, 𝑎 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒
> 𝑠 ∈ 𝑆 State of the game
> 𝑎 ∈ 𝐴(𝑠) Legal actions at 𝑠
> 𝑓(𝑎, 𝑠) Deterministic state transitions
> 𝑟 𝑖 𝑠 Reward for player 𝑖 at 𝑠, 𝑖 ∈ 1,2
𝑍𝑒𝑟𝑜 − 𝑠𝑢𝑚 𝑔𝑎𝑚𝑒: 𝑟 𝑠 = 𝑟1
𝑠 = −𝑟2
𝑠
𝑟 𝑠 = 0 𝑖𝑓 𝑠 ≠ 𝑠 𝑇
> 𝑧𝑡 = ±𝑟 𝑠 𝑇 Terminal reward at 𝑠 𝑇
zt ∈ −1,1
Value Function
> 𝑣 𝑝
(𝑠) 𝐸 𝑧𝑡 𝑠𝑡 = 𝑠, 𝑎 𝑡,…,𝑇~𝑝
Policy
> 𝑝 𝑎 𝑠 Probability distribution
over legal actions

Design | Rollout Policy
AlphaGo | M.Moon 24
> A fast, linear softmax policy for simulation
> Pattern-based feature inputs
> Trained using 8 million positions
> Less domain knowledge implemented compared to
existing MTSC Go programs
> 24.2% prediction accuracy
> Similar 𝑝𝜏 𝑎 𝑠 for tree expansion
𝑝 𝜋 𝑎 𝑠
𝑀𝑎𝑥𝑚𝑖𝑧𝑒 ∆𝜋 ∝
𝜕𝑙𝑜𝑔𝑝 𝜋 𝑎 𝑠
𝜕𝜋

Design | Neural Network Architectures
AlphaGo | M.Moon 25
1
1
0
1
1
0
1
0
0
0
1
0
1
0
Input
19 x 19 intersections
x 48 feature plane
x48 +1Input Feature Space
> Stone Colour
> Ones & Zeros
> Turns Since
> Liberties
> Capture Size
> Self-atari Size
> Liberties after Move
> Ladder Capture
> Ladder Escape
> Sensibleness
with respect to current player
Extra Feature for Value Network
> Player Colour 0
19
x
19
1
1
0
1
0
1
0
0 1
0

AlphaGo | M.Moon 26
0
0
0
0
0
0
0
0
0
0
0
0
1
1
0
1
1
0
0
0
1
0
0
0
0
0
0
0
1
0
𝑘 Filters
Kernel size 5 x 5 with
stride 1 convolution
𝑘 = 192 𝑎𝑡 𝑚𝑎𝑡𝑐ℎ
Zero-Padding
(19+4) x (19+4)
ReLU
𝑓 𝒙 = max 0, 𝒙
0
0
0
0
0
0
0
0

AlphaGo | M.Moon 27
0
0
0
0
0
0
0
0
0
0
0
0
1
1
0
1
1
0
0
0
1
0
0
0
0
0
0
0
1
0
19 x 19 Output
𝑃𝑟𝑒𝑣𝑖𝑜𝑢𝑠𝐷𝑖𝑚 + 𝑃𝑎𝑑𝑑𝑖𝑛𝑔 𝑆𝑖𝑧𝑒 − 𝐾𝑒𝑟𝑛𝑒𝑟 𝑆𝑖𝑧𝑒
𝑆𝑡𝑟𝑖𝑑𝑒
+ 1
=
19 + 4 − 5
1
+ 1 = 19
0
0
0
0
0
0
0
0

AlphaGo | M.Moon 28
0
0
0
0
0
0
0
0
0
0
0
0
1
1
0
1
1
0
0
0
1
0
0
0
0
0
0
0
1
0
0
0
0
00
0
0
0
x11
𝑘 Filters
Kernel size 3 x 3 with
stride 1 convolution
Zero-Padding
(19+2) x (19+2)
ReLU
𝑓 𝒙 = max 0, 𝒙
19
x
19
0
0
0
0
0
0
0
0

AlphaGo | M.Moon 29
0
0
0
0
0
0
0
0
0
0
0
0
1
1
0
1
1
0
0
0
1
0
0
0
0
0
0
0
1
0
0
0
0
00
0
0
0
Policy
1-Stride Convolution
1 kernel of size 1 x 1 with different
bias for each intersection
Softmax Function
Outputs 𝑝 𝑎 𝑠 for each of
19 x 19 intersections
Value 1-Stride Convolution
1 kernel of size 1 x 1
Tanh Function
Fully-connected layer
Outputs a single 𝑣 𝜃 𝑠 ∈ −1,1
256 Rectifiers
Fully-connected layer
19
x
19
x11
Convolution Layers
0
0
0
0
0
0
0
0

Design | Supervised Learning Policy Network
AlphaGo | M.Moon 30
> Trained using mini-batches of 16 randomly selected
from 28.4 million positions
> Trained on 50 GPUs over 3 weeks
> Tested with 1 million positions
> 57.0% prediction accuracy
𝑝 𝜎 𝑎 𝑠
𝑀𝑎𝑥𝑚𝑖𝑧𝑒 ∆𝜎 ∝
𝜕𝑙𝑜𝑔𝑝 𝜎 𝑎 𝑠
𝜕𝜎

Design | Reinforcement Learning Policy Network
AlphaGo | M.Moon 31
> Trained using self-play between the current network and a randomly
selected previous iteration of 𝑝 𝜎 𝑎 𝑠
> Trained over 10,000 million mini-batches of 128 games
> Evaluated through game play 𝑎~𝑝 𝜌 ∙ 𝑠 without search
> 80% against 𝑝 𝜎 𝑎 𝑠
> 85% against strongest open-source Go program
𝑝 𝜌 𝑎 𝑠
𝑀𝑎𝑥𝑚𝑖𝑧𝑒 ∆𝜌 ∝
𝜕𝑙𝑜𝑔𝑝 𝜌 𝑎 𝑡 𝑠𝑡
𝜕𝜌
𝑧𝑡

Design | Value Network
AlphaGo | M.Moon 32
> Trained using 30 million distinct positions from a separate game
generated by a random mix of 𝑝 𝜎 𝑎 𝑠 and 𝑝 𝜌 𝑎 𝑠 to prevent overfitting
> Consistently more accurate than 𝑝 𝜋 𝑎 𝑠
> Approaches Monte Carlo rollouts using 𝑝 𝜌 𝑎 𝑠 with less computation
𝑣 𝜃(𝑠)
𝑀𝑖𝑛𝑖𝑚𝑖𝑧𝑒 ∆𝜃 ∝
𝜕𝑣 𝜃 𝑠
𝜕𝜃
𝑧 − 𝑣 𝜃 𝑠
𝒗 𝜽 𝒔 ≈ 𝒗 𝒑 𝝆 𝒔 ≈ 𝒗∗ 𝒔

Design | Search Algorithm
AlphaGo | M.Moon 33
*Image captured from Silver D. et al. (2016)
Edge (𝑠,𝑎) Data
𝑄 𝑠, 𝑎 = 𝑎𝑐𝑡𝑖𝑜𝑛 𝑣𝑎𝑙𝑢𝑒
𝑁 𝑠, 𝑎 = 𝑣𝑖𝑠𝑖𝑡 𝑐𝑜𝑢𝑛𝑡
𝑃 𝑠, 𝑎 = 𝑝𝑟𝑖𝑜𝑟 𝑝𝑟𝑜𝑏𝑎𝑏𝑖𝑙𝑖𝑡𝑦

AlphaGo | M.Moon 34
Selection
𝑎 𝑡 = 𝑎𝑟𝑔𝑚𝑎𝑥
𝑎
𝑄 𝑠𝑡, 𝑎 + 𝑢 𝑠𝑡, 𝑎
𝑢 𝑠𝑡, 𝑎 ∝
𝑃 𝑠, 𝑎
1 + 𝑁 𝑠, 𝑎
encourages exploration
Stop if t = 𝐿, predefined time step

AlphaGo | M.Moon 35
Expansion
E𝑥𝑝𝑎𝑛𝑑 using a~ 𝑝τ 𝑖𝑓 𝑁 𝑠, 𝑎 > 𝑛 𝑡ℎ𝑟, a dynamic threshold
𝑆𝑒𝑡 𝑃 𝑠 𝐿, 𝑎 = 𝑝 𝜎 𝑠 𝐿, 𝑎

AlphaGo | M.Moon 36
Evaluation
𝑉 𝑠 𝐿 = 1 − 𝜆 𝑣 𝜃 𝑠 𝐿 + 𝜆𝑧 𝐿
𝑧 𝐿 = 𝑠 𝑇~𝑝 𝜋 𝑎 𝑠 𝐿 𝑎𝑛𝑑
𝜆 = 𝑚𝑖𝑥𝑖𝑛𝑔 𝑝𝑎𝑟𝑎𝑚𝑒𝑡𝑒𝑟

AlphaGo | M.Moon 37
Backup
𝑁 𝑠, 𝑎 =
𝑖=1
𝑖
1 𝑠, 𝑎, 𝑖
𝑄 𝑠, 𝑎 =
1
𝑁 𝑠, 𝑎
𝑖=1
𝑖
1 𝑠, 𝑎, 𝑖 𝑉 𝑆 𝐿
𝑖
𝑖 ∈ 1, 𝑛 𝑎𝑠𝑦𝑛𝑐ℎ𝑟𝑜𝑛𝑜𝑢𝑠 𝑠𝑖𝑚𝑢𝑙𝑎𝑡𝑖𝑜𝑛𝑠

AlphaGo | M.Moon 38
Select Move
𝑎′
= argmax
𝑎
𝑁 𝑠, 𝑎

Discussion | Performance
AlphaGo | M.Moon 39
Against AI Players
> Played against strongest commercial and
open-source Go programs based on MCTS
> Single machine AlphaGo won 494 out of
495 in even games
> Distributed version of AlphaGo won 77%
against the single machine version and
100% against others

AlphaGo | M.Moon 40
Against Fan Hui
> Won 5-0 in formal games with 1 hour of
main time + three 30s byoyomi1’s
> Won 3-2 in informal games with three
30s byoyomi1’s
1. Time slots to be consumed after exhausting main time; reset to full period if not exceeded in a single turn;

AlphaGo | M.Moon 41
Against Sedol Lee
> Won 4-1 in formal games with 2 hours of main
time + three 60s byoyomi’s
> Game 4 – the only loss – being analyzed
> MCTS may have overlooked Lee’s game
changing move – which was the only move that
could save the game at the state
Game 4
Sedol Lee (White), AlphaGo (Black)
Sedol Lee wins by resignation
*Image captured from https://gogameguru.com/lee-sedol-defeats-alphago-masterful-comeback-game-4/

Discussion | Future Work
AlphaGo | M.Moon 42
Next Potential Matches
> Imperfect information games
(e.g., Poker, StarCraft)
> AlphaGo based on pure learning
> Testbed for future algorithmic researches
Areas Applications
> Gaming
> Healthcare
> Smartphone Assistant
Healthcare Applications
> Medical diagnosis of images
> Longitudinal tracking of vital signs to help
people have healthier lifestyles

Discussion | Future Work
AlphaGo | M.Moon 43
Next Potential Matches
> Imperfect information games
(e.g., Poker, StarCraft)
> AlphaGo based on pure learning
> Testbed for future algorithmic researches
“it’d be cool if one day an AI was involved in finding a new particle”
Demis Hassabis, Co-founder of DeepMind
Areas Applications
> Gaming
> Healthcare
> Smartphone Assistant
Healthcare Applications
> Medical diagnosis of images
> Longitudinal tracking of vital signs to help
people have healthier lifestyles

References
AlphaGo | M.Moon 44
Browne, C. B., Powley, E., Whitehouse, D., Lucas, S. M., Cowling, P. I., Rohlfshagen, P., . . . Colton, S. (2012). A Survey of Monte
Carlo Tree Search Methods. IEEE Trans. Comput. Intell. AI Games IEEE Transactions on Computational Intelligence and AI in
Games, 4(1), 1-43.
Byford, S. (2016, March 10). DeepMind founder Demis Hassabis on how AI will shape the future. The Verge. Retrieved April 02,
2016, from http://www.theverge.com/2016/3/10/11192774/demis-hassabis-interview-alphago-google-deepmind-ai
Google Inc. (2016). AlphaGo | Google DeepMind. Retrieved April 02, 2016, from https://deepmind.com/alpha-go.html
Lecun, Y., Bengio, Y., & Hinton, G. (2015). Deep learning. Nature, 521(7553), 436-444.
Ormerod, D. (2016, March 13). Lee Sedol defeats AlphaGo in masterful comeback - Game 4. Retrieved April 06, 2016, from
https://gogameguru.com/lee-sedol-defeats-alphago-masterful-comeback-game-4/
Silver, D., Huang, A., Maddison, C. J., Guez, A., Sifre, L., Driessche, G. V., . . . Hassabis, D. (2016). Mastering the game of Go with
deep neural networks and tree search. Nature, 529(7587), 484-489.

AlphaGo: An AI Go player based on deep neural networks and monte carlo tree search

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (12)

Andere mochten auch

Andere mochten auch (20)

Ähnlich wie AlphaGo: An AI Go player based on deep neural networks and monte carlo tree search

Ähnlich wie AlphaGo: An AI Go player based on deep neural networks and monte carlo tree search (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

AlphaGo: An AI Go player based on deep neural networks and monte carlo tree search

Hinweis der Redaktion