2. Contents
Ⅰ. What is the AlphaGo?
- Go Machine
Ⅱ. Background
- Overview
- MCTS (Monte Carlo Tree Search)
- CNN (Convolutional Neural Network)
Ⅲ. Components
- Policy Networks
- Value Networks
- Searching with policy and value networks
Ⅳ. Conclusion
4. What is the AlphaGo? – Go Machine
- AlphaGo is a computer program developed by Google DeepMind to play game Go.
- The first computer Go program to beat a professional human Go player without
handicaps.
5. What is the AlphaGo? – Go Machine
- For Chess, IBM Deep Blue beat the world Chess champion using Brute Force search.
- Go game board is 19 x 19 size and possible states of one point are 3. The number of
all cases is 3361
≒ 10170
.
- The number of the next reasonable position is 250 and a Go game is ended at 150
turns on average. The tree’s depth is the 150 and breadth is the 250.
- It is impossible to search all of the cases with current technology.
- It is the key how can decrease the depth and breadth of the search tree.
7. Background – Overview
1. MCTS (Monte Carlo Tree Search)
- It is used by many AI Go programs.
2. CNN (Convolution Neural Networks)
- Policy Networks
- Value Networks
8. Background – MCTS
- When it is impossible to explore all paths, it is efficient
- Selection : Select the most promising path from root to leaf.
- Expansion : If game is not ended, either create one or more child nodes or choose
from them.
- Simulation : Play game from chosen node until game is ended.
- Backpropagation : Update information on the path from root to chosen node using the
simulation result.
9. Background – CNN
- Convolution Layer : It extracts meaningful data(feature maps) from input image.
- Sub-sampling Layer : It max-pooling from feature maps.
- Fully-Connected Layer : It is used for classification from feature maps.
11. - 𝑠 : State of the board
- 𝑎 : Next action
- 𝑣 𝑠 : Valuation function
- 𝑃 𝑎 𝑠 : Propability distribution over possible moves 𝑎 in position 𝑠
- 𝑃𝜎 : Supervised learning based Policy Networks
- 𝑃𝜋 : Rapidly sample actions during rollouts.
- 𝑃𝜌 : Reinforcement learning based Policy Networks
- 𝑣 𝜃 : Value Networks that predicts the winner of games
Components
12. Components - Policy Networks
- Decrease the breadth of the search tree.
- Convolutional Neural Networks for finding next action.
- Estimating value function 𝑃 𝑎 𝑠 .
- Supervised Learning and Reinforcement Learning.
1. SL(Supervised Learning) Policy Network
- Learn from human expert using 30 million data from KGS GO Server
2. RL(Reinforcement Learning) Policy Network
- Initialized to SL
- Learn from playing games of self-play with RL policy Network
- RL policy Network won more than 80% games against SL policy Network
13. Components - Value Networks
- Decrease the depth of the search tree.
- Convolutional Neural Networks for predicting outcome
from position 𝑠.
- Estimating value function 𝑣 𝑝(𝑠).
- Reinforcement Learning.
1. Reinforcement Learning
- Self play game from RL policy networks
- Avoiding the overfitting from KGS data sets.
14. Components – Searching with policy and value network
- 𝑄 : MCTS action value
- 𝑢(𝑃) : Bonus that depends on a stored prior probability P, that is in inverse proportion
to the visit counts.
- Selection : Select maximum 𝑄 + 𝑢(𝑃) value at step L-1.
- Expansion : Expanse the nodes based on 𝑃𝜎 at step L
- Evaluation : Evaluate the win rate by simulating using 𝑣 𝜃 and random rollout playing.
- Backup : 𝑄 and visit counts of all traversed edges are updated.
16. Conclusion
- Single-machine AlphaGo is many dan ranks stronger than any previous Go program,
winning 494 out of 495 games (99.8%) against outher Go programs.
- Distributed version Alphago won the match 5 games to 0 against Fan Hui, European
Go Champion.