Introduction to Machine Learning Unit-3 for II MECH
A Presentation on the Paper: Mastering the game of Go with deep neural networks and tree search
1. Mastering the game of Go with deep neural
networks and tree search
A Presentation
Aditya R Suryavamshi
Monday 12th
March, 2018
Dayananda Sagar College of Engineering
2. Table of contents
1. Introduction
2. Games
3. Playing Games as a Computer
4. AlphaGo
5. Conclusion
1
4. About the Paper
• The paper is a seminal paper in the field of Artificial Intelligence
(AI), more specifically in the field of General Game Playing (GGI).
• Introduces a novel search algorithm that builds on top of
existing mechanisms to search through very dense trees of
games such as Go.
• The System developed in this paper ended up winning against
one of the strongest Go Players of our current time.
2
5. Authors of the Paper
• David Silver
• Aja Huang (Placed stones on the Go board for AlphaGo in the
match versus Lee Sedol @ the 2017 Future of Go Summit!)
Chris J. Maddison, Arthur Guez, Laurent Sifre, George van den
Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda
Panneershelvam, Marc Lanctot, Sander Dieleman, Dominik Grewe,
John Nham, Nal Kalchbrenner, Ilya Sutskever, Timothy Lillicrap,
Madeleine Leach, Koray Kavukcuoglu, Thore Graepel & Demis
Hassabis
3
7. Games
• Games of Perfect Information
• Basically games which don’t involve chance, and all the players
have complete information to make the best decision for
themselves.
• These games have an associated value function v∗
(s), from every
state of the game s, under perfect play by all players
• Complexity of Perfect Information Games
• A game with breadth b and depth d will have bd
possible
sequence of moves.
• Typical Values are in the order of b = 35, d = 80 for chess, and b =
250, d = 150 for Go.
4
8. Go
• Why Go?
• Search Space is extremely large (≈ 250150
)
• Go is generally considered the kind of game that requires intuition
Objective
Surround a larger area of the board with your own stone than your
opponent.
Rules
• Players take alternate turns in placing the stone on the board
• Black moves first
• Rule of liberty
• Ko Rule
5
10. How do you play games when you are a Computer?
Take 1
Exhaustive Search
• MinMax
• Depth First MinMax with Alpha-Beta Pruning
Issues
Exhaustive Searching is in-feasible for all but the simplest of games
(Tic-Tac-Toe).
6
11. How do you play games when you are a Computer?
Take 2
Cut Down on the Breadth and Depth of the Game
• Focus on only the promising Moves
• Depth Via Position Evaluation and truncation of search tree at
state s and replacing the subtree below by an approximate
value function v∗
(s) that predicts outcome from the state s
• Breadth via Sampling actions from a policy p(a|s)
Monte Carlo Rollouts
Uses a policy for both the players to search until the end, without
branching.
7
12. Monte Carlo Tree Search
Used to find out the most promising action that would optimize for
winning.
General Steps
Selection Select a Node according to a criteria (Usually
something like Upper Confidence Bound (UCB))
Expansion If the selected Node isn’t the leaf node, and hasn’t
been visited before, expand the Node with one or more
child Nodes, and select one of the child nodes.
Simulation Also called Evaluation, Playout or Rollout
Play a random playout from the node until the very
end to determine the value of the rollout.
Backpropogation Update the results of the rollout all the way to the
root.
8
14. AlphaGo
• AlphaGo effectively combines the ideas of policy networks and
value networks with MCTS.
• A policy network pσ is trained via supervised learning of expert
human moves.
• A fast rollout policy pπ is also trained for rapidly sampling
actions during rollouts.
• A reinforcemnt learning policy network pρ is trained, which
adjusts the policy towards winning the game, rather than
maximizing predictive accuracy of human moves.
• A value network vθ is also trained to predict the winner of the
games played from a state.
9
15. Supervised Learning of Policy Network
• The Supervised Learning(SL) Policy Network pσ(a|s) gives a
probability distribution across all the legal moves a, given a
state s.
• Trained on 30 million randomly sampled state-action pairs
(s, a), with the intent of maximizing human move.
• The moves were downloaded from the KGS Go Server, which
contains 160,000 games played by 6-9 dan human players.
• 13 Layer Network.
• The SL Policy Net has an accuracy of 57.0% using all input
features, and 55.7% using only raw board positions and move
history as inputs, compared to state-of-art at the time of 44.4%
• A faster but less accurate rollout policy pπ(a|s) is also trained,
using a linear softmax of small pattern features. (Accuracy of
24.2%)
10
16. Reinforcement Learning of Policy Network
• The Reinforcemt Learning(RL) Policy Network pρ improves upon
the SL Network, by optimizing for winning the game.
• The RL Policy Net weights ρ are initialized to the same values as
that of SL Policy Net ρ = σ
• Games are played between the current policy pρ and a randomly
selected iteration of the policy network.
• A reward function r(s) is used, which is +1 for winning and -1 for
losing and 0 for all other non-terminal states.
• RL Net when played agains SL Net won more than 80% of the
games
• The RL Policy Net won more than 85% of games, using no search
at all, against pachi, the strongest open source Go Program.
11
17. Reinforcemnt Learning of Value Network
• Used to estimate the value of a state s when the game is played
using policy p
vp
(s) = E[zt|st = s, at...T ∼ p]
• The estimation is done for the strongest policy, using the RL
Policy Network pρ
• Value Network has a similar architecture as the policy network
but instead outputs a single value instead of a probability
distribution.
• Trained by regression on state outcome pairs (s, z)
12
18. Searching with policy and Value Networks
• Each edge (s, a) of the search tree stores an action value Q(s, a),
a visit count N(s, a), and prior probability P(s, a)
• At each time step of simulation traversal, an action is selected
from state st using
at = argmax
a
(Q(st, a) + u(st, a))
which maximizes the action value with the u(st, a) term used to
incentivize exploration
• When the traversal reaches a leaf node sL it may be expanded.
13
19. Searching with policy and Value Networks
• The leaf position sl is processed just once by the SL Policy
network and the output probabilites are stored as prior
probabilites P for each legal action a, P(s, a) = pσ(a|s).
• The leaf node valuation is done by mixing the values from the
value network vθ(sL) and the outcome zt of a random rollout
played until the terminal step using the fast rollout policy pπ
with a mixing parameter λ
• At the end of simulation, the action values and visit counts of all
traversed edges are update.
• Once the search is complete the algorithm chooses the most
visited move from the root position
14
20. How good is it?
Really Really Really good
The system built using the proposed system AlphaGo, beats
everything under the sun for Go currently
15
22. Conclusion
• The paper builds on top of existing literature and introduces
new mechanism to search through an intractable search space
by focusing on branches which can have values with high
probability.
• It also provides a pathway through for future research on other
actions which require human intelligence in an implicit and
seemingly intuitive way.
16
23. Things to Do!
• Read the Paper (Surprisingly Easy to Understand)
• Learn to Play Go
• Watch the Documentary AlphaGo (2017) (Is on NF)
• When it’s published Skim or Read the book Deep Learning and
the Game of Go (Manning Publication)
17