Diese Präsentation wurde erfolgreich gemeldet.
Die SlideShare-Präsentation wird heruntergeladen. ×

AlphaGo and AlphaGo Zero

Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Nächste SlideShare
Introduction to Alphago Zero
Introduction to Alphago Zero
Wird geladen in …3
×

Hier ansehen

1 von 48 Anzeige

Weitere Verwandte Inhalte

Diashows für Sie (20)

Ähnlich wie AlphaGo and AlphaGo Zero (20)

Anzeige

Aktuellste (20)

AlphaGo and AlphaGo Zero

  1. 1. AlphaGo/AlphaGo Zero Keita Watanabe
  2. 2. Motivation • Tree based decision-making framework is common across robotics, AV, and etc… • Monte Carlo Tree Search (MCTS) is one of the most successful method among other tree search algorithms. • Recent MCTS based decesion-making framework for AV (Cai 2019) significantly influenced by AlphaGo
  3. 3. Overview of this presentation • Introduction to Go • Alpha Go • SL Policy Network • RL Policy Network • Value Network • MCTS (Monte Carlo Tree Search) • Alpha Go Zero • Improvements from Alpha Go
  4. 4. Rule of Go I Retrieved from Wikipedia
 https://en.wikipedia.org/wiki/Go_(game) Go is an adversarial game with the objective of surrounding a larger total area of the board with one's stones than the opponent. As the game progresses, the players position stones on the board to map out formations and potential territories. Contests between opposing formations are often extremely complex and may result in the expansion, reduction, or wholesale capture and loss of formation stones. The four liberties (adjacent empty points) of a single black stone (A), as White reduces those liberties by one (B, C, and D). When Black has only one liberty left (D), that stone is "in atari . White may capture that stone (remove from board) with a play on its last liberty (at D-1). A basic principle of Go is that a group of stones must have at least one "liberty" to remain on the board. A "liberty" is an open "point" (intersection) bordering the group. An enclosed liberty (or liberties) is called an eye (眼), and a group of stones with two or more eyes is said to be unconditionally "alive". Such groups cannot be captured, even if surrounded.
  5. 5. Rule of Go II Points where Black can capture White Points where White cannot place stone Fig. 1.1 of (Otsuki 2017)
  6. 6. Rule of Go IV: 
 Victory judgment If you want to know more, just ask Ivo or Erik Fig. 1.2 of (Otsuki 2017) * Score: # of stones + # of eyes * Komi: Black (Moves first) takes a handicap. Typically 7.5 points * Black territory 45, White territory 36
 45 > 36 + 7.5 => Black wins
  7. 7. Why Go is so difficult? Approx Size of the Search Space Othello 10**60 Chess 10**120 Shogi 10**220 Go 10**360 Table 1.1 of (Otuki) Size of the search space is enormous!
  8. 8. Alpha Go
  9. 9. Abstract The game of Go has long been viewed as the most challenging of classic games for artificial intelligence owing to its enormous search space and the difficulty of evaluating board positions and moves. Here we introduce a new approach to computer Go that uses 'value networks' to evaluate board positions and 'policy networks' to select moves. These deep neural networks are trained by a novel combination of supervised learning from human expert games, and reinforcement learning from games of self-play. Without any lookahead search, the neural networks play Go at the level of state-of-the-art Monte Carlo tree search programs that simulate thousands of random games of self-play. We also introduce a new search algorithm that combines Monte Carlo simulation with value and policy networks. Using this search algorithm, our program AlphaGo achieved a 99.8% winning rate against other Go programs, and defeated the human European Go champion by 5 games to 0. This is the first time that a computer program has defeated a human professional player in the full-sized game of Go, a feat previously thought to be at least a decade away. 1 2 3 4
  10. 10. Overview of Alpha Go Policy Network Value Network Rollout Policy Prediction of Move Prediction of Move Prediction of Win Rate * Used for playout * Logistic Regression * Fast * Used for Node Selection & Expansion * CNN * Fast * CNN Record of strong players Self play (RL) MCTS
  11. 11. Rollout Policy • Logistic Regression with well known features (see the table below) used in this field. • Trained with 30 million positions from the KGS Go Server (https:// www.gokgs.com/). • This model used for Rollout (details will be explained later). In total: 109747 Features. Extended Table 4 of (Silver 2016)
  12. 12. Logistic Regression . . . x1 x2 x109747 Σ u = 109747 ∑ k=1 wkxk ˜ p = 1 1 + e−u • Logistic Regression with well known features used in this field • Trained with 30 million positions from the KGS Go Server (https://www.gokgs.com/)
  13. 13. Tree Policy • It is a logistic regression model with additional features. • Improved performance with extra computational time. • Used for Expansion step of Monte Carlo Tree search • In total: 141989 Features. Extended Table 4 of (Silver 2016)
  14. 14. Overview of Alpha Go Policy Network Value Network Rollout Policy Prediction of Move Prediction of Move Prediction of Win Rate * Used for playout * Logistic Regression * Fast * Used for Node Selection & Expansion * CNN * Fast * CNN Record of strong players Self play (RL) MCTS
  15. 15. Policy Network: Overview Fig1. of (Silver 2016) • Convolutional Neural Network • The network first is trained by supervised learning algorithm and later refined by reinforcement learning • Trained with KGS dataset. 29.4 million positions from 160000 games played by KGS 6 to 9 dan
  16. 16. SL policy network Output is percentage Fig. 2.18 (Otsuki 2017)
  17. 17. SL Policy Network • Convolutional Neural Network • Trained with KGS dataset. 29.4 million positions from 160000 games played by KGS 6 to 9 dan • 48 Channels (Features) is prepared (Next slide explains details). https://senseis.xmp.net/?Go 19 x 19 48 Channel 19 19 5 5 5 5 3 3 3 3 19 19 .... .... 3 3 3 3 19 19 19 19 Output: Prob. of the next move
  18. 18. Input features (Silver 2016) Note: Most of the hand-made features here are not new, but commonly used in this field.
  19. 19. RL Policy Network • They further trained the policy network by policy gradient reinforcement learning. • Training is done by self-play • The win rate of the RL policy network over the original SL policy network was 80%
  20. 20. Overview of Alpha Go Policy Network Value Network Rollout Policy Prediction of Move Prediction of Move Prediction of Win Rate * Used for playout * Logistic Regression * Fast * Used for Node Selection & Expansion * CNN * Fast * CNN Record of strong players Self play (RL) MCTS
  21. 21. Value Network • Alpha Go uses the RL policy network to generate training data for the Value Network, which predict win rate. • Training data (Position, Win/Lose) 30 million • It took 1 week with 50 GPU • Training also took 1 week with 50 GPU • The network provides Evaluation function for Go (that considered to be hard previously). Fig1. of (Silver 2016)
  22. 22. Rollout Policy Policy Network Value Network Model Logistic Regression CNN (13 Layers) CNN (15 Layers) Time for evaluation of a state 2μs 5ms 5ms Time for playout (200 moves) 0.4ms 1.0 s - # of playouts per sec About 2500 About 1 - Accuracy 24% 57% -
  23. 23. Overview of Alpha Go Policy Network Value Network Rollout Policy Prediction of Move Prediction of Move Prediction of Win Rate * Used for playout * Logistic Regression * Fast * Used for Node Selection & Expansion * CNN * Fast * CNN Record of strong players Self play (RL) MCTS
  24. 24. MCTS Example: Nim • You can take more than one stone from ether left or right • You will win when you take the last stone • This example from http:// blog.brainpad.co.jp/entry/ 2018/04/05/163000
  25. 25. Game Tree Green: Player who moves first wins Yellow: Player who moves second wins Retrieved from http://blog.brainpad.co.jp/entry/2018/04/05/163000
  26. 26. Monte Carlo Simulation Retrieved from http://blog.brainpad.co.jp/entry/2018/04/05/163000 You can find out Q value of each state by simulation. -> MCTS is a heuristic that enable us efficiently investigate promising states
  27. 27. Monte Carlo Tree Search Monte Carlo tree search (MCTS) is a heuristic search algorithm for decision processes. The focus of Monte Carlo tree search is on the analysis of the most promising moves, expanding the search tree based on random sampling of the search space. The application of Monte Carlo tree search in games is based on many playouts. In each playout, the game is played out to the very end by selecting moves at random. The final game result of each playout is then used to weight the nodes in the game tree so that better nodes are more likely to be chosen in future playouts. (Browne 2016)
  28. 28. MCTS Example N: 0, Q: 0 Initial State N: # of visits to the state Q: Expected reward Selection Select node that maximizes Q(s, a) + Cp 2 log ns ns,a N: 1, Q: 0 N: 1, Q: 0 N: 0, Q: 0 N: 0, Q: 0 N: 0, Q: 0 First term: Estimated reward Second term: Bias term
 It balances Exploration vs. Exploitation (Auer, P, 2002) 
 (In this case, it s random) ① ②
  29. 29. N: 1, Q: 0 N: 1, Q: 0 N: 0, Q: 0 N: 0, Q: 0 N: 0, Q: 0 Win Rollout Randomly play game and 
 find out win/lose N: 1, Q: 0 N: 1, Q: 1 N: 0, Q: 0 N: 0, Q: 0 N: 0, Q: 0 Win Backup Renew Q of the state ③ ④
  30. 30. N: 1, Q: 0 N: 1, Q: 1 N: 1, Q: 1 N: 1, Q: -1 N: 1, Q: 1 N: 5, Q: 0.25 N: 2, Q: 1 N: 1, Q: 1 N: 1, Q: -1 N: 1, Q: 1 N: 0, Q: 0 N: 0, Q: 0 N: 0, Q: 0 N: 5, Q: 0.25 N: 2, Q: 1 N: 1, Q: 1 N: 1, Q: -1 N: 1, Q: 1 Expansion Expand tree when a node is visited 
 certain pre-defined times 
 (in this case 2) ⑤ ⑥ ⑦
  31. 31. • Bias is evaluated by the original Bias + Output of the SL policy network
 • Evaluation of win rate => playout + Output of the value Network • Massive parallel computation using both GPUs (176) and CPUs (1202) Q(s, a) = (1 − λ) Wv(s, a) Nv(s, a) + λ Wr(s, a) Nr(s, a) u(s, a) = cpuctP(s, a) ∑b Nr(s, b) 1 + Nr(s, a) MCTS in Alpha Go Value Network MCTS P(s, a)
  32. 32. Performance Figure 4 of (Silver 2016)
  33. 33. Alpha Go Zero
  34. 34. Abstract A long-standing goal of artificial intelligence is an algorithm that learns, tabula rasa, superhuman proficiency in challenging domains. Recently, AlphaGo became the first program to defeat a world champion in the game of Go. The tree search in AlphaGo evaluated positions and selected moves using deep neural networks. These neural networks were trained by supervised learning from human expert moves, and by reinforcement learning from self- play. Here we introduce an algorithm based solely on reinforcement learning, without human data, guidance or domain knowledge beyond game rules. AlphaGo becomes its own teacher: a neural network is trained to predict AlphaGo s own move selections and also the winner of AlphaGo s games. This neural network improves the strength of the tree search, resulting in higher quality move selection and stronger self-play in the next iteration. Starting tabula rasa, our new program AlphaGo Zero achieved superhuman performance, winning 100‒0 against the previously published, champion- defeating AlphaGo. Tabula rasa is a Latin phrase often translated as "clean slate" . 1 2
  35. 35. Point1: Dual Network https://senseis.xmp.net/?Go 19 x 19 48 Channel 19 19 5 5 5 5 3 3 3 3 19 19 .... .... 3 3 3 3 19 19 19 19 Output1: Prediction of the next move 19 19 Output Layer Output2: Win Rate • 40 Layers+ Convolutional Neural Network • Each layer 3x3 convolution layer + Batch normalization + Relu • Layer 2 39 are ResNet • Trained by self-play (details are described later) • 17 Channels (Features) is prepared (the next slide shows details). • Learning method of this network discussed later. (For now, let s assume we have trained it nicely). p v
  36. 36. AlphaGo Zero is less depends on hand crafted features 48 Features of Alpha Go (Silver 2016) Feature # of planes Position of black stones 1 Position of black stones 1 Position of black stones k (1 7) steps before 7 Position of white stones k (1 7) steps before 7 Turn 1 17 Features of Alpha Go Zero (Silver 2017)
  37. 37. Point 2: Improvement of MCTS • MCTS algorithm uses the following value for state selection. • No playout, it just relies on value. Q(s, a) + u(s, a) Q(s, a) = W(s, a) N(s, a) u(s, a) = cpuct p(s, a) ∑b N(s, b) 1 + N(s, a) Win Rate Bias Prediction of the move a
  38. 38. MCTS 1: Selection 25% 48% 35% Select the no de which has max Q(s, a) + u(s, a)
  39. 39. MCTS 2: Selection 25% 48% 35% Expand the node 30% 42%
  40. 40. MCTS 2: Selection 25% 48% 35% Evaluate p and v using the dual network. * p will be used for the calculation of Q+U * the win rate on the state is updated by v 30% 42% -> 70% p v = 70 %
  41. 41. MCTS 3: Backup 50% 40% Update win rate of each state and propagate until the root node. 60% 70% p v = 70 % 60% -> 65% 50% -> 55%
  42. 42. Point3 Improvements on RL (p, v) = fθ(s) and l = (z − v)2 − πT log p + c∥θ∥2 • The dual network (parameter ) accumulate data by self play (step 1: repeated 25 thousand times). • Based on the result, update the parameter of the network (step 2), and get new parameter . • Let two network instantiations compete, update the network parameter if the new parameter set wins. • Repeat step 1 and step 2 θ′ θ
  43. 43. Step 1: Data Accumulation • Do a self play. Store the outcome z. • Store all (s, π, z) tuples in the game. • The policy π is calculated as • Repeat the above processes 250000 times πa = N(s, a)1/γ ∑b N(s, b)1/γ
  44. 44. Step 2: Parameter update • Calculate the loss function using (s, π, z) evaluated in the previous step. (p, v) = fθ(s) and l = (z − v)2 − πT log p + c∥θ∥2 • Update parameter using gradient descent method. θ′ ← θ − α ⋅ Δθ
  45. 45. Empirical evaluation of AlphaGo Zero Fig 3 of (Silver 2017)
  46. 46. Performance of AlphaGo Zero Fig 6 of (Silver 2017)
  47. 47. https://research.fb.com/facebook-open-sources-elf-opengo/
  48. 48. References 1. Silver, D., Huang, A., Maddison, C. J., Guez, A., Sifre, L., van den Driessche, G., … Hassabis, D. (2016). Mastering the game of Go with deep neural networks and tree search. Nature, 529(7587), 484‒489. https://doi.org/10.1038/ nature16961
 Alpha Go 2. Otsuki, T., & Miyake. (2017). Saikyo igo eai arufago kaitai shinsho : Shinso gakushu montekaruro kitansaku kyoka gakushu kara mita sono shikumi. Shoeisha. 3. Silver, D., Schrittwieser, J., Simonyan, K., Antonoglou, I., Huang, A., Guez, A., … Hassabis, D. (2017). Mastering the game of Go without human knowledge. Nature, 550(7676), 354‒359. https://doi.org/10.1038/nature24270
 Alpha Go Zero 4. Browne, C., Powley, E., Whitehouse, D., Lucas, S., Member, S., Cowling, P. I., … Colton, S. (2012). A Survey of Monte Carlo Tree Search Methods. IEEE TRANSACTIONS ON COMPUTATIONAL INTELLIGENCE AND AI IN GAMES, 4(1). https://doi.org/10.1109/TCIAIG.2012.2186810 5. Auer, P. (2002). Finite-time analysis of the multi-armed bandit problem with known trend. IEEE Congress on Evolutionary Computation, CEC 2016, 47(1), 235‒256. https://doi.org/10.1109/CEC.2016.7744106 6. Cai, P., Luo, Y., Saxena, A., Hsu, D., & Lee, W. S. (2019). LeTS-Drive: Driving in a Crowd by Learning from Tree Search. Retrieved from https://arxiv.org/pdf/1905.12197.pdf

×