SlideShare ist ein Scribd-Unternehmen logo
1 von 6
Downloaden Sie, um offline zu lesen
Mastering the game of Go
A deep neural networks and tree search approach
Alessandro Cudazzo
Department of Computer Science
University of Pisa
alessandro@cudazzo.com
ISPR Midterm IV, 2020
Alessandro Cudazzo (UNIPI) Mastering the game of Go ISPR Midterm IV, 2020 1 / 6
Back in time: 2015
We are in 2015 and you may be wondering why mastering the ancient Chinese game of
Go is an important challenge for researchers in the artificial intelligence field.
First, let’s start from the beginning of the story:
A 19x19 board game with approx. bd
possible sequences of moves (b ≈ 250, d ≈ 150).
It is a game of perfect information and can be formulate as a zero-sum game.
Figure: A Go
Board state.
Exhaustive search is infeasible: v∗
(s) optimal value function is
unfeasible to compute. It determines the outcome of a game from
any state s, under perfect play by both players.
Depth reduction with a approximate value function:
v(s) ≈ v∗
(s)
Breadth reduction by sampling actions from a policy function:
P(a|s)
At that time, the strongest Go program was based on MCTS - Pachi [1] ranked at 2
amateur dan on KGS. Experts agree that the major stumbling block to creating
stronger-than-amateur Go programs is the relative to a position evaluation function [2].
Alessandro Cudazzo (UNIPI) Mastering the game of Go ISPR Midterm IV, 2020 2 / 6
Supervised learning of policy networks
So, Deepmind had a clear view: they had to find better policy and value functions with
deep learning and efficiently combines both with the MCTS heuristic search algorithm.
Supervised learning approach ⇒ SL policy network pσ(a|s):
It takes a 19x19x48 input feature to represent the board.
Deep convolution NN with a softmax output (13-layer).
DB 30M state-action (s, a) - stochastic gradient ascent to
maximize the likelihood of the human move a selected in s:
∆σ ∝
∂logpσ(a|s)
∂σ
An issue, it takes 3ms to process: slow for the rollout!
So, a faster but less accurate rollout policy pπ(a|s), using a linear softmax of small
pattern features as been trained; with accuracy of 24.2% and 2µs to select.
The network pσ overcome the state of the art in term
of accuracy: 57% vs 44.4%. Small improvements in
accuracy led to large improvements in playing strength.
Alessandro Cudazzo (UNIPI) Mastering the game of Go ISPR Midterm IV, 2020 3 / 6
Reinforcement learning policy and value networks
By considering a best player, who uses a certain policy p, we can approximate its vp
(s).
Init a RL policy network pρ=σ and improve it in order to find the best player:
1 Challenge the current pρ it vs a randomly selected previous iteration of the policy
network pick up from a pool of opponents to prevent overfitting on a single one.
2 Improve it by policy gradient reinforment learning:
∆ρ ∝
∂logpρ(a|s)
∂σ
zt , zt = ±r(sT ) , r(s) =
0 non-terminal time steps t < T
±1 winning/losing
Since v∗
(s) is infeasible we compute a Value network vθ(s) ≈ vpρ
(s) ≈ v∗
(s):
Regression NN on state-outcome pairs with a similar architecture to the policy
network but with one output - Trained with SGD and MSE as loss.
Use a self-play dataset consisting of 30M distinct position, each sampled from a
separate game. Each game was played by the pρ and itself until the game ends
Alessandro Cudazzo (UNIPI) Mastering the game of Go ISPR Midterm IV, 2020 4 / 6
MCTS: searching with policy and value networks
MCTS algorithm that selects
action by lookahead search plus
the policy and value networks.
Each edge (s, a) store an action
value Q(s, a), visit count N(s, a)
and a prior probability P(s, a).
Iterate for n simulation:
a Selection: traverse the tree from the root by selecting the edge/action a with max(Q + u)
until a leaf node is added at time L: sL. Exploration/exploitation controlled by u.
at = arg max
a
(Q(st , a) + u(st , a)); u(s, a) ∝
P(s, a)
1 + N(s, a)
; P(s, a) = pσ(a|s)
b Expansion: sL may be expanded/processed with the pσ(a|s) ⇒ store it for each legal a.
c Evaluation: compute vθ(sL) and the outcome zL with the fast rollout policy pπ by
sampling actions. Then compute the leaf evaluation: V (sL) = (1 − λ)vθ(sL) + λzL
d Backup: update the action values and visit counts of all traversed edged:
N(s, a) =
n
i=1
1(s, a, i); Q(s, a) =
1
N(s, a)
n
i=1
1(s, a, i)V (si
L)
Once the search is complete chose the most visited move from the root position.
Alessandro Cudazzo (UNIPI) Mastering the game of Go ISPR Midterm IV, 2020 5 / 6
Conclusion
MCTS is asynchronous: use multi-threaded search that executes simulations on
CPUs, and computes policy and value networks in parallel on GPUs.
RL policy network won 85% of games against Pachi by only sampling the next
move at ∼ pσ(·|st ) instead the the SL policy network won only 11%.
Initially, a dataset of complete games led the value net to overfitting! Next states
are strongly correlated and the regression target is shared for the entire game.
In MCTS the SL policy netwok performed better of the stronger RL policy network
to compute the P(s, a), presumably because humans select a diverse beam of
promising moves, whereas RL optimizes for the single best move.
The hyperparameter λ = 0 shows that the value networks provide a viable
alternative to Monte Carlo evaluation in Go! λ = 0.5 was the best.
Further details can be found in the original paper [3].
Reference:
[1] P. Baudiˇs and J.-L. Gailly, ‘Pachi: State of the art open source go program,’, vol. 7168, Jan. 2012.
[2] M. M¨uller, ‘Computer go,’ Artificial Intelligence, vol. 134, no. 1, pp. 145–179, 2002, issn: 0004-3702.
[3] D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. van den Driessche, J. Schrittwieser, I. Antonoglou,
V. Panneershelvam, M. Lanctot, S. Dieleman, D. Grewe, J. Nham, N. Kalchbrenner, I. Sutskever, T. Lillicrap,
M. Leach, K. Kavukcuoglu, T. Graepel, and D. Hassabis, ‘Mastering the game of go with deep neural networks and tree
search,’ Nature, vol. 529, pp. 484–503, 2016.
Alessandro Cudazzo (UNIPI) Mastering the game of Go ISPR Midterm IV, 2020 6 / 6

Weitere ähnliche Inhalte

Was ist angesagt?

NIPS読み会2013: One-shot learning by inverting a compositional causal process
NIPS読み会2013: One-shot learning by inverting  a compositional causal processNIPS読み会2013: One-shot learning by inverting  a compositional causal process
NIPS読み会2013: One-shot learning by inverting a compositional causal process
nozyh
 

Was ist angesagt? (20)

Exact Inference in Bayesian Networks using MapReduce (Hadoop Summit 2010)
Exact Inference in Bayesian Networks using MapReduce (Hadoop Summit 2010)Exact Inference in Bayesian Networks using MapReduce (Hadoop Summit 2010)
Exact Inference in Bayesian Networks using MapReduce (Hadoop Summit 2010)
 
karnaugh maps
karnaugh mapskarnaugh maps
karnaugh maps
 
"Moving CNNs from Academic Theory to Embedded Reality," a Presentation from S...
"Moving CNNs from Academic Theory to Embedded Reality," a Presentation from S..."Moving CNNs from Academic Theory to Embedded Reality," a Presentation from S...
"Moving CNNs from Academic Theory to Embedded Reality," a Presentation from S...
 
Artificial Intelligence
Artificial Intelligence Artificial Intelligence
Artificial Intelligence
 
On First-Order Meta-Learning Algorithms
On First-Order Meta-Learning AlgorithmsOn First-Order Meta-Learning Algorithms
On First-Order Meta-Learning Algorithms
 
New Insights and Perspectives on the Natural Gradient Method
New Insights and Perspectives on the Natural Gradient MethodNew Insights and Perspectives on the Natural Gradient Method
New Insights and Perspectives on the Natural Gradient Method
 
Regularised Cross-Modal Hashing (SIGIR'15 Poster)
Regularised Cross-Modal Hashing (SIGIR'15 Poster)Regularised Cross-Modal Hashing (SIGIR'15 Poster)
Regularised Cross-Modal Hashing (SIGIR'15 Poster)
 
Exact Inference in Bayesian Networks using MapReduce__HadoopSummit2010
Exact Inference in Bayesian Networks using MapReduce__HadoopSummit2010Exact Inference in Bayesian Networks using MapReduce__HadoopSummit2010
Exact Inference in Bayesian Networks using MapReduce__HadoopSummit2010
 
SPDE presentation 2012
SPDE presentation 2012SPDE presentation 2012
SPDE presentation 2012
 
Histogram based Enhancement
Histogram based Enhancement Histogram based Enhancement
Histogram based Enhancement
 
Os Urner
Os UrnerOs Urner
Os Urner
 
Ch8
Ch8Ch8
Ch8
 
Solving graph problems using networkX
Solving graph problems using networkXSolving graph problems using networkX
Solving graph problems using networkX
 
Histogram based enhancement
Histogram based enhancementHistogram based enhancement
Histogram based enhancement
 
Visualization using tSNE
Visualization using tSNEVisualization using tSNE
Visualization using tSNE
 
Artificial Intelligence
Artificial Intelligence Artificial Intelligence
Artificial Intelligence
 
NIPS読み会2013: One-shot learning by inverting a compositional causal process
NIPS読み会2013: One-shot learning by inverting  a compositional causal processNIPS読み会2013: One-shot learning by inverting  a compositional causal process
NIPS読み会2013: One-shot learning by inverting a compositional causal process
 
Histogram Operation in Image Processing
Histogram Operation in Image ProcessingHistogram Operation in Image Processing
Histogram Operation in Image Processing
 
DSD-INT 2018 Work with iMOD MODFLOW models in Python - Visser Bootsma
DSD-INT 2018 Work with iMOD MODFLOW models in Python - Visser BootsmaDSD-INT 2018 Work with iMOD MODFLOW models in Python - Visser Bootsma
DSD-INT 2018 Work with iMOD MODFLOW models in Python - Visser Bootsma
 
Neural networks - BigSkyDevCon
Neural networks - BigSkyDevConNeural networks - BigSkyDevCon
Neural networks - BigSkyDevCon
 

Ähnlich wie Alpha Go: in few slides

Automated Information Retrieval Model Using FP Growth Based Fuzzy Particle Sw...
Automated Information Retrieval Model Using FP Growth Based Fuzzy Particle Sw...Automated Information Retrieval Model Using FP Growth Based Fuzzy Particle Sw...
Automated Information Retrieval Model Using FP Growth Based Fuzzy Particle Sw...
AIRCC Publishing Corporation
 
Prim algorithm for the implementation of random mazes in videogames
Prim algorithm for the  implementation of random mazes  in videogamesPrim algorithm for the  implementation of random mazes  in videogames
Prim algorithm for the implementation of random mazes in videogames
Félix Santos
 
Succinct Summarisation of Large Networks via Small Synthetic Representative G...
Succinct Summarisation of Large Networks via Small Synthetic Representative G...Succinct Summarisation of Large Networks via Small Synthetic Representative G...
Succinct Summarisation of Large Networks via Small Synthetic Representative G...
Jérôme KUNEGIS
 
Vol 14 No 1 - July 2014
Vol 14 No 1 - July 2014Vol 14 No 1 - July 2014
Vol 14 No 1 - July 2014
ijcsbi
 
hanoosh-tictactoeqqqqqqqaaaaaaaaaaaa.ppt
hanoosh-tictactoeqqqqqqqaaaaaaaaaaaa.ppthanoosh-tictactoeqqqqqqqaaaaaaaaaaaa.ppt
hanoosh-tictactoeqqqqqqqaaaaaaaaaaaa.ppt
ssuser148ae0
 

Ähnlich wie Alpha Go: in few slides (20)

Google Deepmind Mastering Go Research Paper
Google Deepmind Mastering Go Research PaperGoogle Deepmind Mastering Go Research Paper
Google Deepmind Mastering Go Research Paper
 
AlphaGo Zero: Mastering the Game of Go Without Human Knowledge
AlphaGo Zero: Mastering the Game of Go Without Human KnowledgeAlphaGo Zero: Mastering the Game of Go Without Human Knowledge
AlphaGo Zero: Mastering the Game of Go Without Human Knowledge
 
Final Report-1-(1)
Final Report-1-(1)Final Report-1-(1)
Final Report-1-(1)
 
A Presentation on the Paper: Mastering the game of Go with deep neural networ...
A Presentation on the Paper: Mastering the game of Go with deep neural networ...A Presentation on the Paper: Mastering the game of Go with deep neural networ...
A Presentation on the Paper: Mastering the game of Go with deep neural networ...
 
How DeepMind Mastered The Game Of Go
How DeepMind Mastered The Game Of GoHow DeepMind Mastered The Game Of Go
How DeepMind Mastered The Game Of Go
 
Automated Information Retrieval Model Using FP Growth Based Fuzzy Particle Sw...
Automated Information Retrieval Model Using FP Growth Based Fuzzy Particle Sw...Automated Information Retrieval Model Using FP Growth Based Fuzzy Particle Sw...
Automated Information Retrieval Model Using FP Growth Based Fuzzy Particle Sw...
 
Improving initial generations in pso algorithm for transportation network des...
Improving initial generations in pso algorithm for transportation network des...Improving initial generations in pso algorithm for transportation network des...
Improving initial generations in pso algorithm for transportation network des...
 
Prim algorithm for the implementation of random mazes in videogames
Prim algorithm for the  implementation of random mazes  in videogamesPrim algorithm for the  implementation of random mazes  in videogames
Prim algorithm for the implementation of random mazes in videogames
 
Succinct Summarisation of Large Networks via Small Synthetic Representative G...
Succinct Summarisation of Large Networks via Small Synthetic Representative G...Succinct Summarisation of Large Networks via Small Synthetic Representative G...
Succinct Summarisation of Large Networks via Small Synthetic Representative G...
 
Citython presentation
Citython presentationCitython presentation
Citython presentation
 
2019 GDRR: Blockchain Data Analytics - Dissecting Blockchain Price Analytics...
2019 GDRR: Blockchain Data Analytics  - Dissecting Blockchain Price Analytics...2019 GDRR: Blockchain Data Analytics  - Dissecting Blockchain Price Analytics...
2019 GDRR: Blockchain Data Analytics - Dissecting Blockchain Price Analytics...
 
Quantitative Propagation of Chaos for SGD in Wide Neural Networks
Quantitative Propagation of Chaos for SGD in Wide Neural NetworksQuantitative Propagation of Chaos for SGD in Wide Neural Networks
Quantitative Propagation of Chaos for SGD in Wide Neural Networks
 
Vol 14 No 1 - July 2014
Vol 14 No 1 - July 2014Vol 14 No 1 - July 2014
Vol 14 No 1 - July 2014
 
Big data matrix factorizations and Overlapping community detection in graphs
Big data matrix factorizations and Overlapping community detection in graphsBig data matrix factorizations and Overlapping community detection in graphs
Big data matrix factorizations and Overlapping community detection in graphs
 
Model-counting Approaches For Nonlinear Numerical Constraints
Model-counting Approaches For Nonlinear Numerical ConstraintsModel-counting Approaches For Nonlinear Numerical Constraints
Model-counting Approaches For Nonlinear Numerical Constraints
 
Performance analysis of transformation and bogdonov chaotic substitution base...
Performance analysis of transformation and bogdonov chaotic substitution base...Performance analysis of transformation and bogdonov chaotic substitution base...
Performance analysis of transformation and bogdonov chaotic substitution base...
 
Alpha go 16110226_김영우
Alpha go 16110226_김영우Alpha go 16110226_김영우
Alpha go 16110226_김영우
 
A STRATEGIC HYBRID TECHNIQUE TO DEVELOP A GAME PLAYER
A STRATEGIC HYBRID TECHNIQUE TO DEVELOP A GAME PLAYERA STRATEGIC HYBRID TECHNIQUE TO DEVELOP A GAME PLAYER
A STRATEGIC HYBRID TECHNIQUE TO DEVELOP A GAME PLAYER
 
hanoosh-tictactoeqqqqqqqaaaaaaaaaaaa.ppt
hanoosh-tictactoeqqqqqqqaaaaaaaaaaaa.ppthanoosh-tictactoeqqqqqqqaaaaaaaaaaaa.ppt
hanoosh-tictactoeqqqqqqqaaaaaaaaaaaa.ppt
 
Cgm Lab Manual
Cgm Lab ManualCgm Lab Manual
Cgm Lab Manual
 

Kürzlich hochgeladen

VIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 BookingVIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 Booking
dharasingh5698
 
Call Girls In Bangalore ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bangalore ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bangalore ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bangalore ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 
Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort ServiceCall Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 BookingVIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
dharasingh5698
 
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
ssuser89054b
 

Kürzlich hochgeladen (20)

Double Revolving field theory-how the rotor develops torque
Double Revolving field theory-how the rotor develops torqueDouble Revolving field theory-how the rotor develops torque
Double Revolving field theory-how the rotor develops torque
 
Call Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance BookingCall Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance Booking
 
Thermal Engineering-R & A / C - unit - V
Thermal Engineering-R & A / C - unit - VThermal Engineering-R & A / C - unit - V
Thermal Engineering-R & A / C - unit - V
 
VIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 BookingVIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 Booking
 
Bhosari ( Call Girls ) Pune 6297143586 Hot Model With Sexy Bhabi Ready For ...
Bhosari ( Call Girls ) Pune  6297143586  Hot Model With Sexy Bhabi Ready For ...Bhosari ( Call Girls ) Pune  6297143586  Hot Model With Sexy Bhabi Ready For ...
Bhosari ( Call Girls ) Pune 6297143586 Hot Model With Sexy Bhabi Ready For ...
 
(INDIRA) Call Girl Bhosari Call Now 8617697112 Bhosari Escorts 24x7
(INDIRA) Call Girl Bhosari Call Now 8617697112 Bhosari Escorts 24x7(INDIRA) Call Girl Bhosari Call Now 8617697112 Bhosari Escorts 24x7
(INDIRA) Call Girl Bhosari Call Now 8617697112 Bhosari Escorts 24x7
 
Booking open Available Pune Call Girls Pargaon 6297143586 Call Hot Indian Gi...
Booking open Available Pune Call Girls Pargaon  6297143586 Call Hot Indian Gi...Booking open Available Pune Call Girls Pargaon  6297143586 Call Hot Indian Gi...
Booking open Available Pune Call Girls Pargaon 6297143586 Call Hot Indian Gi...
 
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdfONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
 
(INDIRA) Call Girl Meerut Call Now 8617697112 Meerut Escorts 24x7
(INDIRA) Call Girl Meerut Call Now 8617697112 Meerut Escorts 24x7(INDIRA) Call Girl Meerut Call Now 8617697112 Meerut Escorts 24x7
(INDIRA) Call Girl Meerut Call Now 8617697112 Meerut Escorts 24x7
 
(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7
(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7
(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7
 
Call Girls In Bangalore ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bangalore ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bangalore ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bangalore ☎ 7737669865 🥵 Book Your One night Stand
 
Unit 1 - Soil Classification and Compaction.pdf
Unit 1 - Soil Classification and Compaction.pdfUnit 1 - Soil Classification and Compaction.pdf
Unit 1 - Soil Classification and Compaction.pdf
 
Double rodded leveling 1 pdf activity 01
Double rodded leveling 1 pdf activity 01Double rodded leveling 1 pdf activity 01
Double rodded leveling 1 pdf activity 01
 
Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort ServiceCall Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
 
data_management_and _data_science_cheat_sheet.pdf
data_management_and _data_science_cheat_sheet.pdfdata_management_and _data_science_cheat_sheet.pdf
data_management_and _data_science_cheat_sheet.pdf
 
Unleashing the Power of the SORA AI lastest leap
Unleashing the Power of the SORA AI lastest leapUnleashing the Power of the SORA AI lastest leap
Unleashing the Power of the SORA AI lastest leap
 
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 BookingVIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
 
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...
 
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
 
Call Girls Wakad Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Wakad Call Me 7737669865 Budget Friendly No Advance BookingCall Girls Wakad Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Wakad Call Me 7737669865 Budget Friendly No Advance Booking
 

Alpha Go: in few slides

  • 1. Mastering the game of Go A deep neural networks and tree search approach Alessandro Cudazzo Department of Computer Science University of Pisa alessandro@cudazzo.com ISPR Midterm IV, 2020 Alessandro Cudazzo (UNIPI) Mastering the game of Go ISPR Midterm IV, 2020 1 / 6
  • 2. Back in time: 2015 We are in 2015 and you may be wondering why mastering the ancient Chinese game of Go is an important challenge for researchers in the artificial intelligence field. First, let’s start from the beginning of the story: A 19x19 board game with approx. bd possible sequences of moves (b ≈ 250, d ≈ 150). It is a game of perfect information and can be formulate as a zero-sum game. Figure: A Go Board state. Exhaustive search is infeasible: v∗ (s) optimal value function is unfeasible to compute. It determines the outcome of a game from any state s, under perfect play by both players. Depth reduction with a approximate value function: v(s) ≈ v∗ (s) Breadth reduction by sampling actions from a policy function: P(a|s) At that time, the strongest Go program was based on MCTS - Pachi [1] ranked at 2 amateur dan on KGS. Experts agree that the major stumbling block to creating stronger-than-amateur Go programs is the relative to a position evaluation function [2]. Alessandro Cudazzo (UNIPI) Mastering the game of Go ISPR Midterm IV, 2020 2 / 6
  • 3. Supervised learning of policy networks So, Deepmind had a clear view: they had to find better policy and value functions with deep learning and efficiently combines both with the MCTS heuristic search algorithm. Supervised learning approach ⇒ SL policy network pσ(a|s): It takes a 19x19x48 input feature to represent the board. Deep convolution NN with a softmax output (13-layer). DB 30M state-action (s, a) - stochastic gradient ascent to maximize the likelihood of the human move a selected in s: ∆σ ∝ ∂logpσ(a|s) ∂σ An issue, it takes 3ms to process: slow for the rollout! So, a faster but less accurate rollout policy pπ(a|s), using a linear softmax of small pattern features as been trained; with accuracy of 24.2% and 2µs to select. The network pσ overcome the state of the art in term of accuracy: 57% vs 44.4%. Small improvements in accuracy led to large improvements in playing strength. Alessandro Cudazzo (UNIPI) Mastering the game of Go ISPR Midterm IV, 2020 3 / 6
  • 4. Reinforcement learning policy and value networks By considering a best player, who uses a certain policy p, we can approximate its vp (s). Init a RL policy network pρ=σ and improve it in order to find the best player: 1 Challenge the current pρ it vs a randomly selected previous iteration of the policy network pick up from a pool of opponents to prevent overfitting on a single one. 2 Improve it by policy gradient reinforment learning: ∆ρ ∝ ∂logpρ(a|s) ∂σ zt , zt = ±r(sT ) , r(s) = 0 non-terminal time steps t < T ±1 winning/losing Since v∗ (s) is infeasible we compute a Value network vθ(s) ≈ vpρ (s) ≈ v∗ (s): Regression NN on state-outcome pairs with a similar architecture to the policy network but with one output - Trained with SGD and MSE as loss. Use a self-play dataset consisting of 30M distinct position, each sampled from a separate game. Each game was played by the pρ and itself until the game ends Alessandro Cudazzo (UNIPI) Mastering the game of Go ISPR Midterm IV, 2020 4 / 6
  • 5. MCTS: searching with policy and value networks MCTS algorithm that selects action by lookahead search plus the policy and value networks. Each edge (s, a) store an action value Q(s, a), visit count N(s, a) and a prior probability P(s, a). Iterate for n simulation: a Selection: traverse the tree from the root by selecting the edge/action a with max(Q + u) until a leaf node is added at time L: sL. Exploration/exploitation controlled by u. at = arg max a (Q(st , a) + u(st , a)); u(s, a) ∝ P(s, a) 1 + N(s, a) ; P(s, a) = pσ(a|s) b Expansion: sL may be expanded/processed with the pσ(a|s) ⇒ store it for each legal a. c Evaluation: compute vθ(sL) and the outcome zL with the fast rollout policy pπ by sampling actions. Then compute the leaf evaluation: V (sL) = (1 − λ)vθ(sL) + λzL d Backup: update the action values and visit counts of all traversed edged: N(s, a) = n i=1 1(s, a, i); Q(s, a) = 1 N(s, a) n i=1 1(s, a, i)V (si L) Once the search is complete chose the most visited move from the root position. Alessandro Cudazzo (UNIPI) Mastering the game of Go ISPR Midterm IV, 2020 5 / 6
  • 6. Conclusion MCTS is asynchronous: use multi-threaded search that executes simulations on CPUs, and computes policy and value networks in parallel on GPUs. RL policy network won 85% of games against Pachi by only sampling the next move at ∼ pσ(·|st ) instead the the SL policy network won only 11%. Initially, a dataset of complete games led the value net to overfitting! Next states are strongly correlated and the regression target is shared for the entire game. In MCTS the SL policy netwok performed better of the stronger RL policy network to compute the P(s, a), presumably because humans select a diverse beam of promising moves, whereas RL optimizes for the single best move. The hyperparameter λ = 0 shows that the value networks provide a viable alternative to Monte Carlo evaluation in Go! λ = 0.5 was the best. Further details can be found in the original paper [3]. Reference: [1] P. Baudiˇs and J.-L. Gailly, ‘Pachi: State of the art open source go program,’, vol. 7168, Jan. 2012. [2] M. M¨uller, ‘Computer go,’ Artificial Intelligence, vol. 134, no. 1, pp. 145–179, 2002, issn: 0004-3702. [3] D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. van den Driessche, J. Schrittwieser, I. Antonoglou, V. Panneershelvam, M. Lanctot, S. Dieleman, D. Grewe, J. Nham, N. Kalchbrenner, I. Sutskever, T. Lillicrap, M. Leach, K. Kavukcuoglu, T. Graepel, and D. Hassabis, ‘Mastering the game of go with deep neural networks and tree search,’ Nature, vol. 529, pp. 484–503, 2016. Alessandro Cudazzo (UNIPI) Mastering the game of Go ISPR Midterm IV, 2020 6 / 6