User Guide: Pulsar™ Weather Station (Columbia Weather Systems)
AlphaZero and beyond: Polygames
1. Zero learning, old and new.
Tristan Cazenave, Univ. Dauphine
Yen-Chi Chen, National Taiwan Normal University
Guan-Wei Chen, National Dong Hwa University
Shi-Yu Chen, National Dong Hwa University
Xian-Dong Chiu, National Dong Hwa University
Julien Dehos, Univ. Littoral Cote d’Opale
Maria Elsa, National Dong Hwa University
Qucheng Gong, Facebook AI Research
Hengyuan Hu, Facebook AI Research
Vasil Khalidov, Facebook AI Research
Chen-Ling Li, National Dong Hwa University
Hsin-I Lin, National Dong Hwa University
Yu-Jin Lin, National Dong Hwa University
OlIvier Teytaud
Started to work in AI last century.
Currently working on games, alphazero style learning, derivative-free
optimization.
Has been working at ARTELYS, INRIA, GOOGLE, FB.
Xavier Martinet, Facebook AI Research
Vegard Mella, Facebook AI Research
Jeremy Rapin, Facebook AI Research
Baptiste Roziere, Facebook AI Research
Gabriel Synnaeve, Facebook AI Research
Fabien Teytaud, Univ. Littoral Cote d’Opale
Olivier Teytaud, Facebook AI Research
Shi-Cheng Ye, National Dong Hwa University
Yi-Jun Ye, National Dong Hwa University
Shi-Jim Yen, National Dong Hwa University
Sergey Zagoruyko, Facebook AI Research
2. 1. MCTS = Monte Carlo Tree Search
2. AlphaZero: adding conv nets
3. AlphaZero – great performances
4. AlphaZero – limitations
5. Open Sourcing
6. Research directions
3. ALPHAZERO INGREDIENT #1: MCTS
MCTS (MONTE CARLO TREE SEARCH) WAS ORIGINALLY PUBLISHED IN [COULOM06].
WAS ENOUGH FOR WINNING GAMES AGAINST PROS IN 9X9 +19X19 WITH HANDICAP ~4.
QUITE STRONG FOR FULLY OBSERVABLE “GENERAL GAME PLAYING” (I.E. THE PROGRAM MUST
FIRST UNDERSTAND THE RULES).
UCT (UPPER CONFIDENCE TREES) IS A VARIANT OF MCTS
(USING UCB).
usgo.org
4. Coulom (06)
Chaslot, Saito & Bouzy (06)
Kocsis Szepesvari (06)
UCT (UPPER CONFIDENCE TREES) STARTS
WITH SIMPLE MONTE CARLO
(Monte Carlo)
Monte Carlo …
15. UCT IN ONE SLIDE
UCT for choosing a move in a board B
While ( I have time left )
{
Do a simulation
{
Start at board B
At each time step, choose action by UCB (or random if no statistics!)
}
Update statistics with this simulation
}
Return the most simulated action.
16. 1. MCTS
2. AlphaZero: adding conv nets
3. AlphaZero – great performances
4. AlphaZero – limitations
5. Open Sourcing
6. Research directions
17. ALPHAZERO INGREDIENT #2: DEEP NETWORK
OVERVIEW IN “DEEP LEARNING”, LECUN, BENGIO, HINTON 2015
BOTH A CRITIC NETWORK (EVALUATING THE PROBABILITY OF WINNING IN A GIVEN POSITION) AND
A POLICY NETWORK (PROVIDING A PROBABILITY DISTRIBUTION ON ACTIONS).
Image “clarifai.com/technology” and “the data science blog”
ß ß ß Invariance by translation à à à High level features
18. PUCT: UCT WITH PRIOR
SCORE(state, action) =
5/7
+ NN(state, action) .sqrt( 10 / 7 )
No logNN based
19. ALPHAZERO IN A NUTSHELL: A FIXED POINT METHOD!
MCTS(NN): A MCTS WHICH USES A NEURAL NET NN FOR
• EVALUATING LEAVES (NO RANDOM ROLLOUT)
• SUGGESTING POLICIES (BIASING THE MCTS)
NN ß MCTS:
• EACH CLIENT: PLAYS GAMES WITH A MCTS(NN)
• SERVER:
• RECEIVES BATCHES “(STATES, ACTIONS, REWARD AT END OF GAMES)”
• TWO LOSS FUNCTIONS (+WEIGHT DECAY):
• LEARN “STATE à REWARD” (CRITIC)
• LEARN “STATE à PROBABILITY DISTRIBUTION ON ACTIONS” (ACTOR) , I.E. MIMIC THE MCTS
ALPHAZERO:
• RANDOMLY INITIALIZE NN
• ITERATIVELY IMITATE: NN-ACTOR è MCTS(NN) NN-CRITIC è GAME RESULTS
Prediction
of value p imitates
𝜋 from
MCTS
Weight
decay
20. ALPHAZERO TRAINING IN ONE SLIDE
THE CLIENTS PERFORM SIMULATIONS USING THE NEURAL NETWORKS OUTPUTS:
- NN.PI
- NN.V
EACH CLIENT, MORE DETAILS:
- RUNNING MCTS MODIFIED AS FOLLOWS:
- UCB SCORE COMBINED BY NN.PI (``PUCT’’ FORMULA)
- RANDOM ROLLOUTS (CORRESPONDING TO STATES WITH ZERO SIMULATION) REPLACED BY:
- 1 SINGLE STEP WITH NN.PI FOR CHOOSING ONE ACTION
- REWARD OF THE SIMULATION REPLACED BY NN.V WHICH PREDICTS THE REWARD
- SENDING TO THE MASTER PLENTY OF (STATE, PI CHOSEN BY MCTS, REWARD OF THE GAME)
THE MASTER LEARNS A BETTER NN BASED ON CLIENTS’ SIMULATIONS
Prediction
of value p
imitates
𝜋 from
MCTS
Weight
decay
MASTER
(neural net)
CLIENT 1
(simulating
games using
MCTS)
State s
NN.Pi(s), NN.V(s)
Pi(s) = probabilities of actions in state s
V(s) = estimated winning rate in s
Training of
𝜋 and V
Inference
State s, MCTS.Pi(s),
real reward R
21. ALPHAZERO TRAINING IN ONE SLIDE
THE CLIENTS PERFORM SIMULATIONS USING THE NEURAL NETWORKS OUTPUTS:
- NN.PI
- NN.V
EACH CLIENT, MORE DETAILS:
- RUNNING MCTS MODIFIED AS FOLLOWS:
- UCB SCORE COMBINED BY NN.PI
- RANDOM ROLLOUTS (CORRESPONDING TO STATES WITH ZERO SIMULATION) REPLACED BY:
- 1 SINGLE STEP WITH NN.PI FOR CHOOSING ONE ACTION
- REWARD OF THE SIMULATION REPLACED BY NN.V WHICH PREDICTS THE REWARD
- SENDING TO THE MASTER PLENTY OF (STATE, PI CHOSEN BY MCTS, REWARD OF THE GAME)
THE MASTER:
- PROVIDES NN.PI (PROBABILITY DISTRIBUTION ON ACTIONS) AND NN.V (ESTIMATED
REWARD) IN STATES, AS REQUESTED BY CLIENTS
- LEARNS A BETTER NN BASED ON CLIENTS’ SIMULATIONS
Prediction
of value p
imitates
𝜋 from
MCTS
Weight
decay
MASTER
(neural net)
CLIENT 1
(simulating
games using
MCTS)
State s
NN.Pi(s), NN.V(s)
Pi(s) = probabilities of actions in state s
V(s) = estimated winning rate in s
Training of
𝜋 and V
Inference
State s, MCTS.Pi(s),
real reward R
Actually there is a replay buffer.
Simulated results are sent to a data structure.
The training picks up data and performs
stochastic gradient descent.
22. 1. MCTS
2. AlphaZero: adding conv nets
3. AlphaZero – great performances
4. AlphaZero – limitations
5. Open Sourcing
6. Research directions
23. ALPHAZERO: GREAT RESULTS
[SILVER ET AL, NATURE PAPERS + ARXIV]
• NO GAME SPECIFIC KNOWLEDGE
• USING MASSIVE COMPUTATIONAL POWER
• EXTENSIVE REPRESENTATION FOR ACTIONS
24. ALPHAZERO: GREAT RESULTS
[SILVER ET AL, NATURE PAPERS + ARXIV]
• NO GAME SPECIFIC KNOWLEDGE
• USING MASSIVE COMPUTATIONAL POWER
• EXTENSIVE REPRESENTATION FOR ACTIONS ç ONE CHANNEL FOR EACH POSSIBLE RELATIVE
MOVE OF EACH PIECE!!! HOW MANY CHANNELS FOR GO ? FOR CHESS ?
25. 1. MCTS
2. AlphaZero: adding conv nets
3. AlphaZero – great performances
4. AlphaZero – limitations
5. Open Sourcing
6. Research directions
26. ALPHAZERO: OPEN PROBLEMS
1. BASED ON MCTS è SO THAT APPLICATION TO PARTIAL OBSERVATION IS NOT TRIVIAL
2. BASED ON MCTS è SO THAT SIMULATORS/BACKTRACKS ARE NECESSARY (WHITE BOX)
3. BASED ON NN è HOW TO DEAL WITH HUGE / COMPLEX ACTION SPACES
4. BASED ON NN è HUGE NUMBER OF SIMULATED GAMES è REDUCED DATA ?
You need a very special
MCTS for partially-
observable games
- Should we learn just V or just 𝜋 or both ?
- Complex action spaces ?
- Partially observable games ? (defogization)
27. ALPHAZERO: OPEN PROBLEMS
1. BASED ON MCTS è SO THAT APPLICATION TO PARTIAL OBSERVATION IS NOT TRIVIAL
2. BASED ON MCTS è SO THAT SIMULATORS/BACKTRACKS ARE NECESSARY (WHITE BOX)
3. BASED ON NN è HOW TO DEAL WITH HUGE / COMPLEX ACTION SPACES
4. BASED ON NN è HUGE NUMBER OF SIMULATED GAMES è REDUCED DATA ?
WHY DOES ZERO FAIL IN PARTIALLY OBSERVABLE GAMES ?
28. ALPHAZERO: OPEN PROBLEMS
1. BASED ON MCTS è SO THAT APPLICATION TO PARTIAL OBSERVATION IS NOT TRIVIAL
2. BASED ON MCTS è SO THAT SIMULATORS/BACKTRACKS ARE NECESSARY (WHITE BOX)
3. BASED ON NN è HOW TO DEAL WITH HUGE / COMPLEX ACTION SPACES
4. BASED ON NN è HUGE NUMBER OF SIMULATED GAMES è REDUCED DATA ?
WHY DOES ZERO FAIL IN PARTIALLY OBSERVABLE GAMES ?
è BECAUSE MCTS NEEDS SIMULATIONS “STATE, ACTION à NEW STATE”
29. 1. MCTS
2. AlphaZero: adding conv nets
3. AlphaZero – great performances
4. AlphaZero – limitations
5. Open Sourcing
6. Research directions
30. POLYGAMES, OPEN SOURCED RECENTLY!
1. MODIFY ONE AND ONLY ONE FILE, SO THAT THE GAME IS YOURS:
• class State {
• bool PlayAction(Action action) ç what happens if we play this move ?
• vector<Action> getLegalActions() ç what are the legal moves ?
• float getReward(int player) ç which reward did that player win ?
• bool terminated() ç is the game over ?
• int getCurrentplayer() ç who should play now ?
• vector<float> getFeature() ç input of the neural net (vector, reshaped as a 3D tensor as below)
• vector<int> getFeatureSize() ç shape of the input of the neural net (Polygames will reshape accordingly)
• … a few technical things…
• } and a class of actions (each action is mapped to an output neuron)
2. GET LEARNING CURVES ON YOUR GAME
• with different architectures (cool: structured output)
• In a couple of days, single machine
• In progress: a class of partially observable games
+ interface with LUDII ?
31. STRONG COMMITMENT TO OPEN SOURCE &
ACADEMIC PUBLICATION (FACEBOOK)
Open source &
exports to Onnx
format, usable
everywhere
Open source, beats
pros in Go
Starcraft
OpenData
On […] English-French and […] German-English
benchmarks, our models respectively obtain
[…], outperforming the state of the art by more
than 11 BLEU points. On low-resource
languages like English-Urdu and English-
Romanian, our methods achieve even better
results […]. Our code for NMT and PBSMT is
publicly available.
https://github.com/TorchCraft/
Pytorch for
Starcraft
32. STRONG COMMITMENT TO OPEN SOURCE &
ACADEMIC PUBLICATION (FACEBOOK)
<< At FAIR, we openly share our advances as much as we can, as fast as we
can in the form of technical papers, open source code and teaching
material. >> (Y. Le Cun, Facebook, BusinessInsider)
33. 1. MCTS
2. AlphaZero: adding conv nets
3. AlphaZero – great performances
4. AlphaZero – limitations
5. Open Sourcing
6. Research directions
34. POLYGAMES: PARTIALLY OBSERVABLE GAMES
Main challenge in partially observable games: building the probability distribution of hidden states, assuming Nash policies.
Papers: Mundhenk, Rintanen… show 2EXP complexity for many partially observable games, and undecidability in some cases.
Consider Chinese Dark Chess. A part of the information is hidden.
But it’s simple (in that case): just randomly draw the hidden information
when it’s revealed è you can simulate CDC in MCTS.
The same principle applies when the hidden information is the same for all players.
Example: Minesweeper (single player!).
- Naïve version: randomly draw the positions of mines until you find something consistent with observations.
- This is slow, but equivalent to the classical minesweeper.
- Faster: use constraint satisfaction problems (cf Studholm’s paper)
35. POLYGAMES: THE RED QUEEN EFFECT AND TOURNAMENTS
Zero-learning = fixed point algorithm.
• Stops when MCTS(NN) = NN ?
• Fixed point whereas no
order on players ? (A > B > C > A)
èKeep an archive
èRelated: Grigoriadis & Khachyan, 1994
37. POLYGAMES: STRUCTURED OUTPUT
Consider Go or Breakthrough or Draughts or many others.
The output space is topologically related to the input space.
This link is destroyed by the FCMLP (fully connected multilayered perceptron).
Let us make the training faster by using convolutions everywhere in the network + global pooling.
+ applicable for any board size!
39. Cool stuff with Polygames
- learning in 13x13 and play in 19x19 at strong level (fully
convolutional nets)
40. Cool stuff with Polygames
- learning in 13x13 and play in 19x19 at strong level (fully
convolutional nets)
- strong checkpoints at many games
41. Cool stuff with Polygames
- learning in 13x13 and play in 19x19 at strong level (fully
convolutional nets)
- strong checkpoints at many games
- stochastic games
42. Cool stuff with Polygames
- learning in 13x13 and play in 19x19 at strong level (fully
convolutional nets)
- strong checkpoints at many games
- stochastic games
- possibility to add layers, channels, kernel width dynamically
43. Cool stuff with Polygames
- learning in 13x13 and play in 19x19 at strong level (fully
convolutional nets)
- strong checkpoints at many games
- stochastic games
- possibility to add layers, channels, kernel width dynamically
- distributed
44. Cool stuff with Polygames
- learning in 13x13 and play in 19x19 at strong level (fully
convolutional nets)
- strong checkpoints at many games
- stochastic games
- possibility to add layers, channels, kernel width dynamically
- distributed
- a few partially observable games (Minesweeper)
45. Cool stuff with Polygames
- learning in 13x13 and play in 19x19 at strong level (fully
convolutional nets)
- strong checkpoints at many games
- stochastic games
- possibility to add layers, channels, kernel width dynamically
- distributed
- a few partially observable games (Minesweeper)
- maintained, open sourced, readable
46. Cool stuff with Polygames
- learning in 13x13 and play in 19x19 at strong level (fully
convolutional nets)
- strong checkpoints at many games
- stochastic games
- possibility to add layers, channels, kernel width dynamically
- distributed
- a few partially observable games (Minesweeper)
- maintained, open sourced, readable
- understand complex crucial things: mask illegal actions rather
than learning logit -infinity
47. Cool stuff with Polygames
- learning in 13x13 and play in 19x19 at strong level (fully
convolutional nets)
- strong checkpoints at many games
- stochastic games
- possibility to add layers, channels, kernel width dynamically
- distributed
- a few partially observable games (Minesweeper)
- maintained, open sourced, readable
- understand complex crucial things: mask illegal actions rather
than learning logit –infinity
- tournament mode for robust learning
48. Cool stuff with Polygames
- learning in 13x13 and play in 19x19 at strong level (fully
convolutional nets)
- strong checkpoints at many games
- stochastic games
- possibility to add layers, channels, kernel width dynamically
- distributed
- a few partially observable games (Minesweeper)
- maintained, open sourced, readable
- understand complex crucial things: mask illegal actions rather
than learning logit –infinity
- tournament mode for robust learning
- in progress: learning with side information
49. HEX
According to Bonnet et al
(https://www.lamsade.dauphine.fr/~bonnet/publi/connection-
games.pdf), “Since its independent inventions in 1942 and 1948 by
the poet and mathematician Piet Hein and the economist and
mathematician John Nash, the game of hex has acquired a special
spot in the heart of abstract game aficionados. Its purity and depth
has lead Jack van Rijswijck to conclude his PhD thesis with the
following hyperbole [1]: << Hex has a Platonic existence,
independent of human thought. If ever we find an
extraterrestrial civilization at all, they will know hex, without
any doubt.>> ”
50. HEX
Simplest rules ever!
I play black.
You play white.
We put a stone in turn.
If I connect my sides, I win.
If you connect your sides, you win.
Theorem: no draw.
Until 2019/10/31: no computer managed to beat the best humans!
51. HEX
Simplest rules ever!
I play black.
You play white.
We put a stone in turn.
If I connect my sides, I win.
If you connect your sides, you win.
Theorem: no draw.
Until 2019/10/31: no computer managed to beat the best humans!
52. HEX
Polygames vs Arek Kulczycki
Bunch of GPUs, several days.
Operated & trained by Vegard, a.k.a
“un putain de hacker de ouf”. (winner last LG tournament, best
ELO-rank on the LittleGolem server)
Thanks a lot ! ! !
53. HEX
Simplest rules ever!
I play black.
You play white.
We put a stone in turn.
If I connect my sides, I win.
If you connect your sides, you win.
Theorem: no draw.
Until 2019/10/31: no computer managed to beat the best humans!
(Max Pixel)
54. HEX
Simplest rules ever!
I play black.
You play white.
We put a stone in turn.
If I connect my sides, I win.
If you connect your sides, you win.
Theorem: no draw.
Until 2019/10/31: no computer managed to beat the best humans!
(pngimg.com)
55. HEX
Simplest rules ever!
I play black.
You play white.
We put a stone in turn.
If I connect my sides, I win.
If you connect your sides, you win.
Theorem: no draw.
Until 2019/10/31: no computer managed to beat the best humans!
Fantastic game with a super
long final path!
56. Breakthrough: seemingly a win for white. Draw with best
bots. Maybe needs a pie rule or a 12* rule.
(12*: turns 122112211221122…
Pie: second player can swap roles.)
Othello: won all* games against 2 strong bots (incl. winner
Olympiads 2019).
Einstein: not yet much results, looks like we play well.
*except one by human
operator mistake!
57. THE END !!!
… we’re coming in many other
games, stay tuned J
(and help us
- join the group J )
Havannah: big board, diversity of
winning conditions, long games,
hexagons…
LUDII: enormous library of games,
interfacing in progress with the
Maastricht games gang.