AlphaZero and beyond: Polygames

Zero learning, old and new.
Tristan Cazenave, Univ. Dauphine
Yen-Chi Chen, National Taiwan Normal University
Guan-Wei Chen, National Dong Hwa University
Shi-Yu Chen, National Dong Hwa University
Xian-Dong Chiu, National Dong Hwa University
Julien Dehos, Univ. Littoral Cote d’Opale
Maria Elsa, National Dong Hwa University
Qucheng Gong, Facebook AI Research
Hengyuan Hu, Facebook AI Research
Vasil Khalidov, Facebook AI Research
Chen-Ling Li, National Dong Hwa University
Hsin-I Lin, National Dong Hwa University
Yu-Jin Lin, National Dong Hwa University
OlIvier Teytaud
Started to work in AI last century.
Currently working on games, alphazero style learning, derivative-free
optimization.
Has been working at ARTELYS, INRIA, GOOGLE, FB.
Xavier Martinet, Facebook AI Research
Vegard Mella, Facebook AI Research
Jeremy Rapin, Facebook AI Research
Baptiste Roziere, Facebook AI Research
Gabriel Synnaeve, Facebook AI Research
Fabien Teytaud, Univ. Littoral Cote d’Opale
Olivier Teytaud, Facebook AI Research
Shi-Cheng Ye, National Dong Hwa University
Yi-Jun Ye, National Dong Hwa University
Shi-Jim Yen, National Dong Hwa University
Sergey Zagoruyko, Facebook AI Research

1. MCTS = Monte Carlo Tree Search
2. AlphaZero: adding conv nets
3. AlphaZero – great performances
4. AlphaZero – limitations
5. Open Sourcing
6. Research directions

ALPHAZERO INGREDIENT #1: MCTS
MCTS (MONTE CARLO TREE SEARCH) WAS ORIGINALLY PUBLISHED IN [COULOM06].
WAS ENOUGH FOR WINNING GAMES AGAINST PROS IN 9X9 +19X19 WITH HANDICAP ~4.
QUITE STRONG FOR FULLY OBSERVABLE “GENERAL GAME PLAYING” (I.E. THE PROGRAM MUST
FIRST UNDERSTAND THE RULES).
UCT (UPPER CONFIDENCE TREES) IS A VARIANT OF MCTS
(USING UCB).
usgo.org

Coulom (06)
Chaslot, Saito & Bouzy (06)
Kocsis Szepesvari (06)
UCT (UPPER CONFIDENCE TREES) STARTS
WITH SIMPLE MONTE CARLO
(Monte Carlo)
Monte Carlo …

UCT
(Monte Carlo)
Monte Carlo … and
keep track of
statistics!

UCT
(we have
statistics!)
(Monte Carlo)

EXPLOITATION ...
Monte Carlo, and
build statistics… and
modify MC with
those statistics!

EXPLOITATION ...
SCORE =
5/7
+ k.sqrt( log(10)/7 )

... OR EXPLORATION ?
SCORE =
0/2
+ k.sqrt( log(10)/2 )

UCT IN ONE SLIDE
UCT for choosing a move in a board B
While ( I have time left )
{
Do a simulation
{
Start at board B
At each time step, choose action by UCB (or random if no statistics!)
}
Update statistics with this simulation
}
Return the most simulated action.

1. MCTS
2. AlphaZero: adding conv nets
3. AlphaZero – great performances
4. AlphaZero – limitations
5. Open Sourcing
6. Research directions

ALPHAZERO INGREDIENT #2: DEEP NETWORK
OVERVIEW IN “DEEP LEARNING”, LECUN, BENGIO, HINTON 2015
BOTH A CRITIC NETWORK (EVALUATING THE PROBABILITY OF WINNING IN A GIVEN POSITION) AND
A POLICY NETWORK (PROVIDING A PROBABILITY DISTRIBUTION ON ACTIONS).
Image “clarifai.com/technology” and “the data science blog”
ß ß ß Invariance by translation à à à High level features

PUCT: UCT WITH PRIOR
SCORE(state, action) =
5/7
+ NN(state, action) .sqrt( 10 / 7 )
No logNN based

ALPHAZERO IN A NUTSHELL: A FIXED POINT METHOD!
MCTS(NN): A MCTS WHICH USES A NEURAL NET NN FOR
• EVALUATING LEAVES (NO RANDOM ROLLOUT)
• SUGGESTING POLICIES (BIASING THE MCTS)
NN ß MCTS:
• EACH CLIENT: PLAYS GAMES WITH A MCTS(NN)
• SERVER:
• RECEIVES BATCHES “(STATES, ACTIONS, REWARD AT END OF GAMES)”
• TWO LOSS FUNCTIONS (+WEIGHT DECAY):
• LEARN “STATE à REWARD” (CRITIC)
• LEARN “STATE à PROBABILITY DISTRIBUTION ON ACTIONS” (ACTOR) , I.E. MIMIC THE MCTS
ALPHAZERO:
• RANDOMLY INITIALIZE NN
• ITERATIVELY IMITATE: NN-ACTOR è MCTS(NN) NN-CRITIC è GAME RESULTS
Prediction
of value p imitates
𝜋 from
MCTS
Weight
decay

ALPHAZERO TRAINING IN ONE SLIDE
THE CLIENTS PERFORM SIMULATIONS USING THE NEURAL NETWORKS OUTPUTS:
- NN.PI
- NN.V
EACH CLIENT, MORE DETAILS:
- RUNNING MCTS MODIFIED AS FOLLOWS:
- UCB SCORE COMBINED BY NN.PI (``PUCT’’ FORMULA)
- RANDOM ROLLOUTS (CORRESPONDING TO STATES WITH ZERO SIMULATION) REPLACED BY:
- 1 SINGLE STEP WITH NN.PI FOR CHOOSING ONE ACTION
- REWARD OF THE SIMULATION REPLACED BY NN.V WHICH PREDICTS THE REWARD
- SENDING TO THE MASTER PLENTY OF (STATE, PI CHOSEN BY MCTS, REWARD OF THE GAME)
THE MASTER LEARNS A BETTER NN BASED ON CLIENTS’ SIMULATIONS
Prediction
of value p
imitates
𝜋 from
MCTS
Weight
decay
MASTER
(neural net)
CLIENT 1
(simulating
games using
MCTS)
State s
NN.Pi(s), NN.V(s)
Pi(s) = probabilities of actions in state s
V(s) = estimated winning rate in s
Training of
𝜋 and V
Inference
State s, MCTS.Pi(s),
real reward R

ALPHAZERO TRAINING IN ONE SLIDE
THE CLIENTS PERFORM SIMULATIONS USING THE NEURAL NETWORKS OUTPUTS:
- NN.PI
- NN.V
EACH CLIENT, MORE DETAILS:
- RUNNING MCTS MODIFIED AS FOLLOWS:
- UCB SCORE COMBINED BY NN.PI
- RANDOM ROLLOUTS (CORRESPONDING TO STATES WITH ZERO SIMULATION) REPLACED BY:
- 1 SINGLE STEP WITH NN.PI FOR CHOOSING ONE ACTION
- REWARD OF THE SIMULATION REPLACED BY NN.V WHICH PREDICTS THE REWARD
- SENDING TO THE MASTER PLENTY OF (STATE, PI CHOSEN BY MCTS, REWARD OF THE GAME)
THE MASTER:
- PROVIDES NN.PI (PROBABILITY DISTRIBUTION ON ACTIONS) AND NN.V (ESTIMATED
REWARD) IN STATES, AS REQUESTED BY CLIENTS
- LEARNS A BETTER NN BASED ON CLIENTS’ SIMULATIONS
Prediction
of value p
imitates
𝜋 from
MCTS
Weight
decay
MASTER
(neural net)
CLIENT 1
(simulating
games using
MCTS)
State s
NN.Pi(s), NN.V(s)
Pi(s) = probabilities of actions in state s
V(s) = estimated winning rate in s
Training of
𝜋 and V
Inference
State s, MCTS.Pi(s),
real reward R
Actually there is a replay buffer.
Simulated results are sent to a data structure.
The training picks up data and performs
stochastic gradient descent.

ALPHAZERO: GREAT RESULTS
[SILVER ET AL, NATURE PAPERS + ARXIV]
• NO GAME SPECIFIC KNOWLEDGE
• USING MASSIVE COMPUTATIONAL POWER
• EXTENSIVE REPRESENTATION FOR ACTIONS

ALPHAZERO: GREAT RESULTS
[SILVER ET AL, NATURE PAPERS + ARXIV]
• NO GAME SPECIFIC KNOWLEDGE
• USING MASSIVE COMPUTATIONAL POWER
• EXTENSIVE REPRESENTATION FOR ACTIONS ç ONE CHANNEL FOR EACH POSSIBLE RELATIVE
MOVE OF EACH PIECE!!! HOW MANY CHANNELS FOR GO ? FOR CHESS ?

ALPHAZERO: OPEN PROBLEMS
1. BASED ON MCTS è SO THAT APPLICATION TO PARTIAL OBSERVATION IS NOT TRIVIAL
2. BASED ON MCTS è SO THAT SIMULATORS/BACKTRACKS ARE NECESSARY (WHITE BOX)
3. BASED ON NN è HOW TO DEAL WITH HUGE / COMPLEX ACTION SPACES
4. BASED ON NN è HUGE NUMBER OF SIMULATED GAMES è REDUCED DATA ?
You need a very special
MCTS for partially-
observable games
- Should we learn just V or just 𝜋 or both ?
- Complex action spaces ?
- Partially observable games ? (defogization)

WHY DOES ZERO FAIL IN PARTIALLY OBSERVABLE GAMES ?

WHY DOES ZERO FAIL IN PARTIALLY OBSERVABLE GAMES ?
è BECAUSE MCTS NEEDS SIMULATIONS “STATE, ACTION à NEW STATE”

POLYGAMES, OPEN SOURCED RECENTLY!
1. MODIFY ONE AND ONLY ONE FILE, SO THAT THE GAME IS YOURS:
• class State {
• bool PlayAction(Action action) ç what happens if we play this move ?
• vector<Action> getLegalActions() ç what are the legal moves ?
• float getReward(int player) ç which reward did that player win ?
• bool terminated() ç is the game over ?
• int getCurrentplayer() ç who should play now ?
• vector<float> getFeature() ç input of the neural net (vector, reshaped as a 3D tensor as below)
• vector<int> getFeatureSize() ç shape of the input of the neural net (Polygames will reshape accordingly)
• … a few technical things…
• } and a class of actions (each action is mapped to an output neuron)
2. GET LEARNING CURVES ON YOUR GAME
• with different architectures (cool: structured output)
• In a couple of days, single machine
• In progress: a class of partially observable games
+ interface with LUDII ?

STRONG COMMITMENT TO OPEN SOURCE &
ACADEMIC PUBLICATION (FACEBOOK)
Open source &
exports to Onnx
format, usable
everywhere
Open source, beats
pros in Go
Starcraft
OpenData
On […] English-French and […] German-English
benchmarks, our models respectively obtain
[…], outperforming the state of the art by more
than 11 BLEU points. On low-resource
languages like English-Urdu and English-
Romanian, our methods achieve even better
results […]. Our code for NMT and PBSMT is
publicly available.
https://github.com/TorchCraft/
Pytorch for
Starcraft

STRONG COMMITMENT TO OPEN SOURCE &
ACADEMIC PUBLICATION (FACEBOOK)
<< At FAIR, we openly share our advances as much as we can, as fast as we
can in the form of technical papers, open source code and teaching
material. >> (Y. Le Cun, Facebook, BusinessInsider)

POLYGAMES: PARTIALLY OBSERVABLE GAMES
Main challenge in partially observable games: building the probability distribution of hidden states, assuming Nash policies.
Papers: Mundhenk, Rintanen… show 2EXP complexity for many partially observable games, and undecidability in some cases.
Consider Chinese Dark Chess. A part of the information is hidden.
But it’s simple (in that case): just randomly draw the hidden information
when it’s revealed è you can simulate CDC in MCTS.
The same principle applies when the hidden information is the same for all players.
Example: Minesweeper (single player!).
- Naïve version: randomly draw the positions of mines until you find something consistent with observations.
- This is slow, but equivalent to the classical minesweeper.
- Faster: use constraint satisfaction problems (cf Studholm’s paper)

POLYGAMES: THE RED QUEEN EFFECT AND TOURNAMENTS
Zero-learning = fixed point algorithm.
• Stops when MCTS(NN) = NN ?
• Fixed point whereas no
order on players ? (A > B > C > A)
èKeep an archive
èRelated: Grigoriadis & Khachyan, 1994

POLYGAMES:
PROVING
STUFF ?
Proof of
convergence
?
Even better,
prescribing
something:
Which level of noise ?
Which formula ?
How many MCTS simulations ?

POLYGAMES: STRUCTURED OUTPUT
Consider Go or Breakthrough or Draughts or many others.
The output space is topologically related to the input space.
This link is destroyed by the FCMLP (fully connected multilayered perceptron).
Let us make the training faster by using convolutions everywhere in the network + global pooling.
+ applicable for any board size!

Cool stuff with Polygames
- learning in 13x13 and play in 19x19 at strong level (fully
convolutional nets)

convolutional nets)
- strong checkpoints at many games

convolutional nets)
- stochastic games

convolutional nets)
- stochastic games
- possibility to add layers, channels, kernel width dynamically

convolutional nets)
- stochastic games
- distributed

convolutional nets)
- stochastic games
- distributed
- a few partially observable games (Minesweeper)

convolutional nets)
- stochastic games
- distributed
- maintained, open sourced, readable

convolutional nets)
- stochastic games
- distributed
- understand complex crucial things: mask illegal actions rather
than learning logit -infinity

convolutional nets)
- stochastic games
- distributed
than learning logit –infinity
- tournament mode for robust learning

convolutional nets)
- stochastic games
- distributed
than learning logit –infinity
- tournament mode for robust learning
- in progress: learning with side information

HEX
According to Bonnet et al
(https://www.lamsade.dauphine.fr/~bonnet/publi/connection-
games.pdf), “Since its independent inventions in 1942 and 1948 by
the poet and mathematician Piet Hein and the economist and
mathematician John Nash, the game of hex has acquired a special
spot in the heart of abstract game aficionados. Its purity and depth
has lead Jack van Rijswijck to conclude his PhD thesis with the
following hyperbole [1]: << Hex has a Platonic existence,
independent of human thought. If ever we find an
extraterrestrial civilization at all, they will know hex, without
any doubt.>> ”

HEX
Simplest rules ever!
I play black.
You play white.
We put a stone in turn.
If I connect my sides, I win.
If you connect your sides, you win.
Theorem: no draw.
Until 2019/10/31: no computer managed to beat the best humans!

HEX
Polygames vs Arek Kulczycki
Bunch of GPUs, several days.
Operated & trained by Vegard, a.k.a
“un putain de hacker de ouf”. (winner last LG tournament, best
ELO-rank on the LittleGolem server)
Thanks a lot ! ! !

HEX
I play black.
You play white.
Theorem: no draw.
(Max Pixel)

HEX
I play black.
You play white.
Theorem: no draw.
(pngimg.com)

HEX
I play black.
You play white.
Theorem: no draw.
Fantastic game with a super
long final path!

Breakthrough: seemingly a win for white. Draw with best
bots. Maybe needs a pie rule or a 12* rule.
(12*: turns 122112211221122…
Pie: second player can swap roles.)
Othello: won all* games against 2 strong bots (incl. winner
Olympiads 2019).
Einstein: not yet much results, looks like we play well.
*except one by human
operator mistake!

THE END !!!
… we’re coming in many other
games, stay tuned J
(and help us
- join the group J )
Havannah: big board, diversity of
winning conditions, long games,
hexagons…
LUDII: enormous library of games,
interfacing in progress with the
Maastricht games gang.

AlphaZero and beyond: Polygames

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (9)

Ähnlich wie AlphaZero and beyond: Polygames

Ähnlich wie AlphaZero and beyond: Polygames (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

AlphaZero and beyond: Polygames