Artificial Intelligence and Optimization with Parallelism

HABILITATION

Artificial intelligence
with Parallelism
Acknowledgments:
All the TAO team. People in Liège, Taiwan, Lri,Artelys, Mash, Iomca, ..,
Thanks a lot to the committee.
Thanks + good recovery to Jonathan Shapiro.
Thanks to Grid5000.

Olivier Teytaud olivier.teytaud@inria.fr

Introduction
What is AI ?
Why evolutionary optimization is a part of AI
Why parallelism ?

Evolutionary computation
Comparison-based optimization
Parallelization
Noisy cases

Sequential decision making
Fundamental facts
Monte-Carlo Tree Search

Conclusion

AI = using computers where they
are weak / weaker than humans.
(thanks Michèle S.)

Difficult optimization (complex structure,
noisy objective functions)
Games (difficult ones)

Key difference with many operational research works:
AI = choosing a model as close as possible to reality and
(very) approximately solve it
OR = choosing the best model that you can solve almost exactly

Many works are about numbers.
Providing standard deviations, rates, etc.

Other goal (more ambitious ?):
switching from something which does not work
to something which works.

E.g. vision; a computer can distinguish:

But it can't distinguish so easily:

And it's a disaster for categorizing
- children,
- women,
- panda,
- babies,
- children
- men,
- bears,
- trucks,
- cars.

And it's a disaster for categorizing children,
women, panda, babies, children, men, bears, trucks, cars.

And it's a disaster for categorizing children,
women, panda, babies, children, men, bears, trucks, cars.

3 years old;
she can do it.

==> AI= focus on things which do not
work and (hopefully) make them work.

Evolutionary optimization is a part of A.I.

Often considered as bad, because many
EO tools are not that hard,
mathematically speaking.

I've met people using
- randomized mutations
- cross-overs
but who did not call this evolutionary or
genetic, because it would be bad.

Gives a lot freedom:
- choose your operators (depending on the problem)
- choose your population-size (depending on your
computer/grid )

- choose  (carefully) e.g. min(dimension,  /4)

==> Can work on strange domains

Voronoi representation of a shape:
- a family of points (thanks Marc S.)

Voronoi representation:
- a family of points

- their labels

- their labels
==> cross-over makes sense
==> you can optimize a shape

- their labels
==> cross-over makes sense
==> you can optimize a shape

Great substitute for
averaging.
“on the benefit of sex”

Cantilever optimization:

Hamda et al, 2000

Parallelism.

Multi-core machines
Clusters
Grids

Sometimes parallelization completely changes
the picture.

Parallelism.

Thank you G5K
Multi-core machines
Clusters
Grids

Sometimes parallelization completely changes
the picture.
Sometimes not.
We want to know when.

Introduction
What is AI ?
Why parallelism ?

Parallelization
Noisy cases Robustness,
Sequential decision making slow rates.
Fundamental facts

Conclusion

Derivative-free optimization of f

No gradient !
Only depends on the x's and f(x)'s


Why derivative free optimization ?


Ok, it's slower


Ok, it's slower
But sometimes you have no derivative


Ok, it's slower
It's simpler (by far) ==> less bugs


Ok, it's slower
It's simpler (by far)
It's more robust (to noise, to strange functions...)


Optimization algorithms
==> Newton optimization ?
Why derivative free
==> Quasi-Newton (BFGS)
Ok, it's slower
==> Gradient descent

==> ...robust (to noise, to strange functions...)
It's more


Ok, it's slower
Derivative-free optimization
(don't need gradients)


Derivative-free optimization
Ok, it's slower
(coming soon),
It's simpler (by far)comparisons,
just needing
including evolutionary algorithms


yi=f(xi)

is comparison-based if

parallel evolution 36

Population-based comparison-based algorithms

X(1)=( x(1,1),x(1,2),...,x(1,) ) = Opt()
X(2)=( x(2,1),x(2,2),...,x(2,) ) = Opt(x(1),
signs of diff)
… … ...
x(n)=( x(n,1),x(n,2),...,x(n,) ) = Opt(x(n-1),
signs of diff)


P-based c-based algorithms w/ internal state

( X(1)=( x(1,1),x(1,2),...,x(1,) ),I(1) ) = Opt()
( X(2)=( x(2,1),x(2,2),...,x(2,) ),I(2) ) = Opt(x(1),I(1),
signs of diff)
… … ...
( x(n)=( x(n,1),x(n,2),...,x(n,) ),I(n) ) = Opt(x(n-1),I(n),
signs of diff)


Comparison-based algorithms are robust

Consider
f: X --> R
We look for x* such that
x,f(x*) ≤ f(x)
==> what if we see g o f (g increasing) ?
==> x* is the same, but xn might change


Robustness of comparison-based algorithms: formal statement

this does not depend on g for a
comparison-based algorithm
a comparison-based algorithm is optimal
for


Complexity bounds (N = dimension)

= nb of fitness evaluations for precision
 with probability at least ½ for all f

Exp ( - Convergence ratio ) = Convergence rate

Convergence ratio ~ 1 / computational cost
==> more convenient than conv. rate for speed-ups


Complexity bounds: basic technique
We want to know how many iterations we need for reaching precision 
in an evolutionary algorithm.

Key observation: (most) evolutionary algorithms are comparison-based

Let's consider (for simplicity) a deterministic selection-based non-elitist
algorithm

First idea: how many different branches we have in a run ?
We select  points among 
Therefore, at most K = ! / ( ! (  -  )!) different branches

Second idea: how many different answers should we able to give ?
Use packing numbers: at least N() different possible answers




algorithm




Complexity bounds: -balls


algorithm






algorithm






algorithm



Conclusion: the number n of iterations should verify
Kn ≥ N (  )


Complexity bounds on the convergence ratio

FR: full ranking (selected points are ranked)
SB: selection-based (selected points are not ranked)


This is why I love
cross-over.



Fournier, T., 2009;
using VC-dim.


Quadratic functions easier
than sphere functions ?
But not for translation invariant
quadratic functions...


Quadratic functions easier
than sphere functions ?
But not for translation invariant
quadratic functions...

FR: full ranking (selected points are ranked) results.
Covers existing
SB: selection-based (selected pointswith discrete domains.
Compliant are not ranked)

Introduction
What is AI ?
Why parallelism ?

1) Mathematical proof that all
Parallelization
comparison-based algorithms
Noisy cases
can be parallelized
(log speed-up)
Sequential decision making
Fundamental facts
2) Practical hint: simple tricks
for some well-known algorithms

Conclusion

Speculative parallelization with branching factor 3

Consider the sequential algorithm.
(iteration 1)



(iteration 2)



(iteration 3)


Parallel version for D=2.
Population = union of all pops for 2 iterations.


Automatic parallelization

Teytaud, T, PPSN 2010

Define:

Necessary condition for log() speed-up:
- E log( * ) ~ log()

But for many algorithms,
- E log( * ) = O(1)
==> asymptotically constant speed-up

These algos do not reach the log(lambda) speed-up.

th
(1+1)-ES with 1/5 rule
Standard CSA
Standard EMNA
Standard SA.

Teytaud, T, PPSN 2010

Example 1: Estimation of Multivariate Normal Algorithm

While ( I have time )
{
Generate points (x1,...,x) distributed as N(x,)
Evaluate the fitness at x1,...,x
X= mean  best points

= standard deviation of  best points

/= log( / 7)1 / d
}

Ex 2: Log(lambda) correction for mutative self-adapt.

 = min( /4,d)
While ( I have time )
{
Generate points (1,...,) as  x exp(- k.N)
Generate points (x1,...,x) distributed as N(x,i)
Select the  best points
Update x (=mean), update (=log. mean)

}

Log() corrections (SA, dim 3)

● In the discrete case (XPs): automatic
parallelization surprisingly efficient.

● Simple trick in the continuous case
- E log( *) should be linear in log()

(this provides corrections which
work for SA and CSA)


Log() corrections

● In the discrete case (XPs): automatic
parallelization surprisingly efficient.

● Simple trick in the continuous case
- E log( *) should be linear in log()

(this provides corrections which
work for SA and CSA)


SUMMARY of the EA part up to now:
- evolutionary algorithms are robust (with
a precise statement of this robustness)
- evolutionary algorithms are somehow
slow (precisely quantified...)
- evolutionary algorithms are parallel (at least
“until” the dimension for the conv. rate)

SUMMARY of the EA part up to now:
- evolutionary algorithms are robust (with
a precise statement of this robustness)
- evolutionary algorithms are somehow
slow (precisely quantified...)
- evolutionary algorithms are parallel (at least
“until” the dimension for the conv. rate)

Now, noisy optimization

Many works focus on fitness functions with “small” noise:
f(x) = ||x||2 x (1+Gaussian )

This is because the more realistic case
f(x) = ||x||2 + Gaussian (variance >0 at optimum)
is too hard for publishing nice curves.

Many works focus on fitness functions with “small” noise:
f(x) = ||x||2 x (1+Gaussian )

This is because the more realistic case
f(x) = ||x||2 + Gaussian
is too hard for publishing nice curves.

==> see however Arnold Beyer 2006.
==> a tool: races ( Heidrich-Meisner et al, Icml 2009)
- reevaluating until statistically significant differences
- … but we must (sometimes) limit the number of
reevaluations

Another difficult case: Bernoulli functions.

fitness(x) = B( f(x) )
f(0) not necessarily = 0.

EDA
Based on
+ races MaxUncertainty
f(0) not necessarily = 0. (Coulom)

EDA
Based on

I like this case
With p=2
with p=2

EDA
Based on

We prove good
results here.

I like this case
With p=2
with p=2

EDA
Based on

We prove good
results here.

We prove good
I like this case results here.
With p=2
with p=2

The game of Go is a part of AI.
Computers are ridiculous in front of children.

Easy situation.
Termed “semeai”.
Requires a little bit
of abstraction.

800 cores, 4.7
GHz,
top level program.

Plays a stupid
move.


8 years old;
little training;
finds the good move


1. Games (a bit of formalism)

2. Decidability / complexity

Games with simultaneous actions 84 Paris 1st of February

A game is a directed graph


A game is a directed graph with actions

1

2
3


A game is a directed graph with actions and players

1 White
Black
2
3

White 12

43
White Black
Black

Black
Black


and players and observations
Bob
Bear Bee
Bee 1 White
Black
2
3

White 12

43
White Black
Black

Black
Black


and players and observations and rewards
Bob
Bear Bee
Bee 1 White
Black
2
+1
3
0
White 12
Rewards
43
White Black on leafs
Black
only!
Black
Black


A game is a directed graph +actions
+players +observations +rewards +loops
Bob
Bear Bee
Bee 1 White
Black
2
+1
3
0
White 12

43
White Black
Black

Black
Black



1. Games (a bit of formalism)

2. Decidability / complexity


Complexity (2P, no random)

Unbounded Exponential Polynomial
horizon horizon horizon

Full Observability EXP EXP PSPACE
No obs EXPSPACE NEXP
(X=100%) (Hasslum et al, 2000)

Partially 2EXP EXPSPACE
Observable (Rintanen, 97)

(X=100%)
Simult. Actions ? EXPSPACE ? <<<= EXP <<<= EXP
No obs / PO undecidable

Complexity question ? (UD)

Instance = position.
Question = Is there a strategy
which wins whatever
are the decisions
of the opponent ?
= natural question if full observability.
Answering this question then allows perfect play.

Hummm ?

Do you know a PO game in which you can
ensure a win with probability 1 ?

Complexity question for matrix
game ?

100000
010000 Good for column-player !

001000
==> but no sure win.
000100
==> the “UD” question is not
000010
relevant here!
000001

Complexity question for Joint work with
phantom-games ? F. Teytaud

This is phantom-go.
Good for black: wins
with proba 1-1/(8!)
Here,
there's no move
which ensures a win.
But some moves are
much better than
others!

Another formalization

c

==> much more satisfactory

Madani et al.

c

1 player + random = undecidable.

Madani et al.

1 player + random = undecidable.

We extend to two players with no random.
Problem: rewrite random nodes, thanks to
additional player.

A random node to be rewritten

Rewritten as follows:
Player 1 chooses a in [[0,N-1]]
Player 2 chooses b in [[0,N-1]]
c=(a+b) modulo N
Go to tc
Each player can force the game to be equivalent to
the initial one (by playing uniformly)
==> the proba of winning for player 1 (in case of perfect play)
is the same as for the initial game
==> undecidability!

Important remark

Existence of a strategy for winning with
proba > 0.5
==> also undecidable for the
restriction to games in which the proba
is >0.6 or <0.4
==> not just a subtle
precision trouble.


MCTS principle

But with
EXP3 in nodes for
hidden information.

UCT (Upper Confidence Trees)

Coulom (06)
Chaslot, Saito & Bouzy (06)
Kocsis Szepesvari (06)

UCT
Kocsis & Szepesvari (06)

Exploitation ...
SCORE =
5/7
+ k.sqrt( log(10)/7 )

... or exploration ?
SCORE =
0/2
+ k.sqrt( log(10)/2 )

SCORE =
0/2
+ k.sqrt( log(10)/2 )

Binary win/loss
games: no explo!
(Berthier, D., T., 2010)

Games vs pros
in the game of Go
First win in 9x9

First win over 5 games in 9x9 blind Go

First win with H2.5 in 13x13 Go

First win with H6 in 19x19 Go

First win with H7 in 19x19 Go vs top pro

SCORE =
0/2
+ k.sqrt( log(10)/2 )

Simultaneous actions:
replace it with
EXP3 / INF

MCTS for simultaneous actions

Player 1 plays

Player 2 plays Both players
play

... Player 1 plays
Player 2 plays

MCTS for simultaneous actions

Player 1 plays
= maxUCB node

Player 2 plays
Both players play
=minUCB node
=EXP3 node

Player 1 plays
... Player 2 plays
=maxUCB node
=minUCB node

MCTS for hidden information
Player 1

Observation set 1 Observation set 2
EXP3 node EXP3 node
Observation set 3
EXP3 node
Player 2

Observation set 2
Observation set 1
EXP3 node
EXP3 node
Observation set 3
EXP3 node

MCTS for hidden information
Player 1

Observation set 1 Observation set 2
EXP3 node EXP3 node
Observation set 3
EXP3 node Thanks Martin

(incrementally + application to phantom-tic-tac-toe: see D. Auger 2010)
Player 2

Observation set 2
Observation set 1
EXP3 node
EXP3 node
Observation set 3
EXP3 node

EXP3 in one slide

Grigoriadis et al, Auer et al, Audibert & Bubeck Colt 2009


Appli to Urban Rivals ==>

(simultaneous actions)


Let's have fun with Urban Rivals (4 cards)
Each player has
- four cards (each one can be used once)
- 12 pilz (each one can be used once)
- 12 life points
Each card has:
- one attack level
- one damage
- special effects (forget that...)
Four turns:
P1 attacks P2, P2 attacks P1,
P1 attacks P2, P2 attacks P1.


Let's have fun with Urban Rivals
First, attacker plays:
- chooses a card
- chooses ( PRIVATELY ) a number of pilz
Attack level = attack(card) x (1+nb of pilz)

Then, defender plays:
- chooses a card
- chooses a number of pilz
Defense level = attack(card) x (1+nb of pilz)

Result:
If attack > defense
Defender looses Power(attacker's card)
Else
Attacker looses Power(defender's card)

Let's have fun with Urban Rivals

==> The MCTS-based AI is now at the best human level.

Experimental (only) remarks on EXP3:
- discard strategies with small number of sims
= better approx of the Nash
- also an improvement by taking
into account the other bandit
- virtual simulations (inspired
by Kummer)


When is MCTS relevant ?

Robust in front of:
High dimension;
Non-convexity of Bellman values;
Complex models
Delayed reward
Simultaneous actions, partial information
More difficult for
High values of H;
Model-free
Highly unobservable cases (Monte-Carlo, but not Monte-Carlo Tree
Search, see Cazenave et al.)
Lack of reasonable baseline for the MC

When is MCTS relevant ?
T., Dagstuhl 2010, D. Auger,
Robust in front of: EvoStar 2011.
EvoStar 2011;
High dimension;
Unpublished
Non-convexity of Bellman values;
Complex models results on
Delayed reward Some endgames
undecidability
Simultaneous actions
More difficult for results
High values of H;
Model-free
Highly unobservable cases (Monte-Carlo, but not Monte-Carlo Tree
Search, see Cazenave et al.)
Lack of reasonable baseline for the MC

Conclusion
Evo. Opt: robustness, tight bounds, simple
algorithmic modifs for better speed-up (SA, 1/5th,
(CSA))

MCTS just great (but requires a model); UCB
not necessary; extension to hidden info (rmk:
undecidability); PO endgames; but no abstraction
power.

Noisy optimization: Consider high noise. Use
QR and Learning (in all EA in fact).
Not mentioned here: multimodal, multiobj, GP, bandits.

Future ?
- Solving semeais ? Would involve great AI progress I think...
- Noisy optimization; there are still things to be done.
==> Promoting high noise fitness functions even if it is less
publication-efficient.
- ``Inheritance'' of belief state in partially observable games.
Big progress to be done. Crucial for applications.
- Sparse bandits / mixed stochastic/adversarial cases.

Thanks for your attention.
Thanks to all collaborators for all I've learnt with them.

Appendix 1:
MCTS with hidden
information

MCTS with hidden information:
incremental version
While (there is time for thinking)
{
s=initial state
os(1)=() os(2)=()
while (s not terminal)
{
p=player(s)
b=Exp3Bandit(os(p))
d=b.makeDecision
(s,o)=transition(s,d)
}
send reward to all bandits in the simulation

}

MCTS with hidden information:
incremental version
While (there is time for thinking)
{ Possibly
s=initial state
os(1)=() os(2)=() refine
while (s not terminal)
the family
{
p=player(s) of bandits.
b=Exp3Bandit(os(p))
d=b.makeDecision
(s,o)=transition(s,d)
}
send reward to all bandits in the simulation

}

Artificial Intelligence and Optimization with Parallelism

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (8)

Andere mochten auch

Andere mochten auch (17)

Ähnlich wie Artificial Intelligence and Optimization with Parallelism

Ähnlich wie Artificial Intelligence and Optimization with Parallelism (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Artificial Intelligence and Optimization with Parallelism

Hinweis der Redaktion