Choosing between several options in uncertain environments

METAGAMING:
Bandits with simple regret and small
budget
Chen-Wei Chou, Ping-Chiang Chou,
Chang-Shing Lee, David Lupien St-Pierre,
Olivier Teytaud, Mei-Hui Wang, Li-Wen Wu
and Shi-Jim Yen

Outline:
- what is a bandit problem ?
- what is a strategic bandit problem ?
- is a strategic bandit different from a bandit ?
- algorithms
- results

What is a bandit problem ?
A finite number of time steps
A (finite) number of options,
each of them equipped with a (unknown) proba distribution
At each time step:
- you choose one option
- you get a reward, distributed according to its proba distribution
At the end:
- you choose one option (you can not change anymore...)
- your reward is the expected reward associated to this option

At each time step:
At the end:
Here we collect
information

At each time step:
At the end:
Here we use
information for
the final choice

At each time step:
At the end:
Here, we
explore

At each time step:
At the end:
Here, we take no risk

At each time step (exploration):
At the end (recommendation):

Which kind of bandit ?
- in the bandit literature, options are
also termed “arms”
- here the criterion is the expected reward
of the option chosen at the end
(sometimes it is the sum
of the rewards during exploration)
- we presented here stochastic bandits
(a probability distribution
per option) ==> next slide is different

And adversarial bandit ?
A (finite) number of options for player 1,
and a finite number of options for player 2.
An unknown probability distribution for each pair of options
At each time step:
- you choose one option for P1 and one option for P2
- you get a reward, distributed according to the
corresponding proba distribution
At the end:
- you choose one **probabilistic** option for P1
(you can not change anymore...)
- your reward is the expected reward associated to this option,
for the worst choice by P2

What is meta-gaming ?
What is “strategic choice” ?
Strategic choices:
- decisions once and for all, at a high level
- ≠ from tactical level
Meta-gaming: choice at a strategic level, in games:
- choosings cards, in card games
- choosing handicap positioning, in Go
==> once and for all, at the beginning of the game

Example of stochastic bandit
(i.e. 1P strategic choice)
Game of Go handicap bandit problem, at each time step:
- you choose one handicap positioning
- then you simulate one game from this position
==> only one player has a strategic choice
==> stochastic bandit

Example of adversarial bandit
(i.e. 2P strategic choice)
Urban Rivals bandit problem, at each time step:
- you choose
- one set of cards for you (P1)
- one set of cards for P2
- then you simulate one Urban Rivals game from this position
PLAYER 1:
PLAYER 2:
==> two players have a strategic choice
==> adversarial bandit

Is a strategic bandit problem
different from
a classical bandit problem ?
No difference in nature
Just a much
smaller budget

Algorithms
Reminder:
- two algorithms needed:
- one for choosing during N exploration steps
- one for choosing during 1 recommendation step
- two settings
- one-player case
- two-player case

Algorithms for exploration
Uniform: test all options uniformly
Bernstein races:
- uniformly among non discarded options,
- discard options with statistical tests
Successive reject:
- uniformly among non discarded options,
- discard periodically the worst option
UCB: choose option with best average result + bonus
for options weakly sampled,
Adaptive-UCB-E: a variant of UCB aimed at removing
hyper-parameters
EXP3: empirically best option + random perturbation

Algorithms for recommendation
Empirically Best Arm: choose empirically best option
Most Played Arm: choose most simulated option
Successive reject:: the only non discarded option
UCB: choose option with best average result + bonus
for options weakly sampled.
LCB: choose option with best average result + malus for
options weakly sampled.
Empirical distribution of play: an option has its
frequency (during exploration) as probability (for
recommendation)
TEXP3: idem, but discard low probability options

Experimental results
Big boring tables of results
are in the paper.
Only a sample of most clear
results here.

One player case
Killall Go stones positionning

One player case
Uncertainty
should
have
malus in
recommend.

One player case
EXP3 for
2player
case

Experimental results: TEXP3
outperforms EXP3 by far
2-player case, game =
Urban Rivals (free online card game)

Do you know killall-Go ?
Black has stones in advance (e.g. 8 in 13x13).
If white makes life, white wins.
If black kills everything, black wins.
Black choose stones
positioning
(strategic decisions).

Left: human is Black and chooses E3 C4.
Right: computer is Black and chooses D3 D5.
White won both.
Human said that the computer choice D3 D5 is good.

Killall Go, H8 (left) H9 (right)
Left: Human Pro Player (5P) as black has 8 handicap stones.
White (computer) makes life and wins.
Right: Human Pro Player (5P) as black has 9 handicap stones
and kills everything and wins.

CONCLUSIONS
1 player case:
UCB for exploration,
LCB or MPA for recommendation
2 player case:
TEXP3 performs best.
Killall-Go
Win against pro with H2 in 7x7 Killall-Go as white.
Loss against pro with H2 in 7x7 Killall-Go as black.
13x13: Computer won as white with H8, lost with H9.
13x13: Computer lost as black with H8 and with H9.
Further work:
Structured bandit: some options are close to each other.
Batoo: Go with strategic choice for both players; nice test case.
Industry: choosing investments for power grid simulations – in progress.

Choosing between several options in uncertain environments

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Andere mochten auch

Andere mochten auch (11)

Ähnlich wie Choosing between several options in uncertain environments

Ähnlich wie Choosing between several options in uncertain environments (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Choosing between several options in uncertain environments