Adaptive Learning In Games

EECS 463 Course Project 1

ADAPTIVE LEARNING IN
GAMES
3/11/2010 Suvarup Saha

Outline
2

Motivation
Games
Learning in Games
Adaptive Learning
Example
Gradient Techniques
Conclusion

EECS 463 Course Project 3/11/2010

Motivation
3

Adaptive Filtering Techniques generalize to a lot of
applications outside
Gradient Based iterative search
Stochastic Gradient
Least Squares
Application of Game Theory in less than rational multi-
agent scenarios demand self-learning mechanisms
Adaptive techniques can be applied in such instances to
help the agents learn the game and play intelligently


Games
4

A game is an interaction between two or more self-interested
agents
Each agent chooses a strategy si from a set of strategies, Si
A (joint) strategy profile, s, is the set of chosen strategies, also
called an outcome of the game in a single play
Each agent has a utility function, ui(s), specifying their
preference for each outcome in terms of a payoff
An agent’s best response is the strategy with the highest
payoff, given its opponents choice of strategy
A Nash equilibrium is a strategy profile such that every
agent’s strategy is a best response to others’ choice of strategy


A Normal Form Game
5

B

b1 b2

A a1 4,4 5,2
a2 0,1 4,3

This is a 2 player game with SA={a1,a2}, SB={b1,b2}
The ui(s) are explicitly given in a matrix form, for
example uA(a1, b2) = 5, uB(a1, b2) = 2
The best response of A to B playing b2 is a1
In this game, (a1, b1) is the unique Nash Equilibrium

Learning in Games
6

Classical Approach: Compute an optimal/equilibrium
strategy
Some criticisms to this approach are
Other agents’ utilities might be unknown to an agent for
computing an equilibrium strategy
Other agents might not be playing an equilibrium strategy
Computing an equilibrium strategy might be hard
Another Approach: Learn how to ‘optimally’ play a game
by
playing it many times
updating strategy based on experience

Learning Dynamics
7

Rationality/Sophistication of agents

Evolutionary Adaptive Bayesian
Dynamics Learning Learning

Focus of Our Discussion


Evolutionary Dynamics
8

Inspired by Evolutionary Biology with no appeal to
rationality of the agents
Entire population of agents all programmed to use some
strategy
Players are randomly matched to play with each other
Strategies with high payoff spread within the population by
Learning
copying or inheriting strategies – Replicator Dynamics
Infection
Stability analysis – Evolutionary Stable Strategies (ESS)
Players playing an ESS must have strictly higher payoffs than a
small group of invaders playing a different strategy


Bayesian Learning
9

Assumes ‘informed agents’ playing repeated games
with a finite action space
Payoffs depend on some characteristics of agents
represented by types – each agent’s type is private
information
The agents’ initial beliefs are given by a common prior
distribution over agent types
This belief is updated according to Bayes’ Rule to a
posterior distribution with each stage of the game.
In every finite Bayesian game, there is at least one
Bayesian Nash equilibrium, possibly in mixed strategies


Adaptive Learning
10

Agents are not fully rational, but can learn through
experience and adapt their strategies
Agents do not know the reward structure of the game
Agents are only able to take actions and observe their own
rewards (or oppnents’ rewards as well)
Popular Examples
Best Response Update
Fictitious Play
Regret Matching
Infinitesimal Gradient Ascent (IGA)
Dynamic Gradient Play
Adaptive Play Q-learning


Fictitious Play
11

The learning process is used to develop a ‘historical
distribution’ of the other agents’ play
In fictitious play, agent i has an exogenous initial weight
function kit: S-i R+
Weight is updated by adding 1 to the weight of each
opponent strategy, each time it is played
The probability that player i assigns to player -i
playing s-i at date t is given by
qit(s-i) = kit(s-i) / Σ kit(s-i)
The ‘best response’ of the agent i in this fictitious play is
given by
sit+1 = arg max Σ qit(s-i)ui(si, s-it)


An Example
12

Consider the same 2x2 game example as before
B
Suppose we assign b1 b2
kA0 (b1)= kA0 (b2)= kB0 (a1)= kB0 (a2)= 1 A a1 4,4 5,2
Then, qA0 (b1)= qA0 (b2)= qB0 (a1)= qB0 (a2)= 0.5
a2 0,1 4,3
For A, if A chooses a1
qA0(b1)uA(a1, b1) + qA0(b2)uA(a1, b2) = .5*4+.5*5 = 4.5
while if A chooses a2
qA0(b1)uA(a2, b1) + qA0(b2)uA(a2, b2) = .5*0+.5*4 = 2
For B, if B chooses b1
qB0(a1)uB(a1, b1) + qB0(a2)uB(a2, b1) = .5*4+.5*1 = 2.5
while if B chooses b2
qB0(a1)uB(a1, b2) + qB0(a2)uB(a2, b2) = .5*2+.5*3 = 2.5
Clearly, A plays a1 , B can choose either b1 or b2; assume B plays b2


Game proceeds.
13

stage 0

A’s selection a1

B’s selection b2

A’s payoff 5

B’ payoff 2

kAt(b1), qAt(b1) 1, 0.5 1, 0.33

kAt(b2), qAt(b2) 1, 0.5 2, 0.67

kBt(a1), qBt(a1) 1, 0 .5 2, 0.67

kBt(a2), qBt(a2) 1, 0 .5 1, 0.33


Game proceeds..
14

stage 0 1

A’s selection a1 a1

B’s selection b2 b1

A’s payoff 5 4

B’ payoff 2 4

kAt(b1), qAt(b1) 1, 0.5 1, 0.33 2, 0.5

kAt(b2), qAt(b2) 1, 0.5 2, 0.67 2, 0.5

kBt(a1), qBt(a1) 1, 0 .5 2, 0.67 3, 0.75

kBt(a2), qBt(a2) 1, 0 .5 1, 0.33 1, 0.25


Game proceeds…
15

stage 0 1 2

A’s selection a1 a1 a1

B’s selection b2 b1 b1

A’s payoff 5 4 4

B’ payoff 2 4 4

kAt(b1), qAt(b1) 1, 0.5 1, 0.33 2, 0.5 3, 0.6

kAt(b2), qAt(b2) 1, 0.5 2, 0.67 2, 0.5 2, 0.4

kBt(a1), qBt(a1) 1, 0 .5 2, 0.67 3, 0.75 4, 0.2

kBt(a2), qBt(a2) 1, 0 .5 1, 0.33 1, 0.25 1, 0.8


Game proceeds….
16

stage 0 1 2 3

A’s selection a1 a1 a1 a1

B’s selection b2 b1 b1 b1

A’s payoff 5 4 4 4

B’ payoff 2 4 4 4

kAt(b1), qAt(b1) 1, 0.5 1, 0.33 2, 0.5 3, 0.6 4, 0.67

kAt(b2), qAt(b2) 1, 0.5 2, 0.67 2, 0.5 2, 0.4 2, 0.33

kBt(a1), qBt(a1) 1, 0 .5 2, 0.67 3, 0.75 4, 0.2 5, 0 .84

kBt(a2), qBt(a2) 1, 0 .5 1, 0.33 1, 0.25 1, 0.8 1, 0.16


Gradient Based Learning
17

Fictitious Play assumes unbounded computation is
allowed in every step – arg max calculation
An alternative is to proceed in gradient ascent on some
objective function – expected payoff
Two players – row and column – have payoffs
r r  c c 
R= 11

r 
and
12
C= 
11 12

 r
21  22 c c  21 22

Row player chooses action 1 with probability α while
column player chooses action 2 with probability β
Expected payoffs are
Vr (α, β ) = r11αβ + r12α (1 − β ) + r21(1 − α)β + r22 (1 − α )(1 − β )
Vc (α , β ) = c11αβ + c12α (1 − β ) + c21 (1 − α )β + c22 (1 − α )(1 − β )

Gradient Ascent
18

Each player repeatedly adjusts her half of the current strategy
pair in the direction of the current gradient with some step size η
∂Vr (α k , β k )
α k +1 = α k + η
∂α
∂V (α , β )
β k +1 = βk +η c k k
∂β
In case the equations take the strategies outside the probability
simplex, it is projected back to the boundary
Gradient ascent algorithm assumes a full information game –
both the players know the game matrices and can see the mixed
strategy of their opponent in the previous step
u = (r11 + r22 ) − (r21 + r12 ) u' = (c11 +c22) −(c21 +c12)
∂Vr (α , β ) ∂Vc (α , β )
= βu − (r22 − r12 ) = αu ' − (c 22 − c 21 )
∂α ∂β

Infinitesimal Gradient Ascent
19

Interesting to see what happens to the strategy pair and to the
expected payoffs over time
Strategy pair sequence produced by following a gradient ascent
algorithm may never converge
Average payoff of both the players always converges to that of some
Nash pair
Consider a small step size assumption – limη →0 so that the update
equations become  ∂α 
 ∂t  0 u  α   − ( r22 − r12 ) 
 ∂β = ' +
  u 0   β   − ( c 22 − c 21 ) 

   
 ∂t 
Point where the gradient is zero – Nash Equilibrium
c − c r22 − r12 
(α * , β * ) =  22 ' 21 ,
 u u  
This point might even lie outside the probability simplex.


IGA dynamics
20

Denote the off-diagonal matrix containing u and u’ by U
Depending on the nature of U (noninvertible, real or imaginary
e-values) the convergence dynamics will vary


WoLF - W(in)-o(r)-L(earn)-Fast
21

Introduces variable learning rate instead of a fixed η
∂Vr (α k , β k )
α k +1 = α k + ηl r
∂α
k

∂ V c (α k , β k )
β k +1 = β k + η l kc
∂β
Let αe be the equilibrium strategy selected by the row player
and βe be the equilibrium strategy selected by the column player
l Vr (αk , βk ) > Vr (α e , βk ) →Winning
l =  min
r
k
l max →
otherwise Losin g

l Vc (αk , βk ) > Vc (αk , β e ) →Winning
l c
k =  min
 l max →
otherwise Losing

If in a two-person, two-action, iterated general-sum game, both
players follow the WoLF-IGA algorithm (with lmax>lmin) then their
strategies will converge to a Nash equilibrium

WoLF-IGA convergence
22


To Conclude
23

Learning in games is popular in anticipation of a future in
which less than rational agents play a game repeatedly to
arrive at a stable and efficient equilibrium.
The algorithmic structure and adaptive techniques involved in
such learning are largely motivated by Machine Learning and
Adaptive Filtering
A Gradient- based approach relieves this computational
burden but might suffer from convergence issues
A stochastic gradient method (not discussed in the presentation)
makes use of minimal information available and still performs
near-optimally


Adaptive Learning In Games

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Empfohlen

Empfohlen (20)

Adaptive Learning In Games