SlideShare ist ein Scribd-Unternehmen logo
1 von 39
Online learning & adaptive
game playing
SAEID GHAFOURI
MOHAMMAD SABER POURHEYDARI
Defenition
 Roughly speaking, online methods are those that process one datum at a time
 But
 This is a somewhat vague denition!
 Online Learning is generally described as doing machine learning in a streaming
data setting, i.e. training a model in consecutive rounds
scope of online methods
 there is too much data to tit into memory, or
 the application is itself inherently online.
We have visited online before!
 Some of the oldest algorithms in machine learning are, in fact, online.
 Like:
 perceptrons
Basic structure of online algorithms
 At the beginning of each round the algorithm is presented with an input sample,
and must perform a prediction
 The algorithm verifies whether its prediction was correct or incorrect, and feeds
this information back into the model, for subsequent rounds
Online vs Offline
Online vs offline cont
Sudo code
Main concepts
 The main objective of online learning algorithms is to minimize the
 regret.
 The regret is the difference between the performance of:
 ● the online algorithm
 ● an ideal algorithm that has been able to train on the whole data
 seen so far, in batch fashion
 In other words, the main objective of an online machine learning algorithm is to try to
perform as closely to the corresponding offline algorithm as possible.
This is measured by the regret.
General usecases
 Online algorithms are useful in at least two scenarios:
 ● When your data is too large to fit in the memory
 ○ So you need to train your model one example at a time
 ● When new data is constantly being generated, and/or is
 dependent upon time
applications
 ● Real-time Recommendation
 ● Fraud Detection
 ● Spam detection
 ● Portfolio Selection
 ● Online ad placement
Types of Targets
 There are two main ways to think about an online learning problem, as far as
 the target functions (that we are trying to learn) are concerned:
 ● Stationary Targets
 ○ The target function you are trying to learn does not change over time
 (but may be stochastic)
 ● Dynamic Targets
 ○ The process that is generating input sample data is assumed to be
 non-stationary (i.e. may change over time)
 ○ The process may even be adapting to your model (i.e. in an
 adversarial manner)
Example I: Stationary Targets
Example II: Dynamic Targets
Example II: Dynamic Targets cont
Approaches
 A couple of approaches have been proposed in the literature:
 ● Online Learning from Expert Advice
 ● Online Learning from Examples
 ● General algorithms that may also be used in the online setting
Online vs. Reinforcement Learning
 Both use continues feedbacks and new data.
 What are the differences?
Online vs. Reinforcement Learning
 Online learning can be supervised, unsupervised or semi-supervised
 Complexities of online learning is:
 1. large data size
 2. moving timeframe
 Main goal: we are interested in “the answer”
Online vs. Reinforcement Learning
 The ultimate goal of reinforcement learning is of reward optimization
 Complexities of reinforcement learning:
 1. The objective might not be clear
 2. The reward might change
 3. unexpected output
 Main goal: we are interested in “the process of getting to an answer”
 (The policy of rewarding)
Online vs. Reinforcement Learning
 Online learning against batch learning
 Online learning: looking at an example just once
 Batch learning: Using all the data set for training
 Q-learning: Online reinforcement
 Fitted-Q or LSPI: Batch reinforcement (not online)
 K-nearest-neighbor: Online (not reinforcement)
Online vs. Reinforcement Learning
 Conclusion
 Online learning: looking at an example only one time and predict the answer
 Reinforcement learning is where you have a string of events that eventually results
in something that is either good or bad.
 If it is good then the entire string of actions leading up to that result are
reinforced, but if it is bad then the actions are penalized.
Further readings
 6.883: Online Methods in Machine Learning MIT university
 CSE599s: Online Learning university of Washington
Adaptive learning
 Adaptive learning is an educational method which uses computers as interactive
teaching devices, and to orchestrate the allocation of human and mediated
resources according to the unique needs of each learner.
Adaptive vs traditional
 Adaptive learning has been partially driven by a realization that tailored learning
cannot be achieved on a large-scale using traditional, non-adaptive approaches.
Adaptive learning models
 Expert model – The model with the information which is to be taught
 Student model – The model which tracks and learns about the student
 Instructional model – The model which actually conveys the information
 Instructional environment – The user interface for interacting with the system
Adaptive Game Playing
 Traditionally, AI research for games has focused on developing static
strategies—fixed maps from the game state to a set of actions—which
maximize the probability of victory.
 Good for combinatorial games like chess and checkers
 While these algorithms can often fight effectively, they tend to become repetitive
and transparent to the players of the game, because their strategies are fixed.
Adaptive Game Playing
 Even the most advanced AI algorithms for complex games often have a hole in
their programming, a simple strategy which can be repeated over and over to
remove the challenge of the AI opponent.
 Because their algorithms are static and inflexible, these AIs are incapable of
adapting to these strategies and restoring the balance of the game.
Adaptive Game Playing
 In addition, many games simply do not take well to these sorts of algorithms.
Fighting games, such as Street Fighter and Virtua Fighter, often have no
objectively correct move in any given situation. Rather, any of the available
moves could be good or bad, depending on the opponent’s reaction
 A better AI would have to be able to adapt to the player and model his or her
actions, in order to predict and react to them. .
Goals
 The original goal is to produce an AI which learn online, allowing it to adapt its
fighting style mid-game if a player decide to try a new technique.
 This is preferable to an offline algorithm that can only adjust its behavior
between games using accumulated data from previous play sessions.
 Such an AI would need to be able to learn a new strategy with a very small
number of training samples, and it would need to do so fast enough to avoid
causing the game to lag.
 Our first challenge was to find an algorithm with the learning speed and
computational efficiency necessary to adapt online.
Goals
 since the goal of a video game is to entertain, an AI opponent should exhibit a
variety of different behaviors, so as to avoid being too repetitive.
 The AI should not be too difficult to beat; ideally it would adjust its difficulty
according to the perceived skill level of the player.
Early Implementations
 Because implementing an algorithm for a complicated fighting game is very
difficult, given the number of possible moves and game states at any point in
time, we designed our first algorithms for the game rock-paper-scissors. This
game abstracts all of the complexity of a fighting game down to just the question
of predicting your opponent based on previous moves.
Naïve Bayes
 For our implementation of naive Bayes, we used the multinomial event model,
treating a series of n rounds as an n-dimensional feature vector, where each
feature xj contains the moves (R, P, or S) made by the human and computer in
the jth round. For instance, if both the computer and human choose R on round j,
then xj = <R, R>. We classify a feature vector as R, P, or S according to the
opponent’s next move (in round n + 1). We then calculate the maximum
likelihood estimates of φ < R, R > | R , φ < R, P > | R , φ < S, R > | R , etc. , as
well as φR , φP , and φS.
Markov Decision Process
 To apply reinforcement learning to rock-paper-scissors, we defined a state to be
the outcomes of the previous n rounds (the same way we defined a feature
vector in the other algorithms). The reward function evaluated at a state was
equal to 1 if the nth round was a win, 0 if it was a tie, and -1 if it was a loss. The
set of actions, obviously, consisted of R, P and S.
N-Gram
 To make a decision, our n-gram algorithm considers the last n moves made by
each player and searches for that string of n moves in the history of moves. If it
has seen those moves before, it makes its prediction based on the move that
most frequently followed that sequence.
Final Implementation
 Players have access to a total of 7 moves: left, right, punch, kick, block, throw,
and wait. Each move is characterized by up to 4 parameters: strength, range, lag,
and hit-stun The moves interact in a similar fashion to those of rock-paper-
scissors: blocks beat attacks, attacks beat throws, and throws beat blocks, but
the additional parameters make these relationships more complicated. The
objective is to use these moves to bring your opponent's life total to 0, while
keeping yours above 0.
Final Implementation
 The most difficult decision in porting our reinforcement learning algorithm to an
actual fighting game was how to define a state.
 Include a history of the most recent moves performed by each player: very slow
learning speed
 We only kept track of the change in life totals from the previous frame, the
distance between the two players, and the opponent's remaining lag. We
ignored the AI's lag, since the AI can only perform actions during frames where
its lag is 0. We defined the reward function for a state as the change in life
totals since the last frame. For example, if the AI's life falls from 14 to 13, while the
opponent's life falls from 15 to 12, the reward is (13 - 14) - (12 - 15) = 2.
Improvement of Implementation
 Suppose that the MDP has been playing against a player for several rounds, and in
state S1 it has taken action A 1000 times, 100 of which resulted in a transition to
state S2, yielding an estimated transition probability of PA, S2 = 0.1. Now
suppose the opponent shifts to a new strategy such that the probability of
transitioning to S2 on A is 0.7. Clearly it will take many more visits to S1 before the
estimated probability converges to 0.7. To address this, our MDP continually
shrinks its store of previous transitions to a constant size, so that data contributed
by very old transitions grows less and less significant with each rescaling. In our
example, this might mean shrinking our count of action A down to 50 and our
counts of transitions to S2 on A down to 5 (maintaining the old probability of 0.1).
This way, it will take far fewer new transitions to increase the probability to 0.7.
Results
 We tested the MDP AI at first against other static AIs, and then against human
players. The MDP AI beat the static AIs nearly every game.
 Against human players, the MDP was very strong, winning most games and
never getting shut out. However, the only experienced players it faced were the
researchers themselves.
Conclusion
 The AI algorithm presented here demonstrates the viability of online learning in
fighting games. We programmed an AI using a simple MDP model, and taught it
only a few negative rules about the game, eliminating the largest empty area of
search space but giving no further directions. Under these conditions the AI was
able to learn the game rules in a reasonable amount of time and adapt to its
opponent.

Weitere ähnliche Inhalte

Was ist angesagt?

课堂讲义(最后更新:2009-9-25)
课堂讲义(最后更新:2009-9-25)课堂讲义(最后更新:2009-9-25)
课堂讲义(最后更新:2009-9-25)
butest
 

Was ist angesagt? (18)

25 introduction reinforcement_learning
25 introduction reinforcement_learning25 introduction reinforcement_learning
25 introduction reinforcement_learning
 
Reinforcement Learning using OpenAI Gym
Reinforcement Learning using OpenAI GymReinforcement Learning using OpenAI Gym
Reinforcement Learning using OpenAI Gym
 
Reinforcement Learning Q-Learning
Reinforcement Learning   Q-Learning Reinforcement Learning   Q-Learning
Reinforcement Learning Q-Learning
 
Learning to Reason in Round-based Games: Multi-task Sequence Generation for P...
Learning to Reason in Round-based Games: Multi-task Sequence Generation for P...Learning to Reason in Round-based Games: Multi-task Sequence Generation for P...
Learning to Reason in Round-based Games: Multi-task Sequence Generation for P...
 
An introduction to reinforcement learning
An introduction to  reinforcement learningAn introduction to  reinforcement learning
An introduction to reinforcement learning
 
Reinforcement learning
Reinforcement learning Reinforcement learning
Reinforcement learning
 
Introduction of Deep Reinforcement Learning
Introduction of Deep Reinforcement LearningIntroduction of Deep Reinforcement Learning
Introduction of Deep Reinforcement Learning
 
Frontier in reinforcement learning
Frontier in reinforcement learningFrontier in reinforcement learning
Frontier in reinforcement learning
 
Actor critic algorithm
Actor critic algorithmActor critic algorithm
Actor critic algorithm
 
课堂讲义(最后更新:2009-9-25)
课堂讲义(最后更新:2009-9-25)课堂讲义(最后更新:2009-9-25)
课堂讲义(最后更新:2009-9-25)
 
Rl chapter 1 introduction
Rl chapter 1 introductionRl chapter 1 introduction
Rl chapter 1 introduction
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement Learning
 
Applying fuzzy control in fighting game ai
Applying fuzzy control in fighting game aiApplying fuzzy control in fighting game ai
Applying fuzzy control in fighting game ai
 
Identifying MMORPG Bots: A Traffic Analysis Approach
Identifying MMORPG Bots: A Traffic Analysis ApproachIdentifying MMORPG Bots: A Traffic Analysis Approach
Identifying MMORPG Bots: A Traffic Analysis Approach
 
Introduction: Asynchronous Methods for Deep Reinforcement Learning
Introduction: Asynchronous Methods for  Deep Reinforcement LearningIntroduction: Asynchronous Methods for  Deep Reinforcement Learning
Introduction: Asynchronous Methods for Deep Reinforcement Learning
 
Game Bot Identification Based on Manifold Learning
Game Bot Identification Based on Manifold LearningGame Bot Identification Based on Manifold Learning
Game Bot Identification Based on Manifold Learning
 
Intro to machine learning(with animations)
Intro to machine learning(with animations)Intro to machine learning(with animations)
Intro to machine learning(with animations)
 
Deep Reinforcement Learning
Deep Reinforcement LearningDeep Reinforcement Learning
Deep Reinforcement Learning
 

Ähnlich wie Online learning &amp; adaptive game playing

introducción a Machine Learning
introducción a Machine Learningintroducción a Machine Learning
introducción a Machine Learning
butest
 
Machine Learning Algorithms | Machine Learning Tutorial | Data Science Algori...
Machine Learning Algorithms | Machine Learning Tutorial | Data Science Algori...Machine Learning Algorithms | Machine Learning Tutorial | Data Science Algori...
Machine Learning Algorithms | Machine Learning Tutorial | Data Science Algori...
Simplilearn
 
Machine-Learning-Overview a statistical approach
Machine-Learning-Overview a statistical approachMachine-Learning-Overview a statistical approach
Machine-Learning-Overview a statistical approach
Ajit Ghodke
 

Ähnlich wie Online learning &amp; adaptive game playing (20)

Machine learning Chapter 1
Machine learning Chapter 1Machine learning Chapter 1
Machine learning Chapter 1
 
Machine Learning Presentation
Machine Learning PresentationMachine Learning Presentation
Machine Learning Presentation
 
ML PPT print.pdf
ML PPT print.pdfML PPT print.pdf
ML PPT print.pdf
 
Machine Learning by Rj
Machine Learning by RjMachine Learning by Rj
Machine Learning by Rj
 
Machine Learning Contents.pptx
Machine Learning Contents.pptxMachine Learning Contents.pptx
Machine Learning Contents.pptx
 
UNIT 1 Machine Learning [KCS-055] (1).pptx
UNIT 1 Machine Learning [KCS-055] (1).pptxUNIT 1 Machine Learning [KCS-055] (1).pptx
UNIT 1 Machine Learning [KCS-055] (1).pptx
 
Rahul_Kirtoniya_11800121032_CSE_Machine_Learning.pptx
Rahul_Kirtoniya_11800121032_CSE_Machine_Learning.pptxRahul_Kirtoniya_11800121032_CSE_Machine_Learning.pptx
Rahul_Kirtoniya_11800121032_CSE_Machine_Learning.pptx
 
introducción a Machine Learning
introducción a Machine Learningintroducción a Machine Learning
introducción a Machine Learning
 
An-Overview-of-Machine-Learning.pptx
An-Overview-of-Machine-Learning.pptxAn-Overview-of-Machine-Learning.pptx
An-Overview-of-Machine-Learning.pptx
 
Essentials of machine learning algorithms
Essentials of machine learning algorithmsEssentials of machine learning algorithms
Essentials of machine learning algorithms
 
introduction to machine learning
introduction to machine learningintroduction to machine learning
introduction to machine learning
 
IRJET- Machine Learning: Survey, Types and Challenges
IRJET- Machine Learning: Survey, Types and ChallengesIRJET- Machine Learning: Survey, Types and Challenges
IRJET- Machine Learning: Survey, Types and Challenges
 
Machine learning-in-details-with-out-python-code
Machine learning-in-details-with-out-python-codeMachine learning-in-details-with-out-python-code
Machine learning-in-details-with-out-python-code
 
Reinforcement learning-ebook-part1
Reinforcement learning-ebook-part1Reinforcement learning-ebook-part1
Reinforcement learning-ebook-part1
 
Reinforcement Learning / E-Book / Part 1
Reinforcement Learning / E-Book / Part 1Reinforcement Learning / E-Book / Part 1
Reinforcement Learning / E-Book / Part 1
 
Ron's muri presentation
Ron's muri presentationRon's muri presentation
Ron's muri presentation
 
Machine Learning Algorithms | Machine Learning Tutorial | Data Science Algori...
Machine Learning Algorithms | Machine Learning Tutorial | Data Science Algori...Machine Learning Algorithms | Machine Learning Tutorial | Data Science Algori...
Machine Learning Algorithms | Machine Learning Tutorial | Data Science Algori...
 
Machine Learning.pptx
Machine Learning.pptxMachine Learning.pptx
Machine Learning.pptx
 
Machine-Learning-Overview a statistical approach
Machine-Learning-Overview a statistical approachMachine-Learning-Overview a statistical approach
Machine-Learning-Overview a statistical approach
 
Chapter 05 Machine Learning.pptx
Chapter 05 Machine Learning.pptxChapter 05 Machine Learning.pptx
Chapter 05 Machine Learning.pptx
 

Kürzlich hochgeladen

Verification of thevenin's theorem for BEEE Lab (1).pptx
Verification of thevenin's theorem for BEEE Lab (1).pptxVerification of thevenin's theorem for BEEE Lab (1).pptx
Verification of thevenin's theorem for BEEE Lab (1).pptx
chumtiyababu
 
Standard vs Custom Battery Packs - Decoding the Power Play
Standard vs Custom Battery Packs - Decoding the Power PlayStandard vs Custom Battery Packs - Decoding the Power Play
Standard vs Custom Battery Packs - Decoding the Power Play
Epec Engineered Technologies
 
Integrated Test Rig For HTFE-25 - Neometrix
Integrated Test Rig For HTFE-25 - NeometrixIntegrated Test Rig For HTFE-25 - Neometrix
Integrated Test Rig For HTFE-25 - Neometrix
Neometrix_Engineering_Pvt_Ltd
 
1_Introduction + EAM Vocabulary + how to navigate in EAM.pdf
1_Introduction + EAM Vocabulary + how to navigate in EAM.pdf1_Introduction + EAM Vocabulary + how to navigate in EAM.pdf
1_Introduction + EAM Vocabulary + how to navigate in EAM.pdf
AldoGarca30
 

Kürzlich hochgeladen (20)

Tamil Call Girls Bhayandar WhatsApp +91-9930687706, Best Service
Tamil Call Girls Bhayandar WhatsApp +91-9930687706, Best ServiceTamil Call Girls Bhayandar WhatsApp +91-9930687706, Best Service
Tamil Call Girls Bhayandar WhatsApp +91-9930687706, Best Service
 
A Study of Urban Area Plan for Pabna Municipality
A Study of Urban Area Plan for Pabna MunicipalityA Study of Urban Area Plan for Pabna Municipality
A Study of Urban Area Plan for Pabna Municipality
 
Block diagram reduction techniques in control systems.ppt
Block diagram reduction techniques in control systems.pptBlock diagram reduction techniques in control systems.ppt
Block diagram reduction techniques in control systems.ppt
 
Employee leave management system project.
Employee leave management system project.Employee leave management system project.
Employee leave management system project.
 
S1S2 B.Arch MGU - HOA1&2 Module 3 -Temple Architecture of Kerala.pptx
S1S2 B.Arch MGU - HOA1&2 Module 3 -Temple Architecture of Kerala.pptxS1S2 B.Arch MGU - HOA1&2 Module 3 -Temple Architecture of Kerala.pptx
S1S2 B.Arch MGU - HOA1&2 Module 3 -Temple Architecture of Kerala.pptx
 
Verification of thevenin's theorem for BEEE Lab (1).pptx
Verification of thevenin's theorem for BEEE Lab (1).pptxVerification of thevenin's theorem for BEEE Lab (1).pptx
Verification of thevenin's theorem for BEEE Lab (1).pptx
 
Moment Distribution Method For Btech Civil
Moment Distribution Method For Btech CivilMoment Distribution Method For Btech Civil
Moment Distribution Method For Btech Civil
 
Online food ordering system project report.pdf
Online food ordering system project report.pdfOnline food ordering system project report.pdf
Online food ordering system project report.pdf
 
Engineering Drawing focus on projection of planes
Engineering Drawing focus on projection of planesEngineering Drawing focus on projection of planes
Engineering Drawing focus on projection of planes
 
Orlando’s Arnold Palmer Hospital Layout Strategy-1.pptx
Orlando’s Arnold Palmer Hospital Layout Strategy-1.pptxOrlando’s Arnold Palmer Hospital Layout Strategy-1.pptx
Orlando’s Arnold Palmer Hospital Layout Strategy-1.pptx
 
Thermal Engineering-R & A / C - unit - V
Thermal Engineering-R & A / C - unit - VThermal Engineering-R & A / C - unit - V
Thermal Engineering-R & A / C - unit - V
 
Work-Permit-Receiver-in-Saudi-Aramco.pptx
Work-Permit-Receiver-in-Saudi-Aramco.pptxWork-Permit-Receiver-in-Saudi-Aramco.pptx
Work-Permit-Receiver-in-Saudi-Aramco.pptx
 
Unit 4_Part 1 CSE2001 Exception Handling and Function Template and Class Temp...
Unit 4_Part 1 CSE2001 Exception Handling and Function Template and Class Temp...Unit 4_Part 1 CSE2001 Exception Handling and Function Template and Class Temp...
Unit 4_Part 1 CSE2001 Exception Handling and Function Template and Class Temp...
 
Thermal Engineering -unit - III & IV.ppt
Thermal Engineering -unit - III & IV.pptThermal Engineering -unit - III & IV.ppt
Thermal Engineering -unit - III & IV.ppt
 
DC MACHINE-Motoring and generation, Armature circuit equation
DC MACHINE-Motoring and generation, Armature circuit equationDC MACHINE-Motoring and generation, Armature circuit equation
DC MACHINE-Motoring and generation, Armature circuit equation
 
Standard vs Custom Battery Packs - Decoding the Power Play
Standard vs Custom Battery Packs - Decoding the Power PlayStandard vs Custom Battery Packs - Decoding the Power Play
Standard vs Custom Battery Packs - Decoding the Power Play
 
Integrated Test Rig For HTFE-25 - Neometrix
Integrated Test Rig For HTFE-25 - NeometrixIntegrated Test Rig For HTFE-25 - Neometrix
Integrated Test Rig For HTFE-25 - Neometrix
 
1_Introduction + EAM Vocabulary + how to navigate in EAM.pdf
1_Introduction + EAM Vocabulary + how to navigate in EAM.pdf1_Introduction + EAM Vocabulary + how to navigate in EAM.pdf
1_Introduction + EAM Vocabulary + how to navigate in EAM.pdf
 
Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...
Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...
Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...
 
Online electricity billing project report..pdf
Online electricity billing project report..pdfOnline electricity billing project report..pdf
Online electricity billing project report..pdf
 

Online learning &amp; adaptive game playing

  • 1. Online learning & adaptive game playing SAEID GHAFOURI MOHAMMAD SABER POURHEYDARI
  • 2. Defenition  Roughly speaking, online methods are those that process one datum at a time  But  This is a somewhat vague denition!  Online Learning is generally described as doing machine learning in a streaming data setting, i.e. training a model in consecutive rounds
  • 3. scope of online methods  there is too much data to tit into memory, or  the application is itself inherently online.
  • 4. We have visited online before!  Some of the oldest algorithms in machine learning are, in fact, online.  Like:  perceptrons
  • 5. Basic structure of online algorithms  At the beginning of each round the algorithm is presented with an input sample, and must perform a prediction  The algorithm verifies whether its prediction was correct or incorrect, and feeds this information back into the model, for subsequent rounds
  • 9. Main concepts  The main objective of online learning algorithms is to minimize the  regret.  The regret is the difference between the performance of:  ● the online algorithm  ● an ideal algorithm that has been able to train on the whole data  seen so far, in batch fashion  In other words, the main objective of an online machine learning algorithm is to try to perform as closely to the corresponding offline algorithm as possible. This is measured by the regret.
  • 10. General usecases  Online algorithms are useful in at least two scenarios:  ● When your data is too large to fit in the memory  ○ So you need to train your model one example at a time  ● When new data is constantly being generated, and/or is  dependent upon time
  • 11. applications  ● Real-time Recommendation  ● Fraud Detection  ● Spam detection  ● Portfolio Selection  ● Online ad placement
  • 12. Types of Targets  There are two main ways to think about an online learning problem, as far as  the target functions (that we are trying to learn) are concerned:  ● Stationary Targets  ○ The target function you are trying to learn does not change over time  (but may be stochastic)  ● Dynamic Targets  ○ The process that is generating input sample data is assumed to be  non-stationary (i.e. may change over time)  ○ The process may even be adapting to your model (i.e. in an  adversarial manner)
  • 15. Example II: Dynamic Targets cont
  • 16. Approaches  A couple of approaches have been proposed in the literature:  ● Online Learning from Expert Advice  ● Online Learning from Examples  ● General algorithms that may also be used in the online setting
  • 17. Online vs. Reinforcement Learning  Both use continues feedbacks and new data.  What are the differences?
  • 18. Online vs. Reinforcement Learning  Online learning can be supervised, unsupervised or semi-supervised  Complexities of online learning is:  1. large data size  2. moving timeframe  Main goal: we are interested in “the answer”
  • 19. Online vs. Reinforcement Learning  The ultimate goal of reinforcement learning is of reward optimization  Complexities of reinforcement learning:  1. The objective might not be clear  2. The reward might change  3. unexpected output  Main goal: we are interested in “the process of getting to an answer”  (The policy of rewarding)
  • 20. Online vs. Reinforcement Learning  Online learning against batch learning  Online learning: looking at an example just once  Batch learning: Using all the data set for training  Q-learning: Online reinforcement  Fitted-Q or LSPI: Batch reinforcement (not online)  K-nearest-neighbor: Online (not reinforcement)
  • 21. Online vs. Reinforcement Learning  Conclusion  Online learning: looking at an example only one time and predict the answer  Reinforcement learning is where you have a string of events that eventually results in something that is either good or bad.  If it is good then the entire string of actions leading up to that result are reinforced, but if it is bad then the actions are penalized.
  • 22. Further readings  6.883: Online Methods in Machine Learning MIT university  CSE599s: Online Learning university of Washington
  • 23. Adaptive learning  Adaptive learning is an educational method which uses computers as interactive teaching devices, and to orchestrate the allocation of human and mediated resources according to the unique needs of each learner.
  • 24. Adaptive vs traditional  Adaptive learning has been partially driven by a realization that tailored learning cannot be achieved on a large-scale using traditional, non-adaptive approaches.
  • 25. Adaptive learning models  Expert model – The model with the information which is to be taught  Student model – The model which tracks and learns about the student  Instructional model – The model which actually conveys the information  Instructional environment – The user interface for interacting with the system
  • 26. Adaptive Game Playing  Traditionally, AI research for games has focused on developing static strategies—fixed maps from the game state to a set of actions—which maximize the probability of victory.  Good for combinatorial games like chess and checkers  While these algorithms can often fight effectively, they tend to become repetitive and transparent to the players of the game, because their strategies are fixed.
  • 27. Adaptive Game Playing  Even the most advanced AI algorithms for complex games often have a hole in their programming, a simple strategy which can be repeated over and over to remove the challenge of the AI opponent.  Because their algorithms are static and inflexible, these AIs are incapable of adapting to these strategies and restoring the balance of the game.
  • 28. Adaptive Game Playing  In addition, many games simply do not take well to these sorts of algorithms. Fighting games, such as Street Fighter and Virtua Fighter, often have no objectively correct move in any given situation. Rather, any of the available moves could be good or bad, depending on the opponent’s reaction  A better AI would have to be able to adapt to the player and model his or her actions, in order to predict and react to them. .
  • 29. Goals  The original goal is to produce an AI which learn online, allowing it to adapt its fighting style mid-game if a player decide to try a new technique.  This is preferable to an offline algorithm that can only adjust its behavior between games using accumulated data from previous play sessions.  Such an AI would need to be able to learn a new strategy with a very small number of training samples, and it would need to do so fast enough to avoid causing the game to lag.  Our first challenge was to find an algorithm with the learning speed and computational efficiency necessary to adapt online.
  • 30. Goals  since the goal of a video game is to entertain, an AI opponent should exhibit a variety of different behaviors, so as to avoid being too repetitive.  The AI should not be too difficult to beat; ideally it would adjust its difficulty according to the perceived skill level of the player.
  • 31. Early Implementations  Because implementing an algorithm for a complicated fighting game is very difficult, given the number of possible moves and game states at any point in time, we designed our first algorithms for the game rock-paper-scissors. This game abstracts all of the complexity of a fighting game down to just the question of predicting your opponent based on previous moves.
  • 32. Naïve Bayes  For our implementation of naive Bayes, we used the multinomial event model, treating a series of n rounds as an n-dimensional feature vector, where each feature xj contains the moves (R, P, or S) made by the human and computer in the jth round. For instance, if both the computer and human choose R on round j, then xj = <R, R>. We classify a feature vector as R, P, or S according to the opponent’s next move (in round n + 1). We then calculate the maximum likelihood estimates of φ < R, R > | R , φ < R, P > | R , φ < S, R > | R , etc. , as well as φR , φP , and φS.
  • 33. Markov Decision Process  To apply reinforcement learning to rock-paper-scissors, we defined a state to be the outcomes of the previous n rounds (the same way we defined a feature vector in the other algorithms). The reward function evaluated at a state was equal to 1 if the nth round was a win, 0 if it was a tie, and -1 if it was a loss. The set of actions, obviously, consisted of R, P and S.
  • 34. N-Gram  To make a decision, our n-gram algorithm considers the last n moves made by each player and searches for that string of n moves in the history of moves. If it has seen those moves before, it makes its prediction based on the move that most frequently followed that sequence.
  • 35. Final Implementation  Players have access to a total of 7 moves: left, right, punch, kick, block, throw, and wait. Each move is characterized by up to 4 parameters: strength, range, lag, and hit-stun The moves interact in a similar fashion to those of rock-paper- scissors: blocks beat attacks, attacks beat throws, and throws beat blocks, but the additional parameters make these relationships more complicated. The objective is to use these moves to bring your opponent's life total to 0, while keeping yours above 0.
  • 36. Final Implementation  The most difficult decision in porting our reinforcement learning algorithm to an actual fighting game was how to define a state.  Include a history of the most recent moves performed by each player: very slow learning speed  We only kept track of the change in life totals from the previous frame, the distance between the two players, and the opponent's remaining lag. We ignored the AI's lag, since the AI can only perform actions during frames where its lag is 0. We defined the reward function for a state as the change in life totals since the last frame. For example, if the AI's life falls from 14 to 13, while the opponent's life falls from 15 to 12, the reward is (13 - 14) - (12 - 15) = 2.
  • 37. Improvement of Implementation  Suppose that the MDP has been playing against a player for several rounds, and in state S1 it has taken action A 1000 times, 100 of which resulted in a transition to state S2, yielding an estimated transition probability of PA, S2 = 0.1. Now suppose the opponent shifts to a new strategy such that the probability of transitioning to S2 on A is 0.7. Clearly it will take many more visits to S1 before the estimated probability converges to 0.7. To address this, our MDP continually shrinks its store of previous transitions to a constant size, so that data contributed by very old transitions grows less and less significant with each rescaling. In our example, this might mean shrinking our count of action A down to 50 and our counts of transitions to S2 on A down to 5 (maintaining the old probability of 0.1). This way, it will take far fewer new transitions to increase the probability to 0.7.
  • 38. Results  We tested the MDP AI at first against other static AIs, and then against human players. The MDP AI beat the static AIs nearly every game.  Against human players, the MDP was very strong, winning most games and never getting shut out. However, the only experienced players it faced were the researchers themselves.
  • 39. Conclusion  The AI algorithm presented here demonstrates the viability of online learning in fighting games. We programmed an AI using a simple MDP model, and taught it only a few negative rules about the game, eliminating the largest empty area of search space but giving no further directions. Under these conditions the AI was able to learn the game rules in a reasonable amount of time and adapt to its opponent.