SlideShare ist ein Scribd-Unternehmen logo
1 von 60
Downloaden Sie, um offline zu lesen
An Introduction to
Reinforcement Learning
Jie-Han Chen
NetDB, National Cheng Kung University
3/27, 2018 @ National Cheng Kung University, Taiwan
1
The content in this lecture were borrowed from:
1. Rich Sutton’s textbook
2. David Silver’s Reinforcement Learning class in UCL
3. Sergey Levine’s Deep Reinforcement Learning class in UCB
2
Disclamier
Syllabus
● Introduction to Reinforcement Learning
● Markov Decision Process
● Dynamic Programming
● Monte Carlo method
● Temporal Difference method
● Deep Reinforcement Learning
● Policy Gradient
● Hierarchical Reinforcement Learning and Multiagent Reinforcement Learning
● Active Research Issue
3
Resources
Textbooks:
● Reinforcement Learning: An Introduction, Sutton and Barto
● Algorithms for Reinforcement Learning, Szepesvari
Course:
● CS 294 Deep Reinforcement Learning, Berkeley
● David Silver’s Reinforcement Learning course, UCL
● CMU 10703 Deep Reinforcement Learning and Control, CMU
● Shan-Hung Wu’s Deep Learning course in NTHU
All of them are our reference materials in this lecture.
4
Outline
● Syllabus
● Introduction
● Elements of reinforcement learning and its objective
● History of RL
● Applications
● The challenge and active research fields in RL
● Research institute and notable researchers
5
Machine Learning
From David Silver’s RL course 6
Introduction to Reinforcement Learning
Reinforcement learning is a learning framework different from supervised learning
and unsupervised learning.
It is composed of series of perception and interaction between agent and
environment.
From Sutton’s book 7
Agent and Environment
At each step t the agent:
● Receives scalar reward Rt
● Receives observaiotn Ot
● Executes action At
The environment:
● Receives action At
● Emits observation Ot+1
● Emits scalar reward Rt+1
8
Introduction to Reinforcement Learning
Reinforcement Learning is often used to solve sequential decision problem.
● Goal: select actions to maximize total future reward
● Action may have long term consequences
● Reward may be delayed
● It may be better to sacrifice immediate reward to gain more long-term reward
● Eg:
○ A financial investiment
○ Chess game
9
Supervised Learning & Unsupervised Learning
The input data are independent (i.i.d).
Current output will not affect the next
input.
10
Reinforcement Learning
The agent’s action do affect the data
received in the future.
Figure from Wikipedia, made by waldoalvarez11
Introduction to Reinforcement Learning
● In reinforcement learning the
agent learns from trial and error.
● The better experience make the
agent learn better policy.
● What kind of experience is
better?
The image is from :
http://www.homemeeting.us/franktmc/maze_2.jpg
12
Elements of reinforcement learning
● Policy
● Reward signal
● Value function
● Model of environment (optional)
13
Elements of reinforcement learning - policy
Policy
● Define the learning agents’ way of behaving at a given time. Could be a
simple function or lookup table or search process
● Often denoted by
● Could be deterministic or stochastic
14
Elements of reinforcement learning - policy
If you are Russell Westbrook, and now
is defended by James Harden. With
this situation, you have 3 choices:
● Cut
● Shoot
● Pass
15
Stochastic policy
Probability
Action
16
Deterministic policy
Probability
Action
17
Policies - Action space
In reinforcement learning, we can categorize the problem by the action space into
2 types.
● Discrete action space
● Continuous action space
In previous example, the decision or the action are in discrete space, but there are
many example of continuous control, eg: robotic arm. The stochastic policy of
continuous control problem would like a probability density function.
18
Elements of reinforcement learning - reward
Reward: r / Rt
● Defines the goal in a reinforcement learning problem
● Indicates how well agent is doing at step t
● Immediately percepted from the environment
19
Elements of reinforcement learning - reward
+2
0 or -0.2?
20
Elements of reinforcement learning - reward
In chess or Go, the reward is defined
by its outcome.
● Win: +1
● Draw: 0
● Lose: -1
In most steps, we don’t receive any
reward(value = 0). It’s a kind of sparse
reward problem.
21
Elements of reinforcement learning - reward
If we want to reach the goal by less
steps, we often define the reward to
-1 when you take a step.
22
Elements of reinforcement learning - value function
Value function
● Indicates which decision is good in the long run.
● There are two forms:
○ state-value function
○ action-value function
● Unlike reward, value function is an estmated value.
23
Elements of reinforcement learning - value function
The game comes to 99 vs 98(our) and just
left 5 seconds to the end of the game.
Now, If you need to throw in in midfield,
which one would you pass the ball to?
1. 櫻木花道
2. 三井壽
24
Elements of reinforcement learning - model
Model of environments (optional)
● Use something to mimic the behavior of the environment.
● Allow inferences to be made about how the environment will behave.
(planning)
● Methods for solving reinforcement learning problems that use models for
planning are called model-based methods. The opposites are model-free
methods.
25
Elements of reinforcement learning - model
Interaction, inferences
Learn the model
The image is from David Silver’s RL course 26
Just like ...
27
Elements of reinforcement learning - model
28
Elements of reinforcement learning - model
29
Elements of reinforcement learning
● Policy
● Reward signal
● Value function
● Model of environment (optional)
30
The objective of reinforcement learning
Reinforcement learning is a framework
of goal directed learning.
The objective of reinforcement learning
is to maximize accumulative rewards in
each task.
The image is from:
https://www.wikijob.co.uk/content/interview-advice/competencies/decision-making31
History of Reinforcement Learning
Reinforcement Learning is inspired by two domain knowledge
● Optimal control
● Biological learning system: Animal learning
32
Optimal control
It is a mathematical optimization method for deriving control policies
especially under certain constraints.
The optimization method is largely due to the work of Lev Pontryagin and
Richard Bellman in the 1950s.
33
Richard Bellman
Richard Bellman was an applied
mathematician, who introduced dynamic
programming in 1953.
Work:
● Bellman Equation
● Curse of dimensionality
● Bellman-Ford algorithm
34
Animal Learning
● Teach dog - positive reward
35
Animal Learning
● Teach dog - penalty (negative reward)
36
Some question about RL
● Why do we need to learn Reinforcement Learning?
● What make Reinforcement Learning spring up like mushrooms?
37
Backgammon (IBM, 1992)
Temporal difference learning and TD-Gammon, by
Gerald Tesauro, 1992
Gammon is 雙陸棋 in Chinese.
source: from wikipedia
38
Autonomous Helicopter (Stanford, 2000)
The aerobatics fo helicopter has been studied from 2000 by Andrew Ng and
Pieter Abbeel in Stanford.
You can see more details on: http://heli.stanford.edu/39
Deep reinforcement learning in Atari game (2013)
Deep Q Network: proposed by V Mnih et al. It’s the first reinforcement learning
end-to-end model to combine deep learning with raw inputs.
40
Deep reinforcement learning in Atari game (2013)
41
Deep Reinforcement Learning for Robotic Manipulation
42
AlphaGo (DeepMind, 2016)
43
AlphaGo (DeepMind, 2016)
AlphaGo: David Silver, Aja Huang et al., use Monte Carlo Tree search (MCTS) and
deep reinforcement learning (policy gradient) to master the game of Go.
44
AlphaGo Zero (DeepMind, 2017)
AlphaGo Zero: David Silver et al., use MCTS and policy iteration with ResNet with
2-head architecture to learn from scratch without human knowledge.
45
46
AlphaGo Zero (DeepMind, 2017)
Dota2 (OpenAI, 2017)
● Beats the world’s top professionals at 1v1 matches
● The bot learned from scratch by self-play
47
Dota2 (OpenAI, 2017)
48
Dota2 (OpenAI, 2017)
49
Alibaba (Starcraft1, multiagent)
50
Deep RL for Dialogue Generation (Li et al., 2016)
● RL agent generates more interactive responses
● RL agent tends to end a sentence with a question and hand the conversation
over to the user
● Next step: explore intrinsic rewards, large-scale training
From the slides on http://opendialogue.miulab.tw51
The Challenge of reinforcement learning
● Sparse reward issue
● Reward credit assignment
● Large space for exploration (trial-and-error)
● Imperfect information, partial observation
52
Active research domain
● Multiagent reinforcement learning
● Hierarchical reinforcement learning
● Inverse reinforcement learning
● Multi-task Transfer learning in reinforcement learning
● Meta learning
● One-shot reinforcement learning
● Deep reinforcement learning in dialogue generation
53
Research institute and notable researchers
54
The research scientists in RL you must know!
● Richard S. Sutton
● David Silver
● Pieter Abbeel
● Sergey Levine
55
Richard S. Sutton
● The founding father of reinforcement
learning
● Professor of Computer Science at University
of Alberta
● Temporal difference learning
● Dyna architecture
56
David Silver
● The research scientist in DeepMind
● Lead researcher on AlphaGo and AlphaGo
Zero team
● Supervised by Sutton in Ph.D
● A professor in University College London
before
57
Pieter Abbeel
● Professor in UC Berkeley
● Director of the UC Berkeley Robot Learning Lab
● Research scientist and advisor in OpenAI
58
Sergey Levine
● Assistant Professor in UC Berkeley
● Research scientist in Google Brain
● Autonomous robots
59
Question?
60

Weitere ähnliche Inhalte

Was ist angesagt?

Reinforcement learning, Q-Learning
Reinforcement learning, Q-LearningReinforcement learning, Q-Learning
Reinforcement learning, Q-LearningKuppusamy P
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement LearningDongHyun Kwak
 
Reinforcement Learning : A Beginners Tutorial
Reinforcement Learning : A Beginners TutorialReinforcement Learning : A Beginners Tutorial
Reinforcement Learning : A Beginners TutorialOmar Enayet
 
Reinforcement Learning 6. Temporal Difference Learning
Reinforcement Learning 6. Temporal Difference LearningReinforcement Learning 6. Temporal Difference Learning
Reinforcement Learning 6. Temporal Difference LearningSeung Jae Lee
 
Multi armed bandit
Multi armed banditMulti armed bandit
Multi armed banditJie-Han Chen
 
An introduction to reinforcement learning (rl)
An introduction to reinforcement learning (rl)An introduction to reinforcement learning (rl)
An introduction to reinforcement learning (rl)pauldix
 
Reinforcement learning
Reinforcement learning Reinforcement learning
Reinforcement learning Chandra Meena
 
Reinforcement Learning Tutorial | Edureka
Reinforcement Learning Tutorial | EdurekaReinforcement Learning Tutorial | Edureka
Reinforcement Learning Tutorial | EdurekaEdureka!
 
Reinforcement learning
Reinforcement learningReinforcement learning
Reinforcement learningDongHyun Kwak
 
Deep Reinforcement Learning and Its Applications
Deep Reinforcement Learning and Its ApplicationsDeep Reinforcement Learning and Its Applications
Deep Reinforcement Learning and Its ApplicationsBill Liu
 
Deep Reinforcement Learning: Q-Learning
Deep Reinforcement Learning: Q-LearningDeep Reinforcement Learning: Q-Learning
Deep Reinforcement Learning: Q-LearningKai-Wen Zhao
 
Deep reinforcement learning from scratch
Deep reinforcement learning from scratchDeep reinforcement learning from scratch
Deep reinforcement learning from scratchJie-Han Chen
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement Learningbutest
 
Lecture 9 Markov decision process
Lecture 9 Markov decision processLecture 9 Markov decision process
Lecture 9 Markov decision processVARUN KUMAR
 
Reinforcement learning 7313
Reinforcement learning 7313Reinforcement learning 7313
Reinforcement learning 7313Slideshare
 
Adversarial Search
Adversarial SearchAdversarial Search
Adversarial SearchMegha Sharma
 
Reinforcement Learning 1. Introduction
Reinforcement Learning 1. IntroductionReinforcement Learning 1. Introduction
Reinforcement Learning 1. IntroductionSeung Jae Lee
 
Deep sarsa, Deep Q-learning, DQN
Deep sarsa, Deep Q-learning, DQNDeep sarsa, Deep Q-learning, DQN
Deep sarsa, Deep Q-learning, DQNEuijin Jeong
 

Was ist angesagt? (20)

Reinforcement learning, Q-Learning
Reinforcement learning, Q-LearningReinforcement learning, Q-Learning
Reinforcement learning, Q-Learning
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement Learning
 
Reinforcement Learning : A Beginners Tutorial
Reinforcement Learning : A Beginners TutorialReinforcement Learning : A Beginners Tutorial
Reinforcement Learning : A Beginners Tutorial
 
Reinforcement Learning 6. Temporal Difference Learning
Reinforcement Learning 6. Temporal Difference LearningReinforcement Learning 6. Temporal Difference Learning
Reinforcement Learning 6. Temporal Difference Learning
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement Learning
 
Multi armed bandit
Multi armed banditMulti armed bandit
Multi armed bandit
 
An introduction to reinforcement learning (rl)
An introduction to reinforcement learning (rl)An introduction to reinforcement learning (rl)
An introduction to reinforcement learning (rl)
 
Reinforcement learning
Reinforcement learning Reinforcement learning
Reinforcement learning
 
Reinforcement Learning Tutorial | Edureka
Reinforcement Learning Tutorial | EdurekaReinforcement Learning Tutorial | Edureka
Reinforcement Learning Tutorial | Edureka
 
Deep Q-Learning
Deep Q-LearningDeep Q-Learning
Deep Q-Learning
 
Reinforcement learning
Reinforcement learningReinforcement learning
Reinforcement learning
 
Deep Reinforcement Learning and Its Applications
Deep Reinforcement Learning and Its ApplicationsDeep Reinforcement Learning and Its Applications
Deep Reinforcement Learning and Its Applications
 
Deep Reinforcement Learning: Q-Learning
Deep Reinforcement Learning: Q-LearningDeep Reinforcement Learning: Q-Learning
Deep Reinforcement Learning: Q-Learning
 
Deep reinforcement learning from scratch
Deep reinforcement learning from scratchDeep reinforcement learning from scratch
Deep reinforcement learning from scratch
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement Learning
 
Lecture 9 Markov decision process
Lecture 9 Markov decision processLecture 9 Markov decision process
Lecture 9 Markov decision process
 
Reinforcement learning 7313
Reinforcement learning 7313Reinforcement learning 7313
Reinforcement learning 7313
 
Adversarial Search
Adversarial SearchAdversarial Search
Adversarial Search
 
Reinforcement Learning 1. Introduction
Reinforcement Learning 1. IntroductionReinforcement Learning 1. Introduction
Reinforcement Learning 1. Introduction
 
Deep sarsa, Deep Q-learning, DQN
Deep sarsa, Deep Q-learning, DQNDeep sarsa, Deep Q-learning, DQN
Deep sarsa, Deep Q-learning, DQN
 

Ähnlich wie An introduction to reinforcement learning

Introduction of Deep Reinforcement Learning
Introduction of Deep Reinforcement LearningIntroduction of Deep Reinforcement Learning
Introduction of Deep Reinforcement LearningNAVER Engineering
 
Frontier in reinforcement learning
Frontier in reinforcement learningFrontier in reinforcement learning
Frontier in reinforcement learningJie-Han Chen
 
Human-level Control Through Deep Reinforcement Learning (Presentation)
Human-level Control Through Deep Reinforcement Learning (Presentation)Human-level Control Through Deep Reinforcement Learning (Presentation)
Human-level Control Through Deep Reinforcement Learning (Presentation)Muhammed Kocabaş
 
Sequential Decision Making in Recommendations
Sequential Decision Making in RecommendationsSequential Decision Making in Recommendations
Sequential Decision Making in RecommendationsJaya Kawale
 
acai01-updated.ppt
acai01-updated.pptacai01-updated.ppt
acai01-updated.pptbutest
 
Teacher-Aware Active Robot Learning
Teacher-Aware Active Robot LearningTeacher-Aware Active Robot Learning
Teacher-Aware Active Robot LearningMattia Racca
 
Reinforcement Learning (DLAI D7L2 2017 UPC Deep Learning for Artificial Intel...
Reinforcement Learning (DLAI D7L2 2017 UPC Deep Learning for Artificial Intel...Reinforcement Learning (DLAI D7L2 2017 UPC Deep Learning for Artificial Intel...
Reinforcement Learning (DLAI D7L2 2017 UPC Deep Learning for Artificial Intel...Universitat Politècnica de Catalunya
 
Reinforcement learning in a nutshell
Reinforcement learning in a nutshellReinforcement learning in a nutshell
Reinforcement learning in a nutshellNing Zhou
 
Ciro Continisio - Implementing Machine Learning the Unity way - Codemotion Mi...
Ciro Continisio - Implementing Machine Learning the Unity way - Codemotion Mi...Ciro Continisio - Implementing Machine Learning the Unity way - Codemotion Mi...
Ciro Continisio - Implementing Machine Learning the Unity way - Codemotion Mi...Codemotion
 
Shanghai deep learning meetup 4
Shanghai deep learning meetup 4Shanghai deep learning meetup 4
Shanghai deep learning meetup 4Xiaohu ZHU
 
reinforcement-learning-141009013546-conversion-gate02.pdf
reinforcement-learning-141009013546-conversion-gate02.pdfreinforcement-learning-141009013546-conversion-gate02.pdf
reinforcement-learning-141009013546-conversion-gate02.pdfVaishnavGhadge1
 
reinforcement-learning-141009013546-conversion-gate02.pptx
reinforcement-learning-141009013546-conversion-gate02.pptxreinforcement-learning-141009013546-conversion-gate02.pptx
reinforcement-learning-141009013546-conversion-gate02.pptxMohibKhan79
 
Machine Learning in Unity - How to give your game AI a real brain
Machine Learning in Unity - How to give your game AI a real brainMachine Learning in Unity - How to give your game AI a real brain
Machine Learning in Unity - How to give your game AI a real brainDevGAMM Conference
 
An efficient use of temporal difference technique in Computer Game Learning
An efficient use of temporal difference technique in Computer Game LearningAn efficient use of temporal difference technique in Computer Game Learning
An efficient use of temporal difference technique in Computer Game LearningPrabhu Kumar
 
Introduction to reinforcement learning
Introduction to reinforcement learningIntroduction to reinforcement learning
Introduction to reinforcement learningMarsan Ma
 
Deep Reinforcement Learning: MDP & DQN - Xavier Giro-i-Nieto - UPC Barcelona ...
Deep Reinforcement Learning: MDP & DQN - Xavier Giro-i-Nieto - UPC Barcelona ...Deep Reinforcement Learning: MDP & DQN - Xavier Giro-i-Nieto - UPC Barcelona ...
Deep Reinforcement Learning: MDP & DQN - Xavier Giro-i-Nieto - UPC Barcelona ...Universitat Politècnica de Catalunya
 
Dmitriy Babichenko, Jonathan Velez - To Scope or Not To Scope: Challenges of ...
Dmitriy Babichenko, Jonathan Velez - To Scope or Not To Scope: Challenges of ...Dmitriy Babichenko, Jonathan Velez - To Scope or Not To Scope: Challenges of ...
Dmitriy Babichenko, Jonathan Velez - To Scope or Not To Scope: Challenges of ...SeriousGamesAssoc
 
Building a deep learning ai.pptx
Building a deep learning ai.pptxBuilding a deep learning ai.pptx
Building a deep learning ai.pptxDaniel Slater
 

Ähnlich wie An introduction to reinforcement learning (20)

Introduction of Deep Reinforcement Learning
Introduction of Deep Reinforcement LearningIntroduction of Deep Reinforcement Learning
Introduction of Deep Reinforcement Learning
 
Frontier in reinforcement learning
Frontier in reinforcement learningFrontier in reinforcement learning
Frontier in reinforcement learning
 
Human-level Control Through Deep Reinforcement Learning (Presentation)
Human-level Control Through Deep Reinforcement Learning (Presentation)Human-level Control Through Deep Reinforcement Learning (Presentation)
Human-level Control Through Deep Reinforcement Learning (Presentation)
 
Sequential Decision Making in Recommendations
Sequential Decision Making in RecommendationsSequential Decision Making in Recommendations
Sequential Decision Making in Recommendations
 
Learning To Run
Learning To RunLearning To Run
Learning To Run
 
acai01-updated.ppt
acai01-updated.pptacai01-updated.ppt
acai01-updated.ppt
 
Teacher-Aware Active Robot Learning
Teacher-Aware Active Robot LearningTeacher-Aware Active Robot Learning
Teacher-Aware Active Robot Learning
 
Reinforcement Learning (DLAI D7L2 2017 UPC Deep Learning for Artificial Intel...
Reinforcement Learning (DLAI D7L2 2017 UPC Deep Learning for Artificial Intel...Reinforcement Learning (DLAI D7L2 2017 UPC Deep Learning for Artificial Intel...
Reinforcement Learning (DLAI D7L2 2017 UPC Deep Learning for Artificial Intel...
 
Reinforcement learning in a nutshell
Reinforcement learning in a nutshellReinforcement learning in a nutshell
Reinforcement learning in a nutshell
 
Ciro Continisio - Implementing Machine Learning the Unity way - Codemotion Mi...
Ciro Continisio - Implementing Machine Learning the Unity way - Codemotion Mi...Ciro Continisio - Implementing Machine Learning the Unity way - Codemotion Mi...
Ciro Continisio - Implementing Machine Learning the Unity way - Codemotion Mi...
 
Shanghai deep learning meetup 4
Shanghai deep learning meetup 4Shanghai deep learning meetup 4
Shanghai deep learning meetup 4
 
reinforcement-learning-141009013546-conversion-gate02.pdf
reinforcement-learning-141009013546-conversion-gate02.pdfreinforcement-learning-141009013546-conversion-gate02.pdf
reinforcement-learning-141009013546-conversion-gate02.pdf
 
reinforcement-learning-141009013546-conversion-gate02.pptx
reinforcement-learning-141009013546-conversion-gate02.pptxreinforcement-learning-141009013546-conversion-gate02.pptx
reinforcement-learning-141009013546-conversion-gate02.pptx
 
Machine Learning in Unity - How to give your game AI a real brain
Machine Learning in Unity - How to give your game AI a real brainMachine Learning in Unity - How to give your game AI a real brain
Machine Learning in Unity - How to give your game AI a real brain
 
Unit5: Learning
Unit5: LearningUnit5: Learning
Unit5: Learning
 
An efficient use of temporal difference technique in Computer Game Learning
An efficient use of temporal difference technique in Computer Game LearningAn efficient use of temporal difference technique in Computer Game Learning
An efficient use of temporal difference technique in Computer Game Learning
 
Introduction to reinforcement learning
Introduction to reinforcement learningIntroduction to reinforcement learning
Introduction to reinforcement learning
 
Deep Reinforcement Learning: MDP & DQN - Xavier Giro-i-Nieto - UPC Barcelona ...
Deep Reinforcement Learning: MDP & DQN - Xavier Giro-i-Nieto - UPC Barcelona ...Deep Reinforcement Learning: MDP & DQN - Xavier Giro-i-Nieto - UPC Barcelona ...
Deep Reinforcement Learning: MDP & DQN - Xavier Giro-i-Nieto - UPC Barcelona ...
 
Dmitriy Babichenko, Jonathan Velez - To Scope or Not To Scope: Challenges of ...
Dmitriy Babichenko, Jonathan Velez - To Scope or Not To Scope: Challenges of ...Dmitriy Babichenko, Jonathan Velez - To Scope or Not To Scope: Challenges of ...
Dmitriy Babichenko, Jonathan Velez - To Scope or Not To Scope: Challenges of ...
 
Building a deep learning ai.pptx
Building a deep learning ai.pptxBuilding a deep learning ai.pptx
Building a deep learning ai.pptx
 

Mehr von Jie-Han Chen

Actor critic algorithm
Actor critic algorithmActor critic algorithm
Actor critic algorithmJie-Han Chen
 
Temporal difference learning
Temporal difference learningTemporal difference learning
Temporal difference learningJie-Han Chen
 
Deep reinforcement learning
Deep reinforcement learningDeep reinforcement learning
Deep reinforcement learningJie-Han Chen
 
Temporal difference learning
Temporal difference learningTemporal difference learning
Temporal difference learningJie-Han Chen
 
Markov decision process
Markov decision processMarkov decision process
Markov decision processJie-Han Chen
 
Discrete sequential prediction of continuous actions for deep RL
Discrete sequential prediction of continuous actions for deep RLDiscrete sequential prediction of continuous actions for deep RL
Discrete sequential prediction of continuous actions for deep RLJie-Han Chen
 
BiCNet presentation (multi-agent reinforcement learning)
BiCNet presentation (multi-agent reinforcement learning)BiCNet presentation (multi-agent reinforcement learning)
BiCNet presentation (multi-agent reinforcement learning)Jie-Han Chen
 
Data science-toolchain
Data science-toolchainData science-toolchain
Data science-toolchainJie-Han Chen
 
The artofreadablecode
The artofreadablecodeThe artofreadablecode
The artofreadablecodeJie-Han Chen
 

Mehr von Jie-Han Chen (10)

Actor critic algorithm
Actor critic algorithmActor critic algorithm
Actor critic algorithm
 
Temporal difference learning
Temporal difference learningTemporal difference learning
Temporal difference learning
 
Policy gradient
Policy gradientPolicy gradient
Policy gradient
 
Deep reinforcement learning
Deep reinforcement learningDeep reinforcement learning
Deep reinforcement learning
 
Temporal difference learning
Temporal difference learningTemporal difference learning
Temporal difference learning
 
Markov decision process
Markov decision processMarkov decision process
Markov decision process
 
Discrete sequential prediction of continuous actions for deep RL
Discrete sequential prediction of continuous actions for deep RLDiscrete sequential prediction of continuous actions for deep RL
Discrete sequential prediction of continuous actions for deep RL
 
BiCNet presentation (multi-agent reinforcement learning)
BiCNet presentation (multi-agent reinforcement learning)BiCNet presentation (multi-agent reinforcement learning)
BiCNet presentation (multi-agent reinforcement learning)
 
Data science-toolchain
Data science-toolchainData science-toolchain
Data science-toolchain
 
The artofreadablecode
The artofreadablecodeThe artofreadablecode
The artofreadablecode
 

Kürzlich hochgeladen

Artificial Intelligence In Microbiology by Dr. Prince C P
Artificial Intelligence In Microbiology by Dr. Prince C PArtificial Intelligence In Microbiology by Dr. Prince C P
Artificial Intelligence In Microbiology by Dr. Prince C PPRINCE C P
 
Recombination DNA Technology (Nucleic Acid Hybridization )
Recombination DNA Technology (Nucleic Acid Hybridization )Recombination DNA Technology (Nucleic Acid Hybridization )
Recombination DNA Technology (Nucleic Acid Hybridization )aarthirajkumar25
 
Isotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on IoIsotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on IoSérgio Sacani
 
Hubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroidsHubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroidsSérgio Sacani
 
Call Girls in Mayapuri Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
Call Girls in Mayapuri Delhi 💯Call Us 🔝9953322196🔝 💯Escort.Call Girls in Mayapuri Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
Call Girls in Mayapuri Delhi 💯Call Us 🔝9953322196🔝 💯Escort.aasikanpl
 
Botany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdfBotany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdfSumit Kumar yadav
 
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...Sérgio Sacani
 
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...Sérgio Sacani
 
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...Lokesh Kothari
 
Call Us ≽ 9953322196 ≼ Call Girls In Mukherjee Nagar(Delhi) |
Call Us ≽ 9953322196 ≼ Call Girls In Mukherjee Nagar(Delhi) |Call Us ≽ 9953322196 ≼ Call Girls In Mukherjee Nagar(Delhi) |
Call Us ≽ 9953322196 ≼ Call Girls In Mukherjee Nagar(Delhi) |aasikanpl
 
Physiochemical properties of nanomaterials and its nanotoxicity.pptx
Physiochemical properties of nanomaterials and its nanotoxicity.pptxPhysiochemical properties of nanomaterials and its nanotoxicity.pptx
Physiochemical properties of nanomaterials and its nanotoxicity.pptxAArockiyaNisha
 
Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)PraveenaKalaiselvan1
 
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptxSOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptxkessiyaTpeter
 
Work, Energy and Power for class 10 ICSE Physics
Work, Energy and Power for class 10 ICSE PhysicsWork, Energy and Power for class 10 ICSE Physics
Work, Energy and Power for class 10 ICSE Physicsvishikhakeshava1
 
Unlocking the Potential: Deep dive into ocean of Ceramic Magnets.pptx
Unlocking  the Potential: Deep dive into ocean of Ceramic Magnets.pptxUnlocking  the Potential: Deep dive into ocean of Ceramic Magnets.pptx
Unlocking the Potential: Deep dive into ocean of Ceramic Magnets.pptxanandsmhk
 
Nanoparticles synthesis and characterization​ ​
Nanoparticles synthesis and characterization​  ​Nanoparticles synthesis and characterization​  ​
Nanoparticles synthesis and characterization​ ​kaibalyasahoo82800
 
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43bNightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43bSérgio Sacani
 
Bentham & Hooker's Classification. along with the merits and demerits of the ...
Bentham & Hooker's Classification. along with the merits and demerits of the ...Bentham & Hooker's Classification. along with the merits and demerits of the ...
Bentham & Hooker's Classification. along with the merits and demerits of the ...Nistarini College, Purulia (W.B) India
 
CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service 🪡
CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service  🪡CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service  🪡
CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service 🪡anilsa9823
 

Kürzlich hochgeladen (20)

Artificial Intelligence In Microbiology by Dr. Prince C P
Artificial Intelligence In Microbiology by Dr. Prince C PArtificial Intelligence In Microbiology by Dr. Prince C P
Artificial Intelligence In Microbiology by Dr. Prince C P
 
Recombination DNA Technology (Nucleic Acid Hybridization )
Recombination DNA Technology (Nucleic Acid Hybridization )Recombination DNA Technology (Nucleic Acid Hybridization )
Recombination DNA Technology (Nucleic Acid Hybridization )
 
Isotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on IoIsotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on Io
 
Hubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroidsHubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroids
 
Call Girls in Mayapuri Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
Call Girls in Mayapuri Delhi 💯Call Us 🔝9953322196🔝 💯Escort.Call Girls in Mayapuri Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
Call Girls in Mayapuri Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
 
Botany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdfBotany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdf
 
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
 
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
 
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
 
Call Us ≽ 9953322196 ≼ Call Girls In Mukherjee Nagar(Delhi) |
Call Us ≽ 9953322196 ≼ Call Girls In Mukherjee Nagar(Delhi) |Call Us ≽ 9953322196 ≼ Call Girls In Mukherjee Nagar(Delhi) |
Call Us ≽ 9953322196 ≼ Call Girls In Mukherjee Nagar(Delhi) |
 
Physiochemical properties of nanomaterials and its nanotoxicity.pptx
Physiochemical properties of nanomaterials and its nanotoxicity.pptxPhysiochemical properties of nanomaterials and its nanotoxicity.pptx
Physiochemical properties of nanomaterials and its nanotoxicity.pptx
 
Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)
 
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptxSOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
 
Work, Energy and Power for class 10 ICSE Physics
Work, Energy and Power for class 10 ICSE PhysicsWork, Energy and Power for class 10 ICSE Physics
Work, Energy and Power for class 10 ICSE Physics
 
Unlocking the Potential: Deep dive into ocean of Ceramic Magnets.pptx
Unlocking  the Potential: Deep dive into ocean of Ceramic Magnets.pptxUnlocking  the Potential: Deep dive into ocean of Ceramic Magnets.pptx
Unlocking the Potential: Deep dive into ocean of Ceramic Magnets.pptx
 
Nanoparticles synthesis and characterization​ ​
Nanoparticles synthesis and characterization​  ​Nanoparticles synthesis and characterization​  ​
Nanoparticles synthesis and characterization​ ​
 
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43bNightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
 
Bentham & Hooker's Classification. along with the merits and demerits of the ...
Bentham & Hooker's Classification. along with the merits and demerits of the ...Bentham & Hooker's Classification. along with the merits and demerits of the ...
Bentham & Hooker's Classification. along with the merits and demerits of the ...
 
CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service 🪡
CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service  🪡CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service  🪡
CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service 🪡
 
CELL -Structural and Functional unit of life.pdf
CELL -Structural and Functional unit of life.pdfCELL -Structural and Functional unit of life.pdf
CELL -Structural and Functional unit of life.pdf
 

An introduction to reinforcement learning

  • 1. An Introduction to Reinforcement Learning Jie-Han Chen NetDB, National Cheng Kung University 3/27, 2018 @ National Cheng Kung University, Taiwan 1
  • 2. The content in this lecture were borrowed from: 1. Rich Sutton’s textbook 2. David Silver’s Reinforcement Learning class in UCL 3. Sergey Levine’s Deep Reinforcement Learning class in UCB 2 Disclamier
  • 3. Syllabus ● Introduction to Reinforcement Learning ● Markov Decision Process ● Dynamic Programming ● Monte Carlo method ● Temporal Difference method ● Deep Reinforcement Learning ● Policy Gradient ● Hierarchical Reinforcement Learning and Multiagent Reinforcement Learning ● Active Research Issue 3
  • 4. Resources Textbooks: ● Reinforcement Learning: An Introduction, Sutton and Barto ● Algorithms for Reinforcement Learning, Szepesvari Course: ● CS 294 Deep Reinforcement Learning, Berkeley ● David Silver’s Reinforcement Learning course, UCL ● CMU 10703 Deep Reinforcement Learning and Control, CMU ● Shan-Hung Wu’s Deep Learning course in NTHU All of them are our reference materials in this lecture. 4
  • 5. Outline ● Syllabus ● Introduction ● Elements of reinforcement learning and its objective ● History of RL ● Applications ● The challenge and active research fields in RL ● Research institute and notable researchers 5
  • 6. Machine Learning From David Silver’s RL course 6
  • 7. Introduction to Reinforcement Learning Reinforcement learning is a learning framework different from supervised learning and unsupervised learning. It is composed of series of perception and interaction between agent and environment. From Sutton’s book 7
  • 8. Agent and Environment At each step t the agent: ● Receives scalar reward Rt ● Receives observaiotn Ot ● Executes action At The environment: ● Receives action At ● Emits observation Ot+1 ● Emits scalar reward Rt+1 8
  • 9. Introduction to Reinforcement Learning Reinforcement Learning is often used to solve sequential decision problem. ● Goal: select actions to maximize total future reward ● Action may have long term consequences ● Reward may be delayed ● It may be better to sacrifice immediate reward to gain more long-term reward ● Eg: ○ A financial investiment ○ Chess game 9
  • 10. Supervised Learning & Unsupervised Learning The input data are independent (i.i.d). Current output will not affect the next input. 10
  • 11. Reinforcement Learning The agent’s action do affect the data received in the future. Figure from Wikipedia, made by waldoalvarez11
  • 12. Introduction to Reinforcement Learning ● In reinforcement learning the agent learns from trial and error. ● The better experience make the agent learn better policy. ● What kind of experience is better? The image is from : http://www.homemeeting.us/franktmc/maze_2.jpg 12
  • 13. Elements of reinforcement learning ● Policy ● Reward signal ● Value function ● Model of environment (optional) 13
  • 14. Elements of reinforcement learning - policy Policy ● Define the learning agents’ way of behaving at a given time. Could be a simple function or lookup table or search process ● Often denoted by ● Could be deterministic or stochastic 14
  • 15. Elements of reinforcement learning - policy If you are Russell Westbrook, and now is defended by James Harden. With this situation, you have 3 choices: ● Cut ● Shoot ● Pass 15
  • 18. Policies - Action space In reinforcement learning, we can categorize the problem by the action space into 2 types. ● Discrete action space ● Continuous action space In previous example, the decision or the action are in discrete space, but there are many example of continuous control, eg: robotic arm. The stochastic policy of continuous control problem would like a probability density function. 18
  • 19. Elements of reinforcement learning - reward Reward: r / Rt ● Defines the goal in a reinforcement learning problem ● Indicates how well agent is doing at step t ● Immediately percepted from the environment 19
  • 20. Elements of reinforcement learning - reward +2 0 or -0.2? 20
  • 21. Elements of reinforcement learning - reward In chess or Go, the reward is defined by its outcome. ● Win: +1 ● Draw: 0 ● Lose: -1 In most steps, we don’t receive any reward(value = 0). It’s a kind of sparse reward problem. 21
  • 22. Elements of reinforcement learning - reward If we want to reach the goal by less steps, we often define the reward to -1 when you take a step. 22
  • 23. Elements of reinforcement learning - value function Value function ● Indicates which decision is good in the long run. ● There are two forms: ○ state-value function ○ action-value function ● Unlike reward, value function is an estmated value. 23
  • 24. Elements of reinforcement learning - value function The game comes to 99 vs 98(our) and just left 5 seconds to the end of the game. Now, If you need to throw in in midfield, which one would you pass the ball to? 1. 櫻木花道 2. 三井壽 24
  • 25. Elements of reinforcement learning - model Model of environments (optional) ● Use something to mimic the behavior of the environment. ● Allow inferences to be made about how the environment will behave. (planning) ● Methods for solving reinforcement learning problems that use models for planning are called model-based methods. The opposites are model-free methods. 25
  • 26. Elements of reinforcement learning - model Interaction, inferences Learn the model The image is from David Silver’s RL course 26
  • 28. Elements of reinforcement learning - model 28
  • 29. Elements of reinforcement learning - model 29
  • 30. Elements of reinforcement learning ● Policy ● Reward signal ● Value function ● Model of environment (optional) 30
  • 31. The objective of reinforcement learning Reinforcement learning is a framework of goal directed learning. The objective of reinforcement learning is to maximize accumulative rewards in each task. The image is from: https://www.wikijob.co.uk/content/interview-advice/competencies/decision-making31
  • 32. History of Reinforcement Learning Reinforcement Learning is inspired by two domain knowledge ● Optimal control ● Biological learning system: Animal learning 32
  • 33. Optimal control It is a mathematical optimization method for deriving control policies especially under certain constraints. The optimization method is largely due to the work of Lev Pontryagin and Richard Bellman in the 1950s. 33
  • 34. Richard Bellman Richard Bellman was an applied mathematician, who introduced dynamic programming in 1953. Work: ● Bellman Equation ● Curse of dimensionality ● Bellman-Ford algorithm 34
  • 35. Animal Learning ● Teach dog - positive reward 35
  • 36. Animal Learning ● Teach dog - penalty (negative reward) 36
  • 37. Some question about RL ● Why do we need to learn Reinforcement Learning? ● What make Reinforcement Learning spring up like mushrooms? 37
  • 38. Backgammon (IBM, 1992) Temporal difference learning and TD-Gammon, by Gerald Tesauro, 1992 Gammon is 雙陸棋 in Chinese. source: from wikipedia 38
  • 39. Autonomous Helicopter (Stanford, 2000) The aerobatics fo helicopter has been studied from 2000 by Andrew Ng and Pieter Abbeel in Stanford. You can see more details on: http://heli.stanford.edu/39
  • 40. Deep reinforcement learning in Atari game (2013) Deep Q Network: proposed by V Mnih et al. It’s the first reinforcement learning end-to-end model to combine deep learning with raw inputs. 40
  • 41. Deep reinforcement learning in Atari game (2013) 41
  • 42. Deep Reinforcement Learning for Robotic Manipulation 42
  • 44. AlphaGo (DeepMind, 2016) AlphaGo: David Silver, Aja Huang et al., use Monte Carlo Tree search (MCTS) and deep reinforcement learning (policy gradient) to master the game of Go. 44
  • 45. AlphaGo Zero (DeepMind, 2017) AlphaGo Zero: David Silver et al., use MCTS and policy iteration with ResNet with 2-head architecture to learn from scratch without human knowledge. 45
  • 47. Dota2 (OpenAI, 2017) ● Beats the world’s top professionals at 1v1 matches ● The bot learned from scratch by self-play 47
  • 51. Deep RL for Dialogue Generation (Li et al., 2016) ● RL agent generates more interactive responses ● RL agent tends to end a sentence with a question and hand the conversation over to the user ● Next step: explore intrinsic rewards, large-scale training From the slides on http://opendialogue.miulab.tw51
  • 52. The Challenge of reinforcement learning ● Sparse reward issue ● Reward credit assignment ● Large space for exploration (trial-and-error) ● Imperfect information, partial observation 52
  • 53. Active research domain ● Multiagent reinforcement learning ● Hierarchical reinforcement learning ● Inverse reinforcement learning ● Multi-task Transfer learning in reinforcement learning ● Meta learning ● One-shot reinforcement learning ● Deep reinforcement learning in dialogue generation 53
  • 54. Research institute and notable researchers 54
  • 55. The research scientists in RL you must know! ● Richard S. Sutton ● David Silver ● Pieter Abbeel ● Sergey Levine 55
  • 56. Richard S. Sutton ● The founding father of reinforcement learning ● Professor of Computer Science at University of Alberta ● Temporal difference learning ● Dyna architecture 56
  • 57. David Silver ● The research scientist in DeepMind ● Lead researcher on AlphaGo and AlphaGo Zero team ● Supervised by Sutton in Ph.D ● A professor in University College London before 57
  • 58. Pieter Abbeel ● Professor in UC Berkeley ● Director of the UC Berkeley Robot Learning Lab ● Research scientist and advisor in OpenAI 58
  • 59. Sergey Levine ● Assistant Professor in UC Berkeley ● Research scientist in Google Brain ● Autonomous robots 59