SlideShare a Scribd company logo
1 of 55
Download to read offline
A Brief Survey of Reinforcement Learning
Short walk-through on building learning agents
Giancarlo Frison
December 12, 2018
Table of contents i
1. Introduction
2. Genetic Programming
3. Multi-armed Bandit
4. Markov Decision Process
5. Neural Networks in RL
gfrison 2018 1
Table of contents ii
6. Actor-Critic
7. Deep Deterministic Policy Gradient
8. Asgard
gfrison 2018 2
Introduction
What is Reinforcement Learning
[9]
Reinforcement learning is an area of machine learning concerned with how
so tware agents ought to take actions in an environment so as to maximize some
notion of cumulative reward.
gfrison 2018 3
Challenges in RL
RL vision is about creating systems that are capable of learning how to adapt in
the real world, solely based on trial-and-error. Challenges:
Reward signal
The optimal policy must be inferred by interacting with the environment. The only
learning signal the agent receives is the reward.
Credit assignment problem
Agents must deal with long-range time dependencies: O ten the consequences of
an action only materialise a ter many transitions of the environment [8].
Non stationary environments
Especially in multi-agent environments, in the same circumstances of state st and
action at outcomes might be different.
gfrison 2018 4
Difference from other optimizations
There are major differences within:
• Supervised learning
gfrison 2018 5
Difference from other optimizations
There are major differences within:
• Supervised learning
• Mathematical optimization
gfrison 2018 5
Difference from other optimizations
There are major differences within:
• Supervised learning
• Mathematical optimization
• Genetic programming
gfrison 2018 5
Genetic Programming
Genetic Programming
GP is a technique whereby computer programs are encoded as a set of genes that
are then modified using an evolutionary algorithm.
• A number of chromosomes are randomly created.
• Each chromosome is evaluated through a fitness function.
• Best ones are selected, the others are disposed.
• Chromosomes could be breded among the selected for a new generation.
• Offsprings are randomly mutated.
• Repeat until the score threshold is reached [3].
gfrison 2018 6
Crossover
Figure 1: The “breeding” is called crossover. The chromosome (in this case an AST), is
merged between two individuals for searching a better function.
gfrison 2018 7
Strenghts on many local minima
GP may overtake gradient-based algorithms when the solution space has many
local minima
Figure 2: Rastrigin function
gfrison 2018 8
Strenghts on no-differentiable functions
f(x) is differentiable when it has always a finite derivative along the domain.
−2 −1 1 2
−1
1
2
Figure 3: f(x) = |x| has no derivative at x = 0.
GP is tollerant to latent no-differentiable functions.
gfrison 2018 9
Multi-armed Bandit
Multi-armed Bandit
The multi-armed bandit problem has been the subject of decades of intense
study in statistics, operations research, A/B testing, computer science, and
economics [12]
gfrison 2018 10
Q Value’s Action
Figure 4: Obtained measures a ter repeated pullings with 10 arms [8].
gfrison 2018 11
Q Value’s Action
Qn is the estimated value of its action a ter n selections.
Qn+1 =
R1 + R2 + ... + Rn
n
A more scalable formula, updates the average with incremental and small
constant:
Qn+1 = Qn +
1
n
(Rn − Qn)
General expression of the bandit algorithm, recurrent pattern in RL (Target could
be considered the reward R by now).
NewEstimate = OldEstimate + StepSize(Target − OldEstimate)
gfrison 2018 12
Gambler’s Dilemma
When pulled, an arm produces a random payout drawn independently of the past.
Because the distribution of payouts corresponding to each arm is not listed, the
player can learn it only by experimenting.
Exploitation
Earn more money by exploiting
arms that yielded high payouts
in the past.
Exploration
Exploring alternative arms may
return higher payouts in the
future.
There is a tradeoff between exploration and regret.
gfrison 2018 13
Markov Decision Process
Definition of MDP
Figure 5: 2015 DARPA Robotics Challange [6]
Despite as in bandits, MDP formalizes the decision making (Policy π) in sequential
steps, aggregated in Episodes.
gfrison 2018 14
Actions and States
Figure 6: Model representation of MDP [5]
MDP strives to find the best π to all possible states. In Markov processes, the
selected action depends only on current state.
gfrison 2018 15
Bellman equation
In a multi-step process the total return G is the discounted sum of rewards, from
the begin state to the terminal state:
Gπ = r0 + γr1 + γ2
r2 + ...γn
rn
bellman equation: The value of a state Vs is the averaged reward obtained in that
state plus the sum of all values following π therea ter.
The goal of RL is to find the optimal π with highest G
π = argmax
∑
s
r + γvπ(s)
gfrison 2018 16
Grid World
 -1 -1 -1
-1 -1 -1 -1
-1 -1 -1 -1
-1 -1 -1 
• A = up, down, right, left
• Terminal states are the
flagged boxes.
• Rt = −1 for all states.
The problem is to define the best πππ. Value function is computed by iterative
policy evaluation.
gfrison 2018 17
Iteration 1
Calculated V1 Policy π1
0 -1 -1 -1
-1 -1 -1 -1
-1 -1 -1 -1
-1 -1 -1 0
 ← ? ?
↑ ? ? ?
? ? ? ↓
? ? → 
gfrison 2018 18
Iteration 2
Calculated V2 Policy π2
0 -1.7 -2 -2
-1.7 -2 -2 -2
-2 -2 -2 -1.7
-2 -2 -1.7 0
 ← ← ?
↑ ? ↓
↑ ? ↓
? → → 
gfrison 2018 19
Iteration 3
Calculated V3 Policy π3
0 -2.4 -2.9 -3
-2.4 -2.9 -3 -2.9
-2.9 -3 -2.9 -2.4
-3 -2.9 -2.4 0
 ← ←
↑ ↓
↑ ↓
→ → 
gfrison 2018 20
Temporal difference
Model-free learning
The learning comes only through raw experiences (episodes) when there is no
prior knowledge of the environment.
Bootstrapping
Value estimations are based partly on other learned estimations.
Bootstrapping helps to increase the stability of estimations.
V(st) = V(st) + α(rt+1 + γV(st+1) − V(st))
gfrison 2018 21
Neural Networks in RL
Beyond the Gridworld
In the Gridworld every state value vt is stored in a table. The approach lacks
scalability and is inherently limited to fairly low-dimension problems.
Figure 7: The state vt might be a frame of a videogame [11]
gfrison 2018 22
Deep Reinforcement Learning
Deep learning enables RL to scale to decision-making problems that were
previously intractable, i.e., settings with high-dimensional state and action
spaces. [1]
• Universal function approximator Multilayer neural networks can approximate
any continuous function to any degree of accuracy [4].
gfrison 2018 23
Deep Reinforcement Learning
Deep learning enables RL to scale to decision-making problems that were
previously intractable, i.e., settings with high-dimensional state and action
spaces. [1]
• Universal function approximator Multilayer neural networks can approximate
any continuous function to any degree of accuracy [4].
• Representation learning Automatically discover the representations needed
for feature detection or classification from raw data, also known as feature
engeneering [2].
Neural networks can learn the approximated value of Vs or Q(s, a) for any given
state/action space.
gfrison 2018 23
Discrete space
When the action space is limited by a set of options, (e.g. turn right, le t) the
output layer could be a set of neurons as the number of possible actions.
A1 A2 A3 A4
0.21 0.27 0.30 0.21
Figure 8: Output layer with so tmax
The choosed action is the one with the biggest Qt (A3).
gfrison 2018 24
Exploration in discrete space
The balance between exploration and exploitation is one of the most
challanging task in RL. The most common is the ε-greedy:
π(s) =



random action, if ξ < ϵ.
argmaxQ(s, a), otherwise.
(1)
With ε-greedy, at each time step, the agent selects a random action with a fixed
probability, 0 ≤ ϵ ≤ 1, instead of selecting the learned optimal action.
gfrison 2018 25
Exploration in discrete space
ε-greedy must be tuned and it is fixed during the training. In contrast,
Value-Difference Based Exploration adapts the exploration-probability based on
fluctuations of confidence [10].
A1 A2 A3 A4
0.21 0.27 0.30 0.21
Figure 9: Low confidence, flat
likelihood
A1 A2 A3 A4
0.21 0.07 0.70 0.01
Figure 10: High confidence, sharp
likelihood
gfrison 2018 26
Continuous space
Applications of RL require continuous state spaces defined by means of
continuous variables e.g. position, velocity, torque.
Single Output
Like in DDPG algorithm. The NN yields only one output.
Probability distribution
The network yields a distribution. For convenience usually it is a normal
distribution defined by 2 outputs: μ and σ.
gfrison 2018 27
Deep Q-Network
NN comes into play by approximating state-action value pairs Qt(s, a).
Double network
Other the main network, a sibling network (target) is used in training for adjusting
values Q, the second network is periodically aligned the the main network.
Priority experience replay
Past state/action/reward tuples are kept also for future iterations. Tuples with
higher error in evaluation are retained longer.
 First disruptive advance in Deep RL in 2015 by DeepMind [7].
gfrison 2018 28
Actor-Critic
Policy-based algorithm
actor-critic methods lay in the family of policy-based algorithms. Differently by
value-based ones - like DQN - the main network (actor) learns directly the action
aπ(s) to return for a given state s.
 Therefore there is no argmax Qπ(s, A) among all Qπ(s) for a given state
gfrison 2018 29
Actor-Critic
actor-critic defines an architecture based on 2 parts:
Actor
Main network which infers the action for a given state
(st → a)
Critic
Secondary network used only for training, evaluates the
actor output returning the averaged value of state.
st → V(s)
gfrison 2018 30
Advantage function for incremental learning
The advantage technique is used to give a evaluation of the last action compared
to the average of other actions.
gfrison 2018 31
Advantage function for incremental learning
The advantage technique is used to give a evaluation of the last action compared
to the average of other actions.
It answers the following question:
Is last action more rewarding than others in such scenario?
gfrison 2018 31
Advantage function for incremental learning
The advantage technique is used to give a evaluation of the last action compared
to the average of other actions.
It answers the following question:
Is last action more rewarding than others in such scenario?
Yes
 Action’s likelihood is encouraged. Gradients are pushed toward this action.
No
 Action’s likelihood is discouraged. Gradients are pushed away from this action.
gfrison 2018 31
Training
s0
a0
Environment
s1 r1
t0
s1
a1
Environment
s2 r2
t1
T T
Figure 11: Actor-Critic interactions in an iterative traininggfrison 2018 32
Actor training
-
vt
rt
st
advt
at
1
2
• 1 Critic returns the averaged value
of the state.
• 2 Actor is trained considering st, at
and advt = rt − vt
gfrison 2018 33
Deep Deterministic Policy Gradient
DDPG
Especially effective in continuous space. Derived from actor-critic and dqn it
uses 4 networks.
Value/action networks
Likewise ac there is a distinction between value and action networks.
Duplication of ac networks
For stability those networks are duplicated.
Replay buffer
Likewise dqn an history of past experiences is reused.
gfrison 2018 34
Gradient propagation
Despite others tecniques, ddpg improves the actual π by applying the gradient
obtained during the training iteration, from the value network to the output
network.
Figure 12: Like with a pantograph, the gradient Vs is applied to the π network for a similar
effect, without knowing the value of π.
gfrison 2018 35
Asgard
Missing slides for Intellectual Property
gfrison 2018 35
Demo
gfrison 2018 35
References i
K. Arulkumaran, M. P. Deisenroth, M. Brundage, and A. A. Bharath.
A brief survey of deep reinforcement learning.
arXiv preprint arXiv:1708.05866, 2017.
Y. Bengio, A. Courville, and P. Vincent.
Representation learning: A review and new perspectives.
IEEE transactions on pattern analysis and machine intelligence,
35(8):1798–1828, 2013.
Giancarlo Frison.
First Steps on Evolutionary Systems.
2018.
link https://labs.cx.sap.com/2018/07/02/first-steps-on-evolutionary-systems/.
gfrison 2018 36
References ii
K. Hornik, M. Stinchcombe, and H. White.
Multilayer feedforward networks are universal approximators.
Neural networks, 2(5):359–366, 1989.
J. Zico Kolter.
Markov Decision Process.
http://www.cs.cmu.edu/afs/cs/academic/class/15780-
s16/www/slides/mdps.pdf.
John Williams.
2015 darpa robotics challange.
Public domain.
gfrison 2018 37
References iii
V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare,
A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, et al.
Human-level control through deep reinforcement learning.
Nature, 518(7540):529, 2015.
P. Montague.
Reinforcement Learning: An Introduction, by Sutton, R.S. and Barto, A.G.
Trends in Cognitive Sciences, 3(9):360, 1999.
Omanomanoman.
CC BY-SA 4.0, https://commons.wikimedia.org/.
gfrison 2018 38
References iv
M. Tokic and G. Palm.
Value-difference based exploration: adaptive control between
epsilon-greedy and so tmax.
In Annual Conference on Artificial Intelligence, pages 335–346. Springer, 2011.
Valve Corporation.
A screenshot from the video game Dota 2.
Yamaguchi .
CC-BY-SA-3.0 (http://creativecommons.org/licenses/by-sa/3.0/)].
gfrison 2018 39

More Related Content

What's hot

Mining at scale with latent factor models for matrix completion
Mining at scale with latent factor models for matrix completionMining at scale with latent factor models for matrix completion
Mining at scale with latent factor models for matrix completionFabio Petroni, PhD
 
GASGD: Stochastic Gradient Descent for Distributed Asynchronous Matrix Comple...
GASGD: Stochastic Gradient Descent for Distributed Asynchronous Matrix Comple...GASGD: Stochastic Gradient Descent for Distributed Asynchronous Matrix Comple...
GASGD: Stochastic Gradient Descent for Distributed Asynchronous Matrix Comple...Fabio Petroni, PhD
 
Reinforcement learning
Reinforcement learningReinforcement learning
Reinforcement learningDongHyun Kwak
 
HDRF: Stream-Based Partitioning for Power-Law Graphs
HDRF: Stream-Based Partitioning for Power-Law GraphsHDRF: Stream-Based Partitioning for Power-Law Graphs
HDRF: Stream-Based Partitioning for Power-Law GraphsFabio Petroni, PhD
 
Distributed Deep Q-Learning
Distributed Deep Q-LearningDistributed Deep Q-Learning
Distributed Deep Q-LearningLyft
 
TensorFlow and Deep Learning Tips and Tricks
TensorFlow and Deep Learning Tips and TricksTensorFlow and Deep Learning Tips and Tricks
TensorFlow and Deep Learning Tips and TricksBen Ball
 
Multi PPT - Agent Actor-Critic for Mixed Cooperative-Competitive Environments
Multi PPT - Agent Actor-Critic for Mixed Cooperative-Competitive EnvironmentsMulti PPT - Agent Actor-Critic for Mixed Cooperative-Competitive Environments
Multi PPT - Agent Actor-Critic for Mixed Cooperative-Competitive EnvironmentsJisang Yoon
 
Deep Reinforcement Learning
Deep Reinforcement LearningDeep Reinforcement Learning
Deep Reinforcement LearningUsman Qayyum
 
Adversarial Reinforced Learning for Unsupervised Domain Adaptation
Adversarial Reinforced Learning for Unsupervised Domain AdaptationAdversarial Reinforced Learning for Unsupervised Domain Adaptation
Adversarial Reinforced Learning for Unsupervised Domain Adaptationtaeseon ryu
 
pptx - Psuedo Random Generator for Halfspaces
pptx - Psuedo Random Generator for Halfspacespptx - Psuedo Random Generator for Halfspaces
pptx - Psuedo Random Generator for Halfspacesbutest
 
# Can we trust ai. the dilemma of model adjustment
# Can we trust ai. the dilemma of model adjustment# Can we trust ai. the dilemma of model adjustment
# Can we trust ai. the dilemma of model adjustmentTerence Huang
 
Financial Trading as a Game: A Deep Reinforcement Learning Approach
Financial Trading as a Game: A Deep Reinforcement Learning ApproachFinancial Trading as a Game: A Deep Reinforcement Learning Approach
Financial Trading as a Game: A Deep Reinforcement Learning Approach謙益 黃
 
Graph Convolutional Neural Networks
Graph Convolutional Neural Networks Graph Convolutional Neural Networks
Graph Convolutional Neural Networks 신동 강
 
Incremental collaborative filtering via evolutionary co clustering
Incremental collaborative filtering via evolutionary co clusteringIncremental collaborative filtering via evolutionary co clustering
Incremental collaborative filtering via evolutionary co clusteringAllen Wu
 
Deep Learning in Finance
Deep Learning in FinanceDeep Learning in Finance
Deep Learning in FinanceAltoros
 
Focal loss for dense object detection
Focal loss for dense object detectionFocal loss for dense object detection
Focal loss for dense object detectionDaeHeeKim31
 
A scalable collaborative filtering framework based on co clustering
A scalable collaborative filtering framework based on co clusteringA scalable collaborative filtering framework based on co clustering
A scalable collaborative filtering framework based on co clusteringAllenWu
 

What's hot (19)

Mining at scale with latent factor models for matrix completion
Mining at scale with latent factor models for matrix completionMining at scale with latent factor models for matrix completion
Mining at scale with latent factor models for matrix completion
 
GASGD: Stochastic Gradient Descent for Distributed Asynchronous Matrix Comple...
GASGD: Stochastic Gradient Descent for Distributed Asynchronous Matrix Comple...GASGD: Stochastic Gradient Descent for Distributed Asynchronous Matrix Comple...
GASGD: Stochastic Gradient Descent for Distributed Asynchronous Matrix Comple...
 
Reinforcement learning
Reinforcement learningReinforcement learning
Reinforcement learning
 
HDRF: Stream-Based Partitioning for Power-Law Graphs
HDRF: Stream-Based Partitioning for Power-Law GraphsHDRF: Stream-Based Partitioning for Power-Law Graphs
HDRF: Stream-Based Partitioning for Power-Law Graphs
 
Playing Atari with Deep Reinforcement Learning
Playing Atari with Deep Reinforcement LearningPlaying Atari with Deep Reinforcement Learning
Playing Atari with Deep Reinforcement Learning
 
Distributed Deep Q-Learning
Distributed Deep Q-LearningDistributed Deep Q-Learning
Distributed Deep Q-Learning
 
TensorFlow and Deep Learning Tips and Tricks
TensorFlow and Deep Learning Tips and TricksTensorFlow and Deep Learning Tips and Tricks
TensorFlow and Deep Learning Tips and Tricks
 
Multi PPT - Agent Actor-Critic for Mixed Cooperative-Competitive Environments
Multi PPT - Agent Actor-Critic for Mixed Cooperative-Competitive EnvironmentsMulti PPT - Agent Actor-Critic for Mixed Cooperative-Competitive Environments
Multi PPT - Agent Actor-Critic for Mixed Cooperative-Competitive Environments
 
Deep Reinforcement Learning
Deep Reinforcement LearningDeep Reinforcement Learning
Deep Reinforcement Learning
 
Deep Q-Learning
Deep Q-LearningDeep Q-Learning
Deep Q-Learning
 
Adversarial Reinforced Learning for Unsupervised Domain Adaptation
Adversarial Reinforced Learning for Unsupervised Domain AdaptationAdversarial Reinforced Learning for Unsupervised Domain Adaptation
Adversarial Reinforced Learning for Unsupervised Domain Adaptation
 
pptx - Psuedo Random Generator for Halfspaces
pptx - Psuedo Random Generator for Halfspacespptx - Psuedo Random Generator for Halfspaces
pptx - Psuedo Random Generator for Halfspaces
 
# Can we trust ai. the dilemma of model adjustment
# Can we trust ai. the dilemma of model adjustment# Can we trust ai. the dilemma of model adjustment
# Can we trust ai. the dilemma of model adjustment
 
Financial Trading as a Game: A Deep Reinforcement Learning Approach
Financial Trading as a Game: A Deep Reinforcement Learning ApproachFinancial Trading as a Game: A Deep Reinforcement Learning Approach
Financial Trading as a Game: A Deep Reinforcement Learning Approach
 
Graph Convolutional Neural Networks
Graph Convolutional Neural Networks Graph Convolutional Neural Networks
Graph Convolutional Neural Networks
 
Incremental collaborative filtering via evolutionary co clustering
Incremental collaborative filtering via evolutionary co clusteringIncremental collaborative filtering via evolutionary co clustering
Incremental collaborative filtering via evolutionary co clustering
 
Deep Learning in Finance
Deep Learning in FinanceDeep Learning in Finance
Deep Learning in Finance
 
Focal loss for dense object detection
Focal loss for dense object detectionFocal loss for dense object detection
Focal loss for dense object detection
 
A scalable collaborative filtering framework based on co clustering
A scalable collaborative filtering framework based on co clusteringA scalable collaborative filtering framework based on co clustering
A scalable collaborative filtering framework based on co clustering
 

Similar to A Brief Survey of Reinforcement Learning

Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017
Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017
Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017MLconf
 
Deep reinforcement learning from scratch
Deep reinforcement learning from scratchDeep reinforcement learning from scratch
Deep reinforcement learning from scratchJie-Han Chen
 
Reinforcement learning
Reinforcement learningReinforcement learning
Reinforcement learningDing Li
 
DDPG algortihm for angry birds
DDPG algortihm for angry birdsDDPG algortihm for angry birds
DDPG algortihm for angry birdsWangyu Han
 
Anomaly Detection through Reinforcement Learning
Anomaly Detection through Reinforcement LearningAnomaly Detection through Reinforcement Learning
Anomaly Detection through Reinforcement LearningHari Koduvely (PhD)
 
REINFORCEMENT LEARNING
REINFORCEMENT LEARNINGREINFORCEMENT LEARNING
REINFORCEMENT LEARNINGpradiprahul
 
Reinforcement Learning (Reloaded) - Xavier Giró-i-Nieto - UPC Barcelona 2018
Reinforcement Learning (Reloaded) - Xavier Giró-i-Nieto - UPC Barcelona 2018Reinforcement Learning (Reloaded) - Xavier Giró-i-Nieto - UPC Barcelona 2018
Reinforcement Learning (Reloaded) - Xavier Giró-i-Nieto - UPC Barcelona 2018Universitat Politècnica de Catalunya
 
Reinforcement learning
Reinforcement learning Reinforcement learning
Reinforcement learning Chandra Meena
 
Machine learning Vs Deep learning Vs Reinforcement learning | Pydata Mumbai
Machine learning Vs Deep learning Vs Reinforcement learning | Pydata Mumbai Machine learning Vs Deep learning Vs Reinforcement learning | Pydata Mumbai
Machine learning Vs Deep learning Vs Reinforcement learning | Pydata Mumbai Pratik Bhavsar
 
Deep Convolutional GANs - meaning of latent space
Deep Convolutional GANs - meaning of latent spaceDeep Convolutional GANs - meaning of latent space
Deep Convolutional GANs - meaning of latent spaceHansol Kang
 
Graph Gurus Episode 32: Using Graph Algorithms for Advanced Analytics Part 5
Graph Gurus Episode 32: Using Graph Algorithms for Advanced Analytics Part 5Graph Gurus Episode 32: Using Graph Algorithms for Advanced Analytics Part 5
Graph Gurus Episode 32: Using Graph Algorithms for Advanced Analytics Part 5TigerGraph
 
Using Graph Algorithms for Advanced Analytics - Part 5 Classification
Using Graph Algorithms for Advanced Analytics - Part 5 ClassificationUsing Graph Algorithms for Advanced Analytics - Part 5 Classification
Using Graph Algorithms for Advanced Analytics - Part 5 ClassificationTigerGraph
 
Head First Reinforcement Learning
Head First Reinforcement LearningHead First Reinforcement Learning
Head First Reinforcement Learningazzeddine chenine
 
G-TAD: Sub-Graph Localization for Temporal Action Detection
G-TAD: Sub-Graph Localization for Temporal Action DetectionG-TAD: Sub-Graph Localization for Temporal Action Detection
G-TAD: Sub-Graph Localization for Temporal Action DetectionMengmeng Xu
 
Advance mathematics mid term presentation rev01
Advance mathematics mid term presentation rev01Advance mathematics mid term presentation rev01
Advance mathematics mid term presentation rev01Nirmal Joshi
 
An efficient use of temporal difference technique in Computer Game Learning
An efficient use of temporal difference technique in Computer Game LearningAn efficient use of temporal difference technique in Computer Game Learning
An efficient use of temporal difference technique in Computer Game LearningPrabhu Kumar
 
ゆるふわ強化学習入門
ゆるふわ強化学習入門ゆるふわ強化学習入門
ゆるふわ強化学習入門Ryo Iwaki
 

Similar to A Brief Survey of Reinforcement Learning (20)

Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017
Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017
Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017
 
Deep reinforcement learning from scratch
Deep reinforcement learning from scratchDeep reinforcement learning from scratch
Deep reinforcement learning from scratch
 
Reinforcement learning
Reinforcement learningReinforcement learning
Reinforcement learning
 
DDPG algortihm for angry birds
DDPG algortihm for angry birdsDDPG algortihm for angry birds
DDPG algortihm for angry birds
 
Anomaly Detection through Reinforcement Learning
Anomaly Detection through Reinforcement LearningAnomaly Detection through Reinforcement Learning
Anomaly Detection through Reinforcement Learning
 
CS799_FinalReport
CS799_FinalReportCS799_FinalReport
CS799_FinalReport
 
REINFORCEMENT LEARNING
REINFORCEMENT LEARNINGREINFORCEMENT LEARNING
REINFORCEMENT LEARNING
 
Reinforcement Learning (Reloaded) - Xavier Giró-i-Nieto - UPC Barcelona 2018
Reinforcement Learning (Reloaded) - Xavier Giró-i-Nieto - UPC Barcelona 2018Reinforcement Learning (Reloaded) - Xavier Giró-i-Nieto - UPC Barcelona 2018
Reinforcement Learning (Reloaded) - Xavier Giró-i-Nieto - UPC Barcelona 2018
 
Reinforcement learning
Reinforcement learning Reinforcement learning
Reinforcement learning
 
5. 8519 1-pb
5. 8519 1-pb5. 8519 1-pb
5. 8519 1-pb
 
Machine learning Vs Deep learning Vs Reinforcement learning | Pydata Mumbai
Machine learning Vs Deep learning Vs Reinforcement learning | Pydata Mumbai Machine learning Vs Deep learning Vs Reinforcement learning | Pydata Mumbai
Machine learning Vs Deep learning Vs Reinforcement learning | Pydata Mumbai
 
Deep Convolutional GANs - meaning of latent space
Deep Convolutional GANs - meaning of latent spaceDeep Convolutional GANs - meaning of latent space
Deep Convolutional GANs - meaning of latent space
 
Graph Gurus Episode 32: Using Graph Algorithms for Advanced Analytics Part 5
Graph Gurus Episode 32: Using Graph Algorithms for Advanced Analytics Part 5Graph Gurus Episode 32: Using Graph Algorithms for Advanced Analytics Part 5
Graph Gurus Episode 32: Using Graph Algorithms for Advanced Analytics Part 5
 
Using Graph Algorithms for Advanced Analytics - Part 5 Classification
Using Graph Algorithms for Advanced Analytics - Part 5 ClassificationUsing Graph Algorithms for Advanced Analytics - Part 5 Classification
Using Graph Algorithms for Advanced Analytics - Part 5 Classification
 
Head First Reinforcement Learning
Head First Reinforcement LearningHead First Reinforcement Learning
Head First Reinforcement Learning
 
Goprez sg
Goprez  sgGoprez  sg
Goprez sg
 
G-TAD: Sub-Graph Localization for Temporal Action Detection
G-TAD: Sub-Graph Localization for Temporal Action DetectionG-TAD: Sub-Graph Localization for Temporal Action Detection
G-TAD: Sub-Graph Localization for Temporal Action Detection
 
Advance mathematics mid term presentation rev01
Advance mathematics mid term presentation rev01Advance mathematics mid term presentation rev01
Advance mathematics mid term presentation rev01
 
An efficient use of temporal difference technique in Computer Game Learning
An efficient use of temporal difference technique in Computer Game LearningAn efficient use of temporal difference technique in Computer Game Learning
An efficient use of temporal difference technique in Computer Game Learning
 
ゆるふわ強化学習入門
ゆるふわ強化学習入門ゆるふわ強化学習入門
ゆるふわ強化学習入門
 

Recently uploaded

Six Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal OntologySix Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal Ontologyjohnbeverley2021
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Jeffrey Haguewood
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxRustici Software
 
Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Zilliz
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdfSandro Moreira
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024The Digital Insurer
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...apidays
 
Vector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxVector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxRemote DBA Services
 
Platformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityPlatformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityWSO2
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Victor Rentea
 
WSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Victor Rentea
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamUiPathCommunity
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Orbitshub
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Angeliki Cooney
 

Recently uploaded (20)

Six Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal OntologySix Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal Ontology
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Vector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxVector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptx
 
Platformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityPlatformless Horizons for Digital Adaptability
Platformless Horizons for Digital Adaptability
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
WSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering Developers
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
 

A Brief Survey of Reinforcement Learning

  • 1. A Brief Survey of Reinforcement Learning Short walk-through on building learning agents Giancarlo Frison December 12, 2018
  • 2. Table of contents i 1. Introduction 2. Genetic Programming 3. Multi-armed Bandit 4. Markov Decision Process 5. Neural Networks in RL gfrison 2018 1
  • 3. Table of contents ii 6. Actor-Critic 7. Deep Deterministic Policy Gradient 8. Asgard gfrison 2018 2
  • 5. What is Reinforcement Learning [9] Reinforcement learning is an area of machine learning concerned with how so tware agents ought to take actions in an environment so as to maximize some notion of cumulative reward. gfrison 2018 3
  • 6. Challenges in RL RL vision is about creating systems that are capable of learning how to adapt in the real world, solely based on trial-and-error. Challenges: Reward signal The optimal policy must be inferred by interacting with the environment. The only learning signal the agent receives is the reward. Credit assignment problem Agents must deal with long-range time dependencies: O ten the consequences of an action only materialise a ter many transitions of the environment [8]. Non stationary environments Especially in multi-agent environments, in the same circumstances of state st and action at outcomes might be different. gfrison 2018 4
  • 7. Difference from other optimizations There are major differences within: • Supervised learning gfrison 2018 5
  • 8. Difference from other optimizations There are major differences within: • Supervised learning • Mathematical optimization gfrison 2018 5
  • 9. Difference from other optimizations There are major differences within: • Supervised learning • Mathematical optimization • Genetic programming gfrison 2018 5
  • 11. Genetic Programming GP is a technique whereby computer programs are encoded as a set of genes that are then modified using an evolutionary algorithm. • A number of chromosomes are randomly created. • Each chromosome is evaluated through a fitness function. • Best ones are selected, the others are disposed. • Chromosomes could be breded among the selected for a new generation. • Offsprings are randomly mutated. • Repeat until the score threshold is reached [3]. gfrison 2018 6
  • 12. Crossover Figure 1: The “breeding” is called crossover. The chromosome (in this case an AST), is merged between two individuals for searching a better function. gfrison 2018 7
  • 13. Strenghts on many local minima GP may overtake gradient-based algorithms when the solution space has many local minima Figure 2: Rastrigin function gfrison 2018 8
  • 14. Strenghts on no-differentiable functions f(x) is differentiable when it has always a finite derivative along the domain. −2 −1 1 2 −1 1 2 Figure 3: f(x) = |x| has no derivative at x = 0. GP is tollerant to latent no-differentiable functions. gfrison 2018 9
  • 16. Multi-armed Bandit The multi-armed bandit problem has been the subject of decades of intense study in statistics, operations research, A/B testing, computer science, and economics [12] gfrison 2018 10
  • 17. Q Value’s Action Figure 4: Obtained measures a ter repeated pullings with 10 arms [8]. gfrison 2018 11
  • 18. Q Value’s Action Qn is the estimated value of its action a ter n selections. Qn+1 = R1 + R2 + ... + Rn n A more scalable formula, updates the average with incremental and small constant: Qn+1 = Qn + 1 n (Rn − Qn) General expression of the bandit algorithm, recurrent pattern in RL (Target could be considered the reward R by now). NewEstimate = OldEstimate + StepSize(Target − OldEstimate) gfrison 2018 12
  • 19. Gambler’s Dilemma When pulled, an arm produces a random payout drawn independently of the past. Because the distribution of payouts corresponding to each arm is not listed, the player can learn it only by experimenting. Exploitation Earn more money by exploiting arms that yielded high payouts in the past. Exploration Exploring alternative arms may return higher payouts in the future. There is a tradeoff between exploration and regret. gfrison 2018 13
  • 21. Definition of MDP Figure 5: 2015 DARPA Robotics Challange [6] Despite as in bandits, MDP formalizes the decision making (Policy π) in sequential steps, aggregated in Episodes. gfrison 2018 14
  • 22. Actions and States Figure 6: Model representation of MDP [5] MDP strives to find the best π to all possible states. In Markov processes, the selected action depends only on current state. gfrison 2018 15
  • 23. Bellman equation In a multi-step process the total return G is the discounted sum of rewards, from the begin state to the terminal state: Gπ = r0 + γr1 + γ2 r2 + ...γn rn bellman equation: The value of a state Vs is the averaged reward obtained in that state plus the sum of all values following π therea ter. The goal of RL is to find the optimal π with highest G π = argmax ∑ s r + γvπ(s) gfrison 2018 16
  • 24. Grid World  -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1  • A = up, down, right, left • Terminal states are the flagged boxes. • Rt = −1 for all states. The problem is to define the best πππ. Value function is computed by iterative policy evaluation. gfrison 2018 17
  • 25. Iteration 1 Calculated V1 Policy π1 0 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 0  ← ? ? ↑ ? ? ? ? ? ? ↓ ? ? →  gfrison 2018 18
  • 26. Iteration 2 Calculated V2 Policy π2 0 -1.7 -2 -2 -1.7 -2 -2 -2 -2 -2 -2 -1.7 -2 -2 -1.7 0  ← ← ? ↑ ? ↓ ↑ ? ↓ ? → →  gfrison 2018 19
  • 27. Iteration 3 Calculated V3 Policy π3 0 -2.4 -2.9 -3 -2.4 -2.9 -3 -2.9 -2.9 -3 -2.9 -2.4 -3 -2.9 -2.4 0  ← ← ↑ ↓ ↑ ↓ → →  gfrison 2018 20
  • 28. Temporal difference Model-free learning The learning comes only through raw experiences (episodes) when there is no prior knowledge of the environment. Bootstrapping Value estimations are based partly on other learned estimations. Bootstrapping helps to increase the stability of estimations. V(st) = V(st) + α(rt+1 + γV(st+1) − V(st)) gfrison 2018 21
  • 30. Beyond the Gridworld In the Gridworld every state value vt is stored in a table. The approach lacks scalability and is inherently limited to fairly low-dimension problems. Figure 7: The state vt might be a frame of a videogame [11] gfrison 2018 22
  • 31. Deep Reinforcement Learning Deep learning enables RL to scale to decision-making problems that were previously intractable, i.e., settings with high-dimensional state and action spaces. [1] • Universal function approximator Multilayer neural networks can approximate any continuous function to any degree of accuracy [4]. gfrison 2018 23
  • 32. Deep Reinforcement Learning Deep learning enables RL to scale to decision-making problems that were previously intractable, i.e., settings with high-dimensional state and action spaces. [1] • Universal function approximator Multilayer neural networks can approximate any continuous function to any degree of accuracy [4]. • Representation learning Automatically discover the representations needed for feature detection or classification from raw data, also known as feature engeneering [2]. Neural networks can learn the approximated value of Vs or Q(s, a) for any given state/action space. gfrison 2018 23
  • 33. Discrete space When the action space is limited by a set of options, (e.g. turn right, le t) the output layer could be a set of neurons as the number of possible actions. A1 A2 A3 A4 0.21 0.27 0.30 0.21 Figure 8: Output layer with so tmax The choosed action is the one with the biggest Qt (A3). gfrison 2018 24
  • 34. Exploration in discrete space The balance between exploration and exploitation is one of the most challanging task in RL. The most common is the ε-greedy: π(s) =    random action, if ξ < ϵ. argmaxQ(s, a), otherwise. (1) With ε-greedy, at each time step, the agent selects a random action with a fixed probability, 0 ≤ ϵ ≤ 1, instead of selecting the learned optimal action. gfrison 2018 25
  • 35. Exploration in discrete space ε-greedy must be tuned and it is fixed during the training. In contrast, Value-Difference Based Exploration adapts the exploration-probability based on fluctuations of confidence [10]. A1 A2 A3 A4 0.21 0.27 0.30 0.21 Figure 9: Low confidence, flat likelihood A1 A2 A3 A4 0.21 0.07 0.70 0.01 Figure 10: High confidence, sharp likelihood gfrison 2018 26
  • 36. Continuous space Applications of RL require continuous state spaces defined by means of continuous variables e.g. position, velocity, torque. Single Output Like in DDPG algorithm. The NN yields only one output. Probability distribution The network yields a distribution. For convenience usually it is a normal distribution defined by 2 outputs: μ and σ. gfrison 2018 27
  • 37. Deep Q-Network NN comes into play by approximating state-action value pairs Qt(s, a). Double network Other the main network, a sibling network (target) is used in training for adjusting values Q, the second network is periodically aligned the the main network. Priority experience replay Past state/action/reward tuples are kept also for future iterations. Tuples with higher error in evaluation are retained longer.  First disruptive advance in Deep RL in 2015 by DeepMind [7]. gfrison 2018 28
  • 39. Policy-based algorithm actor-critic methods lay in the family of policy-based algorithms. Differently by value-based ones - like DQN - the main network (actor) learns directly the action aπ(s) to return for a given state s.  Therefore there is no argmax Qπ(s, A) among all Qπ(s) for a given state gfrison 2018 29
  • 40. Actor-Critic actor-critic defines an architecture based on 2 parts: Actor Main network which infers the action for a given state (st → a) Critic Secondary network used only for training, evaluates the actor output returning the averaged value of state. st → V(s) gfrison 2018 30
  • 41. Advantage function for incremental learning The advantage technique is used to give a evaluation of the last action compared to the average of other actions. gfrison 2018 31
  • 42. Advantage function for incremental learning The advantage technique is used to give a evaluation of the last action compared to the average of other actions. It answers the following question: Is last action more rewarding than others in such scenario? gfrison 2018 31
  • 43. Advantage function for incremental learning The advantage technique is used to give a evaluation of the last action compared to the average of other actions. It answers the following question: Is last action more rewarding than others in such scenario? Yes  Action’s likelihood is encouraged. Gradients are pushed toward this action. No  Action’s likelihood is discouraged. Gradients are pushed away from this action. gfrison 2018 31
  • 44. Training s0 a0 Environment s1 r1 t0 s1 a1 Environment s2 r2 t1 T T Figure 11: Actor-Critic interactions in an iterative traininggfrison 2018 32
  • 45. Actor training - vt rt st advt at 1 2 • 1 Critic returns the averaged value of the state. • 2 Actor is trained considering st, at and advt = rt − vt gfrison 2018 33
  • 47. DDPG Especially effective in continuous space. Derived from actor-critic and dqn it uses 4 networks. Value/action networks Likewise ac there is a distinction between value and action networks. Duplication of ac networks For stability those networks are duplicated. Replay buffer Likewise dqn an history of past experiences is reused. gfrison 2018 34
  • 48. Gradient propagation Despite others tecniques, ddpg improves the actual π by applying the gradient obtained during the training iteration, from the value network to the output network. Figure 12: Like with a pantograph, the gradient Vs is applied to the π network for a similar effect, without knowing the value of π. gfrison 2018 35
  • 50. Missing slides for Intellectual Property gfrison 2018 35
  • 52. References i K. Arulkumaran, M. P. Deisenroth, M. Brundage, and A. A. Bharath. A brief survey of deep reinforcement learning. arXiv preprint arXiv:1708.05866, 2017. Y. Bengio, A. Courville, and P. Vincent. Representation learning: A review and new perspectives. IEEE transactions on pattern analysis and machine intelligence, 35(8):1798–1828, 2013. Giancarlo Frison. First Steps on Evolutionary Systems. 2018. link https://labs.cx.sap.com/2018/07/02/first-steps-on-evolutionary-systems/. gfrison 2018 36
  • 53. References ii K. Hornik, M. Stinchcombe, and H. White. Multilayer feedforward networks are universal approximators. Neural networks, 2(5):359–366, 1989. J. Zico Kolter. Markov Decision Process. http://www.cs.cmu.edu/afs/cs/academic/class/15780- s16/www/slides/mdps.pdf. John Williams. 2015 darpa robotics challange. Public domain. gfrison 2018 37
  • 54. References iii V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, et al. Human-level control through deep reinforcement learning. Nature, 518(7540):529, 2015. P. Montague. Reinforcement Learning: An Introduction, by Sutton, R.S. and Barto, A.G. Trends in Cognitive Sciences, 3(9):360, 1999. Omanomanoman. CC BY-SA 4.0, https://commons.wikimedia.org/. gfrison 2018 38
  • 55. References iv M. Tokic and G. Palm. Value-difference based exploration: adaptive control between epsilon-greedy and so tmax. In Annual Conference on Artificial Intelligence, pages 335–346. Springer, 2011. Valve Corporation. A screenshot from the video game Dota 2. Yamaguchi . CC-BY-SA-3.0 (http://creativecommons.org/licenses/by-sa/3.0/)]. gfrison 2018 39