SlideShare ist ein Scribd-Unternehmen logo
1 von 25
Downloaden Sie, um offline zu lesen
Trends and developments in
Reinforcement Learning
London TensorFlow Meetup
Pierre H. Richemond
Imperial College London
Table of contents
1. Introduction: the RL Setting
2. Theoretical Developments
3. Progress in continuous control
4. Progress in board games
5. Some practical tips
1
Introduction: the RL Setting
RL in one minute
• Reinforcement learning = extreme trial-and-error
• Deep RL = RL combined with deep neural networks
• Given a certain environment (game abstraction), an agent
(player) selects between actions a available to her, and in doing
so, collects a reward r and moves from state s to state s
′
.
• The agent’s objective is to pick actions and build a trajectory so
as to maximize the sum of their rewards.
• Rewards and states are not known in advance, so that RL
involves a trade-off between exploration of new and risky states
and exploitation of known but potentially sub-optimal rewards.
2
Just one more minute :)
Two main families of algorithmic methods for solving RL:
1. Q-learning consists in computing an approximation to an
action-value function Q(s, a) which for each state-action pair
tells us how happy we will be will be if we take that action in
that state.
2. Policy gradients consists in approximately following the gradient
to the sum of rewards (’do more of what works best’) so as to
maximize it. It seeks to compute π(a|s) = P(a|s).
Deep Q-Networks (DQN) and Asynchronous Advantage Actor Critic
(A3C) are the most famous deep RL algorithms.
In general, RL methods are numerically unstable and require tricks
to stabilize them and accelerate their convergence.
3
Theoretical Developments
Equivalence of soft Q-learning and policy gradients
• The single most important theoretical development of 2017.
• We’ve been doing reinforcement learning wrong all these years !
• Idea = turn the hard max in Q-learning into a soft max.
• Mathematically, relax Q∗
(s, a) = max
a∈A
Q(s, a) into
Q∗
(s, a) = β log
∫
exp(Q(s,a)
β )da
• This relaxation was already known since 2010 at least.
4
Equivalence of soft Q-learning and policy gradients (2)
• Policy gradients often use entropy regularization to prevent
early convergence to degenerate policies.
• Schulman shows that in this context, soft Q-learning and
entropic policy gradients are the same.
• Easily proven with an entropic representation formula
(Richemond & Maginnis, to appear.)
• This also proves that the A3C algorithm as originally published
is wrong/incomplete and explains why it may fail to converge.
• Strong theoretical connections both with thermodynamics and
convex/online optimization.
5
Wasserstein RL solves the heat equation
• New development, on ArXiv soon
• Optimal transport / Wasserstein distances have become
popular in the context of GANs
• When doing trust region policy optimization in a small
Wasserstein-2 ball rather than Kullback-Leibler, one proves that
the policy function itself, in the small step limit, satisfies the
heat equation
∂tπ = −∇ · (π∇r) + β∆π
• This might also be a justification for the emergence of
distributional RL
6
Progress in continuous control
TRPO-PPO
• Trust Region Policy Optimization is a very robust policy gradient
baseline that iteratively refines policies by moving them within a
small stepsize.
• This approach introduces strong sensitivity w.r.t. the stepsize as
an important tunable hyperparameter and makes it generally
sample inefficient
• Proximal Policy Optimization is an evolution of TRPO that
attempts to do away with those constraints
• The idea is to use a new objective function that includes clipping
the ratio of the probability under the new and old policies
• Simple change in the code.
7
ACKTR
• Combines actor-critic, trust region optimization, and Kronecker
factorization (’K-FAC’)
• As this is a second-order (natural gradient) method, it’s more
sample efficient
• Flipside is increased computational complexity
• Scales well with batch size
• Together PPO and ACKTR represent the state-of-the-art baseline
in continuous control... until the official release of soft
actor-critic (2018, OpenAI)
8
Video examples
9
Progress in board games
AlphaGo Zero
• Arguably the landmark practical achievement of the year in
reinforcement learning !
• AlphaGo the first learned the game by analyzing 30 million Go
moves from human players (GoKiFu dataset) before refining that
supervised policy with self-play
• This illustrates the difficulty of the mathematical optimization
problem posed by RL, and the necessity to initialize close to a
good solution.
• AlphaGo Zero plays significantly better, and does away with
supervision or human knowledge completely (other than
knowledge of game rules)
10
AlphaGo Zero: Architecture (1)
• Problem with Go = large branching factor in the tree of possible
moves.
• Neural networks enable dynamic pruning of that tree - a policy
network limits the breadth of the subtree to explore, while a
value network limits the depth of that subtree. If those networks
were perfect, little exploration would be needed - in order to get
there, we can refine them iteratively.
• Merging the policy and value network enforces feature reuse
and learning efficiency
• In order to train deeper networks and enable more and more
abstract feature extraction, residual networks are used
11
AlphaGo Zero: Architecture (2)
• Monte Carlo Tree Search (MCTS) is another part of the algorithm
that, just like a neural network, dynamically expands and
compresses the tree of possibles. It averages results across
many paths of the sub-tree, and can be viewed as a form of
imagination, or slow thinking.
• The key component of AlphaGo Zero is to view MCTS as a policy
improvement mechanism: you can’t improve anymore when
your fast thinking (NN) and your slow thinking (MCTS) coincide.
• The loss function is ”simple” for a program that complex:
L(θ) = (z − vθ)2
+ πT
θ log pθ + c||θ||2
12
AlphaGo Zero: Insights
• AlphaZero is the generic, game-agnostic version of AlphaGoZero
that also taught itself to play Chess and Shogi (Japanese chess
where you can capture your opponent’s ) at a superhuman level
using the same algorithm.
• In stark contrast with DeepBlue or Stockfish, AlphaZero ’only’
evaluates 50,000 positions per move due to MCTS, rather than
millions.
• AlphaGo has now become so strong it is used as an online
teaching tool for human players - AlphaGo Teach
https://alphagoteach.deepmind.com/
• According to David Silver, it is speculated that MCTS acts as a
large averaging mechanism, rather than the ’maximum’
operation in standard tree pruning, smoothing rather than
propagating neural network approximation error, and hence
enabling stable training of deep networks.
13
Some practical tips
Suggestions for experiment design
Factors influencing training, by order of importance:
• Fix your random seed for reproducibility !
• Softmax temperature is the single most critical factor according
to Nachum & Huarnooja, followed by...
• Reward scaling and more generally reward engineering help a
lot
• Number of steps in the calculation of returns
• Discount factor also an issue
Clip gradients to avoid NaNs, visualize them with TensorBoard, and
finally be patient waiting for convergence...
14
Key TensorFlow code repos
• As close to official as possible :
https://github.com/tensorflow/models/tree/
master/research/pcl_rl
by Ofir Nachum the author of PCL, soon to be integrated in
TensorFlow
• RLlib: A Composable and Scalable Reinforcement Learning
Library
https://github.com/ray-project/ray/tree/master/
python/ray/rllib
• Imperial’s Brendan Maginnis Atari-RL
https://github.com/brendanator/atari-rl
• OpenAI’s baselines
https://github.com/openai/baselines
15
Deep Reinforcement Learning That Matters
• Encouraging reproducibility in reinforcement learning
• ’RL algorithms have many moving parts that are hard to debug,
and they require substantial effort in tuning in order to get good
results.’
• False Discoveries everywhere ? Dependency on the random seed
• Multiple runs are necessary to give an idea of the variability.
Reporting the best returns is not enough - at the very least,
report top and bottom quantiles
• A necessary debate, as seen at NIPS 2017.
Don’t be an alchemist
16
Conclusion & perspectives
• Another landmark year both for theoretical and practical RL
• In imperfect information settings - Poker : Libratus and
DeepStack become superhuman at Texas Hold’em
• In robotics, notable progress also in hierarchical reinforcement
learning, meta-learning and imitation learning.
• Planning, robustness and sample efficiency remain the greatest
challenges to current methods
• The next big challenge : Starcraft II, a real-time strategy game,
partially observed, with sparse rewards and a huge action
space. PySC2 available to train agents at
https://github.com/deepmind/pysc2
17
Questions from the audience ?
18
Thanks for your time !
Twitter @KloudStrife
19

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Hands-on Deep Learning in Python
Hands-on Deep Learning in PythonHands-on Deep Learning in Python
Hands-on Deep Learning in Python
 
L05 language model_part2
L05 language model_part2L05 language model_part2
L05 language model_part2
 
Capitalico / Chart Pattern Matching in Financial Trading Using RNN
Capitalico / Chart Pattern Matching in Financial Trading Using RNNCapitalico / Chart Pattern Matching in Financial Trading Using RNN
Capitalico / Chart Pattern Matching in Financial Trading Using RNN
 
Recent Trends in Neural Net Policy Learning
Recent Trends in Neural Net Policy LearningRecent Trends in Neural Net Policy Learning
Recent Trends in Neural Net Policy Learning
 
TensorFlow and Deep Learning Tips and Tricks
TensorFlow and Deep Learning Tips and TricksTensorFlow and Deep Learning Tips and Tricks
TensorFlow and Deep Learning Tips and Tricks
 
Deep Learning in Recommender Systems - RecSys Summer School 2017
Deep Learning in Recommender Systems - RecSys Summer School 2017Deep Learning in Recommender Systems - RecSys Summer School 2017
Deep Learning in Recommender Systems - RecSys Summer School 2017
 
Deep learning to the rescue - solving long standing problems of recommender ...
Deep learning to the rescue - solving long standing problems of recommender ...Deep learning to the rescue - solving long standing problems of recommender ...
Deep learning to the rescue - solving long standing problems of recommender ...
 
TensorFlow Tutorial Part2
TensorFlow Tutorial Part2TensorFlow Tutorial Part2
TensorFlow Tutorial Part2
 
Recurrent Neural Networks for Text Analysis
Recurrent Neural Networks for Text AnalysisRecurrent Neural Networks for Text Analysis
Recurrent Neural Networks for Text Analysis
 
Parallel Recurrent Neural Network Architectures for Feature-rich Session-base...
Parallel Recurrent Neural Network Architectures for Feature-rich Session-base...Parallel Recurrent Neural Network Architectures for Feature-rich Session-base...
Parallel Recurrent Neural Network Architectures for Feature-rich Session-base...
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement Learning
 
Naver learning to rank question answer pairs using hrde-ltc
Naver learning to rank question answer pairs using hrde-ltcNaver learning to rank question answer pairs using hrde-ltc
Naver learning to rank question answer pairs using hrde-ltc
 
Deep learning
Deep learningDeep learning
Deep learning
 
Icml2018 naver review
Icml2018 naver reviewIcml2018 naver review
Icml2018 naver review
 
Deep Learning for Recommender Systems RecSys2017 Tutorial
Deep Learning for Recommender Systems RecSys2017 Tutorial Deep Learning for Recommender Systems RecSys2017 Tutorial
Deep Learning for Recommender Systems RecSys2017 Tutorial
 
Recurrent Neural Networks, LSTM and GRU
Recurrent Neural Networks, LSTM and GRURecurrent Neural Networks, LSTM and GRU
Recurrent Neural Networks, LSTM and GRU
 
Introduction to Tree-LSTMs
Introduction to Tree-LSTMsIntroduction to Tree-LSTMs
Introduction to Tree-LSTMs
 
Introduction To Applied Machine Learning
Introduction To Applied Machine LearningIntroduction To Applied Machine Learning
Introduction To Applied Machine Learning
 
PFN Spring Internship Final Report: Autonomous Drive by Deep RL
PFN Spring Internship Final Report: Autonomous Drive by Deep RLPFN Spring Internship Final Report: Autonomous Drive by Deep RL
PFN Spring Internship Final Report: Autonomous Drive by Deep RL
 
Introduction to deep learning in python and Matlab
Introduction to deep learning in python and MatlabIntroduction to deep learning in python and Matlab
Introduction to deep learning in python and Matlab
 

Ähnlich wie TensorFlow London 11: Pierre Harvey Richemond 'Trends and Developments in Reinforcement Learning'

Building graphs to discover information by David Martínez at Big Data Spain 2015
Building graphs to discover information by David Martínez at Big Data Spain 2015Building graphs to discover information by David Martínez at Big Data Spain 2015
Building graphs to discover information by David Martínez at Big Data Spain 2015
Big Data Spain
 

Ähnlich wie TensorFlow London 11: Pierre Harvey Richemond 'Trends and Developments in Reinforcement Learning' (20)

Intro to Deep Reinforcement Learning
Intro to Deep Reinforcement LearningIntro to Deep Reinforcement Learning
Intro to Deep Reinforcement Learning
 
The Factoring Dead: Preparing for the Cryptopocalypse
The Factoring Dead: Preparing for the CryptopocalypseThe Factoring Dead: Preparing for the Cryptopocalypse
The Factoring Dead: Preparing for the Cryptopocalypse
 
Actor critic algorithm
Actor critic algorithmActor critic algorithm
Actor critic algorithm
 
How to formulate reinforcement learning in illustrative ways
How to formulate reinforcement learning in illustrative waysHow to formulate reinforcement learning in illustrative ways
How to formulate reinforcement learning in illustrative ways
 
Machine Learning Essentials Demystified part1 | Big Data Demystified
Machine Learning Essentials Demystified part1 | Big Data DemystifiedMachine Learning Essentials Demystified part1 | Big Data Demystified
Machine Learning Essentials Demystified part1 | Big Data Demystified
 
Machine Learning, Deep Learning and Data Analysis Introduction
Machine Learning, Deep Learning and Data Analysis IntroductionMachine Learning, Deep Learning and Data Analysis Introduction
Machine Learning, Deep Learning and Data Analysis Introduction
 
Deep Learning for Real-Time Atari Game Play Using Offline Monte-CarloTree Sear...
Deep Learning for Real-Time Atari Game Play Using Offline Monte-CarloTree Sear...Deep Learning for Real-Time Atari Game Play Using Offline Monte-CarloTree Sear...
Deep Learning for Real-Time Atari Game Play Using Offline Monte-CarloTree Sear...
 
Reinforcement learning
Reinforcement learningReinforcement learning
Reinforcement learning
 
Deep Learning Introduction - WeCloudData
Deep Learning Introduction - WeCloudDataDeep Learning Introduction - WeCloudData
Deep Learning Introduction - WeCloudData
 
GDRR Opening Workshop - Deep Reinforcement Learning for Asset Based Modeling ...
GDRR Opening Workshop - Deep Reinforcement Learning for Asset Based Modeling ...GDRR Opening Workshop - Deep Reinforcement Learning for Asset Based Modeling ...
GDRR Opening Workshop - Deep Reinforcement Learning for Asset Based Modeling ...
 
Analysis and Design of Algorithms
Analysis and Design of AlgorithmsAnalysis and Design of Algorithms
Analysis and Design of Algorithms
 
Neural Networks with Google TensorFlow
Neural Networks with Google TensorFlowNeural Networks with Google TensorFlow
Neural Networks with Google TensorFlow
 
Using SigOpt to Tune Deep Learning Models with Nervana Cloud
Using SigOpt to Tune Deep Learning Models with Nervana CloudUsing SigOpt to Tune Deep Learning Models with Nervana Cloud
Using SigOpt to Tune Deep Learning Models with Nervana Cloud
 
Dataworkz odsc london 2018
Dataworkz odsc london 2018Dataworkz odsc london 2018
Dataworkz odsc london 2018
 
Training Neural Networks
Training Neural NetworksTraining Neural Networks
Training Neural Networks
 
deepnet-lourentzou.ppt
deepnet-lourentzou.pptdeepnet-lourentzou.ppt
deepnet-lourentzou.ppt
 
MLSEV. Logistic Regression, Deepnets, and Time Series
MLSEV. Logistic Regression, Deepnets, and Time Series MLSEV. Logistic Regression, Deepnets, and Time Series
MLSEV. Logistic Regression, Deepnets, and Time Series
 
AutoML lectures (ACDL 2019)
AutoML lectures (ACDL 2019)AutoML lectures (ACDL 2019)
AutoML lectures (ACDL 2019)
 
Foundations: Artificial Neural Networks
Foundations: Artificial Neural NetworksFoundations: Artificial Neural Networks
Foundations: Artificial Neural Networks
 
Building graphs to discover information by David Martínez at Big Data Spain 2015
Building graphs to discover information by David Martínez at Big Data Spain 2015Building graphs to discover information by David Martínez at Big Data Spain 2015
Building graphs to discover information by David Martínez at Big Data Spain 2015
 

Mehr von Seldon

TensorFlow London 18: Dr Alastair Moore, Towards the use of Graphical Models ...
TensorFlow London 18: Dr Alastair Moore, Towards the use of Graphical Models ...TensorFlow London 18: Dr Alastair Moore, Towards the use of Graphical Models ...
TensorFlow London 18: Dr Alastair Moore, Towards the use of Graphical Models ...
Seldon
 
TensorFlow London 15: Find bugs in the herd with debuggable TensorFlow code
TensorFlow London 15: Find bugs in the herd with debuggable TensorFlow code TensorFlow London 15: Find bugs in the herd with debuggable TensorFlow code
TensorFlow London 15: Find bugs in the herd with debuggable TensorFlow code
Seldon
 
Tensorflow London 13: Barbara Fusinska 'Hassle Free, Scalable, Machine Learni...
Tensorflow London 13: Barbara Fusinska 'Hassle Free, Scalable, Machine Learni...Tensorflow London 13: Barbara Fusinska 'Hassle Free, Scalable, Machine Learni...
Tensorflow London 13: Barbara Fusinska 'Hassle Free, Scalable, Machine Learni...
Seldon
 
Tensorflow London 13: Zbigniew Wojna 'Deep Learning for Big Scale 2D Imagery'
Tensorflow London 13: Zbigniew Wojna 'Deep Learning for Big Scale 2D Imagery'Tensorflow London 13: Zbigniew Wojna 'Deep Learning for Big Scale 2D Imagery'
Tensorflow London 13: Zbigniew Wojna 'Deep Learning for Big Scale 2D Imagery'
Seldon
 

Mehr von Seldon (20)

CD4ML and the challenges of testing and quality in ML systems
CD4ML and the challenges of testing and quality in ML systemsCD4ML and the challenges of testing and quality in ML systems
CD4ML and the challenges of testing and quality in ML systems
 
TensorFlow London: Cutting edge generative models
TensorFlow London: Cutting edge generative modelsTensorFlow London: Cutting edge generative models
TensorFlow London: Cutting edge generative models
 
Tensorflow London: Tensorflow and Graph Recommender Networks by Yaz Santissi
Tensorflow London: Tensorflow and Graph Recommender Networks by Yaz SantissiTensorflow London: Tensorflow and Graph Recommender Networks by Yaz Santissi
Tensorflow London: Tensorflow and Graph Recommender Networks by Yaz Santissi
 
TensorFlow London: Progressive Growing of GANs for increased stability, quali...
TensorFlow London: Progressive Growing of GANs for increased stability, quali...TensorFlow London: Progressive Growing of GANs for increased stability, quali...
TensorFlow London: Progressive Growing of GANs for increased stability, quali...
 
TensorFlow London 18: Dr Daniel Martinho-Corbishley, From science to startups...
TensorFlow London 18: Dr Daniel Martinho-Corbishley, From science to startups...TensorFlow London 18: Dr Daniel Martinho-Corbishley, From science to startups...
TensorFlow London 18: Dr Daniel Martinho-Corbishley, From science to startups...
 
TensorFlow London 18: Dr Alastair Moore, Towards the use of Graphical Models ...
TensorFlow London 18: Dr Alastair Moore, Towards the use of Graphical Models ...TensorFlow London 18: Dr Alastair Moore, Towards the use of Graphical Models ...
TensorFlow London 18: Dr Alastair Moore, Towards the use of Graphical Models ...
 
Seldon: Deploying Models at Scale
Seldon: Deploying Models at ScaleSeldon: Deploying Models at Scale
Seldon: Deploying Models at Scale
 
TensorFlow London 17: How NASA Frontier Development Lab scientists use AI to ...
TensorFlow London 17: How NASA Frontier Development Lab scientists use AI to ...TensorFlow London 17: How NASA Frontier Development Lab scientists use AI to ...
TensorFlow London 17: How NASA Frontier Development Lab scientists use AI to ...
 
TensorFlow London 17: Practical Reinforcement Learning with OpenAI
TensorFlow London 17: Practical Reinforcement Learning with OpenAITensorFlow London 17: Practical Reinforcement Learning with OpenAI
TensorFlow London 17: Practical Reinforcement Learning with OpenAI
 
TensorFlow 16: Multimodal Sentiment Analysis with TensorFlow
TensorFlow 16: Multimodal Sentiment Analysis with TensorFlow TensorFlow 16: Multimodal Sentiment Analysis with TensorFlow
TensorFlow 16: Multimodal Sentiment Analysis with TensorFlow
 
TensorFlow 16: Building a Data Science Platform
TensorFlow 16: Building a Data Science Platform TensorFlow 16: Building a Data Science Platform
TensorFlow 16: Building a Data Science Platform
 
Ai in financial services
Ai in financial servicesAi in financial services
Ai in financial services
 
TensorFlow London 15: Find bugs in the herd with debuggable TensorFlow code
TensorFlow London 15: Find bugs in the herd with debuggable TensorFlow code TensorFlow London 15: Find bugs in the herd with debuggable TensorFlow code
TensorFlow London 15: Find bugs in the herd with debuggable TensorFlow code
 
TensorFlow London 14: Ben Hall 'Machine Learning Workloads with Kubernetes an...
TensorFlow London 14: Ben Hall 'Machine Learning Workloads with Kubernetes an...TensorFlow London 14: Ben Hall 'Machine Learning Workloads with Kubernetes an...
TensorFlow London 14: Ben Hall 'Machine Learning Workloads with Kubernetes an...
 
Tensorflow London 13: Barbara Fusinska 'Hassle Free, Scalable, Machine Learni...
Tensorflow London 13: Barbara Fusinska 'Hassle Free, Scalable, Machine Learni...Tensorflow London 13: Barbara Fusinska 'Hassle Free, Scalable, Machine Learni...
Tensorflow London 13: Barbara Fusinska 'Hassle Free, Scalable, Machine Learni...
 
Tensorflow London 13: Zbigniew Wojna 'Deep Learning for Big Scale 2D Imagery'
Tensorflow London 13: Zbigniew Wojna 'Deep Learning for Big Scale 2D Imagery'Tensorflow London 13: Zbigniew Wojna 'Deep Learning for Big Scale 2D Imagery'
Tensorflow London 13: Zbigniew Wojna 'Deep Learning for Big Scale 2D Imagery'
 
TensorFlow London 11: Gema Parreno 'Use Cases of TensorFlow'
TensorFlow London 11: Gema Parreno 'Use Cases of TensorFlow'TensorFlow London 11: Gema Parreno 'Use Cases of TensorFlow'
TensorFlow London 11: Gema Parreno 'Use Cases of TensorFlow'
 
Tensorflow London 12: Marcel Horstmann and Laurent Decamp 'Using TensorFlow t...
Tensorflow London 12: Marcel Horstmann and Laurent Decamp 'Using TensorFlow t...Tensorflow London 12: Marcel Horstmann and Laurent Decamp 'Using TensorFlow t...
Tensorflow London 12: Marcel Horstmann and Laurent Decamp 'Using TensorFlow t...
 
TensorFlow London 12: Oliver Gindele 'Recommender systems in Tensorflow'
TensorFlow London 12: Oliver Gindele 'Recommender systems in Tensorflow'TensorFlow London 12: Oliver Gindele 'Recommender systems in Tensorflow'
TensorFlow London 12: Oliver Gindele 'Recommender systems in Tensorflow'
 
TensorFlow London 13.09.17 Ilya Dmitrichenko
TensorFlow London 13.09.17 Ilya DmitrichenkoTensorFlow London 13.09.17 Ilya Dmitrichenko
TensorFlow London 13.09.17 Ilya Dmitrichenko
 

Kürzlich hochgeladen

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
WSO2
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 

Kürzlich hochgeladen (20)

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
Cyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdfCyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdf
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdf
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 

TensorFlow London 11: Pierre Harvey Richemond 'Trends and Developments in Reinforcement Learning'

  • 1. Trends and developments in Reinforcement Learning London TensorFlow Meetup Pierre H. Richemond Imperial College London
  • 2. Table of contents 1. Introduction: the RL Setting 2. Theoretical Developments 3. Progress in continuous control 4. Progress in board games 5. Some practical tips 1
  • 4. RL in one minute • Reinforcement learning = extreme trial-and-error • Deep RL = RL combined with deep neural networks • Given a certain environment (game abstraction), an agent (player) selects between actions a available to her, and in doing so, collects a reward r and moves from state s to state s ′ . • The agent’s objective is to pick actions and build a trajectory so as to maximize the sum of their rewards. • Rewards and states are not known in advance, so that RL involves a trade-off between exploration of new and risky states and exploitation of known but potentially sub-optimal rewards. 2
  • 5. Just one more minute :) Two main families of algorithmic methods for solving RL: 1. Q-learning consists in computing an approximation to an action-value function Q(s, a) which for each state-action pair tells us how happy we will be will be if we take that action in that state. 2. Policy gradients consists in approximately following the gradient to the sum of rewards (’do more of what works best’) so as to maximize it. It seeks to compute π(a|s) = P(a|s). Deep Q-Networks (DQN) and Asynchronous Advantage Actor Critic (A3C) are the most famous deep RL algorithms. In general, RL methods are numerically unstable and require tricks to stabilize them and accelerate their convergence. 3
  • 7. Equivalence of soft Q-learning and policy gradients • The single most important theoretical development of 2017. • We’ve been doing reinforcement learning wrong all these years ! • Idea = turn the hard max in Q-learning into a soft max. • Mathematically, relax Q∗ (s, a) = max a∈A Q(s, a) into Q∗ (s, a) = β log ∫ exp(Q(s,a) β )da • This relaxation was already known since 2010 at least. 4
  • 8. Equivalence of soft Q-learning and policy gradients (2) • Policy gradients often use entropy regularization to prevent early convergence to degenerate policies. • Schulman shows that in this context, soft Q-learning and entropic policy gradients are the same. • Easily proven with an entropic representation formula (Richemond & Maginnis, to appear.) • This also proves that the A3C algorithm as originally published is wrong/incomplete and explains why it may fail to converge. • Strong theoretical connections both with thermodynamics and convex/online optimization. 5
  • 9. Wasserstein RL solves the heat equation • New development, on ArXiv soon • Optimal transport / Wasserstein distances have become popular in the context of GANs • When doing trust region policy optimization in a small Wasserstein-2 ball rather than Kullback-Leibler, one proves that the policy function itself, in the small step limit, satisfies the heat equation ∂tπ = −∇ · (π∇r) + β∆π • This might also be a justification for the emergence of distributional RL 6
  • 11. TRPO-PPO • Trust Region Policy Optimization is a very robust policy gradient baseline that iteratively refines policies by moving them within a small stepsize. • This approach introduces strong sensitivity w.r.t. the stepsize as an important tunable hyperparameter and makes it generally sample inefficient • Proximal Policy Optimization is an evolution of TRPO that attempts to do away with those constraints • The idea is to use a new objective function that includes clipping the ratio of the probability under the new and old policies • Simple change in the code. 7
  • 12. ACKTR • Combines actor-critic, trust region optimization, and Kronecker factorization (’K-FAC’) • As this is a second-order (natural gradient) method, it’s more sample efficient • Flipside is increased computational complexity • Scales well with batch size • Together PPO and ACKTR represent the state-of-the-art baseline in continuous control... until the official release of soft actor-critic (2018, OpenAI) 8
  • 15. AlphaGo Zero • Arguably the landmark practical achievement of the year in reinforcement learning ! • AlphaGo the first learned the game by analyzing 30 million Go moves from human players (GoKiFu dataset) before refining that supervised policy with self-play • This illustrates the difficulty of the mathematical optimization problem posed by RL, and the necessity to initialize close to a good solution. • AlphaGo Zero plays significantly better, and does away with supervision or human knowledge completely (other than knowledge of game rules) 10
  • 16. AlphaGo Zero: Architecture (1) • Problem with Go = large branching factor in the tree of possible moves. • Neural networks enable dynamic pruning of that tree - a policy network limits the breadth of the subtree to explore, while a value network limits the depth of that subtree. If those networks were perfect, little exploration would be needed - in order to get there, we can refine them iteratively. • Merging the policy and value network enforces feature reuse and learning efficiency • In order to train deeper networks and enable more and more abstract feature extraction, residual networks are used 11
  • 17. AlphaGo Zero: Architecture (2) • Monte Carlo Tree Search (MCTS) is another part of the algorithm that, just like a neural network, dynamically expands and compresses the tree of possibles. It averages results across many paths of the sub-tree, and can be viewed as a form of imagination, or slow thinking. • The key component of AlphaGo Zero is to view MCTS as a policy improvement mechanism: you can’t improve anymore when your fast thinking (NN) and your slow thinking (MCTS) coincide. • The loss function is ”simple” for a program that complex: L(θ) = (z − vθ)2 + πT θ log pθ + c||θ||2 12
  • 18. AlphaGo Zero: Insights • AlphaZero is the generic, game-agnostic version of AlphaGoZero that also taught itself to play Chess and Shogi (Japanese chess where you can capture your opponent’s ) at a superhuman level using the same algorithm. • In stark contrast with DeepBlue or Stockfish, AlphaZero ’only’ evaluates 50,000 positions per move due to MCTS, rather than millions. • AlphaGo has now become so strong it is used as an online teaching tool for human players - AlphaGo Teach https://alphagoteach.deepmind.com/ • According to David Silver, it is speculated that MCTS acts as a large averaging mechanism, rather than the ’maximum’ operation in standard tree pruning, smoothing rather than propagating neural network approximation error, and hence enabling stable training of deep networks. 13
  • 20. Suggestions for experiment design Factors influencing training, by order of importance: • Fix your random seed for reproducibility ! • Softmax temperature is the single most critical factor according to Nachum & Huarnooja, followed by... • Reward scaling and more generally reward engineering help a lot • Number of steps in the calculation of returns • Discount factor also an issue Clip gradients to avoid NaNs, visualize them with TensorBoard, and finally be patient waiting for convergence... 14
  • 21. Key TensorFlow code repos • As close to official as possible : https://github.com/tensorflow/models/tree/ master/research/pcl_rl by Ofir Nachum the author of PCL, soon to be integrated in TensorFlow • RLlib: A Composable and Scalable Reinforcement Learning Library https://github.com/ray-project/ray/tree/master/ python/ray/rllib • Imperial’s Brendan Maginnis Atari-RL https://github.com/brendanator/atari-rl • OpenAI’s baselines https://github.com/openai/baselines 15
  • 22. Deep Reinforcement Learning That Matters • Encouraging reproducibility in reinforcement learning • ’RL algorithms have many moving parts that are hard to debug, and they require substantial effort in tuning in order to get good results.’ • False Discoveries everywhere ? Dependency on the random seed • Multiple runs are necessary to give an idea of the variability. Reporting the best returns is not enough - at the very least, report top and bottom quantiles • A necessary debate, as seen at NIPS 2017. Don’t be an alchemist 16
  • 23. Conclusion & perspectives • Another landmark year both for theoretical and practical RL • In imperfect information settings - Poker : Libratus and DeepStack become superhuman at Texas Hold’em • In robotics, notable progress also in hierarchical reinforcement learning, meta-learning and imitation learning. • Planning, robustness and sample efficiency remain the greatest challenges to current methods • The next big challenge : Starcraft II, a real-time strategy game, partially observed, with sparse rewards and a huge action space. PySC2 available to train agents at https://github.com/deepmind/pysc2 17
  • 24. Questions from the audience ? 18
  • 25. Thanks for your time ! Twitter @KloudStrife 19