SlideShare ist ein Scribd-Unternehmen logo
1 von 49
Downloaden Sie, um offline zu lesen
Pong from Pixels
Deep RL Bootcamp
Andrej Karpathy, Aug 26, 2017
MDP
Policy network
[80 x 80]
array of
height width
e.g.,
Policy network
[80 x 80]
array
height width
Policy network
E.g. 200 nodes in the hidden network, so:
[(80*80)*200 + 200] + [200*1 + 1] = ~1.3M parameters
Layer 1 Layer 2
[80 x 80]
array
height width
Policy network
Network does not see this. Network sees 80*80 = 6,400 numbers.
It gets a reward of +1 or -1, some of the time.
Q: How do we efficiently find a good setting of the 1.3M parameters?
Problem is easy if you want to be inefficient...
1. Repeat Forever:
2. Sample 1.3M random numbers
3. Run the policy for a while
4. If the performance is best so far, save it
5. Return the best policy
Problem is easy if you want to be inefficient...
1. Repeat Forever:
2. Sample 1.3M random numbers
3. Run the policy for a while
4. If the performance is best so far, save it
5. Return the best policy
Problem is easy if you want to be inefficient...
1. Repeat Forever:
2. Sample 1.3M random numbers
3. Run the policy for a while
4. If the performance is best so far, save it
5. Return the best policy
Policy Gradients
Suppose we had the training labels…
(we know what to do in any state)
(x1,UP)
(x2,DOWN)
(x3,UP)
...
Suppose we had the training labels…
(we know what to do in any state)
(x1,UP)
(x2,DOWN)
(x3,UP)
...
Suppose we had the training labels…
(we know what to do in any state)
(x1,UP)
(x2,DOWN)
(x3,UP)
...
maximize:
Except, we don’t have labels...
Should we go UP or DOWN?
Except, we don’t have labels...
“Try a bunch of stuff and see
what happens. Do more of the
stuff that worked in the future.”
-RL
Let’s just act according to our current policy...
Rollout the policy
and collect an
episode
4 rollouts:
Collect many rollouts...
Not sure whatever we did here, but
apparently it was good.
Not sure whatever we did here, but it was bad.
Pretend every action we took here
was the correct label.
Pretend every action we took
here was the wrong label.
maximize: maximize:
Supervised Learning
maximize:
For images x_i and their
labels y_i.
Supervised Learning
maximize:
For images x_i and their
labels y_i.
Reinforcement Learning
Supervised Learning
maximize:
For images x_i and their
labels y_i.
Reinforcement Learning
1) we have no labels so we sample:
Supervised Learning
maximize:
For images x_i and their
labels y_i.
Reinforcement Learning
maximize:
1) we have no labels so we sample:
2) once we collect a batch of rollouts:
Supervised Learning
maximize:
For images x_i and their
labels y_i.
Reinforcement Learning
maximize:
1) we have no labels so we sample:
2) once we collect a batch of rollouts:
We call this the advantage, it’s a
number, like +1.0 or -1.0 based on how
this action eventually turned out.
Supervised Learning
maximize:
For images x_i and their
labels y_i.
Reinforcement Learning
maximize:
1) we have no labels so we sample:
2) once we collect a batch of rollouts:
+ve advantage will make that action more
likely in the future, for that state.
-ve advantage will make that action less
likely in the future, for that state.
Discounting
Reward +1.0 Reward -1.0
Blame each action assuming that its effects have
exponentially decaying impact into the future.
Discounting
Reward +1.0 Reward -1.0
Blame each action assuming that its effects have
exponentially decaying impact into the future.
00-1-0.9-0.810.270.240.21
Discounted rewards
gamma = 0.9
-1 rewardball gets past our paddleepisode start
discounted reward
time
we screwed up
somewhere here
this was all hopeless and only
contributes noise to the gradient
130 line gist, numpy as the only dependency.
https://gist.github.com/karpathy/a4166c7fe253700972fcbc77
e4ea32c5
Nothing too scary over here.
We use OpenAI Gym.
And start the main training loop.
Get the current image and
preprocess it.
Bookkeeping so that we can do
backpropagation later. If you were
to use PyTorch or something, this
would not be needed.
Derivative of the [log probability of the taken action
given this image] with respect to the [output of the
network (before sigmoid)]
A small piece of backprop:
recall: loss:
Derivative of the [log probability of the taken action
given this image] with respect to the [output of the
network (before sigmoid)]
A small piece of backprop:
More compact:
recall: loss:
Step the environment
(execute the action,
get new state and
record the reward)
Once a rollout is done,
Concatenate together all images, hidden
states, etc. that were seen in this batch.
Again, if using PyTorch, no need to do this.
backprop!!!!!!1
Advantage modulation
Use RMSProp for the
parameter update.
prints etc
In summary
1. Initialize a policy network at random
2. Repeat Forever:
3. Collect a bunch of rollouts with the policy
4. Increase the probability of actions that worked well
5. ???
6. Profit.
Thank you! Questions?

Weitere ähnliche Inhalte

Ähnlich wie Lec4b pong from_pixels

李宏毅课件-Regression.pdf
李宏毅课件-Regression.pdf李宏毅课件-Regression.pdf
李宏毅课件-Regression.pdfssusere61d07
 
2018 Global Azure Bootcamp Azure Machine Learning for neural networks
2018 Global Azure Bootcamp Azure Machine Learning for neural networks2018 Global Azure Bootcamp Azure Machine Learning for neural networks
2018 Global Azure Bootcamp Azure Machine Learning for neural networksSetu Chokshi
 
Introduction to Support Vector Machines
Introduction to Support Vector MachinesIntroduction to Support Vector Machines
Introduction to Support Vector MachinesAnish M M
 
Deep learning from scratch
Deep learning from scratch Deep learning from scratch
Deep learning from scratch Eran Shlomo
 
Notes on Reinforcement Learning - v0.1
Notes on Reinforcement Learning - v0.1Notes on Reinforcement Learning - v0.1
Notes on Reinforcement Learning - v0.1Joo-Haeng Lee
 
Neural networks - BigSkyDevCon
Neural networks - BigSkyDevConNeural networks - BigSkyDevCon
Neural networks - BigSkyDevConryanstout
 
Deep Learning with Apache MXNet (September 2017)
Deep Learning with Apache MXNet (September 2017)Deep Learning with Apache MXNet (September 2017)
Deep Learning with Apache MXNet (September 2017)Julien SIMON
 
semi supervised Learning and Reinforcement learning (1).pptx
 semi supervised Learning and Reinforcement learning (1).pptx semi supervised Learning and Reinforcement learning (1).pptx
semi supervised Learning and Reinforcement learning (1).pptxDr.Shweta
 
Scalable gradientbasedtuningcontinuousregularizationhyperparameters ppt
Scalable gradientbasedtuningcontinuousregularizationhyperparameters pptScalable gradientbasedtuningcontinuousregularizationhyperparameters ppt
Scalable gradientbasedtuningcontinuousregularizationhyperparameters pptRuochun Tzeng
 
NUS-ISS Learning Day 2019-Introduction to reinforcement learning
NUS-ISS Learning Day 2019-Introduction to reinforcement learningNUS-ISS Learning Day 2019-Introduction to reinforcement learning
NUS-ISS Learning Day 2019-Introduction to reinforcement learningNUS-ISS
 
Deep Learning for Developers
Deep Learning for DevelopersDeep Learning for Developers
Deep Learning for DevelopersJulien SIMON
 
Dynamic Programming and Reinforcement Learning applied to Tetris Game
Dynamic Programming and Reinforcement Learning applied to Tetris GameDynamic Programming and Reinforcement Learning applied to Tetris Game
Dynamic Programming and Reinforcement Learning applied to Tetris GameSuelen Carvalho
 
Making smart decisions in real-time with Reinforcement Learning
Making smart decisions in real-time with Reinforcement LearningMaking smart decisions in real-time with Reinforcement Learning
Making smart decisions in real-time with Reinforcement LearningRuth Yakubu
 
Artwork Personalization at Netflix
Artwork Personalization at NetflixArtwork Personalization at Netflix
Artwork Personalization at NetflixJustin Basilico
 
Sippin: A Mobile Application Case Study presented at Techfest Louisville
Sippin: A Mobile Application Case Study presented at Techfest LouisvilleSippin: A Mobile Application Case Study presented at Techfest Louisville
Sippin: A Mobile Application Case Study presented at Techfest LouisvilleDawn Yankeelov
 
Understanding computer vision with Deep Learning
Understanding computer vision with Deep LearningUnderstanding computer vision with Deep Learning
Understanding computer vision with Deep LearningCloudxLab
 

Ähnlich wie Lec4b pong from_pixels (20)

李宏毅课件-Regression.pdf
李宏毅课件-Regression.pdf李宏毅课件-Regression.pdf
李宏毅课件-Regression.pdf
 
2018 Global Azure Bootcamp Azure Machine Learning for neural networks
2018 Global Azure Bootcamp Azure Machine Learning for neural networks2018 Global Azure Bootcamp Azure Machine Learning for neural networks
2018 Global Azure Bootcamp Azure Machine Learning for neural networks
 
Introduction to Support Vector Machines
Introduction to Support Vector MachinesIntroduction to Support Vector Machines
Introduction to Support Vector Machines
 
Deep learning from scratch
Deep learning from scratch Deep learning from scratch
Deep learning from scratch
 
Notes on Reinforcement Learning - v0.1
Notes on Reinforcement Learning - v0.1Notes on Reinforcement Learning - v0.1
Notes on Reinforcement Learning - v0.1
 
Neural networks - BigSkyDevCon
Neural networks - BigSkyDevConNeural networks - BigSkyDevCon
Neural networks - BigSkyDevCon
 
Deep Learning with Apache MXNet (September 2017)
Deep Learning with Apache MXNet (September 2017)Deep Learning with Apache MXNet (September 2017)
Deep Learning with Apache MXNet (September 2017)
 
Deep recommendations in PyTorch
Deep recommendations in PyTorchDeep recommendations in PyTorch
Deep recommendations in PyTorch
 
Mlcc #4
Mlcc #4Mlcc #4
Mlcc #4
 
semi supervised Learning and Reinforcement learning (1).pptx
 semi supervised Learning and Reinforcement learning (1).pptx semi supervised Learning and Reinforcement learning (1).pptx
semi supervised Learning and Reinforcement learning (1).pptx
 
Scalable gradientbasedtuningcontinuousregularizationhyperparameters ppt
Scalable gradientbasedtuningcontinuousregularizationhyperparameters pptScalable gradientbasedtuningcontinuousregularizationhyperparameters ppt
Scalable gradientbasedtuningcontinuousregularizationhyperparameters ppt
 
NUS-ISS Learning Day 2019-Introduction to reinforcement learning
NUS-ISS Learning Day 2019-Introduction to reinforcement learningNUS-ISS Learning Day 2019-Introduction to reinforcement learning
NUS-ISS Learning Day 2019-Introduction to reinforcement learning
 
Deep Learning for Developers
Deep Learning for DevelopersDeep Learning for Developers
Deep Learning for Developers
 
20181221 q-trader
20181221 q-trader20181221 q-trader
20181221 q-trader
 
Dynamic Programming and Reinforcement Learning applied to Tetris Game
Dynamic Programming and Reinforcement Learning applied to Tetris GameDynamic Programming and Reinforcement Learning applied to Tetris Game
Dynamic Programming and Reinforcement Learning applied to Tetris Game
 
Distributive property
Distributive propertyDistributive property
Distributive property
 
Making smart decisions in real-time with Reinforcement Learning
Making smart decisions in real-time with Reinforcement LearningMaking smart decisions in real-time with Reinforcement Learning
Making smart decisions in real-time with Reinforcement Learning
 
Artwork Personalization at Netflix
Artwork Personalization at NetflixArtwork Personalization at Netflix
Artwork Personalization at Netflix
 
Sippin: A Mobile Application Case Study presented at Techfest Louisville
Sippin: A Mobile Application Case Study presented at Techfest LouisvilleSippin: A Mobile Application Case Study presented at Techfest Louisville
Sippin: A Mobile Application Case Study presented at Techfest Louisville
 
Understanding computer vision with Deep Learning
Understanding computer vision with Deep LearningUnderstanding computer vision with Deep Learning
Understanding computer vision with Deep Learning
 

Mehr von Ronald Teo

07 regularization
07 regularization07 regularization
07 regularizationRonald Teo
 
Eac4f222d9d468a0c29a71a3830a5c60 c5_w3l08-attentionmodel
 Eac4f222d9d468a0c29a71a3830a5c60 c5_w3l08-attentionmodel Eac4f222d9d468a0c29a71a3830a5c60 c5_w3l08-attentionmodel
Eac4f222d9d468a0c29a71a3830a5c60 c5_w3l08-attentionmodelRonald Teo
 
Lec7 deeprlbootcamp-svg+scg
Lec7 deeprlbootcamp-svg+scgLec7 deeprlbootcamp-svg+scg
Lec7 deeprlbootcamp-svg+scgRonald Teo
 
Lec5 advanced-policy-gradient-methods
Lec5 advanced-policy-gradient-methodsLec5 advanced-policy-gradient-methods
Lec5 advanced-policy-gradient-methodsRonald Teo
 
Lec6 nuts-and-bolts-deep-rl-research
Lec6 nuts-and-bolts-deep-rl-researchLec6 nuts-and-bolts-deep-rl-research
Lec6 nuts-and-bolts-deep-rl-researchRonald Teo
 
Lec4a policy-gradients-actor-critic
Lec4a policy-gradients-actor-criticLec4a policy-gradients-actor-critic
Lec4a policy-gradients-actor-criticRonald Teo
 
Lec2 sampling-based-approximations-and-function-fitting
Lec2 sampling-based-approximations-and-function-fittingLec2 sampling-based-approximations-and-function-fitting
Lec2 sampling-based-approximations-and-function-fittingRonald Teo
 
Lec1 intro-mdps-exact-methods
Lec1 intro-mdps-exact-methodsLec1 intro-mdps-exact-methods
Lec1 intro-mdps-exact-methodsRonald Teo
 
02 linear algebra
02 linear algebra02 linear algebra
02 linear algebraRonald Teo
 

Mehr von Ronald Teo (16)

Mc td
Mc tdMc td
Mc td
 
07 regularization
07 regularization07 regularization
07 regularization
 
Dp
DpDp
Dp
 
06 mlp
06 mlp06 mlp
06 mlp
 
Mdp
MdpMdp
Mdp
 
04 numerical
04 numerical04 numerical
04 numerical
 
Eac4f222d9d468a0c29a71a3830a5c60 c5_w3l08-attentionmodel
 Eac4f222d9d468a0c29a71a3830a5c60 c5_w3l08-attentionmodel Eac4f222d9d468a0c29a71a3830a5c60 c5_w3l08-attentionmodel
Eac4f222d9d468a0c29a71a3830a5c60 c5_w3l08-attentionmodel
 
Intro rl
Intro rlIntro rl
Intro rl
 
Lec7 deeprlbootcamp-svg+scg
Lec7 deeprlbootcamp-svg+scgLec7 deeprlbootcamp-svg+scg
Lec7 deeprlbootcamp-svg+scg
 
Lec5 advanced-policy-gradient-methods
Lec5 advanced-policy-gradient-methodsLec5 advanced-policy-gradient-methods
Lec5 advanced-policy-gradient-methods
 
Lec6 nuts-and-bolts-deep-rl-research
Lec6 nuts-and-bolts-deep-rl-researchLec6 nuts-and-bolts-deep-rl-research
Lec6 nuts-and-bolts-deep-rl-research
 
Lec4a policy-gradients-actor-critic
Lec4a policy-gradients-actor-criticLec4a policy-gradients-actor-critic
Lec4a policy-gradients-actor-critic
 
Lec3 dqn
Lec3 dqnLec3 dqn
Lec3 dqn
 
Lec2 sampling-based-approximations-and-function-fitting
Lec2 sampling-based-approximations-and-function-fittingLec2 sampling-based-approximations-and-function-fitting
Lec2 sampling-based-approximations-and-function-fitting
 
Lec1 intro-mdps-exact-methods
Lec1 intro-mdps-exact-methodsLec1 intro-mdps-exact-methods
Lec1 intro-mdps-exact-methods
 
02 linear algebra
02 linear algebra02 linear algebra
02 linear algebra
 

Kürzlich hochgeladen

Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Zilliz
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Jeffrey Haguewood
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?Igalia
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesrafiqahmad00786416
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWERMadyBayot
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDropbox
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...Zilliz
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingEdi Saputra
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native ApplicationsWSO2
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodJuan lago vázquez
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 

Kürzlich hochgeladen (20)

Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 

Lec4b pong from_pixels

  • 1. Pong from Pixels Deep RL Bootcamp Andrej Karpathy, Aug 26, 2017
  • 2.
  • 3. MDP
  • 5. [80 x 80] array of height width e.g., Policy network
  • 6. [80 x 80] array height width Policy network
  • 7. E.g. 200 nodes in the hidden network, so: [(80*80)*200 + 200] + [200*1 + 1] = ~1.3M parameters Layer 1 Layer 2 [80 x 80] array height width Policy network
  • 8. Network does not see this. Network sees 80*80 = 6,400 numbers. It gets a reward of +1 or -1, some of the time. Q: How do we efficiently find a good setting of the 1.3M parameters?
  • 9. Problem is easy if you want to be inefficient... 1. Repeat Forever: 2. Sample 1.3M random numbers 3. Run the policy for a while 4. If the performance is best so far, save it 5. Return the best policy
  • 10. Problem is easy if you want to be inefficient... 1. Repeat Forever: 2. Sample 1.3M random numbers 3. Run the policy for a while 4. If the performance is best so far, save it 5. Return the best policy
  • 11. Problem is easy if you want to be inefficient... 1. Repeat Forever: 2. Sample 1.3M random numbers 3. Run the policy for a while 4. If the performance is best so far, save it 5. Return the best policy
  • 13. Suppose we had the training labels… (we know what to do in any state) (x1,UP) (x2,DOWN) (x3,UP) ...
  • 14. Suppose we had the training labels… (we know what to do in any state) (x1,UP) (x2,DOWN) (x3,UP) ...
  • 15. Suppose we had the training labels… (we know what to do in any state) (x1,UP) (x2,DOWN) (x3,UP) ... maximize:
  • 16. Except, we don’t have labels... Should we go UP or DOWN?
  • 17. Except, we don’t have labels... “Try a bunch of stuff and see what happens. Do more of the stuff that worked in the future.” -RL
  • 18. Let’s just act according to our current policy... Rollout the policy and collect an episode
  • 20. Not sure whatever we did here, but apparently it was good.
  • 21. Not sure whatever we did here, but it was bad.
  • 22. Pretend every action we took here was the correct label. Pretend every action we took here was the wrong label. maximize: maximize:
  • 23. Supervised Learning maximize: For images x_i and their labels y_i.
  • 24. Supervised Learning maximize: For images x_i and their labels y_i. Reinforcement Learning
  • 25. Supervised Learning maximize: For images x_i and their labels y_i. Reinforcement Learning 1) we have no labels so we sample:
  • 26. Supervised Learning maximize: For images x_i and their labels y_i. Reinforcement Learning maximize: 1) we have no labels so we sample: 2) once we collect a batch of rollouts:
  • 27. Supervised Learning maximize: For images x_i and their labels y_i. Reinforcement Learning maximize: 1) we have no labels so we sample: 2) once we collect a batch of rollouts: We call this the advantage, it’s a number, like +1.0 or -1.0 based on how this action eventually turned out.
  • 28. Supervised Learning maximize: For images x_i and their labels y_i. Reinforcement Learning maximize: 1) we have no labels so we sample: 2) once we collect a batch of rollouts: +ve advantage will make that action more likely in the future, for that state. -ve advantage will make that action less likely in the future, for that state.
  • 29.
  • 30. Discounting Reward +1.0 Reward -1.0 Blame each action assuming that its effects have exponentially decaying impact into the future.
  • 31. Discounting Reward +1.0 Reward -1.0 Blame each action assuming that its effects have exponentially decaying impact into the future. 00-1-0.9-0.810.270.240.21 Discounted rewards gamma = 0.9
  • 32. -1 rewardball gets past our paddleepisode start discounted reward time we screwed up somewhere here this was all hopeless and only contributes noise to the gradient
  • 33.
  • 34. 130 line gist, numpy as the only dependency. https://gist.github.com/karpathy/a4166c7fe253700972fcbc77 e4ea32c5
  • 35. Nothing too scary over here. We use OpenAI Gym. And start the main training loop.
  • 36. Get the current image and preprocess it.
  • 37.
  • 38. Bookkeeping so that we can do backpropagation later. If you were to use PyTorch or something, this would not be needed.
  • 39. Derivative of the [log probability of the taken action given this image] with respect to the [output of the network (before sigmoid)] A small piece of backprop: recall: loss:
  • 40. Derivative of the [log probability of the taken action given this image] with respect to the [output of the network (before sigmoid)] A small piece of backprop: More compact: recall: loss:
  • 41. Step the environment (execute the action, get new state and record the reward)
  • 42. Once a rollout is done, Concatenate together all images, hidden states, etc. that were seen in this batch. Again, if using PyTorch, no need to do this.
  • 43.
  • 45. Use RMSProp for the parameter update.
  • 47. In summary 1. Initialize a policy network at random 2. Repeat Forever: 3. Collect a bunch of rollouts with the policy 4. Increase the probability of actions that worked well 5. ??? 6. Profit.
  • 48.