Lec4b pong from_pixels

•

0 gefällt mir•100 views

Ronald Teo

Taken from https://sites.google.com/view/deep-rl-bootcamp/lectures

Technologie

Pong from Pixels
Deep RL Bootcamp
Andrej Karpathy, Aug 26, 2017

[80 x 80]
array of
height width
e.g.,
Policy network

[80 x 80]
array
height width
Policy network

E.g. 200 nodes in the hidden network, so:
[(80*80)*200 + 200] + [200*1 + 1] = ~1.3M parameters
Layer 1 Layer 2
[80 x 80]
array
height width
Policy network

Network does not see this. Network sees 80*80 = 6,400 numbers.
It gets a reward of +1 or -1, some of the time.
Q: How do we efficiently find a good setting of the 1.3M parameters?

Problem is easy if you want to be inefficient...
1. Repeat Forever:
2. Sample 1.3M random numbers
3. Run the policy for a while
4. If the performance is best so far, save it
5. Return the best policy

Suppose we had the training labels…
(we know what to do in any state)
(x1,UP)
(x2,DOWN)
(x3,UP)
...

Suppose we had the training labels…
(we know what to do in any state)
(x1,UP)
(x2,DOWN)
(x3,UP)
...
maximize:

Except, we don’t have labels...
Should we go UP or DOWN?

Except, we don’t have labels...
“Try a bunch of stuff and see
what happens. Do more of the
stuff that worked in the future.”
-RL

Let’s just act according to our current policy...
Rollout the policy
and collect an
episode

Not sure whatever we did here, but
apparently it was good.

Not sure whatever we did here, but it was bad.

Pretend every action we took here
was the correct label.
Pretend every action we took
here was the wrong label.
maximize: maximize:

Supervised Learning
maximize:
For images x_i and their
labels y_i.

Supervised Learning
maximize:
For images x_i and their
labels y_i.
Reinforcement Learning

Supervised Learning
maximize:
For images x_i and their
labels y_i.
Reinforcement Learning
1) we have no labels so we sample:

Supervised Learning
maximize:
For images x_i and their
labels y_i.
Reinforcement Learning
maximize:
1) we have no labels so we sample:
2) once we collect a batch of rollouts:

Discounting
Reward +1.0 Reward -1.0
Blame each action assuming that its effects have
exponentially decaying impact into the future.

Discounting
Reward +1.0 Reward -1.0
Blame each action assuming that its effects have
exponentially decaying impact into the future.
00-1-0.9-0.810.270.240.21
Discounted rewards
gamma = 0.9

-1 rewardball gets past our paddleepisode start
discounted reward
time
we screwed up
somewhere here
this was all hopeless and only
contributes noise to the gradient

130 line gist, numpy as the only dependency.
https://gist.github.com/karpathy/a4166c7fe253700972fcbc77
e4ea32c5

Nothing too scary over here.
We use OpenAI Gym.
And start the main training loop.

Get the current image and
preprocess it.

Bookkeeping so that we can do
backpropagation later. If you were
to use PyTorch or something, this
would not be needed.

Derivative of the [log probability of the taken action
given this image] with respect to the [output of the
network (before sigmoid)]
A small piece of backprop:
recall: loss:

Step the environment
(execute the action,
get new state and
record the reward)

Once a rollout is done,
Concatenate together all images, hidden
states, etc. that were seen in this batch.
Again, if using PyTorch, no need to do this.

In summary
1. Initialize a policy network at random
2. Repeat Forever:
3. Collect a bunch of rollouts with the policy
4. Increase the probability of actions that worked well
5. ???
6. Profit.

Empfohlen

Machine learning interviews day2rajmohanc

Demystifying deep reinforement learning재연 윤

Reinforcement Learning in Practice: Contextual BanditsMax Pagels

Reinforcement learning Research experiments OpenAIRaouf KESKES

Gan seminarSan Kim

机器学习AdaboostShocky1

Reinforcement learningFarzad M. Zaravand

Reinforcement Learning using OpenAI GymMuhammad Aleem Siddiqui

Empfohlen

Machine learning interviews day2rajmohanc

Demystifying deep reinforement learning재연 윤

Reinforcement Learning in Practice: Contextual BanditsMax Pagels

Reinforcement learning Research experiments OpenAIRaouf KESKES

Gan seminarSan Kim

机器学习AdaboostShocky1

Reinforcement learningFarzad M. Zaravand

Reinforcement Learning using OpenAI GymMuhammad Aleem Siddiqui

李宏毅课件-Regression.pdfssusere61d07

2018 Global Azure Bootcamp Azure Machine Learning for neural networksSetu Chokshi

Introduction to Support Vector MachinesAnish M M

Deep learning from scratch Eran Shlomo

Notes on Reinforcement Learning - v0.1Joo-Haeng Lee

Neural networks - BigSkyDevConryanstout

Deep Learning with Apache MXNet (September 2017)Julien SIMON

Deep recommendations in PyTorchStitch Fix Algorithms

Mlcc #4Chung-Hsiang Ofa Hsueh

semi supervised Learning and Reinforcement learning (1).pptxDr.Shweta

Scalable gradientbasedtuningcontinuousregularizationhyperparameters pptRuochun Tzeng

NUS-ISS Learning Day 2019-Introduction to reinforcement learningNUS-ISS

Deep Learning for DevelopersJulien SIMON

20181221 q-traderTaku Yoshioka

Dynamic Programming and Reinforcement Learning applied to Tetris GameSuelen Carvalho

Distributive propertyDebbyChamberlain

Making smart decisions in real-time with Reinforcement LearningRuth Yakubu

Artwork Personalization at NetflixJustin Basilico

Sippin: A Mobile Application Case Study presented at Techfest LouisvilleDawn Yankeelov

Understanding computer vision with Deep LearningCloudxLab

Mc tdRonald Teo

07 regularizationRonald Teo

Weitere ähnliche Inhalte

Ähnlich wie Lec4b pong from_pixels

李宏毅课件-Regression.pdfssusere61d07

2018 Global Azure Bootcamp Azure Machine Learning for neural networksSetu Chokshi

Introduction to Support Vector MachinesAnish M M

Deep learning from scratch Eran Shlomo

Notes on Reinforcement Learning - v0.1Joo-Haeng Lee

Neural networks - BigSkyDevConryanstout

Deep Learning with Apache MXNet (September 2017)Julien SIMON

Deep recommendations in PyTorchStitch Fix Algorithms

Mlcc #4Chung-Hsiang Ofa Hsueh

semi supervised Learning and Reinforcement learning (1).pptxDr.Shweta

Scalable gradientbasedtuningcontinuousregularizationhyperparameters pptRuochun Tzeng

NUS-ISS Learning Day 2019-Introduction to reinforcement learningNUS-ISS

Deep Learning for DevelopersJulien SIMON

20181221 q-traderTaku Yoshioka

Dynamic Programming and Reinforcement Learning applied to Tetris GameSuelen Carvalho

Distributive propertyDebbyChamberlain

Making smart decisions in real-time with Reinforcement LearningRuth Yakubu

Artwork Personalization at NetflixJustin Basilico

Sippin: A Mobile Application Case Study presented at Techfest LouisvilleDawn Yankeelov

Understanding computer vision with Deep LearningCloudxLab

Ähnlich wie Lec4b pong from_pixels (20)

李宏毅课件-Regression.pdf

2018 Global Azure Bootcamp Azure Machine Learning for neural networks

Introduction to Support Vector Machines

Deep learning from scratch

Notes on Reinforcement Learning - v0.1

Neural networks - BigSkyDevCon

Deep Learning with Apache MXNet (September 2017)

Deep recommendations in PyTorch

Mlcc #4

semi supervised Learning and Reinforcement learning (1).pptx

Scalable gradientbasedtuningcontinuousregularizationhyperparameters ppt

NUS-ISS Learning Day 2019-Introduction to reinforcement learning

Deep Learning for Developers

20181221 q-trader

Dynamic Programming and Reinforcement Learning applied to Tetris Game

Distributive property

Making smart decisions in real-time with Reinforcement Learning

Artwork Personalization at Netflix

Sippin: A Mobile Application Case Study presented at Techfest Louisville

Understanding computer vision with Deep Learning

Mehr von Ronald Teo

Mc tdRonald Teo

07 regularizationRonald Teo

DpRonald Teo

06 mlpRonald Teo

MdpRonald Teo

04 numericalRonald Teo

Eac4f222d9d468a0c29a71a3830a5c60 c5_w3l08-attentionmodelRonald Teo

Intro rlRonald Teo

Lec7 deeprlbootcamp-svg+scgRonald Teo

Lec5 advanced-policy-gradient-methodsRonald Teo

Lec6 nuts-and-bolts-deep-rl-researchRonald Teo

Lec4a policy-gradients-actor-criticRonald Teo

Lec3 dqnRonald Teo

Lec2 sampling-based-approximations-and-function-fittingRonald Teo

Lec1 intro-mdps-exact-methodsRonald Teo

02 linear algebraRonald Teo

Mehr von Ronald Teo (16)

Mc td

07 regularization

06 mlp

Mdp

04 numerical

Eac4f222d9d468a0c29a71a3830a5c60 c5_w3l08-attentionmodel

Intro rl

Lec7 deeprlbootcamp-svg+scg

Lec5 advanced-policy-gradient-methods

Lec6 nuts-and-bolts-deep-rl-research

Lec4a policy-gradients-actor-critic

Lec3 dqn

Lec2 sampling-based-approximations-and-function-fitting

Lec1 intro-mdps-exact-methods

02 linear algebra

Kürzlich hochgeladen

Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1

Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services

2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong

Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun

How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes

Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Zilliz

Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Jeffrey Haguewood

A Year of the Servo Reboot: Where Are We Now?Igalia

ICT role in 21st century education and its challengesrafiqahmad00786416

EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWERMadyBayot

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@

DBX First Quarter 2024 Investor PresentationDropbox

"I see eyes in my soup": How Delivery Hero implemented the safety system for ...Zilliz

TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc

Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingEdi Saputra

Architecting Cloud Native ApplicationsWSO2

Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodJuan lago vázquez

ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous

Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93

Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2

Kürzlich hochgeladen (20)

Boost Fertility New Invention Ups Success Rates.pdf

Strategies for Landing an Oracle DBA Job as a Fresher

2024: Domino Containers - The Next Step. News from the Domino Container commu...

Data Cloud, More than a CDP by Matt Robison

How to Troubleshoot Apps for the Modern Connected Worker

Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...

Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...

A Year of the Servo Reboot: Where Are We Now?

ICT role in 21st century education and its challenges

EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...

DBX First Quarter 2024 Investor Presentation

"I see eyes in my soup": How Delivery Hero implemented the safety system for ...

TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments

Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving

Architecting Cloud Native Applications

Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood

ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke

Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff

Exploring the Future Potential of AI-Enabled Smartphone Processors

Lec4b pong from_pixels

1. Pong from Pixels Deep RL Bootcamp Andrej Karpathy, Aug 26, 2017

3. MDP

4. Policy network

5. [80 x 80] array of height width e.g., Policy network

6. [80 x 80] array height width Policy network

7. E.g. 200 nodes in the hidden network, so: [(80*80)*200 + 200] + [200*1 + 1] = ~1.3M parameters Layer 1 Layer 2 [80 x 80] array height width Policy network

8. Network does not see this. Network sees 80*80 = 6,400 numbers. It gets a reward of +1 or -1, some of the time. Q: How do we efficiently find a good setting of the 1.3M parameters?

9. Problem is easy if you want to be inefficient... 1. Repeat Forever: 2. Sample 1.3M random numbers 3. Run the policy for a while 4. If the performance is best so far, save it 5. Return the best policy

10. Problem is easy if you want to be inefficient... 1. Repeat Forever: 2. Sample 1.3M random numbers 3. Run the policy for a while 4. If the performance is best so far, save it 5. Return the best policy

11. Problem is easy if you want to be inefficient... 1. Repeat Forever: 2. Sample 1.3M random numbers 3. Run the policy for a while 4. If the performance is best so far, save it 5. Return the best policy

12. Policy Gradients

13. Suppose we had the training labels… (we know what to do in any state) (x1,UP) (x2,DOWN) (x3,UP) ...

14. Suppose we had the training labels… (we know what to do in any state) (x1,UP) (x2,DOWN) (x3,UP) ...

15. Suppose we had the training labels… (we know what to do in any state) (x1,UP) (x2,DOWN) (x3,UP) ... maximize:

16. Except, we don’t have labels... Should we go UP or DOWN?

17. Except, we don’t have labels... “Try a bunch of stuff and see what happens. Do more of the stuff that worked in the future.” -RL

18. Let’s just act according to our current policy... Rollout the policy and collect an episode

19. 4 rollouts: Collect many rollouts...

20. Not sure whatever we did here, but apparently it was good.

21. Not sure whatever we did here, but it was bad.

22. Pretend every action we took here was the correct label. Pretend every action we took here was the wrong label. maximize: maximize:

23. Supervised Learning maximize: For images x_i and their labels y_i.

24. Supervised Learning maximize: For images x_i and their labels y_i. Reinforcement Learning

25. Supervised Learning maximize: For images x_i and their labels y_i. Reinforcement Learning 1) we have no labels so we sample:

26. Supervised Learning maximize: For images x_i and their labels y_i. Reinforcement Learning maximize: 1) we have no labels so we sample: 2) once we collect a batch of rollouts:

27. Supervised Learning maximize: For images x_i and their labels y_i. Reinforcement Learning maximize: 1) we have no labels so we sample: 2) once we collect a batch of rollouts: We call this the advantage, it’s a number, like +1.0 or -1.0 based on how this action eventually turned out.

28. Supervised Learning maximize: For images x_i and their labels y_i. Reinforcement Learning maximize: 1) we have no labels so we sample: 2) once we collect a batch of rollouts: +ve advantage will make that action more likely in the future, for that state. -ve advantage will make that action less likely in the future, for that state.

29.

30. Discounting Reward +1.0 Reward -1.0 Blame each action assuming that its effects have exponentially decaying impact into the future.

31. Discounting Reward +1.0 Reward -1.0 Blame each action assuming that its effects have exponentially decaying impact into the future. 00-1-0.9-0.810.270.240.21 Discounted rewards gamma = 0.9

32. -1 rewardball gets past our paddleepisode start discounted reward time we screwed up somewhere here this was all hopeless and only contributes noise to the gradient

33.

34. 130 line gist, numpy as the only dependency. https://gist.github.com/karpathy/a4166c7fe253700972fcbc77 e4ea32c5

35. Nothing too scary over here. We use OpenAI Gym. And start the main training loop.

36. Get the current image and preprocess it.

37.

38. Bookkeeping so that we can do backpropagation later. If you were to use PyTorch or something, this would not be needed.

39. Derivative of the [log probability of the taken action given this image] with respect to the [output of the network (before sigmoid)] A small piece of backprop: recall: loss:

40. Derivative of the [log probability of the taken action given this image] with respect to the [output of the network (before sigmoid)] A small piece of backprop: More compact: recall: loss:

41. Step the environment (execute the action, get new state and record the reward)

42. Once a rollout is done, Concatenate together all images, hidden states, etc. that were seen in this batch. Again, if using PyTorch, no need to do this.

43.

44. backprop!!!!!!1 Advantage modulation

45. Use RMSProp for the parameter update.

46. prints etc

47. In summary 1. Initialize a policy network at random 2. Repeat Forever: 3. Collect a bunch of rollouts with the policy 4. Increase the probability of actions that worked well 5. ??? 6. Profit.

48.

49. Thank you! Questions?