SlideShare ist ein Scribd-Unternehmen logo
1 von 34
Introduction to Deep Reinforcement Learning
Moustafa Alzantot
PhD Student, Networked and Embedded Systems Lab, UCLA
Oct 22, 2017
Machine Learning
Computer programs can increase their performance on a given task
without being explicitly programmed for it, just by analyzing data !
Types Machine Learning
• Supervised Learning
• Given a set of labeled examples , predict the output label for new unseen
inputs.
• Unsupervised Learning
• Given unlabeled dataset, understand the structure of the data (e.g.
clustering, dimensionality reduction).
• Reinforcement Learning
• Branch of machine learning concerned with acting optimally in face of
uncertainty (i.e. learning to do ! )
Reinforcement Learning
• Agent observes the environment state, performs some action.
• In response, the environment state changes and agent receives reward.
• Goal of agent is to pick actions that maximizes the total reward received from
environment.
Environment
Agent
Actions: a
State: s
Reward: r
Source: Pieter Abeel, UC Berkley188
Examples
Ex: Grid World
 A maze-like problem
 The agent lives in a grid
 Walls block the agent’s path
 Noisy movement: actions do not always go as planned
 80% of the time, the action North takes the agent North
(if there is no wall there)
 10% of the time, North takes the agent West; 10% East
 If there is a wall in the direction the agent would have been taken, the agent stays put
 The agent receives rewards each time step
 Small “living” reward each step (can be negative)
 Big rewards come at the end (good or bad)
 Goal: maximize sum of rewards
Source: Pieter Abeel, UC Berkley188
Ex: Grid World
Deterministic Grid World Stochastic Grid World
Markov Decision Process
• MDP is used to describe RL environments.
• MDP is defined by:
• A set of states s S
A set of actions a A
A transition function
Probability that a from s leads to s’, i.e., P(s’| s, a)
Also called the model or the dynamics
A reward function
Sometimes just R(s) or R(s’)
Discount factor
Environment
Agent
Actions: a
State: s
Reward: r
Source: Pieter Abeel, UC Berkley188
Discounting
It’s reasonable to maximize the sum of rewards
It’s also reasonable to prefer rewards now to rewards later
One solution: values of rewards decay exponentially
0 < < 1
Worth Now Worth Next Step Worth In Two Steps
Why discount ?
— sooner rewards will probably have higher utility than later rewards
— Control preferences of different solutions.
— Avoid numerical issues (total rewards going to infinity)
Optimal policy
No penalty at each step • Reward for each step: -0.1
• Reward for each step: -2 • Reward for each step: +0.1
Remember MDPs
• MDP is defined by:
• A set of states s S
A set of actions a A
A transition function
Probability that a from s leads to s’, i.e., P(s’| s, a)
Also called the model or the dynamics
A reward function
Sometimes just R(s) or R(s’)
Discount factor
Environment
Agent
Actions: a
State: s
Reward: r
Solving MDPs
• If the MDP (environment model) is known, there are ways that are guaranteed
to find the optimal policy.
Value-function
The value (utility) of a state s:
V*(s) = expected utility starting in s and acting optimally
The value (utility) of a q-state (s,a):
Q*(s,a) = expected utility starting out having taken action a from state s and
(thereafter) acting optimally
The optimal policy:
*(s) = optimal action from state s
GridWorld: Q-Values
Noise = 0.2
Discount = 0.9
Living reward = 0
Source: Pieter Abeel, UC Berkley188
Value Iteration
 Theorem: will converge to unique optimal values
 Basic idea: approximations get refined towards optimal values
 Policy may converge long before values do
• Alpaydin: Introduction to Machine Learning, 3rd edition
Policy Iteration
• Value-iterations iterates to refine the value function estimates until it
converges.
• Optimal policy often converges before the value function.
• The final goal is to get an optimal policy.
• Policy-iteration: iterates to re-define the policy at each step.
• Alpaydin: Introduction to Machine Learning, 3rd edition
Reinforcement Learning ?!
Model-Based Learning
Model-Based Idea:
Learn an approximate model based on experiences
Solve for values as if the learned model were correct
Step 1: Learn empirical MDP model
Count outcomes s’ for each s, a
Normalize to give an estimate of
Discover each when we experience (s, a, s’)
Step 2: Solve the learned MDP
For example, use value iteration, as before
Model-Free Learning
• Directly learn the V and Q value functions without estimating T
and R.
• Remember:
Key question: how can we do this update to V without knowing T and R?
In other words, how to we take a weighted average without knowing the weights?
Q-Learning
 Use Temporal difference to learn Q(s, a) from observed samples.
 After convergence, extract the optimal policy !
How to Explore?
Several schemes for forcing exploration
Simplest: random actions (-greedy)
Every time step, flip a coin
With (small) probability , act randomly
With (large) probability 1-, act on current policy
Problems with random actions?
You do eventually explore the space, but keep
thrashing around once learning is done
One solution: lower over time
Another solution: exploration functions
Demo: MountainCar using Q-Learning
https://www.youtube.com/watch?v=ByOdncJE5bE
Approximate Q Learning
Approximate Q Learning
 Basic Q-Learning keeps a table of all q-values
 In realistic situations, we cannot possibly learn about every single state!
 Too many states to visit them all in training
 Too many states to hold the q-tables in memory
Approximate Q-Learning
 Using a feature representation, we can write a q function (or value function) for any
state using a few weights:
 Use optimization to find the weights that minimize MSE between predicted and
observed Q-values.
Questions:
How to approximate the Q(s, a) function ?
How to compute these features ?
Deep Q Networks
Remember:
Universal approximation theorem:
Neural Network with 1 hidden layer can learn any
bounded continuous function!
Deep Q Networks
Remember:
Deep neural networks are good as feature
extractors !
Deep Q Networks
Deep Q-Network: Atari
Deep Q-Network training
Deep Q-Network training
Experience Replay Trick
DQN Results in Atari
Resources
• Pieter Abeel, UC Berkley CS 188
• Alpaydin: Introduction to Machine Learning, 3rd edition
• David Silver, UCL Reinforcement Learning Course
• Yandex: Practical RL
• MIT: Deep Learning for self-driving cars !
• Stanford 234: Reinforcement Learning
Thanks
Send any question to
malzantot@ucla.edu

Weitere ähnliche Inhalte

Was ist angesagt?

Exploration Strategies in Reinforcement Learning
Exploration Strategies in Reinforcement LearningExploration Strategies in Reinforcement Learning
Exploration Strategies in Reinforcement LearningDongmin Lee
 
Reinforcement learning 7313
Reinforcement learning 7313Reinforcement learning 7313
Reinforcement learning 7313Slideshare
 
[1808.00177] Learning Dexterous In-Hand Manipulation
[1808.00177] Learning Dexterous In-Hand Manipulation[1808.00177] Learning Dexterous In-Hand Manipulation
[1808.00177] Learning Dexterous In-Hand ManipulationSeung Jae Lee
 
Reinforcement Learning : A Beginners Tutorial
Reinforcement Learning : A Beginners TutorialReinforcement Learning : A Beginners Tutorial
Reinforcement Learning : A Beginners TutorialOmar Enayet
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement Learningbutest
 
Dexterous In-hand Manipulation by OpenAI
Dexterous In-hand Manipulation by OpenAIDexterous In-hand Manipulation by OpenAI
Dexterous In-hand Manipulation by OpenAIAnand Joshi
 
Deep Q-learning from Demonstrations DQfD
Deep Q-learning from Demonstrations DQfDDeep Q-learning from Demonstrations DQfD
Deep Q-learning from Demonstrations DQfDAmmar Rashed
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement LearningYigit UNALLAR
 
1118_Seminar_Continuous_Deep Q-Learning with Model based acceleration
1118_Seminar_Continuous_Deep Q-Learning with Model based acceleration1118_Seminar_Continuous_Deep Q-Learning with Model based acceleration
1118_Seminar_Continuous_Deep Q-Learning with Model based accelerationHye-min Ahn
 
Competition winning learning rates
Competition winning learning ratesCompetition winning learning rates
Competition winning learning ratesMLconf
 
Planning and Learning with Tabular Methods
Planning and Learning with Tabular MethodsPlanning and Learning with Tabular Methods
Planning and Learning with Tabular MethodsDongmin Lee
 
Actor critic algorithm
Actor critic algorithmActor critic algorithm
Actor critic algorithmJie-Han Chen
 
Reinforcement learning
Reinforcement learningReinforcement learning
Reinforcement learningDing Li
 
Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017
Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017
Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017MLconf
 
Real-time ranking with concept drift using expert advice
Real-time ranking with concept drift using expert adviceReal-time ranking with concept drift using expert advice
Real-time ranking with concept drift using expert adviceHila Becker
 
Real-time Ranking of Electrical Feeders using Expert Advice
Real-time Ranking of Electrical Feeders using Expert AdviceReal-time Ranking of Electrical Feeders using Expert Advice
Real-time Ranking of Electrical Feeders using Expert AdviceHila Becker
 
NUS-ISS Learning Day 2019-Introduction to reinforcement learning
NUS-ISS Learning Day 2019-Introduction to reinforcement learningNUS-ISS Learning Day 2019-Introduction to reinforcement learning
NUS-ISS Learning Day 2019-Introduction to reinforcement learningNUS-ISS
 
Maximum Entropy Reinforcement Learning (Stochastic Control)
Maximum Entropy Reinforcement Learning (Stochastic Control)Maximum Entropy Reinforcement Learning (Stochastic Control)
Maximum Entropy Reinforcement Learning (Stochastic Control)Dongmin Lee
 

Was ist angesagt? (20)

Exploration Strategies in Reinforcement Learning
Exploration Strategies in Reinforcement LearningExploration Strategies in Reinforcement Learning
Exploration Strategies in Reinforcement Learning
 
Reinforcement learning 7313
Reinforcement learning 7313Reinforcement learning 7313
Reinforcement learning 7313
 
[1808.00177] Learning Dexterous In-Hand Manipulation
[1808.00177] Learning Dexterous In-Hand Manipulation[1808.00177] Learning Dexterous In-Hand Manipulation
[1808.00177] Learning Dexterous In-Hand Manipulation
 
Reinforcement Learning : A Beginners Tutorial
Reinforcement Learning : A Beginners TutorialReinforcement Learning : A Beginners Tutorial
Reinforcement Learning : A Beginners Tutorial
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement Learning
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement Learning
 
Dexterous In-hand Manipulation by OpenAI
Dexterous In-hand Manipulation by OpenAIDexterous In-hand Manipulation by OpenAI
Dexterous In-hand Manipulation by OpenAI
 
Deep Q-learning from Demonstrations DQfD
Deep Q-learning from Demonstrations DQfDDeep Q-learning from Demonstrations DQfD
Deep Q-learning from Demonstrations DQfD
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement Learning
 
1118_Seminar_Continuous_Deep Q-Learning with Model based acceleration
1118_Seminar_Continuous_Deep Q-Learning with Model based acceleration1118_Seminar_Continuous_Deep Q-Learning with Model based acceleration
1118_Seminar_Continuous_Deep Q-Learning with Model based acceleration
 
Competition winning learning rates
Competition winning learning ratesCompetition winning learning rates
Competition winning learning rates
 
Planning and Learning with Tabular Methods
Planning and Learning with Tabular MethodsPlanning and Learning with Tabular Methods
Planning and Learning with Tabular Methods
 
Actor critic algorithm
Actor critic algorithmActor critic algorithm
Actor critic algorithm
 
Reinforcement learning
Reinforcement learningReinforcement learning
Reinforcement learning
 
Reinforcement Learning using OpenAI Gym
Reinforcement Learning using OpenAI GymReinforcement Learning using OpenAI Gym
Reinforcement Learning using OpenAI Gym
 
Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017
Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017
Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017
 
Real-time ranking with concept drift using expert advice
Real-time ranking with concept drift using expert adviceReal-time ranking with concept drift using expert advice
Real-time ranking with concept drift using expert advice
 
Real-time Ranking of Electrical Feeders using Expert Advice
Real-time Ranking of Electrical Feeders using Expert AdviceReal-time Ranking of Electrical Feeders using Expert Advice
Real-time Ranking of Electrical Feeders using Expert Advice
 
NUS-ISS Learning Day 2019-Introduction to reinforcement learning
NUS-ISS Learning Day 2019-Introduction to reinforcement learningNUS-ISS Learning Day 2019-Introduction to reinforcement learning
NUS-ISS Learning Day 2019-Introduction to reinforcement learning
 
Maximum Entropy Reinforcement Learning (Stochastic Control)
Maximum Entropy Reinforcement Learning (Stochastic Control)Maximum Entropy Reinforcement Learning (Stochastic Control)
Maximum Entropy Reinforcement Learning (Stochastic Control)
 

Ähnlich wie Introduction to Deep Reinforcement Learning

Reinfrocement Learning
Reinfrocement LearningReinfrocement Learning
Reinfrocement LearningNatan Katz
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement LearningDongHyun Kwak
 
Introduction of Deep Reinforcement Learning
Introduction of Deep Reinforcement LearningIntroduction of Deep Reinforcement Learning
Introduction of Deep Reinforcement LearningNAVER Engineering
 
14_ReinforcementLearning.pptx
14_ReinforcementLearning.pptx14_ReinforcementLearning.pptx
14_ReinforcementLearning.pptxRithikRaj25
 
Demystifying deep reinforement learning
Demystifying deep reinforement learningDemystifying deep reinforement learning
Demystifying deep reinforement learning재연 윤
 
anintroductiontoreinforcementlearning-180912151720.pdf
anintroductiontoreinforcementlearning-180912151720.pdfanintroductiontoreinforcementlearning-180912151720.pdf
anintroductiontoreinforcementlearning-180912151720.pdfssuseradaf5f
 
24.09.2021 Reinforcement Learning Algorithms.pptx
24.09.2021 Reinforcement Learning Algorithms.pptx24.09.2021 Reinforcement Learning Algorithms.pptx
24.09.2021 Reinforcement Learning Algorithms.pptxManiMaran230751
 
Reinforcement Learning Guide For Beginners
Reinforcement Learning Guide For BeginnersReinforcement Learning Guide For Beginners
Reinforcement Learning Guide For Beginnersgokulprasath06
 
reinforcement-learning-141009013546-conversion-gate02.pdf
reinforcement-learning-141009013546-conversion-gate02.pdfreinforcement-learning-141009013546-conversion-gate02.pdf
reinforcement-learning-141009013546-conversion-gate02.pdfVaishnavGhadge1
 
An efficient use of temporal difference technique in Computer Game Learning
An efficient use of temporal difference technique in Computer Game LearningAn efficient use of temporal difference technique in Computer Game Learning
An efficient use of temporal difference technique in Computer Game LearningPrabhu Kumar
 
Reinforcement Learning (Reloaded) - Xavier Giró-i-Nieto - UPC Barcelona 2018
Reinforcement Learning (Reloaded) - Xavier Giró-i-Nieto - UPC Barcelona 2018Reinforcement Learning (Reloaded) - Xavier Giró-i-Nieto - UPC Barcelona 2018
Reinforcement Learning (Reloaded) - Xavier Giró-i-Nieto - UPC Barcelona 2018Universitat Politècnica de Catalunya
 
DRL #2-3 - Multi-Armed Bandits .pptx.pdf
DRL #2-3 - Multi-Armed Bandits .pptx.pdfDRL #2-3 - Multi-Armed Bandits .pptx.pdf
DRL #2-3 - Multi-Armed Bandits .pptx.pdfGulamSarwar31
 
How to formulate reinforcement learning in illustrative ways
How to formulate reinforcement learning in illustrative waysHow to formulate reinforcement learning in illustrative ways
How to formulate reinforcement learning in illustrative waysYasutoTamura1
 
Reinforcement learning
Reinforcement learningReinforcement learning
Reinforcement learningDongHyun Kwak
 
Reinforcement Learning and Artificial Neural Nets
Reinforcement Learning and Artificial Neural NetsReinforcement Learning and Artificial Neural Nets
Reinforcement Learning and Artificial Neural NetsPierre de Lacaze
 
Head First Reinforcement Learning
Head First Reinforcement LearningHead First Reinforcement Learning
Head First Reinforcement Learningazzeddine chenine
 
reinforcement-learning-141009013546-conversion-gate02.pptx
reinforcement-learning-141009013546-conversion-gate02.pptxreinforcement-learning-141009013546-conversion-gate02.pptx
reinforcement-learning-141009013546-conversion-gate02.pptxMohibKhan79
 
Aaa ped-24- Reinforcement Learning
Aaa ped-24- Reinforcement LearningAaa ped-24- Reinforcement Learning
Aaa ped-24- Reinforcement LearningAminaRepo
 
Reinforcement learning
Reinforcement learningReinforcement learning
Reinforcement learningElias Hasnat
 

Ähnlich wie Introduction to Deep Reinforcement Learning (20)

Reinfrocement Learning
Reinfrocement LearningReinfrocement Learning
Reinfrocement Learning
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement Learning
 
Introduction of Deep Reinforcement Learning
Introduction of Deep Reinforcement LearningIntroduction of Deep Reinforcement Learning
Introduction of Deep Reinforcement Learning
 
14_ReinforcementLearning.pptx
14_ReinforcementLearning.pptx14_ReinforcementLearning.pptx
14_ReinforcementLearning.pptx
 
Demystifying deep reinforement learning
Demystifying deep reinforement learningDemystifying deep reinforement learning
Demystifying deep reinforement learning
 
anintroductiontoreinforcementlearning-180912151720.pdf
anintroductiontoreinforcementlearning-180912151720.pdfanintroductiontoreinforcementlearning-180912151720.pdf
anintroductiontoreinforcementlearning-180912151720.pdf
 
24.09.2021 Reinforcement Learning Algorithms.pptx
24.09.2021 Reinforcement Learning Algorithms.pptx24.09.2021 Reinforcement Learning Algorithms.pptx
24.09.2021 Reinforcement Learning Algorithms.pptx
 
Reinforcement Learning Guide For Beginners
Reinforcement Learning Guide For BeginnersReinforcement Learning Guide For Beginners
Reinforcement Learning Guide For Beginners
 
reinforcement-learning-141009013546-conversion-gate02.pdf
reinforcement-learning-141009013546-conversion-gate02.pdfreinforcement-learning-141009013546-conversion-gate02.pdf
reinforcement-learning-141009013546-conversion-gate02.pdf
 
An efficient use of temporal difference technique in Computer Game Learning
An efficient use of temporal difference technique in Computer Game LearningAn efficient use of temporal difference technique in Computer Game Learning
An efficient use of temporal difference technique in Computer Game Learning
 
Reinforcement Learning (Reloaded) - Xavier Giró-i-Nieto - UPC Barcelona 2018
Reinforcement Learning (Reloaded) - Xavier Giró-i-Nieto - UPC Barcelona 2018Reinforcement Learning (Reloaded) - Xavier Giró-i-Nieto - UPC Barcelona 2018
Reinforcement Learning (Reloaded) - Xavier Giró-i-Nieto - UPC Barcelona 2018
 
DRL #2-3 - Multi-Armed Bandits .pptx.pdf
DRL #2-3 - Multi-Armed Bandits .pptx.pdfDRL #2-3 - Multi-Armed Bandits .pptx.pdf
DRL #2-3 - Multi-Armed Bandits .pptx.pdf
 
How to formulate reinforcement learning in illustrative ways
How to formulate reinforcement learning in illustrative waysHow to formulate reinforcement learning in illustrative ways
How to formulate reinforcement learning in illustrative ways
 
Reinforcement learning
Reinforcement learningReinforcement learning
Reinforcement learning
 
Reinforcement Learning and Artificial Neural Nets
Reinforcement Learning and Artificial Neural NetsReinforcement Learning and Artificial Neural Nets
Reinforcement Learning and Artificial Neural Nets
 
Head First Reinforcement Learning
Head First Reinforcement LearningHead First Reinforcement Learning
Head First Reinforcement Learning
 
reinforcement-learning-141009013546-conversion-gate02.pptx
reinforcement-learning-141009013546-conversion-gate02.pptxreinforcement-learning-141009013546-conversion-gate02.pptx
reinforcement-learning-141009013546-conversion-gate02.pptx
 
GDRR Opening Workshop - Deep Reinforcement Learning for Asset Based Modeling ...
GDRR Opening Workshop - Deep Reinforcement Learning for Asset Based Modeling ...GDRR Opening Workshop - Deep Reinforcement Learning for Asset Based Modeling ...
GDRR Opening Workshop - Deep Reinforcement Learning for Asset Based Modeling ...
 
Aaa ped-24- Reinforcement Learning
Aaa ped-24- Reinforcement LearningAaa ped-24- Reinforcement Learning
Aaa ped-24- Reinforcement Learning
 
Reinforcement learning
Reinforcement learningReinforcement learning
Reinforcement learning
 

Mehr von IDEAS - Int'l Data Engineering and Science Association

Mehr von IDEAS - Int'l Data Engineering and Science Association (20)

How to deliver effective data science projects
How to deliver effective data science projectsHow to deliver effective data science projects
How to deliver effective data science projects
 
Digital cracks in banking--Sid Nandi
Digital cracks in banking--Sid NandiDigital cracks in banking--Sid Nandi
Digital cracks in banking--Sid Nandi
 
“Full Stack” Data Science with R for Startups: Production-ready with Open-Sou...
“Full Stack” Data Science with R for Startups: Production-ready with Open-Sou...“Full Stack” Data Science with R for Startups: Production-ready with Open-Sou...
“Full Stack” Data Science with R for Startups: Production-ready with Open-Sou...
 
Battling Skynet: The Role of Humanity in Artificial Intelligence
Battling Skynet: The Role of Humanity in Artificial IntelligenceBattling Skynet: The Role of Humanity in Artificial Intelligence
Battling Skynet: The Role of Humanity in Artificial Intelligence
 
Implementing Artificial Intelligence with Big Data
Implementing Artificial Intelligence with Big DataImplementing Artificial Intelligence with Big Data
Implementing Artificial Intelligence with Big Data
 
Data Architecture (i.e., normalization / relational algebra) and Database Sec...
Data Architecture (i.e., normalization / relational algebra) and Database Sec...Data Architecture (i.e., normalization / relational algebra) and Database Sec...
Data Architecture (i.e., normalization / relational algebra) and Database Sec...
 
Blockchain Application in Real Estate Transactions
Blockchain Application in Real Estate TransactionsBlockchain Application in Real Estate Transactions
Blockchain Application in Real Estate Transactions
 
Learning to learn Model Behavior: How to use "human-in-the-loop" to explain d...
Learning to learn Model Behavior: How to use "human-in-the-loop" to explain d...Learning to learn Model Behavior: How to use "human-in-the-loop" to explain d...
Learning to learn Model Behavior: How to use "human-in-the-loop" to explain d...
 
Practical Machine Learning at Work
Practical Machine Learning at WorkPractical Machine Learning at Work
Practical Machine Learning at Work
 
Artificial Intelligence: Hype, Reality, Vision.
Artificial Intelligence: Hype, Reality, Vision.Artificial Intelligence: Hype, Reality, Vision.
Artificial Intelligence: Hype, Reality, Vision.
 
Operationalizing your Data Lake: Get Ready for Advanced Analytics
Operationalizing your Data Lake: Get Ready for Advanced AnalyticsOperationalizing your Data Lake: Get Ready for Advanced Analytics
Operationalizing your Data Lake: Get Ready for Advanced Analytics
 
Best Practices in Data Partnerships Between Mayor's Office and Academia
Best Practices in Data Partnerships Between Mayor's Office and AcademiaBest Practices in Data Partnerships Between Mayor's Office and Academia
Best Practices in Data Partnerships Between Mayor's Office and Academia
 
Everything You Wish You Knew About Search
Everything You Wish You Knew About SearchEverything You Wish You Knew About Search
Everything You Wish You Knew About Search
 
AliMe Bot Platform Technical Practice - Alibaba`s Personal Intelligent Assist...
AliMe Bot Platform Technical Practice - Alibaba`s Personal Intelligent Assist...AliMe Bot Platform Technical Practice - Alibaba`s Personal Intelligent Assist...
AliMe Bot Platform Technical Practice - Alibaba`s Personal Intelligent Assist...
 
Data-Driven AI for Entertainment and Healthcare
Data-Driven AI for Entertainment and HealthcareData-Driven AI for Entertainment and Healthcare
Data-Driven AI for Entertainment and Healthcare
 
Generating Creative Works with AI
Generating Creative Works with AIGenerating Creative Works with AI
Generating Creative Works with AI
 
Using AI to Tackle the Future of Health Care Data
Using AI to Tackle the Future of Health Care DataUsing AI to Tackle the Future of Health Care Data
Using AI to Tackle the Future of Health Care Data
 
State of AI/ML in Real Estate
State of AI/ML in Real EstateState of AI/ML in Real Estate
State of AI/ML in Real Estate
 
Hot Dog, Not Hot Dog! Generate new training data without taking more photos.
Hot Dog, Not Hot Dog! Generate new training data without taking more photos.Hot Dog, Not Hot Dog! Generate new training data without taking more photos.
Hot Dog, Not Hot Dog! Generate new training data without taking more photos.
 
Machine Learning in Healthcare and Life Science
Machine Learning in Healthcare and Life ScienceMachine Learning in Healthcare and Life Science
Machine Learning in Healthcare and Life Science
 

Kürzlich hochgeladen

Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...HostedbyConfluent
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 

Kürzlich hochgeladen (20)

Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 

Introduction to Deep Reinforcement Learning

  • 1. Introduction to Deep Reinforcement Learning Moustafa Alzantot PhD Student, Networked and Embedded Systems Lab, UCLA Oct 22, 2017
  • 2. Machine Learning Computer programs can increase their performance on a given task without being explicitly programmed for it, just by analyzing data !
  • 3. Types Machine Learning • Supervised Learning • Given a set of labeled examples , predict the output label for new unseen inputs. • Unsupervised Learning • Given unlabeled dataset, understand the structure of the data (e.g. clustering, dimensionality reduction). • Reinforcement Learning • Branch of machine learning concerned with acting optimally in face of uncertainty (i.e. learning to do ! )
  • 4. Reinforcement Learning • Agent observes the environment state, performs some action. • In response, the environment state changes and agent receives reward. • Goal of agent is to pick actions that maximizes the total reward received from environment. Environment Agent Actions: a State: s Reward: r Source: Pieter Abeel, UC Berkley188
  • 6. Ex: Grid World  A maze-like problem  The agent lives in a grid  Walls block the agent’s path  Noisy movement: actions do not always go as planned  80% of the time, the action North takes the agent North (if there is no wall there)  10% of the time, North takes the agent West; 10% East  If there is a wall in the direction the agent would have been taken, the agent stays put  The agent receives rewards each time step  Small “living” reward each step (can be negative)  Big rewards come at the end (good or bad)  Goal: maximize sum of rewards Source: Pieter Abeel, UC Berkley188
  • 7. Ex: Grid World Deterministic Grid World Stochastic Grid World
  • 8. Markov Decision Process • MDP is used to describe RL environments. • MDP is defined by: • A set of states s S A set of actions a A A transition function Probability that a from s leads to s’, i.e., P(s’| s, a) Also called the model or the dynamics A reward function Sometimes just R(s) or R(s’) Discount factor Environment Agent Actions: a State: s Reward: r Source: Pieter Abeel, UC Berkley188
  • 9. Discounting It’s reasonable to maximize the sum of rewards It’s also reasonable to prefer rewards now to rewards later One solution: values of rewards decay exponentially 0 < < 1 Worth Now Worth Next Step Worth In Two Steps Why discount ? — sooner rewards will probably have higher utility than later rewards — Control preferences of different solutions. — Avoid numerical issues (total rewards going to infinity)
  • 10. Optimal policy No penalty at each step • Reward for each step: -0.1 • Reward for each step: -2 • Reward for each step: +0.1
  • 11. Remember MDPs • MDP is defined by: • A set of states s S A set of actions a A A transition function Probability that a from s leads to s’, i.e., P(s’| s, a) Also called the model or the dynamics A reward function Sometimes just R(s) or R(s’) Discount factor Environment Agent Actions: a State: s Reward: r
  • 12. Solving MDPs • If the MDP (environment model) is known, there are ways that are guaranteed to find the optimal policy.
  • 13. Value-function The value (utility) of a state s: V*(s) = expected utility starting in s and acting optimally The value (utility) of a q-state (s,a): Q*(s,a) = expected utility starting out having taken action a from state s and (thereafter) acting optimally The optimal policy: *(s) = optimal action from state s
  • 14. GridWorld: Q-Values Noise = 0.2 Discount = 0.9 Living reward = 0 Source: Pieter Abeel, UC Berkley188
  • 15. Value Iteration  Theorem: will converge to unique optimal values  Basic idea: approximations get refined towards optimal values  Policy may converge long before values do • Alpaydin: Introduction to Machine Learning, 3rd edition
  • 16. Policy Iteration • Value-iterations iterates to refine the value function estimates until it converges. • Optimal policy often converges before the value function. • The final goal is to get an optimal policy. • Policy-iteration: iterates to re-define the policy at each step. • Alpaydin: Introduction to Machine Learning, 3rd edition
  • 18. Model-Based Learning Model-Based Idea: Learn an approximate model based on experiences Solve for values as if the learned model were correct Step 1: Learn empirical MDP model Count outcomes s’ for each s, a Normalize to give an estimate of Discover each when we experience (s, a, s’) Step 2: Solve the learned MDP For example, use value iteration, as before
  • 19. Model-Free Learning • Directly learn the V and Q value functions without estimating T and R. • Remember: Key question: how can we do this update to V without knowing T and R? In other words, how to we take a weighted average without knowing the weights?
  • 20. Q-Learning  Use Temporal difference to learn Q(s, a) from observed samples.  After convergence, extract the optimal policy !
  • 21. How to Explore? Several schemes for forcing exploration Simplest: random actions (-greedy) Every time step, flip a coin With (small) probability , act randomly With (large) probability 1-, act on current policy Problems with random actions? You do eventually explore the space, but keep thrashing around once learning is done One solution: lower over time Another solution: exploration functions
  • 22. Demo: MountainCar using Q-Learning https://www.youtube.com/watch?v=ByOdncJE5bE
  • 24. Approximate Q Learning  Basic Q-Learning keeps a table of all q-values  In realistic situations, we cannot possibly learn about every single state!  Too many states to visit them all in training  Too many states to hold the q-tables in memory
  • 25. Approximate Q-Learning  Using a feature representation, we can write a q function (or value function) for any state using a few weights:  Use optimization to find the weights that minimize MSE between predicted and observed Q-values. Questions: How to approximate the Q(s, a) function ? How to compute these features ?
  • 26. Deep Q Networks Remember: Universal approximation theorem: Neural Network with 1 hidden layer can learn any bounded continuous function!
  • 27. Deep Q Networks Remember: Deep neural networks are good as feature extractors !
  • 32. DQN Results in Atari
  • 33. Resources • Pieter Abeel, UC Berkley CS 188 • Alpaydin: Introduction to Machine Learning, 3rd edition • David Silver, UCL Reinforcement Learning Course • Yandex: Practical RL • MIT: Deep Learning for self-driving cars ! • Stanford 234: Reinforcement Learning
  • 34. Thanks Send any question to malzantot@ucla.edu