SlideShare a Scribd company logo
1 of 66
Download to read offline
NILOOFAR SEDIGHIAN BIDGOLI
MACHINE LEARNING COURSE
CS DEPARTMENT, SBU UNIVERSITY
JUNE 2020, TEHRAN, IRAN
When it is not in our power to determine
what is true, we ought to act in accordance
with what is most probable.
- Descartes
That thing is a
“double bacon cheese
burger
N.Sedighian - CS Dep. SBU - 06/2020
That thing is like this
other thing
N.Sedighian - CS Dep. SBU - 06/2020
Eat that thing because it
tastes good and will keep
you alive longer
N.Sedighian - CS Dep. SBU - 06/2020
Deep reinforcement learning is
about how we make decisions
To tackle decision-making problems under uncertainty
N.Sedighian - CS Dep. SBU - 06/2020
N.Sedighian - CS Dep. SBU - 06/2020
Two core components in a RL system
 Agent: represents the “solution”
 A computer program with a single role of making decisions to solve complex
decision-making problems under uncertainty.
 An Environment: that is the representation of a “problem”
 Everything that comes after the decision of the Agent.
N.Sedighian - CS Dep. SBU - 06/2020
Notations:
 State = s = x
 Action = control = a = u
 Policy 𝜋𝜋(𝑎𝑎|𝑠𝑠) is defined as probability and not as a concrete action
 like weights in Deep Learning method, parameterized by θ
 Gamma: We discount rewards or lower their estimated value in the future
 Human intuition: “In the long run, we are all dead.
 If it is 1: we care about all rewards equally
 If it is 0: we care only about the immediate reward
N.Sedighian - CS Dep. SBU - 06/2020
Policy
N.Sedighian - CS Dep. SBU - 06/2020
Intuition: why humans?
 If you are the agent, the environment could be the laws of physics and the
rules of society that process your actions and determine the
consequences of them.
Were you ever in the wrong place at the wrong time?
That’s a state
N.Sedighian - CS Dep. SBU - 06/2020
There is no training data here
 Like humans learning how to live (and survive!) as a kid
 By trial and error
 With positive or negative rewards
 Reward and punishment method
N.Sedighian - CS Dep. SBU - 06/2020
N.Sedighian - CS Dep. SBU - 06/2020
N.Sedighian - CS Dep. SBU - 06/2020
Google's artificial
intelligence company,
DeepMind, has
developed an AI that
has managed to learn
how to walk, run, jump,
and climb without any
prior guidance. The result
is as impressive as it is
goofy
Watch Video
N.Sedighian - CS Dep. SBU - 06/2020
N.Sedighian - CS Dep. SBU - 06/2020
Google
DeepMind
Learning to play Atari
Watch Video
N.Sedighian - CS Dep. SBU - 06/2020
Reward vs Value
 Reward (Return) is an immediate signal that is received in a given state,
while value is the sum of all rewards you might anticipate from that state.
 Value is a long-term expectation, while reward is an immediate pleasure.
N.Sedighian - CS Dep. SBU - 06/2020
Return
N.Sedighian - CS Dep. SBU - 06/2020
Tasks
 Natural ending: episodic tasks -> games
 Episode: sequence of time steps
 The sum of rewards collected in a single episode is called a return. Agents are
often designed to maximize the return.
 Without natural ending: continuing tasks -> learning forward motion
N.Sedighian - CS Dep. SBU - 06/2020
How the environment reacts to
certain actions is defined by a model
which may or may not be known by
the Agent
Approaches
 Analyze how good to reach a certain state or take a specific action (i.e.
Value-learning)
 measures the total rewards that you get from a particular state following a
specific policy
 Go cheat sheet
 uses V or Q value to derive the optimal policy
 Q- Learning
 Use the model to find actions that have the maximum rewards (model-
based learning)
 Model-based RL uses the model and the cost function to find the optimal path
 Derive a policy directly to maximize rewards (policy gradient)
 For actions with better rewards, we make it more likely to happen (or vice versa).N.Sedighian - CS Dep. SBU - 06/2020
For a model
based learning
Watch this →
Watch Video
N.Sedighian - CS Dep. SBU - 06/2020
RL;
exploit and explore
How can we
mathematically formalize
the RL problem
• MARKOV DECISION PROCESSES FORMALIZE THE REINFORCEMENT
LEARNING PROBLEM SET
• AND Q-LEARNING AND POLICY GRADIENTS ARE 2 MAJOR
ALGORITHMS IN THIS AREA
MDP
 Attempt to model a complex probability distribution of rewards in relation
to a very large number of state-action pair
 Markov decision process, a method to sample from a complex distribution
to infer its properties. even when we do not understand the mechanism by
which they relate
N.Sedighian - CS Dep. SBU - 06/2020
N.Sedighian - CS Dep. SBU - 06/2020
N.Sedighian - CS Dep. SBU - 06/2020
N.Sedighian - CS Dep. SBU - 06/2020
N.Sedighian - CS Dep. SBU - 06/2020
N.Sedighian - CS Dep. SBU - 06/2020
MPD
• Genes on a chromosome are
states. To read them (and create
amino acids) is to go through
their transitions
• Emotions are states in a
psychological system. Mood
swings are the transitions.
N.Sedighian - CS Dep. SBU - 06/2020
Markov chains have a particular property:
oblivion. Forgetting
It assume the entirety of the past is encoded in
the present
N.Sedighian - CS Dep. SBU - 06/2020
N.Sedighian - CS Dep. SBU - 06/2020
Q-learning
"quality" of an action taken in a given state
 Q-learning is a model-free reinforcement learning algorithm to learn a
policy telling an agent what action to take under what circumstances.
 For any finite Markov decision process (FMDP), Q-learning finds an optimal
policy in the sense of maximizing the expected value of the total reward
over any and all successive steps, starting from the current state.
N.Sedighian - CS Dep. SBU - 06/2020
N.Sedighian - CS Dep. SBU - 06/2020
Q
A value for each state-action pair, which is called
the action-value function, also known as Q-function.
It is usually denoted by 𝑄𝑄𝜋𝜋 (𝑠𝑠, 𝑎𝑎) and refers to the
expected return 𝐺𝐺 when the Agent is at state 𝑠𝑠 and
takes action 𝑎𝑎 following the policy 𝜋𝜋.
N.Sedighian - CS Dep. SBU - 06/2020
Break
west world…
Creation of Adam, 1508-1512
N.Sedighian - CS Dep. SBU - 06/2020
N.Sedighian - CS Dep. SBU - 06/2020
Bellman Equation
It writes the "value" of a decision problem at a
certain point in time in terms of the payoff from
some initial choices and the "value" of the
remaining decision problem that results from
those initial choices
that if we know the value of 𝑠𝑠𝑡𝑡+!, we can very easily calculate the value of 𝑠𝑠𝑡𝑡.
N.Sedighian - CS Dep. SBU - 06/2020
Iteration Phase:
N.Sedighian - CS Dep. SBU - 06/2020
N.Sedighian - CS Dep. SBU - 06/2020
N.Sedighian - CS Dep. SBU - 06/2020
N.Sedighian - CS Dep. SBU - 06/2020
DQN
Deep Q-network
Using a deep network to estimate Q
N.Sedighian - CS Dep. SBU - 06/2020
Experience Replay
Experience replay stores the last million of state-
action-reward in a replay buffer. We train Q with
batches of random samples from this buffer
 enabling the RL agent to sample from and train on previously observed data offline
 massively reduce the amount of interactions needed with the environment,
 batches of experience can be sampled, reducing the variance of learning updates
N.Sedighian - CS Dep. SBU - 06/2020
Experience!
N.Sedighian - CS Dep. SBU - 06/2020
N.Sedighian - CS Dep. SBU - 06/2020
N.Sedighian - CS Dep. SBU - 06/2020
N.Sedighian - CS Dep. SBU - 06/2020
Reinforce rule
= estimator of gradient
We change the policy in the direction with the steepest reward increase
It means for actions with better rewards, we make it more likely to happen
N.Sedighian - CS Dep. SBU - 06/2020
N.Sedighian - CS Dep. SBU - 06/2020
N.Sedighian - CS Dep. SBU - 06/2020
N.Sedighian - CS Dep. SBU - 06/2020
N.Sedighian - CS Dep. SBU - 06/2020
N.Sedighian - CS Dep. SBU - 06/2020
N.Sedighian - CS Dep. SBU - 06/2020
Actor-critic set-up:
The “actor”
(policy) learns by
using feedback
from the “critic”
(value function).
N.Sedighian - CS Dep. SBU - 06/2020
N.Sedighian - CS Dep. SBU - 06/2020
N.Sedighian - CS Dep. SBU - 06/2020
N.Sedighian - CS Dep. SBU - 06/2020
N.Sedighian - CS Dep. SBU - 06/2020
So…
N.Sedighian - CS Dep. SBU - 06/2020
N.Sedighian - CS Dep. SBU - 06/2020
Questions
Sophia, on from 2016N.Sedighian - CS Dep. SBU - 06/2020
Thank you
N.Sedighian - CS Dep. SBU - 06/2020

More Related Content

What's hot

大富豪に対する機械学習の適用 + α
大富豪に対する機械学習の適用 + α大富豪に対する機械学習の適用 + α
大富豪に対する機械学習の適用 + αKatsuki Ohto
 
公共交通オープンデータ第2幕:「静的データは出来た、次はリアルタイム」と決めつける前に考えること
公共交通オープンデータ第2幕:「静的データは出来た、次はリアルタイム」と決めつける前に考えること公共交通オープンデータ第2幕:「静的データは出来た、次はリアルタイム」と決めつける前に考えること
公共交通オープンデータ第2幕:「静的データは出来た、次はリアルタイム」と決めつける前に考えることMasaki Ito
 
二人零和マルコフゲームにおけるオフ方策評価
二人零和マルコフゲームにおけるオフ方策評価二人零和マルコフゲームにおけるオフ方策評価
二人零和マルコフゲームにおけるオフ方策評価Kenshi Abe
 
Hierarchical Reinforcement Learning
Hierarchical Reinforcement LearningHierarchical Reinforcement Learning
Hierarchical Reinforcement Learningahmad bassiouny
 
拡大縮小から始める画像処理
拡大縮小から始める画像処理拡大縮小から始める画像処理
拡大縮小から始める画像処理yuichi takeda
 
Unity5.3をさわってみた
Unity5.3をさわってみたUnity5.3をさわってみた
Unity5.3をさわってみたKeizo Nagamine
 
レコメンドシステムの社会実装
レコメンドシステムの社会実装レコメンドシステムの社会実装
レコメンドシステムの社会実装西岡 賢一郎
 
Trust Region Policy Optimization
Trust Region Policy OptimizationTrust Region Policy Optimization
Trust Region Policy Optimizationmooopan
 
勾配降下法の 最適化アルゴリズム
勾配降下法の最適化アルゴリズム勾配降下法の最適化アルゴリズム
勾配降下法の 最適化アルゴリズムnishio
 
運用中のゲームにAIを導入するには〜プロジェクト推進・ユースケース・運用〜 [DeNA TechCon 2019]
運用中のゲームにAIを導入するには〜プロジェクト推進・ユースケース・運用〜 [DeNA TechCon 2019]運用中のゲームにAIを導入するには〜プロジェクト推進・ユースケース・運用〜 [DeNA TechCon 2019]
運用中のゲームにAIを導入するには〜プロジェクト推進・ユースケース・運用〜 [DeNA TechCon 2019]DeNA
 
Intro to Deep Reinforcement Learning
Intro to Deep Reinforcement LearningIntro to Deep Reinforcement Learning
Intro to Deep Reinforcement LearningKhaled Saleh
 
Interactive Recommender Systems with Netflix and Spotify
Interactive Recommender Systems with Netflix and SpotifyInteractive Recommender Systems with Netflix and Spotify
Interactive Recommender Systems with Netflix and SpotifyChris Johnson
 
機械学習のための数学のおさらい
機械学習のための数学のおさらい機械学習のための数学のおさらい
機械学習のための数学のおさらいHideo Terada
 
ゲームAIとマルチエージェント(下)
ゲームAIとマルチエージェント(下)ゲームAIとマルチエージェント(下)
ゲームAIとマルチエージェント(下)Youichiro Miyake
 
Multi-armed Bandits
Multi-armed BanditsMulti-armed Bandits
Multi-armed BanditsDongmin Lee
 
[DL輪読会]Neural Radiance Flow for 4D View Synthesis and Video Processing (NeRF...
[DL輪読会]Neural Radiance Flow for 4D View Synthesis and Video  Processing (NeRF...[DL輪読会]Neural Radiance Flow for 4D View Synthesis and Video  Processing (NeRF...
[DL輪読会]Neural Radiance Flow for 4D View Synthesis and Video Processing (NeRF...Deep Learning JP
 
確率的主成分分析
確率的主成分分析確率的主成分分析
確率的主成分分析Mika Yoshimura
 

What's hot (20)

大富豪に対する機械学習の適用 + α
大富豪に対する機械学習の適用 + α大富豪に対する機械学習の適用 + α
大富豪に対する機械学習の適用 + α
 
公共交通オープンデータ第2幕:「静的データは出来た、次はリアルタイム」と決めつける前に考えること
公共交通オープンデータ第2幕:「静的データは出来た、次はリアルタイム」と決めつける前に考えること公共交通オープンデータ第2幕:「静的データは出来た、次はリアルタイム」と決めつける前に考えること
公共交通オープンデータ第2幕:「静的データは出来た、次はリアルタイム」と決めつける前に考えること
 
二人零和マルコフゲームにおけるオフ方策評価
二人零和マルコフゲームにおけるオフ方策評価二人零和マルコフゲームにおけるオフ方策評価
二人零和マルコフゲームにおけるオフ方策評価
 
Hierarchical Reinforcement Learning
Hierarchical Reinforcement LearningHierarchical Reinforcement Learning
Hierarchical Reinforcement Learning
 
拡大縮小から始める画像処理
拡大縮小から始める画像処理拡大縮小から始める画像処理
拡大縮小から始める画像処理
 
Unity5.3をさわってみた
Unity5.3をさわってみたUnity5.3をさわってみた
Unity5.3をさわってみた
 
レコメンドシステムの社会実装
レコメンドシステムの社会実装レコメンドシステムの社会実装
レコメンドシステムの社会実装
 
Trust Region Policy Optimization
Trust Region Policy OptimizationTrust Region Policy Optimization
Trust Region Policy Optimization
 
How AlphaGo Works
How AlphaGo WorksHow AlphaGo Works
How AlphaGo Works
 
勾配降下法の 最適化アルゴリズム
勾配降下法の最適化アルゴリズム勾配降下法の最適化アルゴリズム
勾配降下法の 最適化アルゴリズム
 
運用中のゲームにAIを導入するには〜プロジェクト推進・ユースケース・運用〜 [DeNA TechCon 2019]
運用中のゲームにAIを導入するには〜プロジェクト推進・ユースケース・運用〜 [DeNA TechCon 2019]運用中のゲームにAIを導入するには〜プロジェクト推進・ユースケース・運用〜 [DeNA TechCon 2019]
運用中のゲームにAIを導入するには〜プロジェクト推進・ユースケース・運用〜 [DeNA TechCon 2019]
 
Intro to Deep Reinforcement Learning
Intro to Deep Reinforcement LearningIntro to Deep Reinforcement Learning
Intro to Deep Reinforcement Learning
 
Interactive Recommender Systems with Netflix and Spotify
Interactive Recommender Systems with Netflix and SpotifyInteractive Recommender Systems with Netflix and Spotify
Interactive Recommender Systems with Netflix and Spotify
 
機械学習のための数学のおさらい
機械学習のための数学のおさらい機械学習のための数学のおさらい
機械学習のための数学のおさらい
 
C++ マルチスレッド 入門
C++ マルチスレッド 入門C++ マルチスレッド 入門
C++ マルチスレッド 入門
 
ゲームAIとマルチエージェント(下)
ゲームAIとマルチエージェント(下)ゲームAIとマルチエージェント(下)
ゲームAIとマルチエージェント(下)
 
Multi-armed Bandits
Multi-armed BanditsMulti-armed Bandits
Multi-armed Bandits
 
[DL輪読会]Neural Radiance Flow for 4D View Synthesis and Video Processing (NeRF...
[DL輪読会]Neural Radiance Flow for 4D View Synthesis and Video  Processing (NeRF...[DL輪読会]Neural Radiance Flow for 4D View Synthesis and Video  Processing (NeRF...
[DL輪読会]Neural Radiance Flow for 4D View Synthesis and Video Processing (NeRF...
 
MRTK3を調べてみた
MRTK3を調べてみたMRTK3を調べてみた
MRTK3を調べてみた
 
確率的主成分分析
確率的主成分分析確率的主成分分析
確率的主成分分析
 

Similar to RL presentation

Reinforcement Learning to Mimic Portfolio Behavior
Reinforcement Learning to Mimic Portfolio BehaviorReinforcement Learning to Mimic Portfolio Behavior
Reinforcement Learning to Mimic Portfolio BehaviorYigal D. Jhirad
 
Evaluate deep q learning for sequential targeted marketing with 10-fold cross...
Evaluate deep q learning for sequential targeted marketing with 10-fold cross...Evaluate deep q learning for sequential targeted marketing with 10-fold cross...
Evaluate deep q learning for sequential targeted marketing with 10-fold cross...Jian Wu
 
Reinforcement Learning on Mine Sweeper
Reinforcement Learning on Mine SweeperReinforcement Learning on Mine Sweeper
Reinforcement Learning on Mine SweeperDataScienceLab
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement LearningSalem-Kabbani
 
Simulations Project Part II.pdf
Simulations Project Part II.pdfSimulations Project Part II.pdf
Simulations Project Part II.pdfJeanMarshall8
 
Applying Machine Learning for Mobile Games by Neil Patrick Del Gallego
Applying Machine Learning for Mobile Games by Neil Patrick Del GallegoApplying Machine Learning for Mobile Games by Neil Patrick Del Gallego
Applying Machine Learning for Mobile Games by Neil Patrick Del GallegoDEVCON
 
Using Open Source Tools for Machine Learning
Using Open Source Tools for Machine LearningUsing Open Source Tools for Machine Learning
Using Open Source Tools for Machine LearningAll Things Open
 
Modern Recommendation for Advanced Practitioners part2
Modern Recommendation for Advanced Practitioners part2Modern Recommendation for Advanced Practitioners part2
Modern Recommendation for Advanced Practitioners part2Flavian Vasile
 
Neural Network Model
Neural Network ModelNeural Network Model
Neural Network ModelEric Esajian
 
Why Now is the Best Time to Have a Phantom Stock Plan
Why Now is the Best Time to Have a Phantom Stock PlanWhy Now is the Best Time to Have a Phantom Stock Plan
Why Now is the Best Time to Have a Phantom Stock PlanThe VisionLink Advisory Group
 
Tensorflow KR PR12(Season 3) : 251th Paper Review
Tensorflow KR PR12(Season 3) : 251th Paper ReviewTensorflow KR PR12(Season 3) : 251th Paper Review
Tensorflow KR PR12(Season 3) : 251th Paper ReviewChanghoon Jeong
 
0415_seminar_DeepDPG
0415_seminar_DeepDPG0415_seminar_DeepDPG
0415_seminar_DeepDPGHye-min Ahn
 
Deep Reinforcement Learning: MDP & DQN - Xavier Giro-i-Nieto - UPC Barcelona ...
Deep Reinforcement Learning: MDP & DQN - Xavier Giro-i-Nieto - UPC Barcelona ...Deep Reinforcement Learning: MDP & DQN - Xavier Giro-i-Nieto - UPC Barcelona ...
Deep Reinforcement Learning: MDP & DQN - Xavier Giro-i-Nieto - UPC Barcelona ...Universitat Politècnica de Catalunya
 
Project global systems development corporation
Project global systems development corporationProject global systems development corporation
Project global systems development corporationReese Boone
 
Loan Eligibility Checker
Loan Eligibility CheckerLoan Eligibility Checker
Loan Eligibility CheckerKiranVodela
 
Predictive analytics for ROI driven decision making
Predictive analytics for ROI driven decision makingPredictive analytics for ROI driven decision making
Predictive analytics for ROI driven decision makingSai Kumar Devulapalli
 
Assignment 6 – Overall Instruction 1 Assignment 6 Pay.docx
Assignment 6 – Overall Instruction 1 Assignment 6 Pay.docxAssignment 6 – Overall Instruction 1 Assignment 6 Pay.docx
Assignment 6 – Overall Instruction 1 Assignment 6 Pay.docxbraycarissa250
 
NUS-ISS Learning Day 2018-How to train your program to play black jack
NUS-ISS Learning Day 2018-How to train your program to play black jackNUS-ISS Learning Day 2018-How to train your program to play black jack
NUS-ISS Learning Day 2018-How to train your program to play black jackNUS-ISS
 
Showing ROI for Your Analytic Project
Showing ROI for Your Analytic ProjectShowing ROI for Your Analytic Project
Showing ROI for Your Analytic ProjectDATAVERSITY
 

Similar to RL presentation (20)

Reinforcement Learning to Mimic Portfolio Behavior
Reinforcement Learning to Mimic Portfolio BehaviorReinforcement Learning to Mimic Portfolio Behavior
Reinforcement Learning to Mimic Portfolio Behavior
 
Evaluate deep q learning for sequential targeted marketing with 10-fold cross...
Evaluate deep q learning for sequential targeted marketing with 10-fold cross...Evaluate deep q learning for sequential targeted marketing with 10-fold cross...
Evaluate deep q learning for sequential targeted marketing with 10-fold cross...
 
Reinforcement Learning on Mine Sweeper
Reinforcement Learning on Mine SweeperReinforcement Learning on Mine Sweeper
Reinforcement Learning on Mine Sweeper
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement Learning
 
Simulations Project Part II.pdf
Simulations Project Part II.pdfSimulations Project Part II.pdf
Simulations Project Part II.pdf
 
Applying Machine Learning for Mobile Games by Neil Patrick Del Gallego
Applying Machine Learning for Mobile Games by Neil Patrick Del GallegoApplying Machine Learning for Mobile Games by Neil Patrick Del Gallego
Applying Machine Learning for Mobile Games by Neil Patrick Del Gallego
 
Using Open Source Tools for Machine Learning
Using Open Source Tools for Machine LearningUsing Open Source Tools for Machine Learning
Using Open Source Tools for Machine Learning
 
Modern Recommendation for Advanced Practitioners part2
Modern Recommendation for Advanced Practitioners part2Modern Recommendation for Advanced Practitioners part2
Modern Recommendation for Advanced Practitioners part2
 
Neural Network Model
Neural Network ModelNeural Network Model
Neural Network Model
 
Why Now is the Best Time to Have a Phantom Stock Plan
Why Now is the Best Time to Have a Phantom Stock PlanWhy Now is the Best Time to Have a Phantom Stock Plan
Why Now is the Best Time to Have a Phantom Stock Plan
 
Tensorflow KR PR12(Season 3) : 251th Paper Review
Tensorflow KR PR12(Season 3) : 251th Paper ReviewTensorflow KR PR12(Season 3) : 251th Paper Review
Tensorflow KR PR12(Season 3) : 251th Paper Review
 
0415_seminar_DeepDPG
0415_seminar_DeepDPG0415_seminar_DeepDPG
0415_seminar_DeepDPG
 
scrib.pptx
scrib.pptxscrib.pptx
scrib.pptx
 
Deep Reinforcement Learning: MDP & DQN - Xavier Giro-i-Nieto - UPC Barcelona ...
Deep Reinforcement Learning: MDP & DQN - Xavier Giro-i-Nieto - UPC Barcelona ...Deep Reinforcement Learning: MDP & DQN - Xavier Giro-i-Nieto - UPC Barcelona ...
Deep Reinforcement Learning: MDP & DQN - Xavier Giro-i-Nieto - UPC Barcelona ...
 
Project global systems development corporation
Project global systems development corporationProject global systems development corporation
Project global systems development corporation
 
Loan Eligibility Checker
Loan Eligibility CheckerLoan Eligibility Checker
Loan Eligibility Checker
 
Predictive analytics for ROI driven decision making
Predictive analytics for ROI driven decision makingPredictive analytics for ROI driven decision making
Predictive analytics for ROI driven decision making
 
Assignment 6 – Overall Instruction 1 Assignment 6 Pay.docx
Assignment 6 – Overall Instruction 1 Assignment 6 Pay.docxAssignment 6 – Overall Instruction 1 Assignment 6 Pay.docx
Assignment 6 – Overall Instruction 1 Assignment 6 Pay.docx
 
NUS-ISS Learning Day 2018-How to train your program to play black jack
NUS-ISS Learning Day 2018-How to train your program to play black jackNUS-ISS Learning Day 2018-How to train your program to play black jack
NUS-ISS Learning Day 2018-How to train your program to play black jack
 
Showing ROI for Your Analytic Project
Showing ROI for Your Analytic ProjectShowing ROI for Your Analytic Project
Showing ROI for Your Analytic Project
 

Recently uploaded

Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piececharlottematthew16
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024The Digital Insurer
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfRankYa
 

Recently uploaded (20)

Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdf
 

RL presentation

  • 1. NILOOFAR SEDIGHIAN BIDGOLI MACHINE LEARNING COURSE CS DEPARTMENT, SBU UNIVERSITY JUNE 2020, TEHRAN, IRAN
  • 2. When it is not in our power to determine what is true, we ought to act in accordance with what is most probable. - Descartes
  • 3. That thing is a “double bacon cheese burger N.Sedighian - CS Dep. SBU - 06/2020
  • 4. That thing is like this other thing N.Sedighian - CS Dep. SBU - 06/2020
  • 5. Eat that thing because it tastes good and will keep you alive longer N.Sedighian - CS Dep. SBU - 06/2020
  • 6. Deep reinforcement learning is about how we make decisions To tackle decision-making problems under uncertainty N.Sedighian - CS Dep. SBU - 06/2020
  • 7. N.Sedighian - CS Dep. SBU - 06/2020
  • 8. Two core components in a RL system  Agent: represents the “solution”  A computer program with a single role of making decisions to solve complex decision-making problems under uncertainty.  An Environment: that is the representation of a “problem”  Everything that comes after the decision of the Agent. N.Sedighian - CS Dep. SBU - 06/2020
  • 9. Notations:  State = s = x  Action = control = a = u  Policy 𝜋𝜋(𝑎𝑎|𝑠𝑠) is defined as probability and not as a concrete action  like weights in Deep Learning method, parameterized by θ  Gamma: We discount rewards or lower their estimated value in the future  Human intuition: “In the long run, we are all dead.  If it is 1: we care about all rewards equally  If it is 0: we care only about the immediate reward N.Sedighian - CS Dep. SBU - 06/2020
  • 10. Policy N.Sedighian - CS Dep. SBU - 06/2020
  • 11. Intuition: why humans?  If you are the agent, the environment could be the laws of physics and the rules of society that process your actions and determine the consequences of them. Were you ever in the wrong place at the wrong time? That’s a state N.Sedighian - CS Dep. SBU - 06/2020
  • 12. There is no training data here  Like humans learning how to live (and survive!) as a kid  By trial and error  With positive or negative rewards  Reward and punishment method N.Sedighian - CS Dep. SBU - 06/2020
  • 13. N.Sedighian - CS Dep. SBU - 06/2020
  • 14. N.Sedighian - CS Dep. SBU - 06/2020
  • 15. Google's artificial intelligence company, DeepMind, has developed an AI that has managed to learn how to walk, run, jump, and climb without any prior guidance. The result is as impressive as it is goofy Watch Video N.Sedighian - CS Dep. SBU - 06/2020
  • 16. N.Sedighian - CS Dep. SBU - 06/2020
  • 17. Google DeepMind Learning to play Atari Watch Video N.Sedighian - CS Dep. SBU - 06/2020
  • 18. Reward vs Value  Reward (Return) is an immediate signal that is received in a given state, while value is the sum of all rewards you might anticipate from that state.  Value is a long-term expectation, while reward is an immediate pleasure. N.Sedighian - CS Dep. SBU - 06/2020
  • 19. Return N.Sedighian - CS Dep. SBU - 06/2020
  • 20. Tasks  Natural ending: episodic tasks -> games  Episode: sequence of time steps  The sum of rewards collected in a single episode is called a return. Agents are often designed to maximize the return.  Without natural ending: continuing tasks -> learning forward motion N.Sedighian - CS Dep. SBU - 06/2020
  • 21. How the environment reacts to certain actions is defined by a model which may or may not be known by the Agent
  • 22. Approaches  Analyze how good to reach a certain state or take a specific action (i.e. Value-learning)  measures the total rewards that you get from a particular state following a specific policy  Go cheat sheet  uses V or Q value to derive the optimal policy  Q- Learning  Use the model to find actions that have the maximum rewards (model- based learning)  Model-based RL uses the model and the cost function to find the optimal path  Derive a policy directly to maximize rewards (policy gradient)  For actions with better rewards, we make it more likely to happen (or vice versa).N.Sedighian - CS Dep. SBU - 06/2020
  • 23. For a model based learning Watch this → Watch Video N.Sedighian - CS Dep. SBU - 06/2020
  • 25. How can we mathematically formalize the RL problem • MARKOV DECISION PROCESSES FORMALIZE THE REINFORCEMENT LEARNING PROBLEM SET • AND Q-LEARNING AND POLICY GRADIENTS ARE 2 MAJOR ALGORITHMS IN THIS AREA
  • 26. MDP  Attempt to model a complex probability distribution of rewards in relation to a very large number of state-action pair  Markov decision process, a method to sample from a complex distribution to infer its properties. even when we do not understand the mechanism by which they relate N.Sedighian - CS Dep. SBU - 06/2020
  • 27. N.Sedighian - CS Dep. SBU - 06/2020
  • 28. N.Sedighian - CS Dep. SBU - 06/2020
  • 29. N.Sedighian - CS Dep. SBU - 06/2020
  • 30. N.Sedighian - CS Dep. SBU - 06/2020
  • 31. N.Sedighian - CS Dep. SBU - 06/2020
  • 32. MPD • Genes on a chromosome are states. To read them (and create amino acids) is to go through their transitions • Emotions are states in a psychological system. Mood swings are the transitions. N.Sedighian - CS Dep. SBU - 06/2020
  • 33. Markov chains have a particular property: oblivion. Forgetting It assume the entirety of the past is encoded in the present N.Sedighian - CS Dep. SBU - 06/2020
  • 34. N.Sedighian - CS Dep. SBU - 06/2020
  • 35. Q-learning "quality" of an action taken in a given state  Q-learning is a model-free reinforcement learning algorithm to learn a policy telling an agent what action to take under what circumstances.  For any finite Markov decision process (FMDP), Q-learning finds an optimal policy in the sense of maximizing the expected value of the total reward over any and all successive steps, starting from the current state. N.Sedighian - CS Dep. SBU - 06/2020
  • 36. N.Sedighian - CS Dep. SBU - 06/2020
  • 37. Q A value for each state-action pair, which is called the action-value function, also known as Q-function. It is usually denoted by 𝑄𝑄𝜋𝜋 (𝑠𝑠, 𝑎𝑎) and refers to the expected return 𝐺𝐺 when the Agent is at state 𝑠𝑠 and takes action 𝑎𝑎 following the policy 𝜋𝜋. N.Sedighian - CS Dep. SBU - 06/2020
  • 38. Break west world… Creation of Adam, 1508-1512 N.Sedighian - CS Dep. SBU - 06/2020
  • 39. N.Sedighian - CS Dep. SBU - 06/2020
  • 40. Bellman Equation It writes the "value" of a decision problem at a certain point in time in terms of the payoff from some initial choices and the "value" of the remaining decision problem that results from those initial choices that if we know the value of 𝑠𝑠𝑡𝑡+!, we can very easily calculate the value of 𝑠𝑠𝑡𝑡. N.Sedighian - CS Dep. SBU - 06/2020
  • 41. Iteration Phase: N.Sedighian - CS Dep. SBU - 06/2020
  • 42. N.Sedighian - CS Dep. SBU - 06/2020
  • 43. N.Sedighian - CS Dep. SBU - 06/2020
  • 44. N.Sedighian - CS Dep. SBU - 06/2020
  • 45. DQN Deep Q-network Using a deep network to estimate Q N.Sedighian - CS Dep. SBU - 06/2020
  • 46. Experience Replay Experience replay stores the last million of state- action-reward in a replay buffer. We train Q with batches of random samples from this buffer  enabling the RL agent to sample from and train on previously observed data offline  massively reduce the amount of interactions needed with the environment,  batches of experience can be sampled, reducing the variance of learning updates N.Sedighian - CS Dep. SBU - 06/2020
  • 47. Experience! N.Sedighian - CS Dep. SBU - 06/2020
  • 48. N.Sedighian - CS Dep. SBU - 06/2020
  • 49. N.Sedighian - CS Dep. SBU - 06/2020
  • 50. N.Sedighian - CS Dep. SBU - 06/2020
  • 51. Reinforce rule = estimator of gradient We change the policy in the direction with the steepest reward increase It means for actions with better rewards, we make it more likely to happen N.Sedighian - CS Dep. SBU - 06/2020
  • 52. N.Sedighian - CS Dep. SBU - 06/2020
  • 53. N.Sedighian - CS Dep. SBU - 06/2020
  • 54. N.Sedighian - CS Dep. SBU - 06/2020
  • 55. N.Sedighian - CS Dep. SBU - 06/2020
  • 56. N.Sedighian - CS Dep. SBU - 06/2020
  • 57. N.Sedighian - CS Dep. SBU - 06/2020
  • 58. Actor-critic set-up: The “actor” (policy) learns by using feedback from the “critic” (value function). N.Sedighian - CS Dep. SBU - 06/2020
  • 59. N.Sedighian - CS Dep. SBU - 06/2020
  • 60. N.Sedighian - CS Dep. SBU - 06/2020
  • 61. N.Sedighian - CS Dep. SBU - 06/2020
  • 62. N.Sedighian - CS Dep. SBU - 06/2020
  • 63. So… N.Sedighian - CS Dep. SBU - 06/2020
  • 64. N.Sedighian - CS Dep. SBU - 06/2020
  • 65. Questions Sophia, on from 2016N.Sedighian - CS Dep. SBU - 06/2020
  • 66. Thank you N.Sedighian - CS Dep. SBU - 06/2020