SlideShare ist ein Scribd-Unternehmen logo
1 von 23
Demystifying Reinforement
Learning
Slides by JaeyeunYoon
IDS Lab.
What is Reinforcement Learning?
• Learning by trial-and-error, in real-time.
• Improves with experience
• Inspired by psychology
- Agent + Environment
- Agent selects actions to maximize utility function.
IDS Lab.
When to use RL?
•Data in the form of trajectories(궤적).
•Need to make a sequence of (related) decisions.
•Observe (partial, noisy) feedback to choice of
actions.
•Tasks that require both learning and planning.
IDS Lab.
Supervised Learning VS RL
IDS Lab.
Markov Decision Process(MDP)
•Defined by:
S: = 𝒔 𝟏, 𝒔 𝟐, … , 𝒔 𝒏 , the set of states (can be infinite / continuous)
A: = 𝑎 𝟏, 𝑎 𝟐, … , 𝑎 𝒏 , the set of actions (can be infinite / continuous)
T(s,a,s′ ): = Pr(𝑠′
|𝑠, 𝑎), the dynamics of states (can b infinite /
continuous)
R(s,a): Reward function
μ(s): Initial state distribution
IDS Lab.
The Markov Property
•The distribution over future states depends only on the
present state and action, not on any other previous event.
Pr 𝑠𝑡+1 𝑠0, … , 𝑠𝑡, 𝑎0 , … , 𝑎 𝑡) = Pr(𝑠𝑡+1 | 𝑠𝑡, 𝑎 𝑡)
IDS Lab.
The goal of RL? Maximize return!
•Returns, 𝑼𝒕 of a trajectory, is the sum of rewards starting
from step t.
•Episodic task: consider over finite horizon (e.g. games,
maze).
→ 𝑼 𝒕 = 𝒓 𝒕 + 𝒓 𝒕+𝟏 + 𝒓 𝒕+𝟐 + ⋯ + 𝒓 𝑻
•Continuing task: consider return over infinite horizon
(e.g. juggling,
balancing).
→ 𝑼 𝒕 = γ𝒓 𝒕 + γ 𝟐 𝒓 𝒕+𝟏 + γ 𝟑 𝒓 𝒕+𝟐 + ⋯ = 𝒌=𝟎:𝒊𝒏𝒇 γ 𝒌 𝒓 𝒕+𝒌
IDS Lab.
The discount factor, γ
•Discount facator, γ ∈ 𝟎, 𝟏 (usually close to 1).
•This values immediate reward above delayed reward.
- γ close to 0 leads to ”myopic”(근시안적인) evaluation
- γ close to 1 leads to ”far-sighted”(원시안적인) evaluation
•Intuition :
- Receiving $80 today is worth the same as $100 tomorrow assuming
a discount of factor of γ = 𝟎. 𝟖
- At each time step, there is a (𝟏 − γ) chance that the agen dies, and
does not receive rewards aftwards
IDS Lab.
Major Components of an RL Agent
•An RL agent may include one or more of these components:
- Policy: agent's behavior function
- Value function: how good is each state and/or action
- Model: agent's representation of the environment
IDS Lab.
Defining behavior: The policy
•Policy, π defines the action-selction strategy at every state:
π 𝒔, 𝒂 = 𝑷 𝒂 𝒕 = 𝒂 𝒔 𝒕 = 𝒔)
π : S -> A
Goal : Find the policy that maximizes expected total reward.
(But there are many policies!)
𝒂𝒓𝒈𝒎𝒂𝒙π 𝑬π[𝒓 𝟎 + 𝒓 𝟏 + 𝒓 … + 𝒓 𝑻|𝒔 𝟎
???
IDS Lab.
Example: Career Options
IDS Lab.
Example: Career Options
IDS Lab.
Value functions
•The expected return of a policy (for every state) is called the
•Value function: 𝐕π 𝒔 = 𝑬 𝒑[𝒓 𝒕 + 𝒓 𝒕+𝟏 + ⋯ + 𝒓 𝑻|𝒔 𝒕 = 𝒔]
* Simple strategy to find the best policy:
1. Enumerate the space of all possible policies.
2. Estimate the expected return of each one.
3. Keep the policy that has maximum expected return.
IDS Lab.
Getting confused with terminology?
•Reward: 1 step numerical feedback
•Return: Sum of rewards over the agent’s trajectory.
•Value: Expected sum of rewards over the agent’s trajector.
•Utility: Numerical function representing preferences.
* In RL, we assume Utility = Return.
IDS Lab.
RL algorithm outline
IDS Lab.
RL algorithm outline
IDS Lab.
Q-learning: Model-Free RL
•In Q-learning we define a function Q(s, a) representing the
maximum discounted future reward when we perform action a in
state s, and continue optimally from that point on. (함수 Q(s, a)를
각 지점에서 계속 최적값을 찾으면서 상태 에서 행동 를 수행할 때 차감된
미래의 리워드(discounted future reward)를 나타내는 함수로 정의함)
𝑸 𝒔 𝒕, 𝒂 𝒕 = 𝒎𝒂𝒙 𝑹 𝒕+𝟏
• The way to think about Q(s, a) is that it is “the best possible score
at the end of the game after performing action a in state s”. It is
called Q-function, because it represents the “quality” of a certain
action in a given state.
• Then, we can choose followed policy function :
π 𝒔 = 𝒂𝒓𝒈𝒎𝒂𝒙 𝒂 𝑸(𝒔, 𝒂)
IDS Lab.
Q-learning: Bellman equation
•How do we get that Q-function then? Let’s focus on just one
transition <s, a, r, s’>. Just like with discounted future rewards in
the previous section, we can express the Q-value of state s and
action a in terms of the Q-value of the next state s’.
𝑸 𝒔, 𝒂 = 𝒓 + 𝜸𝒎𝒂𝒙 𝒂′ 𝑸(𝒔′
, 𝒂′
) (Bellman equation)
• The main idea in Q-learning
- we can iteratively approximate the Q-function using the Bellman equation.
IDS Lab.
Q-learning: Atari Breakout
• For example, ‘Breakout’ game screens as in the DeepMind paper
-> take the four last screen images, resize them to 84×84 and
convert to grayscale with 256 gray levels
-> we would have 25684x84x4 ≈ 𝟏𝟎 𝟔𝟕𝟗𝟕𝟎
possible game states.
This means 𝟏𝟎 𝟔𝟕𝟗𝟕𝟎 rows in our imaginary Q-table
-> more than the number of atoms in the known universe!
Atari Breakout game. Image credit: DeepMind.
IDS Lab.
Deep Q Network: Atari Breakout
•The Q-function can be approximated using a neural network
model.
IDS Lab.
Deep Q Network: Atari Breakout
IDS Lab.
* No pooling layer? Why?
Deep Q Network: Atari Breakout
IDS Lab.
•Experience Replay
- During gameplay all the experiences < s, a, r, s’ > are stored in a replay
memory. When training the network, random minibatches from the replay
memory are used instead of the most recent transition.
•Exploration-Exploitation
- ε-greedy exploration – with probability ε choose a random action, otherwise
go with the “greedy” action with the highest Q-value. In their system
DeepMind actually decreases ε over time from 1 to 0.1
Deep Q Network: Atari Breakout

Weitere ähnliche Inhalte

Was ist angesagt?

An introduction to reinforcement learning (rl)
An introduction to reinforcement learning (rl)An introduction to reinforcement learning (rl)
An introduction to reinforcement learning (rl)pauldix
 
Continuous control with deep reinforcement learning (DDPG)
Continuous control with deep reinforcement learning (DDPG)Continuous control with deep reinforcement learning (DDPG)
Continuous control with deep reinforcement learning (DDPG)Taehoon Kim
 
Boosted tree
Boosted treeBoosted tree
Boosted treeZhuyi Xue
 
Jsai final final final
Jsai final final finalJsai final final final
Jsai final final finaldinesh malla
 
Maximum Entropy Reinforcement Learning (Stochastic Control)
Maximum Entropy Reinforcement Learning (Stochastic Control)Maximum Entropy Reinforcement Learning (Stochastic Control)
Maximum Entropy Reinforcement Learning (Stochastic Control)Dongmin Lee
 
[GAN by Hung-yi Lee]Part 1: General introduction of GAN
[GAN by Hung-yi Lee]Part 1: General introduction of GAN[GAN by Hung-yi Lee]Part 1: General introduction of GAN
[GAN by Hung-yi Lee]Part 1: General introduction of GANNAVER Engineering
 
[GAN by Hung-yi Lee]Part 2: The application of GAN to speech and text processing
[GAN by Hung-yi Lee]Part 2: The application of GAN to speech and text processing[GAN by Hung-yi Lee]Part 2: The application of GAN to speech and text processing
[GAN by Hung-yi Lee]Part 2: The application of GAN to speech and text processingNAVER Engineering
 
deep reinforcement learning with double q learning
deep reinforcement learning with double q learningdeep reinforcement learning with double q learning
deep reinforcement learning with double q learningSeungHyeok Baek
 
ICASSP 2018 Tutorial: Generative Adversarial Network and its Applications to ...
ICASSP 2018 Tutorial: Generative Adversarial Network and its Applications to ...ICASSP 2018 Tutorial: Generative Adversarial Network and its Applications to ...
ICASSP 2018 Tutorial: Generative Adversarial Network and its Applications to ...宏毅 李
 
Imitation Learning for Autonomous Driving in TORCS
Imitation Learning for Autonomous Driving in TORCSImitation Learning for Autonomous Driving in TORCS
Imitation Learning for Autonomous Driving in TORCSPreferred Networks
 
Ranking and Diversity in Recommendations - RecSys Stammtisch at SoundCloud, B...
Ranking and Diversity in Recommendations - RecSys Stammtisch at SoundCloud, B...Ranking and Diversity in Recommendations - RecSys Stammtisch at SoundCloud, B...
Ranking and Diversity in Recommendations - RecSys Stammtisch at SoundCloud, B...Alexandros Karatzoglou
 
Semantical Cognitive Scheduling
Semantical Cognitive SchedulingSemantical Cognitive Scheduling
Semantical Cognitive Schedulingigalshilman
 
A brief overview of Reinforcement Learning applied to games
A brief overview of Reinforcement Learning applied to gamesA brief overview of Reinforcement Learning applied to games
A brief overview of Reinforcement Learning applied to gamesThomas da Silva Paula
 
Gradient Boosted Regression Trees in scikit-learn
Gradient Boosted Regression Trees in scikit-learnGradient Boosted Regression Trees in scikit-learn
Gradient Boosted Regression Trees in scikit-learnDataRobot
 
Aaron Roth, Associate Professor, University of Pennsylvania, at MLconf NYC 2017
Aaron Roth, Associate Professor, University of Pennsylvania, at MLconf NYC 2017Aaron Roth, Associate Professor, University of Pennsylvania, at MLconf NYC 2017
Aaron Roth, Associate Professor, University of Pennsylvania, at MLconf NYC 2017MLconf
 
Hanie Sedghi, Research Scientist at Allen Institute for Artificial Intelligen...
Hanie Sedghi, Research Scientist at Allen Institute for Artificial Intelligen...Hanie Sedghi, Research Scientist at Allen Institute for Artificial Intelligen...
Hanie Sedghi, Research Scientist at Allen Institute for Artificial Intelligen...MLconf
 
Gan seminar
Gan seminarGan seminar
Gan seminarSan Kim
 
Optimization in deep learning
Optimization in deep learningOptimization in deep learning
Optimization in deep learningJeremy Nixon
 
Optimization for Deep Learning
Optimization for Deep LearningOptimization for Deep Learning
Optimization for Deep LearningSebastian Ruder
 

Was ist angesagt? (20)

An introduction to reinforcement learning (rl)
An introduction to reinforcement learning (rl)An introduction to reinforcement learning (rl)
An introduction to reinforcement learning (rl)
 
Continuous control with deep reinforcement learning (DDPG)
Continuous control with deep reinforcement learning (DDPG)Continuous control with deep reinforcement learning (DDPG)
Continuous control with deep reinforcement learning (DDPG)
 
Boosted tree
Boosted treeBoosted tree
Boosted tree
 
Jsai final final final
Jsai final final finalJsai final final final
Jsai final final final
 
Maximum Entropy Reinforcement Learning (Stochastic Control)
Maximum Entropy Reinforcement Learning (Stochastic Control)Maximum Entropy Reinforcement Learning (Stochastic Control)
Maximum Entropy Reinforcement Learning (Stochastic Control)
 
Finalver
FinalverFinalver
Finalver
 
[GAN by Hung-yi Lee]Part 1: General introduction of GAN
[GAN by Hung-yi Lee]Part 1: General introduction of GAN[GAN by Hung-yi Lee]Part 1: General introduction of GAN
[GAN by Hung-yi Lee]Part 1: General introduction of GAN
 
[GAN by Hung-yi Lee]Part 2: The application of GAN to speech and text processing
[GAN by Hung-yi Lee]Part 2: The application of GAN to speech and text processing[GAN by Hung-yi Lee]Part 2: The application of GAN to speech and text processing
[GAN by Hung-yi Lee]Part 2: The application of GAN to speech and text processing
 
deep reinforcement learning with double q learning
deep reinforcement learning with double q learningdeep reinforcement learning with double q learning
deep reinforcement learning with double q learning
 
ICASSP 2018 Tutorial: Generative Adversarial Network and its Applications to ...
ICASSP 2018 Tutorial: Generative Adversarial Network and its Applications to ...ICASSP 2018 Tutorial: Generative Adversarial Network and its Applications to ...
ICASSP 2018 Tutorial: Generative Adversarial Network and its Applications to ...
 
Imitation Learning for Autonomous Driving in TORCS
Imitation Learning for Autonomous Driving in TORCSImitation Learning for Autonomous Driving in TORCS
Imitation Learning for Autonomous Driving in TORCS
 
Ranking and Diversity in Recommendations - RecSys Stammtisch at SoundCloud, B...
Ranking and Diversity in Recommendations - RecSys Stammtisch at SoundCloud, B...Ranking and Diversity in Recommendations - RecSys Stammtisch at SoundCloud, B...
Ranking and Diversity in Recommendations - RecSys Stammtisch at SoundCloud, B...
 
Semantical Cognitive Scheduling
Semantical Cognitive SchedulingSemantical Cognitive Scheduling
Semantical Cognitive Scheduling
 
A brief overview of Reinforcement Learning applied to games
A brief overview of Reinforcement Learning applied to gamesA brief overview of Reinforcement Learning applied to games
A brief overview of Reinforcement Learning applied to games
 
Gradient Boosted Regression Trees in scikit-learn
Gradient Boosted Regression Trees in scikit-learnGradient Boosted Regression Trees in scikit-learn
Gradient Boosted Regression Trees in scikit-learn
 
Aaron Roth, Associate Professor, University of Pennsylvania, at MLconf NYC 2017
Aaron Roth, Associate Professor, University of Pennsylvania, at MLconf NYC 2017Aaron Roth, Associate Professor, University of Pennsylvania, at MLconf NYC 2017
Aaron Roth, Associate Professor, University of Pennsylvania, at MLconf NYC 2017
 
Hanie Sedghi, Research Scientist at Allen Institute for Artificial Intelligen...
Hanie Sedghi, Research Scientist at Allen Institute for Artificial Intelligen...Hanie Sedghi, Research Scientist at Allen Institute for Artificial Intelligen...
Hanie Sedghi, Research Scientist at Allen Institute for Artificial Intelligen...
 
Gan seminar
Gan seminarGan seminar
Gan seminar
 
Optimization in deep learning
Optimization in deep learningOptimization in deep learning
Optimization in deep learning
 
Optimization for Deep Learning
Optimization for Deep LearningOptimization for Deep Learning
Optimization for Deep Learning
 

Ähnlich wie Demystifying deep reinforement learning

Reinfrocement Learning
Reinfrocement LearningReinfrocement Learning
Reinfrocement LearningNatan Katz
 
Gradient Boosted Regression Trees in Scikit Learn by Gilles Louppe & Peter Pr...
Gradient Boosted Regression Trees in Scikit Learn by Gilles Louppe & Peter Pr...Gradient Boosted Regression Trees in Scikit Learn by Gilles Louppe & Peter Pr...
Gradient Boosted Regression Trees in Scikit Learn by Gilles Louppe & Peter Pr...PyData
 
Intro to Deep Reinforcement Learning
Intro to Deep Reinforcement LearningIntro to Deep Reinforcement Learning
Intro to Deep Reinforcement LearningKhaled Saleh
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement LearningDongHyun Kwak
 
Introduction of Deep Reinforcement Learning
Introduction of Deep Reinforcement LearningIntroduction of Deep Reinforcement Learning
Introduction of Deep Reinforcement LearningNAVER Engineering
 
24.09.2021 Reinforcement Learning Algorithms.pptx
24.09.2021 Reinforcement Learning Algorithms.pptx24.09.2021 Reinforcement Learning Algorithms.pptx
24.09.2021 Reinforcement Learning Algorithms.pptxManiMaran230751
 
Review :: Demystifying deep reinforcement learning (written by Tambet Matiisen)
Review :: Demystifying deep reinforcement learning (written by Tambet Matiisen)Review :: Demystifying deep reinforcement learning (written by Tambet Matiisen)
Review :: Demystifying deep reinforcement learning (written by Tambet Matiisen)Hogeon Seo
 
Reinforcement Learning and Artificial Neural Nets
Reinforcement Learning and Artificial Neural NetsReinforcement Learning and Artificial Neural Nets
Reinforcement Learning and Artificial Neural NetsPierre de Lacaze
 
An introduction to reinforcement learning
An introduction to reinforcement learningAn introduction to reinforcement learning
An introduction to reinforcement learningSubrat Panda, PhD
 
anintroductiontoreinforcementlearning-180912151720.pdf
anintroductiontoreinforcementlearning-180912151720.pdfanintroductiontoreinforcementlearning-180912151720.pdf
anintroductiontoreinforcementlearning-180912151720.pdfssuseradaf5f
 
Introduction: Asynchronous Methods for Deep Reinforcement Learning
Introduction: Asynchronous Methods for  Deep Reinforcement LearningIntroduction: Asynchronous Methods for  Deep Reinforcement Learning
Introduction: Asynchronous Methods for Deep Reinforcement LearningTakashi Nagata
 
Reinforcement learning
Reinforcement learningReinforcement learning
Reinforcement learningDing Li
 
Survey of Modern Reinforcement Learning
Survey of Modern Reinforcement Learning Survey of Modern Reinforcement Learning
Survey of Modern Reinforcement Learning Julia Maddalena
 
Reinforcement learning
Reinforcement learningReinforcement learning
Reinforcement learningElias Hasnat
 
An efficient use of temporal difference technique in Computer Game Learning
An efficient use of temporal difference technique in Computer Game LearningAn efficient use of temporal difference technique in Computer Game Learning
An efficient use of temporal difference technique in Computer Game LearningPrabhu Kumar
 

Ähnlich wie Demystifying deep reinforement learning (20)

Reinfrocement Learning
Reinfrocement LearningReinfrocement Learning
Reinfrocement Learning
 
Introduction to Deep Reinforcement Learning
Introduction to Deep Reinforcement LearningIntroduction to Deep Reinforcement Learning
Introduction to Deep Reinforcement Learning
 
Gradient Boosted Regression Trees in Scikit Learn by Gilles Louppe & Peter Pr...
Gradient Boosted Regression Trees in Scikit Learn by Gilles Louppe & Peter Pr...Gradient Boosted Regression Trees in Scikit Learn by Gilles Louppe & Peter Pr...
Gradient Boosted Regression Trees in Scikit Learn by Gilles Louppe & Peter Pr...
 
Intro to Deep Reinforcement Learning
Intro to Deep Reinforcement LearningIntro to Deep Reinforcement Learning
Intro to Deep Reinforcement Learning
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement Learning
 
Introduction of Deep Reinforcement Learning
Introduction of Deep Reinforcement LearningIntroduction of Deep Reinforcement Learning
Introduction of Deep Reinforcement Learning
 
Deep RL.pdf
Deep RL.pdfDeep RL.pdf
Deep RL.pdf
 
24.09.2021 Reinforcement Learning Algorithms.pptx
24.09.2021 Reinforcement Learning Algorithms.pptx24.09.2021 Reinforcement Learning Algorithms.pptx
24.09.2021 Reinforcement Learning Algorithms.pptx
 
Review :: Demystifying deep reinforcement learning (written by Tambet Matiisen)
Review :: Demystifying deep reinforcement learning (written by Tambet Matiisen)Review :: Demystifying deep reinforcement learning (written by Tambet Matiisen)
Review :: Demystifying deep reinforcement learning (written by Tambet Matiisen)
 
Reinforcement Learning and Artificial Neural Nets
Reinforcement Learning and Artificial Neural NetsReinforcement Learning and Artificial Neural Nets
Reinforcement Learning and Artificial Neural Nets
 
An introduction to reinforcement learning
An introduction to reinforcement learningAn introduction to reinforcement learning
An introduction to reinforcement learning
 
anintroductiontoreinforcementlearning-180912151720.pdf
anintroductiontoreinforcementlearning-180912151720.pdfanintroductiontoreinforcementlearning-180912151720.pdf
anintroductiontoreinforcementlearning-180912151720.pdf
 
Reinforcement Learning - DQN
Reinforcement Learning - DQNReinforcement Learning - DQN
Reinforcement Learning - DQN
 
Introduction: Asynchronous Methods for Deep Reinforcement Learning
Introduction: Asynchronous Methods for  Deep Reinforcement LearningIntroduction: Asynchronous Methods for  Deep Reinforcement Learning
Introduction: Asynchronous Methods for Deep Reinforcement Learning
 
Reinforcement learning
Reinforcement learningReinforcement learning
Reinforcement learning
 
Survey of Modern Reinforcement Learning
Survey of Modern Reinforcement Learning Survey of Modern Reinforcement Learning
Survey of Modern Reinforcement Learning
 
Playing Atari with Deep Reinforcement Learning
Playing Atari with Deep Reinforcement LearningPlaying Atari with Deep Reinforcement Learning
Playing Atari with Deep Reinforcement Learning
 
GDRR Opening Workshop - Deep Reinforcement Learning for Asset Based Modeling ...
GDRR Opening Workshop - Deep Reinforcement Learning for Asset Based Modeling ...GDRR Opening Workshop - Deep Reinforcement Learning for Asset Based Modeling ...
GDRR Opening Workshop - Deep Reinforcement Learning for Asset Based Modeling ...
 
Reinforcement learning
Reinforcement learningReinforcement learning
Reinforcement learning
 
An efficient use of temporal difference technique in Computer Game Learning
An efficient use of temporal difference technique in Computer Game LearningAn efficient use of temporal difference technique in Computer Game Learning
An efficient use of temporal difference technique in Computer Game Learning
 

Kürzlich hochgeladen

Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
The Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfThe Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfSeasiaInfotech2
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024The Digital Insurer
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr LapshynFwdays
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
Vector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesVector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesZilliz
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 

Kürzlich hochgeladen (20)

Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
The Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfThe Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdf
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
Vector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesVector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector Databases
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 

Demystifying deep reinforement learning

  • 2. IDS Lab. What is Reinforcement Learning? • Learning by trial-and-error, in real-time. • Improves with experience • Inspired by psychology - Agent + Environment - Agent selects actions to maximize utility function.
  • 3. IDS Lab. When to use RL? •Data in the form of trajectories(궤적). •Need to make a sequence of (related) decisions. •Observe (partial, noisy) feedback to choice of actions. •Tasks that require both learning and planning.
  • 5. IDS Lab. Markov Decision Process(MDP) •Defined by: S: = 𝒔 𝟏, 𝒔 𝟐, … , 𝒔 𝒏 , the set of states (can be infinite / continuous) A: = 𝑎 𝟏, 𝑎 𝟐, … , 𝑎 𝒏 , the set of actions (can be infinite / continuous) T(s,a,s′ ): = Pr(𝑠′ |𝑠, 𝑎), the dynamics of states (can b infinite / continuous) R(s,a): Reward function μ(s): Initial state distribution
  • 6. IDS Lab. The Markov Property •The distribution over future states depends only on the present state and action, not on any other previous event. Pr 𝑠𝑡+1 𝑠0, … , 𝑠𝑡, 𝑎0 , … , 𝑎 𝑡) = Pr(𝑠𝑡+1 | 𝑠𝑡, 𝑎 𝑡)
  • 7. IDS Lab. The goal of RL? Maximize return! •Returns, 𝑼𝒕 of a trajectory, is the sum of rewards starting from step t. •Episodic task: consider over finite horizon (e.g. games, maze). → 𝑼 𝒕 = 𝒓 𝒕 + 𝒓 𝒕+𝟏 + 𝒓 𝒕+𝟐 + ⋯ + 𝒓 𝑻 •Continuing task: consider return over infinite horizon (e.g. juggling, balancing). → 𝑼 𝒕 = γ𝒓 𝒕 + γ 𝟐 𝒓 𝒕+𝟏 + γ 𝟑 𝒓 𝒕+𝟐 + ⋯ = 𝒌=𝟎:𝒊𝒏𝒇 γ 𝒌 𝒓 𝒕+𝒌
  • 8. IDS Lab. The discount factor, γ •Discount facator, γ ∈ 𝟎, 𝟏 (usually close to 1). •This values immediate reward above delayed reward. - γ close to 0 leads to ”myopic”(근시안적인) evaluation - γ close to 1 leads to ”far-sighted”(원시안적인) evaluation •Intuition : - Receiving $80 today is worth the same as $100 tomorrow assuming a discount of factor of γ = 𝟎. 𝟖 - At each time step, there is a (𝟏 − γ) chance that the agen dies, and does not receive rewards aftwards
  • 9. IDS Lab. Major Components of an RL Agent •An RL agent may include one or more of these components: - Policy: agent's behavior function - Value function: how good is each state and/or action - Model: agent's representation of the environment
  • 10. IDS Lab. Defining behavior: The policy •Policy, π defines the action-selction strategy at every state: π 𝒔, 𝒂 = 𝑷 𝒂 𝒕 = 𝒂 𝒔 𝒕 = 𝒔) π : S -> A Goal : Find the policy that maximizes expected total reward. (But there are many policies!) 𝒂𝒓𝒈𝒎𝒂𝒙π 𝑬π[𝒓 𝟎 + 𝒓 𝟏 + 𝒓 … + 𝒓 𝑻|𝒔 𝟎 ???
  • 13. IDS Lab. Value functions •The expected return of a policy (for every state) is called the •Value function: 𝐕π 𝒔 = 𝑬 𝒑[𝒓 𝒕 + 𝒓 𝒕+𝟏 + ⋯ + 𝒓 𝑻|𝒔 𝒕 = 𝒔] * Simple strategy to find the best policy: 1. Enumerate the space of all possible policies. 2. Estimate the expected return of each one. 3. Keep the policy that has maximum expected return.
  • 14. IDS Lab. Getting confused with terminology? •Reward: 1 step numerical feedback •Return: Sum of rewards over the agent’s trajectory. •Value: Expected sum of rewards over the agent’s trajector. •Utility: Numerical function representing preferences. * In RL, we assume Utility = Return.
  • 17. IDS Lab. Q-learning: Model-Free RL •In Q-learning we define a function Q(s, a) representing the maximum discounted future reward when we perform action a in state s, and continue optimally from that point on. (함수 Q(s, a)를 각 지점에서 계속 최적값을 찾으면서 상태 에서 행동 를 수행할 때 차감된 미래의 리워드(discounted future reward)를 나타내는 함수로 정의함) 𝑸 𝒔 𝒕, 𝒂 𝒕 = 𝒎𝒂𝒙 𝑹 𝒕+𝟏 • The way to think about Q(s, a) is that it is “the best possible score at the end of the game after performing action a in state s”. It is called Q-function, because it represents the “quality” of a certain action in a given state. • Then, we can choose followed policy function : π 𝒔 = 𝒂𝒓𝒈𝒎𝒂𝒙 𝒂 𝑸(𝒔, 𝒂)
  • 18. IDS Lab. Q-learning: Bellman equation •How do we get that Q-function then? Let’s focus on just one transition <s, a, r, s’>. Just like with discounted future rewards in the previous section, we can express the Q-value of state s and action a in terms of the Q-value of the next state s’. 𝑸 𝒔, 𝒂 = 𝒓 + 𝜸𝒎𝒂𝒙 𝒂′ 𝑸(𝒔′ , 𝒂′ ) (Bellman equation) • The main idea in Q-learning - we can iteratively approximate the Q-function using the Bellman equation.
  • 19. IDS Lab. Q-learning: Atari Breakout • For example, ‘Breakout’ game screens as in the DeepMind paper -> take the four last screen images, resize them to 84×84 and convert to grayscale with 256 gray levels -> we would have 25684x84x4 ≈ 𝟏𝟎 𝟔𝟕𝟗𝟕𝟎 possible game states. This means 𝟏𝟎 𝟔𝟕𝟗𝟕𝟎 rows in our imaginary Q-table -> more than the number of atoms in the known universe! Atari Breakout game. Image credit: DeepMind.
  • 20. IDS Lab. Deep Q Network: Atari Breakout •The Q-function can be approximated using a neural network model.
  • 21. IDS Lab. Deep Q Network: Atari Breakout
  • 22. IDS Lab. * No pooling layer? Why? Deep Q Network: Atari Breakout
  • 23. IDS Lab. •Experience Replay - During gameplay all the experiences < s, a, r, s’ > are stored in a replay memory. When training the network, random minibatches from the replay memory are used instead of the most recent transition. •Exploration-Exploitation - ε-greedy exploration – with probability ε choose a random action, otherwise go with the “greedy” action with the highest Q-value. In their system DeepMind actually decreases ε over time from 1 to 0.1 Deep Q Network: Atari Breakout

Hinweis der Redaktion

  1. Emerging technologies such as smartphones and GPS enable the effortless collection of trajectories and other tracking data. More generally, a time-series is a recording of a signal that changes over time. 
  2. Markov Assumtion (바로 직전)
  3. 알고리즘의 (알파)는 알고리즘 내에서 이루어진 이전의 Q-값과 새로 제시된 Q-값의 차이를 컨트롤 하는 학습률 입니다. 특히, 이면, 두 는 취소되고 벨맨 방정식과 완전히 동일한 방법으로 업데이트 시킵니다.
  4. 여기서 Q-table을 그려준다 세로 S 가로 A Output Q(S,A)
  5. 여기서 Q-table을 그려준다 세로 S 가로 A Output Q(S,A)
  6. 여기서 Q-table을 그려준다 세로 S 가로 A Output Q(S,A)
  7. But if you really think about it, pooling layers buy you translation invariance – the network becomes insensitive to the location of an object in the image. That makes perfectly sense for a classification task like ImageNet, but for games the location of the ball is crucial in determining the potential reward and we wouldn’t want to discard this information!
  8. 재현 경험하기(Experience Replay) 이제 우리는 Q-러닝을 이용해서 각각의 상태에서 미래의 리워드를 추정하는 방법과 convolutional 뉴럴 네트워크를 이용해서 Q-함수를 근사시키는 방법에 대한 아이디어가 생겼습니다. 그러나 비선형적인 함수를 사용해서 Q-값들을 예측하면 매우 안정적이지 않습니다. 이것을 수렴시킬 수많은 요령들이 있습니다. 그리고 GPU하나로 돌린다면 거의 1주일이나 소요될 정도로 오랜 시간이 걸리는 일입니다. 가장 중요한 요령은 재현을 경험시키는 것입니다. 게임을 하는 동안 모든 경험들 은 재현 메모리에 저장됩니다. 네트워크를 훈련시키는 동안, 가장 최근의 보다는 재현 메모리의 무작위적인 샘플들이 사용될 것입니다. 이 방법은 차후의 훈련 예시들의 유사성을 무너뜨리거나 네트워크를 극솟점으로 이동시켜 줍니다. 또한 재현을 경험시키는 것은 간단하게 디버깅하고 알고리즘을 테스트할 수 있는 일반적인 지도학습과 비슷한 훈련 업무로 만들어 줍니다. 실제로는 사람이 게임을 플레이하는 모든 경험들을 수집하여 이들로 네트워크를 훈련시킬 수도 있습니다. 탐색-이용(Exploration-Exploitation) Q-러닝은 신뢰할당문제를 해결하려고 합니다. 진짜 리워드를 얻게하는 중요한 결정 포인트에 다다를 때까지 게속 리워드를 전파합니다. 아직까지 우리는 탐색-이용 딜레마(exploration-exploitation dillema)에 대해 다루지 못했습니다. 첫 번째로 관찰할 수 있듯, Q-테이블과 Q-네트워크가 무작위로 초기화되면, 그에 따른 예측 또한 랜덤적으로 초기화됩니다. 가장 높은 Q-값을 선택하면 행동은 아마 무작위적이고, 에이전트는 대충 “탐색”을 수행할 것입니다. Q-함수가 수렴하면 더 많은 거듭된 Q-값들을 반환하고, 탐색의 양은 줄어들게 됩니다. 누군가 말한 것처럼, 이 Q-러닝은 알고리즘의 일부분으로써 탐색을 만들게 됩니다. 그렇지만 이 탐색은 “(욕심)greedy”이 많아서, 찾은 방법 중에 첫번째 효율적인 방법에 안착합니다. 위 문제를 간단하고 효율적으로 고치기 위해서는 무작위적인 행동을 선택하는 확률 ε를 이용하는 ε-greedy exploration을 사용합니다. 그 외에는, 가장 높은 Q-값을 이용해서 “greedy” 행동을 수행합니다. 사실 딥마인드 시스템에서는 시간이 지나면서 ε를 1에서 0.1로 만들어줍니다. 즉, 시스템 시작시점에서는 상태 공간을 최대화시켜 검색하도록 완전히 무작위적으로 만들어 준 후, 고정된 탐색률에 안착시키는 방법을 사용합니다.