SlideShare ist ein Scribd-Unternehmen logo
1 von 12
Downloaden Sie, um offline zu lesen
Google confidential | Do not distribute
Deep Q-Network for beginners
Etsuji Nakai
Cloud Solutions Architect at Google
2016/08/01 ver1.0
$ who am i
▪Etsuji Nakai
Cloud Solutions Architect at Google
Twitter @enakai00
Typical application of DQN
▪ Learning “the optimal operations to achieve the best score” through the screen
images of video games.
– In theory, you can learn it (without knowing the rule of the game) by collecting
data consisting of “on what screen, with which operation, how the score will
change, and what will be the next screen.”
– This is analogous to construct an algorithm for the game of Go by collecting
data consisting of “On what face of board, where you put the next stone, how
your advantage will change.”
https://www.youtube.com/watch?v=r3pb-ZDEKVghttps://www.youtube.com/watch?v=V1eYniJ0Rnk
Theoretical framework of DQN
▪ Suppose that you have all the quartets (s, a, r, s') for any pair (s, a), meaning “with
the current state s and the action a, you will have a reward (score) r and the next
state will be s' ”
– This corresponds to the data “on what screen, with which operation, how the score will
change, and what will be the next screen.”
– It is impractical to collect data for all possible pairs (s, a), but suppose that you have
enough of them to train the model to a certain level.
– Note that, r and s' are functions of the pair (s, a) in terms of mathematics.
▪ You may naively think that the following gives the optimal action given the current
state s.
 ⇒ Choose the action a which maximizes the immediate reward r .
– But this doesn't necessarily result in the best scenario. In the case of Breakout, you’d
better hit blocks near the side walls even though it may take a little longer.
▪ In a nutshell, you have to figure out the action a which maximizes the long term
rewards.
Let’s imagine the magical “Q” function
▪ First, we define the total rewards as below.
– sn
and an
represent the state and action at n-th step. is a small number around 0.9
introduced to avoid the sum becomes infinite.
▪ Now suppose that we have a convenient magical function Q(s, a) as below although
we don’t know how to calculate it at all.
– Q(s, a) = “The total rewards you will receive when you choose the next action a,
and keep choosing the optimal actions afterwards.”
▪ Once you have the function Q(s, a) , you can choose the optimal action at state s
with the following rule.
 ⇒ Choose an action a which maximizes the total rewards if you keep choosing the optimal
actions afterwards.
The black magic of “recursive definition”
▪ Although we are not sure how we could calculate Q(s, a) , I can say that it satisfies
the following “Q-equation”.
– See the next slide for the mathematical proof.
Proof of the Q-equation
– Suppose that the chain of states and actions is as below when you start from the state s0
and keep choosing the best actions.
– From the definition of Q(s, a), the following equation holds.
– Now suppose that the initial state is s1
instead of s0
, and you keep choosing the best
actions, the chain will be as below.
– Again, from the definition of Q(s, a), the following equation holds.
– Rearrange (1) as below, and substitute (2).
– Considering the following relations, it’s equivalent to the Q-equation.
―― (1)
―― (2)
Approximate “Q” function using Q-equation
▪ Prepare some function with adjustable parameters, and by adjusting them, you may
find a function which satisfies the Q-equation.
– If you succeeded, now you have the “Q” function!
– Strictly speaking, Q-equation is not a sufficient condition but a necessary one. However,
under some assumptions, it’s been proved to be sufficient.
▪ Here’s the steps to adjust the parameters.
– 1. Let all the quartets (s, a, r, a') you have is D.
– 2. Prepare an initial candidate of “Q” function as Q(s, a | w).
– 3. Calculate the following error function E(w) using all (or a part of) data in D. This is the
sum of squared differences between LHS and RHS of the Q-equation.
– 4. Adjust the parameter w so that E(w) becomes smaller. Then, go back to 3.
▪ After repeating 3. and 4., if E(w) becomes small enough, you have the approximate
version of the “Q” function.
– You’d better use more complicated candidate Q(s, a | w) , so that you would have better
approximation.
What is “the more complicated function”?
▪ Yes!
The Deep Neural Network!
▪ “Deep Q-Network” is essentially a multi-layer neural network used as the candidate
of Q-function.
By the way, what is Neural Network?
▪ Roughly speaking, it’s just a combination of multiple simple functions resulting in a
highly complex function.
How do you collect the quartets?
▪ How do you collect the quartets (s, a, r, a') in the real world?
– Basically, just keep playing the game with random actions.
– In theory, if you keep playing for infinite time, you would encounter all the possible states.
▪ But in reality, the possibility to reach some states with random actions is quite
small. To compensate it, you can take the following strategy.
– Once you have collected some amount of data, train the Q-function using these data.
– After that, you play the game by mixing random actions and the (presumably) best actions
calculated from the current Q-function.
– When you have collected some more additional data, train the Q-function again.
– Through this cycle, you can make Q-function better, and collect more data including states
which is unreachable with only random actions.
▪ Why don’t you play only using Q-function without random actions?
– It doesn’t work. By collecting all kinds of states even with random actions, the
model can learn “how to gain long term rewards through some non-rewarding
states.”
Thank you!

Weitere ähnliche Inhalte

Andere mochten auch

A Brief History of My English Learning
A Brief History of My English LearningA Brief History of My English Learning
A Brief History of My English LearningEtsuji Nakai
 
Spannerに関する技術メモ
Spannerに関する技術メモSpannerに関する技術メモ
Spannerに関する技術メモEtsuji Nakai
 
Exploring the Philosophy behind Docker/Kubernetes/OpenShift
Exploring the Philosophy behind Docker/Kubernetes/OpenShiftExploring the Philosophy behind Docker/Kubernetes/OpenShift
Exploring the Philosophy behind Docker/Kubernetes/OpenShiftEtsuji Nakai
 
Using Kubernetes on Google Container Engine
Using Kubernetes on Google Container EngineUsing Kubernetes on Google Container Engine
Using Kubernetes on Google Container EngineEtsuji Nakai
 
TensorFlowによるニューラルネットワーク入門
TensorFlowによるニューラルネットワーク入門TensorFlowによるニューラルネットワーク入門
TensorFlowによるニューラルネットワーク入門Etsuji Nakai
 
TensorFlowプログラミングと分類アルゴリズムの基礎
TensorFlowプログラミングと分類アルゴリズムの基礎TensorFlowプログラミングと分類アルゴリズムの基礎
TensorFlowプログラミングと分類アルゴリズムの基礎Etsuji Nakai
 
Machine Learning Basics for Web Application Developers
Machine Learning Basics for Web Application DevelopersMachine Learning Basics for Web Application Developers
Machine Learning Basics for Web Application DevelopersEtsuji Nakai
 
Googleのインフラ技術から考える理想のDevOps
Googleのインフラ技術から考える理想のDevOpsGoogleのインフラ技術から考える理想のDevOps
Googleのインフラ技術から考える理想のDevOpsEtsuji Nakai
 
Open Shift v3 主要機能と内部構造のご紹介
Open Shift v3 主要機能と内部構造のご紹介Open Shift v3 主要機能と内部構造のご紹介
Open Shift v3 主要機能と内部構造のご紹介Etsuji Nakai
 
「TensorFlow Tutorialの数学的背景」 クイックツアー(パート1)
「TensorFlow Tutorialの数学的背景」 クイックツアー(パート1)「TensorFlow Tutorialの数学的背景」 クイックツアー(パート1)
「TensorFlow Tutorialの数学的背景」 クイックツアー(パート1)Etsuji Nakai
 
Docker活用パターンの整理 ― どう組み合わせるのが正解?!
Docker活用パターンの整理 ― どう組み合わせるのが正解?!Docker活用パターンの整理 ― どう組み合わせるのが正解?!
Docker活用パターンの整理 ― どう組み合わせるのが正解?!Etsuji Nakai
 
Python 機械学習プログラミング データ分析ライブラリー解説編
Python 機械学習プログラミング データ分析ライブラリー解説編Python 機械学習プログラミング データ分析ライブラリー解説編
Python 機械学習プログラミング データ分析ライブラリー解説編Etsuji Nakai
 
分散ストレージソフトウェアCeph・アーキテクチャー概要
分散ストレージソフトウェアCeph・アーキテクチャー概要分散ストレージソフトウェアCeph・アーキテクチャー概要
分散ストレージソフトウェアCeph・アーキテクチャー概要Etsuji Nakai
 
機械学習概論 講義テキスト
機械学習概論 講義テキスト機械学習概論 講義テキスト
機械学習概論 講義テキストEtsuji Nakai
 
Docker with RHEL7 技術勉強会
Docker with RHEL7 技術勉強会Docker with RHEL7 技術勉強会
Docker with RHEL7 技術勉強会Etsuji Nakai
 
OpenStackをさらに”使う”技術 - OpenStack&Docker活用テクニック
OpenStackをさらに”使う”技術 - OpenStack&Docker活用テクニックOpenStackをさらに”使う”技術 - OpenStack&Docker活用テクニック
OpenStackをさらに”使う”技術 - OpenStack&Docker活用テクニックEtsuji Nakai
 
Linux Container Technology inside Docker with RHEL7
Linux Container Technology inside Docker with RHEL7Linux Container Technology inside Docker with RHEL7
Linux Container Technology inside Docker with RHEL7Etsuji Nakai
 
OpenStackとDockerの未来像
OpenStackとDockerの未来像OpenStackとDockerの未来像
OpenStackとDockerの未来像Etsuji Nakai
 
Dockerイメージ管理の内部構造
Dockerイメージ管理の内部構造Dockerイメージ管理の内部構造
Dockerイメージ管理の内部構造Etsuji Nakai
 
OpenShift v3 Internal networking details
OpenShift v3 Internal networking detailsOpenShift v3 Internal networking details
OpenShift v3 Internal networking detailsEtsuji Nakai
 

Andere mochten auch (20)

A Brief History of My English Learning
A Brief History of My English LearningA Brief History of My English Learning
A Brief History of My English Learning
 
Spannerに関する技術メモ
Spannerに関する技術メモSpannerに関する技術メモ
Spannerに関する技術メモ
 
Exploring the Philosophy behind Docker/Kubernetes/OpenShift
Exploring the Philosophy behind Docker/Kubernetes/OpenShiftExploring the Philosophy behind Docker/Kubernetes/OpenShift
Exploring the Philosophy behind Docker/Kubernetes/OpenShift
 
Using Kubernetes on Google Container Engine
Using Kubernetes on Google Container EngineUsing Kubernetes on Google Container Engine
Using Kubernetes on Google Container Engine
 
TensorFlowによるニューラルネットワーク入門
TensorFlowによるニューラルネットワーク入門TensorFlowによるニューラルネットワーク入門
TensorFlowによるニューラルネットワーク入門
 
TensorFlowプログラミングと分類アルゴリズムの基礎
TensorFlowプログラミングと分類アルゴリズムの基礎TensorFlowプログラミングと分類アルゴリズムの基礎
TensorFlowプログラミングと分類アルゴリズムの基礎
 
Machine Learning Basics for Web Application Developers
Machine Learning Basics for Web Application DevelopersMachine Learning Basics for Web Application Developers
Machine Learning Basics for Web Application Developers
 
Googleのインフラ技術から考える理想のDevOps
Googleのインフラ技術から考える理想のDevOpsGoogleのインフラ技術から考える理想のDevOps
Googleのインフラ技術から考える理想のDevOps
 
Open Shift v3 主要機能と内部構造のご紹介
Open Shift v3 主要機能と内部構造のご紹介Open Shift v3 主要機能と内部構造のご紹介
Open Shift v3 主要機能と内部構造のご紹介
 
「TensorFlow Tutorialの数学的背景」 クイックツアー(パート1)
「TensorFlow Tutorialの数学的背景」 クイックツアー(パート1)「TensorFlow Tutorialの数学的背景」 クイックツアー(パート1)
「TensorFlow Tutorialの数学的背景」 クイックツアー(パート1)
 
Docker活用パターンの整理 ― どう組み合わせるのが正解?!
Docker活用パターンの整理 ― どう組み合わせるのが正解?!Docker活用パターンの整理 ― どう組み合わせるのが正解?!
Docker活用パターンの整理 ― どう組み合わせるのが正解?!
 
Python 機械学習プログラミング データ分析ライブラリー解説編
Python 機械学習プログラミング データ分析ライブラリー解説編Python 機械学習プログラミング データ分析ライブラリー解説編
Python 機械学習プログラミング データ分析ライブラリー解説編
 
分散ストレージソフトウェアCeph・アーキテクチャー概要
分散ストレージソフトウェアCeph・アーキテクチャー概要分散ストレージソフトウェアCeph・アーキテクチャー概要
分散ストレージソフトウェアCeph・アーキテクチャー概要
 
機械学習概論 講義テキスト
機械学習概論 講義テキスト機械学習概論 講義テキスト
機械学習概論 講義テキスト
 
Docker with RHEL7 技術勉強会
Docker with RHEL7 技術勉強会Docker with RHEL7 技術勉強会
Docker with RHEL7 技術勉強会
 
OpenStackをさらに”使う”技術 - OpenStack&Docker活用テクニック
OpenStackをさらに”使う”技術 - OpenStack&Docker活用テクニックOpenStackをさらに”使う”技術 - OpenStack&Docker活用テクニック
OpenStackをさらに”使う”技術 - OpenStack&Docker活用テクニック
 
Linux Container Technology inside Docker with RHEL7
Linux Container Technology inside Docker with RHEL7Linux Container Technology inside Docker with RHEL7
Linux Container Technology inside Docker with RHEL7
 
OpenStackとDockerの未来像
OpenStackとDockerの未来像OpenStackとDockerの未来像
OpenStackとDockerの未来像
 
Dockerイメージ管理の内部構造
Dockerイメージ管理の内部構造Dockerイメージ管理の内部構造
Dockerイメージ管理の内部構造
 
OpenShift v3 Internal networking details
OpenShift v3 Internal networking detailsOpenShift v3 Internal networking details
OpenShift v3 Internal networking details
 

Ähnlich wie Deep Q-Network for beginners

Reinforcement learning
Reinforcement learningReinforcement learning
Reinforcement learningElias Hasnat
 
Review :: Demystifying deep reinforcement learning (written by Tambet Matiisen)
Review :: Demystifying deep reinforcement learning (written by Tambet Matiisen)Review :: Demystifying deep reinforcement learning (written by Tambet Matiisen)
Review :: Demystifying deep reinforcement learning (written by Tambet Matiisen)Hogeon Seo
 
Demystifying deep reinforement learning
Demystifying deep reinforement learningDemystifying deep reinforement learning
Demystifying deep reinforement learning재연 윤
 
Deep Reinforcement Learning
Deep Reinforcement LearningDeep Reinforcement Learning
Deep Reinforcement LearningUsman Qayyum
 
Survey of Modern Reinforcement Learning
Survey of Modern Reinforcement Learning Survey of Modern Reinforcement Learning
Survey of Modern Reinforcement Learning Julia Maddalena
 
TensorFlow and Deep Learning Tips and Tricks
TensorFlow and Deep Learning Tips and TricksTensorFlow and Deep Learning Tips and Tricks
TensorFlow and Deep Learning Tips and TricksBen Ball
 
Intro to Reinforcement Learning
Intro to Reinforcement LearningIntro to Reinforcement Learning
Intro to Reinforcement LearningUtkarsh Garg
 
Single shot multiboxdetectors
Single shot multiboxdetectorsSingle shot multiboxdetectors
Single shot multiboxdetectors지현 백
 
Deep sarsa, Deep Q-learning, DQN
Deep sarsa, Deep Q-learning, DQNDeep sarsa, Deep Q-learning, DQN
Deep sarsa, Deep Q-learning, DQNEuijin Jeong
 
Reinfrocement Learning
Reinfrocement LearningReinfrocement Learning
Reinfrocement LearningNatan Katz
 
Chakrabarti alpha go analysis
Chakrabarti alpha go analysisChakrabarti alpha go analysis
Chakrabarti alpha go analysisDave Selinger
 
Single shot multiboxdetectors
Single shot multiboxdetectorsSingle shot multiboxdetectors
Single shot multiboxdetectors지현 백
 
Practical Reinforcement Learning with TensorFlow
Practical Reinforcement Learning with TensorFlowPractical Reinforcement Learning with TensorFlow
Practical Reinforcement Learning with TensorFlowIllia Polosukhin
 
Intro to Deep Reinforcement Learning
Intro to Deep Reinforcement LearningIntro to Deep Reinforcement Learning
Intro to Deep Reinforcement LearningKhaled Saleh
 
Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017
Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017
Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017MLconf
 
Cs229 notes12
Cs229 notes12Cs229 notes12
Cs229 notes12VuTran231
 

Ähnlich wie Deep Q-Network for beginners (20)

Reinforcement learning
Reinforcement learningReinforcement learning
Reinforcement learning
 
Deep RL.pdf
Deep RL.pdfDeep RL.pdf
Deep RL.pdf
 
Deep Q-learning explained
Deep Q-learning explainedDeep Q-learning explained
Deep Q-learning explained
 
Review :: Demystifying deep reinforcement learning (written by Tambet Matiisen)
Review :: Demystifying deep reinforcement learning (written by Tambet Matiisen)Review :: Demystifying deep reinforcement learning (written by Tambet Matiisen)
Review :: Demystifying deep reinforcement learning (written by Tambet Matiisen)
 
Demystifying deep reinforement learning
Demystifying deep reinforement learningDemystifying deep reinforement learning
Demystifying deep reinforement learning
 
Deep Reinforcement Learning
Deep Reinforcement LearningDeep Reinforcement Learning
Deep Reinforcement Learning
 
Survey of Modern Reinforcement Learning
Survey of Modern Reinforcement Learning Survey of Modern Reinforcement Learning
Survey of Modern Reinforcement Learning
 
TensorFlow and Deep Learning Tips and Tricks
TensorFlow and Deep Learning Tips and TricksTensorFlow and Deep Learning Tips and Tricks
TensorFlow and Deep Learning Tips and Tricks
 
Lec3 dqn
Lec3 dqnLec3 dqn
Lec3 dqn
 
Intro to Reinforcement Learning
Intro to Reinforcement LearningIntro to Reinforcement Learning
Intro to Reinforcement Learning
 
Single shot multiboxdetectors
Single shot multiboxdetectorsSingle shot multiboxdetectors
Single shot multiboxdetectors
 
Deep sarsa, Deep Q-learning, DQN
Deep sarsa, Deep Q-learning, DQNDeep sarsa, Deep Q-learning, DQN
Deep sarsa, Deep Q-learning, DQN
 
Reinfrocement Learning
Reinfrocement LearningReinfrocement Learning
Reinfrocement Learning
 
Chakrabarti alpha go analysis
Chakrabarti alpha go analysisChakrabarti alpha go analysis
Chakrabarti alpha go analysis
 
Single shot multiboxdetectors
Single shot multiboxdetectorsSingle shot multiboxdetectors
Single shot multiboxdetectors
 
Practical Reinforcement Learning with TensorFlow
Practical Reinforcement Learning with TensorFlowPractical Reinforcement Learning with TensorFlow
Practical Reinforcement Learning with TensorFlow
 
Intro to Deep Reinforcement Learning
Intro to Deep Reinforcement LearningIntro to Deep Reinforcement Learning
Intro to Deep Reinforcement Learning
 
Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017
Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017
Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017
 
Cs229 notes12
Cs229 notes12Cs229 notes12
Cs229 notes12
 
Reinforcement Learning - DQN
Reinforcement Learning - DQNReinforcement Learning - DQN
Reinforcement Learning - DQN
 

Mehr von Etsuji Nakai

「ITエンジニアリングの本質」を考える
「ITエンジニアリングの本質」を考える「ITエンジニアリングの本質」を考える
「ITエンジニアリングの本質」を考えるEtsuji Nakai
 
Googleのインフラ技術に見る基盤標準化とDevOpsの真実
Googleのインフラ技術に見る基盤標準化とDevOpsの真実Googleのインフラ技術に見る基盤標準化とDevOpsの真実
Googleのインフラ技術に見る基盤標準化とDevOpsの真実Etsuji Nakai
 
Introducton to Convolutional Nerural Network with TensorFlow
Introducton to Convolutional Nerural Network with TensorFlowIntroducton to Convolutional Nerural Network with TensorFlow
Introducton to Convolutional Nerural Network with TensorFlowEtsuji Nakai
 
Googleにおける機械学習の活用とクラウドサービス
Googleにおける機械学習の活用とクラウドサービスGoogleにおける機械学習の活用とクラウドサービス
Googleにおける機械学習の活用とクラウドサービスEtsuji Nakai
 
Lecture note on PRML 8.2
Lecture note on PRML 8.2Lecture note on PRML 8.2
Lecture note on PRML 8.2Etsuji Nakai
 
OpenShift v3 Technical Introduction
OpenShift v3 Technical IntroductionOpenShift v3 Technical Introduction
OpenShift v3 Technical IntroductionEtsuji Nakai
 
Python 機械学習プログラミング データ分析演習編
Python 機械学習プログラミング データ分析演習編Python 機械学習プログラミング データ分析演習編
Python 機械学習プログラミング データ分析演習編Etsuji Nakai
 
Red Hat Enterprise Linux OpenStack Platform 7 - VM Instance HA Architecture
Red Hat Enterprise Linux OpenStack Platform 7 - VM Instance HA ArchitectureRed Hat Enterprise Linux OpenStack Platform 7 - VM Instance HA Architecture
Red Hat Enterprise Linux OpenStack Platform 7 - VM Instance HA ArchitectureEtsuji Nakai
 

Mehr von Etsuji Nakai (10)

PRML11.2-11.3
PRML11.2-11.3PRML11.2-11.3
PRML11.2-11.3
 
「ITエンジニアリングの本質」を考える
「ITエンジニアリングの本質」を考える「ITエンジニアリングの本質」を考える
「ITエンジニアリングの本質」を考える
 
Googleのインフラ技術に見る基盤標準化とDevOpsの真実
Googleのインフラ技術に見る基盤標準化とDevOpsの真実Googleのインフラ技術に見る基盤標準化とDevOpsの真実
Googleのインフラ技術に見る基盤標準化とDevOpsの真実
 
Introducton to Convolutional Nerural Network with TensorFlow
Introducton to Convolutional Nerural Network with TensorFlowIntroducton to Convolutional Nerural Network with TensorFlow
Introducton to Convolutional Nerural Network with TensorFlow
 
Googleにおける機械学習の活用とクラウドサービス
Googleにおける機械学習の活用とクラウドサービスGoogleにおける機械学習の活用とクラウドサービス
Googleにおける機械学習の活用とクラウドサービス
 
Lecture note on PRML 8.2
Lecture note on PRML 8.2Lecture note on PRML 8.2
Lecture note on PRML 8.2
 
PRML7.2
PRML7.2PRML7.2
PRML7.2
 
OpenShift v3 Technical Introduction
OpenShift v3 Technical IntroductionOpenShift v3 Technical Introduction
OpenShift v3 Technical Introduction
 
Python 機械学習プログラミング データ分析演習編
Python 機械学習プログラミング データ分析演習編Python 機械学習プログラミング データ分析演習編
Python 機械学習プログラミング データ分析演習編
 
Red Hat Enterprise Linux OpenStack Platform 7 - VM Instance HA Architecture
Red Hat Enterprise Linux OpenStack Platform 7 - VM Instance HA ArchitectureRed Hat Enterprise Linux OpenStack Platform 7 - VM Instance HA Architecture
Red Hat Enterprise Linux OpenStack Platform 7 - VM Instance HA Architecture
 

Kürzlich hochgeladen

The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businesspanagenda
 
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...Principled Technologies
 
Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024The Digital Insurer
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyKhushali Kathiriya
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024The Digital Insurer
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdflior mazor
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...apidays
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsRoshan Dwivedi
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 

Kürzlich hochgeladen (20)

The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
 
Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 

Deep Q-Network for beginners

  • 1. Google confidential | Do not distribute Deep Q-Network for beginners Etsuji Nakai Cloud Solutions Architect at Google 2016/08/01 ver1.0
  • 2. $ who am i ▪Etsuji Nakai Cloud Solutions Architect at Google Twitter @enakai00
  • 3. Typical application of DQN ▪ Learning “the optimal operations to achieve the best score” through the screen images of video games. – In theory, you can learn it (without knowing the rule of the game) by collecting data consisting of “on what screen, with which operation, how the score will change, and what will be the next screen.” – This is analogous to construct an algorithm for the game of Go by collecting data consisting of “On what face of board, where you put the next stone, how your advantage will change.” https://www.youtube.com/watch?v=r3pb-ZDEKVghttps://www.youtube.com/watch?v=V1eYniJ0Rnk
  • 4. Theoretical framework of DQN ▪ Suppose that you have all the quartets (s, a, r, s') for any pair (s, a), meaning “with the current state s and the action a, you will have a reward (score) r and the next state will be s' ” – This corresponds to the data “on what screen, with which operation, how the score will change, and what will be the next screen.” – It is impractical to collect data for all possible pairs (s, a), but suppose that you have enough of them to train the model to a certain level. – Note that, r and s' are functions of the pair (s, a) in terms of mathematics. ▪ You may naively think that the following gives the optimal action given the current state s.  ⇒ Choose the action a which maximizes the immediate reward r . – But this doesn't necessarily result in the best scenario. In the case of Breakout, you’d better hit blocks near the side walls even though it may take a little longer. ▪ In a nutshell, you have to figure out the action a which maximizes the long term rewards.
  • 5. Let’s imagine the magical “Q” function ▪ First, we define the total rewards as below. – sn and an represent the state and action at n-th step. is a small number around 0.9 introduced to avoid the sum becomes infinite. ▪ Now suppose that we have a convenient magical function Q(s, a) as below although we don’t know how to calculate it at all. – Q(s, a) = “The total rewards you will receive when you choose the next action a, and keep choosing the optimal actions afterwards.” ▪ Once you have the function Q(s, a) , you can choose the optimal action at state s with the following rule.  ⇒ Choose an action a which maximizes the total rewards if you keep choosing the optimal actions afterwards.
  • 6. The black magic of “recursive definition” ▪ Although we are not sure how we could calculate Q(s, a) , I can say that it satisfies the following “Q-equation”. – See the next slide for the mathematical proof.
  • 7. Proof of the Q-equation – Suppose that the chain of states and actions is as below when you start from the state s0 and keep choosing the best actions. – From the definition of Q(s, a), the following equation holds. – Now suppose that the initial state is s1 instead of s0 , and you keep choosing the best actions, the chain will be as below. – Again, from the definition of Q(s, a), the following equation holds. – Rearrange (1) as below, and substitute (2). – Considering the following relations, it’s equivalent to the Q-equation. ―― (1) ―― (2)
  • 8. Approximate “Q” function using Q-equation ▪ Prepare some function with adjustable parameters, and by adjusting them, you may find a function which satisfies the Q-equation. – If you succeeded, now you have the “Q” function! – Strictly speaking, Q-equation is not a sufficient condition but a necessary one. However, under some assumptions, it’s been proved to be sufficient. ▪ Here’s the steps to adjust the parameters. – 1. Let all the quartets (s, a, r, a') you have is D. – 2. Prepare an initial candidate of “Q” function as Q(s, a | w). – 3. Calculate the following error function E(w) using all (or a part of) data in D. This is the sum of squared differences between LHS and RHS of the Q-equation. – 4. Adjust the parameter w so that E(w) becomes smaller. Then, go back to 3. ▪ After repeating 3. and 4., if E(w) becomes small enough, you have the approximate version of the “Q” function. – You’d better use more complicated candidate Q(s, a | w) , so that you would have better approximation.
  • 9. What is “the more complicated function”? ▪ Yes! The Deep Neural Network! ▪ “Deep Q-Network” is essentially a multi-layer neural network used as the candidate of Q-function.
  • 10. By the way, what is Neural Network? ▪ Roughly speaking, it’s just a combination of multiple simple functions resulting in a highly complex function.
  • 11. How do you collect the quartets? ▪ How do you collect the quartets (s, a, r, a') in the real world? – Basically, just keep playing the game with random actions. – In theory, if you keep playing for infinite time, you would encounter all the possible states. ▪ But in reality, the possibility to reach some states with random actions is quite small. To compensate it, you can take the following strategy. – Once you have collected some amount of data, train the Q-function using these data. – After that, you play the game by mixing random actions and the (presumably) best actions calculated from the current Q-function. – When you have collected some more additional data, train the Q-function again. – Through this cycle, you can make Q-function better, and collect more data including states which is unreachable with only random actions. ▪ Why don’t you play only using Q-function without random actions? – It doesn’t work. By collecting all kinds of states even with random actions, the model can learn “how to gain long term rewards through some non-rewarding states.”