SlideShare ist ein Scribd-Unternehmen logo
1 von 12
Downloaden Sie, um offline zu lesen
Google confidential | Do not distribute
Deep Q-Network for beginners
Etsuji Nakai
Cloud Solutions Architect at Google
2016/08/01 ver1.0
$ who am i
▪Etsuji Nakai
Cloud Solutions Architect at Google
Twitter @enakai00
Typical application of DQN
▪ Learning “the optimal operations to achieve the best score” through the screen
images of video games.
– In theory, you can learn it (without knowing the rule of the game) by collecting
data consisting of “on what screen, with which operation, how the score will
change, and what will be the next screen.”
– This is analogous to construct an algorithm for the game of Go by collecting
data consisting of “On what face of board, where you put the next stone, how
your advantage will change.”
https://www.youtube.com/watch?v=r3pb-ZDEKVghttps://www.youtube.com/watch?v=V1eYniJ0Rnk
Theoretical framework of DQN
▪ Suppose that you have all the quartets (s, a, r, s') for any pair (s, a), meaning “with
the current state s and the action a, you will have a reward (score) r and the next
state will be s' ”
– This corresponds to the data “on what screen, with which operation, how the score will
change, and what will be the next screen.”
– It is impractical to collect data for all possible pairs (s, a), but suppose that you have
enough of them to train the model to a certain level.
– Note that, r and s' are functions of the pair (s, a) in terms of mathematics.
▪ You may naively think that the following gives the optimal action given the current
state s.
 ⇒ Choose the action a which maximizes the immediate reward r .
– But this doesn't necessarily result in the best scenario. In the case of Breakout, you’d
better hit blocks near the side walls even though it may take a little longer.
▪ In a nutshell, you have to figure out the action a which maximizes the long term
rewards.
Let’s imagine the magical “Q” function
▪ First, we define the total rewards as below.
– sn
and an
represent the state and action at n-th step. is a small number around 0.9
introduced to avoid the sum becomes infinite.
▪ Now suppose that we have a convenient magical function Q(s, a) as below although
we don’t know how to calculate it at all.
– Q(s, a) = “The total rewards you will receive when you choose the next action a,
and keep choosing the optimal actions afterwards.”
▪ Once you have the function Q(s, a) , you can choose the optimal action at state s
with the following rule.
 ⇒ Choose an action a which maximizes the total rewards if you keep choosing the optimal
actions afterwards.
The black magic of “recursive definition”
▪ Although we are not sure how we could calculate Q(s, a) , I can say that it satisfies
the following “Q-equation”.
– See the next slide for the mathematical proof.
Proof of the Q-equation
– Suppose that the chain of states and actions is as below when you start from the state s0
and keep choosing the best actions.
– From the definition of Q(s, a), the following equation holds.
– Now suppose that the initial state is s1
instead of s0
, and you keep choosing the best
actions, the chain will be as below.
– Again, from the definition of Q(s, a), the following equation holds.
– Rearrange (1) as below, and substitute (2).
– Considering the following relations, it’s equivalent to the Q-equation.
―― (1)
―― (2)
Approximate “Q” function using Q-equation
▪ Prepare some function with adjustable parameters, and by adjusting them, you may
find a function which satisfies the Q-equation.
– If you succeeded, now you have the “Q” function!
– Strictly speaking, Q-equation is not a sufficient condition but a necessary one. However,
under some assumptions, it’s been proved to be sufficient.
▪ Here’s the steps to adjust the parameters.
– 1. Let all the quartets (s, a, r, a') you have is D.
– 2. Prepare an initial candidate of “Q” function as Q(s, a | w).
– 3. Calculate the following error function E(w) using all (or a part of) data in D. This is the
sum of squared differences between LHS and RHS of the Q-equation.
– 4. Adjust the parameter w so that E(w) becomes smaller. Then, go back to 3.
▪ After repeating 3. and 4., if E(w) becomes small enough, you have the approximate
version of the “Q” function.
– You’d better use more complicated candidate Q(s, a | w) , so that you would have better
approximation.
What is “the more complicated function”?
▪ Yes!
The Deep Neural Network!
▪ “Deep Q-Network” is essentially a multi-layer neural network used as the candidate
of Q-function.
By the way, what is Neural Network?
▪ Roughly speaking, it’s just a combination of multiple simple functions resulting in a
highly complex function.
How do you collect the quartets?
▪ How do you collect the quartets (s, a, r, a') in the real world?
– Basically, just keep playing the game with random actions.
– In theory, if you keep playing for infinite time, you would encounter all the possible states.
▪ But in reality, the possibility to reach some states with random actions is quite
small. To compensate it, you can take the following strategy.
– Once you have collected some amount of data, train the Q-function using these data.
– After that, you play the game by mixing random actions and the (presumably) best actions
calculated from the current Q-function.
– When you have collected some more additional data, train the Q-function again.
– Through this cycle, you can make Q-function better, and collect more data including states
which is unreachable with only random actions.
▪ Why don’t you play only using Q-function without random actions?
– It doesn’t work. By collecting all kinds of states even with random actions, the
model can learn “how to gain long term rewards through some non-rewarding
states.”
Thank you!

Weitere ähnliche Inhalte

Andere mochten auch

A Brief History of My English Learning
A Brief History of My English LearningA Brief History of My English Learning
A Brief History of My English LearningEtsuji Nakai
 
Spannerに関する技術メモ
Spannerに関する技術メモSpannerに関する技術メモ
Spannerに関する技術メモEtsuji Nakai
 
Exploring the Philosophy behind Docker/Kubernetes/OpenShift
Exploring the Philosophy behind Docker/Kubernetes/OpenShiftExploring the Philosophy behind Docker/Kubernetes/OpenShift
Exploring the Philosophy behind Docker/Kubernetes/OpenShiftEtsuji Nakai
 
Using Kubernetes on Google Container Engine
Using Kubernetes on Google Container EngineUsing Kubernetes on Google Container Engine
Using Kubernetes on Google Container EngineEtsuji Nakai
 
TensorFlowによるニューラルネットワーク入門
TensorFlowによるニューラルネットワーク入門TensorFlowによるニューラルネットワーク入門
TensorFlowによるニューラルネットワーク入門Etsuji Nakai
 
TensorFlowプログラミングと分類アルゴリズムの基礎
TensorFlowプログラミングと分類アルゴリズムの基礎TensorFlowプログラミングと分類アルゴリズムの基礎
TensorFlowプログラミングと分類アルゴリズムの基礎Etsuji Nakai
 
Machine Learning Basics for Web Application Developers
Machine Learning Basics for Web Application DevelopersMachine Learning Basics for Web Application Developers
Machine Learning Basics for Web Application DevelopersEtsuji Nakai
 
Googleのインフラ技術から考える理想のDevOps
Googleのインフラ技術から考える理想のDevOpsGoogleのインフラ技術から考える理想のDevOps
Googleのインフラ技術から考える理想のDevOpsEtsuji Nakai
 
Open Shift v3 主要機能と内部構造のご紹介
Open Shift v3 主要機能と内部構造のご紹介Open Shift v3 主要機能と内部構造のご紹介
Open Shift v3 主要機能と内部構造のご紹介Etsuji Nakai
 
「TensorFlow Tutorialの数学的背景」 クイックツアー(パート1)
「TensorFlow Tutorialの数学的背景」 クイックツアー(パート1)「TensorFlow Tutorialの数学的背景」 クイックツアー(パート1)
「TensorFlow Tutorialの数学的背景」 クイックツアー(パート1)Etsuji Nakai
 
Docker活用パターンの整理 ― どう組み合わせるのが正解?!
Docker活用パターンの整理 ― どう組み合わせるのが正解?!Docker活用パターンの整理 ― どう組み合わせるのが正解?!
Docker活用パターンの整理 ― どう組み合わせるのが正解?!Etsuji Nakai
 
Python 機械学習プログラミング データ分析ライブラリー解説編
Python 機械学習プログラミング データ分析ライブラリー解説編Python 機械学習プログラミング データ分析ライブラリー解説編
Python 機械学習プログラミング データ分析ライブラリー解説編Etsuji Nakai
 
分散ストレージソフトウェアCeph・アーキテクチャー概要
分散ストレージソフトウェアCeph・アーキテクチャー概要分散ストレージソフトウェアCeph・アーキテクチャー概要
分散ストレージソフトウェアCeph・アーキテクチャー概要Etsuji Nakai
 
機械学習概論 講義テキスト
機械学習概論 講義テキスト機械学習概論 講義テキスト
機械学習概論 講義テキストEtsuji Nakai
 
Docker with RHEL7 技術勉強会
Docker with RHEL7 技術勉強会Docker with RHEL7 技術勉強会
Docker with RHEL7 技術勉強会Etsuji Nakai
 
OpenStackをさらに”使う”技術 - OpenStack&Docker活用テクニック
OpenStackをさらに”使う”技術 - OpenStack&Docker活用テクニックOpenStackをさらに”使う”技術 - OpenStack&Docker活用テクニック
OpenStackをさらに”使う”技術 - OpenStack&Docker活用テクニックEtsuji Nakai
 
Linux Container Technology inside Docker with RHEL7
Linux Container Technology inside Docker with RHEL7Linux Container Technology inside Docker with RHEL7
Linux Container Technology inside Docker with RHEL7Etsuji Nakai
 
OpenStackとDockerの未来像
OpenStackとDockerの未来像OpenStackとDockerの未来像
OpenStackとDockerの未来像Etsuji Nakai
 
Dockerイメージ管理の内部構造
Dockerイメージ管理の内部構造Dockerイメージ管理の内部構造
Dockerイメージ管理の内部構造Etsuji Nakai
 
OpenShift v3 Internal networking details
OpenShift v3 Internal networking detailsOpenShift v3 Internal networking details
OpenShift v3 Internal networking detailsEtsuji Nakai
 

Andere mochten auch (20)

A Brief History of My English Learning
A Brief History of My English LearningA Brief History of My English Learning
A Brief History of My English Learning
 
Spannerに関する技術メモ
Spannerに関する技術メモSpannerに関する技術メモ
Spannerに関する技術メモ
 
Exploring the Philosophy behind Docker/Kubernetes/OpenShift
Exploring the Philosophy behind Docker/Kubernetes/OpenShiftExploring the Philosophy behind Docker/Kubernetes/OpenShift
Exploring the Philosophy behind Docker/Kubernetes/OpenShift
 
Using Kubernetes on Google Container Engine
Using Kubernetes on Google Container EngineUsing Kubernetes on Google Container Engine
Using Kubernetes on Google Container Engine
 
TensorFlowによるニューラルネットワーク入門
TensorFlowによるニューラルネットワーク入門TensorFlowによるニューラルネットワーク入門
TensorFlowによるニューラルネットワーク入門
 
TensorFlowプログラミングと分類アルゴリズムの基礎
TensorFlowプログラミングと分類アルゴリズムの基礎TensorFlowプログラミングと分類アルゴリズムの基礎
TensorFlowプログラミングと分類アルゴリズムの基礎
 
Machine Learning Basics for Web Application Developers
Machine Learning Basics for Web Application DevelopersMachine Learning Basics for Web Application Developers
Machine Learning Basics for Web Application Developers
 
Googleのインフラ技術から考える理想のDevOps
Googleのインフラ技術から考える理想のDevOpsGoogleのインフラ技術から考える理想のDevOps
Googleのインフラ技術から考える理想のDevOps
 
Open Shift v3 主要機能と内部構造のご紹介
Open Shift v3 主要機能と内部構造のご紹介Open Shift v3 主要機能と内部構造のご紹介
Open Shift v3 主要機能と内部構造のご紹介
 
「TensorFlow Tutorialの数学的背景」 クイックツアー(パート1)
「TensorFlow Tutorialの数学的背景」 クイックツアー(パート1)「TensorFlow Tutorialの数学的背景」 クイックツアー(パート1)
「TensorFlow Tutorialの数学的背景」 クイックツアー(パート1)
 
Docker活用パターンの整理 ― どう組み合わせるのが正解?!
Docker活用パターンの整理 ― どう組み合わせるのが正解?!Docker活用パターンの整理 ― どう組み合わせるのが正解?!
Docker活用パターンの整理 ― どう組み合わせるのが正解?!
 
Python 機械学習プログラミング データ分析ライブラリー解説編
Python 機械学習プログラミング データ分析ライブラリー解説編Python 機械学習プログラミング データ分析ライブラリー解説編
Python 機械学習プログラミング データ分析ライブラリー解説編
 
分散ストレージソフトウェアCeph・アーキテクチャー概要
分散ストレージソフトウェアCeph・アーキテクチャー概要分散ストレージソフトウェアCeph・アーキテクチャー概要
分散ストレージソフトウェアCeph・アーキテクチャー概要
 
機械学習概論 講義テキスト
機械学習概論 講義テキスト機械学習概論 講義テキスト
機械学習概論 講義テキスト
 
Docker with RHEL7 技術勉強会
Docker with RHEL7 技術勉強会Docker with RHEL7 技術勉強会
Docker with RHEL7 技術勉強会
 
OpenStackをさらに”使う”技術 - OpenStack&Docker活用テクニック
OpenStackをさらに”使う”技術 - OpenStack&Docker活用テクニックOpenStackをさらに”使う”技術 - OpenStack&Docker活用テクニック
OpenStackをさらに”使う”技術 - OpenStack&Docker活用テクニック
 
Linux Container Technology inside Docker with RHEL7
Linux Container Technology inside Docker with RHEL7Linux Container Technology inside Docker with RHEL7
Linux Container Technology inside Docker with RHEL7
 
OpenStackとDockerの未来像
OpenStackとDockerの未来像OpenStackとDockerの未来像
OpenStackとDockerの未来像
 
Dockerイメージ管理の内部構造
Dockerイメージ管理の内部構造Dockerイメージ管理の内部構造
Dockerイメージ管理の内部構造
 
OpenShift v3 Internal networking details
OpenShift v3 Internal networking detailsOpenShift v3 Internal networking details
OpenShift v3 Internal networking details
 

Ähnlich wie Deep Q-Network for beginners

Reinforcement learning
Reinforcement learningReinforcement learning
Reinforcement learningElias Hasnat
 
Review :: Demystifying deep reinforcement learning (written by Tambet Matiisen)
Review :: Demystifying deep reinforcement learning (written by Tambet Matiisen)Review :: Demystifying deep reinforcement learning (written by Tambet Matiisen)
Review :: Demystifying deep reinforcement learning (written by Tambet Matiisen)Hogeon Seo
 
Demystifying deep reinforement learning
Demystifying deep reinforement learningDemystifying deep reinforement learning
Demystifying deep reinforement learning재연 윤
 
Deep Reinforcement Learning
Deep Reinforcement LearningDeep Reinforcement Learning
Deep Reinforcement LearningUsman Qayyum
 
Survey of Modern Reinforcement Learning
Survey of Modern Reinforcement Learning Survey of Modern Reinforcement Learning
Survey of Modern Reinforcement Learning Julia Maddalena
 
TensorFlow and Deep Learning Tips and Tricks
TensorFlow and Deep Learning Tips and TricksTensorFlow and Deep Learning Tips and Tricks
TensorFlow and Deep Learning Tips and TricksBen Ball
 
Intro to Reinforcement Learning
Intro to Reinforcement LearningIntro to Reinforcement Learning
Intro to Reinforcement LearningUtkarsh Garg
 
Single shot multiboxdetectors
Single shot multiboxdetectorsSingle shot multiboxdetectors
Single shot multiboxdetectors지현 백
 
Deep sarsa, Deep Q-learning, DQN
Deep sarsa, Deep Q-learning, DQNDeep sarsa, Deep Q-learning, DQN
Deep sarsa, Deep Q-learning, DQNEuijin Jeong
 
Reinfrocement Learning
Reinfrocement LearningReinfrocement Learning
Reinfrocement LearningNatan Katz
 
Chakrabarti alpha go analysis
Chakrabarti alpha go analysisChakrabarti alpha go analysis
Chakrabarti alpha go analysisDave Selinger
 
Single shot multiboxdetectors
Single shot multiboxdetectorsSingle shot multiboxdetectors
Single shot multiboxdetectors지현 백
 
Practical Reinforcement Learning with TensorFlow
Practical Reinforcement Learning with TensorFlowPractical Reinforcement Learning with TensorFlow
Practical Reinforcement Learning with TensorFlowIllia Polosukhin
 
Intro to Deep Reinforcement Learning
Intro to Deep Reinforcement LearningIntro to Deep Reinforcement Learning
Intro to Deep Reinforcement LearningKhaled Saleh
 
Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017
Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017
Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017MLconf
 
Cs229 notes12
Cs229 notes12Cs229 notes12
Cs229 notes12VuTran231
 

Ähnlich wie Deep Q-Network for beginners (20)

Reinforcement learning
Reinforcement learningReinforcement learning
Reinforcement learning
 
Deep RL.pdf
Deep RL.pdfDeep RL.pdf
Deep RL.pdf
 
Deep Q-learning explained
Deep Q-learning explainedDeep Q-learning explained
Deep Q-learning explained
 
Review :: Demystifying deep reinforcement learning (written by Tambet Matiisen)
Review :: Demystifying deep reinforcement learning (written by Tambet Matiisen)Review :: Demystifying deep reinforcement learning (written by Tambet Matiisen)
Review :: Demystifying deep reinforcement learning (written by Tambet Matiisen)
 
Demystifying deep reinforement learning
Demystifying deep reinforement learningDemystifying deep reinforement learning
Demystifying deep reinforement learning
 
Deep Reinforcement Learning
Deep Reinforcement LearningDeep Reinforcement Learning
Deep Reinforcement Learning
 
Survey of Modern Reinforcement Learning
Survey of Modern Reinforcement Learning Survey of Modern Reinforcement Learning
Survey of Modern Reinforcement Learning
 
TensorFlow and Deep Learning Tips and Tricks
TensorFlow and Deep Learning Tips and TricksTensorFlow and Deep Learning Tips and Tricks
TensorFlow and Deep Learning Tips and Tricks
 
Lec3 dqn
Lec3 dqnLec3 dqn
Lec3 dqn
 
Intro to Reinforcement Learning
Intro to Reinforcement LearningIntro to Reinforcement Learning
Intro to Reinforcement Learning
 
Single shot multiboxdetectors
Single shot multiboxdetectorsSingle shot multiboxdetectors
Single shot multiboxdetectors
 
Deep sarsa, Deep Q-learning, DQN
Deep sarsa, Deep Q-learning, DQNDeep sarsa, Deep Q-learning, DQN
Deep sarsa, Deep Q-learning, DQN
 
Reinfrocement Learning
Reinfrocement LearningReinfrocement Learning
Reinfrocement Learning
 
Chakrabarti alpha go analysis
Chakrabarti alpha go analysisChakrabarti alpha go analysis
Chakrabarti alpha go analysis
 
Single shot multiboxdetectors
Single shot multiboxdetectorsSingle shot multiboxdetectors
Single shot multiboxdetectors
 
Practical Reinforcement Learning with TensorFlow
Practical Reinforcement Learning with TensorFlowPractical Reinforcement Learning with TensorFlow
Practical Reinforcement Learning with TensorFlow
 
Intro to Deep Reinforcement Learning
Intro to Deep Reinforcement LearningIntro to Deep Reinforcement Learning
Intro to Deep Reinforcement Learning
 
Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017
Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017
Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017
 
Cs229 notes12
Cs229 notes12Cs229 notes12
Cs229 notes12
 
Reinforcement Learning - DQN
Reinforcement Learning - DQNReinforcement Learning - DQN
Reinforcement Learning - DQN
 

Mehr von Etsuji Nakai

「ITエンジニアリングの本質」を考える
「ITエンジニアリングの本質」を考える「ITエンジニアリングの本質」を考える
「ITエンジニアリングの本質」を考えるEtsuji Nakai
 
Googleのインフラ技術に見る基盤標準化とDevOpsの真実
Googleのインフラ技術に見る基盤標準化とDevOpsの真実Googleのインフラ技術に見る基盤標準化とDevOpsの真実
Googleのインフラ技術に見る基盤標準化とDevOpsの真実Etsuji Nakai
 
Introducton to Convolutional Nerural Network with TensorFlow
Introducton to Convolutional Nerural Network with TensorFlowIntroducton to Convolutional Nerural Network with TensorFlow
Introducton to Convolutional Nerural Network with TensorFlowEtsuji Nakai
 
Googleにおける機械学習の活用とクラウドサービス
Googleにおける機械学習の活用とクラウドサービスGoogleにおける機械学習の活用とクラウドサービス
Googleにおける機械学習の活用とクラウドサービスEtsuji Nakai
 
Lecture note on PRML 8.2
Lecture note on PRML 8.2Lecture note on PRML 8.2
Lecture note on PRML 8.2Etsuji Nakai
 
OpenShift v3 Technical Introduction
OpenShift v3 Technical IntroductionOpenShift v3 Technical Introduction
OpenShift v3 Technical IntroductionEtsuji Nakai
 
Python 機械学習プログラミング データ分析演習編
Python 機械学習プログラミング データ分析演習編Python 機械学習プログラミング データ分析演習編
Python 機械学習プログラミング データ分析演習編Etsuji Nakai
 
Red Hat Enterprise Linux OpenStack Platform 7 - VM Instance HA Architecture
Red Hat Enterprise Linux OpenStack Platform 7 - VM Instance HA ArchitectureRed Hat Enterprise Linux OpenStack Platform 7 - VM Instance HA Architecture
Red Hat Enterprise Linux OpenStack Platform 7 - VM Instance HA ArchitectureEtsuji Nakai
 

Mehr von Etsuji Nakai (10)

PRML11.2-11.3
PRML11.2-11.3PRML11.2-11.3
PRML11.2-11.3
 
「ITエンジニアリングの本質」を考える
「ITエンジニアリングの本質」を考える「ITエンジニアリングの本質」を考える
「ITエンジニアリングの本質」を考える
 
Googleのインフラ技術に見る基盤標準化とDevOpsの真実
Googleのインフラ技術に見る基盤標準化とDevOpsの真実Googleのインフラ技術に見る基盤標準化とDevOpsの真実
Googleのインフラ技術に見る基盤標準化とDevOpsの真実
 
Introducton to Convolutional Nerural Network with TensorFlow
Introducton to Convolutional Nerural Network with TensorFlowIntroducton to Convolutional Nerural Network with TensorFlow
Introducton to Convolutional Nerural Network with TensorFlow
 
Googleにおける機械学習の活用とクラウドサービス
Googleにおける機械学習の活用とクラウドサービスGoogleにおける機械学習の活用とクラウドサービス
Googleにおける機械学習の活用とクラウドサービス
 
Lecture note on PRML 8.2
Lecture note on PRML 8.2Lecture note on PRML 8.2
Lecture note on PRML 8.2
 
PRML7.2
PRML7.2PRML7.2
PRML7.2
 
OpenShift v3 Technical Introduction
OpenShift v3 Technical IntroductionOpenShift v3 Technical Introduction
OpenShift v3 Technical Introduction
 
Python 機械学習プログラミング データ分析演習編
Python 機械学習プログラミング データ分析演習編Python 機械学習プログラミング データ分析演習編
Python 機械学習プログラミング データ分析演習編
 
Red Hat Enterprise Linux OpenStack Platform 7 - VM Instance HA Architecture
Red Hat Enterprise Linux OpenStack Platform 7 - VM Instance HA ArchitectureRed Hat Enterprise Linux OpenStack Platform 7 - VM Instance HA Architecture
Red Hat Enterprise Linux OpenStack Platform 7 - VM Instance HA Architecture
 

Kürzlich hochgeladen

DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersRaghuram Pandurangan
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionDilum Bandara
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embeddingZilliz
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsNathaniel Shimoni
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfMounikaPolabathina
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 

Kürzlich hochgeladen (20)

DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information Developers
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embedding
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directions
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdf
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 

Deep Q-Network for beginners

  • 1. Google confidential | Do not distribute Deep Q-Network for beginners Etsuji Nakai Cloud Solutions Architect at Google 2016/08/01 ver1.0
  • 2. $ who am i ▪Etsuji Nakai Cloud Solutions Architect at Google Twitter @enakai00
  • 3. Typical application of DQN ▪ Learning “the optimal operations to achieve the best score” through the screen images of video games. – In theory, you can learn it (without knowing the rule of the game) by collecting data consisting of “on what screen, with which operation, how the score will change, and what will be the next screen.” – This is analogous to construct an algorithm for the game of Go by collecting data consisting of “On what face of board, where you put the next stone, how your advantage will change.” https://www.youtube.com/watch?v=r3pb-ZDEKVghttps://www.youtube.com/watch?v=V1eYniJ0Rnk
  • 4. Theoretical framework of DQN ▪ Suppose that you have all the quartets (s, a, r, s') for any pair (s, a), meaning “with the current state s and the action a, you will have a reward (score) r and the next state will be s' ” – This corresponds to the data “on what screen, with which operation, how the score will change, and what will be the next screen.” – It is impractical to collect data for all possible pairs (s, a), but suppose that you have enough of them to train the model to a certain level. – Note that, r and s' are functions of the pair (s, a) in terms of mathematics. ▪ You may naively think that the following gives the optimal action given the current state s.  ⇒ Choose the action a which maximizes the immediate reward r . – But this doesn't necessarily result in the best scenario. In the case of Breakout, you’d better hit blocks near the side walls even though it may take a little longer. ▪ In a nutshell, you have to figure out the action a which maximizes the long term rewards.
  • 5. Let’s imagine the magical “Q” function ▪ First, we define the total rewards as below. – sn and an represent the state and action at n-th step. is a small number around 0.9 introduced to avoid the sum becomes infinite. ▪ Now suppose that we have a convenient magical function Q(s, a) as below although we don’t know how to calculate it at all. – Q(s, a) = “The total rewards you will receive when you choose the next action a, and keep choosing the optimal actions afterwards.” ▪ Once you have the function Q(s, a) , you can choose the optimal action at state s with the following rule.  ⇒ Choose an action a which maximizes the total rewards if you keep choosing the optimal actions afterwards.
  • 6. The black magic of “recursive definition” ▪ Although we are not sure how we could calculate Q(s, a) , I can say that it satisfies the following “Q-equation”. – See the next slide for the mathematical proof.
  • 7. Proof of the Q-equation – Suppose that the chain of states and actions is as below when you start from the state s0 and keep choosing the best actions. – From the definition of Q(s, a), the following equation holds. – Now suppose that the initial state is s1 instead of s0 , and you keep choosing the best actions, the chain will be as below. – Again, from the definition of Q(s, a), the following equation holds. – Rearrange (1) as below, and substitute (2). – Considering the following relations, it’s equivalent to the Q-equation. ―― (1) ―― (2)
  • 8. Approximate “Q” function using Q-equation ▪ Prepare some function with adjustable parameters, and by adjusting them, you may find a function which satisfies the Q-equation. – If you succeeded, now you have the “Q” function! – Strictly speaking, Q-equation is not a sufficient condition but a necessary one. However, under some assumptions, it’s been proved to be sufficient. ▪ Here’s the steps to adjust the parameters. – 1. Let all the quartets (s, a, r, a') you have is D. – 2. Prepare an initial candidate of “Q” function as Q(s, a | w). – 3. Calculate the following error function E(w) using all (or a part of) data in D. This is the sum of squared differences between LHS and RHS of the Q-equation. – 4. Adjust the parameter w so that E(w) becomes smaller. Then, go back to 3. ▪ After repeating 3. and 4., if E(w) becomes small enough, you have the approximate version of the “Q” function. – You’d better use more complicated candidate Q(s, a | w) , so that you would have better approximation.
  • 9. What is “the more complicated function”? ▪ Yes! The Deep Neural Network! ▪ “Deep Q-Network” is essentially a multi-layer neural network used as the candidate of Q-function.
  • 10. By the way, what is Neural Network? ▪ Roughly speaking, it’s just a combination of multiple simple functions resulting in a highly complex function.
  • 11. How do you collect the quartets? ▪ How do you collect the quartets (s, a, r, a') in the real world? – Basically, just keep playing the game with random actions. – In theory, if you keep playing for infinite time, you would encounter all the possible states. ▪ But in reality, the possibility to reach some states with random actions is quite small. To compensate it, you can take the following strategy. – Once you have collected some amount of data, train the Q-function using these data. – After that, you play the game by mixing random actions and the (presumably) best actions calculated from the current Q-function. – When you have collected some more additional data, train the Q-function again. – Through this cycle, you can make Q-function better, and collect more data including states which is unreachable with only random actions. ▪ Why don’t you play only using Q-function without random actions? – It doesn’t work. By collecting all kinds of states even with random actions, the model can learn “how to gain long term rewards through some non-rewarding states.”