SlideShare ist ein Scribd-Unternehmen logo
1 von 52
Downloaden Sie, um offline zu lesen
Qo
(s,a) = r(s,a)+γ max
a'
Qo
(s',a')
Qo
L = (r(s,a)+γ max
a'
Qθ
o
(s',a')−Qθ
o
(s,a))2
∇θ J = ∇θ Eπθ
[ γ τ
Rτ ]
τ =0
∞
∑
= ∇θ P( ′s | st ,a)πθ (a | st ) γ τ
Rτ
τ =0
∞
∑
a
∑
′s
∑
= P( ′s | st ,a)∇θπθ (a | st ) γ τ
Rτ
τ =0
∞
∑
a
∑
′s
∑
= P( ′s | st ,a)πθ (a | st )
∇θπθ (a | st )
πθ (a | st )
γ τ
Rτ
τ =0
∞
∑
a
∑
′s
∑
= P( ′s | st ,a)πθ (a | st )∇θ log(πθ (a | st )) γ τ
Rτ
τ =0
∞
∑
a
∑
′s
∑
= Eπθ
[∇θ log(πθ (a | st )) γ τ
Rτ ]
τ =0
∞
∑
Eπθ
[∇θ log(πθ (a | st )) γ τ
Rτ ]
τ =0
∞
∑
=
1
M
∇θ log(πθ (ai
T
| si
T
))( γ τ
Rτ
T
)
τ =0
∞
∑
i
∑
T
∑
T = s0
T
,a0
T
,r0
T
,!sn
T
,an
T
,rn
T
1
M
∇θ log(πθ (ai
T
| si
T
))( γ τ
Rτ
T
τ =0
∞
∑
i
∑
T
∑ )
1
M
∇θ log(πθ (ai
T
| si
T
))( γ τ
Rτ
T
τ =0
∞
∑
i
∑
T
∑ )
1
M
∇θ log(πθ (ai
T
| si
T
))
i
∑
T
∑ A(si
T
,ai
T
)
Qaux
(a,i, j)
LQ = E[(Rt:t+n +γ n
max
a'
Q(s',a';θ−
)−Q(s,a;θ))2
]
LVR = Eπ [(Rt:t+n +γ n
V(st+n+1,θ−
)−V(st ,θ))2
]
Ep[ f (x)] = p(x) f (x)x∑
Eq[ f (x)] = q(x) f (x)x∑
=
q(x)
p(x)
p(x) f (x)x∑
= p(x)
q(x)
p(x)
f (x)x∑
= Ep[
q(x)
p(x)
f (x)]
LA3C = Lπ + LV − Es∼π [αH(π(⋅| s))]
!Qπ
(s,a) = α(log(π(s,a)+ Hπ
(s))+Vπ
(s)
Q∗
(s,a) = r(s,a)+γτ log exp(Q∗
(s',a') /τ )a'∑
Q∗
V∗
(s) = −τ logπ∗
(a | s)+ r(s,a)+γV∗
(s')
−V∗
(s1)+γ t−1
V∗
(st )+ R(s1:t )−τG(s1:t ,π∗
) = 0
R(sm:n ) = γ i
r(sm+i ,am+i )
i=0
n−m−1
∑ G(sm:n,π) = γ i
logπ(am+i | sm+i )
i=0
n−m−1
∑
Cθ,φ (s1:t ) = −Vφ (s1)+γ t−1
Vφ (st )+ R(s1:t )−τG(s1:t ,πθ )
Δθ ∝Cθ,φ (s1:t )∇θG(s1:t ,πθ )
Δφ ∝Cθ,φ (s1:t )(∇φVφ (s1)− ∇φγ t−1
Vφ (st ))
Recent rl
Recent rl
Recent rl
Recent rl
Recent rl
Recent rl
Recent rl

Weitere ähnliche Inhalte

Ähnlich wie Recent rl

強化学習勉強会6の資料
強化学習勉強会6の資料強化学習勉強会6の資料
強化学習勉強会6の資料Yuji Okamoto
 
Control as Inference (強化学習とベイズ統計)
Control as Inference (強化学習とベイズ統計)Control as Inference (強化学習とベイズ統計)
Control as Inference (強化学習とベイズ統計)Shohei Taniguchi
 
ゲーム理論NEXT 戦略形協力ゲーム第11回 -寡占市場ゲームにおける結託耐性ナッシュ均衡-
ゲーム理論NEXT 戦略形協力ゲーム第11回 -寡占市場ゲームにおける結託耐性ナッシュ均衡-ゲーム理論NEXT 戦略形協力ゲーム第11回 -寡占市場ゲームにおける結託耐性ナッシュ均衡-
ゲーム理論NEXT 戦略形協力ゲーム第11回 -寡占市場ゲームにおける結託耐性ナッシュ均衡-ssusere0a682
 
ゲーム理論BASIC 演習51 -完全ベイジアン均衡-
ゲーム理論BASIC 演習51 -完全ベイジアン均衡-ゲーム理論BASIC 演習51 -完全ベイジアン均衡-
ゲーム理論BASIC 演習51 -完全ベイジアン均衡-ssusere0a682
 
Ejercicios prueba de algebra de la UTN- widmar aguilar
Ejercicios prueba de algebra de la UTN-  widmar aguilarEjercicios prueba de algebra de la UTN-  widmar aguilar
Ejercicios prueba de algebra de la UTN- widmar aguilarWidmar Aguilar Gonzalez
 
Functional Gradient Boosting based on Residual Network Perception
Functional Gradient Boosting based on Residual Network PerceptionFunctional Gradient Boosting based on Residual Network Perception
Functional Gradient Boosting based on Residual Network PerceptionAtsushi Nitanda
 
ゲーム理論BASIC 演習37 -3人ゲームの混合戦略ナッシュ均衡を求める-
ゲーム理論BASIC 演習37 -3人ゲームの混合戦略ナッシュ均衡を求める-ゲーム理論BASIC 演習37 -3人ゲームの混合戦略ナッシュ均衡を求める-
ゲーム理論BASIC 演習37 -3人ゲームの混合戦略ナッシュ均衡を求める-ssusere0a682
 
確率的推論と行動選択
確率的推論と行動選択確率的推論と行動選択
確率的推論と行動選択Masahiro Suzuki
 
関西NIPS+読み会発表スライド
関西NIPS+読み会発表スライド関西NIPS+読み会発表スライド
関西NIPS+読み会発表スライドYuchi Matsuoka
 
6 28 18_hack_hunterdon_meetup_deep_rl
6 28 18_hack_hunterdon_meetup_deep_rl6 28 18_hack_hunterdon_meetup_deep_rl
6 28 18_hack_hunterdon_meetup_deep_rlSean Devlin
 
El text.life science6.matsubayashi191120
El text.life science6.matsubayashi191120El text.life science6.matsubayashi191120
El text.life science6.matsubayashi191120RCCSRENKEI
 
Re:ゲーム理論入門 - ナッシュ均衡の存在証明
Re:ゲーム理論入門 - ナッシュ均衡の存在証明Re:ゲーム理論入門 - ナッシュ均衡の存在証明
Re:ゲーム理論入門 - ナッシュ均衡の存在証明ssusere0a682
 
ゲーム理論NEXT 期待効用理論第10/11回 -期待効用定理の証明4/5
ゲーム理論NEXT 期待効用理論第10/11回 -期待効用定理の証明4/5ゲーム理論NEXT 期待効用理論第10/11回 -期待効用定理の証明4/5
ゲーム理論NEXT 期待効用理論第10/11回 -期待効用定理の証明4/5ssusere0a682
 
Ejercicios varios de algebra widmar aguilar
Ejercicios varios de  algebra   widmar aguilarEjercicios varios de  algebra   widmar aguilar
Ejercicios varios de algebra widmar aguilarWidmar Aguilar Gonzalez
 
Existence of positive solutions for fractional q-difference equations involvi...
Existence of positive solutions for fractional q-difference equations involvi...Existence of positive solutions for fractional q-difference equations involvi...
Existence of positive solutions for fractional q-difference equations involvi...IJRTEMJOURNAL
 
3人ゲームの混合戦略ナッシュ均衡を求める ゲーム理論 BASIC 演習1の補足
3人ゲームの混合戦略ナッシュ均衡を求める ゲーム理論 BASIC 演習1の補足3人ゲームの混合戦略ナッシュ均衡を求める ゲーム理論 BASIC 演習1の補足
3人ゲームの混合戦略ナッシュ均衡を求める ゲーム理論 BASIC 演習1の補足ssusere0a682
 
ゲーム理論BASIC 第42回 -仁に関する定理の証明3-
ゲーム理論BASIC 第42回 -仁に関する定理の証明3-ゲーム理論BASIC 第42回 -仁に関する定理の証明3-
ゲーム理論BASIC 第42回 -仁に関する定理の証明3-ssusere0a682
 

Ähnlich wie Recent rl (20)

強化学習勉強会6の資料
強化学習勉強会6の資料強化学習勉強会6の資料
強化学習勉強会6の資料
 
Control as Inference (強化学習とベイズ統計)
Control as Inference (強化学習とベイズ統計)Control as Inference (強化学習とベイズ統計)
Control as Inference (強化学習とベイズ統計)
 
ゲーム理論NEXT 戦略形協力ゲーム第11回 -寡占市場ゲームにおける結託耐性ナッシュ均衡-
ゲーム理論NEXT 戦略形協力ゲーム第11回 -寡占市場ゲームにおける結託耐性ナッシュ均衡-ゲーム理論NEXT 戦略形協力ゲーム第11回 -寡占市場ゲームにおける結託耐性ナッシュ均衡-
ゲーム理論NEXT 戦略形協力ゲーム第11回 -寡占市場ゲームにおける結託耐性ナッシュ均衡-
 
ゲーム理論BASIC 演習51 -完全ベイジアン均衡-
ゲーム理論BASIC 演習51 -完全ベイジアン均衡-ゲーム理論BASIC 演習51 -完全ベイジアン均衡-
ゲーム理論BASIC 演習51 -完全ベイジアン均衡-
 
Ejercicios prueba de algebra de la UTN- widmar aguilar
Ejercicios prueba de algebra de la UTN-  widmar aguilarEjercicios prueba de algebra de la UTN-  widmar aguilar
Ejercicios prueba de algebra de la UTN- widmar aguilar
 
Functional Gradient Boosting based on Residual Network Perception
Functional Gradient Boosting based on Residual Network PerceptionFunctional Gradient Boosting based on Residual Network Perception
Functional Gradient Boosting based on Residual Network Perception
 
ゲーム理論BASIC 演習37 -3人ゲームの混合戦略ナッシュ均衡を求める-
ゲーム理論BASIC 演習37 -3人ゲームの混合戦略ナッシュ均衡を求める-ゲーム理論BASIC 演習37 -3人ゲームの混合戦略ナッシュ均衡を求める-
ゲーム理論BASIC 演習37 -3人ゲームの混合戦略ナッシュ均衡を求める-
 
確率的推論と行動選択
確率的推論と行動選択確率的推論と行動選択
確率的推論と行動選択
 
関西NIPS+読み会発表スライド
関西NIPS+読み会発表スライド関西NIPS+読み会発表スライド
関西NIPS+読み会発表スライド
 
6 28 18_hack_hunterdon_meetup_deep_rl
6 28 18_hack_hunterdon_meetup_deep_rl6 28 18_hack_hunterdon_meetup_deep_rl
6 28 18_hack_hunterdon_meetup_deep_rl
 
El text.life science6.matsubayashi191120
El text.life science6.matsubayashi191120El text.life science6.matsubayashi191120
El text.life science6.matsubayashi191120
 
Prelude to halide_public
Prelude to halide_publicPrelude to halide_public
Prelude to halide_public
 
Gan
GanGan
Gan
 
Re:ゲーム理論入門 - ナッシュ均衡の存在証明
Re:ゲーム理論入門 - ナッシュ均衡の存在証明Re:ゲーム理論入門 - ナッシュ均衡の存在証明
Re:ゲーム理論入門 - ナッシュ均衡の存在証明
 
ゲーム理論NEXT 期待効用理論第10/11回 -期待効用定理の証明4/5
ゲーム理論NEXT 期待効用理論第10/11回 -期待効用定理の証明4/5ゲーム理論NEXT 期待効用理論第10/11回 -期待効用定理の証明4/5
ゲーム理論NEXT 期待効用理論第10/11回 -期待効用定理の証明4/5
 
uuum_3q
uuum_3quuum_3q
uuum_3q
 
Ejercicios varios de algebra widmar aguilar
Ejercicios varios de  algebra   widmar aguilarEjercicios varios de  algebra   widmar aguilar
Ejercicios varios de algebra widmar aguilar
 
Existence of positive solutions for fractional q-difference equations involvi...
Existence of positive solutions for fractional q-difference equations involvi...Existence of positive solutions for fractional q-difference equations involvi...
Existence of positive solutions for fractional q-difference equations involvi...
 
3人ゲームの混合戦略ナッシュ均衡を求める ゲーム理論 BASIC 演習1の補足
3人ゲームの混合戦略ナッシュ均衡を求める ゲーム理論 BASIC 演習1の補足3人ゲームの混合戦略ナッシュ均衡を求める ゲーム理論 BASIC 演習1の補足
3人ゲームの混合戦略ナッシュ均衡を求める ゲーム理論 BASIC 演習1の補足
 
ゲーム理論BASIC 第42回 -仁に関する定理の証明3-
ゲーム理論BASIC 第42回 -仁に関する定理の証明3-ゲーム理論BASIC 第42回 -仁に関する定理の証明3-
ゲーム理論BASIC 第42回 -仁に関する定理の証明3-
 

Kürzlich hochgeladen

Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024The Digital Insurer
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingEdi Saputra
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Jeffrey Haguewood
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxRustici Software
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businesspanagenda
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MIND CTI
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native ApplicationsWSO2
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?Igalia
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyKhushali Kathiriya
 

Kürzlich hochgeladen (20)

Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 

Recent rl

  • 1.
  • 2.
  • 3.
  • 4.
  • 5.
  • 6.
  • 7. Qo (s,a) = r(s,a)+γ max a' Qo (s',a') Qo L = (r(s,a)+γ max a' Qθ o (s',a')−Qθ o (s,a))2
  • 8.
  • 9.
  • 10. ∇θ J = ∇θ Eπθ [ γ τ Rτ ] τ =0 ∞ ∑ = ∇θ P( ′s | st ,a)πθ (a | st ) γ τ Rτ τ =0 ∞ ∑ a ∑ ′s ∑ = P( ′s | st ,a)∇θπθ (a | st ) γ τ Rτ τ =0 ∞ ∑ a ∑ ′s ∑ = P( ′s | st ,a)πθ (a | st ) ∇θπθ (a | st ) πθ (a | st ) γ τ Rτ τ =0 ∞ ∑ a ∑ ′s ∑ = P( ′s | st ,a)πθ (a | st )∇θ log(πθ (a | st )) γ τ Rτ τ =0 ∞ ∑ a ∑ ′s ∑ = Eπθ [∇θ log(πθ (a | st )) γ τ Rτ ] τ =0 ∞ ∑
  • 11. Eπθ [∇θ log(πθ (a | st )) γ τ Rτ ] τ =0 ∞ ∑ = 1 M ∇θ log(πθ (ai T | si T ))( γ τ Rτ T ) τ =0 ∞ ∑ i ∑ T ∑ T = s0 T ,a0 T ,r0 T ,!sn T ,an T ,rn T
  • 12. 1 M ∇θ log(πθ (ai T | si T ))( γ τ Rτ T τ =0 ∞ ∑ i ∑ T ∑ )
  • 13. 1 M ∇θ log(πθ (ai T | si T ))( γ τ Rτ T τ =0 ∞ ∑ i ∑ T ∑ ) 1 M ∇θ log(πθ (ai T | si T )) i ∑ T ∑ A(si T ,ai T )
  • 14.
  • 15.
  • 16.
  • 17.
  • 18.
  • 19.
  • 20.
  • 21. Qaux (a,i, j) LQ = E[(Rt:t+n +γ n max a' Q(s',a';θ− )−Q(s,a;θ))2 ]
  • 22. LVR = Eπ [(Rt:t+n +γ n V(st+n+1,θ− )−V(st ,θ))2 ]
  • 23.
  • 24.
  • 25.
  • 26.
  • 27.
  • 28.
  • 29. Ep[ f (x)] = p(x) f (x)x∑ Eq[ f (x)] = q(x) f (x)x∑ = q(x) p(x) p(x) f (x)x∑ = p(x) q(x) p(x) f (x)x∑ = Ep[ q(x) p(x) f (x)]
  • 30.
  • 31.
  • 32.
  • 33.
  • 34.
  • 35.
  • 36. LA3C = Lπ + LV − Es∼π [αH(π(⋅| s))]
  • 37. !Qπ (s,a) = α(log(π(s,a)+ Hπ (s))+Vπ (s)
  • 38.
  • 39.
  • 40.
  • 41.
  • 42.
  • 43. Q∗ (s,a) = r(s,a)+γτ log exp(Q∗ (s',a') /τ )a'∑ Q∗
  • 44. V∗ (s) = −τ logπ∗ (a | s)+ r(s,a)+γV∗ (s') −V∗ (s1)+γ t−1 V∗ (st )+ R(s1:t )−τG(s1:t ,π∗ ) = 0 R(sm:n ) = γ i r(sm+i ,am+i ) i=0 n−m−1 ∑ G(sm:n,π) = γ i logπ(am+i | sm+i ) i=0 n−m−1 ∑
  • 45. Cθ,φ (s1:t ) = −Vφ (s1)+γ t−1 Vφ (st )+ R(s1:t )−τG(s1:t ,πθ ) Δθ ∝Cθ,φ (s1:t )∇θG(s1:t ,πθ ) Δφ ∝Cθ,φ (s1:t )(∇φVφ (s1)− ∇φγ t−1 Vφ (st ))