SlideShare ist ein Scribd-Unternehmen logo
1 von 16
Downloaden Sie, um offline zu lesen
REBAR: Low-variance, unbiased gradient estimates
for discrete latent variable models
Sangwoo Mo
KAIST AI Lab.
November 29, 2017
Sangwoo Mo (KAIST AI Lab.) REBAR November 29, 2017 1 / 16
General Problem
Let z ∼ p(z|θ). Want to maximize
L(θ) = Ep(z)[f (z)1].
Example:
ELBO2
L(θ, φ) = Eqφ(z|x)[pθ(x|z)]
Policy Gradient
L(θ) = Epθ(τ)[R(τ)]
1
assume f (z) is independent to θ
2
omit KL term
Sangwoo Mo (KAIST AI Lab.) REBAR November 29, 2017 2 / 16
General Problem
Let z ∼ p(z|θ). Want to maximize
L(θ) = Ep(z)[f (z)].
Want to optimize by gradient descent1. Need to compute
d
dθ
L(θ) =
d
dθ
Ep(z)[f (z)]
Caveat: We cannot simply put d
dθ inside since z depends on θ.
1
assume f (z) is differentiable
Sangwoo Mo (KAIST AI Lab.) REBAR November 29, 2017 3 / 16
Background
REINFORCE:
d
dθ
Ep(z)[f (z)] =
d
dθ
f (z)p(z)dz
= f (z)
∂
∂θ
p(z)dz
= f (z)
∂
∂θ p(z)
p(z)
p(z)dz
= f (z)
∂
∂θ
log p(z)dz
= Ep(z) f (z)
∂
∂θ
log p(z)
It is unbiased, but variance is too high.
Sangwoo Mo (KAIST AI Lab.) REBAR November 29, 2017 4 / 16
Background
Control variate: Subtract baseline c.
d
dθ
Ep(z)[f (z)] =
d
dθ
Ep(z,c)[f (z) − c] + Ep(z,c)[c]
= Ep(z,c) (f (z) − c)
∂
∂θ
log p(z) +
∂
∂θ
Ep(z,c)[c]
Qustion: How to choose proper1 c?
constant value e.g. Ep(z)[f (z)]
linear approximation of f arround Ep(z)[z]
1
i) c should be correlated to p(z), ii) if c
|=
θ, second term is eleminated
Sangwoo Mo (KAIST AI Lab.) REBAR November 29, 2017 5 / 16
Background
Reparametrization trick: Assume z = g(θ, ).
d
dθ
Ep(z)[f (z)] =
d
dθ
f (z)p(z)dz
=
d
dθ
f (g(θ, ))p( )d
=
∂f
∂g
∂g
∂θ
p( )d
= Ep( )
∂f
∂g
∂g
∂θ
It is unbiased & low variance, and successful for continuous1 z
However, it is not directly applicable for discrete case
1
VAE assumes z ∼ N(µ, σ) and reparametrize it as z = µ + σ where ∼ N(0, 1)
Sangwoo Mo (KAIST AI Lab.) REBAR November 29, 2017 6 / 16
Background
Gumbel-softmax trick:
It is well-known that z ∼ Cat(θ) is equivalent to
z = H(w) = arg maxi [log θi − log(− log( i ))]
where H is hard argmax, w = g(θ, ), and i ∼ Uniform(0, 1).
Instead of H, use softmax σλ(w) (with temperature λ).
Then σλ(g(θ, )) is differentiable reparametrization of z.
It is low variance, but biased.
Sangwoo Mo (KAIST AI Lab.) REBAR November 29, 2017 7 / 16
REBAR
Motivation:
Gumbel-softmax is highly correlated biased estimator
Use Gumbel-softmax as control variate of REINFORCE
However, we can do more than na¨ıvely applying this idea
Sangwoo Mo (KAIST AI Lab.) REBAR November 29, 2017 8 / 16
REBAR
Observation:
We can reduce variance of REINFORCE by marginalizing w over z.
∂
∂θ
Ep(w) [f (σλ(w))] = Ep(w) f (σλ(w))
∂
∂θ
log p(w)
= Ep(z) Ep(w|z) f (σλ(w))
∂
∂θ
(log p(w|z) + log p(z))
= Ep(z)
∂
∂θ
Ep(w|z) [f (σλ(w))]
+ Ep(z) Ep(w|z)[f (σλ(w))]
∂
∂θ
log p(z)
Sangwoo Mo (KAIST AI Lab.) REBAR November 29, 2017 9 / 16
REBAR
Observation:
Here, the first term can be reparametrized as
Ep(z)
∂
∂θ
Ep(w|z) [f (σλ(w))] = Ep(z) Ep(δ)
∂
∂θ
f (σλ(˜w))
where ˜w = ˜g(θ, z, δ)1 and δi ∼ Uniform(0, 1).
1
conditional distribution of g given z
Sangwoo Mo (KAIST AI Lab.) REBAR November 29, 2017 10 / 16
REBAR
Putting it all together,
∂
∂θ
Ep(z)[f (z)] = E ,δ [f (H(w)) − ηf (σλ(˜w))]
∂
∂θ
log p(z)
z=H(w)
+ η
∂
∂θ
f (σλ(w)) − η
∂
∂θ
f (σλ(˜w))
where w = g(θ, ), ˜w = ˜g(θ, H(w), δ), and i , δi ∼ Uniform(0, 1).
Sangwoo Mo (KAIST AI Lab.) REBAR November 29, 2017 11 / 16
Hyperparameter Optimization
Let r(η, λ) be the Monte Carlo REBAR estiamtor.
Since r is unbiased, E[r] does not depend on η and λ. Thus,
∂
∂η
Var(r) =
∂
∂η
E[r2
] − E[r]2
= E 2r
∂r
∂η
.
Now we can optimize η (and λ) to minimize variance.
Sangwoo Mo (KAIST AI Lab.) REBAR November 29, 2017 12 / 16
Experiments
Minimize Ep(z)[(z − 0.45)2] where z ∼ Bernoulli(θ).
left: log variance / right: loss
Sangwoo Mo (KAIST AI Lab.) REBAR November 29, 2017 13 / 16
Experiments
Maximize ELBO of Sigmoid Belief Network
log p(x|θ) ≥ Eq(z|x,θ)[log p(x, z|θ) − log q(z|x, θ)]
left: 2-layer linear / right: 1-layer nonlinear (log variance)
Sangwoo Mo (KAIST AI Lab.) REBAR November 29, 2017 14 / 16
Experiments
Maximize ELBO of Sigmoid Belief Network
log p(x|θ) ≥ Eq(z|x,θ)[log p(x, z|θ) − log q(z|x, θ)]
left: 2-layer linear / right: 1-layer nonlinear (objective)
Sangwoo Mo (KAIST AI Lab.) REBAR November 29, 2017 15 / 16
Questions?
Sangwoo Mo (KAIST AI Lab.) REBAR November 29, 2017 16 / 16

Weitere ähnliche Inhalte

Was ist angesagt?

Normalization of microarray
Normalization of microarrayNormalization of microarray
Normalization of microarray
弘毅 露崎
 
数式を使わずイメージで理解するEMアルゴリズム
数式を使わずイメージで理解するEMアルゴリズム数式を使わずイメージで理解するEMアルゴリズム
数式を使わずイメージで理解するEMアルゴリズム
裕樹 奥田
 

Was ist angesagt? (20)

Normalization of microarray
Normalization of microarrayNormalization of microarray
Normalization of microarray
 
알기쉬운 Variational autoencoder
알기쉬운 Variational autoencoder알기쉬운 Variational autoencoder
알기쉬운 Variational autoencoder
 
ノンパラベイズ入門の入門
ノンパラベイズ入門の入門ノンパラベイズ入門の入門
ノンパラベイズ入門の入門
 
数式を使わずイメージで理解するEMアルゴリズム
数式を使わずイメージで理解するEMアルゴリズム数式を使わずイメージで理解するEMアルゴリズム
数式を使わずイメージで理解するEMアルゴリズム
 
Hyperoptとその周辺について
Hyperoptとその周辺についてHyperoptとその周辺について
Hyperoptとその周辺について
 
ICML 2020 最適輸送まとめ
ICML 2020 最適輸送まとめICML 2020 最適輸送まとめ
ICML 2020 最適輸送まとめ
 
(DL輪読)Variational Dropout Sparsifies Deep Neural Networks
(DL輪読)Variational Dropout Sparsifies Deep Neural Networks(DL輪読)Variational Dropout Sparsifies Deep Neural Networks
(DL輪読)Variational Dropout Sparsifies Deep Neural Networks
 
レプリカ交換モンテカルロ法で乱数の生成
レプリカ交換モンテカルロ法で乱数の生成レプリカ交換モンテカルロ法で乱数の生成
レプリカ交換モンテカルロ法で乱数の生成
 
20180613 [TensorFlow分散学習] Horovodによる分散学習の実装方法と解説
20180613 [TensorFlow分散学習] Horovodによる分散学習の実装方法と解説20180613 [TensorFlow分散学習] Horovodによる分散学習の実装方法と解説
20180613 [TensorFlow分散学習] Horovodによる分散学習の実装方法と解説
 
A summary on “On choosing and bounding probability metrics”
A summary on “On choosing and bounding probability metrics”A summary on “On choosing and bounding probability metrics”
A summary on “On choosing and bounding probability metrics”
 
関数型プログラミング入門 for Matlab ユーザー
関数型プログラミング入門 for Matlab ユーザー関数型プログラミング入門 for Matlab ユーザー
関数型プログラミング入門 for Matlab ユーザー
 
[PRML勉強会資料] パターン認識と機械学習 第3章 線形回帰モデル (章頭-3.1.5)(p.135-145)
[PRML勉強会資料] パターン認識と機械学習 第3章 線形回帰モデル (章頭-3.1.5)(p.135-145)[PRML勉強会資料] パターン認識と機械学習 第3章 線形回帰モデル (章頭-3.1.5)(p.135-145)
[PRML勉強会資料] パターン認識と機械学習 第3章 線形回帰モデル (章頭-3.1.5)(p.135-145)
 
Categorical reparameterization with gumbel softmax
Categorical reparameterization with gumbel softmaxCategorical reparameterization with gumbel softmax
Categorical reparameterization with gumbel softmax
 
[DL輪読会]Flow-based Deep Generative Models
[DL輪読会]Flow-based Deep Generative Models[DL輪読会]Flow-based Deep Generative Models
[DL輪読会]Flow-based Deep Generative Models
 
[研究室論文紹介用スライド] Adversarial Contrastive Estimation
[研究室論文紹介用スライド] Adversarial Contrastive Estimation[研究室論文紹介用スライド] Adversarial Contrastive Estimation
[研究室論文紹介用スライド] Adversarial Contrastive Estimation
 
テンソル多重線形ランクの推定法について(Estimation of Multi-linear Tensor Rank)
テンソル多重線形ランクの推定法について(Estimation of Multi-linear Tensor Rank)テンソル多重線形ランクの推定法について(Estimation of Multi-linear Tensor Rank)
テンソル多重線形ランクの推定法について(Estimation of Multi-linear Tensor Rank)
 
SMO徹底入門 - SVMをちゃんと実装する
SMO徹底入門 - SVMをちゃんと実装するSMO徹底入門 - SVMをちゃんと実装する
SMO徹底入門 - SVMをちゃんと実装する
 
混合モデルとEMアルゴリズム(PRML第9章)
混合モデルとEMアルゴリズム(PRML第9章)混合モデルとEMアルゴリズム(PRML第9章)
混合モデルとEMアルゴリズム(PRML第9章)
 
Variational inference intro. (korean ver.)
Variational inference intro. (korean ver.)Variational inference intro. (korean ver.)
Variational inference intro. (korean ver.)
 
統計的学習の基礎輪読会資料 (7章後半 7.9〜)
統計的学習の基礎輪読会資料 (7章後半 7.9〜)統計的学習の基礎輪読会資料 (7章後半 7.9〜)
統計的学習の基礎輪読会資料 (7章後半 7.9〜)
 

Ähnlich wie REBAR: Low-variance, unbiased gradient estimates for discrete latent variable models

On the-approximate-solution-of-a-nonlinear-singular-integral-equation
On the-approximate-solution-of-a-nonlinear-singular-integral-equationOn the-approximate-solution-of-a-nonlinear-singular-integral-equation
On the-approximate-solution-of-a-nonlinear-singular-integral-equation
Cemal Ardil
 
Note on Character Theory-summer 2013
Note on Character Theory-summer 2013Note on Character Theory-summer 2013
Note on Character Theory-summer 2013
Fan Huang (Wright)
 
l1-Embeddings and Algorithmic Applications
l1-Embeddings and Algorithmic Applicationsl1-Embeddings and Algorithmic Applications
l1-Embeddings and Algorithmic Applications
Grigory Yaroslavtsev
 
Rainone - Groups St. Andrew 2013
Rainone - Groups St. Andrew 2013Rainone - Groups St. Andrew 2013
Rainone - Groups St. Andrew 2013
Raffaele Rainone
 

Ähnlich wie REBAR: Low-variance, unbiased gradient estimates for discrete latent variable models (20)

Darmon Points: an Overview
Darmon Points: an OverviewDarmon Points: an Overview
Darmon Points: an Overview
 
On the Jensen-Shannon symmetrization of distances relying on abstract means
On the Jensen-Shannon symmetrization of distances relying on abstract meansOn the Jensen-Shannon symmetrization of distances relying on abstract means
On the Jensen-Shannon symmetrization of distances relying on abstract means
 
Approximate Bayesian Computation with Quasi-Likelihoods
Approximate Bayesian Computation with Quasi-LikelihoodsApproximate Bayesian Computation with Quasi-Likelihoods
Approximate Bayesian Computation with Quasi-Likelihoods
 
On the-approximate-solution-of-a-nonlinear-singular-integral-equation
On the-approximate-solution-of-a-nonlinear-singular-integral-equationOn the-approximate-solution-of-a-nonlinear-singular-integral-equation
On the-approximate-solution-of-a-nonlinear-singular-integral-equation
 
Group theory notes
Group theory notesGroup theory notes
Group theory notes
 
Asymptotics for discrete random measures
Asymptotics for discrete random measuresAsymptotics for discrete random measures
Asymptotics for discrete random measures
 
Note on Character Theory-summer 2013
Note on Character Theory-summer 2013Note on Character Theory-summer 2013
Note on Character Theory-summer 2013
 
A STUDY ON L-FUZZY NORMAL SUBl -GROUP
A STUDY ON L-FUZZY NORMAL SUBl -GROUPA STUDY ON L-FUZZY NORMAL SUBl -GROUP
A STUDY ON L-FUZZY NORMAL SUBl -GROUP
 
A Unified Perspective for Darmon Points
A Unified Perspective for Darmon PointsA Unified Perspective for Darmon Points
A Unified Perspective for Darmon Points
 
cmftJYeZhuanTalk.pdf
cmftJYeZhuanTalk.pdfcmftJYeZhuanTalk.pdf
cmftJYeZhuanTalk.pdf
 
Improved Trainings of Wasserstein GANs (WGAN-GP)
Improved Trainings of Wasserstein GANs (WGAN-GP)Improved Trainings of Wasserstein GANs (WGAN-GP)
Improved Trainings of Wasserstein GANs (WGAN-GP)
 
Berezin-Toeplitz Quantization On Coadjoint orbits
Berezin-Toeplitz Quantization On Coadjoint orbitsBerezin-Toeplitz Quantization On Coadjoint orbits
Berezin-Toeplitz Quantization On Coadjoint orbits
 
l1-Embeddings and Algorithmic Applications
l1-Embeddings and Algorithmic Applicationsl1-Embeddings and Algorithmic Applications
l1-Embeddings and Algorithmic Applications
 
Meta-learning and the ELBO
Meta-learning and the ELBOMeta-learning and the ELBO
Meta-learning and the ELBO
 
A Note on Latent LSTM Allocation
A Note on Latent LSTM AllocationA Note on Latent LSTM Allocation
A Note on Latent LSTM Allocation
 
Murphy: Machine learning A probabilistic perspective: Ch.9
Murphy: Machine learning A probabilistic perspective: Ch.9Murphy: Machine learning A probabilistic perspective: Ch.9
Murphy: Machine learning A probabilistic perspective: Ch.9
 
Matrix calculus
Matrix calculusMatrix calculus
Matrix calculus
 
MUMS: Bayesian, Fiducial, and Frequentist Conference - Can a Fiducial Phoenix...
MUMS: Bayesian, Fiducial, and Frequentist Conference - Can a Fiducial Phoenix...MUMS: Bayesian, Fiducial, and Frequentist Conference - Can a Fiducial Phoenix...
MUMS: Bayesian, Fiducial, and Frequentist Conference - Can a Fiducial Phoenix...
 
Continuous and Discrete-Time Analysis of SGD
Continuous and Discrete-Time Analysis of SGDContinuous and Discrete-Time Analysis of SGD
Continuous and Discrete-Time Analysis of SGD
 
Rainone - Groups St. Andrew 2013
Rainone - Groups St. Andrew 2013Rainone - Groups St. Andrew 2013
Rainone - Groups St. Andrew 2013
 

Mehr von Sangwoo Mo

Mehr von Sangwoo Mo (20)

Brief History of Visual Representation Learning
Brief History of Visual Representation LearningBrief History of Visual Representation Learning
Brief History of Visual Representation Learning
 
Learning Visual Representations from Uncurated Data
Learning Visual Representations from Uncurated DataLearning Visual Representations from Uncurated Data
Learning Visual Representations from Uncurated Data
 
Hyperbolic Deep Reinforcement Learning
Hyperbolic Deep Reinforcement LearningHyperbolic Deep Reinforcement Learning
Hyperbolic Deep Reinforcement Learning
 
A Unified Framework for Computer Vision Tasks: (Conditional) Generative Model...
A Unified Framework for Computer Vision Tasks: (Conditional) Generative Model...A Unified Framework for Computer Vision Tasks: (Conditional) Generative Model...
A Unified Framework for Computer Vision Tasks: (Conditional) Generative Model...
 
Self-supervised Learning Lecture Note
Self-supervised Learning Lecture NoteSelf-supervised Learning Lecture Note
Self-supervised Learning Lecture Note
 
Deep Learning Theory Seminar (Chap 3, part 2)
Deep Learning Theory Seminar (Chap 3, part 2)Deep Learning Theory Seminar (Chap 3, part 2)
Deep Learning Theory Seminar (Chap 3, part 2)
 
Deep Learning Theory Seminar (Chap 1-2, part 1)
Deep Learning Theory Seminar (Chap 1-2, part 1)Deep Learning Theory Seminar (Chap 1-2, part 1)
Deep Learning Theory Seminar (Chap 1-2, part 1)
 
Introduction to Diffusion Models
Introduction to Diffusion ModelsIntroduction to Diffusion Models
Introduction to Diffusion Models
 
Object-Region Video Transformers
Object-Region Video TransformersObject-Region Video Transformers
Object-Region Video Transformers
 
Deep Implicit Layers: Learning Structured Problems with Neural Networks
Deep Implicit Layers: Learning Structured Problems with Neural NetworksDeep Implicit Layers: Learning Structured Problems with Neural Networks
Deep Implicit Layers: Learning Structured Problems with Neural Networks
 
Learning Theory 101 ...and Towards Learning the Flat Minima
Learning Theory 101 ...and Towards Learning the Flat MinimaLearning Theory 101 ...and Towards Learning the Flat Minima
Learning Theory 101 ...and Towards Learning the Flat Minima
 
Sharpness-aware minimization (SAM)
Sharpness-aware minimization (SAM)Sharpness-aware minimization (SAM)
Sharpness-aware minimization (SAM)
 
Explicit Density Models
Explicit Density ModelsExplicit Density Models
Explicit Density Models
 
Score-Based Generative Modeling through Stochastic Differential Equations
Score-Based Generative Modeling through Stochastic Differential EquationsScore-Based Generative Modeling through Stochastic Differential Equations
Score-Based Generative Modeling through Stochastic Differential Equations
 
Self-Attention with Linear Complexity
Self-Attention with Linear ComplexitySelf-Attention with Linear Complexity
Self-Attention with Linear Complexity
 
Meta-Learning with Implicit Gradients
Meta-Learning with Implicit GradientsMeta-Learning with Implicit Gradients
Meta-Learning with Implicit Gradients
 
Challenging Common Assumptions in the Unsupervised Learning of Disentangled R...
Challenging Common Assumptions in the Unsupervised Learning of Disentangled R...Challenging Common Assumptions in the Unsupervised Learning of Disentangled R...
Challenging Common Assumptions in the Unsupervised Learning of Disentangled R...
 
Generative Models for General Audiences
Generative Models for General AudiencesGenerative Models for General Audiences
Generative Models for General Audiences
 
Bayesian Model-Agnostic Meta-Learning
Bayesian Model-Agnostic Meta-LearningBayesian Model-Agnostic Meta-Learning
Bayesian Model-Agnostic Meta-Learning
 
Deep Learning for Natural Language Processing
Deep Learning for Natural Language ProcessingDeep Learning for Natural Language Processing
Deep Learning for Natural Language Processing
 

Kürzlich hochgeladen

IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
Enterprise Knowledge
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
vu2urc
 

Kürzlich hochgeladen (20)

IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 

REBAR: Low-variance, unbiased gradient estimates for discrete latent variable models

  • 1. REBAR: Low-variance, unbiased gradient estimates for discrete latent variable models Sangwoo Mo KAIST AI Lab. November 29, 2017 Sangwoo Mo (KAIST AI Lab.) REBAR November 29, 2017 1 / 16
  • 2. General Problem Let z ∼ p(z|θ). Want to maximize L(θ) = Ep(z)[f (z)1]. Example: ELBO2 L(θ, φ) = Eqφ(z|x)[pθ(x|z)] Policy Gradient L(θ) = Epθ(τ)[R(τ)] 1 assume f (z) is independent to θ 2 omit KL term Sangwoo Mo (KAIST AI Lab.) REBAR November 29, 2017 2 / 16
  • 3. General Problem Let z ∼ p(z|θ). Want to maximize L(θ) = Ep(z)[f (z)]. Want to optimize by gradient descent1. Need to compute d dθ L(θ) = d dθ Ep(z)[f (z)] Caveat: We cannot simply put d dθ inside since z depends on θ. 1 assume f (z) is differentiable Sangwoo Mo (KAIST AI Lab.) REBAR November 29, 2017 3 / 16
  • 4. Background REINFORCE: d dθ Ep(z)[f (z)] = d dθ f (z)p(z)dz = f (z) ∂ ∂θ p(z)dz = f (z) ∂ ∂θ p(z) p(z) p(z)dz = f (z) ∂ ∂θ log p(z)dz = Ep(z) f (z) ∂ ∂θ log p(z) It is unbiased, but variance is too high. Sangwoo Mo (KAIST AI Lab.) REBAR November 29, 2017 4 / 16
  • 5. Background Control variate: Subtract baseline c. d dθ Ep(z)[f (z)] = d dθ Ep(z,c)[f (z) − c] + Ep(z,c)[c] = Ep(z,c) (f (z) − c) ∂ ∂θ log p(z) + ∂ ∂θ Ep(z,c)[c] Qustion: How to choose proper1 c? constant value e.g. Ep(z)[f (z)] linear approximation of f arround Ep(z)[z] 1 i) c should be correlated to p(z), ii) if c |= θ, second term is eleminated Sangwoo Mo (KAIST AI Lab.) REBAR November 29, 2017 5 / 16
  • 6. Background Reparametrization trick: Assume z = g(θ, ). d dθ Ep(z)[f (z)] = d dθ f (z)p(z)dz = d dθ f (g(θ, ))p( )d = ∂f ∂g ∂g ∂θ p( )d = Ep( ) ∂f ∂g ∂g ∂θ It is unbiased & low variance, and successful for continuous1 z However, it is not directly applicable for discrete case 1 VAE assumes z ∼ N(µ, σ) and reparametrize it as z = µ + σ where ∼ N(0, 1) Sangwoo Mo (KAIST AI Lab.) REBAR November 29, 2017 6 / 16
  • 7. Background Gumbel-softmax trick: It is well-known that z ∼ Cat(θ) is equivalent to z = H(w) = arg maxi [log θi − log(− log( i ))] where H is hard argmax, w = g(θ, ), and i ∼ Uniform(0, 1). Instead of H, use softmax σλ(w) (with temperature λ). Then σλ(g(θ, )) is differentiable reparametrization of z. It is low variance, but biased. Sangwoo Mo (KAIST AI Lab.) REBAR November 29, 2017 7 / 16
  • 8. REBAR Motivation: Gumbel-softmax is highly correlated biased estimator Use Gumbel-softmax as control variate of REINFORCE However, we can do more than na¨ıvely applying this idea Sangwoo Mo (KAIST AI Lab.) REBAR November 29, 2017 8 / 16
  • 9. REBAR Observation: We can reduce variance of REINFORCE by marginalizing w over z. ∂ ∂θ Ep(w) [f (σλ(w))] = Ep(w) f (σλ(w)) ∂ ∂θ log p(w) = Ep(z) Ep(w|z) f (σλ(w)) ∂ ∂θ (log p(w|z) + log p(z)) = Ep(z) ∂ ∂θ Ep(w|z) [f (σλ(w))] + Ep(z) Ep(w|z)[f (σλ(w))] ∂ ∂θ log p(z) Sangwoo Mo (KAIST AI Lab.) REBAR November 29, 2017 9 / 16
  • 10. REBAR Observation: Here, the first term can be reparametrized as Ep(z) ∂ ∂θ Ep(w|z) [f (σλ(w))] = Ep(z) Ep(δ) ∂ ∂θ f (σλ(˜w)) where ˜w = ˜g(θ, z, δ)1 and δi ∼ Uniform(0, 1). 1 conditional distribution of g given z Sangwoo Mo (KAIST AI Lab.) REBAR November 29, 2017 10 / 16
  • 11. REBAR Putting it all together, ∂ ∂θ Ep(z)[f (z)] = E ,δ [f (H(w)) − ηf (σλ(˜w))] ∂ ∂θ log p(z) z=H(w) + η ∂ ∂θ f (σλ(w)) − η ∂ ∂θ f (σλ(˜w)) where w = g(θ, ), ˜w = ˜g(θ, H(w), δ), and i , δi ∼ Uniform(0, 1). Sangwoo Mo (KAIST AI Lab.) REBAR November 29, 2017 11 / 16
  • 12. Hyperparameter Optimization Let r(η, λ) be the Monte Carlo REBAR estiamtor. Since r is unbiased, E[r] does not depend on η and λ. Thus, ∂ ∂η Var(r) = ∂ ∂η E[r2 ] − E[r]2 = E 2r ∂r ∂η . Now we can optimize η (and λ) to minimize variance. Sangwoo Mo (KAIST AI Lab.) REBAR November 29, 2017 12 / 16
  • 13. Experiments Minimize Ep(z)[(z − 0.45)2] where z ∼ Bernoulli(θ). left: log variance / right: loss Sangwoo Mo (KAIST AI Lab.) REBAR November 29, 2017 13 / 16
  • 14. Experiments Maximize ELBO of Sigmoid Belief Network log p(x|θ) ≥ Eq(z|x,θ)[log p(x, z|θ) − log q(z|x, θ)] left: 2-layer linear / right: 1-layer nonlinear (log variance) Sangwoo Mo (KAIST AI Lab.) REBAR November 29, 2017 14 / 16
  • 15. Experiments Maximize ELBO of Sigmoid Belief Network log p(x|θ) ≥ Eq(z|x,θ)[log p(x, z|θ) − log q(z|x, θ)] left: 2-layer linear / right: 1-layer nonlinear (objective) Sangwoo Mo (KAIST AI Lab.) REBAR November 29, 2017 15 / 16
  • 16. Questions? Sangwoo Mo (KAIST AI Lab.) REBAR November 29, 2017 16 / 16