SlideShare ist ein Scribd-Unternehmen logo
1 von 51
Downloaden Sie, um offline zu lesen
Deep Reasoning
2016-03-16
Taehoon Kim
carpedm20@gmail.com
References
1. [Sukhbaatar, 2015] Sukhbaatar, Szlam, Weston, Fergus. “End-To-End Memory Networks” Advances in
Neural Information Processing Systems. 2015.
2. [Hill, 2015] Hill, Bordes, Chopra, Weston. “The Goldilocks Principle: Reading Children's Books with Explicit
Memory Representations” arXiv preprint arXiv:1511.02301 (2015).
3. [Kumar, 2015] Kumar, Irsoy, Ondruska, Iyyer, Bradbury, Gulrajani, Zhong, Paulus, Socher. “Ask Me
Anything: Dynamic Memory Networks for Natural Language Processing” arXiv preprint
arXiv:1511.06038 (2015).
4. [Xiong, 2016] Xiong, Merity, Socher. “Dynamic Memory Networks for Visual and Textual Question
Answering” arXiv preprint arXiv:1603.01417 (2016).
5. [Yin, 2015] Yin, Schütze, Xiang, Zhou. “ABCNN: Attention-Based Convolutional Neural Network for
Modeling Sentence Pairs” arXiv preprint arXiv:1512.05193 (2015).
6. [Yu, 2015] Yu, Zhang, Hang, Xiang, Zhou. “Empirical Study on Deep Learning Models for Question
Answering” arXiv preprint arXiv:1510.07526 (2015).
7. [Hermann, 2015] Hermann, Kočiský, Grefenstette, Espeholt, Will Kay, Suleyman, Blunsom. “Teaching
Machines to Read and Comprehend” arXiv preprint arXiv:1506.03340 (2015).
8. [Kadlec, 2016] Kadlec, Schmid, Bajgar, Kleindienst. “Text Understanding with the Attention Sum Reader
Network” arXiv preprint arXiv:1603.01547 (2016).
2
References
9. [Miao, 2015] Miao, Lei Yu, Blunsom. “Neural Variational Inference for Text Processing” arXiv preprint
arXiv:1511.06038 (2015).
10. [Kingma, 2013] Kingma, Diederik P., and Max Welling. "Auto-encoding variational bayes" arXiv preprint
arXiv:1312.6114 (2013).
11. [Sohn, 2015] Sohn, Kihyuk, Honglak Lee, and Xinchen Yan. "Learning Structured Output Representation
using Deep Conditional Generative Models." Advances in Neural Information Processing Systems. 2015.
3
Models
Answer selection	(WikiQA) General	QA	(CNN) Considered transitive	inference	(bAbI)
ABCNN E2E MN E2E MN
Variational Impatient	Attentive	Reader DMN
Attentive	Pooling Attentive	(Impatient)	Reader ReasoningNet
Attention	Sum Reader NTM
4
End-to-End Memory Network [Sukhbaatar, 2015]
5
End-to-End Memory Network [Sukhbaatar, 2015]
6
I go to school.
He gets ball.
…
Embed
C
Where does he go?u Embed
B
Embed
A
Attention
o
I
go
to
he
gets
ball
Input Memory
Output Memory
softmax
Inner product
weighted sum
	Σ linear
End-to-End Memory Network [Sukhbaatar, 2015]
7
I go to school.
He gets ball.
…
linear
Where does he go?
	Σ
	Σ
	Σ
Sentence representation :
𝑖 th sentence	:	𝑥% = 𝑥%',𝑥%),…, 𝑥%+
BoW	:	𝑚% = ∑ 𝐴𝑥%//
Position	Encoding	:	𝑚% = ∑ 𝑙/ 1 𝐴𝑥%//
Temporal	Encoding	:	𝑚% = ∑ 𝐴𝑥%/ + 𝑇4(𝑖)/
Training details
Linear Start (LS) help avoid local minima
- First train with softmax in each memory layer removed, making the model entirely linear except
for the final softmax
- When the validation loss stopped decreasing, the softmax layers were re-inserted and training
recommenced
RNN-style layer-wise weight tying
- The input and output embeddings are the same across different layers
Learning time invariance by injecting random noise
- Jittering the time index with random empty memories
- Add “dummy” memories to regularize 𝑇4(𝑖)
8
Example of bAbI tasks
9
The Goldilocks Principle: Reading Children's Books with Explicit
Memory Representations [Hill, 2016]
10
The Goldilocks Principle: Reading Children's Books with Explicit
Memory Representations [Hill, 2016]
11
• Context sentences : 𝑆 = 𝑠', 𝑠), … , 𝑠+ , 	 𝑠% ∶	BoW word representation
• Encoded memory : m; = 𝜙 𝑠 	∀𝑠 ∈ 𝑆
• Lexical memory
• Each word occupies a separate slot in the memory
• 𝑠 is a single word and 𝜙 𝑠 has only one non-zero feature
• Multiple hop only beneficial in this memory model
• Window memory (best)
• 𝑠 corresponds to a window of text from the context 𝑆 centered on an individual mention of a candidate 𝑐 in 𝑆
m; = 𝑤%A BA' )⁄ 	…	𝑤%	… 𝑤%D BA' )⁄
• Where 𝑤% ∈ 𝐶 which is an instance of one of the candidate words
• Sentential memory
• Same as original implementation of Memory Network
The Goldilocks Principle: Reading Children's Books with Explicit
Memory Representations [Hill, 2016]
12
Self-supervision for window memories
- Memory supervision (knowing which memories to attend to) is not provided at training time
- Making gradient steps using SGD to force the model to give a higher score to the supporting
memory 𝒎G relative to any other memory from any other candidate using:
Hard attention (training and testing) : 𝑚H' = argmax
%M',…,+
𝑐%
N
𝑞
Soft attention (testing) : 𝑚H' = ∑ 𝛼% 𝑚%%M'…+ , 𝑤𝑖𝑡ℎ	𝛼% =
ST
U
VW
∑ S
T
U
VW
X
- If 𝑚H' happens to be different from 𝑚G (memory contain true answer), then model is updated
- Can be understood as a way of achieving hard attention over memories (no need any new
label information beyond the training data)
The Goldilocks Principle: Reading Children's Books with Explicit
Memory Representations [Hill, 2016]
13
Gated Recurrent Network (GRU)
14
ℎYA'
𝑥Y 𝑥YD'
ℎY
X
𝑟Y
𝑧Y
ℎY

X
1-
X
+
𝑧Y
1 − 𝑧Y
ℎYA'
Ask Me Anything: Dynamic Memory Networks for Natural Language
Processing [Kumar, 2015]
15
Ask Me Anything: Dynamic Memory Networks for Natural Language
Processing [Kumar, 2015]
16
I go to school.
He gets ball.
…
Where does he go?
GloVe
Embed
𝑦Y
I
go
to
GloVe
Embedwh
do
he
𝑞
𝑎Y
𝑦Y
𝑞
𝑎Y
𝐺𝑅𝑈 𝐺𝑅𝑈
< 𝐸𝑂𝑆 >
ℎY 𝑒%
𝑒%
𝑮𝑹𝑼𝒍𝒊𝒔𝒉
Episodic	Memory
𝑔Y
%
Input	Module
Answer	Module
Question	Module
Ask Me Anything: Dynamic Memory Networks for Natural Language
Processing [Kumar, 2015]
17
𝑐Y
ℎY 𝑒%
I go to school.
He gets ball.
…
𝑔Y
%
Where does he go?
𝑞
GloVe
Embed
𝐺𝑅𝑈(𝐿[𝑤Y
s
], ℎYA'
u
) = ℎY
u
= 𝑐Y
𝑦Y
I
go
to
Input Memory
Episodic Memory 𝑒% 𝑚%
𝑞
GloVe
EmbedI
go
to𝑞Y = 𝐺𝑅𝑈(𝐿 𝑤Y
v
, 𝑞YA')
ℎN
%
= 𝑒%
𝑔Y
%
= 𝑮(𝑐Y, 𝑚%A'
, 𝑞)
𝑮𝑹𝑼𝒍𝒊𝒔𝒉ℎY
%
= 𝑔Y
%
𝐺𝑅𝑈 𝑐Y, ℎYA'
%
+ (1 − 𝑔Y
%
) ℎYA'
%
𝑚%
new Memory
Gate
𝑞
𝑎Y
𝑦Y
𝑞
𝑎Y
𝐺𝑅𝑈 𝐺𝑅𝑈
< 𝐸𝑂𝑆 >
𝑠𝑜𝑓𝑡𝑚𝑎𝑥 𝑠𝑜𝑓𝑡𝑚𝑎𝑥
Ask Me Anything: Dynamic Memory Networks for Natural Language
Processing [Kumar, 2015]
18
ℎY 𝑒Y
I go to school.
He gets ball.
…
Where does he go?
GloVe
Embed
𝐺𝑅𝑈(𝐿[𝑤Y
s
], ℎYA'
u
) = ℎY
u
= 𝑐Y
𝑦Y
I
go
to
Input Memory
Episodic Memory 𝑒Y 𝑚%
GloVe
EmbedI
go
to𝑞Y = 𝐺𝑅𝑈(𝐿 𝑤Y
v
, 𝑞YA')
ℎN
%
= 𝑒%
𝑮𝑹𝑼𝒍𝒊𝒔𝒉ℎY
%
= 𝑔Y
%
𝐺𝑅𝑈 𝑐Y, ℎYA'
%
+ (1 − 𝑔Y
%
) ℎYA'
%
𝑚%
new Memory
𝑞
𝑎Y
𝑦Y
𝑞
𝑎Y
𝐺𝑅𝑈 𝐺𝑅𝑈
< 𝐸𝑂𝑆 >
𝑠𝑜𝑓𝑡𝑚𝑎𝑥 𝑠𝑜𝑓𝑡𝑚𝑎𝑥
𝑔Y
%
= 𝑮(𝑐Y, 𝑚%A'
, 𝑞)Gate
feature vector : captures a similarities between c, m, q
G	:	two-layer	feed	forward	neural	network
Attention Mechanism
𝑞𝑞𝑐Y
𝑔Y
%
Ask Me Anything: Dynamic Memory Networks for Natural Language
Processing [Kumar, 2015]
19
𝑐Y
ℎY 𝑒Y
I go to school.
He gets ball.
…
𝑔Y
%
Where does he go?
𝑞
GloVe
Embed
𝐺𝑅𝑈(𝐿[𝑤Y
s
], ℎYA'
u
) = ℎY
u
= 𝑐Y
𝑦Y
I
go
to
Input Memory
Episodic Memory
𝑞
GloVe
EmbedI
go
to𝑞Y = 𝐺𝑅𝑈(𝐿 𝑤Y
v
, 𝑞YA')
ℎN
%
= 𝑒%
𝑮𝑹𝑼𝒍𝒊𝒔𝒉
𝑞
𝑎Y
𝑦Y
𝑞
𝑎Y
𝐺𝑅𝑈 𝐺𝑅𝑈
< 𝐸𝑂𝑆 >
𝑠𝑜𝑓𝑡𝑚𝑎𝑥 𝑠𝑜𝑓𝑡𝑚𝑎𝑥
𝑔Y
%
= 𝑮(𝑐Y, 𝑚%A'
, 𝑞)Gate
ℎY
%
= 𝑔Y
%
𝐺𝑅𝑈 𝑐Y, ℎYA'
%
+ (1 − 𝑔Y
%
)	ℎYA'
%
𝑒%
= ℎNy
%
new Memory
Episodic	memory	update
𝑒% 𝑚%
𝑚%
𝑚%
= 𝐺𝑅𝑈(𝑒%
, 𝑚%A'
)
Episodic	Memory	Module
- Iterates over	input representations,	while	updating	episodic	memory 𝒆𝒊
- Attention	mechanism	+	Recurrent	network	→ Update	memory	 𝒎 𝒊
Memory	update
Ask Me Anything: Dynamic Memory Networks for Natural Language
Processing [Kumar, 2015]
20
𝑐Y
I go to school.
He gets ball.
…
𝑔Y
%
Where does he go?
𝑞
GloVe
Embed
𝐺𝑅𝑈(𝐿[𝑤Y
s
], ℎYA'
u
) = ℎY
u
= 𝑐Y
𝑦Y
I
go
to
Input Memory
Episodic Memory
GloVe
EmbedI
go
to𝑞Y = 𝐺𝑅𝑈(𝐿 𝑤Y
v
, 𝑞YA')
ℎN
%
= 𝑒%
𝑮𝑹𝑼𝒍𝒊𝒔𝒉
𝑞
𝑎Y
𝑦Y
𝑞
𝑎Y
𝐺𝑅𝑈 𝐺𝑅𝑈
< 𝐸𝑂𝑆 >
𝑠𝑜𝑓𝑡𝑚𝑎𝑥 𝑠𝑜𝑓𝑡𝑚𝑎𝑥
𝑔Y
%
= 𝑮(𝑐Y, 𝑚%A'
, 𝑞)Gate
ℎY
%
= 𝑔Y
%
𝐺𝑅𝑈 𝑐Y, ℎYA'
%
+ (1 − 𝑔Y
%
)	ℎYA'
%
new Memory
ℎY 𝑒%
𝑞
𝑒% 𝑚%
𝑚%
Multiple	Episodes
- Allows	to	attend to	different inputs during	each	pass
- Allows	for	a	type	of	transitive inference,	since	the	first	
pass	may	uncover	the	need	to	retrieve	additional	facts.
Q	:	Where	is	the	football?
C1	:	John	put	down	the	football.
Only	once	the	model	sees	C1,	John	is	relevant,	
can	reason	that	the	second	iteration	should	
retrieve	where	John	was.
Criteria	for	Stopping
- Append	 a	special	end-of-passes	
representation	to	the	input	 𝒄
- Stop	if	this	representation	is	chosen by	
the	gate function
- Set	a	maximum	number	of	iterations
- This	is	why	called	Dynamic MM
Ask Me Anything: Dynamic Memory Networks for Natural Language
Processing [Kumar, 2015]
21
𝑐Y
I go to school.
He gets ball.
…
𝑔Y
%
Where does he go?
GloVe
Embed
𝐺𝑅𝑈(𝐿[𝑤Y
s
], ℎYA'
u
) = ℎY
u
= 𝑐Y
I
go
to
Input Memory
Episodic Memory
GloVe
EmbedI
go
to𝑞Y = 𝐺𝑅𝑈(𝐿 𝑤Y
v
, 𝑞YA')
ℎN
%
= 𝑒%
𝑮𝑹𝑼𝒍𝒊𝒔𝒉
𝑔Y
%
= 𝑮(𝑐Y, 𝑚%A'
, 𝑞)Gate
ℎY
%
= 𝑔Y
%
𝐺𝑅𝑈 𝑐Y, ℎYA'
%
+ (1 − 𝑔Y
%
)	ℎYA'
%
new Memory
ℎY 𝑒Y
𝑞
𝑒Y
𝑚%
𝑞
𝑦Y
𝑞
𝑎Y
𝑦Y
𝑞
𝑎Y
𝐺𝑅𝑈 𝐺𝑅𝑈
< 𝐸𝑂𝑆 >
𝑠𝑜𝑓𝑡𝑚𝑎𝑥 𝑠𝑜𝑓𝑡𝑚𝑎𝑥
𝑚%
Answer	Module
- Triggered once	at	the	end	of	the	episodic	memory	or	at	
each	time	step
- Concatenate	the	last generated	word and	the	question
vector	as	the	input	at	each	time	step
- Cross-entropy	error
Training Details
- Adam optimization
- 𝐿) regularization, dropout on the word embedding (GloVe)
bAbI dataset
- Objective function : 𝐽 = 𝛼𝐸s† 𝐺𝑎𝑡𝑒𝑠 + 𝛽𝐸s†(𝐴𝑛𝑠𝑤𝑒𝑟𝑠)
- Gate supervision aims to select one sentence per pass
- Without supervision : GRU of c‰,ℎY
% and 𝑒% = ℎNy
%
- With supervision (simpler) : 𝑒% = ∑ 𝑠𝑜𝑓𝑡𝑚𝑎𝑥 𝑔Y
% 𝑐Y
N
YM' , where 𝑠𝑜𝑓𝑡𝑚𝑎𝑥 𝑔Y
% =
Š‹Œ •Ž
U
∑ Š‹Œ	(•X
U
)V
X••
and
𝑔Y
% is the value before sigmoid
- Better results, because softmax encourages sparsity & suited to picking one sentence
22
Training Details
Stanford Sentiment Treebank (Sentiment Analysis)
- Use all full sentences, subsample 50% of phrase-level labels every epoch
- Only evaluated on the full sentences
- Binary classification, neutral phrases are removed from the dataset
- Trained with GRU sequence models
23
Training Details
24
Dynamic Memory Networks for Visual and Textual Question
Answering [Xiong 2016]
25
Several	design	choices	are motivated	by	intuition	and accuracy	improvements
Input Module in DMN
- A single GRU for embedding story and store the hidden states
- GRU provides temporal component by allowing a sentence to know the content
of the sentences that came before them
- Cons:
- GRU only allows sentences to have context from sentences before them, but not after them
- Supporting sentences may be too far away from each other
- Here comes Input fusion layer
26
Input Module in DMN+
Replacing a single GRU with two different components
1. Sentence reader : responsible only for encoding the words into a sentence
embedding
• Use positional encoder (used in E2E) : 𝑓% = ∑ 𝑙/ 1 𝐴𝑥%//
• Considered GRUs LSTMs, but required more computational resources, prone to overfitting
2. Input fusion layer : interactions between sentences, allows content interaction
between sentences
• bi-directional GRU to allow information from both past and future sentences
• gradients do not need to propagate through the words between sentences
• distant supporting sentences can have a more direct interaction
27
Input Module for DMN+
28
Referenced	paper	:	A	Hierarchical	Neural	Autoencoder for	Paragraphs	and	Documents	[Li,	2015]
Episodic Memory Module in DMN+
- 𝐹⃡ = [𝑓', 𝑓), … , 𝑓“] : output of the input module
- Interactions between the fact 𝒇 𝒊 and both the question 𝒒 and episode memory
state 𝒎𝒕
29
Attention Mechanism in DMN+
Use attention to extract contextual vector 𝑐Y
based on the current focus
1. Soft attention
• A weighted summation of 𝐹⃡ : 𝑐Y
= ∑ 𝑔%
Y
𝑓%
“
%M'
• Can approximate a hard attention by selecting a single fact 𝑓%
• Cons: losses positional and ordering information
• Attention passes can retrieve some of this information, but inefficient
30
Attention Mechanism in DMN+
2. Attention based GRU (best)
- position and ordering information : RNN is proper but can’t use 𝑔%
Y
- 𝑢%: update, 𝑟%: how much retain from ℎ%A'
- Replace 𝑢% (vector) to 𝑔%
Y
(scalar)
- Allows us to easily visualize how the attention gates activate
- Use final hidden state as 𝒄 𝒕, which is used to update episodic memory 𝒎𝒕
31
𝑔%
Y
𝑔%
Y
Episode Memory Updates in DMT+
1. Untied and Tied (better) GRU
𝑚Y
= 𝐺𝑅𝑈(𝐶Y
, 𝑚YA'
)
2. Untied ReLU layer (best)
𝑚Y
= 𝑅𝑒𝐿𝑈(𝑊Y
𝑚YA'
; 𝑐Y
; 𝑞 + 𝑏)
32
Training Details
- Adam optimization
- Xavier initialization is used for all weights except for the word embeddings
- 𝐿) regularization on all weights except bias
- Dropout on the word embedding (GloVe) and answer module with 𝑝 = 0.9
33
ABCNN: Attention-Based Convolutional Neural Network for Modeling
Sentence Pairs [Yin 2015]
34
ABCNN: Attention-Based Convolutional Neural Network for Modeling
Sentence Pairs [Yin 2015]
35
• Most prior work on answer selection model each sentence separately and
neglects mutual influence
• Human focus on key parts of 𝑠Ÿ by extracting parts from 𝑠' related by
identity, synonymy, antonym etc.
• ABCNN : taking into account the interdependence between 𝑠Ÿ and 𝑠'
• Convolution layer : increase abstraction of a phrase from words
ABCNN: Attention-Based Convolutional Neural Network for Modeling
Sentence Pairs [Yin 2015]
36
1. Input embedding with word2vec
2-1. Convolution layer with wide convolution
• To make each word 𝑣% to be detected by all weights in 𝑊
2-2. Average pooling layer
• all-ap : column-wise averaging over all columns
• w-ap : column-wise averaging over windows of 𝑤
3. Output layer with logistic regression
• Forward all-ap to all non-final ap layer + final ap layer
ABCNN: Attention-Based Convolutional Neural Network for Modeling
Sentence Pairs [Yin 2015]
37
Attention on feature map (ABCNN-1)
• Attention values of row 𝑖 in 𝑨 : attention distribution of the 𝑖−th unit of 𝑠Ÿ	with respect to 𝑠'
• 𝐴%,/ = 𝑚𝑎𝑡𝑐ℎ𝑠𝑐𝑜𝑟𝑒(𝐹Ÿ,¢ :, 𝑖 , 𝐹',¢ :, 𝑗 )
• 𝑚𝑎𝑡𝑐ℎ𝑠𝑐𝑜𝑟𝑒 = 1/(1 + 𝑥 − 𝑦 )
• Generate the attention feature map 𝐹%,¦ for 𝑠%
• Cons : need more parameters
ABCNN: Attention-Based Convolutional Neural Network for Modeling
Sentence Pairs [Yin 2015]
38
Attention after convolution (ABCNN-2)
• Attention weights directly on the representation with the aim of improving the
features computed by convolution
• 𝑎Ÿ,/ = ∑𝐴 𝑗, : → col-wise, row-wise sum
• w-ap on convolution feature
ABCNN: Attention-Based Convolutional Neural Network for Modeling
Sentence Pairs [Yin 2015]
39
ABCNN-1 ABCNN-2
Indirect	impact		to	convolution
Direct	convolution	via	pooling
(weighted	attention)
Need more	features
Vulnerable	to	overfitting
No	need	features
handles	smaller-granularity	units
(ex. Word	level)
handles	larger-granularity	units
(ex. Phrase	level,	phrase	size	=	window	size)
ABCNN-3
ABCNN: Attention-Based Convolutional Neural Network for Modeling
Sentence Pairs [Yin 2015]
40
Empirical Study on Deep Learning Models for QA [Yu 2015]
41
Empirical Study on Deep Learning Models for QA [Yu 2015]
42
The first to examine Neural Turing Machines on QA problems
Split QA into two step
1. search supporting facts
2. Generate answer from relevant pieces of information
NTM
• Single-layer LSTM network as controller
• Input : word embedding
1. Support fact only
2. Fact highlighted : user marker to annotate begin and end of supporting facts
• Output : softmax layer (multiclass classification) for answer
Teaching Machines to Read and Comprehend [Hermann 2015]
43
Teaching Machines to Read and Comprehend [Hermann 2015]
44
where	𝑓% = 𝑦§ 𝑡
𝑠(𝑡) : degree to which the network
attends to a particular token in the
document when answering the query
(soft attention)
Text Understanding with the Attention Sum Reader Network [Kadlec 2016]
45
Answer should be in context
Inspired by Pinter Network
Contrast to Attentive Reader:
• We select answer from context
directly using weighted sum of
individual representation Attentive Reader
Stochastic Latent Variable
46
Z x
Zx
y
Generative	Model Conditional	GenerativeModel
𝑝 𝑥 = ¨ 𝑝(𝑥, 𝑧)
©
= ¨ 𝑝(𝑥|𝑧)𝑝(𝑧)	
©
𝑝 𝑦|𝑥 = ¨ 𝑝 𝑦 𝑧, 𝑥 𝑝(𝑧|𝑥)
©
𝑝 𝑥 = «𝑝(𝑥, 𝑧)
©
= «𝑝 𝑥 𝑧 𝑝(𝑧)
©
𝑝 𝑦|𝑥 = «𝑝 𝑦 𝑧, 𝑥 𝑝(𝑧|𝑥)
©
Variational Inference Framework
47
𝑝 𝑥, 𝑧 = 𝑝 𝑥 𝑧 𝑝 𝑧 = ¨ 𝑝 𝑥 ℎ 𝑝 ℎ 𝑧 𝑝(𝑧)
¬
log 𝑝® 𝑥, 𝑧 = log«
𝑞 ℎ
𝑞 ℎ
𝑝 𝑥 ℎ 𝑝 ℎ 𝑧 𝑝 𝑧 𝑑ℎ
¬
≥ « 𝑞 ℎ log
𝑝 𝑥 ℎ 𝑝 ℎ 𝑧 𝑝 𝑧
𝑞 ℎ
𝑑ℎ
¬
	
= « 𝑞 ℎ log
𝑝 𝑥 ℎ 𝑝 ℎ 𝑧
𝑞 ℎ
𝑑ℎ
¬
+ « 𝑞 ℎ log
𝑝 𝑧
𝑞 ℎ
𝑑ℎ
¬
= 𝐸± ¬ log 𝑝 𝑥 ℎ 𝑝 ℎ 𝑧 − log 𝑞 ℎ − 𝐷³´(𝑞(ℎ)||𝑝 𝑧 )
= 𝐸± ¬ log 𝑝 𝑥 ℎ 𝑝 ℎ 𝑧 𝑝(𝑧) − log 𝑞 ℎ
Variational Inference Framework
48
𝑝® 𝑥, 𝑧 = 𝑝® 𝑥 𝑧 𝑝 𝑧 = ¨ 𝑝® 𝑥 ℎ 𝑝® ℎ 𝑧 𝑝(𝑧)
¬
log 𝑝® 𝑥, 𝑧 = log«
𝑞 ℎ
𝑞 ℎ
𝑝® 𝑥 ℎ 𝑝® ℎ 𝑧 𝑝 𝑧 𝑑ℎ
¬
≥ « 𝑞 ℎ log
𝑝® 𝑥 ℎ 𝑝® ℎ 𝑧 𝑝 𝑧
𝑞 ℎ
𝑑ℎ
¬
	
= « 𝑞 ℎ log
𝑝® 𝑥 ℎ 𝑝® ℎ 𝑧
𝑞 ℎ
𝑑ℎ
¬
+ « 𝑞 ℎ log
𝑝 𝑧
𝑞 ℎ
𝑑ℎ
¬
= 𝐸± ¬ log 𝑝® 𝑥 ℎ 𝑝® ℎ 𝑧 − log 𝑞 ℎ − 𝐷³´(𝑞(ℎ)| 𝑝 𝑧
= 𝐸± ¬ log 𝑝® 𝑥 ℎ 𝑝® ℎ 𝑧 − log 𝑞 ℎ a	tight	lower	bound	if	𝑞 ℎ = 𝑝(ℎ|𝑥, 𝑧)
Jensen’s Inequality
Conditional Variational Inference Framework
49
𝑝® 𝑦|𝑥 = ¨ 𝑝®(𝑦, 𝑧|𝑥)	
©
= ¨ 𝑝®(𝑦|𝑥, 𝑧)𝑝µ 𝑧 𝑥
©
log 𝑝(𝑦|𝑥) = log «
𝑞 𝑧
𝑞 𝑧
𝑝 𝑦 𝑧, 𝑥 𝑝 𝑧 𝑥 𝑑𝑧
©
≥ « 𝑞 𝑧 log
𝑝 𝑦 𝑧, 𝑥 𝑝 𝑧 𝑥
𝑞 𝑧
𝑑𝑧
©
	
= « 𝑞 𝑧 log
𝑝 𝑦 𝑧, 𝑥
𝑞 𝑧
𝑑𝑧
©
+ « 𝑞 𝑧 log
𝑝 𝑧 𝑥
𝑞 𝑧
𝑑𝑧
¬
= «𝑞 𝑧 log 𝑝 𝑦 𝑧, 𝑥 𝑑𝑧
©
− «𝑞 𝑧 log 𝑞(𝑧) 𝑑𝑧
©
+ « 𝑞 𝑧 log
𝑝 𝑧 𝑥
𝑞 𝑧
𝑑𝑧
¬
= 𝐸± © log 𝑝(𝑦|𝑧, 𝑥) − log 𝑞 𝑧 − 𝐷³´(𝑞 𝑧 ∥ 𝑝 𝑧|𝑥 )
= 𝐸± © log 𝑝(𝑦|𝑧, 𝑥) − log 𝑞 𝑧 a	tight	lower	bound	if	𝑞 𝑧 = 𝑝(𝑧|𝑥)
Jensen’s Inequality
Neural Variational Inference Framework
log 𝑝® 𝑥, 𝑧 ≥ 𝐸± © log 𝑝 𝑦 𝑧, 𝑥 − log 𝑞 𝑧 − 𝐷³´ 𝑞 𝑧 ∥ 𝑝 𝑧|𝑥 = ℒ
1. Vector representations of the observed variables
𝑢 = 𝑓© 𝑧 , 𝑣 = 𝑓¸(𝑥)
2. Joint representation (concatenation)
𝜋 = 𝑔(𝑢, 𝑣)
3. Parameterize the variational distribution
𝜇 = 𝑙' 𝜋 , 𝜎 = 𝑙)(𝜋)
50
Neural Variational Document Model [Miao, 2015]
51

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

[GAN by Hung-yi Lee]Part 2: The application of GAN to speech and text processing
[GAN by Hung-yi Lee]Part 2: The application of GAN to speech and text processing[GAN by Hung-yi Lee]Part 2: The application of GAN to speech and text processing
[GAN by Hung-yi Lee]Part 2: The application of GAN to speech and text processing
 
Distributed Deep Q-Learning
Distributed Deep Q-LearningDistributed Deep Q-Learning
Distributed Deep Q-Learning
 
Deep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAI
Deep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAIDeep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAI
Deep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAI
 
Lecture 4 neural networks
Lecture 4 neural networksLecture 4 neural networks
Lecture 4 neural networks
 
Generative Adversarial Networks
Generative Adversarial NetworksGenerative Adversarial Networks
Generative Adversarial Networks
 
Neural Learning to Rank
Neural Learning to RankNeural Learning to Rank
Neural Learning to Rank
 
Animashree Anandkumar, Electrical Engineering and CS Dept, UC Irvine at MLcon...
Animashree Anandkumar, Electrical Engineering and CS Dept, UC Irvine at MLcon...Animashree Anandkumar, Electrical Engineering and CS Dept, UC Irvine at MLcon...
Animashree Anandkumar, Electrical Engineering and CS Dept, UC Irvine at MLcon...
 
Deep Generative Models
Deep Generative ModelsDeep Generative Models
Deep Generative Models
 
Introduction to Neural Networks and Deep Learning from Scratch
Introduction to Neural Networks and Deep Learning from ScratchIntroduction to Neural Networks and Deep Learning from Scratch
Introduction to Neural Networks and Deep Learning from Scratch
 
Lecture 5: Neural Networks II
Lecture 5: Neural Networks IILecture 5: Neural Networks II
Lecture 5: Neural Networks II
 
Reinforcement learning
Reinforcement learningReinforcement learning
Reinforcement learning
 
Supervised Prediction of Graph Summaries
Supervised Prediction of Graph SummariesSupervised Prediction of Graph Summaries
Supervised Prediction of Graph Summaries
 
0415_seminar_DeepDPG
0415_seminar_DeepDPG0415_seminar_DeepDPG
0415_seminar_DeepDPG
 
Generative Adversarial Networks (GAN)
Generative Adversarial Networks (GAN)Generative Adversarial Networks (GAN)
Generative Adversarial Networks (GAN)
 
Reinforcement Learning : A Beginners Tutorial
Reinforcement Learning : A Beginners TutorialReinforcement Learning : A Beginners Tutorial
Reinforcement Learning : A Beginners Tutorial
 
Variational Autoencoder Tutorial
Variational Autoencoder Tutorial Variational Autoencoder Tutorial
Variational Autoencoder Tutorial
 
Animashree Anandkumar, Electrical Engineering and CS Dept, UC Irvine at MLcon...
Animashree Anandkumar, Electrical Engineering and CS Dept, UC Irvine at MLcon...Animashree Anandkumar, Electrical Engineering and CS Dept, UC Irvine at MLcon...
Animashree Anandkumar, Electrical Engineering and CS Dept, UC Irvine at MLcon...
 
Deep Reinforcement Learning
Deep Reinforcement LearningDeep Reinforcement Learning
Deep Reinforcement Learning
 
方策勾配型強化学習の基礎と応用
方策勾配型強化学習の基礎と応用方策勾配型強化学習の基礎と応用
方策勾配型強化学習の基礎と応用
 
Backpropagation (DLAI D3L1 2017 UPC Deep Learning for Artificial Intelligence)
Backpropagation (DLAI D3L1 2017 UPC Deep Learning for Artificial Intelligence)Backpropagation (DLAI D3L1 2017 UPC Deep Learning for Artificial Intelligence)
Backpropagation (DLAI D3L1 2017 UPC Deep Learning for Artificial Intelligence)
 

Ähnlich wie Deep Reasoning

Ähnlich wie Deep Reasoning (15)

cs330_2021_lifelong_learning.pdf
cs330_2021_lifelong_learning.pdfcs330_2021_lifelong_learning.pdf
cs330_2021_lifelong_learning.pdf
 
Deep Residual Hashing Neural Network for Image Retrieval
Deep Residual Hashing Neural Network for Image RetrievalDeep Residual Hashing Neural Network for Image Retrieval
Deep Residual Hashing Neural Network for Image Retrieval
 
Joseph Jay Williams - WESST - Bridging Research via MOOClets and Collaborativ...
Joseph Jay Williams - WESST - Bridging Research via MOOClets and Collaborativ...Joseph Jay Williams - WESST - Bridging Research via MOOClets and Collaborativ...
Joseph Jay Williams - WESST - Bridging Research via MOOClets and Collaborativ...
 
NetSci14 invited talk: Competing for attention
NetSci14 invited talk: Competing for attentionNetSci14 invited talk: Competing for attention
NetSci14 invited talk: Competing for attention
 
Maze Path Finding
Maze Path FindingMaze Path Finding
Maze Path Finding
 
Machine Learning at Geeky Base 2
Machine Learning at Geeky Base 2Machine Learning at Geeky Base 2
Machine Learning at Geeky Base 2
 
Unbiased Learning from Biased User Feedback (AIS304) - AWS re:Invent 2018
Unbiased Learning from Biased User Feedback (AIS304) - AWS re:Invent 2018Unbiased Learning from Biased User Feedback (AIS304) - AWS re:Invent 2018
Unbiased Learning from Biased User Feedback (AIS304) - AWS re:Invent 2018
 
Dual Learning for Machine Translation (NIPS 2016)
Dual Learning for Machine Translation (NIPS 2016)Dual Learning for Machine Translation (NIPS 2016)
Dual Learning for Machine Translation (NIPS 2016)
 
Joseph Jay Williams - WESST - Bridging Research and Practice via MOOClets & C...
Joseph Jay Williams - WESST - Bridging Research and Practice via MOOClets & C...Joseph Jay Williams - WESST - Bridging Research and Practice via MOOClets & C...
Joseph Jay Williams - WESST - Bridging Research and Practice via MOOClets & C...
 
NUS-ISS Learning Day 2017 - How About a Game of Chess?
NUS-ISS Learning Day 2017 - How About a Game of Chess?NUS-ISS Learning Day 2017 - How About a Game of Chess?
NUS-ISS Learning Day 2017 - How About a Game of Chess?
 
Introduction to sentiment analysis
Introduction to sentiment analysisIntroduction to sentiment analysis
Introduction to sentiment analysis
 
Mathspring
MathspringMathspring
Mathspring
 
Artificial Neural Network
Artificial Neural NetworkArtificial Neural Network
Artificial Neural Network
 
e-Learning and e-assessment examples
e-Learning and e-assessment examplese-Learning and e-assessment examples
e-Learning and e-assessment examples
 
CATS Learning Model - Intensify the learning experience
CATS Learning Model - Intensify the learning experienceCATS Learning Model - Intensify the learning experience
CATS Learning Model - Intensify the learning experience
 

Mehr von Taehoon Kim

강화 학습 기초 Reinforcement Learning an introduction
강화 학습 기초 Reinforcement Learning an introduction강화 학습 기초 Reinforcement Learning an introduction
강화 학습 기초 Reinforcement Learning an introduction
Taehoon Kim
 

Mehr von Taehoon Kim (15)

LLM에서 배우는 이미지 생성 모델 ZERO부터 학습하기 Training Large-Scale Diffusion Model from Scr...
LLM에서 배우는 이미지 생성 모델 ZERO부터 학습하기 Training Large-Scale Diffusion Model from Scr...LLM에서 배우는 이미지 생성 모델 ZERO부터 학습하기 Training Large-Scale Diffusion Model from Scr...
LLM에서 배우는 이미지 생성 모델 ZERO부터 학습하기 Training Large-Scale Diffusion Model from Scr...
 
상상을 현실로 만드는, 이미지 생성 모델을 위한 엔지니어링
상상을 현실로 만드는, 이미지 생성 모델을 위한 엔지니어링상상을 현실로 만드는, 이미지 생성 모델을 위한 엔지니어링
상상을 현실로 만드는, 이미지 생성 모델을 위한 엔지니어링
 
머신러닝 해외 취업 준비: 닳고 닳은 이력서와 고통스러웠던 면접을 돌아보며 SNU 2018
머신러닝 해외 취업 준비: 닳고 닳은 이력서와 고통스러웠던 면접을 돌아보며 SNU 2018머신러닝 해외 취업 준비: 닳고 닳은 이력서와 고통스러웠던 면접을 돌아보며 SNU 2018
머신러닝 해외 취업 준비: 닳고 닳은 이력서와 고통스러웠던 면접을 돌아보며 SNU 2018
 
Random Thoughts on Paper Implementations [KAIST 2018]
Random Thoughts on Paper Implementations [KAIST 2018]Random Thoughts on Paper Implementations [KAIST 2018]
Random Thoughts on Paper Implementations [KAIST 2018]
 
책 읽어주는 딥러닝: 배우 유인나가 해리포터를 읽어준다면 DEVIEW 2017
책 읽어주는 딥러닝: 배우 유인나가 해리포터를 읽어준다면 DEVIEW 2017책 읽어주는 딥러닝: 배우 유인나가 해리포터를 읽어준다면 DEVIEW 2017
책 읽어주는 딥러닝: 배우 유인나가 해리포터를 읽어준다면 DEVIEW 2017
 
알아두면 쓸데있는 신기한 강화학습 NAVER 2017
알아두면 쓸데있는 신기한 강화학습 NAVER 2017알아두면 쓸데있는 신기한 강화학습 NAVER 2017
알아두면 쓸데있는 신기한 강화학습 NAVER 2017
 
카카오톡으로 여친 만들기 2013.06.29
카카오톡으로 여친 만들기 2013.06.29카카오톡으로 여친 만들기 2013.06.29
카카오톡으로 여친 만들기 2013.06.29
 
Differentiable Neural Computer
Differentiable Neural ComputerDifferentiable Neural Computer
Differentiable Neural Computer
 
딥러닝과 강화 학습으로 나보다 잘하는 쿠키런 AI 구현하기 DEVIEW 2016
딥러닝과 강화 학습으로 나보다 잘하는 쿠키런 AI 구현하기 DEVIEW 2016딥러닝과 강화 학습으로 나보다 잘하는 쿠키런 AI 구현하기 DEVIEW 2016
딥러닝과 강화 학습으로 나보다 잘하는 쿠키런 AI 구현하기 DEVIEW 2016
 
지적 대화를 위한 깊고 넓은 딥러닝 PyCon APAC 2016
지적 대화를 위한 깊고 넓은 딥러닝 PyCon APAC 2016지적 대화를 위한 깊고 넓은 딥러닝 PyCon APAC 2016
지적 대화를 위한 깊고 넓은 딥러닝 PyCon APAC 2016
 
강화 학습 기초 Reinforcement Learning an introduction
강화 학습 기초 Reinforcement Learning an introduction강화 학습 기초 Reinforcement Learning an introduction
강화 학습 기초 Reinforcement Learning an introduction
 
Continuous control with deep reinforcement learning (DDPG)
Continuous control with deep reinforcement learning (DDPG)Continuous control with deep reinforcement learning (DDPG)
Continuous control with deep reinforcement learning (DDPG)
 
텐서플로우 설치도 했고 튜토리얼도 봤고 기초 예제도 짜봤다면 TensorFlow KR Meetup 2016
텐서플로우 설치도 했고 튜토리얼도 봤고 기초 예제도 짜봤다면 TensorFlow KR Meetup 2016텐서플로우 설치도 했고 튜토리얼도 봤고 기초 예제도 짜봤다면 TensorFlow KR Meetup 2016
텐서플로우 설치도 했고 튜토리얼도 봤고 기초 예제도 짜봤다면 TensorFlow KR Meetup 2016
 
쉽게 쓰여진 Django
쉽게 쓰여진 Django쉽게 쓰여진 Django
쉽게 쓰여진 Django
 
영화 서비스에 대한 생각
영화 서비스에 대한 생각영화 서비스에 대한 생각
영화 서비스에 대한 생각
 

Kürzlich hochgeladen

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Victor Rentea
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Victor Rentea
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 

Kürzlich hochgeladen (20)

Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with Milvus
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 

Deep Reasoning

  • 2. References 1. [Sukhbaatar, 2015] Sukhbaatar, Szlam, Weston, Fergus. “End-To-End Memory Networks” Advances in Neural Information Processing Systems. 2015. 2. [Hill, 2015] Hill, Bordes, Chopra, Weston. “The Goldilocks Principle: Reading Children's Books with Explicit Memory Representations” arXiv preprint arXiv:1511.02301 (2015). 3. [Kumar, 2015] Kumar, Irsoy, Ondruska, Iyyer, Bradbury, Gulrajani, Zhong, Paulus, Socher. “Ask Me Anything: Dynamic Memory Networks for Natural Language Processing” arXiv preprint arXiv:1511.06038 (2015). 4. [Xiong, 2016] Xiong, Merity, Socher. “Dynamic Memory Networks for Visual and Textual Question Answering” arXiv preprint arXiv:1603.01417 (2016). 5. [Yin, 2015] Yin, Schütze, Xiang, Zhou. “ABCNN: Attention-Based Convolutional Neural Network for Modeling Sentence Pairs” arXiv preprint arXiv:1512.05193 (2015). 6. [Yu, 2015] Yu, Zhang, Hang, Xiang, Zhou. “Empirical Study on Deep Learning Models for Question Answering” arXiv preprint arXiv:1510.07526 (2015). 7. [Hermann, 2015] Hermann, Kočiský, Grefenstette, Espeholt, Will Kay, Suleyman, Blunsom. “Teaching Machines to Read and Comprehend” arXiv preprint arXiv:1506.03340 (2015). 8. [Kadlec, 2016] Kadlec, Schmid, Bajgar, Kleindienst. “Text Understanding with the Attention Sum Reader Network” arXiv preprint arXiv:1603.01547 (2016). 2
  • 3. References 9. [Miao, 2015] Miao, Lei Yu, Blunsom. “Neural Variational Inference for Text Processing” arXiv preprint arXiv:1511.06038 (2015). 10. [Kingma, 2013] Kingma, Diederik P., and Max Welling. "Auto-encoding variational bayes" arXiv preprint arXiv:1312.6114 (2013). 11. [Sohn, 2015] Sohn, Kihyuk, Honglak Lee, and Xinchen Yan. "Learning Structured Output Representation using Deep Conditional Generative Models." Advances in Neural Information Processing Systems. 2015. 3
  • 4. Models Answer selection (WikiQA) General QA (CNN) Considered transitive inference (bAbI) ABCNN E2E MN E2E MN Variational Impatient Attentive Reader DMN Attentive Pooling Attentive (Impatient) Reader ReasoningNet Attention Sum Reader NTM 4
  • 5. End-to-End Memory Network [Sukhbaatar, 2015] 5
  • 6. End-to-End Memory Network [Sukhbaatar, 2015] 6 I go to school. He gets ball. … Embed C Where does he go?u Embed B Embed A Attention o I go to he gets ball Input Memory Output Memory softmax Inner product weighted sum Σ linear
  • 7. End-to-End Memory Network [Sukhbaatar, 2015] 7 I go to school. He gets ball. … linear Where does he go? Σ Σ Σ Sentence representation : 𝑖 th sentence : 𝑥% = 𝑥%',𝑥%),…, 𝑥%+ BoW : 𝑚% = ∑ 𝐴𝑥%// Position Encoding : 𝑚% = ∑ 𝑙/ 1 𝐴𝑥%// Temporal Encoding : 𝑚% = ∑ 𝐴𝑥%/ + 𝑇4(𝑖)/
  • 8. Training details Linear Start (LS) help avoid local minima - First train with softmax in each memory layer removed, making the model entirely linear except for the final softmax - When the validation loss stopped decreasing, the softmax layers were re-inserted and training recommenced RNN-style layer-wise weight tying - The input and output embeddings are the same across different layers Learning time invariance by injecting random noise - Jittering the time index with random empty memories - Add “dummy” memories to regularize 𝑇4(𝑖) 8
  • 9. Example of bAbI tasks 9
  • 10. The Goldilocks Principle: Reading Children's Books with Explicit Memory Representations [Hill, 2016] 10
  • 11. The Goldilocks Principle: Reading Children's Books with Explicit Memory Representations [Hill, 2016] 11 • Context sentences : 𝑆 = 𝑠', 𝑠), … , 𝑠+ , 𝑠% ∶ BoW word representation • Encoded memory : m; = 𝜙 𝑠 ∀𝑠 ∈ 𝑆 • Lexical memory • Each word occupies a separate slot in the memory • 𝑠 is a single word and 𝜙 𝑠 has only one non-zero feature • Multiple hop only beneficial in this memory model • Window memory (best) • 𝑠 corresponds to a window of text from the context 𝑆 centered on an individual mention of a candidate 𝑐 in 𝑆 m; = 𝑤%A BA' )⁄ … 𝑤% … 𝑤%D BA' )⁄ • Where 𝑤% ∈ 𝐶 which is an instance of one of the candidate words • Sentential memory • Same as original implementation of Memory Network
  • 12. The Goldilocks Principle: Reading Children's Books with Explicit Memory Representations [Hill, 2016] 12 Self-supervision for window memories - Memory supervision (knowing which memories to attend to) is not provided at training time - Making gradient steps using SGD to force the model to give a higher score to the supporting memory 𝒎G relative to any other memory from any other candidate using: Hard attention (training and testing) : 𝑚H' = argmax %M',…,+ 𝑐% N 𝑞 Soft attention (testing) : 𝑚H' = ∑ 𝛼% 𝑚%%M'…+ , 𝑤𝑖𝑡ℎ 𝛼% = ST U VW ∑ S T U VW X - If 𝑚H' happens to be different from 𝑚G (memory contain true answer), then model is updated - Can be understood as a way of achieving hard attention over memories (no need any new label information beyond the training data)
  • 13. The Goldilocks Principle: Reading Children's Books with Explicit Memory Representations [Hill, 2016] 13
  • 14. Gated Recurrent Network (GRU) 14 ℎYA' 𝑥Y 𝑥YD' ℎY X 𝑟Y 𝑧Y ℎY X 1- X + 𝑧Y 1 − 𝑧Y ℎYA'
  • 15. Ask Me Anything: Dynamic Memory Networks for Natural Language Processing [Kumar, 2015] 15
  • 16. Ask Me Anything: Dynamic Memory Networks for Natural Language Processing [Kumar, 2015] 16 I go to school. He gets ball. … Where does he go? GloVe Embed 𝑦Y I go to GloVe Embedwh do he 𝑞 𝑎Y 𝑦Y 𝑞 𝑎Y 𝐺𝑅𝑈 𝐺𝑅𝑈 < 𝐸𝑂𝑆 > ℎY 𝑒% 𝑒% 𝑮𝑹𝑼𝒍𝒊𝒔𝒉 Episodic Memory 𝑔Y % Input Module Answer Module Question Module
  • 17. Ask Me Anything: Dynamic Memory Networks for Natural Language Processing [Kumar, 2015] 17 𝑐Y ℎY 𝑒% I go to school. He gets ball. … 𝑔Y % Where does he go? 𝑞 GloVe Embed 𝐺𝑅𝑈(𝐿[𝑤Y s ], ℎYA' u ) = ℎY u = 𝑐Y 𝑦Y I go to Input Memory Episodic Memory 𝑒% 𝑚% 𝑞 GloVe EmbedI go to𝑞Y = 𝐺𝑅𝑈(𝐿 𝑤Y v , 𝑞YA') ℎN % = 𝑒% 𝑔Y % = 𝑮(𝑐Y, 𝑚%A' , 𝑞) 𝑮𝑹𝑼𝒍𝒊𝒔𝒉ℎY % = 𝑔Y % 𝐺𝑅𝑈 𝑐Y, ℎYA' % + (1 − 𝑔Y % ) ℎYA' % 𝑚% new Memory Gate 𝑞 𝑎Y 𝑦Y 𝑞 𝑎Y 𝐺𝑅𝑈 𝐺𝑅𝑈 < 𝐸𝑂𝑆 > 𝑠𝑜𝑓𝑡𝑚𝑎𝑥 𝑠𝑜𝑓𝑡𝑚𝑎𝑥
  • 18. Ask Me Anything: Dynamic Memory Networks for Natural Language Processing [Kumar, 2015] 18 ℎY 𝑒Y I go to school. He gets ball. … Where does he go? GloVe Embed 𝐺𝑅𝑈(𝐿[𝑤Y s ], ℎYA' u ) = ℎY u = 𝑐Y 𝑦Y I go to Input Memory Episodic Memory 𝑒Y 𝑚% GloVe EmbedI go to𝑞Y = 𝐺𝑅𝑈(𝐿 𝑤Y v , 𝑞YA') ℎN % = 𝑒% 𝑮𝑹𝑼𝒍𝒊𝒔𝒉ℎY % = 𝑔Y % 𝐺𝑅𝑈 𝑐Y, ℎYA' % + (1 − 𝑔Y % ) ℎYA' % 𝑚% new Memory 𝑞 𝑎Y 𝑦Y 𝑞 𝑎Y 𝐺𝑅𝑈 𝐺𝑅𝑈 < 𝐸𝑂𝑆 > 𝑠𝑜𝑓𝑡𝑚𝑎𝑥 𝑠𝑜𝑓𝑡𝑚𝑎𝑥 𝑔Y % = 𝑮(𝑐Y, 𝑚%A' , 𝑞)Gate feature vector : captures a similarities between c, m, q G : two-layer feed forward neural network Attention Mechanism 𝑞𝑞𝑐Y 𝑔Y %
  • 19. Ask Me Anything: Dynamic Memory Networks for Natural Language Processing [Kumar, 2015] 19 𝑐Y ℎY 𝑒Y I go to school. He gets ball. … 𝑔Y % Where does he go? 𝑞 GloVe Embed 𝐺𝑅𝑈(𝐿[𝑤Y s ], ℎYA' u ) = ℎY u = 𝑐Y 𝑦Y I go to Input Memory Episodic Memory 𝑞 GloVe EmbedI go to𝑞Y = 𝐺𝑅𝑈(𝐿 𝑤Y v , 𝑞YA') ℎN % = 𝑒% 𝑮𝑹𝑼𝒍𝒊𝒔𝒉 𝑞 𝑎Y 𝑦Y 𝑞 𝑎Y 𝐺𝑅𝑈 𝐺𝑅𝑈 < 𝐸𝑂𝑆 > 𝑠𝑜𝑓𝑡𝑚𝑎𝑥 𝑠𝑜𝑓𝑡𝑚𝑎𝑥 𝑔Y % = 𝑮(𝑐Y, 𝑚%A' , 𝑞)Gate ℎY % = 𝑔Y % 𝐺𝑅𝑈 𝑐Y, ℎYA' % + (1 − 𝑔Y % ) ℎYA' % 𝑒% = ℎNy % new Memory Episodic memory update 𝑒% 𝑚% 𝑚% 𝑚% = 𝐺𝑅𝑈(𝑒% , 𝑚%A' ) Episodic Memory Module - Iterates over input representations, while updating episodic memory 𝒆𝒊 - Attention mechanism + Recurrent network → Update memory 𝒎 𝒊 Memory update
  • 20. Ask Me Anything: Dynamic Memory Networks for Natural Language Processing [Kumar, 2015] 20 𝑐Y I go to school. He gets ball. … 𝑔Y % Where does he go? 𝑞 GloVe Embed 𝐺𝑅𝑈(𝐿[𝑤Y s ], ℎYA' u ) = ℎY u = 𝑐Y 𝑦Y I go to Input Memory Episodic Memory GloVe EmbedI go to𝑞Y = 𝐺𝑅𝑈(𝐿 𝑤Y v , 𝑞YA') ℎN % = 𝑒% 𝑮𝑹𝑼𝒍𝒊𝒔𝒉 𝑞 𝑎Y 𝑦Y 𝑞 𝑎Y 𝐺𝑅𝑈 𝐺𝑅𝑈 < 𝐸𝑂𝑆 > 𝑠𝑜𝑓𝑡𝑚𝑎𝑥 𝑠𝑜𝑓𝑡𝑚𝑎𝑥 𝑔Y % = 𝑮(𝑐Y, 𝑚%A' , 𝑞)Gate ℎY % = 𝑔Y % 𝐺𝑅𝑈 𝑐Y, ℎYA' % + (1 − 𝑔Y % ) ℎYA' % new Memory ℎY 𝑒% 𝑞 𝑒% 𝑚% 𝑚% Multiple Episodes - Allows to attend to different inputs during each pass - Allows for a type of transitive inference, since the first pass may uncover the need to retrieve additional facts. Q : Where is the football? C1 : John put down the football. Only once the model sees C1, John is relevant, can reason that the second iteration should retrieve where John was. Criteria for Stopping - Append a special end-of-passes representation to the input 𝒄 - Stop if this representation is chosen by the gate function - Set a maximum number of iterations - This is why called Dynamic MM
  • 21. Ask Me Anything: Dynamic Memory Networks for Natural Language Processing [Kumar, 2015] 21 𝑐Y I go to school. He gets ball. … 𝑔Y % Where does he go? GloVe Embed 𝐺𝑅𝑈(𝐿[𝑤Y s ], ℎYA' u ) = ℎY u = 𝑐Y I go to Input Memory Episodic Memory GloVe EmbedI go to𝑞Y = 𝐺𝑅𝑈(𝐿 𝑤Y v , 𝑞YA') ℎN % = 𝑒% 𝑮𝑹𝑼𝒍𝒊𝒔𝒉 𝑔Y % = 𝑮(𝑐Y, 𝑚%A' , 𝑞)Gate ℎY % = 𝑔Y % 𝐺𝑅𝑈 𝑐Y, ℎYA' % + (1 − 𝑔Y % ) ℎYA' % new Memory ℎY 𝑒Y 𝑞 𝑒Y 𝑚% 𝑞 𝑦Y 𝑞 𝑎Y 𝑦Y 𝑞 𝑎Y 𝐺𝑅𝑈 𝐺𝑅𝑈 < 𝐸𝑂𝑆 > 𝑠𝑜𝑓𝑡𝑚𝑎𝑥 𝑠𝑜𝑓𝑡𝑚𝑎𝑥 𝑚% Answer Module - Triggered once at the end of the episodic memory or at each time step - Concatenate the last generated word and the question vector as the input at each time step - Cross-entropy error
  • 22. Training Details - Adam optimization - 𝐿) regularization, dropout on the word embedding (GloVe) bAbI dataset - Objective function : 𝐽 = 𝛼𝐸s† 𝐺𝑎𝑡𝑒𝑠 + 𝛽𝐸s†(𝐴𝑛𝑠𝑤𝑒𝑟𝑠) - Gate supervision aims to select one sentence per pass - Without supervision : GRU of c‰,ℎY % and 𝑒% = ℎNy % - With supervision (simpler) : 𝑒% = ∑ 𝑠𝑜𝑓𝑡𝑚𝑎𝑥 𝑔Y % 𝑐Y N YM' , where 𝑠𝑜𝑓𝑡𝑚𝑎𝑥 𝑔Y % = Š‹Œ •Ž U ∑ Š‹Œ (•X U )V X•• and 𝑔Y % is the value before sigmoid - Better results, because softmax encourages sparsity & suited to picking one sentence 22
  • 23. Training Details Stanford Sentiment Treebank (Sentiment Analysis) - Use all full sentences, subsample 50% of phrase-level labels every epoch - Only evaluated on the full sentences - Binary classification, neutral phrases are removed from the dataset - Trained with GRU sequence models 23
  • 25. Dynamic Memory Networks for Visual and Textual Question Answering [Xiong 2016] 25 Several design choices are motivated by intuition and accuracy improvements
  • 26. Input Module in DMN - A single GRU for embedding story and store the hidden states - GRU provides temporal component by allowing a sentence to know the content of the sentences that came before them - Cons: - GRU only allows sentences to have context from sentences before them, but not after them - Supporting sentences may be too far away from each other - Here comes Input fusion layer 26
  • 27. Input Module in DMN+ Replacing a single GRU with two different components 1. Sentence reader : responsible only for encoding the words into a sentence embedding • Use positional encoder (used in E2E) : 𝑓% = ∑ 𝑙/ 1 𝐴𝑥%// • Considered GRUs LSTMs, but required more computational resources, prone to overfitting 2. Input fusion layer : interactions between sentences, allows content interaction between sentences • bi-directional GRU to allow information from both past and future sentences • gradients do not need to propagate through the words between sentences • distant supporting sentences can have a more direct interaction 27
  • 28. Input Module for DMN+ 28 Referenced paper : A Hierarchical Neural Autoencoder for Paragraphs and Documents [Li, 2015]
  • 29. Episodic Memory Module in DMN+ - 𝐹⃡ = [𝑓', 𝑓), … , 𝑓“] : output of the input module - Interactions between the fact 𝒇 𝒊 and both the question 𝒒 and episode memory state 𝒎𝒕 29
  • 30. Attention Mechanism in DMN+ Use attention to extract contextual vector 𝑐Y based on the current focus 1. Soft attention • A weighted summation of 𝐹⃡ : 𝑐Y = ∑ 𝑔% Y 𝑓% “ %M' • Can approximate a hard attention by selecting a single fact 𝑓% • Cons: losses positional and ordering information • Attention passes can retrieve some of this information, but inefficient 30
  • 31. Attention Mechanism in DMN+ 2. Attention based GRU (best) - position and ordering information : RNN is proper but can’t use 𝑔% Y - 𝑢%: update, 𝑟%: how much retain from ℎ%A' - Replace 𝑢% (vector) to 𝑔% Y (scalar) - Allows us to easily visualize how the attention gates activate - Use final hidden state as 𝒄 𝒕, which is used to update episodic memory 𝒎𝒕 31 𝑔% Y 𝑔% Y
  • 32. Episode Memory Updates in DMT+ 1. Untied and Tied (better) GRU 𝑚Y = 𝐺𝑅𝑈(𝐶Y , 𝑚YA' ) 2. Untied ReLU layer (best) 𝑚Y = 𝑅𝑒𝐿𝑈(𝑊Y 𝑚YA' ; 𝑐Y ; 𝑞 + 𝑏) 32
  • 33. Training Details - Adam optimization - Xavier initialization is used for all weights except for the word embeddings - 𝐿) regularization on all weights except bias - Dropout on the word embedding (GloVe) and answer module with 𝑝 = 0.9 33
  • 34. ABCNN: Attention-Based Convolutional Neural Network for Modeling Sentence Pairs [Yin 2015] 34
  • 35. ABCNN: Attention-Based Convolutional Neural Network for Modeling Sentence Pairs [Yin 2015] 35 • Most prior work on answer selection model each sentence separately and neglects mutual influence • Human focus on key parts of 𝑠Ÿ by extracting parts from 𝑠' related by identity, synonymy, antonym etc. • ABCNN : taking into account the interdependence between 𝑠Ÿ and 𝑠' • Convolution layer : increase abstraction of a phrase from words
  • 36. ABCNN: Attention-Based Convolutional Neural Network for Modeling Sentence Pairs [Yin 2015] 36 1. Input embedding with word2vec 2-1. Convolution layer with wide convolution • To make each word 𝑣% to be detected by all weights in 𝑊 2-2. Average pooling layer • all-ap : column-wise averaging over all columns • w-ap : column-wise averaging over windows of 𝑤 3. Output layer with logistic regression • Forward all-ap to all non-final ap layer + final ap layer
  • 37. ABCNN: Attention-Based Convolutional Neural Network for Modeling Sentence Pairs [Yin 2015] 37 Attention on feature map (ABCNN-1) • Attention values of row 𝑖 in 𝑨 : attention distribution of the 𝑖−th unit of 𝑠Ÿ with respect to 𝑠' • 𝐴%,/ = 𝑚𝑎𝑡𝑐ℎ𝑠𝑐𝑜𝑟𝑒(𝐹Ÿ,¢ :, 𝑖 , 𝐹',¢ :, 𝑗 ) • 𝑚𝑎𝑡𝑐ℎ𝑠𝑐𝑜𝑟𝑒 = 1/(1 + 𝑥 − 𝑦 ) • Generate the attention feature map 𝐹%,¦ for 𝑠% • Cons : need more parameters
  • 38. ABCNN: Attention-Based Convolutional Neural Network for Modeling Sentence Pairs [Yin 2015] 38 Attention after convolution (ABCNN-2) • Attention weights directly on the representation with the aim of improving the features computed by convolution • 𝑎Ÿ,/ = ∑𝐴 𝑗, : → col-wise, row-wise sum • w-ap on convolution feature
  • 39. ABCNN: Attention-Based Convolutional Neural Network for Modeling Sentence Pairs [Yin 2015] 39 ABCNN-1 ABCNN-2 Indirect impact to convolution Direct convolution via pooling (weighted attention) Need more features Vulnerable to overfitting No need features handles smaller-granularity units (ex. Word level) handles larger-granularity units (ex. Phrase level, phrase size = window size) ABCNN-3
  • 40. ABCNN: Attention-Based Convolutional Neural Network for Modeling Sentence Pairs [Yin 2015] 40
  • 41. Empirical Study on Deep Learning Models for QA [Yu 2015] 41
  • 42. Empirical Study on Deep Learning Models for QA [Yu 2015] 42 The first to examine Neural Turing Machines on QA problems Split QA into two step 1. search supporting facts 2. Generate answer from relevant pieces of information NTM • Single-layer LSTM network as controller • Input : word embedding 1. Support fact only 2. Fact highlighted : user marker to annotate begin and end of supporting facts • Output : softmax layer (multiclass classification) for answer
  • 43. Teaching Machines to Read and Comprehend [Hermann 2015] 43
  • 44. Teaching Machines to Read and Comprehend [Hermann 2015] 44 where 𝑓% = 𝑦§ 𝑡 𝑠(𝑡) : degree to which the network attends to a particular token in the document when answering the query (soft attention)
  • 45. Text Understanding with the Attention Sum Reader Network [Kadlec 2016] 45 Answer should be in context Inspired by Pinter Network Contrast to Attentive Reader: • We select answer from context directly using weighted sum of individual representation Attentive Reader
  • 46. Stochastic Latent Variable 46 Z x Zx y Generative Model Conditional GenerativeModel 𝑝 𝑥 = ¨ 𝑝(𝑥, 𝑧) © = ¨ 𝑝(𝑥|𝑧)𝑝(𝑧) © 𝑝 𝑦|𝑥 = ¨ 𝑝 𝑦 𝑧, 𝑥 𝑝(𝑧|𝑥) © 𝑝 𝑥 = «𝑝(𝑥, 𝑧) © = «𝑝 𝑥 𝑧 𝑝(𝑧) © 𝑝 𝑦|𝑥 = «𝑝 𝑦 𝑧, 𝑥 𝑝(𝑧|𝑥) ©
  • 47. Variational Inference Framework 47 𝑝 𝑥, 𝑧 = 𝑝 𝑥 𝑧 𝑝 𝑧 = ¨ 𝑝 𝑥 ℎ 𝑝 ℎ 𝑧 𝑝(𝑧) ¬ log 𝑝® 𝑥, 𝑧 = log« 𝑞 ℎ 𝑞 ℎ 𝑝 𝑥 ℎ 𝑝 ℎ 𝑧 𝑝 𝑧 𝑑ℎ ¬ ≥ « 𝑞 ℎ log 𝑝 𝑥 ℎ 𝑝 ℎ 𝑧 𝑝 𝑧 𝑞 ℎ 𝑑ℎ ¬ = « 𝑞 ℎ log 𝑝 𝑥 ℎ 𝑝 ℎ 𝑧 𝑞 ℎ 𝑑ℎ ¬ + « 𝑞 ℎ log 𝑝 𝑧 𝑞 ℎ 𝑑ℎ ¬ = 𝐸± ¬ log 𝑝 𝑥 ℎ 𝑝 ℎ 𝑧 − log 𝑞 ℎ − 𝐷³´(𝑞(ℎ)||𝑝 𝑧 ) = 𝐸± ¬ log 𝑝 𝑥 ℎ 𝑝 ℎ 𝑧 𝑝(𝑧) − log 𝑞 ℎ
  • 48. Variational Inference Framework 48 𝑝® 𝑥, 𝑧 = 𝑝® 𝑥 𝑧 𝑝 𝑧 = ¨ 𝑝® 𝑥 ℎ 𝑝® ℎ 𝑧 𝑝(𝑧) ¬ log 𝑝® 𝑥, 𝑧 = log« 𝑞 ℎ 𝑞 ℎ 𝑝® 𝑥 ℎ 𝑝® ℎ 𝑧 𝑝 𝑧 𝑑ℎ ¬ ≥ « 𝑞 ℎ log 𝑝® 𝑥 ℎ 𝑝® ℎ 𝑧 𝑝 𝑧 𝑞 ℎ 𝑑ℎ ¬ = « 𝑞 ℎ log 𝑝® 𝑥 ℎ 𝑝® ℎ 𝑧 𝑞 ℎ 𝑑ℎ ¬ + « 𝑞 ℎ log 𝑝 𝑧 𝑞 ℎ 𝑑ℎ ¬ = 𝐸± ¬ log 𝑝® 𝑥 ℎ 𝑝® ℎ 𝑧 − log 𝑞 ℎ − 𝐷³´(𝑞(ℎ)| 𝑝 𝑧 = 𝐸± ¬ log 𝑝® 𝑥 ℎ 𝑝® ℎ 𝑧 − log 𝑞 ℎ a tight lower bound if 𝑞 ℎ = 𝑝(ℎ|𝑥, 𝑧) Jensen’s Inequality
  • 49. Conditional Variational Inference Framework 49 𝑝® 𝑦|𝑥 = ¨ 𝑝®(𝑦, 𝑧|𝑥) © = ¨ 𝑝®(𝑦|𝑥, 𝑧)𝑝µ 𝑧 𝑥 © log 𝑝(𝑦|𝑥) = log « 𝑞 𝑧 𝑞 𝑧 𝑝 𝑦 𝑧, 𝑥 𝑝 𝑧 𝑥 𝑑𝑧 © ≥ « 𝑞 𝑧 log 𝑝 𝑦 𝑧, 𝑥 𝑝 𝑧 𝑥 𝑞 𝑧 𝑑𝑧 © = « 𝑞 𝑧 log 𝑝 𝑦 𝑧, 𝑥 𝑞 𝑧 𝑑𝑧 © + « 𝑞 𝑧 log 𝑝 𝑧 𝑥 𝑞 𝑧 𝑑𝑧 ¬ = «𝑞 𝑧 log 𝑝 𝑦 𝑧, 𝑥 𝑑𝑧 © − «𝑞 𝑧 log 𝑞(𝑧) 𝑑𝑧 © + « 𝑞 𝑧 log 𝑝 𝑧 𝑥 𝑞 𝑧 𝑑𝑧 ¬ = 𝐸± © log 𝑝(𝑦|𝑧, 𝑥) − log 𝑞 𝑧 − 𝐷³´(𝑞 𝑧 ∥ 𝑝 𝑧|𝑥 ) = 𝐸± © log 𝑝(𝑦|𝑧, 𝑥) − log 𝑞 𝑧 a tight lower bound if 𝑞 𝑧 = 𝑝(𝑧|𝑥) Jensen’s Inequality
  • 50. Neural Variational Inference Framework log 𝑝® 𝑥, 𝑧 ≥ 𝐸± © log 𝑝 𝑦 𝑧, 𝑥 − log 𝑞 𝑧 − 𝐷³´ 𝑞 𝑧 ∥ 𝑝 𝑧|𝑥 = ℒ 1. Vector representations of the observed variables 𝑢 = 𝑓© 𝑧 , 𝑣 = 𝑓¸(𝑥) 2. Joint representation (concatenation) 𝜋 = 𝑔(𝑢, 𝑣) 3. Parameterize the variational distribution 𝜇 = 𝑙' 𝜋 , 𝜎 = 𝑙)(𝜋) 50
  • 51. Neural Variational Document Model [Miao, 2015] 51