The document describes several papers on deep learning models for natural language processing tasks that utilize memory networks or attention mechanisms. It begins with references to seminal papers on end-to-end memory networks and dynamic memory networks. It then provides examples of tasks these models have been applied to, such as question answering, and summarizes the training procedures and architectures of memory networks and dynamic memory networks. Finally, it discusses extensions like utilizing episodic memory with multiple passes over the inputs and attention mechanisms.
6. End-to-End Memory Network [Sukhbaatar, 2015]
6
I go to school.
He gets ball.
…
Embed
C
Where does he go?u Embed
B
Embed
A
Attention
o
I
go
to
he
gets
ball
Input Memory
Output Memory
softmax
Inner product
weighted sum
Σ linear
7. End-to-End Memory Network [Sukhbaatar, 2015]
7
I go to school.
He gets ball.
…
linear
Where does he go?
Σ
Σ
Σ
Sentence representation :
𝑖 th sentence : 𝑥% = 𝑥%',𝑥%),…, 𝑥%+
BoW : 𝑚% = ∑ 𝐴𝑥%//
Position Encoding : 𝑚% = ∑ 𝑙/ 1 𝐴𝑥%//
Temporal Encoding : 𝑚% = ∑ 𝐴𝑥%/ + 𝑇4(𝑖)/
8. Training details
Linear Start (LS) help avoid local minima
- First train with softmax in each memory layer removed, making the model entirely linear except
for the final softmax
- When the validation loss stopped decreasing, the softmax layers were re-inserted and training
recommenced
RNN-style layer-wise weight tying
- The input and output embeddings are the same across different layers
Learning time invariance by injecting random noise
- Jittering the time index with random empty memories
- Add “dummy” memories to regularize 𝑇4(𝑖)
8
10. The Goldilocks Principle: Reading Children's Books with Explicit
Memory Representations [Hill, 2016]
10
11. The Goldilocks Principle: Reading Children's Books with Explicit
Memory Representations [Hill, 2016]
11
• Context sentences : 𝑆 = 𝑠', 𝑠), … , 𝑠+ , 𝑠% ∶ BoW word representation
• Encoded memory : m; = 𝜙 𝑠 ∀𝑠 ∈ 𝑆
• Lexical memory
• Each word occupies a separate slot in the memory
• 𝑠 is a single word and 𝜙 𝑠 has only one non-zero feature
• Multiple hop only beneficial in this memory model
• Window memory (best)
• 𝑠 corresponds to a window of text from the context 𝑆 centered on an individual mention of a candidate 𝑐 in 𝑆
m; = 𝑤%A BA' )⁄ … 𝑤% … 𝑤%D BA' )⁄
• Where 𝑤% ∈ 𝐶 which is an instance of one of the candidate words
• Sentential memory
• Same as original implementation of Memory Network
12. The Goldilocks Principle: Reading Children's Books with Explicit
Memory Representations [Hill, 2016]
12
Self-supervision for window memories
- Memory supervision (knowing which memories to attend to) is not provided at training time
- Making gradient steps using SGD to force the model to give a higher score to the supporting
memory 𝒎G relative to any other memory from any other candidate using:
Hard attention (training and testing) : 𝑚H' = argmax
%M',…,+
𝑐%
N
𝑞
Soft attention (testing) : 𝑚H' = ∑ 𝛼% 𝑚%%M'…+ , 𝑤𝑖𝑡ℎ 𝛼% =
ST
U
VW
∑ S
T
U
VW
X
- If 𝑚H' happens to be different from 𝑚G (memory contain true answer), then model is updated
- Can be understood as a way of achieving hard attention over memories (no need any new
label information beyond the training data)
13. The Goldilocks Principle: Reading Children's Books with Explicit
Memory Representations [Hill, 2016]
13
15. Ask Me Anything: Dynamic Memory Networks for Natural Language
Processing [Kumar, 2015]
15
16. Ask Me Anything: Dynamic Memory Networks for Natural Language
Processing [Kumar, 2015]
16
I go to school.
He gets ball.
…
Where does he go?
GloVe
Embed
𝑦Y
I
go
to
GloVe
Embedwh
do
he
𝑞
𝑎Y
𝑦Y
𝑞
𝑎Y
𝐺𝑅𝑈 𝐺𝑅𝑈
< 𝐸𝑂𝑆 >
ℎY 𝑒%
𝑒%
𝑮𝑹𝑼𝒍𝒊𝒔𝒉
Episodic Memory
𝑔Y
%
Input Module
Answer Module
Question Module
17. Ask Me Anything: Dynamic Memory Networks for Natural Language
Processing [Kumar, 2015]
17
𝑐Y
ℎY 𝑒%
I go to school.
He gets ball.
…
𝑔Y
%
Where does he go?
𝑞
GloVe
Embed
𝐺𝑅𝑈(𝐿[𝑤Y
s
], ℎYA'
u
) = ℎY
u
= 𝑐Y
𝑦Y
I
go
to
Input Memory
Episodic Memory 𝑒% 𝑚%
𝑞
GloVe
EmbedI
go
to𝑞Y = 𝐺𝑅𝑈(𝐿 𝑤Y
v
, 𝑞YA')
ℎN
%
= 𝑒%
𝑔Y
%
= 𝑮(𝑐Y, 𝑚%A'
, 𝑞)
𝑮𝑹𝑼𝒍𝒊𝒔𝒉ℎY
%
= 𝑔Y
%
𝐺𝑅𝑈 𝑐Y, ℎYA'
%
+ (1 − 𝑔Y
%
) ℎYA'
%
𝑚%
new Memory
Gate
𝑞
𝑎Y
𝑦Y
𝑞
𝑎Y
𝐺𝑅𝑈 𝐺𝑅𝑈
< 𝐸𝑂𝑆 >
𝑠𝑜𝑓𝑡𝑚𝑎𝑥 𝑠𝑜𝑓𝑡𝑚𝑎𝑥
18. Ask Me Anything: Dynamic Memory Networks for Natural Language
Processing [Kumar, 2015]
18
ℎY 𝑒Y
I go to school.
He gets ball.
…
Where does he go?
GloVe
Embed
𝐺𝑅𝑈(𝐿[𝑤Y
s
], ℎYA'
u
) = ℎY
u
= 𝑐Y
𝑦Y
I
go
to
Input Memory
Episodic Memory 𝑒Y 𝑚%
GloVe
EmbedI
go
to𝑞Y = 𝐺𝑅𝑈(𝐿 𝑤Y
v
, 𝑞YA')
ℎN
%
= 𝑒%
𝑮𝑹𝑼𝒍𝒊𝒔𝒉ℎY
%
= 𝑔Y
%
𝐺𝑅𝑈 𝑐Y, ℎYA'
%
+ (1 − 𝑔Y
%
) ℎYA'
%
𝑚%
new Memory
𝑞
𝑎Y
𝑦Y
𝑞
𝑎Y
𝐺𝑅𝑈 𝐺𝑅𝑈
< 𝐸𝑂𝑆 >
𝑠𝑜𝑓𝑡𝑚𝑎𝑥 𝑠𝑜𝑓𝑡𝑚𝑎𝑥
𝑔Y
%
= 𝑮(𝑐Y, 𝑚%A'
, 𝑞)Gate
feature vector : captures a similarities between c, m, q
G : two-layer feed forward neural network
Attention Mechanism
𝑞𝑞𝑐Y
𝑔Y
%
19. Ask Me Anything: Dynamic Memory Networks for Natural Language
Processing [Kumar, 2015]
19
𝑐Y
ℎY 𝑒Y
I go to school.
He gets ball.
…
𝑔Y
%
Where does he go?
𝑞
GloVe
Embed
𝐺𝑅𝑈(𝐿[𝑤Y
s
], ℎYA'
u
) = ℎY
u
= 𝑐Y
𝑦Y
I
go
to
Input Memory
Episodic Memory
𝑞
GloVe
EmbedI
go
to𝑞Y = 𝐺𝑅𝑈(𝐿 𝑤Y
v
, 𝑞YA')
ℎN
%
= 𝑒%
𝑮𝑹𝑼𝒍𝒊𝒔𝒉
𝑞
𝑎Y
𝑦Y
𝑞
𝑎Y
𝐺𝑅𝑈 𝐺𝑅𝑈
< 𝐸𝑂𝑆 >
𝑠𝑜𝑓𝑡𝑚𝑎𝑥 𝑠𝑜𝑓𝑡𝑚𝑎𝑥
𝑔Y
%
= 𝑮(𝑐Y, 𝑚%A'
, 𝑞)Gate
ℎY
%
= 𝑔Y
%
𝐺𝑅𝑈 𝑐Y, ℎYA'
%
+ (1 − 𝑔Y
%
) ℎYA'
%
𝑒%
= ℎNy
%
new Memory
Episodic memory update
𝑒% 𝑚%
𝑚%
𝑚%
= 𝐺𝑅𝑈(𝑒%
, 𝑚%A'
)
Episodic Memory Module
- Iterates over input representations, while updating episodic memory 𝒆𝒊
- Attention mechanism + Recurrent network → Update memory 𝒎 𝒊
Memory update
20. Ask Me Anything: Dynamic Memory Networks for Natural Language
Processing [Kumar, 2015]
20
𝑐Y
I go to school.
He gets ball.
…
𝑔Y
%
Where does he go?
𝑞
GloVe
Embed
𝐺𝑅𝑈(𝐿[𝑤Y
s
], ℎYA'
u
) = ℎY
u
= 𝑐Y
𝑦Y
I
go
to
Input Memory
Episodic Memory
GloVe
EmbedI
go
to𝑞Y = 𝐺𝑅𝑈(𝐿 𝑤Y
v
, 𝑞YA')
ℎN
%
= 𝑒%
𝑮𝑹𝑼𝒍𝒊𝒔𝒉
𝑞
𝑎Y
𝑦Y
𝑞
𝑎Y
𝐺𝑅𝑈 𝐺𝑅𝑈
< 𝐸𝑂𝑆 >
𝑠𝑜𝑓𝑡𝑚𝑎𝑥 𝑠𝑜𝑓𝑡𝑚𝑎𝑥
𝑔Y
%
= 𝑮(𝑐Y, 𝑚%A'
, 𝑞)Gate
ℎY
%
= 𝑔Y
%
𝐺𝑅𝑈 𝑐Y, ℎYA'
%
+ (1 − 𝑔Y
%
) ℎYA'
%
new Memory
ℎY 𝑒%
𝑞
𝑒% 𝑚%
𝑚%
Multiple Episodes
- Allows to attend to different inputs during each pass
- Allows for a type of transitive inference, since the first
pass may uncover the need to retrieve additional facts.
Q : Where is the football?
C1 : John put down the football.
Only once the model sees C1, John is relevant,
can reason that the second iteration should
retrieve where John was.
Criteria for Stopping
- Append a special end-of-passes
representation to the input 𝒄
- Stop if this representation is chosen by
the gate function
- Set a maximum number of iterations
- This is why called Dynamic MM
21. Ask Me Anything: Dynamic Memory Networks for Natural Language
Processing [Kumar, 2015]
21
𝑐Y
I go to school.
He gets ball.
…
𝑔Y
%
Where does he go?
GloVe
Embed
𝐺𝑅𝑈(𝐿[𝑤Y
s
], ℎYA'
u
) = ℎY
u
= 𝑐Y
I
go
to
Input Memory
Episodic Memory
GloVe
EmbedI
go
to𝑞Y = 𝐺𝑅𝑈(𝐿 𝑤Y
v
, 𝑞YA')
ℎN
%
= 𝑒%
𝑮𝑹𝑼𝒍𝒊𝒔𝒉
𝑔Y
%
= 𝑮(𝑐Y, 𝑚%A'
, 𝑞)Gate
ℎY
%
= 𝑔Y
%
𝐺𝑅𝑈 𝑐Y, ℎYA'
%
+ (1 − 𝑔Y
%
) ℎYA'
%
new Memory
ℎY 𝑒Y
𝑞
𝑒Y
𝑚%
𝑞
𝑦Y
𝑞
𝑎Y
𝑦Y
𝑞
𝑎Y
𝐺𝑅𝑈 𝐺𝑅𝑈
< 𝐸𝑂𝑆 >
𝑠𝑜𝑓𝑡𝑚𝑎𝑥 𝑠𝑜𝑓𝑡𝑚𝑎𝑥
𝑚%
Answer Module
- Triggered once at the end of the episodic memory or at
each time step
- Concatenate the last generated word and the question
vector as the input at each time step
- Cross-entropy error
22. Training Details
- Adam optimization
- 𝐿) regularization, dropout on the word embedding (GloVe)
bAbI dataset
- Objective function : 𝐽 = 𝛼𝐸s† 𝐺𝑎𝑡𝑒𝑠 + 𝛽𝐸s†(𝐴𝑛𝑠𝑤𝑒𝑟𝑠)
- Gate supervision aims to select one sentence per pass
- Without supervision : GRU of c‰,ℎY
% and 𝑒% = ℎNy
%
- With supervision (simpler) : 𝑒% = ∑ 𝑠𝑜𝑓𝑡𝑚𝑎𝑥 𝑔Y
% 𝑐Y
N
YM' , where 𝑠𝑜𝑓𝑡𝑚𝑎𝑥 𝑔Y
% =
Š‹Œ •Ž
U
∑ Š‹Œ (•X
U
)V
X••
and
𝑔Y
% is the value before sigmoid
- Better results, because softmax encourages sparsity & suited to picking one sentence
22
23. Training Details
Stanford Sentiment Treebank (Sentiment Analysis)
- Use all full sentences, subsample 50% of phrase-level labels every epoch
- Only evaluated on the full sentences
- Binary classification, neutral phrases are removed from the dataset
- Trained with GRU sequence models
23
25. Dynamic Memory Networks for Visual and Textual Question
Answering [Xiong 2016]
25
Several design choices are motivated by intuition and accuracy improvements
26. Input Module in DMN
- A single GRU for embedding story and store the hidden states
- GRU provides temporal component by allowing a sentence to know the content
of the sentences that came before them
- Cons:
- GRU only allows sentences to have context from sentences before them, but not after them
- Supporting sentences may be too far away from each other
- Here comes Input fusion layer
26
27. Input Module in DMN+
Replacing a single GRU with two different components
1. Sentence reader : responsible only for encoding the words into a sentence
embedding
• Use positional encoder (used in E2E) : 𝑓% = ∑ 𝑙/ 1 𝐴𝑥%//
• Considered GRUs LSTMs, but required more computational resources, prone to overfitting
2. Input fusion layer : interactions between sentences, allows content interaction
between sentences
• bi-directional GRU to allow information from both past and future sentences
• gradients do not need to propagate through the words between sentences
• distant supporting sentences can have a more direct interaction
27
28. Input Module for DMN+
28
Referenced paper : A Hierarchical Neural Autoencoder for Paragraphs and Documents [Li, 2015]
29. Episodic Memory Module in DMN+
- 𝐹⃡ = [𝑓', 𝑓), … , 𝑓“] : output of the input module
- Interactions between the fact 𝒇 𝒊 and both the question 𝒒 and episode memory
state 𝒎𝒕
29
30. Attention Mechanism in DMN+
Use attention to extract contextual vector 𝑐Y
based on the current focus
1. Soft attention
• A weighted summation of 𝐹⃡ : 𝑐Y
= ∑ 𝑔%
Y
𝑓%
“
%M'
• Can approximate a hard attention by selecting a single fact 𝑓%
• Cons: losses positional and ordering information
• Attention passes can retrieve some of this information, but inefficient
30
31. Attention Mechanism in DMN+
2. Attention based GRU (best)
- position and ordering information : RNN is proper but can’t use 𝑔%
Y
- 𝑢%: update, 𝑟%: how much retain from ℎ%A'
- Replace 𝑢% (vector) to 𝑔%
Y
(scalar)
- Allows us to easily visualize how the attention gates activate
- Use final hidden state as 𝒄 𝒕, which is used to update episodic memory 𝒎𝒕
31
𝑔%
Y
𝑔%
Y
33. Training Details
- Adam optimization
- Xavier initialization is used for all weights except for the word embeddings
- 𝐿) regularization on all weights except bias
- Dropout on the word embedding (GloVe) and answer module with 𝑝 = 0.9
33
35. ABCNN: Attention-Based Convolutional Neural Network for Modeling
Sentence Pairs [Yin 2015]
35
• Most prior work on answer selection model each sentence separately and
neglects mutual influence
• Human focus on key parts of 𝑠Ÿ by extracting parts from 𝑠' related by
identity, synonymy, antonym etc.
• ABCNN : taking into account the interdependence between 𝑠Ÿ and 𝑠'
• Convolution layer : increase abstraction of a phrase from words
36. ABCNN: Attention-Based Convolutional Neural Network for Modeling
Sentence Pairs [Yin 2015]
36
1. Input embedding with word2vec
2-1. Convolution layer with wide convolution
• To make each word 𝑣% to be detected by all weights in 𝑊
2-2. Average pooling layer
• all-ap : column-wise averaging over all columns
• w-ap : column-wise averaging over windows of 𝑤
3. Output layer with logistic regression
• Forward all-ap to all non-final ap layer + final ap layer
37. ABCNN: Attention-Based Convolutional Neural Network for Modeling
Sentence Pairs [Yin 2015]
37
Attention on feature map (ABCNN-1)
• Attention values of row 𝑖 in 𝑨 : attention distribution of the 𝑖−th unit of 𝑠Ÿ with respect to 𝑠'
• 𝐴%,/ = 𝑚𝑎𝑡𝑐ℎ𝑠𝑐𝑜𝑟𝑒(𝐹Ÿ,¢ :, 𝑖 , 𝐹',¢ :, 𝑗 )
• 𝑚𝑎𝑡𝑐ℎ𝑠𝑐𝑜𝑟𝑒 = 1/(1 + 𝑥 − 𝑦 )
• Generate the attention feature map 𝐹%,¦ for 𝑠%
• Cons : need more parameters
38. ABCNN: Attention-Based Convolutional Neural Network for Modeling
Sentence Pairs [Yin 2015]
38
Attention after convolution (ABCNN-2)
• Attention weights directly on the representation with the aim of improving the
features computed by convolution
• 𝑎Ÿ,/ = ∑𝐴 𝑗, : → col-wise, row-wise sum
• w-ap on convolution feature
39. ABCNN: Attention-Based Convolutional Neural Network for Modeling
Sentence Pairs [Yin 2015]
39
ABCNN-1 ABCNN-2
Indirect impact to convolution
Direct convolution via pooling
(weighted attention)
Need more features
Vulnerable to overfitting
No need features
handles smaller-granularity units
(ex. Word level)
handles larger-granularity units
(ex. Phrase level, phrase size = window size)
ABCNN-3
42. Empirical Study on Deep Learning Models for QA [Yu 2015]
42
The first to examine Neural Turing Machines on QA problems
Split QA into two step
1. search supporting facts
2. Generate answer from relevant pieces of information
NTM
• Single-layer LSTM network as controller
• Input : word embedding
1. Support fact only
2. Fact highlighted : user marker to annotate begin and end of supporting facts
• Output : softmax layer (multiclass classification) for answer
44. Teaching Machines to Read and Comprehend [Hermann 2015]
44
where 𝑓% = 𝑦§ 𝑡
𝑠(𝑡) : degree to which the network
attends to a particular token in the
document when answering the query
(soft attention)
45. Text Understanding with the Attention Sum Reader Network [Kadlec 2016]
45
Answer should be in context
Inspired by Pinter Network
Contrast to Attentive Reader:
• We select answer from context
directly using weighted sum of
individual representation Attentive Reader