The document discusses attention models and their applications. Attention models allow a model to focus on specific parts of the input that are important for predicting the output. This is unlike traditional models that use the entire input equally. Three key applications are discussed: (1) Image captioning models that attend to relevant regions of an image when generating each word of the caption, (2) Speech recognition models that attend to different audio fragments when predicting text, and (3) Visual attention models for tasks like saliency detection and fixation prediction that learn to focus on important regions of an image. The document also covers techniques like soft attention, hard attention, and spatial transformer networks.
Just Call Vip call girls Erode Escorts âïž9352988975 Two shot with one girl (E...
Â
Attention Models (D3L6 2017 UPC Deep Learning for Computer Vision)
1. [course site]
Attention Models
Day 3 Lecture 6
#DLUPC
Amaia Salvador
amaia.salvador@upc.edu
PhD Candidate
Universitat PolitĂšcnica de Catalunya
2. Attention Models: Motivation
Image:
H x W x 3
bird
The whole input volume is used to predict the output...
...despite the fact that not all pixels are equally important
2
3. Attention Models: Motivation
3
A bird flying over a body of water
Attend to different parts of the input to optimize a certain output
Case study: Image Captioning
4. Previously D3L5: Image Captioning
4
only takes into account
image features in the first
hidden state
Multimodal Recurrent
Neural Network
Karpathy and Fei-Fei. "Deep visual-semantic alignments for generating image descriptions." CVPR 2015
5. LSTM Decoder for Image Captioning
LSTMLSTM LSTM
CNN LSTM
A bird flying
...
<EOS>
Features:
D
5
...
Vinyals et al. Show and tell: A neural image caption generator. CVPR 2015
Limitation: All output predictions are based on the final and static output
of the encoder
7. Attention for Image Captioning
CNN
Image:
H x W x 3
Features f:
L x D
h0
7
a1 y1
c0 y0
first context vector
is the average
Attention weights (LxD) Predicted word
First word (<start> token)
8. Attention for Image Captioning
CNN
Image:
H x W x 3
h0
c1
Visual features weighted with
attention give the next
context vector
y1
h1
a2 y2
8
a1 y1
c0 y0
Predicted word in
previous timestep
9. Attention for Image Captioning
CNN
Image:
H x W x 3
h0
c1 y1
h1
a2 y2
h2
a3 y3
c2 y2
9
a1 y1
c0 y0
10. Attention for Image Captioning
Xu et al. Show, Attend and Tell: Neural Image Caption Generation with Visual Attention. ICML 2015
10
11. Attention for Image Captioning
Xu et al. Show, Attend and Tell: Neural Image Caption Generation with Visual Attention. ICML 2015
11
12. Attention for Image Captioning
Xu et al. Show, Attend and Tell: Neural Image Caption Generation with Visual Attention. ICML 2015
12
Some outputs can probably be predicted without looking at the image...
13. Attention for Image Captioning
Xu et al. Show, Attend and Tell: Neural Image Caption Generation with Visual Attention. ICML 2015
13
Some outputs can probably be predicted without looking at the image...
14. Attention for Image Captioning
14
Can we focus on the image only when necessary?
15. Attention for Image Captioning
CNN
Image:
H x W x 3
h0
c1 y1
h1
a2 y2
h2
a3 y3
c2 y2
15
a1 y1
c0 y0
âRegularâ spatial attention
16. Attention for Image Captioning
CNN
Image:
H x W x 3 c1 y1
a2 y2 a3 y3
c2 y2
16
a1 y1
c0 y0
Attention with sentinel: LSTM is modified to output a ânon-visualâ feature to attend to
s0 h0 s1 h1 s2 h2
Lu et al. Knowing When to Look: Adaptive Attention via a Visual Sentinel for Image Captioning. CVPR
2017
17. Attention for Image Captioning
Lu et al. Knowing When to Look: Adaptive Attention via a Visual Sentinel for Image Captioning. CVPR
2017
17
Attention weights indicate when itâs more important to look at the image features, and when itâs better to
rely on the current LSTM state
If:
sum(a[0:LxD]) > a[LxD]
image features are needed
for the final decision
Else:
RNN state is enough
to predict the next word
18. Soft Attention
Xu et al. Show, Attend and Tell: Neural Image Caption Generation with Visual Attention. ICML 2015
CNN
Image:
H x W x 3
Grid of features
(Each
D-dimensional)
a b
c d
pa
pb
pc
pd
Distribution over
grid locations
pa
+ pb
+ pc
+ pc
= 1
Soft attention:
Summarize ALL locations
z = pa
a+ pb
b + pc
c + pd
d
Derivative dz/dp is nice!
Train with gradient descent
Context vector z
(D-dimensional)
From
RNN:
Slide Credit: CS231n 18
19. Soft Attention
Xu et al. Show, Attend and Tell: Neural Image Caption Generation with Visual Attention. ICML 2015
CNN
Image:
H x W x 3
Grid of features
(Each
D-dimensional)
a b
c d
pa
pb
pc
pd
Distribution over
grid locations
pa
+ pb
+ pc
+ pc
= 1
Soft attention:
Summarize ALL locations
z = pa
a+ pb
b + pc
c + pd
d
Differentiable function
Train with gradient descent
Context vector z
(D-dimensional)
From
RNN:
Slide Credit: CS231n
â Still uses the whole input !
â Constrained to fix grid
19
20. Hard Attention
Input image:
H x W x 3
Box Coordinates:
(xc, yc, w, h)
Cropped and
rescaled image:
X x Y x 3
Not a differentiable function !
Canât train with backprop :(
20
Hard attention:
Sample a subset
of the input
Need other optimization strategies
e.g.: reinforcement learning
21. Spatial Transformer Networks
Input image:
H x W x 3
Box Coordinates:
(xc, yc, w, h)
Cropped and
rescaled image:
X x Y x 3
CNN
bird
Jaderberg et al. Spatial Transformer Networks. NIPS 2015
Not a differentiable function !
Canât train with backprop :(
Make it differentiable
Train with backprop :) 21
22. Spatial Transformer Networks
Jaderberg et al. Spatial Transformer Networks. NIPS 2015
Input image:
H x W x 3 Cropped and
rescaled image:
X x Y x 3
Can we make this
function differentiable?
Idea: Function mapping
pixel coordinates (xt, yt) of
output to pixel coordinates
(xs, ys) of input
Slide Credit: CS231n
Repeat for all pixels
in output
Network
attends to
input by
predicting
22
Mapping given by box coordinates
(translation + scale)
23. Spatial Transformer Networks
Jaderberg et al. Spatial Transformer Networks. NIPS 2015
Easy to incorporate in any network, anywhere !
Differentiable module
Insert spatial transformers into a
classification network and it learns
to attend and transform the input
23
24. Spatial Transformer Networks
Jaderberg et al. Spatial Transformer Networks. NIPS 2015
24
Fine-grained classification
Also used as an alternative to RoI pooling in proposal-based detection & segmentation pipelines
25. Deformable Convolutions
Dai, Qi, Xiong, Li, Zhang et al. Deformable Convolutional Networks. arXiv Mar 2017
25
Dynamic & learnable receptive field
28. Attention Mechanism
28
Kyunghyun Cho, âIntroduction to Neural Machine Translation with GPUsâ (2015)
The vector to be fed to the RNN at each timestep is a
weighted sum of all the annotation vectors.
29. Attention Mechanism
29
Kyunghyun Cho, âIntroduction to Neural Machine Translation with GPUsâ (2015)
An attention weight (scalar) is predicted at each time-step for each annotation vector
hj
with a simple fully connected neural network.
h1
zi
Annotation
vector
Recurrent
state
Attention
weight
(a1
)
30. Attention Mechanism
30
Kyunghyun Cho, âIntroduction to Neural Machine Translation with GPUsâ (2015)
An attention weight (scalar) is predicted at each time-step for each annotation vector
hj
with a simple fully connected neural network.
h2
zi
Annotation
vector
Recurrent
state
Attention
weight
(a2
)
Shared for all j
31. Attention Mechanism
31
Kyunghyun Cho, âIntroduction to Neural Machine Translation with GPUsâ (2015)
Once a relevance score (weight) is estimated for each word, they are normalized
with a softmax function so they sum up to 1.
32. Attention Mechanism
32
Kyunghyun Cho, âIntroduction to Neural Machine Translation with GPUsâ (2015)
Finally, a context-aware representation ci+1
for the output word at timestep i can be
defined as:
33. Attention Mechanism
33
Kyunghyun Cho, âIntroduction to Neural Machine Translation with GPUsâ (2015)
The model automatically finds the correspondence structure between two languages
(alignment).
(Edge thicknesses represent the attention weights found by the attention model)
35. Attention Models
35
Chan et al. Listen, Attend and Spell. ICASSP 2016
Source: distill.pub
Input: Audio features; Output: Text
Attend to different parts of the input to optimize a certain output
36. Attention for Image Captioning
36
Side-note: attention can be computed with previous or current hidden state
CNN
Image:
H x W x 3
h1
v y1
h2 h3
v y2
a1
y1
v y0average
c1
a2
y2
c2
a3
y3
c3
37. Attention for Image Captioning
37
Attention with sentinel: LSTM is modified to output a ânon-visualâ feature to attend to
CNN
Image:
H x W x 3 v y1 v y2
a1
y1
v y0average
c1
a2
y2
c2
a3
y3
c3
s1 h1 s2 h2 s3 h3
38. Semantic Attention: Image Captioning
38You et al. Image Captioning with Semantic Attention. CVPR 2016
39. Visual Attention: Saliency Detection
Kuen et al. Recurrent Attentional Networks for Saliency Detection. CVPR 2016
39
40. Visual Attention: Fixation Prediction
Cornia et al. Predicting Human Eye Fixations via an LSTM-based Saliency Attentive Model.
40