1. Question Image Co-attention by Low-Rank Bilinear
Model for Visual Question Answering
Jitendra Kumar Kushwaha
IIST, Thiruvananthapuram
Project Guide:
Dr. Sumitra S.
Associate Professor
Dept. of Mathematics, IIST
May 30, 2019
Jitendra Kumar Kushwaha (IIST) May 30, 2019 1 / 31
2. Overview
1 Introduction
2 Applications
3 Image Feature Extraction
4 Question Modeling
5 Joint Representation
Bilinear Model
Low-Rank Bilinear Model with Hadamard Product
6 Co-attention Mechanism
7 Attended visual and question feature
8 Answer Prediction
9 Results
10 Conclusion and Future Work
11 References
Jitendra Kumar Kushwaha (IIST) May 30, 2019 2 / 31
3. Introduction
Objective
The goal of this thesis is to develop a model that can incorporate
language and visual inputs and have their joint understanding.
The model takes as input an image and a natural language question
about the image and produces a natural language answer as the
output.
Jitendra Kumar Kushwaha (IIST) May 30, 2019 3 / 31
4. Introduction
Motivation
In neural network models we don’t know if model is making sensible
prediction or giving random guess. Incorporating Attention
mechanism can gives us estimate of what the model learns.
Instead of considering only image attention, Co-Attention mechanism
allows us to consider image and question attention.
In the Co-attention mechanism, image guides in finding question
attention and vice-versa.
Jitendra Kumar Kushwaha (IIST) May 30, 2019 4 / 31
6. Image Feature Extraction
Image Feature Extraction
The image model uses a CNN to get representation of images.
CNN architectures are used to extract the image feature map V form
raw image I.
The image feature V = {v1, v2 . . . , vN}, where the vn is the feature
vector at spatial location n.
Jitendra Kumar Kushwaha (IIST) May 30, 2019 6 / 31
7. Image Feature Extraction
Pretrained Models
The feature V = CNNvgg (I) is chosen from the last convolution layer,
which retains spatial information of original images.
A visual feature vector V of a rescaled image size of 3 × 448 × 448, is
an output of the last convolution layer of VGG-19 networks, whose
dimension is 512 × 14 × 14. Alternatively, ResNet-152 is used, whose
dimension is of 2048 × 14 × 14.
Jitendra Kumar Kushwaha (IIST) May 30, 2019 7 / 31
8. Question Modeling
Question Modeling
There are three levels the representations of question:
word-level
phrase-level
sentence-level
The words in the question are converted into a 1-hot encoded vector,
where the size of the vector is the size of vocabulary with binary (0
and 1) entries.
Jitendra Kumar Kushwaha (IIST) May 30, 2019 8 / 31
9. Question Modeling
Question Modeling
1-hot encoded vectors are again embedded into a vector space to get:
Qw = {qw
1 , qw
2 , . . . qw
T }.
The embedded word vectors represent the word-level feature of the
question.
Corresponding to every word in the question, there is a vector that
represents that word.
Jitendra Kumar Kushwaha (IIST) May 30, 2019 9 / 31
10. Question Modeling
Phrase-level Feature
To compute the phrase-level feature, 1-D CNN can be applied on
word-level feature vector with help of 3 filters: unigram, bigram, and
trigram.
The working of 1-D CNN is the same as 2-D CNN.
To obtain phrase-level features, the max-pooling applied across all the
three filters at each word location as shown in equation:
qp
t = max ˆqp
1,t, ˆqp
2,t, ˆqp
3,t , t ∈ {1, 2, . . . , T}
These three filters capture the semantic meaning by grouping the
words known as phrase-level features.
Jitendra Kumar Kushwaha (IIST) May 30, 2019 10 / 31
12. Question Modeling
Question Modeling
The LSTM embeds the phrase-level feature qp
t into the sentence-level
feature.
Corresponding to every word in the question, there is a vector that
represents that word.
Jitendra Kumar Kushwaha (IIST) May 30, 2019 12 / 31
13. Joint Representation
Bilinear Model
The bilinear model provides rich joint representation of two distinct
input features.
Bilinear model uses a quadratic expansion of linear transformation
considering every pair of features.
Ci =
N
j=1
M
k=1
wijkxj yk = XT
Wi Y
The joint embedding C captures the semantic concept of both input
features(X and Y ).
The number of weight parameters required for joint embedding of
vector size of size L is L × (N × M).
Consists of third order tensor which limiting the applicability to
computationally complex tasks.
Jitendra Kumar Kushwaha (IIST) May 30, 2019 13 / 31
14. Joint Representation
Low-Rank Bilinear Model with Hadamard Product
Low-rank bilinear method is to reduce the rank of the weight matrix
Wi to have less number of parameters for regularization.
Ci = XT
Wi Y = XT
Ui V T
i Y = I1T
(UT
i X ◦ V T
i Y )
Two third-order tensors are needed for a feature vector , whose
elements are {Ci }.
The order of weight tensors is reduced by one, with replacing I1 with
IP ∈ IRd×c
.
U ∈ IRN×d
and V ∈ IRM×d
are redefined to get the joint embedding
feature vector C ∈ IRc
:
C = IPT
(UT
X ◦ V T
Y )
Jitendra Kumar Kushwaha (IIST) May 30, 2019 14 / 31
15. Joint Representation
Low-Rank Bi-Linear Model with Hadamard Product
This imposes a restriction on the rank of Wi to be at most
d ≤ min(N, M)
This mechanism factors three-dimensional weight tensor for bilinear
model into three two-dimensional weight matrices.
This enforces the rank of the weight tensor to be low-rank.
Jitendra Kumar Kushwaha (IIST) May 30, 2019 15 / 31
16. Co-attention Mechanism
Co-attention Mechanism
The attention mechanism produces a spatial map highlighting image
regions relevant to answering the question.
The attention models [Huijuan, 2016 ][Jin-Hwa Kim, 2017] focused
on problem of identifying where to look means visual attention.
This model discusses the problem of identifying which word to listen
or question attention is equally important.
Jitendra Kumar Kushwaha (IIST) May 30, 2019 16 / 31
17. Co-attention Mechanism
Visual Attention
That visual attention distribution helps to get attended visual
features.
αv = softmax PT
αv
σ(W v
q
T
Q) ◦ σ(W v
v
T
V )
Where Pαv ∈ IRd×N
, σ is a hyperbolic tangent function,
W v
q ∈ IRT×1
, W v
v ∈ IRN×1
and αv ∈ IRN
.
Jitendra Kumar Kushwaha (IIST) May 30, 2019 17 / 31
18. Co-attention Mechanism
Question Attention
That question attention distribution helps to get attended question
features.
αq = softmax PT
αq
σ(W q
v
T
V ) ◦ σ(W q
q
T
Q)
Where Pαq ∈ IRd×T
, σ is a hyperbolic tangent function,
W q
q ∈ IRT×1
, W q
v ∈ IRN×1
and αq ∈ IRT
.
Jitendra Kumar Kushwaha (IIST) May 30, 2019 18 / 31
19. Attended visual and question feature
Attended visual and question feature
Attended question feature is a linear combination of question
attention and question feature vectors.
Attended visual feature is a linear combination of visual attention and
visual-spatial region vectors.
ˆV =
N
n=1
αvn Vn , ˆQ =
T
t=1
αqt Qt
This is a fine-grained representation of image and question.
Jitendra Kumar Kushwaha (IIST) May 30, 2019 19 / 31
20. Answer Prediction
Answer Prediction
The VQA task is treated as a classification task.
Answer prediction is based on the co-attended question and visual
features.
p(a|V , Q; Θ) =softmax PT
σ(Wq
T ˆQ) ◦ σ(Wv
T ˆV )
ˆa =arg max
a∈Ω
p a| V , Q; Θ
Jitendra Kumar Kushwaha (IIST) May 30, 2019 20 / 31
21. Answer Prediction
Experimental Setup
The size of the joint embedding of the visual and question feature is d,
which is the same with the rank d in low-rank bilinear model. The size of
the set of candidate answers is Ω. The decay rate and dropout are α and
p.
The RMSProp optimizer has been used with base learning rate 4e−4 and
the decay rate α= 0.90 as well as correction factor =1e−8. The batch
size is set to 100.
Jitendra Kumar Kushwaha (IIST) May 30, 2019 21 / 31
23. Answer Prediction
Datatset
The VQA v2.0 dataset is the largest dataset for the VQA task.
VQA v2.0 dataset comprises 248,349 questions for training, 121,512
questions validation and 244,302 questions testing.
On the basis of answer-type, the questions are divided into three
categories:
yes/no (binary)
number(number of objects)
other(one more than one-word answer)
Each question has 10 human annotated free-response answers.
Jitendra Kumar Kushwaha (IIST) May 30, 2019 23 / 31
24. Answer Prediction
Evaluation Metric
The accuracy of a predicted answer a is evaluated as followed:
Accuracy(a) = min
Count(a)
3
, 1
Where Count(a) is the number of human(Amazon Mechanical Turk)
annotated answers matched with predicted answer a.
Jitendra Kumar Kushwaha (IIST) May 30, 2019 24 / 31
25. Results
Results
Table: Assessment of Architecture on the VQA dataset
MODEL ALL YES/NO NUMBER OTHERS
With W-Att, ResNet 62.23 82.28 39.06 42.13
With P-Att, ResNet 67.07 84.50 40.37 43.71
With S-Att, ResNet 65.20 80.50 37.62 43.20
With P-Att, VGG 63.79 82.73 37.92 53.46
Jitendra Kumar Kushwaha (IIST) May 30, 2019 25 / 31
26. Results
Results
Table: Result on VQA v2.0 dataset and comparison with other models
MODEL ALL YES/NO NUMBER OTHERS
SMem[Huijuan, 2016 ] 58.24 80.80 37.53 46.32
SAN 58.85 79.11 36.41 46.42
qru[R. Li, 2016] 60.72 82.29 37.02 47.67
HieCoAtt[J. Lu, 2016] 62.06 79.95 38.22 51.95
MCB[Akira Fukui, 2016] 65.40 82.30 37.20 57.40
MLB[Jin-Hwa Kim, 2017] 65.84 83.84 37.87 56.76
With P-Att, ResNet 67.07 84.50 40.37 43.71
Jitendra Kumar Kushwaha (IIST) May 30, 2019 26 / 31
27. Conclusion and Future Work
Conclusion
In this thesis work, a VQA model has been proposed using
co-attention mechanism with low-rank bilinear model (LBM).
The LBM model gives the richer joint representation to determine
semantic objects and concepts.
The Co-Attention mechanism explores the natural symmetry between
image and question.
The experimental results achieved the better performance than state
of the art [Jin-Hwa Kim, 2017].
Jitendra Kumar Kushwaha (IIST) May 30, 2019 27 / 31
28. Conclusion and Future Work
Future Work
A VQA model can be developed to deal with spatial reasoning images
and questions.
The multiple low-rank bilinear model (LBM) can be applied, which
enhances the representativeness of co-attention mechanism.
The attention at word embedding module may capture the
informative and semantic concepts of question-words.
Jitendra Kumar Kushwaha (IIST) May 30, 2019 28 / 31
29. References
References
Huijuan Xu and Kate Saenko. Ask, attend and answer: Exploring question-guided
spatial attention for visual question answering. In ECCV, pages 451466.Springer,
2016.
R. Li and J. Jia,Visual question answering with question representation up-date
(qru), in NIPS, 2016, pp. 46554663.
J. Lu, J. Yang, D. Batra, and D. Parikh,Hierarchical question-image co-attention
for visual question answering, in NIPS, 2016, pp. 289297
Akira Fukui, Dong Huk Park, Daylen Yang, Anna Rohrbach, Trevor Darrell,and
Marcus Rohrbach.Multimodal Compact Bilinear Pooling for Visual Ques-tion
Answering and Visual Grounding. Conference on Empirical Methods in Natural
Language Processing, 2016
Jin-Hwa Kim, Kyoung-Woon On, Woosang Lim, Jeonghee Kim
Jung-WooHa,Byoung-Tak Zhang ,Hadamard Product for Low-Rank Bilinear
Pooling.,ICLR, 2017
Opper, M., and Winther, O. (1999). A Bayesian approach to online learning.
OnLine Learning in Neural Networks. Cambridge University Press.
Jitendra Kumar Kushwaha (IIST) May 30, 2019 29 / 31