Visual Question Answering 2.0

Visual Question Answering 2.0
Slides by Francisco Roldán
Bsc Thesis, UPC
12th July, 2017
Author: Francisco Roldán Sánchez
Advisors: Xavier Giró-i-Nieto, Santiago Pascual de la
Puente, Issey Masuda Mora

Roadmap
1. Introduction
2. Related Work
3. Methodology
4. Results
5. Conclusion
6. Future Work
2

Roadmap
1. Introduction
2. Related Work
3. Methodology
4. Results
5. Conclusion
6. Future Work
3

Visual Question Answering
4
AI System
Visual Question AnsweringVisual Question AnsweringVisual Question AnsweringVisual Question Answering

Why VQA?
● Multidisciplinary task.
● Models need to tackle different sub-tasks at once
6
Natural Language
Processing
Knowledge
Representation
Computer Vision

VQA Challenge
7Antol, S., Agrawal, A., Lu, J., Mitchell, M., Batra, D., Lawrence Zitnick, C., & Parikh, D. Vqa: Visual question answering. ICCV (2015)

Roadmap
1. Introduction
2. Related Work
3. Methodology
4. Results
5. Conclusion
6. Future Work
8

Roadmap
1. Introduction
2. Related Work
3. Methodology
4. Results
5. Conclusion
6. Future Work
9

Convolutional Neural Networks (CNN)
11

Recurrent Neural Networks (RNN)
13

Roadmap
1. Introduction
2. Related Work
3. Methodology
4. Results
5. Conclusion
6. Future Work
14

Roadmap
1. Introduction
2. Related Work
3. Methodology
4. Results
5. Conclusion
6. Future Work
15

VQA Dataset 2.0
16
Goyal, Y., Khot, T., Summers-Stay, D., Batra, D., & Parikh, D. (2016). Making the V in VQA matter: Elevating the role of image
understanding in Visual Question Answering. arXiv preprint arXiv:1612.00837.

VQA Dataset 2.0 Population
17
Train Validation Test
Images 82,783 40,504 81,434
Questions 443,757 214,354 447,793
Answers 4,437,570 2,143,540 4,477,930
Goyal, Y., Khot, T., Summers-Stay, D., Batra, D., & Parikh, D. (2016). Making the V in VQA matter: Elevating the role of image
understanding in Visual Question Answering. arXiv preprint arXiv:1612.00837.

UPC 2016 Model
20
Yes/No Number Other Overall
66.05 29.77 20.35 40.25
Masuda, I., de la Puente, S. P., & Giro-i-Nieto, X. (2016). Open-Ended Visual Question-Answering. arXiv preprint arXiv:1610.02692.

Language-Only Model
21
Pennington, J., Socher, R., & Manning, C. D. Glove: Global vectors for word representation. In EMNLP (2014, October)

Language-Only Model
22
Freezing or fine-tuning GloVe embeddings:
Fine-tuning
Freezing
66.14 30.42 24.47 42.31
66.15 31.17 24.87 42.59

ResNet based Model
27
Selection of the merge operand
Concat + FC
Concat
Product
Sum
64.66 29.49 21.27 40.08
65.51 29.99 22.02 40.84
65.90 30.00 22.80 41.37

VGG based Model
29
Selection of the merge operand
Sum
Product

VGG based Model
30
Adding Average Pooling to VGG based model

VGG based Model
31
Adding Average Pooling to VGG based model
No pooling
Avg pooling

VGG based Model
32
Deciding between GloVe or learnable embeddings:
Glove
Learnable
66.59 31.01 25.83 43.21
67.10 31.54 25.46 43.30

Roadmap
1. Introduction
2. Related Work
3. Methodology
4. Results
5. Conclusion
6. Future Work
33

Roadmap
1. Introduction
2. Related Work
3. Methodology
4. Results
5. Conclusion
6. Future Work
34

Results
35
0% 100%50%
43.48%
Our method
25.98%
UPC 2016*
Priors
40.25%
* Accuracy obtained from the test-dev split, where our method obtained a 43.30%
LV_NUS
69.71%

Roadmap
1. Introduction
2. Related Work
3. Methodology
4. Results
5. Conclusion
6. Future Work
38

Roadmap
1. Introduction
2. Related Work
3. Methodology
4. Results
5. Conclusion
6. Future Work
39

Conclusions
Summary:
- GloVe embeddings work better when frozen .
- Need of spatial information for VQA task.
- Models tend to learn language biases.
Goals achieved:
- Improve last year model’s performance by 3.05%.
- Participate in the VQA 2.0 Challenge, obtaining a 43.48% of accuracy.
- Explore many different Deep Learning techniques.
- Build a reusable modular software.
40

Roadmap
1. Introduction
2. Related Work
3. Methodology
4. Results
5. Conclusion
6. Future Work
41

Roadmap
1. Introduction
2. Related Work
3. Methodology
4. Results
5. Conclusion
6. Future Work
42

Visual Reasoning
43
Johnson, J., Hariharan, B., van der Maaten, L., Fei-Fei, L., Zitnick, C. L., & Girshick, R. (2016). CLEVR: A diagnostic dataset for
compositional language and elementary visual reasoning. arXiv preprint arXiv:1612.06890.

Visual Reasoning
45
Johnson, J., Hariharan, B., van der Maaten, L., Hoffman, J., Fei-Fei, L., Zitnick, C. L., & Girshick, R. (2017). Inferring and
Executing Programs for Visual Reasoning. arXiv preprint arXiv:1705.03633.

Visual Question Answering 2.0

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie Visual Question Answering 2.0

Ähnlich wie Visual Question Answering 2.0 (20)

Mehr von Universitat Politècnica de Catalunya

Mehr von Universitat Politècnica de Catalunya (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Visual Question Answering 2.0