https://imatge.upc.edu/web/publications/visual-question-answering-20
This bachelor's thesis explores dierent deep learning techniques to solve the Visual Question-Answering (VQA) task, whose aim is to answer questions about images. We study dierent Convolutional Neural Networks (CNN) to extract the visual representation from images: Kernelized-CNN (KCNN), VGG-16 and Residual Networks (ResNet). We also analyze the impact of using pre-computed word embeddings trained in large datasets (GloVe embeddings). Moreover, we examine dierent techniques of joining representations from dierent modalities. This work has been submitted to the second edition Visual Question Answering Challenge, and obtained a 43.48% of accuracy.
1. Visual Question Answering 2.0
Slides by Francisco Roldán
Bsc Thesis, UPC
12th July, 2017
Author: Francisco Roldán Sánchez
Advisors: Xavier Giró-i-Nieto, Santiago Pascual de la
Puente, Issey Masuda Mora
6. Why VQA?
● Multidisciplinary task.
● Models need to tackle different sub-tasks at once
6
Natural Language
Processing
Knowledge
Representation
Computer Vision
7. VQA Challenge
7Antol, S., Agrawal, A., Lu, J., Mitchell, M., Batra, D., Lawrence Zitnick, C., & Parikh, D. Vqa: Visual question answering. ICCV (2015)
16. VQA Dataset 2.0
16
Goyal, Y., Khot, T., Summers-Stay, D., Batra, D., & Parikh, D. (2016). Making the V in VQA matter: Elevating the role of image
understanding in Visual Question Answering. arXiv preprint arXiv:1612.00837.
17. VQA Dataset 2.0 Population
17
Train Validation Test
Images 82,783 40,504 81,434
Questions 443,757 214,354 447,793
Answers 4,437,570 2,143,540 4,477,930
Goyal, Y., Khot, T., Summers-Stay, D., Batra, D., & Parikh, D. (2016). Making the V in VQA matter: Elevating the role of image
understanding in Visual Question Answering. arXiv preprint arXiv:1612.00837.
20. UPC 2016 Model
20
Yes/No Number Other Overall
66.05 29.77 20.35 40.25
Masuda, I., de la Puente, S. P., & Giro-i-Nieto, X. (2016). Open-Ended Visual Question-Answering. arXiv preprint arXiv:1610.02692.
27. ResNet based Model
27
Selection of the merge operand
Concat + FC
Concat
Product
Sum
Yes/No Number Other Overall
64.66 29.49 21.27 40.08
65.51 29.99 22.02 40.84
65.90 30.00 22.80 41.37
32. VGG based Model
32
Deciding between GloVe or learnable embeddings:
Glove
Learnable
Yes/No Number Other Overall
66.59 31.01 25.83 43.21
67.10 31.54 25.46 43.30
40. Conclusions
Summary:
- GloVe embeddings work better when frozen .
- Need of spatial information for VQA task.
- Models tend to learn language biases.
Goals achieved:
- Improve last year model’s performance by 3.05%.
- Participate in the VQA 2.0 Challenge, obtaining a 43.48% of accuracy.
- Explore many different Deep Learning techniques.
- Build a reusable modular software.
40
43. Visual Reasoning
43
Johnson, J., Hariharan, B., van der Maaten, L., Fei-Fei, L., Zitnick, C. L., & Girshick, R. (2016). CLEVR: A diagnostic dataset for
compositional language and elementary visual reasoning. arXiv preprint arXiv:1612.06890.