http://ixa2.si.ehu.es/deep_learning_seminar/
Deep neural networks have boosted the convergence of multimedia data analytics in a unified framework shared by practitioners in natural language and vision. Image captioning, visual question answering or multimodal translation are some of the first applications of a new and exciting field that exploiting the generalization properties of deep neural representations. This talk will provide an overview of how vision and language problems are addressed with deep neural networks, and the exciting challenges being addressed nowadays by the research community.
CHEAP Call Girls in Saket (-DELHI )đ 9953056974đ(=)/CALL GIRLS SERVICE
Â
One Perceptron to Rule Them All: Language and Vision
1. One Perceptron to Rule Them All:
Language and Vision
Xavier Giro-i-Nieto
xavier.giro@upc.edu
Associate Professor
Intelligent Data Science and ArtiïŹcial
Intelligence Center (IDEAI)
Universitat Politecnica de Catalunya (UPC)
Barcelona Supercomputing Center (BSC)
Deep Learning
for Natural
Language
Processing
San Sebastian
5 July 2019
bit.ly/ixa-dlnlp-2019
xavier.giro@upc.edu
@DocXavi
3. 3
â 11 faculty members
â 12 Phd students
Research Group & Centers
https://imatge.upc.edu/
https://www.bsc.es/
â National computation center #1
â Supercomputer MareNostrum
â Emerging Technologies for
ArtiïŹcial Intelligence Group,
directed by Prof. Jordi Torres.
https://ideai.upc.edu/
â Center funded in 2017
â 60 researchers
IDEAI (Intelligent Data Science and
ArtiïŹcial Intelligence)
24. 24
Pooling Layer
Figure Credit: Ranzatto
Pooling is a downsample operation
along the spatial dimensions (width,
height)
â It reduces progressively the
spatial size of the
representation, so it reduces the
computation greatly.
â Provides invariance to small
local changes
25. 25
Pooling Layer (critics)
"The pooling operation
used in CNNs is a big
mistake and the fact that it
works so well is a disaster."
GeoïŹrey Hinton,
AMA reddit (2015).
Learn more:
Richard Zhang, âMaking Convolutional Networks Shift-Invariant Againâ (ICML 2019)
26. 26
Convolutional Neural Networks for Vision
LeNet-5: Several convolutional layers, combined with pooling layers, and followed by a
small number of fully connected layers
#LeNet-5 LeCun, Y., Bottou, L., Bengio, Y., & HaïŹner, P. (1998). Gradient-based learning applied to document
recognition. Proceedings of the IEEE, 86(11), 2278-2324.
27. 27
ImageNet Challenge
â 1,000 object classes
(categories).
â Images:
â 1.2 M train
â 100k test.
Deng, Jia, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. "Imagenet: A large-scale hierarchical image
database." CVPR 2019.
28. 28
ImageNet Challenge: 2012
Slide credit:
Rob Fergus (NYU)
-9.8%
Based on SIFT + Fisher Vectors
Russakovsky, Olga, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang et al. "Imagenet
large scale visual recognition challenge." International Journal of Computer Vision 115, no. 3 (2015): 211-252. [web]
29. 29
Image Encoding
A Krizhevsky, I Sutskever, GE Hinton âImagenet classiïŹcation with deep convolutional neural networksâ NIPS 2012
Cat
CNN FC
31. 31
Video Encoding
Slide: VĂctor Campos (UPC 2018)
CNN CNN CNN...
Combination method
Combination is commonly
implemented as a small NN on
top of a pooling operation
(e.g. max, sum, average).
Drawback: pooling is not
aware of the temporal order!
Ng et al., Beyond short snippets: Deep networks for video classiïŹcation, CVPR 2015
32. 32
Video Encoding
Slide: VĂctor Campos (UPC 2018)
Recurrent Neural Networks are
well suited for processing
sequences.
Drawback: RNNs are sequential
and cannot be parallelized.
Donahue et al., Long-term Recurrent Convolutional Networks for Visual Recognition and Description, CVPR 2015
CNN CNN CNN...
RNN RNN RNN...
41. 41
#ShowAndTell Vinyals, Oriol, Alexander Toshev, Samy Bengio, and Dumitru Erhan. "Show and tell: A neural image caption
generator." CVPR 2015.
Image Captioning
43. 43
Captioning: Show, Attend & Tell
Xu, Kelvin, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron C. Courville, Ruslan Salakhutdinov, Richard S. Zemel, and Yoshua
Bengio. "Show, Attend and Tell: Neural Image Caption Generation with Visual Attention." ICML 2015
44. 44
Captioning: Show, Attend & Tell
Xu, Kelvin, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron C. Courville, Ruslan Salakhutdinov, Richard S. Zemel, and Yoshua
Bengio. "Show, Attend and Tell: Neural Image Caption Generation with Visual Attention." ICML 2015
45. 45
Johnson, Justin, Andrej Karpathy, and Li Fei-Fei. "Densecap: Fully convolutional localization networks for dense
captioning." CVPR 2016
Dense Captioning
46. 46
XAVI: âman has
short hairâ, âman
with short hairâ
AMAIA:âa woman
wearing a black
shirtâ, â
BOTH: âtwo men
wearing black
glassesâ
Johnson, Justin, Andrej Karpathy, and Li Fei-Fei. "Densecap: Fully convolutional localization networks for dense
captioning." CVPR 2016
Dense Captioning
47. Image Captioning for News
Ali Furkan Biten, Lluis Gomez, Marçal Rusiñol, Dimosthenis Karatzas, âGood News, Everyone! Context driven entity-aware
captioning for news imagesâ CVPR 2019.
48. 48
Filtering Social Bias in Neural Models
#Equalizer Burns, Kaylee, Lisa Anne Hendricks, Trevor Darrell, and Anna Rohrbach. "Women also Snowboard: Overcoming
Bias in Captioning Models." ECCV 2018.
49. 49
Captioning: Dataset biases
#Equalizer Burns, Kaylee, Lisa Anne Hendricks, Trevor Darrell, and Anna Rohrbach. "Women also Snowboard: Overcoming
Bias in Captioning Models." ECCV 2018.
50. 50
JeïŹrey Donahue, Lisa Anne Hendricks, Sergio Guadarrama, Marcus Rohrbach, Subhashini Venugopalan, Kate Saenko, Trevor
Darrel. Long-term Recurrent Convolutional Networks for Visual Recognition and Description, CVPR 2015. code
Captioning: Video
51. 51
(Slides by Marc Bolaños) Pingbo Pan, Zhongwen Xu, Yi Yang,Fei Wu,Yueting Zhuang Hierarchical Recurrent Neural
Encoder for Video Representation with Application to Captioning, CVPR 2016.
LSTM unit
(2nd layer)
Time
Image
t = 1 t = T
hidden state
at t = T
first chunk
of data
Captioning: Video
53. 53
Assael, Yannis M., Brendan Shillingford, Shimon Whiteson, and Nando de Freitas. "LipNet: End-to-End Sentence-level Lipreading."
(2016).
54. 54
Lip Reading
Assael, Yannis M., Brendan Shillingford, Shimon Whiteson, and Nando de Freitas. "LipNet: End-to-End Sentence-level
Lipreading." (2016).
55. 55
Chung, Joon Son, Andrew Senior, Oriol Vinyals, and Andrew Zisserman. "Lip reading sentences in the wild."
CVPR 2017
56. 56
Lipreading: Watch, Listen, Attend & Spell
Audio
features
Image
features
Chung, Joon Son, Andrew Senior, Oriol Vinyals, and Andrew Zisserman. "Lip reading sentences in the wild." CVPR 2017
57. 57
Lipreading: Watch, Listen, Attend & Spell
Chung, Joon Son, Andrew Senior, Oriol Vinyals, and Andrew Zisserman. "Lip reading sentences in the wild." CVPR 2017
Attention over output
states from audio and
video is computed at
each timestep
59. 59
Grounded Captioning from Objects
Lu, Jiasen and Yang, Jianwei and Batra, Dhruv and Parikh, Devi âNeural Baby Talkâ CVPR 2018 [code]
60. 60Lu, Jiasen and Yang, Jianwei and Batra, Dhruv and Parikh, Devi âNeural Baby Talkâ CVPR 2018 [code]
Grounded Captioning from Objects
61. 61Akbari, Hassan, Svebor Karaman, Surabhi Bhargava, Brian Chen, Carl Vondrick, and Shih-Fu Chang. "Multi-level Multimodal
Common Semantic Space for Image-Phrase Grounding." CVPR 2019. [code]
Weak grounding w/o supervision
62. 62Akbari, Hassan, Svebor Karaman, Surabhi Bhargava, Brian Chen, Carl Vondrick, and Shih-Fu Chang. "Multi-level Multimodal
Common Semantic Space for Image-Phrase Grounding." CVPR 2019. [code]
Grounding with weak supervision
63. 63
Cornia, Marcella, Lorenzo Baraldi, and Rita Cucchiara. "Show, Control and Tell: A Framework for Generating Controllable and
Grounded Captions." CVPR 2019. [code]
Controlled Grounding
66. 66
Reed, Scott, Zeynep Akata, Xinchen Yan, Lajanugen Logeswaran, Bernt Schiele, and Honglak Lee. "Generative adversarial
text to image synthesis." ICML 2016.
Image Generation
67. 67
Reed, Scott, Zeynep Akata, Xinchen Yan, Lajanugen Logeswaran, Bernt Schiele, and Honglak Lee. "Generative adversarial
text to image synthesis." ICML 2016. [code]
Image Synthesis
68. 68
Reed, Scott, Zeynep Akata, Xinchen Yan, Lajanugen Logeswaran, Bernt Schiele, and Honglak Lee. "Generative adversarial
text to image synthesis." ICML 2016. [code]
Image Generation
69. 69
#StackGAN Zhang, Han, Tao Xu, Hongsheng Li, Shaoting Zhang, Xiaolei Huang, Xiaogang Wang, and Dimitris Metaxas.
"Stackgan: Text to photo-realistic image synthesis with stacked generative adversarial networks." ICCV 2017. [code]
Image Synthesis
70. 70
#StackGAN Zhang, Han, Tao Xu, Hongsheng Li, Shaoting Zhang, Xiaolei Huang, Xiaogang Wang, and Dimitris Metaxas.
"Stackgan: Text to photo-realistic image synthesis with stacked generative adversarial networks." ICCV 2017. [code]
Image Synthesis
71. 71Justin Johnson, Agrim Gupta, Li Fei-Fei, âImage Generation from Scene Graphsâ CVPR 2018
Image Generation via Scene Graphs
72. 72Justin Johnson, Agrim Gupta, Li Fei-Fei, âImage Generation from Scene Graphsâ CVPR 2018
Image Synthesis via Scene Graphs
73. 73
#Text2Scene Tan, Fuwen, Song Feng, and Vicente Ordonez. "Text2Scene: Generating Compositional Scenes From Textual
Descriptions." CVPR 2019 [blog].
Image Generation by Composition
74. 74
#Text2Scene Tan, Fuwen, Song Feng, and Vicente Ordonez. "Text2Scene: Generating Compositional Scenes From Textual
Descriptions." CVPR 2019 [blog].
75. 75
#Text2Scene Tan, Fuwen, Song Feng, and Vicente Ordonez. "Text2Scene: Generating Compositional Scenes From Textual
Descriptions." CVPR 2019 [blog].
76. 76
#CRAFT Gupta, Tanmay, Dustin Schwenk, Ali Farhadi, Derek Hoiem, and Aniruddha Kembhavi. "Imagine this! scripts to
compositions to videos." ECCV 2018
77. 77
#CRAFT Gupta, Tanmay, Dustin Schwenk, Ali Farhadi, Derek Hoiem, and Aniruddha Kembhavi. "Imagine this! scripts to
compositions to videos." ECCV 2018
Video Generation by Composition
80. 80
#Mattnet Yu, Licheng, Zhe Lin, Xiaohui Shen, Jimei Yang, Xin Lu, Mohit Bansal, and Tamara L. Berg. "Mattnet: Modular
attention network for referring expression comprehension." CVPR 2018. [code]
Object from Referring Expressions
81. 81
Khoreva, Anna, Anna Rohrbach, and Bernt Schiele. "Video object segmentation with language referring expressions." ACCV
2018.
Video Object Grounding
86. 86
Visual Question Answering (VQA)
Masuda, Issey, Santiago Pascual de la Puente, and Xavier Giro-i-Nieto. "Open-Ended Visual
Question-Answering." ETSETB UPC TelecomBCN (2016).
Image
Question
Answer
87. 87
Visual Question Answering (VQA)
Francisco RoldĂĄn, Issey Masuda, Santiago Pascual de la Puente, and Xavier Giro-i-Nieto. "Visual
Question-Answering 2.0." ETSETB UPC TelecomBCN (2017).
88. 88
Noh, H., Seo, P. H., & Han, B. Image question answering using convolutional neural network with
dynamic parameter prediction. CVPR 2016
Dynamic Parameter Prediction Network (DPPnet)
Visual Question Answering (VQA)
89. 89
VQA: Dynamic Memory Networks
(Slides and Slidecast by Santi Pascual): Xiong, Caiming, Stephen Merity, and Richard Socher. "Dynamic Memory Networks for
Visual and Textual Question Answering." ICML 2016
90. 90
Grounded VQA
(Slides and Screencast by Issey Masuda): Zhu, Yuke, Oliver Groth, Michael Bernstein, and Li Fei-Fei."Visual7W: Grounded
Question Answering in Images." CVPR 2016.
91. 91
Visual Reasoning
#Clevr Johnson, Justin, Bharath Hariharan, Laurens van der Maaten, Li Fei-Fei, C. Lawrence Zitnick, and Ross Girshick.
"CLEVR: A Diagnostic Dataset for Compositional Language and Elementary Visual Reasoning." CVPR 2017
92. 92
Visual Reasoning
(Slides by Fran Roldan) Justin Johnson, Bharath Hariharan, Laurens van der Maaten, Judy HoïŹman, Fei-Fei Li, Larry
Zitnick, Ross Girshick , âInferring and Executing Programs for Visual Reasoningâ. ICCV 2017
Program Generator Execution Engine
95. 95
Hate Speech Detection in Memes
Benet Oriol, Cristian Canton, Xavier Giro-i-Nieto, âHate Speech Detection in Memesâ. UPC TelecomBCN
2019.
Hate Speech Detection
96. 96
Visual Reasoning: Relation Networks
Santoro, Adam, David Raposo, David G. Barrett, Mateusz Malinowski, Razvan Pascanu, Peter Battaglia, and Timothy
Lillicrap. "A simple neural network module for relational reasoning." NIPS 2017.
Relation Networks concatenate all possible pairs of objects with the an encoded question to later ïŹnd the
answer with a MLP.
100. 100
Joint Representations (Embeddings)
Frome, Andrea, Greg S. Corrado, Jon Shlens, Samy Bengio, JeïŹ Dean, and Tomas Mikolov. "Devise: A deep
visual-semantic embedding model." NIPS 2013
101. 101
Zero-shot learning
Socher, R., Ganjoo, M., Manning, C. D., & Ng, A., Zero-shot learning through cross-modal transfer. NIPS 2013 [slides] [code]
No images from âcatâ in
the training set...
...but they can still be
recognised as âcatsâ
thanks to the
representations learned
from text .
102. 102
Multimodal Retrieval
Aytar, Yusuf, Lluis Castrejon, Carl Vondrick, Hamed Pirsiavash, and Antonio Torralba. "Cross-Modal Scene Networks."
CVPR 2016.
103. 103
Multimodal Retrieval
Aytar, Yusuf, Lluis Castrejon, Carl Vondrick, Hamed Pirsiavash, and Antonio Torralba. "Cross-Modal Scene Networks."
CVPR 2016.
104. 104
Image and text retrieval with joint embeddings.
Joint Neural Embeddings
#pic2recipe Amaia Salvador, Nicholas Haynes, Yusuf Aytar, Javier MarĂn, Ferda OïŹi, Ingmar Weber, Antonio
Torralba, âLearning Cross-modal Embeddings for Cooking Recipes and Food Imagesâ. CVPR 2017
105. 105
#pic2recipe Amaia Salvador, Nicholas Haynes, Yusuf Aytar, Javier MarĂn, Ferda OïŹi, Ingmar Weber, Antonio
Torralba, âLearning Cross-modal Embeddings for Cooking Recipes and Food Imagesâ. CVPR 2017
Joint Neural Embeddings
106. 106
Joint Neural Embeddings
joint
embedding
LSTM Bidirectional LSTM
#pic2recipe Amaia Salvador, Nicholas Haynes, Yusuf Aytar, Javier MarĂn, Ferda OïŹi, Ingmar Weber, Antonio
Torralba, âLearning Cross-modal Embeddings for Cooking Recipes and Food Imagesâ. CVPR 2017
107. 107
Joint Neural Embeddings
â Constrained to database recipes
â Ingredients and Instructions are retrieved as a whole
â Prohibits user manipulation (ingredient replacements)
#pic2recipe Amaia Salvador, Nicholas Haynes, Yusuf Aytar, Javier MarĂn, Ferda OïŹi, Ingmar Weber, Antonio
Torralba, âLearning Cross-modal Embeddings for Cooking Recipes and Food Imagesâ. CVPR 2017
109. 109
Recipe Generation (not retrieval !)
Salvador, Amaia, Michal Drozdzal, Xavier Giro-i-Nieto, and Adriana Romero. "Inverse Cooking: Recipe
Generation from Food Images." CVPR 2019.
110. 110
Recipe Generation (not retrieval !)
Salvador, Amaia, Michal Drozdzal, Xavier Giro-i-Nieto, and Adriana Romero. "Inverse Cooking: Recipe
Generation from Food Images." CVPR 2019.
Title: Edamame corn salad
Ingredients
pepper, corn, onion, edamame, salt, vinegar, cilantro, avocado, oil
Instructions
- In a large bowl, combine edamame, corn, red onion, cilantro,
avocado, and red bell pepper.
- In a small bowl, whisk together olive oil, vinegar, salt, and
pepper.
- Pour dressing over edamame mixture and toss to coat.
- Cover and refrigerate for at least 1 hour before serving.
111. 111
Recipe Generation (not retrieval !)
Salvador, Amaia, Michal Drozdzal, Xavier Giro-i-Nieto, and Adriana Romero. "Inverse Cooking: Recipe
Generation from Food Images." CVPR 2019.
According to human judgment, our proposed system is able to generate better recipes than the previous
retrieval method.
112. 112
Recipe Generation (data as the DL ingredient!)
Salvador, Amaia, Michal Drozdzal, Xavier Giro-i-Nieto, and Adriana Romero. "Inverse Cooking: Recipe
Generation from Food Images." CVPR 2019.
Title: Spaghetti with spicy tomato sauce
Ingredients:
onion, tomato, chili, salt, noodles, pepper, spaghetti, clove, cumin, water
Instructions:
-In a large pot, combine the tomatoes, onion, garlic, chili powder, cumin, salt,
pepper, water and tomato sauce.
-Bring to a boil, then reduce heat and simmer for about 20 minutes.
-Meanwhile, cook the spaghetti according to package directions.
-Drain and set aside.
-When the spaghetti is done, drain and return to pot.
-Add the sauce and stir to combine.
-Serve with the shredded cheese and a dollop of sour cream.
122. Rowan Zellers, Yonatan Bisk, Ali Farhadi, Yejin Choi. From Recognition to Cognition: Visual Commonsense
Reasoning. CVPR 2019 (oral)
https://visualcommonsense.com/
123. 123
Ma, Chih-Yao, Jiasen Lu, Zuxuan Wu, Ghassan AlRegib, Zsolt Kira, Richard Socher, and Caiming Xiong.
"Self-Monitoring Navigation Agent via Auxiliary Progress Estimation." ICLR 2019. [code]
124. 124
Visual Question Answering
Gurari, Danna, Qing Li, Abigale J. Stangl, Anhong Guo, Chi Lin, Kristen Grauman, Jiebo Luo, and Jeffrey P. Bigham. "VizWiz
Grand Challenge: Answering Visual Questions from Blind People." arXiv preprint arXiv:1802.08218 (2018).
125. 125
Reasoning: MAC
Hudson, Drew A., and Christopher D. Manning. "Compositional attention networks for machine reasoning."
ICLR 2018.
126. 126
Navigation with Language and Vision
Fried, Daniel, Ronghang Hu, Volkan Cirik, Anna Rohrbach, Jacob Andreas, Louis-Philippe Morency, Taylor Berg-Kirkpatrick, Kate
Saenko, Dan Klein, and Trevor Darrell. "Speaker-Follower Models for Vision-and-Language Navigation." arXiv preprint
arXiv:1806.02724 (2018).
127. 127
Translation
Harwath, David, Galen Chuang, and James Glass. "Vision as an Interlingua: Learning Multilingual Semantic
Embeddings of Untranscribed Speech." arXiv preprint arXiv:1804.03052 (2018).