SlideShare ist ein Scribd-Unternehmen logo
1 von 127
Downloaden Sie, um offline zu lesen
One Perceptron to Rule Them All:
Language and Vision
Xavier Giro-i-Nieto
xavier.giro@upc.edu
Associate Professor
Intelligent Data Science and ArtiïŹcial
Intelligence Center (IDEAI)
Universitat Politecnica de Catalunya (UPC)
Barcelona Supercomputing Center (BSC)
Deep Learning
for Natural
Language
Processing
San Sebastian
5 July 2019
bit.ly/ixa-dlnlp-2019
xavier.giro@upc.edu
@DocXavi
2
Xavier Giro-i-Nieto
Associate Professor at Universitat PolitĂšcnica de Catalunya (UPC)
Kaixo
IDEAI Center for
Intelligent Data Science
& ArtiïŹcial Intelligence
3
● 11 faculty members
● 12 Phd students
Research Group & Centers
https://imatge.upc.edu/
https://www.bsc.es/
● National computation center #1
● Supercomputer MareNostrum
● Emerging Technologies for
ArtiïŹcial Intelligence Group,
directed by Prof. Jordi Torres.
https://ideai.upc.edu/
● Center funded in 2017
● 60 researchers
IDEAI (Intelligent Data Science and
ArtiïŹcial Intelligence)
4
Acknowledgments
Mariona
CarĂłs
Janna
Escur
Benet
Oriol
Amaia
Salvador
Santiago
Pascual
Marta R.
Costa-jussĂ 
Francisco
Roldan
Issey
Masuda
Ionut
Sorodoc
Carina
Silberer
Gemma
Boleda
Antonio
Bonafonte
José A. R.
Fonollosa
IDEAI Center for
Intelligent Data Science
& ArtiïŹcial Intelligence
5
6[course site]
bit.ly/mmm-docxavi
@DocXavi
7
Densely linked slides
8
Outline
1. Encoder-Decoder Architectures
2. Image and Video Encoding
3. Image Captioning & Grounding
4. Image Generation
5. Visual Question Answering / Reasoning
6. Joint Embeddings (+ recipe generation)
Text
Audio
9
Speech
Vision
Text
Audio
10
Speech
Vision
Text
Audio
11
Speech
Vision
12
13
Encoder
0
1
0
Cat
A Krizhevsky, I Sutskever, GE Hinton “Imagenet classiïŹcation with deep convolutional neural networks” NIPS 2012
14Slide concept: Perronin, F., Tutorial on LSVR @ CVPR’14, Output embedding for LSVR
One-hot Representation
[1,0,0]
[0,1,0]
[0,0,1]
15
Encoder
Representation
16
Encoder
Representation
17
Decoder
Radford, Alec, Luke Metz, and Soumith Chintala. "Unsupervised representation learning with deep convolutional generative
adversarial networks." ICLR 2016. #DCGAN
0
1
0
Cat
Fig: Xudong Mao #DCGAN
18
Encoder Decoder
Representation
19
Encoder Decoder
Representation
20
Outline
1. Encoder-Decoder Architectures
2. Image and Video Encoding
3. Image Captioning & Grounding
4. Image Generation
5. Visual Question Answering / Reasoning
6. Joint Embeddings (+ recipe generation)
21
Encoder
Representation
22
Perceptron
Weights and bias are the parameters that deïŹne the behavior. They must be
learned during training.
23
Convolutional Layers for Vision
Fully Connected layer (FC) Convolutional layer (Conv)
24
Pooling Layer
Figure Credit: Ranzatto
Pooling is a downsample operation
along the spatial dimensions (width,
height)
● It reduces progressively the
spatial size of the
representation, so it reduces the
computation greatly.
● Provides invariance to small
local changes
25
Pooling Layer (critics)
"The pooling operation
used in CNNs is a big
mistake and the fact that it
works so well is a disaster."
GeoïŹ€rey Hinton,
AMA reddit (2015).
Learn more:
Richard Zhang, “Making Convolutional Networks Shift-Invariant Again” (ICML 2019)
26
Convolutional Neural Networks for Vision
LeNet-5: Several convolutional layers, combined with pooling layers, and followed by a
small number of fully connected layers
#LeNet-5 LeCun, Y., Bottou, L., Bengio, Y., & HaïŹ€ner, P. (1998). Gradient-based learning applied to document
recognition. Proceedings of the IEEE, 86(11), 2278-2324.
27
ImageNet Challenge
● 1,000 object classes
(categories).
● Images:
○ 1.2 M train
○ 100k test.
Deng, Jia, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. "Imagenet: A large-scale hierarchical image
database." CVPR 2019.
28
ImageNet Challenge: 2012
Slide credit:
Rob Fergus (NYU)
-9.8%
Based on SIFT + Fisher Vectors
Russakovsky, Olga, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang et al. "Imagenet
large scale visual recognition challenge." International Journal of Computer Vision 115, no. 3 (2015): 211-252. [web]
29
Image Encoding
A Krizhevsky, I Sutskever, GE Hinton “Imagenet classiïŹcation with deep convolutional neural networks” NIPS 2012
Cat
CNN FC
30
Encoder
Representation
31
Video Encoding
Slide: VĂ­ctor Campos (UPC 2018)
CNN CNN CNN...
Combination method
Combination is commonly
implemented as a small NN on
top of a pooling operation
(e.g. max, sum, average).
Drawback: pooling is not
aware of the temporal order!
Ng et al., Beyond short snippets: Deep networks for video classiïŹcation, CVPR 2015
32
Video Encoding
Slide: VĂ­ctor Campos (UPC 2018)
Recurrent Neural Networks are
well suited for processing
sequences.
Drawback: RNNs are sequential
and cannot be parallelized.
Donahue et al., Long-term Recurrent Convolutional Networks for Visual Recognition and Description, CVPR 2015
CNN CNN CNN...
RNN RNN RNN...
33
Learn more on visual encoding
34
Decoder
Representation
35
Image Decoding
CNN
Radford, Alec, Luke Metz, and Soumith Chintala. "Unsupervised representation learning with deep convolutional generative
adversarial networks." ICLR 2016. #DCGAN
36
Encoder Decoder
Representation
37
Image Encoding and Decoding
Noh, Hyeonwoo, Seunghoon Hong, and Bohyung Han. "Learning deconvolution network for semantic segmentation."
ICCV 2015.
“Regular” VGG “Upside down” VGG
38
Isola, Phillip, Jun-Yan Zhu, Tinghui Zhou, and Alexei A. Efros. "Image-to-image translation with conditional adversarial
networks." CVPR 2017.
39
Outline
1. Encoder-Decoder Architectures
2. Image and Video Encoding
3. Image Captioning & Grounding
4. Image Generation
5. Visual Question Answering / Reasoning
6. Joint Embeddings (+ recipe generation)
40
Encoder Decoder
Representation
41
#ShowAndTell Vinyals, Oriol, Alexander Toshev, Samy Bengio, and Dumitru Erhan. "Show and tell: A neural image caption
generator." CVPR 2015.
Image Captioning
42
Image Captioning
#DeepImageSent Karpathy, Andrej, and Li Fei-Fei. "Deep visual-semantic alignments for generating image descriptions."
CVPR 2015 (Slides by Marc Bolaños)
43
Captioning: Show, Attend & Tell
Xu, Kelvin, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron C. Courville, Ruslan Salakhutdinov, Richard S. Zemel, and Yoshua
Bengio. "Show, Attend and Tell: Neural Image Caption Generation with Visual Attention." ICML 2015
44
Captioning: Show, Attend & Tell
Xu, Kelvin, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron C. Courville, Ruslan Salakhutdinov, Richard S. Zemel, and Yoshua
Bengio. "Show, Attend and Tell: Neural Image Caption Generation with Visual Attention." ICML 2015
45
Johnson, Justin, Andrej Karpathy, and Li Fei-Fei. "Densecap: Fully convolutional localization networks for dense
captioning." CVPR 2016
Dense Captioning
46
XAVI: “man has
short hair”, “man
with short hair”
AMAIA:”a woman
wearing a black
shirt”, “
BOTH: “two men
wearing black
glasses”
Johnson, Justin, Andrej Karpathy, and Li Fei-Fei. "Densecap: Fully convolutional localization networks for dense
captioning." CVPR 2016
Dense Captioning
Image Captioning for News
Ali Furkan Biten, Lluis Gomez, Marçal Rusiñol, Dimosthenis Karatzas, “Good News, Everyone! Context driven entity-aware
captioning for news images” CVPR 2019.
48
Filtering Social Bias in Neural Models
#Equalizer Burns, Kaylee, Lisa Anne Hendricks, Trevor Darrell, and Anna Rohrbach. "Women also Snowboard: Overcoming
Bias in Captioning Models." ECCV 2018.
49
Captioning: Dataset biases
#Equalizer Burns, Kaylee, Lisa Anne Hendricks, Trevor Darrell, and Anna Rohrbach. "Women also Snowboard: Overcoming
Bias in Captioning Models." ECCV 2018.
50
JeïŹ€rey Donahue, Lisa Anne Hendricks, Sergio Guadarrama, Marcus Rohrbach, Subhashini Venugopalan, Kate Saenko, Trevor
Darrel. Long-term Recurrent Convolutional Networks for Visual Recognition and Description, CVPR 2015. code
Captioning: Video
51
(Slides by Marc Bolaños) Pingbo Pan, Zhongwen Xu, Yi Yang,Fei Wu,Yueting Zhuang Hierarchical Recurrent Neural
Encoder for Video Representation with Application to Captioning, CVPR 2016.
LSTM unit
(2nd layer)
Time
Image
t = 1 t = T
hidden state
at t = T
first chunk
of data
Captioning: Video
52
Sign Language Translation
Camgoz, Necati Cihan, et al. Neural Sign Language Translation. CVPR 2018.
53
Assael, Yannis M., Brendan Shillingford, Shimon Whiteson, and Nando de Freitas. "LipNet: End-to-End Sentence-level Lipreading."
(2016).
54
Lip Reading
Assael, Yannis M., Brendan Shillingford, Shimon Whiteson, and Nando de Freitas. "LipNet: End-to-End Sentence-level
Lipreading." (2016).
55
Chung, Joon Son, Andrew Senior, Oriol Vinyals, and Andrew Zisserman. "Lip reading sentences in the wild."
CVPR 2017
56
Lipreading: Watch, Listen, Attend & Spell
Audio
features
Image
features
Chung, Joon Son, Andrew Senior, Oriol Vinyals, and Andrew Zisserman. "Lip reading sentences in the wild." CVPR 2017
57
Lipreading: Watch, Listen, Attend & Spell
Chung, Joon Son, Andrew Senior, Oriol Vinyals, and Andrew Zisserman. "Lip reading sentences in the wild." CVPR 2017
Attention over output
states from audio and
video is computed at
each timestep
58
Lipreading
Afouras, Triantafyllos, Joon Son Chung, and Andrew Zisserman. "Deep Lip Reading: a comparison of models and an online
application." Interspeech 2018.
59
Grounded Captioning from Objects
Lu, Jiasen and Yang, Jianwei and Batra, Dhruv and Parikh, Devi “Neural Baby Talk” CVPR 2018 [code]
60Lu, Jiasen and Yang, Jianwei and Batra, Dhruv and Parikh, Devi “Neural Baby Talk” CVPR 2018 [code]
Grounded Captioning from Objects
61Akbari, Hassan, Svebor Karaman, Surabhi Bhargava, Brian Chen, Carl Vondrick, and Shih-Fu Chang. "Multi-level Multimodal
Common Semantic Space for Image-Phrase Grounding." CVPR 2019. [code]
Weak grounding w/o supervision
62Akbari, Hassan, Svebor Karaman, Surabhi Bhargava, Brian Chen, Carl Vondrick, and Shih-Fu Chang. "Multi-level Multimodal
Common Semantic Space for Image-Phrase Grounding." CVPR 2019. [code]
Grounding with weak supervision
63
Cornia, Marcella, Lorenzo Baraldi, and Rita Cucchiara. "Show, Control and Tell: A Framework for Generating Controllable and
Grounded Captions." CVPR 2019. [code]
Controlled Grounding
64
Outline
1. Encoder-Decoder Architectures
2. Image and Video Encoding
3. Image Captioning & Grounding
4. Image Generation
5. Visual Question Answering / Reasoning
6. Joint Embeddings (+ recipe generation)
65
Encoder Decoder
Representation
66
Reed, Scott, Zeynep Akata, Xinchen Yan, Lajanugen Logeswaran, Bernt Schiele, and Honglak Lee. "Generative adversarial
text to image synthesis." ICML 2016.
Image Generation
67
Reed, Scott, Zeynep Akata, Xinchen Yan, Lajanugen Logeswaran, Bernt Schiele, and Honglak Lee. "Generative adversarial
text to image synthesis." ICML 2016. [code]
Image Synthesis
68
Reed, Scott, Zeynep Akata, Xinchen Yan, Lajanugen Logeswaran, Bernt Schiele, and Honglak Lee. "Generative adversarial
text to image synthesis." ICML 2016. [code]
Image Generation
69
#StackGAN Zhang, Han, Tao Xu, Hongsheng Li, Shaoting Zhang, Xiaolei Huang, Xiaogang Wang, and Dimitris Metaxas.
"Stackgan: Text to photo-realistic image synthesis with stacked generative adversarial networks." ICCV 2017. [code]
Image Synthesis
70
#StackGAN Zhang, Han, Tao Xu, Hongsheng Li, Shaoting Zhang, Xiaolei Huang, Xiaogang Wang, and Dimitris Metaxas.
"Stackgan: Text to photo-realistic image synthesis with stacked generative adversarial networks." ICCV 2017. [code]
Image Synthesis
71Justin Johnson, Agrim Gupta, Li Fei-Fei, “Image Generation from Scene Graphs” CVPR 2018
Image Generation via Scene Graphs
72Justin Johnson, Agrim Gupta, Li Fei-Fei, “Image Generation from Scene Graphs” CVPR 2018
Image Synthesis via Scene Graphs
73
#Text2Scene Tan, Fuwen, Song Feng, and Vicente Ordonez. "Text2Scene: Generating Compositional Scenes From Textual
Descriptions." CVPR 2019 [blog].
Image Generation by Composition
74
#Text2Scene Tan, Fuwen, Song Feng, and Vicente Ordonez. "Text2Scene: Generating Compositional Scenes From Textual
Descriptions." CVPR 2019 [blog].
75
#Text2Scene Tan, Fuwen, Song Feng, and Vicente Ordonez. "Text2Scene: Generating Compositional Scenes From Textual
Descriptions." CVPR 2019 [blog].
76
#CRAFT Gupta, Tanmay, Dustin Schwenk, Ali Farhadi, Derek Hoiem, and Aniruddha Kembhavi. "Imagine this! scripts to
compositions to videos." ECCV 2018
77
#CRAFT Gupta, Tanmay, Dustin Schwenk, Ali Farhadi, Derek Hoiem, and Aniruddha Kembhavi. "Imagine this! scripts to
compositions to videos." ECCV 2018
Video Generation by Composition
78
Outline
1. Encoder-Decoder Architectures
2. Image and Video Encoding
3. Image Captioning & Grounding
4. Image Generation
5. Visual Question Answering / Reasoning
6. Joint Embeddings (+ recipe generation)
79
Encoder
Decoder
Representation
Encoder
Representation
80
#Mattnet Yu, Licheng, Zhe Lin, Xiaohui Shen, Jimei Yang, Xin Lu, Mohit Bansal, and Tamara L. Berg. "Mattnet: Modular
attention network for referring expression comprehension." CVPR 2018. [code]
Object from Referring Expressions
81
Khoreva, Anna, Anna Rohrbach, and Bernt Schiele. "Video object segmentation with language referring expressions." ACCV
2018.
Video Object Grounding
82
Encoder
Decoder
Representation
Encoder
Representation
83
Visual Question Answering
Antol, Stanislaw, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C. Lawrence Zitnick, and Devi Parikh. "VQA:
Visual question answering." CVPR 2015.
84
Visual Question Answering (VQA)
[z1
, z2
, 
 zN
] [y1
, y2
, 
 yM
]
“Is economic growth decreasing ?”
“Yes”
Encode
Encode
Decode
85
Extract visual
features
Embedding
Predict answerMerge
Question
What object is flying?
Answer
Kite
Visual Question Answering (VQA)
Masuda, Issey, Santiago Pascual de la Puente, and Xavier Giro-i-Nieto. "Open-Ended Visual
Question-Answering." ETSETB UPC TelecomBCN (2016).
86
Visual Question Answering (VQA)
Masuda, Issey, Santiago Pascual de la Puente, and Xavier Giro-i-Nieto. "Open-Ended Visual
Question-Answering." ETSETB UPC TelecomBCN (2016).
Image
Question
Answer
87
Visual Question Answering (VQA)
Francisco RoldĂĄn, Issey Masuda, Santiago Pascual de la Puente, and Xavier Giro-i-Nieto. "Visual
Question-Answering 2.0." ETSETB UPC TelecomBCN (2017).
88
Noh, H., Seo, P. H., & Han, B. Image question answering using convolutional neural network with
dynamic parameter prediction. CVPR 2016
Dynamic Parameter Prediction Network (DPPnet)
Visual Question Answering (VQA)
89
VQA: Dynamic Memory Networks
(Slides and Slidecast by Santi Pascual): Xiong, Caiming, Stephen Merity, and Richard Socher. "Dynamic Memory Networks for
Visual and Textual Question Answering." ICML 2016
90
Grounded VQA
(Slides and Screencast by Issey Masuda): Zhu, Yuke, Oliver Groth, Michael Bernstein, and Li Fei-Fei."Visual7W: Grounded
Question Answering in Images." CVPR 2016.
91
Visual Reasoning
#Clevr Johnson, Justin, Bharath Hariharan, Laurens van der Maaten, Li Fei-Fei, C. Lawrence Zitnick, and Ross Girshick.
"CLEVR: A Diagnostic Dataset for Compositional Language and Elementary Visual Reasoning." CVPR 2017
92
Visual Reasoning
(Slides by Fran Roldan) Justin Johnson, Bharath Hariharan, Laurens van der Maaten, Judy HoïŹ€man, Fei-Fei Li, Larry
Zitnick, Ross Girshick , “Inferring and Executing Programs for Visual Reasoning”. ICCV 2017
Program Generator Execution Engine
93
Visual Dialog
Das, Abhishek, Satwik Kottur, Khushi Gupta, Avi Singh, Deshraj Yadav, José MF Moura, Devi Parikh, and Dhruv Batra. "Visual
Dialog." CVPR 2017 [Project]
94
Visual Dialog
95
Hate Speech Detection in Memes
Benet Oriol, Cristian Canton, Xavier Giro-i-Nieto, “Hate Speech Detection in Memes”. UPC TelecomBCN
2019.
Hate Speech Detection
96
Visual Reasoning: Relation Networks
Santoro, Adam, David Raposo, David G. Barrett, Mateusz Malinowski, Razvan Pascanu, Peter Battaglia, and Timothy
Lillicrap. "A simple neural network module for relational reasoning." NIPS 2017.
Relation Networks concatenate all possible pairs of objects with the an encoded question to later ïŹnd the
answer with a MLP.
97
Multimodal Machine Translation
Challenge on Multimodal Image Translation:
http://www.statmt.org/wmt17/multimodal-task.html#task1
98
Outline
1. Encoder-Decoder Architectures
2. Image and Video Encoding
3. Image Captioning & Grounding
4. Image Generation
5. Visual Question Answering / Reasoning
6. Joint Embeddings (+ recipe generation)
99
Encoder Encoder
Representation
100
Joint Representations (Embeddings)
Frome, Andrea, Greg S. Corrado, Jon Shlens, Samy Bengio, JeïŹ€ Dean, and Tomas Mikolov. "Devise: A deep
visual-semantic embedding model." NIPS 2013
101
Zero-shot learning
Socher, R., Ganjoo, M., Manning, C. D., & Ng, A., Zero-shot learning through cross-modal transfer. NIPS 2013 [slides] [code]
No images from “cat” in
the training set...
...but they can still be
recognised as “cats”
thanks to the
representations learned
from text .
102
Multimodal Retrieval
Aytar, Yusuf, Lluis Castrejon, Carl Vondrick, Hamed Pirsiavash, and Antonio Torralba. "Cross-Modal Scene Networks."
CVPR 2016.
103
Multimodal Retrieval
Aytar, Yusuf, Lluis Castrejon, Carl Vondrick, Hamed Pirsiavash, and Antonio Torralba. "Cross-Modal Scene Networks."
CVPR 2016.
104
Image and text retrieval with joint embeddings.
Joint Neural Embeddings
#pic2recipe Amaia Salvador, Nicholas Haynes, Yusuf Aytar, Javier MarĂ­n, Ferda OïŹ‚i, Ingmar Weber, Antonio
Torralba, “Learning Cross-modal Embeddings for Cooking Recipes and Food Images”. CVPR 2017
105
#pic2recipe Amaia Salvador, Nicholas Haynes, Yusuf Aytar, Javier MarĂ­n, Ferda OïŹ‚i, Ingmar Weber, Antonio
Torralba, “Learning Cross-modal Embeddings for Cooking Recipes and Food Images”. CVPR 2017
Joint Neural Embeddings
106
Joint Neural Embeddings
joint
embedding
LSTM Bidirectional LSTM
#pic2recipe Amaia Salvador, Nicholas Haynes, Yusuf Aytar, Javier MarĂ­n, Ferda OïŹ‚i, Ingmar Weber, Antonio
Torralba, “Learning Cross-modal Embeddings for Cooking Recipes and Food Images”. CVPR 2017
107
Joint Neural Embeddings
● Constrained to database recipes
● Ingredients and Instructions are retrieved as a whole
● Prohibits user manipulation (ingredient replacements)
#pic2recipe Amaia Salvador, Nicholas Haynes, Yusuf Aytar, Javier MarĂ­n, Ferda OïŹ‚i, Ingmar Weber, Antonio
Torralba, “Learning Cross-modal Embeddings for Cooking Recipes and Food Images”. CVPR 2017
108
109
Recipe Generation (not retrieval !)
Salvador, Amaia, Michal Drozdzal, Xavier Giro-i-Nieto, and Adriana Romero. "Inverse Cooking: Recipe
Generation from Food Images." CVPR 2019.
110
Recipe Generation (not retrieval !)
Salvador, Amaia, Michal Drozdzal, Xavier Giro-i-Nieto, and Adriana Romero. "Inverse Cooking: Recipe
Generation from Food Images." CVPR 2019.
Title: Edamame corn salad
Ingredients
pepper, corn, onion, edamame, salt, vinegar, cilantro, avocado, oil
Instructions
- In a large bowl, combine edamame, corn, red onion, cilantro,
avocado, and red bell pepper.
- In a small bowl, whisk together olive oil, vinegar, salt, and
pepper.
- Pour dressing over edamame mixture and toss to coat.
- Cover and refrigerate for at least 1 hour before serving.
111
Recipe Generation (not retrieval !)
Salvador, Amaia, Michal Drozdzal, Xavier Giro-i-Nieto, and Adriana Romero. "Inverse Cooking: Recipe
Generation from Food Images." CVPR 2019.
According to human judgment, our proposed system is able to generate better recipes than the previous
retrieval method.
112
Recipe Generation (data as the DL ingredient!)
Salvador, Amaia, Michal Drozdzal, Xavier Giro-i-Nieto, and Adriana Romero. "Inverse Cooking: Recipe
Generation from Food Images." CVPR 2019.
Title: Spaghetti with spicy tomato sauce
Ingredients:
onion, tomato, chili, salt, noodles, pepper, spaghetti, clove, cumin, water
Instructions:
-In a large pot, combine the tomatoes, onion, garlic, chili powder, cumin, salt,
pepper, water and tomato sauce.
-Bring to a boil, then reduce heat and simmer for about 20 minutes.
-Meanwhile, cook the spaghetti according to package directions.
-Drain and set aside.
-When the spaghetti is done, drain and return to pot.
-Add the sauce and stir to combine.
-Serve with the shredded cheese and a dollop of sour cream.
113
@JonathanFly
114
Outline
1. Encoder-Decoder Architectures
2. Image and Video Encoding
3. Image Captioning & Grounding
4. Image Generation
5. Visual Question Answering / Reasoning
6. Joint Embeddings (+ recipe generation)
Recommended tool
Pythia for vision and language multimodal AI models by Facebook FAIR.
116
Deep Learning courses @ UPC TelecomBCN:
● MSc course [2017] [2018] [2019]
● BSc course [2018] [2019]
● 1st edition (2016)
● 2nd edition (2017)
● 3rd edition (2018)
● 4th edition (2019)
● 1st edition (2017)
● 2nd edition (2018)
● 3rd edition - NLP (2019)
Next edition: Autumn 2019 Central repo with slides & videos here
117
Deep Learning courses @ UPC TelecomBCN:
Central repo with slides & videos here
118
Multimodal DL with audio+speech
https://telecombcn-dl.github.io/2019-mmm-tutorial/
119
Deep Learning for Professionals @ UPC School
Next edition starts November 2019. Sign up here.
120
Community building
bcn.ai deeplearning.barcelona
121
Eskerrik asko
Victor
Campos
Amaia
Salvador
Amanda
Duarte
DĂšlia
FernĂĄndez
Eduard
Ramon
Andreu
Girbau
Dani
Fojo
Oscar
Mañas
Santi
Pascual
Xavi
GirĂł
Miriam
Bellver
Janna
Escur
Carles
Ventura
Paula
GĂłmez
Benet
Oriol
Mariona
CarĂłs
Jordi
Torres
Ferran
Marqués
bit.ly/ixa-dlnlp-2019
xavier.giro@upc.edu
@DocXavi
Rowan Zellers, Yonatan Bisk, Ali Farhadi, Yejin Choi. From Recognition to Cognition: Visual Commonsense
Reasoning. CVPR 2019 (oral)
https://visualcommonsense.com/
123
Ma, Chih-Yao, Jiasen Lu, Zuxuan Wu, Ghassan AlRegib, Zsolt Kira, Richard Socher, and Caiming Xiong.
"Self-Monitoring Navigation Agent via Auxiliary Progress Estimation." ICLR 2019. [code]
124
Visual Question Answering
Gurari, Danna, Qing Li, Abigale J. Stangl, Anhong Guo, Chi Lin, Kristen Grauman, Jiebo Luo, and Jeffrey P. Bigham. "VizWiz
Grand Challenge: Answering Visual Questions from Blind People." arXiv preprint arXiv:1802.08218 (2018).
125
Reasoning: MAC
Hudson, Drew A., and Christopher D. Manning. "Compositional attention networks for machine reasoning."
ICLR 2018.
126
Navigation with Language and Vision
Fried, Daniel, Ronghang Hu, Volkan Cirik, Anna Rohrbach, Jacob Andreas, Louis-Philippe Morency, Taylor Berg-Kirkpatrick, Kate
Saenko, Dan Klein, and Trevor Darrell. "Speaker-Follower Models for Vision-and-Language Navigation." arXiv preprint
arXiv:1806.02724 (2018).
127
Translation
Harwath, David, Galen Chuang, and James Glass. "Vision as an Interlingua: Learning Multilingual Semantic
Embeddings of Untranscribed Speech." arXiv preprint arXiv:1804.03052 (2018).

Weitere Àhnliche Inhalte

Was ist angesagt?

Wav2Pix: Speech-conditioned face generation using Generative Adversarial Netw...
Wav2Pix: Speech-conditioned face generation using Generative Adversarial Netw...Wav2Pix: Speech-conditioned face generation using Generative Adversarial Netw...
Wav2Pix: Speech-conditioned face generation using Generative Adversarial Netw...Universitat PolitĂšcnica de Catalunya
 
Deep Learning Architectures for Video - Xavier Giro - UPC Barcelona 2019
Deep Learning Architectures for Video - Xavier Giro - UPC Barcelona 2019Deep Learning Architectures for Video - Xavier Giro - UPC Barcelona 2019
Deep Learning Architectures for Video - Xavier Giro - UPC Barcelona 2019Universitat PolitĂšcnica de Catalunya
 
Learning Representations for Sign Language Videos - Xavier Giro - NIST TRECVI...
Learning Representations for Sign Language Videos - Xavier Giro - NIST TRECVI...Learning Representations for Sign Language Videos - Xavier Giro - NIST TRECVI...
Learning Representations for Sign Language Videos - Xavier Giro - NIST TRECVI...Universitat PolitĂšcnica de Catalunya
 
Advanced Deep Architectures (D2L6 Deep Learning for Speech and Language UPC 2...
Advanced Deep Architectures (D2L6 Deep Learning for Speech and Language UPC 2...Advanced Deep Architectures (D2L6 Deep Learning for Speech and Language UPC 2...
Advanced Deep Architectures (D2L6 Deep Learning for Speech and Language UPC 2...Universitat PolitĂšcnica de Catalunya
 
Video Saliency Prediction with Deep Neural Networks - Juan Jose Nieto - DCU 2019
Video Saliency Prediction with Deep Neural Networks - Juan Jose Nieto - DCU 2019Video Saliency Prediction with Deep Neural Networks - Juan Jose Nieto - DCU 2019
Video Saliency Prediction with Deep Neural Networks - Juan Jose Nieto - DCU 2019Universitat PolitĂšcnica de Catalunya
 
Multimodal Deep Learning (D4L4 Deep Learning for Speech and Language UPC 2017)
Multimodal Deep Learning (D4L4 Deep Learning for Speech and Language UPC 2017)Multimodal Deep Learning (D4L4 Deep Learning for Speech and Language UPC 2017)
Multimodal Deep Learning (D4L4 Deep Learning for Speech and Language UPC 2017)Universitat PolitĂšcnica de Catalunya
 
Closing, Course Offer 17/18 & Homework (D5 2017 UPC Deep Learning for Compute...
Closing, Course Offer 17/18 & Homework (D5 2017 UPC Deep Learning for Compute...Closing, Course Offer 17/18 & Homework (D5 2017 UPC Deep Learning for Compute...
Closing, Course Offer 17/18 & Homework (D5 2017 UPC Deep Learning for Compute...Universitat PolitĂšcnica de Catalunya
 
Language and Vision (D2L11 Insight@DCU Machine Learning Workshop 2017)
Language and Vision (D2L11 Insight@DCU Machine Learning Workshop 2017)Language and Vision (D2L11 Insight@DCU Machine Learning Workshop 2017)
Language and Vision (D2L11 Insight@DCU Machine Learning Workshop 2017)Universitat PolitĂšcnica de Catalunya
 
Learning with Videos (D4L4 2017 UPC Deep Learning for Computer Vision)
Learning with Videos  (D4L4 2017 UPC Deep Learning for Computer Vision)Learning with Videos  (D4L4 2017 UPC Deep Learning for Computer Vision)
Learning with Videos (D4L4 2017 UPC Deep Learning for Computer Vision)Universitat PolitĂšcnica de Catalunya
 
Interpretability of Convolutional Neural Networks - Xavier Giro - UPC Barcelo...
Interpretability of Convolutional Neural Networks - Xavier Giro - UPC Barcelo...Interpretability of Convolutional Neural Networks - Xavier Giro - UPC Barcelo...
Interpretability of Convolutional Neural Networks - Xavier Giro - UPC Barcelo...Universitat PolitĂšcnica de Catalunya
 
Video Analysis with Convolutional Neural Networks (Master Computer Vision Bar...
Video Analysis with Convolutional Neural Networks (Master Computer Vision Bar...Video Analysis with Convolutional Neural Networks (Master Computer Vision Bar...
Video Analysis with Convolutional Neural Networks (Master Computer Vision Bar...Universitat PolitĂšcnica de Catalunya
 

Was ist angesagt? (20)

Wav2Pix: Speech-conditioned face generation using Generative Adversarial Netw...
Wav2Pix: Speech-conditioned face generation using Generative Adversarial Netw...Wav2Pix: Speech-conditioned face generation using Generative Adversarial Netw...
Wav2Pix: Speech-conditioned face generation using Generative Adversarial Netw...
 
Deep Learning Architectures for Video - Xavier Giro - UPC Barcelona 2019
Deep Learning Architectures for Video - Xavier Giro - UPC Barcelona 2019Deep Learning Architectures for Video - Xavier Giro - UPC Barcelona 2019
Deep Learning Architectures for Video - Xavier Giro - UPC Barcelona 2019
 
Deep Learning Representations for All (a.ka. the AI hype)
Deep Learning Representations for All (a.ka. the AI hype)Deep Learning Representations for All (a.ka. the AI hype)
Deep Learning Representations for All (a.ka. the AI hype)
 
Multimodal Deep Learning
Multimodal Deep LearningMultimodal Deep Learning
Multimodal Deep Learning
 
Learning Representations for Sign Language Videos - Xavier Giro - NIST TRECVI...
Learning Representations for Sign Language Videos - Xavier Giro - NIST TRECVI...Learning Representations for Sign Language Videos - Xavier Giro - NIST TRECVI...
Learning Representations for Sign Language Videos - Xavier Giro - NIST TRECVI...
 
Video Analysis (D4L2 2017 UPC Deep Learning for Computer Vision)
Video Analysis (D4L2 2017 UPC Deep Learning for Computer Vision)Video Analysis (D4L2 2017 UPC Deep Learning for Computer Vision)
Video Analysis (D4L2 2017 UPC Deep Learning for Computer Vision)
 
Advanced Deep Architectures (D2L6 Deep Learning for Speech and Language UPC 2...
Advanced Deep Architectures (D2L6 Deep Learning for Speech and Language UPC 2...Advanced Deep Architectures (D2L6 Deep Learning for Speech and Language UPC 2...
Advanced Deep Architectures (D2L6 Deep Learning for Speech and Language UPC 2...
 
Neural Architectures for Video Encoding
Neural Architectures for Video EncodingNeural Architectures for Video Encoding
Neural Architectures for Video Encoding
 
Deep Learning for Video: Language (UPC 2018)
Deep Learning for Video: Language (UPC 2018)Deep Learning for Video: Language (UPC 2018)
Deep Learning for Video: Language (UPC 2018)
 
Video Saliency Prediction with Deep Neural Networks - Juan Jose Nieto - DCU 2019
Video Saliency Prediction with Deep Neural Networks - Juan Jose Nieto - DCU 2019Video Saliency Prediction with Deep Neural Networks - Juan Jose Nieto - DCU 2019
Video Saliency Prediction with Deep Neural Networks - Juan Jose Nieto - DCU 2019
 
Deep Learning for Video: Action Recognition (UPC 2018)
Deep Learning for Video: Action Recognition (UPC 2018)Deep Learning for Video: Action Recognition (UPC 2018)
Deep Learning for Video: Action Recognition (UPC 2018)
 
Multimodal Deep Learning (D4L4 Deep Learning for Speech and Language UPC 2017)
Multimodal Deep Learning (D4L4 Deep Learning for Speech and Language UPC 2017)Multimodal Deep Learning (D4L4 Deep Learning for Speech and Language UPC 2017)
Multimodal Deep Learning (D4L4 Deep Learning for Speech and Language UPC 2017)
 
Closing, Course Offer 17/18 & Homework (D5 2017 UPC Deep Learning for Compute...
Closing, Course Offer 17/18 & Homework (D5 2017 UPC Deep Learning for Compute...Closing, Course Offer 17/18 & Homework (D5 2017 UPC Deep Learning for Compute...
Closing, Course Offer 17/18 & Homework (D5 2017 UPC Deep Learning for Compute...
 
Deep Learning from Videos (UPC 2018)
Deep Learning from Videos (UPC 2018)Deep Learning from Videos (UPC 2018)
Deep Learning from Videos (UPC 2018)
 
Welcome (D1L1 2017 UPC Deep Learning for Computer Vision)
Welcome (D1L1 2017 UPC Deep Learning for Computer Vision)Welcome (D1L1 2017 UPC Deep Learning for Computer Vision)
Welcome (D1L1 2017 UPC Deep Learning for Computer Vision)
 
Language and Vision (D2L11 Insight@DCU Machine Learning Workshop 2017)
Language and Vision (D2L11 Insight@DCU Machine Learning Workshop 2017)Language and Vision (D2L11 Insight@DCU Machine Learning Workshop 2017)
Language and Vision (D2L11 Insight@DCU Machine Learning Workshop 2017)
 
Learning with Videos (D4L4 2017 UPC Deep Learning for Computer Vision)
Learning with Videos  (D4L4 2017 UPC Deep Learning for Computer Vision)Learning with Videos  (D4L4 2017 UPC Deep Learning for Computer Vision)
Learning with Videos (D4L4 2017 UPC Deep Learning for Computer Vision)
 
Deep Learning for Video: Object Tracking (UPC 2018)
Deep Learning for Video: Object Tracking (UPC 2018)Deep Learning for Video: Object Tracking (UPC 2018)
Deep Learning for Video: Object Tracking (UPC 2018)
 
Interpretability of Convolutional Neural Networks - Xavier Giro - UPC Barcelo...
Interpretability of Convolutional Neural Networks - Xavier Giro - UPC Barcelo...Interpretability of Convolutional Neural Networks - Xavier Giro - UPC Barcelo...
Interpretability of Convolutional Neural Networks - Xavier Giro - UPC Barcelo...
 
Video Analysis with Convolutional Neural Networks (Master Computer Vision Bar...
Video Analysis with Convolutional Neural Networks (Master Computer Vision Bar...Video Analysis with Convolutional Neural Networks (Master Computer Vision Bar...
Video Analysis with Convolutional Neural Networks (Master Computer Vision Bar...
 

Ähnlich wie One Perceptron to Rule Them All: Language and Vision

One Perceptron to Rule them All: Deep Learning for Multimedia #A2IC2018
One Perceptron  to Rule them All: Deep Learning for Multimedia #A2IC2018One Perceptron  to Rule them All: Deep Learning for Multimedia #A2IC2018
One Perceptron to Rule them All: Deep Learning for Multimedia #A2IC2018Universitat PolitĂšcnica de Catalunya
 
One Perceptron to Rule Them All (Re-Work Deep Learning Summit, London 2017)
One Perceptron to Rule Them All (Re-Work Deep Learning Summit, London 2017)One Perceptron to Rule Them All (Re-Work Deep Learning Summit, London 2017)
One Perceptron to Rule Them All (Re-Work Deep Learning Summit, London 2017)Universitat PolitĂšcnica de Catalunya
 
Deep Language and Vision (DLSL D2L4 2018 UPC Deep Learning for Speech and Lan...
Deep Language and Vision (DLSL D2L4 2018 UPC Deep Learning for Speech and Lan...Deep Language and Vision (DLSL D2L4 2018 UPC Deep Learning for Speech and Lan...
Deep Language and Vision (DLSL D2L4 2018 UPC Deep Learning for Speech and Lan...Universitat PolitĂšcnica de Catalunya
 
Towards Sign Language Translation & Production | Xavier Giro-i-Nieto
Towards Sign Language Translation & Production | Xavier Giro-i-NietoTowards Sign Language Translation & Production | Xavier Giro-i-Nieto
Towards Sign Language Translation & Production | Xavier Giro-i-NietoUniversitat PolitĂšcnica de Catalunya
 
Deep Language and Vision - Xavier Giro-i-Nieto - UPC Barcelona 2018
Deep Language and Vision - Xavier Giro-i-Nieto - UPC Barcelona 2018Deep Language and Vision - Xavier Giro-i-Nieto - UPC Barcelona 2018
Deep Language and Vision - Xavier Giro-i-Nieto - UPC Barcelona 2018Universitat PolitĂšcnica de Catalunya
 
Modeling perceptual similarity and shift invariance in deep networks
Modeling perceptual similarity and shift invariance in deep networksModeling perceptual similarity and shift invariance in deep networks
Modeling perceptual similarity and shift invariance in deep networksNAVER Engineering
 
Language and Vision (D3L5 2017 UPC Deep Learning for Computer Vision)
Language and Vision (D3L5 2017 UPC Deep Learning for Computer Vision)Language and Vision (D3L5 2017 UPC Deep Learning for Computer Vision)
Language and Vision (D3L5 2017 UPC Deep Learning for Computer Vision)Universitat PolitĂšcnica de Catalunya
 
Deep Learning Representations for All - Xavier Giro-i-Nieto - IRI Barcelona 2020
Deep Learning Representations for All - Xavier Giro-i-Nieto - IRI Barcelona 2020Deep Learning Representations for All - Xavier Giro-i-Nieto - IRI Barcelona 2020
Deep Learning Representations for All - Xavier Giro-i-Nieto - IRI Barcelona 2020Universitat PolitĂšcnica de Catalunya
 
International Perspectives: Visualization in Science and Education
International Perspectives: Visualization in Science and EducationInternational Perspectives: Visualization in Science and Education
International Perspectives: Visualization in Science and EducationLiz Dorland
 
Multimedia Information Retrieval: Bytes and pixels meet the challenges of hum...
Multimedia Information Retrieval: Bytes and pixels meet the challenges of hum...Multimedia Information Retrieval: Bytes and pixels meet the challenges of hum...
Multimedia Information Retrieval: Bytes and pixels meet the challenges of hum...maranlar
 
파읎윘 한ꔭ 2019 íŠœí† ëŠŹì–Œ - ì„€ëȘ…ê°€ëŠ„ìžêł”ì§€ëŠ„ìŽëž€? (Part 1)
파읎윘 한ꔭ 2019 íŠœí† ëŠŹì–Œ - ì„€ëȘ…ê°€ëŠ„ìžêł”ì§€ëŠ„ìŽëž€? (Part 1)파읎윘 한ꔭ 2019 íŠœí† ëŠŹì–Œ - ì„€ëȘ…ê°€ëŠ„ìžêł”ì§€ëŠ„ìŽëž€? (Part 1)
파읎윘 한ꔭ 2019 íŠœí† ëŠŹì–Œ - ì„€ëȘ…ê°€ëŠ„ìžêł”ì§€ëŠ„ìŽëž€? (Part 1)XAIC
 
BIng NLP Expert - Dl summer-school-2017.-jianfeng-gao.v2
BIng NLP Expert - Dl summer-school-2017.-jianfeng-gao.v2BIng NLP Expert - Dl summer-school-2017.-jianfeng-gao.v2
BIng NLP Expert - Dl summer-school-2017.-jianfeng-gao.v2Karthik Murugesan
 
Deep Learning Summit (DLS01-1)
Deep Learning Summit (DLS01-1)Deep Learning Summit (DLS01-1)
Deep Learning Summit (DLS01-1)Amazon Web Services
 
Introduction to Deep learning
Introduction to Deep learningIntroduction to Deep learning
Introduction to Deep learningleopauly
 

Ähnlich wie One Perceptron to Rule Them All: Language and Vision (20)

One Perceptron to Rule them All: Deep Learning for Multimedia #A2IC2018
One Perceptron  to Rule them All: Deep Learning for Multimedia #A2IC2018One Perceptron  to Rule them All: Deep Learning for Multimedia #A2IC2018
One Perceptron to Rule them All: Deep Learning for Multimedia #A2IC2018
 
One Perceptron to Rule Them All (Re-Work Deep Learning Summit, London 2017)
One Perceptron to Rule Them All (Re-Work Deep Learning Summit, London 2017)One Perceptron to Rule Them All (Re-Work Deep Learning Summit, London 2017)
One Perceptron to Rule Them All (Re-Work Deep Learning Summit, London 2017)
 
Once Perceptron to Rule Them all: Deep Learning for Multimedia
Once Perceptron to Rule Them all: Deep Learning for MultimediaOnce Perceptron to Rule Them all: Deep Learning for Multimedia
Once Perceptron to Rule Them all: Deep Learning for Multimedia
 
Deep Language and Vision (DLSL D2L4 2018 UPC Deep Learning for Speech and Lan...
Deep Language and Vision (DLSL D2L4 2018 UPC Deep Learning for Speech and Lan...Deep Language and Vision (DLSL D2L4 2018 UPC Deep Learning for Speech and Lan...
Deep Language and Vision (DLSL D2L4 2018 UPC Deep Learning for Speech and Lan...
 
Towards Sign Language Translation & Production | Xavier Giro-i-Nieto
Towards Sign Language Translation & Production | Xavier Giro-i-NietoTowards Sign Language Translation & Production | Xavier Giro-i-Nieto
Towards Sign Language Translation & Production | Xavier Giro-i-Nieto
 
Deep Language and Vision - Xavier Giro-i-Nieto - UPC Barcelona 2018
Deep Language and Vision - Xavier Giro-i-Nieto - UPC Barcelona 2018Deep Language and Vision - Xavier Giro-i-Nieto - UPC Barcelona 2018
Deep Language and Vision - Xavier Giro-i-Nieto - UPC Barcelona 2018
 
Modeling perceptual similarity and shift invariance in deep networks
Modeling perceptual similarity and shift invariance in deep networksModeling perceptual similarity and shift invariance in deep networks
Modeling perceptual similarity and shift invariance in deep networks
 
Deep Language and Vision by Amaia Salvador (Insight DCU 2018)
Deep Language and Vision by Amaia Salvador (Insight DCU 2018)Deep Language and Vision by Amaia Salvador (Insight DCU 2018)
Deep Language and Vision by Amaia Salvador (Insight DCU 2018)
 
Language and Vision (D3L5 2017 UPC Deep Learning for Computer Vision)
Language and Vision (D3L5 2017 UPC Deep Learning for Computer Vision)Language and Vision (D3L5 2017 UPC Deep Learning for Computer Vision)
Language and Vision (D3L5 2017 UPC Deep Learning for Computer Vision)
 
Deep Learning Representations for All - Xavier Giro-i-Nieto - IRI Barcelona 2020
Deep Learning Representations for All - Xavier Giro-i-Nieto - IRI Barcelona 2020Deep Learning Representations for All - Xavier Giro-i-Nieto - IRI Barcelona 2020
Deep Learning Representations for All - Xavier Giro-i-Nieto - IRI Barcelona 2020
 
International Perspectives: Visualization in Science and Education
International Perspectives: Visualization in Science and EducationInternational Perspectives: Visualization in Science and Education
International Perspectives: Visualization in Science and Education
 
Multimedia Information Retrieval: Bytes and pixels meet the challenges of hum...
Multimedia Information Retrieval: Bytes and pixels meet the challenges of hum...Multimedia Information Retrieval: Bytes and pixels meet the challenges of hum...
Multimedia Information Retrieval: Bytes and pixels meet the challenges of hum...
 
파읎윘 한ꔭ 2019 íŠœí† ëŠŹì–Œ - ì„€ëȘ…ê°€ëŠ„ìžêł”ì§€ëŠ„ìŽëž€? (Part 1)
파읎윘 한ꔭ 2019 íŠœí† ëŠŹì–Œ - ì„€ëȘ…ê°€ëŠ„ìžêł”ì§€ëŠ„ìŽëž€? (Part 1)파읎윘 한ꔭ 2019 íŠœí† ëŠŹì–Œ - ì„€ëȘ…ê°€ëŠ„ìžêł”ì§€ëŠ„ìŽëž€? (Part 1)
파읎윘 한ꔭ 2019 íŠœí† ëŠŹì–Œ - ì„€ëȘ…ê°€ëŠ„ìžêł”ì§€ëŠ„ìŽëž€? (Part 1)
 
Open challenges in sign language translation and production
Open challenges in sign language translation and productionOpen challenges in sign language translation and production
Open challenges in sign language translation and production
 
Deep Speech and Vision - Xavier Giro-i-Nieto - UPC Barcelona 2018
Deep Speech and Vision - Xavier Giro-i-Nieto - UPC Barcelona 2018Deep Speech and Vision - Xavier Giro-i-Nieto - UPC Barcelona 2018
Deep Speech and Vision - Xavier Giro-i-Nieto - UPC Barcelona 2018
 
BIng NLP Expert - Dl summer-school-2017.-jianfeng-gao.v2
BIng NLP Expert - Dl summer-school-2017.-jianfeng-gao.v2BIng NLP Expert - Dl summer-school-2017.-jianfeng-gao.v2
BIng NLP Expert - Dl summer-school-2017.-jianfeng-gao.v2
 
DeepLabCut AI Residency
DeepLabCut AI ResidencyDeepLabCut AI Residency
DeepLabCut AI Residency
 
Deep Learning Summit (DLS01-1)
Deep Learning Summit (DLS01-1)Deep Learning Summit (DLS01-1)
Deep Learning Summit (DLS01-1)
 
Deep and Young Vision Learning at UPC BarcelonaTech (NIPS 2016)
Deep and Young Vision Learning at UPC BarcelonaTech (NIPS 2016)Deep and Young Vision Learning at UPC BarcelonaTech (NIPS 2016)
Deep and Young Vision Learning at UPC BarcelonaTech (NIPS 2016)
 
Introduction to Deep learning
Introduction to Deep learningIntroduction to Deep learning
Introduction to Deep learning
 

Mehr von Universitat PolitĂšcnica de Catalunya

The Transformer in Vision | Xavier Giro | Master in Computer Vision Barcelona...
The Transformer in Vision | Xavier Giro | Master in Computer Vision Barcelona...The Transformer in Vision | Xavier Giro | Master in Computer Vision Barcelona...
The Transformer in Vision | Xavier Giro | Master in Computer Vision Barcelona...Universitat PolitĂšcnica de Catalunya
 
Generation of Synthetic Referring Expressions for Object Segmentation in Videos
Generation of Synthetic Referring Expressions for Object Segmentation in VideosGeneration of Synthetic Referring Expressions for Object Segmentation in Videos
Generation of Synthetic Referring Expressions for Object Segmentation in VideosUniversitat PolitĂšcnica de Catalunya
 
Discovery and Learning of Navigation Goals from Pixels in Minecraft
Discovery and Learning of Navigation Goals from Pixels in MinecraftDiscovery and Learning of Navigation Goals from Pixels in Minecraft
Discovery and Learning of Navigation Goals from Pixels in MinecraftUniversitat PolitĂšcnica de Catalunya
 
Learn2Sign : Sign language recognition and translation using human keypoint e...
Learn2Sign : Sign language recognition and translation using human keypoint e...Learn2Sign : Sign language recognition and translation using human keypoint e...
Learn2Sign : Sign language recognition and translation using human keypoint e...Universitat PolitĂšcnica de Catalunya
 
Convolutional Neural Networks - Xavier Giro - UPC TelecomBCN Barcelona 2020
Convolutional Neural Networks - Xavier Giro - UPC TelecomBCN Barcelona 2020Convolutional Neural Networks - Xavier Giro - UPC TelecomBCN Barcelona 2020
Convolutional Neural Networks - Xavier Giro - UPC TelecomBCN Barcelona 2020Universitat PolitĂšcnica de Catalunya
 
Attention for Deep Learning - Xavier Giro - UPC TelecomBCN Barcelona 2020
Attention for Deep Learning - Xavier Giro - UPC TelecomBCN Barcelona 2020Attention for Deep Learning - Xavier Giro - UPC TelecomBCN Barcelona 2020
Attention for Deep Learning - Xavier Giro - UPC TelecomBCN Barcelona 2020Universitat PolitĂšcnica de Catalunya
 
Generative Adversarial Networks GAN - Xavier Giro - UPC TelecomBCN Barcelona ...
Generative Adversarial Networks GAN - Xavier Giro - UPC TelecomBCN Barcelona ...Generative Adversarial Networks GAN - Xavier Giro - UPC TelecomBCN Barcelona ...
Generative Adversarial Networks GAN - Xavier Giro - UPC TelecomBCN Barcelona ...Universitat PolitĂšcnica de Catalunya
 
Q-Learning with a Neural Network - Xavier GirĂł - UPC Barcelona 2020
Q-Learning with a Neural Network - Xavier GirĂł - UPC Barcelona 2020Q-Learning with a Neural Network - Xavier GirĂł - UPC Barcelona 2020
Q-Learning with a Neural Network - Xavier GirĂł - UPC Barcelona 2020Universitat PolitĂšcnica de Catalunya
 
Image Segmentation with Deep Learning - Xavier Giro & Carles Ventura - ISSonD...
Image Segmentation with Deep Learning - Xavier Giro & Carles Ventura - ISSonD...Image Segmentation with Deep Learning - Xavier Giro & Carles Ventura - ISSonD...
Image Segmentation with Deep Learning - Xavier Giro & Carles Ventura - ISSonD...Universitat PolitĂšcnica de Catalunya
 
Deep Self-supervised Learning for All - Xavier Giro - X-Europe 2020
Deep Self-supervised Learning for All - Xavier Giro - X-Europe 2020Deep Self-supervised Learning for All - Xavier Giro - X-Europe 2020
Deep Self-supervised Learning for All - Xavier Giro - X-Europe 2020Universitat PolitĂšcnica de Catalunya
 
Transcription-Enriched Joint Embeddings for Spoken Descriptions of Images and...
Transcription-Enriched Joint Embeddings for Spoken Descriptions of Images and...Transcription-Enriched Joint Embeddings for Spoken Descriptions of Images and...
Transcription-Enriched Joint Embeddings for Spoken Descriptions of Images and...Universitat PolitĂšcnica de Catalunya
 
Object Detection with Deep Learning - Xavier Giro-i-Nieto - UPC School Barcel...
Object Detection with Deep Learning - Xavier Giro-i-Nieto - UPC School Barcel...Object Detection with Deep Learning - Xavier Giro-i-Nieto - UPC School Barcel...
Object Detection with Deep Learning - Xavier Giro-i-Nieto - UPC School Barcel...Universitat PolitĂšcnica de Catalunya
 
Self-supervised Visual Learning 2020 - Xavier Giro-i-Nieto - UPC Barcelona
Self-supervised Visual Learning 2020 - Xavier Giro-i-Nieto - UPC BarcelonaSelf-supervised Visual Learning 2020 - Xavier Giro-i-Nieto - UPC Barcelona
Self-supervised Visual Learning 2020 - Xavier Giro-i-Nieto - UPC BarcelonaUniversitat PolitĂšcnica de Catalunya
 
Recurrent Neural Networks RNN - Xavier Giro - UPC TelecomBCN Barcelona 2020
Recurrent Neural Networks RNN - Xavier Giro - UPC TelecomBCN Barcelona 2020Recurrent Neural Networks RNN - Xavier Giro - UPC TelecomBCN Barcelona 2020
Recurrent Neural Networks RNN - Xavier Giro - UPC TelecomBCN Barcelona 2020Universitat PolitĂšcnica de Catalunya
 

Mehr von Universitat PolitĂšcnica de Catalunya (20)

Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
 
Deep Generative Learning for All
Deep Generative Learning for AllDeep Generative Learning for All
Deep Generative Learning for All
 
The Transformer in Vision | Xavier Giro | Master in Computer Vision Barcelona...
The Transformer in Vision | Xavier Giro | Master in Computer Vision Barcelona...The Transformer in Vision | Xavier Giro | Master in Computer Vision Barcelona...
The Transformer in Vision | Xavier Giro | Master in Computer Vision Barcelona...
 
The Transformer - Xavier GirĂł - UPC Barcelona 2021
The Transformer - Xavier GirĂł - UPC Barcelona 2021The Transformer - Xavier GirĂł - UPC Barcelona 2021
The Transformer - Xavier GirĂł - UPC Barcelona 2021
 
Generation of Synthetic Referring Expressions for Object Segmentation in Videos
Generation of Synthetic Referring Expressions for Object Segmentation in VideosGeneration of Synthetic Referring Expressions for Object Segmentation in Videos
Generation of Synthetic Referring Expressions for Object Segmentation in Videos
 
Discovery and Learning of Navigation Goals from Pixels in Minecraft
Discovery and Learning of Navigation Goals from Pixels in MinecraftDiscovery and Learning of Navigation Goals from Pixels in Minecraft
Discovery and Learning of Navigation Goals from Pixels in Minecraft
 
Learn2Sign : Sign language recognition and translation using human keypoint e...
Learn2Sign : Sign language recognition and translation using human keypoint e...Learn2Sign : Sign language recognition and translation using human keypoint e...
Learn2Sign : Sign language recognition and translation using human keypoint e...
 
Intepretability / Explainable AI for Deep Neural Networks
Intepretability / Explainable AI for Deep Neural NetworksIntepretability / Explainable AI for Deep Neural Networks
Intepretability / Explainable AI for Deep Neural Networks
 
Convolutional Neural Networks - Xavier Giro - UPC TelecomBCN Barcelona 2020
Convolutional Neural Networks - Xavier Giro - UPC TelecomBCN Barcelona 2020Convolutional Neural Networks - Xavier Giro - UPC TelecomBCN Barcelona 2020
Convolutional Neural Networks - Xavier Giro - UPC TelecomBCN Barcelona 2020
 
Attention for Deep Learning - Xavier Giro - UPC TelecomBCN Barcelona 2020
Attention for Deep Learning - Xavier Giro - UPC TelecomBCN Barcelona 2020Attention for Deep Learning - Xavier Giro - UPC TelecomBCN Barcelona 2020
Attention for Deep Learning - Xavier Giro - UPC TelecomBCN Barcelona 2020
 
Generative Adversarial Networks GAN - Xavier Giro - UPC TelecomBCN Barcelona ...
Generative Adversarial Networks GAN - Xavier Giro - UPC TelecomBCN Barcelona ...Generative Adversarial Networks GAN - Xavier Giro - UPC TelecomBCN Barcelona ...
Generative Adversarial Networks GAN - Xavier Giro - UPC TelecomBCN Barcelona ...
 
Q-Learning with a Neural Network - Xavier GirĂł - UPC Barcelona 2020
Q-Learning with a Neural Network - Xavier GirĂł - UPC Barcelona 2020Q-Learning with a Neural Network - Xavier GirĂł - UPC Barcelona 2020
Q-Learning with a Neural Network - Xavier GirĂł - UPC Barcelona 2020
 
Image Segmentation with Deep Learning - Xavier Giro & Carles Ventura - ISSonD...
Image Segmentation with Deep Learning - Xavier Giro & Carles Ventura - ISSonD...Image Segmentation with Deep Learning - Xavier Giro & Carles Ventura - ISSonD...
Image Segmentation with Deep Learning - Xavier Giro & Carles Ventura - ISSonD...
 
Curriculum Learning for Recurrent Video Object Segmentation
Curriculum Learning for Recurrent Video Object SegmentationCurriculum Learning for Recurrent Video Object Segmentation
Curriculum Learning for Recurrent Video Object Segmentation
 
Deep Self-supervised Learning for All - Xavier Giro - X-Europe 2020
Deep Self-supervised Learning for All - Xavier Giro - X-Europe 2020Deep Self-supervised Learning for All - Xavier Giro - X-Europe 2020
Deep Self-supervised Learning for All - Xavier Giro - X-Europe 2020
 
Transcription-Enriched Joint Embeddings for Spoken Descriptions of Images and...
Transcription-Enriched Joint Embeddings for Spoken Descriptions of Images and...Transcription-Enriched Joint Embeddings for Spoken Descriptions of Images and...
Transcription-Enriched Joint Embeddings for Spoken Descriptions of Images and...
 
Object Detection with Deep Learning - Xavier Giro-i-Nieto - UPC School Barcel...
Object Detection with Deep Learning - Xavier Giro-i-Nieto - UPC School Barcel...Object Detection with Deep Learning - Xavier Giro-i-Nieto - UPC School Barcel...
Object Detection with Deep Learning - Xavier Giro-i-Nieto - UPC School Barcel...
 
Self-supervised Visual Learning 2020 - Xavier Giro-i-Nieto - UPC Barcelona
Self-supervised Visual Learning 2020 - Xavier Giro-i-Nieto - UPC BarcelonaSelf-supervised Visual Learning 2020 - Xavier Giro-i-Nieto - UPC Barcelona
Self-supervised Visual Learning 2020 - Xavier Giro-i-Nieto - UPC Barcelona
 
Recurrent Neural Networks RNN - Xavier Giro - UPC TelecomBCN Barcelona 2020
Recurrent Neural Networks RNN - Xavier Giro - UPC TelecomBCN Barcelona 2020Recurrent Neural Networks RNN - Xavier Giro - UPC TelecomBCN Barcelona 2020
Recurrent Neural Networks RNN - Xavier Giro - UPC TelecomBCN Barcelona 2020
 
Backpropagation for Deep Learning
Backpropagation for Deep LearningBackpropagation for Deep Learning
Backpropagation for Deep Learning
 

KĂŒrzlich hochgeladen

BDSM⚡Call Girls in Mandawali Delhi >àŒ’8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >àŒ’8448380779 Escort ServiceBDSM⚡Call Girls in Mandawali Delhi >àŒ’8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >àŒ’8448380779 Escort ServiceDelhi Call girls
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightDelhi Call girls
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...ZurliaSoop
 
Call Girls In Hsr Layout ☎ 7737669865 đŸ„” Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 đŸ„” Book Your One night StandCall Girls In Hsr Layout ☎ 7737669865 đŸ„” Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 đŸ„” Book Your One night Standamitlee9823
 
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteedamy56318795
 
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...amitlee9823
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz1
 
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...amitlee9823
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxolyaivanovalion
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...amitlee9823
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxolyaivanovalion
 
Predicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science ProjectPredicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science ProjectBoston Institute of Analytics
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Researchmichael115558
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...amitlee9823
 
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...only4webmaster01
 
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...amitlee9823
 
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service BangaloreCall Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangaloreamitlee9823
 

KĂŒrzlich hochgeladen (20)

BDSM⚡Call Girls in Mandawali Delhi >àŒ’8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >àŒ’8448380779 Escort ServiceBDSM⚡Call Girls in Mandawali Delhi >àŒ’8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >àŒ’8448380779 Escort Service
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
 
Call Girls In Hsr Layout ☎ 7737669865 đŸ„” Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 đŸ„” Book Your One night StandCall Girls In Hsr Layout ☎ 7737669865 đŸ„” Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 đŸ„” Book Your One night Stand
 
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
 
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signals
 
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptx
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptx
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
 
Predicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science ProjectPredicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science Project
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Research
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
 
Anomaly detection and data imputation within time series
Anomaly detection and data imputation within time seriesAnomaly detection and data imputation within time series
Anomaly detection and data imputation within time series
 
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
 
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
 
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service BangaloreCall Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
 
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 

One Perceptron to Rule Them All: Language and Vision

  • 1. One Perceptron to Rule Them All: Language and Vision Xavier Giro-i-Nieto xavier.giro@upc.edu Associate Professor Intelligent Data Science and ArtiïŹcial Intelligence Center (IDEAI) Universitat Politecnica de Catalunya (UPC) Barcelona Supercomputing Center (BSC) Deep Learning for Natural Language Processing San Sebastian 5 July 2019 bit.ly/ixa-dlnlp-2019 xavier.giro@upc.edu @DocXavi
  • 2. 2 Xavier Giro-i-Nieto Associate Professor at Universitat PolitĂšcnica de Catalunya (UPC) Kaixo IDEAI Center for Intelligent Data Science & ArtiïŹcial Intelligence
  • 3. 3 ● 11 faculty members ● 12 Phd students Research Group & Centers https://imatge.upc.edu/ https://www.bsc.es/ ● National computation center #1 ● Supercomputer MareNostrum ● Emerging Technologies for ArtiïŹcial Intelligence Group, directed by Prof. Jordi Torres. https://ideai.upc.edu/ ● Center funded in 2017 ● 60 researchers IDEAI (Intelligent Data Science and ArtiïŹcial Intelligence)
  • 5. 5
  • 8. 8 Outline 1. Encoder-Decoder Architectures 2. Image and Video Encoding 3. Image Captioning & Grounding 4. Image Generation 5. Visual Question Answering / Reasoning 6. Joint Embeddings (+ recipe generation)
  • 12. 12
  • 13. 13 Encoder 0 1 0 Cat A Krizhevsky, I Sutskever, GE Hinton “Imagenet classiïŹcation with deep convolutional neural networks” NIPS 2012
  • 14. 14Slide concept: Perronin, F., Tutorial on LSVR @ CVPR’14, Output embedding for LSVR One-hot Representation [1,0,0] [0,1,0] [0,0,1]
  • 17. 17 Decoder Radford, Alec, Luke Metz, and Soumith Chintala. "Unsupervised representation learning with deep convolutional generative adversarial networks." ICLR 2016. #DCGAN 0 1 0 Cat Fig: Xudong Mao #DCGAN
  • 20. 20 Outline 1. Encoder-Decoder Architectures 2. Image and Video Encoding 3. Image Captioning & Grounding 4. Image Generation 5. Visual Question Answering / Reasoning 6. Joint Embeddings (+ recipe generation)
  • 22. 22 Perceptron Weights and bias are the parameters that deïŹne the behavior. They must be learned during training.
  • 23. 23 Convolutional Layers for Vision Fully Connected layer (FC) Convolutional layer (Conv)
  • 24. 24 Pooling Layer Figure Credit: Ranzatto Pooling is a downsample operation along the spatial dimensions (width, height) ● It reduces progressively the spatial size of the representation, so it reduces the computation greatly. ● Provides invariance to small local changes
  • 25. 25 Pooling Layer (critics) "The pooling operation used in CNNs is a big mistake and the fact that it works so well is a disaster." GeoïŹ€rey Hinton, AMA reddit (2015). Learn more: Richard Zhang, “Making Convolutional Networks Shift-Invariant Again” (ICML 2019)
  • 26. 26 Convolutional Neural Networks for Vision LeNet-5: Several convolutional layers, combined with pooling layers, and followed by a small number of fully connected layers #LeNet-5 LeCun, Y., Bottou, L., Bengio, Y., & HaïŹ€ner, P. (1998). Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11), 2278-2324.
  • 27. 27 ImageNet Challenge ● 1,000 object classes (categories). ● Images: ○ 1.2 M train ○ 100k test. Deng, Jia, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. "Imagenet: A large-scale hierarchical image database." CVPR 2019.
  • 28. 28 ImageNet Challenge: 2012 Slide credit: Rob Fergus (NYU) -9.8% Based on SIFT + Fisher Vectors Russakovsky, Olga, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang et al. "Imagenet large scale visual recognition challenge." International Journal of Computer Vision 115, no. 3 (2015): 211-252. [web]
  • 29. 29 Image Encoding A Krizhevsky, I Sutskever, GE Hinton “Imagenet classiïŹcation with deep convolutional neural networks” NIPS 2012 Cat CNN FC
  • 31. 31 Video Encoding Slide: VĂ­ctor Campos (UPC 2018) CNN CNN CNN... Combination method Combination is commonly implemented as a small NN on top of a pooling operation (e.g. max, sum, average). Drawback: pooling is not aware of the temporal order! Ng et al., Beyond short snippets: Deep networks for video classiïŹcation, CVPR 2015
  • 32. 32 Video Encoding Slide: VĂ­ctor Campos (UPC 2018) Recurrent Neural Networks are well suited for processing sequences. Drawback: RNNs are sequential and cannot be parallelized. Donahue et al., Long-term Recurrent Convolutional Networks for Visual Recognition and Description, CVPR 2015 CNN CNN CNN... RNN RNN RNN...
  • 33. 33 Learn more on visual encoding
  • 35. 35 Image Decoding CNN Radford, Alec, Luke Metz, and Soumith Chintala. "Unsupervised representation learning with deep convolutional generative adversarial networks." ICLR 2016. #DCGAN
  • 37. 37 Image Encoding and Decoding Noh, Hyeonwoo, Seunghoon Hong, and Bohyung Han. "Learning deconvolution network for semantic segmentation." ICCV 2015. “Regular” VGG “Upside down” VGG
  • 38. 38 Isola, Phillip, Jun-Yan Zhu, Tinghui Zhou, and Alexei A. Efros. "Image-to-image translation with conditional adversarial networks." CVPR 2017.
  • 39. 39 Outline 1. Encoder-Decoder Architectures 2. Image and Video Encoding 3. Image Captioning & Grounding 4. Image Generation 5. Visual Question Answering / Reasoning 6. Joint Embeddings (+ recipe generation)
  • 41. 41 #ShowAndTell Vinyals, Oriol, Alexander Toshev, Samy Bengio, and Dumitru Erhan. "Show and tell: A neural image caption generator." CVPR 2015. Image Captioning
  • 42. 42 Image Captioning #DeepImageSent Karpathy, Andrej, and Li Fei-Fei. "Deep visual-semantic alignments for generating image descriptions." CVPR 2015 (Slides by Marc Bolaños)
  • 43. 43 Captioning: Show, Attend & Tell Xu, Kelvin, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron C. Courville, Ruslan Salakhutdinov, Richard S. Zemel, and Yoshua Bengio. "Show, Attend and Tell: Neural Image Caption Generation with Visual Attention." ICML 2015
  • 44. 44 Captioning: Show, Attend & Tell Xu, Kelvin, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron C. Courville, Ruslan Salakhutdinov, Richard S. Zemel, and Yoshua Bengio. "Show, Attend and Tell: Neural Image Caption Generation with Visual Attention." ICML 2015
  • 45. 45 Johnson, Justin, Andrej Karpathy, and Li Fei-Fei. "Densecap: Fully convolutional localization networks for dense captioning." CVPR 2016 Dense Captioning
  • 46. 46 XAVI: “man has short hair”, “man with short hair” AMAIA:”a woman wearing a black shirt”, “ BOTH: “two men wearing black glasses” Johnson, Justin, Andrej Karpathy, and Li Fei-Fei. "Densecap: Fully convolutional localization networks for dense captioning." CVPR 2016 Dense Captioning
  • 47. Image Captioning for News Ali Furkan Biten, Lluis Gomez, Marçal Rusiñol, Dimosthenis Karatzas, “Good News, Everyone! Context driven entity-aware captioning for news images” CVPR 2019.
  • 48. 48 Filtering Social Bias in Neural Models #Equalizer Burns, Kaylee, Lisa Anne Hendricks, Trevor Darrell, and Anna Rohrbach. "Women also Snowboard: Overcoming Bias in Captioning Models." ECCV 2018.
  • 49. 49 Captioning: Dataset biases #Equalizer Burns, Kaylee, Lisa Anne Hendricks, Trevor Darrell, and Anna Rohrbach. "Women also Snowboard: Overcoming Bias in Captioning Models." ECCV 2018.
  • 50. 50 JeïŹ€rey Donahue, Lisa Anne Hendricks, Sergio Guadarrama, Marcus Rohrbach, Subhashini Venugopalan, Kate Saenko, Trevor Darrel. Long-term Recurrent Convolutional Networks for Visual Recognition and Description, CVPR 2015. code Captioning: Video
  • 51. 51 (Slides by Marc Bolaños) Pingbo Pan, Zhongwen Xu, Yi Yang,Fei Wu,Yueting Zhuang Hierarchical Recurrent Neural Encoder for Video Representation with Application to Captioning, CVPR 2016. LSTM unit (2nd layer) Time Image t = 1 t = T hidden state at t = T first chunk of data Captioning: Video
  • 52. 52 Sign Language Translation Camgoz, Necati Cihan, et al. Neural Sign Language Translation. CVPR 2018.
  • 53. 53 Assael, Yannis M., Brendan Shillingford, Shimon Whiteson, and Nando de Freitas. "LipNet: End-to-End Sentence-level Lipreading." (2016).
  • 54. 54 Lip Reading Assael, Yannis M., Brendan Shillingford, Shimon Whiteson, and Nando de Freitas. "LipNet: End-to-End Sentence-level Lipreading." (2016).
  • 55. 55 Chung, Joon Son, Andrew Senior, Oriol Vinyals, and Andrew Zisserman. "Lip reading sentences in the wild." CVPR 2017
  • 56. 56 Lipreading: Watch, Listen, Attend & Spell Audio features Image features Chung, Joon Son, Andrew Senior, Oriol Vinyals, and Andrew Zisserman. "Lip reading sentences in the wild." CVPR 2017
  • 57. 57 Lipreading: Watch, Listen, Attend & Spell Chung, Joon Son, Andrew Senior, Oriol Vinyals, and Andrew Zisserman. "Lip reading sentences in the wild." CVPR 2017 Attention over output states from audio and video is computed at each timestep
  • 58. 58 Lipreading Afouras, Triantafyllos, Joon Son Chung, and Andrew Zisserman. "Deep Lip Reading: a comparison of models and an online application." Interspeech 2018.
  • 59. 59 Grounded Captioning from Objects Lu, Jiasen and Yang, Jianwei and Batra, Dhruv and Parikh, Devi “Neural Baby Talk” CVPR 2018 [code]
  • 60. 60Lu, Jiasen and Yang, Jianwei and Batra, Dhruv and Parikh, Devi “Neural Baby Talk” CVPR 2018 [code] Grounded Captioning from Objects
  • 61. 61Akbari, Hassan, Svebor Karaman, Surabhi Bhargava, Brian Chen, Carl Vondrick, and Shih-Fu Chang. "Multi-level Multimodal Common Semantic Space for Image-Phrase Grounding." CVPR 2019. [code] Weak grounding w/o supervision
  • 62. 62Akbari, Hassan, Svebor Karaman, Surabhi Bhargava, Brian Chen, Carl Vondrick, and Shih-Fu Chang. "Multi-level Multimodal Common Semantic Space for Image-Phrase Grounding." CVPR 2019. [code] Grounding with weak supervision
  • 63. 63 Cornia, Marcella, Lorenzo Baraldi, and Rita Cucchiara. "Show, Control and Tell: A Framework for Generating Controllable and Grounded Captions." CVPR 2019. [code] Controlled Grounding
  • 64. 64 Outline 1. Encoder-Decoder Architectures 2. Image and Video Encoding 3. Image Captioning & Grounding 4. Image Generation 5. Visual Question Answering / Reasoning 6. Joint Embeddings (+ recipe generation)
  • 66. 66 Reed, Scott, Zeynep Akata, Xinchen Yan, Lajanugen Logeswaran, Bernt Schiele, and Honglak Lee. "Generative adversarial text to image synthesis." ICML 2016. Image Generation
  • 67. 67 Reed, Scott, Zeynep Akata, Xinchen Yan, Lajanugen Logeswaran, Bernt Schiele, and Honglak Lee. "Generative adversarial text to image synthesis." ICML 2016. [code] Image Synthesis
  • 68. 68 Reed, Scott, Zeynep Akata, Xinchen Yan, Lajanugen Logeswaran, Bernt Schiele, and Honglak Lee. "Generative adversarial text to image synthesis." ICML 2016. [code] Image Generation
  • 69. 69 #StackGAN Zhang, Han, Tao Xu, Hongsheng Li, Shaoting Zhang, Xiaolei Huang, Xiaogang Wang, and Dimitris Metaxas. "Stackgan: Text to photo-realistic image synthesis with stacked generative adversarial networks." ICCV 2017. [code] Image Synthesis
  • 70. 70 #StackGAN Zhang, Han, Tao Xu, Hongsheng Li, Shaoting Zhang, Xiaolei Huang, Xiaogang Wang, and Dimitris Metaxas. "Stackgan: Text to photo-realistic image synthesis with stacked generative adversarial networks." ICCV 2017. [code] Image Synthesis
  • 71. 71Justin Johnson, Agrim Gupta, Li Fei-Fei, “Image Generation from Scene Graphs” CVPR 2018 Image Generation via Scene Graphs
  • 72. 72Justin Johnson, Agrim Gupta, Li Fei-Fei, “Image Generation from Scene Graphs” CVPR 2018 Image Synthesis via Scene Graphs
  • 73. 73 #Text2Scene Tan, Fuwen, Song Feng, and Vicente Ordonez. "Text2Scene: Generating Compositional Scenes From Textual Descriptions." CVPR 2019 [blog]. Image Generation by Composition
  • 74. 74 #Text2Scene Tan, Fuwen, Song Feng, and Vicente Ordonez. "Text2Scene: Generating Compositional Scenes From Textual Descriptions." CVPR 2019 [blog].
  • 75. 75 #Text2Scene Tan, Fuwen, Song Feng, and Vicente Ordonez. "Text2Scene: Generating Compositional Scenes From Textual Descriptions." CVPR 2019 [blog].
  • 76. 76 #CRAFT Gupta, Tanmay, Dustin Schwenk, Ali Farhadi, Derek Hoiem, and Aniruddha Kembhavi. "Imagine this! scripts to compositions to videos." ECCV 2018
  • 77. 77 #CRAFT Gupta, Tanmay, Dustin Schwenk, Ali Farhadi, Derek Hoiem, and Aniruddha Kembhavi. "Imagine this! scripts to compositions to videos." ECCV 2018 Video Generation by Composition
  • 78. 78 Outline 1. Encoder-Decoder Architectures 2. Image and Video Encoding 3. Image Captioning & Grounding 4. Image Generation 5. Visual Question Answering / Reasoning 6. Joint Embeddings (+ recipe generation)
  • 80. 80 #Mattnet Yu, Licheng, Zhe Lin, Xiaohui Shen, Jimei Yang, Xin Lu, Mohit Bansal, and Tamara L. Berg. "Mattnet: Modular attention network for referring expression comprehension." CVPR 2018. [code] Object from Referring Expressions
  • 81. 81 Khoreva, Anna, Anna Rohrbach, and Bernt Schiele. "Video object segmentation with language referring expressions." ACCV 2018. Video Object Grounding
  • 83. 83 Visual Question Answering Antol, Stanislaw, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C. Lawrence Zitnick, and Devi Parikh. "VQA: Visual question answering." CVPR 2015.
  • 84. 84 Visual Question Answering (VQA) [z1 , z2 , 
 zN ] [y1 , y2 , 
 yM ] “Is economic growth decreasing ?” “Yes” Encode Encode Decode
  • 85. 85 Extract visual features Embedding Predict answerMerge Question What object is flying? Answer Kite Visual Question Answering (VQA) Masuda, Issey, Santiago Pascual de la Puente, and Xavier Giro-i-Nieto. "Open-Ended Visual Question-Answering." ETSETB UPC TelecomBCN (2016).
  • 86. 86 Visual Question Answering (VQA) Masuda, Issey, Santiago Pascual de la Puente, and Xavier Giro-i-Nieto. "Open-Ended Visual Question-Answering." ETSETB UPC TelecomBCN (2016). Image Question Answer
  • 87. 87 Visual Question Answering (VQA) Francisco RoldĂĄn, Issey Masuda, Santiago Pascual de la Puente, and Xavier Giro-i-Nieto. "Visual Question-Answering 2.0." ETSETB UPC TelecomBCN (2017).
  • 88. 88 Noh, H., Seo, P. H., & Han, B. Image question answering using convolutional neural network with dynamic parameter prediction. CVPR 2016 Dynamic Parameter Prediction Network (DPPnet) Visual Question Answering (VQA)
  • 89. 89 VQA: Dynamic Memory Networks (Slides and Slidecast by Santi Pascual): Xiong, Caiming, Stephen Merity, and Richard Socher. "Dynamic Memory Networks for Visual and Textual Question Answering." ICML 2016
  • 90. 90 Grounded VQA (Slides and Screencast by Issey Masuda): Zhu, Yuke, Oliver Groth, Michael Bernstein, and Li Fei-Fei."Visual7W: Grounded Question Answering in Images." CVPR 2016.
  • 91. 91 Visual Reasoning #Clevr Johnson, Justin, Bharath Hariharan, Laurens van der Maaten, Li Fei-Fei, C. Lawrence Zitnick, and Ross Girshick. "CLEVR: A Diagnostic Dataset for Compositional Language and Elementary Visual Reasoning." CVPR 2017
  • 92. 92 Visual Reasoning (Slides by Fran Roldan) Justin Johnson, Bharath Hariharan, Laurens van der Maaten, Judy HoïŹ€man, Fei-Fei Li, Larry Zitnick, Ross Girshick , “Inferring and Executing Programs for Visual Reasoning”. ICCV 2017 Program Generator Execution Engine
  • 93. 93 Visual Dialog Das, Abhishek, Satwik Kottur, Khushi Gupta, Avi Singh, Deshraj Yadav, JosĂ© MF Moura, Devi Parikh, and Dhruv Batra. "Visual Dialog." CVPR 2017 [Project]
  • 95. 95 Hate Speech Detection in Memes Benet Oriol, Cristian Canton, Xavier Giro-i-Nieto, “Hate Speech Detection in Memes”. UPC TelecomBCN 2019. Hate Speech Detection
  • 96. 96 Visual Reasoning: Relation Networks Santoro, Adam, David Raposo, David G. Barrett, Mateusz Malinowski, Razvan Pascanu, Peter Battaglia, and Timothy Lillicrap. "A simple neural network module for relational reasoning." NIPS 2017. Relation Networks concatenate all possible pairs of objects with the an encoded question to later ïŹnd the answer with a MLP.
  • 97. 97 Multimodal Machine Translation Challenge on Multimodal Image Translation: http://www.statmt.org/wmt17/multimodal-task.html#task1
  • 98. 98 Outline 1. Encoder-Decoder Architectures 2. Image and Video Encoding 3. Image Captioning & Grounding 4. Image Generation 5. Visual Question Answering / Reasoning 6. Joint Embeddings (+ recipe generation)
  • 100. 100 Joint Representations (Embeddings) Frome, Andrea, Greg S. Corrado, Jon Shlens, Samy Bengio, JeïŹ€ Dean, and Tomas Mikolov. "Devise: A deep visual-semantic embedding model." NIPS 2013
  • 101. 101 Zero-shot learning Socher, R., Ganjoo, M., Manning, C. D., & Ng, A., Zero-shot learning through cross-modal transfer. NIPS 2013 [slides] [code] No images from “cat” in the training set... ...but they can still be recognised as “cats” thanks to the representations learned from text .
  • 102. 102 Multimodal Retrieval Aytar, Yusuf, Lluis Castrejon, Carl Vondrick, Hamed Pirsiavash, and Antonio Torralba. "Cross-Modal Scene Networks." CVPR 2016.
  • 103. 103 Multimodal Retrieval Aytar, Yusuf, Lluis Castrejon, Carl Vondrick, Hamed Pirsiavash, and Antonio Torralba. "Cross-Modal Scene Networks." CVPR 2016.
  • 104. 104 Image and text retrieval with joint embeddings. Joint Neural Embeddings #pic2recipe Amaia Salvador, Nicholas Haynes, Yusuf Aytar, Javier MarĂ­n, Ferda OïŹ‚i, Ingmar Weber, Antonio Torralba, “Learning Cross-modal Embeddings for Cooking Recipes and Food Images”. CVPR 2017
  • 105. 105 #pic2recipe Amaia Salvador, Nicholas Haynes, Yusuf Aytar, Javier MarĂ­n, Ferda OïŹ‚i, Ingmar Weber, Antonio Torralba, “Learning Cross-modal Embeddings for Cooking Recipes and Food Images”. CVPR 2017 Joint Neural Embeddings
  • 106. 106 Joint Neural Embeddings joint embedding LSTM Bidirectional LSTM #pic2recipe Amaia Salvador, Nicholas Haynes, Yusuf Aytar, Javier MarĂ­n, Ferda OïŹ‚i, Ingmar Weber, Antonio Torralba, “Learning Cross-modal Embeddings for Cooking Recipes and Food Images”. CVPR 2017
  • 107. 107 Joint Neural Embeddings ● Constrained to database recipes ● Ingredients and Instructions are retrieved as a whole ● Prohibits user manipulation (ingredient replacements) #pic2recipe Amaia Salvador, Nicholas Haynes, Yusuf Aytar, Javier MarĂ­n, Ferda OïŹ‚i, Ingmar Weber, Antonio Torralba, “Learning Cross-modal Embeddings for Cooking Recipes and Food Images”. CVPR 2017
  • 108. 108
  • 109. 109 Recipe Generation (not retrieval !) Salvador, Amaia, Michal Drozdzal, Xavier Giro-i-Nieto, and Adriana Romero. "Inverse Cooking: Recipe Generation from Food Images." CVPR 2019.
  • 110. 110 Recipe Generation (not retrieval !) Salvador, Amaia, Michal Drozdzal, Xavier Giro-i-Nieto, and Adriana Romero. "Inverse Cooking: Recipe Generation from Food Images." CVPR 2019. Title: Edamame corn salad Ingredients pepper, corn, onion, edamame, salt, vinegar, cilantro, avocado, oil Instructions - In a large bowl, combine edamame, corn, red onion, cilantro, avocado, and red bell pepper. - In a small bowl, whisk together olive oil, vinegar, salt, and pepper. - Pour dressing over edamame mixture and toss to coat. - Cover and refrigerate for at least 1 hour before serving.
  • 111. 111 Recipe Generation (not retrieval !) Salvador, Amaia, Michal Drozdzal, Xavier Giro-i-Nieto, and Adriana Romero. "Inverse Cooking: Recipe Generation from Food Images." CVPR 2019. According to human judgment, our proposed system is able to generate better recipes than the previous retrieval method.
  • 112. 112 Recipe Generation (data as the DL ingredient!) Salvador, Amaia, Michal Drozdzal, Xavier Giro-i-Nieto, and Adriana Romero. "Inverse Cooking: Recipe Generation from Food Images." CVPR 2019. Title: Spaghetti with spicy tomato sauce Ingredients: onion, tomato, chili, salt, noodles, pepper, spaghetti, clove, cumin, water Instructions: -In a large pot, combine the tomatoes, onion, garlic, chili powder, cumin, salt, pepper, water and tomato sauce. -Bring to a boil, then reduce heat and simmer for about 20 minutes. -Meanwhile, cook the spaghetti according to package directions. -Drain and set aside. -When the spaghetti is done, drain and return to pot. -Add the sauce and stir to combine. -Serve with the shredded cheese and a dollop of sour cream.
  • 114. 114 Outline 1. Encoder-Decoder Architectures 2. Image and Video Encoding 3. Image Captioning & Grounding 4. Image Generation 5. Visual Question Answering / Reasoning 6. Joint Embeddings (+ recipe generation)
  • 115. Recommended tool Pythia for vision and language multimodal AI models by Facebook FAIR.
  • 116. 116 Deep Learning courses @ UPC TelecomBCN: ● MSc course [2017] [2018] [2019] ● BSc course [2018] [2019] ● 1st edition (2016) ● 2nd edition (2017) ● 3rd edition (2018) ● 4th edition (2019) ● 1st edition (2017) ● 2nd edition (2018) ● 3rd edition - NLP (2019) Next edition: Autumn 2019 Central repo with slides & videos here
  • 117. 117 Deep Learning courses @ UPC TelecomBCN: Central repo with slides & videos here
  • 118. 118 Multimodal DL with audio+speech https://telecombcn-dl.github.io/2019-mmm-tutorial/
  • 119. 119 Deep Learning for Professionals @ UPC School Next edition starts November 2019. Sign up here.
  • 122. Rowan Zellers, Yonatan Bisk, Ali Farhadi, Yejin Choi. From Recognition to Cognition: Visual Commonsense Reasoning. CVPR 2019 (oral) https://visualcommonsense.com/
  • 123. 123 Ma, Chih-Yao, Jiasen Lu, Zuxuan Wu, Ghassan AlRegib, Zsolt Kira, Richard Socher, and Caiming Xiong. "Self-Monitoring Navigation Agent via Auxiliary Progress Estimation." ICLR 2019. [code]
  • 124. 124 Visual Question Answering Gurari, Danna, Qing Li, Abigale J. Stangl, Anhong Guo, Chi Lin, Kristen Grauman, Jiebo Luo, and Jeffrey P. Bigham. "VizWiz Grand Challenge: Answering Visual Questions from Blind People." arXiv preprint arXiv:1802.08218 (2018).
  • 125. 125 Reasoning: MAC Hudson, Drew A., and Christopher D. Manning. "Compositional attention networks for machine reasoning." ICLR 2018.
  • 126. 126 Navigation with Language and Vision Fried, Daniel, Ronghang Hu, Volkan Cirik, Anna Rohrbach, Jacob Andreas, Louis-Philippe Morency, Taylor Berg-Kirkpatrick, Kate Saenko, Dan Klein, and Trevor Darrell. "Speaker-Follower Models for Vision-and-Language Navigation." arXiv preprint arXiv:1806.02724 (2018).
  • 127. 127 Translation Harwath, David, Galen Chuang, and James Glass. "Vision as an Interlingua: Learning Multilingual Semantic Embeddings of Untranscribed Speech." arXiv preprint arXiv:1804.03052 (2018).