These slides review the research of our lab since 2016 on applied deep learning, starting from our participation in the TRECVID Instance Search 2014, moving into video analysis with CNN+RNN architectures, and our current efforts in sign language translation and production.
4. Instance Search
Figure: Eva Mohedano, “Visual Search with Deep Learning”. UPC 2018.
Visual Query
“This dog”
Expected outcome:
5. Instance Search
#BoW Mohedano, Eva, Kevin McGuinness, Noel E. O'Connor, Amaia Salvador, Ferran Marques, and Xavier Giró-i-Nieto. "Bags of local convolutional
features for scalable instance search." ICMR 2016. (best poster award)
Query image Top N retrieved images
6. Off the Shelf + Bag of CNN Words
#BoVW Mohedano, Eva, Kevin McGuinness, Noel E. O'Connor, Amaia Salvador, Ferran Marques, and Xavier Giró-i-Nieto. "Bags of local convolutional
features for scalable instance search." ICMR 2016.
(336x256)
Resolution
conv5_1 from
VGG16[1]
(42x32)
25K centroids 25K-D vector
7. Off the Shelf + Bag of Visual Words
Query Representation
... ... ...
... ... ...
Global Search
(GS)
Local Search
(LS)
8. Off the Shelf + Bag of Visual Words
#BoVW Mohedano, Eva, Kevin McGuinness, Noel E. O'Connor, Amaia Salvador, Ferran Marques, and Xavier Giró-i-Nieto. "Bags of local convolutional
features for scalable instance search." ICMR 2016.
9. Salvador, Amaia, Xavier Giró-i-Nieto, Ferran Marqués, and Shin'ichi Satoh. "Faster R-CNN features for instance search." CVPRW 2016.
Conv
layers
Region Proposal
Network
FC6
Class probabilities
FC7
FC8
RPN Proposals
RoI
Pooling
Conv5_3
RPN Proposals
Image representation
Region Representation (for reranking only)
Fine-Tuning (FT) Faster R-CNN Features
Train object detector for query instances using query images as training data
10. Fine-Tuning (FT) Faster R-CNN Features
Salvador, Amaia, Xavier Giró-i-Nieto, Ferran Marqués, and Shin'ichi Satoh. "Faster R-CNN features for instance search." CVPRW 2016.
Image representation Region Representation
(for reranking)
RoI
Pooling
Conv5_3 RoI
Pooling
sum-pooling max-pooling
D
D
11. Fine-Tuning (FT) Faster R-CNN Features
Salvador, Amaia, Xavier Giró-i-Nieto, Ferran Marqués, and Shin'ichi Satoh. "Faster R-CNN features for instance search." CVPRW 2016.
Spatial Reranking (R) over Object Proposals
Query Image Target image in top M ranking
.
.
.
.
.
.
12. Fine-Tuning (FT) Faster R-CNN Features
Query Expansion (QE) with top N results
Image Representations
Query image
Image
Database
Image Matching Ranking List
v = (v1
, …, vn
)
v1
= (v11
, …, v1n
)
vk
= (vk1
, …, vkn
)
...
Similarity
Metric
(cosine similarity)
...
Top N images
are added to the
query for a new
search
(N = 5)
13. Fine-Tuning (FT) Faster R-CNN Features
Salvador, Amaia, Xavier Giró-i-Nieto, Ferran Marqués, and Shin'ichi Satoh. "Faster R-CNN features for instance search." CVPRW 2016.
R: Local Reranking QE: Query Expansion FT: Fine-tuned
~10 % gain (+QE+R)
~35 % gain (FT + R + QE)
~20 % gain (FT)
14. Jiménez, Albert, Jose M. Alvarez, and Xavier Giró Nieto. "Class-weighted convolutional features for visual instance search." BMVC 2017.
Attention from Class Activation Maps
15. Jiménez, Albert, Jose M. Alvarez, and Xavier Giró Nieto. "Class-weighted convolutional features for visual instance search." BMVC 2017.
Attention from Class Activation Maps
Compact
Descriptor
Class-Weighted Convolutional Features
GAP .
.
.
.
.
.
w1
w2
w3
wN
Class 1
(Tennis Ball)
CAM layer
Class N
16. Reyes, Cristian, Eva Mohedano, Kevin McGuinness, Noel E. O'Connor, and Xavier Giro-i-Nieto. "Where is my phone? Personal object retrieval from
egocentric images." ACM M Workshops 2016.
Predicted
Human Visual Saliency
Attention from Human Visual Saliency
17. Attention from Human Visual Saliency
#SalBoW Mohedano, Eva, Kevin McGuinness, Xavier Giró-i-Nieto, and Noel E. O'Connor. "Saliency weighted convolutional features for instance
search." CBMI 2018.
18. Attention from Human Visual Saliency
#SalBoW Mohedano, Eva, Kevin McGuinness, Xavier Giró-i-Nieto, and Noel E. O'Connor. "Saliency weighted convolutional features for instance
search." CBMI 2018.
25K-D BoW vector
Unweighted Bow Weighted Bow
25K-D BoW vector
19. Attention from Human Visual Saliency
#SalGAN Junting Pan, Cristian Canton, Kevin McGuinness, Noel E. O’Connor, Jordi Torres, Elisa Sayrol and Xavier Giro-i-Nieto.
“SalGAN: Visual Saliency Prediction with Generative Adversarial Networks.” CVPRW 2017.
Generator Discriminator
20. Attention from Human Visual Saliency
#SalBoW Mohedano, Eva, Kevin McGuinness, Xavier Giró-i-Nieto, and Noel E. O'Connor. "Saliency weighted convolutional features for instance
search." CBMI 2018.
Hand-crafted
saliency models
Deep-learning
based saliency
models
22. Action Recognition
Alberto Montes, Amaia Salvador, Santiago Pascual, and Xavier Giro-i-Nieto. "Temporal Activity Detection in Untrimmed Videos with Recurrent
Neural Networks." NIPS Workshop 2016 (best poster award)
Ground Truth:
Hopscotch
Prediction:
0.848 Running a marathon
0.023 Triple jump
0.022 Javelin throw
Ground Truth:
Playing water polo
Prediction:
0.765 Playing water polo
0.202 Swimming
0.007 Springboard diving
23. C3D +RNN: Action Recognition
Alberto Montes, Amaia Salvador, Santiago Pascual, and Xavier Giro-i-Nieto. "Temporal Activity Detection in Untrimmed Videos with Recurrent
Neural Networks." NIPS Workshop 2016 (best poster award)
25. C3D: Online Detection of Action Start
#ODAS Shou, Zheng, Junting Pan, Jonathan Chan, Kazuyuki Miyazawa, Hassan Mansour, Anthony Vetro, Xavier Giro-i-Nieto, and Shih-Fu Chang.
"Online detection of action start in untrimmed, streaming videos." ECCV 2018.
26. Efficient C2D + RNN
CNN CNN CNN
...
RNN RNN RNN
...
#SkipRNN Victor Campos, Brendan Jou, Xavier Giro-i-Nieto, Jordi Torres, and Shih-Fu Chang. “Skip RNN: Learning to Skip State Updates in
Recurrent Neural Networks”, ICLR 2018.
27. Efficient C2D + RNN
#SkipRNN Victor Campos, Brendan Jou, Xavier Giro-i-Nieto, Jordi Torres, and Shih-Fu Chang. “Skip RNN: Learning to Skip State Updates in
Recurrent Neural Networks”, ICLR 2018.
S
x1
s1
x2
S
s1
time
x3
S
s3
COPY UPDATE
28. #SkipRNN Victor Campos, Brendan Jou, Xavier Giro-i-Nieto, Jordi Torres, and Shih-Fu Chang. “Skip RNN: Learning to Skip State Updates in
Recurrent Neural Networks”, ICLR 2018.
Efficient C2D + RNN
~95% acc
Used
Unused
29. #SkipRNN Victor Campos, Brendan Jou, Xavier Giro-i-Nieto, Jordi Torres, and Shih-Fu Chang. “Skip RNN: Learning to Skip State Updates in
Recurrent Neural Networks”, ICLR 2018.
Used
Unused
Efficient C2D + RNN
30. Efficient Visual Saliency Prediction
#SalEMA Linardos, Panagiotis, Eva Mohedano, Juan Jose Nieto, Noel E. O'Connor, Xavier Giro-i-Nieto, and Kevin McGuinness. "Simple vs complex
temporal recurrences for video saliency prediction." BMVC 2019.
SalCLSTM
SalEMA
31. #SalCLSTM #SalEMA Linardos, Panagiotis, Eva Mohedano, Juan Jose Nieto, Noel E. O'Connor, Xavier Giro-i-Nieto, and Kevin McGuinness. "Simple
vs complex temporal recurrences for video saliency prediction." BMVC 2019.
32. #SalEMA Linardos, Panagiotis, Eva Mohedano, Juan Jose Nieto, Noel E. O'Connor, Xavier Giro-i-Nieto, and Kevin McGuinness. "Simple vs complex
temporal recurrences for video saliency prediction." BMVC 2019.
Efficient Visual Saliency Prediction
SalCLSTM
SalEMA
48. #RefVOS Bellver, Miriam, Carles Ventura, Carina Silberer, Ioannis Kazakos, Jordi Torres, and Xavier Giro-i-Nieto. "RefVOS: A
closer look at referring expressions for video object segmentation." arXiv preprint arXiv:2010.00263 (2020).
Object Segmentation with Language
49. Object Segmentation with Language
#SynthRef Kazakos, Ioannis, Carles Ventura, Miriam Bellver, Carina Silberer, and Xavier Giro-i-Nieto. "SynthRef:
Generation of Synthetic Referring Expressions for Object Segmentation." NAACL ViGIL Workshop 2021.
Accuracy on DAVIS 2017 train+val
51. Cross-modal Video Retrieval
Amanda Duarte, Dídac Surís, Amaia Salvador, Jordi Torres, and Xavier Giró-i-Nieto. "Cross-modal
Embeddings for Video and Audio Retrieval." ECCV Women in Computer Vision Workshop 2018.
52. Cross-modal Video Retrieval
Best
match
Visual feature Audio feature
Amanda Duarte, Dídac Surís, Amaia Salvador, Jordi Torres, and Xavier Giró-i-Nieto. "Cross-modal
Embeddings for Video and Audio Retrieval." ECCV Women in Computer Vision Workshop 2018.
53. Cross-modal Video Retrieval
Best
match
Visual feature Audio feature
Amanda Duarte, Dídac Surís, Amaia Salvador, Jordi Torres, and Xavier Giró-i-Nieto. "Cross-modal
Embeddings for Video and Audio Retrieval." ECCV Women in Computer Vision Workshop 2018.
55. Multimodal Steganopgraphy
Geleta, Margarita, Cristina Punti, Kevin McGuinness, Jordi Pons, Cristian Canton, and Xavier Giro-i-Nieto. "PixInWav: Residual
Steganography for Hiding Pixels in Audio." CVPR Women in Computer Vision Workshop 2021.
56. Multimodal Steganopgraphy
Geleta, Margarita, Cristina Punti, Kevin McGuinness, Jordi Pons, Cristian Canton, and Xavier Giro-i-Nieto. "PixInWav: Residual
Steganography for Hiding Pixels in Audio." CVPR Women in Computer Vision Workshop 2021.
57. Multimodal Steganopgraphy
Results produced by Teresa Domènech.
Geleta, Margarita, Cristina Punti, Kevin McGuinness, Jordi Pons, Cristian Canton, and Xavier Giro-i-Nieto. "PixInWav: Residual
Steganography for Hiding Pixels in Audio." CVPR Women in Computer Vision Workshop 2021.
Revealed image
Hidden image
62. A crash course on Sign Language
Sign languages are NOT a one-to-one mapping from spoken languages.
Look-Up
Table
Hi, I’m Amelia and I’m
going to talk to you
about how to remove
gum from hair.
Sign Language
(video)
Spoken Language
(transcription)
��🏼
63. Sign-to-Spoken Language Tasks
SL Translation Hi, I’m Amelia and I’m going to talk to you
about how to remove gum from hair.
GIPHY/SIGNN WITH ROBERT
Isolated SL Recognition
Continuous SL Recognition
Finger-spelling
HI, ME FS-AMELIA WILL EXPLAIN
HOW REMOVE GUM FROM YOUR
HAIR
“I”
A, B, C, D...
65. Sign-Spoken Language Tasks
SL Production
SL Translation
Sign Language
(video)
65
Spoken Language
(transcription)
Hi, I’m Amelia and
I’m going to talk
to you about how
to remove gum
from hair.
66. Sign Language Translation & Production
End-to-end
Hi, I’m Amelia and I’m going
to talk to you about how to
remove gum from hair.
HI, ME FS-AMELIA WILL
EXPLAIN HOW REMOVE
GUM FROM YOUR HAIR
Speech
Spoken
transcription
Gloss
transcription
Sign
transcription
Video
3D
Poses
2D
Poses
Production
Translation
Segments
68. Challenges in Computer Vision
68
Off-the-shelf pose detectors and generators struggle with hands.
69. 69
��
Zhou, Yuxiao, Marc Habermann, Weipeng Xu, Ikhsanul Habibie, Christian Theobalt, and Feng Xu. "Monocular real-time
hand shape and motion capture using multi-modal data." CVPR 2020.
Challenges in Computer Vision
70. 70
��
Weinzaepfel, Philippe, Romain Brégier, Hadrien Combaluzier, Vincent Leroy, and Grégory Rogez. "Dope: Distillation of
part experts for whole-body 3d pose estimation in the wild." ECCV 2020.
Challenges in Computer Vision
71. 71
��
Saunders, Ben, Necati Cihan Camgoz, and Richard Bowden. "Progressive transformers for end-to-end sign language
production." ECCV 2020.
Challenges in Computer Vision
72. 72
��
Ng, Evonne, Shiry Ginosar, Trevor Darrell, and Hanbyul Joo. "Body2hands: Learning to infer 3d hands from
conversational gesture body dynamics." CVPR 2021.
Challenges in Computer Vision
74. Challenges in NLP
Sign Languages are:
74
🤔
(Very) low-resource
languages…
...in a (very) high
dimensional space (video).
��🏼
��🏼
75. Challenges in NLP
75
Figure: TensorFlow tutorial
Bengio, Yoshua, Réjean Ducharme, Pascal Vincent, and Christian Jauvin. "A neural probabilistic language model." Journal of machine learning
research 3, no. Feb (2003): 1137-1155.
🤔
What are “language
models” in sign
language ?
76. Challenges in NLP
76
How to transfer from
large pre-trained
(“foundation”) models ?
#GPT-3 Brown, T. B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., ... & Agarwal, S. Language models
are few-shot learners. NeurIPS 2020 (best paper award).
Source: [OpenAI API]
English: My name is Barbara.
ASL: ME NAME fs-B-A-R-B-A-R-A.
English: Is he a teacher?
ASL: HE TEACHER HE
English: Amir is tall.
ASL: fs-A-M-I-R, HE TALL HE
English: I’m not sad.
ASL: ME SAD ME 🤔
78. Challenges in Speech Translation
78
Jia, Ye, Michelle Tadmor Ramanovich, Tal Remez, and Roi Pomerantz. "Translatotron 2: Robust direct speech-to-speech
translation." arXiv preprint arXiv:2107.08661 (2021).
Speech Video
Speech Speech
End-to-end End-to-end
🤔
80. Challenge in Sign Language Analytics
Computer Vision
Speech
NLP
Training Data
Giro-i-Nieto, X. “Open Challenges in Sign Language Translation & Production”. CMU VASC Seminar 2021.
81. Parallel Corpus
Fully supervised learning requires a large dataset of pairs of sentences in the two
languages to translate.
Cho, Kyunghyun, Bart Van Merriënboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. "Learning
phrase representations using RNN encoder-decoder for statistical machine translation." AMNLP 2014.
82. Sign Language Translation & Production
Duarte, A., Palaskar, S., Ventura, L., Ghadiyaram, D., DeHaan, K., Metze, F., ... & Giro-i-Nieto, X.
How2Sign: a large-scale multimodal dataset for continuous American sign language. CVPR 2021.
Body-face-hands keypoints
2D keypoints estimation from OpenPose [2]
Speech Signal
English Transcription
Hi, I’m Amelia and I’m going
to talk to you about how to
remove gum from hair.
Instructional videos
Multi-view VGA and HD videos [3]
Multi-view recordings (only for a subset)
3D keypoints
estimation
Gloss Annotation
HI, ME FS-AMELIA WILL EXPLAIN HOW REMOVE GUM FROM YOUR HAIR
83. Continuous Sign Language Datasets
Duarte, A., Palaskar, S., Ventura, L., Ghadiyaram, D., DeHaan, K., Metze, F., ... & Giro-i-Nieto, X.
How2Sign: a large-scale multimodal dataset for continuous American sign language. CVPR 2021.
84. Continuous Sign Language Datasets
Green Studio
Multi-view RGB videos
RGB-D videos
Joo, H., Liu, H., Tan, L., Gui, L., Nabbe, B., Matthews, I., Kanade, T., Nobuhara,S.,
Sheikh, Y.: Panoptic studio: A massively multiview system for social motioncapture. In:
ICCV, 2015.
Panoptic Studio
Multi-view recordings (only for a subset)
Multi-view VGA and HD videos
85. Application: Human motion transfer
85
2D Pose
estimation
[Openpose]
GAN-
generated
[Everybody
dance now]
86. Application: Human motion transfer
86
Ventura, Lucas, Amanda Duarte, and Xavier Giró-i-Nieto. "Can everybody sign now? Exploring sign
language video generation from 2D poses." ECCV 2020 SLRTP Workshop.
87. Application: Human motion transfer
87
Ventura, Lucas, Amanda Duarte, and Xavier Giró-i-Nieto. "Can everybody sign now? Exploring sign
language video generation from 2D poses." ECCV 2020 SLRTP Workshop.
“Choose one category”
Skeleton
GAN-generated
Classification
accuracy
88. Application: Human motion transfer
88
Ventura, Lucas, Amanda Duarte, and Xavier Giró-i-Nieto. "Can everybody sign now? Exploring sign
language video generation from 2D poses." ECCV 2020 SLRTP Workshop.
Mean Opinion
Score
“How well could you understand the video?”
Skeleton
GAN-generated
89. Application: Human motion transfer
89
Ventura, Lucas, Amanda Duarte, and Xavier Giró-i-Nieto. "Can everybody sign now? Exploring sign
language video generation from 2D poses." ECCV 2020 SLRTP Workshop.
“Translate the ASL signs into written
English.”
Skeleton
GAN-generated
90. Challenge in Sign Language Analytics
Computer Vision
Speech
NLP
Training Data
Giro-i-Nieto, X. “Open Challenges in Sign Language Translation & Production”. CMU VASC Seminar 2021.
92. Thank you
● @DocXavi
● xavier.giro@upc.edu
Eva
Mohedano
Victor
Campos
Miriam
Bellver
Amaia
Salvador
Andreu
Girbau
Amanda
Duarte
Carles
Ventura
Laia
Tarrés