SlideShare ist ein Scribd-Unternehmen logo
1 von 50
Downloaden Sie, um offline zu lesen
A Survey
on Cross-Modal Embedding
( )
n
n @ymym3412
n nlpaper.challenge
n
n @ymas0315
n
n Cross-Modal Embedding
n Cross-Modal Retrieval
n Audio-Visual Embedding
n
Cross-Modal Embedding
Cross-Modal Embedding
n
Cross-Modal Embedding
n
Cross-Modal Embedding
nCross-Modal Retrieval
◦ 3D
◦ Adversarial Training Consistency Loss
nAudio-Visual Embedding
◦ Web
◦
Cross-Modal Retrieval
Cross-Modal Retrieval
n
n Text <-> Image
Wikipedia
Image/Tag
n
NUS-WIDE
Flickr / 18.6
Low-level
Pascal VOC
Flickr LabelMe 9963
Image/Text
n
Wikipedia
Wikipedia
Flickr-30k
Flickr 30000
16
Recipe1M
1M
Cross-Modal Retrieval
n
• Real-Valued Representation
• Binary Representation
• Unsupervised Method
• Pairwise based Method
• Supervised Method
Unsupervised Method
CCA
2
AutoEncoder
Ranking Method
Pairwise based Method
/ /
Shared Space
Supervised Method
:
nLocalizing Moments in Video with Natural Language(ICCV2017)
◦
◦ Global Context
:
nAttentive Moment Retrieval in Videos (SIGIR2018)
°
° Attention
First
: 3D
nY2Seq2Seq: Cross-Modal Representation Learning for 3D Shape
and Text by Joint Reconstruction and Prediction of View and
Word Sequence (AAAI2019)
◦ 3D Cross-Modal Retrieval
◦ 3D
: Adversarial Training
nSelf-Supervised Adversarial Hashing Networks for Cross-Modal
Retrieval(CVPR2018)
◦
◦ Adversarial
Training =
: Adversarial Training
nCoupled CycleGAN: Unsupervised Hashing Network for Cross-
Modal Retrieval (AAAI2019)
◦ 2 GAN
◦ Outer Cycle GAN
◦ Inner Cycle GAN
: Consistency Loss
nLook, Imagine and Match: Improving Textual-Visual Cross-
Modal Retrieval with Generative Models(CVPR2018)
◦ Decoder
Adversarial Training
◦ Adversarial Training
Adversarial Training
: Consistency Loss
nLearning Cross-Modal Embeddings with Adversarial Networks
for Cooking Recipes and Food Images(CVPR2019)
◦ Metric Learning, Adversarial Training
Consistency Loss
nViLBERT: Pretraining Task-Agnostic Visiolinguistic
Representation for Vision-and-Language Tasks
◦ Vision Language BERT
◦ Vision->Language Language->Vision Attention Co-Attention
Transformer
◦ / Mask
◦ Vision/Language Encoder
BERT
Audio-Visual Embedding
Audio-Visual Embedding
Audio-Visual
n
◦ Audio Visual
⇒ Alignment
◦
⇒
n
◦
⇒ ” ”
( … )
Cross-modal retrieval
nAudio-Visual Embedding Network (AVE-Net)
◦
◦ DNN
n
◦ /
◦ Cross-modal Intra-modal
Audio-Visual
Audio-Visual
Cross-modal retrieval
nAudio-Visual Embedding Network (AVE-Net)
◦
◦ DNN
n
◦ /
◦ Cross-modal Intra-modal
nDCG@30
(Higher is better)
Audio-Visual
Audio-visual source separation
nLooking to Listen at the Cocktail Party
◦
◦ https://www.youtube.com/watch?v=rVQVAPiJWKU
Audio-Visual
Sound source localization
nLearning to Localize Sound Source in Visual Scenes
◦ attention
◦ Attention supervised
Audio-Visual
Image/sound generation
nSpeech2Face: Learning the Face Behind a Voice
◦ decoder
nYoutube 8M
◦
nAudioSet
◦ 632 2,084,320
nAVSpeech
◦ 29,000 ID
nYahoo Flickr Creative Commons 100M (YFCC100M)
◦ 80 (100M )
◦ Flickr Creative Commons
nVoxCeleb1, 2
◦ Youtube 2000
nSoundnet: Learning sound representations from unlabeled
video (NIPS2016)
◦
◦
◦ SVM
(Audio+Vision)
nLook, Listen and Learn (ICCV2017)
◦ visual audio
◦
(Audio-Visual Correspondence(AVC))
◦
AVC Audio-visual
nLook, Listen and Learn (ICCV2017)
◦ visual audio
◦
(Audio-Visual Correspondence(AVC))
◦
AVC Audio-visual
nObjects that Sound (ECCV2018)
◦ L3 ( )
◦ Cross-modal retrieval Sound source localization
AVC
AVOL-Net
nObjects that Sound (ECCV2018)
◦ L3 ( )
◦ Cross-modal retrieval Sound source localization
AVC
AVOL-Net
nAudio-Visual Scene Analysis with Self-Supervised Multisensory
Features (ECCV2018)
◦ ( )
◦ Action recognition
◦ Audio-visual source separation
Alignment
AVSS
nThe Sound of Pixels (ECCV2018)
◦ PixelPlayer
(http://sound-of-pixels.csail.mit.edu/)
◦
Mix-and-Separate
Mix-and-separate
AVSS
nThe Sound of Pixels (ECCV2018)
◦ PixelPlayer
(http://sound-of-pixels.csail.mit.edu/)
◦
Mix-and-Separate
Mix-and-separate
AVSS
nThe Sound of Pixels (ECCV2018)
◦ PixelPlayer
(http://sound-of-pixels.csail.mit.edu/)
◦
Mix-and-Separate
Mix-and-separate
K
AVSS
nThe Sound of Pixels (ECCV2018)
◦ PixelPlayer
(http://sound-of-pixels.csail.mit.edu/)
◦
Mix-and-Separate
Mix-and-separate
AVSS
nThe Sound of Pixels (ECCV2018)
◦ PixelPlayer
(http://sound-of-pixels.csail.mit.edu/)
◦
Mix-and-Separate
Mix-and-separate
AVSS
nThe Sound of Pixels (ECCV2018)
◦ PixelPlayer
(http://sound-of-pixels.csail.mit.edu/)
◦
Mix-and-Separate
Mix-and-separate
nSpeech2Face: Learning the Face Behind a Voice (CVPR2019)
◦
◦ Encoder
⇔
nTalking Face Generation by Adversarially Disentangled Audio-
Visual Representation (AAAI2019)
◦ /
◦ (disentangle )
nTalking Face Generation by Adversarially Disentangled Audio-
Visual Representation (AAAI2019)
◦ /
◦ (disentangle )
n Cross-Modal Embeddings Image
Text Cross-Modal Retrieval, Audio Vision Audio-Visual
Embeddings
n Cross-Modal Retrieval Image Text
Video 3D Adversarial Training
n Audio-Visual
n Image/Text/Audio/Video Cross-Modal ->
(
)
Kaiye Wang, Qiyue Yin, Wei Wang, Shu Wu, Liang Wang: A Comprehensive Survey on Cross-modal Retrieval
T.-S. Chua, J. Tang, R. Hong, H. Li, Z. Luo, and Y. Zheng: NUS-WIDE: A real-world web image database from National
University of Singapore
Sung Ju Hwang ; Kristen Grauman: Reading between the Lines: Object Localization Using Implicit Cues from Image
Tags
Peter Young Alice Lai Micah Hodosh Julia Hockenmaier: From image descriptions to visual denotations: New
similarity metrics for semantic inference over event descriptions
Hao Wang, Doyen Sahoo, Chenghao Liu, Ee-peng Lim, Steven C. H. Hoi: Learning Cross-Modal Embeddings with
Adversarial Networks for Cooking Recipes and Food Images
J. Zhou, G. Ding, and Y. Guo: Latent Semantic Sparse Hashing for Cross-Modal Similarity Search
Amaia Salvador1∗ Nicholas Hynes2∗ Yusuf Aytar2, Javier Marin2 Ferda Ofli3, Ingmar Weber3 Antonio Torralba2:
Learning Cross-modal Embeddings for Cooking Recipes and Food Images
Micael Carvalho, Rémi Cadène, David Picard, Laure Soulier, Nicolas Thome, Matthieu Cord: Cross-Modal Retrieval in
the Cooking Context: Learning Semantic Text-Image Embeddings
Fartash Faghri, David J. Fleet, Jamie Ryan Kiros, Sanja Fidler: VSE++: Improving Visual-Semantic Embeddings with
Hard Negatives
Alexander Hermans, Lucas Beyer, Bastian Leibe: In Defense of the Triplet Loss for Person Re-Identification
Chao Li, Cheng Deng, Ning Li, Wei Liu, Xinbo Gao, Dacheng Tao: Self-Supervised Adversarial Hashing Networks for
Cross-Modal Retrieval
Nikhil Rasiwasia1, Jose Costa Pereira1, Emanuele Coviello1, Gabriel Doyle2,
Gert R.G. Lanckriet1, Roger Levy2, Nuno Vasconcelos1: A New Approach to Cross-Modal Multimedia Retrieval
Ting Yao †, Tao Mei †, and Chong-Wah Ngo ‡† Microsoft Research, Beijing, China‡ City University of Hong Kong,
Kowloon, Hong Kong: Learning Query and Image Similarities with Ranking Canonical Correlation Analysis
Lisa Anne Hendricks1∗, Oliver Wang2, Eli Shechtman2, Josef Sivic2,3∗, Trevor Darrell1, Bryan Russell2: Localizing
Moments in Video with Natural Language
Zhu Zhang, Zhijie Lin, Zhou Zhao and Zhenxin Xiao: Attentive Moment Retrieval in Videos
Zhizhong Han1,2, Mingyang Shang1, Xiyang Wang1, Yu-Shen Liu1∗, Matthias Zwicker2: Y2Seq2Seq: Cross-Modal
Representation Learning for 3D Shape and Text by Joint Reconstruction and Prediction of View and Word Sequences
Chao Li, Cheng Deng, Ning Li, Wei Liu, Xinbo Gao, Dacheng Tao: Self-Supervised Adversarial Hashing Networks for
Cross-Modal Retrieval
Chao Li,1 Cheng Deng,1∗ Lei Wang,1 De Xie,1 Xianglong Liu2†: Coupled CycleGAN: Unsupervised Hashing Network
for Cross-Modal Retrieval
Jiuxiang Gu, Jianfei Cai, Shafiq Joty, Li Niu, Gang Wang: Look, Imagine and Match: Improving Textual-Visual Cross-
Modal Retrieval with Generative Models
Jiasen Lu1, Dhruv Batra1,2, Devi Parikh1,2, Stefan Lee: ViLBERT: Pretraining Task-Agnostic Visiolinguistic
Representations for Vision-and-Language Task
Yusuf Aytar, Carl Vondrick, Antonio Torralba: SoundNet: Learning Sound Representations from Unlabeled Video
Relja Arandjelović, Andrew Zisserman: Objects that Sound
Relja Arandjelović, Andrew Zisserman: Look, Listen and Learn
Hang Zhao, Chuang Gan, Andrew Rouditchenko, Carl Vondrick, Josh McDermott, Antonio Torralba: The Sound of
Pixels
Andrew Owens, Alexei A. Efros: Audio-Visual Scene Analysis with Self-Supervised Multisensory Features
Tae-Hyun Oh, Tali Dekel, Changil Kim, Inbar Mosseri, William T. Freeman, Michael Rubinstein, Wojciech Matusik:
Speech2Face: Learning the Face Behind a Voice
Ariel Ephrat, Inbar Mosseri, Oran Lang, Tali Dekel, Kevin Wilson, Avinatan Hassidim, William T. Freeman, Michael
Rubinstein: Looking to Listen at the Cocktail Party: A Speaker-Independent Audio-Visual Model for Speech Separation
Arda Senocak, Tae-Hyun Oh, Junsik Kim, Ming-Hsuan Yang, In So Kweon: Learning to Localize Sound Source in Visual
Scenes
Hang Zhou, Yu Liu, Ziwei Liu, Ping Luo, Xiaogang Wang:
Talking Face Generation by Adversarially Disentangled Audio-Visual Representation

Weitere ähnliche Inhalte

Was ist angesagt?

Action Recognitionの歴史と最新動向
Action Recognitionの歴史と最新動向Action Recognitionの歴史と最新動向
Action Recognitionの歴史と最新動向Ohnishi Katsunori
 
SSII2019企画: 点群深層学習の研究動向
SSII2019企画: 点群深層学習の研究動向SSII2019企画: 点群深層学習の研究動向
SSII2019企画: 点群深層学習の研究動向SSII
 
【DL輪読会】WIRE: Wavelet Implicit Neural Representations
【DL輪読会】WIRE: Wavelet Implicit Neural Representations【DL輪読会】WIRE: Wavelet Implicit Neural Representations
【DL輪読会】WIRE: Wavelet Implicit Neural RepresentationsDeep Learning JP
 
【DL輪読会】SDEdit: Guided Image Synthesis and Editing with Stochastic Differentia...
【DL輪読会】SDEdit: Guided Image Synthesis and Editing with Stochastic Differentia...【DL輪読会】SDEdit: Guided Image Synthesis and Editing with Stochastic Differentia...
【DL輪読会】SDEdit: Guided Image Synthesis and Editing with Stochastic Differentia...Deep Learning JP
 
[DL輪読会]GENESIS: Generative Scene Inference and Sampling with Object-Centric L...
[DL輪読会]GENESIS: Generative Scene Inference and Sampling with Object-Centric L...[DL輪読会]GENESIS: Generative Scene Inference and Sampling with Object-Centric L...
[DL輪読会]GENESIS: Generative Scene Inference and Sampling with Object-Centric L...Deep Learning JP
 
[DL輪読会]Neural Radiance Flow for 4D View Synthesis and Video Processing (NeRF...
[DL輪読会]Neural Radiance Flow for 4D View Synthesis and Video  Processing (NeRF...[DL輪読会]Neural Radiance Flow for 4D View Synthesis and Video  Processing (NeRF...
[DL輪読会]Neural Radiance Flow for 4D View Synthesis and Video Processing (NeRF...Deep Learning JP
 
[DL輪読会]A Higher-Dimensional Representation for Topologically Varying Neural R...
[DL輪読会]A Higher-Dimensional Representation for Topologically Varying Neural R...[DL輪読会]A Higher-Dimensional Representation for Topologically Varying Neural R...
[DL輪読会]A Higher-Dimensional Representation for Topologically Varying Neural R...Deep Learning JP
 
論文紹介: Fast R-CNN&Faster R-CNN
論文紹介: Fast R-CNN&Faster R-CNN論文紹介: Fast R-CNN&Faster R-CNN
論文紹介: Fast R-CNN&Faster R-CNNTakashi Abe
 
[DL輪読会]Progressive Growing of GANs for Improved Quality, Stability, and Varia...
[DL輪読会]Progressive Growing of GANs for Improved Quality, Stability, and Varia...[DL輪読会]Progressive Growing of GANs for Improved Quality, Stability, and Varia...
[DL輪読会]Progressive Growing of GANs for Improved Quality, Stability, and Varia...Deep Learning JP
 
[DL輪読会]A closer look at few shot classification
[DL輪読会]A closer look at few shot classification[DL輪読会]A closer look at few shot classification
[DL輪読会]A closer look at few shot classificationDeep Learning JP
 
SSII2019TS: 実践カメラキャリブレーション ~カメラを用いた実世界計測の基礎と応用~
SSII2019TS: 実践カメラキャリブレーション ~カメラを用いた実世界計測の基礎と応用~SSII2019TS: 実践カメラキャリブレーション ~カメラを用いた実世界計測の基礎と応用~
SSII2019TS: 実践カメラキャリブレーション ~カメラを用いた実世界計測の基礎と応用~SSII
 
(文献紹介) 画像復元:Plug-and-Play ADMM
(文献紹介) 画像復元:Plug-and-Play ADMM(文献紹介) 画像復元:Plug-and-Play ADMM
(文献紹介) 画像復元:Plug-and-Play ADMMMorpho, Inc.
 
大域的探索から局所的探索へデータ拡張 (Data Augmentation)を用いた学習の探索テクニック
大域的探索から局所的探索へデータ拡張 (Data Augmentation)を用いた学習の探索テクニック 大域的探索から局所的探索へデータ拡張 (Data Augmentation)を用いた学習の探索テクニック
大域的探索から局所的探索へデータ拡張 (Data Augmentation)を用いた学習の探索テクニック 西岡 賢一郎
 
論文紹介:Temporal Action Segmentation: An Analysis of Modern Techniques
論文紹介:Temporal Action Segmentation: An Analysis of Modern Techniques論文紹介:Temporal Action Segmentation: An Analysis of Modern Techniques
論文紹介:Temporal Action Segmentation: An Analysis of Modern TechniquesToru Tamaki
 
[DL輪読会]YOLOv4: Optimal Speed and Accuracy of Object Detection
[DL輪読会]YOLOv4: Optimal Speed and Accuracy of Object Detection[DL輪読会]YOLOv4: Optimal Speed and Accuracy of Object Detection
[DL輪読会]YOLOv4: Optimal Speed and Accuracy of Object DetectionDeep Learning JP
 
SfM Learner系単眼深度推定手法について
SfM Learner系単眼深度推定手法についてSfM Learner系単眼深度推定手法について
SfM Learner系単眼深度推定手法についてRyutaro Yamauchi
 
Active Convolution, Deformable Convolution ―形状・スケールを学習可能なConvolution―
Active Convolution, Deformable Convolution ―形状・スケールを学習可能なConvolution―Active Convolution, Deformable Convolution ―形状・スケールを学習可能なConvolution―
Active Convolution, Deformable Convolution ―形状・スケールを学習可能なConvolution―Yosuke Shinya
 

Was ist angesagt? (20)

Action Recognitionの歴史と最新動向
Action Recognitionの歴史と最新動向Action Recognitionの歴史と最新動向
Action Recognitionの歴史と最新動向
 
SSII2019企画: 点群深層学習の研究動向
SSII2019企画: 点群深層学習の研究動向SSII2019企画: 点群深層学習の研究動向
SSII2019企画: 点群深層学習の研究動向
 
【DL輪読会】WIRE: Wavelet Implicit Neural Representations
【DL輪読会】WIRE: Wavelet Implicit Neural Representations【DL輪読会】WIRE: Wavelet Implicit Neural Representations
【DL輪読会】WIRE: Wavelet Implicit Neural Representations
 
【DL輪読会】SDEdit: Guided Image Synthesis and Editing with Stochastic Differentia...
【DL輪読会】SDEdit: Guided Image Synthesis and Editing with Stochastic Differentia...【DL輪読会】SDEdit: Guided Image Synthesis and Editing with Stochastic Differentia...
【DL輪読会】SDEdit: Guided Image Synthesis and Editing with Stochastic Differentia...
 
[DL輪読会]GENESIS: Generative Scene Inference and Sampling with Object-Centric L...
[DL輪読会]GENESIS: Generative Scene Inference and Sampling with Object-Centric L...[DL輪読会]GENESIS: Generative Scene Inference and Sampling with Object-Centric L...
[DL輪読会]GENESIS: Generative Scene Inference and Sampling with Object-Centric L...
 
[DL輪読会]Neural Radiance Flow for 4D View Synthesis and Video Processing (NeRF...
[DL輪読会]Neural Radiance Flow for 4D View Synthesis and Video  Processing (NeRF...[DL輪読会]Neural Radiance Flow for 4D View Synthesis and Video  Processing (NeRF...
[DL輪読会]Neural Radiance Flow for 4D View Synthesis and Video Processing (NeRF...
 
[DL輪読会]A Higher-Dimensional Representation for Topologically Varying Neural R...
[DL輪読会]A Higher-Dimensional Representation for Topologically Varying Neural R...[DL輪読会]A Higher-Dimensional Representation for Topologically Varying Neural R...
[DL輪読会]A Higher-Dimensional Representation for Topologically Varying Neural R...
 
ResNetの仕組み
ResNetの仕組みResNetの仕組み
ResNetの仕組み
 
論文紹介: Fast R-CNN&Faster R-CNN
論文紹介: Fast R-CNN&Faster R-CNN論文紹介: Fast R-CNN&Faster R-CNN
論文紹介: Fast R-CNN&Faster R-CNN
 
[DL輪読会]Progressive Growing of GANs for Improved Quality, Stability, and Varia...
[DL輪読会]Progressive Growing of GANs for Improved Quality, Stability, and Varia...[DL輪読会]Progressive Growing of GANs for Improved Quality, Stability, and Varia...
[DL輪読会]Progressive Growing of GANs for Improved Quality, Stability, and Varia...
 
[DL輪読会]A closer look at few shot classification
[DL輪読会]A closer look at few shot classification[DL輪読会]A closer look at few shot classification
[DL輪読会]A closer look at few shot classification
 
SSII2019TS: 実践カメラキャリブレーション ~カメラを用いた実世界計測の基礎と応用~
SSII2019TS: 実践カメラキャリブレーション ~カメラを用いた実世界計測の基礎と応用~SSII2019TS: 実践カメラキャリブレーション ~カメラを用いた実世界計測の基礎と応用~
SSII2019TS: 実践カメラキャリブレーション ~カメラを用いた実世界計測の基礎と応用~
 
(文献紹介) 画像復元:Plug-and-Play ADMM
(文献紹介) 画像復元:Plug-and-Play ADMM(文献紹介) 画像復元:Plug-and-Play ADMM
(文献紹介) 画像復元:Plug-and-Play ADMM
 
Depth Estimation論文紹介
Depth Estimation論文紹介Depth Estimation論文紹介
Depth Estimation論文紹介
 
大域的探索から局所的探索へデータ拡張 (Data Augmentation)を用いた学習の探索テクニック
大域的探索から局所的探索へデータ拡張 (Data Augmentation)を用いた学習の探索テクニック 大域的探索から局所的探索へデータ拡張 (Data Augmentation)を用いた学習の探索テクニック
大域的探索から局所的探索へデータ拡張 (Data Augmentation)を用いた学習の探索テクニック
 
論文紹介:Temporal Action Segmentation: An Analysis of Modern Techniques
論文紹介:Temporal Action Segmentation: An Analysis of Modern Techniques論文紹介:Temporal Action Segmentation: An Analysis of Modern Techniques
論文紹介:Temporal Action Segmentation: An Analysis of Modern Techniques
 
[DL輪読会]YOLOv4: Optimal Speed and Accuracy of Object Detection
[DL輪読会]YOLOv4: Optimal Speed and Accuracy of Object Detection[DL輪読会]YOLOv4: Optimal Speed and Accuracy of Object Detection
[DL輪読会]YOLOv4: Optimal Speed and Accuracy of Object Detection
 
SfM Learner系単眼深度推定手法について
SfM Learner系単眼深度推定手法についてSfM Learner系単眼深度推定手法について
SfM Learner系単眼深度推定手法について
 
Lucas kanade法について
Lucas kanade法についてLucas kanade法について
Lucas kanade法について
 
Active Convolution, Deformable Convolution ―形状・スケールを学習可能なConvolution―
Active Convolution, Deformable Convolution ―形状・スケールを学習可能なConvolution―Active Convolution, Deformable Convolution ―形状・スケールを学習可能なConvolution―
Active Convolution, Deformable Convolution ―形状・スケールを学習可能なConvolution―
 

Ähnlich wie A Survey on Cross-Modal Embedding

Deep Language and Vision (DLSL D2L4 2018 UPC Deep Learning for Speech and Lan...
Deep Language and Vision (DLSL D2L4 2018 UPC Deep Learning for Speech and Lan...Deep Language and Vision (DLSL D2L4 2018 UPC Deep Learning for Speech and Lan...
Deep Language and Vision (DLSL D2L4 2018 UPC Deep Learning for Speech and Lan...Universitat Politècnica de Catalunya
 
Language and Vision (D2L11 Insight@DCU Machine Learning Workshop 2017)
Language and Vision (D2L11 Insight@DCU Machine Learning Workshop 2017)Language and Vision (D2L11 Insight@DCU Machine Learning Workshop 2017)
Language and Vision (D2L11 Insight@DCU Machine Learning Workshop 2017)Universitat Politècnica de Catalunya
 
One Perceptron to Rule Them All (Re-Work Deep Learning Summit, London 2017)
One Perceptron to Rule Them All (Re-Work Deep Learning Summit, London 2017)One Perceptron to Rule Them All (Re-Work Deep Learning Summit, London 2017)
One Perceptron to Rule Them All (Re-Work Deep Learning Summit, London 2017)Universitat Politècnica de Catalunya
 
Self-supervised Learning from Video Sequences - Xavier Giro - UPC Barcelona 2019
Self-supervised Learning from Video Sequences - Xavier Giro - UPC Barcelona 2019Self-supervised Learning from Video Sequences - Xavier Giro - UPC Barcelona 2019
Self-supervised Learning from Video Sequences - Xavier Giro - UPC Barcelona 2019Universitat Politècnica de Catalunya
 
Self-Supervised Audio-Visual Learning - Xavier Giro - UPC TelecomBCN Barcelon...
Self-Supervised Audio-Visual Learning - Xavier Giro - UPC TelecomBCN Barcelon...Self-Supervised Audio-Visual Learning - Xavier Giro - UPC TelecomBCN Barcelon...
Self-Supervised Audio-Visual Learning - Xavier Giro - UPC TelecomBCN Barcelon...Universitat Politècnica de Catalunya
 
Multimodal Deep Learning (D4L4 Deep Learning for Speech and Language UPC 2017)
Multimodal Deep Learning (D4L4 Deep Learning for Speech and Language UPC 2017)Multimodal Deep Learning (D4L4 Deep Learning for Speech and Language UPC 2017)
Multimodal Deep Learning (D4L4 Deep Learning for Speech and Language UPC 2017)Universitat Politècnica de Catalunya
 
Near-Duplicate Video Retrieval by Aggregating Intermediate CNN Layers
Near-Duplicate Video Retrieval by Aggregating Intermediate CNN LayersNear-Duplicate Video Retrieval by Aggregating Intermediate CNN Layers
Near-Duplicate Video Retrieval by Aggregating Intermediate CNN LayersSymeon Papadopoulos
 
One Perceptron to Rule them All: Deep Learning for Multimedia #A2IC2018
One Perceptron  to Rule them All: Deep Learning for Multimedia #A2IC2018One Perceptron  to Rule them All: Deep Learning for Multimedia #A2IC2018
One Perceptron to Rule them All: Deep Learning for Multimedia #A2IC2018Universitat Politècnica de Catalunya
 
Language and Vision (D3L5 2017 UPC Deep Learning for Computer Vision)
Language and Vision (D3L5 2017 UPC Deep Learning for Computer Vision)Language and Vision (D3L5 2017 UPC Deep Learning for Computer Vision)
Language and Vision (D3L5 2017 UPC Deep Learning for Computer Vision)Universitat Politècnica de Catalunya
 
Video Analysis with Convolutional Neural Networks (Master Computer Vision Bar...
Video Analysis with Convolutional Neural Networks (Master Computer Vision Bar...Video Analysis with Convolutional Neural Networks (Master Computer Vision Bar...
Video Analysis with Convolutional Neural Networks (Master Computer Vision Bar...Universitat Politècnica de Catalunya
 
Unsupervised object-level video summarization with online motion auto-encoder
Unsupervised object-level video summarization with online motion auto-encoderUnsupervised object-level video summarization with online motion auto-encoder
Unsupervised object-level video summarization with online motion auto-encoderNEERAJ BAGHEL
 
Multi modal retrieval and generation with deep distributed models
Multi modal retrieval and generation with deep distributed modelsMulti modal retrieval and generation with deep distributed models
Multi modal retrieval and generation with deep distributed modelsRoelof Pieters
 
Learning Representations for Sign Language Videos - Xavier Giro - NIST TRECVI...
Learning Representations for Sign Language Videos - Xavier Giro - NIST TRECVI...Learning Representations for Sign Language Videos - Xavier Giro - NIST TRECVI...
Learning Representations for Sign Language Videos - Xavier Giro - NIST TRECVI...Universitat Politècnica de Catalunya
 

Ähnlich wie A Survey on Cross-Modal Embedding (20)

Deep Language and Vision (DLSL D2L4 2018 UPC Deep Learning for Speech and Lan...
Deep Language and Vision (DLSL D2L4 2018 UPC Deep Learning for Speech and Lan...Deep Language and Vision (DLSL D2L4 2018 UPC Deep Learning for Speech and Lan...
Deep Language and Vision (DLSL D2L4 2018 UPC Deep Learning for Speech and Lan...
 
Language and Vision (D2L11 Insight@DCU Machine Learning Workshop 2017)
Language and Vision (D2L11 Insight@DCU Machine Learning Workshop 2017)Language and Vision (D2L11 Insight@DCU Machine Learning Workshop 2017)
Language and Vision (D2L11 Insight@DCU Machine Learning Workshop 2017)
 
One Perceptron to Rule Them All (Re-Work Deep Learning Summit, London 2017)
One Perceptron to Rule Them All (Re-Work Deep Learning Summit, London 2017)One Perceptron to Rule Them All (Re-Work Deep Learning Summit, London 2017)
One Perceptron to Rule Them All (Re-Work Deep Learning Summit, London 2017)
 
Self-supervised Learning from Video Sequences - Xavier Giro - UPC Barcelona 2019
Self-supervised Learning from Video Sequences - Xavier Giro - UPC Barcelona 2019Self-supervised Learning from Video Sequences - Xavier Giro - UPC Barcelona 2019
Self-supervised Learning from Video Sequences - Xavier Giro - UPC Barcelona 2019
 
Once Perceptron to Rule Them all: Deep Learning for Multimedia
Once Perceptron to Rule Them all: Deep Learning for MultimediaOnce Perceptron to Rule Them all: Deep Learning for Multimedia
Once Perceptron to Rule Them all: Deep Learning for Multimedia
 
Self-Supervised Audio-Visual Learning - Xavier Giro - UPC TelecomBCN Barcelon...
Self-Supervised Audio-Visual Learning - Xavier Giro - UPC TelecomBCN Barcelon...Self-Supervised Audio-Visual Learning - Xavier Giro - UPC TelecomBCN Barcelon...
Self-Supervised Audio-Visual Learning - Xavier Giro - UPC TelecomBCN Barcelon...
 
Deep Learning for Video: Action Recognition (UPC 2018)
Deep Learning for Video: Action Recognition (UPC 2018)Deep Learning for Video: Action Recognition (UPC 2018)
Deep Learning for Video: Action Recognition (UPC 2018)
 
Multimodal Deep Learning (D4L4 Deep Learning for Speech and Language UPC 2017)
Multimodal Deep Learning (D4L4 Deep Learning for Speech and Language UPC 2017)Multimodal Deep Learning (D4L4 Deep Learning for Speech and Language UPC 2017)
Multimodal Deep Learning (D4L4 Deep Learning for Speech and Language UPC 2017)
 
Near-Duplicate Video Retrieval by Aggregating Intermediate CNN Layers
Near-Duplicate Video Retrieval by Aggregating Intermediate CNN LayersNear-Duplicate Video Retrieval by Aggregating Intermediate CNN Layers
Near-Duplicate Video Retrieval by Aggregating Intermediate CNN Layers
 
One Perceptron to Rule them All: Deep Learning for Multimedia #A2IC2018
One Perceptron  to Rule them All: Deep Learning for Multimedia #A2IC2018One Perceptron  to Rule them All: Deep Learning for Multimedia #A2IC2018
One Perceptron to Rule them All: Deep Learning for Multimedia #A2IC2018
 
Deep Language and Vision - Xavier Giro-i-Nieto - UPC Barcelona 2018
Deep Language and Vision - Xavier Giro-i-Nieto - UPC Barcelona 2018Deep Language and Vision - Xavier Giro-i-Nieto - UPC Barcelona 2018
Deep Language and Vision - Xavier Giro-i-Nieto - UPC Barcelona 2018
 
Language and Vision (D3L5 2017 UPC Deep Learning for Computer Vision)
Language and Vision (D3L5 2017 UPC Deep Learning for Computer Vision)Language and Vision (D3L5 2017 UPC Deep Learning for Computer Vision)
Language and Vision (D3L5 2017 UPC Deep Learning for Computer Vision)
 
Video Analysis with Convolutional Neural Networks (Master Computer Vision Bar...
Video Analysis with Convolutional Neural Networks (Master Computer Vision Bar...Video Analysis with Convolutional Neural Networks (Master Computer Vision Bar...
Video Analysis with Convolutional Neural Networks (Master Computer Vision Bar...
 
Deep Audio and Vision - Eva Mohedano - UPC Barcelona 2018
Deep Audio and Vision - Eva Mohedano - UPC Barcelona 2018Deep Audio and Vision - Eva Mohedano - UPC Barcelona 2018
Deep Audio and Vision - Eva Mohedano - UPC Barcelona 2018
 
Deep Language and Vision by Amaia Salvador (Insight DCU 2018)
Deep Language and Vision by Amaia Salvador (Insight DCU 2018)Deep Language and Vision by Amaia Salvador (Insight DCU 2018)
Deep Language and Vision by Amaia Salvador (Insight DCU 2018)
 
Video Analysis (D4L2 2017 UPC Deep Learning for Computer Vision)
Video Analysis (D4L2 2017 UPC Deep Learning for Computer Vision)Video Analysis (D4L2 2017 UPC Deep Learning for Computer Vision)
Video Analysis (D4L2 2017 UPC Deep Learning for Computer Vision)
 
Deep Learning from Videos (UPC 2018)
Deep Learning from Videos (UPC 2018)Deep Learning from Videos (UPC 2018)
Deep Learning from Videos (UPC 2018)
 
Unsupervised object-level video summarization with online motion auto-encoder
Unsupervised object-level video summarization with online motion auto-encoderUnsupervised object-level video summarization with online motion auto-encoder
Unsupervised object-level video summarization with online motion auto-encoder
 
Multi modal retrieval and generation with deep distributed models
Multi modal retrieval and generation with deep distributed modelsMulti modal retrieval and generation with deep distributed models
Multi modal retrieval and generation with deep distributed models
 
Learning Representations for Sign Language Videos - Xavier Giro - NIST TRECVI...
Learning Representations for Sign Language Videos - Xavier Giro - NIST TRECVI...Learning Representations for Sign Language Videos - Xavier Giro - NIST TRECVI...
Learning Representations for Sign Language Videos - Xavier Giro - NIST TRECVI...
 

Mehr von Yasuhide Miura

Divide, Conquer and Combine: Hierarchical Feature Fusion Network with Local a...
Divide, Conquer and Combine: Hierarchical Feature Fusion Network with Local a...Divide, Conquer and Combine: Hierarchical Feature Fusion Network with Local a...
Divide, Conquer and Combine: Hierarchical Feature Fusion Network with Local a...Yasuhide Miura
 
Maskgan better text generation via filling in the ____
Maskgan better text generation via filling in the   ____Maskgan better text generation via filling in the   ____
Maskgan better text generation via filling in the ____Yasuhide Miura
 
Fast abstractive summarization with reinforce selected sentence rewriting
Fast abstractive summarization with reinforce selected sentence rewritingFast abstractive summarization with reinforce selected sentence rewriting
Fast abstractive summarization with reinforce selected sentence rewritingYasuhide Miura
 
Deconvolutional paragraph representation learning
Deconvolutional paragraph representation learningDeconvolutional paragraph representation learning
Deconvolutional paragraph representation learningYasuhide Miura
 
放送大学テキスト「自然言語処理」 7章 構文の解析(1)
放送大学テキスト「自然言語処理」 7章 構文の解析(1)放送大学テキスト「自然言語処理」 7章 構文の解析(1)
放送大学テキスト「自然言語処理」 7章 構文の解析(1)Yasuhide Miura
 

Mehr von Yasuhide Miura (6)

Bert for multimodal
Bert for multimodalBert for multimodal
Bert for multimodal
 
Divide, Conquer and Combine: Hierarchical Feature Fusion Network with Local a...
Divide, Conquer and Combine: Hierarchical Feature Fusion Network with Local a...Divide, Conquer and Combine: Hierarchical Feature Fusion Network with Local a...
Divide, Conquer and Combine: Hierarchical Feature Fusion Network with Local a...
 
Maskgan better text generation via filling in the ____
Maskgan better text generation via filling in the   ____Maskgan better text generation via filling in the   ____
Maskgan better text generation via filling in the ____
 
Fast abstractive summarization with reinforce selected sentence rewriting
Fast abstractive summarization with reinforce selected sentence rewritingFast abstractive summarization with reinforce selected sentence rewriting
Fast abstractive summarization with reinforce selected sentence rewriting
 
Deconvolutional paragraph representation learning
Deconvolutional paragraph representation learningDeconvolutional paragraph representation learning
Deconvolutional paragraph representation learning
 
放送大学テキスト「自然言語処理」 7章 構文の解析(1)
放送大学テキスト「自然言語処理」 7章 構文の解析(1)放送大学テキスト「自然言語処理」 7章 構文の解析(1)
放送大学テキスト「自然言語処理」 7章 構文の解析(1)
 

Kürzlich hochgeladen

Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43bNightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43bSérgio Sacani
 
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...Sérgio Sacani
 
Biological Classification BioHack (3).pdf
Biological Classification BioHack (3).pdfBiological Classification BioHack (3).pdf
Biological Classification BioHack (3).pdfmuntazimhurra
 
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune WaterworldsBiogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune WaterworldsSérgio Sacani
 
Formation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disksFormation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disksSérgio Sacani
 
Zoology 4th semester series (krishna).pdf
Zoology 4th semester series (krishna).pdfZoology 4th semester series (krishna).pdf
Zoology 4th semester series (krishna).pdfSumit Kumar yadav
 
Pulmonary drug delivery system M.pharm -2nd sem P'ceutics
Pulmonary drug delivery system M.pharm -2nd sem P'ceuticsPulmonary drug delivery system M.pharm -2nd sem P'ceutics
Pulmonary drug delivery system M.pharm -2nd sem P'ceuticssakshisoni2385
 
VIRUSES structure and classification ppt by Dr.Prince C P
VIRUSES structure and classification ppt by Dr.Prince C PVIRUSES structure and classification ppt by Dr.Prince C P
VIRUSES structure and classification ppt by Dr.Prince C PPRINCE C P
 
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...Sérgio Sacani
 
Forensic Biology & Its biological significance.pdf
Forensic Biology & Its biological significance.pdfForensic Biology & Its biological significance.pdf
Forensic Biology & Its biological significance.pdfrohankumarsinghrore1
 
Animal Communication- Auditory and Visual.pptx
Animal Communication- Auditory and Visual.pptxAnimal Communication- Auditory and Visual.pptx
Animal Communication- Auditory and Visual.pptxUmerFayaz5
 
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdfPests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdfPirithiRaju
 
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 bAsymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 bSérgio Sacani
 
GUIDELINES ON SIMILAR BIOLOGICS Regulatory Requirements for Marketing Authori...
GUIDELINES ON SIMILAR BIOLOGICS Regulatory Requirements for Marketing Authori...GUIDELINES ON SIMILAR BIOLOGICS Regulatory Requirements for Marketing Authori...
GUIDELINES ON SIMILAR BIOLOGICS Regulatory Requirements for Marketing Authori...Lokesh Kothari
 
Botany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdfBotany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdfSumit Kumar yadav
 
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls Agency
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls AgencyHire 💕 9907093804 Hooghly Call Girls Service Call Girls Agency
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls AgencySheetal Arora
 
GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)Areesha Ahmad
 
Presentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptxPresentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptxgindu3009
 
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.Nitya salvi
 
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCRStunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCRDelhi Call girls
 

Kürzlich hochgeladen (20)

Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43bNightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
 
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
 
Biological Classification BioHack (3).pdf
Biological Classification BioHack (3).pdfBiological Classification BioHack (3).pdf
Biological Classification BioHack (3).pdf
 
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune WaterworldsBiogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
 
Formation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disksFormation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disks
 
Zoology 4th semester series (krishna).pdf
Zoology 4th semester series (krishna).pdfZoology 4th semester series (krishna).pdf
Zoology 4th semester series (krishna).pdf
 
Pulmonary drug delivery system M.pharm -2nd sem P'ceutics
Pulmonary drug delivery system M.pharm -2nd sem P'ceuticsPulmonary drug delivery system M.pharm -2nd sem P'ceutics
Pulmonary drug delivery system M.pharm -2nd sem P'ceutics
 
VIRUSES structure and classification ppt by Dr.Prince C P
VIRUSES structure and classification ppt by Dr.Prince C PVIRUSES structure and classification ppt by Dr.Prince C P
VIRUSES structure and classification ppt by Dr.Prince C P
 
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
 
Forensic Biology & Its biological significance.pdf
Forensic Biology & Its biological significance.pdfForensic Biology & Its biological significance.pdf
Forensic Biology & Its biological significance.pdf
 
Animal Communication- Auditory and Visual.pptx
Animal Communication- Auditory and Visual.pptxAnimal Communication- Auditory and Visual.pptx
Animal Communication- Auditory and Visual.pptx
 
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdfPests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
 
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 bAsymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
 
GUIDELINES ON SIMILAR BIOLOGICS Regulatory Requirements for Marketing Authori...
GUIDELINES ON SIMILAR BIOLOGICS Regulatory Requirements for Marketing Authori...GUIDELINES ON SIMILAR BIOLOGICS Regulatory Requirements for Marketing Authori...
GUIDELINES ON SIMILAR BIOLOGICS Regulatory Requirements for Marketing Authori...
 
Botany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdfBotany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdf
 
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls Agency
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls AgencyHire 💕 9907093804 Hooghly Call Girls Service Call Girls Agency
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls Agency
 
GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)
 
Presentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptxPresentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptx
 
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
 
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCRStunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
 

A Survey on Cross-Modal Embedding

  • 1. A Survey on Cross-Modal Embedding ( )
  • 3. n Cross-Modal Embedding n Cross-Modal Retrieval n Audio-Visual Embedding n
  • 7. Cross-Modal Embedding nCross-Modal Retrieval ◦ 3D ◦ Adversarial Training Consistency Loss nAudio-Visual Embedding ◦ Web ◦
  • 9. Cross-Modal Retrieval n n Text <-> Image Wikipedia
  • 12. Cross-Modal Retrieval n • Real-Valued Representation • Binary Representation • Unsupervised Method • Pairwise based Method • Supervised Method
  • 16. : nLocalizing Moments in Video with Natural Language(ICCV2017) ◦ ◦ Global Context
  • 17. : nAttentive Moment Retrieval in Videos (SIGIR2018) ° ° Attention First
  • 18. : 3D nY2Seq2Seq: Cross-Modal Representation Learning for 3D Shape and Text by Joint Reconstruction and Prediction of View and Word Sequence (AAAI2019) ◦ 3D Cross-Modal Retrieval ◦ 3D
  • 19. : Adversarial Training nSelf-Supervised Adversarial Hashing Networks for Cross-Modal Retrieval(CVPR2018) ◦ ◦ Adversarial Training =
  • 20. : Adversarial Training nCoupled CycleGAN: Unsupervised Hashing Network for Cross- Modal Retrieval (AAAI2019) ◦ 2 GAN ◦ Outer Cycle GAN ◦ Inner Cycle GAN
  • 21. : Consistency Loss nLook, Imagine and Match: Improving Textual-Visual Cross- Modal Retrieval with Generative Models(CVPR2018) ◦ Decoder Adversarial Training ◦ Adversarial Training Adversarial Training
  • 22. : Consistency Loss nLearning Cross-Modal Embeddings with Adversarial Networks for Cooking Recipes and Food Images(CVPR2019) ◦ Metric Learning, Adversarial Training Consistency Loss
  • 23. nViLBERT: Pretraining Task-Agnostic Visiolinguistic Representation for Vision-and-Language Tasks ◦ Vision Language BERT ◦ Vision->Language Language->Vision Attention Co-Attention Transformer ◦ / Mask ◦ Vision/Language Encoder BERT
  • 25. Audio-Visual Embedding Audio-Visual n ◦ Audio Visual ⇒ Alignment ◦ ⇒ n ◦ ⇒ ” ” ( … )
  • 26. Cross-modal retrieval nAudio-Visual Embedding Network (AVE-Net) ◦ ◦ DNN n ◦ / ◦ Cross-modal Intra-modal Audio-Visual
  • 27. Audio-Visual Cross-modal retrieval nAudio-Visual Embedding Network (AVE-Net) ◦ ◦ DNN n ◦ / ◦ Cross-modal Intra-modal nDCG@30 (Higher is better)
  • 28. Audio-Visual Audio-visual source separation nLooking to Listen at the Cocktail Party ◦ ◦ https://www.youtube.com/watch?v=rVQVAPiJWKU
  • 29. Audio-Visual Sound source localization nLearning to Localize Sound Source in Visual Scenes ◦ attention ◦ Attention supervised
  • 30. Audio-Visual Image/sound generation nSpeech2Face: Learning the Face Behind a Voice ◦ decoder
  • 31. nYoutube 8M ◦ nAudioSet ◦ 632 2,084,320 nAVSpeech ◦ 29,000 ID nYahoo Flickr Creative Commons 100M (YFCC100M) ◦ 80 (100M ) ◦ Flickr Creative Commons nVoxCeleb1, 2 ◦ Youtube 2000
  • 32. nSoundnet: Learning sound representations from unlabeled video (NIPS2016) ◦ ◦ ◦ SVM (Audio+Vision)
  • 33. nLook, Listen and Learn (ICCV2017) ◦ visual audio ◦ (Audio-Visual Correspondence(AVC)) ◦ AVC Audio-visual
  • 34. nLook, Listen and Learn (ICCV2017) ◦ visual audio ◦ (Audio-Visual Correspondence(AVC)) ◦ AVC Audio-visual
  • 35. nObjects that Sound (ECCV2018) ◦ L3 ( ) ◦ Cross-modal retrieval Sound source localization AVC AVOL-Net
  • 36. nObjects that Sound (ECCV2018) ◦ L3 ( ) ◦ Cross-modal retrieval Sound source localization AVC AVOL-Net
  • 37. nAudio-Visual Scene Analysis with Self-Supervised Multisensory Features (ECCV2018) ◦ ( ) ◦ Action recognition ◦ Audio-visual source separation Alignment
  • 38. AVSS nThe Sound of Pixels (ECCV2018) ◦ PixelPlayer (http://sound-of-pixels.csail.mit.edu/) ◦ Mix-and-Separate Mix-and-separate
  • 39. AVSS nThe Sound of Pixels (ECCV2018) ◦ PixelPlayer (http://sound-of-pixels.csail.mit.edu/) ◦ Mix-and-Separate Mix-and-separate
  • 40. AVSS nThe Sound of Pixels (ECCV2018) ◦ PixelPlayer (http://sound-of-pixels.csail.mit.edu/) ◦ Mix-and-Separate Mix-and-separate K
  • 41. AVSS nThe Sound of Pixels (ECCV2018) ◦ PixelPlayer (http://sound-of-pixels.csail.mit.edu/) ◦ Mix-and-Separate Mix-and-separate
  • 42. AVSS nThe Sound of Pixels (ECCV2018) ◦ PixelPlayer (http://sound-of-pixels.csail.mit.edu/) ◦ Mix-and-Separate Mix-and-separate
  • 43. AVSS nThe Sound of Pixels (ECCV2018) ◦ PixelPlayer (http://sound-of-pixels.csail.mit.edu/) ◦ Mix-and-Separate Mix-and-separate
  • 44. nSpeech2Face: Learning the Face Behind a Voice (CVPR2019) ◦ ◦ Encoder ⇔
  • 45. nTalking Face Generation by Adversarially Disentangled Audio- Visual Representation (AAAI2019) ◦ / ◦ (disentangle )
  • 46. nTalking Face Generation by Adversarially Disentangled Audio- Visual Representation (AAAI2019) ◦ / ◦ (disentangle )
  • 47. n Cross-Modal Embeddings Image Text Cross-Modal Retrieval, Audio Vision Audio-Visual Embeddings n Cross-Modal Retrieval Image Text Video 3D Adversarial Training n Audio-Visual n Image/Text/Audio/Video Cross-Modal -> ( )
  • 48. Kaiye Wang, Qiyue Yin, Wei Wang, Shu Wu, Liang Wang: A Comprehensive Survey on Cross-modal Retrieval T.-S. Chua, J. Tang, R. Hong, H. Li, Z. Luo, and Y. Zheng: NUS-WIDE: A real-world web image database from National University of Singapore Sung Ju Hwang ; Kristen Grauman: Reading between the Lines: Object Localization Using Implicit Cues from Image Tags Peter Young Alice Lai Micah Hodosh Julia Hockenmaier: From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions Hao Wang, Doyen Sahoo, Chenghao Liu, Ee-peng Lim, Steven C. H. Hoi: Learning Cross-Modal Embeddings with Adversarial Networks for Cooking Recipes and Food Images J. Zhou, G. Ding, and Y. Guo: Latent Semantic Sparse Hashing for Cross-Modal Similarity Search Amaia Salvador1∗ Nicholas Hynes2∗ Yusuf Aytar2, Javier Marin2 Ferda Ofli3, Ingmar Weber3 Antonio Torralba2: Learning Cross-modal Embeddings for Cooking Recipes and Food Images Micael Carvalho, Rémi Cadène, David Picard, Laure Soulier, Nicolas Thome, Matthieu Cord: Cross-Modal Retrieval in the Cooking Context: Learning Semantic Text-Image Embeddings Fartash Faghri, David J. Fleet, Jamie Ryan Kiros, Sanja Fidler: VSE++: Improving Visual-Semantic Embeddings with Hard Negatives Alexander Hermans, Lucas Beyer, Bastian Leibe: In Defense of the Triplet Loss for Person Re-Identification Chao Li, Cheng Deng, Ning Li, Wei Liu, Xinbo Gao, Dacheng Tao: Self-Supervised Adversarial Hashing Networks for Cross-Modal Retrieval
  • 49. Nikhil Rasiwasia1, Jose Costa Pereira1, Emanuele Coviello1, Gabriel Doyle2, Gert R.G. Lanckriet1, Roger Levy2, Nuno Vasconcelos1: A New Approach to Cross-Modal Multimedia Retrieval Ting Yao †, Tao Mei †, and Chong-Wah Ngo ‡† Microsoft Research, Beijing, China‡ City University of Hong Kong, Kowloon, Hong Kong: Learning Query and Image Similarities with Ranking Canonical Correlation Analysis Lisa Anne Hendricks1∗, Oliver Wang2, Eli Shechtman2, Josef Sivic2,3∗, Trevor Darrell1, Bryan Russell2: Localizing Moments in Video with Natural Language Zhu Zhang, Zhijie Lin, Zhou Zhao and Zhenxin Xiao: Attentive Moment Retrieval in Videos Zhizhong Han1,2, Mingyang Shang1, Xiyang Wang1, Yu-Shen Liu1∗, Matthias Zwicker2: Y2Seq2Seq: Cross-Modal Representation Learning for 3D Shape and Text by Joint Reconstruction and Prediction of View and Word Sequences Chao Li, Cheng Deng, Ning Li, Wei Liu, Xinbo Gao, Dacheng Tao: Self-Supervised Adversarial Hashing Networks for Cross-Modal Retrieval Chao Li,1 Cheng Deng,1∗ Lei Wang,1 De Xie,1 Xianglong Liu2†: Coupled CycleGAN: Unsupervised Hashing Network for Cross-Modal Retrieval Jiuxiang Gu, Jianfei Cai, Shafiq Joty, Li Niu, Gang Wang: Look, Imagine and Match: Improving Textual-Visual Cross- Modal Retrieval with Generative Models Jiasen Lu1, Dhruv Batra1,2, Devi Parikh1,2, Stefan Lee: ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Task
  • 50. Yusuf Aytar, Carl Vondrick, Antonio Torralba: SoundNet: Learning Sound Representations from Unlabeled Video Relja Arandjelović, Andrew Zisserman: Objects that Sound Relja Arandjelović, Andrew Zisserman: Look, Listen and Learn Hang Zhao, Chuang Gan, Andrew Rouditchenko, Carl Vondrick, Josh McDermott, Antonio Torralba: The Sound of Pixels Andrew Owens, Alexei A. Efros: Audio-Visual Scene Analysis with Self-Supervised Multisensory Features Tae-Hyun Oh, Tali Dekel, Changil Kim, Inbar Mosseri, William T. Freeman, Michael Rubinstein, Wojciech Matusik: Speech2Face: Learning the Face Behind a Voice Ariel Ephrat, Inbar Mosseri, Oran Lang, Tali Dekel, Kevin Wilson, Avinatan Hassidim, William T. Freeman, Michael Rubinstein: Looking to Listen at the Cocktail Party: A Speaker-Independent Audio-Visual Model for Speech Separation Arda Senocak, Tae-Hyun Oh, Junsik Kim, Ming-Hsuan Yang, In So Kweon: Learning to Localize Sound Source in Visual Scenes Hang Zhou, Yu Liu, Ziwei Liu, Ping Luo, Xiaogang Wang: Talking Face Generation by Adversarially Disentangled Audio-Visual Representation