18. : 3D
nY2Seq2Seq: Cross-Modal Representation Learning for 3D Shape
and Text by Joint Reconstruction and Prediction of View and
Word Sequence (AAAI2019)
◦ 3D Cross-Modal Retrieval
◦ 3D
20. : Adversarial Training
nCoupled CycleGAN: Unsupervised Hashing Network for Cross-
Modal Retrieval (AAAI2019)
◦ 2 GAN
◦ Outer Cycle GAN
◦ Inner Cycle GAN
21. : Consistency Loss
nLook, Imagine and Match: Improving Textual-Visual Cross-
Modal Retrieval with Generative Models(CVPR2018)
◦ Decoder
Adversarial Training
◦ Adversarial Training
Adversarial Training
22. : Consistency Loss
nLearning Cross-Modal Embeddings with Adversarial Networks
for Cooking Recipes and Food Images(CVPR2019)
◦ Metric Learning, Adversarial Training
Consistency Loss
45. nTalking Face Generation by Adversarially Disentangled Audio-
Visual Representation (AAAI2019)
◦ /
◦ (disentangle )
46. nTalking Face Generation by Adversarially Disentangled Audio-
Visual Representation (AAAI2019)
◦ /
◦ (disentangle )
47. n Cross-Modal Embeddings Image
Text Cross-Modal Retrieval, Audio Vision Audio-Visual
Embeddings
n Cross-Modal Retrieval Image Text
Video 3D Adversarial Training
n Audio-Visual
n Image/Text/Audio/Video Cross-Modal ->
(
)
48. Kaiye Wang, Qiyue Yin, Wei Wang, Shu Wu, Liang Wang: A Comprehensive Survey on Cross-modal Retrieval
T.-S. Chua, J. Tang, R. Hong, H. Li, Z. Luo, and Y. Zheng: NUS-WIDE: A real-world web image database from National
University of Singapore
Sung Ju Hwang ; Kristen Grauman: Reading between the Lines: Object Localization Using Implicit Cues from Image
Tags
Peter Young Alice Lai Micah Hodosh Julia Hockenmaier: From image descriptions to visual denotations: New
similarity metrics for semantic inference over event descriptions
Hao Wang, Doyen Sahoo, Chenghao Liu, Ee-peng Lim, Steven C. H. Hoi: Learning Cross-Modal Embeddings with
Adversarial Networks for Cooking Recipes and Food Images
J. Zhou, G. Ding, and Y. Guo: Latent Semantic Sparse Hashing for Cross-Modal Similarity Search
Amaia Salvador1∗ Nicholas Hynes2∗ Yusuf Aytar2, Javier Marin2 Ferda Ofli3, Ingmar Weber3 Antonio Torralba2:
Learning Cross-modal Embeddings for Cooking Recipes and Food Images
Micael Carvalho, Rémi Cadène, David Picard, Laure Soulier, Nicolas Thome, Matthieu Cord: Cross-Modal Retrieval in
the Cooking Context: Learning Semantic Text-Image Embeddings
Fartash Faghri, David J. Fleet, Jamie Ryan Kiros, Sanja Fidler: VSE++: Improving Visual-Semantic Embeddings with
Hard Negatives
Alexander Hermans, Lucas Beyer, Bastian Leibe: In Defense of the Triplet Loss for Person Re-Identification
Chao Li, Cheng Deng, Ning Li, Wei Liu, Xinbo Gao, Dacheng Tao: Self-Supervised Adversarial Hashing Networks for
Cross-Modal Retrieval
49. Nikhil Rasiwasia1, Jose Costa Pereira1, Emanuele Coviello1, Gabriel Doyle2,
Gert R.G. Lanckriet1, Roger Levy2, Nuno Vasconcelos1: A New Approach to Cross-Modal Multimedia Retrieval
Ting Yao †, Tao Mei †, and Chong-Wah Ngo ‡† Microsoft Research, Beijing, China‡ City University of Hong Kong,
Kowloon, Hong Kong: Learning Query and Image Similarities with Ranking Canonical Correlation Analysis
Lisa Anne Hendricks1∗, Oliver Wang2, Eli Shechtman2, Josef Sivic2,3∗, Trevor Darrell1, Bryan Russell2: Localizing
Moments in Video with Natural Language
Zhu Zhang, Zhijie Lin, Zhou Zhao and Zhenxin Xiao: Attentive Moment Retrieval in Videos
Zhizhong Han1,2, Mingyang Shang1, Xiyang Wang1, Yu-Shen Liu1∗, Matthias Zwicker2: Y2Seq2Seq: Cross-Modal
Representation Learning for 3D Shape and Text by Joint Reconstruction and Prediction of View and Word Sequences
Chao Li, Cheng Deng, Ning Li, Wei Liu, Xinbo Gao, Dacheng Tao: Self-Supervised Adversarial Hashing Networks for
Cross-Modal Retrieval
Chao Li,1 Cheng Deng,1∗ Lei Wang,1 De Xie,1 Xianglong Liu2†: Coupled CycleGAN: Unsupervised Hashing Network
for Cross-Modal Retrieval
Jiuxiang Gu, Jianfei Cai, Shafiq Joty, Li Niu, Gang Wang: Look, Imagine and Match: Improving Textual-Visual Cross-
Modal Retrieval with Generative Models
Jiasen Lu1, Dhruv Batra1,2, Devi Parikh1,2, Stefan Lee: ViLBERT: Pretraining Task-Agnostic Visiolinguistic
Representations for Vision-and-Language Task
50. Yusuf Aytar, Carl Vondrick, Antonio Torralba: SoundNet: Learning Sound Representations from Unlabeled Video
Relja Arandjelović, Andrew Zisserman: Objects that Sound
Relja Arandjelović, Andrew Zisserman: Look, Listen and Learn
Hang Zhao, Chuang Gan, Andrew Rouditchenko, Carl Vondrick, Josh McDermott, Antonio Torralba: The Sound of
Pixels
Andrew Owens, Alexei A. Efros: Audio-Visual Scene Analysis with Self-Supervised Multisensory Features
Tae-Hyun Oh, Tali Dekel, Changil Kim, Inbar Mosseri, William T. Freeman, Michael Rubinstein, Wojciech Matusik:
Speech2Face: Learning the Face Behind a Voice
Ariel Ephrat, Inbar Mosseri, Oran Lang, Tali Dekel, Kevin Wilson, Avinatan Hassidim, William T. Freeman, Michael
Rubinstein: Looking to Listen at the Cocktail Party: A Speaker-Independent Audio-Visual Model for Speech Separation
Arda Senocak, Tae-Hyun Oh, Junsik Kim, Ming-Hsuan Yang, In So Kweon: Learning to Localize Sound Source in Visual
Scenes
Hang Zhou, Yu Liu, Ziwei Liu, Ping Luo, Xiaogang Wang:
Talking Face Generation by Adversarially Disentangled Audio-Visual Representation