https://telecombcn-dl.github.io/dlmm-2017-dcu/
Deep learning technologies are at the core of the current revolution in artificial intelligence for multimedia data analysis. The convergence of big annotated data and affordable GPU hardware has allowed the training of neural networks for data analysis tasks which had been addressed until now with hand-crafted features. Architectures such as convolutional neural networks, recurrent neural networks and Q-nets for reinforcement learning have shaped a brand new scenario in signal processing. This course will cover the basic principles and applications of deep learning to computer vision problems, such as image classification, object detection or image captioning.
5. 5
Transfer Learning
Owens et al. Ambient Sound Provides Supervision for Visual Learning. ECCV 2016
Audio features help learn semantically meaningful visual representation
6. 6
Transfer Learning
Owens et al. Ambient Sound Provides Supervision for Visual Learning. ECCV 2016
Images are grouped into audio clusters.
The task is to predict the audio cluster for a given image (classification)
7. 7
Transfer Learning
Owens et al. Ambient Sound Provides Supervision for Visual Learning. ECCV 2016
Although the model was not trained with class labels, units with
semantical meaning emerge
8. 8
Transfer Learning: SoundNet
Aytar, Vondrick and Torralba. Soundnet: Learning sound representations from unlabeled video. NIPS 2016
Pretrained visual ConvNets supervise the training of a model for sound representation
9. 9
Videos for training are unlabeled. Relies on Convnets trained on labeled images.
Transfer Learning: SoundNet
Aytar, Vondrick and Torralba. Soundnet: Learning sound representations from unlabeled video. NIPS 2016
10. 10
Kullback-Leibler Divergence:
Cross-Entropy:
Aytar, Vondrick and Torralba. Soundnet: Learning sound representations from unlabeled video. NIPS 2016
prob. distribution of vision network
prob. distribution of sound network
Transfer Learning: SoundNet
11. 11
Recognizing Objects and Scenes from Sound
Aytar, Vondrick and Torralba. Soundnet: Learning sound representations from unlabeled video. NIPS 2016
Transfer Learning: SoundNet
12. 12
Recognizing Objects and Scenes from Sound
Aytar, Vondrick and Torralba. Soundnet: Learning sound representations from unlabeled video. NIPS 2016
Transfer Learning: SoundNet
13. 13
Hidden layers of Soundnet are used to train a standard SVM
classifier that outperforms state of the art.
Aytar, Vondrick and Torralba. Soundnet: Learning sound representations from unlabeled video. NIPS 2016
Transfer Learning: SoundNet
14. 14
Visualization of the 1D filters over raw audio in conv1.
Aytar, Yusuf, Carl Vondrick, and Antonio Torralba. "Soundnet: Learning sound representations from
unlabeled video." NIPS 2016.
Transfer Learning: SoundNet
15. 15
Visualization of the 1D filters over raw audio in conv1.
Aytar, Vondrick and Torralba. Soundnet: Learning sound representations from unlabeled video. NIPS 2016
Transfer Learning: SoundNet
16. 16
Visualize samples that mostly activate a neuron in a late layer (conv7)
Aytar, Vondrick and Torralba. Soundnet: Learning sound representations from unlabeled video. NIPS 2016
Transfer Learning: SoundNet
17. 17
Visualization of the video frames associated to the sounds that activate some of the
last hidden units (conv7):
Aytar, Vondrick and Torralba. Soundnet: Learning sound representations from unlabeled video. NIPS 2016
Transfer Learning: SoundNet
18. 18Aytar, Vondrick and Torralba. Soundnet: Learning sound representations from unlabeled video. NIPS 2016
Transfer Learning: SoundNet
Hearing sounds that most activate a neuron in the sound network (conv7)
19. 19
Aytar, Vondrick and Torralba. Soundnet: Learning sound representations from unlabeled video. NIPS 2016
Transfer Learning: SoundNet
Hearing sounds that most activate a neuron in the sound network (conv5)
21. 21
Sonorization
Owens et al. Visually indicated sounds. CVPR 2016
Learn synthesized sounds from videos of people hitting objects with a drumstick.
Are features learned when training for sound prediction useful to identify the object’s material?
24. 24
Sonorization
Owens et al. Visually indicated sounds. CVPR 2016
● Map video sequences to sequences of
sound features
● Sound features in the output are
converted into waveform by
example-based synthesis (i.e. nearest
neighbor retrieval).
29. 29
Ephrat et al. Vid2speech: Speech Reconstruction from Silent Video. ICASSP 2017
Speech Generation from Video
CNN
(VGG)
Frame from a
silent video
Audio feature
Post-hoc
synthesis
30. 30
Ephrat et al. Vid2speech: Speech Reconstruction from Silent Video. ICASSP 2017
Speech Generation from Video
Downsample
8kHz
Split in 40ms
chunks
(.5 overlap)
LPC analysis
9-dim LPC
features
Waveform processing
31. 31Ephrat et al. Vid2speech: Speech Reconstruction from Silent Video. ICASSP 2017
Speech Generation from Video
Waveform processing (2 LSP features/ frame)
32. 32
Speech Generation from Video
Cooke et al. An audio-visual corpus for speech perception and automatic speech recognition. Journal of the Acoustical Society of America,
2006
GRID Dataset
33. 33
Speech Generation from Video
Ephrat et al. Vid2speech: Speech Reconstruction from Silent Video. ICASSP 2017
35. 35Assael et al. LipNet: Sentence-level Lipreading. arXiv Nov 2016
Lip Reading: LipNet
36. 36Assael et al. LipNet: Sentence-level Lipreading. arXiv Nov 2016
Lip Reading: LipNet
Input (frames) and output (sentence) sequences are not aligned
37. 37
Graves et al. Connectionist Temporal Classification: Labelling Unsegmented Sequence Data with
Recurrent Neural Networks. ICML 2006
Lip Reading: LipNet
CTC Loss: Connectionist temporal classification
● Avoiding the need for alignment between input and output sequence by predicting
an additional “_” blank word
● Before computing the loss, repeated words and blank tokens are removed
● “a _ a b _ ” == “_ a a _ _ b b” == “a a b”
39. 39Chung et al. Lip reading sentences in the wild. arXiv Nov 2016
Lip Reading: Watch, Listen, Attend & Spell
40. 40
Chung et al. Lip reading sentences in the wild. arXiv Nov 2016
Lip Reading: Watch, Listen, Attend & Spell
41. 41
Chung et al. Lip reading sentences in the wild. arXiv Nov 2016
Lip Reading: Watch, Listen, Attend & Spell
Audio
features
Image
features
42. 42
Chung et al. Lip reading sentences in the wild. arXiv Nov 2016
Lip Reading: Watch, Listen, Attend & Spell
Attention over output
states from audio and
video is computed at
each timestep
43. 43
Chung et al. Lip reading sentences in the wild. arXiv Nov 2016
Lip Reading: Watch, Listen, Attend & Spell
44. 44
Lip Reading: Watch, Listen, Attend & Spell
Chung et al. Lip reading sentences in the wild. arXiv Nov 2016
LipNet
45. 45
Lip Reading: Watch, Listen, Attend & Spell
Chung et al. Lip reading sentences in the wild. arXiv Nov 2016
50. 50
Cross-Modality
Ngiam et al. Multimodal Deep Learning. ICML 2011
Denoising autoencoder required to reconstruct both modalities given only one
or both in the input
51. 51
Cross-Modality
Ngiam et al. Multimodal Deep Learning. ICML 2011
Video classification: Add Linear SVM on top of learned features
● Combination of audio&video as inputs gives relative improvement (on AVLetters)
● Using video as input to predict both audio& video gives SotA results in both
52. 52
Cross-Modality
Ngiam et al. Multimodal Deep Learning. ICML 2011
Audio classification: Add Linear SVM on top of learned features
● Performance of cross-modal features below SotA
● Still, remarkable performance when using video as input to classify audio
54. 54
Shared Learning
Ngiam et al. Multimodal Deep Learning. ICML 2011
First, learn a shared representation using the two modalities as input
55. 55
Shared Learning
Ngiam et al. Multimodal Deep Learning. ICML 2011
Then, train linear classifiers on top of shared representations (previously trained on audio/video
data) using only one of the two modalities as input. Testing is performed with the other modality.
56. 56
Shared Learning
Ngiam et al. Multimodal Deep Learning. ICML 2011
Train Linear classifiers on top of shared representations (previously trained on audio/video data)
using only one of the two modalities as input. Testing is performed with the other modality.