SlideShare ist ein Scribd-Unternehmen logo
1 von 56
Downloaden Sie, um offline zu lesen
Amaia Salvador
amaia.salvador@upc.edu
PhD Candidate
Universitat Politècnica de Catalunya
DEEP
LEARNING
WORKSHOP
Dublin City University
27-28 April 2017
Audio & Vision
D2 L9
2
Multimedia
Text
Audio
Vision
3
Audio & Vision
● Transfer Learning
● Sonorization
● Speech generation
● Lip Reading
4
Audio & Vision
● Transfer Learning
● Sonorization
● Speech generation
● Lip Reading
5
Transfer Learning
Owens et al. Ambient Sound Provides Supervision for Visual Learning. ECCV 2016
Audio features help learn semantically meaningful visual representation
6
Transfer Learning
Owens et al. Ambient Sound Provides Supervision for Visual Learning. ECCV 2016
Images are grouped into audio clusters.
The task is to predict the audio cluster for a given image (classification)
7
Transfer Learning
Owens et al. Ambient Sound Provides Supervision for Visual Learning. ECCV 2016
Although the model was not trained with class labels, units with
semantical meaning emerge
8
Transfer Learning: SoundNet
Aytar, Vondrick and Torralba. Soundnet: Learning sound representations from unlabeled video. NIPS 2016
Pretrained visual ConvNets supervise the training of a model for sound representation
9
Videos for training are unlabeled. Relies on Convnets trained on labeled images.
Transfer Learning: SoundNet
Aytar, Vondrick and Torralba. Soundnet: Learning sound representations from unlabeled video. NIPS 2016
10
Kullback-Leibler Divergence:
Cross-Entropy:
Aytar, Vondrick and Torralba. Soundnet: Learning sound representations from unlabeled video. NIPS 2016
prob. distribution of vision network
prob. distribution of sound network
Transfer Learning: SoundNet
11
Recognizing Objects and Scenes from Sound
Aytar, Vondrick and Torralba. Soundnet: Learning sound representations from unlabeled video. NIPS 2016
Transfer Learning: SoundNet
12
Recognizing Objects and Scenes from Sound
Aytar, Vondrick and Torralba. Soundnet: Learning sound representations from unlabeled video. NIPS 2016
Transfer Learning: SoundNet
13
Hidden layers of Soundnet are used to train a standard SVM
classifier that outperforms state of the art.
Aytar, Vondrick and Torralba. Soundnet: Learning sound representations from unlabeled video. NIPS 2016
Transfer Learning: SoundNet
14
Visualization of the 1D filters over raw audio in conv1.
Aytar, Yusuf, Carl Vondrick, and Antonio Torralba. "Soundnet: Learning sound representations from
unlabeled video." NIPS 2016.
Transfer Learning: SoundNet
15
Visualization of the 1D filters over raw audio in conv1.
Aytar, Vondrick and Torralba. Soundnet: Learning sound representations from unlabeled video. NIPS 2016
Transfer Learning: SoundNet
16
Visualize samples that mostly activate a neuron in a late layer (conv7)
Aytar, Vondrick and Torralba. Soundnet: Learning sound representations from unlabeled video. NIPS 2016
Transfer Learning: SoundNet
17
Visualization of the video frames associated to the sounds that activate some of the
last hidden units (conv7):
Aytar, Vondrick and Torralba. Soundnet: Learning sound representations from unlabeled video. NIPS 2016
Transfer Learning: SoundNet
18Aytar, Vondrick and Torralba. Soundnet: Learning sound representations from unlabeled video. NIPS 2016
Transfer Learning: SoundNet
Hearing sounds that most activate a neuron in the sound network (conv7)
19
Aytar, Vondrick and Torralba. Soundnet: Learning sound representations from unlabeled video. NIPS 2016
Transfer Learning: SoundNet
Hearing sounds that most activate a neuron in the sound network (conv5)
20
Audio & Vision
● Transfer Learning
● Sonorization
● Speech generation
● Lip Reading
21
Sonorization
Owens et al. Visually indicated sounds. CVPR 2016
Learn synthesized sounds from videos of people hitting objects with a drumstick.
Are features learned when training for sound prediction useful to identify the object’s material?
22
Sonorization
Owens et al. Visually indicated sounds. CVPR 2016
The Greatest Hits Dataset
23
Sonorization
Owens et al. Visually indicated sounds. CVPR 2016
The Greatest Hits Dataset
24
Sonorization
Owens et al. Visually indicated sounds. CVPR 2016
● Map video sequences to sequences of
sound features
● Sound features in the output are
converted into waveform by
example-based synthesis (i.e. nearest
neighbor retrieval).
25
Sonorization
Owens et al. Visually indicated sounds. CVPR 2016
26
Sonorization
Owens et al. Visually indicated sounds. CVPR 2016
27
Sonorization
Owens et al. Visually indicated sounds. CVPR 2016
28
Audio & Vision
● Transfer Learning
● Sonorization
● Speech generation
● Lip Reading
29
Ephrat et al. Vid2speech: Speech Reconstruction from Silent Video. ICASSP 2017
Speech Generation from Video
CNN
(VGG)
Frame from a
silent video
Audio feature
Post-hoc
synthesis
30
Ephrat et al. Vid2speech: Speech Reconstruction from Silent Video. ICASSP 2017
Speech Generation from Video
Downsample
8kHz
Split in 40ms
chunks
(.5 overlap)
LPC analysis
9-dim LPC
features
Waveform processing
31Ephrat et al. Vid2speech: Speech Reconstruction from Silent Video. ICASSP 2017
Speech Generation from Video
Waveform processing (2 LSP features/ frame)
32
Speech Generation from Video
Cooke et al. An audio-visual corpus for speech perception and automatic speech recognition. Journal of the Acoustical Society of America,
2006
GRID Dataset
33
Speech Generation from Video
Ephrat et al. Vid2speech: Speech Reconstruction from Silent Video. ICASSP 2017
34
Audio & Vision
● Transfer Learning
● Sonorization
● Speech generation
● Lip Reading
35Assael et al. LipNet: Sentence-level Lipreading. arXiv Nov 2016
Lip Reading: LipNet
36Assael et al. LipNet: Sentence-level Lipreading. arXiv Nov 2016
Lip Reading: LipNet
Input (frames) and output (sentence) sequences are not aligned
37
Graves et al. Connectionist Temporal Classification: Labelling Unsegmented Sequence Data with
Recurrent Neural Networks. ICML 2006
Lip Reading: LipNet
CTC Loss: Connectionist temporal classification
● Avoiding the need for alignment between input and output sequence by predicting
an additional “_” blank word
● Before computing the loss, repeated words and blank tokens are removed
● “a _ a b _ ” == “_ a a _ _ b b” == “a a b”
38
Lip Reading: LipNet
Assael et al. LipNet: Sentence-level Lipreading. arXiv Nov 2016
39Chung et al. Lip reading sentences in the wild. arXiv Nov 2016
Lip Reading: Watch, Listen, Attend & Spell
40
Chung et al. Lip reading sentences in the wild. arXiv Nov 2016
Lip Reading: Watch, Listen, Attend & Spell
41
Chung et al. Lip reading sentences in the wild. arXiv Nov 2016
Lip Reading: Watch, Listen, Attend & Spell
Audio
features
Image
features
42
Chung et al. Lip reading sentences in the wild. arXiv Nov 2016
Lip Reading: Watch, Listen, Attend & Spell
Attention over output
states from audio and
video is computed at
each timestep
43
Chung et al. Lip reading sentences in the wild. arXiv Nov 2016
Lip Reading: Watch, Listen, Attend & Spell
44
Lip Reading: Watch, Listen, Attend & Spell
Chung et al. Lip reading sentences in the wild. arXiv Nov 2016
LipNet
45
Lip Reading: Watch, Listen, Attend & Spell
Chung et al. Lip reading sentences in the wild. arXiv Nov 2016
46
Summary
● Transfer learning
● Sonorization
● Speech generation
● Lip Reading
47
Questions?
48
Multimodal Learning
Ngiam et al. Multimodal Deep Learning. ICML 2011
49
Multimodal Learning
Ngiam et al. Multimodal Deep Learning. ICML 2011
50
Cross-Modality
Ngiam et al. Multimodal Deep Learning. ICML 2011
Denoising autoencoder required to reconstruct both modalities given only one
or both in the input
51
Cross-Modality
Ngiam et al. Multimodal Deep Learning. ICML 2011
Video classification: Add Linear SVM on top of learned features
● Combination of audio&video as inputs gives relative improvement (on AVLetters)
● Using video as input to predict both audio& video gives SotA results in both
52
Cross-Modality
Ngiam et al. Multimodal Deep Learning. ICML 2011
Audio classification: Add Linear SVM on top of learned features
● Performance of cross-modal features below SotA
● Still, remarkable performance when using video as input to classify audio
53
Multimodal Learning
Ngiam et al. Multimodal Deep Learning. ICML 2011
54
Shared Learning
Ngiam et al. Multimodal Deep Learning. ICML 2011
First, learn a shared representation using the two modalities as input
55
Shared Learning
Ngiam et al. Multimodal Deep Learning. ICML 2011
Then, train linear classifiers on top of shared representations (previously trained on audio/video
data) using only one of the two modalities as input. Testing is performed with the other modality.
56
Shared Learning
Ngiam et al. Multimodal Deep Learning. ICML 2011
Train Linear classifiers on top of shared representations (previously trained on audio/video data)
using only one of the two modalities as input. Testing is performed with the other modality.

Weitere ähnliche Inhalte

Ähnlich wie Audio and Vision (D2L9 Insight@DCU Machine Learning Workshop 2017)

Automated Podcasting System for Universities
Automated Podcasting System for UniversitiesAutomated Podcasting System for Universities
Automated Podcasting System for UniversitiesEducational Technology
 
Adria Recasens, DeepMind – Multi-modal self-supervised learning from videos
Adria Recasens, DeepMind – Multi-modal self-supervised learning from videosAdria Recasens, DeepMind – Multi-modal self-supervised learning from videos
Adria Recasens, DeepMind – Multi-modal self-supervised learning from videosCodiax
 
Deep Learning - Convolutional Neural Networks - Architectural Zoo
Deep Learning - Convolutional Neural Networks - Architectural ZooDeep Learning - Convolutional Neural Networks - Architectural Zoo
Deep Learning - Convolutional Neural Networks - Architectural ZooChristian Perone
 
[DL輪読会]IMPROVING VOICE SEPARATION BY INCORPORATING END-TO-END SPEECH RECOGNITION
[DL輪読会]IMPROVING VOICE SEPARATION BY INCORPORATING END-TO-END SPEECH RECOGNITION[DL輪読会]IMPROVING VOICE SEPARATION BY INCORPORATING END-TO-END SPEECH RECOGNITION
[DL輪読会]IMPROVING VOICE SEPARATION BY INCORPORATING END-TO-END SPEECH RECOGNITIONDeep Learning JP
 
Music Gesture for Visual Sound Separation
Music Gesture for Visual Sound SeparationMusic Gesture for Visual Sound Separation
Music Gesture for Visual Sound Separationivaderivader
 
Sander Dieleman - Generating music in the raw audio domain - Creative AI meetup
Sander Dieleman - Generating music in the raw audio domain - Creative AI meetupSander Dieleman - Generating music in the raw audio domain - Creative AI meetup
Sander Dieleman - Generating music in the raw audio domain - Creative AI meetupLuba Elliott
 
Multimodal deep learning
Multimodal deep learningMultimodal deep learning
Multimodal deep learningAkhter Al Amin
 
Ry pyconjp2015 karaoke
Ry pyconjp2015 karaokeRy pyconjp2015 karaoke
Ry pyconjp2015 karaokeRenyuan Lyu
 
ICML Talk on deep learning for music recommendation
ICML Talk on deep learning for music recommendationICML Talk on deep learning for music recommendation
ICML Talk on deep learning for music recommendationAloïs Gruson
 
One Perceptron to Rule them All: Deep Learning for Multimedia #A2IC2018
One Perceptron  to Rule them All: Deep Learning for Multimedia #A2IC2018One Perceptron  to Rule them All: Deep Learning for Multimedia #A2IC2018
One Perceptron to Rule them All: Deep Learning for Multimedia #A2IC2018Universitat Politècnica de Catalunya
 
Miniproject audioenhancement-100223094301-phpapp02
Miniproject audioenhancement-100223094301-phpapp02Miniproject audioenhancement-100223094301-phpapp02
Miniproject audioenhancement-100223094301-phpapp02mohankota
 
Serena Yeung, PHD, Stanford, at MLconf Seattle 2017
Serena Yeung, PHD, Stanford, at MLconf Seattle 2017 Serena Yeung, PHD, Stanford, at MLconf Seattle 2017
Serena Yeung, PHD, Stanford, at MLconf Seattle 2017 MLconf
 
10 Minute Research Presentation on Ambisonics and Impact
10 Minute Research Presentation on Ambisonics and Impact10 Minute Research Presentation on Ambisonics and Impact
10 Minute Research Presentation on Ambisonics and ImpactBruce Wiggins
 
Dolby audio ai workshop speech coding - cong zhou
Dolby audio ai workshop   speech coding - cong zhouDolby audio ai workshop   speech coding - cong zhou
Dolby audio ai workshop speech coding - cong zhouAnkit Shah
 
ICDM 2019 Tutorial: Speech and Language Processing: New Tools and Applications
ICDM 2019 Tutorial: Speech and Language Processing: New Tools and ApplicationsICDM 2019 Tutorial: Speech and Language Processing: New Tools and Applications
ICDM 2019 Tutorial: Speech and Language Processing: New Tools and ApplicationsForward Gradient
 
Gr 12 - Buzzer Project on Sound Production (W10).pptx
Gr 12 - Buzzer Project on Sound Production (W10).pptxGr 12 - Buzzer Project on Sound Production (W10).pptx
Gr 12 - Buzzer Project on Sound Production (W10).pptxafsheenfaiq2
 

Ähnlich wie Audio and Vision (D2L9 Insight@DCU Machine Learning Workshop 2017) (20)

Automated Podcasting System for Universities
Automated Podcasting System for UniversitiesAutomated Podcasting System for Universities
Automated Podcasting System for Universities
 
Adria Recasens, DeepMind – Multi-modal self-supervised learning from videos
Adria Recasens, DeepMind – Multi-modal self-supervised learning from videosAdria Recasens, DeepMind – Multi-modal self-supervised learning from videos
Adria Recasens, DeepMind – Multi-modal self-supervised learning from videos
 
Deep Learning - Convolutional Neural Networks - Architectural Zoo
Deep Learning - Convolutional Neural Networks - Architectural ZooDeep Learning - Convolutional Neural Networks - Architectural Zoo
Deep Learning - Convolutional Neural Networks - Architectural Zoo
 
[DL輪読会]IMPROVING VOICE SEPARATION BY INCORPORATING END-TO-END SPEECH RECOGNITION
[DL輪読会]IMPROVING VOICE SEPARATION BY INCORPORATING END-TO-END SPEECH RECOGNITION[DL輪読会]IMPROVING VOICE SEPARATION BY INCORPORATING END-TO-END SPEECH RECOGNITION
[DL輪読会]IMPROVING VOICE SEPARATION BY INCORPORATING END-TO-END SPEECH RECOGNITION
 
Music Gesture for Visual Sound Separation
Music Gesture for Visual Sound SeparationMusic Gesture for Visual Sound Separation
Music Gesture for Visual Sound Separation
 
Sander Dieleman - Generating music in the raw audio domain - Creative AI meetup
Sander Dieleman - Generating music in the raw audio domain - Creative AI meetupSander Dieleman - Generating music in the raw audio domain - Creative AI meetup
Sander Dieleman - Generating music in the raw audio domain - Creative AI meetup
 
Multimodal deep learning
Multimodal deep learningMultimodal deep learning
Multimodal deep learning
 
Deep Learning for Video: Language (UPC 2018)
Deep Learning for Video: Language (UPC 2018)Deep Learning for Video: Language (UPC 2018)
Deep Learning for Video: Language (UPC 2018)
 
Ry pyconjp2015 karaoke
Ry pyconjp2015 karaokeRy pyconjp2015 karaoke
Ry pyconjp2015 karaoke
 
ICML Talk on deep learning for music recommendation
ICML Talk on deep learning for music recommendationICML Talk on deep learning for music recommendation
ICML Talk on deep learning for music recommendation
 
One Perceptron to Rule them All: Deep Learning for Multimedia #A2IC2018
One Perceptron  to Rule them All: Deep Learning for Multimedia #A2IC2018One Perceptron  to Rule them All: Deep Learning for Multimedia #A2IC2018
One Perceptron to Rule them All: Deep Learning for Multimedia #A2IC2018
 
Mini Project- Audio Enhancement
Mini Project- Audio EnhancementMini Project- Audio Enhancement
Mini Project- Audio Enhancement
 
Miniproject audioenhancement-100223094301-phpapp02
Miniproject audioenhancement-100223094301-phpapp02Miniproject audioenhancement-100223094301-phpapp02
Miniproject audioenhancement-100223094301-phpapp02
 
Serena Yeung, PHD, Stanford, at MLconf Seattle 2017
Serena Yeung, PHD, Stanford, at MLconf Seattle 2017 Serena Yeung, PHD, Stanford, at MLconf Seattle 2017
Serena Yeung, PHD, Stanford, at MLconf Seattle 2017
 
10 Minute Research Presentation on Ambisonics and Impact
10 Minute Research Presentation on Ambisonics and Impact10 Minute Research Presentation on Ambisonics and Impact
10 Minute Research Presentation on Ambisonics and Impact
 
Dolby audio ai workshop speech coding - cong zhou
Dolby audio ai workshop   speech coding - cong zhouDolby audio ai workshop   speech coding - cong zhou
Dolby audio ai workshop speech coding - cong zhou
 
Video + Language 2019
Video + Language 2019Video + Language 2019
Video + Language 2019
 
Video + Language
Video + LanguageVideo + Language
Video + Language
 
ICDM 2019 Tutorial: Speech and Language Processing: New Tools and Applications
ICDM 2019 Tutorial: Speech and Language Processing: New Tools and ApplicationsICDM 2019 Tutorial: Speech and Language Processing: New Tools and Applications
ICDM 2019 Tutorial: Speech and Language Processing: New Tools and Applications
 
Gr 12 - Buzzer Project on Sound Production (W10).pptx
Gr 12 - Buzzer Project on Sound Production (W10).pptxGr 12 - Buzzer Project on Sound Production (W10).pptx
Gr 12 - Buzzer Project on Sound Production (W10).pptx
 

Mehr von Universitat Politècnica de Catalunya

The Transformer in Vision | Xavier Giro | Master in Computer Vision Barcelona...
The Transformer in Vision | Xavier Giro | Master in Computer Vision Barcelona...The Transformer in Vision | Xavier Giro | Master in Computer Vision Barcelona...
The Transformer in Vision | Xavier Giro | Master in Computer Vision Barcelona...Universitat Politècnica de Catalunya
 
Towards Sign Language Translation & Production | Xavier Giro-i-Nieto
Towards Sign Language Translation & Production | Xavier Giro-i-NietoTowards Sign Language Translation & Production | Xavier Giro-i-Nieto
Towards Sign Language Translation & Production | Xavier Giro-i-NietoUniversitat Politècnica de Catalunya
 
Learning Representations for Sign Language Videos - Xavier Giro - NIST TRECVI...
Learning Representations for Sign Language Videos - Xavier Giro - NIST TRECVI...Learning Representations for Sign Language Videos - Xavier Giro - NIST TRECVI...
Learning Representations for Sign Language Videos - Xavier Giro - NIST TRECVI...Universitat Politècnica de Catalunya
 
Generation of Synthetic Referring Expressions for Object Segmentation in Videos
Generation of Synthetic Referring Expressions for Object Segmentation in VideosGeneration of Synthetic Referring Expressions for Object Segmentation in Videos
Generation of Synthetic Referring Expressions for Object Segmentation in VideosUniversitat Politècnica de Catalunya
 
Learn2Sign : Sign language recognition and translation using human keypoint e...
Learn2Sign : Sign language recognition and translation using human keypoint e...Learn2Sign : Sign language recognition and translation using human keypoint e...
Learn2Sign : Sign language recognition and translation using human keypoint e...Universitat Politècnica de Catalunya
 
Convolutional Neural Networks - Xavier Giro - UPC TelecomBCN Barcelona 2020
Convolutional Neural Networks - Xavier Giro - UPC TelecomBCN Barcelona 2020Convolutional Neural Networks - Xavier Giro - UPC TelecomBCN Barcelona 2020
Convolutional Neural Networks - Xavier Giro - UPC TelecomBCN Barcelona 2020Universitat Politècnica de Catalunya
 
Attention for Deep Learning - Xavier Giro - UPC TelecomBCN Barcelona 2020
Attention for Deep Learning - Xavier Giro - UPC TelecomBCN Barcelona 2020Attention for Deep Learning - Xavier Giro - UPC TelecomBCN Barcelona 2020
Attention for Deep Learning - Xavier Giro - UPC TelecomBCN Barcelona 2020Universitat Politècnica de Catalunya
 
Generative Adversarial Networks GAN - Xavier Giro - UPC TelecomBCN Barcelona ...
Generative Adversarial Networks GAN - Xavier Giro - UPC TelecomBCN Barcelona ...Generative Adversarial Networks GAN - Xavier Giro - UPC TelecomBCN Barcelona ...
Generative Adversarial Networks GAN - Xavier Giro - UPC TelecomBCN Barcelona ...Universitat Politècnica de Catalunya
 
Q-Learning with a Neural Network - Xavier Giró - UPC Barcelona 2020
Q-Learning with a Neural Network - Xavier Giró - UPC Barcelona 2020Q-Learning with a Neural Network - Xavier Giró - UPC Barcelona 2020
Q-Learning with a Neural Network - Xavier Giró - UPC Barcelona 2020Universitat Politècnica de Catalunya
 
Language and Vision with Deep Learning - Xavier Giró - ACM ICMR 2020 (Tutorial)
Language and Vision with Deep Learning - Xavier Giró - ACM ICMR 2020 (Tutorial)Language and Vision with Deep Learning - Xavier Giró - ACM ICMR 2020 (Tutorial)
Language and Vision with Deep Learning - Xavier Giró - ACM ICMR 2020 (Tutorial)Universitat Politècnica de Catalunya
 
Image Segmentation with Deep Learning - Xavier Giro & Carles Ventura - ISSonD...
Image Segmentation with Deep Learning - Xavier Giro & Carles Ventura - ISSonD...Image Segmentation with Deep Learning - Xavier Giro & Carles Ventura - ISSonD...
Image Segmentation with Deep Learning - Xavier Giro & Carles Ventura - ISSonD...Universitat Politècnica de Catalunya
 
Deep Learning Representations for All - Xavier Giro-i-Nieto - IRI Barcelona 2020
Deep Learning Representations for All - Xavier Giro-i-Nieto - IRI Barcelona 2020Deep Learning Representations for All - Xavier Giro-i-Nieto - IRI Barcelona 2020
Deep Learning Representations for All - Xavier Giro-i-Nieto - IRI Barcelona 2020Universitat Politècnica de Catalunya
 

Mehr von Universitat Politècnica de Catalunya (20)

Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
 
Deep Generative Learning for All
Deep Generative Learning for AllDeep Generative Learning for All
Deep Generative Learning for All
 
The Transformer in Vision | Xavier Giro | Master in Computer Vision Barcelona...
The Transformer in Vision | Xavier Giro | Master in Computer Vision Barcelona...The Transformer in Vision | Xavier Giro | Master in Computer Vision Barcelona...
The Transformer in Vision | Xavier Giro | Master in Computer Vision Barcelona...
 
Towards Sign Language Translation & Production | Xavier Giro-i-Nieto
Towards Sign Language Translation & Production | Xavier Giro-i-NietoTowards Sign Language Translation & Production | Xavier Giro-i-Nieto
Towards Sign Language Translation & Production | Xavier Giro-i-Nieto
 
The Transformer - Xavier Giró - UPC Barcelona 2021
The Transformer - Xavier Giró - UPC Barcelona 2021The Transformer - Xavier Giró - UPC Barcelona 2021
The Transformer - Xavier Giró - UPC Barcelona 2021
 
Learning Representations for Sign Language Videos - Xavier Giro - NIST TRECVI...
Learning Representations for Sign Language Videos - Xavier Giro - NIST TRECVI...Learning Representations for Sign Language Videos - Xavier Giro - NIST TRECVI...
Learning Representations for Sign Language Videos - Xavier Giro - NIST TRECVI...
 
Open challenges in sign language translation and production
Open challenges in sign language translation and productionOpen challenges in sign language translation and production
Open challenges in sign language translation and production
 
Generation of Synthetic Referring Expressions for Object Segmentation in Videos
Generation of Synthetic Referring Expressions for Object Segmentation in VideosGeneration of Synthetic Referring Expressions for Object Segmentation in Videos
Generation of Synthetic Referring Expressions for Object Segmentation in Videos
 
Discovery and Learning of Navigation Goals from Pixels in Minecraft
Discovery and Learning of Navigation Goals from Pixels in MinecraftDiscovery and Learning of Navigation Goals from Pixels in Minecraft
Discovery and Learning of Navigation Goals from Pixels in Minecraft
 
Learn2Sign : Sign language recognition and translation using human keypoint e...
Learn2Sign : Sign language recognition and translation using human keypoint e...Learn2Sign : Sign language recognition and translation using human keypoint e...
Learn2Sign : Sign language recognition and translation using human keypoint e...
 
Intepretability / Explainable AI for Deep Neural Networks
Intepretability / Explainable AI for Deep Neural NetworksIntepretability / Explainable AI for Deep Neural Networks
Intepretability / Explainable AI for Deep Neural Networks
 
Convolutional Neural Networks - Xavier Giro - UPC TelecomBCN Barcelona 2020
Convolutional Neural Networks - Xavier Giro - UPC TelecomBCN Barcelona 2020Convolutional Neural Networks - Xavier Giro - UPC TelecomBCN Barcelona 2020
Convolutional Neural Networks - Xavier Giro - UPC TelecomBCN Barcelona 2020
 
Attention for Deep Learning - Xavier Giro - UPC TelecomBCN Barcelona 2020
Attention for Deep Learning - Xavier Giro - UPC TelecomBCN Barcelona 2020Attention for Deep Learning - Xavier Giro - UPC TelecomBCN Barcelona 2020
Attention for Deep Learning - Xavier Giro - UPC TelecomBCN Barcelona 2020
 
Generative Adversarial Networks GAN - Xavier Giro - UPC TelecomBCN Barcelona ...
Generative Adversarial Networks GAN - Xavier Giro - UPC TelecomBCN Barcelona ...Generative Adversarial Networks GAN - Xavier Giro - UPC TelecomBCN Barcelona ...
Generative Adversarial Networks GAN - Xavier Giro - UPC TelecomBCN Barcelona ...
 
Q-Learning with a Neural Network - Xavier Giró - UPC Barcelona 2020
Q-Learning with a Neural Network - Xavier Giró - UPC Barcelona 2020Q-Learning with a Neural Network - Xavier Giró - UPC Barcelona 2020
Q-Learning with a Neural Network - Xavier Giró - UPC Barcelona 2020
 
Language and Vision with Deep Learning - Xavier Giró - ACM ICMR 2020 (Tutorial)
Language and Vision with Deep Learning - Xavier Giró - ACM ICMR 2020 (Tutorial)Language and Vision with Deep Learning - Xavier Giró - ACM ICMR 2020 (Tutorial)
Language and Vision with Deep Learning - Xavier Giró - ACM ICMR 2020 (Tutorial)
 
Image Segmentation with Deep Learning - Xavier Giro & Carles Ventura - ISSonD...
Image Segmentation with Deep Learning - Xavier Giro & Carles Ventura - ISSonD...Image Segmentation with Deep Learning - Xavier Giro & Carles Ventura - ISSonD...
Image Segmentation with Deep Learning - Xavier Giro & Carles Ventura - ISSonD...
 
Curriculum Learning for Recurrent Video Object Segmentation
Curriculum Learning for Recurrent Video Object SegmentationCurriculum Learning for Recurrent Video Object Segmentation
Curriculum Learning for Recurrent Video Object Segmentation
 
Deep Self-supervised Learning for All - Xavier Giro - X-Europe 2020
Deep Self-supervised Learning for All - Xavier Giro - X-Europe 2020Deep Self-supervised Learning for All - Xavier Giro - X-Europe 2020
Deep Self-supervised Learning for All - Xavier Giro - X-Europe 2020
 
Deep Learning Representations for All - Xavier Giro-i-Nieto - IRI Barcelona 2020
Deep Learning Representations for All - Xavier Giro-i-Nieto - IRI Barcelona 2020Deep Learning Representations for All - Xavier Giro-i-Nieto - IRI Barcelona 2020
Deep Learning Representations for All - Xavier Giro-i-Nieto - IRI Barcelona 2020
 

Kürzlich hochgeladen

Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service BangaloreCall Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangaloreamitlee9823
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxolyaivanovalion
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Researchmichael115558
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxolyaivanovalion
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...amitlee9823
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...amitlee9823
 
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...only4webmaster01
 
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort ServiceBDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort ServiceDelhi Call girls
 
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...amitlee9823
 
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...amitlee9823
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxolyaivanovalion
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...amitlee9823
 
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...amitlee9823
 
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...amitlee9823
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxolyaivanovalion
 
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...amitlee9823
 

Kürzlich hochgeladen (20)

Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service BangaloreCall Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
 
Predicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science ProjectPredicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science Project
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptx
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Research
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFx
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
 
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
 
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort ServiceBDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
 
Sampling (random) method and Non random.ppt
Sampling (random) method and Non random.pptSampling (random) method and Non random.ppt
Sampling (random) method and Non random.ppt
 
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
 
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptx
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
 
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
 
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
 
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptx
 
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 

Audio and Vision (D2L9 Insight@DCU Machine Learning Workshop 2017)

  • 1. Amaia Salvador amaia.salvador@upc.edu PhD Candidate Universitat Politècnica de Catalunya DEEP LEARNING WORKSHOP Dublin City University 27-28 April 2017 Audio & Vision D2 L9
  • 3. 3 Audio & Vision ● Transfer Learning ● Sonorization ● Speech generation ● Lip Reading
  • 4. 4 Audio & Vision ● Transfer Learning ● Sonorization ● Speech generation ● Lip Reading
  • 5. 5 Transfer Learning Owens et al. Ambient Sound Provides Supervision for Visual Learning. ECCV 2016 Audio features help learn semantically meaningful visual representation
  • 6. 6 Transfer Learning Owens et al. Ambient Sound Provides Supervision for Visual Learning. ECCV 2016 Images are grouped into audio clusters. The task is to predict the audio cluster for a given image (classification)
  • 7. 7 Transfer Learning Owens et al. Ambient Sound Provides Supervision for Visual Learning. ECCV 2016 Although the model was not trained with class labels, units with semantical meaning emerge
  • 8. 8 Transfer Learning: SoundNet Aytar, Vondrick and Torralba. Soundnet: Learning sound representations from unlabeled video. NIPS 2016 Pretrained visual ConvNets supervise the training of a model for sound representation
  • 9. 9 Videos for training are unlabeled. Relies on Convnets trained on labeled images. Transfer Learning: SoundNet Aytar, Vondrick and Torralba. Soundnet: Learning sound representations from unlabeled video. NIPS 2016
  • 10. 10 Kullback-Leibler Divergence: Cross-Entropy: Aytar, Vondrick and Torralba. Soundnet: Learning sound representations from unlabeled video. NIPS 2016 prob. distribution of vision network prob. distribution of sound network Transfer Learning: SoundNet
  • 11. 11 Recognizing Objects and Scenes from Sound Aytar, Vondrick and Torralba. Soundnet: Learning sound representations from unlabeled video. NIPS 2016 Transfer Learning: SoundNet
  • 12. 12 Recognizing Objects and Scenes from Sound Aytar, Vondrick and Torralba. Soundnet: Learning sound representations from unlabeled video. NIPS 2016 Transfer Learning: SoundNet
  • 13. 13 Hidden layers of Soundnet are used to train a standard SVM classifier that outperforms state of the art. Aytar, Vondrick and Torralba. Soundnet: Learning sound representations from unlabeled video. NIPS 2016 Transfer Learning: SoundNet
  • 14. 14 Visualization of the 1D filters over raw audio in conv1. Aytar, Yusuf, Carl Vondrick, and Antonio Torralba. "Soundnet: Learning sound representations from unlabeled video." NIPS 2016. Transfer Learning: SoundNet
  • 15. 15 Visualization of the 1D filters over raw audio in conv1. Aytar, Vondrick and Torralba. Soundnet: Learning sound representations from unlabeled video. NIPS 2016 Transfer Learning: SoundNet
  • 16. 16 Visualize samples that mostly activate a neuron in a late layer (conv7) Aytar, Vondrick and Torralba. Soundnet: Learning sound representations from unlabeled video. NIPS 2016 Transfer Learning: SoundNet
  • 17. 17 Visualization of the video frames associated to the sounds that activate some of the last hidden units (conv7): Aytar, Vondrick and Torralba. Soundnet: Learning sound representations from unlabeled video. NIPS 2016 Transfer Learning: SoundNet
  • 18. 18Aytar, Vondrick and Torralba. Soundnet: Learning sound representations from unlabeled video. NIPS 2016 Transfer Learning: SoundNet Hearing sounds that most activate a neuron in the sound network (conv7)
  • 19. 19 Aytar, Vondrick and Torralba. Soundnet: Learning sound representations from unlabeled video. NIPS 2016 Transfer Learning: SoundNet Hearing sounds that most activate a neuron in the sound network (conv5)
  • 20. 20 Audio & Vision ● Transfer Learning ● Sonorization ● Speech generation ● Lip Reading
  • 21. 21 Sonorization Owens et al. Visually indicated sounds. CVPR 2016 Learn synthesized sounds from videos of people hitting objects with a drumstick. Are features learned when training for sound prediction useful to identify the object’s material?
  • 22. 22 Sonorization Owens et al. Visually indicated sounds. CVPR 2016 The Greatest Hits Dataset
  • 23. 23 Sonorization Owens et al. Visually indicated sounds. CVPR 2016 The Greatest Hits Dataset
  • 24. 24 Sonorization Owens et al. Visually indicated sounds. CVPR 2016 ● Map video sequences to sequences of sound features ● Sound features in the output are converted into waveform by example-based synthesis (i.e. nearest neighbor retrieval).
  • 25. 25 Sonorization Owens et al. Visually indicated sounds. CVPR 2016
  • 26. 26 Sonorization Owens et al. Visually indicated sounds. CVPR 2016
  • 27. 27 Sonorization Owens et al. Visually indicated sounds. CVPR 2016
  • 28. 28 Audio & Vision ● Transfer Learning ● Sonorization ● Speech generation ● Lip Reading
  • 29. 29 Ephrat et al. Vid2speech: Speech Reconstruction from Silent Video. ICASSP 2017 Speech Generation from Video CNN (VGG) Frame from a silent video Audio feature Post-hoc synthesis
  • 30. 30 Ephrat et al. Vid2speech: Speech Reconstruction from Silent Video. ICASSP 2017 Speech Generation from Video Downsample 8kHz Split in 40ms chunks (.5 overlap) LPC analysis 9-dim LPC features Waveform processing
  • 31. 31Ephrat et al. Vid2speech: Speech Reconstruction from Silent Video. ICASSP 2017 Speech Generation from Video Waveform processing (2 LSP features/ frame)
  • 32. 32 Speech Generation from Video Cooke et al. An audio-visual corpus for speech perception and automatic speech recognition. Journal of the Acoustical Society of America, 2006 GRID Dataset
  • 33. 33 Speech Generation from Video Ephrat et al. Vid2speech: Speech Reconstruction from Silent Video. ICASSP 2017
  • 34. 34 Audio & Vision ● Transfer Learning ● Sonorization ● Speech generation ● Lip Reading
  • 35. 35Assael et al. LipNet: Sentence-level Lipreading. arXiv Nov 2016 Lip Reading: LipNet
  • 36. 36Assael et al. LipNet: Sentence-level Lipreading. arXiv Nov 2016 Lip Reading: LipNet Input (frames) and output (sentence) sequences are not aligned
  • 37. 37 Graves et al. Connectionist Temporal Classification: Labelling Unsegmented Sequence Data with Recurrent Neural Networks. ICML 2006 Lip Reading: LipNet CTC Loss: Connectionist temporal classification ● Avoiding the need for alignment between input and output sequence by predicting an additional “_” blank word ● Before computing the loss, repeated words and blank tokens are removed ● “a _ a b _ ” == “_ a a _ _ b b” == “a a b”
  • 38. 38 Lip Reading: LipNet Assael et al. LipNet: Sentence-level Lipreading. arXiv Nov 2016
  • 39. 39Chung et al. Lip reading sentences in the wild. arXiv Nov 2016 Lip Reading: Watch, Listen, Attend & Spell
  • 40. 40 Chung et al. Lip reading sentences in the wild. arXiv Nov 2016 Lip Reading: Watch, Listen, Attend & Spell
  • 41. 41 Chung et al. Lip reading sentences in the wild. arXiv Nov 2016 Lip Reading: Watch, Listen, Attend & Spell Audio features Image features
  • 42. 42 Chung et al. Lip reading sentences in the wild. arXiv Nov 2016 Lip Reading: Watch, Listen, Attend & Spell Attention over output states from audio and video is computed at each timestep
  • 43. 43 Chung et al. Lip reading sentences in the wild. arXiv Nov 2016 Lip Reading: Watch, Listen, Attend & Spell
  • 44. 44 Lip Reading: Watch, Listen, Attend & Spell Chung et al. Lip reading sentences in the wild. arXiv Nov 2016 LipNet
  • 45. 45 Lip Reading: Watch, Listen, Attend & Spell Chung et al. Lip reading sentences in the wild. arXiv Nov 2016
  • 46. 46 Summary ● Transfer learning ● Sonorization ● Speech generation ● Lip Reading
  • 48. 48 Multimodal Learning Ngiam et al. Multimodal Deep Learning. ICML 2011
  • 49. 49 Multimodal Learning Ngiam et al. Multimodal Deep Learning. ICML 2011
  • 50. 50 Cross-Modality Ngiam et al. Multimodal Deep Learning. ICML 2011 Denoising autoencoder required to reconstruct both modalities given only one or both in the input
  • 51. 51 Cross-Modality Ngiam et al. Multimodal Deep Learning. ICML 2011 Video classification: Add Linear SVM on top of learned features ● Combination of audio&video as inputs gives relative improvement (on AVLetters) ● Using video as input to predict both audio& video gives SotA results in both
  • 52. 52 Cross-Modality Ngiam et al. Multimodal Deep Learning. ICML 2011 Audio classification: Add Linear SVM on top of learned features ● Performance of cross-modal features below SotA ● Still, remarkable performance when using video as input to classify audio
  • 53. 53 Multimodal Learning Ngiam et al. Multimodal Deep Learning. ICML 2011
  • 54. 54 Shared Learning Ngiam et al. Multimodal Deep Learning. ICML 2011 First, learn a shared representation using the two modalities as input
  • 55. 55 Shared Learning Ngiam et al. Multimodal Deep Learning. ICML 2011 Then, train linear classifiers on top of shared representations (previously trained on audio/video data) using only one of the two modalities as input. Testing is performed with the other modality.
  • 56. 56 Shared Learning Ngiam et al. Multimodal Deep Learning. ICML 2011 Train Linear classifiers on top of shared representations (previously trained on audio/video data) using only one of the two modalities as input. Testing is performed with the other modality.