SlideShare ist ein Scribd-Unternehmen logo
1 von 43
孟泽
张氏秋怀 TRUONGTHITHUHOAI

MULTIMODAL DEEP LEARNING
PRESENTATION
MULTIMODAL DEEP LEARNING
                Jiquan Ngiam
   Aditya Khosla, Mingyu Kim, Juhan Nam,
          Honglak Lee, Andrew Y. Ng

   Computer Science Department, Stanford
                  University
  Department of Music, Stanford University
  Computer Science & Engineering Division,
      University of Michigan, Ann Arbor
MCGURK EFFECT

   In speech recognition, people are known to
    integrate audio-visual information in order to
    understand speech.

    This was first exemplified in the McGurk
    effect where a visual /ga/ with a voiced /ba/
    is perceived as /da/ by most subjects.
AUDIO-VISUAL SPEECH RECOGNITION
FEATURE CHALLENGE




            Classifier (e.g.
                SVM)
REPRESENTING LIPS

• Can we learn better representations for
  audio/visual speech recognition?

• How can multimodal data (multiple
  sources of input) be used to find better
  features?
UNSUPERVISED FEATURE LEARNING
UNSUPERVISED FEATURE LEARNING
MULTIMODAL FEATURES
CROSS-MODALITY FEATURE LEARNING
FEATURE LEARNING MODELS
BACKGROUND


   Sparse Restricted Boltzmann Machines
    (RBMs)
FEATURE LEARNING WITH AUTOENCODERS

Audio Reconstruction   Video Reconstruction
        ...                   ...
        ...                   ...
        ...                   ...
    Audio Input            Video Input
BIMODAL AUTOENCODER

                      Video Reconstruction
   Audio Reconstruction
           ...                ...
                     ...       Hidden
                               Representation

           ...                ...
       Audio Input         Video Input
SHALLOW LEARNING
   Hidden Units




                  Video Input   Audio Input

 • Mostly unimodal features learned
BIMODAL AUTOENCODER

                       Video Reconstruction
    Audio Reconstruction
            ...                 ...
                     ...        Hidden
                                Representation

                     ...
                  Video Input

Cross-modality Learning:
Learn better video features by using audio as a
cue
CROSS-MODALITY DEEP AUTOENCODER
      Audio Reconstruction   Video Reconstruction
             ...                    ...
             ...                    ...
                        ...          Learned
                                     Representation


                        ...
                        ...
                     Video Input
CROSS-MODALITY DEEP AUTOENCODER
      Audio Reconstruction   Video Reconstruction
             ...                    ...
             ...                    ...
                        ...          Learned
                                     Representation


                        ...
                        ...
                     Audio Input
BIMODAL DEEP AUTOENCODERS
        Audio Reconstruction   Video Reconstruction
                 ...                  ...
                 ...                  ...
                            ...        Shared
                                       Representation

 “Phonemes”                                         “Visemes”
                 ...                  ...         (Mouth Shapes)


                 ...                  ...
              Audio Input          Video Input
BIMODAL DEEP AUTOENCODERS
     Audio Reconstruction   Video Reconstruction
            ...                    ...
            ...                    ...
                       ...
                                                 “Visemes”
                                   ...         (Mouth Shapes)


                                   ...
                                Video Input
BIMODAL DEEP AUTOENCODERS
        Audio Reconstruction   Video Reconstruction
                 ...                  ...
                 ...                  ...
                            ...
 “Phonemes”
                 ...
                 ...
              Audio Input
TRAINING BIMODAL DEEP AUTOENCODER
Audio Reconstruction   Video Reconstruction
        ...                    ...               Audio Reconstruction
                                                         ...
                                                                        Video Reconstruction
                                                                                ...
                                                                                                  Audio Reconstruction
                                                                                                          ...
                                                                                                                         Video Reconstruction
                                                                                                                                 ...
        ...                    ...                       ...                    ...                       ...                    ...
                   ...          Shared
                                Representation
                                                                    ...          Shared
                                                                                 Representation
                                                                                                                     ...          Shared
                                                                                                                                  Representation


        ...                    ...                       ...                                                                     ...
        ...                    ...                       ...                                                                     ...
    Audio Input            Video Input               Audio Input                                                             Video Input




        • Train a single model to perform all 3
          tasks

        • Similar in spirit to denoising
          autoencoders
EVALUATIONS
VISUALIZATIONS OF LEARNED FEATURES



             0 ms          33 ms   67 ms   100 ms




             0 ms          33 ms   67 ms   100 ms

Audio (spectrogram) and Video
features
learned over 100ms windows
LEARNING SETTINGS

   We will consider the learning settings
    shown in Figure 1.
LIP-READING WITH AVLETTERS

   AVLetters:                        Audio Reconstruction
                                             ...
                                                             Video Reconstruction
                                                                    ...
     26-way Letter Classification           ...                    ...
     10 Speakers                                       ...           Learned
                                                                      Representation


     60x80 pixels lip regions                          ...
                                                        ...
   Cross-modality learning                          Video Input




        Feature          Supervised
                                                        Testing
        Learning          Learning
      Audio + Video        Video                             Video
LIP-READING WITH AVLETTERS

   Feature Representation          Classification
                                     Accuracy
   Multiscale Spatial Analysis        44.6%
         (Matthews et al., 2002)

      Local Binary Pattern            58.5%
         (Zhao & Barnard, 2009)
LIP-READING WITH AVLETTERS
   Feature Representation         Classification
                                    Accuracy
  Multiscale Spatial Analysis        44.6%
        (Matthews et al., 2002)

     Local Binary Pattern            58.5%
        (Zhao & Barnard, 2009)


     Video-Only Learning
                                     54.2%
  (Single Modality Learning)
LIP-READING WITH AVLETTERS
   Feature Representation         Classification
                                    Accuracy
  Multiscale Spatial Analysis        44.6%
        (Matthews et al., 2002)

     Local Binary Pattern            58.5%
        (Zhao & Barnard, 2009)


     Video-Only Learning
                                     54.2%
  (Single Modality Learning)
        Our Features
                                     64.4%
  (Cross Modality Learning)
LIP-READING WITH CUAVE

   CUAVE:                            Audio Reconstruction   Video Reconstruction
                                             ...                    ...
     10-way Digit Classification
                                             ...                    ...
     36 Speakers
                                                        ...           Learned
                                                                      Representation


   Cross Modality Learning                             ...
                                                        ...
                                                     Video Input




        Feature          Supervised
                                                        Testing
        Learning          Learning
      Audio + Video        Video                             Video
LIP-READING WITH CUAVE
                               Classification
   Feature Representation
                                 Accuracy
 Baseline Preprocessed Video      58.5%
     Video-Only Learning
                                  65.4%
  (Single Modality Learning)
LIP-READING WITH CUAVE
                               Classification
   Feature Representation
                                 Accuracy
 Baseline Preprocessed Video      58.5%
     Video-Only Learning
                                  65.4%
  (Single Modality Learning)
        Our Features
                                  68.7%
  (Cross Modality Learning)
LIP-READING WITH CUAVE
                                   Classification
   Feature Representation
                                     Accuracy
 Baseline Preprocessed Video          58.5%
     Video-Only Learning
                                      65.4%
  (Single Modality Learning)
        Our Features
                                      68.7%
  (Cross Modality Learning)

  Discrete Cosine Transform           64.0%
        (Gurban & Thiran, 2009)

         Visemic AAM                  83.0%
       (Papandreou et al., 2009)
MULTIMODAL RECOGNITION
                                       Audio Reconstruction         Video Reconstruction

                                               ...                         ...
   CUAVE:                                     ...                         ...

     10-way Digit Classification                             ...            Shared
                                                                             Representation




                                              ...                          ...
     36 Speakers
                                              ...                          ...
                                           Audio Input                  Video Input




   Evaluate in clean and noisy audio
    scenarios
     Inthe clean audio scenario, audio performs
      extremely well alone
       Feature          Supervised
                                                   Testing
       Learning          Learning
                                                  Audio +
     Audio + Video     Audio + Video
                                                   Video
MULTIMODAL RECOGNITION
                               Classification
                                 Accuracy
  Feature Representation
                            (Noisy Audio at 0db
                                   SNR)
   Audio Features (RBM)           75.8%
  Our Best Video Features         68.7%
MULTIMODAL RECOGNITION
                               Classification
                                 Accuracy
  Feature Representation
                            (Noisy Audio at 0db
                                   SNR)
   Audio Features (RBM)           75.8%
  Our Best Video Features         68.7%
 Bimodal Deep Autoencoder         77.3%
MULTIMODAL RECOGNITION
                               Classification
                                 Accuracy
  Feature Representation
                            (Noisy Audio at 0db
                                   SNR)
   Audio Features (RBM)           75.8%
  Our Best Video Features         68.7%
 Bimodal Deep Autoencoder         77.3%


 Bimodal Deep Autoencoder
                                  82.2%
   + Audio Features (RBM)
SHARED REPRESENTATION EVALUATION
     Feature             Supervised
                                               Testing
     Learning             Learning
  Audio + Video            Audio                Video
   Linear Classifier                              Supervised
                                                    Testing

       Shared                            Shared
    Representation                    Representation


 Audio           Video             Audio             Video

      Training                             Testing
SHARED REPRESENTATION EVALUATION
   Method: Learned Features + Canonical Correlation
    Analysis
      Feature     Supervised
                                               Testing                     Accuracy
      Learning              Learning
    Audio + Video              Audio                Video                   57.3%
    Audio + Video              Video                Audio                   91.7%

                    Linear Classifier                         Supervised
                                                                Testing

                        Shared                     Shared
                     Representation             Representation


                 Audio              Video   Audio             Video

                         Training                   Testing
MCGURK EFFECT
A visual /ga/ combined with an audio /ba/ is often
perceived as /da/.


 Audio    Video          Model Predictions
 Input    Input       /ga/       /ba/        /da/
  /ga/      /ga/     82.6%       2.2%       15.2%
  /ba/      /ba/     4.4%       89.1%       6.5%
MCGURK EFFECT
A visual /ga/ combined with an audio /ba/ is often
perceived as /da/.


 Audio    Video          Model Predictions
 Input    Input       /ga/       /ba/        /da/
  /ga/      /ga/     82.6%       2.2%       15.2%
  /ba/      /ba/     4.4%       89.1%       6.5%
  /ba/      /ga/     28.3%      13.0%       58.7%
CONCLUSION
   Applied deep autoencoders to         Audio Reconstruction
                                                ...
                                                                      Video Reconstruction
                                                                               ...
    discover features in multimodal             ...                            ...
    data                                                        ...              Learned
                                                                                 Representation

                                                                ...
                                                                ...
   Cross-modality Learning:                                 Video Input




    We obtained better video features
                                                                           Video Reconstruction

    (for lip-reading) using audio as a
                                           Audio Reconstruction

                                                   ...                            ...
    cue                                            ...                            ...
                                                                  ...               Shared
                                                                                    Representation




   Multimodal Feature Learning:                  ...                             ...
    Learn representations that relate             ...                             ...
                                               Audio Input                     Video Input


    across audio and video data
THANK FOR ATTENTION!

Weitere ähnliche Inhalte

Was ist angesagt?

Deep neural networks
Deep neural networksDeep neural networks
Deep neural networksSi Haem
 
Deep learning health care
Deep learning health care  Deep learning health care
Deep learning health care Meenakshi Sood
 
Andrew NG machine learning
Andrew NG machine learningAndrew NG machine learning
Andrew NG machine learningShareDocView.com
 
Interpretable Machine Learning Using LIME Framework - Kasia Kulma (PhD), Data...
Interpretable Machine Learning Using LIME Framework - Kasia Kulma (PhD), Data...Interpretable Machine Learning Using LIME Framework - Kasia Kulma (PhD), Data...
Interpretable Machine Learning Using LIME Framework - Kasia Kulma (PhD), Data...Sri Ambati
 
What is Deep Learning?
What is Deep Learning?What is Deep Learning?
What is Deep Learning?NVIDIA
 
Explainable AI in Industry (WWW 2020 Tutorial)
Explainable AI in Industry (WWW 2020 Tutorial)Explainable AI in Industry (WWW 2020 Tutorial)
Explainable AI in Industry (WWW 2020 Tutorial)Krishnaram Kenthapadi
 
Applications of Emotions Recognition
Applications of Emotions RecognitionApplications of Emotions Recognition
Applications of Emotions RecognitionFrancesco Bonadiman
 
Image Segmentation (D3L1 2017 UPC Deep Learning for Computer Vision)
Image Segmentation (D3L1 2017 UPC Deep Learning for Computer Vision)Image Segmentation (D3L1 2017 UPC Deep Learning for Computer Vision)
Image Segmentation (D3L1 2017 UPC Deep Learning for Computer Vision)Universitat Politècnica de Catalunya
 
Implications of GPT-3
Implications of GPT-3Implications of GPT-3
Implications of GPT-3Raven Jiang
 
Introduction to Deep Learning
Introduction to Deep LearningIntroduction to Deep Learning
Introduction to Deep LearningOswald Campesato
 
Neural network final NWU 4.3 Graphics Course
Neural network final NWU 4.3 Graphics CourseNeural network final NWU 4.3 Graphics Course
Neural network final NWU 4.3 Graphics CourseMohaiminur Rahman
 
Difference between Artificial Intelligence, Machine Learning, Deep Learning a...
Difference between Artificial Intelligence, Machine Learning, Deep Learning a...Difference between Artificial Intelligence, Machine Learning, Deep Learning a...
Difference between Artificial Intelligence, Machine Learning, Deep Learning a...Sanjay Srivastava
 
Introduction to Generative Adversarial Networks (GAN) with Apache MXNet
Introduction to Generative Adversarial Networks (GAN) with Apache MXNetIntroduction to Generative Adversarial Networks (GAN) with Apache MXNet
Introduction to Generative Adversarial Networks (GAN) with Apache MXNetAmazon Web Services
 
Handwritten Digit Recognition(Convolutional Neural Network) PPT
Handwritten Digit Recognition(Convolutional Neural Network) PPTHandwritten Digit Recognition(Convolutional Neural Network) PPT
Handwritten Digit Recognition(Convolutional Neural Network) PPTRishabhTyagi48
 
Artificial Neural Network seminar presentation using ppt.
Artificial Neural Network seminar presentation using ppt.Artificial Neural Network seminar presentation using ppt.
Artificial Neural Network seminar presentation using ppt.Mohd Faiz
 
Deep learning for real life applications
Deep learning for real life applicationsDeep learning for real life applications
Deep learning for real life applicationsAnas Arram, Ph.D
 
State of AI Report 2022 - ONLINE.pdf
State of AI Report 2022 - ONLINE.pdfState of AI Report 2022 - ONLINE.pdf
State of AI Report 2022 - ONLINE.pdfvizologi
 

Was ist angesagt? (20)

Deep neural networks
Deep neural networksDeep neural networks
Deep neural networks
 
Deep learning health care
Deep learning health care  Deep learning health care
Deep learning health care
 
Deep learning and Healthcare
Deep learning and HealthcareDeep learning and Healthcare
Deep learning and Healthcare
 
Andrew NG machine learning
Andrew NG machine learningAndrew NG machine learning
Andrew NG machine learning
 
Interpretable Machine Learning Using LIME Framework - Kasia Kulma (PhD), Data...
Interpretable Machine Learning Using LIME Framework - Kasia Kulma (PhD), Data...Interpretable Machine Learning Using LIME Framework - Kasia Kulma (PhD), Data...
Interpretable Machine Learning Using LIME Framework - Kasia Kulma (PhD), Data...
 
What is Deep Learning?
What is Deep Learning?What is Deep Learning?
What is Deep Learning?
 
Explainable AI in Industry (WWW 2020 Tutorial)
Explainable AI in Industry (WWW 2020 Tutorial)Explainable AI in Industry (WWW 2020 Tutorial)
Explainable AI in Industry (WWW 2020 Tutorial)
 
Applications of Emotions Recognition
Applications of Emotions RecognitionApplications of Emotions Recognition
Applications of Emotions Recognition
 
Image Segmentation (D3L1 2017 UPC Deep Learning for Computer Vision)
Image Segmentation (D3L1 2017 UPC Deep Learning for Computer Vision)Image Segmentation (D3L1 2017 UPC Deep Learning for Computer Vision)
Image Segmentation (D3L1 2017 UPC Deep Learning for Computer Vision)
 
Implications of GPT-3
Implications of GPT-3Implications of GPT-3
Implications of GPT-3
 
Introduction to Deep Learning
Introduction to Deep LearningIntroduction to Deep Learning
Introduction to Deep Learning
 
Deep learning
Deep learningDeep learning
Deep learning
 
Neural network final NWU 4.3 Graphics Course
Neural network final NWU 4.3 Graphics CourseNeural network final NWU 4.3 Graphics Course
Neural network final NWU 4.3 Graphics Course
 
Difference between Artificial Intelligence, Machine Learning, Deep Learning a...
Difference between Artificial Intelligence, Machine Learning, Deep Learning a...Difference between Artificial Intelligence, Machine Learning, Deep Learning a...
Difference between Artificial Intelligence, Machine Learning, Deep Learning a...
 
Introduction to Generative Adversarial Networks (GAN) with Apache MXNet
Introduction to Generative Adversarial Networks (GAN) with Apache MXNetIntroduction to Generative Adversarial Networks (GAN) with Apache MXNet
Introduction to Generative Adversarial Networks (GAN) with Apache MXNet
 
Handwritten Digit Recognition(Convolutional Neural Network) PPT
Handwritten Digit Recognition(Convolutional Neural Network) PPTHandwritten Digit Recognition(Convolutional Neural Network) PPT
Handwritten Digit Recognition(Convolutional Neural Network) PPT
 
Deep learning
Deep learning Deep learning
Deep learning
 
Artificial Neural Network seminar presentation using ppt.
Artificial Neural Network seminar presentation using ppt.Artificial Neural Network seminar presentation using ppt.
Artificial Neural Network seminar presentation using ppt.
 
Deep learning for real life applications
Deep learning for real life applicationsDeep learning for real life applications
Deep learning for real life applications
 
State of AI Report 2022 - ONLINE.pdf
State of AI Report 2022 - ONLINE.pdfState of AI Report 2022 - ONLINE.pdf
State of AI Report 2022 - ONLINE.pdf
 

Andere mochten auch

Multimodal Learning Preferences - Jessie Herring
Multimodal Learning Preferences - Jessie HerringMultimodal Learning Preferences - Jessie Herring
Multimodal Learning Preferences - Jessie Herringjrherring2
 
Multimodal Deep Learning (D4L4 Deep Learning for Speech and Language UPC 2017)
Multimodal Deep Learning (D4L4 Deep Learning for Speech and Language UPC 2017)Multimodal Deep Learning (D4L4 Deep Learning for Speech and Language UPC 2017)
Multimodal Deep Learning (D4L4 Deep Learning for Speech and Language UPC 2017)Universitat Politècnica de Catalunya
 
Applying word vectors sentiment analysis
Applying word vectors sentiment analysisApplying word vectors sentiment analysis
Applying word vectors sentiment analysisAbdullah Khan Zehady
 
Eddl5131 assignment 1 march2013
Eddl5131 assignment 1 march2013Eddl5131 assignment 1 march2013
Eddl5131 assignment 1 march2013gmorong
 
REPRESENTATION LEARNING FOR STATE APPROXIMATION IN PLATFORM GAMES
REPRESENTATION LEARNING FOR STATE APPROXIMATION IN PLATFORM GAMESREPRESENTATION LEARNING FOR STATE APPROXIMATION IN PLATFORM GAMES
REPRESENTATION LEARNING FOR STATE APPROXIMATION IN PLATFORM GAMESRamnandan Krishnamurthy
 
Speech recognition An overview
Speech recognition An overviewSpeech recognition An overview
Speech recognition An overviewsajanazoya
 
Neural networks and deep learning
Neural networks and deep learningNeural networks and deep learning
Neural networks and deep learningJörgen Sandig
 
Multimodal Learning Analytics
Multimodal Learning AnalyticsMultimodal Learning Analytics
Multimodal Learning AnalyticsXavier Ochoa
 
What is lipreading?
What is lipreading?What is lipreading?
What is lipreading?Heidi Walsh
 
Multimodal Residual Learning for Visual Question-Answering
Multimodal Residual Learning for Visual Question-AnsweringMultimodal Residual Learning for Visual Question-Answering
Multimodal Residual Learning for Visual Question-AnsweringNAVER D2
 
Introduction to un supervised learning
Introduction to un supervised learningIntroduction to un supervised learning
Introduction to un supervised learningRishikesh .
 
Deep Learning Primer - a brief introduction
Deep Learning Primer - a brief introductionDeep Learning Primer - a brief introduction
Deep Learning Primer - a brief introductionananth
 
Procedural modeling using autoencoder networks
Procedural modeling using autoencoder networksProcedural modeling using autoencoder networks
Procedural modeling using autoencoder networksShuhei Iitsuka
 
Multimedia data mining using deep learning
Multimedia data mining using deep learningMultimedia data mining using deep learning
Multimedia data mining using deep learningPeter Wlodarczak
 
Unsupervised Feature Learning
Unsupervised Feature LearningUnsupervised Feature Learning
Unsupervised Feature LearningAmgad Muhammad
 

Andere mochten auch (20)

Multimodal Learning Preferences - Jessie Herring
Multimodal Learning Preferences - Jessie HerringMultimodal Learning Preferences - Jessie Herring
Multimodal Learning Preferences - Jessie Herring
 
Multimodal Deep Learning (D4L4 Deep Learning for Speech and Language UPC 2017)
Multimodal Deep Learning (D4L4 Deep Learning for Speech and Language UPC 2017)Multimodal Deep Learning (D4L4 Deep Learning for Speech and Language UPC 2017)
Multimodal Deep Learning (D4L4 Deep Learning for Speech and Language UPC 2017)
 
Applying word vectors sentiment analysis
Applying word vectors sentiment analysisApplying word vectors sentiment analysis
Applying word vectors sentiment analysis
 
Eddl5131 assignment 1 march2013
Eddl5131 assignment 1 march2013Eddl5131 assignment 1 march2013
Eddl5131 assignment 1 march2013
 
ECML-2015 Presentation
ECML-2015 PresentationECML-2015 Presentation
ECML-2015 Presentation
 
presentation
presentationpresentation
presentation
 
REPRESENTATION LEARNING FOR STATE APPROXIMATION IN PLATFORM GAMES
REPRESENTATION LEARNING FOR STATE APPROXIMATION IN PLATFORM GAMESREPRESENTATION LEARNING FOR STATE APPROXIMATION IN PLATFORM GAMES
REPRESENTATION LEARNING FOR STATE APPROXIMATION IN PLATFORM GAMES
 
Speech recognition An overview
Speech recognition An overviewSpeech recognition An overview
Speech recognition An overview
 
Neural networks and deep learning
Neural networks and deep learningNeural networks and deep learning
Neural networks and deep learning
 
Multimodal Learning Analytics
Multimodal Learning AnalyticsMultimodal Learning Analytics
Multimodal Learning Analytics
 
What is lipreading?
What is lipreading?What is lipreading?
What is lipreading?
 
Multimodal Residual Learning for Visual Question-Answering
Multimodal Residual Learning for Visual Question-AnsweringMultimodal Residual Learning for Visual Question-Answering
Multimodal Residual Learning for Visual Question-Answering
 
Introduction to un supervised learning
Introduction to un supervised learningIntroduction to un supervised learning
Introduction to un supervised learning
 
CBIR by deep learning
CBIR by deep learningCBIR by deep learning
CBIR by deep learning
 
Deep Learning Primer - a brief introduction
Deep Learning Primer - a brief introductionDeep Learning Primer - a brief introduction
Deep Learning Primer - a brief introduction
 
Neural Networks and Deep Learning
Neural Networks and Deep LearningNeural Networks and Deep Learning
Neural Networks and Deep Learning
 
Procedural modeling using autoencoder networks
Procedural modeling using autoencoder networksProcedural modeling using autoencoder networks
Procedural modeling using autoencoder networks
 
Tutorial on Deep Learning
Tutorial on Deep LearningTutorial on Deep Learning
Tutorial on Deep Learning
 
Multimedia data mining using deep learning
Multimedia data mining using deep learningMultimedia data mining using deep learning
Multimedia data mining using deep learning
 
Unsupervised Feature Learning
Unsupervised Feature LearningUnsupervised Feature Learning
Unsupervised Feature Learning
 

Kürzlich hochgeladen

The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfPrecisely
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfMounikaPolabathina
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionDilum Bandara
 

Kürzlich hochgeladen (20)

The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdf
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
 

Multimodal deep learning

  • 2. MULTIMODAL DEEP LEARNING  Jiquan Ngiam  Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee, Andrew Y. Ng  Computer Science Department, Stanford University  Department of Music, Stanford University  Computer Science & Engineering Division, University of Michigan, Ann Arbor
  • 3. MCGURK EFFECT  In speech recognition, people are known to integrate audio-visual information in order to understand speech.  This was first exemplified in the McGurk effect where a visual /ga/ with a voiced /ba/ is perceived as /da/ by most subjects.
  • 5. FEATURE CHALLENGE Classifier (e.g. SVM)
  • 6. REPRESENTING LIPS • Can we learn better representations for audio/visual speech recognition? • How can multimodal data (multiple sources of input) be used to find better features?
  • 12. BACKGROUND  Sparse Restricted Boltzmann Machines (RBMs)
  • 13. FEATURE LEARNING WITH AUTOENCODERS Audio Reconstruction Video Reconstruction ... ... ... ... ... ... Audio Input Video Input
  • 14. BIMODAL AUTOENCODER Video Reconstruction Audio Reconstruction ... ... ... Hidden Representation ... ... Audio Input Video Input
  • 15. SHALLOW LEARNING Hidden Units Video Input Audio Input • Mostly unimodal features learned
  • 16. BIMODAL AUTOENCODER Video Reconstruction Audio Reconstruction ... ... ... Hidden Representation ... Video Input Cross-modality Learning: Learn better video features by using audio as a cue
  • 17. CROSS-MODALITY DEEP AUTOENCODER Audio Reconstruction Video Reconstruction ... ... ... ... ... Learned Representation ... ... Video Input
  • 18. CROSS-MODALITY DEEP AUTOENCODER Audio Reconstruction Video Reconstruction ... ... ... ... ... Learned Representation ... ... Audio Input
  • 19. BIMODAL DEEP AUTOENCODERS Audio Reconstruction Video Reconstruction ... ... ... ... ... Shared Representation “Phonemes” “Visemes” ... ... (Mouth Shapes) ... ... Audio Input Video Input
  • 20. BIMODAL DEEP AUTOENCODERS Audio Reconstruction Video Reconstruction ... ... ... ... ... “Visemes” ... (Mouth Shapes) ... Video Input
  • 21. BIMODAL DEEP AUTOENCODERS Audio Reconstruction Video Reconstruction ... ... ... ... ... “Phonemes” ... ... Audio Input
  • 22. TRAINING BIMODAL DEEP AUTOENCODER Audio Reconstruction Video Reconstruction ... ... Audio Reconstruction ... Video Reconstruction ... Audio Reconstruction ... Video Reconstruction ... ... ... ... ... ... ... ... Shared Representation ... Shared Representation ... Shared Representation ... ... ... ... ... ... ... ... Audio Input Video Input Audio Input Video Input • Train a single model to perform all 3 tasks • Similar in spirit to denoising autoencoders
  • 24. VISUALIZATIONS OF LEARNED FEATURES 0 ms 33 ms 67 ms 100 ms 0 ms 33 ms 67 ms 100 ms Audio (spectrogram) and Video features learned over 100ms windows
  • 25. LEARNING SETTINGS  We will consider the learning settings shown in Figure 1.
  • 26. LIP-READING WITH AVLETTERS  AVLetters: Audio Reconstruction ... Video Reconstruction ...  26-way Letter Classification ... ...  10 Speakers ... Learned Representation  60x80 pixels lip regions ... ...  Cross-modality learning Video Input Feature Supervised Testing Learning Learning Audio + Video Video Video
  • 27. LIP-READING WITH AVLETTERS Feature Representation Classification Accuracy Multiscale Spatial Analysis 44.6% (Matthews et al., 2002) Local Binary Pattern 58.5% (Zhao & Barnard, 2009)
  • 28. LIP-READING WITH AVLETTERS Feature Representation Classification Accuracy Multiscale Spatial Analysis 44.6% (Matthews et al., 2002) Local Binary Pattern 58.5% (Zhao & Barnard, 2009) Video-Only Learning 54.2% (Single Modality Learning)
  • 29. LIP-READING WITH AVLETTERS Feature Representation Classification Accuracy Multiscale Spatial Analysis 44.6% (Matthews et al., 2002) Local Binary Pattern 58.5% (Zhao & Barnard, 2009) Video-Only Learning 54.2% (Single Modality Learning) Our Features 64.4% (Cross Modality Learning)
  • 30. LIP-READING WITH CUAVE  CUAVE: Audio Reconstruction Video Reconstruction ... ...  10-way Digit Classification ... ...  36 Speakers ... Learned Representation  Cross Modality Learning ... ... Video Input Feature Supervised Testing Learning Learning Audio + Video Video Video
  • 31. LIP-READING WITH CUAVE Classification Feature Representation Accuracy Baseline Preprocessed Video 58.5% Video-Only Learning 65.4% (Single Modality Learning)
  • 32. LIP-READING WITH CUAVE Classification Feature Representation Accuracy Baseline Preprocessed Video 58.5% Video-Only Learning 65.4% (Single Modality Learning) Our Features 68.7% (Cross Modality Learning)
  • 33. LIP-READING WITH CUAVE Classification Feature Representation Accuracy Baseline Preprocessed Video 58.5% Video-Only Learning 65.4% (Single Modality Learning) Our Features 68.7% (Cross Modality Learning) Discrete Cosine Transform 64.0% (Gurban & Thiran, 2009) Visemic AAM 83.0% (Papandreou et al., 2009)
  • 34. MULTIMODAL RECOGNITION Audio Reconstruction Video Reconstruction ... ...  CUAVE: ... ...  10-way Digit Classification ... Shared Representation ... ...  36 Speakers ... ... Audio Input Video Input  Evaluate in clean and noisy audio scenarios  Inthe clean audio scenario, audio performs extremely well alone Feature Supervised Testing Learning Learning Audio + Audio + Video Audio + Video Video
  • 35. MULTIMODAL RECOGNITION Classification Accuracy Feature Representation (Noisy Audio at 0db SNR) Audio Features (RBM) 75.8% Our Best Video Features 68.7%
  • 36. MULTIMODAL RECOGNITION Classification Accuracy Feature Representation (Noisy Audio at 0db SNR) Audio Features (RBM) 75.8% Our Best Video Features 68.7% Bimodal Deep Autoencoder 77.3%
  • 37. MULTIMODAL RECOGNITION Classification Accuracy Feature Representation (Noisy Audio at 0db SNR) Audio Features (RBM) 75.8% Our Best Video Features 68.7% Bimodal Deep Autoencoder 77.3% Bimodal Deep Autoencoder 82.2% + Audio Features (RBM)
  • 38. SHARED REPRESENTATION EVALUATION Feature Supervised Testing Learning Learning Audio + Video Audio Video Linear Classifier Supervised Testing Shared Shared Representation Representation Audio Video Audio Video Training Testing
  • 39. SHARED REPRESENTATION EVALUATION  Method: Learned Features + Canonical Correlation Analysis Feature Supervised Testing Accuracy Learning Learning Audio + Video Audio Video 57.3% Audio + Video Video Audio 91.7% Linear Classifier Supervised Testing Shared Shared Representation Representation Audio Video Audio Video Training Testing
  • 40. MCGURK EFFECT A visual /ga/ combined with an audio /ba/ is often perceived as /da/. Audio Video Model Predictions Input Input /ga/ /ba/ /da/ /ga/ /ga/ 82.6% 2.2% 15.2% /ba/ /ba/ 4.4% 89.1% 6.5%
  • 41. MCGURK EFFECT A visual /ga/ combined with an audio /ba/ is often perceived as /da/. Audio Video Model Predictions Input Input /ga/ /ba/ /da/ /ga/ /ga/ 82.6% 2.2% 15.2% /ba/ /ba/ 4.4% 89.1% 6.5% /ba/ /ga/ 28.3% 13.0% 58.7%
  • 42. CONCLUSION  Applied deep autoencoders to Audio Reconstruction ... Video Reconstruction ... discover features in multimodal ... ... data ... Learned Representation ... ...  Cross-modality Learning: Video Input We obtained better video features Video Reconstruction (for lip-reading) using audio as a Audio Reconstruction ... ... cue ... ... ... Shared Representation  Multimodal Feature Learning: ... ... Learn representations that relate ... ... Audio Input Video Input across audio and video data

Hinweis der Redaktion

  1. In this work, I’m going to talk about audio visual speech recognition and how we can apply deep learning to this multimodal setting.For example, if we’re given a small speech segment with the video of person saying letters can we determine which letter he said – images of his lips; and the audio – how do we integrate these two sources of data.Multimodal learning involves relating information from multiple sources. For example, images and 3-d depth scans are correlated at first-order as depth discontinuities often manifest as strong edges in images. Conversely, audio and visual data for speech recognition have non-linear correlations at a “mid-level”, as phonemes or visemes; it is difficult to relate raw pixels to audio waveforms or spectrograms. In this paper, we are interested in modeling “mid-level” relationships, thus we choose to use audio-visual speech classification to validate our methods. In particular, we focus on learning representations for speech audio which are coupled with videos of the lips.
  2. So how do we solve this problem? A common machine learning pipeline goes like this – we take the inputs and extract some features and then feed it into our standard ML toolbox (e.g., classifier). The hardest part is really the features – how we represent the audio and video data for use in our classifier.While for audio, the speech community have developed many features such as MFCCs which work really well,it is not obvious what features we should use for lips.
  3. So what does state of the art features look like? Engineering these features took long timeTo this, we address two questions in this work – [click]Furthermore, what is interesting in this problem is the deep question – that audio and video features are only related at a deep level
  4. Concretely, our task is to convert sequences of lip images into a vector of numbersSimilarly, for the audio
  5. Now that we have multimodal data, one easy way is to simply concatenate them – however simply concatenating the features like this fails to model the interactions between the modalitiesHowever, this is a very limited view of multimodal features – instead what we would like to do [click] is to
  6. Find better ways to relate the audio and visual inputs and get features that arise out of relating them together
  7. Next I’m going to describe adifferent feature learning settingSuppose that at test time, only the lip images are available, and you donot get the audio signal And supposed at training time, you have both audio and video – can the audio at training time help you do better at test time even though you don’t have audio at test time(lip-reading not well defined)But there are more settings to consider!If our task is only to do lip reading, visual speech recognition.An interesting question to ask is -- can we improve our lip reading features if we had audio data.
  8. Lets step back a bit and take a similar but related approach to the problem.What if we learn an autoencoderBut, this still has the problem! But, wait now we can do something interesting
  9. So there are different versions of these shallow models and if you trained a model of this form, this is what one usually gets.If you look at the hidden units, it turns out that there are hidden units ….that respond to X and Y only…So why doesn’t this work? We think that there are two reasons for this.In the shallow models, we’re trying relate pixels values to the values in the audio spectrogram.Instead, what we expect is for mid-level video features such as mouth motions to inform us on the audio content.It turns out that the model learn many unimodal units. The figure shows the connectivity ..(explain)We think there are two reasons possible here – 1) that the model is unable to do it (no incentive) 2) we’re actually trying to relate pixel values to values in the audio spectrogram. But this is really difficult, for example, we do not expect that change in one pixel value to inform us abouhow the audio pitch changing. Thus, the relations across the modalities are deep – and we really need a deep model to do this.Review: 1) no incentive and 2) deep
  10. But, this still has the problem! But, wait now we can do something interestingThis model will be trained on clips with both audio and video.
  11. However, the connections between audio and video are (arguably) deep instead of shallow – so ideally, we want to extract mid-level features before trying to connect them together …. moreSince audio is really good for speech recognition, the model is going to learn representations that can reconstruct audio and thus hopefully be good for speech recognition as well
  12. But, what we like to do is not to have to train many versions of this models. It turns out that you can unify the separates models together.
  13. [pause] the second model we present is the bimodal deep autoencoderWhat we want this bimodal deep AE to do is – to learn representations that relate both the audio and video data. Concretely, we want the bimodal deep AE to learn representations that are robust to the input modality
  14. Features correspond to mouth motions and are also paired up with the audio spectrogramThe features are generic and are not speaker specific
  15. Explain in phases!
  16. Explain in phases
  17. Explain in phases