SlideShare ist ein Scribd-Unternehmen logo
1 von 11
SECOND DEFENSE
VIDEO CAPTIONING AND LIP READING
WORK PROPOSED FOR MAJOR PROJECT
Objectives
 Train a neural network using LSTM, RNNs
and transfer learning for object detection
(lip movement in this case) and linking
the same with Natural Language
Processing
 Create a powerful tool capable of
detecting the objects and describe the
events of the video
 If a human face and lip movement is
detected, use AI techniques to read the
lips and convert to text what’s being said
Application
 Better search algorithms : If each video
can be automatically described search
algorithms will have finer more accurate
results
 Recommendation Systems: We could
easily be able to cluster videos based on
their similarity if the contents of the
video can be automatically described.
 Automated lipreading of speakers with
damaged vocal tracts, biometric person
identification, multi-talker simultaneous
speech decoding , etc.
METHODOLOGY
 The project follows a three-step detection mechanism and neural networks
are used at every stage.
Video
converted into
image frames
Detection of
human lips
Lip
Reading
Description
of video
contents
Caption
Generation
YES
N
o
1 2
3
LIP MOVEMENT DETECTION
• A simple RNN based detector that
determines whether someone is speaking
by watching their lip movements for 1
second of video (i.e. a sequence of 25
video frames). The detector can be run
in real time on a video file, or on the
output of a webcam by using a sliding
window technique.
• This model contains:
• Two stacked RNN layers.
• Each layer is composed of 64 non-
bidirectional, simple RNN cells.
• There is a dropout of 0.5 applied to the
output of the second RNN layer before
the output is finally fed to the final
softmax classification layer.
• Dataset that can be used: GRID, AMFED,
DISFA, HMDB, Cohn-Kanade
Reference
VIDEO CAPTIONING
• Dataset that can be used: MSVD
• This data set contains 1450 short
YouTube clips that have been manually
labelled for training and 100 videos
for testing.
• Each video has been assigned a unique
ID and each ID has about 15–20
captions.
• Model Used for feature extraction :
VGG 16 (because of less training
parameters)
Reference
LIP READING • Dataset used: GRID CORPUS
• GRID is a large multitalker audio-visual
sentence corpus to support joint studies
in speech perception. In brief, the corpus
consists of high-quality audio and video
(facial) recordings of 1000 sentences
spoken by each of 34 talkers (18 male, 16
female). Sentences are of the form "put
red at G9 now".
• A sequence of T frames is used as input,
and is processed by 3 layers of STCNN,
each followed by a spatial max-pooling
layer. The features extracted are
processed by 2 Bi-GRUs; each time-step
of the GRU output is processed by a
linear layer and a SoftMax. This end-to-
end model is trained with CTC.
LipNet architecture
INPUT VIDEO
LIP READING APPLICATION
INPUT VIDEO
VIDEO CAPTIONING APPLICATION
THANK YOU
YASHIKA CHUGH
40214803118

Weitere ähnliche Inhalte

Was ist angesagt?

Speech recognition system seminar
Speech recognition system seminarSpeech recognition system seminar
Speech recognition system seminar
Diptimaya Sarangi
 

Was ist angesagt? (20)

Speech recognition system seminar
Speech recognition system seminarSpeech recognition system seminar
Speech recognition system seminar
 
SPEECH RECOGNITION USING NEURAL NETWORK
SPEECH RECOGNITION USING NEURAL NETWORK SPEECH RECOGNITION USING NEURAL NETWORK
SPEECH RECOGNITION USING NEURAL NETWORK
 
Text similarity measures
Text similarity measuresText similarity measures
Text similarity measures
 
Speech Recognition in Artificail Inteligence
Speech Recognition in Artificail InteligenceSpeech Recognition in Artificail Inteligence
Speech Recognition in Artificail Inteligence
 
Speech Recognition System
Speech Recognition SystemSpeech Recognition System
Speech Recognition System
 
Weights initialization
Weights initializationWeights initialization
Weights initialization
 
Speech Recognition Technology
Speech Recognition TechnologySpeech Recognition Technology
Speech Recognition Technology
 
Audio steganography
Audio steganography Audio steganography
Audio steganography
 
Connectionist Temporal Classification
Connectionist Temporal ClassificationConnectionist Temporal Classification
Connectionist Temporal Classification
 
Emotion Recognition Based On Audio Speech
Emotion Recognition Based On Audio SpeechEmotion Recognition Based On Audio Speech
Emotion Recognition Based On Audio Speech
 
SPEECH BASED EMOTION RECOGNITION USING VOICE
SPEECH BASED  EMOTION RECOGNITION USING VOICESPEECH BASED  EMOTION RECOGNITION USING VOICE
SPEECH BASED EMOTION RECOGNITION USING VOICE
 
Emotion Speech Recognition - Convolutional Neural Network Capstone Project
Emotion Speech Recognition - Convolutional Neural Network Capstone ProjectEmotion Speech Recognition - Convolutional Neural Network Capstone Project
Emotion Speech Recognition - Convolutional Neural Network Capstone Project
 
Driver drowsiness monitoring system using visual behavior and Machine Learning.
Driver drowsiness monitoring system using visual behavior and Machine Learning.Driver drowsiness monitoring system using visual behavior and Machine Learning.
Driver drowsiness monitoring system using visual behavior and Machine Learning.
 
Rnn and lstm
Rnn and lstmRnn and lstm
Rnn and lstm
 
Pattern recognition and Machine Learning.
Pattern recognition and Machine Learning.Pattern recognition and Machine Learning.
Pattern recognition and Machine Learning.
 
Convolutional Neural Networks (CNN)
Convolutional Neural Networks (CNN)Convolutional Neural Networks (CNN)
Convolutional Neural Networks (CNN)
 
RNN and its applications
RNN and its applicationsRNN and its applications
RNN and its applications
 
Convolutional Neural Network and RNN for OCR problem.
Convolutional Neural Network and RNN for OCR problem.Convolutional Neural Network and RNN for OCR problem.
Convolutional Neural Network and RNN for OCR problem.
 
Recurrent Neural Networks (RNN) | RNN LSTM | Deep Learning Tutorial | Tensorf...
Recurrent Neural Networks (RNN) | RNN LSTM | Deep Learning Tutorial | Tensorf...Recurrent Neural Networks (RNN) | RNN LSTM | Deep Learning Tutorial | Tensorf...
Recurrent Neural Networks (RNN) | RNN LSTM | Deep Learning Tutorial | Tensorf...
 
Automatic Speaker Recognition system using MFCC and VQ approach
Automatic Speaker Recognition system using MFCC and VQ approachAutomatic Speaker Recognition system using MFCC and VQ approach
Automatic Speaker Recognition system using MFCC and VQ approach
 

Ähnlich wie Lip reading Project

The Mind Connection - concept & possibilities
The Mind Connection - concept & possibilitiesThe Mind Connection - concept & possibilities
The Mind Connection - concept & possibilities
The_Mind_Connection
 
multimedia mpeg-7
multimedia mpeg-7multimedia mpeg-7
multimedia mpeg-7
nil65
 
Adria Recasens, DeepMind – Multi-modal self-supervised learning from videos
Adria Recasens, DeepMind – Multi-modal self-supervised learning from videosAdria Recasens, DeepMind – Multi-modal self-supervised learning from videos
Adria Recasens, DeepMind – Multi-modal self-supervised learning from videos
Codiax
 
Visual recognition of human communications
Visual recognition of human communicationsVisual recognition of human communications
Visual recognition of human communications
NAVER Engineering
 

Ähnlich wie Lip reading Project (20)

INDIAN SIGN LANGUAGE TRANSLATION FOR HARD-OF-HEARING AND HARD-OF-SPEAKING COM...
INDIAN SIGN LANGUAGE TRANSLATION FOR HARD-OF-HEARING AND HARD-OF-SPEAKING COM...INDIAN SIGN LANGUAGE TRANSLATION FOR HARD-OF-HEARING AND HARD-OF-SPEAKING COM...
INDIAN SIGN LANGUAGE TRANSLATION FOR HARD-OF-HEARING AND HARD-OF-SPEAKING COM...
 
A survey on Enhancements in Speech Recognition
A survey on Enhancements in Speech RecognitionA survey on Enhancements in Speech Recognition
A survey on Enhancements in Speech Recognition
 
IRJET - Audio Emotion Analysis
IRJET - Audio Emotion AnalysisIRJET - Audio Emotion Analysis
IRJET - Audio Emotion Analysis
 
LIP READING - AN EFFICIENT CROSS AUDIO-VIDEO RECOGNITION USING 3D CONVOLUTION...
LIP READING - AN EFFICIENT CROSS AUDIO-VIDEO RECOGNITION USING 3D CONVOLUTION...LIP READING - AN EFFICIENT CROSS AUDIO-VIDEO RECOGNITION USING 3D CONVOLUTION...
LIP READING - AN EFFICIENT CROSS AUDIO-VIDEO RECOGNITION USING 3D CONVOLUTION...
 
Machine Learning
Machine LearningMachine Learning
Machine Learning
 
Mpeg7
Mpeg7Mpeg7
Mpeg7
 
IRJET- Voice Command Execution with Speech Recognition and Synthesizer
IRJET- Voice Command Execution with Speech Recognition and SynthesizerIRJET- Voice Command Execution with Speech Recognition and Synthesizer
IRJET- Voice Command Execution with Speech Recognition and Synthesizer
 
IRJET- On-Screen Translator using NLP and Text Detection
IRJET- On-Screen Translator using NLP and Text DetectionIRJET- On-Screen Translator using NLP and Text Detection
IRJET- On-Screen Translator using NLP and Text Detection
 
Video Description using Deep Learning
Video Description using Deep LearningVideo Description using Deep Learning
Video Description using Deep Learning
 
IRJET - Automatic Lip Reading: Classification of Words and Phrases using Conv...
IRJET - Automatic Lip Reading: Classification of Words and Phrases using Conv...IRJET - Automatic Lip Reading: Classification of Words and Phrases using Conv...
IRJET - Automatic Lip Reading: Classification of Words and Phrases using Conv...
 
MULTILINGUAL SPEECH TO TEXT CONVERSION USING HUGGING FACE FOR DEAF PEOPLE
MULTILINGUAL SPEECH TO TEXT CONVERSION USING HUGGING FACE FOR DEAF PEOPLEMULTILINGUAL SPEECH TO TEXT CONVERSION USING HUGGING FACE FOR DEAF PEOPLE
MULTILINGUAL SPEECH TO TEXT CONVERSION USING HUGGING FACE FOR DEAF PEOPLE
 
IDE Code Compiler for the physically challenged (Deaf, Blind & Mute)
IDE Code Compiler for the physically challenged (Deaf, Blind & Mute)IDE Code Compiler for the physically challenged (Deaf, Blind & Mute)
IDE Code Compiler for the physically challenged (Deaf, Blind & Mute)
 
Audio video steganography using forensic techniquefor data security
Audio video steganography using forensic techniquefor data securityAudio video steganography using forensic techniquefor data security
Audio video steganography using forensic techniquefor data security
 
The Mind Connection - concept & possibilities
The Mind Connection - concept & possibilitiesThe Mind Connection - concept & possibilities
The Mind Connection - concept & possibilities
 
Sign Language Recognition
Sign Language RecognitionSign Language Recognition
Sign Language Recognition
 
Conv-TasNet.pdf
Conv-TasNet.pdfConv-TasNet.pdf
Conv-TasNet.pdf
 
multimedia mpeg-7
multimedia mpeg-7multimedia mpeg-7
multimedia mpeg-7
 
Adria Recasens, DeepMind – Multi-modal self-supervised learning from videos
Adria Recasens, DeepMind – Multi-modal self-supervised learning from videosAdria Recasens, DeepMind – Multi-modal self-supervised learning from videos
Adria Recasens, DeepMind – Multi-modal self-supervised learning from videos
 
Deep Learning for Speech Recognition - Vikrant Singh Tomar
Deep Learning for Speech Recognition - Vikrant Singh TomarDeep Learning for Speech Recognition - Vikrant Singh Tomar
Deep Learning for Speech Recognition - Vikrant Singh Tomar
 
Visual recognition of human communications
Visual recognition of human communicationsVisual recognition of human communications
Visual recognition of human communications
 

Kürzlich hochgeladen

怎样办理伦敦大学城市学院毕业证(CITY毕业证书)成绩单学校原版复制
怎样办理伦敦大学城市学院毕业证(CITY毕业证书)成绩单学校原版复制怎样办理伦敦大学城市学院毕业证(CITY毕业证书)成绩单学校原版复制
怎样办理伦敦大学城市学院毕业证(CITY毕业证书)成绩单学校原版复制
vexqp
 
PLE-statistics document for primary schs
PLE-statistics document for primary schsPLE-statistics document for primary schs
PLE-statistics document for primary schs
cnajjemba
 
Jual Cytotec Asli Obat Aborsi No. 1 Paling Manjur
Jual Cytotec Asli Obat Aborsi No. 1 Paling ManjurJual Cytotec Asli Obat Aborsi No. 1 Paling Manjur
Jual Cytotec Asli Obat Aborsi No. 1 Paling Manjur
ptikerjasaptiker
 
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
nirzagarg
 
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
vexqp
 
Gartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptxGartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptx
chadhar227
 
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
gajnagarg
 
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
nirzagarg
 
怎样办理圣路易斯大学毕业证(SLU毕业证书)成绩单学校原版复制
怎样办理圣路易斯大学毕业证(SLU毕业证书)成绩单学校原版复制怎样办理圣路易斯大学毕业证(SLU毕业证书)成绩单学校原版复制
怎样办理圣路易斯大学毕业证(SLU毕业证书)成绩单学校原版复制
vexqp
 
怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制
怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制
怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制
vexqp
 
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
nirzagarg
 

Kürzlich hochgeladen (20)

SR-101-01012024-EN.docx Federal Constitution of the Swiss Confederation
SR-101-01012024-EN.docx  Federal Constitution  of the Swiss ConfederationSR-101-01012024-EN.docx  Federal Constitution  of the Swiss Confederation
SR-101-01012024-EN.docx Federal Constitution of the Swiss Confederation
 
怎样办理伦敦大学城市学院毕业证(CITY毕业证书)成绩单学校原版复制
怎样办理伦敦大学城市学院毕业证(CITY毕业证书)成绩单学校原版复制怎样办理伦敦大学城市学院毕业证(CITY毕业证书)成绩单学校原版复制
怎样办理伦敦大学城市学院毕业证(CITY毕业证书)成绩单学校原版复制
 
Capstone in Interprofessional Informatic // IMPACT OF COVID 19 ON EDUCATION
Capstone in Interprofessional Informatic  // IMPACT OF COVID 19 ON EDUCATIONCapstone in Interprofessional Informatic  // IMPACT OF COVID 19 ON EDUCATION
Capstone in Interprofessional Informatic // IMPACT OF COVID 19 ON EDUCATION
 
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With OrangePredicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
 
Harnessing the Power of GenAI for BI and Reporting.pptx
Harnessing the Power of GenAI for BI and Reporting.pptxHarnessing the Power of GenAI for BI and Reporting.pptx
Harnessing the Power of GenAI for BI and Reporting.pptx
 
PLE-statistics document for primary schs
PLE-statistics document for primary schsPLE-statistics document for primary schs
PLE-statistics document for primary schs
 
Jual Cytotec Asli Obat Aborsi No. 1 Paling Manjur
Jual Cytotec Asli Obat Aborsi No. 1 Paling ManjurJual Cytotec Asli Obat Aborsi No. 1 Paling Manjur
Jual Cytotec Asli Obat Aborsi No. 1 Paling Manjur
 
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
 
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
 
Gartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptxGartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptx
 
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
 
Data Analyst Tasks to do the internship.pdf
Data Analyst Tasks to do the internship.pdfData Analyst Tasks to do the internship.pdf
Data Analyst Tasks to do the internship.pdf
 
Ranking and Scoring Exercises for Research
Ranking and Scoring Exercises for ResearchRanking and Scoring Exercises for Research
Ranking and Scoring Exercises for Research
 
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Research
 
怎样办理圣路易斯大学毕业证(SLU毕业证书)成绩单学校原版复制
怎样办理圣路易斯大学毕业证(SLU毕业证书)成绩单学校原版复制怎样办理圣路易斯大学毕业证(SLU毕业证书)成绩单学校原版复制
怎样办理圣路易斯大学毕业证(SLU毕业证书)成绩单学校原版复制
 
7. Epi of Chronic respiratory diseases.ppt
7. Epi of Chronic respiratory diseases.ppt7. Epi of Chronic respiratory diseases.ppt
7. Epi of Chronic respiratory diseases.ppt
 
怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制
怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制
怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制
 
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book nowVadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
 
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
 

Lip reading Project

  • 2. WORK PROPOSED FOR MAJOR PROJECT Objectives  Train a neural network using LSTM, RNNs and transfer learning for object detection (lip movement in this case) and linking the same with Natural Language Processing  Create a powerful tool capable of detecting the objects and describe the events of the video  If a human face and lip movement is detected, use AI techniques to read the lips and convert to text what’s being said Application  Better search algorithms : If each video can be automatically described search algorithms will have finer more accurate results  Recommendation Systems: We could easily be able to cluster videos based on their similarity if the contents of the video can be automatically described.  Automated lipreading of speakers with damaged vocal tracts, biometric person identification, multi-talker simultaneous speech decoding , etc.
  • 3. METHODOLOGY  The project follows a three-step detection mechanism and neural networks are used at every stage. Video converted into image frames Detection of human lips Lip Reading Description of video contents Caption Generation YES N o 1 2 3
  • 4. LIP MOVEMENT DETECTION • A simple RNN based detector that determines whether someone is speaking by watching their lip movements for 1 second of video (i.e. a sequence of 25 video frames). The detector can be run in real time on a video file, or on the output of a webcam by using a sliding window technique. • This model contains: • Two stacked RNN layers. • Each layer is composed of 64 non- bidirectional, simple RNN cells. • There is a dropout of 0.5 applied to the output of the second RNN layer before the output is finally fed to the final softmax classification layer. • Dataset that can be used: GRID, AMFED, DISFA, HMDB, Cohn-Kanade Reference
  • 5. VIDEO CAPTIONING • Dataset that can be used: MSVD • This data set contains 1450 short YouTube clips that have been manually labelled for training and 100 videos for testing. • Each video has been assigned a unique ID and each ID has about 15–20 captions. • Model Used for feature extraction : VGG 16 (because of less training parameters) Reference
  • 6. LIP READING • Dataset used: GRID CORPUS • GRID is a large multitalker audio-visual sentence corpus to support joint studies in speech perception. In brief, the corpus consists of high-quality audio and video (facial) recordings of 1000 sentences spoken by each of 34 talkers (18 male, 16 female). Sentences are of the form "put red at G9 now". • A sequence of T frames is used as input, and is processed by 3 layers of STCNN, each followed by a spatial max-pooling layer. The features extracted are processed by 2 Bi-GRUs; each time-step of the GRU output is processed by a linear layer and a SoftMax. This end-to- end model is trained with CTC. LipNet architecture