2. WORK PROPOSED FOR MAJOR PROJECT
Objectives
Train a neural network using LSTM, RNNs
and transfer learning for object detection
(lip movement in this case) and linking
the same with Natural Language
Processing
Create a powerful tool capable of
detecting the objects and describe the
events of the video
If a human face and lip movement is
detected, use AI techniques to read the
lips and convert to text what’s being said
Application
Better search algorithms : If each video
can be automatically described search
algorithms will have finer more accurate
results
Recommendation Systems: We could
easily be able to cluster videos based on
their similarity if the contents of the
video can be automatically described.
Automated lipreading of speakers with
damaged vocal tracts, biometric person
identification, multi-talker simultaneous
speech decoding , etc.
3. METHODOLOGY
The project follows a three-step detection mechanism and neural networks
are used at every stage.
Video
converted into
image frames
Detection of
human lips
Lip
Reading
Description
of video
contents
Caption
Generation
YES
N
o
1 2
3
4. LIP MOVEMENT DETECTION
• A simple RNN based detector that
determines whether someone is speaking
by watching their lip movements for 1
second of video (i.e. a sequence of 25
video frames). The detector can be run
in real time on a video file, or on the
output of a webcam by using a sliding
window technique.
• This model contains:
• Two stacked RNN layers.
• Each layer is composed of 64 non-
bidirectional, simple RNN cells.
• There is a dropout of 0.5 applied to the
output of the second RNN layer before
the output is finally fed to the final
softmax classification layer.
• Dataset that can be used: GRID, AMFED,
DISFA, HMDB, Cohn-Kanade
Reference
5. VIDEO CAPTIONING
• Dataset that can be used: MSVD
• This data set contains 1450 short
YouTube clips that have been manually
labelled for training and 100 videos
for testing.
• Each video has been assigned a unique
ID and each ID has about 15–20
captions.
• Model Used for feature extraction :
VGG 16 (because of less training
parameters)
Reference
6. LIP READING • Dataset used: GRID CORPUS
• GRID is a large multitalker audio-visual
sentence corpus to support joint studies
in speech perception. In brief, the corpus
consists of high-quality audio and video
(facial) recordings of 1000 sentences
spoken by each of 34 talkers (18 male, 16
female). Sentences are of the form "put
red at G9 now".
• A sequence of T frames is used as input,
and is processed by 3 layers of STCNN,
each followed by a spatial max-pooling
layer. The features extracted are
processed by 2 Bi-GRUs; each time-step
of the GRU output is processed by a
linear layer and a SoftMax. This end-to-
end model is trained with CTC.
LipNet architecture