Audio-Visual Speech Processing presentation at the IEEE conference on Advanced Technologies for Signal and Image Processing in Sousse, Tunisia, March 18th, 2014
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Â
Atsip avsp17
1. Audio-Visual Speech Processing
GĂŠrard Chollet
with Meriem Bendris, HervĂŠ Bredin, Thomas Hueber,
Walid Karam, RĂŠmi Landais, Patrick Perrot,
Eduardo Sanchez-Soto, Leila Zouari
ATSIP, Sousse, March 18th 2014
2. Page 2 ATSIP, Sousse, May 18th, 2014
Some motivations,âŚ
â ⯠A talking face is more intelligible, expressive,
recognisable, attractive than acoustic speech
alone.
â ⯠The combined use of facial and speech
information improves identity verification and
robustness to forgeries.
â ⯠Multi-stream models of the synchrony of visual
and acoustic information have applications in
the analysis, coding, recognition and synthesis
of talking faces.
â ⯠SmartPhones, VisioPhones, WebPhones,
SecurePhones, Visio Conferences, Virtual
Reality worlds are gaining popularity.
3. Page 3 ATSIP, Sousse, May 18th, 2014
Some topics under study,âŚ
â ⯠Audio-visual speech recognition
â⯠Automatic âlip-readingâ
â ⯠Audio-visual speaker verification
â⯠Detection of forgeries
â ⯠Speech driven animation of the face
â⯠Could we look and sound like somebody else ?
â ⯠Speaker indexing
â⯠âWho is talking in a video sequence ?â
â ⯠OUISPER : a silent speech interface
â⯠Corpus based synthesis from tongue and lips
4. Page 4 ATSIP, Sousse, May 18th, 2014
Audio Visual Speech Recognition
Dictionary Grammar
Acoustic models
Features
extraction
Decoder
5. Page 5 ATSIP, Sousse, May 18th, 2014
Video Mike (IBM, 2004)
â ⯠IBM
â ⯠2004
6. Page 6 ATSIP, Sousse, May 18th, 2014
Audio processing
â ⯠Features  extraction Â
â ⯠Digits  detection
â ⯠Digits  recognition:  Â
â˘âŻ Acoustic  parameters  :  MFCC
â˘âŻ Context  independent   HMMs
â˘âŻ  Decoding  :  Time  synchronous Â
algorithm
â ⯠Sound  eďŹect
â⯠Noise  :  Babble
â ⯠Recognition  experiments
7. Page 7 ATSIP, Sousse, May 18th, 2014
Video processing
â ⯠Video  extraction
â ⯠Lips  localisation
â ⯠Images  interpolation Â
(same  frequency  as  speech)
â ⯠Features  extraction
â˘âŻ DCT  and  DCT2  (DCT+LDA)
â˘âŻ Projections   :  PRO  et  PRO2 Â
(PRO+LDA)
â ⯠Recognition  experiments
8. Page 8 ATSIP, Sousse, May 18th, 2014
Fusion techniques
qďąâŻ Parameters fusion :
â˘âŻConcatenation
â˘âŻ Dimension decrease : Linear Discriminant Analysis (LDA)
â˘âŻ Modelisation : classical HMM with one stream
qďąâŻ Scores fusion : Multi-stream HMM
9. Page 9 ATSIP, Sousse, May 18th, 2014
Experimental results :
parameters fusion
0
10
20
30
40
50
60
70
80
90
100
-15 -10 -5 0 5 10S/N
%Accuracy
Speech only
Video only : Pro2
Video only : DCT2
AV Fusion : Pro2
AV Fusion : DCT2
10. Page 10 ATSIP, Sousse, May 18th, 2014
Experimental results :
Scores fusion at -5db
42
43
44
45
46
47
48
49
50
51
52
Speech only AV : PRO AV :PRO2 AV : DCT AV : DCT2
11. Page 11 ATSIP, Sousse, May 18th, 2014
Audiovisual identity verification
â ⯠Fusion of face and speech for identity verification
â ⯠Detection of possible forgeries
â ⯠Compulsory ? for:
â⯠Homeland/firms security: restricted access,âŚ
â⯠Secured computer login
â⯠Secured on-line signature of contracts
12. Page 12 ATSIP, Sousse, May 18th, 2014
12
Talking-face and
2D face sequence database
â ⯠Data: video sequences (.avi) in which a short phrase in English is
pronounced / duration â 10s (actual speech duration â 2s)
â ⯠Audio-video data used for talking faces evaluations
â ⯠Same sequences used for 2D face from video sequences evaluations
â ⯠430 subjects pronounced 4 phrases :
â⯠from a set 430 English phrases
â⯠2 indoor video files acquired during the first session
â⯠2 outdoor video files acquired during the second session
â⯠realistic forgeries created a posteriori
13. Page 13 ATSIP, Sousse, May 18th, 2014
Audio-Visual Speech Features
Raw
Pixel
Value
DCT
Transform
Shape
Related
Many
Others
âŚ
Raw
amplitude
ÂŤ Classical Âť
MFCC coefficients
Many others
14. Page 14 ATSIP, Sousse, May 18th, 2014
Audio-Visual
Audio-Visual Subspaces
AudioVisual
Reduced
Audiovisual Subspace
Principal Component &
Linear Discriminant
Analysis
x
Correlated
Audio & Visual Subspaces
Co-inertia &
Canonical Correlation
Analysis
16. Page 16 ATSIP, Sousse, May 18th, 2014
Application to indexation
â ⯠High-level requests
â⯠âFind videos where John Doe is speakingâ
â⯠âFind dialogues between Mr X and Mrs Yâ
â⯠âLocate the singer in this music videoâ
Raw
Energy
Raw
Pixel
Value
Correlation
17. Page 17 ATSIP, Sousse, May 18th, 2014
Who is speaking?
â ⯠Face tracking
â ⯠Correlation
â⯠Pixel of each face
â⯠Raw audio energy
â ⯠Find maximum synchrony
Green: current speaker
18. Page 18 ATSIP, Sousse, May 18th, 2014
How
 to
 Perform
 âTalking-ÂâFaceâ
Â
Authen:ca:on?
Â
Face
recognition
Speaker verification
Score
fusion
What if�
OK
OK OK
Deliberate imposture
19. Page 19 ATSIP, Sousse, May 18th, 2014
Biometrics
â ⯠Identity Verification with Talking Faces
â⯠Speaker Verification
â⯠Face Recognition
â ⯠What if?
Face
OK
Voice
OK
NO
X
20. Page 20 ATSIP, Sousse, May 18th, 2014
Identity Verification
Enrolment of client Îť
Model for
client Îť
Person Îľ pretending to be client Îť
accepted if
rejected otherwise
Co-Inertia
Analysis
Equal Error Rate: 30 %
21. Page 21 ATSIP, Sousse, May 18th, 2014
Test
Replay Attacks Detection
Training
Co-IA
CCA
accepted if
rejected otherwise
Sync
Model
22. Page 22 ATSIP, Sousse, May 18th, 2014
Replay Attacks Detection
Genuine synchronized video Audio replay attack
Lips do not match audio perfectly
Equal Error Rate: 14 %
23. Page 23 ATSIP, Sousse, May 18th, 2014
Example of Replay attacks
24. Page 24 ATSIP, Sousse, May 18th, 2014
delayed video delayed audio
-5 0 +5
Alignment by maximum correlation
-1
25. Page 25 ATSIP, Sousse, May 18th, 2014
Audiovisual identity verification
â ⯠Available features
â⯠Face : Face features (lip, eyes) Ă ď Face Modality
â⯠Speech Ă ď Speech Modality
â⯠Speech Synchrony Ă ď Synchrony Modality
Video
26. Page 26 ATSIP, Sousse, May 18th, 2014
Audiovisual identity verification
â ⯠Face modality
â⯠Detection:
â˘âŻ Generative models (MPT toolbox)
â˘âŻ Temporal median Filtering
â˘âŻ Eyes detection within faces
â⯠Normalization: geometry + illumination
27. Page 27 ATSIP, Sousse, May 18th, 2014
Audiovisual identity verification
â ⯠Face Modality:
â⯠Two verification strategies and one single
comparison framework
â˘âŻ Global = Eigenfaces:
â⯠Calculation of a set of directions (eigenfaces)
defining a projection space
â⯠Two faces are compared regarding their
projection on the eigenfaces space.
â⯠Learning data: BIOMET (130 pers.) + BANCA
(30 pers.)
29. Page 29 ATSIP, Sousse, May 18th, 2014
Audiovisual identity verification
â ⯠Face Modality:
â˘âŻ SVD-based matching method:
â⯠Compare two videos V1 and V2
â⯠Exclusive principle: One-to-one correspondences
between
⯠Faces (global)
⯠Descriptors (local)
â⯠Principle:
⯠Proximity matrix computation between faces or
descriptors
⯠Extraction of good pairings (made easy by SVD
computation)
â⯠Scores:
⯠One matching score between global
representations
⯠One matching score between local representations
31. Page 31 ATSIP, Sousse, May 18th, 2014
Audiovisual identity verification
â ⯠Speech Modality:
â⯠GMM-based approach;
â˘âŻ One world model
â˘âŻ Each speaker model is derived from the
World Model by MAP adaptation
â˘âŻ Speech verification score: derived from
likelihood ratio
32. Page 32 ATSIP, Sousse, May 18th, 2014
Audiovisual identity verification
â ⯠Synchrony Modality:
â⯠Principle: synchrony between lips and
speech carries identity information
â⯠Process:
â˘âŻ Computation of a synchrony model (CoIA
analysis) for each person based on DCT
(visual signal) and MFCC (speech signal)
â˘âŻ Comparison of the test sample with the
synchrony model
33. Page 33 ATSIP, Sousse, May 18th, 2014
Audiovisual identity verification
â ⯠Experiments:
â⯠BANCA database:
â˘âŻ 52 persons divided into two groups (G1 and G2)
â˘âŻ 3 recording conditions
â˘âŻ 1 person Ă ď 8 recordings (4 client accesses, 4
impostor accesses)
â˘âŻ Evaluation based on P protocol: 234 client
accesses and 312 impostor accesses
â⯠Scores:
â˘âŻ 4 scores per access (PCA face, SIFT face,
speech, synchrony)
â˘âŻ Score fusion based on RBF-SVM: hyperplan
learned on G1/tested on G2 and conversely)
35. Page 35 ATSIP, Sousse, May 18th, 2014
SecurePhone
â ⯠Technical solution that improves security
â ⯠Biometric recognition
â⯠Makes use of VOICE, FACE and SIGNATURE
â ⯠Electronic signature used to secure information exchange
36. Page 36 ATSIP, Sousse, May 18th, 2014
Biometrics in SecurePhone
â ⯠Operation
Pre-processing
Modelling
Modelling
Modelling
Pre-processingPre-processing
Access DeniedAccess Granted
FUSION
Face Voice Written Signature
Modelling
37. Page 37 ATSIP, Sousse, May 18th, 2014
The BioSecure Multimodal Evaluation Campaign
â ⯠Launched in April 2007
â ⯠Many modalities including âVideo sequencesâ and
âTalking Facesâ
â ⯠Development data and reference systems available
â ⯠Evaluations on the sequestrated BioSecure data base
(1000 clients)
â ⯠Debriefing workshop
â ⯠More info on :
http://www.int-evry.fr/biometrics/BMEC2007/index.php
38. Page 38 ATSIP, Sousse, May 18th, 2014
Audio-Ââvisual
 forgery
 scenarios
Â
â ⯠Low-ÂâeďŹort
Â
â⯠âPaparazziâ
 scenario
Â
â˘âŻ The
 impostor
 owns
 a
 picture
 of
 the
 face
 and
 a
 recording
 of
 the
 voice
 of
 the
Â
target
Â
â⯠âBig
 Brotherâ
 scenario
Â
â˘âŻ The
 impostor
 owns
 a
 video
 of
 the
 face
 and
 a
 recording
 of
 the
 voice
 of
 the
Â
target
Â
â ⯠High-ÂâeďŹort
Â
â⯠âImitatorâ
 scenario
Â
â˘âŻ The
 impostor
 owns
 a
 recording
 of
 the
 voice
 of
 the
 target
 and
 transforms
 his
Â
own
 voice
 to
 sound
 like
 the
 target
Â
â⯠âPlaybackâ
 scenario
Â
â˘âŻ The
 impostor
 owns
 a
 picture
 of
 the
 face
 of
 the
 target
 and
 animate
 it
Â
according
 to
 his
 own
 face
 moAon
Â
â⯠âVentriloquistâ
 scenario
Â
â˘âŻ combines
 the
 two
 previous
 ones
Â
39. Page 39 ATSIP, Sousse, May 18th, 2014
Detec:on
 of
 imposture
Â
Face modality:
ACCEPTED!
Voice modality:
ACCEPTED!
Synchronisation:
DENIED!
40. Page 40 ATSIP, Sousse, May 18th, 2014
40
Audio replay + ârandomâ face
Talking-Face forgeries @ BMEC
Audio replay attack
" ⯠Assumptions
§ď§âŻ Forger has recorded speech data from the genuine
user in outdoor (test) conditions
§ď§âŻ Forger is replaying the audio and uses his face in
front of the sensor
Stolen wave Audio replay + forger face
41. Page 41 ATSIP, Sousse, May 18th, 2014
41
CRAZY TALK Face animation + TTS
Talking-Face forgeries @ BMEC
Replay attack
" ⯠Assumptions
§ď§âŻ Forger has stolen a picture
§ď§âŻ Forger uses a face animation software and TTS (male or
female)
§ď§âŻ Forger plays back the animation to the sensor
Stolen picture Contour detection Generated avi
42. Page 42 ATSIP, Sousse, May 18th, 2014
42
Picture presentation + TTS forgeries
Talking-Face forgeries @ BMEC
Replay attack
" ⯠Assumptions
§ď§âŻ Forger has stolen a picture
§ď§âŻ Forger has printed the picture
§ď§âŻ Forger present the picture to the sensor and uses TTS
(same wave as for the face animation forgery)
Stolen picture
Presented picture
43. Page 43 ATSIP, Sousse, May 18th, 2014
43
Systems with fusion of
(face, speech)
face
score
speech
score
fusion
score
video sequence
frames
speech signal
Face verification
Speaker verification
44. Page 44 ATSIP, Sousse, May 18th, 2014
44
Voice Conversion methods
â âŻGMM
 conversion
Â
â⯠Training
 of
 a
 joined
 Gaussian
 model
Â
â˘âŻ
 parallel
 corpus
 of
 aligned
 sentences
 of
 both
 source
 and
 target
Â
voice
Â
â˘âŻ
 MFCC
 on
 HNM
 (Harmonic
 plus
 Noise
 Model)
 parameterizaAon
Â
Â
â⯠Speech
 synthesis
 from
 Gaussian
 model
Â
â˘âŻ
 Inversion
 of
 the
 MFCC
Â
â˘âŻ
 Pitch
 correcAon
Â
â âŻALISP
 conversion
Â
â⯠Very
 low
 debit
 speech
 compression
 (500
 bps)
 method
Â
â˘âŻ
 Originally
 developed
 by
 TELECOM-ÂâParisTech
Â
â⯠Indexed
 segments
 dicAonary
 system
 (of
 the
 target
 voice)
Â
â⯠HNM
 parameterizaAon
Â
45. Page 45 ATSIP, Sousse, May 18th, 2014
Voice conversion techniques
Definition: Process of making one personâs voice ÂŤ source Âť sounds like another
personâs voice target
source target
Voice conversion
My name is John My name is John
46. Page 46 ATSIP, Sousse, May 18th, 2014
Principle of ALISP
Dictionary of
representative
segments
Dictionary of
representative
segments
Spectral analysis
Prosodic analysis
Selection of
segmental units
Segment
index
Prosodic
parameters
Input speech
Concatenative
synthesis
HNM
Output speech
CODER
47. Page 47 ATSIP, Sousse, May 18th, 2014
Details of Encoding
speech Spectral
analysis
Prosodic
analysis
HMM
Recognition
Dictionary of
HMM models of
ALISP classes
Synth unit A1
âŚ
Synth unit A8
HMM A
Representative
units of the
class
Selection by
DTW
Prosodic
encoding
Index of
ALISP class
Index of
synth. unit
Pitch,
energy,
duration
48. Page 48 ATSIP, Sousse, May 18th, 2014
Details of decoding
Output speech
Synth unit A1
âŚ
Synth unit A8
ALISP Index
Synth unit index
within class
Prosodic parameters
Loading
Synth unit
Concatenative
synthesis
49. Page 49 ATSIP, Sousse, May 18th, 2014
Principle of Alisp conversion
Learning step: one hour of target voice
-âŻParametric analysis: MFCC
-âŻSegmentation based on temporal decompostion and vector quantization
-âŻStochastic modelling based on HMM
-âŻCreation of representative units
Conversion step
- Parametric analysis: MFCC
-âŻHMM recognition
-âŻSelection of representative segment Ă ď DTW
Synthesis step
-âŻConcatenation of representative
-âŻHNM synthesis
50. Page 50 ATSIP, Sousse, May 18th, 2014
Voice conversion using ALISP
results
BREF databaseNIST database
Source
Result
TargetSource Target
Result
female female female male
51. Page 51 ATSIP, Sousse, May 18th, 2014
Demonstra:on
 of
 Voice
 Conversion
Â
Impostor voice Converted voice with GMM Converted voice with ALISP
Target voiceConverted voice with ALISP+GMM
52. Page 52 ATSIP, Sousse, May 18th, 2014
3D reconstruction
â˘âŻ 3D face modeling from a front and a profile shot :
â˘âŻ Animated face
â˘âŻ https://picoforge.int-evry.fr/cgi-bin/twiki/view/
Myblog3d/Web/Demos
53. Page 53 ATSIP, Sousse, May 18th, 2014
Face Tranformation
Control point
selection
Image
segmentation
Figure
 2:
 Division
 of
 an
 image
Â
 Figure
 1:
 Control
 points
 selec8on
Â
Linear
transformation
between source
and target image
Blending
step
source
target
54. Page 54 ATSIP, Sousse, May 18th, 2014
Face Transformation
Source
Â
?
Â
54
Â
-Ââ>
 LocalisaAon
 of
 control
 points
Â
-Ââ>
 Warping
 -Ââ>
 Blending
Â
Cible
Â
?
Â
Xâ
 =
 f(X)
Â
p
 =
 ιp
 +
 (1
 â
 ι)pâ
Â
X
Xâ
Â
p
 pâ
Â
55. Page 55 ATSIP, Sousse, May 18th, 2014
Face
 transforma:on
 (IBM)
Â
56. Page 56 ATSIP, Sousse, May 18th, 2014
Ouisper1 - Silent Speech Interface
â ⯠Sensor-based system allowing speech communication via
standard articulators, but without glottal activity
â ⯠Two distinct types of application
â⯠alternative to tracheo-oesophagal speech (TES) for persons
having undergone a tracheotomy
â⯠a "silent telephone" for use in situations where quiet must be
maintained, or for communication in very noisy environments
â ⯠Speech Synthesis from ultrasound
and optical imagery of the tongue and lips
1) Oral Ultrasound synthetIc SPEech souRce
57. Page 57 ATSIP, Sousse, May 18th, 2014
Ouisper - System Overview
Ultrasound video
of the vocal tract
Optical video
of the speaker lips
Recorded
audio
Speech Alignment
Text
Visual Feature
Extraction
Audio-Visual Speech
Corpus
Visual Speech
Recognizer
Visual Unit
Selection
Audio Unit
Concatenation
T
R
A
I
N
I
N
G
T
E
S
T
Visual Data
N-best
Phonetic
or ALISP
Targets
58. Page 58 ATSIP, Sousse, May 18th, 2014
Ouisper - Training Data
59. Page 59 ATSIP, Sousse, May 18th, 2014
Ouisper - Video Stream Coding
T.Hueber , G. Aversano, G.Chollet, B. Denby, G. Dreyfus, Y. Oussar, P. Roussel, M. Stone, âEigenTongue Feature Extraction For
An Ultrasound-based Silent Speech Interface,â IEEE International Conference on Acoustics, Speech and Signal Processing,
Honolulu Hawaii, USA, 2007.
Eigenvectors
Build a subset of
typical frames
Perform
PCA
Code new frames with their projections
onto the set of Eigenvectors
60. Page 60 ATSIP, Sousse, May 18th, 2014
Ouisper - Audio Stream Coding
ALISP Segmentation
Detection of quasi-stationary parts in the
parametric representation of speech
Assignment of segments to class using
unsupervised classification techniques
Phonetic Segmentation
Forced-alignement of speech with the text
Need of a relevant and correct phonetic
transcription of the uttered signal
Corpus-based synthesis
Need of a preliminary segmental description of the signal
61. Page 61 ATSIP, Sousse, May 18th, 2014
Audiovisual dictionary building
â ⯠Visual and acoustic data
are synchronously
recorded
â ⯠Audio segmentation is
used to bootstrap visual
speech recognizer
/e
 -Ââ
 r/
2)
Â
 Train
 HMM
 model
 for
 each
 phonetic
 class
/a
 -Ââ
 j//u
 -Ââ
 th/
Audiovisual dictionary
62. Page 62 ATSIP, Sousse, May 18th, 2014
Visuo-acoustic decoding
â ⯠Visual speech recognition
â⯠Train HMM model for each visual class
â˘âŻ Use multistream-based learning techniques
â⯠Perform a ÂŤ visuo-phonetic Âť decoding step
â˘âŻ Use N-Best list
â˘âŻ Introduce linguistic constraints
â⯠Language model
â⯠Dictionary
â⯠Multigrams
â ⯠Corpus-based speech synthesis
â⯠Combine probabilistic and data-driven approach in the
audiovisual unit selection step.
63. Page 63 ATSIP, Sousse, May 18th, 2014
Speech recognition from
video-only data
ow p ax n y uh r b uh k t uw dh ax f er s t p ey jh
ax w ih y uh r b uh k sh uw dh ax v er s p ey jh
Open your book to the first page
Ref
Rec
A wear your book shoe the verse page
Corpus-based synthesis driven by predicted phonetic lattice
is currently under study
64. Page 64 ATSIP, Sousse, May 18th, 2014
Ouisper - Conclusion
â ⯠More information on
â⯠http://www.neurones.espci.fr/ouisper/
â ⯠Contacts
â⯠gerard.chollet@enst.fr
â⯠denby@ieee.org
â⯠hueber@ieee.org
65. Page 65 ATSIP, Sousse, May 18th, 2014
Audio-Visual Speech Processing
Conclusions and Perspectives
â ⯠A talking face is more intelligible, expressive,
recognisable, attractive than acoustic speech
alone.
â ⯠The combined use of facial and speech
information improves identity verification and
robustness to forgeries.
â ⯠Multi-stream models of the synchrony of visual
and acoustic information have applications in
the analysis, coding, recognition and synthesis
of talking faces.