Atsip avsp17

Audio-Visual Speech Processing
Gérard Chollet
with Meriem Bendris, Hervé Bredin, Thomas Hueber,
Walid Karam, Rémi Landais, Patrick Perrot,
Eduardo Sanchez-Soto, Leila Zouari
ATSIP, Sousse, March 18th 2014

ATSIP, Sousse, May 18th, 2014
Some motivations,…
■  A talking face is more intelligible, expressive,
recognisable, attractive than acoustic speech
alone.
■  The combined use of facial and speech
information improves identity verification and
robustness to forgeries.
■  Multi-stream models of the synchrony of visual
and acoustic information have applications in
the analysis, coding, recognition and synthesis
of talking faces.
■  SmartPhones, VisioPhones, WebPhones,
SecurePhones, Visio Conferences, Virtual
Reality worlds are gaining popularity.

Some topics under study,…
■  Audio-visual speech recognition
–  Automatic ‘lip-reading’
■  Audio-visual speaker verification
–  Detection of forgeries
■  Speech driven animation of the face
–  Could we look and sound like somebody else ?
■  Speaker indexing
–  ‘Who is talking in a video sequence ?’
■  OUISPER : a silent speech interface
–  Corpus based synthesis from tongue and lips

Audio Visual Speech Recognition
Dictionary Grammar
Acoustic models
Features
extraction
Decoder

Video Mike (IBM, 2004)
■  IBM
■  2004

Audio processing
■  Features extraction
■  Digits detection
■  Digits recognition:
•  Acoustic parameters : MFCC
•  Context independent HMMs
•  Decoding : Time synchronous
algorithm
■  Sound eﬀect
–  Noise : Babble
■  Recognition experiments

Video processing
■  Video extraction
■  Lips localisation
■  Images interpolation

(same frequency as speech)
■  Features extraction
•  DCT and DCT2 (DCT+LDA)
•  Projections : PRO et PRO2
(PRO+LDA)
■  Recognition experiments

Fusion techniques
q  Parameters fusion :
• Concatenation
•  Dimension decrease : Linear Discriminant Analysis (LDA)
•  Modelisation : classical HMM with one stream
q  Scores fusion : Multi-stream HMM

Experimental results :
parameters fusion
0
10
20
30
40
50
60
70
80
90
100
-15 -10 -5 0 5 10S/N
%Accuracy
Speech only
Video only : Pro2
Video only : DCT2
AV Fusion : Pro2
AV Fusion : DCT2

Experimental results :
Scores fusion at -5db
42
43
44
45
46
47
48
49
50
51
52
Speech only AV : PRO AV :PRO2 AV : DCT AV : DCT2

Audiovisual identity verification
■  Fusion of face and speech for identity verification
■  Detection of possible forgeries
■  Compulsory ? for:
–  Homeland/firms security: restricted access,…
–  Secured computer login
–  Secured on-line signature of contracts

12
Talking-face and
2D face sequence database
■  Data: video sequences (.avi) in which a short phrase in English is
pronounced / duration ≈ 10s (actual speech duration ≈ 2s)
■  Audio-video data used for talking faces evaluations
■  Same sequences used for 2D face from video sequences evaluations
■  430 subjects pronounced 4 phrases :
–  from a set 430 English phrases
–  2 indoor video files acquired during the first session
–  2 outdoor video files acquired during the second session
–  realistic forgeries created a posteriori

Audio-Visual Speech Features
Raw
Pixel
Value
DCT
Transform
Shape
Related
Many
Others
…
Raw
amplitude
« Classical »
MFCC coefficients
Many others

Audio-Visual
Audio-Visual Subspaces
AudioVisual
Reduced
Audiovisual Subspace
Principal Component &
Linear Discriminant
Analysis
x
Correlated
Audio & Visual Subspaces
Co-inertia &
Canonical Correlation
Analysis

Correspondence Measures
Audiovisual subspace Correlated subspaces
Gaussian
Mixture
Models
Neural
Networks
Coupled
HMM
Correlation
Mutual
Information

Application to indexation
■  High-level requests
–  “Find videos where John Doe is speaking”
–  “Find dialogues between Mr X and Mrs Y”
–  “Locate the singer in this music video”
Raw
Energy
Raw
Pixel
Value
Correlation

Who is speaking?
■  Face tracking
■  Correlation
–  Pixel of each face
–  Raw audio energy
■  Find maximum synchrony
Green: current speaker

How
to
Perform
“Talking-‐Face”

Authen:ca:on?

Face
recognition
Speaker verification
Score
fusion
What if…?
OK
OK OK
Deliberate imposture

Biometrics
■  Identity Verification with Talking Faces
–  Speaker Verification
–  Face Recognition
■  What if?
Face
OK
Voice
OK
NO
X

Identity Verification
Enrolment of client λ
Model for
client λ
Person ε pretending to be client λ
accepted if
rejected otherwise
Co-Inertia
Analysis
Equal Error Rate: 30 %

Test
Replay Attacks Detection
Training
Co-IA
CCA
accepted if
rejected otherwise
Sync
Model

Replay Attacks Detection
Genuine synchronized video Audio replay attack
Lips do not match audio perfectly
Equal Error Rate: 14 %

Example of Replay attacks

delayed video delayed audio
-5 0 +5
Alignment by maximum correlation
-1

■  Available features
–  Face : Face features (lip, eyes) à Face Modality
–  Speech à Speech Modality
–  Speech Synchrony à Synchrony Modality
Video

■  Face modality
–  Detection:
•  Generative models (MPT toolbox)
•  Temporal median Filtering
•  Eyes detection within faces
–  Normalization: geometry + illumination

■  Face Modality:
–  Two verification strategies and one single
comparison framework
•  Global = Eigenfaces:
–  Calculation of a set of directions (eigenfaces)
defining a projection space
–  Two faces are compared regarding their
projection on the eigenfaces space.
–  Learning data: BIOMET (130 pers.) + BANCA
(30 pers.)

•  SIFT descriptors:
–  Keypoints extraction
–  Keypoints representation: 128-dimensional
vector (gradient orientation histogramme,…)
+ 4-dimensional position vector
SIFT descriptor
(dim 128)
Position (x,y) + scale + orientation
(dim 4)

•  SVD-based matching method:
–  Compare two videos V1 and V2
–  Exclusive principle: One-to-one correspondences
between
»  Faces (global)
»  Descriptors (local)
–  Principle:
»  Proximity matrix computation between faces or
descriptors
»  Extraction of good pairings (made easy by SVD
computation)
–  Scores:
»  One matching score between global
representations
»  One matching score between local representations

Variability !!!!

■  Speech Modality:
–  GMM-based approach;
•  One world model
•  Each speaker model is derived from the
World Model by MAP adaptation
•  Speech verification score: derived from
likelihood ratio

■  Synchrony Modality:
–  Principle: synchrony between lips and
speech carries identity information
–  Process:
•  Computation of a synchrony model (CoIA
analysis) for each person based on DCT
(visual signal) and MFCC (speech signal)
•  Comparison of the test sample with the
synchrony model

■  Experiments:
–  BANCA database:
•  52 persons divided into two groups (G1 and G2)
•  3 recording conditions
•  1 person à 8 recordings (4 client accesses, 4
impostor accesses)
•  Evaluation based on P protocol: 234 client
accesses and 312 impostor accesses
–  Scores:
•  4 scores per access (PCA face, SIFT face,
speech, synchrony)
•  Score fusion based on RBF-SVM: hyperplan
learned on G1/tested on G2 and conversely)

■  Experiments:

SecurePhone
■  Technical solution that improves security
■  Biometric recognition
–  Makes use of VOICE, FACE and SIGNATURE
■  Electronic signature used to secure information exchange

Biometrics in SecurePhone
■  Operation
Pre-processing
Modelling
Modelling
Modelling
Pre-processingPre-processing
Access DeniedAccess Granted
FUSION
Face Voice Written Signature
Modelling

The BioSecure Multimodal Evaluation Campaign
■  Launched in April 2007
■  Many modalities including ‘Video sequences’ and
‘Talking Faces’
■  Development data and reference systems available
■  Evaluations on the sequestrated BioSecure data base
(1000 clients)
■  Debriefing workshop
■  More info on :
http://www.int-evry.fr/biometrics/BMEC2007/index.php

Audio-‐visual
forgery
scenarios

■  Low-‐eﬀort

–  “Paparazzi”
scenario

•  The
impostor
owns
a
picture
of
the
face
and
a
recording
of
the
voice
of
the

target

–  “Big
Brother”
scenario

•  The
impostor
owns
a
video
of
the
face
and
a
recording
of
the
voice
of
the

target

■  High-‐eﬀort

–  “Imitator”
scenario

•  The
impostor
owns
a
recording
of
the
voice
of
the
target
and
transforms
his

own
voice
to
sound
like
the
target

–  “Playback”
scenario

•  The
impostor
owns
a
picture
of
the
face
of
the
target
and
animate
it

according
to
his
own
face
moAon

–  “Ventriloquist”
scenario

•  combines
the
two
previous
ones

Detec:on
of
imposture

Face modality:
ACCEPTED!
Voice modality:
ACCEPTED!
Synchronisation:
DENIED!

40
Audio replay + “random” face
Talking-Face forgeries @ BMEC
Audio replay attack
"   Assumptions
§  Forger has recorded speech data from the genuine
user in outdoor (test) conditions
§  Forger is replaying the audio and uses his face in
front of the sensor
Stolen wave Audio replay + forger face

41
CRAZY TALK Face animation + TTS
Replay attack
"   Assumptions
§  Forger has stolen a picture
§  Forger uses a face animation software and TTS (male or
female)
§  Forger plays back the animation to the sensor
Stolen picture Contour detection Generated avi

42
Picture presentation + TTS forgeries
Replay attack
"   Assumptions
§  Forger has stolen a picture
§  Forger has printed the picture
§  Forger present the picture to the sensor and uses TTS
(same wave as for the face animation forgery)
Stolen picture
Presented picture

43
Systems with fusion of
(face, speech)
face
score
speech
score
fusion
score
video sequence
frames
speech signal
Face verification
Speaker verification

44
Voice Conversion methods
■ GMM
conversion

–  Training
of
a
joined
Gaussian
model

• 
parallel
corpus
of
aligned
sentences
of
both
source
and
target

voice

• 
MFCC
on
HNM
(Harmonic
plus
Noise
Model)
parameterizaAon

–  Speech
synthesis
from
Gaussian
model

• 
Inversion
of
the
MFCC

• 
Pitch
correcAon

■ ALISP
conversion

–  Very
low
debit
speech
compression
(500
bps)
method

• 
Originally
developed
by
TELECOM-‐ParisTech

–  Indexed
segments
dicAonary
system
(of
the
target
voice)

–  HNM
parameterizaAon

Voice conversion techniques
Definition: Process of making one person’s voice « source » sounds like another
person’s voice target
source target
Voice conversion
My name is John My name is John

Principle of ALISP
Dictionary of
representative
segments
Dictionary of
representative
segments
Spectral analysis
Prosodic analysis
Selection of
segmental units
Segment
index
Prosodic
parameters
Input speech
Concatenative
synthesis
HNM
Output speech
CODER

Details of Encoding
speech Spectral
analysis
Prosodic
analysis
HMM
Recognition
Dictionary of
HMM models of
ALISP classes
Synth unit A1
…
Synth unit A8
HMM A
Representative
units of the
class
Selection by
DTW
Prosodic
encoding
Index of
ALISP class
Index of
synth. unit
Pitch,
energy,
duration

Details of decoding
Output speech
Synth unit A1
…
Synth unit A8
ALISP Index
Synth unit index
within class
Prosodic parameters
Loading
Synth unit
Concatenative
synthesis

Principle of Alisp conversion
Learning step: one hour of target voice
- Parametric analysis: MFCC
- Segmentation based on temporal decompostion and vector quantization
- Stochastic modelling based on HMM
- Creation of representative units
Conversion step
- Parametric analysis: MFCC
- HMM recognition
- Selection of representative segment à DTW
Synthesis step
- Concatenation of representative
- HNM synthesis

Voice conversion using ALISP
results
BREF databaseNIST database
Source
Result
TargetSource Target
Result
female female female male

Demonstra:on
of
Voice
Conversion

Impostor voice Converted voice with GMM Converted voice with ALISP
Target voiceConverted voice with ALISP+GMM

3D reconstruction
•  3D face modeling from a front and a profile shot :
•  Animated face
•  https://picoforge.int-evry.fr/cgi-bin/twiki/view/
Myblog3d/Web/Demos

Face Tranformation
Control point
selection
Image
segmentation
Figure
2:
Division
of
an
image

Figure
1:
Control
points
selec8on

Linear
transformation
between source
and target image
Blending
step
source
target

Face Transformation
Source

?

54

-‐>
LocalisaAon
of
control
points

-‐>
Warping
-‐>
Blending

Cible

?

X’
=
f(X)

p
=
αp
+
(1
–
α)p’

X
X’

p
p’

Face
transforma:on
(IBM)

Ouisper1 - Silent Speech Interface
■  Sensor-based system allowing speech communication via
standard articulators, but without glottal activity
■  Two distinct types of application
–  alternative to tracheo-oesophagal speech (TES) for persons
having undergone a tracheotomy
–  a "silent telephone" for use in situations where quiet must be
maintained, or for communication in very noisy environments
■  Speech Synthesis from ultrasound
and optical imagery of the tongue and lips
1) Oral Ultrasound synthetIc SPEech souRce

Ouisper - System Overview
Ultrasound video
of the vocal tract
Optical video
of the speaker lips
Recorded
audio
Speech Alignment
Text
Visual Feature
Extraction
Audio-Visual Speech
Corpus
Visual Speech
Recognizer
Visual Unit
Selection
Audio Unit
Concatenation
T
R
A
I
N
I
N
G
T
E
S
T
Visual Data
N-best
Phonetic
or ALISP
Targets

Ouisper - Training Data

Ouisper - Video Stream Coding
T.Hueber , G. Aversano, G.Chollet, B. Denby, G. Dreyfus, Y. Oussar, P. Roussel, M. Stone, “EigenTongue Feature Extraction For
An Ultrasound-based Silent Speech Interface,” IEEE International Conference on Acoustics, Speech and Signal Processing,
Honolulu Hawaii, USA, 2007.
Eigenvectors
Build a subset of
typical frames
Perform
PCA
Code new frames with their projections
onto the set of Eigenvectors

Ouisper - Audio Stream Coding
ALISP Segmentation
Detection of quasi-stationary parts in the
parametric representation of speech
Assignment of segments to class using
unsupervised classification techniques
Phonetic Segmentation
Forced-alignement of speech with the text
Need of a relevant and correct phonetic
transcription of the uttered signal
Corpus-based synthesis
Need of a preliminary segmental description of the signal

Audiovisual dictionary building
■  Visual and acoustic data
are synchronously
recorded
■  Audio segmentation is
used to bootstrap visual
speech recognizer
/e
-‐
r/
2)

Train
HMM
model
for
each
phonetic
class
/a
-‐
j//u
-‐
th/
Audiovisual dictionary

Visuo-acoustic decoding
■  Visual speech recognition
–  Train HMM model for each visual class
•  Use multistream-based learning techniques
–  Perform a « visuo-phonetic » decoding step
•  Use N-Best list
•  Introduce linguistic constraints
–  Language model
–  Dictionary
–  Multigrams
■  Corpus-based speech synthesis
–  Combine probabilistic and data-driven approach in the
audiovisual unit selection step.

Speech recognition from
video-only data
ow p ax n y uh r b uh k t uw dh ax f er s t p ey jh
ax w ih y uh r b uh k sh uw dh ax v er s p ey jh
Open your book to the first page
Ref
Rec
A wear your book shoe the verse page
Corpus-based synthesis driven by predicted phonetic lattice
is currently under study

Ouisper - Conclusion
■  More information on
–  http://www.neurones.espci.fr/ouisper/
■  Contacts
–  gerard.chollet@enst.fr
–  denby@ieee.org
–  hueber@ieee.org

Audio-Visual Speech Processing
Conclusions and Perspectives
■  A talking face is more intelligible, expressive,
recognisable, attractive than acoustic speech
alone.
■  The combined use of facial and speech
information improves identity verification and
robustness to forgeries.
■  Multi-stream models of the synchrony of visual
and acoustic information have applications in
the analysis, coding, recognition and synthesis
of talking faces.

Atsip avsp17

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Empfohlen

Empfohlen (20)

Atsip avsp17