SlideShare ist ein Scribd-Unternehmen logo
1 von 53
Downloaden Sie, um offline zu lesen
“Development of Some Techniques
For Text-Independent Speaker
Recognition From Audio Signals”
By
Bidhan Barai
Under the guidance of
Dr. Nibaran Das and Dr. Subhadip Basu
Assistant Professors of Computer Science & Engineering
Jadavpur University
Kolkata – 700 032
Overview
●
Introduction
●
Types of Speaker Recognition
●
Principles of Automatic Speaker Recognition (ASR)
●
Steps of Speaker Recognition:
1> Voice Recording
2> Feature Extraction
3> Modeling
4> Pattern Matching
5> Decision (accept / reject) (for Verification)
●
Conclusion
●
References
Introduction
●
Speaker recognition is the identification of a person from
characteristics of voices (voice biometrics). It is also
called voice recognition. There is a difference between
speaker recognition (recognizing who is speaking) and
speech recognition (recognizing what is being said).
●
In addition, there is a difference between the act of
authentication (commonly referred to as speaker
verification or speaker authentication) and identification.
Types of Speaker Identification
● Text-Dependent:
If the text must be the same for enrollment and
verification this is called text-dependent recognition.
In a text-dependent system, prompts can either be
common across all speakers (e.g.: a common pass
phrase) or unique
●
Text-Independent:Text-Independent:
Text-independent systems are most often used for
speaker identification as they require very little if any
cooperation by the speaker. In this case the text
during enrollment and test is different.
Types of Speaker Identification
● Closed-Set: Assumed that Speaker is in Database
In closed-set identification, the audio of the test speaker is
compared against all the available speaker models and the
speaker ID of the model with the closest match is returned.
Result is the best speaker matched.
● Open-Set: Speaker may not in Database
Open-set identification may be viewed as a combination of
closed-set identification and speaker verification. Result
can be a speaker or a no-match result.
Principles of Automatic Speaker
Recognition
●
Speaker recognition can be classified into identification
and verification.
●
Speaker identification is the process of determining which
registered speaker provides a given utterance.
●
Speaker verification, on the other hand, is the process of
accepting or rejecting the identity claim of a speaker.
●
Following figures shows the basic structures of speaker
identification and verification systems. The system that we
will describe is classified as text-independent speaker
identification system since its task is to identify the person
who speaks regardless of what is saying.
Principles of Automatic Speaker
Recognition ... Contd.
Figure 1
Block Diagram of Speaker Recognition SystemBlock Diagram of Speaker Recognition System
Principles of Automatic Speaker
Recognition ... Contd.
●
Speaker RecognitionSpeaker Recognition
Feature
Extraction
Similarity
Reference Model
Speaker #1
Similarity
Reference Model
Speaker #N
Maximun
Selection
Identification
Result
(Speaker ID)
Input Speech
Figure 2
Principles of Automatic Speaker
Recognition ... Contd.
●
Speaker verificationSpeaker verification
Feature
Extraction
Input Speech Similarity Decision
Reference Model
Speaker #1
Speaker ID
(#M)
Threshold
Verification
Result
(Accept/Reject)
Figure 3
Principles of Automatic Speaker
Recognition ... Contd.
●
All speaker recognition systems have to serve two
distinguished phases.
The first one is referred to the enrolment or training phase,
while the second one is referred to as the operational or
testing phase.
●
In the training phase, each registered speaker has to provide
samples of their speech so that the system can build or train
a reference model for that speaker. In case of speaker
verification systems, in addition, a speaker-specific
threshold is also computed from the training samples.
●
In the testing phase, the input speech is matched with
stored reference model(s) and a recognition decision is
made.
Steps of Speaker Recognition
1> Voice Recording
2> Feature Extraction
3> Modeling
4> Pattern Matching
5> Decision (accept / reject) (for Verification)
Step 1: Voice Recording
●
The speech input is typically recorded at a sampling
rate above 10000 Hz (10 kHz).
●
This sampling frequency was chosen to minimize the
effects of aliasing in the analog-to-digital conversion.
These sampled signals can capture all frequencies up
to 5 kHz, which cover most energy of sounds that are
generated by humans.
●
This sampling rate (10 kHz) is determined by the
Nyquest Sampling Theorem.
Step 2: Speech Feature
Extraction
●
The purpose of this module is to convert the speech
waveform, using digital signal processing (DSP) tools, to a set
of features (at a considerably lower information rate) for
further analysis. This is often referred as the
signal-processing front end.
●
The speech signal is a slowly timed varying signal (it is called
quasi-stationary). When examined over a sufficiently short
period of time (between 5 and 100 msec), its characteristics
are fairly stationary. However, over long periods of time (on
the order of 1/5 seconds or more) the signal characteristic
change to reflect the different speech sounds being spoken.
●
Therefore, short-time spectral analysis is the most common
way to characterize the speech signal.
Speech Feature Extraction...Contd
Examples of Speech Signals:
A wide range of possibilities exist for parametrically representing
the speech signal for the speaker recognition task, such as Linear
Prediction Coding (LPC), Mel-Frequency Cepstrum Coefficients
(MFCC), Gammatone Frequency Cepstral Coefficients (GFCC),
Group Delay Features (GDF) and others. MFCC is perhaps the best
known and most popular, and will be described in this project.
Figure 4 Figure 5
Speech Feature Extraction...Contd
● Mel-frequency Cepstrum Coefficients Processor:
A block diagram of the structure of an MFCC
processor is given in Figure
Figure 6
Speech Feature Extraction...Contd
● Steps of extracting Feature from Speech Signal:
1> Pre-emphasis
2> Frame Blocking
3> Windowing
4> Fast Fourier Transform (FFT)
5> Mel-frequency Wrapping
6> Cepstrum: Logarithmic Compression and Discrete
Cosine Transform (DCT)
Speech Feature Extraction...Contd
●
Pre-emphasis: In speech processing, the original
signal usually has too much lower frequency energy,
and processing the signal to emphasize higher
frequency energy is necessary. To perform
pre-emphasis, we choose some value α between 0.9
and 1. Then each value in the signal is re-evaluated
using this formula:
This is apparently a first order high pass filter.
y[n]=x[n]−α x[n−1], where 0.9<α<1
Speech Feature Extraction...Contd
Figure 7
Speech Feature Extraction...Contd
● Frame Blocking: The input speech signal is segmented into frames of
20~30 ms with optional overlap of 1/3~1/2 of the frame size. Usually the
frame size (in terms of sample points) is equal to power of two in order to
facilitate the use of FFT. If this is not the case, we need to do zero padding
to the nearest length of power of two.
● Windowing: Each frame has to be multiplied with a hamming window in
order to keep the continuity of the first and the last points in the frame. If
the signal in a frame is denoted by s(n), n = 0,…N-1, then the signal after
Hamming windowing is s(n)*w(n), where w(n) is the Hamming window
defined by:
Different values of corresponds to different curves for the Hamming
windows shown next:
w(n,α)=(1−α)−αcos(
2πn
N−1
), 0≤n≤N−1
α
Speech Feature Extraction...Contd
Figure 8
Speech Feature Extraction...Contd
Figure 9
Speech Feature Extraction...Contd
● Fast Fourier Transform (FFT): The Discrete
Fourier Transform (DFT) of a discrete-time signal
x(nT) is given by:
Where:
X (k)= ∑
n=0
N−1
x [n]e
−j
2π
N
nk
k=0,1,…N−1
x (nT )=x [n]
Speech Feature Extraction...Contd
●
If we let: thene
−j
2π
N
=W N
X (k)= ∑
n=0
N−1
x [n]W N
nk
0 20 40 60 80 100 120
-2
-1
0
1
2
Sampledsignal
Sample
Amplitude
0 0.1 0.2 0.3 0.4 0.5
0
0.2
0.4
0.6
0.8
1
FrequencyDomain
NormalisedFrequency
Magnitude
Figure 10
Speech Feature Extraction...Contd
● x[n] = x[0], x[1], …, x[N-1]
X (k)= ∑
n=0
N−1
x[n]W N
nk
; 0≤k≤N−1 [1][1]
 Lets divide the sequence x[n] into even and odd
sequences:
 x[2n] = x[0], x[2], …, x[N-2]
 x[2n+1] = x[1], x[3], …, x[N-1]
Speech Feature Extraction...Contd
●
Equation 1 can be rewritten as:
X (k)= ∑
n=0
N
2
−1
x[2n]WN
2nk
+ ∑
n=0
N
2
−1
x[2n+1]WN
(2n+1)k
[2][2]
Since:
W N
2 nk
=e
− j
2π
N
2 nk
=e
−j
2π
N /2
nk
=W N
2
nk W N
(2n+1)k
=W N
k
⋅W N
2
nk
Then:
X (k)= ∑
n=0
N
2
−1
x [2n]WN
2
nk
+WN
k
∑
n=0
N
2
−1
x [2n+1]WN
2
nk
=Y (k )+WN
k
Z (k )
and
Speech Feature Extraction...Contd
●
The result is that an N-point DFT can be divided into
two N/2 point DFT’s:
●
Where Y(k) and Z(k) are the two N/2 point DFTs
operating on even and odd samples respectively:
X (k)= ∑
n=0
N−1
x[n]W N
nk
; 0≤k≤N−1 N-point DFTN-point DFT
X (k)= ∑
n=0
N
2
−1
x1[n]WN
2
nk
+W N
k
∑
n=0
N
2
−1
x2[n]WN
2
nk
=Y (k )+WN
k
Z (k )
TwoTwo
N/2-pointN/2-point
DFTsDFTs
Speech Feature Extraction...Contd
●
Periodicity and symmetry of W can be exploited to
simplify the DFT further:
X (k )= ∑
n=0
N
2
−1
x1[n]W N
2
nk
+W N
k
∑
n=0
N
2
−1
x2[n]W N
2
nk
⋮
X(k+
N
2 )= ∑
n=0
N
2
−1
x1[n]W N
2
n(k +
N
2 )+W
N
k+
N
2
∑
n=0
N
2
−1
x2 [n]W N
2
n(k+
N
2 )
[3][3]
Or: W N
k+
N
2
=e
− j
2π
N
k
e
−j
2π
N
N
2
=e
−j
2π
N
k
e−jπ
=−e
−j
2π
N
k
=−W N
k : Symmetry: Symmetry
And: W N
2
k+
N
2
=e
− j
2π
N /2
k
e
− j
2π
N /2
N
2
=e
− j
2π
N /2
k
=W N
2
k : Periodicity: Periodicity
Speech Feature Extraction...Contd
●
Finally by exploiting the symmetry and periodicity,
Equation 3 can be written as:
●
Hence Complete Equations for finding FFT are:
X (k+
N
2 )= ∑
n=0
N
2
−1
x1[n]W N
2
nk
−W N
k
∑
n=0
N
2
−1
x2[n]W N
2
nk
=Y (k)−W N
k
Z (k )
[4][4]
X (k)=Y (k)+WN
k
Z (k ); k=0,…(N
2
−1)
X (k+
N
2 )=Y (k)−WN
k
Z (k); k=0,…(N
2
−1)
Speech Feature Extraction...Contd
●
Schematic Diagram for FFT: Radix-2 butterfly diagram
y[0]y[0]
y[1]y[1]
y[2]y[2]
y[N-2]y[N-2]
N/2 pointN/2 point
DFTDFT
x[0]x[0]
x[2]x[2]
x[4]x[4]
x[N-2]x[N-2]
N/2 pointN/2 point
DFTDFT
x[1]x[1]
x[3]x[3]
x[5]x[5]
x[N-1]x[N-1]
z[0]z[0]
z[1]z[1]
z[2]z[2]
z[N/2-1]z[N/2-1]
X[1] = y[1]+WX[1] = y[1]+W11z[1]z[1]
WW11
X[0] = y[0]+WX[0] = y[0]+W00z[0]z[0]
WW00
X[N/2] = y[0]-WX[N/2] = y[0]-W00z[0]z[0]-1-1
X[N/2+1] = y[1]-WX[N/2+1] = y[1]-W11z[1]z[1]-1-1
Speech Feature Extraction...Contd
●
Mel-frequency Wrapping: Psychophysical studies
have shown that human perception of the frequency
content of sounds does not follow a liner scale. That
research has led to the concept of the subjective
frequency, i.e., the perceived frequency of sounds is
defined as follows. For each sound with an actual
frequency, f , measured in Hz, a subjective frequency
is measured on a scale called the "Mel scale".
Mel-frequency can be approximated by
Mel(f )=2595log(1+
f
700
)
Speech Feature Extraction...Contd
Mel Frequency Plot:
Figure 11
Speech Feature Extraction...Contd
●
In the Mel-Frequency Scale, there is a linear frequency
spacing below 1000 Hz and a logarithmic spacing above
1000Hz.
●
Triangular Filters Bank: The human ear acts essentially
like a bank of overlapping band-pass filters and human
perception is based on Mel scale. Thus, the approach to
simulating the human perception is to build a filter bank
with bandwidth given by the Mel scale and pass the
magnitudes of the spectra, through these filters and
obtain the Mel-frequency spectrum.
Speech Feature Extraction...Contd
●
Equally Spaced Mel values:
●
We define a triangular filter-bank with M filters (m=1, 2,...,M) where,
Hm[ k ] , is the magnitude (frequency response) of the filter given by:
Hm( k) =
{
0, k< f ( m−1)
k− f ( m−1)
f ( m)−f ( m−1)
, f ( m−1)≤k≤f ( m)
f ( m+1)− k
f (m+ 1)−f (m)
, f ( m)≤k≤ f ( m+1)
0, k> f ( m+1)
}
Speech Feature Extraction...Contd
● Mel Filter Bank:
Speech Feature Extraction...Contd
●
Given the FFT of the input signal, x[n]
●
The values of FFT are weighted by triangular filters.
The result is called Mel-frequency power spectrum
which is defined as:
where is called the Power Spectrum.
X [k]=∑
n=0
N−1
x[n]e
−j2 π nk/ N
,0≤k≤N
S[m]=∑
k=1
N
∣Xa [k]∣
2
Hm[k],0<m≤M
∣Xa [k]∣
2
Speech Feature Extraction...Contd
●
Schematic diagram of Filter Bank Energy:
●
Finally, a discrete cosine transform (DCT) of the
logarithm of S[m] is computed to form the MFCCs as:
mfcc[i]=∑
m=1
M
log(S[m])cos[i(m−
1
2
) π
M
],
i=1,2,........., L
Step 3: Modeling
●
State-of-the-Art Modeling Techniques:
1> Gaussian Mixture Model (GMM)
2> Hidden Markov Model (HMM)
GMM
●
Mixture model is a probabilistic model which assumes
the underlying data to belong to a mixture
distribution.
Gaussian is a characteristic symmetric
“bell curve"
GMM...Contd
●
Mathematical Description of GMM:
where = Mixed Density Function
= Mixture weight or mixture Coefficient
= Density Function
p(x)=∑
i=1
i=n
wi pi(x)
p(x)
wi
pi(x)
GMM...Contd
●
Mathematical Description of GMM:
where = Mixed Density Function
= Mixture weight or mixture Coefficient
= Density Function
p(x)=∑
i=1
i=n
wi pi(x)
p(x)
wi
pi(x)
GMM...Contd
●
Image showing Best fit Gaussian Curve:
GMM...Contd
● Hence the Density Function is:
● The Descreption of GMM becomes
where ‘s are means and ‘s are covariance-matrix of
individual components(probability density function) .
pi(x)=N (x∣μi ,Σi)
p(x)=∑
i=1
i=n
wi N (x∣μi ,Σi)
μi Σi
G1,w1 G2,w2
G3,w3
G4,w4
G5,w5
GMM...Contd
●
The Gaussian (Normal) density function, in which each of
the mixture components are Gaussian distributions, each
with their own mean and variance parameters is the most
common mixture distribution.
The feature vectors follows the Gaussian Distribution.
Hence X is distributed Normally.
: Multi variate Normal Distribution
Where = Means
= Covariance Matrix
μ
Σ
X∼N (x∣μ,Σ)
GMM...Contd
●
The GMM for a Speaker is denoted by
Here a speaker is represented by a mixture of M
Gaussian Components.
●
The Gaussian Mixture Density is
λ={wi ,μi ,Σi }, where i=1,2,.........., M
p(⃗x∣λ)=∑
i=1
M
wi pi(⃗x)
where ⃗x = D−dimensional random vector(variable)
GMM...Contd●
The Component Density is given by
●
The schematic diagram of the GMM of a speaker is given
below
pi(⃗x)=
1
(2π)D/2
∣Σi∣
1/2
exp{−
1
2
(⃗x−μi)
T
Σi
−1
(⃗x−μi)}
p1() p2()μ1, Σ1
μ2, Σ2
Σ
⃗x
p(⃗x∣λ)
w1
pM () μM , ΣM
w2
wM
. . . .
Model Parameter Estimation
●
To create a GMM we are required to find the
numerical values of Model parameters , and
●
To obtain an optimum model representing each
speaker we need to calculate a good estimation of
the GMM parameters. To do that, a very efficient
method is the Maximum-Likelihood Estimation (MLE)
approach. For speaker identification, each speaker is
represented by a GMM and is referred to by his/her
model. In this regard EM algorithm is very useful tool
to find the optimum model parameters by MLE
approach.
wi μi Σi
Step 4: Pattern Matching:
Classification
●
In this stage, a series of input vectors are compared,
and a decision is made as to which of the speakers in
the set is the most likely to have spoken the test data.
The input to the classification system is denoted as
●
Using the models of each speaker and the unknown
vectors the fitness values are calculated with the
help of posterior probalility. The speaker model which
gives the maximum fitness value, we classify the
vectors to that speaker.
⃗x={x1, x2, x3,. ................, xT }
⃗x
Conclusion...Contd
● Modification can be done in the following
cases:
1> Feature Extraction
2> MFCC Feature
3> Filter Bank
4> Modeling Techniques
5> Pattern Matching
Conclusion...Contd
● Feature Extraction: In the MFCC feature the phase
information is not taken into account. Only magnitude is
considered. So using phase information along with the MFCC
feature new feature vectors can be derived.
● Pattern Matching: In pattern matching step it is assumed
that the feature vectors of unknown speaker are
independent. With this assumption posterior probability is
calculated. But we can use some orthogonal transformation
to transform the set of vectors into a new set of orthogonal
vectors. Hence, after the transformation the the vectors
become independent. And then we can proceed as before.
References
●
[1] Molau, S., Pitz, M., Schlüter, R. & Ney, H. (2001), Computing Mel-Frequency
Cepstral Coefficients on the Power Spectrum, IEEE International Conference on
Acoustics, Speech and Signal Processing, Germany, 2001: 73-76.
●
[2] Huang, X., Acero, A. & Hon, H. (2001), Spoken Language Processing - A Guide to
Theory, Algorithm, and System Development, Prentice Hall PTR, New Jersey.
●
[3] Homayoon Beigi, (2011), Fundamentals of Speaker Recognition, Springer.
●
[4] Daniel J. Mashao, Marshalleno Skosan, Combining classifier decisions for robust
speaker identification, ELSEVIER 2006.
●
[5] W.M. Campbell , J.P. Campbell, D.A. Reynolds, E. Singer, P.A. Torres-Carrasquillo,
Support vector machines for speaker and language recognition, ELSEVIER, 2006.
●
[6] Seiichi Nakagawa, Kouhei Asakawa, Longbiao Wang, Speaker Recognition by
Combining MFCC and Phase Information, INTERSPEECH 2007.
●
[7] Nilsson, M. & Ejnarsson, M, Speech Recognition Using Hidden Markov Model
Performance Evaluation in Noisy Environment, Blekinge Institute of Technology
Sweden, 2002.
References...Contd●
[8] Stevens, S. S. & Volkman, J. (1940), The Relation of the Pitch to
Frequency , Journal of Psychology, 1940(53): 329.
●
[9] A . Jain, A. Ross, and S. Prabhakar, “An introduction to biometric
recognition,” IEEETrans. Circuits Systems Video Technol., vol. 14, no. 1, pp.
4–20, 2004.
●
[10] D. Reynolds, “An overview of automatic speaker recognition
technology,” in Proc. IEEE Int. Conf. Acoustics Speech Signal Processing
(ICASSP), 2002, vol. 4, pp. 4072–4075.
●
[11] S. Furui, “Cepstral analysis technique for automatic speaker
verification,” IEEE Trans. Acoustics Speech Signal Process., vol. 29, no. 2, pp.
254–272, 1981.
●
[12] D. Reynolds and R. Rose, “Robust text-independent speaker
identification using Gaussian mixture speaker models,” IEEE Trans. Speech
Audio Process., vol. 3, no. 1, pp. 72–83, 1995.
●
[13] D. Reynolds, “Speaker identification and verification using Gaussian
mixture speaker models,” Speech Commun., vol. 17, no. 1–2, pp. 91–108,
1995.
References...Contd
●
[14] Man-Wai Mak , Wei Rao, Utterance partitioning with acoustic vector
resampling for GMM–SVM speaker verification, ELSEVIER, 2011.
●
[15] Md. Sahidullah, Goutam Saha, Design, analysis and experimental
evaluation of block based transformation in MFCC computation for speaker
recognition, ELSEVIER, 2011.
●
[16] Qi Li, and Yan Huang, An Auditory-Based Feature Extraction Algorithm for
Robust Speaker Identification Under Mismatched Conditions , IEEE
TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 19,
NO. 6, AUGUST 2011.
●
[17] Alfredo Maesa, Fabio Garzia, Michele Scarpiniti, Roberto Cusani, Text
Independent Automatic Speaker Recognition System Using Mel-Frequency
Cepstrum Coefficient and Gaussian Mixture Models, Journal of Information
Security, 2012.
●
[18] Ming Li, Kyu J. Han, Shrikanth Narayanan, Automatic speaker age and
gender recognition using acoustic and prosodic level information fusion,
ELSEVIER, 2013.
Thank YouThank You

Weitere ähnliche Inhalte

Was ist angesagt?

An Optimized Transform for ECG Signal Compression
An Optimized Transform for ECG Signal CompressionAn Optimized Transform for ECG Signal Compression
An Optimized Transform for ECG Signal CompressionIDES Editor
 
Phonetic distance based accent
Phonetic distance based accentPhonetic distance based accent
Phonetic distance based accentsipij
 
SPEAKER VERIFICATION USING ACOUSTIC AND PROSODIC FEATURES
SPEAKER VERIFICATION USING ACOUSTIC AND PROSODIC FEATURESSPEAKER VERIFICATION USING ACOUSTIC AND PROSODIC FEATURES
SPEAKER VERIFICATION USING ACOUSTIC AND PROSODIC FEATURESacijjournal
 
A literature review on improving speech intelligibility in noisy environment
A literature review on improving speech intelligibility in noisy environmentA literature review on improving speech intelligibility in noisy environment
A literature review on improving speech intelligibility in noisy environmentOHSU | Oregon Health & Science University
 
Ultra short multi soliton generation for application in long distance communi...
Ultra short multi soliton generation for application in long distance communi...Ultra short multi soliton generation for application in long distance communi...
Ultra short multi soliton generation for application in long distance communi...University of Malaya (UM)
 

Was ist angesagt? (6)

An Optimized Transform for ECG Signal Compression
An Optimized Transform for ECG Signal CompressionAn Optimized Transform for ECG Signal Compression
An Optimized Transform for ECG Signal Compression
 
Phonetic distance based accent
Phonetic distance based accentPhonetic distance based accent
Phonetic distance based accent
 
Speech driven gesture generation with Autoencoders - Project
Speech driven gesture generation with Autoencoders - ProjectSpeech driven gesture generation with Autoencoders - Project
Speech driven gesture generation with Autoencoders - Project
 
SPEAKER VERIFICATION USING ACOUSTIC AND PROSODIC FEATURES
SPEAKER VERIFICATION USING ACOUSTIC AND PROSODIC FEATURESSPEAKER VERIFICATION USING ACOUSTIC AND PROSODIC FEATURES
SPEAKER VERIFICATION USING ACOUSTIC AND PROSODIC FEATURES
 
A literature review on improving speech intelligibility in noisy environment
A literature review on improving speech intelligibility in noisy environmentA literature review on improving speech intelligibility in noisy environment
A literature review on improving speech intelligibility in noisy environment
 
Ultra short multi soliton generation for application in long distance communi...
Ultra short multi soliton generation for application in long distance communi...Ultra short multi soliton generation for application in long distance communi...
Ultra short multi soliton generation for application in long distance communi...
 

Ähnlich wie ASR_final

Speaker recognition on matlab
Speaker recognition on matlabSpeaker recognition on matlab
Speaker recognition on matlabArcanjo Salazaku
 
Realization and design of a pilot assist decision making system based on spee...
Realization and design of a pilot assist decision making system based on spee...Realization and design of a pilot assist decision making system based on spee...
Realization and design of a pilot assist decision making system based on spee...csandit
 
Design and implementation of different audio restoration techniques for audio...
Design and implementation of different audio restoration techniques for audio...Design and implementation of different audio restoration techniques for audio...
Design and implementation of different audio restoration techniques for audio...eSAT Journals
 
Voice biometric recognition
Voice biometric recognitionVoice biometric recognition
Voice biometric recognitionphyuhsan
 
Speaker Recognition System using MFCC and Vector Quantization Approach
Speaker Recognition System using MFCC and Vector Quantization ApproachSpeaker Recognition System using MFCC and Vector Quantization Approach
Speaker Recognition System using MFCC and Vector Quantization Approachijsrd.com
 
Emotion Recognition based on audio signal using GFCC Extraction and BPNN Clas...
Emotion Recognition based on audio signal using GFCC Extraction and BPNN Clas...Emotion Recognition based on audio signal using GFCC Extraction and BPNN Clas...
Emotion Recognition based on audio signal using GFCC Extraction and BPNN Clas...ijceronline
 
PSoC BASED SPEECH RECOGNITION SYSTEM
PSoC BASED SPEECH RECOGNITION SYSTEMPSoC BASED SPEECH RECOGNITION SYSTEM
PSoC BASED SPEECH RECOGNITION SYSTEMirjes
 
PSoC BASED SPEECH RECOGNITION SYSTEM
PSoC BASED SPEECH RECOGNITION SYSTEMPSoC BASED SPEECH RECOGNITION SYSTEM
PSoC BASED SPEECH RECOGNITION SYSTEMIJRES Journal
 
A Novel, Robust, Hierarchical, Text-Independent Speaker Recognition Technique
A Novel, Robust, Hierarchical, Text-Independent Speaker Recognition TechniqueA Novel, Robust, Hierarchical, Text-Independent Speaker Recognition Technique
A Novel, Robust, Hierarchical, Text-Independent Speaker Recognition TechniqueCSCJournals
 
IRJET- Emotion recognition using Speech Signal: A Review
IRJET-  	  Emotion recognition using Speech Signal: A ReviewIRJET-  	  Emotion recognition using Speech Signal: A Review
IRJET- Emotion recognition using Speech Signal: A ReviewIRJET Journal
 
Isolated words recognition using mfcc, lpc and neural network
Isolated words recognition using mfcc, lpc and neural networkIsolated words recognition using mfcc, lpc and neural network
Isolated words recognition using mfcc, lpc and neural networkeSAT Journals
 
AN ANALYSIS OF SPEECH RECOGNITION PERFORMANCE BASED UPON NETWORK LAYERS AND T...
AN ANALYSIS OF SPEECH RECOGNITION PERFORMANCE BASED UPON NETWORK LAYERS AND T...AN ANALYSIS OF SPEECH RECOGNITION PERFORMANCE BASED UPON NETWORK LAYERS AND T...
AN ANALYSIS OF SPEECH RECOGNITION PERFORMANCE BASED UPON NETWORK LAYERS AND T...IJCSEA Journal
 
20575-38936-1-PB.pdf
20575-38936-1-PB.pdf20575-38936-1-PB.pdf
20575-38936-1-PB.pdfIjictTeam
 
A Novel Method for Speaker Independent Recognition Based on Hidden Markov Model
A Novel Method for Speaker Independent Recognition Based on Hidden Markov ModelA Novel Method for Speaker Independent Recognition Based on Hidden Markov Model
A Novel Method for Speaker Independent Recognition Based on Hidden Markov ModelIDES Editor
 
GENDER RECOGNITION SYSTEM USING SPEECH SIGNAL
GENDER RECOGNITION SYSTEM USING SPEECH SIGNALGENDER RECOGNITION SYSTEM USING SPEECH SIGNAL
GENDER RECOGNITION SYSTEM USING SPEECH SIGNALIJCSEIT Journal
 

Ähnlich wie ASR_final (20)

Speaker recognition.
Speaker recognition.Speaker recognition.
Speaker recognition.
 
Speaker recognition on matlab
Speaker recognition on matlabSpeaker recognition on matlab
Speaker recognition on matlab
 
Speech Signal Processing
Speech Signal ProcessingSpeech Signal Processing
Speech Signal Processing
 
V041203124126
V041203124126V041203124126
V041203124126
 
Realization and design of a pilot assist decision making system based on spee...
Realization and design of a pilot assist decision making system based on spee...Realization and design of a pilot assist decision making system based on spee...
Realization and design of a pilot assist decision making system based on spee...
 
H0814247
H0814247H0814247
H0814247
 
Design and implementation of different audio restoration techniques for audio...
Design and implementation of different audio restoration techniques for audio...Design and implementation of different audio restoration techniques for audio...
Design and implementation of different audio restoration techniques for audio...
 
Voice biometric recognition
Voice biometric recognitionVoice biometric recognition
Voice biometric recognition
 
Speaker Recognition System using MFCC and Vector Quantization Approach
Speaker Recognition System using MFCC and Vector Quantization ApproachSpeaker Recognition System using MFCC and Vector Quantization Approach
Speaker Recognition System using MFCC and Vector Quantization Approach
 
Emotion Recognition based on audio signal using GFCC Extraction and BPNN Clas...
Emotion Recognition based on audio signal using GFCC Extraction and BPNN Clas...Emotion Recognition based on audio signal using GFCC Extraction and BPNN Clas...
Emotion Recognition based on audio signal using GFCC Extraction and BPNN Clas...
 
PSoC BASED SPEECH RECOGNITION SYSTEM
PSoC BASED SPEECH RECOGNITION SYSTEMPSoC BASED SPEECH RECOGNITION SYSTEM
PSoC BASED SPEECH RECOGNITION SYSTEM
 
PSoC BASED SPEECH RECOGNITION SYSTEM
PSoC BASED SPEECH RECOGNITION SYSTEMPSoC BASED SPEECH RECOGNITION SYSTEM
PSoC BASED SPEECH RECOGNITION SYSTEM
 
N017428692
N017428692N017428692
N017428692
 
A Novel, Robust, Hierarchical, Text-Independent Speaker Recognition Technique
A Novel, Robust, Hierarchical, Text-Independent Speaker Recognition TechniqueA Novel, Robust, Hierarchical, Text-Independent Speaker Recognition Technique
A Novel, Robust, Hierarchical, Text-Independent Speaker Recognition Technique
 
IRJET- Emotion recognition using Speech Signal: A Review
IRJET-  	  Emotion recognition using Speech Signal: A ReviewIRJET-  	  Emotion recognition using Speech Signal: A Review
IRJET- Emotion recognition using Speech Signal: A Review
 
Isolated words recognition using mfcc, lpc and neural network
Isolated words recognition using mfcc, lpc and neural networkIsolated words recognition using mfcc, lpc and neural network
Isolated words recognition using mfcc, lpc and neural network
 
AN ANALYSIS OF SPEECH RECOGNITION PERFORMANCE BASED UPON NETWORK LAYERS AND T...
AN ANALYSIS OF SPEECH RECOGNITION PERFORMANCE BASED UPON NETWORK LAYERS AND T...AN ANALYSIS OF SPEECH RECOGNITION PERFORMANCE BASED UPON NETWORK LAYERS AND T...
AN ANALYSIS OF SPEECH RECOGNITION PERFORMANCE BASED UPON NETWORK LAYERS AND T...
 
20575-38936-1-PB.pdf
20575-38936-1-PB.pdf20575-38936-1-PB.pdf
20575-38936-1-PB.pdf
 
A Novel Method for Speaker Independent Recognition Based on Hidden Markov Model
A Novel Method for Speaker Independent Recognition Based on Hidden Markov ModelA Novel Method for Speaker Independent Recognition Based on Hidden Markov Model
A Novel Method for Speaker Independent Recognition Based on Hidden Markov Model
 
GENDER RECOGNITION SYSTEM USING SPEECH SIGNAL
GENDER RECOGNITION SYSTEM USING SPEECH SIGNALGENDER RECOGNITION SYSTEM USING SPEECH SIGNAL
GENDER RECOGNITION SYSTEM USING SPEECH SIGNAL
 

ASR_final

  • 1. “Development of Some Techniques For Text-Independent Speaker Recognition From Audio Signals” By Bidhan Barai Under the guidance of Dr. Nibaran Das and Dr. Subhadip Basu Assistant Professors of Computer Science & Engineering Jadavpur University Kolkata – 700 032
  • 2. Overview ● Introduction ● Types of Speaker Recognition ● Principles of Automatic Speaker Recognition (ASR) ● Steps of Speaker Recognition: 1> Voice Recording 2> Feature Extraction 3> Modeling 4> Pattern Matching 5> Decision (accept / reject) (for Verification) ● Conclusion ● References
  • 3. Introduction ● Speaker recognition is the identification of a person from characteristics of voices (voice biometrics). It is also called voice recognition. There is a difference between speaker recognition (recognizing who is speaking) and speech recognition (recognizing what is being said). ● In addition, there is a difference between the act of authentication (commonly referred to as speaker verification or speaker authentication) and identification.
  • 4. Types of Speaker Identification ● Text-Dependent: If the text must be the same for enrollment and verification this is called text-dependent recognition. In a text-dependent system, prompts can either be common across all speakers (e.g.: a common pass phrase) or unique ● Text-Independent:Text-Independent: Text-independent systems are most often used for speaker identification as they require very little if any cooperation by the speaker. In this case the text during enrollment and test is different.
  • 5. Types of Speaker Identification ● Closed-Set: Assumed that Speaker is in Database In closed-set identification, the audio of the test speaker is compared against all the available speaker models and the speaker ID of the model with the closest match is returned. Result is the best speaker matched. ● Open-Set: Speaker may not in Database Open-set identification may be viewed as a combination of closed-set identification and speaker verification. Result can be a speaker or a no-match result.
  • 6. Principles of Automatic Speaker Recognition ● Speaker recognition can be classified into identification and verification. ● Speaker identification is the process of determining which registered speaker provides a given utterance. ● Speaker verification, on the other hand, is the process of accepting or rejecting the identity claim of a speaker. ● Following figures shows the basic structures of speaker identification and verification systems. The system that we will describe is classified as text-independent speaker identification system since its task is to identify the person who speaks regardless of what is saying.
  • 7. Principles of Automatic Speaker Recognition ... Contd. Figure 1 Block Diagram of Speaker Recognition SystemBlock Diagram of Speaker Recognition System
  • 8. Principles of Automatic Speaker Recognition ... Contd. ● Speaker RecognitionSpeaker Recognition Feature Extraction Similarity Reference Model Speaker #1 Similarity Reference Model Speaker #N Maximun Selection Identification Result (Speaker ID) Input Speech Figure 2
  • 9. Principles of Automatic Speaker Recognition ... Contd. ● Speaker verificationSpeaker verification Feature Extraction Input Speech Similarity Decision Reference Model Speaker #1 Speaker ID (#M) Threshold Verification Result (Accept/Reject) Figure 3
  • 10. Principles of Automatic Speaker Recognition ... Contd. ● All speaker recognition systems have to serve two distinguished phases. The first one is referred to the enrolment or training phase, while the second one is referred to as the operational or testing phase. ● In the training phase, each registered speaker has to provide samples of their speech so that the system can build or train a reference model for that speaker. In case of speaker verification systems, in addition, a speaker-specific threshold is also computed from the training samples. ● In the testing phase, the input speech is matched with stored reference model(s) and a recognition decision is made.
  • 11. Steps of Speaker Recognition 1> Voice Recording 2> Feature Extraction 3> Modeling 4> Pattern Matching 5> Decision (accept / reject) (for Verification)
  • 12. Step 1: Voice Recording ● The speech input is typically recorded at a sampling rate above 10000 Hz (10 kHz). ● This sampling frequency was chosen to minimize the effects of aliasing in the analog-to-digital conversion. These sampled signals can capture all frequencies up to 5 kHz, which cover most energy of sounds that are generated by humans. ● This sampling rate (10 kHz) is determined by the Nyquest Sampling Theorem.
  • 13. Step 2: Speech Feature Extraction ● The purpose of this module is to convert the speech waveform, using digital signal processing (DSP) tools, to a set of features (at a considerably lower information rate) for further analysis. This is often referred as the signal-processing front end. ● The speech signal is a slowly timed varying signal (it is called quasi-stationary). When examined over a sufficiently short period of time (between 5 and 100 msec), its characteristics are fairly stationary. However, over long periods of time (on the order of 1/5 seconds or more) the signal characteristic change to reflect the different speech sounds being spoken. ● Therefore, short-time spectral analysis is the most common way to characterize the speech signal.
  • 14. Speech Feature Extraction...Contd Examples of Speech Signals: A wide range of possibilities exist for parametrically representing the speech signal for the speaker recognition task, such as Linear Prediction Coding (LPC), Mel-Frequency Cepstrum Coefficients (MFCC), Gammatone Frequency Cepstral Coefficients (GFCC), Group Delay Features (GDF) and others. MFCC is perhaps the best known and most popular, and will be described in this project. Figure 4 Figure 5
  • 15. Speech Feature Extraction...Contd ● Mel-frequency Cepstrum Coefficients Processor: A block diagram of the structure of an MFCC processor is given in Figure Figure 6
  • 16. Speech Feature Extraction...Contd ● Steps of extracting Feature from Speech Signal: 1> Pre-emphasis 2> Frame Blocking 3> Windowing 4> Fast Fourier Transform (FFT) 5> Mel-frequency Wrapping 6> Cepstrum: Logarithmic Compression and Discrete Cosine Transform (DCT)
  • 17. Speech Feature Extraction...Contd ● Pre-emphasis: In speech processing, the original signal usually has too much lower frequency energy, and processing the signal to emphasize higher frequency energy is necessary. To perform pre-emphasis, we choose some value α between 0.9 and 1. Then each value in the signal is re-evaluated using this formula: This is apparently a first order high pass filter. y[n]=x[n]−α x[n−1], where 0.9<α<1
  • 19. Speech Feature Extraction...Contd ● Frame Blocking: The input speech signal is segmented into frames of 20~30 ms with optional overlap of 1/3~1/2 of the frame size. Usually the frame size (in terms of sample points) is equal to power of two in order to facilitate the use of FFT. If this is not the case, we need to do zero padding to the nearest length of power of two. ● Windowing: Each frame has to be multiplied with a hamming window in order to keep the continuity of the first and the last points in the frame. If the signal in a frame is denoted by s(n), n = 0,…N-1, then the signal after Hamming windowing is s(n)*w(n), where w(n) is the Hamming window defined by: Different values of corresponds to different curves for the Hamming windows shown next: w(n,α)=(1−α)−αcos( 2πn N−1 ), 0≤n≤N−1 α
  • 22. Speech Feature Extraction...Contd ● Fast Fourier Transform (FFT): The Discrete Fourier Transform (DFT) of a discrete-time signal x(nT) is given by: Where: X (k)= ∑ n=0 N−1 x [n]e −j 2π N nk k=0,1,…N−1 x (nT )=x [n]
  • 23. Speech Feature Extraction...Contd ● If we let: thene −j 2π N =W N X (k)= ∑ n=0 N−1 x [n]W N nk 0 20 40 60 80 100 120 -2 -1 0 1 2 Sampledsignal Sample Amplitude 0 0.1 0.2 0.3 0.4 0.5 0 0.2 0.4 0.6 0.8 1 FrequencyDomain NormalisedFrequency Magnitude Figure 10
  • 24. Speech Feature Extraction...Contd ● x[n] = x[0], x[1], …, x[N-1] X (k)= ∑ n=0 N−1 x[n]W N nk ; 0≤k≤N−1 [1][1]  Lets divide the sequence x[n] into even and odd sequences:  x[2n] = x[0], x[2], …, x[N-2]  x[2n+1] = x[1], x[3], …, x[N-1]
  • 25. Speech Feature Extraction...Contd ● Equation 1 can be rewritten as: X (k)= ∑ n=0 N 2 −1 x[2n]WN 2nk + ∑ n=0 N 2 −1 x[2n+1]WN (2n+1)k [2][2] Since: W N 2 nk =e − j 2π N 2 nk =e −j 2π N /2 nk =W N 2 nk W N (2n+1)k =W N k ⋅W N 2 nk Then: X (k)= ∑ n=0 N 2 −1 x [2n]WN 2 nk +WN k ∑ n=0 N 2 −1 x [2n+1]WN 2 nk =Y (k )+WN k Z (k ) and
  • 26. Speech Feature Extraction...Contd ● The result is that an N-point DFT can be divided into two N/2 point DFT’s: ● Where Y(k) and Z(k) are the two N/2 point DFTs operating on even and odd samples respectively: X (k)= ∑ n=0 N−1 x[n]W N nk ; 0≤k≤N−1 N-point DFTN-point DFT X (k)= ∑ n=0 N 2 −1 x1[n]WN 2 nk +W N k ∑ n=0 N 2 −1 x2[n]WN 2 nk =Y (k )+WN k Z (k ) TwoTwo N/2-pointN/2-point DFTsDFTs
  • 27. Speech Feature Extraction...Contd ● Periodicity and symmetry of W can be exploited to simplify the DFT further: X (k )= ∑ n=0 N 2 −1 x1[n]W N 2 nk +W N k ∑ n=0 N 2 −1 x2[n]W N 2 nk ⋮ X(k+ N 2 )= ∑ n=0 N 2 −1 x1[n]W N 2 n(k + N 2 )+W N k+ N 2 ∑ n=0 N 2 −1 x2 [n]W N 2 n(k+ N 2 ) [3][3] Or: W N k+ N 2 =e − j 2π N k e −j 2π N N 2 =e −j 2π N k e−jπ =−e −j 2π N k =−W N k : Symmetry: Symmetry And: W N 2 k+ N 2 =e − j 2π N /2 k e − j 2π N /2 N 2 =e − j 2π N /2 k =W N 2 k : Periodicity: Periodicity
  • 28. Speech Feature Extraction...Contd ● Finally by exploiting the symmetry and periodicity, Equation 3 can be written as: ● Hence Complete Equations for finding FFT are: X (k+ N 2 )= ∑ n=0 N 2 −1 x1[n]W N 2 nk −W N k ∑ n=0 N 2 −1 x2[n]W N 2 nk =Y (k)−W N k Z (k ) [4][4] X (k)=Y (k)+WN k Z (k ); k=0,…(N 2 −1) X (k+ N 2 )=Y (k)−WN k Z (k); k=0,…(N 2 −1)
  • 29. Speech Feature Extraction...Contd ● Schematic Diagram for FFT: Radix-2 butterfly diagram y[0]y[0] y[1]y[1] y[2]y[2] y[N-2]y[N-2] N/2 pointN/2 point DFTDFT x[0]x[0] x[2]x[2] x[4]x[4] x[N-2]x[N-2] N/2 pointN/2 point DFTDFT x[1]x[1] x[3]x[3] x[5]x[5] x[N-1]x[N-1] z[0]z[0] z[1]z[1] z[2]z[2] z[N/2-1]z[N/2-1] X[1] = y[1]+WX[1] = y[1]+W11z[1]z[1] WW11 X[0] = y[0]+WX[0] = y[0]+W00z[0]z[0] WW00 X[N/2] = y[0]-WX[N/2] = y[0]-W00z[0]z[0]-1-1 X[N/2+1] = y[1]-WX[N/2+1] = y[1]-W11z[1]z[1]-1-1
  • 30. Speech Feature Extraction...Contd ● Mel-frequency Wrapping: Psychophysical studies have shown that human perception of the frequency content of sounds does not follow a liner scale. That research has led to the concept of the subjective frequency, i.e., the perceived frequency of sounds is defined as follows. For each sound with an actual frequency, f , measured in Hz, a subjective frequency is measured on a scale called the "Mel scale". Mel-frequency can be approximated by Mel(f )=2595log(1+ f 700 )
  • 31. Speech Feature Extraction...Contd Mel Frequency Plot: Figure 11
  • 32. Speech Feature Extraction...Contd ● In the Mel-Frequency Scale, there is a linear frequency spacing below 1000 Hz and a logarithmic spacing above 1000Hz. ● Triangular Filters Bank: The human ear acts essentially like a bank of overlapping band-pass filters and human perception is based on Mel scale. Thus, the approach to simulating the human perception is to build a filter bank with bandwidth given by the Mel scale and pass the magnitudes of the spectra, through these filters and obtain the Mel-frequency spectrum.
  • 33. Speech Feature Extraction...Contd ● Equally Spaced Mel values: ● We define a triangular filter-bank with M filters (m=1, 2,...,M) where, Hm[ k ] , is the magnitude (frequency response) of the filter given by: Hm( k) = { 0, k< f ( m−1) k− f ( m−1) f ( m)−f ( m−1) , f ( m−1)≤k≤f ( m) f ( m+1)− k f (m+ 1)−f (m) , f ( m)≤k≤ f ( m+1) 0, k> f ( m+1) }
  • 35. Speech Feature Extraction...Contd ● Given the FFT of the input signal, x[n] ● The values of FFT are weighted by triangular filters. The result is called Mel-frequency power spectrum which is defined as: where is called the Power Spectrum. X [k]=∑ n=0 N−1 x[n]e −j2 π nk/ N ,0≤k≤N S[m]=∑ k=1 N ∣Xa [k]∣ 2 Hm[k],0<m≤M ∣Xa [k]∣ 2
  • 36. Speech Feature Extraction...Contd ● Schematic diagram of Filter Bank Energy: ● Finally, a discrete cosine transform (DCT) of the logarithm of S[m] is computed to form the MFCCs as: mfcc[i]=∑ m=1 M log(S[m])cos[i(m− 1 2 ) π M ], i=1,2,........., L
  • 37. Step 3: Modeling ● State-of-the-Art Modeling Techniques: 1> Gaussian Mixture Model (GMM) 2> Hidden Markov Model (HMM)
  • 38. GMM ● Mixture model is a probabilistic model which assumes the underlying data to belong to a mixture distribution. Gaussian is a characteristic symmetric “bell curve"
  • 39. GMM...Contd ● Mathematical Description of GMM: where = Mixed Density Function = Mixture weight or mixture Coefficient = Density Function p(x)=∑ i=1 i=n wi pi(x) p(x) wi pi(x)
  • 40. GMM...Contd ● Mathematical Description of GMM: where = Mixed Density Function = Mixture weight or mixture Coefficient = Density Function p(x)=∑ i=1 i=n wi pi(x) p(x) wi pi(x)
  • 41. GMM...Contd ● Image showing Best fit Gaussian Curve:
  • 42. GMM...Contd ● Hence the Density Function is: ● The Descreption of GMM becomes where ‘s are means and ‘s are covariance-matrix of individual components(probability density function) . pi(x)=N (x∣μi ,Σi) p(x)=∑ i=1 i=n wi N (x∣μi ,Σi) μi Σi G1,w1 G2,w2 G3,w3 G4,w4 G5,w5
  • 43. GMM...Contd ● The Gaussian (Normal) density function, in which each of the mixture components are Gaussian distributions, each with their own mean and variance parameters is the most common mixture distribution. The feature vectors follows the Gaussian Distribution. Hence X is distributed Normally. : Multi variate Normal Distribution Where = Means = Covariance Matrix μ Σ X∼N (x∣μ,Σ)
  • 44. GMM...Contd ● The GMM for a Speaker is denoted by Here a speaker is represented by a mixture of M Gaussian Components. ● The Gaussian Mixture Density is λ={wi ,μi ,Σi }, where i=1,2,.........., M p(⃗x∣λ)=∑ i=1 M wi pi(⃗x) where ⃗x = D−dimensional random vector(variable)
  • 45. GMM...Contd● The Component Density is given by ● The schematic diagram of the GMM of a speaker is given below pi(⃗x)= 1 (2π)D/2 ∣Σi∣ 1/2 exp{− 1 2 (⃗x−μi) T Σi −1 (⃗x−μi)} p1() p2()μ1, Σ1 μ2, Σ2 Σ ⃗x p(⃗x∣λ) w1 pM () μM , ΣM w2 wM . . . .
  • 46. Model Parameter Estimation ● To create a GMM we are required to find the numerical values of Model parameters , and ● To obtain an optimum model representing each speaker we need to calculate a good estimation of the GMM parameters. To do that, a very efficient method is the Maximum-Likelihood Estimation (MLE) approach. For speaker identification, each speaker is represented by a GMM and is referred to by his/her model. In this regard EM algorithm is very useful tool to find the optimum model parameters by MLE approach. wi μi Σi
  • 47. Step 4: Pattern Matching: Classification ● In this stage, a series of input vectors are compared, and a decision is made as to which of the speakers in the set is the most likely to have spoken the test data. The input to the classification system is denoted as ● Using the models of each speaker and the unknown vectors the fitness values are calculated with the help of posterior probalility. The speaker model which gives the maximum fitness value, we classify the vectors to that speaker. ⃗x={x1, x2, x3,. ................, xT } ⃗x
  • 48. Conclusion...Contd ● Modification can be done in the following cases: 1> Feature Extraction 2> MFCC Feature 3> Filter Bank 4> Modeling Techniques 5> Pattern Matching
  • 49. Conclusion...Contd ● Feature Extraction: In the MFCC feature the phase information is not taken into account. Only magnitude is considered. So using phase information along with the MFCC feature new feature vectors can be derived. ● Pattern Matching: In pattern matching step it is assumed that the feature vectors of unknown speaker are independent. With this assumption posterior probability is calculated. But we can use some orthogonal transformation to transform the set of vectors into a new set of orthogonal vectors. Hence, after the transformation the the vectors become independent. And then we can proceed as before.
  • 50. References ● [1] Molau, S., Pitz, M., Schlüter, R. & Ney, H. (2001), Computing Mel-Frequency Cepstral Coefficients on the Power Spectrum, IEEE International Conference on Acoustics, Speech and Signal Processing, Germany, 2001: 73-76. ● [2] Huang, X., Acero, A. & Hon, H. (2001), Spoken Language Processing - A Guide to Theory, Algorithm, and System Development, Prentice Hall PTR, New Jersey. ● [3] Homayoon Beigi, (2011), Fundamentals of Speaker Recognition, Springer. ● [4] Daniel J. Mashao, Marshalleno Skosan, Combining classifier decisions for robust speaker identification, ELSEVIER 2006. ● [5] W.M. Campbell , J.P. Campbell, D.A. Reynolds, E. Singer, P.A. Torres-Carrasquillo, Support vector machines for speaker and language recognition, ELSEVIER, 2006. ● [6] Seiichi Nakagawa, Kouhei Asakawa, Longbiao Wang, Speaker Recognition by Combining MFCC and Phase Information, INTERSPEECH 2007. ● [7] Nilsson, M. & Ejnarsson, M, Speech Recognition Using Hidden Markov Model Performance Evaluation in Noisy Environment, Blekinge Institute of Technology Sweden, 2002.
  • 51. References...Contd● [8] Stevens, S. S. & Volkman, J. (1940), The Relation of the Pitch to Frequency , Journal of Psychology, 1940(53): 329. ● [9] A . Jain, A. Ross, and S. Prabhakar, “An introduction to biometric recognition,” IEEETrans. Circuits Systems Video Technol., vol. 14, no. 1, pp. 4–20, 2004. ● [10] D. Reynolds, “An overview of automatic speaker recognition technology,” in Proc. IEEE Int. Conf. Acoustics Speech Signal Processing (ICASSP), 2002, vol. 4, pp. 4072–4075. ● [11] S. Furui, “Cepstral analysis technique for automatic speaker verification,” IEEE Trans. Acoustics Speech Signal Process., vol. 29, no. 2, pp. 254–272, 1981. ● [12] D. Reynolds and R. Rose, “Robust text-independent speaker identification using Gaussian mixture speaker models,” IEEE Trans. Speech Audio Process., vol. 3, no. 1, pp. 72–83, 1995. ● [13] D. Reynolds, “Speaker identification and verification using Gaussian mixture speaker models,” Speech Commun., vol. 17, no. 1–2, pp. 91–108, 1995.
  • 52. References...Contd ● [14] Man-Wai Mak , Wei Rao, Utterance partitioning with acoustic vector resampling for GMM–SVM speaker verification, ELSEVIER, 2011. ● [15] Md. Sahidullah, Goutam Saha, Design, analysis and experimental evaluation of block based transformation in MFCC computation for speaker recognition, ELSEVIER, 2011. ● [16] Qi Li, and Yan Huang, An Auditory-Based Feature Extraction Algorithm for Robust Speaker Identification Under Mismatched Conditions , IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 19, NO. 6, AUGUST 2011. ● [17] Alfredo Maesa, Fabio Garzia, Michele Scarpiniti, Roberto Cusani, Text Independent Automatic Speaker Recognition System Using Mel-Frequency Cepstrum Coefficient and Gaussian Mixture Models, Journal of Information Security, 2012. ● [18] Ming Li, Kyu J. Han, Shrikanth Narayanan, Automatic speaker age and gender recognition using acoustic and prosodic level information fusion, ELSEVIER, 2013.