SlideShare a Scribd company logo
1 of 4
Download to read offline
Text-Independent Speaker Verification
                                                          Cody A. Ray
                                                        Drexel University
                                                      codyaray@drexel.edu


   Abstract—This paper provides an introduction to the task
of speaker recognition, and describes a not-so-novel speaker
recognition system based upon a minimum-distance classification
scheme. We describe both the theory and practical details for a
reference implementation. Furthermore, we discuss an advanced
technique for classification based upon Gaussian Mixture Models
(GMM). Finally, we discuss the results of a set of experiments
performed using our reference implementation.

                      I. I NTRODUCTION
   The objective of this project was to develop a basic speaker
recognition system to demonstrate an understanding of the
subjects covered in the course Processing of the Human Voice.
Speaker recognition systems can generally be classified as
either identification or verification. In speaker identification,               Fig. 1.   Speaker Recognition System Classification
the challenge is to decide which voice model from a known set
of voice models best characterizes a speaker. In the different
task of speaker verification, the goal is to decide whether         to build the speaker model is called training data. During the
a speaker corresponds to a particular known voice or to            recognition or testing stage, the features measured from the
some other unknown voice. In either case, the problem can          waveform of a test utterance, i.e., the test data of a speaker,
be further divided into text-dependent and text-independent        are matched (in some sense) against speaker models obtained
subproblems, depending on whether we may rely upon the             during training. An overview of the components of a general
same utterance being used for both training and testing. This      speaker recognition system is given in Figure 2.
loose classification scheme is shown in Figure 1. For the              As with any biometric (pattern) recognition system, the
purpose of this project, we focused on the task of text-           speaker recognition system consists of two core modules:
independent speaker verification.                                   feature extraction and feature matching. In Section IV, we
                                                                   provide an overview of various dimensions used for speaker
                      II. T ERMINOLOGY                             analysis as well as describe the features we selected in more
  •   A background speaker is an imposter speaker.                 depth. Section V continues with mathematical techniques used
  •   A claimant is a speaker known to the system who is           in the matching process.
      correctly claiming his/her identity.
  •   A false negative is an error where a claimant is rejected                        IV. F EATURE E XTRACTION
      as an imposter.                                                 Although its possible to extract a number of features
  •   A false positive is an error where an imposter is accepted   from sampled audio, including both spectral and non-spectral
      as a claimant.                                               features, for the purpose of this project we characterize a
  •   Speaker identification decides which voice model from         speaker’s voice attributes exclusively through spectral features.
      a known set of voice models best characterizes a speaker.    In selecting acoustic spectral features, we want our feature
  •   Speaker verification decides whether a speaker cor-           set to reflect the unique characteristics of a speaker. For this
      responds to a particular known voice or some other           purpose, we use the magnitude component of the short-time
      unknown voice.
  •   A target speaker is a known speaker.

                   III. S YSTEM OVERVIEW
   Speaker recognition systems must first build a model of
the voice of each target speaker, as well as a model of a
collection of background speakers, using speaker-dependent
features extracted from the speech waveform. This is referred
to as the training stage, and the associated speech data used                  Fig. 2.   General Speaker Recognition System
Fourier transform (STFT) as a basis. As the phase is difficult to                        V. F EATURE M ATCHING
measure and susceptible to channel distortion, it is discarded.       A survey of the literature has revealed numerous approaches
   We first compute the discrete STFT and then weight it by a       based upon minimum-distance classification, dynamic time-
series of filter frequency responses that roughly match those of    warping, vector quantization, hidden Markov model, Gaussian
the auditory critical band filters. To approximate the auditory     mixture model, or artificial neural networks. For this project,
critical bands, we use triangular mel-scale filter bands, as        we chose to implement a minimum-distance classifier, and as
shown in Figure 3.                                                 time permits improve upon this with a Gaussian mixture model
                              ∞                                    based matching.
           X(n, wk ) =              x[m]w[n − m]e−jwk m            A. Minimum-Distance Classification
                           m=−∞
                                                                      The concept of minimum-distance classification is simple:
  Next, we compute the energy in the mel-scale weighted            we calculate a feature vector for each new test case, and
STFT and normalize the energy in each frame so as to give          measure how far it is from the stored training data using
equal energy for a flat input spectrum.                             some metric for distance computation. We then select a
                                                                   threshold distance to determine at which point we consider the
                                   Ul
                            1                                      speaker verification to have been successful, or equivalently,
           Emel (n, l) =                 |Vl (wk )X(n, wk )|2 ,    the speaker to have been recognized. As we will later see, this
                            Al
                                  k=Ll
                                                                   threshold will determine the tradeoff between the number of
where                                                              false negatives and false positives.
                                                                      Specifically, we compute a feature vector based upon the
                                  Ul
                                                                   averages of the MFCCs for the test and training data sets.
                         Al =           |Vl (wk )|2
                                                                                                           M
                                 k=Ll                                                tr            1             tr
                                                                                   C mel [n] =                  Cmel [mL, n]
                                                                                                   M
   Finally, we compute the mel-cepstrum for each frame using                                           m=1
the even property of the real cepstrum to rewrite the inverse
                                                                                                           M
transform as the discrete cosine transform. We then extract                          ts            1             ts
the mel-frequency cepstrum coefficients (MFCC) for use as                           C mel [n] =                  Cmel [mL, n]
                                                                                                   M   m=1
our feature vector.
                                                                      As a distance measure, we then use the mean-squared
                            R−1
                      1                                2π          difference between the average testing and training feature
        Cmel [n, m] =             log{Emel (n, l)} cos( lm)
                      R                                R           vectors.
                            i=0
                                                                                               R−1
                                                                                       1                   ts             tr
  We repeat this feature extraction process for both training                   D=                   (C mel [n] − C mel [n])2
                             tr         ts                                            R−1
and testing data to produce Cmel and Cmel , respectively.                                      n=1

                                                                     If this distance is less than the given threshold, the speaker
                                                                   has been verified.
                                                                   B. Gaussian Mixture Models
                                                                      Recognizing that speech production is inherently non-
                                                                   deterministic (due to subtle variations in vocal tract shape and
                                                                   glottal flow), we represent a speaker probabilistically through a
                                                                   multivariate Gaussian probability density function (pdf). This
                                                                   is a multi-dimensional structure in which we can think of
                                                                   each statistical variable as a state corresponding to a single
                                                                   acoustic sound class, whether at a broad level, such as quasi-
                                                                   periodic, noise-like, and impulse-like, or at very fine level such
                                                                   as individual phonemes. The Gaussian pdf of a feature vector
                                                                   x for the ith state is written as
                                                                                               1                1     T
                                                                                                                          Σ−1 (x−µi )
                                                                             bi (x) =         R        1   e− 2 (x−µi )    i

                                                                                          (2π) |Σi |
                                                                                              2        2


                                                                   where µi is the state mean vector, Σi is the state covariance
                                                                   matrix, and R is the dimension of the feature vector.1 The
                                                                    1 Using MFCC-based feature vectors, this corresponds to the number of
           Fig. 3.   Idealized Triangular Mel-Scale Filter Bank    MFCC coefficients.
TABLE I
                                                    D ISTANCES B ETWEEN T RAINING AND T ESTING DATA

                                            Test1      Test2    Test3    Test4    Test5      Test6      Test7     Test8
                                Train1     0.1192      0.1945   0.2151   0.2184   0.5364    0.3823     0.4963     0.4538
                                Train2     0.0724      0.0378   0.0406   0.0783   0.4035    0.3177     0.4125     0.3986
                                Train3     0.1672      0.1311   0.1042   0.0969   0.3382    0.2121     0.3597     0.2847
                                Train4     0.1482      0.1412   0.1363   0.0817   0.3268    0.3211     0.3154     0.3282
                                Train5     0.1882      0.1928   0.2237   0.1466   0.1044    0.0709     0.1382     0.1299
                                Train6     0.3012      0.3521   0.3208   0.3112   0.3023    0.0958     0.2755     0.2094
                                Train7     0.2743      0.2973   0.3252   0.2517   0.1618    0.1318     0.0724     0.1427
                                Train8     0.3589      0.3600   0.3381   0.2186   0.1976    0.1133     0.2487     0.0585



speaker model λ then represents the set of GMM mean,                                                 VI. I MPLEMENTATION
covariance, and weight parameters.                                              The reference speaker recognition system was implemented
                                                                             in MATLAB using training data and test data stored in WAV
                           λ = {µi , Σi , pi }                               files. There are tools included in MATLAB and publicly-
   The probability of a feature vector being in any one of I                 available libraries to aid in creating this system. For reading
states for a particular speaker model λ is represented by the                in the data sets, we used MATLAB’s wavread function.
union, or mixture, of different Gaussian pdfs:                               For feature extraction, we used the melcepst function
                                                                             from Voicebox, a MATLAB toolbox. We used twelve MFCC
                                     l                                       coefficients (skipping the 0th order coefficient) using 256-
                        p(x|λ) =          pi bi (x)                          sample frames and a 128-sample increment Hamming window.
                                    i=I                                      We used custom matching and testing routines based upon
where pi are the component mixture weights and bi (x) are the                minimum-distance classification as described above. For the
mixture densities.                                                           Gaussian Mixture Models, we used T. N. Vikram’s GMM
   In speaker verification, we must decide whether a test                     library, based upon the text Algorithm Collections For Digital
utterance belongs to the target speaker. Formally, we are                    Signal Processing Applications Using Matlab by E.S. Gopi.
making a binary decision (yes/no) on two hypotheses, whether                                   VII. E XPERIMENTAL R ESULTS
the test utterance belongs to the target speaker, hypothesis H1 ,
                                                                                We performed a series of experiments to determine the
or whether it comes from an imposter, hypothesis H2 .
                                                                             accuracy of the reference implementation for text-independent
   Suppose that we have already computed the GMM of the
                                                                             speaker verification. We selected eight speakers from the
target speaker and the GMM for a collection of background
                                                                             TIMIT database, including four males and four females, each
speakers2 ; we then determined the likelihood ratio test to
                                                                             saying two sentences. The sentences used for this experiment
decide between H1 and H2 . This test is the ratio between
                                                                             were “dont ask me to carry an oily rag like that” and “she had
the probability that the collection of feature vectors X =
                                                                             your dark suit in greasy wash water all year”. We used the
{x0 , x1 , . . . , xM −1 } is from the claimed speaker, P (λC |X),
                                                                             sentence referring to an “oily rag” as training data, and used
and the probability that the collection of feature vectors X
                                                                             the sentence referring to a “dark suit” as testing data.3
is not from the claimed speaker, P (λC |X), i.e., from the
                                                                                The results of the experiment are shown in Table I. This
background. Using Bayes’ rule, we can write this ratio as
                                                                             table depicts the “distance” between the training and testing
              P (λC |X)   p(X|λC )P (λC )/P (X)                              data sets as a Cartesian product. For example, the value in
                        =                                                    cell (1, 3) (in row-major order) corresponds to the distance
              P (λC |X)   p(X|λC )P (λC )/P (X)
                                                                             between speaker1’s training data and speaker3’s testing
where P (X) denotes the probability of the vector stream X.                  data. Note that we’re trying to minimize the diagonals, that
Discarding the constant probability terms and applying the                   is, the distance between any individual speaker’s testing and
logarithm, we have the log-likelihood ratio                                  training data while maximizing the non-diagonal cells in a
                                                                             text-independent manner.
             Λ(X) = log[p(X|λC )] − log[p(X|λC )]                               For these data, we empirically selected an initial threshold
                                                                             value of 0.12 to avoid false any negatives. With this threshold,
that we compare with a threshold to accept or reject whether
                                                                             however, the system resulted in six false positives, for an
the utterance belongs to the claimed speaker. If Λ(X) ≥ θ,
                                                                             accuracy of 91%. Decrementing the threshold to 0.11 results in
then the speaker has been verified.
                                                                                3 The allocation of one sentence for training and the other for testing does
  2 Onecommon model for generating a background pdf p(X|λC ) is through      not affect the results of the experiment. Due to the symmetry of the distance
models of a variety of background (imposter) speakers. In our development,   metric used for computation, if we had switched the sentences used for
we used speakers from the TIMIT database for this purpose.                   training and testing, the results of Table I would simply be transposed.
one false negative and five false positives, while maintaining        implemented, and used remaining time to explore the use
an accuracy of 91%. For these data, further variance in the          of GMMs for speaker recognition. However, we did not find
threshold cannot increase the accuracy; however, it clearly          sufficient remaining time to complete the implementation of
provides a tradeoff between false negatives and false positives.     the GMM-based speaker recognition system and repeat the
Due to this tradeoff, the most desirable threshold is application    experiment. We’ve integrated an existing GMM library into
dependent and must be determined on a case-by-case basis.            the project framework to compute the means, covariances, and
   It is informative to note that this is a challenging experiment   weights of each state of the target speaker model; however,
for a text-independent speaker recognition system, given the         it remains to implement the least-likelihood ratio classifier.
small amount of data used for training as well as the presence       Given the large parallels to completed tasks remaining, i.e.,
of multiple differing acoustic sound classes in the sentences,       threshold-based classification and multivariate Gaussian pdf
i.e., the phoneme “sh” appears twice in the testing sentence,        computation, it should be simple to compute a background
but is absent from the training sentence. Note that even             model and repeat the above experiment with the new GMM-
with these difficulties, the reference implementation performs        based reference implementation. During the analysis, it would
decently; we expect this is due to the effect of averaging the       be insightful to compare the minimum-distance and GMM
mel-cepstral features. Furthermore, we expect that, contrary to      classification schemes, particularly focusing on performance
conventional wisdom, the minimum-distance classifier might            with regards to differing acoustic makeup of the test and
perform better than GMM when confronted with differing               training data for text-independent speaker recognition.
acoustic classes in the testing and training data sets during
text-independent speech recognition.                                                      IX. C ONCLUSION
   As a final comment, note that from these results, it is easy          We have shown that minimum-distance classification for
to see why an MFCC-based minimum-distance classifier sys-             text-independent speaker recognition performs moderately
tem should never be used for text-independent authentication         well, though there is obvious room for improvement. Through
systems. There is no way to eliminate false positives while          a simple experiment, we’ve clearly demonstrated the tradeoff
maintaining a high degree of accuracy!                               between false negatives and false positives in selecting a
                                                                     threshold, and noted that the most desirable threshold is
                   VIII. F UTURE W ORK
                                                                     an application-dependent parameter. From our results, we’ve
  Although it wasn’t in the stated proposal for this project, we     also hypothesized that the minimum-distance classifier might
found the task of comparing the minimum-distance classifier           outperform a GMM classifier on acoustically diverse test and
with one based upon a Gaussian Mixture Model intriguing.             training data sets, though this remains to be seen.
We found that the minimum-distance classifier was easily

More Related Content

What's hot

Automatic speech recognition
Automatic speech recognitionAutomatic speech recognition
Automatic speech recognitionRichie
 
COLEA : A MATLAB Tool for Speech Analysis
COLEA : A MATLAB Tool for Speech AnalysisCOLEA : A MATLAB Tool for Speech Analysis
COLEA : A MATLAB Tool for Speech AnalysisRushin Shah
 
Automatic Speaker Recognition system using MFCC and VQ approach
Automatic Speaker Recognition system using MFCC and VQ approachAutomatic Speaker Recognition system using MFCC and VQ approach
Automatic Speaker Recognition system using MFCC and VQ approachAbdullah al Mamun
 
Voice biometric recognition
Voice biometric recognitionVoice biometric recognition
Voice biometric recognitionphyuhsan
 
Text Prompted Remote Speaker Authentication : Joint Speech and Speaker Recogn...
Text Prompted Remote Speaker Authentication : Joint Speech and Speaker Recogn...Text Prompted Remote Speaker Authentication : Joint Speech and Speaker Recogn...
Text Prompted Remote Speaker Authentication : Joint Speech and Speaker Recogn...gt_ebuddy
 
Speech based password authentication system on FPGA
Speech based password authentication system on FPGASpeech based password authentication system on FPGA
Speech based password authentication system on FPGARajesh Roshan
 
Digital modeling of speech signal
Digital modeling of speech signalDigital modeling of speech signal
Digital modeling of speech signalVinodhini
 
Automatic speech emotion and speaker recognition based on hybrid gmm and ffbnn
Automatic speech emotion and speaker recognition based on hybrid gmm and ffbnnAutomatic speech emotion and speaker recognition based on hybrid gmm and ffbnn
Automatic speech emotion and speaker recognition based on hybrid gmm and ffbnnijcsa
 
SPEKER RECOGNITION UNDER LIMITED DATA CODITION
SPEKER RECOGNITION UNDER LIMITED DATA CODITIONSPEKER RECOGNITION UNDER LIMITED DATA CODITION
SPEKER RECOGNITION UNDER LIMITED DATA CODITIONniranjan kumar
 
Speaker and Speech Recognition for Secured Smart Home Applications
Speaker and Speech Recognition for Secured Smart Home ApplicationsSpeaker and Speech Recognition for Secured Smart Home Applications
Speaker and Speech Recognition for Secured Smart Home ApplicationsRoger Gomes
 
LPC for Speech Recognition
LPC for Speech RecognitionLPC for Speech Recognition
LPC for Speech RecognitionDr. Uday Saikia
 

What's hot (18)

SPEAKER VERIFICATION
SPEAKER VERIFICATIONSPEAKER VERIFICATION
SPEAKER VERIFICATION
 
Automatic speech recognition
Automatic speech recognitionAutomatic speech recognition
Automatic speech recognition
 
COLEA : A MATLAB Tool for Speech Analysis
COLEA : A MATLAB Tool for Speech AnalysisCOLEA : A MATLAB Tool for Speech Analysis
COLEA : A MATLAB Tool for Speech Analysis
 
Automatic Speaker Recognition system using MFCC and VQ approach
Automatic Speaker Recognition system using MFCC and VQ approachAutomatic Speaker Recognition system using MFCC and VQ approach
Automatic Speaker Recognition system using MFCC and VQ approach
 
Voice biometric recognition
Voice biometric recognitionVoice biometric recognition
Voice biometric recognition
 
Text Prompted Remote Speaker Authentication : Joint Speech and Speaker Recogn...
Text Prompted Remote Speaker Authentication : Joint Speech and Speaker Recogn...Text Prompted Remote Speaker Authentication : Joint Speech and Speaker Recogn...
Text Prompted Remote Speaker Authentication : Joint Speech and Speaker Recogn...
 
Speech based password authentication system on FPGA
Speech based password authentication system on FPGASpeech based password authentication system on FPGA
Speech based password authentication system on FPGA
 
speech enhancement
speech enhancementspeech enhancement
speech enhancement
 
Speech Signal Analysis
Speech Signal AnalysisSpeech Signal Analysis
Speech Signal Analysis
 
Digital modeling of speech signal
Digital modeling of speech signalDigital modeling of speech signal
Digital modeling of speech signal
 
Automatic speech emotion and speaker recognition based on hybrid gmm and ffbnn
Automatic speech emotion and speaker recognition based on hybrid gmm and ffbnnAutomatic speech emotion and speaker recognition based on hybrid gmm and ffbnn
Automatic speech emotion and speaker recognition based on hybrid gmm and ffbnn
 
Speech Signal Processing
Speech Signal ProcessingSpeech Signal Processing
Speech Signal Processing
 
Et25897899
Et25897899Et25897899
Et25897899
 
SPEKER RECOGNITION UNDER LIMITED DATA CODITION
SPEKER RECOGNITION UNDER LIMITED DATA CODITIONSPEKER RECOGNITION UNDER LIMITED DATA CODITION
SPEKER RECOGNITION UNDER LIMITED DATA CODITION
 
A017410108
A017410108A017410108
A017410108
 
Matlab: Speech Signal Analysis
Matlab: Speech Signal AnalysisMatlab: Speech Signal Analysis
Matlab: Speech Signal Analysis
 
Speaker and Speech Recognition for Secured Smart Home Applications
Speaker and Speech Recognition for Secured Smart Home ApplicationsSpeaker and Speech Recognition for Secured Smart Home Applications
Speaker and Speech Recognition for Secured Smart Home Applications
 
LPC for Speech Recognition
LPC for Speech RecognitionLPC for Speech Recognition
LPC for Speech Recognition
 

Viewers also liked

Speaker Recognition using Gaussian Mixture Model
Speaker Recognition using Gaussian Mixture Model Speaker Recognition using Gaussian Mixture Model
Speaker Recognition using Gaussian Mixture Model Saurab Dulal
 
Lecture 18: Gaussian Mixture Models and Expectation Maximization
Lecture 18: Gaussian Mixture Models and Expectation MaximizationLecture 18: Gaussian Mixture Models and Expectation Maximization
Lecture 18: Gaussian Mixture Models and Expectation Maximizationbutest
 
Speaker recognition using MFCC
Speaker recognition using MFCCSpeaker recognition using MFCC
Speaker recognition using MFCCHira Shaukat
 
Expectation Maximization and Gaussian Mixture Models
Expectation Maximization and Gaussian Mixture ModelsExpectation Maximization and Gaussian Mixture Models
Expectation Maximization and Gaussian Mixture Modelspetitegeek
 
Speech Recognition System By Matlab
Speech Recognition System By MatlabSpeech Recognition System By Matlab
Speech Recognition System By MatlabAnkit Gujrati
 
iVector vs GMM/UBM for Automatic Speaker Recognition system
iVector vs GMM/UBM for Automatic Speaker Recognition system iVector vs GMM/UBM for Automatic Speaker Recognition system
iVector vs GMM/UBM for Automatic Speaker Recognition system Walid Bouaffou
 
DEVELOPMENT OF SPEAKER VERIFICATION UNDER LIMITED DATA AND CONDITION
DEVELOPMENT OF SPEAKER VERIFICATION  UNDER LIMITED DATA AND CONDITIONDEVELOPMENT OF SPEAKER VERIFICATION  UNDER LIMITED DATA AND CONDITION
DEVELOPMENT OF SPEAKER VERIFICATION UNDER LIMITED DATA AND CONDITIONniranjan kumar
 
METHODS FOR IMPROVING THE CLASSIFICATION ACCURACY OF BIOMEDICAL SIGNALS BASED...
METHODS FOR IMPROVING THE CLASSIFICATION ACCURACY OF BIOMEDICAL SIGNALS BASED...METHODS FOR IMPROVING THE CLASSIFICATION ACCURACY OF BIOMEDICAL SIGNALS BASED...
METHODS FOR IMPROVING THE CLASSIFICATION ACCURACY OF BIOMEDICAL SIGNALS BASED...IAEME Publication
 
EECS452EMGFinalProjectReportPDF
EECS452EMGFinalProjectReportPDFEECS452EMGFinalProjectReportPDF
EECS452EMGFinalProjectReportPDFAngie Zhang
 
sEMG biofeedback in dysphagia
sEMG biofeedback in dysphagiasEMG biofeedback in dysphagia
sEMG biofeedback in dysphagiaArshelle Kibs
 
Kaldi-voice: Your personal speech recognition server using open source code
Kaldi-voice: Your personal speech recognition server using open source codeKaldi-voice: Your personal speech recognition server using open source code
Kaldi-voice: Your personal speech recognition server using open source codeXavier Anguera
 
PRESENTATION LAB DSP.Analysis & classification of EMG signal - DSP LAB
PRESENTATION LAB DSP.Analysis & classification of EMG signal - DSP LABPRESENTATION LAB DSP.Analysis & classification of EMG signal - DSP LAB
PRESENTATION LAB DSP.Analysis & classification of EMG signal - DSP LABNurhasanah Shafei
 
Dsp lab report- Analysis and classification of EMG signal using MATLAB.
Dsp lab report- Analysis and classification of EMG signal using MATLAB.Dsp lab report- Analysis and classification of EMG signal using MATLAB.
Dsp lab report- Analysis and classification of EMG signal using MATLAB.Nurhasanah Shafei
 
EMG electromayogram
EMG electromayogramEMG electromayogram
EMG electromayogramASHISH RAJ
 
A Survey on Speaker Recognition System
A Survey on Speaker Recognition SystemA Survey on Speaker Recognition System
A Survey on Speaker Recognition SystemVani011
 
Applications of Emotions Recognition
Applications of Emotions RecognitionApplications of Emotions Recognition
Applications of Emotions RecognitionFrancesco Bonadiman
 

Viewers also liked (20)

Speaker Recognition using Gaussian Mixture Model
Speaker Recognition using Gaussian Mixture Model Speaker Recognition using Gaussian Mixture Model
Speaker Recognition using Gaussian Mixture Model
 
Lecture 18: Gaussian Mixture Models and Expectation Maximization
Lecture 18: Gaussian Mixture Models and Expectation MaximizationLecture 18: Gaussian Mixture Models and Expectation Maximization
Lecture 18: Gaussian Mixture Models and Expectation Maximization
 
Speaker recognition using MFCC
Speaker recognition using MFCCSpeaker recognition using MFCC
Speaker recognition using MFCC
 
Expectation Maximization and Gaussian Mixture Models
Expectation Maximization and Gaussian Mixture ModelsExpectation Maximization and Gaussian Mixture Models
Expectation Maximization and Gaussian Mixture Models
 
Speech Recognition System By Matlab
Speech Recognition System By MatlabSpeech Recognition System By Matlab
Speech Recognition System By Matlab
 
iVector vs GMM/UBM for Automatic Speaker Recognition system
iVector vs GMM/UBM for Automatic Speaker Recognition system iVector vs GMM/UBM for Automatic Speaker Recognition system
iVector vs GMM/UBM for Automatic Speaker Recognition system
 
DEVELOPMENT OF SPEAKER VERIFICATION UNDER LIMITED DATA AND CONDITION
DEVELOPMENT OF SPEAKER VERIFICATION  UNDER LIMITED DATA AND CONDITIONDEVELOPMENT OF SPEAKER VERIFICATION  UNDER LIMITED DATA AND CONDITION
DEVELOPMENT OF SPEAKER VERIFICATION UNDER LIMITED DATA AND CONDITION
 
METHODS FOR IMPROVING THE CLASSIFICATION ACCURACY OF BIOMEDICAL SIGNALS BASED...
METHODS FOR IMPROVING THE CLASSIFICATION ACCURACY OF BIOMEDICAL SIGNALS BASED...METHODS FOR IMPROVING THE CLASSIFICATION ACCURACY OF BIOMEDICAL SIGNALS BASED...
METHODS FOR IMPROVING THE CLASSIFICATION ACCURACY OF BIOMEDICAL SIGNALS BASED...
 
EECS452EMGFinalProjectReportPDF
EECS452EMGFinalProjectReportPDFEECS452EMGFinalProjectReportPDF
EECS452EMGFinalProjectReportPDF
 
sEMG biofeedback in dysphagia
sEMG biofeedback in dysphagiasEMG biofeedback in dysphagia
sEMG biofeedback in dysphagia
 
R Code for EM Algorithm
R Code for EM AlgorithmR Code for EM Algorithm
R Code for EM Algorithm
 
Kaldi-voice: Your personal speech recognition server using open source code
Kaldi-voice: Your personal speech recognition server using open source codeKaldi-voice: Your personal speech recognition server using open source code
Kaldi-voice: Your personal speech recognition server using open source code
 
PRESENTATION LAB DSP.Analysis & classification of EMG signal - DSP LAB
PRESENTATION LAB DSP.Analysis & classification of EMG signal - DSP LABPRESENTATION LAB DSP.Analysis & classification of EMG signal - DSP LAB
PRESENTATION LAB DSP.Analysis & classification of EMG signal - DSP LAB
 
Dsp lab report- Analysis and classification of EMG signal using MATLAB.
Dsp lab report- Analysis and classification of EMG signal using MATLAB.Dsp lab report- Analysis and classification of EMG signal using MATLAB.
Dsp lab report- Analysis and classification of EMG signal using MATLAB.
 
EMG electromayogram
EMG electromayogramEMG electromayogram
EMG electromayogram
 
A Survey on Speaker Recognition System
A Survey on Speaker Recognition SystemA Survey on Speaker Recognition System
A Survey on Speaker Recognition System
 
Emg biofeedback
Emg biofeedbackEmg biofeedback
Emg biofeedback
 
GMM
GMMGMM
GMM
 
Electromyogram
ElectromyogramElectromyogram
Electromyogram
 
Applications of Emotions Recognition
Applications of Emotions RecognitionApplications of Emotions Recognition
Applications of Emotions Recognition
 

Similar to Text-Independent Speaker Verification Report

Adaptive equalization
Adaptive equalizationAdaptive equalization
Adaptive equalizationKamal Bhatt
 
ML_Unit_2_Part_A
ML_Unit_2_Part_AML_Unit_2_Part_A
ML_Unit_2_Part_ASrimatre K
 
Designing an Efficient Multimodal Biometric System using Palmprint and Speech...
Designing an Efficient Multimodal Biometric System using Palmprint and Speech...Designing an Efficient Multimodal Biometric System using Palmprint and Speech...
Designing an Efficient Multimodal Biometric System using Palmprint and Speech...IDES Editor
 
Fundamentals of Machine Learning.pptx
Fundamentals of Machine Learning.pptxFundamentals of Machine Learning.pptx
Fundamentals of Machine Learning.pptxWiamFADEL
 
A Novel Method for Speaker Independent Recognition Based on Hidden Markov Model
A Novel Method for Speaker Independent Recognition Based on Hidden Markov ModelA Novel Method for Speaker Independent Recognition Based on Hidden Markov Model
A Novel Method for Speaker Independent Recognition Based on Hidden Markov ModelIDES Editor
 
An Algorithm For Vector Quantizer Design
An Algorithm For Vector Quantizer DesignAn Algorithm For Vector Quantizer Design
An Algorithm For Vector Quantizer DesignAngie Miller
 
A Text-Independent Speaker Identification System based on The Zak Transform
A Text-Independent Speaker Identification System based on The Zak TransformA Text-Independent Speaker Identification System based on The Zak Transform
A Text-Independent Speaker Identification System based on The Zak TransformCSCJournals
 
Emotion Recognition Based On Audio Speech
Emotion Recognition Based On Audio SpeechEmotion Recognition Based On Audio Speech
Emotion Recognition Based On Audio SpeechIOSR Journals
 
Speaker Identification From Youtube Obtained Data
Speaker Identification From Youtube Obtained DataSpeaker Identification From Youtube Obtained Data
Speaker Identification From Youtube Obtained Datasipij
 
Biomedical Signals Classification With Transformer Based Model.pptx
Biomedical Signals Classification With Transformer Based Model.pptxBiomedical Signals Classification With Transformer Based Model.pptx
Biomedical Signals Classification With Transformer Based Model.pptxSandeep Kumar
 
A Novel, Robust, Hierarchical, Text-Independent Speaker Recognition Technique
A Novel, Robust, Hierarchical, Text-Independent Speaker Recognition TechniqueA Novel, Robust, Hierarchical, Text-Independent Speaker Recognition Technique
A Novel, Robust, Hierarchical, Text-Independent Speaker Recognition TechniqueCSCJournals
 
Artificial Neural Networks 1
Artificial Neural Networks 1Artificial Neural Networks 1
Artificial Neural Networks 1swapnac12
 
Wavelet Based Image Compression Using FPGA
Wavelet Based Image Compression Using FPGAWavelet Based Image Compression Using FPGA
Wavelet Based Image Compression Using FPGADr. Mohieddin Moradi
 
SYNTHETICAL ENLARGEMENT OF MFCC BASED TRAINING SETS FOR EMOTION RECOGNITION
SYNTHETICAL ENLARGEMENT OF MFCC BASED TRAINING SETS FOR EMOTION RECOGNITIONSYNTHETICAL ENLARGEMENT OF MFCC BASED TRAINING SETS FOR EMOTION RECOGNITION
SYNTHETICAL ENLARGEMENT OF MFCC BASED TRAINING SETS FOR EMOTION RECOGNITIONcscpconf
 
Blind Estimation of Carrier Frequency Offset in Multicarrier Communication Sy...
Blind Estimation of Carrier Frequency Offset in Multicarrier Communication Sy...Blind Estimation of Carrier Frequency Offset in Multicarrier Communication Sy...
Blind Estimation of Carrier Frequency Offset in Multicarrier Communication Sy...IDES Editor
 

Similar to Text-Independent Speaker Verification Report (20)

Lab manual
Lab manualLab manual
Lab manual
 
Adaptive equalization
Adaptive equalizationAdaptive equalization
Adaptive equalization
 
ML_Unit_2_Part_A
ML_Unit_2_Part_AML_Unit_2_Part_A
ML_Unit_2_Part_A
 
Es25893896
Es25893896Es25893896
Es25893896
 
D111823
D111823D111823
D111823
 
Designing an Efficient Multimodal Biometric System using Palmprint and Speech...
Designing an Efficient Multimodal Biometric System using Palmprint and Speech...Designing an Efficient Multimodal Biometric System using Palmprint and Speech...
Designing an Efficient Multimodal Biometric System using Palmprint and Speech...
 
Fundamentals of Machine Learning.pptx
Fundamentals of Machine Learning.pptxFundamentals of Machine Learning.pptx
Fundamentals of Machine Learning.pptx
 
A Novel Method for Speaker Independent Recognition Based on Hidden Markov Model
A Novel Method for Speaker Independent Recognition Based on Hidden Markov ModelA Novel Method for Speaker Independent Recognition Based on Hidden Markov Model
A Novel Method for Speaker Independent Recognition Based on Hidden Markov Model
 
An Algorithm For Vector Quantizer Design
An Algorithm For Vector Quantizer DesignAn Algorithm For Vector Quantizer Design
An Algorithm For Vector Quantizer Design
 
A Text-Independent Speaker Identification System based on The Zak Transform
A Text-Independent Speaker Identification System based on The Zak TransformA Text-Independent Speaker Identification System based on The Zak Transform
A Text-Independent Speaker Identification System based on The Zak Transform
 
Emotion Recognition Based On Audio Speech
Emotion Recognition Based On Audio SpeechEmotion Recognition Based On Audio Speech
Emotion Recognition Based On Audio Speech
 
Speaker Identification From Youtube Obtained Data
Speaker Identification From Youtube Obtained DataSpeaker Identification From Youtube Obtained Data
Speaker Identification From Youtube Obtained Data
 
Biomedical Signals Classification With Transformer Based Model.pptx
Biomedical Signals Classification With Transformer Based Model.pptxBiomedical Signals Classification With Transformer Based Model.pptx
Biomedical Signals Classification With Transformer Based Model.pptx
 
A Novel, Robust, Hierarchical, Text-Independent Speaker Recognition Technique
A Novel, Robust, Hierarchical, Text-Independent Speaker Recognition TechniqueA Novel, Robust, Hierarchical, Text-Independent Speaker Recognition Technique
A Novel, Robust, Hierarchical, Text-Independent Speaker Recognition Technique
 
Artificial Neural Networks 1
Artificial Neural Networks 1Artificial Neural Networks 1
Artificial Neural Networks 1
 
Wavelet Based Image Compression Using FPGA
Wavelet Based Image Compression Using FPGAWavelet Based Image Compression Using FPGA
Wavelet Based Image Compression Using FPGA
 
lecture_01.ppt
lecture_01.pptlecture_01.ppt
lecture_01.ppt
 
SYNTHETICAL ENLARGEMENT OF MFCC BASED TRAINING SETS FOR EMOTION RECOGNITION
SYNTHETICAL ENLARGEMENT OF MFCC BASED TRAINING SETS FOR EMOTION RECOGNITIONSYNTHETICAL ENLARGEMENT OF MFCC BASED TRAINING SETS FOR EMOTION RECOGNITION
SYNTHETICAL ENLARGEMENT OF MFCC BASED TRAINING SETS FOR EMOTION RECOGNITION
 
Blind Estimation of Carrier Frequency Offset in Multicarrier Communication Sy...
Blind Estimation of Carrier Frequency Offset in Multicarrier Communication Sy...Blind Estimation of Carrier Frequency Offset in Multicarrier Communication Sy...
Blind Estimation of Carrier Frequency Offset in Multicarrier Communication Sy...
 
1607.01152.pdf
1607.01152.pdf1607.01152.pdf
1607.01152.pdf
 

More from Cody Ray

Building a Scalable Distributed Stats Infrastructure with Storm and KairosDB
Building a Scalable Distributed Stats Infrastructure with Storm and KairosDBBuilding a Scalable Distributed Stats Infrastructure with Storm and KairosDB
Building a Scalable Distributed Stats Infrastructure with Storm and KairosDBCody Ray
 
Cody A. Ray DEXA Report 3/21/2013
Cody A. Ray DEXA Report 3/21/2013Cody A. Ray DEXA Report 3/21/2013
Cody A. Ray DEXA Report 3/21/2013Cody Ray
 
Cognitive Modeling & Intelligent Tutors
Cognitive Modeling & Intelligent TutorsCognitive Modeling & Intelligent Tutors
Cognitive Modeling & Intelligent TutorsCody Ray
 
Robotics: Modelling, Planning and Control
Robotics: Modelling, Planning and ControlRobotics: Modelling, Planning and Control
Robotics: Modelling, Planning and ControlCody Ray
 
Psychoacoustic Approaches to Audio Steganography
Psychoacoustic Approaches to Audio SteganographyPsychoacoustic Approaches to Audio Steganography
Psychoacoustic Approaches to Audio SteganographyCody Ray
 
Psychoacoustic Approaches to Audio Steganography Report
Psychoacoustic Approaches to Audio Steganography Report Psychoacoustic Approaches to Audio Steganography Report
Psychoacoustic Approaches to Audio Steganography Report Cody Ray
 
Image Printing Based on Halftoning
Image Printing Based on HalftoningImage Printing Based on Halftoning
Image Printing Based on HalftoningCody Ray
 
Object Recognition: Fourier Descriptors and Minimum-Distance Classification
Object Recognition: Fourier Descriptors and Minimum-Distance ClassificationObject Recognition: Fourier Descriptors and Minimum-Distance Classification
Object Recognition: Fourier Descriptors and Minimum-Distance ClassificationCody Ray
 

More from Cody Ray (8)

Building a Scalable Distributed Stats Infrastructure with Storm and KairosDB
Building a Scalable Distributed Stats Infrastructure with Storm and KairosDBBuilding a Scalable Distributed Stats Infrastructure with Storm and KairosDB
Building a Scalable Distributed Stats Infrastructure with Storm and KairosDB
 
Cody A. Ray DEXA Report 3/21/2013
Cody A. Ray DEXA Report 3/21/2013Cody A. Ray DEXA Report 3/21/2013
Cody A. Ray DEXA Report 3/21/2013
 
Cognitive Modeling & Intelligent Tutors
Cognitive Modeling & Intelligent TutorsCognitive Modeling & Intelligent Tutors
Cognitive Modeling & Intelligent Tutors
 
Robotics: Modelling, Planning and Control
Robotics: Modelling, Planning and ControlRobotics: Modelling, Planning and Control
Robotics: Modelling, Planning and Control
 
Psychoacoustic Approaches to Audio Steganography
Psychoacoustic Approaches to Audio SteganographyPsychoacoustic Approaches to Audio Steganography
Psychoacoustic Approaches to Audio Steganography
 
Psychoacoustic Approaches to Audio Steganography Report
Psychoacoustic Approaches to Audio Steganography Report Psychoacoustic Approaches to Audio Steganography Report
Psychoacoustic Approaches to Audio Steganography Report
 
Image Printing Based on Halftoning
Image Printing Based on HalftoningImage Printing Based on Halftoning
Image Printing Based on Halftoning
 
Object Recognition: Fourier Descriptors and Minimum-Distance Classification
Object Recognition: Fourier Descriptors and Minimum-Distance ClassificationObject Recognition: Fourier Descriptors and Minimum-Distance Classification
Object Recognition: Fourier Descriptors and Minimum-Distance Classification
 

Recently uploaded

Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024The Digital Insurer
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
Vector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesVector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesZilliz
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 

Recently uploaded (20)

Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
Vector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesVector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector Databases
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 

Text-Independent Speaker Verification Report

  • 1. Text-Independent Speaker Verification Cody A. Ray Drexel University codyaray@drexel.edu Abstract—This paper provides an introduction to the task of speaker recognition, and describes a not-so-novel speaker recognition system based upon a minimum-distance classification scheme. We describe both the theory and practical details for a reference implementation. Furthermore, we discuss an advanced technique for classification based upon Gaussian Mixture Models (GMM). Finally, we discuss the results of a set of experiments performed using our reference implementation. I. I NTRODUCTION The objective of this project was to develop a basic speaker recognition system to demonstrate an understanding of the subjects covered in the course Processing of the Human Voice. Speaker recognition systems can generally be classified as either identification or verification. In speaker identification, Fig. 1. Speaker Recognition System Classification the challenge is to decide which voice model from a known set of voice models best characterizes a speaker. In the different task of speaker verification, the goal is to decide whether to build the speaker model is called training data. During the a speaker corresponds to a particular known voice or to recognition or testing stage, the features measured from the some other unknown voice. In either case, the problem can waveform of a test utterance, i.e., the test data of a speaker, be further divided into text-dependent and text-independent are matched (in some sense) against speaker models obtained subproblems, depending on whether we may rely upon the during training. An overview of the components of a general same utterance being used for both training and testing. This speaker recognition system is given in Figure 2. loose classification scheme is shown in Figure 1. For the As with any biometric (pattern) recognition system, the purpose of this project, we focused on the task of text- speaker recognition system consists of two core modules: independent speaker verification. feature extraction and feature matching. In Section IV, we provide an overview of various dimensions used for speaker II. T ERMINOLOGY analysis as well as describe the features we selected in more • A background speaker is an imposter speaker. depth. Section V continues with mathematical techniques used • A claimant is a speaker known to the system who is in the matching process. correctly claiming his/her identity. • A false negative is an error where a claimant is rejected IV. F EATURE E XTRACTION as an imposter. Although its possible to extract a number of features • A false positive is an error where an imposter is accepted from sampled audio, including both spectral and non-spectral as a claimant. features, for the purpose of this project we characterize a • Speaker identification decides which voice model from speaker’s voice attributes exclusively through spectral features. a known set of voice models best characterizes a speaker. In selecting acoustic spectral features, we want our feature • Speaker verification decides whether a speaker cor- set to reflect the unique characteristics of a speaker. For this responds to a particular known voice or some other purpose, we use the magnitude component of the short-time unknown voice. • A target speaker is a known speaker. III. S YSTEM OVERVIEW Speaker recognition systems must first build a model of the voice of each target speaker, as well as a model of a collection of background speakers, using speaker-dependent features extracted from the speech waveform. This is referred to as the training stage, and the associated speech data used Fig. 2. General Speaker Recognition System
  • 2. Fourier transform (STFT) as a basis. As the phase is difficult to V. F EATURE M ATCHING measure and susceptible to channel distortion, it is discarded. A survey of the literature has revealed numerous approaches We first compute the discrete STFT and then weight it by a based upon minimum-distance classification, dynamic time- series of filter frequency responses that roughly match those of warping, vector quantization, hidden Markov model, Gaussian the auditory critical band filters. To approximate the auditory mixture model, or artificial neural networks. For this project, critical bands, we use triangular mel-scale filter bands, as we chose to implement a minimum-distance classifier, and as shown in Figure 3. time permits improve upon this with a Gaussian mixture model ∞ based matching. X(n, wk ) = x[m]w[n − m]e−jwk m A. Minimum-Distance Classification m=−∞ The concept of minimum-distance classification is simple: Next, we compute the energy in the mel-scale weighted we calculate a feature vector for each new test case, and STFT and normalize the energy in each frame so as to give measure how far it is from the stored training data using equal energy for a flat input spectrum. some metric for distance computation. We then select a threshold distance to determine at which point we consider the Ul 1 speaker verification to have been successful, or equivalently, Emel (n, l) = |Vl (wk )X(n, wk )|2 , the speaker to have been recognized. As we will later see, this Al k=Ll threshold will determine the tradeoff between the number of where false negatives and false positives. Specifically, we compute a feature vector based upon the Ul averages of the MFCCs for the test and training data sets. Al = |Vl (wk )|2 M k=Ll tr 1 tr C mel [n] = Cmel [mL, n] M Finally, we compute the mel-cepstrum for each frame using m=1 the even property of the real cepstrum to rewrite the inverse M transform as the discrete cosine transform. We then extract ts 1 ts the mel-frequency cepstrum coefficients (MFCC) for use as C mel [n] = Cmel [mL, n] M m=1 our feature vector. As a distance measure, we then use the mean-squared R−1 1 2π difference between the average testing and training feature Cmel [n, m] = log{Emel (n, l)} cos( lm) R R vectors. i=0 R−1 1 ts tr We repeat this feature extraction process for both training D= (C mel [n] − C mel [n])2 tr ts R−1 and testing data to produce Cmel and Cmel , respectively. n=1 If this distance is less than the given threshold, the speaker has been verified. B. Gaussian Mixture Models Recognizing that speech production is inherently non- deterministic (due to subtle variations in vocal tract shape and glottal flow), we represent a speaker probabilistically through a multivariate Gaussian probability density function (pdf). This is a multi-dimensional structure in which we can think of each statistical variable as a state corresponding to a single acoustic sound class, whether at a broad level, such as quasi- periodic, noise-like, and impulse-like, or at very fine level such as individual phonemes. The Gaussian pdf of a feature vector x for the ith state is written as 1 1 T Σ−1 (x−µi ) bi (x) = R 1 e− 2 (x−µi ) i (2π) |Σi | 2 2 where µi is the state mean vector, Σi is the state covariance matrix, and R is the dimension of the feature vector.1 The 1 Using MFCC-based feature vectors, this corresponds to the number of Fig. 3. Idealized Triangular Mel-Scale Filter Bank MFCC coefficients.
  • 3. TABLE I D ISTANCES B ETWEEN T RAINING AND T ESTING DATA Test1 Test2 Test3 Test4 Test5 Test6 Test7 Test8 Train1 0.1192 0.1945 0.2151 0.2184 0.5364 0.3823 0.4963 0.4538 Train2 0.0724 0.0378 0.0406 0.0783 0.4035 0.3177 0.4125 0.3986 Train3 0.1672 0.1311 0.1042 0.0969 0.3382 0.2121 0.3597 0.2847 Train4 0.1482 0.1412 0.1363 0.0817 0.3268 0.3211 0.3154 0.3282 Train5 0.1882 0.1928 0.2237 0.1466 0.1044 0.0709 0.1382 0.1299 Train6 0.3012 0.3521 0.3208 0.3112 0.3023 0.0958 0.2755 0.2094 Train7 0.2743 0.2973 0.3252 0.2517 0.1618 0.1318 0.0724 0.1427 Train8 0.3589 0.3600 0.3381 0.2186 0.1976 0.1133 0.2487 0.0585 speaker model λ then represents the set of GMM mean, VI. I MPLEMENTATION covariance, and weight parameters. The reference speaker recognition system was implemented in MATLAB using training data and test data stored in WAV λ = {µi , Σi , pi } files. There are tools included in MATLAB and publicly- The probability of a feature vector being in any one of I available libraries to aid in creating this system. For reading states for a particular speaker model λ is represented by the in the data sets, we used MATLAB’s wavread function. union, or mixture, of different Gaussian pdfs: For feature extraction, we used the melcepst function from Voicebox, a MATLAB toolbox. We used twelve MFCC l coefficients (skipping the 0th order coefficient) using 256- p(x|λ) = pi bi (x) sample frames and a 128-sample increment Hamming window. i=I We used custom matching and testing routines based upon where pi are the component mixture weights and bi (x) are the minimum-distance classification as described above. For the mixture densities. Gaussian Mixture Models, we used T. N. Vikram’s GMM In speaker verification, we must decide whether a test library, based upon the text Algorithm Collections For Digital utterance belongs to the target speaker. Formally, we are Signal Processing Applications Using Matlab by E.S. Gopi. making a binary decision (yes/no) on two hypotheses, whether VII. E XPERIMENTAL R ESULTS the test utterance belongs to the target speaker, hypothesis H1 , We performed a series of experiments to determine the or whether it comes from an imposter, hypothesis H2 . accuracy of the reference implementation for text-independent Suppose that we have already computed the GMM of the speaker verification. We selected eight speakers from the target speaker and the GMM for a collection of background TIMIT database, including four males and four females, each speakers2 ; we then determined the likelihood ratio test to saying two sentences. The sentences used for this experiment decide between H1 and H2 . This test is the ratio between were “dont ask me to carry an oily rag like that” and “she had the probability that the collection of feature vectors X = your dark suit in greasy wash water all year”. We used the {x0 , x1 , . . . , xM −1 } is from the claimed speaker, P (λC |X), sentence referring to an “oily rag” as training data, and used and the probability that the collection of feature vectors X the sentence referring to a “dark suit” as testing data.3 is not from the claimed speaker, P (λC |X), i.e., from the The results of the experiment are shown in Table I. This background. Using Bayes’ rule, we can write this ratio as table depicts the “distance” between the training and testing P (λC |X) p(X|λC )P (λC )/P (X) data sets as a Cartesian product. For example, the value in = cell (1, 3) (in row-major order) corresponds to the distance P (λC |X) p(X|λC )P (λC )/P (X) between speaker1’s training data and speaker3’s testing where P (X) denotes the probability of the vector stream X. data. Note that we’re trying to minimize the diagonals, that Discarding the constant probability terms and applying the is, the distance between any individual speaker’s testing and logarithm, we have the log-likelihood ratio training data while maximizing the non-diagonal cells in a text-independent manner. Λ(X) = log[p(X|λC )] − log[p(X|λC )] For these data, we empirically selected an initial threshold value of 0.12 to avoid false any negatives. With this threshold, that we compare with a threshold to accept or reject whether however, the system resulted in six false positives, for an the utterance belongs to the claimed speaker. If Λ(X) ≥ θ, accuracy of 91%. Decrementing the threshold to 0.11 results in then the speaker has been verified. 3 The allocation of one sentence for training and the other for testing does 2 Onecommon model for generating a background pdf p(X|λC ) is through not affect the results of the experiment. Due to the symmetry of the distance models of a variety of background (imposter) speakers. In our development, metric used for computation, if we had switched the sentences used for we used speakers from the TIMIT database for this purpose. training and testing, the results of Table I would simply be transposed.
  • 4. one false negative and five false positives, while maintaining implemented, and used remaining time to explore the use an accuracy of 91%. For these data, further variance in the of GMMs for speaker recognition. However, we did not find threshold cannot increase the accuracy; however, it clearly sufficient remaining time to complete the implementation of provides a tradeoff between false negatives and false positives. the GMM-based speaker recognition system and repeat the Due to this tradeoff, the most desirable threshold is application experiment. We’ve integrated an existing GMM library into dependent and must be determined on a case-by-case basis. the project framework to compute the means, covariances, and It is informative to note that this is a challenging experiment weights of each state of the target speaker model; however, for a text-independent speaker recognition system, given the it remains to implement the least-likelihood ratio classifier. small amount of data used for training as well as the presence Given the large parallels to completed tasks remaining, i.e., of multiple differing acoustic sound classes in the sentences, threshold-based classification and multivariate Gaussian pdf i.e., the phoneme “sh” appears twice in the testing sentence, computation, it should be simple to compute a background but is absent from the training sentence. Note that even model and repeat the above experiment with the new GMM- with these difficulties, the reference implementation performs based reference implementation. During the analysis, it would decently; we expect this is due to the effect of averaging the be insightful to compare the minimum-distance and GMM mel-cepstral features. Furthermore, we expect that, contrary to classification schemes, particularly focusing on performance conventional wisdom, the minimum-distance classifier might with regards to differing acoustic makeup of the test and perform better than GMM when confronted with differing training data for text-independent speaker recognition. acoustic classes in the testing and training data sets during text-independent speech recognition. IX. C ONCLUSION As a final comment, note that from these results, it is easy We have shown that minimum-distance classification for to see why an MFCC-based minimum-distance classifier sys- text-independent speaker recognition performs moderately tem should never be used for text-independent authentication well, though there is obvious room for improvement. Through systems. There is no way to eliminate false positives while a simple experiment, we’ve clearly demonstrated the tradeoff maintaining a high degree of accuracy! between false negatives and false positives in selecting a threshold, and noted that the most desirable threshold is VIII. F UTURE W ORK an application-dependent parameter. From our results, we’ve Although it wasn’t in the stated proposal for this project, we also hypothesized that the minimum-distance classifier might found the task of comparing the minimum-distance classifier outperform a GMM classifier on acoustically diverse test and with one based upon a Gaussian Mixture Model intriguing. training data sets, though this remains to be seen. We found that the minimum-distance classifier was easily