Provides an introduction to the task of speaker recognition, and describes a not-so-novel speaker recognition system based upon a minimum-distance classification scheme. We describe both the theory and practical details for a reference implementation. Furthermore, we discuss an advanced technique for classification based upon Gaussian Mixture Models (GMM). Finally, we discuss the results of a set of experiments performed using our reference implementation.
1. Text-Independent Speaker Verification
Cody A. Ray
Drexel University
codyaray@drexel.edu
Abstract—This paper provides an introduction to the task
of speaker recognition, and describes a not-so-novel speaker
recognition system based upon a minimum-distance classification
scheme. We describe both the theory and practical details for a
reference implementation. Furthermore, we discuss an advanced
technique for classification based upon Gaussian Mixture Models
(GMM). Finally, we discuss the results of a set of experiments
performed using our reference implementation.
I. I NTRODUCTION
The objective of this project was to develop a basic speaker
recognition system to demonstrate an understanding of the
subjects covered in the course Processing of the Human Voice.
Speaker recognition systems can generally be classified as
either identification or verification. In speaker identification, Fig. 1. Speaker Recognition System Classification
the challenge is to decide which voice model from a known set
of voice models best characterizes a speaker. In the different
task of speaker verification, the goal is to decide whether to build the speaker model is called training data. During the
a speaker corresponds to a particular known voice or to recognition or testing stage, the features measured from the
some other unknown voice. In either case, the problem can waveform of a test utterance, i.e., the test data of a speaker,
be further divided into text-dependent and text-independent are matched (in some sense) against speaker models obtained
subproblems, depending on whether we may rely upon the during training. An overview of the components of a general
same utterance being used for both training and testing. This speaker recognition system is given in Figure 2.
loose classification scheme is shown in Figure 1. For the As with any biometric (pattern) recognition system, the
purpose of this project, we focused on the task of text- speaker recognition system consists of two core modules:
independent speaker verification. feature extraction and feature matching. In Section IV, we
provide an overview of various dimensions used for speaker
II. T ERMINOLOGY analysis as well as describe the features we selected in more
• A background speaker is an imposter speaker. depth. Section V continues with mathematical techniques used
• A claimant is a speaker known to the system who is in the matching process.
correctly claiming his/her identity.
• A false negative is an error where a claimant is rejected IV. F EATURE E XTRACTION
as an imposter. Although its possible to extract a number of features
• A false positive is an error where an imposter is accepted from sampled audio, including both spectral and non-spectral
as a claimant. features, for the purpose of this project we characterize a
• Speaker identification decides which voice model from speaker’s voice attributes exclusively through spectral features.
a known set of voice models best characterizes a speaker. In selecting acoustic spectral features, we want our feature
• Speaker verification decides whether a speaker cor- set to reflect the unique characteristics of a speaker. For this
responds to a particular known voice or some other purpose, we use the magnitude component of the short-time
unknown voice.
• A target speaker is a known speaker.
III. S YSTEM OVERVIEW
Speaker recognition systems must first build a model of
the voice of each target speaker, as well as a model of a
collection of background speakers, using speaker-dependent
features extracted from the speech waveform. This is referred
to as the training stage, and the associated speech data used Fig. 2. General Speaker Recognition System
2. Fourier transform (STFT) as a basis. As the phase is difficult to V. F EATURE M ATCHING
measure and susceptible to channel distortion, it is discarded. A survey of the literature has revealed numerous approaches
We first compute the discrete STFT and then weight it by a based upon minimum-distance classification, dynamic time-
series of filter frequency responses that roughly match those of warping, vector quantization, hidden Markov model, Gaussian
the auditory critical band filters. To approximate the auditory mixture model, or artificial neural networks. For this project,
critical bands, we use triangular mel-scale filter bands, as we chose to implement a minimum-distance classifier, and as
shown in Figure 3. time permits improve upon this with a Gaussian mixture model
∞ based matching.
X(n, wk ) = x[m]w[n − m]e−jwk m A. Minimum-Distance Classification
m=−∞
The concept of minimum-distance classification is simple:
Next, we compute the energy in the mel-scale weighted we calculate a feature vector for each new test case, and
STFT and normalize the energy in each frame so as to give measure how far it is from the stored training data using
equal energy for a flat input spectrum. some metric for distance computation. We then select a
threshold distance to determine at which point we consider the
Ul
1 speaker verification to have been successful, or equivalently,
Emel (n, l) = |Vl (wk )X(n, wk )|2 , the speaker to have been recognized. As we will later see, this
Al
k=Ll
threshold will determine the tradeoff between the number of
where false negatives and false positives.
Specifically, we compute a feature vector based upon the
Ul
averages of the MFCCs for the test and training data sets.
Al = |Vl (wk )|2
M
k=Ll tr 1 tr
C mel [n] = Cmel [mL, n]
M
Finally, we compute the mel-cepstrum for each frame using m=1
the even property of the real cepstrum to rewrite the inverse
M
transform as the discrete cosine transform. We then extract ts 1 ts
the mel-frequency cepstrum coefficients (MFCC) for use as C mel [n] = Cmel [mL, n]
M m=1
our feature vector.
As a distance measure, we then use the mean-squared
R−1
1 2π difference between the average testing and training feature
Cmel [n, m] = log{Emel (n, l)} cos( lm)
R R vectors.
i=0
R−1
1 ts tr
We repeat this feature extraction process for both training D= (C mel [n] − C mel [n])2
tr ts R−1
and testing data to produce Cmel and Cmel , respectively. n=1
If this distance is less than the given threshold, the speaker
has been verified.
B. Gaussian Mixture Models
Recognizing that speech production is inherently non-
deterministic (due to subtle variations in vocal tract shape and
glottal flow), we represent a speaker probabilistically through a
multivariate Gaussian probability density function (pdf). This
is a multi-dimensional structure in which we can think of
each statistical variable as a state corresponding to a single
acoustic sound class, whether at a broad level, such as quasi-
periodic, noise-like, and impulse-like, or at very fine level such
as individual phonemes. The Gaussian pdf of a feature vector
x for the ith state is written as
1 1 T
Σ−1 (x−µi )
bi (x) = R 1 e− 2 (x−µi ) i
(2π) |Σi |
2 2
where µi is the state mean vector, Σi is the state covariance
matrix, and R is the dimension of the feature vector.1 The
1 Using MFCC-based feature vectors, this corresponds to the number of
Fig. 3. Idealized Triangular Mel-Scale Filter Bank MFCC coefficients.
3. TABLE I
D ISTANCES B ETWEEN T RAINING AND T ESTING DATA
Test1 Test2 Test3 Test4 Test5 Test6 Test7 Test8
Train1 0.1192 0.1945 0.2151 0.2184 0.5364 0.3823 0.4963 0.4538
Train2 0.0724 0.0378 0.0406 0.0783 0.4035 0.3177 0.4125 0.3986
Train3 0.1672 0.1311 0.1042 0.0969 0.3382 0.2121 0.3597 0.2847
Train4 0.1482 0.1412 0.1363 0.0817 0.3268 0.3211 0.3154 0.3282
Train5 0.1882 0.1928 0.2237 0.1466 0.1044 0.0709 0.1382 0.1299
Train6 0.3012 0.3521 0.3208 0.3112 0.3023 0.0958 0.2755 0.2094
Train7 0.2743 0.2973 0.3252 0.2517 0.1618 0.1318 0.0724 0.1427
Train8 0.3589 0.3600 0.3381 0.2186 0.1976 0.1133 0.2487 0.0585
speaker model λ then represents the set of GMM mean, VI. I MPLEMENTATION
covariance, and weight parameters. The reference speaker recognition system was implemented
in MATLAB using training data and test data stored in WAV
λ = {µi , Σi , pi } files. There are tools included in MATLAB and publicly-
The probability of a feature vector being in any one of I available libraries to aid in creating this system. For reading
states for a particular speaker model λ is represented by the in the data sets, we used MATLAB’s wavread function.
union, or mixture, of different Gaussian pdfs: For feature extraction, we used the melcepst function
from Voicebox, a MATLAB toolbox. We used twelve MFCC
l coefficients (skipping the 0th order coefficient) using 256-
p(x|λ) = pi bi (x) sample frames and a 128-sample increment Hamming window.
i=I We used custom matching and testing routines based upon
where pi are the component mixture weights and bi (x) are the minimum-distance classification as described above. For the
mixture densities. Gaussian Mixture Models, we used T. N. Vikram’s GMM
In speaker verification, we must decide whether a test library, based upon the text Algorithm Collections For Digital
utterance belongs to the target speaker. Formally, we are Signal Processing Applications Using Matlab by E.S. Gopi.
making a binary decision (yes/no) on two hypotheses, whether VII. E XPERIMENTAL R ESULTS
the test utterance belongs to the target speaker, hypothesis H1 ,
We performed a series of experiments to determine the
or whether it comes from an imposter, hypothesis H2 .
accuracy of the reference implementation for text-independent
Suppose that we have already computed the GMM of the
speaker verification. We selected eight speakers from the
target speaker and the GMM for a collection of background
TIMIT database, including four males and four females, each
speakers2 ; we then determined the likelihood ratio test to
saying two sentences. The sentences used for this experiment
decide between H1 and H2 . This test is the ratio between
were “dont ask me to carry an oily rag like that” and “she had
the probability that the collection of feature vectors X =
your dark suit in greasy wash water all year”. We used the
{x0 , x1 , . . . , xM −1 } is from the claimed speaker, P (λC |X),
sentence referring to an “oily rag” as training data, and used
and the probability that the collection of feature vectors X
the sentence referring to a “dark suit” as testing data.3
is not from the claimed speaker, P (λC |X), i.e., from the
The results of the experiment are shown in Table I. This
background. Using Bayes’ rule, we can write this ratio as
table depicts the “distance” between the training and testing
P (λC |X) p(X|λC )P (λC )/P (X) data sets as a Cartesian product. For example, the value in
= cell (1, 3) (in row-major order) corresponds to the distance
P (λC |X) p(X|λC )P (λC )/P (X)
between speaker1’s training data and speaker3’s testing
where P (X) denotes the probability of the vector stream X. data. Note that we’re trying to minimize the diagonals, that
Discarding the constant probability terms and applying the is, the distance between any individual speaker’s testing and
logarithm, we have the log-likelihood ratio training data while maximizing the non-diagonal cells in a
text-independent manner.
Λ(X) = log[p(X|λC )] − log[p(X|λC )] For these data, we empirically selected an initial threshold
value of 0.12 to avoid false any negatives. With this threshold,
that we compare with a threshold to accept or reject whether
however, the system resulted in six false positives, for an
the utterance belongs to the claimed speaker. If Λ(X) ≥ θ,
accuracy of 91%. Decrementing the threshold to 0.11 results in
then the speaker has been verified.
3 The allocation of one sentence for training and the other for testing does
2 Onecommon model for generating a background pdf p(X|λC ) is through not affect the results of the experiment. Due to the symmetry of the distance
models of a variety of background (imposter) speakers. In our development, metric used for computation, if we had switched the sentences used for
we used speakers from the TIMIT database for this purpose. training and testing, the results of Table I would simply be transposed.
4. one false negative and five false positives, while maintaining implemented, and used remaining time to explore the use
an accuracy of 91%. For these data, further variance in the of GMMs for speaker recognition. However, we did not find
threshold cannot increase the accuracy; however, it clearly sufficient remaining time to complete the implementation of
provides a tradeoff between false negatives and false positives. the GMM-based speaker recognition system and repeat the
Due to this tradeoff, the most desirable threshold is application experiment. We’ve integrated an existing GMM library into
dependent and must be determined on a case-by-case basis. the project framework to compute the means, covariances, and
It is informative to note that this is a challenging experiment weights of each state of the target speaker model; however,
for a text-independent speaker recognition system, given the it remains to implement the least-likelihood ratio classifier.
small amount of data used for training as well as the presence Given the large parallels to completed tasks remaining, i.e.,
of multiple differing acoustic sound classes in the sentences, threshold-based classification and multivariate Gaussian pdf
i.e., the phoneme “sh” appears twice in the testing sentence, computation, it should be simple to compute a background
but is absent from the training sentence. Note that even model and repeat the above experiment with the new GMM-
with these difficulties, the reference implementation performs based reference implementation. During the analysis, it would
decently; we expect this is due to the effect of averaging the be insightful to compare the minimum-distance and GMM
mel-cepstral features. Furthermore, we expect that, contrary to classification schemes, particularly focusing on performance
conventional wisdom, the minimum-distance classifier might with regards to differing acoustic makeup of the test and
perform better than GMM when confronted with differing training data for text-independent speaker recognition.
acoustic classes in the testing and training data sets during
text-independent speech recognition. IX. C ONCLUSION
As a final comment, note that from these results, it is easy We have shown that minimum-distance classification for
to see why an MFCC-based minimum-distance classifier sys- text-independent speaker recognition performs moderately
tem should never be used for text-independent authentication well, though there is obvious room for improvement. Through
systems. There is no way to eliminate false positives while a simple experiment, we’ve clearly demonstrated the tradeoff
maintaining a high degree of accuracy! between false negatives and false positives in selecting a
threshold, and noted that the most desirable threshold is
VIII. F UTURE W ORK
an application-dependent parameter. From our results, we’ve
Although it wasn’t in the stated proposal for this project, we also hypothesized that the minimum-distance classifier might
found the task of comparing the minimum-distance classifier outperform a GMM classifier on acoustically diverse test and
with one based upon a Gaussian Mixture Model intriguing. training data sets, though this remains to be seen.
We found that the minimum-distance classifier was easily