2. On a similar grounds, a multidimensional data such as
the HSI can be modeled by a multidimensional Gaussian
mixture (GM)[1]. Normally, a GM in the form of the PDF
for z â RP is given by
(PCA), this technique is also used to determine the inherent dimensionality of the imagery data. This transformation
segregates noise in the data and reduces the computational
requirements for subsequent processing [5]. Figure 2 shows
the 2D âscatterâ plot of the ïŹrst two MNF components of
the original cuprite data.
L
αi N (z, ÎŒi , ÎŁi )
p(z) =
i=1
200
where
1
(2Ï)P/2 |ÎŁi |1/2
1
â1
e{â 2 (zâÎŒi ) ÎŁi
(zâÎŒi )}
100
.
MNF Band 2
N (z, ÎŒi , ÎŁi ) =
150
Here L is the number of mixture components and P the
number of spectral channels (bands). The GM parameters
are denoted by λ = {αi , ÎŒi , ÎŁi }. These parameters are
estimated using maximum likelihood (ML) by means of the
expectation-maximization (EM) algorithm.
50
0
â50
â100
â150
â200
â1000
â800
â600
â400
â200
0
MNF Band 1
200
400
600
800
Figure 2. 2D scatter plot of the data using the
ïŹrst two MNF bands
3 Dynamic Component Allocation
In spite of the good mathematical tractability of GMM,
there are challenges trying to train a GM with a local algorithm like EM. First of all, the true number of mixture
components is usually unknown. Eventually, not knowing
the true number of mixing components is a major learning
problem for a mixture classiïŹer using EM [4]. The solution to this problem is a dynamic algorithm for Gaussian
mixture density estimation that could effectively add and
remove kernel components to adequately characterize the
input data. This methodology also increases the chances to
escape getting stuck in one of the many local maxima of
the likelihood function. The solution to the component initialization is based on a greedy EM approach which begins
the GM training with a single component [6]. Components
or modes are then added in a sequential manner until the
likelihood stops increasing or the incrementally computed
mixture is almost as good as any mixture in that form. This
incremental mixture density function uses a combination of
global and local search each time a new kernel component
is added to the mixture. We shall now describe in detail the
following three operations-merging, splitting and pruning
of the GMM.
Figure 1. The scene is a 1995 AVIRIS image
of Cuprite ïŹeld in Nevada with the training regions overlayed.
Figure 1 shows data sets used in our experiments that
belong to 1995 Cuprite ïŹeld scene in Nevada. The training
regions in the HSI data are identiïŹed heuristically from mineral maps provided by Clark et. al. [2]. The remote sensing data sets that we have used in our experiments come
from an Airborne Visible/Infrared Imaging Spectrometer
(AVIRIS) sensor image. AVIRIS is a unique optical sensor that delivers calibrated images of the upwelling spectral
radiance in 224 contiguous spectral channels (bands) with
wavelengths from 0.4-2.5 ÎŒm. AVIRIS is ïŹown all across
the US, Canada and Europe.
Since, HSI imagery is highly correlated in the spectral
direction using the minimum noise fraction (MNF) transform is natural for dimensionality reduction. This transform is also called the noise adjusted principal component
(NAPC) transformation. Like principal component analysis
3.1
Merging of Modes
Merging is one of the processes in this proposed training
scheme wherein a single mode is created from two identical
408
3. ones. The closeness between the mixture modes is given
by a metric d. For example, consider two PDFâs p1 (x) and
p2 (x). Let there be collection of points near the central peak
of p1 (x) represented by xi â X1 and another set of points
near the central peak of p2 (x) denoted by xi â X2 . In which
case the closeness metric d is given by
d = log
p2 (xi )
xi âX1 p1 (xi )
xi âX1
p1 (xi )
xi âX2 p2 (xi )
xi âX2
which directly determines the number of mixture components. This kurtosis measure is given by
Ki =
wn,i =
(4)
n
wn,i ( ZââÎŒi )4
ÎŁ
N
n=1
where
i
wn,i
â3
N (zn , ÎŒi , ÎŁi )
N N (z , ÎŒ , ÎŁ )
ÎŁn=1
n
i
i
·
Therefore, if |Ki | is too high for any component (mode)
i, then the mode is split into two. This could be modiïŹed
to higher dimension by considering skew in addition to the
kurtosis, where each data sample Zn is projected on to the
j
j th principal axis of ÎŁi in turn. Let zn,i
(Zn â ÎŒi ) Vij
where Vij is the j th column of V, obtained from the SVD
of ÎŁi . Therefore, for each j
Notice that this metric is zero when p1 (x) = p2 (x) and
greater than zero for p1 (x) = p2 (x). A pre-determined
threshold is set to determine if the modes are too close to
each other. Since we assume that p1 (x) and p2 (x) are just
two Gaussian modes, it is easy to know where some good
points for X1 and X2 are. We choose the means (centers)
and then go one standard deviation in each direction along
all the principal axes. The principal axes are found by SVD
decomposition of R (the Cholesky factor of the covariance
matrix).
If the two modes are found to be too close, they will be
merged forming a weighted sum of two modes(weighted by
α1 , α2 ). The mean for this newly merged mode will be
âą
j
Ki,j =
âą
α1 Ό1 + α2 Ό2
(5)
α1 + α2
Here Ό1 and Ό2 are means of the components before
merging and Ό is the resultant mean after merging of the
two components. The proper way to form a weighted combination of the covariances is not simply a weighted sum of
the covariances, which does not take in to account the separation of means. Therefore, one needs to implement a more
intelligent technique. Consider the Cholesky decomposition of the covariance matrix ÎŁ = R R. It is possible to
consider the rows (P )R to be samples of P -dimensional
vectors whose covariance is ÎŁ, where Pâ the dimension.
is
1
The sample covariance is given by P ( P )2 R R = ÎŁ.
Now, given the two modes to merge, we regard (P )R1
and (P )R2 as two populations to be joined. The sample
covariance of the collection of rows is the desired covariance. But this will assign equal weight to the two populations. To weight them with their respective weights, we
multiply them by α1α1 2 and α1α2 2 . Before they can
+α
+α
Zn,i 4
N
n=1 wn,i ( si )
N
n=1 wn,i
â3
j
Ïi,j =
Ό=
âą
Zn,i 3
N
n=1 wn,i ( si )
N
n=1 wn,i
mi,j = |Ki,j | + |Ïi,j |
where
s2 =
i
N
j
2
n=1 wn,i (zn,i )
.
N
n=1 wn,i
Now, if mi,j > Ï , for any j, split mode i. Further, split
the mode by creating the modes at Ό = Όi + vi,j Si,j
and ÎŒ = ÎŒi â vi,j Si,j , where Si,j is the j th singular
value of ÎŁi . The same covariance ÎŁi is used for each
new mode. The decision to split or not also depends
upon the mixing proportion αi . The splitting does not
take place if the value of αi is too small. The optional
threshold parameter allows control over splitting. A
higher threshold is less likely to split.
3.3
Pruning of Modes
When the number of components becomes high they are
pruned out as the mixing weight αi falls. Pruning is killing
weak modes. This procedure ensures removal of weak
modes from the overall mixture. A weak mode is identiïŹed by checking αi with respect to certain threshold. Once
identiïŹed they are obliterated and further re-normalizing αi
such that i αi = 1. It is equally important that the algorithm does not annihilate many moderately weak modes all
at once. This is achieved by setting up two input threshold
values.
be joined, however, they must be shifted so they are rereferenced to the new central mean.
3.2
N
n=1
Splitting of Components
On the other hand if the number of components is too
low, then the components are split in order to increase the
total number of components. Vlassis et. al. [6] deïŹne
a method to monitor the weighted kurtosis of each mode
409
4. 4 Minimum-Message Length Criteria
Let us now consider the second mixture learning technique based on the minimum message length (MML) criterion. This method is also known by the name of FigueredoJain algorithm [13]. Using the MML criterion and applying
it to mixture models leads to the following objective function
V
2
log(
{c:αc >0}
=
100
50
0
â50
â100
â150
â1000
â800
â600
â400
â200
0
MNF BAND 1
200
400
600
Figure 3. Intermediate learning step of GMM
before achieving convergence using the MML
criterion.
Cnz
N
N αc
)+
log +
12
2
12
Cnz (V + 1)
â log L(Z, λ)
2
where N is the number of training points, V is the
number of free parameters specifying a component, Cnz
is the number of components with non-zero weight in the
mixture (αc > 0), λ is the parameter list of the GMM
i.e. {α1 , ÎŒ1 , ÎŁ1 , · · ·, αC , ÎŒC , ÎŁC }, and the last expression
log L(Z, λ) is the log-likelihood of the training data given
the distribution parameters λ.
The EM algorithm can be used to minimize the above
equation with ïŹxed Cnz . This leads to the M-step with component weight updating formula
αi+1
c
150
N
V
n=1 wn,c ) â 2 }
C
N
V
j=1 max{0, ( n=1 wn,j ) â 2
200
150
100
MNF Band 2
â§(λ, Z) =
200
MNF BAND 2
Finally, once the number of modes settles out, then Q
stops increasing, and convergence is achieved. Hence, the
DCA technique for GMM approximation of HSI observations demonstrates that the combination of covariance constraints, mode pruning, merging and splitting can result in a
good PDF approximation of the HSI mixture models.
Outliers
Class 1
Class 2
Class 3
Missâclassification
50
0
â50
â100
â150
â200
â1000
â800
â600
â400
â200
0
MNF Band 1
200
400
600
800
Figure 4. ClassiïŹcation using MML learning
criterion.
max{0, (
}
This formula contains an explicit rule of annihilating
components by setting their weights to zero. The other
distribution parameters are updates similar to the previous
method. Figure 3 shows an intermediate step of mixture
learning using the MML criterion.
Having described the two learning criteria, we now evaluate their performance by using a simple Bayesian classiïŹer. Figures 4,5 depict the classiïŹcation of the three HSI
mixture classes based on the two learning criteria. ClassiïŹcation, outliers and miss-classiïŹcations are shown in these
ïŹgures for the two cases. Table 1 shows the classiïŹcation
performance comparison of the DCA learning method vs.
the MML. The tabulation shows various classiïŹcation and
miss-classiïŹcation rates of the two learning methodologies
based on the amount of training data utilized.
410
200
150
MNF Band 2
5 ClassiïŹcation Results
250
100
Outliers
Class 1
Class 2
Class 3
MissâClassification
50
0
â50
â100
â150
â200
â1000
â800
â600
â400
â200
0
200
MNF Band 1
400
600
800
1000
Figure 5. ClassiïŹcation using the DCA learning criterion.
5. plications for noice removal,â in IEEE Transactions on Geoscience and Remote Sensing, pp. 6574, Vol.26, No.1, 1988.
Table 1. ClassiïŹcation performance comparisons between the DCA learning criterion and
MML Learning criterion.
Training Data
(in percentage)
Utilized
55 %
60 %
65 %
70 %
75 %
80 %
ClassiïŹcation
(in percentage)
MML
DCA
68.7691 69.8148
73.5235 74.6569
72.6667 73.5882
74.54
75.1457
75.1541 74.3254
75.6765 75.5392
[6] N. Vlassis and A. Likas, âA kurtosis-based dynamic approach to Gaussian mixture modeling,â
IEEE Transactions on Systems, Man and Cybernetics, vol. 29, pp. 393399, 1999.
Miss-Class
(# of pixels)
MML DCA
1087
1054
957
970
647
606
1230
866
857
752
404
468
[7] N. Vlassis and A. Likas, âA greedy EM for Gaussian mixture learning,â Neural Processing Letters,
vol. 15, pp. 7787, 2002.
[8] G. F. Hughes, âOn mean accuracy of statistitical
pattern recognizersâ, IEEE Transactions on Information Theory, vol. 14, no. 1, pp. 55-63, 1968.
[9] D. A. Landgrebe, Signal Theory Methods in Multispectral Remote Sensing. ISBN 0-471-42028-X,
Wiley Inter-Science Publishers, 2003.
6 Conclusions
In this paper, we evaluated the performance of two
PDF mixture learning criterion for the reduced dimensionality material classiïŹcation of Hyperspectral remote sensing data. The results show that the two methods have nearly
identical classiïŹcation performance. The outcome of this
paper presents a possible integration of advanced data analysis and modeling tools to scientists, advancing the stateof-the-practice in the utilization of satellite image data to
various types of Earth System Science studies.
[10] B. S. Everitt and D. J. Hand, Finite Mixture Distributions. London: Chapman and Hall, 1981.
[11] T. Hastie, R. Tibshirani and J. Friedman, The Elements of Statistical Learning: Data Mining, Inference and Prediction, Springer, 2001.
[12] A. P. Dempster, N. M. Laird, and D. B. Rubin,
âMaximum likelihood from incomplete data via
the EM algorithm,â Journal of the Royal Statistical Society, vol. 39B, No. 1, pp. 138, 1977.
References
[13] M. A. T. Figueiredo and A. K. Jain, âUnsupervised
Learning of Finite mixture models,â IEEE Trans.
Pattern Anal. Machine Intell., vol. 24, no. 3, pp.
381396, March 2002.
[1] D. G. Manolakis, and G. Shaw, âDetection Algorithms for Hyperspectral Imaging Applications,â
IEEE Signal Processing Magazine, Vol. 19, Issue
1, January 2002, ISSN 1053-5888.
[14] A. K. Jain and R. Dubes, Algorithms for Clustering Data. Englewood Cliffs, New Jersey: Prentice
Hall, 1988.
[2] R. N. Clark, A. J. Gallagher, and G. A. Swayze,
âMaterial absorption band depth mapping of imaging spectrometer data using a complete band
shape least-squares ïŹt with library reference spectraâ, Proceedings of the Second Airborne Visible/Infrared Imaging Spectrometer (AVIRIS) Workshop, JPL Publication 90-54, pp. 176-186, 1990.
[15] B. H. Huang, âMaximum Likelihood estimation
for mixture multivariate stochastic observations of
Markov chainsâ, AT&T Technical Journal,vol. 64,
No. 6, pp. 1235-1249, 1985.
[16] L. Liporace, âMaximum likelihood estimation
for multivariate observations of Markov sources,â
IEEE Transactions on Information Theory, vol. 28,
Issue 5, pp. 729-734, 1982.
[3] G. J. McLachlan and K. E. Basford, Mixture Models: Inference and Applications to Clustering. New
York: M. Dekker, 1988.
[4] G. J. McLachlan and D. Peel, Finite Mixture Models. New York: Wiley, 2000.
[17] A.K. Jain, R. Duin and J. Mao, Statistical pattern
recognition: A review, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 22,
no. 1, pp. 4-38, 2000.
[5] A.A. Green, M. Berman, P. Switzer and
M.D. Craig, âA transformation for ordering multispectral data in terms of image quality with im-
411