Juha vesanto esa alhoniemi 2000:clustering of the som
RFNM-Aranda-Final.PDF
1. Robust Fuzzy n-Means Clustering
A Research Paper
Presented to
the Faculty of the Division of Mathematical Sciences
Midwestern State University
In Partial Fulfillment
of the Requirements of the Degree
Master of Science
by
Thomas G. Aranda
October 2000
2. Abstract
Clustering is a data segmentation method with a wide range of applications including
pattern recognition, document classification and data mining. This paper focuses on the problem
of unsupervised clustering when the optimal number of clusters is not known. This paper
presents an algorithm that can determine the ideal number of clusters and be robust to the
influence of outliers. A modification of the Robust Fuzzy c-Means Clustering Algorithm
(RFCM) was developed. This modification retains the robustness (ability to ignore outliers) of
RFCM, yet it does not increase the complexity of the algorithm. A Robust Fuzzy n-Means
Clustering Algorithm (RFNM) is presented. This method produces a good partition without a
priori knowledge of the optimal number of clusters.
3. 1
Introduction
This research is motivated by the requirement to segment large images in real time
without prior knowledge of the image’s structure. The ultimate goal is to identify and classify
sections of data into categories. For example, given an aerial photograph, the computer should
be able to distinguish between grass, concrete, water and asphalt. One technique for segmenting
data in this way is called clustering.
Many methods for clustering are in use, including Validity-Guided Clustering [1] and c-
Means Clustering [4]. However, these algorithms have two drawbacks. First, they are very
susceptible to the presence of outliers in some data sets. Consequently, they do not identify the
clusters properly. Some algorithms solve this problem by using robust centering statistics.
Second, these algorithms require the user to input the desired number of clusters. Often times,
the correct number of clusters is not known prior to execution. Therefore, it would be beneficial
to develop an algorithm that does not require such knowledge. This paper presents such an
algorithm.
Data Segmentation via Clustering
Background on Clustering
The classification of objects into categories is the subject of cluster analysis. It plays a
large role in pattern recognition. However, it has many other applications such as the
classification of documents in a database, the development of social demographics, data mining
and the construction of taxonomies in biology.
Ultimately, clustering attempts to identify groups of similar data. Given a set of data X,
the problem of clustering is to find several cluster centers that properly characterize relevant
classes of X [10]. For example, a good clustering of an image by color would identify the
4. 2
various shades of red as one cluster, the blues as another cluster, etc. A clustering of a 3D set of
points over a Euclidean space would find groups (clusters) of points that are close together.
After the cluster centers are identified, the data set X is partitioned by labeling each data element
with the exemplar (cluster center) closest to it.
In 1967 Ball and Hall introduced the ISODATA process [2]. This technique, which is
also called Hard c-Means Clustering (HCM), is one of the most popular clustering methods [4].
However, the user is required to input the desired number N of clusters. It uses an alternating
optimization (AO) technique to minimize an objective function. The definition of the objective
function and the AO technique can be found in [7]. One problem with HCM is that it tends to
get caught in local minima [7]. In other words, it does not find the global minimum of the
objective function and therefore does not properly identify the cluster centers.
Zadeh introduced fuzzy set theory in 1965 as a way to represent the vagueness of
everyday life [3]. In a nutshell, fuzzy set theory allows data elements to belong to a set in
varying degrees. Each element has a membership value [ ]1,0∈u that represents the degree to
which the data element belongs to that set. In other words, data elements can have a partial
membership in a set. This fuzziness allows one to mathematically represent vague concepts such
as “pretty soon” or “very far.”
Dunn applied fuzzy set theory to the ISODATA clustering process in 1973 [7]. His
method, called Fuzzy c-Means Clustering (FCM), allows data elements to belong to several
clusters in varying degrees. For example, a data element can have a 30% membership in one
cluster and a 70% membership in a second cluster, instead of discretely belonging to one cluster
or the other. Consider the clustering by color example: a dark violet could partially belong to the
red cluster and partially belong to the blue cluster.
5. 3
Fuzzy c-Means Clustering (FCM) uses an alternating optimization (AO) technique that is
very similar to HCM. After the algorithm finishes execution and the cluster centers are
identified, the clusters are “defuzzified” by discretely assigning each data element to the cluster
in which it has the highest membership. If a light orange color had a 45% membership in red, a
52% membership in yellow and a 3% membership in blue, then the color would be assigned to
the yellow cluster. Experiments have shown that the fuzzy clustering method is less likely to be
trapped in a local minimum [7] and, therefore, avoids one disadvantage of HCM.
FCM typically produces better results than HCM, but it is susceptible to the influence of
outliers—extraneous data elements that are very far away from the cluster centers. Outliers may
be the result of errors in the data, or they could be real information: such as a highly reflective
piece of aluminum foil appearing in a radar image of a grass field. Regardless of what the
outliers are, their presence often disrupts the clustering process.
Kersten’s Fuzzy c-Medians Clustering Algorithm (FCMED), which uses the fuzzy
median as its centering statistic, is more robust than FCM [8]. In other words, it is more resistant
to the influence of outliers. However, its time complexity of ( )NcpNO lg and space complexity
of ( )NO make it very slow [8]. Conversely, Choi’s and Krishnapuram’s Robust Fuzzy c-Means
Algorithm (RFCM) solves the outlier problem in linear time. [6]. Kersten’s implementation of
RFCM uses Huber’s weighting functions to reduce the influence of outliers [9]. Experiments
have shown RFCM to be very robust.
One disadvantage of RFCM is that it requires the user to input the correct number of
clusters. Often times the user does not know enough about the structure of the data to provide
such information. This is especially true in data mining applications. The research described in
this paper developed a new algorithm, Robust Fuzzy n-Means (RFNM), which is robust to
6. 4
outliers and capable of determining the proper number of clusters. This algorithm is a
modification of FCM and RFCM. In order to provide the reader with a complete understanding
of the new RFNM algorithm, this paper will describe its parent algorithms in detail.
Fuzzy c-Means Clustering
Fuzzy c-means clustering (FCM) is defined well by [4]. Consider N data samples
forming the data set denoted by { }NxxxX ,,, 21 K= . Assume there are c clusters and
( ) [ ]1,0∈= kiik xuu is the membership of the k-th sample kx in the i-th cluster iv , where
{ }cvvvv ,,, 21 K= is the set of exemplars (cluster centers) and U is the membership matrix.
Normally, a cluster center refers to an actual pattern in the data and an exemplar refers to a
pattern identified by the algorithm. However, these terms will be used interchangeably in this
paper. The membership value of each data element kx satisfies the requirement that
∑=
=
c
i
iku
1
1 (1)
for all Nk ℵ∈ . In other words, all of a particular data element’s membership values must add up
to one. In addition, each cluster must contain some, but not all of the data points’ membership.
Defined mathematically, this means that for every ci ℵ∈
Nu
N
k
ik << ∑=1
0 . (2)
The goal of the FCM algorithm is to minimize the objective function
( ) ∑∑= =
=
N
k
c
i
ik
m
ik duvUJ c
1 1
2
, (3)
where 2kiik xvd −= (the Euclidean distance between the exemplar and the data element). The
power cm of the membership function is called the weighting exponent. It expresses the
7. 5
“fuzziness” of the algorithm. Setting 1=cm and only allowing discrete membership values will
convert the fuzzy algorithm into traditional HCM [9].
The objective function (3) is the weighted square error of the exemplars. The closer data
elements are to their respective cluster centers, the lower the value of the function will be.
Furthermore, the number of exemplars c will have an effect on the value of ( )vUJ , . Increasing
the number of exemplars will lower the value the objective function. In an extreme case, when
the number of clusters equals the number of data elements ( )Nc = , the objective function will go
to zero. Although using a large number of clusters will reduce the value of ( )vUJ , , it is more
important to choose a value of c that represents the actual number of clusters in the data.
Fuzzy c-Means Clustering is more effective than Hard c-Means because the objective
function is less likely to get caught in a local minimum [7]. Furthermore, it runs in ( )cNO time
and ( )cO space. However, it is susceptible to outliers [9]. The robust algorithm presented in the
next section addresses this problem.
Robust Fuzzy c-Means Clustering
Real world data sets often contain outliers. These extraneous data elements are usually
very far away from the larger cluster centers. Consider a data set with two large well-defined
clusters and one small outlying cluster that is very far away from the other two. Due to the 2
ikd
term in ( )vUJ , (3), the distance of a data point from its exemplar will have a quadratic effect on
the value of objective function. Since FCM attempts to minimize ( )vUJ , , it will attempt to
reduce the impact of the outliers’ large 2
ikd values by placing an exemplar over the outlying
cluster. This minimizes the objective function, but does not correctly identify the larger cluster
centers.
8. 6
Kersten’s implementation of Choi and Krishnapuram’s Robust Fuzzy c-Means Clustering
Algorithm (RFCM) takes steps to solve this problem [9]. Huber’s m-estimator is used to reduce
the influence of outliers. Huber’s function ρ is defined as:
( )
>−
≤
=
1,
1,
2
1
2
2
1
xifx
xifx
xρ . (4)
The 2
ikd term is replaced with ( )γρ ikd where γ is a scaling constant. As a result, the influence
of the distance between cluster centers and data elements is quadratic when the data element is
close to the exemplar and linear when the data element is far away from the exemplar. The
objective function to be minimized becomes:
( ) ( )∑∑= =
=
N
k
c
i
ik
m
ik duvUJ c
1 1
, γρ . (5)
The membership values of each element are given by:
( )
( )
1
1
1
1 −
=
−
= ∑
c
j
m
jk
ik
ik
c
d
d
u
γρ
γρ
. (6)
Using this function, the membership of a data element kx in cluster iv is assigned in proportion
to the distance between kx and iv . In other words, the data element will have a larger
membership in clusters that are closer to it.
The center of a cluster is computed by determining the average value of all the points in
that cluster. Since a point’s membership in a cluster is fuzzy, the mean must be adjusted by the
membership values iku . Therefore, the locations of the exemplars are computed by using the
weighted mean given by:
9. 7
( )
( )∑
∑
=
=
= N
k
ik
m
ik
N
k
ikik
m
ik
i
dwu
xdwu
v
c
c
1
1
γ
γ
(7)
where Huber’s weighting function ( ) ( ) xxxw ρ′= . In this case
( )
>
≤
=
1,1
1,1
xx
x
xw . (8)
Huber’s w function has the effect of reducing the influence of data points that are far away from
the cluster centers thereby making the algorithm robust to outliers.
In order for the ρ and w functions to work properly, all distances must be adjusted by a
scaling constant γ [9]. The experiments in this paper use the median absolute deviation about
the median (MAD) [11] to compute γ . The MAD is a robust estimator similar to the standard
deviation. All distances are divided by three times the MAD before Huber’s functions are
applied, i.e. MAD3⋅=γ . As a result, when ρ is applied, data points have a quadratic influence
when they are MAD3⋅ or less from the exemplar and linear influence when they are greater than
MAD3⋅ away. One should note that computing the MAD takes ( )NNO lg time (on average)
and ( )NO space. The normalization of the data using an estimator like the MAD is crucial to
making the algorithm run properly.
Except for the calculation of the scaling constant and the application of Huber’s
functions, RFCM is identical to FCM. However, RFCM is not as susceptible to the influence of
outliers [9].
Determining the Number of Clusters
Robust Fuzzy n-Means Clustering
One problem with RFCM is that the user must input the desired number of clusters.
10. 8
Quite often the optimal number of clusters is not known prior to execution. The Robust Fuzzy n-
Means Algorithm (RFNM) presented in this paper retains the robustness of RFCM, yet does not
require a priori knowledge of the proper number of clusters.
RFNM requires the user to provide a maximum number mc of clusters. The algorithm
begins by executing the RFCM algorithm with mc clusters. During every iteration cluster
centers that are close together are considered for merging. Several methods for merging have
been explored, including Validity-Guided Clustering described in [1] and Competitive Clustering
described in [5]. However, the merging criteria should be robust and efficient.
Merging Criterion
If two clusters are “close” together they should be merged. Two clusters are close if the
distance between their centers is small compared to their compactness. The notion of
compactness [12] is the weighted mean square deviation of the cluster. It can be thought of as
the average “radius” squared. The compactness of a cluster is defined in terms of its variation
and cardinality.
The variation of a cluster is a measure of the cluster’s dispersion. One can think of it as
the fuzzy variance. Formally, the variation is defined by [12]:
∑=
=
N
k
ik
m
iki du c
1
2
σ (9)
The fuzzy cardinality of a cluster is a measure of the cluster’s size. The more data
elements that belong to the cluster the larger the cluster’s cardinality will be. Often, the fuzzy
cardinality is used as a divisor when calculating the fuzzy mean. Formally the fuzzy cardinality
is defined by [12]:
∑=
=
N
k
iki un
1
. (10)
11. 9
The compactness of a cluster is the ratio of its variation and cardinality [12]:
i
i
i
n
σ
π = . (11)
To make the compactness formula robust to outliers Huber’s ρ function (4) is inserted into the
equation. Finally, the cardinality of the cluster must take the weighting exponent cm into
account. Therefore, the robust compactness of a cluster iv is defined as
( )
∑
∑
=
=
= n
k
m
ik
N
k
ik
m
ik
i
c
c
u
du
1
1
γρ
π . (12)
RFNM uses a modified version of separation [12] to measure how far clusters are apart.
Formally, the separation between two clusters qv and rv is defined as the Euclidean distance
between the clusters’ centers:
2rqqr vvs −= . (13)
The merging criterion uses a merge ratio, which is similar to the validity index defined in
[12]. The merge ratio will be small when exemplars are close together relative to their
compactness. Formally, it is the ratio of the separation squared over the compactness:
q
qr
qr
s
π
ω
2
= . (14)
Once again, to make the formula robust Huber’s function is substituted:
( )
q
qr
qr
s
π
γρ
ω = . (15)
During every iteration of RFCM, the merge ratio qrω is calculated for every cluster
vvq ∈ and vvr ∈ . If αω ≤qr , where α is some constant, then the clusters centered at qv and
12. 10
rv are merged. Choosing a value of 1<α means that in order for two clusters to be merged, the
distance between the clusters’ centers must be less than the compactness (radius) of the clusters.
Experimentally, values of [ ]3.0,1.0∈α work well.
Merging Mechanics
Once the decision is made to join two clusters, they must be combined in a meaningful
way. The new exemplar should exist on a line segment that runs between the two old exemplars.
The new center will be placed closer to the cluster with the larger fuzzy cardinality.
The placement of the new exemplar is accomplished by using a parameter p:
rq
q
nn
n
p
+
= (16)
where qv and rv are the centers of the two clusters to be merged. The location of the new
exemplar is calculated using a combination formula:
( ) rqn vppvv −+= 1 (17)
where nv is the center of the new cluster. The old exemplars are removed from v and replaced
with the new center nv . The next iteration of the algorithm will compute the membership values
of X in the new cluster.
The RFNM Algorithm
The Robust Fuzzy n-Means algorithm is based on the FCM algorithm described in [10].
It uses an alternating optimization (AO) technique for minimizing the objective function (5).
FCM has been modified to be robust and unsupervised. The mc exemplars begin at locations
determined by the user. During execution, these exemplars gravitate toward the data set’s “true”
cluster centers. Some exemplars may merge along the way. Ideally, the algorithm will terminate
with exactly one exemplar positioned near the center of each cluster. The user provides the
13. 11
following input:
∞ℵ∈mc initial (maximum) number of clusters
[ )∞∈ ,1cm weighting exponent
( )1,0∈α merging criterion constant
( )∞∈ ,0ε stopping constant (small positive number)
( )∞∈ ,0γ scaling constant (example: three times the MAD)
{ }mcvvvv ,,, 21 K= initial placement of the exemplars (cluster centers)
Algorithm: ( )vmcRFNM cm ,,,,, γεα
Step 1. Let mcc =
Step 2. Let csss ,,, 21 K equal cvvv ,,, 21 K respectively.
Step 3. Calculate the new membership matrix U by the following procedure: for each
Xxk ∈ , if 0
2
>− ik vx for all ci ℵ∈ , then compute iku using equation (6). If
0
2
=− ik vx for some cIi ℵ⊆∈ , then define iku for Ii ∈ by any nonnegative
real numbers satisfying ∑∈
=
Ii
iku 1 and define 0=iku for Ii c −ℵ∈ .
Step 4. Merge clusters that are close together. For every vvq ∈ and vvr ∈ and rq ≠
do the following: calculate qrω using equation (15); if αω ≤qr then compute
nv using equations (16) and (17); let nq vv = ; remove rv from v and decrement
c by 1. NOTE: Any cluster can only be merged once per iteration.
Step 5. Calculate the c cluster centers cvvv ,,, 21 K using equation (7) and the given value
of cm .
14. 12
Step 6. If a merge took place in Step 4 then return to Step 2. Otherwise if
ε≤−
ℵ∈
ii
i
sv
c
max then stop. Otherwise, return to Step 2.
On average, this algorithm has linear time complexity. Steps 2, 5 and 6 have a total
maximum running time of ( )cbacm ++ where a, b and c are constants. The maximum running
time of Step 3 is Nck m ⋅⋅ and Step 4 will run in 2
mcl ⋅ time (worst case), where k and l are
constants. Thus, the total running time of this algorithm has an upper bound of
( )[ ]cbacNckclt mmm +++⋅⋅+⋅ 2
where t is the number of iterations of the algorithm. In most
cases the size of the data set N will be significantly larger than mc and t. Therefore, the N term
will overwhelm the 2
mc and t terms yielding a running time complexity of ( )NcO m .
The memory overhead of this algorithm is also linear. Storing the exemplar vectors v and
s requires ( )mcO space in the worst case. The membership matrix U requires ( )NcO m ⋅ space.
The memory required for the data set X is not considered because it is not overhead. The total
memory overhead is ( )NcO m . Often times, the data set to be clustered is very large. Therefore,
storing something of size N will cost much memory. However, clever coding and some slight
modifications will allow the algorithm to run in ( )mcO space. The following modifications
compute the membership matrix U, exemplars v and the merge ratio ω on the fly without storing
U in memory:
Algorithm: ( )vmcRFNMFast cm ,,,,, γεα
Step 1. Let mcc = .
Step 2. Let csss ,,, 21 K equal cvvv ,,, 21 K respectively.
Step 3. Calculate, but do not store, the new membership matrix U by the following
15. 13
procedure: for each Xxk ∈ , if 0
2
>− ik vx for all ci ℵ∈ , then compute iku
using equation (6). If 0
2
=− ik vx for some cIi ℵ⊆∈ , then define iku for
Ii ∈ by any nonnegative real numbers satisfying ∑∈
=
Ii
iku 1 and define 0=iku
for Ii c −ℵ∈ . Simultaneously, calculate and keep the running sums used in
equations (7), (15) and (16).
Step 4. For every vvq ∈ and vvr ∈ and rq ≠ do the following: calculate qrω using
equation (15) and the running sums from Step 3; if αω ≤qr then compute nv
using equations (16) and (17) and the running sum from Step 3; let nq vv = ;
remove rv from v and decrement c by 1. If any two clusters were merged, then
return to Step 3.
Step 5. Calculate the c cluster centers cvvv ,,, 21 K using equation (7), the given value of
cm and the running sums from Step 3.
Step 6. If a merge took place in Step 4 then return to Step 2. Otherwise if
ε≤−
ℵ∈
ii
i
sv
c
max then stop. Otherwise, return to Step 2.
This “fast” version of RFNM actually has the same time complexity as the normal
algorithm. However, it has a much lower memory complexity. Since it does not store the
membership matrix U, the N term can be dropped from the space complexity. Consequently, the
memory overhead is ( )mcO , which represents only storing the exemplar vectors. On average,
this “fast” algorithm will execute quicker because its lower memory overhead reduces the risk of
page faults. RFNM provides robust unsupervised learning with a linear running time and low
memory overhead. This makes it ideally suited for real-time data processing applications.
16. 14
Testing
Exemplar Placement (Gaussian Tests 1 and 2)
The RFNM algorithm described in the previous section was tested using a five-
dimensional Gaussian scatter of random data. The test data has two cluster centers equidistant
from the origin. The first two tests start with six exemplars ( )6=mc , 75.1=cm and 3.0=α .
The positioning of the initial exemplars is critical. Figure 1 shows a 2D plot of the movement of
the exemplars.
Figure 1. Gaussian Test 1, 6=mc .
The “true” cluster centers, which are computed using the sample mean, exist at
( )0,0,0,0,0.1− and ( )0,0,0,0,0.1 . In Figure 1, two exemplars (labeled A and B) merge together at
17. 15
point C and then converge to approximately ( )0,0,0,1.0,3.1 −− . Two more exemplars (D and E)
merge together at F and converge to ( )0,0,0,1.0,4.1≈ . The last two exemplars (G and H) were
initialized on the y-axis. They merge and converge very close to the origin (point I). The test
data is almost symmetric. Since the middle exemplars (G and H) started equidistant from two
nearly symmetric clusters, they were never drawn to one cluster or the other. In this case the
algorithm does not converge properly.
Figure 2. Gaussian Test 2, 6=mc .
The second test uses the same data set and starting parameters except two of the initial
exemplars (Figure 2: D and G) are offset 5.0+ along the x-axis. Figure 2 plots the movement of
the exemplars. Notice the algorithm converges to the desired two cluster centers ( )0,0,0,0,0.1± .
18. 16
Exemplars A and B merge together into exemplar C, which then converges to the cluster center
at ( )0,0,0,0,1− .
The exemplar trace on the right side of the y-axis is more interesting. Exemplars D and E
merge together and become F. Exemplar G then merges with F to become H. Finally, H and I
merge into exemplar J, which then converges to the cluster center at ( )0,0,0,0,1 . Since the
middle exemplars (D and G) were initialized slightly closer the right hand cluster, they were
drawn toward that cluster’s center. By comparing the results of test 1 and test 2, one can see that
the initial placement of the exemplars can change the results dramatically.
Robustness Testing (Cauchy Test 1)
A second set of two-dimensional test data was randomly generated using a Cauchy
distribution. This data set has two well-defined main clusters, but also has several outliers that
are very far away from the main clusters. The presence of the outliers obviously increases the
compactness of the clusters (makes them less compact). Consequently, the exemplars tend to
merge very quickly. To compensate for this, lower values of 25.1=cm and 2.0=α were
chosen. Additionally, the initial exemplars were started a little further away from the origin. To
reduce the influence of outliers, the sample median is used to determine the “true” cluster
centers: approximately ( )0,3.1± . Figure 3 shows a trace of the six exemplars.
The merging sequence in Figure 3 is very similar to the previous test. Exemplars A and
B merge into C and converge to ( )0,8.1− . On the right side of the y-axis, exemplars D and E
merge into F. Next, G and F merge into H. Finally, H and I merge into J and converge to
( )0,5.1 . Notice, the algorithm converges near the two desired cluster centers of ( )0,3.1± .
However, it does not converge exactly because the outliers still have some influence on the
exemplars. The total error is 0.7.
19. 17
Figure 3. Cauchy Test 1 (RFNM), 6=mc .
Additionally, the proper choice of α is very important. Choosing a merge ratio that is
too low ( )1.0<α will cause the exemplars to not merge. Conversely, setting the merge ratio too
high ( )4.0>α will cause all of the exemplars to merge together into one cluster center. In both
cases, the true cluster centers are never found. Thus, choosing a good merge ratio is crucial.
Robust n-Means vs. c-Means Clustering (Cauchy Test 1)
For comparison purposes, a standard RFCM algorithm ( )0=α was run on the same set of
data with the same parameters. Figure 4 shows the trace. Exemplar A converges to ( )0,5.1 , and
exemplar B converges to ( )0,8.1− . In other words, the two exemplars in this example converge
to the same cluster centers as the exemplars in the previous test. Clearly, the RFNM algorithm
20. 18
performs as well as RFCM.
Figure 4. Cauchy Test 1 (RFCM), 2== ccm .
The values of the objective functions (5) for both methods were plotted against time (see
Figure 5). Both methods converge to the same value ( )830≈ within the same number of
iterations. Notice the RFNM algorithm (left), with an initial 6=mc and final 2=c , yields an
increasing value of ( )vUJ , . This is because reducing the number of clusters actually causes an
increase in the objective function. However, RFNM still reaches the optimal solution without
requiring the user to input the desired number of clusters.
Catching Outliers (Cauchy Test 2)
The previous examples start with the exemplars near the cluster centers. In the next
21. 19
example, eight exemplars ( )8=mc begin far away from the origin. One again, the Cauchy data
set is used. Figure 6 shows a close-up of the exemplar trace. Exemplars A and B merge into C.
At the same time exemplars D and E merge into F. Finally, C and F merge into G and converge
to ( )0,5.1− . Exemplar H converges to ( )0,5.1 without merging. The total error is 0.4, and the
final number of clusters is 5=c .
Figure 5. RFNM vs. RFCM (Cauchy Test 1).
Figure 7 shows a trace of the same test on a larger scale. One can see exemplars A, B, D,
E and H move toward the main clusters, merge and converge to the cluster centers (see Figure 6).
Furthermore, exemplars I, J and K converge near clusters of outliers: approximately located at
( )1,10− , ( )25,0 and ( )17,1 −− respectively. These outlying clusters have very low fuzzy
cardinalities (less than 10% of the main clusters). In the final analysis, one could classify
exemplars with low cardinalities as clusters of outliers. Depending on the application, it may be
useful to discover outlying clusters. Otherwise, they can be ignored and removed from the final
partition.
The use of Huber’s functions in RFNM reduces the influence of outliers, but it does not
22. 20
eliminate their influence. However, notice that the exemplars in Cauchy test 2 (Figure 6) are
closer to the desired centers of ( )0,3.1± than the exemplars in Cauchy test 1 (Figure 3). In fact,
test 2 yields improvement of 0.3 in total error over test 1. This is because the second test placed
exemplars near the outliers (see Figure 7), which has the effect of reducing the influence of those
outliers on the two main clusters. As a result, the true centers of the main cluster are identified
with greater accuracy. Furthermore, the final value of the objective function is lower:
approximately 611 as opposed to 830 from test 1. Of course, the larger number of exemplars
( )5=c in test 2 accounts for much of this decrease. Figure 8 shows a plot of the objective
function. A good initial placement of the exemplars will improve the robustness of the algorithm
Figure 5. Cauchy Test 2 (Zoomed-In), 8=mc .
23. 21
and yield better results.
Figure 6. Cauchy Test 2 (Zoomed Out), 8=mc .
Conclusion
The goal of this paper is to provide a robust algorithm that will find an optimal partition
without knowing the proper number of clusters. Ideally, the user should be able to partition a
data set without any a priori knowledge of the data’s structure. Robust Fuzzy n-Means
Clustering provides a good start toward this goal.
Experiments with Gaussian data have demonstrated that RFNM can accurately find the
desired number of clusters and their centers. Furthermore, the first Cauchy test has shown that
RFNM provides results which are identical to the results reached by RFCM. Thus, RFNM is as
24. 22
accurate and robust as RFCM, yet it does not increase the time complexity. Finally, clever
initialization of the exemplars allows RFNM to identify outlying clusters (Cauchy test 2). This
in turn improves the accuracy of the final results. Clearly, RFNM provides robust accurate
results without requiring prior knowledge of the data’s structure.
Figure 7. Cauchy Test 2 (Objective Function).
Although RFNM is an improvement over other algorithms, it does have some
shortcomings. First, it is not completely unsupervised because the user’s choices of cm and α
will have significant effects on the results. Data sets with outliers, for example, require lower
values cm and α than sets with compact well-separated clusters. Future research should
examine ways of preprocessing the target data in order to determine the ideal clustering
parameters so that the entire process can be fully automated.
Second, the initial positioning of the exemplars is crucial to getting optimal results.
Placing the exemplars exactly between two cluster centers, for example, may cause those
exemplars to not converge. One possible solution is to place the initial exemplars very far away
25. 23
from the cluster centers. This will allow the exemplars to compete equally for cardinality. In
other words, one exemplar will not have an advantage simply because it was initially placed
close to a cluster of points. However, if the exemplars are initialized too far away from the
cluster centers, then the main clusters and the outliers will have equal influence. As a result, the
exemplars may skip over the outliers all together. Determining an automated, yet reliable way of
initializing the exemplars would be very beneficial. Future research in this area should be
considered.
Third, the preprocessing requirements of RFNM can be costly. The experiments in this
paper use the MAD to compute the scaling constant γ . This operation takes ( )NNO lg time and
uses ( )NO space. Research into more efficient preprocessing techniques may be useful.
Robust Fuzzy n-Means Clustering has a wide range of applications in image and data
processing. It requires less user supervision than many other algorithms, but it is not completely
unsupervised. However, in several situations the RFNM algorithm provides a good solution in
linear time.
26. 24
References
1. Bensaid, A., Hall, L., Bezdek, J., Clarke L., Silbiger, M., Arrington, J. and Murtagh, R.,
Validity-guided (re)clustering with applications to image segmentation, IEEE Transactions
on Fuzzy Systems, vol. 4, no. 2 (May 1996), 112-123.
2. Bezdek, J., A convergence theorem for the fuzzy ISODATA clustering algorithms, IEEE
Transactions on Pattern Analysis and Machine Intelligence, vol. 2, no. 1 (January 1980), 1-8.
3. Bezdek, J., Fuzzy models—what are they and why? IEEE Transactions on Fuzzy Systems,
vol. 1, no 1 (February 1993), 1-5.
4. Bezdek, J. and Pal, S. Fuzzy Models for Pattern Recognition: Methods That Search for
Structures in Data, IEEE Press, New York, NY, 1992.
5. Boujemaa, N., Generalized competitive clustering for image segmentation, In Proceedings of
the 19th
International Conference of the North American Fuzzy Information Processing
Society – NAFIPS (July 13-15, 2000, Atlanta, GA), NAFIPS/IEEE, 2000, 133-137.
6. Choi, Y. and Krishnapuram, R., Fuzzy and robust formulations of maximum-likelihood-
based Gaussian mixture decomposition, In Proceedings of the Fifth IEEE International
Conference on Fuzzy Systems (September 8-11, 1996, New Orleans, LA), IEEE Neural
Networks Council, 1996, 1899-1905.
7. Dunn, J., A fuzzy relative of the ISODATA process and its use in detecting compact well-
separated clusters, Journal of Cybernetics, vol. 3, no. 3 (1973) 32-57.
8. Kersten, P., Fuzzy order statistics and their application to fuzzy clustering, IEEE
Transactions on Fuzzy Systems, vol. 7, no. 6 (December 1999) 708-712.
9. Kersten, P., Lee, R., Verdi, J. Carvalho R. and Yankovich, S., Segmenting SAR images using
fuzzy clustering, In Proceedings of the 19th
International Conference of the North American
27. 25
Fuzzy Information Processing Society – NAFIPS (July 13-15, 2000, Atlanta, GA),
NAFIPS/IEEE, 2000, 105-108.
10. Klir, G. and Yuan, B. Fuzzy Sets and Fuzzy Logic: Theory and Applications, Prentice Hall P
T R, Upper Saddle River, NJ, 1995.
11. Randles, R. and Wolfe, D., Introduction to The Theory of Nonparametric Statistics, John
Wiley & Sons, Inc., New York, NY, 1979.
12. Xie, X.L. and Beni, G., A validity measure for fuzzy clustering, IEEE Transactions on
Pattern Analysis and Machine Intelligence, vol. 13, no. 8 (August 1991), 841-847.