2. CONTENTS
Abstract
Introduction
Basic idea of k-means algorithm
K-Means example
Leader-Follower to find the value of k
Data set used for implementing the research paper
Implementation of modules
References
3. ABSTRACT
Nowadays million of databases have been used in business management,
Govt., scientific engineering & in many other application & keeps growing
rapidly in present day scenario.
The explosive growth in data & database has generated an urgent need to
develop new technique to remove outliers for effective data mining. In this
project we have suggested a clustering based outlier detection algorithm for
effective data mining.
4. INTRODUCTION
Data mining is the process of extracting information, patterns, and
etc., from large quantities of data.
CLUSTER: a collection of data objects
Similar to one another within the same cluster
Dissimilar to the objects in other clusters
Clustering plays a major role in pattern recognition, image
analysis, market and business research and etc.
5. K-MEANS CLUSTERING ALGORITHM
Step 1: Randomly select k data object from dataset D as initial
cluster centers.
Step 2: Repeat step 3 to step 5 till no new cluster centers are found.
Step 3: Calculate the distance between each data object di
(1<=i<=n) and all k cluster centers cj(1<=j<=n) and assign data
object di to the nearest cluster.
Step 4: For each cluster j(1<=j<=k), recalculate the cluster center.
6.
7. LEADER-FOLLOWER
Algorithm is as follows:-
1. Initialize the input data set.
2. Specify the threshold distance.
3. Find the closest cluster centre.
4. If the distance from the cluster centre is above threshold? Create new
cluster.
5. Else, add as an instance to the cluster.
8. PROPOSED ALGORITHM-
Input: Data set D={d1,d2.....dn},where di=data points, n= no of data points
Cluster centre C={c1,c2,......ck), where ci=cluster center,k = no of cluster
centres.
Output: Cluster Cn (n=1,2….k) outlier cluster.
Step 1: Calculate the distance of each data points dn and the k cluster
centers ck mostly preferred is the eucledian distance.
Step 2: For each data object di, find the closest centroid cj and assign di to
the cluster with nearest centroid cj..
Step 3: Repeat the following Steps 4-5 till a convergence criteria is met or we
can say no new centroids are found.
Step 4: For each data points di compute its distance from the centroid cj of
the present nearest cluster.
Step 5: If the calculated distance is less than or equal to the previous
calculated distance then the data points stay in the previous cluster.
Step 6: Else, calculate the distance of the data point to each of the new
cluster centers and assign the data point to the nearest cluster based on the
distances from the cluster centers.
9. Step 7: Calculate the no of data points (tk) in the cluster.
Step 8: Calculate the maximum no. of data points (t max) of the k- cluster, where
tmax = max( t1,t2,…tk ).
Step 9: Calculate the minimum no. of data points (tmin) of the k- cluster, where
tmin = min( t1,t2,…tk ).
Step 10: 𝑡𝑐𝑟𝑖𝑡𝑖𝑐𝑎𝑙=𝑡𝑚𝑎𝑥+ 𝑡𝑚𝑖𝑛2
Step 11: For each of the value of n= 1 to k compare the value to its tn to tcritical.
Step 12: If the value is found less than tcritical then the given cluster Cn is the
outlier.
10. DATA-SET
We use the following data set D to implement this paper:-
X
0
0
Y
1.4
1
2
DISTANCE
3
0.6
0.6
1.4
2
4.6
4
6
4.6
5
4
6
5.4
1
5.4
2
1.6
1.6
2.4
1
2.4 1
0.2
3.8
1
1.8
1
3
3
5.7
1
2
3
1
1
5.4
3
1
3
2.2
3
2.4
3
1.6
2.4
1.6
2
3
1
7.5
7.5
1.8
2.1
5
3
11. IMPLEMENTATION
MODULE 1: LEADER FOLLOWER TECHNIQUE (TO FIND
THE EXACT VALUE OF k)
STEP1: Calculate the threshold value as:
First we calculate the distance between every data point in the given data
set by using the formula of EUCLEDIAN DISTANCE . Then we add all
distance and divide that distance dy the total no. of clusters. So we get:-
Threshold value=(distance between every data point in the data set)
(total no. of data points in the given data set)
=(49.1)/20= 2.455
After that we calculate the distance between the threshold & the
calculated distance:-
If the distance from the cluster centre is above threshold
then Create new cluster.
Else
add as an instance to the cluster.
12. By applying leader-follower technique, we get three clusters c1,c2 and
c3 with 18,1 and 1 point respectively as shown in the figure:-
0
3
5 6 7 8 9
2
5
1 2 3 4
1
4
Cluster c1
Cluster c2
Cluster c3
13. STEPS 1 TO 6:
MODULE 2: k-means algorithm
This algorithm comprises from the following steps:-
From step 1 to 6, we calculate the distance between cluster
mean & the data points in the cluster. Then assign objects
to the cluster (based on the distance) to which the objects
are most similar.
Then update the cluster mean and follow the above procedure
Iteratively until no chage in the structure of clusters.
Generally for calculating the distance, we use the
EUCLEDIAN DISTANCE defined as:
D(i,j)= (xi1-xj1) +(xi2-xj2) +……….+(xin-xjn)2
2
2
16. Finally we get the cluster c1 with 9-points in it, c2 with 9-points and c3
with 2-points shown in the figure:-
0
5 6 7 8 91 2 3 4
1
4
3
2
5
Cluster c1 Cluster c2
Cluster c3
17. REFERENCES
1. Rui Xu, Donald Wunsch, “Survey of clustering algorithms,” IEEE Transactions on
Neural Networks, vol. 16, no. 3, May 2005, pp.
645-678
2. Mu-Chun Su and Chien-Hsing Chou, “A modified version of the K-means algorithm
with a distance based on cluster symmetry,” IEEE Transactions on Pattern Analysis
and Machine Intelligence, vol. 23,no. 6, June 2001, pp. 674-680., 2012