Project PPT

DATA MINING CLASSIFICATION ANALYSIS

CONTENTS
 Abstract
 Introduction
 Basic idea of k-means algorithm
 K-Means example
Leader-Follower to find the value of k
 Data set used for implementing the research paper
 Implementation of modules
 References

ABSTRACT
Nowadays million of databases have been used in business management,
Govt., scientific engineering & in many other application & keeps growing
rapidly in present day scenario.
The explosive growth in data & database has generated an urgent need to
develop new technique to remove outliers for effective data mining. In this
project we have suggested a clustering based outlier detection algorithm for
effective data mining.

INTRODUCTION
 Data mining is the process of extracting information, patterns, and
etc., from large quantities of data.
 CLUSTER: a collection of data objects
 Similar to one another within the same cluster
 Dissimilar to the objects in other clusters
Clustering plays a major role in pattern recognition, image
analysis, market and business research and etc.

K-MEANS CLUSTERING ALGORITHM
 Step 1: Randomly select k data object from dataset D as initial
cluster centers.
 Step 2: Repeat step 3 to step 5 till no new cluster centers are found.
 Step 3: Calculate the distance between each data object di
(1<=i<=n) and all k cluster centers cj(1<=j<=n) and assign data
object di to the nearest cluster.
 Step 4: For each cluster j(1<=j<=k), recalculate the cluster center.

LEADER-FOLLOWER
Algorithm is as follows:-
1. Initialize the input data set.
2. Specify the threshold distance.
3. Find the closest cluster centre.
4. If the distance from the cluster centre is above threshold? Create new
cluster.
5. Else, add as an instance to the cluster.

PROPOSED ALGORITHM-
Input: Data set D={d1,d2.....dn},where di=data points, n= no of data points
Cluster centre C={c1,c2,......ck), where ci=cluster center,k = no of cluster
centres.
Output: Cluster Cn (n=1,2….k) outlier cluster.
Step 1: Calculate the distance of each data points dn and the k cluster
centers ck mostly preferred is the eucledian distance.
Step 2: For each data object di, find the closest centroid cj and assign di to
the cluster with nearest centroid cj..
Step 3: Repeat the following Steps 4-5 till a convergence criteria is met or we
can say no new centroids are found.
Step 4: For each data points di compute its distance from the centroid cj of
the present nearest cluster.
Step 5: If the calculated distance is less than or equal to the previous
calculated distance then the data points stay in the previous cluster.
Step 6: Else, calculate the distance of the data point to each of the new
cluster centers and assign the data point to the nearest cluster based on the
distances from the cluster centers.

Step 7: Calculate the no of data points (tk) in the cluster.
Step 8: Calculate the maximum no. of data points (t max) of the k- cluster, where
tmax = max( t1,t2,…tk ).
Step 9: Calculate the minimum no. of data points (tmin) of the k- cluster, where
tmin = min( t1,t2,…tk ).
Step 10: 𝑡𝑐𝑟𝑖𝑡𝑖𝑐𝑎𝑙=𝑡𝑚𝑎𝑥+ 𝑡𝑚𝑖𝑛2
Step 11: For each of the value of n= 1 to k compare the value to its tn to tcritical.
Step 12: If the value is found less than tcritical then the given cluster Cn is the
outlier.

DATA-SET
We use the following data set D to implement this paper:-
X
0
0
Y
1.4
1
2
DISTANCE
3
0.6
0.6
1.4
2
4.6
4
6
4.6
5
4
6
5.4
1
5.4
2
1.6
1.6
2.4
1
2.4 1
0.2
3.8
1
1.8
1
3
3
5.7
1
2
3
1
1
5.4
3
1
3
2.2
3
2.4
3
1.6
2.4
1.6
2
3
1
7.5
7.5
1.8
2.1
5
3

IMPLEMENTATION
MODULE 1: LEADER FOLLOWER TECHNIQUE (TO FIND
THE EXACT VALUE OF k)
STEP1: Calculate the threshold value as:
First we calculate the distance between every data point in the given data
set by using the formula of EUCLEDIAN DISTANCE . Then we add all
distance and divide that distance dy the total no. of clusters. So we get:-
Threshold value=(distance between every data point in the data set)
(total no. of data points in the given data set)
=(49.1)/20= 2.455
After that we calculate the distance between the threshold & the
calculated distance:-
If the distance from the cluster centre is above threshold
then Create new cluster.
Else
add as an instance to the cluster.

By applying leader-follower technique, we get three clusters c1,c2 and
c3 with 18,1 and 1 point respectively as shown in the figure:-
0
3
5 6 7 8 9
2
5
1 2 3 4
1
4
Cluster c1
Cluster c2
Cluster c3

STEPS 1 TO 6:
MODULE 2: k-means algorithm
This algorithm comprises from the following steps:-
From step 1 to 6, we calculate the distance between cluster
mean & the data points in the cluster. Then assign objects
to the cluster (based on the distance) to which the objects
are most similar.
Then update the cluster mean and follow the above procedure
Iteratively until no chage in the structure of clusters.
Generally for calculating the distance, we use the
EUCLEDIAN DISTANCE defined as:
D(i,j)= (xi1-xj1) +(xi2-xj2) +……….+(xin-xjn)2
2
2

CLUSTER
CENTER
DATA
POINTS
DISTANCEi
2
1
9
11
10
13
12
18
8
7
6
5
4
3
3.1
14
15
16
17
(3,1.8)
(3,1.8)
(3,1.8)
(3,1.8)
(3,1.8)
(3,1.8)
(3,1.8)
(3,1.8)
(3,1.8)
(3,1.8)
(3,1.8)
(3,1.8)
(3,1.8)
(3,1.8)
(6,3)
(3,1.8)
(3,1.8)
(3,1.8)
(3,1.8)
(5,2)
(0,1)
(1,2)
(1.4,1.6)
(1.4,2.4)
(2,3)
(4,1)
(4,3)
(4.6,2.4)
(4.6,1.6)
(5.4,1.6)
(5.4,2.4)
(6,1)
(2,1)
(0,3)
(0.6,1.6)
(0.6,2.4)
1.6
1.6
1.7
1.2
1.5
1.2
1.5
1.6
1.7
2.0
2.4
2.4
3.1
3.2
3.2
2.4
2.0
FOR CLUSTER c1
i CLUSTER
CENTER
DATA
POINTS
DISTANCE
1 (7.5,2.1) (7.5,2.1) 0
FOR CLUSTER c2
i
CLUSTER
CENTER
DATA
POINTS
DISTANCE
(7.5,1.8)1 (7.5,1.8) 0
FOR CLUSTER c3
Since points in cluster 2 & 3 are
similar so they form one clusters&
some points in cluster1 are same &
some are different. So similar points
are added in same cluster & different
in another cluster. Then update
mean & iteratively follow this
procedure. Finally we get the
following Table:

CLUSTER
CENTER
DATA
POINTS
DISTANCEi
2
1
9
8
7
6
5
4
3 (0.7,2)
(0.7,2)
(0.7,2)
(0.7,2)
(0.7,2)
(0.7,2)
(0.7,2)
(0.7,2)
(0.7,2)
(0,1)
(0,3)
(0.6,1.6)
(1,2)
(1.4,1.6)
(1.4,2.4)
(2,3)
(2,1)
(0.6,2.4)
1.2
1.2
1.3
1.2
1.2
1.3
1.2
1.2
i CLUSTER
CENTER
DATA
POINTS DISTANCE
i CLUSTER
CENTER
DATA
POINTS
DISTANCE
2
FOR CLUSTER c1
FOR CLUSTER c2
FOR CLUSTER c3
1
2
1
3
4
5
6
7
8
9
(5,2)
(5,2)
(5,2)
0.2
(5,2)
(5,2)
(5,2)
(5,2)
(5,2)
(5,2)
(4,1)
(4,3)
(4.6,1.6)
(4.6,2.4)
(5,2)
(5.4,1.6)
(5.4,2.4)
(6,1)
(6,3)
3.2
3.2
2.4
2.4
2.4
2.6
2.6
2.8
2.9
0.1(7.5,2.1)
(7.5,1.8)
(7.5,2)
(7.5,2)
1.5
As we see from the distance
table, there is no further
modifications are possible in
the clusters c1,c2 and c3.

Finally we get the cluster c1 with 9-points in it, c2 with 9-points and c3
with 2-points shown in the figure:-
0
5 6 7 8 91 2 3 4
1
4
3
2
5
Cluster c1 Cluster c2
Cluster c3

REFERENCES
1. Rui Xu, Donald Wunsch, “Survey of clustering algorithms,” IEEE Transactions on
Neural Networks, vol. 16, no. 3, May 2005, pp.
645-678
2. Mu-Chun Su and Chien-Hsing Chou, “A modified version of the K-means algorithm
with a distance based on cluster symmetry,” IEEE Transactions on Pattern Analysis
and Machine Intelligence, vol. 23,no. 6, June 2001, pp. 674-680., 2012

Project PPT

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Andere mochten auch

Andere mochten auch (8)

Ähnlich wie Project PPT

Ähnlich wie Project PPT (20)

Project PPT