SlideShare ist ein Scribd-Unternehmen logo
1 von 17
DATA MINING CLASSIFICATION ANALYSIS
CONTENTS
 Abstract
 Introduction
 Basic idea of k-means algorithm
 K-Means example
Leader-Follower to find the value of k
 Data set used for implementing the research paper
 Implementation of modules
 References
ABSTRACT
Nowadays million of databases have been used in business management,
Govt., scientific engineering & in many other application & keeps growing
rapidly in present day scenario.
The explosive growth in data & database has generated an urgent need to
develop new technique to remove outliers for effective data mining. In this
project we have suggested a clustering based outlier detection algorithm for
effective data mining.
INTRODUCTION
 Data mining is the process of extracting information, patterns, and
etc., from large quantities of data.
 CLUSTER: a collection of data objects
 Similar to one another within the same cluster
 Dissimilar to the objects in other clusters
Clustering plays a major role in pattern recognition, image
analysis, market and business research and etc.
K-MEANS CLUSTERING ALGORITHM
 Step 1: Randomly select k data object from dataset D as initial
cluster centers.
 Step 2: Repeat step 3 to step 5 till no new cluster centers are found.
 Step 3: Calculate the distance between each data object di
(1<=i<=n) and all k cluster centers cj(1<=j<=n) and assign data
object di to the nearest cluster.
 Step 4: For each cluster j(1<=j<=k), recalculate the cluster center.
LEADER-FOLLOWER
Algorithm is as follows:-
1. Initialize the input data set.
2. Specify the threshold distance.
3. Find the closest cluster centre.
4. If the distance from the cluster centre is above threshold? Create new
cluster.
5. Else, add as an instance to the cluster.
PROPOSED ALGORITHM-
Input: Data set D={d1,d2.....dn},where di=data points, n= no of data points
Cluster centre C={c1,c2,......ck), where ci=cluster center,k = no of cluster
centres.
Output: Cluster Cn (n=1,2….k) outlier cluster.
Step 1: Calculate the distance of each data points dn and the k cluster
centers ck mostly preferred is the eucledian distance.
Step 2: For each data object di, find the closest centroid cj and assign di to
the cluster with nearest centroid cj..
Step 3: Repeat the following Steps 4-5 till a convergence criteria is met or we
can say no new centroids are found.
Step 4: For each data points di compute its distance from the centroid cj of
the present nearest cluster.
Step 5: If the calculated distance is less than or equal to the previous
calculated distance then the data points stay in the previous cluster.
Step 6: Else, calculate the distance of the data point to each of the new
cluster centers and assign the data point to the nearest cluster based on the
distances from the cluster centers.
Step 7: Calculate the no of data points (tk) in the cluster.
Step 8: Calculate the maximum no. of data points (t max) of the k- cluster, where
tmax = max( t1,t2,…tk ).
Step 9: Calculate the minimum no. of data points (tmin) of the k- cluster, where
tmin = min( t1,t2,…tk ).
Step 10: 𝑡𝑐𝑟𝑖𝑡𝑖𝑐𝑎𝑙=𝑡𝑚𝑎𝑥+ 𝑡𝑚𝑖𝑛2
Step 11: For each of the value of n= 1 to k compare the value to its tn to tcritical.
Step 12: If the value is found less than tcritical then the given cluster Cn is the
outlier.
DATA-SET
We use the following data set D to implement this paper:-
X
0
0
Y
1.4
1
2
DISTANCE
3
0.6
0.6
1.4
2
4.6
4
6
4.6
5
4
6
5.4
1
5.4
2
1.6
1.6
2.4
1
2.4 1
0.2
3.8
1
1.8
1
3
3
5.7
1
2
3
1
1
5.4
3
1
3
2.2
3
2.4
3
1.6
2.4
1.6
2
3
1
7.5
7.5
1.8
2.1
5
3
IMPLEMENTATION
MODULE 1: LEADER FOLLOWER TECHNIQUE (TO FIND
THE EXACT VALUE OF k)
STEP1: Calculate the threshold value as:
First we calculate the distance between every data point in the given data
set by using the formula of EUCLEDIAN DISTANCE . Then we add all
distance and divide that distance dy the total no. of clusters. So we get:-
Threshold value=(distance between every data point in the data set)
(total no. of data points in the given data set)
=(49.1)/20= 2.455
After that we calculate the distance between the threshold & the
calculated distance:-
If the distance from the cluster centre is above threshold
then Create new cluster.
Else
add as an instance to the cluster.
By applying leader-follower technique, we get three clusters c1,c2 and
c3 with 18,1 and 1 point respectively as shown in the figure:-
0
3
5 6 7 8 9
2
5
1 2 3 4
1
4
Cluster c1
Cluster c2
Cluster c3
STEPS 1 TO 6:
MODULE 2: k-means algorithm
This algorithm comprises from the following steps:-
From step 1 to 6, we calculate the distance between cluster
mean & the data points in the cluster. Then assign objects
to the cluster (based on the distance) to which the objects
are most similar.
Then update the cluster mean and follow the above procedure
Iteratively until no chage in the structure of clusters.
Generally for calculating the distance, we use the
EUCLEDIAN DISTANCE defined as:
D(i,j)= (xi1-xj1) +(xi2-xj2) +……….+(xin-xjn)2
2
2
CLUSTER
CENTER
DATA
POINTS
DISTANCEi
2
1
9
11
10
13
12
18
8
7
6
5
4
3
3.1
14
15
16
17
(3,1.8)
(3,1.8)
(3,1.8)
(3,1.8)
(3,1.8)
(3,1.8)
(3,1.8)
(3,1.8)
(3,1.8)
(3,1.8)
(3,1.8)
(3,1.8)
(3,1.8)
(3,1.8)
(6,3)
(3,1.8)
(3,1.8)
(3,1.8)
(3,1.8)
(5,2)
(0,1)
(1,2)
(1.4,1.6)
(1.4,2.4)
(2,3)
(4,1)
(4,3)
(4.6,2.4)
(4.6,1.6)
(5.4,1.6)
(5.4,2.4)
(6,1)
(2,1)
(0,3)
(0.6,1.6)
(0.6,2.4)
1.6
1.6
1.7
1.2
1.5
1.2
1.5
1.6
1.7
2.0
2.4
2.4
3.1
3.2
3.2
2.4
2.0
FOR CLUSTER c1
i CLUSTER
CENTER
DATA
POINTS
DISTANCE
1 (7.5,2.1) (7.5,2.1) 0
FOR CLUSTER c2
i
CLUSTER
CENTER
DATA
POINTS
DISTANCE
(7.5,1.8)1 (7.5,1.8) 0
FOR CLUSTER c3
Since points in cluster 2 & 3 are
similar so they form one clusters&
some points in cluster1 are same &
some are different. So similar points
are added in same cluster & different
in another cluster. Then update
mean & iteratively follow this
procedure. Finally we get the
following Table:
CLUSTER
CENTER
DATA
POINTS
DISTANCEi
2
1
9
8
7
6
5
4
3 (0.7,2)
(0.7,2)
(0.7,2)
(0.7,2)
(0.7,2)
(0.7,2)
(0.7,2)
(0.7,2)
(0.7,2)
(0,1)
(0,3)
(0.6,1.6)
(1,2)
(1.4,1.6)
(1.4,2.4)
(2,3)
(2,1)
(0.6,2.4)
1.2
1.2
1.3
1.2
1.2
1.3
1.2
1.2
i CLUSTER
CENTER
DATA
POINTS DISTANCE
i CLUSTER
CENTER
DATA
POINTS
DISTANCE
2
FOR CLUSTER c1
FOR CLUSTER c2
FOR CLUSTER c3
1
2
1
3
4
5
6
7
8
9
(5,2)
(5,2)
(5,2)
0.2
(5,2)
(5,2)
(5,2)
(5,2)
(5,2)
(5,2)
(4,1)
(4,3)
(4.6,1.6)
(4.6,2.4)
(5,2)
(5.4,1.6)
(5.4,2.4)
(6,1)
(6,3)
3.2
3.2
2.4
2.4
2.4
2.6
2.6
2.8
2.9
0.1(7.5,2.1)
(7.5,1.8)
(7.5,2)
(7.5,2)
1.5
As we see from the distance
table, there is no further
modifications are possible in
the clusters c1,c2 and c3.
Finally we get the cluster c1 with 9-points in it, c2 with 9-points and c3
with 2-points shown in the figure:-
0
5 6 7 8 91 2 3 4
1
4
3
2
5
Cluster c1 Cluster c2
Cluster c3
REFERENCES
1. Rui Xu, Donald Wunsch, “Survey of clustering algorithms,” IEEE Transactions on
Neural Networks, vol. 16, no. 3, May 2005, pp.
645-678
2. Mu-Chun Su and Chien-Hsing Chou, “A modified version of the K-means algorithm
with a distance based on cluster symmetry,” IEEE Transactions on Pattern Analysis
and Machine Intelligence, vol. 23,no. 6, June 2001, pp. 674-680., 2012

Weitere ähnliche Inhalte

Was ist angesagt?

Fuzzy c means clustering protocol for wireless sensor networks
Fuzzy c means clustering protocol for wireless sensor networksFuzzy c means clustering protocol for wireless sensor networks
Fuzzy c means clustering protocol for wireless sensor networksmourya chandra
 
K means clustering | K Means ++
K means clustering | K Means ++K means clustering | K Means ++
K means clustering | K Means ++sabbirantor
 
K means clustering
K means clusteringK means clustering
K means clusteringThomas K T
 
Clustering on database systems rkm
Clustering on database systems rkmClustering on database systems rkm
Clustering on database systems rkmVahid Mirjalili
 
Hierarchical clustering techniques
Hierarchical clustering techniquesHierarchical clustering techniques
Hierarchical clustering techniquesMd Syed Ahamad
 
K-means Clustering Algorithm with Matlab Source code
K-means Clustering Algorithm with Matlab Source codeK-means Clustering Algorithm with Matlab Source code
K-means Clustering Algorithm with Matlab Source codegokulprasath06
 
"Principal Component Analysis - the original paper" presentation @ Papers We ...
"Principal Component Analysis - the original paper" presentation @ Papers We ..."Principal Component Analysis - the original paper" presentation @ Papers We ...
"Principal Component Analysis - the original paper" presentation @ Papers We ...Adrian Florea
 
Intro to MATLAB and K-mean algorithm
Intro to MATLAB and K-mean algorithmIntro to MATLAB and K-mean algorithm
Intro to MATLAB and K-mean algorithmkhalid Shah
 
Customer Segmentation using Clustering
Customer Segmentation using ClusteringCustomer Segmentation using Clustering
Customer Segmentation using ClusteringDessy Amirudin
 
Mean shift and Hierarchical clustering
Mean shift and Hierarchical clustering Mean shift and Hierarchical clustering
Mean shift and Hierarchical clustering Yan Xu
 
Principal Component Analysis For Novelty Detection
Principal Component Analysis For Novelty DetectionPrincipal Component Analysis For Novelty Detection
Principal Component Analysis For Novelty DetectionJordan McBain
 
Dynamic clustering algorithm using fuzzy c means
Dynamic clustering algorithm using fuzzy c meansDynamic clustering algorithm using fuzzy c means
Dynamic clustering algorithm using fuzzy c meansWrishin Bhattacharya
 
Fast Single-pass K-means Clusterting at Oxford
Fast Single-pass K-means Clusterting at Oxford Fast Single-pass K-means Clusterting at Oxford
Fast Single-pass K-means Clusterting at Oxford MapR Technologies
 
PCA (Principal component analysis) Theory and Toolkits
PCA (Principal component analysis) Theory and ToolkitsPCA (Principal component analysis) Theory and Toolkits
PCA (Principal component analysis) Theory and ToolkitsHopeBay Technologies, Inc.
 
Neural Networks: Principal Component Analysis (PCA)
Neural Networks: Principal Component Analysis (PCA)Neural Networks: Principal Component Analysis (PCA)
Neural Networks: Principal Component Analysis (PCA)Mostafa G. M. Mostafa
 

Was ist angesagt? (20)

Fuzzy c means clustering protocol for wireless sensor networks
Fuzzy c means clustering protocol for wireless sensor networksFuzzy c means clustering protocol for wireless sensor networks
Fuzzy c means clustering protocol for wireless sensor networks
 
K means clustering | K Means ++
K means clustering | K Means ++K means clustering | K Means ++
K means clustering | K Means ++
 
K means clustering
K means clusteringK means clustering
K means clustering
 
Clustering on database systems rkm
Clustering on database systems rkmClustering on database systems rkm
Clustering on database systems rkm
 
Hierarchical clustering techniques
Hierarchical clustering techniquesHierarchical clustering techniques
Hierarchical clustering techniques
 
Principal component analysis
Principal component analysisPrincipal component analysis
Principal component analysis
 
K-means Clustering Algorithm with Matlab Source code
K-means Clustering Algorithm with Matlab Source codeK-means Clustering Algorithm with Matlab Source code
K-means Clustering Algorithm with Matlab Source code
 
"Principal Component Analysis - the original paper" presentation @ Papers We ...
"Principal Component Analysis - the original paper" presentation @ Papers We ..."Principal Component Analysis - the original paper" presentation @ Papers We ...
"Principal Component Analysis - the original paper" presentation @ Papers We ...
 
Intro to MATLAB and K-mean algorithm
Intro to MATLAB and K-mean algorithmIntro to MATLAB and K-mean algorithm
Intro to MATLAB and K-mean algorithm
 
Customer Segmentation using Clustering
Customer Segmentation using ClusteringCustomer Segmentation using Clustering
Customer Segmentation using Clustering
 
08 clustering
08 clustering08 clustering
08 clustering
 
Data miningpresentation
Data miningpresentationData miningpresentation
Data miningpresentation
 
Mean shift and Hierarchical clustering
Mean shift and Hierarchical clustering Mean shift and Hierarchical clustering
Mean shift and Hierarchical clustering
 
Neural nw k means
Neural nw k meansNeural nw k means
Neural nw k means
 
Principal Component Analysis For Novelty Detection
Principal Component Analysis For Novelty DetectionPrincipal Component Analysis For Novelty Detection
Principal Component Analysis For Novelty Detection
 
K-Means manual work
K-Means manual workK-Means manual work
K-Means manual work
 
Dynamic clustering algorithm using fuzzy c means
Dynamic clustering algorithm using fuzzy c meansDynamic clustering algorithm using fuzzy c means
Dynamic clustering algorithm using fuzzy c means
 
Fast Single-pass K-means Clusterting at Oxford
Fast Single-pass K-means Clusterting at Oxford Fast Single-pass K-means Clusterting at Oxford
Fast Single-pass K-means Clusterting at Oxford
 
PCA (Principal component analysis) Theory and Toolkits
PCA (Principal component analysis) Theory and ToolkitsPCA (Principal component analysis) Theory and Toolkits
PCA (Principal component analysis) Theory and Toolkits
 
Neural Networks: Principal Component Analysis (PCA)
Neural Networks: Principal Component Analysis (PCA)Neural Networks: Principal Component Analysis (PCA)
Neural Networks: Principal Component Analysis (PCA)
 

Andere mochten auch

A model of recommender system for a digital library
A model of recommender system for a digital libraryA model of recommender system for a digital library
A model of recommender system for a digital librarySar Lyna
 
Density Based Clustering
Density Based ClusteringDensity Based Clustering
Density Based ClusteringSSA KPI
 
Chap8 basic cluster_analysis
Chap8 basic cluster_analysisChap8 basic cluster_analysis
Chap8 basic cluster_analysisguru_prasadg
 
Clustering, k means algorithm
Clustering, k means algorithmClustering, k means algorithm
Clustering, k means algorithmJunyoung Park
 
K-Means, its Variants and its Applications
K-Means, its Variants and its ApplicationsK-Means, its Variants and its Applications
K-Means, its Variants and its ApplicationsVarad Meru
 
K means Clustering
K means ClusteringK means Clustering
K means ClusteringEdureka!
 

Andere mochten auch (8)

A model of recommender system for a digital library
A model of recommender system for a digital libraryA model of recommender system for a digital library
A model of recommender system for a digital library
 
Density Based Clustering
Density Based ClusteringDensity Based Clustering
Density Based Clustering
 
Chap8 basic cluster_analysis
Chap8 basic cluster_analysisChap8 basic cluster_analysis
Chap8 basic cluster_analysis
 
Clustering: A Survey
Clustering: A SurveyClustering: A Survey
Clustering: A Survey
 
K-Means Algorithm
K-Means AlgorithmK-Means Algorithm
K-Means Algorithm
 
Clustering, k means algorithm
Clustering, k means algorithmClustering, k means algorithm
Clustering, k means algorithm
 
K-Means, its Variants and its Applications
K-Means, its Variants and its ApplicationsK-Means, its Variants and its Applications
K-Means, its Variants and its Applications
 
K means Clustering
K means ClusteringK means Clustering
K means Clustering
 

Ähnlich wie Project PPT

The International Journal of Engineering and Science (The IJES)
The International Journal of Engineering and Science (The IJES)The International Journal of Engineering and Science (The IJES)
The International Journal of Engineering and Science (The IJES)theijes
 
ML basic &amp; clustering
ML basic &amp; clusteringML basic &amp; clustering
ML basic &amp; clusteringmonalisa Das
 
Premeditated Initial Points for K-Means Clustering
Premeditated Initial Points for K-Means ClusteringPremeditated Initial Points for K-Means Clustering
Premeditated Initial Points for K-Means ClusteringIJCSIS Research Publications
 
An improvement in k mean clustering algorithm using better time and accuracy
An improvement in k mean clustering algorithm using better time and accuracyAn improvement in k mean clustering algorithm using better time and accuracy
An improvement in k mean clustering algorithm using better time and accuracyijpla
 
CLUSTER ANALYSIS ALGORITHMS.pptx
CLUSTER ANALYSIS ALGORITHMS.pptxCLUSTER ANALYSIS ALGORITHMS.pptx
CLUSTER ANALYSIS ALGORITHMS.pptxShwetapadmaBabu1
 
Optimising Data Using K-Means Clustering Algorithm
Optimising Data Using K-Means Clustering AlgorithmOptimising Data Using K-Means Clustering Algorithm
Optimising Data Using K-Means Clustering AlgorithmIJERA Editor
 
Parallel Implementation of K Means Clustering on CUDA
Parallel Implementation of K Means Clustering on CUDAParallel Implementation of K Means Clustering on CUDA
Parallel Implementation of K Means Clustering on CUDAprithan
 
K mean-clustering
K mean-clusteringK mean-clustering
K mean-clusteringPVP College
 
Welcome to International Journal of Engineering Research and Development (IJERD)
Welcome to International Journal of Engineering Research and Development (IJERD)Welcome to International Journal of Engineering Research and Development (IJERD)
Welcome to International Journal of Engineering Research and Development (IJERD)IJERD Editor
 
k-mean-clustering.ppt
k-mean-clustering.pptk-mean-clustering.ppt
k-mean-clustering.pptRanimeLoutar
 
k-mean-Clustering impact on AI using DSS
k-mean-Clustering impact on AI using DSSk-mean-Clustering impact on AI using DSS
k-mean-Clustering impact on AI using DSSMarkNaguibElAbd
 
New Approach for K-mean and K-medoids Algorithm
New Approach for K-mean and K-medoids AlgorithmNew Approach for K-mean and K-medoids Algorithm
New Approach for K-mean and K-medoids AlgorithmEditor IJCATR
 
Data Mining: Concepts and techniques: Chapter 11,Review: Basic Cluster Analys...
Data Mining: Concepts and techniques: Chapter 11,Review: Basic Cluster Analys...Data Mining: Concepts and techniques: Chapter 11,Review: Basic Cluster Analys...
Data Mining: Concepts and techniques: Chapter 11,Review: Basic Cluster Analys...Salah Amean
 
Principal Components Analysis, Calculation and Visualization
Principal Components Analysis, Calculation and VisualizationPrincipal Components Analysis, Calculation and Visualization
Principal Components Analysis, Calculation and VisualizationMarjan Sterjev
 
8.clustering algorithm.k means.em algorithm
8.clustering algorithm.k means.em algorithm8.clustering algorithm.k means.em algorithm
8.clustering algorithm.k means.em algorithmLaura Petrosanu
 
ANLY 501 Lab 7 Presentation Group 8 slide.pptx
ANLY 501 Lab 7 Presentation Group 8 slide.pptxANLY 501 Lab 7 Presentation Group 8 slide.pptx
ANLY 501 Lab 7 Presentation Group 8 slide.pptxrinehi3578
 
AI-Lec20 Clustering I - Kmean.pptx
AI-Lec20 Clustering I - Kmean.pptxAI-Lec20 Clustering I - Kmean.pptx
AI-Lec20 Clustering I - Kmean.pptxSyed Ejaz
 

Ähnlich wie Project PPT (20)

The International Journal of Engineering and Science (The IJES)
The International Journal of Engineering and Science (The IJES)The International Journal of Engineering and Science (The IJES)
The International Journal of Engineering and Science (The IJES)
 
Lect4
Lect4Lect4
Lect4
 
ML basic &amp; clustering
ML basic &amp; clusteringML basic &amp; clustering
ML basic &amp; clustering
 
Premeditated Initial Points for K-Means Clustering
Premeditated Initial Points for K-Means ClusteringPremeditated Initial Points for K-Means Clustering
Premeditated Initial Points for K-Means Clustering
 
An improvement in k mean clustering algorithm using better time and accuracy
An improvement in k mean clustering algorithm using better time and accuracyAn improvement in k mean clustering algorithm using better time and accuracy
An improvement in k mean clustering algorithm using better time and accuracy
 
CLUSTER ANALYSIS ALGORITHMS.pptx
CLUSTER ANALYSIS ALGORITHMS.pptxCLUSTER ANALYSIS ALGORITHMS.pptx
CLUSTER ANALYSIS ALGORITHMS.pptx
 
Optimising Data Using K-Means Clustering Algorithm
Optimising Data Using K-Means Clustering AlgorithmOptimising Data Using K-Means Clustering Algorithm
Optimising Data Using K-Means Clustering Algorithm
 
Parallel Implementation of K Means Clustering on CUDA
Parallel Implementation of K Means Clustering on CUDAParallel Implementation of K Means Clustering on CUDA
Parallel Implementation of K Means Clustering on CUDA
 
K mean-clustering
K mean-clusteringK mean-clustering
K mean-clustering
 
Welcome to International Journal of Engineering Research and Development (IJERD)
Welcome to International Journal of Engineering Research and Development (IJERD)Welcome to International Journal of Engineering Research and Development (IJERD)
Welcome to International Journal of Engineering Research and Development (IJERD)
 
KNN
KNNKNN
KNN
 
k-mean-clustering.ppt
k-mean-clustering.pptk-mean-clustering.ppt
k-mean-clustering.ppt
 
k-mean-Clustering impact on AI using DSS
k-mean-Clustering impact on AI using DSSk-mean-Clustering impact on AI using DSS
k-mean-Clustering impact on AI using DSS
 
New Approach for K-mean and K-medoids Algorithm
New Approach for K-mean and K-medoids AlgorithmNew Approach for K-mean and K-medoids Algorithm
New Approach for K-mean and K-medoids Algorithm
 
Data Mining: Concepts and techniques: Chapter 11,Review: Basic Cluster Analys...
Data Mining: Concepts and techniques: Chapter 11,Review: Basic Cluster Analys...Data Mining: Concepts and techniques: Chapter 11,Review: Basic Cluster Analys...
Data Mining: Concepts and techniques: Chapter 11,Review: Basic Cluster Analys...
 
Principal Components Analysis, Calculation and Visualization
Principal Components Analysis, Calculation and VisualizationPrincipal Components Analysis, Calculation and Visualization
Principal Components Analysis, Calculation and Visualization
 
K mean-clustering
K mean-clusteringK mean-clustering
K mean-clustering
 
8.clustering algorithm.k means.em algorithm
8.clustering algorithm.k means.em algorithm8.clustering algorithm.k means.em algorithm
8.clustering algorithm.k means.em algorithm
 
ANLY 501 Lab 7 Presentation Group 8 slide.pptx
ANLY 501 Lab 7 Presentation Group 8 slide.pptxANLY 501 Lab 7 Presentation Group 8 slide.pptx
ANLY 501 Lab 7 Presentation Group 8 slide.pptx
 
AI-Lec20 Clustering I - Kmean.pptx
AI-Lec20 Clustering I - Kmean.pptxAI-Lec20 Clustering I - Kmean.pptx
AI-Lec20 Clustering I - Kmean.pptx
 

Project PPT

  • 2. CONTENTS  Abstract  Introduction  Basic idea of k-means algorithm  K-Means example Leader-Follower to find the value of k  Data set used for implementing the research paper  Implementation of modules  References
  • 3. ABSTRACT Nowadays million of databases have been used in business management, Govt., scientific engineering & in many other application & keeps growing rapidly in present day scenario. The explosive growth in data & database has generated an urgent need to develop new technique to remove outliers for effective data mining. In this project we have suggested a clustering based outlier detection algorithm for effective data mining.
  • 4. INTRODUCTION  Data mining is the process of extracting information, patterns, and etc., from large quantities of data.  CLUSTER: a collection of data objects  Similar to one another within the same cluster  Dissimilar to the objects in other clusters Clustering plays a major role in pattern recognition, image analysis, market and business research and etc.
  • 5. K-MEANS CLUSTERING ALGORITHM  Step 1: Randomly select k data object from dataset D as initial cluster centers.  Step 2: Repeat step 3 to step 5 till no new cluster centers are found.  Step 3: Calculate the distance between each data object di (1<=i<=n) and all k cluster centers cj(1<=j<=n) and assign data object di to the nearest cluster.  Step 4: For each cluster j(1<=j<=k), recalculate the cluster center.
  • 6.
  • 7. LEADER-FOLLOWER Algorithm is as follows:- 1. Initialize the input data set. 2. Specify the threshold distance. 3. Find the closest cluster centre. 4. If the distance from the cluster centre is above threshold? Create new cluster. 5. Else, add as an instance to the cluster.
  • 8. PROPOSED ALGORITHM- Input: Data set D={d1,d2.....dn},where di=data points, n= no of data points Cluster centre C={c1,c2,......ck), where ci=cluster center,k = no of cluster centres. Output: Cluster Cn (n=1,2….k) outlier cluster. Step 1: Calculate the distance of each data points dn and the k cluster centers ck mostly preferred is the eucledian distance. Step 2: For each data object di, find the closest centroid cj and assign di to the cluster with nearest centroid cj.. Step 3: Repeat the following Steps 4-5 till a convergence criteria is met or we can say no new centroids are found. Step 4: For each data points di compute its distance from the centroid cj of the present nearest cluster. Step 5: If the calculated distance is less than or equal to the previous calculated distance then the data points stay in the previous cluster. Step 6: Else, calculate the distance of the data point to each of the new cluster centers and assign the data point to the nearest cluster based on the distances from the cluster centers.
  • 9. Step 7: Calculate the no of data points (tk) in the cluster. Step 8: Calculate the maximum no. of data points (t max) of the k- cluster, where tmax = max( t1,t2,…tk ). Step 9: Calculate the minimum no. of data points (tmin) of the k- cluster, where tmin = min( t1,t2,…tk ). Step 10: 𝑡𝑐𝑟𝑖𝑡𝑖𝑐𝑎𝑙=𝑡𝑚𝑎𝑥+ 𝑡𝑚𝑖𝑛2 Step 11: For each of the value of n= 1 to k compare the value to its tn to tcritical. Step 12: If the value is found less than tcritical then the given cluster Cn is the outlier.
  • 10. DATA-SET We use the following data set D to implement this paper:- X 0 0 Y 1.4 1 2 DISTANCE 3 0.6 0.6 1.4 2 4.6 4 6 4.6 5 4 6 5.4 1 5.4 2 1.6 1.6 2.4 1 2.4 1 0.2 3.8 1 1.8 1 3 3 5.7 1 2 3 1 1 5.4 3 1 3 2.2 3 2.4 3 1.6 2.4 1.6 2 3 1 7.5 7.5 1.8 2.1 5 3
  • 11. IMPLEMENTATION MODULE 1: LEADER FOLLOWER TECHNIQUE (TO FIND THE EXACT VALUE OF k) STEP1: Calculate the threshold value as: First we calculate the distance between every data point in the given data set by using the formula of EUCLEDIAN DISTANCE . Then we add all distance and divide that distance dy the total no. of clusters. So we get:- Threshold value=(distance between every data point in the data set) (total no. of data points in the given data set) =(49.1)/20= 2.455 After that we calculate the distance between the threshold & the calculated distance:- If the distance from the cluster centre is above threshold then Create new cluster. Else add as an instance to the cluster.
  • 12. By applying leader-follower technique, we get three clusters c1,c2 and c3 with 18,1 and 1 point respectively as shown in the figure:- 0 3 5 6 7 8 9 2 5 1 2 3 4 1 4 Cluster c1 Cluster c2 Cluster c3
  • 13. STEPS 1 TO 6: MODULE 2: k-means algorithm This algorithm comprises from the following steps:- From step 1 to 6, we calculate the distance between cluster mean & the data points in the cluster. Then assign objects to the cluster (based on the distance) to which the objects are most similar. Then update the cluster mean and follow the above procedure Iteratively until no chage in the structure of clusters. Generally for calculating the distance, we use the EUCLEDIAN DISTANCE defined as: D(i,j)= (xi1-xj1) +(xi2-xj2) +……….+(xin-xjn)2 2 2
  • 14. CLUSTER CENTER DATA POINTS DISTANCEi 2 1 9 11 10 13 12 18 8 7 6 5 4 3 3.1 14 15 16 17 (3,1.8) (3,1.8) (3,1.8) (3,1.8) (3,1.8) (3,1.8) (3,1.8) (3,1.8) (3,1.8) (3,1.8) (3,1.8) (3,1.8) (3,1.8) (3,1.8) (6,3) (3,1.8) (3,1.8) (3,1.8) (3,1.8) (5,2) (0,1) (1,2) (1.4,1.6) (1.4,2.4) (2,3) (4,1) (4,3) (4.6,2.4) (4.6,1.6) (5.4,1.6) (5.4,2.4) (6,1) (2,1) (0,3) (0.6,1.6) (0.6,2.4) 1.6 1.6 1.7 1.2 1.5 1.2 1.5 1.6 1.7 2.0 2.4 2.4 3.1 3.2 3.2 2.4 2.0 FOR CLUSTER c1 i CLUSTER CENTER DATA POINTS DISTANCE 1 (7.5,2.1) (7.5,2.1) 0 FOR CLUSTER c2 i CLUSTER CENTER DATA POINTS DISTANCE (7.5,1.8)1 (7.5,1.8) 0 FOR CLUSTER c3 Since points in cluster 2 & 3 are similar so they form one clusters& some points in cluster1 are same & some are different. So similar points are added in same cluster & different in another cluster. Then update mean & iteratively follow this procedure. Finally we get the following Table:
  • 15. CLUSTER CENTER DATA POINTS DISTANCEi 2 1 9 8 7 6 5 4 3 (0.7,2) (0.7,2) (0.7,2) (0.7,2) (0.7,2) (0.7,2) (0.7,2) (0.7,2) (0.7,2) (0,1) (0,3) (0.6,1.6) (1,2) (1.4,1.6) (1.4,2.4) (2,3) (2,1) (0.6,2.4) 1.2 1.2 1.3 1.2 1.2 1.3 1.2 1.2 i CLUSTER CENTER DATA POINTS DISTANCE i CLUSTER CENTER DATA POINTS DISTANCE 2 FOR CLUSTER c1 FOR CLUSTER c2 FOR CLUSTER c3 1 2 1 3 4 5 6 7 8 9 (5,2) (5,2) (5,2) 0.2 (5,2) (5,2) (5,2) (5,2) (5,2) (5,2) (4,1) (4,3) (4.6,1.6) (4.6,2.4) (5,2) (5.4,1.6) (5.4,2.4) (6,1) (6,3) 3.2 3.2 2.4 2.4 2.4 2.6 2.6 2.8 2.9 0.1(7.5,2.1) (7.5,1.8) (7.5,2) (7.5,2) 1.5 As we see from the distance table, there is no further modifications are possible in the clusters c1,c2 and c3.
  • 16. Finally we get the cluster c1 with 9-points in it, c2 with 9-points and c3 with 2-points shown in the figure:- 0 5 6 7 8 91 2 3 4 1 4 3 2 5 Cluster c1 Cluster c2 Cluster c3
  • 17. REFERENCES 1. Rui Xu, Donald Wunsch, “Survey of clustering algorithms,” IEEE Transactions on Neural Networks, vol. 16, no. 3, May 2005, pp. 645-678 2. Mu-Chun Su and Chien-Hsing Chou, “A modified version of the K-means algorithm with a distance based on cluster symmetry,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 23,no. 6, June 2001, pp. 674-680., 2012