SlideShare a Scribd company logo
1 of 5
Download to read offline
IOSR Journal of Computer Engineering (IOSRJCE)
ISSN: 2278-0661, ISBN: 2278-8727 Volume 6, Issue 2 (Sep-Oct. 2012), PP 09-13
www.iosrjournals.org
www.iosrjournals.org 9 | Page
Hybrid Algorithm for Clustering Mixed Data Sets
V.N. Prasad Pinisetty1
, Ramesh Valaboju2
, N. Raghava Rao3
1
Department of CSE, DRK College Of Engineering& Technology, Hyderabad, Andhra Pradesh, India
2
Department of CS, DRK Institute Of Science & Technology, Hyderabad, Andhra Pradesh, India
Associate Professor
3
Department of IT, DRK Institute Of Science & Technology, Ranga Reddy, Hyderabad, Andhra Pradesh, India
Abstract: Clustering is one of the data mining techniques used to group similar objects into different
meaningful classes known as clusters. Objects in each cluster have maximum similarity while the objects across
the clusters have minimum or no similarity. This kind of partitioning of objects into various groups has many
real time applications such as pattern recognition, machine learning and so on. In this paper we review a
clustering algorithm based on genetic K-means [1] and compare it with GKMODE and IGKA. The algorithm
works well for both numeric and discrete values. The existing genetic K-means algorithms have limitation as
they can cluster only numeric data. The algorithm [1] overcomes this problem and provides a better way of
characterization of clusters. The empirical results revealed that the performance of the proposed algorithm has
been improved. We also make observations and recommendations on the proposed algorithm, GKMODE, and
IGKA that help in future enhancements.
Index Terms – Generic algorithm, data mining, clustering, mixed data
I. Introduction
Classifying a group of data objects into different classes is very essential operation in data mining.
There are many applications associated with this. They include dissection [2], segmentation and aggregation,
classification and machine learning. Generally the data set containing m-dimensional space might have
possibility of k unique clusters and each data point within cluster is highly similar to the other objects in the
same cluster. The data points when compared with objects of other clusters are not similar or dissimilar. This is
the phenomenon that takes place in clustering objects. As discussed in [1] three of the problems that can be
solved using clustering technique are when a similarity is measured, it is possible to compare distance between
elements; a clustering algorithm helps in obtaining most similar objects into a specific cluster; a description can
be derived that can characterize the objects of a cluster in a specific manner. In many traditional clustering
algorithms ED (Euclidean Distance) measure is used to judge the distance between objects to be clustered [3],
[4]. These algorithms work fine when attributes of data set are of numeric types. Many such algorithms
including ED fail to measure distance between objects if the type of attributes is categorical or mixed in nature.
K-means is another algorithm that has traditionally been used to cluster numeric values. It is widely used in the
data mining domain for various kinds of applications. Still it is in the top 10 clustering algorithms [7]. It has got
many variations to meet the changing requirements. It is also traditionally used to work with numeric data
though it has variations. As a matter of fact, data available over Internet or academia and industries is abundant
in discrete values [5]. Such data is from sectors like banking, health care, education, web-logs, biological and so
on. The datasets from these domains are essentially having mixed attributes. It does mean that each and every
data set is having some numeric and some categorical values. Clustering such data into meaningful groups of
objects is a challenging job. The clustering algorithms ED and K-means that have been discussed are inadequate
in their original form to measure distances and cluster such data containing mixed attributes into meaningful
classes. This is the motivation for the invention of new clustering algorithms that can handle numeric, discrete
or mixed values [6].
Handling mixed content with respect to clustering technique is the area that needs further research.
There are certain strategies that can be used to work with mixed data. One of the strategies is the conversion of
categorical data into numerical data as pre processing prior to the application of clustering algorithm or as part
of algorithm; however, it is difficult to do this because, there are instances where the conversion does not given
meaningful numeric data and does not make sense. Another strategy is the reverse process as it is quite opposite
to the first strategy. The second strategy allows us to convert numerical data into categorical data and apply any
known categorical clustering algorithm. However, the discretization may result in the loss of information. To
overcome these problems we have used and reviewed a genetic K-means clustering algorithm [1] which can
efficiently group huge datasets with categorical and numeric values as such databases are used in real time data
mining applications frequently.
Hybrid Algorithm for Clustering Mixed Data Sets
www.iosrjournals.org 10 | Page
Our contributions in this paper include the review of algorithms such as GKMODE, IGKA and [1] as clustering
algorithms that work on numeric and categorical values and observations that can help improve the existing
clustering algorithms of that kind.
II. Related Work
Genetic algorithms are basically evolutionary algorithms that are used to solve many real time
applications. They are stochastic in nature and can be used to solve many real world problems [5]. They can
perform parallel operations on complex search spaces. Evolution strategies, evolutionary programming and
genetic algorithms are all part of evolutionary algorithms. Genetic algorithms are used in complex games and
applications where certain values are to be evaluated dynamically at runtime. For instance in Master Mind game
some values are hidden and the player has to guess values every time. The correctly guessed values are indicated
and the other values are to be guessed. This guessing may take place several times until the hidden values are
correctly guessed. The same thing can be achieved by using genetic algorithm that makes use of initial
population, operators and iterations. Finally it converges into guessing correct values. Thus genetic algorithms
are practically used in many real world applications.
Holland [5] originally proposed genetic algorithms. They have been applied to many problems such as
function organization problems. They are proved to be good for finding near optimal solutions. Genetic
algorithms are domain independent and robust in nature for evolutionary problems. This has motivated many
researchers to work on genetic algorithms. Genetic K-means algorithm was proposed by Krishna and Murthy
which is the combination of K-means algorithm and also a genetic algorithm. For this reason this algorithm
converges to the global optimum faster than only K-means algorithms or only genetic algorithms.
FGKA (Fast Genetic K-means Cluster Technique) was proposed by Lu et al. (8). It has many
enhancements over GKA that includes evaluation with Total Within-Cluster Variation (TWCA). The
enhancements include mutation operator simplification and avoiding illegal string elimination overhead.
Moreover FGKA is faster than other algorithms such as GKA. Nevertheless, the FGKA algorithm has a
significant disadvantage. When the mutation probability is small, it gives problems besides making the
algorithm more expensive. In order to overcome this issue Lu et al. proposed IGKA (Incremental Genetic K-
means Algorithm) which exceeds the FGKA in performance. The incremental nature of the IGKA provides such
performance benefits. When compared with FGA, IGKA performs well. From this perspective the a hybrid
algorithm is proposed that is the combination of FGKA and IGKA and performs exceptionally well when the
mutation probability is more. Many GA based algorithms are based on K-means and they work only for numeric
data. Another generic algorithm known as GKMODE works only for discrete data. Therefore in [1] a Genetic K-
Means algorithm is proposed that works for both numeric and categorical data. This paper reviews [1] and
related algorithms, evaluate [1] with a prototype application and compare its results with that of GKMODE and
IGKA. Besides, this paper also provides observations that help enhance the genetic clustering algorithms that
work for both discrete and numeric data.
III. Genetic K-Means Algorithm
This algorithm as proposed in [1] is a clustering algorithm that works for both numeric and categorical
data. It has been reviewed and experimented in this paper and then compared with other such algorithms known
as GKMODE and IGKA. Afterwards a set of observations are made that help in further enhancing the way of
evaluation and improvement of these algorithms.
Objective Function
For the purpose of clustering with Genetic K-Means algorithm, N genes and corresponding N patterns.
D dimensions vector is associated with each pattern. The goal of algorithm [1] is to make N patterns into a user-
defined number of clusters. The TWCV as defined as [6] is
TWCV = ΣN
n=1 ΣD
d=1 X2
nd – ΣK
k=1 1/Zk ΣD
d=1 SF2
kd
The closest cluster center is computed as
V(di,Cj) = Σm r-1
t =1 (wt(dr
it – Cr
jt))2
+ Σmc-1
t=1 Ω (dc
it, Cc
jt)2
Selection Operator
For the selection process, the proportional selection is used. With z independent random experiments
the population of next generation is determined. From the current population, each experiment randomly selects
a solution based on the probability distribution which is computed as
Pz = F(Sz) (z=1,……..Z),
ΣZ
z=1 F(Sz)
The fitness value of solution is computed as
F(Sz) = { 1.5 x Vmax – V(Sz), if Sz is leagal
e(Sz) x Fmin , otherwise
Hybrid Algorithm for Clustering Mixed Data Sets
www.iosrjournals.org 11 | Page
The maximum V and minimum V are represented as Vmax and Vmin respectively.
Mutation Operator
Shaking the algorithm out of a local optimum into global optimum is performed by mutation operator.
For this purpose the probability distribution is computed as
pk = 1.5 * dmax(Xn) – d(Xn,ck) + 0.5
__________________________
ΣK
k=1 (1.5 * dmax(Xn) – d(Xn,ck) + 0.5)
K-Means Operator
K-means operator is introduced in order to speed up the convergence. When the solution encoded by
a1..aN is given, an is replaced by an’ for n=1..N concurrently where the number of the cluster whose centroid is
closest to Xn in Euclidean distance. If the kth cluster is empty we define d (Xn, ck) = +∞ in order to deal with
illegal strings. We also give a new definition that is d (Xn, ck) = 0 if the kth cluster is empty. The new definition
is motivated by the fact that this can avoid reassigning all patterns to empty clusters.
IV. Experiments And Results
We have made experiments on the Genetic K-Means algorithm [1] with a prototype application that has
been built in Java programming language. The environment used to develop the prototype application include
JSE 6.0 (Java Standard Edition), Net Beans IDE that run in a PC having 2GB RAM, 2.9x GHz processor with
Windows 7 OS. The experiments are done with real world datasets such as Vote, Iris, and Heart Diseases
obtained from KDD cup datasets and UCI repository. To judge results accurately we compare the results from
genetic k-means algorithm and compare them with the results obtained from GroundTruth technique.
Comparatively good results were found. The experiments and the results done with Heart Diseases dataset are
presented in this paper.
Heart diseases dataset used here contains five numeric and seven categorical attributes. The dataset
contains data pertaining to healthy people and also heart patients with 164 and 139 data instances respectively.
The experiments are made with values such as generation size and population size set to 100 and 50
respectively. The algorithm is run over 100 times and average values are presented in table 1. The mutation
probability ranges from 0.001 to 0.1.
CLUSTER NO. NORMAL HEART PATIENT
1 130 29
2 34 110
Table 1 – Clusters obtained from heart diseases dataset
The result of genetic K-means algorithm is presented in table 1. The results show that the average number of
data elements that are not in expected center is 46 while standard deviation for error is 3.
The clustering accuracy is found by using average convergence concept and the corresponding
objective function value over many generations for different probabilities of mutation. As per the observations,
in both cases, the proposed algorithm [1] converges faster when compared with other algorithms such as
GKMODE and IGKA. The global optimal clustering is converged in just five generations. The convergence
results are presented in fig. 1
Fig. 1 – Average optimal corrected rand index changes [1]
Hybrid Algorithm for Clustering Mixed Data Sets
www.iosrjournals.org 12 | Page
We also review the comparison of algorithm [1] with GKMODE and IGKA based on the heart diseases dataset
with population set to 50 and mutation probability is set to 0.0001.
Fig. 2 – The performance comparison of [1], GKMODE and IGKA [1]
As can be seen in fig. 2, the Genetic K-means algorithm shows performance better than that of GKMODE and
IGKA in terms of time taken for given iterations.
V. Observations And Recommendations
Keeping the importance of usefulness of clustering data objects with mixed content such as numeric and
categorical, we have studied many algorithms including [1], GKMODE and IGKA. We have also built a
prototype application that facilitates to evaluate the performance of Genetic K-means algorithm [1] and compare
it with other algorithms mentioned. Based on experiments, in this paper we present the following observations.
 Currently the K-means algorithm performance COMPARISON is being made in terms of time taken for
given iterations. The only comparison (Time) may not be sufficient for complete analysis of the
performance of the algorithm. The introduction of additional measures like number of computations
performed, and memory consumption by each algorithm can give more efficient and advanced analysis on
the performance of the algorithm when compared with GKMODE and IGKA algorithms. These measures
are useful for cost analysis, along with time.
 The algorithms can be enhanced to adapt to various benchmark datasets and the comparison results can be
analyzed, and reported to provide more meaningful insights into the study.(e.g. Various clusters formed by
the algorithm can be shown in a graphical report for better representation of data sets).
VI. Conclusion
Real world databases always contain numeric and discrete values. The numeric values are generally
used in mathematic operations and categorical values can’t be used. This is the reason many clustering
algorithms fail in working with large datasets that contain attributes of mixed type. The popular Euclidian
Distance (ED) and K-means clustering algorithms in their original form fail to work with mixed content. This is
the motivation behind taking up this work. In this paper we review many algorithms and strategies such as [1],
GKMODE and IGKA with respect to finding their performance in clustering mixed data objects. The clustering
algorithm proposed in [1] has been tested with a prototype application and compared with others. This genetic
K-means algorithm [1] is more efficient than GKMODE and IGKA. The enhanced cost function and modified
cluster center is used in genetic K-means algorithm in order to capture cluster characteristics effectively and
efficiently. We also reviewed the additional features encountered in [1] such as minimizing or avoiding string
elimination overhead, mutator operator simplification and TWCVs. The observations made here are that the
performance comparison can also be made in terms of number of computations performed by the algorithm, and
memory consumption for cost analysis (in addition to existing analysis on the Time) by Genetic K-Means
Clustering Algorithm when compared with GKMODE and IGKA algorithms. Also the; another observation is
that the algorithms can be enhanced to adapt to various benchmark datasets and the comparison results can be
analyzed, and reported to provide more meaningful insights into the study. These two observations can pave the
way for the future work that can be made on clustering data objects possessing both categorical and numeric
data.
Acknowledgements
The authors thank to R.V Krishnaiah, Principal- DRK Institute of Science & Technology. Also we
would like to thank K. Praveen, Head of the IT Department for their valuable support.
Hybrid Algorithm for Clustering Mixed Data Sets
www.iosrjournals.org 13 | Page
References
[1] Dharmendra K Roy & Lokesh K Sharma. Genetic K-Means Clustering Algorithm for Mixed Numeric and Categorical Data Sets.
International Journal of Artificial Intelligence & Applications (IJAIA), Vol. 1, No. 2, April 2010.
[2] A. Ahmad and L. Dey, (2007), A k-mean clustering algorithm for mixed numeric and categorical data’, Data and Knowledge
Engineering Elsevier Publication, vol. 63, pp 503-527.
[3] G. Gan, Z. Yang, and J. Wu (2005), A Genetic k-Modes Algorithm for Clustering for Categorical Data, ADMA , LNAI 3584, pp.
195–202.
[4] J. Z. Haung, M. K. Ng, H. Rong, Z. Li (2005) Automated variable weighting in k-mean type clustering, IEEE Transaction on PAMI
27(5).
[5] K. Krishna and M. Murty (1999), ‘Genetic K-Means Algorithm’, IEEE Transactions on Systems, Man, and Cybernetics vol. 29,
NO. 3, pp. 433-439.
[6] Jain, M. Murty and P. Flynn (1999), ‘Data clustering: A review’, ACM Computing Survey., vol. 31, no. 3, pp. 264–323.
[7] Chaturvedi, P. Green and J. Carroll (2001), k-modes clustering. Journal of Classification, vol 18, pp. 35-55.

More Related Content

What's hot

84cc04ff77007e457df6aa2b814d2346bf1b
84cc04ff77007e457df6aa2b814d2346bf1b84cc04ff77007e457df6aa2b814d2346bf1b
84cc04ff77007e457df6aa2b814d2346bf1bPRAWEEN KUMAR
 
Building a Classifier Employing Prism Algorithm with Fuzzy Logic
Building a Classifier Employing Prism Algorithm with Fuzzy LogicBuilding a Classifier Employing Prism Algorithm with Fuzzy Logic
Building a Classifier Employing Prism Algorithm with Fuzzy LogicIJDKP
 
An Improved Differential Evolution Algorithm for Data Stream Clustering
An Improved Differential Evolution Algorithm for Data Stream ClusteringAn Improved Differential Evolution Algorithm for Data Stream Clustering
An Improved Differential Evolution Algorithm for Data Stream ClusteringIJECEIAES
 
Performance Analysis of Genetic Algorithm as a Stochastic Optimization Tool i...
Performance Analysis of Genetic Algorithm as a Stochastic Optimization Tool i...Performance Analysis of Genetic Algorithm as a Stochastic Optimization Tool i...
Performance Analysis of Genetic Algorithm as a Stochastic Optimization Tool i...paperpublications3
 
Literature Survey: Clustering Technique
Literature Survey: Clustering TechniqueLiterature Survey: Clustering Technique
Literature Survey: Clustering TechniqueEditor IJCATR
 
USING ONTOLOGIES TO IMPROVE DOCUMENT CLASSIFICATION WITH TRANSDUCTIVE SUPPORT...
USING ONTOLOGIES TO IMPROVE DOCUMENT CLASSIFICATION WITH TRANSDUCTIVE SUPPORT...USING ONTOLOGIES TO IMPROVE DOCUMENT CLASSIFICATION WITH TRANSDUCTIVE SUPPORT...
USING ONTOLOGIES TO IMPROVE DOCUMENT CLASSIFICATION WITH TRANSDUCTIVE SUPPORT...IJDKP
 
The pertinent single-attribute-based classifier for small datasets classific...
The pertinent single-attribute-based classifier  for small datasets classific...The pertinent single-attribute-based classifier  for small datasets classific...
The pertinent single-attribute-based classifier for small datasets classific...IJECEIAES
 
A Survey on Fuzzy Association Rule Mining Methodologies
A Survey on Fuzzy Association Rule Mining MethodologiesA Survey on Fuzzy Association Rule Mining Methodologies
A Survey on Fuzzy Association Rule Mining MethodologiesIOSR Journals
 
Volume 2-issue-6-1969-1973
Volume 2-issue-6-1969-1973Volume 2-issue-6-1969-1973
Volume 2-issue-6-1969-1973Editor IJARCET
 
A Preference Model on Adaptive Affinity Propagation
A Preference Model on Adaptive Affinity PropagationA Preference Model on Adaptive Affinity Propagation
A Preference Model on Adaptive Affinity PropagationIJECEIAES
 
Document Classification Using Expectation Maximization with Semi Supervised L...
Document Classification Using Expectation Maximization with Semi Supervised L...Document Classification Using Expectation Maximization with Semi Supervised L...
Document Classification Using Expectation Maximization with Semi Supervised L...ijsc
 
An efficient algorithm for privacy
An efficient algorithm for privacyAn efficient algorithm for privacy
An efficient algorithm for privacyIJDKP
 
OPTIMIZATION IN ENGINE DESIGN VIA FORMAL CONCEPT ANALYSIS USING NEGATIVE ATTR...
OPTIMIZATION IN ENGINE DESIGN VIA FORMAL CONCEPT ANALYSIS USING NEGATIVE ATTR...OPTIMIZATION IN ENGINE DESIGN VIA FORMAL CONCEPT ANALYSIS USING NEGATIVE ATTR...
OPTIMIZATION IN ENGINE DESIGN VIA FORMAL CONCEPT ANALYSIS USING NEGATIVE ATTR...csandit
 
Paper-Allstate-Claim-Severity
Paper-Allstate-Claim-SeverityPaper-Allstate-Claim-Severity
Paper-Allstate-Claim-SeverityGon-soo Moon
 
IRJET- Data Mining - Secure Keyword Manager
IRJET- Data Mining - Secure Keyword ManagerIRJET- Data Mining - Secure Keyword Manager
IRJET- Data Mining - Secure Keyword ManagerIRJET Journal
 
Incremental learning from unbalanced data with concept class, concept drift a...
Incremental learning from unbalanced data with concept class, concept drift a...Incremental learning from unbalanced data with concept class, concept drift a...
Incremental learning from unbalanced data with concept class, concept drift a...IJDKP
 
A Survey Ondecision Tree Learning Algorithms for Knowledge Discovery
A Survey Ondecision Tree Learning Algorithms for Knowledge DiscoveryA Survey Ondecision Tree Learning Algorithms for Knowledge Discovery
A Survey Ondecision Tree Learning Algorithms for Knowledge DiscoveryIJERA Editor
 

What's hot (19)

84cc04ff77007e457df6aa2b814d2346bf1b
84cc04ff77007e457df6aa2b814d2346bf1b84cc04ff77007e457df6aa2b814d2346bf1b
84cc04ff77007e457df6aa2b814d2346bf1b
 
Building a Classifier Employing Prism Algorithm with Fuzzy Logic
Building a Classifier Employing Prism Algorithm with Fuzzy LogicBuilding a Classifier Employing Prism Algorithm with Fuzzy Logic
Building a Classifier Employing Prism Algorithm with Fuzzy Logic
 
An Improved Differential Evolution Algorithm for Data Stream Clustering
An Improved Differential Evolution Algorithm for Data Stream ClusteringAn Improved Differential Evolution Algorithm for Data Stream Clustering
An Improved Differential Evolution Algorithm for Data Stream Clustering
 
Performance Analysis of Genetic Algorithm as a Stochastic Optimization Tool i...
Performance Analysis of Genetic Algorithm as a Stochastic Optimization Tool i...Performance Analysis of Genetic Algorithm as a Stochastic Optimization Tool i...
Performance Analysis of Genetic Algorithm as a Stochastic Optimization Tool i...
 
Ijetcas14 338
Ijetcas14 338Ijetcas14 338
Ijetcas14 338
 
Literature Survey: Clustering Technique
Literature Survey: Clustering TechniqueLiterature Survey: Clustering Technique
Literature Survey: Clustering Technique
 
USING ONTOLOGIES TO IMPROVE DOCUMENT CLASSIFICATION WITH TRANSDUCTIVE SUPPORT...
USING ONTOLOGIES TO IMPROVE DOCUMENT CLASSIFICATION WITH TRANSDUCTIVE SUPPORT...USING ONTOLOGIES TO IMPROVE DOCUMENT CLASSIFICATION WITH TRANSDUCTIVE SUPPORT...
USING ONTOLOGIES TO IMPROVE DOCUMENT CLASSIFICATION WITH TRANSDUCTIVE SUPPORT...
 
The pertinent single-attribute-based classifier for small datasets classific...
The pertinent single-attribute-based classifier  for small datasets classific...The pertinent single-attribute-based classifier  for small datasets classific...
The pertinent single-attribute-based classifier for small datasets classific...
 
A Survey on Fuzzy Association Rule Mining Methodologies
A Survey on Fuzzy Association Rule Mining MethodologiesA Survey on Fuzzy Association Rule Mining Methodologies
A Survey on Fuzzy Association Rule Mining Methodologies
 
Volume 2-issue-6-1969-1973
Volume 2-issue-6-1969-1973Volume 2-issue-6-1969-1973
Volume 2-issue-6-1969-1973
 
A Preference Model on Adaptive Affinity Propagation
A Preference Model on Adaptive Affinity PropagationA Preference Model on Adaptive Affinity Propagation
A Preference Model on Adaptive Affinity Propagation
 
Document Classification Using Expectation Maximization with Semi Supervised L...
Document Classification Using Expectation Maximization with Semi Supervised L...Document Classification Using Expectation Maximization with Semi Supervised L...
Document Classification Using Expectation Maximization with Semi Supervised L...
 
An efficient algorithm for privacy
An efficient algorithm for privacyAn efficient algorithm for privacy
An efficient algorithm for privacy
 
EXECUTION OF ASSOCIATION RULE MINING WITH DATA GRIDS IN WEKA 3.8
EXECUTION OF ASSOCIATION RULE MINING WITH DATA GRIDS IN WEKA 3.8EXECUTION OF ASSOCIATION RULE MINING WITH DATA GRIDS IN WEKA 3.8
EXECUTION OF ASSOCIATION RULE MINING WITH DATA GRIDS IN WEKA 3.8
 
OPTIMIZATION IN ENGINE DESIGN VIA FORMAL CONCEPT ANALYSIS USING NEGATIVE ATTR...
OPTIMIZATION IN ENGINE DESIGN VIA FORMAL CONCEPT ANALYSIS USING NEGATIVE ATTR...OPTIMIZATION IN ENGINE DESIGN VIA FORMAL CONCEPT ANALYSIS USING NEGATIVE ATTR...
OPTIMIZATION IN ENGINE DESIGN VIA FORMAL CONCEPT ANALYSIS USING NEGATIVE ATTR...
 
Paper-Allstate-Claim-Severity
Paper-Allstate-Claim-SeverityPaper-Allstate-Claim-Severity
Paper-Allstate-Claim-Severity
 
IRJET- Data Mining - Secure Keyword Manager
IRJET- Data Mining - Secure Keyword ManagerIRJET- Data Mining - Secure Keyword Manager
IRJET- Data Mining - Secure Keyword Manager
 
Incremental learning from unbalanced data with concept class, concept drift a...
Incremental learning from unbalanced data with concept class, concept drift a...Incremental learning from unbalanced data with concept class, concept drift a...
Incremental learning from unbalanced data with concept class, concept drift a...
 
A Survey Ondecision Tree Learning Algorithms for Knowledge Discovery
A Survey Ondecision Tree Learning Algorithms for Knowledge DiscoveryA Survey Ondecision Tree Learning Algorithms for Knowledge Discovery
A Survey Ondecision Tree Learning Algorithms for Knowledge Discovery
 

Viewers also liked

5 tips for perfect presentations
5 tips for perfect presentations5 tips for perfect presentations
5 tips for perfect presentationsBernd Allgeier
 
Feb 2015 ndta flyer v2
Feb 2015 ndta flyer v2Feb 2015 ndta flyer v2
Feb 2015 ndta flyer v2watersb
 
Genetic Variability, Heritability for Late leaf Spot tolerance and Productivi...
Genetic Variability, Heritability for Late leaf Spot tolerance and Productivi...Genetic Variability, Heritability for Late leaf Spot tolerance and Productivi...
Genetic Variability, Heritability for Late leaf Spot tolerance and Productivi...IOSR Journals
 
Efficient Doubletree: An Algorithm for Large-Scale Topology Discovery
Efficient Doubletree: An Algorithm for Large-Scale Topology DiscoveryEfficient Doubletree: An Algorithm for Large-Scale Topology Discovery
Efficient Doubletree: An Algorithm for Large-Scale Topology DiscoveryIOSR Journals
 
From Changes to Dynamics: Dynamics Analysis of Linked Open Data Sources
From Changes to Dynamics: Dynamics Analysis of Linked Open Data Sources From Changes to Dynamics: Dynamics Analysis of Linked Open Data Sources
From Changes to Dynamics: Dynamics Analysis of Linked Open Data Sources Thomas Gottron
 
The Impact Of Positive Thinking On Your Internet Business
The Impact Of Positive Thinking On Your Internet BusinessThe Impact Of Positive Thinking On Your Internet Business
The Impact Of Positive Thinking On Your Internet Businessbelieve52
 
Proyecto ingles
Proyecto inglesProyecto ingles
Proyecto inglesleidy
 
Quick Identification of Stego Signatures in Images Using Suspicion Value (Sp...
Quick Identification of Stego Signatures in Images Using  Suspicion Value (Sp...Quick Identification of Stego Signatures in Images Using  Suspicion Value (Sp...
Quick Identification of Stego Signatures in Images Using Suspicion Value (Sp...IOSR Journals
 
Sistema de Comando de incidentes, Como organizar recursos
Sistema de Comando de incidentes, Como organizar recursosSistema de Comando de incidentes, Como organizar recursos
Sistema de Comando de incidentes, Como organizar recursosYeison Ramirez
 
Performance analysis of Fuzzy logic based speed control of DC motor
Performance analysis of Fuzzy logic based speed control of DC motorPerformance analysis of Fuzzy logic based speed control of DC motor
Performance analysis of Fuzzy logic based speed control of DC motorIOSR Journals
 

Viewers also liked (20)

F0344451
F0344451F0344451
F0344451
 
F0443041
F0443041F0443041
F0443041
 
5 tips for perfect presentations
5 tips for perfect presentations5 tips for perfect presentations
5 tips for perfect presentations
 
Feb 2015 ndta flyer v2
Feb 2015 ndta flyer v2Feb 2015 ndta flyer v2
Feb 2015 ndta flyer v2
 
H0365158
H0365158H0365158
H0365158
 
Genetic Variability, Heritability for Late leaf Spot tolerance and Productivi...
Genetic Variability, Heritability for Late leaf Spot tolerance and Productivi...Genetic Variability, Heritability for Late leaf Spot tolerance and Productivi...
Genetic Variability, Heritability for Late leaf Spot tolerance and Productivi...
 
Efficient Doubletree: An Algorithm for Large-Scale Topology Discovery
Efficient Doubletree: An Algorithm for Large-Scale Topology DiscoveryEfficient Doubletree: An Algorithm for Large-Scale Topology Discovery
Efficient Doubletree: An Algorithm for Large-Scale Topology Discovery
 
Vis van bij ons
Vis van bij onsVis van bij ons
Vis van bij ons
 
B0160709
B0160709B0160709
B0160709
 
F0443041
F0443041F0443041
F0443041
 
I0925259
I0925259I0925259
I0925259
 
From Changes to Dynamics: Dynamics Analysis of Linked Open Data Sources
From Changes to Dynamics: Dynamics Analysis of Linked Open Data Sources From Changes to Dynamics: Dynamics Analysis of Linked Open Data Sources
From Changes to Dynamics: Dynamics Analysis of Linked Open Data Sources
 
C0421318
C0421318C0421318
C0421318
 
The Impact Of Positive Thinking On Your Internet Business
The Impact Of Positive Thinking On Your Internet BusinessThe Impact Of Positive Thinking On Your Internet Business
The Impact Of Positive Thinking On Your Internet Business
 
Proyecto ingles
Proyecto inglesProyecto ingles
Proyecto ingles
 
E0421931
E0421931E0421931
E0421931
 
Quick Identification of Stego Signatures in Images Using Suspicion Value (Sp...
Quick Identification of Stego Signatures in Images Using  Suspicion Value (Sp...Quick Identification of Stego Signatures in Images Using  Suspicion Value (Sp...
Quick Identification of Stego Signatures in Images Using Suspicion Value (Sp...
 
Sistema de Comando de incidentes, Como organizar recursos
Sistema de Comando de incidentes, Como organizar recursosSistema de Comando de incidentes, Como organizar recursos
Sistema de Comando de incidentes, Como organizar recursos
 
D0961927
D0961927D0961927
D0961927
 
Performance analysis of Fuzzy logic based speed control of DC motor
Performance analysis of Fuzzy logic based speed control of DC motorPerformance analysis of Fuzzy logic based speed control of DC motor
Performance analysis of Fuzzy logic based speed control of DC motor
 

Similar to Hybrid Algorithm for Clustering Mixed Data Sets

Particle Swarm Optimization based K-Prototype Clustering Algorithm
Particle Swarm Optimization based K-Prototype Clustering Algorithm Particle Swarm Optimization based K-Prototype Clustering Algorithm
Particle Swarm Optimization based K-Prototype Clustering Algorithm iosrjce
 
A Novel Approach for Clustering Big Data based on MapReduce
A Novel Approach for Clustering Big Data based on MapReduce A Novel Approach for Clustering Big Data based on MapReduce
A Novel Approach for Clustering Big Data based on MapReduce IJECEIAES
 
Ensemble based Distributed K-Modes Clustering
Ensemble based Distributed K-Modes ClusteringEnsemble based Distributed K-Modes Clustering
Ensemble based Distributed K-Modes ClusteringIJERD Editor
 
Estimating project development effort using clustered regression approach
Estimating project development effort using clustered regression approachEstimating project development effort using clustered regression approach
Estimating project development effort using clustered regression approachcsandit
 
ESTIMATING PROJECT DEVELOPMENT EFFORT USING CLUSTERED REGRESSION APPROACH
ESTIMATING PROJECT DEVELOPMENT EFFORT USING CLUSTERED REGRESSION APPROACHESTIMATING PROJECT DEVELOPMENT EFFORT USING CLUSTERED REGRESSION APPROACH
ESTIMATING PROJECT DEVELOPMENT EFFORT USING CLUSTERED REGRESSION APPROACHcscpconf
 
Clustering Algorithm with a Novel Similarity Measure
Clustering Algorithm with a Novel Similarity MeasureClustering Algorithm with a Novel Similarity Measure
Clustering Algorithm with a Novel Similarity MeasureIOSR Journals
 
A Comparative Study Of Various Clustering Algorithms In Data Mining
A Comparative Study Of Various Clustering Algorithms In Data MiningA Comparative Study Of Various Clustering Algorithms In Data Mining
A Comparative Study Of Various Clustering Algorithms In Data MiningNatasha Grant
 
Performance Comparision of Machine Learning Algorithms
Performance Comparision of Machine Learning AlgorithmsPerformance Comparision of Machine Learning Algorithms
Performance Comparision of Machine Learning AlgorithmsDinusha Dilanka
 
A Novel Multi- Viewpoint based Similarity Measure for Document Clustering
A Novel Multi- Viewpoint based Similarity Measure for Document ClusteringA Novel Multi- Viewpoint based Similarity Measure for Document Clustering
A Novel Multi- Viewpoint based Similarity Measure for Document ClusteringIJMER
 
(2016)application of parallel glowworm swarm optimization algorithm for data ...
(2016)application of parallel glowworm swarm optimization algorithm for data ...(2016)application of parallel glowworm swarm optimization algorithm for data ...
(2016)application of parallel glowworm swarm optimization algorithm for data ...Akram Pasha
 
The improved k means with particle swarm optimization
The improved k means with particle swarm optimizationThe improved k means with particle swarm optimization
The improved k means with particle swarm optimizationAlexander Decker
 
Textual Data Partitioning with Relationship and Discriminative Analysis
Textual Data Partitioning with Relationship and Discriminative AnalysisTextual Data Partitioning with Relationship and Discriminative Analysis
Textual Data Partitioning with Relationship and Discriminative AnalysisEditor IJMTER
 
Survey on classification algorithms for data mining (comparison and evaluation)
Survey on classification algorithms for data mining (comparison and evaluation)Survey on classification algorithms for data mining (comparison and evaluation)
Survey on classification algorithms for data mining (comparison and evaluation)Alexander Decker
 
Final Report
Final ReportFinal Report
Final Reportimu409
 
Web Based Fuzzy Clustering Analysis
Web Based Fuzzy Clustering AnalysisWeb Based Fuzzy Clustering Analysis
Web Based Fuzzy Clustering Analysisinventy
 

Similar to Hybrid Algorithm for Clustering Mixed Data Sets (20)

Particle Swarm Optimization based K-Prototype Clustering Algorithm
Particle Swarm Optimization based K-Prototype Clustering Algorithm Particle Swarm Optimization based K-Prototype Clustering Algorithm
Particle Swarm Optimization based K-Prototype Clustering Algorithm
 
I017235662
I017235662I017235662
I017235662
 
A Novel Approach for Clustering Big Data based on MapReduce
A Novel Approach for Clustering Big Data based on MapReduce A Novel Approach for Clustering Big Data based on MapReduce
A Novel Approach for Clustering Big Data based on MapReduce
 
Ensemble based Distributed K-Modes Clustering
Ensemble based Distributed K-Modes ClusteringEnsemble based Distributed K-Modes Clustering
Ensemble based Distributed K-Modes Clustering
 
Estimating project development effort using clustered regression approach
Estimating project development effort using clustered regression approachEstimating project development effort using clustered regression approach
Estimating project development effort using clustered regression approach
 
ESTIMATING PROJECT DEVELOPMENT EFFORT USING CLUSTERED REGRESSION APPROACH
ESTIMATING PROJECT DEVELOPMENT EFFORT USING CLUSTERED REGRESSION APPROACHESTIMATING PROJECT DEVELOPMENT EFFORT USING CLUSTERED REGRESSION APPROACH
ESTIMATING PROJECT DEVELOPMENT EFFORT USING CLUSTERED REGRESSION APPROACH
 
Clustering Algorithm with a Novel Similarity Measure
Clustering Algorithm with a Novel Similarity MeasureClustering Algorithm with a Novel Similarity Measure
Clustering Algorithm with a Novel Similarity Measure
 
F04463437
F04463437F04463437
F04463437
 
A Comparative Study Of Various Clustering Algorithms In Data Mining
A Comparative Study Of Various Clustering Algorithms In Data MiningA Comparative Study Of Various Clustering Algorithms In Data Mining
A Comparative Study Of Various Clustering Algorithms In Data Mining
 
Performance Comparision of Machine Learning Algorithms
Performance Comparision of Machine Learning AlgorithmsPerformance Comparision of Machine Learning Algorithms
Performance Comparision of Machine Learning Algorithms
 
A Novel Multi- Viewpoint based Similarity Measure for Document Clustering
A Novel Multi- Viewpoint based Similarity Measure for Document ClusteringA Novel Multi- Viewpoint based Similarity Measure for Document Clustering
A Novel Multi- Viewpoint based Similarity Measure for Document Clustering
 
(2016)application of parallel glowworm swarm optimization algorithm for data ...
(2016)application of parallel glowworm swarm optimization algorithm for data ...(2016)application of parallel glowworm swarm optimization algorithm for data ...
(2016)application of parallel glowworm swarm optimization algorithm for data ...
 
The improved k means with particle swarm optimization
The improved k means with particle swarm optimizationThe improved k means with particle swarm optimization
The improved k means with particle swarm optimization
 
Big Data Clustering Model based on Fuzzy Gaussian
Big Data Clustering Model based on Fuzzy GaussianBig Data Clustering Model based on Fuzzy Gaussian
Big Data Clustering Model based on Fuzzy Gaussian
 
Textual Data Partitioning with Relationship and Discriminative Analysis
Textual Data Partitioning with Relationship and Discriminative AnalysisTextual Data Partitioning with Relationship and Discriminative Analysis
Textual Data Partitioning with Relationship and Discriminative Analysis
 
Survey on classification algorithms for data mining (comparison and evaluation)
Survey on classification algorithms for data mining (comparison and evaluation)Survey on classification algorithms for data mining (comparison and evaluation)
Survey on classification algorithms for data mining (comparison and evaluation)
 
A046010107
A046010107A046010107
A046010107
 
Final Report
Final ReportFinal Report
Final Report
 
Web Based Fuzzy Clustering Analysis
Web Based Fuzzy Clustering AnalysisWeb Based Fuzzy Clustering Analysis
Web Based Fuzzy Clustering Analysis
 
Lx3520322036
Lx3520322036Lx3520322036
Lx3520322036
 

More from IOSR Journals (20)

A011140104
A011140104A011140104
A011140104
 
M0111397100
M0111397100M0111397100
M0111397100
 
L011138596
L011138596L011138596
L011138596
 
K011138084
K011138084K011138084
K011138084
 
J011137479
J011137479J011137479
J011137479
 
I011136673
I011136673I011136673
I011136673
 
G011134454
G011134454G011134454
G011134454
 
H011135565
H011135565H011135565
H011135565
 
F011134043
F011134043F011134043
F011134043
 
E011133639
E011133639E011133639
E011133639
 
D011132635
D011132635D011132635
D011132635
 
C011131925
C011131925C011131925
C011131925
 
B011130918
B011130918B011130918
B011130918
 
A011130108
A011130108A011130108
A011130108
 
I011125160
I011125160I011125160
I011125160
 
H011124050
H011124050H011124050
H011124050
 
G011123539
G011123539G011123539
G011123539
 
F011123134
F011123134F011123134
F011123134
 
E011122530
E011122530E011122530
E011122530
 
D011121524
D011121524D011121524
D011121524
 

Recently uploaded

"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Vector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesVector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesZilliz
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfRankYa
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024The Digital Insurer
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr LapshynFwdays
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 

Recently uploaded (20)

"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Vector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesVector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector Databases
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdf
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 

Hybrid Algorithm for Clustering Mixed Data Sets

  • 1. IOSR Journal of Computer Engineering (IOSRJCE) ISSN: 2278-0661, ISBN: 2278-8727 Volume 6, Issue 2 (Sep-Oct. 2012), PP 09-13 www.iosrjournals.org www.iosrjournals.org 9 | Page Hybrid Algorithm for Clustering Mixed Data Sets V.N. Prasad Pinisetty1 , Ramesh Valaboju2 , N. Raghava Rao3 1 Department of CSE, DRK College Of Engineering& Technology, Hyderabad, Andhra Pradesh, India 2 Department of CS, DRK Institute Of Science & Technology, Hyderabad, Andhra Pradesh, India Associate Professor 3 Department of IT, DRK Institute Of Science & Technology, Ranga Reddy, Hyderabad, Andhra Pradesh, India Abstract: Clustering is one of the data mining techniques used to group similar objects into different meaningful classes known as clusters. Objects in each cluster have maximum similarity while the objects across the clusters have minimum or no similarity. This kind of partitioning of objects into various groups has many real time applications such as pattern recognition, machine learning and so on. In this paper we review a clustering algorithm based on genetic K-means [1] and compare it with GKMODE and IGKA. The algorithm works well for both numeric and discrete values. The existing genetic K-means algorithms have limitation as they can cluster only numeric data. The algorithm [1] overcomes this problem and provides a better way of characterization of clusters. The empirical results revealed that the performance of the proposed algorithm has been improved. We also make observations and recommendations on the proposed algorithm, GKMODE, and IGKA that help in future enhancements. Index Terms – Generic algorithm, data mining, clustering, mixed data I. Introduction Classifying a group of data objects into different classes is very essential operation in data mining. There are many applications associated with this. They include dissection [2], segmentation and aggregation, classification and machine learning. Generally the data set containing m-dimensional space might have possibility of k unique clusters and each data point within cluster is highly similar to the other objects in the same cluster. The data points when compared with objects of other clusters are not similar or dissimilar. This is the phenomenon that takes place in clustering objects. As discussed in [1] three of the problems that can be solved using clustering technique are when a similarity is measured, it is possible to compare distance between elements; a clustering algorithm helps in obtaining most similar objects into a specific cluster; a description can be derived that can characterize the objects of a cluster in a specific manner. In many traditional clustering algorithms ED (Euclidean Distance) measure is used to judge the distance between objects to be clustered [3], [4]. These algorithms work fine when attributes of data set are of numeric types. Many such algorithms including ED fail to measure distance between objects if the type of attributes is categorical or mixed in nature. K-means is another algorithm that has traditionally been used to cluster numeric values. It is widely used in the data mining domain for various kinds of applications. Still it is in the top 10 clustering algorithms [7]. It has got many variations to meet the changing requirements. It is also traditionally used to work with numeric data though it has variations. As a matter of fact, data available over Internet or academia and industries is abundant in discrete values [5]. Such data is from sectors like banking, health care, education, web-logs, biological and so on. The datasets from these domains are essentially having mixed attributes. It does mean that each and every data set is having some numeric and some categorical values. Clustering such data into meaningful groups of objects is a challenging job. The clustering algorithms ED and K-means that have been discussed are inadequate in their original form to measure distances and cluster such data containing mixed attributes into meaningful classes. This is the motivation for the invention of new clustering algorithms that can handle numeric, discrete or mixed values [6]. Handling mixed content with respect to clustering technique is the area that needs further research. There are certain strategies that can be used to work with mixed data. One of the strategies is the conversion of categorical data into numerical data as pre processing prior to the application of clustering algorithm or as part of algorithm; however, it is difficult to do this because, there are instances where the conversion does not given meaningful numeric data and does not make sense. Another strategy is the reverse process as it is quite opposite to the first strategy. The second strategy allows us to convert numerical data into categorical data and apply any known categorical clustering algorithm. However, the discretization may result in the loss of information. To overcome these problems we have used and reviewed a genetic K-means clustering algorithm [1] which can efficiently group huge datasets with categorical and numeric values as such databases are used in real time data mining applications frequently.
  • 2. Hybrid Algorithm for Clustering Mixed Data Sets www.iosrjournals.org 10 | Page Our contributions in this paper include the review of algorithms such as GKMODE, IGKA and [1] as clustering algorithms that work on numeric and categorical values and observations that can help improve the existing clustering algorithms of that kind. II. Related Work Genetic algorithms are basically evolutionary algorithms that are used to solve many real time applications. They are stochastic in nature and can be used to solve many real world problems [5]. They can perform parallel operations on complex search spaces. Evolution strategies, evolutionary programming and genetic algorithms are all part of evolutionary algorithms. Genetic algorithms are used in complex games and applications where certain values are to be evaluated dynamically at runtime. For instance in Master Mind game some values are hidden and the player has to guess values every time. The correctly guessed values are indicated and the other values are to be guessed. This guessing may take place several times until the hidden values are correctly guessed. The same thing can be achieved by using genetic algorithm that makes use of initial population, operators and iterations. Finally it converges into guessing correct values. Thus genetic algorithms are practically used in many real world applications. Holland [5] originally proposed genetic algorithms. They have been applied to many problems such as function organization problems. They are proved to be good for finding near optimal solutions. Genetic algorithms are domain independent and robust in nature for evolutionary problems. This has motivated many researchers to work on genetic algorithms. Genetic K-means algorithm was proposed by Krishna and Murthy which is the combination of K-means algorithm and also a genetic algorithm. For this reason this algorithm converges to the global optimum faster than only K-means algorithms or only genetic algorithms. FGKA (Fast Genetic K-means Cluster Technique) was proposed by Lu et al. (8). It has many enhancements over GKA that includes evaluation with Total Within-Cluster Variation (TWCA). The enhancements include mutation operator simplification and avoiding illegal string elimination overhead. Moreover FGKA is faster than other algorithms such as GKA. Nevertheless, the FGKA algorithm has a significant disadvantage. When the mutation probability is small, it gives problems besides making the algorithm more expensive. In order to overcome this issue Lu et al. proposed IGKA (Incremental Genetic K- means Algorithm) which exceeds the FGKA in performance. The incremental nature of the IGKA provides such performance benefits. When compared with FGA, IGKA performs well. From this perspective the a hybrid algorithm is proposed that is the combination of FGKA and IGKA and performs exceptionally well when the mutation probability is more. Many GA based algorithms are based on K-means and they work only for numeric data. Another generic algorithm known as GKMODE works only for discrete data. Therefore in [1] a Genetic K- Means algorithm is proposed that works for both numeric and categorical data. This paper reviews [1] and related algorithms, evaluate [1] with a prototype application and compare its results with that of GKMODE and IGKA. Besides, this paper also provides observations that help enhance the genetic clustering algorithms that work for both discrete and numeric data. III. Genetic K-Means Algorithm This algorithm as proposed in [1] is a clustering algorithm that works for both numeric and categorical data. It has been reviewed and experimented in this paper and then compared with other such algorithms known as GKMODE and IGKA. Afterwards a set of observations are made that help in further enhancing the way of evaluation and improvement of these algorithms. Objective Function For the purpose of clustering with Genetic K-Means algorithm, N genes and corresponding N patterns. D dimensions vector is associated with each pattern. The goal of algorithm [1] is to make N patterns into a user- defined number of clusters. The TWCV as defined as [6] is TWCV = ΣN n=1 ΣD d=1 X2 nd – ΣK k=1 1/Zk ΣD d=1 SF2 kd The closest cluster center is computed as V(di,Cj) = Σm r-1 t =1 (wt(dr it – Cr jt))2 + Σmc-1 t=1 Ω (dc it, Cc jt)2 Selection Operator For the selection process, the proportional selection is used. With z independent random experiments the population of next generation is determined. From the current population, each experiment randomly selects a solution based on the probability distribution which is computed as Pz = F(Sz) (z=1,……..Z), ΣZ z=1 F(Sz) The fitness value of solution is computed as F(Sz) = { 1.5 x Vmax – V(Sz), if Sz is leagal e(Sz) x Fmin , otherwise
  • 3. Hybrid Algorithm for Clustering Mixed Data Sets www.iosrjournals.org 11 | Page The maximum V and minimum V are represented as Vmax and Vmin respectively. Mutation Operator Shaking the algorithm out of a local optimum into global optimum is performed by mutation operator. For this purpose the probability distribution is computed as pk = 1.5 * dmax(Xn) – d(Xn,ck) + 0.5 __________________________ ΣK k=1 (1.5 * dmax(Xn) – d(Xn,ck) + 0.5) K-Means Operator K-means operator is introduced in order to speed up the convergence. When the solution encoded by a1..aN is given, an is replaced by an’ for n=1..N concurrently where the number of the cluster whose centroid is closest to Xn in Euclidean distance. If the kth cluster is empty we define d (Xn, ck) = +∞ in order to deal with illegal strings. We also give a new definition that is d (Xn, ck) = 0 if the kth cluster is empty. The new definition is motivated by the fact that this can avoid reassigning all patterns to empty clusters. IV. Experiments And Results We have made experiments on the Genetic K-Means algorithm [1] with a prototype application that has been built in Java programming language. The environment used to develop the prototype application include JSE 6.0 (Java Standard Edition), Net Beans IDE that run in a PC having 2GB RAM, 2.9x GHz processor with Windows 7 OS. The experiments are done with real world datasets such as Vote, Iris, and Heart Diseases obtained from KDD cup datasets and UCI repository. To judge results accurately we compare the results from genetic k-means algorithm and compare them with the results obtained from GroundTruth technique. Comparatively good results were found. The experiments and the results done with Heart Diseases dataset are presented in this paper. Heart diseases dataset used here contains five numeric and seven categorical attributes. The dataset contains data pertaining to healthy people and also heart patients with 164 and 139 data instances respectively. The experiments are made with values such as generation size and population size set to 100 and 50 respectively. The algorithm is run over 100 times and average values are presented in table 1. The mutation probability ranges from 0.001 to 0.1. CLUSTER NO. NORMAL HEART PATIENT 1 130 29 2 34 110 Table 1 – Clusters obtained from heart diseases dataset The result of genetic K-means algorithm is presented in table 1. The results show that the average number of data elements that are not in expected center is 46 while standard deviation for error is 3. The clustering accuracy is found by using average convergence concept and the corresponding objective function value over many generations for different probabilities of mutation. As per the observations, in both cases, the proposed algorithm [1] converges faster when compared with other algorithms such as GKMODE and IGKA. The global optimal clustering is converged in just five generations. The convergence results are presented in fig. 1 Fig. 1 – Average optimal corrected rand index changes [1]
  • 4. Hybrid Algorithm for Clustering Mixed Data Sets www.iosrjournals.org 12 | Page We also review the comparison of algorithm [1] with GKMODE and IGKA based on the heart diseases dataset with population set to 50 and mutation probability is set to 0.0001. Fig. 2 – The performance comparison of [1], GKMODE and IGKA [1] As can be seen in fig. 2, the Genetic K-means algorithm shows performance better than that of GKMODE and IGKA in terms of time taken for given iterations. V. Observations And Recommendations Keeping the importance of usefulness of clustering data objects with mixed content such as numeric and categorical, we have studied many algorithms including [1], GKMODE and IGKA. We have also built a prototype application that facilitates to evaluate the performance of Genetic K-means algorithm [1] and compare it with other algorithms mentioned. Based on experiments, in this paper we present the following observations.  Currently the K-means algorithm performance COMPARISON is being made in terms of time taken for given iterations. The only comparison (Time) may not be sufficient for complete analysis of the performance of the algorithm. The introduction of additional measures like number of computations performed, and memory consumption by each algorithm can give more efficient and advanced analysis on the performance of the algorithm when compared with GKMODE and IGKA algorithms. These measures are useful for cost analysis, along with time.  The algorithms can be enhanced to adapt to various benchmark datasets and the comparison results can be analyzed, and reported to provide more meaningful insights into the study.(e.g. Various clusters formed by the algorithm can be shown in a graphical report for better representation of data sets). VI. Conclusion Real world databases always contain numeric and discrete values. The numeric values are generally used in mathematic operations and categorical values can’t be used. This is the reason many clustering algorithms fail in working with large datasets that contain attributes of mixed type. The popular Euclidian Distance (ED) and K-means clustering algorithms in their original form fail to work with mixed content. This is the motivation behind taking up this work. In this paper we review many algorithms and strategies such as [1], GKMODE and IGKA with respect to finding their performance in clustering mixed data objects. The clustering algorithm proposed in [1] has been tested with a prototype application and compared with others. This genetic K-means algorithm [1] is more efficient than GKMODE and IGKA. The enhanced cost function and modified cluster center is used in genetic K-means algorithm in order to capture cluster characteristics effectively and efficiently. We also reviewed the additional features encountered in [1] such as minimizing or avoiding string elimination overhead, mutator operator simplification and TWCVs. The observations made here are that the performance comparison can also be made in terms of number of computations performed by the algorithm, and memory consumption for cost analysis (in addition to existing analysis on the Time) by Genetic K-Means Clustering Algorithm when compared with GKMODE and IGKA algorithms. Also the; another observation is that the algorithms can be enhanced to adapt to various benchmark datasets and the comparison results can be analyzed, and reported to provide more meaningful insights into the study. These two observations can pave the way for the future work that can be made on clustering data objects possessing both categorical and numeric data. Acknowledgements The authors thank to R.V Krishnaiah, Principal- DRK Institute of Science & Technology. Also we would like to thank K. Praveen, Head of the IT Department for their valuable support.
  • 5. Hybrid Algorithm for Clustering Mixed Data Sets www.iosrjournals.org 13 | Page References [1] Dharmendra K Roy & Lokesh K Sharma. Genetic K-Means Clustering Algorithm for Mixed Numeric and Categorical Data Sets. International Journal of Artificial Intelligence & Applications (IJAIA), Vol. 1, No. 2, April 2010. [2] A. Ahmad and L. Dey, (2007), A k-mean clustering algorithm for mixed numeric and categorical data’, Data and Knowledge Engineering Elsevier Publication, vol. 63, pp 503-527. [3] G. Gan, Z. Yang, and J. Wu (2005), A Genetic k-Modes Algorithm for Clustering for Categorical Data, ADMA , LNAI 3584, pp. 195–202. [4] J. Z. Haung, M. K. Ng, H. Rong, Z. Li (2005) Automated variable weighting in k-mean type clustering, IEEE Transaction on PAMI 27(5). [5] K. Krishna and M. Murty (1999), ‘Genetic K-Means Algorithm’, IEEE Transactions on Systems, Man, and Cybernetics vol. 29, NO. 3, pp. 433-439. [6] Jain, M. Murty and P. Flynn (1999), ‘Data clustering: A review’, ACM Computing Survey., vol. 31, no. 3, pp. 264–323. [7] Chaturvedi, P. Green and J. Carroll (2001), k-modes clustering. Journal of Classification, vol 18, pp. 35-55.