SlideShare a Scribd company logo
1 of 4
Download to read offline
Preeti Kumari Int. Journal of Engineering Research and Applications www.ijera.com
ISSN : 2248-9622, Vol. 4, Issue 6( Version 2), June 2014, pp.48-51
www.ijera.com 48 | P a g e
Selection of Significant Features Using Decision Tree Classifiers
Preeti Kumari1
, K. Rajeswari2
,
1
M.E Student, PCCOE, Pune
2
Ph.D Research Scholar, SASTRA and Associate professor, PCCOE, Pune, India
Abstract
Data Mining refers to extraction or mining knowledge from huge volume of data. Classification is an important
data mining technique with various applications. It classifies data of various kinds and used to classify each item
in a set of data into one of predefined set of classes or groups. This work has been carried out to select the most
significant attributes from a dataset using Decision tree classifier j48.This Decision tree classification algorithm
can be efficiently used in selecting the most significant feature of a dataset and hence we can reduce the
dimensionality of a dataset and yet obtain a very good accuracy. In this paper we have selected two datasets
from university of California, Irvine website, of weather and vote and applied j48 algorithm on it. After each
iteration, one attribute is removed and its accuracy has been found .We have compared the accuracy of dataset
with different attribute count and then selected the most significant features of dataset.
I. Literature review
1.1 Data mining
Data mining (sometimes called data or
knowledge discovery) is the process of analyzing
data from different perspectives and summarizing it
into useful information - information that can be used
to increase revenue, cuts costs, or both. Data mining
is a tool for analyzing data allowing users to analyze
data from many different dimensions or angles,
categorize it, and summarize the relationships
identified. Technically, data mining is the process of
finding correlations or patterns among dozens of
fields in large relational databases.
Classification is a data mining function that
assigns items in a collection to target categories or
classes. The goal of classification is to accurately
predict the target class for each case in the data. For
example, a classification model could be used to
identify loan applicants as low, medium, or high
credit risks. Classification models are tested by
comparing the predicted values to known target
values in a set of test data. The historical data for a
classification project is typically divided into two
data sets: one for building the model; the other for
testing the model to classify a given instance.
1.2 Feature selection
Feature selection is a term commonly used in
data mining to describe the tools and techniques
available for reducing inputs to a manageable size for
processing and analysis. Feature selection implies
not only cardinality reduction, which means imposing
an arbitrary or predefined cutoff on the number of
attributes that can be considered when building a
model, but also the choice of attributes, meaning that
either the analyst or the modeling tool actively selects
or discards attributes based on their usefulness for
analysis. The ability to apply feature selection is
critical for effective analysis, because datasets
frequently contain far more information than is
needed to build the model. For example, a dataset
might contain 500 columns that describe the
characteristics of customers, but if the data in some
of the columns is very sparse you would gain very
little benefit from adding them to the model. If you
keep the unneeded columns while building the
model, more CPU and memory are required during
the training process, and more storage space is
required for the completed model. Selection of
significant features of a dataset is an important task in
any application.
1.3 Decision tree classifier J48
J48 are the improved versions of C4.5 algorithms
or can be called as optimized implementation of the
C4.5. The output of J48 is the Decision tree. A
Decision tree is similar to the tree structure having
root node, intermediate nodes and leaf node. Each
node in the tree consist a decision and that decision
leads to our result. Decision tree divide the input
space of a data set into mutually exclusive areas, each
area having a label, a value or an action to describe
its data points. Splitting criterion is used to calculate
which attribute is the best to split that portion tree of
the training data that reaches a particular node.
II. Methodology
In Weka datasets should be formatted to the
ARFF format. The Weka Explorer will use these
automatically if it does not recognize a given file as
an ARFF file, the Preprocess panel has facilities for
importing data from a database, and for
preprocessing this data using a filtering algorithm.
These filters can be used to transform the data and
RESEARCH ARTICLE OPEN ACCESS
Preeti Kumari Int. Journal of Engineering Research and Applications www.ijera.com
ISSN : 2248-9622, Vol. 4, Issue 6( Version 2), June 2014, pp.48-51
www.ijera.com 49 | P a g e
make it possible to delete instances and attributes
according to specific criteria Datasets
lassify tab is used for the classification purpose. We
have selected the decision tree classifier J48 to obtain
the results.
Step 1: Take the input dataset descretize it properly.
Step 2: Apply the decision tree classifier algorithm
J48 on the whole data set and note the accuracy given
by it.
Step 3: Reduce the dimension of the given dataset by
eliminating one or more attribute pair and note the
accuracy of dataset after applying J48 on it.
Step 4: Repeat step 4 for all combination of
attributes.
Step 5: Compare the different accuracy provided by
the dataset of different attribute taken together and
identify the significant feature of the dataset
III. Results and Discussion
1) Weather dataset
Preeti Kumari Int. Journal of Engineering Research and Applications www.ijera.com
ISSN : 2248-9622, Vol. 4, Issue 6( Version 2), June 2014, pp.48-51
www.ijera.com 50 | P a g e
Best attribute: Temperature
Here we have selected temperature is the best attribute which alone gives the accuracy of 64.28 %
which is far better than the accuracy provided by the four attribute taken together 50%
2) Vote dataset
Best attribute: physician-free-freeze
For the vote dataset, which consist of 16 attributes which together gives accuracy of 96.32%, after
applying the algorithm we have found out that even single attribute physician-free-freeze gives the
accuracy of 95.63%, which is very efficient. Hence for this dataset it can be considered as the best
attribute.
IV. Conclusion
Thus in this paper we have tried to select best
significant feature of data set using repeatedly
applying the decision tree classification algorithm
J48 over the dataset and found that for weather
dataset temperature is the best attribute which alone
gives the accuracy of 64.28 % which is far better than
the accuracy provided by the four attribute taken
together 50% .For the vote dataset, which consist of
16 attributes which together gives accuracy of
96.32%, after applying the algorithm we have found
out that even single attribute physician-free-freeze
gives the accuracy of 95.63%, which is very efficient.
Hence for this dataset it can be considered as the best
attribute. Future work will focus on analysis on the
feature selected with various other techniques like
principle component analysis, genetic algorithms.
Preeti Kumari Int. Journal of Engineering Research and Applications www.ijera.com
ISSN : 2248-9622, Vol. 4, Issue 6( Version 2), June 2014, pp.48-51
www.ijera.com 51 | P a g e
References
[1]. Olivier Henchiri , Nathalie Japkowicz. “A
Feature Selection and Evaluation Scheme
for Computer Virus Detection, Proceedings
of the Sixth International Conference on
Data Mining “, (ICDM'06).
[2] M. Dash, H. Liu, Feature Selection for
Classification, Intelligent Data Analysis 1
(1997) 131-156.
[3] Frank,A. & Asuncion ,A. (2010).UCI
Machine Learning Repository
[http://archive.ics.uci.edu/ml].Irvine,CA:
University of California, School of
Information and Computer Science.
[4] Data mining book – kamber
[5] V.Vaithiyanathan, K.Rajeswari, Swati
Tonge,, “ Improved Apriori algorithm based
on Selection Criterion”, IEEE Conference
ICCIC Dec 18-20, 2012, Coimbatore.
Catalog Number: CFP1220J-ART ISBN:
978-1-4673-1344-5
[6] Machine Learning Group at the University
of Waikato:http://www.cs.waikato.ac.nz/ml
/weka/

More Related Content

What's hot

Selecting the correct Data Mining Method: Classification & InDaMiTe-R
Selecting the correct Data Mining Method: Classification & InDaMiTe-RSelecting the correct Data Mining Method: Classification & InDaMiTe-R
Selecting the correct Data Mining Method: Classification & InDaMiTe-RIOSR Journals
 
11.software modules clustering an effective approach for reusability
11.software modules clustering an effective approach for  reusability11.software modules clustering an effective approach for  reusability
11.software modules clustering an effective approach for reusabilityAlexander Decker
 
CLASSIFICATION ALGORITHM USING RANDOM CONCEPT ON A VERY LARGE DATA SET: A SURVEY
CLASSIFICATION ALGORITHM USING RANDOM CONCEPT ON A VERY LARGE DATA SET: A SURVEYCLASSIFICATION ALGORITHM USING RANDOM CONCEPT ON A VERY LARGE DATA SET: A SURVEY
CLASSIFICATION ALGORITHM USING RANDOM CONCEPT ON A VERY LARGE DATA SET: A SURVEYEditor IJMTER
 
Introduction to feature subset selection method
Introduction to feature subset selection methodIntroduction to feature subset selection method
Introduction to feature subset selection methodIJSRD
 
Classification By Clustering Based On Adjusted Cluster
Classification By Clustering Based On Adjusted ClusterClassification By Clustering Based On Adjusted Cluster
Classification By Clustering Based On Adjusted ClusterIOSR Journals
 
A Survey on Constellation Based Attribute Selection Method for High Dimension...
A Survey on Constellation Based Attribute Selection Method for High Dimension...A Survey on Constellation Based Attribute Selection Method for High Dimension...
A Survey on Constellation Based Attribute Selection Method for High Dimension...IJERA Editor
 
Automatic Unsupervised Data Classification Using Jaya Evolutionary Algorithm
Automatic Unsupervised Data Classification Using Jaya Evolutionary AlgorithmAutomatic Unsupervised Data Classification Using Jaya Evolutionary Algorithm
Automatic Unsupervised Data Classification Using Jaya Evolutionary Algorithmaciijournal
 
The pertinent single-attribute-based classifier for small datasets classific...
The pertinent single-attribute-based classifier  for small datasets classific...The pertinent single-attribute-based classifier  for small datasets classific...
The pertinent single-attribute-based classifier for small datasets classific...IJECEIAES
 
Effective data mining for proper
Effective data mining for properEffective data mining for proper
Effective data mining for properIJDKP
 
Comparative study of various supervisedclassification methodsforanalysing def...
Comparative study of various supervisedclassification methodsforanalysing def...Comparative study of various supervisedclassification methodsforanalysing def...
Comparative study of various supervisedclassification methodsforanalysing def...eSAT Publishing House
 
A CONCEPTUAL METADATA FRAMEWORK FOR SPATIAL DATA WAREHOUSE
A CONCEPTUAL METADATA FRAMEWORK FOR SPATIAL DATA WAREHOUSEA CONCEPTUAL METADATA FRAMEWORK FOR SPATIAL DATA WAREHOUSE
A CONCEPTUAL METADATA FRAMEWORK FOR SPATIAL DATA WAREHOUSEIJDKP
 
Automatic Unsupervised Data Classification Using Jaya Evolutionary Algorithm
Automatic Unsupervised Data Classification Using Jaya Evolutionary AlgorithmAutomatic Unsupervised Data Classification Using Jaya Evolutionary Algorithm
Automatic Unsupervised Data Classification Using Jaya Evolutionary Algorithmaciijournal
 
A statistical data fusion technique in virtual data integration environment
A statistical data fusion technique in virtual data integration environmentA statistical data fusion technique in virtual data integration environment
A statistical data fusion technique in virtual data integration environmentIJDKP
 
CLUSTERING DICHOTOMOUS DATA FOR HEALTH CARE
CLUSTERING DICHOTOMOUS DATA FOR HEALTH CARECLUSTERING DICHOTOMOUS DATA FOR HEALTH CARE
CLUSTERING DICHOTOMOUS DATA FOR HEALTH CAREijistjournal
 
A Survey and Comparative Study of Filter and Wrapper Feature Selection Techni...
A Survey and Comparative Study of Filter and Wrapper Feature Selection Techni...A Survey and Comparative Study of Filter and Wrapper Feature Selection Techni...
A Survey and Comparative Study of Filter and Wrapper Feature Selection Techni...theijes
 
An efficient feature selection in
An efficient feature selection inAn efficient feature selection in
An efficient feature selection incsandit
 
A HYBRID MODEL FOR MINING MULTI DIMENSIONAL DATA SETS
A HYBRID MODEL FOR MINING MULTI DIMENSIONAL DATA SETSA HYBRID MODEL FOR MINING MULTI DIMENSIONAL DATA SETS
A HYBRID MODEL FOR MINING MULTI DIMENSIONAL DATA SETSEditor IJCATR
 
A Threshold fuzzy entropy based feature selection method applied in various b...
A Threshold fuzzy entropy based feature selection method applied in various b...A Threshold fuzzy entropy based feature selection method applied in various b...
A Threshold fuzzy entropy based feature selection method applied in various b...IJMER
 

What's hot (20)

Selecting the correct Data Mining Method: Classification & InDaMiTe-R
Selecting the correct Data Mining Method: Classification & InDaMiTe-RSelecting the correct Data Mining Method: Classification & InDaMiTe-R
Selecting the correct Data Mining Method: Classification & InDaMiTe-R
 
11.software modules clustering an effective approach for reusability
11.software modules clustering an effective approach for  reusability11.software modules clustering an effective approach for  reusability
11.software modules clustering an effective approach for reusability
 
CLASSIFICATION ALGORITHM USING RANDOM CONCEPT ON A VERY LARGE DATA SET: A SURVEY
CLASSIFICATION ALGORITHM USING RANDOM CONCEPT ON A VERY LARGE DATA SET: A SURVEYCLASSIFICATION ALGORITHM USING RANDOM CONCEPT ON A VERY LARGE DATA SET: A SURVEY
CLASSIFICATION ALGORITHM USING RANDOM CONCEPT ON A VERY LARGE DATA SET: A SURVEY
 
Introduction to feature subset selection method
Introduction to feature subset selection methodIntroduction to feature subset selection method
Introduction to feature subset selection method
 
Classification By Clustering Based On Adjusted Cluster
Classification By Clustering Based On Adjusted ClusterClassification By Clustering Based On Adjusted Cluster
Classification By Clustering Based On Adjusted Cluster
 
A Survey on Constellation Based Attribute Selection Method for High Dimension...
A Survey on Constellation Based Attribute Selection Method for High Dimension...A Survey on Constellation Based Attribute Selection Method for High Dimension...
A Survey on Constellation Based Attribute Selection Method for High Dimension...
 
IJET-V2I6P32
IJET-V2I6P32IJET-V2I6P32
IJET-V2I6P32
 
Automatic Unsupervised Data Classification Using Jaya Evolutionary Algorithm
Automatic Unsupervised Data Classification Using Jaya Evolutionary AlgorithmAutomatic Unsupervised Data Classification Using Jaya Evolutionary Algorithm
Automatic Unsupervised Data Classification Using Jaya Evolutionary Algorithm
 
M43016571
M43016571M43016571
M43016571
 
The pertinent single-attribute-based classifier for small datasets classific...
The pertinent single-attribute-based classifier  for small datasets classific...The pertinent single-attribute-based classifier  for small datasets classific...
The pertinent single-attribute-based classifier for small datasets classific...
 
Effective data mining for proper
Effective data mining for properEffective data mining for proper
Effective data mining for proper
 
Comparative study of various supervisedclassification methodsforanalysing def...
Comparative study of various supervisedclassification methodsforanalysing def...Comparative study of various supervisedclassification methodsforanalysing def...
Comparative study of various supervisedclassification methodsforanalysing def...
 
A CONCEPTUAL METADATA FRAMEWORK FOR SPATIAL DATA WAREHOUSE
A CONCEPTUAL METADATA FRAMEWORK FOR SPATIAL DATA WAREHOUSEA CONCEPTUAL METADATA FRAMEWORK FOR SPATIAL DATA WAREHOUSE
A CONCEPTUAL METADATA FRAMEWORK FOR SPATIAL DATA WAREHOUSE
 
Automatic Unsupervised Data Classification Using Jaya Evolutionary Algorithm
Automatic Unsupervised Data Classification Using Jaya Evolutionary AlgorithmAutomatic Unsupervised Data Classification Using Jaya Evolutionary Algorithm
Automatic Unsupervised Data Classification Using Jaya Evolutionary Algorithm
 
A statistical data fusion technique in virtual data integration environment
A statistical data fusion technique in virtual data integration environmentA statistical data fusion technique in virtual data integration environment
A statistical data fusion technique in virtual data integration environment
 
CLUSTERING DICHOTOMOUS DATA FOR HEALTH CARE
CLUSTERING DICHOTOMOUS DATA FOR HEALTH CARECLUSTERING DICHOTOMOUS DATA FOR HEALTH CARE
CLUSTERING DICHOTOMOUS DATA FOR HEALTH CARE
 
A Survey and Comparative Study of Filter and Wrapper Feature Selection Techni...
A Survey and Comparative Study of Filter and Wrapper Feature Selection Techni...A Survey and Comparative Study of Filter and Wrapper Feature Selection Techni...
A Survey and Comparative Study of Filter and Wrapper Feature Selection Techni...
 
An efficient feature selection in
An efficient feature selection inAn efficient feature selection in
An efficient feature selection in
 
A HYBRID MODEL FOR MINING MULTI DIMENSIONAL DATA SETS
A HYBRID MODEL FOR MINING MULTI DIMENSIONAL DATA SETSA HYBRID MODEL FOR MINING MULTI DIMENSIONAL DATA SETS
A HYBRID MODEL FOR MINING MULTI DIMENSIONAL DATA SETS
 
A Threshold fuzzy entropy based feature selection method applied in various b...
A Threshold fuzzy entropy based feature selection method applied in various b...A Threshold fuzzy entropy based feature selection method applied in various b...
A Threshold fuzzy entropy based feature selection method applied in various b...
 

Viewers also liked

Apresentação Casulo
Apresentação CasuloApresentação Casulo
Apresentação CasuloYaraTerra
 
Inter Biz # Walmart
Inter Biz # WalmartInter Biz # Walmart
Inter Biz # Walmartmonsadako
 
Aula roteiro 2011.1 - 1o.per. - 2a.parte (alunos)
Aula roteiro   2011.1 - 1o.per. - 2a.parte (alunos)Aula roteiro   2011.1 - 1o.per. - 2a.parte (alunos)
Aula roteiro 2011.1 - 1o.per. - 2a.parte (alunos)informaticafdv
 
Will Gallows & o Troll Barriga de Serpente
Will Gallows & o Troll Barriga de SerpenteWill Gallows & o Troll Barriga de Serpente
Will Gallows & o Troll Barriga de SerpenteNanda Soares
 
Arch Linux: Uma distribuição leve e simples - Érico Nunes
Arch Linux: Uma distribuição leve e simples - Érico NunesArch Linux: Uma distribuição leve e simples - Érico Nunes
Arch Linux: Uma distribuição leve e simples - Érico NunesTchelinux
 
Gisela Espinosa. Coloquio Regiones, 2010
Gisela Espinosa. Coloquio Regiones, 2010Gisela Espinosa. Coloquio Regiones, 2010
Gisela Espinosa. Coloquio Regiones, 2010Pro Regiones
 
As cores do mundo
As cores do mundoAs cores do mundo
As cores do mundovaltermn
 
Derecho politico limitado
Derecho politico limitadoDerecho politico limitado
Derecho politico limitadoHugo Araujo
 

Viewers also liked (20)

Apresentação Casulo
Apresentação CasuloApresentação Casulo
Apresentação Casulo
 
Inter Biz # Walmart
Inter Biz # WalmartInter Biz # Walmart
Inter Biz # Walmart
 
Classroom
ClassroomClassroom
Classroom
 
Empleate y ocupate
Empleate y ocupateEmpleate y ocupate
Empleate y ocupate
 
Case de Sucesso Symnetics: Brasil Telecom
Case de Sucesso Symnetics: Brasil TelecomCase de Sucesso Symnetics: Brasil Telecom
Case de Sucesso Symnetics: Brasil Telecom
 
Aula roteiro 2011.1 - 1o.per. - 2a.parte (alunos)
Aula roteiro   2011.1 - 1o.per. - 2a.parte (alunos)Aula roteiro   2011.1 - 1o.per. - 2a.parte (alunos)
Aula roteiro 2011.1 - 1o.per. - 2a.parte (alunos)
 
Case de Sucesso Symnetics: Petrobras
Case de Sucesso Symnetics: PetrobrasCase de Sucesso Symnetics: Petrobras
Case de Sucesso Symnetics: Petrobras
 
18 cartões anjos azuis
18 cartões anjos azuis18 cartões anjos azuis
18 cartões anjos azuis
 
Webquest
WebquestWebquest
Webquest
 
Will Gallows & o Troll Barriga de Serpente
Will Gallows & o Troll Barriga de SerpenteWill Gallows & o Troll Barriga de Serpente
Will Gallows & o Troll Barriga de Serpente
 
Lesson5
Lesson5Lesson5
Lesson5
 
Arch Linux: Uma distribuição leve e simples - Érico Nunes
Arch Linux: Uma distribuição leve e simples - Érico NunesArch Linux: Uma distribuição leve e simples - Érico Nunes
Arch Linux: Uma distribuição leve e simples - Érico Nunes
 
Gisela Espinosa. Coloquio Regiones, 2010
Gisela Espinosa. Coloquio Regiones, 2010Gisela Espinosa. Coloquio Regiones, 2010
Gisela Espinosa. Coloquio Regiones, 2010
 
2 festa junina 2013
2 festa junina 20132 festa junina 2013
2 festa junina 2013
 
As cores do mundo
As cores do mundoAs cores do mundo
As cores do mundo
 
Wethebiocloud.programa.2013
Wethebiocloud.programa.2013Wethebiocloud.programa.2013
Wethebiocloud.programa.2013
 
El comercio electrónico.odt
El comercio electrónico.odtEl comercio electrónico.odt
El comercio electrónico.odt
 
Derecho politico limitado
Derecho politico limitadoDerecho politico limitado
Derecho politico limitado
 
Gestão para a organização da copa 2014
Gestão para a organização da copa 2014Gestão para a organização da copa 2014
Gestão para a organização da copa 2014
 
Mecanismos de estímulo às pessoas para inovação
Mecanismos de estímulo às pessoas para inovaçãoMecanismos de estímulo às pessoas para inovação
Mecanismos de estímulo às pessoas para inovação
 

Similar to G046024851

AN EFFICIENT FEATURE SELECTION IN CLASSIFICATION OF AUDIO FILES
AN EFFICIENT FEATURE SELECTION IN CLASSIFICATION OF AUDIO FILESAN EFFICIENT FEATURE SELECTION IN CLASSIFICATION OF AUDIO FILES
AN EFFICIENT FEATURE SELECTION IN CLASSIFICATION OF AUDIO FILEScscpconf
 
Study on Relavance Feature Selection Methods
Study on Relavance Feature Selection MethodsStudy on Relavance Feature Selection Methods
Study on Relavance Feature Selection MethodsIRJET Journal
 
E-Healthcare monitoring System for diagnosis of Heart Disease using Machine L...
E-Healthcare monitoring System for diagnosis of Heart Disease using Machine L...E-Healthcare monitoring System for diagnosis of Heart Disease using Machine L...
E-Healthcare monitoring System for diagnosis of Heart Disease using Machine L...IRJET Journal
 
Clustering of Big Data Using Different Data-Mining Techniques
Clustering of Big Data Using Different Data-Mining TechniquesClustering of Big Data Using Different Data-Mining Techniques
Clustering of Big Data Using Different Data-Mining TechniquesIRJET Journal
 
IRJET- Deep Learning Model to Predict Hardware Performance
IRJET- Deep Learning Model to Predict Hardware PerformanceIRJET- Deep Learning Model to Predict Hardware Performance
IRJET- Deep Learning Model to Predict Hardware PerformanceIRJET Journal
 
IRJET- Analysis of PV Fed Vector Controlled Induction Motor Drive
IRJET- Analysis of PV Fed Vector Controlled Induction Motor DriveIRJET- Analysis of PV Fed Vector Controlled Induction Motor Drive
IRJET- Analysis of PV Fed Vector Controlled Induction Motor DriveIRJET Journal
 
Distributed Digital Artifacts on the Semantic Web
Distributed Digital Artifacts on the Semantic WebDistributed Digital Artifacts on the Semantic Web
Distributed Digital Artifacts on the Semantic WebEditor IJCATR
 
Feature Subset Selection for High Dimensional Data using Clustering Techniques
Feature Subset Selection for High Dimensional Data using Clustering TechniquesFeature Subset Selection for High Dimensional Data using Clustering Techniques
Feature Subset Selection for High Dimensional Data using Clustering TechniquesIRJET Journal
 
Efficient classification of big data using vfdt (very fast decision tree)
Efficient classification of big data using vfdt (very fast decision tree)Efficient classification of big data using vfdt (very fast decision tree)
Efficient classification of big data using vfdt (very fast decision tree)eSAT Journals
 
IRJET-Scaling Distributed Associative Classifier using Big Data
IRJET-Scaling Distributed Associative Classifier using Big DataIRJET-Scaling Distributed Associative Classifier using Big Data
IRJET-Scaling Distributed Associative Classifier using Big DataIRJET Journal
 
Data Mining Module 2 Business Analytics.
Data Mining Module 2 Business Analytics.Data Mining Module 2 Business Analytics.
Data Mining Module 2 Business Analytics.Jayanti Pande
 
A Firefly based improved clustering algorithm
A Firefly based improved clustering algorithmA Firefly based improved clustering algorithm
A Firefly based improved clustering algorithmIRJET Journal
 
Review of Existing Methods in K-means Clustering Algorithm
Review of Existing Methods in K-means Clustering AlgorithmReview of Existing Methods in K-means Clustering Algorithm
Review of Existing Methods in K-means Clustering AlgorithmIRJET Journal
 
IRJET- Study and Evaluation of Classification Algorithms in Data Mining
IRJET- Study and Evaluation of Classification Algorithms in Data MiningIRJET- Study and Evaluation of Classification Algorithms in Data Mining
IRJET- Study and Evaluation of Classification Algorithms in Data MiningIRJET Journal
 
A NEW DECISION TREE METHOD FOR DATA MINING IN MEDICINE
A NEW DECISION TREE METHOD FOR DATA MINING IN MEDICINEA NEW DECISION TREE METHOD FOR DATA MINING IN MEDICINE
A NEW DECISION TREE METHOD FOR DATA MINING IN MEDICINEaciijournal
 
Advanced Computational Intelligence: An International Journal (ACII)
Advanced Computational Intelligence: An International Journal (ACII)Advanced Computational Intelligence: An International Journal (ACII)
Advanced Computational Intelligence: An International Journal (ACII)aciijournal
 
An experimental study on hypothyroid using rotation forest
An experimental study on hypothyroid using rotation forestAn experimental study on hypothyroid using rotation forest
An experimental study on hypothyroid using rotation forestIJDKP
 
Predicting students' performance using id3 and c4.5 classification algorithms
Predicting students' performance using id3 and c4.5 classification algorithmsPredicting students' performance using id3 and c4.5 classification algorithms
Predicting students' performance using id3 and c4.5 classification algorithmsIJDKP
 

Similar to G046024851 (20)

AN EFFICIENT FEATURE SELECTION IN CLASSIFICATION OF AUDIO FILES
AN EFFICIENT FEATURE SELECTION IN CLASSIFICATION OF AUDIO FILESAN EFFICIENT FEATURE SELECTION IN CLASSIFICATION OF AUDIO FILES
AN EFFICIENT FEATURE SELECTION IN CLASSIFICATION OF AUDIO FILES
 
Ijetcas14 338
Ijetcas14 338Ijetcas14 338
Ijetcas14 338
 
Study on Relavance Feature Selection Methods
Study on Relavance Feature Selection MethodsStudy on Relavance Feature Selection Methods
Study on Relavance Feature Selection Methods
 
E-Healthcare monitoring System for diagnosis of Heart Disease using Machine L...
E-Healthcare monitoring System for diagnosis of Heart Disease using Machine L...E-Healthcare monitoring System for diagnosis of Heart Disease using Machine L...
E-Healthcare monitoring System for diagnosis of Heart Disease using Machine L...
 
Clustering of Big Data Using Different Data-Mining Techniques
Clustering of Big Data Using Different Data-Mining TechniquesClustering of Big Data Using Different Data-Mining Techniques
Clustering of Big Data Using Different Data-Mining Techniques
 
IRJET- Deep Learning Model to Predict Hardware Performance
IRJET- Deep Learning Model to Predict Hardware PerformanceIRJET- Deep Learning Model to Predict Hardware Performance
IRJET- Deep Learning Model to Predict Hardware Performance
 
IRJET- Analysis of PV Fed Vector Controlled Induction Motor Drive
IRJET- Analysis of PV Fed Vector Controlled Induction Motor DriveIRJET- Analysis of PV Fed Vector Controlled Induction Motor Drive
IRJET- Analysis of PV Fed Vector Controlled Induction Motor Drive
 
Distributed Digital Artifacts on the Semantic Web
Distributed Digital Artifacts on the Semantic WebDistributed Digital Artifacts on the Semantic Web
Distributed Digital Artifacts on the Semantic Web
 
Feature Subset Selection for High Dimensional Data using Clustering Techniques
Feature Subset Selection for High Dimensional Data using Clustering TechniquesFeature Subset Selection for High Dimensional Data using Clustering Techniques
Feature Subset Selection for High Dimensional Data using Clustering Techniques
 
Efficient classification of big data using vfdt (very fast decision tree)
Efficient classification of big data using vfdt (very fast decision tree)Efficient classification of big data using vfdt (very fast decision tree)
Efficient classification of big data using vfdt (very fast decision tree)
 
IRJET-Scaling Distributed Associative Classifier using Big Data
IRJET-Scaling Distributed Associative Classifier using Big DataIRJET-Scaling Distributed Associative Classifier using Big Data
IRJET-Scaling Distributed Associative Classifier using Big Data
 
Data Mining Module 2 Business Analytics.
Data Mining Module 2 Business Analytics.Data Mining Module 2 Business Analytics.
Data Mining Module 2 Business Analytics.
 
A Firefly based improved clustering algorithm
A Firefly based improved clustering algorithmA Firefly based improved clustering algorithm
A Firefly based improved clustering algorithm
 
Review of Existing Methods in K-means Clustering Algorithm
Review of Existing Methods in K-means Clustering AlgorithmReview of Existing Methods in K-means Clustering Algorithm
Review of Existing Methods in K-means Clustering Algorithm
 
IRJET- Study and Evaluation of Classification Algorithms in Data Mining
IRJET- Study and Evaluation of Classification Algorithms in Data MiningIRJET- Study and Evaluation of Classification Algorithms in Data Mining
IRJET- Study and Evaluation of Classification Algorithms in Data Mining
 
A NEW DECISION TREE METHOD FOR DATA MINING IN MEDICINE
A NEW DECISION TREE METHOD FOR DATA MINING IN MEDICINEA NEW DECISION TREE METHOD FOR DATA MINING IN MEDICINE
A NEW DECISION TREE METHOD FOR DATA MINING IN MEDICINE
 
Advanced Computational Intelligence: An International Journal (ACII)
Advanced Computational Intelligence: An International Journal (ACII)Advanced Computational Intelligence: An International Journal (ACII)
Advanced Computational Intelligence: An International Journal (ACII)
 
An experimental study on hypothyroid using rotation forest
An experimental study on hypothyroid using rotation forestAn experimental study on hypothyroid using rotation forest
An experimental study on hypothyroid using rotation forest
 
Predicting students' performance using id3 and c4.5 classification algorithms
Predicting students' performance using id3 and c4.5 classification algorithmsPredicting students' performance using id3 and c4.5 classification algorithms
Predicting students' performance using id3 and c4.5 classification algorithms
 
Hx3115011506
Hx3115011506Hx3115011506
Hx3115011506
 

Recently uploaded

Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...apidays
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesBoston Institute of Analytics
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdflior mazor
 
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...Principled Technologies
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...apidays
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024The Digital Insurer
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 

Recently uploaded (20)

Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 

G046024851

  • 1. Preeti Kumari Int. Journal of Engineering Research and Applications www.ijera.com ISSN : 2248-9622, Vol. 4, Issue 6( Version 2), June 2014, pp.48-51 www.ijera.com 48 | P a g e Selection of Significant Features Using Decision Tree Classifiers Preeti Kumari1 , K. Rajeswari2 , 1 M.E Student, PCCOE, Pune 2 Ph.D Research Scholar, SASTRA and Associate professor, PCCOE, Pune, India Abstract Data Mining refers to extraction or mining knowledge from huge volume of data. Classification is an important data mining technique with various applications. It classifies data of various kinds and used to classify each item in a set of data into one of predefined set of classes or groups. This work has been carried out to select the most significant attributes from a dataset using Decision tree classifier j48.This Decision tree classification algorithm can be efficiently used in selecting the most significant feature of a dataset and hence we can reduce the dimensionality of a dataset and yet obtain a very good accuracy. In this paper we have selected two datasets from university of California, Irvine website, of weather and vote and applied j48 algorithm on it. After each iteration, one attribute is removed and its accuracy has been found .We have compared the accuracy of dataset with different attribute count and then selected the most significant features of dataset. I. Literature review 1.1 Data mining Data mining (sometimes called data or knowledge discovery) is the process of analyzing data from different perspectives and summarizing it into useful information - information that can be used to increase revenue, cuts costs, or both. Data mining is a tool for analyzing data allowing users to analyze data from many different dimensions or angles, categorize it, and summarize the relationships identified. Technically, data mining is the process of finding correlations or patterns among dozens of fields in large relational databases. Classification is a data mining function that assigns items in a collection to target categories or classes. The goal of classification is to accurately predict the target class for each case in the data. For example, a classification model could be used to identify loan applicants as low, medium, or high credit risks. Classification models are tested by comparing the predicted values to known target values in a set of test data. The historical data for a classification project is typically divided into two data sets: one for building the model; the other for testing the model to classify a given instance. 1.2 Feature selection Feature selection is a term commonly used in data mining to describe the tools and techniques available for reducing inputs to a manageable size for processing and analysis. Feature selection implies not only cardinality reduction, which means imposing an arbitrary or predefined cutoff on the number of attributes that can be considered when building a model, but also the choice of attributes, meaning that either the analyst or the modeling tool actively selects or discards attributes based on their usefulness for analysis. The ability to apply feature selection is critical for effective analysis, because datasets frequently contain far more information than is needed to build the model. For example, a dataset might contain 500 columns that describe the characteristics of customers, but if the data in some of the columns is very sparse you would gain very little benefit from adding them to the model. If you keep the unneeded columns while building the model, more CPU and memory are required during the training process, and more storage space is required for the completed model. Selection of significant features of a dataset is an important task in any application. 1.3 Decision tree classifier J48 J48 are the improved versions of C4.5 algorithms or can be called as optimized implementation of the C4.5. The output of J48 is the Decision tree. A Decision tree is similar to the tree structure having root node, intermediate nodes and leaf node. Each node in the tree consist a decision and that decision leads to our result. Decision tree divide the input space of a data set into mutually exclusive areas, each area having a label, a value or an action to describe its data points. Splitting criterion is used to calculate which attribute is the best to split that portion tree of the training data that reaches a particular node. II. Methodology In Weka datasets should be formatted to the ARFF format. The Weka Explorer will use these automatically if it does not recognize a given file as an ARFF file, the Preprocess panel has facilities for importing data from a database, and for preprocessing this data using a filtering algorithm. These filters can be used to transform the data and RESEARCH ARTICLE OPEN ACCESS
  • 2. Preeti Kumari Int. Journal of Engineering Research and Applications www.ijera.com ISSN : 2248-9622, Vol. 4, Issue 6( Version 2), June 2014, pp.48-51 www.ijera.com 49 | P a g e make it possible to delete instances and attributes according to specific criteria Datasets lassify tab is used for the classification purpose. We have selected the decision tree classifier J48 to obtain the results. Step 1: Take the input dataset descretize it properly. Step 2: Apply the decision tree classifier algorithm J48 on the whole data set and note the accuracy given by it. Step 3: Reduce the dimension of the given dataset by eliminating one or more attribute pair and note the accuracy of dataset after applying J48 on it. Step 4: Repeat step 4 for all combination of attributes. Step 5: Compare the different accuracy provided by the dataset of different attribute taken together and identify the significant feature of the dataset III. Results and Discussion 1) Weather dataset
  • 3. Preeti Kumari Int. Journal of Engineering Research and Applications www.ijera.com ISSN : 2248-9622, Vol. 4, Issue 6( Version 2), June 2014, pp.48-51 www.ijera.com 50 | P a g e Best attribute: Temperature Here we have selected temperature is the best attribute which alone gives the accuracy of 64.28 % which is far better than the accuracy provided by the four attribute taken together 50% 2) Vote dataset Best attribute: physician-free-freeze For the vote dataset, which consist of 16 attributes which together gives accuracy of 96.32%, after applying the algorithm we have found out that even single attribute physician-free-freeze gives the accuracy of 95.63%, which is very efficient. Hence for this dataset it can be considered as the best attribute. IV. Conclusion Thus in this paper we have tried to select best significant feature of data set using repeatedly applying the decision tree classification algorithm J48 over the dataset and found that for weather dataset temperature is the best attribute which alone gives the accuracy of 64.28 % which is far better than the accuracy provided by the four attribute taken together 50% .For the vote dataset, which consist of 16 attributes which together gives accuracy of 96.32%, after applying the algorithm we have found out that even single attribute physician-free-freeze gives the accuracy of 95.63%, which is very efficient. Hence for this dataset it can be considered as the best attribute. Future work will focus on analysis on the feature selected with various other techniques like principle component analysis, genetic algorithms.
  • 4. Preeti Kumari Int. Journal of Engineering Research and Applications www.ijera.com ISSN : 2248-9622, Vol. 4, Issue 6( Version 2), June 2014, pp.48-51 www.ijera.com 51 | P a g e References [1]. Olivier Henchiri , Nathalie Japkowicz. “A Feature Selection and Evaluation Scheme for Computer Virus Detection, Proceedings of the Sixth International Conference on Data Mining “, (ICDM'06). [2] M. Dash, H. Liu, Feature Selection for Classification, Intelligent Data Analysis 1 (1997) 131-156. [3] Frank,A. & Asuncion ,A. (2010).UCI Machine Learning Repository [http://archive.ics.uci.edu/ml].Irvine,CA: University of California, School of Information and Computer Science. [4] Data mining book – kamber [5] V.Vaithiyanathan, K.Rajeswari, Swati Tonge,, “ Improved Apriori algorithm based on Selection Criterion”, IEEE Conference ICCIC Dec 18-20, 2012, Coimbatore. Catalog Number: CFP1220J-ART ISBN: 978-1-4673-1344-5 [6] Machine Learning Group at the University of Waikato:http://www.cs.waikato.ac.nz/ml /weka/