SlideShare ist ein Scribd-Unternehmen logo
1 von 12
Data mining techniques using WEKA




Submitted by:

Shashidhar Shenoy N (10BM60083)
MBA, 2nd Year, Vinod Gupta School of Management,
IIT Kharagpur

As part of the course “IT for Business Intelligence”
Introduction to Weka
Weka stands for ‘Waikato Environment for Knowledge Analysis’ and is a free open source software
developed by at the University of Waikato, New Zealand. It is a very popular set of software for
machine learning, containing a collection of visualization tools and algorithms for data analysis and
predictive modeling, together with graphical user interfaces for easy access to this functionality.

Although not as sophisticated as the other statistical packages, Weka’s popularity lies in the fact
that it is not only a freeware but also code is open source, which means that new algorithms can be
implemented by making use of the existing algorithms and sufficiently modifying them.

Weka can be used to do a wide variety of operations on the data. Some of the important operations
which can be carried out using weka suite are:

      Classification of data
      Regression analysis and prediction
      Clustering of data
      Associating data

A quick guide on how to carry out some of these operations is described in this document.


Quick note on the data used in the guide
Unless meaningfully interpreted, any data is meaningless. Most machine learning software would
accept any data as long as they are in the specified format without understanding why they are
used. Thus, the onus lies on the user of the software to choose proper data and feed it to the
software to derive meaningful insights on it.

Rather than using the pre-built examples given in Weka suite, some attempt is made to get freely
available data from the internet and the best place to get .arff files would be the Machine Learning
Repository located of UCI. The about page in their website says:

“The UCI Machine Learning Repository is a collection of databases, domain theories, and data
generators that are used by the machine learning community for the empirical analysis of machine
learning algorithms. The archive was created as an ftp archive in 1987 by David Aha and fellow
graduate students at UC Irvine. Since that time, it has been widely used by students, educators, and
researchers all over the world as a primary source of machine learning data sets. As an indication of
the impact of the archive, it has been cited over 1000 times, making it one of the top 100 most cited
"papers" in all of computer science”

For the demonstrations, two of the data sets have been used. Regression uses the data from Auto
MPG while the classification uses the data Contraceptive method choice. More details on the data and
its attributes are explained in the subsequent sections.




VGSoM, IIT Kharagpur                                                                          Page 2
Regression using Weka

Simple regression involving two variables
Regression involves building a model to predict the dependant variable based on one or more
independent variables. A simple example of regression would be to predict the body weight of a
mammal given the brain weight. Here, the body weight is the dependant variable and brain weight
is the independent variable:




                                    Figure 1: Brain weight v Body weight

The data is imported into weka in the native (Attribute-Relation File Format) arff format. Weka supports
imports of the ubiquitous .csv formats too. This is done by clicking on ‘Explorer’ in the Weka Gui
Chooser suite and then going to ‘Open File..’ under the preprocess tab.




                                    Figure 2: Opening a file in Weka Suite



VGSoM, IIT Kharagpur                                                                             Page 3
Once the file is loaded, a variety of pre-process operations can be done on the data. The data can be
edited using the ‘Edit’ option too. In the left section of the Explorer window, it outlines all of the
columns in the data (Attributes) and the number of rows of data supplied (Instances). By selecting
each column, the right section of the Explorer window will also give information about the data in
that column of your data set. There’s a visual way of examining the data, which we can see by
clicking the ‘Visualize All’ button.

The next step would be to perform the regression analysis. For this, we go to the ‘Classify’ tab and
click on the ‘Choose’ button. Since we are running a ‘simple linear regression’, we need to go to the
‘Classifiers.functions.simplelinearregression’ and click on it. Once this is done, we need to supply
the test options for building the regression model. The following options are available:

      Use training set. The classifier is evaluated on how well it predicts the class of the instances
       it was trained on.
      Supplied test set. The classifier is evaluated on how well it predicts the class of a set of
       instances loaded from a file. Clicking the Set... button brings up a dialog allowing you to
       choose the file to test on.
      Cross-validation. The classifier is evaluated by cross-validation, using the number of folds
       that are entered in the Folds text field.
      Percentage split. The classifier is evaluated on how well it predicts a certain percentage of
       the data which is held out for testing. The amount of data held out depends on the value
       entered in the % field.

Choose one of these for a model, make sure that the dependant variable is shown in the field below
as ‘body weight (kg)’ and click on start. This is the output we get:




                                   Figure 3: Output of simple regression




VGSoM, IIT Kharagpur                                                                            Page 4
It gives the model summary and the details of the regression. Thus, simple linear regression model
has been built using the weka suite.

Multiple Linear regression with many variables
In multiple regression, there is one dependant variable which depends on many independent
variables. Many of the real world situations are multiple regression models where one variable
depends on a lot of other variables. Here, we use a famous example data to demonstrate regression
using Weka.

Data used for multiple regression
This data set is taken from the UCI’s machine learning repository and regresses automobile mileage
against certain basic attributes of the model. The data can be downloaded from the URL
<http://archive.ics.uci.edu/ml/datasets/Auto+MPG> and a corresponding ARFF file be created.
This sample data file attempts to create a regression model to predict the miles per gallon (MPG)
for a car based on several attributes of the car (this data is from 1970 to 1982). The model includes
these possible attributes of the car: cylinders, displacement, horsepower, weight, acceleration,
model year, origin, and car make. Further, this data set has 398 rows of data.

Data Set                              Number of
                       Multivariate                        398
Characteristics:                      Instances:

Attribute              Categorical,   Number of
                                                           8
Characteristics:       Real           Attributes:

                                      Missing              Yes
 8 instances of the variable horsepower are
Associated Tasks:      Regression
                                      Values?              removed because they have unknown value

This data set is loaded into the Weka suite using the ‘Open file
’ syntax as explained before. This is
how the window looks like when the data is imported.




                                      Figure 4: Imported data in Weka



VGSoM, IIT Kharagpur                                                                                 Page 5
The first seven attributes are all independant variables, while the eighth one, ie, CLASS is the
dependant variable for which we try and build a predictive model. Before doing so, we can use as
many visualizations on the data as necessary to see the relevant information in each attribute.




                                     Figure 5: Visualize the data in Weka

The next step is to perform the regression. Go to the Classify tab and on the choose button, go to
classifiers -> functions -> linear regressions. Once this is done, we need to supply the test options for
building the regression model, in the same manner which we did for simple linear regression. We
initially give a ‘Percentage split’ of 80% of the test data and see the output:




                                  Figure 6: Run information shown by Weka




VGSoM, IIT Kharagpur                                                                              Page 6
Figure 7: The regression model ouput by Weka




                                    Figure 8: Regression model details

This model might appear as complex for beginners but it is not. For example, the first line of the
regression model, -2.2744 * cylinders=6,3,5,4 means that if the car has six cylinders, you would
place a 1 in this column, and if it has eight cylinders, you would place a 0. We could use a test set
and see the deviation from the expected results and calculate the error.

Example data:
data = 8,390,190,3850,8.5,70,1,15
class (aka MPG) =
     -2.2744    *   0 +
     -4.4421    *   0 +
      6.74      *   0 +
      0.012     *   390 +
     -0.0359    *   190 +
     -0.0056    *   3850 +
      1.6184    *   0 +
      1.8307    *   0 +
      1.8958    *   0 +
      1.7754    *   0 +
      1.167     *   0 +
      1.2522    *   0 +

VGSoM, IIT Kharagpur                                                                          Page 7
2.1363 * 0 +
     37.9165

Expected Value = 15 mpg
Regression Model Output = 14.2 mpg



So, we see that the regression model output is pretty near the expected value and thus we have a
predictive model for beginners. We could continue to improve on this model to improve the
accuracy. We can also go for visualization to plot each of the independent variable against the
dependent one and see how the variation occurs. A sample plot of horsepower versus ‘Miles per
gallon’ is shown. The relationship can be found to be inversely proportional.




                                  Figure 9: Visualizing the regression output




Classification using Weka
In classification, different attributes of a product are analysed to classify the product into one of the
predefined classes. For example, a cricket player can be classified as batsman, bowler, wicket
keeper or allrounder depending on the attributes like ‘Can bat?’, ‘Can bowl?’ etc.

TrainSet: The trainset is that data which is used to train the software. Here, the classification is
already made based on few attributes. The machine just observes the patterns and tries to create a
rule which can be used to explain how the training set data is classified. If the model built by the
machine in first instance is not reliable, intelligent algorithms might be used to make the model
more robust.

TestSet: The test set or data set is the actual data where the classification is not yet made. Once the
trainset is used to build a satisfactory model, we can feed the test set and get the classification of
the data set.

VGSoM, IIT Kharagpur                                                                              Page 8
Data used for classification
The data used is the ‘Contraceptive Method Choice’ Data set available from the UCI’s machine
learning repository and can be downloaded from the following URL:

< http://archive.ics.uci.edu/ml/datasets/Contraceptive+Method+Choice>

The samples are married women who were either not pregnant or do not know if they were at the
time of interview. The problem is to predict the current contraceptive method choice (no use, long-
term methods, or short-term methods) of a woman based on her demographic and socio-economic
characteristics. Some of the attributes like Wife’s age, Wife’s education, Husband’s education,
Number of children ever born’, etc are used to predict the current contraceptive method choice.

Data Set Characteristics:            Multivariate                  Number of Instances:       1473

Attribute Characteristics:           Categorical, Integer          Number of Attributes:      9

Associated Tasks:                    Classification                Missing Values?            No



Use the ‘Open file..’ syntax to import the arff file into weka suite as instructed before. The tenth
attribute, ie, the contraceptive method used’ is the predicted variable and the data looks like this:




                                  Figure 10: CMC data imported into Weka




VGSoM, IIT Kharagpur                                                                          Page 9
Next, go to the classify tab, and use the ZeroR algorithm to run the classification model. ZeroR is the
basic classification model and it does not do anything but classify all the instances into one class.
We ask weka to run the model using the entire training set without splitting it into test and
trainsets. This can be done by giving the choice as ‘Use train set’ under ‘Test options’ as explained in
the case of regression before. As expected, the model will be inaccurate. This is the output of the
Weka file.




                             Figure 11: Classification Output using ZeroR algorithm




Of particular importance is the Confusion matrix which shows the correctly and incorrectly
classifcied instances. Here, we see that all samples have been classified as ‘a’ and the 333 samples
which should have been ‘b’ and the 511 samples which should have been classified as ‘c’ are also
incorrectly classified as ‘a’. Thus, the accuracy of the model is only 42% (629 out of 1473 samples)

We could now go for more accurate algorithms like NaĂŻveBayes or NaiveBayesUpdateable to
improve the accuracy of the predictions. Here is the ouput of the NaiveBayes simple classification
scheme:




VGSoM, IIT Kharagpur                                                                           Page 10
Figure 12: Classification output using Naive Bayes algorithm

Here we see that the accuracy of this model, although under acceptable limits has improved over
the previous model. Thus, we can start training the software to be more accurate by using better
algorithms.

Various visualization schemes are present which will help visualize the independent and dependant
variables.


Conclusion
In this term paper, two simple techniques which can be used to get started with Weka –regression
and classification are presented. In regression, we have demonstrated how Weka can be used to
build a regression model with one dependant variable and many independent variables. The live
example used was the automobile miles per gallon based on many independent attributes in a car.
In classification, we have demonstrated how Weka can be trained to classify the given data set
based on observations in a training set. The live data used was the choice of contraceptive method
based on a number of demographic factors.

Though the outputs are not intriguing, the real power of Weka lies in the fact that the algorithms
can be trained to produce better results. Since the source code is open for everyone, anyone can
download the same and simple manipulations can be done on the existing algorithms with ease to
produce more accurate algorithms. Hence, Weka is used by many researchers in their study.



VGSoM, IIT Kharagpur                                                                      Page 11
References
   1.   Weka reference manual pdf available at their website
   2.   http://www.cs.waikato.ac.nz/ml/weka/
   3.   http://archive.ics.uci.edu/ml/datasets.html
   4.   http://www.ibm.com/developerworks/opensource/library/os-weka1/index.html#N100F6




VGSoM, IIT Kharagpur                                                                 Page 12

Weitere Àhnliche Inhalte

Was ist angesagt?

Chapter - 6 Data Mining Concepts and Techniques 2nd Ed slides Han & Kamber
Chapter - 6 Data Mining Concepts and Techniques 2nd Ed slides Han & KamberChapter - 6 Data Mining Concepts and Techniques 2nd Ed slides Han & Kamber
Chapter - 6 Data Mining Concepts and Techniques 2nd Ed slides Han & Kambererror007
 
Ensemble methods in machine learning
Ensemble methods in machine learningEnsemble methods in machine learning
Ensemble methods in machine learningSANTHOSH RAJA M G
 
Classification and Clustering Analysis using Weka
Classification and Clustering Analysis using Weka Classification and Clustering Analysis using Weka
Classification and Clustering Analysis using Weka Ishan Awadhesh
 
Data cubes
Data cubesData cubes
Data cubesMohammed
 
Introduction to Linear Discriminant Analysis
Introduction to Linear Discriminant AnalysisIntroduction to Linear Discriminant Analysis
Introduction to Linear Discriminant AnalysisJaclyn Kokx
 
Data visualization using R
Data visualization using RData visualization using R
Data visualization using RUmmiya Mohammedi
 
Handling Missing Values for Machine Learning.pptx
Handling Missing Values for Machine Learning.pptxHandling Missing Values for Machine Learning.pptx
Handling Missing Values for Machine Learning.pptxShamimBhuiyan8
 
Support vector machines
Support vector machinesSupport vector machines
Support vector machinesmanaswinimysore
 
Classification using back propagation algorithm
Classification using back propagation algorithmClassification using back propagation algorithm
Classification using back propagation algorithmKIRAN R
 
Types of clustering and different types of clustering algorithms
Types of clustering and different types of clustering algorithmsTypes of clustering and different types of clustering algorithms
Types of clustering and different types of clustering algorithmsPrashanth Guntal
 
Data Mining : Healthcare Application
Data Mining : Healthcare ApplicationData Mining : Healthcare Application
Data Mining : Healthcare Applicationosman ansari
 
Introduction to Deep Learning
Introduction to Deep LearningIntroduction to Deep Learning
Introduction to Deep LearningOswald Campesato
 
Introduction to Machine Learning with SciKit-Learn
Introduction to Machine Learning with SciKit-LearnIntroduction to Machine Learning with SciKit-Learn
Introduction to Machine Learning with SciKit-LearnBenjamin Bengfort
 
Cluster analysis
Cluster analysisCluster analysis
Cluster analysisPushkar Mishra
 
Support Vector Machines ( SVM )
Support Vector Machines ( SVM ) Support Vector Machines ( SVM )
Support Vector Machines ( SVM ) Mohammad Junaid Khan
 
Birch Algorithm With Solved Example
Birch Algorithm With Solved ExampleBirch Algorithm With Solved Example
Birch Algorithm With Solved Examplekailash shaw
 

Was ist angesagt? (20)

Chapter - 6 Data Mining Concepts and Techniques 2nd Ed slides Han & Kamber
Chapter - 6 Data Mining Concepts and Techniques 2nd Ed slides Han & KamberChapter - 6 Data Mining Concepts and Techniques 2nd Ed slides Han & Kamber
Chapter - 6 Data Mining Concepts and Techniques 2nd Ed slides Han & Kamber
 
Ensemble methods in machine learning
Ensemble methods in machine learningEnsemble methods in machine learning
Ensemble methods in machine learning
 
Classification and Clustering Analysis using Weka
Classification and Clustering Analysis using Weka Classification and Clustering Analysis using Weka
Classification and Clustering Analysis using Weka
 
Data cubes
Data cubesData cubes
Data cubes
 
Introduction to Linear Discriminant Analysis
Introduction to Linear Discriminant AnalysisIntroduction to Linear Discriminant Analysis
Introduction to Linear Discriminant Analysis
 
Data visualization using R
Data visualization using RData visualization using R
Data visualization using R
 
Handling Missing Values for Machine Learning.pptx
Handling Missing Values for Machine Learning.pptxHandling Missing Values for Machine Learning.pptx
Handling Missing Values for Machine Learning.pptx
 
02 Data Mining
02 Data Mining02 Data Mining
02 Data Mining
 
Support vector machines
Support vector machinesSupport vector machines
Support vector machines
 
Classification using back propagation algorithm
Classification using back propagation algorithmClassification using back propagation algorithm
Classification using back propagation algorithm
 
Types of clustering and different types of clustering algorithms
Types of clustering and different types of clustering algorithmsTypes of clustering and different types of clustering algorithms
Types of clustering and different types of clustering algorithms
 
Malhotra20
Malhotra20Malhotra20
Malhotra20
 
Data Mining : Healthcare Application
Data Mining : Healthcare ApplicationData Mining : Healthcare Application
Data Mining : Healthcare Application
 
Introduction to Deep Learning
Introduction to Deep LearningIntroduction to Deep Learning
Introduction to Deep Learning
 
Data Analytics Life Cycle
Data Analytics Life CycleData Analytics Life Cycle
Data Analytics Life Cycle
 
Introduction to Machine Learning with SciKit-Learn
Introduction to Machine Learning with SciKit-LearnIntroduction to Machine Learning with SciKit-Learn
Introduction to Machine Learning with SciKit-Learn
 
Cluster analysis
Cluster analysisCluster analysis
Cluster analysis
 
Support Vector Machines ( SVM )
Support Vector Machines ( SVM ) Support Vector Machines ( SVM )
Support Vector Machines ( SVM )
 
Hierarchical Clustering
Hierarchical ClusteringHierarchical Clustering
Hierarchical Clustering
 
Birch Algorithm With Solved Example
Birch Algorithm With Solved ExampleBirch Algorithm With Solved Example
Birch Algorithm With Solved Example
 

Ähnlich wie Data Mining using Weka

STOCK PRICE PREDICTION USING MACHINE LEARNING [RANDOM FOREST REGRESSION MODEL]
STOCK PRICE PREDICTION USING MACHINE LEARNING [RANDOM FOREST REGRESSION MODEL]STOCK PRICE PREDICTION USING MACHINE LEARNING [RANDOM FOREST REGRESSION MODEL]
STOCK PRICE PREDICTION USING MACHINE LEARNING [RANDOM FOREST REGRESSION MODEL]IRJET Journal
 
Weka_Manual_Sagar
Weka_Manual_SagarWeka_Manual_Sagar
Weka_Manual_SagarSagar Kumar
 
Feature extraction for classifying students based on theirac ademic performance
Feature extraction for classifying students based on theirac ademic performanceFeature extraction for classifying students based on theirac ademic performance
Feature extraction for classifying students based on theirac ademic performanceVenkat Projects
 
Paper-Allstate-Claim-Severity
Paper-Allstate-Claim-SeverityPaper-Allstate-Claim-Severity
Paper-Allstate-Claim-SeverityGon-soo Moon
 
Classification and Prediction Based Data Mining Algorithm in Weka Tool
Classification and Prediction Based Data Mining Algorithm in Weka ToolClassification and Prediction Based Data Mining Algorithm in Weka Tool
Classification and Prediction Based Data Mining Algorithm in Weka ToolIRJET Journal
 
Itb weka nikhil
Itb weka nikhilItb weka nikhil
Itb weka nikhilnikhilyagnic
 
lab #6
lab #6lab #6
lab #6butest
 
Nose Dive into Apache Spark ML
Nose Dive into Apache Spark MLNose Dive into Apache Spark ML
Nose Dive into Apache Spark MLAhmet Bulut
 
Presentation
PresentationPresentation
Presentationbutest
 
Data mining techniques using weka
Data mining techniques using wekaData mining techniques using weka
Data mining techniques using wekaPrashant Menon
 
data mining with weka application
data mining with weka applicationdata mining with weka application
data mining with weka applicationRezapourabbas
 
Open06
Open06Open06
Open06butest
 
Machine Learning with WEKA
Machine Learning with WEKAMachine Learning with WEKA
Machine Learning with WEKAbutest
 
Machine Learning - Simple Linear Regression
Machine Learning - Simple Linear RegressionMachine Learning - Simple Linear Regression
Machine Learning - Simple Linear RegressionSiddharth Shrivastava
 
Key projects Data Science and Engineering
Key projects Data Science and EngineeringKey projects Data Science and Engineering
Key projects Data Science and EngineeringVijayananda Mohire
 
Key projects Data Science and Engineering
Key projects Data Science and EngineeringKey projects Data Science and Engineering
Key projects Data Science and EngineeringVijayananda Mohire
 
Competition 1 (blog 1)
Competition 1 (blog 1)Competition 1 (blog 1)
Competition 1 (blog 1)TarunPaparaju
 
Mining attributes
Mining attributesMining attributes
Mining attributesSandra Alex
 
Hybrid feature selection using correlation coefficient and particle swarm opt...
Hybrid feature selection using correlation coefficient and particle swarm opt...Hybrid feature selection using correlation coefficient and particle swarm opt...
Hybrid feature selection using correlation coefficient and particle swarm opt...Venkat Projects
 

Ähnlich wie Data Mining using Weka (20)

STOCK PRICE PREDICTION USING MACHINE LEARNING [RANDOM FOREST REGRESSION MODEL]
STOCK PRICE PREDICTION USING MACHINE LEARNING [RANDOM FOREST REGRESSION MODEL]STOCK PRICE PREDICTION USING MACHINE LEARNING [RANDOM FOREST REGRESSION MODEL]
STOCK PRICE PREDICTION USING MACHINE LEARNING [RANDOM FOREST REGRESSION MODEL]
 
Weka_Manual_Sagar
Weka_Manual_SagarWeka_Manual_Sagar
Weka_Manual_Sagar
 
Feature extraction for classifying students based on theirac ademic performance
Feature extraction for classifying students based on theirac ademic performanceFeature extraction for classifying students based on theirac ademic performance
Feature extraction for classifying students based on theirac ademic performance
 
Paper-Allstate-Claim-Severity
Paper-Allstate-Claim-SeverityPaper-Allstate-Claim-Severity
Paper-Allstate-Claim-Severity
 
Classification and Prediction Based Data Mining Algorithm in Weka Tool
Classification and Prediction Based Data Mining Algorithm in Weka ToolClassification and Prediction Based Data Mining Algorithm in Weka Tool
Classification and Prediction Based Data Mining Algorithm in Weka Tool
 
Itb weka nikhil
Itb weka nikhilItb weka nikhil
Itb weka nikhil
 
lab #6
lab #6lab #6
lab #6
 
Nose Dive into Apache Spark ML
Nose Dive into Apache Spark MLNose Dive into Apache Spark ML
Nose Dive into Apache Spark ML
 
Presentation
PresentationPresentation
Presentation
 
Data mining techniques using weka
Data mining techniques using wekaData mining techniques using weka
Data mining techniques using weka
 
data mining with weka application
data mining with weka applicationdata mining with weka application
data mining with weka application
 
Open06
Open06Open06
Open06
 
Machine Learning with WEKA
Machine Learning with WEKAMachine Learning with WEKA
Machine Learning with WEKA
 
Machine Learning - Simple Linear Regression
Machine Learning - Simple Linear RegressionMachine Learning - Simple Linear Regression
Machine Learning - Simple Linear Regression
 
Key projects Data Science and Engineering
Key projects Data Science and EngineeringKey projects Data Science and Engineering
Key projects Data Science and Engineering
 
Key projects Data Science and Engineering
Key projects Data Science and EngineeringKey projects Data Science and Engineering
Key projects Data Science and Engineering
 
Competition 1 (blog 1)
Competition 1 (blog 1)Competition 1 (blog 1)
Competition 1 (blog 1)
 
Mining attributes
Mining attributesMining attributes
Mining attributes
 
Hybrid feature selection using correlation coefficient and particle swarm opt...
Hybrid feature selection using correlation coefficient and particle swarm opt...Hybrid feature selection using correlation coefficient and particle swarm opt...
Hybrid feature selection using correlation coefficient and particle swarm opt...
 
Data Mining _ Weka
Data Mining _ WekaData Mining _ Weka
Data Mining _ Weka
 

KĂŒrzlich hochgeladen

Z Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphZ Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphThiyagu K
 
Paris 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityParis 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityGeoBlogs
 
Russian Call Girls in Andheri Airport Mumbai WhatsApp 9167673311 💞 Full Nigh...
Russian Call Girls in Andheri Airport Mumbai WhatsApp  9167673311 💞 Full Nigh...Russian Call Girls in Andheri Airport Mumbai WhatsApp  9167673311 💞 Full Nigh...
Russian Call Girls in Andheri Airport Mumbai WhatsApp 9167673311 💞 Full Nigh...Pooja Nehwal
 
Sanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfSanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfsanyamsingh5019
 
Beyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactBeyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactPECB
 
microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introductionMaksud Ahmed
 
Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)eniolaolutunde
 
1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdfQucHHunhnh
 
Web & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfWeb & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfJayanti Pande
 
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxSOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxiammrhaywood
 
Introduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsIntroduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsTechSoup
 
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Krashi Coaching
 
Sports & Fitness Value Added Course FY..
Sports & Fitness Value Added Course FY..Sports & Fitness Value Added Course FY..
Sports & Fitness Value Added Course FY..Disha Kariya
 
Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions  for the students and aspirants of Chemistry12th.pptxOrganic Name Reactions  for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions for the students and aspirants of Chemistry12th.pptxVS Mahajan Coaching Centre
 
Advance Mobile Application Development class 07
Advance Mobile Application Development class 07Advance Mobile Application Development class 07
Advance Mobile Application Development class 07Dr. Mazin Mohamed alkathiri
 
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...Sapna Thakur
 
JAPAN: ORGANISATION OF PMDA, PHARMACEUTICAL LAWS & REGULATIONS, TYPES OF REGI...
JAPAN: ORGANISATION OF PMDA, PHARMACEUTICAL LAWS & REGULATIONS, TYPES OF REGI...JAPAN: ORGANISATION OF PMDA, PHARMACEUTICAL LAWS & REGULATIONS, TYPES OF REGI...
JAPAN: ORGANISATION OF PMDA, PHARMACEUTICAL LAWS & REGULATIONS, TYPES OF REGI...anjaliyadav012327
 

KĂŒrzlich hochgeladen (20)

Z Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphZ Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot Graph
 
Paris 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityParis 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activity
 
Russian Call Girls in Andheri Airport Mumbai WhatsApp 9167673311 💞 Full Nigh...
Russian Call Girls in Andheri Airport Mumbai WhatsApp  9167673311 💞 Full Nigh...Russian Call Girls in Andheri Airport Mumbai WhatsApp  9167673311 💞 Full Nigh...
Russian Call Girls in Andheri Airport Mumbai WhatsApp 9167673311 💞 Full Nigh...
 
Sanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfSanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdf
 
Beyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactBeyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global Impact
 
microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introduction
 
Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)
 
1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdf
 
Web & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfWeb & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdf
 
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxSOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
 
Introduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsIntroduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The Basics
 
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
 
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
 
Sports & Fitness Value Added Course FY..
Sports & Fitness Value Added Course FY..Sports & Fitness Value Added Course FY..
Sports & Fitness Value Added Course FY..
 
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptxINDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
 
Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions  for the students and aspirants of Chemistry12th.pptxOrganic Name Reactions  for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
 
Advance Mobile Application Development class 07
Advance Mobile Application Development class 07Advance Mobile Application Development class 07
Advance Mobile Application Development class 07
 
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
 
CĂłdigo Creativo y Arte de Software | Unidad 1
CĂłdigo Creativo y Arte de Software | Unidad 1CĂłdigo Creativo y Arte de Software | Unidad 1
CĂłdigo Creativo y Arte de Software | Unidad 1
 
JAPAN: ORGANISATION OF PMDA, PHARMACEUTICAL LAWS & REGULATIONS, TYPES OF REGI...
JAPAN: ORGANISATION OF PMDA, PHARMACEUTICAL LAWS & REGULATIONS, TYPES OF REGI...JAPAN: ORGANISATION OF PMDA, PHARMACEUTICAL LAWS & REGULATIONS, TYPES OF REGI...
JAPAN: ORGANISATION OF PMDA, PHARMACEUTICAL LAWS & REGULATIONS, TYPES OF REGI...
 

Data Mining using Weka

  • 1. Data mining techniques using WEKA Submitted by: Shashidhar Shenoy N (10BM60083) MBA, 2nd Year, Vinod Gupta School of Management, IIT Kharagpur As part of the course “IT for Business Intelligence”
  • 2. Introduction to Weka Weka stands for ‘Waikato Environment for Knowledge Analysis’ and is a free open source software developed by at the University of Waikato, New Zealand. It is a very popular set of software for machine learning, containing a collection of visualization tools and algorithms for data analysis and predictive modeling, together with graphical user interfaces for easy access to this functionality. Although not as sophisticated as the other statistical packages, Weka’s popularity lies in the fact that it is not only a freeware but also code is open source, which means that new algorithms can be implemented by making use of the existing algorithms and sufficiently modifying them. Weka can be used to do a wide variety of operations on the data. Some of the important operations which can be carried out using weka suite are:  Classification of data  Regression analysis and prediction  Clustering of data  Associating data A quick guide on how to carry out some of these operations is described in this document. Quick note on the data used in the guide Unless meaningfully interpreted, any data is meaningless. Most machine learning software would accept any data as long as they are in the specified format without understanding why they are used. Thus, the onus lies on the user of the software to choose proper data and feed it to the software to derive meaningful insights on it. Rather than using the pre-built examples given in Weka suite, some attempt is made to get freely available data from the internet and the best place to get .arff files would be the Machine Learning Repository located of UCI. The about page in their website says: “The UCI Machine Learning Repository is a collection of databases, domain theories, and data generators that are used by the machine learning community for the empirical analysis of machine learning algorithms. The archive was created as an ftp archive in 1987 by David Aha and fellow graduate students at UC Irvine. Since that time, it has been widely used by students, educators, and researchers all over the world as a primary source of machine learning data sets. As an indication of the impact of the archive, it has been cited over 1000 times, making it one of the top 100 most cited "papers" in all of computer science” For the demonstrations, two of the data sets have been used. Regression uses the data from Auto MPG while the classification uses the data Contraceptive method choice. More details on the data and its attributes are explained in the subsequent sections. VGSoM, IIT Kharagpur Page 2
  • 3. Regression using Weka Simple regression involving two variables Regression involves building a model to predict the dependant variable based on one or more independent variables. A simple example of regression would be to predict the body weight of a mammal given the brain weight. Here, the body weight is the dependant variable and brain weight is the independent variable: Figure 1: Brain weight v Body weight The data is imported into weka in the native (Attribute-Relation File Format) arff format. Weka supports imports of the ubiquitous .csv formats too. This is done by clicking on ‘Explorer’ in the Weka Gui Chooser suite and then going to ‘Open File..’ under the preprocess tab. Figure 2: Opening a file in Weka Suite VGSoM, IIT Kharagpur Page 3
  • 4. Once the file is loaded, a variety of pre-process operations can be done on the data. The data can be edited using the ‘Edit’ option too. In the left section of the Explorer window, it outlines all of the columns in the data (Attributes) and the number of rows of data supplied (Instances). By selecting each column, the right section of the Explorer window will also give information about the data in that column of your data set. There’s a visual way of examining the data, which we can see by clicking the ‘Visualize All’ button. The next step would be to perform the regression analysis. For this, we go to the ‘Classify’ tab and click on the ‘Choose’ button. Since we are running a ‘simple linear regression’, we need to go to the ‘Classifiers.functions.simplelinearregression’ and click on it. Once this is done, we need to supply the test options for building the regression model. The following options are available:  Use training set. The classifier is evaluated on how well it predicts the class of the instances it was trained on.  Supplied test set. The classifier is evaluated on how well it predicts the class of a set of instances loaded from a file. Clicking the Set... button brings up a dialog allowing you to choose the file to test on.  Cross-validation. The classifier is evaluated by cross-validation, using the number of folds that are entered in the Folds text field.  Percentage split. The classifier is evaluated on how well it predicts a certain percentage of the data which is held out for testing. The amount of data held out depends on the value entered in the % field. Choose one of these for a model, make sure that the dependant variable is shown in the field below as ‘body weight (kg)’ and click on start. This is the output we get: Figure 3: Output of simple regression VGSoM, IIT Kharagpur Page 4
  • 5. It gives the model summary and the details of the regression. Thus, simple linear regression model has been built using the weka suite. Multiple Linear regression with many variables In multiple regression, there is one dependant variable which depends on many independent variables. Many of the real world situations are multiple regression models where one variable depends on a lot of other variables. Here, we use a famous example data to demonstrate regression using Weka. Data used for multiple regression This data set is taken from the UCI’s machine learning repository and regresses automobile mileage against certain basic attributes of the model. The data can be downloaded from the URL <http://archive.ics.uci.edu/ml/datasets/Auto+MPG> and a corresponding ARFF file be created. This sample data file attempts to create a regression model to predict the miles per gallon (MPG) for a car based on several attributes of the car (this data is from 1970 to 1982). The model includes these possible attributes of the car: cylinders, displacement, horsepower, weight, acceleration, model year, origin, and car make. Further, this data set has 398 rows of data. Data Set Number of Multivariate 398 Characteristics: Instances: Attribute Categorical, Number of 8 Characteristics: Real Attributes: Missing Yes
 8 instances of the variable horsepower are Associated Tasks: Regression Values? removed because they have unknown value This data set is loaded into the Weka suite using the ‘Open file
’ syntax as explained before. This is how the window looks like when the data is imported. Figure 4: Imported data in Weka VGSoM, IIT Kharagpur Page 5
  • 6. The first seven attributes are all independant variables, while the eighth one, ie, CLASS is the dependant variable for which we try and build a predictive model. Before doing so, we can use as many visualizations on the data as necessary to see the relevant information in each attribute. Figure 5: Visualize the data in Weka The next step is to perform the regression. Go to the Classify tab and on the choose button, go to classifiers -> functions -> linear regressions. Once this is done, we need to supply the test options for building the regression model, in the same manner which we did for simple linear regression. We initially give a ‘Percentage split’ of 80% of the test data and see the output: Figure 6: Run information shown by Weka VGSoM, IIT Kharagpur Page 6
  • 7. Figure 7: The regression model ouput by Weka Figure 8: Regression model details This model might appear as complex for beginners but it is not. For example, the first line of the regression model, -2.2744 * cylinders=6,3,5,4 means that if the car has six cylinders, you would place a 1 in this column, and if it has eight cylinders, you would place a 0. We could use a test set and see the deviation from the expected results and calculate the error. Example data: data = 8,390,190,3850,8.5,70,1,15 class (aka MPG) = -2.2744 * 0 + -4.4421 * 0 + 6.74 * 0 + 0.012 * 390 + -0.0359 * 190 + -0.0056 * 3850 + 1.6184 * 0 + 1.8307 * 0 + 1.8958 * 0 + 1.7754 * 0 + 1.167 * 0 + 1.2522 * 0 + VGSoM, IIT Kharagpur Page 7
  • 8. 2.1363 * 0 + 37.9165 Expected Value = 15 mpg Regression Model Output = 14.2 mpg So, we see that the regression model output is pretty near the expected value and thus we have a predictive model for beginners. We could continue to improve on this model to improve the accuracy. We can also go for visualization to plot each of the independent variable against the dependent one and see how the variation occurs. A sample plot of horsepower versus ‘Miles per gallon’ is shown. The relationship can be found to be inversely proportional. Figure 9: Visualizing the regression output Classification using Weka In classification, different attributes of a product are analysed to classify the product into one of the predefined classes. For example, a cricket player can be classified as batsman, bowler, wicket keeper or allrounder depending on the attributes like ‘Can bat?’, ‘Can bowl?’ etc. TrainSet: The trainset is that data which is used to train the software. Here, the classification is already made based on few attributes. The machine just observes the patterns and tries to create a rule which can be used to explain how the training set data is classified. If the model built by the machine in first instance is not reliable, intelligent algorithms might be used to make the model more robust. TestSet: The test set or data set is the actual data where the classification is not yet made. Once the trainset is used to build a satisfactory model, we can feed the test set and get the classification of the data set. VGSoM, IIT Kharagpur Page 8
  • 9. Data used for classification The data used is the ‘Contraceptive Method Choice’ Data set available from the UCI’s machine learning repository and can be downloaded from the following URL: < http://archive.ics.uci.edu/ml/datasets/Contraceptive+Method+Choice> The samples are married women who were either not pregnant or do not know if they were at the time of interview. The problem is to predict the current contraceptive method choice (no use, long- term methods, or short-term methods) of a woman based on her demographic and socio-economic characteristics. Some of the attributes like Wife’s age, Wife’s education, Husband’s education, Number of children ever born’, etc are used to predict the current contraceptive method choice. Data Set Characteristics: Multivariate Number of Instances: 1473 Attribute Characteristics: Categorical, Integer Number of Attributes: 9 Associated Tasks: Classification Missing Values? No Use the ‘Open file..’ syntax to import the arff file into weka suite as instructed before. The tenth attribute, ie, the contraceptive method used’ is the predicted variable and the data looks like this: Figure 10: CMC data imported into Weka VGSoM, IIT Kharagpur Page 9
  • 10. Next, go to the classify tab, and use the ZeroR algorithm to run the classification model. ZeroR is the basic classification model and it does not do anything but classify all the instances into one class. We ask weka to run the model using the entire training set without splitting it into test and trainsets. This can be done by giving the choice as ‘Use train set’ under ‘Test options’ as explained in the case of regression before. As expected, the model will be inaccurate. This is the output of the Weka file. Figure 11: Classification Output using ZeroR algorithm Of particular importance is the Confusion matrix which shows the correctly and incorrectly classifcied instances. Here, we see that all samples have been classified as ‘a’ and the 333 samples which should have been ‘b’ and the 511 samples which should have been classified as ‘c’ are also incorrectly classified as ‘a’. Thus, the accuracy of the model is only 42% (629 out of 1473 samples) We could now go for more accurate algorithms like NaĂŻveBayes or NaiveBayesUpdateable to improve the accuracy of the predictions. Here is the ouput of the NaiveBayes simple classification scheme: VGSoM, IIT Kharagpur Page 10
  • 11. Figure 12: Classification output using Naive Bayes algorithm Here we see that the accuracy of this model, although under acceptable limits has improved over the previous model. Thus, we can start training the software to be more accurate by using better algorithms. Various visualization schemes are present which will help visualize the independent and dependant variables. Conclusion In this term paper, two simple techniques which can be used to get started with Weka –regression and classification are presented. In regression, we have demonstrated how Weka can be used to build a regression model with one dependant variable and many independent variables. The live example used was the automobile miles per gallon based on many independent attributes in a car. In classification, we have demonstrated how Weka can be trained to classify the given data set based on observations in a training set. The live data used was the choice of contraceptive method based on a number of demographic factors. Though the outputs are not intriguing, the real power of Weka lies in the fact that the algorithms can be trained to produce better results. Since the source code is open for everyone, anyone can download the same and simple manipulations can be done on the existing algorithms with ease to produce more accurate algorithms. Hence, Weka is used by many researchers in their study. VGSoM, IIT Kharagpur Page 11
  • 12. References 1. Weka reference manual pdf available at their website 2. http://www.cs.waikato.ac.nz/ml/weka/ 3. http://archive.ics.uci.edu/ml/datasets.html 4. http://www.ibm.com/developerworks/opensource/library/os-weka1/index.html#N100F6 VGSoM, IIT Kharagpur Page 12