SlideShare ist ein Scribd-Unternehmen logo
1 von 16
Data mining - Weka
                  Submitted as a part of the course
                    ‘IT for Business Intelligence’

                                      Ramya Krishna P
                                        10BM60056
                                         4/19/2012




This paper introduces Weka briefly and proceeds to demonstrate application of two data mining
techniques – association rules and regression.
Table of Contents
Weka – Introduction ..................................................................................................................................... 1
   Requirements............................................................................................................................................ 1
Getting started .............................................................................................................................................. 1
Data sets ....................................................................................................................................................... 2
Association rules ........................................................................................................................................... 3
   Business application.................................................................................................................................. 3
   Data set ..................................................................................................................................................... 4
   Preprocess................................................................................................................................................. 5
   Associate ................................................................................................................................................... 6
Regression ..................................................................................................................................................... 7
   Business applications ................................................................................................................................ 7
   Data set ..................................................................................................................................................... 8
   Preprocess................................................................................................................................................. 8
   Linear regression ..................................................................................................................................... 10
   Non-numeric input variables .................................................................................................................. 13
References .................................................................................................................................................. 14
Weka – Introduction
Weka is a rich tool for data mining. It is a collection of machine learning algorithms. It allows us to do
classification, regression, clustering, forming association rules and visualization. It is open source
software.

Requirements
For latest versions of Weka, i.e., Weka 3.7.x, Java 1.6 needed to be installed in your system. I have used
Weka 3.7.5 for this small tutorial. The latest and other editions of Weka can be downloaded here.


Getting started
You can run Weka through command prompt or through GUI. We go by the GUI. Here is how it looks
like.




For all our purposes, the application ‘Explorer’ is sufficient. On clicking ‘explorer’, we have




                                                                                                        1
To load a data set into Weka, choose ‘Open file’ under ‘Preprocess’ tab. Now a short note about data
sets.


Data sets
The default format of a Weka data set is .arff(Attribute-Relation File Format). This is an ASCII text file. A
snapshot of a .arff file is like this.
So, you can either prepare your data in this form or if you have a spreadsheet or an .xls or .xlsx, upload
your data to .csv format.

Now, on clicking ‘Open file’, select the .csv format of your data and click ‘Open’.

I will proceed with the rest of the tutorial through examples.


Association rules
To give a little introduction about association rules, this is a method to develop relations between
variables in data sets. We develop some rules from these relations that have a certain level of support
and confidence. These rules can be of a great business value sometimes. One typical business
application of association rules is ‘Market basket analysis’.

Business application
The market-basket problem assumes we have some large number of items, e.g., bread, milk. Customers
fill their market baskets with some subset of the items, and we get to know what items people buy
together, even if we don't know who they are. By developing association rules of the form,
{X1, X2, . . .Xn} -> Y

we have a good chance of finding Y. So, next time a retailer is stocking up X1, X2, … Xn, he might also
stock up ‘Y’ based on our prediction. Now, without going too much into the theory, let us see our data
set.

Data set
The format of my data set is like this

TID1         ID2    ID5    ID6
TID2         ID3    ID4    ID6    ID7    ID9
TID3         ID4    ID5
TID4         ID1    ID4    ID5    ID7    ID9    ID10

...

where, the first column gives the transaction id and then each row has a number of products, which
have been purchased in this particular transaction. Now, unfortunately, Weka cannot accept the data
set in this form (the rows are of unequal lengths). Both .arff and .csv require each data record to
have the same number of fields.

 To change the data format, create one attribute per "item" and use "true" and "false" field values
in the data row corresponding to the item. We can't use 0 and 1 because Apriori (the algorithm we will
be using) does not work on numeric attributes. It only works on ‘Nominal values’. The data now looks
like

TID, ID1, ID2, ID3, ID4, ID5, ID6, ID7, ID8, ID9, ID10

1,false, true, false, false, true, true, false, false, false, false
2, false, false, true, true, false, true, true, false, true, false
3, false, false, true, true, false, false, false, false, false, false
4, true, false, false, true, true, false, true, false, true, true

Now, I have a sample data set (which I have downloaded from here) which is thankfully, already in
the.csv form. This is a huge data set with 300+ products and 1300+ rows. When you try to run this in
Weka, you get an error that the heap size is not sufficient. You can change the heap size by changing the
value of the ‘maxheap’ in Weka. ini file (or RunWeka – config file). However, even after giving a heap
size of 1GB, this data set is too huge too run. So, I have cropped the data set to about 20 attributes and
400 rows. A snapshot of the data set is like this.
Preprocess
Once you choose this file under ‘open file’, this is how it looks like.
Weka lists all the attributes present in the data set. It also provides visualizations of these data and
other stastics. For eg., we can see that the ‘fat free hamburger’ is true only 41 times out of 400. Now, we
can select the attributes we want for our analysis one by one or, or check ‘all’ or we can also write a
‘Perl’ language expression to choose the attributes matching a rule, by selecting ‘pattern’ and typing the
expression. We check ‘all’. Then we go to ‘Associate’ tab.

Associate
We go to ‘associate’ tab and click ‘Choose’. Out of the algorithms listed, we select Apriori. Now, by
clicking the text box beside Choose (i.e., on Apriori), the various parameters that are used in Apriori, are
listed.
We can change these parameters as per requirements. To know what each parameter stands for, click
on ‘More’. After changing the parameters, click on ‘Ok’.

Now, click on ‘Start’ to start building the model. Depending on the size of the data set, it takes a while
and mean-while the bird roams this side and that side.

A part of the output is shown here.




Since, we have given ‘numrules’ as 10, only the top 10 best rules are shown here. The first rule is

 Plain English Muffins= false 396 ==> 40 Watt Lightbulb= false 396        <conf:(1)> lift:(1.01) lev:(0) [1]
conv:(1.98)

That is, people who do not buy Plain English Muffins, do not buy 40 watt lightbulb as well. The rule also
specifies confidence, conviction and leverage of each rule(explanation of each can be found under
‘more’ , shown above).

The model can be run by changing the parameters and each of the results can be seen under the ‘Result
list’. The results can also be saved for later.


Regression
Regression, is as one knows a relation between a dependent variable and one or more independent
variables. As there is not much need to explain about regression, we jump into the process.

Business applications
Before we start with the tutorial, here are some areas where regression can be used
Trend line analysis - to show the movement of financial or product attributes over time. Stock
         prices, oil prices can be analyzed using trend lines.
         Risk analysis for investments - The capital asset pricing model was developed using linear
         regression analysis
         Sales or market forecasts - multivariate regression is a good method to forecast sales volumes or
         market shares.
         Total quality control - Quality control methods use linear regression frequently to analyze key
         product specifications and other measurable parameters of product or organization (for eg.,
         customer complaints over time).
         Human Resources - to predict the demographics and types of future work forces for large
         companies.

Data set
I have used a data set provided by Weka website for this. A number of datasets for different techniques
can be found here.

The data set I am using is ‘strike.arff’ extracted from ‘numeric. Jar’. The data consists of days lost due to
industrial disputes per 1000 wage salary earners, in 18 OECD countries from 1951-1985. The dependent
variables are


    1.   country code
    2.   year
    3.   unemployment
    4.   inflation
    5.   parliamentary representation of social democratic and labor parties and
    6.   a time-invariant measure of union centralization.


If your data is not in .csv or .arff, it needs to be preprocessed as explained above.

Preprocess
After uploading the data into Weka, it looks like this.
For each numerical attribute, weka gives the stastics like mean, max, min, stdev.

On clicking ‘visualize all’, the graphs of all variables are shown.
We check ‘All’ to select all variables and click on ‘Classify’ now.

Linear regression
We click ‘choose’ under Classifier and select ‘Linear Regression’ as shown.




Click on box beside ‘choose’ to select parameters for Linear Regression.
Then, click on ‘Ok’. Now, we have to tell Weka which data set to use. Apart from the data set we have
uploaded, we have 3 more choices - Supplied test set, where we can supply a different set of data to
build the model, Cross-validation, which lets WEKA build a model based on subsets of the supplied data
and then average them out to create a final model and Percentage split, where WEKA takes a percentile
subset of the supplied data to build a final model. For this example, we choose Use training set.

By default, Weka takes the last attribute as dependent attribute. If it is not so, as per the data, we
change the variable to the required variable by choosing from the drop-down. We choose ‘volume’ as
the dependent variable and click on ‘Start’.

A part of the output is shown below.
The first line of the model is

175.7183 * country=5,3,13,17,7,1,18,6,9,4,10

It means that if the country code is 5, you would put a ‘1’ in the calculation of the equation, and if the
country code is 8, you would put a ‘0’.

By default, Weka employs attribute selection, which means it may not include all of the attributes in the
regression equation. Hence we have not got all the dependent variables in the above model. To
eliminate attribute selection, we change the ‘attributeSelectionMethod’ parameter to "No attribute
selection" and run the model again.

Now the model is as follows
Non-numeric input variables
If we have a non-numeric input variable, d- If we have a binary attribute (yes/no or true/false), we can
convert the two values to 0 and 1.

However, we have techniques to handle both numeric and non-numeric (categorical) attributes.

    1. One way is to build a decision tree and have each classification be a numeric value that is the
       average of the values for the training examples in that subgroup - the result is called a
       regression tree
    2. Another option is to have a separate regression equation for each classification in the tree –
       based on the training examples in that subgroup – this is called a model tree.
References
  1. http://www.cs.waikato.ac.nz/ml/weka/
  2. http://www.cs.waikato.ac.nz/ml/weka/index_datasets.html
  3. http://inf.abdn.ac.uk/~hnguyen/teaching/CS5553/prac05.php
  4. http://inf.abdn.ac.uk/~hnguyen/teaching/CS5553/marketbasket.csv
  5. "The WEKA Data Mining Software: An Update" by Mark Hall, Eibe Frank, Geoffrey Holmes,
     Bernhard Pfahringer Peter Reutemann, and Ian H. Witten
  6. http://www.ehow.com/about_6160819_application-regression-analysis-business.html
  7. http://www.ibm.com/developerworks/opensource/library/os-weka1/index.html
  8. http://cs-people.bu.edu/dgs/courses/cs105/lectures/data_mining_estimation.pdf

Weitere ähnliche Inhalte

Was ist angesagt?

Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement Learning
butest
 
Decision Tree In R | Decision Tree Algorithm | Data Science Tutorial | Machin...
Decision Tree In R | Decision Tree Algorithm | Data Science Tutorial | Machin...Decision Tree In R | Decision Tree Algorithm | Data Science Tutorial | Machin...
Decision Tree In R | Decision Tree Algorithm | Data Science Tutorial | Machin...
Simplilearn
 

Was ist angesagt? (20)

Image captioning
Image captioningImage captioning
Image captioning
 
ANOMALY DETECTION IN INTELLIGENT TRANSPORTATION SYSTEM using real-time video...
 ANOMALY DETECTION IN INTELLIGENT TRANSPORTATION SYSTEM using real-time video... ANOMALY DETECTION IN INTELLIGENT TRANSPORTATION SYSTEM using real-time video...
ANOMALY DETECTION IN INTELLIGENT TRANSPORTATION SYSTEM using real-time video...
 
Sentiment Analysis
Sentiment AnalysisSentiment Analysis
Sentiment Analysis
 
The Mathematics of Neural Networks
The Mathematics of Neural NetworksThe Mathematics of Neural Networks
The Mathematics of Neural Networks
 
Big Data Analytics for Cyber Security: A Quick Overview
Big Data Analytics for Cyber Security: A Quick OverviewBig Data Analytics for Cyber Security: A Quick Overview
Big Data Analytics for Cyber Security: A Quick Overview
 
Introduction to FAIR Risk Methodology – Global CISO Forum 2019 – Donna Gall...
Introduction to FAIR Risk Methodology – Global CISO Forum 2019  –  Donna Gall...Introduction to FAIR Risk Methodology – Global CISO Forum 2019  –  Donna Gall...
Introduction to FAIR Risk Methodology – Global CISO Forum 2019 – Donna Gall...
 
Practical sentiment analysis
Practical sentiment analysisPractical sentiment analysis
Practical sentiment analysis
 
Reinforcement Learning In AI Powerpoint Presentation Slide Templates Complete...
Reinforcement Learning In AI Powerpoint Presentation Slide Templates Complete...Reinforcement Learning In AI Powerpoint Presentation Slide Templates Complete...
Reinforcement Learning In AI Powerpoint Presentation Slide Templates Complete...
 
Image Captioning Generator using Deep Machine Learning
Image Captioning Generator using Deep Machine LearningImage Captioning Generator using Deep Machine Learning
Image Captioning Generator using Deep Machine Learning
 
SIEM in NIST Cyber Security Framework
SIEM in NIST Cyber Security FrameworkSIEM in NIST Cyber Security Framework
SIEM in NIST Cyber Security Framework
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement Learning
 
Decision Tree In R | Decision Tree Algorithm | Data Science Tutorial | Machin...
Decision Tree In R | Decision Tree Algorithm | Data Science Tutorial | Machin...Decision Tree In R | Decision Tree Algorithm | Data Science Tutorial | Machin...
Decision Tree In R | Decision Tree Algorithm | Data Science Tutorial | Machin...
 
Scaling threat detection and response in AWS - SDD312-R - AWS re:Inforce 2019
Scaling threat detection and response in AWS - SDD312-R - AWS re:Inforce 2019 Scaling threat detection and response in AWS - SDD312-R - AWS re:Inforce 2019
Scaling threat detection and response in AWS - SDD312-R - AWS re:Inforce 2019
 
Explainable AI
Explainable AIExplainable AI
Explainable AI
 
An introduction to reinforcement learning
An introduction to  reinforcement learningAn introduction to  reinforcement learning
An introduction to reinforcement learning
 
Reinforcement learning
Reinforcement learning Reinforcement learning
Reinforcement learning
 
Image Caption Generation using Convolutional Neural Network and LSTM
Image Caption Generation using Convolutional Neural Network and LSTMImage Caption Generation using Convolutional Neural Network and LSTM
Image Caption Generation using Convolutional Neural Network and LSTM
 
Convolution Neural Network (CNN)
Convolution Neural Network (CNN)Convolution Neural Network (CNN)
Convolution Neural Network (CNN)
 
How to assess and manage cyber risk
How to assess and manage cyber riskHow to assess and manage cyber risk
How to assess and manage cyber risk
 
Reinforcement Learning 4. Dynamic Programming
Reinforcement Learning 4. Dynamic ProgrammingReinforcement Learning 4. Dynamic Programming
Reinforcement Learning 4. Dynamic Programming
 

Andere mochten auch

DATA MINING TOOL- ORANGE
DATA MINING TOOL- ORANGEDATA MINING TOOL- ORANGE
DATA MINING TOOL- ORANGE
Neeraj Goswami
 

Andere mochten auch (12)

Many eyes @Vgsom
Many eyes @VgsomMany eyes @Vgsom
Many eyes @Vgsom
 
Some Thoughts on Learning Analytics and Educational Data Mining
Some Thoughts on Learning Analytics and Educational Data MiningSome Thoughts on Learning Analytics and Educational Data Mining
Some Thoughts on Learning Analytics and Educational Data Mining
 
Data Mining Project for student academic specialization and performance
Data Mining Project for student academic specialization and performanceData Mining Project for student academic specialization and performance
Data Mining Project for student academic specialization and performance
 
Students academic performance using clustering technique
Students academic performance using clustering techniqueStudents academic performance using clustering technique
Students academic performance using clustering technique
 
Grand challenges for the Educational Data Mining and Learning Sciences Commun...
Grand challenges for the Educational Data Mining and Learning Sciences Commun...Grand challenges for the Educational Data Mining and Learning Sciences Commun...
Grand challenges for the Educational Data Mining and Learning Sciences Commun...
 
Predicting Student Performance in Solving Parameterized Exercises
Predicting Student Performance in Solving Parameterized ExercisesPredicting Student Performance in Solving Parameterized Exercises
Predicting Student Performance in Solving Parameterized Exercises
 
USING LEARNING ANALYTICS TO PREDICT STUDENTS’ PERFORMANCE IN MOODLE LMS
USING LEARNING ANALYTICS TO PREDICT STUDENTS’ PERFORMANCE IN MOODLE LMSUSING LEARNING ANALYTICS TO PREDICT STUDENTS’ PERFORMANCE IN MOODLE LMS
USING LEARNING ANALYTICS TO PREDICT STUDENTS’ PERFORMANCE IN MOODLE LMS
 
Big Data in Education
Big Data in EducationBig Data in Education
Big Data in Education
 
DATA MINING TOOL- ORANGE
DATA MINING TOOL- ORANGEDATA MINING TOOL- ORANGE
DATA MINING TOOL- ORANGE
 
Information security in big data -privacy and data mining
Information security in big data -privacy and data miningInformation security in big data -privacy and data mining
Information security in big data -privacy and data mining
 
Data mining tools (R , WEKA, RAPID MINER, ORANGE)
Data mining tools (R , WEKA, RAPID MINER, ORANGE)Data mining tools (R , WEKA, RAPID MINER, ORANGE)
Data mining tools (R , WEKA, RAPID MINER, ORANGE)
 
Learning Analytics in Education: Using Student’s Big Data to Improve Teaching
Learning Analytics in Education:  Using Student’s Big Data to Improve TeachingLearning Analytics in Education:  Using Student’s Big Data to Improve Teaching
Learning Analytics in Education: Using Student’s Big Data to Improve Teaching
 

Ähnlich wie Data Mining _ Weka

A Data Warehouse And Business Intelligence Application
A Data Warehouse And Business Intelligence ApplicationA Data Warehouse And Business Intelligence Application
A Data Warehouse And Business Intelligence Application
Kate Subramanian
 
Yelp Project Report
Yelp Project ReportYelp Project Report
Yelp Project Report
Mihir Bhatt
 
Introduction to Dimesional Modelling
Introduction to Dimesional ModellingIntroduction to Dimesional Modelling
Introduction to Dimesional Modelling
Ashish Chandwani
 

Ähnlich wie Data Mining _ Weka (20)

Dwbi Project
Dwbi ProjectDwbi Project
Dwbi Project
 
HP Vertica Architecture Gives Massive Performance Boost to Toughest BI Querie...
HP Vertica Architecture Gives Massive Performance Boost to Toughest BI Querie...HP Vertica Architecture Gives Massive Performance Boost to Toughest BI Querie...
HP Vertica Architecture Gives Massive Performance Boost to Toughest BI Querie...
 
A Data Warehouse And Business Intelligence Application
A Data Warehouse And Business Intelligence ApplicationA Data Warehouse And Business Intelligence Application
A Data Warehouse And Business Intelligence Application
 
ITB Term Paper - 10BM60066
ITB Term Paper - 10BM60066ITB Term Paper - 10BM60066
ITB Term Paper - 10BM60066
 
Yelp Project Report
Yelp Project ReportYelp Project Report
Yelp Project Report
 
Data Mining Apriori Algorithm Implementation using R
Data Mining Apriori Algorithm Implementation using RData Mining Apriori Algorithm Implementation using R
Data Mining Apriori Algorithm Implementation using R
 
Data warehousing and business intelligence project report
Data warehousing and business intelligence project reportData warehousing and business intelligence project report
Data warehousing and business intelligence project report
 
Data Mining using Weka
Data Mining using WekaData Mining using Weka
Data Mining using Weka
 
IBM Cognos tutorial - ABC LEARN
IBM Cognos tutorial - ABC LEARNIBM Cognos tutorial - ABC LEARN
IBM Cognos tutorial - ABC LEARN
 
BIG MART SALES.pptx
BIG MART SALES.pptxBIG MART SALES.pptx
BIG MART SALES.pptx
 
BIG MART SALES PRIDICTION PROJECT.pptx
BIG MART SALES PRIDICTION PROJECT.pptxBIG MART SALES PRIDICTION PROJECT.pptx
BIG MART SALES PRIDICTION PROJECT.pptx
 
Using Safyr to navigate and analyse SAP data model demonstration screen shots
Using Safyr to navigate and analyse SAP data model demonstration screen shotsUsing Safyr to navigate and analyse SAP data model demonstration screen shots
Using Safyr to navigate and analyse SAP data model demonstration screen shots
 
Star schema
Star schemaStar schema
Star schema
 
Excel Datamining Addin Advanced
Excel Datamining Addin AdvancedExcel Datamining Addin Advanced
Excel Datamining Addin Advanced
 
Excel Datamining Addin Advanced
Excel Datamining Addin AdvancedExcel Datamining Addin Advanced
Excel Datamining Addin Advanced
 
Association Rule based Recommendation System using Big Data
Association Rule based Recommendation System using Big DataAssociation Rule based Recommendation System using Big Data
Association Rule based Recommendation System using Big Data
 
Sql Server 2008 Portfolio for Vera Vaitsiuk.
Sql Server 2008 Portfolio for Vera Vaitsiuk.Sql Server 2008 Portfolio for Vera Vaitsiuk.
Sql Server 2008 Portfolio for Vera Vaitsiuk.
 
Introduction to Dimesional Modelling
Introduction to Dimesional ModellingIntroduction to Dimesional Modelling
Introduction to Dimesional Modelling
 
DataTables
DataTablesDataTables
DataTables
 
A guide to preparing your data for tableau
A guide to preparing your data for tableauA guide to preparing your data for tableau
A guide to preparing your data for tableau
 

Kürzlich hochgeladen

Salient Features of India constitution especially power and functions
Salient Features of India constitution especially power and functionsSalient Features of India constitution especially power and functions
Salient Features of India constitution especially power and functions
KarakKing
 

Kürzlich hochgeladen (20)

ICT role in 21st century education and it's challenges.
ICT role in 21st century education and it's challenges.ICT role in 21st century education and it's challenges.
ICT role in 21st century education and it's challenges.
 
Understanding Accommodations and Modifications
Understanding  Accommodations and ModificationsUnderstanding  Accommodations and Modifications
Understanding Accommodations and Modifications
 
HMCS Max Bernays Pre-Deployment Brief (May 2024).pptx
HMCS Max Bernays Pre-Deployment Brief (May 2024).pptxHMCS Max Bernays Pre-Deployment Brief (May 2024).pptx
HMCS Max Bernays Pre-Deployment Brief (May 2024).pptx
 
Mehran University Newsletter Vol-X, Issue-I, 2024
Mehran University Newsletter Vol-X, Issue-I, 2024Mehran University Newsletter Vol-X, Issue-I, 2024
Mehran University Newsletter Vol-X, Issue-I, 2024
 
Single or Multiple melodic lines structure
Single or Multiple melodic lines structureSingle or Multiple melodic lines structure
Single or Multiple melodic lines structure
 
On National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan FellowsOn National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan Fellows
 
Salient Features of India constitution especially power and functions
Salient Features of India constitution especially power and functionsSalient Features of India constitution especially power and functions
Salient Features of India constitution especially power and functions
 
Micro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdfMicro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdf
 
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptxBasic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
 
Sociology 101 Demonstration of Learning Exhibit
Sociology 101 Demonstration of Learning ExhibitSociology 101 Demonstration of Learning Exhibit
Sociology 101 Demonstration of Learning Exhibit
 
Towards a code of practice for AI in AT.pptx
Towards a code of practice for AI in AT.pptxTowards a code of practice for AI in AT.pptx
Towards a code of practice for AI in AT.pptx
 
How to Give a Domain for a Field in Odoo 17
How to Give a Domain for a Field in Odoo 17How to Give a Domain for a Field in Odoo 17
How to Give a Domain for a Field in Odoo 17
 
REMIFENTANIL: An Ultra short acting opioid.pptx
REMIFENTANIL: An Ultra short acting opioid.pptxREMIFENTANIL: An Ultra short acting opioid.pptx
REMIFENTANIL: An Ultra short acting opioid.pptx
 
NO1 Top Black Magic Specialist In Lahore Black magic In Pakistan Kala Ilam Ex...
NO1 Top Black Magic Specialist In Lahore Black magic In Pakistan Kala Ilam Ex...NO1 Top Black Magic Specialist In Lahore Black magic In Pakistan Kala Ilam Ex...
NO1 Top Black Magic Specialist In Lahore Black magic In Pakistan Kala Ilam Ex...
 
COMMUNICATING NEGATIVE NEWS - APPROACHES .pptx
COMMUNICATING NEGATIVE NEWS - APPROACHES .pptxCOMMUNICATING NEGATIVE NEWS - APPROACHES .pptx
COMMUNICATING NEGATIVE NEWS - APPROACHES .pptx
 
SOC 101 Demonstration of Learning Presentation
SOC 101 Demonstration of Learning PresentationSOC 101 Demonstration of Learning Presentation
SOC 101 Demonstration of Learning Presentation
 
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
 
This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.
 
Sensory_Experience_and_Emotional_Resonance_in_Gabriel_Okaras_The_Piano_and_Th...
Sensory_Experience_and_Emotional_Resonance_in_Gabriel_Okaras_The_Piano_and_Th...Sensory_Experience_and_Emotional_Resonance_in_Gabriel_Okaras_The_Piano_and_Th...
Sensory_Experience_and_Emotional_Resonance_in_Gabriel_Okaras_The_Piano_and_Th...
 
Accessible Digital Futures project (20/03/2024)
Accessible Digital Futures project (20/03/2024)Accessible Digital Futures project (20/03/2024)
Accessible Digital Futures project (20/03/2024)
 

Data Mining _ Weka

  • 1. Data mining - Weka Submitted as a part of the course ‘IT for Business Intelligence’ Ramya Krishna P 10BM60056 4/19/2012 This paper introduces Weka briefly and proceeds to demonstrate application of two data mining techniques – association rules and regression.
  • 2. Table of Contents Weka – Introduction ..................................................................................................................................... 1 Requirements............................................................................................................................................ 1 Getting started .............................................................................................................................................. 1 Data sets ....................................................................................................................................................... 2 Association rules ........................................................................................................................................... 3 Business application.................................................................................................................................. 3 Data set ..................................................................................................................................................... 4 Preprocess................................................................................................................................................. 5 Associate ................................................................................................................................................... 6 Regression ..................................................................................................................................................... 7 Business applications ................................................................................................................................ 7 Data set ..................................................................................................................................................... 8 Preprocess................................................................................................................................................. 8 Linear regression ..................................................................................................................................... 10 Non-numeric input variables .................................................................................................................. 13 References .................................................................................................................................................. 14
  • 3. Weka – Introduction Weka is a rich tool for data mining. It is a collection of machine learning algorithms. It allows us to do classification, regression, clustering, forming association rules and visualization. It is open source software. Requirements For latest versions of Weka, i.e., Weka 3.7.x, Java 1.6 needed to be installed in your system. I have used Weka 3.7.5 for this small tutorial. The latest and other editions of Weka can be downloaded here. Getting started You can run Weka through command prompt or through GUI. We go by the GUI. Here is how it looks like. For all our purposes, the application ‘Explorer’ is sufficient. On clicking ‘explorer’, we have 1
  • 4. To load a data set into Weka, choose ‘Open file’ under ‘Preprocess’ tab. Now a short note about data sets. Data sets The default format of a Weka data set is .arff(Attribute-Relation File Format). This is an ASCII text file. A snapshot of a .arff file is like this.
  • 5. So, you can either prepare your data in this form or if you have a spreadsheet or an .xls or .xlsx, upload your data to .csv format. Now, on clicking ‘Open file’, select the .csv format of your data and click ‘Open’. I will proceed with the rest of the tutorial through examples. Association rules To give a little introduction about association rules, this is a method to develop relations between variables in data sets. We develop some rules from these relations that have a certain level of support and confidence. These rules can be of a great business value sometimes. One typical business application of association rules is ‘Market basket analysis’. Business application The market-basket problem assumes we have some large number of items, e.g., bread, milk. Customers fill their market baskets with some subset of the items, and we get to know what items people buy together, even if we don't know who they are. By developing association rules of the form,
  • 6. {X1, X2, . . .Xn} -> Y we have a good chance of finding Y. So, next time a retailer is stocking up X1, X2, … Xn, he might also stock up ‘Y’ based on our prediction. Now, without going too much into the theory, let us see our data set. Data set The format of my data set is like this TID1 ID2 ID5 ID6 TID2 ID3 ID4 ID6 ID7 ID9 TID3 ID4 ID5 TID4 ID1 ID4 ID5 ID7 ID9 ID10 ... where, the first column gives the transaction id and then each row has a number of products, which have been purchased in this particular transaction. Now, unfortunately, Weka cannot accept the data set in this form (the rows are of unequal lengths). Both .arff and .csv require each data record to have the same number of fields. To change the data format, create one attribute per "item" and use "true" and "false" field values in the data row corresponding to the item. We can't use 0 and 1 because Apriori (the algorithm we will be using) does not work on numeric attributes. It only works on ‘Nominal values’. The data now looks like TID, ID1, ID2, ID3, ID4, ID5, ID6, ID7, ID8, ID9, ID10 1,false, true, false, false, true, true, false, false, false, false 2, false, false, true, true, false, true, true, false, true, false 3, false, false, true, true, false, false, false, false, false, false 4, true, false, false, true, true, false, true, false, true, true Now, I have a sample data set (which I have downloaded from here) which is thankfully, already in the.csv form. This is a huge data set with 300+ products and 1300+ rows. When you try to run this in Weka, you get an error that the heap size is not sufficient. You can change the heap size by changing the value of the ‘maxheap’ in Weka. ini file (or RunWeka – config file). However, even after giving a heap size of 1GB, this data set is too huge too run. So, I have cropped the data set to about 20 attributes and 400 rows. A snapshot of the data set is like this.
  • 7. Preprocess Once you choose this file under ‘open file’, this is how it looks like.
  • 8. Weka lists all the attributes present in the data set. It also provides visualizations of these data and other stastics. For eg., we can see that the ‘fat free hamburger’ is true only 41 times out of 400. Now, we can select the attributes we want for our analysis one by one or, or check ‘all’ or we can also write a ‘Perl’ language expression to choose the attributes matching a rule, by selecting ‘pattern’ and typing the expression. We check ‘all’. Then we go to ‘Associate’ tab. Associate We go to ‘associate’ tab and click ‘Choose’. Out of the algorithms listed, we select Apriori. Now, by clicking the text box beside Choose (i.e., on Apriori), the various parameters that are used in Apriori, are listed.
  • 9. We can change these parameters as per requirements. To know what each parameter stands for, click on ‘More’. After changing the parameters, click on ‘Ok’. Now, click on ‘Start’ to start building the model. Depending on the size of the data set, it takes a while and mean-while the bird roams this side and that side. A part of the output is shown here. Since, we have given ‘numrules’ as 10, only the top 10 best rules are shown here. The first rule is Plain English Muffins= false 396 ==> 40 Watt Lightbulb= false 396 <conf:(1)> lift:(1.01) lev:(0) [1] conv:(1.98) That is, people who do not buy Plain English Muffins, do not buy 40 watt lightbulb as well. The rule also specifies confidence, conviction and leverage of each rule(explanation of each can be found under ‘more’ , shown above). The model can be run by changing the parameters and each of the results can be seen under the ‘Result list’. The results can also be saved for later. Regression Regression, is as one knows a relation between a dependent variable and one or more independent variables. As there is not much need to explain about regression, we jump into the process. Business applications Before we start with the tutorial, here are some areas where regression can be used
  • 10. Trend line analysis - to show the movement of financial or product attributes over time. Stock prices, oil prices can be analyzed using trend lines. Risk analysis for investments - The capital asset pricing model was developed using linear regression analysis Sales or market forecasts - multivariate regression is a good method to forecast sales volumes or market shares. Total quality control - Quality control methods use linear regression frequently to analyze key product specifications and other measurable parameters of product or organization (for eg., customer complaints over time). Human Resources - to predict the demographics and types of future work forces for large companies. Data set I have used a data set provided by Weka website for this. A number of datasets for different techniques can be found here. The data set I am using is ‘strike.arff’ extracted from ‘numeric. Jar’. The data consists of days lost due to industrial disputes per 1000 wage salary earners, in 18 OECD countries from 1951-1985. The dependent variables are 1. country code 2. year 3. unemployment 4. inflation 5. parliamentary representation of social democratic and labor parties and 6. a time-invariant measure of union centralization. If your data is not in .csv or .arff, it needs to be preprocessed as explained above. Preprocess After uploading the data into Weka, it looks like this.
  • 11. For each numerical attribute, weka gives the stastics like mean, max, min, stdev. On clicking ‘visualize all’, the graphs of all variables are shown.
  • 12. We check ‘All’ to select all variables and click on ‘Classify’ now. Linear regression We click ‘choose’ under Classifier and select ‘Linear Regression’ as shown. Click on box beside ‘choose’ to select parameters for Linear Regression.
  • 13. Then, click on ‘Ok’. Now, we have to tell Weka which data set to use. Apart from the data set we have uploaded, we have 3 more choices - Supplied test set, where we can supply a different set of data to build the model, Cross-validation, which lets WEKA build a model based on subsets of the supplied data and then average them out to create a final model and Percentage split, where WEKA takes a percentile subset of the supplied data to build a final model. For this example, we choose Use training set. By default, Weka takes the last attribute as dependent attribute. If it is not so, as per the data, we change the variable to the required variable by choosing from the drop-down. We choose ‘volume’ as the dependent variable and click on ‘Start’. A part of the output is shown below.
  • 14. The first line of the model is 175.7183 * country=5,3,13,17,7,1,18,6,9,4,10 It means that if the country code is 5, you would put a ‘1’ in the calculation of the equation, and if the country code is 8, you would put a ‘0’. By default, Weka employs attribute selection, which means it may not include all of the attributes in the regression equation. Hence we have not got all the dependent variables in the above model. To eliminate attribute selection, we change the ‘attributeSelectionMethod’ parameter to "No attribute selection" and run the model again. Now the model is as follows
  • 15. Non-numeric input variables If we have a non-numeric input variable, d- If we have a binary attribute (yes/no or true/false), we can convert the two values to 0 and 1. However, we have techniques to handle both numeric and non-numeric (categorical) attributes. 1. One way is to build a decision tree and have each classification be a numeric value that is the average of the values for the training examples in that subgroup - the result is called a regression tree 2. Another option is to have a separate regression equation for each classification in the tree – based on the training examples in that subgroup – this is called a model tree.
  • 16. References 1. http://www.cs.waikato.ac.nz/ml/weka/ 2. http://www.cs.waikato.ac.nz/ml/weka/index_datasets.html 3. http://inf.abdn.ac.uk/~hnguyen/teaching/CS5553/prac05.php 4. http://inf.abdn.ac.uk/~hnguyen/teaching/CS5553/marketbasket.csv 5. "The WEKA Data Mining Software: An Update" by Mark Hall, Eibe Frank, Geoffrey Holmes, Bernhard Pfahringer Peter Reutemann, and Ian H. Witten 6. http://www.ehow.com/about_6160819_application-regression-analysis-business.html 7. http://www.ibm.com/developerworks/opensource/library/os-weka1/index.html 8. http://cs-people.bu.edu/dgs/courses/cs105/lectures/data_mining_estimation.pdf