1. Data mining - Weka
Submitted as a part of the course
‘IT for Business Intelligence’
Ramya Krishna P
10BM60056
4/19/2012
This paper introduces Weka briefly and proceeds to demonstrate application of two data mining
techniques – association rules and regression.
2. Table of Contents
Weka – Introduction ..................................................................................................................................... 1
Requirements............................................................................................................................................ 1
Getting started .............................................................................................................................................. 1
Data sets ....................................................................................................................................................... 2
Association rules ........................................................................................................................................... 3
Business application.................................................................................................................................. 3
Data set ..................................................................................................................................................... 4
Preprocess................................................................................................................................................. 5
Associate ................................................................................................................................................... 6
Regression ..................................................................................................................................................... 7
Business applications ................................................................................................................................ 7
Data set ..................................................................................................................................................... 8
Preprocess................................................................................................................................................. 8
Linear regression ..................................................................................................................................... 10
Non-numeric input variables .................................................................................................................. 13
References .................................................................................................................................................. 14
3. Weka – Introduction
Weka is a rich tool for data mining. It is a collection of machine learning algorithms. It allows us to do
classification, regression, clustering, forming association rules and visualization. It is open source
software.
Requirements
For latest versions of Weka, i.e., Weka 3.7.x, Java 1.6 needed to be installed in your system. I have used
Weka 3.7.5 for this small tutorial. The latest and other editions of Weka can be downloaded here.
Getting started
You can run Weka through command prompt or through GUI. We go by the GUI. Here is how it looks
like.
For all our purposes, the application ‘Explorer’ is sufficient. On clicking ‘explorer’, we have
1
4. To load a data set into Weka, choose ‘Open file’ under ‘Preprocess’ tab. Now a short note about data
sets.
Data sets
The default format of a Weka data set is .arff(Attribute-Relation File Format). This is an ASCII text file. A
snapshot of a .arff file is like this.
5. So, you can either prepare your data in this form or if you have a spreadsheet or an .xls or .xlsx, upload
your data to .csv format.
Now, on clicking ‘Open file’, select the .csv format of your data and click ‘Open’.
I will proceed with the rest of the tutorial through examples.
Association rules
To give a little introduction about association rules, this is a method to develop relations between
variables in data sets. We develop some rules from these relations that have a certain level of support
and confidence. These rules can be of a great business value sometimes. One typical business
application of association rules is ‘Market basket analysis’.
Business application
The market-basket problem assumes we have some large number of items, e.g., bread, milk. Customers
fill their market baskets with some subset of the items, and we get to know what items people buy
together, even if we don't know who they are. By developing association rules of the form,
6. {X1, X2, . . .Xn} -> Y
we have a good chance of finding Y. So, next time a retailer is stocking up X1, X2, … Xn, he might also
stock up ‘Y’ based on our prediction. Now, without going too much into the theory, let us see our data
set.
Data set
The format of my data set is like this
TID1 ID2 ID5 ID6
TID2 ID3 ID4 ID6 ID7 ID9
TID3 ID4 ID5
TID4 ID1 ID4 ID5 ID7 ID9 ID10
...
where, the first column gives the transaction id and then each row has a number of products, which
have been purchased in this particular transaction. Now, unfortunately, Weka cannot accept the data
set in this form (the rows are of unequal lengths). Both .arff and .csv require each data record to
have the same number of fields.
To change the data format, create one attribute per "item" and use "true" and "false" field values
in the data row corresponding to the item. We can't use 0 and 1 because Apriori (the algorithm we will
be using) does not work on numeric attributes. It only works on ‘Nominal values’. The data now looks
like
TID, ID1, ID2, ID3, ID4, ID5, ID6, ID7, ID8, ID9, ID10
1,false, true, false, false, true, true, false, false, false, false
2, false, false, true, true, false, true, true, false, true, false
3, false, false, true, true, false, false, false, false, false, false
4, true, false, false, true, true, false, true, false, true, true
Now, I have a sample data set (which I have downloaded from here) which is thankfully, already in
the.csv form. This is a huge data set with 300+ products and 1300+ rows. When you try to run this in
Weka, you get an error that the heap size is not sufficient. You can change the heap size by changing the
value of the ‘maxheap’ in Weka. ini file (or RunWeka – config file). However, even after giving a heap
size of 1GB, this data set is too huge too run. So, I have cropped the data set to about 20 attributes and
400 rows. A snapshot of the data set is like this.
8. Weka lists all the attributes present in the data set. It also provides visualizations of these data and
other stastics. For eg., we can see that the ‘fat free hamburger’ is true only 41 times out of 400. Now, we
can select the attributes we want for our analysis one by one or, or check ‘all’ or we can also write a
‘Perl’ language expression to choose the attributes matching a rule, by selecting ‘pattern’ and typing the
expression. We check ‘all’. Then we go to ‘Associate’ tab.
Associate
We go to ‘associate’ tab and click ‘Choose’. Out of the algorithms listed, we select Apriori. Now, by
clicking the text box beside Choose (i.e., on Apriori), the various parameters that are used in Apriori, are
listed.
9. We can change these parameters as per requirements. To know what each parameter stands for, click
on ‘More’. After changing the parameters, click on ‘Ok’.
Now, click on ‘Start’ to start building the model. Depending on the size of the data set, it takes a while
and mean-while the bird roams this side and that side.
A part of the output is shown here.
Since, we have given ‘numrules’ as 10, only the top 10 best rules are shown here. The first rule is
Plain English Muffins= false 396 ==> 40 Watt Lightbulb= false 396 <conf:(1)> lift:(1.01) lev:(0) [1]
conv:(1.98)
That is, people who do not buy Plain English Muffins, do not buy 40 watt lightbulb as well. The rule also
specifies confidence, conviction and leverage of each rule(explanation of each can be found under
‘more’ , shown above).
The model can be run by changing the parameters and each of the results can be seen under the ‘Result
list’. The results can also be saved for later.
Regression
Regression, is as one knows a relation between a dependent variable and one or more independent
variables. As there is not much need to explain about regression, we jump into the process.
Business applications
Before we start with the tutorial, here are some areas where regression can be used
10. Trend line analysis - to show the movement of financial or product attributes over time. Stock
prices, oil prices can be analyzed using trend lines.
Risk analysis for investments - The capital asset pricing model was developed using linear
regression analysis
Sales or market forecasts - multivariate regression is a good method to forecast sales volumes or
market shares.
Total quality control - Quality control methods use linear regression frequently to analyze key
product specifications and other measurable parameters of product or organization (for eg.,
customer complaints over time).
Human Resources - to predict the demographics and types of future work forces for large
companies.
Data set
I have used a data set provided by Weka website for this. A number of datasets for different techniques
can be found here.
The data set I am using is ‘strike.arff’ extracted from ‘numeric. Jar’. The data consists of days lost due to
industrial disputes per 1000 wage salary earners, in 18 OECD countries from 1951-1985. The dependent
variables are
1. country code
2. year
3. unemployment
4. inflation
5. parliamentary representation of social democratic and labor parties and
6. a time-invariant measure of union centralization.
If your data is not in .csv or .arff, it needs to be preprocessed as explained above.
Preprocess
After uploading the data into Weka, it looks like this.
11. For each numerical attribute, weka gives the stastics like mean, max, min, stdev.
On clicking ‘visualize all’, the graphs of all variables are shown.
12. We check ‘All’ to select all variables and click on ‘Classify’ now.
Linear regression
We click ‘choose’ under Classifier and select ‘Linear Regression’ as shown.
Click on box beside ‘choose’ to select parameters for Linear Regression.
13. Then, click on ‘Ok’. Now, we have to tell Weka which data set to use. Apart from the data set we have
uploaded, we have 3 more choices - Supplied test set, where we can supply a different set of data to
build the model, Cross-validation, which lets WEKA build a model based on subsets of the supplied data
and then average them out to create a final model and Percentage split, where WEKA takes a percentile
subset of the supplied data to build a final model. For this example, we choose Use training set.
By default, Weka takes the last attribute as dependent attribute. If it is not so, as per the data, we
change the variable to the required variable by choosing from the drop-down. We choose ‘volume’ as
the dependent variable and click on ‘Start’.
A part of the output is shown below.
14. The first line of the model is
175.7183 * country=5,3,13,17,7,1,18,6,9,4,10
It means that if the country code is 5, you would put a ‘1’ in the calculation of the equation, and if the
country code is 8, you would put a ‘0’.
By default, Weka employs attribute selection, which means it may not include all of the attributes in the
regression equation. Hence we have not got all the dependent variables in the above model. To
eliminate attribute selection, we change the ‘attributeSelectionMethod’ parameter to "No attribute
selection" and run the model again.
Now the model is as follows
15. Non-numeric input variables
If we have a non-numeric input variable, d- If we have a binary attribute (yes/no or true/false), we can
convert the two values to 0 and 1.
However, we have techniques to handle both numeric and non-numeric (categorical) attributes.
1. One way is to build a decision tree and have each classification be a numeric value that is the
average of the values for the training examples in that subgroup - the result is called a
regression tree
2. Another option is to have a separate regression equation for each classification in the tree –
based on the training examples in that subgroup – this is called a model tree.
16. References
1. http://www.cs.waikato.ac.nz/ml/weka/
2. http://www.cs.waikato.ac.nz/ml/weka/index_datasets.html
3. http://inf.abdn.ac.uk/~hnguyen/teaching/CS5553/prac05.php
4. http://inf.abdn.ac.uk/~hnguyen/teaching/CS5553/marketbasket.csv
5. "The WEKA Data Mining Software: An Update" by Mark Hall, Eibe Frank, Geoffrey Holmes,
Bernhard Pfahringer Peter Reutemann, and Ian H. Witten
6. http://www.ehow.com/about_6160819_application-regression-analysis-business.html
7. http://www.ibm.com/developerworks/opensource/library/os-weka1/index.html
8. http://cs-people.bu.edu/dgs/courses/cs105/lectures/data_mining_estimation.pdf