SlideShare ist ein Scribd-Unternehmen logo
1 von 17
Downloaden Sie, um offline zu lesen
WEKA
IT For Business Intelligence

Ishan Awadhesh
10BM60033 • Term Paper • 19 April 2012




Vinod Gupta School of Management, IIT Kharagpur
   1
Table of Contents

WEKA!                                              3

Data Used!                                         5

Classification Analysis!                           6

Cluster Analysis!                                  11

Other Applications of Weka!                        17

References!                                        17




Vinod Gupta School of Management, IIT Kharagpur
   2
WEKA
Waikato Environment for Knowledge Analysis

DATA MINING TECHNIQUES

WEKA is a collection of state-of-the-art machine learning algorithms and data preprocessing
tools written in Java, developed at the University of Waikato, New Zealand. It is free software
that runs on almost any platform and is available under the GNU General Public License. It
has a wide range of applications in various data mining techniques. It provides extensive
support for the entire process of experimental data mining, including preparing the input
data, evaluating learning schemes statistically, and visualizing the input data and the result of
learning. The WEKA workbench includes methods for the main data mining problems:
regression, classification, clustering, association rule mining, and attribute selection. It can
be used in either of the following two interfaces –
•!    Command Line Interface (CLI)
•!    Graphical User Interface (GUI)



The WEKA GUI Chooser appears like this –




Vinod Gupta School of Management, IIT Kharagpur
                                                    3
The buttons can be used to start the following applications –
        •Explorer – Environment for exploring data with WEKA. It gives access to all the
        facilities using menu selection and form filling.
        •Experimenter – It can be used to get the answer for a question: Which methods and
        parameter values work best for the given problem?
        •KnowledgeFlow – Same function as explorer. Supports incremental learning. It
        allows designing configurations for streamed data processing. Incremental algorithms
        can be used to process very large datasets.
        •Simple CLI – It provides a simple Command Line Interface for directly executing
        WEKA commands.


This term paper will demonstrate the following two data mining techniques using WEKA:
•Classification
•Clustering (Simple K Means)




Vinod Gupta School of Management, IIT Kharagpur
                                           4
Data Used
The data used in this paper is Bank Data available in Comma Separated Values format




The data contains following fields
id - a unique identification number
age - age of customer in years (numeric)
sex - MALE / FEMALE
region - inner_city/rural/suburban/town
income- income of customer (numeric)
married - is the customer married (YES/NO)
children - number of children (numeric)
car - does the customer own a car (YES/NO)
save_acct - does the customer have a saving account (YES/NO)
current_acct - does the customer have a current account (YES/NO)
mortgage - does the customer have a mortgage (YES/NO)
pep - did customer buy a PEP (Personal Equity Plan) after the last mailing (YES/NO)




Vinod Gupta School of Management, IIT Kharagpur
                                      5
Classification Analysis

Question

"How likely is person X to buy the new Personal Equity?" By creating a classification tree (a
decision tree), the data can be mined to determine the likelihood of this person to buy a new
PEP. Possible nodes on the tree would be children, income level, marital status. The
attributes of this person can be used against the decision tree to determine the likelihood of
him purchasing the Personal Equity Plan.

Load the data file Bank_Data.CSV into WEKA. This file contains 900 records of present
customers of Bank.
We need to divide up our records so some data instances are used to create the model, and
some are used to test the model to ensure that we didn't overfit it.
Your screen should look like Figure 1 after loading the data.


Figure 1.Bank Data Classification in Weka




We select the Classify tab, then we select the trees node, then the J48 leaf




Vinod Gupta School of Management, IIT Kharagpur
                                                 6
Figure 2.Bank Data Classification Algorithm




At this point, we are ready to create our model in WEKA. Ensure that Use training set is
selected so we use the data set we just loaded to create our model. Click Start and let WEKA
run. The output from this model should look like the results in Listing 1.




Vinod Gupta School of Management, IIT Kharagpur
                                               7
Listing 1.Output from WEKA’s classification model




What do these numbers mean-
Correctly Classified Instances - 92.3333%
Incorrectly Classified Instances- 7.6667%
False Positives- 29
False Negatives-17
Based on our accuracy rate of 92.3333%, we can say that this is a pretty good model to predict
whether a new customer will buy Personal Equity Plan or not.



Vinod Gupta School of Management, IIT Kharagpur
                                             8
You can see the tree by right-clicking on the model you just created, in the result list. On the
pop-up menu, select Visualize tree. You'll see the classification tree we just created,
although in this example, the visual tree doesn't offer much help.


Figure 3. Classification Tree Visualization




There's one final step to validating our classification tree, which is to run our test set through
the model and ensure that accuracy of the model when evaluating the test set isn't too
different from the training set. To do this, in Test options, select the Supplied test set radio
button and click Set. Choose the file bmw-test.arff, which contains 1,500 records that were
not in the training set we used to create the model. When we click Start this time, WEKA will
run this test data set through the model we already created and let us know how the model did.
Let's do that, by clicking Start. Below is the output.




Vinod Gupta School of Management, IIT Kharagpur
                                                   9
Listing 2.Output from WEKA’s classification model of Test Data




Comparing the "Correctly Classified Instances" from this test set (90.5 percent) with the
"Correctly Classified Instances" from the training set (92.3333 percent), we see that the
accuracy of the model is pretty close, which indicates that the model will not break down with
unknown data, or when future data is applied to it.




Vinod Gupta School of Management, IIT Kharagpur
                                            10
Cluster Analysis

Question: "What age groups more likely to buy Personal Equity Plan?" The data can be
mined to compare the age of the purchaser of past PEP . From this data, it could be found
whether certain age groups (22-30 year olds, for example) have a higher propensity to to go
for PEP. The data, when mined, will tend to cluster around certain age groups and certain
colors, allowing the user to quickly determine patterns in the data.


Load the data file Bank_data.CSV into WEKA using the same steps we used to load data into
the Preprocess tab. Take a few minutes to look around the data in this tab. Look at the
columns, the attribute data, the distribution of the columns, etc. Your screen should look like
Figure 4 after loading the data.


Figure 4. Bank cluster data in Weka




With this data set, we are looking to create clusters, so instead of clicking on the Classify tab,
click on the Cluster tab. Click Choose and select SimpleKMeans from the choices that appear
(this will be our preferred method of clustering for this article).


Vinod Gupta School of Management, IIT Kharagpur
                                                11
Finally, we want to adjust the attributes of our cluster algorithm by clicking SimpleKMeans .
The only attribute of the algorithm we are interested in adjusting here is the numClusters
field, which tells us how many clusters we want to create. Let's change the default value of 2 to
5 for now, but keep these steps in mind later if you want to adjust the number of clusters
created. Your WEKA Explorer should look like Figure 5 at this point. Click OK to accept
these values.


Figure 5. Cluster Attributes




At this point, we are ready to run the clustering algorithm. Remember that 100 rows of data
with five data clusters would likely take a few hours of computation with a spreadsheet, but
WEKA can spit out the answer in less than a second. Your output should look like Listing 3.




Vinod Gupta School of Management, IIT Kharagpur
                                               12
Listing 3. Cluster Output with 5 clusters




Vinod Gupta School of Management, IIT Kharagpur
   13
Listing 4. Cluster Output with 10 Clusters




Clusters


One thing that is clear from the clusters is that behavior of Male are clustered in only 2-3
groups while females behavior are heavily distributed among 7 clusters, so preparing an
offering for a specific


Description of Clusters-


Cluster 0- This group consists of unmarried, mid-income earning females in their early 40’s
who live in rural areas. They have on an average two children, no car and personal equity plan
but they do have savings and current account.


Cluster 1- This group consists of married, high-income earning females in their late 40’s who
live in rural areas. They have on an average two children,no car and personal equity plan but
they do have savings and current account.


Cluster 2- This group consists of married, low-income earning females in their early 40’s who
live in inner city. They have on an average one child, no car and savings account but they do
have current account and personal equity plan.



Vinod Gupta School of Management, IIT Kharagpur
                                                14
Cluster 3- This group consists of married, low-income earning females in their early 30’s who
live in town. They have on an average one or two children, no car, savings account and
personal equity plan but they do have current account.


Cluster 4- This group consists of married, mid-income earning males in their late 30’s who
live in inner city. They have on an average one or no child, no savings account but they do
have personal equity plan, savings & current account.


Cluster 5- This group consists of unmarried, high-income earning males in their early 40’s
who live in town. They have on an average one or no child, they have car, personal equity plan,
savings & current account.


Cluster 6- This group consists of married, mid-income earning females in their early 40’s
who live in inner city. They mostly don’t have ant child, they do not have any savings account
and personal equity plan but they do have current account.


Cluster 7- This group consists of unmarried, high-income earning females in their mid 40’s
who live in inner city. They have on an average one or two child, no car and personal equity
plan but they do have savings & current account.


Cluster 8- This group consists of unmarried, high-income earning females in their mid 40’s
who live in town. They have on an average one or no child, no personal equity plan but they do
have car, savings & current account.


Cluster 9- This group consists of married, mid-income earning males in their early 40’s who
live in inner city. They have on an average one or two children, no car, personal equity   plan
and current account but they do have savings account.




Vinod Gupta School of Management, IIT Kharagpur
                                               15
One other interesting way to examine the data in these clusters is to inspect it visually. To do
this, you should right-click on theResult List section of the Cluster tab . One of the options
from this pop-up menu is Visualize Cluster Assignments. A window will pop up that lets you
play with the results and see them visually. For this example, change the X axis to be income
(Num), the Y axis to children (Num), and the Color to Cluster (Nom). This will show us in a
chart how the clusters are grouped in terms of income and no’ of children. Also, turn up the
"Jitter" to about three-fourths of the way maxed out, which will artificially scatter the plot
points to allow us to see them more easily.


Figure 6. Cluster Visual Inspection




Vinod Gupta School of Management, IIT Kharagpur
                                                 16
Other Applications of Weka

•DISCRETIZATION

•REGRESSION

•NEAREST NEIGHBOR




References
https://www.ibm.com/developerworks/opensource/library/os-weka2/
http://maya.cs.depaul.edu/classes/ect584/weka/preprocess.html
http://www.cs.waikato.ac.nz/~ml/weka/




Vinod Gupta School of Management, IIT Kharagpur
                  17

Weitere ähnliche Inhalte

Was ist angesagt?

Weka presentation
Weka presentationWeka presentation
Weka presentationSaeed Iqbal
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessingHoang Nguyen
 
WEKA Tutorial
WEKA TutorialWEKA Tutorial
WEKA Tutorialbutest
 
Presentation - Msc Thesis - Machine Learning Techniques for Short-Term Electr...
Presentation - Msc Thesis - Machine Learning Techniques for Short-Term Electr...Presentation - Msc Thesis - Machine Learning Techniques for Short-Term Electr...
Presentation - Msc Thesis - Machine Learning Techniques for Short-Term Electr...Praxitelis Nikolaos Kouroupetroglou
 
Laptop Price Prediction system
Laptop Price Prediction systemLaptop Price Prediction system
Laptop Price Prediction systemMDRIAZHASAN
 
Introduction to Machine Learning Classifiers
Introduction to Machine Learning ClassifiersIntroduction to Machine Learning Classifiers
Introduction to Machine Learning ClassifiersFunctional Imperative
 
Association Analysis in Data Mining
Association Analysis in Data MiningAssociation Analysis in Data Mining
Association Analysis in Data MiningKamal Acharya
 
DATASCIENCE vs BUSINESS INTELLIGENCE.pptx
DATASCIENCE vs BUSINESS INTELLIGENCE.pptxDATASCIENCE vs BUSINESS INTELLIGENCE.pptx
DATASCIENCE vs BUSINESS INTELLIGENCE.pptxOTA13NayabNakhwa
 
Supervised and unsupervised learning
Supervised and unsupervised learningSupervised and unsupervised learning
Supervised and unsupervised learningParas Kohli
 
ResNet basics (Deep Residual Network for Image Recognition)
ResNet basics (Deep Residual Network for Image Recognition)ResNet basics (Deep Residual Network for Image Recognition)
ResNet basics (Deep Residual Network for Image Recognition)Sanjay Saha
 
Convolutional Neural Networks
Convolutional Neural NetworksConvolutional Neural Networks
Convolutional Neural NetworksAshray Bhandare
 
Heart disease prediction system
Heart disease prediction systemHeart disease prediction system
Heart disease prediction systemSWAMI06
 
Convolutional Neural Network - CNN | How CNN Works | Deep Learning Course | S...
Convolutional Neural Network - CNN | How CNN Works | Deep Learning Course | S...Convolutional Neural Network - CNN | How CNN Works | Deep Learning Course | S...
Convolutional Neural Network - CNN | How CNN Works | Deep Learning Course | S...Simplilearn
 
What Is Random Forest Classification And How Can It Help Your Business?
What Is Random Forest Classification And How Can It Help Your Business?What Is Random Forest Classification And How Can It Help Your Business?
What Is Random Forest Classification And How Can It Help Your Business?Smarten Augmented Analytics
 
Model selection and cross validation techniques
Model selection and cross validation techniquesModel selection and cross validation techniques
Model selection and cross validation techniquesVenkata Reddy Konasani
 
MACHINE LEARNING - GENETIC ALGORITHM
MACHINE LEARNING - GENETIC ALGORITHMMACHINE LEARNING - GENETIC ALGORITHM
MACHINE LEARNING - GENETIC ALGORITHMPuneet Kulyana
 

Was ist angesagt? (20)

Machine Learning and Data Mining
Machine Learning and Data MiningMachine Learning and Data Mining
Machine Learning and Data Mining
 
Weka presentation
Weka presentationWeka presentation
Weka presentation
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
 
WEKA Tutorial
WEKA TutorialWEKA Tutorial
WEKA Tutorial
 
Presentation - Msc Thesis - Machine Learning Techniques for Short-Term Electr...
Presentation - Msc Thesis - Machine Learning Techniques for Short-Term Electr...Presentation - Msc Thesis - Machine Learning Techniques for Short-Term Electr...
Presentation - Msc Thesis - Machine Learning Techniques for Short-Term Electr...
 
Laptop Price Prediction system
Laptop Price Prediction systemLaptop Price Prediction system
Laptop Price Prediction system
 
Introduction to Machine Learning Classifiers
Introduction to Machine Learning ClassifiersIntroduction to Machine Learning Classifiers
Introduction to Machine Learning Classifiers
 
Association Analysis in Data Mining
Association Analysis in Data MiningAssociation Analysis in Data Mining
Association Analysis in Data Mining
 
DATASCIENCE vs BUSINESS INTELLIGENCE.pptx
DATASCIENCE vs BUSINESS INTELLIGENCE.pptxDATASCIENCE vs BUSINESS INTELLIGENCE.pptx
DATASCIENCE vs BUSINESS INTELLIGENCE.pptx
 
Supervised and unsupervised learning
Supervised and unsupervised learningSupervised and unsupervised learning
Supervised and unsupervised learning
 
Search methods
Search methodsSearch methods
Search methods
 
ResNet basics (Deep Residual Network for Image Recognition)
ResNet basics (Deep Residual Network for Image Recognition)ResNet basics (Deep Residual Network for Image Recognition)
ResNet basics (Deep Residual Network for Image Recognition)
 
Content based filtering
Content based filteringContent based filtering
Content based filtering
 
Cnn
CnnCnn
Cnn
 
Convolutional Neural Networks
Convolutional Neural NetworksConvolutional Neural Networks
Convolutional Neural Networks
 
Heart disease prediction system
Heart disease prediction systemHeart disease prediction system
Heart disease prediction system
 
Convolutional Neural Network - CNN | How CNN Works | Deep Learning Course | S...
Convolutional Neural Network - CNN | How CNN Works | Deep Learning Course | S...Convolutional Neural Network - CNN | How CNN Works | Deep Learning Course | S...
Convolutional Neural Network - CNN | How CNN Works | Deep Learning Course | S...
 
What Is Random Forest Classification And How Can It Help Your Business?
What Is Random Forest Classification And How Can It Help Your Business?What Is Random Forest Classification And How Can It Help Your Business?
What Is Random Forest Classification And How Can It Help Your Business?
 
Model selection and cross validation techniques
Model selection and cross validation techniquesModel selection and cross validation techniques
Model selection and cross validation techniques
 
MACHINE LEARNING - GENETIC ALGORITHM
MACHINE LEARNING - GENETIC ALGORITHMMACHINE LEARNING - GENETIC ALGORITHM
MACHINE LEARNING - GENETIC ALGORITHM
 

Ähnlich wie Classification and Clustering Analysis using Weka

Weka_Manual_Sagar
Weka_Manual_SagarWeka_Manual_Sagar
Weka_Manual_SagarSagar Kumar
 
Barga Data Science lecture 10
Barga Data Science lecture 10Barga Data Science lecture 10
Barga Data Science lecture 10Roger Barga
 
Feature extraction for classifying students based on theirac ademic performance
Feature extraction for classifying students based on theirac ademic performanceFeature extraction for classifying students based on theirac ademic performance
Feature extraction for classifying students based on theirac ademic performanceVenkat Projects
 
Post Graduate Admission Prediction System
Post Graduate Admission Prediction SystemPost Graduate Admission Prediction System
Post Graduate Admission Prediction SystemIRJET Journal
 
AI TESTING: ENSURING A GOOD DATA SPLIT BETWEEN DATA SETS (TRAINING AND TEST) ...
AI TESTING: ENSURING A GOOD DATA SPLIT BETWEEN DATA SETS (TRAINING AND TEST) ...AI TESTING: ENSURING A GOOD DATA SPLIT BETWEEN DATA SETS (TRAINING AND TEST) ...
AI TESTING: ENSURING A GOOD DATA SPLIT BETWEEN DATA SETS (TRAINING AND TEST) ...ijsc
 
Loan Analysis Predicting Defaulters
Loan Analysis Predicting DefaultersLoan Analysis Predicting Defaulters
Loan Analysis Predicting DefaultersIRJET Journal
 
IRJET- Analysis of Brand Value Prediction based on Social Media Data
IRJET-  	  Analysis of Brand Value Prediction based on Social Media DataIRJET-  	  Analysis of Brand Value Prediction based on Social Media Data
IRJET- Analysis of Brand Value Prediction based on Social Media DataIRJET Journal
 
data-science-lifecycle-ebook.pdf
data-science-lifecycle-ebook.pdfdata-science-lifecycle-ebook.pdf
data-science-lifecycle-ebook.pdfDanilo Cardona
 
A Comparative Study on Identical Face Classification using Machine Learning
A Comparative Study on Identical Face Classification using Machine LearningA Comparative Study on Identical Face Classification using Machine Learning
A Comparative Study on Identical Face Classification using Machine LearningIRJET Journal
 
BIG MART SALES PREDICTION USING MACHINE LEARNING
BIG MART SALES PREDICTION USING MACHINE LEARNINGBIG MART SALES PREDICTION USING MACHINE LEARNING
BIG MART SALES PREDICTION USING MACHINE LEARNINGIRJET Journal
 
STOCK MARKET ANALYZING AND PREDICTION USING MACHINE LEARNING TECHNIQUES
STOCK MARKET ANALYZING AND PREDICTION USING MACHINE LEARNING TECHNIQUESSTOCK MARKET ANALYZING AND PREDICTION USING MACHINE LEARNING TECHNIQUES
STOCK MARKET ANALYZING AND PREDICTION USING MACHINE LEARNING TECHNIQUESIRJET Journal
 
LOAN APPROVAL PRDICTION SYSTEM USING MACHINE LEARNING.
LOAN APPROVAL PRDICTION SYSTEM USING MACHINE LEARNING.LOAN APPROVAL PRDICTION SYSTEM USING MACHINE LEARNING.
LOAN APPROVAL PRDICTION SYSTEM USING MACHINE LEARNING.Souma Maiti
 
Introduction to Machine Learning and Data Science using the Autonomous databa...
Introduction to Machine Learning and Data Science using the Autonomous databa...Introduction to Machine Learning and Data Science using the Autonomous databa...
Introduction to Machine Learning and Data Science using the Autonomous databa...Sandesh Rao
 
The 8 Step Data Mining Process
The 8 Step Data Mining ProcessThe 8 Step Data Mining Process
The 8 Step Data Mining ProcessMarc Berman
 
Machine Learning in Autonomous Data Warehouse
 Machine Learning in Autonomous Data Warehouse Machine Learning in Autonomous Data Warehouse
Machine Learning in Autonomous Data WarehouseSandesh Rao
 
IRJET- Sentimental Analysis of Product Reviews for E-Commerce Websites
IRJET- Sentimental Analysis of Product Reviews for E-Commerce WebsitesIRJET- Sentimental Analysis of Product Reviews for E-Commerce Websites
IRJET- Sentimental Analysis of Product Reviews for E-Commerce WebsitesIRJET Journal
 
Introduction to Machine Learning and Data Science using Autonomous Database ...
Introduction to Machine Learning and Data Science using Autonomous Database  ...Introduction to Machine Learning and Data Science using Autonomous Database  ...
Introduction to Machine Learning and Data Science using Autonomous Database ...Sandesh Rao
 
Andrew NG machine learning
Andrew NG machine learningAndrew NG machine learning
Andrew NG machine learningShareDocView.com
 
Loan Eligibility Checker
Loan Eligibility CheckerLoan Eligibility Checker
Loan Eligibility CheckerKiranVodela
 

Ähnlich wie Classification and Clustering Analysis using Weka (20)

Weka_Manual_Sagar
Weka_Manual_SagarWeka_Manual_Sagar
Weka_Manual_Sagar
 
Barga Data Science lecture 10
Barga Data Science lecture 10Barga Data Science lecture 10
Barga Data Science lecture 10
 
Feature extraction for classifying students based on theirac ademic performance
Feature extraction for classifying students based on theirac ademic performanceFeature extraction for classifying students based on theirac ademic performance
Feature extraction for classifying students based on theirac ademic performance
 
Post Graduate Admission Prediction System
Post Graduate Admission Prediction SystemPost Graduate Admission Prediction System
Post Graduate Admission Prediction System
 
AI TESTING: ENSURING A GOOD DATA SPLIT BETWEEN DATA SETS (TRAINING AND TEST) ...
AI TESTING: ENSURING A GOOD DATA SPLIT BETWEEN DATA SETS (TRAINING AND TEST) ...AI TESTING: ENSURING A GOOD DATA SPLIT BETWEEN DATA SETS (TRAINING AND TEST) ...
AI TESTING: ENSURING A GOOD DATA SPLIT BETWEEN DATA SETS (TRAINING AND TEST) ...
 
Loan Analysis Predicting Defaulters
Loan Analysis Predicting DefaultersLoan Analysis Predicting Defaulters
Loan Analysis Predicting Defaulters
 
IRJET- Analysis of Brand Value Prediction based on Social Media Data
IRJET-  	  Analysis of Brand Value Prediction based on Social Media DataIRJET-  	  Analysis of Brand Value Prediction based on Social Media Data
IRJET- Analysis of Brand Value Prediction based on Social Media Data
 
Data Mining GUI Tools with Demo
Data Mining GUI Tools with DemoData Mining GUI Tools with Demo
Data Mining GUI Tools with Demo
 
data-science-lifecycle-ebook.pdf
data-science-lifecycle-ebook.pdfdata-science-lifecycle-ebook.pdf
data-science-lifecycle-ebook.pdf
 
A Comparative Study on Identical Face Classification using Machine Learning
A Comparative Study on Identical Face Classification using Machine LearningA Comparative Study on Identical Face Classification using Machine Learning
A Comparative Study on Identical Face Classification using Machine Learning
 
BIG MART SALES PREDICTION USING MACHINE LEARNING
BIG MART SALES PREDICTION USING MACHINE LEARNINGBIG MART SALES PREDICTION USING MACHINE LEARNING
BIG MART SALES PREDICTION USING MACHINE LEARNING
 
STOCK MARKET ANALYZING AND PREDICTION USING MACHINE LEARNING TECHNIQUES
STOCK MARKET ANALYZING AND PREDICTION USING MACHINE LEARNING TECHNIQUESSTOCK MARKET ANALYZING AND PREDICTION USING MACHINE LEARNING TECHNIQUES
STOCK MARKET ANALYZING AND PREDICTION USING MACHINE LEARNING TECHNIQUES
 
LOAN APPROVAL PRDICTION SYSTEM USING MACHINE LEARNING.
LOAN APPROVAL PRDICTION SYSTEM USING MACHINE LEARNING.LOAN APPROVAL PRDICTION SYSTEM USING MACHINE LEARNING.
LOAN APPROVAL PRDICTION SYSTEM USING MACHINE LEARNING.
 
Introduction to Machine Learning and Data Science using the Autonomous databa...
Introduction to Machine Learning and Data Science using the Autonomous databa...Introduction to Machine Learning and Data Science using the Autonomous databa...
Introduction to Machine Learning and Data Science using the Autonomous databa...
 
The 8 Step Data Mining Process
The 8 Step Data Mining ProcessThe 8 Step Data Mining Process
The 8 Step Data Mining Process
 
Machine Learning in Autonomous Data Warehouse
 Machine Learning in Autonomous Data Warehouse Machine Learning in Autonomous Data Warehouse
Machine Learning in Autonomous Data Warehouse
 
IRJET- Sentimental Analysis of Product Reviews for E-Commerce Websites
IRJET- Sentimental Analysis of Product Reviews for E-Commerce WebsitesIRJET- Sentimental Analysis of Product Reviews for E-Commerce Websites
IRJET- Sentimental Analysis of Product Reviews for E-Commerce Websites
 
Introduction to Machine Learning and Data Science using Autonomous Database ...
Introduction to Machine Learning and Data Science using Autonomous Database  ...Introduction to Machine Learning and Data Science using Autonomous Database  ...
Introduction to Machine Learning and Data Science using Autonomous Database ...
 
Andrew NG machine learning
Andrew NG machine learningAndrew NG machine learning
Andrew NG machine learning
 
Loan Eligibility Checker
Loan Eligibility CheckerLoan Eligibility Checker
Loan Eligibility Checker
 

Kürzlich hochgeladen

How to setup Pycharm environment for Odoo 17.pptx
How to setup Pycharm environment for Odoo 17.pptxHow to setup Pycharm environment for Odoo 17.pptx
How to setup Pycharm environment for Odoo 17.pptxCeline George
 
HMCS Vancouver Pre-Deployment Brief - May 2024 (Web Version).pptx
HMCS Vancouver Pre-Deployment Brief - May 2024 (Web Version).pptxHMCS Vancouver Pre-Deployment Brief - May 2024 (Web Version).pptx
HMCS Vancouver Pre-Deployment Brief - May 2024 (Web Version).pptxmarlenawright1
 
General Principles of Intellectual Property: Concepts of Intellectual Proper...
General Principles of Intellectual Property: Concepts of Intellectual  Proper...General Principles of Intellectual Property: Concepts of Intellectual  Proper...
General Principles of Intellectual Property: Concepts of Intellectual Proper...Poonam Aher Patil
 
Google Gemini An AI Revolution in Education.pptx
Google Gemini An AI Revolution in Education.pptxGoogle Gemini An AI Revolution in Education.pptx
Google Gemini An AI Revolution in Education.pptxDr. Sarita Anand
 
Salient Features of India constitution especially power and functions
Salient Features of India constitution especially power and functionsSalient Features of India constitution especially power and functions
Salient Features of India constitution especially power and functionsKarakKing
 
SOC 101 Demonstration of Learning Presentation
SOC 101 Demonstration of Learning PresentationSOC 101 Demonstration of Learning Presentation
SOC 101 Demonstration of Learning Presentationcamerronhm
 
Interdisciplinary_Insights_Data_Collection_Methods.pptx
Interdisciplinary_Insights_Data_Collection_Methods.pptxInterdisciplinary_Insights_Data_Collection_Methods.pptx
Interdisciplinary_Insights_Data_Collection_Methods.pptxPooja Bhuva
 
Towards a code of practice for AI in AT.pptx
Towards a code of practice for AI in AT.pptxTowards a code of practice for AI in AT.pptx
Towards a code of practice for AI in AT.pptxJisc
 
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...ZurliaSoop
 
How to Add New Custom Addons Path in Odoo 17
How to Add New Custom Addons Path in Odoo 17How to Add New Custom Addons Path in Odoo 17
How to Add New Custom Addons Path in Odoo 17Celine George
 
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...Nguyen Thanh Tu Collection
 
On National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan FellowsOn National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan FellowsMebane Rash
 
Fostering Friendships - Enhancing Social Bonds in the Classroom
Fostering Friendships - Enhancing Social Bonds  in the ClassroomFostering Friendships - Enhancing Social Bonds  in the Classroom
Fostering Friendships - Enhancing Social Bonds in the ClassroomPooky Knightsmith
 
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptxMaritesTamaniVerdade
 
FSB Advising Checklist - Orientation 2024
FSB Advising Checklist - Orientation 2024FSB Advising Checklist - Orientation 2024
FSB Advising Checklist - Orientation 2024Elizabeth Walsh
 
HMCS Max Bernays Pre-Deployment Brief (May 2024).pptx
HMCS Max Bernays Pre-Deployment Brief (May 2024).pptxHMCS Max Bernays Pre-Deployment Brief (May 2024).pptx
HMCS Max Bernays Pre-Deployment Brief (May 2024).pptxEsquimalt MFRC
 
Understanding Accommodations and Modifications
Understanding  Accommodations and ModificationsUnderstanding  Accommodations and Modifications
Understanding Accommodations and ModificationsMJDuyan
 
Single or Multiple melodic lines structure
Single or Multiple melodic lines structureSingle or Multiple melodic lines structure
Single or Multiple melodic lines structuredhanjurrannsibayan2
 
Kodo Millet PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
Kodo Millet  PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...Kodo Millet  PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
Kodo Millet PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...pradhanghanshyam7136
 

Kürzlich hochgeladen (20)

How to setup Pycharm environment for Odoo 17.pptx
How to setup Pycharm environment for Odoo 17.pptxHow to setup Pycharm environment for Odoo 17.pptx
How to setup Pycharm environment for Odoo 17.pptx
 
HMCS Vancouver Pre-Deployment Brief - May 2024 (Web Version).pptx
HMCS Vancouver Pre-Deployment Brief - May 2024 (Web Version).pptxHMCS Vancouver Pre-Deployment Brief - May 2024 (Web Version).pptx
HMCS Vancouver Pre-Deployment Brief - May 2024 (Web Version).pptx
 
General Principles of Intellectual Property: Concepts of Intellectual Proper...
General Principles of Intellectual Property: Concepts of Intellectual  Proper...General Principles of Intellectual Property: Concepts of Intellectual  Proper...
General Principles of Intellectual Property: Concepts of Intellectual Proper...
 
Google Gemini An AI Revolution in Education.pptx
Google Gemini An AI Revolution in Education.pptxGoogle Gemini An AI Revolution in Education.pptx
Google Gemini An AI Revolution in Education.pptx
 
Salient Features of India constitution especially power and functions
Salient Features of India constitution especially power and functionsSalient Features of India constitution especially power and functions
Salient Features of India constitution especially power and functions
 
SOC 101 Demonstration of Learning Presentation
SOC 101 Demonstration of Learning PresentationSOC 101 Demonstration of Learning Presentation
SOC 101 Demonstration of Learning Presentation
 
Interdisciplinary_Insights_Data_Collection_Methods.pptx
Interdisciplinary_Insights_Data_Collection_Methods.pptxInterdisciplinary_Insights_Data_Collection_Methods.pptx
Interdisciplinary_Insights_Data_Collection_Methods.pptx
 
Mehran University Newsletter Vol-X, Issue-I, 2024
Mehran University Newsletter Vol-X, Issue-I, 2024Mehran University Newsletter Vol-X, Issue-I, 2024
Mehran University Newsletter Vol-X, Issue-I, 2024
 
Towards a code of practice for AI in AT.pptx
Towards a code of practice for AI in AT.pptxTowards a code of practice for AI in AT.pptx
Towards a code of practice for AI in AT.pptx
 
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
 
How to Add New Custom Addons Path in Odoo 17
How to Add New Custom Addons Path in Odoo 17How to Add New Custom Addons Path in Odoo 17
How to Add New Custom Addons Path in Odoo 17
 
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
 
On National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan FellowsOn National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan Fellows
 
Fostering Friendships - Enhancing Social Bonds in the Classroom
Fostering Friendships - Enhancing Social Bonds  in the ClassroomFostering Friendships - Enhancing Social Bonds  in the Classroom
Fostering Friendships - Enhancing Social Bonds in the Classroom
 
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
 
FSB Advising Checklist - Orientation 2024
FSB Advising Checklist - Orientation 2024FSB Advising Checklist - Orientation 2024
FSB Advising Checklist - Orientation 2024
 
HMCS Max Bernays Pre-Deployment Brief (May 2024).pptx
HMCS Max Bernays Pre-Deployment Brief (May 2024).pptxHMCS Max Bernays Pre-Deployment Brief (May 2024).pptx
HMCS Max Bernays Pre-Deployment Brief (May 2024).pptx
 
Understanding Accommodations and Modifications
Understanding  Accommodations and ModificationsUnderstanding  Accommodations and Modifications
Understanding Accommodations and Modifications
 
Single or Multiple melodic lines structure
Single or Multiple melodic lines structureSingle or Multiple melodic lines structure
Single or Multiple melodic lines structure
 
Kodo Millet PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
Kodo Millet  PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...Kodo Millet  PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
Kodo Millet PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
 

Classification and Clustering Analysis using Weka

  • 1. WEKA IT For Business Intelligence Ishan Awadhesh 10BM60033 • Term Paper • 19 April 2012 Vinod Gupta School of Management, IIT Kharagpur 1
  • 2. Table of Contents WEKA! 3 Data Used! 5 Classification Analysis! 6 Cluster Analysis! 11 Other Applications of Weka! 17 References! 17 Vinod Gupta School of Management, IIT Kharagpur 2
  • 3. WEKA Waikato Environment for Knowledge Analysis DATA MINING TECHNIQUES WEKA is a collection of state-of-the-art machine learning algorithms and data preprocessing tools written in Java, developed at the University of Waikato, New Zealand. It is free software that runs on almost any platform and is available under the GNU General Public License. It has a wide range of applications in various data mining techniques. It provides extensive support for the entire process of experimental data mining, including preparing the input data, evaluating learning schemes statistically, and visualizing the input data and the result of learning. The WEKA workbench includes methods for the main data mining problems: regression, classification, clustering, association rule mining, and attribute selection. It can be used in either of the following two interfaces – •! Command Line Interface (CLI) •! Graphical User Interface (GUI) The WEKA GUI Chooser appears like this – Vinod Gupta School of Management, IIT Kharagpur 3
  • 4. The buttons can be used to start the following applications – •Explorer – Environment for exploring data with WEKA. It gives access to all the facilities using menu selection and form filling. •Experimenter – It can be used to get the answer for a question: Which methods and parameter values work best for the given problem? •KnowledgeFlow – Same function as explorer. Supports incremental learning. It allows designing configurations for streamed data processing. Incremental algorithms can be used to process very large datasets. •Simple CLI – It provides a simple Command Line Interface for directly executing WEKA commands. This term paper will demonstrate the following two data mining techniques using WEKA: •Classification •Clustering (Simple K Means) Vinod Gupta School of Management, IIT Kharagpur 4
  • 5. Data Used The data used in this paper is Bank Data available in Comma Separated Values format The data contains following fields id - a unique identification number age - age of customer in years (numeric) sex - MALE / FEMALE region - inner_city/rural/suburban/town income- income of customer (numeric) married - is the customer married (YES/NO) children - number of children (numeric) car - does the customer own a car (YES/NO) save_acct - does the customer have a saving account (YES/NO) current_acct - does the customer have a current account (YES/NO) mortgage - does the customer have a mortgage (YES/NO) pep - did customer buy a PEP (Personal Equity Plan) after the last mailing (YES/NO) Vinod Gupta School of Management, IIT Kharagpur 5
  • 6. Classification Analysis Question "How likely is person X to buy the new Personal Equity?" By creating a classification tree (a decision tree), the data can be mined to determine the likelihood of this person to buy a new PEP. Possible nodes on the tree would be children, income level, marital status. The attributes of this person can be used against the decision tree to determine the likelihood of him purchasing the Personal Equity Plan. Load the data file Bank_Data.CSV into WEKA. This file contains 900 records of present customers of Bank. We need to divide up our records so some data instances are used to create the model, and some are used to test the model to ensure that we didn't overfit it. Your screen should look like Figure 1 after loading the data. Figure 1.Bank Data Classification in Weka We select the Classify tab, then we select the trees node, then the J48 leaf Vinod Gupta School of Management, IIT Kharagpur 6
  • 7. Figure 2.Bank Data Classification Algorithm At this point, we are ready to create our model in WEKA. Ensure that Use training set is selected so we use the data set we just loaded to create our model. Click Start and let WEKA run. The output from this model should look like the results in Listing 1. Vinod Gupta School of Management, IIT Kharagpur 7
  • 8. Listing 1.Output from WEKA’s classification model What do these numbers mean- Correctly Classified Instances - 92.3333% Incorrectly Classified Instances- 7.6667% False Positives- 29 False Negatives-17 Based on our accuracy rate of 92.3333%, we can say that this is a pretty good model to predict whether a new customer will buy Personal Equity Plan or not. Vinod Gupta School of Management, IIT Kharagpur 8
  • 9. You can see the tree by right-clicking on the model you just created, in the result list. On the pop-up menu, select Visualize tree. You'll see the classification tree we just created, although in this example, the visual tree doesn't offer much help. Figure 3. Classification Tree Visualization There's one final step to validating our classification tree, which is to run our test set through the model and ensure that accuracy of the model when evaluating the test set isn't too different from the training set. To do this, in Test options, select the Supplied test set radio button and click Set. Choose the file bmw-test.arff, which contains 1,500 records that were not in the training set we used to create the model. When we click Start this time, WEKA will run this test data set through the model we already created and let us know how the model did. Let's do that, by clicking Start. Below is the output. Vinod Gupta School of Management, IIT Kharagpur 9
  • 10. Listing 2.Output from WEKA’s classification model of Test Data Comparing the "Correctly Classified Instances" from this test set (90.5 percent) with the "Correctly Classified Instances" from the training set (92.3333 percent), we see that the accuracy of the model is pretty close, which indicates that the model will not break down with unknown data, or when future data is applied to it. Vinod Gupta School of Management, IIT Kharagpur 10
  • 11. Cluster Analysis Question: "What age groups more likely to buy Personal Equity Plan?" The data can be mined to compare the age of the purchaser of past PEP . From this data, it could be found whether certain age groups (22-30 year olds, for example) have a higher propensity to to go for PEP. The data, when mined, will tend to cluster around certain age groups and certain colors, allowing the user to quickly determine patterns in the data. Load the data file Bank_data.CSV into WEKA using the same steps we used to load data into the Preprocess tab. Take a few minutes to look around the data in this tab. Look at the columns, the attribute data, the distribution of the columns, etc. Your screen should look like Figure 4 after loading the data. Figure 4. Bank cluster data in Weka With this data set, we are looking to create clusters, so instead of clicking on the Classify tab, click on the Cluster tab. Click Choose and select SimpleKMeans from the choices that appear (this will be our preferred method of clustering for this article). Vinod Gupta School of Management, IIT Kharagpur 11
  • 12. Finally, we want to adjust the attributes of our cluster algorithm by clicking SimpleKMeans . The only attribute of the algorithm we are interested in adjusting here is the numClusters field, which tells us how many clusters we want to create. Let's change the default value of 2 to 5 for now, but keep these steps in mind later if you want to adjust the number of clusters created. Your WEKA Explorer should look like Figure 5 at this point. Click OK to accept these values. Figure 5. Cluster Attributes At this point, we are ready to run the clustering algorithm. Remember that 100 rows of data with five data clusters would likely take a few hours of computation with a spreadsheet, but WEKA can spit out the answer in less than a second. Your output should look like Listing 3. Vinod Gupta School of Management, IIT Kharagpur 12
  • 13. Listing 3. Cluster Output with 5 clusters Vinod Gupta School of Management, IIT Kharagpur 13
  • 14. Listing 4. Cluster Output with 10 Clusters Clusters One thing that is clear from the clusters is that behavior of Male are clustered in only 2-3 groups while females behavior are heavily distributed among 7 clusters, so preparing an offering for a specific Description of Clusters- Cluster 0- This group consists of unmarried, mid-income earning females in their early 40’s who live in rural areas. They have on an average two children, no car and personal equity plan but they do have savings and current account. Cluster 1- This group consists of married, high-income earning females in their late 40’s who live in rural areas. They have on an average two children,no car and personal equity plan but they do have savings and current account. Cluster 2- This group consists of married, low-income earning females in their early 40’s who live in inner city. They have on an average one child, no car and savings account but they do have current account and personal equity plan. Vinod Gupta School of Management, IIT Kharagpur 14
  • 15. Cluster 3- This group consists of married, low-income earning females in their early 30’s who live in town. They have on an average one or two children, no car, savings account and personal equity plan but they do have current account. Cluster 4- This group consists of married, mid-income earning males in their late 30’s who live in inner city. They have on an average one or no child, no savings account but they do have personal equity plan, savings & current account. Cluster 5- This group consists of unmarried, high-income earning males in their early 40’s who live in town. They have on an average one or no child, they have car, personal equity plan, savings & current account. Cluster 6- This group consists of married, mid-income earning females in their early 40’s who live in inner city. They mostly don’t have ant child, they do not have any savings account and personal equity plan but they do have current account. Cluster 7- This group consists of unmarried, high-income earning females in their mid 40’s who live in inner city. They have on an average one or two child, no car and personal equity plan but they do have savings & current account. Cluster 8- This group consists of unmarried, high-income earning females in their mid 40’s who live in town. They have on an average one or no child, no personal equity plan but they do have car, savings & current account. Cluster 9- This group consists of married, mid-income earning males in their early 40’s who live in inner city. They have on an average one or two children, no car, personal equity plan and current account but they do have savings account. Vinod Gupta School of Management, IIT Kharagpur 15
  • 16. One other interesting way to examine the data in these clusters is to inspect it visually. To do this, you should right-click on theResult List section of the Cluster tab . One of the options from this pop-up menu is Visualize Cluster Assignments. A window will pop up that lets you play with the results and see them visually. For this example, change the X axis to be income (Num), the Y axis to children (Num), and the Color to Cluster (Nom). This will show us in a chart how the clusters are grouped in terms of income and no’ of children. Also, turn up the "Jitter" to about three-fourths of the way maxed out, which will artificially scatter the plot points to allow us to see them more easily. Figure 6. Cluster Visual Inspection Vinod Gupta School of Management, IIT Kharagpur 16
  • 17. Other Applications of Weka •DISCRETIZATION •REGRESSION •NEAREST NEIGHBOR References https://www.ibm.com/developerworks/opensource/library/os-weka2/ http://maya.cs.depaul.edu/classes/ect584/weka/preprocess.html http://www.cs.waikato.ac.nz/~ml/weka/ Vinod Gupta School of Management, IIT Kharagpur 17