SlideShare ist ein Scribd-Unternehmen logo
1 von 12
Downloaden Sie, um offline zu lesen
Term paper on Data mining
How to use Weka for data analysis
Submitted by: Shubham Gupta (10BM60085)

Vinod Gupta School of Management
The first technique that we would do on weka is classification. The data below shows the financial
situation in Japan. The data has been collected from 1970-2009. The columns represent:

    1)   BROAD: Broad money supplied in the economy
    2)   DOMC: Domestic consumption
    3)   PSC: Payment securities
    4)   CLAIMS: Represents the claims on the government.
    5)   TOTRES: Total Reserves
    6)   GDP: Gross domestic product
    7)   LIQLB: Liquid Liability

We want to get a decision tree that would help us decide what values of independent variable may
result in what final rule or result. For example if we know that for a DOMC> 140 and PSC> 150.3 we
would always get say GDP of greater than 3 trillion yen, then it would help us in making our decisions
better. Hence to get such rules we perform this analysis to generate a decision tree.


  YEAR       BROAD     CLAIMS     DOMC       PSC         TOTRES        LIQLB           GDP
  1970       83.65      61.88     134.25    111.75     4876114550     104.73      205,995,000,000
  1971       106.70     21.37     147.59    123.72    15469150615     118.21      232,681,000,000
  1972       116.14     23.17     160.29    133.47    18932675966     129.03      308,137,000,000
  1973       116.02     19.84     157.87    132.20    13723930639     126.07      418,640,000,000
  1974       113.08     13.72     154.00    126.49    16551248298     120.50      464,705,000,000
  1975       118.31     13.02     164.40    129.96    14910849997     127.56      505,317,000,000
  1976       122.40     12.09     169.96    130.63    18590784646     131.20      567,926,000,000
  1977       125.82      8.76     172.45    128.49    25907710023     133.90      698,968,000,000
  1978       130.36      8.56     178.29    127.71    37824744320     139.12      982,078,000,000
  1979       135.51      8.19     183.05    129.23    31926244737     142.67    1,022,190,000,000
  1980       137.95      8.09     188.44    131.29    38918848626     144.30    1,071,000,000,000
  1981       142.13      8.04     194.09    134.10    37839039769     150.03    1,183,790,000,000
  1982       149.54      7.67     203.99    139.59    34403732201     156.18    1,100,410,000,000
  1983       156.55      6.72     213.12    145.03    33844549531     162.92    1,200,190,000,000
  1984       159.31      6.69     217.77    147.43    33898638541     165.34    1,275,560,000,000
  1985       160.68      7.66     220.09    149.90    34641202378     167.41    1,364,160,000,000
  1986       167.30      7.67     230.23    156.30    51727320082     174.65    2,020,890,000,000
  1987       175.85     12.27     243.85    173.48    92701641597     183.77    2,448,670,000,000
  1988       178.70     10.66     251.68    182.52    1.06668E+11     186.47    2,971,030,000,000
  1989       182.62     10.13     258.13    190.28    93672771034     192.14    2,972,670,000,000
  1990       184.06      8.46     259.15    194.81    87828362969     190.16    3,058,040,000,000
  1991       184.35      5.20     257.54    195.40    80625855126     189.32    3,484,770,000,000
  1992       187.89      4.16     265.33    199.63    79696644593     190.93    3,796,110,000,000
  1993       193.97      1.33     274.00    202.14    1.07989E+11     198.16    4,350,010,000,000
  1994       200.35      1.88     281.02    204.58    1.35146E+11     204.45    4,778,990,000,000
1995       205.79        1.26      287.13   203.90   1.9262E+11       209.90    5,264,380,000,000
   1996       209.72        1.81      292.42   205.21   2.25594E+11      213.63    4,642,540,000,000
   1997       215.31        6.47      276.47   217.76   2.26679E+11      221.38    4,261,840,000,000
   1998       229.64        1.80      298.40   228.01   2.22443E+11      233.17    3,857,030,000,000
   1999       239.91        -1.20     309.92   231.08   2.93948E+11      243.22    4,368,730,000,000
   2000       242.24        -1.58     308.91   222.28   3.61639E+11      243.84    4,667,450,000,000
   2001       225.31       -33.25     299.43   193.01   4.01958E+11      187.41    4,095,480,000,000
   2002       207.79        -4.32     299.16   182.40   4.69618E+11      190.79    3,918,340,000,000
   2003       209.70       -1.99      307.26   180.71   6.73554E+11      191.84    4,229,100,000,000
   2004       207.51       -1.10      303.48   174.12   8.44667E+11      189.79    4,605,920,000,000
   2005       207.24       1.79       312.85   182.87   8.46896E+11      189.30    4,552,200,000,000
   2006       204.73       -0.14      304.96   179.99   8.95321E+11      186.06    4,362,590,000,000
   2007       201.50       0.16       294.31   172.56   9.73297E+11      184.17    4,377,940,000,000
   2008       207.14       0.76       295.42   165.48   1.03076E+12      189.52    4,879,860,000,000
   2009       223.76       -1.12      320.53   171.00   1.04899E+12      206.13    5,032,980,000,000


Loading data in Weka is quite easy. Just click on the open file option and give the location of the file.




Figure 1 Shows how to load data in Weka

Weka software is used to classify the above data to find out how these economical factors be modified
or fixed so as to get an 11% growth in the previous year’s GDP
Figure 2 Diagram shows where you could the used tree technique

The following shows the output by running the above data in Weka. The Classifier used is to create the
required decision tree is M5P. Weka's M5P algorithm is a rational reconstruction of M5 with some
enhancements. M5Base. Implements base routines for generating M5 Model trees and rule
the original algorithm M5 was invented by R. Quinlan and Yong Wang. M5P (where the P stands for
‘prime’) generates M5 model trees using the M5' algorithm, which was introduced in Wang & Witten
(1997) and enhances the original M5 algorithm by Quinlan (1992). The output of the analysis is shown
below:

=== Run information ===

Scheme: weka.classifiers.trees.M5P -M 4.0

Relation:    Copy of Data_Rudra-weka.filters.unsupervised.attribute.Remove-R1

Instances: 945

Attributes: 6

        BROAD, CLAIMS, DOMC, PSC, TOTRES, LIQLB

Test mode: 10-fold cross-validation
=== Classifier model (full training set) ===

M5 pruned model tree:

(Using smoothed linear models)

BROAD <= 153.045 : LM1 (13/5.644%)

BROAD > 153.045 :

| PSC <= 203.02 :

| | BROAD <= 177.275 : LM2 (5/0.653%)

| | BROAD > 177.275 :

| | | TOTRES <= 871108500000 : LM3 (11/8.309%)

| | | TOTRES > 871108500000 : LM4 (4/1.446%)

| PSC > 203.02 : LM5 (7/2.741%)



LM num: 1

LIQLB = 0.7447 * BROAD + 0.1474 * PSC - 0 * TOTRES + 22.3168

LM num: 2

LIQLB = 0.586 * BROAD + 0.2788 * PSC - 0 * TOTRES + 30.3097

LM num: 3

LIQLB = 0.4606 * BROAD + 0.2504 * PSC - 0 * TOTRES + 58.87

LM num: 4

LIQLB = 0.4996 * BROAD + 0.2504 * PSC - 0 * TOTRES + 50.7563

LM num: 5

LIQLB = 0.7016 * BROAD + 0.2497 * PSC - 0 * TOTRES + 15.2517

Number of Rules: 5



Time taken to build model: 0.08 seconds
=== Cross-validation ===

=== Summary ===



Correlation coefficient          0.9882

Mean absolute error              3.412

Root mean squared error            5.4145

Relative absolute error           11.529 %

Root relative squared error       15.1993 %

Total Number of Instances          40

Ignored Class Unknown Instances     905




Interpretation of the Results:

Based on the data above M5 algorithm generates modular tree which is formed by 5 linear models (LM)
based on the initial values of Broad money in the economy which if less than equal to 153.045 then we
have to follow linear model 1 (LM 1) to estimate Liquidity in the economy. If BROAD> 153.045 we check
PSC and move down the tree and choosing corresponding models to get the Liquidity and finally GDP
values as shown in the figure above.



Linear Regression with Weka
The second technique is to conduct linear regression through Weka on the same data. When the
outcome, or class, is numeric and all the attributes are numeric, linear regression is a natural technique
to consider. In the previous technique we created five linear models from the same data; hence M5P’s
performance is slightly worse than any linear model. The idea is to express the class as a linear
combination of the attributes with predetermined weights. From the previous data, we can also find
linear regression equation between various parameters determining GDP. To run the regression, go to
classify tab on Weka and choose linear regression from functions as shown.




Figure 3 Shows where to find LR in Weka

Following output is generated by the above analysis:

=== Run information ===

Scheme: weka.classifiers.functions.LinearRegression -S 0 -R 1.0E-8

Relation:    Copy of Data_Rudra

Instances: 945

Attributes: 7

        YEAR

        BROAD
CLAIMS

        DOMC

        PSC

        TOTRES

        LIQLB

Test mode: 10-fold cross-validation

=== Classifier model (full training set) ===

Linear Regression Model

LIQLB = 1.2523 * BROAD + 0.6062 * CLAIMS + -0.1407 * DOMC 0           * TOTRES +     -6.9705

Time taken to build model: 0.2 seconds

=== Cross-validation ===

=== Summary ===



Correlation coefficient              0.9738

Mean absolute error                  4.8731

Root mean squared error               8.0404

Relative absolute error              16.4661 %

Root relative squared error          22.5707 %

Total Number of Instances               40

Ignored Class Unknown Instances         905

The above analysis gives as a mathematical relationship (linear) between various variables. The Value of
the fifth variable (dependent) can be found out once other independent variable values are known. This
equation also tells how these variables are related. A negative relation shows reciprocal relationship and
vice-versa. To see the same relation is pictorial form simply goes to visualize tab on Weka explorer. The
same is shown in the figure below.
CLUSTERING IN WEKA
Clustering is a technique used to group similar instances or rows in term of Euclidean distance. We have
used SimpeKMeans clustering algorithm to analyze clustering in our initial data. In SimpleKMeans
implementation clustering data use k-means, or the algorithm can decide using cross-validation- in
which case number of folds is fixed at 10. The figure below shows the output of SimpeKMeans for the
above data. The result is shown as table with rows that are attributes names and columns that
correspond to cluster centroids; an additional cluster at the beginning shows the entire data set. The
number of instances in each cluster appears in parenthesis at the top of its column. Each table entry is
either the mean or mode of the corresponding attribute for the cluster in that column. The bottom of
the output shows the result of applying the learned cluster model. In this case, it assigned each training
set to one of the clusters, showing the same result as the parenthetical numbers at the top of each
column. An alternative is to use a separate test set or a percentage split of training data, in which case
figures would be different. This technique could be used with data from other countries in addition of
the present data that is taken for Japan.
=== Run information ===

Scheme:weka.clusterers.SimpleKMeans -N 2 -A "weka.core.EuclideanDistance -R first-last" -I 500 -S 10

Relation:    Copy of Data_Rudra

Instances: 945

Attributes: 7

       YEAR

       BROAD

       CLAIMS

       DOMC

       PSC

       TOTRES

       LIQLB

Test mode:evaluate on training data
=== Model and evaluation on training set ===kMeans======

Number of iterations: 5

Within cluster sum of squared errors: 12.988387913678944

Missing values globally replaced with mean/mode



Cluster centroids:

                           Cluster#

Attribute      Full Data              0          1

                (945)           (929)          (16)

=================================================================

YEAR            1989.5         1989.2933        2001.5

BROAD          174.1633        173.4625      214.8525

CLAIMS          6.6645         6.8103         -1.7981

DOMC           242.2808        241.2956       299.4794

PSC            168.2627         167.8077       194.685

TOTRES      248907476505.9463 243675387834.3592           552695625000

LIQLB          175.2342        174.7166       205.2875

Time taken to build model (full training data) : 0.14 seconds

=== Model and evaluation on training set ===

Clustered    Instances

0           929 (98%)

1           16 (2%)

We can also visualize the clusters formed. Right click on the result-list output and select cluster visualize.
We get the following output:
DATA MINING WITH WEKA

Weitere ähnliche Inhalte

Was ist angesagt?

traffic noise
traffic noisetraffic noise
traffic noise
dont
 
Economic Feasibility Study for Highway 640
Economic Feasibility Study for Highway 640Economic Feasibility Study for Highway 640
Economic Feasibility Study for Highway 640
Sadegh Tabatabaei
 
Syllabus of Transportation Engineering
Syllabus of Transportation EngineeringSyllabus of Transportation Engineering
Syllabus of Transportation Engineering
mhawarey
 
Queuing theory and traffic flow analysis
Queuing theory and traffic flow analysisQueuing theory and traffic flow analysis
Queuing theory and traffic flow analysis
Reymond Dy
 
Transportation Planning & Management
Transportation Planning & ManagementTransportation Planning & Management
Transportation Planning & Management
Living Online
 

Was ist angesagt? (20)

10 Capacity and LOS Analysis for Freeway (Traffic Engineering هندسة المرور & ...
10 Capacity and LOS Analysis for Freeway (Traffic Engineering هندسة المرور & ...10 Capacity and LOS Analysis for Freeway (Traffic Engineering هندسة المرور & ...
10 Capacity and LOS Analysis for Freeway (Traffic Engineering هندسة المرور & ...
 
Traffic signal design (1)
Traffic signal design (1)Traffic signal design (1)
Traffic signal design (1)
 
Chapter 6 Fundamentals of traffic flow
Chapter 6 Fundamentals of traffic flowChapter 6 Fundamentals of traffic flow
Chapter 6 Fundamentals of traffic flow
 
Freeway LOS (Transportation Engineering)
Freeway LOS (Transportation Engineering)Freeway LOS (Transportation Engineering)
Freeway LOS (Transportation Engineering)
 
Mathematical Understanding in Traffic Flow Modelling
Mathematical Understanding in Traffic Flow ModellingMathematical Understanding in Traffic Flow Modelling
Mathematical Understanding in Traffic Flow Modelling
 
07 Speed, Travel Time & Delay Studies (Traffic Engineering هندسة المرور & Pro...
07 Speed, Travel Time & Delay Studies (Traffic Engineering هندسة المرور & Pro...07 Speed, Travel Time & Delay Studies (Traffic Engineering هندسة المرور & Pro...
07 Speed, Travel Time & Delay Studies (Traffic Engineering هندسة المرور & Pro...
 
traffic noise
traffic noisetraffic noise
traffic noise
 
Transportation Engineering I
Transportation Engineering ITransportation Engineering I
Transportation Engineering I
 
Economic Feasibility Study for Highway 640
Economic Feasibility Study for Highway 640Economic Feasibility Study for Highway 640
Economic Feasibility Study for Highway 640
 
Simulation of Traffic Flow - Density
Simulation of Traffic Flow - DensitySimulation of Traffic Flow - Density
Simulation of Traffic Flow - Density
 
Syllabus of Transportation Engineering
Syllabus of Transportation EngineeringSyllabus of Transportation Engineering
Syllabus of Transportation Engineering
 
Set theory
Set theorySet theory
Set theory
 
Lec 10 Traffic Stream Models (Transportation Engineering Dr.Lina Shbeeb)
Lec 10 Traffic Stream Models (Transportation Engineering Dr.Lina Shbeeb)Lec 10 Traffic Stream Models (Transportation Engineering Dr.Lina Shbeeb)
Lec 10 Traffic Stream Models (Transportation Engineering Dr.Lina Shbeeb)
 
Traffic stream models
Traffic stream models Traffic stream models
Traffic stream models
 
Design principles of traffic signal
Design principles of traffic signalDesign principles of traffic signal
Design principles of traffic signal
 
Visualisasi pisah ragaman
Visualisasi pisah ragamanVisualisasi pisah ragaman
Visualisasi pisah ragaman
 
Queuing theory and traffic flow analysis
Queuing theory and traffic flow analysisQueuing theory and traffic flow analysis
Queuing theory and traffic flow analysis
 
Transportation Planning & Management
Transportation Planning & ManagementTransportation Planning & Management
Transportation Planning & Management
 
Vertical Alignment
Vertical Alignment Vertical Alignment
Vertical Alignment
 
Level of Service(LOS) of a road with Calculation Method
Level of Service(LOS) of a road with Calculation MethodLevel of Service(LOS) of a road with Calculation Method
Level of Service(LOS) of a road with Calculation Method
 

Ähnlich wie DATA MINING WITH WEKA

Linear regression an 80 year study of the dow jones industrial average
Linear regression  an 80 year study of the dow jones industrial averageLinear regression  an 80 year study of the dow jones industrial average
Linear regression an 80 year study of the dow jones industrial average
TehyaSingleton
 
Linear regression an 80 year study of the dow jones industrial average
Linear regression  an 80 year study of the dow jones industrial averageLinear regression  an 80 year study of the dow jones industrial average
Linear regression an 80 year study of the dow jones industrial average
TehyaSingleton
 
Fundamental Equity Analysis - QMS Advisors HDAX FlexIndex 110
Fundamental Equity Analysis - QMS Advisors HDAX FlexIndex 110Fundamental Equity Analysis - QMS Advisors HDAX FlexIndex 110
Fundamental Equity Analysis - QMS Advisors HDAX FlexIndex 110
BCV
 
Fedegan_Estadisticas_Boletin_Comercio_Exterior_Leche_Importacion
Fedegan_Estadisticas_Boletin_Comercio_Exterior_Leche_ImportacionFedegan_Estadisticas_Boletin_Comercio_Exterior_Leche_Importacion
Fedegan_Estadisticas_Boletin_Comercio_Exterior_Leche_Importacion
Fedegan
 
WRI Operating Statement Detail
WRI Operating Statement DetailWRI Operating Statement Detail
WRI Operating Statement Detail
Scott Pickering
 

Ähnlich wie DATA MINING WITH WEKA (20)

Linear regression an 80 year study of the dow jones industrial average
Linear regression  an 80 year study of the dow jones industrial averageLinear regression  an 80 year study of the dow jones industrial average
Linear regression an 80 year study of the dow jones industrial average
 
Linear regression an 80 year study of the dow jones industrial average
Linear regression  an 80 year study of the dow jones industrial averageLinear regression  an 80 year study of the dow jones industrial average
Linear regression an 80 year study of the dow jones industrial average
 
Excel Model for Banking
Excel Model for Banking Excel Model for Banking
Excel Model for Banking
 
Moyno pump 2000 dimensions g2
Moyno pump 2000 dimensions g2Moyno pump 2000 dimensions g2
Moyno pump 2000 dimensions g2
 
Eco dev final
Eco dev finalEco dev final
Eco dev final
 
Rio cojedes total mediciones
Rio cojedes total medicionesRio cojedes total mediciones
Rio cojedes total mediciones
 
cobb500 broiler performance nutrition supplement 2022
cobb500 broiler performance nutrition supplement 2022 cobb500 broiler performance nutrition supplement 2022
cobb500 broiler performance nutrition supplement 2022
 
Fundamental Equity Analysis - QMS Advisors HDAX FlexIndex 110
Fundamental Equity Analysis - QMS Advisors HDAX FlexIndex 110Fundamental Equity Analysis - QMS Advisors HDAX FlexIndex 110
Fundamental Equity Analysis - QMS Advisors HDAX FlexIndex 110
 
Fedegan_Estadisticas_Boletin_Comercio_Exterior_Leche_Importacion
Fedegan_Estadisticas_Boletin_Comercio_Exterior_Leche_ImportacionFedegan_Estadisticas_Boletin_Comercio_Exterior_Leche_Importacion
Fedegan_Estadisticas_Boletin_Comercio_Exterior_Leche_Importacion
 
recipes
recipesrecipes
recipes
 
Baja wf
Baja wfBaja wf
Baja wf
 
8 fv&amp;pv tables
8 fv&amp;pv tables8 fv&amp;pv tables
8 fv&amp;pv tables
 
Present Value and Future Value Tables
Present Value and Future Value TablesPresent Value and Future Value Tables
Present Value and Future Value Tables
 
Futurevaluetables
FuturevaluetablesFuturevaluetables
Futurevaluetables
 
Futurevaluetables
FuturevaluetablesFuturevaluetables
Futurevaluetables
 
WRI Operating Statement Detail
WRI Operating Statement DetailWRI Operating Statement Detail
WRI Operating Statement Detail
 
Hyundai hdf18 3 forklift truck service repair manual
Hyundai hdf18 3 forklift truck service repair manualHyundai hdf18 3 forklift truck service repair manual
Hyundai hdf18 3 forklift truck service repair manual
 
Hyundai hdf18 3 forklift truck service repair manual
Hyundai hdf18 3 forklift truck service repair manualHyundai hdf18 3 forklift truck service repair manual
Hyundai hdf18 3 forklift truck service repair manual
 
Hyundai hdf15 3 forklift truck service repair manual
Hyundai hdf15 3 forklift truck service repair manualHyundai hdf15 3 forklift truck service repair manual
Hyundai hdf15 3 forklift truck service repair manual
 
Hyundai hdf18 3 forklift truck service repair manual
Hyundai hdf18 3 forklift truck service repair manualHyundai hdf18 3 forklift truck service repair manual
Hyundai hdf18 3 forklift truck service repair manual
 

Mehr von Shubham Gupta (6)

Marketing great dakota bank case - Harward Business School
Marketing   great dakota bank case - Harward Business SchoolMarketing   great dakota bank case - Harward Business School
Marketing great dakota bank case - Harward Business School
 
Understanding Customer Value - Marketing through 4P's and SAVE
Understanding Customer Value - Marketing through 4P's and SAVEUnderstanding Customer Value - Marketing through 4P's and SAVE
Understanding Customer Value - Marketing through 4P's and SAVE
 
Segmentation, Targeting and Positioning at an Election
Segmentation, Targeting and Positioning at an ElectionSegmentation, Targeting and Positioning at an Election
Segmentation, Targeting and Positioning at an Election
 
The bose corporation: JIT II case solution
The bose corporation: JIT II case solutionThe bose corporation: JIT II case solution
The bose corporation: JIT II case solution
 
Impure data analytics & visualization tool
Impure data analytics & visualization toolImpure data analytics & visualization tool
Impure data analytics & visualization tool
 
Impure data analytics & visualization tool
Impure data analytics & visualization toolImpure data analytics & visualization tool
Impure data analytics & visualization tool
 

Kürzlich hochgeladen

The Abortion pills for sale in Qatar@Doha [+27737758557] []Deira Dubai Kuwait
The Abortion pills for sale in Qatar@Doha [+27737758557] []Deira Dubai KuwaitThe Abortion pills for sale in Qatar@Doha [+27737758557] []Deira Dubai Kuwait
The Abortion pills for sale in Qatar@Doha [+27737758557] []Deira Dubai Kuwait
daisycvs
 
Challenges and Opportunities: A Qualitative Study on Tax Compliance in Pakistan
Challenges and Opportunities: A Qualitative Study on Tax Compliance in PakistanChallenges and Opportunities: A Qualitative Study on Tax Compliance in Pakistan
Challenges and Opportunities: A Qualitative Study on Tax Compliance in Pakistan
vineshkumarsajnani12
 
Quick Doctor In Kuwait +2773`7758`557 Kuwait Doha Qatar Dubai Abu Dhabi Sharj...
Quick Doctor In Kuwait +2773`7758`557 Kuwait Doha Qatar Dubai Abu Dhabi Sharj...Quick Doctor In Kuwait +2773`7758`557 Kuwait Doha Qatar Dubai Abu Dhabi Sharj...
Quick Doctor In Kuwait +2773`7758`557 Kuwait Doha Qatar Dubai Abu Dhabi Sharj...
daisycvs
 

Kürzlich hochgeladen (20)

The Abortion pills for sale in Qatar@Doha [+27737758557] []Deira Dubai Kuwait
The Abortion pills for sale in Qatar@Doha [+27737758557] []Deira Dubai KuwaitThe Abortion pills for sale in Qatar@Doha [+27737758557] []Deira Dubai Kuwait
The Abortion pills for sale in Qatar@Doha [+27737758557] []Deira Dubai Kuwait
 
WheelTug Short Pitch Deck 2024 | Byond Insights
WheelTug Short Pitch Deck 2024 | Byond InsightsWheelTug Short Pitch Deck 2024 | Byond Insights
WheelTug Short Pitch Deck 2024 | Byond Insights
 
Buy gmail accounts.pdf buy Old Gmail Accounts
Buy gmail accounts.pdf buy Old Gmail AccountsBuy gmail accounts.pdf buy Old Gmail Accounts
Buy gmail accounts.pdf buy Old Gmail Accounts
 
joint cost.pptx COST ACCOUNTING Sixteenth Edition ...
joint cost.pptx  COST ACCOUNTING  Sixteenth Edition                          ...joint cost.pptx  COST ACCOUNTING  Sixteenth Edition                          ...
joint cost.pptx COST ACCOUNTING Sixteenth Edition ...
 
Challenges and Opportunities: A Qualitative Study on Tax Compliance in Pakistan
Challenges and Opportunities: A Qualitative Study on Tax Compliance in PakistanChallenges and Opportunities: A Qualitative Study on Tax Compliance in Pakistan
Challenges and Opportunities: A Qualitative Study on Tax Compliance in Pakistan
 
Cannabis Legalization World Map: 2024 Updated
Cannabis Legalization World Map: 2024 UpdatedCannabis Legalization World Map: 2024 Updated
Cannabis Legalization World Map: 2024 Updated
 
How to Get Started in Social Media for Art League City
How to Get Started in Social Media for Art League CityHow to Get Started in Social Media for Art League City
How to Get Started in Social Media for Art League City
 
Quick Doctor In Kuwait +2773`7758`557 Kuwait Doha Qatar Dubai Abu Dhabi Sharj...
Quick Doctor In Kuwait +2773`7758`557 Kuwait Doha Qatar Dubai Abu Dhabi Sharj...Quick Doctor In Kuwait +2773`7758`557 Kuwait Doha Qatar Dubai Abu Dhabi Sharj...
Quick Doctor In Kuwait +2773`7758`557 Kuwait Doha Qatar Dubai Abu Dhabi Sharj...
 
Ooty Call Gril 80022//12248 Only For Sex And High Profile Best Gril Sex Avail...
Ooty Call Gril 80022//12248 Only For Sex And High Profile Best Gril Sex Avail...Ooty Call Gril 80022//12248 Only For Sex And High Profile Best Gril Sex Avail...
Ooty Call Gril 80022//12248 Only For Sex And High Profile Best Gril Sex Avail...
 
Dr. Admir Softic_ presentation_Green Club_ENG.pdf
Dr. Admir Softic_ presentation_Green Club_ENG.pdfDr. Admir Softic_ presentation_Green Club_ENG.pdf
Dr. Admir Softic_ presentation_Green Club_ENG.pdf
 
Nashik Call Girl Just Call 7091819311 Top Class Call Girl Service Available
Nashik Call Girl Just Call 7091819311 Top Class Call Girl Service AvailableNashik Call Girl Just Call 7091819311 Top Class Call Girl Service Available
Nashik Call Girl Just Call 7091819311 Top Class Call Girl Service Available
 
Durg CALL GIRL ❤ 82729*64427❤ CALL GIRLS IN durg ESCORTS
Durg CALL GIRL ❤ 82729*64427❤ CALL GIRLS IN durg ESCORTSDurg CALL GIRL ❤ 82729*64427❤ CALL GIRLS IN durg ESCORTS
Durg CALL GIRL ❤ 82729*64427❤ CALL GIRLS IN durg ESCORTS
 
CROSS CULTURAL NEGOTIATION BY PANMISEM NS
CROSS CULTURAL NEGOTIATION BY PANMISEM NSCROSS CULTURAL NEGOTIATION BY PANMISEM NS
CROSS CULTURAL NEGOTIATION BY PANMISEM NS
 
Horngren’s Cost Accounting A Managerial Emphasis, Canadian 9th edition soluti...
Horngren’s Cost Accounting A Managerial Emphasis, Canadian 9th edition soluti...Horngren’s Cost Accounting A Managerial Emphasis, Canadian 9th edition soluti...
Horngren’s Cost Accounting A Managerial Emphasis, Canadian 9th edition soluti...
 
Escorts in Nungambakkam Phone 8250092165 Enjoy 24/7 Escort Service Enjoy Your...
Escorts in Nungambakkam Phone 8250092165 Enjoy 24/7 Escort Service Enjoy Your...Escorts in Nungambakkam Phone 8250092165 Enjoy 24/7 Escort Service Enjoy Your...
Escorts in Nungambakkam Phone 8250092165 Enjoy 24/7 Escort Service Enjoy Your...
 
Marel Q1 2024 Investor Presentation from May 8, 2024
Marel Q1 2024 Investor Presentation from May 8, 2024Marel Q1 2024 Investor Presentation from May 8, 2024
Marel Q1 2024 Investor Presentation from May 8, 2024
 
GUWAHATI 💋 Call Girl 9827461493 Call Girls in Escort service book now
GUWAHATI 💋 Call Girl 9827461493 Call Girls in  Escort service book nowGUWAHATI 💋 Call Girl 9827461493 Call Girls in  Escort service book now
GUWAHATI 💋 Call Girl 9827461493 Call Girls in Escort service book now
 
Berhampur CALL GIRL❤7091819311❤CALL GIRLS IN ESCORT SERVICE WE ARE PROVIDING
Berhampur CALL GIRL❤7091819311❤CALL GIRLS IN ESCORT SERVICE WE ARE PROVIDINGBerhampur CALL GIRL❤7091819311❤CALL GIRLS IN ESCORT SERVICE WE ARE PROVIDING
Berhampur CALL GIRL❤7091819311❤CALL GIRLS IN ESCORT SERVICE WE ARE PROVIDING
 
Lucknow Housewife Escorts by Sexy Bhabhi Service 8250092165
Lucknow Housewife Escorts  by Sexy Bhabhi Service 8250092165Lucknow Housewife Escorts  by Sexy Bhabhi Service 8250092165
Lucknow Housewife Escorts by Sexy Bhabhi Service 8250092165
 
Call 7737669865 Vadodara Call Girls Service at your Door Step Available All Time
Call 7737669865 Vadodara Call Girls Service at your Door Step Available All TimeCall 7737669865 Vadodara Call Girls Service at your Door Step Available All Time
Call 7737669865 Vadodara Call Girls Service at your Door Step Available All Time
 

DATA MINING WITH WEKA

  • 1. Term paper on Data mining How to use Weka for data analysis Submitted by: Shubham Gupta (10BM60085) Vinod Gupta School of Management
  • 2. The first technique that we would do on weka is classification. The data below shows the financial situation in Japan. The data has been collected from 1970-2009. The columns represent: 1) BROAD: Broad money supplied in the economy 2) DOMC: Domestic consumption 3) PSC: Payment securities 4) CLAIMS: Represents the claims on the government. 5) TOTRES: Total Reserves 6) GDP: Gross domestic product 7) LIQLB: Liquid Liability We want to get a decision tree that would help us decide what values of independent variable may result in what final rule or result. For example if we know that for a DOMC> 140 and PSC> 150.3 we would always get say GDP of greater than 3 trillion yen, then it would help us in making our decisions better. Hence to get such rules we perform this analysis to generate a decision tree. YEAR BROAD CLAIMS DOMC PSC TOTRES LIQLB GDP 1970 83.65 61.88 134.25 111.75 4876114550 104.73 205,995,000,000 1971 106.70 21.37 147.59 123.72 15469150615 118.21 232,681,000,000 1972 116.14 23.17 160.29 133.47 18932675966 129.03 308,137,000,000 1973 116.02 19.84 157.87 132.20 13723930639 126.07 418,640,000,000 1974 113.08 13.72 154.00 126.49 16551248298 120.50 464,705,000,000 1975 118.31 13.02 164.40 129.96 14910849997 127.56 505,317,000,000 1976 122.40 12.09 169.96 130.63 18590784646 131.20 567,926,000,000 1977 125.82 8.76 172.45 128.49 25907710023 133.90 698,968,000,000 1978 130.36 8.56 178.29 127.71 37824744320 139.12 982,078,000,000 1979 135.51 8.19 183.05 129.23 31926244737 142.67 1,022,190,000,000 1980 137.95 8.09 188.44 131.29 38918848626 144.30 1,071,000,000,000 1981 142.13 8.04 194.09 134.10 37839039769 150.03 1,183,790,000,000 1982 149.54 7.67 203.99 139.59 34403732201 156.18 1,100,410,000,000 1983 156.55 6.72 213.12 145.03 33844549531 162.92 1,200,190,000,000 1984 159.31 6.69 217.77 147.43 33898638541 165.34 1,275,560,000,000 1985 160.68 7.66 220.09 149.90 34641202378 167.41 1,364,160,000,000 1986 167.30 7.67 230.23 156.30 51727320082 174.65 2,020,890,000,000 1987 175.85 12.27 243.85 173.48 92701641597 183.77 2,448,670,000,000 1988 178.70 10.66 251.68 182.52 1.06668E+11 186.47 2,971,030,000,000 1989 182.62 10.13 258.13 190.28 93672771034 192.14 2,972,670,000,000 1990 184.06 8.46 259.15 194.81 87828362969 190.16 3,058,040,000,000 1991 184.35 5.20 257.54 195.40 80625855126 189.32 3,484,770,000,000 1992 187.89 4.16 265.33 199.63 79696644593 190.93 3,796,110,000,000 1993 193.97 1.33 274.00 202.14 1.07989E+11 198.16 4,350,010,000,000 1994 200.35 1.88 281.02 204.58 1.35146E+11 204.45 4,778,990,000,000
  • 3. 1995 205.79 1.26 287.13 203.90 1.9262E+11 209.90 5,264,380,000,000 1996 209.72 1.81 292.42 205.21 2.25594E+11 213.63 4,642,540,000,000 1997 215.31 6.47 276.47 217.76 2.26679E+11 221.38 4,261,840,000,000 1998 229.64 1.80 298.40 228.01 2.22443E+11 233.17 3,857,030,000,000 1999 239.91 -1.20 309.92 231.08 2.93948E+11 243.22 4,368,730,000,000 2000 242.24 -1.58 308.91 222.28 3.61639E+11 243.84 4,667,450,000,000 2001 225.31 -33.25 299.43 193.01 4.01958E+11 187.41 4,095,480,000,000 2002 207.79 -4.32 299.16 182.40 4.69618E+11 190.79 3,918,340,000,000 2003 209.70 -1.99 307.26 180.71 6.73554E+11 191.84 4,229,100,000,000 2004 207.51 -1.10 303.48 174.12 8.44667E+11 189.79 4,605,920,000,000 2005 207.24 1.79 312.85 182.87 8.46896E+11 189.30 4,552,200,000,000 2006 204.73 -0.14 304.96 179.99 8.95321E+11 186.06 4,362,590,000,000 2007 201.50 0.16 294.31 172.56 9.73297E+11 184.17 4,377,940,000,000 2008 207.14 0.76 295.42 165.48 1.03076E+12 189.52 4,879,860,000,000 2009 223.76 -1.12 320.53 171.00 1.04899E+12 206.13 5,032,980,000,000 Loading data in Weka is quite easy. Just click on the open file option and give the location of the file. Figure 1 Shows how to load data in Weka Weka software is used to classify the above data to find out how these economical factors be modified or fixed so as to get an 11% growth in the previous year’s GDP
  • 4. Figure 2 Diagram shows where you could the used tree technique The following shows the output by running the above data in Weka. The Classifier used is to create the required decision tree is M5P. Weka's M5P algorithm is a rational reconstruction of M5 with some enhancements. M5Base. Implements base routines for generating M5 Model trees and rule the original algorithm M5 was invented by R. Quinlan and Yong Wang. M5P (where the P stands for ‘prime’) generates M5 model trees using the M5' algorithm, which was introduced in Wang & Witten (1997) and enhances the original M5 algorithm by Quinlan (1992). The output of the analysis is shown below: === Run information === Scheme: weka.classifiers.trees.M5P -M 4.0 Relation: Copy of Data_Rudra-weka.filters.unsupervised.attribute.Remove-R1 Instances: 945 Attributes: 6 BROAD, CLAIMS, DOMC, PSC, TOTRES, LIQLB Test mode: 10-fold cross-validation
  • 5. === Classifier model (full training set) === M5 pruned model tree: (Using smoothed linear models) BROAD <= 153.045 : LM1 (13/5.644%) BROAD > 153.045 : | PSC <= 203.02 : | | BROAD <= 177.275 : LM2 (5/0.653%) | | BROAD > 177.275 : | | | TOTRES <= 871108500000 : LM3 (11/8.309%) | | | TOTRES > 871108500000 : LM4 (4/1.446%) | PSC > 203.02 : LM5 (7/2.741%) LM num: 1 LIQLB = 0.7447 * BROAD + 0.1474 * PSC - 0 * TOTRES + 22.3168 LM num: 2 LIQLB = 0.586 * BROAD + 0.2788 * PSC - 0 * TOTRES + 30.3097 LM num: 3 LIQLB = 0.4606 * BROAD + 0.2504 * PSC - 0 * TOTRES + 58.87 LM num: 4 LIQLB = 0.4996 * BROAD + 0.2504 * PSC - 0 * TOTRES + 50.7563 LM num: 5 LIQLB = 0.7016 * BROAD + 0.2497 * PSC - 0 * TOTRES + 15.2517 Number of Rules: 5 Time taken to build model: 0.08 seconds
  • 6. === Cross-validation === === Summary === Correlation coefficient 0.9882 Mean absolute error 3.412 Root mean squared error 5.4145 Relative absolute error 11.529 % Root relative squared error 15.1993 % Total Number of Instances 40 Ignored Class Unknown Instances 905 Interpretation of the Results: Based on the data above M5 algorithm generates modular tree which is formed by 5 linear models (LM) based on the initial values of Broad money in the economy which if less than equal to 153.045 then we
  • 7. have to follow linear model 1 (LM 1) to estimate Liquidity in the economy. If BROAD> 153.045 we check PSC and move down the tree and choosing corresponding models to get the Liquidity and finally GDP values as shown in the figure above. Linear Regression with Weka The second technique is to conduct linear regression through Weka on the same data. When the outcome, or class, is numeric and all the attributes are numeric, linear regression is a natural technique to consider. In the previous technique we created five linear models from the same data; hence M5P’s performance is slightly worse than any linear model. The idea is to express the class as a linear combination of the attributes with predetermined weights. From the previous data, we can also find linear regression equation between various parameters determining GDP. To run the regression, go to classify tab on Weka and choose linear regression from functions as shown. Figure 3 Shows where to find LR in Weka Following output is generated by the above analysis: === Run information === Scheme: weka.classifiers.functions.LinearRegression -S 0 -R 1.0E-8 Relation: Copy of Data_Rudra Instances: 945 Attributes: 7 YEAR BROAD
  • 8. CLAIMS DOMC PSC TOTRES LIQLB Test mode: 10-fold cross-validation === Classifier model (full training set) === Linear Regression Model LIQLB = 1.2523 * BROAD + 0.6062 * CLAIMS + -0.1407 * DOMC 0 * TOTRES + -6.9705 Time taken to build model: 0.2 seconds === Cross-validation === === Summary === Correlation coefficient 0.9738 Mean absolute error 4.8731 Root mean squared error 8.0404 Relative absolute error 16.4661 % Root relative squared error 22.5707 % Total Number of Instances 40 Ignored Class Unknown Instances 905 The above analysis gives as a mathematical relationship (linear) between various variables. The Value of the fifth variable (dependent) can be found out once other independent variable values are known. This equation also tells how these variables are related. A negative relation shows reciprocal relationship and vice-versa. To see the same relation is pictorial form simply goes to visualize tab on Weka explorer. The same is shown in the figure below.
  • 9. CLUSTERING IN WEKA Clustering is a technique used to group similar instances or rows in term of Euclidean distance. We have used SimpeKMeans clustering algorithm to analyze clustering in our initial data. In SimpleKMeans implementation clustering data use k-means, or the algorithm can decide using cross-validation- in which case number of folds is fixed at 10. The figure below shows the output of SimpeKMeans for the above data. The result is shown as table with rows that are attributes names and columns that correspond to cluster centroids; an additional cluster at the beginning shows the entire data set. The number of instances in each cluster appears in parenthesis at the top of its column. Each table entry is either the mean or mode of the corresponding attribute for the cluster in that column. The bottom of the output shows the result of applying the learned cluster model. In this case, it assigned each training set to one of the clusters, showing the same result as the parenthetical numbers at the top of each column. An alternative is to use a separate test set or a percentage split of training data, in which case figures would be different. This technique could be used with data from other countries in addition of the present data that is taken for Japan.
  • 10. === Run information === Scheme:weka.clusterers.SimpleKMeans -N 2 -A "weka.core.EuclideanDistance -R first-last" -I 500 -S 10 Relation: Copy of Data_Rudra Instances: 945 Attributes: 7 YEAR BROAD CLAIMS DOMC PSC TOTRES LIQLB Test mode:evaluate on training data
  • 11. === Model and evaluation on training set ===kMeans====== Number of iterations: 5 Within cluster sum of squared errors: 12.988387913678944 Missing values globally replaced with mean/mode Cluster centroids: Cluster# Attribute Full Data 0 1 (945) (929) (16) ================================================================= YEAR 1989.5 1989.2933 2001.5 BROAD 174.1633 173.4625 214.8525 CLAIMS 6.6645 6.8103 -1.7981 DOMC 242.2808 241.2956 299.4794 PSC 168.2627 167.8077 194.685 TOTRES 248907476505.9463 243675387834.3592 552695625000 LIQLB 175.2342 174.7166 205.2875 Time taken to build model (full training data) : 0.14 seconds === Model and evaluation on training set === Clustered Instances 0 929 (98%) 1 16 (2%) We can also visualize the clusters formed. Right click on the result-list output and select cluster visualize. We get the following output: