DM PROJECT

SUBMITTED BY:
ARUN KUMAR DASH
PARAS SHAH
DIVYA RAJASRI TADI
NIREESHA MANDALA
DATA MINING FINAL PROJECT

DATA MINING FINAL PROJECT FALL 2014
~ 1 ~
INTRODUCTION
Data mining aims at extracting patterns and knowledge from a particular data
set and converting it into an easy to understand framework that can be used in
future as well. Data is not only analyzed, it is pre-processed, complexities in
the data are taken into consideration and the structures discovered are post-
processed. Data mining is done on large quantities of data in order to extract
unknown patterns like groups of data records known as cluster analysis or to
identify exceptional records known as anomaly detection. These patterns can
then be used in machine learning and predictive analysis. All the steps like
data collection, data preparation, result interpretation and reporting though a
part of data mining belong to the knowledge discovery in databases process.
Data mining is being used in almost every field from business, science and
engineering, medical, visual, music, sensor, temporal and spatial to name a
few. The ways in which data mining can be used can in some cases and
contexts raise questions regarding privacy, legality, and ethics. As data mining
needs data preparation which can uncover patterns that can compromise
confidentiality. And this mainly happens during data aggregation where data is
combined together for analysis purpose which might affect the privacy of
individual data.
We used a census data set to predict whether income exceeds $50K/yr based
on census data. Prediction task is to determine whether a person makes over
50K a year.

~ 2 ~
Number of instances : 32561
Number of attributes : 15
Age, fnlwgt, workclass, education, education-num, marital-status,occupation,
relationship, race, sex, capital-gain, capital-loss, hours-per-week,native-country,
salary
Attribute Information:
Listing of attributes:
Salary: >50K, <=50K.
age: continuous.
workclass: Private, Self-emp-not-inc, Self-emp-inc, Federal-gov, Local-gov,
State-gov, Without-pay, Never-worked.
fnlwgt: continuous.
education: Bachelors, Some-college, 11th, HS-grad, Prof-school, Assoc-acdm,
Assoc-voc, 9th, 7th-8th, 12th, Masters, 1st-4th, 10th, Doctorate, 5th-6th,
Preschool.
education-num: continuous.
marital-status: Married-civ-spouse, Divorced, Never-married, Separated,
Widowed, Married-spouse-absent, Married-AF-spouse.
occupation: Tech-support, Craft-repair, Other-service, Sales, Exec-managerial,
Prof-specialty, Handlers-cleaners, Machine-op-inspct, Adm-clerical, Farming-
fishing, Transport-moving, Priv-house-serv, Protective-serv, Armed-Forces.

~ 3 ~
relationship: Wife, Own-child, Husband, Not-in-family, Other-relative,
Unmarried.
DATA PRE-PROCESSING:
The first step in data mining is to choose a data set with large number of
records so that unknown patterns can be identified. The data set should not be
too large that it consumes a lot of time to come up with a pattern. So, basically
selecting an appropriate data set is very important. Data sets are usually
chosen from a data warehouse. The main of data pre-processing is to evaluate
multivariate data sets prior to data mining. The data set is cleaned to remove
the records containing noise and the ones with missing data.
DATA MINING:
It is broadly segregated into six categories of tasks-
Anomaly detection
This phase involves identifying records that are unusual like outliers, any
deviations or errors. These data records have to be further investigated.
Association rule learning
This phase is the one where the model searches for any relationships among
the different variables. The very common example of it can be the beer diaper,
where on Fridays customers buy beer along with diapers. Based on this

~ 4 ~
association, the store can put both these items together to increase the sales
which is also known as market based analysis or dependency modelling.
Clustering:
Determining groups in the data which are similar in some aspect without the
use of previously known structures in data.
Classification:
This phase generalizes the known structures to apply the new data. As an
example an e-mail program that attempts to classify e-mails as spam or
genuine.
Regression:
This phase tries to phase a function which models the data with the least
amount of errors.
Summarization:
This phase provides a more compact view of the data set including the
visualization and report generation.
Validation of results:
Data mining can also give results which seem significant but cannot be used to
predict future behaviors and cannot be reproduced on a new sample of data.
These results are derived from improper statistical hypothesis testing. A simple

~ 5 ~
version of this problem is known as over fitting, where data mining algorithms
find patterns in the training set which are not present in the general data set.
WEKA (WAIKATO ENVIRONMENT FOR
KNOWLEDGE ANALYSIS)
Weka consists of a collection of visualization tools and algorithms for data
analysis and predictive modeling, along with graphical user interfaces for easy
access to this functionality. It is basically used for educational purposes and
research. Weka is:
Freely available General Public License
Portable as it completely implemented in Java and can be run on any
computing platform
An Extensive collection of data preprocessing and modeling techniques
Has a GUI that makes it easy to use
Weka supports various data mining tasks like data preprocessing, clustering,
classification, regression, visualization, and feature selection. Weka's
techniques are predicated on the assumption that data is available as a single
flat file or relation, where each data point is expressed by a fixed number of
attributes. Weka provides access to SQL databases using Java Database
Connectivity and can process the result returned by a database query.

~ 6 ~
NAÏVE BAYES SIMPLE
 All attributes: We have executed Naïve Bayes Simple on all the 15
attributes in our data set and have obtained the following results.

~ 7 ~
The accuracy of the predictive model is 83.69%.

~ 8 ~
 Few attributes: We have considered 6 attributes namely age, work class,
education, education number, hours per week and salary and obtained
the following results.

~ 9 ~
The accuracy of the predictive model is 79.20%.

~ 10 ~
 The following results are obtained on removing the education attribute
from the data set.

~ 11 ~
The accuracy of the predictive model is now 80% as compared to 79.20%
when the attribute education was present. This indicates that education is an
unimportant attribute and its removal from the data set yields a more accurate
result.

~ 12 ~
=== Run information ===
Scheme:weka.classifiers.bayes.NaiveBayesSimple
Relation: income-weka.filters.unsupervised.attribute.Remove-R6,8-12,14-
weka.filters.unsupervised.attribute.Remove-R3-
weka.filters.unsupervised.attribute.Remove-R3
Instances: 32561
Attributes: 5
age
workclass
education-num
hours-per-week
salary
Test mode:split 66.0% train, remainder test
=== Classifier model (full training set) ===
Naive Bayes (simple)
Class <=50K: P(C) = 0.75917452
Attribute age
Mean: 36.78373786 Standard Deviation: 14.02008849
Attribute workclass
State-gov Self-emp-not-inc Private Federal-gov Local-gov ? Self-emp-
inc Without-pay Never-worked
0.03825468 0.07351692 0.71713373 0.02385863 0.05972745 0.06656153
0.02001698 0.00060658 0.00032351

~ 13 ~
Attribute education-num
Attribute hours-per-week
Class >50K: P(C) = 0.24082548
Attribute age
Attribute workclass
State-gov Self-emp-not-inc Private Federal-gov Local-gov ? Self-emp-inc Without-
pay Never-worked
0.04509554 0.09235669 0.63235669 0.04738854 0.07872611 0.0244586 0.07936306
0.00012739 0.00012739
Attribute education-num
Attribute hours-per-week
Time taken to build model: 0.06 seconds

~ 14 ~
=== Evaluation on test split ===
=== Summary ===
Correctly Classified Instances 8857 80.0018 %
Incorrectly Classified Instances 2214 19.9982 %
Kappa statistic 0.3666
Mean absolute error 0.2737
Root mean squared error 0.3704
Relative absolute error 75.1356 %
Root relative squared error 87.3416 %
Total Number of Instances 11071
=== Detailed Accuracy By Class ===
TP Rate FP Rate Precision Recall F-Measure ROC Area Class
0.923 0.601 0.833 0.923 0.876 0.82 <=50K
0.399 0.077 0.615 0.399 0.484 0.82 >50K
Weighted Avg. 0.8 0.478 0.782 0.8 0.784 0.82
=== Confusion Matrix ===
a b <-- classified as
7820 650 | a = <=50K
1564 1037 | b = >50K

~ 15 ~
APRIORI:
Apriori is an algorithm for frequent item set mining and association rule
learning over transactional databases. Apriori proceeds by identifying the
frequent individual items in the database and extending them to larger and
larger item sets as long as those item sets appear sufficiently often in the
database.
 TOTAL NUMBER OF RULES

~ 16 ~
OUTPUT:
Scheme: weka.associations.Apriori -I -R -N 200 -T 0 -C 0.1 -D 0.05 -U 1.0 -M 0.1 -S
1.0 -V -c -1
weka.filters.unsupervised.attribute.Discretize-B10-M-1.0-Rfirst-last
Instances: 32561
Attributes: 5
age
workclass
education-num
hours-per-week
salary
=== Associator model (full training set) ===
Apriori
=======
Minimum support: 0.1 (3256 instances)
Minimum metric <confidence>: 0.1
Significance level: 1
Number of cycles performed: 14

~ 17 ~
Generated sets of large itemsets:
Size of set of large itemsetsL(1): 12
Large ItemsetsL(1):
age='(-inf-24.3]' 5570
age='(24.3-31.6]' 5890
age='(31.6-38.9]' 6048
age='(38.9-46.2]' 6163
age='(46.2-53.5]' 3967
workclass= Private 22696
education-num='(8.5-10]' 17792
hours-per-week='(30.4-40.2]' 17735
hours-per-week='(40.2-50]' 5938
salary= <=50K 24720
salary= >50K 7841

~ 18 ~
Large ItemsetsL(2):
age='(-inf-24.3]' workclass= Private 4404
age='(-inf-24.3]' education-num='(8.5-10]' 3637
age='(-inf-24.3]' salary= <=50K 5509
age='(24.3-31.6]' workclass= Private 4617
age='(24.3-31.6]' hours-per-week='(30.4-40.2]' 3517
age='(24.3-31.6]' salary= <=50K 5086
age='(31.6-38.9]' education-num='(8.5-10]' 3283
age='(31.6-38.9]' salary= <=50K 4371
age='(38.9-46.2]' salary= <=50K 3934
workclass= Private education-num='(8.5-10]' 12874
workclass= Private hours-per-week='(30.4-40.2]' 12849
workclass= Private hours-per-week='(40.2-50]' 4270
workclass= Private salary= <=50K 17733
workclass= Private salary= >50K 4963
education-num='(8.5-10]' hours-per-week='(30.4-40.2]' 10177
education-num='(8.5-10]' salary= <=50K 14730
hours-per-week='(30.4-40.2]' salary= <=50K 14103
hours-per-week='(30.4-40.2]' salary= >50K 3632
hours-per-week='(40.2-50]' salary= <=50K 3586

~ 19 ~
Large ItemsetsL(3):
age='(-inf-24.3]' workclass= Private salary= <=50K 4360
age='(-inf-24.3]' education-num='(8.5-10]' salary= <=50K 3605
age='(24.3-31.6]' workclass= Private salary= <=50K 4023
workclass= Private education-num='(8.5-10]' hours-per-week='(30.4-40.2]' 7597
workclass= Private education-num='(8.5-10]' salary= <=50K 10832
workclass= Private hours-per-week='(30.4-40.2]' salary= <=50K 10524
education-num='(8.5-10]' hours-per-week='(30.4-40.2]' salary= <=50K 8574
Large ItemsetsL(4):
workclass= Private education-num='(8.5-10]' hours-per-week='(30.4-40.2]' salary=
<=50K 6513

~ 20 ~
BEST RULES FOUND:
1. age='(-inf-24.3]' education-num='(8.5-10]' 3637 ==> salary= <=50K 3605 conf:(0.99)
lift:(1.31) lev:(0.03) [843] conv:(26.54)
2. age='(-inf-24.3]' workclass= Private 4404 ==> salary= <=50K 4360 conf:(0.99) lift:(1.3)
lev:(0.03) [1016] conv:(23.57)
3. age='(-inf-24.3]' 5570 ==> salary= <=50K 5509 conf:(0.99) lift:(1.3) lev:(0.04) [1280]
conv:(21.63)
4. age='(24.3-31.6]' workclass= Private 4617 ==> salary= <=50K 4023 conf:(0.87) lift:(1.15)
lev:(0.02) [517] conv:(1.87)
5. age='(24.3-31.6]' 5890 ==> salary= <=50K 5086 conf:(0.86) lift:(1.14) lev:(0.02) [614]
conv:(1.76)
6. workclass= Private education-num='(8.5-10]' hours-per-week='(30.4-40.2]' 7597 ==>
salary= <=50K 6513 conf:(0.86) lift:(1.13) lev:(0.02) [745] conv:(1.69)
7. education-num='(8.5-10]' hours-per-week='(30.4-40.2]' 10177 ==> salary= <=50K 8574
conf:(0.84) lift:(1.11) lev:(0.03) [847] conv:(1.53)
8. workclass= Private education-num='(8.5-10]' 12874 ==> salary= <=50K 10832 conf:(0.84)
lift:(1.11) lev:(0.03) [1058] conv:(1.52)
9. education-num='(8.5-10]' 17792 ==> salary= <=50K 14730 conf:(0.83) lift:(1.09) lev:(0.04)
[1222] conv:(1.4)
10. workclass= Private hours-per-week='(30.4-40.2]' 12849 ==> salary= <=50K 10524
11. hours-per-week='(30.4-40.2]' 17735 ==> salary= <=50K 14103 conf:(0.8) lift:(1.05)
lev:(0.02) [638] conv:(1.18)
12. age='(-inf-24.3]' salary= <=50K 5509 ==>workclass= Private 4360 conf:(0.79) lift:(1.14)
lev:(0.02) [520] conv:(1.45) 13. age='(24.3-31.6]' salary= <=50K 5086 ==>workclass= Private
4023 conf:(0.79) lift:(1.13) lev:(0.01) [477] conv:(1.45)
14. age='(-inf-24.3]' 5570 ==>workclass= Private 4404 conf:(0.79) lift:(1.13) lev:(0.02) [521]
conv:(1.45)

~ 21 ~
15. age='(24.3-31.6]' 5890 ==>workclass= Private 4617 conf:(0.78) lift:(1.12) lev:(0.02) [511]
conv:(1.4)
16. age='(-inf-24.3]' 5570 ==>workclass= Private salary= <=50K 4360 conf:(0.78) lift:(1.44)
lev:(0.04) [1326] conv:(2.09)
17. workclass= Private 22696 ==> salary= <=50K 17733 conf:(0.78) lift:(1.03) lev:(0.02)
[502] conv:(1.1)
18. education-num='(8.5-10]' hours-per-week='(30.4-40.2]' salary= <=50K 8574
==>workclass= Private 6513 conf:(0.76) lift:(1.09) lev:(0.02) [536] conv:(1.26)
19. education-num='(8.5-10]' hours-per-week='(30.4-40.2]' 10177 ==>workclass= Private 7597
20. hours-per-week='(30.4-40.2]' salary= <=50K 14103 ==>workclass= Private 10524
21. education-num='(8.5-10]' salary= <=50K 14730 ==>workclass= Private 10832 conf:(0.74)
lift:(1.06) lev:(0.02) [564] conv:(1.14)
22. age='(31.6-38.9]' 6048 ==>workclass= Private 4426 conf:(0.73) lift:(1.05) lev:(0.01) [210]
conv:(1.13)
23. hours-per-week='(30.4-40.2]' 17735 ==>workclass= Private 12849 conf:(0.72) lift:(1.04)
lev:(0.01) [487] conv:(1.1)
24. education-num='(8.5-10]' 17792 ==>workclass= Private 12874 conf:(0.72) lift:(1.04)
lev:(0.01) [472] conv:(1.1)
25. age='(31.6-38.9]' 6048 ==> salary= <=50K 4371 conf:(0.72) lift:(0.95) lev:(-0.01) [-220]
conv:(0.87)
26. hours-per-week='(40.2-50]' 5938 ==>workclass= Private 4270 conf:(0.72) lift:(1.03)
lev:(0) [131] conv:(1.08)
27. salary= <=50K 24720 ==>workclass= Private 17733 conf:(0.72) lift:(1.03) lev:(0.02) [502]
conv:(1.07)
28. age='(24.3-31.6]' 5890 ==>workclass= Private salary= <=50K 4023 conf:(0.68) lift:(1.25)
lev:(0.03) [815] conv:(1.44)

~ 22 ~
29. age='(38.9-46.2]' 6163 ==>workclass= Private 4127 conf:(0.67) lift:(0.96) lev:(-0.01) [-
168] conv:(0.92)
30. education-num='(11.5-13]' 6422 ==>workclass= Private 4280 conf:(0.67) lift:(0.96) lev:(-
0.01) [-196] conv:(0.91)
31. age='(-inf-24.3]' salary= <=50K 5509 ==> education-num='(8.5-10]' 3605 conf:(0.65)
lift:(1.2) lev:(0.02) [594] conv:(1.31)
32. age='(-inf-24.3]' 5570 ==> education-num='(8.5-10]' 3637 conf:(0.65) lift:(1.19) lev:(0.02)
[593] conv:(1.31)
33. age='(-inf-24.3]' 5570 ==> education-num='(8.5-10]' salary= <=50K 3605 conf:(0.65)
lift:(1.43) lev:(0.03) [1085] conv:(1.55)
34. education-num='(8.5-10]' hours-per-week='(30.4-40.2]' 10177 ==>workclass= Private
35. age='(38.9-46.2]' 6163 ==> salary= <=50K 3934 conf:(0.64) lift:(0.84) lev:(-0.02) [-744]
conv:(0.67)
36. salary= >50K 7841 ==>workclass= Private 4963 conf:(0.63) lift:(0.91) lev:(-0.02) [-502]
conv:(0.83)
37. workclass= Private hours-per-week='(30.4-40.2]' salary= <=50K 10524 ==> education-
num='(8.5-10]' 6513 conf:(0.62) lift:(1.13) lev:(0.02) [762] conv:(1.19)
38. education-num='(11.5-13]' 6422 ==> salary= <=50K 3936 conf:(0.61) lift:(0.81) lev:(-0.03)
[-939] conv:(0.62)
39. workclass= Private salary= <=50K 17733 ==> education-num='(8.5-10]' 10832
40. education-num='(8.5-10]' 17792 ==>workclass= Private salary= <=50K 10832 conf:(0.61)
lift:(1.12) lev:(0.04) [1142] conv:(1.16)
41. hours-per-week='(30.4-40.2]' salary= <=50K 14103 ==> education-num='(8.5-10]' 8574
42. hours-per-week='(40.2-50]' 5938 ==> salary= <=50K 3586 conf:(0.6) lift:(0.8) lev:(-0.03)
[-922] conv:(0.61)

~ 23 ~
43. workclass= Private education-num='(8.5-10]' salary= <=50K 10832 ==> hours-per-
week='(30.4-40.2]' 6513 conf:(0.6) lift:(1.1) lev:(0.02) [613] conv:(1.14)
44. age='(24.3-31.6]' 5890 ==> hours-per-week='(30.4-40.2]' 3517 conf:(0.6) lift:(1.1)
lev:(0.01) [308] conv:(1.13)
45. salary= <=50K 24720 ==> education-num='(8.5-10]' 14730 conf:(0.6) lift:(1.09) lev:(0.04)
[1222] conv:(1.12)
46. workclass= Private salary= <=50K 17733 ==> hours-per-week='(30.4-40.2]' 10524
47. hours-per-week='(30.4-40.2]' 17735 ==>workclass= Private salary= <=50K 10524
48. workclass= Private hours-per-week='(30.4-40.2]' 12849 ==> education-num='(8.5-10]'
49. workclass= Private education-num='(8.5-10]' 12874 ==> hours-per-week='(30.4-40.2]'
50. education-num='(8.5-10]' salary= <=50K 14730 ==> hours-per-week='(30.4-40.2]' 8574
51. hours-per-week='(30.4-40.2]' 17735 ==> education-num='(8.5-10]' 10177 conf:(0.57)
lift:(1.05) lev:(0.01) [486] conv:(1.06)
52. education-num='(8.5-10]' 17792 ==> hours-per-week='(30.4-40.2]' 10177 conf:(0.57)
lift:(1.05) lev:(0.01) [486] conv:(1.06)
53. salary= <=50K 24720 ==> hours-per-week='(30.4-40.2]' 14103 conf:(0.57) lift:(1.05)
lev:(0.02) [638] conv:(1.06)
54. workclass= Private 22696 ==> education-num='(8.5-10]' 12874 conf:(0.57) lift:(1.04)
lev:(0.01) [472] conv:(1.05)
55. workclass= Private 22696 ==> hours-per-week='(30.4-40.2]' 12849 conf:(0.57) lift:(1.04)
lev:(0.01) [487] conv:(1.05)
lev:(0) [124] conv:(1.05)

~ 24 ~
lev:(0) [76] conv:(1.03)
58. age='(31.6-38.9]' 6048 ==> education-num='(8.5-10]' 3283 conf:(0.54) lift:(0.99) lev:(0) [-
21] conv:(0.99)
59. workclass= Private hours-per-week='(30.4-40.2]' 12849 ==> education-num='(8.5-10]'
60. workclass= Private education-num='(8.5-10]' 12874 ==> hours-per-week='(30.4-40.2]'
61. hours-per-week='(30.4-40.2]' 17735 ==> education-num='(8.5-10]' salary= <=50K 8574
62. education-num='(8.5-10]' 17792 ==> hours-per-week='(30.4-40.2]' salary= <=50K 8574
63. workclass= Private 22696 ==> education-num='(8.5-10]' salary= <=50K 10832
64. workclass= Private 22696 ==> hours-per-week='(30.4-40.2]' salary= <=50K 10524
65. salary= >50K 7841 ==> hours-per-week='(30.4-40.2]' 3632 conf:(0.46) lift:(0.85) lev:(-
0.02) [-638] conv:(0.85)
66. hours-per-week='(30.4-40.2]' salary= <=50K 14103 ==>workclass= Private education-
num='(8.5-10]' 6513 conf:(0.46) lift:(1.17) lev:(0.03) [936] conv:(1.12)
67. education-num='(8.5-10]' salary= <=50K 14730 ==>workclass= Private hours-per-
68. salary= <=50K 24720 ==>workclass= Private education-num='(8.5-10]' 10832 conf:(0.44)
lift:(1.11) lev:(0.03) [1058] conv:(1.08)
69. hours-per-week='(30.4-40.2]' 17735 ==>workclass= Private education-num='(8.5-10]' 7597
70. education-num='(8.5-10]' 17792 ==>workclass= Private hours-per-week='(30.4-40.2]' 7597

~ 25 ~
71. salary= <=50K 24720 ==>workclass= Private hours-per-week='(30.4-40.2]' 10524
72. workclass= Private salary= <=50K 17733 ==> education-num='(8.5-10]' hours-per-
73. hours-per-week='(30.4-40.2]' 17735 ==>workclass= Private education-num='(8.5-10]'
74. education-num='(8.5-10]' 17792 ==>workclass= Private hours-per-week='(30.4-40.2]'
75. salary= <=50K 24720 ==> education-num='(8.5-10]' hours-per-week='(30.4-40.2]' 8574
76. workclass= Private 22696 ==> education-num='(8.5-10]' hours-per-week='(30.4-40.2]'
77. workclass= Private 22696 ==> education-num='(8.5-10]' hours-per-week='(30.4-40.2]'
78. salary= <=50K 24720 ==>workclass= Private education-num='(8.5-10]' hours-per-
79. workclass= Private salary= <=50K 17733 ==> age='(-inf-24.3]' 4360 conf:(0.25) lift:(1.44)
lev:(0.04) [1326] conv:(1.1)
80. education-num='(8.5-10]' salary= <=50K 14730 ==> age='(-inf-24.3]' 3605 conf:(0.24)
lift:(1.43) lev:(0.03) [1085] conv:(1.1)
81. workclass= Private salary= <=50K 17733 ==> age='(24.3-31.6]' 4023 conf:(0.23)
lift:(1.25) lev:(0.03) [815] conv:(1.06)
82. salary= <=50K 24720 ==> age='(-inf-24.3]' 5509 conf:(0.22) lift:(1.3) lev:(0.04) [1280]
conv:(1.07)
83. workclass= Private 22696 ==> salary= >50K 4963 conf:(0.22) lift:(0.91) lev:(-0.02) [-502]
conv:(0.97)

~ 26 ~
84. salary= <=50K 24720 ==> age='(24.3-31.6]' 5086 conf:(0.21) lift:(1.14) lev:(0.02) [614]
conv:(1.03)
85. hours-per-week='(30.4-40.2]' 17735 ==> salary= >50K 3632 conf:(0.2) lift:(0.85) lev:(-
0.02) [-638] conv:(0.95)
86. education-num='(8.5-10]' 17792 ==> age='(-inf-24.3]' 3637 conf:(0.2) lift:(1.19) lev:(0.02)
[593] conv:(1.04)
87. workclass= Private 22696 ==> age='(24.3-31.6]' 4617 conf:(0.2) lift:(1.12) lev:(0.02) [511]
conv:(1.03)
88. education-num='(8.5-10]' 17792 ==> age='(-inf-24.3]' salary= <=50K 3605 conf:(0.2)
lift:(1.2) lev:(0.02) [594] conv:(1.04)
89. hours-per-week='(30.4-40.2]' 17735 ==> age='(24.3-31.6]' 3517 conf:(0.2) lift:(1.1)
lev:(0.01) [308] conv:(1.02)
lev:(0) [124] conv:(1.01)
91. workclass= Private 22696 ==> age='(31.6-38.9]' 4426 conf:(0.2) lift:(1.05) lev:(0.01) [210]
conv:(1.01)
92. workclass= Private 22696 ==> age='(-inf-24.3]' 4404 conf:(0.19) lift:(1.13) lev:(0.02) [521]
conv:(1.03)
93. workclass= Private 22696 ==> age='(-inf-24.3]' salary= <=50K 4360 conf:(0.19) lift:(1.14)
lev:(0.02) [520] conv:(1.03)
lev:(0) [76] conv:(1.01)
95. workclass= Private 22696 ==> education-num='(11.5-13]' 4280 conf:(0.19) lift:(0.96)
lev:(-0.01) [-196] conv:(0.99)
96. workclass= Private 22696 ==> hours-per-week='(40.2-50]' 4270 conf:(0.19) lift:(1.03)
lev:(0) [131] conv:(1.01)

~ 27 ~
97. education-num='(8.5-10]' 17792 ==> age='(31.6-38.9]' 3283 conf:(0.18) lift:(0.99) lev:(0) [-
21] conv:(1)
98. workclass= Private 22696 ==> age='(38.9-46.2]' 4127 conf:(0.18) lift:(0.96) lev:(-0.01) [-
168] conv:(0.99)
99. workclass= Private 22696 ==> age='(24.3-31.6]' salary= <=50K 4023 conf:(0.18)
lift:(1.13) lev:(0.01) [477] conv:(1.03)
100. salary= <=50K 24720 ==> age='(31.6-38.9]' 4371 conf:(0.18) lift:(0.95) lev:(-0.01) [-220]
conv:(0.99)
101. salary= <=50K 24720 ==> age='(-inf-24.3]' workclass= Private 4360 conf:(0.18) lift:(1.3)
lev:(0.03) [1016] conv:(1.05)
102. salary= <=50K 24720 ==> age='(24.3-31.6]' workclass= Private 4023 conf:(0.16)
lift:(1.15) lev:(0.02) [517] conv:(1.02)
103. salary= <=50K 24720 ==> education-num='(11.5-13]' 3936 conf:(0.16) lift:(0.81) lev:(-
0.03) [-939] conv:(0.95)
104. salary= <=50K 24720 ==> age='(38.9-46.2]' 3934 conf:(0.16) lift:(0.84) lev:(-0.02) [-744]
conv:(0.96)
105. salary= <=50K 24720 ==> age='(-inf-24.3]' education-num='(8.5-10]' 3605 conf:(0.15)
lift:(1.31) lev:(0.03) [843] conv:(1.04)
106. salary= <=50K 24720 ==> hours-per-week='(40.2-50]' 3586 conf:(0.15) lift:(0.8) lev:(-
0.03) [-922] conv:(0.96)

~ 28 ~
 CHANGED MIN METRIC = 0.9 AND lowerBoundMinSupport=0.1

~ 29 ~
Scheme: weka.associations.Apriori -I -R -N 200 -T 0 -C 0.9 -D 0.05 -U 1.0 -M 0.1 -S 1.0 -V
-c -1
weka.filters.unsupervised.attribute.Remove-R3-weka.filters.unsupervised.attribute.Remove-R5-
weka.filters.unsupervised.attribute.Remove-R3-weka.filters.unsupervised.attribute.Discretize-
B10-M-1.0-Rfirst-last
Instances: 32561
Attributes: 5
age
workclass
education-num
hours-per-week
salary
Apriori
=======
Large ItemsetsL(1):
age='(-inf-24.3]' 5570
age='(24.3-31.6]' 5890
age='(31.6-38.9]' 6048
age='(38.9-46.2]' 6163
age='(46.2-53.5]' 3967

~ 30 ~
hours-per-week='(30.4-40.2]' 17735
hours-per-week='(40.2-50]' 5938
salary= <=50K 24720
salary= >50K 7841
Large ItemsetsL(2):
age='(-inf-24.3]' workclass= Private 4404
age='(-inf-24.3]' education-num='(8.5-10]' 3637
age='(-inf-24.3]' salary= <=50K 5509
age='(24.3-31.6]' salary= <=50K 5086
age='(31.6-38.9]' education-num='(8.5-10]' 3283
age='(31.6-38.9]' salary= <=50K 4371
age='(38.9-46.2]' salary= <=50K 3934
workclass= Private hours-per-week='(40.2-50]' 4270
workclass= Private salary= >50K 4963

~ 31 ~
hours-per-week='(30.4-40.2]' salary= >50K 3632
hours-per-week='(40.2-50]' salary= <=50K 3586
Large ItemsetsL(3):
age='(-inf-24.3]' workclass= Private salary= <=50K 4360
age='(-inf-24.3]' education-num='(8.5-10]' salary= <=50K 3605
age='(24.3-31.6]' workclass= Private salary= <=50K 4023
workclass= Private education-num='(8.5-10]' hours-per-week='(30.4-40.2]' 7597
education-num='(8.5-10]' hours-per-week='(30.4-40.2]' salary= <=50K 8574
Large ItemsetsL(4):
workclass= Private education-num='(8.5-10]' hours-per-week='(30.4-40.2]' salary= <=50K 6513
Best rules found:
1. age='(-inf-24.3]' education-num='(8.5-10]' 3637 ==> salary= <=50K 3605 conf:(0.99)
lift:(1.31) lev:(0.03) [843] conv:(26.54)
2. age='(-inf-24.3]' workclass= Private 4404 ==> salary= <=50K 4360 conf:(0.99) lift:(1.3)
lev:(0.03) [1016] conv:(23.57)
3. age='(-inf-24.3]' 5570 ==> salary= <=50K 5509 conf:(0.99) lift:(1.3) lev:(0.04) [1280]
conv:(21.63)

~ 32 ~
 Changed lowerBoundMinSupport = 0.3 minMetric = 0.65

~ 33 ~
Scheme: weka.associations.Apriori -I -R -N 200 -T 0 -C 0.65 -D 0.05 -U 1.0 -M 0.3 -S 1.0 -V
-c -1
Relation: income-weka.filters.unsupervised.attribute.Discretize-B10-M-1.0-Rfirst-last-
weka.filters.unsupervised.attribute.Remove-R3-4,6-12,14-
weka.filters.unsupervised.attribute.Discretize-B10-M-1.0-Rfirst-last
Instances: 32561
Attributes: 5
age
workclass
education-num
hours-per-week
salary
Apriori
=======
Large ItemsetsL(1):
hours-per-week='(30.4-40.2]' 17735
salary= <=50K 24720Size of set of large itemsets L(2): 6
Large ItemsetsL(2):

~ 34 ~
Large ItemsetsL(3):
Best rules found:
1. workclass= Private education-num='(8.5-10]' 12874 ==> salary= <=50K 10832 conf:(0.84)
lift:(1.11) lev:(0.03) [1058] conv:(1.52)
2. education-num='(8.5-10]' 17792 ==> salary= <=50K 14730 conf:(0.83) lift:(1.09) lev:(0.04)
[1222] conv:(1.4)
3. workclass= Private hours-per-week='(30.4-40.2]' 12849 ==> salary= <=50K 10524
4. hours-per-week='(30.4-40.2]' 17735 ==> salary= <=50K 14103 conf:(0.8) lift:(1.05)
lev:(0.02) [638] conv:(1.18)
5. workclass= Private 22696 ==> salary= <=50K 17733 conf:(0.78) lift:(1.03) lev:(0.02) [502]
conv:(1.1)
6. hours-per-week='(30.4-40.2]' salary= <=50K 14103 ==>workclass= Private 10524
7. education-num='(8.5-10]' salary= <=50K 14730 ==>workclass= Private 10832 conf:(0.74)
lift:(1.06) lev:(0.02) [564] conv:(1.14)
8. hours-per-week='(30.4-40.2]' 17735 ==>workclass= Private 12849 conf:(0.72) lift:(1.04)
lev:(0.01) [487] conv:(1.1)
9. education-num='(8.5-10]' 17792 ==>workclass= Private 12874 conf:(0.72) lift:(1.04)
lev:(0.01) [472] conv:(1.1)
10. salary= <=50K 24720 ==>workclass= Private 17733 conf:(0.72) lift:(1.03) lev:(0.02) [502]
conv:(1.07)

~ 35 ~
J48:
The decision trees generated by J48 can be used for classification. It uses the
fact that each attribute of the data can be used to make a decision by splitting
the data into smaller subsets.
 WITH ALL ATTRIBUTES

~ 36 ~
 SELECTED ATTRIBUTES

~ 37 ~

~ 38 ~
K-MEANS CLUSTERING:
K-means clustering is a method of vector quantization, originally from signal
processing, that is popular for cluster analysis in data mining. k-means
clustering aims to partition n observations into k clusters in which each
observation belongs to the cluster with the nearest mean, serving as a
prototype of the cluster.

~ 39 ~
Scheme:weka.clusterers.SimpleKMeans -N 5 -A "weka.core.EuclideanDistance -R first-last" -I
500 -S 10
Relation: income-weka.filters.unsupervised.attribute.Discretize-B10-M-1.0-Rfirst-last
Instances: 32561
Attributes: 15
age
workclass
fnlwgt
education
education-num
martial-status
occupation
relationship
race
sex
capital-gain
capital-loss
hours-per-week
native-country
salary
Test mode:split 66% train, remainder test
=== Model and evaluation on training set ===
kMeans
======
Number of iterations: 4
Within cluster sum of squared errors: 146940.0

~ 40 ~
Missing values globally replaced with mean/mode
Cluster centroids:
Cluster#
Attribute Full Data 0 1 2 3 4
(32561) (7137) (7920) (9177) (4102) (4225)
========================================================================================
==============================================================
age '(38.9-46.2]' '(46.2-53.5]' '(-inf-24.3]' '(38.9-46.2]' '(24.3-31.6]' '(24.3-
31.6]'
workclass Private PrivatePrivatePrivatePrivatePrivate
fnlwgt '(159527-306769]' '(-inf-159527]' '(-inf-159527]' '(159527-306769]' '(159527-306769]'
'(159527-306769]'
education HS-grad Bachelors Some-college HS-grad Bachelors
HS-grad
education-num '(8.5-10]' '(11.5-13]' '(8.5-10]' '(8.5-10]' '(11.5-13]' '(8.5-10]'
martial-status Married-civ-spouse Married-civ-spouse Never-married Married-civ-spouse Never-
married Never-married
occupation Prof-specialty Prof-specialty Other-service Craft-repair Prof-specialty
Adm-clerical
relationship Husband Husband Own-child Husband Not-in-family
Unmarried
race White WhiteWhiteWhiteWhiteWhite
sex Male MaleMaleMale Female Female
capital-gain '(-inf-9999.9]' '(-inf-9999.9]' '(-inf-9999.9]' '(-inf-9999.9]' '(-inf-9999.9]' '(-inf-
9999.9]'
capital-loss '(-inf-435.6]' '(-inf-435.6]' '(-inf-435.6]' '(-inf-435.6]' '(-inf-435.6]' '(-inf-435.6]'
hours-per-week '(30.4-40.2]' '(30.4-40.2]' '(30.4-40.2]' '(30.4-40.2]' '(30.4-40.2]'
'(30.4-40.2]'
native-country United-States United-StatesUnited-StatesUnited-StatesUnited-StatesUnited-States
salary<=50K >50K <=50K <=50K <=50K <=50K

~ 41 ~
Time taken to build model (full training data) : 3.13 seconds
=== Model and evaluation on test split ===
kMeans
======
Number of iterations: 4
Within cluster sum of squared errors: 103681.0
Missing values globally replaced with mean/mode
Cluster centroids:
Cluster#
Attribute Full Data 0 1 2 3 4
(21490) (11082) (3995) (3024) (2584) (805)
========================================================================================
==============================================================
age '(38.9-46.2]' '(31.6-38.9]' '(-inf-24.3]' '(24.3-31.6]' '(24.3-31.6]' '(-inf-24.3]'
workclass Private PrivatePrivatePrivatePrivatePrivate
fnlwgt '(159527-306769]' '(159527-306769]' '(159527-306769]' '(159527-306769]' '(159527-
306769]' '(159527-306769]'
education HS-grad HS-grad Some-college HS-grad Bachelors
Some-college
education-num '(8.5-10]' '(8.5-10]' '(8.5-10]' '(8.5-10]' '(11.5-13]' '(8.5-10]'
martial-status Married-civ-spouse Married-civ-spouse Never-married Never-marriedNever-
marriedNever-married
occupation Craft-repair Craft-repair Other-service Adm-clerical Prof-specialty
Other-service
relationship Husband Husband Own-child Not-in-family Not-in-familyNot-in-
family
race White WhiteWhiteWhiteWhiteWhite
sex Male Male Female FemaleFemaleFemale
capital-gain '(-inf-9999.9]' '(-inf-9999.9]' '(-inf-9999.9]' '(-inf-9999.9]' '(-inf-9999.9]' '(-inf-
9999.9]'
capital-loss '(-inf-435.6]' '(-inf-435.6]' '(-inf-435.6]' '(-inf-435.6]' '(-inf-435.6]' '(-inf-435.6]'
hours-per-week '(30.4-40.2]' '(30.4-40.2]' '(30.4-40.2]' '(30.4-40.2]' '(30.4-40.2]'
'(20.6-30.4]'
native-country United-States United-StatesUnited-StatesUnited-StatesUnited-StatesUnited-States
salary<=50K <=50K <=50K <=50K <=50K <=50K

~ 42 ~
Time taken to build model (percentage split) : 1.99 seconds
Clustered Instances
0 5618 ( 51%)
1 2119 ( 19%)
2 1569 ( 14%)
3 1358 ( 12%)
4 407 ( 4%)

~ 43 ~
SAS ENTERPRISE MINER
 Cluster Analysis:
This analysis attempts to ﬁnd natural groupings of observations in the data,
based on a set of input variables. After grouping the observations into clusters,
you can use the input variables to try to characterize each group. When the
clusters have been identiﬁed and interpreted, you can decide whether to treat
each cluster independently. Clustering can therefore be formulated as a multi-
objective optimization problem. The appropriate clustering algorithm and
parameter settings (including values such as the distance function to use, a
density threshold or the number of expected clusters) depend on the individual
data set and intended use of the results. Cluster analysis as such is not an
automatic task, but an iterative process of knowledge discovery or interactive
multi-objective optimization that involves trial and failure. It will often be
necessary to modify data preprocessing and model parameters until the result
achieves the desired properties.
In this dataset we built clusters to group similar items in our dataset. We
changed the properties of the cluster.
Cluster variable role = Segment
Specification Method = User Specify
Maximum Number of Clusters = 5

~ 44 ~

~ 45 ~
From the clustering segments (pie-chart), we can observe that cluster 1
contains a large part of the data set.

~ 46 ~
SAS DECISION TREES
Node Rules:
*------------------------------------------------------------*
Node = 11
*------------------------------------------------------------*
if Relationship IS ONE OF: NOT-IN-FAMILY, OWN-CHILD, UNMARRIED, OTHER-RELATIVE or MISSING
AND Hours-Per-Week >= 35.5 or MISSING
AND Capital-Gain >= 7073.59
then
Tree Node Identifier = 11
Number of Observations = 285
Predicted: Salary=>50K = 0.99
Predicted: Salary=<=50K = 0.01
*------------------------------------------------------------*
Node = 13
*------------------------------------------------------------*
if Relationship IS ONE OF: HUSBAND, WIFE
AND Education-Num < 12.5 or MISSING
then

~ 47 ~
*------------------------------------------------------------*
Node = 15
*------------------------------------------------------------*
if Relationship IS ONE OF: HUSBAND, WIFE
AND Education-Num >= 12.5
then
*------------------------------------------------------------*
Node = 161
*------------------------------------------------------------*
if Workclass IS ONE OF: STATE-GOV, SELF-EMP-NOT-INC
AND Relationship IS ONE OF: HUSBAND, WIFE
AND Occupation IS ONE OF: ADM-CLERICAL, EXEC-MANAGERIAL, PROF-SPECIALTY, SALES, TECH-SUPPORT,
PROTECTIVE-SERV
AND Hours-Per-Week >= 37.5 or MISSING
AND Education-Num < 12.5 AND Education-Num >= 9.5 or MISSING
AND Capital-Loss < 1512 or MISSING
AND Capital-Gain < 5095.5 or MISSING
AND Age >= 33.5 or MISSING
then

~ 48 ~

~ 49 ~
Fit Statistics
Target=Salary Target Label=' '
FitStatistics Statistics Label Train
_NOBS_ Sum of Frequencies 32561.00
_MISC_ Misclassification Rate 0.14
_MAX_ Maximum Absolute Error 1.00
_SSE_ Sum of Squared Errors 6368.80
_ASE_ Average Squared Error 0.10
_RASE_ Root Average Squared Error 0.31
_DIV_ Divisor for ASE 65122.00
_DFT_ Total Degrees of Freedom 32561.00

~ 50 ~

~ 51 ~
CONCLUSION
1. It is likely that if the age is around 24 years, the education level is 11th grade,
12th grade or some college, then the income would be less than 50K.
2. It is likely that if the age is around 24 years, and the person is working for a
private firm, then the income would be less than 50K.
3. It is likely that if the age is around 24 years, the income would be less than
50K
4. 90% of females have salary less than or equal to 50k whereas 60% of males
have salary <=50k
5. 95% of the population belonging to other services category belong to salary
<=50k
6. 95% of the population belonging to the age group of 23 to 24 have salary
<=50k
We made couple of runs of the J48 classifiers and found out the following 5
attributes to be important in predicting the Income of a person:
Education number, age, salary, work class, hours-per-week. These columns
provide an accuracy of 80% in predicting the Income.

DM PROJECT

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Andere mochten auch

Andere mochten auch (20)

Ähnlich wie DM PROJECT

Ähnlich wie DM PROJECT (20)

Mehr von Divya Tadi

Mehr von Divya Tadi (14)

DM PROJECT