SlideShare ist ein Scribd-Unternehmen logo
1 von 52
Downloaden Sie, um offline zu lesen
SUBMITTED BY:
ARUN KUMAR DASH
PARAS SHAH
DIVYA RAJASRI TADI
NIREESHA MANDALA
DATA MINING FINAL PROJECT
DATA MINING FINAL PROJECT FALL 2014
~ 1 ~
INTRODUCTION
Data mining aims at extracting patterns and knowledge from a particular data
set and converting it into an easy to understand framework that can be used in
future as well. Data is not only analyzed, it is pre-processed, complexities in
the data are taken into consideration and the structures discovered are post-
processed. Data mining is done on large quantities of data in order to extract
unknown patterns like groups of data records known as cluster analysis or to
identify exceptional records known as anomaly detection. These patterns can
then be used in machine learning and predictive analysis. All the steps like
data collection, data preparation, result interpretation and reporting though a
part of data mining belong to the knowledge discovery in databases process.
Data mining is being used in almost every field from business, science and
engineering, medical, visual, music, sensor, temporal and spatial to name a
few. The ways in which data mining can be used can in some cases and
contexts raise questions regarding privacy, legality, and ethics. As data mining
needs data preparation which can uncover patterns that can compromise
confidentiality. And this mainly happens during data aggregation where data is
combined together for analysis purpose which might affect the privacy of
individual data.
We used a census data set to predict whether income exceeds $50K/yr based
on census data. Prediction task is to determine whether a person makes over
50K a year.
DATA MINING FINAL PROJECT FALL 2014
~ 2 ~
Number of instances : 32561
Number of attributes : 15
Age, fnlwgt, workclass, education, education-num, marital-status,occupation,
relationship, race, sex, capital-gain, capital-loss, hours-per-week,native-country,
salary
Attribute Information:
Listing of attributes:
Salary: >50K, <=50K.
age: continuous.
workclass: Private, Self-emp-not-inc, Self-emp-inc, Federal-gov, Local-gov,
State-gov, Without-pay, Never-worked.
fnlwgt: continuous.
education: Bachelors, Some-college, 11th, HS-grad, Prof-school, Assoc-acdm,
Assoc-voc, 9th, 7th-8th, 12th, Masters, 1st-4th, 10th, Doctorate, 5th-6th,
Preschool.
education-num: continuous.
marital-status: Married-civ-spouse, Divorced, Never-married, Separated,
Widowed, Married-spouse-absent, Married-AF-spouse.
occupation: Tech-support, Craft-repair, Other-service, Sales, Exec-managerial,
Prof-specialty, Handlers-cleaners, Machine-op-inspct, Adm-clerical, Farming-
fishing, Transport-moving, Priv-house-serv, Protective-serv, Armed-Forces.
DATA MINING FINAL PROJECT FALL 2014
~ 3 ~
relationship: Wife, Own-child, Husband, Not-in-family, Other-relative,
Unmarried.
DATA PRE-PROCESSING:
The first step in data mining is to choose a data set with large number of
records so that unknown patterns can be identified. The data set should not be
too large that it consumes a lot of time to come up with a pattern. So, basically
selecting an appropriate data set is very important. Data sets are usually
chosen from a data warehouse. The main of data pre-processing is to evaluate
multivariate data sets prior to data mining. The data set is cleaned to remove
the records containing noise and the ones with missing data.
DATA MINING:
It is broadly segregated into six categories of tasks-
Anomaly detection
This phase involves identifying records that are unusual like outliers, any
deviations or errors. These data records have to be further investigated.
Association rule learning
This phase is the one where the model searches for any relationships among
the different variables. The very common example of it can be the beer diaper,
where on Fridays customers buy beer along with diapers. Based on this
DATA MINING FINAL PROJECT FALL 2014
~ 4 ~
association, the store can put both these items together to increase the sales
which is also known as market based analysis or dependency modelling.
Clustering:
Determining groups in the data which are similar in some aspect without the
use of previously known structures in data.
Classification:
This phase generalizes the known structures to apply the new data. As an
example an e-mail program that attempts to classify e-mails as spam or
genuine.
Regression:
This phase tries to phase a function which models the data with the least
amount of errors.
Summarization:
This phase provides a more compact view of the data set including the
visualization and report generation.
Validation of results:
Data mining can also give results which seem significant but cannot be used to
predict future behaviors and cannot be reproduced on a new sample of data.
These results are derived from improper statistical hypothesis testing. A simple
DATA MINING FINAL PROJECT FALL 2014
~ 5 ~
version of this problem is known as over fitting, where data mining algorithms
find patterns in the training set which are not present in the general data set.
WEKA (WAIKATO ENVIRONMENT FOR
KNOWLEDGE ANALYSIS)
Weka consists of a collection of visualization tools and algorithms for data
analysis and predictive modeling, along with graphical user interfaces for easy
access to this functionality. It is basically used for educational purposes and
research. Weka is:
Freely available General Public License
Portable as it completely implemented in Java and can be run on any
computing platform
An Extensive collection of data preprocessing and modeling techniques
Has a GUI that makes it easy to use
Weka supports various data mining tasks like data preprocessing, clustering,
classification, regression, visualization, and feature selection. Weka's
techniques are predicated on the assumption that data is available as a single
flat file or relation, where each data point is expressed by a fixed number of
attributes. Weka provides access to SQL databases using Java Database
Connectivity and can process the result returned by a database query.
DATA MINING FINAL PROJECT FALL 2014
~ 6 ~
NAÏVE BAYES SIMPLE
 All attributes: We have executed Naïve Bayes Simple on all the 15
attributes in our data set and have obtained the following results.
DATA MINING FINAL PROJECT FALL 2014
~ 7 ~
The accuracy of the predictive model is 83.69%.
DATA MINING FINAL PROJECT FALL 2014
~ 8 ~
 Few attributes: We have considered 6 attributes namely age, work class,
education, education number, hours per week and salary and obtained
the following results.
DATA MINING FINAL PROJECT FALL 2014
~ 9 ~
The accuracy of the predictive model is 79.20%.
DATA MINING FINAL PROJECT FALL 2014
~ 10 ~
 The following results are obtained on removing the education attribute
from the data set.
DATA MINING FINAL PROJECT FALL 2014
~ 11 ~
The accuracy of the predictive model is now 80% as compared to 79.20%
when the attribute education was present. This indicates that education is an
unimportant attribute and its removal from the data set yields a more accurate
result.
DATA MINING FINAL PROJECT FALL 2014
~ 12 ~
=== Run information ===
Scheme:weka.classifiers.bayes.NaiveBayesSimple
Relation: income-weka.filters.unsupervised.attribute.Remove-R6,8-12,14-
weka.filters.unsupervised.attribute.Remove-R3-
weka.filters.unsupervised.attribute.Remove-R5-
weka.filters.unsupervised.attribute.Remove-R3
Instances: 32561
Attributes: 5
age
workclass
education-num
hours-per-week
salary
Test mode:split 66.0% train, remainder test
=== Classifier model (full training set) ===
Naive Bayes (simple)
Class <=50K: P(C) = 0.75917452
Attribute age
Mean: 36.78373786 Standard Deviation: 14.02008849
Attribute workclass
State-gov Self-emp-not-inc Private Federal-gov Local-gov ? Self-emp-
inc Without-pay Never-worked
0.03825468 0.07351692 0.71713373 0.02385863 0.05972745 0.06656153
0.02001698 0.00060658 0.00032351
DATA MINING FINAL PROJECT FALL 2014
~ 13 ~
Attribute education-num
Mean: 9.59506472 Standard Deviation: 2.43614679
Attribute hours-per-week
Mean: 38.84021036 Standard Deviation: 12.31899464
Class >50K: P(C) = 0.24082548
Attribute age
Mean: 44.24984058 Standard Deviation: 10.51902772
Attribute workclass
State-gov Self-emp-not-inc Private Federal-gov Local-gov ? Self-emp-inc Without-
pay Never-worked
0.04509554 0.09235669 0.63235669 0.04738854 0.07872611 0.0244586 0.07936306
0.00012739 0.00012739
Attribute education-num
Mean: 11.61165668 Standard Deviation: 2.38512863
Attribute hours-per-week
Mean: 45.4730264 Standard Deviation: 11.01297093
Time taken to build model: 0.06 seconds
DATA MINING FINAL PROJECT FALL 2014
~ 14 ~
=== Evaluation on test split ===
=== Summary ===
Correctly Classified Instances 8857 80.0018 %
Incorrectly Classified Instances 2214 19.9982 %
Kappa statistic 0.3666
Mean absolute error 0.2737
Root mean squared error 0.3704
Relative absolute error 75.1356 %
Root relative squared error 87.3416 %
Total Number of Instances 11071
=== Detailed Accuracy By Class ===
TP Rate FP Rate Precision Recall F-Measure ROC Area Class
0.923 0.601 0.833 0.923 0.876 0.82 <=50K
0.399 0.077 0.615 0.399 0.484 0.82 >50K
Weighted Avg. 0.8 0.478 0.782 0.8 0.784 0.82
=== Confusion Matrix ===
a b <-- classified as
7820 650 | a = <=50K
1564 1037 | b = >50K
DATA MINING FINAL PROJECT FALL 2014
~ 15 ~
APRIORI:
Apriori is an algorithm for frequent item set mining and association rule
learning over transactional databases. Apriori proceeds by identifying the
frequent individual items in the database and extending them to larger and
larger item sets as long as those item sets appear sufficiently often in the
database.
 TOTAL NUMBER OF RULES
DATA MINING FINAL PROJECT FALL 2014
~ 16 ~
OUTPUT:
=== Run information ===
Scheme: weka.associations.Apriori -I -R -N 200 -T 0 -C 0.1 -D 0.05 -U 1.0 -M 0.1 -S
1.0 -V -c -1
Relation: income-weka.filters.unsupervised.attribute.Remove-R6,8-12,14-
weka.filters.unsupervised.attribute.Remove-R3-
weka.filters.unsupervised.attribute.Remove-R5-
weka.filters.unsupervised.attribute.Remove-R3-
weka.filters.unsupervised.attribute.Discretize-B10-M-1.0-Rfirst-last
Instances: 32561
Attributes: 5
age
workclass
education-num
hours-per-week
salary
=== Associator model (full training set) ===
Apriori
=======
Minimum support: 0.1 (3256 instances)
Minimum metric <confidence>: 0.1
Significance level: 1
Number of cycles performed: 14
DATA MINING FINAL PROJECT FALL 2014
~ 17 ~
Generated sets of large itemsets:
Size of set of large itemsetsL(1): 12
Large ItemsetsL(1):
age='(-inf-24.3]' 5570
age='(24.3-31.6]' 5890
age='(31.6-38.9]' 6048
age='(38.9-46.2]' 6163
age='(46.2-53.5]' 3967
workclass= Private 22696
education-num='(8.5-10]' 17792
education-num='(11.5-13]' 6422
hours-per-week='(30.4-40.2]' 17735
hours-per-week='(40.2-50]' 5938
salary= <=50K 24720
salary= >50K 7841
DATA MINING FINAL PROJECT FALL 2014
~ 18 ~
Size of set of large itemsetsL(2): 25
Large ItemsetsL(2):
age='(-inf-24.3]' workclass= Private 4404
age='(-inf-24.3]' education-num='(8.5-10]' 3637
age='(-inf-24.3]' salary= <=50K 5509
age='(24.3-31.6]' workclass= Private 4617
age='(24.3-31.6]' hours-per-week='(30.4-40.2]' 3517
age='(24.3-31.6]' salary= <=50K 5086
age='(31.6-38.9]' workclass= Private 4426
age='(31.6-38.9]' education-num='(8.5-10]' 3283
age='(31.6-38.9]' hours-per-week='(30.4-40.2]' 3371
age='(31.6-38.9]' salary= <=50K 4371
age='(38.9-46.2]' workclass= Private 4127
age='(38.9-46.2]' hours-per-week='(30.4-40.2]' 3481
age='(38.9-46.2]' salary= <=50K 3934
workclass= Private education-num='(8.5-10]' 12874
workclass= Private education-num='(11.5-13]' 4280
workclass= Private hours-per-week='(30.4-40.2]' 12849
workclass= Private hours-per-week='(40.2-50]' 4270
workclass= Private salary= <=50K 17733
workclass= Private salary= >50K 4963
education-num='(8.5-10]' hours-per-week='(30.4-40.2]' 10177
education-num='(8.5-10]' salary= <=50K 14730
education-num='(11.5-13]' salary= <=50K 3936
hours-per-week='(30.4-40.2]' salary= <=50K 14103
hours-per-week='(30.4-40.2]' salary= >50K 3632
hours-per-week='(40.2-50]' salary= <=50K 3586
DATA MINING FINAL PROJECT FALL 2014
~ 19 ~
Size of set of large itemsetsL(3): 7
Large ItemsetsL(3):
age='(-inf-24.3]' workclass= Private salary= <=50K 4360
age='(-inf-24.3]' education-num='(8.5-10]' salary= <=50K 3605
age='(24.3-31.6]' workclass= Private salary= <=50K 4023
workclass= Private education-num='(8.5-10]' hours-per-week='(30.4-40.2]' 7597
workclass= Private education-num='(8.5-10]' salary= <=50K 10832
workclass= Private hours-per-week='(30.4-40.2]' salary= <=50K 10524
education-num='(8.5-10]' hours-per-week='(30.4-40.2]' salary= <=50K 8574
Size of set of large itemsetsL(4): 1
Large ItemsetsL(4):
workclass= Private education-num='(8.5-10]' hours-per-week='(30.4-40.2]' salary=
<=50K 6513
DATA MINING FINAL PROJECT FALL 2014
~ 20 ~
BEST RULES FOUND:
1. age='(-inf-24.3]' education-num='(8.5-10]' 3637 ==> salary= <=50K 3605 conf:(0.99)
lift:(1.31) lev:(0.03) [843] conv:(26.54)
2. age='(-inf-24.3]' workclass= Private 4404 ==> salary= <=50K 4360 conf:(0.99) lift:(1.3)
lev:(0.03) [1016] conv:(23.57)
3. age='(-inf-24.3]' 5570 ==> salary= <=50K 5509 conf:(0.99) lift:(1.3) lev:(0.04) [1280]
conv:(21.63)
4. age='(24.3-31.6]' workclass= Private 4617 ==> salary= <=50K 4023 conf:(0.87) lift:(1.15)
lev:(0.02) [517] conv:(1.87)
5. age='(24.3-31.6]' 5890 ==> salary= <=50K 5086 conf:(0.86) lift:(1.14) lev:(0.02) [614]
conv:(1.76)
6. workclass= Private education-num='(8.5-10]' hours-per-week='(30.4-40.2]' 7597 ==>
salary= <=50K 6513 conf:(0.86) lift:(1.13) lev:(0.02) [745] conv:(1.69)
7. education-num='(8.5-10]' hours-per-week='(30.4-40.2]' 10177 ==> salary= <=50K 8574
conf:(0.84) lift:(1.11) lev:(0.03) [847] conv:(1.53)
8. workclass= Private education-num='(8.5-10]' 12874 ==> salary= <=50K 10832 conf:(0.84)
lift:(1.11) lev:(0.03) [1058] conv:(1.52)
9. education-num='(8.5-10]' 17792 ==> salary= <=50K 14730 conf:(0.83) lift:(1.09) lev:(0.04)
[1222] conv:(1.4)
10. workclass= Private hours-per-week='(30.4-40.2]' 12849 ==> salary= <=50K 10524
conf:(0.82) lift:(1.08) lev:(0.02) [769] conv:(1.33)
11. hours-per-week='(30.4-40.2]' 17735 ==> salary= <=50K 14103 conf:(0.8) lift:(1.05)
lev:(0.02) [638] conv:(1.18)
12. age='(-inf-24.3]' salary= <=50K 5509 ==>workclass= Private 4360 conf:(0.79) lift:(1.14)
lev:(0.02) [520] conv:(1.45) 13. age='(24.3-31.6]' salary= <=50K 5086 ==>workclass= Private
4023 conf:(0.79) lift:(1.13) lev:(0.01) [477] conv:(1.45)
14. age='(-inf-24.3]' 5570 ==>workclass= Private 4404 conf:(0.79) lift:(1.13) lev:(0.02) [521]
conv:(1.45)
DATA MINING FINAL PROJECT FALL 2014
~ 21 ~
15. age='(24.3-31.6]' 5890 ==>workclass= Private 4617 conf:(0.78) lift:(1.12) lev:(0.02) [511]
conv:(1.4)
16. age='(-inf-24.3]' 5570 ==>workclass= Private salary= <=50K 4360 conf:(0.78) lift:(1.44)
lev:(0.04) [1326] conv:(2.09)
17. workclass= Private 22696 ==> salary= <=50K 17733 conf:(0.78) lift:(1.03) lev:(0.02)
[502] conv:(1.1)
18. education-num='(8.5-10]' hours-per-week='(30.4-40.2]' salary= <=50K 8574
==>workclass= Private 6513 conf:(0.76) lift:(1.09) lev:(0.02) [536] conv:(1.26)
19. education-num='(8.5-10]' hours-per-week='(30.4-40.2]' 10177 ==>workclass= Private 7597
conf:(0.75) lift:(1.07) lev:(0.02) [503] conv:(1.19)
20. hours-per-week='(30.4-40.2]' salary= <=50K 14103 ==>workclass= Private 10524
conf:(0.75) lift:(1.07) lev:(0.02) [693] conv:(1.19)
21. education-num='(8.5-10]' salary= <=50K 14730 ==>workclass= Private 10832 conf:(0.74)
lift:(1.06) lev:(0.02) [564] conv:(1.14)
22. age='(31.6-38.9]' 6048 ==>workclass= Private 4426 conf:(0.73) lift:(1.05) lev:(0.01) [210]
conv:(1.13)
23. hours-per-week='(30.4-40.2]' 17735 ==>workclass= Private 12849 conf:(0.72) lift:(1.04)
lev:(0.01) [487] conv:(1.1)
24. education-num='(8.5-10]' 17792 ==>workclass= Private 12874 conf:(0.72) lift:(1.04)
lev:(0.01) [472] conv:(1.1)
25. age='(31.6-38.9]' 6048 ==> salary= <=50K 4371 conf:(0.72) lift:(0.95) lev:(-0.01) [-220]
conv:(0.87)
26. hours-per-week='(40.2-50]' 5938 ==>workclass= Private 4270 conf:(0.72) lift:(1.03)
lev:(0) [131] conv:(1.08)
27. salary= <=50K 24720 ==>workclass= Private 17733 conf:(0.72) lift:(1.03) lev:(0.02) [502]
conv:(1.07)
28. age='(24.3-31.6]' 5890 ==>workclass= Private salary= <=50K 4023 conf:(0.68) lift:(1.25)
lev:(0.03) [815] conv:(1.44)
DATA MINING FINAL PROJECT FALL 2014
~ 22 ~
29. age='(38.9-46.2]' 6163 ==>workclass= Private 4127 conf:(0.67) lift:(0.96) lev:(-0.01) [-
168] conv:(0.92)
30. education-num='(11.5-13]' 6422 ==>workclass= Private 4280 conf:(0.67) lift:(0.96) lev:(-
0.01) [-196] conv:(0.91)
31. age='(-inf-24.3]' salary= <=50K 5509 ==> education-num='(8.5-10]' 3605 conf:(0.65)
lift:(1.2) lev:(0.02) [594] conv:(1.31)
32. age='(-inf-24.3]' 5570 ==> education-num='(8.5-10]' 3637 conf:(0.65) lift:(1.19) lev:(0.02)
[593] conv:(1.31)
33. age='(-inf-24.3]' 5570 ==> education-num='(8.5-10]' salary= <=50K 3605 conf:(0.65)
lift:(1.43) lev:(0.03) [1085] conv:(1.55)
34. education-num='(8.5-10]' hours-per-week='(30.4-40.2]' 10177 ==>workclass= Private
salary= <=50K 6513 conf:(0.64) lift:(1.18) lev:(0.03) [970] conv:(1.26)
35. age='(38.9-46.2]' 6163 ==> salary= <=50K 3934 conf:(0.64) lift:(0.84) lev:(-0.02) [-744]
conv:(0.67)
36. salary= >50K 7841 ==>workclass= Private 4963 conf:(0.63) lift:(0.91) lev:(-0.02) [-502]
conv:(0.83)
37. workclass= Private hours-per-week='(30.4-40.2]' salary= <=50K 10524 ==> education-
num='(8.5-10]' 6513 conf:(0.62) lift:(1.13) lev:(0.02) [762] conv:(1.19)
38. education-num='(11.5-13]' 6422 ==> salary= <=50K 3936 conf:(0.61) lift:(0.81) lev:(-0.03)
[-939] conv:(0.62)
39. workclass= Private salary= <=50K 17733 ==> education-num='(8.5-10]' 10832
conf:(0.61) lift:(1.12) lev:(0.04) [1142] conv:(1.17)
40. education-num='(8.5-10]' 17792 ==>workclass= Private salary= <=50K 10832 conf:(0.61)
lift:(1.12) lev:(0.04) [1142] conv:(1.16)
41. hours-per-week='(30.4-40.2]' salary= <=50K 14103 ==> education-num='(8.5-10]' 8574
conf:(0.61) lift:(1.11) lev:(0.03) [867] conv:(1.16)
42. hours-per-week='(40.2-50]' 5938 ==> salary= <=50K 3586 conf:(0.6) lift:(0.8) lev:(-0.03)
[-922] conv:(0.61)
DATA MINING FINAL PROJECT FALL 2014
~ 23 ~
43. workclass= Private education-num='(8.5-10]' salary= <=50K 10832 ==> hours-per-
week='(30.4-40.2]' 6513 conf:(0.6) lift:(1.1) lev:(0.02) [613] conv:(1.14)
44. age='(24.3-31.6]' 5890 ==> hours-per-week='(30.4-40.2]' 3517 conf:(0.6) lift:(1.1)
lev:(0.01) [308] conv:(1.13)
45. salary= <=50K 24720 ==> education-num='(8.5-10]' 14730 conf:(0.6) lift:(1.09) lev:(0.04)
[1222] conv:(1.12)
46. workclass= Private salary= <=50K 17733 ==> hours-per-week='(30.4-40.2]' 10524
conf:(0.59) lift:(1.09) lev:(0.03) [865] conv:(1.12)
47. hours-per-week='(30.4-40.2]' 17735 ==>workclass= Private salary= <=50K 10524
conf:(0.59) lift:(1.09) lev:(0.03) [865] conv:(1.12)
48. workclass= Private hours-per-week='(30.4-40.2]' 12849 ==> education-num='(8.5-10]'
7597 conf:(0.59) lift:(1.08) lev:(0.02) [576] conv:(1.11)
49. workclass= Private education-num='(8.5-10]' 12874 ==> hours-per-week='(30.4-40.2]'
7597 conf:(0.59) lift:(1.08) lev:(0.02) [584] conv:(1.11)
50. education-num='(8.5-10]' salary= <=50K 14730 ==> hours-per-week='(30.4-40.2]' 8574
conf:(0.58) lift:(1.07) lev:(0.02) [551] conv:(1.09)
51. hours-per-week='(30.4-40.2]' 17735 ==> education-num='(8.5-10]' 10177 conf:(0.57)
lift:(1.05) lev:(0.01) [486] conv:(1.06)
52. education-num='(8.5-10]' 17792 ==> hours-per-week='(30.4-40.2]' 10177 conf:(0.57)
lift:(1.05) lev:(0.01) [486] conv:(1.06)
53. salary= <=50K 24720 ==> hours-per-week='(30.4-40.2]' 14103 conf:(0.57) lift:(1.05)
lev:(0.02) [638] conv:(1.06)
54. workclass= Private 22696 ==> education-num='(8.5-10]' 12874 conf:(0.57) lift:(1.04)
lev:(0.01) [472] conv:(1.05)
55. workclass= Private 22696 ==> hours-per-week='(30.4-40.2]' 12849 conf:(0.57) lift:(1.04)
lev:(0.01) [487] conv:(1.05)
56. age='(38.9-46.2]' 6163 ==> hours-per-week='(30.4-40.2]' 3481 conf:(0.56) lift:(1.04)
lev:(0) [124] conv:(1.05)
DATA MINING FINAL PROJECT FALL 2014
~ 24 ~
57. age='(31.6-38.9]' 6048 ==> hours-per-week='(30.4-40.2]' 3371 conf:(0.56) lift:(1.02)
lev:(0) [76] conv:(1.03)
58. age='(31.6-38.9]' 6048 ==> education-num='(8.5-10]' 3283 conf:(0.54) lift:(0.99) lev:(0) [-
21] conv:(0.99)
59. workclass= Private hours-per-week='(30.4-40.2]' 12849 ==> education-num='(8.5-10]'
salary= <=50K 6513 conf:(0.51) lift:(1.12) lev:(0.02) [700] conv:(1.11)
60. workclass= Private education-num='(8.5-10]' 12874 ==> hours-per-week='(30.4-40.2]'
salary= <=50K 6513 conf:(0.51) lift:(1.17) lev:(0.03) [936] conv:(1.15)
61. hours-per-week='(30.4-40.2]' 17735 ==> education-num='(8.5-10]' salary= <=50K 8574
conf:(0.48) lift:(1.07) lev:(0.02) [551] conv:(1.06)
62. education-num='(8.5-10]' 17792 ==> hours-per-week='(30.4-40.2]' salary= <=50K 8574
conf:(0.48) lift:(1.11) lev:(0.03) [867] conv:(1.09)
63. workclass= Private 22696 ==> education-num='(8.5-10]' salary= <=50K 10832
conf:(0.48) lift:(1.06) lev:(0.02) [564] conv:(1.05)
64. workclass= Private 22696 ==> hours-per-week='(30.4-40.2]' salary= <=50K 10524
conf:(0.46) lift:(1.07) lev:(0.02) [693] conv:(1.06)
65. salary= >50K 7841 ==> hours-per-week='(30.4-40.2]' 3632 conf:(0.46) lift:(0.85) lev:(-
0.02) [-638] conv:(0.85)
66. hours-per-week='(30.4-40.2]' salary= <=50K 14103 ==>workclass= Private education-
num='(8.5-10]' 6513 conf:(0.46) lift:(1.17) lev:(0.03) [936] conv:(1.12)
67. education-num='(8.5-10]' salary= <=50K 14730 ==>workclass= Private hours-per-
week='(30.4-40.2]' 6513 conf:(0.44) lift:(1.12) lev:(0.02) [700] conv:(1.09)
68. salary= <=50K 24720 ==>workclass= Private education-num='(8.5-10]' 10832 conf:(0.44)
lift:(1.11) lev:(0.03) [1058] conv:(1.08)
69. hours-per-week='(30.4-40.2]' 17735 ==>workclass= Private education-num='(8.5-10]' 7597
conf:(0.43) lift:(1.08) lev:(0.02) [584] conv:(1.06)
70. education-num='(8.5-10]' 17792 ==>workclass= Private hours-per-week='(30.4-40.2]' 7597
conf:(0.43) lift:(1.08) lev:(0.02) [576] conv:(1.06)
DATA MINING FINAL PROJECT FALL 2014
~ 25 ~
71. salary= <=50K 24720 ==>workclass= Private hours-per-week='(30.4-40.2]' 10524
conf:(0.43) lift:(1.08) lev:(0.02) [769] conv:(1.05)
72. workclass= Private salary= <=50K 17733 ==> education-num='(8.5-10]' hours-per-
week='(30.4-40.2]' 6513 conf:(0.37) lift:(1.18) lev:(0.03) [970] conv:(1.09)
73. hours-per-week='(30.4-40.2]' 17735 ==>workclass= Private education-num='(8.5-10]'
salary= <=50K 6513 conf:(0.37) lift:(1.1) lev:(0.02) [613] conv:(1.05)
74. education-num='(8.5-10]' 17792 ==>workclass= Private hours-per-week='(30.4-40.2]'
salary= <=50K 6513 conf:(0.37) lift:(1.13) lev:(0.02) [762] conv:(1.07)
75. salary= <=50K 24720 ==> education-num='(8.5-10]' hours-per-week='(30.4-40.2]' 8574
conf:(0.35) lift:(1.11) lev:(0.03) [847] conv:(1.05)
76. workclass= Private 22696 ==> education-num='(8.5-10]' hours-per-week='(30.4-40.2]'
7597 conf:(0.33) lift:(1.07) lev:(0.02) [503] conv:(1.03)
77. workclass= Private 22696 ==> education-num='(8.5-10]' hours-per-week='(30.4-40.2]'
salary= <=50K 6513 conf:(0.29) lift:(1.09) lev:(0.02) [536] conv:(1.03)
78. salary= <=50K 24720 ==>workclass= Private education-num='(8.5-10]' hours-per-
week='(30.4-40.2]' 6513 conf:(0.26) lift:(1.13) lev:(0.02) [745] conv:(1.04)
79. workclass= Private salary= <=50K 17733 ==> age='(-inf-24.3]' 4360 conf:(0.25) lift:(1.44)
lev:(0.04) [1326] conv:(1.1)
80. education-num='(8.5-10]' salary= <=50K 14730 ==> age='(-inf-24.3]' 3605 conf:(0.24)
lift:(1.43) lev:(0.03) [1085] conv:(1.1)
81. workclass= Private salary= <=50K 17733 ==> age='(24.3-31.6]' 4023 conf:(0.23)
lift:(1.25) lev:(0.03) [815] conv:(1.06)
82. salary= <=50K 24720 ==> age='(-inf-24.3]' 5509 conf:(0.22) lift:(1.3) lev:(0.04) [1280]
conv:(1.07)
83. workclass= Private 22696 ==> salary= >50K 4963 conf:(0.22) lift:(0.91) lev:(-0.02) [-502]
conv:(0.97)
DATA MINING FINAL PROJECT FALL 2014
~ 26 ~
84. salary= <=50K 24720 ==> age='(24.3-31.6]' 5086 conf:(0.21) lift:(1.14) lev:(0.02) [614]
conv:(1.03)
85. hours-per-week='(30.4-40.2]' 17735 ==> salary= >50K 3632 conf:(0.2) lift:(0.85) lev:(-
0.02) [-638] conv:(0.95)
86. education-num='(8.5-10]' 17792 ==> age='(-inf-24.3]' 3637 conf:(0.2) lift:(1.19) lev:(0.02)
[593] conv:(1.04)
87. workclass= Private 22696 ==> age='(24.3-31.6]' 4617 conf:(0.2) lift:(1.12) lev:(0.02) [511]
conv:(1.03)
88. education-num='(8.5-10]' 17792 ==> age='(-inf-24.3]' salary= <=50K 3605 conf:(0.2)
lift:(1.2) lev:(0.02) [594] conv:(1.04)
89. hours-per-week='(30.4-40.2]' 17735 ==> age='(24.3-31.6]' 3517 conf:(0.2) lift:(1.1)
lev:(0.01) [308] conv:(1.02)
90. hours-per-week='(30.4-40.2]' 17735 ==> age='(38.9-46.2]' 3481 conf:(0.2) lift:(1.04)
lev:(0) [124] conv:(1.01)
91. workclass= Private 22696 ==> age='(31.6-38.9]' 4426 conf:(0.2) lift:(1.05) lev:(0.01) [210]
conv:(1.01)
92. workclass= Private 22696 ==> age='(-inf-24.3]' 4404 conf:(0.19) lift:(1.13) lev:(0.02) [521]
conv:(1.03)
93. workclass= Private 22696 ==> age='(-inf-24.3]' salary= <=50K 4360 conf:(0.19) lift:(1.14)
lev:(0.02) [520] conv:(1.03)
94. hours-per-week='(30.4-40.2]' 17735 ==> age='(31.6-38.9]' 3371 conf:(0.19) lift:(1.02)
lev:(0) [76] conv:(1.01)
95. workclass= Private 22696 ==> education-num='(11.5-13]' 4280 conf:(0.19) lift:(0.96)
lev:(-0.01) [-196] conv:(0.99)
96. workclass= Private 22696 ==> hours-per-week='(40.2-50]' 4270 conf:(0.19) lift:(1.03)
lev:(0) [131] conv:(1.01)
DATA MINING FINAL PROJECT FALL 2014
~ 27 ~
97. education-num='(8.5-10]' 17792 ==> age='(31.6-38.9]' 3283 conf:(0.18) lift:(0.99) lev:(0) [-
21] conv:(1)
98. workclass= Private 22696 ==> age='(38.9-46.2]' 4127 conf:(0.18) lift:(0.96) lev:(-0.01) [-
168] conv:(0.99)
99. workclass= Private 22696 ==> age='(24.3-31.6]' salary= <=50K 4023 conf:(0.18)
lift:(1.13) lev:(0.01) [477] conv:(1.03)
100. salary= <=50K 24720 ==> age='(31.6-38.9]' 4371 conf:(0.18) lift:(0.95) lev:(-0.01) [-220]
conv:(0.99)
101. salary= <=50K 24720 ==> age='(-inf-24.3]' workclass= Private 4360 conf:(0.18) lift:(1.3)
lev:(0.03) [1016] conv:(1.05)
102. salary= <=50K 24720 ==> age='(24.3-31.6]' workclass= Private 4023 conf:(0.16)
lift:(1.15) lev:(0.02) [517] conv:(1.02)
103. salary= <=50K 24720 ==> education-num='(11.5-13]' 3936 conf:(0.16) lift:(0.81) lev:(-
0.03) [-939] conv:(0.95)
104. salary= <=50K 24720 ==> age='(38.9-46.2]' 3934 conf:(0.16) lift:(0.84) lev:(-0.02) [-744]
conv:(0.96)
105. salary= <=50K 24720 ==> age='(-inf-24.3]' education-num='(8.5-10]' 3605 conf:(0.15)
lift:(1.31) lev:(0.03) [843] conv:(1.04)
106. salary= <=50K 24720 ==> hours-per-week='(40.2-50]' 3586 conf:(0.15) lift:(0.8) lev:(-
0.03) [-922] conv:(0.96)
DATA MINING FINAL PROJECT FALL 2014
~ 28 ~
 CHANGED MIN METRIC = 0.9 AND lowerBoundMinSupport=0.1
DATA MINING FINAL PROJECT FALL 2014
~ 29 ~
=== Run information ===
Scheme: weka.associations.Apriori -I -R -N 200 -T 0 -C 0.9 -D 0.05 -U 1.0 -M 0.1 -S 1.0 -V
-c -1
Relation: income-weka.filters.unsupervised.attribute.Remove-R6,8-12,14-
weka.filters.unsupervised.attribute.Remove-R3-weka.filters.unsupervised.attribute.Remove-R5-
weka.filters.unsupervised.attribute.Remove-R3-weka.filters.unsupervised.attribute.Discretize-
B10-M-1.0-Rfirst-last
Instances: 32561
Attributes: 5
age
workclass
education-num
hours-per-week
salary
=== Associator model (full training set) ===
Apriori
=======
Minimum support: 0.1 (3256 instances)
Minimum metric <confidence>: 0.9
Significance level: 1
Number of cycles performed: 14
Generated sets of large itemsets:
Size of set of large itemsetsL(1): 12
Large ItemsetsL(1):
age='(-inf-24.3]' 5570
age='(24.3-31.6]' 5890
age='(31.6-38.9]' 6048
age='(38.9-46.2]' 6163
age='(46.2-53.5]' 3967
DATA MINING FINAL PROJECT FALL 2014
~ 30 ~
workclass= Private 22696
education-num='(8.5-10]' 17792
education-num='(11.5-13]' 6422
hours-per-week='(30.4-40.2]' 17735
hours-per-week='(40.2-50]' 5938
salary= <=50K 24720
salary= >50K 7841
Size of set of large itemsetsL(2): 25
Large ItemsetsL(2):
age='(-inf-24.3]' workclass= Private 4404
age='(-inf-24.3]' education-num='(8.5-10]' 3637
age='(-inf-24.3]' salary= <=50K 5509
age='(24.3-31.6]' workclass= Private 4617
age='(24.3-31.6]' hours-per-week='(30.4-40.2]' 3517
age='(24.3-31.6]' salary= <=50K 5086
age='(31.6-38.9]' workclass= Private 4426
age='(31.6-38.9]' education-num='(8.5-10]' 3283
age='(31.6-38.9]' hours-per-week='(30.4-40.2]' 3371
age='(31.6-38.9]' salary= <=50K 4371
age='(38.9-46.2]' workclass= Private 4127
age='(38.9-46.2]' hours-per-week='(30.4-40.2]' 3481
age='(38.9-46.2]' salary= <=50K 3934
workclass= Private education-num='(8.5-10]' 12874
workclass= Private education-num='(11.5-13]' 4280
workclass= Private hours-per-week='(30.4-40.2]' 12849
workclass= Private hours-per-week='(40.2-50]' 4270
workclass= Private salary= <=50K 17733
workclass= Private salary= >50K 4963
DATA MINING FINAL PROJECT FALL 2014
~ 31 ~
education-num='(8.5-10]' hours-per-week='(30.4-40.2]' 10177
education-num='(8.5-10]' salary= <=50K 14730
education-num='(11.5-13]' salary= <=50K 3936
hours-per-week='(30.4-40.2]' salary= <=50K 14103
hours-per-week='(30.4-40.2]' salary= >50K 3632
hours-per-week='(40.2-50]' salary= <=50K 3586
Size of set of large itemsetsL(3): 7
Large ItemsetsL(3):
age='(-inf-24.3]' workclass= Private salary= <=50K 4360
age='(-inf-24.3]' education-num='(8.5-10]' salary= <=50K 3605
age='(24.3-31.6]' workclass= Private salary= <=50K 4023
workclass= Private education-num='(8.5-10]' hours-per-week='(30.4-40.2]' 7597
workclass= Private education-num='(8.5-10]' salary= <=50K 10832
workclass= Private hours-per-week='(30.4-40.2]' salary= <=50K 10524
education-num='(8.5-10]' hours-per-week='(30.4-40.2]' salary= <=50K 8574
Size of set of large itemsetsL(4): 1
Large ItemsetsL(4):
workclass= Private education-num='(8.5-10]' hours-per-week='(30.4-40.2]' salary= <=50K 6513
Best rules found:
1. age='(-inf-24.3]' education-num='(8.5-10]' 3637 ==> salary= <=50K 3605 conf:(0.99)
lift:(1.31) lev:(0.03) [843] conv:(26.54)
2. age='(-inf-24.3]' workclass= Private 4404 ==> salary= <=50K 4360 conf:(0.99) lift:(1.3)
lev:(0.03) [1016] conv:(23.57)
3. age='(-inf-24.3]' 5570 ==> salary= <=50K 5509 conf:(0.99) lift:(1.3) lev:(0.04) [1280]
conv:(21.63)
DATA MINING FINAL PROJECT FALL 2014
~ 32 ~
 Changed lowerBoundMinSupport = 0.3 minMetric = 0.65
DATA MINING FINAL PROJECT FALL 2014
~ 33 ~
=== Run information ===
Scheme: weka.associations.Apriori -I -R -N 200 -T 0 -C 0.65 -D 0.05 -U 1.0 -M 0.3 -S 1.0 -V
-c -1
Relation: income-weka.filters.unsupervised.attribute.Discretize-B10-M-1.0-Rfirst-last-
weka.filters.unsupervised.attribute.Remove-R3-4,6-12,14-
weka.filters.unsupervised.attribute.Discretize-B10-M-1.0-Rfirst-last
Instances: 32561
Attributes: 5
age
workclass
education-num
hours-per-week
salary
=== Associator model (full training set) ===
Apriori
=======
Minimum support: 0.3 (9768 instances)
Minimum metric <confidence>: 0.65
Significance level: 1
Number of cycles performed: 10
Generated sets of large itemsets:
Size of set of large itemsetsL(1): 4
Large ItemsetsL(1):
workclass= Private 22696
education-num='(8.5-10]' 17792
hours-per-week='(30.4-40.2]' 17735
salary= <=50K 24720Size of set of large itemsets L(2): 6
Large ItemsetsL(2):
workclass= Private education-num='(8.5-10]' 12874
DATA MINING FINAL PROJECT FALL 2014
~ 34 ~
workclass= Private hours-per-week='(30.4-40.2]' 12849
workclass= Private salary= <=50K 17733
education-num='(8.5-10]' hours-per-week='(30.4-40.2]' 10177
education-num='(8.5-10]' salary= <=50K 14730
hours-per-week='(30.4-40.2]' salary= <=50K 14103
Size of set of large itemsetsL(3): 2
Large ItemsetsL(3):
workclass= Private education-num='(8.5-10]' salary= <=50K 10832
workclass= Private hours-per-week='(30.4-40.2]' salary= <=50K 10524
Best rules found:
1. workclass= Private education-num='(8.5-10]' 12874 ==> salary= <=50K 10832 conf:(0.84)
lift:(1.11) lev:(0.03) [1058] conv:(1.52)
2. education-num='(8.5-10]' 17792 ==> salary= <=50K 14730 conf:(0.83) lift:(1.09) lev:(0.04)
[1222] conv:(1.4)
3. workclass= Private hours-per-week='(30.4-40.2]' 12849 ==> salary= <=50K 10524
conf:(0.82) lift:(1.08) lev:(0.02) [769] conv:(1.33)
4. hours-per-week='(30.4-40.2]' 17735 ==> salary= <=50K 14103 conf:(0.8) lift:(1.05)
lev:(0.02) [638] conv:(1.18)
5. workclass= Private 22696 ==> salary= <=50K 17733 conf:(0.78) lift:(1.03) lev:(0.02) [502]
conv:(1.1)
6. hours-per-week='(30.4-40.2]' salary= <=50K 14103 ==>workclass= Private 10524
conf:(0.75) lift:(1.07) lev:(0.02) [693] conv:(1.19)
7. education-num='(8.5-10]' salary= <=50K 14730 ==>workclass= Private 10832 conf:(0.74)
lift:(1.06) lev:(0.02) [564] conv:(1.14)
8. hours-per-week='(30.4-40.2]' 17735 ==>workclass= Private 12849 conf:(0.72) lift:(1.04)
lev:(0.01) [487] conv:(1.1)
9. education-num='(8.5-10]' 17792 ==>workclass= Private 12874 conf:(0.72) lift:(1.04)
lev:(0.01) [472] conv:(1.1)
10. salary= <=50K 24720 ==>workclass= Private 17733 conf:(0.72) lift:(1.03) lev:(0.02) [502]
conv:(1.07)
DATA MINING FINAL PROJECT FALL 2014
~ 35 ~
J48:
The decision trees generated by J48 can be used for classification. It uses the
fact that each attribute of the data can be used to make a decision by splitting
the data into smaller subsets.
 WITH ALL ATTRIBUTES
DATA MINING FINAL PROJECT FALL 2014
~ 36 ~
 SELECTED ATTRIBUTES
DATA MINING FINAL PROJECT FALL 2014
~ 37 ~
DATA MINING FINAL PROJECT FALL 2014
~ 38 ~
K-MEANS CLUSTERING:
K-means clustering is a method of vector quantization, originally from signal
processing, that is popular for cluster analysis in data mining. k-means
clustering aims to partition n observations into k clusters in which each
observation belongs to the cluster with the nearest mean, serving as a
prototype of the cluster.
DATA MINING FINAL PROJECT FALL 2014
~ 39 ~
=== Run information ===
Scheme:weka.clusterers.SimpleKMeans -N 5 -A "weka.core.EuclideanDistance -R first-last" -I
500 -S 10
Relation: income-weka.filters.unsupervised.attribute.Discretize-B10-M-1.0-Rfirst-last
Instances: 32561
Attributes: 15
age
workclass
fnlwgt
education
education-num
martial-status
occupation
relationship
race
sex
capital-gain
capital-loss
hours-per-week
native-country
salary
Test mode:split 66% train, remainder test
=== Model and evaluation on training set ===
kMeans
======
Number of iterations: 4
Within cluster sum of squared errors: 146940.0
DATA MINING FINAL PROJECT FALL 2014
~ 40 ~
Missing values globally replaced with mean/mode
Cluster centroids:
Cluster#
Attribute Full Data 0 1 2 3 4
(32561) (7137) (7920) (9177) (4102) (4225)
========================================================================================
==============================================================
age '(38.9-46.2]' '(46.2-53.5]' '(-inf-24.3]' '(38.9-46.2]' '(24.3-31.6]' '(24.3-
31.6]'
workclass Private PrivatePrivatePrivatePrivatePrivate
fnlwgt '(159527-306769]' '(-inf-159527]' '(-inf-159527]' '(159527-306769]' '(159527-306769]'
'(159527-306769]'
education HS-grad Bachelors Some-college HS-grad Bachelors
HS-grad
education-num '(8.5-10]' '(11.5-13]' '(8.5-10]' '(8.5-10]' '(11.5-13]' '(8.5-10]'
martial-status Married-civ-spouse Married-civ-spouse Never-married Married-civ-spouse Never-
married Never-married
occupation Prof-specialty Prof-specialty Other-service Craft-repair Prof-specialty
Adm-clerical
relationship Husband Husband Own-child Husband Not-in-family
Unmarried
race White WhiteWhiteWhiteWhiteWhite
sex Male MaleMaleMale Female Female
capital-gain '(-inf-9999.9]' '(-inf-9999.9]' '(-inf-9999.9]' '(-inf-9999.9]' '(-inf-9999.9]' '(-inf-
9999.9]'
capital-loss '(-inf-435.6]' '(-inf-435.6]' '(-inf-435.6]' '(-inf-435.6]' '(-inf-435.6]' '(-inf-435.6]'
hours-per-week '(30.4-40.2]' '(30.4-40.2]' '(30.4-40.2]' '(30.4-40.2]' '(30.4-40.2]'
'(30.4-40.2]'
native-country United-States United-StatesUnited-StatesUnited-StatesUnited-StatesUnited-States
salary<=50K >50K <=50K <=50K <=50K <=50K
DATA MINING FINAL PROJECT FALL 2014
~ 41 ~
Time taken to build model (full training data) : 3.13 seconds
=== Model and evaluation on test split ===
kMeans
======
Number of iterations: 4
Within cluster sum of squared errors: 103681.0
Missing values globally replaced with mean/mode
Cluster centroids:
Cluster#
Attribute Full Data 0 1 2 3 4
(21490) (11082) (3995) (3024) (2584) (805)
========================================================================================
==============================================================
age '(38.9-46.2]' '(31.6-38.9]' '(-inf-24.3]' '(24.3-31.6]' '(24.3-31.6]' '(-inf-24.3]'
workclass Private PrivatePrivatePrivatePrivatePrivate
fnlwgt '(159527-306769]' '(159527-306769]' '(159527-306769]' '(159527-306769]' '(159527-
306769]' '(159527-306769]'
education HS-grad HS-grad Some-college HS-grad Bachelors
Some-college
education-num '(8.5-10]' '(8.5-10]' '(8.5-10]' '(8.5-10]' '(11.5-13]' '(8.5-10]'
martial-status Married-civ-spouse Married-civ-spouse Never-married Never-marriedNever-
marriedNever-married
occupation Craft-repair Craft-repair Other-service Adm-clerical Prof-specialty
Other-service
relationship Husband Husband Own-child Not-in-family Not-in-familyNot-in-
family
race White WhiteWhiteWhiteWhiteWhite
sex Male Male Female FemaleFemaleFemale
capital-gain '(-inf-9999.9]' '(-inf-9999.9]' '(-inf-9999.9]' '(-inf-9999.9]' '(-inf-9999.9]' '(-inf-
9999.9]'
capital-loss '(-inf-435.6]' '(-inf-435.6]' '(-inf-435.6]' '(-inf-435.6]' '(-inf-435.6]' '(-inf-435.6]'
hours-per-week '(30.4-40.2]' '(30.4-40.2]' '(30.4-40.2]' '(30.4-40.2]' '(30.4-40.2]'
'(20.6-30.4]'
native-country United-States United-StatesUnited-StatesUnited-StatesUnited-StatesUnited-States
salary<=50K <=50K <=50K <=50K <=50K <=50K
DATA MINING FINAL PROJECT FALL 2014
~ 42 ~
Time taken to build model (percentage split) : 1.99 seconds
Clustered Instances
0 5618 ( 51%)
1 2119 ( 19%)
2 1569 ( 14%)
3 1358 ( 12%)
4 407 ( 4%)
DATA MINING FINAL PROJECT FALL 2014
~ 43 ~
SAS ENTERPRISE MINER
 Cluster Analysis:
This analysis attempts to find natural groupings of observations in the data,
based on a set of input variables. After grouping the observations into clusters,
you can use the input variables to try to characterize each group. When the
clusters have been identified and interpreted, you can decide whether to treat
each cluster independently. Clustering can therefore be formulated as a multi-
objective optimization problem. The appropriate clustering algorithm and
parameter settings (including values such as the distance function to use, a
density threshold or the number of expected clusters) depend on the individual
data set and intended use of the results. Cluster analysis as such is not an
automatic task, but an iterative process of knowledge discovery or interactive
multi-objective optimization that involves trial and failure. It will often be
necessary to modify data preprocessing and model parameters until the result
achieves the desired properties.
In this dataset we built clusters to group similar items in our dataset. We
changed the properties of the cluster.
Cluster variable role = Segment
Specification Method = User Specify
Maximum Number of Clusters = 5
DATA MINING FINAL PROJECT FALL 2014
~ 44 ~
DATA MINING FINAL PROJECT FALL 2014
~ 45 ~
From the clustering segments (pie-chart), we can observe that cluster 1
contains a large part of the data set.
DATA MINING FINAL PROJECT FALL 2014
~ 46 ~
SAS DECISION TREES
Node Rules:
*------------------------------------------------------------*
Node = 11
*------------------------------------------------------------*
if Relationship IS ONE OF: NOT-IN-FAMILY, OWN-CHILD, UNMARRIED, OTHER-RELATIVE or MISSING
AND Hours-Per-Week >= 35.5 or MISSING
AND Capital-Gain >= 7073.59
then
Tree Node Identifier = 11
Number of Observations = 285
Predicted: Salary=>50K = 0.99
Predicted: Salary=<=50K = 0.01
*------------------------------------------------------------*
Node = 13
*------------------------------------------------------------*
if Relationship IS ONE OF: HUSBAND, WIFE
AND Education-Num < 12.5 or MISSING
AND Capital-Gain >= 5095.5
then
Tree Node Identifier = 13
Number of Observations = 522
Predicted: Salary=>50K = 0.98
Predicted: Salary=<=50K = 0.02
DATA MINING FINAL PROJECT FALL 2014
~ 47 ~
*------------------------------------------------------------*
Node = 15
*------------------------------------------------------------*
if Relationship IS ONE OF: HUSBAND, WIFE
AND Education-Num >= 12.5
AND Capital-Gain >= 5095.5
then
Tree Node Identifier = 15
Number of Observations = 678
Predicted: Salary=>50K = 1.00
Predicted: Salary=<=50K = 0.00
*------------------------------------------------------------*
Node = 161
*------------------------------------------------------------*
if Workclass IS ONE OF: STATE-GOV, SELF-EMP-NOT-INC
AND Relationship IS ONE OF: HUSBAND, WIFE
AND Occupation IS ONE OF: ADM-CLERICAL, EXEC-MANAGERIAL, PROF-SPECIALTY, SALES, TECH-SUPPORT,
PROTECTIVE-SERV
AND Hours-Per-Week >= 37.5 or MISSING
AND Education-Num < 12.5 AND Education-Num >= 9.5 or MISSING
AND Capital-Loss < 1512 or MISSING
AND Capital-Gain < 5095.5 or MISSING
AND Age >= 33.5 or MISSING
then
Tree Node Identifier = 161
Number of Observations = 157
Predicted: Salary=>50K = 0.39
Predicted: Salary=<=50K = 0.61
DATA MINING FINAL PROJECT FALL 2014
~ 48 ~
DATA MINING FINAL PROJECT FALL 2014
~ 49 ~
Fit Statistics
Target=Salary Target Label=' '
FitStatistics Statistics Label Train
_NOBS_ Sum of Frequencies 32561.00
_MISC_ Misclassification Rate 0.14
_MAX_ Maximum Absolute Error 1.00
_SSE_ Sum of Squared Errors 6368.80
_ASE_ Average Squared Error 0.10
_RASE_ Root Average Squared Error 0.31
_DIV_ Divisor for ASE 65122.00
_DFT_ Total Degrees of Freedom 32561.00
DATA MINING FINAL PROJECT FALL 2014
~ 50 ~
DATA MINING FINAL PROJECT FALL 2014
~ 51 ~
CONCLUSION
1. It is likely that if the age is around 24 years, the education level is 11th grade,
12th grade or some college, then the income would be less than 50K.
2. It is likely that if the age is around 24 years, and the person is working for a
private firm, then the income would be less than 50K.
3. It is likely that if the age is around 24 years, the income would be less than
50K
4. 90% of females have salary less than or equal to 50k whereas 60% of males
have salary <=50k
5. 95% of the population belonging to other services category belong to salary
<=50k
6. 95% of the population belonging to the age group of 23 to 24 have salary
<=50k
We made couple of runs of the J48 classifiers and found out the following 5
attributes to be important in predicting the Income of a person:
Education number, age, salary, work class, hours-per-week. These columns
provide an accuracy of 80% in predicting the Income.

Weitere ähnliche Inhalte

Andere mochten auch

Taller Data Centers: la innovación irrumpe en sus estructuras y funcionalidad...
Taller Data Centers: la innovación irrumpe en sus estructuras y funcionalidad...Taller Data Centers: la innovación irrumpe en sus estructuras y funcionalidad...
Taller Data Centers: la innovación irrumpe en sus estructuras y funcionalidad...Mundo Contact
 
Mesa 4. La apertura de datos en el contexto del Gobierno Abierto. Encuentro A...
Mesa 4. La apertura de datos en el contexto del Gobierno Abierto. Encuentro A...Mesa 4. La apertura de datos en el contexto del Gobierno Abierto. Encuentro A...
Mesa 4. La apertura de datos en el contexto del Gobierno Abierto. Encuentro A...Datos.gob.es
 
2000 câu đàm thoại anh-việt
2000 câu đàm thoại anh-việt2000 câu đàm thoại anh-việt
2000 câu đàm thoại anh-việtCherry Moon
 
Epoca Colonial
Epoca ColonialEpoca Colonial
Epoca Colonialgla mas
 
QPS Bio-Kinetic Booklet
QPS Bio-Kinetic BookletQPS Bio-Kinetic Booklet
QPS Bio-Kinetic Bookletmark_slama
 
Diseno y reevaluacion de una red de distribucion en Colombia
Diseno y reevaluacion de una red de distribucion en ColombiaDiseno y reevaluacion de una red de distribucion en Colombia
Diseno y reevaluacion de una red de distribucion en ColombiaDecisiones Logísticas
 
Deuda pendiente por cancelar de la unesr 2012
Deuda pendiente por cancelar de la unesr 2012Deuda pendiente por cancelar de la unesr 2012
Deuda pendiente por cancelar de la unesr 2012Rafael Verde)
 
SVT. Que es el Cloud. Ejemplos (Sesion EADA 2.14)
SVT. Que es el Cloud. Ejemplos (Sesion EADA 2.14)SVT. Que es el Cloud. Ejemplos (Sesion EADA 2.14)
SVT. Que es el Cloud. Ejemplos (Sesion EADA 2.14)SVT Cloud Services
 
Troubleshooting Novell Access Manager 3.1
Troubleshooting Novell Access Manager 3.1Troubleshooting Novell Access Manager 3.1
Troubleshooting Novell Access Manager 3.1Novell
 
Actividad 4
Actividad 4Actividad 4
Actividad 4JOVEL06
 
Fundamentos de Redes - Medios de transmisión guiados
Fundamentos de Redes - Medios de transmisión guiadosFundamentos de Redes - Medios de transmisión guiados
Fundamentos de Redes - Medios de transmisión guiadosPattzy Montero García
 
Aula 4 dermato parte 2
Aula 4 dermato parte 2Aula 4 dermato parte 2
Aula 4 dermato parte 2ReginaReiniger
 
[E book.sport] culturismo sin tonterias
[E book.sport] culturismo sin tonterias[E book.sport] culturismo sin tonterias
[E book.sport] culturismo sin tonteriasOscar Rodriguez Gomez
 
I4M Country profile sweden (in finnish)
I4M Country profile sweden (in finnish)I4M Country profile sweden (in finnish)
I4M Country profile sweden (in finnish)Veronica Gelfgren
 
mehr Unternehmensgewinn mit Quantenphysik?
mehr Unternehmensgewinn mit Quantenphysik?mehr Unternehmensgewinn mit Quantenphysik?
mehr Unternehmensgewinn mit Quantenphysik?WM-Pool Pressedienst
 
The Next Generation of Canadian Giving Webinar
The Next Generation of Canadian Giving WebinarThe Next Generation of Canadian Giving Webinar
The Next Generation of Canadian Giving Webinarhjc
 

Andere mochten auch (20)

Taller Data Centers: la innovación irrumpe en sus estructuras y funcionalidad...
Taller Data Centers: la innovación irrumpe en sus estructuras y funcionalidad...Taller Data Centers: la innovación irrumpe en sus estructuras y funcionalidad...
Taller Data Centers: la innovación irrumpe en sus estructuras y funcionalidad...
 
Mesa 4. La apertura de datos en el contexto del Gobierno Abierto. Encuentro A...
Mesa 4. La apertura de datos en el contexto del Gobierno Abierto. Encuentro A...Mesa 4. La apertura de datos en el contexto del Gobierno Abierto. Encuentro A...
Mesa 4. La apertura de datos en el contexto del Gobierno Abierto. Encuentro A...
 
2000 câu đàm thoại anh-việt
2000 câu đàm thoại anh-việt2000 câu đàm thoại anh-việt
2000 câu đàm thoại anh-việt
 
Epoca Colonial
Epoca ColonialEpoca Colonial
Epoca Colonial
 
SCHOCK TEST giveaway 2013
SCHOCK TEST giveaway 2013SCHOCK TEST giveaway 2013
SCHOCK TEST giveaway 2013
 
QPS Bio-Kinetic Booklet
QPS Bio-Kinetic BookletQPS Bio-Kinetic Booklet
QPS Bio-Kinetic Booklet
 
Diseno y reevaluacion de una red de distribucion en Colombia
Diseno y reevaluacion de una red de distribucion en ColombiaDiseno y reevaluacion de una red de distribucion en Colombia
Diseno y reevaluacion de una red de distribucion en Colombia
 
Deuda pendiente por cancelar de la unesr 2012
Deuda pendiente por cancelar de la unesr 2012Deuda pendiente por cancelar de la unesr 2012
Deuda pendiente por cancelar de la unesr 2012
 
SVT. Que es el Cloud. Ejemplos (Sesion EADA 2.14)
SVT. Que es el Cloud. Ejemplos (Sesion EADA 2.14)SVT. Que es el Cloud. Ejemplos (Sesion EADA 2.14)
SVT. Que es el Cloud. Ejemplos (Sesion EADA 2.14)
 
hdhd
hdhdhdhd
hdhd
 
Troubleshooting Novell Access Manager 3.1
Troubleshooting Novell Access Manager 3.1Troubleshooting Novell Access Manager 3.1
Troubleshooting Novell Access Manager 3.1
 
Actividad 4
Actividad 4Actividad 4
Actividad 4
 
Fundamentos de Redes - Medios de transmisión guiados
Fundamentos de Redes - Medios de transmisión guiadosFundamentos de Redes - Medios de transmisión guiados
Fundamentos de Redes - Medios de transmisión guiados
 
Tejido adiposo
Tejido adiposoTejido adiposo
Tejido adiposo
 
Aula 4 dermato parte 2
Aula 4 dermato parte 2Aula 4 dermato parte 2
Aula 4 dermato parte 2
 
[E book.sport] culturismo sin tonterias
[E book.sport] culturismo sin tonterias[E book.sport] culturismo sin tonterias
[E book.sport] culturismo sin tonterias
 
I4M Country profile sweden (in finnish)
I4M Country profile sweden (in finnish)I4M Country profile sweden (in finnish)
I4M Country profile sweden (in finnish)
 
mehr Unternehmensgewinn mit Quantenphysik?
mehr Unternehmensgewinn mit Quantenphysik?mehr Unternehmensgewinn mit Quantenphysik?
mehr Unternehmensgewinn mit Quantenphysik?
 
The Next Generation of Canadian Giving Webinar
The Next Generation of Canadian Giving WebinarThe Next Generation of Canadian Giving Webinar
The Next Generation of Canadian Giving Webinar
 
ITML CRM
ITML CRMITML CRM
ITML CRM
 

Ähnlich wie DM PROJECT

What Is Data Science? Data Science Course - Data Science Tutorial For Beginne...
What Is Data Science? Data Science Course - Data Science Tutorial For Beginne...What Is Data Science? Data Science Course - Data Science Tutorial For Beginne...
What Is Data Science? Data Science Course - Data Science Tutorial For Beginne...Edureka!
 
Big data project
Big data projectBig data project
Big data projectKedar Kumar
 
Data preparation and processing chapter 2
Data preparation and processing chapter  2Data preparation and processing chapter  2
Data preparation and processing chapter 2Mahmoud Alfarra
 
Jisc student digital experience data service walkthrough
Jisc student digital experience data service walkthroughJisc student digital experience data service walkthrough
Jisc student digital experience data service walkthroughSarah Knight
 
Case study uwv using eCF and edison
Case study uwv using eCF and edisonCase study uwv using eCF and edison
Case study uwv using eCF and edisonCo Siebes
 
Jumpstart a Lucrative Career in Data Science
Jumpstart a Lucrative Career in Data ScienceJumpstart a Lucrative Career in Data Science
Jumpstart a Lucrative Career in Data ScienceSharala Axryd
 
Data Science Training | Data Science Tutorial | Data Science Certification | ...
Data Science Training | Data Science Tutorial | Data Science Certification | ...Data Science Training | Data Science Tutorial | Data Science Certification | ...
Data Science Training | Data Science Tutorial | Data Science Certification | ...Edureka!
 
IRJET - Comparative Analysis of GUI based Prediction of Parkinson Disease usi...
IRJET - Comparative Analysis of GUI based Prediction of Parkinson Disease usi...IRJET - Comparative Analysis of GUI based Prediction of Parkinson Disease usi...
IRJET - Comparative Analysis of GUI based Prediction of Parkinson Disease usi...IRJET Journal
 
Credit card fraud detection using python machine learning
Credit card fraud detection using python machine learningCredit card fraud detection using python machine learning
Credit card fraud detection using python machine learningSandeep Garg
 
Student’s Skills Evaluation Techniques using Data Mining.
Student’s Skills Evaluation Techniques using Data Mining.Student’s Skills Evaluation Techniques using Data Mining.
Student’s Skills Evaluation Techniques using Data Mining.IOSRjournaljce
 
Barga Data Science lecture 2
Barga Data Science lecture 2Barga Data Science lecture 2
Barga Data Science lecture 2Roger Barga
 
Classification and Clustering Analysis using Weka
Classification and Clustering Analysis using Weka Classification and Clustering Analysis using Weka
Classification and Clustering Analysis using Weka Ishan Awadhesh
 
Data Science Training Course in Gurgaon.pptx
Data Science Training Course in Gurgaon.pptxData Science Training Course in Gurgaon.pptx
Data Science Training Course in Gurgaon.pptxAPTRON Gurgaon
 
Data Analytics Training Course in Noida.pptx
Data Analytics Training Course in Noida.pptxData Analytics Training Course in Noida.pptx
Data Analytics Training Course in Noida.pptxAPTRON Solutions Noida
 
What is data science ?
What is data science ?What is data science ?
What is data science ?ShahlKv
 
Everyday Data Science
Everyday Data ScienceEveryday Data Science
Everyday Data SciencePaul Laughlin
 
Major AssignmentDue5pm Friday, Week 11. If you unable to submit on.docx
Major AssignmentDue5pm Friday, Week 11. If you unable to submit on.docxMajor AssignmentDue5pm Friday, Week 11. If you unable to submit on.docx
Major AssignmentDue5pm Friday, Week 11. If you unable to submit on.docxinfantsuk
 
IRJET- Tracking and Predicting Student Performance using Machine Learning
IRJET- Tracking and Predicting Student Performance using Machine LearningIRJET- Tracking and Predicting Student Performance using Machine Learning
IRJET- Tracking and Predicting Student Performance using Machine LearningIRJET Journal
 

Ähnlich wie DM PROJECT (20)

What Is Data Science? Data Science Course - Data Science Tutorial For Beginne...
What Is Data Science? Data Science Course - Data Science Tutorial For Beginne...What Is Data Science? Data Science Course - Data Science Tutorial For Beginne...
What Is Data Science? Data Science Course - Data Science Tutorial For Beginne...
 
Big data project
Big data projectBig data project
Big data project
 
Data preparation and processing chapter 2
Data preparation and processing chapter  2Data preparation and processing chapter  2
Data preparation and processing chapter 2
 
Jisc student digital experience data service walkthrough
Jisc student digital experience data service walkthroughJisc student digital experience data service walkthrough
Jisc student digital experience data service walkthrough
 
Case study uwv using eCF and edison
Case study uwv using eCF and edisonCase study uwv using eCF and edison
Case study uwv using eCF and edison
 
Jumpstart a Lucrative Career in Data Science
Jumpstart a Lucrative Career in Data ScienceJumpstart a Lucrative Career in Data Science
Jumpstart a Lucrative Career in Data Science
 
Data Science Training | Data Science Tutorial | Data Science Certification | ...
Data Science Training | Data Science Tutorial | Data Science Certification | ...Data Science Training | Data Science Tutorial | Data Science Certification | ...
Data Science Training | Data Science Tutorial | Data Science Certification | ...
 
IRJET - Comparative Analysis of GUI based Prediction of Parkinson Disease usi...
IRJET - Comparative Analysis of GUI based Prediction of Parkinson Disease usi...IRJET - Comparative Analysis of GUI based Prediction of Parkinson Disease usi...
IRJET - Comparative Analysis of GUI based Prediction of Parkinson Disease usi...
 
Credit card fraud detection using python machine learning
Credit card fraud detection using python machine learningCredit card fraud detection using python machine learning
Credit card fraud detection using python machine learning
 
Student’s Skills Evaluation Techniques using Data Mining.
Student’s Skills Evaluation Techniques using Data Mining.Student’s Skills Evaluation Techniques using Data Mining.
Student’s Skills Evaluation Techniques using Data Mining.
 
Barga Data Science lecture 2
Barga Data Science lecture 2Barga Data Science lecture 2
Barga Data Science lecture 2
 
Data Analytics: From Basic Skills to Executive Decision-Making
Data Analytics: From Basic Skills to Executive Decision-MakingData Analytics: From Basic Skills to Executive Decision-Making
Data Analytics: From Basic Skills to Executive Decision-Making
 
Classification and Clustering Analysis using Weka
Classification and Clustering Analysis using Weka Classification and Clustering Analysis using Weka
Classification and Clustering Analysis using Weka
 
Data Science Training Course in Gurgaon.pptx
Data Science Training Course in Gurgaon.pptxData Science Training Course in Gurgaon.pptx
Data Science Training Course in Gurgaon.pptx
 
Data Analytics Training Course in Noida.pptx
Data Analytics Training Course in Noida.pptxData Analytics Training Course in Noida.pptx
Data Analytics Training Course in Noida.pptx
 
What is data science ?
What is data science ?What is data science ?
What is data science ?
 
Everyday Data Science
Everyday Data ScienceEveryday Data Science
Everyday Data Science
 
Major AssignmentDue5pm Friday, Week 11. If you unable to submit on.docx
Major AssignmentDue5pm Friday, Week 11. If you unable to submit on.docxMajor AssignmentDue5pm Friday, Week 11. If you unable to submit on.docx
Major AssignmentDue5pm Friday, Week 11. If you unable to submit on.docx
 
Prasad_Resume
Prasad_ResumePrasad_Resume
Prasad_Resume
 
IRJET- Tracking and Predicting Student Performance using Machine Learning
IRJET- Tracking and Predicting Student Performance using Machine LearningIRJET- Tracking and Predicting Student Performance using Machine Learning
IRJET- Tracking and Predicting Student Performance using Machine Learning
 

Mehr von Divya Tadi

Effective commn
Effective commnEffective commn
Effective commnDivya Tadi
 
Effective commn
Effective commnEffective commn
Effective commnDivya Tadi
 
Economics Certificate
Economics CertificateEconomics Certificate
Economics CertificateDivya Tadi
 
Negotiation skills
Negotiation skillsNegotiation skills
Negotiation skillsDivya Tadi
 
Financial Accounting
Financial AccountingFinancial Accounting
Financial AccountingDivya Tadi
 
final team app
final team appfinal team app
final team appDivya Tadi
 
Healthcare domain PPT
Healthcare domain PPTHealthcare domain PPT
Healthcare domain PPTDivya Tadi
 
DW DIMENSN MODELNG
DW DIMENSN MODELNGDW DIMENSN MODELNG
DW DIMENSN MODELNGDivya Tadi
 
HOT TOPIC REPORT DIVYA
HOT TOPIC REPORT DIVYAHOT TOPIC REPORT DIVYA
HOT TOPIC REPORT DIVYADivya Tadi
 
MYREVIEWERS ASAD Project
MYREVIEWERS ASAD ProjectMYREVIEWERS ASAD Project
MYREVIEWERS ASAD ProjectDivya Tadi
 
ADBMS Project Pearl
ADBMS Project PearlADBMS Project Pearl
ADBMS Project PearlDivya Tadi
 

Mehr von Divya Tadi (14)

CBAP
CBAPCBAP
CBAP
 
Effective commn
Effective commnEffective commn
Effective commn
 
Effective commn
Effective commnEffective commn
Effective commn
 
Economics Certificate
Economics CertificateEconomics Certificate
Economics Certificate
 
Negotiation skills
Negotiation skillsNegotiation skills
Negotiation skills
 
Financial Accounting
Financial AccountingFinancial Accounting
Financial Accounting
 
Data
DataData
Data
 
final team app
final team appfinal team app
final team app
 
Healthcare domain PPT
Healthcare domain PPTHealthcare domain PPT
Healthcare domain PPT
 
DW DIMENSN MODELNG
DW DIMENSN MODELNGDW DIMENSN MODELNG
DW DIMENSN MODELNG
 
FINAL REPORT
FINAL REPORTFINAL REPORT
FINAL REPORT
 
HOT TOPIC REPORT DIVYA
HOT TOPIC REPORT DIVYAHOT TOPIC REPORT DIVYA
HOT TOPIC REPORT DIVYA
 
MYREVIEWERS ASAD Project
MYREVIEWERS ASAD ProjectMYREVIEWERS ASAD Project
MYREVIEWERS ASAD Project
 
ADBMS Project Pearl
ADBMS Project PearlADBMS Project Pearl
ADBMS Project Pearl
 

DM PROJECT

  • 1. SUBMITTED BY: ARUN KUMAR DASH PARAS SHAH DIVYA RAJASRI TADI NIREESHA MANDALA DATA MINING FINAL PROJECT
  • 2. DATA MINING FINAL PROJECT FALL 2014 ~ 1 ~ INTRODUCTION Data mining aims at extracting patterns and knowledge from a particular data set and converting it into an easy to understand framework that can be used in future as well. Data is not only analyzed, it is pre-processed, complexities in the data are taken into consideration and the structures discovered are post- processed. Data mining is done on large quantities of data in order to extract unknown patterns like groups of data records known as cluster analysis or to identify exceptional records known as anomaly detection. These patterns can then be used in machine learning and predictive analysis. All the steps like data collection, data preparation, result interpretation and reporting though a part of data mining belong to the knowledge discovery in databases process. Data mining is being used in almost every field from business, science and engineering, medical, visual, music, sensor, temporal and spatial to name a few. The ways in which data mining can be used can in some cases and contexts raise questions regarding privacy, legality, and ethics. As data mining needs data preparation which can uncover patterns that can compromise confidentiality. And this mainly happens during data aggregation where data is combined together for analysis purpose which might affect the privacy of individual data. We used a census data set to predict whether income exceeds $50K/yr based on census data. Prediction task is to determine whether a person makes over 50K a year.
  • 3. DATA MINING FINAL PROJECT FALL 2014 ~ 2 ~ Number of instances : 32561 Number of attributes : 15 Age, fnlwgt, workclass, education, education-num, marital-status,occupation, relationship, race, sex, capital-gain, capital-loss, hours-per-week,native-country, salary Attribute Information: Listing of attributes: Salary: >50K, <=50K. age: continuous. workclass: Private, Self-emp-not-inc, Self-emp-inc, Federal-gov, Local-gov, State-gov, Without-pay, Never-worked. fnlwgt: continuous. education: Bachelors, Some-college, 11th, HS-grad, Prof-school, Assoc-acdm, Assoc-voc, 9th, 7th-8th, 12th, Masters, 1st-4th, 10th, Doctorate, 5th-6th, Preschool. education-num: continuous. marital-status: Married-civ-spouse, Divorced, Never-married, Separated, Widowed, Married-spouse-absent, Married-AF-spouse. occupation: Tech-support, Craft-repair, Other-service, Sales, Exec-managerial, Prof-specialty, Handlers-cleaners, Machine-op-inspct, Adm-clerical, Farming- fishing, Transport-moving, Priv-house-serv, Protective-serv, Armed-Forces.
  • 4. DATA MINING FINAL PROJECT FALL 2014 ~ 3 ~ relationship: Wife, Own-child, Husband, Not-in-family, Other-relative, Unmarried. DATA PRE-PROCESSING: The first step in data mining is to choose a data set with large number of records so that unknown patterns can be identified. The data set should not be too large that it consumes a lot of time to come up with a pattern. So, basically selecting an appropriate data set is very important. Data sets are usually chosen from a data warehouse. The main of data pre-processing is to evaluate multivariate data sets prior to data mining. The data set is cleaned to remove the records containing noise and the ones with missing data. DATA MINING: It is broadly segregated into six categories of tasks- Anomaly detection This phase involves identifying records that are unusual like outliers, any deviations or errors. These data records have to be further investigated. Association rule learning This phase is the one where the model searches for any relationships among the different variables. The very common example of it can be the beer diaper, where on Fridays customers buy beer along with diapers. Based on this
  • 5. DATA MINING FINAL PROJECT FALL 2014 ~ 4 ~ association, the store can put both these items together to increase the sales which is also known as market based analysis or dependency modelling. Clustering: Determining groups in the data which are similar in some aspect without the use of previously known structures in data. Classification: This phase generalizes the known structures to apply the new data. As an example an e-mail program that attempts to classify e-mails as spam or genuine. Regression: This phase tries to phase a function which models the data with the least amount of errors. Summarization: This phase provides a more compact view of the data set including the visualization and report generation. Validation of results: Data mining can also give results which seem significant but cannot be used to predict future behaviors and cannot be reproduced on a new sample of data. These results are derived from improper statistical hypothesis testing. A simple
  • 6. DATA MINING FINAL PROJECT FALL 2014 ~ 5 ~ version of this problem is known as over fitting, where data mining algorithms find patterns in the training set which are not present in the general data set. WEKA (WAIKATO ENVIRONMENT FOR KNOWLEDGE ANALYSIS) Weka consists of a collection of visualization tools and algorithms for data analysis and predictive modeling, along with graphical user interfaces for easy access to this functionality. It is basically used for educational purposes and research. Weka is: Freely available General Public License Portable as it completely implemented in Java and can be run on any computing platform An Extensive collection of data preprocessing and modeling techniques Has a GUI that makes it easy to use Weka supports various data mining tasks like data preprocessing, clustering, classification, regression, visualization, and feature selection. Weka's techniques are predicated on the assumption that data is available as a single flat file or relation, where each data point is expressed by a fixed number of attributes. Weka provides access to SQL databases using Java Database Connectivity and can process the result returned by a database query.
  • 7. DATA MINING FINAL PROJECT FALL 2014 ~ 6 ~ NAÏVE BAYES SIMPLE  All attributes: We have executed Naïve Bayes Simple on all the 15 attributes in our data set and have obtained the following results.
  • 8. DATA MINING FINAL PROJECT FALL 2014 ~ 7 ~ The accuracy of the predictive model is 83.69%.
  • 9. DATA MINING FINAL PROJECT FALL 2014 ~ 8 ~  Few attributes: We have considered 6 attributes namely age, work class, education, education number, hours per week and salary and obtained the following results.
  • 10. DATA MINING FINAL PROJECT FALL 2014 ~ 9 ~ The accuracy of the predictive model is 79.20%.
  • 11. DATA MINING FINAL PROJECT FALL 2014 ~ 10 ~  The following results are obtained on removing the education attribute from the data set.
  • 12. DATA MINING FINAL PROJECT FALL 2014 ~ 11 ~ The accuracy of the predictive model is now 80% as compared to 79.20% when the attribute education was present. This indicates that education is an unimportant attribute and its removal from the data set yields a more accurate result.
  • 13. DATA MINING FINAL PROJECT FALL 2014 ~ 12 ~ === Run information === Scheme:weka.classifiers.bayes.NaiveBayesSimple Relation: income-weka.filters.unsupervised.attribute.Remove-R6,8-12,14- weka.filters.unsupervised.attribute.Remove-R3- weka.filters.unsupervised.attribute.Remove-R5- weka.filters.unsupervised.attribute.Remove-R3 Instances: 32561 Attributes: 5 age workclass education-num hours-per-week salary Test mode:split 66.0% train, remainder test === Classifier model (full training set) === Naive Bayes (simple) Class <=50K: P(C) = 0.75917452 Attribute age Mean: 36.78373786 Standard Deviation: 14.02008849 Attribute workclass State-gov Self-emp-not-inc Private Federal-gov Local-gov ? Self-emp- inc Without-pay Never-worked 0.03825468 0.07351692 0.71713373 0.02385863 0.05972745 0.06656153 0.02001698 0.00060658 0.00032351
  • 14. DATA MINING FINAL PROJECT FALL 2014 ~ 13 ~ Attribute education-num Mean: 9.59506472 Standard Deviation: 2.43614679 Attribute hours-per-week Mean: 38.84021036 Standard Deviation: 12.31899464 Class >50K: P(C) = 0.24082548 Attribute age Mean: 44.24984058 Standard Deviation: 10.51902772 Attribute workclass State-gov Self-emp-not-inc Private Federal-gov Local-gov ? Self-emp-inc Without- pay Never-worked 0.04509554 0.09235669 0.63235669 0.04738854 0.07872611 0.0244586 0.07936306 0.00012739 0.00012739 Attribute education-num Mean: 11.61165668 Standard Deviation: 2.38512863 Attribute hours-per-week Mean: 45.4730264 Standard Deviation: 11.01297093 Time taken to build model: 0.06 seconds
  • 15. DATA MINING FINAL PROJECT FALL 2014 ~ 14 ~ === Evaluation on test split === === Summary === Correctly Classified Instances 8857 80.0018 % Incorrectly Classified Instances 2214 19.9982 % Kappa statistic 0.3666 Mean absolute error 0.2737 Root mean squared error 0.3704 Relative absolute error 75.1356 % Root relative squared error 87.3416 % Total Number of Instances 11071 === Detailed Accuracy By Class === TP Rate FP Rate Precision Recall F-Measure ROC Area Class 0.923 0.601 0.833 0.923 0.876 0.82 <=50K 0.399 0.077 0.615 0.399 0.484 0.82 >50K Weighted Avg. 0.8 0.478 0.782 0.8 0.784 0.82 === Confusion Matrix === a b <-- classified as 7820 650 | a = <=50K 1564 1037 | b = >50K
  • 16. DATA MINING FINAL PROJECT FALL 2014 ~ 15 ~ APRIORI: Apriori is an algorithm for frequent item set mining and association rule learning over transactional databases. Apriori proceeds by identifying the frequent individual items in the database and extending them to larger and larger item sets as long as those item sets appear sufficiently often in the database.  TOTAL NUMBER OF RULES
  • 17. DATA MINING FINAL PROJECT FALL 2014 ~ 16 ~ OUTPUT: === Run information === Scheme: weka.associations.Apriori -I -R -N 200 -T 0 -C 0.1 -D 0.05 -U 1.0 -M 0.1 -S 1.0 -V -c -1 Relation: income-weka.filters.unsupervised.attribute.Remove-R6,8-12,14- weka.filters.unsupervised.attribute.Remove-R3- weka.filters.unsupervised.attribute.Remove-R5- weka.filters.unsupervised.attribute.Remove-R3- weka.filters.unsupervised.attribute.Discretize-B10-M-1.0-Rfirst-last Instances: 32561 Attributes: 5 age workclass education-num hours-per-week salary === Associator model (full training set) === Apriori ======= Minimum support: 0.1 (3256 instances) Minimum metric <confidence>: 0.1 Significance level: 1 Number of cycles performed: 14
  • 18. DATA MINING FINAL PROJECT FALL 2014 ~ 17 ~ Generated sets of large itemsets: Size of set of large itemsetsL(1): 12 Large ItemsetsL(1): age='(-inf-24.3]' 5570 age='(24.3-31.6]' 5890 age='(31.6-38.9]' 6048 age='(38.9-46.2]' 6163 age='(46.2-53.5]' 3967 workclass= Private 22696 education-num='(8.5-10]' 17792 education-num='(11.5-13]' 6422 hours-per-week='(30.4-40.2]' 17735 hours-per-week='(40.2-50]' 5938 salary= <=50K 24720 salary= >50K 7841
  • 19. DATA MINING FINAL PROJECT FALL 2014 ~ 18 ~ Size of set of large itemsetsL(2): 25 Large ItemsetsL(2): age='(-inf-24.3]' workclass= Private 4404 age='(-inf-24.3]' education-num='(8.5-10]' 3637 age='(-inf-24.3]' salary= <=50K 5509 age='(24.3-31.6]' workclass= Private 4617 age='(24.3-31.6]' hours-per-week='(30.4-40.2]' 3517 age='(24.3-31.6]' salary= <=50K 5086 age='(31.6-38.9]' workclass= Private 4426 age='(31.6-38.9]' education-num='(8.5-10]' 3283 age='(31.6-38.9]' hours-per-week='(30.4-40.2]' 3371 age='(31.6-38.9]' salary= <=50K 4371 age='(38.9-46.2]' workclass= Private 4127 age='(38.9-46.2]' hours-per-week='(30.4-40.2]' 3481 age='(38.9-46.2]' salary= <=50K 3934 workclass= Private education-num='(8.5-10]' 12874 workclass= Private education-num='(11.5-13]' 4280 workclass= Private hours-per-week='(30.4-40.2]' 12849 workclass= Private hours-per-week='(40.2-50]' 4270 workclass= Private salary= <=50K 17733 workclass= Private salary= >50K 4963 education-num='(8.5-10]' hours-per-week='(30.4-40.2]' 10177 education-num='(8.5-10]' salary= <=50K 14730 education-num='(11.5-13]' salary= <=50K 3936 hours-per-week='(30.4-40.2]' salary= <=50K 14103 hours-per-week='(30.4-40.2]' salary= >50K 3632 hours-per-week='(40.2-50]' salary= <=50K 3586
  • 20. DATA MINING FINAL PROJECT FALL 2014 ~ 19 ~ Size of set of large itemsetsL(3): 7 Large ItemsetsL(3): age='(-inf-24.3]' workclass= Private salary= <=50K 4360 age='(-inf-24.3]' education-num='(8.5-10]' salary= <=50K 3605 age='(24.3-31.6]' workclass= Private salary= <=50K 4023 workclass= Private education-num='(8.5-10]' hours-per-week='(30.4-40.2]' 7597 workclass= Private education-num='(8.5-10]' salary= <=50K 10832 workclass= Private hours-per-week='(30.4-40.2]' salary= <=50K 10524 education-num='(8.5-10]' hours-per-week='(30.4-40.2]' salary= <=50K 8574 Size of set of large itemsetsL(4): 1 Large ItemsetsL(4): workclass= Private education-num='(8.5-10]' hours-per-week='(30.4-40.2]' salary= <=50K 6513
  • 21. DATA MINING FINAL PROJECT FALL 2014 ~ 20 ~ BEST RULES FOUND: 1. age='(-inf-24.3]' education-num='(8.5-10]' 3637 ==> salary= <=50K 3605 conf:(0.99) lift:(1.31) lev:(0.03) [843] conv:(26.54) 2. age='(-inf-24.3]' workclass= Private 4404 ==> salary= <=50K 4360 conf:(0.99) lift:(1.3) lev:(0.03) [1016] conv:(23.57) 3. age='(-inf-24.3]' 5570 ==> salary= <=50K 5509 conf:(0.99) lift:(1.3) lev:(0.04) [1280] conv:(21.63) 4. age='(24.3-31.6]' workclass= Private 4617 ==> salary= <=50K 4023 conf:(0.87) lift:(1.15) lev:(0.02) [517] conv:(1.87) 5. age='(24.3-31.6]' 5890 ==> salary= <=50K 5086 conf:(0.86) lift:(1.14) lev:(0.02) [614] conv:(1.76) 6. workclass= Private education-num='(8.5-10]' hours-per-week='(30.4-40.2]' 7597 ==> salary= <=50K 6513 conf:(0.86) lift:(1.13) lev:(0.02) [745] conv:(1.69) 7. education-num='(8.5-10]' hours-per-week='(30.4-40.2]' 10177 ==> salary= <=50K 8574 conf:(0.84) lift:(1.11) lev:(0.03) [847] conv:(1.53) 8. workclass= Private education-num='(8.5-10]' 12874 ==> salary= <=50K 10832 conf:(0.84) lift:(1.11) lev:(0.03) [1058] conv:(1.52) 9. education-num='(8.5-10]' 17792 ==> salary= <=50K 14730 conf:(0.83) lift:(1.09) lev:(0.04) [1222] conv:(1.4) 10. workclass= Private hours-per-week='(30.4-40.2]' 12849 ==> salary= <=50K 10524 conf:(0.82) lift:(1.08) lev:(0.02) [769] conv:(1.33) 11. hours-per-week='(30.4-40.2]' 17735 ==> salary= <=50K 14103 conf:(0.8) lift:(1.05) lev:(0.02) [638] conv:(1.18) 12. age='(-inf-24.3]' salary= <=50K 5509 ==>workclass= Private 4360 conf:(0.79) lift:(1.14) lev:(0.02) [520] conv:(1.45) 13. age='(24.3-31.6]' salary= <=50K 5086 ==>workclass= Private 4023 conf:(0.79) lift:(1.13) lev:(0.01) [477] conv:(1.45) 14. age='(-inf-24.3]' 5570 ==>workclass= Private 4404 conf:(0.79) lift:(1.13) lev:(0.02) [521] conv:(1.45)
  • 22. DATA MINING FINAL PROJECT FALL 2014 ~ 21 ~ 15. age='(24.3-31.6]' 5890 ==>workclass= Private 4617 conf:(0.78) lift:(1.12) lev:(0.02) [511] conv:(1.4) 16. age='(-inf-24.3]' 5570 ==>workclass= Private salary= <=50K 4360 conf:(0.78) lift:(1.44) lev:(0.04) [1326] conv:(2.09) 17. workclass= Private 22696 ==> salary= <=50K 17733 conf:(0.78) lift:(1.03) lev:(0.02) [502] conv:(1.1) 18. education-num='(8.5-10]' hours-per-week='(30.4-40.2]' salary= <=50K 8574 ==>workclass= Private 6513 conf:(0.76) lift:(1.09) lev:(0.02) [536] conv:(1.26) 19. education-num='(8.5-10]' hours-per-week='(30.4-40.2]' 10177 ==>workclass= Private 7597 conf:(0.75) lift:(1.07) lev:(0.02) [503] conv:(1.19) 20. hours-per-week='(30.4-40.2]' salary= <=50K 14103 ==>workclass= Private 10524 conf:(0.75) lift:(1.07) lev:(0.02) [693] conv:(1.19) 21. education-num='(8.5-10]' salary= <=50K 14730 ==>workclass= Private 10832 conf:(0.74) lift:(1.06) lev:(0.02) [564] conv:(1.14) 22. age='(31.6-38.9]' 6048 ==>workclass= Private 4426 conf:(0.73) lift:(1.05) lev:(0.01) [210] conv:(1.13) 23. hours-per-week='(30.4-40.2]' 17735 ==>workclass= Private 12849 conf:(0.72) lift:(1.04) lev:(0.01) [487] conv:(1.1) 24. education-num='(8.5-10]' 17792 ==>workclass= Private 12874 conf:(0.72) lift:(1.04) lev:(0.01) [472] conv:(1.1) 25. age='(31.6-38.9]' 6048 ==> salary= <=50K 4371 conf:(0.72) lift:(0.95) lev:(-0.01) [-220] conv:(0.87) 26. hours-per-week='(40.2-50]' 5938 ==>workclass= Private 4270 conf:(0.72) lift:(1.03) lev:(0) [131] conv:(1.08) 27. salary= <=50K 24720 ==>workclass= Private 17733 conf:(0.72) lift:(1.03) lev:(0.02) [502] conv:(1.07) 28. age='(24.3-31.6]' 5890 ==>workclass= Private salary= <=50K 4023 conf:(0.68) lift:(1.25) lev:(0.03) [815] conv:(1.44)
  • 23. DATA MINING FINAL PROJECT FALL 2014 ~ 22 ~ 29. age='(38.9-46.2]' 6163 ==>workclass= Private 4127 conf:(0.67) lift:(0.96) lev:(-0.01) [- 168] conv:(0.92) 30. education-num='(11.5-13]' 6422 ==>workclass= Private 4280 conf:(0.67) lift:(0.96) lev:(- 0.01) [-196] conv:(0.91) 31. age='(-inf-24.3]' salary= <=50K 5509 ==> education-num='(8.5-10]' 3605 conf:(0.65) lift:(1.2) lev:(0.02) [594] conv:(1.31) 32. age='(-inf-24.3]' 5570 ==> education-num='(8.5-10]' 3637 conf:(0.65) lift:(1.19) lev:(0.02) [593] conv:(1.31) 33. age='(-inf-24.3]' 5570 ==> education-num='(8.5-10]' salary= <=50K 3605 conf:(0.65) lift:(1.43) lev:(0.03) [1085] conv:(1.55) 34. education-num='(8.5-10]' hours-per-week='(30.4-40.2]' 10177 ==>workclass= Private salary= <=50K 6513 conf:(0.64) lift:(1.18) lev:(0.03) [970] conv:(1.26) 35. age='(38.9-46.2]' 6163 ==> salary= <=50K 3934 conf:(0.64) lift:(0.84) lev:(-0.02) [-744] conv:(0.67) 36. salary= >50K 7841 ==>workclass= Private 4963 conf:(0.63) lift:(0.91) lev:(-0.02) [-502] conv:(0.83) 37. workclass= Private hours-per-week='(30.4-40.2]' salary= <=50K 10524 ==> education- num='(8.5-10]' 6513 conf:(0.62) lift:(1.13) lev:(0.02) [762] conv:(1.19) 38. education-num='(11.5-13]' 6422 ==> salary= <=50K 3936 conf:(0.61) lift:(0.81) lev:(-0.03) [-939] conv:(0.62) 39. workclass= Private salary= <=50K 17733 ==> education-num='(8.5-10]' 10832 conf:(0.61) lift:(1.12) lev:(0.04) [1142] conv:(1.17) 40. education-num='(8.5-10]' 17792 ==>workclass= Private salary= <=50K 10832 conf:(0.61) lift:(1.12) lev:(0.04) [1142] conv:(1.16) 41. hours-per-week='(30.4-40.2]' salary= <=50K 14103 ==> education-num='(8.5-10]' 8574 conf:(0.61) lift:(1.11) lev:(0.03) [867] conv:(1.16) 42. hours-per-week='(40.2-50]' 5938 ==> salary= <=50K 3586 conf:(0.6) lift:(0.8) lev:(-0.03) [-922] conv:(0.61)
  • 24. DATA MINING FINAL PROJECT FALL 2014 ~ 23 ~ 43. workclass= Private education-num='(8.5-10]' salary= <=50K 10832 ==> hours-per- week='(30.4-40.2]' 6513 conf:(0.6) lift:(1.1) lev:(0.02) [613] conv:(1.14) 44. age='(24.3-31.6]' 5890 ==> hours-per-week='(30.4-40.2]' 3517 conf:(0.6) lift:(1.1) lev:(0.01) [308] conv:(1.13) 45. salary= <=50K 24720 ==> education-num='(8.5-10]' 14730 conf:(0.6) lift:(1.09) lev:(0.04) [1222] conv:(1.12) 46. workclass= Private salary= <=50K 17733 ==> hours-per-week='(30.4-40.2]' 10524 conf:(0.59) lift:(1.09) lev:(0.03) [865] conv:(1.12) 47. hours-per-week='(30.4-40.2]' 17735 ==>workclass= Private salary= <=50K 10524 conf:(0.59) lift:(1.09) lev:(0.03) [865] conv:(1.12) 48. workclass= Private hours-per-week='(30.4-40.2]' 12849 ==> education-num='(8.5-10]' 7597 conf:(0.59) lift:(1.08) lev:(0.02) [576] conv:(1.11) 49. workclass= Private education-num='(8.5-10]' 12874 ==> hours-per-week='(30.4-40.2]' 7597 conf:(0.59) lift:(1.08) lev:(0.02) [584] conv:(1.11) 50. education-num='(8.5-10]' salary= <=50K 14730 ==> hours-per-week='(30.4-40.2]' 8574 conf:(0.58) lift:(1.07) lev:(0.02) [551] conv:(1.09) 51. hours-per-week='(30.4-40.2]' 17735 ==> education-num='(8.5-10]' 10177 conf:(0.57) lift:(1.05) lev:(0.01) [486] conv:(1.06) 52. education-num='(8.5-10]' 17792 ==> hours-per-week='(30.4-40.2]' 10177 conf:(0.57) lift:(1.05) lev:(0.01) [486] conv:(1.06) 53. salary= <=50K 24720 ==> hours-per-week='(30.4-40.2]' 14103 conf:(0.57) lift:(1.05) lev:(0.02) [638] conv:(1.06) 54. workclass= Private 22696 ==> education-num='(8.5-10]' 12874 conf:(0.57) lift:(1.04) lev:(0.01) [472] conv:(1.05) 55. workclass= Private 22696 ==> hours-per-week='(30.4-40.2]' 12849 conf:(0.57) lift:(1.04) lev:(0.01) [487] conv:(1.05) 56. age='(38.9-46.2]' 6163 ==> hours-per-week='(30.4-40.2]' 3481 conf:(0.56) lift:(1.04) lev:(0) [124] conv:(1.05)
  • 25. DATA MINING FINAL PROJECT FALL 2014 ~ 24 ~ 57. age='(31.6-38.9]' 6048 ==> hours-per-week='(30.4-40.2]' 3371 conf:(0.56) lift:(1.02) lev:(0) [76] conv:(1.03) 58. age='(31.6-38.9]' 6048 ==> education-num='(8.5-10]' 3283 conf:(0.54) lift:(0.99) lev:(0) [- 21] conv:(0.99) 59. workclass= Private hours-per-week='(30.4-40.2]' 12849 ==> education-num='(8.5-10]' salary= <=50K 6513 conf:(0.51) lift:(1.12) lev:(0.02) [700] conv:(1.11) 60. workclass= Private education-num='(8.5-10]' 12874 ==> hours-per-week='(30.4-40.2]' salary= <=50K 6513 conf:(0.51) lift:(1.17) lev:(0.03) [936] conv:(1.15) 61. hours-per-week='(30.4-40.2]' 17735 ==> education-num='(8.5-10]' salary= <=50K 8574 conf:(0.48) lift:(1.07) lev:(0.02) [551] conv:(1.06) 62. education-num='(8.5-10]' 17792 ==> hours-per-week='(30.4-40.2]' salary= <=50K 8574 conf:(0.48) lift:(1.11) lev:(0.03) [867] conv:(1.09) 63. workclass= Private 22696 ==> education-num='(8.5-10]' salary= <=50K 10832 conf:(0.48) lift:(1.06) lev:(0.02) [564] conv:(1.05) 64. workclass= Private 22696 ==> hours-per-week='(30.4-40.2]' salary= <=50K 10524 conf:(0.46) lift:(1.07) lev:(0.02) [693] conv:(1.06) 65. salary= >50K 7841 ==> hours-per-week='(30.4-40.2]' 3632 conf:(0.46) lift:(0.85) lev:(- 0.02) [-638] conv:(0.85) 66. hours-per-week='(30.4-40.2]' salary= <=50K 14103 ==>workclass= Private education- num='(8.5-10]' 6513 conf:(0.46) lift:(1.17) lev:(0.03) [936] conv:(1.12) 67. education-num='(8.5-10]' salary= <=50K 14730 ==>workclass= Private hours-per- week='(30.4-40.2]' 6513 conf:(0.44) lift:(1.12) lev:(0.02) [700] conv:(1.09) 68. salary= <=50K 24720 ==>workclass= Private education-num='(8.5-10]' 10832 conf:(0.44) lift:(1.11) lev:(0.03) [1058] conv:(1.08) 69. hours-per-week='(30.4-40.2]' 17735 ==>workclass= Private education-num='(8.5-10]' 7597 conf:(0.43) lift:(1.08) lev:(0.02) [584] conv:(1.06) 70. education-num='(8.5-10]' 17792 ==>workclass= Private hours-per-week='(30.4-40.2]' 7597 conf:(0.43) lift:(1.08) lev:(0.02) [576] conv:(1.06)
  • 26. DATA MINING FINAL PROJECT FALL 2014 ~ 25 ~ 71. salary= <=50K 24720 ==>workclass= Private hours-per-week='(30.4-40.2]' 10524 conf:(0.43) lift:(1.08) lev:(0.02) [769] conv:(1.05) 72. workclass= Private salary= <=50K 17733 ==> education-num='(8.5-10]' hours-per- week='(30.4-40.2]' 6513 conf:(0.37) lift:(1.18) lev:(0.03) [970] conv:(1.09) 73. hours-per-week='(30.4-40.2]' 17735 ==>workclass= Private education-num='(8.5-10]' salary= <=50K 6513 conf:(0.37) lift:(1.1) lev:(0.02) [613] conv:(1.05) 74. education-num='(8.5-10]' 17792 ==>workclass= Private hours-per-week='(30.4-40.2]' salary= <=50K 6513 conf:(0.37) lift:(1.13) lev:(0.02) [762] conv:(1.07) 75. salary= <=50K 24720 ==> education-num='(8.5-10]' hours-per-week='(30.4-40.2]' 8574 conf:(0.35) lift:(1.11) lev:(0.03) [847] conv:(1.05) 76. workclass= Private 22696 ==> education-num='(8.5-10]' hours-per-week='(30.4-40.2]' 7597 conf:(0.33) lift:(1.07) lev:(0.02) [503] conv:(1.03) 77. workclass= Private 22696 ==> education-num='(8.5-10]' hours-per-week='(30.4-40.2]' salary= <=50K 6513 conf:(0.29) lift:(1.09) lev:(0.02) [536] conv:(1.03) 78. salary= <=50K 24720 ==>workclass= Private education-num='(8.5-10]' hours-per- week='(30.4-40.2]' 6513 conf:(0.26) lift:(1.13) lev:(0.02) [745] conv:(1.04) 79. workclass= Private salary= <=50K 17733 ==> age='(-inf-24.3]' 4360 conf:(0.25) lift:(1.44) lev:(0.04) [1326] conv:(1.1) 80. education-num='(8.5-10]' salary= <=50K 14730 ==> age='(-inf-24.3]' 3605 conf:(0.24) lift:(1.43) lev:(0.03) [1085] conv:(1.1) 81. workclass= Private salary= <=50K 17733 ==> age='(24.3-31.6]' 4023 conf:(0.23) lift:(1.25) lev:(0.03) [815] conv:(1.06) 82. salary= <=50K 24720 ==> age='(-inf-24.3]' 5509 conf:(0.22) lift:(1.3) lev:(0.04) [1280] conv:(1.07) 83. workclass= Private 22696 ==> salary= >50K 4963 conf:(0.22) lift:(0.91) lev:(-0.02) [-502] conv:(0.97)
  • 27. DATA MINING FINAL PROJECT FALL 2014 ~ 26 ~ 84. salary= <=50K 24720 ==> age='(24.3-31.6]' 5086 conf:(0.21) lift:(1.14) lev:(0.02) [614] conv:(1.03) 85. hours-per-week='(30.4-40.2]' 17735 ==> salary= >50K 3632 conf:(0.2) lift:(0.85) lev:(- 0.02) [-638] conv:(0.95) 86. education-num='(8.5-10]' 17792 ==> age='(-inf-24.3]' 3637 conf:(0.2) lift:(1.19) lev:(0.02) [593] conv:(1.04) 87. workclass= Private 22696 ==> age='(24.3-31.6]' 4617 conf:(0.2) lift:(1.12) lev:(0.02) [511] conv:(1.03) 88. education-num='(8.5-10]' 17792 ==> age='(-inf-24.3]' salary= <=50K 3605 conf:(0.2) lift:(1.2) lev:(0.02) [594] conv:(1.04) 89. hours-per-week='(30.4-40.2]' 17735 ==> age='(24.3-31.6]' 3517 conf:(0.2) lift:(1.1) lev:(0.01) [308] conv:(1.02) 90. hours-per-week='(30.4-40.2]' 17735 ==> age='(38.9-46.2]' 3481 conf:(0.2) lift:(1.04) lev:(0) [124] conv:(1.01) 91. workclass= Private 22696 ==> age='(31.6-38.9]' 4426 conf:(0.2) lift:(1.05) lev:(0.01) [210] conv:(1.01) 92. workclass= Private 22696 ==> age='(-inf-24.3]' 4404 conf:(0.19) lift:(1.13) lev:(0.02) [521] conv:(1.03) 93. workclass= Private 22696 ==> age='(-inf-24.3]' salary= <=50K 4360 conf:(0.19) lift:(1.14) lev:(0.02) [520] conv:(1.03) 94. hours-per-week='(30.4-40.2]' 17735 ==> age='(31.6-38.9]' 3371 conf:(0.19) lift:(1.02) lev:(0) [76] conv:(1.01) 95. workclass= Private 22696 ==> education-num='(11.5-13]' 4280 conf:(0.19) lift:(0.96) lev:(-0.01) [-196] conv:(0.99) 96. workclass= Private 22696 ==> hours-per-week='(40.2-50]' 4270 conf:(0.19) lift:(1.03) lev:(0) [131] conv:(1.01)
  • 28. DATA MINING FINAL PROJECT FALL 2014 ~ 27 ~ 97. education-num='(8.5-10]' 17792 ==> age='(31.6-38.9]' 3283 conf:(0.18) lift:(0.99) lev:(0) [- 21] conv:(1) 98. workclass= Private 22696 ==> age='(38.9-46.2]' 4127 conf:(0.18) lift:(0.96) lev:(-0.01) [- 168] conv:(0.99) 99. workclass= Private 22696 ==> age='(24.3-31.6]' salary= <=50K 4023 conf:(0.18) lift:(1.13) lev:(0.01) [477] conv:(1.03) 100. salary= <=50K 24720 ==> age='(31.6-38.9]' 4371 conf:(0.18) lift:(0.95) lev:(-0.01) [-220] conv:(0.99) 101. salary= <=50K 24720 ==> age='(-inf-24.3]' workclass= Private 4360 conf:(0.18) lift:(1.3) lev:(0.03) [1016] conv:(1.05) 102. salary= <=50K 24720 ==> age='(24.3-31.6]' workclass= Private 4023 conf:(0.16) lift:(1.15) lev:(0.02) [517] conv:(1.02) 103. salary= <=50K 24720 ==> education-num='(11.5-13]' 3936 conf:(0.16) lift:(0.81) lev:(- 0.03) [-939] conv:(0.95) 104. salary= <=50K 24720 ==> age='(38.9-46.2]' 3934 conf:(0.16) lift:(0.84) lev:(-0.02) [-744] conv:(0.96) 105. salary= <=50K 24720 ==> age='(-inf-24.3]' education-num='(8.5-10]' 3605 conf:(0.15) lift:(1.31) lev:(0.03) [843] conv:(1.04) 106. salary= <=50K 24720 ==> hours-per-week='(40.2-50]' 3586 conf:(0.15) lift:(0.8) lev:(- 0.03) [-922] conv:(0.96)
  • 29. DATA MINING FINAL PROJECT FALL 2014 ~ 28 ~  CHANGED MIN METRIC = 0.9 AND lowerBoundMinSupport=0.1
  • 30. DATA MINING FINAL PROJECT FALL 2014 ~ 29 ~ === Run information === Scheme: weka.associations.Apriori -I -R -N 200 -T 0 -C 0.9 -D 0.05 -U 1.0 -M 0.1 -S 1.0 -V -c -1 Relation: income-weka.filters.unsupervised.attribute.Remove-R6,8-12,14- weka.filters.unsupervised.attribute.Remove-R3-weka.filters.unsupervised.attribute.Remove-R5- weka.filters.unsupervised.attribute.Remove-R3-weka.filters.unsupervised.attribute.Discretize- B10-M-1.0-Rfirst-last Instances: 32561 Attributes: 5 age workclass education-num hours-per-week salary === Associator model (full training set) === Apriori ======= Minimum support: 0.1 (3256 instances) Minimum metric <confidence>: 0.9 Significance level: 1 Number of cycles performed: 14 Generated sets of large itemsets: Size of set of large itemsetsL(1): 12 Large ItemsetsL(1): age='(-inf-24.3]' 5570 age='(24.3-31.6]' 5890 age='(31.6-38.9]' 6048 age='(38.9-46.2]' 6163 age='(46.2-53.5]' 3967
  • 31. DATA MINING FINAL PROJECT FALL 2014 ~ 30 ~ workclass= Private 22696 education-num='(8.5-10]' 17792 education-num='(11.5-13]' 6422 hours-per-week='(30.4-40.2]' 17735 hours-per-week='(40.2-50]' 5938 salary= <=50K 24720 salary= >50K 7841 Size of set of large itemsetsL(2): 25 Large ItemsetsL(2): age='(-inf-24.3]' workclass= Private 4404 age='(-inf-24.3]' education-num='(8.5-10]' 3637 age='(-inf-24.3]' salary= <=50K 5509 age='(24.3-31.6]' workclass= Private 4617 age='(24.3-31.6]' hours-per-week='(30.4-40.2]' 3517 age='(24.3-31.6]' salary= <=50K 5086 age='(31.6-38.9]' workclass= Private 4426 age='(31.6-38.9]' education-num='(8.5-10]' 3283 age='(31.6-38.9]' hours-per-week='(30.4-40.2]' 3371 age='(31.6-38.9]' salary= <=50K 4371 age='(38.9-46.2]' workclass= Private 4127 age='(38.9-46.2]' hours-per-week='(30.4-40.2]' 3481 age='(38.9-46.2]' salary= <=50K 3934 workclass= Private education-num='(8.5-10]' 12874 workclass= Private education-num='(11.5-13]' 4280 workclass= Private hours-per-week='(30.4-40.2]' 12849 workclass= Private hours-per-week='(40.2-50]' 4270 workclass= Private salary= <=50K 17733 workclass= Private salary= >50K 4963
  • 32. DATA MINING FINAL PROJECT FALL 2014 ~ 31 ~ education-num='(8.5-10]' hours-per-week='(30.4-40.2]' 10177 education-num='(8.5-10]' salary= <=50K 14730 education-num='(11.5-13]' salary= <=50K 3936 hours-per-week='(30.4-40.2]' salary= <=50K 14103 hours-per-week='(30.4-40.2]' salary= >50K 3632 hours-per-week='(40.2-50]' salary= <=50K 3586 Size of set of large itemsetsL(3): 7 Large ItemsetsL(3): age='(-inf-24.3]' workclass= Private salary= <=50K 4360 age='(-inf-24.3]' education-num='(8.5-10]' salary= <=50K 3605 age='(24.3-31.6]' workclass= Private salary= <=50K 4023 workclass= Private education-num='(8.5-10]' hours-per-week='(30.4-40.2]' 7597 workclass= Private education-num='(8.5-10]' salary= <=50K 10832 workclass= Private hours-per-week='(30.4-40.2]' salary= <=50K 10524 education-num='(8.5-10]' hours-per-week='(30.4-40.2]' salary= <=50K 8574 Size of set of large itemsetsL(4): 1 Large ItemsetsL(4): workclass= Private education-num='(8.5-10]' hours-per-week='(30.4-40.2]' salary= <=50K 6513 Best rules found: 1. age='(-inf-24.3]' education-num='(8.5-10]' 3637 ==> salary= <=50K 3605 conf:(0.99) lift:(1.31) lev:(0.03) [843] conv:(26.54) 2. age='(-inf-24.3]' workclass= Private 4404 ==> salary= <=50K 4360 conf:(0.99) lift:(1.3) lev:(0.03) [1016] conv:(23.57) 3. age='(-inf-24.3]' 5570 ==> salary= <=50K 5509 conf:(0.99) lift:(1.3) lev:(0.04) [1280] conv:(21.63)
  • 33. DATA MINING FINAL PROJECT FALL 2014 ~ 32 ~  Changed lowerBoundMinSupport = 0.3 minMetric = 0.65
  • 34. DATA MINING FINAL PROJECT FALL 2014 ~ 33 ~ === Run information === Scheme: weka.associations.Apriori -I -R -N 200 -T 0 -C 0.65 -D 0.05 -U 1.0 -M 0.3 -S 1.0 -V -c -1 Relation: income-weka.filters.unsupervised.attribute.Discretize-B10-M-1.0-Rfirst-last- weka.filters.unsupervised.attribute.Remove-R3-4,6-12,14- weka.filters.unsupervised.attribute.Discretize-B10-M-1.0-Rfirst-last Instances: 32561 Attributes: 5 age workclass education-num hours-per-week salary === Associator model (full training set) === Apriori ======= Minimum support: 0.3 (9768 instances) Minimum metric <confidence>: 0.65 Significance level: 1 Number of cycles performed: 10 Generated sets of large itemsets: Size of set of large itemsetsL(1): 4 Large ItemsetsL(1): workclass= Private 22696 education-num='(8.5-10]' 17792 hours-per-week='(30.4-40.2]' 17735 salary= <=50K 24720Size of set of large itemsets L(2): 6 Large ItemsetsL(2): workclass= Private education-num='(8.5-10]' 12874
  • 35. DATA MINING FINAL PROJECT FALL 2014 ~ 34 ~ workclass= Private hours-per-week='(30.4-40.2]' 12849 workclass= Private salary= <=50K 17733 education-num='(8.5-10]' hours-per-week='(30.4-40.2]' 10177 education-num='(8.5-10]' salary= <=50K 14730 hours-per-week='(30.4-40.2]' salary= <=50K 14103 Size of set of large itemsetsL(3): 2 Large ItemsetsL(3): workclass= Private education-num='(8.5-10]' salary= <=50K 10832 workclass= Private hours-per-week='(30.4-40.2]' salary= <=50K 10524 Best rules found: 1. workclass= Private education-num='(8.5-10]' 12874 ==> salary= <=50K 10832 conf:(0.84) lift:(1.11) lev:(0.03) [1058] conv:(1.52) 2. education-num='(8.5-10]' 17792 ==> salary= <=50K 14730 conf:(0.83) lift:(1.09) lev:(0.04) [1222] conv:(1.4) 3. workclass= Private hours-per-week='(30.4-40.2]' 12849 ==> salary= <=50K 10524 conf:(0.82) lift:(1.08) lev:(0.02) [769] conv:(1.33) 4. hours-per-week='(30.4-40.2]' 17735 ==> salary= <=50K 14103 conf:(0.8) lift:(1.05) lev:(0.02) [638] conv:(1.18) 5. workclass= Private 22696 ==> salary= <=50K 17733 conf:(0.78) lift:(1.03) lev:(0.02) [502] conv:(1.1) 6. hours-per-week='(30.4-40.2]' salary= <=50K 14103 ==>workclass= Private 10524 conf:(0.75) lift:(1.07) lev:(0.02) [693] conv:(1.19) 7. education-num='(8.5-10]' salary= <=50K 14730 ==>workclass= Private 10832 conf:(0.74) lift:(1.06) lev:(0.02) [564] conv:(1.14) 8. hours-per-week='(30.4-40.2]' 17735 ==>workclass= Private 12849 conf:(0.72) lift:(1.04) lev:(0.01) [487] conv:(1.1) 9. education-num='(8.5-10]' 17792 ==>workclass= Private 12874 conf:(0.72) lift:(1.04) lev:(0.01) [472] conv:(1.1) 10. salary= <=50K 24720 ==>workclass= Private 17733 conf:(0.72) lift:(1.03) lev:(0.02) [502] conv:(1.07)
  • 36. DATA MINING FINAL PROJECT FALL 2014 ~ 35 ~ J48: The decision trees generated by J48 can be used for classification. It uses the fact that each attribute of the data can be used to make a decision by splitting the data into smaller subsets.  WITH ALL ATTRIBUTES
  • 37. DATA MINING FINAL PROJECT FALL 2014 ~ 36 ~  SELECTED ATTRIBUTES
  • 38. DATA MINING FINAL PROJECT FALL 2014 ~ 37 ~
  • 39. DATA MINING FINAL PROJECT FALL 2014 ~ 38 ~ K-MEANS CLUSTERING: K-means clustering is a method of vector quantization, originally from signal processing, that is popular for cluster analysis in data mining. k-means clustering aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean, serving as a prototype of the cluster.
  • 40. DATA MINING FINAL PROJECT FALL 2014 ~ 39 ~ === Run information === Scheme:weka.clusterers.SimpleKMeans -N 5 -A "weka.core.EuclideanDistance -R first-last" -I 500 -S 10 Relation: income-weka.filters.unsupervised.attribute.Discretize-B10-M-1.0-Rfirst-last Instances: 32561 Attributes: 15 age workclass fnlwgt education education-num martial-status occupation relationship race sex capital-gain capital-loss hours-per-week native-country salary Test mode:split 66% train, remainder test === Model and evaluation on training set === kMeans ====== Number of iterations: 4 Within cluster sum of squared errors: 146940.0
  • 41. DATA MINING FINAL PROJECT FALL 2014 ~ 40 ~ Missing values globally replaced with mean/mode Cluster centroids: Cluster# Attribute Full Data 0 1 2 3 4 (32561) (7137) (7920) (9177) (4102) (4225) ======================================================================================== ============================================================== age '(38.9-46.2]' '(46.2-53.5]' '(-inf-24.3]' '(38.9-46.2]' '(24.3-31.6]' '(24.3- 31.6]' workclass Private PrivatePrivatePrivatePrivatePrivate fnlwgt '(159527-306769]' '(-inf-159527]' '(-inf-159527]' '(159527-306769]' '(159527-306769]' '(159527-306769]' education HS-grad Bachelors Some-college HS-grad Bachelors HS-grad education-num '(8.5-10]' '(11.5-13]' '(8.5-10]' '(8.5-10]' '(11.5-13]' '(8.5-10]' martial-status Married-civ-spouse Married-civ-spouse Never-married Married-civ-spouse Never- married Never-married occupation Prof-specialty Prof-specialty Other-service Craft-repair Prof-specialty Adm-clerical relationship Husband Husband Own-child Husband Not-in-family Unmarried race White WhiteWhiteWhiteWhiteWhite sex Male MaleMaleMale Female Female capital-gain '(-inf-9999.9]' '(-inf-9999.9]' '(-inf-9999.9]' '(-inf-9999.9]' '(-inf-9999.9]' '(-inf- 9999.9]' capital-loss '(-inf-435.6]' '(-inf-435.6]' '(-inf-435.6]' '(-inf-435.6]' '(-inf-435.6]' '(-inf-435.6]' hours-per-week '(30.4-40.2]' '(30.4-40.2]' '(30.4-40.2]' '(30.4-40.2]' '(30.4-40.2]' '(30.4-40.2]' native-country United-States United-StatesUnited-StatesUnited-StatesUnited-StatesUnited-States salary<=50K >50K <=50K <=50K <=50K <=50K
  • 42. DATA MINING FINAL PROJECT FALL 2014 ~ 41 ~ Time taken to build model (full training data) : 3.13 seconds === Model and evaluation on test split === kMeans ====== Number of iterations: 4 Within cluster sum of squared errors: 103681.0 Missing values globally replaced with mean/mode Cluster centroids: Cluster# Attribute Full Data 0 1 2 3 4 (21490) (11082) (3995) (3024) (2584) (805) ======================================================================================== ============================================================== age '(38.9-46.2]' '(31.6-38.9]' '(-inf-24.3]' '(24.3-31.6]' '(24.3-31.6]' '(-inf-24.3]' workclass Private PrivatePrivatePrivatePrivatePrivate fnlwgt '(159527-306769]' '(159527-306769]' '(159527-306769]' '(159527-306769]' '(159527- 306769]' '(159527-306769]' education HS-grad HS-grad Some-college HS-grad Bachelors Some-college education-num '(8.5-10]' '(8.5-10]' '(8.5-10]' '(8.5-10]' '(11.5-13]' '(8.5-10]' martial-status Married-civ-spouse Married-civ-spouse Never-married Never-marriedNever- marriedNever-married occupation Craft-repair Craft-repair Other-service Adm-clerical Prof-specialty Other-service relationship Husband Husband Own-child Not-in-family Not-in-familyNot-in- family race White WhiteWhiteWhiteWhiteWhite sex Male Male Female FemaleFemaleFemale capital-gain '(-inf-9999.9]' '(-inf-9999.9]' '(-inf-9999.9]' '(-inf-9999.9]' '(-inf-9999.9]' '(-inf- 9999.9]' capital-loss '(-inf-435.6]' '(-inf-435.6]' '(-inf-435.6]' '(-inf-435.6]' '(-inf-435.6]' '(-inf-435.6]' hours-per-week '(30.4-40.2]' '(30.4-40.2]' '(30.4-40.2]' '(30.4-40.2]' '(30.4-40.2]' '(20.6-30.4]' native-country United-States United-StatesUnited-StatesUnited-StatesUnited-StatesUnited-States salary<=50K <=50K <=50K <=50K <=50K <=50K
  • 43. DATA MINING FINAL PROJECT FALL 2014 ~ 42 ~ Time taken to build model (percentage split) : 1.99 seconds Clustered Instances 0 5618 ( 51%) 1 2119 ( 19%) 2 1569 ( 14%) 3 1358 ( 12%) 4 407 ( 4%)
  • 44. DATA MINING FINAL PROJECT FALL 2014 ~ 43 ~ SAS ENTERPRISE MINER  Cluster Analysis: This analysis attempts to find natural groupings of observations in the data, based on a set of input variables. After grouping the observations into clusters, you can use the input variables to try to characterize each group. When the clusters have been identified and interpreted, you can decide whether to treat each cluster independently. Clustering can therefore be formulated as a multi- objective optimization problem. The appropriate clustering algorithm and parameter settings (including values such as the distance function to use, a density threshold or the number of expected clusters) depend on the individual data set and intended use of the results. Cluster analysis as such is not an automatic task, but an iterative process of knowledge discovery or interactive multi-objective optimization that involves trial and failure. It will often be necessary to modify data preprocessing and model parameters until the result achieves the desired properties. In this dataset we built clusters to group similar items in our dataset. We changed the properties of the cluster. Cluster variable role = Segment Specification Method = User Specify Maximum Number of Clusters = 5
  • 45. DATA MINING FINAL PROJECT FALL 2014 ~ 44 ~
  • 46. DATA MINING FINAL PROJECT FALL 2014 ~ 45 ~ From the clustering segments (pie-chart), we can observe that cluster 1 contains a large part of the data set.
  • 47. DATA MINING FINAL PROJECT FALL 2014 ~ 46 ~ SAS DECISION TREES Node Rules: *------------------------------------------------------------* Node = 11 *------------------------------------------------------------* if Relationship IS ONE OF: NOT-IN-FAMILY, OWN-CHILD, UNMARRIED, OTHER-RELATIVE or MISSING AND Hours-Per-Week >= 35.5 or MISSING AND Capital-Gain >= 7073.59 then Tree Node Identifier = 11 Number of Observations = 285 Predicted: Salary=>50K = 0.99 Predicted: Salary=<=50K = 0.01 *------------------------------------------------------------* Node = 13 *------------------------------------------------------------* if Relationship IS ONE OF: HUSBAND, WIFE AND Education-Num < 12.5 or MISSING AND Capital-Gain >= 5095.5 then Tree Node Identifier = 13 Number of Observations = 522 Predicted: Salary=>50K = 0.98 Predicted: Salary=<=50K = 0.02
  • 48. DATA MINING FINAL PROJECT FALL 2014 ~ 47 ~ *------------------------------------------------------------* Node = 15 *------------------------------------------------------------* if Relationship IS ONE OF: HUSBAND, WIFE AND Education-Num >= 12.5 AND Capital-Gain >= 5095.5 then Tree Node Identifier = 15 Number of Observations = 678 Predicted: Salary=>50K = 1.00 Predicted: Salary=<=50K = 0.00 *------------------------------------------------------------* Node = 161 *------------------------------------------------------------* if Workclass IS ONE OF: STATE-GOV, SELF-EMP-NOT-INC AND Relationship IS ONE OF: HUSBAND, WIFE AND Occupation IS ONE OF: ADM-CLERICAL, EXEC-MANAGERIAL, PROF-SPECIALTY, SALES, TECH-SUPPORT, PROTECTIVE-SERV AND Hours-Per-Week >= 37.5 or MISSING AND Education-Num < 12.5 AND Education-Num >= 9.5 or MISSING AND Capital-Loss < 1512 or MISSING AND Capital-Gain < 5095.5 or MISSING AND Age >= 33.5 or MISSING then Tree Node Identifier = 161 Number of Observations = 157 Predicted: Salary=>50K = 0.39 Predicted: Salary=<=50K = 0.61
  • 49. DATA MINING FINAL PROJECT FALL 2014 ~ 48 ~
  • 50. DATA MINING FINAL PROJECT FALL 2014 ~ 49 ~ Fit Statistics Target=Salary Target Label=' ' FitStatistics Statistics Label Train _NOBS_ Sum of Frequencies 32561.00 _MISC_ Misclassification Rate 0.14 _MAX_ Maximum Absolute Error 1.00 _SSE_ Sum of Squared Errors 6368.80 _ASE_ Average Squared Error 0.10 _RASE_ Root Average Squared Error 0.31 _DIV_ Divisor for ASE 65122.00 _DFT_ Total Degrees of Freedom 32561.00
  • 51. DATA MINING FINAL PROJECT FALL 2014 ~ 50 ~
  • 52. DATA MINING FINAL PROJECT FALL 2014 ~ 51 ~ CONCLUSION 1. It is likely that if the age is around 24 years, the education level is 11th grade, 12th grade or some college, then the income would be less than 50K. 2. It is likely that if the age is around 24 years, and the person is working for a private firm, then the income would be less than 50K. 3. It is likely that if the age is around 24 years, the income would be less than 50K 4. 90% of females have salary less than or equal to 50k whereas 60% of males have salary <=50k 5. 95% of the population belonging to other services category belong to salary <=50k 6. 95% of the population belonging to the age group of 23 to 24 have salary <=50k We made couple of runs of the J48 classifiers and found out the following 5 attributes to be important in predicting the Income of a person: Education number, age, salary, work class, hours-per-week. These columns provide an accuracy of 80% in predicting the Income.