1. SUBMITTED BY:
ARUN KUMAR DASH
PARAS SHAH
DIVYA RAJASRI TADI
NIREESHA MANDALA
DATA MINING FINAL PROJECT
2. DATA MINING FINAL PROJECT FALL 2014
~ 1 ~
INTRODUCTION
Data mining aims at extracting patterns and knowledge from a particular data
set and converting it into an easy to understand framework that can be used in
future as well. Data is not only analyzed, it is pre-processed, complexities in
the data are taken into consideration and the structures discovered are post-
processed. Data mining is done on large quantities of data in order to extract
unknown patterns like groups of data records known as cluster analysis or to
identify exceptional records known as anomaly detection. These patterns can
then be used in machine learning and predictive analysis. All the steps like
data collection, data preparation, result interpretation and reporting though a
part of data mining belong to the knowledge discovery in databases process.
Data mining is being used in almost every field from business, science and
engineering, medical, visual, music, sensor, temporal and spatial to name a
few. The ways in which data mining can be used can in some cases and
contexts raise questions regarding privacy, legality, and ethics. As data mining
needs data preparation which can uncover patterns that can compromise
confidentiality. And this mainly happens during data aggregation where data is
combined together for analysis purpose which might affect the privacy of
individual data.
We used a census data set to predict whether income exceeds $50K/yr based
on census data. Prediction task is to determine whether a person makes over
50K a year.
4. DATA MINING FINAL PROJECT FALL 2014
~ 3 ~
relationship: Wife, Own-child, Husband, Not-in-family, Other-relative,
Unmarried.
DATA PRE-PROCESSING:
The first step in data mining is to choose a data set with large number of
records so that unknown patterns can be identified. The data set should not be
too large that it consumes a lot of time to come up with a pattern. So, basically
selecting an appropriate data set is very important. Data sets are usually
chosen from a data warehouse. The main of data pre-processing is to evaluate
multivariate data sets prior to data mining. The data set is cleaned to remove
the records containing noise and the ones with missing data.
DATA MINING:
It is broadly segregated into six categories of tasks-
Anomaly detection
This phase involves identifying records that are unusual like outliers, any
deviations or errors. These data records have to be further investigated.
Association rule learning
This phase is the one where the model searches for any relationships among
the different variables. The very common example of it can be the beer diaper,
where on Fridays customers buy beer along with diapers. Based on this
5. DATA MINING FINAL PROJECT FALL 2014
~ 4 ~
association, the store can put both these items together to increase the sales
which is also known as market based analysis or dependency modelling.
Clustering:
Determining groups in the data which are similar in some aspect without the
use of previously known structures in data.
Classification:
This phase generalizes the known structures to apply the new data. As an
example an e-mail program that attempts to classify e-mails as spam or
genuine.
Regression:
This phase tries to phase a function which models the data with the least
amount of errors.
Summarization:
This phase provides a more compact view of the data set including the
visualization and report generation.
Validation of results:
Data mining can also give results which seem significant but cannot be used to
predict future behaviors and cannot be reproduced on a new sample of data.
These results are derived from improper statistical hypothesis testing. A simple
6. DATA MINING FINAL PROJECT FALL 2014
~ 5 ~
version of this problem is known as over fitting, where data mining algorithms
find patterns in the training set which are not present in the general data set.
WEKA (WAIKATO ENVIRONMENT FOR
KNOWLEDGE ANALYSIS)
Weka consists of a collection of visualization tools and algorithms for data
analysis and predictive modeling, along with graphical user interfaces for easy
access to this functionality. It is basically used for educational purposes and
research. Weka is:
Freely available General Public License
Portable as it completely implemented in Java and can be run on any
computing platform
An Extensive collection of data preprocessing and modeling techniques
Has a GUI that makes it easy to use
Weka supports various data mining tasks like data preprocessing, clustering,
classification, regression, visualization, and feature selection. Weka's
techniques are predicated on the assumption that data is available as a single
flat file or relation, where each data point is expressed by a fixed number of
attributes. Weka provides access to SQL databases using Java Database
Connectivity and can process the result returned by a database query.
7. DATA MINING FINAL PROJECT FALL 2014
~ 6 ~
NAÏVE BAYES SIMPLE
All attributes: We have executed Naïve Bayes Simple on all the 15
attributes in our data set and have obtained the following results.
8. DATA MINING FINAL PROJECT FALL 2014
~ 7 ~
The accuracy of the predictive model is 83.69%.
9. DATA MINING FINAL PROJECT FALL 2014
~ 8 ~
Few attributes: We have considered 6 attributes namely age, work class,
education, education number, hours per week and salary and obtained
the following results.
10. DATA MINING FINAL PROJECT FALL 2014
~ 9 ~
The accuracy of the predictive model is 79.20%.
11. DATA MINING FINAL PROJECT FALL 2014
~ 10 ~
The following results are obtained on removing the education attribute
from the data set.
12. DATA MINING FINAL PROJECT FALL 2014
~ 11 ~
The accuracy of the predictive model is now 80% as compared to 79.20%
when the attribute education was present. This indicates that education is an
unimportant attribute and its removal from the data set yields a more accurate
result.
13. DATA MINING FINAL PROJECT FALL 2014
~ 12 ~
=== Run information ===
Scheme:weka.classifiers.bayes.NaiveBayesSimple
Relation: income-weka.filters.unsupervised.attribute.Remove-R6,8-12,14-
weka.filters.unsupervised.attribute.Remove-R3-
weka.filters.unsupervised.attribute.Remove-R5-
weka.filters.unsupervised.attribute.Remove-R3
Instances: 32561
Attributes: 5
age
workclass
education-num
hours-per-week
salary
Test mode:split 66.0% train, remainder test
=== Classifier model (full training set) ===
Naive Bayes (simple)
Class <=50K: P(C) = 0.75917452
Attribute age
Mean: 36.78373786 Standard Deviation: 14.02008849
Attribute workclass
State-gov Self-emp-not-inc Private Federal-gov Local-gov ? Self-emp-
inc Without-pay Never-worked
0.03825468 0.07351692 0.71713373 0.02385863 0.05972745 0.06656153
0.02001698 0.00060658 0.00032351
14. DATA MINING FINAL PROJECT FALL 2014
~ 13 ~
Attribute education-num
Mean: 9.59506472 Standard Deviation: 2.43614679
Attribute hours-per-week
Mean: 38.84021036 Standard Deviation: 12.31899464
Class >50K: P(C) = 0.24082548
Attribute age
Mean: 44.24984058 Standard Deviation: 10.51902772
Attribute workclass
State-gov Self-emp-not-inc Private Federal-gov Local-gov ? Self-emp-inc Without-
pay Never-worked
0.04509554 0.09235669 0.63235669 0.04738854 0.07872611 0.0244586 0.07936306
0.00012739 0.00012739
Attribute education-num
Mean: 11.61165668 Standard Deviation: 2.38512863
Attribute hours-per-week
Mean: 45.4730264 Standard Deviation: 11.01297093
Time taken to build model: 0.06 seconds
15. DATA MINING FINAL PROJECT FALL 2014
~ 14 ~
=== Evaluation on test split ===
=== Summary ===
Correctly Classified Instances 8857 80.0018 %
Incorrectly Classified Instances 2214 19.9982 %
Kappa statistic 0.3666
Mean absolute error 0.2737
Root mean squared error 0.3704
Relative absolute error 75.1356 %
Root relative squared error 87.3416 %
Total Number of Instances 11071
=== Detailed Accuracy By Class ===
TP Rate FP Rate Precision Recall F-Measure ROC Area Class
0.923 0.601 0.833 0.923 0.876 0.82 <=50K
0.399 0.077 0.615 0.399 0.484 0.82 >50K
Weighted Avg. 0.8 0.478 0.782 0.8 0.784 0.82
=== Confusion Matrix ===
a b <-- classified as
7820 650 | a = <=50K
1564 1037 | b = >50K
16. DATA MINING FINAL PROJECT FALL 2014
~ 15 ~
APRIORI:
Apriori is an algorithm for frequent item set mining and association rule
learning over transactional databases. Apriori proceeds by identifying the
frequent individual items in the database and extending them to larger and
larger item sets as long as those item sets appear sufficiently often in the
database.
TOTAL NUMBER OF RULES
17. DATA MINING FINAL PROJECT FALL 2014
~ 16 ~
OUTPUT:
=== Run information ===
Scheme: weka.associations.Apriori -I -R -N 200 -T 0 -C 0.1 -D 0.05 -U 1.0 -M 0.1 -S
1.0 -V -c -1
Relation: income-weka.filters.unsupervised.attribute.Remove-R6,8-12,14-
weka.filters.unsupervised.attribute.Remove-R3-
weka.filters.unsupervised.attribute.Remove-R5-
weka.filters.unsupervised.attribute.Remove-R3-
weka.filters.unsupervised.attribute.Discretize-B10-M-1.0-Rfirst-last
Instances: 32561
Attributes: 5
age
workclass
education-num
hours-per-week
salary
=== Associator model (full training set) ===
Apriori
=======
Minimum support: 0.1 (3256 instances)
Minimum metric <confidence>: 0.1
Significance level: 1
Number of cycles performed: 14
18. DATA MINING FINAL PROJECT FALL 2014
~ 17 ~
Generated sets of large itemsets:
Size of set of large itemsetsL(1): 12
Large ItemsetsL(1):
age='(-inf-24.3]' 5570
age='(24.3-31.6]' 5890
age='(31.6-38.9]' 6048
age='(38.9-46.2]' 6163
age='(46.2-53.5]' 3967
workclass= Private 22696
education-num='(8.5-10]' 17792
education-num='(11.5-13]' 6422
hours-per-week='(30.4-40.2]' 17735
hours-per-week='(40.2-50]' 5938
salary= <=50K 24720
salary= >50K 7841
36. DATA MINING FINAL PROJECT FALL 2014
~ 35 ~
J48:
The decision trees generated by J48 can be used for classification. It uses the
fact that each attribute of the data can be used to make a decision by splitting
the data into smaller subsets.
WITH ALL ATTRIBUTES
39. DATA MINING FINAL PROJECT FALL 2014
~ 38 ~
K-MEANS CLUSTERING:
K-means clustering is a method of vector quantization, originally from signal
processing, that is popular for cluster analysis in data mining. k-means
clustering aims to partition n observations into k clusters in which each
observation belongs to the cluster with the nearest mean, serving as a
prototype of the cluster.
40. DATA MINING FINAL PROJECT FALL 2014
~ 39 ~
=== Run information ===
Scheme:weka.clusterers.SimpleKMeans -N 5 -A "weka.core.EuclideanDistance -R first-last" -I
500 -S 10
Relation: income-weka.filters.unsupervised.attribute.Discretize-B10-M-1.0-Rfirst-last
Instances: 32561
Attributes: 15
age
workclass
fnlwgt
education
education-num
martial-status
occupation
relationship
race
sex
capital-gain
capital-loss
hours-per-week
native-country
salary
Test mode:split 66% train, remainder test
=== Model and evaluation on training set ===
kMeans
======
Number of iterations: 4
Within cluster sum of squared errors: 146940.0
42. DATA MINING FINAL PROJECT FALL 2014
~ 41 ~
Time taken to build model (full training data) : 3.13 seconds
=== Model and evaluation on test split ===
kMeans
======
Number of iterations: 4
Within cluster sum of squared errors: 103681.0
Missing values globally replaced with mean/mode
Cluster centroids:
Cluster#
Attribute Full Data 0 1 2 3 4
(21490) (11082) (3995) (3024) (2584) (805)
========================================================================================
==============================================================
age '(38.9-46.2]' '(31.6-38.9]' '(-inf-24.3]' '(24.3-31.6]' '(24.3-31.6]' '(-inf-24.3]'
workclass Private PrivatePrivatePrivatePrivatePrivate
fnlwgt '(159527-306769]' '(159527-306769]' '(159527-306769]' '(159527-306769]' '(159527-
306769]' '(159527-306769]'
education HS-grad HS-grad Some-college HS-grad Bachelors
Some-college
education-num '(8.5-10]' '(8.5-10]' '(8.5-10]' '(8.5-10]' '(11.5-13]' '(8.5-10]'
martial-status Married-civ-spouse Married-civ-spouse Never-married Never-marriedNever-
marriedNever-married
occupation Craft-repair Craft-repair Other-service Adm-clerical Prof-specialty
Other-service
relationship Husband Husband Own-child Not-in-family Not-in-familyNot-in-
family
race White WhiteWhiteWhiteWhiteWhite
sex Male Male Female FemaleFemaleFemale
capital-gain '(-inf-9999.9]' '(-inf-9999.9]' '(-inf-9999.9]' '(-inf-9999.9]' '(-inf-9999.9]' '(-inf-
9999.9]'
capital-loss '(-inf-435.6]' '(-inf-435.6]' '(-inf-435.6]' '(-inf-435.6]' '(-inf-435.6]' '(-inf-435.6]'
hours-per-week '(30.4-40.2]' '(30.4-40.2]' '(30.4-40.2]' '(30.4-40.2]' '(30.4-40.2]'
'(20.6-30.4]'
native-country United-States United-StatesUnited-StatesUnited-StatesUnited-StatesUnited-States
salary<=50K <=50K <=50K <=50K <=50K <=50K
43. DATA MINING FINAL PROJECT FALL 2014
~ 42 ~
Time taken to build model (percentage split) : 1.99 seconds
Clustered Instances
0 5618 ( 51%)
1 2119 ( 19%)
2 1569 ( 14%)
3 1358 ( 12%)
4 407 ( 4%)
44. DATA MINING FINAL PROJECT FALL 2014
~ 43 ~
SAS ENTERPRISE MINER
Cluster Analysis:
This analysis attempts to find natural groupings of observations in the data,
based on a set of input variables. After grouping the observations into clusters,
you can use the input variables to try to characterize each group. When the
clusters have been identified and interpreted, you can decide whether to treat
each cluster independently. Clustering can therefore be formulated as a multi-
objective optimization problem. The appropriate clustering algorithm and
parameter settings (including values such as the distance function to use, a
density threshold or the number of expected clusters) depend on the individual
data set and intended use of the results. Cluster analysis as such is not an
automatic task, but an iterative process of knowledge discovery or interactive
multi-objective optimization that involves trial and failure. It will often be
necessary to modify data preprocessing and model parameters until the result
achieves the desired properties.
In this dataset we built clusters to group similar items in our dataset. We
changed the properties of the cluster.
Cluster variable role = Segment
Specification Method = User Specify
Maximum Number of Clusters = 5
46. DATA MINING FINAL PROJECT FALL 2014
~ 45 ~
From the clustering segments (pie-chart), we can observe that cluster 1
contains a large part of the data set.
47. DATA MINING FINAL PROJECT FALL 2014
~ 46 ~
SAS DECISION TREES
Node Rules:
*------------------------------------------------------------*
Node = 11
*------------------------------------------------------------*
if Relationship IS ONE OF: NOT-IN-FAMILY, OWN-CHILD, UNMARRIED, OTHER-RELATIVE or MISSING
AND Hours-Per-Week >= 35.5 or MISSING
AND Capital-Gain >= 7073.59
then
Tree Node Identifier = 11
Number of Observations = 285
Predicted: Salary=>50K = 0.99
Predicted: Salary=<=50K = 0.01
*------------------------------------------------------------*
Node = 13
*------------------------------------------------------------*
if Relationship IS ONE OF: HUSBAND, WIFE
AND Education-Num < 12.5 or MISSING
AND Capital-Gain >= 5095.5
then
Tree Node Identifier = 13
Number of Observations = 522
Predicted: Salary=>50K = 0.98
Predicted: Salary=<=50K = 0.02
48. DATA MINING FINAL PROJECT FALL 2014
~ 47 ~
*------------------------------------------------------------*
Node = 15
*------------------------------------------------------------*
if Relationship IS ONE OF: HUSBAND, WIFE
AND Education-Num >= 12.5
AND Capital-Gain >= 5095.5
then
Tree Node Identifier = 15
Number of Observations = 678
Predicted: Salary=>50K = 1.00
Predicted: Salary=<=50K = 0.00
*------------------------------------------------------------*
Node = 161
*------------------------------------------------------------*
if Workclass IS ONE OF: STATE-GOV, SELF-EMP-NOT-INC
AND Relationship IS ONE OF: HUSBAND, WIFE
AND Occupation IS ONE OF: ADM-CLERICAL, EXEC-MANAGERIAL, PROF-SPECIALTY, SALES, TECH-SUPPORT,
PROTECTIVE-SERV
AND Hours-Per-Week >= 37.5 or MISSING
AND Education-Num < 12.5 AND Education-Num >= 9.5 or MISSING
AND Capital-Loss < 1512 or MISSING
AND Capital-Gain < 5095.5 or MISSING
AND Age >= 33.5 or MISSING
then
Tree Node Identifier = 161
Number of Observations = 157
Predicted: Salary=>50K = 0.39
Predicted: Salary=<=50K = 0.61
50. DATA MINING FINAL PROJECT FALL 2014
~ 49 ~
Fit Statistics
Target=Salary Target Label=' '
FitStatistics Statistics Label Train
_NOBS_ Sum of Frequencies 32561.00
_MISC_ Misclassification Rate 0.14
_MAX_ Maximum Absolute Error 1.00
_SSE_ Sum of Squared Errors 6368.80
_ASE_ Average Squared Error 0.10
_RASE_ Root Average Squared Error 0.31
_DIV_ Divisor for ASE 65122.00
_DFT_ Total Degrees of Freedom 32561.00
52. DATA MINING FINAL PROJECT FALL 2014
~ 51 ~
CONCLUSION
1. It is likely that if the age is around 24 years, the education level is 11th grade,
12th grade or some college, then the income would be less than 50K.
2. It is likely that if the age is around 24 years, and the person is working for a
private firm, then the income would be less than 50K.
3. It is likely that if the age is around 24 years, the income would be less than
50K
4. 90% of females have salary less than or equal to 50k whereas 60% of males
have salary <=50k
5. 95% of the population belonging to other services category belong to salary
<=50k
6. 95% of the population belonging to the age group of 23 to 24 have salary
<=50k
We made couple of runs of the J48 classifiers and found out the following 5
attributes to be important in predicting the Income of a person:
Education number, age, salary, work class, hours-per-week. These columns
provide an accuracy of 80% in predicting the Income.