SlideShare ist ein Scribd-Unternehmen logo
1 von 5
Downloaden Sie, um offline zu lesen
IJRET: International Journal of Research in Engineering and Technology eISSN: 2319-1163 | pISSN: 2321-7308
__________________________________________________________________________________________
Volume: 02 Issue: 09 | Sep-2013, Available @ http://www.ijret.org 106
PRIVACY PRESERVING DATA MINING IN FOUR GROUP
RANDOMIZED RESPONSE TECHNIQUE USING ID3 AND CART
ALGORITHM
Monika Soni1
, Vishal Shrivastava2
1
M. Tech. Scholar, 2
Associate Professor, Arya College of Engineering and IT, Rajasthan, India,
12.monika@gmail.com, vishal500371@yahoo.co.in
Abstract
Data mining is a process in which data collected from different sources is analyzed for useful information. Data mining is also known
as knowledge discovery in database (KDD). Privacy and accuracy are the important issues in data mining when data is shared. Most
of the methods use random permutation techniques to mask the data, for preserving the privacy of sensitive data. Randomize response
techniques were developed for the purpose of protecting surveys privacy and avoiding biased answers. The proposed work thesis is to
enhance the privacy level in RR technique using four group schemes. First according to the algorithm random attributes a, b, c, d
were considered, Then the randomization have been performed on every dataset according to the values of theta. Then ID3 and CART
algorithm are applied on the randomized data. The result shows that by increasing the group, the privacy level will increase. This
work shows that as compared with three group scheme with four groups scheme the accuracy decreases 6% but the privacy increases
65%.
-----------------------------------------------------------------------***-----------------------------------------------------------------------
1. INTRODUCTION OF PROPOSED APPROACH
This work uses ID3 and CART algorithm to enhance the
privacy of the secret data. The problem with the previous work
for three groups of data sets using ID3 algorithm was that it
was not checking the group performance at every step and the
privacy level was not very high. The proposed work increases
the level of privacy by using ID3 and CART algorithms.
Previous work was giving an overall result whereas this work
is implementing it in step by step manner.
1.1 The Basic idea of ID3 Algorithm
ID3 algorithm uses information entropy theory to select
attribute values with maximum information gain in the current
sample sets as the test attribute. The division of the sample
sets is based on the value of the test properties, the numbers of
test attributes decide the number of subsample sets; at the
same time, new leaf nodes grow out of corresponding nodes of
the sample set on the decision tree. ID3 algorithm is given
below:
ID3(S, AL)
Step 1. Create a node V.
Step 2. If S consists of samples with all the same class
C then return V as a leaf node labeled with class
C.
Step 3. If AL is empty, then return V as a leaf-node
with the majority class in y.
Step 4. Select test attribute (TA) among the AL with
the highest information gain.
Step 5. Label node V with TA.
Step 6. For each known value ai of TA
a) Grow a branch from node V for the condition
TA=ai
b) Let si be the set of samples in S for which
TA=ai.
c) If si empty then attach a leaf labeled with the
majority class in S.
d) Else attach the node returned by ID3(si,AL-TA).
1.2 CART Algorithm
Classification and Regression Trees is a classification method
which uses historical data to construct so-called decision trees.
Decision trees are then used to classify new data. In order to
use CART the number of classes should be known a priori.
Decision trees are represented by a set of questions which
splits the learning sample into smaller and smaller parts.
CART asks only yes/no questions. A possible question could
be: ”Is age greater than 50?” or ”Is sex male?”. CART
algorithm will search for all possible variables and all possible
values in order to find the best split – the question that splits
the data into two parts with maximum homogeneity. The
process is then repeated for each of the resulting data
fragments.
Step 1: Find each predictor’s best split.
Step 2: Find the node’s best split.
Step 3: Split the node using its best split found in step 2
if the stopping rules are not satisfied.
IJRET: International Journal of Research in Engineering and Technology eISSN: 2319-1163 | pISSN: 2321-7308
__________________________________________________________________________________________
Volume: 02 Issue: 09 | Sep-2013, Available @ http://www.ijret.org 107
2. RANDOMIZED RESPONSE TECHNIQUES
Randomized Response (RR) techniques were developed for
the purpose of protecting survey’s privacy and avoiding
answer bias mainly. The RR technique was designed to reduce
both response bias and non-response bias, in surveys which
ask sensitive questions. It uses probability theory to protect the
privacy of an individual’s response, and has been used
successfully in several sensitive research areas, such as
abortion, drugs and assault. The basic idea of RR is to
scramble the data in such a way that the real status of the
respondent cannot be identified.
In Related-Question Model, instead of asking each respondent
whether he/she has attribute 0, the interviewer asks each
respondent two related questions, the answers to which are
opposite to each other. For example, the questions could be
like the following. If the statement is correct, the respondent
answers “yes”; otherwise he/she answers “no”. Similar as
described in.
 I have the sensitive attribute A.
 I do not have the sensitive attribute A.
Respondents use a randomizing device to decide which
question to answer, without letting the interviewer know
which question is answered. The probability of choosing the
first question is Ɵ, and the probability of choosing the second
question is 1-Ɵ.Although the interviewer learns the responses
(e.g., “yes” or “no”), he/she does not know which question
was answered by the respondents. In this way the respondents’
privacy is preserved. Since the interviewer’s wants to know
the answer to the first question, and the answer to the second
question is exactly the opposite to the answer for the first one,
if the respondent chooses to answer the first question, then it
say that he/she is telling the truth; if the respondent chooses to
answer the second question, then it say that he/she is telling a
lie. To estimate the percentage of people who has the attribute
A. The following equations can be used:
𝑃∗
𝐴 = 𝑦𝑒𝑠 = 𝑃 𝐴 = 𝑦𝑒𝑠 . Ɵ + P A = no . (1 − 𝜃Ɵ)
𝑃∗
𝐴 = 𝑛𝑜 = 𝑃 𝐴 = 𝑛𝑜 . Ɵ + P A = yes . (1 − Ɵ𝜃) (4)
Where P*(A=yes) (resp. p*(A=no)) is the proportion of the
“yes” (resp. “no”) responses obtained from the survey data,
and P (A=yes) (resp. P (A=no)) is the estimated proportion of
the “yes” (resp. “no”) responses to the sensitive questions.
2.1 One-Group Scheme
In the one-group scheme, all the attributes are put in the same
group, and all the attributes are either reversed together or
keeping the same values. In other words, when sending the
private data to the central database, users either tell the truth
about all their answers to the sensitive questions or tell the lie
about all their answers. The probability for the first event is θ
and the probability for the second event is (1 − θ).
The general model for the one-group scheme is described in
the following:
𝑃∗
𝐸 = 𝑃 𝐸 . 𝜃 + 𝑃(𝐸). (1 − 𝜃)
P∗
E = P E . θ + P(E). (1 − θ)
Using the matrix form, let M1 denote the co efficiency matrix
of the above equations, the Matrix
p∗
(E)
P∗
(E)
=M1
P(E)
P(E)
, where M1 =
θ (1 − θ)
1 − θ θ
2.2 Two-Group Scheme
In the one-group scheme, if the interviewer somehow knows
whether the respondents tell a truth or a lie for one attribute,
he/she can immediately obtain all the true values of a
respondent’s response for all other attributes. To improve
data’s privacy level, data providers divide all the attributes
into two groups.
Then it has the following equation:
P∗
E1 E2 = P E1 E2 . θ2
+ P E1E2 . θ 1 − θ
+P E1E2 . θ 1 − θ + P E1E2 . (1 − θ)2
There are four unknown variables in the above equation:
(P E1 E2 , P E1E2 , P E1E2 , P E1E2 ) (9)
To solve the above equation, three more equations are needed
, derive them using the similar method.
P∗
E1 E2
P∗
(E1 E2
P∗
E1E2
P∗
(E1E2)
= M2 .
P E1 E2
P E1E2
P E1E2
P E1E2
The final equations are described in the following
Where M2 =
θ2
θ 1 − θ
θ 1 − θ θ2
θ 1 − θ 1 − θ 2
1 − θ 2
θ 1 − θ
θ 1 − θ 1 − θ 2
1 − θ 2
θ 1 − θ
θ2
θ 1 − θ
θ 1 − θ θ2
2.3 Three-Group Scheme
To further preserve the data’s privacy, the attributes are
partitioned into three groups, and disguised each group
independently. The model can be derived using the similar
way as it was done for the two-group model. The model for
the three-group scheme is as follows:
IJRET: International Journal of Research in Engineering and Technology eISSN: 2319-1163 | pISSN: 2321-7308
__________________________________________________________________________________________
Volume: 02 Issue: 09 | Sep-2013, Available @ http://www.ijret.org 108
P∗
(E1 E2 E3 )
P∗
(E1E2E3 )
P∗
E1E2E3
P∗
(E1E2E3)
P∗
(E1E2E3)
P∗
(E1E2E3)
P∗
(E1E2E3)
P∗
(E1E2E3)
= M3 =
P(E1 E2 E3 )
P(E1E2E3 )
P E1E2E3
P(E1E2E3)
P(E1E2E3)
P(E1E2E3)
P(E1E2E3)
P(E1E2E3)
M3 =
θ3
θ2
(1 − θ) θ2
(1 − θ) θ(1 − θ)2
θ2
(1 − θ) θ(1 − θ)2
θ(1 − θ)2
(1 − θ)3
θ2
(1 − θ) θ3
θ(1 − θ)2
θ2
(1 − θ) θ(1 − θ)2
θ(1 − θ)2
(1 − θ)3
θ(1 − θ)2
θ2
(1 − θ) θ(1 − θ)2
θ3
θ2
(1 − θ) θ(1 − θ)2
(1 − θ)3
θ2
(1 − θ) θ(1 − θ)2
θ(1 − θ)2
θ2
(1 − θ) θ2
(1 − θ) θ3
(1 − θ)3
θ(1 − θ)2
θ(1 − θ)2
θ2
(1 − θ)
θ2
(1 − θ) θ(1 − θ)2
θ(1 − θ)2
(1 − θ)3
θ3
θ2
(1 − θ) θ2
(1 − θ) θ(1 − θ)2
θ(1 − θ)2
θ2
(1 − θ) (1 − θ)3
θ(1 − θ)2
θ2
(1 − θ) θ3
θ(1 − θ)2
θ2
(1 − θ)
θ(1 − θ)2
(1 − θ)3
θ2
(1 − θ) θ(1 − θ)2
θ2
(1 − θ) θ(1 − θ)2
θ3
θ2
(1 − θ)
(1 − θ)3
θ(1 − θ)2
θ(1 − θ)2
θ2
(1 − θ) θ(1 − θ)2
θ2
(1 − θ) θ2
(1 − θ) θ3
(13)
Where M3the coefficiency matrix and similar techniques can
be employed to extend the above schemes to four-group
scheme
2.4 Four group scheme
To further preserve privacy the attributes can be partition into
four groups, and disguised each group independently. The
model for the four group scheme is as follows:
P∗
(E1E2E3E4)
P∗
(E1E2E3E4)
P∗
(E1E2E3E4)
P∗
(E1E2E3E4)
P∗
(E1E2E3E4)
P∗
(E1E2E3E4)
P∗
(E1E2E3E4)
P∗
(E1E2E3E4)
P∗
(E1E2E3E4)
P∗
(E1E2E3E4)
P∗
(E1E2E3E4)
P∗
(E1E2E3E4)
P∗
(E1E2E3E4)
P∗
(E1E2E3E4)
P∗
(E1E2E3E4)
P∗
(E1E2E3E4)
P∗
(E1E2E3E4)
=M4
P(E1E2E3E4)
P(E1E2E3E4)
P(E1E2E3E4)
P(E1E2E3E4)
P(E1E2E3E4)
P(E1E2E3E4)
P(E1E2E3E4)
P(E1E2E3E4)
P(E1E2E3E4)
P(E1E2E3E4)
P(E1E2E3E4)
P(E1E2E3E4)
P(E1E2E3E4)
P(E1E2E3E4)
P(E1E2E3E4)
P(E1E2E3E4)
P(E1E2E3E4)
Where M4 is the co efficiency matrix.
3. PROCESSING THE STEPS ARE GIVEN
BELOW FOR IMPLEMENTING THE WORK:
1. Since it is considered that the dataset contains only
binary data. First transformed the non binary data
into binary data. For this there are many methods
available in MATLAB.
2. Divided the data randomly in groups as follows:
 Four random attributes are taken namely a, b, c
and d.
 The dataset are grouped in to 1, 2, 3 and 4 level
according to the following manner
group1=all data attributes are considered
as single group and performed the above
operations
group2=data set divided into ab and cd and performed the
randomization and id3 algorithm group3=data set divided into
ab, c, d group4=data set divided into a, b, c, d
Following steps are used on each dataset and on each group
for different values of Ө. Here related question model of
randomized response technique is used.
Ө = [0.0, 0.1, 0.2, 0.3, 0.45, 0.51, 0.55, 0.6, 0.7, 0.8, 0.9 1];
For randomization it generates a random number r from 0 to 1
using uniform distribution.
a. Randomization
For one-group scheme, a disguised data set G is
created. For each record in the training data set D, A
random number r from 0 to 1 is generated using
uniform distribution.
IJRET: International Journal of Research in Engineering and Technology eISSN: 2319-1163 | pISSN: 2321-7308
__________________________________________________________________________________________
Volume: 02 Issue: 09 | Sep-2013, Available @ http://www.ijret.org 109
b. Build the decision tree
For building the decision tree CART and ID3
algorithms are used with disguised data set G.
c. Testing
Training dataset is used for testing. Test with the
original data set.
d. Repeat the steps a to c for 50 times. Then compute
the mean and variance.
On after the calculation of the gain, it is checked against the
original data to find out how the accuracy and privacy is
affected
4. RESULT ANALYSIS
For each group and each data set mean and variance are
calculated. Mean value of dataset are calculated for finding
the accuracy for different groups of different dataset.
4.1 Accuracy between Three Groups and Four
Groups
The figure 1 shows the accuracy using three group scheme and
figure 2 shows the accuracy using four group schemes. The
result shows that by using three groups scheme the accuracy is
25.2% but the accuracy will decrease up to 19.2% while using
the four group scheme. The accuracy will decrease 6% when
the data is divided from three groups to four groups. When
this work is compared with previous work then the results
shows that the accuracy is negligible decreased but the privacy
of datasets are increased which is described in next section
.
Figure 1: Accuracy in percentage using three groups scheme
Figure 2: Accuracy in percentage using four groups scheme
4.2 Privacy between Three Groups and Four Groups
The figure 3 shows the privacy using three group scheme and
figure 4 shows the privacy using four group schemes. The
result shows that by using three groups scheme the privacy is
29% and the privacy will increase up to 94% while using the
four group scheme. The privacy will increase 65% when the
data is divided from three groups to four groups. When this
work is compared with previous work then the results shows
that the privacy is increased at very high level and the
decrease in accuracy is negligible.
Figure 3: Privacy using three group schemes
Figure 4: Privacy using four group schemes
IJRET: International Journal of Research in Engineering and Technology eISSN: 2319-1163 | pISSN: 2321-7308
__________________________________________________________________________________________
Volume: 02 Issue: 09 | Sep-2013, Available @ http://www.ijret.org 110
CONCLUSIONS
This thesis gives a different approach for enhancing the
privacy of the sensitive data in datamining. In this thesis, the
four group randomized response technique is used in privacy
preserving algorithm. To support the work CART and ID3
algorithm are used. In this experiment first applied
randomized response techniques on one, two, three and four
groups. The ID3 and CART algorithm are applied on the
randomized data. This work shows that as compared with
three group scheme with four groups scheme the accuracy
decreases 6% but the privacy increases 65%.
By the experiment it is concluded that with increasing the
level of grouping the privacy of the dataset will be enhanced
and the accuracy level decreases but it is negligible.
REFERENCES
[1] Carlos N. Bouza1, Carmelo Herrera, Pasha G.
Mitra,”A Review Of Randomized Responses
Procedures The Qualitative Variable Case”, Revista
Investigación Operacional VOL., 31 , No. 3, 240-247
2010
[2] Jasbir Malik, Rajkumar, “A Hybrid Approach Using
C Mean and CART for Classification in Data
Mining”, IJCSMS, Vol. 12, Issue 03, Sept 2012
[3] Zhouxuan Teng, Wenliang Du,”A Hybrid Multi-
Group Privacy-Preserving Approach for Building
Decision Trees”,
[4] Zhijun Zhan, Wenliang Du, “Privacy-Preserving Data
Mining Using Multi-Group Randomized Response
Techniques” 2010
[5] Raj Kumar, Dr. Rajesh Verma, “Classification
Algorithms for Data Mining: A Survey”, IJIET, Vol.
1 Issue 2 August 2012”
[6] Monika Soni, Vishal Srivastava, “Privacy Preserving
Data Mining: Comparion of Three Groups and Four
Groups Randomized Response Techniques”
IJRITCC, 1 Issue 7 Volume.
BIOGRAPHIES:
Mrs. Monika Soni Pursuing M. Tech. in
Computer Science She has published many
national and international research papers.
She has written 3 books for engineering and
engineering diploma.
Mr. Vishal Shrivastava working as
Assistant Professor in Arya College & IT
He has published many national and
international research papers. He has very
depth knowledge of his research areas.

Weitere ähnliche Inhalte

Was ist angesagt?

IJCSI-10-6-1-288-292
IJCSI-10-6-1-288-292IJCSI-10-6-1-288-292
IJCSI-10-6-1-288-292
HARDIK SINGH
 

Was ist angesagt? (16)

A Fuzzy Logic Intelligent Agent for Information Extraction
A Fuzzy Logic Intelligent Agent for Information ExtractionA Fuzzy Logic Intelligent Agent for Information Extraction
A Fuzzy Logic Intelligent Agent for Information Extraction
 
Parametric Sensitivity Analysis of a Mathematical Model of Two Interacting Po...
Parametric Sensitivity Analysis of a Mathematical Model of Two Interacting Po...Parametric Sensitivity Analysis of a Mathematical Model of Two Interacting Po...
Parametric Sensitivity Analysis of a Mathematical Model of Two Interacting Po...
 
IRJET- Machine Learning: Survey, Types and Challenges
IRJET- Machine Learning: Survey, Types and ChallengesIRJET- Machine Learning: Survey, Types and Challenges
IRJET- Machine Learning: Survey, Types and Challenges
 
K044065257
K044065257K044065257
K044065257
 
Repurposing Classification & Regression Trees for Causal Research with High-D...
Repurposing Classification & Regression Trees for Causal Research with High-D...Repurposing Classification & Regression Trees for Causal Research with High-D...
Repurposing Classification & Regression Trees for Causal Research with High-D...
 
Saif_CCECE2007_full_paper_submitted
Saif_CCECE2007_full_paper_submittedSaif_CCECE2007_full_paper_submitted
Saif_CCECE2007_full_paper_submitted
 
Machine Learning Unit 3 Semester 3 MSc IT Part 2 Mumbai University
Machine Learning Unit 3 Semester 3  MSc IT Part 2 Mumbai UniversityMachine Learning Unit 3 Semester 3  MSc IT Part 2 Mumbai University
Machine Learning Unit 3 Semester 3 MSc IT Part 2 Mumbai University
 
Repurposing predictive tools for causal research
Repurposing predictive tools for causal researchRepurposing predictive tools for causal research
Repurposing predictive tools for causal research
 
Avoid Overfitting with Regularization
Avoid Overfitting with RegularizationAvoid Overfitting with Regularization
Avoid Overfitting with Regularization
 
To Explain, To Predict, or To Describe?
To Explain, To Predict, or To Describe?To Explain, To Predict, or To Describe?
To Explain, To Predict, or To Describe?
 
Implementation of Improved ID3 Algorithm to Obtain more Optimal Decision Tree.
Implementation of Improved ID3 Algorithm to Obtain more Optimal Decision Tree.Implementation of Improved ID3 Algorithm to Obtain more Optimal Decision Tree.
Implementation of Improved ID3 Algorithm to Obtain more Optimal Decision Tree.
 
25 16285 a hybrid ijeecs 012 edit septian
25 16285 a hybrid ijeecs 012 edit septian25 16285 a hybrid ijeecs 012 edit septian
25 16285 a hybrid ijeecs 012 edit septian
 
E1802023741
E1802023741E1802023741
E1802023741
 
IRJET- Missing Data Imputation by Evidence Chain
IRJET- Missing Data Imputation by Evidence ChainIRJET- Missing Data Imputation by Evidence Chain
IRJET- Missing Data Imputation by Evidence Chain
 
IJCSI-10-6-1-288-292
IJCSI-10-6-1-288-292IJCSI-10-6-1-288-292
IJCSI-10-6-1-288-292
 
COMPARISION OF PERCENTAGE ERROR BY USING IMPUTATION METHOD ON MID TERM EXAMIN...
COMPARISION OF PERCENTAGE ERROR BY USING IMPUTATION METHOD ON MID TERM EXAMIN...COMPARISION OF PERCENTAGE ERROR BY USING IMPUTATION METHOD ON MID TERM EXAMIN...
COMPARISION OF PERCENTAGE ERROR BY USING IMPUTATION METHOD ON MID TERM EXAMIN...
 

Ähnlich wie Privacy preserving data mining in four group randomized response technique using id3 and cart algorithm

A Preference Model on Adaptive Affinity Propagation
A Preference Model on Adaptive Affinity PropagationA Preference Model on Adaptive Affinity Propagation
A Preference Model on Adaptive Affinity Propagation
IJECEIAES
 

Ähnlich wie Privacy preserving data mining in four group randomized response technique using id3 and cart algorithm (20)

Perfomance Comparison of Decsion Tree Algorithms to Findout the Reason for St...
Perfomance Comparison of Decsion Tree Algorithms to Findout the Reason for St...Perfomance Comparison of Decsion Tree Algorithms to Findout the Reason for St...
Perfomance Comparison of Decsion Tree Algorithms to Findout the Reason for St...
 
Enhanced ID3 algorithm based on the weightage of the Attribute
Enhanced ID3 algorithm based on the weightage of the AttributeEnhanced ID3 algorithm based on the weightage of the Attribute
Enhanced ID3 algorithm based on the weightage of the Attribute
 
In data streams using classification and clustering different techniques to f...
In data streams using classification and clustering different techniques to f...In data streams using classification and clustering different techniques to f...
In data streams using classification and clustering different techniques to f...
 
In data streams using classification and clustering
In data streams using classification and clusteringIn data streams using classification and clustering
In data streams using classification and clustering
 
Classification of Machine Learning Algorithms
Classification of Machine Learning AlgorithmsClassification of Machine Learning Algorithms
Classification of Machine Learning Algorithms
 
Data mining techniques
Data mining techniquesData mining techniques
Data mining techniques
 
Ijcatr04041015
Ijcatr04041015Ijcatr04041015
Ijcatr04041015
 
CCC-Bicluster Analysis for Time Series Gene Expression Data
CCC-Bicluster Analysis for Time Series Gene Expression DataCCC-Bicluster Analysis for Time Series Gene Expression Data
CCC-Bicluster Analysis for Time Series Gene Expression Data
 
A Preference Model on Adaptive Affinity Propagation
A Preference Model on Adaptive Affinity PropagationA Preference Model on Adaptive Affinity Propagation
A Preference Model on Adaptive Affinity Propagation
 
Data mining techniques a survey paper
Data mining techniques a survey paperData mining techniques a survey paper
Data mining techniques a survey paper
 
Machine learning and decision trees
Machine learning and decision treesMachine learning and decision trees
Machine learning and decision trees
 
An Heterogeneous Population-Based Genetic Algorithm for Data Clustering
An Heterogeneous Population-Based Genetic Algorithm for Data ClusteringAn Heterogeneous Population-Based Genetic Algorithm for Data Clustering
An Heterogeneous Population-Based Genetic Algorithm for Data Clustering
 
Efficient classification of big data using vfdt (very fast decision tree)
Efficient classification of big data using vfdt (very fast decision tree)Efficient classification of big data using vfdt (very fast decision tree)
Efficient classification of big data using vfdt (very fast decision tree)
 
CLUSTERING DICHOTOMOUS DATA FOR HEALTH CARE
CLUSTERING DICHOTOMOUS DATA FOR HEALTH CARECLUSTERING DICHOTOMOUS DATA FOR HEALTH CARE
CLUSTERING DICHOTOMOUS DATA FOR HEALTH CARE
 
CLUSTERING DICHOTOMOUS DATA FOR HEALTH CARE
CLUSTERING DICHOTOMOUS DATA FOR HEALTH CARECLUSTERING DICHOTOMOUS DATA FOR HEALTH CARE
CLUSTERING DICHOTOMOUS DATA FOR HEALTH CARE
 
Sampling
 Sampling Sampling
Sampling
 
Comparative study of various supervisedclassification methodsforanalysing def...
Comparative study of various supervisedclassification methodsforanalysing def...Comparative study of various supervisedclassification methodsforanalysing def...
Comparative study of various supervisedclassification methodsforanalysing def...
 
130509
130509130509
130509
 
130509
130509130509
130509
 
Depression Detection Using Various Machine Learning Classifiers
Depression Detection Using Various Machine Learning ClassifiersDepression Detection Using Various Machine Learning Classifiers
Depression Detection Using Various Machine Learning Classifiers
 

Mehr von eSAT Journals

Material management in construction – a case study
Material management in construction – a case studyMaterial management in construction – a case study
Material management in construction – a case study
eSAT Journals
 
Laboratory studies of dense bituminous mixes ii with reclaimed asphalt materials
Laboratory studies of dense bituminous mixes ii with reclaimed asphalt materialsLaboratory studies of dense bituminous mixes ii with reclaimed asphalt materials
Laboratory studies of dense bituminous mixes ii with reclaimed asphalt materials
eSAT Journals
 
Geographical information system (gis) for water resources management
Geographical information system (gis) for water resources managementGeographical information system (gis) for water resources management
Geographical information system (gis) for water resources management
eSAT Journals
 
Estimation of surface runoff in nallur amanikere watershed using scs cn method
Estimation of surface runoff in nallur amanikere watershed using scs cn methodEstimation of surface runoff in nallur amanikere watershed using scs cn method
Estimation of surface runoff in nallur amanikere watershed using scs cn method
eSAT Journals
 
Effect of variation of plastic hinge length on the results of non linear anal...
Effect of variation of plastic hinge length on the results of non linear anal...Effect of variation of plastic hinge length on the results of non linear anal...
Effect of variation of plastic hinge length on the results of non linear anal...
eSAT Journals
 

Mehr von eSAT Journals (20)

Mechanical properties of hybrid fiber reinforced concrete for pavements
Mechanical properties of hybrid fiber reinforced concrete for pavementsMechanical properties of hybrid fiber reinforced concrete for pavements
Mechanical properties of hybrid fiber reinforced concrete for pavements
 
Material management in construction – a case study
Material management in construction – a case studyMaterial management in construction – a case study
Material management in construction – a case study
 
Managing drought short term strategies in semi arid regions a case study
Managing drought    short term strategies in semi arid regions  a case studyManaging drought    short term strategies in semi arid regions  a case study
Managing drought short term strategies in semi arid regions a case study
 
Life cycle cost analysis of overlay for an urban road in bangalore
Life cycle cost analysis of overlay for an urban road in bangaloreLife cycle cost analysis of overlay for an urban road in bangalore
Life cycle cost analysis of overlay for an urban road in bangalore
 
Laboratory studies of dense bituminous mixes ii with reclaimed asphalt materials
Laboratory studies of dense bituminous mixes ii with reclaimed asphalt materialsLaboratory studies of dense bituminous mixes ii with reclaimed asphalt materials
Laboratory studies of dense bituminous mixes ii with reclaimed asphalt materials
 
Laboratory investigation of expansive soil stabilized with natural inorganic ...
Laboratory investigation of expansive soil stabilized with natural inorganic ...Laboratory investigation of expansive soil stabilized with natural inorganic ...
Laboratory investigation of expansive soil stabilized with natural inorganic ...
 
Influence of reinforcement on the behavior of hollow concrete block masonry p...
Influence of reinforcement on the behavior of hollow concrete block masonry p...Influence of reinforcement on the behavior of hollow concrete block masonry p...
Influence of reinforcement on the behavior of hollow concrete block masonry p...
 
Influence of compaction energy on soil stabilized with chemical stabilizer
Influence of compaction energy on soil stabilized with chemical stabilizerInfluence of compaction energy on soil stabilized with chemical stabilizer
Influence of compaction energy on soil stabilized with chemical stabilizer
 
Geographical information system (gis) for water resources management
Geographical information system (gis) for water resources managementGeographical information system (gis) for water resources management
Geographical information system (gis) for water resources management
 
Forest type mapping of bidar forest division, karnataka using geoinformatics ...
Forest type mapping of bidar forest division, karnataka using geoinformatics ...Forest type mapping of bidar forest division, karnataka using geoinformatics ...
Forest type mapping of bidar forest division, karnataka using geoinformatics ...
 
Factors influencing compressive strength of geopolymer concrete
Factors influencing compressive strength of geopolymer concreteFactors influencing compressive strength of geopolymer concrete
Factors influencing compressive strength of geopolymer concrete
 
Experimental investigation on circular hollow steel columns in filled with li...
Experimental investigation on circular hollow steel columns in filled with li...Experimental investigation on circular hollow steel columns in filled with li...
Experimental investigation on circular hollow steel columns in filled with li...
 
Experimental behavior of circular hsscfrc filled steel tubular columns under ...
Experimental behavior of circular hsscfrc filled steel tubular columns under ...Experimental behavior of circular hsscfrc filled steel tubular columns under ...
Experimental behavior of circular hsscfrc filled steel tubular columns under ...
 
Evaluation of punching shear in flat slabs
Evaluation of punching shear in flat slabsEvaluation of punching shear in flat slabs
Evaluation of punching shear in flat slabs
 
Evaluation of performance of intake tower dam for recent earthquake in india
Evaluation of performance of intake tower dam for recent earthquake in indiaEvaluation of performance of intake tower dam for recent earthquake in india
Evaluation of performance of intake tower dam for recent earthquake in india
 
Evaluation of operational efficiency of urban road network using travel time ...
Evaluation of operational efficiency of urban road network using travel time ...Evaluation of operational efficiency of urban road network using travel time ...
Evaluation of operational efficiency of urban road network using travel time ...
 
Estimation of surface runoff in nallur amanikere watershed using scs cn method
Estimation of surface runoff in nallur amanikere watershed using scs cn methodEstimation of surface runoff in nallur amanikere watershed using scs cn method
Estimation of surface runoff in nallur amanikere watershed using scs cn method
 
Estimation of morphometric parameters and runoff using rs & gis techniques
Estimation of morphometric parameters and runoff using rs & gis techniquesEstimation of morphometric parameters and runoff using rs & gis techniques
Estimation of morphometric parameters and runoff using rs & gis techniques
 
Effect of variation of plastic hinge length on the results of non linear anal...
Effect of variation of plastic hinge length on the results of non linear anal...Effect of variation of plastic hinge length on the results of non linear anal...
Effect of variation of plastic hinge length on the results of non linear anal...
 
Effect of use of recycled materials on indirect tensile strength of asphalt c...
Effect of use of recycled materials on indirect tensile strength of asphalt c...Effect of use of recycled materials on indirect tensile strength of asphalt c...
Effect of use of recycled materials on indirect tensile strength of asphalt c...
 

Kürzlich hochgeladen

DeepFakes presentation : brief idea of DeepFakes
DeepFakes presentation : brief idea of DeepFakesDeepFakes presentation : brief idea of DeepFakes
DeepFakes presentation : brief idea of DeepFakes
MayuraD1
 
+97470301568>> buy weed in qatar,buy thc oil qatar,buy weed and vape oil in d...
+97470301568>> buy weed in qatar,buy thc oil qatar,buy weed and vape oil in d...+97470301568>> buy weed in qatar,buy thc oil qatar,buy weed and vape oil in d...
+97470301568>> buy weed in qatar,buy thc oil qatar,buy weed and vape oil in d...
Health
 
Integrated Test Rig For HTFE-25 - Neometrix
Integrated Test Rig For HTFE-25 - NeometrixIntegrated Test Rig For HTFE-25 - Neometrix
Integrated Test Rig For HTFE-25 - Neometrix
Neometrix_Engineering_Pvt_Ltd
 

Kürzlich hochgeladen (20)

Bhubaneswar🌹Call Girls Bhubaneswar ❤Komal 9777949614 💟 Full Trusted CALL GIRL...
Bhubaneswar🌹Call Girls Bhubaneswar ❤Komal 9777949614 💟 Full Trusted CALL GIRL...Bhubaneswar🌹Call Girls Bhubaneswar ❤Komal 9777949614 💟 Full Trusted CALL GIRL...
Bhubaneswar🌹Call Girls Bhubaneswar ❤Komal 9777949614 💟 Full Trusted CALL GIRL...
 
Bridge Jacking Design Sample Calculation.pptx
Bridge Jacking Design Sample Calculation.pptxBridge Jacking Design Sample Calculation.pptx
Bridge Jacking Design Sample Calculation.pptx
 
Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...
Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...
Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...
 
Block diagram reduction techniques in control systems.ppt
Block diagram reduction techniques in control systems.pptBlock diagram reduction techniques in control systems.ppt
Block diagram reduction techniques in control systems.ppt
 
Double Revolving field theory-how the rotor develops torque
Double Revolving field theory-how the rotor develops torqueDouble Revolving field theory-how the rotor develops torque
Double Revolving field theory-how the rotor develops torque
 
Computer Networks Basics of Network Devices
Computer Networks  Basics of Network DevicesComputer Networks  Basics of Network Devices
Computer Networks Basics of Network Devices
 
DeepFakes presentation : brief idea of DeepFakes
DeepFakes presentation : brief idea of DeepFakesDeepFakes presentation : brief idea of DeepFakes
DeepFakes presentation : brief idea of DeepFakes
 
Employee leave management system project.
Employee leave management system project.Employee leave management system project.
Employee leave management system project.
 
Work-Permit-Receiver-in-Saudi-Aramco.pptx
Work-Permit-Receiver-in-Saudi-Aramco.pptxWork-Permit-Receiver-in-Saudi-Aramco.pptx
Work-Permit-Receiver-in-Saudi-Aramco.pptx
 
kiln thermal load.pptx kiln tgermal load
kiln thermal load.pptx kiln tgermal loadkiln thermal load.pptx kiln tgermal load
kiln thermal load.pptx kiln tgermal load
 
HOA1&2 - Module 3 - PREHISTORCI ARCHITECTURE OF KERALA.pptx
HOA1&2 - Module 3 - PREHISTORCI ARCHITECTURE OF KERALA.pptxHOA1&2 - Module 3 - PREHISTORCI ARCHITECTURE OF KERALA.pptx
HOA1&2 - Module 3 - PREHISTORCI ARCHITECTURE OF KERALA.pptx
 
Online electricity billing project report..pdf
Online electricity billing project report..pdfOnline electricity billing project report..pdf
Online electricity billing project report..pdf
 
2016EF22_0 solar project report rooftop projects
2016EF22_0 solar project report rooftop projects2016EF22_0 solar project report rooftop projects
2016EF22_0 solar project report rooftop projects
 
+97470301568>> buy weed in qatar,buy thc oil qatar,buy weed and vape oil in d...
+97470301568>> buy weed in qatar,buy thc oil qatar,buy weed and vape oil in d...+97470301568>> buy weed in qatar,buy thc oil qatar,buy weed and vape oil in d...
+97470301568>> buy weed in qatar,buy thc oil qatar,buy weed and vape oil in d...
 
Thermal Engineering-R & A / C - unit - V
Thermal Engineering-R & A / C - unit - VThermal Engineering-R & A / C - unit - V
Thermal Engineering-R & A / C - unit - V
 
Thermal Engineering -unit - III & IV.ppt
Thermal Engineering -unit - III & IV.pptThermal Engineering -unit - III & IV.ppt
Thermal Engineering -unit - III & IV.ppt
 
FEA Based Level 3 Assessment of Deformed Tanks with Fluid Induced Loads
FEA Based Level 3 Assessment of Deformed Tanks with Fluid Induced LoadsFEA Based Level 3 Assessment of Deformed Tanks with Fluid Induced Loads
FEA Based Level 3 Assessment of Deformed Tanks with Fluid Induced Loads
 
Integrated Test Rig For HTFE-25 - Neometrix
Integrated Test Rig For HTFE-25 - NeometrixIntegrated Test Rig For HTFE-25 - Neometrix
Integrated Test Rig For HTFE-25 - Neometrix
 
Hazard Identification (HAZID) vs. Hazard and Operability (HAZOP): A Comparati...
Hazard Identification (HAZID) vs. Hazard and Operability (HAZOP): A Comparati...Hazard Identification (HAZID) vs. Hazard and Operability (HAZOP): A Comparati...
Hazard Identification (HAZID) vs. Hazard and Operability (HAZOP): A Comparati...
 
Thermal Engineering Unit - I & II . ppt
Thermal Engineering  Unit - I & II . pptThermal Engineering  Unit - I & II . ppt
Thermal Engineering Unit - I & II . ppt
 

Privacy preserving data mining in four group randomized response technique using id3 and cart algorithm

  • 1. IJRET: International Journal of Research in Engineering and Technology eISSN: 2319-1163 | pISSN: 2321-7308 __________________________________________________________________________________________ Volume: 02 Issue: 09 | Sep-2013, Available @ http://www.ijret.org 106 PRIVACY PRESERVING DATA MINING IN FOUR GROUP RANDOMIZED RESPONSE TECHNIQUE USING ID3 AND CART ALGORITHM Monika Soni1 , Vishal Shrivastava2 1 M. Tech. Scholar, 2 Associate Professor, Arya College of Engineering and IT, Rajasthan, India, 12.monika@gmail.com, vishal500371@yahoo.co.in Abstract Data mining is a process in which data collected from different sources is analyzed for useful information. Data mining is also known as knowledge discovery in database (KDD). Privacy and accuracy are the important issues in data mining when data is shared. Most of the methods use random permutation techniques to mask the data, for preserving the privacy of sensitive data. Randomize response techniques were developed for the purpose of protecting surveys privacy and avoiding biased answers. The proposed work thesis is to enhance the privacy level in RR technique using four group schemes. First according to the algorithm random attributes a, b, c, d were considered, Then the randomization have been performed on every dataset according to the values of theta. Then ID3 and CART algorithm are applied on the randomized data. The result shows that by increasing the group, the privacy level will increase. This work shows that as compared with three group scheme with four groups scheme the accuracy decreases 6% but the privacy increases 65%. -----------------------------------------------------------------------***----------------------------------------------------------------------- 1. INTRODUCTION OF PROPOSED APPROACH This work uses ID3 and CART algorithm to enhance the privacy of the secret data. The problem with the previous work for three groups of data sets using ID3 algorithm was that it was not checking the group performance at every step and the privacy level was not very high. The proposed work increases the level of privacy by using ID3 and CART algorithms. Previous work was giving an overall result whereas this work is implementing it in step by step manner. 1.1 The Basic idea of ID3 Algorithm ID3 algorithm uses information entropy theory to select attribute values with maximum information gain in the current sample sets as the test attribute. The division of the sample sets is based on the value of the test properties, the numbers of test attributes decide the number of subsample sets; at the same time, new leaf nodes grow out of corresponding nodes of the sample set on the decision tree. ID3 algorithm is given below: ID3(S, AL) Step 1. Create a node V. Step 2. If S consists of samples with all the same class C then return V as a leaf node labeled with class C. Step 3. If AL is empty, then return V as a leaf-node with the majority class in y. Step 4. Select test attribute (TA) among the AL with the highest information gain. Step 5. Label node V with TA. Step 6. For each known value ai of TA a) Grow a branch from node V for the condition TA=ai b) Let si be the set of samples in S for which TA=ai. c) If si empty then attach a leaf labeled with the majority class in S. d) Else attach the node returned by ID3(si,AL-TA). 1.2 CART Algorithm Classification and Regression Trees is a classification method which uses historical data to construct so-called decision trees. Decision trees are then used to classify new data. In order to use CART the number of classes should be known a priori. Decision trees are represented by a set of questions which splits the learning sample into smaller and smaller parts. CART asks only yes/no questions. A possible question could be: ”Is age greater than 50?” or ”Is sex male?”. CART algorithm will search for all possible variables and all possible values in order to find the best split – the question that splits the data into two parts with maximum homogeneity. The process is then repeated for each of the resulting data fragments. Step 1: Find each predictor’s best split. Step 2: Find the node’s best split. Step 3: Split the node using its best split found in step 2 if the stopping rules are not satisfied.
  • 2. IJRET: International Journal of Research in Engineering and Technology eISSN: 2319-1163 | pISSN: 2321-7308 __________________________________________________________________________________________ Volume: 02 Issue: 09 | Sep-2013, Available @ http://www.ijret.org 107 2. RANDOMIZED RESPONSE TECHNIQUES Randomized Response (RR) techniques were developed for the purpose of protecting survey’s privacy and avoiding answer bias mainly. The RR technique was designed to reduce both response bias and non-response bias, in surveys which ask sensitive questions. It uses probability theory to protect the privacy of an individual’s response, and has been used successfully in several sensitive research areas, such as abortion, drugs and assault. The basic idea of RR is to scramble the data in such a way that the real status of the respondent cannot be identified. In Related-Question Model, instead of asking each respondent whether he/she has attribute 0, the interviewer asks each respondent two related questions, the answers to which are opposite to each other. For example, the questions could be like the following. If the statement is correct, the respondent answers “yes”; otherwise he/she answers “no”. Similar as described in.  I have the sensitive attribute A.  I do not have the sensitive attribute A. Respondents use a randomizing device to decide which question to answer, without letting the interviewer know which question is answered. The probability of choosing the first question is Ɵ, and the probability of choosing the second question is 1-Ɵ.Although the interviewer learns the responses (e.g., “yes” or “no”), he/she does not know which question was answered by the respondents. In this way the respondents’ privacy is preserved. Since the interviewer’s wants to know the answer to the first question, and the answer to the second question is exactly the opposite to the answer for the first one, if the respondent chooses to answer the first question, then it say that he/she is telling the truth; if the respondent chooses to answer the second question, then it say that he/she is telling a lie. To estimate the percentage of people who has the attribute A. The following equations can be used: 𝑃∗ 𝐴 = 𝑦𝑒𝑠 = 𝑃 𝐴 = 𝑦𝑒𝑠 . Ɵ + P A = no . (1 − 𝜃Ɵ) 𝑃∗ 𝐴 = 𝑛𝑜 = 𝑃 𝐴 = 𝑛𝑜 . Ɵ + P A = yes . (1 − Ɵ𝜃) (4) Where P*(A=yes) (resp. p*(A=no)) is the proportion of the “yes” (resp. “no”) responses obtained from the survey data, and P (A=yes) (resp. P (A=no)) is the estimated proportion of the “yes” (resp. “no”) responses to the sensitive questions. 2.1 One-Group Scheme In the one-group scheme, all the attributes are put in the same group, and all the attributes are either reversed together or keeping the same values. In other words, when sending the private data to the central database, users either tell the truth about all their answers to the sensitive questions or tell the lie about all their answers. The probability for the first event is θ and the probability for the second event is (1 − θ). The general model for the one-group scheme is described in the following: 𝑃∗ 𝐸 = 𝑃 𝐸 . 𝜃 + 𝑃(𝐸). (1 − 𝜃) P∗ E = P E . θ + P(E). (1 − θ) Using the matrix form, let M1 denote the co efficiency matrix of the above equations, the Matrix p∗ (E) P∗ (E) =M1 P(E) P(E) , where M1 = θ (1 − θ) 1 − θ θ 2.2 Two-Group Scheme In the one-group scheme, if the interviewer somehow knows whether the respondents tell a truth or a lie for one attribute, he/she can immediately obtain all the true values of a respondent’s response for all other attributes. To improve data’s privacy level, data providers divide all the attributes into two groups. Then it has the following equation: P∗ E1 E2 = P E1 E2 . θ2 + P E1E2 . θ 1 − θ +P E1E2 . θ 1 − θ + P E1E2 . (1 − θ)2 There are four unknown variables in the above equation: (P E1 E2 , P E1E2 , P E1E2 , P E1E2 ) (9) To solve the above equation, three more equations are needed , derive them using the similar method. P∗ E1 E2 P∗ (E1 E2 P∗ E1E2 P∗ (E1E2) = M2 . P E1 E2 P E1E2 P E1E2 P E1E2 The final equations are described in the following Where M2 = θ2 θ 1 − θ θ 1 − θ θ2 θ 1 − θ 1 − θ 2 1 − θ 2 θ 1 − θ θ 1 − θ 1 − θ 2 1 − θ 2 θ 1 − θ θ2 θ 1 − θ θ 1 − θ θ2 2.3 Three-Group Scheme To further preserve the data’s privacy, the attributes are partitioned into three groups, and disguised each group independently. The model can be derived using the similar way as it was done for the two-group model. The model for the three-group scheme is as follows:
  • 3. IJRET: International Journal of Research in Engineering and Technology eISSN: 2319-1163 | pISSN: 2321-7308 __________________________________________________________________________________________ Volume: 02 Issue: 09 | Sep-2013, Available @ http://www.ijret.org 108 P∗ (E1 E2 E3 ) P∗ (E1E2E3 ) P∗ E1E2E3 P∗ (E1E2E3) P∗ (E1E2E3) P∗ (E1E2E3) P∗ (E1E2E3) P∗ (E1E2E3) = M3 = P(E1 E2 E3 ) P(E1E2E3 ) P E1E2E3 P(E1E2E3) P(E1E2E3) P(E1E2E3) P(E1E2E3) P(E1E2E3) M3 = θ3 θ2 (1 − θ) θ2 (1 − θ) θ(1 − θ)2 θ2 (1 − θ) θ(1 − θ)2 θ(1 − θ)2 (1 − θ)3 θ2 (1 − θ) θ3 θ(1 − θ)2 θ2 (1 − θ) θ(1 − θ)2 θ(1 − θ)2 (1 − θ)3 θ(1 − θ)2 θ2 (1 − θ) θ(1 − θ)2 θ3 θ2 (1 − θ) θ(1 − θ)2 (1 − θ)3 θ2 (1 − θ) θ(1 − θ)2 θ(1 − θ)2 θ2 (1 − θ) θ2 (1 − θ) θ3 (1 − θ)3 θ(1 − θ)2 θ(1 − θ)2 θ2 (1 − θ) θ2 (1 − θ) θ(1 − θ)2 θ(1 − θ)2 (1 − θ)3 θ3 θ2 (1 − θ) θ2 (1 − θ) θ(1 − θ)2 θ(1 − θ)2 θ2 (1 − θ) (1 − θ)3 θ(1 − θ)2 θ2 (1 − θ) θ3 θ(1 − θ)2 θ2 (1 − θ) θ(1 − θ)2 (1 − θ)3 θ2 (1 − θ) θ(1 − θ)2 θ2 (1 − θ) θ(1 − θ)2 θ3 θ2 (1 − θ) (1 − θ)3 θ(1 − θ)2 θ(1 − θ)2 θ2 (1 − θ) θ(1 − θ)2 θ2 (1 − θ) θ2 (1 − θ) θ3 (13) Where M3the coefficiency matrix and similar techniques can be employed to extend the above schemes to four-group scheme 2.4 Four group scheme To further preserve privacy the attributes can be partition into four groups, and disguised each group independently. The model for the four group scheme is as follows: P∗ (E1E2E3E4) P∗ (E1E2E3E4) P∗ (E1E2E3E4) P∗ (E1E2E3E4) P∗ (E1E2E3E4) P∗ (E1E2E3E4) P∗ (E1E2E3E4) P∗ (E1E2E3E4) P∗ (E1E2E3E4) P∗ (E1E2E3E4) P∗ (E1E2E3E4) P∗ (E1E2E3E4) P∗ (E1E2E3E4) P∗ (E1E2E3E4) P∗ (E1E2E3E4) P∗ (E1E2E3E4) P∗ (E1E2E3E4) =M4 P(E1E2E3E4) P(E1E2E3E4) P(E1E2E3E4) P(E1E2E3E4) P(E1E2E3E4) P(E1E2E3E4) P(E1E2E3E4) P(E1E2E3E4) P(E1E2E3E4) P(E1E2E3E4) P(E1E2E3E4) P(E1E2E3E4) P(E1E2E3E4) P(E1E2E3E4) P(E1E2E3E4) P(E1E2E3E4) P(E1E2E3E4) Where M4 is the co efficiency matrix. 3. PROCESSING THE STEPS ARE GIVEN BELOW FOR IMPLEMENTING THE WORK: 1. Since it is considered that the dataset contains only binary data. First transformed the non binary data into binary data. For this there are many methods available in MATLAB. 2. Divided the data randomly in groups as follows:  Four random attributes are taken namely a, b, c and d.  The dataset are grouped in to 1, 2, 3 and 4 level according to the following manner group1=all data attributes are considered as single group and performed the above operations group2=data set divided into ab and cd and performed the randomization and id3 algorithm group3=data set divided into ab, c, d group4=data set divided into a, b, c, d Following steps are used on each dataset and on each group for different values of Ө. Here related question model of randomized response technique is used. Ө = [0.0, 0.1, 0.2, 0.3, 0.45, 0.51, 0.55, 0.6, 0.7, 0.8, 0.9 1]; For randomization it generates a random number r from 0 to 1 using uniform distribution. a. Randomization For one-group scheme, a disguised data set G is created. For each record in the training data set D, A random number r from 0 to 1 is generated using uniform distribution.
  • 4. IJRET: International Journal of Research in Engineering and Technology eISSN: 2319-1163 | pISSN: 2321-7308 __________________________________________________________________________________________ Volume: 02 Issue: 09 | Sep-2013, Available @ http://www.ijret.org 109 b. Build the decision tree For building the decision tree CART and ID3 algorithms are used with disguised data set G. c. Testing Training dataset is used for testing. Test with the original data set. d. Repeat the steps a to c for 50 times. Then compute the mean and variance. On after the calculation of the gain, it is checked against the original data to find out how the accuracy and privacy is affected 4. RESULT ANALYSIS For each group and each data set mean and variance are calculated. Mean value of dataset are calculated for finding the accuracy for different groups of different dataset. 4.1 Accuracy between Three Groups and Four Groups The figure 1 shows the accuracy using three group scheme and figure 2 shows the accuracy using four group schemes. The result shows that by using three groups scheme the accuracy is 25.2% but the accuracy will decrease up to 19.2% while using the four group scheme. The accuracy will decrease 6% when the data is divided from three groups to four groups. When this work is compared with previous work then the results shows that the accuracy is negligible decreased but the privacy of datasets are increased which is described in next section . Figure 1: Accuracy in percentage using three groups scheme Figure 2: Accuracy in percentage using four groups scheme 4.2 Privacy between Three Groups and Four Groups The figure 3 shows the privacy using three group scheme and figure 4 shows the privacy using four group schemes. The result shows that by using three groups scheme the privacy is 29% and the privacy will increase up to 94% while using the four group scheme. The privacy will increase 65% when the data is divided from three groups to four groups. When this work is compared with previous work then the results shows that the privacy is increased at very high level and the decrease in accuracy is negligible. Figure 3: Privacy using three group schemes Figure 4: Privacy using four group schemes
  • 5. IJRET: International Journal of Research in Engineering and Technology eISSN: 2319-1163 | pISSN: 2321-7308 __________________________________________________________________________________________ Volume: 02 Issue: 09 | Sep-2013, Available @ http://www.ijret.org 110 CONCLUSIONS This thesis gives a different approach for enhancing the privacy of the sensitive data in datamining. In this thesis, the four group randomized response technique is used in privacy preserving algorithm. To support the work CART and ID3 algorithm are used. In this experiment first applied randomized response techniques on one, two, three and four groups. The ID3 and CART algorithm are applied on the randomized data. This work shows that as compared with three group scheme with four groups scheme the accuracy decreases 6% but the privacy increases 65%. By the experiment it is concluded that with increasing the level of grouping the privacy of the dataset will be enhanced and the accuracy level decreases but it is negligible. REFERENCES [1] Carlos N. Bouza1, Carmelo Herrera, Pasha G. Mitra,”A Review Of Randomized Responses Procedures The Qualitative Variable Case”, Revista Investigación Operacional VOL., 31 , No. 3, 240-247 2010 [2] Jasbir Malik, Rajkumar, “A Hybrid Approach Using C Mean and CART for Classification in Data Mining”, IJCSMS, Vol. 12, Issue 03, Sept 2012 [3] Zhouxuan Teng, Wenliang Du,”A Hybrid Multi- Group Privacy-Preserving Approach for Building Decision Trees”, [4] Zhijun Zhan, Wenliang Du, “Privacy-Preserving Data Mining Using Multi-Group Randomized Response Techniques” 2010 [5] Raj Kumar, Dr. Rajesh Verma, “Classification Algorithms for Data Mining: A Survey”, IJIET, Vol. 1 Issue 2 August 2012” [6] Monika Soni, Vishal Srivastava, “Privacy Preserving Data Mining: Comparion of Three Groups and Four Groups Randomized Response Techniques” IJRITCC, 1 Issue 7 Volume. BIOGRAPHIES: Mrs. Monika Soni Pursuing M. Tech. in Computer Science She has published many national and international research papers. She has written 3 books for engineering and engineering diploma. Mr. Vishal Shrivastava working as Assistant Professor in Arya College & IT He has published many national and international research papers. He has very depth knowledge of his research areas.