Towards A Differential Privacy Preserving Utility Machine Learning Classifier

Complex Adaptive Systems 2012 – Washington DC USA,
November 14-16

Towards A Differential Privacy and Utility Preserving
Machine Learning Classifier

Kato Mivule, Claude Turner, and Soo-Yeon Ji

Computer Science Department
Bowie State University

November 14-16

1

Outline November 14-16

 Introduction
 Related work
 Essential Terms
 Methodology
 Results
 Conclusion

2

Introduction

 Entities transact in ‘big data’ containing personal identifiable
information (PII).

 Organizations are bound by federal and state law to ensure data privacy.

 In the process to achieve privacy, the utility of privatized datasets
diminishes.

 Achieving balance between privacy and utility is an ongoing problem.

 Therefore, we investigate a differential privacy preserving machine
learning classification approach that seeks an acceptable level of
utility.

Complex Adaptive Systems 2012 – Washington DC USA, November 14-16 3

Related Work
There is a growing interest in investigating privacy preserving data mining
solutions that provide a balance between data privacy and utility.

 Kifer and Gehrke (2006) did a broad study of enhanced data utility in
privacy preserving data publishing by using statistical approaches.

 Wong (2007) described how achieving global optimal privacy while
maintaining utility is an NP-hard problem.

 Krause and Horvitz (2010) noted that endeavours of finding trade-offs
between privacy and utility is still an NP-hard problem.

 Muralidhar and Sarathy (2011) showed that differential privacy provides
strong privacy guarantees but utility is still a problem due to noise levels.

 Finding the optimal balance between privacy and utility remains a
challenge—even with differential privacy. 4
Complex Adaptive Systems 2012 – Washington DC USA, November 14-16

Data Utility verses Privacy

 Data utility is the extent of how useful a published dataset is to the
consumer of that publicized dataset.

 In the course of a data privacy process, original data will lose statistical
value despite privacy guarantees.

Image Source: Kenneth Corbin/Internet News.


Objective

 Achieving an optimal balance between data privacy and utility
remains an ongoing challenge.

 Such optimality is highly desired and remains our investigation goal.

Image Source: Wikipedia, on Confidentiality.


Ensemble classification
 Is a machine learning process, in which a collection of several
independently trained classifiers are merged to achieve better prediction.

 Examples include single trained decision trees joined to make accurate
predictions.

AdaBoost Ensemble – Adaptive Boosting
 Proposed by Freund and Schapire (1995), uses several iterations by adding weak
learners to create a powerful learner, adjusting weights to center on misclassified
data in earlier iterations.

 Classification Error in AdaBoost Ensemble is computed as below:


AdaBoost Ensemble (Cont’d )
 AdaBoost Ensemble computes as follows:


Differential Privacy


Differential Privacy (Cont’d)


Methodology (Cont’d)

 We utilized a public available Barack Obama 2008 campaign donations dataset.

 The data set contained 17,695 records of original unperturbed data.

 Two attributes, the donation amount and income status, are utilized to classify data
into three classes.

 The three classes are low income, middle income, and high income, for donations
$1 to $49, $50 to $80, $81 and above respectively.

 Validating our approach, the dataset comprised 50 percent on training and the
remainder on testing, on both Original and Privatized datasets.

 Oracle database is queried via MATLAB ODBC connector. MATLAB is used for
differential privacy and machine learning classification.


Results

 Essential statistical traits of the original and differential privacy datasets,
a necessary requirement to publish privatized datasets, are kept.

 As depicted, the mean, standard deviation, and variance of the original
and differential privacy datasets remained the same.


Results (Cont’d)
 There is a strong positive covariance of 1060.8 between the two datasets, which
means that they grow simultaneously, as illustrated below:


Results (Cont’d)
 There is almost no correlation (the correlation was 0.0054) between the
original and differentially privatized datasets.

 Indicates some privacy assurances, and difficulty for an attacker, dealing
only with the privatized dataset, to correctly infer any alterations.


Results (Cont’d)
 After applying differential privacy, AdaBoost ensemble classifier is
performed.

 The outcome of the donors’ dataset was Low, Middle, and High income,
for donations 0 to 50, 51 to 80, and 81 to 100, respectively.

 This same classification outcome is used for the perturbed dataset to
investigate whether the classifier would categorize the perturbed dataset
correctly.


Results (Cont’d)

 The training dataset from the original data showed that the classification
error dropped from 0.25 to 0 with increased weak decision tree learners.

 The results changed with the training dataset on the differentially private
data when the classification error dropped from 0.588 to 0.58.


Results (Cont’d)
 When the same procedure is applied to the test dataset of the original data the
classification error dropped from 0.03 to 0.

 However, when this procedure perform on the differentially private data, the error rate
did not change even with increased number of weak decision tree.


Conclusion
 In this study, we found that while differential privacy might guarantee strong
confidentiality, providing data utility still remains a challenge.

 However, this study is instructive in a variety of ways:

 The level of Laplace noise does affect the classification error.

 Increasing the number of weak learners is not too significant.

 Adjusting the Laplace noise parameter, ε, is essential for further study.

 However, accurate classification means loss of privacy.

 Tradeoffs must be made between privacy and utility.

 We plan on investigating optimization approaches for such tradeoffs.

Questions? November 14-16

Contact:
Kato Mivule: kmivule@gmail.com

Thank You.

20

Towards A Differential Privacy Preserving Utility Machine Learning Classifier

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Andere mochten auch

Andere mochten auch (20)

Ähnlich wie Towards A Differential Privacy Preserving Utility Machine Learning Classifier

Ähnlich wie Towards A Differential Privacy Preserving Utility Machine Learning Classifier (20)

Mehr von Kato Mivule

Mehr von Kato Mivule (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Towards A Differential Privacy Preserving Utility Machine Learning Classifier