Towards A Differential Privacy Preserving Utility Machine Learning Classifier
1. Complex Adaptive Systems 2012 – Washington DC USA,
November 14-16
Towards A Differential Privacy and Utility Preserving
Machine Learning Classifier
Kato Mivule, Claude Turner, and Soo-Yeon Ji
Computer Science Department
Bowie State University
Complex Adaptive Systems 2012 – Washington DC USA,
November 14-16
1
2. Complex Adaptive Systems 2012 – Washington DC USA,
Outline November 14-16
Introduction
Related work
Essential Terms
Methodology
Results
Conclusion
2
3. Introduction
Entities transact in ‘big data’ containing personal identifiable
information (PII).
Organizations are bound by federal and state law to ensure data privacy.
In the process to achieve privacy, the utility of privatized datasets
diminishes.
Achieving balance between privacy and utility is an ongoing problem.
Therefore, we investigate a differential privacy preserving machine
learning classification approach that seeks an acceptable level of
utility.
Complex Adaptive Systems 2012 – Washington DC USA, November 14-16 3
4. Related Work
There is a growing interest in investigating privacy preserving data mining
solutions that provide a balance between data privacy and utility.
Kifer and Gehrke (2006) did a broad study of enhanced data utility in
privacy preserving data publishing by using statistical approaches.
Wong (2007) described how achieving global optimal privacy while
maintaining utility is an NP-hard problem.
Krause and Horvitz (2010) noted that endeavours of finding trade-offs
between privacy and utility is still an NP-hard problem.
Muralidhar and Sarathy (2011) showed that differential privacy provides
strong privacy guarantees but utility is still a problem due to noise levels.
Finding the optimal balance between privacy and utility remains a
challenge—even with differential privacy. 4
Complex Adaptive Systems 2012 – Washington DC USA, November 14-16
5. Data Utility verses Privacy
Data utility is the extent of how useful a published dataset is to the
consumer of that publicized dataset.
In the course of a data privacy process, original data will lose statistical
value despite privacy guarantees.
Image Source: Kenneth Corbin/Internet News.
Complex Adaptive Systems 2012 – Washington DC USA, November 14-16 5
6. Objective
Achieving an optimal balance between data privacy and utility
remains an ongoing challenge.
Such optimality is highly desired and remains our investigation goal.
Image Source: Wikipedia, on Confidentiality.
Complex Adaptive Systems 2012 – Washington DC USA, November 14-16 6
7. Ensemble classification
Is a machine learning process, in which a collection of several
independently trained classifiers are merged to achieve better prediction.
Examples include single trained decision trees joined to make accurate
predictions.
Complex Adaptive Systems 2012 – Washington DC USA, November 14-16 7
8. AdaBoost Ensemble – Adaptive Boosting
Proposed by Freund and Schapire (1995), uses several iterations by adding weak
learners to create a powerful learner, adjusting weights to center on misclassified
data in earlier iterations.
Classification Error in AdaBoost Ensemble is computed as below:
Complex Adaptive Systems 2012 – Washington DC USA, November 14-16 8
9. AdaBoost Ensemble (Cont’d )
AdaBoost Ensemble computes as follows:
Complex Adaptive Systems 2012 – Washington DC USA, November 14-16 9
12. Methodology (Cont’d)
We utilized a public available Barack Obama 2008 campaign donations dataset.
The data set contained 17,695 records of original unperturbed data.
Two attributes, the donation amount and income status, are utilized to classify data
into three classes.
The three classes are low income, middle income, and high income, for donations
$1 to $49, $50 to $80, $81 and above respectively.
Validating our approach, the dataset comprised 50 percent on training and the
remainder on testing, on both Original and Privatized datasets.
Oracle database is queried via MATLAB ODBC connector. MATLAB is used for
differential privacy and machine learning classification.
Complex Adaptive Systems 2012 – Washington DC USA, November 14-16 12
13. Results
Essential statistical traits of the original and differential privacy datasets,
a necessary requirement to publish privatized datasets, are kept.
As depicted, the mean, standard deviation, and variance of the original
and differential privacy datasets remained the same.
Complex Adaptive Systems 2012 – Washington DC USA, November 14-16 13
14. Results (Cont’d)
There is a strong positive covariance of 1060.8 between the two datasets, which
means that they grow simultaneously, as illustrated below:
Complex Adaptive Systems 2012 – Washington DC USA, November 14-16 14
15. Results (Cont’d)
There is almost no correlation (the correlation was 0.0054) between the
original and differentially privatized datasets.
Indicates some privacy assurances, and difficulty for an attacker, dealing
only with the privatized dataset, to correctly infer any alterations.
Complex Adaptive Systems 2012 – Washington DC USA, November 14-16 15
16. Results (Cont’d)
After applying differential privacy, AdaBoost ensemble classifier is
performed.
The outcome of the donors’ dataset was Low, Middle, and High income,
for donations 0 to 50, 51 to 80, and 81 to 100, respectively.
This same classification outcome is used for the perturbed dataset to
investigate whether the classifier would categorize the perturbed dataset
correctly.
Complex Adaptive Systems 2012 – Washington DC USA, November 14-16 16
17. Results (Cont’d)
The training dataset from the original data showed that the classification
error dropped from 0.25 to 0 with increased weak decision tree learners.
The results changed with the training dataset on the differentially private
data when the classification error dropped from 0.588 to 0.58.
Complex Adaptive Systems 2012 – Washington DC USA, November 14-16 17
18. Results (Cont’d)
When the same procedure is applied to the test dataset of the original data the
classification error dropped from 0.03 to 0.
However, when this procedure perform on the differentially private data, the error rate
did not change even with increased number of weak decision tree.
Complex Adaptive Systems 2012 – Washington DC USA, November 14-16 18
19. Conclusion
In this study, we found that while differential privacy might guarantee strong
confidentiality, providing data utility still remains a challenge.
However, this study is instructive in a variety of ways:
The level of Laplace noise does affect the classification error.
Increasing the number of weak learners is not too significant.
Adjusting the Laplace noise parameter, ε, is essential for further study.
However, accurate classification means loss of privacy.
Tradeoffs must be made between privacy and utility.
We plan on investigating optimization approaches for such tradeoffs.
Complex Adaptive Systems 2012 – Washington DC USA, November 14-16 19
20. Complex Adaptive Systems 2012 – Washington DC USA,
Questions? November 14-16
Contact:
Kato Mivule: kmivule@gmail.com
Thank You.
20