This document discusses case studies using differential privacy to analyze sensitive data. It describes analyzing Windows Live user data to study web analytics and customer churn. Clinical researchers' perspectives on differential privacy were also examined. Researchers wanted unaffected statistics and the ability to access original data if needed. Future collaboration with OHSU aims to develop a healthcare template for applying differential privacy.
2. Case Studies
Quantitative Case Study:
Windows Live / MSN Web Analytics data
Qualitative Case Study:
Clinical Physicians Perspective
Future Study
OHSU/CORI data set to apply differential privacy to
Healthcare setting
3. Sanitization Concept
Mask individuals within the data by creating a sanitization
point between user interface and data.
The magnitude of the noise is given by the theorem. If many
queries f1, f2, … are to be made, noise proportional to ΣiΔfi
suffices. For many sequences, we can often use less noise
than ΣiΔfi . Note that Δ Histogram = 1, independent of
number of cells
6/3/2016
4. Generating the noise
To generate the noise, a pseudo-random number
generator will create a stream of numbers, e.g.:
The resulting translation of this stream is:
0 0 1 1 1 … 1 0 0 0 0 1
- . 2 + 1 … + . . . . 6
6/3/2016
5. Adding noise
Category Value
A 36
B 22
… …
N 102
Category Value
A 34
B 23
… …
N 108
noise
6/3/2016
• The stream of numbers above is applied
to the result set.
• While masking the individuals, it allows
accurate percentages and trending.
• Presuming the magnitude is small (i.e.
small error), the numbers are
themselves accurate within an
acceptable margin.
6. Windows Live User Data
Our initial case study is based on Windows Live
user data:
550 million Passport users
Passport has web site visitor self-reported data: gender, birth
date, occupation, country, zip code, etc.
Web data has: IP address, pages viewed, page view duration,
browser, operating system, etc.
Created two groups for this case study to study the
acceptability / applicability of differential privacy within
the WL reporting context:
WL Sampled Users Web Analytics
Customer Churn Analytics
8. Sampled Users Web Analytics
Group
New solution built on top of an existing Windows
Live web analytics solution to provide a sample
specific to Passport users.
Built on top of an OLAP database to provide analysts
to view the data from multiple dimensions.
Built as well to showcase the privacy preserving
histogram for various teams including Channels,
Search, and Money.
9. Web Analytics Group Feedback
Country Visitors
United States 202
Canada 31
Country Gender Visitors
United States Female 128
Male 75
Total 203
Canada Female 15
Male 15
Total 30
Feedback was negative because customers
could not accept any amount of error.
This group had been using reporting
systems for over two years that had
perceived accuracy issues.
They were adamant that all of the totals
matched; the difference on the right was
not acceptable even though this data was
not used for financial reconciliation.
10. Customer Churn Analysis
Group
This reporting solution provided an OLAP cube, based on an
existing targeted marketing system, to allow analysts to
understand how services (Messenger, Mail, Search, Spaces,
etc.) are being used.
A key difference between the groups is that this group did not
have access to any reporting (though it was requested for
many months).
Within a few weeks of their initial request, CCA customers
received a working beta in which they were able to interact,
validate, and provide feedback to the precision and accuracy
of the data.
11. Discussion
The collaborative effort lead to the customer
trusting the data, a key difference in comparison to
the first group.
Because of this trust, the small amount of error
introduced into the system to ensure customer
privacy was well within a tolerable error margin.
The CCA group is in direct marketing hence had to
deal more regularly with customer privacy.
12. An important component to the
acceptance of privacy algorithms is
the users’ trust of the data.
13. Clinical Researchers Perceptions
A pilot qualitative study on the perceptions of clinical
researchers was recently completed.
It has noted three categories of six themes:
Unaffected Statistics
Understanding the privacy algorithms
Can get back to the original data
Understanding the purpose of the privacy algorithms
Management ROI
Protecting Patient Privacy
14. Unaffected Statistics
The most important point – no point applying privacy
if we get faulty statistics.
Primary concern is healthcare studies involve smaller
number of patients than other studies.
We are currently planning to provide in the near
future a healthcare template for the use of these
algorithms.
15. Understanding the privacy algorithms
As we have done in these slides, we have described
the mathematics behind these algorithms only
briefly.
But most clinical researchers are willing to accept the
science behind them without necessarily
understanding them.
While this is good, it does pose the problem that one
will implement them w/o understanding them
incorrectly guaranteeing the privacy of patients.
16. Can get back to the original data
It is very important to get back to the original data set
if so required.
Many existing privacy algorithms perturb the data so
while guaranteeing the privacy of an individual, it is
impossible to get back to the individual.
Healthcare research always requires the ability to get
back to the original data to potentially inform
patients of new outcomes.
The privacy preserving data analysis approach here
will allow this ability.
17. Understand the purpose of the privacy
algorithms
Most educated healthcare professionals understand
the issues and providing case studies such as the Gov
Weld case make this more apparent.
But we will still want to provide well-worded text
and/or confidence intervals below a chart or report
that has privacy algorithms applied.
18. Management ROI
We should be limiting the number of users who need
access to full data. So is there a good return-on-
investment to provide this extra step if you can
securely authorize the right people to access this
data?
This is where standards from IRB, privacy & security
steering committees, and the government get
involved.
Most importantly: the ability to share data.
19. Protecting Patient Privacy
For us to be able to analyze and mine
medical data so we can help patients
as well as lower the costs of
healthcare, we must first ensure
patient privacy.
20. Future Collaboration
As noted above, we are currently working with OHSU
to build a template for the application of these
privacy algorithms to healthcare.
For more information and/or interest in participating
in future application research, please email Denny
Lee at dennyl@microsoft.com.
21. Thanks
Thanks to Sally Allwardt for helping implement the
privacy preserving histogram algorithm used in this
case study.
Thanks to Kristina Behr, Lead Marketing Manager, for
all of her help and feedback with this case study.
6/3/2016
22. Practical Privacy: The SuLQ Framework
Reference paper “Practical Privacy: The SuLQ
Framework”
Conceptually, this application of privacy can be
applied to:
Principal component analysis
k means clustering
ID3 algorithm
Perceptron algorithm
Apparently, all algorithms in the statistical queries learning
model.
6/3/2016
Hinweis der Redaktion
This is based on the work of Cynthia Dwork and Frank McSherry from Microsoft Research (MSR)
A carefully detailed algorithm is definitely important, and something we have and can show folks. Aside from the addition of noise, the main snafus are a) how much noise and b) where did the randomness come from? Both are fun and exciting questions that you could have neat policy answers to, but the safe answers are: a) standard deviation equal to total number of queries and b) fresh randomness for every query. If they don't want to tell you the number of queries up front, the the standard deviation can be proportional to the square of the queries asked so far.
By doing this, this algorithm will be able to address all attacks. Consequently, for each person, the increase in probability of them being attacked (or anyone else for that matter) due to the contribution of their data is nominal. The example given is foiled for two reasons: a) the addition of noise will (formally) complicate the polynomial reconstruction and b) the number of queries is limited by the degree of privacy guaranteed, and N is generally going to be way too many queries.
The distribution used to create this noise can be Guassian because this can often work. But in order to handle all situations, we should utilize other distributions that provide more noise and/or more complicated like Laplace (Exponential) as noted in the previous slide
Windows Live User Data Application
Windows Live can use the above data to provide customizable experiences for their users and understand how visitors are using these services.
Microsoft is able to offer services like Search and Messenger at no charge to the consumer because the services are ad-funded, including ads that are targeted to be more relevant to the consumer.
As the data is accumulated, it becomes easier to segment the population and potentially better identify individual users without directly using personally identifiable information.
Potential Issues
As noted above, the Windows Live user data has enough specifics to allow us to identify a web site visitor even through the aggregations.
We need to worry about standard privacy issues:
Identity theft
Fraud
Bad press (e.g. AOL releasing search queries which ended up being revealing of their users)
If user expectations about privacy are not satisfied, consumers may no longer trust the services that we are so willing to provide.
For example, reviewing the country Afghanistan, the “Unknown” value is 121561 in one case and 121599 in another. Because of the random noise, we do not know what the “real” value is.