SlideShare ist ein Scribd-Unternehmen logo
1 von 85
Downloaden Sie, um offline zu lesen
Preserving Privacy and Utility in Text Data Analysis
Tom Diethe, Oluwaseyi Feyisetan, Thomas Drake, Borja Balle
{sey,tdiethe,draket}@amazon.com
borja.balle@gmail.com
PrivateNLP Workshop, WSDM
February 7 2020
Outline
1 Alexa AI
2 Algorithmic Privacy
3 Privacy for Text
4 Differential Privacy in Euclidean Spaces
5 Differential Privacy in Hyperbolic Spaces
6 Optimizing the Privacy Utility Trade-off
7 Summary
Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 1 / 41
Outline
1 Alexa AI
2 Algorithmic Privacy
3 Privacy for Text
4 Differential Privacy in Euclidean Spaces
5 Differential Privacy in Hyperbolic Spaces
6 Optimizing the Privacy Utility Trade-off
7 Summary
Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 2 / 41
Alexa AI
What is Alexa?
A cloud-based voice service that can help
you with tasks, entertainment, general
information, shopping, and more
The more you talk to Alexa, the more
Alexa adapts to your speech patterns,
vocabulary, and personal preferences
Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 3 / 41
Alexa AI
What is Alexa?
A cloud-based voice service that can help
you with tasks, entertainment, general
information, shopping, and more
The more you talk to Alexa, the more
Alexa adapts to your speech patterns,
vocabulary, and personal preferences
How do we ...
create robust and efficient AI systems?
maintain the privacy of customer data?
Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 3 / 41
Failure Modes
Unintentional failures: ML system produces a formally correct but completely unsafe
outcome
Outliers/anomalies
Dataset shift
Limited memory
Intentional failures: failure is caused by an active adversary attempting to subvert the
system to attain her goals, such as to:
misclassify the result
infer private training data
steal the underlying algorithm
Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 4 / 41
Outline
1 Alexa AI
2 Algorithmic Privacy
3 Privacy for Text
4 Differential Privacy in Euclidean Spaces
5 Differential Privacy in Hyperbolic Spaces
6 Optimizing the Privacy Utility Trade-off
7 Summary
Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 5 / 41
A first attempt: Can’t I just anonymize my data?
k-anonymity: information for each person cannot be distinguished from at least k − 1
individuals whose information also appear in the release
Suppose a company is audited for salary discrimination
The auditor can see salaries by gender, age and nationality for each department and office
If the auditor has a friend, an ex, a date, working for the company she will learn the salary
of that person
Reducing data granularity reduces the risk, but also reduces accuracy (fidelity in this case)
Office Dept. Salary D.O.B. Nationality Gender
London IT £##### May 1985 Portuguese Female
Still presents risk of re-identification!. If there are 10 females born between 80-85 in the
whole of UK’s IT department, 9 of them could conspire to learn the salary of the 10th one
Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 6 / 41
A first attempt: Can’t I just anonymize my data?
k-anonymity: information for each person cannot be distinguished from at least k − 1
individuals whose information also appear in the release
Suppose a company is audited for salary discrimination
The auditor can see salaries by gender, age and nationality for each department and office
If the auditor has a friend, an ex, a date, working for the company she will learn the salary
of that person
Reducing data granularity reduces the risk, but also reduces accuracy (fidelity in this case)
Office Dept. Salary D.O.B. Nationality Gender
London IT £##### May 1985 Portuguese Female
Still presents risk of re-identification!. If there are 10 females born between 80-85 in the
whole of UK’s IT department, 9 of them could conspire to learn the salary of the 10th one
Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 6 / 41
A first attempt: Can’t I just anonymize my data?
k-anonymity: information for each person cannot be distinguished from at least k − 1
individuals whose information also appear in the release
Suppose a company is audited for salary discrimination
The auditor can see salaries by gender, age and nationality for each department and office
If the auditor has a friend, an ex, a date, working for the company she will learn the salary
of that person
Reducing data granularity reduces the risk, but also reduces accuracy (fidelity in this case)
Office Dept. Salary D.O.B. Nationality Gender
UK IT £##### 1980-1985 - Female
Still presents risk of re-identification!. If there are 10 females born between 80-85 in the
whole of UK’s IT department, 9 of them could conspire to learn the salary of the 10th one
Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 6 / 41
Anonymized Data Isn’t
Example 1: Mid 1990’s: Massachusetts “Group Insurance Commission” released
“anonymized” data on state employees that showed every hospital visit
Goal was to help researchers. Removed all obvious identifiers such as name, address, and
social security number
MIT PhD student Latanya Sweeney decided to attempt to reverse the anonymization,
requested a copy of the data
Reidentification
William Weld, then Governor of Massachusetts, assured the public that GIC had protected
patient privacy by deleting identifiers. Sweeney started hunting for the Governor’s hospital
records in the GIC data. She knew that Governor Weld resided in Cambridge, Massachusetts,
population 54,000 and 7 ZIP codes. For $20, she purchased the complete voter rolls from the
city of Cambridge, containing the name, address, ZIP code, birth date, and gender of every
voter. Crossing this with the GIC records, Sweeney found Governor Weld with ease: Only 6
people shared his birth date, only 3 of them men, and of them, only he lived in his ZIP code.
Sweeney sent the Governor’s health records (including diagnoses and prescriptions) to his office.
Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 7 / 41
Anonymized Data Isn’t
Example 1: Mid 1990’s: Massachusetts “Group Insurance Commission” released
“anonymized” data on state employees that showed every hospital visit
Goal was to help researchers. Removed all obvious identifiers such as name, address, and
social security number
MIT PhD student Latanya Sweeney decided to attempt to reverse the anonymization,
requested a copy of the data
Reidentification
William Weld, then Governor of Massachusetts, assured the public that GIC had protected
patient privacy by deleting identifiers. Sweeney started hunting for the Governor’s hospital
records in the GIC data. She knew that Governor Weld resided in Cambridge, Massachusetts,
population 54,000 and 7 ZIP codes. For $20, she purchased the complete voter rolls from the
city of Cambridge, containing the name, address, ZIP code, birth date, and gender of every
voter. Crossing this with the GIC records, Sweeney found Governor Weld with ease: Only 6
people shared his birth date, only 3 of them men, and of them, only he lived in his ZIP code.
Sweeney sent the Governor’s health records (including diagnoses and prescriptions) to his office.
Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 7 / 41
Anonymized Data Isn’t
Example 2: In 2006, Netflix released data pertaining to how 500,000 of its users rated
movies over a six-year period
Netflix “anonymized” the data before releasing it by removing usernames, but assigned
unique identification numbers to users in order to allow for continuous tracking of user
ratings and trends
Reidentification
Researchers used this information to uniquely identify individual Netflix users by crossing the
data with the public IMDB database. According to the study, if a person has information about
when and how a user rated six movies, that person can identify 99% of people in the Netflix
database.
Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 8 / 41
Anonymized Data Isn’t
Example 2: In 2006, Netflix released data pertaining to how 500,000 of its users rated
movies over a six-year period
Netflix “anonymized” the data before releasing it by removing usernames, but assigned
unique identification numbers to users in order to allow for continuous tracking of user
ratings and trends
Reidentification
Researchers used this information to uniquely identify individual Netflix users by crossing the
data with the public IMDB database. According to the study, if a person has information about
when and how a user rated six movies, that person can identify 99% of people in the Netflix
database.
Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 8 / 41
Differential Privacy
A randomised mechanism M : X → Y is -differentially private if for all neighbouring inputs
x x (i.e. x − x 1 = 1) and for all sets of outputs E ⊆ Y we have
P[M(x) ∈ E] ≤ e P M x ∈ E
0 5 10 15 20 25
0.00
0.02
0.04
0.06
0.08
0.10
0.12
0.14
Ratio bounded by e
M(D)
M(D')
Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 9 / 41
Differential Privacy
A randomised mechanism M : X → Y is -differentially private if for all neighbouring inputs
x x (i.e. x − x 1 = 1) and for all sets of outputs E ⊆ Y we have
P[M(x) ∈ E] ≤ e P M x ∈ E
0 5 10 15 20 25
0.00
0.02
0.04
0.06
0.08
0.10
0.12
0.14
Ratio bounded by e
M(D)
M(D')
Mechanisms:
Randomised response −→ plausible
deniability
Laplace mechanism: e.g. ˜µ = µ + ξ,
ξ ∼ Lap 1
n
Output perturbation
...
Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 9 / 41
Randomized Response [Warner ’65]
Say you want to release a bit x ∈ {Yes, No}. Do the following:
1 flip a coin
2 if tails, respond truthfully with x
3 if heads, flip a second coin and respond “Yes” if heads; respond “No” if tails
Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 10 / 41
Randomized Response [Warner ’65]
Say you want to release a bit x ∈ {Yes, No}. Do the following:
1 flip a coin
2 if tails, respond truthfully with x
3 if heads, flip a second coin and respond “Yes” if heads; respond “No” if tails
Claim: Above algorithm satisfies (log 3)-differential privacy
Pr[Response = Yes|x = Yes]
Pr[Response = Yes|x = No]
=
1/2 × 1 + 1/2 × 1/2
1/2 × 0 + 1/2 × 1/2
=
3/4
1/4
= 3 =⇒ e = 3
Same for Pr[Response=No|x=Yes]
Pr[Response=No|x=No] .
Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 10 / 41
Important Properties
Robustness to post-processing: M is ( , δ)-DP, then f (M) is ( , δ)-DP
Composition: if M1, . . . , Mn are ( , δ)-DP, then g (M1, . . . , Mn) is
( n
i=1 i , n
i=1 δi )-DP
Protects against arbitrary side knowledge
Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 11 / 41
Outline
1 Alexa AI
2 Algorithmic Privacy
3 Privacy for Text
4 Differential Privacy in Euclidean Spaces
5 Differential Privacy in Hyperbolic Spaces
6 Optimizing the Privacy Utility Trade-off
7 Summary
Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 12 / 41
User-AI system interaction via natural language
User’s goal: meet some specific need with respect to an
issued query x
Agent’s goal: satisfy the user’s request
Privacy violation: occurs when x is used to make personal
inference. e.g. unrestricted PII present
Mechanism: Modify the query to protect privacy whilst
preserving semantics
Our approach: Metric Differential Privacy
Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 13 / 41
User-AI system interaction via natural language
User’s goal: meet some specific need with respect to an
issued query x
Agent’s goal: satisfy the user’s request
Privacy violation: occurs when x is used to make personal
inference. e.g. unrestricted PII present
Mechanism: Modify the query to protect privacy whilst
preserving semantics
Our approach: Metric Differential Privacy
Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 13 / 41
User-AI system interaction via natural language
User’s goal: meet some specific need with respect to an
issued query x
Agent’s goal: satisfy the user’s request
Privacy violation: occurs when x is used to make personal
inference. e.g. unrestricted PII present
Mechanism: Modify the query to protect privacy whilst
preserving semantics
Our approach: Metric Differential Privacy
Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 13 / 41
User-AI system interaction via natural language
User’s goal: meet some specific need with respect to an
issued query x
Agent’s goal: satisfy the user’s request
Privacy violation: occurs when x is used to make personal
inference. e.g. unrestricted PII present
Mechanism: Modify the query to protect privacy whilst
preserving semantics
Our approach: Metric Differential Privacy
Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 13 / 41
User-AI system interaction via natural language
User’s goal: meet some specific need with respect to an
issued query x
Agent’s goal: satisfy the user’s request
Privacy violation: occurs when x is used to make personal
inference. e.g. unrestricted PII present
Mechanism: Modify the query to protect privacy whilst
preserving semantics
Our approach: Metric Differential Privacy
Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 13 / 41
User-AI system interaction via natural language
User’s goal: meet some specific need with respect to an
issued query x
Agent’s goal: satisfy the user’s request
Privacy violation: occurs when x is used to make personal
inference. e.g. unrestricted PII present
Mechanism: Modify the query to protect privacy whilst
preserving semantics
Our approach: Metric Differential Privacy
Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 13 / 41
Desired Functionality
Intent Query x Modified Query x
GetWeather Will it be colder in Cleveland Will it be colder in Ohio
PlayMusic Play Cantopop on lastfm Play C-pop on lastfm
BookRestaurant Book a restaurant in Milladore Book a restaurant in Wood County
SearchCreativeWork I want to watch Manthan film I want to watch Hindi film
Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 14 / 41
Word Embeddings
Mapping from words into vectors of real numbers (many ways to do this!)
e.g. Neural network based models (e.g. Word2Vec, GloVe, fastText)
Defines a mapping φ : W → Rn
Nearest neigbours are often synonyms
Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 15 / 41
Metric Differential Privacy
Recall the definition of DP ...
P[M(x) ∈ E] ≤ e P M x ∈ E for x, x ∈ X s.t. x − x 1
= 1
This can be rewritten into a single equation as:
P[M(x) ∈ E]
P[M(x ) ∈ E]
≤ e x−x 1
Metric differential privacy generalises this to use any valid metric d(x, x ):
P[M(x) ∈ E]
P[M(x ) ∈ E]
≤ e d(x,x )
(easy to see that standard DP is metric DP with d(x, x ) = x − x 1)
Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 16 / 41
Metric Differential Privacy
Recall the definition of DP ...
P[M(x) ∈ E] ≤ e P M x ∈ E for x, x ∈ X s.t. x − x 1
= 1
This can be rewritten into a single equation as:
P[M(x) ∈ E]
P[M(x ) ∈ E]
≤ e x−x 1
Metric differential privacy generalises this to use any valid metric d(x, x ):
P[M(x) ∈ E]
P[M(x ) ∈ E]
≤ e d(x,x )
(easy to see that standard DP is metric DP with d(x, x ) = x − x 1)
Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 16 / 41
Metric Differential Privacy
Recall the definition of DP ...
P[M(x) ∈ E] ≤ e P M x ∈ E for x, x ∈ X s.t. x − x 1
= 1
This can be rewritten into a single equation as:
P[M(x) ∈ E]
P[M(x ) ∈ E]
≤ e x−x 1
Metric differential privacy generalises this to use any valid metric d(x, x ):
P[M(x) ∈ E]
P[M(x ) ∈ E]
≤ e d(x,x )
(easy to see that standard DP is metric DP with d(x, x ) = x − x 1)
Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 16 / 41
Privacy in the Space of Word Embeddings [Feyisetan 2019, Feyisetan 2020]
Given:
w ∈ W: word to be “privatised” from word space W (dictionary)
φ : W → Z: embedding function from word space to embedding space Z (e.g. Rn
)
v = φ(w): corresponding word vector
d : Z × Z → R: distance function in embedding space
Ω( ): the D.P. noise sampling distribution (e.g. Ωi ( ) = Lap 1
n , i = 1, ..., n for Rn
)
Metric DP Mechanism for word embeddings
1 Perturb the word vector: v = v + ξ where ξ ∼ Ω( )
2 The new vector v will not be a word (a.s.)
3 Project back to W: w = arg minw∈W d(v , φ(w)), return w
What do we need?
d satisfies the axioms of a metric (nonnegative, indiscernibles, symmetry, triangle)
A way to sample using Ω in the metric space that respects d and gives us -metric DP
Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 17 / 41
Privacy in the Space of Word Embeddings [Feyisetan 2019, Feyisetan 2020]
Given:
w ∈ W: word to be “privatised” from word space W (dictionary)
φ : W → Z: embedding function from word space to embedding space Z (e.g. Rn
)
v = φ(w): corresponding word vector
d : Z × Z → R: distance function in embedding space
Ω( ): the D.P. noise sampling distribution (e.g. Ωi ( ) = Lap 1
n , i = 1, ..., n for Rn
)
Metric DP Mechanism for word embeddings
1 Perturb the word vector: v = v + ξ where ξ ∼ Ω( )
2 The new vector v will not be a word (a.s.)
3 Project back to W: w = arg minw∈W d(v , φ(w)), return w
What do we need?
d satisfies the axioms of a metric (nonnegative, indiscernibles, symmetry, triangle)
A way to sample using Ω in the metric space that respects d and gives us -metric DP
Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 17 / 41
Privacy in the Space of Word Embeddings [Feyisetan 2019, Feyisetan 2020]
Given:
w ∈ W: word to be “privatised” from word space W (dictionary)
φ : W → Z: embedding function from word space to embedding space Z (e.g. Rn
)
v = φ(w): corresponding word vector
d : Z × Z → R: distance function in embedding space
Ω( ): the D.P. noise sampling distribution (e.g. Ωi ( ) = Lap 1
n , i = 1, ..., n for Rn
)
Metric DP Mechanism for word embeddings
1 Perturb the word vector: v = v + ξ where ξ ∼ Ω( )
2 The new vector v will not be a word (a.s.)
3 Project back to W: w = arg minw∈W d(v , φ(w)), return w
What do we need?
d satisfies the axioms of a metric (nonnegative, indiscernibles, symmetry, triangle)
A way to sample using Ω in the metric space that respects d and gives us -metric DP
Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 17 / 41
Outline
1 Alexa AI
2 Algorithmic Privacy
3 Privacy for Text
4 Differential Privacy in Euclidean Spaces
5 Differential Privacy in Hyperbolic Spaces
6 Optimizing the Privacy Utility Trade-off
7 Summary
Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 18 / 41
Differential Privacy in the Space of Euclidean Word Embedding
Adding noise to a location always produces
a valid location — a point somewhere on
the earth’s surface
Adding noise to a word embedding
produces a new point in the embedding
space, but it’s A.S. not the location of a
valid word embedding
We perform approximate nearest neighbors
find the nearest valid embedding
Nearest valid embedding could be the
original word itself: in that case, the
original word is returned
Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 19 / 41
Practical Considerations
To help choose , we define:
Uncertainty statistics for the adversary over the outputs
Indistinguishability statistics: plausible deniability
Find a radius of high protection: guarantee on the likelihood of changing any word in the
embedding vocabulary
Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 20 / 41
Euclidean Experiments: Setup
Dataset IMDb Enron InsuranceQA
Task type Sentiment analysis Author identification Question answering
Evaluation Metric accuracy accuracy MAP, MRR
Training set size 25, 000 8, 517 12, 887
Test set size 25, 000 850 1, 800
Total word count 5, 958, 157 307, 639 92, 095
Vocabulary size 79, 428 15, 570 2, 745
Sentence length
µ = 42.27
σ = 34.38
µ = 30.68
σ = 31.54
µ = 7.15
σ = 2.06
Scenario 1: Train time protection little access to public data (10%), but abundant
access to private training data (90%); model training is done on the combined dataset
(i.e. public subset + perturbed private subset)
Scenario 2: Test time protection models trained on complete training set; evaluation
on privatized version of the test sets
We used 300-D GloVe word embeddings with biLSTM models
Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 21 / 41
Results
IMDb reviews – Accuracy vs baseline for different values of ε
200 400 600 800 1000
epsilon
0.0
0.2
0.4
0.6
0.8
1.0
accuracy
Accuracy (at training time)
Accuracy
Baseline
200 400 600 800 1000
epsilon
0.0
0.2
0.4
0.6
0.8
1.0
accuracy
Accuracy (at test time)
Accuracy
Baseline
Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 22 / 41
Results
Enron emails – Accuracy vs baseline for different values of ε
200 400 600 800 1000
epsilon
0.0
0.2
0.4
0.6
0.8
1.0
accuracy
Accuracy (at training time)
Accuracy
Baseline
200 400 600 800 1000
epsilon
0.0
0.2
0.4
0.6
0.8
1.0
accuracy
Accuracy (at test time)
Accuracy
Baseline
Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 22 / 41
Results
InsuranceQA – MAP/MRR scores for different values of ε on the dev set
200 400 600 800 1000
epsilon
0.0
0.2
0.4
0.6
0.8
1.0
Scores for dev at training time
MAP on dev
MRR on dev
MAP baseline
MRR baseline
200 400 600 800 1000
epsilon
0.0
0.2
0.4
0.6
0.8
1.0
Scores for dev at test time
MAP on dev
MRR on dev
MAP baseline
MRR baseline
Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 22 / 41
Privacy Evaluation
In the previous experiments, we didn’t explicitly evaluate privacy
Problem: is an arbitrary number that is hard to interpret
This is especially true in metric DP, since is on a different scale
As we have seen, there are empirical ways to calibrate according to statistics of the word
embeddings
But how do we convince stakeholders that the privacy guarantees are holding, and there
are no bugs?
Solution: machine auditors – machine learning algorithms designed to different types of
privacy attacks on the data
Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 23 / 41
Privacy Evaluation
In the previous experiments, we didn’t explicitly evaluate privacy
Problem: is an arbitrary number that is hard to interpret
This is especially true in metric DP, since is on a different scale
As we have seen, there are empirical ways to calibrate according to statistics of the word
embeddings
But how do we convince stakeholders that the privacy guarantees are holding, and there
are no bugs?
Solution: machine auditors – machine learning algorithms designed to different types of
privacy attacks on the data
Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 23 / 41
Privacy Evaluation
In the previous experiments, we didn’t explicitly evaluate privacy
Problem: is an arbitrary number that is hard to interpret
This is especially true in metric DP, since is on a different scale
As we have seen, there are empirical ways to calibrate according to statistics of the word
embeddings
But how do we convince stakeholders that the privacy guarantees are holding, and there
are no bugs?
Solution: machine auditors – machine learning algorithms designed to different types of
privacy attacks on the data
Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 23 / 41
Privacy Evaluation
In the previous experiments, we didn’t explicitly evaluate privacy
Problem: is an arbitrary number that is hard to interpret
This is especially true in metric DP, since is on a different scale
As we have seen, there are empirical ways to calibrate according to statistics of the word
embeddings
But how do we convince stakeholders that the privacy guarantees are holding, and there
are no bugs?
Solution: machine auditors – machine learning algorithms designed to different types of
privacy attacks on the data
Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 23 / 41
Privacy Evaluation
In the previous experiments, we didn’t explicitly evaluate privacy
Problem: is an arbitrary number that is hard to interpret
This is especially true in metric DP, since is on a different scale
As we have seen, there are empirical ways to calibrate according to statistics of the word
embeddings
But how do we convince stakeholders that the privacy guarantees are holding, and there
are no bugs?
Solution: machine auditors – machine learning algorithms designed to different types of
privacy attacks on the data
Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 23 / 41
Privacy Evaluation
In the previous experiments, we didn’t explicitly evaluate privacy
Problem: is an arbitrary number that is hard to interpret
This is especially true in metric DP, since is on a different scale
As we have seen, there are empirical ways to calibrate according to statistics of the word
embeddings
But how do we convince stakeholders that the privacy guarantees are holding, and there
are no bugs?
Solution: machine auditors – machine learning algorithms designed to different types of
privacy attacks on the data
Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 23 / 41
Machine Auditors
Probabilistic record linkage auditing attack
Objective: link a user in a public dataset, to a user in a (leaked) private dataset.
Attack simulation: simulate public and “leaked” datasets by randomly splitting
an initial dataset. The attack takes advantage of rare words and queries issued
by users. A vector of word counts can be extracted from user queries and used to
perform the linkage.
Assumptions: attacker is able to narrow the attack set (using side knowledge)
Evaluation: how many accurate links can the attacker reconstruct?
Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 24 / 41
Machine Auditors
Membership auditing attack [Shokri et al ’17, Song & Shmatikov ’18]
Objective: identify whether an individual’s data (queries) were used in the
training set of an ML model.
Attack simulation: train ML model on queries from m users. Train “shadow”
models using data from a different set of n users. The attack model is a classifier
built using the output of the shadow models
Assumptions: attacker is able to narrow the attack set (using side knowledge)
Evaluation: can the attacker correctly detect m users inside and outside the
model’s dataset
Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 24 / 41
Outline
1 Alexa AI
2 Algorithmic Privacy
3 Privacy for Text
4 Differential Privacy in Euclidean Spaces
5 Differential Privacy in Hyperbolic Spaces
6 Optimizing the Privacy Utility Trade-off
7 Summary
Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 25 / 41
Hyperbolic Spaces
(a) (b)
(a) Projection of a point in the Lorentz model Hn to the Poincaré model
(b) WebIsADb is-a relationships in GloVe vocabulary on B2 Poincaré disk
Continuous analog of a tree
structure
Natural language captures
hypernomy and hyponomy
−→ embeddings require fewer
dimensions
Use models of Hyperbolic space -
projections into Euclidean space
Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 26 / 41
Hyperbolic Differential Privacy
Distances in n−dimensional Poincaré ball are given by:
dBn (u, v) = arcosh 1 + 2
u − v 2
(1 − u 2
)(1 − v 2
)
Claim: dBn (u, v) is a valid metric. Proof (via Lorentzian model) in the paper
Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 27 / 41
Hyperbolic Noise
Recall for Euclidean metric DP, we use Laplacian
noise to achieve −mDP, i.e:
ξ ∼ Lap
1
n
We derive the Hyperbolic Laplace distribution:
p(x|µ = 0, ε) =
1 + ε
2 2F1(1, ε, 2 + ε, −1)
−
2
x − 1
− 1
−ε
where 2F1(a, b; c, z) is the hypergeometric function
For sampling, we developed a Lorentzian Metropolis
Hastings sampler (see paper)
−0.4 −0.2 0.0 0.2 0.4
−0.4
−0.2
0.0
0.2
0.4
Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 28 / 41
Hyperbolic Noise
Recall for Euclidean metric DP, we use Laplacian
noise to achieve −mDP, i.e:
ξ ∼ Lap
1
n
We derive the Hyperbolic Laplace distribution:
p(x|µ = 0, ε) =
1 + ε
2 2F1(1, ε, 2 + ε, −1)
−
2
x − 1
− 1
−ε
where 2F1(a, b; c, z) is the hypergeometric function
For sampling, we developed a Lorentzian Metropolis
Hastings sampler (see paper)
−0.4 −0.2 0.0 0.2 0.4
−0.4
−0.2
0.0
0.2
0.4
Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 28 / 41
Hyperbolic Noise
Recall for Euclidean metric DP, we use Laplacian
noise to achieve −mDP, i.e:
ξ ∼ Lap
1
n
We derive the Hyperbolic Laplace distribution:
p(x|µ = 0, ε) =
1 + ε
2 2F1(1, ε, 2 + ε, −1)
−
2
x − 1
− 1
−ε
where 2F1(a, b; c, z) is the hypergeometric function
For sampling, we developed a Lorentzian Metropolis
Hastings sampler (see paper)
−0.4 −0.2 0.0 0.2 0.4
−0.4
−0.2
0.0
0.2
0.4
Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 28 / 41
Hyperbolic Privacy Experiments 1
Task: obfuscation vs. Koppel’s authorship attribution algorithm
Datasets: TPAN@Clef tasks, correct author predictions (lower=better)
Pan-11 Pan-12
small large set-A set-C set-D set-I
0.5 36 72 4 3 2 5
1 35 73 3 3 2 5
2 40 78 4 3 2 5
8 65 116 4 5 4 5
∞ 147 259 6 6 6 12
Correct author predictions (lower is better)
Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 29 / 41
Hyperbolic Privacy Experiments 2
Task: expected privacy vs Euclidean baseline
Datasets: 100/200/300d GloVe embeddings
expected value Nw
ε worst-case Nw hyp-100 euc-100 euc-200 euc-300
0.125 134 1.25 38.54 39.66 39.88
0.5 148 1.62 42.48 43.62 43.44
1 172 2.07 48.80 50.26 53.82
2 297 3.92 92.42 93.75 90.90
8 960 140.67 602.21 613.11 587.68
Privacy comparisons (lower Nw is better)
Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 30 / 41
Hyperbolic Utility Experiments
5 classification tasks: sentiment x2, product reviews, opinion polarity, question-type
3 natural language tasks: NL inference, paraphrase detection, semantic textual similarity
baselines: utility results baselined using SentEval against random replacement
hyp-100d original
dataset random ε = 0.125 ε = 1 ε = 8 InferSent SkipThought fastText
MR 58.19 58.38 63.56 74.52 81.10 79.40 78.20
CR 77.48 83.21∗∗
83.92∗∗
85.19∗∗
86.30 83.1 80.20
MPQA 84.27 88.53∗
88.62∗
88.98∗
90.20 89.30 88.00
SST-5 30.81 41.76 42.40 42.53 46.30 − 45.10
TREC-6 75.20 82.40 82.40 84.20∗
88.20 88.40 83.40
SICK-E 79.20 81.00∗∗
82.38∗∗
82.34∗∗
86.10 79.5 78.9
MRPC 69.86 74.78∗
75.07∗
75.01∗
76.20 − 74.40
STS14 0.17/0.16 0.44/0.45 0.45/0.46∗
0.52/0.53∗
0.68/0.65 0.44/0.45 0.65/0.63
Accuracy scores on classification tasks. * indicates results better than 1 baseline, ** better than 2 baselines
Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 31 / 41
Outline
1 Alexa AI
2 Algorithmic Privacy
3 Privacy for Text
4 Differential Privacy in Euclidean Spaces
5 Differential Privacy in Hyperbolic Spaces
6 Optimizing the Privacy Utility Trade-off
7 Summary
Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 32 / 41
UTILITYPRIVACY
Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 33 / 41
Example: Differentially Private SGD
Algorithm 1: Differentially Private SGD
Input: dataset z = (z1, . . . , zn)
Hyperparameters: learning rate η, mini-batch size m, number of epochs T, noise variance
σ2, clipping norm L
Initialize w ← 0
for t ∈ [T] do
for k ∈ [n/m] do
Sample S ⊂ [n] with |S| = m uniformly at random
Let g ← 1
m j∈S clipL( (zj , w)) + 2L
m N(0, σ2I)
Update w ← w − ηg
return w
5+ hyper-parameters affecting both privacy and utility
For deep learning applications we only have empirical utility (not analyitic)
How do we find the hyperparameters that give us an optimal trade-off?
Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 34 / 41
The Privacy-Utility Pareto Front
Pareto-Optimal Points
Hyper-parameter Space
Privacy Loss
Error
Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 35 / 41
The Privacy-Utility Pareto Front
Pareto-Optimal Points
Hyper-parameter Space
Privacy Loss
Error
Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 35 / 41
The Privacy-Utility Pareto Front
Pareto-Optimal Points
Hyper-parameter Space
Privacy Loss
Error
Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 35 / 41
The Privacy-Utility Pareto Front
Pareto-Optimal Points
Hyper-parameter Space
Privacy Loss
Error
Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 35 / 41
The Privacy-Utility Pareto Front
Pareto-Optimal Points
Hyper-parameter Space
Privacy Loss
Error
Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 35 / 41
The Privacy-Utility Pareto Front
Pareto-Optimal Points
Hyper-parameter Space
Privacy Loss
Error
Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 35 / 41
The Privacy-Utility Pareto Front
Pareto-Optimal Points
Hyper-parameter Space
Privacy Loss
Error
Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 35 / 41
The Privacy-Utility Pareto Front
Pareto-Optimal Points
Hyper-parameter Space
Privacy Loss
Error
Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 35 / 41
The Privacy-Utility Pareto Front
Pareto-Optimal Points
Hyper-parameter Space
Privacy Loss
Error
Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 35 / 41
Bayesian Optimization
Gradient-free optimization for black-box functions
Widely used in applications (HPO in ML, scheduling & planning, experimental design ...)
In multi-objective problems, BO aims to learn the Pareto front with a minimal number of
evaluations.
Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 36 / 41
Bayesian Optimization
Gradient-free optimization for black-box functions
Widely used in applications (HPO in ML, scheduling & planning, experimental design ...)
In multi-objective problems, BO aims to learn the Pareto front with a minimal number of
evaluations.
Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 36 / 41
Bayesian Optimization
Gradient-free optimization for black-box functions
Widely used in applications (HPO in ML, scheduling & planning, experimental design ...)
In multi-objective problems, BO aims to learn the Pareto front with a minimal number of
evaluations.
Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 36 / 41
Bayesian Optimization
Gradient-free optimization for black-box functions
Widely used in applications (HPO in ML, scheduling & planning, experimental design ...)
In multi-objective problems, BO aims to learn the Pareto front with a minimal number of
evaluations.
Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 36 / 41
Bayesian Optimization
Gradient-free optimization for black-box functions
Widely used in applications (HPO in ML, scheduling & planning, experimental design ...)
In multi-objective problems, BO aims to learn the Pareto front with a minimal number of
evaluations.
Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 36 / 41
Bayesian Optimization
Gradient-free optimization for black-box functions
Widely used in applications (HPO in ML, scheduling & planning, experimental design ...)
In multi-objective problems, BO aims to learn the Pareto front with a minimal number of
evaluations.
Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 36 / 41
Bayesian Optimization
Gradient-free optimization for black-box functions
Widely used in applications (HPO in ML, scheduling & planning, experimental design ...)
In multi-objective problems, BO aims to learn the Pareto front with a minimal number of
evaluations.
Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 36 / 41
Bayesian Optimization
Gradient-free optimization for black-box functions
Widely used in applications (HPO in ML, scheduling & planning, experimental design ...)
In multi-objective problems, BO aims to learn the Pareto front with a minimal number of
evaluations.
Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 36 / 41
Bayesian Optimization
Gradient-free optimization for black-box functions
Widely used in applications (HPO in ML, scheduling & planning, experimental design ...)
In multi-objective problems, BO aims to learn the Pareto front with a minimal number of
evaluations.
Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 36 / 41
Bayesian Optimization
Gradient-free optimization for black-box functions
Widely used in applications (HPO in ML, scheduling & planning, experimental design ...)
In multi-objective problems, BO aims to learn the Pareto front with a minimal number of
evaluations.
Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 36 / 41
DPareto
DPareto
Repeat:
1 For each objective (privacy, utility):
1 Fit a surrogate model (Gaussian process (GP)) using the available dataset
2 Calculate the predictive distribution using the GP mean and variance functions
2 Use the posterior of the surrogate models to form an acquisition function
3 Collect the next point at the estimated global max. of the acquisition function
until budget exhausted
Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 37 / 41
DPareto vs Random Sampling
28
)
20
22
24
26
28
Sampled points
1.0
1.5
2.0
2.5
3.0
3.5
4.0
4.5
PFhypervolume
Hypervolume Evolution
MLP1 (RS)
MLP1 (BO)
MLP2 (RS)
MLP2 (BO)
10−1
100
101
ε
0.0
0.2
0.4
0.6
0.8
1.0
Classificationerror
MLP2 Pareto Fronts
Initial
+256 RS
+256 BO
10−1
100
101
ε
0.16
0.18
0.20
0.22
0.24
Classificationerror
LogReg+SGD Samples
1500 RS
256 BO
Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 38 / 41
Outline
1 Alexa AI
2 Algorithmic Privacy
3 Privacy for Text
4 Differential Privacy in Euclidean Spaces
5 Differential Privacy in Hyperbolic Spaces
6 Optimizing the Privacy Utility Trade-off
7 Summary
Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 39 / 41
Summary: Privacy Enhancing Technologies
Privacy
Privacy risks can be counter-intuitive and tricky to formalize
High-dimensional data and side knowledge make privacy hard
Semantic guarantees (eg. DP) behave better than syntactic ones (eg.
k-anonymization)
Differential privacy is a mature privacy enhancing technology
Metric DP provides local plausible deniability, accuracy can be good even in
cases with an infinite number of outcomes
Empirical privacy-utility trade-off evaluation enables application-specific decisions
Bayesian optimization provides computationally efficient method to recover the
Pareto front (esp. with large number of hyper-parameters)
Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 40 / 41
Questions?
tdiethe@amazon.com
Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 41 / 41

Weitere ähnliche Inhalte

Ähnlich wie Preserving Privacy and Utility in Text Data Analysis

DataTags: Sharing Privacy Sensitive Data by Latanya Sweeney
DataTags: Sharing Privacy Sensitive Data by Latanya SweeneyDataTags: Sharing Privacy Sensitive Data by Latanya Sweeney
DataTags: Sharing Privacy Sensitive Data by Latanya Sweeneydatascienceiqss
 
Using Apache Spark and Differential Privacy for Protecting the Privacy of the...
Using Apache Spark and Differential Privacy for Protecting the Privacy of the...Using Apache Spark and Differential Privacy for Protecting the Privacy of the...
Using Apache Spark and Differential Privacy for Protecting the Privacy of the...Databricks
 
Strata Conference NY: The Accidental Chief Privacy Officer
Strata Conference NY: The Accidental Chief Privacy OfficerStrata Conference NY: The Accidental Chief Privacy Officer
Strata Conference NY: The Accidental Chief Privacy OfficerJim Adler
 
Data Mining Challenges
Data Mining ChallengesData Mining Challenges
Data Mining ChallengesRepustate
 
2020 Data Breach Investigations Report (DBIR)
2020 Data Breach Investigations Report (DBIR)2020 Data Breach Investigations Report (DBIR)
2020 Data Breach Investigations Report (DBIR)- Mark - Fullbright
 
Behavioral Big Data & Healthcare Research: Talk at WiDS Taipei
Behavioral Big Data & Healthcare Research: Talk at WiDS TaipeiBehavioral Big Data & Healthcare Research: Talk at WiDS Taipei
Behavioral Big Data & Healthcare Research: Talk at WiDS TaipeiGalit Shmueli
 
1 tenea lewissocw 6301methodological approach
1 tenea lewissocw 6301methodological approach1 tenea lewissocw 6301methodological approach
1 tenea lewissocw 6301methodological approachlicservernoida
 
Reuters/Ipsos Core Political Survey: Presidential Approval Tracker (02/12/2020)
Reuters/Ipsos Core Political Survey: Presidential Approval Tracker  (02/12/2020)Reuters/Ipsos Core Political Survey: Presidential Approval Tracker  (02/12/2020)
Reuters/Ipsos Core Political Survey: Presidential Approval Tracker (02/12/2020)Ipsos Public Affairs
 
Wolfram Data Summit: Data Feast, Privacy Famine: What Is a Healthy Data Diet?
Wolfram Data Summit: Data Feast, Privacy Famine: What Is a Healthy Data Diet?Wolfram Data Summit: Data Feast, Privacy Famine: What Is a Healthy Data Diet?
Wolfram Data Summit: Data Feast, Privacy Famine: What Is a Healthy Data Diet?Jim Adler
 
IE_expressyourself_EssayH
IE_expressyourself_EssayHIE_expressyourself_EssayH
IE_expressyourself_EssayHjk6653284
 
Data collection for cultural project
Data collection for cultural projectData collection for cultural project
Data collection for cultural projectDanilo Supino
 
Carpe Datum! Who knows who you are?
Carpe Datum! Who knows who you are?Carpe Datum! Who knows who you are?
Carpe Datum! Who knows who you are?Kuliza Technologies
 
1. Data Science overview - part1.pptx
1. Data Science overview - part1.pptx1. Data Science overview - part1.pptx
1. Data Science overview - part1.pptxRahulTr22
 
Reuters/Ipsos Core Political Survey: Presidential Approval Tracker (03/04/2020)
Reuters/Ipsos Core Political Survey: Presidential Approval Tracker (03/04/2020)Reuters/Ipsos Core Political Survey: Presidential Approval Tracker (03/04/2020)
Reuters/Ipsos Core Political Survey: Presidential Approval Tracker (03/04/2020)Ipsos Public Affairs
 
AI Challenges for Non-Profits, Small Business and Government
AI Challenges for Non-Profits, Small Business and GovernmentAI Challenges for Non-Profits, Small Business and Government
AI Challenges for Non-Profits, Small Business and GovernmentMichael Bryan
 
Data Con LA 2019 - Applied Privacy Engineering Study on SEER database by Ken ...
Data Con LA 2019 - Applied Privacy Engineering Study on SEER database by Ken ...Data Con LA 2019 - Applied Privacy Engineering Study on SEER database by Ken ...
Data Con LA 2019 - Applied Privacy Engineering Study on SEER database by Ken ...Data Con LA
 

Ähnlich wie Preserving Privacy and Utility in Text Data Analysis (20)

DataTags: Sharing Privacy Sensitive Data by Latanya Sweeney
DataTags: Sharing Privacy Sensitive Data by Latanya SweeneyDataTags: Sharing Privacy Sensitive Data by Latanya Sweeney
DataTags: Sharing Privacy Sensitive Data by Latanya Sweeney
 
Using Apache Spark and Differential Privacy for Protecting the Privacy of the...
Using Apache Spark and Differential Privacy for Protecting the Privacy of the...Using Apache Spark and Differential Privacy for Protecting the Privacy of the...
Using Apache Spark and Differential Privacy for Protecting the Privacy of the...
 
Explainability for NLP
Explainability for NLPExplainability for NLP
Explainability for NLP
 
Strata Conference NY: The Accidental Chief Privacy Officer
Strata Conference NY: The Accidental Chief Privacy OfficerStrata Conference NY: The Accidental Chief Privacy Officer
Strata Conference NY: The Accidental Chief Privacy Officer
 
Data Mining Challenges
Data Mining ChallengesData Mining Challenges
Data Mining Challenges
 
2020 Data Breach Investigations Report (DBIR)
2020 Data Breach Investigations Report (DBIR)2020 Data Breach Investigations Report (DBIR)
2020 Data Breach Investigations Report (DBIR)
 
Behavioral Big Data & Healthcare Research: Talk at WiDS Taipei
Behavioral Big Data & Healthcare Research: Talk at WiDS TaipeiBehavioral Big Data & Healthcare Research: Talk at WiDS Taipei
Behavioral Big Data & Healthcare Research: Talk at WiDS Taipei
 
1 tenea lewissocw 6301methodological approach
1 tenea lewissocw 6301methodological approach1 tenea lewissocw 6301methodological approach
1 tenea lewissocw 6301methodological approach
 
Reuters/Ipsos Core Political Survey: Presidential Approval Tracker (02/12/2020)
Reuters/Ipsos Core Political Survey: Presidential Approval Tracker  (02/12/2020)Reuters/Ipsos Core Political Survey: Presidential Approval Tracker  (02/12/2020)
Reuters/Ipsos Core Political Survey: Presidential Approval Tracker (02/12/2020)
 
Data Coordinator Guidebook
Data Coordinator GuidebookData Coordinator Guidebook
Data Coordinator Guidebook
 
Wolfram Data Summit: Data Feast, Privacy Famine: What Is a Healthy Data Diet?
Wolfram Data Summit: Data Feast, Privacy Famine: What Is a Healthy Data Diet?Wolfram Data Summit: Data Feast, Privacy Famine: What Is a Healthy Data Diet?
Wolfram Data Summit: Data Feast, Privacy Famine: What Is a Healthy Data Diet?
 
IE_expressyourself_EssayH
IE_expressyourself_EssayHIE_expressyourself_EssayH
IE_expressyourself_EssayH
 
Data collection for cultural project
Data collection for cultural projectData collection for cultural project
Data collection for cultural project
 
Carpe Datum! Who knows who you are?
Carpe Datum! Who knows who you are?Carpe Datum! Who knows who you are?
Carpe Datum! Who knows who you are?
 
Umhoefer: Data-driven enterprise - handout
Umhoefer: Data-driven enterprise - handoutUmhoefer: Data-driven enterprise - handout
Umhoefer: Data-driven enterprise - handout
 
SOC2002 Lecture 6
SOC2002 Lecture 6SOC2002 Lecture 6
SOC2002 Lecture 6
 
1. Data Science overview - part1.pptx
1. Data Science overview - part1.pptx1. Data Science overview - part1.pptx
1. Data Science overview - part1.pptx
 
Reuters/Ipsos Core Political Survey: Presidential Approval Tracker (03/04/2020)
Reuters/Ipsos Core Political Survey: Presidential Approval Tracker (03/04/2020)Reuters/Ipsos Core Political Survey: Presidential Approval Tracker (03/04/2020)
Reuters/Ipsos Core Political Survey: Presidential Approval Tracker (03/04/2020)
 
AI Challenges for Non-Profits, Small Business and Government
AI Challenges for Non-Profits, Small Business and GovernmentAI Challenges for Non-Profits, Small Business and Government
AI Challenges for Non-Profits, Small Business and Government
 
Data Con LA 2019 - Applied Privacy Engineering Study on SEER database by Ken ...
Data Con LA 2019 - Applied Privacy Engineering Study on SEER database by Ken ...Data Con LA 2019 - Applied Privacy Engineering Study on SEER database by Ken ...
Data Con LA 2019 - Applied Privacy Engineering Study on SEER database by Ken ...
 

Kürzlich hochgeladen

Servosystem Theory / Cybernetic Theory by Petrovic
Servosystem Theory / Cybernetic Theory by PetrovicServosystem Theory / Cybernetic Theory by Petrovic
Servosystem Theory / Cybernetic Theory by PetrovicAditi Jain
 
ECG Graph Monitoring with AD8232 ECG Sensor & Arduino.pptx
ECG Graph Monitoring with AD8232 ECG Sensor & Arduino.pptxECG Graph Monitoring with AD8232 ECG Sensor & Arduino.pptx
ECG Graph Monitoring with AD8232 ECG Sensor & Arduino.pptxmaryFF1
 
User Guide: Orion™ Weather Station (Columbia Weather Systems)
User Guide: Orion™ Weather Station (Columbia Weather Systems)User Guide: Orion™ Weather Station (Columbia Weather Systems)
User Guide: Orion™ Weather Station (Columbia Weather Systems)Columbia Weather Systems
 
Pests of jatropha_Bionomics_identification_Dr.UPR.pdf
Pests of jatropha_Bionomics_identification_Dr.UPR.pdfPests of jatropha_Bionomics_identification_Dr.UPR.pdf
Pests of jatropha_Bionomics_identification_Dr.UPR.pdfPirithiRaju
 
Microteaching on terms used in filtration .Pharmaceutical Engineering
Microteaching on terms used in filtration .Pharmaceutical EngineeringMicroteaching on terms used in filtration .Pharmaceutical Engineering
Microteaching on terms used in filtration .Pharmaceutical EngineeringPrajakta Shinde
 
(9818099198) Call Girls In Noida Sector 14 (NOIDA ESCORTS)
(9818099198) Call Girls In Noida Sector 14 (NOIDA ESCORTS)(9818099198) Call Girls In Noida Sector 14 (NOIDA ESCORTS)
(9818099198) Call Girls In Noida Sector 14 (NOIDA ESCORTS)riyaescorts54
 
Topic 9- General Principles of International Law.pptx
Topic 9- General Principles of International Law.pptxTopic 9- General Principles of International Law.pptx
Topic 9- General Principles of International Law.pptxJorenAcuavera1
 
User Guide: Capricorn FLX™ Weather Station
User Guide: Capricorn FLX™ Weather StationUser Guide: Capricorn FLX™ Weather Station
User Guide: Capricorn FLX™ Weather StationColumbia Weather Systems
 
Observational constraints on mergers creating magnetism in massive stars
Observational constraints on mergers creating magnetism in massive starsObservational constraints on mergers creating magnetism in massive stars
Observational constraints on mergers creating magnetism in massive starsSérgio Sacani
 
Microphone- characteristics,carbon microphone, dynamic microphone.pptx
Microphone- characteristics,carbon microphone, dynamic microphone.pptxMicrophone- characteristics,carbon microphone, dynamic microphone.pptx
Microphone- characteristics,carbon microphone, dynamic microphone.pptxpriyankatabhane
 
Thermodynamics ,types of system,formulae ,gibbs free energy .pptx
Thermodynamics ,types of system,formulae ,gibbs free energy .pptxThermodynamics ,types of system,formulae ,gibbs free energy .pptx
Thermodynamics ,types of system,formulae ,gibbs free energy .pptxuniversity
 
Radiation physics in Dental Radiology...
Radiation physics in Dental Radiology...Radiation physics in Dental Radiology...
Radiation physics in Dental Radiology...navyadasi1992
 
PROJECTILE MOTION-Horizontal and Vertical
PROJECTILE MOTION-Horizontal and VerticalPROJECTILE MOTION-Horizontal and Vertical
PROJECTILE MOTION-Horizontal and VerticalMAESTRELLAMesa2
 
Vision and reflection on Mining Software Repositories research in 2024
Vision and reflection on Mining Software Repositories research in 2024Vision and reflection on Mining Software Repositories research in 2024
Vision and reflection on Mining Software Repositories research in 2024AyushiRastogi48
 
BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.
BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.
BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.PraveenaKalaiselvan1
 
Pests of soyabean_Binomics_IdentificationDr.UPR.pdf
Pests of soyabean_Binomics_IdentificationDr.UPR.pdfPests of soyabean_Binomics_IdentificationDr.UPR.pdf
Pests of soyabean_Binomics_IdentificationDr.UPR.pdfPirithiRaju
 
THE ROLE OF PHARMACOGNOSY IN TRADITIONAL AND MODERN SYSTEM OF MEDICINE.pptx
THE ROLE OF PHARMACOGNOSY IN TRADITIONAL AND MODERN SYSTEM OF MEDICINE.pptxTHE ROLE OF PHARMACOGNOSY IN TRADITIONAL AND MODERN SYSTEM OF MEDICINE.pptx
THE ROLE OF PHARMACOGNOSY IN TRADITIONAL AND MODERN SYSTEM OF MEDICINE.pptxNandakishor Bhaurao Deshmukh
 
Pests of Blackgram, greengram, cowpea_Dr.UPR.pdf
Pests of Blackgram, greengram, cowpea_Dr.UPR.pdfPests of Blackgram, greengram, cowpea_Dr.UPR.pdf
Pests of Blackgram, greengram, cowpea_Dr.UPR.pdfPirithiRaju
 

Kürzlich hochgeladen (20)

Servosystem Theory / Cybernetic Theory by Petrovic
Servosystem Theory / Cybernetic Theory by PetrovicServosystem Theory / Cybernetic Theory by Petrovic
Servosystem Theory / Cybernetic Theory by Petrovic
 
ECG Graph Monitoring with AD8232 ECG Sensor & Arduino.pptx
ECG Graph Monitoring with AD8232 ECG Sensor & Arduino.pptxECG Graph Monitoring with AD8232 ECG Sensor & Arduino.pptx
ECG Graph Monitoring with AD8232 ECG Sensor & Arduino.pptx
 
User Guide: Orion™ Weather Station (Columbia Weather Systems)
User Guide: Orion™ Weather Station (Columbia Weather Systems)User Guide: Orion™ Weather Station (Columbia Weather Systems)
User Guide: Orion™ Weather Station (Columbia Weather Systems)
 
Pests of jatropha_Bionomics_identification_Dr.UPR.pdf
Pests of jatropha_Bionomics_identification_Dr.UPR.pdfPests of jatropha_Bionomics_identification_Dr.UPR.pdf
Pests of jatropha_Bionomics_identification_Dr.UPR.pdf
 
Microteaching on terms used in filtration .Pharmaceutical Engineering
Microteaching on terms used in filtration .Pharmaceutical EngineeringMicroteaching on terms used in filtration .Pharmaceutical Engineering
Microteaching on terms used in filtration .Pharmaceutical Engineering
 
(9818099198) Call Girls In Noida Sector 14 (NOIDA ESCORTS)
(9818099198) Call Girls In Noida Sector 14 (NOIDA ESCORTS)(9818099198) Call Girls In Noida Sector 14 (NOIDA ESCORTS)
(9818099198) Call Girls In Noida Sector 14 (NOIDA ESCORTS)
 
Topic 9- General Principles of International Law.pptx
Topic 9- General Principles of International Law.pptxTopic 9- General Principles of International Law.pptx
Topic 9- General Principles of International Law.pptx
 
User Guide: Capricorn FLX™ Weather Station
User Guide: Capricorn FLX™ Weather StationUser Guide: Capricorn FLX™ Weather Station
User Guide: Capricorn FLX™ Weather Station
 
Observational constraints on mergers creating magnetism in massive stars
Observational constraints on mergers creating magnetism in massive starsObservational constraints on mergers creating magnetism in massive stars
Observational constraints on mergers creating magnetism in massive stars
 
Microphone- characteristics,carbon microphone, dynamic microphone.pptx
Microphone- characteristics,carbon microphone, dynamic microphone.pptxMicrophone- characteristics,carbon microphone, dynamic microphone.pptx
Microphone- characteristics,carbon microphone, dynamic microphone.pptx
 
Volatile Oils Pharmacognosy And Phytochemistry -I
Volatile Oils Pharmacognosy And Phytochemistry -IVolatile Oils Pharmacognosy And Phytochemistry -I
Volatile Oils Pharmacognosy And Phytochemistry -I
 
Thermodynamics ,types of system,formulae ,gibbs free energy .pptx
Thermodynamics ,types of system,formulae ,gibbs free energy .pptxThermodynamics ,types of system,formulae ,gibbs free energy .pptx
Thermodynamics ,types of system,formulae ,gibbs free energy .pptx
 
Radiation physics in Dental Radiology...
Radiation physics in Dental Radiology...Radiation physics in Dental Radiology...
Radiation physics in Dental Radiology...
 
PROJECTILE MOTION-Horizontal and Vertical
PROJECTILE MOTION-Horizontal and VerticalPROJECTILE MOTION-Horizontal and Vertical
PROJECTILE MOTION-Horizontal and Vertical
 
Vision and reflection on Mining Software Repositories research in 2024
Vision and reflection on Mining Software Repositories research in 2024Vision and reflection on Mining Software Repositories research in 2024
Vision and reflection on Mining Software Repositories research in 2024
 
BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.
BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.
BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.
 
Let’s Say Someone Did Drop the Bomb. Then What?
Let’s Say Someone Did Drop the Bomb. Then What?Let’s Say Someone Did Drop the Bomb. Then What?
Let’s Say Someone Did Drop the Bomb. Then What?
 
Pests of soyabean_Binomics_IdentificationDr.UPR.pdf
Pests of soyabean_Binomics_IdentificationDr.UPR.pdfPests of soyabean_Binomics_IdentificationDr.UPR.pdf
Pests of soyabean_Binomics_IdentificationDr.UPR.pdf
 
THE ROLE OF PHARMACOGNOSY IN TRADITIONAL AND MODERN SYSTEM OF MEDICINE.pptx
THE ROLE OF PHARMACOGNOSY IN TRADITIONAL AND MODERN SYSTEM OF MEDICINE.pptxTHE ROLE OF PHARMACOGNOSY IN TRADITIONAL AND MODERN SYSTEM OF MEDICINE.pptx
THE ROLE OF PHARMACOGNOSY IN TRADITIONAL AND MODERN SYSTEM OF MEDICINE.pptx
 
Pests of Blackgram, greengram, cowpea_Dr.UPR.pdf
Pests of Blackgram, greengram, cowpea_Dr.UPR.pdfPests of Blackgram, greengram, cowpea_Dr.UPR.pdf
Pests of Blackgram, greengram, cowpea_Dr.UPR.pdf
 

Preserving Privacy and Utility in Text Data Analysis

  • 1. Preserving Privacy and Utility in Text Data Analysis Tom Diethe, Oluwaseyi Feyisetan, Thomas Drake, Borja Balle {sey,tdiethe,draket}@amazon.com borja.balle@gmail.com PrivateNLP Workshop, WSDM February 7 2020
  • 2. Outline 1 Alexa AI 2 Algorithmic Privacy 3 Privacy for Text 4 Differential Privacy in Euclidean Spaces 5 Differential Privacy in Hyperbolic Spaces 6 Optimizing the Privacy Utility Trade-off 7 Summary Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 1 / 41
  • 3. Outline 1 Alexa AI 2 Algorithmic Privacy 3 Privacy for Text 4 Differential Privacy in Euclidean Spaces 5 Differential Privacy in Hyperbolic Spaces 6 Optimizing the Privacy Utility Trade-off 7 Summary Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 2 / 41
  • 4. Alexa AI What is Alexa? A cloud-based voice service that can help you with tasks, entertainment, general information, shopping, and more The more you talk to Alexa, the more Alexa adapts to your speech patterns, vocabulary, and personal preferences Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 3 / 41
  • 5. Alexa AI What is Alexa? A cloud-based voice service that can help you with tasks, entertainment, general information, shopping, and more The more you talk to Alexa, the more Alexa adapts to your speech patterns, vocabulary, and personal preferences How do we ... create robust and efficient AI systems? maintain the privacy of customer data? Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 3 / 41
  • 6. Failure Modes Unintentional failures: ML system produces a formally correct but completely unsafe outcome Outliers/anomalies Dataset shift Limited memory Intentional failures: failure is caused by an active adversary attempting to subvert the system to attain her goals, such as to: misclassify the result infer private training data steal the underlying algorithm Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 4 / 41
  • 7. Outline 1 Alexa AI 2 Algorithmic Privacy 3 Privacy for Text 4 Differential Privacy in Euclidean Spaces 5 Differential Privacy in Hyperbolic Spaces 6 Optimizing the Privacy Utility Trade-off 7 Summary Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 5 / 41
  • 8. A first attempt: Can’t I just anonymize my data? k-anonymity: information for each person cannot be distinguished from at least k − 1 individuals whose information also appear in the release Suppose a company is audited for salary discrimination The auditor can see salaries by gender, age and nationality for each department and office If the auditor has a friend, an ex, a date, working for the company she will learn the salary of that person Reducing data granularity reduces the risk, but also reduces accuracy (fidelity in this case) Office Dept. Salary D.O.B. Nationality Gender London IT £##### May 1985 Portuguese Female Still presents risk of re-identification!. If there are 10 females born between 80-85 in the whole of UK’s IT department, 9 of them could conspire to learn the salary of the 10th one Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 6 / 41
  • 9. A first attempt: Can’t I just anonymize my data? k-anonymity: information for each person cannot be distinguished from at least k − 1 individuals whose information also appear in the release Suppose a company is audited for salary discrimination The auditor can see salaries by gender, age and nationality for each department and office If the auditor has a friend, an ex, a date, working for the company she will learn the salary of that person Reducing data granularity reduces the risk, but also reduces accuracy (fidelity in this case) Office Dept. Salary D.O.B. Nationality Gender London IT £##### May 1985 Portuguese Female Still presents risk of re-identification!. If there are 10 females born between 80-85 in the whole of UK’s IT department, 9 of them could conspire to learn the salary of the 10th one Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 6 / 41
  • 10. A first attempt: Can’t I just anonymize my data? k-anonymity: information for each person cannot be distinguished from at least k − 1 individuals whose information also appear in the release Suppose a company is audited for salary discrimination The auditor can see salaries by gender, age and nationality for each department and office If the auditor has a friend, an ex, a date, working for the company she will learn the salary of that person Reducing data granularity reduces the risk, but also reduces accuracy (fidelity in this case) Office Dept. Salary D.O.B. Nationality Gender UK IT £##### 1980-1985 - Female Still presents risk of re-identification!. If there are 10 females born between 80-85 in the whole of UK’s IT department, 9 of them could conspire to learn the salary of the 10th one Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 6 / 41
  • 11. Anonymized Data Isn’t Example 1: Mid 1990’s: Massachusetts “Group Insurance Commission” released “anonymized” data on state employees that showed every hospital visit Goal was to help researchers. Removed all obvious identifiers such as name, address, and social security number MIT PhD student Latanya Sweeney decided to attempt to reverse the anonymization, requested a copy of the data Reidentification William Weld, then Governor of Massachusetts, assured the public that GIC had protected patient privacy by deleting identifiers. Sweeney started hunting for the Governor’s hospital records in the GIC data. She knew that Governor Weld resided in Cambridge, Massachusetts, population 54,000 and 7 ZIP codes. For $20, she purchased the complete voter rolls from the city of Cambridge, containing the name, address, ZIP code, birth date, and gender of every voter. Crossing this with the GIC records, Sweeney found Governor Weld with ease: Only 6 people shared his birth date, only 3 of them men, and of them, only he lived in his ZIP code. Sweeney sent the Governor’s health records (including diagnoses and prescriptions) to his office. Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 7 / 41
  • 12. Anonymized Data Isn’t Example 1: Mid 1990’s: Massachusetts “Group Insurance Commission” released “anonymized” data on state employees that showed every hospital visit Goal was to help researchers. Removed all obvious identifiers such as name, address, and social security number MIT PhD student Latanya Sweeney decided to attempt to reverse the anonymization, requested a copy of the data Reidentification William Weld, then Governor of Massachusetts, assured the public that GIC had protected patient privacy by deleting identifiers. Sweeney started hunting for the Governor’s hospital records in the GIC data. She knew that Governor Weld resided in Cambridge, Massachusetts, population 54,000 and 7 ZIP codes. For $20, she purchased the complete voter rolls from the city of Cambridge, containing the name, address, ZIP code, birth date, and gender of every voter. Crossing this with the GIC records, Sweeney found Governor Weld with ease: Only 6 people shared his birth date, only 3 of them men, and of them, only he lived in his ZIP code. Sweeney sent the Governor’s health records (including diagnoses and prescriptions) to his office. Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 7 / 41
  • 13. Anonymized Data Isn’t Example 2: In 2006, Netflix released data pertaining to how 500,000 of its users rated movies over a six-year period Netflix “anonymized” the data before releasing it by removing usernames, but assigned unique identification numbers to users in order to allow for continuous tracking of user ratings and trends Reidentification Researchers used this information to uniquely identify individual Netflix users by crossing the data with the public IMDB database. According to the study, if a person has information about when and how a user rated six movies, that person can identify 99% of people in the Netflix database. Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 8 / 41
  • 14. Anonymized Data Isn’t Example 2: In 2006, Netflix released data pertaining to how 500,000 of its users rated movies over a six-year period Netflix “anonymized” the data before releasing it by removing usernames, but assigned unique identification numbers to users in order to allow for continuous tracking of user ratings and trends Reidentification Researchers used this information to uniquely identify individual Netflix users by crossing the data with the public IMDB database. According to the study, if a person has information about when and how a user rated six movies, that person can identify 99% of people in the Netflix database. Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 8 / 41
  • 15. Differential Privacy A randomised mechanism M : X → Y is -differentially private if for all neighbouring inputs x x (i.e. x − x 1 = 1) and for all sets of outputs E ⊆ Y we have P[M(x) ∈ E] ≤ e P M x ∈ E 0 5 10 15 20 25 0.00 0.02 0.04 0.06 0.08 0.10 0.12 0.14 Ratio bounded by e M(D) M(D') Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 9 / 41
  • 16. Differential Privacy A randomised mechanism M : X → Y is -differentially private if for all neighbouring inputs x x (i.e. x − x 1 = 1) and for all sets of outputs E ⊆ Y we have P[M(x) ∈ E] ≤ e P M x ∈ E 0 5 10 15 20 25 0.00 0.02 0.04 0.06 0.08 0.10 0.12 0.14 Ratio bounded by e M(D) M(D') Mechanisms: Randomised response −→ plausible deniability Laplace mechanism: e.g. ˜µ = µ + ξ, ξ ∼ Lap 1 n Output perturbation ... Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 9 / 41
  • 17. Randomized Response [Warner ’65] Say you want to release a bit x ∈ {Yes, No}. Do the following: 1 flip a coin 2 if tails, respond truthfully with x 3 if heads, flip a second coin and respond “Yes” if heads; respond “No” if tails Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 10 / 41
  • 18. Randomized Response [Warner ’65] Say you want to release a bit x ∈ {Yes, No}. Do the following: 1 flip a coin 2 if tails, respond truthfully with x 3 if heads, flip a second coin and respond “Yes” if heads; respond “No” if tails Claim: Above algorithm satisfies (log 3)-differential privacy Pr[Response = Yes|x = Yes] Pr[Response = Yes|x = No] = 1/2 × 1 + 1/2 × 1/2 1/2 × 0 + 1/2 × 1/2 = 3/4 1/4 = 3 =⇒ e = 3 Same for Pr[Response=No|x=Yes] Pr[Response=No|x=No] . Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 10 / 41
  • 19. Important Properties Robustness to post-processing: M is ( , δ)-DP, then f (M) is ( , δ)-DP Composition: if M1, . . . , Mn are ( , δ)-DP, then g (M1, . . . , Mn) is ( n i=1 i , n i=1 δi )-DP Protects against arbitrary side knowledge Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 11 / 41
  • 20. Outline 1 Alexa AI 2 Algorithmic Privacy 3 Privacy for Text 4 Differential Privacy in Euclidean Spaces 5 Differential Privacy in Hyperbolic Spaces 6 Optimizing the Privacy Utility Trade-off 7 Summary Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 12 / 41
  • 21. User-AI system interaction via natural language User’s goal: meet some specific need with respect to an issued query x Agent’s goal: satisfy the user’s request Privacy violation: occurs when x is used to make personal inference. e.g. unrestricted PII present Mechanism: Modify the query to protect privacy whilst preserving semantics Our approach: Metric Differential Privacy Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 13 / 41
  • 22. User-AI system interaction via natural language User’s goal: meet some specific need with respect to an issued query x Agent’s goal: satisfy the user’s request Privacy violation: occurs when x is used to make personal inference. e.g. unrestricted PII present Mechanism: Modify the query to protect privacy whilst preserving semantics Our approach: Metric Differential Privacy Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 13 / 41
  • 23. User-AI system interaction via natural language User’s goal: meet some specific need with respect to an issued query x Agent’s goal: satisfy the user’s request Privacy violation: occurs when x is used to make personal inference. e.g. unrestricted PII present Mechanism: Modify the query to protect privacy whilst preserving semantics Our approach: Metric Differential Privacy Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 13 / 41
  • 24. User-AI system interaction via natural language User’s goal: meet some specific need with respect to an issued query x Agent’s goal: satisfy the user’s request Privacy violation: occurs when x is used to make personal inference. e.g. unrestricted PII present Mechanism: Modify the query to protect privacy whilst preserving semantics Our approach: Metric Differential Privacy Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 13 / 41
  • 25. User-AI system interaction via natural language User’s goal: meet some specific need with respect to an issued query x Agent’s goal: satisfy the user’s request Privacy violation: occurs when x is used to make personal inference. e.g. unrestricted PII present Mechanism: Modify the query to protect privacy whilst preserving semantics Our approach: Metric Differential Privacy Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 13 / 41
  • 26. User-AI system interaction via natural language User’s goal: meet some specific need with respect to an issued query x Agent’s goal: satisfy the user’s request Privacy violation: occurs when x is used to make personal inference. e.g. unrestricted PII present Mechanism: Modify the query to protect privacy whilst preserving semantics Our approach: Metric Differential Privacy Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 13 / 41
  • 27. Desired Functionality Intent Query x Modified Query x GetWeather Will it be colder in Cleveland Will it be colder in Ohio PlayMusic Play Cantopop on lastfm Play C-pop on lastfm BookRestaurant Book a restaurant in Milladore Book a restaurant in Wood County SearchCreativeWork I want to watch Manthan film I want to watch Hindi film Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 14 / 41
  • 28. Word Embeddings Mapping from words into vectors of real numbers (many ways to do this!) e.g. Neural network based models (e.g. Word2Vec, GloVe, fastText) Defines a mapping φ : W → Rn Nearest neigbours are often synonyms Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 15 / 41
  • 29. Metric Differential Privacy Recall the definition of DP ... P[M(x) ∈ E] ≤ e P M x ∈ E for x, x ∈ X s.t. x − x 1 = 1 This can be rewritten into a single equation as: P[M(x) ∈ E] P[M(x ) ∈ E] ≤ e x−x 1 Metric differential privacy generalises this to use any valid metric d(x, x ): P[M(x) ∈ E] P[M(x ) ∈ E] ≤ e d(x,x ) (easy to see that standard DP is metric DP with d(x, x ) = x − x 1) Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 16 / 41
  • 30. Metric Differential Privacy Recall the definition of DP ... P[M(x) ∈ E] ≤ e P M x ∈ E for x, x ∈ X s.t. x − x 1 = 1 This can be rewritten into a single equation as: P[M(x) ∈ E] P[M(x ) ∈ E] ≤ e x−x 1 Metric differential privacy generalises this to use any valid metric d(x, x ): P[M(x) ∈ E] P[M(x ) ∈ E] ≤ e d(x,x ) (easy to see that standard DP is metric DP with d(x, x ) = x − x 1) Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 16 / 41
  • 31. Metric Differential Privacy Recall the definition of DP ... P[M(x) ∈ E] ≤ e P M x ∈ E for x, x ∈ X s.t. x − x 1 = 1 This can be rewritten into a single equation as: P[M(x) ∈ E] P[M(x ) ∈ E] ≤ e x−x 1 Metric differential privacy generalises this to use any valid metric d(x, x ): P[M(x) ∈ E] P[M(x ) ∈ E] ≤ e d(x,x ) (easy to see that standard DP is metric DP with d(x, x ) = x − x 1) Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 16 / 41
  • 32. Privacy in the Space of Word Embeddings [Feyisetan 2019, Feyisetan 2020] Given: w ∈ W: word to be “privatised” from word space W (dictionary) φ : W → Z: embedding function from word space to embedding space Z (e.g. Rn ) v = φ(w): corresponding word vector d : Z × Z → R: distance function in embedding space Ω( ): the D.P. noise sampling distribution (e.g. Ωi ( ) = Lap 1 n , i = 1, ..., n for Rn ) Metric DP Mechanism for word embeddings 1 Perturb the word vector: v = v + ξ where ξ ∼ Ω( ) 2 The new vector v will not be a word (a.s.) 3 Project back to W: w = arg minw∈W d(v , φ(w)), return w What do we need? d satisfies the axioms of a metric (nonnegative, indiscernibles, symmetry, triangle) A way to sample using Ω in the metric space that respects d and gives us -metric DP Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 17 / 41
  • 33. Privacy in the Space of Word Embeddings [Feyisetan 2019, Feyisetan 2020] Given: w ∈ W: word to be “privatised” from word space W (dictionary) φ : W → Z: embedding function from word space to embedding space Z (e.g. Rn ) v = φ(w): corresponding word vector d : Z × Z → R: distance function in embedding space Ω( ): the D.P. noise sampling distribution (e.g. Ωi ( ) = Lap 1 n , i = 1, ..., n for Rn ) Metric DP Mechanism for word embeddings 1 Perturb the word vector: v = v + ξ where ξ ∼ Ω( ) 2 The new vector v will not be a word (a.s.) 3 Project back to W: w = arg minw∈W d(v , φ(w)), return w What do we need? d satisfies the axioms of a metric (nonnegative, indiscernibles, symmetry, triangle) A way to sample using Ω in the metric space that respects d and gives us -metric DP Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 17 / 41
  • 34. Privacy in the Space of Word Embeddings [Feyisetan 2019, Feyisetan 2020] Given: w ∈ W: word to be “privatised” from word space W (dictionary) φ : W → Z: embedding function from word space to embedding space Z (e.g. Rn ) v = φ(w): corresponding word vector d : Z × Z → R: distance function in embedding space Ω( ): the D.P. noise sampling distribution (e.g. Ωi ( ) = Lap 1 n , i = 1, ..., n for Rn ) Metric DP Mechanism for word embeddings 1 Perturb the word vector: v = v + ξ where ξ ∼ Ω( ) 2 The new vector v will not be a word (a.s.) 3 Project back to W: w = arg minw∈W d(v , φ(w)), return w What do we need? d satisfies the axioms of a metric (nonnegative, indiscernibles, symmetry, triangle) A way to sample using Ω in the metric space that respects d and gives us -metric DP Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 17 / 41
  • 35. Outline 1 Alexa AI 2 Algorithmic Privacy 3 Privacy for Text 4 Differential Privacy in Euclidean Spaces 5 Differential Privacy in Hyperbolic Spaces 6 Optimizing the Privacy Utility Trade-off 7 Summary Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 18 / 41
  • 36. Differential Privacy in the Space of Euclidean Word Embedding Adding noise to a location always produces a valid location — a point somewhere on the earth’s surface Adding noise to a word embedding produces a new point in the embedding space, but it’s A.S. not the location of a valid word embedding We perform approximate nearest neighbors find the nearest valid embedding Nearest valid embedding could be the original word itself: in that case, the original word is returned Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 19 / 41
  • 37. Practical Considerations To help choose , we define: Uncertainty statistics for the adversary over the outputs Indistinguishability statistics: plausible deniability Find a radius of high protection: guarantee on the likelihood of changing any word in the embedding vocabulary Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 20 / 41
  • 38. Euclidean Experiments: Setup Dataset IMDb Enron InsuranceQA Task type Sentiment analysis Author identification Question answering Evaluation Metric accuracy accuracy MAP, MRR Training set size 25, 000 8, 517 12, 887 Test set size 25, 000 850 1, 800 Total word count 5, 958, 157 307, 639 92, 095 Vocabulary size 79, 428 15, 570 2, 745 Sentence length µ = 42.27 σ = 34.38 µ = 30.68 σ = 31.54 µ = 7.15 σ = 2.06 Scenario 1: Train time protection little access to public data (10%), but abundant access to private training data (90%); model training is done on the combined dataset (i.e. public subset + perturbed private subset) Scenario 2: Test time protection models trained on complete training set; evaluation on privatized version of the test sets We used 300-D GloVe word embeddings with biLSTM models Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 21 / 41
  • 39. Results IMDb reviews – Accuracy vs baseline for different values of ε 200 400 600 800 1000 epsilon 0.0 0.2 0.4 0.6 0.8 1.0 accuracy Accuracy (at training time) Accuracy Baseline 200 400 600 800 1000 epsilon 0.0 0.2 0.4 0.6 0.8 1.0 accuracy Accuracy (at test time) Accuracy Baseline Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 22 / 41
  • 40. Results Enron emails – Accuracy vs baseline for different values of ε 200 400 600 800 1000 epsilon 0.0 0.2 0.4 0.6 0.8 1.0 accuracy Accuracy (at training time) Accuracy Baseline 200 400 600 800 1000 epsilon 0.0 0.2 0.4 0.6 0.8 1.0 accuracy Accuracy (at test time) Accuracy Baseline Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 22 / 41
  • 41. Results InsuranceQA – MAP/MRR scores for different values of ε on the dev set 200 400 600 800 1000 epsilon 0.0 0.2 0.4 0.6 0.8 1.0 Scores for dev at training time MAP on dev MRR on dev MAP baseline MRR baseline 200 400 600 800 1000 epsilon 0.0 0.2 0.4 0.6 0.8 1.0 Scores for dev at test time MAP on dev MRR on dev MAP baseline MRR baseline Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 22 / 41
  • 42. Privacy Evaluation In the previous experiments, we didn’t explicitly evaluate privacy Problem: is an arbitrary number that is hard to interpret This is especially true in metric DP, since is on a different scale As we have seen, there are empirical ways to calibrate according to statistics of the word embeddings But how do we convince stakeholders that the privacy guarantees are holding, and there are no bugs? Solution: machine auditors – machine learning algorithms designed to different types of privacy attacks on the data Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 23 / 41
  • 43. Privacy Evaluation In the previous experiments, we didn’t explicitly evaluate privacy Problem: is an arbitrary number that is hard to interpret This is especially true in metric DP, since is on a different scale As we have seen, there are empirical ways to calibrate according to statistics of the word embeddings But how do we convince stakeholders that the privacy guarantees are holding, and there are no bugs? Solution: machine auditors – machine learning algorithms designed to different types of privacy attacks on the data Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 23 / 41
  • 44. Privacy Evaluation In the previous experiments, we didn’t explicitly evaluate privacy Problem: is an arbitrary number that is hard to interpret This is especially true in metric DP, since is on a different scale As we have seen, there are empirical ways to calibrate according to statistics of the word embeddings But how do we convince stakeholders that the privacy guarantees are holding, and there are no bugs? Solution: machine auditors – machine learning algorithms designed to different types of privacy attacks on the data Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 23 / 41
  • 45. Privacy Evaluation In the previous experiments, we didn’t explicitly evaluate privacy Problem: is an arbitrary number that is hard to interpret This is especially true in metric DP, since is on a different scale As we have seen, there are empirical ways to calibrate according to statistics of the word embeddings But how do we convince stakeholders that the privacy guarantees are holding, and there are no bugs? Solution: machine auditors – machine learning algorithms designed to different types of privacy attacks on the data Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 23 / 41
  • 46. Privacy Evaluation In the previous experiments, we didn’t explicitly evaluate privacy Problem: is an arbitrary number that is hard to interpret This is especially true in metric DP, since is on a different scale As we have seen, there are empirical ways to calibrate according to statistics of the word embeddings But how do we convince stakeholders that the privacy guarantees are holding, and there are no bugs? Solution: machine auditors – machine learning algorithms designed to different types of privacy attacks on the data Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 23 / 41
  • 47. Privacy Evaluation In the previous experiments, we didn’t explicitly evaluate privacy Problem: is an arbitrary number that is hard to interpret This is especially true in metric DP, since is on a different scale As we have seen, there are empirical ways to calibrate according to statistics of the word embeddings But how do we convince stakeholders that the privacy guarantees are holding, and there are no bugs? Solution: machine auditors – machine learning algorithms designed to different types of privacy attacks on the data Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 23 / 41
  • 48. Machine Auditors Probabilistic record linkage auditing attack Objective: link a user in a public dataset, to a user in a (leaked) private dataset. Attack simulation: simulate public and “leaked” datasets by randomly splitting an initial dataset. The attack takes advantage of rare words and queries issued by users. A vector of word counts can be extracted from user queries and used to perform the linkage. Assumptions: attacker is able to narrow the attack set (using side knowledge) Evaluation: how many accurate links can the attacker reconstruct? Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 24 / 41
  • 49. Machine Auditors Membership auditing attack [Shokri et al ’17, Song & Shmatikov ’18] Objective: identify whether an individual’s data (queries) were used in the training set of an ML model. Attack simulation: train ML model on queries from m users. Train “shadow” models using data from a different set of n users. The attack model is a classifier built using the output of the shadow models Assumptions: attacker is able to narrow the attack set (using side knowledge) Evaluation: can the attacker correctly detect m users inside and outside the model’s dataset Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 24 / 41
  • 50. Outline 1 Alexa AI 2 Algorithmic Privacy 3 Privacy for Text 4 Differential Privacy in Euclidean Spaces 5 Differential Privacy in Hyperbolic Spaces 6 Optimizing the Privacy Utility Trade-off 7 Summary Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 25 / 41
  • 51. Hyperbolic Spaces (a) (b) (a) Projection of a point in the Lorentz model Hn to the Poincaré model (b) WebIsADb is-a relationships in GloVe vocabulary on B2 Poincaré disk Continuous analog of a tree structure Natural language captures hypernomy and hyponomy −→ embeddings require fewer dimensions Use models of Hyperbolic space - projections into Euclidean space Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 26 / 41
  • 52. Hyperbolic Differential Privacy Distances in n−dimensional Poincaré ball are given by: dBn (u, v) = arcosh 1 + 2 u − v 2 (1 − u 2 )(1 − v 2 ) Claim: dBn (u, v) is a valid metric. Proof (via Lorentzian model) in the paper Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 27 / 41
  • 53. Hyperbolic Noise Recall for Euclidean metric DP, we use Laplacian noise to achieve −mDP, i.e: ξ ∼ Lap 1 n We derive the Hyperbolic Laplace distribution: p(x|µ = 0, ε) = 1 + ε 2 2F1(1, ε, 2 + ε, −1) − 2 x − 1 − 1 −ε where 2F1(a, b; c, z) is the hypergeometric function For sampling, we developed a Lorentzian Metropolis Hastings sampler (see paper) −0.4 −0.2 0.0 0.2 0.4 −0.4 −0.2 0.0 0.2 0.4 Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 28 / 41
  • 54. Hyperbolic Noise Recall for Euclidean metric DP, we use Laplacian noise to achieve −mDP, i.e: ξ ∼ Lap 1 n We derive the Hyperbolic Laplace distribution: p(x|µ = 0, ε) = 1 + ε 2 2F1(1, ε, 2 + ε, −1) − 2 x − 1 − 1 −ε where 2F1(a, b; c, z) is the hypergeometric function For sampling, we developed a Lorentzian Metropolis Hastings sampler (see paper) −0.4 −0.2 0.0 0.2 0.4 −0.4 −0.2 0.0 0.2 0.4 Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 28 / 41
  • 55. Hyperbolic Noise Recall for Euclidean metric DP, we use Laplacian noise to achieve −mDP, i.e: ξ ∼ Lap 1 n We derive the Hyperbolic Laplace distribution: p(x|µ = 0, ε) = 1 + ε 2 2F1(1, ε, 2 + ε, −1) − 2 x − 1 − 1 −ε where 2F1(a, b; c, z) is the hypergeometric function For sampling, we developed a Lorentzian Metropolis Hastings sampler (see paper) −0.4 −0.2 0.0 0.2 0.4 −0.4 −0.2 0.0 0.2 0.4 Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 28 / 41
  • 56. Hyperbolic Privacy Experiments 1 Task: obfuscation vs. Koppel’s authorship attribution algorithm Datasets: TPAN@Clef tasks, correct author predictions (lower=better) Pan-11 Pan-12 small large set-A set-C set-D set-I 0.5 36 72 4 3 2 5 1 35 73 3 3 2 5 2 40 78 4 3 2 5 8 65 116 4 5 4 5 ∞ 147 259 6 6 6 12 Correct author predictions (lower is better) Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 29 / 41
  • 57. Hyperbolic Privacy Experiments 2 Task: expected privacy vs Euclidean baseline Datasets: 100/200/300d GloVe embeddings expected value Nw ε worst-case Nw hyp-100 euc-100 euc-200 euc-300 0.125 134 1.25 38.54 39.66 39.88 0.5 148 1.62 42.48 43.62 43.44 1 172 2.07 48.80 50.26 53.82 2 297 3.92 92.42 93.75 90.90 8 960 140.67 602.21 613.11 587.68 Privacy comparisons (lower Nw is better) Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 30 / 41
  • 58. Hyperbolic Utility Experiments 5 classification tasks: sentiment x2, product reviews, opinion polarity, question-type 3 natural language tasks: NL inference, paraphrase detection, semantic textual similarity baselines: utility results baselined using SentEval against random replacement hyp-100d original dataset random ε = 0.125 ε = 1 ε = 8 InferSent SkipThought fastText MR 58.19 58.38 63.56 74.52 81.10 79.40 78.20 CR 77.48 83.21∗∗ 83.92∗∗ 85.19∗∗ 86.30 83.1 80.20 MPQA 84.27 88.53∗ 88.62∗ 88.98∗ 90.20 89.30 88.00 SST-5 30.81 41.76 42.40 42.53 46.30 − 45.10 TREC-6 75.20 82.40 82.40 84.20∗ 88.20 88.40 83.40 SICK-E 79.20 81.00∗∗ 82.38∗∗ 82.34∗∗ 86.10 79.5 78.9 MRPC 69.86 74.78∗ 75.07∗ 75.01∗ 76.20 − 74.40 STS14 0.17/0.16 0.44/0.45 0.45/0.46∗ 0.52/0.53∗ 0.68/0.65 0.44/0.45 0.65/0.63 Accuracy scores on classification tasks. * indicates results better than 1 baseline, ** better than 2 baselines Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 31 / 41
  • 59. Outline 1 Alexa AI 2 Algorithmic Privacy 3 Privacy for Text 4 Differential Privacy in Euclidean Spaces 5 Differential Privacy in Hyperbolic Spaces 6 Optimizing the Privacy Utility Trade-off 7 Summary Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 32 / 41
  • 60. UTILITYPRIVACY Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 33 / 41
  • 61. Example: Differentially Private SGD Algorithm 1: Differentially Private SGD Input: dataset z = (z1, . . . , zn) Hyperparameters: learning rate η, mini-batch size m, number of epochs T, noise variance σ2, clipping norm L Initialize w ← 0 for t ∈ [T] do for k ∈ [n/m] do Sample S ⊂ [n] with |S| = m uniformly at random Let g ← 1 m j∈S clipL( (zj , w)) + 2L m N(0, σ2I) Update w ← w − ηg return w 5+ hyper-parameters affecting both privacy and utility For deep learning applications we only have empirical utility (not analyitic) How do we find the hyperparameters that give us an optimal trade-off? Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 34 / 41
  • 62. The Privacy-Utility Pareto Front Pareto-Optimal Points Hyper-parameter Space Privacy Loss Error Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 35 / 41
  • 63. The Privacy-Utility Pareto Front Pareto-Optimal Points Hyper-parameter Space Privacy Loss Error Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 35 / 41
  • 64. The Privacy-Utility Pareto Front Pareto-Optimal Points Hyper-parameter Space Privacy Loss Error Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 35 / 41
  • 65. The Privacy-Utility Pareto Front Pareto-Optimal Points Hyper-parameter Space Privacy Loss Error Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 35 / 41
  • 66. The Privacy-Utility Pareto Front Pareto-Optimal Points Hyper-parameter Space Privacy Loss Error Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 35 / 41
  • 67. The Privacy-Utility Pareto Front Pareto-Optimal Points Hyper-parameter Space Privacy Loss Error Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 35 / 41
  • 68. The Privacy-Utility Pareto Front Pareto-Optimal Points Hyper-parameter Space Privacy Loss Error Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 35 / 41
  • 69. The Privacy-Utility Pareto Front Pareto-Optimal Points Hyper-parameter Space Privacy Loss Error Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 35 / 41
  • 70. The Privacy-Utility Pareto Front Pareto-Optimal Points Hyper-parameter Space Privacy Loss Error Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 35 / 41
  • 71. Bayesian Optimization Gradient-free optimization for black-box functions Widely used in applications (HPO in ML, scheduling & planning, experimental design ...) In multi-objective problems, BO aims to learn the Pareto front with a minimal number of evaluations. Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 36 / 41
  • 72. Bayesian Optimization Gradient-free optimization for black-box functions Widely used in applications (HPO in ML, scheduling & planning, experimental design ...) In multi-objective problems, BO aims to learn the Pareto front with a minimal number of evaluations. Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 36 / 41
  • 73. Bayesian Optimization Gradient-free optimization for black-box functions Widely used in applications (HPO in ML, scheduling & planning, experimental design ...) In multi-objective problems, BO aims to learn the Pareto front with a minimal number of evaluations. Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 36 / 41
  • 74. Bayesian Optimization Gradient-free optimization for black-box functions Widely used in applications (HPO in ML, scheduling & planning, experimental design ...) In multi-objective problems, BO aims to learn the Pareto front with a minimal number of evaluations. Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 36 / 41
  • 75. Bayesian Optimization Gradient-free optimization for black-box functions Widely used in applications (HPO in ML, scheduling & planning, experimental design ...) In multi-objective problems, BO aims to learn the Pareto front with a minimal number of evaluations. Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 36 / 41
  • 76. Bayesian Optimization Gradient-free optimization for black-box functions Widely used in applications (HPO in ML, scheduling & planning, experimental design ...) In multi-objective problems, BO aims to learn the Pareto front with a minimal number of evaluations. Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 36 / 41
  • 77. Bayesian Optimization Gradient-free optimization for black-box functions Widely used in applications (HPO in ML, scheduling & planning, experimental design ...) In multi-objective problems, BO aims to learn the Pareto front with a minimal number of evaluations. Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 36 / 41
  • 78. Bayesian Optimization Gradient-free optimization for black-box functions Widely used in applications (HPO in ML, scheduling & planning, experimental design ...) In multi-objective problems, BO aims to learn the Pareto front with a minimal number of evaluations. Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 36 / 41
  • 79. Bayesian Optimization Gradient-free optimization for black-box functions Widely used in applications (HPO in ML, scheduling & planning, experimental design ...) In multi-objective problems, BO aims to learn the Pareto front with a minimal number of evaluations. Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 36 / 41
  • 80. Bayesian Optimization Gradient-free optimization for black-box functions Widely used in applications (HPO in ML, scheduling & planning, experimental design ...) In multi-objective problems, BO aims to learn the Pareto front with a minimal number of evaluations. Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 36 / 41
  • 81. DPareto DPareto Repeat: 1 For each objective (privacy, utility): 1 Fit a surrogate model (Gaussian process (GP)) using the available dataset 2 Calculate the predictive distribution using the GP mean and variance functions 2 Use the posterior of the surrogate models to form an acquisition function 3 Collect the next point at the estimated global max. of the acquisition function until budget exhausted Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 37 / 41
  • 82. DPareto vs Random Sampling 28 ) 20 22 24 26 28 Sampled points 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 PFhypervolume Hypervolume Evolution MLP1 (RS) MLP1 (BO) MLP2 (RS) MLP2 (BO) 10−1 100 101 ε 0.0 0.2 0.4 0.6 0.8 1.0 Classificationerror MLP2 Pareto Fronts Initial +256 RS +256 BO 10−1 100 101 ε 0.16 0.18 0.20 0.22 0.24 Classificationerror LogReg+SGD Samples 1500 RS 256 BO Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 38 / 41
  • 83. Outline 1 Alexa AI 2 Algorithmic Privacy 3 Privacy for Text 4 Differential Privacy in Euclidean Spaces 5 Differential Privacy in Hyperbolic Spaces 6 Optimizing the Privacy Utility Trade-off 7 Summary Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 39 / 41
  • 84. Summary: Privacy Enhancing Technologies Privacy Privacy risks can be counter-intuitive and tricky to formalize High-dimensional data and side knowledge make privacy hard Semantic guarantees (eg. DP) behave better than syntactic ones (eg. k-anonymization) Differential privacy is a mature privacy enhancing technology Metric DP provides local plausible deniability, accuracy can be good even in cases with an infinite number of outcomes Empirical privacy-utility trade-off evaluation enables application-specific decisions Bayesian optimization provides computationally efficient method to recover the Pareto front (esp. with large number of hyper-parameters) Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 40 / 41
  • 85. Questions? tdiethe@amazon.com Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 41 / 41