Boost Fertility New Invention Ups Success Rates.pdf
Towards Statistical Queries over Distributed Private User Data
1. Towards Statistical Queries over
Distributed Private User Data
R.Chen, A.Reznichenko, P.Francis – MPI-SWS, Germany
J.Gehrke – Cornell University, USA
Serafeim Chatzopoulos
M1258
schatz@di.uoa.gr
MDE519 – Distributed Systems
Instructor: Mema Roussopoulou
May 31,
2013
2. User Privacy
Towards Statistical Queries over Distributed Private User Data 2
User Data is exposed to organizations in many
ways.
Users are aware of their data being exposed.
Make a purchase in an online store.
Update a profile on a social network.
Users are unaware of their data exposure.
Third party trackers.
Smart phone Apps.
3. The “user-owned and operated” principle
Towards Statistical Queries over Distributed Private User Data 3
Personal data should be stored in a local host or a
cloud device under the user‟s control and is released
in a controlled, limited or noisy fashion.
Users must have the exclusive control of
their own data and must be able to share
data selectively or voluntarily.
4. Motivation and Problem
Towards Statistical Queries over Distributed Private User Data 4
Distributed private user data is important.
Analyst could use such data to
understand users‟ behaviors
discover their statistic patterns
evaluate proposed enhancements.
How to make statistical queries over such distributed
private user data while still preserving privacy?
5. Related Work
Towards Statistical Queries over Distributed Private User Data 5
Anonymization
Removes well-known personally identifiable
information(PPI).
Randomization
Adds random distortion values to user data.
k-anonymity, l-diversity, t-closeness
Differential Privacy
6. Differential Privacy
Towards Statistical Queries over Distributed Private User Data 6
Differential privacy adds noise to the output of a
computation (i.e., answer of query).
Hides the presence or absence of a record in the
dataset.
Makes no assumption about the adversary.
Some form of distributed differential privacy is
required…
7. Prior Distributed Differential Privacy Designs
Towards Statistical Queries over Distributed Private User Data 7
First design has a per-user computational load of
O(U).
Dwork et al. EUROCRYPT ‟06
Poor scalability
Following designs reduce per-user computational
load to O(1) by using expensive secret sharing
protocols.
Rastogi and Nath, SIGMOD ‟10 – Shi et al. NDSS ‟11
Not tolerate churn
Recent designs introduce two honest-but-curious
servers to collaboratively compute the query result.
Gotz and Nath, MSR-TR ‟11
Even a single malicious user can substantially distort
the query result.
8. Practical Distributed Differential Privacy System
(PDDP)
Towards Statistical Queries over Distributed Private User Data 8
Goals:
The differential private guarantee is always maintained for
every honest client.
Puts tight bound to the extent to which a malicious user
can distort query results.
The maximum absolute distortion in the final result is bounded
by the number of malicious users.
Operates at a large scale.
Millions of users.
Tolerates churn.
Not prevent results from being produced.
9. PDDP Components
Towards Statistical Queries over Distributed Private User Data 9
Analyst
Makes queries to the system
and collects answers.
Proxy
Adds differential private noise
to client‟s answers to preserve
privacy
Clients
Locally maintain their own data
and answer queries.
10. Security Assumptions (1/2)
Towards Statistical Queries over Distributed Private User Data 10
General Assumptions
Clients have the correct public keys for analyst and the
proxy.
Analyst and the proxy have the correct public keys for
each other.
Corresponding private keys are kept secure.
Analyst is potentially malicious (violating users‟
privacy)
Collude with other analysts.
Pretend to be multiple distinct analysts.
Take control of clients and use PDDP protocol to reveal
info.
Publish its collected answers.
Intercept and modify all messages.
11. Security Assumptions (2/2)
Towards Statistical Queries over Distributed Private User Data 11
Proxy is honest but curious (HbC)
Follows the specified protocol.
Tries to exploit additional info that can be learned in so
doing.
Does not collude with other components.
Clients are potentially malicious (distorting the
statistical results learned by analysts)
Have churn characteristics.
Limited resources for computation and data transmission.
Generate false or illegitimate answers.
Act as Sybils.
12. PDDP Key insights – Binary answer
Towards Statistical Queries over Distributed Private User Data 12
How to limit query result distortion?
Split answer‟s value into buckets.
Enforce a binary answer in each bucket.
Goldwasser-Micali (GM) bit-cryptosystem.
Example:
Query: “SELECT age FROM info WHERE gender=„m‟”
4 buckets: 0~12, 13~20, 21~59, and ≥60.
Answers: „1‟ or „0‟ per bucket
Malicious clients cannot substantially distort the query
result.
13. PDDP Key insights – Blind noise
Towards Statistical Queries over Distributed Private User Data 13
How to achieve differential privacy ?
Honest-but-curious proxy
Generates additional binary answers in each bucket as
differentially private noise.
If analyst publishes the final noisy result
proxy knows the noise added
can subtract noise from the publish result to get a noisy-free
result.
Solution: Proxy can only blindly add noise!
Proxy knows that the added noise is enough to achieve
differential privacy
Proxy does not know the exact noise added.
14. PDDP Workflow – Step 1
Towards Statistical Queries over Distributed Private User Data 14
Query Initialization
Analyst first issues
a query to the
Proxy.
Message consists of 4 items:
Query: SELECT age FROM info WHERE gender=„m‟
Buckets: 0∼12, 13∼20, 21∼59 and ≥60.
# clients queried (c): 1000
DP parameter (ε): 1.0
Controls tradeoff between accuracy of computation and strength of
its privacy guarantee.
15. PDDP Workflow – Step 2
Towards Statistical Queries over Distributed Private User Data 15
Query Forwarding
Select clients and
send them the
query.
Proxy:
rejects the query if c is too low or too high.
rejects the query if ε exceeds the max privacy level allowed.
selects c unique clients and send them the query, under the one
of the following policies:
Select c clients randomly and wait for them to connect.
Select the first c clients that connect.
16. PDDP Workflow – Step 3 (1/2)
Towards Statistical Queries over Distributed Private User Data 16
Client Response
Clients execute
the query and
send answers.
Client executes query over its local data and produces
answer:
„1‟ or „0‟ per bucket.
More than one bucket may contain a „1‟.
Per-bucket answer value is individually encrypted with the
analyst‟s public key. (GM cryptosystem)
17. PDDP Workflow – Step 3 (2/2)
Towards Statistical Queries over Distributed Private User Data 17
Goldwasser-Micali (GM) cryptosystem
Single-bit cryptosystem
Enforces binary answer in each bucket.
Very Efficient
XOR – homomorphic
E(a) * E(b) = E(a XOR b)
18. PDDP Workflow – Step 4
Towards Statistical Queries over Distributed Private User Data 18
Blind noise
addition
The proxy maintains a pool of additional binary
answers called coins and adds them as noise to
each bucket.
Coins must be unbiased.
Coins are encrypted with the analyst‟s public key.
In each bucket must be added n coins:
How to generate coins blindly?
19. Coin pool generation
Towards Statistical Queries over Distributed Private User Data 19
Straightforward approaches
Proxy generates coins
Curious proxy could know noise-free result
Clients generate coins
Malicious clients could generate biased coins
20. Collaborative coin generation
Towards Statistical Queries over Distributed Private User Data 20
Paper‟s approach
Each online client periodically generates an encrypted
unbiased coin E(oc) and sends it to the proxy
The proxy receives the coin and verifies the legitimacy of the
coin.
The proxy blindly re-flips the coin E(oc) by multiplying it with a
proxy‟s locally generated unbiased coin E(op) plus a modulo
operation.
E(oc) * E(op) mod m = E(oc XOR op),
where m is part of the analyst’s public key
The proxy stores the unbiased coin in the locally maintained
pool.
Proxy doesn‟t know the actual value of the generated unbiased
coin.
21. PDDP Workflow – Step 5
Towards Statistical Queries over Distributed Private User Data 21
Noisy answers to
analyst
Each bucket has clients answers + coins (noise)
After random delay the proxy shuffles the c + n values.
Prevents identification of a client based on the vector of „1‟ and „0‟ in its answer.
Finally, analyst
decrypts with its private key all encrypted binary values.
sums the plaintext values obtained.
obtains the noisy answer for the clients that fall within each bucket.
22. Practical Considerations (1/2)
Towards Statistical Queries over Distributed Private User Data 22
Utility of aggregate result
Depends on the amount of added noise.
The n coins added by the proxy and the analyst‟s adjustment on
the means of n/2 form a binomial distribution (approximation of
the normal distribution N(0, n/4) ).
Example :
c =106 , ε = 1.0
Given normal distribution in each bucket
68% probability that the noisy answer is 15.24 away from the true answer
95% probability that the noisy answer is 30.48 away from the true answer
99.7% probability that the noisy answer is 45.72 away from the true answer
23. Practical Considerations (2/2)
Towards Statistical Queries over Distributed Private User Data 23
Non-numeric Queries
Map query into a numeric query.
Example:
“Which website do you visit most often?”
Map each website the analyst wishes to learn into a numeric
value.
Large number of buckets – limit the answer to 5000 buckets.
Sybils
Design susceptible to Sybil attacks (single client can
masquerades multiple clients).
Proxy can limit the number of clients selected at a single IP
address for a given query.
24. Implementation and Deployment (1/2)
Towards Statistical Queries over Distributed Private User Data 24
Client
Firefox add-on
9600 lines of Java code
Information is stored in local SQLite storage
Web browsing activities
Certain online shopping activities
Certain ad interactions
Can be extended to capture any online activity
Every 5 min connects to the proxy to retrieve pending queries,
return answers and periodically generated coins.
25. Implementation and Deployment (2/2)
Towards Statistical Queries over Distributed Private User Data 25
Proxy
Web service on Tomcat 6.0.33
3600 lines of code
Proxy state in MySQL database.
Analyst
800 lines of code
Deployment
Correctness verified on a set of local machines.
600+ real clients
26. Comparison: “Paillier-based” design
Towards Statistical Queries over Distributed Private User Data 26
Honest-but-Curious Proxy
Paillier Cryptosystem
Additive homomorphism
Proxy can directly sum up all clients‟ encrypted binary
answers to get the encrypted sum of each bucket.
A single malicious client can distort substantially the
result
Use of zero-knowledge-proofs (ZKP) to ensure that
encrypted answers are „1‟ or „0‟.
Proxy knows exactly how much noise has been
added.
27. Evaluation (1/5)
Towards Statistical Queries over Distributed Private User Data 27
Client Performance
Clients encrypt a binary value for each bucket.
GM cryptosystem
Paillier cryptosystem
28. Evaluation (2/5)
Towards Statistical Queries over Distributed Private User Data 28
Proxy - Analyst Performance
Proxy
PDDP
One encryption and one homomorphic XOR for one unbiased coin.
Jacobi symbol checking on received coins and answer values
(faster than a decryption).
Paillier-based
One ZKP for each client answer in each bucket.
Homomorphically sum up all clients answers per bucket.
Add noise to each per-bucket total sum.
29. Evaluation (3/5)
Towards Statistical Queries over Distributed Private User Data 29
Proxy - Analyst Performance
Analyst
PDDP
Decrypt all encrypted values in each bucket.
Paillier-based
Decrypt one encrypted value in each bucket
30. Evaluation (4/5)
Towards Statistical Queries over Distributed Private User Data 30
Bandwidth overhead
In both systems, a client transmits an encrypted answer to
each bucket.
In PDDP, a client transmits periodically generated coin to the
proxy.
In Paillier-based, a client transmits a ZKP for each bucket.
Storage overhead
In PDDP, the proxy stores all clients‟ answer values for each
bucket plus the required number of coins.
In Paillier-based, proxy stores only one answer value per
bucket.
31. Evaluation (5/5)
Towards Statistical Queries over Distributed Private User Data 31
Querying the client deployment
Parameters
c = 250 (out of 600 clients)
ε = 5.0
clients are selected as they connect until 250 unique clients are queried or
24-hours expire.
These parameters result in 16 coins per bucket.
Ensure that a per bucket aggregate answer is within plus or minus 2, 4, 6
of the noisy-free answer with a probability of 68%, 95% and 99,7%
32. Future Work
Towards Statistical Queries over Distributed Private User Data 32
Support of statistical learning algorithms
Scalability of non-numeric queries
Bloom filters – map a large number of possible answers in
a small number of buckets.
Gather statistical data for a large-scale experiment.
Weaken proxy trust requirements.
Use of trusted hardware (TPM)
General: measure the actual privacy loss for
differential privacy.
33. Conclusion
Towards Statistical Queries over Distributed Private User Data 33
PDDP: Practical Distributed Differential Private
System
Scales well
Tolerates churn
Places tight bound on malicious user‟s capability.
Key insights
Binary answer in each bucket
Blind noise addition