2. Schema Reuse
Query
Output
Contribute
Query
Output
Contribute
schema.org
factual.com
Traditional approach: shows all
original schemas
Our approach: shows an
anonymized (unified) schema
DASFAA Security, privacy & trust DASFAA | 04.2014 2
3. Motivation
• Schema Reuse offers many benefits:
– Reduce development complexity:
• New schemas require small modifications
copy and adapt existing schemas
• Large repositories exist: schema.org, freebase.com, factual.com, niem.gov
– Increase the interoperability:
• Share common standard
• But, privacy needs to be considered:
– Leak schema information
Potential attack (e.g. SQL injection)
– Maintain competitiveness: some parts of schemas are the source of
revenue and business strategy.
DASFAA Security, privacy & trust DASFAA | 04.2014 3
4. Challenges
• How to define privacy constraints?
• How to define an anonymized schema
from multiple schemas?
• How to define a utility function for a
certain anonymized schema?
• How to find an anonymized schema
that satisfies privacy constraints and
maximizes the utility function?
Query
Anonymized
Schema
Privacy constraints
Contributors
Our approach: shows an
anonymized (unified) schema
DASFAA Security, privacy & trust DASFAA | 04.2014 4
5. Challenge 1 – Define privacy constraints
• Need to identify two elements
– Sensitive information
• Attributes
– Privacy requirement
• Prevent leaking provenance of sensitive attributes
• Use presence constraint:
A presence constraint ߛ is a triple ൏ ݏ, ܦ, ߠ , where ݏ is a schema, ܦ is a
set of attributes, and ߠ is a specified threshold. An anonymized schema ܵ
satisfies the presence constraint ߛ if ܲݎ ܦ ∈ ݏ ܵ ሻ ߠ.
DASFAA Security, privacy & trust DASFAA | 04.2014 5
6. Challenge 2 – Define anonymized schema
• How to define “anonymized
schema” given a set of schemas
– Enough information to understand
but not overwhelming
• Anonymized schema contains a
set of “abstract” attributes
– Abstract attribute is a set similar
attributes
…
Original schemas
Name
Num
Name
CC Holder
CC
{Name, Holder}
{CC, Num}
Anonymized schema
Abstract attribute
DASFAA Security, privacy & trust DASFAA | 04.2014 6
7. Challenge 3 – Define utility function
• How to define utility function for a
certain “anonymized schema”
– Importance: sum of popularity of
attributes
• A schema that contains more popular
attributes is better
• An attribute that appears in more schemas is
more popular
– Completeness: number of abstract
attributes
• The more abstract attributes, the better
Let Σ be the set of all possible
anonymized schemas. The utility
function ݑ: Σ → Թ measures a
mount of information of each
anonymized schema.
?
ൌ ݅݉ݎݐܽ݊ܿ݁ ܵመ
ݓ݄݁݅݃ݐ ∗ ݈ܿ݉݁ݐ݁݊݁ݏݏሺܵመ
ሻ
{Holder}
{CC}
Utility function:
ݑ ܵመ
{Holder} {Name, Holder}
{CC, Num}
Importance Completeness
S1 S2 S3
DASFAA Security, privacy & trust DASFAA | 04.2014 7
8. Challenge 4 – Optimization problem (1)
Maximizing Anonymized Schema
Given a schema group ܵ and a set of privacy constraints ߁, construct
an anonymized schema ܵ∗ such that ܵ∗ satisfies all constraints ߁ and
has the utility value.
• NP‐Hard problem
…
DASFAA Security, privacy & trust DASFAA | 04.2014 8
9. Challenge 4 – Optimization problem (2)
• Problem modeling
– Schema group: Affinity matrix
– Anonymized schema: Affinity instance
• Affinity instance is an affinity matrix with some empty cells
ݏଵ
a1
a2
Affinity matrix
Anonymized schema
DASFAA Security, privacy & trust DASFAA | 04.2014 9
b1
b2
c1
c2
a1 b1 c1
a2 b2 c2
{a1, b1}
{a2, b2,c2}
a1 b1
a2 b2 c2
a1 b1 c1
b2
…
=
=
Affinity instance
{a1, b1,c1}
ݏ { b2} ଶ
ݏଷ
Need to find an affinity instance satisfying privacy constraints and having
highest utility value
10. Challenge 4 – Optimization problem (4)
• Overall solution:
– Meta‐heuristic with 2 steps
• Greedy algorithm: find a possible solution
• Randomized local search: find optimal solution
– Improve performance
• Divide and conquer: partition the set of constraints into independent sets
satisfy each set independently
DASFAA Security, privacy & trust DASFAA | 04.2014 10
11. Experiments - Setting
Datasets:
• Real data: 117 schemas
• Synthetic data: vary the number of schemas and the number of attributes
Evaluation Metrics:
– Utility loss: measures the amount of utility reduction w.r.t the existence
of privacy constraints
• Δݑ ൌ ௨∅ି௨
௨∅
where u∅ is utility without constraints, ݑ is utility with a
set of constraints Γ
– Privacy loss: measures the amount of disagreement between actual
privacy ܲ ൌ ሼ
ሽ and expected privacy Θ ൌ ሼ
ߠሽ.
• Δ ൌ ܭܮ ܲ ∥ Θ ൌ Σ log
ఏ
DASFAA Security, privacy & trust DASFAA | 04.2014 11
12. Experiments – Computation Time
• 100 schemas, 50 attributes, 1500 constraints
running time is about 6s
Computation Time (log2 of msec.)
DASFAA Security, privacy & trust DASFAA | 04.2014 12
13. Experiment – Privacy & Utility
• Validate the trade‐off between privacy and utility
• Evaluation procedure
– Relax constraint: increase privacy threshold θ to 1 ݎ ߠ , ݎ is relaxing ratio
• Observation
– The higher privacy you enforce, the more the utility loss.
Both utility loss and privacy loss
are normalized to [0,1]
Δݑ ൌ
Δݑ െ ݉݅݊Δ௨
݉ܽݔΔ௨ െ ݉݅݊Δ௨
Δ ൌ
Δ െ ݉݅݊Δ
݉ܽݔΔ െ ݉݅݊Δ
DASFAA Security, privacy & trust DASFAA | 04.2014 13
14. Conclusion
Introduced schema reuse with privacy constraints
Defined privacy constraints
Defined an anonymized schema from multiple schemas
Defined a utility function for a certain anonymized schema
Constructed an anonymized schema that satisfies privacy
constraints and maximizes the utility function
DASFAA Security, privacy & trust DASFAA | 04.2014 14