SeWeBMeDA 2018 (June 3 2018) Presentation at ESWC Crete Schema Extraction enabling Data Selection and Integration Query formulation without direct data access.
Schema Extraction for Privacy Preserving Processing of Sensitive Data
1. Schema Extraction for Privacy Preserving
Processing of Sensitive Data
Enabling Privacy Maintaining
Processing of Sensitive Data
Lars Gleim, RWTH Aachen University, Germany
SeWeBMeDA - June 3, 2018, Crete
3. Lack of Data Reuse
Medical Research Data
(as well as many other privacy sensitive data)
¬F is hard to discover due to data publishing restrictions
¬A often not directly shareable with or inspectable by researchers (legally)
¬I is often hard to describe with static (standard) information models and
thus uses custom vocabularies & information models (or extensions)
¬R typically has no proper structural metadata
➜ is not reusable
4. Context
Personal Health Train (PHT)
Key concept:
● data do not travel,
algorithms do
● processing at
location of origin
Published (PoC) implementations:
1. Jochems et al..: Distributed learning: Developing a predictive model based on data from multiple hospitals without data leaving the
hospital – A real life proof of concept. Radiotherapy and Oncology 121(3), 459–467 (2016), http://dx.doi.org/10.1016/j.radonc.2016.10.002
2. Deist et al.: Infrastructure and distributed learning methodology for privacy-preserving multi-centric rapid learning health care: euroCAT.
Clinical and Translational Radiation Oncology 4, 24–31 (2017), http://linkinghub.elsevier.com/retrieve/pii/S2405630816300271
5. Data Stations
● SPARQL/RDF Data Integration Engine
○ aggregates associated Data Banks
○ exposes data as standard RDF
○ evaluates Train’s SPARQL query
● Secure Computation Environment
○ executes data processing algorithm in a
secure enclave
➜ Requires a-priori agreement upon a
shared information model & encoding
6. Observation
Data Schema
(a formal description of the structure of the data)
● is typically not privacy sensitive
● is easily publishable
● even without direct data access enables
○ design of data selection and integration queries
○ development of processing algorithms
7. Data Stations
● SPARQL/RDF Data Integration Engine
○ aggregates associated Data Banks
○ exposes data as standard RDF
○ evaluates Train’s SPARQL query
● Secure Computation Environment
○ executes data processing algorithm in a
secure enclave
➜ Requires a-priori agreement upon a
shared information model & encoding
8. Data Stations
● SPARQL/RDF Data Integration Engine
○ aggregates associated Data Banks
○ exposes data as standard RDF
○ evaluates Train’s SPARQL query
● Secure Computation Environment
○ executes data processing algorithm in a
secure enclave
● Schema Introspection Endpoint
○ exposes structural description of data
9. Schema Introspection Endpoint
● provides a formal description of the
structure of the available data
● is (semi-)publically accessible
● enables formulation of ad-hoc
data selection & integration queries
➜ How to efficiently publish and maintain the schema?
10. RDFS+ Schema
RDFS+ (RDFS plus a little bit of OWL)
● Classes, properties, relations between them, hierarchies, equivalences
○ e.g. rdfs:subClassOf, rdfs:subPropertyOf, owl:sameAs, owl:equivalentClass and
owl:equivalentProperty
● inference typically1,2 supported by RDF Triple Stores
1. http://docs.openlinksw.com/virtuoso/rdfsparqlruleintro/
2. https://wiki.blazegraph.com/wiki/index.php/InferenceAndTruthMaintenance
11. Automatic Schema Extraction
RDFS+ Schema extraction using simple SPARQL Query
A. Relying upon proper RDFS+ inference of SPARQL endpoint
B. Using the SPARQL 1.1 Property Paths feature
Extracts the subset of vocabularies & ontologies instantiated in the data
Two-step extract & publish approach enables air-gapped deployments
Alternatively: Deployment as a virtual view of the data store
13. Experimental Validation
Personal contacts described using foaf & schema.org vocabularies
➜ 96.7% reduction in schema triple count (vs. full foaf & schema.org)
➜ allows for focused query design based on only relevant schema,
reducing cognitive & computational load during introspection
➜ public schema enables data discovery & integration
14. Schema-based Privacy Preserving Processing
F facilitates data discovery through schema publishing
A alleviates sharing hurdles by sharing query & algorithm instead of data
I provides a publically accessible semantic description of the available data
R ➜ facilitates reuse
Step towards FAIR Data