Schema Extraction for Privacy Preserving Processing of Sensitive Data

Schema Extraction for Privacy Preserving
Processing of Sensitive Data
Enabling Privacy Maintaining
Processing of Sensitive Data
Lars Gleim, RWTH Aachen University, Germany
SeWeBMeDA - June 3, 2018, Crete

Team
Tübingen University
Oliver Kohlbacher
Holger Stenzhorn
Lukas Zimmermann
Fraunhofer FIT / RWTH Aachen University
Stefan Decker
Oya Beyan
Lars Gleim
Md. Rezaul Karim

Lack of Data Reuse
Medical Research Data
(as well as many other privacy sensitive data)
¬F is hard to discover due to data publishing restrictions
¬A often not directly shareable with or inspectable by researchers (legally)
¬I is often hard to describe with static (standard) information models and
thus uses custom vocabularies & information models (or extensions)
¬R typically has no proper structural metadata
➜ is not reusable

Context
Personal Health Train (PHT)
Key concept:
● data do not travel,
algorithms do
● processing at
location of origin
Published (PoC) implementations:
1. Jochems et al..: Distributed learning: Developing a predictive model based on data from multiple hospitals without data leaving the
hospital – A real life proof of concept. Radiotherapy and Oncology 121(3), 459–467 (2016), http://dx.doi.org/10.1016/j.radonc.2016.10.002
2. Deist et al.: Infrastructure and distributed learning methodology for privacy-preserving multi-centric rapid learning health care: euroCAT.
Clinical and Translational Radiation Oncology 4, 24–31 (2017), http://linkinghub.elsevier.com/retrieve/pii/S2405630816300271

Data Stations
● SPARQL/RDF Data Integration Engine
○ aggregates associated Data Banks
○ exposes data as standard RDF
○ evaluates Train’s SPARQL query
● Secure Computation Environment
○ executes data processing algorithm in a
secure enclave
➜ Requires a-priori agreement upon a
shared information model & encoding

Observation
Data Schema
(a formal description of the structure of the data)
● is typically not privacy sensitive
● is easily publishable
● even without direct data access enables
○ design of data selection and integration queries
○ development of processing algorithms

Data Stations
● SPARQL/RDF Data Integration Engine
○ aggregates associated Data Banks
○ exposes data as standard RDF
○ evaluates Train’s SPARQL query
● Secure Computation Environment
○ executes data processing algorithm in a
secure enclave
● Schema Introspection Endpoint
○ exposes structural description of data

Schema Introspection Endpoint
● provides a formal description of the
structure of the available data
● is (semi-)publically accessible
● enables formulation of ad-hoc
data selection & integration queries
➜ How to efficiently publish and maintain the schema?

RDFS+ Schema
RDFS+ (RDFS plus a little bit of OWL)
● Classes, properties, relations between them, hierarchies, equivalences
○ e.g. rdfs:subClassOf, rdfs:subPropertyOf, owl:sameAs, owl:equivalentClass and
owl:equivalentProperty
● inference typically1,2 supported by RDF Triple Stores
1. http://docs.openlinksw.com/virtuoso/rdfsparqlruleintro/
2. https://wiki.blazegraph.com/wiki/index.php/InferenceAndTruthMaintenance

Automatic Schema Extraction
RDFS+ Schema extraction using simple SPARQL Query
A. Relying upon proper RDFS+ inference of SPARQL endpoint
B. Using the SPARQL 1.1 Property Paths feature
Extracts the subset of vocabularies & ontologies instantiated in the data
Two-step extract & publish approach enables air-gapped deployments
Alternatively: Deployment as a virtual view of the data store

Schema-based Privacy Preserving Processing

Experimental Validation
Personal contacts described using foaf & schema.org vocabularies
➜ 96.7% reduction in schema triple count (vs. full foaf & schema.org)
➜ allows for focused query design based on only relevant schema,
reducing cognitive & computational load during introspection
➜ public schema enables data discovery & integration

Schema-based Privacy Preserving Processing
F facilitates data discovery through schema publishing
A alleviates sharing hurdles by sharing query & algorithm instead of data
I provides a publically accessible semantic description of the available data
R ➜ facilitates reuse
Step towards FAIR Data

Thank you very much for your attention
Questions?

Schema Extraction for Privacy Preserving Processing of Sensitive Data

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie Schema Extraction for Privacy Preserving Processing of Sensitive Data

Ähnlich wie Schema Extraction for Privacy Preserving Processing of Sensitive Data (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Schema Extraction for Privacy Preserving Processing of Sensitive Data