Data processing is increasingly the subject of various internal and external regulations, such as GDPR which has recently come into effect. Instead of assuming that such processes avail of data sources (such as files and relational databases), we approach the problem in a more abstract manner and view these processes as taking datasets as input. These datasets are then created by pulling data from various data sources. Taking a W3C Recommendation for prescribing the structure of and for describing datasets, we investigate an extension of that vocabulary for the generation of executable R2RML mappings. This results in a top-down approach where one prescribes the dataset to be used by a data process and where to find the data, and where that prescription is subsequently used to retrieve the data for the creation of the dataset “just in time”. We argue that this approach to the generation of an R2RML mapping from a dataset description is the first step towards policy-aware mappings, where the generation takes into account regulations to generate mappings that are compliant. In this paper, we describe how one can obtain an R2RML mapping from a data structure definition in a declarative manner using SPARQL CONSTRUCT queries, and demonstrate it using a running example. Some of the more technical aspects are also described.
Reference: Christophe Debruyne, Dave Lewis, Declan O'Sullivan: Generating Executable Mappings from RDF Data Cube Data Structure Definitions. OTM Conferences (2) 2018: 333-350
Generating Executable Mappings from RDF Data Cube Data Structure Definitions
1. Generating Executable Mappings from RDF
Data Cube Data Structure Definitions
Christophe Debruyne, Dave Lewis, Declan O’Sullivan
Trinity College Dublin
2018-10-23 @ ODBASE
The ADAPT Centre is funded under the SFI Research Centres Programme (Grant 13/RC/2106) and is co-funded under the European Regional Development Fund.
2. www.adaptcentre.ieIntroduction
• Data processing is increasingly the subject of various
internal and external regulations – e.g., GDPR.
• Datasets are created and used for a particular purpose.
E.g., sending newsletters or using the purchase history of
users to suggest recommendations. In the context of GDPR,
these purposes require a user’s informed consent.
• Can we generate datasets for a particular purpose “just in
time” that complies with informed consent?
2018-10-23 2
3. www.adaptcentre.ieIntroduction
• R2RML is a convenient way to transform (relational) non-
RDF data into RDF to create these datasets.
• One can create mappings from databases to vocabularies,
ontologies, etc. for data processing activities.
• We, however, chose to adopt the RDF Data Cube
Vocabulary (QB) for representing datasets.
2018-10-23 3
4. www.adaptcentre.ieIntroduction
• QB is an ontology for multi-dimensional datasets.
A Data Structure Definition prescribes how a Dataset and
its Observations are structure. An Observation is identified
by Dimensions and captures a value for a Measure.
• QB’s foundations is rooted in a schema for statistical
datasets and the ontology seemingly complicated, but the
RDF vocabulary is useful for other types of datasets as well.
• Our choice was also influenced by projects in the health
domain where statistical processing of data is key*
*AVERT project: https://www.tcd.ie/medicine/thkc/avert/index.php/
2018-10-23 4
5. www.adaptcentre.ieResearch Question
• From “Can we generate datasets for a particular purpose
“just in time” that complies with informed consent?”
• To: “If we have a DSD for a particular purpose, how can we
create an executable R2RML mapping to generate a dataset
that complies with that DSD’s structure?”
• A solution could is subsequently be extended to take into
account policies so as to generate mapping that is
compliant. In other words: “policy-aware”. To be reported.
2018-10-23 5
6. www.adaptcentre.ieApproach
• R2DQB – pronounced R-2-D-cube
• Data Structure Definitions
• Dimensions
• Measures
• Attributes
• References to tables
• References to columns
• Transformation functions
• …
Mapping
Engine
R2RML Mapping
R2RML
Processor
Data Cube Dataset
extended with
according
to
1
2
3
Validation 4
Provenance
Information
captured with
5
2018-10-23 6
7. www.adaptcentre.ieApproach
Step 1: annotating DSDs
• May be done in a separate graph (separation of concerns)
• We chose to reuse R2RML to assess the feasibility in this
study. A bespoke vocabulary may be considered in the
future.
(example from RDF Data Vocabulary Recommendation)
2018-10-23 7
8. @base <http://www.example.org/>
<#refPeriod> a rdf:Property, qb:DimensionProperty;
rdfs:subPropertyOf sdmx-dimension:refPeriod .
<#refArea> a rdf:Property, qb:DimensionProperty;
rdfs:subPropertyOf sdmx-dimension:refArea .
<#lifeExpectancy> a rdf:Property, qb:MeasureProperty;
rdfs:subPropertyOf sdmx-measure:obsValue;
rdfs:range xsd:decimal .
sdmx-dimension:sex a rdf:Property, qb:DimensionProperty .
<#dsd-le> a qb:DataStructureDefinition;
# The dimensions
qb:component [ qb:dimension <#refArea> ];
qb:component [ qb:dimension <#refPeriod> ];
qb:component [ qb:dimension sdmx-dimension:sex ];
# The measure(s)
qb:component [ qb:measure <#lifeExpectancy> ] .
@base <http://www.example.org/>
<#refPeriod> rr:column "period";
<#refArea> rr:column "area";
<#lifeExpectancy> rr:column "lifeexpectancy";
sdmx-dimension:sex rr:column "sex" .
<#dsd-le> rr:tableName "statssimple";
The DSD
The annotations
Note: prefixes
omitted for brevity.
9. www.adaptcentre.ieApproach
Step 2: Generating the R2RML mapping
• Adopting a declarative approach with SPARQL CONSTRUCT
queries:
1. Generating a triples map for each DSD
2. Generating a subject map for each DSD and a predicate
object map for linking observations to dataset
Subject map is based on dimensions, as
observations are identified by those.
3. Generating predicate object maps from measures
4. Generating predicate object maps from dimensions
5. Generating a link between dataset and DSD
2018-10-23 9
10. 1. CONSTRUCT {
2. ?tm rr:subjectMap [
3. rr:class qb:Observation ;
4. rr:termType rr:BlankNode ;
5. rr:template ?x ;
6. ] .
7. ?tm rr:predicateObjectMap [
8. rr:predicate qb:dataSet ;
9. rr:object ?ds;
10. ] .
11.} WHERE {
12. ?tm pam:correspondsWith ?dsd ;
13. rr:logicalTable [ rr:tableName ?t ] ;
14. BIND(IRI(?t) AS ?ds)
15. {
16. SELECT
17. (CONCAT("{", GROUP_CONCAT(?c; SEPARATOR="}-{"), "}") as ?x) {
18. ?dsd qb:component ?component .
19. { ?component qb:dimension [ rr:column ?c ] }
20. UNION
21. # OMITTED FOR CLARITY (SEE PAPER)
22. } GROUP BY ?dsd
23. }
24.}
Constructing a subject map for observations
and a predicate object map for linking
observations to a dataset.
All queries can be found in the paper.
12. www.adaptcentre.ieApproach
Step 3: Executing the R2RML Mapping – straightforward
We did use our implementation of R2RML which extends the
specification with JavaScript functions called R2RML-F
Step 4: Validating the generated RDF
Using the integrity constraints specified by the RDF Data Cube
Vocabulary Recommendation
2018-10-23 12
13. www.adaptcentre.ieApproach
Step 5: Provenance Information
Keep track of activities and intermediate results with PROV-O.
This will become key for a posteriori compliance analysis in
future work.
pam:Validation_Report
pam:DSD_Document
pam:Generate_Mapping
pam:Execute_Mapping
pam:Validate_Dataset
pam:Mapping_Generator
pam:R2RML_Processor
pam:DSD_Document
pam:R2RML_Mapping
pam:Validatorowl:Thing
prov:Entity
prov:Agent prov:SoftwareAgent
prov:Activity
2018-10-23 13
14. www.adaptcentre.ieFeatures
Mapping values onto URIs, and
Inclusion of data transformation functions
• Mapping languages such as D2R had so-called translation tables,
which mapped elements of one set to elements of another. Ideal
for mapping values to IRIs. R2RML has no such functionality.
That is why we choose to adopt R2RML-F, where such
“translation tables” can be written in a JavaScript function.
• R2RML-F also allows for transformation functions to be written
when the underlying database technology has not support for
that.
Possibility to interlink with external datasets provided by R2RML
2018-10-23 14
15. www.adaptcentre.ieRelated Work
Related Work – generation of R2RML to the best of our
knowledge limited.
• Skjaeveland et al. 2015 proposed a method to generate an
ontology, rules and a mapping from one description
• TabLinker and CSV2DataCube are two tools for generating
QB graphs from Excel files (in a certain format) and CSV
data respectively
• The Open Cube Toolkit has a built-in R2RML compliant D2R
server, but it relies on a bespoke XML that maps source and
DSD.
2018-10-23 15
16. www.adaptcentre.ieConclusions
• We argued that datasets are used for a purpose and that
datasets should be built suitable for a purpose, including
any policies it should comply with.
• Before we can do the latter, we investigated the former by
trying to answer the question: “Can we generate an R2RML
mapping from a data structure definition?”
• The answer is yes and we presented the R2DQB approach
showing how. We strived for a declarative approach using
SPARQL CONSTRUCT queries. A demonstration of the
approach is presented in the paper.
2018-10-23 16
17. www.adaptcentre.ieFuture work
Tackling the problem of policy-aware mapping, which would
complement research on post-hoc compliance analysis (e.g.,
Harsh et al. 2017). To be reported.
The Metadata Vocabulary for Tabular Data (W3C Rec.). A
vocabulary for describing the “schemas” of tabular data,
including constraints. This might be another representation
worth considering (future work)
2018-10-23 17