The document describes a Metadata Quality Assessment Framework (MQAF) API that can validate JSON, XML, CSV, and MARC data against SHACL-like constraints. The MQAF API implements a subset of SHACL tests to validate data elements, including tests for data types, lengths, patterns, logical rules and more. It provides a Java API and configuration files to define validation rules for different data formats and schemas in an abstracted way.
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
Validating JSON, XML and CSV data with SHACL-like constraints (DINI-KIM 2022)
1. Validating
JSON, XML and CSV data
with SHACL-like constraints
Péter Király, GWDG (Göttingen)
pkiraly@gwdg.de
Deutsche Initiative für Netzwerkinformation e.V.
Kompetenzzentrum Interoperable Metadaten (KIM) Workshop
2022-05-02
https://github.com/pkiraly/metadata-qa-api
2. Shapes Constraint Language (SHACL)
a language for validating RDF graphs against a set of conditions (expressed as
RDF graphs)
ex:PersonShape
a sh:NodeShape ;
sh:targetClass ex:Person ; # checks persons
sh:property [
sh:path ex:ssn ; # checks social
security nr.
sh:maxCount 1 ;
sh:datatype xsd:string ;
sh:pattern "^d{3}-d{2}-d{4}$" ;
] ;
Validating
JSON,
XML
and
CSV
data
with
SHACL-like
constraints
https://github.com/pkiraly/metadata-qa-api
3. Metadata Quality Assessment Framework (MQAF) API
★ an open source software for metadata quality assessment
★ quality dimensions: completeness, multilinguality, uniqueness, etc.
★ extensions: Europeana, MARC, Deutsche Digitale Bibliothek
★ Java API + command line interface (in progress)
★ reads XML, JSON, CSV, MARC
★ highly configurable
★ adaptable to different metadata schemas
Validating
JSON,
XML
and
CSV
data
with
SHACL-like
constraints
https://github.com/pkiraly/metadata-qa-api
4. RDF agnostic SHACL tests*
Cardinality minCount <number>, maxCount <number>
Value Range minExclusive <number>, minInclusive <number>, maxExclusive <number>, maxInclusive
<number>
String minLength <number>, maxLength <number>, hasValue <String>, in [String1, ...,
StringN], pattern <regular expression>, minWords <number>, maxWords <number>
Comparision of
properties
equals <field label>, disjoint <field label>, lessThan <field label>, lessThanOrEquals
<field label>
Logical operators and [<rule1>, ..., <ruleN>], or [<rule1>, ..., <ruleN>], not [<rule1>, ..., <ruleN>]
Validating
JSON,
XML
and
CSV
data
with
SHACL-like
constraints
https://github.com/pkiraly/metadata-qa-api
* a subset of SHACL
5. MQAF API’s SHACL tests
Cardinality minCount <number>, maxCount <number>
Value Range minExclusive <number>, minInclusive <number>, maxExclusive <number>, maxInclusive
<number>
String minLength <number>, maxLength <number>, hasValue <String>, in [String1, ...,
StringN], pattern <regular expression>, minWords <number>, maxWords <number>
Comparision of
properties
equals <field label>, disjoint <field label>, lessThan <field label>, lessThanOrEquals
<field label>
Logical operators and [<rule1>, ..., <ruleN>], or [<rule1>, ..., <ruleN>], not [<rule1>, ..., <ruleN>]
extras contentType [type1, ..., typeN], unique <boolean>, dependencies [id1, id2, ..., idN],
dimension [criteria...] (min/max + Width/Height/Shortside/Longside)
properties id, description, failureScore, successScore, hidden, skip
Validating
JSON,
XML
and
CSV
data
with
SHACL-like
constraints
https://github.com/pkiraly/metadata-qa-api
6. abstracting the address of data element
XML
JSON
CSV
MARC21
have addressable data
elements (branches)
XPath
JSONPath
column
names
MARCSpec
addressing
languages
Validating
JSON,
XML
and
CSV
data
with
SHACL-like
constraints
https://github.com/pkiraly/metadata-qa-api
7. schema
definition
abstracting data element retrieval
XML
JSON
CSV
MARC21
data element
selector
uniform data
structure
May I
get the
title?
Title’s address
is //head/title
Validating
JSON,
XML
and
CSV
data
with
SHACL-like
constraints
https://github.com/pkiraly/metadata-qa-api
8. schema definition
Schema schema = new BaseSchema()
.setFormat(Format.CSV)
.addField(
new JsonBranch("title", "title")
.setRule(
new Rule()
.withDisjoint("description")))
.addField(
new JsonBranch("url", "url")
.setExtractable(true)
.setRule(
new Rule()
.withMinCount(1)
.withMaxCount(1)
.withPattern("^https?://.*$")))
format: csv
fields:
- name: title
rules:
disjoint: description
- name: url
extractable: true
rules:
minCount: 1
maxCount: 1
pattern: ^https?://.*$
Java API YAML configuration file
{
“format”: “csv”,
“fields”: [
{
“name”: “title”,
“rules”: [
{“disjoint”: “description”}
]
},
{
“name”: “url”,
“extractable”: true,
“rules”: [
{
“minCount”: 1,
“maxCount”: 1,
“pattern”: “^https?://.*$”}]}
JSON configuration file
Validating
JSON,
XML
and
CSV
data
with
SHACL-like
constraints
https://github.com/pkiraly/metadata-qa-api
9. one and only one data element instance
- name: about
path: $.['about']
rules:
- minCount: 1
- maxCount: 1
Validating
JSON,
XML
and
CSV
data
with
SHACL-like
constraints
https://github.com/pkiraly/metadata-qa-api
11. string constraints / length
- name: about
path: $.['about']
rules:
- minLength: 1
- name: about
path: $.['about']
rules:
- and:
- minLength: 3
- maxLength: 5
lenght(about) >= 1 5 >= lenght(about) >= 3
- name: status
path: $.['status']
rules:
- hasValue: published
status == “published”
- name: type
path: $.['type']
rules:
- in: [dataverse, dataset, file]
type == “dataverse” or
type == “dataset” or
type == “file”
- name: thumbnail
path: oai:record/dc:identifier[@type='binary']
rules:
- pattern: ^https?://.*.(jpe?g||png|tiff?|gif)$
thumbnail is an image or PDF file
- name: about
path: $.['about']
rules:
- minWords: 1
nr_words(about) >= 2
Validating
JSON,
XML
and
CSV
data
with
SHACL-like
constraints
https://github.com/pkiraly/metadata-qa-api
12. string constraints / fixed values
- name: about
path: $.['about']
rules:
- minLength: 1
- name: about
path: $.['about']
rules:
- and:
- minLength: 3
- maxLength: 5
lenght(about) >= 1 5 >= lenght(about) >= 3
- name: status
path: $.['status']
rules:
- hasValue: published
status == “published”
- name: type
path: $.['type']
rules:
- in: [dataverse, dataset, file]
type == “dataverse” or
type == “dataset” or
type == “file”
- name: thumbnail
path: oai:record/dc:identifier[@type='binary']
rules:
- pattern: ^https?://.*.(jpe?g||png|tiff?|gif)$
thumbnail is an image or PDF file
- name: about
path: $.['about']
rules:
- minWords: 1
nr_words(about) >= 2
Validating
JSON,
XML
and
CSV
data
with
SHACL-like
constraints
https://github.com/pkiraly/metadata-qa-api
13. string constraints / pattern
- name: about
path: $.['about']
rules:
- minLength: 1
- name: about
path: $.['about']
rules:
- and:
- minLength: 3
- maxLength: 5
lenght(about) >= 1 5 >= lenght(about) >= 3
- name: status
path: $.['status']
rules:
- hasValue: published
status == “published”
- name: type
path: $.['type']
rules:
- in: [dataverse, dataset, file]
type == “dataverse” or
type == “dataset” or
type == “file”
- name: thumbnail
path: oai:record/dc:identifier[@type='binary']
rules:
- pattern: ^https?://.*.(jpe?g||png|tiff?|gif)$
thumbnail is an image or PDF file
- name: about
path: $.['about']
rules:
- minWords: 1
nr_words(about) >= 2
Validating
JSON,
XML
and
CSV
data
with
SHACL-like
constraints
https://github.com/pkiraly/metadata-qa-api
14. string constraints / number or words
- name: about
path: $.['about']
rules:
- minLength: 1
- name: about
path: $.['about']
rules:
- and:
- minLength: 3
- maxLength: 5
lenght(about) >= 1 5 >= lenght(about) >= 3
- name: status
path: $.['status']
rules:
- hasValue: published
status == “published”
- name: type
path: $.['type']
rules:
- in: [dataverse, dataset, file]
type == “dataverse” or
type == “dataset” or
type == “file”
- name: thumbnail
path: oai:record/dc:identifier[@type='binary']
rules:
- pattern: ^https?://.*.(jpe?g||png|tiff?|gif)$
thumbnail is an image or PDF file
- name: about
path: $.['about']
rules:
- minWords: 2
nr_words(about) >= 2
Validating
JSON,
XML
and
CSV
data
with
SHACL-like
constraints
https://github.com/pkiraly/metadata-qa-api
15. comparisions of data elements
fields:
- name: id
path: $.['id']
rules:
- equals: isbn
- name: isbn
path: $.['isbn']
fields:
- name: title
path: $.['title']
rules:
- disjoint: description
- name: description
path: $.['description']
- name: startingPage
path: startingPage
rules:
- lessThanOrEquals: endingPage
id == isbn title != description startingPage <= endingPage
Validating
JSON,
XML
and
CSV
data
with
SHACL-like
constraints
https://github.com/pkiraly/metadata-qa-api
16. comparisions of data elements
fields:
- name: id
path: $.['id']
rules:
- equals: isbn
- name: isbn
path: $.['isbn']
fields:
- name: title
path: $.['title']
rules:
- disjoint: description
- name: description
path: $.['description']
- name: startingPage
path: startingPage
rules:
- lessThanOrEquals: endingPage
id == isbn title != description startingPage <= endingPage
Validating
JSON,
XML
and
CSV
data
with
SHACL-like
constraints
https://github.com/pkiraly/metadata-qa-api
17. comparisions of data elements
fields:
- name: id
path: $.['id']
rules:
- equals: isbn
- name: isbn
path: $.['isbn']
fields:
- name: title
path: $.['title']
rules:
- disjoint: description
- name: description
path: $.['description']
- name: startingPage
path: startingPage
rules:
- lessThanOrEquals: endingPage
id == isbn title != description startingPage <= endingPage
Validating
JSON,
XML
and
CSV
data
with
SHACL-like
constraints
https://github.com/pkiraly/metadata-qa-api
18. logical operations
- name: id
path: oai:record/dc:identifier
rules:
- and:
- minCount: 1
- maxCount: 1
- minLength: 1
- name: thumbnail
path: oai:record/dc:identifier
rules:
- or:
- pattern: ^.*.(jpe?g|png|)$
- contentType:
- image/jpeg
- image/png
- name: title
path: $.['title']
rules:
- not:
- equals: description
and or not
Validating
JSON,
XML
and
CSV
data
with
SHACL-like
constraints
https://github.com/pkiraly/metadata-qa-api
19. logical operations
- name: id
path: oai:record/dc:identifier
rules:
- and:
- minCount: 1
- maxCount: 1
- minLength: 1
- name: thumbnail
path: oai:record/dc:identifier
rules:
- or:
- pattern: ^.*.(jpe?g|png|)$
- contentType:
- image/jpeg
- image/png
- name: title
path: $.['title']
rules:
- not:
- equals: description
and or not
Validating
JSON,
XML
and
CSV
data
with
SHACL-like
constraints
https://github.com/pkiraly/metadata-qa-api
20. logical operations
- name: id
path: oai:record/dc:identifier
rules:
- and:
- minCount: 1
- maxCount: 1
- minLength: 1
- name: thumbnail
path: oai:record/dc:identifier
rules:
- or:
- pattern: ^.*.(jpe?g|png|)$
- contentType:
- image/jpeg
- image/png
- name: title
path: $.['title']
rules:
- not:
- equals: description
and or not
Validating
JSON,
XML
and
CSV
data
with
SHACL-like
constraints
https://github.com/pkiraly/metadata-qa-api
21. extras
- name: thumbnail
path: oai:record/dc:identifier[@type='binary']
rules:
- contentType: [image/jpeg, image/png, …]
content type
- name: id
path: oai:record/dc:identifier
rules:
- unique: true
- name: url
path: oai:record/dc:identifier[@type='URL']
rules:
- id: Q-4.4
description: Both a media file and a link to an
object are referenced in context.
dependencies: [Q-3.0, Q-4.0]
- name: thumbnail
path: oai:record/dc:identifier[@type='binary']
rules:
- id: 3.1
dimension:
minWidth: 200
minHeight: 200
only if other test has been passed image dimensions
unique value
Validating
JSON,
XML
and
CSV
data
with
SHACL-like
constraints
https://github.com/pkiraly/metadata-qa-api
22. extras
- name: thumbnail
path: oai:record/dc:identifier[@type='binary']
rules:
- contentType: [image/jpeg, image/png, …]
content type
- name: id
path: oai:record/dc:identifier
rules:
- unique: true
- name: url
path: oai:record/dc:identifier[@type='URL']
rules:
- id: Q-4.4
description: Both a media file and a link to an
object are referenced in context.
dependencies: [Q-3.0, Q-4.0]
- name: thumbnail
path: oai:record/dc:identifier[@type='binary']
rules:
- id: 3.1
dimension:
minWidth: 200
minHeight: 200
only if other test has been passed image dimensions
unique value
Validating
JSON,
XML
and
CSV
data
with
SHACL-like
constraints
https://github.com/pkiraly/metadata-qa-api
23. extras
- name: thumbnail
path: oai:record/dc:identifier[@type='binary']
rules:
- contentType: [image/jpeg, image/png, …]
content type
- name: id
path: oai:record/dc:identifier
rules:
- unique: true
- name: url
path: oai:record/dc:identifier[@type='URL']
rules:
- id: Q-4.4
description: Both a media file and a link to an
object are referenced in context.
dependencies: [Q-3.0, Q-4.0]
- name: thumbnail
path: oai:record/dc:identifier[@type='binary']
rules:
- id: 3.1
dimension:
minWidth: 200
minHeight: 200
only if other test has been passed image dimensions
unique value
Validating
JSON,
XML
and
CSV
data
with
SHACL-like
constraints
https://github.com/pkiraly/metadata-qa-api
24. extras
- name: thumbnail
path: oai:record/dc:identifier[@type='binary']
rules:
- contentType: [image/jpeg, image/png, …]
content type
- name: id
path: oai:record/dc:identifier
rules:
- unique: true
- name: url
path: oai:record/dc:identifier[@type='URL']
rules:
- id: Q-4.4
description: Both a media file and a link to an
object are referenced in context.
dependencies: [Q-3.0, Q-4.0]
- name: thumbnail
path: oai:record/dc:identifier[@type='binary']
rules:
- id: 3.1
dimension:
minWidth: 200
minHeight: 200
only if other test has been passed image dimensions
unique value
Validating
JSON,
XML
and
CSV
data
with
SHACL-like
constraints
https://github.com/pkiraly/metadata-qa-api
25. other properties
id identifier, used in output, and in internal references
description explain what the rule checks
failureScore a numerical score assigned if the test fails
successScore a numerical score assigned if the test passes
hidden run the test, but hides from the output
skip do not run the test now (for debugging reason)
Validating
JSON,
XML
and
CSV
data
with
SHACL-like
constraints
https://github.com/pkiraly/metadata-qa-api
26. raw output
★ for each tests:
○ status: PASSED, FAILED, NA (if the data element is not available)
○ score: the output of successScore (if passed), failureScore (if failed) or 0
★ total score
The output could be CSV, JSON or Java objects (configurable)
Validating
JSON,
XML
and
CSV
data
with
SHACL-like
constraints
https://github.com/pkiraly/metadata-qa-api
27. visualization for metadata managers / single record
Validating
JSON,
XML
and
CSV
data
with
SHACL-like
constraints
https://github.com/pkiraly/metadata-qa-api
30. workflow 1. ingest
2. measure records
3. aggregate
4. report
5. evaluate with experts
catalogue
improve records
quality assessment tool
Validating
JSON,
XML
and
CSV
data
with
SHACL-like
constraints
https://github.com/pkiraly/metadata-qa-api
31. research partners
early adopters and contributors
★ Miel Vander Sande (meemoo, Belgium)
★ Richard Palmer (Victoria and Albert Museum, Great Britain)
Deutsche Digitale Bibliothek
★ Francesca Schulze
★ Cosmina Berta
★ Stefanie Rühle
★ Claudia Effenberger
★ Letitia-Venetia Mölck
special thanks
★ Juliane Stiller
Validating
JSON,
XML
and
CSV
data
with
SHACL-like
constraints
https://github.com/pkiraly/metadata-qa-api