The document discusses metadata standards for datasets, including DCAT, DCAT-AP, and related standards. It provides 3 key points:
1. DCAT and DCAT-AP are metadata standards that provide models for describing datasets and their distributions in order to improve discoverability, interoperability, and reuse. DCAT-AP adds constraints to DCAT for use by European data portals.
2. DCAT-AP_IT is the Italian implementation of DCAT-AP, which extends it with additional mandatory properties and controlled vocabularies. It defines core classes and properties for catalogs, datasets, and distributions in RDF.
3. Future developments include DCAT version 2, which introduces new
The importance of metadata for datasets: The DCAT-AP European standard
1. The importance of metadata for
datasets: the DCAT-AP European
standard
Giorgia Lodi –
giorgia.lodi@gmail.com
2. Summary
• The importance of metadata in the (open) data
management
• The standard DCAT and the European DCAT-AP version 1
and its new evolution
• Focus on the Italian extension of DCAT-AP named DCAT-
AP_IT
3. Data vs metadata
DATA
Physical representation of facts, atomic events, objective
phenomena, information suitable for communication,
interpretation and processing by human beings or automatic
means
METADATA
Data that defines other data (they are NOT the data itself!).
Examples: bibliographic reference, the author of a document,
the date of last modification of a dataset
4. Additional definitions
ONTOLOGY: a formal and explicit specification of shared
representation (conceptualization) of a knowledge domain,
defined on the basis of specific requirements
CONTROLLED VOCABULARY: a set of predefined and authoritative
standard terms and codes, pre-selected for the purpose of
indexing and retrieving information
DESCRIPTIVE METADATA: identify and describe digital objects
5. The importance of metadata
• They allow a better understanding of the data they describe
• Facilitate the discoverability of the data
• If they are defined using standard and shared ontologies
and controlled vocabularies they facilitate:
• Information exchange
• Interoperability
• Riutilization and valorisation of the public information
6. FAIR principles
FINDABLE
The first step in (re)using data is to find them. Metadata and data should be easy to find for
both humans and computers. Machine-readable metadata are essential for automatic
discovery of datasets and services
ACCESSIBLE
Once the user finds the required data, she/he needs to know how can they be accessed
INTEROPERABLE
The data usually need to be integrated with other data. In addition, the data need to
interoperate with applications or workflows for analysis, storage, and processing
REUSABLE
metadata and data should be well-described so that they can be replicated and/or
combined in different settings
7. European Directive 1/2
ARTICLE 5 – AVAILABLE FORMATS
“…public sector bodies and public undertakings shall make
their documents available in any pre-existing format or
language and, where possible and appropriate, by electronic
means, in formats that are open, machine-readable,
accessible, findable and re-usable, together with their
metadata. Both the format and the metadata shall, where
possible, comply with formal open standards”
8. European Directive 2/2
ARTICLE 9 – PRACTICAL ARRANGEMENTS
“Member States shall make practical arrangements facilitating
the search for documents available for re-use, such as asset
lists of main documents with relevant metadata, accessible
where possible and appropriate online and in machine-
readable format, and portal sites that are linked to the
asset lists. Where possible, Member States shall facilitate the
cross-linguistic search for documents, in particular by
enabling metadata aggregation at Union level”
9. Italian legislation – open data definition
AVAILABLE (LEGAL REQUIREMENT): disaggregated according to the
terms of an open licence allowing its re-use, also for commercial
purposes
ACCESSIBILE (TECHNOLOGICAL REQUIREMENT): by machines, in an
open format and with associated metadata
FREE OF CHARGE (ECONOMIC REQUIREMENT): free of charge or at
marginal costs incurred for reproduction, making available and
dissemination
10. Metadata: European and Italian scenarios
• We still observe different levels of quality for metadata
• There are still different platforms being used for cataloguing
data based on metadata
• CKAN, DKAN, Socrata, Linked-data based platforms and
proprietary infrastructures
• There are still different tematic classifications for datasets
• There are still different ways to specify licenses
11. Metadata: European and Italian scenarios
• We still observe different levels of quality for metadata
• There are still different platforms being used for cataloguing
data based on metadata
• CKAN, DKAN, Socrata, Linked-data based platforms and
proprietary infrastructures
• There are still different tematic classifications for datasets
• There are still different ways to specify licenses
HOWEVER
12. Common model for metadata specification
A common European data model for metadata is
helping in overcoming the previous mentioned
obstacles
The data model offers an harmonized and shared way
to specify metadata for datasets with a focus on
information that is particulartly relevant for re-users
14. DCAT-AP specifications
DCAT
DCAT-AP
Data CATalog vocabulary – Web standard based on the RDF
standard. It provides a data model of the descriptions of datasets
(not only open datasets) as stored in catalog
European Data CATalog vocabulary – Application Profile –
based on RDF, it is a set of constraints added to the DCAT
specification that facilitate the data and metadata exchange.
NO DE CH
National Data CATalog vocabulary – Application
Profiles – Defined by the different Member States that
adhere to the European DCAT-AP initiative
Based on the RDF standard, they typically include
additional constraints or property while maintaining the
compliance with DCAT-AP
IT
…
15. DCAT-AP extensions for specific types of data
• GeoDCAT-AP
- Facilitate the metadata exchange between geospatial catalogs and
data catalogs in general
• StatDCAT-AP
- Extends the DCAT-AP specification with a small number of elements
that are relevant in oder to describe statistical datasets
- Facilitate the metadata exchange between statitical catalogs and data
catalogs in general
Useful extensions to enable interoperability among different data catalogs
16. DCAT-AP specifications
DCAT
DCAT-AP
Data CATalog vocabulary – Web standard based on the RDF
standard. It provides a data model of the descriptions of datasets
(not only open datasets) as stored in catalogs
European Data CATalog vocabulary – Application Profile –
based on RDF, it is a set of constraints added to the DCAT
specification that facilitate the data and metadata exchange.
NO DE CH
National Data CATalog vocabulary – Application
Profiles – Defined by the different Member States that
adhere to the European DCAT-AP initiative
Based on the RDF standard, they typically include
additional constraints or property while maintaining the
compliance with DCAT-AP
IT
…
17. DCAT vocabulary
• There are two versions of it
o Version 1.0 of 2014
o Version 2.0 of 2019
o In both versions the vocabulary uses directly standard
and well known ontologies such as FOAF, Dublin Core,
SKOS)
18. DCAT version 1
• It is the latest Web recommendation
• It includes four main concepts (classes) for describing data in catalogs
o Catalog: a collection of metadata about datasets
o Catalog Record: represents a metadata item in the catalog, primarily
concerning the registration information, such as who added the item and
when
o Dataset: a collection of data, published or curated by a single agent, and
available for access or download in one or more serializations or formats
o Distribution: represent different formats of the dataset or different
endpoints. Examples of distributions include a downloadable CSV file, an
API or an RSS feed
20. DCAT version 1 – conformance 1/2
• Based on the standard we say that a catalog is compliant to DCAT if:
o It is organized in datasets and distributions
o There exists an RDF description of the catalog (independently of the
specific RDF serialization used to represent it)
o The contents of all metadata fields that are held in the catalog, and
that contain data about the catalog itself and its dataset and
distributions, are included in this RDF description
o All classes and properties are consistent with the semantics of the
specification
o Additional non-DCAT properties are specified
21. DCAT version 1 – conformance 2/2
DCAT PROFILE
A DCAT profile can be defined. A profile adds additional constraints to
DCAT. The constraints can be
o A minimum set of metadata fields that are mandatory (in contrast to
the open world assumption of DCAT vocabulary)
o Classes and properties for additional metadata fields that are not
covered in DCAT
o Controlled vocabularies or URI sets as acceptable values for some
properties (e.g., language, themes, etc.)
o Requirements for specific access mechanisms (RDF syntaxes,
protocols) to the catalog’s RDF description
22. DCAT-AP specifications
DCAT
DCAT-AP
Data CATalog vocabulary – Web standard based on the RDF
standard. It provides a data model of the descriptions of
datasets (not only open datasets) as stored in catalogs
European Data CATalog vocabulary – Application Profile –
based on RDF, it is a set of constraints added to the DCAT
specification that facilitate the data and metadata exchange.
NO DE CH
National Data CATalog vocabulary – Application
Profiles – Defined by the different Member States that
adhere to the European DCAT-AP initiative
Based on the RDF standard, they typically include
additional constraints or property while maintaining the
compliance with DCAT-AP
IT
…
23. European DCAT-Application Profile
• Born in 2013, DCAT-AP is a specification based on DCAT that aims at meeting
specific application needs of data portals in Europe while providing semantic
interoperability with other applications
• It provides a common specification for describing public sector datasets in
Europe to enable the exchange of descriptions of datasets among data portals
• It allows:
o Data catalogs to describe their dataset collections using a standardised
description, while keeping their own system for documenting and storing
them
o Content aggregators, such as the European data portal or national data
portals, to aggregate such descriptions into a single point of access.
o Data consumers to more easily find datasets through a single point of
access
https://d1jdzavdzee8nu.cloudfront.net/sites/default/files/distribution/access_url/2019-05/e3f7bcdf-eaad-
4741-9bf6-dc61327f4eea/DCAT_AP_1.2.1.pdf
25. Mandatory elements DCAT-AP– catalog
Catalog class– Mandatory
The Catalog is described with the following
mandatory properties:
• title à example “Open Data Catalog of the
University of Bologna”
• description à short description of the content of
the Catalog
• publisher à who makes available the catalog
• dataset à list of all dataset objects that are
included in the catalog
Recommended
issued and modified à date in which the catalog is
released and modified, respectively
26. Mandatory elements DCAT-AP– dataset
Dataset class– Mandatory
The Dataset is described with the
following mandatory properties:
• title à represents in short the
content of the dataset
• description à description of
the content of the dataset
All the remaining properties are
recommended (i.e., contact point,
distribution, keyword/tag,
publisher, theme) and optional
(e.g., conforms to, accrual
periodicity, has version, identifier,
language, landing page, spatial
and temporal coverage, etc.)
27. Some recommended elements of DCAT-AP -
distribution
Distribution class– Recommended
If specificed following properties must be materialized for Distribution
• Access URL à U RL that gives ac cess to a D istribution of the Dataset
All the other properties are recommended (i.e., licence, format, description) and optional (e.g., byte
size, download URL, language, title, modified, etc.)
28. DCAT-AP specifications
DCAT
DCAT-AP
Data CATalog vocabulary – Web standard based on the RDF
standard. It provides a data model of the descriptions of
datasets (not only open datasets) as stored in catalogs
European Data CATalog vocabulary – Application Profile –
based on RDF, it is a set of constraints added to the DCAT
specification that facilitate the data and metadata exchange.
NO DE CH
National Data CATalog vocabulary – Application
Profiles – Defined by the different Member States that
adhere to the European DCAT-AP initiative
Based on the RDF standard, they typically include
additional constraints or property while maintaining the
compliance with DCAT-AP
IT
…
29. DCAT-AP_IT
[ITA] Technical guidelines for data catalogs
Available online
https://docs.italia.it/italia/daf/linee-guida-cataloghi-dati-dcat-
ap-it/it/stabile/
30. DCAT-AP_IT
• It reuses ontologies already available at the state of the art (e.g.,
Dublin-Core, FOAF, etc.) in order to guarantee interoperability with the
European application profile
• It extends some of the core concepts of DCAT and DCAT-AP in order
to define additional constraints and properties
• It does not use some concepts (classes) and properties defined as
optional in DCAT-AP
• Three core classes
o Catalog – A collection of metadata that describe datasets
o Dataset – a collection of data, published or curated by a single agent, and
available for access or download in one or more serializations or formats
o Distribution – a specific available form of a dataset
• An OWL ontology has been defined in order to describe the profile
31. Mandatory elements of DCAT-AP_IT - Catalog
Catalog class– Mandatory (subclass of
dcat:Catalog)
The Catalog is described with the
following mandatory properties:
• title à example “Open Data Catalog
of the University of Bologna”
• description à short description of
the content of the Catalog
• publisher à who makes available
the catalog
• modified à the date of last
modification of the catalog
• dataset à list of all dataset objects
that are included in the catalog
32. Mandatory elements of DCAT-AP_IT - Dataset
Dataset class– Mandatory (subclass of
dcat:Dataset)
The Dataset is described with the following
mandatory properties:
• identifier à example “unibo:D.1”
• titleà it describes in short its content
• description à description of the content
• modified à date of last modification
• theme à use of the controlled vocabulary
defined at the European level named Data
theme (13 themes associated with the
dataset)
• rightsHolder à who owns the rights on the
dataset (publisher is recommended and
creator is optional)
• accrual periodicityà the frequency of
update of the dataset. Use of the European
controlled vocabulary Frequencies
• distribuzione à mandatory property if the
dataset is open
33. Mandatory elements of DCAT-AP_IT - Distribution
Distribution class– Mandatory if the dataset is
open (subclass of dcat:Distribution)
La class is decribed by the following mandatory
properties:
• format à use of the European controlled
vocabulary File Type
• license à use of the Italian controlled
vocabulary Licences
(https://w3id.org/italia/controlled-
vocabulary/licences)
• description à describe the content of the
distribution
• access URL à a URL of a web page
through which it is possible to get access to
the dataset
downloadURL is optional but it may be useful to
specify it
38. Spacial coverage – GeoDCAT-AP
• Only a minimal part of the GeoDCAT-
AP extension is used in the current
DCAT-AP_IT to connect the profile with
the geospatial world
• Italian guidelines guide in the
implementation the overall GeoDCAT-
AP specification
https://geodati.gov.it/geoportale/images/struttura/documenti/
GeoDCAT-AP_IT-v1.0.pdf
40. DCAT version 2
• It is the new candidate recommendation as of beginning of October
2019
• It changes the original version 1 in order to reflect years of practical use
cases and introduce important elements that characterize data in
catalogs e.g.,
o data resources
o relationships between data resources
o some geospatial elements
o APIs or data services
https://www.w3.org/TR/vocab-dcat-2/
41. DCAT version 2 – 1/2
NOVEL ELEMENTS
• 3 new concepts (classes)
o Resource: represents a dataset, a data
service or any other resource that may be
described by a metadata record in a catalog.
It is not used directly but it is the parent class
for Catalog, Dataset and Data Service
o Data Service: A data service is a collection of
operations accessible through an interface
(API) that provide access to one or more
datasets or data processing functions
o Relationship: An association class for
attaching additional information to a
relationship between DCAT Resources à to
be verified in practice
42. DCAT version 2 – 2/2
NOVEL ELEMENTS
• Revision of the definitions of
o Catalog: collections of metadata about
datasets or data services
o Distribution: represents an accessible form
of a dataset such as a downloadable file
• New elements for dealing with Time and Space
for Dataset and Distribution
• License metadata specified for dataset too other
than Distribution
• Possibility to specify compressed formats for
Distribution (e.g., zip e tar.gz) by also indicating
the format included in the compression
• Possibility to specify roles and relationships
among data resources
43. DCAT version 2 – relationship
• The class Relationship is used to characterize a
relationship between datasets, and potentially
other resources, where the nature of the
relationship is known but is not adequately
characterized by the standard Dublin core and
PROV-O properties
• The property hadRole defines the function of an
agent wrt another entity or resource
o May be used in a qualified-attribution to
specify the role of an Agent with respect to
an Entity
o Recommended the use of a controlled
vocabulary for roles
o A new way to specify roles for resources
(datasets, catalogs, data services)
44. DCAT version 2
• A data service typically provides selection, extraction,
combination, processing or transformation operations
over datasets that might be hosted locally or remote to
the service.
o The result of any request to a data service is a
representation of a part or all of a dataset or catalog
o Examples: a data discovery service, data
transformation services, such as coordinate
transformation services, re-sampling and
interpolation services, and various data processing
services, including simulation and modelling
services
o Three main properties characterize the data service:
endpointURL, endpointDescription, servesDataset
45. DCAT-AP version 2.0
• It is currently under development
• It is in public review until 4th of November; contributions are
discussed using the related github repository
• Two types of changes have been applied:
• Changes based on the feedback on the usage of verson
1.2.1
• Changes that adapt the profile to the new DCAT
https://joinup.ec.europa.eu/solution/dcat-application-profile-data-portals-europe/release/200
47. DCAT-AP version 2.0 - catalog
Novel elements
• Possibility to specify catalog of catalog
• Possibility to specify catalog of data service
• Possibility to specify a creator
• Spatial coverage becomes recommended
All the rest remains unchanged
48. DCAT-AP version 2.0 - dataset
Novel elements
• Temporal and spatial coverage
become recommended (some new
properties have been added for these
two concepts)
• Introduction of a set of optional
properties that derive from the new
DCAT (e.g., provenance, relationship,
creator)
All the rest remains unchanged
49. DCAT-AP version 2.0 - distribution
Novel elements
• New properties for compressed
formats
• Temporal and spatial resolution
• A new property named
availability that assume the
following values: temporary,
experimental, available, stable
• New properties to link the
Distribution to a policy (rights)
and accessService to connect
the Distribution to Data services
50. Conclusions
• In the public sector an increasing number of Public
Administrations are adopting DCAT-AP(_IT)
• However, important changes are to be taken into account
• Challenge: how to rapidly adapt the current metadata
ecosystem in order to implement the new changes that
were introduced in DCAT and DCAT-AP?