SlideShare a Scribd company logo
1 of 22
How to describe a dataset.
Interoperability issues
Valeria Pesce
Global Forum on Agricultural Research
Definition of “dataset”
The term “dataset” has been defined in several ways, all of which
further specify or extend the basic concept of “a collection of data”.
Definition given by the W3C Government Linked Data Working Group:
A dataset is “a collection of data, published or curated by a
single source, and available for access or download in one or
more formats”
The “instances” of the dataset “available for access or
download in one or more formats” are called
“distributions”. A dataset can have many distributions.
Examples of distributions include a downloadable CSV
file, an API or an RSS feed.
Definition of “interoperability”
“Data interoperability is a feature of datasets -
and of information services that give access to
datasets - whereby data can easily be retrieved,
processed, re-used, and re-packaged
(“operated”) by other systems.”
Interim Proceedings of International Expert Consultation on “Building the CIARD
Framework for Data and Information Sharing”, CIARD (2011)
software applications
datasets have to be machine-readable
What applications need
Besides information common to any type of resource (name, author /
owner, date…), applications have to find enough metadata about
datasets to understand:
1. the specific coverage of the dataset (type of data, thematic
coverage, geographic coverage)
2. the necessary technical specifications to retrieve and parse a
distribution of the dataset (format, protocol etc.)
3. the conditions for re-use (rights, licenses)
4. the “dimensions” covered by the dataset (e.g. temperature,
time, salinity, gene, coordinates)
5. the semantics of the dimensions (units of measure, time
granularity, syntax, reference taxonomies)
Partial answers in existing vocabularies
• DCAT vocabulary
– RDF vocabulary for describing any dataset
– Datasets can be standalone or part of a “catalog”
– Datasets are accessible through several “distributions”
– “Other, complementary vocabularies may be used together with DCAT to provide
more detailed format-specific information. For example, properties from the VoID
vocabulary can be used if that dataset is in RDF format.”
• VOID vocabulary
– RDF vocabulary for expressing metadata about RDF datasets
• (SDMX ) DataCube vocabulary
– RDF vocabulary for describing statistical datasets
– Useful for attaching metadata about the “data structure” to any dataset that
doesn’t follow a known published standard
Coverage of a dataset
• This can be handled by common Dublin Core properties like subject and
coverage.
• DCAT re-uses these DC properties.
Issue 1: No specific property for the type of data covered in a dataset
The values of these properties have to be understood by machines:
- The value should be standardized, possibly a URI
- The URI should be de-referenceable to a thing
- The thing should be part of an authority list / taxonomy
Issue 3: There is no authority vocabulary for types of data
Issue 1
Issue 2
Conditions for re-use
• DCAT re-uses the license DC property at the level of
distributions
• DCAT re-uses the rights DC property at bith the level
of dataset and the level of distribution
dc:license > dc:LicenseDocument
dc:rights > dc:RightsStatement
W3C DCAT > DCAT AP
DCAT core
Technical properties
The necessary technical specifications to retrieve and
parse a distribution of a dataset (format, protocol etc.)
• DCAT re-uses the DC format property;
Issue No property for protocol
The values of these properties have to be understood by
machines, possibly URIs:
Issue2 No comprehensive RDF authority lists for these
values (partial: DC Types; non-RDF: IANA types)
Issue 1
Issue 2
VOID
VOID can help with the protocol metadata but only for
RDF datasets:
- Property for data dump: dataDump
- Property for SPARQL endpoint: sparqlEndpoint
“Dimensions” and their semantics
DCAT does not describe the dimensions of a dataset,
except for a reference to a standard if the dataset
dimensions can be defined by a formalized standard
(e.g. an XML schema or an RDF vocabulary or an ISO
standard)
dc:conformsTo > dc:Standard
Statistical vocabularies can help
with the description of the dimensions
SDMX: data structure and dimensions
SDMX: Statistical Data and Metadata Exchange
The data structure definition is a description of all the metadata needed to
understand the data set structure.
This includes:
• identification of the dimensions (Dimension) according to standard
statistical terminology,
• the key structure (KeyDescriptor),
• the code-lists (CodeList) that enumerate valid values for each dimension
• coded attribute (CodedAttribute), information about whether attributes
are required or optional and coded or free text.
Given the metadata in the data structure definition, all of the data in the
data set becomes meaningful.
DataCube: simplified SDMX in RDF
DataCube: simplified SDMX in RDF
Reference to a concept scheme
DataCube: simplified SDMX in RDF
“Semantic role” of the property
DataCube: simplified SDMX in RDF
“Semantic role” of
Combining different vocabularies
Name
URL
Owner
Content type
Topic(s)
Language
Metadata set(s)
Data structure
Distribution(s)
[…]
DATASET
Name
Protocol
Endpoint URL
Media type
Format
Size
DISTRIBUTION
DCAT model
Dimensions
Attributes
Measures
Value lists
DATA STRUCTURE
DataCube model
Catalog: the directory
Vocabulary(ies)
SPARQL endpoint
Data dump
Serialization format
Number of triples
RDF dataset info
VOID properties
If one or more known
published metadata sets
are used, just fill
“metadata set(s)”,
otherwise link to a “data
structure” with custom
“dimensions”
IF media type has RDF
or SPARQL response
Tools for managing dataset metadata
• CKAN maintained by the Open Knowledge Foundation
Uses most of DCAT. Doesn’t describe dimensions.
Also provides a global dataset hub called the Datahub
• Dataverse created by Harvard University
Uses a custom vocabulary. Doesn’t describe dimensions.
• Commercial solutions
• Repositories and catalogs:
OpenAIRE, DataCite (using re3data to search repositories) and Dryad
use their own vocabularies.
• CIARD RING
Uses full DCAT AP with some extended properties (protocol, data
type) and local taxonomies with URIs mapped when possible to
authorities.
Next steps: adding DataCube properties for dimensions.
Major outstanding issues
• Some missing properties in existing vocabularies:
 approach vocabulary owners OR extend vocabularies
• Missing vocabularies for protocols, formats
 approach standardizing bodies?
 perhaps specific dataset formats?
• Need for more standardized semantics for
dimensions:
 Joint discussions with the RDA Data Type Registries WG?
• Lack of interoperability metadata in existing tools
References
• W3C DCAT: http://www.w3.org/TR/vocab-dcat/
• DCAT AP: https://joinup.ec.europa.eu/asset/dcat_application_profile/asset_release/dcat-
application-profile-data-portals-europe-final
• DataCube: http://purl.org/linked-data/cube#
• VOID: http://rdfs.org/ns/void-guide
• VIVO Datastar: http://sourceforge.net/projects/vivo/files/Datastar%20ontology/
• CERIF for datasets: https://cerif4datasets.wordpress.com/c4d-deliverables/
• CKAN: http://ckan.org/
• Datahub: http://datahub.io/
• DataCite: http://search.datacite.org/ui?q=subject%3Aagriculture
• Re3data: http://www.re3data.org
• Dryad: http://datadryad.org/
• OpenAIRE: https://www.openaire.eu/
Thank you
Valeria Pesce
Global Forum on Agricultural Research

More Related Content

What's hot

What's hot (20)

Metadata Mapping & Crosswalks
Metadata Mapping & CrosswalksMetadata Mapping & Crosswalks
Metadata Mapping & Crosswalks
 
DSpace-CRIS technical level introduction
DSpace-CRIS technical level introductionDSpace-CRIS technical level introduction
DSpace-CRIS technical level introduction
 
Benefits of Taxonomies
Benefits of TaxonomiesBenefits of Taxonomies
Benefits of Taxonomies
 
RDF and OWL
RDF and OWLRDF and OWL
RDF and OWL
 
Data Catalog in Denodo Platform 7.0: Creating a Data Marketplace with Data Vi...
Data Catalog in Denodo Platform 7.0: Creating a Data Marketplace with Data Vi...Data Catalog in Denodo Platform 7.0: Creating a Data Marketplace with Data Vi...
Data Catalog in Denodo Platform 7.0: Creating a Data Marketplace with Data Vi...
 
Data Architecture PowerPoint Presentation Slides
Data Architecture PowerPoint Presentation SlidesData Architecture PowerPoint Presentation Slides
Data Architecture PowerPoint Presentation Slides
 
Metadata crosswalks
Metadata crosswalksMetadata crosswalks
Metadata crosswalks
 
DSpace-CRIS ORCID Integration
DSpace-CRIS ORCID IntegrationDSpace-CRIS ORCID Integration
DSpace-CRIS ORCID Integration
 
What Is Unstructured Data And Why Is It So Important To Businesses?
What Is Unstructured Data And Why Is It So Important To Businesses?What Is Unstructured Data And Why Is It So Important To Businesses?
What Is Unstructured Data And Why Is It So Important To Businesses?
 
Metadata an overview
Metadata an overviewMetadata an overview
Metadata an overview
 
FAIR principles and metrics for evaluation
FAIR principles and metrics for evaluationFAIR principles and metrics for evaluation
FAIR principles and metrics for evaluation
 
Data Governance and Metadata Management
Data Governance and Metadata ManagementData Governance and Metadata Management
Data Governance and Metadata Management
 
Data Modeling & Metadata Management
Data Modeling & Metadata ManagementData Modeling & Metadata Management
Data Modeling & Metadata Management
 
Interoperability, SNOMED, HL7 and FHIR
Interoperability, SNOMED, HL7 and FHIRInteroperability, SNOMED, HL7 and FHIR
Interoperability, SNOMED, HL7 and FHIR
 
Implementing Effective Data Governance
Implementing Effective Data GovernanceImplementing Effective Data Governance
Implementing Effective Data Governance
 
Selecting Software for Taxonomy, Thesaurus and Ontology Management
Selecting Software for Taxonomy, Thesaurus and Ontology ManagementSelecting Software for Taxonomy, Thesaurus and Ontology Management
Selecting Software for Taxonomy, Thesaurus and Ontology Management
 
Dublin Core In Practice
Dublin Core In PracticeDublin Core In Practice
Dublin Core In Practice
 
XDS - Cross-Enterprise Document Sharing
XDS - Cross-Enterprise Document SharingXDS - Cross-Enterprise Document Sharing
XDS - Cross-Enterprise Document Sharing
 
A Brief Introduction to SKOS
A Brief Introduction to SKOSA Brief Introduction to SKOS
A Brief Introduction to SKOS
 
Hl7 overview
Hl7 overviewHl7 overview
Hl7 overview
 

Viewers also liked

Viewers also liked (14)

Attivio Predictions 2017
Attivio Predictions 2017Attivio Predictions 2017
Attivio Predictions 2017
 
Semantic challenges in sharing dataset metadata and creating federated datase...
Semantic challenges in sharing dataset metadata and creating federated datase...Semantic challenges in sharing dataset metadata and creating federated datase...
Semantic challenges in sharing dataset metadata and creating federated datase...
 
Data Modeling & Data Integration
Data Modeling & Data IntegrationData Modeling & Data Integration
Data Modeling & Data Integration
 
Microsoft Data Integration Pipelines: Azure Data Factory and SSIS
Microsoft Data Integration Pipelines: Azure Data Factory and SSISMicrosoft Data Integration Pipelines: Azure Data Factory and SSIS
Microsoft Data Integration Pipelines: Azure Data Factory and SSIS
 
A global linked and open data infrastructure for agricultural development
A global linked and open data infrastructure for agricultural developmentA global linked and open data infrastructure for agricultural development
A global linked and open data infrastructure for agricultural development
 
Cognitive Search for Knowledge Management
Cognitive Search for Knowledge ManagementCognitive Search for Knowledge Management
Cognitive Search for Knowledge Management
 
Data discovery through federated dataset catalogs
Data discovery through federated dataset catalogsData discovery through federated dataset catalogs
Data discovery through federated dataset catalogs
 
Semantics for food and agriculture: the GODAN Action map of data standards
Semantics for food and agriculture: the GODAN Action map of data standardsSemantics for food and agriculture: the GODAN Action map of data standards
Semantics for food and agriculture: the GODAN Action map of data standards
 
The path to a Modern Data Architecture in Financial Services
The path to a Modern Data Architecture in Financial ServicesThe path to a Modern Data Architecture in Financial Services
The path to a Modern Data Architecture in Financial Services
 
Inventory of data standards for food & agriculture
Inventory of data standards for food & agricultureInventory of data standards for food & agriculture
Inventory of data standards for food & agriculture
 
Sharing Agricultural Events Information: When and where is that workshop?
Sharing Agricultural Events Information: When and where is that workshop?Sharing Agricultural Events Information: When and where is that workshop?
Sharing Agricultural Events Information: When and where is that workshop?
 
The agINFRA Linked Data layer
The agINFRA Linked Data layerThe agINFRA Linked Data layer
The agINFRA Linked Data layer
 
Big Data Integration & Analytics Data Flows with AWS Data Pipeline (BDT207) |...
Big Data Integration & Analytics Data Flows with AWS Data Pipeline (BDT207) |...Big Data Integration & Analytics Data Flows with AWS Data Pipeline (BDT207) |...
Big Data Integration & Analytics Data Flows with AWS Data Pipeline (BDT207) |...
 
Microsoft Technologies for Data Science 201612
Microsoft Technologies for Data Science 201612Microsoft Technologies for Data Science 201612
Microsoft Technologies for Data Science 201612
 

Similar to How to describe a dataset. Interoperability issues

Metadata lecture 3, metadata schemes
Metadata lecture 3, metadata schemesMetadata lecture 3, metadata schemes
Metadata lecture 3, metadata schemes
Richard.Sapon-White
 
Swap For Dummies Rsp 2007 11 29
Swap For Dummies Rsp 2007 11 29Swap For Dummies Rsp 2007 11 29
Swap For Dummies Rsp 2007 11 29
Julie Allinson
 

Similar to How to describe a dataset. Interoperability issues (20)

Force11 JDDCP workshop presentation, @ Force2015, Oxford
Force11 JDDCP workshop presentation, @ Force2015, OxfordForce11 JDDCP workshop presentation, @ Force2015, Oxford
Force11 JDDCP workshop presentation, @ Force2015, Oxford
 
HDL - Towards A Harmonized Dataset Model for Open Data Portals
HDL - Towards A Harmonized Dataset Model for Open Data PortalsHDL - Towards A Harmonized Dataset Model for Open Data Portals
HDL - Towards A Harmonized Dataset Model for Open Data Portals
 
The JISC DC Application Profiles: Some thoughts on requirements and scope
The JISC DC Application Profiles: Some thoughts on requirements and scopeThe JISC DC Application Profiles: Some thoughts on requirements and scope
The JISC DC Application Profiles: Some thoughts on requirements and scope
 
Dataset Catalogs as a Foundation for FAIR* Data
Dataset Catalogs as a Foundation for FAIR* DataDataset Catalogs as a Foundation for FAIR* Data
Dataset Catalogs as a Foundation for FAIR* Data
 
Easily Serving and Accessing HDF-EOS2 Datasets Using DODS Technologies
Easily Serving and Accessing HDF-EOS2 Datasets Using DODS TechnologiesEasily Serving and Accessing HDF-EOS2 Datasets Using DODS Technologies
Easily Serving and Accessing HDF-EOS2 Datasets Using DODS Technologies
 
Flexible metadata schemes for research data repositories - CLARIN Conference'21
Flexible metadata schemes for research data repositories - CLARIN Conference'21Flexible metadata schemes for research data repositories - CLARIN Conference'21
Flexible metadata schemes for research data repositories - CLARIN Conference'21
 
Flexible metadata schemes for research data repositories - Clarin Conference...
Flexible metadata schemes for research data repositories  - Clarin Conference...Flexible metadata schemes for research data repositories  - Clarin Conference...
Flexible metadata schemes for research data repositories - Clarin Conference...
 
DC-2008 Architecture Forum Open session
DC-2008 Architecture Forum Open sessionDC-2008 Architecture Forum Open session
DC-2008 Architecture Forum Open session
 
Chachra, "Improving Discovery Systems Through Post Processing of Harvested Data"
Chachra, "Improving Discovery Systems Through Post Processing of Harvested Data"Chachra, "Improving Discovery Systems Through Post Processing of Harvested Data"
Chachra, "Improving Discovery Systems Through Post Processing of Harvested Data"
 
IRJET- Data Retrieval using Master Resource Description Framework
IRJET- Data Retrieval using Master Resource Description FrameworkIRJET- Data Retrieval using Master Resource Description Framework
IRJET- Data Retrieval using Master Resource Description Framework
 
Metadata lecture(9 17-14)
Metadata lecture(9 17-14)Metadata lecture(9 17-14)
Metadata lecture(9 17-14)
 
Understanding RDF: the Resource Description Framework in Context (1999)
Understanding RDF: the Resource Description Framework in Context  (1999)Understanding RDF: the Resource Description Framework in Context  (1999)
Understanding RDF: the Resource Description Framework in Context (1999)
 
Ontologies, controlled vocabularies and Dataverse
Ontologies, controlled vocabularies and DataverseOntologies, controlled vocabularies and Dataverse
Ontologies, controlled vocabularies and Dataverse
 
Metadata lecture 3, metadata schemes
Metadata lecture 3, metadata schemesMetadata lecture 3, metadata schemes
Metadata lecture 3, metadata schemes
 
PRELIDA Project Draft Roadmap
PRELIDA Project Draft RoadmapPRELIDA Project Draft Roadmap
PRELIDA Project Draft Roadmap
 
CLARIN CMDI support in Dataverse
CLARIN CMDI support in Dataverse CLARIN CMDI support in Dataverse
CLARIN CMDI support in Dataverse
 
General concepts: DDI
General concepts: DDIGeneral concepts: DDI
General concepts: DDI
 
Swap For Dummies Rsp 2007 11 29
Swap For Dummies Rsp 2007 11 29Swap For Dummies Rsp 2007 11 29
Swap For Dummies Rsp 2007 11 29
 
Metadata Standards
Metadata StandardsMetadata Standards
Metadata Standards
 
Validation: Requirements and approaches
Validation: Requirements and approachesValidation: Requirements and approaches
Validation: Requirements and approaches
 

More from Valeria Pesce

AgriVIVO. Fostering better networking and collaboration among researchers, re...
AgriVIVO. Fostering better networking and collaboration among researchers, re...AgriVIVO. Fostering better networking and collaboration among researchers, re...
AgriVIVO. Fostering better networking and collaboration among researchers, re...
Valeria Pesce
 

More from Valeria Pesce (16)

Codes of conduct for farm data sharing. Work done and ideas for a GODAN/CTA s...
Codes of conduct for farm data sharing. Work done and ideas for a GODAN/CTA s...Codes of conduct for farm data sharing. Work done and ideas for a GODAN/CTA s...
Codes of conduct for farm data sharing. Work done and ideas for a GODAN/CTA s...
 
Digital agriculture: ICT-amplified data asymmetries and power imbalances. Pol...
Digital agriculture: ICT-amplified data asymmetries and power imbalances. Pol...Digital agriculture: ICT-amplified data asymmetries and power imbalances. Pol...
Digital agriculture: ICT-amplified data asymmetries and power imbalances. Pol...
 
Farmers' data rights - Some findings
Farmers' data rights - Some findingsFarmers' data rights - Some findings
Farmers' data rights - Some findings
 
The new CIARD RING , a machine-readable directory of datasets for agriculture
The new CIARD RING, a machine-readable directory of datasets for agricultureThe new CIARD RING, a machine-readable directory of datasets for agriculture
The new CIARD RING , a machine-readable directory of datasets for agriculture
 
Publishing Germplasm Vocabularies as Linked Data
Publishing Germplasm Vocabularies as Linked DataPublishing Germplasm Vocabularies as Linked Data
Publishing Germplasm Vocabularies as Linked Data
 
VIVOCamp slides: agenda and slides on the extension of the ontology
VIVOCamp slides: agenda and slides on the extension of the ontologyVIVOCamp slides: agenda and slides on the extension of the ontology
VIVOCamp slides: agenda and slides on the extension of the ontology
 
AgriVIVO: A Global Ontology-Driven RDF Store Based on a Distributed Architect...
AgriVIVO: A Global Ontology-Driven RDF Store Based on a Distributed Architect...AgriVIVO: A Global Ontology-Driven RDF Store Based on a Distributed Architect...
AgriVIVO: A Global Ontology-Driven RDF Store Based on a Distributed Architect...
 
AgriVIVO. Fostering better networking and collaboration among researchers, re...
AgriVIVO. Fostering better networking and collaboration among researchers, re...AgriVIVO. Fostering better networking and collaboration among researchers, re...
AgriVIVO. Fostering better networking and collaboration among researchers, re...
 
AgriDrupal: general presentation
AgriDrupal: general presentationAgriDrupal: general presentation
AgriDrupal: general presentation
 
Developing Agricultural Research Information Systems. The experience of the G...
Developing Agricultural Research Information Systems. The experience of the G...Developing Agricultural Research Information Systems. The experience of the G...
Developing Agricultural Research Information Systems. The experience of the G...
 
Information / software architectures based on Content Management Systems (CMS)
Information / software architectures based on Content Management Systems (CMS)Information / software architectures based on Content Management Systems (CMS)
Information / software architectures based on Content Management Systems (CMS)
 
The CIARD RING, an infrastructure for interoperability of agricultural resear...
The CIARD RING, an infrastructure for interoperability of agricultural resear...The CIARD RING, an infrastructure for interoperability of agricultural resear...
The CIARD RING, an infrastructure for interoperability of agricultural resear...
 
Libraries 2.0 and RSS
Libraries 2.0 and RSSLibraries 2.0 and RSS
Libraries 2.0 and RSS
 
The Ciard RING
The Ciard RINGThe Ciard RING
The Ciard RING
 
The Global ARD Web Ring
The Global ARD Web RingThe Global ARD Web Ring
The Global ARD Web Ring
 
The EGFAR web space: Using Web 2.0 technologies to electronically mimic GFAR
The EGFAR web space: Using Web 2.0 technologies to electronically mimic GFARThe EGFAR web space: Using Web 2.0 technologies to electronically mimic GFAR
The EGFAR web space: Using Web 2.0 technologies to electronically mimic GFAR
 

Recently uploaded

Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
vu2urc
 

Recently uploaded (20)

08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 

How to describe a dataset. Interoperability issues

  • 1. How to describe a dataset. Interoperability issues Valeria Pesce Global Forum on Agricultural Research
  • 2. Definition of “dataset” The term “dataset” has been defined in several ways, all of which further specify or extend the basic concept of “a collection of data”. Definition given by the W3C Government Linked Data Working Group: A dataset is “a collection of data, published or curated by a single source, and available for access or download in one or more formats” The “instances” of the dataset “available for access or download in one or more formats” are called “distributions”. A dataset can have many distributions. Examples of distributions include a downloadable CSV file, an API or an RSS feed.
  • 3. Definition of “interoperability” “Data interoperability is a feature of datasets - and of information services that give access to datasets - whereby data can easily be retrieved, processed, re-used, and re-packaged (“operated”) by other systems.” Interim Proceedings of International Expert Consultation on “Building the CIARD Framework for Data and Information Sharing”, CIARD (2011) software applications datasets have to be machine-readable
  • 4. What applications need Besides information common to any type of resource (name, author / owner, date…), applications have to find enough metadata about datasets to understand: 1. the specific coverage of the dataset (type of data, thematic coverage, geographic coverage) 2. the necessary technical specifications to retrieve and parse a distribution of the dataset (format, protocol etc.) 3. the conditions for re-use (rights, licenses) 4. the “dimensions” covered by the dataset (e.g. temperature, time, salinity, gene, coordinates) 5. the semantics of the dimensions (units of measure, time granularity, syntax, reference taxonomies)
  • 5. Partial answers in existing vocabularies • DCAT vocabulary – RDF vocabulary for describing any dataset – Datasets can be standalone or part of a “catalog” – Datasets are accessible through several “distributions” – “Other, complementary vocabularies may be used together with DCAT to provide more detailed format-specific information. For example, properties from the VoID vocabulary can be used if that dataset is in RDF format.” • VOID vocabulary – RDF vocabulary for expressing metadata about RDF datasets • (SDMX ) DataCube vocabulary – RDF vocabulary for describing statistical datasets – Useful for attaching metadata about the “data structure” to any dataset that doesn’t follow a known published standard
  • 6. Coverage of a dataset • This can be handled by common Dublin Core properties like subject and coverage. • DCAT re-uses these DC properties. Issue 1: No specific property for the type of data covered in a dataset The values of these properties have to be understood by machines: - The value should be standardized, possibly a URI - The URI should be de-referenceable to a thing - The thing should be part of an authority list / taxonomy Issue 3: There is no authority vocabulary for types of data Issue 1 Issue 2
  • 7. Conditions for re-use • DCAT re-uses the license DC property at the level of distributions • DCAT re-uses the rights DC property at bith the level of dataset and the level of distribution dc:license > dc:LicenseDocument dc:rights > dc:RightsStatement
  • 8. W3C DCAT > DCAT AP
  • 10. Technical properties The necessary technical specifications to retrieve and parse a distribution of a dataset (format, protocol etc.) • DCAT re-uses the DC format property; Issue No property for protocol The values of these properties have to be understood by machines, possibly URIs: Issue2 No comprehensive RDF authority lists for these values (partial: DC Types; non-RDF: IANA types) Issue 1 Issue 2
  • 11. VOID VOID can help with the protocol metadata but only for RDF datasets: - Property for data dump: dataDump - Property for SPARQL endpoint: sparqlEndpoint
  • 12. “Dimensions” and their semantics DCAT does not describe the dimensions of a dataset, except for a reference to a standard if the dataset dimensions can be defined by a formalized standard (e.g. an XML schema or an RDF vocabulary or an ISO standard) dc:conformsTo > dc:Standard Statistical vocabularies can help with the description of the dimensions
  • 13. SDMX: data structure and dimensions SDMX: Statistical Data and Metadata Exchange The data structure definition is a description of all the metadata needed to understand the data set structure. This includes: • identification of the dimensions (Dimension) according to standard statistical terminology, • the key structure (KeyDescriptor), • the code-lists (CodeList) that enumerate valid values for each dimension • coded attribute (CodedAttribute), information about whether attributes are required or optional and coded or free text. Given the metadata in the data structure definition, all of the data in the data set becomes meaningful.
  • 15. DataCube: simplified SDMX in RDF Reference to a concept scheme
  • 16. DataCube: simplified SDMX in RDF “Semantic role” of the property
  • 17. DataCube: simplified SDMX in RDF “Semantic role” of
  • 18. Combining different vocabularies Name URL Owner Content type Topic(s) Language Metadata set(s) Data structure Distribution(s) […] DATASET Name Protocol Endpoint URL Media type Format Size DISTRIBUTION DCAT model Dimensions Attributes Measures Value lists DATA STRUCTURE DataCube model Catalog: the directory Vocabulary(ies) SPARQL endpoint Data dump Serialization format Number of triples RDF dataset info VOID properties If one or more known published metadata sets are used, just fill “metadata set(s)”, otherwise link to a “data structure” with custom “dimensions” IF media type has RDF or SPARQL response
  • 19. Tools for managing dataset metadata • CKAN maintained by the Open Knowledge Foundation Uses most of DCAT. Doesn’t describe dimensions. Also provides a global dataset hub called the Datahub • Dataverse created by Harvard University Uses a custom vocabulary. Doesn’t describe dimensions. • Commercial solutions • Repositories and catalogs: OpenAIRE, DataCite (using re3data to search repositories) and Dryad use their own vocabularies. • CIARD RING Uses full DCAT AP with some extended properties (protocol, data type) and local taxonomies with URIs mapped when possible to authorities. Next steps: adding DataCube properties for dimensions.
  • 20. Major outstanding issues • Some missing properties in existing vocabularies:  approach vocabulary owners OR extend vocabularies • Missing vocabularies for protocols, formats  approach standardizing bodies?  perhaps specific dataset formats? • Need for more standardized semantics for dimensions:  Joint discussions with the RDA Data Type Registries WG? • Lack of interoperability metadata in existing tools
  • 21. References • W3C DCAT: http://www.w3.org/TR/vocab-dcat/ • DCAT AP: https://joinup.ec.europa.eu/asset/dcat_application_profile/asset_release/dcat- application-profile-data-portals-europe-final • DataCube: http://purl.org/linked-data/cube# • VOID: http://rdfs.org/ns/void-guide • VIVO Datastar: http://sourceforge.net/projects/vivo/files/Datastar%20ontology/ • CERIF for datasets: https://cerif4datasets.wordpress.com/c4d-deliverables/ • CKAN: http://ckan.org/ • Datahub: http://datahub.io/ • DataCite: http://search.datacite.org/ui?q=subject%3Aagriculture • Re3data: http://www.re3data.org • Dryad: http://datadryad.org/ • OpenAIRE: https://www.openaire.eu/
  • 22. Thank you Valeria Pesce Global Forum on Agricultural Research