Diese Präsentation wurde erfolgreich gemeldet.
Die SlideShare-Präsentation wird heruntergeladen. ×

HDL - Towards A Harmonized Dataset Model for Open Data Portals

Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Wird geladen in …3
×

Hier ansehen

1 von 17 Anzeige

HDL - Towards A Harmonized Dataset Model for Open Data Portals

Herunterladen, um offline zu lesen

The Open Data movement triggered an unprecedented amount of data published in a wide range of domains. Governments and corporations around the world are encouraged to publish, share, use and integrate Open Data. There are many areas where one can see the added value of Open Data, from transparency and self-empowerment to improving efficiency, effectiveness and decision making. This growing amount of data requires rich metadata in order to reach its full potential. This metadata enables dataset discovery, understanding, integration and maintenance. Data portals, which are considered to be datasets' access points, offer metadata represented in different and heterogenous models. In this paper, we first conduct a unique and comprehensive survey of seven metadata models: CKAN, DKAN, Public Open Data, Socrata, VoID, DCAT and Schema.org. Next, we propose a Harmonized Dataset modeL (HDL) based on this survey. We describe use cases that show the benefits of providing rich metadata to enable dataset discovery, search and spam detection

The Open Data movement triggered an unprecedented amount of data published in a wide range of domains. Governments and corporations around the world are encouraged to publish, share, use and integrate Open Data. There are many areas where one can see the added value of Open Data, from transparency and self-empowerment to improving efficiency, effectiveness and decision making. This growing amount of data requires rich metadata in order to reach its full potential. This metadata enables dataset discovery, understanding, integration and maintenance. Data portals, which are considered to be datasets' access points, offer metadata represented in different and heterogenous models. In this paper, we first conduct a unique and comprehensive survey of seven metadata models: CKAN, DKAN, Public Open Data, Socrata, VoID, DCAT and Schema.org. Next, we propose a Harmonized Dataset modeL (HDL) based on this survey. We describe use cases that show the benefits of providing rich metadata to enable dataset discovery, search and spam detection

Anzeige
Anzeige

Weitere Verwandte Inhalte

Diashows für Sie (20)

Andere mochten auch (20)

Anzeige

Ähnlich wie HDL - Towards A Harmonized Dataset Model for Open Data Portals (20)

Aktuellste (20)

Anzeige

HDL - Towards A Harmonized Dataset Model for Open Data Portals

  1. 1. HDL Towards a Harmonized Dataset Model for Open Data Portals Ahmad Assaf, Raphaël Troncy And Aline Senart @ahmadaassaf PROFILES 15 – 2nd International Workshop on Dataset PROFIling & fEderated Search for Linked Data 1st June 2015
  2. 2. HDL Towards a Harmonized Dataset Model for Open Data Portals Open Data/Linked Open Data  Open Data (OD) is the data that can be easily discovered, accessed, reused and redistributed by anyone [Davies et al. 2014]  Open Data should be placed in public domain under liberal terms of use and available in electronic formats that are non-proprietary and machine readable.  Linked Open Data (LOD) refers to the semantically rich, linked and machine readable open data.  Open Data has major benefits for citizens, businesses, societies and governments. 2
  3. 3. HDL Towards a Harmonized Dataset Model for Open Data Portals Metadata Metadata is structured information that describes, explains, locates or otherwise makes it easier to retrieve use or manage information resources Data Discovery, exploration and reuse Organization & identification Archiving & preservation 3
  4. 4. HDL Towards a Harmonized Dataset Model for Open Data Portals Data Portals/Data Management Systems  Data Portals (Catalogs) are the entry points to discover published datasets  Data Portals are a curated collection of datasets metadata providing a set discovery and integration services.  Data Portals can be private like datahub.io, publicdata.eu or private like enigma.io or quandle.com  Portals are built on top of Data Management Systems (DMS) like CKAN, DKAN and Socrata 4
  5. 5. HDL Towards a Harmonized Dataset Model for Open Data Portals Why a Harmonized Model ?  Exploring/discovering datasets for (re)use  Defining a “minimal” set of information needed to build a “profile”  Building tools that will automatically generate/validate metadata models 5
  6. 6.  The Data Catalog Vocabulary (DCAT)✝ is a W3C recommendation to facilitate interoperability between data catalogs on the web  DCAT is an RDF vocabulary with three main classes: dcat:Catalog, dcat:Dataset and dcat:Distribution  DCAT Profiles [extensions built upon DCAT]  DCAT-AP✝✝ defines a minimal set of properties that should be included in a datasets profile by specifying mandatory and optional properties  The Asset Description Metadata Schema (ADMS)✝✝✝ is used to semantically describe assets (code lists, taxonomies, vocabularies) HDL Towards a Harmonized Dataset Model for Open Data Portals Dataset Models - DCAT 6 ✝ http://w3.org/TR/vocab-dcat/ ✝✝ https://joinup.ec.europa.eu/asset/dcat_application_profile/description ✝✝✝ http://www.w3.org/TR/vocab-adms/
  7. 7. HDL Towards a Harmonized Dataset Model for Open Data Portals Dataset Models - VoID✝  RDF vocabulary for interlinked datasets  In addition to describing datasets, VoID describes the links between datasets  VoID defines three main classes: void:Dataset, void:Linkset and void:subset  A linkset in voiD is a subclass of a dataset, used for storing triples to express the interlinking relationship between datasets 7 ✝ http://www.w3.org/TR/void/
  8. 8. HDL Towards a Harmonized Dataset Model for Open Data Portals Dataset Models – CKAN✝/DKAN✝✝  Data model describes a set of entities (dataset, resource, group, tag)  Allow additional information to be added via “extra” arbitrary key/value fields  The core metadata restricted as a JSON file  Supports Linked Data and RDF by providing a complete and functional mapping of its model to LD formats  CKAN support descriptions of vocabularies  DKAN is a Drupal based DMS 8 ✝ http://ckan.org/ ✝✝ http://demo.getdkan.com/
  9. 9.  Online collection of best practices and case studies to help data publishers  POD data model is based on DCAT  Similarly to DCAT-AP, POD defines three types of metadata elements: Required, Required-If and Expanded(optional)  Metadata extensions using elements from the “Expanded” fields HDL Towards a Harmonized Dataset Model for Open Data Portals Dataset Models - Continued  Commercial platform to streamline data publishing, management, analysis and reusing.  The model is designed specifically to represent tabular data  The model covers a basic set of metadata properties and has good support for geospatial data  A collection of schema used to markup HTML pages with structured data  Covers many domains. We are interested in the Dataset schema although we also use various properties from schemas like organizations, authors, etc. 9 ✝ http://socrata.com/ ✝✝ http://schema.org/ ✝✝✝ https://project-open-data.cio.gov/ ✝ ✝✝ ✝✝✝
  10. 10. 10 Ballmer effect anyone? HDL Towards a Harmonized Dataset Model for Open Data Portals https://xkcd.com/323/
  11. 11. HDL Towards a Harmonized Dataset Model for Open Data Portals Metadata Classification – Information Groups 11 Organization Clustering or curation solely based on associations with specific administration parties Resource Actual raw data that can be downloaded or accessed directly e.g. JSON, CSV, SPARQL endpoint Tag Descriptive knowledge about the dataset contents and structure. This can range from simple textual tags to semantically rich controlled terms Group Organizational units that share common semantics. They can be seen as a cluster or curation based on shared themes/categories
  12. 12. HDL Towards a Harmonized Dataset Model for Open Data Portals Metadata Classification – Information Types 12 General Information title, description, id Ownership Information author, maintainer_email Provenance Information version, creation_date, update_date Access Information URL, license_title, license_id Geospatial Information bbox, layers Temporal Information coverage_from, coverage_to Statistical Information max_value, uniques, average Quality Information rating, availability, freshness Dataset Metadata
  13. 13. HDL Towards a Harmonized Dataset Model for Open Data Portals Harmonization Process  Examine the model or vocabulary specification and documentation  Examine existing datasets using these models  Examine the source code for DMS 13 1 Map the information groups [resource, tag, group, organization] 2 Map the information types [general, ownership, provenance, etc.]
  14. 14. HDL Towards a Harmonized Dataset Model for Open Data Portals Mapping Information Types 14 CKAN maintainer_email DKAN maintainer_email POD ContactPoint -> hasEmail Schema.org CreativeWork:producer -> Person:email VoID void:Dataset -> dct:creator -> foaf:Person:givenName DCAT dcat:Dataset -> dct:creator -> foaf:Person:givenName
  15. 15. HDL Towards a Harmonized Dataset Model for Open Data Portals Extra Information 15  Examining the models, we noticed an abundance of information filled in “extras” fields  Using Roomba we generated aggregation reports to inspect those extras on LOD Cloud✝ and OpenAfrica✝✝ extras>value:extras>name1 Extra fields names and values resources>resource_type:resources>name2 Types describing resources  53% of the datasets in OpenAfrica have additional geospatial attached (spatial-reference-system, spatial harvester, bbox-east-long, bbox-north-long, bbox-south-long, bbox-west-long)  16% of the datasets have additional provenance and ownership information (frequency-of-update, dataset- reference-date) ✝ http://datahub.io/group/lodcloud ✝✝ http://africaopendata.org/https://github.com/ahmadassaf/opendata-checker/tree/master/model
  16. 16. HDL Towards a Harmonized Dataset Model for Open Data Portals 16 https://xkcd.com/927/
  17. 17. 17HDL Towards a Harmonized Dataset Model for Open Data Portals Questions? Ahmad Assaf http://ahmadassaf.com/ @ahmadaassaf http://github.com/ahmadassaf

Hinweis der Redaktion

  • An asset is something that can be opened and read using a familiar desktop software as opposed to the need to be processed like raw data.
  • The interlinking is modelled by a linkset (void:Linkset). A linkset in voiD is a subclass of a dataset, used for storing triples to express the interlinking relationship between datasets. In each interlinking triple, the subject is a resource hosted in one dataset and the object is a resource hosted in another dataset. This modelling enables a flexible and powerful way to talk in great detail about the interlinking between two datasets, such as how many links there exist, which kind of links (e.g. owl:sameAs or foaf:knows) are present, or stating who claims these statements.

×