SlideShare ist ein Scribd-Unternehmen logo
1 von 4
Downloaden Sie, um offline zu lesen
圖書館學與資訊科學 43(1):7 – 46(民一○六年四月) 43
Reuse of Structured Data: Semantics,
Linkage, and Realization
Andrea Wei-Ching Huang
Project Manager (Research)
Institute of Information Science, Academia Sinica, Taiwan
E-mail: andreahg@iis.sinica.edu.tw
Cheng-Jen Lee
Research Assistant
Institute of Information Science, Academia Sinica, Taiwan
E-mail: cjlee@iis.sinica.edu.tw
Tyng-Ruey Chuang
Associate Research Fellow
Institute of Information Science, Academia Sinica, Taiwan
E-mail: trc@iis.sinica.edu.tw
Keywords: CKAN; Data Provenance; Data Quality; Knowledge Base; Linked Open
Data (LOD); Ontology; Semantic Representation
【Abstract】
In order to increase the reuse value of existing datasets, it is now becoming a general practice to add 
semantic links among the records in a dataset, and to link these records to external resources. The 
enriched datasets are published on the web for both human and machine to consume and re‐purpose. 
In this paper, we make use of publicly available structured records from a digital archive catalogue, and 
we demonstrate a principled approach to converting the records into semantically rich and interlinked 
resources for all to reuse. While exploring the various issues involved in the process of reusing and 
re‐purposing existing datasets, we review the recent progress in the field of Linked Open Data (LOD), 
and examine twelve well‐known knowledge bases built with a Linked Data approach. We also discuss 
the general issues of data quality, metadata vocabularies, and data provenance. The concrete outcome 
of  this  research  work  is  the  following:  (1)  a  website  data.odw.tw  that  hosts  more  than  840,000 
semantically  enriched  catalogue  records  across  multiple  subject  areas,  (2)  a  lightweight  ontology 
voc4odw  for  describing  data  reuse  and  provenance,  among  others,  and  (3)  a  set  of  open  source 
DOI: 10.6245/JLIS.2017.431/722
44 Journal of Library and Information Science 43(1):7 – 46(April, 2017)
software tools available to all to perform the kind of data conversion and enrichment we did in this 
research. We have used and extended CKAN (The Comprehensive Knowledge Archive Network) as a 
platform  to  host  and  publish  Linked  Data.  Our  extensions  to  CKAN  is  open  sourced  as  well.  As  the 
records we drawn from the originally catalogue are released under the Creative Commons licenses, the 
semantically enriched resources we now re‐publish on the Web are free for all to reuse as well. 
 
【Long Abstract】
Introduction
In order to enhance the reuse value of existing datasets, it is now becoming a general practice to add
semantic links among the records in a dataset, and to link these records to external resources. The
enriched datasets are published on the Web for both the human and the machine to consume and
re-purpose. In the paper, we make use of publicly available structured records from a digital archive
catalogue, and we demonstrate a principled approach to converting the records into semantically rich and
interlinked resources for all to reuse. While exploring the various issues involved in the process of
reusing and re-purposing existing datasets, we review the recent progress in the field of Linked Open
Data (LOD), and examine twelve well-known knowledge bases built with a Linked Data approach. We
also discuss the general issues of data quality, metadata vocabularies, and data provenance.
The concrete outcome of this research work is the following: (1) a website that hosts more than
840,000 semantically enriched catalogue records across multiple subject areas, (2) a lightweight
ontology voc4odw for describing data reuse and provenance, among others, and (3) a set of open source
software tools available to all to perform the kind of data conversion and enrichment we did in this
research. We have used and extended CKAN (The Comprehensive Knowledge Archive Network) as a
platform to host and publish Linked Data. Our extensions to CKAN is open sourced as well. As the
records we have drawn from the originally catalogue are released under the Creative Commons licenses,
the semantically enriched resources we now re-publish on the Web are free for all to reuse as well.
Review of Twelve Knowledge Bases
We begin by first examine twelve knowledge bases built with a Linked Data approach. Five of them
are built by domain knowledge experts (OpenCyc, Getty Art & Architecture Thesaurus, Getty Thesaurus
of Geographic Names, and Ordnance Survey), six of them are collaborative databases (Freebase, YAGO,
DBpedia, Wikidata, LinkedGeoData, GeoNames), and the last one is about ecological observations
based on expert and community collaborations (Encyclopedia of Life). We further compare datasets
圖書館學與資訊科學 43(1):7 – 46(民一○六年四月) 45
about geospatial entities with controlled vocabularies: Getty TGN, Open Names (Ordnance Survey),
DBpediaPlace, LinkedGeoData, and GeoNames.
To make good reuse of structured data, ones need to first deal with the problem of data quality.
Currently there exist different evaluation criteria, with various techniques for measuring the quality of
information, data, metadata, and Linked Data. We review four papers on data quality and systematically
compare their evaluation criteria. Moreover, data provenance --- contextual metadata about the source
and use of data --- has proven to be fundamental for assessing authenticity, enabling trust, and allowing
reproducibility. Thus, we examine key mechanisms of data provenance before we move forward to
discussing LOD applications.
Practices
We then make use of structured records from a digital archive catalogue, and convert the records into
semantically rich and interlinked resources on the Web. This is realized as a unified Linked Data
catalogue to several digital archive collections. Our work results in a LOD catalogue available to the public at
the website <http://data.odw.tw>. The following five parts are involved in realizing this website. A catalogue
record, about a species of Pleione Formosana, is used throughout in the paper as an example to demonstrate
the way we model, convert, and represent the semantics of a structured record.
Part 1: Exploring data reuse relations in a shared context -- We review our previous research about the
Relation for Reuse Ontology (R4R). In particular, we provide mechanisms for reusing article, data, and
code with some flexibility of encoding provenance and license information.
Part 2: Comparing two different data conversion approaches to providing LOD for an archive
catalogue -- We show two different scenarios: (1) The LOD catalogue is converted directly from a
relational database, and (2) the LOD catalogue is generated from a series of format conversions --- from
XML to CSV, and then to RDF.
Part 3: Data profiling, cleaning and mapping -- We demonstrate format conversion processes, and we
discuss the pros and cons of various ways in handling broken links in source datasets. In addition, we
mapped and linked catalogue records to three external knowledge bases: GeoNames, Wikidata, and
Encyclopedia of Life.
Part 4: Using CKAN (The Comprehensive Knowledge Archive Network) as a Linked Data platform --
We briefly introduce CKAN, an open source web-based data portal software package for curating and
publishing datasets. CKAN provides data preview, search, and discovery, especially with regard to
geospatial datasets. We built several extensions to CKAN in order to deposit, publish, browse, and
46 Journal of Library and Information Science 43(1):7 – 46(April, 2017)
search Linked Data. Various Linked Data representations of a catalogue record --- Turtle, RDF/XML,
and JSON-LD --- can all be downloaded and reused.
Part 5: Designing ontologies for data representation and reuse -- We design an ontology voc4odw
which includes the following 3 modules:
(1) The Core Model. It is comprise of a data model and a conceptual model. The data model represents
key data structure and relation. It is a framework to illustrate data source, derivation, and provenance.
The conceptual model incorporates SKOS Simple Knowledge Organization System; it also connects
to key event concepts. The conceptual model allows for data contextualization using common and
domain knowledge vocabularies.
(2) The Curation Model. It is responsible for disclosing the identification, classification, and publication
of structured records at a curation platform, such as the classification of themes, the assignment of
data identifiers, and the publication of datasets.
(3) A vocabulary voaf:Vocabulary. It is defined as "A vocabulary used in the Linked Data cloud", from
the Vocabulary of a Friend <http://purl.org/vocommons/voaf>. This module is to relate the Core
Model to external common vocabularies. Some hierarchy relations between different external
vocabularies can be traced with this vocabulary.
【Romanization of Chinese references is offered in the paper.】

Weitere ähnliche Inhalte

Was ist angesagt?

20130622 okfn hackathon t2
20130622 okfn hackathon t220130622 okfn hackathon t2
20130622 okfn hackathon t2
Seonho Kim
 
OpenTox - an open community and framework supporting predictive toxicology an...
OpenTox - an open community and framework supporting predictive toxicology an...OpenTox - an open community and framework supporting predictive toxicology an...
OpenTox - an open community and framework supporting predictive toxicology an...
Barry Hardy
 
Towards an Infrastructure for Mining Scientific Publications
Towards an Infrastructure for Mining Scientific PublicationsTowards an Infrastructure for Mining Scientific Publications
Towards an Infrastructure for Mining Scientific Publications
petrknoth
 

Was ist angesagt? (20)

Dataset description: DCAT and other vocabularies
Dataset description: DCAT and other vocabulariesDataset description: DCAT and other vocabularies
Dataset description: DCAT and other vocabularies
 
Linked Data
Linked DataLinked Data
Linked Data
 
The Materials Data Facility: A Distributed Model for the Materials Data Commu...
The Materials Data Facility: A Distributed Model for the Materials Data Commu...The Materials Data Facility: A Distributed Model for the Materials Data Commu...
The Materials Data Facility: A Distributed Model for the Materials Data Commu...
 
AAT LOD Microthesauri
AAT LOD MicrothesauriAAT LOD Microthesauri
AAT LOD Microthesauri
 
How to describe a dataset. Interoperability issues
How to describe a dataset. Interoperability issuesHow to describe a dataset. Interoperability issues
How to describe a dataset. Interoperability issues
 
Knowledge organization
Knowledge organizationKnowledge organization
Knowledge organization
 
TripFS presentation at ldow 2010
TripFS presentation at ldow 2010TripFS presentation at ldow 2010
TripFS presentation at ldow 2010
 
mx & dbs
mx & dbsmx & dbs
mx & dbs
 
20130622 okfn hackathon t2
20130622 okfn hackathon t220130622 okfn hackathon t2
20130622 okfn hackathon t2
 
Linked data HHS 2015
Linked data HHS 2015Linked data HHS 2015
Linked data HHS 2015
 
Metadata standards
Metadata standardsMetadata standards
Metadata standards
 
Www2012 tutorial content_aggregation
Www2012 tutorial content_aggregationWww2012 tutorial content_aggregation
Www2012 tutorial content_aggregation
 
NISO/NFAIS Joint Virtual Conference: Connecting the Library to the Wider Worl...
NISO/NFAIS Joint Virtual Conference: Connecting the Library to the Wider Worl...NISO/NFAIS Joint Virtual Conference: Connecting the Library to the Wider Worl...
NISO/NFAIS Joint Virtual Conference: Connecting the Library to the Wider Worl...
 
OpenTox - an open community and framework supporting predictive toxicology an...
OpenTox - an open community and framework supporting predictive toxicology an...OpenTox - an open community and framework supporting predictive toxicology an...
OpenTox - an open community and framework supporting predictive toxicology an...
 
2015 07-tuto3-mining hin
2015 07-tuto3-mining hin2015 07-tuto3-mining hin
2015 07-tuto3-mining hin
 
Role of Semantic Web in Health Informatics
Role of Semantic Web in Health InformaticsRole of Semantic Web in Health Informatics
Role of Semantic Web in Health Informatics
 
eROSA Stakeholder WS1: Data discovery through federated dataset catalogues
eROSA Stakeholder WS1: Data discovery through federated dataset catalogueseROSA Stakeholder WS1: Data discovery through federated dataset catalogues
eROSA Stakeholder WS1: Data discovery through federated dataset catalogues
 
Research Data Sharing: A Basic Framework
Research Data Sharing: A Basic FrameworkResearch Data Sharing: A Basic Framework
Research Data Sharing: A Basic Framework
 
DataTags, The Tags Toolset, and Dataverse Integration
DataTags, The Tags Toolset, and Dataverse IntegrationDataTags, The Tags Toolset, and Dataverse Integration
DataTags, The Tags Toolset, and Dataverse Integration
 
Towards an Infrastructure for Mining Scientific Publications
Towards an Infrastructure for Mining Scientific PublicationsTowards an Infrastructure for Mining Scientific Publications
Towards an Infrastructure for Mining Scientific Publications
 

Ähnlich wie Reuse of Structured Data: Semantics, Linkage, and Realization

Sem facet paper
Sem facet paperSem facet paper
Sem facet paper
DBOnto
 

Ähnlich wie Reuse of Structured Data: Semantics, Linkage, and Realization (20)

Academic Linkage A Linkage Platform For Large Volumes Of Academic Information
Academic Linkage  A Linkage Platform For Large Volumes Of Academic InformationAcademic Linkage  A Linkage Platform For Large Volumes Of Academic Information
Academic Linkage A Linkage Platform For Large Volumes Of Academic Information
 
Metadata as Linked Data for Research Data Repositories
Metadata as Linked Data for Research Data RepositoriesMetadata as Linked Data for Research Data Repositories
Metadata as Linked Data for Research Data Repositories
 
ChemConnect: Poster for European Combustion Meeting 2017
ChemConnect: Poster for European Combustion Meeting 2017ChemConnect: Poster for European Combustion Meeting 2017
ChemConnect: Poster for European Combustion Meeting 2017
 
A Clean Slate?
A Clean Slate?A Clean Slate?
A Clean Slate?
 
Toward Semantic Representation of Science in Electronic Laboratory Notebooks ...
Toward Semantic Representation of Science in Electronic Laboratory Notebooks ...Toward Semantic Representation of Science in Electronic Laboratory Notebooks ...
Toward Semantic Representation of Science in Electronic Laboratory Notebooks ...
 
Next-Generation Search Engines for Information Retrieval
Next-Generation Search Engines for Information RetrievalNext-Generation Search Engines for Information Retrieval
Next-Generation Search Engines for Information Retrieval
 
from local/regional OER Silos towards an OER Global Dataspace
from local/regional OER Silos towards an OER Global Dataspacefrom local/regional OER Silos towards an OER Global Dataspace
from local/regional OER Silos towards an OER Global Dataspace
 
Charleston 2012 - The Future of Serials in a Linked Data World
Charleston 2012 - The Future of Serials in a Linked Data WorldCharleston 2012 - The Future of Serials in a Linked Data World
Charleston 2012 - The Future of Serials in a Linked Data World
 
LKG Editor Dev
LKG Editor DevLKG Editor Dev
LKG Editor Dev
 
Automatically converting tabular data to
Automatically converting tabular data toAutomatically converting tabular data to
Automatically converting tabular data to
 
Exploration of a Data Landscape using a Collaborative Linked Data Framework.
Exploration of a Data Landscape using a Collaborative Linked Data Framework.Exploration of a Data Landscape using a Collaborative Linked Data Framework.
Exploration of a Data Landscape using a Collaborative Linked Data Framework.
 
Linked Data to Improve the OER Experience
Linked Data to Improve the OER ExperienceLinked Data to Improve the OER Experience
Linked Data to Improve the OER Experience
 
Poster RDAP13: Research Data in eCommons @ Cornell: Present and Future
Poster RDAP13: Research Data in eCommons @ Cornell: Present and FuturePoster RDAP13: Research Data in eCommons @ Cornell: Present and Future
Poster RDAP13: Research Data in eCommons @ Cornell: Present and Future
 
Open Data and Institutional Repositories
Open Data and Institutional RepositoriesOpen Data and Institutional Repositories
Open Data and Institutional Repositories
 
Linked Data: Why Bother?
Linked Data:  Why Bother?Linked Data:  Why Bother?
Linked Data: Why Bother?
 
Relations for Reusing (R4R) in A Shared Context: An Exploration on Research P...
Relations for Reusing (R4R) in A Shared Context: An Exploration on Research P...Relations for Reusing (R4R) in A Shared Context: An Exploration on Research P...
Relations for Reusing (R4R) in A Shared Context: An Exploration on Research P...
 
2011linked science4mccuskermcguinnessfinal
2011linked science4mccuskermcguinnessfinal2011linked science4mccuskermcguinnessfinal
2011linked science4mccuskermcguinnessfinal
 
Sem facet paper
Sem facet paperSem facet paper
Sem facet paper
 
SemFacet paper
SemFacet paperSemFacet paper
SemFacet paper
 
New old(1)
New old(1)New old(1)
New old(1)
 

Mehr von andrea huang

A preliminary study on Wikipedia Dbpdeia and Wikidata
A preliminary study on Wikipedia Dbpdeia and WikidataA preliminary study on Wikipedia Dbpdeia and Wikidata
A preliminary study on Wikipedia Dbpdeia and Wikidata
andrea huang
 
20130805 Activating Linked Open Data in Libraries Archives and Museums
20130805 Activating Linked Open Data in Libraries Archives and Museums20130805 Activating Linked Open Data in Libraries Archives and Museums
20130805 Activating Linked Open Data in Libraries Archives and Museums
andrea huang
 
060817 Participation Collaboration Mapping
060817 Participation Collaboration Mapping060817 Participation Collaboration Mapping
060817 Participation Collaboration Mapping
andrea huang
 
070928 Collaborative Geospatial Mapping And Data Authorization
070928 Collaborative Geospatial Mapping And Data Authorization070928 Collaborative Geospatial Mapping And Data Authorization
070928 Collaborative Geospatial Mapping And Data Authorization
andrea huang
 
041018 Community Gis
041018 Community Gis041018 Community Gis
041018 Community Gis
andrea huang
 
051102 Online Community Mapping
051102 Online Community Mapping051102 Online Community Mapping
051102 Online Community Mapping
andrea huang
 
051207 Commonsense Geography Meets Web Technology
051207 Commonsense Geography Meets Web Technology 051207 Commonsense Geography Meets Web Technology
051207 Commonsense Geography Meets Web Technology
andrea huang
 

Mehr von andrea huang (14)

結構資料的再次使用:語意、連結與實作
結構資料的再次使用:語意、連結與實作結構資料的再次使用:語意、連結與實作
結構資料的再次使用:語意、連結與實作
 
20161004 “Open Data Web” – A Linked Open Data Repository Built with CKAN
20161004 “Open Data Web” – A Linked Open Data Repository Built with CKAN20161004 “Open Data Web” – A Linked Open Data Repository Built with CKAN
20161004 “Open Data Web” – A Linked Open Data Repository Built with CKAN
 
20160818 Semantics and Linkage of Archived Catalogs
20160818 Semantics and Linkage of Archived Catalogs20160818 Semantics and Linkage of Archived Catalogs
20160818 Semantics and Linkage of Archived Catalogs
 
20160602 典藏目錄的語意與連結
20160602 典藏目錄的語意與連結20160602 典藏目錄的語意與連結
20160602 典藏目錄的語意與連結
 
How to clean data less through Linked (Open Data) approach?
How to clean data less through Linked (Open Data) approach?How to clean data less through Linked (Open Data) approach?
How to clean data less through Linked (Open Data) approach?
 
A preliminary study on Wikipedia Dbpdeia and Wikidata
A preliminary study on Wikipedia Dbpdeia and WikidataA preliminary study on Wikipedia Dbpdeia and Wikidata
A preliminary study on Wikipedia Dbpdeia and Wikidata
 
20130805 Activating Linked Open Data in Libraries Archives and Museums
20130805 Activating Linked Open Data in Libraries Archives and Museums20130805 Activating Linked Open Data in Libraries Archives and Museums
20130805 Activating Linked Open Data in Libraries Archives and Museums
 
101203 An event ontology for crisis-disaster information
101203 An event ontology for crisis-disaster information101203 An event ontology for crisis-disaster information
101203 An event ontology for crisis-disaster information
 
081016 Social Tagging, Online Communication, and Peircean Semiotics
081016 Social Tagging, Online Communication, and Peircean Semiotics081016 Social Tagging, Online Communication, and Peircean Semiotics
081016 Social Tagging, Online Communication, and Peircean Semiotics
 
060817 Participation Collaboration Mapping
060817 Participation Collaboration Mapping060817 Participation Collaboration Mapping
060817 Participation Collaboration Mapping
 
070928 Collaborative Geospatial Mapping And Data Authorization
070928 Collaborative Geospatial Mapping And Data Authorization070928 Collaborative Geospatial Mapping And Data Authorization
070928 Collaborative Geospatial Mapping And Data Authorization
 
041018 Community Gis
041018 Community Gis041018 Community Gis
041018 Community Gis
 
051102 Online Community Mapping
051102 Online Community Mapping051102 Online Community Mapping
051102 Online Community Mapping
 
051207 Commonsense Geography Meets Web Technology
051207 Commonsense Geography Meets Web Technology 051207 Commonsense Geography Meets Web Technology
051207 Commonsense Geography Meets Web Technology
 

Kürzlich hochgeladen

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 

Kürzlich hochgeladen (20)

Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 

Reuse of Structured Data: Semantics, Linkage, and Realization

  • 1. 圖書館學與資訊科學 43(1):7 – 46(民一○六年四月) 43 Reuse of Structured Data: Semantics, Linkage, and Realization Andrea Wei-Ching Huang Project Manager (Research) Institute of Information Science, Academia Sinica, Taiwan E-mail: andreahg@iis.sinica.edu.tw Cheng-Jen Lee Research Assistant Institute of Information Science, Academia Sinica, Taiwan E-mail: cjlee@iis.sinica.edu.tw Tyng-Ruey Chuang Associate Research Fellow Institute of Information Science, Academia Sinica, Taiwan E-mail: trc@iis.sinica.edu.tw Keywords: CKAN; Data Provenance; Data Quality; Knowledge Base; Linked Open Data (LOD); Ontology; Semantic Representation 【Abstract】 In order to increase the reuse value of existing datasets, it is now becoming a general practice to add  semantic links among the records in a dataset, and to link these records to external resources. The  enriched datasets are published on the web for both human and machine to consume and re‐purpose.  In this paper, we make use of publicly available structured records from a digital archive catalogue, and  we demonstrate a principled approach to converting the records into semantically rich and interlinked  resources for all to reuse. While exploring the various issues involved in the process of reusing and  re‐purposing existing datasets, we review the recent progress in the field of Linked Open Data (LOD),  and examine twelve well‐known knowledge bases built with a Linked Data approach. We also discuss  the general issues of data quality, metadata vocabularies, and data provenance. The concrete outcome  of  this  research  work  is  the  following:  (1)  a  website  data.odw.tw  that  hosts  more  than  840,000  semantically  enriched  catalogue  records  across  multiple  subject  areas,  (2)  a  lightweight  ontology  voc4odw  for  describing  data  reuse  and  provenance,  among  others,  and  (3)  a  set  of  open  source  DOI: 10.6245/JLIS.2017.431/722
  • 2. 44 Journal of Library and Information Science 43(1):7 – 46(April, 2017) software tools available to all to perform the kind of data conversion and enrichment we did in this  research. We have used and extended CKAN (The Comprehensive Knowledge Archive Network) as a  platform  to  host  and  publish  Linked  Data.  Our  extensions  to  CKAN  is  open  sourced  as  well.  As  the  records we drawn from the originally catalogue are released under the Creative Commons licenses, the  semantically enriched resources we now re‐publish on the Web are free for all to reuse as well.    【Long Abstract】 Introduction In order to enhance the reuse value of existing datasets, it is now becoming a general practice to add semantic links among the records in a dataset, and to link these records to external resources. The enriched datasets are published on the Web for both the human and the machine to consume and re-purpose. In the paper, we make use of publicly available structured records from a digital archive catalogue, and we demonstrate a principled approach to converting the records into semantically rich and interlinked resources for all to reuse. While exploring the various issues involved in the process of reusing and re-purposing existing datasets, we review the recent progress in the field of Linked Open Data (LOD), and examine twelve well-known knowledge bases built with a Linked Data approach. We also discuss the general issues of data quality, metadata vocabularies, and data provenance. The concrete outcome of this research work is the following: (1) a website that hosts more than 840,000 semantically enriched catalogue records across multiple subject areas, (2) a lightweight ontology voc4odw for describing data reuse and provenance, among others, and (3) a set of open source software tools available to all to perform the kind of data conversion and enrichment we did in this research. We have used and extended CKAN (The Comprehensive Knowledge Archive Network) as a platform to host and publish Linked Data. Our extensions to CKAN is open sourced as well. As the records we have drawn from the originally catalogue are released under the Creative Commons licenses, the semantically enriched resources we now re-publish on the Web are free for all to reuse as well. Review of Twelve Knowledge Bases We begin by first examine twelve knowledge bases built with a Linked Data approach. Five of them are built by domain knowledge experts (OpenCyc, Getty Art & Architecture Thesaurus, Getty Thesaurus of Geographic Names, and Ordnance Survey), six of them are collaborative databases (Freebase, YAGO, DBpedia, Wikidata, LinkedGeoData, GeoNames), and the last one is about ecological observations based on expert and community collaborations (Encyclopedia of Life). We further compare datasets
  • 3. 圖書館學與資訊科學 43(1):7 – 46(民一○六年四月) 45 about geospatial entities with controlled vocabularies: Getty TGN, Open Names (Ordnance Survey), DBpediaPlace, LinkedGeoData, and GeoNames. To make good reuse of structured data, ones need to first deal with the problem of data quality. Currently there exist different evaluation criteria, with various techniques for measuring the quality of information, data, metadata, and Linked Data. We review four papers on data quality and systematically compare their evaluation criteria. Moreover, data provenance --- contextual metadata about the source and use of data --- has proven to be fundamental for assessing authenticity, enabling trust, and allowing reproducibility. Thus, we examine key mechanisms of data provenance before we move forward to discussing LOD applications. Practices We then make use of structured records from a digital archive catalogue, and convert the records into semantically rich and interlinked resources on the Web. This is realized as a unified Linked Data catalogue to several digital archive collections. Our work results in a LOD catalogue available to the public at the website <http://data.odw.tw>. The following five parts are involved in realizing this website. A catalogue record, about a species of Pleione Formosana, is used throughout in the paper as an example to demonstrate the way we model, convert, and represent the semantics of a structured record. Part 1: Exploring data reuse relations in a shared context -- We review our previous research about the Relation for Reuse Ontology (R4R). In particular, we provide mechanisms for reusing article, data, and code with some flexibility of encoding provenance and license information. Part 2: Comparing two different data conversion approaches to providing LOD for an archive catalogue -- We show two different scenarios: (1) The LOD catalogue is converted directly from a relational database, and (2) the LOD catalogue is generated from a series of format conversions --- from XML to CSV, and then to RDF. Part 3: Data profiling, cleaning and mapping -- We demonstrate format conversion processes, and we discuss the pros and cons of various ways in handling broken links in source datasets. In addition, we mapped and linked catalogue records to three external knowledge bases: GeoNames, Wikidata, and Encyclopedia of Life. Part 4: Using CKAN (The Comprehensive Knowledge Archive Network) as a Linked Data platform -- We briefly introduce CKAN, an open source web-based data portal software package for curating and publishing datasets. CKAN provides data preview, search, and discovery, especially with regard to geospatial datasets. We built several extensions to CKAN in order to deposit, publish, browse, and
  • 4. 46 Journal of Library and Information Science 43(1):7 – 46(April, 2017) search Linked Data. Various Linked Data representations of a catalogue record --- Turtle, RDF/XML, and JSON-LD --- can all be downloaded and reused. Part 5: Designing ontologies for data representation and reuse -- We design an ontology voc4odw which includes the following 3 modules: (1) The Core Model. It is comprise of a data model and a conceptual model. The data model represents key data structure and relation. It is a framework to illustrate data source, derivation, and provenance. The conceptual model incorporates SKOS Simple Knowledge Organization System; it also connects to key event concepts. The conceptual model allows for data contextualization using common and domain knowledge vocabularies. (2) The Curation Model. It is responsible for disclosing the identification, classification, and publication of structured records at a curation platform, such as the classification of themes, the assignment of data identifiers, and the publication of datasets. (3) A vocabulary voaf:Vocabulary. It is defined as "A vocabulary used in the Linked Data cloud", from the Vocabulary of a Friend <http://purl.org/vocommons/voaf>. This module is to relate the Core Model to external common vocabularies. Some hierarchy relations between different external vocabularies can be traced with this vocabulary. 【Romanization of Chinese references is offered in the paper.】