Session 1.4 a distributed network of heritage information
1. A distributed network of digital
heritage information
Enno Meijers
Semantics Conference – Amsterdam – 12 September 2017
2. Contents
• Introduction to Dutch Digital Heritage Network
• Problems with the current infrastructure
• Strategies for improvement
• Building the distributed network
4. Digital Heritage Network (NDE) aims
at increasing the social value of the
heritage information maintained by
libraries, archives, museums and
other cultural heritage institutions.
Long term cooperation between the
government and the institutions on
national, regional and local level. It’s
about information and people!
5. Three-layered approach for
improving the sustainability,
the usability and the visibility
of digital heritage information.
sustainable
usable
visible
6. In general - discovery of the “deep web”
• Institutional repositories, collection management systems
• Millions of ‘invisible’ datasets: publications, research data, heritage collections
• Poor coverage by regular search engines
• Metadata is key, describing physical materials or (licensed) digital content
• Dutch cultural heritage sector: 1500 institutions, >>1500 collections
• Demand for cross-institutional, cross-domain discovery
• Many specialized portals giving access to different views
11. Evaluating current approach
Positive results so far:
• many data sources available through OAI-PMH protocol
• powerful and smart protocol for metadata synchronization
• opened up data silos
• created the need for aligning data models
• made cross-collection and cross-domain discovery possible (e.g. Europeana)
But there are two problems areas:
• semantic alignment
• data integration
12. Problem #1: Poor semantic alignment
Not enough semantic alignment in the data sources:
• lack sustainable URIs and shared identifiers
• no shared terminology sources available
• no provisions for linking between sources
• implementations lack support for multiple data models
• data is ‘flattened’ to a common data model (EDM, Dublin Core)
• loss of meaning due to transformation
poor capabilities for cross-collection, cross-domain discovery
cleaning, aligning and enriching is needed after harvesting
13. Problem #2: Inefficient data integration
Physical data integration based on OAI–PMH (= copying)
• synchronizing with the sources is hard work
• ownership, licensing, provenance, control over access are difficult topics
• no feedback loop to the data source (usage, cleaning, enrichments)
• data source owner and end user are disconnected
• centralized model leads to scalability problems
• OAI-PMH is not a web-centric protocol
See also:
Miel Vander Sande et al. , Towards sustainable publishing and querying of distributed Linked Data archives - Journal of Documentation (2017)
Herbert Van de Sompel - Reminiscing About 15 Years of Interoperability Efforts - D-lib Magazine - December (2015)
15. • build portals as views based on a common data layer
• minimize the intermediate layers
• refer to the source instead of copying
• support decentralized discovery
• maximize the usability of data at the source
• develop a sustainable, ‘webcentric’ solution
• use HTTP, RDF and RESTful APIs as building blocks
=> implement the Linked Data principals
Inspired by the work of Ruben Verborgh, Herbert Van de Sompel and colleagues:
See for example: Miel Vander Sande et al. , Towards sustainable publishing and querying of distributed
Linked Data archives - Journal of Documentation (2017)
Design principles for a discovery infrastructure
16. At the data source level:
• use sustainable URIs to identify the resources
• use formal definitions for persons, places, concepts, events (API)
• use domain vocabularies / data models to describe the data
• add support for cross-domain discovery (EDM, Schema.org,...)
At the network level:
• create a ‘network of terms’ for shared entities
• provide tools for aligning and linking
• create alignments and links between different terminology sources
• provide easy access for collection management systems (API)
Implementing Linked Data principles
21. The Semantic Web is still a dream… #1
So discovery of
Linked Data requires
registering datasets?!
22. A tiny example...suppose a resource is defined as:
museum_X:object1
a nde:painting ;
dcterms:subject aat:windmill .
For ‘browsable Linked Data’ you should(!) add the inverse relation [1],[2]:
aat:windmill
a skos:Concept ;
skos:prefLabel “Windmolen“@nl ;
dcterms:isSubjectOf museum_X:object1 . # “backlinks”
=> a Linked Data integration problem…
The Semantic Web is still a dream… #2
[1]: Tim Berner’s Lee on ‘browsable linked data’ (2006)
[2]: Tom Heath and Christian Bizer on ‘Incoming Links’ (2011)
23. 1. Only semantic integration
• just implement schema.org, let search engines ‘infer’ the relations
• is the data interesting enough for Google?
• what about special thematic or regional views? how about reuse?
• can we reuse the results of the integration? (NO!)
2. Physical integration:
• aggregate all the related Linked Data sources
• build large triplestore and infer the relations
• but like OAI-PMH, based on copying data
Special case: LOD Laundromat
Comparing Linked Data integration approaches…
24. ‘Traditional’ solutions to federated querying not feasible:
- publishing Linked Data in triplestores is hard for small data providers
- service is vulnerable because of rich functionality
- federated querying over SPARQL endpoints performs poorly
Follow the Linked Data Fragments approach :
- Linked Data available through Triple Pattern Fragments interface
- easier to implement for data providers
- federated querying is supported, even SPARQL
- more complexity at network level is acceptable
- even support for time-based versions (Memento)
3. Virtual integration by federated approach
See also: Miel Vander Sande et al. , (2017) Towards sustainable publishing and querying of distributed Linked Data archives -
Journal of Documentation
25. Use the backlinks to support the discovery process:
See also: Miel Vander Sande et al. (2016) Hypermedia-Based Discovery for Source Selection Using Low-Cost Linked Data Interfaces
(IJSWIS) 12(3) 79–110
More advanced:
data source profiling
or dataset summaries
Federated querying needs source selection…
26. To make discovery of Linked Data work:
1. Register organizations and datasets
2. Build a knowledge graph with backlinks for resource discovery
Implementations will depend on capabilities of cultural heritage institutions
28. Strategy for our distributed network
1. build a knowledge graph for Dutch digital heritage entities
2. improve the usability of the data source:
- align object descriptions with shared entities
- publish data as Linked Data
3. build a discovery infrastructure:
- register organizations and datasets in a registry
- build knowledge graph to support discovery (backlinks)
4. implement virtual data integration technology :
- use registry and knowledge graph for selecting the resources
- support federated querying (or selective aggregation)
semantic
alignment
data
integration
30. Roadmap
• Start with the existing (OAI-PMH) based infrastructure
• Build registry for organizations and datasets
• Build network of terms to provide shared entities for discovery
• Upgrade object descriptions with URIs
• Make aggregators Linked Data compliant
• Build knowledge graph with backlinks for discovery
• Support federated querying (or selective harvesting)
• Make collection management systems Linked Data compliant
• Transform aggregators to service portals for discovery
31. Thank you for your attention!
please share your thoughts with us...
email: enno.meijers at kb.nl
twitter, slideshare: ennomeijers
https://github.com/netwerk-digitaal-erfgoed