Looks at hyperlinks from the perspective of a managed collection of resources for which link persistence/integrity is considered a quality of service concern. Distinguishes between links into other managed collections and to the web at large. Considers link rot and content drift.
1. @hvdsomp
Thor Conference, Rome, Italy, November 15 2017
Herbert Van de Sompel
Los Alamos National Laboratory
@hvdsomp
Achieving Link Integrity for Managed Collections
Photo by Eric Sieverts
11. @hvdsomp
Thor Conference, Rome, Italy, November 15 2017
Content Drift
http://icecube.wisc.edu/ on May 8 2009 (left) and August 27 2009 (right)
12. @hvdsomp
Thor Conference, Rome, Italy, November 15 2017
No Content Drift
http://www.ifa.hawaii.edu/~cowie/k_table.html on June 9 1997 (left) and March 2016 (right)
14. @hvdsomp
Thor Conference, Rome, Italy, November 15 2017
The Web, All Hyperlinks Subject to Reference Rot
• Reference Rot hinders our ability to follow links as they were
intended when they were put in place:
• Link rot: A link stops working all together
• Content drift: The Linked content changes over time and may
eventually no longer be representative of the content that was
originally linked
15. @hvdsomp
Thor Conference, Rome, Italy, November 15 2017
Creating Pockets of Persistence
• How to maintain the integrity of links?
• This challenge exists for the entire web. Some communities with well
managed collections care about addressing it because they consider
it a Quality of Service issue:
• Scholarly communication
• Cultural heritage
• Legal publications
• Government communication
• Journalism
• Wikipedia
• …
• What can these communities do to create Pockets of Persistence?
21. @hvdsomp
Thor Conference, Rome, Italy, November 15 2017
PubMed Central Corpus
PMC articles published 1997-2012 PMC
Total 479,194
With links to articles 240,857
With links to web-at-large resources 156,160
Links PMC
To articles 744,678
To web-at-large resources 480,853A B
A B
22. @hvdsomp
Thor Conference, Rome, Italy, November 15 2017
Links to Articles & to Web At Large Resources - PMC
Martin Klein, Herbert Van de Sompel, et al. (2014) Scholarly context not found. In: PLOS ONE
https://doi.org/10.1371/journal.pone.0115253
30. @hvdsomp
Thor Conference, Rome, Italy, November 15 2017
Herbert Van de Sompel, Martin Klein, and Shawn Jones (2016) Persistent URIs Must Be Used
to Be Persistent. In: WWW2016. http://arxiv.org/1602.09102
31. @hvdsomp
Thor Conference, Rome, Italy, November 15 2017
• When classifying links extracted from PMC as linking to articles, we
assumed that filtering on http://dx.doi.org/* would do the trick
• But we found a lot of e.g. http://link.springer.com/article/*
• For example:
• http://link.springer.com/article/10.1007%2Fs00799-014-018-0
• Instead of:
• http://dx.doi.org/10.1007/s00799-014-0108-0
• We used CrossRef’s Reverse Domain Lookup to classify these
extracted links as linking to articles
A Disconcerting Observation
32. @hvdsomp
Thor Conference, Rome, Italy, November 15 2017
URI References - PMC
Herbert Van de Sompel, Martin Klein, and Shawn Jones (2016) Persistent URIs Must Be Used to Be Persistent.
In: WWW2016. http://arxiv.org/1602.09102
Herbert Van de Sompel, Martin Klein, and Shawn Jones (2016) Persistent URIs Must Be Used
to Be Persistent. In: WWW2016. http://arxiv.org/1602.09102
33. @hvdsomp
Thor Conference, Rome, Italy, November 15 2017
Cartoon by Patrick Hochstenbach
http://signposting.org
<Intermezzo – Signposting the Scholarly Web>
34. @hvdsomp
Thor Conference, Rome, Italy, November 15 2017
• Proposal:
Use typed links to address some long standing problems regarding
scholarly resources on the web, by interlinking them using
appropriate relation types
• Focus on a limited set of patterns to support uniformly:
•Conveying a Persistent Identifier
•Expressing the web boundary of a scholarly resource
•Making bibliographic metadata discoverable
•Conveying an Author Identifier
•Conveying a license that applies to a resource
•Conveying a resource type
Signposting the Scholarly Web
35. @hvdsomp
Thor Conference, Rome, Italy, November 15 2017
HTTP Links
Mark Nottingham (2017) RFC8288: Web Linking
http://tools.iets.org/rfc/rfc8288.txt
38. @hvdsomp
Thor Conference, Rome, Italy, November 15 2017
HTTP Links Are Used
curl –I http://dbpedia.org/data/Reykjavik
HTTP/1.1 200 OK
Date: Thu, 27 Oct 2016 04:43:28 GMT
Content-Type: application/rdf+xml; charset=UTF-8
Content-Length: 1210
Link:
<http://creativecommons.org/licenses/by-sa/3.0>
; rel=“license",
<http://dbpedia.org/data/Reykjavik>
; rel="alternate"; type="text/n3",
<http://dbpedia.org/resource/Reykjavik>; rel="describes",
<http://mementoarchive.lanl.gov/dbpedia/timegate/http://dbpedia.org/
data/Reykjavik>
; rel="timegate"
39. @hvdsomp
Thor Conference, Rome, Italy, November 15 2017
For PIDs: Use cite-as Relation Type
Van de Sompel, H., Nelson M., Bilder, G, Kunze, J., and Warner, S. (2017) “cite-as”: A Link Relation
to Convey a Preferred URI for Referencing https://datatracker.ietf.org/doc/draft-vandesompel-citeas/
40. @hvdsomp
Thor Conference, Rome, Italy, November 15 2017
For PIDs: Use cite-as Relation Type
Van de Sompel, H., Nelson M., Bilder, G, Kunze, J., and Warner, S. (2017) “cite-as”: A Link Relation
to Convey a Preferred URI for Referencing https://datatracker.ietf.org/doc/draft-vandesompel-citeas/
41. @hvdsomp
Thor Conference, Rome, Italy, November 15 2017
• The target URI (PID) of the cite-as link can be picked up by
applications, e.g.:
• reference managers can pick up the PID of an object when the
user saves it while on the landing page, one of the constituent
resources
• publication pipelines can pick up the PID by looking up (HTTP
HEAD) URIs referenced in a paper to determine whether a PID
exists for them
For PIDs: Use cite-as Relation Type
42. @hvdsomp
Thor Conference, Rome, Italy, November 15 2017
Cartoon by Patrick Hochstenbach
http://signposting.org
</Intermezzo – Signposting the Scholarly Web>
44. @hvdsomp
Thor Conference, Rome, Italy, November 15 2017
PID Alternative - When B Moves to C: HTTP Redirect from B to C
• Custodian of C needs to hold on to domain of B
• Custodian of C needs to establish redirection patterns; often those
are rather simple rules
• No problem with establishing links to PID(B); the URI in the browser
address bar (initially B, later C) is just fine
47. @hvdsomp
Thor Conference, Rome, Italy, November 15 2017
Content Drift Occurs when B Changes over Time
• Is not really considered an issue because:
• the objects that receive PIDs were typically static, e.g. scientific
papers
• when a (substantially) new version of an object is published,
typically a new PID is assigned
• But:
• how to verify that the retrieved version of an object is indeed the
referenced version of the object?
• Requires:
• archiving objects in trusted archive(s)
• ability to retrieve objects from the archive(s)
48. @hvdsomp
Thor Conference, Rome, Italy, November 15 2017
Archived Articles
David Rosenthal (2013) Patio Perspectives at ANADP II: Preserving the Other Half
http://blog.dshr.org/2013/11/patio-perspectives-at-anadp-ii.html
Too few
Too low risk
49. @hvdsomp
Thor Conference, Rome, Italy, November 15 2017
How to Audit Whether a PID-identified Object is Archived
http://thekeepers.org
Journal,
Volume, Issue
centric
Global audit by
DOI?
50. @hvdsomp
Thor Conference, Rome, Italy, November 15 2017
Contrast: All Web-Archived Versions of David’s Blog Post
Global audit by
HTTP URI
Uses Memento
infrastructure
http://timetravel.mementoweb.org
52. @hvdsomp
Thor Conference, Rome, Italy, November 15 2017
Scholarly Context Adrift
Shawn Jones, Herbert Van de Sompel, et al. (2016) Scholarly context adrift. In: PLOS ONE
https://doi.org/10.1371/journal.pone.0167475
56. @hvdsomp
Thor Conference, Rome, Italy, November 15 2017
Text Similarity Measures
• Compute aggregate text similarity scores (values between 0...100)
for:
• Simhash
• Jaccard
• Sørensen-Dice
• Cosine
• If the aggregate score is 100, we decide that the Pre/Post
Mementos are representative
• We find 137K URI references out of 480K that have representative
Mementos
59. @hvdsomp
Thor Conference, Rome, Italy, November 15 2017
Content Drift - PMC
Shawn Jones, Herbert Van de Sompel, et al. (2016) Scholarly context adrift. In: PLOS ONE
https://doi.org/10.1371/journal.pone.0167475
60. @hvdsomp
Thor Conference, Rome, Italy, November 15 2017
Reference Rot for Links to Web at Large is Severe
• Link Rot and Content Drift are severe
• Cannot retrieve originally linked content from the live web
• Can potentially retrieve originally linked content from web archives
• But the archival coverage is too poor, a result of incidental
archiving
61. @hvdsomp
Thor Conference, Rome, Italy, November 15 2017
URI References without Representative Mementos - PMC
Shawn Jones, Herbert Van de Sompel, et al. (2016) Scholarly context adrift. In: PLOS ONE
https://doi.org/10.1371/journal.pone.0167475
62. @hvdsomp
Thor Conference, Rome, Italy, November 15 2017
Impact of Archival Gap on Links from Managed Collections
Martin Klein, Herbert Van de Sompel, et al. (2014) Scholarly context not found. In: PLOS ONE
https://doi.org/10.1371/journal.pone.0115253
Links from Managed Collections to Domains Grey: Linked Content not Archived
65. @hvdsomp
Thor Conference, Rome, Italy, November 15 2017
Taking a Snapshots of B: Automation is Key
• Web archive APIs for on-demand archiving
• perma.cc, Internet Archive, archive.is, webcitation
• Amber for Wordpress & Drupal archives resources linked in a page
• http://amberlink.org/
• Hiberlink’s experimental Zotero extension archives bookmarked
URLs
• http://hiberlink.org/zotero.html
• Hiberlink’s experimental HiberActive archives all URLs referenced in
a newly submitted paper
• https://www.slideshare.net/martinklein0815/hiberactive
67. @hvdsomp
Thor Conference, Rome, Italy, November 15 2017
Custodian of A Links to Snapshot of B
• Typical practice for linking to snapshots:
<a href=“URL of snapshot of B”>
• Problems with this practice:
o Impossible to visit the original URI, if desired
o Requires the permanent existence/uptime of the archive that
holds the snapshot
-One link rot problem replaced by another
http://robustlinks.mementoweb.org/about/
68. @hvdsomp
Thor Conference, Rome, Italy, November 15 2017
Permanent Existence/Uptime of Archives?
Capture of http://webcitation.org dated July 17 2013
https://archive.today/eAETp
69. @hvdsomp
Thor Conference, Rome, Italy, November 15 2017
Permanent Existence/Uptime of Archives?
Remnant of discontinued web archive http://mummify.it captured on February 14 2014
https://web.archive.org/web/20140214233752/https://www.mummify.it/
70. @hvdsomp
Thor Conference, Rome, Italy, November 15 2017
Permanent Existence/Uptime of Archives?
http://www.themoscowtimes.com/news/article/russia-bans-wayback-machine-internet-archive-over-
islamic-state-video/510074.html
71. @hvdsomp
Thor Conference, Rome, Italy, November 15 2017
Permanent Existence/Uptime of Archives?
http://web.archive.org/web/20121101043952/http://vogin.nl on March 6 2017 at 15:59 CET
72. @hvdsomp
Thor Conference, Rome, Italy, November 15 2017
Custodian of A Links to Snapshot of B, Decorates the Link
• Desired practice for linking to captures is to decorate the link so it
provides a variety of options:
<a href=“URL of snapshot of B”
data-originalurl=“B”
data-versiondate=“datetime of snapshot of B”>
• Supports:
o Revisiting the original URL
o Finding snapshots in any web archive (via original URL)
o Finding a temporally appropriate snapshot in any web archive
(via original URL & snapshot datetime)
o Automatically accessing a temporally appropriate snapshot in
any web archive (Memento protocol using original URL &
snapshot datetime)
http://robustlinks.mementoweb.org/spec/
73. @hvdsomp
Thor Conference, Rome, Italy, November 15 2017
Robust Links: Link Decoration in Action
See Robust Links at work in: Van de Sompel H. & Nelson, M.L. (2015)
Reminiscing about 15 years of interoperability efforts. D-Lib Magazine.
https://doi.org/10.1045/november2015-vandesompel
JavaScript makes the
link decorations actionable
Robust Links Javascript
https://github.com/mementoweb/robustlinks
75. @hvdsomp
Thor Conference, Rome, Italy, November 15 2017
Takeaways
• When it comes to links to
managed collections, the
custodian of the linking collection
relies on the custodians of the
linked collections to preserve link
integrity.
• PIDs, HTTP redirects are
managed by the custodian of
linked collections.
76. @hvdsomp
Thor Conference, Rome, Italy, November 15 2017
Takeaways
• When it comes to links to web at
large resources, the custodian of a
linking collection cannot rely on the
custodians of those linked
resources to maintain link integrity.
• Creation of Mementos, Robust
Links is managed by the custodian
of the collection that links to web at
large resources.
77. @hvdsomp
Thor Conference, Rome, Italy, November 15 2017
Herbert Van de Sompel
Los Alamos National Laboratory
@hvdsomp
Achieving Link Integrity for Managed Collections
Photo by Eric Sieverts
Hinweis der Redaktion
Previously, archival status (14-day window) as proxy
Previously, archival status (14-day window) as proxy
Previously, archival status (14-day window) as proxy
Previously, archival status (14-day window) as proxy
Previously, archival status (14-day window) as proxy