The document discusses the infrastructure for collaborating web archives. It describes Memento, which interconnects current and archived versions of web resources across distributed systems. An aggregator is proposed that provides timegates and timemaps to access versions from multiple archives. APIs and services are presented to allow applications to retrieve and reconstruct archived versions from the aggregator. While challenges exist in polling many archives efficiently, usage statistics show the time travel infrastructure sees millions of requests monthly.
Financing strategies for adaptation. Presentation for CANCC
Collaborating web archives - Herbert van de Sompel
1. Herbert Van de Sompel
Een web van webarchieven, Hilversum, Nederland, 17 Nov 2016
Herbert Van de Sompel
LANL & DANS
@hvdsomp
http://mementoweb.org/about/
http://timetravel.mementoweb.org
Infrastructure for Collaborating Web Archives
2. Herbert Van de Sompel
Een web van webarchieven, Hilversum, Nederland, 17 Nov 2016
• Having Many Web Archives is a Good Thing ™
• Web Archive Interoperability
• Memento
• Towards Increased Interoperability
• Infrastructure for Web Archive Collaboration
• Aggregator
• Aggregator Services
• Aggregator APIs
• If You Build It Will They Come?
Outline
3. Herbert Van de Sompel
Een web van webarchieven, Hilversum, Nederland, 17 Nov 2016
Having Many Web Archives is a Good Thing ™
Capture of http://webcitation.org dated July 17 2013
https://archive.today/eAETp
4. Herbert Van de Sompel
Een web van webarchieven, Hilversum, Nederland, 17 Nov 2016
Having Many Web Archives is a Good Thing ™
Remnant of discontinued web archive http://mummify.it captured on February 14 2014
https://web.archive.org/web/20140214233752/https://www.mummify.it/
5. Herbert Van de Sompel
Een web van webarchieven, Hilversum, Nederland, 17 Nov 2016
Having Many Web Archives is a Good Thing ™
Capture of http://webcitation.org dated August 6 2014
6. Herbert Van de Sompel
Een web van webarchieven, Hilversum, Nederland, 17 Nov 2016
Having Many Web Archives is a Good Thing ™
http://arstechnica.com/business/2013/11/fire-at-internet-archive-destroys-equipment-and-materials-but-data-safe/
7. Herbert Van de Sompel
Een web van webarchieven, Hilversum, Nederland, 17 Nov 2016
Having Many Web Archives is a Good Thing ™
http://www.themoscowtimes.com/news/article/russia-bans-wayback-machine-internet-archive-over-islamic-state-
video/510074.html
8. Herbert Van de Sompel
Een web van webarchieven, Hilversum, Nederland, 17 Nov 2016
http://www.independent.co.uk/news/uk/politics/tories-deleted-past-broken-promises-from-party-website-
8937435.html
Having Many Web Archives is a Good Thing ™
Speeches not
accessible in IA
Available in other
Web archives
9. Herbert Van de Sompel
Een web van webarchieven, Hilversum, Nederland, 17 Nov 2016
Having Many Web Archives is a Good Thing ™
http://web.archive.org/web/20140717152222/http://vk.com/strelkov_info https://archive.today/XFFAj
Captures of http://vk.com/strelkov_info
17 July 2014 15:22:22 17 July 2014 17:06:51
Claim of
responsibility for
downing what
Strelkov thought to
be a Ukrainian
military transport
plane, but was
MH17, removed
10. Herbert Van de Sompel
Een web van webarchieven, Hilversum, Nederland, 17 Nov 2016
But Even a Better Thing if They Collaborate
Julien Masanes vision of a global grid of web archives:
Such a grid should link Web archives so that they together form
one global navigation space like the live Web itself. This is only
possible if they are structured in a way close enough to the original
Web and if they are openly accessible.
J. Masanes. Web Archiving. Springer-Verlag, 2006
11. Herbert Van de Sompel
Een web van webarchieven, Hilversum, Nederland, 17 Nov 2016
• Having Many Web Archives is a Good Thing ™
• Web Archive Interoperability
• Memento
• Towards Increased Interoperability
• Infrastructure for Web Archive Collaboration
• Aggregator
• Aggregator Services
• Aggregator APIs
• If You Build It Will They Come?
Outline
12. Herbert Van de Sompel
Een web van webarchieven, Hilversum, Nederland, 17 Nov 2016
2009
• Memento observation:
• Web resources exist in the eternal now.
• Prior versions of resources exist in web
archives and resource versioning
systems.
• The current resource and its prior
versions live disconnected lives.
• How to interconnect current and prior
versions of resources across distributed
web servers, web archives, resource
versioning systems?
Herbert Van de Sompel, Michael L. Nelson, and Robert Sanderson (2013) RFC7089 Memento
http://mementoweb.org/guide/rfc/
Memento Did Just That. And More.
13. Herbert Van de Sompel
Een web van webarchieven, Hilversum, Nederland, 17 Nov 2016
Original Resource and Mementos
14. Herbert Van de Sompel
Een web van webarchieven, Hilversum, Nederland, 17 Nov 2016
Bridge from Present to Past
15. Herbert Van de Sompel
Een web van webarchieven, Hilversum, Nederland, 17 Nov 2016
Bridge from Present to Past
16. Herbert Van de Sompel
Een web van webarchieven, Hilversum, Nederland, 17 Nov 2016
Bridge from Past to Present
17. Herbert Van de Sompel
Een web van webarchieven, Hilversum, Nederland, 17 Nov 2016
Today
Select Date
Nov 17 2014
Apr 1 2014
archive.is
Memento: Access Versions via the Original URI and a Datetime
18. Herbert Van de Sompel
Een web van webarchieven, Hilversum, Nederland, 17 Nov 2016
Memento for Chrome
Memento for Chrome
http://bit.ly/memento-for-chrome
19. Herbert Van de Sompel
Een web van webarchieven, Hilversum, Nederland, 17 Nov 2016
• Open Wayback
• pywb
• Memento TimeGate server
• Bridge between a homegrown versioning API and the Memento
protocol
• MediaWiki Memento extensions
• Linked Data Fragments server
Tools for Server-Side Memento Support
Memento Tools
http://mementoweb.org/tools/
20. Herbert Van de Sompel
Een web van webarchieven, Hilversum, Nederland, 17 Nov 2016
Can’t Please Everyone
An anonymous reviewer of our submission for WWW 2010:
Is there any statistics to show that many or a good number of Web
users would like to get obsolete data or resources?
Herbert Van de Sompel, Michael L. Nelson, et al. (2009) Memento: Time Travel for the Web
http://arxiv.org/abs/0911.1112
21. Herbert Van de Sompel
Een web van webarchieven, Hilversum, Nederland, 17 Nov 2016
• Having Many Web Archives is a Good Thing ™
• Web Archive Interoperability
• Memento
• Towards Increased Interoperability
• Infrastructure for Web Archive Collaboration
• Aggregator
• Aggregator Services
• Aggregator APIs
• If You Build It Will They Come?
Outline
22. Herbert Van de Sompel
Een web van webarchieven, Hilversum, Nederland, 17 Nov 2016
Raw Mementos
Shawn Jones (2016) Mementos in the Raw, Take Two
http://ws-dl.blogspot.nl/2016/08/2016-08-15-mementos-in-raw-take-two.html
23. Herbert Van de Sompel
Een web van webarchieven, Hilversum, Nederland, 17 Nov 2016
Raw Mementos
Shawn Jones (2016) Mementos in the Raw, Take Two
http://ws-dl.blogspot.nl/2016/08/2016-08-15-mementos-in-raw-take-two.html
24. Herbert Van de Sompel
Een web van webarchieven, Hilversum, Nederland, 17 Nov 2016
Verifying Authenticity of Mementos
Ongoing research Old Dominion University & LANL
25. Herbert Van de Sompel
Een web van webarchieven, Hilversum, Nederland, 17 Nov 2016
• Having Many Web Archives is a Good Thing ™
• Web Archive Interoperability
• Memento
• Towards Increased Interoperability
• Infrastructure for Web Archive Collaboration
• Aggregator
• Aggregator Services
• Aggregator APIs
• If You Build It Will They Come?
Outline
26. Herbert Van de Sompel
Een web van webarchieven, Hilversum, Nederland, 17 Nov 2016
• Resource Version Control Systems
• Servers with dedicated web archive
• Servers with a preference for a specific web archive
Original Resource Provides timegate Link
27. Herbert Van de Sompel
Een web van webarchieven, Hilversum, Nederland, 17 Nov 2016
Original Resource Provides No timegate Link – Client Intelligence
28. Herbert Van de Sompel
Een web van webarchieven, Hilversum, Nederland, 17 Nov 2016
Memento Aggregator
29. Herbert Van de Sompel
Een web van webarchieven, Hilversum, Nederland, 17 Nov 2016
• Official service of the LANL Research Library
• Currently covers 23 archives (web and linked data):
archive.today, Archive-It, Bibliotheca Alexandrina Web Archive, DBpedia
archive, DBpedia Triple Pattern Fragments archive, Canadian Government
Web Archive, Croatian Web Archive, Estonian Web Archive, Icelandic web
archive, Internet Archive, Library of Congress Web Archive, NARA Web
Archive, National Library of Ireland Web Archive, perma.cc, Portugese Web
Archive, PRONI Web Archive, Slovenian Web Archive, Stanford Web
Archive, UK Government Web Archive, UK Parliament's Web Archive, UK
Web Archive, Web Archive Singapore, WebCite
• LANL Aggregator software not available, but see MemGator
LANL Memento Aggregator
Archives covered by LANL Memento Aggregator: http://mementoweb.org/depot/
MemGator: https://github.com/oduwsdl/memgator
30. Herbert Van de Sompel
Een web van webarchieven, Hilversum, Nederland, 17 Nov 2016
• Polling of many distributed archives:
• Slow
• Load on aggregator and archives
• Approaches:
• Batch collecting and caching of archival coverage of popular
URIs in all archives
• Summarization of archives (based on CDX files and/or search)
• Machine Learning of URI patterns for archives
Memento Aggregator Challenges
Sawood Alam, Michael L. Nelson, et al. (2016) Web archive profiling through fulltext search
https://doi.org/10.1007/978-3-319-43997-6_10
Sawood Alam, Michael L. Nelson, et al. (2016) Web archive profiling through CDX summarization
https://doi.org/10.1007/s00799-016-0184-4
Nicholas Bornand, Herbert Van de Sompel, et al. (2016) Routing Memento Requests Using Binary Classifiers
https://doi.org/10.1145/2910896.2910899
31. Herbert Van de Sompel
Een web van webarchieven, Hilversum, Nederland, 17 Nov 2016
• Having Many Web Archives is a Good Thing ™
• Web Archive Interoperability
• Memento
• Towards Increased Interoperability
• Infrastructure for Web Archive Collaboration
• Aggregator
• Aggregator Services
• Aggregator APIs
• If You Build It Will They Come?
Outline
32. Herbert Van de Sompel
Een web van webarchieven, Hilversum, Nederland, 17 Nov 2016
• Exposes:
• TimeGates
• TimeMaps
that reach across all web archives covered by the Aggregator
Basic Aggregator Services
33. Herbert Van de Sompel
Een web van webarchieven, Hilversum, Nederland, 17 Nov 2016
Time Travel Services
http://timetravel.mementoweb.org/
34. Herbert Van de Sompel
Een web van webarchieven, Hilversum, Nederland, 17 Nov 2016
Time Travel Find
http://timetravel.mementoweb.org/list/20120428045424/http://www.stanford.edu/
35. Herbert Van de Sompel
Een web van webarchieven, Hilversum, Nederland, 17 Nov 2016
Time Travel Find
http://timetravel.mementoweb.org/list/20120428045424/http://www.stanford.edu/
36. Herbert Van de Sompel
Een web van webarchieven, Hilversum, Nederland, 17 Nov 2016
Time Travel Reconstruct
http://timetravel.mementoweb.org/reconstruct/20120428045424/http://www.stanford.edu/
37. Herbert Van de Sompel
Een web van webarchieven, Hilversum, Nederland, 17 Nov 2016
Time Travel Reconstruct
http://timetravel.mementoweb.org/reconstruct/20120428045424/http://www.stanford.edu/
38. Herbert Van de Sompel
Een web van webarchieven, Hilversum, Nederland, 17 Nov 2016
• Having Many Web Archives is a Good Thing ™
• Web Archive Interoperability
• Memento
• Towards Increased Interoperability
• Infrastructure for Web Archive Collaboration
• Aggregator
• Aggregator Services
• Aggregator APIs
• If You Build It Will They Come?
Outline
39. Herbert Van de Sompel
Een web van webarchieven, Hilversum, Nederland, 17 Nov 2016
Time Travel APIs
http://timetravel.mementoweb.org/guide/api/
40. Herbert Van de Sompel
Een web van webarchieven, Hilversum, Nederland, 17 Nov 2016
URI that Redirects to a Memento
http://timetravel.mementoweb.org/memento/20120428045424/http://www.stanford.edu/
41. Herbert Van de Sompel
Een web van webarchieven, Hilversum, Nederland, 17 Nov 2016
URI that Redirects to a JSON Description of a Memento
http://timetravel.mementoweb.org/api/json/20100428103432/http://stanford.edu
42. Herbert Van de Sompel
Een web van webarchieven, Hilversum, Nederland, 17 Nov 2016
JSON Format for TimeMaps
http://mementoweb.org/guide/timemap-json/
43. Herbert Van de Sompel
Een web van webarchieven, Hilversum, Nederland, 17 Nov 2016
DIY TimeMap - Index TimeMap Lists Potential TimeMap URIs
http://timetravel.mementoweb.org/timemap/json/http://stanford.edu
SPEED
44. Herbert Van de Sompel
Een web van webarchieven, Hilversum, Nederland, 17 Nov 2016
WDI TimeMap – Index TimeMap with Full Coverage
http://labs.mementoweb.org/timemap/link/http://stanford.edu
COVERAGE
45. Herbert Van de Sompel
Een web van webarchieven, Hilversum, Nederland, 17 Nov 2016
Time Travel Archive Registry
http://labs.mementoweb.org/aggregator_config/archivelist.xml
46. Herbert Van de Sompel
Een web van webarchieven, Hilversum, Nederland, 17 Nov 2016
• Having Many Web Archives is a Good Thing ™
• Web Archive Interoperability
• Memento
• Towards Increased Interoperability
• Infrastructure for Web Archive Collaboration
• Aggregator
• Aggregator Services
• Aggregator APIs
• If You Build It Will They Come?
Outline
47. Herbert Van de Sompel
Een web van webarchieven, Hilversum, Nederland, 17 Nov 2016
Time Travel Infrastructure Use, October 2016
TimeTravel
Interface
Use
/api/ 1,404,985
/timegate/ 54,007
/list/ 744,484
/memento/ 1,563,278
48. Herbert Van de Sompel
Een web van webarchieven, Hilversum, Nederland, 17 Nov 2016
oldweb.today
http://oldweb.today/nsmac4/20001115150435/http://www.stanford.edu
49. Herbert Van de Sompel
Een web van webarchieven, Hilversum, Nederland, 17 Nov 2016
arquivo.pt
http://arquivo.pt/wayback/20120127040929/http://stanford.edu/
Link to Reconstruct
50. Herbert Van de Sompel
Een web van webarchieven, Hilversum, Nederland, 17 Nov 2016
TimeTravel Reconstruct
http://timetravel.mementoweb.org/reconstruct/20120127040929/http://stanford.edu/
51. Herbert Van de Sompel
Een web van webarchieven, Hilversum, Nederland, 17 Nov 2016
British Library Memento Service
http://www.webarchive.org.uk/mementos/search/http://www.stanford.edu
52. Herbert Van de Sompel
Een web van webarchieven, Hilversum, Nederland, 17 Nov 2016
#icanhazmemento
http://ws-dl.blogspot.nl/2015/07/2015-07-22-i-can-haz-memento.html
53. Herbert Van de Sompel
Een web van webarchieven, Hilversum, Nederland, 17 Nov 2016
#icanhazmemento
http://timetravel.mementoweb.org/list/20161116101831/http://signposting.org/adopters
54. Herbert Van de Sompel
Een web van webarchieven, Hilversum, Nederland, 17 Nov 2016
Robust Links
• Decorate links to allow retrieving Mementos subject to link date or
from a specific archive
• In combination with the Time Travel API, this yields links - provided
client or server side - that circumvent link rot and content drift
Robust Links Specification
http://robustlinks.mementoweb.org/spec/
<a href=“http://archive.is/FAy6o”
data-originalurl=“http://www.stanford.edu”
data-versiondate=“2014-08-15” >
<a href=“http://www.stanford.edu”
data-versiondate=“2014-08-15” > DO
DO
<a href=“http://archive.is/FAy6o” > DON’T
55. Herbert Van de Sompel
Een web van webarchieven, Hilversum, Nederland, 17 Nov 2016
Robust Links – robustify.js
Rene Voorburg (2014) robustify.js
https://github.com/renevoorburg/robustify.js
56. Herbert Van de Sompel
Een web van webarchieven, Hilversum, Nederland, 17 Nov 2016
Robust Links – robustlinks.js
Herbert Van de Sompel and Michael L. Nelson (2015) Reminiscing about 15 years of interoperability efforts.
https://dx.doi.org/10.1045/november2015-vandesompel
57. Herbert Van de Sompel
Een web van webarchieven, Hilversum, Nederland, 17 Nov 2016
Herbert Van de Sompel
LANL & DANS
@hvdsomp
http://mementoweb.org/about/
http://timetravel.mementoweb.org
Infrastructure for Collaborating Web Archives