SlideShare a Scribd company logo
1 of 130
The Memento Protocol and
Research Issues With Web Archiving
Michael L. Nelson
Old Dominion University
Web Science & Digital Libraries Research Group
www.cs.odu.edu/~mln/
@phonedude_mln
With:
Los Alamos National Laboratory: Herbert Van de Sompel
ODU: Michele C. Weigle, Hany SalahEldeen, Matthias Prellwitz, Justin
Brunelle, Mat Kelly, Ahmed AlSum, Scott Ainsworth
University of Virginia Colloquium
2016-09-12
http://web.archive.org/web/*/http://www.library.virginia.edu/
also: http://whatdiditlooklike.mementoweb.org/tagged/library.virginia.edu
Memento wants to make it easy
to access the Web of the Past.
6
Memento achieves this by technically integrating
the present Web and the past Web, by introducing
a uniform version access capability for the Web.
7
Content Management Systems:
• Designed to be aware of all
versions of a resource;
• Self-contained;
• Variety of proprietary version
mechanisms;
• Versions interlinked using
proprietary mechanisms.
8
World Wide Web:
• Designed to forget about prior
versions of a resource;
• Distributed.
9
There are resource versions on
the Web:
• Content Management
Systems;
• Web Archives;
• Transactional archives;
• Search engine caches.
10
But the Web architecture has no
way to deal with them:
• Cannot talk about a resource
as it used to exist;
• Cannot access a prior version
knowing the current one;
• Cannot access the current
version knowing a prior one;
Current approaches are ad hoc
and localized.
11
Memento:
• Looks at the Web as a
Content Management
System;
• Introduces the uniform
capability to access versions
on the Web;
• Does not build new archives
but leverages all systems that
host versions: Web archives,
Content Management
Systems, Software Version
Systems, etc.
12
Memento’s version access
approach:
• Is distributed: versions may
exist on several servers;
• Uses datetime as a global
version indicator;
• Is based on the primitives of
the Web: resource, resource
state, representation, content
negotiation, link.
13
Since Memento’s access approach is distributed,
and is based on Web primitives, it scales like the Web.
14
Memento’s core components:
• Ability to speak about a
resource as it existed in the
past;
• A bridge between present
and past: link and content
negotiation;
• A bridge between past and
present: link.
15
original resource and versions
16
bridge from present to past
17
bridge from past to present
18
Memento Framework
19
original resource gone
20
original resource’s server gone
21
original resource provides no link
22
Integrating Multiple Archives
more info: http://mementoweb.org/
https://tools.ietf.org/html/rfc7089
Memento wants to make it easy
to access the Web of the Past…
http://bit.ly/memento-for-chrome
http://timetravel.mementoweb.org/list/20060912144251/http://www.library.virginia.edu/
www.library.virginia.edu
in 4 different web archives
Long Tail of Archives
Archive.is
Using Only Top-k Archives
for URI Lookup Yields Good Results
Even when there are 100s of archives, we only need to talk to a few.
see: http://www.cs.odu.edu/~mln/pubs/tpdl-2013/paper_134.pdf
OK, so
More Archives == More Better
But why care about archiving at all?
Why Care About The Past?
From an anonymous WWW 2010 reviewer about our
Memento paper (emphasis mine):
"Is there any statistics to show that many or a good number of Web
users would like to get obsolete data or resources? "
one answer: replay of contemporary pages >> summary pages
http://www.slideshare.net/phonedude/why-careaboutthepast
http://www.nytimes.com/2013/06/19/books/seven-american-deaths-and-disasters-transcribes-the-news.html
A Youtube video of a TV show where celebrities
read “mean” Tweets about themselves
Our social discourse is dominated by the web. Q.E.D.
https://www.youtube.com/watch?v=LABGimhsEys
Our scholarly record is in jeopardy…
http://dx.doi.org/10.1371/journal.pone.0115253
See also: http://blog.dshr.org/2015/02/the-evanescent-web.html
As is our legal record…
http://www.nytimes.com/2013/09/24/us/politics/in-supreme-court-opinions-clicks-that-lead-nowhere.html
See also: http://blogs.law.harvard.edu/futureoftheinternet/2013/09/22/perma/
http://ssnat.com/
And our popular culture as well…
http://f-measure.blogspot.com/2009/02/pink-floyd-hour-with-pink-floyd-kqed-lp.html
Half-Life of Popular Music Youtube Videos
Half life
0 3 6 9 12 15 18
0.5
1.0
Month
LinearRegression
Top 40 US Singles Charts
Music Blogs @ blogspot.com
The 500 Greatest Songs
0 1 2 3 4 5 6 7 8 9 10
0.86
0.88
0.90
0.92
0.94
0.96
0.98
1.00
Weeks
MedianAbsoluteDeviation
Datasets
Top 40 US Singles Charts
Music Blogs @ blogspot.com
The 500 Greatest Songs
Matthias Prellwitz, Michael L. Nelson,
Music Video Redundancy and Half-Life in YouTube,
Proceedings of TPDL 2011
http://www.cs.odu.edu/~mln/pubs/tpdl-2011/tpdl-2011-prellwitz.pdf
Individual URLs die, but new versions arise
So we won’t lose every copy of “Shake It Off”…
What about the grist of history?
http://www.bbc.com/future/story/20120927-the-decaying-web
On January 28 2011, three days into the fierce protests that would
eventually oust the Egyptian president Hosni Mubarak, a Twitter
user called Farrah posted a link to a picture that supposedly showed
an armed man as he ran on a “rooftop during clashes between police
and protesters in Suez”. I say supposedly, because both the tweet
and the picture it linked to no longer exist. Instead they have
been replaced with error messages that claim the message – and its
contents – “doesn’t exist”.
Missing Tweet & Pic
https://twitter.com/Farrah3m/status/31727870736859137 http://twitpic.com/3uvo6z
http://ws-dl.blogspot.com/2013/05/2013-05-07-who-is-archiving-your-tweets.html
In May 2013, both are “archived” by topsy.com
In February 2015, they’re completely missing.
http://topsy.com/http://twitpic.com/3uvo6z
In 2016, redirecting…
http://topsy.com/http://twitpic.com/3uvo6z
…to a random (?) apple.com page
http://topsy.com/http://twitpic.com/3uvo6z
No Server == No HTTP Event == Nothing to Archive
http://topsy.com/http://twitpic.com/3uvo6z
Hany M. SalahEldeen, Michael L. Nelson, Losing My Revolution: How Many Resources Shared on Social Media Have Been
Lost?, Proceedings of TPDL 2012. http://arxiv.org/abs/1209.3026
Hany SalahEldeen, Michael L. Nelson, Resurrecting My Revolution: Using Social Link Neighborhood in Bringing Context to the
Disappearing Web, Proceedings of TPDL 2013. http://arxiv.org/abs/1309.2648
Missing: 11% year 1, 7%/year afterwards
Archived: 7% year 1, 15%/year afterwards
Why we need multiple,
independent archives…
A single archive is vulnerable
http://www.bbc.com/news/uk-politics-24924185
http://ws-dl.blogspot.com/2013/11/2013-11-21-conservative-party-speeches.html
Houston, Tranquility Base Here. The Eagle has landed.
see also: http://ws-dl.blogspot.com/2013/03/2013-03-22-ntrs-web-archives-and-why-we.html
http://ws-dl.blogspot.com/2013/06/2013-06-18-ntrs-memento-and-handles.html
http://www.theguardian.com/technology/2015/feb/19/google-acknowledges-some-people-want-right-to-be-forgotten
$ curl –I "http://www.thedailybeast.com/articles/2016/08/11/i-
got-three-grindr-dates-in-an-hour-in-the-olympic-village.html"
HTTP/1.1 301 Moved Permanently
Access-Control-Allow-Origin: *
Age: 0
Cache-Control: max-age=60
Content-Type: text/html; charset=iso-8859-1
Date: Thu, 18 Aug 2016 01:13:46 GMT
Location: http://www.thedailybeast.com/articles/2016/08/11/a-
note-from-the-editors.html
RealAge: 0
Server: Apache
Vary: Accept-Encoding, User-Agent
Via: 1.1 varnish
X-BackEnd: default
X-Cache: MISS
X-Cacheable: YES
X-Restarts: 0
X-UA-Device: pc
X-Varnish: 995407903
Connection: keep-alive
http://www.usnews.com/news/articles/2016-08-17/wayback-machine-wont-censor-archive-for-taste-director-says-after-olympics-article-scrubbed
But who pays for those extra archives?
1TB endowment = ~$4700: http://blog.dshr.org/2011/02/paying-for-long-term-storage.html
see also: http://blog.dshr.org/2011/01/memento-marketplace-for-archiving.html
Archives aren’t magic web sites
They’re just web sites.
If you used Mummify, you’re now left with a bunch of defunct, shortened links like:
https://mummify.it/XbmcMfE3
Don’t Throw Away the Original URL –
Use Robust Links!
<a href="http://www.w3.org/"
data-versionurl="https://archive.today/r7cov"
data-versiondate="2015-01-21">
my robust link to the live web</a>
<a href="https://archive.today/r7cov"
data-originalurl="http://www.w3.org/"
data-versiondate="2015-01-21">
my robust link to an archived version</a>
<!DOCTYPE html>
<html lang="en" itemscope itemtype="http://schema.org/WebPage"
itemid="http://robustlinks.mementoweb.org/spec/">
<head>
<meta charset="utf-8" />
<meta itemprop="dateModified" content="2015-02-02">
<meta itemprop="datePublished" content="2015-01-23">
<title>Page Level Metadata Is The Least You Can Do</title>
More examples / scenarios at: http://robustlinks.mementoweb.org/spec/
Economics Working Against Archives
“In the paper world in order to monetize their content the
copyright owner had to maximize the number of copies
of it. In the Web world, in order to monetize their content
the copyright owner has to minimize the number of copies.
Thus the fundamental economic motivation for Web
content militates against its preservation in the ways
that Herbert and I would like.”
--David Rosenthal
http://blog.dshr.org/2015/02/the-evanescent-web.html
“We’ll use the cloud!”
https://www.chriswatterston.com/blog/my-there-no-cloud-sticker
"...when all costs are taken in to account,
cloud storage is not cheaper for long-term preservation than doing it yourself
once you get to a reasonable scale.”
http://blog.dshr.org/2014/11/talk-costs-why-do-we-care.html
Historicity of Web Archives
Malaysia Airlines Flight 17 (MH17)
http://web.archive.org/web/20140717152222/http://vk.com/strelkov_info
http://www.csmonitor.com/World/Europe/2014/0717/Web-evidence-points-to-pro-Russia-rebels-in-downing-of-MH17-video
http://www.newyorker.com/magazine/2015/01/26/cobweb
(not really archived as well as you think)
Ed and I Discuss Who Has What…
https://twitter.com/phonedude_mln/status/490171976389238784
Remember MH17?
https://twitter.com/phonedude_mln/status/490171976389238784
Alex is now 404.
Would multiple archives have convinced him?
https://twitter.com/quicknquiet
Do we really have
“a perfect tool to produce `evidence’ of any kind”?
@gary4205 mansplains to @AstroKatie
https://twitter.com/AstroKatie/status/765344020184739840
see also: http://www.someecards.com/news/so-that-happened/mansplain-astronaught-jessica-meir-twitter/
But can you prove he didn’t say this?
Or that she didn’t say this?
(remember: black hats can use tools created by white hats)
Assessing the Quality of Web Archiving
"Hooray! It's in the archive!"
vs.
"How well was it archived?"
current:
the question
we should
be asking:
Temporal Drift
August 27, 2005
11:16 a.m. EDT
link
Temporal Drift: Now 3 Hours in the Past
August 27, 2005
11:16 a.m. EDT
link
August 27, 2005
8:00 a.m. EDT
link
Temporal Drift: Now 17 Days in the Future
August 27, 2005
11:16 a.m. EDT
link
August 27, 2005
8:00 a.m. EDT
link
September 13, 2005
8:12 a.m. EDT
link
Temporal Drift: Now 23 (or 6) Days in the Future
August 27, 2005
11:16 a.m. EDT
link
August 27, 2005
8:00 a.m. EDT
link
September 13, 2005
8:12 a.m. EDT
link
September 19, 2005
8:25 a.m. EDT
link
10+ clicks in the archive results in median drift of ~45 days (standard UI)
or ~15 days with Memento. ~2% of the sessions have drift of > 1 year.
see: http://www.cs.odu.edu/~mln/pubs/jcdl-2013/jcdl93-ainsworth.pdf
Sometimes the Live Web
"Leaks" Into the Archive…
see: http://ws-dl.blogspot.com/2012/10/2012-10-10-zombies-in-archives.html
Sept 3, 2008
2012
Not All Mementos Are Created Equal:
Measuring The Impact Of Missing Resources
JCDL 2014
http://www.cs.odu.edu/~mln/pubs/jcdl-2014/jcdl-2014-brunelle-damage.pdf
M = 0.17
D = 0.09
(live web)
M = 0.24
D = 0.41
(missing main)
M = 0.29
D = 0.36
(missing logo + navigation)
Synthetic Damage:
Removing Images From xkcd.com
damage (D) differs from % missing (M)!
Was missing
resource
important?
<img>and
<embed>
can leave hints
about size and
centrality.
For CSS, we
look at the
distribution of
background
color in page
divided into
vertical thirds.
Weights from Turker Assessment of Damage
first: establish that Turkers
can determine damaged vs.
undamaged pages (81% of the time)
second: find weights that match
Turker's rankings of (real) differently
damaged versions of the same page
Good News:
Although %Missing (M) is steady/increasing,
weighted Damage (D) is decreasing
A Framework for Evaluation of
Composite Memento Temporal Coherence
Hypertext 2015
http://ws-dl.blogspot.com/2015/12/2015-12-08-evaluating-temporal.html
As Presented by IA
http://web.archive.org/web/20041209190926/http://www.wunderground.org/cgi-bin/findWeather/getForecast?query=50593 (now 404, but that's a different story…)
Not Everything Is 200412091900926
+ 9 months
1 in 20 pages complete; 1 in 5 have violations
Description
Closest
Single
Archive
Closest
Multi-
Archive
Bracket
Single
Archive
Bracket
Multi-
Archive
Completeness
Mean complete 76.1% 80.2% 76.2% 80.3%
Mean missing 23.9% 19.8% 23.8% 19.7%
Temporal Coherence
Mean prima facie coherent 41.0% 40.9% 54.7% 54.6%
Mean possibly coherent 27.3% 27.3% 12.8% 14.2%
Mean probably violative 2.5% 5.3% 2.5% 5.3%
Mean prima facie violative 5.3% 5.3% 6.2% 6.2%
At least 5% of pages can be shown to be temporal violations
http://ws-dl.blogspot.com/2015/12/2015-12-08-evaluating-temporal.html
Closing Observations
Wrong Metaphor for Web Archives
Web Archives Are Not Destinations
This is a destination. This is not a destination.
Memento is about linking the past and present web
Possible Metaphor for Viewing Past & Present?
Turn Archiving Into A Social Activity…
see also: http://xkcd.com/1034/,
Marshall & Shipman, JCDL 2011
…But Don't Use the "A" Word
Ed: Are there any zombies out there?
Shaun: Don't say that!
Ed: What?
Shaun: That.
Ed: What?
Shaun: That. The Z word. Don't say it.
Ed: Why not?
Shaun: Because it's ridiculous!
— Shaun of the Dead
Pinterest: Anonymous Mementos
http://media-cache-ec3.pinterest.com/upload/47639708527755289_AhxhItiQ_c.jpg
is a memento of:
http://3.bp.blogspot.com/_d0vByWRfhvU/S_Ygk_oX4xI/AAAAAAAACCQ/LXgC3S0KYEo/s400/_MG_8091.jpg
but there is no machine-readable indication of this relationship
repins are by-reference
When all else fails, justify project with:
“web archiving is Big Data”
Backup Slides
Archiving your internal stuff:
Transactional Archiving
https://mementoweb.github.io/SiteStory/
Never miss an update;
archive your site as it is
being viewed by users.
Archiving your internal stuff:
Heritrix & Wayback
Crawling your intranet: http://www.dlib.org/dlib/january16/brunelle/01brunelle.html
Crawling JS “stuff” will take 5X more storage: http://arxiv.org/abs/1601.05142
mementos of Mitre Intranet “MiiTube” – Complete With Javascript leakage
JavaScript == the new deep web;
use ResourceSync to make sure your URIs are exposed
<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"
xmlns:rs="http://www.openarchives.org/rs/terms/">
<rs:ln rel="up"
href="http://example.com/dataset1/capabilitylist.xml"/>
<rs:md capability="resourcelist"
at="2013-01-03T09:00:00Z"
completed="2013-01-03T09:01:00Z"/>
<url>
<loc>http://example.com/res1</loc>
<lastmod>2013-01-02T13:00:00Z</lastmod>
<rs:md hash="md5:1584abdf8ebdc9802ac0c6a7402c03b6"
length="8876"
type="text/html"/>
</url>
<url>
<loc>http://example.com/res2</loc>
<lastmod>2013-01-02T14:00:00Z</lastmod>
<rs:md hash="md5:1e0d5cb8ef6ba40c99b14c0237be735e
sha-256:854f61290e2e197a11bc91063afce22e43f8ccc655237050ace766adc68dc784"
length="14599"
type="application/pdf"/>
</url>
</urlset>
(AKA “Fancy SiteMaps”)
http://www.openarchives.org/rs/
timetravel.mementoweb.org
http://timetravel.mementoweb.org/list/20140525002314/http://www.bbc.co.uk/
e.g., bbc.co.uk in six different archives…
Seagal’s Law
A man with a watch knows what time it is.
A man with two watches is never sure.
How to resolve conflicting archives?
Personalization, GeoIP, mobile vs. desktop, etc.
means “the” page rarely exists, only “a” page.
Mat Kelly, Justin F. Brunelle, Michele C. Weigle, and Michael L. Nelson,
A Method for Identifying Personalized Representations in Web Archives,
D-Lib Magazine, 19(11/12), 2013.
http://www.dlib.org/dlib/november13/kelly/11kelly.html
Thoughtful analysis: http://blog.dshr.org/2015/02/vint-cerfs-talk-at-aaas.html
Snarky analysis: http://ws-dl.blogspot.com/2015/02/2015-02-17-reactions-to-vint-cerfs.html
Why Care About The Past?
From an anonymous WWW 2010 reviewer about our
Memento paper (emphasis mine):
"Is there any statistics to show that many or a good number of Web
users would like to get obsolete data or resources? "
one answer: replay of contemporary pages >> summary pages
http://www.slideshare.net/phonedude/why-careaboutthepast
http://www.nytimes.com/2013/06/19/books/seven-american-deaths-and-disasters-transcribes-the-news.html
vs.
Archiving Moves At Hurricane Speed,
Most News Stories Move Faster
Most of the Story,
at Least as Conveyed by cnn.com,
is Missing…
in this case, you can reconstruct the events with
http://en.wikipedia.org/wiki/Virginia_Tech_massacre_timeline
How Much of The Web Is Archived?
Public Archives, ca. Late 2010 / Early 2011
Three categories of archives
• Internet Archive
• Search engine
• Other archives
UK US
See also: http://arxiv.org/abs/1212.6177
1000 URIs Ordered by First Observation Date
See also: http://ws-dl.blogspot.com/2011/06/2011-06-23-how-much-of-web-is-archived.html
see also: http://ws-dl.blogspot.com/2013/04/2013-04-19-carbon-dating-web.html
How Much of the Web is Archived?
It Depends on Which Web…
Including
SE cache
Excluding
SE Cache
90% 79%
97% 68%
35% 16%
88% 19%
Changes since 2011: no more free SE APIs;
greatly reduced IA quarantine period; 15 public web archives
2013
95%
92%
23%
26%
Quis Archiviet Ipsos Archives?
(thanks to webmaster@archive.is for this example)
% curl -I http://lenta.ru/articles/2013/04/02/mat/
HTTP/1.1 302 Found
Server: nginx
Date: Tue, 03 Sep 2013 00:15:14 GMT
Content-Type: text/html; charset=utf-8
Connection: keep-alive
Status: 302 Found
Location: http://lenta.ru/f_words/
X-UA-Compatible: IE=Edge,chrome=1
Cache-Control: no-cache
X-Request-Id: bd7caae039d6312c0542cb4ad62f3847
X-Runtime: 0.005474
X-Rack-Cache: miss
current page for: http://lenta.ru/articles/2013/04/02/mat/
archive.org version of: http://lenta.ru/articles/2013/04/02/mat/
peep.us archived version of archive.org version
archive.is archived version of peeep.us version of archive.org version

More Related Content

What's hot

InterPlanetary Wayback: Peer-To-Peer Permanence of Web Archives
InterPlanetary Wayback: Peer-To-Peer Permanence of Web ArchivesInterPlanetary Wayback: Peer-To-Peer Permanence of Web Archives
InterPlanetary Wayback: Peer-To-Peer Permanence of Web ArchivesSawood Alam
 
We Need Multiple, Independent Web Archives
We Need Multiple, Independent Web ArchivesWe Need Multiple, Independent Web Archives
We Need Multiple, Independent Web ArchivesMichael Nelson
 
InterPlanetary Wayback: The Next Step Towards Decentralized Web Archiving
InterPlanetary Wayback: The Next Step Towards Decentralized Web ArchivingInterPlanetary Wayback: The Next Step Towards Decentralized Web Archiving
InterPlanetary Wayback: The Next Step Towards Decentralized Web ArchivingSawood Alam
 
Combining Heritrix and PhantomJS for Better Crawling of Pages with Javascript
Combining Heritrix and PhantomJS for Better Crawling of Pages with JavascriptCombining Heritrix and PhantomJS for Better Crawling of Pages with Javascript
Combining Heritrix and PhantomJS for Better Crawling of Pages with JavascriptMichael Nelson
 
Summarize Your Archival Holdings With MementoMap
Summarize Your Archival Holdings With MementoMapSummarize Your Archival Holdings With MementoMap
Summarize Your Archival Holdings With MementoMapSawood Alam
 
Supporting Web Archiving via Web Packaging
Supporting Web Archiving via Web PackagingSupporting Web Archiving via Web Packaging
Supporting Web Archiving via Web PackagingSawood Alam
 
Archive Assisted Archival Fixity Verification Framework
Archive Assisted Archival Fixity Verification FrameworkArchive Assisted Archival Fixity Verification Framework
Archive Assisted Archival Fixity Verification FrameworkSawood Alam
 
Detecting Off-Topic Pages in Web Archives
Detecting Off-Topic Pages in Web ArchivesDetecting Off-Topic Pages in Web Archives
Detecting Off-Topic Pages in Web ArchivesYasmin AlNoamany, PhD
 
MementoMap Framework for Flexible and Adaptive Web Archive Profiling
MementoMap Framework for Flexible and Adaptive Web Archive ProfilingMementoMap Framework for Flexible and Adaptive Web Archive Profiling
MementoMap Framework for Flexible and Adaptive Web Archive ProfilingSawood Alam
 
Scripts in a Frame: A Two-Tiered Approach for Archiving Deferred Representations
Scripts in a Frame: A Two-Tiered Approach for Archiving Deferred RepresentationsScripts in a Frame: A Two-Tiered Approach for Archiving Deferred Representations
Scripts in a Frame: A Two-Tiered Approach for Archiving Deferred RepresentationsJustin Brunelle
 
Readying Web Archives to Consume and Leverage Web Bundles
Readying Web Archives to Consume and Leverage Web BundlesReadying Web Archives to Consume and Leverage Web Bundles
Readying Web Archives to Consume and Leverage Web BundlesSawood Alam
 
Impact of HTTP Cookie Violations in Web Archives
Impact of HTTP Cookie Violations in Web ArchivesImpact of HTTP Cookie Violations in Web Archives
Impact of HTTP Cookie Violations in Web ArchivesSawood Alam
 
MementoMap: A Web Archive Profiling Framework for Efficient Memento Routing
MementoMap: A Web Archive Profiling Framework for Efficient Memento RoutingMementoMap: A Web Archive Profiling Framework for Efficient Memento Routing
MementoMap: A Web Archive Profiling Framework for Efficient Memento RoutingSawood Alam
 
Impact of URI Canonicalization on Memento Count
Impact of URI Canonicalization on Memento Count Impact of URI Canonicalization on Memento Count
Impact of URI Canonicalization on Memento Count Mat Kelly
 
Profiling Web Archive Coverage for Top-Level Domain and Content Language
Profiling Web Archive Coverage for Top-Level Domain and Content LanguageProfiling Web Archive Coverage for Top-Level Domain and Content Language
Profiling Web Archive Coverage for Top-Level Domain and Content LanguageMichael Nelson
 
Profiling Web Archives
Profiling Web ArchivesProfiling Web Archives
Profiling Web ArchivesMichael Nelson
 
To the Rescue of the Orphans of Scholarly Communication
To the Rescue of the Orphans of Scholarly CommunicationTo the Rescue of the Orphans of Scholarly Communication
To the Rescue of the Orphans of Scholarly CommunicationMartin Klein
 
Who and What Links to the Internet Archive
Who and What Links to the Internet ArchiveWho and What Links to the Internet Archive
Who and What Links to the Internet ArchiveMichael Nelson
 
Recommending Archived Webpages Using Only The URI
Recommending Archived Webpages Using Only The URIRecommending Archived Webpages Using Only The URI
Recommending Archived Webpages Using Only The URILulwahMA
 

What's hot (20)

InterPlanetary Wayback: Peer-To-Peer Permanence of Web Archives
InterPlanetary Wayback: Peer-To-Peer Permanence of Web ArchivesInterPlanetary Wayback: Peer-To-Peer Permanence of Web Archives
InterPlanetary Wayback: Peer-To-Peer Permanence of Web Archives
 
We Need Multiple, Independent Web Archives
We Need Multiple, Independent Web ArchivesWe Need Multiple, Independent Web Archives
We Need Multiple, Independent Web Archives
 
InterPlanetary Wayback: The Next Step Towards Decentralized Web Archiving
InterPlanetary Wayback: The Next Step Towards Decentralized Web ArchivingInterPlanetary Wayback: The Next Step Towards Decentralized Web Archiving
InterPlanetary Wayback: The Next Step Towards Decentralized Web Archiving
 
Combining Heritrix and PhantomJS for Better Crawling of Pages with Javascript
Combining Heritrix and PhantomJS for Better Crawling of Pages with JavascriptCombining Heritrix and PhantomJS for Better Crawling of Pages with Javascript
Combining Heritrix and PhantomJS for Better Crawling of Pages with Javascript
 
Summarize Your Archival Holdings With MementoMap
Summarize Your Archival Holdings With MementoMapSummarize Your Archival Holdings With MementoMap
Summarize Your Archival Holdings With MementoMap
 
Supporting Web Archiving via Web Packaging
Supporting Web Archiving via Web PackagingSupporting Web Archiving via Web Packaging
Supporting Web Archiving via Web Packaging
 
Archive Assisted Archival Fixity Verification Framework
Archive Assisted Archival Fixity Verification FrameworkArchive Assisted Archival Fixity Verification Framework
Archive Assisted Archival Fixity Verification Framework
 
Detecting Off-Topic Pages in Web Archives
Detecting Off-Topic Pages in Web ArchivesDetecting Off-Topic Pages in Web Archives
Detecting Off-Topic Pages in Web Archives
 
MementoMap Framework for Flexible and Adaptive Web Archive Profiling
MementoMap Framework for Flexible and Adaptive Web Archive ProfilingMementoMap Framework for Flexible and Adaptive Web Archive Profiling
MementoMap Framework for Flexible and Adaptive Web Archive Profiling
 
Scripts in a Frame: A Two-Tiered Approach for Archiving Deferred Representations
Scripts in a Frame: A Two-Tiered Approach for Archiving Deferred RepresentationsScripts in a Frame: A Two-Tiered Approach for Archiving Deferred Representations
Scripts in a Frame: A Two-Tiered Approach for Archiving Deferred Representations
 
Readying Web Archives to Consume and Leverage Web Bundles
Readying Web Archives to Consume and Leverage Web BundlesReadying Web Archives to Consume and Leverage Web Bundles
Readying Web Archives to Consume and Leverage Web Bundles
 
Impact of HTTP Cookie Violations in Web Archives
Impact of HTTP Cookie Violations in Web ArchivesImpact of HTTP Cookie Violations in Web Archives
Impact of HTTP Cookie Violations in Web Archives
 
MementoMap: A Web Archive Profiling Framework for Efficient Memento Routing
MementoMap: A Web Archive Profiling Framework for Efficient Memento RoutingMementoMap: A Web Archive Profiling Framework for Efficient Memento Routing
MementoMap: A Web Archive Profiling Framework for Efficient Memento Routing
 
Impact of URI Canonicalization on Memento Count
Impact of URI Canonicalization on Memento Count Impact of URI Canonicalization on Memento Count
Impact of URI Canonicalization on Memento Count
 
Profiling Web Archive Coverage for Top-Level Domain and Content Language
Profiling Web Archive Coverage for Top-Level Domain and Content LanguageProfiling Web Archive Coverage for Top-Level Domain and Content Language
Profiling Web Archive Coverage for Top-Level Domain and Content Language
 
Profiling Web Archives
Profiling Web ArchivesProfiling Web Archives
Profiling Web Archives
 
The Web We Want
The Web We WantThe Web We Want
The Web We Want
 
To the Rescue of the Orphans of Scholarly Communication
To the Rescue of the Orphans of Scholarly CommunicationTo the Rescue of the Orphans of Scholarly Communication
To the Rescue of the Orphans of Scholarly Communication
 
Who and What Links to the Internet Archive
Who and What Links to the Internet ArchiveWho and What Links to the Internet Archive
Who and What Links to the Internet Archive
 
Recommending Archived Webpages Using Only The URI
Recommending Archived Webpages Using Only The URIRecommending Archived Webpages Using Only The URI
Recommending Archived Webpages Using Only The URI
 

Viewers also liked

Resurrecting My Revolutionsing Social Link Neighborhood in Bringing Context t...
Resurrecting My Revolutionsing Social Link Neighborhood in Bringing Context t...Resurrecting My Revolutionsing Social Link Neighborhood in Bringing Context t...
Resurrecting My Revolutionsing Social Link Neighborhood in Bringing Context t...Michael Nelson
 
Evaluating the Temporal Coherence of Archived Pages
Evaluating the Temporal Coherence of Archived PagesEvaluating the Temporal Coherence of Archived Pages
Evaluating the Temporal Coherence of Archived PagesMichael Nelson
 
Assessing the Quality of Web Archives
Assessing the Quality of Web ArchivesAssessing the Quality of Web Archives
Assessing the Quality of Web ArchivesMichael Nelson
 
Evaluating the SiteStory Transactional Web Archive with the ApacheBench Tool
Evaluating the SiteStory Transactional Web Archive with the ApacheBench ToolEvaluating the SiteStory Transactional Web Archive with the ApacheBench Tool
Evaluating the SiteStory Transactional Web Archive with the ApacheBench ToolMichael Nelson
 
On the Change in Archivability of Websites Over Time
On the Change in Archivability of Websites Over TimeOn the Change in Archivability of Websites Over Time
On the Change in Archivability of Websites Over TimeMichael Nelson
 
Web Archiving: A Brief Introduction
Web Archiving: A Brief IntroductionWeb Archiving: A Brief Introduction
Web Archiving: A Brief IntroductionSawood Alam
 
Combining Storytelling and Web Archives
Combining Storytelling and Web ArchivesCombining Storytelling and Web Archives
Combining Storytelling and Web ArchivesMichael Nelson
 
Old Dominion University Computer Science IIPC New Member
Old Dominion University Computer Science IIPC New Member Old Dominion University Computer Science IIPC New Member
Old Dominion University Computer Science IIPC New Member Michael Nelson
 
Software as a Well-Formed Research Object
Software as a Well-Formed Research ObjectSoftware as a Well-Formed Research Object
Software as a Well-Formed Research ObjectYasmin AlNoamany, PhD
 
Using Web Archives to Enrich the Live Web Experience Through Storytelling
Using Web Archives to Enrich the Live Web Experience Through StorytellingUsing Web Archives to Enrich the Live Web Experience Through Storytelling
Using Web Archives to Enrich the Live Web Experience Through StorytellingYasmin AlNoamany, PhD
 
When Should I Make Preservation Copies of Myself?
When Should I Make Preservation Copies of Myself?�When Should I Make Preservation Copies of Myself?�
When Should I Make Preservation Copies of Myself?Michael Nelson
 
@WebSciDL PhD Student Project Reviews August 5&6, 2015
@WebSciDL PhD Student Project Reviews August 5&6, 2015@WebSciDL PhD Student Project Reviews August 5&6, 2015
@WebSciDL PhD Student Project Reviews August 5&6, 2015Michael Nelson
 
Why Care About the Past?
Why Care About the Past?Why Care About the Past?
Why Care About the Past?Michael Nelson
 
OAI-ORE: The Open Archives Initiative Object Reuse and Exchange Project
OAI-ORE:  The Open Archives Initiative  Object Reuse and Exchange ProjectOAI-ORE:  The Open Archives Initiative  Object Reuse and Exchange Project
OAI-ORE: The Open Archives Initiative Object Reuse and Exchange ProjectMichael Nelson
 

Viewers also liked (14)

Resurrecting My Revolutionsing Social Link Neighborhood in Bringing Context t...
Resurrecting My Revolutionsing Social Link Neighborhood in Bringing Context t...Resurrecting My Revolutionsing Social Link Neighborhood in Bringing Context t...
Resurrecting My Revolutionsing Social Link Neighborhood in Bringing Context t...
 
Evaluating the Temporal Coherence of Archived Pages
Evaluating the Temporal Coherence of Archived PagesEvaluating the Temporal Coherence of Archived Pages
Evaluating the Temporal Coherence of Archived Pages
 
Assessing the Quality of Web Archives
Assessing the Quality of Web ArchivesAssessing the Quality of Web Archives
Assessing the Quality of Web Archives
 
Evaluating the SiteStory Transactional Web Archive with the ApacheBench Tool
Evaluating the SiteStory Transactional Web Archive with the ApacheBench ToolEvaluating the SiteStory Transactional Web Archive with the ApacheBench Tool
Evaluating the SiteStory Transactional Web Archive with the ApacheBench Tool
 
On the Change in Archivability of Websites Over Time
On the Change in Archivability of Websites Over TimeOn the Change in Archivability of Websites Over Time
On the Change in Archivability of Websites Over Time
 
Web Archiving: A Brief Introduction
Web Archiving: A Brief IntroductionWeb Archiving: A Brief Introduction
Web Archiving: A Brief Introduction
 
Combining Storytelling and Web Archives
Combining Storytelling and Web ArchivesCombining Storytelling and Web Archives
Combining Storytelling and Web Archives
 
Old Dominion University Computer Science IIPC New Member
Old Dominion University Computer Science IIPC New Member Old Dominion University Computer Science IIPC New Member
Old Dominion University Computer Science IIPC New Member
 
Software as a Well-Formed Research Object
Software as a Well-Formed Research ObjectSoftware as a Well-Formed Research Object
Software as a Well-Formed Research Object
 
Using Web Archives to Enrich the Live Web Experience Through Storytelling
Using Web Archives to Enrich the Live Web Experience Through StorytellingUsing Web Archives to Enrich the Live Web Experience Through Storytelling
Using Web Archives to Enrich the Live Web Experience Through Storytelling
 
When Should I Make Preservation Copies of Myself?
When Should I Make Preservation Copies of Myself?�When Should I Make Preservation Copies of Myself?�
When Should I Make Preservation Copies of Myself?
 
@WebSciDL PhD Student Project Reviews August 5&6, 2015
@WebSciDL PhD Student Project Reviews August 5&6, 2015@WebSciDL PhD Student Project Reviews August 5&6, 2015
@WebSciDL PhD Student Project Reviews August 5&6, 2015
 
Why Care About the Past?
Why Care About the Past?Why Care About the Past?
Why Care About the Past?
 
OAI-ORE: The Open Archives Initiative Object Reuse and Exchange Project
OAI-ORE:  The Open Archives Initiative  Object Reuse and Exchange ProjectOAI-ORE:  The Open Archives Initiative  Object Reuse and Exchange Project
OAI-ORE: The Open Archives Initiative Object Reuse and Exchange Project
 

Similar to The Memento Protocol and Research Issues With Web Archiving

Blockchain Can Not Be Used To Verify Replayed Archived Web Pages
Blockchain Can Not Be Used To Verify Replayed Archived Web PagesBlockchain Can Not Be Used To Verify Replayed Archived Web Pages
Blockchain Can Not Be Used To Verify Replayed Archived Web PagesMichael Nelson
 
Herbert Van De Sompel - Time Travel for the Web
Herbert Van De Sompel - Time Travel for the WebHerbert Van De Sompel - Time Travel for the Web
Herbert Van De Sompel - Time Travel for the WebiMinds conference
 
Socialmediaandweb2.0
Socialmediaandweb2.0Socialmediaandweb2.0
Socialmediaandweb2.0shwetanema
 
Socialmediaandweb2.0
Socialmediaandweb2.0Socialmediaandweb2.0
Socialmediaandweb2.0shwetanema
 
Library 2.0 and Web 2.0
Library 2.0 and Web 2.0Library 2.0 and Web 2.0
Library 2.0 and Web 2.0snackeru
 
HBaseCon 2015: Warcbase - Scaling 'Out' and 'Down' HBase for Web Archiving
HBaseCon 2015: Warcbase - Scaling 'Out' and 'Down' HBase for Web ArchivingHBaseCon 2015: Warcbase - Scaling 'Out' and 'Down' HBase for Web Archiving
HBaseCon 2015: Warcbase - Scaling 'Out' and 'Down' HBase for Web ArchivingHBaseCon
 
Lecture: Ontologies and the Semantic Web
Lecture: Ontologies and the Semantic WebLecture: Ontologies and the Semantic Web
Lecture: Ontologies and the Semantic WebMarina Santini
 
Linked Open Data for Archives
Linked Open Data for ArchivesLinked Open Data for Archives
Linked Open Data for ArchivesCliff Landis
 
Technologie Proche: Imagining the Archival Systems of Tomorrow With the Tools...
Technologie Proche: Imagining the Archival Systems of Tomorrow With the Tools...Technologie Proche: Imagining the Archival Systems of Tomorrow With the Tools...
Technologie Proche: Imagining the Archival Systems of Tomorrow With the Tools...Artefactual Systems - AtoM
 
Mashups & Data Visualizations: The New Breed of Web Applications
Mashups & Data Visualizations: The New Breed of Web ApplicationsMashups & Data Visualizations: The New Breed of Web Applications
Mashups & Data Visualizations: The New Breed of Web ApplicationsDarlene Fichter
 
Must Know Web 20 For Nyscate 2010
Must Know Web 20 For Nyscate 2010Must Know Web 20 For Nyscate 2010
Must Know Web 20 For Nyscate 2010Karen Brooks
 
Web 2.0 Setting The Stage For Extending Our Reach: Resource Guide
Web 2.0 Setting The Stage For Extending Our Reach: Resource GuideWeb 2.0 Setting The Stage For Extending Our Reach: Resource Guide
Web 2.0 Setting The Stage For Extending Our Reach: Resource Guidekennbicknell
 
Web20 Intro Naj Shaik
Web20 Intro Naj ShaikWeb20 Intro Naj Shaik
Web20 Intro Naj ShaikKaren Vignare
 
Reflections on 10 years of the Institutional Web
Reflections on 10 years of the Institutional WebReflections on 10 years of the Institutional Web
Reflections on 10 years of the Institutional Weblisbk
 
Social Graphs and Semantic Analytics
Social Graphs and Semantic AnalyticsSocial Graphs and Semantic Analytics
Social Graphs and Semantic AnalyticsColin Bell
 

Similar to The Memento Protocol and Research Issues With Web Archiving (20)

Blockchain Can Not Be Used To Verify Replayed Archived Web Pages
Blockchain Can Not Be Used To Verify Replayed Archived Web PagesBlockchain Can Not Be Used To Verify Replayed Archived Web Pages
Blockchain Can Not Be Used To Verify Replayed Archived Web Pages
 
Herbert Van De Sompel - Time Travel for the Web
Herbert Van De Sompel - Time Travel for the WebHerbert Van De Sompel - Time Travel for the Web
Herbert Van De Sompel - Time Travel for the Web
 
Socialmediaandweb2.0
Socialmediaandweb2.0Socialmediaandweb2.0
Socialmediaandweb2.0
 
Socialmediaandweb2.0
Socialmediaandweb2.0Socialmediaandweb2.0
Socialmediaandweb2.0
 
Library 2.0 and Web 2.0
Library 2.0 and Web 2.0Library 2.0 and Web 2.0
Library 2.0 and Web 2.0
 
HBaseCon 2015: Warcbase - Scaling 'Out' and 'Down' HBase for Web Archiving
HBaseCon 2015: Warcbase - Scaling 'Out' and 'Down' HBase for Web ArchivingHBaseCon 2015: Warcbase - Scaling 'Out' and 'Down' HBase for Web Archiving
HBaseCon 2015: Warcbase - Scaling 'Out' and 'Down' HBase for Web Archiving
 
Lecture: Ontologies and the Semantic Web
Lecture: Ontologies and the Semantic WebLecture: Ontologies and the Semantic Web
Lecture: Ontologies and the Semantic Web
 
Mapping the Dutch Blogosphere
Mapping the Dutch BlogosphereMapping the Dutch Blogosphere
Mapping the Dutch Blogosphere
 
Internet Mashups
Internet MashupsInternet Mashups
Internet Mashups
 
Web 2.0 Kid Style
Web 2.0 Kid StyleWeb 2.0 Kid Style
Web 2.0 Kid Style
 
Linked Open Data for Archives
Linked Open Data for ArchivesLinked Open Data for Archives
Linked Open Data for Archives
 
Technologie Proche: Imagining the Archival Systems of Tomorrow With the Tools...
Technologie Proche: Imagining the Archival Systems of Tomorrow With the Tools...Technologie Proche: Imagining the Archival Systems of Tomorrow With the Tools...
Technologie Proche: Imagining the Archival Systems of Tomorrow With the Tools...
 
Ssworkgroup
SsworkgroupSsworkgroup
Ssworkgroup
 
Mashups & Data Visualizations: The New Breed of Web Applications
Mashups & Data Visualizations: The New Breed of Web ApplicationsMashups & Data Visualizations: The New Breed of Web Applications
Mashups & Data Visualizations: The New Breed of Web Applications
 
Must Know Web 20 For Nyscate 2010
Must Know Web 20 For Nyscate 2010Must Know Web 20 For Nyscate 2010
Must Know Web 20 For Nyscate 2010
 
Web 2.0 Setting The Stage For Extending Our Reach: Resource Guide
Web 2.0 Setting The Stage For Extending Our Reach: Resource GuideWeb 2.0 Setting The Stage For Extending Our Reach: Resource Guide
Web 2.0 Setting The Stage For Extending Our Reach: Resource Guide
 
Flourish2011
Flourish2011Flourish2011
Flourish2011
 
Web20 Intro Naj Shaik
Web20 Intro Naj ShaikWeb20 Intro Naj Shaik
Web20 Intro Naj Shaik
 
Reflections on 10 years of the Institutional Web
Reflections on 10 years of the Institutional WebReflections on 10 years of the Institutional Web
Reflections on 10 years of the Institutional Web
 
Social Graphs and Semantic Analytics
Social Graphs and Semantic AnalyticsSocial Graphs and Semantic Analytics
Social Graphs and Semantic Analytics
 

More from Michael Nelson

Web Archiving in the Year eaee1902f186819154789ee22ca30035
Web Archiving in the Year eaee1902f186819154789ee22ca30035Web Archiving in the Year eaee1902f186819154789ee22ca30035
Web Archiving in the Year eaee1902f186819154789ee22ca30035Michael Nelson
 
Uncertainty in replaying archived Twitter pages
Uncertainty in replaying archived Twitter pagesUncertainty in replaying archived Twitter pages
Uncertainty in replaying archived Twitter pagesMichael Nelson
 
Web Archives at the Nexus of Good Fakes and Flawed Originals
Web Archives at the Nexus of Good Fakes and Flawed OriginalsWeb Archives at the Nexus of Good Fakes and Flawed Originals
Web Archives at the Nexus of Good Fakes and Flawed OriginalsMichael Nelson
 
Web Archives at the Nexus of Good Fakes and Flawed Originals
Web Archives at the Nexus of Good Fakes and Flawed OriginalsWeb Archives at the Nexus of Good Fakes and Flawed Originals
Web Archives at the Nexus of Good Fakes and Flawed OriginalsMichael Nelson
 
Blockchain Can Not Be Used To Verify Replayed Archived Web Pages
Blockchain Can Not Be Used To Verify Replayed Archived Web PagesBlockchain Can Not Be Used To Verify Replayed Archived Web Pages
Blockchain Can Not Be Used To Verify Replayed Archived Web PagesMichael Nelson
 
Weaponized Web Archives: Provenance Laundering of Short Order Evidence
Weaponized Web Archives: Provenance Laundering of Short Order Evidence Weaponized Web Archives: Provenance Laundering of Short Order Evidence
Weaponized Web Archives: Provenance Laundering of Short Order Evidence Michael Nelson
 
Weaponized Web Archives: Provenance Laundering of Short Order Evidence
Weaponized Web Archives: Provenance Laundering of Short Order Evidence Weaponized Web Archives: Provenance Laundering of Short Order Evidence
Weaponized Web Archives: Provenance Laundering of Short Order Evidence Michael Nelson
 
Weaponized Web Archives: Provenance Laundering of Short Order Evidence
Weaponized Web Archives: Provenance Laundering of Short Order Evidence Weaponized Web Archives: Provenance Laundering of Short Order Evidence
Weaponized Web Archives: Provenance Laundering of Short Order Evidence Michael Nelson
 

More from Michael Nelson (8)

Web Archiving in the Year eaee1902f186819154789ee22ca30035
Web Archiving in the Year eaee1902f186819154789ee22ca30035Web Archiving in the Year eaee1902f186819154789ee22ca30035
Web Archiving in the Year eaee1902f186819154789ee22ca30035
 
Uncertainty in replaying archived Twitter pages
Uncertainty in replaying archived Twitter pagesUncertainty in replaying archived Twitter pages
Uncertainty in replaying archived Twitter pages
 
Web Archives at the Nexus of Good Fakes and Flawed Originals
Web Archives at the Nexus of Good Fakes and Flawed OriginalsWeb Archives at the Nexus of Good Fakes and Flawed Originals
Web Archives at the Nexus of Good Fakes and Flawed Originals
 
Web Archives at the Nexus of Good Fakes and Flawed Originals
Web Archives at the Nexus of Good Fakes and Flawed OriginalsWeb Archives at the Nexus of Good Fakes and Flawed Originals
Web Archives at the Nexus of Good Fakes and Flawed Originals
 
Blockchain Can Not Be Used To Verify Replayed Archived Web Pages
Blockchain Can Not Be Used To Verify Replayed Archived Web PagesBlockchain Can Not Be Used To Verify Replayed Archived Web Pages
Blockchain Can Not Be Used To Verify Replayed Archived Web Pages
 
Weaponized Web Archives: Provenance Laundering of Short Order Evidence
Weaponized Web Archives: Provenance Laundering of Short Order Evidence Weaponized Web Archives: Provenance Laundering of Short Order Evidence
Weaponized Web Archives: Provenance Laundering of Short Order Evidence
 
Weaponized Web Archives: Provenance Laundering of Short Order Evidence
Weaponized Web Archives: Provenance Laundering of Short Order Evidence Weaponized Web Archives: Provenance Laundering of Short Order Evidence
Weaponized Web Archives: Provenance Laundering of Short Order Evidence
 
Weaponized Web Archives: Provenance Laundering of Short Order Evidence
Weaponized Web Archives: Provenance Laundering of Short Order Evidence Weaponized Web Archives: Provenance Laundering of Short Order Evidence
Weaponized Web Archives: Provenance Laundering of Short Order Evidence
 

Recently uploaded

The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Strongerpanagenda
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfMounikaPolabathina
 
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesAssure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesThousandEyes
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfNeo4j
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality AssuranceInflectra
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfpanagenda
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersRaghuram Pandurangan
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentPim van der Noll
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsRavi Sanghani
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfIngrid Airi González
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersNicole Novielli
 

Recently uploaded (20)

The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdf
 
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesAssure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdf
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information Developers
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and Insights
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdf
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software Developers
 

The Memento Protocol and Research Issues With Web Archiving

  • 1. The Memento Protocol and Research Issues With Web Archiving Michael L. Nelson Old Dominion University Web Science & Digital Libraries Research Group www.cs.odu.edu/~mln/ @phonedude_mln With: Los Alamos National Laboratory: Herbert Van de Sompel ODU: Michele C. Weigle, Hany SalahEldeen, Matthias Prellwitz, Justin Brunelle, Mat Kelly, Ahmed AlSum, Scott Ainsworth University of Virginia Colloquium 2016-09-12
  • 2.
  • 4.
  • 5.
  • 6. Memento wants to make it easy to access the Web of the Past. 6
  • 7. Memento achieves this by technically integrating the present Web and the past Web, by introducing a uniform version access capability for the Web. 7
  • 8. Content Management Systems: • Designed to be aware of all versions of a resource; • Self-contained; • Variety of proprietary version mechanisms; • Versions interlinked using proprietary mechanisms. 8
  • 9. World Wide Web: • Designed to forget about prior versions of a resource; • Distributed. 9
  • 10. There are resource versions on the Web: • Content Management Systems; • Web Archives; • Transactional archives; • Search engine caches. 10
  • 11. But the Web architecture has no way to deal with them: • Cannot talk about a resource as it used to exist; • Cannot access a prior version knowing the current one; • Cannot access the current version knowing a prior one; Current approaches are ad hoc and localized. 11
  • 12. Memento: • Looks at the Web as a Content Management System; • Introduces the uniform capability to access versions on the Web; • Does not build new archives but leverages all systems that host versions: Web archives, Content Management Systems, Software Version Systems, etc. 12
  • 13. Memento’s version access approach: • Is distributed: versions may exist on several servers; • Uses datetime as a global version indicator; • Is based on the primitives of the Web: resource, resource state, representation, content negotiation, link. 13
  • 14. Since Memento’s access approach is distributed, and is based on Web primitives, it scales like the Web. 14
  • 15. Memento’s core components: • Ability to speak about a resource as it existed in the past; • A bridge between present and past: link and content negotiation; • A bridge between past and present: link. 15
  • 16. original resource and versions 16
  • 17. bridge from present to past 17
  • 18. bridge from past to present 18
  • 23. Integrating Multiple Archives more info: http://mementoweb.org/ https://tools.ietf.org/html/rfc7089
  • 24. Memento wants to make it easy to access the Web of the Past…
  • 25.
  • 27.
  • 28.
  • 30. Long Tail of Archives Archive.is
  • 31. Using Only Top-k Archives for URI Lookup Yields Good Results Even when there are 100s of archives, we only need to talk to a few. see: http://www.cs.odu.edu/~mln/pubs/tpdl-2013/paper_134.pdf
  • 32. OK, so More Archives == More Better But why care about archiving at all?
  • 33. Why Care About The Past? From an anonymous WWW 2010 reviewer about our Memento paper (emphasis mine): "Is there any statistics to show that many or a good number of Web users would like to get obsolete data or resources? " one answer: replay of contemporary pages >> summary pages http://www.slideshare.net/phonedude/why-careaboutthepast http://www.nytimes.com/2013/06/19/books/seven-american-deaths-and-disasters-transcribes-the-news.html
  • 34. A Youtube video of a TV show where celebrities read “mean” Tweets about themselves Our social discourse is dominated by the web. Q.E.D. https://www.youtube.com/watch?v=LABGimhsEys
  • 35. Our scholarly record is in jeopardy… http://dx.doi.org/10.1371/journal.pone.0115253 See also: http://blog.dshr.org/2015/02/the-evanescent-web.html
  • 36. As is our legal record… http://www.nytimes.com/2013/09/24/us/politics/in-supreme-court-opinions-clicks-that-lead-nowhere.html See also: http://blogs.law.harvard.edu/futureoftheinternet/2013/09/22/perma/ http://ssnat.com/
  • 37. And our popular culture as well… http://f-measure.blogspot.com/2009/02/pink-floyd-hour-with-pink-floyd-kqed-lp.html
  • 38. Half-Life of Popular Music Youtube Videos Half life 0 3 6 9 12 15 18 0.5 1.0 Month LinearRegression Top 40 US Singles Charts Music Blogs @ blogspot.com The 500 Greatest Songs 0 1 2 3 4 5 6 7 8 9 10 0.86 0.88 0.90 0.92 0.94 0.96 0.98 1.00 Weeks MedianAbsoluteDeviation Datasets Top 40 US Singles Charts Music Blogs @ blogspot.com The 500 Greatest Songs Matthias Prellwitz, Michael L. Nelson, Music Video Redundancy and Half-Life in YouTube, Proceedings of TPDL 2011 http://www.cs.odu.edu/~mln/pubs/tpdl-2011/tpdl-2011-prellwitz.pdf Individual URLs die, but new versions arise
  • 39. So we won’t lose every copy of “Shake It Off”… What about the grist of history?
  • 40. http://www.bbc.com/future/story/20120927-the-decaying-web On January 28 2011, three days into the fierce protests that would eventually oust the Egyptian president Hosni Mubarak, a Twitter user called Farrah posted a link to a picture that supposedly showed an armed man as he ran on a “rooftop during clashes between police and protesters in Suez”. I say supposedly, because both the tweet and the picture it linked to no longer exist. Instead they have been replaced with error messages that claim the message – and its contents – “doesn’t exist”.
  • 41. Missing Tweet & Pic https://twitter.com/Farrah3m/status/31727870736859137 http://twitpic.com/3uvo6z http://ws-dl.blogspot.com/2013/05/2013-05-07-who-is-archiving-your-tweets.html
  • 42. In May 2013, both are “archived” by topsy.com
  • 43. In February 2015, they’re completely missing. http://topsy.com/http://twitpic.com/3uvo6z
  • 45. …to a random (?) apple.com page http://topsy.com/http://twitpic.com/3uvo6z
  • 46. No Server == No HTTP Event == Nothing to Archive http://topsy.com/http://twitpic.com/3uvo6z
  • 47. Hany M. SalahEldeen, Michael L. Nelson, Losing My Revolution: How Many Resources Shared on Social Media Have Been Lost?, Proceedings of TPDL 2012. http://arxiv.org/abs/1209.3026 Hany SalahEldeen, Michael L. Nelson, Resurrecting My Revolution: Using Social Link Neighborhood in Bringing Context to the Disappearing Web, Proceedings of TPDL 2013. http://arxiv.org/abs/1309.2648 Missing: 11% year 1, 7%/year afterwards Archived: 7% year 1, 15%/year afterwards
  • 48. Why we need multiple, independent archives…
  • 49. A single archive is vulnerable http://www.bbc.com/news/uk-politics-24924185 http://ws-dl.blogspot.com/2013/11/2013-11-21-conservative-party-speeches.html
  • 50. Houston, Tranquility Base Here. The Eagle has landed. see also: http://ws-dl.blogspot.com/2013/03/2013-03-22-ntrs-web-archives-and-why-we.html http://ws-dl.blogspot.com/2013/06/2013-06-18-ntrs-memento-and-handles.html
  • 52. $ curl –I "http://www.thedailybeast.com/articles/2016/08/11/i- got-three-grindr-dates-in-an-hour-in-the-olympic-village.html" HTTP/1.1 301 Moved Permanently Access-Control-Allow-Origin: * Age: 0 Cache-Control: max-age=60 Content-Type: text/html; charset=iso-8859-1 Date: Thu, 18 Aug 2016 01:13:46 GMT Location: http://www.thedailybeast.com/articles/2016/08/11/a- note-from-the-editors.html RealAge: 0 Server: Apache Vary: Accept-Encoding, User-Agent Via: 1.1 varnish X-BackEnd: default X-Cache: MISS X-Cacheable: YES X-Restarts: 0 X-UA-Device: pc X-Varnish: 995407903 Connection: keep-alive http://www.usnews.com/news/articles/2016-08-17/wayback-machine-wont-censor-archive-for-taste-director-says-after-olympics-article-scrubbed
  • 53. But who pays for those extra archives? 1TB endowment = ~$4700: http://blog.dshr.org/2011/02/paying-for-long-term-storage.html see also: http://blog.dshr.org/2011/01/memento-marketplace-for-archiving.html
  • 54. Archives aren’t magic web sites They’re just web sites. If you used Mummify, you’re now left with a bunch of defunct, shortened links like: https://mummify.it/XbmcMfE3
  • 55. Don’t Throw Away the Original URL – Use Robust Links! <a href="http://www.w3.org/" data-versionurl="https://archive.today/r7cov" data-versiondate="2015-01-21"> my robust link to the live web</a> <a href="https://archive.today/r7cov" data-originalurl="http://www.w3.org/" data-versiondate="2015-01-21"> my robust link to an archived version</a> <!DOCTYPE html> <html lang="en" itemscope itemtype="http://schema.org/WebPage" itemid="http://robustlinks.mementoweb.org/spec/"> <head> <meta charset="utf-8" /> <meta itemprop="dateModified" content="2015-02-02"> <meta itemprop="datePublished" content="2015-01-23"> <title>Page Level Metadata Is The Least You Can Do</title> More examples / scenarios at: http://robustlinks.mementoweb.org/spec/
  • 56. Economics Working Against Archives “In the paper world in order to monetize their content the copyright owner had to maximize the number of copies of it. In the Web world, in order to monetize their content the copyright owner has to minimize the number of copies. Thus the fundamental economic motivation for Web content militates against its preservation in the ways that Herbert and I would like.” --David Rosenthal http://blog.dshr.org/2015/02/the-evanescent-web.html
  • 57. “We’ll use the cloud!”
  • 58. https://www.chriswatterston.com/blog/my-there-no-cloud-sticker "...when all costs are taken in to account, cloud storage is not cheaper for long-term preservation than doing it yourself once you get to a reasonable scale.” http://blog.dshr.org/2014/11/talk-costs-why-do-we-care.html
  • 59. Historicity of Web Archives
  • 60. Malaysia Airlines Flight 17 (MH17) http://web.archive.org/web/20140717152222/http://vk.com/strelkov_info http://www.csmonitor.com/World/Europe/2014/0717/Web-evidence-points-to-pro-Russia-rebels-in-downing-of-MH17-video http://www.newyorker.com/magazine/2015/01/26/cobweb
  • 61.
  • 62. (not really archived as well as you think)
  • 63. Ed and I Discuss Who Has What… https://twitter.com/phonedude_mln/status/490171976389238784
  • 65. Alex is now 404. Would multiple archives have convinced him? https://twitter.com/quicknquiet
  • 66. Do we really have “a perfect tool to produce `evidence’ of any kind”?
  • 67. @gary4205 mansplains to @AstroKatie https://twitter.com/AstroKatie/status/765344020184739840 see also: http://www.someecards.com/news/so-that-happened/mansplain-astronaught-jessica-meir-twitter/
  • 68. But can you prove he didn’t say this?
  • 69. Or that she didn’t say this? (remember: black hats can use tools created by white hats)
  • 70. Assessing the Quality of Web Archiving "Hooray! It's in the archive!" vs. "How well was it archived?" current: the question we should be asking:
  • 71. Temporal Drift August 27, 2005 11:16 a.m. EDT link
  • 72. Temporal Drift: Now 3 Hours in the Past August 27, 2005 11:16 a.m. EDT link August 27, 2005 8:00 a.m. EDT link
  • 73. Temporal Drift: Now 17 Days in the Future August 27, 2005 11:16 a.m. EDT link August 27, 2005 8:00 a.m. EDT link September 13, 2005 8:12 a.m. EDT link
  • 74. Temporal Drift: Now 23 (or 6) Days in the Future August 27, 2005 11:16 a.m. EDT link August 27, 2005 8:00 a.m. EDT link September 13, 2005 8:12 a.m. EDT link September 19, 2005 8:25 a.m. EDT link 10+ clicks in the archive results in median drift of ~45 days (standard UI) or ~15 days with Memento. ~2% of the sessions have drift of > 1 year. see: http://www.cs.odu.edu/~mln/pubs/jcdl-2013/jcdl93-ainsworth.pdf
  • 75. Sometimes the Live Web "Leaks" Into the Archive…
  • 77. Not All Mementos Are Created Equal: Measuring The Impact Of Missing Resources JCDL 2014 http://www.cs.odu.edu/~mln/pubs/jcdl-2014/jcdl-2014-brunelle-damage.pdf
  • 78. M = 0.17 D = 0.09 (live web) M = 0.24 D = 0.41 (missing main) M = 0.29 D = 0.36 (missing logo + navigation) Synthetic Damage: Removing Images From xkcd.com damage (D) differs from % missing (M)!
  • 79. Was missing resource important? <img>and <embed> can leave hints about size and centrality. For CSS, we look at the distribution of background color in page divided into vertical thirds.
  • 80. Weights from Turker Assessment of Damage first: establish that Turkers can determine damaged vs. undamaged pages (81% of the time) second: find weights that match Turker's rankings of (real) differently damaged versions of the same page
  • 81. Good News: Although %Missing (M) is steady/increasing, weighted Damage (D) is decreasing
  • 82. A Framework for Evaluation of Composite Memento Temporal Coherence Hypertext 2015 http://ws-dl.blogspot.com/2015/12/2015-12-08-evaluating-temporal.html
  • 83. As Presented by IA http://web.archive.org/web/20041209190926/http://www.wunderground.org/cgi-bin/findWeather/getForecast?query=50593 (now 404, but that's a different story…)
  • 84. Not Everything Is 200412091900926 + 9 months
  • 85. 1 in 20 pages complete; 1 in 5 have violations Description Closest Single Archive Closest Multi- Archive Bracket Single Archive Bracket Multi- Archive Completeness Mean complete 76.1% 80.2% 76.2% 80.3% Mean missing 23.9% 19.8% 23.8% 19.7% Temporal Coherence Mean prima facie coherent 41.0% 40.9% 54.7% 54.6% Mean possibly coherent 27.3% 27.3% 12.8% 14.2% Mean probably violative 2.5% 5.3% 2.5% 5.3% Mean prima facie violative 5.3% 5.3% 6.2% 6.2% At least 5% of pages can be shown to be temporal violations http://ws-dl.blogspot.com/2015/12/2015-12-08-evaluating-temporal.html
  • 87. Wrong Metaphor for Web Archives
  • 88. Web Archives Are Not Destinations This is a destination. This is not a destination. Memento is about linking the past and present web
  • 89. Possible Metaphor for Viewing Past & Present?
  • 90. Turn Archiving Into A Social Activity… see also: http://xkcd.com/1034/, Marshall & Shipman, JCDL 2011
  • 91. …But Don't Use the "A" Word Ed: Are there any zombies out there? Shaun: Don't say that! Ed: What? Shaun: That. Ed: What? Shaun: That. The Z word. Don't say it. Ed: Why not? Shaun: Because it's ridiculous! — Shaun of the Dead
  • 92. Pinterest: Anonymous Mementos http://media-cache-ec3.pinterest.com/upload/47639708527755289_AhxhItiQ_c.jpg is a memento of: http://3.bp.blogspot.com/_d0vByWRfhvU/S_Ygk_oX4xI/AAAAAAAACCQ/LXgC3S0KYEo/s400/_MG_8091.jpg but there is no machine-readable indication of this relationship repins are by-reference
  • 93. When all else fails, justify project with: “web archiving is Big Data”
  • 95. Archiving your internal stuff: Transactional Archiving https://mementoweb.github.io/SiteStory/ Never miss an update; archive your site as it is being viewed by users.
  • 96. Archiving your internal stuff: Heritrix & Wayback Crawling your intranet: http://www.dlib.org/dlib/january16/brunelle/01brunelle.html Crawling JS “stuff” will take 5X more storage: http://arxiv.org/abs/1601.05142 mementos of Mitre Intranet “MiiTube” – Complete With Javascript leakage
  • 97. JavaScript == the new deep web; use ResourceSync to make sure your URIs are exposed <?xml version="1.0" encoding="UTF-8"?> <urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9" xmlns:rs="http://www.openarchives.org/rs/terms/"> <rs:ln rel="up" href="http://example.com/dataset1/capabilitylist.xml"/> <rs:md capability="resourcelist" at="2013-01-03T09:00:00Z" completed="2013-01-03T09:01:00Z"/> <url> <loc>http://example.com/res1</loc> <lastmod>2013-01-02T13:00:00Z</lastmod> <rs:md hash="md5:1584abdf8ebdc9802ac0c6a7402c03b6" length="8876" type="text/html"/> </url> <url> <loc>http://example.com/res2</loc> <lastmod>2013-01-02T14:00:00Z</lastmod> <rs:md hash="md5:1e0d5cb8ef6ba40c99b14c0237be735e sha-256:854f61290e2e197a11bc91063afce22e43f8ccc655237050ace766adc68dc784" length="14599" type="application/pdf"/> </url> </urlset> (AKA “Fancy SiteMaps”) http://www.openarchives.org/rs/
  • 99. Seagal’s Law A man with a watch knows what time it is. A man with two watches is never sure. How to resolve conflicting archives? Personalization, GeoIP, mobile vs. desktop, etc. means “the” page rarely exists, only “a” page. Mat Kelly, Justin F. Brunelle, Michele C. Weigle, and Michael L. Nelson, A Method for Identifying Personalized Representations in Web Archives, D-Lib Magazine, 19(11/12), 2013. http://www.dlib.org/dlib/november13/kelly/11kelly.html
  • 100. Thoughtful analysis: http://blog.dshr.org/2015/02/vint-cerfs-talk-at-aaas.html Snarky analysis: http://ws-dl.blogspot.com/2015/02/2015-02-17-reactions-to-vint-cerfs.html
  • 101. Why Care About The Past? From an anonymous WWW 2010 reviewer about our Memento paper (emphasis mine): "Is there any statistics to show that many or a good number of Web users would like to get obsolete data or resources? " one answer: replay of contemporary pages >> summary pages http://www.slideshare.net/phonedude/why-careaboutthepast http://www.nytimes.com/2013/06/19/books/seven-american-deaths-and-disasters-transcribes-the-news.html
  • 102.
  • 103. vs.
  • 104.
  • 105.
  • 106.
  • 107.
  • 108.
  • 109.
  • 110.
  • 111.
  • 112.
  • 113.
  • 114.
  • 115.
  • 116. Archiving Moves At Hurricane Speed, Most News Stories Move Faster
  • 117.
  • 118.
  • 119.
  • 120. Most of the Story, at Least as Conveyed by cnn.com, is Missing… in this case, you can reconstruct the events with http://en.wikipedia.org/wiki/Virginia_Tech_massacre_timeline
  • 121. How Much of The Web Is Archived?
  • 122. Public Archives, ca. Late 2010 / Early 2011 Three categories of archives • Internet Archive • Search engine • Other archives UK US See also: http://arxiv.org/abs/1212.6177
  • 123. 1000 URIs Ordered by First Observation Date See also: http://ws-dl.blogspot.com/2011/06/2011-06-23-how-much-of-web-is-archived.html
  • 125. How Much of the Web is Archived? It Depends on Which Web… Including SE cache Excluding SE Cache 90% 79% 97% 68% 35% 16% 88% 19% Changes since 2011: no more free SE APIs; greatly reduced IA quarantine period; 15 public web archives 2013 95% 92% 23% 26%
  • 126. Quis Archiviet Ipsos Archives? (thanks to webmaster@archive.is for this example)
  • 127. % curl -I http://lenta.ru/articles/2013/04/02/mat/ HTTP/1.1 302 Found Server: nginx Date: Tue, 03 Sep 2013 00:15:14 GMT Content-Type: text/html; charset=utf-8 Connection: keep-alive Status: 302 Found Location: http://lenta.ru/f_words/ X-UA-Compatible: IE=Edge,chrome=1 Cache-Control: no-cache X-Request-Id: bd7caae039d6312c0542cb4ad62f3847 X-Runtime: 0.005474 X-Rack-Cache: miss current page for: http://lenta.ru/articles/2013/04/02/mat/
  • 128. archive.org version of: http://lenta.ru/articles/2013/04/02/mat/
  • 129. peep.us archived version of archive.org version
  • 130. archive.is archived version of peeep.us version of archive.org version