Unlocking the Potential: Deep dive into ocean of Ceramic Magnets.pptx
A Framework for Verifying the Fixity of Archived Web Resources
1. PhD Dissertation Defense for:
Mohamed Aturban
Advisor:
Michele C. Weigle
Committee Members:
Michele C. Weigle, Michael L. Nelson, Jian Wu,
Sampath Jayarathna, and M'Hammed Abdous
A Framework for Verifying the Fixity
of Archived Web Resources
Department of Computer Science
Norfolk, Virginia 23529 USA
July 23, 2020
PhD Dissertation Defense for:
Mohamed Aturban
Advisor:
Michele C. Weigle
Committee Members:
Michele C. Weigle, Michael L. Nelson, Jian Wu,
Sampath Jayarathna, and M'Hammed Abdous
A Framework for Verifying the Fixity
of Archived Web Resources
Department of Computer Science
Norfolk, Virginia 23529 USA
July 23, 2020
2. Outline
Introduction and Motivation
Research Questions
Background
Sampling of Related work
Changes in the Playback of Mementos
The Fixity Information Dissemination Framework
Contributions, Future Work, and Conclusions
2
@maturban1 • @WebSciDL
PhD Defense: Mohamed Aturban
July 23, 2020
3. This is what www.cnn.com looks like today
33
@maturban1 • @WebSciDL
PhD Defense: Mohamed Aturban
July 23, 2020
4. The Internet Archive (IA) allows
us to view previous versions
(mementos) of that page
• IA is the world’s largest public
web archive
• It holds hundreds of billions of
archived web pages
https://web.archive.org/web/20130401000000*/http://www.cnn.com/
PhD Defense: Mohamed Aturban
July 23, 2020
4
5. The CNN archived page from May 30, 2013
• Replaying this memento in 2018
• There was a thunderstorm in Atlanta, GA on May 30, 2013
5
6. 6
When reloading (#1) the memento in the browser,
the weather icon changed to “cloudy”
7. 7
When reloading (#2) the memento in the browser,
the weather icon changed to “partly sunny”
8. When reloading (#3) the memento in the browser,
the weather icon changed to “partly sunny”
8
9. Replaying the same memento multiple times
does not always produce the same results!
• The changes on the
playback of this mementos
are caused by JavaScript
(JS) being executed on the
client side (e.g., the
browser)
• In this example, each time
JS is executed, it loads
randomly one of the
weather icons
9
10. Textbooks vs. archived pages
https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcR4FM1VszineUIBCFEQchQTnaZWwKJE7BoUU1u1h3fmrbLdpWl8
A book in a library Replayed mementos
10
@maturban1 • @WebSciDL
PhD Defense: Mohamed Aturban
July 23, 2020
11. This is what climate.nasa.gov/vital-signs/carbon-dioxide/
looks like today
11
12. This is what it looked like in July 2018
12
https://web.archive.org/web/20180726025537/https://climate.nasa.gov/vital-signs/carbon-dioxide/
A memento created by
the Internet Archive in
July 2018. It is replayed
now (2019).
13. 13
The page in other web archives
web.archive.org/web/*/climate.nasa.gov/vital-signs/carbon-dioxide4,870
archive.is/climate.nasa.gov/vital-signs/carbon-dioxide/13
wayback.archive-it.org/all/*/climate.nasa.gov/vital-signs/carbon-dioxide/91
perma-archives.org/warc/*/climate.nasa.gov/vital-signs/carbon-dioxide/4
arquivo.pt/wayback/*/climate.nasa.gov/vital-signs/carbon-dioxide3
Typical archive URI construction:
archive.example.org/archive-collection/climate.nasa.gov/vital-signs/carbon-dioxide
webarchive.loc.gov/all/*/climate.nasa.gov/vital-signs/carbon-dioxide5
Mementos
for a full list of public web archives, see: http://labs.mementoweb.org/aggregator_config/archivelist.xml
14. What if we checked these archives?
What if they all agree?
Would you trust the results?
breitbart.com/wayback/*/climate.nasa.gov/vital-signs/carbon-dioxide/
infowars.com/web/*/climate.nasa.gov/vital-signs/carbon-dioxide/
MichaelsEvilWayback.com/web/*/climate.nasa.gov/vital-signs/carbon-dioxide/
InternetResearchAgency.ru/climate.nasa.gov/vital-signs/carbon-dioxide/
14
@maturban1 • @WebSciDL
PhD Defense: Mohamed Aturban
July 23, 2020
15. 15
The web page is archived July 2017 by
Michael’sEvilWayback
Which one is the real memento?
Replayed in August 2017 Replayed in October 2017 15
16. 16
It is important to verify fixity of archived resources
Evidentiary purposes in court cases
• Marten Transport v. PlatForm Advertising
• Telewizja Polska USA, Inc. v. Echostar Satellite Corp
• St. Luke’s Cataract & Laser Institute v. James C. Sanderson
• https://www.bloomberglaw.com/public/desktop/document/Marten_Transp_Ltd_v_PlattForm_Adver_Inc_No_142464JWL_2016_BL_1371?1462657373
• https://casetext.com/case/telewizja-polska-usa-4
• https://caselaw.findlaw.com/us-11th-circuit/1351498.html
• https://web.stanford.edu/~gentzkow/research/fakenews.pdf
• https://www.nytimes.com/2016/12/05/us/politics/-michael-flynn-trump-fake-news-clinton.html
• https://www.csmonitor.com/World/Europe/2014/0717/Web-evidence-points-to-pro-Russia-rebels-in-downing-of-MH17
• https://www.newyorker.com/magazine/2015/01/26/cobweb
• https://www.datarefuge.org
• http://eotarchive.cdlib.org
Preserving fake news and important news articles
• H. Allcott and M. Gentzkow, “Social media and fake news in the 2016 election,”
Journal of Economic Perspectives, vol. 31, no. 2, pp. 211–36, 2017.
• M. Rosenberg, “Trump Adviser Has Pushed Clinton Conspiracy Theories,” The New York Times, 2016
Providing information about certain incidents or crimes
• A. Bright, “Web evidence points to pro-Russia rebels in downing of MH17,”
The Christian Science Monitor, 2014
Preserving federal and government data
• The Data Refuge project is an attempt to preserve federal climate and environmental data
• The End of Term Web Archive preserves U.S. Government websites around every new presidential
election
16
17. A disclaimer from the Internet Archive stating that the archive
is not responsible for the reliability of the archive resources
https://archive.org/about/terms.php
1717
18. Web pages change on the live web
Time
Live
Web
May
2016
April
2017
April
2018
18
@maturban1 • @WebSciDL
PhD Defense: Mohamed Aturban
July 23, 2020
19. Archives make copies of web pages
Live
Web
Archive
May
2016
April
2017
April
2018
Time
19
@maturban1 • @WebSciDL
PhD Defense: Mohamed Aturban
July 23, 2020
20. Do archived pages change?
Live
Web
Archive
Replay
May
2016
April
2017
April
2018
Time
20
@maturban1 • @WebSciDL
PhD Defense: Mohamed Aturban
July 23, 2020
21. Do archived pages change?
Live
Web
Archive
Replay
May
2016
When replaying the archived page at different
points in time, will we get the same content?
April
2017
April
2018
Time
21
22. Do archived pages change?
Live
Web
Archive
Replay
May
2016
April
2017
April
2018
22
Time
When replaying the archived page at different
points in time, will we get the same content?
23. Do archived pages change?
Live
Web
Archive
Replay
May
2016
April
2017
April
2018
23
Time
When replaying the archived page at different
points in time, will we get the same content?
24. Do archived pages change?
Live
Web
Archive
Replay
May
2016
April
2017
April
2018
24
Time
When replaying the archived page at different
points in time, will we get the same content?
25. Do archived pages change?
Live
Web
Archive
Replay
May
2016
April
2017
April
2018
25
Time
When replaying the archived page at different
points in time, will we get the same content?
26. Do archived pages change?
Live
Web
Archive
Replay
May
2016
April
2017
April
2018
26
Time
When replaying the archived page at different
points in time, will we get the same content?
27. Do archived pages change?
Live
Web
Archive
Replay
May
2016
Our study shows that we are not always
presented with the same archived content!
?
April
2017
April
2018
27
Time
209
28. Outline
Introduction and Motivation
Research Questions
Background
Sampling of Related work
Changes in the Playback of Mementos
The Fixity Information Dissemination Framework
Contributions, Future Work, and Conclusions
28
@maturban1 • @WebSciDL
PhD Defense: Mohamed Aturban
July 23, 2020
29. RQ1: Can we identify and quantify the types of changes on the playback
of mementos that prevent generating repeatable fixity information?
Research questions
29
@maturban1 • @WebSciDL
PhD Defense: Mohamed Aturban
July 23, 2020
30. RQ1: Can we identify and quantify the types of changes on the playback
of mementos that prevent generating repeatable fixity information?
RQ2: Given the types of changes identified in the playback of mementos,
what steps/guidelines should we follow in order to generate repeatable
fixity information (defining an archive-aware fixity-based approach)?
Research questions
30
@maturban1 • @WebSciDL
PhD Defense: Mohamed Aturban
July 23, 2020
31. RQ1: Can we identify and quantify the types of changes on the playback
of mementos that prevent generating repeatable fixity information?
RQ2: Given the types of changes identified in the playback of mementos,
what steps/guidelines should we follow in order to generate repeatable
fixity information (defining an archive-aware fixity-based approach)?
RQ3: How can we store and retrieve fixity information independently from
the web archives from which the associated mementos are preserved?
Research questions
31
@maturban1 • @WebSciDL
PhD Defense: Mohamed Aturban
July 23, 2020
32. Outline
Introduction and Motivation
Research Questions
Background
Sampling of Related work
Changes in the Playback of Mementos
The Fixity Information Dissemination Framework
Contributions, Future Work, and Conclusions
32
@maturban1 • @WebSciDL
PhD Defense: Mohamed Aturban
July 23, 2020
33. Generating cryptographic hash values (fixity
information)
• Common hash algorithms (e.g., MD5, SHA256):
A small change in the input à a large change output
SHA256
9801 1510 87e1 6d6b
ddb9 e6b0 09fd b723
abe5 1fea b548 0914
a130 6325 5ae4 6caa
5d4d b590 605c 9023
000d 6622 6004 534f
e84a 5549 d535 f91e
cdf4 4952 5c1a 37cf
SHA256
33
@maturban1 • @WebSciDL
PhD Defense: Mohamed Aturban
July 23, 2020
34. 34
SimHash:
A small change in the input à a small change in the output
'Klein et al. conducted a study on over one
million references from scientific articles
and found that 20% articles suffers from
Reference Rot, referring to links to web
resources that no longer exist or that have
significantly modified content.'
SimHash
668c8cccd966a785
https://github.com/leonsim/simhash
'Klein et al. conducted a study on over one
million references from scientific articles
and found that 30% articles suffers from
Reference Rot, referring to links to web
resources that no longer exist or that have
significantly modified content.'
SimHash
668c8cced966a785
M. Klein, H. Van de Sompel, R. Sanderson, H. Shankar, L. Balakireva, K. Zhou, and R. Tobin, “Scholarly context not found: One in five articles suffers from reference rot,” PloS one, vol. 9, no. 12, 2014. e115253.
http://www.hackerfactor.com/blog/index.php?/archives/432-Looks-Like-It.html
• We can use SimHash to compare text-based files and pHash to compare images
35. 35
An example of a binary hash tree (or Merkle tree)
https://brilliant.org/wiki/merkle-tree/
• A leaf nodes = the hash
of a block of data
• A non-leaf node = the
hash of its children
36. Generate hashes on a web page
• Compute a hash value on the downloaded HTML content
$ curl -s https://climate.nasa.gov/vital-signs/carbon-dioxide/ | shasum -a 256
17710fd38d908a3cd124510f26adaec67e57e3f1d3aec1209c4ad4efbe2c035d
Compute SHA256 hashDownload the page
36
@maturban1 • @WebSciDL
PhD Defense: Mohamed Aturban
July 23, 2020
37. Fixity
information
Verifying the fixity of a web page
Hashes are NOT identical à the page has changed!
• Compare
the current
hash with
previously
calculated
hash
37
Time
HTML
content is
downloaded
e834 c71a efda 284f e03a 4eed 4e8c b78e
a581 537b a888 4aec ec29 bd2d 66cb f521
SHA256
Hash
HTML
content is
downloaded
fc90 88b3 a614 a588 40bd 5387 d93c 16be
824c d2bb b3fa b173 f93f a57d 241a 3790
SHA256
Hash
August 2017
October 2017
The archived page has been tampered with by changing the value of COSeptember 2017
2
38. Verifying the fixity of a web page
Hashes are NOT identical à the page has changed!
• Compare
the current
hash with
previously
calculated
hash
38
- Users of web archives do not have the ability to easily verify
the fixity of mementos.
- Most web archives do not allow accessing fixity information
- Even if fixity information is available, it is not from an
independent archive or service.
39. What if an image has changed?
• Computing hashes on only HTML content will NOT detect changes
39
40. Potential solution: include all resources in hash calculation
• 201 images
• 19 JavaScript files
• 3 CSS files
• Base HTML file
A single aggregated
hash value
Consists of
Turns out it is hard to get
repeatable hashes on
composite mementos
A composite
memento
• www.gwern.net/Timestamping (Existing tools for generating a hash value on a composite archived page)
• https://web.archive.org/web/20170717184643/https://climate.nasa.gov/vital-signs/carbon-dioxide/
• https://ws-dl.blogspot.com/2015/12/2015-12-08-evaluating-temporal.html
• http://ws-dl.blogspot.com/2017/12/2017-12-11-difficulties-in-timestamping.html
40
41. Archives add banners
• To convey information like the number of mementos and inform users that
what they are viewing is from the archive
• Banners change à different hashes
Replayed in 2016 (43 mementos) Replayed in 2017 (49 mementos)
http://webarchive.proni.gov.uk/20150826163149/http://www.ulster.ac.uk
41
42. Archives transform original content to appropriately
replay mementos in a user’s browser
• Add banners
• Rewrite links to point to the archive, not to
the live web
• Add HTML tags to convey metadata
• Archives use one of the Wayback Machine’s
implementations to replay mementos
• https://archive.org/web/
• https://github.com/iipc/openwayback/wiki
• https://github.com/ikreymer/
PyWb
42
@maturban1 • @WebSciDL
43. Rewriting original content by archives’ replay tools
An image
A CSS file
The page is captured by the Internet Archive:
https://web.archive.org/web/20190725212938/https://maturban.github.io/playground/index.html
4343
@maturban1 • @WebSciDL
PhD Defense: Mohamed Aturban
July 23, 2020
48. We need an archive-aware hashing
function suitable for mementos
Archive Repeatable
hash value?
JavaScript
Michael’sEvilWayback
Transform
ation
48
Security
49. Archive
May
2016
April
2017
April
2018
Time
Live
Web
TimeMap
• Defined by Memento framework (an Internet RFC)
• A TimeMap for an Original Resource “as a resource from which a list of URIs of
Mementos of the Original Resource is available.”
49
@maturban1 • @WebSciDL
PhD Defense: Mohamed Aturban
July 23, 2020
https://climate.nasa.gov/vital-signs/carbon-dioxide/
50. Archive
May
2016
April
2017
April
2018
Time
Live
Web
The TimeMap of the resource climate.nasa.gov/vital-signs/carbon-dioxide/ has three
mementos
TimeMap
• Defined by Memento framework (an Internet RFC)
• A TimeMap for an Original Resource “as a resource from which a list of URIs of
Mementos of the Original Resource is available.”
50
51. Memento aggregators
• Aggregate TimeMaps, of an Original Resource, from multiple archives into a
single TimeMap
• LANL Memento aggregator
⁃ http://mementoweb.org/depot/
⁃ https://github.com/oduwsdl/MemGator
• MemGator
51
@maturban1 • @WebSciDL
PhD Defense: Mohamed Aturban
July 23, 2020
52. Downloading the TimeMap of climate.nasa.gov/vital-
signs/carbon-dioxide/ using MemGator
web.archive.org4,870
archive.is13
wayback.archive-it.org91
perma-archives.org4
arquivo.pt3
webarchive.loc.gov5
Mementos
http://timetravel.mementoweb.org/timemap/link/climate.nasa.
gov/vital-signs/carbon-dioxide/ 52
53. Outline
Introduction and Motivation
Research Questions
Background
Sampling of Related work
Changes in the Playback of Mementos
The Fixity Information Dissemination Framework
Contributions, Future Work, and Conclusions
53
@maturban1 • @WebSciDL
PhD Defense: Mohamed Aturban
July 23, 2020
54. 54
Sampling of Related Work
TRAC (2007)
Establishing trusted archives
- TRAC not for playback
Lerner et al. (2017)
Vulnerabilities
- Discovered four vulnerabilities in
the Internet Archive’s Wayback
Machine
J. Cushman et al. (2017)
More potential threats
- Demonstrate potential threats in
web archives
Rosenthal et al. (2005)
Threats
- Described several threats
against digital preservation
systems
Juan Benet (2017)
Multihash
- Self identifying hashes for
IPFS
OriginStamp, Gipp (2015,
2016) Trusted timestamps in
Blockchain
- Not suitable for composite
mementos
T. Kuhn et al. (2014)
Trusty URI
- A URI that contains a hash
value of the content it
identifies
P. Maniatis et al. (2005)
Distributed copies of archived
resources (LOCKSS)
- The scope and content are narrowly
defined
opentimestamps.org/ (2017)
OpenTimestamps
- Not suitable for composite
mementos
chainpoint.org (2017)
Chainpoint
- Not suitable for composite
mementos
Collomosse et al. (2018)
ARCHANGEL
- For mementos, but not
suitable for composite
mementos
Trusted
timestampingSecurity
Standards and
other systems
Identity derived
from content
Hamano et al. (2005)
Git, Distributed version
control
- Uses hash values to create
commits identifiers
Web archives, such as
webcitation.org, and
archive.is, use hash values
in URIs to identify mementos
Brunelle (2010)
Live web leakage in archives
- Describes how live web leakage
changes the representation of
mementos
Rosenthal et al. (2005)
Requirements for establishing
trusted digital preservation systems
- Not for playback
OAIS (2012)
Reference Model For An Open
Archival Information System (OAIS)
- Not for playback
54
55. Outline
Introduction and Motivation
Research Questions
Background
Sampling of Related work
Changes in the Playback of Mementos
The Fixity Information Dissemination Framework
Contributions, Future Work, and Conclusions
55
@maturban1 • @WebSciDL
PhD Defense: Mohamed Aturban
July 23, 2020
56. RQ1: Can we identify and quantify the types of changes on the playback
of mementos that prevent generating repeatable fixity information?
Identifying and quantifying changes on the
playback of mementos
Collect a dataset
of mementos
Download
rewritten/raw
composite
mementos
Identify
changes
Present
results
39 Times
Generate
aggregated
hash values
1 2 3 4 5
M. Aturban, M. L. Nelson, and M. C. Weigle, “It is hard to compute fixity on archived web
pages,” in Proceedings of the Workshop on Web Archiving and Digital Libraries (WADL) held
in conjunction with the 18th ACM/IEEE Joint Conference on Digital Libraries (JCDL), 2018.
56PhD Defense: Mohamed Aturban
July 23, 2020
57. Collecting 16,627 Mementos
M. Aturban, M. L. Nelson, M. C. Weigle, M. Klein, and H.
Van de Sompel, “Collecting 16K archived web pages
from 17 public web archives,” Tech. Rep.
arXiv:1905.03836, May 2019.
• The HTTP Archive: httparchive.org
• The Web Archives for Historical Research:
uwaterloo.ca/web-archive-group/
• Not all mementos are created equal: measuring
the impact of missing resources, J. Brunelle et
al. (DOI: doi.org/10.1007/s0079)
• The Moz Top 500 Websites: moz.com/top500
Sources of URI-Rs:
Collect a dataset
of mementos
1
57
59. 59
Extract all URI-Ms
by reading WARC
records using the
tool warcio
Download
rewritten/raw
composite
mementos
2
rewritten.warc
https://github.com/webrecorder/warcio
60. 60
Requesting the
raw mementos of
x
✓
✓
✓
x
x
x
x
x
x
x
x
x
x
x
✓
x
x
x
x
✓
200 Ok (or archival 4xx/5xx)✓
raw.warc
Using id_
X = Archive-specific resources
X = 3xx Redirect
Download
rewritten/raw
composite
mementos
2
62. 62
Identifying types of changes on the playback
of mementos
Set:
One or more resources in the set comprising a composite memento has changed
Status code:
The HTTP status code of one or more resources comprising a composite memento
has changed
HTTP Headers:
One or more HTTP response headers, that we do not expect to change, has changed
Representation:
The returned HTTP entity body of one or more resources comprising a composite
memento has changed
URI-M:
One or more resources in the set comprising a composite memento redirects to a
different memento with a different Memento-DateTime
Identify
changes
4
63. 63
Set:
One or more resources in the set comprising a
composite memento has changed
https://www.webharvest.gov/congress112th/20130119060624/http://www.fws.gov/
Reload # 1
Identify
changes
4
63
66. A resource selected randomly by JavaScript
Reload # 3
function random_imglink(){
myimages[1]="/congress112th/20130119060624/http://www.fws.gov/home/feature/home-
banner/open-spaces/bannerbluemnt.jpg";
myimages[2]="/congress112th/20130119060624/http://www.fws.gov/home/feature/home-
banner/open-spaces/bannereagle.jpg";
myimages[3]="/congress112th/20130119060624/http://www.fws.gov/home/feature/home-
banner/open-spaces/bannertiger.jpg";
var ry=Math.floor(Math.random(1)*myimages.length)
if (ry==0)
ry=1
document.write('<a href='+'"'+imagelinks[ry]+'"'+'><img src="'+myimages[ry]+'"
border="0" alt="The Open Spaces Blog. A Talk on the Wild Side. Click to
Read"></a>’)
}
https://www.webharvest.gov/congress112th/20130119060624/http://www.fws.gov/
Identify
changes
4
66
67. Status code:
The HTTP status code of one or more resources
comprising a composite memento has changed
https://web.archive.org/web/20141209193553im/http://wac.450F.edgecastcdn.net/80450F/noisecreep.com/files/2009/06/aarona042209eb200.jpg
https://web.archive.org/web/20141209193553/http://noisecreep.com/aaron-harris-of-isis-talks-twitter/
404 200
Identify
changes
4
67
68. Status code:
The HTTP status code of one or more resources
comprising a composite memento has changed
https://web.archive.org/web/20141209193553im/http://wac.450F.edgecastcdn.net/80450F/noisecreep.com/files/2009/06/aarona042209eb200.jpg
https://web.archive.org/web/20141209193553/http://noisecreep.com/aaron-harris-of-isis-talks-twitter/
404 200
WARC/1.0
WARC-Type: response
WARC-Target-URI:
https://web.archive.org/save/_embed/http://wac.450F.edgecas
tcdn.net/80450F/noisecreep.com/files/2009/06/aaron_a042209eb
_200.jpg
WARC-Date: 2017-11-18T10:33:14Z
…
HTTP/1.1 200 OK
Date: Sat, 18 Nov 2017 10:32:51 GMT
Content-Type: image/jpeg
Content-Location:
/web/20171118103250/http://wac.450F.edgecastcdn.net/80450F/n
oisecreep.com/files/2009/06/aaron_a042209eb_200.jpg
Observations
change archives
Identify
changes
4
68
69. Headers:
One or more HTTP response headers, that we do
not expect to change, has changed
https://web.archive.org/web/20071111211818/http:// images.sohu.com:80/chat_online/market/sohu/140140-1.html
Replayed in 2017
Replayed in 2018
Identify
changes
4
69
70. Representation:
The returned HTTP entity body of one or more resources
comprising a composite memento has changed
curl -s http://perma-archives.org/warc/20171026200017id_/https://www.usa.gov/federal-
agencies/a
| egrep -i "(cdn-cgi|^Date:)"
Date: Tue, 15 May 2018 21:00:45 GMT
<a href="/cdn-cgi/l/email-
protection#28175b5d4a424d4b5c15690854086905720861464c4d5008474e087d067b06086f475e4d5a464
54d465c086c4d58495a5c454d465c5b0849464c08694f4d464b414d5b0e494558134a474c5115405c5c585b1
207075f5f5f065d5b49064f475e074e4d4c4d5a494405494f4d464b414d5b0749” …
curl -s http://perma-archives.org/warc/20171026200017id_/https://www.usa.gov/federal-
agencies/a
| egrep -i "(cdn-cgi|^Date:)"
Date: Tue, 15 May 2018 21:00:50 GMT
<a href="/cdn-cgi/l/email-
protection#68571b1d0a020d0b1c55294814482945324821060c0d1048070e483d463b46482f071e0d1a060
50d061c482c0d18091a1c050d061c1b4809060c48290f0d060b010d1b4e090518530a070c1155001c1c181b5
247471f1f1f461d1b09460f071e470e0d0c0d1a090445090f0d060b010d1b4709” …
Requesting the raw version, a third party service (Cloudflare) modifies the HTML
Identify
changes
4
70
78. December 12, 2017
302
Redirect
Requesting URI-M1
Requesting URI-M1 URI-M2December 25, 2017
URI-M1 = perma-archives.org/warc/20170101182814id_/http://umich.edu/includes/image/type/gallery/id/113
/name/Resea rchDIL19Aug14_DM%28136%29.jpg/width/152/height/152/mode/minfit/
URI-M2 = perma-archives.org/warc/20170619145458id_/http://umich.edu/includes/image/type/gallery/id/113
/name/ResearchDIL19Aug14_DM%28136%29.jpg/width/152/height/152/mode/minfit/
URI-M1 was
NOT available
Different image
78
Identify changes
4
79. 79
88.45% of 16,627 mementos produce at
least two different hashes
Present
results
5
80. 80
One in eight mementos (11.55%) always produce the
same hash and one in six mementos (16.06%) produce
a different hash on each replay
blue=11.55% (1,920 mementos)
red=16.06% (2,670 mementos)
Present
results
5
81. The types of changes affecting mementos after each download
Present
results
5
81
83. Because most mementos produce multiple aggregated
hash values over time, we introduce two additional
hashing techniques
• URI-M-based hashing technique
Only URI-Ms of mementos comprising a composite
memento are used in the hash calculation
• Entity-based hashing technique
Only HTTP entity bodies of mementos comprising a composite
memento are used in the hash calculation
83
@maturban1 • @WebSciDL
PhD Defense: Mohamed Aturban
July 23, 2020
90. 90
Complete hashing on mementos from archive.org
New hash values calculated in each download
(median = 871 hash values) 90
Only 47% of the total number of
hash values are seen in Download 1
91. 91
URI-M-based hashing on mementos from archive.org
New URI-Ms requested in each download
(median = 806 URI-Ms) 91
Only 50% of the total number of URI-Ms
are requested in Download 1
92. 92
Entity-based hashing on mementos from archive.org
New entity bodies observed in each download
(median = 116 entity body) 92
About 80% of the total number of entity
bodies are seen in Download 1
93. RQ2: Given the types of changes identified in the playback of mementos,
what steps/requirements should we consider in order to generate
repeatable fixity information (defining an archive-aware fixity-based
approach)?
Research questions
93
@maturban1 • @WebSciDL
PhD Defense: Mohamed Aturban
July 23, 2020
94. Guidelines for generating fixity information
on the playback of mementos
• We define these guidelines based on results from our 14-month study
94
95. Outline
Introduction and Motivation
Research Questions
Background
Sampling of Related work
Changes in the Playback of Mementos
The Fixity Information Dissemination Framework
Contributions, Future Work, and Conclusions
95
@maturban1 • @WebSciDL
PhD Defense: Mohamed Aturban
July 23, 2020
96. RQ3: How can we store and retrieve fixity information independently from the
web archives from which the associated mementos are preserved?
M. Aturban, S. Alam, M. L. Nelson, and M. C. Weigle, “Archive Assisted Archival
Fixity Verification Framework,” in Proceedings of the 19th ACM/IEEE Joint
Conference on Digital Libraries (JCDL), 2019.
96
@maturban1 • @WebSciDL
PhD Defense: Mohamed Aturban
July 23, 2020
97. Two approaches for disseminating and verifying
the fixity of archived web pages
(using web archives to monitor web archives)
• The Atomic approach
• Generate a manifest file (a JSON file containing the fixity information) for each
memento
• Publish the manifest at a well-known location
• Disseminate the published manifest to several archives
• The Block approach
• Batch together fixity information of multiple mementos in a single binary-
searchable file (or block)
• Publish the block at a well-known location
• Disseminate the published block to several archives
97
@maturban1 • @WebSciDL
99. Atomic approach:
Push manifests into multiple archives
• In this example, the memento is in the Internet Archive (IA) and its fixity
information is disseminated to four archives including IA
• An attacker would have to hack a majority of 4 domains (archives)
https://archive.is/20181224093334/http://manifest.
ws-dl.cs.odu.edu/manifest/https://web.archive.org/
web/20181224085329/https://2019.jcdl.org/
https://web.archive.org/web/20181224093355/http://
manifest.ws-dl.cs.odu.edu/manifest/https://web.arc
hive.org/web/20181224085329/https://2019.jcdl.org/
https://perma-archives.org/warc/20181224093354/htt
p://manifest.ws-dl.cs.odu.edu/manifest/https://web
.archive.org/web/20181224085329/https://2019.jcdl.
org/
http://www.webcitation.org/74tvdsyxemanifest.ws-dl.cs.odu.edu/
manifest/https://web.archi
ve.org/web/20181224085329/
https://2019.jcdl.org/
99
100. Block approach:
Batch together fixity information of multiple
mementos in a single file (block)
• Adding additional metadata (e.g., created_at, fields, …)
• The hash of the previous block must be added
!context ["http://oduwsdl.github.io/contexts/fixity"]
!fields {keys: ["surt"]}
!id {uri: "https://manifest.ws-dl.cs.odu.edu/"}
!meta {created_at: "20190111181327"}
!meta {prev_block:"sha256:d4eb1190f9aaae9542...845b632eb2b3f4f098a34144d"}
!meta {type: "FixityBlock"}
org,archive,web)/web/19961022175434/http://search.com
org,archive,web)/web/19961219082428/http://sho.com
org,archive,web)/web/19961223174001/http://reference.com
… 100
101. Block approach:
Push the blocks entrypoint into multiple archives
manifest.ws-dl.cs.odu.edu/blocks
https://web.archive.org/web/20190121054059/https
://manifest.ws-dl.cs.odu.edu/blocks/7bbf757046ac
0a0a60015a1cb847c3189160d18c809b210073822df15760
9e01
• Will result in archiving the latest published block as well
https://perma.cc/8YG3-X7KN
101
@maturban1 • @WebSciDL
PhD Defense: Mohamed Aturban
July 23, 2020
102. Three steps to verify the fixity of a memento
1. Discover a manifest/block
• In Atomic approach, this includes discovering the archived manifests
2. Compute current fixity information of the memento
3. Compare current fixity information with the discovered manifests/block.
$ curl -sI https://manifest.ws-dl.cs.odu.edu/manifest/https://web.archive.org/web/
20171115140705/http://rln.fm/ | egrep -i "(HTTP/|^location:)”
HTTP/2 302
location: https://manifest.ws-dl.cs.odu.edu/manifest/20181212074423/bd669de8835e38
d54651fe9d04709515beec0c727db82a5366f4bc2506e103d8/https://web.archive.org/web/201
71115140705/http://rln.fm/
An example of discovering the latest manifest in the Archival Fixity
server for the memento:
https:/web.archive.org/web/20171115140705/http://rln.fm/
102
103. Evaluation
• A data set of 16K mementos from 17 public web archives
• For each memento, we generated and disseminated a manifest to 3 archives
- The median size of a
composite memento is
1143.85 KB
- The median size of a
manifest file is 15.29 KB,
which represents 1.33% of
the size of a composite
memento
103
105. The Block approach creates fewer resources
in archives than the Atomic approach
• Given a collection of N = 16,608 mementos
• Katomic = 3 web archives
• Kblock = 2 web archives
• The selected block size B = 1038 records per block
• The total number of resources created in the archives by
each approach:
Atomic
(N ∗ Katomic) = 49,824
Block
(Kblock ∗ (N/B)) = 32
105
@maturban1 • @WebSciDL
PhD Defense: Mohamed Aturban
July 23, 2020
106. It takes 1.09X and 4.54X longer to disseminate a manifest to perma.cc,
archive.org, respectively, than archive.is
106
@maturban1 • @WebSciDL
PhD Defense: Mohamed Aturban
July 23, 2020
107. It takes 9.2x longer to disseminate a block to archive.org than
perma.cc
107
@maturban1 • @WebSciDL
PhD Defense: Mohamed Aturban
July 23, 2020
108. The Block approach performs 1.05X faster than the Atomic approach on
verifying the fixity of mementos
Discovering and downloading manifest files
in the Atomic/Block approaches per archive
Verifying mementos by both approaches 108
109. Outline
Introduction and Motivation
Research Questions
Background
Sampling of Related work
Changes in the Playback of Mementos
The Fixity Information Dissemination Framework
Contributions, Future Work, and Conclusions
109
@maturban1 • @WebSciDL
PhD Defense: Mohamed Aturban
July 23, 2020
110. Contributions
• RQ1
- Four methods for collecting mementos (arXiv’19)
- Identified and quantified types of changes on the playback of mementos (JCDL/WADL’18)
- Showed examples of missing mementos
• RQ2
- The two hashing techniques (URI-M-based and entity-based)
- The archive-aware hashing function (arXiv’17)
• RQ3
- ArchiveNow, a tool for disseminating web pages in public web archives (JCDL’18)
- A framework for disseminating fixity information to web archives (JCDL’19)
110
@maturban1 • @WebSciDL
PhD Defense: Mohamed Aturban
July 23, 2020
111. Future Work
Investigating web packaging generating fixity information using
• Web packaging is an emerging standard
• It should allow archives to deliver a composite memento in a single HTTP
response or in a self-contained file
• Using web packaging we can download a composite memento,
packaged in a bundle, with a single HTTP request. This should reduce
playback-related changes, such as transient errors and URI-M changes.
111PhD Defense: Mohamed Aturban
July 23, 2020
112. Conclusions
• Conventional hashing techniques are not suitable for replayed archived web
pages.
• We defined an archive-aware hashing function that consists of multiple
guidelines (based on our 14-month study on 16K mementos)
• Fixity information includes
(1) Multiple aggregated hash values generated using different hashing
techniques (URI-M-based and entity-based hashing)
(2) Multiple hash values generated on each resource comprising a
composite memento
• We introduce two approaches for disseminating fixity information to web
archives
112
@maturban1 • @WebSciDL
PhD Defense: Mohamed Aturban
July 23, 2020
113. The archive-aware hashing function
113
@maturban1 • @WebSciDL
PhD Defense: Mohamed Aturban
July 23, 2020
114. PhD Dissertation Defense for:
Mohamed Aturban
Advisor:
Michele C. Weigle
Committee Members:
Michele C. Weigle, Michael L. Nelson, Jian Wu,
Sampath Jayarathna, and M'Hammed Abdous
A Framework for Verifying the Fixity
of Archived Web Resources
Department of Computer Science
Norfolk, Virginia 23529 USA
July 23, 2020
116. Collecting 16,627 Mementos
M. Aturban, M. L. Nelson, M. C. Weigle, M. Klein, and H.
Van de Sompel, “Collecting 16K archived web pages
from 17 public web archives,” Tech. Rep.
arXiv:1905.03836, May 2019.
• The HTTP Archive: httparchive.org
• The Web Archives for Historical Research:
uwaterloo.ca/web-archive-group/
• Not all mementos are created equal: measuring
the impact of missing resources, J. Brunelle et
al. (DOI: doi.org/10.1007/s0079)
• The Moz Top 500 Websites: moz.com/top500
Sources of URI-Rs:
Collect a dataset
of mementos
1
116
117. http://collections.internetmemory.org/nli/
20121223031837/http://www2008.org/
• Mementos from the National Library of Ireland (NLI) collection
has been moved from collections.internetmemory.org/nli/ to
wayback.archive-it.org/10702/
An example of a missing memento
• The URI-M
was 200 OK in September 2018
http://wayback.archive-it.org/10702/20121223031837/http://www2008.org/
• The URI-M
is now 404 Not Found
117
@maturban1 • @WebSciDL
PhD Defense: Mohamed Aturban
July 23, 2020
118. 118
- The heatmap shows
archive-level changes
by comparing consecutive
downloads of mementos
- It visualizes the overall
performance of each
archive
- It identifies points in time
where major changes
occur
Present
results
5
119. URI-Rs with different path lengths and
URI-Ms with different Memento-Datetime
119
M. Aturban, M. L. Nelson, M. C. Weigle, M. Klein, and H. Van de Sompel, “Collecting 16K
archived web pages from 17 public web archives,” Tech. Rep. arXiv:1905.03836, May 2019.
URI-Ms per year
URI-Rs per path length
Select a dataset
of mementos
1
121. Downloading the ZIP file of a memento at three different times. Each time the
archive refers to itself differently in the index.html in the ZIP file.
http://archive.is/download/BRWpm.zip
http://archive.is/BRWpm
Representation: The returned HTTP entity body of one or more
resources comprising a composite memento has changed
Identify
changes
4
121
122. Downloading the ZIP file of a memento at three different times. Each time the
archive refers to itself differently in the index.html in the ZIP file.
http://webarchive.nationalarchives.gov.uk/20170303010736id_/https://cereals.ahdb.org.uk/
http://webarchive.nationalarchives.gov.uk/20170303010736id_/https://cereals.ahdb.org.uk/media/1157842/corporate-strategy-1.jpg
Representation (transient errors)
Identify
changes
4
122
125. The Block approach performs 4.46x faster than the
Atomic approach in verifying the fixity of mementos
• The fixity verification time includes:
- Discovering manifests
- Computing current fixity information
- Downloading the archived manifests
- Comparing results
• On average, the verification
time of a memento is 6.65
seconds by the Atomic
approach and 1.49 seconds by
the Block approach
@maturban1 • August 22, 2019
A Framework for Verifying the Fixity
of Archived Web Resources
126. {
"@context": "http://manifest.ws-dl.cs.odu.edu/",
"created": "Sun, 23 Dec 2018 11:43:55 GMT",
"@id": "http://manifest.ws-dl.cs.odu.edu/manifest/20181223114355/c6ad485819abb
e20e37c0632843081710c95f94829f59bbe3b6ad3251d93f7d2/https://web.archiv
e.org/web/2018121102034/https://2019.jcdl .org/",
"uri-r": "https://2019.jcdl.org/",
"uri-m": "https://web.archive.org/web/20181219102034/https://2019.jcdl.org/",
"memento-datetime": "Wed, 19 Dec 2018 10:20:34 GMT",
"http-headers": {
"Content-Type": "text/html; charset=UTF-8",
"X-Archive-Orig-date": "Wed, 19 Dec 2018 10:20:36 GMT",
"X-Archive-Orig-link": "<https://2019.jcdl.org/wp-json/>;
rel="https://api.w.org/"",
"Preference-Applied": "original-links, original-content” },
"hash-constructor": "(curl -s '$uri-m' && echo -n '$Content-Type $X-Archive-
Orig-date $X-Archive-O rig-link') | tee >(sha256sum)
>(md5sum) >/dev/null | cut -d ' ' -f 1 | paste -d':’
<(echo -e 'md5nsha256') - | paste -d' ' - -",
"hash": "md5:969d7aba4c16444a6544bdc39eefe394 sha256:c68a215eb1c3edbf51f565b9
a87f49646456369e51791a86106a6667630737a6"
}
A manifest file example
• Including how hashes are computed
• Hashes are computed on only base HTML file
• Compute fixity on things that should not change like certain original HTTP response headers 126
127. 127
• Using web packaging we can download a composite memento, packaged in a
bundle, with a single HTTP request. This should reduce playback-related changes,
such as transient errors and URI-M changes.
129. Memento framework
• Uses time as a dimension to access the web by relating current web
resources to their prior states
• Is supported by most public web archives including the Internet
Archive
http://mementoweb.org/guide/quick-intro/
129
@maturban1 • @WebSciDL
131. 131
New URI-Ms are
requested in each
download
URI-M-based hashing on mementos from archive-it.org
Actual results
132. 132
New URI-Ms are
requested in each
download
URI-M-based hashing on mementos from archive-it.org
Expected results Actual results
133. 133
Atomic approach (step 1):
Push a web page into multiple archives
https://archive.is/20181224085310/
https://2019.jcdl.org/
https://web.archive.org/web/201812
24085329/https://2019.jcdl.org/
https://perma-archives.org/warc/201
81224085330/https://2019.jcdl.org/
http://www.webcitation.org74tsy6pU0
https://2019.
jcdl.org/
This results in creating multiple mementos of the web page
Archive Assisted Archival Fixity Verification Framework ∙ JCDL 2019 ∙ June 4, 2019 ∙ Urbana-Champaign, Illinois
134. Atomic approach (steps 2 & 3):
For each memento, compute fixity “manifest” and publish it on the
web at a well-known location (Archival Fixity server)
manifest.ws-dl.cs.odu.edu/manifest/
https://archive.is/20181224085310/h
ttps://2019.jcdl.org/
manifest.ws-dl.cs.odu.edu/manifest/
https://web.archive.org/web/2018122
4085329/https://2019.jcdl.org/
manifest.ws-dl.cs.odu.edu/manifest/
https://perma-archives.org/warc/201
81224085330/https://2019.jcdl.org/
manifest.ws-dl.cs.odu.edu/manifest/
http://www.webcitation.org/74tsy6pU0
• In this example https://manifest.ws-dl.cs.odu.edu is the Archival Fixity server
• Actual URIs to manifests can be a bit more complex using “Trusty URIs”: http://ws-
dl.blogspot.com/2017/01/2017-01-15-summary-of-trusty-uris.html
134
136. Block approach (step 2):
Publish the block file at the Archival Fixity server
manifest.ws-dl.cs.odu.edu/blocks
The blocks entrypoint always
redirects to the latest published
block
136
137. The dissemination/download time varies
from one archive to another
137
@maturban1 • @WebSciDL
PhD Defense: Mohamed Aturban
July 23, 2020