This document describes an archive assisted archival fixity verification framework. It presents two approaches - atomic and block - for generating and disseminating fixity information for archived web pages stored in multiple archives. In the atomic approach, a manifest containing cryptographic hashes is generated for each archived page and published. In the block approach, fixity information for multiple pages is batched into a single binary file. Both approaches aim to verify the integrity of archived pages by comparing current and previously published cryptographic hashes. The document outlines steps for discovery, generation, publication and verification of fixity information to help ensure the authenticity of archived web content over time.
1. Archive Assisted Archival
Fixity Verification Framework
JCDL 2019
Urbana-Champaign, Illinois
June 2-6, 2019
Mohamed Aturban, Sawood Alam, Michael L. Nelson, and Michele C. Weigle
Old Dominion University
Department of Computer Science
Norfolk, Virginia 23529 USA
3. 3
The Internet Archive allows us to view
previous versions (mementos) of that page
Archive Assisted Archival Fixity Verification Framework · JCDL 2019 · June 4, 2019 · Urbana-Champaign, Illinois · @WebSciDL
6. 6
The page is in other web archives
For a full list of public web archives, see: http://labs.mementoweb.org/aggregator_config/archivelist.xml
Typical archive URI construction:
wayback.example.org/SomeString/climate.nasa.gov/vital-signs/carbon-dioxide
4,172
62
3
13
webarchive.loc.gov/all/*/climate.nasa.gov/vital-signs/carbon-dioxide/
arquivo.pt/wayback/*/climate.nasa.gov/vital-signs/carbon-dioxide/
perma-archives.org/warc/*/climate.nasa.gov/vital-signs/carbon-dioxide/
archive.is/climate.nasa.gov/vital-signs/carbon-dioxide/
wayback.archive-it.org/all/*/climate.nasa.gov/vital-signs/carbon-dioxide/
web.archive.org/web/*/climate.nasa.gov/vital-signs/carbon-dioxide/
Mementos
available
3
39
7. 7
The web page is archived by
Michael’s Evil Wayback in July 2017
Michaelsevilwayback/web/20170717185130/https://climate.nasa.gov/vital-signs/carbon-dioxide/
Archive Assisted Archival Fixity Verification Framework · JCDL 2019 · June 4, 2019 · Urbana-Champaign, Illinois · @WebSciDL
8. 8
Replaying the same memento in October 2017,
we got a different CO2
Michaelsevilwayback/web/20170717185130/https://climate.nasa.gov/vital-signs/carbon-dioxide/
Archive Assisted Archival Fixity Verification Framework · JCDL 2019 · June 4, 2019 · Urbana-Champaign, Illinois · @WebSciDL
9. 9
Which one is the real memento?
July 2017 October 2017
• How to ensure that a memento has remained unaltered
since the time it was captured by the archive?
Michael_evil_wayback/web/20170717185130/https://climate.nasa.gov/vital-signs/carbon-dioxide/ Michael_evil_wayback/web/20170717185130/https://climate.nasa.gov/vital-signs/carbon-dioxide/
Archive Assisted Archival Fixity Verification Framework · JCDL 2019 · June 4, 2019 · Urbana-Champaign, Illinois · @WebSciDL
10. 10
Cryptographic hashes to create
fixity information
• Common hash algorithms (e.g., MD5, SHA256):
A small change in the input à a large change output
SHA256(HTML)
9801 1510 87e1 6d6b
ddb9 e6b0 09fd b723
abe5 1fea b548 0914
a130 6325 5ae4 6caa
5d4d b590 605c 9023
000d 6622 6004 534f
e84a 5549 d535 f91e
cdf4 4952 5c1a 37cf
SHA256(HTML)
11. 11
Compute a hash value on the
downloaded HTML
$ curl -s https://climate.nasa.gov/vital-
signs/carbon-dioxide/ | shasum -a 256
17710fd38d908a3cd124510f26adaec67e57e3f1d3aec
1209c4ad4efbe2c035d
Compute SHA256 hashDownload the page
Archive Assisted Archival Fixity Verification Framework · JCDL 2019 · June 4, 2019 · Urbana-Champaign, Illinois · @WebSciDL
12. Time
HTML
content is
downloaded
e834 c71a efda 284f e03a 4eed 4e8c b78e
a581 537b a888 4aec ec29 bd2d 66cb f521
SHA256
Hash
HTML
content is
downloaded
fc90 88b3 a614 a588 40bd 5387 d93c 16be
824c d2bb b3fa b173 f93f a57d 241a 3790
SHA256
Hash
August 2017
October 2017
The archived page has been tampered with by changing the value of COSeptember 2017
2
12
Compare the current hash with previously calculated hash
To verify the fixity
Hashes are NOT identical à the page has changed!
http://ws-dl.blogspot.com/2017/12/2017-12-11-difficulties-in-timestamping.html
13. Two approaches for verifying the
fixity of archived web pages
13
• The Atomic approach
• Generate a manifest file (a JSON file containing the fixity
information) for each memento
• Publish the manifest at a well-known web location
• Disseminate the published manifest to several archives
• The Block approach
• Batch together fixity information of multiple mementos
in a single binary-searchable file (or block)
• Publish the block at a well-known web location
• Disseminate the published block to several archives
(Use web archives to monitor web archives)
Archive Assisted Archival Fixity Verification Framework · JCDL 2019 · June 4, 2019 · Urbana-Champaign, Illinois · @WebSciDL
14. Atomic approach (step 1):
Push a web page into multiple archives
14
https://archive.is/20181224085310/
https://2019.jcdl.org/
https://web.archive.org/web/201812
24085329/https://2019.jcdl.org/
https://perma-archives.org/warc/201
81224085330/https://2019.jcdl.org/
http://www.webcitation.org74tsy6pU0
https://2019.
jcdl.org/
This results in creating multiple mementos of the web page
Archive Assisted Archival Fixity Verification Framework · JCDL 2019 · June 4, 2019 · Urbana-Champaign, Illinois · @WebSciDL
15. Atomic approach (steps 2 & 3):
For each memento, compute fixity “manifest”
and publish it on the web at a well-known
Archival Fixity server
15
manifest.ws-dl.cs.odu.edu/manifest/
https://archive.is/20181224085310/h
ttps://2019.jcdl.org/
manifest.ws-dl.cs.odu.edu/manifest/
https://web.archive.org/web/2018122
4085329/https://2019.jcdl.org/
manifest.ws-dl.cs.odu.edu/manifest/
https://perma-archives.org/warc/201
81224085330/https://2019.jcdl.org/
manifest.ws-dl.cs.odu.edu/manifest/
http://www.webcitation.org/74tsy6pU0
• In this example https://manifest.ws-dl.cs.odu.edu is the Archival Fixity server
• Actual URIs to manifests can be a bit more complex using “Trusty URIs”:
http://ws-dl.blogspot.com/2017/01/2017-01-15-summary-of-trusty-uris.html
17. {
"@context": "http://manifest.ws-dl.cs.odu.edu/",
"created": "Sun, 23 Dec 2018 11:43:55 GMT",
"@id": "http://manifest.ws-dl.cs.odu.edu/manifest/20181223114355/c6ad485819abb
e20e37c0632843081710c95f94829f59bbe3b6ad3251d93f7d2/https://web.archiv
e.org/web/2018121102034/https://2019.jcdl .org/",
"uri-r": "https://2019.jcdl.org/",
"uri-m": "https://web.archive.org/web/20181219102034/https://2019.jcdl.org/",
"memento-datetime": "Wed, 19 Dec 2018 10:20:34 GMT",
"http-headers": {
"Content-Type": "text/html; charset=UTF-8",
"X-Archive-Orig-date": "Wed, 19 Dec 2018 10:20:36 GMT",
"X-Archive-Orig-link": "<https://2019.jcdl.org/wp-json/>;
rel="https://api.w.org/"",
"Preference-Applied": "original-links, original-content” },
"hash-constructor": "(curl -s '$uri-m' && echo -n '$Content-Type $X-Archive-
Orig-date $X-Archive-O rig-link') | tee >(sha256sum)
>(md5sum) >/dev/null | cut -d ' ' -f 1 | paste -d':’
<(echo -e 'md5nsha256') - | paste -d' ' - -",
"hash": "md5:969d7aba4c16444a6544bdc39eefe394 sha256:c68a215eb1c3edbf51f565b9
a87f49646456369e51791a86106a6667630737a6"
}
A manifest file example
• Including how hashes are computed
• Hashes are computed on only base HTML file
• Compute fixity on things that should not change like certain original HTTP
response headers
Trusty
URI
17
Archive Assisted Archival Fixity Verification Framework · JCDL 2019 · June 4, 2019 · Urbana-Champaign, Illinois · @WebSciDL
18. Atomic approach (step 4):
Push manifests into multiple archives
• In this example, the memento is in the Internet Archive (IA) and
its fixity information is disseminated to four archives including IA
• An attacker would have to hack a majority of 4 domains (archives)
https://archive.is/20181224093334/http://manifest.
ws-dl.cs.odu.edu/manifest/https://web.archive.org/
web/20181224085329/https://2019.jcdl.org/
https://web.archive.org/web/20181224093355/http://
manifest.ws-dl.cs.odu.edu/manifest/https://web.arc
hive.org/web/20181224085329/https://2019.jcdl.org/
https://perma-archives.org/warc/20181224093354/htt
p://manifest.ws-dl.cs.odu.edu/manifest/https://web
.archive.org/web/20181224085329/https://2019.jcdl.
org/
http://www.webcitation.org/74tvdsyxemanifest.ws-dl.cs.odu.edu/
manifest/https://web.archi
ve.org/web/20181224085329/
https://2019.jcdl.org/
18
Archive Assisted Archival Fixity Verification Framework · JCDL 2019 · June 4, 2019 · Urbana-Champaign, Illinois · @WebSciDL
19. Block approach (step 1):
Batch together fixity information of
multiple mementos in a single file (block)
• Adding additional metadata (e.g., created_at, fields, …)
• The hash of the previous block must be added
!context ["http://oduwsdl.github.io/contexts/fixity"]
!fields {keys: ["surt"]}
!id {uri: "https://manifest.ws-dl.cs.odu.edu/"}
!meta {created_at: "20190111181327"}
!meta {prev_block:"sha256:d4eb1190f9aaae9542...845b632eb2b3f4f098a34144d"}
!meta {type: "FixityBlock"}
org,archive,web)/web/19961022175434/http://search.com
org,archive,web)/web/19961219082428/http://sho.com
org,archive,web)/web/19961223174001/http://reference.com
…
19
Archive Assisted Archival Fixity Verification Framework · JCDL 2019 · June 4, 2019 · Urbana-Champaign, Illinois · @WebSciDL
20. Block approach (step 2):
Publish the block file at the Archival Fixity server
always redirects to the
latest published block
manifest.ws-dl.cs.odu.edu/blocks
The blocks entrypoint
20
Archive Assisted Archival Fixity Verification Framework · JCDL 2019 · June 4, 2019 · Urbana-Champaign, Illinois · @WebSciDL
21. Block approach (step 3):
Push the blocks entrypoint into
multiple archives
https://manifest.ws-dl
.cs.odu.edu/blocks
https://web.archive.org/web/20190121054059
/https://manifest.ws-dl.cs.odu.edu/blocks/7bbf
757046ac0a0a60015a1cb847c3189160d18c809
b210073822df157609e01
• Will result in archiving the latest published block as well
https://perma.cc/8YG3-X7KN
21
Archive Assisted Archival Fixity Verification Framework · JCDL 2019 · June 4, 2019 · Urbana-Champaign, Illinois · @WebSciDL
22. Three steps to verify the fixity
of a memento
1. Discover a manifest/block
• In Atomic approach, this includes discovering the archived
manifests
2. Compute current fixity information of the memento
3. Compare current fixity information with the discovered
manifests/block.
$ curl -sI https://manifest.ws-dl.cs.odu.edu/manifest/https://web.archive.org/web/
20171115140705/http://rln.fm/ | egrep -i "(HTTP/|^location:)”
HTTP/2 302
location: https://manifest.ws-dl.cs.odu.edu/manifest/20181212074423/bd669de8835e38
d54651fe9d04709515beec0c727db82a5366f4bc2506e103d8/https://web.archive.org/web/201
71115140705/http://rln.fm/
An example of discovering the latest manifest in the Archival Fixity server
for the memento web.archive.org/web/2017111 5140705/http://rln.fm/
22
Archive Assisted Archival Fixity Verification Framework · JCDL 2019 · June 4, 2019 · Urbana-Champaign, Illinois · @WebSciDL
23. Evaluation
• A data set of 1,000 mementos from the Internet Archive
• For each memento, we generated and disseminated 3 manifests
to 4 archives
23
• The average size
of a manifest file
is 1,157 bytes
• The manifest size
represents 2.79%
of the actual
download HTML
Archive Assisted Archival Fixity Verification Framework · JCDL 2019 · June 4, 2019 · Urbana-Champaign, Illinois · @WebSciDL
24. 24
Increasing the number of records per block
reduces the block generation time
Archive Assisted Archival Fixity Verification Framework · JCDL 2019 · June 4, 2019 · Urbana-Champaign, Illinois · @WebSciDL
25. 25
The Block approach creates fewer resources
in archives than the Atomic approach
• Given a collection of N = 1,000 mementos
• K = 4 web archives
• The selected block size B = 100 records per block
• The total number of resources created in the archives:
• Atomic
(N ∗ K) = 4,000
• Block
(k ∗ (N/B)) = 40
Archive Assisted Archival Fixity Verification Framework · JCDL 2019 · June 4, 2019 · Urbana-Champaign, Illinois · @WebSciDL
26. Dissemination/download time
varies from one archive to another
26
Archive Assisted Archival Fixity Verification Framework · JCDL 2019 · June 4, 2019 · Urbana-Champaign, Illinois · @WebSciDL
27. It takes 1.25x, 4x and 36x longer to disseminate a
manifest to perma.cc, archive.org, and
webcitation.org, respectively, than archive.is
27
Archive Assisted Archival Fixity Verification Framework · JCDL 2019 · June 4, 2019 · Urbana-Champaign, Illinois · @WebSciDL
28. It takes 3.5x longer to disseminate a
manifest to archive.org than perma.cc
28
Archive Assisted Archival Fixity Verification Framework · JCDL 2019 · June 4, 2019 · Urbana-Champaign, Illinois · @WebSciDL
29. Average time for discovering published
fixity information
29
Archive Assisted Archival Fixity Verification Framework · JCDL 2019 · June 4, 2019 · Urbana-Champaign, Illinois · @WebSciDL
30. The Block approach performs 4.46x faster than the
Atomic approach in verifying the fixity of mementos
30
• The fixity verification time includes:
• Discovering manifests
• Computing current fixity information
• Downloading the archived manifests
• Comparing results
• On average, the verification
time of a memento is 6.65
seconds by the Atomic
approach and 1.49 seconds by
the Block approach
Archive Assisted Archival Fixity Verification Framework · JCDL 2019 · June 4, 2019 · Urbana-Champaign, Illinois · @WebSciDL
31. Conclusions
31
• The proposed approaches do not require any changes in the
infrastructure of web archives
• The Block approach creates fewer resources in archives and
reduces fixity verification time (4.46x faster than the Atomic
approach)
• The Atomic approach has the ability to verify the fixity of
archived pages even without using the Archival Fixity server
• Varying/increasing the block size could be one potential solution
to improve the Block approach performance and reduce the
number of resources created in archives
• Caching archived manifests/blocks in the Archival Fixity server
should improve the performance of both approaches