SlideShare ist ein Scribd-Unternehmen logo
1 von 31
Downloaden Sie, um offline zu lesen
Sawood Alam, Mat Kelly, Michele C. Weigle, Michael L. Nelson
Web Science and Digital Libraries Research Group
Old Dominion University
Norfolk, Virginia, USA
@WebSciDL
InterPlanetary Wayback
The Next Step Towards Decentralized Web Archiving
IPFS Lab Day, Decentralized Web Summit, 2018
San Francisco, CA (USA)
August 3, 2018
http://github.com/oduwsdl/ipwb
Content Addressing
http://foo.com/spaceDog.jpg
http://example.org/yuri.jpg
QmZAD4xeeNeYF3TmwWgBXypLKTiCGwGRMXHW7MtheWKtw4
QmZAD4xeeNeYF3TmwWgBXypLKTiCGwGRMXHW7MtheWKtw4
===
$ ipfs cat QmZAD4xeeNeYF3TmwWgBXypLKTiCGwGRMXHW7MtheWKtw4 > doge.jpg
2@ibnesayeed
Rendered HTML vs. Source Code
3@ibnesayeed
HTTP Response vs. WARC Record
4
HTTP headers
Payload
WARC headers
@ibnesayeed
How Wayback Machine Works?
Archival
Indexer
Archival Index
(e.g., CDXJ) Replay Engine
processes
Outputs references
reads (file, offset)
read archived content
Present WARC
content to user
5@ibnesayeed
Memento: Time Dimension to the Web
https://tools.ietf.org/html/rfc7089 6@ibnesayeed
Why IPWB?
● Persistence of archived web dependent on resilience of
organizations
● Availability of data is subject to censorship
● Redundancy in web archive files of exact duplicate content
● Lack of public participation in web archiving
● Discoverability issue of small web archives
7@ibnesayeed
Indexing
8@ibnesayeed
WARC Creation HTTP Header & Payload Extraction Push to IPFS Generate CDXJ WARC-CDXJ Correspondence
9@ibnesayeed
HTTP HEADER
BLOCK
HTTP PAYLOAD
BLOCK
WARC Creation HTTP Header & Payload Extraction Push to IPFS Generate CDXJ WARC-CDXJ Correspondence
10@ibnesayeed
QmcN9eWwRF73dZj5BgT4x8jeEcFr
xurX1hot8QwCbMi9PB
Qmczh9YnB4U1ptPeqxcaTZA4aMm
uNUswTLTWzXntvbp9sL
HEADER DIGEST
PAYLOAD DIGEST
WARC Creation HTTP Header & Payload Extraction Push to IPFS Generate CDXJ WARC-CDXJ Correspondence
11
HTTP HEADER
BLOCK
HTTP PAYLOAD
BLOCK
@ibnesayeed
QmcN9eWwRF73dZj5BgT4x8jeEcFrxurX1hot8QwCb
Mi9PB
Qmczh9YnB4U1ptPeqxcaTZA4aMmuNUswTLTWzX
ntvbp9sL
HEADER DIGEST
com,example,ipwb)/ 20180802012013 {
"locator":"urn:ipfs/QmcN9eWwRF73dZj5BgT4x8jeEcFrxurX1hot8QwCbMi9PB/Qmczh9YnB4U1ptPeqxca
TZA4aMmuNUswTLTWzXntvbp9sL",
"mime_type": "text/html",
"status_code": 200,
“other_fields”: “other values...”
}
// * This is a single-line record, line breaks and indentation are added for readability only
PAYLOAD DIGEST
CDXJ: http://ws-dl.blogspot.com/2015/09/2015-09-10-cdxj-object-resource-stream.html
WARC Creation HTTP Header & Payload Extraction Push to IPFS Generate CDXJ Record WARC-CDXJ Correspondence
12@ibnesayeed
edu,odu,cs)/~salam/dweb/ 20180802012013 {
"locator": "urn:ipfs/QmcN9eWwRF73dZj5.../Qmczh9YnB4U1ptPe...",
"mime_type": "text/html",
"status_code": "200"
}
edu,odu,cs)/~salam/dweb/style.css 20180802012013 {
"locator": "urn:ipfs/QmU1k71bT6ibZBSd.../QmbvUAo9U31wSdvA...",
"mime_type": "text/css",
"status_code": "200"
}
edu,odu,cs)/~salam/dweb/wsdl-logo.png 20180802012013 {
"locator": "urn:ipfs/QmTjfMxFGvbP4nwF.../QmYMKZbnk53kuPJi...",
"mime_type": "image/png",
"status_code": "200"
}
WARC Creation HTTP Header & Payload Extraction Push to IPFS Generate CDXJ WARC-CDXJ Correspondence
13@ibnesayeed
Replay
14@ibnesayeed
https://www.cs.odu.edu/~salam/dweb/
15
edu,odu,cs)/~salam/dweb/ 20180802012013 {
"locator": "urn:ipfs/QmcN9eWwRF.../Qmczh9YnB4...",
"mime_type": "text/html",
"status_code": "200"
}
edu,odu,cs)/~salam/dweb/style.css 20180802012013 {
"locator": "urn:ipfs/QmU1k71bT6.../QmbvUAo9U3...",
"mime_type": "text/css",
"status_code": "200"
}
edu,odu,cs)/~salam/dweb/wsdl-logo.png 20180802012013 {
"locator": "urn:ipfs/QmTjfMxFGv.../QmYMKZbnk5...",
"mime_type": "image/png",
"status_code": "200"
}
Fetch from IPFS Reroute ResourcesReconstruct ResponseLookup in CDXJ
@ibnesayeed
edu,odu,cs)/~salam/dweb/ 20180802012013 {
"locator": "urn:ipfs/QmcN9eWwRF73dZj5.../Qmczh9YnB4U1ptPe...",
"mime_type": "text/html",
"status_code": "200"
}
16
HTTP HEADER
BLOCK
HTTP PAYLOAD
BLOCK
Fetch from IPFS Reroute ResourcesReconstruct ResponseLookup in CDXJ
@ibnesayeed
17
HTTP HEADER
BLOCK
HTTP PAYLOAD
BLOCK
Fetch from IPFS Reroute ResourcesReconstruct ResponseLookup in CDXJ
Reconstruct
@ibnesayeed
Fetch from IPFS Reroute Resources
18
● https://oduwsdl.github.io/Reconstructive/
● http://ws-dl.blogspot.com/2018/01/2018-01-08-introducing-reconstructive.html
● Avoids zombies (live-leakage)
● Adds an unobtrusive archival banner (Custom HTML Element)
Reconstruct ResponseLookup in CDXJ
@ibnesayeed
19
IPWB Indexing and Replay
@ibnesayeed
Decentralization
20@ibnesayeed
Current Issues
● IPFS is permanent, but not persistent
● DHT-based IPNS is history-unaware
● CDXJ index, a critical piece of replay, is centralized
21@ibnesayeed
Persistence
● Data persistence is critical for web archiving
● A decentralized storage with sufficient replication is needed
● Memory organizations should contribute storage infrastructure
● Qri, Filecoin, IPFS-Cluster, IPFS-Sync etc. can be helpful
22@ibnesayeed
IPNS: InterPlanetary Naming System
URI IPFS Hash
http://example.org/yuri.jpg
http://example.com/style.css
http://example.com/logo.png
http://example.com/style.css
How about changes and history?
23@ibnesayeed
IPNS Blockchain
● URI → Latest hash
● URI + DateTime → A historical hash
● URI → List historical hashes with times
https://github.com/oduwsdl/IPNS-Blockchain
Owner URI Time Hash PrevBlock
Pub K1
URI1
T1
H1
1234567...
Pub K2
URI2
T2
H2
0000000...
Pub K3
URI3
T3
H3
9876543...
Owner URI Time Hash PrevBlock
Pub K1
URI1
T4
H5
5463728...
Pub K3
URI4
T5
H6
0000000...
IPNS + Blockchain + Memento
24@ibnesayeed
Lazy Relationship Evaluation
/<namespace>/about/<URI>
IP-LD to the rescue?
https://github.com/oduwsdl/ipwb/issues/61
Memento
Memento
Memento
MementoOf
(Active Relation)
HasMemento
(Lazy Evaluation)
25@ibnesayeed
Evaluation
26@ibnesayeed
● Reported IPFS slowness https://github.com/ipfs/go-ipfs/issues/1216
○ Has since been fixed, but we did not evaluate again
570 files per minute~10% overhead
27
Storage Space and Time Overhead
@ibnesayeed
Replay Time
● 600 requests in 222 seconds
● Slower than PyWB (which took 5.26 seconds)
● File vs. rich object based retrieval
● Never expiring cache
28
https://github.com/ibnesayeed/ipfsapi-concurrency-test
@ibnesayeed
Future Works
● Evaluate the improved IPFS on large dataset
● Evaluate deduplication
● Implement an index-free collaborative archiving system
● Utilize IPNS to reference URI-Rs with datetime
29@ibnesayeed
Conclusions
● A proof of concept system to leverage a novel approach to
archiving and retrieval
● Storage and time costs evaluation and qualitative analysis
● It can only work for small archives in its current state
● A path to answer “who will archive the archives?”
● More work to be done to make it a truly decentralized
archiving system
30@ibnesayeed
InterPlanetary Wayback
The Next Step Towards Decentralized Web Archiving
@WebSciDL
http://github.com/oduwsdl/ipwb
Supported in part by Protocol Labs, AMF 11600663, and NSF IIS-1526700
Sawood Alam, Mat Kelly, Michele C. Weigle, Michael L. Nelson

Weitere ähnliche Inhalte

Was ist angesagt?

InterPlanetary Wayback: Peer-To-Peer Permanence of Web Archives
InterPlanetary Wayback: Peer-To-Peer Permanence of Web ArchivesInterPlanetary Wayback: Peer-To-Peer Permanence of Web Archives
InterPlanetary Wayback: Peer-To-Peer Permanence of Web ArchivesSawood Alam
 
Combining Heritrix and PhantomJS for Better Crawling of Pages with Javascript
Combining Heritrix and PhantomJS for Better Crawling of Pages with JavascriptCombining Heritrix and PhantomJS for Better Crawling of Pages with Javascript
Combining Heritrix and PhantomJS for Better Crawling of Pages with JavascriptMichael Nelson
 
Scripts in a Frame: A Two-Tiered Approach for Archiving Deferred Representations
Scripts in a Frame: A Two-Tiered Approach for Archiving Deferred RepresentationsScripts in a Frame: A Two-Tiered Approach for Archiving Deferred Representations
Scripts in a Frame: A Two-Tiered Approach for Archiving Deferred RepresentationsJustin Brunelle
 
Introducing Web Archiving and WSDL Research Group
Introducing Web Archiving and WSDL Research GroupIntroducing Web Archiving and WSDL Research Group
Introducing Web Archiving and WSDL Research GroupSawood Alam
 
MementoMap: An Archive Profile Dissemination Framework
MementoMap: An Archive Profile Dissemination FrameworkMementoMap: An Archive Profile Dissemination Framework
MementoMap: An Archive Profile Dissemination FrameworkSawood Alam
 
Evaluating the SiteStory Transactional Web Archive with the ApacheBench Tool
Evaluating the SiteStory Transactional Web Archive with the ApacheBench ToolEvaluating the SiteStory Transactional Web Archive with the ApacheBench Tool
Evaluating the SiteStory Transactional Web Archive with the ApacheBench ToolMichael Nelson
 
Profiling Web Archive Coverage for Top-Level Domain and Content Language
Profiling Web Archive Coverage for Top-Level Domain and Content LanguageProfiling Web Archive Coverage for Top-Level Domain and Content Language
Profiling Web Archive Coverage for Top-Level Domain and Content LanguageMichael Nelson
 
URI Disambiguation in the Context of Linked Data
URI Disambiguation in the Context of Linked DataURI Disambiguation in the Context of Linked Data
URI Disambiguation in the Context of Linked Databutest
 
Proposal for Text Mining PubAg
Proposal for Text Mining PubAgProposal for Text Mining PubAg
Proposal for Text Mining PubAgJake Lever
 
DHUG 2018: Towards Web-Centric Repository Interoperability
DHUG 2018: Towards Web-Centric Repository InteroperabilityDHUG 2018: Towards Web-Centric Repository Interoperability
DHUG 2018: Towards Web-Centric Repository InteroperabilityAccess Innovations, Inc.
 
Herbert Van De Sompel - Time Travel for the Web
Herbert Van De Sompel - Time Travel for the WebHerbert Van De Sompel - Time Travel for the Web
Herbert Van De Sompel - Time Travel for the WebiMinds conference
 
A Framework for Verifying the Fixity of Archived Web Resources
A Framework for Verifying the Fixity of Archived Web ResourcesA Framework for Verifying the Fixity of Archived Web Resources
A Framework for Verifying the Fixity of Archived Web Resourcesmaturban
 
Fetch It! A Custom Voyager service for Holds/Retrieval without using reporter
Fetch It! A Custom Voyager service for Holds/Retrieval without using reporterFetch It! A Custom Voyager service for Holds/Retrieval without using reporter
Fetch It! A Custom Voyager service for Holds/Retrieval without using reporterRay Schwartz
 
iPRES2015: Archiving Deferred Representations Using a Two-Tiered Crawling App...
iPRES2015: Archiving Deferred Representations Using a Two-Tiered Crawling App...iPRES2015: Archiving Deferred Representations Using a Two-Tiered Crawling App...
iPRES2015: Archiving Deferred Representations Using a Two-Tiered Crawling App...Justin Brunelle
 
RDM#2- The Distributed Web
RDM#2- The Distributed WebRDM#2- The Distributed Web
RDM#2- The Distributed WebDavid Dias
 
Node.js Interactive
Node.js InteractiveNode.js Interactive
Node.js InteractiveDavid Dias
 
Data on the web - an inconvenient truth
Data on the web - an inconvenient truthData on the web - an inconvenient truth
Data on the web - an inconvenient truthmarcobrattinga
 
Extracting Data from Historical Documents: Crowdsourcing Annotations on Wikis...
Extracting Data from Historical Documents: Crowdsourcing Annotations on Wikis...Extracting Data from Historical Documents: Crowdsourcing Annotations on Wikis...
Extracting Data from Historical Documents: Crowdsourcing Annotations on Wikis...Gaurav Vaidya
 
Scraping with Python for Fun and Profit - PyCon India 2010
Scraping with Python for Fun and Profit - PyCon India 2010Scraping with Python for Fun and Profit - PyCon India 2010
Scraping with Python for Fun and Profit - PyCon India 2010Abhishek Mishra
 

Was ist angesagt? (20)

InterPlanetary Wayback: Peer-To-Peer Permanence of Web Archives
InterPlanetary Wayback: Peer-To-Peer Permanence of Web ArchivesInterPlanetary Wayback: Peer-To-Peer Permanence of Web Archives
InterPlanetary Wayback: Peer-To-Peer Permanence of Web Archives
 
Combining Heritrix and PhantomJS for Better Crawling of Pages with Javascript
Combining Heritrix and PhantomJS for Better Crawling of Pages with JavascriptCombining Heritrix and PhantomJS for Better Crawling of Pages with Javascript
Combining Heritrix and PhantomJS for Better Crawling of Pages with Javascript
 
Scripts in a Frame: A Two-Tiered Approach for Archiving Deferred Representations
Scripts in a Frame: A Two-Tiered Approach for Archiving Deferred RepresentationsScripts in a Frame: A Two-Tiered Approach for Archiving Deferred Representations
Scripts in a Frame: A Two-Tiered Approach for Archiving Deferred Representations
 
Introducing Web Archiving and WSDL Research Group
Introducing Web Archiving and WSDL Research GroupIntroducing Web Archiving and WSDL Research Group
Introducing Web Archiving and WSDL Research Group
 
MementoMap: An Archive Profile Dissemination Framework
MementoMap: An Archive Profile Dissemination FrameworkMementoMap: An Archive Profile Dissemination Framework
MementoMap: An Archive Profile Dissemination Framework
 
Evaluating the SiteStory Transactional Web Archive with the ApacheBench Tool
Evaluating the SiteStory Transactional Web Archive with the ApacheBench ToolEvaluating the SiteStory Transactional Web Archive with the ApacheBench Tool
Evaluating the SiteStory Transactional Web Archive with the ApacheBench Tool
 
Profiling Web Archive Coverage for Top-Level Domain and Content Language
Profiling Web Archive Coverage for Top-Level Domain and Content LanguageProfiling Web Archive Coverage for Top-Level Domain and Content Language
Profiling Web Archive Coverage for Top-Level Domain and Content Language
 
URI Disambiguation in the Context of Linked Data
URI Disambiguation in the Context of Linked DataURI Disambiguation in the Context of Linked Data
URI Disambiguation in the Context of Linked Data
 
Proposal for Text Mining PubAg
Proposal for Text Mining PubAgProposal for Text Mining PubAg
Proposal for Text Mining PubAg
 
DHUG 2018: Towards Web-Centric Repository Interoperability
DHUG 2018: Towards Web-Centric Repository InteroperabilityDHUG 2018: Towards Web-Centric Repository Interoperability
DHUG 2018: Towards Web-Centric Repository Interoperability
 
Herbert Van De Sompel - Time Travel for the Web
Herbert Van De Sompel - Time Travel for the WebHerbert Van De Sompel - Time Travel for the Web
Herbert Van De Sompel - Time Travel for the Web
 
A Framework for Verifying the Fixity of Archived Web Resources
A Framework for Verifying the Fixity of Archived Web ResourcesA Framework for Verifying the Fixity of Archived Web Resources
A Framework for Verifying the Fixity of Archived Web Resources
 
Mezi snem a realitou. Otevřená data českého webového archivu.
Mezi snem a realitou. Otevřená data českého webového archivu.Mezi snem a realitou. Otevřená data českého webového archivu.
Mezi snem a realitou. Otevřená data českého webového archivu.
 
Fetch It! A Custom Voyager service for Holds/Retrieval without using reporter
Fetch It! A Custom Voyager service for Holds/Retrieval without using reporterFetch It! A Custom Voyager service for Holds/Retrieval without using reporter
Fetch It! A Custom Voyager service for Holds/Retrieval without using reporter
 
iPRES2015: Archiving Deferred Representations Using a Two-Tiered Crawling App...
iPRES2015: Archiving Deferred Representations Using a Two-Tiered Crawling App...iPRES2015: Archiving Deferred Representations Using a Two-Tiered Crawling App...
iPRES2015: Archiving Deferred Representations Using a Two-Tiered Crawling App...
 
RDM#2- The Distributed Web
RDM#2- The Distributed WebRDM#2- The Distributed Web
RDM#2- The Distributed Web
 
Node.js Interactive
Node.js InteractiveNode.js Interactive
Node.js Interactive
 
Data on the web - an inconvenient truth
Data on the web - an inconvenient truthData on the web - an inconvenient truth
Data on the web - an inconvenient truth
 
Extracting Data from Historical Documents: Crowdsourcing Annotations on Wikis...
Extracting Data from Historical Documents: Crowdsourcing Annotations on Wikis...Extracting Data from Historical Documents: Crowdsourcing Annotations on Wikis...
Extracting Data from Historical Documents: Crowdsourcing Annotations on Wikis...
 
Scraping with Python for Fun and Profit - PyCon India 2010
Scraping with Python for Fun and Profit - PyCon India 2010Scraping with Python for Fun and Profit - PyCon India 2010
Scraping with Python for Fun and Profit - PyCon India 2010
 

Ähnlich wie InterPlanetary Wayback: The Next Step Towards Decentralized Web Archiving

IPWB and IPFS at WAC2017
IPWB and IPFS at WAC2017IPWB and IPFS at WAC2017
IPWB and IPFS at WAC2017David Dias
 
A Collaborative, Secure, and Private InterPlanetary Wayback Web Archiving Sys...
A Collaborative, Secure, and Private InterPlanetary Wayback Web Archiving Sys...A Collaborative, Secure, and Private InterPlanetary Wayback Web Archiving Sys...
A Collaborative, Secure, and Private InterPlanetary Wayback Web Archiving Sys...machawk1
 
Apache ZooKeeper TechTuesday
Apache ZooKeeper TechTuesdayApache ZooKeeper TechTuesday
Apache ZooKeeper TechTuesdayAndrei Savu
 
WebSocket Perspectives 2015 - Clouds, Streams, Microservices and WoT
WebSocket Perspectives 2015 - Clouds, Streams, Microservices and WoTWebSocket Perspectives 2015 - Clouds, Streams, Microservices and WoT
WebSocket Perspectives 2015 - Clouds, Streams, Microservices and WoTFrank Greco
 
Can we run the Whole Web on Apache Sling?
Can we run the Whole Web on Apache Sling?Can we run the Whole Web on Apache Sling?
Can we run the Whole Web on Apache Sling?Bertrand Delacretaz
 
The Proto-Burst Buffer: Experience with the flash-based file system on SDSC's...
The Proto-Burst Buffer: Experience with the flash-based file system on SDSC's...The Proto-Burst Buffer: Experience with the flash-based file system on SDSC's...
The Proto-Burst Buffer: Experience with the flash-based file system on SDSC's...Glenn K. Lockwood
 
Openstack Swift - Lots of small files
Openstack Swift - Lots of small filesOpenstack Swift - Lots of small files
Openstack Swift - Lots of small filesAlexandre Lecuyer
 
Cosmos, Big Data GE implementation in FIWARE
Cosmos, Big Data GE implementation in FIWARECosmos, Big Data GE implementation in FIWARE
Cosmos, Big Data GE implementation in FIWAREFernando Lopez Aguilar
 
Cosmos, Big Data GE Implementation
Cosmos, Big Data GE ImplementationCosmos, Big Data GE Implementation
Cosmos, Big Data GE ImplementationFIWARE
 
Berlin Buzz Words - Apache Drill by Ted Dunning & Michael Hausenblas
Berlin Buzz Words - Apache Drill by Ted Dunning & Michael HausenblasBerlin Buzz Words - Apache Drill by Ted Dunning & Michael Hausenblas
Berlin Buzz Words - Apache Drill by Ted Dunning & Michael HausenblasMapR Technologies
 
Environment for training models
Environment for training modelsEnvironment for training models
Environment for training modelsFlyElephant
 
Regex Considered Harmful: Use Rosie Pattern Language Instead
Regex Considered Harmful: Use Rosie Pattern Language InsteadRegex Considered Harmful: Use Rosie Pattern Language Instead
Regex Considered Harmful: Use Rosie Pattern Language InsteadAll Things Open
 
Managing Your Content with Elasticsearch
Managing Your Content with ElasticsearchManaging Your Content with Elasticsearch
Managing Your Content with ElasticsearchSamantha Quiñones
 
HTTP/2 What's inside and Why
HTTP/2 What's inside and WhyHTTP/2 What's inside and Why
HTTP/2 What's inside and WhyAdrian Cole
 
Nodejs a-practical-introduction-oredev
Nodejs a-practical-introduction-oredevNodejs a-practical-introduction-oredev
Nodejs a-practical-introduction-oredevFelix Geisendörfer
 
Node.js - A practical introduction (v2)
Node.js  - A practical introduction (v2)Node.js  - A practical introduction (v2)
Node.js - A practical introduction (v2)Felix Geisendörfer
 

Ähnlich wie InterPlanetary Wayback: The Next Step Towards Decentralized Web Archiving (20)

IPWB and IPFS at WAC2017
IPWB and IPFS at WAC2017IPWB and IPFS at WAC2017
IPWB and IPFS at WAC2017
 
A Collaborative, Secure, and Private InterPlanetary Wayback Web Archiving Sys...
A Collaborative, Secure, and Private InterPlanetary Wayback Web Archiving Sys...A Collaborative, Secure, and Private InterPlanetary Wayback Web Archiving Sys...
A Collaborative, Secure, and Private InterPlanetary Wayback Web Archiving Sys...
 
Apache ZooKeeper TechTuesday
Apache ZooKeeper TechTuesdayApache ZooKeeper TechTuesday
Apache ZooKeeper TechTuesday
 
WebSocket Perspectives 2015 - Clouds, Streams, Microservices and WoT
WebSocket Perspectives 2015 - Clouds, Streams, Microservices and WoTWebSocket Perspectives 2015 - Clouds, Streams, Microservices and WoT
WebSocket Perspectives 2015 - Clouds, Streams, Microservices and WoT
 
Can we run the Whole Web on Apache Sling?
Can we run the Whole Web on Apache Sling?Can we run the Whole Web on Apache Sling?
Can we run the Whole Web on Apache Sling?
 
An API Your Parents Would Be Proud Of
An API Your Parents Would Be Proud OfAn API Your Parents Would Be Proud Of
An API Your Parents Would Be Proud Of
 
The Proto-Burst Buffer: Experience with the flash-based file system on SDSC's...
The Proto-Burst Buffer: Experience with the flash-based file system on SDSC's...The Proto-Burst Buffer: Experience with the flash-based file system on SDSC's...
The Proto-Burst Buffer: Experience with the flash-based file system on SDSC's...
 
Restfs internals
Restfs internalsRestfs internals
Restfs internals
 
Guillotina
GuillotinaGuillotina
Guillotina
 
Openstack Swift - Lots of small files
Openstack Swift - Lots of small filesOpenstack Swift - Lots of small files
Openstack Swift - Lots of small files
 
Cosmos, Big Data GE implementation in FIWARE
Cosmos, Big Data GE implementation in FIWARECosmos, Big Data GE implementation in FIWARE
Cosmos, Big Data GE implementation in FIWARE
 
Cosmos, Big Data GE Implementation
Cosmos, Big Data GE ImplementationCosmos, Big Data GE Implementation
Cosmos, Big Data GE Implementation
 
Berlin Buzz Words - Apache Drill by Ted Dunning & Michael Hausenblas
Berlin Buzz Words - Apache Drill by Ted Dunning & Michael HausenblasBerlin Buzz Words - Apache Drill by Ted Dunning & Michael Hausenblas
Berlin Buzz Words - Apache Drill by Ted Dunning & Michael Hausenblas
 
Environment for training models
Environment for training modelsEnvironment for training models
Environment for training models
 
Regex Considered Harmful: Use Rosie Pattern Language Instead
Regex Considered Harmful: Use Rosie Pattern Language InsteadRegex Considered Harmful: Use Rosie Pattern Language Instead
Regex Considered Harmful: Use Rosie Pattern Language Instead
 
Html 5 boot camp
Html 5 boot campHtml 5 boot camp
Html 5 boot camp
 
Managing Your Content with Elasticsearch
Managing Your Content with ElasticsearchManaging Your Content with Elasticsearch
Managing Your Content with Elasticsearch
 
HTTP/2 What's inside and Why
HTTP/2 What's inside and WhyHTTP/2 What's inside and Why
HTTP/2 What's inside and Why
 
Nodejs a-practical-introduction-oredev
Nodejs a-practical-introduction-oredevNodejs a-practical-introduction-oredev
Nodejs a-practical-introduction-oredev
 
Node.js - A practical introduction (v2)
Node.js  - A practical introduction (v2)Node.js  - A practical introduction (v2)
Node.js - A practical introduction (v2)
 

Mehr von Sawood Alam

TrendMachine: Temporal Resilience of Web Pages
TrendMachine: Temporal Resilience of Web PagesTrendMachine: Temporal Resilience of Web Pages
TrendMachine: Temporal Resilience of Web PagesSawood Alam
 
CDX Summary: Web Archival Collection Insights
CDX Summary: Web Archival Collection InsightsCDX Summary: Web Archival Collection Insights
CDX Summary: Web Archival Collection InsightsSawood Alam
 
Video Archiving and Playback in the Wayback Machine
Video Archiving and Playback in the Wayback MachineVideo Archiving and Playback in the Wayback Machine
Video Archiving and Playback in the Wayback MachineSawood Alam
 
MementoMap: A Web Archive Profiling Framework for Efficient Memento Routing
MementoMap: A Web Archive Profiling Framework for Efficient Memento RoutingMementoMap: A Web Archive Profiling Framework for Efficient Memento Routing
MementoMap: A Web Archive Profiling Framework for Efficient Memento RoutingSawood Alam
 
Web ARChive (WARC) File Format
Web ARChive (WARC) File FormatWeb ARChive (WARC) File Format
Web ARChive (WARC) File FormatSawood Alam
 
MemGator - A Memento Aggregator CLI and Server in Go
MemGator - A Memento Aggregator CLI and Server in GoMemGator - A Memento Aggregator CLI and Server in Go
MemGator - A Memento Aggregator CLI and Server in GoSawood Alam
 
Dockerize Your Projects - A Brief Introduction to Containerization
Dockerize Your Projects - A Brief Introduction to ContainerizationDockerize Your Projects - A Brief Introduction to Containerization
Dockerize Your Projects - A Brief Introduction to ContainerizationSawood Alam
 
Avoiding Zombies in Archival Replay Using ServiceWorker
Avoiding Zombies in Archival Replay Using ServiceWorkerAvoiding Zombies in Archival Replay Using ServiceWorker
Avoiding Zombies in Archival Replay Using ServiceWorkerSawood Alam
 
Client-side Reconstruction of Composite Mementos Using ServiceWorker
Client-side Reconstruction of Composite Mementos Using ServiceWorkerClient-side Reconstruction of Composite Mementos Using ServiceWorker
Client-side Reconstruction of Composite Mementos Using ServiceWorkerSawood Alam
 
TPDL 2016 Doctoral Consortium - Web Archive Profiling
TPDL 2016 Doctoral Consortium - Web Archive ProfilingTPDL 2016 Doctoral Consortium - Web Archive Profiling
TPDL 2016 Doctoral Consortium - Web Archive ProfilingSawood Alam
 
Web Archive Profiling Through Fulltext Search
Web Archive Profiling Through Fulltext SearchWeb Archive Profiling Through Fulltext Search
Web Archive Profiling Through Fulltext SearchSawood Alam
 
JCDL 2016 Doctoral Consortium - Web Archive Profiling
JCDL 2016 Doctoral Consortium - Web Archive ProfilingJCDL 2016 Doctoral Consortium - Web Archive Profiling
JCDL 2016 Doctoral Consortium - Web Archive ProfilingSawood Alam
 
Web Archiving: A Brief Introduction
Web Archiving: A Brief IntroductionWeb Archiving: A Brief Introduction
Web Archiving: A Brief IntroductionSawood Alam
 
TPDL 2015 - Profiling Web Archives
TPDL 2015 - Profiling Web ArchivesTPDL 2015 - Profiling Web Archives
TPDL 2015 - Profiling Web ArchivesSawood Alam
 
Profiling Web Archives
Profiling Web ArchivesProfiling Web Archives
Profiling Web ArchivesSawood Alam
 
Improving Accessibility of Archived Raster Dictionaries of Complex Script Lan...
Improving Accessibility of Archived Raster Dictionaries of Complex Script Lan...Improving Accessibility of Archived Raster Dictionaries of Complex Script Lan...
Improving Accessibility of Archived Raster Dictionaries of Complex Script Lan...Sawood Alam
 
Profile Serialization IIPC GA 2015
Profile Serialization IIPC GA 2015Profile Serialization IIPC GA 2015
Profile Serialization IIPC GA 2015Sawood Alam
 
Profiling Web Archives IIPC GA 2015
Profiling Web Archives IIPC GA 2015Profiling Web Archives IIPC GA 2015
Profiling Web Archives IIPC GA 2015Sawood Alam
 
Web Archiving: A Brief Introduction
Web Archiving: A Brief IntroductionWeb Archiving: A Brief Introduction
Web Archiving: A Brief IntroductionSawood Alam
 
HTTP Mailbox - Asynchronous RESTful Communication
HTTP Mailbox - Asynchronous RESTful CommunicationHTTP Mailbox - Asynchronous RESTful Communication
HTTP Mailbox - Asynchronous RESTful CommunicationSawood Alam
 

Mehr von Sawood Alam (20)

TrendMachine: Temporal Resilience of Web Pages
TrendMachine: Temporal Resilience of Web PagesTrendMachine: Temporal Resilience of Web Pages
TrendMachine: Temporal Resilience of Web Pages
 
CDX Summary: Web Archival Collection Insights
CDX Summary: Web Archival Collection InsightsCDX Summary: Web Archival Collection Insights
CDX Summary: Web Archival Collection Insights
 
Video Archiving and Playback in the Wayback Machine
Video Archiving and Playback in the Wayback MachineVideo Archiving and Playback in the Wayback Machine
Video Archiving and Playback in the Wayback Machine
 
MementoMap: A Web Archive Profiling Framework for Efficient Memento Routing
MementoMap: A Web Archive Profiling Framework for Efficient Memento RoutingMementoMap: A Web Archive Profiling Framework for Efficient Memento Routing
MementoMap: A Web Archive Profiling Framework for Efficient Memento Routing
 
Web ARChive (WARC) File Format
Web ARChive (WARC) File FormatWeb ARChive (WARC) File Format
Web ARChive (WARC) File Format
 
MemGator - A Memento Aggregator CLI and Server in Go
MemGator - A Memento Aggregator CLI and Server in GoMemGator - A Memento Aggregator CLI and Server in Go
MemGator - A Memento Aggregator CLI and Server in Go
 
Dockerize Your Projects - A Brief Introduction to Containerization
Dockerize Your Projects - A Brief Introduction to ContainerizationDockerize Your Projects - A Brief Introduction to Containerization
Dockerize Your Projects - A Brief Introduction to Containerization
 
Avoiding Zombies in Archival Replay Using ServiceWorker
Avoiding Zombies in Archival Replay Using ServiceWorkerAvoiding Zombies in Archival Replay Using ServiceWorker
Avoiding Zombies in Archival Replay Using ServiceWorker
 
Client-side Reconstruction of Composite Mementos Using ServiceWorker
Client-side Reconstruction of Composite Mementos Using ServiceWorkerClient-side Reconstruction of Composite Mementos Using ServiceWorker
Client-side Reconstruction of Composite Mementos Using ServiceWorker
 
TPDL 2016 Doctoral Consortium - Web Archive Profiling
TPDL 2016 Doctoral Consortium - Web Archive ProfilingTPDL 2016 Doctoral Consortium - Web Archive Profiling
TPDL 2016 Doctoral Consortium - Web Archive Profiling
 
Web Archive Profiling Through Fulltext Search
Web Archive Profiling Through Fulltext SearchWeb Archive Profiling Through Fulltext Search
Web Archive Profiling Through Fulltext Search
 
JCDL 2016 Doctoral Consortium - Web Archive Profiling
JCDL 2016 Doctoral Consortium - Web Archive ProfilingJCDL 2016 Doctoral Consortium - Web Archive Profiling
JCDL 2016 Doctoral Consortium - Web Archive Profiling
 
Web Archiving: A Brief Introduction
Web Archiving: A Brief IntroductionWeb Archiving: A Brief Introduction
Web Archiving: A Brief Introduction
 
TPDL 2015 - Profiling Web Archives
TPDL 2015 - Profiling Web ArchivesTPDL 2015 - Profiling Web Archives
TPDL 2015 - Profiling Web Archives
 
Profiling Web Archives
Profiling Web ArchivesProfiling Web Archives
Profiling Web Archives
 
Improving Accessibility of Archived Raster Dictionaries of Complex Script Lan...
Improving Accessibility of Archived Raster Dictionaries of Complex Script Lan...Improving Accessibility of Archived Raster Dictionaries of Complex Script Lan...
Improving Accessibility of Archived Raster Dictionaries of Complex Script Lan...
 
Profile Serialization IIPC GA 2015
Profile Serialization IIPC GA 2015Profile Serialization IIPC GA 2015
Profile Serialization IIPC GA 2015
 
Profiling Web Archives IIPC GA 2015
Profiling Web Archives IIPC GA 2015Profiling Web Archives IIPC GA 2015
Profiling Web Archives IIPC GA 2015
 
Web Archiving: A Brief Introduction
Web Archiving: A Brief IntroductionWeb Archiving: A Brief Introduction
Web Archiving: A Brief Introduction
 
HTTP Mailbox - Asynchronous RESTful Communication
HTTP Mailbox - Asynchronous RESTful CommunicationHTTP Mailbox - Asynchronous RESTful Communication
HTTP Mailbox - Asynchronous RESTful Communication
 

Kürzlich hochgeladen

Factory Acceptance Test( FAT).pptx .
Factory Acceptance Test( FAT).pptx       .Factory Acceptance Test( FAT).pptx       .
Factory Acceptance Test( FAT).pptx .Poonam Aher Patil
 
FAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
FAIRSpectra - Enabling the FAIRification of Spectroscopy and SpectrometryFAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
FAIRSpectra - Enabling the FAIRification of Spectroscopy and SpectrometryAlex Henderson
 
Module for Grade 9 for Asynchronous/Distance learning
Module for Grade 9 for Asynchronous/Distance learningModule for Grade 9 for Asynchronous/Distance learning
Module for Grade 9 for Asynchronous/Distance learninglevieagacer
 
Dubai Call Girls Beauty Face Teen O525547819 Call Girls Dubai Young
Dubai Call Girls Beauty Face Teen O525547819 Call Girls Dubai YoungDubai Call Girls Beauty Face Teen O525547819 Call Girls Dubai Young
Dubai Call Girls Beauty Face Teen O525547819 Call Girls Dubai Youngkajalvid75
 
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdfPests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdfPirithiRaju
 
Pests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdfPests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdfPirithiRaju
 
Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...
Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...
Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...Monika Rani
 
The Mariana Trench remarkable geological features on Earth.pptx
The Mariana Trench remarkable geological features on Earth.pptxThe Mariana Trench remarkable geological features on Earth.pptx
The Mariana Trench remarkable geological features on Earth.pptxseri bangash
 
Locating and isolating a gene, FISH, GISH, Chromosome walking and jumping, te...
Locating and isolating a gene, FISH, GISH, Chromosome walking and jumping, te...Locating and isolating a gene, FISH, GISH, Chromosome walking and jumping, te...
Locating and isolating a gene, FISH, GISH, Chromosome walking and jumping, te...Silpa
 
Call Girls Ahmedabad +917728919243 call me Independent Escort Service
Call Girls Ahmedabad +917728919243 call me Independent Escort ServiceCall Girls Ahmedabad +917728919243 call me Independent Escort Service
Call Girls Ahmedabad +917728919243 call me Independent Escort Serviceshivanisharma5244
 
Forensic Biology & Its biological significance.pdf
Forensic Biology & Its biological significance.pdfForensic Biology & Its biological significance.pdf
Forensic Biology & Its biological significance.pdfrohankumarsinghrore1
 
GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)Areesha Ahmad
 
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 60009654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000Sapana Sha
 
9999266834 Call Girls In Noida Sector 22 (Delhi) Call Girl Service
9999266834 Call Girls In Noida Sector 22 (Delhi) Call Girl Service9999266834 Call Girls In Noida Sector 22 (Delhi) Call Girl Service
9999266834 Call Girls In Noida Sector 22 (Delhi) Call Girl Servicenishacall1
 
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.Nitya salvi
 
pumpkin fruit fly, water melon fruit fly, cucumber fruit fly
pumpkin fruit fly, water melon fruit fly, cucumber fruit flypumpkin fruit fly, water melon fruit fly, cucumber fruit fly
pumpkin fruit fly, water melon fruit fly, cucumber fruit flyPRADYUMMAURYA1
 
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 bAsymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 bSérgio Sacani
 
GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)Areesha Ahmad
 

Kürzlich hochgeladen (20)

Factory Acceptance Test( FAT).pptx .
Factory Acceptance Test( FAT).pptx       .Factory Acceptance Test( FAT).pptx       .
Factory Acceptance Test( FAT).pptx .
 
FAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
FAIRSpectra - Enabling the FAIRification of Spectroscopy and SpectrometryFAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
FAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
 
Module for Grade 9 for Asynchronous/Distance learning
Module for Grade 9 for Asynchronous/Distance learningModule for Grade 9 for Asynchronous/Distance learning
Module for Grade 9 for Asynchronous/Distance learning
 
Dubai Call Girls Beauty Face Teen O525547819 Call Girls Dubai Young
Dubai Call Girls Beauty Face Teen O525547819 Call Girls Dubai YoungDubai Call Girls Beauty Face Teen O525547819 Call Girls Dubai Young
Dubai Call Girls Beauty Face Teen O525547819 Call Girls Dubai Young
 
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdfPests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
 
Pests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdfPests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdf
 
Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...
Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...
Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...
 
The Mariana Trench remarkable geological features on Earth.pptx
The Mariana Trench remarkable geological features on Earth.pptxThe Mariana Trench remarkable geological features on Earth.pptx
The Mariana Trench remarkable geological features on Earth.pptx
 
Locating and isolating a gene, FISH, GISH, Chromosome walking and jumping, te...
Locating and isolating a gene, FISH, GISH, Chromosome walking and jumping, te...Locating and isolating a gene, FISH, GISH, Chromosome walking and jumping, te...
Locating and isolating a gene, FISH, GISH, Chromosome walking and jumping, te...
 
Call Girls Ahmedabad +917728919243 call me Independent Escort Service
Call Girls Ahmedabad +917728919243 call me Independent Escort ServiceCall Girls Ahmedabad +917728919243 call me Independent Escort Service
Call Girls Ahmedabad +917728919243 call me Independent Escort Service
 
Forensic Biology & Its biological significance.pdf
Forensic Biology & Its biological significance.pdfForensic Biology & Its biological significance.pdf
Forensic Biology & Its biological significance.pdf
 
GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)
 
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 60009654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
 
9999266834 Call Girls In Noida Sector 22 (Delhi) Call Girl Service
9999266834 Call Girls In Noida Sector 22 (Delhi) Call Girl Service9999266834 Call Girls In Noida Sector 22 (Delhi) Call Girl Service
9999266834 Call Girls In Noida Sector 22 (Delhi) Call Girl Service
 
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
pumpkin fruit fly, water melon fruit fly, cucumber fruit fly
pumpkin fruit fly, water melon fruit fly, cucumber fruit flypumpkin fruit fly, water melon fruit fly, cucumber fruit fly
pumpkin fruit fly, water melon fruit fly, cucumber fruit fly
 
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 bAsymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
 
Site Acceptance Test .
Site Acceptance Test                    .Site Acceptance Test                    .
Site Acceptance Test .
 
GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)
 

InterPlanetary Wayback: The Next Step Towards Decentralized Web Archiving

  • 1. Sawood Alam, Mat Kelly, Michele C. Weigle, Michael L. Nelson Web Science and Digital Libraries Research Group Old Dominion University Norfolk, Virginia, USA @WebSciDL InterPlanetary Wayback The Next Step Towards Decentralized Web Archiving IPFS Lab Day, Decentralized Web Summit, 2018 San Francisco, CA (USA) August 3, 2018 http://github.com/oduwsdl/ipwb
  • 3. Rendered HTML vs. Source Code 3@ibnesayeed
  • 4. HTTP Response vs. WARC Record 4 HTTP headers Payload WARC headers @ibnesayeed
  • 5. How Wayback Machine Works? Archival Indexer Archival Index (e.g., CDXJ) Replay Engine processes Outputs references reads (file, offset) read archived content Present WARC content to user 5@ibnesayeed
  • 6. Memento: Time Dimension to the Web https://tools.ietf.org/html/rfc7089 6@ibnesayeed
  • 7. Why IPWB? ● Persistence of archived web dependent on resilience of organizations ● Availability of data is subject to censorship ● Redundancy in web archive files of exact duplicate content ● Lack of public participation in web archiving ● Discoverability issue of small web archives 7@ibnesayeed
  • 9. WARC Creation HTTP Header & Payload Extraction Push to IPFS Generate CDXJ WARC-CDXJ Correspondence 9@ibnesayeed
  • 10. HTTP HEADER BLOCK HTTP PAYLOAD BLOCK WARC Creation HTTP Header & Payload Extraction Push to IPFS Generate CDXJ WARC-CDXJ Correspondence 10@ibnesayeed
  • 11. QmcN9eWwRF73dZj5BgT4x8jeEcFr xurX1hot8QwCbMi9PB Qmczh9YnB4U1ptPeqxcaTZA4aMm uNUswTLTWzXntvbp9sL HEADER DIGEST PAYLOAD DIGEST WARC Creation HTTP Header & Payload Extraction Push to IPFS Generate CDXJ WARC-CDXJ Correspondence 11 HTTP HEADER BLOCK HTTP PAYLOAD BLOCK @ibnesayeed
  • 12. QmcN9eWwRF73dZj5BgT4x8jeEcFrxurX1hot8QwCb Mi9PB Qmczh9YnB4U1ptPeqxcaTZA4aMmuNUswTLTWzX ntvbp9sL HEADER DIGEST com,example,ipwb)/ 20180802012013 { "locator":"urn:ipfs/QmcN9eWwRF73dZj5BgT4x8jeEcFrxurX1hot8QwCbMi9PB/Qmczh9YnB4U1ptPeqxca TZA4aMmuNUswTLTWzXntvbp9sL", "mime_type": "text/html", "status_code": 200, “other_fields”: “other values...” } // * This is a single-line record, line breaks and indentation are added for readability only PAYLOAD DIGEST CDXJ: http://ws-dl.blogspot.com/2015/09/2015-09-10-cdxj-object-resource-stream.html WARC Creation HTTP Header & Payload Extraction Push to IPFS Generate CDXJ Record WARC-CDXJ Correspondence 12@ibnesayeed
  • 13. edu,odu,cs)/~salam/dweb/ 20180802012013 { "locator": "urn:ipfs/QmcN9eWwRF73dZj5.../Qmczh9YnB4U1ptPe...", "mime_type": "text/html", "status_code": "200" } edu,odu,cs)/~salam/dweb/style.css 20180802012013 { "locator": "urn:ipfs/QmU1k71bT6ibZBSd.../QmbvUAo9U31wSdvA...", "mime_type": "text/css", "status_code": "200" } edu,odu,cs)/~salam/dweb/wsdl-logo.png 20180802012013 { "locator": "urn:ipfs/QmTjfMxFGvbP4nwF.../QmYMKZbnk53kuPJi...", "mime_type": "image/png", "status_code": "200" } WARC Creation HTTP Header & Payload Extraction Push to IPFS Generate CDXJ WARC-CDXJ Correspondence 13@ibnesayeed
  • 15. https://www.cs.odu.edu/~salam/dweb/ 15 edu,odu,cs)/~salam/dweb/ 20180802012013 { "locator": "urn:ipfs/QmcN9eWwRF.../Qmczh9YnB4...", "mime_type": "text/html", "status_code": "200" } edu,odu,cs)/~salam/dweb/style.css 20180802012013 { "locator": "urn:ipfs/QmU1k71bT6.../QmbvUAo9U3...", "mime_type": "text/css", "status_code": "200" } edu,odu,cs)/~salam/dweb/wsdl-logo.png 20180802012013 { "locator": "urn:ipfs/QmTjfMxFGv.../QmYMKZbnk5...", "mime_type": "image/png", "status_code": "200" } Fetch from IPFS Reroute ResourcesReconstruct ResponseLookup in CDXJ @ibnesayeed
  • 16. edu,odu,cs)/~salam/dweb/ 20180802012013 { "locator": "urn:ipfs/QmcN9eWwRF73dZj5.../Qmczh9YnB4U1ptPe...", "mime_type": "text/html", "status_code": "200" } 16 HTTP HEADER BLOCK HTTP PAYLOAD BLOCK Fetch from IPFS Reroute ResourcesReconstruct ResponseLookup in CDXJ @ibnesayeed
  • 17. 17 HTTP HEADER BLOCK HTTP PAYLOAD BLOCK Fetch from IPFS Reroute ResourcesReconstruct ResponseLookup in CDXJ Reconstruct @ibnesayeed
  • 18. Fetch from IPFS Reroute Resources 18 ● https://oduwsdl.github.io/Reconstructive/ ● http://ws-dl.blogspot.com/2018/01/2018-01-08-introducing-reconstructive.html ● Avoids zombies (live-leakage) ● Adds an unobtrusive archival banner (Custom HTML Element) Reconstruct ResponseLookup in CDXJ @ibnesayeed
  • 19. 19 IPWB Indexing and Replay @ibnesayeed
  • 21. Current Issues ● IPFS is permanent, but not persistent ● DHT-based IPNS is history-unaware ● CDXJ index, a critical piece of replay, is centralized 21@ibnesayeed
  • 22. Persistence ● Data persistence is critical for web archiving ● A decentralized storage with sufficient replication is needed ● Memory organizations should contribute storage infrastructure ● Qri, Filecoin, IPFS-Cluster, IPFS-Sync etc. can be helpful 22@ibnesayeed
  • 23. IPNS: InterPlanetary Naming System URI IPFS Hash http://example.org/yuri.jpg http://example.com/style.css http://example.com/logo.png http://example.com/style.css How about changes and history? 23@ibnesayeed
  • 24. IPNS Blockchain ● URI → Latest hash ● URI + DateTime → A historical hash ● URI → List historical hashes with times https://github.com/oduwsdl/IPNS-Blockchain Owner URI Time Hash PrevBlock Pub K1 URI1 T1 H1 1234567... Pub K2 URI2 T2 H2 0000000... Pub K3 URI3 T3 H3 9876543... Owner URI Time Hash PrevBlock Pub K1 URI1 T4 H5 5463728... Pub K3 URI4 T5 H6 0000000... IPNS + Blockchain + Memento 24@ibnesayeed
  • 25. Lazy Relationship Evaluation /<namespace>/about/<URI> IP-LD to the rescue? https://github.com/oduwsdl/ipwb/issues/61 Memento Memento Memento MementoOf (Active Relation) HasMemento (Lazy Evaluation) 25@ibnesayeed
  • 27. ● Reported IPFS slowness https://github.com/ipfs/go-ipfs/issues/1216 ○ Has since been fixed, but we did not evaluate again 570 files per minute~10% overhead 27 Storage Space and Time Overhead @ibnesayeed
  • 28. Replay Time ● 600 requests in 222 seconds ● Slower than PyWB (which took 5.26 seconds) ● File vs. rich object based retrieval ● Never expiring cache 28 https://github.com/ibnesayeed/ipfsapi-concurrency-test @ibnesayeed
  • 29. Future Works ● Evaluate the improved IPFS on large dataset ● Evaluate deduplication ● Implement an index-free collaborative archiving system ● Utilize IPNS to reference URI-Rs with datetime 29@ibnesayeed
  • 30. Conclusions ● A proof of concept system to leverage a novel approach to archiving and retrieval ● Storage and time costs evaluation and qualitative analysis ● It can only work for small archives in its current state ● A path to answer “who will archive the archives?” ● More work to be done to make it a truly decentralized archiving system 30@ibnesayeed
  • 31. InterPlanetary Wayback The Next Step Towards Decentralized Web Archiving @WebSciDL http://github.com/oduwsdl/ipwb Supported in part by Protocol Labs, AMF 11600663, and NSF IIS-1526700 Sawood Alam, Mat Kelly, Michele C. Weigle, Michael L. Nelson