SlideShare ist ein Scribd-Unternehmen logo
1 von 39
Browser-Based
Digital Preservation
Mat Kelly
Web Science and Digital Libraries (WS-DL) Research Lab
Department of Computer Science
http://ws-dl.cs.odu.edu
July 7, 2014
What This Will Be About
• Saving Things on Live Web to Web Archives
– State of the Art
– Barriers
• Dynamics of Digital Preservation Components
– Means of Preservation (e.g., crawlers)
– Means of Replay (e.g., Wayback)
• Outstanding and Unsolved Issues in DigPres
2
Browser-Based Digital Preservation
http://ws-dl.cs.odu.edu
The Speaker, Mat Kelly:
• 3rd Year PhD student
• MS Degree:
• BS Degree:
• Previously a researcher
at BMW
• Currently programmer @ NASA Langley
• Research Assistant @ ODU WS-DL Research Lab
Research Lab homepage: http://ws-dl.cs.odu.edu
3
My homepage: http://www.cs.odu.edu/~mkelly
Quick Primer: The Web
• What It Is
– Client-Server Messaging + Payload
– HTML, Images, JavaScripts, Multimedia, and more!
• Where It Resides
– Live web: remote servers, distributed content
4
Browser-Based Digital Preservation
http://ws-dl.cs.odu.edu
Quick Primer: HTTP
Browser-Based Digital Preservation
http://ws-dl.cs.odu.edu
5
http://mysite.com/~jsmith/myPage.html
domain resource path on serverSCHEME/PROTOCOL
user agent
(browser)
Client Requests Resource
Server
Server Responds
YES! 200 OK + Resource
Resource Exists? Respond With
NO, it’s moved! 300-code + URI
NO, it’s gone! 404
Embedded Content in Resp?
e.g., images, JavaScript YES! 200 OK + Resource
NO, it’s moved! 300-code + URI
NO, it’s gone! 404
How The Web Is Used
• Web browser accesses website via URI
• Browser requests site’s contents from server
• Browser displays site’s contents
– Usually requiring multiple requests for embedded
resources
GET / HTTP/1.1
Host: mysite.com BEHIND THE SCENES
6
Browser-Based Digital Preservation
http://ws-dl.cs.odu.edu
HTTP/1.1 200 OK
Content-Type: text/html
<!DOCTYPE html>
<html>
<head> …
HTTP Response
Header
HTTP Response
Entity
Returning to a Page on the Live Web
• Lookup by URI
• But Site’s Gone
• Lookup in Web Archives
GET / HTTP/1.1
Host: mysite.com
HTTP/1.1 404 NOT FOUND
Content-Type: text/html
7
Browser-Based Digital Preservation
http://ws-dl.cs.odu.edu
Returning to a Page on the Live Web
• Lookup by URI at Internet Archive
✓ SUCCESS
Likely Preserved If…
• Popular
• Surface (e.g., no auth)
• Simple (no fancy JS)
USING THE ARCHIVES
8
Browser-Based Digital Preservation
http://ws-dl.cs.odu.edu
Browser-Based Digital Preservation
http://ws-dl.cs.odu.edu
Site Popularity Affects Archival
Presence
More Popular
Well preserved
Less Popular
Less preserved
Niche Site
Likely not preserved 9
10
Sites Not On Surface Web are Likely
Not Archived
• Sites can restrict being crawled
– Whose data is it anyway?
• Robots.txt can limit
– Crawler (by user-agent)
– Paths/files/whole site
11
Sites Not On Surface Web are Likely
Not Archived
• Not everything ought
to be in public archives
• …but we might still
want to preserve these
pages.
– e.g., our FB news feed
12
Browser-Based Digital Preservation
http://ws-dl.cs.odu.edu
When the Browsing Tool Changes
The Result is Inconsistent
• Interactive sites
– frequently do not have
all resources required for
replay in the archives
Recall:
Likely a Web Page Is Preserved If…
• Popular
• Surface (e.g., no authentication required)
• Simple (no fancy JavaScript)
13
Browser-Based Digital Preservation
http://ws-dl.cs.odu.edu
From This, We Can Surmise:
A page will be UNLIKELY or INSUFFICIENTLY
preserved if:
• Not Popular
• On the Deep Web or Behind Authentication
• Contains Fancy Effects or Asynchronously
Loaded Resources
14
Browser-Based Digital Preservation
http://ws-dl.cs.odu.edu
Why and What Can Be Done
A page will be UNLIKELY or INSUFFICIENTLY
preserved if:
• Not Popular
• On the Deep Web or Behind Authentication
• Contains Fancy Effects or Asynchronously
Loaded Resources
Internet Archive chooses What is Preserved
15
Browser-Based Digital Preservation
http://ws-dl.cs.odu.edu
Why and What Can Be Done
A page will be UNLIKELY or INSUFFICIENTLY
preserved if:
• Not Popular
• On the Deep Web or Behind Authentication
• Contains Fancy Effects or Asynchronously
Loaded Resources
Preserve from the Browser Instead of
Delegating by URI
16
Browser-Based Digital Preservation
http://ws-dl.cs.odu.edu
Why and What Can Be Done
A page will be UNLIKELY or INSUFFICIENTLY
preserved if:
• Not Popular
• On the Deep Web or Behind Authentication
• Contains Fancy Effects or Asynchronously
Loaded Resources
Leverage the Browser’s Native JavaScript
rendering engine
17
Browser-Based Digital Preservation
http://ws-dl.cs.odu.edu
• Internet Archive’s archival crawler
– Open source, also deployed @archive.org
• Command-line driven
• Web-based GUI
• Allows list of URIs to be crawled,
frequency, crawl depth, etc.
– On user’s own Heritrix deployment
• Generates Web ARChive (WARC) files
State of the Art:
18
Browser-Based Digital Preservation
http://ws-dl.cs.odu.edu
The Web ARChive (WARC) File Format
• 28500
• Multiple text-based
record types in a file
• Supports arbitrary
data types on web
• Preserves HTTP
information for
replaying content
WARC/1.0
WARC-Type: warcinfo
WARC-Filename: test.warc
Content-Length: 42
Format: WARC File Format 1.0
WARC/1.0
WARC-Type: request
WARC-Target-URI: http://mysite.
WARC-Date: 2014-05-10T12:11:03Z
WARC-Concurrent-To: <urn:uuid:b
WARC-Record-ID: <urn:uuid:f8e09
Content-Length: 115
GET / HTTP/1.1
User-Agent: Mozilla/5.0
WARC/1.0
WARC-Type: response
WARC-Target-URI: http://mysite.
WARC-Concurrent-To: <urn:uuid:b
WARC-Record-ID: <urn:uuid:f93ea
Content-Length: 11262
HTTP/1.1 200 OK
<!DOCTYPE html><html>
<head>
<body>...
WARC
info
Record
WARC
request
Record
WARC
response
Record
19
WARC Anatomy
• WARC Records each have
a header and a payload
• One warcinfo/WARC
• warcmetadata optional
– Describe files
– can have multiple/file
• Req/Resp records
document HTTP
transactions @archive time
warcinfo record header
warcinfo record payload
warcmetadata record header
warcmetadata record payload
warcrequest record header
warcrequest record payload
warcresponse record header
warcreponse record payload
warcrequest record header
warcrequest record payload
warcresponse record header
warcreponse record payload
MyArchive.warc
20
Browser-Based Digital Preservation
http://ws-dl.cs.odu.edu
Reading WARCs
• “Replaying” content as it once was versus
viewing a static web page
•
– Deployed at Internet Archive (archive.org)
– Open Source software
• You can deploy your own Wayback!
– Indexes WARCs, allows content to be accessible
from web interface
21
Browser-Based Digital Preservation
http://ws-dl.cs.odu.edu
But the Issue Remains…
A crawler
changes the capture context
from the browser
Browser-Based Digital Preservation
http://ws-dl.cs.odu.edu
22
High Level Requirements
WARC Creation Software
• Intuitive to use
– cf. Heritrix and Wayback are difficult to setup
• Comply with WARC ISO standard
• Capture JavaScript
• Capture content behind authentication
• Allow a user to execute through the browser
– On demand, without need to delegate via URI
23
Browser-Based Digital Preservation
http://ws-dl.cs.odu.edu
WARCreate
“Create WARC files from any webpage”
1. On Page Load
1. Generate warcinfo record
2. Save HTTP Requests, Responses
• Content as Strings
2. On Secondary Content Loaded
 Repeat 1.2. and Concatenate
3. On Generate WARC command
1. Create Blob from String
2. Provide Blob to FileSaver library
24
Browser-Based Digital Preservation
http://ws-dl.cs.odu.edu
• onBeforeSendHeaders - HTTP request headers
• onHeadersReceived() –HTTP response headers
• onBeforeRequest() –HTTP 3XX responses
• onResponseStarted() - Payload
https://developer.chrome.com/extensions/webRequest
25
Google Chrome’s Webrequest API
Payloads Captured
• Target information can be
captured with webRequest
• File/Session descriptors
generated at archive time
• Payload headers generated
once payload captured
• Templating system used
warcinfo record header
warcinfo record payload
warcmetadata record header
warcmetadata record payload
warcrequest record header
warcrequest record payload
warcresponse record header
warcreponse record payload
warcrequest record header
warcrequest record payload
warcresponse record header
warcreponse record payload
MyArchive.warc
✓
✓
✓
✓
26
Browser-Based Digital Preservation
http://ws-dl.cs.odu.edu
A Real warcinfo
WARC/1.0
WARC-Type: warcinfo
WARC-Date: 2014-06-30T22:56:06Z
WARC-Filename: 20140630180606900.warc
WARC-Record-ID: <urn:uuid:7fc954f0-e291-279f-5b6b-5b97aeb437bd>
Content-Type: application/warc-fields
Content-Length: 458
software: WARCreate/0.2014.6.2 http://warcreate.com
format: WARC File Format 1.0
conformsTo:
http://bibnum.bnf.fr/WARC/WARC_ISO_28500_version1_latestdraft.pdf
isPartOf: basic
description: Crawl initiated from the WARCreate Google Chrome
extension
robots: ignore
http-header-user-agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_3)
AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.153
Safari/537.36
http-header-from: warcreate@matkelly.com
• Template
• Generated @ archive time
• Self-descriptive
27
Browser-Based Digital Preservation
http://ws-dl.cs.odu.edu
A Real warcmetadata
WARC/1.0
WARC-Type: metadata
WARC-Target-URI: http://matkelly.com/
WARC-Date: 2014-06-30T22:56:06Z
WARC-Concurrent-To: <urn:uuid:dddc4ba2-c1e1-459b-8d0d-a98a20b87e96>
WARC-Record-ID: <urn:uuid:6fef2a49-a9ba-4b40-9f4a-5ca5db1fd5c6>
Content-Type: application/warc-fields
Content-Length: 1938
outlink: http://matkelly.com/_images/logo.png E =EMBED_MISC
outlink: http://matkelly.com/resume/ L a/@href
• Template
• Generated @ archive time
• Self-descriptive
28
Browser-Based Digital Preservation
http://ws-dl.cs.odu.edu
A Real warcrequest
WARC/1.0
WARC-Type: request
WARC-Target-URI: http://matkelly.com/
WARC-Date: 2014-06-30T22:56:06Z
WARC-Concurrent-To: <urn:uuid:ba53adfd-4bd0-9a53-9396-b0dd77e39353>
WARC-Record-ID: <urn:uuid:f8e095f3-7b77-f2e6-f592-ce0b63a32f36>
Content-Type: application/http; msgtype=request
Content-Length: 317
GET / HTTP/1.1
Accept:
text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q
=0.8
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_3)
AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.153
Safari/537.36
Accept-Encoding: gzip,deflate,sdch
Accept-Language: en-US,en;q=0.8,de-DE;q=0.6
• Template
• Generated @ archive time
• Self-descriptive
29
Browser-Based Digital Preservation
http://ws-dl.cs.odu.edu
HTTP
Request
header
A Real warcresponse
WARC/1.0
WARC-Type: response
WARC-Target-URI: http://matkelly.com/
WARC-Date: 2014-06-30T22:56:06Z
WARC-Record-ID: <urn:uuid:f93eafa7-24c5-c29f-e17f-e0627201b7f1>
Content-Type: application/http; msgtype=response
Content-Length: 12662
HTTP/1.1 200 OK
Date: Mon, 30 Jun 2014 22:56:04 GMT
Server: Apache
Vary: Accept-Encoding
Content-Encoding: gzip
Content-Length: 4824
Keep-Alive: timeout=2, max=100
Connection: Keep-Alive
Content-Type: text/html
<!DOCTYPE html><html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en"
lang="en"><head>
• Template
• Generated @ archive time
• Self-descriptive
30
HTTP
Response
header
HTTP
Response
body
The File System Problem
• WARC strings must be saved to file system
• JS provides access to HTML5 localStorage
– Limited to sandbox, 5 MB
• NEW Chrome extension API allows unlimited MB
– chrome.storage instead of localStorage
• HTML5 W3C saveAs() support is limited
• FileSaver.js provides polyfill!
– Limited to ~ 360 MB/file
https://github.com/eligrey/FileSaver.js/
31
Browser-Based Digital Preservation
http://ws-dl.cs.odu.edu
Capture Scope
• Not limited to robots.txt
• Native JS Support
(from browser)
• User accessible
– No URI delegation
32
Browser-Based Digital Preservation
http://ws-dl.cs.odu.edu
User Dynamics
• Simple operation:
navigate, click a button
• Files downloaded with
smart defaults to HDD
Browser-Based Digital Preservation
http://ws-dl.cs.odu.edu
33
• We may not want our personal info in archives
• Your http://facebook.com and
mine are vastly different
– …but same URI!
• Even if we capture it,
it might not be replayable
– Limited by replay system
Caveats of Personal Web Archiving
34
Browser-Based Digital Preservation
http://ws-dl.cs.odu.edu
Recent Work/Outstanding Issues
Direct upload of WARC to server
Comprehensive Site Archiving (e.g., all your )
• Annotation of Web Pages
• Grouping WARCs into archival collections
• Periodic, automated execution
• Portable JS WARC interaction library
• Much more!
35
Browser-Based Digital Preservation
http://ws-dl.cs.odu.edu
But…Isn’t This Just Coding
and Not Research?
• WARC work in the JS space is non-existent!
• Initial implementations have already
uncovered limitations
• JS tech has evolved, opened new doors for
functionality
• WARCreate is state-of-the art for using
JavaScript/Browsers FOR archiving
36
Browser-Based Digital Preservation
http://ws-dl.cs.odu.edu
Browser-Based Digital Preservation
WARCreate as Research
• Publications
– Mat Kelly and Michele C. Weigle, "WARCreate - Create Wayback-Consumable
WARC Files from Any Webpage," In Proceedings of the ACM/IEEE Joint
Conference on Digital Libraries (JCDL). Washington, DC, June 2012, pp. 437-
438
– Mat Kelly, Michele C. Weigle and Michael L. Nelson. "WARCreate - Create
Wayback-Consumable WARC Files from Any Webpage," Digital Preservation
2012, Tools Demo Session: Web Archiving; 2012 Jul 25; Washington, DC.
– Mat Kelly, Michael L. Nelson and Michele C. Weigle. "WARCreate and WAIL:
WARC, Wayback and Heritrix Made Easy," Digital Preservation 2013,
Workshops and Sessions: Web Archiving; 2013 Jul 24; Alexandria, VA
• Academic Funding
– Archive What I See Now: Bringing Institutional Web Archiving Tools to the
Individual Researcher, National Endowment for the Humanities (NEH), Digital
Humanities Start-Up Grant, HD-51670-13, May 2013 - Dec 2014, $57,892
• Notoriety
– Future Steward Innovation Award Recipient, National Digital Stewardship
Alliance (NDSA) / Library of Congress, July 2012
37
Browser-Based Digital Preservation
http://ws-dl.cs.odu.edu
Where I Have Traveled as a
ODU Researcher
Browser-Based Digital Preservation
http://ws-dl.cs.odu.edu
38
Washington, DC Atlanta, GA College Park, MD
Salt Lake City, UT San Francisco, CA London, England
(Sept 2014)
…
Browser-Based Digital Preservation
http://ws-dl.cs.odu.edu
http://www.cs.odu.edu/~mkelly 39
Publications
Travel
Academic
Funding
Notoriety


Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (9)

Answers to usual issues in getting started with consuming Linked Data (2010)
Answers to usual issues in getting started with consuming Linked Data (2010)Answers to usual issues in getting started with consuming Linked Data (2010)
Answers to usual issues in getting started with consuming Linked Data (2010)
 
csvconfyasmin2017_05_03
csvconfyasmin2017_05_03csvconfyasmin2017_05_03
csvconfyasmin2017_05_03
 
Web Archiving Activities of ODU’s Web Science and Digital Library Research G...
Web Archiving Activities of ODU’s Web Science and Digital Library Research G...Web Archiving Activities of ODU’s Web Science and Digital Library Research G...
Web Archiving Activities of ODU’s Web Science and Digital Library Research G...
 
Interoperability and Its Role In Standardization, Plus A ResourceSync Overview
Interoperability and Its Role In Standardization, Plus A ResourceSync OverviewInteroperability and Its Role In Standardization, Plus A ResourceSync Overview
Interoperability and Its Role In Standardization, Plus A ResourceSync Overview
 
Library 2.0
Library 2.0Library 2.0
Library 2.0
 
LITA Preconference: Getting Started with Drupal (handout)
LITA Preconference: Getting Started with Drupal (handout)LITA Preconference: Getting Started with Drupal (handout)
LITA Preconference: Getting Started with Drupal (handout)
 
Blogging/RSS in Education
Blogging/RSS in EducationBlogging/RSS in Education
Blogging/RSS in Education
 
BHIS Faculty Library Refresher 2016
BHIS Faculty Library Refresher 2016BHIS Faculty Library Refresher 2016
BHIS Faculty Library Refresher 2016
 
Web2 inschools November 2010
Web2 inschools November 2010Web2 inschools November 2010
Web2 inschools November 2010
 

Ähnlich wie Browser-Based Digital Preservation

Avtar's ppt
Avtar's pptAvtar's ppt
Avtar's ppt
mak57
 
Coast gis talk
Coast gis talkCoast gis talk
Coast gis talk
Carl Sack
 
Slides anu talkwebarchivingaug2012
Slides anu talkwebarchivingaug2012Slides anu talkwebarchivingaug2012
Slides anu talkwebarchivingaug2012
Roxanne Missingham
 
Lesson 6 web based attacks
Lesson 6 web based attacksLesson 6 web based attacks
Lesson 6 web based attacks
Frank Victory
 

Ähnlich wie Browser-Based Digital Preservation (20)

Notes on SF W3Conf
Notes on SF W3ConfNotes on SF W3Conf
Notes on SF W3Conf
 
Evaluating the SiteStory Transactional Web Archive with the ApacheBench Tool
Evaluating the SiteStory Transactional Web Archive with the ApacheBench ToolEvaluating the SiteStory Transactional Web Archive with the ApacheBench Tool
Evaluating the SiteStory Transactional Web Archive with the ApacheBench Tool
 
Internet content as research data
Internet content as research dataInternet content as research data
Internet content as research data
 
Readying Web Archives to Consume and Leverage Web Bundles
Readying Web Archives to Consume and Leverage Web BundlesReadying Web Archives to Consume and Leverage Web Bundles
Readying Web Archives to Consume and Leverage Web Bundles
 
world wide web
world wide webworld wide web
world wide web
 
Avtar's ppt
Avtar's pptAvtar's ppt
Avtar's ppt
 
Web Landscape - updated in Jan 2016
Web Landscape - updated in Jan 2016Web Landscape - updated in Jan 2016
Web Landscape - updated in Jan 2016
 
introduction to Web system
introduction to Web systemintroduction to Web system
introduction to Web system
 
Mini-Training: To cache or not to cache
Mini-Training: To cache or not to cacheMini-Training: To cache or not to cache
Mini-Training: To cache or not to cache
 
Filling in the Blanks: Capturing Dynamically Generated Content
Filling in the Blanks: Capturing Dynamically Generated ContentFilling in the Blanks: Capturing Dynamically Generated Content
Filling in the Blanks: Capturing Dynamically Generated Content
 
Static Site Generators - Developing Websites in Low-resource Condition
Static Site Generators - Developing Websites in Low-resource ConditionStatic Site Generators - Developing Websites in Low-resource Condition
Static Site Generators - Developing Websites in Low-resource Condition
 
Coast gis talk
Coast gis talkCoast gis talk
Coast gis talk
 
Varnish intro
Varnish introVarnish intro
Varnish intro
 
Web Services
Web ServicesWeb Services
Web Services
 
Web browser architecture.pptx
Web browser architecture.pptxWeb browser architecture.pptx
Web browser architecture.pptx
 
Www ppt
Www pptWww ppt
Www ppt
 
Slides anu talkwebarchivingaug2012
Slides anu talkwebarchivingaug2012Slides anu talkwebarchivingaug2012
Slides anu talkwebarchivingaug2012
 
Tool Academy: Web Archiving
Tool Academy: Web ArchivingTool Academy: Web Archiving
Tool Academy: Web Archiving
 
Lesson 6 web based attacks
Lesson 6 web based attacksLesson 6 web based attacks
Lesson 6 web based attacks
 
Codefest2015
Codefest2015Codefest2015
Codefest2015
 

Mehr von Mat Kelly

Visualizing Digital Collections of Web Archives from Columbia Web Archiving C...
Visualizing Digital Collections of Web Archives from Columbia Web Archiving C...Visualizing Digital Collections of Web Archives from Columbia Web Archiving C...
Visualizing Digital Collections of Web Archives from Columbia Web Archiving C...
Mat Kelly
 
Archive What I See Now - Archive-It Partner Meeting 2013 2013
Archive What I See Now - Archive-It Partner Meeting 2013 2013Archive What I See Now - Archive-It Partner Meeting 2013 2013
Archive What I See Now - Archive-It Partner Meeting 2013 2013
Mat Kelly
 
Making Enterprise-Level Archive Tools Accessible for Personal Web Archiving
Making Enterprise-Level Archive Tools Accessible for Personal Web ArchivingMaking Enterprise-Level Archive Tools Accessible for Personal Web Archiving
Making Enterprise-Level Archive Tools Accessible for Personal Web Archiving
Mat Kelly
 
An Extensible Framework for Creating Personal Web Archives of Content Behind ...
An Extensible Framework for Creating Personal Web Archives of Content Behind ...An Extensible Framework for Creating Personal Web Archives of Content Behind ...
An Extensible Framework for Creating Personal Web Archives of Content Behind ...
Mat Kelly
 
NDIIPP/NDSA 2011 - YouTube Link Restoration
NDIIPP/NDSA 2011 - YouTube Link RestorationNDIIPP/NDSA 2011 - YouTube Link Restoration
NDIIPP/NDSA 2011 - YouTube Link Restoration
Mat Kelly
 
NDIIPP/NDSA 2011 - Archive Facebook
NDIIPP/NDSA 2011 - Archive FacebookNDIIPP/NDSA 2011 - Archive Facebook
NDIIPP/NDSA 2011 - Archive Facebook
Mat Kelly
 

Mehr von Mat Kelly (20)

Aggregating Private and Public Web Archives Using the Mementity Framework
Aggregating Private and Public Web Archives Using the Mementity FrameworkAggregating Private and Public Web Archives Using the Mementity Framework
Aggregating Private and Public Web Archives Using the Mementity Framework
 
Client-Assisted Memento Aggregation Using the Prefer Header
Client-Assisted Memento Aggregation Using the Prefer HeaderClient-Assisted Memento Aggregation Using the Prefer Header
Client-Assisted Memento Aggregation Using the Prefer Header
 
A Framework for Aggregating Public and Private Web Archives
A Framework for Aggregating Public and Private Web ArchivesA Framework for Aggregating Public and Private Web Archives
A Framework for Aggregating Public and Private Web Archives
 
Impact of URI Canonicalization on Memento Count
Impact of URI Canonicalization on Memento Count Impact of URI Canonicalization on Memento Count
Impact of URI Canonicalization on Memento Count
 
Exploring Aggregation of Personal, Private, and Institutional Web Archives
Exploring Aggregation of Personal, Private, and Institutional Web ArchivesExploring Aggregation of Personal, Private, and Institutional Web Archives
Exploring Aggregation of Personal, Private, and Institutional Web Archives
 
JCDL 2015 Doctoral Consortium - A Framework for Aggregating Private and Publi...
JCDL 2015 Doctoral Consortium - A Framework for AggregatingPrivate and Publi...JCDL 2015 Doctoral Consortium - A Framework for AggregatingPrivate and Publi...
JCDL 2015 Doctoral Consortium - A Framework for Aggregating Private and Publi...
 
Visualizing Digital Collections of Web Archives from Columbia Web Archiving C...
Visualizing Digital Collections of Web Archives from Columbia Web Archiving C...Visualizing Digital Collections of Web Archives from Columbia Web Archiving C...
Visualizing Digital Collections of Web Archives from Columbia Web Archiving C...
 
Facilitation of the A Posteriori Replication of Web Published Satellite Imagery
Facilitation of the A Posteriori Replication of Web Published Satellite ImageryFacilitation of the A Posteriori Replication of Web Published Satellite Imagery
Facilitation of the A Posteriori Replication of Web Published Satellite Imagery
 
Slides
SlidesSlides
Slides
 
Mink: Integrating the Live and Archived Web Viewing Experience Using Web Brow...
Mink: Integrating the Live and Archived Web Viewing Experience Using Web Brow...Mink: Integrating the Live and Archived Web Viewing Experience Using Web Brow...
Mink: Integrating the Live and Archived Web Viewing Experience Using Web Brow...
 
Efficient Thumbnail Generation for Web Archives at Digital Preservation 2014
Efficient Thumbnail Generation for Web Archives at Digital Preservation 2014Efficient Thumbnail Generation for Web Archives at Digital Preservation 2014
Efficient Thumbnail Generation for Web Archives at Digital Preservation 2014
 
Archive What I See Now - Archive-It Partner Meeting 2013 2013
Archive What I See Now - Archive-It Partner Meeting 2013 2013Archive What I See Now - Archive-It Partner Meeting 2013 2013
Archive What I See Now - Archive-It Partner Meeting 2013 2013
 
IEEE VIS 2013 Graph-Based Navigation of a Box Office Prediction System
IEEE VIS 2013 Graph-Based Navigation of a Box Office Prediction SystemIEEE VIS 2013 Graph-Based Navigation of a Box Office Prediction System
IEEE VIS 2013 Graph-Based Navigation of a Box Office Prediction System
 
Digital Preservation 2013
Digital Preservation 2013Digital Preservation 2013
Digital Preservation 2013
 
Making Enterprise-Level Archive Tools Accessible for Personal Web Archiving
Making Enterprise-Level Archive Tools Accessible for Personal Web ArchivingMaking Enterprise-Level Archive Tools Accessible for Personal Web Archiving
Making Enterprise-Level Archive Tools Accessible for Personal Web Archiving
 
An Extensible Framework for Creating Personal Web Archives of Content Behind ...
An Extensible Framework for Creating Personal Web Archives of Content Behind ...An Extensible Framework for Creating Personal Web Archives of Content Behind ...
An Extensible Framework for Creating Personal Web Archives of Content Behind ...
 
The Revolution Will Not Be Archived
The Revolution Will Not Be ArchivedThe Revolution Will Not Be Archived
The Revolution Will Not Be Archived
 
WARCreate - Create Wayback-Consumable WARC Files from Any Webpage
WARCreate - Create Wayback-Consumable WARC Files from Any WebpageWARCreate - Create Wayback-Consumable WARC Files from Any Webpage
WARCreate - Create Wayback-Consumable WARC Files from Any Webpage
 
NDIIPP/NDSA 2011 - YouTube Link Restoration
NDIIPP/NDSA 2011 - YouTube Link RestorationNDIIPP/NDSA 2011 - YouTube Link Restoration
NDIIPP/NDSA 2011 - YouTube Link Restoration
 
NDIIPP/NDSA 2011 - Archive Facebook
NDIIPP/NDSA 2011 - Archive FacebookNDIIPP/NDSA 2011 - Archive Facebook
NDIIPP/NDSA 2011 - Archive Facebook
 

Kürzlich hochgeladen

Call Girls in Prashant Vihar, Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Prashant Vihar, Delhi 💯 Call Us 🔝9953056974 🔝 Escort ServiceCall Girls in Prashant Vihar, Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Prashant Vihar, Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
valsad Escorts Service ☎️ 6378878445 ( Sakshi Sinha ) High Profile Call Girls...
valsad Escorts Service ☎️ 6378878445 ( Sakshi Sinha ) High Profile Call Girls...valsad Escorts Service ☎️ 6378878445 ( Sakshi Sinha ) High Profile Call Girls...
valsad Escorts Service ☎️ 6378878445 ( Sakshi Sinha ) High Profile Call Girls...
Call Girls In Delhi Whatsup 9873940964 Enjoy Unlimited Pleasure
 
( Pune ) VIP Baner Call Girls 🎗️ 9352988975 Sizzling | Escorts | Girls Are Re...
( Pune ) VIP Baner Call Girls 🎗️ 9352988975 Sizzling | Escorts | Girls Are Re...( Pune ) VIP Baner Call Girls 🎗️ 9352988975 Sizzling | Escorts | Girls Are Re...
( Pune ) VIP Baner Call Girls 🎗️ 9352988975 Sizzling | Escorts | Girls Are Re...
nilamkumrai
 

Kürzlich hochgeladen (20)

VVIP Pune Call Girls Sinhagad WhatSapp Number 8005736733 With Elite Staff And...
VVIP Pune Call Girls Sinhagad WhatSapp Number 8005736733 With Elite Staff And...VVIP Pune Call Girls Sinhagad WhatSapp Number 8005736733 With Elite Staff And...
VVIP Pune Call Girls Sinhagad WhatSapp Number 8005736733 With Elite Staff And...
 
APNIC Updates presented by Paul Wilson at ARIN 53
APNIC Updates presented by Paul Wilson at ARIN 53APNIC Updates presented by Paul Wilson at ARIN 53
APNIC Updates presented by Paul Wilson at ARIN 53
 
Busty Desi⚡Call Girls in Vasundhara Ghaziabad >༒8448380779 Escort Service
Busty Desi⚡Call Girls in Vasundhara Ghaziabad >༒8448380779 Escort ServiceBusty Desi⚡Call Girls in Vasundhara Ghaziabad >༒8448380779 Escort Service
Busty Desi⚡Call Girls in Vasundhara Ghaziabad >༒8448380779 Escort Service
 
Pirangut | Call Girls Pune Phone No 8005736733 Elite Escort Service Available...
Pirangut | Call Girls Pune Phone No 8005736733 Elite Escort Service Available...Pirangut | Call Girls Pune Phone No 8005736733 Elite Escort Service Available...
Pirangut | Call Girls Pune Phone No 8005736733 Elite Escort Service Available...
 
All Time Service Available Call Girls Mg Road 👌 ⏭️ 6378878445
All Time Service Available Call Girls Mg Road 👌 ⏭️ 6378878445All Time Service Available Call Girls Mg Road 👌 ⏭️ 6378878445
All Time Service Available Call Girls Mg Road 👌 ⏭️ 6378878445
 
Top Rated Pune Call Girls Daund ⟟ 6297143586 ⟟ Call Me For Genuine Sex Servi...
Top Rated  Pune Call Girls Daund ⟟ 6297143586 ⟟ Call Me For Genuine Sex Servi...Top Rated  Pune Call Girls Daund ⟟ 6297143586 ⟟ Call Me For Genuine Sex Servi...
Top Rated Pune Call Girls Daund ⟟ 6297143586 ⟟ Call Me For Genuine Sex Servi...
 
Katraj ( Call Girls ) Pune 6297143586 Hot Model With Sexy Bhabi Ready For S...
Katraj ( Call Girls ) Pune  6297143586  Hot Model With Sexy Bhabi Ready For S...Katraj ( Call Girls ) Pune  6297143586  Hot Model With Sexy Bhabi Ready For S...
Katraj ( Call Girls ) Pune 6297143586 Hot Model With Sexy Bhabi Ready For S...
 
Call Girls Sangvi Call Me 7737669865 Budget Friendly No Advance BookingCall G...
Call Girls Sangvi Call Me 7737669865 Budget Friendly No Advance BookingCall G...Call Girls Sangvi Call Me 7737669865 Budget Friendly No Advance BookingCall G...
Call Girls Sangvi Call Me 7737669865 Budget Friendly No Advance BookingCall G...
 
20240509 QFM015 Engineering Leadership Reading List April 2024.pdf
20240509 QFM015 Engineering Leadership Reading List April 2024.pdf20240509 QFM015 Engineering Leadership Reading List April 2024.pdf
20240509 QFM015 Engineering Leadership Reading List April 2024.pdf
 
Call Now ☎ 8264348440 !! Call Girls in Green Park Escort Service Delhi N.C.R.
Call Now ☎ 8264348440 !! Call Girls in Green Park Escort Service Delhi N.C.R.Call Now ☎ 8264348440 !! Call Girls in Green Park Escort Service Delhi N.C.R.
Call Now ☎ 8264348440 !! Call Girls in Green Park Escort Service Delhi N.C.R.
 
WhatsApp 📞 8448380779 ✅Call Girls In Mamura Sector 66 ( Noida)
WhatsApp 📞 8448380779 ✅Call Girls In Mamura Sector 66 ( Noida)WhatsApp 📞 8448380779 ✅Call Girls In Mamura Sector 66 ( Noida)
WhatsApp 📞 8448380779 ✅Call Girls In Mamura Sector 66 ( Noida)
 
Call Girls in Prashant Vihar, Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Prashant Vihar, Delhi 💯 Call Us 🔝9953056974 🔝 Escort ServiceCall Girls in Prashant Vihar, Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Prashant Vihar, Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
 
Sarola * Female Escorts Service in Pune | 8005736733 Independent Escorts & Da...
Sarola * Female Escorts Service in Pune | 8005736733 Independent Escorts & Da...Sarola * Female Escorts Service in Pune | 8005736733 Independent Escorts & Da...
Sarola * Female Escorts Service in Pune | 8005736733 Independent Escorts & Da...
 
Trump Diapers Over Dems t shirts Sweatshirt
Trump Diapers Over Dems t shirts SweatshirtTrump Diapers Over Dems t shirts Sweatshirt
Trump Diapers Over Dems t shirts Sweatshirt
 
Hire↠Young Call Girls in Tilak nagar (Delhi) ☎️ 9205541914 ☎️ Independent Esc...
Hire↠Young Call Girls in Tilak nagar (Delhi) ☎️ 9205541914 ☎️ Independent Esc...Hire↠Young Call Girls in Tilak nagar (Delhi) ☎️ 9205541914 ☎️ Independent Esc...
Hire↠Young Call Girls in Tilak nagar (Delhi) ☎️ 9205541914 ☎️ Independent Esc...
 
valsad Escorts Service ☎️ 6378878445 ( Sakshi Sinha ) High Profile Call Girls...
valsad Escorts Service ☎️ 6378878445 ( Sakshi Sinha ) High Profile Call Girls...valsad Escorts Service ☎️ 6378878445 ( Sakshi Sinha ) High Profile Call Girls...
valsad Escorts Service ☎️ 6378878445 ( Sakshi Sinha ) High Profile Call Girls...
 
(INDIRA) Call Girl Pune Call Now 8250077686 Pune Escorts 24x7
(INDIRA) Call Girl Pune Call Now 8250077686 Pune Escorts 24x7(INDIRA) Call Girl Pune Call Now 8250077686 Pune Escorts 24x7
(INDIRA) Call Girl Pune Call Now 8250077686 Pune Escorts 24x7
 
( Pune ) VIP Baner Call Girls 🎗️ 9352988975 Sizzling | Escorts | Girls Are Re...
( Pune ) VIP Baner Call Girls 🎗️ 9352988975 Sizzling | Escorts | Girls Are Re...( Pune ) VIP Baner Call Girls 🎗️ 9352988975 Sizzling | Escorts | Girls Are Re...
( Pune ) VIP Baner Call Girls 🎗️ 9352988975 Sizzling | Escorts | Girls Are Re...
 
Shikrapur - Call Girls in Pune Neha 8005736733 | 100% Gennuine High Class Ind...
Shikrapur - Call Girls in Pune Neha 8005736733 | 100% Gennuine High Class Ind...Shikrapur - Call Girls in Pune Neha 8005736733 | 100% Gennuine High Class Ind...
Shikrapur - Call Girls in Pune Neha 8005736733 | 100% Gennuine High Class Ind...
 
"Boost Your Digital Presence: Partner with a Leading SEO Agency"
"Boost Your Digital Presence: Partner with a Leading SEO Agency""Boost Your Digital Presence: Partner with a Leading SEO Agency"
"Boost Your Digital Presence: Partner with a Leading SEO Agency"
 

Browser-Based Digital Preservation

  • 1. Browser-Based Digital Preservation Mat Kelly Web Science and Digital Libraries (WS-DL) Research Lab Department of Computer Science http://ws-dl.cs.odu.edu July 7, 2014
  • 2. What This Will Be About • Saving Things on Live Web to Web Archives – State of the Art – Barriers • Dynamics of Digital Preservation Components – Means of Preservation (e.g., crawlers) – Means of Replay (e.g., Wayback) • Outstanding and Unsolved Issues in DigPres 2 Browser-Based Digital Preservation http://ws-dl.cs.odu.edu
  • 3. The Speaker, Mat Kelly: • 3rd Year PhD student • MS Degree: • BS Degree: • Previously a researcher at BMW • Currently programmer @ NASA Langley • Research Assistant @ ODU WS-DL Research Lab Research Lab homepage: http://ws-dl.cs.odu.edu 3 My homepage: http://www.cs.odu.edu/~mkelly
  • 4. Quick Primer: The Web • What It Is – Client-Server Messaging + Payload – HTML, Images, JavaScripts, Multimedia, and more! • Where It Resides – Live web: remote servers, distributed content 4 Browser-Based Digital Preservation http://ws-dl.cs.odu.edu
  • 5. Quick Primer: HTTP Browser-Based Digital Preservation http://ws-dl.cs.odu.edu 5 http://mysite.com/~jsmith/myPage.html domain resource path on serverSCHEME/PROTOCOL user agent (browser) Client Requests Resource Server Server Responds YES! 200 OK + Resource Resource Exists? Respond With NO, it’s moved! 300-code + URI NO, it’s gone! 404 Embedded Content in Resp? e.g., images, JavaScript YES! 200 OK + Resource NO, it’s moved! 300-code + URI NO, it’s gone! 404
  • 6. How The Web Is Used • Web browser accesses website via URI • Browser requests site’s contents from server • Browser displays site’s contents – Usually requiring multiple requests for embedded resources GET / HTTP/1.1 Host: mysite.com BEHIND THE SCENES 6 Browser-Based Digital Preservation http://ws-dl.cs.odu.edu HTTP/1.1 200 OK Content-Type: text/html <!DOCTYPE html> <html> <head> … HTTP Response Header HTTP Response Entity
  • 7. Returning to a Page on the Live Web • Lookup by URI • But Site’s Gone • Lookup in Web Archives GET / HTTP/1.1 Host: mysite.com HTTP/1.1 404 NOT FOUND Content-Type: text/html 7 Browser-Based Digital Preservation http://ws-dl.cs.odu.edu
  • 8. Returning to a Page on the Live Web • Lookup by URI at Internet Archive ✓ SUCCESS Likely Preserved If… • Popular • Surface (e.g., no auth) • Simple (no fancy JS) USING THE ARCHIVES 8 Browser-Based Digital Preservation http://ws-dl.cs.odu.edu
  • 9. Browser-Based Digital Preservation http://ws-dl.cs.odu.edu Site Popularity Affects Archival Presence More Popular Well preserved Less Popular Less preserved Niche Site Likely not preserved 9
  • 10. 10 Sites Not On Surface Web are Likely Not Archived • Sites can restrict being crawled – Whose data is it anyway? • Robots.txt can limit – Crawler (by user-agent) – Paths/files/whole site
  • 11. 11 Sites Not On Surface Web are Likely Not Archived • Not everything ought to be in public archives • …but we might still want to preserve these pages. – e.g., our FB news feed
  • 12. 12 Browser-Based Digital Preservation http://ws-dl.cs.odu.edu When the Browsing Tool Changes The Result is Inconsistent • Interactive sites – frequently do not have all resources required for replay in the archives
  • 13. Recall: Likely a Web Page Is Preserved If… • Popular • Surface (e.g., no authentication required) • Simple (no fancy JavaScript) 13 Browser-Based Digital Preservation http://ws-dl.cs.odu.edu
  • 14. From This, We Can Surmise: A page will be UNLIKELY or INSUFFICIENTLY preserved if: • Not Popular • On the Deep Web or Behind Authentication • Contains Fancy Effects or Asynchronously Loaded Resources 14 Browser-Based Digital Preservation http://ws-dl.cs.odu.edu
  • 15. Why and What Can Be Done A page will be UNLIKELY or INSUFFICIENTLY preserved if: • Not Popular • On the Deep Web or Behind Authentication • Contains Fancy Effects or Asynchronously Loaded Resources Internet Archive chooses What is Preserved 15 Browser-Based Digital Preservation http://ws-dl.cs.odu.edu
  • 16. Why and What Can Be Done A page will be UNLIKELY or INSUFFICIENTLY preserved if: • Not Popular • On the Deep Web or Behind Authentication • Contains Fancy Effects or Asynchronously Loaded Resources Preserve from the Browser Instead of Delegating by URI 16 Browser-Based Digital Preservation http://ws-dl.cs.odu.edu
  • 17. Why and What Can Be Done A page will be UNLIKELY or INSUFFICIENTLY preserved if: • Not Popular • On the Deep Web or Behind Authentication • Contains Fancy Effects or Asynchronously Loaded Resources Leverage the Browser’s Native JavaScript rendering engine 17 Browser-Based Digital Preservation http://ws-dl.cs.odu.edu
  • 18. • Internet Archive’s archival crawler – Open source, also deployed @archive.org • Command-line driven • Web-based GUI • Allows list of URIs to be crawled, frequency, crawl depth, etc. – On user’s own Heritrix deployment • Generates Web ARChive (WARC) files State of the Art: 18 Browser-Based Digital Preservation http://ws-dl.cs.odu.edu
  • 19. The Web ARChive (WARC) File Format • 28500 • Multiple text-based record types in a file • Supports arbitrary data types on web • Preserves HTTP information for replaying content WARC/1.0 WARC-Type: warcinfo WARC-Filename: test.warc Content-Length: 42 Format: WARC File Format 1.0 WARC/1.0 WARC-Type: request WARC-Target-URI: http://mysite. WARC-Date: 2014-05-10T12:11:03Z WARC-Concurrent-To: <urn:uuid:b WARC-Record-ID: <urn:uuid:f8e09 Content-Length: 115 GET / HTTP/1.1 User-Agent: Mozilla/5.0 WARC/1.0 WARC-Type: response WARC-Target-URI: http://mysite. WARC-Concurrent-To: <urn:uuid:b WARC-Record-ID: <urn:uuid:f93ea Content-Length: 11262 HTTP/1.1 200 OK <!DOCTYPE html><html> <head> <body>... WARC info Record WARC request Record WARC response Record 19
  • 20. WARC Anatomy • WARC Records each have a header and a payload • One warcinfo/WARC • warcmetadata optional – Describe files – can have multiple/file • Req/Resp records document HTTP transactions @archive time warcinfo record header warcinfo record payload warcmetadata record header warcmetadata record payload warcrequest record header warcrequest record payload warcresponse record header warcreponse record payload warcrequest record header warcrequest record payload warcresponse record header warcreponse record payload MyArchive.warc 20 Browser-Based Digital Preservation http://ws-dl.cs.odu.edu
  • 21. Reading WARCs • “Replaying” content as it once was versus viewing a static web page • – Deployed at Internet Archive (archive.org) – Open Source software • You can deploy your own Wayback! – Indexes WARCs, allows content to be accessible from web interface 21 Browser-Based Digital Preservation http://ws-dl.cs.odu.edu
  • 22. But the Issue Remains… A crawler changes the capture context from the browser Browser-Based Digital Preservation http://ws-dl.cs.odu.edu 22
  • 23. High Level Requirements WARC Creation Software • Intuitive to use – cf. Heritrix and Wayback are difficult to setup • Comply with WARC ISO standard • Capture JavaScript • Capture content behind authentication • Allow a user to execute through the browser – On demand, without need to delegate via URI 23 Browser-Based Digital Preservation http://ws-dl.cs.odu.edu
  • 24. WARCreate “Create WARC files from any webpage” 1. On Page Load 1. Generate warcinfo record 2. Save HTTP Requests, Responses • Content as Strings 2. On Secondary Content Loaded  Repeat 1.2. and Concatenate 3. On Generate WARC command 1. Create Blob from String 2. Provide Blob to FileSaver library 24 Browser-Based Digital Preservation http://ws-dl.cs.odu.edu
  • 25. • onBeforeSendHeaders - HTTP request headers • onHeadersReceived() –HTTP response headers • onBeforeRequest() –HTTP 3XX responses • onResponseStarted() - Payload https://developer.chrome.com/extensions/webRequest 25 Google Chrome’s Webrequest API
  • 26. Payloads Captured • Target information can be captured with webRequest • File/Session descriptors generated at archive time • Payload headers generated once payload captured • Templating system used warcinfo record header warcinfo record payload warcmetadata record header warcmetadata record payload warcrequest record header warcrequest record payload warcresponse record header warcreponse record payload warcrequest record header warcrequest record payload warcresponse record header warcreponse record payload MyArchive.warc ✓ ✓ ✓ ✓ 26 Browser-Based Digital Preservation http://ws-dl.cs.odu.edu
  • 27. A Real warcinfo WARC/1.0 WARC-Type: warcinfo WARC-Date: 2014-06-30T22:56:06Z WARC-Filename: 20140630180606900.warc WARC-Record-ID: <urn:uuid:7fc954f0-e291-279f-5b6b-5b97aeb437bd> Content-Type: application/warc-fields Content-Length: 458 software: WARCreate/0.2014.6.2 http://warcreate.com format: WARC File Format 1.0 conformsTo: http://bibnum.bnf.fr/WARC/WARC_ISO_28500_version1_latestdraft.pdf isPartOf: basic description: Crawl initiated from the WARCreate Google Chrome extension robots: ignore http-header-user-agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.153 Safari/537.36 http-header-from: warcreate@matkelly.com • Template • Generated @ archive time • Self-descriptive 27 Browser-Based Digital Preservation http://ws-dl.cs.odu.edu
  • 28. A Real warcmetadata WARC/1.0 WARC-Type: metadata WARC-Target-URI: http://matkelly.com/ WARC-Date: 2014-06-30T22:56:06Z WARC-Concurrent-To: <urn:uuid:dddc4ba2-c1e1-459b-8d0d-a98a20b87e96> WARC-Record-ID: <urn:uuid:6fef2a49-a9ba-4b40-9f4a-5ca5db1fd5c6> Content-Type: application/warc-fields Content-Length: 1938 outlink: http://matkelly.com/_images/logo.png E =EMBED_MISC outlink: http://matkelly.com/resume/ L a/@href • Template • Generated @ archive time • Self-descriptive 28 Browser-Based Digital Preservation http://ws-dl.cs.odu.edu
  • 29. A Real warcrequest WARC/1.0 WARC-Type: request WARC-Target-URI: http://matkelly.com/ WARC-Date: 2014-06-30T22:56:06Z WARC-Concurrent-To: <urn:uuid:ba53adfd-4bd0-9a53-9396-b0dd77e39353> WARC-Record-ID: <urn:uuid:f8e095f3-7b77-f2e6-f592-ce0b63a32f36> Content-Type: application/http; msgtype=request Content-Length: 317 GET / HTTP/1.1 Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q =0.8 User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.153 Safari/537.36 Accept-Encoding: gzip,deflate,sdch Accept-Language: en-US,en;q=0.8,de-DE;q=0.6 • Template • Generated @ archive time • Self-descriptive 29 Browser-Based Digital Preservation http://ws-dl.cs.odu.edu HTTP Request header
  • 30. A Real warcresponse WARC/1.0 WARC-Type: response WARC-Target-URI: http://matkelly.com/ WARC-Date: 2014-06-30T22:56:06Z WARC-Record-ID: <urn:uuid:f93eafa7-24c5-c29f-e17f-e0627201b7f1> Content-Type: application/http; msgtype=response Content-Length: 12662 HTTP/1.1 200 OK Date: Mon, 30 Jun 2014 22:56:04 GMT Server: Apache Vary: Accept-Encoding Content-Encoding: gzip Content-Length: 4824 Keep-Alive: timeout=2, max=100 Connection: Keep-Alive Content-Type: text/html <!DOCTYPE html><html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en"><head> • Template • Generated @ archive time • Self-descriptive 30 HTTP Response header HTTP Response body
  • 31. The File System Problem • WARC strings must be saved to file system • JS provides access to HTML5 localStorage – Limited to sandbox, 5 MB • NEW Chrome extension API allows unlimited MB – chrome.storage instead of localStorage • HTML5 W3C saveAs() support is limited • FileSaver.js provides polyfill! – Limited to ~ 360 MB/file https://github.com/eligrey/FileSaver.js/ 31 Browser-Based Digital Preservation http://ws-dl.cs.odu.edu
  • 32. Capture Scope • Not limited to robots.txt • Native JS Support (from browser) • User accessible – No URI delegation 32 Browser-Based Digital Preservation http://ws-dl.cs.odu.edu
  • 33. User Dynamics • Simple operation: navigate, click a button • Files downloaded with smart defaults to HDD Browser-Based Digital Preservation http://ws-dl.cs.odu.edu 33
  • 34. • We may not want our personal info in archives • Your http://facebook.com and mine are vastly different – …but same URI! • Even if we capture it, it might not be replayable – Limited by replay system Caveats of Personal Web Archiving 34 Browser-Based Digital Preservation http://ws-dl.cs.odu.edu
  • 35. Recent Work/Outstanding Issues Direct upload of WARC to server Comprehensive Site Archiving (e.g., all your ) • Annotation of Web Pages • Grouping WARCs into archival collections • Periodic, automated execution • Portable JS WARC interaction library • Much more! 35 Browser-Based Digital Preservation http://ws-dl.cs.odu.edu
  • 36. But…Isn’t This Just Coding and Not Research? • WARC work in the JS space is non-existent! • Initial implementations have already uncovered limitations • JS tech has evolved, opened new doors for functionality • WARCreate is state-of-the art for using JavaScript/Browsers FOR archiving 36 Browser-Based Digital Preservation http://ws-dl.cs.odu.edu
  • 37. Browser-Based Digital Preservation WARCreate as Research • Publications – Mat Kelly and Michele C. Weigle, "WARCreate - Create Wayback-Consumable WARC Files from Any Webpage," In Proceedings of the ACM/IEEE Joint Conference on Digital Libraries (JCDL). Washington, DC, June 2012, pp. 437- 438 – Mat Kelly, Michele C. Weigle and Michael L. Nelson. "WARCreate - Create Wayback-Consumable WARC Files from Any Webpage," Digital Preservation 2012, Tools Demo Session: Web Archiving; 2012 Jul 25; Washington, DC. – Mat Kelly, Michael L. Nelson and Michele C. Weigle. "WARCreate and WAIL: WARC, Wayback and Heritrix Made Easy," Digital Preservation 2013, Workshops and Sessions: Web Archiving; 2013 Jul 24; Alexandria, VA • Academic Funding – Archive What I See Now: Bringing Institutional Web Archiving Tools to the Individual Researcher, National Endowment for the Humanities (NEH), Digital Humanities Start-Up Grant, HD-51670-13, May 2013 - Dec 2014, $57,892 • Notoriety – Future Steward Innovation Award Recipient, National Digital Stewardship Alliance (NDSA) / Library of Congress, July 2012 37 Browser-Based Digital Preservation http://ws-dl.cs.odu.edu
  • 38. Where I Have Traveled as a ODU Researcher Browser-Based Digital Preservation http://ws-dl.cs.odu.edu 38 Washington, DC Atlanta, GA College Park, MD Salt Lake City, UT San Francisco, CA London, England (Sept 2014) …
  • 39. Browser-Based Digital Preservation http://ws-dl.cs.odu.edu http://www.cs.odu.edu/~mkelly 39 Publications Travel Academic Funding Notoriety 