1. Web Archiving
Tools and Technology
Dan Chudnov - GWU Libraries
dchud at gwu edu
@dchud
IS&T Workshop, April 2, 2013
Washington DC USA
Tuesday, April 2, 13
2. select scope crawl process access
unt nom X X
tool
heritrix X X
wct X X X X
netarchive
X X X X X
suite
warc tools X
nutchwax X X
wayback X
Tuesday, April 2, 13
3. select
⢠what to collect
⢠who authorizes
⢠when
⢠what order
Tuesday, April 2, 13
4. scope
â˘how much
⢠robots.txt
⢠what to leave out
⢠which doors not
to open
Tuesday, April 2, 13
5. crawl
⢠start with seeds
⢠ďŹnd, queue, follow links
⢠be kind to each site
⢠parallelize across sites
⢠schedule, log,
checkpoint, resume
⢠bundle
Tuesday, April 2, 13
6. process
⢠lump, split, bundle,
rebundle
⢠quality control
⢠index, surrogate,
reorder, prep for access
⢠store, distribute,
preserve
Tuesday, April 2, 13
7. access
⢠browse
⢠search
⢠known items
⢠patterns
⢠needles
Tuesday, April 2, 13
8. select scope crawl process access
unt nom X X
tool
heritrix X X
wct X X X X
netarchive
X X X X X
suite
warc tools X
nutchwax X X
wayback X
Tuesday, April 2, 13
9. UNT URL Nomination Tool
⢠collaborative
selection
⢠collect seed lists
⢠attach metadata
⢠agree on scope
⢠feed crawlers
Tuesday, April 2, 13
10. heritrix
⢠free software from
Internet Archive
⢠easy to start with
⢠difďŹcult to master
⢠powerful, conďŹgurable,
confusing
Tuesday, April 2, 13
11. heritrix contâd
⢠two major versions, â1â and â3â
⢠WCT and NetArchive embed â1â
⢠â1â - minimal UI
⢠â3â - even less
⢠iterate early - long learning curve
⢠best available tool
Tuesday, April 2, 13
16. NetarchiveSuite
⢠free software from
netarkivet.dk
⢠used by State and University
Library, The Royal Library in
Denmark
⢠complete solution from
selection to access
Tuesday, April 2, 13