1. Web archives: a new class of
primary source for historians ?
Peter Webster (British Library)
@pj_webster / @UKWebArchive
2. www.bl.uk 2
Scarcity or abundance ?
• Rosenzweig, ‘‘Scarcity or Abundance? Preserving the Past
in a Digital Era’ American Historical Review 108, 3 (June
2003) http://tinyurl.com/oawcltx
3. www.bl.uk 3
Web archiving DIY
• BootCat (bootcat.sslmit.unibo.it )
• Wget, by @ianmilligan1
(http://programminghistorian.org/lessons/ )
• The Historian’s WARC Toolkit (
https://github.com/ianmilligan1/Historian-WARC-1 )
4. www.bl.uk 4
UK Web Archive
• Selective archiving since 2004
• 13,000 sites, 60,000 instances,
20TB of data
• British Library, National Library of
Wales, JISC
• Plus many collaborators:
Women’s Library, Live Art
Development Agency, NHS
• http://webarchive.org.uk
6. www.bl.uk 6
An archived website in UKWA
votedavidcameron.org (archived 24/5/05) at UK Web Archive
7. www.bl.uk 7
Non-Print Legal Deposit (web): what may
we collect ?
Web resources that:
• are issued from a .uk or other UK geographic top-level
domain, or
• where part of the publishing process takes place in the UK;
• but excluding any which are only accessible to audiences
outside the UK.
8. www.bl.uk 8
JISC UK Web Domain Dataset 1996-2010
• Funded by JISC to create a research collection of UK
websites
• Collaboration between the Internet Archive, JISC and the
British Library
• Copy of subset of the Internet Archive’s web collection that
relates to the UK
• 470466 files (arc.gz & warc.gz), 32TB in total
• No local access – possible through the Internet Archive
• Can be used to generate secondary datasets
9. www.bl.uk 9
Big Data project (Oxford Internet Institute)
• “Demonstrating the value of the UK Web Domain Dataset
for social science research”
• Led by Professor Helen Margetts
• Link analysis of structure of UK government web estate
• http://www.oii.ox.ac.uk/research/projects/?id=88
• Funded by the JISC
10. www.bl.uk 10
Datasets available for download
Link data
1996 | appserver.ed.ac.uk | portico.bl.uk 1
19GB, available at: http://tinyurl.com/kon2eve
Geo-index
8GB (compressed) at: http://tinyurl.com/knn4zmz
File format analysis
http://tinyurl.com/nz4xoah
12. www.bl.uk 12
Analytical Access to the Domain Dark
Archive (AADDA)
• Led by Dr Jane Winters (IHR)
• In partnership with the British Library and the University of
Cambridge
• http://domaindarkarchive.blogspot.co.uk
• Funded by the JISC
• Bringing together HSS researchers to help the Library
develop a web user interface.
• Feb 2012 – Oct 2013
19. www.bl.uk 19
Methodological challenges: what is in the
archive ?
• National web archives: some selective, some legal deposit
• When is comprehensive not comprehensive ?
• Defining the national (http://tinyurl.com/m9ue5gw )
20. www.bl.uk 20
Methodological challenges: when was it in
the archive ?
• Understanding the crawl profile
• Crawl date NOT publication date
• Citation standard: what, when archived