Building a Collection of the Historical UK Web for scholarly use
1. Building a Collection of the Historical UK Web for scholarly use
Helen Hockx-Yu
Head of Web Archiving, British Library
2. www.bl.uk
2
The UK Web Domain
4th TLD after .com, .de and .net
Over 10 million .uk registered domain
UK organisations also use non .uk domain names (eg .com or .org) – scale unknown
Non-print Legal Deposit (since April 2013) applies to
the open (freely available) web: .uk and other UK-published (non .uk) websites, such as .com, .org…
also e-journals, e-books, news web pages and other digital publications, either by harvesting or mutual agreement on other delivery methods
3. www.bl.uk
3
Web Archiving at the British Library
Collect UK digital heritage and provide continued access to archived web resources
Started web archiving in 2003: Open UK Web Archive
Selective, topical collections and key sites
Consortium sharing infrastructure and development effort; agreement on who collects what
Curating collections with organisations and researchers
Archiving UK Web for non-print Legal Deposit since April 2013: Legal Deposit UK Web Archive
Comprehensive national archive with on-site access only
Joint responsibility of six Legal Deposit Libraries (LDLs)
4. www.bl.uk
4
Domain Crawl
News
Special collection
Special collection
Domain crawl:
•Broad sweep of UK domain
•Once or twice a year
Events & key sites and news:
•Events of UK interest
•High value, high impact sites
•National & regional news
Special Collection:
•Focused, thematic collections
•Support priority subjects
Key sites
Events
Special collection
Special collection
Collecting strategy for websites
5. www.bl.uk
5
UK websites – territoriality explained
An online work is considered as “published in the UK” and therefore in scope for Legal Deposit, if it meets either of the following criteria:
(a) it is made available to the public from a website with a domain name which relates to the United Kingdom or to a place within the United Kingdom; or
(b) it is made available to the public by a person and any of that person’s activities relating to the creation or the publication of the work take place within the United Kingdom
The Legal Deposit Libraries (Non-Print Works) Regulations, 2013
6. www.bl.uk
6
Territoriality - implementation
All websites with a .uk domain name
Including embedded content (eg CSS, images) regardless where it is hosted
non .uk websites have to meet at least one criteria
UK Hosting: check external IP geo-location database and add in-scope URLs to the fetch-chain
UK postal address
Correspondence
Professional judgement
7. www.bl.uk
7
UK Domain Crawl
2013 domain crawl stats
3.86 million seeds
1.9 billion URLs (web pages, docs, images)
~31TB
Duration: 70days
2014 domain crawl
90 million seeds (starting URLs)
Started on 19th June 2014
Collected 52TB of data (by 9th December (incl. 4.4GB of viruses & 3TB of homepage screenshots)
Nearly 2 million non .uk domains
8. www.bl.uk
8
The “access” paradoxes
Completeness versus openness of web archives
Legal Deposit national collections have restricted access
Documents-centred versus data driven
Essentially a scale issue
Pre-selected or defined collections not relevant to all researchers; difficulty in finding relevant content in large scale web archive.
Arbitrary (national) boundaries often irrelevant to research question but most heritage institutions operation within certain geographical areas
…
10. www.bl.uk
10
Collaboration with researchers
Building collections
Researchers’ involvement in scoping collections, selecting and describing websites
Creation of specific, (narrow) topical collections
Formulating research question
Brain-storm sessions, workshops, discussion, surveys etc.
Lack of awareness & baseline knowledge
Challenging: you don’t know what you don’t know
Co-development of access services
This is changing how we collect and store data
11. www.bl.uk
11
JISC UK Web Domain dataset (1996-2013)
Collaboration between the Internet Archive (IA), the Joint Information Systems Committee (JISC) and the British Library
Extracted copies of UK websites from the Internet Archives collection
1st tranche : 1996 – 2010, 30TB, 2.5 billion URLs
2nd tranche: 2010 – April 2013, 27.5TB, 1.5 billion URLs (estimated)
Research agreement between JISC and IA, upholding IA’s Terms of Use
Access via IA’s Wayback Machine
Allows replication / extraction of derivative or secondary datasets
BL hosts the dataset on behalf of JISC
Data used by research projects
Institute of Historical Research project: Analytical Access to the Domain Dark Archive (AADDA)
Oxford Internet Institute project: Big data for political science
12. www.bl.uk
12
Completed work
Analytical Access to the Domain Dark Archive Project
Use cases & experimental UI
Demonstrating the Value of the UK Web Domain Dataset for Social Science Research
Analysis of link graph
Paper accepted for WebSci’14: Mapping the UK Webspace: Fifteen Years of British Universities on the Web
MA thesis by Jules Mataly: The Three Truths of Margaret Thatcher: Creating and Analysing
Secondary datasets under open licence
Format profile, Geoindex, Host Link Graph
13. www.bl.uk
13
Exploring Host Link Graph
Courtesy of Peter Webster, Rainer Simon and Jules Mataly
18. www.bl.uk
18
Big UK Domain Data for Arts and Humanities
Funded by the UK Arts and Humanities Research Council as one of the 21 “Big Data” projects
Collaboration between the Institution of Historical Research, Oxford Internet Institute, British Library and Aarhus University
Develop theoretical and methodological framework for the study of web archives
Build on ADDAA: researchers and the BL co-produce access tools
A major study of the history of UK web space from 1996 to 2013 + sub-projects covering a range of disciplines
Also an online training course and peer-reviewed journal articles.
20. www.bl.uk
20
Query building
Corpus formation and handling
Annotation and curation
In-corpus analysis
Whole-dataset analysis
Shine
21. www.bl.uk
21
What’s in it for us?
Helps researchers understand the value of web archives and explore new ways of using these for scholarly research
Allows BL to obtain hands-on experience with indexing and processing large scale web archive datasets
(Prototypes) analytics and visualisations can be applied to our own Legal Deposit collection
Enables BL to participate in various UK, European and international projects
Helps curators understand characteristics of large scale digital corpora
Improve the way we collet and store web archive
22. www.bl.uk
22
Web archives for reference AND for analytics
Base-line knowledge self-explanatory
Focus on national events for curated collections; provide means to assemble research corpora
Link to what we do not have
Offer a bag of tools to support scholarly use
The go-to state
Exploit open licences, changes to copyright law
Online access to selected websites, metadata and secondary datasets
The British Library Collection Development Policy for websites
Lobbying – review of Non-print Legal Deposit Regulations in 2018