Presentada en la Jornada Internacional sobre Archivos Web y Depósito Legal Electrónico, en la Biblioteca Nacional de España (BNE), el día 9 de julio de 2013.
DevEX - reference for building teams, processes, and platforms
Archiving the French Web: the BnF web archiving workflow. Sara Aubry
1. Archiving the French Web:
the BnF web archiving workflow
Sara Aubry
Web Archiving Project Manager, IT department
Bibliothèque nationale de France
International Conference on Web archives and e-LD
Biblioteca Nacional de España, Madrid, July 9th 2013
2. Let’s start with some figures
• Programme start in 2000, industrialisation in 2008-
2012
• Collections:
– 1996 - now
– 20 000 websites for focused crawls, 2.5 million .fr domains for broad
crawls
– 18.8 billion URLs, 370 TB, growing up +100TB / year
• Resources:
– 9 Full Time Employees (5 librarians, 4 engineers)
– many partners within and out of Library, both at the national and
international level
– 70 robots (648GB RAM, 144 CPUs 2.4GHz)
3. Digital curation is not different!
• « Actions, tools and practices defined
and applied to collect, identify, select,
organize and preserve digital contents
(…) in order to use them and make them
available (…) »
Definition of Digital Archiving in Wikipedia
6. Selecting with BCWeb
• A form-based application, commonly called a
« curator tool »
– for content curators and researchers to nominate
websites to harvest
– giving basic information about them (content policies,
trends watch)
• Most important information for each website:
– Internet address/URL
– frequency (daily, monthly, yearly, once…)
– size/budget (small, medium, big)
– depth (entire domain, part of it) Content curators
7. The Web is made of HTML pages
1 HTML page, 48
URL
• 1 HTML
• 1 text/css
• 4 javascript
• 17 image/png
• 5 image/jpeg
• 21 image/gif
all links and
inclusions are URL
references
8. Harvesting with Heritrix
• A harvester is a piece of
software (crawler,
spider, robot)
• Simulates what a
person would do with a
browser but repeatedly
and very fast
• Follows a looping
process
• Repeated until new and
in-scope URL are found
and limits are not
reached (budget and
time)
WARC
Pick a
location
Make a
Request
Receive a
Response
Examine for
references
Save the
content
9. Assets:
- open source
- small and large scale
- textual or all-media formats
- data structures
11. Engineers : IT department
Challenges:
• rich media and ever-changing
environment
• social networks
• content beyond paywalls
(news sites, ebooks)
12. Piloting the crawls with
NetarchiveSuite
• Prepare, schedule, run and monitor harvests
of websites, perform QA
Digital curators: legal
deposit department
Engineers : IT department
13. Offering access with Wayback
• Give readers the ability to
browse the web “as it
was” with:
– a regular web browser
– a search and redisplay
software
• An application called
“Web archives”
– Wayback: for URL search,
display and browsing
– Nutch prototype for
keyword search
– Guided paths for collection
highlights
14.
15.
16.
17. Challenges:
• links with our main Catalogue and
open data repository
• “smart” URL search
• full text search and indexing
• small-scale data mining projects with
researchers