SlideShare ist ein Scribd-Unternehmen logo
1 von 35
Lessons Learned Archiving the
National Web of New Zealand
              Kris Carpenter Negulescu
                 The Internet Archive

                Gordon Paynter
      The National Library of New Zealand

    Future Perfect 2012, 27 March 2012 , Wellington New Zealand
Why collect the web?
Legal deposit

• The National Library of New Zealand Act (2003)
• “Legal deposit” now includes “Internet documents”
• Available from http://legislation.govt.nz/
Two web archiving programmes

Selective Harvesting of specific websites or parts websites
Domain Harvesting of the entire “New Zealand Internet”




http://topics.breitbart.com/fishing+pole/




                                            http://www.trimarinegroup.com/operations/fleet.php
Selective Web Archiving
Selective web archiving
Selective web archiving
Selective web archiving
Selective web archiving
                                           cd ND...
                                        Administration


    Submission Tools
             cd ND...                                            cd ND... Access    Tools
                                             Actor 1
                                                                              National Library Beta,
Web Curator Tool                                                              Voyager, Tapuhi, etc
                  cd ND...                                       cd ND...
                       Actor 1                                     Actor 1

  Other Published &                          NDHA                           Timeframes, Papers Past
                  cd ND...                  (Rosetta)            cd ND...
 Unpublished Material
                       Actor 1                                     Actor 1
                                                                             Rosetta Access modules

Digitisation & Sound                                                          including ArcViewer
                       Actor 1                                     Actor 1
   Preservation
                                 Collection Management Systems

                           Technology Infrastructure      IAMS
Selective web archiving
Selective web archiving
Selective web archiving

From January 2007: 14,182 harvests
• 83% Endorsed and Archived
• 17% Rejected or Aborted
• Using the Web Curator Tool
From 2000-2006: 441 harvests
• Some of multiple websites
• Using a desktop website capture tool
New Zealand Web Harvests

        October 2008
         April 2010
New Zealand Web Harvests

•   Scope
•   Seeds
•   Robots Policy
•   Notification and communications
•   How are we going to accomplish this?
•   When are we going to stop?
New Zealand Web Harvests

          2008                       2010
• 17 days in October      • 24 days in April-May
• 106,184,620 URLs        • 131,770,485 URLs
• 4.6 Terabytes           • 6.9 Terabytes
• 397,000 hosts           • 559,000 hosts
• Seeds are known hosts   • Seeds include .nz,
                            .com, .org and .net zone
                            files
New Zealand Web Harvests

•   Harvest analysis:
     – What exactly do we have?
     – What’s a good harvest frequency?

•   Preservation analysis:
     – ARC or WARC format?
     – Should they be stored in the National
       Digital Heritage Archive?

•   Public access analysis:
     –   Ethical issues
     –   Privacy issues
     –   Legal and evidentiary value
     –   Copyright
Challenges and Lessons
Scope of a National Domain
•   How is a national web domain defined?
    – Hosts in the top-level domain or domains operated by registrars in
      country?
    – Hosts known to be hosted on IP addresses within geographic
      boundaries?
    – Content and advertising embedded in web sites published to the above
    – Curator selected web sites, desitinations, or services considered to be
      within bounds of a country’s legislative or cultural heritage
Scope of a National Domain
•   New Zealand Web Harvest scope:
    – Hosts in the .nz top-level domain
    – Hosts from .com, .org and .net that are physically in New Zealand
    – A list of hosts known to be within the scope of the legislation
    – Image, video clips, and other files that are embedded in web pages on the hosts
      above

•   New Zealand Web Harvest seeds:
    – 2008: Gathered from the Library and the Internet Archive’s past crawls
    – 2010: Zone files for .nz, .com, .org and .net (plus 2008 hosts)
Shape of harvest
•   How broad or deep should the harvest be?
    – Usually as broad as possible (survey of all resources at the highest
      levels)
    – Usually deep enough to collect primary resources of interest and
      minimize unwanted, unrelated junk prevalent in any top level domain
Shape of harvest
•   New Zealand Web Harvest
    – Up to 10,000 URLs from every host
    – But up to 50,000 for .govt.nz and .ac.nz.

•   On average, about 250 URLs (12 megabytes) per host
Harvest Policies & Practices

•   Robots Policy
     – Respect robots.txt
     – Ignore for embeds and inline content for unrestricted pages

•   Notification
     – Notifications may be sent to site owners/publishers prior to harvest

•   Politeness settings
     – Usually limit to load from a visitor navigating to the site via a browser

•   Trade-off of harvest duration vs scale of resources
     – Need to keep the data capture period brief
Harvest Policies & Practices

•   New Zealand Web Harvest Robots Policy
     – Selective: Ignore robots.txt (usually)
     – 2008: Ignore robots.txt (unless asked otherwise)
     – 2010: Mostly honour robots.txt (following consultation)

•   Four to six weeks of notification through many channels
Harvest Infrastructure

•   Dedicated crawlers to capture data
     – Service nodes for reporting and access; shared infrastructure for automated QA,
       data mining and analysis

•   Hardware:
     – Quad Core Processors (2.6 GHz)
     – 4-8 GB ram/core
     – 8+ Terabytes of local disk (Four 2-Terabyte SATA drives)

•   Software:
     – Ubuntu Linux
     – Java(TM) SE Runtime Environment (latest build)
     – Heritrix 3 or v1.14.x

•   Network:
     – Bandwidth is limited to ~300 Mbits/sec/project
Harvest Infrastructure
              The New Zealand Web Harvests were
             commissioned from the Internet Archive.

In-house                          Commissioned
•   Possibly cheaper              •   Higher outright cost
•   Large staff requirement       •   Contractor provides
•   Hardware requirements             expertise: Heritrix, crawler
                                      traps, scope, etc
•   Network requirements
                                  •   Contractor provides staff,
•   Risks: what don’t we know?        computers, bandwidth

             Unexpected issue: International bandwidth
Challenges of All Web Archiving

•   Not all data can be crawled
•   Can publishers “opt in” or “opt out”?
•   Data may be lost no matter how carefully it is managed
•   Harvested data hard to make accessible
     – Intuitive interfaces for discovering and navigating resources
     – With robust APIs
     – All done in a compelling and sustainable way

•   Research and experimentation are essential to keep pace with
    publisher innovation
Challenges of Domain Archiving
•   Harvests are at best samples
     – Time & expense: can’t get everything
     – Rate of change: don’t get every version
     – Rate of collection: issues of ‘time skew’
•   Choice of User agents/protocols
     – If you crawl as the Mozilla agent your content
       may not redisplay in IE
     – Which mobile agents should you crawl as, if any?
•   Site structure & publishing models
     – Some parts of sites are not “archive-friendly”
       (JavaScript, AJAX, Flash, etc.)
     – Change both their technical structure and policy
       quickly and often (YouTube, Facebook, etc)
Challenges of Domain Archiving
70+% of the world’s digital content is now generated by individuals –
  not all of it can be crawled
(UK Telegraph, IDC annual survey, released May 2010)


Social networks and collaborative/semi-private spaces
Immersive Worlds
Challenges of Domain Archiving

•   Manageable Costs/Sustainable Approaches
     – Access to power & other critical operational resources
     – Sufficient processing capacity for collection, analysis,
       discovery, & dissemination of resources
     – Bandwidth
•   Recruitment and retention of staff/engineering expertise;
    effective ongoing training
Challenges of Domain Archiving

When do you stop crawling?
•   The internet is infinitely large!
•   Indicators that suggest diminishing returns have set in:
     – A relatively small number of remaining hosts have a lot of depth
     – More HTML than images appearing in the crawl log
     – Higher incidence of crawler traps, content farms

•   At this point we expect:
     – We will capture proportionally more junk
     – Website owners will complain that we're over-crawling
Challenges of Domain Archiving

How do you assess the quality of a harvest?
•   Quantitative measures of quality, breadth and depth
•   Qualitative measures including characterization of resources and how
    they fit with other collections
•   Usually harvest for weeks in duration depending upon the desired
    scope, and then run a “patch crawl”
Challenges of Domain Archiving

•   Being responsive during a crawl
•   New Zealand Web Harvest 2008:
    – 37 individual contacts during harvest
    – 2 major mailing list discussions
    – Blogs & Twitter
    – Newspapers (“Library harvest costs website dear”) and radio

•   A communications strategy and plan essential
    – The biggest difficulty is responding promptly outside working hours
Final thoughts?

    What have we learned that is
particularly relevant to New Zealand?
Final thoughts

•   New Zealand faces the same challenges as our peers overseas
•   Most of the world favours dedicated web archives
    – But we’re preserving web material alongside other formats.

•   When will it be economical to harvest from New Zealand?
Final thoughts: how should national
            domain crawls work?
•   Institutions crawl within their national domains from their own
    national infrastructure
•   Institutions share tools, metadata, knowledge and best practices
     – And to the extent possible – data!
     – Collaboration will always achieve greater results than acting alone!


•   Over the long term, shared goals and resources can help
    mitigate economic and other barriers to collection, mining, and
    access of New Zealand’s national digital heritage

Weitere ähnliche Inhalte

Andere mochten auch (6)

Steve Knight by Design
Steve Knight by DesignSteve Knight by Design
Steve Knight by Design
 
Ensuring Data Integrity white paper
Ensuring Data Integrity white paperEnsuring Data Integrity white paper
Ensuring Data Integrity white paper
 
Working Across Organizations white paper
Working Across Organizations white paperWorking Across Organizations white paper
Working Across Organizations white paper
 
Michael Parsons Passion
Michael Parsons PassionMichael Parsons Passion
Michael Parsons Passion
 
OGD Wien - Ideensammlung
OGD Wien - IdeensammlungOGD Wien - Ideensammlung
OGD Wien - Ideensammlung
 
Bigger Hard Drive Jamie Lean
Bigger Hard Drive Jamie LeanBigger Hard Drive Jamie Lean
Bigger Hard Drive Jamie Lean
 

Ähnlich wie Kris Carpenter Negulescu Gordon Paynter Archiving the National Web of New Zealand

2010 AIRI Petabyte Challenge - View From The Trenches
2010 AIRI Petabyte Challenge - View From The Trenches2010 AIRI Petabyte Challenge - View From The Trenches
2010 AIRI Petabyte Challenge - View From The Trenches
George Ang
 
Petabyte scale on commodity infrastructure
Petabyte scale on commodity infrastructurePetabyte scale on commodity infrastructure
Petabyte scale on commodity infrastructure
elliando dias
 
CONTENTdm Presentation 060711
CONTENTdm Presentation 060711CONTENTdm Presentation 060711
CONTENTdm Presentation 060711
Buttes
 

Ähnlich wie Kris Carpenter Negulescu Gordon Paynter Archiving the National Web of New Zealand (20)

Asia Pacific Internet Leadership Program
Asia Pacific Internet Leadership ProgramAsia Pacific Internet Leadership Program
Asia Pacific Internet Leadership Program
 
2010 AIRI Petabyte Challenge - View From The Trenches
2010 AIRI Petabyte Challenge - View From The Trenches2010 AIRI Petabyte Challenge - View From The Trenches
2010 AIRI Petabyte Challenge - View From The Trenches
 
Internet content as research data
Internet content as research dataInternet content as research data
Internet content as research data
 
Moving an Archive from Tape to Disk: A Case-Study at ICPSR
Moving an Archive from Tape to Disk: A Case-Study at ICPSRMoving an Archive from Tape to Disk: A Case-Study at ICPSR
Moving an Archive from Tape to Disk: A Case-Study at ICPSR
 
ROMA User-Customizable NoSQL Database in Ruby
ROMA User-Customizable NoSQL Database in RubyROMA User-Customizable NoSQL Database in Ruby
ROMA User-Customizable NoSQL Database in Ruby
 
Web Archiving – Lessons and Potential
 Web Archiving – Lessons and Potential Web Archiving – Lessons and Potential
Web Archiving – Lessons and Potential
 
Archiving the French Web: the BnF web archiving workflow. Sara Aubry
Archiving the French Web: the BnF web archiving workflow. Sara AubryArchiving the French Web: the BnF web archiving workflow. Sara Aubry
Archiving the French Web: the BnF web archiving workflow. Sara Aubry
 
Future-Proofing the Web: What We Can Do Today
Future-Proofing the Web: What We Can Do TodayFuture-Proofing the Web: What We Can Do Today
Future-Proofing the Web: What We Can Do Today
 
Bertenthal
BertenthalBertenthal
Bertenthal
 
Chicago HUG Presentation Oct 2011
Chicago HUG Presentation Oct 2011Chicago HUG Presentation Oct 2011
Chicago HUG Presentation Oct 2011
 
Petabyte scale on commodity infrastructure
Petabyte scale on commodity infrastructurePetabyte scale on commodity infrastructure
Petabyte scale on commodity infrastructure
 
DNS Measurement Activity on ITB 2010
DNS Measurement Activity on ITB 2010DNS Measurement Activity on ITB 2010
DNS Measurement Activity on ITB 2010
 
IWMW 1997: WWW Caching
IWMW 1997: WWW CachingIWMW 1997: WWW Caching
IWMW 1997: WWW Caching
 
CONTENTdm Presentation 060711
CONTENTdm Presentation 060711CONTENTdm Presentation 060711
CONTENTdm Presentation 060711
 
Just digitise it - Daniel Wilksch of the Public Records Office Victoria
Just digitise it - Daniel Wilksch of the Public Records Office VictoriaJust digitise it - Daniel Wilksch of the Public Records Office Victoria
Just digitise it - Daniel Wilksch of the Public Records Office Victoria
 
Robust Applications in Mesos using External Storage
Robust Applications in Mesos using External StorageRobust Applications in Mesos using External Storage
Robust Applications in Mesos using External Storage
 
Timesten Architecture
Timesten ArchitectureTimesten Architecture
Timesten Architecture
 
ION Bangladesh - IETF Update
ION Bangladesh - IETF UpdateION Bangladesh - IETF Update
ION Bangladesh - IETF Update
 
Active Directory Fundamentals
Active Directory FundamentalsActive Directory Fundamentals
Active Directory Fundamentals
 
IETF Update: Making the Internet Work Better
IETF Update: Making the Internet Work BetterIETF Update: Making the Internet Work Better
IETF Update: Making the Internet Work Better
 

Mehr von Future Perfect 2012

Mehr von Future Perfect 2012 (20)

Joe Coleman Biodiversity Heritage Library
Joe Coleman Biodiversity Heritage LibraryJoe Coleman Biodiversity Heritage Library
Joe Coleman Biodiversity Heritage Library
 
James Smithies Academic Earthquake Research
James Smithies Academic Earthquake ResearchJames Smithies Academic Earthquake Research
James Smithies Academic Earthquake Research
 
Shaun Hendy Innovation Ecosystem
Shaun Hendy Innovation EcosystemShaun Hendy Innovation Ecosystem
Shaun Hendy Innovation Ecosystem
 
Martin Donnelly Sarah Jones DMP Online
Martin Donnelly Sarah Jones DMP OnlineMartin Donnelly Sarah Jones DMP Online
Martin Donnelly Sarah Jones DMP Online
 
Steve Mc Eachern Australian Data Archive
Steve Mc Eachern Australian Data ArchiveSteve Mc Eachern Australian Data Archive
Steve Mc Eachern Australian Data Archive
 
Parul Sharma Sally Vermaaten Right Combination
Parul Sharma Sally Vermaaten Right CombinationParul Sharma Sally Vermaaten Right Combination
Parul Sharma Sally Vermaaten Right Combination
 
Alison Fleming Michael Upton Collaborating for Success
Alison Fleming Michael Upton Collaborating for SuccessAlison Fleming Michael Upton Collaborating for Success
Alison Fleming Michael Upton Collaborating for Success
 
Andrew Waugh Business Systems
Andrew Waugh Business SystemsAndrew Waugh Business Systems
Andrew Waugh Business Systems
 
Gabe Nault Data Integrity
Gabe Nault Data IntegrityGabe Nault Data Integrity
Gabe Nault Data Integrity
 
Clare Somerville Trish O’Kane Data in Databases
Clare Somerville Trish O’Kane Data in DatabasesClare Somerville Trish O’Kane Data in Databases
Clare Somerville Trish O’Kane Data in Databases
 
Cochrane von Suchodoletz File Creation, Rendering and Formats
Cochrane von Suchodoletz File Creation, Rendering and FormatsCochrane von Suchodoletz File Creation, Rendering and Formats
Cochrane von Suchodoletz File Creation, Rendering and Formats
 
Dave Pearson The Adventures of Digi
Dave Pearson The Adventures of DigiDave Pearson The Adventures of Digi
Dave Pearson The Adventures of Digi
 
Jay Gattuso Persistently Identifying Formats
Jay Gattuso Persistently Identifying FormatsJay Gattuso Persistently Identifying Formats
Jay Gattuso Persistently Identifying Formats
 
Jeff Rothenberg Digital Preservation Perspective
Jeff Rothenberg Digital Preservation PerspectiveJeff Rothenberg Digital Preservation Perspective
Jeff Rothenberg Digital Preservation Perspective
 
Stuart Wakefield Cloud Computing
Stuart Wakefield Cloud ComputingStuart Wakefield Cloud Computing
Stuart Wakefield Cloud Computing
 
Cassie Findlay Digital Transformation SRNSW
Cassie Findlay Digital Transformation SRNSWCassie Findlay Digital Transformation SRNSW
Cassie Findlay Digital Transformation SRNSW
 
Kevin De Vorsey Past is Prologue
Kevin De Vorsey Past is PrologueKevin De Vorsey Past is Prologue
Kevin De Vorsey Past is Prologue
 
Grace Currie Ann Jebson First Things First
Grace Currie Ann Jebson First Things FirstGrace Currie Ann Jebson First Things First
Grace Currie Ann Jebson First Things First
 
T Bahr M Lindlar Goportis Digital Preservation Pilot
T Bahr M Lindlar Goportis Digital Preservation PilotT Bahr M Lindlar Goportis Digital Preservation Pilot
T Bahr M Lindlar Goportis Digital Preservation Pilot
 
Dennis Phillips Cooperative Digital Preservation
Dennis Phillips Cooperative Digital PreservationDennis Phillips Cooperative Digital Preservation
Dennis Phillips Cooperative Digital Preservation
 

Kürzlich hochgeladen

Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
vu2urc
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 

Kürzlich hochgeladen (20)

A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 

Kris Carpenter Negulescu Gordon Paynter Archiving the National Web of New Zealand

  • 1. Lessons Learned Archiving the National Web of New Zealand Kris Carpenter Negulescu The Internet Archive Gordon Paynter The National Library of New Zealand Future Perfect 2012, 27 March 2012 , Wellington New Zealand
  • 3. Legal deposit • The National Library of New Zealand Act (2003) • “Legal deposit” now includes “Internet documents” • Available from http://legislation.govt.nz/
  • 4. Two web archiving programmes Selective Harvesting of specific websites or parts websites Domain Harvesting of the entire “New Zealand Internet” http://topics.breitbart.com/fishing+pole/ http://www.trimarinegroup.com/operations/fleet.php
  • 9. Selective web archiving cd ND... Administration Submission Tools cd ND... cd ND... Access Tools Actor 1 National Library Beta, Web Curator Tool Voyager, Tapuhi, etc cd ND... cd ND... Actor 1 Actor 1 Other Published & NDHA Timeframes, Papers Past cd ND... (Rosetta) cd ND... Unpublished Material Actor 1 Actor 1 Rosetta Access modules Digitisation & Sound including ArcViewer Actor 1 Actor 1 Preservation Collection Management Systems Technology Infrastructure IAMS
  • 12. Selective web archiving From January 2007: 14,182 harvests • 83% Endorsed and Archived • 17% Rejected or Aborted • Using the Web Curator Tool From 2000-2006: 441 harvests • Some of multiple websites • Using a desktop website capture tool
  • 13. New Zealand Web Harvests October 2008 April 2010
  • 14. New Zealand Web Harvests • Scope • Seeds • Robots Policy • Notification and communications • How are we going to accomplish this? • When are we going to stop?
  • 15. New Zealand Web Harvests 2008 2010 • 17 days in October • 24 days in April-May • 106,184,620 URLs • 131,770,485 URLs • 4.6 Terabytes • 6.9 Terabytes • 397,000 hosts • 559,000 hosts • Seeds are known hosts • Seeds include .nz, .com, .org and .net zone files
  • 16. New Zealand Web Harvests • Harvest analysis: – What exactly do we have? – What’s a good harvest frequency? • Preservation analysis: – ARC or WARC format? – Should they be stored in the National Digital Heritage Archive? • Public access analysis: – Ethical issues – Privacy issues – Legal and evidentiary value – Copyright
  • 18. Scope of a National Domain • How is a national web domain defined? – Hosts in the top-level domain or domains operated by registrars in country? – Hosts known to be hosted on IP addresses within geographic boundaries? – Content and advertising embedded in web sites published to the above – Curator selected web sites, desitinations, or services considered to be within bounds of a country’s legislative or cultural heritage
  • 19. Scope of a National Domain • New Zealand Web Harvest scope: – Hosts in the .nz top-level domain – Hosts from .com, .org and .net that are physically in New Zealand – A list of hosts known to be within the scope of the legislation – Image, video clips, and other files that are embedded in web pages on the hosts above • New Zealand Web Harvest seeds: – 2008: Gathered from the Library and the Internet Archive’s past crawls – 2010: Zone files for .nz, .com, .org and .net (plus 2008 hosts)
  • 20. Shape of harvest • How broad or deep should the harvest be? – Usually as broad as possible (survey of all resources at the highest levels) – Usually deep enough to collect primary resources of interest and minimize unwanted, unrelated junk prevalent in any top level domain
  • 21. Shape of harvest • New Zealand Web Harvest – Up to 10,000 URLs from every host – But up to 50,000 for .govt.nz and .ac.nz. • On average, about 250 URLs (12 megabytes) per host
  • 22. Harvest Policies & Practices • Robots Policy – Respect robots.txt – Ignore for embeds and inline content for unrestricted pages • Notification – Notifications may be sent to site owners/publishers prior to harvest • Politeness settings – Usually limit to load from a visitor navigating to the site via a browser • Trade-off of harvest duration vs scale of resources – Need to keep the data capture period brief
  • 23. Harvest Policies & Practices • New Zealand Web Harvest Robots Policy – Selective: Ignore robots.txt (usually) – 2008: Ignore robots.txt (unless asked otherwise) – 2010: Mostly honour robots.txt (following consultation) • Four to six weeks of notification through many channels
  • 24. Harvest Infrastructure • Dedicated crawlers to capture data – Service nodes for reporting and access; shared infrastructure for automated QA, data mining and analysis • Hardware: – Quad Core Processors (2.6 GHz) – 4-8 GB ram/core – 8+ Terabytes of local disk (Four 2-Terabyte SATA drives) • Software: – Ubuntu Linux – Java(TM) SE Runtime Environment (latest build) – Heritrix 3 or v1.14.x • Network: – Bandwidth is limited to ~300 Mbits/sec/project
  • 25. Harvest Infrastructure The New Zealand Web Harvests were commissioned from the Internet Archive. In-house Commissioned • Possibly cheaper • Higher outright cost • Large staff requirement • Contractor provides • Hardware requirements expertise: Heritrix, crawler traps, scope, etc • Network requirements • Contractor provides staff, • Risks: what don’t we know? computers, bandwidth Unexpected issue: International bandwidth
  • 26. Challenges of All Web Archiving • Not all data can be crawled • Can publishers “opt in” or “opt out”? • Data may be lost no matter how carefully it is managed • Harvested data hard to make accessible – Intuitive interfaces for discovering and navigating resources – With robust APIs – All done in a compelling and sustainable way • Research and experimentation are essential to keep pace with publisher innovation
  • 27. Challenges of Domain Archiving • Harvests are at best samples – Time & expense: can’t get everything – Rate of change: don’t get every version – Rate of collection: issues of ‘time skew’ • Choice of User agents/protocols – If you crawl as the Mozilla agent your content may not redisplay in IE – Which mobile agents should you crawl as, if any? • Site structure & publishing models – Some parts of sites are not “archive-friendly” (JavaScript, AJAX, Flash, etc.) – Change both their technical structure and policy quickly and often (YouTube, Facebook, etc)
  • 28. Challenges of Domain Archiving 70+% of the world’s digital content is now generated by individuals – not all of it can be crawled (UK Telegraph, IDC annual survey, released May 2010) Social networks and collaborative/semi-private spaces Immersive Worlds
  • 29. Challenges of Domain Archiving • Manageable Costs/Sustainable Approaches – Access to power & other critical operational resources – Sufficient processing capacity for collection, analysis, discovery, & dissemination of resources – Bandwidth • Recruitment and retention of staff/engineering expertise; effective ongoing training
  • 30. Challenges of Domain Archiving When do you stop crawling? • The internet is infinitely large! • Indicators that suggest diminishing returns have set in: – A relatively small number of remaining hosts have a lot of depth – More HTML than images appearing in the crawl log – Higher incidence of crawler traps, content farms • At this point we expect: – We will capture proportionally more junk – Website owners will complain that we're over-crawling
  • 31. Challenges of Domain Archiving How do you assess the quality of a harvest? • Quantitative measures of quality, breadth and depth • Qualitative measures including characterization of resources and how they fit with other collections • Usually harvest for weeks in duration depending upon the desired scope, and then run a “patch crawl”
  • 32. Challenges of Domain Archiving • Being responsive during a crawl • New Zealand Web Harvest 2008: – 37 individual contacts during harvest – 2 major mailing list discussions – Blogs & Twitter – Newspapers (“Library harvest costs website dear”) and radio • A communications strategy and plan essential – The biggest difficulty is responding promptly outside working hours
  • 33. Final thoughts? What have we learned that is particularly relevant to New Zealand?
  • 34. Final thoughts • New Zealand faces the same challenges as our peers overseas • Most of the world favours dedicated web archives – But we’re preserving web material alongside other formats. • When will it be economical to harvest from New Zealand?
  • 35. Final thoughts: how should national domain crawls work? • Institutions crawl within their national domains from their own national infrastructure • Institutions share tools, metadata, knowledge and best practices – And to the extent possible – data! – Collaboration will always achieve greater results than acting alone! • Over the long term, shared goals and resources can help mitigate economic and other barriers to collection, mining, and access of New Zealand’s national digital heritage