SlideShare ist ein Scribd-Unternehmen logo
1 von 19
Downloaden Sie, um offline zu lesen
Collecting Government Web Content at
   the National Library of Australia



        AGLIN Forum 2 May 2012
                 Paul Koerbin
           Manager Web Archiving
          National Library of Australia
Web Archiving at the NLA
•   Background
•   Scale of collections
•   Archival collections (selective, bulk, govt)
•   Objectives, selection and scope
•   Retention and preservation
•   Finding government content in PANDORA
Web Archiving at the NLA
• Began web archiving activity in 1996
  – http://pandora.nla.gov.au/
• Government content is included in all NLA web
  collections
  – „PANDORA Archive‟ collection, 1996 to now
      • Selective
  – The „auscrawl‟ whole .au domain harvest collections
      • Annual since 2005
  – The „whole-of-government‟ collections
      • Seed list
      • 2011, 2012
Web Archiving at the NLA
• Scale of collecting
  – PANDORA (as at April 2012, i.e. 15 years of collecting)
     • 31,000 titles
          – All govt ~ 55 % of titles
          – Commonwealth Govt ~ 12 % of titles
     • 75,000 instances
     • 145 million files
     • 6.5 Tb
  – Australian .au domain harvests 2005-2011
     • 3.5 billion files
     • 140 Tb
  – ‘Whole-of-government ‘ seed list crawl 2011
     • 7.4 million files
     • 538 Gb
Web Archiving at the NLA
• PANDORA Archive
  – Strong representation of govt content including Commonwealth,
    State and Territory, and local govt (> 50 % of titles)
  – Generally does not include whole departmental websites
  – Prominent ministerial micro-sites (speeches, press releases)
  – Government initiatives websites (e.g. Firearms buyback, 2000)
  – Major reports, enquiries, documents (e.g. Gershon Review, 2008)
  – Discrete „titles‟ and „instances‟ – no links between instances
  – Quality checked
  – Catalogued and full text indexed
  – Accessible through the Trove and PANDORA discovery
    services
Web Archiving at the NLA
• Whole .au domain harvests („auscrawl‟)
  –   Crawls of the entire .au domain (plus some)
  –   Averages over 1 million hosts crawled each year (av. 650m files)
  –   Includes gov.au second level domain
  –   Relies on crawler capabilities and subject to crawler limitations
      and constraints
  –   Obeys robots.txt (except for inline image and style elements)
  –   No quality checking for completeness of harvest or functionality
      (e.g. look and style)
  –   Retains linkages between content that is in scope for the crawl
  –   Full-text and URL indexes
  –   But, not accessible to public
Web Archiving at the NLA
• Collecting Commonwealth Govt websites
  – Whole-of-government arrangements
    • Whole-of-government ICT policy
    • Secretaries‟ ICT Governance Board, 7 May 2010
    • AGIMO circular 2010/01
    • http://www.finance.gov.au/e-government/strategy-and-
      governance/Whole-of-Government-ICT-Policies.html
    • Covers FMA Act agencies
        – CAC Act agencies – still require individual permissions
    • Subject to opt-out arrangements
    • Replaced the need for individual copyright licence arrangements
      coordinated through the CCA
    • NLA now permitted to collect, preserve and make accessible freely
      available govt web content
Web Archiving at the NLA
• Whole-of-government collection
  – Based on list of specified URLs (most at domain
    level)
  – Around 800 seed URLs
  – Only includes FMA Act agency sites
  – No QA and fixing
  – Obeys robots.txt (except for inline images and style
    elements)
  – Full-text and URL indexes
  – No pubic access yet (but perhaps soon)
Web Archiving at the NLA
• Collecting mandate and objective
  – The National Library Act 1960 mandate to build and
    maintain a national comprehensive collection of
    material relating to Australia and Australians
  – ... and to make the collection available in the national
    interest
  – Objective is about ensuring future and ongoing
    access to materials of interest to Australia‟s social,
    cultural and publishing heritage
  – Not the function of NLA web collecting (archiving)
    program to satisfy requirements for agencies under
    the Archives Act 1983
Web Archiving at the NLA
• Government „Web Guide‟ recordkeeping advice:
   – “Archiving websites”
      • Mandatory requirement (Archives Act 1983 and Evidence Act 1995)
      • seek advice from NAA
   – “Retaining access to outdated content”
      •   Not a mandatory requirement
      •   Recommends nominating content for inclusion in PANDORA
      •   Does not ensure safeguarding of content
      •   Selective
   – Create own publicly accessible archive
   – Publish advice how people can access out of date content
• New „whole-of-government‟ web collection
      • More inclusive and larger scale than PANDORA
      • FMA Act agencies requirement (with „opt-out‟ provisions)
      • CAC Act agencies – opt-in!
Web Archiving at the NLA
• PANDORA selection
  – Commonwealth Government publications a priority
    collecting area
  – Methodical approaches have been attempted but ...
  – Curator expertise and current awareness
  – Stakeholders as nominators (e.g. indexing agencies,
    other collecting areas in NLA, Parl Library, depts)
  – Selecting and scoping
     •   Whole site, part site, specific documents
     •   Substance and research value
     •   Scheduling (when to harvest and how frequently)
     •   Resources to undertake work
     •   Technical constraints
Web Archiving at the NLA
• PANDORA collecting
  – Websites and web „documents‟
     • documents (discrete files), whole sites, parts of sites
     • text, images, video, style elements, client side scripts
  – Content is harvested using a crawl robot
     • efficient (no work for publisher), automated process
     • deposit of complex objects is harder to deal with
  – Dynamic content becomes static HTML
     • an artefact of the original
     • the published version as you would view it from a web browser, not
       from the content management system
     • loses dynamic functionality
     • „normalising‟ process
  – Persistent URIs
Web Archiving at the NLA
• Retention of collected web content
  – Archiving means preservation
  – Long term access
  – Collections developed and maintained in perpetuity
    for future generations
  – What is the preservation reality?
     • Is access in perpetuity achievable?
  – Investing in systems to manage for preservation
     •   More than preserving the bit stream
     •   Establishing preservation intent
     •   Collecting and managing preservation metadata
     •   Understanding formats and their risks (... and actions?)
Web Archiving at the NLA
• „DIY‟ archive of your published web content
  – Use a subscription service
     • ArchiveIT (Internet Archive) www.archive-it.org
     • CDL Web Archiving Service webarchives.cdlib.org
  – Build your own with open-source tools
     • Heritrix archival crawler crawler.archive.org
     • WARC packages
     • Wayback interface
  – Lightweight approach
     • HTTrack (free) offline browser for website snapshots
       www.httrack.com
  – Citation service
     • on demand archiving of web resources webcitation.org
Web Archiving at the NLA
• Current and future developments at NLA
  – Digital Library Infrastructure Replacement (DLIR)
    project
     • Replacing infrastructure that manages our digital
       assets
     • Will require new web collecting infrastructure and
       processes
     • Already taking steps such as the gov.au seed list
       crawl
  – Some testing of new tools underway (Heritrix,
    Wayback)
  – Opening access to domain harvest content (gov.au)
Web Archiving at the NLA
• Extension of „legal deposit‟ to digital
  content
  – Attorney-General‟s consultation paper
     • Submissions closed 14 April
  – Proposed model covers:
     • physical format digital (mandatory delivery)
     • online electronic publications (mandatory delivery on
       demand)
  – May put pressure on NLA resources & priorities
  – Already have „whole-of-government‟ arrangements
     • Bulk harvesting of FMA Act agencies‟ domains
     • Seek „opt-in‟ from CAC Act agencies
Web Archiving at the NLA
• Finding government content in PANDORA
  – Full text search through Trove
     • Trove „Archived websites 1996 - now‟ silo
     • All Trove (results in „Books‟ and „Archived websites‟
     • PANDORA portal
  – Browse lists on PANDORA portal site
     • „Commonwealth Government‟ (263 titles)
  – Catalogue (MARC record search)
     •   NLA online catalogue
     •   Libraries Australia
     •   Trove (books silo)
     •   Search e.g.: innovation industry pandora
           – Advanced search options for best results
           – „Pandora electronic collection‟ (MARC 830 series field)
http://www.flickr.com/photos/ricksmit/15671245/
Web Archiving at the NLA
• Government Web Guide and NAA links

  – Archiving websites
     • http://webguide.gov.au/recordkeeping/archiving-a-website/

  – Retaining access of outdated content
     • http://webguide.gov.au/recordkeeping/retaining-access-to-outdated-content/

  – NAA Archiving Websites advice
     • http://www.naa.gov.au/records-management/publications/index.aspx#Archiving-
       Websites:-Advice-and-Policy-Statement

Weitere ähnliche Inhalte

Andere mochten auch (6)

Res tafarian ism at the nla
Res tafarian ism at the nlaRes tafarian ism at the nla
Res tafarian ism at the nla
 
What are some of the things that the ‘Feds’ are doing about Digital ‘Stuff’?
What are some of the things that the ‘Feds’ are doing about Digital ‘Stuff’?What are some of the things that the ‘Feds’ are doing about Digital ‘Stuff’?
What are some of the things that the ‘Feds’ are doing about Digital ‘Stuff’?
 
I say emulate
I say emulateI say emulate
I say emulate
 
Digital presevation
Digital presevationDigital presevation
Digital presevation
 
Digitisation of Panoramic Negatives
Digitisation of Panoramic NegativesDigitisation of Panoramic Negatives
Digitisation of Panoramic Negatives
 
Creating a vision for mobile service delivery
Creating a vision for mobile service deliveryCreating a vision for mobile service delivery
Creating a vision for mobile service delivery
 

Ähnlich wie Aglin

Slides anu talkwebarchivingaug2012
Slides anu talkwebarchivingaug2012Slides anu talkwebarchivingaug2012
Slides anu talkwebarchivingaug2012Roxanne Missingham
 
The development of web archiving 3
The development of web archiving 3The development of web archiving 3
The development of web archiving 3Essam Obaid
 
The National Library of Australia's New Discovery Service
The National Library of Australia's New Discovery ServiceThe National Library of Australia's New Discovery Service
The National Library of Australia's New Discovery ServiceOCLC Research
 
From Ambition to Go Live SWIB.pdf
From Ambition to Go Live SWIB.pdfFrom Ambition to Go Live SWIB.pdf
From Ambition to Go Live SWIB.pdfRichardWallis3
 
From Ambition to Go Live
From Ambition to Go LiveFrom Ambition to Go Live
From Ambition to Go LiveRichard Wallis
 
Web Archiving – Lessons and Potential
 Web Archiving – Lessons and Potential Web Archiving – Lessons and Potential
Web Archiving – Lessons and PotentialDaniel Gomes
 
Preservation of Research Data: Dataverse / Archivematica Integration by Allan...
Preservation of Research Data: Dataverse / Archivematica Integration by Allan...Preservation of Research Data: Dataverse / Archivematica Integration by Allan...
Preservation of Research Data: Dataverse / Archivematica Integration by Allan...datascienceiqss
 
Scalability andefficiencypres
Scalability andefficiencypresScalability andefficiencypres
Scalability andefficiencypresNekoGato
 
SAFETY NETS: RESCUE AND REVIVAL FOR ENDANGERED BORN-DIGITAL RECORDS- Program ...
SAFETY NETS: RESCUE AND REVIVAL FOR ENDANGERED BORN-DIGITAL RECORDS- Program ...SAFETY NETS: RESCUE AND REVIVAL FOR ENDANGERED BORN-DIGITAL RECORDS- Program ...
SAFETY NETS: RESCUE AND REVIVAL FOR ENDANGERED BORN-DIGITAL RECORDS- Program ...Micah Altman
 
Three Linked Data choices for Libraries
Three Linked Data choices for LibrariesThree Linked Data choices for Libraries
Three Linked Data choices for LibrariesRichard Wallis
 
Archiving the French Web: the BnF web archiving workflow. Sara Aubry
Archiving the French Web: the BnF web archiving workflow. Sara AubryArchiving the French Web: the BnF web archiving workflow. Sara Aubry
Archiving the French Web: the BnF web archiving workflow. Sara AubryBiblioteca Nacional de España
 
Contextual Computing: Laying a Global Data Foundation
Contextual Computing: Laying a Global Data FoundationContextual Computing: Laying a Global Data Foundation
Contextual Computing: Laying a Global Data FoundationRichard Wallis
 
Marc and beyond: 3 Linked Data Choices
 Marc and beyond: 3 Linked Data Choices  Marc and beyond: 3 Linked Data Choices
Marc and beyond: 3 Linked Data Choices Richard Wallis
 
Contextual Computing - Knowledge Graphs & Web of Entities
Contextual Computing - Knowledge Graphs & Web of EntitiesContextual Computing - Knowledge Graphs & Web of Entities
Contextual Computing - Knowledge Graphs & Web of EntitiesRichard Wallis
 
Web and Twitter Archiving at the Library of Congress
Web and Twitter Archiving at the Library of CongressWeb and Twitter Archiving at the Library of Congress
Web and Twitter Archiving at the Library of Congressnullhandle
 
Building the AAPB: Inter-Institutional Preservation and Access Workflows
Building the AAPB: Inter-Institutional Preservation and Access WorkflowsBuilding the AAPB: Inter-Institutional Preservation and Access Workflows
Building the AAPB: Inter-Institutional Preservation and Access WorkflowsWGBH Media Library and Archives
 
Web archiving challenges and opportunities
Web archiving challenges and opportunitiesWeb archiving challenges and opportunities
Web archiving challenges and opportunitiesAhmed AlSum
 
Schema.org: Where did that come from!
Schema.org: Where did that come from!Schema.org: Where did that come from!
Schema.org: Where did that come from!Richard Wallis
 

Ähnlich wie Aglin (20)

Slides anu talkwebarchivingaug2012
Slides anu talkwebarchivingaug2012Slides anu talkwebarchivingaug2012
Slides anu talkwebarchivingaug2012
 
Scaling up to archive the UK Web. Helen Hockx-Yu
Scaling up to archive the UK Web. Helen Hockx-YuScaling up to archive the UK Web. Helen Hockx-Yu
Scaling up to archive the UK Web. Helen Hockx-Yu
 
The development of web archiving 3
The development of web archiving 3The development of web archiving 3
The development of web archiving 3
 
The National Library of Australia's New Discovery Service
The National Library of Australia's New Discovery ServiceThe National Library of Australia's New Discovery Service
The National Library of Australia's New Discovery Service
 
From Ambition to Go Live SWIB.pdf
From Ambition to Go Live SWIB.pdfFrom Ambition to Go Live SWIB.pdf
From Ambition to Go Live SWIB.pdf
 
From Ambition to Go Live
From Ambition to Go LiveFrom Ambition to Go Live
From Ambition to Go Live
 
Web Archiving – Lessons and Potential
 Web Archiving – Lessons and Potential Web Archiving – Lessons and Potential
Web Archiving – Lessons and Potential
 
Preservation of Research Data: Dataverse / Archivematica Integration by Allan...
Preservation of Research Data: Dataverse / Archivematica Integration by Allan...Preservation of Research Data: Dataverse / Archivematica Integration by Allan...
Preservation of Research Data: Dataverse / Archivematica Integration by Allan...
 
Scalability andefficiencypres
Scalability andefficiencypresScalability andefficiencypres
Scalability andefficiencypres
 
SAFETY NETS: RESCUE AND REVIVAL FOR ENDANGERED BORN-DIGITAL RECORDS- Program ...
SAFETY NETS: RESCUE AND REVIVAL FOR ENDANGERED BORN-DIGITAL RECORDS- Program ...SAFETY NETS: RESCUE AND REVIVAL FOR ENDANGERED BORN-DIGITAL RECORDS- Program ...
SAFETY NETS: RESCUE AND REVIVAL FOR ENDANGERED BORN-DIGITAL RECORDS- Program ...
 
Three Linked Data choices for Libraries
Three Linked Data choices for LibrariesThree Linked Data choices for Libraries
Three Linked Data choices for Libraries
 
Archiving the French Web: the BnF web archiving workflow. Sara Aubry
Archiving the French Web: the BnF web archiving workflow. Sara AubryArchiving the French Web: the BnF web archiving workflow. Sara Aubry
Archiving the French Web: the BnF web archiving workflow. Sara Aubry
 
Contextual Computing: Laying a Global Data Foundation
Contextual Computing: Laying a Global Data FoundationContextual Computing: Laying a Global Data Foundation
Contextual Computing: Laying a Global Data Foundation
 
Marc and beyond: 3 Linked Data Choices
 Marc and beyond: 3 Linked Data Choices  Marc and beyond: 3 Linked Data Choices
Marc and beyond: 3 Linked Data Choices
 
Rs detective afpl
Rs detective afplRs detective afpl
Rs detective afpl
 
Contextual Computing - Knowledge Graphs & Web of Entities
Contextual Computing - Knowledge Graphs & Web of EntitiesContextual Computing - Knowledge Graphs & Web of Entities
Contextual Computing - Knowledge Graphs & Web of Entities
 
Web and Twitter Archiving at the Library of Congress
Web and Twitter Archiving at the Library of CongressWeb and Twitter Archiving at the Library of Congress
Web and Twitter Archiving at the Library of Congress
 
Building the AAPB: Inter-Institutional Preservation and Access Workflows
Building the AAPB: Inter-Institutional Preservation and Access WorkflowsBuilding the AAPB: Inter-Institutional Preservation and Access Workflows
Building the AAPB: Inter-Institutional Preservation and Access Workflows
 
Web archiving challenges and opportunities
Web archiving challenges and opportunitiesWeb archiving challenges and opportunities
Web archiving challenges and opportunities
 
Schema.org: Where did that come from!
Schema.org: Where did that come from!Schema.org: Where did that come from!
Schema.org: Where did that come from!
 

Mehr von National Library of Australia

Publicity and media - Anna Gressier & Sarah Kleven (Communications and Market...
Publicity and media - Anna Gressier & Sarah Kleven (Communications and Market...Publicity and media - Anna Gressier & Sarah Kleven (Communications and Market...
Publicity and media - Anna Gressier & Sarah Kleven (Communications and Market...National Library of Australia
 
CHG recipient case study - Julia Mant of the National Institute of Dramatic Art
CHG recipient case study - Julia Mant of the National Institute of Dramatic ArtCHG recipient case study - Julia Mant of the National Institute of Dramatic Art
CHG recipient case study - Julia Mant of the National Institute of Dramatic ArtNational Library of Australia
 
Just Digitise It - Daniel Wilksch of the Public Records Office Victoria
Just Digitise It - Daniel Wilksch of the Public Records Office VictoriaJust Digitise It - Daniel Wilksch of the Public Records Office Victoria
Just Digitise It - Daniel Wilksch of the Public Records Office VictoriaNational Library of Australia
 
Trove - a window to our community heritage - Hilary Berthon of Trove, NLA
Trove - a window to our community heritage - Hilary Berthon of Trove, NLATrove - a window to our community heritage - Hilary Berthon of Trove, NLA
Trove - a window to our community heritage - Hilary Berthon of Trove, NLANational Library of Australia
 
Disaster Prevention, Preparedness, Response and Recovery for Collections - Ki...
Disaster Prevention, Preparedness, Response and Recovery for Collections - Ki...Disaster Prevention, Preparedness, Response and Recovery for Collections - Ki...
Disaster Prevention, Preparedness, Response and Recovery for Collections - Ki...National Library of Australia
 
Assessing Significance and Significance 2.0: an introduction - Margaret Birt...
 Assessing Significance and Significance 2.0: an introduction - Margaret Birt... Assessing Significance and Significance 2.0: an introduction - Margaret Birt...
Assessing Significance and Significance 2.0: an introduction - Margaret Birt...National Library of Australia
 
Assessing the significance of cultural heritage - Tania Cleary
Assessing the significance of cultural heritage - Tania ClearyAssessing the significance of cultural heritage - Tania Cleary
Assessing the significance of cultural heritage - Tania ClearyNational Library of Australia
 
Publicity, Media & Completing your CHG project - 2017 - Fran D'Castro
Publicity, Media & Completing your CHG project - 2017 - Fran D'CastroPublicity, Media & Completing your CHG project - 2017 - Fran D'Castro
Publicity, Media & Completing your CHG project - 2017 - Fran D'CastroNational Library of Australia
 
Just Digitise It - Daniel Wilksch of the Public Records Office Victoria
Just Digitise It - Daniel Wilksch of the Public Records Office VictoriaJust Digitise It - Daniel Wilksch of the Public Records Office Victoria
Just Digitise It - Daniel Wilksch of the Public Records Office VictoriaNational Library of Australia
 
TROVE - a window to our community heritage - Hilary Berthon of Trove, NLA
TROVE - a window to our community heritage - Hilary Berthon of Trove, NLATROVE - a window to our community heritage - Hilary Berthon of Trove, NLA
TROVE - a window to our community heritage - Hilary Berthon of Trove, NLANational Library of Australia
 
Disaster Prevention, Preparedness, Response and Recovery for Collections - Ki...
Disaster Prevention, Preparedness, Response and Recovery for Collections - Ki...Disaster Prevention, Preparedness, Response and Recovery for Collections - Ki...
Disaster Prevention, Preparedness, Response and Recovery for Collections - Ki...National Library of Australia
 
CHG recipient case study - Donna Bailey of the Catholic Diocese of Sandhurst
CHG recipient case study - Donna Bailey of the Catholic Diocese of SandhurstCHG recipient case study - Donna Bailey of the Catholic Diocese of Sandhurst
CHG recipient case study - Donna Bailey of the Catholic Diocese of SandhurstNational Library of Australia
 
Assessing the significance of cultural heritage - Tania Cleary
Assessing the significance of cultural heritage - Tania ClearyAssessing the significance of cultural heritage - Tania Cleary
Assessing the significance of cultural heritage - Tania ClearyNational Library of Australia
 
Significance Assessment and Significance 2.0: an introduction - Veronica Bull...
Significance Assessment and Significance 2.0: an introduction - Veronica Bull...Significance Assessment and Significance 2.0: an introduction - Veronica Bull...
Significance Assessment and Significance 2.0: an introduction - Veronica Bull...National Library of Australia
 
Just digitise it - Daniel Wilksch of the Public Records Office Victoria
Just digitise it - Daniel Wilksch of the Public Records Office VictoriaJust digitise it - Daniel Wilksch of the Public Records Office Victoria
Just digitise it - Daniel Wilksch of the Public Records Office VictoriaNational Library of Australia
 

Mehr von National Library of Australia (20)

Publicity and media - Anna Gressier & Sarah Kleven (Communications and Market...
Publicity and media - Anna Gressier & Sarah Kleven (Communications and Market...Publicity and media - Anna Gressier & Sarah Kleven (Communications and Market...
Publicity and media - Anna Gressier & Sarah Kleven (Communications and Market...
 
CHG recipient case study - Julia Mant of the National Institute of Dramatic Art
CHG recipient case study - Julia Mant of the National Institute of Dramatic ArtCHG recipient case study - Julia Mant of the National Institute of Dramatic Art
CHG recipient case study - Julia Mant of the National Institute of Dramatic Art
 
Completing your CHG project - Fran D'Castro
Completing your CHG project - Fran D'CastroCompleting your CHG project - Fran D'Castro
Completing your CHG project - Fran D'Castro
 
Just Digitise It - Daniel Wilksch of the Public Records Office Victoria
Just Digitise It - Daniel Wilksch of the Public Records Office VictoriaJust Digitise It - Daniel Wilksch of the Public Records Office Victoria
Just Digitise It - Daniel Wilksch of the Public Records Office Victoria
 
Trove - a window to our community heritage - Hilary Berthon of Trove, NLA
Trove - a window to our community heritage - Hilary Berthon of Trove, NLATrove - a window to our community heritage - Hilary Berthon of Trove, NLA
Trove - a window to our community heritage - Hilary Berthon of Trove, NLA
 
National Archives of Australia
National Archives of AustraliaNational Archives of Australia
National Archives of Australia
 
Disaster Prevention, Preparedness, Response and Recovery for Collections - Ki...
Disaster Prevention, Preparedness, Response and Recovery for Collections - Ki...Disaster Prevention, Preparedness, Response and Recovery for Collections - Ki...
Disaster Prevention, Preparedness, Response and Recovery for Collections - Ki...
 
Assessing Significance and Significance 2.0: an introduction - Margaret Birt...
 Assessing Significance and Significance 2.0: an introduction - Margaret Birt... Assessing Significance and Significance 2.0: an introduction - Margaret Birt...
Assessing Significance and Significance 2.0: an introduction - Margaret Birt...
 
Preservation Needs Assessment - Tamara Lavrencic
Preservation Needs Assessment  - Tamara LavrencicPreservation Needs Assessment  - Tamara Lavrencic
Preservation Needs Assessment - Tamara Lavrencic
 
Assessing the significance of cultural heritage - Tania Cleary
Assessing the significance of cultural heritage - Tania ClearyAssessing the significance of cultural heritage - Tania Cleary
Assessing the significance of cultural heritage - Tania Cleary
 
Publicity, Media & Completing your CHG project - 2017 - Fran D'Castro
Publicity, Media & Completing your CHG project - 2017 - Fran D'CastroPublicity, Media & Completing your CHG project - 2017 - Fran D'Castro
Publicity, Media & Completing your CHG project - 2017 - Fran D'Castro
 
Just Digitise It - Daniel Wilksch of the Public Records Office Victoria
Just Digitise It - Daniel Wilksch of the Public Records Office VictoriaJust Digitise It - Daniel Wilksch of the Public Records Office Victoria
Just Digitise It - Daniel Wilksch of the Public Records Office Victoria
 
TROVE - a window to our community heritage - Hilary Berthon of Trove, NLA
TROVE - a window to our community heritage - Hilary Berthon of Trove, NLATROVE - a window to our community heritage - Hilary Berthon of Trove, NLA
TROVE - a window to our community heritage - Hilary Berthon of Trove, NLA
 
Disaster Prevention, Preparedness, Response and Recovery for Collections - Ki...
Disaster Prevention, Preparedness, Response and Recovery for Collections - Ki...Disaster Prevention, Preparedness, Response and Recovery for Collections - Ki...
Disaster Prevention, Preparedness, Response and Recovery for Collections - Ki...
 
CHG recipient case study - Donna Bailey of the Catholic Diocese of Sandhurst
CHG recipient case study - Donna Bailey of the Catholic Diocese of SandhurstCHG recipient case study - Donna Bailey of the Catholic Diocese of Sandhurst
CHG recipient case study - Donna Bailey of the Catholic Diocese of Sandhurst
 
Preservation Needs Assessment - Tamara Lavrencic
Preservation Needs Assessment - Tamara LavrencicPreservation Needs Assessment - Tamara Lavrencic
Preservation Needs Assessment - Tamara Lavrencic
 
Assessing the significance of cultural heritage - Tania Cleary
Assessing the significance of cultural heritage - Tania ClearyAssessing the significance of cultural heritage - Tania Cleary
Assessing the significance of cultural heritage - Tania Cleary
 
Significance Assessment and Significance 2.0: an introduction - Veronica Bull...
Significance Assessment and Significance 2.0: an introduction - Veronica Bull...Significance Assessment and Significance 2.0: an introduction - Veronica Bull...
Significance Assessment and Significance 2.0: an introduction - Veronica Bull...
 
Preservation assessment - Tamara Lavrencic
Preservation assessment - Tamara LavrencicPreservation assessment - Tamara Lavrencic
Preservation assessment - Tamara Lavrencic
 
Just digitise it - Daniel Wilksch of the Public Records Office Victoria
Just digitise it - Daniel Wilksch of the Public Records Office VictoriaJust digitise it - Daniel Wilksch of the Public Records Office Victoria
Just digitise it - Daniel Wilksch of the Public Records Office Victoria
 

Kürzlich hochgeladen

Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu SubbuApidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbuapidays
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot ModelNavi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot ModelDeepika Singh
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobeapidays
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxRustici Software
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Jeffrey Haguewood
 
A Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source MilvusA Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source MilvusZilliz
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...apidays
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businesspanagenda
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?Igalia
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...Zilliz
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024The Digital Insurer
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 

Kürzlich hochgeladen (20)

Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu SubbuApidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot ModelNavi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
A Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source MilvusA Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source Milvus
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 

Aglin

  • 1. Collecting Government Web Content at the National Library of Australia AGLIN Forum 2 May 2012 Paul Koerbin Manager Web Archiving National Library of Australia
  • 2. Web Archiving at the NLA • Background • Scale of collections • Archival collections (selective, bulk, govt) • Objectives, selection and scope • Retention and preservation • Finding government content in PANDORA
  • 3. Web Archiving at the NLA • Began web archiving activity in 1996 – http://pandora.nla.gov.au/ • Government content is included in all NLA web collections – „PANDORA Archive‟ collection, 1996 to now • Selective – The „auscrawl‟ whole .au domain harvest collections • Annual since 2005 – The „whole-of-government‟ collections • Seed list • 2011, 2012
  • 4. Web Archiving at the NLA • Scale of collecting – PANDORA (as at April 2012, i.e. 15 years of collecting) • 31,000 titles – All govt ~ 55 % of titles – Commonwealth Govt ~ 12 % of titles • 75,000 instances • 145 million files • 6.5 Tb – Australian .au domain harvests 2005-2011 • 3.5 billion files • 140 Tb – ‘Whole-of-government ‘ seed list crawl 2011 • 7.4 million files • 538 Gb
  • 5. Web Archiving at the NLA • PANDORA Archive – Strong representation of govt content including Commonwealth, State and Territory, and local govt (> 50 % of titles) – Generally does not include whole departmental websites – Prominent ministerial micro-sites (speeches, press releases) – Government initiatives websites (e.g. Firearms buyback, 2000) – Major reports, enquiries, documents (e.g. Gershon Review, 2008) – Discrete „titles‟ and „instances‟ – no links between instances – Quality checked – Catalogued and full text indexed – Accessible through the Trove and PANDORA discovery services
  • 6. Web Archiving at the NLA • Whole .au domain harvests („auscrawl‟) – Crawls of the entire .au domain (plus some) – Averages over 1 million hosts crawled each year (av. 650m files) – Includes gov.au second level domain – Relies on crawler capabilities and subject to crawler limitations and constraints – Obeys robots.txt (except for inline image and style elements) – No quality checking for completeness of harvest or functionality (e.g. look and style) – Retains linkages between content that is in scope for the crawl – Full-text and URL indexes – But, not accessible to public
  • 7. Web Archiving at the NLA • Collecting Commonwealth Govt websites – Whole-of-government arrangements • Whole-of-government ICT policy • Secretaries‟ ICT Governance Board, 7 May 2010 • AGIMO circular 2010/01 • http://www.finance.gov.au/e-government/strategy-and- governance/Whole-of-Government-ICT-Policies.html • Covers FMA Act agencies – CAC Act agencies – still require individual permissions • Subject to opt-out arrangements • Replaced the need for individual copyright licence arrangements coordinated through the CCA • NLA now permitted to collect, preserve and make accessible freely available govt web content
  • 8. Web Archiving at the NLA • Whole-of-government collection – Based on list of specified URLs (most at domain level) – Around 800 seed URLs – Only includes FMA Act agency sites – No QA and fixing – Obeys robots.txt (except for inline images and style elements) – Full-text and URL indexes – No pubic access yet (but perhaps soon)
  • 9. Web Archiving at the NLA • Collecting mandate and objective – The National Library Act 1960 mandate to build and maintain a national comprehensive collection of material relating to Australia and Australians – ... and to make the collection available in the national interest – Objective is about ensuring future and ongoing access to materials of interest to Australia‟s social, cultural and publishing heritage – Not the function of NLA web collecting (archiving) program to satisfy requirements for agencies under the Archives Act 1983
  • 10. Web Archiving at the NLA • Government „Web Guide‟ recordkeeping advice: – “Archiving websites” • Mandatory requirement (Archives Act 1983 and Evidence Act 1995) • seek advice from NAA – “Retaining access to outdated content” • Not a mandatory requirement • Recommends nominating content for inclusion in PANDORA • Does not ensure safeguarding of content • Selective – Create own publicly accessible archive – Publish advice how people can access out of date content • New „whole-of-government‟ web collection • More inclusive and larger scale than PANDORA • FMA Act agencies requirement (with „opt-out‟ provisions) • CAC Act agencies – opt-in!
  • 11. Web Archiving at the NLA • PANDORA selection – Commonwealth Government publications a priority collecting area – Methodical approaches have been attempted but ... – Curator expertise and current awareness – Stakeholders as nominators (e.g. indexing agencies, other collecting areas in NLA, Parl Library, depts) – Selecting and scoping • Whole site, part site, specific documents • Substance and research value • Scheduling (when to harvest and how frequently) • Resources to undertake work • Technical constraints
  • 12. Web Archiving at the NLA • PANDORA collecting – Websites and web „documents‟ • documents (discrete files), whole sites, parts of sites • text, images, video, style elements, client side scripts – Content is harvested using a crawl robot • efficient (no work for publisher), automated process • deposit of complex objects is harder to deal with – Dynamic content becomes static HTML • an artefact of the original • the published version as you would view it from a web browser, not from the content management system • loses dynamic functionality • „normalising‟ process – Persistent URIs
  • 13. Web Archiving at the NLA • Retention of collected web content – Archiving means preservation – Long term access – Collections developed and maintained in perpetuity for future generations – What is the preservation reality? • Is access in perpetuity achievable? – Investing in systems to manage for preservation • More than preserving the bit stream • Establishing preservation intent • Collecting and managing preservation metadata • Understanding formats and their risks (... and actions?)
  • 14. Web Archiving at the NLA • „DIY‟ archive of your published web content – Use a subscription service • ArchiveIT (Internet Archive) www.archive-it.org • CDL Web Archiving Service webarchives.cdlib.org – Build your own with open-source tools • Heritrix archival crawler crawler.archive.org • WARC packages • Wayback interface – Lightweight approach • HTTrack (free) offline browser for website snapshots www.httrack.com – Citation service • on demand archiving of web resources webcitation.org
  • 15. Web Archiving at the NLA • Current and future developments at NLA – Digital Library Infrastructure Replacement (DLIR) project • Replacing infrastructure that manages our digital assets • Will require new web collecting infrastructure and processes • Already taking steps such as the gov.au seed list crawl – Some testing of new tools underway (Heritrix, Wayback) – Opening access to domain harvest content (gov.au)
  • 16. Web Archiving at the NLA • Extension of „legal deposit‟ to digital content – Attorney-General‟s consultation paper • Submissions closed 14 April – Proposed model covers: • physical format digital (mandatory delivery) • online electronic publications (mandatory delivery on demand) – May put pressure on NLA resources & priorities – Already have „whole-of-government‟ arrangements • Bulk harvesting of FMA Act agencies‟ domains • Seek „opt-in‟ from CAC Act agencies
  • 17. Web Archiving at the NLA • Finding government content in PANDORA – Full text search through Trove • Trove „Archived websites 1996 - now‟ silo • All Trove (results in „Books‟ and „Archived websites‟ • PANDORA portal – Browse lists on PANDORA portal site • „Commonwealth Government‟ (263 titles) – Catalogue (MARC record search) • NLA online catalogue • Libraries Australia • Trove (books silo) • Search e.g.: innovation industry pandora – Advanced search options for best results – „Pandora electronic collection‟ (MARC 830 series field)
  • 19. Web Archiving at the NLA • Government Web Guide and NAA links – Archiving websites • http://webguide.gov.au/recordkeeping/archiving-a-website/ – Retaining access of outdated content • http://webguide.gov.au/recordkeeping/retaining-access-to-outdated-content/ – NAA Archiving Websites advice • http://www.naa.gov.au/records-management/publications/index.aspx#Archiving- Websites:-Advice-and-Policy-Statement