SlideShare ist ein Scribd-Unternehmen logo
1 von 12
The Digital Documents 
Harvesting and Processing 
Tool (Document Harvester) 
Jennie Grimshaw 
Lead Curator, Social Policy & Official Publications 
The British Library
The Challenge 
• Ensuring long-time preservation on non-commercial online 
publications – through Legal Deposit Web Archive 
• Providing easy access at the level of the individual 
document 
• Providing a system for efficient processing of individual 
documents at scale 
• Storing documents economically 
www.bl.uk 2
The solution: DDHAPT 
• Crawls target websites at set intervals 
• Stores them in LDWA 
• Presents list of new documents for selection 
• Selector creates basic metadata 
• Basic record + hotlink available after seven days 
www.bl.uk 3
STEP 1: Selector (librarian, digital processing team 
member or publisher) logs in 
• DDHAPT is a web-based application based on W3ACT 
•Web Archiving Team sets up users with different roles + 
permissions 
• After login the selector is taken to a personalised 
homepage, with 
– List of crawled targets + date 
– Crawl succeeded/failed 
– Links to new documents harvested 
– Provision to set up new watched target 
www.bl.uk 4
STEP 2: Selector searches for and enters details of 
selected Watched Target 
•Watched target = URL of publications page 
• Crawl frequency can be set in light of volume/frequency of 
publishing 
• Use NPLD selection criteria 
www.bl.uk 5
STEP 3: Doc Harvester crawls websites, retrieves Docs 
and sends Selector (or generic email address) an email 
confirmation that it has completed the crawl 
• Will report if no new documents are found 
• Will signal possible duplicates 
• Selector can ignore documents not required for 
individual cataloguing 
• Populates metadata creation form 
www.bl.uk 6
STEP 4: Selector reviews and edits metadata fields for 
each Document and clicks ‘Submit’ which sends the SIP 
to Ingest 
• Separate screens for books, journal issues and journal 
articles 
• Displays document and pre-populated form side-by-side 
• Selectors will need to check and edit the metadata 
• Hit the submit button – tool confirms if submission is 
successful 
www.bl.uk 7
Step 4 DDHAPT sample metadata 
• *Title Living standards, poverty and inequality in the UK: 2014 
ISBN 9781909463523 
Author(s): Up to 3Belfield, Chris; Cribb, Jonathan; Hood, Andrew 
• Corporate Author 
*Publisher Institute for Fiscal Studies 
Edition 
*Year of publication 2014 
DOI (Digital Object Identifier) 10.1920/re.ifs.2014.0096 
*Submission type (should autopopulate): Book 
* Filename (automated) 
• URL landing page = http://www.ifs.org.uk/publications/7274 
• URL pdf = http://www.ifs.org.uk/uploads/publications/comms/r96.pdf 
• Comment: published in series: Report no. 96 
www.bl.uk 8
STEP 5: The SIP is ingested and flows through Aleph 
and into Primo. 
• Basic catalogue records appear in Explore (our catalogue) 
7 days after ingest, to make content available asap. 
• Catalogue records are upgraded later by BL cataloguing 
team. 
• The Documents can also be found in the LDWA. 
www.bl.uk 9
Phase 2 
• In Phase 2, from January – end March 2015, we will 
develop the Tool to harvest content on websites with simple, 
username and password-based, barriers as covered in the 
NPLD Regulations. 
• From April 2015, we hope to roll out use of the Tool within 
BL and LDLs 
www.bl.uk 10
An additional use case: the landing page approach 
• Aims to reduce effort by forming sets of harvested websites 
into collections 
• Each collection will have a landing page set up in the LDWA 
• A single entry will link from Explore to the collection landing 
page 
www.bl.uk 11
Any questions? 
www.bl.uk 12

Weitere ähnliche Inhalte

Ähnlich wie The Digital Documents Harvesting and Processing Tool (Document Harvester)

Actions and Updates on the Standards and Best Practices Front
Actions and Updates on the Standards and Best Practices FrontActions and Updates on the Standards and Best Practices Front
Actions and Updates on the Standards and Best Practices Front
NASIG
 
CONTENTdm Presentation 060711
CONTENTdm Presentation 060711CONTENTdm Presentation 060711
CONTENTdm Presentation 060711
Buttes
 

Ähnlich wie The Digital Documents Harvesting and Processing Tool (Document Harvester) (20)

Your research is more than a thesis: Make the most of research data and other...
Your research is more than a thesis: Make the most of research data and other...Your research is more than a thesis: Make the most of research data and other...
Your research is more than a thesis: Make the most of research data and other...
 
Levine-Clark, Michael, and Barbara Kawecki, “Best Practices for Demand-Driven...
Levine-Clark, Michael, and Barbara Kawecki, “Best Practices for Demand-Driven...Levine-Clark, Michael, and Barbara Kawecki, “Best Practices for Demand-Driven...
Levine-Clark, Michael, and Barbara Kawecki, “Best Practices for Demand-Driven...
 
iNACOL Research In Review Webinar: Blended and Online Learning Clearinghouse
iNACOL Research In Review Webinar: Blended and Online Learning ClearinghouseiNACOL Research In Review Webinar: Blended and Online Learning Clearinghouse
iNACOL Research In Review Webinar: Blended and Online Learning Clearinghouse
 
Niso dda uksg 2014
Niso dda uksg 2014Niso dda uksg 2014
Niso dda uksg 2014
 
Levine-Clark, Michael, and Barbara Kawecki, “NISO’s Initiative for Best Pract...
Levine-Clark, Michael, and Barbara Kawecki, “NISO’s Initiative for Best Pract...Levine-Clark, Michael, and Barbara Kawecki, “NISO’s Initiative for Best Pract...
Levine-Clark, Michael, and Barbara Kawecki, “NISO’s Initiative for Best Pract...
 
Jisc Publications Router: Delivering Open Access Content to Institutions
Jisc Publications Router: Delivering Open Access Content to InstitutionsJisc Publications Router: Delivering Open Access Content to Institutions
Jisc Publications Router: Delivering Open Access Content to Institutions
 
Jisc Publications Router
Jisc Publications RouterJisc Publications Router
Jisc Publications Router
 
OA Network: Heading for Joint Standards and Enhancing Cooperation: Value‐Adde...
OA Network: Heading for Joint Standards and Enhancing Cooperation: Value‐Adde...OA Network: Heading for Joint Standards and Enhancing Cooperation: Value‐Adde...
OA Network: Heading for Joint Standards and Enhancing Cooperation: Value‐Adde...
 
nstitutional repositories, item and research data metrics
nstitutional repositories, item and research data metricsnstitutional repositories, item and research data metrics
nstitutional repositories, item and research data metrics
 
NASIG 2014: Actions and Updates on the Standards and Best Practices Front
NASIG 2014: Actions and Updates on the Standards and Best Practices FrontNASIG 2014: Actions and Updates on the Standards and Best Practices Front
NASIG 2014: Actions and Updates on the Standards and Best Practices Front
 
Actions and Updates on the Standards and Best Practices Front
Actions and Updates on the Standards and Best Practices FrontActions and Updates on the Standards and Best Practices Front
Actions and Updates on the Standards and Best Practices Front
 
Use of "NewGenLib" Open Source Software for Library Automation, Digital Libra...
Use of "NewGenLib" Open Source Software for Library Automation, Digital Libra...Use of "NewGenLib" Open Source Software for Library Automation, Digital Libra...
Use of "NewGenLib" Open Source Software for Library Automation, Digital Libra...
 
Using Archivemedia to preserve research data
Using Archivemedia to preserve research dataUsing Archivemedia to preserve research data
Using Archivemedia to preserve research data
 
CORE Repositories Dashboard
CORE Repositories DashboardCORE Repositories Dashboard
CORE Repositories Dashboard
 
CONTENTdm Presentation 060711
CONTENTdm Presentation 060711CONTENTdm Presentation 060711
CONTENTdm Presentation 060711
 
Niso ddLevine-Clark, Michael, “New forms of Discovery and Purchase in Librari...
Niso ddLevine-Clark, Michael, “New forms of Discovery and Purchase in Librari...Niso ddLevine-Clark, Michael, “New forms of Discovery and Purchase in Librari...
Niso ddLevine-Clark, Michael, “New forms of Discovery and Purchase in Librari...
 
New Forms of Discovery and Purchasing in Libraries: Demand Driven Acquisitions
New Forms of Discovery and Purchasing in Libraries: Demand Driven AcquisitionsNew Forms of Discovery and Purchasing in Libraries: Demand Driven Acquisitions
New Forms of Discovery and Purchasing in Libraries: Demand Driven Acquisitions
 
From Box to Hydra via Archivematica
From Box to Hydra via ArchivematicaFrom Box to Hydra via Archivematica
From Box to Hydra via Archivematica
 
CTDA Brown Bag, Oct. 2016
CTDA Brown Bag, Oct. 2016CTDA Brown Bag, Oct. 2016
CTDA Brown Bag, Oct. 2016
 
Online Journal Management using Open Journal Systems (OJS)
Online Journal Management using Open Journal Systems (OJS)Online Journal Management using Open Journal Systems (OJS)
Online Journal Management using Open Journal Systems (OJS)
 

Mehr von ALISS

July2015cooke.
July2015cooke.July2015cooke.
July2015cooke.
ALISS
 
ALISS AGM Minutes 2015
ALISS AGM Minutes 2015ALISS AGM Minutes 2015
ALISS AGM Minutes 2015
ALISS
 
Developing digital literacies in undergraduate students: SADL project -
Developing digital literacies in undergraduate students: SADL project - Developing digital literacies in undergraduate students: SADL project -
Developing digital literacies in undergraduate students: SADL project -
ALISS
 
Useful resources for student training and orientation
Useful resources for student training and orientationUseful resources for student training and orientation
Useful resources for student training and orientation
ALISS
 

Mehr von ALISS (20)

Library champions for disability Meeting Notes January 22nd 2021
Library champions for disability Meeting Notes January 22nd 2021Library champions for disability Meeting Notes January 22nd 2021
Library champions for disability Meeting Notes January 22nd 2021
 
Disability- higher education, libraries, teaching and learning bibliography m...
Disability- higher education, libraries, teaching and learning bibliography m...Disability- higher education, libraries, teaching and learning bibliography m...
Disability- higher education, libraries, teaching and learning bibliography m...
 
What is crowdsourcing?
What is crowdsourcing?What is crowdsourcing?
What is crowdsourcing?
 
Creating Digital Collections Through Crowdsourcing
Creating Digital Collections Through CrowdsourcingCreating Digital Collections Through Crowdsourcing
Creating Digital Collections Through Crowdsourcing
 
The sound of the Crowd: David Tomkins, Bodleian Digital Library
The sound of the Crowd: David Tomkins, Bodleian Digital Library The sound of the Crowd: David Tomkins, Bodleian Digital Library
The sound of the Crowd: David Tomkins, Bodleian Digital Library
 
Incorporating student content at city- Diane Bell, City University
Incorporating student content at city- Diane Bell, City UniversityIncorporating student content at city- Diane Bell, City University
Incorporating student content at city- Diane Bell, City University
 
July2015cooke.
July2015cooke.July2015cooke.
July2015cooke.
 
ALISS AGM Minutes 2015
ALISS AGM Minutes 2015ALISS AGM Minutes 2015
ALISS AGM Minutes 2015
 
Developing digital literacies in undergraduate students: SADL project -
Developing digital literacies in undergraduate students: SADL project - Developing digital literacies in undergraduate students: SADL project -
Developing digital literacies in undergraduate students: SADL project -
 
News media at the British Library
News media at the British LibraryNews media at the British Library
News media at the British Library
 
How SCIE supports the information needs of health and social care professionals
How SCIE supports the information needs of health and social care professionalsHow SCIE supports the information needs of health and social care professionals
How SCIE supports the information needs of health and social care professionals
 
Searching systematically: supporting authors of Cochrane reviews.
Searching systematically: supporting authors of Cochrane reviews.  Searching systematically: supporting authors of Cochrane reviews.
Searching systematically: supporting authors of Cochrane reviews.
 
Jo Wood, Cafcass –Build it and they will come: developing an in-house service...
Jo Wood, Cafcass –Build it and they will come: developing an in-house service...Jo Wood, Cafcass –Build it and they will come: developing an in-house service...
Jo Wood, Cafcass –Build it and they will come: developing an in-house service...
 
Speedy professional conversations around learning and teaching in higher educ...
Speedy professional conversations around learning and teaching in higher educ...Speedy professional conversations around learning and teaching in higher educ...
Speedy professional conversations around learning and teaching in higher educ...
 
Building a Collection of the Historical UK Web for scholarly use
Building a Collection of the Historical UK Web for scholarly useBuilding a Collection of the Historical UK Web for scholarly use
Building a Collection of the Historical UK Web for scholarly use
 
Legal Deposit in a Digital Age: an overview
Legal Deposit in a Digital Age: an overviewLegal Deposit in a Digital Age: an overview
Legal Deposit in a Digital Age: an overview
 
Useful resources for student training and orientation
Useful resources for student training and orientationUseful resources for student training and orientation
Useful resources for student training and orientation
 
Doing something different staff development and workplace learning at Cardiff...
Doing something different staff development and workplace learning at Cardiff...Doing something different staff development and workplace learning at Cardiff...
Doing something different staff development and workplace learning at Cardiff...
 
Knowledge, skills and reskilling – where does the MSc fit in?
Knowledge, skills and reskilling – where does the MSc fit in?Knowledge, skills and reskilling – where does the MSc fit in?
Knowledge, skills and reskilling – where does the MSc fit in?
 
Start with the Staff
Start with the StaffStart with the Staff
Start with the Staff
 

Kürzlich hochgeladen

The basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptxThe basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptx
heathfieldcps1
 
Spellings Wk 3 English CAPS CARES Please Practise
Spellings Wk 3 English CAPS CARES Please PractiseSpellings Wk 3 English CAPS CARES Please Practise
Spellings Wk 3 English CAPS CARES Please Practise
AnaAcapella
 
1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdf
QucHHunhnh
 
Salient Features of India constitution especially power and functions
Salient Features of India constitution especially power and functionsSalient Features of India constitution especially power and functions
Salient Features of India constitution especially power and functions
KarakKing
 

Kürzlich hochgeladen (20)

TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
 
Food safety_Challenges food safety laboratories_.pdf
Food safety_Challenges food safety laboratories_.pdfFood safety_Challenges food safety laboratories_.pdf
Food safety_Challenges food safety laboratories_.pdf
 
HMCS Max Bernays Pre-Deployment Brief (May 2024).pptx
HMCS Max Bernays Pre-Deployment Brief (May 2024).pptxHMCS Max Bernays Pre-Deployment Brief (May 2024).pptx
HMCS Max Bernays Pre-Deployment Brief (May 2024).pptx
 
Mixin Classes in Odoo 17 How to Extend Models Using Mixin Classes
Mixin Classes in Odoo 17  How to Extend Models Using Mixin ClassesMixin Classes in Odoo 17  How to Extend Models Using Mixin Classes
Mixin Classes in Odoo 17 How to Extend Models Using Mixin Classes
 
Graduate Outcomes Presentation Slides - English
Graduate Outcomes Presentation Slides - EnglishGraduate Outcomes Presentation Slides - English
Graduate Outcomes Presentation Slides - English
 
SOC 101 Demonstration of Learning Presentation
SOC 101 Demonstration of Learning PresentationSOC 101 Demonstration of Learning Presentation
SOC 101 Demonstration of Learning Presentation
 
Making communications land - Are they received and understood as intended? we...
Making communications land - Are they received and understood as intended? we...Making communications land - Are they received and understood as intended? we...
Making communications land - Are they received and understood as intended? we...
 
ComPTIA Overview | Comptia Security+ Book SY0-701
ComPTIA Overview | Comptia Security+ Book SY0-701ComPTIA Overview | Comptia Security+ Book SY0-701
ComPTIA Overview | Comptia Security+ Book SY0-701
 
Introduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsIntroduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The Basics
 
Google Gemini An AI Revolution in Education.pptx
Google Gemini An AI Revolution in Education.pptxGoogle Gemini An AI Revolution in Education.pptx
Google Gemini An AI Revolution in Education.pptx
 
SKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptx
SKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptxSKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptx
SKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptx
 
Mehran University Newsletter Vol-X, Issue-I, 2024
Mehran University Newsletter Vol-X, Issue-I, 2024Mehran University Newsletter Vol-X, Issue-I, 2024
Mehran University Newsletter Vol-X, Issue-I, 2024
 
ICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptxICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptx
 
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptxBasic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
 
Towards a code of practice for AI in AT.pptx
Towards a code of practice for AI in AT.pptxTowards a code of practice for AI in AT.pptx
Towards a code of practice for AI in AT.pptx
 
The basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptxThe basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptx
 
Spellings Wk 3 English CAPS CARES Please Practise
Spellings Wk 3 English CAPS CARES Please PractiseSpellings Wk 3 English CAPS CARES Please Practise
Spellings Wk 3 English CAPS CARES Please Practise
 
1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdf
 
Python Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docxPython Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docx
 
Salient Features of India constitution especially power and functions
Salient Features of India constitution especially power and functionsSalient Features of India constitution especially power and functions
Salient Features of India constitution especially power and functions
 

The Digital Documents Harvesting and Processing Tool (Document Harvester)

  • 1. The Digital Documents Harvesting and Processing Tool (Document Harvester) Jennie Grimshaw Lead Curator, Social Policy & Official Publications The British Library
  • 2. The Challenge • Ensuring long-time preservation on non-commercial online publications – through Legal Deposit Web Archive • Providing easy access at the level of the individual document • Providing a system for efficient processing of individual documents at scale • Storing documents economically www.bl.uk 2
  • 3. The solution: DDHAPT • Crawls target websites at set intervals • Stores them in LDWA • Presents list of new documents for selection • Selector creates basic metadata • Basic record + hotlink available after seven days www.bl.uk 3
  • 4. STEP 1: Selector (librarian, digital processing team member or publisher) logs in • DDHAPT is a web-based application based on W3ACT •Web Archiving Team sets up users with different roles + permissions • After login the selector is taken to a personalised homepage, with – List of crawled targets + date – Crawl succeeded/failed – Links to new documents harvested – Provision to set up new watched target www.bl.uk 4
  • 5. STEP 2: Selector searches for and enters details of selected Watched Target •Watched target = URL of publications page • Crawl frequency can be set in light of volume/frequency of publishing • Use NPLD selection criteria www.bl.uk 5
  • 6. STEP 3: Doc Harvester crawls websites, retrieves Docs and sends Selector (or generic email address) an email confirmation that it has completed the crawl • Will report if no new documents are found • Will signal possible duplicates • Selector can ignore documents not required for individual cataloguing • Populates metadata creation form www.bl.uk 6
  • 7. STEP 4: Selector reviews and edits metadata fields for each Document and clicks ‘Submit’ which sends the SIP to Ingest • Separate screens for books, journal issues and journal articles • Displays document and pre-populated form side-by-side • Selectors will need to check and edit the metadata • Hit the submit button – tool confirms if submission is successful www.bl.uk 7
  • 8. Step 4 DDHAPT sample metadata • *Title Living standards, poverty and inequality in the UK: 2014 ISBN 9781909463523 Author(s): Up to 3Belfield, Chris; Cribb, Jonathan; Hood, Andrew • Corporate Author *Publisher Institute for Fiscal Studies Edition *Year of publication 2014 DOI (Digital Object Identifier) 10.1920/re.ifs.2014.0096 *Submission type (should autopopulate): Book * Filename (automated) • URL landing page = http://www.ifs.org.uk/publications/7274 • URL pdf = http://www.ifs.org.uk/uploads/publications/comms/r96.pdf • Comment: published in series: Report no. 96 www.bl.uk 8
  • 9. STEP 5: The SIP is ingested and flows through Aleph and into Primo. • Basic catalogue records appear in Explore (our catalogue) 7 days after ingest, to make content available asap. • Catalogue records are upgraded later by BL cataloguing team. • The Documents can also be found in the LDWA. www.bl.uk 9
  • 10. Phase 2 • In Phase 2, from January – end March 2015, we will develop the Tool to harvest content on websites with simple, username and password-based, barriers as covered in the NPLD Regulations. • From April 2015, we hope to roll out use of the Tool within BL and LDLs www.bl.uk 10
  • 11. An additional use case: the landing page approach • Aims to reduce effort by forming sets of harvested websites into collections • Each collection will have a landing page set up in the LDWA • A single entry will link from Explore to the collection landing page www.bl.uk 11

Hinweis der Redaktion

  1. - Explain DDHAPT is a web-based application, an extension of W3ACT. BL Web Archiving Team sets up users and assigns different Roles and Permissions to them. - W3ACT interface is being improved as part of this project, so looks a bit different from the LD ACT Tool you might be used to.
  2. Watched Target = URL of a specific page, for example a publications listing page, within a Target website. System prevents Duplicates. PDFs only at the moment. But don’t need ISBN/ISSN. Crawl frequency can be adjusted to keep up with the volume and frequency of publishing, which may change over time. Includes Non Print Legal Deposit criteria. Can use for NPLD and also for content for remote access services, but rights-clearance for remote access services is done outside the tool and second copy of the Documents has to be kept. Rights on the material are recorded in the Tool and are visible to all logged in users.
  3. If it finds no Docs at all, it tells us so we can check whether the website has closed down or Watched Target URL needs updating. At this point, the Documents are already ingested into the DLS. What we do next is improve the metadata, before submitting it to Ingest.
  4. Metadata is autopopulated as far as possible. The Tool uses software called MEX, Metadata Extraction, which was developed for the BL, and looks at the PDF itself as the primary source, then the web page the PDF came from, to find as much metadata as it can. Selectors will need to edit the metadata a bit e.g. untangle Personal Author forenames and surnames into separate fields, but there should not be much typing needed. If there are dupes of a Doc, the system will reject the Dupe on Ingest.
  5. Will also include FAST subject headings
  6. Use for integrating resources, books published in chapters, annual report series, consultations, as well as local authorities.