Salient Features of India constitution especially power and functions
The Digital Documents Harvesting and Processing Tool (Document Harvester)
1. The Digital Documents
Harvesting and Processing
Tool (Document Harvester)
Jennie Grimshaw
Lead Curator, Social Policy & Official Publications
The British Library
2. The Challenge
• Ensuring long-time preservation on non-commercial online
publications – through Legal Deposit Web Archive
• Providing easy access at the level of the individual
document
• Providing a system for efficient processing of individual
documents at scale
• Storing documents economically
www.bl.uk 2
3. The solution: DDHAPT
• Crawls target websites at set intervals
• Stores them in LDWA
• Presents list of new documents for selection
• Selector creates basic metadata
• Basic record + hotlink available after seven days
www.bl.uk 3
4. STEP 1: Selector (librarian, digital processing team
member or publisher) logs in
• DDHAPT is a web-based application based on W3ACT
•Web Archiving Team sets up users with different roles +
permissions
• After login the selector is taken to a personalised
homepage, with
– List of crawled targets + date
– Crawl succeeded/failed
– Links to new documents harvested
– Provision to set up new watched target
www.bl.uk 4
5. STEP 2: Selector searches for and enters details of
selected Watched Target
•Watched target = URL of publications page
• Crawl frequency can be set in light of volume/frequency of
publishing
• Use NPLD selection criteria
www.bl.uk 5
6. STEP 3: Doc Harvester crawls websites, retrieves Docs
and sends Selector (or generic email address) an email
confirmation that it has completed the crawl
• Will report if no new documents are found
• Will signal possible duplicates
• Selector can ignore documents not required for
individual cataloguing
• Populates metadata creation form
www.bl.uk 6
7. STEP 4: Selector reviews and edits metadata fields for
each Document and clicks ‘Submit’ which sends the SIP
to Ingest
• Separate screens for books, journal issues and journal
articles
• Displays document and pre-populated form side-by-side
• Selectors will need to check and edit the metadata
• Hit the submit button – tool confirms if submission is
successful
www.bl.uk 7
8. Step 4 DDHAPT sample metadata
• *Title Living standards, poverty and inequality in the UK: 2014
ISBN 9781909463523
Author(s): Up to 3Belfield, Chris; Cribb, Jonathan; Hood, Andrew
• Corporate Author
*Publisher Institute for Fiscal Studies
Edition
*Year of publication 2014
DOI (Digital Object Identifier) 10.1920/re.ifs.2014.0096
*Submission type (should autopopulate): Book
* Filename (automated)
• URL landing page = http://www.ifs.org.uk/publications/7274
• URL pdf = http://www.ifs.org.uk/uploads/publications/comms/r96.pdf
• Comment: published in series: Report no. 96
www.bl.uk 8
9. STEP 5: The SIP is ingested and flows through Aleph
and into Primo.
• Basic catalogue records appear in Explore (our catalogue)
7 days after ingest, to make content available asap.
• Catalogue records are upgraded later by BL cataloguing
team.
• The Documents can also be found in the LDWA.
www.bl.uk 9
10. Phase 2
• In Phase 2, from January – end March 2015, we will
develop the Tool to harvest content on websites with simple,
username and password-based, barriers as covered in the
NPLD Regulations.
• From April 2015, we hope to roll out use of the Tool within
BL and LDLs
www.bl.uk 10
11. An additional use case: the landing page approach
• Aims to reduce effort by forming sets of harvested websites
into collections
• Each collection will have a landing page set up in the LDWA
• A single entry will link from Explore to the collection landing
page
www.bl.uk 11
- Explain DDHAPT is a web-based application, an extension of W3ACT. BL Web Archiving Team sets up users and assigns different Roles and Permissions to them.
- W3ACT interface is being improved as part of this project, so looks a bit different from the LD ACT Tool you might be used to.
Watched Target = URL of a specific page, for example a publications listing page, within a Target website. System prevents Duplicates.
PDFs only at the moment. But don’t need ISBN/ISSN. Crawl frequency can be adjusted to keep up with the volume and frequency of publishing, which may change over time.
Includes Non Print Legal Deposit criteria. Can use for NPLD and also for content for remote access services, but rights-clearance for remote access services is done outside the tool and second copy of the Documents has to be kept. Rights on the material are recorded in the Tool and are visible to all logged in users.
If it finds no Docs at all, it tells us so we can check whether the website has closed down or Watched Target URL needs updating. At this point, the Documents are already ingested into the DLS. What we do next is improve the metadata, before submitting it to Ingest.
Metadata is autopopulated as far as possible. The Tool uses software called MEX, Metadata Extraction, which was developed for the BL, and looks at the PDF itself as the primary source, then the web page the PDF came from, to find as much metadata as it can. Selectors will need to edit the metadata a bit e.g. untangle Personal Author forenames and surnames into separate fields, but there should not be much typing needed. If there are dupes of a Doc, the system will reject the Dupe on Ingest.
Will also include FAST subject headings
Use for integrating resources, books published in chapters, annual report series, consultations, as well as local authorities.