Unleashing the Power of XSLT: Catalog Records in Batch

Catalog Records in Batch
Lucas Mak, Michigan State University Libraries
Texas Library Association Annual Conference, April 24-27, 2013

Agenda
 Overview of XSLT & AutoIt
 Case studies
 Cataloging
○ Digitized monograph workflow
○ Spoken word recordings cataloging
 Catalog Maintenance
○ Multi-format records clean up project
○ NAR and Bib record update
 Reflections

Overview of XSLT
 XSLT (Extensible Stylesheet Language
Transformations)
 Within the family of XML
○ Case sensitive
○ Current version: 2.0 (3.0 in draft)
○ Unicode compliant
○ File extension: .xsl
○ Requires matching start and end tags
 “Transformation” means:
○ Manipulation of documents by creating a new document
based on the original document
○ Output can be in XML, HTML, XHTML, text etc.
 Data-driven execution
○ Codes executed when a certain piece of data is
encountered  unexpected outcomes

 <template>
○ Contains set of instructions to be executed when a
template is called explicitly or invoked based on a
matching node
○ Modularity & Reusability
 Multiple templates in a XSLT
 Multiple XSLTs in a pipeline
 Multiple inputs and outputs
○ Comparing data from multiple inputs
 document ( )
 key ( )
 Common usages in library context
○ Web display
 e.g. converting EAD into HTML for display
○ Metadata crosswalking
 Data selection and manipulation

Overview of AutoIt
 http://www.autoitscript.com/site/autoit/
 Freeware
 Scripting language designed for
automating the Windows GUI and general
scripting
 Simulated keystrokes, mouse movement and
window/control manipulation
 Simple text manipulation (supports regular
expression), read/write text file
 Send HTTP request
 Open/run/close applications

 Automation
 To execute multiple XSLTs in a sequence by
Saxon-HE command line XSLT processor
 To link XSLT processes with other
applications, e.g. MarcEdit
 To read and compile TXT, XML, XSLT files

Background
 Digital & Multimedia Center
 Media library & Digitization service
 Chiefly text digitization
 Flatbed, overhead, planetary, slide, &
sheetfed scanners
 2.5 FTE staff, 21 students
 100k – 150k pages scanned & processed
per year

Old Cataloging Workflow for
Digitized Monographs
DMC
• Mounting
digital files on
web server
DMC
• Spreadsheet
of titles and
URLs to be
sent to
cataloging
(DataCat)
DataCat
• Search title in
catalog
• Manual derive
from print
• Insert URL
from
spreadsheet

Task Automation
 Extraction of print version MARC records
 List of bib record no.
 Utilizing AutoIT to build XML query against
catalog XML server
○ Turning III XML into MarcXML by XSLT
 Insertion of URLs to appropriate records
 Finding a match point
 Instead of recording “title” & “URL” pair,
recording “unique identifier” & “URL” pair
○ Unique identifier  bib record no.

New Cataloging Workflow
DMC
• Mounting digital
files on web
server
DMC
• Prepare a TXT file
with bib no. & URL
Metadata
Librarian
• Run AutoIT script
to extract MARC
records from
Millennium/Sierra
• Batch derive from
print and insert
URL by matching
bib no.

Design of XSLT
 Processing logic
 Derive electronic record from print record
 Insert URL into electronic record by
matching bib no. of the print record against
XML document with bib no./URL pairs

Conversion Table
 Structure of bib no./URL pair XML
<pair>
<bibNumber>b55612367</bibNumber>
<url>http://archive.lib.msu.edu/DMC/AREP/AREP1.pdf</url>
</pair>
<pair>
……

Derive Template
 Provider-neutral record
 Copy data from print records without manipulation
○ e.g. 100, subject headings
○ Element: <xsl:copy-of>
 Hard-coding new data in
○ 006, 007, 040, 049, 588
○ Element: <xsl:text>
 Combine existing and new data to create new element
○ e.g. 008 (“o” into “form” byte), 090 (“Online” at the end of call #),
776 (main entry, title, imprint, 020, 010)
○ Element: <xsl:value-of>, <xsl:text>
○ Function: substring()
 Copy URL from the 2nd XML file
○ Key bib number into 2nd XML file
○ Element: <xsl:key>, <xsl:variable>
○ Function: document()

Workflow
Print
records
(MarcXML)
XSLT
PN e-monograph
records
Catalog
Bib no. &
URLs
Extraction Print
records
(III XML)
Format conversion
by XSLT
Converted into
XML by AutoIT

Implementation Issues
 Legacy data
 440 vs 490 & 830 pair
 Descriptive rules & ISBD punctuations
 Obsolete 1st/2nd indicators

Background
 Vincent Voice Library
 Over 40,000 hours of spoken word
recordings
 Open-reel taps, cassettes, DAT tapes, digital
files
 Over 2TB of born-digital and digitized
recordings (WAV format)
○ Provide MP3 for public domain/copyrights-
cleared items (e.g. pre-1923, presidential
speeches, MSU provenance, etc.)

Voice Library Cataloging
 MySQL database
 Digitization & inventory tracking
 Students create a database record, which
includes summary, date of utterance, speaker
names, recording source, format(s) available,
etc., for each digital file
 Library Catalog
 Item-level cataloging
○ Cataloging of analog items stopped in 1990s
 Analog item records suppressed since then
○ Cataloging of digital items started in 2000s
 Based on database records

Objective
 Automate cataloging of digital files
 Reformatted items
○ Derive from suppressed analog records if
available
○ Create brief electronic records, from SQL
database records, for items with no
suppressed records in catalog
 Born-digital items
○ Create brief electronic records, from SQL
database records

Tasks
 Extract XML records of digital items from
MySQL database
 Matches XML records against existing
records of analog items to create
records for digital items
 Convert into .mrc file using MarcEdit for
loading into cataloging client
 Automate and link all above steps by an
AutoIt script

Task #1: XML records
Extraction
 PHP script written by a programmer
 Extract XML records for digital items ready
for cataloging (*status = cataloging)
 Each XML record includes:
○ Summary written by students doing digitization
○ Date of utterance
○ Recording source
○ Names of speakers
○ Database (DB) no. assigned to each digital file
 If reformatted, both DB no. and analog no. (M no. for
open-reel tapes, C no. for cassettes)
○ Running time (in seconds)

<vvl:Record>
<vvl:id>2159</vvl:id>
<vvl:vvl_number>01-0350-113</vvl:vvl_number>
<vvl:copyright>Broadcast News</vvl:copyright>
<vvl:main_speaker>Farrakhan, Louis</vvl:main_speaker>
<vvl:additional_speakers/>
<vvl:recording_source>CNN</vvl:recording_source>
<vvl:summary>Minister Louis Farrakhan, Head of the Nation of Islam and organizer of
the The Million Man March, concludes the gathering of social activists with a 2-1/2 hour
speech. Held on and around the National Mall in Washington, D.C.</vvl:summary>
<vvl:date_day>16</vvl:date_day>
<vvl:date_month>October</vvl:date_month>
<vvl:date_year>1995</vvl:date_year>
<vvl:running_time>9067</vvl:running_time>
<vvl:open-reel>
<vvl:formatid>M5420 - M5421</vvl:formatid>
<vvl:type>open-reel</vvl:type>
<vvl:size>0</vvl:size>
</vvl:open-reel>
<vvl:wav>
<vvl:formatid>DB2159</vvl:formatid>
<vvl:type>wav</vvl:type>
<vvl:size>870444032</vvl:size>
</vvl:wav>
</vvl:Record>

Task #2: Records matching
 Match point
 Analog item no.
○ <formatid> in database records
○ Call no. in analog version MARC records
 Matching & deriving done by XSLT
 Matched: derive from analog version MARC
record
 No-Match: create brief MARC record from
database XML record

 “Derive” Template
 Copy as is
○ e.g. 1XX, 7XX, 6XX, 518, etc.
 Hardcode constant data
○ e.g. 006, 007, 588, etc.
 Insert variable data
○ e.g. 776 (info from analog record), 099 (info
from database record), 033
 Copy with adjustments
○ e.g. 008 (form byte), 245 (GMD), 300

 “Create new” Template (i.e. no match)
 1XX, 7XX from <main_speaker>,
<additional_speakers>
 245 from first sentence of <summary>
 520 from <summary>
 033 from <date_day>, <date_month>, &
<date_year>
 518 from <recording_source> & date info
 099 from <formatid>
 Hardcode MARC leader, 006, 007, 008 (except
date(s)), etc.

Workflow
Digital
records
(MarcXML)
XSLT
Catalog
Extraction
by PHP
Analog
records
(MarcXML)
Format
conversion
by MarcEdit
SQL
Digital
records
(VVL XML)
Digital
records
(.mrc)
One-time
Extraction

Benefits
 Time saving
 Eliminate manual searching for cataloged
analog items in local catalog
 Eliminate manual copy and paste from
database to SkyRiver cataloging client

Limitations
 False no-match
 Occasional discrepancy in analog item no. in
database and catalog records
○ Typo
○ Digitization vs. Cataloging practice
 Heading updates
 Headings updated after analog records
exported from catalog
 Suppressed analog records  can’t do real
time lookup through XML server

Background
 Thesis and dissertation cataloging at
Michigan State University Libraries
 Current practice: Separate-record approach
 Pre-2007 practice:
○ Mulvered records: Print & Microform on the same
record
○ Either:
 Cataloged print on OCLC and added microform info
locally  no record for microform in OCLC
 Cataloged print and microform separately on OCLC and
merged two records into one locally
○ 7387 titles with mulvered records

Summary of Main
Characteristics
MARC Fields Characteristics
001 1 (print) or 2 (print, microform)
007 For microform
008 Form of item (byte 23): Blank
099 Call no. for print
245$h [paper, microform]
533 Reproduction note for microform
952 $a: Item record no. for print & microform
$b: Barcode

Objective
 Un-mulvering
 One record for print, one record for microform
○ Record for print:
 Turn mulvered record into print record by removing info only
pertaining to microform (e.g. 533)
- Overlay the original record
- Delete item records pertaining to microform after overlay
○ Record for microform:
 Create new microform bib record by
- Transferring data only pertaining to microform to the new
record (microform 001, 007, 533)
- Copying, with/without modification, common data into the
new record (008, 245 etc.)
 Create new item record by copying info from original item
record for microform
 To process records by XSLT

Workflow
ILS Mulvered
Print
Microform
Overlay existing
Export as new
Extract
Fix up
Create
XSLT

Roadblocks
 Loss of items over the years
 46 titles with print only
 11 titles with microform only
 Need to determine which title has what
format(s)
 Multiple microform formats
 Some titles have both microfiche and
microfilm formats  need to be split into 3
records (1 print, 1 microfiche, 1 microfilm)

Loss of items
 Determine available format by location
code in item record
 “mc” = Microfiche/ Microfilm
 “th” or other branch locations = Print
 Most reliable since 049 (location in bib) did
no get updated when format was lost/added
 Bib records extracted from ILS do not
contain item location
=952 $a.i69490405$b31293027362833
=952 $a.i69490417

 Export item info separately from ILS

 Merge item info into extracted bib by
matching up item record no.

 Base on item location info in MARC 952 to
determine what record(s) to be generated
• Print record (overlay)
• Microform record (export as new)
“mc” & print
location(s)
• Microform record (overlay)
“mc” & NO print
locations
• Print record (overlay)
Print location(s)
& NO “mc”

Revised Workflow
ILS
Mulvered
Print
Microform
Overlay existing
Export as new
Extract
Fix up
Create
XSLT 2
Item
info
XSLT 1Extract

Multiple microform formats
 Some titles have microfiche and microfilm
on the same record
 Similarity
 Both microfiche and microfilm use “mc” as item
location
 Differences
 Call# system
○ Microfiche: Goetz, S - 3 fiche
○ Microfilm: 24354 THS Microfilm
 Two MARC 533 (reproduction note)
○ One for microfiche, one for microfilm

○ Microfiche
 =533 $aMicrofiche.$bAnn Arbor, Mich.
:$cUniversity Microfilms,$d1979.$e4 microfiche ; 11
X 15cm.
○ Microfilm
 =533 $aMicrofilm.$bAnn Arbor, Mich. :$cUniversity
Microfilms,$d 1973.$e1 microfilm reel ; 35 mm.
 Solutions
○ Base on number of 533 to determine how
many microform record(s) to be generated
○ Use MARC 533$a to pull appropriate call#
from MARC 952

Design of XSLT
 Processing logic
 Both print and microform available
○ Go through the record twice*
 1st pass for print record
 2nd pass for microform record*
* When both microfiche & microfilm are available  3
passes (print, fiche, film)
 One format available
○ Go through the record once (print/ microform*)
* When both microfiche and microfilm are available  2
passes (one for microfiche, one for microfilm)

 5 templates
○ Format determination template
○ Print data template
○ Microform templates
 Microform only (949 overlay command)
 Microform with print (949 item generation command)
○ Common data template
 Data common to both formats
 Reusable

 Format determination template
1. Parse item location info in MARC 952 as a
variable
2. Determine which location(s), aka format(s),
is available
3. Invoke different combination of “Print data”,
“microform only”, and “microform (with
print)” templates accordingly

 Print data template
 To copy print 001, 049, 245 with adjustment
 To copy 008, 099 as is
 Generate 949 overlay command
 Invoke “common data” template

 Microform data templates
 To copy microform 001, 008, 245 with adjustment
○ 008: Add “a” (microfilm) or “b” (microfiche) in Form byte
(byte 23) based on 533$a
○ 245$h: replace “[paper, microform]” with “[microform]”
 To copy 007, 533 as is
 To generate 049, 099
○ 099: Copy call# from item info stored in 952
 949 command
○ Overlay command if microform is the only available
format
○ Item record generation if print is also available
 Invoke “common data” template

 Common data template
 Copy fields not touched by other templates
mostly without adjustment
e.g. leader, subject headings (6XX), thesis
note (502), imprint date (260$c), physical
dimension (300), etc.

Results
 Total mulvered: 7387 records
 Total un-mulvered: 14722 records
 Microform: 7346 records
 Print: 7376 records

Acronyms
 NAR (Name Authority Record)
 LC/NACO NAF (Name Authority File)
 BFM (Bibliographic File Maintenance)
 Heading/Authorized access point updates in
bib records
 SRU (Search/Retrieval via URL)
 HTTP request

LC/NACO NAF
 Dynamic file
 Contributions from over 700 NACO
participants
 Updated everyday with new and
changed NARs from NACO nodes
 Full nodes
○ British Library, OCLC, SkyRiver
 Contribution-only node
○ National Library of Medicine

LC/NACO NAF
Maintenance
LC
Database
Distribution to
BL, OCLC,
SkyRiver

Authority Control at MSU
 In-house
 NACO institution
 Database maintenance
 Post-cataloging Authority Control
 New Headings Report
○ Download NARs from SkyRiver
 Updates to NARs not necessary caught
○ 1XX (No item cataloged under changed 1XX
 not in new heading report)
○ Elements other than 1XX (e.g. 4XX, 670)

LC/NACO NAF RDA
Transition
 PCC Day 1 for RDA NAR: Mar. 31, 2013
 PCC Task Group on AACR2 & RDA
Acceptable Heading Categories (Aug 2011)
 225,000 NARs with 1XX not usable in RDA bib
records
 172,000 NARs with 1XX usable in RDA bib
records after batch manipulation by software
 7,631,00 NARs with 1XX usable in RDA bib
records as they are and can be recoded as RDA

 Phased reissuance of NARs
 Phase 1
○ Scope
 NARs with characteristics known to be at variance with RDA practice
 Not candidates for any of the mechanical changes to be made during
phase 2
○ Adding a 667 note “THIS 1XX FIELD CANNOT BE USED UNDER
RDA UNTIL THIS RECORD HAS BEEN REVIEWED AND/OR
UPDATED”
 Completed Aug. 20, 2012 (436,943 records processed)
 Phase 2
○ Programmatic changes to 1XX headings that are not acceptable
under RDA (e.g., changes to Bible headings, spelling out Dept. and
months, etc., abbreviations in the subfield $d for personal names)
○ Completed March 27, 2013 (371,942 records changed)

 Updates of NARs by NACO institutions
 Reviewing, upgrading, and recoding Phase 1
records to RDA
 Upgrading and recoding non-Phase 1 records to
RDA
 Adding any of the 17 new MARC fields (e.g.
046, 372, etc.)
 Routine NAR maintenance
○ PCC post-RDA test guidelines “strongly
encourage” to evaluate and recode the “RDA-
acceptable AACR2 NARs” to RDA whenever
possible

Objectives
 To catch changes to NARs
 Changes in 1XX
 Addition, deletion, or updates of elements
other than 1XX
 To perform related BFM if 1XX in a NAR
is changed

Tasks
 To download NARs one-by-one/in bulk
 To detect updates to NARs already
existing in Millennium
 To overlay existing NARs with updated
ones
 Updates headings in bib records if 1XX
in NAR updated
 To automate and link up the above tasks

Task #1: Download NARs
 OCLC LCNAF SRU Service
 Pros
○ Multiple indexes (LCCN, names, dates, etc.)
○ Available in multiple schema including MARCXML
○ One-by-one or bulk download*
○ SRU-based service (HTTP request)
○ FREE!!
 Cons
○ Updated every Monday night
○ OAI-PMH service is not available though there is an
index for OAI identifier
○ Bulk download – by search term (e.g. after certain date)

Task #1: Download NARs
(cont’d)
 Implementation
 Search LCCNs one-by-one by AutoIt script
○ Around 10 records/sec. retrieved
 Download XML files into one folder (files
named by LCCN)

Task #2: NAR Update Detection
 To compare NARs from Millennium and NARs from
LC/NACO NAF by XSLT
 MARC 005 (timestamp)
 If timestamp more current on the NAR from NAF  Overlay the
NAR in Millennium

Task #3: Export/Overlay of
NARs
 MarcEdit
 Export updated NARs into Millennium
 Through TCP/IP (Host address, Port, .mrc
file)
○ Same as export from OCLC Connexion or
SkyRiver
 One-by-one (though .mrc file can contain
multiple NARs)

Task #4: Updates of Bib
Headings
 XSLT
 To detect changes in 1XX between old and
new NARs
 To build heading conversion table (a TXT
file) when 1XX is changed
 AutoIt
 Automate bib heading updates by “Global
Update” module in Millennium
○ Read old and new headings from the TXT file
and fill out info required in “Global Update”
process

Task #5: Automation
 Use AutoIt to:
 Link up various steps in the workflow
 Automate searching against OCLC LCNAF SRU
Service by compiling and sending HTTP
requests
 Execute various XSLTs in a predetermined
sequence
○ e.g. NAR comparison  Heading comparison
 Read TXT files (LCCN list, heading conversion
table) created by XSLT processes
 Run MarcEdit to overlay obsolete NARs
 Execute “Global Update” process

Basic Workflow
Millennium
Millennium
NARs
Extract by
Create
Lists
LCCNs
Extract
by XSLT
Search by AutoIt
LC/NACO
NARs
Retrieve
Updated
NARs
Compare by XSLT
Overlay
by
MarcEdit
Updated
Headings
Global Update

Test Results
 82,398 NARs tested
 81,362 NARs needed to be overlaid*
 4,584 headings became obsolete
 10,900 bib records had at least one heading
flipped
* Many NARs exported from Millennium do not contain field 005 
overlay those will save comparison time down the road

Limitations
 Identities broken out from undifferentiated
NARs can’t be detected
 Partially taken care of by “New Headings Report”
 Headings with diacritics
 Code points & exact match in Global Update
 Headings in Field 880
 Slow export using MarcEdit
 Data Exchange module
 Slow “Global Update” process
 Wrong indicators put in by AutoIt during Global
Update (though correct in conversion table)
 “Java heap space” out of memory error

Reflections
 Low tolerance to differences between specified
pattern and target
 Case-sensitive
○ Normalization needed
 Implication on processing time
○ RegEx helps specify a range of patterns
○ Exceptions: match(), replace(), tokenize()
 Data encoding
○ Diacritics, non-ASCII characters
 Data consistency
○ Extra conditions/steps needed to account for exceptions
○ Legacy data, incorrect data
○ Pre-processing clean up vs. On-the-fly clean up
○ Familiarity of source data  normal pattern & exceptions

 Unique identifiers vital for matching
 Full automation vs. Semi-automation
 Integration with other scripting language for
full automation of ongoing workflow
 Demanding for computing power
 Multi-step matching
 Processing multiple documents
 Large XML files
○ <xsl:stream> in XSLT 3.0
 Can’t create records out of thin air

Lucas Mak
Metadata & Catalog Librarian
Michigan State University Libraries
makw@mail.lib.msu.edu

Unleashing the Power of XSLT: Catalog Records in Batch

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Andere mochten auch

Andere mochten auch (20)

Ähnlich wie Unleashing the Power of XSLT: Catalog Records in Batch

Ähnlich wie Unleashing the Power of XSLT: Catalog Records in Batch (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Unleashing the Power of XSLT: Catalog Records in Batch