Unleashing the Power of XSLT: Catalog Records in Batch
1. Catalog Records in Batch
Lucas Mak, Michigan State University Libraries
Texas Library Association Annual Conference, April 24-27, 2013
2. Agenda
Overview of XSLT & AutoIt
Case studies
Cataloging
○ Digitized monograph workflow
○ Spoken word recordings cataloging
Catalog Maintenance
○ Multi-format records clean up project
○ NAR and Bib record update
Reflections
3. Overview of XSLT
XSLT (Extensible Stylesheet Language
Transformations)
Within the family of XML
○ Case sensitive
○ Current version: 2.0 (3.0 in draft)
○ Unicode compliant
○ File extension: .xsl
○ Requires matching start and end tags
“Transformation” means:
○ Manipulation of documents by creating a new document
based on the original document
○ Output can be in XML, HTML, XHTML, text etc.
Data-driven execution
○ Codes executed when a certain piece of data is
encountered unexpected outcomes
4. <template>
○ Contains set of instructions to be executed when a
template is called explicitly or invoked based on a
matching node
○ Modularity & Reusability
Multiple templates in a XSLT
Multiple XSLTs in a pipeline
Multiple inputs and outputs
○ Comparing data from multiple inputs
document ( )
key ( )
Common usages in library context
○ Web display
e.g. converting EAD into HTML for display
○ Metadata crosswalking
Data selection and manipulation
5. Overview of AutoIt
http://www.autoitscript.com/site/autoit/
Freeware
Scripting language designed for
automating the Windows GUI and general
scripting
Simulated keystrokes, mouse movement and
window/control manipulation
Simple text manipulation (supports regular
expression), read/write text file
Send HTTP request
Open/run/close applications
6. Automation
To execute multiple XSLTs in a sequence by
Saxon-HE command line XSLT processor
To link XSLT processes with other
applications, e.g. MarcEdit
To read and compile TXT, XML, XSLT files
8. Background
Digital & Multimedia Center
Media library & Digitization service
Chiefly text digitization
Flatbed, overhead, planetary, slide, &
sheetfed scanners
2.5 FTE staff, 21 students
100k – 150k pages scanned & processed
per year
9. Old Cataloging Workflow for
Digitized Monographs
DMC
• Mounting
digital files on
web server
DMC
• Spreadsheet
of titles and
URLs to be
sent to
cataloging
(DataCat)
DataCat
• Search title in
catalog
• Manual derive
from print
• Insert URL
from
spreadsheet
10. Task Automation
Extraction of print version MARC records
List of bib record no.
Utilizing AutoIT to build XML query against
catalog XML server
○ Turning III XML into MarcXML by XSLT
Insertion of URLs to appropriate records
Finding a match point
Instead of recording “title” & “URL” pair,
recording “unique identifier” & “URL” pair
○ Unique identifier bib record no.
11. New Cataloging Workflow
DMC
• Mounting digital
files on web
server
DMC
• Prepare a TXT file
with bib no. & URL
Metadata
Librarian
• Run AutoIT script
to extract MARC
records from
Millennium/Sierra
• Batch derive from
print and insert
URL by matching
bib no.
12. Design of XSLT
Processing logic
Derive electronic record from print record
Insert URL into electronic record by
matching bib no. of the print record against
XML document with bib no./URL pairs
13. Conversion Table
Structure of bib no./URL pair XML
<pair>
<bibNumber>b55612367</bibNumber>
<url>http://archive.lib.msu.edu/DMC/AREP/AREP1.pdf</url>
</pair>
<pair>
……
14. Derive Template
Provider-neutral record
Copy data from print records without manipulation
○ e.g. 100, subject headings
○ Element: <xsl:copy-of>
Hard-coding new data in
○ 006, 007, 040, 049, 588
○ Element: <xsl:text>
Combine existing and new data to create new element
○ e.g. 008 (“o” into “form” byte), 090 (“Online” at the end of call #),
776 (main entry, title, imprint, 020, 010)
○ Element: <xsl:value-of>, <xsl:text>
○ Function: substring()
Copy URL from the 2nd XML file
○ Key bib number into 2nd XML file
○ Element: <xsl:key>, <xsl:variable>
○ Function: document()
18. Background
Vincent Voice Library
Over 40,000 hours of spoken word
recordings
Open-reel taps, cassettes, DAT tapes, digital
files
Over 2TB of born-digital and digitized
recordings (WAV format)
○ Provide MP3 for public domain/copyrights-
cleared items (e.g. pre-1923, presidential
speeches, MSU provenance, etc.)
19. Voice Library Cataloging
MySQL database
Digitization & inventory tracking
Students create a database record, which
includes summary, date of utterance, speaker
names, recording source, format(s) available,
etc., for each digital file
Library Catalog
Item-level cataloging
○ Cataloging of analog items stopped in 1990s
Analog item records suppressed since then
○ Cataloging of digital items started in 2000s
Based on database records
20. Objective
Automate cataloging of digital files
Reformatted items
○ Derive from suppressed analog records if
available
○ Create brief electronic records, from SQL
database records, for items with no
suppressed records in catalog
Born-digital items
○ Create brief electronic records, from SQL
database records
21. Tasks
Extract XML records of digital items from
MySQL database
Matches XML records against existing
records of analog items to create
records for digital items
Convert into .mrc file using MarcEdit for
loading into cataloging client
Automate and link all above steps by an
AutoIt script
22. Task #1: XML records
Extraction
PHP script written by a programmer
Extract XML records for digital items ready
for cataloging (*status = cataloging)
Each XML record includes:
○ Summary written by students doing digitization
○ Date of utterance
○ Recording source
○ Names of speakers
○ Database (DB) no. assigned to each digital file
If reformatted, both DB no. and analog no. (M no. for
open-reel tapes, C no. for cassettes)
○ Running time (in seconds)
24. Task #2: Records matching
Match point
Analog item no.
○ <formatid> in database records
○ Call no. in analog version MARC records
Matching & deriving done by XSLT
Matched: derive from analog version MARC
record
No-Match: create brief MARC record from
database XML record
25. “Derive” Template
Copy as is
○ e.g. 1XX, 7XX, 6XX, 518, etc.
Hardcode constant data
○ e.g. 006, 007, 588, etc.
Insert variable data
○ e.g. 776 (info from analog record), 099 (info
from database record), 033
Copy with adjustments
○ e.g. 008 (form byte), 245 (GMD), 300
26. “Create new” Template (i.e. no match)
1XX, 7XX from <main_speaker>,
<additional_speakers>
245 from first sentence of <summary>
520 from <summary>
033 from <date_day>, <date_month>, &
<date_year>
518 from <recording_source> & date info
099 from <formatid>
Hardcode MARC leader, 006, 007, 008 (except
date(s)), etc.
28. Benefits
Time saving
Eliminate manual searching for cataloged
analog items in local catalog
Eliminate manual copy and paste from
database to SkyRiver cataloging client
29. Limitations
False no-match
Occasional discrepancy in analog item no. in
database and catalog records
○ Typo
○ Digitization vs. Cataloging practice
Heading updates
Headings updated after analog records
exported from catalog
Suppressed analog records can’t do real
time lookup through XML server
31. Background
Thesis and dissertation cataloging at
Michigan State University Libraries
Current practice: Separate-record approach
Pre-2007 practice:
○ Mulvered records: Print & Microform on the same
record
○ Either:
Cataloged print on OCLC and added microform info
locally no record for microform in OCLC
Cataloged print and microform separately on OCLC and
merged two records into one locally
○ 7387 titles with mulvered records
32. Summary of Main
Characteristics
MARC Fields Characteristics
001 1 (print) or 2 (print, microform)
007 For microform
008 Form of item (byte 23): Blank
099 Call no. for print
245$h [paper, microform]
533 Reproduction note for microform
952 $a: Item record no. for print & microform
$b: Barcode
33. Objective
Un-mulvering
One record for print, one record for microform
○ Record for print:
Turn mulvered record into print record by removing info only
pertaining to microform (e.g. 533)
- Overlay the original record
- Delete item records pertaining to microform after overlay
○ Record for microform:
Create new microform bib record by
- Transferring data only pertaining to microform to the new
record (microform 001, 007, 533)
- Copying, with/without modification, common data into the
new record (008, 245 etc.)
Create new item record by copying info from original item
record for microform
To process records by XSLT
35. Roadblocks
Loss of items over the years
46 titles with print only
11 titles with microform only
Need to determine which title has what
format(s)
Multiple microform formats
Some titles have both microfiche and
microfilm formats need to be split into 3
records (1 print, 1 microfiche, 1 microfilm)
36. Loss of items
Determine available format by location
code in item record
“mc” = Microfiche/ Microfilm
“th” or other branch locations = Print
Most reliable since 049 (location in bib) did
no get updated when format was lost/added
Bib records extracted from ILS do not
contain item location
=952 $a.i69490405$b31293027362833
=952 $a.i69490417
38. Merge item info into extracted bib by
matching up item record no.
39. Base on item location info in MARC 952 to
determine what record(s) to be generated
• Print record (overlay)
• Microform record (export as new)
“mc” & print
location(s)
• Microform record (overlay)
“mc” & NO print
locations
• Print record (overlay)
Print location(s)
& NO “mc”
41. Multiple microform formats
Some titles have microfiche and microfilm
on the same record
Similarity
Both microfiche and microfilm use “mc” as item
location
Differences
Call# system
○ Microfiche: Goetz, S - 3 fiche
○ Microfilm: 24354 THS Microfilm
Two MARC 533 (reproduction note)
○ One for microfiche, one for microfilm
42. ○ Microfiche
=533 $aMicrofiche.$bAnn Arbor, Mich.
:$cUniversity Microfilms,$d1979.$e4 microfiche ; 11
X 15cm.
○ Microfilm
=533 $aMicrofilm.$bAnn Arbor, Mich. :$cUniversity
Microfilms,$d 1973.$e1 microfilm reel ; 35 mm.
Solutions
○ Base on number of 533 to determine how
many microform record(s) to be generated
○ Use MARC 533$a to pull appropriate call#
from MARC 952
43.
44. Design of XSLT
Processing logic
Both print and microform available
○ Go through the record twice*
1st pass for print record
2nd pass for microform record*
* When both microfiche & microfilm are available 3
passes (print, fiche, film)
One format available
○ Go through the record once (print/ microform*)
* When both microfiche and microfilm are available 2
passes (one for microfiche, one for microfilm)
45. 5 templates
○ Format determination template
○ Print data template
○ Microform templates
Microform only (949 overlay command)
Microform with print (949 item generation command)
○ Common data template
Data common to both formats
Reusable
46. Format determination template
1. Parse item location info in MARC 952 as a
variable
2. Determine which location(s), aka format(s),
is available
3. Invoke different combination of “Print data”,
“microform only”, and “microform (with
print)” templates accordingly
47. Print data template
To copy print 001, 049, 245 with adjustment
To copy 008, 099 as is
Generate 949 overlay command
Invoke “common data” template
48. Microform data templates
To copy microform 001, 008, 245 with adjustment
○ 008: Add “a” (microfilm) or “b” (microfiche) in Form byte
(byte 23) based on 533$a
○ 245$h: replace “[paper, microform]” with “[microform]”
To copy 007, 533 as is
To generate 049, 099
○ 099: Copy call# from item info stored in 952
949 command
○ Overlay command if microform is the only available
format
○ Item record generation if print is also available
Invoke “common data” template
49. Common data template
Copy fields not touched by other templates
mostly without adjustment
e.g. leader, subject headings (6XX), thesis
note (502), imprint date (260$c), physical
dimension (300), etc.
50. Results
Total mulvered: 7387 records
Total un-mulvered: 14722 records
Microform: 7346 records
Print: 7376 records
52. Acronyms
NAR (Name Authority Record)
LC/NACO NAF (Name Authority File)
BFM (Bibliographic File Maintenance)
Heading/Authorized access point updates in
bib records
SRU (Search/Retrieval via URL)
HTTP request
53. LC/NACO NAF
Dynamic file
Contributions from over 700 NACO
participants
Updated everyday with new and
changed NARs from NACO nodes
Full nodes
○ British Library, OCLC, SkyRiver
Contribution-only node
○ National Library of Medicine
55. Authority Control at MSU
In-house
NACO institution
Database maintenance
Post-cataloging Authority Control
New Headings Report
○ Download NARs from SkyRiver
Updates to NARs not necessary caught
○ 1XX (No item cataloged under changed 1XX
not in new heading report)
○ Elements other than 1XX (e.g. 4XX, 670)
56. LC/NACO NAF RDA
Transition
PCC Day 1 for RDA NAR: Mar. 31, 2013
PCC Task Group on AACR2 & RDA
Acceptable Heading Categories (Aug 2011)
225,000 NARs with 1XX not usable in RDA bib
records
172,000 NARs with 1XX usable in RDA bib
records after batch manipulation by software
7,631,00 NARs with 1XX usable in RDA bib
records as they are and can be recoded as RDA
57. Phased reissuance of NARs
Phase 1
○ Scope
NARs with characteristics known to be at variance with RDA practice
Not candidates for any of the mechanical changes to be made during
phase 2
○ Adding a 667 note “THIS 1XX FIELD CANNOT BE USED UNDER
RDA UNTIL THIS RECORD HAS BEEN REVIEWED AND/OR
UPDATED”
Completed Aug. 20, 2012 (436,943 records processed)
Phase 2
○ Programmatic changes to 1XX headings that are not acceptable
under RDA (e.g., changes to Bible headings, spelling out Dept. and
months, etc., abbreviations in the subfield $d for personal names)
○ Completed March 27, 2013 (371,942 records changed)
58. Updates of NARs by NACO institutions
Reviewing, upgrading, and recoding Phase 1
records to RDA
Upgrading and recoding non-Phase 1 records to
RDA
Adding any of the 17 new MARC fields (e.g.
046, 372, etc.)
Routine NAR maintenance
○ PCC post-RDA test guidelines “strongly
encourage” to evaluate and recode the “RDA-
acceptable AACR2 NARs” to RDA whenever
possible
59. Objectives
To catch changes to NARs
Changes in 1XX
Addition, deletion, or updates of elements
other than 1XX
To perform related BFM if 1XX in a NAR
is changed
60. Tasks
To download NARs one-by-one/in bulk
To detect updates to NARs already
existing in Millennium
To overlay existing NARs with updated
ones
Updates headings in bib records if 1XX
in NAR updated
To automate and link up the above tasks
61. Task #1: Download NARs
OCLC LCNAF SRU Service
Pros
○ Multiple indexes (LCCN, names, dates, etc.)
○ Available in multiple schema including MARCXML
○ One-by-one or bulk download*
○ SRU-based service (HTTP request)
○ FREE!!
Cons
○ Updated every Monday night
○ OAI-PMH service is not available though there is an
index for OAI identifier
○ Bulk download – by search term (e.g. after certain date)
62. Task #1: Download NARs
(cont’d)
Implementation
Search LCCNs one-by-one by AutoIt script
○ Around 10 records/sec. retrieved
Download XML files into one folder (files
named by LCCN)
63. Task #2: NAR Update Detection
To compare NARs from Millennium and NARs from
LC/NACO NAF by XSLT
MARC 005 (timestamp)
If timestamp more current on the NAR from NAF Overlay the
NAR in Millennium
64. Task #3: Export/Overlay of
NARs
MarcEdit
Export updated NARs into Millennium
Through TCP/IP (Host address, Port, .mrc
file)
○ Same as export from OCLC Connexion or
SkyRiver
One-by-one (though .mrc file can contain
multiple NARs)
65. Task #4: Updates of Bib
Headings
XSLT
To detect changes in 1XX between old and
new NARs
To build heading conversion table (a TXT
file) when 1XX is changed
AutoIt
Automate bib heading updates by “Global
Update” module in Millennium
○ Read old and new headings from the TXT file
and fill out info required in “Global Update”
process
66. Task #5: Automation
Use AutoIt to:
Link up various steps in the workflow
Automate searching against OCLC LCNAF SRU
Service by compiling and sending HTTP
requests
Execute various XSLTs in a predetermined
sequence
○ e.g. NAR comparison Heading comparison
Read TXT files (LCCN list, heading conversion
table) created by XSLT processes
Run MarcEdit to overlay obsolete NARs
Execute “Global Update” process
68. Test Results
82,398 NARs tested
81,362 NARs needed to be overlaid*
4,584 headings became obsolete
10,900 bib records had at least one heading
flipped
* Many NARs exported from Millennium do not contain field 005
overlay those will save comparison time down the road
69. Limitations
Identities broken out from undifferentiated
NARs can’t be detected
Partially taken care of by “New Headings Report”
Headings with diacritics
Code points & exact match in Global Update
Headings in Field 880
Slow export using MarcEdit
Data Exchange module
Slow “Global Update” process
Wrong indicators put in by AutoIt during Global
Update (though correct in conversion table)
“Java heap space” out of memory error
70.
71. Reflections
Low tolerance to differences between specified
pattern and target
Case-sensitive
○ Normalization needed
Implication on processing time
○ RegEx helps specify a range of patterns
○ Exceptions: match(), replace(), tokenize()
Data encoding
○ Diacritics, non-ASCII characters
Data consistency
○ Extra conditions/steps needed to account for exceptions
○ Legacy data, incorrect data
○ Pre-processing clean up vs. On-the-fly clean up
○ Familiarity of source data normal pattern & exceptions
72. Unique identifiers vital for matching
Full automation vs. Semi-automation
Integration with other scripting language for
full automation of ongoing workflow
Demanding for computing power
Multi-step matching
Processing multiple documents
Large XML files
○ <xsl:stream> in XSLT 3.0
Can’t create records out of thin air
73. Lucas Mak
Metadata & Catalog Librarian
Michigan State University Libraries
makw@mail.lib.msu.edu