SlideShare ist ein Scribd-Unternehmen logo
1 von 73
Catalog Records in Batch
Lucas Mak, Michigan State University Libraries
Texas Library Association Annual Conference, April 24-27, 2013
Agenda
 Overview of XSLT & AutoIt
 Case studies
 Cataloging
○ Digitized monograph workflow
○ Spoken word recordings cataloging
 Catalog Maintenance
○ Multi-format records clean up project
○ NAR and Bib record update
 Reflections
Overview of XSLT
 XSLT (Extensible Stylesheet Language
Transformations)
 Within the family of XML
○ Case sensitive
○ Current version: 2.0 (3.0 in draft)
○ Unicode compliant
○ File extension: .xsl
○ Requires matching start and end tags
 “Transformation” means:
○ Manipulation of documents by creating a new document
based on the original document
○ Output can be in XML, HTML, XHTML, text etc.
 Data-driven execution
○ Codes executed when a certain piece of data is
encountered  unexpected outcomes
 <template>
○ Contains set of instructions to be executed when a
template is called explicitly or invoked based on a
matching node
○ Modularity & Reusability
 Multiple templates in a XSLT
 Multiple XSLTs in a pipeline
 Multiple inputs and outputs
○ Comparing data from multiple inputs
 document ( )
 key ( )
 Common usages in library context
○ Web display
 e.g. converting EAD into HTML for display
○ Metadata crosswalking
 Data selection and manipulation
Overview of AutoIt
 http://www.autoitscript.com/site/autoit/
 Freeware
 Scripting language designed for
automating the Windows GUI and general
scripting
 Simulated keystrokes, mouse movement and
window/control manipulation
 Simple text manipulation (supports regular
expression), read/write text file
 Send HTTP request
 Open/run/close applications
 Automation
 To execute multiple XSLTs in a sequence by
Saxon-HE command line XSLT processor
 To link XSLT processes with other
applications, e.g. MarcEdit
 To read and compile TXT, XML, XSLT files
Case study 1
Background
 Digital & Multimedia Center
 Media library & Digitization service
 Chiefly text digitization
 Flatbed, overhead, planetary, slide, &
sheetfed scanners
 2.5 FTE staff, 21 students
 100k – 150k pages scanned & processed
per year
Old Cataloging Workflow for
Digitized Monographs
DMC
• Mounting
digital files on
web server
DMC
• Spreadsheet
of titles and
URLs to be
sent to
cataloging
(DataCat)
DataCat
• Search title in
catalog
• Manual derive
from print
• Insert URL
from
spreadsheet
Task Automation
 Extraction of print version MARC records
 List of bib record no.
 Utilizing AutoIT to build XML query against
catalog XML server
○ Turning III XML into MarcXML by XSLT
 Insertion of URLs to appropriate records
 Finding a match point
 Instead of recording “title” & “URL” pair,
recording “unique identifier” & “URL” pair
○ Unique identifier  bib record no.
New Cataloging Workflow
DMC
• Mounting digital
files on web
server
DMC
• Prepare a TXT file
with bib no. & URL
Metadata
Librarian
• Run AutoIT script
to extract MARC
records from
Millennium/Sierra
• Batch derive from
print and insert
URL by matching
bib no.
Design of XSLT
 Processing logic
 Derive electronic record from print record
 Insert URL into electronic record by
matching bib no. of the print record against
XML document with bib no./URL pairs
Conversion Table
 Structure of bib no./URL pair XML
<pair>
<bibNumber>b55612367</bibNumber>
<url>http://archive.lib.msu.edu/DMC/AREP/AREP1.pdf</url>
</pair>
<pair>
……
Derive Template
 Provider-neutral record
 Copy data from print records without manipulation
○ e.g. 100, subject headings
○ Element: <xsl:copy-of>
 Hard-coding new data in
○ 006, 007, 040, 049, 588
○ Element: <xsl:text>
 Combine existing and new data to create new element
○ e.g. 008 (“o” into “form” byte), 090 (“Online” at the end of call #),
776 (main entry, title, imprint, 020, 010)
○ Element: <xsl:value-of>, <xsl:text>
○ Function: substring()
 Copy URL from the 2nd XML file
○ Key bib number into 2nd XML file
○ Element: <xsl:key>, <xsl:variable>
○ Function: document()
Workflow
Print
records
(MarcXML)
XSLT
PN e-monograph
records
Catalog
Bib no. &
URLs
Extraction Print
records
(III XML)
Format conversion
by XSLT
Converted into
XML by AutoIT
Implementation Issues
 Legacy data
 440 vs 490 & 830 pair
 Descriptive rules & ISBD punctuations
 Obsolete 1st/2nd indicators
Case study 2
Background
 Vincent Voice Library
 Over 40,000 hours of spoken word
recordings
 Open-reel taps, cassettes, DAT tapes, digital
files
 Over 2TB of born-digital and digitized
recordings (WAV format)
○ Provide MP3 for public domain/copyrights-
cleared items (e.g. pre-1923, presidential
speeches, MSU provenance, etc.)
Voice Library Cataloging
 MySQL database
 Digitization & inventory tracking
 Students create a database record, which
includes summary, date of utterance, speaker
names, recording source, format(s) available,
etc., for each digital file
 Library Catalog
 Item-level cataloging
○ Cataloging of analog items stopped in 1990s
 Analog item records suppressed since then
○ Cataloging of digital items started in 2000s
 Based on database records
Objective
 Automate cataloging of digital files
 Reformatted items
○ Derive from suppressed analog records if
available
○ Create brief electronic records, from SQL
database records, for items with no
suppressed records in catalog
 Born-digital items
○ Create brief electronic records, from SQL
database records
Tasks
 Extract XML records of digital items from
MySQL database
 Matches XML records against existing
records of analog items to create
records for digital items
 Convert into .mrc file using MarcEdit for
loading into cataloging client
 Automate and link all above steps by an
AutoIt script
Task #1: XML records
Extraction
 PHP script written by a programmer
 Extract XML records for digital items ready
for cataloging (*status = cataloging)
 Each XML record includes:
○ Summary written by students doing digitization
○ Date of utterance
○ Recording source
○ Names of speakers
○ Database (DB) no. assigned to each digital file
 If reformatted, both DB no. and analog no. (M no. for
open-reel tapes, C no. for cassettes)
○ Running time (in seconds)
<vvl:Record>
<vvl:id>2159</vvl:id>
<vvl:vvl_number>01-0350-113</vvl:vvl_number>
<vvl:copyright>Broadcast News</vvl:copyright>
<vvl:main_speaker>Farrakhan, Louis</vvl:main_speaker>
<vvl:additional_speakers/>
<vvl:recording_source>CNN</vvl:recording_source>
<vvl:summary>Minister Louis Farrakhan, Head of the Nation of Islam and organizer of
the The Million Man March, concludes the gathering of social activists with a 2-1/2 hour
speech. Held on and around the National Mall in Washington, D.C.</vvl:summary>
<vvl:date_day>16</vvl:date_day>
<vvl:date_month>October</vvl:date_month>
<vvl:date_year>1995</vvl:date_year>
<vvl:running_time>9067</vvl:running_time>
<vvl:open-reel>
<vvl:formatid>M5420 - M5421</vvl:formatid>
<vvl:type>open-reel</vvl:type>
<vvl:size>0</vvl:size>
</vvl:open-reel>
<vvl:wav>
<vvl:formatid>DB2159</vvl:formatid>
<vvl:type>wav</vvl:type>
<vvl:size>870444032</vvl:size>
</vvl:wav>
</vvl:Record>
Task #2: Records matching
 Match point
 Analog item no.
○ <formatid> in database records
○ Call no. in analog version MARC records
 Matching & deriving done by XSLT
 Matched: derive from analog version MARC
record
 No-Match: create brief MARC record from
database XML record
 “Derive” Template
 Copy as is
○ e.g. 1XX, 7XX, 6XX, 518, etc.
 Hardcode constant data
○ e.g. 006, 007, 588, etc.
 Insert variable data
○ e.g. 776 (info from analog record), 099 (info
from database record), 033
 Copy with adjustments
○ e.g. 008 (form byte), 245 (GMD), 300
 “Create new” Template (i.e. no match)
 1XX, 7XX from <main_speaker>,
<additional_speakers>
 245 from first sentence of <summary>
 520 from <summary>
 033 from <date_day>, <date_month>, &
<date_year>
 518 from <recording_source> & date info
 099 from <formatid>
 Hardcode MARC leader, 006, 007, 008 (except
date(s)), etc.
Workflow
Digital
records
(MarcXML)
XSLT
Catalog
Extraction
by PHP
Analog
records
(MarcXML)
Format
conversion
by MarcEdit
SQL
Digital
records
(VVL XML)
Digital
records
(.mrc)
One-time
Extraction
Benefits
 Time saving
 Eliminate manual searching for cataloged
analog items in local catalog
 Eliminate manual copy and paste from
database to SkyRiver cataloging client
Limitations
 False no-match
 Occasional discrepancy in analog item no. in
database and catalog records
○ Typo
○ Digitization vs. Cataloging practice
 Heading updates
 Headings updated after analog records
exported from catalog
 Suppressed analog records  can’t do real
time lookup through XML server
Case study 3
Background
 Thesis and dissertation cataloging at
Michigan State University Libraries
 Current practice: Separate-record approach
 Pre-2007 practice:
○ Mulvered records: Print & Microform on the same
record
○ Either:
 Cataloged print on OCLC and added microform info
locally  no record for microform in OCLC
 Cataloged print and microform separately on OCLC and
merged two records into one locally
○ 7387 titles with mulvered records
Summary of Main
Characteristics
MARC Fields Characteristics
001 1 (print) or 2 (print, microform)
007 For microform
008 Form of item (byte 23): Blank
099 Call no. for print
245$h [paper, microform]
533 Reproduction note for microform
952 $a: Item record no. for print & microform
$b: Barcode
Objective
 Un-mulvering
 One record for print, one record for microform
○ Record for print:
 Turn mulvered record into print record by removing info only
pertaining to microform (e.g. 533)
- Overlay the original record
- Delete item records pertaining to microform after overlay
○ Record for microform:
 Create new microform bib record by
- Transferring data only pertaining to microform to the new
record (microform 001, 007, 533)
- Copying, with/without modification, common data into the
new record (008, 245 etc.)
 Create new item record by copying info from original item
record for microform
 To process records by XSLT
Workflow
ILS Mulvered
Print
Microform
Overlay existing
Export as new
Extract
Fix up
Create
XSLT
Roadblocks
 Loss of items over the years
 46 titles with print only
 11 titles with microform only
 Need to determine which title has what
format(s)
 Multiple microform formats
 Some titles have both microfiche and
microfilm formats  need to be split into 3
records (1 print, 1 microfiche, 1 microfilm)
Loss of items
 Determine available format by location
code in item record
 “mc” = Microfiche/ Microfilm
 “th” or other branch locations = Print
 Most reliable since 049 (location in bib) did
no get updated when format was lost/added
 Bib records extracted from ILS do not
contain item location
=952 $a.i69490405$b31293027362833
=952 $a.i69490417
 Export item info separately from ILS
 Merge item info into extracted bib by
matching up item record no.
 Base on item location info in MARC 952 to
determine what record(s) to be generated
• Print record (overlay)
• Microform record (export as new)
“mc” & print
location(s)
• Microform record (overlay)
“mc” & NO print
locations
• Print record (overlay)
Print location(s)
& NO “mc”
Revised Workflow
ILS
Mulvered
Print
Microform
Overlay existing
Export as new
Extract
Fix up
Create
XSLT 2
Item
info
XSLT 1Extract
Multiple microform formats
 Some titles have microfiche and microfilm
on the same record
 Similarity
 Both microfiche and microfilm use “mc” as item
location
 Differences
 Call# system
○ Microfiche: Goetz, S - 3 fiche
○ Microfilm: 24354 THS Microfilm
 Two MARC 533 (reproduction note)
○ One for microfiche, one for microfilm
○ Microfiche
 =533 $aMicrofiche.$bAnn Arbor, Mich.
:$cUniversity Microfilms,$d1979.$e4 microfiche ; 11
X 15cm.
○ Microfilm
 =533 $aMicrofilm.$bAnn Arbor, Mich. :$cUniversity
Microfilms,$d 1973.$e1 microfilm reel ; 35 mm.
 Solutions
○ Base on number of 533 to determine how
many microform record(s) to be generated
○ Use MARC 533$a to pull appropriate call#
from MARC 952
Design of XSLT
 Processing logic
 Both print and microform available
○ Go through the record twice*
 1st pass for print record
 2nd pass for microform record*
* When both microfiche & microfilm are available  3
passes (print, fiche, film)
 One format available
○ Go through the record once (print/ microform*)
* When both microfiche and microfilm are available  2
passes (one for microfiche, one for microfilm)
 5 templates
○ Format determination template
○ Print data template
○ Microform templates
 Microform only (949 overlay command)
 Microform with print (949 item generation command)
○ Common data template
 Data common to both formats
 Reusable
 Format determination template
1. Parse item location info in MARC 952 as a
variable
2. Determine which location(s), aka format(s),
is available
3. Invoke different combination of “Print data”,
“microform only”, and “microform (with
print)” templates accordingly
 Print data template
 To copy print 001, 049, 245 with adjustment
 To copy 008, 099 as is
 Generate 949 overlay command
 Invoke “common data” template
 Microform data templates
 To copy microform 001, 008, 245 with adjustment
○ 008: Add “a” (microfilm) or “b” (microfiche) in Form byte
(byte 23) based on 533$a
○ 245$h: replace “[paper, microform]” with “[microform]”
 To copy 007, 533 as is
 To generate 049, 099
○ 099: Copy call# from item info stored in 952
 949 command
○ Overlay command if microform is the only available
format
○ Item record generation if print is also available
 Invoke “common data” template
 Common data template
 Copy fields not touched by other templates
mostly without adjustment
e.g. leader, subject headings (6XX), thesis
note (502), imprint date (260$c), physical
dimension (300), etc.
Results
 Total mulvered: 7387 records
 Total un-mulvered: 14722 records
 Microform: 7346 records
 Print: 7376 records
Case study 4
Acronyms
 NAR (Name Authority Record)
 LC/NACO NAF (Name Authority File)
 BFM (Bibliographic File Maintenance)
 Heading/Authorized access point updates in
bib records
 SRU (Search/Retrieval via URL)
 HTTP request
LC/NACO NAF
 Dynamic file
 Contributions from over 700 NACO
participants
 Updated everyday with new and
changed NARs from NACO nodes
 Full nodes
○ British Library, OCLC, SkyRiver
 Contribution-only node
○ National Library of Medicine
LC/NACO NAF
Maintenance
LC
Database
Distribution to
BL, OCLC,
SkyRiver
Authority Control at MSU
 In-house
 NACO institution
 Database maintenance
 Post-cataloging Authority Control
 New Headings Report
○ Download NARs from SkyRiver
 Updates to NARs not necessary caught
○ 1XX (No item cataloged under changed 1XX
 not in new heading report)
○ Elements other than 1XX (e.g. 4XX, 670)
LC/NACO NAF RDA
Transition
 PCC Day 1 for RDA NAR: Mar. 31, 2013
 PCC Task Group on AACR2 & RDA
Acceptable Heading Categories (Aug 2011)
 225,000 NARs with 1XX not usable in RDA bib
records
 172,000 NARs with 1XX usable in RDA bib
records after batch manipulation by software
 7,631,00 NARs with 1XX usable in RDA bib
records as they are and can be recoded as RDA
 Phased reissuance of NARs
 Phase 1
○ Scope
 NARs with characteristics known to be at variance with RDA practice
 Not candidates for any of the mechanical changes to be made during
phase 2
○ Adding a 667 note “THIS 1XX FIELD CANNOT BE USED UNDER
RDA UNTIL THIS RECORD HAS BEEN REVIEWED AND/OR
UPDATED”
 Completed Aug. 20, 2012 (436,943 records processed)
 Phase 2
○ Programmatic changes to 1XX headings that are not acceptable
under RDA (e.g., changes to Bible headings, spelling out Dept. and
months, etc., abbreviations in the subfield $d for personal names)
○ Completed March 27, 2013 (371,942 records changed)
 Updates of NARs by NACO institutions
 Reviewing, upgrading, and recoding Phase 1
records to RDA
 Upgrading and recoding non-Phase 1 records to
RDA
 Adding any of the 17 new MARC fields (e.g.
046, 372, etc.)
 Routine NAR maintenance
○ PCC post-RDA test guidelines “strongly
encourage” to evaluate and recode the “RDA-
acceptable AACR2 NARs” to RDA whenever
possible
Objectives
 To catch changes to NARs
 Changes in 1XX
 Addition, deletion, or updates of elements
other than 1XX
 To perform related BFM if 1XX in a NAR
is changed
Tasks
 To download NARs one-by-one/in bulk
 To detect updates to NARs already
existing in Millennium
 To overlay existing NARs with updated
ones
 Updates headings in bib records if 1XX
in NAR updated
 To automate and link up the above tasks
Task #1: Download NARs
 OCLC LCNAF SRU Service
 Pros
○ Multiple indexes (LCCN, names, dates, etc.)
○ Available in multiple schema including MARCXML
○ One-by-one or bulk download*
○ SRU-based service (HTTP request)
○ FREE!!
 Cons
○ Updated every Monday night
○ OAI-PMH service is not available though there is an
index for OAI identifier
○ Bulk download – by search term (e.g. after certain date)
Task #1: Download NARs
(cont’d)
 Implementation
 Search LCCNs one-by-one by AutoIt script
○ Around 10 records/sec. retrieved
 Download XML files into one folder (files
named by LCCN)
Task #2: NAR Update Detection
 To compare NARs from Millennium and NARs from
LC/NACO NAF by XSLT
 MARC 005 (timestamp)
 If timestamp more current on the NAR from NAF  Overlay the
NAR in Millennium
Task #3: Export/Overlay of
NARs
 MarcEdit
 Export updated NARs into Millennium
 Through TCP/IP (Host address, Port, .mrc
file)
○ Same as export from OCLC Connexion or
SkyRiver
 One-by-one (though .mrc file can contain
multiple NARs)
Task #4: Updates of Bib
Headings
 XSLT
 To detect changes in 1XX between old and
new NARs
 To build heading conversion table (a TXT
file) when 1XX is changed
 AutoIt
 Automate bib heading updates by “Global
Update” module in Millennium
○ Read old and new headings from the TXT file
and fill out info required in “Global Update”
process
Task #5: Automation
 Use AutoIt to:
 Link up various steps in the workflow
 Automate searching against OCLC LCNAF SRU
Service by compiling and sending HTTP
requests
 Execute various XSLTs in a predetermined
sequence
○ e.g. NAR comparison  Heading comparison
 Read TXT files (LCCN list, heading conversion
table) created by XSLT processes
 Run MarcEdit to overlay obsolete NARs
 Execute “Global Update” process
Basic Workflow
Millennium
Millennium
NARs
Extract by
Create
Lists
LCCNs
Extract
by XSLT
Search by AutoIt
LC/NACO
NARs
Retrieve
Updated
NARs
Compare by XSLT
Overlay
by
MarcEdit
Updated
Headings
Global Update
Test Results
 82,398 NARs tested
 81,362 NARs needed to be overlaid*
 4,584 headings became obsolete
 10,900 bib records had at least one heading
flipped
* Many NARs exported from Millennium do not contain field 005 
overlay those will save comparison time down the road
Limitations
 Identities broken out from undifferentiated
NARs can’t be detected
 Partially taken care of by “New Headings Report”
 Headings with diacritics
 Code points & exact match in Global Update
 Headings in Field 880
 Slow export using MarcEdit
 Data Exchange module
 Slow “Global Update” process
 Wrong indicators put in by AutoIt during Global
Update (though correct in conversion table)
 “Java heap space” out of memory error
Reflections
 Low tolerance to differences between specified
pattern and target
 Case-sensitive
○ Normalization needed
 Implication on processing time
○ RegEx helps specify a range of patterns
○ Exceptions: match(), replace(), tokenize()
 Data encoding
○ Diacritics, non-ASCII characters
 Data consistency
○ Extra conditions/steps needed to account for exceptions
○ Legacy data, incorrect data
○ Pre-processing clean up vs. On-the-fly clean up
○ Familiarity of source data  normal pattern & exceptions
 Unique identifiers vital for matching
 Full automation vs. Semi-automation
 Integration with other scripting language for
full automation of ongoing workflow
 Demanding for computing power
 Multi-step matching
 Processing multiple documents
 Large XML files
○ <xsl:stream> in XSLT 3.0
 Can’t create records out of thin air
Lucas Mak
Metadata & Catalog Librarian
Michigan State University Libraries
makw@mail.lib.msu.edu

Weitere ähnliche Inhalte

Andere mochten auch

RESTful services
RESTful servicesRESTful services
RESTful services
gouthamrv
 
Data Power Architectural Patterns - Jagadish Vemugunta
Data Power Architectural Patterns - Jagadish VemuguntaData Power Architectural Patterns - Jagadish Vemugunta
Data Power Architectural Patterns - Jagadish Vemugunta
floridawusergroup
 

Andere mochten auch (20)

XML - Displaying Data ith XSLT
XML - Displaying Data ith XSLTXML - Displaying Data ith XSLT
XML - Displaying Data ith XSLT
 
Xml part5
Xml part5Xml part5
Xml part5
 
Xml part4
Xml part4Xml part4
Xml part4
 
Interoperable Web Services with JAX-WS
Interoperable Web Services with JAX-WSInteroperable Web Services with JAX-WS
Interoperable Web Services with JAX-WS
 
SOA Governance and WebSphere Service Registry and Repository
SOA Governance and WebSphere Service Registry and RepositorySOA Governance and WebSphere Service Registry and Repository
SOA Governance and WebSphere Service Registry and Repository
 
Open Id, O Auth And Webservices
Open Id, O Auth And WebservicesOpen Id, O Auth And Webservices
Open Id, O Auth And Webservices
 
XSLT for Web Developers
XSLT for Web DevelopersXSLT for Web Developers
XSLT for Web Developers
 
Web Services
Web ServicesWeb Services
Web Services
 
Web services
Web servicesWeb services
Web services
 
WebService-Java
WebService-JavaWebService-Java
WebService-Java
 
CTDA Workshop on XSL
CTDA Workshop on XSLCTDA Workshop on XSL
CTDA Workshop on XSL
 
Siebel Web Service
Siebel Web ServiceSiebel Web Service
Siebel Web Service
 
RESTful services
RESTful servicesRESTful services
RESTful services
 
Java web services using JAX-WS
Java web services using JAX-WSJava web services using JAX-WS
Java web services using JAX-WS
 
XSLT
XSLTXSLT
XSLT
 
OAuth 2.0 with IBM WebSphere DataPower
OAuth 2.0 with IBM WebSphere DataPowerOAuth 2.0 with IBM WebSphere DataPower
OAuth 2.0 with IBM WebSphere DataPower
 
SOAP-based Web Services
SOAP-based Web ServicesSOAP-based Web Services
SOAP-based Web Services
 
Intorduction to Datapower
Intorduction to DatapowerIntorduction to Datapower
Intorduction to Datapower
 
Data Power Architectural Patterns - Jagadish Vemugunta
Data Power Architectural Patterns - Jagadish VemuguntaData Power Architectural Patterns - Jagadish Vemugunta
Data Power Architectural Patterns - Jagadish Vemugunta
 
Writing simple web services in java using eclipse editor
Writing simple web services in java using eclipse editorWriting simple web services in java using eclipse editor
Writing simple web services in java using eclipse editor
 

Ähnlich wie Unleashing the Power of XSLT: Catalog Records in Batch

The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
Databricks
 
Scalable crawling with Kafka, scrapy and spark - November 2021
Scalable crawling with Kafka, scrapy and spark - November 2021Scalable crawling with Kafka, scrapy and spark - November 2021
Scalable crawling with Kafka, scrapy and spark - November 2021
Max Lapan
 

Ähnlich wie Unleashing the Power of XSLT: Catalog Records in Batch (20)

Apache Carbondata: An Indexed Columnar File Format for Interactive Query with...
Apache Carbondata: An Indexed Columnar File Format for Interactive Query with...Apache Carbondata: An Indexed Columnar File Format for Interactive Query with...
Apache Carbondata: An Indexed Columnar File Format for Interactive Query with...
 
Environment Canada's Data Management Service
Environment Canada's Data Management ServiceEnvironment Canada's Data Management Service
Environment Canada's Data Management Service
 
Xml and multimedia database
Xml and multimedia databaseXml and multimedia database
Xml and multimedia database
 
Web services Overview in depth
Web services Overview in depthWeb services Overview in depth
Web services Overview in depth
 
On the need for a W3C community group on RDF Stream Processing
On the need for a W3C community group on RDF Stream ProcessingOn the need for a W3C community group on RDF Stream Processing
On the need for a W3C community group on RDF Stream Processing
 
OrdRing 2013 keynote - On the need for a W3C community group on RDF Stream Pr...
OrdRing 2013 keynote - On the need for a W3C community group on RDF Stream Pr...OrdRing 2013 keynote - On the need for a W3C community group on RDF Stream Pr...
OrdRing 2013 keynote - On the need for a W3C community group on RDF Stream Pr...
 
Change Tracking in Knowledge Organization Systems with skos-history
Change Tracking in Knowledge Organization Systems with skos-historyChange Tracking in Knowledge Organization Systems with skos-history
Change Tracking in Knowledge Organization Systems with skos-history
 
Legislative data portals and linked data quality
Legislative data portals and linked data qualityLegislative data portals and linked data quality
Legislative data portals and linked data quality
 
Patterns of Streaming Applications
Patterns of Streaming ApplicationsPatterns of Streaming Applications
Patterns of Streaming Applications
 
Managing Data and Operation Distribution In MongoDB
Managing Data and Operation Distribution In MongoDBManaging Data and Operation Distribution In MongoDB
Managing Data and Operation Distribution In MongoDB
 
Batchloading Presentation
Batchloading PresentationBatchloading Presentation
Batchloading Presentation
 
IWMW 1998: Deploying new web technologies
IWMW 1998: Deploying new web technologiesIWMW 1998: Deploying new web technologies
IWMW 1998: Deploying new web technologies
 
Managing data and operation distribution in MongoDB
Managing data and operation distribution in MongoDBManaging data and operation distribution in MongoDB
Managing data and operation distribution in MongoDB
 
Legislation.gov.uk
Legislation.gov.ukLegislation.gov.uk
Legislation.gov.uk
 
Elk presentation 2#3
Elk presentation 2#3Elk presentation 2#3
Elk presentation 2#3
 
New Features in Apache Pinot
New Features in Apache PinotNew Features in Apache Pinot
New Features in Apache Pinot
 
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
 
Scalable crawling with Kafka, scrapy and spark - November 2021
Scalable crawling with Kafka, scrapy and spark - November 2021Scalable crawling with Kafka, scrapy and spark - November 2021
Scalable crawling with Kafka, scrapy and spark - November 2021
 
The need of Interoperability in Office and GIS formats
The need of Interoperability in Office and GIS formatsThe need of Interoperability in Office and GIS formats
The need of Interoperability in Office and GIS formats
 
XML, XML Databases and MPEG-7
XML, XML Databases and MPEG-7XML, XML Databases and MPEG-7
XML, XML Databases and MPEG-7
 

Kürzlich hochgeladen

Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdf
ciinovamais
 
1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdf
QucHHunhnh
 

Kürzlich hochgeladen (20)

Grant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingGrant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy Consulting
 
Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdf
 
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
 
Class 11th Physics NEET formula sheet pdf
Class 11th Physics NEET formula sheet pdfClass 11th Physics NEET formula sheet pdf
Class 11th Physics NEET formula sheet pdf
 
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
 
Arihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdfArihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdf
 
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
 
Key note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdfKey note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdf
 
1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdf
 
Holdier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfHoldier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdf
 
Student login on Anyboli platform.helpin
Student login on Anyboli platform.helpinStudent login on Anyboli platform.helpin
Student login on Anyboli platform.helpin
 
Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)
 
A Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformA Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy Reform
 
General AI for Medical Educators April 2024
General AI for Medical Educators April 2024General AI for Medical Educators April 2024
General AI for Medical Educators April 2024
 
Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3
 
Interactive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationInteractive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communication
 
Mattingly "AI & Prompt Design: The Basics of Prompt Design"
Mattingly "AI & Prompt Design: The Basics of Prompt Design"Mattingly "AI & Prompt Design: The Basics of Prompt Design"
Mattingly "AI & Prompt Design: The Basics of Prompt Design"
 
Advance Mobile Application Development class 07
Advance Mobile Application Development class 07Advance Mobile Application Development class 07
Advance Mobile Application Development class 07
 
social pharmacy d-pharm 1st year by Pragati K. Mahajan
social pharmacy d-pharm 1st year by Pragati K. Mahajansocial pharmacy d-pharm 1st year by Pragati K. Mahajan
social pharmacy d-pharm 1st year by Pragati K. Mahajan
 
fourth grading exam for kindergarten in writing
fourth grading exam for kindergarten in writingfourth grading exam for kindergarten in writing
fourth grading exam for kindergarten in writing
 

Unleashing the Power of XSLT: Catalog Records in Batch

  • 1. Catalog Records in Batch Lucas Mak, Michigan State University Libraries Texas Library Association Annual Conference, April 24-27, 2013
  • 2. Agenda  Overview of XSLT & AutoIt  Case studies  Cataloging ○ Digitized monograph workflow ○ Spoken word recordings cataloging  Catalog Maintenance ○ Multi-format records clean up project ○ NAR and Bib record update  Reflections
  • 3. Overview of XSLT  XSLT (Extensible Stylesheet Language Transformations)  Within the family of XML ○ Case sensitive ○ Current version: 2.0 (3.0 in draft) ○ Unicode compliant ○ File extension: .xsl ○ Requires matching start and end tags  “Transformation” means: ○ Manipulation of documents by creating a new document based on the original document ○ Output can be in XML, HTML, XHTML, text etc.  Data-driven execution ○ Codes executed when a certain piece of data is encountered  unexpected outcomes
  • 4.  <template> ○ Contains set of instructions to be executed when a template is called explicitly or invoked based on a matching node ○ Modularity & Reusability  Multiple templates in a XSLT  Multiple XSLTs in a pipeline  Multiple inputs and outputs ○ Comparing data from multiple inputs  document ( )  key ( )  Common usages in library context ○ Web display  e.g. converting EAD into HTML for display ○ Metadata crosswalking  Data selection and manipulation
  • 5. Overview of AutoIt  http://www.autoitscript.com/site/autoit/  Freeware  Scripting language designed for automating the Windows GUI and general scripting  Simulated keystrokes, mouse movement and window/control manipulation  Simple text manipulation (supports regular expression), read/write text file  Send HTTP request  Open/run/close applications
  • 6.  Automation  To execute multiple XSLTs in a sequence by Saxon-HE command line XSLT processor  To link XSLT processes with other applications, e.g. MarcEdit  To read and compile TXT, XML, XSLT files
  • 8. Background  Digital & Multimedia Center  Media library & Digitization service  Chiefly text digitization  Flatbed, overhead, planetary, slide, & sheetfed scanners  2.5 FTE staff, 21 students  100k – 150k pages scanned & processed per year
  • 9. Old Cataloging Workflow for Digitized Monographs DMC • Mounting digital files on web server DMC • Spreadsheet of titles and URLs to be sent to cataloging (DataCat) DataCat • Search title in catalog • Manual derive from print • Insert URL from spreadsheet
  • 10. Task Automation  Extraction of print version MARC records  List of bib record no.  Utilizing AutoIT to build XML query against catalog XML server ○ Turning III XML into MarcXML by XSLT  Insertion of URLs to appropriate records  Finding a match point  Instead of recording “title” & “URL” pair, recording “unique identifier” & “URL” pair ○ Unique identifier  bib record no.
  • 11. New Cataloging Workflow DMC • Mounting digital files on web server DMC • Prepare a TXT file with bib no. & URL Metadata Librarian • Run AutoIT script to extract MARC records from Millennium/Sierra • Batch derive from print and insert URL by matching bib no.
  • 12. Design of XSLT  Processing logic  Derive electronic record from print record  Insert URL into electronic record by matching bib no. of the print record against XML document with bib no./URL pairs
  • 13. Conversion Table  Structure of bib no./URL pair XML <pair> <bibNumber>b55612367</bibNumber> <url>http://archive.lib.msu.edu/DMC/AREP/AREP1.pdf</url> </pair> <pair> ……
  • 14. Derive Template  Provider-neutral record  Copy data from print records without manipulation ○ e.g. 100, subject headings ○ Element: <xsl:copy-of>  Hard-coding new data in ○ 006, 007, 040, 049, 588 ○ Element: <xsl:text>  Combine existing and new data to create new element ○ e.g. 008 (“o” into “form” byte), 090 (“Online” at the end of call #), 776 (main entry, title, imprint, 020, 010) ○ Element: <xsl:value-of>, <xsl:text> ○ Function: substring()  Copy URL from the 2nd XML file ○ Key bib number into 2nd XML file ○ Element: <xsl:key>, <xsl:variable> ○ Function: document()
  • 15. Workflow Print records (MarcXML) XSLT PN e-monograph records Catalog Bib no. & URLs Extraction Print records (III XML) Format conversion by XSLT Converted into XML by AutoIT
  • 16. Implementation Issues  Legacy data  440 vs 490 & 830 pair  Descriptive rules & ISBD punctuations  Obsolete 1st/2nd indicators
  • 18. Background  Vincent Voice Library  Over 40,000 hours of spoken word recordings  Open-reel taps, cassettes, DAT tapes, digital files  Over 2TB of born-digital and digitized recordings (WAV format) ○ Provide MP3 for public domain/copyrights- cleared items (e.g. pre-1923, presidential speeches, MSU provenance, etc.)
  • 19. Voice Library Cataloging  MySQL database  Digitization & inventory tracking  Students create a database record, which includes summary, date of utterance, speaker names, recording source, format(s) available, etc., for each digital file  Library Catalog  Item-level cataloging ○ Cataloging of analog items stopped in 1990s  Analog item records suppressed since then ○ Cataloging of digital items started in 2000s  Based on database records
  • 20. Objective  Automate cataloging of digital files  Reformatted items ○ Derive from suppressed analog records if available ○ Create brief electronic records, from SQL database records, for items with no suppressed records in catalog  Born-digital items ○ Create brief electronic records, from SQL database records
  • 21. Tasks  Extract XML records of digital items from MySQL database  Matches XML records against existing records of analog items to create records for digital items  Convert into .mrc file using MarcEdit for loading into cataloging client  Automate and link all above steps by an AutoIt script
  • 22. Task #1: XML records Extraction  PHP script written by a programmer  Extract XML records for digital items ready for cataloging (*status = cataloging)  Each XML record includes: ○ Summary written by students doing digitization ○ Date of utterance ○ Recording source ○ Names of speakers ○ Database (DB) no. assigned to each digital file  If reformatted, both DB no. and analog no. (M no. for open-reel tapes, C no. for cassettes) ○ Running time (in seconds)
  • 23. <vvl:Record> <vvl:id>2159</vvl:id> <vvl:vvl_number>01-0350-113</vvl:vvl_number> <vvl:copyright>Broadcast News</vvl:copyright> <vvl:main_speaker>Farrakhan, Louis</vvl:main_speaker> <vvl:additional_speakers/> <vvl:recording_source>CNN</vvl:recording_source> <vvl:summary>Minister Louis Farrakhan, Head of the Nation of Islam and organizer of the The Million Man March, concludes the gathering of social activists with a 2-1/2 hour speech. Held on and around the National Mall in Washington, D.C.</vvl:summary> <vvl:date_day>16</vvl:date_day> <vvl:date_month>October</vvl:date_month> <vvl:date_year>1995</vvl:date_year> <vvl:running_time>9067</vvl:running_time> <vvl:open-reel> <vvl:formatid>M5420 - M5421</vvl:formatid> <vvl:type>open-reel</vvl:type> <vvl:size>0</vvl:size> </vvl:open-reel> <vvl:wav> <vvl:formatid>DB2159</vvl:formatid> <vvl:type>wav</vvl:type> <vvl:size>870444032</vvl:size> </vvl:wav> </vvl:Record>
  • 24. Task #2: Records matching  Match point  Analog item no. ○ <formatid> in database records ○ Call no. in analog version MARC records  Matching & deriving done by XSLT  Matched: derive from analog version MARC record  No-Match: create brief MARC record from database XML record
  • 25.  “Derive” Template  Copy as is ○ e.g. 1XX, 7XX, 6XX, 518, etc.  Hardcode constant data ○ e.g. 006, 007, 588, etc.  Insert variable data ○ e.g. 776 (info from analog record), 099 (info from database record), 033  Copy with adjustments ○ e.g. 008 (form byte), 245 (GMD), 300
  • 26.  “Create new” Template (i.e. no match)  1XX, 7XX from <main_speaker>, <additional_speakers>  245 from first sentence of <summary>  520 from <summary>  033 from <date_day>, <date_month>, & <date_year>  518 from <recording_source> & date info  099 from <formatid>  Hardcode MARC leader, 006, 007, 008 (except date(s)), etc.
  • 28. Benefits  Time saving  Eliminate manual searching for cataloged analog items in local catalog  Eliminate manual copy and paste from database to SkyRiver cataloging client
  • 29. Limitations  False no-match  Occasional discrepancy in analog item no. in database and catalog records ○ Typo ○ Digitization vs. Cataloging practice  Heading updates  Headings updated after analog records exported from catalog  Suppressed analog records  can’t do real time lookup through XML server
  • 31. Background  Thesis and dissertation cataloging at Michigan State University Libraries  Current practice: Separate-record approach  Pre-2007 practice: ○ Mulvered records: Print & Microform on the same record ○ Either:  Cataloged print on OCLC and added microform info locally  no record for microform in OCLC  Cataloged print and microform separately on OCLC and merged two records into one locally ○ 7387 titles with mulvered records
  • 32. Summary of Main Characteristics MARC Fields Characteristics 001 1 (print) or 2 (print, microform) 007 For microform 008 Form of item (byte 23): Blank 099 Call no. for print 245$h [paper, microform] 533 Reproduction note for microform 952 $a: Item record no. for print & microform $b: Barcode
  • 33. Objective  Un-mulvering  One record for print, one record for microform ○ Record for print:  Turn mulvered record into print record by removing info only pertaining to microform (e.g. 533) - Overlay the original record - Delete item records pertaining to microform after overlay ○ Record for microform:  Create new microform bib record by - Transferring data only pertaining to microform to the new record (microform 001, 007, 533) - Copying, with/without modification, common data into the new record (008, 245 etc.)  Create new item record by copying info from original item record for microform  To process records by XSLT
  • 35. Roadblocks  Loss of items over the years  46 titles with print only  11 titles with microform only  Need to determine which title has what format(s)  Multiple microform formats  Some titles have both microfiche and microfilm formats  need to be split into 3 records (1 print, 1 microfiche, 1 microfilm)
  • 36. Loss of items  Determine available format by location code in item record  “mc” = Microfiche/ Microfilm  “th” or other branch locations = Print  Most reliable since 049 (location in bib) did no get updated when format was lost/added  Bib records extracted from ILS do not contain item location =952 $a.i69490405$b31293027362833 =952 $a.i69490417
  • 37.  Export item info separately from ILS
  • 38.  Merge item info into extracted bib by matching up item record no.
  • 39.  Base on item location info in MARC 952 to determine what record(s) to be generated • Print record (overlay) • Microform record (export as new) “mc” & print location(s) • Microform record (overlay) “mc” & NO print locations • Print record (overlay) Print location(s) & NO “mc”
  • 40. Revised Workflow ILS Mulvered Print Microform Overlay existing Export as new Extract Fix up Create XSLT 2 Item info XSLT 1Extract
  • 41. Multiple microform formats  Some titles have microfiche and microfilm on the same record  Similarity  Both microfiche and microfilm use “mc” as item location  Differences  Call# system ○ Microfiche: Goetz, S - 3 fiche ○ Microfilm: 24354 THS Microfilm  Two MARC 533 (reproduction note) ○ One for microfiche, one for microfilm
  • 42. ○ Microfiche  =533 $aMicrofiche.$bAnn Arbor, Mich. :$cUniversity Microfilms,$d1979.$e4 microfiche ; 11 X 15cm. ○ Microfilm  =533 $aMicrofilm.$bAnn Arbor, Mich. :$cUniversity Microfilms,$d 1973.$e1 microfilm reel ; 35 mm.  Solutions ○ Base on number of 533 to determine how many microform record(s) to be generated ○ Use MARC 533$a to pull appropriate call# from MARC 952
  • 43.
  • 44. Design of XSLT  Processing logic  Both print and microform available ○ Go through the record twice*  1st pass for print record  2nd pass for microform record* * When both microfiche & microfilm are available  3 passes (print, fiche, film)  One format available ○ Go through the record once (print/ microform*) * When both microfiche and microfilm are available  2 passes (one for microfiche, one for microfilm)
  • 45.  5 templates ○ Format determination template ○ Print data template ○ Microform templates  Microform only (949 overlay command)  Microform with print (949 item generation command) ○ Common data template  Data common to both formats  Reusable
  • 46.  Format determination template 1. Parse item location info in MARC 952 as a variable 2. Determine which location(s), aka format(s), is available 3. Invoke different combination of “Print data”, “microform only”, and “microform (with print)” templates accordingly
  • 47.  Print data template  To copy print 001, 049, 245 with adjustment  To copy 008, 099 as is  Generate 949 overlay command  Invoke “common data” template
  • 48.  Microform data templates  To copy microform 001, 008, 245 with adjustment ○ 008: Add “a” (microfilm) or “b” (microfiche) in Form byte (byte 23) based on 533$a ○ 245$h: replace “[paper, microform]” with “[microform]”  To copy 007, 533 as is  To generate 049, 099 ○ 099: Copy call# from item info stored in 952  949 command ○ Overlay command if microform is the only available format ○ Item record generation if print is also available  Invoke “common data” template
  • 49.  Common data template  Copy fields not touched by other templates mostly without adjustment e.g. leader, subject headings (6XX), thesis note (502), imprint date (260$c), physical dimension (300), etc.
  • 50. Results  Total mulvered: 7387 records  Total un-mulvered: 14722 records  Microform: 7346 records  Print: 7376 records
  • 52. Acronyms  NAR (Name Authority Record)  LC/NACO NAF (Name Authority File)  BFM (Bibliographic File Maintenance)  Heading/Authorized access point updates in bib records  SRU (Search/Retrieval via URL)  HTTP request
  • 53. LC/NACO NAF  Dynamic file  Contributions from over 700 NACO participants  Updated everyday with new and changed NARs from NACO nodes  Full nodes ○ British Library, OCLC, SkyRiver  Contribution-only node ○ National Library of Medicine
  • 55. Authority Control at MSU  In-house  NACO institution  Database maintenance  Post-cataloging Authority Control  New Headings Report ○ Download NARs from SkyRiver  Updates to NARs not necessary caught ○ 1XX (No item cataloged under changed 1XX  not in new heading report) ○ Elements other than 1XX (e.g. 4XX, 670)
  • 56. LC/NACO NAF RDA Transition  PCC Day 1 for RDA NAR: Mar. 31, 2013  PCC Task Group on AACR2 & RDA Acceptable Heading Categories (Aug 2011)  225,000 NARs with 1XX not usable in RDA bib records  172,000 NARs with 1XX usable in RDA bib records after batch manipulation by software  7,631,00 NARs with 1XX usable in RDA bib records as they are and can be recoded as RDA
  • 57.  Phased reissuance of NARs  Phase 1 ○ Scope  NARs with characteristics known to be at variance with RDA practice  Not candidates for any of the mechanical changes to be made during phase 2 ○ Adding a 667 note “THIS 1XX FIELD CANNOT BE USED UNDER RDA UNTIL THIS RECORD HAS BEEN REVIEWED AND/OR UPDATED”  Completed Aug. 20, 2012 (436,943 records processed)  Phase 2 ○ Programmatic changes to 1XX headings that are not acceptable under RDA (e.g., changes to Bible headings, spelling out Dept. and months, etc., abbreviations in the subfield $d for personal names) ○ Completed March 27, 2013 (371,942 records changed)
  • 58.  Updates of NARs by NACO institutions  Reviewing, upgrading, and recoding Phase 1 records to RDA  Upgrading and recoding non-Phase 1 records to RDA  Adding any of the 17 new MARC fields (e.g. 046, 372, etc.)  Routine NAR maintenance ○ PCC post-RDA test guidelines “strongly encourage” to evaluate and recode the “RDA- acceptable AACR2 NARs” to RDA whenever possible
  • 59. Objectives  To catch changes to NARs  Changes in 1XX  Addition, deletion, or updates of elements other than 1XX  To perform related BFM if 1XX in a NAR is changed
  • 60. Tasks  To download NARs one-by-one/in bulk  To detect updates to NARs already existing in Millennium  To overlay existing NARs with updated ones  Updates headings in bib records if 1XX in NAR updated  To automate and link up the above tasks
  • 61. Task #1: Download NARs  OCLC LCNAF SRU Service  Pros ○ Multiple indexes (LCCN, names, dates, etc.) ○ Available in multiple schema including MARCXML ○ One-by-one or bulk download* ○ SRU-based service (HTTP request) ○ FREE!!  Cons ○ Updated every Monday night ○ OAI-PMH service is not available though there is an index for OAI identifier ○ Bulk download – by search term (e.g. after certain date)
  • 62. Task #1: Download NARs (cont’d)  Implementation  Search LCCNs one-by-one by AutoIt script ○ Around 10 records/sec. retrieved  Download XML files into one folder (files named by LCCN)
  • 63. Task #2: NAR Update Detection  To compare NARs from Millennium and NARs from LC/NACO NAF by XSLT  MARC 005 (timestamp)  If timestamp more current on the NAR from NAF  Overlay the NAR in Millennium
  • 64. Task #3: Export/Overlay of NARs  MarcEdit  Export updated NARs into Millennium  Through TCP/IP (Host address, Port, .mrc file) ○ Same as export from OCLC Connexion or SkyRiver  One-by-one (though .mrc file can contain multiple NARs)
  • 65. Task #4: Updates of Bib Headings  XSLT  To detect changes in 1XX between old and new NARs  To build heading conversion table (a TXT file) when 1XX is changed  AutoIt  Automate bib heading updates by “Global Update” module in Millennium ○ Read old and new headings from the TXT file and fill out info required in “Global Update” process
  • 66. Task #5: Automation  Use AutoIt to:  Link up various steps in the workflow  Automate searching against OCLC LCNAF SRU Service by compiling and sending HTTP requests  Execute various XSLTs in a predetermined sequence ○ e.g. NAR comparison  Heading comparison  Read TXT files (LCCN list, heading conversion table) created by XSLT processes  Run MarcEdit to overlay obsolete NARs  Execute “Global Update” process
  • 67. Basic Workflow Millennium Millennium NARs Extract by Create Lists LCCNs Extract by XSLT Search by AutoIt LC/NACO NARs Retrieve Updated NARs Compare by XSLT Overlay by MarcEdit Updated Headings Global Update
  • 68. Test Results  82,398 NARs tested  81,362 NARs needed to be overlaid*  4,584 headings became obsolete  10,900 bib records had at least one heading flipped * Many NARs exported from Millennium do not contain field 005  overlay those will save comparison time down the road
  • 69. Limitations  Identities broken out from undifferentiated NARs can’t be detected  Partially taken care of by “New Headings Report”  Headings with diacritics  Code points & exact match in Global Update  Headings in Field 880  Slow export using MarcEdit  Data Exchange module  Slow “Global Update” process  Wrong indicators put in by AutoIt during Global Update (though correct in conversion table)  “Java heap space” out of memory error
  • 70.
  • 71. Reflections  Low tolerance to differences between specified pattern and target  Case-sensitive ○ Normalization needed  Implication on processing time ○ RegEx helps specify a range of patterns ○ Exceptions: match(), replace(), tokenize()  Data encoding ○ Diacritics, non-ASCII characters  Data consistency ○ Extra conditions/steps needed to account for exceptions ○ Legacy data, incorrect data ○ Pre-processing clean up vs. On-the-fly clean up ○ Familiarity of source data  normal pattern & exceptions
  • 72.  Unique identifiers vital for matching  Full automation vs. Semi-automation  Integration with other scripting language for full automation of ongoing workflow  Demanding for computing power  Multi-step matching  Processing multiple documents  Large XML files ○ <xsl:stream> in XSLT 3.0  Can’t create records out of thin air
  • 73. Lucas Mak Metadata & Catalog Librarian Michigan State University Libraries makw@mail.lib.msu.edu