This document discusses the data mapping required for the OAI Sheet Music Harvester project. Data mapping was necessary because OAI requires unqualified Dublin Core, while contributed data used different formats and definitions. Mapping addressed inconsistencies between MARC, EAD, Dublin Core and local formats used by partner institutions. Issues included field formatting, creator/contributor distinctions, and date/subject standards. Outstanding issues concerned authority control, robust data formats, and improving participation. The document outlines the mapping process and challenges of integrating diverse legacy metadata into a single discovery interface.
Open Archives Initiative for Sheet Music: Data Mapping
1. Data Mapping: OAI Sheet
Music Harvester
Jenn Riley
Digital Media Specialist
Indiana University Digital Library
Program
2. Why was data mapping required?
OAI requires unqualified Dublin Core
Contributed data only needed to
support resource discovery
Dublin Core field definitions need
interpretation
For efficient searching, data from
different institutions must be consistent
5. Limitations to Dublin Core
Heavily slanted towards electronic
resources
No content standards enforced
Without qualifiers, fields not granular
enough for sheet music needs
Field definitions open to interpretation
7. MARC
Library of Congress
some records in AACR2 MARC
many records in non-AACR2 MARC
already had data mapped “based on”
MARC to Dublin Core crosswalk
not able to alter their mapping for
participation in sheet music project
8. EAD
Duke – item level finding aid
records weren’t contributed for phase 1
very robust and specific
conversion was relatively simple because
data was converted to EAD from
collection-specific database
included virtually all information in EAD
documents to DC records
9. Dublin Core
UCLA – 4 types of DC records
songs
sheet music
covers et al
recordings
mapping basically only required
inheritance of songs and sheet music
data elements down to the covers level
10. Local custom formats (1)
Johns Hopkins - Simple DTD
publication (location,
publisher, date)
subject
call num (box, item)
title
composer/lyricist/
arranger
form of composition
instrumentation
first line
first line of chorus
performer
dedicatee
engraver/lithographer/
artist
advertisement
plate num
duplication
11. Local custom formats (2)
Indiana – simple database
title
composer
lyricist
place of publication
publisher
copyright
first line
first line of chorus
subject
form of composition
performance
medium
copies
call #
13. Some mapping issues
Field formatting important, not just contents
Choices heavily influenced by LC practice
Can’t force institutions to comply
Sheet music has many alternative titles
Creator vs. contributor
Plate numbers: they’re important, where to
put and how to label?
Uncertain dates and date ranges
14. Outstanding issues
Authority control for names
Date formats
Data clean-up: what can be done at
harvester end and what must we ask data
providers to do?
What will more robust data format look like?
How do we make it easier for more
institutions to participate?
15. More information
Harvester site (still in development):
http://digital.library.ucla.edu/sheetmusic/
Jenn Riley, Indiana University Digital
Library Program: jenlrile@indiana.edu
These presentation slides:
http://www.dlib.indiana.edu/~jenlrile/rbms2003/
Hinweis der Redaktion
Unqualified DC required, but more robust formats also allowed. More on this later.
Since the purpose of the harvester is discovery of resources, and a user is taken out of the harvester to view items at individual institutions, it was not necessary to force all of the information from each institution’s records into DC. We needed to define what was required for discovery only, not figure out how to squeeze every marc field into dc.
The name “Dublin Core” reveals something about its purpose. It was designed to be a core set of metadata elements applicable to all types of resources. Thus it’s meant to be flexible, with a low entry barrier. This means the definitions of fields are open to wide interpretations. We needed to develop a single interpretation that all contributors followed to make searching and browsing more effective.
Search on Michigan’s OAIster for Einstein and format=image. Note rec. 1, 2 forms of name in author/creator, subject is a description. Note rec. 2, type=image, but very little indicates it’s a photo, weird note text, subjects stuck together in a single string
These are the 15 Dublin Core fields. None are required, all are repeatable.
You’ll notice they’re pretty basic. That’s because they’re supposed to be “core.” Many of them are obvious inclusions – title, creator, description, publisher. Others are not so obvious from their names. “Type” is meant to be used to indicate a general category to which the resource belongs. Some suggested types are image, physical object, text, collection, and event. “Format” is for describing the physical or digital manifestation of a resource. You would record things like dimensions, duration, software needed in “format.”
You’ll also notice that these terms are extremely generic – “creator” for example. Again, they must be this generic because Dublin Core is meant to be useable to describe any type of resource. You may look at this list and think these elements are TOO generic to be really useful. To try and change that, Dublin Core is defined in two types: qualified and unqualified. Unqualified is just using this list of fields, exactly as they are written here. Qualified Dublin Core allows the specification of a refinement of the meaning of the field or the encoding scheme used for the field. The first of these, refining the meaning of the field, would be used, for example, with “creator.” A qualifier could specify what role that creator had in the development of the resource. The second, specifying the encoding scheme, could be used, for example, with the “subject” field, to indicate the name of the controlled vocabulary from which a subject was taken. Unfortunately, OAI requires unqualified Dublin Core to be used, so the harvester couldn’t take advantage (at least in phase 1) of the greater specificity of description provided by qualifiers.
Even though unqualified DC is required for OAI, there are some reasons why it’s not the best metadata format to use for describing and searching sheet music collections.
Although it doesn’t specifically limit its scope to online materials, DC was designed in the networked world. Field definitions tend to work better for networked materials. For example, the “format” element description suggests using internet mime types.
Many of the DC fields have suggestions for controlled vocabularies to use. Internet mime types for format is one example. For subject, “recommended best practice” is to use terms from a CV, but no specific ones are identified. But DC itself does not require field values to conform to these suggestions.
As we saw earlier, using unqualified DC doesn’t offer a great deal of specificity for describing resources. Sheet music has specific needs that DC can’t meet. For example, being able to distinguish between composers, lyricists, and arrangers is important for users of sheet music collections.
Despite our efforts to clarify DC field definitions for use with the harvester, it is inevitable that the fields would be used differently by different institutions.
Duke (rare materials emphasis) and IU (little authority control) had records in MARC too, but these weren’t contributed in phase 1 of the project.
Very specific info, for example: subject type (LCSH, AAT, TGM), dedicatee, recordings available
Very specific for sheet music, not good even for other types of printed music.
No authority control
MARC has more fields than custom DBs, but custom DBs have more applicable fields. Many MARC records don’t have relator codes, so we don’t know who’s a composer and who’s a lyricist.
Even if each individual collection was under name authority, they still might not interoperate. But the problem was much worse – there wasn’t even agreement on name order!
Local subject vocabs in use, so even a complex mapping between LCSH, AAT, TGM wouldn’t solve the problem.