UPDATED and REPLACED with new file June 2014
Simplified presentation on library metadata evolution, the perils of not curating the metadata properly, and how it's being used "in the wild".
But…it’s all on the internet and a keyword search will find it, right? Not exactly... There's been a massive change in cataloging in libraries with the rise of the internet. Everything is connected, including our metadata. Catalogers are no longer isolated, and metadata management is no longer just an internal process. Everything we do now links to the wider world of metadata, pushing libraries into re-purposing our long-held work into the new frontiers of identity management and linked data.
6. WHY DOES METADATA MATTER? –
DIGITALGEORGETOWN
Author Index Subject Index
6
7. WHY DOES METADATA MATTER?
“This town built a memorial to the wrong guy”
Ottawa, Canada
“It’s the metadata, stupid: and it’s not just for your
audience” (Joshua Lasky, posted 5/21/2014)
“To succeed in the digital age is to be able to easily
aggregate all of your articles in the most meaningful
way for each of your visitors. Competitors such as Circa
actively use metadata to surface relevant content during
breaking news events.” 7
8. WHY DOES METADATA MATTER?
What are we trying to identify? OR What are people
trying to find?
Works
Individuals
Places
Things/objects
Concepts
Discovery and discovery enhancement
Relationships
“On the fly” collections of resources
Users start elsewhere 8
9. WHAT DO WE DO WHEN WE CURATE [CREATE]
METADATA?
Create and enhance descriptive metadata
Apply controlled vocabularies
Disambiguation of works, authors, etc.
Unique identification of editions, works, etc.
Collocation of editions, works, etc.
Use agreed upon standards for data elements to
ensure consistent application/use
MARC
DigitalGeorgetown (DublinCore)
RDF (Resource Description Framework)
9
10. HOW DO WE EXPOSE “OUR” METADATA?
Controlled vocabulary and mapping
Genres
Subjects/Concepts
Classification
Identification:
People
Places/Geographic
Works
OWL (Web Ontology Language)
SKOS (Simple Knowledge Organization System)
Normalization
Indexing 10
11. OWL: WEB ONTOLOGY LANGUAGE
Utilizes RDF (Resource Description Framework)
5.2 Individual identity
Many languages have a so-called "unique names" assumption:
different names refer to different things in the world. On the web,
such an assumption is not possible. For example, the same
person could be referred to in many different ways (i.e. with
different URI references). For this reason OWL does not make
this assumption. Unless an explicit statement is being made that
two URI references refer to the same or to different individuals,
OWL tools should in principle assume either situation is possible.
OWL provides three constructs for stating facts about the identity
of individuals:
owl:sameAs is used to state that two URI references refer to the same
individual.
owl:differentFrom is used to state that two URI references refer to different
individuals
owl:AllDifferent provides an idiom for stating that a list of individuals are all
different.
11
12. SKOS: SIMPLE KNOWLEDGE ORGANIZATION
SYSTEM
Utilizes RDF (Resource Description Framework)
2.3 Semantic Relationships
In KOSs semantic relations play a crucial role for defining
concepts. The meaning of a concept is defined not just by the
natural-language words in its labels but also by its links to other
concepts in the vocabulary. Mirroring the fundamental categories
of relations that are used in vocabularies such as thesauri
[ISO2788], SKOS supplies three standard properties:
skos:broader and skos:narrower enable the representation of hierarchical
links, such as the relationship between one genre and its more
specific species, or, depending on interpretations, the relationship between
onewhole and its parts;
skos:related enables the representation of associative (non-hierarchical)
links, such as the relationship between one type of event and a category of
entities which typically participate in it. Another use for skos:related is
between two categories where neither is more general or more specific.
Note that skos:related enables the representation of associative (non-
hierarchical) links, which can also be used to represent part-whole links
that are not meant as hierarchical relationships. 12
13. CURATED METADATA IN THE WILD – LIBRARY
OF CONGRESS
Library of Congress data exposed as linked data
“The Library of Congress Linked Data Service enables
both humans and machines to programmatically access
authority data at the Library of Congress. This service is
influenced by -- and implements -- the Linked Data
movement's approach of exposing and inter-connecting
data on the Web via dereferenceable URIs.”
13
16. CURATED METADATA IN THE WILD - OTHERS
Wikipedia/dbpedia
WorldCat: links to WorldCat Identities
http://www.worldcat.org/identities/lccn-n79-007035/
LCCN: links to LC National Authority File (NAF)
http://id.loc.gov/authorities/names/n79007035.html
VIAF record
https://viaf.org/viaf/88919448/
ISNI (International Standard Name Identifier) record
http://isni-url.oclc.nl/isni/0000000121429031
16
17. CURATED METADATA IN THE WILD - OTHERS
Wikipedia/dbpedia
Disambiguation
http://en.wikipedia.org/w/index.php?title=Category:All_disambi
guation_pages
Identity management:
John Smith http://en.wikipedia.org/wiki/John_Smith
St. Mary’s Church
http://en.wikipedia.org/wiki/St._Mary%27s_Church
Georgetown http://en.wikipedia.org/wiki/Georgetown
Hamlet http://en.wikipedia.org/wiki/Hamlet_(disambiguation)
17
18. CURATED METADATA IN THE WILD - OTHERS
“MARC 21 records for
CONSER serials either
cataloged or processed by
LC or by CONSER
(Cooperative Online Serials
Program) participants. Also
includes records with ISSN
assignments and U.S.
Newspaper Program
cataloging. Records include
all languages. Available in
MARC 21 and MARCXML
formats.”
eCIP CONSER
18
19. BUILDING CURATED METADATA: OTHER
OPTIONS
Crowd sourcing
Archives and Alumni
Identification of individuals for identity control
Penn Provenance project
“We are trying to identify former owners and virtually reunite
dispersed collections, and we welcome any information you
have about the images posted here.”
Incorporate data into records; establish identities
https://www.flickr.com/photos/58558794@N07
19
20. CONCLUSION
All comes back to the basics of metadata work:
DESCRIPTION
COLLOCATION
DISAMBIGUATION (uniquely identifiable)
RELATIONSHIPS
20
Hinweis der Redaktion
But…it’s all on the internet and a keyword search will find it, right? Not exactly…
Stop me if I get to into the technical details – much like the rabbit hole of links (linked data!) on the web, it’s easy to get lost in a deeper technical analysis of the use of metadata
Examples of searching with keywords
Example 1
w/o controlled vocabulary there’s no links (no see also) or unified results list; OPAC can’t fix that, you need the metadata to make it work
Koran: http://catalog.library.georgetown.edu/search~S4/X?SEARCH=Koran&searchscope=4&SORT=D
Quran: http://catalog.library.georgetown.edu/search~S4/X?SEARCH=quran&searchscope=4&SORT=D
Qur’an:
w/o controlled vocab there’s no link (no see also); discovery layer can’t fix that, you need the metadata to make it work
Issues in Subject and Author indexes – lack of consistency; duplication; lack of collocation for items with same term
No normalization
Currently working on a project to clean these up so that each author and each subject has a unique single entry; using established forms when possible for future links (e.g. using the NAF form so that it can link out to Wikipedia and other pages)
http://www.vocativ.com/culture/fun/town-built-huge-memorial-wrong-guy/
“concrete” (ha!) example of the perils of relying on keyword searching only
Library metadata is CURATED and CONTROLLED – making it reliable and authoritative and consistent
OWL: http://www.w3.org/TR/owl-ref/ - publishing and sharing ontologies on the web, includes “SameAs” options for linking like things (such as linking a VIAF record to ISNI to NAF)
Metadata is everywhere, but it’s not useful unless it’s managed – a keyword search can bring up a lot of disparate unrelated things…you need curated/controlled metadata to identify the thing you actually want and it’s links to other things
Impacts indexing, identitification, collocation of like things, define/display relationships
COLLOCATION and DISAMBIGUATION
DigitalGeorgetown: https://docs.google.com/spreadsheets/d/1oivICr3O1Drhn-Ncypi6dVc6gjdoaU-IYrNwJEGJBYI/edit?usp=sharing
http://id.loc.gov/
http://linkeddata.org/
Non library organizations mine the Library of Congress authority data and subject data, creating links and clarification
http://www.worldcat.org/oclc/70180992
Schema.org is used by search engines (yes, all of them)
http://www.oclc.org/news/releases/2012/201238.en.html
Also see the SameAs references!!
This then links to other available linked data sets from LC and OCLC and more.
Available for download: WorldCat Works, FAST (subjects), VIAF, Dewey.info,
How? Because the metadata is available in a format that the search engines can mine and use
http://dbpedia.org/About
http://www.wikipedia.org/
ISNI: http://www.isni.org/about
They are ALL linked together via metadata – more information out there, more links are made
eCIP example: OCLC# 880237744 – see the vendor information added to the record
eCIP: The CIP record produced is published in the books on the copyright page and used as marketing and purchasing tools by vendors and libraries, facilitating ordering and processing of materials.
CONSER data: curated by catalogers and the ISSN Centre (in Paris)
Purchased and re-used by many companies, including SFX and SerialsSolutions/ProQuest
Repurposed MARC records:
ETDs
Finding Aids
Princeton Theological Seminary Theological Commons http://commons.ptsem.edu/
DigitalGeorgetown scanned objects
Why? We don’t always have all the information, but we can incorporate it and verify it when we have it, establishing new identities and confirming/enhancing existing ones, adding more links
If you look, it’s in the underlying structure to the web, and library data is OUT THERE