Empirical evaluation of library catalogues (SWIB 2019)
1. Empirical evaluation of
library catalogues
SWIB 2019, Hamburg, 2019-11-27.
Péter Király
Gesellschaft für wissenschaftliche Datenverarbeitung mbH Göttingen (GWDG)
Card catalog at Gent University Library, photo: Pieter Morlion, 2010 CC-BY 4.0
https://commons.wikimedia.org/wiki/File:Boekentoren_2010PM_1179_21H9015.JPG
http://bit.ly/qa-swib2019
2. about
❏ MAchine Readable Cataloging
❏ a format and semantic specification
❏ invented in early 60’s by H. Avram
❏ “MARC must die”*, “Stockholm syndrome of
MARC”** vs. MARC is still with us
❏ know your data
❏ fix a) MARC before moving to X,
or b) X after moving from MARC
❏ audience: those who would like to work with
or understand data for different purposes
* Roy Tennnant http://lj.libraryjournal.com/2002/10/ljarchives/marc-must-die/
** Niklas Lindström at ELAG 2019
https://twitter.com/cm_harlow/status/1126068414928293888
2
Henriette D. Avram
smithsonianmag.com
3. 1. ingest
2. measure records
3. aggregate
4. report
5. evaluate with experts
catalogue
improve records
workflow
3
quality assessment tool
http://bit.ly/qa-swib2019
4. binding semantics
4
http://bit.ly/qa-swib2019
control field 1
position 1
position N
data field 1
subfield 1
subfield N
definitions
control field 1
position 1 position N
data field 1
subfield 1 subfield N
value definition
MarcRecord
value definition
value definition
value definition
● name
● URL
● acceptable codes
and their meaning
● value constraints
● indexing rules
● FRBR functions
● historical codes
● dictionaries
● BIBFRAME name
● versions
● other rules
Avram JSON
data model export machine readable format of
MARC standard
6. aggregating results – records with issues
6
all filtered
bay 100.0 18.8
bzb 100.0 76.1
cer 2.8 2.8
col 90.4 66.0
dnb 13.9 0.2
gen 40.8 27.3
har 100.0 97.3
loc 30.5 29.3
all filtered
mic 80.8 67.5
nfi 62.1 58.1
ris 99.7 57.1
sfp 82.7 60.4
sta 92.7 92.5
szt 30.8 30.6
tib 100.0 100.0
tor 100.0 74.2
Filtered = issues excluding the undocumented tags and subfields
http://bit.ly/qa-swib2019
22. term #1
term #2 term #4
term #3
term #10
term #5 term #6 term #7
term #9
term #11
term #8
KOS metadata record #1
field #1
field #2
metadata record #2
field #1
field #2
references to
generic concepts
references to
specific concepts
22
http://bit.ly/qa-swib2019
plan: deep dive into subjects
23. references, credits
https://github.com/pkiraly/{metadata-qa-marc, metadata-qa-marc-web}
Validating 126 million MARC records. doi:10.1145/3322905.3322929
Avram (w/ Jakob Voß): http://format.gbv.de/schema/avram/specification
@kiru, pkiraly@gwdg.de, pkiraly.github.io
Thanks to: J. Rolschewski, Phú, J. Voß, C. Klee, P. Hochstenbach, O. Suominen,
T. Virolainen, Kokas K., Bernátsky L., S. Auer, B. Genat, Sh. Doljack, D. L. Rueda,
Ph. E. Schreur, M. Lefferts, A. Jahnke, M. Kittelmann, J. Christoffersen, R.
Heuvelmann, Gyuricza A., Balázs L., Ungváry R., G. Coen, A. Ledl, A. Kasprzik,
U. Balakrishnan, Y. Y. Nicolas, M. Franke-Maier, G. Lauer
23
http://bit.ly/qa-swib2019
25. data sources
Bavarian union cat. (bay) – 27.3 million records; Baden-Würt. union cat.
(bzb) – 23.1 m; Columbia (col) – 6.0 m; Heritage of the Printed Book DB,
CERL (cer) – 6.7 m; Germ. National Bibl. (dnb) – 16.7 m; Gent (gen) – 1.8 m;
Harvard (har) – 13.7 m; Library of Congress (loc) – 10.1 m; Michigan (mic) –
1.3 m; Finnish National Bibl. (nfi) – 1.0 m; Repertoire International des
Sources Musicales (ris) – 1.3 m; San Francisco Public Lib. (sfp) – 0.9 m;
Stanford (sta) – 9.4 m; Szeged (szt) – 1.2 m; TIB Hannover (tib) – 3.5 m;
Toronto Public Lib. (tor) – 2.5 m; Polish National Lib. - 6.5 m; North Germ.
union cat. - 69 m; Hung. Acad. of Science Lib. - 1 m; Hung. union cat. - 9 m
union catalogues – national libraries – university libraries – public libraries
25
http://bit.ly/qa-swib2019
26. issue types
issues on record level
❏ ambiguous linkage
❏ invalid linkage
❏ type error
control field issues
❏ invalid code
❏ invalid value
26
field issues
❏ missing reference
subfield (880$6)
❏ non-repeatable field
❏ undefined field
indicator issues
❏ invalid value
❏ non-empty value
❏ obsolete value
subfield issues
❏ classification
❏ invalid ISBN
❏ invalid ISSN
❏ invalid length
❏ invalid value
❏ repetition
❏ undefined subfield
❏ non well-formatted
value
http://bit.ly/qa-swib2019
27. number of subfields in catalogues
total 1% 10%
bay 854 144 51
bzb 522 144 65
crl 169 65 39
col 1862 196 59
dnb 575 186 97
gnt 955 122 47
har 2024 154 49
loc 1156 128 40
27
total 1% 10%
mic 1233 138 37
nfi 811 145 54
ris 138 88 52
sfp 1046 125 37
sta 2997 225 64
szt 1210 74 42
tib 46 41 35
tor 1733 163 46
The tool has 2600+ subfield definitions
total: total number of fields, 1% fields available in at least 1% of the records, 10%: fields available in at
least 10% of the records.
Top fields (not in the table) – 50%: 13-25 fields, 80%: 4-18 fields, 90%: 0-16 fields
http://bit.ly/qa-swib2019
29. Tom Delsey (2002)
Functional analysis of the MARC 21 bibliographic and holdings formats.
Tech. report, Library of Congress, 2002. Prepared for the Network Development
and MARC Standards Office Library of Congress. Second Revision: September
17, 2003.
https://www.loc.gov/marc/marc-functional-analysis/original_source/analysis.pdf
29
http://bit.ly/qa-swib2019
30. reuse
30
MARC 21 versions total
control fields 7 7
control subfields 211 211
data fields 215 68 283
indicators 175 8 183
subfields 2259 344 2603
3287
Java classes
qa-metadata-marc.jar
Avram JSON
data model export
machine readable standard
http://bit.ly/qa-swib2019