SlideShare ist ein Scribd-Unternehmen logo
1 von 30
Empirical evaluation of
library catalogues
SWIB 2019, Hamburg, 2019-11-27.
Péter Király
Gesellschaft für wissenschaftliche Datenverarbeitung mbH Göttingen (GWDG)
Card catalog at Gent University Library, photo: Pieter Morlion, 2010 CC-BY 4.0
https://commons.wikimedia.org/wiki/File:Boekentoren_2010PM_1179_21H9015.JPG
http://bit.ly/qa-swib2019
about
❏ MAchine Readable Cataloging
❏ a format and semantic specification
❏ invented in early 60’s by H. Avram
❏ “MARC must die”*, “Stockholm syndrome of
MARC”** vs. MARC is still with us
❏ know your data
❏ fix a) MARC before moving to X,
or b) X after moving from MARC
❏ audience: those who would like to work with
or understand data for different purposes
* Roy Tennnant http://lj.libraryjournal.com/2002/10/ljarchives/marc-must-die/
** Niklas Lindström at ELAG 2019
https://twitter.com/cm_harlow/status/1126068414928293888
2
Henriette D. Avram
smithsonianmag.com
1. ingest
2. measure records
3. aggregate
4. report
5. evaluate with experts
catalogue
improve records
workflow
3
quality assessment tool
http://bit.ly/qa-swib2019
binding semantics
4
http://bit.ly/qa-swib2019
control field 1
position 1
position N
data field 1
subfield 1
subfield N
definitions
control field 1
position 1 position N
data field 1
subfield 1 subfield N
value definition
MarcRecord
value definition
value definition
value definition
● name
● URL
● acceptable codes
and their meaning
● value constraints
● indexing rules
● FRBR functions
● historical codes
● dictionaries
● BIBFRAME name
● versions
● other rules
Avram JSON
data model export machine readable format of
MARC standard
measure records
$ ./validate # validation
$ ./completeness # completeness analysis
$ ./classifications # classification analysis
$ ./authorities # authorities analysis
$ ./tt-completeness # Thomson-Trail completeness
$ ./serial-score # serial scores
$ ./functional-analysis # functional analysis
$ ./prepare-solr # prepare Solr index
$ ./index # indexing with Solr
5
http://bit.ly/qa-swib2019
$ ./all-analyses
CSV files
Lightweight web UI
aggregating results – records with issues
6
all filtered
bay 100.0 18.8
bzb 100.0 76.1
cer 2.8 2.8
col 90.4 66.0
dnb 13.9 0.2
gen 40.8 27.3
har 100.0 97.3
loc 30.5 29.3
all filtered
mic 80.8 67.5
nfi 62.1 58.1
ris 99.7 57.1
sfp 82.7 60.4
sta 92.7 92.5
szt 30.8 30.6
tib 100.0 100.0
tor 100.0 74.2
Filtered = issues excluding the undocumented tags and subfields
http://bit.ly/qa-swib2019
7
http://bit.ly/qa-swib2019
8
http://bit.ly/qa-swib2019
open to see individual issues
9
http://bit.ly/qa-swib2019
link to records with this issue
link to the field definition in MARC standard
ordered by frequency
clicked
10
http://bit.ly/qa-swib2019
11
http://bit.ly/qa-swib2019
launch a search for 015$a:*
shows a term list of 015$a 015$9 is not defined
barchar size proportional to
recor count with subfield
12
http://bit.ly/qa-swib2019
13
http://bit.ly/qa-swib2019
14
http://bit.ly/qa-swib2019
link to term list
check details
the element which tells us the source of terms (dictionary)
15
http://bit.ly/qa-swib2019
strange values
probably good,
but not listed in MARC
16
http://bit.ly/qa-swib2019
link to completeness
link to term list
17
http://bit.ly/qa-swib2019
search links
the missing step:
resolve to human readable labels
(cooperation with coli-conc and BARTOC)
18
http://bit.ly/qa-swib2019
19
http://bit.ly/qa-swib2019
20
http://bit.ly/qa-swib2019
21
http://bit.ly/qa-swib2019
outliers
term #1
term #2 term #4
term #3
term #10
term #5 term #6 term #7
term #9
term #11
term #8
KOS metadata record #1
field #1
field #2
metadata record #2
field #1
field #2
references to
generic concepts
references to
specific concepts
22
http://bit.ly/qa-swib2019
plan: deep dive into subjects
references, credits
https://github.com/pkiraly/{metadata-qa-marc, metadata-qa-marc-web}
Validating 126 million MARC records. doi:10.1145/3322905.3322929
Avram (w/ Jakob Voß): http://format.gbv.de/schema/avram/specification
@kiru, pkiraly@gwdg.de, pkiraly.github.io
Thanks to: J. Rolschewski, Phú, J. Voß, C. Klee, P. Hochstenbach, O. Suominen,
T. Virolainen, Kokas K., Bernátsky L., S. Auer, B. Genat, Sh. Doljack, D. L. Rueda,
Ph. E. Schreur, M. Lefferts, A. Jahnke, M. Kittelmann, J. Christoffersen, R.
Heuvelmann, Gyuricza A., Balázs L., Ungváry R., G. Coen, A. Ledl, A. Kasprzik,
U. Balakrishnan, Y. Y. Nicolas, M. Franke-Maier, G. Lauer
23
http://bit.ly/qa-swib2019
back material
24
http://bit.ly/qa-swib2019
data sources
Bavarian union cat. (bay) – 27.3 million records; Baden-Würt. union cat.
(bzb) – 23.1 m; Columbia (col) – 6.0 m; Heritage of the Printed Book DB,
CERL (cer) – 6.7 m; Germ. National Bibl. (dnb) – 16.7 m; Gent (gen) – 1.8 m;
Harvard (har) – 13.7 m; Library of Congress (loc) – 10.1 m; Michigan (mic) –
1.3 m; Finnish National Bibl. (nfi) – 1.0 m; Repertoire International des
Sources Musicales (ris) – 1.3 m; San Francisco Public Lib. (sfp) – 0.9 m;
Stanford (sta) – 9.4 m; Szeged (szt) – 1.2 m; TIB Hannover (tib) – 3.5 m;
Toronto Public Lib. (tor) – 2.5 m; Polish National Lib. - 6.5 m; North Germ.
union cat. - 69 m; Hung. Acad. of Science Lib. - 1 m; Hung. union cat. - 9 m
union catalogues – national libraries – university libraries – public libraries
25
http://bit.ly/qa-swib2019
issue types
issues on record level
❏ ambiguous linkage
❏ invalid linkage
❏ type error
control field issues
❏ invalid code
❏ invalid value
26
field issues
❏ missing reference
subfield (880$6)
❏ non-repeatable field
❏ undefined field
indicator issues
❏ invalid value
❏ non-empty value
❏ obsolete value
subfield issues
❏ classification
❏ invalid ISBN
❏ invalid ISSN
❏ invalid length
❏ invalid value
❏ repetition
❏ undefined subfield
❏ non well-formatted
value
http://bit.ly/qa-swib2019
number of subfields in catalogues
total 1% 10%
bay 854 144 51
bzb 522 144 65
crl 169 65 39
col 1862 196 59
dnb 575 186 97
gnt 955 122 47
har 2024 154 49
loc 1156 128 40
27
total 1% 10%
mic 1233 138 37
nfi 811 145 54
ris 138 88 52
sfp 1046 125 37
sta 2997 225 64
szt 1210 74 42
tib 46 41 35
tor 1733 163 46
The tool has 2600+ subfield definitions
total: total number of fields, 1% fields available in at least 1% of the records, 10%: fields available in at
least 10% of the records.
Top fields (not in the table) – 50%: 13-25 fields, 80%: 4-18 fields, 90%: 0-16 fields
http://bit.ly/qa-swib2019
28
http://bit.ly/qa-swib2019
outliers
discriminated
negative values
Tom Delsey (2002)
Functional analysis of the MARC 21 bibliographic and holdings formats.
Tech. report, Library of Congress, 2002. Prepared for the Network Development
and MARC Standards Office Library of Congress. Second Revision: September
17, 2003.
https://www.loc.gov/marc/marc-functional-analysis/original_source/analysis.pdf
29
http://bit.ly/qa-swib2019
reuse
30
MARC 21 versions total
control fields 7 7
control subfields 211 211
data fields 215 68 283
indicators 175 8 183
subfields 2259 344 2603
3287
Java classes
qa-metadata-marc.jar
Avram JSON
data model export
machine readable standard
http://bit.ly/qa-swib2019

Weitere ähnliche Inhalte

Mehr von Péter Király

Data element constraints for DDB (DDB 2021)
Data element constraints for DDB (DDB 2021)Data element constraints for DDB (DDB 2021)
Data element constraints for DDB (DDB 2021)Péter Király
 
Incubating Göttingen Cultural Analytics Alliance (SUB 2021)
Incubating Göttingen Cultural Analytics Alliance (SUB 2021)Incubating Göttingen Cultural Analytics Alliance (SUB 2021)
Incubating Göttingen Cultural Analytics Alliance (SUB 2021)Péter Király
 
Continuous quality assessment for MARC21 catalogues (MINI ELAG 2021)
Continuous quality assessment for MARC21 catalogues (MINI ELAG 2021)Continuous quality assessment for MARC21 catalogues (MINI ELAG 2021)
Continuous quality assessment for MARC21 catalogues (MINI ELAG 2021)Péter Király
 
Introduction to data quality management (BVB KVB FDM-KompetenzPool, 2021)
Introduction to data quality management (BVB KVB FDM-KompetenzPool, 2021)Introduction to data quality management (BVB KVB FDM-KompetenzPool, 2021)
Introduction to data quality management (BVB KVB FDM-KompetenzPool, 2021)Péter Király
 
Magyar irodalom idegen nyelven (BTK ITI 2021)
Magyar irodalom idegen nyelven (BTK ITI 2021)Magyar irodalom idegen nyelven (BTK ITI 2021)
Magyar irodalom idegen nyelven (BTK ITI 2021)Péter Király
 
Validating JSON, XML and CSV data with SHACL-like constraints (DINI-KIM 2022)
Validating JSON, XML and CSV data with SHACL-like constraints (DINI-KIM 2022)Validating JSON, XML and CSV data with SHACL-like constraints (DINI-KIM 2022)
Validating JSON, XML and CSV data with SHACL-like constraints (DINI-KIM 2022)Péter Király
 
FRBR a book history perspective (Bibliodata WG 2022)
FRBR a book history perspective (Bibliodata WG 2022)FRBR a book history perspective (Bibliodata WG 2022)
FRBR a book history perspective (Bibliodata WG 2022)Péter Király
 
GRO.data - Dataverse in Göttingen (Magdeburg Coffee Lecture, 2022)
GRO.data - Dataverse in Göttingen (Magdeburg Coffee Lecture, 2022)GRO.data - Dataverse in Göttingen (Magdeburg Coffee Lecture, 2022)
GRO.data - Dataverse in Göttingen (Magdeburg Coffee Lecture, 2022)Péter Király
 
Understanding, extracting and enhancing catalogue data (CE Book history works...
Understanding, extracting and enhancing catalogue data (CE Book history works...Understanding, extracting and enhancing catalogue data (CE Book history works...
Understanding, extracting and enhancing catalogue data (CE Book history works...Péter Király
 
Measuring cultural heritage metadata quality (Semantics 2017)
Measuring cultural heritage metadata quality (Semantics 2017)Measuring cultural heritage metadata quality (Semantics 2017)
Measuring cultural heritage metadata quality (Semantics 2017)Péter Király
 
Measuring Metadata Quality in Europeana (ADOCHS 2017)
Measuring Metadata Quality in Europeana (ADOCHS 2017)Measuring Metadata Quality in Europeana (ADOCHS 2017)
Measuring Metadata Quality in Europeana (ADOCHS 2017)Péter Király
 
Measuring library catalogs (ADOCHS 2017)
Measuring library catalogs (ADOCHS 2017)Measuring library catalogs (ADOCHS 2017)
Measuring library catalogs (ADOCHS 2017)Péter Király
 
Evaluating Data Quality in Europeana: Metrics for Multilinguality (MTSR 2018)
Evaluating Data Quality in Europeana: Metrics for Multilinguality (MTSR 2018)Evaluating Data Quality in Europeana: Metrics for Multilinguality (MTSR 2018)
Evaluating Data Quality in Europeana: Metrics for Multilinguality (MTSR 2018)Péter Király
 
Researching metadata quality (ORKG 2018)
Researching metadata quality (ORKG 2018)Researching metadata quality (ORKG 2018)
Researching metadata quality (ORKG 2018)Péter Király
 
Metadata quality in cultural heritage institutions (ReIRes-FAIR 2018)
Metadata quality in cultural heritage institutions (ReIRes-FAIR 2018)Metadata quality in cultural heritage institutions (ReIRes-FAIR 2018)
Metadata quality in cultural heritage institutions (ReIRes-FAIR 2018)Péter Király
 
Measuring Completeness as Metadata Quality Metric in Europeana (CAS 2018)
Measuring Completeness as Metadata Quality Metric in Europeana (CAS 2018)Measuring Completeness as Metadata Quality Metric in Europeana (CAS 2018)
Measuring Completeness as Metadata Quality Metric in Europeana (CAS 2018)Péter Király
 
Measuring MARC (ELAG 2018)
Measuring MARC (ELAG 2018)Measuring MARC (ELAG 2018)
Measuring MARC (ELAG 2018)Péter Király
 
SHACL shortly (ELAG 2018)
SHACL shortly (ELAG 2018)SHACL shortly (ELAG 2018)
SHACL shortly (ELAG 2018)Péter Király
 
Measuring Metadata Quality (ELAG, 2018)
Measuring Metadata Quality (ELAG, 2018)Measuring Metadata Quality (ELAG, 2018)
Measuring Metadata Quality (ELAG, 2018)Péter Király
 
Measuring completeness as metadata quality metric in Europeana (DH 2017)
Measuring completeness as metadata quality metric in Europeana (DH 2017)Measuring completeness as metadata quality metric in Europeana (DH 2017)
Measuring completeness as metadata quality metric in Europeana (DH 2017)Péter Király
 

Mehr von Péter Király (20)

Data element constraints for DDB (DDB 2021)
Data element constraints for DDB (DDB 2021)Data element constraints for DDB (DDB 2021)
Data element constraints for DDB (DDB 2021)
 
Incubating Göttingen Cultural Analytics Alliance (SUB 2021)
Incubating Göttingen Cultural Analytics Alliance (SUB 2021)Incubating Göttingen Cultural Analytics Alliance (SUB 2021)
Incubating Göttingen Cultural Analytics Alliance (SUB 2021)
 
Continuous quality assessment for MARC21 catalogues (MINI ELAG 2021)
Continuous quality assessment for MARC21 catalogues (MINI ELAG 2021)Continuous quality assessment for MARC21 catalogues (MINI ELAG 2021)
Continuous quality assessment for MARC21 catalogues (MINI ELAG 2021)
 
Introduction to data quality management (BVB KVB FDM-KompetenzPool, 2021)
Introduction to data quality management (BVB KVB FDM-KompetenzPool, 2021)Introduction to data quality management (BVB KVB FDM-KompetenzPool, 2021)
Introduction to data quality management (BVB KVB FDM-KompetenzPool, 2021)
 
Magyar irodalom idegen nyelven (BTK ITI 2021)
Magyar irodalom idegen nyelven (BTK ITI 2021)Magyar irodalom idegen nyelven (BTK ITI 2021)
Magyar irodalom idegen nyelven (BTK ITI 2021)
 
Validating JSON, XML and CSV data with SHACL-like constraints (DINI-KIM 2022)
Validating JSON, XML and CSV data with SHACL-like constraints (DINI-KIM 2022)Validating JSON, XML and CSV data with SHACL-like constraints (DINI-KIM 2022)
Validating JSON, XML and CSV data with SHACL-like constraints (DINI-KIM 2022)
 
FRBR a book history perspective (Bibliodata WG 2022)
FRBR a book history perspective (Bibliodata WG 2022)FRBR a book history perspective (Bibliodata WG 2022)
FRBR a book history perspective (Bibliodata WG 2022)
 
GRO.data - Dataverse in Göttingen (Magdeburg Coffee Lecture, 2022)
GRO.data - Dataverse in Göttingen (Magdeburg Coffee Lecture, 2022)GRO.data - Dataverse in Göttingen (Magdeburg Coffee Lecture, 2022)
GRO.data - Dataverse in Göttingen (Magdeburg Coffee Lecture, 2022)
 
Understanding, extracting and enhancing catalogue data (CE Book history works...
Understanding, extracting and enhancing catalogue data (CE Book history works...Understanding, extracting and enhancing catalogue data (CE Book history works...
Understanding, extracting and enhancing catalogue data (CE Book history works...
 
Measuring cultural heritage metadata quality (Semantics 2017)
Measuring cultural heritage metadata quality (Semantics 2017)Measuring cultural heritage metadata quality (Semantics 2017)
Measuring cultural heritage metadata quality (Semantics 2017)
 
Measuring Metadata Quality in Europeana (ADOCHS 2017)
Measuring Metadata Quality in Europeana (ADOCHS 2017)Measuring Metadata Quality in Europeana (ADOCHS 2017)
Measuring Metadata Quality in Europeana (ADOCHS 2017)
 
Measuring library catalogs (ADOCHS 2017)
Measuring library catalogs (ADOCHS 2017)Measuring library catalogs (ADOCHS 2017)
Measuring library catalogs (ADOCHS 2017)
 
Evaluating Data Quality in Europeana: Metrics for Multilinguality (MTSR 2018)
Evaluating Data Quality in Europeana: Metrics for Multilinguality (MTSR 2018)Evaluating Data Quality in Europeana: Metrics for Multilinguality (MTSR 2018)
Evaluating Data Quality in Europeana: Metrics for Multilinguality (MTSR 2018)
 
Researching metadata quality (ORKG 2018)
Researching metadata quality (ORKG 2018)Researching metadata quality (ORKG 2018)
Researching metadata quality (ORKG 2018)
 
Metadata quality in cultural heritage institutions (ReIRes-FAIR 2018)
Metadata quality in cultural heritage institutions (ReIRes-FAIR 2018)Metadata quality in cultural heritage institutions (ReIRes-FAIR 2018)
Metadata quality in cultural heritage institutions (ReIRes-FAIR 2018)
 
Measuring Completeness as Metadata Quality Metric in Europeana (CAS 2018)
Measuring Completeness as Metadata Quality Metric in Europeana (CAS 2018)Measuring Completeness as Metadata Quality Metric in Europeana (CAS 2018)
Measuring Completeness as Metadata Quality Metric in Europeana (CAS 2018)
 
Measuring MARC (ELAG 2018)
Measuring MARC (ELAG 2018)Measuring MARC (ELAG 2018)
Measuring MARC (ELAG 2018)
 
SHACL shortly (ELAG 2018)
SHACL shortly (ELAG 2018)SHACL shortly (ELAG 2018)
SHACL shortly (ELAG 2018)
 
Measuring Metadata Quality (ELAG, 2018)
Measuring Metadata Quality (ELAG, 2018)Measuring Metadata Quality (ELAG, 2018)
Measuring Metadata Quality (ELAG, 2018)
 
Measuring completeness as metadata quality metric in Europeana (DH 2017)
Measuring completeness as metadata quality metric in Europeana (DH 2017)Measuring completeness as metadata quality metric in Europeana (DH 2017)
Measuring completeness as metadata quality metric in Europeana (DH 2017)
 

Kürzlich hochgeladen

Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024thyngster
 
Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Cathrine Wilhelmsen
 
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdfHuman37
 
Advanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsAdvanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsVICTOR MAESTRE RAMIREZ
 
Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Seán Kennedy
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfJohn Sterrett
 
How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFAAndrei Kaleshka
 
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSINGmarianagonzalez07
 
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Jack DiGiovanna
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...Florian Roscheck
 
Heart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectHeart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectBoston Institute of Analytics
 
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degreeyuu sss
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]📊 Markus Baersch
 
Top 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In QueensTop 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In Queensdataanalyticsqueen03
 
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...GQ Research
 
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDRafezzaman
 
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档208367051
 
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfPredicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfBoston Institute of Analytics
 
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝DelhiRS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhijennyeacort
 
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPramod Kumar Srivastava
 

Kürzlich hochgeladen (20)

Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
 
Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)
 
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf
 
Advanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsAdvanced Machine Learning for Business Professionals
Advanced Machine Learning for Business Professionals
 
Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdf
 
How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFA
 
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
 
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
 
Heart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectHeart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis Project
 
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]
 
Top 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In QueensTop 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In Queens
 
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
 
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
 
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
 
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfPredicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
 
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝DelhiRS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
 
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
 

Empirical evaluation of library catalogues (SWIB 2019)

  • 1. Empirical evaluation of library catalogues SWIB 2019, Hamburg, 2019-11-27. Péter Király Gesellschaft für wissenschaftliche Datenverarbeitung mbH Göttingen (GWDG) Card catalog at Gent University Library, photo: Pieter Morlion, 2010 CC-BY 4.0 https://commons.wikimedia.org/wiki/File:Boekentoren_2010PM_1179_21H9015.JPG http://bit.ly/qa-swib2019
  • 2. about ❏ MAchine Readable Cataloging ❏ a format and semantic specification ❏ invented in early 60’s by H. Avram ❏ “MARC must die”*, “Stockholm syndrome of MARC”** vs. MARC is still with us ❏ know your data ❏ fix a) MARC before moving to X, or b) X after moving from MARC ❏ audience: those who would like to work with or understand data for different purposes * Roy Tennnant http://lj.libraryjournal.com/2002/10/ljarchives/marc-must-die/ ** Niklas Lindström at ELAG 2019 https://twitter.com/cm_harlow/status/1126068414928293888 2 Henriette D. Avram smithsonianmag.com
  • 3. 1. ingest 2. measure records 3. aggregate 4. report 5. evaluate with experts catalogue improve records workflow 3 quality assessment tool http://bit.ly/qa-swib2019
  • 4. binding semantics 4 http://bit.ly/qa-swib2019 control field 1 position 1 position N data field 1 subfield 1 subfield N definitions control field 1 position 1 position N data field 1 subfield 1 subfield N value definition MarcRecord value definition value definition value definition ● name ● URL ● acceptable codes and their meaning ● value constraints ● indexing rules ● FRBR functions ● historical codes ● dictionaries ● BIBFRAME name ● versions ● other rules Avram JSON data model export machine readable format of MARC standard
  • 5. measure records $ ./validate # validation $ ./completeness # completeness analysis $ ./classifications # classification analysis $ ./authorities # authorities analysis $ ./tt-completeness # Thomson-Trail completeness $ ./serial-score # serial scores $ ./functional-analysis # functional analysis $ ./prepare-solr # prepare Solr index $ ./index # indexing with Solr 5 http://bit.ly/qa-swib2019 $ ./all-analyses CSV files Lightweight web UI
  • 6. aggregating results – records with issues 6 all filtered bay 100.0 18.8 bzb 100.0 76.1 cer 2.8 2.8 col 90.4 66.0 dnb 13.9 0.2 gen 40.8 27.3 har 100.0 97.3 loc 30.5 29.3 all filtered mic 80.8 67.5 nfi 62.1 58.1 ris 99.7 57.1 sfp 82.7 60.4 sta 92.7 92.5 szt 30.8 30.6 tib 100.0 100.0 tor 100.0 74.2 Filtered = issues excluding the undocumented tags and subfields http://bit.ly/qa-swib2019
  • 9. 9 http://bit.ly/qa-swib2019 link to records with this issue link to the field definition in MARC standard ordered by frequency clicked
  • 11. 11 http://bit.ly/qa-swib2019 launch a search for 015$a:* shows a term list of 015$a 015$9 is not defined barchar size proportional to recor count with subfield
  • 14. 14 http://bit.ly/qa-swib2019 link to term list check details the element which tells us the source of terms (dictionary)
  • 17. 17 http://bit.ly/qa-swib2019 search links the missing step: resolve to human readable labels (cooperation with coli-conc and BARTOC)
  • 22. term #1 term #2 term #4 term #3 term #10 term #5 term #6 term #7 term #9 term #11 term #8 KOS metadata record #1 field #1 field #2 metadata record #2 field #1 field #2 references to generic concepts references to specific concepts 22 http://bit.ly/qa-swib2019 plan: deep dive into subjects
  • 23. references, credits https://github.com/pkiraly/{metadata-qa-marc, metadata-qa-marc-web} Validating 126 million MARC records. doi:10.1145/3322905.3322929 Avram (w/ Jakob Voß): http://format.gbv.de/schema/avram/specification @kiru, pkiraly@gwdg.de, pkiraly.github.io Thanks to: J. Rolschewski, Phú, J. Voß, C. Klee, P. Hochstenbach, O. Suominen, T. Virolainen, Kokas K., Bernátsky L., S. Auer, B. Genat, Sh. Doljack, D. L. Rueda, Ph. E. Schreur, M. Lefferts, A. Jahnke, M. Kittelmann, J. Christoffersen, R. Heuvelmann, Gyuricza A., Balázs L., Ungváry R., G. Coen, A. Ledl, A. Kasprzik, U. Balakrishnan, Y. Y. Nicolas, M. Franke-Maier, G. Lauer 23 http://bit.ly/qa-swib2019
  • 25. data sources Bavarian union cat. (bay) – 27.3 million records; Baden-Würt. union cat. (bzb) – 23.1 m; Columbia (col) – 6.0 m; Heritage of the Printed Book DB, CERL (cer) – 6.7 m; Germ. National Bibl. (dnb) – 16.7 m; Gent (gen) – 1.8 m; Harvard (har) – 13.7 m; Library of Congress (loc) – 10.1 m; Michigan (mic) – 1.3 m; Finnish National Bibl. (nfi) – 1.0 m; Repertoire International des Sources Musicales (ris) – 1.3 m; San Francisco Public Lib. (sfp) – 0.9 m; Stanford (sta) – 9.4 m; Szeged (szt) – 1.2 m; TIB Hannover (tib) – 3.5 m; Toronto Public Lib. (tor) – 2.5 m; Polish National Lib. - 6.5 m; North Germ. union cat. - 69 m; Hung. Acad. of Science Lib. - 1 m; Hung. union cat. - 9 m union catalogues – national libraries – university libraries – public libraries 25 http://bit.ly/qa-swib2019
  • 26. issue types issues on record level ❏ ambiguous linkage ❏ invalid linkage ❏ type error control field issues ❏ invalid code ❏ invalid value 26 field issues ❏ missing reference subfield (880$6) ❏ non-repeatable field ❏ undefined field indicator issues ❏ invalid value ❏ non-empty value ❏ obsolete value subfield issues ❏ classification ❏ invalid ISBN ❏ invalid ISSN ❏ invalid length ❏ invalid value ❏ repetition ❏ undefined subfield ❏ non well-formatted value http://bit.ly/qa-swib2019
  • 27. number of subfields in catalogues total 1% 10% bay 854 144 51 bzb 522 144 65 crl 169 65 39 col 1862 196 59 dnb 575 186 97 gnt 955 122 47 har 2024 154 49 loc 1156 128 40 27 total 1% 10% mic 1233 138 37 nfi 811 145 54 ris 138 88 52 sfp 1046 125 37 sta 2997 225 64 szt 1210 74 42 tib 46 41 35 tor 1733 163 46 The tool has 2600+ subfield definitions total: total number of fields, 1% fields available in at least 1% of the records, 10%: fields available in at least 10% of the records. Top fields (not in the table) – 50%: 13-25 fields, 80%: 4-18 fields, 90%: 0-16 fields http://bit.ly/qa-swib2019
  • 29. Tom Delsey (2002) Functional analysis of the MARC 21 bibliographic and holdings formats. Tech. report, Library of Congress, 2002. Prepared for the Network Development and MARC Standards Office Library of Congress. Second Revision: September 17, 2003. https://www.loc.gov/marc/marc-functional-analysis/original_source/analysis.pdf 29 http://bit.ly/qa-swib2019
  • 30. reuse 30 MARC 21 versions total control fields 7 7 control subfields 211 211 data fields 215 68 283 indicators 175 8 183 subfields 2259 344 2603 3287 Java classes qa-metadata-marc.jar Avram JSON data model export machine readable standard http://bit.ly/qa-swib2019

Hinweis der Redaktion

  1. What is “report”?