The Abortion pills for sale in Qatar@Doha [+27737758557] []Deira Dubai Kuwait
Ā
Automatic creation of mappings between classification systems
1. Automatic creation of mappings between
classification systems
Prof. Magnus Pfeffer
Stuttgart Media University
pfeffer@hdm-stuttgart.de
2. July 10th, 2013 European Conference on Data Analysis 2013 2
ļ® Motivation
ļ® Instance-based matching
ļ® Application to bibliographic data
ļ® Evaluation
ļ® Ongoing projects
Agenda
3. July 10th, 2013 European Conference on Data Analysis 2013 3
Motivation
4. July 10th, 2013 European Conference on Data Analysis 2013 4
Current situation in Germany
ī Five regional library unions
ī Subject headings
ī Predominantly RSWK (āRegeln fĆ¼r den Schlagwortkatalogā -
āRules for the subject catalogueā) using a shared authority file
ī Classification systems
ī RVK (Regensburg Union Classification)
ī BK (Basic Classification)
ī DDC (Dewey Decimal Classification)
ī Various local classification systems
ī Low proportion of indexed titles (25-30%)
5. July 10th, 2013 European Conference on Data Analysis 2013 5
Current situation in Germany
ī National library
ī Subject headings
ī Predominantly RSWK (āRegeln fĆ¼r den Schlagwortkatalogā -
āRules for the subject catalogueā) using a shared authority file
ī Classification systems
ī DDC (Dewey Decimal Classification)
ī Coarse categories
ī DDC only for titles published since 2007
ī Only āReihe Aā (print trade publications) is fully indexed
with RSWK
6. July 10th, 2013 European Conference on Data Analysis 2013 6
Goals
ī Re-use existing indexing information
ī National level
ī BK is used mainly in northern Germany
ī RVK mainly in southern Germany
ī DDC mainly by the National Library
ī International level
ī Make RVK data more accessible to DDC users
ī Use DDC indexing information available from e.g. the Library
of Congress
7. July 10th, 2013 European Conference on Data Analysis 2013 7
Ideas
ī Use of appropriate classification systems
ī Facetted search in resource discovery systems
ī Should be monohierarchical
ī Should have limited number of classes
ā DDC (first digits) or BK
ī Browsing of similar titles
ī Should be fine-grained
ā DDC (full) or RVK
ī Multi-lingual retrieval
8. July 10th, 2013 European Conference on Data Analysis 2013 8
Ideas
ī Enable the use of existing tools and visualisations
Denton (2012) Legrady (2005)
9. July 10th, 2013 European Conference on Data Analysis 2013 9
Instance-based Matching
10. July 10th, 2013 European Conference on Data Analysis 2013 10
Ontology matching
ī Well-studied problem in computer science
ī Several approaches
ī Based on the descriptors
ī Based on the structure
ī Based on the manifestations (instances)
11. July 10th, 2013 European Conference on Data Analysis 2013 11
Instances
ī Entries in catalogues with multiple classifications
12. July 10th, 2013 European Conference on Data Analysis 2013 12
Instance-based matching
ī Assumptions
ī Classes with semantic overlap appear together
ī The more often these classes co-occur, the stronger
the overlap
ī Preparation
ī Extraction of all pairs of classifications from the data
ī Count of the extracted pairs
13. July 10th, 2013 European Conference on Data Analysis 2013 13
Example
14. July 10th, 2013 European Conference on Data Analysis 2013 14
Example
ī Entry 1
ī DDC: 179.9
ī RVK: CC 7200
ī RVK: CC 7250
ī Entry 2
ī DDC: 179.9
ī RVK: CC 7200
ī Pairs
ī 179.9 / CC 7200
ī 179.9 / CC 7250
ī 179.9 / CC 7200
16. July 10th, 2013 European Conference on Data Analysis 2013 16
Further interpretation
ī a and b are two classes from two classification
systems A and B
ī The classes a and b only occur together
ā exact match
ī a only co-occurs with b, but b co-occurs with other
classes from A
ā a is narrower concept than b
ī a co-occurs with several classes from B (including b)
ā a is wider concept than b
ī a and b do not co-occur
ā cannot infer that a and b are unrelated
17. July 10th, 2013 European Conference on Data Analysis 2013 17
Prior work
ī Pfeffer (2009)
ī Analysis of classification system structure and actual
use
ī Locating classes that describe the same concept
ī Finding ways to improve existing mappings to RVK
ī Focus on RVK, using data from library union catalogues
ī Co-occurrence analysis
ī Results
ī High co-occurrence and close in the hierarchy:
ā classes are hard to assign properly
ī High co-occurrence and far in the hierarchy:
ā classes describe identical concepts
ī Mappings from RSWK to RVK could be augmented
18. July 10th, 2013 European Conference on Data Analysis 2013 18
Related work
ī Isaac et.al. (2007)
ī Applied instance based matching to bibliographic data
ī Data from the National Library of the Netherlands
ī Mapping from a thesaurus to a classification system
ī Results
ī Generated mappings are quite good
ī More sophisticated measures than Jaccard do not lead to
better mappings
19. July 10th, 2013 European Conference on Data Analysis 2013 19
Application to bibliographic data
20. July 10th, 2013 European Conference on Data Analysis 2013 20
Bibliographic data is different
ī Multiple editions
ī Multiple document types
21. July 10th, 2013 European Conference on Data Analysis 2013 21
Bibliographic data
ī Skewed data
ī Multiple editions ā More pairs
ī Some co-occurrences could appear stronger than others
ī Solution: Pre-clustering individual titles on the āworkā level
ī Increases chance for instances with more than one
classifications
ī Each cluster contributes only once
ī Allows using absolute co-occurrence numbers
ī Cut-off for small numbers
ī Ranking of competing matches
22. July 10th, 2013 European Conference on Data Analysis 2013 22
Prior work
ī Pfeffer (2013)
ī Matching bibliographic records
ī Based on author, title and uniform title
ī
(as well as information on title changes)
ī Matches any edition and revision of a work
ī
Including translations
ī Full closure over the match sets ā Discrete clusters
ī Consolidating indexing information
ī For indexing purposes, the differences between editions and
revisions are irrelevant
ī Subject headings and classifications are shared between all
members of a cluster
23. July 10th, 2013 European Conference on Data Analysis 2013 23
Evaluation
24. July 10th, 2013 European Conference on Data Analysis 2013 24
Comparison with existing mappings
ī Existing (partial) mappings can be used as a basis for
evaluation
ā āGold standardā
ī Comparison of automatic and manual mapping
ī Recall: Are all the mappings found?
ī Precision: Are all found mappings correct?
ī Analysis of additional links
ī Maybe the gold standard can be improved?
25. July 10th, 2013 European Conference on Data Analysis 2013 25
Ongoing projects
26. July 10th, 2013 European Conference on Data Analysis 2013 26
Data
ī Bibliographic data
ī All German library union catalogues
ī German National Library catalogue
ī Austrian National Library catalogue
ī British national bibliography
ī Gold standards
ī Partial mappings BK ā RVK
27. July 10th, 2013 European Conference on Data Analysis 2013 27
Interesting Mappings
ī RVK ā BK
ī Gold standard exists
ī BK well suited for faceted retrieval
ī RVK has largest proportion of classified titles
ī RVK ā DDC
ī Enable data sharing between the German National
Library and the RVK-using libraries
ī Not limited to classification systems
ī See Pfeffer (2009) and Wang et.al. (2009)
28. July 10th, 2013 European Conference on Data Analysis 2013 28
Software features
ī Using the āMetafactoryā framework
ī Import of MAB2 and MARC data
ī Clustering
ī Generation of keys for the match process
ī Matching and clustering (full closure)
ī Consolidation of indexing and classification information
ī Statistics
ī Single occurrence and co-occurrence counts
ī Jaccard measure
ī Output
ī Full mappings
29. July 10th, 2013 European Conference on Data Analysis 2013 29
Thank you for listening.
Slides available online
http://www.slideshare.net/MagnusPfeffer/
This work is licensed under a Creative Commons
Attribution-ShareAlike 3.0 Unported License.
30. July 10th, 2013 European Conference on Data Analysis 2013 30
References
ī Denton, W. (2012). On dentographs, a new method of visualizing library collections. In:
Code4Lib.
ī Isaac, A., Van Der Meij, L., Schlobach, S. and Wang, S. (2007).
An empirical study of instance-based ontology matching. In: The Semantic Web (pp.
253-266). Springer Berlin Heidelberg.
ī Legrady, G. (2005). Making visible the invisible. Seattle Library Data Flow
Visualization. In: Digital Culture and Heritage. Proceedings of ICHIM05 Sept, 21-23.
ī Pfeffer, M. (2009). Ćquivalenzklassen ā Alle Doppelstellen der RVK finden.
Presentation given at the Librarian Workshop of the 33rd Annual Conference of the
German Classification Society on Data Analysis, Machine Learning, and Applications
(GfKl).
ī Pfeffer, M. (2013). Using clustering across union catalogues to enrich entries with
indexing information. In: Data Analysis, Machine Learning and Knowledge Discovery -
Proceedings of the 36th Annual Conference of the Gesellschaft fĆ¼r Klassifikation e. V..
Springer Berlin Heidelberg.
ī Wang, S., Isaac, A., Schopman, B., Schlobach, S. and Van Der Meij, L. (2009).
Matching multi-lingual subject vocabularies. In: Research and Advanced Technology
for Digital Libraries (pp. 125-137). Springer Berlin Heidelberg.