Comparing taxonomies for organising collections of documents presentation

Comparing taxonomies for organising
collections of documents
Samuel Fernando, Mark Hall, Eneko Agirre,
Aitor Soroa, Paul Clough, Mark Stevenson

COLING 2012, 14th December 2012, Mumbai, India

Introduction
● Large collections of diverse data are available
online. PATHS project aims to support user
exploration in digital library collections.
● Search box is useful but taxonomies are better
suited for exploration and browsing.
● We apply taxonomies to organise data from a large
digital library collection.
● Process is automatic – either map items to an
existing taxonomy, or induce a taxonomy from the
data.


Evaluation data
● We use items from Europeana, a large online collection
of cultural heritage.
● Use English subset, approx. 550,000 items.
● Item typically contains a picture, a title, description and
subject keywords.
● Very diverse data comprising artifacts, places, people.
Topics include fashion, archaeology, architecture and
many other subjects.
● Data from many providers, some of which use
taxonomies, some don’t – need unified approach


Example item
Title: Design Council Slide
Collection

Subject: colour, exhibitions,
industrial design

Description: Display on the
theme of colour matching at
the Design Centre, London,
1960


Manually created taxonomies
● We use four existing manually created taxonomies:
– LCSH (Library of Congress)
– WordNet domains
– Wikipedia Taxonomy
– DBpedia ontology
● The taxonomies already exist and are of good
quality - but problem is to map Europeana items
into the correct place in the taxonomy.


LCSH
● A controlled vocabulary maintained by the US
Library of Congress for bibliographic records.
● Used by libraries to organise collections and also by
curators of cultural heritage.
● Subject keywords are used to map Europeana
items into the appropriate LCSH category nodes.
industrial design  design creation (literary, artistic, etc.)
 intellect
+30 more higher level headings


WordNet domains
● WordNet domains (Bernardo Magnini, LREC 2000)
applies a small set of 164 domain labels to each of the
WordNet synsets.
● Again use subject keywords to map Europeana items -
first to Yago2 (for proper nouns) then to synset and
finally to WordNet domain label.
tourism  social
color  factotum
art  humanities
+ 5 more


Wikipedia Taxonomy
● Wikipedia category hierarchy preserving only is-a
relations - all others are discarded.
● Use Wikipedia Miner over each Europeana item to
identify Wikipedia articles in the subject keywords. Then
map item to all categories that contain these articles
design  visual_arts  criticism
image_processing  digital_signal_processing  signal_processing
museology  museums  educational_organizations 
organizations
+35 more


DBpedia ontology
 A formalised shallow ontology manually created
based on Wikipedia (with inference capability).
 Again use Wikipedia Miner to find Wikipedia articles
in subject keywords of each item and map item to
the categories which these articles belong.
musical_work  work
work
album  musicalwork  work


Automatic data-derived taxonomies
● We use two approaches to derive taxonomies
automatically from the Europeana data.
– LDA (Latent Dirichlet Allocation) topic modelling
– WikiFreq (Wikipedia Frequency hierarchy)
● Taxonomies fit data - no unnecessary nodes to
prune.
● Mapping from items to concept nodes is implicit
during derivation.


LDA topic modelling

 Latent Dirichlet Allocation (LDA) maps each
item to one or more topics.
 Distribution of items over topics - each topic is
a distribution over words
 Item-topic and topic-word distributions are
learned using collapsed Gibbs sampling
 Has been used for improving results from IR
 Previous work has developed hierarchical LDA
but this is infeasible over our large data set

Hierarchical LDA topics
● Run LDA over corpus to determine item-topic probabilities.

● Identify set of items for each topic. Each item assigned to
highest probability topic. Topic labelled with highest
probability word.
● If a topic has less than 60 items then stop. Otherwise go
back to first step with the set of items identified in previous
part as the corpus.


Hierarchical LDA topics (example)

Bangle  design  design  design 
brooch  collection


Wikipedia link frequencies
● Novel approach.
● Run Wikipedia Miner to find links in all Europeana
items – use title, subject and description.
● Find frequency counts for each link.
● For each item take the set of links found.
● Create taxonomy branch (if not already present)
with links in order of frequency (most frequent first).
● Map item to least frequent link.


Wikipedia link frequencies (cont.)
● Large number of concept nodes - limit to 24
children for each node.
● Require at least 2 links for each item - filter out
items with little metadata.
● Filter out concepts with fewer than 20 items.

industrial design  design council


Statistics

Type Taxonomy Items Nodes Avg. Avg. Top
parents Depth nodes
Manual LCSH 99259 285238 1.8 1.97 28901
DBpedia 178312 273 4.2 2 30
WikiTax 275359 121359 11.7 1.13 10417
WN domains 308687 170 7.1 7.1 6
Automatic LDA topics 545896 22494 1 7.3 9
Wiki Freq 66558 502 1 3.39 24


Evaluation - cohesion
 Intruder detection originally proposed in (Chang et. al,
2009). A cohesive unit is defined as one in which the
items are similar while at the same time different from
items in other clusters.
 Present 5 items to each annotator. 4 from one concept
node, and an intruder item randomly from elsewhere in
the taxonomy. The more cohesive the unit, the more
obvious the intruder will be.
 Crowd-sourcing: 111 annotators, 30 units from each
taxonomy. 1255 answers – average 7 annotators for
each unit


Example of a cohesive unit


Evaluation - cohesion results

Type Taxonomy Cohesive Percentage
units
Manual LCSH 19 63.3
DBpedia 17 56.7
Wiki Taxonomy 18 60.0
WN domains 15 50.0
Automatic LDA topics 17 56.7
Wiki Freq 29 96.7

Number of cohesive units (out of a possible 30)


Evaluation - relation classification
 Previous work has typically used a simple boolean
question “is it true that ChildNode is-a ParentNode?”
 We ask two questions for each child-parent pair A and
B:
 Are the concepts A and B related?
 If they are, is A more specific than B, less specific
than B, or neither?
 Crowd sourcing: 173 annotators, 40 pairs from each
taxonomy, each pair evaluated on average 16 times


Evaluation - example pairs
Taxonomy Child (A) Parent(B)
LCSH Work Human Behaviour
Braid Weaving
DBpedia Mountain Range Place
Fern Plant
Wiki Mammals of Africa Wildlife of Africa
Taxonomy Schools in Wiltshire Schools in England
WN domains vehicles transport
mechanics engineering
LDA topics earthenware dish
view church
Wiki Freq Corrosion Coin
Interior Design Industrial Design


Are A and B related?

Taxonomy Yes No Don't know
LCSH 74.2 8.8 17.0
DBpedia 86.6 11.2 2.2
Wiki Taxonomy 96.1 1.7 2.3
WN domains 77.1 14.5 8.4
LDA topics 30.3 50.3 19.3
Wiki Freq 47.6 16.5 35.8


Which is more specific?

Taxonomy A<B A>B Neither Don't
know
LCSH 65.4 8.7 23.4 2.5
DBpedia 76.2 4.9 18.1 0.7
WikiTaxonomy 78.3 4.7 16.0 0.9
WN domains 63.6 6.3 28.0 2.0
LDA topics 21.4 14.8 62.1 1.6
Wiki Freq 30.9 22.6 43.6 2.9


Conclusions
 Wikipedia Taxonomy is conceptually well organised,
even better than LCSH which has been widely used
for organising library collections.
 WikiFreq gives very high cohesion for items
although the conceptual relations are not well
defined.
 Future work continues with different intrinsic and
user evaluations. Also aim to combine Wikipedia
Taxonomy and WikiFreq to get the best of both.


The End

s.fernando@sheffield.ac.uk

Supported by the PATHS project http://paths-project.eu
Funded by the European Community's Seventh Framework
Programme (FP7/2007-2013) under grant agreement no.
270082. This research was also partially funded by the Ministry
of Economy under grant TIN2009-14715-C04-01 (KNOW2
project

Questions?

Comparing taxonomies for organising collections of documents presentation

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Andere mochten auch

Andere mochten auch (10)

Ähnlich wie Comparing taxonomies for organising collections of documents presentation

Ähnlich wie Comparing taxonomies for organising collections of documents presentation (20)

Mehr von pathsproject

Mehr von pathsproject (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Comparing taxonomies for organising collections of documents presentation