HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...
Comparing taxonomies for organising collections of documents presentation
1. Comparing taxonomies for organising
collections of documents
Samuel Fernando, Mark Hall, Eneko Agirre,
Aitor Soroa, Paul Clough, Mark Stevenson
COLING 2012, 14th December 2012, Mumbai, India
2. Introduction
● Large collections of diverse data are available
online. PATHS project aims to support user
exploration in digital library collections.
● Search box is useful but taxonomies are better
suited for exploration and browsing.
● We apply taxonomies to organise data from a large
digital library collection.
● Process is automatic – either map items to an
existing taxonomy, or induce a taxonomy from the
data.
COLING 2012, 14th December 2012, Mumbai, India
3. Evaluation data
● We use items from Europeana, a large online collection
of cultural heritage.
● Use English subset, approx. 550,000 items.
● Item typically contains a picture, a title, description and
subject keywords.
● Very diverse data comprising artifacts, places, people.
Topics include fashion, archaeology, architecture and
many other subjects.
● Data from many providers, some of which use
taxonomies, some don’t – need unified approach
COLING 2012, 14th December 2012, Mumbai, India
4. Example item
Title: Design Council Slide
Collection
Subject: colour, exhibitions,
industrial design
Description: Display on the
theme of colour matching at
the Design Centre, London,
1960
COLING 2012, 14th December 2012, Mumbai, India
5. Manually created taxonomies
● We use four existing manually created taxonomies:
– LCSH (Library of Congress)
– WordNet domains
– Wikipedia Taxonomy
– DBpedia ontology
● The taxonomies already exist and are of good
quality - but problem is to map Europeana items
into the correct place in the taxonomy.
COLING 2012, 14th December 2012, Mumbai, India
6. LCSH
● A controlled vocabulary maintained by the US
Library of Congress for bibliographic records.
● Used by libraries to organise collections and also by
curators of cultural heritage.
● Subject keywords are used to map Europeana
items into the appropriate LCSH category nodes.
industrial design design creation (literary, artistic, etc.)
intellect
+30 more higher level headings
COLING 2012, 14th December 2012, Mumbai, India
7. WordNet domains
● WordNet domains (Bernardo Magnini, LREC 2000)
applies a small set of 164 domain labels to each of the
WordNet synsets.
● Again use subject keywords to map Europeana items -
first to Yago2 (for proper nouns) then to synset and
finally to WordNet domain label.
tourism social
color factotum
art humanities
+ 5 more
COLING 2012, 14th December 2012, Mumbai, India
8. Wikipedia Taxonomy
● Wikipedia category hierarchy preserving only is-a
relations - all others are discarded.
● Use Wikipedia Miner over each Europeana item to
identify Wikipedia articles in the subject keywords. Then
map item to all categories that contain these articles
design visual_arts criticism
image_processing digital_signal_processing signal_processing
museology museums educational_organizations
organizations
+35 more
COLING 2012, 14th December 2012, Mumbai, India
9. DBpedia ontology
A formalised shallow ontology manually created
based on Wikipedia (with inference capability).
Again use Wikipedia Miner to find Wikipedia articles
in subject keywords of each item and map item to
the categories which these articles belong.
musical_work work
work
album musicalwork work
COLING 2012, 14th December 2012, Mumbai, India
10. Automatic data-derived taxonomies
● We use two approaches to derive taxonomies
automatically from the Europeana data.
– LDA (Latent Dirichlet Allocation) topic modelling
– WikiFreq (Wikipedia Frequency hierarchy)
● Taxonomies fit data - no unnecessary nodes to
prune.
● Mapping from items to concept nodes is implicit
during derivation.
COLING 2012, 14th December 2012, Mumbai, India
11. LDA topic modelling
Latent Dirichlet Allocation (LDA) maps each
item to one or more topics.
Distribution of items over topics - each topic is
a distribution over words
Item-topic and topic-word distributions are
learned using collapsed Gibbs sampling
Has been used for improving results from IR
Previous work has developed hierarchical LDA
but this is infeasible over our large data set
COLING 2012, 14th December 2012, Mumbai, India
12. Hierarchical LDA topics
● Run LDA over corpus to determine item-topic probabilities.
● Identify set of items for each topic. Each item assigned to
highest probability topic. Topic labelled with highest
probability word.
● If a topic has less than 60 items then stop. Otherwise go
back to first step with the set of items identified in previous
part as the corpus.
COLING 2012, 14th December 2012, Mumbai, India
14. Wikipedia link frequencies
● Novel approach.
● Run Wikipedia Miner to find links in all Europeana
items – use title, subject and description.
● Find frequency counts for each link.
● For each item take the set of links found.
● Create taxonomy branch (if not already present)
with links in order of frequency (most frequent first).
● Map item to least frequent link.
COLING 2012, 14th December 2012, Mumbai, India
15. Wikipedia link frequencies (cont.)
● Large number of concept nodes - limit to 24
children for each node.
● Require at least 2 links for each item - filter out
items with little metadata.
● Filter out concepts with fewer than 20 items.
industrial design design council
COLING 2012, 14th December 2012, Mumbai, India
17. Evaluation - cohesion
Intruder detection originally proposed in (Chang et. al,
2009). A cohesive unit is defined as one in which the
items are similar while at the same time different from
items in other clusters.
Present 5 items to each annotator. 4 from one concept
node, and an intruder item randomly from elsewhere in
the taxonomy. The more cohesive the unit, the more
obvious the intruder will be.
Crowd-sourcing: 111 annotators, 30 units from each
taxonomy. 1255 answers – average 7 annotators for
each unit
COLING 2012, 14th December 2012, Mumbai, India
18. Example of a cohesive unit
COLING 2012, 14th December 2012, Mumbai, India
19. Evaluation - cohesion results
Type Taxonomy Cohesive Percentage
units
Manual LCSH 19 63.3
DBpedia 17 56.7
Wiki Taxonomy 18 60.0
WN domains 15 50.0
Automatic LDA topics 17 56.7
Wiki Freq 29 96.7
Number of cohesive units (out of a possible 30)
COLING 2012, 14th December 2012, Mumbai, India
20. Evaluation - relation classification
Previous work has typically used a simple boolean
question “is it true that ChildNode is-a ParentNode?”
We ask two questions for each child-parent pair A and
B:
Are the concepts A and B related?
If they are, is A more specific than B, less specific
than B, or neither?
Crowd sourcing: 173 annotators, 40 pairs from each
taxonomy, each pair evaluated on average 16 times
COLING 2012, 14th December 2012, Mumbai, India
21. Evaluation - example pairs
Taxonomy Child (A) Parent(B)
LCSH Work Human Behaviour
Braid Weaving
DBpedia Mountain Range Place
Fern Plant
Wiki Mammals of Africa Wildlife of Africa
Taxonomy Schools in Wiltshire Schools in England
WN domains vehicles transport
mechanics engineering
LDA topics earthenware dish
view church
Wiki Freq Corrosion Coin
Interior Design Industrial Design
COLING 2012, 14th December 2012, Mumbai, India
22. Are A and B related?
Taxonomy Yes No Don't know
LCSH 74.2 8.8 17.0
DBpedia 86.6 11.2 2.2
Wiki Taxonomy 96.1 1.7 2.3
WN domains 77.1 14.5 8.4
LDA topics 30.3 50.3 19.3
Wiki Freq 47.6 16.5 35.8
COLING 2012, 14th December 2012, Mumbai, India
23. Which is more specific?
Taxonomy A<B A>B Neither Don't
know
LCSH 65.4 8.7 23.4 2.5
DBpedia 76.2 4.9 18.1 0.7
WikiTaxonomy 78.3 4.7 16.0 0.9
WN domains 63.6 6.3 28.0 2.0
LDA topics 21.4 14.8 62.1 1.6
Wiki Freq 30.9 22.6 43.6 2.9
COLING 2012, 14th December 2012, Mumbai, India
24. Conclusions
Wikipedia Taxonomy is conceptually well organised,
even better than LCSH which has been widely used
for organising library collections.
WikiFreq gives very high cohesion for items
although the conceptual relations are not well
defined.
Future work continues with different intrinsic and
user evaluations. Also aim to combine Wikipedia
Taxonomy and WikiFreq to get the best of both.
COLING 2012, 14th December 2012, Mumbai, India
25. The End
s.fernando@sheffield.ac.uk
Supported by the PATHS project http://paths-project.eu
Funded by the European Community's Seventh Framework
Programme (FP7/2007-2013) under grant agreement no.
270082. This research was also partially funded by the Ministry
of Economy under grant TIN2009-14715-C04-01 (KNOW2
project
Questions?