SlideShare ist ein Scribd-Unternehmen logo
1 von 25
Downloaden Sie, um offline zu lesen
Comparing taxonomies for organising
               collections of documents
                Samuel Fernando, Mark Hall, Eneko Agirre,
                 Aitor Soroa, Paul Clough, Mark Stevenson




COLING 2012, 14th December 2012, Mumbai, India
Introduction
●   Large collections of diverse data are available
    online. PATHS project aims to support user
    exploration in digital library collections.
●   Search box is useful but taxonomies are better
    suited for exploration and browsing.
●   We apply taxonomies to organise data from a large
    digital library collection.
●   Process is automatic – either map items to an
    existing taxonomy, or induce a taxonomy from the
    data.

COLING 2012, 14th December 2012, Mumbai, India
Evaluation data
●   We use items from Europeana, a large online collection
    of cultural heritage.
●   Use English subset, approx. 550,000 items.
●   Item typically contains a picture, a title, description and
    subject keywords.
●   Very diverse data comprising artifacts, places, people.
    Topics include fashion, archaeology, architecture and
    many other subjects.
●   Data from many providers, some of which use
    taxonomies, some don’t – need unified approach

COLING 2012, 14th December 2012, Mumbai, India
Example item
                                                 Title: Design Council Slide
                                                 Collection

                                                 Subject: colour, exhibitions,
                                                 industrial design

                                                 Description: Display on the
                                                 theme of colour matching at
                                                 the Design Centre, London,
                                                 1960

COLING 2012, 14th December 2012, Mumbai, India
Manually created taxonomies
●   We use four existing manually created taxonomies:
     –   LCSH (Library of Congress)
     –   WordNet domains
     –   Wikipedia Taxonomy
     –   DBpedia ontology
●   The taxonomies already exist and are of good
    quality - but problem is to map Europeana items
    into the correct place in the taxonomy.



COLING 2012, 14th December 2012, Mumbai, India
LCSH
●   A controlled vocabulary maintained by the US
    Library of Congress for bibliographic records.
●   Used by libraries to organise collections and also by
    curators of cultural heritage.
●   Subject keywords are used to map Europeana
    items into the appropriate LCSH category nodes.
industrial design  design creation (literary, artistic, etc.)
 intellect
+30 more higher level headings

COLING 2012, 14th December 2012, Mumbai, India
WordNet domains
●   WordNet domains (Bernardo Magnini, LREC 2000)
    applies a small set of 164 domain labels to each of the
    WordNet synsets.
●   Again use subject keywords to map Europeana items -
    first to Yago2 (for proper nouns) then to synset and
    finally to WordNet domain label.
         tourism  social
         color  factotum
         art  humanities
         + 5 more

COLING 2012, 14th December 2012, Mumbai, India
Wikipedia Taxonomy
●   Wikipedia category hierarchy preserving only is-a
    relations - all others are discarded.
●   Use Wikipedia Miner over each Europeana item to
    identify Wikipedia articles in the subject keywords. Then
    map item to all categories that contain these articles
design  visual_arts  criticism
image_processing  digital_signal_processing  signal_processing
museology  museums  educational_organizations 
organizations
         +35 more

COLING 2012, 14th December 2012, Mumbai, India
DBpedia ontology
   A formalised shallow ontology manually created
    based on Wikipedia (with inference capability).
   Again use Wikipedia Miner to find Wikipedia articles
    in subject keywords of each item and map item to
    the categories which these articles belong.
       musical_work  work
       work
       album  musicalwork  work



COLING 2012, 14th December 2012, Mumbai, India
Automatic data-derived taxonomies
●   We use two approaches to derive taxonomies
    automatically from the Europeana data.
     –   LDA (Latent Dirichlet Allocation) topic modelling
     –   WikiFreq (Wikipedia Frequency hierarchy)
●   Taxonomies fit data - no unnecessary nodes to
    prune.
●   Mapping from items to concept nodes is implicit
    during derivation.



COLING 2012, 14th December 2012, Mumbai, India
LDA topic modelling

    Latent Dirichlet Allocation (LDA) maps each
     item to one or more topics.
    Distribution of items over topics - each topic is
     a distribution over words
    Item-topic and topic-word distributions are
     learned using collapsed Gibbs sampling
    Has been used for improving results from IR
    Previous work has developed hierarchical LDA
     but this is infeasible over our large data set
COLING 2012, 14th December 2012, Mumbai, India
Hierarchical LDA topics
●   Run LDA over corpus to determine item-topic probabilities.




●   Identify set of items for each topic. Each item assigned to
    highest probability topic. Topic labelled with highest
    probability word.
●   If a topic has less than 60 items then stop. Otherwise go
    back to first step with the set of items identified in previous
    part as the corpus.


COLING 2012, 14th December 2012, Mumbai, India
Hierarchical LDA topics (example)




    Bangle  design  design  design 
    brooch  collection




COLING 2012, 14th December 2012, Mumbai, India
Wikipedia link frequencies
●   Novel approach.
●   Run Wikipedia Miner to find links in all Europeana
    items – use title, subject and description.
●   Find frequency counts for each link.
●   For each item take the set of links found.
●   Create taxonomy branch (if not already present)
    with links in order of frequency (most frequent first).
●   Map item to least frequent link.


COLING 2012, 14th December 2012, Mumbai, India
Wikipedia link frequencies (cont.)
●   Large number of concept nodes - limit to 24
    children for each node.
●   Require at least 2 links for each item - filter out
    items with little metadata.
●   Filter out concepts with fewer than 20 items.


                   industrial design  design council



COLING 2012, 14th December 2012, Mumbai, India
Statistics

Type          Taxonomy            Items       Nodes    Avg.      Avg.    Top
                                                       parents   Depth   nodes
Manual        LCSH                99259       285238   1.8       1.97    28901
              DBpedia             178312      273      4.2       2       30
              WikiTax             275359      121359   11.7      1.13    10417
              WN domains          308687      170      7.1       7.1     6
Automatic     LDA topics          545896      22494    1         7.3     9
              Wiki Freq           66558       502      1         3.39    24




  COLING 2012, 14th December 2012, Mumbai, India
Evaluation - cohesion
   Intruder detection originally proposed in (Chang et. al,
    2009). A cohesive unit is defined as one in which the
    items are similar while at the same time different from
    items in other clusters.
   Present 5 items to each annotator. 4 from one concept
    node, and an intruder item randomly from elsewhere in
    the taxonomy. The more cohesive the unit, the more
    obvious the intruder will be.
   Crowd-sourcing: 111 annotators, 30 units from each
    taxonomy. 1255 answers – average 7 annotators for
    each unit

COLING 2012, 14th December 2012, Mumbai, India
Example of a cohesive unit




COLING 2012, 14th December 2012, Mumbai, India
Evaluation - cohesion results

     Type                 Taxonomy               Cohesive   Percentage
                                                   units
    Manual                  LCSH                    19         63.3
                          DBpedia                   17         56.7
                       Wiki Taxonomy                18         60.0
                        WN domains                  15         50.0
  Automatic             LDA topics                  17         56.7
                         Wiki Freq                  29         96.7

      Number of cohesive units (out of a possible 30)


COLING 2012, 14th December 2012, Mumbai, India
Evaluation - relation classification
   Previous work has typically used a simple boolean
    question “is it true that ChildNode is-a ParentNode?”
   We ask two questions for each child-parent pair A and
    B:
        Are the concepts A and B related?
        If they are, is A more specific than B, less specific
         than B, or neither?
   Crowd sourcing: 173 annotators, 40 pairs from each
    taxonomy, each pair evaluated on average 16 times



COLING 2012, 14th December 2012, Mumbai, India
Evaluation - example pairs
    Taxonomy                   Child (A)             Parent(B)
       LCSH                       Work           Human Behaviour
                                  Braid             Weaving
     DBpedia              Mountain Range               Place
                               Fern                    Plant
      Wiki              Mammals of Africa         Wildlife of Africa
    Taxonomy            Schools in Wiltshire     Schools in England
  WN domains                   vehicles              transport
                              mechanics             engineering
   LDA topics                earthenware                dish
                                 view                  church
    Wiki Freq                 Corrosion                 Coin
                           Interior Design        Industrial Design

COLING 2012, 14th December 2012, Mumbai, India
Are A and B related?

   Taxonomy                       Yes             No    Don't know
      LCSH                        74.2            8.8     17.0
    DBpedia                       86.6           11.2      2.2
 Wiki Taxonomy                    96.1            1.7      2.3
  WN domains                      77.1           14.5      8.4
  LDA topics                      30.3           50.3     19.3
   Wiki Freq                      47.6           16.5     35.8




COLING 2012, 14th December 2012, Mumbai, India
Which is more specific?

  Taxonomy                   A<B             A>B       Neither   Don't
                                                                 know
    LCSH                     65.4                8.7    23.4      2.5
  DBpedia                    76.2                4.9    18.1      0.7
WikiTaxonomy                 78.3                4.7    16.0      0.9
WN domains                   63.6                6.3    28.0      2.0
 LDA topics                  21.4            14.8       62.1      1.6
  Wiki Freq                  30.9            22.6       43.6      2.9



COLING 2012, 14th December 2012, Mumbai, India
Conclusions
   Wikipedia Taxonomy is conceptually well organised,
    even better than LCSH which has been widely used
    for organising library collections.
   WikiFreq gives very high cohesion for items
    although the conceptual relations are not well
    defined.
   Future work continues with different intrinsic and
    user evaluations. Also aim to combine Wikipedia
    Taxonomy and WikiFreq to get the best of both.



COLING 2012, 14th December 2012, Mumbai, India
The End

s.fernando@sheffield.ac.uk

Supported by the PATHS project http://paths-project.eu
Funded by the European Community's Seventh Framework
Programme (FP7/2007-2013) under grant agreement no.
270082. This research was also partially funded by the Ministry
of Economy under grant TIN2009-14715-C04-01 (KNOW2
project

                      Questions?

Weitere ähnliche Inhalte

Andere mochten auch

Past Perfect Tense Nurlaela 201212500067
Past Perfect Tense Nurlaela 201212500067Past Perfect Tense Nurlaela 201212500067
Past Perfect Tense Nurlaela 201212500067nurlaelanur
 
PATHS at VSMM 2012
PATHS at VSMM 2012PATHS at VSMM 2012
PATHS at VSMM 2012pathsproject
 
Semantic Enrichment of Cultural Heritage content in PATHS
Semantic Enrichment of Cultural Heritage content in PATHSSemantic Enrichment of Cultural Heritage content in PATHS
Semantic Enrichment of Cultural Heritage content in PATHSpathsproject
 
Exchange infrastructure implementing single domain namespace scheme part 2#...
Exchange infrastructure  implementing single domain namespace scheme  part 2#...Exchange infrastructure  implementing single domain namespace scheme  part 2#...
Exchange infrastructure implementing single domain namespace scheme part 2#...Eyal Doron
 
Why our mail system is exposed to spoof and phishing mail attacks part 5#9 |...
Why our mail system is exposed to spoof and phishing mail attacks  part 5#9 |...Why our mail system is exposed to spoof and phishing mail attacks  part 5#9 |...
Why our mail system is exposed to spoof and phishing mail attacks part 5#9 |...Eyal Doron
 
IND-2012-317 Govt Satya Bharti Adarsh Sr Sec School, Chogawan -Paudha Lagao, ...
IND-2012-317 Govt Satya Bharti Adarsh Sr Sec School, Chogawan -Paudha Lagao, ...IND-2012-317 Govt Satya Bharti Adarsh Sr Sec School, Chogawan -Paudha Lagao, ...
IND-2012-317 Govt Satya Bharti Adarsh Sr Sec School, Chogawan -Paudha Lagao, ...designforchangechallenge
 
IND-2012-254 Vivekanand M S, Mathalamparai -Don’t waste drinking water
IND-2012-254 Vivekanand M S, Mathalamparai -Don’t waste drinking waterIND-2012-254 Vivekanand M S, Mathalamparai -Don’t waste drinking water
IND-2012-254 Vivekanand M S, Mathalamparai -Don’t waste drinking waterdesignforchangechallenge
 
Cross-lingual event-mining using wordnet as a shared knowledge interface
Cross-lingual event-mining using wordnet as a shared knowledge interfaceCross-lingual event-mining using wordnet as a shared knowledge interface
Cross-lingual event-mining using wordnet as a shared knowledge interfacepathsproject
 
Величко М.В. (2012) — Территориальные инновации, их внедрение и кадровое обе...
Величко М.В. (2012) — Территориальные инновации, их внедрение и кадровое обе...Величко М.В. (2012) — Территориальные инновации, их внедрение и кадровое обе...
Величко М.В. (2012) — Территориальные инновации, их внедрение и кадровое обе...mediamera
 

Andere mochten auch (10)

Past Perfect Tense Nurlaela 201212500067
Past Perfect Tense Nurlaela 201212500067Past Perfect Tense Nurlaela 201212500067
Past Perfect Tense Nurlaela 201212500067
 
PATHS at VSMM 2012
PATHS at VSMM 2012PATHS at VSMM 2012
PATHS at VSMM 2012
 
Semantic Enrichment of Cultural Heritage content in PATHS
Semantic Enrichment of Cultural Heritage content in PATHSSemantic Enrichment of Cultural Heritage content in PATHS
Semantic Enrichment of Cultural Heritage content in PATHS
 
Chocolate Games
Chocolate GamesChocolate Games
Chocolate Games
 
Exchange infrastructure implementing single domain namespace scheme part 2#...
Exchange infrastructure  implementing single domain namespace scheme  part 2#...Exchange infrastructure  implementing single domain namespace scheme  part 2#...
Exchange infrastructure implementing single domain namespace scheme part 2#...
 
Why our mail system is exposed to spoof and phishing mail attacks part 5#9 |...
Why our mail system is exposed to spoof and phishing mail attacks  part 5#9 |...Why our mail system is exposed to spoof and phishing mail attacks  part 5#9 |...
Why our mail system is exposed to spoof and phishing mail attacks part 5#9 |...
 
IND-2012-317 Govt Satya Bharti Adarsh Sr Sec School, Chogawan -Paudha Lagao, ...
IND-2012-317 Govt Satya Bharti Adarsh Sr Sec School, Chogawan -Paudha Lagao, ...IND-2012-317 Govt Satya Bharti Adarsh Sr Sec School, Chogawan -Paudha Lagao, ...
IND-2012-317 Govt Satya Bharti Adarsh Sr Sec School, Chogawan -Paudha Lagao, ...
 
IND-2012-254 Vivekanand M S, Mathalamparai -Don’t waste drinking water
IND-2012-254 Vivekanand M S, Mathalamparai -Don’t waste drinking waterIND-2012-254 Vivekanand M S, Mathalamparai -Don’t waste drinking water
IND-2012-254 Vivekanand M S, Mathalamparai -Don’t waste drinking water
 
Cross-lingual event-mining using wordnet as a shared knowledge interface
Cross-lingual event-mining using wordnet as a shared knowledge interfaceCross-lingual event-mining using wordnet as a shared knowledge interface
Cross-lingual event-mining using wordnet as a shared knowledge interface
 
Величко М.В. (2012) — Территориальные инновации, их внедрение и кадровое обе...
Величко М.В. (2012) — Территориальные инновации, их внедрение и кадровое обе...Величко М.В. (2012) — Территориальные инновации, их внедрение и кадровое обе...
Величко М.В. (2012) — Территориальные инновации, их внедрение и кадровое обе...
 

Ähnlich wie Comparing taxonomies for organising collections of documents presentation

Prateek Jain dissertation defense, Kno.e.sis, Wright State University
Prateek Jain dissertation defense, Kno.e.sis, Wright State UniversityPrateek Jain dissertation defense, Kno.e.sis, Wright State University
Prateek Jain dissertation defense, Kno.e.sis, Wright State UniversityPrateek Jain
 
Linked Open Data Alignment and Enrichment Using Bootstrapping Based Techniques
Linked Open Data Alignment and Enrichment Using Bootstrapping Based TechniquesLinked Open Data Alignment and Enrichment Using Bootstrapping Based Techniques
Linked Open Data Alignment and Enrichment Using Bootstrapping Based TechniquesPrateek Jain
 
IASSIST 2012 - DDI-RDF - Trouble with Triples
IASSIST 2012 - DDI-RDF - Trouble with TriplesIASSIST 2012 - DDI-RDF - Trouble with Triples
IASSIST 2012 - DDI-RDF - Trouble with TriplesDr.-Ing. Thomas Hartmann
 
Identifying Topics in Social Media Posts using DBpedia
Identifying Topics in Social Media Posts using DBpediaIdentifying Topics in Social Media Posts using DBpedia
Identifying Topics in Social Media Posts using DBpediaÓscar Muñoz García
 
Discovering and Navigating Memes in Social Media
Discovering and Navigating Memes in Social MediaDiscovering and Navigating Memes in Social Media
Discovering and Navigating Memes in Social MediaMatthew Lease
 
20130622 okfn hackathon t2
20130622 okfn hackathon t220130622 okfn hackathon t2
20130622 okfn hackathon t2Seonho Kim
 
Geo-annotations in Semantic Digital Libraries
Geo-annotations in Semantic Digital Libraries Geo-annotations in Semantic Digital Libraries
Geo-annotations in Semantic Digital Libraries mdabrowski
 
Harmony project - JISC Synthesis meeting 2001
Harmony project - JISC Synthesis meeting 2001Harmony project - JISC Synthesis meeting 2001
Harmony project - JISC Synthesis meeting 2001Dan Brickley
 
MCN 2014: Make One, Contribute Many: Sharing Image Metadata via LIDO
MCN 2014: Make One, Contribute Many: Sharing Image Metadata via LIDOMCN 2014: Make One, Contribute Many: Sharing Image Metadata via LIDO
MCN 2014: Make One, Contribute Many: Sharing Image Metadata via LIDORob Lancefield
 
Charleston 2012 - The Future of Serials in a Linked Data World
Charleston 2012 - The Future of Serials in a Linked Data WorldCharleston 2012 - The Future of Serials in a Linked Data World
Charleston 2012 - The Future of Serials in a Linked Data WorldProQuest
 
Tutorial: Building and using ontologies - E.Simperl - ESWC SS 2014
 Tutorial: Building and using ontologies -  E.Simperl - ESWC SS 2014 Tutorial: Building and using ontologies -  E.Simperl - ESWC SS 2014
Tutorial: Building and using ontologies - E.Simperl - ESWC SS 2014eswcsummerschool
 
Building and using ontologies
Building and using ontologies Building and using ontologies
Building and using ontologies Elena Simperl
 
E-Learning on Social Semantic Information Sources
E-Learning on Social Semantic Information SourcesE-Learning on Social Semantic Information Sources
E-Learning on Social Semantic Information Sourcesadameq
 
Linked Data at the Open University: From Technical Challenges to Organization...
Linked Data at the Open University: From Technical Challenges to Organization...Linked Data at the Open University: From Technical Challenges to Organization...
Linked Data at the Open University: From Technical Challenges to Organization...Mathieu d'Aquin
 

Ähnlich wie Comparing taxonomies for organising collections of documents presentation (20)

Prateek Jain dissertation defense, Kno.e.sis, Wright State University
Prateek Jain dissertation defense, Kno.e.sis, Wright State UniversityPrateek Jain dissertation defense, Kno.e.sis, Wright State University
Prateek Jain dissertation defense, Kno.e.sis, Wright State University
 
Linked Open Data Alignment and Enrichment Using Bootstrapping Based Techniques
Linked Open Data Alignment and Enrichment Using Bootstrapping Based TechniquesLinked Open Data Alignment and Enrichment Using Bootstrapping Based Techniques
Linked Open Data Alignment and Enrichment Using Bootstrapping Based Techniques
 
PhD Proposal Defense - Prateek Jain
PhD Proposal Defense - Prateek JainPhD Proposal Defense - Prateek Jain
PhD Proposal Defense - Prateek Jain
 
IASSIST 2012 - DDI-RDF - Trouble with Triples
IASSIST 2012 - DDI-RDF - Trouble with TriplesIASSIST 2012 - DDI-RDF - Trouble with Triples
IASSIST 2012 - DDI-RDF - Trouble with Triples
 
Prateek Jain's Dissertation Defense - Linked Open Data Alignment and Querying
Prateek Jain's Dissertation Defense - Linked Open Data Alignment and QueryingPrateek Jain's Dissertation Defense - Linked Open Data Alignment and Querying
Prateek Jain's Dissertation Defense - Linked Open Data Alignment and Querying
 
Understanding Critical Elements of E-books: The Social Reading Experience of ...
Understanding Critical Elements of E-books: The Social Reading Experience of ...Understanding Critical Elements of E-books: The Social Reading Experience of ...
Understanding Critical Elements of E-books: The Social Reading Experience of ...
 
Identifying Topics in Social Media Posts using DBpedia
Identifying Topics in Social Media Posts using DBpediaIdentifying Topics in Social Media Posts using DBpedia
Identifying Topics in Social Media Posts using DBpedia
 
Niso Annotation Webinar
Niso Annotation WebinarNiso Annotation Webinar
Niso Annotation Webinar
 
Discovering and Navigating Memes in Social Media
Discovering and Navigating Memes in Social MediaDiscovering and Navigating Memes in Social Media
Discovering and Navigating Memes in Social Media
 
20130622 okfn hackathon t2
20130622 okfn hackathon t220130622 okfn hackathon t2
20130622 okfn hackathon t2
 
Geo-annotations in Semantic Digital Libraries
Geo-annotations in Semantic Digital Libraries Geo-annotations in Semantic Digital Libraries
Geo-annotations in Semantic Digital Libraries
 
Digital Libraries of the Future
Digital Libraries of the Future
Digital Libraries of the Future
Digital Libraries of the Future
 
Harmony project - JISC Synthesis meeting 2001
Harmony project - JISC Synthesis meeting 2001Harmony project - JISC Synthesis meeting 2001
Harmony project - JISC Synthesis meeting 2001
 
MCN 2014: Make One, Contribute Many: Sharing Image Metadata via LIDO
MCN 2014: Make One, Contribute Many: Sharing Image Metadata via LIDOMCN 2014: Make One, Contribute Many: Sharing Image Metadata via LIDO
MCN 2014: Make One, Contribute Many: Sharing Image Metadata via LIDO
 
Charleston 2012 - The Future of Serials in a Linked Data World
Charleston 2012 - The Future of Serials in a Linked Data WorldCharleston 2012 - The Future of Serials in a Linked Data World
Charleston 2012 - The Future of Serials in a Linked Data World
 
Irish Digital Libraries Summit
Irish Digital Libraries SummitIrish Digital Libraries Summit
Irish Digital Libraries Summit
 
Tutorial: Building and using ontologies - E.Simperl - ESWC SS 2014
 Tutorial: Building and using ontologies -  E.Simperl - ESWC SS 2014 Tutorial: Building and using ontologies -  E.Simperl - ESWC SS 2014
Tutorial: Building and using ontologies - E.Simperl - ESWC SS 2014
 
Building and using ontologies
Building and using ontologies Building and using ontologies
Building and using ontologies
 
E-Learning on Social Semantic Information Sources
E-Learning on Social Semantic Information SourcesE-Learning on Social Semantic Information Sources
E-Learning on Social Semantic Information Sources
 
Linked Data at the Open University: From Technical Challenges to Organization...
Linked Data at the Open University: From Technical Challenges to Organization...Linked Data at the Open University: From Technical Challenges to Organization...
Linked Data at the Open University: From Technical Challenges to Organization...
 

Mehr von pathsproject

Roadmap from ESEPaths to EDMPaths: a note on representing annotations resulti...
Roadmap from ESEPaths to EDMPaths: a note on representing annotations resulti...Roadmap from ESEPaths to EDMPaths: a note on representing annotations resulti...
Roadmap from ESEPaths to EDMPaths: a note on representing annotations resulti...pathsproject
 
Aletras, Nikolaos and Stevenson, Mark (2013) "Evaluating Topic Coherence Us...
Aletras, Nikolaos  and  Stevenson, Mark (2013) "Evaluating Topic Coherence Us...Aletras, Nikolaos  and  Stevenson, Mark (2013) "Evaluating Topic Coherence Us...
Aletras, Nikolaos and Stevenson, Mark (2013) "Evaluating Topic Coherence Us...pathsproject
 
PATHSenrich: A Web Service Prototype for Automatic Cultural Heritage Item Enr...
PATHSenrich: A Web Service Prototype for Automatic Cultural Heritage Item Enr...PATHSenrich: A Web Service Prototype for Automatic Cultural Heritage Item Enr...
PATHSenrich: A Web Service Prototype for Automatic Cultural Heritage Item Enr...pathsproject
 
Implementing Recommendations in the PATHS system, SUEDL 2013
Implementing Recommendations in the PATHS system, SUEDL 2013Implementing Recommendations in the PATHS system, SUEDL 2013
Implementing Recommendations in the PATHS system, SUEDL 2013pathsproject
 
User-Centred Design to Support Exploration and Path Creation in Cultural Her...
 User-Centred Design to Support Exploration and Path Creation in Cultural Her... User-Centred Design to Support Exploration and Path Creation in Cultural Her...
User-Centred Design to Support Exploration and Path Creation in Cultural Her...pathsproject
 
Generating Paths through Cultural Heritage Collections Latech2013 paper
Generating Paths through Cultural Heritage Collections Latech2013 paperGenerating Paths through Cultural Heritage Collections Latech2013 paper
Generating Paths through Cultural Heritage Collections Latech2013 paperpathsproject
 
Supporting User's Exploration of Digital Libraries, Suedl 2012 workshop proce...
Supporting User's Exploration of Digital Libraries, Suedl 2012 workshop proce...Supporting User's Exploration of Digital Libraries, Suedl 2012 workshop proce...
Supporting User's Exploration of Digital Libraries, Suedl 2012 workshop proce...pathsproject
 
PATHS state of the art monitoring report
PATHS state of the art monitoring reportPATHS state of the art monitoring report
PATHS state of the art monitoring reportpathsproject
 
Recommendations for the automatic enrichment of digital library content using...
Recommendations for the automatic enrichment of digital library content using...Recommendations for the automatic enrichment of digital library content using...
Recommendations for the automatic enrichment of digital library content using...pathsproject
 
Generating Paths through Cultural Heritage Collections, LATECH 2013 paper
Generating Paths through Cultural Heritage Collections, LATECH 2013 paperGenerating Paths through Cultural Heritage Collections, LATECH 2013 paper
Generating Paths through Cultural Heritage Collections, LATECH 2013 paperpathsproject
 
PATHS @ LATECH 2013
PATHS @ LATECH 2013PATHS @ LATECH 2013
PATHS @ LATECH 2013pathsproject
 
PATHS at the eChallenges conference
PATHS at the eChallenges conferencePATHS at the eChallenges conference
PATHS at the eChallenges conferencepathsproject
 
PATHS at the EAA conference 2013
PATHS at the EAA conference 2013PATHS at the EAA conference 2013
PATHS at the EAA conference 2013pathsproject
 
PATHS at the eCult dialogue day 2013
PATHS at the eCult dialogue day 2013PATHS at the eCult dialogue day 2013
PATHS at the eCult dialogue day 2013pathsproject
 
SemEval-2012 Task 6: A Pilot on Semantic Textual Similarity
SemEval-2012 Task 6: A Pilot on Semantic Textual SimilaritySemEval-2012 Task 6: A Pilot on Semantic Textual Similarity
SemEval-2012 Task 6: A Pilot on Semantic Textual Similaritypathsproject
 
A pilot on Semantic Textual Similarity
A pilot on Semantic Textual SimilarityA pilot on Semantic Textual Similarity
A pilot on Semantic Textual Similaritypathsproject
 
Comparing taxonomies for organising collections of documents
Comparing taxonomies for organising collections of documentsComparing taxonomies for organising collections of documents
Comparing taxonomies for organising collections of documentspathsproject
 
PATHS Final prototype interface design v1.0
PATHS Final prototype interface design v1.0PATHS Final prototype interface design v1.0
PATHS Final prototype interface design v1.0pathsproject
 
PATHS Evaluation of the 1st paths prototype
PATHS Evaluation of the 1st paths prototypePATHS Evaluation of the 1st paths prototype
PATHS Evaluation of the 1st paths prototypepathsproject
 
PATHS Second prototype-functional-spec
PATHS Second prototype-functional-specPATHS Second prototype-functional-spec
PATHS Second prototype-functional-specpathsproject
 

Mehr von pathsproject (20)

Roadmap from ESEPaths to EDMPaths: a note on representing annotations resulti...
Roadmap from ESEPaths to EDMPaths: a note on representing annotations resulti...Roadmap from ESEPaths to EDMPaths: a note on representing annotations resulti...
Roadmap from ESEPaths to EDMPaths: a note on representing annotations resulti...
 
Aletras, Nikolaos and Stevenson, Mark (2013) "Evaluating Topic Coherence Us...
Aletras, Nikolaos  and  Stevenson, Mark (2013) "Evaluating Topic Coherence Us...Aletras, Nikolaos  and  Stevenson, Mark (2013) "Evaluating Topic Coherence Us...
Aletras, Nikolaos and Stevenson, Mark (2013) "Evaluating Topic Coherence Us...
 
PATHSenrich: A Web Service Prototype for Automatic Cultural Heritage Item Enr...
PATHSenrich: A Web Service Prototype for Automatic Cultural Heritage Item Enr...PATHSenrich: A Web Service Prototype for Automatic Cultural Heritage Item Enr...
PATHSenrich: A Web Service Prototype for Automatic Cultural Heritage Item Enr...
 
Implementing Recommendations in the PATHS system, SUEDL 2013
Implementing Recommendations in the PATHS system, SUEDL 2013Implementing Recommendations in the PATHS system, SUEDL 2013
Implementing Recommendations in the PATHS system, SUEDL 2013
 
User-Centred Design to Support Exploration and Path Creation in Cultural Her...
 User-Centred Design to Support Exploration and Path Creation in Cultural Her... User-Centred Design to Support Exploration and Path Creation in Cultural Her...
User-Centred Design to Support Exploration and Path Creation in Cultural Her...
 
Generating Paths through Cultural Heritage Collections Latech2013 paper
Generating Paths through Cultural Heritage Collections Latech2013 paperGenerating Paths through Cultural Heritage Collections Latech2013 paper
Generating Paths through Cultural Heritage Collections Latech2013 paper
 
Supporting User's Exploration of Digital Libraries, Suedl 2012 workshop proce...
Supporting User's Exploration of Digital Libraries, Suedl 2012 workshop proce...Supporting User's Exploration of Digital Libraries, Suedl 2012 workshop proce...
Supporting User's Exploration of Digital Libraries, Suedl 2012 workshop proce...
 
PATHS state of the art monitoring report
PATHS state of the art monitoring reportPATHS state of the art monitoring report
PATHS state of the art monitoring report
 
Recommendations for the automatic enrichment of digital library content using...
Recommendations for the automatic enrichment of digital library content using...Recommendations for the automatic enrichment of digital library content using...
Recommendations for the automatic enrichment of digital library content using...
 
Generating Paths through Cultural Heritage Collections, LATECH 2013 paper
Generating Paths through Cultural Heritage Collections, LATECH 2013 paperGenerating Paths through Cultural Heritage Collections, LATECH 2013 paper
Generating Paths through Cultural Heritage Collections, LATECH 2013 paper
 
PATHS @ LATECH 2013
PATHS @ LATECH 2013PATHS @ LATECH 2013
PATHS @ LATECH 2013
 
PATHS at the eChallenges conference
PATHS at the eChallenges conferencePATHS at the eChallenges conference
PATHS at the eChallenges conference
 
PATHS at the EAA conference 2013
PATHS at the EAA conference 2013PATHS at the EAA conference 2013
PATHS at the EAA conference 2013
 
PATHS at the eCult dialogue day 2013
PATHS at the eCult dialogue day 2013PATHS at the eCult dialogue day 2013
PATHS at the eCult dialogue day 2013
 
SemEval-2012 Task 6: A Pilot on Semantic Textual Similarity
SemEval-2012 Task 6: A Pilot on Semantic Textual SimilaritySemEval-2012 Task 6: A Pilot on Semantic Textual Similarity
SemEval-2012 Task 6: A Pilot on Semantic Textual Similarity
 
A pilot on Semantic Textual Similarity
A pilot on Semantic Textual SimilarityA pilot on Semantic Textual Similarity
A pilot on Semantic Textual Similarity
 
Comparing taxonomies for organising collections of documents
Comparing taxonomies for organising collections of documentsComparing taxonomies for organising collections of documents
Comparing taxonomies for organising collections of documents
 
PATHS Final prototype interface design v1.0
PATHS Final prototype interface design v1.0PATHS Final prototype interface design v1.0
PATHS Final prototype interface design v1.0
 
PATHS Evaluation of the 1st paths prototype
PATHS Evaluation of the 1st paths prototypePATHS Evaluation of the 1st paths prototype
PATHS Evaluation of the 1st paths prototype
 
PATHS Second prototype-functional-spec
PATHS Second prototype-functional-specPATHS Second prototype-functional-spec
PATHS Second prototype-functional-spec
 

Kürzlich hochgeladen

ACC 2024 Chronicles. Cardiology. Exam.pdf
ACC 2024 Chronicles. Cardiology. Exam.pdfACC 2024 Chronicles. Cardiology. Exam.pdf
ACC 2024 Chronicles. Cardiology. Exam.pdfSpandanaRallapalli
 
Influencing policy (training slides from Fast Track Impact)
Influencing policy (training slides from Fast Track Impact)Influencing policy (training slides from Fast Track Impact)
Influencing policy (training slides from Fast Track Impact)Mark Reed
 
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTS
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTSGRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTS
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTSJoshuaGantuangco2
 
Grade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdf
Grade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdfGrade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdf
Grade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdfJemuel Francisco
 
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdf
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdfLike-prefer-love -hate+verb+ing & silent letters & citizenship text.pdf
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdfMr Bounab Samir
 
USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...
USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...
USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...Postal Advocate Inc.
 
Field Attribute Index Feature in Odoo 17
Field Attribute Index Feature in Odoo 17Field Attribute Index Feature in Odoo 17
Field Attribute Index Feature in Odoo 17Celine George
 
What is Model Inheritance in Odoo 17 ERP
What is Model Inheritance in Odoo 17 ERPWhat is Model Inheritance in Odoo 17 ERP
What is Model Inheritance in Odoo 17 ERPCeline George
 
Judging the Relevance and worth of ideas part 2.pptx
Judging the Relevance  and worth of ideas part 2.pptxJudging the Relevance  and worth of ideas part 2.pptx
Judging the Relevance and worth of ideas part 2.pptxSherlyMaeNeri
 
Procuring digital preservation CAN be quick and painless with our new dynamic...
Procuring digital preservation CAN be quick and painless with our new dynamic...Procuring digital preservation CAN be quick and painless with our new dynamic...
Procuring digital preservation CAN be quick and painless with our new dynamic...Jisc
 
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptxINTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptxHumphrey A Beña
 
AMERICAN LANGUAGE HUB_Level2_Student'sBook_Answerkey.pdf
AMERICAN LANGUAGE HUB_Level2_Student'sBook_Answerkey.pdfAMERICAN LANGUAGE HUB_Level2_Student'sBook_Answerkey.pdf
AMERICAN LANGUAGE HUB_Level2_Student'sBook_Answerkey.pdfphamnguyenenglishnb
 
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdf
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdfInclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdf
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdfTechSoup
 
Global Lehigh Strategic Initiatives (without descriptions)
Global Lehigh Strategic Initiatives (without descriptions)Global Lehigh Strategic Initiatives (without descriptions)
Global Lehigh Strategic Initiatives (without descriptions)cama23
 
Difference Between Search & Browse Methods in Odoo 17
Difference Between Search & Browse Methods in Odoo 17Difference Between Search & Browse Methods in Odoo 17
Difference Between Search & Browse Methods in Odoo 17Celine George
 
ENGLISH6-Q4-W3.pptxqurter our high choom
ENGLISH6-Q4-W3.pptxqurter our high choomENGLISH6-Q4-W3.pptxqurter our high choom
ENGLISH6-Q4-W3.pptxqurter our high choomnelietumpap1
 
Transaction Management in Database Management System
Transaction Management in Database Management SystemTransaction Management in Database Management System
Transaction Management in Database Management SystemChristalin Nelson
 
HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...
HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...
HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...Nguyen Thanh Tu Collection
 

Kürzlich hochgeladen (20)

ACC 2024 Chronicles. Cardiology. Exam.pdf
ACC 2024 Chronicles. Cardiology. Exam.pdfACC 2024 Chronicles. Cardiology. Exam.pdf
ACC 2024 Chronicles. Cardiology. Exam.pdf
 
Influencing policy (training slides from Fast Track Impact)
Influencing policy (training slides from Fast Track Impact)Influencing policy (training slides from Fast Track Impact)
Influencing policy (training slides from Fast Track Impact)
 
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTS
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTSGRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTS
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTS
 
Grade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdf
Grade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdfGrade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdf
Grade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdf
 
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdf
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdfLike-prefer-love -hate+verb+ing & silent letters & citizenship text.pdf
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdf
 
Raw materials used in Herbal Cosmetics.pptx
Raw materials used in Herbal Cosmetics.pptxRaw materials used in Herbal Cosmetics.pptx
Raw materials used in Herbal Cosmetics.pptx
 
USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...
USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...
USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...
 
Field Attribute Index Feature in Odoo 17
Field Attribute Index Feature in Odoo 17Field Attribute Index Feature in Odoo 17
Field Attribute Index Feature in Odoo 17
 
What is Model Inheritance in Odoo 17 ERP
What is Model Inheritance in Odoo 17 ERPWhat is Model Inheritance in Odoo 17 ERP
What is Model Inheritance in Odoo 17 ERP
 
Judging the Relevance and worth of ideas part 2.pptx
Judging the Relevance  and worth of ideas part 2.pptxJudging the Relevance  and worth of ideas part 2.pptx
Judging the Relevance and worth of ideas part 2.pptx
 
Procuring digital preservation CAN be quick and painless with our new dynamic...
Procuring digital preservation CAN be quick and painless with our new dynamic...Procuring digital preservation CAN be quick and painless with our new dynamic...
Procuring digital preservation CAN be quick and painless with our new dynamic...
 
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptxINTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
 
AMERICAN LANGUAGE HUB_Level2_Student'sBook_Answerkey.pdf
AMERICAN LANGUAGE HUB_Level2_Student'sBook_Answerkey.pdfAMERICAN LANGUAGE HUB_Level2_Student'sBook_Answerkey.pdf
AMERICAN LANGUAGE HUB_Level2_Student'sBook_Answerkey.pdf
 
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdf
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdfInclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdf
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdf
 
Global Lehigh Strategic Initiatives (without descriptions)
Global Lehigh Strategic Initiatives (without descriptions)Global Lehigh Strategic Initiatives (without descriptions)
Global Lehigh Strategic Initiatives (without descriptions)
 
Difference Between Search & Browse Methods in Odoo 17
Difference Between Search & Browse Methods in Odoo 17Difference Between Search & Browse Methods in Odoo 17
Difference Between Search & Browse Methods in Odoo 17
 
ENGLISH6-Q4-W3.pptxqurter our high choom
ENGLISH6-Q4-W3.pptxqurter our high choomENGLISH6-Q4-W3.pptxqurter our high choom
ENGLISH6-Q4-W3.pptxqurter our high choom
 
Transaction Management in Database Management System
Transaction Management in Database Management SystemTransaction Management in Database Management System
Transaction Management in Database Management System
 
YOUVE GOT EMAIL_FINALS_EL_DORADO_2024.pptx
YOUVE GOT EMAIL_FINALS_EL_DORADO_2024.pptxYOUVE GOT EMAIL_FINALS_EL_DORADO_2024.pptx
YOUVE GOT EMAIL_FINALS_EL_DORADO_2024.pptx
 
HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...
HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...
HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...
 

Comparing taxonomies for organising collections of documents presentation

  • 1. Comparing taxonomies for organising collections of documents Samuel Fernando, Mark Hall, Eneko Agirre, Aitor Soroa, Paul Clough, Mark Stevenson COLING 2012, 14th December 2012, Mumbai, India
  • 2. Introduction ● Large collections of diverse data are available online. PATHS project aims to support user exploration in digital library collections. ● Search box is useful but taxonomies are better suited for exploration and browsing. ● We apply taxonomies to organise data from a large digital library collection. ● Process is automatic – either map items to an existing taxonomy, or induce a taxonomy from the data. COLING 2012, 14th December 2012, Mumbai, India
  • 3. Evaluation data ● We use items from Europeana, a large online collection of cultural heritage. ● Use English subset, approx. 550,000 items. ● Item typically contains a picture, a title, description and subject keywords. ● Very diverse data comprising artifacts, places, people. Topics include fashion, archaeology, architecture and many other subjects. ● Data from many providers, some of which use taxonomies, some don’t – need unified approach COLING 2012, 14th December 2012, Mumbai, India
  • 4. Example item Title: Design Council Slide Collection Subject: colour, exhibitions, industrial design Description: Display on the theme of colour matching at the Design Centre, London, 1960 COLING 2012, 14th December 2012, Mumbai, India
  • 5. Manually created taxonomies ● We use four existing manually created taxonomies: – LCSH (Library of Congress) – WordNet domains – Wikipedia Taxonomy – DBpedia ontology ● The taxonomies already exist and are of good quality - but problem is to map Europeana items into the correct place in the taxonomy. COLING 2012, 14th December 2012, Mumbai, India
  • 6. LCSH ● A controlled vocabulary maintained by the US Library of Congress for bibliographic records. ● Used by libraries to organise collections and also by curators of cultural heritage. ● Subject keywords are used to map Europeana items into the appropriate LCSH category nodes. industrial design  design creation (literary, artistic, etc.)  intellect +30 more higher level headings COLING 2012, 14th December 2012, Mumbai, India
  • 7. WordNet domains ● WordNet domains (Bernardo Magnini, LREC 2000) applies a small set of 164 domain labels to each of the WordNet synsets. ● Again use subject keywords to map Europeana items - first to Yago2 (for proper nouns) then to synset and finally to WordNet domain label. tourism  social color  factotum art  humanities + 5 more COLING 2012, 14th December 2012, Mumbai, India
  • 8. Wikipedia Taxonomy ● Wikipedia category hierarchy preserving only is-a relations - all others are discarded. ● Use Wikipedia Miner over each Europeana item to identify Wikipedia articles in the subject keywords. Then map item to all categories that contain these articles design  visual_arts  criticism image_processing  digital_signal_processing  signal_processing museology  museums  educational_organizations  organizations +35 more COLING 2012, 14th December 2012, Mumbai, India
  • 9. DBpedia ontology  A formalised shallow ontology manually created based on Wikipedia (with inference capability).  Again use Wikipedia Miner to find Wikipedia articles in subject keywords of each item and map item to the categories which these articles belong. musical_work  work work album  musicalwork  work COLING 2012, 14th December 2012, Mumbai, India
  • 10. Automatic data-derived taxonomies ● We use two approaches to derive taxonomies automatically from the Europeana data. – LDA (Latent Dirichlet Allocation) topic modelling – WikiFreq (Wikipedia Frequency hierarchy) ● Taxonomies fit data - no unnecessary nodes to prune. ● Mapping from items to concept nodes is implicit during derivation. COLING 2012, 14th December 2012, Mumbai, India
  • 11. LDA topic modelling  Latent Dirichlet Allocation (LDA) maps each item to one or more topics.  Distribution of items over topics - each topic is a distribution over words  Item-topic and topic-word distributions are learned using collapsed Gibbs sampling  Has been used for improving results from IR  Previous work has developed hierarchical LDA but this is infeasible over our large data set COLING 2012, 14th December 2012, Mumbai, India
  • 12. Hierarchical LDA topics ● Run LDA over corpus to determine item-topic probabilities. ● Identify set of items for each topic. Each item assigned to highest probability topic. Topic labelled with highest probability word. ● If a topic has less than 60 items then stop. Otherwise go back to first step with the set of items identified in previous part as the corpus. COLING 2012, 14th December 2012, Mumbai, India
  • 13. Hierarchical LDA topics (example) Bangle  design  design  design  brooch  collection COLING 2012, 14th December 2012, Mumbai, India
  • 14. Wikipedia link frequencies ● Novel approach. ● Run Wikipedia Miner to find links in all Europeana items – use title, subject and description. ● Find frequency counts for each link. ● For each item take the set of links found. ● Create taxonomy branch (if not already present) with links in order of frequency (most frequent first). ● Map item to least frequent link. COLING 2012, 14th December 2012, Mumbai, India
  • 15. Wikipedia link frequencies (cont.) ● Large number of concept nodes - limit to 24 children for each node. ● Require at least 2 links for each item - filter out items with little metadata. ● Filter out concepts with fewer than 20 items. industrial design  design council COLING 2012, 14th December 2012, Mumbai, India
  • 16. Statistics Type Taxonomy Items Nodes Avg. Avg. Top parents Depth nodes Manual LCSH 99259 285238 1.8 1.97 28901 DBpedia 178312 273 4.2 2 30 WikiTax 275359 121359 11.7 1.13 10417 WN domains 308687 170 7.1 7.1 6 Automatic LDA topics 545896 22494 1 7.3 9 Wiki Freq 66558 502 1 3.39 24 COLING 2012, 14th December 2012, Mumbai, India
  • 17. Evaluation - cohesion  Intruder detection originally proposed in (Chang et. al, 2009). A cohesive unit is defined as one in which the items are similar while at the same time different from items in other clusters.  Present 5 items to each annotator. 4 from one concept node, and an intruder item randomly from elsewhere in the taxonomy. The more cohesive the unit, the more obvious the intruder will be.  Crowd-sourcing: 111 annotators, 30 units from each taxonomy. 1255 answers – average 7 annotators for each unit COLING 2012, 14th December 2012, Mumbai, India
  • 18. Example of a cohesive unit COLING 2012, 14th December 2012, Mumbai, India
  • 19. Evaluation - cohesion results Type Taxonomy Cohesive Percentage units Manual LCSH 19 63.3 DBpedia 17 56.7 Wiki Taxonomy 18 60.0 WN domains 15 50.0 Automatic LDA topics 17 56.7 Wiki Freq 29 96.7 Number of cohesive units (out of a possible 30) COLING 2012, 14th December 2012, Mumbai, India
  • 20. Evaluation - relation classification  Previous work has typically used a simple boolean question “is it true that ChildNode is-a ParentNode?”  We ask two questions for each child-parent pair A and B:  Are the concepts A and B related?  If they are, is A more specific than B, less specific than B, or neither?  Crowd sourcing: 173 annotators, 40 pairs from each taxonomy, each pair evaluated on average 16 times COLING 2012, 14th December 2012, Mumbai, India
  • 21. Evaluation - example pairs Taxonomy Child (A) Parent(B) LCSH Work Human Behaviour Braid Weaving DBpedia Mountain Range Place Fern Plant Wiki Mammals of Africa Wildlife of Africa Taxonomy Schools in Wiltshire Schools in England WN domains vehicles transport mechanics engineering LDA topics earthenware dish view church Wiki Freq Corrosion Coin Interior Design Industrial Design COLING 2012, 14th December 2012, Mumbai, India
  • 22. Are A and B related? Taxonomy Yes No Don't know LCSH 74.2 8.8 17.0 DBpedia 86.6 11.2 2.2 Wiki Taxonomy 96.1 1.7 2.3 WN domains 77.1 14.5 8.4 LDA topics 30.3 50.3 19.3 Wiki Freq 47.6 16.5 35.8 COLING 2012, 14th December 2012, Mumbai, India
  • 23. Which is more specific? Taxonomy A<B A>B Neither Don't know LCSH 65.4 8.7 23.4 2.5 DBpedia 76.2 4.9 18.1 0.7 WikiTaxonomy 78.3 4.7 16.0 0.9 WN domains 63.6 6.3 28.0 2.0 LDA topics 21.4 14.8 62.1 1.6 Wiki Freq 30.9 22.6 43.6 2.9 COLING 2012, 14th December 2012, Mumbai, India
  • 24. Conclusions  Wikipedia Taxonomy is conceptually well organised, even better than LCSH which has been widely used for organising library collections.  WikiFreq gives very high cohesion for items although the conceptual relations are not well defined.  Future work continues with different intrinsic and user evaluations. Also aim to combine Wikipedia Taxonomy and WikiFreq to get the best of both. COLING 2012, 14th December 2012, Mumbai, India
  • 25. The End s.fernando@sheffield.ac.uk Supported by the PATHS project http://paths-project.eu Funded by the European Community's Seventh Framework Programme (FP7/2007-2013) under grant agreement no. 270082. This research was also partially funded by the Ministry of Economy under grant TIN2009-14715-C04-01 (KNOW2 project Questions?