SlideShare a Scribd company logo
1 of 30
Download to read offline
Type inference through
         the analysis of
        Wikipedia links
                              Andrea Giovanni Nuzzolese
                                              nuzzoles@cs.unibo.it
                                              Aldo Gangemi
                                               aldo.gangemi@cnr.it
                                          Valentina Presutti
                                           valentina.presutti@cnr.it
                                           Paolo Ciancarini
                                             ciancarini@cs.unibo.it
  stlab.istc.cnr.it
                      16 April 2012 - Lyon, France - LDOW 2012
Outline

• Motivations
• Materials
• Applied methods
• Results
• Conclusions




     stlab.istc.cnr.it
                            2
Motivations
                                       ✦   Only a subset of the DBpedia
                                           resources is typed with the
     Resources used in wikilinks
             relations:
                                           DBpedia ontology (DBPO)
            15,944,381                 ✦   The typing procedure is top-
                                           down.
                                       ✦   Is the DBPO complete with
                                           respect to the DBpedia
                                           domain?
Resources having a                     ✦   How good and homogeneous
   DBPO type:
    1,518,697
                                           is the granularity of DBPO
                                           types?



 stlab.istc.cnr.it
                                   3
Materials
Wikilink triples with typed
      subject/object:
        16,745,830                                   DBpedia 3.6



                                               Dataset                    # of triples

                Wikilinks                   wikilink triples              107,892,317
                 triples:
               107,892,317
                                  infobox mapping-based “data” triples     9,357,273


                                            rdfs:label triples             7,972,225


                                            rdf:type triples               6,173,940

       DBpedia ontology:
          272 classes            infobox mapping-based “object” triples    4,251,239




          stlab.istc.cnr.it
                                    4
What we did
• Wikilinks of a DBpedia resource convey knowledge that
 can be used for classifying it.
• Classification methods
 ✦   Inductive learning: k-Nearest Neighbor algorithm
 ✦   Abductive classification based on EKPs [1] and homotypes used as
     background knowledge
• The methods were performed on
     Resources having a                    Resources used in wikilinks
        DBPO type:                                 relations:                                                   Sample of
         1,518,697                                15,944,381                                                untyped resources:
                                                                                                                  1,000



                             [1] A. G. Nuzzolese, A. Gangemi,V. Presutti, and P. Ciancarini. Encyclopedic Knowledge Patterns from Wikipedia Links.
                                 In L. Aroyo, N. Noy, and C. Welty, editors, Proceedings of the 10th International Semantic Web Conference (ISWC2011),
                                 pages 520-536. Springer, 2011.
         stlab.istc.cnr.it
                                                              5
Inductive classification
• We designed two inductive classification
  experiments based on the k-NN algorithm
 ✦ on 272 features, i.e., all the classes in the DBPO

 ✦ on 27 features, i.e., the top-level classes in the DBPO

   hierarchy
• For each experiment we built a labeled feature
  space model as training set by using a randomly
  sampled 20% of typed resources
 ✦ the algorithms were tested on the remaining 80% of typed

   resources


       stlab.istc.cnr.it
                                5
                                6
Building the training set for K-Nearest
                  Neighbor algorithm


dbpo:wikiPageWikiLink         dbpedia:Apple_Inc.                             dbpedia:NeXT




                                                   dbpedia:Steve_Jobs




                                dbpedia:Forbes                         dbpedia:Cupertino,_California


                                       Mammal    Scientist   Company      Drug      City    Magazine   Class
              dbpedia:Steve_Jobs
                                                                                                        ...
                        stlab.istc.cnr.it
                                                             7
Building the training set for K-Nearest
                  Neighbor algorithm
                                                   dbpo:Organisation

dbpo:wikiPageWikiLink         dbpedia:Apple_Inc.                          dbpedia:NeXT
      rdf:type




                                                   dbpedia:Steve_Jobs

            dbpo:Magazine                                                                 dbpo:City



                                dbpedia:Forbes         dbpo:Persondbpedia:Cupertino,_California


                                       Mammal    Scientist   Company    Drug    City     Magazine      Class
                 dbpedia:Steve_Jobs                                                                 dbpo:Person
                                                                                                        ...
                        stlab.istc.cnr.it
                                                             7
Building the training set for K-Nearest
                  Neighbor algorithm
                                                  dbpo:Organisation

dbpo:wikiPageWikiLink           dbpo:Magazine                                 dbpo:City
      rdf:type

     kp:linksTo


                                                  dbpedia:Steve_Jobs




                                       Mammal   Scientist   Company    Drug       City    Magazine      Class
                  dbpedia:Steve_Jobs        0      0           1        0          1         1       dbpo:Person
                                                                                                         ...
                        stlab.istc.cnr.it
                                                            7
Building the training set for K-Nearest
                  Neighbor algorithm
                                                    dbpo:Organisation

dbpo:wikiPageWikiLink           dbpo:Magazine                                   dbpo:City
      rdf:type

     kp:linksTo


                                                    dbpedia:Steve_Jobs




                                       Mammal     Scientist   Company    Drug       City    Magazine      Class
                  dbpedia:Steve_Jobs        0        0           1        0          1         1       dbpo:Person
                          ...               ...       ...         ...     ...        ...      ...          ...
                        stlab.istc.cnr.it
                                                              7
Building the training set for K-Nearest
                  Neighbor algorithm
                                                 dbpo:Organisation

dbpo:wikiPageWikiLink            dbpo:Magazine                        dbpo:City
      rdf:type

     kp:linksTo


                                                 dbpedia:Steve_Jobs




                  ✦     Precision using all DBPO types as features: 31.65%
                  ✦     Precision using the top-level of DBPO as features: 40.27%
                         stlab.istc.cnr.it
                                                         7
Abductive classification with
           EKPs

• EKPs
 ✦   A EKP of a certain entity
     type is a small vocabulary
     that captures the core
     types used for describing
     such entity type as it
     emerges from the
     Wikipedia crowds


                      visit aemoo.org for an exploratory tool based on EKPs
        stlab.istc.cnr.it
                                         8
How can we infer
  the type of
“Galileo Galilei”?




                                http://www.aemoo.org
        stlab.istc.cnr.it
                            9
How can we infer                We know its path types
  the type of
“Galileo Galilei”?




                                         http://www.aemoo.org
        stlab.istc.cnr.it
                            9
We compare the path types involving
We have 231 EKPs            “Galileo Galilei” as subject with EKPs in
                            order to identify the most similar, which
                                     is the "Scientist" EKP.




                                                      http://www.aemoo.org
        stlab.istc.cnr.it
                                   9
The inferred type for
    the resource
  “Galileo Galiei” is
 the class “Scientist”




                                  http://www.aemoo.org
          stlab.istc.cnr.it
                              9
Distinctive weakness of some
            EKPs

                                            ✦ The distinctive weakness
                                              seems due to wide
                                              overlaps among some
                                              EKPs
                                            ✦ Systematic ambiguity of

                                              the 4 largest classes


   ✦   Precision and recall on all DBPO types both 44.4%
   ✦   Precision and recall on the top-level of DBPO hierarchy: 36.5% and 79.5%
       stlab.istc.cnr.it
                                    10
Homotype-based abductive
     classification
• Homotypes are wikilinks that have the same
 type on both the subject and the object of the
 triple
  dbpo:Philosopher dbpedia:Immanuel_Kant                    dbpedia:Plato              dbpo:Philosopher
                rdf:type            dbpo:wikiPageWikiLink                   rdf:type




• We have observed how the homotype is usually
 the most frequent (or in the top 3) wikilink type
• Given an untyped entity, we hypothesize that
 the most frequent type involved in its ingoing/
 outgoing wikilinks detects its homotype, hence
 it indicates its type 11
      stlab.istc.cnr.it
Homotype-based abductive
     classification
                           s




  stlab.istc.cnr.it
                      12
Homotype-based abductive
     classification
                           s




  stlab.istc.cnr.it
                      12
Results on classifying already
      typed resources




   stlab.istc.cnr.it
                       13
Results on untyped resources
 • Results on a sample of 1,000 untyped resources
  are much less satisfactory

       With EKPs


                                  With Homotypes




     stlab.istc.cnr.it
                         14
Why? [1]

• Typed entities: 2:3 typed wikilinks ratio

• Untyped entities: 1:3 typed wikilinks ratio

• Link structure for untyped entities is not rich
 enough




     stlab.istc.cnr.it
                            15
Why? [2]

• DBPO does not provide a complete set of
  classes for correctly typing DBpedia resources

dbpedia:List_of_FIFA_World_Cup_finals      Collection
   dbpedia:Computer_Science      ScientificDiscipline
          dbpedia:Counterattack    Plan
        dbpedia:Eros(concept)    Concept
  dbpedia:Gentlemen’s_agreement      Agreement


      stlab.istc.cnr.it
                             16
Conclusions
• We have investigated different approaches for
 typing DBpedia resources based on the data set
 of wikilinks

• Results are acceptable in the test set, but
 extensive untypedness in output links, and poor
 DBPO coverage severely compromise automatic
 typing for untyped resources

• We have analyzed possible causes deriving from
 some bias in DBpedia
                              17
     stlab.istc.cnr.it
Future work
• Yago could be helpful but
 ✦ there is a lack of mapping between YAGO and DBPO
 ✦ it has larger coverage and only an overlap with DBPO

 ✦ the granularity of its categories is finer, and not easily

   reusable, because the top level is very large




      stlab.istc.cnr.it
                               18
Thank you
           Andrea Nuzzolese
                    -
           STLab, ISTC-CNR
                    &
Dipartimento di Scienze dell’Informazione
         University of Bologna
                  Italy


 stlab.istc.cnr.it
                        19
Path popularity
             nRes(dbpo:MusicalArtist)
 dbpo:MusicalArtist

           Anthony_Kiedis         Michael_Jackson
rdf:type
                                                                      Jackie_Jackson
                            rdf:type
                                        rdf:type
                                                                        rdf:type
              Chad_Smith
                            dbpo:MusicalArtist
                                                                   dbpo:MusicalArtist


Number of resources having type dbpo:MusicalArtist
                                                                            rdf:type



                                                                      John_Lennon
                                       Paul_McCartney
                                                        rdf:type
                                                                    dbpo:MusicalArtist
 PREFIX dbpo : http://dbpedia.org/ontology/

     stlab.istc.cnr.it
Path popularity
                       nSubjectRes(Pi,j)
                                                 Jackson_5
           Dave_Grohl               Michael_Jackson

                                                                      Jackie_Jackson
                              Nirvana


                               Si             dbpo:wikiPageWikiLink            Oj
                        dbpo:MusicalArtist                                  dbpo:Band

   Foo Fighters                                   Beatles
    Number of distinct resources
that participate in a path as subjects
                                                                      John_Lennon
                                     Paul_McCartney


   PREFIX dbpo : http://dbpedia.org/ontology/

         stlab.istc.cnr.it
Path popularity
               pathPopularity(Pi,j,Si)
                                                Jackson_5
        Dave_Grohl                 Michael_Jackson

                                                                   Jackie_Jackson
                              Nirvana

                                    Madonna
                                                      Prince
                              Charlie_Parker                   Keith_Jarrett

Foo Fighters                                     Beatles
   nSubjectRes(Pi,j)/nRes(Si)

                                                                   John_Lennon
                                     Paul_McCartney
          stlab.istc.cnr.it

More Related Content

Similar to Type inference through the analysis of Wikipedia links

Towards a Simple, Standards-Compliant, and Generic Phylogenetic Database
Towards a Simple, Standards-Compliant, and Generic Phylogenetic DatabaseTowards a Simple, Standards-Compliant, and Generic Phylogenetic Database
Towards a Simple, Standards-Compliant, and Generic Phylogenetic DatabaseHilmar Lapp
 
Knowledge Organization System (KOS) for biodiversity information resources, G...
Knowledge Organization System (KOS) for biodiversity information resources, G...Knowledge Organization System (KOS) for biodiversity information resources, G...
Knowledge Organization System (KOS) for biodiversity information resources, G...Dag Endresen
 
Towards an Empirical Semantic Web Science: Knowledge Pattern Extraction and U...
Towards an Empirical Semantic Web Science: Knowledge Pattern Extraction and U...Towards an Empirical Semantic Web Science: Knowledge Pattern Extraction and U...
Towards an Empirical Semantic Web Science: Knowledge Pattern Extraction and U...Andrea Nuzzolese
 
Triplifier talk
Triplifier talkTriplifier talk
Triplifier talkJohn Deck
 
Context Semantic Analysis: a knowledge-based technique for computing inter-do...
Context Semantic Analysis: a knowledge-based technique for computing inter-do...Context Semantic Analysis: a knowledge-based technique for computing inter-do...
Context Semantic Analysis: a knowledge-based technique for computing inter-do...Fabio Benedetti
 
Building a Knowledge Graph with Spark and NLP: How We Recommend Novel Drugs t...
Building a Knowledge Graph with Spark and NLP: How We Recommend Novel Drugs t...Building a Knowledge Graph with Spark and NLP: How We Recommend Novel Drugs t...
Building a Knowledge Graph with Spark and NLP: How We Recommend Novel Drugs t...Databricks
 
NRNB Annual Report 2017
NRNB Annual Report 2017NRNB Annual Report 2017
NRNB Annual Report 2017Alexander Pico
 
Omdi2021 Ontologies for (Materials) Science in the Digital Age
Omdi2021 Ontologies for (Materials) Science in the Digital AgeOmdi2021 Ontologies for (Materials) Science in the Digital Age
Omdi2021 Ontologies for (Materials) Science in the Digital Agepetermurrayrust
 
Life Science Database Cross Search and Metadata
Life Science Database Cross Search and MetadataLife Science Database Cross Search and Metadata
Life Science Database Cross Search and MetadataMaori Ito
 
Semantic Representation of Provenance in Wikipedia
Semantic Representation of Provenance in WikipediaSemantic Representation of Provenance in Wikipedia
Semantic Representation of Provenance in WikipediaFabrizio Orlandi
 
Facilitating Data Curation: a Solution Developed in the Toxicology Domain
Facilitating Data Curation: a Solution Developed in the Toxicology DomainFacilitating Data Curation: a Solution Developed in the Toxicology Domain
Facilitating Data Curation: a Solution Developed in the Toxicology DomainChristophe Debruyne
 
NRNB Annual Report 2018
NRNB Annual Report 2018NRNB Annual Report 2018
NRNB Annual Report 2018Alexander Pico
 
BioNLP-SADI: A Suite of interoperable BioNLP Semantic Web Services based on S...
BioNLP-SADI: A Suite of interoperable BioNLP Semantic Web Services based on S...BioNLP-SADI: A Suite of interoperable BioNLP Semantic Web Services based on S...
BioNLP-SADI: A Suite of interoperable BioNLP Semantic Web Services based on S...Syed Ahmad Chan Bukhari, PhD
 
論文サーベイ(Sasaki)
論文サーベイ(Sasaki)論文サーベイ(Sasaki)
論文サーベイ(Sasaki)Hajime Sasaki
 
Deep Content Learning in Traffic Prediction and Text Classification
Deep Content Learning in Traffic Prediction and Text ClassificationDeep Content Learning in Traffic Prediction and Text Classification
Deep Content Learning in Traffic Prediction and Text ClassificationHPCC Systems
 
UMich CI Days: Scaling a code in the human dimension
UMich CI Days: Scaling a code in the human dimensionUMich CI Days: Scaling a code in the human dimension
UMich CI Days: Scaling a code in the human dimensionmatthewturk
 

Similar to Type inference through the analysis of Wikipedia links (20)

Towards a Simple, Standards-Compliant, and Generic Phylogenetic Database
Towards a Simple, Standards-Compliant, and Generic Phylogenetic DatabaseTowards a Simple, Standards-Compliant, and Generic Phylogenetic Database
Towards a Simple, Standards-Compliant, and Generic Phylogenetic Database
 
Knowledge Organization System (KOS) for biodiversity information resources, G...
Knowledge Organization System (KOS) for biodiversity information resources, G...Knowledge Organization System (KOS) for biodiversity information resources, G...
Knowledge Organization System (KOS) for biodiversity information resources, G...
 
Towards an Empirical Semantic Web Science: Knowledge Pattern Extraction and U...
Towards an Empirical Semantic Web Science: Knowledge Pattern Extraction and U...Towards an Empirical Semantic Web Science: Knowledge Pattern Extraction and U...
Towards an Empirical Semantic Web Science: Knowledge Pattern Extraction and U...
 
Triplifier talk
Triplifier talkTriplifier talk
Triplifier talk
 
Context Semantic Analysis: a knowledge-based technique for computing inter-do...
Context Semantic Analysis: a knowledge-based technique for computing inter-do...Context Semantic Analysis: a knowledge-based technique for computing inter-do...
Context Semantic Analysis: a knowledge-based technique for computing inter-do...
 
Building a Knowledge Graph with Spark and NLP: How We Recommend Novel Drugs t...
Building a Knowledge Graph with Spark and NLP: How We Recommend Novel Drugs t...Building a Knowledge Graph with Spark and NLP: How We Recommend Novel Drugs t...
Building a Knowledge Graph with Spark and NLP: How We Recommend Novel Drugs t...
 
NRNB Annual Report 2017
NRNB Annual Report 2017NRNB Annual Report 2017
NRNB Annual Report 2017
 
BioNLPSADI
BioNLPSADIBioNLPSADI
BioNLPSADI
 
2015 genome-center
2015 genome-center2015 genome-center
2015 genome-center
 
Omdi2021 Ontologies for (Materials) Science in the Digital Age
Omdi2021 Ontologies for (Materials) Science in the Digital AgeOmdi2021 Ontologies for (Materials) Science in the Digital Age
Omdi2021 Ontologies for (Materials) Science in the Digital Age
 
4V - WP3 Progress Report (TIN2013-46238)
4V - WP3 Progress Report (TIN2013-46238)4V - WP3 Progress Report (TIN2013-46238)
4V - WP3 Progress Report (TIN2013-46238)
 
Life Science Database Cross Search and Metadata
Life Science Database Cross Search and MetadataLife Science Database Cross Search and Metadata
Life Science Database Cross Search and Metadata
 
Semantic Representation of Provenance in Wikipedia
Semantic Representation of Provenance in WikipediaSemantic Representation of Provenance in Wikipedia
Semantic Representation of Provenance in Wikipedia
 
Neo4j and bioinformatics
Neo4j and bioinformaticsNeo4j and bioinformatics
Neo4j and bioinformatics
 
Facilitating Data Curation: a Solution Developed in the Toxicology Domain
Facilitating Data Curation: a Solution Developed in the Toxicology DomainFacilitating Data Curation: a Solution Developed in the Toxicology Domain
Facilitating Data Curation: a Solution Developed in the Toxicology Domain
 
NRNB Annual Report 2018
NRNB Annual Report 2018NRNB Annual Report 2018
NRNB Annual Report 2018
 
BioNLP-SADI: A Suite of interoperable BioNLP Semantic Web Services based on S...
BioNLP-SADI: A Suite of interoperable BioNLP Semantic Web Services based on S...BioNLP-SADI: A Suite of interoperable BioNLP Semantic Web Services based on S...
BioNLP-SADI: A Suite of interoperable BioNLP Semantic Web Services based on S...
 
論文サーベイ(Sasaki)
論文サーベイ(Sasaki)論文サーベイ(Sasaki)
論文サーベイ(Sasaki)
 
Deep Content Learning in Traffic Prediction and Text Classification
Deep Content Learning in Traffic Prediction and Text ClassificationDeep Content Learning in Traffic Prediction and Text Classification
Deep Content Learning in Traffic Prediction and Text Classification
 
UMich CI Days: Scaling a code in the human dimension
UMich CI Days: Scaling a code in the human dimensionUMich CI Days: Scaling a code in the human dimension
UMich CI Days: Scaling a code in the human dimension
 

More from Andrea Nuzzolese

Aemoo: Linked Data Exploration based on Knowledge Patterns
Aemoo: Linked Data Exploration based on Knowledge PatternsAemoo: Linked Data Exploration based on Knowledge Patterns
Aemoo: Linked Data Exploration based on Knowledge PatternsAndrea Nuzzolese
 
Conference Linked Data: the ScholarlyData project
Conference Linked Data: the ScholarlyData projectConference Linked Data: the ScholarlyData project
Conference Linked Data: the ScholarlyData projectAndrea Nuzzolese
 
Semantic Technologies in ST&DL
Semantic Technologies in ST&DLSemantic Technologies in ST&DL
Semantic Technologies in ST&DLAndrea Nuzzolese
 
Evaluating citation functions in CiTO: cognitive issues
Evaluating citation functions in CiTO: cognitive issuesEvaluating citation functions in CiTO: cognitive issues
Evaluating citation functions in CiTO: cognitive issuesAndrea Nuzzolese
 
Knowledge Patterns for the Web: extraction, transformation, and reuse
Knowledge Patterns for the Web: extraction, transformation, and reuseKnowledge Patterns for the Web: extraction, transformation, and reuse
Knowledge Patterns for the Web: extraction, transformation, and reuseAndrea Nuzzolese
 
Towards the automatic identification of the nature of citations
Towards the automatic identification of the nature of citationsTowards the automatic identification of the nature of citations
Towards the automatic identification of the nature of citationsAndrea Nuzzolese
 
Knowledge Representation and Reasoning with Apache Stanbol
Knowledge Representation and Reasoning with Apache StanbolKnowledge Representation and Reasoning with Apache Stanbol
Knowledge Representation and Reasoning with Apache StanbolAndrea Nuzzolese
 
Gathering Lexical Linked Data and Knowledge Patterns from FrameNet
Gathering Lexical Linked Data and Knowledge Patterns from FrameNetGathering Lexical Linked Data and Knowledge Patterns from FrameNet
Gathering Lexical Linked Data and Knowledge Patterns from FrameNetAndrea Nuzzolese
 
Aemoo: exploratory search based on knowledge patterns over the Semantic Web
Aemoo:  exploratory search based on knowledge patterns over the Semantic WebAemoo:  exploratory search based on knowledge patterns over the Semantic Web
Aemoo: exploratory search based on knowledge patterns over the Semantic WebAndrea Nuzzolese
 

More from Andrea Nuzzolese (12)

Aemoo: Linked Data Exploration based on Knowledge Patterns
Aemoo: Linked Data Exploration based on Knowledge PatternsAemoo: Linked Data Exploration based on Knowledge Patterns
Aemoo: Linked Data Exploration based on Knowledge Patterns
 
Conference Linked Data: the ScholarlyData project
Conference Linked Data: the ScholarlyData projectConference Linked Data: the ScholarlyData project
Conference Linked Data: the ScholarlyData project
 
Semantic Technologies in ST&DL
Semantic Technologies in ST&DLSemantic Technologies in ST&DL
Semantic Technologies in ST&DL
 
Oke
OkeOke
Oke
 
Sheldon challenge
Sheldon challengeSheldon challenge
Sheldon challenge
 
Evaluating citation functions in CiTO: cognitive issues
Evaluating citation functions in CiTO: cognitive issuesEvaluating citation functions in CiTO: cognitive issues
Evaluating citation functions in CiTO: cognitive issues
 
Knowledge Patterns for the Web: extraction, transformation, and reuse
Knowledge Patterns for the Web: extraction, transformation, and reuseKnowledge Patterns for the Web: extraction, transformation, and reuse
Knowledge Patterns for the Web: extraction, transformation, and reuse
 
Loditaly2014 new
Loditaly2014 newLoditaly2014 new
Loditaly2014 new
 
Towards the automatic identification of the nature of citations
Towards the automatic identification of the nature of citationsTowards the automatic identification of the nature of citations
Towards the automatic identification of the nature of citations
 
Knowledge Representation and Reasoning with Apache Stanbol
Knowledge Representation and Reasoning with Apache StanbolKnowledge Representation and Reasoning with Apache Stanbol
Knowledge Representation and Reasoning with Apache Stanbol
 
Gathering Lexical Linked Data and Knowledge Patterns from FrameNet
Gathering Lexical Linked Data and Knowledge Patterns from FrameNetGathering Lexical Linked Data and Knowledge Patterns from FrameNet
Gathering Lexical Linked Data and Knowledge Patterns from FrameNet
 
Aemoo: exploratory search based on knowledge patterns over the Semantic Web
Aemoo:  exploratory search based on knowledge patterns over the Semantic WebAemoo:  exploratory search based on knowledge patterns over the Semantic Web
Aemoo: exploratory search based on knowledge patterns over the Semantic Web
 

Recently uploaded

Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rick Flair
 
Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demoSample pptx for embedding into website for demo
Sample pptx for embedding into website for demoHarshalMandlekar2
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersRaghuram Pandurangan
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersNicole Novielli
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 

Recently uploaded (20)

Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...
 
Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demoSample pptx for embedding into website for demo
Sample pptx for embedding into website for demo
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information Developers
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software Developers
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 

Type inference through the analysis of Wikipedia links

  • 1. Type inference through the analysis of Wikipedia links Andrea Giovanni Nuzzolese nuzzoles@cs.unibo.it Aldo Gangemi aldo.gangemi@cnr.it Valentina Presutti valentina.presutti@cnr.it Paolo Ciancarini ciancarini@cs.unibo.it stlab.istc.cnr.it 16 April 2012 - Lyon, France - LDOW 2012
  • 2. Outline • Motivations • Materials • Applied methods • Results • Conclusions stlab.istc.cnr.it 2
  • 3. Motivations ✦ Only a subset of the DBpedia resources is typed with the Resources used in wikilinks relations: DBpedia ontology (DBPO) 15,944,381 ✦ The typing procedure is top- down. ✦ Is the DBPO complete with respect to the DBpedia domain? Resources having a ✦ How good and homogeneous DBPO type: 1,518,697 is the granularity of DBPO types? stlab.istc.cnr.it 3
  • 4. Materials Wikilink triples with typed subject/object: 16,745,830 DBpedia 3.6 Dataset # of triples Wikilinks wikilink triples 107,892,317 triples: 107,892,317 infobox mapping-based “data” triples 9,357,273 rdfs:label triples 7,972,225 rdf:type triples 6,173,940 DBpedia ontology: 272 classes infobox mapping-based “object” triples 4,251,239 stlab.istc.cnr.it 4
  • 5. What we did • Wikilinks of a DBpedia resource convey knowledge that can be used for classifying it. • Classification methods ✦ Inductive learning: k-Nearest Neighbor algorithm ✦ Abductive classification based on EKPs [1] and homotypes used as background knowledge • The methods were performed on Resources having a Resources used in wikilinks DBPO type: relations: Sample of 1,518,697 15,944,381 untyped resources: 1,000 [1] A. G. Nuzzolese, A. Gangemi,V. Presutti, and P. Ciancarini. Encyclopedic Knowledge Patterns from Wikipedia Links. In L. Aroyo, N. Noy, and C. Welty, editors, Proceedings of the 10th International Semantic Web Conference (ISWC2011), pages 520-536. Springer, 2011. stlab.istc.cnr.it 5
  • 6. Inductive classification • We designed two inductive classification experiments based on the k-NN algorithm ✦ on 272 features, i.e., all the classes in the DBPO ✦ on 27 features, i.e., the top-level classes in the DBPO hierarchy • For each experiment we built a labeled feature space model as training set by using a randomly sampled 20% of typed resources ✦ the algorithms were tested on the remaining 80% of typed resources stlab.istc.cnr.it 5 6
  • 7. Building the training set for K-Nearest Neighbor algorithm dbpo:wikiPageWikiLink dbpedia:Apple_Inc. dbpedia:NeXT dbpedia:Steve_Jobs dbpedia:Forbes dbpedia:Cupertino,_California Mammal Scientist Company Drug City Magazine Class dbpedia:Steve_Jobs ... stlab.istc.cnr.it 7
  • 8. Building the training set for K-Nearest Neighbor algorithm dbpo:Organisation dbpo:wikiPageWikiLink dbpedia:Apple_Inc. dbpedia:NeXT rdf:type dbpedia:Steve_Jobs dbpo:Magazine dbpo:City dbpedia:Forbes dbpo:Persondbpedia:Cupertino,_California Mammal Scientist Company Drug City Magazine Class dbpedia:Steve_Jobs dbpo:Person ... stlab.istc.cnr.it 7
  • 9. Building the training set for K-Nearest Neighbor algorithm dbpo:Organisation dbpo:wikiPageWikiLink dbpo:Magazine dbpo:City rdf:type kp:linksTo dbpedia:Steve_Jobs Mammal Scientist Company Drug City Magazine Class dbpedia:Steve_Jobs 0 0 1 0 1 1 dbpo:Person ... stlab.istc.cnr.it 7
  • 10. Building the training set for K-Nearest Neighbor algorithm dbpo:Organisation dbpo:wikiPageWikiLink dbpo:Magazine dbpo:City rdf:type kp:linksTo dbpedia:Steve_Jobs Mammal Scientist Company Drug City Magazine Class dbpedia:Steve_Jobs 0 0 1 0 1 1 dbpo:Person ... ... ... ... ... ... ... ... stlab.istc.cnr.it 7
  • 11. Building the training set for K-Nearest Neighbor algorithm dbpo:Organisation dbpo:wikiPageWikiLink dbpo:Magazine dbpo:City rdf:type kp:linksTo dbpedia:Steve_Jobs ✦ Precision using all DBPO types as features: 31.65% ✦ Precision using the top-level of DBPO as features: 40.27% stlab.istc.cnr.it 7
  • 12. Abductive classification with EKPs • EKPs ✦ A EKP of a certain entity type is a small vocabulary that captures the core types used for describing such entity type as it emerges from the Wikipedia crowds visit aemoo.org for an exploratory tool based on EKPs stlab.istc.cnr.it 8
  • 13. How can we infer the type of “Galileo Galilei”? http://www.aemoo.org stlab.istc.cnr.it 9
  • 14. How can we infer We know its path types the type of “Galileo Galilei”? http://www.aemoo.org stlab.istc.cnr.it 9
  • 15. We compare the path types involving We have 231 EKPs “Galileo Galilei” as subject with EKPs in order to identify the most similar, which is the "Scientist" EKP. http://www.aemoo.org stlab.istc.cnr.it 9
  • 16. The inferred type for the resource “Galileo Galiei” is the class “Scientist” http://www.aemoo.org stlab.istc.cnr.it 9
  • 17. Distinctive weakness of some EKPs ✦ The distinctive weakness seems due to wide overlaps among some EKPs ✦ Systematic ambiguity of the 4 largest classes ✦ Precision and recall on all DBPO types both 44.4% ✦ Precision and recall on the top-level of DBPO hierarchy: 36.5% and 79.5% stlab.istc.cnr.it 10
  • 18. Homotype-based abductive classification • Homotypes are wikilinks that have the same type on both the subject and the object of the triple dbpo:Philosopher dbpedia:Immanuel_Kant dbpedia:Plato dbpo:Philosopher rdf:type dbpo:wikiPageWikiLink rdf:type • We have observed how the homotype is usually the most frequent (or in the top 3) wikilink type • Given an untyped entity, we hypothesize that the most frequent type involved in its ingoing/ outgoing wikilinks detects its homotype, hence it indicates its type 11 stlab.istc.cnr.it
  • 19. Homotype-based abductive classification s stlab.istc.cnr.it 12
  • 20. Homotype-based abductive classification s stlab.istc.cnr.it 12
  • 21. Results on classifying already typed resources stlab.istc.cnr.it 13
  • 22. Results on untyped resources • Results on a sample of 1,000 untyped resources are much less satisfactory With EKPs With Homotypes stlab.istc.cnr.it 14
  • 23. Why? [1] • Typed entities: 2:3 typed wikilinks ratio • Untyped entities: 1:3 typed wikilinks ratio • Link structure for untyped entities is not rich enough stlab.istc.cnr.it 15
  • 24. Why? [2] • DBPO does not provide a complete set of classes for correctly typing DBpedia resources dbpedia:List_of_FIFA_World_Cup_finals Collection dbpedia:Computer_Science ScientificDiscipline dbpedia:Counterattack Plan dbpedia:Eros(concept) Concept dbpedia:Gentlemen’s_agreement Agreement stlab.istc.cnr.it 16
  • 25. Conclusions • We have investigated different approaches for typing DBpedia resources based on the data set of wikilinks • Results are acceptable in the test set, but extensive untypedness in output links, and poor DBPO coverage severely compromise automatic typing for untyped resources • We have analyzed possible causes deriving from some bias in DBpedia 17 stlab.istc.cnr.it
  • 26. Future work • Yago could be helpful but ✦ there is a lack of mapping between YAGO and DBPO ✦ it has larger coverage and only an overlap with DBPO ✦ the granularity of its categories is finer, and not easily reusable, because the top level is very large stlab.istc.cnr.it 18
  • 27. Thank you Andrea Nuzzolese - STLab, ISTC-CNR & Dipartimento di Scienze dell’Informazione University of Bologna Italy stlab.istc.cnr.it 19
  • 28. Path popularity nRes(dbpo:MusicalArtist) dbpo:MusicalArtist Anthony_Kiedis Michael_Jackson rdf:type Jackie_Jackson rdf:type rdf:type rdf:type Chad_Smith dbpo:MusicalArtist dbpo:MusicalArtist Number of resources having type dbpo:MusicalArtist rdf:type John_Lennon Paul_McCartney rdf:type dbpo:MusicalArtist PREFIX dbpo : http://dbpedia.org/ontology/ stlab.istc.cnr.it
  • 29. Path popularity nSubjectRes(Pi,j) Jackson_5 Dave_Grohl Michael_Jackson Jackie_Jackson Nirvana Si dbpo:wikiPageWikiLink Oj dbpo:MusicalArtist dbpo:Band Foo Fighters Beatles Number of distinct resources that participate in a path as subjects John_Lennon Paul_McCartney PREFIX dbpo : http://dbpedia.org/ontology/ stlab.istc.cnr.it
  • 30. Path popularity pathPopularity(Pi,j,Si) Jackson_5 Dave_Grohl Michael_Jackson Jackie_Jackson Nirvana Madonna Prince Charlie_Parker Keith_Jarrett Foo Fighters Beatles nSubjectRes(Pi,j)/nRes(Si) John_Lennon Paul_McCartney stlab.istc.cnr.it