SlideShare ist ein Scribd-Unternehmen logo
1 von 20
Downloaden Sie, um offline zu lesen
Page:




                                 Stanbol
Semantic CMS Community           in the Labs


                                 Universal Topic
 Olivier Grisel                  Classification
 Nuxeo

 June 17, 2011                   Named Entity
                                 Disambiguation

    Co-funded by the
                             1              Copyright IKS Consortium
    European Union

        www.iks-project.eu
Page:




   1 - Universal Topic Classification




www.iks-project.eu
Page: 3          June 17, 2011




Wikipedia is a Web-Scale Controlled
             Vocabulary

                      – Chris Sizemore, BBC




www.iks-project.eu                   Copyright IKS Consortium
Page:




A Rather “Simple” Idea

                          Use
Apache Lucene / Solr MoreLikeThis
                      to perform an
  approximate k-Nearest Neighbors
                         query

                         in the
  TF-IDF vector space of Wikipedia

 www.iks-project.eu
Page:




    Which means:
●   Picks the top 30 terms of the document to categorize
●   Build a fuzzy full-text query
●   Search for indexed articles that share most terms
●   Rank results according to similarity score
●   Use the top-related Wikipedia articles as “Topics”




      www.iks-project.eu
Page:




However Wikipedia has millions of
           articles:
       Navigation Hell

         Need hierarchical structure:

              from generic to specific

                     Faceted Browsing!

www.iks-project.eu
Page:




    Hierarchical Wikipedia Categorization
●   Group text of all articles categorized for a given Topic
●   Use Wikipedia Categories as Hierarchical Taxonomy
●   Categorize new document with MoreLikeThis on the
    aggregate text of articles
●   Available DBpedia dumps provides:
    ●    Text summaries for each article
    ●    “subject” relationships between articles and topics
    ●    “broader” / “narrower” SKOS hieararchy between topics



        www.iks-project.eu
Page:




    Challenges encountered
●   500k “technical” categories
    “People_with_missing_birth_place”, “Rivers_in_Romania”
●   70k “grounded” categories
    ●   Paths to roots need both “technical” and “grounded”
●   Loops everywhere!
    ●   Death is a subcategory of Life
         –   Life is a subcategory of Death
                ●   …
●   Scale
    ●   1.2M topic / topic links
    ●   30M topic / article links
        www.iks-project.eu
Page:




                     Sample results

Pig / Solr / Python Proof of Concept




www.iks-project.eu
Page:




    IKS Workshop Wiki Page
●   Category:Free_web_development_software
●   Category:Semantic_HTML
●   Category:Semantic_Web
●   Category:Web_development_software
●   Category:Office_software
●   Category:World_Wide_Web_Consortium
●   Category:Open_source_project_foundations
●   Category:Free_network-related_software
●   Category:Free_business_software
      www.iks-project.eu
Page:




    IKS Workshop Wiki Page (cont'd)
●   Category:Knowledge_representation_languages
●   Category:PHP_programming_language
●   Category:XML-based_standards
●   Category:Content_management_systems
●   Category:Knowledge_representation
●   Category:Presentation
●   Category:Cross-platform_software
●   Category:HTML
●   Category:Data_management
      www.iks-project.eu
Page:




    Yesterday Wikinews Articles (1/3)
    Hundreds of thousands of British public sector workers
    strike over planned pension changes


●   Category:Retirement_in_the_United_Kingdom
●   Category:United_Kingdom_pensions_and_benefits
●   Category:Pensions_in_the_United_Kingdom
●   Category:Labor_disputes_by_country
●   Category:Labor_disputes


      www.iks-project.eu
Page:




    Yesterday Wikinews Articles (2/3)
    US children who celebrate Independence Day more
    likely to become Republicans, says Harvard study


●   Category:Fireworks
●   Category:Voting_theory
●   Category:Republican_Party_%28United_States%29
●   Category:Statistics
●   Category:Electoral_systems

      www.iks-project.eu
Page:




    Yesterday Wikinews Articles (3/3)
    U.S. space agency NASA sues ex-astronaut


●   Category:American_astronauts
●   Category:Aviation_halls_of_fame
●   Category:Edwards_Air_Force_Base
●   Category:Apollo_program
●   Category:Exploration_of_the_Moon


      www.iks-project.eu
Page:




    Scientific publication (1/2) (PLOS One)
    Metabolic Programming during Lactation Stimulates
    Renal Na+ Transport in the Adult Offspring Due to an
    Early Impact on Local Angiotensin II Pathways


●   Category:Renal_physiology
●   Category:Kidney
●   Category:Nephrology
●   Category:Hypertension
●   Category:Membrane_biology
      www.iks-project.eu
Page:




    Scientific Publications (2/2)
    International Conference on Machine Learning 2011
    accepted papers abstracts


●   Category:Machine_learning
●   Category:Computational_statistics
●   Category:Data_analysis
●   Category:Classification_algorithms
●   Category:Ensemble_learning

      www.iks-project.eu
Page:




    Track & Hack
●   https://github.com/ogrisel/pignlproc
●   https://issues.apache.org/jira/browse/STANBOL-201
●   Help integrate into Stanbol EntityHub / Enhancer during the
    Hackathon
●   IKS User Story S10: Automated document categorization
    ●   I create new document in my CMS by typing in a HTML edit form or
        by uploading a document with textual content (PDF, office file, XML
        file, ...). I want the CMS to suggest me a list of maximum 3
        controlled properties such as subjects/topics or geographical
        coverage out of list of standardised options (IPTC subjects or world
        countries), based on the text content I gave.


        www.iks-project.eu
Page:




 2 – Named Entity Disambiguation




www.iks-project.eu
Page:




    An example
●   Query for person with name = “George Bush”
    ●    Results: 2 ambigous possibilities
●   Perform additional MoreLikeThis with surrounding
    paragraph as context:
●   If more like “41st”, “1988”, “Reagan”, “Panama”...
    ●    then: dbpedia:George_H._W._Bush
●   If more like “43rd”, “911”, “War on Terror”, “bretzel”...
    ●    then: dbpedia:George_W._Bush



        www.iks-project.eu
Page:




    Work in Progress
●   EntityHub's SolrYard now has a SimilarityConstraint
●   OpenNLP NamedEntiy Engine already extracts context
●   pignlproc is able to extract occurrence corpus from
    Wikipedia dumps
●   Early prototype during Berlin Buzzwords Hackathon


                                TODO:
                 build a prepackaged Enhancer Engine
                           & EntityHub index
      www.iks-project.eu

Weitere ähnliche Inhalte

Was ist angesagt?

Open Context and Publishing to the Web of Data: Eric Kansa's LAWDI Presentation
Open Context and Publishing to the Web of Data: Eric Kansa's LAWDI PresentationOpen Context and Publishing to the Web of Data: Eric Kansa's LAWDI Presentation
Open Context and Publishing to the Web of Data: Eric Kansa's LAWDI Presentationekansa
 
DBpedia: A Public Data Infrastructure for the Web of Data
DBpedia: A Public Data Infrastructure for the Web of DataDBpedia: A Public Data Infrastructure for the Web of Data
DBpedia: A Public Data Infrastructure for the Web of DataSebastian Hellmann
 
IASSIT Kansa Presentation
IASSIT Kansa PresentationIASSIT Kansa Presentation
IASSIT Kansa Presentationekansa
 
Www history by Mumtaz Khan
Www history by Mumtaz KhanWww history by Mumtaz Khan
Www history by Mumtaz KhanIftikhar Alam
 
Web Science Synergies: Exploring Web Knowledge through the Semantic Web
Web Science Synergies: Exploring Web Knowledge through the Semantic WebWeb Science Synergies: Exploring Web Knowledge through the Semantic Web
Web Science Synergies: Exploring Web Knowledge through the Semantic WebStefan Dietze
 
Web@rchive Austria (Archiving Online Media)
Web@rchive Austria (Archiving Online Media)Web@rchive Austria (Archiving Online Media)
Web@rchive Austria (Archiving Online Media)Web@rchive Austria
 
OCLC and the Social Web: Building tools, providing platforms, engaging the co...
OCLC and the Social Web:Building tools, providing platforms, engaging the co...OCLC and the Social Web:Building tools, providing platforms, engaging the co...
OCLC and the Social Web: Building tools, providing platforms, engaging the co...Andy Havens
 

Was ist angesagt? (10)

Wikipedia
WikipediaWikipedia
Wikipedia
 
Open Context and Publishing to the Web of Data: Eric Kansa's LAWDI Presentation
Open Context and Publishing to the Web of Data: Eric Kansa's LAWDI PresentationOpen Context and Publishing to the Web of Data: Eric Kansa's LAWDI Presentation
Open Context and Publishing to the Web of Data: Eric Kansa's LAWDI Presentation
 
DBpedia: A Public Data Infrastructure for the Web of Data
DBpedia: A Public Data Infrastructure for the Web of DataDBpedia: A Public Data Infrastructure for the Web of Data
DBpedia: A Public Data Infrastructure for the Web of Data
 
IASSIT Kansa Presentation
IASSIT Kansa PresentationIASSIT Kansa Presentation
IASSIT Kansa Presentation
 
WWW Histor
WWW HistorWWW Histor
WWW Histor
 
Www history by Mumtaz Khan
Www history by Mumtaz KhanWww history by Mumtaz Khan
Www history by Mumtaz Khan
 
Web Science Synergies: Exploring Web Knowledge through the Semantic Web
Web Science Synergies: Exploring Web Knowledge through the Semantic WebWeb Science Synergies: Exploring Web Knowledge through the Semantic Web
Web Science Synergies: Exploring Web Knowledge through the Semantic Web
 
Web@rchive Austria (Archiving Online Media)
Web@rchive Austria (Archiving Online Media)Web@rchive Austria (Archiving Online Media)
Web@rchive Austria (Archiving Online Media)
 
Wikipedia
Wikipedia Wikipedia
Wikipedia
 
OCLC and the Social Web: Building tools, providing platforms, engaging the co...
OCLC and the Social Web:Building tools, providing platforms, engaging the co...OCLC and the Social Web:Building tools, providing platforms, engaging the co...
OCLC and the Social Web: Building tools, providing platforms, engaging the co...
 

Andere mochten auch

Magnetic - Query Categorization at Scale
Magnetic - Query Categorization at ScaleMagnetic - Query Categorization at Scale
Magnetic - Query Categorization at ScaleAlex Dorman
 
Query Classification Tool
Query Classification ToolQuery Classification Tool
Query Classification ToolHRoi Consulting
 
Exploring Linked Data content through network analysis
Exploring Linked Data content through network analysisExploring Linked Data content through network analysis
Exploring Linked Data content through network analysisChristophe Guéret
 
Automatic Term Ambiguity Detection
Automatic Term Ambiguity DetectionAutomatic Term Ambiguity Detection
Automatic Term Ambiguity DetectionYunyao Li
 
A Comparison of NER Tools w.r.t. a Domain-Specific Vocabulary
A Comparison of NER Tools w.r.t. a Domain-Specific VocabularyA Comparison of NER Tools w.r.t. a Domain-Specific Vocabulary
A Comparison of NER Tools w.r.t. a Domain-Specific VocabularyTimm Heuss
 
Linked Data: What’s the Story?
Linked Data: What’s the Story?Linked Data: What’s the Story?
Linked Data: What’s the Story?WiLS
 
SYNERGY - A Named Entity Recognition System for Resource-scarce Languages suc...
SYNERGY - A Named Entity Recognition System for Resource-scarce Languages suc...SYNERGY - A Named Entity Recognition System for Resource-scarce Languages suc...
SYNERGY - A Named Entity Recognition System for Resource-scarce Languages suc...Guy De Pauw
 
Understanding Named-Entity Recognition (NER)
Understanding Named-Entity Recognition (NER) Understanding Named-Entity Recognition (NER)
Understanding Named-Entity Recognition (NER) Stephen Shellman
 
QER : query entity recognition
QER : query entity recognitionQER : query entity recognition
QER : query entity recognitionDhwaj Raj
 
The named entity recognition (ner)2
The named entity recognition (ner)2The named entity recognition (ner)2
The named entity recognition (ner)2Arabic_NLP_ImamU2013
 
Reflected Intelligence: Lucene/Solr as a self-learning data system
Reflected Intelligence: Lucene/Solr as a self-learning data systemReflected Intelligence: Lucene/Solr as a self-learning data system
Reflected Intelligence: Lucene/Solr as a self-learning data systemTrey Grainger
 
Named Entity Recognition - ACL 2011 Presentation
Named Entity Recognition - ACL 2011 PresentationNamed Entity Recognition - ACL 2011 Presentation
Named Entity Recognition - ACL 2011 PresentationRichard Littauer
 
RDF and other linked data standards — how to make use of big localization data
RDF and other linked data standards — how to make use of big localization dataRDF and other linked data standards — how to make use of big localization data
RDF and other linked data standards — how to make use of big localization dataDave Lewis
 
Interaction with Linked Data
Interaction with Linked DataInteraction with Linked Data
Interaction with Linked DataEUCLID project
 
Scaling up Linked Data
Scaling up Linked DataScaling up Linked Data
Scaling up Linked DataEUCLID project
 
Dynamically Optimizing Queries over Large Scale Data Platforms
Dynamically Optimizing Queries over Large Scale Data PlatformsDynamically Optimizing Queries over Large Scale Data Platforms
Dynamically Optimizing Queries over Large Scale Data PlatformsINRIA-OAK
 

Andere mochten auch (20)

Magnetic - Query Categorization at Scale
Magnetic - Query Categorization at ScaleMagnetic - Query Categorization at Scale
Magnetic - Query Categorization at Scale
 
Query Classification Tool
Query Classification ToolQuery Classification Tool
Query Classification Tool
 
Exploring Linked Data content through network analysis
Exploring Linked Data content through network analysisExploring Linked Data content through network analysis
Exploring Linked Data content through network analysis
 
Automatic Term Ambiguity Detection
Automatic Term Ambiguity DetectionAutomatic Term Ambiguity Detection
Automatic Term Ambiguity Detection
 
Entity Search Engine
Entity Search Engine Entity Search Engine
Entity Search Engine
 
A Comparison of NER Tools w.r.t. a Domain-Specific Vocabulary
A Comparison of NER Tools w.r.t. a Domain-Specific VocabularyA Comparison of NER Tools w.r.t. a Domain-Specific Vocabulary
A Comparison of NER Tools w.r.t. a Domain-Specific Vocabulary
 
Linked Data: What’s the Story?
Linked Data: What’s the Story?Linked Data: What’s the Story?
Linked Data: What’s the Story?
 
SYNERGY - A Named Entity Recognition System for Resource-scarce Languages suc...
SYNERGY - A Named Entity Recognition System for Resource-scarce Languages suc...SYNERGY - A Named Entity Recognition System for Resource-scarce Languages suc...
SYNERGY - A Named Entity Recognition System for Resource-scarce Languages suc...
 
Understanding Named-Entity Recognition (NER)
Understanding Named-Entity Recognition (NER) Understanding Named-Entity Recognition (NER)
Understanding Named-Entity Recognition (NER)
 
Multlingual Linked Data Patterns
Multlingual Linked Data PatternsMultlingual Linked Data Patterns
Multlingual Linked Data Patterns
 
QER : query entity recognition
QER : query entity recognitionQER : query entity recognition
QER : query entity recognition
 
The named entity recognition (ner)2
The named entity recognition (ner)2The named entity recognition (ner)2
The named entity recognition (ner)2
 
Reflected Intelligence: Lucene/Solr as a self-learning data system
Reflected Intelligence: Lucene/Solr as a self-learning data systemReflected Intelligence: Lucene/Solr as a self-learning data system
Reflected Intelligence: Lucene/Solr as a self-learning data system
 
Text mining
Text miningText mining
Text mining
 
Named Entity Recognition - ACL 2011 Presentation
Named Entity Recognition - ACL 2011 PresentationNamed Entity Recognition - ACL 2011 Presentation
Named Entity Recognition - ACL 2011 Presentation
 
RDF and other linked data standards — how to make use of big localization data
RDF and other linked data standards — how to make use of big localization dataRDF and other linked data standards — how to make use of big localization data
RDF and other linked data standards — how to make use of big localization data
 
Interaction with Linked Data
Interaction with Linked DataInteraction with Linked Data
Interaction with Linked Data
 
Scaling up Linked Data
Scaling up Linked DataScaling up Linked Data
Scaling up Linked Data
 
Discoverers of Surface Analysis
Discoverers of Surface AnalysisDiscoverers of Surface Analysis
Discoverers of Surface Analysis
 
Dynamically Optimizing Queries over Large Scale Data Platforms
Dynamically Optimizing Queries over Large Scale Data PlatformsDynamically Optimizing Queries over Large Scale Data Platforms
Dynamically Optimizing Queries over Large Scale Data Platforms
 

Ähnlich wie Universal Topic Classification - Named Entity Disambiguation (IKS Workshop Paris 2011)

Semtech web-protege-tutorial
Semtech web-protege-tutorialSemtech web-protege-tutorial
Semtech web-protege-tutorialmatthewhorridge
 
Exploring Article Networks on Wikipedia with NodeXL
Exploring Article Networks on Wikipedia with NodeXLExploring Article Networks on Wikipedia with NodeXL
Exploring Article Networks on Wikipedia with NodeXLShalin Hai-Jew
 
From Hyperlinks to Semantic Web Properties using Open Knowledge Extraction
From Hyperlinks to Semantic Web Properties using Open Knowledge ExtractionFrom Hyperlinks to Semantic Web Properties using Open Knowledge Extraction
From Hyperlinks to Semantic Web Properties using Open Knowledge ExtractionSTLab
 
Csvconf data hacking-with_wikimedia_projects
Csvconf data hacking-with_wikimedia_projectsCsvconf data hacking-with_wikimedia_projects
Csvconf data hacking-with_wikimedia_projectsmattsenate
 
Häskell und Grepl: Data Hacking Wikimedia Projects Exampled With Open Access ...
Häskell und Grepl: Data Hacking Wikimedia Projects Exampled With Open Access ...Häskell und Grepl: Data Hacking Wikimedia Projects Exampled With Open Access ...
Häskell und Grepl: Data Hacking Wikimedia Projects Exampled With Open Access ...Maximilian Klein
 
Wikipedia Day 2011 Talk
Wikipedia Day 2011 TalkWikipedia Day 2011 Talk
Wikipedia Day 2011 TalkMark Reynolds
 
Technologie Proche: Imagining the Archival Systems of Tomorrow With the Tools...
Technologie Proche: Imagining the Archival Systems of Tomorrow With the Tools...Technologie Proche: Imagining the Archival Systems of Tomorrow With the Tools...
Technologie Proche: Imagining the Archival Systems of Tomorrow With the Tools...Artefactual Systems - AtoM
 
A Survey of the Landscape and State-of-Art in Semantic Wiki
A Survey of the Landscape and State-of-Art in Semantic WikiA Survey of the Landscape and State-of-Art in Semantic Wiki
A Survey of the Landscape and State-of-Art in Semantic WikiMax Völkel
 
Aporte Wikis
Aporte WikisAporte Wikis
Aporte Wikiscinthia
 
DM110 - Week 3 - Wikis
DM110 - Week 3 - WikisDM110 - Week 3 - Wikis
DM110 - Week 3 - WikisJohn Breslin
 
Interlinking Online Communities and Enriching Social Software with the Semant...
Interlinking Online Communities and Enriching Social Software with the Semant...Interlinking Online Communities and Enriching Social Software with the Semant...
Interlinking Online Communities and Enriching Social Software with the Semant...John Breslin
 
Web 2.0 e ricerca scientifica - Web 2.0 and scientific research
Web 2.0 e ricerca scientifica - Web 2.0 and scientific researchWeb 2.0 e ricerca scientifica - Web 2.0 and scientific research
Web 2.0 e ricerca scientifica - Web 2.0 and scientific researchGiovanni Marco Dall'Olio
 
Jist tutorial semantic wikis and applications
Jist tutorial   semantic wikis and applicationsJist tutorial   semantic wikis and applications
Jist tutorial semantic wikis and applicationsJesse Wang
 
2014-02-27 Wikidata talk Cambridge
2014-02-27 Wikidata talk Cambridge2014-02-27 Wikidata talk Cambridge
2014-02-27 Wikidata talk CambridgeMagnus Manske
 
Dissecting Wikipedia
Dissecting WikipediaDissecting Wikipedia
Dissecting WikipediaAndrew Gray
 
Emtacl12, mlibraries12 conferences, 2012
Emtacl12, mlibraries12 conferences, 2012Emtacl12, mlibraries12 conferences, 2012
Emtacl12, mlibraries12 conferences, 2012Kerryn Amery
 
The Social Semantic Web
The Social Semantic WebThe Social Semantic Web
The Social Semantic WebJohn Breslin
 

Ähnlich wie Universal Topic Classification - Named Entity Disambiguation (IKS Workshop Paris 2011) (20)

Semtech web-protege-tutorial
Semtech web-protege-tutorialSemtech web-protege-tutorial
Semtech web-protege-tutorial
 
Exploring Article Networks on Wikipedia with NodeXL
Exploring Article Networks on Wikipedia with NodeXLExploring Article Networks on Wikipedia with NodeXL
Exploring Article Networks on Wikipedia with NodeXL
 
From Hyperlinks to Semantic Web Properties using Open Knowledge Extraction
From Hyperlinks to Semantic Web Properties using Open Knowledge ExtractionFrom Hyperlinks to Semantic Web Properties using Open Knowledge Extraction
From Hyperlinks to Semantic Web Properties using Open Knowledge Extraction
 
Csvconf data hacking-with_wikimedia_projects
Csvconf data hacking-with_wikimedia_projectsCsvconf data hacking-with_wikimedia_projects
Csvconf data hacking-with_wikimedia_projects
 
Häskell und Grepl: Data Hacking Wikimedia Projects Exampled With Open Access ...
Häskell und Grepl: Data Hacking Wikimedia Projects Exampled With Open Access ...Häskell und Grepl: Data Hacking Wikimedia Projects Exampled With Open Access ...
Häskell und Grepl: Data Hacking Wikimedia Projects Exampled With Open Access ...
 
Wikipedia Day 2011 Talk
Wikipedia Day 2011 TalkWikipedia Day 2011 Talk
Wikipedia Day 2011 Talk
 
Technologie Proche: Imagining the Archival Systems of Tomorrow With the Tools...
Technologie Proche: Imagining the Archival Systems of Tomorrow With the Tools...Technologie Proche: Imagining the Archival Systems of Tomorrow With the Tools...
Technologie Proche: Imagining the Archival Systems of Tomorrow With the Tools...
 
Intranet 2.0: Using Wikis
Intranet 2.0: Using WikisIntranet 2.0: Using Wikis
Intranet 2.0: Using Wikis
 
A Survey of the Landscape and State-of-Art in Semantic Wiki
A Survey of the Landscape and State-of-Art in Semantic WikiA Survey of the Landscape and State-of-Art in Semantic Wiki
A Survey of the Landscape and State-of-Art in Semantic Wiki
 
Aporte Wikis
Aporte WikisAporte Wikis
Aporte Wikis
 
DM110 - Week 3 - Wikis
DM110 - Week 3 - WikisDM110 - Week 3 - Wikis
DM110 - Week 3 - Wikis
 
Interlinking Online Communities and Enriching Social Software with the Semant...
Interlinking Online Communities and Enriching Social Software with the Semant...Interlinking Online Communities and Enriching Social Software with the Semant...
Interlinking Online Communities and Enriching Social Software with the Semant...
 
Estrategias basadas en la interoperabilidad para la incorporación de contenid...
Estrategias basadas en la interoperabilidad para la incorporación de contenid...Estrategias basadas en la interoperabilidad para la incorporación de contenid...
Estrategias basadas en la interoperabilidad para la incorporación de contenid...
 
Web 2.0 e ricerca scientifica - Web 2.0 and scientific research
Web 2.0 e ricerca scientifica - Web 2.0 and scientific researchWeb 2.0 e ricerca scientifica - Web 2.0 and scientific research
Web 2.0 e ricerca scientifica - Web 2.0 and scientific research
 
Jist tutorial semantic wikis and applications
Jist tutorial   semantic wikis and applicationsJist tutorial   semantic wikis and applications
Jist tutorial semantic wikis and applications
 
2014-02-27 Wikidata talk Cambridge
2014-02-27 Wikidata talk Cambridge2014-02-27 Wikidata talk Cambridge
2014-02-27 Wikidata talk Cambridge
 
Dissecting Wikipedia
Dissecting WikipediaDissecting Wikipedia
Dissecting Wikipedia
 
Wiki Analytics Workshop
Wiki Analytics WorkshopWiki Analytics Workshop
Wiki Analytics Workshop
 
Emtacl12, mlibraries12 conferences, 2012
Emtacl12, mlibraries12 conferences, 2012Emtacl12, mlibraries12 conferences, 2012
Emtacl12, mlibraries12 conferences, 2012
 
The Social Semantic Web
The Social Semantic WebThe Social Semantic Web
The Social Semantic Web
 

Mehr von Olivier Grisel

Strategies and Tools for Parallel Machine Learning in Python
Strategies and Tools for Parallel Machine Learning in PythonStrategies and Tools for Parallel Machine Learning in Python
Strategies and Tools for Parallel Machine Learning in PythonOlivier Grisel
 
Statistical Machine Learning for Text Classification with scikit-learn and NLTK
Statistical Machine Learning for Text Classification with scikit-learn and NLTKStatistical Machine Learning for Text Classification with scikit-learn and NLTK
Statistical Machine Learning for Text Classification with scikit-learn and NLTKOlivier Grisel
 
Statistical Learning and Text Classification with NLTK and scikit-learn
Statistical Learning and Text Classification with NLTK and scikit-learnStatistical Learning and Text Classification with NLTK and scikit-learn
Statistical Learning and Text Classification with NLTK and scikit-learnOlivier Grisel
 
Nuxeo 5.3 and Semantic R&D
Nuxeo 5.3 and Semantic R&DNuxeo 5.3 and Semantic R&D
Nuxeo 5.3 and Semantic R&DOlivier Grisel
 
Hadoop MapReduce - OSDC FR 2009
Hadoop MapReduce - OSDC FR 2009Hadoop MapReduce - OSDC FR 2009
Hadoop MapReduce - OSDC FR 2009Olivier Grisel
 

Mehr von Olivier Grisel (7)

Strategies and Tools for Parallel Machine Learning in Python
Strategies and Tools for Parallel Machine Learning in PythonStrategies and Tools for Parallel Machine Learning in Python
Strategies and Tools for Parallel Machine Learning in Python
 
Statistical Machine Learning for Text Classification with scikit-learn and NLTK
Statistical Machine Learning for Text Classification with scikit-learn and NLTKStatistical Machine Learning for Text Classification with scikit-learn and NLTK
Statistical Machine Learning for Text Classification with scikit-learn and NLTK
 
Statistical Learning and Text Classification with NLTK and scikit-learn
Statistical Learning and Text Classification with NLTK and scikit-learnStatistical Learning and Text Classification with NLTK and scikit-learn
Statistical Learning and Text Classification with NLTK and scikit-learn
 
Nuxeo Iks 2009 11 13
Nuxeo Iks 2009 11 13Nuxeo Iks 2009 11 13
Nuxeo Iks 2009 11 13
 
Nuxeo 5.3 and Semantic R&D
Nuxeo 5.3 and Semantic R&DNuxeo 5.3 and Semantic R&D
Nuxeo 5.3 and Semantic R&D
 
Hadoop MapReduce - OSDC FR 2009
Hadoop MapReduce - OSDC FR 2009Hadoop MapReduce - OSDC FR 2009
Hadoop MapReduce - OSDC FR 2009
 
Programming the PS3
Programming the PS3Programming the PS3
Programming the PS3
 

Kürzlich hochgeladen

Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfPrecisely
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxBkGupta21
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESmohitsingh558521
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersRaghuram Pandurangan
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 

Kürzlich hochgeladen (20)

Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptx
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information Developers
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 

Universal Topic Classification - Named Entity Disambiguation (IKS Workshop Paris 2011)

  • 1. Page: Stanbol Semantic CMS Community in the Labs Universal Topic Olivier Grisel Classification Nuxeo June 17, 2011 Named Entity Disambiguation Co-funded by the 1 Copyright IKS Consortium European Union www.iks-project.eu
  • 2. Page: 1 - Universal Topic Classification www.iks-project.eu
  • 3. Page: 3 June 17, 2011 Wikipedia is a Web-Scale Controlled Vocabulary – Chris Sizemore, BBC www.iks-project.eu Copyright IKS Consortium
  • 4. Page: A Rather “Simple” Idea Use Apache Lucene / Solr MoreLikeThis to perform an approximate k-Nearest Neighbors query in the TF-IDF vector space of Wikipedia www.iks-project.eu
  • 5. Page: Which means: ● Picks the top 30 terms of the document to categorize ● Build a fuzzy full-text query ● Search for indexed articles that share most terms ● Rank results according to similarity score ● Use the top-related Wikipedia articles as “Topics” www.iks-project.eu
  • 6. Page: However Wikipedia has millions of articles: Navigation Hell Need hierarchical structure: from generic to specific Faceted Browsing! www.iks-project.eu
  • 7. Page: Hierarchical Wikipedia Categorization ● Group text of all articles categorized for a given Topic ● Use Wikipedia Categories as Hierarchical Taxonomy ● Categorize new document with MoreLikeThis on the aggregate text of articles ● Available DBpedia dumps provides: ● Text summaries for each article ● “subject” relationships between articles and topics ● “broader” / “narrower” SKOS hieararchy between topics www.iks-project.eu
  • 8. Page: Challenges encountered ● 500k “technical” categories “People_with_missing_birth_place”, “Rivers_in_Romania” ● 70k “grounded” categories ● Paths to roots need both “technical” and “grounded” ● Loops everywhere! ● Death is a subcategory of Life – Life is a subcategory of Death ● … ● Scale ● 1.2M topic / topic links ● 30M topic / article links www.iks-project.eu
  • 9. Page: Sample results Pig / Solr / Python Proof of Concept www.iks-project.eu
  • 10. Page: IKS Workshop Wiki Page ● Category:Free_web_development_software ● Category:Semantic_HTML ● Category:Semantic_Web ● Category:Web_development_software ● Category:Office_software ● Category:World_Wide_Web_Consortium ● Category:Open_source_project_foundations ● Category:Free_network-related_software ● Category:Free_business_software www.iks-project.eu
  • 11. Page: IKS Workshop Wiki Page (cont'd) ● Category:Knowledge_representation_languages ● Category:PHP_programming_language ● Category:XML-based_standards ● Category:Content_management_systems ● Category:Knowledge_representation ● Category:Presentation ● Category:Cross-platform_software ● Category:HTML ● Category:Data_management www.iks-project.eu
  • 12. Page: Yesterday Wikinews Articles (1/3) Hundreds of thousands of British public sector workers strike over planned pension changes ● Category:Retirement_in_the_United_Kingdom ● Category:United_Kingdom_pensions_and_benefits ● Category:Pensions_in_the_United_Kingdom ● Category:Labor_disputes_by_country ● Category:Labor_disputes www.iks-project.eu
  • 13. Page: Yesterday Wikinews Articles (2/3) US children who celebrate Independence Day more likely to become Republicans, says Harvard study ● Category:Fireworks ● Category:Voting_theory ● Category:Republican_Party_%28United_States%29 ● Category:Statistics ● Category:Electoral_systems www.iks-project.eu
  • 14. Page: Yesterday Wikinews Articles (3/3) U.S. space agency NASA sues ex-astronaut ● Category:American_astronauts ● Category:Aviation_halls_of_fame ● Category:Edwards_Air_Force_Base ● Category:Apollo_program ● Category:Exploration_of_the_Moon www.iks-project.eu
  • 15. Page: Scientific publication (1/2) (PLOS One) Metabolic Programming during Lactation Stimulates Renal Na+ Transport in the Adult Offspring Due to an Early Impact on Local Angiotensin II Pathways ● Category:Renal_physiology ● Category:Kidney ● Category:Nephrology ● Category:Hypertension ● Category:Membrane_biology www.iks-project.eu
  • 16. Page: Scientific Publications (2/2) International Conference on Machine Learning 2011 accepted papers abstracts ● Category:Machine_learning ● Category:Computational_statistics ● Category:Data_analysis ● Category:Classification_algorithms ● Category:Ensemble_learning www.iks-project.eu
  • 17. Page: Track & Hack ● https://github.com/ogrisel/pignlproc ● https://issues.apache.org/jira/browse/STANBOL-201 ● Help integrate into Stanbol EntityHub / Enhancer during the Hackathon ● IKS User Story S10: Automated document categorization ● I create new document in my CMS by typing in a HTML edit form or by uploading a document with textual content (PDF, office file, XML file, ...). I want the CMS to suggest me a list of maximum 3 controlled properties such as subjects/topics or geographical coverage out of list of standardised options (IPTC subjects or world countries), based on the text content I gave. www.iks-project.eu
  • 18. Page: 2 – Named Entity Disambiguation www.iks-project.eu
  • 19. Page: An example ● Query for person with name = “George Bush” ● Results: 2 ambigous possibilities ● Perform additional MoreLikeThis with surrounding paragraph as context: ● If more like “41st”, “1988”, “Reagan”, “Panama”... ● then: dbpedia:George_H._W._Bush ● If more like “43rd”, “911”, “War on Terror”, “bretzel”... ● then: dbpedia:George_W._Bush www.iks-project.eu
  • 20. Page: Work in Progress ● EntityHub's SolrYard now has a SimilarityConstraint ● OpenNLP NamedEntiy Engine already extracts context ● pignlproc is able to extract occurrence corpus from Wikipedia dumps ● Early prototype during Berlin Buzzwords Hackathon TODO: build a prepackaged Enhancer Engine & EntityHub index www.iks-project.eu