This document discusses two natural language processing techniques: universal topic classification and named entity disambiguation.
For universal topic classification, it proposes using Apache Lucene/Solr's MoreLikeThis query to find related Wikipedia articles based on document terms, and then categorizing the document using the topics of related articles. It also discusses using Wikipedia categories to provide a hierarchical structure.
For named entity disambiguation, it suggests using MoreLikeThis with surrounding context to disambiguate entities mentioned in a document (e.g. determining if "George Bush" refers to George H. W. Bush or George W. Bush). The document outlines work in progress to integrate these techniques into the Stanbol semantic framework.
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Universal Topic Classification - Named Entity Disambiguation (IKS Workshop Paris 2011)
1. Page:
Stanbol
Semantic CMS Community in the Labs
Universal Topic
Olivier Grisel Classification
Nuxeo
June 17, 2011 Named Entity
Disambiguation
Co-funded by the
1 Copyright IKS Consortium
European Union
www.iks-project.eu
3. Page: 3 June 17, 2011
Wikipedia is a Web-Scale Controlled
Vocabulary
– Chris Sizemore, BBC
www.iks-project.eu Copyright IKS Consortium
4. Page:
A Rather “Simple” Idea
Use
Apache Lucene / Solr MoreLikeThis
to perform an
approximate k-Nearest Neighbors
query
in the
TF-IDF vector space of Wikipedia
www.iks-project.eu
5. Page:
Which means:
● Picks the top 30 terms of the document to categorize
● Build a fuzzy full-text query
● Search for indexed articles that share most terms
● Rank results according to similarity score
● Use the top-related Wikipedia articles as “Topics”
www.iks-project.eu
6. Page:
However Wikipedia has millions of
articles:
Navigation Hell
Need hierarchical structure:
from generic to specific
Faceted Browsing!
www.iks-project.eu
7. Page:
Hierarchical Wikipedia Categorization
● Group text of all articles categorized for a given Topic
● Use Wikipedia Categories as Hierarchical Taxonomy
● Categorize new document with MoreLikeThis on the
aggregate text of articles
● Available DBpedia dumps provides:
● Text summaries for each article
● “subject” relationships between articles and topics
● “broader” / “narrower” SKOS hieararchy between topics
www.iks-project.eu
8. Page:
Challenges encountered
● 500k “technical” categories
“People_with_missing_birth_place”, “Rivers_in_Romania”
● 70k “grounded” categories
● Paths to roots need both “technical” and “grounded”
● Loops everywhere!
● Death is a subcategory of Life
– Life is a subcategory of Death
● …
● Scale
● 1.2M topic / topic links
● 30M topic / article links
www.iks-project.eu
12. Page:
Yesterday Wikinews Articles (1/3)
Hundreds of thousands of British public sector workers
strike over planned pension changes
● Category:Retirement_in_the_United_Kingdom
● Category:United_Kingdom_pensions_and_benefits
● Category:Pensions_in_the_United_Kingdom
● Category:Labor_disputes_by_country
● Category:Labor_disputes
www.iks-project.eu
13. Page:
Yesterday Wikinews Articles (2/3)
US children who celebrate Independence Day more
likely to become Republicans, says Harvard study
● Category:Fireworks
● Category:Voting_theory
● Category:Republican_Party_%28United_States%29
● Category:Statistics
● Category:Electoral_systems
www.iks-project.eu
14. Page:
Yesterday Wikinews Articles (3/3)
U.S. space agency NASA sues ex-astronaut
● Category:American_astronauts
● Category:Aviation_halls_of_fame
● Category:Edwards_Air_Force_Base
● Category:Apollo_program
● Category:Exploration_of_the_Moon
www.iks-project.eu
15. Page:
Scientific publication (1/2) (PLOS One)
Metabolic Programming during Lactation Stimulates
Renal Na+ Transport in the Adult Offspring Due to an
Early Impact on Local Angiotensin II Pathways
● Category:Renal_physiology
● Category:Kidney
● Category:Nephrology
● Category:Hypertension
● Category:Membrane_biology
www.iks-project.eu
17. Page:
Track & Hack
● https://github.com/ogrisel/pignlproc
● https://issues.apache.org/jira/browse/STANBOL-201
● Help integrate into Stanbol EntityHub / Enhancer during the
Hackathon
● IKS User Story S10: Automated document categorization
● I create new document in my CMS by typing in a HTML edit form or
by uploading a document with textual content (PDF, office file, XML
file, ...). I want the CMS to suggest me a list of maximum 3
controlled properties such as subjects/topics or geographical
coverage out of list of standardised options (IPTC subjects or world
countries), based on the text content I gave.
www.iks-project.eu
18. Page:
2 – Named Entity Disambiguation
www.iks-project.eu
19. Page:
An example
● Query for person with name = “George Bush”
● Results: 2 ambigous possibilities
● Perform additional MoreLikeThis with surrounding
paragraph as context:
● If more like “41st”, “1988”, “Reagan”, “Panama”...
● then: dbpedia:George_H._W._Bush
● If more like “43rd”, “911”, “War on Terror”, “bretzel”...
● then: dbpedia:George_W._Bush
www.iks-project.eu
20. Page:
Work in Progress
● EntityHub's SolrYard now has a SimilarityConstraint
● OpenNLP NamedEntiy Engine already extracts context
● pignlproc is able to extract occurrence corpus from
Wikipedia dumps
● Early prototype during Berlin Buzzwords Hackathon
TODO:
build a prepackaged Enhancer Engine
& EntityHub index
www.iks-project.eu