1. TRank: Ranking
Entity Types Using
the Web of Data
Alberto Tonon1, Michele Catasta2, Gianluca Demartini1,
Philippe Cudré-Mauroux1, and Karl Aberer2
1eXascale Infolab,
University of Fribourg, Switzerland
{alberto, demartini, phil}@exascale.info
ISWC– 25 October 2013
2Distributed Information Systems Laboratory
EPFL, Switzerland
{firstname.lastname}@epfl.ch
2. Why Entities?
• The Web is getting entity-centric!
• Entity-centric services
2
Google
3. …and Why Types?
• “Summarization” of texts
• Contextual entities summaries in Web-pages
• Disambiguation of other entities
• Diversification of search results
3
Article Title Entities Types
Bin Laden Relative Pleads Not
Guilty in Terrorism Case
Osama Bin Laden
Abu Ghaith
Lewis Kaplan
Manhattan
Al-QaedaPropagandists
Kuwaiti Al-Qaeda members
Judge
Borough (New York City)
Sulaiman Abu Ghaith, a son-in-law of Osama bin Laden
who once served as a spokesman for Al Qaeda
Al-Quaeda
Propagandist
Kuwaiti Al-Qaeda
members
Jihadist
Organizations
4. Entities May Have Many Types
4
Thing
American
Billionaires
People from
King County People
from
Seattle
Windows
People
Agent
Person
Living
People
American
People of
Scottish Descent
Harvard
University
People
American
Computer
Programmers
American
Philanthropists
People
from
Seattle
5. G: DBPedia 3.8
e: Bill Gates
c: «Microsoft was founded by Bill Gates
and Paul Allen on April 4, 1975.»
Our Task: Ranking Types Given a
Context
• Input: a knowledge base
G, an Entity e, a context c
in which e appears.
• Output: e’s types ranked
by relevance wrt the
context c.
• Evaluation:
crowdsourcing + MAP,
NDCG
5
Bill Gates
1. American Chief executive
2. American Computer Programmer
3. American Billionaires
4. …
6. TRank Pipeline
6
Type ranking
Type ranking
Type ranking
Text
extraction
(BoilerPipe)
Named Entity
Recognition
(Stanford NER)
List of
entity
labels
Entity linking
(inverted index:
DBpedia labels ⟹
resource URIs)
foreach
List of
entity
URIs
Type retrieval
(inverted index:
resource URIs ⟹
type URIs)
List of
type
URIs
Type ranking
Ranked
list of
types
10. Hierarchy-Based Approaches
(An Example)
• ANCESTORS
Score(e, t) = number of t’s
ancestors in the type
hierarchy contained in Te.
10
Te often doesn’t
contain all super
types of a
specific type
11. Context-Aware Ranking Approaches
(An Example)
• SAMETYPE
Score(e, t, cT) = number of
times t appears among
the types of every other
entity in cT.
11
e'
Person
Actor
Actor
AmericanActor
Context
e''
Organization
Thing
e
12. Learning to Rank Entity Types
Determine an optimal combination of all our
approaches:
• Decision trees
• Linear regression models
• 10-fold cross validation
12
13. Avoiding SPARQL Queries with
Inverted Indices and Map/Reduce
• TRank is implemented with Hadoop and
Map/Reduce.
• All computations are done by using inverted
indices:
– Entity linking
– Path index
– Depth index
• The inverted indices are publicly available at
exascale.info/TRank
13
17. Efficiency Evaluation
• Tested efficiency on a CommonCrawl sample
of 1TB
– 1,310,459 HTML pages
– 23GB compressed
• Map/Reduce on a cluster of 8 machines with
12 cores, 32GB of RAM and 3 SATA disks
• On average, 25 min. processing time (> 100
docs/node x sec)
17
Text Extraction NER Entity Linking Type Retrieval Type Ranking
18.9% 35.6% 29.5% 9.8% 6.2%
18. Conclusions
• New task: ranking entity types.
– Useful for: “summarization” of Web-documents,
entity summaries, disambiguation.
• Various approaches: entity-centric, context-
aware, hierarchy-based, learning to rank.
– Hierarchy-based and learning to rank are the most
effective.
• Hadoop, Map/Reduce, and inverted indices to
achieve scalability.
18
19. Grazie!
• Datasets (with relevance judgments!),
inverted indices, evaluation tools and more
material are available at exascale.info/Trank.
19
Thank you for
your attention!
Check out B-hist at
the SW Challenge!
Thanks to
for the Travel
Award!
TRank is open-
source!https://github.c
om/MEM0R1ES/TRank
21. Entity-Centric Ranking Approaches
• FREQ
Rank(e, t, ck) = number of triples <e> <rdfs:type> <t> in the
knowledge base.
• WIKILINK
Rank(e, t, ck) = number of e’s “neighbor entities” with type t.
• SAMEAS
Rank(e, t, ck) = number of URIs representing e with type t.
• LABEL
Rank(e, t, ck) = frequency of t among the top-10 most similar entities in
terms of label (thank you, Lucene )
21
22. Create
Inverted
Index
"Tom Cruise"
label
...
"Tom Hanks"
label
...
"Bill Gates"
label
...
"Osama Bin Laden"
label
...
Knowledge Base
e1
e2
e3
eN
...
"Tom" e1 e3 . . .
"Cruise" e1 . . .
"Hanks" . . .
e3
"Bill" . . .
e2
Inverted Index
Entity-Centric Ranking Approaches
• LABEL
Rank(e, t, ck) =
frequency of t among
the top-10 most
similar entities in
terms of label.
Exploits an inverted
index.
22
...
"Tom" e1 e3 . . .
"Cruise" e1 . . .
"Hanks" . . .
e3
"Bill" . . .
e2
Inverted Index
Label(e) Query
TF-IDF
Ranking
e2
e3
.
.
.
TOP-10
23. Hierarchy-Based Ranking Approaches
• DEPTH
Rank(e, t, cH) = depth of t in
the type hierarchy.
• ANCESTORS
Rank(e, t, cH) = number of t’s
ancestors in the type
hierarchy contained in Te.
• ANC_DEPTH
Rank(e, t, cH) =
23
Te often doesn’t
contain all super
types of a
specific type
24. Context-Aware Ranking Approaches
• The context can help getting a better ranking
of types.
24
Italy’s rebellious voters, who opted for a flamboyant billionaire and a
clown, reminded us last week how deeply in crisis the Continent is.
Meanwhile, France is going it virtually alone in Mali, and Britain talks
openly of jumping the European ship altogether.
Landlocked Countries
Least Developed Countries
States And Territories Established In 1960
French-speaking Countries
World Trade Organization Member Economies
Country
African Union Member States
African Countries
Member States Of La Francophonie
African Union Member Economies
Populated Place
Place
• Which is the right type for Mali?
25. Context-Aware Ranking Approaches
PATH
• Suppose we have to compute Rank(t, e, cT).
• Consider each type t’ of each other entity e’ in c.
• P(t) = path from the root of the type hierarchy to
t.
25
???
27. Relevance Judgments
• Crowdsourced relevance
judgments.
• Anonymous Web-users
are TRank users.
• 3 workers per
entity/context.
• Overall cost: 190$
• Pilot study on task
design… mega-bubbles!
• Numbers of votes as
relevance score for a
type.
27
Hinweis der Redaktion
An entity is something that exists by itself, although it need not be of material existance
LEGGI TIPI
STATE OF THE ART NER AND LINKING FOCUS IS RANKING TYPES
PARIS: VLDB2012 ontology alignment
Yago super specific types
Entity centric
Use only the information connected to the entity
Context-aware
Exploit the types of entities that co-occur in the context (e.g. Bill Gates + Micr soft vs Bill Gates + Scotland)
Hierarchy-based
Exploit the type hierarchy
Learning to Rank
Combine evidences coming from all previous approaches in an optimal way
we start from the node representing an entity, follow same-as links (we get other nodes representing the same entity) and we count how many “new” nodes feature the type we’re giving a score to
The set of types associated to an entity in a knowledge base often doesn’t contain all super types
C_T is the context given by the text
10 FOLD CROSS VALID
DECISION TREE
REGRESSION …
Preliminary experiments showed that is the best performing model bla la
Use Inverted indices to AVOID SPARQL QUERIES!!
- Increasing granularities of context: from no-context (here is the entity, here are its types, rank them), one sentence/paragraph (rank the types of all entities in this sentence/paragraph)
- 3 workers were asked to select the best type of each entity appearing in a given context
ANCESTORS Is the real winner since it uses inverted indices which are faster, no machine learning yadda yadda