TRank ISWC2013

TRank: Ranking
Entity Types Using
the Web of Data
Alberto Tonon1, Michele Catasta2, Gianluca Demartini1,
Philippe Cudré-Mauroux1, and Karl Aberer2
1eXascale Infolab,
University of Fribourg, Switzerland
{alberto, demartini, phil}@exascale.info
ISWC– 25 October 2013
2Distributed Information Systems Laboratory
EPFL, Switzerland
{firstname.lastname}@epfl.ch

Why Entities?
• The Web is getting entity-centric!
• Entity-centric services
2
Google

…and Why Types?
• “Summarization” of texts
• Contextual entities summaries in Web-pages
• Disambiguation of other entities
• Diversification of search results
3
Article Title Entities Types
Bin Laden Relative Pleads Not
Guilty in Terrorism Case
Osama Bin Laden
Abu Ghaith
Lewis Kaplan
Manhattan
Al-QaedaPropagandists
Kuwaiti Al-Qaeda members
Judge
Borough (New York City)
Sulaiman Abu Ghaith, a son-in-law of Osama bin Laden
who once served as a spokesman for Al Qaeda
Al-Quaeda
Propagandist
Kuwaiti Al-Qaeda
members
Jihadist
Organizations

Entities May Have Many Types
4
Thing
American
Billionaires
People from
King County People
from
Seattle
Windows
People
Agent
Person
Living
People
American
People of
Scottish Descent
Harvard
University
People
American
Computer
Programmers
American
Philanthropists
People
from
Seattle

G: DBPedia 3.8
e: Bill Gates
c: «Microsoft was founded by Bill Gates
and Paul Allen on April 4, 1975.»
Our Task: Ranking Types Given a
Context
• Input: a knowledge base
G, an Entity e, a context c
in which e appears.
• Output: e’s types ranked
by relevance wrt the
context c.
• Evaluation:
crowdsourcing + MAP,
NDCG
5
Bill Gates
1. American Chief executive
2. American Computer Programmer
3. American Billionaires
4. …

TRank Pipeline
6
Type ranking
Type ranking
Type ranking
Text
extraction
(BoilerPipe)
Named Entity
Recognition
(Stanford NER)
List of
entity
labels
Entity linking
(inverted index:
DBpedia labels ⟹
resource URIs)
foreach
List of
entity
URIs
Type retrieval
(inverted index:
resource URIs ⟹
type URIs)
List of
type
URIs
Type ranking
Ranked
list of
types

Type Hierarchy
7
<owl:equivalentClass>
<owl:Thing>
MappingsYAGO/DBpedia (PARIS)
type: DBpedia schema.org Yago
subClassOf relationship:
explicit inferred from
<owl:equivalentClass>
manually
added
PARISontology
mapping

Ranking Algorithms
• Entity centric
• Hierarchy-based
• Context-aware (featuring type-hierarchy)
• Learning to Rank
8

Entity-Centric Ranking Approaches
(An Example)
9
• SAMEAS
Score(e, t) = number of
URIs representing e with
type t.

Hierarchy-Based Approaches
(An Example)
• ANCESTORS
Score(e, t) = number of t’s
ancestors in the type
hierarchy contained in Te.
10
Te often doesn’t
contain all super
types of a
specific type

Context-Aware Ranking Approaches
(An Example)
• SAMETYPE
Score(e, t, cT) = number of
times t appears among
the types of every other
entity in cT.
11
e'
Person
Actor
Actor
AmericanActor
Context
e''
Organization
Thing
e

Learning to Rank Entity Types
Determine an optimal combination of all our
approaches:
• Decision trees
• Linear regression models
• 10-fold cross validation
12

Avoiding SPARQL Queries with
Inverted Indices and Map/Reduce
• TRank is implemented with Hadoop and
Map/Reduce.
• All computations are done by using inverted
indices:
– Entity linking
– Path index
– Depth index
• The inverted indices are publicly available at
exascale.info/TRank
13

Datasets
• 128 recent NYTimes articles split to create:
– Entity Collection
– Sentence Collection
– Paragraph Collection
– 3-Paragraphs Collection
• Ground-truth obtained by using crowdsourcing
– 3 workers per entity/context
– 4 levels of relevance for each type
– Overall cost: 190$
15

Effectiveness Evaluation
16
Check our paper or contact
us for a complete
description of all the
approaches we evaluated

Efficiency Evaluation
• Tested efficiency on a CommonCrawl sample
of 1TB
– 1,310,459 HTML pages
– 23GB compressed
• Map/Reduce on a cluster of 8 machines with
12 cores, 32GB of RAM and 3 SATA disks
• On average, 25 min. processing time (> 100
docs/node x sec)
17
Text Extraction NER Entity Linking Type Retrieval Type Ranking
18.9% 35.6% 29.5% 9.8% 6.2%

Conclusions
• New task: ranking entity types.
– Useful for: “summarization” of Web-documents,
entity summaries, disambiguation.
• Various approaches: entity-centric, context-
aware, hierarchy-based, learning to rank.
– Hierarchy-based and learning to rank are the most
effective.
• Hadoop, Map/Reduce, and inverted indices to
achieve scalability.
18

Grazie!
• Datasets (with relevance judgments!),
inverted indices, evaluation tools and more
material are available at exascale.info/Trank.
19
Thank you for
your attention!
Check out B-hist at
the SW Challenge!
Thanks to
for the Travel
Award!
TRank is open-
source!https://github.c
om/MEM0R1ES/TRank

• FREQ
Rank(e, t, ck) = number of triples <e> <rdfs:type> <t> in the
knowledge base.
• WIKILINK
Rank(e, t, ck) = number of e’s “neighbor entities” with type t.
• SAMEAS
Rank(e, t, ck) = number of URIs representing e with type t.
• LABEL
Rank(e, t, ck) = frequency of t among the top-10 most similar entities in
terms of label (thank you, Lucene  )
21

Create
Inverted
Index
"Tom Cruise"
label
...
"Tom Hanks"
label
...
"Bill Gates"
label
...
"Osama Bin Laden"
label
...
Knowledge Base
e1
e2
e3
eN
...
"Tom" e1 e3 . . .
"Cruise" e1 . . .
"Hanks" . . .
e3
"Bill" . . .
e2
Inverted Index
• LABEL
Rank(e, t, ck) =
frequency of t among
the top-10 most
similar entities in
terms of label.
Exploits an inverted
index.
22
...
"Tom" e1 e3 . . .
"Cruise" e1 . . .
"Hanks" . . .
e3
"Bill" . . .
e2
Inverted Index
Label(e) Query
TF-IDF
Ranking
e2
e3
.
.
.
TOP-10

Hierarchy-Based Ranking Approaches
• DEPTH
Rank(e, t, cH) = depth of t in
the type hierarchy.
• ANCESTORS
Rank(e, t, cH) = number of t’s
ancestors in the type
hierarchy contained in Te.
• ANC_DEPTH
Rank(e, t, cH) =
23
Te often doesn’t
contain all super
types of a
specific type

• The context can help getting a better ranking
of types.
24
Italy’s rebellious voters, who opted for a flamboyant billionaire and a
clown, reminded us last week how deeply in crisis the Continent is.
Meanwhile, France is going it virtually alone in Mali, and Britain talks
openly of jumping the European ship altogether.
Landlocked Countries
Least Developed Countries
States And Territories Established In 1960
French-speaking Countries
World Trade Organization Member Economies
Country
African Union Member States
African Countries
Member States Of La Francophonie
African Union Member Economies
Populated Place
Place
• Which is the right type for Mali?

PATH
• Suppose we have to compute Rank(t, e, cT).
• Consider each type t’ of each other entity e’ in c.
• P(t) = path from the root of the type hierarchy to
t.
25
???

Ranking Tom Hank’s types when co-occurring with Tom
Cruise in some text.
26
1
2
3
4
4
1
1
1

Relevance Judgments
• Crowdsourced relevance
judgments.
• Anonymous Web-users
are TRank users.
• 3 workers per
entity/context.
• Overall cost: 190$
• Pilot study on task
design… mega-bubbles!
• Numbers of votes as
relevance score for a
type.
27

TRank ISWC2013

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (11)

Ähnlich wie TRank ISWC2013

Ähnlich wie TRank ISWC2013 (20)

Mehr von eXascale Infolab

Mehr von eXascale Infolab (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

TRank ISWC2013

Hinweis der Redaktion