SlideShare ist ein Scribd-Unternehmen logo
1 von 112
Downloaden Sie, um offline zu lesen
Approaches for Improving and
Enriching Textual Knowledge Bases
Besnik Fetahu
PhD Defense
8th of November 2017
Hannover, Germany
What is a textual
knowledge base?
Wikipedia as a textual knowledge base
3
Wikipedia
Articles
University of
Hannover
Infobox Section Template Section Text
Wikipedia is a free online encyclopedia with the aim to allow anyone to edit articles.
Wikipedia is the largest and most popular general reference work on the Internet, and is
ranked the 5th popular website. Wikipedia is owned by the nonprofit Wikimedia Foundation.
Wikipedia Editor Collaboration Dynamics
4
Wikipedia
Wikipedia Editors
Localized Wikipedias
Editor Profiles
lang: {English}
topic: {Education, Politics}
~40 mill. articles
293
lang.
32 mill. editors (only
in english Wikipedia)
Wikipedia Revisions
(cur | prev) 02:51, 5 October 2017
Brilliantwiki2 (talk | contribs) . . (14,516 bytes)
(+69) . . (→Rankings) (undo | thank)
(cur | prev) 00:30, 21 August 2017 Blueclaw
(talk | contribs) . . (14,447 bytes) (+83) . .
(→Alumni: added Flügge-Lotz) (undo | thank)
(cur | prev) 09:03, 18 June 2017
77.23.196.148 (talk) . . (14,364 bytes) (-2) . .
(→History: pupils -> students, today and now
in same sentence corrected) (undo)
(cur | prev) 05:35, 10 June 2017 AnomieBOT
(talk | contribs) . . (14,366 bytes) (+319) . .
(Rescuing orphaned refs ("Mitarbeiter und
Etat" from rev 782668206; "Studierende" from
rev 782668206)) (undo)
(cur | prev) 01:24, 10 June 2017 Mephistolus
(talk | contribs) m . . (14,047 bytes) (+9) . .
(undo | thank) (Tag: Visual edit)
(cur | prev) 01:21, 10 June 2017 Mephistolus
(talk | contribs) . . (14,038 bytes) (-89) . .
(Update infobox) (undo | thank) (Tag: Visual
edit)
• Wikipedia and its sister projects develop at a rate of
over 10 edits per second, performed by editors from all
over the world.
• English Wikipedia has an average growth rate of 600
new articles per day.
Wikipedia Dynamics and Growth
5
Wikipedia’s Daily Growth Rate
Editorial Policies in Wikipedia
6
Wikipedia is written from a neutral point of view.
Content in Wikipedia must be verifiable. The
burden of evidence lies with the editor who adds
content into a page.
No original research. Content — such as facts,
allegations, and ideas — for which no reliable,
published sources exist.
Why Wikipedia?
Quality in Wikipedia
8
[1]	Giles,	Jim.	"Internet	encyclopaedias	go	head	to	
head."	(2005):	900-901.	Nature.	
[2]	Keegan,	Brian,	Darren	Gergle,	and	Noshir	
Contractor.	"Hot	off	the	wiki:	Structures	and	dynamics	
of	Wikipedia’s	coverage	of	breaking	news	events."	
American	Behavioral	Scientist	57,	no.	5	(2013)
~
• Comparable quality to Britannica[1]
• Verifiable content through third-
party external sources
• Up-to-date information on emerging
entities and events[2]
Importance and Use of Wikipedia
9
Importance and Use of Wikipedia
9
Rank Site
Daily Time on
Site
Daily Pageviews per
Visitor
% of traffic from
search
Total Sites
Lining in
1 google.com 8:02 8.93 4.30% 3.56 mill.
2 youtube.com 8:27 4.98 15.40% 2.69 mill.
3 facebook.com 9:48 4.01 8.30% 7.6 mill.
4 baidu.com 7:56 6.36 8.50% 1.3 mill.
5 wikipedia.org 4:11 3.28 68.40% 1.7 mill.
From Wikipedia to Structured Data and Search
10
++
From Wikipedia to Structured Data and Search
11
Web Search (Google Knowledge Cards)
VoiceSearch
Use Cases: Question Answering
12
Danqi	Chen,	Adam	Fisch,	Jason	Weston,	Antoine	Bordes:	
Reading	Wikipedia	to	Answer	Open-Domain	Questions.	ACL	2017.
Q: When did the 1973 oil crisis begin?
A: October
1973
What are the issues in
Wikipedia?
Issues with Verifiability in Wikipedia
14
https://en.wikipedia.org/wiki/Wikipedia:List_of_hoaxes_on_Wikipedia
Issues with Verifiability in Wikipedia
15
https://www.theguardian.com/technology/2017/feb/08/wikipedia-bans-
daily-mail-as-unreliable-source-for-website
Issues with Verifiability in Wikipedia
15
Unreliable news
source
https://www.theguardian.com/technology/2017/feb/08/wikipedia-bans-
daily-mail-as-unreliable-source-for-website
Issues with Verifiability in Wikipedia
16
Where was Obama born?
In 2012, Breitbart.com published a copy of a promotional
booklet that Obama's literary agency, Acton & Dystel,
printed in 1991 (and later posted to their website, in a
biography in place until April 2007) which misidentified
Obama's birthplace and states that Obama was "born in
Kenya and raised in Indonesia and Hawaii."
Obama was born on August 4, 1961, at Kapiʻolani Maternity &
Gynecological Hospital in Honolulu, Hawaii. He is the first President to
have been born in Hawaii, making him the first President born outside of
the contiguous 48 states. He was born to a white mother and a black
father. His mother, Ann Dunham (1942–1995), was born in Wichita,
Kansas, of mostly English descent, with some German, Irish, Scottish,
Swiss, and Welsh ancestry.
Kenya. Kapiʻolani Maternity & Gynecological Hospital in Honolulu, Hawaii
Min	Joon	Seo,	Aniruddha	Kembhavi,	Ali	Farhadi,	Hannaneh	Hajishirzi:	Bidirectional	
Attention	Flow	for	Machine	Comprehension.	CoRR	abs/1611.01603	(2016)
Issues with Verifiability in Wikipedia
16
Where was Obama born?
In 2012, Breitbart.com published a copy of a promotional
booklet that Obama's literary agency, Acton & Dystel,
printed in 1991 (and later posted to their website, in a
biography in place until April 2007) which misidentified
Obama's birthplace and states that Obama was "born in
Kenya and raised in Indonesia and Hawaii."
Obama was born on August 4, 1961, at Kapiʻolani Maternity &
Gynecological Hospital in Honolulu, Hawaii. He is the first President to
have been born in Hawaii, making him the first President born outside of
the contiguous 48 states. He was born to a white mother and a black
father. His mother, Ann Dunham (1942–1995), was born in Wichita,
Kansas, of mostly English descent, with some German, Irish, Scottish,
Swiss, and Welsh ancestry.
Kenya. Kapiʻolani Maternity & Gynecological Hospital in Honolulu, Hawaii
Min	Joon	Seo,	Aniruddha	Kembhavi,	Ali	Farhadi,	Hannaneh	Hajishirzi:	Bidirectional	
Attention	Flow	for	Machine	Comprehension.	CoRR	abs/1611.01603	(2016)
Misalignment of Editor Efforts in Wikipedia
17
• Human Fatalities: 10k vs 1.8k
losses
• Estimated Damages: $4.5 vs.
$108 billions
• “Odisha Cyclone” without
coverage and mention in
Wikipedia article “Odisha”
• “Hurricane Katrina” finds broad
coverage in Wikipedia article
“New Orleans”
Challenges addressed
in this thesis
Challenges and Contributions
19
For an arbitrary statement in Wikipedia how can we find
citations which provide evidence for it?
For a paragraph in Wikipedia and an existing citation how can
we determine the exact span of the citation?
For a Wikipedia page and a given news corpus how can we
find and suggest important and novel information for a page?
Part (I): Finding news citations
for Wikipedia entity pages?
Besnik	Fetahu,	Katja	Markert,	Wolfgang	Nejdl,	Avishek	Anand:	
“Finding	News	Citations	for	Wikipedia”.	CIKM	2016:	337-346
News Collection
t1 t2 tn
Textual Knowledge Base
t1 t2 tn
Citation Recommendation Citation Span
News Suggestion
Entity Placement Section Placement
e:“Barack Obama”
Obama was born on
August 4, 1961,[4] …..
The couple married in
Wailuku on Maui on …
After graduating ... a JD …
magna cum laude[49]…
Obama was elected to the
Illinois Senate in …
news? query for s1
c4
Obama was born on August 4,
1961, at Kapiʻolani Maternity &
Gynecological Hospital in
Honolulu, Hawaii.
c4s1
Obama was born on August 4, 1961, at
Kapiʻolani Maternity & Gynecological Hospital
in Honolulu, Hawaii.
citation c4 span
e:“Barack Obama” AND t2
news: nk
time: > t2
The choice of Barack Obama on Friday
as the recipient of the 2009 Nobel Peace
Prize, [...] around the globe. [...] The
Nobel committee’s embrace of Mr. Obama
was viewed [as a rejection of the
unpopular tenure, in] Europe especially, of
his predecessor, George W. Bush. [...] “To
be honest,” the president said in the Rose
Garden, [...] Last year’s laureate, former
President Martti Ahtisaari of Finland, saw
the award as an endorsement of Mr.
Obama’s goal of achieving Middle East
peace.
1.Early life and career
2.Political career
3.2008 presidential
campaign
4.Presidency
5.Political positions
6.Family and Personal
life
7.Cultural and political
image
1.Early life and career
2.Political career
3.2008 presidential
campaign
4.Presidency
5.Political positions
6.Nobel Peace Prize
7.Family and Personal life
8.Cultural and political
image
t2 t3
publish date t
headline
body
entity mentions (e.g. “Barack
Obama, Nobel Prize”…)
revision date t
entity title
sections
section text
categories
citations
Finding news citations for Wikipedia entity pages
22
Wikipedia
Articles
1. Early life and career
2. Presidential campaigns
3. Presidency (2009-2017)
4. Post-presidency (2017-present)
5. Legacy
6. Books written
Section Chunking Statements Extraction
Obama was born on August 4, 1961,[c] at Kapiʻolani
Maternity & Gynecological Hospital in Honolulu, Hawaii.[c]
He is the first President to have been born in Hawaii,[c]
making him the first President born outside of the
contiguous 48 states.[c] […] His father, Barack Obama Sr.
(1936–1982), was a married Luo Kenyan man from
Nyang'oma Kogelo. Obama's parents met in 1960 in a
Russian language class at the University of Hawaii at
Manoa, where his father was a foreign student on
scholarship.[c] […]
Obama's parents met in 1960 in
a Russian language class at the
University of Hawaii at Manoa,
where his father was a foreign
student on scholarship.[c]
Find a citation for the
statement!
Does it require a
news citation?
Yes
Finding news citations for Wikipedia entity pages
22
Wikipedia
Articles
1. Early life and career
2. Presidential campaigns
3. Presidency (2009-2017)
4. Post-presidency (2017-present)
5. Legacy
6. Books written
Section Chunking Statements Extraction
Obama was born on August 4, 1961,[c] at Kapiʻolani
Maternity & Gynecological Hospital in Honolulu, Hawaii.[c]
He is the first President to have been born in Hawaii,[c]
making him the first President born outside of the
contiguous 48 states.[c] […] His father, Barack Obama Sr.
(1936–1982), was a married Luo Kenyan man from
Nyang'oma Kogelo. Obama's parents met in 1960 in a
Russian language class at the University of Hawaii at
Manoa, where his father was a foreign student on
scholarship.[c] […]
Obama's parents met in 1960 in
a Russian language class at the
University of Hawaii at Manoa,
where his father was a foreign
student on scholarship.[c]
Find a citation for the
statement!
Does it require a
news citation?
Yes
Task—1: Statement Categorization
Finding news citations for Wikipedia entity pages
22
Wikipedia
Articles
1. Early life and career
2. Presidential campaigns
3. Presidency (2009-2017)
4. Post-presidency (2017-present)
5. Legacy
6. Books written
Section Chunking Statements Extraction
Obama was born on August 4, 1961,[c] at Kapiʻolani
Maternity & Gynecological Hospital in Honolulu, Hawaii.[c]
He is the first President to have been born in Hawaii,[c]
making him the first President born outside of the
contiguous 48 states.[c] […] His father, Barack Obama Sr.
(1936–1982), was a married Luo Kenyan man from
Nyang'oma Kogelo. Obama's parents met in 1960 in a
Russian language class at the University of Hawaii at
Manoa, where his father was a foreign student on
scholarship.[c] […]
Obama's parents met in 1960 in
a Russian language class at the
University of Hawaii at Manoa,
where his father was a foreign
student on scholarship.[c]
Find a citation for the
statement!
Does it require a
news citation?
Yes
Task—1: Statement Categorization
Task—2: Citation Discovery
Task#1: Statement
Categorization
Why Statement Categorization?
24
type description
arXiv arXiv preprints
AV media audio and visual
AV media notes audio and visual liner
notesbioRxiv bioRxiv preprints
book books
conference conference papers
encylopedia edited collections
episode radio or television
collectionsinterview interviews
journal academic journals
and papersmagazine magazines,
periodicalsmailing list public mailing lists
map maps
news news articles
newsgroup online newsgroups
podcast audio or video
podcastpress release press releases
report reports
serial audio or video serials
sign sign, plaques
speech speeches
techreport technical report
thesis theses
web any resource
accessible through a
Citation Types in Wikipedia
0
0.2
0.4
0.6
0.8
1
ComicsCreator
Artwork
NaturalPlace
Airline
Film
SoccerManager
LegalCase
Album
Band
SportsTeam
TelevisionShow
AnatomicalStructure
Athlete
Weapon
Criminal
MusicalArtist
Politician
Plant
Song
Non-ProfitOrganisation
Book
Actor
FictionalCharacter
RecordLabel
Broadcaster
PoliticalParty
Automobile
TradeUnion
Scientist
MilitaryPerson
Philosopher
TelevisionSeason
Election
OfficeHolder
SportsLeague
GovernmentAgency
Single
Animal
Award
SportsEvent
Airport
MilitaryConflict
TelevisionEpisode
Aircraft
Magazine
Writer
Location
news book court journal web thesis
Besnik	Fetahu,	Abhijit	Anand,	Avishek	Anand:	“How	much	is	
Wikipedia	Lagging	Behind	News?”	WebSci	2015:	28:1-28:9
Why Statement Categorization?
24
type description
arXiv arXiv preprints
AV media audio and visual
AV media notes audio and visual liner
notesbioRxiv bioRxiv preprints
book books
conference conference papers
encylopedia edited collections
episode radio or television
collectionsinterview interviews
journal academic journals
and papersmagazine magazines,
periodicalsmailing list public mailing lists
map maps
news news articles
newsgroup online newsgroups
podcast audio or video
podcastpress release press releases
report reports
serial audio or video serials
sign sign, plaques
speech speeches
techreport technical report
thesis theses
web any resource
accessible through a
Citation Types in Wikipedia
0
0.2
0.4
0.6
0.8
1
ComicsCreator
Artwork
NaturalPlace
Airline
Film
SoccerManager
LegalCase
Album
Band
SportsTeam
TelevisionShow
AnatomicalStructure
Athlete
Weapon
Criminal
MusicalArtist
Politician
Plant
Song
Non-ProfitOrganisation
Book
Actor
FictionalCharacter
RecordLabel
Broadcaster
PoliticalParty
Automobile
TradeUnion
Scientist
MilitaryPerson
Philosopher
TelevisionSeason
Election
OfficeHolder
SportsLeague
GovernmentAgency
Single
Animal
Award
SportsEvent
Airport
MilitaryConflict
TelevisionEpisode
Aircraft
Magazine
Writer
Location
news book court journal web thesis
Besnik	Fetahu,	Abhijit	Anand,	Avishek	Anand:	“How	much	is	
Wikipedia	Lagging	Behind	News?”	WebSci	2015:	28:1-28:9
• Citations of type web and news account for the absolute
majority of citations in Wikipedia
• Citations of type news are considered as “reliable, published
source”
• Depending on the context and added information in Wikipedia,
different citation types are preferred
Why Statement Categorization?
25
Obama emphasized issues of rapidly ending the Iraq War, increasing
energy independence, and reforming the health care system,[1] in a
campaign that projected themes of hope and change.[2]
On June 3, 2008, Senator Obama—along with Senators Tom Carper, Tom
Coburn, and John McCain—introduced follow-up legislation: Strengthening
Transparency and Accountability in Federal Spending Act of 2008.[1]
In mid—1988, he traveled for the first time in Europe for three weeks and
then for five weeks in Kenya, where he met many of his paternal relatives
for the first time.[1][2]
1. “Barack Obama on the Issues: What Would
Be Your Top Three Overall Priorities If
Elected?". The Washington Post.
2. “The Obama promise of hope and
change". The Independent. London.
November 1, 2008.
cite type = “news”
1. "S. 3077: Strengthening Transparency and
Accountability in Federal Spending Act of
2008: 2007–2008 (110th Congress)".
Govtrack.us. June 3, 2008.
cite type = “report” cite type = “book”
1. Obama, Auma (2012). And then life
happens: a memoir. New York: St. Martin's
Press. pp. 189–208, 212–216. ISBN
978-1-250-01005-6.
Task#1: Statement Categorization
26
For a given Wikipedia statement, categorize it through a supervised model into
one of the predefined citation types (e.g. “news”, “web” etc.).
Task#1: Statement Categorization
26
For a given Wikipedia statement, categorize it through a supervised model into
one of the predefined citation types (e.g. “news”, “web” etc.).
Obama emphasized issues of rapidly ending the Iraq War, increasing energy
independence, and reforming the health care system,[c] in a campaign that projected
themes of hope and change.[c]
Task#1: Statement Categorization
26
For a given Wikipedia statement, categorize it through a supervised model into
one of the predefined citation types (e.g. “news”, “web” etc.).
Obama emphasized issues of rapidly ending the Iraq War, increasing energy
independence, and reforming the health care system,[c] in a campaign that projected
themes of hope and change.[c]
Wikipedia
language style
Wikipedia
entity structure
feature extraction
Task#1: Statement Categorization
26
For a given Wikipedia statement, categorize it through a supervised model into
one of the predefined citation types (e.g. “news”, “web” etc.).
Obama emphasized issues of rapidly ending the Iraq War, increasing energy
independence, and reforming the health care system,[c] in a campaign that projected
themes of hope and change.[c]
Wikipedia
language style
Wikipedia
entity structure
feature extraction
multi-class classification
feature representation
Task#1: Statement Categorization
26
For a given Wikipedia statement, categorize it through a supervised model into
one of the predefined citation types (e.g. “news”, “web” etc.).
Obama emphasized issues of rapidly ending the Iraq War, increasing energy
independence, and reforming the health care system,[c] in a campaign that projected
themes of hope and change.[c]
Wikipedia
language style
Wikipedia
entity structure
feature extraction
multi-class classification
feature representation
Task#2: Citation
Discovery
Task#2: Citation Discovery
28
For a Wikipedia statement that requires a citation of type news, find one or more
news article(s) as a citation from a given news corpus.
Task#2: Citation Discovery
28
For a Wikipedia statement that requires a citation of type news, find one or more
news article(s) as a citation from a given news corpus.
On February 10, 2007, Obama announced his candidacy for President of the United States in front
of the Old State Capitol building in Springfield, Illinois.
Wikipedia statement
citation discovery
Wikipedia
Statement
Task#2: Citation Discovery
28
For a Wikipedia statement that requires a citation of type news, find one or more
news article(s) as a citation from a given news corpus.
On February 10, 2007, Obama announced his candidacy for President of the United States in front
of the Old State Capitol building in Springfield, Illinois.
Wikipedia statement
citation discovery
Wikipedia
Statement
Task#2: Citation Discovery
28
For a Wikipedia statement that requires a citation of type news, find one or more
news article(s) as a citation from a given news corpus.
On February 10, 2007, Obama announced his candidacy for President of the United States in front
of the Old State Capitol building in Springfield, Illinois.
Wikipedia statement
news index
query
citation discovery
Wikipedia
Statement
Task#2: Citation Discovery
28
For a Wikipedia statement that requires a citation of type news, find one or more
news article(s) as a citation from a given news corpus.
On February 10, 2007, Obama announced his candidacy for President of the United States in front
of the Old State Capitol building in Springfield, Illinois.
Wikipedia statement
news index
query
top—k
retrieval
1 doc1
2 doc2
3 doc3
100 doc100
ranked news
citation discovery
Wikipedia
Statement
Task#2: Citation Discovery
28
For a Wikipedia statement that requires a citation of type news, find one or more
news article(s) as a citation from a given news corpus.
On February 10, 2007, Obama announced his candidacy for President of the United States in front
of the Old State Capitol building in Springfield, Illinois.
Wikipedia statement
news index
query
top—k
retrieval
1 doc1
2 doc2
3 doc3
100 doc100
ranked news
citation
discovery
citation discovery
Wikipedia
Statement
Task#2: Citation Discovery
28
For a Wikipedia statement that requires a citation of type news, find one or more
news article(s) as a citation from a given news corpus.
On February 10, 2007, Obama announced his candidacy for President of the United States in front
of the Old State Capitol building in Springfield, Illinois.
Wikipedia statement
news index
query
top—k
retrieval
1 doc1
2 doc2
3 doc3
100 doc100
ranked news
citation
discovery
citation discovery
Wikipedia
Statement
Task#2: Citation Discovery
28
For a Wikipedia statement that requires a citation of type news, find one or more
news article(s) as a citation from a given news corpus.
On February 10, 2007, Obama announced his candidacy for President of the United States in front
of the Old State Capitol building in Springfield, Illinois.
Wikipedia statement
news index
query
top—k
retrieval
1 doc1
2 doc2
3 doc3
100 doc100
ranked news
citation
discovery
Task#2: Citation Discovery Properties
29
Candidate News Article
On February 10, 2007, Obama announced his
candidacy for President of the United States in front of
the Old State Capitol building in Springfield, Illinois.
Wikipedia Statement
+
Task#2: Citation Discovery Properties
29
Candidate News Article
On February 10, 2007, Obama announced his
candidacy for President of the United States in front of
the Old State Capitol building in Springfield, Illinois.
Wikipedia Statement
+
Speaking in a single-digit, morning chill and sunshine to
thousands of supporters outside the Old State Capitol,
the first-term Democratic senator delivered an address
that built upon his biography as a community organizer
in Chicago, state legislator and U.S. senator to call for
quick action on issues ranging from bringing a close to
the Iraq war to the need for universal health care and an
end to foreign-oil dependence.
News Sentence
Task#2: Citation Discovery Properties
29
Candidate News Article
On February 10, 2007, Obama announced his
candidacy for President of the United States in front of
the Old State Capitol building in Springfield, Illinois.
Wikipedia Statement
+
Speaking in a single-digit, morning chill and sunshine to
thousands of supporters outside the Old State Capitol,
the first-term Democratic senator delivered an address
that built upon his biography as a community organizer
in Chicago, state legislator and U.S. senator to call for
quick action on issues ranging from bringing a close to
the Iraq war to the need for universal health care and an
end to foreign-oil dependence.
News Sentence
• Lexical similarity between
sentence and statement
• Language model similarity
• TreeKernel similarity
• Query and news article
Entailment
Task#2: Citation Discovery Properties
29
Candidate News Article
On February 10, 2007, Obama announced his
candidacy for President of the United States in front of
the Old State Capitol building in Springfield, Illinois.
Wikipedia Statement
+
Speaking in a single-digit, morning chill and sunshine to
thousands of supporters outside the Old State Capitol,
the first-term Democratic senator delivered an address
that built upon his biography as a community organizer
in Chicago, state legislator and U.S. senator to call for
quick action on issues ranging from bringing a close to
the Iraq war to the need for universal health care and an
end to foreign-oil dependence.
News Sentence
• Lexical similarity between
sentence and statement
• Language model similarity
• TreeKernel similarity
• Query and news article
Entailment
• TextRank for measuring
sentence centrality in a
news article
• Entailment feature scores
w.r.t most central sentence
in a news article
Centrality
Task#2: Citation Discovery Properties
29
Candidate News Article
On February 10, 2007, Obama announced his
candidacy for President of the United States in front of
the Old State Capitol building in Springfield, Illinois.
Wikipedia Statement
+
Speaking in a single-digit, morning chill and sunshine to
thousands of supporters outside the Old State Capitol,
the first-term Democratic senator delivered an address
that built upon his biography as a community organizer
in Chicago, state legislator and U.S. senator to call for
quick action on issues ranging from bringing a close to
the Iraq war to the need for universal health care and an
end to foreign-oil dependence.
News Sentence
• Lexical similarity between
sentence and statement
• Language model similarity
• TreeKernel similarity
• Query and news article
Entailment
• TextRank for measuring
sentence centrality in a
news article
• Entailment feature scores
w.r.t most central sentence
in a news article
Centrality
• Entity type specific news
citation suggestion
• Authority of news domains
on specific entity types
Authority
Task#2: Citation Discovery Properties
29
Candidate News Article
On February 10, 2007, Obama announced his
candidacy for President of the United States in front of
the Old State Capitol building in Springfield, Illinois.
Wikipedia Statement
+
Speaking in a single-digit, morning chill and sunshine to
thousands of supporters outside the Old State Capitol,
the first-term Democratic senator delivered an address
that built upon his biography as a community organizer
in Chicago, state legislator and U.S. senator to call for
quick action on issues ranging from bringing a close to
the Iraq war to the need for universal health care and an
end to foreign-oil dependence.
News Sentence
• Lexical similarity between
sentence and statement
• Language model similarity
• TreeKernel similarity
• Query and news article
Entailment
• TextRank for measuring
sentence centrality in a
news article
• Entailment feature scores
w.r.t most central sentence
in a news article
Centrality
• Entity type specific news
citation suggestion
• Authority of news domains
on specific entity types
Authority
Properties of a good citation:
1. the statement should be entailed by the news article
2. the statement is central in the news article
3. the cited news article should be from an authoritative source
Evaluation
Evaluation Datasets
31
• 6.9 million Wikipedia statements
• 8.8 million citations to external
references
• 1.6 million Wikipedia entities
• 1.88 million news articles cited from
Wikipedia statements
• 20 million news articles from a real world
news collection (GDelt), between 2013—
2015
• 27k news articles cited from Wikipedia
statements in the within the range of
GDelt
Task#1: Statement Categorization Data Task#2: Citation Discovery Data
GDelt domain stats
news domain news articles
yahoo.com 1244781
allafrica.com 1035646
reuters.com 828133
dailymail.co.uk 815372
indiatimes.com 743991
wn.com 587607
Wikipedia statement distribution by citation type
Task#1: Statement Categorization Results
32
yagoLegalActorGeo
Parent
Type
Child Type 1  ⌧  10 10 < ⌧  50 50 < ⌧  90
P R F1 P R F1 P R F1
owl:Thing Legal
Actor Geo
0.48 0.36 0.41 0.51 0.43 0.47 0.53 0.47 0.50
Legal Actor
Geo
Legal
Actor
0.51 0.34 0.41 0.54 0.41 0.47 0.56 0.45 0.50
location 0.30 0.29 0.29 0.34 0.40 0.37 0.36 0.45 0.40
location
region 0.30 0.28 0.29 0.35 0.40 0.37 0.37 0.44 0.40
point 0.30 0.10 0.14 0.38 0.22 0.28 0.39 0.26 0.32
Legal
Actor
person 0.53 0.36 0.43 0.56 0.43 0.49 0.58 0.46 0.51
person
preserver 0.63 0.31 0.42 0.67 0.46 0.54 0.67 0.49 0.57
authority 0.53 0.20 0.29 0.62 0.24 0.35 0.65 0.33 0.44
contestant 0.59 0.43 0.50 0.62 0.52 0.57 0.64 0.56 0.60
leader 0.53 0.26 0.34 0.59 0.34 0.43 0.61 0.37 0.46
wc Living
people
0.55 0.37 0.44 0.58 0.44 0.50 0.59 0.47 0.52
Separate models
per entity type
YAGO type hierarchy Statement Categorization results based on RandomForests
Task#2: Citation Discovery Results
33
Task#2: Citation Discovery Results
33
B1: top—1 retrieved article from the retrieval model
Task#2: Citation Discovery Results
33
B1: top—1 retrieved article from the retrieval model
B2: supervised model based on rank and similarity score from the search engine
Task#2: Citation Discovery Results
33
B1: top—1 retrieved article from the retrieval model
B2: supervised model based on rank and similarity score from the search engine
E1: our approach with entailment, centrality, authority features, where for a statement a correct
citation are news articles which are cited originally from the statement in the Wikipedia page
Task#2: Citation Discovery Results
33
B1: top—1 retrieved article from the retrieval model
B2: supervised model based on rank and similarity score from the search engine
E1: our approach with entailment, centrality, authority features, where for a statement a correct
citation are news articles which are cited originally from the statement in the Wikipedia page
E1+FP: similar as E1, but consider as relevant false positives which have very high similarity to ground—truth articles
Task#2: Citation Discovery Results
33
B1: top—1 retrieved article from the retrieval model
B2: supervised model based on rank and similarity score from the search engine
E1: our approach with entailment, centrality, authority features, where for a statement a correct
citation are news articles which are cited originally from the statement in the Wikipedia page
E1+FP: similar as E1, but consider as relevant false positives which have very high similarity to ground—truth articles
E2: builds on top of E1+FP with additionally assessing for the relevance of FP articles below the similarity threshold
Part (I): Conclusion
34
• Specific citation types are preferred based on the statement, its context, and the
Wikipedia page
• Statement categorization works fairly well for some entity types
• Challenging to distinguish between citation type web and news
• Citation discovery can be performed accurately across all entity types
Part (II): Fine Grained
Citation Span for
References in Wikipedia
Besnik	Fetahu,	Katja	Markert,	Avishek	Anand:	“Fine	Grained	
Citation	Span	for	References	in	Wikipedia”.	EMNLP	2017:	1980-1989
News Collection
t1 t2 tn
Textual Knowledge Base
t1 t2 tn
Citation Recommendation Citation Span
News Suggestion
Entity Placement Section Placement
e:“Barack Obama”
Obama was born on
August 4, 1961,[4] …..
The couple married in
Wailuku on Maui on …
After graduating ... a JD …
magna cum laude[49]…
Obama was elected to the
Illinois Senate in …
news? query for s1
c4
Obama was born on August 4,
1961, at Kapiʻolani Maternity &
Gynecological Hospital in
Honolulu, Hawaii.
c4s1
Obama was born on August 4, 1961, at
Kapiʻolani Maternity & Gynecological Hospital
in Honolulu, Hawaii.
citation c4 span
e:“Barack Obama” AND t2
news: nk
time: > t2
The choice of Barack Obama on Friday
as the recipient of the 2009 Nobel Peace
Prize, [...] around the globe. [...] The
Nobel committee’s embrace of Mr. Obama
was viewed [as a rejection of the
unpopular tenure, in] Europe especially, of
his predecessor, George W. Bush. [...] “To
be honest,” the president said in the Rose
Garden, [...] Last year’s laureate, former
President Martti Ahtisaari of Finland, saw
the award as an endorsement of Mr.
Obama’s goal of achieving Middle East
peace.
1.Early life and career
2.Political career
3.2008 presidential
campaign
4.Presidency
5.Political positions
6.Family and Personal
life
7.Cultural and political
image
1.Early life and career
2.Political career
3.2008 presidential
campaign
4.Presidency
5.Political positions
6.Nobel Peace Prize
7.Family and Personal life
8.Cultural and political
image
t2 t3
publish date t
headline
body
entity mentions (e.g. “Barack
Obama, Nobel Prize”…)
revision date t
entity title
sections
section text
categories
citations
Citations in Wikipedia
Citation Span Cases
38
Obama was born on August 4, 1961,[5] at Kapiʻolani Maternity & Gynecological
Hospital in Honolulu, Hawaii.[6][7][8]
On February 10, 2007, Obama announced his candidacy for President of the United
States in front of the Old State Capitol building in Springfield, Illinois.[158][159] […]
Obama emphasized issues of rapidly ending the Iraq War, increasing energy
independence, and reforming the health care system,[161] in a campaign that
projected themes of hope and change.[162]
At the Democratic National Convention in Charlotte, North Carolina, Obama and Joe
Biden were formally nominated by former President Bill Clinton as the Democratic
Party candidates for president and vice president in the general election. Their main
opponents were Republicans Mitt Romney, the former governor of Massachusetts,
and Representative Paul Ryan of Wisconsin.[183]
Citation marker placed at a sub-sentence level
Citation marker placed at the end of a sentence
Citation marker placed after multiple sentences in a paragraph
Citation Span Task
Citation Span Task
40
Citing
Paragraph
He was reelected to the Illinois Senate in 1998, defeating Republican Yesse Yehudah in the
general election, and was re-elected again in 2002.[117] In 2000, he lost a Democratic primary
race for Illinois's 1st congressional district in the United States House of Representatives to four-
term incumbent Bobby Rush by a margin of two to one.[118]
Citation Span Task
40
Citing
Paragraph
He was reelected to the Illinois Senate in 1998, defeating Republican Yesse Yehudah in the
general election, and was re-elected again in 2002.[117] In 2000, he lost a Democratic primary
race for Illinois's 1st congressional district in the United States House of Representatives to four-
term incumbent Bobby Rush by a margin of two to one.[118]
Textual
Fragments
Chunk Paragraph
(punctuation symbols)
He was reelected to the Illinois Senate in 1998, defeating Republican Yesse Yehudah in the
general election, and was re-elected again in 2002.[117] In 2000, he lost a Democratic primary
race for Illinois's 1st congressional district in the United States House of Representatives to four-
term incumbent Bobby Rush by a margin of two to one.[118]
Citation Span Task
40
Citing
Paragraph
He was reelected to the Illinois Senate in 1998, defeating Republican Yesse Yehudah in the
general election, and was re-elected again in 2002.[117] In 2000, he lost a Democratic primary
race for Illinois's 1st congressional district in the United States House of Representatives to four-
term incumbent Bobby Rush by a margin of two to one.[118]
Textual
Fragments
Chunk Paragraph
(punctuation symbols)
He was reelected to the Illinois Senate in 1998, defeating Republican Yesse Yehudah in the
general election, and was re-elected again in 2002.[117] In 2000, he lost a Democratic primary
race for Illinois's 1st congressional district in the United States House of Representatives to four-
term incumbent Bobby Rush by a margin of two to one.[118]
CitingSpan
Citation Span for
reference [117]
He was reelected to the Illinois Senate in 1998, defeating Republican Yesse Yehudah in the
general election, and was re-elected again in 2002.[117] In 2000, he lost a Democratic primary
race for Illinois's 1st congressional district in the United States House of Representatives to four-
term incumbent Bobby Rush by a margin of two to one.[118]
Citation Span Approach
41
Sequence Classification
(linear—chain CRF)
Plain Classification
• Citations other than c
• Same sentence as c
• Same sentence as previous text
fragment
• Distance in terms of text
fragments to c
Paragraph Structure
He was reelected to the Illinois Senate in 1998, defeating Republican Yesse Yehudah in the
general election, and was re-elected again in 2002.[117] In 2000, he lost a Democratic primary
race for Illinois's 1st congressional district in the United States House of Representatives to four-
term incumbent Bobby Rush by a margin of two to one.[118]
• Language models per
paragraphs in cited document
• Language model similarity
between fragment and
paragraph’s LM
Citation Features
• Explicit discourse sense annotation
of sentences
• Fragments in a sentence with explicit
discourse (e.g. comparison) are
likely to have same label
• Fragments with different time points
are unlikely to have same label
Discourse/Temporal Features
Determine span for citation c= [117]
Extract features for each text fragment
Evaluation
Citation Span Dataset
43
• 500 citing paragraphs, pointing to either web or news citations
• Manual annotation of each textual fragment whether it is explicitly
supported or implied by the corresponding citation
• High inter-rater agreement on a 10% sample with 𝜅=0.84
span
news web
dist. skip frag. skip sent. dist. skip frag. skip sent.
<= 0.5 11% 6% - 6.8% - -
(.5,1] 63% - - 63% - 1%
(1,2] 17% - 8% 14% - 19%
(2,5] 7% 5% 18% 13.1% - 21%
> 5 1.8% - 20% 3.1% - 67%
Evaluation Metrics
44
MAP =
1
|N|
X
p2N
|S0
 St
|
|S0|
R =
1
|N|
X
p2N
|S0
 St
|
|St|
w =
1
|N|
X
p2N
P
2S0St words( )
P
2St words( )
Mean Average Precision — MAP:
Recall — R:
Erroneous Span (word and text-fragment level) - ∆”
• S’ — text fragment marked as
covered by one of the
approaches.
• St — text fragment marked as
covered by the citation in our
ground-truth.
• words(∂) — number of words in a
text fragment.
Citation Span Analysis — Accuracy
45
Citation Span results decomposed across the different span cases.
Citation Span Analysis — Accuracy
45
Citation Span results decomposed across the different span cases.
MRF: compute two functions: (i) potential and (ii) compatibility. In (i) measure the similarity between sentences in a
citing paragraph, and in (ii) measure the similarity between a sentences and sentences in the citing document.
Citation Span Analysis — Accuracy
45
Citation Span results decomposed across the different span cases.
MRF: compute two functions: (i) potential and (ii) compatibility. In (i) measure the similarity between sentences in a
citing paragraph, and in (ii) measure the similarity between a sentences and sentences in the citing document.
IC: the span of a citation is the text between two citations, or it starts in the beginning of a paragraph, and it can end at the
end of a paragraph.
Citation Span Analysis — Accuracy
45
Citation Span results decomposed across the different span cases.
MRF: compute two functions: (i) potential and (ii) compatibility. In (i) measure the similarity between sentences in a
citing paragraph, and in (ii) measure the similarity between a sentences and sentences in the citing document.
IC: the span of a citation is the text between two citations, or it starts in the beginning of a paragraph, and it can end at the
end of a paragraph.
CS/CSW: the span is the citing sentence, and for CSW the span is the citing sentence +/- 2 sentences depending if they contain
specific cue words in specific locations in the sentence.
Citation Span Analysis — Accuracy
45
Citation Span results decomposed across the different span cases.
MRF: compute two functions: (i) potential and (ii) compatibility. In (i) measure the similarity between sentences in a
citing paragraph, and in (ii) measure the similarity between a sentences and sentences in the citing document.
IC: the span of a citation is the text between two citations, or it starts in the beginning of a paragraph, and it can end at the
end of a paragraph.
CS/CSW: the span is the citing sentence, and for CSW the span is the citing sentence +/- 2 sentences depending if they contain
specific cue words in specific locations in the sentence.
CSPC/CSPS: our plain classifier with the proposed features, and for CSPS we train a structured prediction model based on the same feature set.
Citation Span Analysis — Erroneous Span
46
≤ 0.5
9 % 11 %
872 %
274 %258 %
480 %
0
250
500
750
1000
CSPS CSPC CS CSW IC MRF
(0.5,1]
6 % 5 %
313 %
14 %12 %
80 %
0
100
200
300
CSPS CSPC CS CSW IC MRF
(1,2]
11 % 7 %
114 %
11 %10 %
65 %
0
50
100
150
CSPS CSPC CS CSW IC MRF
> 2
45 %
26 %
96 %
17 %16 %
68 %
0
30
60
90
CSPS CSPC CS CSW IC MRF
Citation Span Buckets
ErroneousSpanΔw%
Part (II): Conclusion
47
• Citation span can be accurately determined for web and news
citations
• Sequence classification achieves slightly better performance than
plain classification and outperforms baseline approaches
• Baseline approaches from the scientific domain do not generalize in
Wikipedia’s language style
Part (III): Automated News
Suggestion for Populating
Wikipedia Pages
Besnik	Fetahu,	Katja	Markert,	Avishek	Anand:	“Automated	News	
Suggestion	for	Populating	Wikipedia	Pages”.	CIKM	2015:	323-332
News Collection
t1 t2 tn
Textual Knowledge Base
t1 t2 tn
Citation Recommendation Citation Span
News Suggestion
Entity Placement Section Placement
e:“Barack Obama”
Obama was born on
August 4, 1961,[4] …..
The couple married in
Wailuku on Maui on …
After graduating ... a JD …
magna cum laude[49]…
Obama was elected to the
Illinois Senate in …
news? query for s1
c4
Obama was born on August 4,
1961, at Kapiʻolani Maternity &
Gynecological Hospital in
Honolulu, Hawaii.
c4s1
Obama was born on August 4, 1961, at
Kapiʻolani Maternity & Gynecological Hospital
in Honolulu, Hawaii.
citation c4 span
e:“Barack Obama” AND t2
news: nk
time: > t2
The choice of Barack Obama on Friday
as the recipient of the 2009 Nobel Peace
Prize, [...] around the globe. [...] The
Nobel committee’s embrace of Mr. Obama
was viewed [as a rejection of the
unpopular tenure, in] Europe especially, of
his predecessor, George W. Bush. [...] “To
be honest,” the president said in the Rose
Garden, [...] Last year’s laureate, former
President Martti Ahtisaari of Finland, saw
the award as an endorsement of Mr.
Obama’s goal of achieving Middle East
peace.
1.Early life and career
2.Political career
3.2008 presidential
campaign
4.Presidency
5.Political positions
6.Family and Personal
life
7.Cultural and political
image
1.Early life and career
2.Political career
3.2008 presidential
campaign
4.Presidency
5.Political positions
6.Nobel Peace Prize
7.Family and Personal life
8.Cultural and political
image
t2 t3
publish date t
headline
body
entity mentions (e.g. “Barack
Obama, Nobel Prize”…)
revision date t
entity title
sections
section text
categories
citations
Automated News Suggestion to Entity Pages
50
Daily News Articles
Automated News Suggestion to Entity Pages
50
Daily News Articles
Some half a million people were evacuated
from the southeastern Indian coast as Cyclone
Phailin, a tropical storm from the Bay of
Bengal, bore down on India. The states of
Orissa and Andhra Pradesh, both of which
have large coastal populations, were on high
alert ahead of the storm’s expected arrival. […]
WSJ: As it Happened: Cyclone Reaches Orissa
Automated News Suggestion to Entity Pages
50
Daily News Articles
Some half a million people were evacuated
from the southeastern Indian coast as Cyclone
Phailin, a tropical storm from the Bay of
Bengal, bore down on India. The states of
Orissa and Andhra Pradesh, both of which
have large coastal populations, were on high
alert ahead of the storm’s expected arrival. […]
WSJ: As it Happened: Cyclone Reaches Orissa
Automated News Suggestion to Entity Pages
50
Daily News Articles
Some half a million people were evacuated
from the southeastern Indian coast as Cyclone
Phailin, a tropical storm from the Bay of
Bengal, bore down on India. The states of
Orissa and Andhra Pradesh, both of which
have large coastal populations, were on high
alert ahead of the storm’s expected arrival. […]
WSJ: As it Happened: Cyclone Reaches Orissa
Some half a million people were evacuated
from the southeastern Indian coast as
Cyclone Phailin, a tropical storm from the
Bay of Bengal, bore down on India. The states
of Orissa and Andhra Pradesh, both of which
have large coastal populations, were on high
alert ahead of the storm’s expected arrival. […]
Automated News Suggestion to Entity Pages
50
Daily News Articles
Some half a million people were evacuated
from the southeastern Indian coast as Cyclone
Phailin, a tropical storm from the Bay of
Bengal, bore down on India. The states of
Orissa and Andhra Pradesh, both of which
have large coastal populations, were on high
alert ahead of the storm’s expected arrival. […]
WSJ: As it Happened: Cyclone Reaches Orissa
Some half a million people were evacuated
from the southeastern Indian coast as
Cyclone Phailin, a tropical storm from the
Bay of Bengal, bore down on India. The states
of Orissa and Andhra Pradesh, both of which
have large coastal populations, were on high
alert ahead of the storm’s expected arrival. […]
Some half a million people were evacuated
from the southeastern Indian coast as
Cyclone Phailin, a tropical storm from the
Bay of Bengal, bore down on India. The
states of Orissa and Andhra Pradesh, both of
which have large coastal populations, were on
high alert ahead of the storm’s expected
arrival. […]
Automated News Suggestion to Entity Pages
50
Daily News Articles
Some half a million people were evacuated
from the southeastern Indian coast as Cyclone
Phailin, a tropical storm from the Bay of
Bengal, bore down on India. The states of
Orissa and Andhra Pradesh, both of which
have large coastal populations, were on high
alert ahead of the storm’s expected arrival. […]
WSJ: As it Happened: Cyclone Reaches Orissa
Some half a million people were evacuated
from the southeastern Indian coast as
Cyclone Phailin, a tropical storm from the
Bay of Bengal, bore down on India. The states
of Orissa and Andhra Pradesh, both of which
have large coastal populations, were on high
alert ahead of the storm’s expected arrival. […]
Some half a million people were evacuated
from the southeastern Indian coast as
Cyclone Phailin, a tropical storm from the
Bay of Bengal, bore down on India. The
states of Orissa and Andhra Pradesh, both of
which have large coastal populations, were on
high alert ahead of the storm’s expected
arrival. […]
Some half a million people were evacuated
from the southeastern Indian coast as
Cyclone Phailin, a tropical storm from the
Bay of Bengal, bore down on India. The
states of Orissa and Andhra Pradesh, both of
which have large coastal populations, were on
high alert ahead of the storm’s expected
arrival. […]
Automated News Suggestion to Entity Pages
50
Daily News Articles
Some half a million people were evacuated
from the southeastern Indian coast as Cyclone
Phailin, a tropical storm from the Bay of
Bengal, bore down on India. The states of
Orissa and Andhra Pradesh, both of which
have large coastal populations, were on high
alert ahead of the storm’s expected arrival. […]
WSJ: As it Happened: Cyclone Reaches Orissa
Some half a million people were evacuated
from the southeastern Indian coast as
Cyclone Phailin, a tropical storm from the
Bay of Bengal, bore down on India. The states
of Orissa and Andhra Pradesh, both of which
have large coastal populations, were on high
alert ahead of the storm’s expected arrival. […]
Some half a million people were evacuated
from the southeastern Indian coast as
Cyclone Phailin, a tropical storm from the
Bay of Bengal, bore down on India. The
states of Orissa and Andhra Pradesh, both of
which have large coastal populations, were on
high alert ahead of the storm’s expected
arrival. […]
Some half a million people were evacuated
from the southeastern Indian coast as
Cyclone Phailin, a tropical storm from the
Bay of Bengal, bore down on India. The
states of Orissa and Andhra Pradesh, both of
which have large coastal populations, were on
high alert ahead of the storm’s expected
arrival. […]
Some half a million people were evacuated
from the southeastern Indian coast as
Cyclone Phailin, a tropical storm from the
Bay of Bengal, bore down on India. The
states of Orissa and Andhra Pradesh,
both of which have large coastal populations,
were on high alert ahead of the storm’s
expected arrival. […]
Automated News Suggestion to Entity Pages
50
Daily News Articles
Some half a million people were evacuated
from the southeastern Indian coast as Cyclone
Phailin, a tropical storm from the Bay of
Bengal, bore down on India. The states of
Orissa and Andhra Pradesh, both of which
have large coastal populations, were on high
alert ahead of the storm’s expected arrival. […]
WSJ: As it Happened: Cyclone Reaches Orissa
Some half a million people were evacuated
from the southeastern Indian coast as
Cyclone Phailin, a tropical storm from the
Bay of Bengal, bore down on India. The states
of Orissa and Andhra Pradesh, both of which
have large coastal populations, were on high
alert ahead of the storm’s expected arrival. […]
Some half a million people were evacuated
from the southeastern Indian coast as
Cyclone Phailin, a tropical storm from the
Bay of Bengal, bore down on India. The
states of Orissa and Andhra Pradesh, both of
which have large coastal populations, were on
high alert ahead of the storm’s expected
arrival. […]
Some half a million people were evacuated
from the southeastern Indian coast as
Cyclone Phailin, a tropical storm from the
Bay of Bengal, bore down on India. The
states of Orissa and Andhra Pradesh, both of
which have large coastal populations, were on
high alert ahead of the storm’s expected
arrival. […]
Some half a million people were evacuated
from the southeastern Indian coast as
Cyclone Phailin, a tropical storm from the
Bay of Bengal, bore down on India. The
states of Orissa and Andhra Pradesh,
both of which have large coastal populations,
were on high alert ahead of the storm’s
expected arrival. […]
Task#1: Article Placement
Automated News Suggestion to Entity Pages
50
Daily News Articles
Some half a million people were evacuated
from the southeastern Indian coast as Cyclone
Phailin, a tropical storm from the Bay of
Bengal, bore down on India. The states of
Orissa and Andhra Pradesh, both of which
have large coastal populations, were on high
alert ahead of the storm’s expected arrival. […]
WSJ: As it Happened: Cyclone Reaches Orissa
Some half a million people were evacuated
from the southeastern Indian coast as
Cyclone Phailin, a tropical storm from the
Bay of Bengal, bore down on India. The states
of Orissa and Andhra Pradesh, both of which
have large coastal populations, were on high
alert ahead of the storm’s expected arrival. […]
Some half a million people were evacuated
from the southeastern Indian coast as
Cyclone Phailin, a tropical storm from the
Bay of Bengal, bore down on India. The
states of Orissa and Andhra Pradesh, both of
which have large coastal populations, were on
high alert ahead of the storm’s expected
arrival. […]
Some half a million people were evacuated
from the southeastern Indian coast as
Cyclone Phailin, a tropical storm from the
Bay of Bengal, bore down on India. The
states of Orissa and Andhra Pradesh, both of
which have large coastal populations, were on
high alert ahead of the storm’s expected
arrival. […]
Some half a million people were evacuated
from the southeastern Indian coast as
Cyclone Phailin, a tropical storm from the
Bay of Bengal, bore down on India. The
states of Orissa and Andhra Pradesh,
both of which have large coastal populations,
were on high alert ahead of the storm’s
expected arrival. […]
Task#1: Article Placement
Bay of Bengal
Odisha
India
Cyclone Phailin
Entity Mentions
Automated News Suggestion to Entity Pages
50
Daily News Articles
Some half a million people were evacuated
from the southeastern Indian coast as Cyclone
Phailin, a tropical storm from the Bay of
Bengal, bore down on India. The states of
Orissa and Andhra Pradesh, both of which
have large coastal populations, were on high
alert ahead of the storm’s expected arrival. […]
WSJ: As it Happened: Cyclone Reaches Orissa
Some half a million people were evacuated
from the southeastern Indian coast as
Cyclone Phailin, a tropical storm from the
Bay of Bengal, bore down on India. The states
of Orissa and Andhra Pradesh, both of which
have large coastal populations, were on high
alert ahead of the storm’s expected arrival. […]
Some half a million people were evacuated
from the southeastern Indian coast as
Cyclone Phailin, a tropical storm from the
Bay of Bengal, bore down on India. The
states of Orissa and Andhra Pradesh, both of
which have large coastal populations, were on
high alert ahead of the storm’s expected
arrival. […]
Some half a million people were evacuated
from the southeastern Indian coast as
Cyclone Phailin, a tropical storm from the
Bay of Bengal, bore down on India. The
states of Orissa and Andhra Pradesh, both of
which have large coastal populations, were on
high alert ahead of the storm’s expected
arrival. […]
Some half a million people were evacuated
from the southeastern Indian coast as
Cyclone Phailin, a tropical storm from the
Bay of Bengal, bore down on India. The
states of Orissa and Andhra Pradesh,
both of which have large coastal populations,
were on high alert ahead of the storm’s
expected arrival. […]
Task#1: Article Placement
Bay of Bengal
Odisha
India
Cyclone Phailin
Entity Mentions
Bay of Bengal
Odisha
India
Cyclone Phailin
Automated News Suggestion to Entity Pages
50
Daily News Articles
Some half a million people were evacuated
from the southeastern Indian coast as Cyclone
Phailin, a tropical storm from the Bay of
Bengal, bore down on India. The states of
Orissa and Andhra Pradesh, both of which
have large coastal populations, were on high
alert ahead of the storm’s expected arrival. […]
WSJ: As it Happened: Cyclone Reaches Orissa
Some half a million people were evacuated
from the southeastern Indian coast as
Cyclone Phailin, a tropical storm from the
Bay of Bengal, bore down on India. The states
of Orissa and Andhra Pradesh, both of which
have large coastal populations, were on high
alert ahead of the storm’s expected arrival. […]
Some half a million people were evacuated
from the southeastern Indian coast as
Cyclone Phailin, a tropical storm from the
Bay of Bengal, bore down on India. The
states of Orissa and Andhra Pradesh, both of
which have large coastal populations, were on
high alert ahead of the storm’s expected
arrival. […]
Some half a million people were evacuated
from the southeastern Indian coast as
Cyclone Phailin, a tropical storm from the
Bay of Bengal, bore down on India. The
states of Orissa and Andhra Pradesh, both of
which have large coastal populations, were on
high alert ahead of the storm’s expected
arrival. […]
Some half a million people were evacuated
from the southeastern Indian coast as
Cyclone Phailin, a tropical storm from the
Bay of Bengal, bore down on India. The
states of Orissa and Andhra Pradesh,
both of which have large coastal populations,
were on high alert ahead of the storm’s
expected arrival. […]
Task#1: Article Placement
Bay of Bengal
Odisha
India
Cyclone Phailin
Entity Mentions
Bay of Bengal
Odisha
India
Cyclone Phailin
Task#2: Section Placement
Odisha
Automated News Suggestion to Entity Pages
50
Daily News Articles
Some half a million people were evacuated
from the southeastern Indian coast as Cyclone
Phailin, a tropical storm from the Bay of
Bengal, bore down on India. The states of
Orissa and Andhra Pradesh, both of which
have large coastal populations, were on high
alert ahead of the storm’s expected arrival. […]
WSJ: As it Happened: Cyclone Reaches Orissa
Some half a million people were evacuated
from the southeastern Indian coast as
Cyclone Phailin, a tropical storm from the
Bay of Bengal, bore down on India. The states
of Orissa and Andhra Pradesh, both of which
have large coastal populations, were on high
alert ahead of the storm’s expected arrival. […]
Some half a million people were evacuated
from the southeastern Indian coast as
Cyclone Phailin, a tropical storm from the
Bay of Bengal, bore down on India. The
states of Orissa and Andhra Pradesh, both of
which have large coastal populations, were on
high alert ahead of the storm’s expected
arrival. […]
Some half a million people were evacuated
from the southeastern Indian coast as
Cyclone Phailin, a tropical storm from the
Bay of Bengal, bore down on India. The
states of Orissa and Andhra Pradesh, both of
which have large coastal populations, were on
high alert ahead of the storm’s expected
arrival. […]
Some half a million people were evacuated
from the southeastern Indian coast as
Cyclone Phailin, a tropical storm from the
Bay of Bengal, bore down on India. The
states of Orissa and Andhra Pradesh,
both of which have large coastal populations,
were on high alert ahead of the storm’s
expected arrival. […]
Task#1: Article Placement
Bay of Bengal
Odisha
India
Cyclone Phailin
Entity Mentions
Bay of Bengal
Odisha
India
Cyclone Phailin
Task#2: Section Placement
Odisha
1. Etymology
2. History
3. Geography
4. Government and politics
5. Economy
6. Transportation
7. Demographics
8. Education
Sections
Automated News Suggestion to Entity Pages
50
Daily News Articles
Some half a million people were evacuated
from the southeastern Indian coast as Cyclone
Phailin, a tropical storm from the Bay of
Bengal, bore down on India. The states of
Orissa and Andhra Pradesh, both of which
have large coastal populations, were on high
alert ahead of the storm’s expected arrival. […]
WSJ: As it Happened: Cyclone Reaches Orissa
Some half a million people were evacuated
from the southeastern Indian coast as
Cyclone Phailin, a tropical storm from the
Bay of Bengal, bore down on India. The states
of Orissa and Andhra Pradesh, both of which
have large coastal populations, were on high
alert ahead of the storm’s expected arrival. […]
Some half a million people were evacuated
from the southeastern Indian coast as
Cyclone Phailin, a tropical storm from the
Bay of Bengal, bore down on India. The
states of Orissa and Andhra Pradesh, both of
which have large coastal populations, were on
high alert ahead of the storm’s expected
arrival. […]
Some half a million people were evacuated
from the southeastern Indian coast as
Cyclone Phailin, a tropical storm from the
Bay of Bengal, bore down on India. The
states of Orissa and Andhra Pradesh, both of
which have large coastal populations, were on
high alert ahead of the storm’s expected
arrival. […]
Some half a million people were evacuated
from the southeastern Indian coast as
Cyclone Phailin, a tropical storm from the
Bay of Bengal, bore down on India. The
states of Orissa and Andhra Pradesh,
both of which have large coastal populations,
were on high alert ahead of the storm’s
expected arrival. […]
Task#1: Article Placement
Bay of Bengal
Odisha
India
Cyclone Phailin
Entity Mentions
Bay of Bengal
Odisha
India
Cyclone Phailin
Task#2: Section Placement
Odisha
1. Etymology
2. History
3. Geography
4. Government and politics
5. Economy
6. Transportation
7. Demographics
8. Education
Sections
Add section in case
it is missing
Automated News Suggestion to Entity Pages
50
Daily News Articles
Some half a million people were evacuated
from the southeastern Indian coast as Cyclone
Phailin, a tropical storm from the Bay of
Bengal, bore down on India. The states of
Orissa and Andhra Pradesh, both of which
have large coastal populations, were on high
alert ahead of the storm’s expected arrival. […]
WSJ: As it Happened: Cyclone Reaches Orissa
Some half a million people were evacuated
from the southeastern Indian coast as
Cyclone Phailin, a tropical storm from the
Bay of Bengal, bore down on India. The states
of Orissa and Andhra Pradesh, both of which
have large coastal populations, were on high
alert ahead of the storm’s expected arrival. […]
Some half a million people were evacuated
from the southeastern Indian coast as
Cyclone Phailin, a tropical storm from the
Bay of Bengal, bore down on India. The
states of Orissa and Andhra Pradesh, both of
which have large coastal populations, were on
high alert ahead of the storm’s expected
arrival. […]
Some half a million people were evacuated
from the southeastern Indian coast as
Cyclone Phailin, a tropical storm from the
Bay of Bengal, bore down on India. The
states of Orissa and Andhra Pradesh, both of
which have large coastal populations, were on
high alert ahead of the storm’s expected
arrival. […]
Some half a million people were evacuated
from the southeastern Indian coast as
Cyclone Phailin, a tropical storm from the
Bay of Bengal, bore down on India. The
states of Orissa and Andhra Pradesh,
both of which have large coastal populations,
were on high alert ahead of the storm’s
expected arrival. […]
Task#1: Article Placement
Bay of Bengal
Odisha
India
Cyclone Phailin
Entity Mentions
Bay of Bengal
Odisha
India
Cyclone Phailin
Task#2: Section Placement
Odisha
1. Etymology
2. History
3. Geography
4. Government and politics
5. Economy
6. Transportation
7. Demographics
8. Education
Sections
Add section in case
it is missing
Catastrophes
News Suggestion Attributes
51
• The entity should be a central concept in the news article
• The information in the news article should be important for the
Wikipedia entity
• Information in news article should contain novel or missing
information for a Wikipedia entity
• A news article should be suggested for to the exact section, if such
section does not exist, it needs to be added
Task#1: Article —
Placement
Article—Placement: News Suggestion Attributes
53
Entity Salience Relative Authority Novelty
• Reward entities appearing
throughout the news article
• Reward entities appearing in
top-paragraphs
• Weigh the entities w.r.t the score
of the co-occurring entities
• Entry barrier is lower for
information from news articles
for entities with low-authority
• Important information for an
entity can be unveiled by
measuring the relative
importance of its co-occurring
entities
• Information from a news article
should be novel w.r.t to the
entity under consideration
• Measure information novelty
against already cited news
sources
• Measure information novelty
against the already existing
content in a Wikipedia entity
Task#2: Section
Placement
Section—Placement: Template Generation and Section Fit
55
News Articles [Germanwings	incident]
Article—Entity
Placement
Wikipedia Entity
Section Template
typeOf Airline
1. History
2. Corporate affairs
3. Destinations
4. Fleet
5. Services
6. References
Germanwings
1. History
2. Corporate affairs and
Identity
3. Destinations
4. Codeshare agreements
5. Fleet
6. Services
7. Incidents and
accidents
8. References
Adria
1. History
2. Corporate affairs and
identity
3. Miles & More
4. Lounges
5. Accidents and incidents
6. Criticism
7. See also
8. Citations
9. External Links
Lufthansa
Section—Placement: Template Generation and Section Fit
55
News Articles [Germanwings	incident]
Article—Entity
Placement
Wikipedia Entity
Section Template
typeOf Airline
1. History
2. Corporate affairs
3. Destinations
4. Fleet
5. Services
6. References
Germanwings
1. History
2. Corporate affairs and
Identity
3. Destinations
4. Codeshare agreements
5. Fleet
6. Services
7. Incidents and
accidents
8. References
Adria
1. History
2. Corporate affairs and
identity
3. Miles & More
4. Lounges
5. Accidents and incidents
6. Criticism
7. See also
8. Citations
9. External Links
Lufthansa
Section—Placement: Template Generation and Section Fit
55
1. History
2. Corporate affairs and Identity
3. Destinations
4. Codeshare agreements
5. Fleet
6. Services / Lounges
7. Criticism
8. Incidents and accidents
9. References
Section Template [Airline]
News Articles [Germanwings	incident]
Article—Entity
Placement
Wikipedia Entity
Section Template
typeOf Airline
1. History
2. Corporate affairs
3. Destinations
4. Fleet
5. Services
6. References
Germanwings
1. History
2. Corporate affairs and
Identity
3. Destinations
4. Codeshare agreements
5. Fleet
6. Services
7. Incidents and
accidents
8. References
Adria
1. History
2. Corporate affairs and
identity
3. Miles & More
4. Lounges
5. Accidents and incidents
6. Criticism
7. See also
8. Citations
9. External Links
Lufthansa
Section—Placement: Template Generation and Section Fit
55
1. History
2. Corporate affairs and Identity
3. Destinations
4. Codeshare agreements
5. Fleet
6. Services / Lounges
7. Criticism
8. Incidents and accidents
9. References
Section Template [Airline]
News Articles [Germanwings	incident]
Article—Entity
Placement
Wikipedia Entity
Section Fit
• Content similarity of the news article w.r.t
sections in the template
• Topic similarity of the news article w.r.t
sections in the template
Incidents and Accidents
Evaluation
Evaluation Dataset
57
year #news #entities #sections
2009 42707 13550 3510
2010 78328 24953 8416
2011 73491 23144 6581
2012 81473 25980 8455
2013 69079 22121 8183
2014 29961 11088 4694
Evaluation Datasets
Evaluation Plan
• B1 — baseline for AEP (Dunietz and Gillick) • S2 — baseline for AES (most frequent
section)
Task#1 — AEP: Baselines Task#2 — AES: Baselines
Dunietz,	Jesse,	and	Daniel	Gillick.	"A	New	Entity	Salience	Task	
with	Millions	of	Training	Examples."	In	EACL,	p.	205.	2014.
Article—Entity and Article—Section Placement Results
58
0.25
0.50
0.75
1.00
0.00 0.25 0.50 0.75 1.00
Recall
Precision
group
B1
B1+F_e
2009
B1 — baseline approach for AEP
AEP: B1 + entity salience, relative
authority, novelty
0
0.2
0.4
0.6
0.8
1
2009 2010 2011 2012 2013 2014
Avg.Precision
Fs S2
S2 — most frequent section
AES: topic, content, lexical features
Part (III): Conclusions
59
• Three main properties of a good news suggestion
• Through AEP and AES tasks, we can suggest important and novel
information for Wikipedia entities
• Entity profile expansion through section templates generated at entity
type level
Conclusions and
Future Work
Conclusions
61
• We account for the evolving nature of Wikipedia entities as new and
novel information becomes available on the Web
• We present a holistic approach for enriching and improving
Wikipedia entities
• Through our approach we enforce the core principles of Wikipedia
such as the “verifiability” principle
• Our automated approach provides accurate enrichments and
improvements, and furthermore accounts for long-tail entities, where
editor interests are low.
Future Work
62
• Wikipedia is a collaboratively edited and created data source, as such it can have
pitfalls like “echo chambers”. We want to investigate how are such “echo
chambers” established, and what are the factors (e.g. editors, sources, topic
interests) that cause it?
• Quality issues such as NPOV violations in Wikipedia are coarse-grained and
such quality indicators are inexistent in long-tail entities, thus, investigating editor,
language, and source biases that cause such a NPOV violations is an important
quality assurance step.
• Editors dynamics reflect the quality of Wikipedia pages. How can we provide a
mechanism for distributing “uniformly” Wikipedia pages across editors, such that
we satisfy their interests and at the same time increase the overall quality of
Wikipedia pages overall.
Thank you for your attention.



Questions?

Weitere ähnliche Inhalte

Ähnlich wie Approaches for Improving and Enriching Textual Knowledge Bases

The Future of Knowledge in the Age of Wikipedia - REMIXNYC 2014
The Future of Knowledge in the Age of Wikipedia - REMIXNYC 2014The Future of Knowledge in the Age of Wikipedia - REMIXNYC 2014
The Future of Knowledge in the Age of Wikipedia - REMIXNYC 2014Andrew Lih
 
Free for All: Wikipedia, Wikimedia, and the Future of History
Free for All: Wikipedia, Wikimedia, and the Future of HistoryFree for All: Wikipedia, Wikimedia, and the Future of History
Free for All: Wikipedia, Wikimedia, and the Future of HistoryAndrew Lih
 
Wikipedia for GLAMS_by_jentzsch_&_ockerbloom
Wikipedia for GLAMS_by_jentzsch_&_ockerbloomWikipedia for GLAMS_by_jentzsch_&_ockerbloom
Wikipedia for GLAMS_by_jentzsch_&_ockerbloomTracy Jentzsch
 
Beyond Academia: Teaching Life Skills with WikipediaEld presentation jun2016
Beyond Academia: Teaching Life Skills with WikipediaEld presentation jun2016Beyond Academia: Teaching Life Skills with WikipediaEld presentation jun2016
Beyond Academia: Teaching Life Skills with WikipediaEld presentation jun2016IrisLynne
 
Encyclopedias--Powerpoint 2003
Encyclopedias--Powerpoint 2003Encyclopedias--Powerpoint 2003
Encyclopedias--Powerpoint 2003Johan Koren
 
Wikipedia and the Making of a (Wo)Man: biographical construction in the digi...
Wikipedia and the Making of a (Wo)Man:  biographical construction in the digi...Wikipedia and the Making of a (Wo)Man:  biographical construction in the digi...
Wikipedia and the Making of a (Wo)Man: biographical construction in the digi...Chris Sweet
 
Intelligent design
Intelligent designIntelligent design
Intelligent designSabiq Hafidz
 
Diversity and Inclusion in Wikipedia
Diversity and Inclusion in WikipediaDiversity and Inclusion in Wikipedia
Diversity and Inclusion in WikipediaNoreen Whysel
 
Making Wikipedia: Students as Information Creators
Making Wikipedia: Students as Information CreatorsMaking Wikipedia: Students as Information Creators
Making Wikipedia: Students as Information CreatorsIrisLynne
 
Beyond Words: Simple Techniques for Fostering Critical Reading Skills MEXTESO...
Beyond Words: Simple Techniques for Fostering Critical Reading Skills MEXTESO...Beyond Words: Simple Techniques for Fostering Critical Reading Skills MEXTESO...
Beyond Words: Simple Techniques for Fostering Critical Reading Skills MEXTESO...KLSagert
 
Wikipedia: the educator's friend (!)
Wikipedia: the educator's friend (!)Wikipedia: the educator's friend (!)
Wikipedia: the educator's friend (!)Nathan Rinne
 
Reasearch Essay. Sample MLA Research Paper Templates at allbusinesstemplates...
Reasearch Essay. Sample MLA Research Paper  Templates at allbusinesstemplates...Reasearch Essay. Sample MLA Research Paper  Templates at allbusinesstemplates...
Reasearch Essay. Sample MLA Research Paper Templates at allbusinesstemplates...Bobbi Antonacci
 
Encyclopedias 2003 version
Encyclopedias  2003 versionEncyclopedias  2003 version
Encyclopedias 2003 versionJohan Koren
 
On the Value of Temporal Anchor Texts in Wikipedia
On the Value of Temporal Anchor Texts in WikipediaOn the Value of Temporal Anchor Texts in Wikipedia
On the Value of Temporal Anchor Texts in WikipediaNattiya Kanhabua
 
Open access for researchers and research managers
Open access  for researchers and research managersOpen access  for researchers and research managers
Open access for researchers and research managersIryna Kuchma
 
Using Wikipedia for Research
Using Wikipedia for ResearchUsing Wikipedia for Research
Using Wikipedia for ResearchMandi Goodsett
 

Ähnlich wie Approaches for Improving and Enriching Textual Knowledge Bases (20)

The Future of Knowledge in the Age of Wikipedia - REMIXNYC 2014
The Future of Knowledge in the Age of Wikipedia - REMIXNYC 2014The Future of Knowledge in the Age of Wikipedia - REMIXNYC 2014
The Future of Knowledge in the Age of Wikipedia - REMIXNYC 2014
 
Free for All: Wikipedia, Wikimedia, and the Future of History
Free for All: Wikipedia, Wikimedia, and the Future of HistoryFree for All: Wikipedia, Wikimedia, and the Future of History
Free for All: Wikipedia, Wikimedia, and the Future of History
 
Wikipedia for GLAMS_by_jentzsch_&_ockerbloom
Wikipedia for GLAMS_by_jentzsch_&_ockerbloomWikipedia for GLAMS_by_jentzsch_&_ockerbloom
Wikipedia for GLAMS_by_jentzsch_&_ockerbloom
 
Beyond Academia: Teaching Life Skills with WikipediaEld presentation jun2016
Beyond Academia: Teaching Life Skills with WikipediaEld presentation jun2016Beyond Academia: Teaching Life Skills with WikipediaEld presentation jun2016
Beyond Academia: Teaching Life Skills with WikipediaEld presentation jun2016
 
Wikipedia. lih.
Wikipedia. lih.Wikipedia. lih.
Wikipedia. lih.
 
Encyclopedias--Powerpoint 2003
Encyclopedias--Powerpoint 2003Encyclopedias--Powerpoint 2003
Encyclopedias--Powerpoint 2003
 
Wikipedia and the Making of a (Wo)Man: biographical construction in the digi...
Wikipedia and the Making of a (Wo)Man:  biographical construction in the digi...Wikipedia and the Making of a (Wo)Man:  biographical construction in the digi...
Wikipedia and the Making of a (Wo)Man: biographical construction in the digi...
 
Intelligent design
Intelligent designIntelligent design
Intelligent design
 
Diversity and Inclusion in Wikipedia
Diversity and Inclusion in WikipediaDiversity and Inclusion in Wikipedia
Diversity and Inclusion in Wikipedia
 
Making Wikipedia: Students as Information Creators
Making Wikipedia: Students as Information CreatorsMaking Wikipedia: Students as Information Creators
Making Wikipedia: Students as Information Creators
 
Beyond Words: Simple Techniques for Fostering Critical Reading Skills MEXTESO...
Beyond Words: Simple Techniques for Fostering Critical Reading Skills MEXTESO...Beyond Words: Simple Techniques for Fostering Critical Reading Skills MEXTESO...
Beyond Words: Simple Techniques for Fostering Critical Reading Skills MEXTESO...
 
Information Literacy Lessons in Wikipedia
Information Literacy Lessons in WikipediaInformation Literacy Lessons in Wikipedia
Information Literacy Lessons in Wikipedia
 
Wikipedia: the educator's friend (!)
Wikipedia: the educator's friend (!)Wikipedia: the educator's friend (!)
Wikipedia: the educator's friend (!)
 
Reasearch Essay. Sample MLA Research Paper Templates at allbusinesstemplates...
Reasearch Essay. Sample MLA Research Paper  Templates at allbusinesstemplates...Reasearch Essay. Sample MLA Research Paper  Templates at allbusinesstemplates...
Reasearch Essay. Sample MLA Research Paper Templates at allbusinesstemplates...
 
Encyclopedias 2003 version
Encyclopedias  2003 versionEncyclopedias  2003 version
Encyclopedias 2003 version
 
On the Value of Temporal Anchor Texts in Wikipedia
On the Value of Temporal Anchor Texts in WikipediaOn the Value of Temporal Anchor Texts in Wikipedia
On the Value of Temporal Anchor Texts in Wikipedia
 
Why edit Wikipedia
Why edit WikipediaWhy edit Wikipedia
Why edit Wikipedia
 
Open access for researchers and research managers
Open access  for researchers and research managersOpen access  for researchers and research managers
Open access for researchers and research managers
 
Using Wikipedia for Research
Using Wikipedia for ResearchUsing Wikipedia for Research
Using Wikipedia for Research
 
Pearl Harbor Essay
Pearl Harbor EssayPearl Harbor Essay
Pearl Harbor Essay
 

Mehr von Besnik Fetahu

Fine Grained Citation Span for References in WIkipedia
Fine Grained Citation Span for References in WIkipediaFine Grained Citation Span for References in WIkipedia
Fine Grained Citation Span for References in WIkipediaBesnik Fetahu
 
Finding News Citations For Wikipedia
Finding News Citations For WikipediaFinding News Citations For Wikipedia
Finding News Citations For WikipediaBesnik Fetahu
 
Improving Entity Retrieval on Structured Data
Improving Entity Retrieval on Structured DataImproving Entity Retrieval on Structured Data
Improving Entity Retrieval on Structured DataBesnik Fetahu
 
Automated News Suggestions for Populating Wikipedia Entity Pages
Automated News Suggestions for Populating Wikipedia Entity PagesAutomated News Suggestions for Populating Wikipedia Entity Pages
Automated News Suggestions for Populating Wikipedia Entity PagesBesnik Fetahu
 
How much is Wikipedia lagging behind News?
How much is Wikipedia lagging behind News?How much is Wikipedia lagging behind News?
How much is Wikipedia lagging behind News?Besnik Fetahu
 
A Scalable Approach for Efficiently Generating Structured Dataset Topic Profiles
A Scalable Approach for Efficiently Generating Structured Dataset Topic ProfilesA Scalable Approach for Efficiently Generating Structured Dataset Topic Profiles
A Scalable Approach for Efficiently Generating Structured Dataset Topic ProfilesBesnik Fetahu
 
euclid_linkedup WWW tutorial (Besnik Fetahu)
euclid_linkedup WWW tutorial (Besnik Fetahu)euclid_linkedup WWW tutorial (Besnik Fetahu)
euclid_linkedup WWW tutorial (Besnik Fetahu)Besnik Fetahu
 
Complex Matching of RDF Datatype Properties
Complex Matching of RDF Datatype PropertiesComplex Matching of RDF Datatype Properties
Complex Matching of RDF Datatype PropertiesBesnik Fetahu
 
Summaries on the fly: Query-based Extraction of Structured Knowledge from Web...
Summaries on the fly: Query-based Extraction of Structured Knowledge from Web...Summaries on the fly: Query-based Extraction of Structured Knowledge from Web...
Summaries on the fly: Query-based Extraction of Structured Knowledge from Web...Besnik Fetahu
 
Towards Integration of Web Data into a coherent Educational Data Graph
Towards Integration of Web Data into a coherent Educational Data GraphTowards Integration of Web Data into a coherent Educational Data Graph
Towards Integration of Web Data into a coherent Educational Data GraphBesnik Fetahu
 
Combining a co-occurrence-based and a semantic measure for entity linking
Combining a co-occurrence-based and a semantic measure for entity linkingCombining a co-occurrence-based and a semantic measure for entity linking
Combining a co-occurrence-based and a semantic measure for entity linkingBesnik Fetahu
 

Mehr von Besnik Fetahu (11)

Fine Grained Citation Span for References in WIkipedia
Fine Grained Citation Span for References in WIkipediaFine Grained Citation Span for References in WIkipedia
Fine Grained Citation Span for References in WIkipedia
 
Finding News Citations For Wikipedia
Finding News Citations For WikipediaFinding News Citations For Wikipedia
Finding News Citations For Wikipedia
 
Improving Entity Retrieval on Structured Data
Improving Entity Retrieval on Structured DataImproving Entity Retrieval on Structured Data
Improving Entity Retrieval on Structured Data
 
Automated News Suggestions for Populating Wikipedia Entity Pages
Automated News Suggestions for Populating Wikipedia Entity PagesAutomated News Suggestions for Populating Wikipedia Entity Pages
Automated News Suggestions for Populating Wikipedia Entity Pages
 
How much is Wikipedia lagging behind News?
How much is Wikipedia lagging behind News?How much is Wikipedia lagging behind News?
How much is Wikipedia lagging behind News?
 
A Scalable Approach for Efficiently Generating Structured Dataset Topic Profiles
A Scalable Approach for Efficiently Generating Structured Dataset Topic ProfilesA Scalable Approach for Efficiently Generating Structured Dataset Topic Profiles
A Scalable Approach for Efficiently Generating Structured Dataset Topic Profiles
 
euclid_linkedup WWW tutorial (Besnik Fetahu)
euclid_linkedup WWW tutorial (Besnik Fetahu)euclid_linkedup WWW tutorial (Besnik Fetahu)
euclid_linkedup WWW tutorial (Besnik Fetahu)
 
Complex Matching of RDF Datatype Properties
Complex Matching of RDF Datatype PropertiesComplex Matching of RDF Datatype Properties
Complex Matching of RDF Datatype Properties
 
Summaries on the fly: Query-based Extraction of Structured Knowledge from Web...
Summaries on the fly: Query-based Extraction of Structured Knowledge from Web...Summaries on the fly: Query-based Extraction of Structured Knowledge from Web...
Summaries on the fly: Query-based Extraction of Structured Knowledge from Web...
 
Towards Integration of Web Data into a coherent Educational Data Graph
Towards Integration of Web Data into a coherent Educational Data GraphTowards Integration of Web Data into a coherent Educational Data Graph
Towards Integration of Web Data into a coherent Educational Data Graph
 
Combining a co-occurrence-based and a semantic measure for entity linking
Combining a co-occurrence-based and a semantic measure for entity linkingCombining a co-occurrence-based and a semantic measure for entity linking
Combining a co-occurrence-based and a semantic measure for entity linking
 

Kürzlich hochgeladen

Botany krishna series 2nd semester Only Mcq type questions
Botany krishna series 2nd semester Only Mcq type questionsBotany krishna series 2nd semester Only Mcq type questions
Botany krishna series 2nd semester Only Mcq type questionsSumit Kumar yadav
 
Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)PraveenaKalaiselvan1
 
Nanoparticles synthesis and characterization​ ​
Nanoparticles synthesis and characterization​  ​Nanoparticles synthesis and characterization​  ​
Nanoparticles synthesis and characterization​ ​kaibalyasahoo82800
 
Pulmonary drug delivery system M.pharm -2nd sem P'ceutics
Pulmonary drug delivery system M.pharm -2nd sem P'ceuticsPulmonary drug delivery system M.pharm -2nd sem P'ceutics
Pulmonary drug delivery system M.pharm -2nd sem P'ceuticssakshisoni2385
 
Chromatin Structure | EUCHROMATIN | HETEROCHROMATIN
Chromatin Structure | EUCHROMATIN | HETEROCHROMATINChromatin Structure | EUCHROMATIN | HETEROCHROMATIN
Chromatin Structure | EUCHROMATIN | HETEROCHROMATINsankalpkumarsahoo174
 
GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)Areesha Ahmad
 
Animal Communication- Auditory and Visual.pptx
Animal Communication- Auditory and Visual.pptxAnimal Communication- Auditory and Visual.pptx
Animal Communication- Auditory and Visual.pptxUmerFayaz5
 
Physiochemical properties of nanomaterials and its nanotoxicity.pptx
Physiochemical properties of nanomaterials and its nanotoxicity.pptxPhysiochemical properties of nanomaterials and its nanotoxicity.pptx
Physiochemical properties of nanomaterials and its nanotoxicity.pptxAArockiyaNisha
 
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...ssifa0344
 
Forensic Biology & Its biological significance.pdf
Forensic Biology & Its biological significance.pdfForensic Biology & Its biological significance.pdf
Forensic Biology & Its biological significance.pdfrohankumarsinghrore1
 
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdfPests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdfPirithiRaju
 
Isotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on IoIsotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on IoSérgio Sacani
 
Formation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disksFormation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disksSérgio Sacani
 
Zoology 4th semester series (krishna).pdf
Zoology 4th semester series (krishna).pdfZoology 4th semester series (krishna).pdf
Zoology 4th semester series (krishna).pdfSumit Kumar yadav
 
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...Sérgio Sacani
 
Presentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptxPresentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptxgindu3009
 
Recombination DNA Technology (Nucleic Acid Hybridization )
Recombination DNA Technology (Nucleic Acid Hybridization )Recombination DNA Technology (Nucleic Acid Hybridization )
Recombination DNA Technology (Nucleic Acid Hybridization )aarthirajkumar25
 
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 60009654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000Sapana Sha
 
VIRUSES structure and classification ppt by Dr.Prince C P
VIRUSES structure and classification ppt by Dr.Prince C PVIRUSES structure and classification ppt by Dr.Prince C P
VIRUSES structure and classification ppt by Dr.Prince C PPRINCE C P
 
Unlocking the Potential: Deep dive into ocean of Ceramic Magnets.pptx
Unlocking  the Potential: Deep dive into ocean of Ceramic Magnets.pptxUnlocking  the Potential: Deep dive into ocean of Ceramic Magnets.pptx
Unlocking the Potential: Deep dive into ocean of Ceramic Magnets.pptxanandsmhk
 

Kürzlich hochgeladen (20)

Botany krishna series 2nd semester Only Mcq type questions
Botany krishna series 2nd semester Only Mcq type questionsBotany krishna series 2nd semester Only Mcq type questions
Botany krishna series 2nd semester Only Mcq type questions
 
Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)
 
Nanoparticles synthesis and characterization​ ​
Nanoparticles synthesis and characterization​  ​Nanoparticles synthesis and characterization​  ​
Nanoparticles synthesis and characterization​ ​
 
Pulmonary drug delivery system M.pharm -2nd sem P'ceutics
Pulmonary drug delivery system M.pharm -2nd sem P'ceuticsPulmonary drug delivery system M.pharm -2nd sem P'ceutics
Pulmonary drug delivery system M.pharm -2nd sem P'ceutics
 
Chromatin Structure | EUCHROMATIN | HETEROCHROMATIN
Chromatin Structure | EUCHROMATIN | HETEROCHROMATINChromatin Structure | EUCHROMATIN | HETEROCHROMATIN
Chromatin Structure | EUCHROMATIN | HETEROCHROMATIN
 
GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)
 
Animal Communication- Auditory and Visual.pptx
Animal Communication- Auditory and Visual.pptxAnimal Communication- Auditory and Visual.pptx
Animal Communication- Auditory and Visual.pptx
 
Physiochemical properties of nanomaterials and its nanotoxicity.pptx
Physiochemical properties of nanomaterials and its nanotoxicity.pptxPhysiochemical properties of nanomaterials and its nanotoxicity.pptx
Physiochemical properties of nanomaterials and its nanotoxicity.pptx
 
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
 
Forensic Biology & Its biological significance.pdf
Forensic Biology & Its biological significance.pdfForensic Biology & Its biological significance.pdf
Forensic Biology & Its biological significance.pdf
 
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdfPests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
 
Isotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on IoIsotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on Io
 
Formation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disksFormation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disks
 
Zoology 4th semester series (krishna).pdf
Zoology 4th semester series (krishna).pdfZoology 4th semester series (krishna).pdf
Zoology 4th semester series (krishna).pdf
 
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
 
Presentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptxPresentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptx
 
Recombination DNA Technology (Nucleic Acid Hybridization )
Recombination DNA Technology (Nucleic Acid Hybridization )Recombination DNA Technology (Nucleic Acid Hybridization )
Recombination DNA Technology (Nucleic Acid Hybridization )
 
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 60009654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
 
VIRUSES structure and classification ppt by Dr.Prince C P
VIRUSES structure and classification ppt by Dr.Prince C PVIRUSES structure and classification ppt by Dr.Prince C P
VIRUSES structure and classification ppt by Dr.Prince C P
 
Unlocking the Potential: Deep dive into ocean of Ceramic Magnets.pptx
Unlocking  the Potential: Deep dive into ocean of Ceramic Magnets.pptxUnlocking  the Potential: Deep dive into ocean of Ceramic Magnets.pptx
Unlocking the Potential: Deep dive into ocean of Ceramic Magnets.pptx
 

Approaches for Improving and Enriching Textual Knowledge Bases

  • 1. Approaches for Improving and Enriching Textual Knowledge Bases Besnik Fetahu PhD Defense 8th of November 2017 Hannover, Germany
  • 2. What is a textual knowledge base?
  • 3. Wikipedia as a textual knowledge base 3 Wikipedia Articles University of Hannover Infobox Section Template Section Text Wikipedia is a free online encyclopedia with the aim to allow anyone to edit articles. Wikipedia is the largest and most popular general reference work on the Internet, and is ranked the 5th popular website. Wikipedia is owned by the nonprofit Wikimedia Foundation.
  • 4. Wikipedia Editor Collaboration Dynamics 4 Wikipedia Wikipedia Editors Localized Wikipedias Editor Profiles lang: {English} topic: {Education, Politics} ~40 mill. articles 293 lang. 32 mill. editors (only in english Wikipedia) Wikipedia Revisions (cur | prev) 02:51, 5 October 2017 Brilliantwiki2 (talk | contribs) . . (14,516 bytes) (+69) . . (→Rankings) (undo | thank) (cur | prev) 00:30, 21 August 2017 Blueclaw (talk | contribs) . . (14,447 bytes) (+83) . . (→Alumni: added Flügge-Lotz) (undo | thank) (cur | prev) 09:03, 18 June 2017 77.23.196.148 (talk) . . (14,364 bytes) (-2) . . (→History: pupils -> students, today and now in same sentence corrected) (undo) (cur | prev) 05:35, 10 June 2017 AnomieBOT (talk | contribs) . . (14,366 bytes) (+319) . . (Rescuing orphaned refs ("Mitarbeiter und Etat" from rev 782668206; "Studierende" from rev 782668206)) (undo) (cur | prev) 01:24, 10 June 2017 Mephistolus (talk | contribs) m . . (14,047 bytes) (+9) . . (undo | thank) (Tag: Visual edit) (cur | prev) 01:21, 10 June 2017 Mephistolus (talk | contribs) . . (14,038 bytes) (-89) . . (Update infobox) (undo | thank) (Tag: Visual edit)
  • 5. • Wikipedia and its sister projects develop at a rate of over 10 edits per second, performed by editors from all over the world. • English Wikipedia has an average growth rate of 600 new articles per day. Wikipedia Dynamics and Growth 5 Wikipedia’s Daily Growth Rate
  • 6. Editorial Policies in Wikipedia 6 Wikipedia is written from a neutral point of view. Content in Wikipedia must be verifiable. The burden of evidence lies with the editor who adds content into a page. No original research. Content — such as facts, allegations, and ideas — for which no reliable, published sources exist.
  • 9. Importance and Use of Wikipedia 9
  • 10. Importance and Use of Wikipedia 9 Rank Site Daily Time on Site Daily Pageviews per Visitor % of traffic from search Total Sites Lining in 1 google.com 8:02 8.93 4.30% 3.56 mill. 2 youtube.com 8:27 4.98 15.40% 2.69 mill. 3 facebook.com 9:48 4.01 8.30% 7.6 mill. 4 baidu.com 7:56 6.36 8.50% 1.3 mill. 5 wikipedia.org 4:11 3.28 68.40% 1.7 mill.
  • 11. From Wikipedia to Structured Data and Search 10 ++
  • 12. From Wikipedia to Structured Data and Search 11 Web Search (Google Knowledge Cards) VoiceSearch
  • 13. Use Cases: Question Answering 12 Danqi Chen, Adam Fisch, Jason Weston, Antoine Bordes: Reading Wikipedia to Answer Open-Domain Questions. ACL 2017. Q: When did the 1973 oil crisis begin? A: October 1973
  • 14. What are the issues in Wikipedia?
  • 15. Issues with Verifiability in Wikipedia 14 https://en.wikipedia.org/wiki/Wikipedia:List_of_hoaxes_on_Wikipedia
  • 16. Issues with Verifiability in Wikipedia 15 https://www.theguardian.com/technology/2017/feb/08/wikipedia-bans- daily-mail-as-unreliable-source-for-website
  • 17. Issues with Verifiability in Wikipedia 15 Unreliable news source https://www.theguardian.com/technology/2017/feb/08/wikipedia-bans- daily-mail-as-unreliable-source-for-website
  • 18. Issues with Verifiability in Wikipedia 16 Where was Obama born? In 2012, Breitbart.com published a copy of a promotional booklet that Obama's literary agency, Acton & Dystel, printed in 1991 (and later posted to their website, in a biography in place until April 2007) which misidentified Obama's birthplace and states that Obama was "born in Kenya and raised in Indonesia and Hawaii." Obama was born on August 4, 1961, at Kapiʻolani Maternity & Gynecological Hospital in Honolulu, Hawaii. He is the first President to have been born in Hawaii, making him the first President born outside of the contiguous 48 states. He was born to a white mother and a black father. His mother, Ann Dunham (1942–1995), was born in Wichita, Kansas, of mostly English descent, with some German, Irish, Scottish, Swiss, and Welsh ancestry. Kenya. Kapiʻolani Maternity & Gynecological Hospital in Honolulu, Hawaii Min Joon Seo, Aniruddha Kembhavi, Ali Farhadi, Hannaneh Hajishirzi: Bidirectional Attention Flow for Machine Comprehension. CoRR abs/1611.01603 (2016)
  • 19. Issues with Verifiability in Wikipedia 16 Where was Obama born? In 2012, Breitbart.com published a copy of a promotional booklet that Obama's literary agency, Acton & Dystel, printed in 1991 (and later posted to their website, in a biography in place until April 2007) which misidentified Obama's birthplace and states that Obama was "born in Kenya and raised in Indonesia and Hawaii." Obama was born on August 4, 1961, at Kapiʻolani Maternity & Gynecological Hospital in Honolulu, Hawaii. He is the first President to have been born in Hawaii, making him the first President born outside of the contiguous 48 states. He was born to a white mother and a black father. His mother, Ann Dunham (1942–1995), was born in Wichita, Kansas, of mostly English descent, with some German, Irish, Scottish, Swiss, and Welsh ancestry. Kenya. Kapiʻolani Maternity & Gynecological Hospital in Honolulu, Hawaii Min Joon Seo, Aniruddha Kembhavi, Ali Farhadi, Hannaneh Hajishirzi: Bidirectional Attention Flow for Machine Comprehension. CoRR abs/1611.01603 (2016)
  • 20. Misalignment of Editor Efforts in Wikipedia 17 • Human Fatalities: 10k vs 1.8k losses • Estimated Damages: $4.5 vs. $108 billions • “Odisha Cyclone” without coverage and mention in Wikipedia article “Odisha” • “Hurricane Katrina” finds broad coverage in Wikipedia article “New Orleans”
  • 22. Challenges and Contributions 19 For an arbitrary statement in Wikipedia how can we find citations which provide evidence for it? For a paragraph in Wikipedia and an existing citation how can we determine the exact span of the citation? For a Wikipedia page and a given news corpus how can we find and suggest important and novel information for a page?
  • 23. Part (I): Finding news citations for Wikipedia entity pages? Besnik Fetahu, Katja Markert, Wolfgang Nejdl, Avishek Anand: “Finding News Citations for Wikipedia”. CIKM 2016: 337-346
  • 24. News Collection t1 t2 tn Textual Knowledge Base t1 t2 tn Citation Recommendation Citation Span News Suggestion Entity Placement Section Placement e:“Barack Obama” Obama was born on August 4, 1961,[4] ….. The couple married in Wailuku on Maui on … After graduating ... a JD … magna cum laude[49]… Obama was elected to the Illinois Senate in … news? query for s1 c4 Obama was born on August 4, 1961, at Kapiʻolani Maternity & Gynecological Hospital in Honolulu, Hawaii. c4s1 Obama was born on August 4, 1961, at Kapiʻolani Maternity & Gynecological Hospital in Honolulu, Hawaii. citation c4 span e:“Barack Obama” AND t2 news: nk time: > t2 The choice of Barack Obama on Friday as the recipient of the 2009 Nobel Peace Prize, [...] around the globe. [...] The Nobel committee’s embrace of Mr. Obama was viewed [as a rejection of the unpopular tenure, in] Europe especially, of his predecessor, George W. Bush. [...] “To be honest,” the president said in the Rose Garden, [...] Last year’s laureate, former President Martti Ahtisaari of Finland, saw the award as an endorsement of Mr. Obama’s goal of achieving Middle East peace. 1.Early life and career 2.Political career 3.2008 presidential campaign 4.Presidency 5.Political positions 6.Family and Personal life 7.Cultural and political image 1.Early life and career 2.Political career 3.2008 presidential campaign 4.Presidency 5.Political positions 6.Nobel Peace Prize 7.Family and Personal life 8.Cultural and political image t2 t3 publish date t headline body entity mentions (e.g. “Barack Obama, Nobel Prize”…) revision date t entity title sections section text categories citations
  • 25. Finding news citations for Wikipedia entity pages 22 Wikipedia Articles 1. Early life and career 2. Presidential campaigns 3. Presidency (2009-2017) 4. Post-presidency (2017-present) 5. Legacy 6. Books written Section Chunking Statements Extraction Obama was born on August 4, 1961,[c] at Kapiʻolani Maternity & Gynecological Hospital in Honolulu, Hawaii.[c] He is the first President to have been born in Hawaii,[c] making him the first President born outside of the contiguous 48 states.[c] […] His father, Barack Obama Sr. (1936–1982), was a married Luo Kenyan man from Nyang'oma Kogelo. Obama's parents met in 1960 in a Russian language class at the University of Hawaii at Manoa, where his father was a foreign student on scholarship.[c] […] Obama's parents met in 1960 in a Russian language class at the University of Hawaii at Manoa, where his father was a foreign student on scholarship.[c] Find a citation for the statement! Does it require a news citation? Yes
  • 26. Finding news citations for Wikipedia entity pages 22 Wikipedia Articles 1. Early life and career 2. Presidential campaigns 3. Presidency (2009-2017) 4. Post-presidency (2017-present) 5. Legacy 6. Books written Section Chunking Statements Extraction Obama was born on August 4, 1961,[c] at Kapiʻolani Maternity & Gynecological Hospital in Honolulu, Hawaii.[c] He is the first President to have been born in Hawaii,[c] making him the first President born outside of the contiguous 48 states.[c] […] His father, Barack Obama Sr. (1936–1982), was a married Luo Kenyan man from Nyang'oma Kogelo. Obama's parents met in 1960 in a Russian language class at the University of Hawaii at Manoa, where his father was a foreign student on scholarship.[c] […] Obama's parents met in 1960 in a Russian language class at the University of Hawaii at Manoa, where his father was a foreign student on scholarship.[c] Find a citation for the statement! Does it require a news citation? Yes Task—1: Statement Categorization
  • 27. Finding news citations for Wikipedia entity pages 22 Wikipedia Articles 1. Early life and career 2. Presidential campaigns 3. Presidency (2009-2017) 4. Post-presidency (2017-present) 5. Legacy 6. Books written Section Chunking Statements Extraction Obama was born on August 4, 1961,[c] at Kapiʻolani Maternity & Gynecological Hospital in Honolulu, Hawaii.[c] He is the first President to have been born in Hawaii,[c] making him the first President born outside of the contiguous 48 states.[c] […] His father, Barack Obama Sr. (1936–1982), was a married Luo Kenyan man from Nyang'oma Kogelo. Obama's parents met in 1960 in a Russian language class at the University of Hawaii at Manoa, where his father was a foreign student on scholarship.[c] […] Obama's parents met in 1960 in a Russian language class at the University of Hawaii at Manoa, where his father was a foreign student on scholarship.[c] Find a citation for the statement! Does it require a news citation? Yes Task—1: Statement Categorization Task—2: Citation Discovery
  • 29. Why Statement Categorization? 24 type description arXiv arXiv preprints AV media audio and visual AV media notes audio and visual liner notesbioRxiv bioRxiv preprints book books conference conference papers encylopedia edited collections episode radio or television collectionsinterview interviews journal academic journals and papersmagazine magazines, periodicalsmailing list public mailing lists map maps news news articles newsgroup online newsgroups podcast audio or video podcastpress release press releases report reports serial audio or video serials sign sign, plaques speech speeches techreport technical report thesis theses web any resource accessible through a Citation Types in Wikipedia 0 0.2 0.4 0.6 0.8 1 ComicsCreator Artwork NaturalPlace Airline Film SoccerManager LegalCase Album Band SportsTeam TelevisionShow AnatomicalStructure Athlete Weapon Criminal MusicalArtist Politician Plant Song Non-ProfitOrganisation Book Actor FictionalCharacter RecordLabel Broadcaster PoliticalParty Automobile TradeUnion Scientist MilitaryPerson Philosopher TelevisionSeason Election OfficeHolder SportsLeague GovernmentAgency Single Animal Award SportsEvent Airport MilitaryConflict TelevisionEpisode Aircraft Magazine Writer Location news book court journal web thesis Besnik Fetahu, Abhijit Anand, Avishek Anand: “How much is Wikipedia Lagging Behind News?” WebSci 2015: 28:1-28:9
  • 30. Why Statement Categorization? 24 type description arXiv arXiv preprints AV media audio and visual AV media notes audio and visual liner notesbioRxiv bioRxiv preprints book books conference conference papers encylopedia edited collections episode radio or television collectionsinterview interviews journal academic journals and papersmagazine magazines, periodicalsmailing list public mailing lists map maps news news articles newsgroup online newsgroups podcast audio or video podcastpress release press releases report reports serial audio or video serials sign sign, plaques speech speeches techreport technical report thesis theses web any resource accessible through a Citation Types in Wikipedia 0 0.2 0.4 0.6 0.8 1 ComicsCreator Artwork NaturalPlace Airline Film SoccerManager LegalCase Album Band SportsTeam TelevisionShow AnatomicalStructure Athlete Weapon Criminal MusicalArtist Politician Plant Song Non-ProfitOrganisation Book Actor FictionalCharacter RecordLabel Broadcaster PoliticalParty Automobile TradeUnion Scientist MilitaryPerson Philosopher TelevisionSeason Election OfficeHolder SportsLeague GovernmentAgency Single Animal Award SportsEvent Airport MilitaryConflict TelevisionEpisode Aircraft Magazine Writer Location news book court journal web thesis Besnik Fetahu, Abhijit Anand, Avishek Anand: “How much is Wikipedia Lagging Behind News?” WebSci 2015: 28:1-28:9 • Citations of type web and news account for the absolute majority of citations in Wikipedia • Citations of type news are considered as “reliable, published source” • Depending on the context and added information in Wikipedia, different citation types are preferred
  • 31. Why Statement Categorization? 25 Obama emphasized issues of rapidly ending the Iraq War, increasing energy independence, and reforming the health care system,[1] in a campaign that projected themes of hope and change.[2] On June 3, 2008, Senator Obama—along with Senators Tom Carper, Tom Coburn, and John McCain—introduced follow-up legislation: Strengthening Transparency and Accountability in Federal Spending Act of 2008.[1] In mid—1988, he traveled for the first time in Europe for three weeks and then for five weeks in Kenya, where he met many of his paternal relatives for the first time.[1][2] 1. “Barack Obama on the Issues: What Would Be Your Top Three Overall Priorities If Elected?". The Washington Post. 2. “The Obama promise of hope and change". The Independent. London. November 1, 2008. cite type = “news” 1. "S. 3077: Strengthening Transparency and Accountability in Federal Spending Act of 2008: 2007–2008 (110th Congress)". Govtrack.us. June 3, 2008. cite type = “report” cite type = “book” 1. Obama, Auma (2012). And then life happens: a memoir. New York: St. Martin's Press. pp. 189–208, 212–216. ISBN 978-1-250-01005-6.
  • 32. Task#1: Statement Categorization 26 For a given Wikipedia statement, categorize it through a supervised model into one of the predefined citation types (e.g. “news”, “web” etc.).
  • 33. Task#1: Statement Categorization 26 For a given Wikipedia statement, categorize it through a supervised model into one of the predefined citation types (e.g. “news”, “web” etc.). Obama emphasized issues of rapidly ending the Iraq War, increasing energy independence, and reforming the health care system,[c] in a campaign that projected themes of hope and change.[c]
  • 34. Task#1: Statement Categorization 26 For a given Wikipedia statement, categorize it through a supervised model into one of the predefined citation types (e.g. “news”, “web” etc.). Obama emphasized issues of rapidly ending the Iraq War, increasing energy independence, and reforming the health care system,[c] in a campaign that projected themes of hope and change.[c] Wikipedia language style Wikipedia entity structure feature extraction
  • 35. Task#1: Statement Categorization 26 For a given Wikipedia statement, categorize it through a supervised model into one of the predefined citation types (e.g. “news”, “web” etc.). Obama emphasized issues of rapidly ending the Iraq War, increasing energy independence, and reforming the health care system,[c] in a campaign that projected themes of hope and change.[c] Wikipedia language style Wikipedia entity structure feature extraction multi-class classification feature representation
  • 36. Task#1: Statement Categorization 26 For a given Wikipedia statement, categorize it through a supervised model into one of the predefined citation types (e.g. “news”, “web” etc.). Obama emphasized issues of rapidly ending the Iraq War, increasing energy independence, and reforming the health care system,[c] in a campaign that projected themes of hope and change.[c] Wikipedia language style Wikipedia entity structure feature extraction multi-class classification feature representation
  • 38. Task#2: Citation Discovery 28 For a Wikipedia statement that requires a citation of type news, find one or more news article(s) as a citation from a given news corpus.
  • 39. Task#2: Citation Discovery 28 For a Wikipedia statement that requires a citation of type news, find one or more news article(s) as a citation from a given news corpus. On February 10, 2007, Obama announced his candidacy for President of the United States in front of the Old State Capitol building in Springfield, Illinois. Wikipedia statement
  • 40. citation discovery Wikipedia Statement Task#2: Citation Discovery 28 For a Wikipedia statement that requires a citation of type news, find one or more news article(s) as a citation from a given news corpus. On February 10, 2007, Obama announced his candidacy for President of the United States in front of the Old State Capitol building in Springfield, Illinois. Wikipedia statement
  • 41. citation discovery Wikipedia Statement Task#2: Citation Discovery 28 For a Wikipedia statement that requires a citation of type news, find one or more news article(s) as a citation from a given news corpus. On February 10, 2007, Obama announced his candidacy for President of the United States in front of the Old State Capitol building in Springfield, Illinois. Wikipedia statement news index query
  • 42. citation discovery Wikipedia Statement Task#2: Citation Discovery 28 For a Wikipedia statement that requires a citation of type news, find one or more news article(s) as a citation from a given news corpus. On February 10, 2007, Obama announced his candidacy for President of the United States in front of the Old State Capitol building in Springfield, Illinois. Wikipedia statement news index query top—k retrieval 1 doc1 2 doc2 3 doc3 100 doc100 ranked news
  • 43. citation discovery Wikipedia Statement Task#2: Citation Discovery 28 For a Wikipedia statement that requires a citation of type news, find one or more news article(s) as a citation from a given news corpus. On February 10, 2007, Obama announced his candidacy for President of the United States in front of the Old State Capitol building in Springfield, Illinois. Wikipedia statement news index query top—k retrieval 1 doc1 2 doc2 3 doc3 100 doc100 ranked news citation discovery
  • 44. citation discovery Wikipedia Statement Task#2: Citation Discovery 28 For a Wikipedia statement that requires a citation of type news, find one or more news article(s) as a citation from a given news corpus. On February 10, 2007, Obama announced his candidacy for President of the United States in front of the Old State Capitol building in Springfield, Illinois. Wikipedia statement news index query top—k retrieval 1 doc1 2 doc2 3 doc3 100 doc100 ranked news citation discovery
  • 45. citation discovery Wikipedia Statement Task#2: Citation Discovery 28 For a Wikipedia statement that requires a citation of type news, find one or more news article(s) as a citation from a given news corpus. On February 10, 2007, Obama announced his candidacy for President of the United States in front of the Old State Capitol building in Springfield, Illinois. Wikipedia statement news index query top—k retrieval 1 doc1 2 doc2 3 doc3 100 doc100 ranked news citation discovery
  • 46. Task#2: Citation Discovery Properties 29 Candidate News Article On February 10, 2007, Obama announced his candidacy for President of the United States in front of the Old State Capitol building in Springfield, Illinois. Wikipedia Statement +
  • 47. Task#2: Citation Discovery Properties 29 Candidate News Article On February 10, 2007, Obama announced his candidacy for President of the United States in front of the Old State Capitol building in Springfield, Illinois. Wikipedia Statement + Speaking in a single-digit, morning chill and sunshine to thousands of supporters outside the Old State Capitol, the first-term Democratic senator delivered an address that built upon his biography as a community organizer in Chicago, state legislator and U.S. senator to call for quick action on issues ranging from bringing a close to the Iraq war to the need for universal health care and an end to foreign-oil dependence. News Sentence
  • 48. Task#2: Citation Discovery Properties 29 Candidate News Article On February 10, 2007, Obama announced his candidacy for President of the United States in front of the Old State Capitol building in Springfield, Illinois. Wikipedia Statement + Speaking in a single-digit, morning chill and sunshine to thousands of supporters outside the Old State Capitol, the first-term Democratic senator delivered an address that built upon his biography as a community organizer in Chicago, state legislator and U.S. senator to call for quick action on issues ranging from bringing a close to the Iraq war to the need for universal health care and an end to foreign-oil dependence. News Sentence • Lexical similarity between sentence and statement • Language model similarity • TreeKernel similarity • Query and news article Entailment
  • 49. Task#2: Citation Discovery Properties 29 Candidate News Article On February 10, 2007, Obama announced his candidacy for President of the United States in front of the Old State Capitol building in Springfield, Illinois. Wikipedia Statement + Speaking in a single-digit, morning chill and sunshine to thousands of supporters outside the Old State Capitol, the first-term Democratic senator delivered an address that built upon his biography as a community organizer in Chicago, state legislator and U.S. senator to call for quick action on issues ranging from bringing a close to the Iraq war to the need for universal health care and an end to foreign-oil dependence. News Sentence • Lexical similarity between sentence and statement • Language model similarity • TreeKernel similarity • Query and news article Entailment • TextRank for measuring sentence centrality in a news article • Entailment feature scores w.r.t most central sentence in a news article Centrality
  • 50. Task#2: Citation Discovery Properties 29 Candidate News Article On February 10, 2007, Obama announced his candidacy for President of the United States in front of the Old State Capitol building in Springfield, Illinois. Wikipedia Statement + Speaking in a single-digit, morning chill and sunshine to thousands of supporters outside the Old State Capitol, the first-term Democratic senator delivered an address that built upon his biography as a community organizer in Chicago, state legislator and U.S. senator to call for quick action on issues ranging from bringing a close to the Iraq war to the need for universal health care and an end to foreign-oil dependence. News Sentence • Lexical similarity between sentence and statement • Language model similarity • TreeKernel similarity • Query and news article Entailment • TextRank for measuring sentence centrality in a news article • Entailment feature scores w.r.t most central sentence in a news article Centrality • Entity type specific news citation suggestion • Authority of news domains on specific entity types Authority
  • 51. Task#2: Citation Discovery Properties 29 Candidate News Article On February 10, 2007, Obama announced his candidacy for President of the United States in front of the Old State Capitol building in Springfield, Illinois. Wikipedia Statement + Speaking in a single-digit, morning chill and sunshine to thousands of supporters outside the Old State Capitol, the first-term Democratic senator delivered an address that built upon his biography as a community organizer in Chicago, state legislator and U.S. senator to call for quick action on issues ranging from bringing a close to the Iraq war to the need for universal health care and an end to foreign-oil dependence. News Sentence • Lexical similarity between sentence and statement • Language model similarity • TreeKernel similarity • Query and news article Entailment • TextRank for measuring sentence centrality in a news article • Entailment feature scores w.r.t most central sentence in a news article Centrality • Entity type specific news citation suggestion • Authority of news domains on specific entity types Authority Properties of a good citation: 1. the statement should be entailed by the news article 2. the statement is central in the news article 3. the cited news article should be from an authoritative source
  • 53. Evaluation Datasets 31 • 6.9 million Wikipedia statements • 8.8 million citations to external references • 1.6 million Wikipedia entities • 1.88 million news articles cited from Wikipedia statements • 20 million news articles from a real world news collection (GDelt), between 2013— 2015 • 27k news articles cited from Wikipedia statements in the within the range of GDelt Task#1: Statement Categorization Data Task#2: Citation Discovery Data GDelt domain stats news domain news articles yahoo.com 1244781 allafrica.com 1035646 reuters.com 828133 dailymail.co.uk 815372 indiatimes.com 743991 wn.com 587607 Wikipedia statement distribution by citation type
  • 54. Task#1: Statement Categorization Results 32 yagoLegalActorGeo Parent Type Child Type 1  ⌧  10 10 < ⌧  50 50 < ⌧  90 P R F1 P R F1 P R F1 owl:Thing Legal Actor Geo 0.48 0.36 0.41 0.51 0.43 0.47 0.53 0.47 0.50 Legal Actor Geo Legal Actor 0.51 0.34 0.41 0.54 0.41 0.47 0.56 0.45 0.50 location 0.30 0.29 0.29 0.34 0.40 0.37 0.36 0.45 0.40 location region 0.30 0.28 0.29 0.35 0.40 0.37 0.37 0.44 0.40 point 0.30 0.10 0.14 0.38 0.22 0.28 0.39 0.26 0.32 Legal Actor person 0.53 0.36 0.43 0.56 0.43 0.49 0.58 0.46 0.51 person preserver 0.63 0.31 0.42 0.67 0.46 0.54 0.67 0.49 0.57 authority 0.53 0.20 0.29 0.62 0.24 0.35 0.65 0.33 0.44 contestant 0.59 0.43 0.50 0.62 0.52 0.57 0.64 0.56 0.60 leader 0.53 0.26 0.34 0.59 0.34 0.43 0.61 0.37 0.46 wc Living people 0.55 0.37 0.44 0.58 0.44 0.50 0.59 0.47 0.52 Separate models per entity type YAGO type hierarchy Statement Categorization results based on RandomForests
  • 56. Task#2: Citation Discovery Results 33 B1: top—1 retrieved article from the retrieval model
  • 57. Task#2: Citation Discovery Results 33 B1: top—1 retrieved article from the retrieval model B2: supervised model based on rank and similarity score from the search engine
  • 58. Task#2: Citation Discovery Results 33 B1: top—1 retrieved article from the retrieval model B2: supervised model based on rank and similarity score from the search engine E1: our approach with entailment, centrality, authority features, where for a statement a correct citation are news articles which are cited originally from the statement in the Wikipedia page
  • 59. Task#2: Citation Discovery Results 33 B1: top—1 retrieved article from the retrieval model B2: supervised model based on rank and similarity score from the search engine E1: our approach with entailment, centrality, authority features, where for a statement a correct citation are news articles which are cited originally from the statement in the Wikipedia page E1+FP: similar as E1, but consider as relevant false positives which have very high similarity to ground—truth articles
  • 60. Task#2: Citation Discovery Results 33 B1: top—1 retrieved article from the retrieval model B2: supervised model based on rank and similarity score from the search engine E1: our approach with entailment, centrality, authority features, where for a statement a correct citation are news articles which are cited originally from the statement in the Wikipedia page E1+FP: similar as E1, but consider as relevant false positives which have very high similarity to ground—truth articles E2: builds on top of E1+FP with additionally assessing for the relevance of FP articles below the similarity threshold
  • 61. Part (I): Conclusion 34 • Specific citation types are preferred based on the statement, its context, and the Wikipedia page • Statement categorization works fairly well for some entity types • Challenging to distinguish between citation type web and news • Citation discovery can be performed accurately across all entity types
  • 62. Part (II): Fine Grained Citation Span for References in Wikipedia Besnik Fetahu, Katja Markert, Avishek Anand: “Fine Grained Citation Span for References in Wikipedia”. EMNLP 2017: 1980-1989
  • 63. News Collection t1 t2 tn Textual Knowledge Base t1 t2 tn Citation Recommendation Citation Span News Suggestion Entity Placement Section Placement e:“Barack Obama” Obama was born on August 4, 1961,[4] ….. The couple married in Wailuku on Maui on … After graduating ... a JD … magna cum laude[49]… Obama was elected to the Illinois Senate in … news? query for s1 c4 Obama was born on August 4, 1961, at Kapiʻolani Maternity & Gynecological Hospital in Honolulu, Hawaii. c4s1 Obama was born on August 4, 1961, at Kapiʻolani Maternity & Gynecological Hospital in Honolulu, Hawaii. citation c4 span e:“Barack Obama” AND t2 news: nk time: > t2 The choice of Barack Obama on Friday as the recipient of the 2009 Nobel Peace Prize, [...] around the globe. [...] The Nobel committee’s embrace of Mr. Obama was viewed [as a rejection of the unpopular tenure, in] Europe especially, of his predecessor, George W. Bush. [...] “To be honest,” the president said in the Rose Garden, [...] Last year’s laureate, former President Martti Ahtisaari of Finland, saw the award as an endorsement of Mr. Obama’s goal of achieving Middle East peace. 1.Early life and career 2.Political career 3.2008 presidential campaign 4.Presidency 5.Political positions 6.Family and Personal life 7.Cultural and political image 1.Early life and career 2.Political career 3.2008 presidential campaign 4.Presidency 5.Political positions 6.Nobel Peace Prize 7.Family and Personal life 8.Cultural and political image t2 t3 publish date t headline body entity mentions (e.g. “Barack Obama, Nobel Prize”…) revision date t entity title sections section text categories citations
  • 65. Citation Span Cases 38 Obama was born on August 4, 1961,[5] at Kapiʻolani Maternity & Gynecological Hospital in Honolulu, Hawaii.[6][7][8] On February 10, 2007, Obama announced his candidacy for President of the United States in front of the Old State Capitol building in Springfield, Illinois.[158][159] […] Obama emphasized issues of rapidly ending the Iraq War, increasing energy independence, and reforming the health care system,[161] in a campaign that projected themes of hope and change.[162] At the Democratic National Convention in Charlotte, North Carolina, Obama and Joe Biden were formally nominated by former President Bill Clinton as the Democratic Party candidates for president and vice president in the general election. Their main opponents were Republicans Mitt Romney, the former governor of Massachusetts, and Representative Paul Ryan of Wisconsin.[183] Citation marker placed at a sub-sentence level Citation marker placed at the end of a sentence Citation marker placed after multiple sentences in a paragraph
  • 67. Citation Span Task 40 Citing Paragraph He was reelected to the Illinois Senate in 1998, defeating Republican Yesse Yehudah in the general election, and was re-elected again in 2002.[117] In 2000, he lost a Democratic primary race for Illinois's 1st congressional district in the United States House of Representatives to four- term incumbent Bobby Rush by a margin of two to one.[118]
  • 68. Citation Span Task 40 Citing Paragraph He was reelected to the Illinois Senate in 1998, defeating Republican Yesse Yehudah in the general election, and was re-elected again in 2002.[117] In 2000, he lost a Democratic primary race for Illinois's 1st congressional district in the United States House of Representatives to four- term incumbent Bobby Rush by a margin of two to one.[118] Textual Fragments Chunk Paragraph (punctuation symbols) He was reelected to the Illinois Senate in 1998, defeating Republican Yesse Yehudah in the general election, and was re-elected again in 2002.[117] In 2000, he lost a Democratic primary race for Illinois's 1st congressional district in the United States House of Representatives to four- term incumbent Bobby Rush by a margin of two to one.[118]
  • 69. Citation Span Task 40 Citing Paragraph He was reelected to the Illinois Senate in 1998, defeating Republican Yesse Yehudah in the general election, and was re-elected again in 2002.[117] In 2000, he lost a Democratic primary race for Illinois's 1st congressional district in the United States House of Representatives to four- term incumbent Bobby Rush by a margin of two to one.[118] Textual Fragments Chunk Paragraph (punctuation symbols) He was reelected to the Illinois Senate in 1998, defeating Republican Yesse Yehudah in the general election, and was re-elected again in 2002.[117] In 2000, he lost a Democratic primary race for Illinois's 1st congressional district in the United States House of Representatives to four- term incumbent Bobby Rush by a margin of two to one.[118] CitingSpan Citation Span for reference [117] He was reelected to the Illinois Senate in 1998, defeating Republican Yesse Yehudah in the general election, and was re-elected again in 2002.[117] In 2000, he lost a Democratic primary race for Illinois's 1st congressional district in the United States House of Representatives to four- term incumbent Bobby Rush by a margin of two to one.[118]
  • 70. Citation Span Approach 41 Sequence Classification (linear—chain CRF) Plain Classification • Citations other than c • Same sentence as c • Same sentence as previous text fragment • Distance in terms of text fragments to c Paragraph Structure He was reelected to the Illinois Senate in 1998, defeating Republican Yesse Yehudah in the general election, and was re-elected again in 2002.[117] In 2000, he lost a Democratic primary race for Illinois's 1st congressional district in the United States House of Representatives to four- term incumbent Bobby Rush by a margin of two to one.[118] • Language models per paragraphs in cited document • Language model similarity between fragment and paragraph’s LM Citation Features • Explicit discourse sense annotation of sentences • Fragments in a sentence with explicit discourse (e.g. comparison) are likely to have same label • Fragments with different time points are unlikely to have same label Discourse/Temporal Features Determine span for citation c= [117] Extract features for each text fragment
  • 72. Citation Span Dataset 43 • 500 citing paragraphs, pointing to either web or news citations • Manual annotation of each textual fragment whether it is explicitly supported or implied by the corresponding citation • High inter-rater agreement on a 10% sample with 𝜅=0.84 span news web dist. skip frag. skip sent. dist. skip frag. skip sent. <= 0.5 11% 6% - 6.8% - - (.5,1] 63% - - 63% - 1% (1,2] 17% - 8% 14% - 19% (2,5] 7% 5% 18% 13.1% - 21% > 5 1.8% - 20% 3.1% - 67%
  • 73. Evaluation Metrics 44 MAP = 1 |N| X p2N |S0 St | |S0| R = 1 |N| X p2N |S0 St | |St| w = 1 |N| X p2N P 2S0St words( ) P 2St words( ) Mean Average Precision — MAP: Recall — R: Erroneous Span (word and text-fragment level) - ∆” • S’ — text fragment marked as covered by one of the approaches. • St — text fragment marked as covered by the citation in our ground-truth. • words(∂) — number of words in a text fragment.
  • 74. Citation Span Analysis — Accuracy 45 Citation Span results decomposed across the different span cases.
  • 75. Citation Span Analysis — Accuracy 45 Citation Span results decomposed across the different span cases. MRF: compute two functions: (i) potential and (ii) compatibility. In (i) measure the similarity between sentences in a citing paragraph, and in (ii) measure the similarity between a sentences and sentences in the citing document.
  • 76. Citation Span Analysis — Accuracy 45 Citation Span results decomposed across the different span cases. MRF: compute two functions: (i) potential and (ii) compatibility. In (i) measure the similarity between sentences in a citing paragraph, and in (ii) measure the similarity between a sentences and sentences in the citing document. IC: the span of a citation is the text between two citations, or it starts in the beginning of a paragraph, and it can end at the end of a paragraph.
  • 77. Citation Span Analysis — Accuracy 45 Citation Span results decomposed across the different span cases. MRF: compute two functions: (i) potential and (ii) compatibility. In (i) measure the similarity between sentences in a citing paragraph, and in (ii) measure the similarity between a sentences and sentences in the citing document. IC: the span of a citation is the text between two citations, or it starts in the beginning of a paragraph, and it can end at the end of a paragraph. CS/CSW: the span is the citing sentence, and for CSW the span is the citing sentence +/- 2 sentences depending if they contain specific cue words in specific locations in the sentence.
  • 78. Citation Span Analysis — Accuracy 45 Citation Span results decomposed across the different span cases. MRF: compute two functions: (i) potential and (ii) compatibility. In (i) measure the similarity between sentences in a citing paragraph, and in (ii) measure the similarity between a sentences and sentences in the citing document. IC: the span of a citation is the text between two citations, or it starts in the beginning of a paragraph, and it can end at the end of a paragraph. CS/CSW: the span is the citing sentence, and for CSW the span is the citing sentence +/- 2 sentences depending if they contain specific cue words in specific locations in the sentence. CSPC/CSPS: our plain classifier with the proposed features, and for CSPS we train a structured prediction model based on the same feature set.
  • 79. Citation Span Analysis — Erroneous Span 46 ≤ 0.5 9 % 11 % 872 % 274 %258 % 480 % 0 250 500 750 1000 CSPS CSPC CS CSW IC MRF (0.5,1] 6 % 5 % 313 % 14 %12 % 80 % 0 100 200 300 CSPS CSPC CS CSW IC MRF (1,2] 11 % 7 % 114 % 11 %10 % 65 % 0 50 100 150 CSPS CSPC CS CSW IC MRF > 2 45 % 26 % 96 % 17 %16 % 68 % 0 30 60 90 CSPS CSPC CS CSW IC MRF Citation Span Buckets ErroneousSpanΔw%
  • 80. Part (II): Conclusion 47 • Citation span can be accurately determined for web and news citations • Sequence classification achieves slightly better performance than plain classification and outperforms baseline approaches • Baseline approaches from the scientific domain do not generalize in Wikipedia’s language style
  • 81. Part (III): Automated News Suggestion for Populating Wikipedia Pages Besnik Fetahu, Katja Markert, Avishek Anand: “Automated News Suggestion for Populating Wikipedia Pages”. CIKM 2015: 323-332
  • 82. News Collection t1 t2 tn Textual Knowledge Base t1 t2 tn Citation Recommendation Citation Span News Suggestion Entity Placement Section Placement e:“Barack Obama” Obama was born on August 4, 1961,[4] ….. The couple married in Wailuku on Maui on … After graduating ... a JD … magna cum laude[49]… Obama was elected to the Illinois Senate in … news? query for s1 c4 Obama was born on August 4, 1961, at Kapiʻolani Maternity & Gynecological Hospital in Honolulu, Hawaii. c4s1 Obama was born on August 4, 1961, at Kapiʻolani Maternity & Gynecological Hospital in Honolulu, Hawaii. citation c4 span e:“Barack Obama” AND t2 news: nk time: > t2 The choice of Barack Obama on Friday as the recipient of the 2009 Nobel Peace Prize, [...] around the globe. [...] The Nobel committee’s embrace of Mr. Obama was viewed [as a rejection of the unpopular tenure, in] Europe especially, of his predecessor, George W. Bush. [...] “To be honest,” the president said in the Rose Garden, [...] Last year’s laureate, former President Martti Ahtisaari of Finland, saw the award as an endorsement of Mr. Obama’s goal of achieving Middle East peace. 1.Early life and career 2.Political career 3.2008 presidential campaign 4.Presidency 5.Political positions 6.Family and Personal life 7.Cultural and political image 1.Early life and career 2.Political career 3.2008 presidential campaign 4.Presidency 5.Political positions 6.Nobel Peace Prize 7.Family and Personal life 8.Cultural and political image t2 t3 publish date t headline body entity mentions (e.g. “Barack Obama, Nobel Prize”…) revision date t entity title sections section text categories citations
  • 83. Automated News Suggestion to Entity Pages 50 Daily News Articles
  • 84. Automated News Suggestion to Entity Pages 50 Daily News Articles Some half a million people were evacuated from the southeastern Indian coast as Cyclone Phailin, a tropical storm from the Bay of Bengal, bore down on India. The states of Orissa and Andhra Pradesh, both of which have large coastal populations, were on high alert ahead of the storm’s expected arrival. […] WSJ: As it Happened: Cyclone Reaches Orissa
  • 85. Automated News Suggestion to Entity Pages 50 Daily News Articles Some half a million people were evacuated from the southeastern Indian coast as Cyclone Phailin, a tropical storm from the Bay of Bengal, bore down on India. The states of Orissa and Andhra Pradesh, both of which have large coastal populations, were on high alert ahead of the storm’s expected arrival. […] WSJ: As it Happened: Cyclone Reaches Orissa
  • 86. Automated News Suggestion to Entity Pages 50 Daily News Articles Some half a million people were evacuated from the southeastern Indian coast as Cyclone Phailin, a tropical storm from the Bay of Bengal, bore down on India. The states of Orissa and Andhra Pradesh, both of which have large coastal populations, were on high alert ahead of the storm’s expected arrival. […] WSJ: As it Happened: Cyclone Reaches Orissa Some half a million people were evacuated from the southeastern Indian coast as Cyclone Phailin, a tropical storm from the Bay of Bengal, bore down on India. The states of Orissa and Andhra Pradesh, both of which have large coastal populations, were on high alert ahead of the storm’s expected arrival. […]
  • 87. Automated News Suggestion to Entity Pages 50 Daily News Articles Some half a million people were evacuated from the southeastern Indian coast as Cyclone Phailin, a tropical storm from the Bay of Bengal, bore down on India. The states of Orissa and Andhra Pradesh, both of which have large coastal populations, were on high alert ahead of the storm’s expected arrival. […] WSJ: As it Happened: Cyclone Reaches Orissa Some half a million people were evacuated from the southeastern Indian coast as Cyclone Phailin, a tropical storm from the Bay of Bengal, bore down on India. The states of Orissa and Andhra Pradesh, both of which have large coastal populations, were on high alert ahead of the storm’s expected arrival. […] Some half a million people were evacuated from the southeastern Indian coast as Cyclone Phailin, a tropical storm from the Bay of Bengal, bore down on India. The states of Orissa and Andhra Pradesh, both of which have large coastal populations, were on high alert ahead of the storm’s expected arrival. […]
  • 88. Automated News Suggestion to Entity Pages 50 Daily News Articles Some half a million people were evacuated from the southeastern Indian coast as Cyclone Phailin, a tropical storm from the Bay of Bengal, bore down on India. The states of Orissa and Andhra Pradesh, both of which have large coastal populations, were on high alert ahead of the storm’s expected arrival. […] WSJ: As it Happened: Cyclone Reaches Orissa Some half a million people were evacuated from the southeastern Indian coast as Cyclone Phailin, a tropical storm from the Bay of Bengal, bore down on India. The states of Orissa and Andhra Pradesh, both of which have large coastal populations, were on high alert ahead of the storm’s expected arrival. […] Some half a million people were evacuated from the southeastern Indian coast as Cyclone Phailin, a tropical storm from the Bay of Bengal, bore down on India. The states of Orissa and Andhra Pradesh, both of which have large coastal populations, were on high alert ahead of the storm’s expected arrival. […] Some half a million people were evacuated from the southeastern Indian coast as Cyclone Phailin, a tropical storm from the Bay of Bengal, bore down on India. The states of Orissa and Andhra Pradesh, both of which have large coastal populations, were on high alert ahead of the storm’s expected arrival. […]
  • 89. Automated News Suggestion to Entity Pages 50 Daily News Articles Some half a million people were evacuated from the southeastern Indian coast as Cyclone Phailin, a tropical storm from the Bay of Bengal, bore down on India. The states of Orissa and Andhra Pradesh, both of which have large coastal populations, were on high alert ahead of the storm’s expected arrival. […] WSJ: As it Happened: Cyclone Reaches Orissa Some half a million people were evacuated from the southeastern Indian coast as Cyclone Phailin, a tropical storm from the Bay of Bengal, bore down on India. The states of Orissa and Andhra Pradesh, both of which have large coastal populations, were on high alert ahead of the storm’s expected arrival. […] Some half a million people were evacuated from the southeastern Indian coast as Cyclone Phailin, a tropical storm from the Bay of Bengal, bore down on India. The states of Orissa and Andhra Pradesh, both of which have large coastal populations, were on high alert ahead of the storm’s expected arrival. […] Some half a million people were evacuated from the southeastern Indian coast as Cyclone Phailin, a tropical storm from the Bay of Bengal, bore down on India. The states of Orissa and Andhra Pradesh, both of which have large coastal populations, were on high alert ahead of the storm’s expected arrival. […] Some half a million people were evacuated from the southeastern Indian coast as Cyclone Phailin, a tropical storm from the Bay of Bengal, bore down on India. The states of Orissa and Andhra Pradesh, both of which have large coastal populations, were on high alert ahead of the storm’s expected arrival. […]
  • 90. Automated News Suggestion to Entity Pages 50 Daily News Articles Some half a million people were evacuated from the southeastern Indian coast as Cyclone Phailin, a tropical storm from the Bay of Bengal, bore down on India. The states of Orissa and Andhra Pradesh, both of which have large coastal populations, were on high alert ahead of the storm’s expected arrival. […] WSJ: As it Happened: Cyclone Reaches Orissa Some half a million people were evacuated from the southeastern Indian coast as Cyclone Phailin, a tropical storm from the Bay of Bengal, bore down on India. The states of Orissa and Andhra Pradesh, both of which have large coastal populations, were on high alert ahead of the storm’s expected arrival. […] Some half a million people were evacuated from the southeastern Indian coast as Cyclone Phailin, a tropical storm from the Bay of Bengal, bore down on India. The states of Orissa and Andhra Pradesh, both of which have large coastal populations, were on high alert ahead of the storm’s expected arrival. […] Some half a million people were evacuated from the southeastern Indian coast as Cyclone Phailin, a tropical storm from the Bay of Bengal, bore down on India. The states of Orissa and Andhra Pradesh, both of which have large coastal populations, were on high alert ahead of the storm’s expected arrival. […] Some half a million people were evacuated from the southeastern Indian coast as Cyclone Phailin, a tropical storm from the Bay of Bengal, bore down on India. The states of Orissa and Andhra Pradesh, both of which have large coastal populations, were on high alert ahead of the storm’s expected arrival. […] Task#1: Article Placement
  • 91. Automated News Suggestion to Entity Pages 50 Daily News Articles Some half a million people were evacuated from the southeastern Indian coast as Cyclone Phailin, a tropical storm from the Bay of Bengal, bore down on India. The states of Orissa and Andhra Pradesh, both of which have large coastal populations, were on high alert ahead of the storm’s expected arrival. […] WSJ: As it Happened: Cyclone Reaches Orissa Some half a million people were evacuated from the southeastern Indian coast as Cyclone Phailin, a tropical storm from the Bay of Bengal, bore down on India. The states of Orissa and Andhra Pradesh, both of which have large coastal populations, were on high alert ahead of the storm’s expected arrival. […] Some half a million people were evacuated from the southeastern Indian coast as Cyclone Phailin, a tropical storm from the Bay of Bengal, bore down on India. The states of Orissa and Andhra Pradesh, both of which have large coastal populations, were on high alert ahead of the storm’s expected arrival. […] Some half a million people were evacuated from the southeastern Indian coast as Cyclone Phailin, a tropical storm from the Bay of Bengal, bore down on India. The states of Orissa and Andhra Pradesh, both of which have large coastal populations, were on high alert ahead of the storm’s expected arrival. […] Some half a million people were evacuated from the southeastern Indian coast as Cyclone Phailin, a tropical storm from the Bay of Bengal, bore down on India. The states of Orissa and Andhra Pradesh, both of which have large coastal populations, were on high alert ahead of the storm’s expected arrival. […] Task#1: Article Placement Bay of Bengal Odisha India Cyclone Phailin Entity Mentions
  • 92. Automated News Suggestion to Entity Pages 50 Daily News Articles Some half a million people were evacuated from the southeastern Indian coast as Cyclone Phailin, a tropical storm from the Bay of Bengal, bore down on India. The states of Orissa and Andhra Pradesh, both of which have large coastal populations, were on high alert ahead of the storm’s expected arrival. […] WSJ: As it Happened: Cyclone Reaches Orissa Some half a million people were evacuated from the southeastern Indian coast as Cyclone Phailin, a tropical storm from the Bay of Bengal, bore down on India. The states of Orissa and Andhra Pradesh, both of which have large coastal populations, were on high alert ahead of the storm’s expected arrival. […] Some half a million people were evacuated from the southeastern Indian coast as Cyclone Phailin, a tropical storm from the Bay of Bengal, bore down on India. The states of Orissa and Andhra Pradesh, both of which have large coastal populations, were on high alert ahead of the storm’s expected arrival. […] Some half a million people were evacuated from the southeastern Indian coast as Cyclone Phailin, a tropical storm from the Bay of Bengal, bore down on India. The states of Orissa and Andhra Pradesh, both of which have large coastal populations, were on high alert ahead of the storm’s expected arrival. […] Some half a million people were evacuated from the southeastern Indian coast as Cyclone Phailin, a tropical storm from the Bay of Bengal, bore down on India. The states of Orissa and Andhra Pradesh, both of which have large coastal populations, were on high alert ahead of the storm’s expected arrival. […] Task#1: Article Placement Bay of Bengal Odisha India Cyclone Phailin Entity Mentions Bay of Bengal Odisha India Cyclone Phailin
  • 93. Automated News Suggestion to Entity Pages 50 Daily News Articles Some half a million people were evacuated from the southeastern Indian coast as Cyclone Phailin, a tropical storm from the Bay of Bengal, bore down on India. The states of Orissa and Andhra Pradesh, both of which have large coastal populations, were on high alert ahead of the storm’s expected arrival. […] WSJ: As it Happened: Cyclone Reaches Orissa Some half a million people were evacuated from the southeastern Indian coast as Cyclone Phailin, a tropical storm from the Bay of Bengal, bore down on India. The states of Orissa and Andhra Pradesh, both of which have large coastal populations, were on high alert ahead of the storm’s expected arrival. […] Some half a million people were evacuated from the southeastern Indian coast as Cyclone Phailin, a tropical storm from the Bay of Bengal, bore down on India. The states of Orissa and Andhra Pradesh, both of which have large coastal populations, were on high alert ahead of the storm’s expected arrival. […] Some half a million people were evacuated from the southeastern Indian coast as Cyclone Phailin, a tropical storm from the Bay of Bengal, bore down on India. The states of Orissa and Andhra Pradesh, both of which have large coastal populations, were on high alert ahead of the storm’s expected arrival. […] Some half a million people were evacuated from the southeastern Indian coast as Cyclone Phailin, a tropical storm from the Bay of Bengal, bore down on India. The states of Orissa and Andhra Pradesh, both of which have large coastal populations, were on high alert ahead of the storm’s expected arrival. […] Task#1: Article Placement Bay of Bengal Odisha India Cyclone Phailin Entity Mentions Bay of Bengal Odisha India Cyclone Phailin Task#2: Section Placement Odisha
  • 94. Automated News Suggestion to Entity Pages 50 Daily News Articles Some half a million people were evacuated from the southeastern Indian coast as Cyclone Phailin, a tropical storm from the Bay of Bengal, bore down on India. The states of Orissa and Andhra Pradesh, both of which have large coastal populations, were on high alert ahead of the storm’s expected arrival. […] WSJ: As it Happened: Cyclone Reaches Orissa Some half a million people were evacuated from the southeastern Indian coast as Cyclone Phailin, a tropical storm from the Bay of Bengal, bore down on India. The states of Orissa and Andhra Pradesh, both of which have large coastal populations, were on high alert ahead of the storm’s expected arrival. […] Some half a million people were evacuated from the southeastern Indian coast as Cyclone Phailin, a tropical storm from the Bay of Bengal, bore down on India. The states of Orissa and Andhra Pradesh, both of which have large coastal populations, were on high alert ahead of the storm’s expected arrival. […] Some half a million people were evacuated from the southeastern Indian coast as Cyclone Phailin, a tropical storm from the Bay of Bengal, bore down on India. The states of Orissa and Andhra Pradesh, both of which have large coastal populations, were on high alert ahead of the storm’s expected arrival. […] Some half a million people were evacuated from the southeastern Indian coast as Cyclone Phailin, a tropical storm from the Bay of Bengal, bore down on India. The states of Orissa and Andhra Pradesh, both of which have large coastal populations, were on high alert ahead of the storm’s expected arrival. […] Task#1: Article Placement Bay of Bengal Odisha India Cyclone Phailin Entity Mentions Bay of Bengal Odisha India Cyclone Phailin Task#2: Section Placement Odisha 1. Etymology 2. History 3. Geography 4. Government and politics 5. Economy 6. Transportation 7. Demographics 8. Education Sections
  • 95. Automated News Suggestion to Entity Pages 50 Daily News Articles Some half a million people were evacuated from the southeastern Indian coast as Cyclone Phailin, a tropical storm from the Bay of Bengal, bore down on India. The states of Orissa and Andhra Pradesh, both of which have large coastal populations, were on high alert ahead of the storm’s expected arrival. […] WSJ: As it Happened: Cyclone Reaches Orissa Some half a million people were evacuated from the southeastern Indian coast as Cyclone Phailin, a tropical storm from the Bay of Bengal, bore down on India. The states of Orissa and Andhra Pradesh, both of which have large coastal populations, were on high alert ahead of the storm’s expected arrival. […] Some half a million people were evacuated from the southeastern Indian coast as Cyclone Phailin, a tropical storm from the Bay of Bengal, bore down on India. The states of Orissa and Andhra Pradesh, both of which have large coastal populations, were on high alert ahead of the storm’s expected arrival. […] Some half a million people were evacuated from the southeastern Indian coast as Cyclone Phailin, a tropical storm from the Bay of Bengal, bore down on India. The states of Orissa and Andhra Pradesh, both of which have large coastal populations, were on high alert ahead of the storm’s expected arrival. […] Some half a million people were evacuated from the southeastern Indian coast as Cyclone Phailin, a tropical storm from the Bay of Bengal, bore down on India. The states of Orissa and Andhra Pradesh, both of which have large coastal populations, were on high alert ahead of the storm’s expected arrival. […] Task#1: Article Placement Bay of Bengal Odisha India Cyclone Phailin Entity Mentions Bay of Bengal Odisha India Cyclone Phailin Task#2: Section Placement Odisha 1. Etymology 2. History 3. Geography 4. Government and politics 5. Economy 6. Transportation 7. Demographics 8. Education Sections Add section in case it is missing
  • 96. Automated News Suggestion to Entity Pages 50 Daily News Articles Some half a million people were evacuated from the southeastern Indian coast as Cyclone Phailin, a tropical storm from the Bay of Bengal, bore down on India. The states of Orissa and Andhra Pradesh, both of which have large coastal populations, were on high alert ahead of the storm’s expected arrival. […] WSJ: As it Happened: Cyclone Reaches Orissa Some half a million people were evacuated from the southeastern Indian coast as Cyclone Phailin, a tropical storm from the Bay of Bengal, bore down on India. The states of Orissa and Andhra Pradesh, both of which have large coastal populations, were on high alert ahead of the storm’s expected arrival. […] Some half a million people were evacuated from the southeastern Indian coast as Cyclone Phailin, a tropical storm from the Bay of Bengal, bore down on India. The states of Orissa and Andhra Pradesh, both of which have large coastal populations, were on high alert ahead of the storm’s expected arrival. […] Some half a million people were evacuated from the southeastern Indian coast as Cyclone Phailin, a tropical storm from the Bay of Bengal, bore down on India. The states of Orissa and Andhra Pradesh, both of which have large coastal populations, were on high alert ahead of the storm’s expected arrival. […] Some half a million people were evacuated from the southeastern Indian coast as Cyclone Phailin, a tropical storm from the Bay of Bengal, bore down on India. The states of Orissa and Andhra Pradesh, both of which have large coastal populations, were on high alert ahead of the storm’s expected arrival. […] Task#1: Article Placement Bay of Bengal Odisha India Cyclone Phailin Entity Mentions Bay of Bengal Odisha India Cyclone Phailin Task#2: Section Placement Odisha 1. Etymology 2. History 3. Geography 4. Government and politics 5. Economy 6. Transportation 7. Demographics 8. Education Sections Add section in case it is missing Catastrophes
  • 97. News Suggestion Attributes 51 • The entity should be a central concept in the news article • The information in the news article should be important for the Wikipedia entity • Information in news article should contain novel or missing information for a Wikipedia entity • A news article should be suggested for to the exact section, if such section does not exist, it needs to be added
  • 99. Article—Placement: News Suggestion Attributes 53 Entity Salience Relative Authority Novelty • Reward entities appearing throughout the news article • Reward entities appearing in top-paragraphs • Weigh the entities w.r.t the score of the co-occurring entities • Entry barrier is lower for information from news articles for entities with low-authority • Important information for an entity can be unveiled by measuring the relative importance of its co-occurring entities • Information from a news article should be novel w.r.t to the entity under consideration • Measure information novelty against already cited news sources • Measure information novelty against the already existing content in a Wikipedia entity
  • 101. Section—Placement: Template Generation and Section Fit 55 News Articles [Germanwings incident] Article—Entity Placement Wikipedia Entity
  • 102. Section Template typeOf Airline 1. History 2. Corporate affairs 3. Destinations 4. Fleet 5. Services 6. References Germanwings 1. History 2. Corporate affairs and Identity 3. Destinations 4. Codeshare agreements 5. Fleet 6. Services 7. Incidents and accidents 8. References Adria 1. History 2. Corporate affairs and identity 3. Miles & More 4. Lounges 5. Accidents and incidents 6. Criticism 7. See also 8. Citations 9. External Links Lufthansa Section—Placement: Template Generation and Section Fit 55 News Articles [Germanwings incident] Article—Entity Placement Wikipedia Entity
  • 103. Section Template typeOf Airline 1. History 2. Corporate affairs 3. Destinations 4. Fleet 5. Services 6. References Germanwings 1. History 2. Corporate affairs and Identity 3. Destinations 4. Codeshare agreements 5. Fleet 6. Services 7. Incidents and accidents 8. References Adria 1. History 2. Corporate affairs and identity 3. Miles & More 4. Lounges 5. Accidents and incidents 6. Criticism 7. See also 8. Citations 9. External Links Lufthansa Section—Placement: Template Generation and Section Fit 55 1. History 2. Corporate affairs and Identity 3. Destinations 4. Codeshare agreements 5. Fleet 6. Services / Lounges 7. Criticism 8. Incidents and accidents 9. References Section Template [Airline] News Articles [Germanwings incident] Article—Entity Placement Wikipedia Entity
  • 104. Section Template typeOf Airline 1. History 2. Corporate affairs 3. Destinations 4. Fleet 5. Services 6. References Germanwings 1. History 2. Corporate affairs and Identity 3. Destinations 4. Codeshare agreements 5. Fleet 6. Services 7. Incidents and accidents 8. References Adria 1. History 2. Corporate affairs and identity 3. Miles & More 4. Lounges 5. Accidents and incidents 6. Criticism 7. See also 8. Citations 9. External Links Lufthansa Section—Placement: Template Generation and Section Fit 55 1. History 2. Corporate affairs and Identity 3. Destinations 4. Codeshare agreements 5. Fleet 6. Services / Lounges 7. Criticism 8. Incidents and accidents 9. References Section Template [Airline] News Articles [Germanwings incident] Article—Entity Placement Wikipedia Entity Section Fit • Content similarity of the news article w.r.t sections in the template • Topic similarity of the news article w.r.t sections in the template Incidents and Accidents
  • 106. Evaluation Dataset 57 year #news #entities #sections 2009 42707 13550 3510 2010 78328 24953 8416 2011 73491 23144 6581 2012 81473 25980 8455 2013 69079 22121 8183 2014 29961 11088 4694 Evaluation Datasets Evaluation Plan • B1 — baseline for AEP (Dunietz and Gillick) • S2 — baseline for AES (most frequent section) Task#1 — AEP: Baselines Task#2 — AES: Baselines Dunietz, Jesse, and Daniel Gillick. "A New Entity Salience Task with Millions of Training Examples." In EACL, p. 205. 2014.
  • 107. Article—Entity and Article—Section Placement Results 58 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 Recall Precision group B1 B1+F_e 2009 B1 — baseline approach for AEP AEP: B1 + entity salience, relative authority, novelty 0 0.2 0.4 0.6 0.8 1 2009 2010 2011 2012 2013 2014 Avg.Precision Fs S2 S2 — most frequent section AES: topic, content, lexical features
  • 108. Part (III): Conclusions 59 • Three main properties of a good news suggestion • Through AEP and AES tasks, we can suggest important and novel information for Wikipedia entities • Entity profile expansion through section templates generated at entity type level
  • 110. Conclusions 61 • We account for the evolving nature of Wikipedia entities as new and novel information becomes available on the Web • We present a holistic approach for enriching and improving Wikipedia entities • Through our approach we enforce the core principles of Wikipedia such as the “verifiability” principle • Our automated approach provides accurate enrichments and improvements, and furthermore accounts for long-tail entities, where editor interests are low.
  • 111. Future Work 62 • Wikipedia is a collaboratively edited and created data source, as such it can have pitfalls like “echo chambers”. We want to investigate how are such “echo chambers” established, and what are the factors (e.g. editors, sources, topic interests) that cause it? • Quality issues such as NPOV violations in Wikipedia are coarse-grained and such quality indicators are inexistent in long-tail entities, thus, investigating editor, language, and source biases that cause such a NPOV violations is an important quality assurance step. • Editors dynamics reflect the quality of Wikipedia pages. How can we provide a mechanism for distributing “uniformly” Wikipedia pages across editors, such that we satisfy their interests and at the same time increase the overall quality of Wikipedia pages overall.
  • 112. Thank you for your attention.
 
 Questions?