AWS Community Day CPH - Three problems of Terraform
Entities, Time and Events in BiographyNet and NewsReader
1. Entities, Time and Events
in BiographyNet &
NewsReader
Antske Fokkens
VU University
Monday, November 11, 13
2. Acknowledgement
(people)
The work presented in this
presentation was carried out by/with:
Agata Cybulska, Marieke van Erp and
Piek Vossen
Niels Ockeloen, Serge ter Braake, Willem
Robert van Hage, Jesper Hoeksema, Sara
Tonelli, Rachele Sprugnoli, Luciano Serafini,
Aitor Soroa, German Rigau and others
Monday, November 11, 13
3. Overview
mini introduction to BiographyNet
mini introduction to NewsReader
representing entities and events
Monday, November 11, 13
4. BiographyNet
An interdisciplinary project
involving history, computer science
and computational linguistics
Goal: inspire new historic research
by identifying relations between
people and events in Biographical
dictionaries
Monday, November 11, 13
5. NLP in BiographyNet
The Biography Portal of the Netherlands
125,000 biographies from 23 sources
describing 76,000 people
Text and metadata
Role of NLP:
Identify information in text
Study differences in style and focus
Monday, November 11, 13
6. BiographyNet
use cases
Analysis on groups of individuals (e.g.
who were governor generals of the Dutch
Indies)
More complex questions, e.g. the relation
between influential people in the Dutch
colonies and current Dutch elite
Perspectives: how are people and events
judged in different sources?
Monday, November 11, 13
7. BiographyNet data
Biographical text in Dutch
Heterogenous corpus: 23 sources,
texts from 17th century - now
Metadata about basic facts:
high quality (few errors)
completeness varies
Monday, November 11, 13
8. BiographyNet
Text mining
First step: fill out gaps in metadata
Basic supervised machine learning system
Next steps:
Create timelines for individuals
Identify relations between people
Identify events and relations between them
Monday, November 11, 13
9. BiographyNet
Methodology
The output of NLP tools is used by other
researchers
They should have insight into the
performance of the tools and the
approaches that are used
Provenance information plays a vital role
Monday, November 11, 13
10. NewsReader
Automatically process massive streams of
daily news from thousands of sources in 4
different languages
Project Partners:
VU University Amsterdam, LexisNexis,
Synerscope (the Netherlands)
Basque University (Spain)
ScraperWiki (UK)
Federation Bruno Kessler (Italy)
Monday, November 11, 13
11. NewsReader
what happened, where, when and who was
involved?
Which temporal and causal relations hold
between events, what does that tell us
about the people involved?
Place the cumulated result in a knowledge
store that can handle dynamic growth of
information: a history recorder
Monday, November 11, 13
12. NewsReader
Big Data
Focus: The financial crisis
E.g. What is the impact of the financial
crisis on the car industry?
Big Data: LexisNexis estimates:
1-2 million news articles per day
that their archive has 10 million
English news articles about the car
industry from the last 10 years
Monday, November 11, 13
13. NewsReader
Narratives
What are the stories that are being
told by all this data?
Challenges:
Duplicates, overlap and repetitions: how to
distinguish old from new?
Single results tell only parts of the story
Results can be inconsistent
News is opinionated and colored
Monday, November 11, 13
14. NewsReader
overall approach
Resolve all mentions of events, their
participants, locations and time in texts
and other resources
Determine coreference and other relations
between them
Combine all information from coreferring
event mentions around a hypothetical
event instance (independent from text)
Combine instances into storylines
Monday, November 11, 13
15. NLP pipeline
TOKENIZER +
SENTENCE
SPLITTER
Time
expressions
WSD_client
WSD_server
NER
POS-TAGGER
NED_client
NED_server
PARSER
KS Frontend
Mgmt.
Scripts
API implementation over layers; replicated for scalability and fault tolerance
LEXISNEXIS
documents
Storage of original
input data
HBase + Hadoop
Triple Store
distributed & replicated for scalability and fault-tolerance
(possibly) distributed
Resource
Mention
KNOWLEDGE STORE
Visualisation
(Synerscope)
Story
Understanding
Entity
Statement
+ Context
Partial replication
Event
relations
RDF Triples +
Named Graphs
Coreference
resolution
start / stop,
backup /
restore,
configuration,
statistics,
gathering
SRL
Event
detection
Inference
Event
coreference
Opinion
Detection
Factuality
Runs in virtual machine
EHU
Runs in virtual machine
Input data storage
Processes that can be carried out in any order at this stage
VUA
Monday, November 11, 13
FBK
16. Both Projects
Accumulate information about the same
entities and events from various
sources
Must deal with different perspectives,
contradicting and partial information
Monday, November 11, 13
17. Grounded Annotation
Framework (GAF)
Sources report on events and entities:
event mentions and entity mentions
URIs represent instances of these
entities and events in reality
GAF links instances to mentions
Information from mentions in other
sources is merged with known
information around the instance
Monday, November 11, 13
18. a GAF example
changes in the world
2004
2005
SEM-EVENT
TEMBLOR
SEM-EVENT
USS Jimmy
Carter energy
weapon
2006
SEM-EVENT
TSUNAMI
2007
SEM-EVENT
TEMBLOR
2009
2008
SEM-EVENT
TSUNAMI
SEM-EVENT
TEMBLOR
SEM-EVENT
TSUNAMI
future tsunami
Tsunami alert
system
ANNOTATION
ANNOTATION
NAF
TAF
publication of sources
2004
2005
ANNOTATION
2006
sensor data
direct event report
Monday, November 11, 13
delayed event report
future event report
ANNOTATION
ANNOTATION
2007
ANNOTATION
ANNOTATION
2008
"The catastrophe four years ago devastated Indian
Ocean community and killed more than 230,000
people, over 170,000 of them in Aceh
at northern tip of Sumatra Island of Indonesia."
2009
2013
..., the vessel is the party responsible for the 2004 Indian
Ocean tsunami that killed 230,000 people. Apparently,
the submarine was able to trigger seismic activity via
some kind of directed energy weapon.
19. Linguistic information in
GAF
The NLP Annotation Format (NAF)
Knowledge Annotation Format (KAF)
stand-off layered annotation (LAF
compatible)
separating mentions from instances
NLP Interchange Format (NIF)
RDF and URIs, inline annotation
Compatible with PROV-DM
Monday, November 11, 13
20. Events in GAF
extended Simple Event Model (SEM):
RDF representations of event
instances with participant, location
and time
can represent contradictory
information
Monday, November 11, 13
21. GAF from NAF + SEM
Can accumulate information from
different sources
Can represent repeated information as a
single relation (with links to all
sources that provided this information)
Can represent contradicting information
Is compatible with the PROV-DM
Monday, November 11, 13
22. Acknowledgements
Supported by the European Union’s 7th
Framework program via the NewsReader
Project (ICT-316404)
Supported by the BiographyNet project
(nr. 660.011.308) funded by the
Netherlands eScience center (http://
escience.center.nl)
Monday, November 11, 13
23. References
GAF:
Fokkens, Antske, Marieke van Erp, Piek Vossen, Sara
Tonelli, Willem Robert van Hage, Luciano Serafini,
Rachele Sprugnoli and Jesper Hoeksema. 2013. GAF: A
Grounded Annotation Framework for Events. Proceedings
of the first Workshop on EVENTS: Definition, Detection,
Coreference and Representation. Atlanta USA.
Marieke Van Erp, Antske Fokkens, Piek Vossen, Sara
Tonelli, Willem Robert Van Hage, Luciano Serafini,
Rachele Sprugnoli and Jesper Hoeksema. 2013. Denoting
Data in the Grounded Annotation Framework. ISWC 2013
Posters and Demos. Sydney Australia, 21-25 October 2013
Monday, November 11, 13
24. References
SEM:
Van Hage, Willem Robert, Véronique Malaisé, Roxane
Segers, Laura Hollink, and Guus Schreiber. "Design
and use of the Simple Event Model (SEM)." Web
Semantics: Science, Services and Agents on the World
Wide Web 9, no. 2 (2011): 128-136.
Cross-document coreference:
Cybulska, Agata, and Piek Vossen. “Semantic
Relations between Events and their Time, Locations
and Participants for Event Coreference Resolution.”
In: Proceedings of RANLP 2013.
Monday, November 11, 13
25. References
Named Entity Recognition:
Marieke van Erp, Giuseppe Rizzo and Raphaël Troncy
(2013) Learning with the Web: Spotting Named Entities on
the intersection of NERD and Machine Learning. #MSM2013
Concept Extraction Challenge. Rio de Janeiro, Brazil,
May 2013.
Provenance:
Niels Ockeloen, Antske Fokkens, Serge Ter Braake, Piek
Vossen, Victor de Boer, Guus Schreiber and Susan Legêne.
2013. BiographyNet: Managing Provenance at multiple
levels and from different perspectives. In: Proceedings
of the Workshop on Linked Science 2013 (LISC2013).
Monday, November 11, 13