This document discusses tools and techniques for monitoring global media data and events. It introduces several systems developed at the Jozef Stefan Institute for collecting news articles from around the world, enriching documents with semantic annotations, linking information across languages, and analyzing news reporting bias. It also addresses representing events with structured and semantic descriptions and tracking how topics evolve over time through an event registry system. The overall goal is to establish an integrated real-time pipeline for processing multilingual media, identifying events, and providing visualization of global event dynamics.
Keynote: Global Media Monitoring - M. Grobelnik - ESWC SS 2014
1. Global Media Monitoring
h0p://eventregistry.org/
Marko
Grobelnik
Jozef
Stefan
Ins4tute
Ljubljana,
Slovenia
Contribu4ons
from
Gregor
Leban,
Blaz
Fortuna,
Janez
Brank,
Jan
Rupnik,
Andrej
Muhic
ESWC
Summer
School,
Sep
2nd
2014,
Kalamaki
4. What ques=ons we’ll try to answer?
• Where
to
get
global
media
data?
• What
is
extractable
from
media
documents?
• How
to
connect
informa4on
across
languages?
• What
is
an
event?
• How
to
approach
diversity
in
news
repor4ng?
• How
to
visualize
global
event
dynamics?
5. Systems/Demos used within the presenta=on
• NewsFeed
(hWp://newsfeed.ijs.si/)
• News
and
social
media
crawler
• Enrycher
(hWp://enrycher.ijs.si/)
• Language
and
Seman4c
annota4on
• XLing
(hWp://xling.ijs.si/
• Cross-‐lingual
document
linking
and
categoriza4on
• DiversiNews
(hWp://aidemo.ijs.si/diversinews/)
• News
Diversity
Explorer
• Event
Registry
(hWp://eventregistry.org/)
• Event
detec4on
and
topic
tracking
6. The overall goal
• The
goal
is
to
establish
a
real-‐4me
system
• …to
collect
data
from
global
media
in
real-‐4me
• …to
iden4fy
events
and
track
evolving
topics
• …to
assign
stable
iden4fiers
to
events
• …to
iden4fy
events
across
languages
• …to
detect
diversity
of
repor4ng
along
several
dimensions
• …to
provide
rich
exploratory
visualiza4ons
• …to
provide
interoperable
data
export
7. Main
stream
news
Blogs
Global Media Monitoring pipeline
Ar4cle
seman4c
annota4on
Cross-‐lingual
ar4cle
matching
Cross-‐lingual
cluster
matching
Event
forma4on
Event
registry
API
Interface
Event
info.
extrac4on
Input
data
Pre-‐processing
steps
Event
construc4on
Event
storage
&
maintenance
Extrac4on
of
date
references
Ar4cle
clustering
Iden4fying
related
events
Detec4on
of
ar4cle
duplicates
GUI/Visualiza4ons
hWp://EventRegistry.org
9. Where to get references to news publishers?
• Good
start
is
Wikipedia
list
of
newspapers:
• hWp://en.wikipedia.org/wiki/Lists_of_newspapers
10. From a newspaper home-‐page to an ar=cle
hWp://www.ny4mes.com/
HTML
RSS
Feed
(list
of
ar4cles)
Ar4cle
to
be
retreived
11. Collec=ng global media data
• Data
collec4on
service
News-‐Feed
• hWp://newsfeed.ijs.si/
• …crawling
global
main-‐stream
and
social
media
• Monitoring
• ~60k
main-‐stream
publishers
(RSS
feeds+special
feeds)
• ~250k
most
influen4al
blogs
(RSS
feeds)
• free
TwiWer
feed
• Data
volume:
~350k
ar4cles
&
blogs
per
day
(+5M
tweets)
• Languages:
eng
(50%),
ger
(10%),
spa
(8%),
fra
(5%)
12. Downloading the news stream (1/2)
• The
stream
is
accessible
at
hWp://newsfeed.ijs.si/stream/
• To
download
the
whole
stream
con4nuously,
you
can
use
the
python
script
(hWp://newsfeed.ijs.si/hWp2fs.py)
• The
script
does
the
following:
13. Downloading the news
stream (2/2)
• News
Stream
Contents
and
Format
• The
root
element,
<ar4cle-‐set>,
contains
zero
or
more
ar4cles
in
the
following
XML
format:
• …more
details:
• Trampus,
Mitja
and
Novak,
Blaz:
The
Internals
Of
An
Aggregated
Web
News
Feed.
Proceedings
of
15th
Mul4conference
on
Informa4on
Society
2012
(IS-‐2012).
[PDF]
15. What can extracted from a document?
• Lexical
level
• Tokeniza4on
–
extrac4ng
tokens
from
a
document
(words,
separators,
…)
• Sentence
spli<ng
–
set
of
sentences
to
be
further
processed
• Linguis4c
level
• Part-‐of-‐Speech
–
assigning
word
types
(nouns,
verbs,
adjec4ves,
…)
• Deep
Parsing
–
construc4ng
parse
trees
from
sentences
• Triple
extrac4on
–
subject-‐predicate-‐object
triple
extrac4on
• Name
en4ty
extrac4on
–
iden4fying
names
of
people,
places,
organiza4ons
• Seman4c
level
• Co-‐reference
resolu4on
–
replacing
pronouns
with
corresponding
names;
merging
different
surface
forms
of
names
into
single
en4ty
• Seman4c
labeling
–
assigning
seman4c
iden4fiers
to
names
(e.g.
LOD/DBpedia/
Freebase)
including
disambigua4on
• Topic
classifica4on
–
assigning
topic
categories
to
a
document
(e.g.
DMoz)
• Summariza4on
–
assigning
importance
to
parts
of
a
document
• Fact
extrac4on
–
extrac4ng
relevant
facts
from
a
document
16. Enrycher (h0p://enrycher.ijs.si/)
Plain
text
Extracted
graph
of
triples
from
text
Text
Enrichment
Diego
Maradona
Seman4cs:
owl:sameAs:
hKp://dbpedia.org/resource/Diego_Maradona
owl:sameAs:
hKp://sw.opencyc.org/concept/Mx4rvofERZwpEbGdrcN5Y29ycA
rdf:type:
hWp://dbpedia.org/class/yago/Argen4naInterna4onalFootballers
rdf:type:
hWp://dbpedia.org/class/yago/Argen4neExpatriatesInItaly
rdf:type:
hWp://dbpedia.org/class/yago/Argen4neFootballManagers
rdf:type:
hWp://dbpedia.org/class/yago/Argen4neFootballers
Robbie
Keane
Seman4cs:
owl:sameAs:
hKp://dbpedia.org/resource/Robbie_Keane
rdf:type:
hWp://dbpedia.org/class/yago/CoventryCityF.C.Players
rdf:type:
hWp://dbpedia.org/class/yago/ExpatriateFootballPlayersInItaly
rdf:type:
hWp://dbpedia.org/class/yago/F.C.InternazionaleMilanoPlayers
“Enrycher”
is
available
as
as
a
web-‐service
genera4ng
Seman4c
Graph,
LOD
links,
En44es,
Keywords,
Categories,
Text
Summariza4on,
Sen4ment
17. Enrycher
Architecture
• Enrycher
Plain
text
is
a
web
service
consis4ng
of
a
set
of
interlinked
modules…
• …covering
lexical,
linguis4c
and
seman4c
annota4ons
• …expor4ng
data
in
XML
or
RDF
• To
execute
the
service,
one
should
send
an
HTTP
POST
request,
with
the
raw
text
in
the
body:
• curl -d “Enrycher was
developed at JSI, a
research institute in
Ljubljana. Ljubljana is
the capital of Slovenia.”
http://enrycher.ijs.si/run!
Annotated
document
19. Cross-‐linguality
How to operate in many languages?
• Cross-‐linguality
is
a
set
of
func4ons
on
how
to
transfer
informa4on
across
the
languages
• …having
this,
we
can
track
informa4on
independent
of
the
language
borders
• Machine
Transla4on
is
expensive
and
slow,
so
the
goal
is
to
avoid
machine
transla4on
to
gain
speed
and
scale
• The
key
building
block
is
the
func4on
for
comparing
and
categoriza4on
of
documents
in
different
languages
• XLing.ijs.si
is
an
open
web
service
to
bridge
informa4on
across
100
languages
21. XLing (XLing.ijs.si)
service for comparing and categoriza=on of documents across 100 languages
Chinese
Text
English
Text
Automa4cally
Extracted
Keywords
Automa4cally
Extracted
Keywords
Similarity
Between
Two
Documents
Selec4on
Of
100
Languages
24. Detec=ng News Repor=ng Bias
• The
task:
• Given
a
news
story,
are
we
able
to
say
from
which
news
source
it
came?
• We
compared
CNN
and
Aljazeera
reports
about
the
same
events
from
the
war
in
Iraq
• …300
aligned
ar4cles
describing
the
same
story
from
both
sources
• The
same
topics
are
expressed
in
both
sources
with
the
following
keywords:
• CNN
with:
• Insurgents,
Troops,
Baghdad,
Iran,
Militant,
Police,
Suicide,
Terrorist,
United,
Na4onal,
Hussein,
Alleged,
Israeli,
Syria,
Terrorism…
• Aljazeera
with:
• AWacks,
Claims,
Rebels,
Withdrawing,
Report,
Fighters,
President,
Resistance,
Occupa4on,
Injured,
Army,
Demanded,
Hit,
Muslim,
…
25. DiversiNews iPad App (1/2)
• DiversiNews
iPad
App
is
using
newsfeed.ijs.si
and
enrycher.ijs.si
services
• …in
its
ini4al
screen
is
shows
list
of
current
hot
topics
and
current
trending
events
Hot
Topics
Trending
Events
26. DiversiNews iPad App (2/2)
• DiversiNews
“diversity
search”
screen
allows
dynamic
reranking
of
ar4cles
describing
an
event
along
three
dimensions:
• Geography
–
where
is
a
content
being
published
from
• Subtopics
–
what
are
subtopics
of
an
event
• Sen4ment
–
what
are
good
and
what
are
bad
news
• For
each
query
it
provides
• Automa4cally
generated
summary
• List
of
corresponding
ar4cles
Geography
Subtopics
Sen4ment
Summary
Ar4cles
28. What is an event?
(abstract descrip=on)
• …more
prac4cal
ques4on:
what
defini4on
of
is
computa4onally
feasible?
• In
general,
an
event
is
something
which
“s4cks
out”
of
the
average
in
some
kind
of
(high
dimensional)
data
space
• …could
be
interpreted
as
an
“anomaly”
• …densifica4on
of
data
points
(e.g.
many
similar
documents)
• …significant
change
of
distribu4on
(e.g.
a
trend
on
TwiWer)
• In
prac4ce,
the
event
could
be:
• A
cluster
od
documents
/
change
of
a
distribu4on
in
data
• Detected
in
an
unsupervised
way
• A
fit
to
a
pre-‐built
model
• Detected
in
a
supervised
way
29. How to represent an event?
• Baseline
data
for
a
news
event
is
usually
a
cluster
of
documents
• …with
some
preprocessing
we
extract
linguis4c
and
seman4c
annota4ons
• …seman4c
annota4ons
are
linked
to
ontologies
providing
possibility
for
mul4resolu4on
annota4ons
• Three
levels
of
event
representa4on:
• Feature
vector
event
representa4on:
• …light
weight
representa4on
that
can
be
easily
represented
as
a
set
of
feature
vectors
augmented
with
external
ontologies
–
suitable
for
scalable
ML
analysis
• Structured
event
representa4on:
• Infobox
representa4on
(slots
filling)
using
open
schema
or
event
taxonomy
• Deep
event
representa4on
• Seman4c
representa4on
linked
to
a
world-‐model
(e.g.
CycKB
common
sense
knowledge)
–
suitable
for
reasoning
and
diagnos4cs
30. Feature vector event representa=on
• Feature
vectors
easily
extractable
from
news
documents:
• Topical
dimension
–
what
is
being
talked
about?
(keywords)
• Social
dimension
–
which
en44es
are
men4oned?
(named
en44es)
• Temporal
aspect
–
what
is
the
4me
of
an
event?
(temporal
distribu4on)
• Geographical
aspect
–
where
an
event
is
taking
place?
(loca4on)
• Publisher
aspect
–
who
is
repor4ng?
(publisher
iden4fiers)
• Sen4ment/bias
aspect
–
emo4onal
signals
(numeric
es4mates)
• Scalable
Machine
Learning
techniques
can
easily
deal
with
such
representa4on
• …in
“Event
Registry”
system
we
use
this
representa4on
to
describe
events
31. Example of “feature vector” event representa=on: Event Registry “Chicago” related events
Where?
(geography)
When?
(temporal
distribu4on)
Who?
(named
en44es)
What?
(keyword/
topics)
Query:
“Chicago”
32. Structured event representa=on
• Structured
event
representa4on
describes
an
event
by
its
“Event
Type”
and
corresponding
informa4on
slots
to
be
filled
• Event
Types
should
be
taken
from
“Event
Taxonomy”
• …at
this
stage
of
development
this
level
of
representa4on
s4ll
requires
human
interven4on
to
achieve
high
accuracy
(Precision/Recall)
extrac4on
• Example
on
the
right
–
Wikipedia
event
infobox:
• 2011
Tōhoku
earthquake
and
tsunami
34. Prototype for event Infobox extrac=on:
XLike annota=on service
• The
goal
is
to
build
a
system
for
economically
viable
extrac4on
of
event
infoboxes
• …using
crowd-‐sourcing
• …aiming
at
high
Precision
&
Recall
for
a
small
cost
35. Event sequences & Hierarchical events
• Once
having
events
iden4fies
and
represented
we
can
connect
events
into
“event
sequences”
(also
called
story-‐lines)
• “Event
sequences”
include
events
which
are
supposedly
related
and
cons4tute
larger
story
• Collec4on
of
interrelated
events
can
be
also
organized
in
hierarchies
(e.g.
World
Cup
event
consists
from
a
series
of
smaller
events)
51. Event Registry exports event data through API
and RDF/Storyline ontology
• API
to
search
and
export
event
informa4on
• Export
of
all
the
system
data
in
JSON
• Event
data
is
exported
in
a
structured
form
• BBC
Storyline
ontology
• hWp://www.bbc.co.uk/ontologies/storyline/2013-‐05-‐01.html
• SPARQL
endpoint:
• hWp://eventregistry.org/rdf/search
• hWp://eventregistry.org/rdf/event/{eventID}
• hWp://eventregistry.org/rdf/ar4cle/{ar4cleD}
• hWp://eventregistry.org/rdf/storyline/{storylineID}
• Example:
hWp://eventregistry.org/rdf/event/1234
53. Some of the follow-‐up projects
• Understanding
global
social
dynamics
• How
global
society
func4ons?
• Integra4ng
text-‐based
media
with
TV
channels
• …requires
speech
recogni4on,
video
processing,
visual
object
recogni4on,
face
recogni4on,
…
• Event
predic4on
/
Event-‐Consequence
predic4on
• …requires
understanding
of
causality
in
the
social
dynamics
and
much
more
• Micro-‐reading
/
Machine-‐reading
• …full
understanding
of
individual
documents
–
the
goal
for
10+
years