ICWE2013 - Discovering links between political debates and media

DISCOVERING LINKS
BETWEEN POLITICAL
DEBATES AND MEDIA
Damir Juric, Delft University of Technology
Laura Hollink, VU University Amsterdam
Geert-Jan Houben, Delft University of Technology
ICWE2013

The PoliMedia project: linking politics
to media

PoliMedia research questions
• How is a person, subject or process covered & visualised by the
media?
• How do debates and arguments develop over a longer period of time?
• Analysing the changing ideas, arguments and presentation in different
media

Issues with current approach
• Go to different archives, look up original data!

Goal: explicit links to different media
types in one system

Data Sets – Debates
Handelingen der Staten-General or Dutch Hansard
from 1945-1995
Some provenance:
1. Transcripts are made of the complete debates of the
Dutch parliament.
2. Published online by the government on
http://www.statengeneraaldigitaal.nl/ (1818 - 1995) and
http://officielebekendmakingen.nl/ (from 1995)
3. PoliticalMashup project has translated government pdf
and txt files into XML, incl URI’s as identifiers, see
http://politicalmashup.nl/
4. We build on that.

Structure of the debate data
Debate
Metadata
Topic 1
Topic 2
Speaker 1 / Content
Speaker 2 / Content
Speaker 3 / Content
Speaker 1 / Content
Aan de orde is de behandeling van: - de brief van de minister van Economische
Zaken inzake Borssele (16226, nr. 26).
De beraadslaging wordt geopend.
NEs={Economische Zaken, Borssele}
NEs={Borssele, Partij van de Arbeid, D66}
Metadata
Speaker 1
Speaker 2
Speaker 3
Mijnheer de Voorzitter! Met de verdragen tot uitbreiding van de EEG met
Denemarken, Engeland, Ierland en Noorwegen wordt een van de doelstellingen
van ons buitenlands beleid verwezenlijkt.
• who, when, what
• identifiers for subparts of the debate
• chronological order of speakers

Data sets – Media
• Newspaper articles
• at the National Library of the Netherlands
• Many newspapers 1950- 1995
• Text + images of newspaper layout

All data and links expressed as RDF
• We have created a semantic model to capture the
datasets and link between them.
• Reusing other vocabularies
• Simple Event Model (SEM)
• Dublic Core
• FOAF
• ISOCAT

All data and links expressed as RDF
nl.proc.sgd.d.
194519460000002
nl.proc.sgd.d.
194519460000002.1
PartOfDebateDebate
http://resolver.politicalmashup.nl/nl.proc.sgd.d.194519460000002
http://statengeneraaldigitaal.nl/
http://resolver.kb.nl/resolve?urn=sgd:mpeg21:19451946:0000002:pdf
nl.proc.sgd.d.19720000002
Handelingen Verenigde
Vergadering...
Dutch
1945-11-20
rdf:type
dc:id
dc:source
dc:source
dc:publisher
dc:language
dc:date
hasPart
rdf:type
nl.proc.sgd.d.
194519460000002.1.1
hasPart
DebateContext
rdf:type
nl.proc.sgd.d.
194519460000002.1.2
Speech
rdf:type
hasPart
nl.proc.sgd.d.
194519460000002.1.3
hasSubsequentSpeech
"Mijnheer de
Voorzitter, de
Commissie
van …"
hasSpokenText
sem:hasActor
"De voorzitter
opent de
vergadering…"
hasText
http://resolver.kb.nl/resolve?urn=ddd:011198136:mpeg21:a0525:ocr
coveredIn
nl.proc.sgd.d.
194519460000002.2
hasSubsequentPartOfDebate

PoliMedia linking method
• Debate speeches and newspaper articles are different
types of documents, so default document similarity
metrics are insufficient
• Speeches contain many named entities, digressions.
• Newspapers are formal and concise, words are used sparingly.
• The challenge: how to create a representation of the
speeches that contains enough information to be used as
a query to retrieve the right media articles from the
archive?

PoliMedia linking method
• Our PoliMedia linking method consists of four steps:
1. topics: enriching the existing debate metadata with topics
2. preselection of articles: when the candidate articles were published and
who spoke in the debate (timeframe and speakers)?
3. automatic query creation: candidate articles are ranked based on
similarity to the query (automatically created from speech text) by
comparing vectors of topics and named entities
4. link creation: links are created between a speech and an article if the
similarity score is above a threshold t

Topics
The MALLET topic model package
• Unsupervised analysis of text
• “a Topic consists of a cluster of words that frequently occur together”
• [see http://mallet.cs.umass.edu/topics.php]
• Input: Text, Number of iterations, Number of topics/clusters
• Output: Words that cluster around one topic.
• Example:
• Text: a speech in a debate from 1975
• number of iterations: 2000
• number of topics: 1

Kombrink
rente
inkomstenbelasting
bronheffing
vereenvoudiging
tarief
contourennota
Nederland
word
tussen
wetgeving
sociale
moeten
fraude
fraudebestrijding
vraag
misbruik
ten
gebruik
kamer
misbruik
fraudebestrijding
ismo-rapport
Contourennota
Kombrink
EEG
Netherlandse
OESO-verband
Nederland
Contou
Engwirda
Couprie
Midden-Oosten
Euro-kapitaalmarkt
Tariefnota
Staatssecretaris
Regering
Financiën
Zwitserland
Brussel
Grave
TopicSet Speech
NE Speech
TopicSet Topic
NE Topic
Automatic query creation
Debate
Metadata
Topic 1
Topic 2
Speaker 1 / Content
Speaker 2 / Content
Speaker 3 / Content
Speaker 1 / Content
Actor
Query Debate
came from
came from

Polimedia pipeline
RDF
semantic model
RDF files
NERs Speech
TopicSet Speech
NERs Topic
TopicSet Topic
contextual vectors
PoliticalMas
hup
(xml)
Query NE
Stopword removal
Topic modeling
Query content
Expanded query
creation
SRU Query (actor,
date range)
automatic query creation
KB
(preselect
data)
similarity
calculation
ranking
filtering
article
metadat
a

Evaluation
• We tried three different approaches:
• Experiment 1: NEs in speech
• Experiment 2: NEs + topics in speech
• Experiment 3: NEs + topics in speech and debate
• Two independent evaluators: reading the speeches and articles linked to
them and manually assessing their relatedness
• Randomly selected 20 debates from our dataset of 10,924 debates
(different subjects: from fraud in the social system to the European elections)
• Each experiment: random 50 speeches
• In total: 150 speech-article pairs, namely 3 sets of 50 each

Evaluation
Results:
• best approach: named entities (speech + debate descriptions) and topics
(speech + debate)
(2: relevant, 1: partially relevant, 0: unrelated)

Evaluation
• Relative recall:
• different evaluation: annotator reads a speech, manually creates a
suitable query for it, and assess the relevance of the articles returned for
that query
Precision: 75%, recall 62%
experiment 3 on 5 speeches/115 articles gave a recall of 3804 links

Conclusion
• Creation of links between two very different datasets: a dataset of political
debates and a media archive
• Linking method takes advantage of:
• Debate content and metadata
• Named Entities and Topics from the debates
• semantic partOf structure of the debates
• In experiments we have shown the added value of topics and debate
structure
• Produced links
• different in nature than those produced by e.g. ontology alignment tools
• Now: coarsely typed links
• Future: nature and strength of the link

ICWE2013 - Discovering links between political debates and media

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (7)

Andere mochten auch

Andere mochten auch (9)

Ähnlich wie ICWE2013 - Discovering links between political debates and media

Ähnlich wie ICWE2013 - Discovering links between political debates and media (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

ICWE2013 - Discovering links between political debates and media

Hinweis der Redaktion