Discovering links between political debates and media
by Damir Juric, Laura Hollink, Geert-Jan Houben
TU Delft - WIS
at ICWE 2013, Aalborg, Denmark, July 2013
ICWE2013 - Discovering links between political debates and media
1. DISCOVERING LINKS
BETWEEN POLITICAL
DEBATES AND MEDIA
Damir Juric, Delft University of Technology
Laura Hollink, VU University Amsterdam
Geert-Jan Houben, Delft University of Technology
ICWE2013
3. PoliMedia research questions
• How is a person, subject or process covered & visualised by the
media?
• How do debates and arguments develop over a longer period of time?
• Analysing the changing ideas, arguments and presentation in different
media
4. Issues with current approach
• Go to different archives, look up original data!
6. Data Sets – Debates
Handelingen der Staten-General or Dutch Hansard
from 1945-1995
Some provenance:
1. Transcripts are made of the complete debates of the
Dutch parliament.
2. Published online by the government on
http://www.statengeneraaldigitaal.nl/ (1818 - 1995) and
http://officielebekendmakingen.nl/ (from 1995)
3. PoliticalMashup project has translated government pdf
and txt files into XML, incl URI’s as identifiers, see
http://politicalmashup.nl/
4. We build on that.
7. Structure of the debate data
Debate
Metadata
Topic 1
Topic 2
Speaker 1 / Content
Speaker 2 / Content
Speaker 3 / Content
Speaker 1 / Content
Aan de orde is de behandeling van: - de brief van de minister van Economische
Zaken inzake Borssele (16226, nr. 26).
De beraadslaging wordt geopend.
NEs={Economische Zaken, Borssele}
NEs={Borssele, Partij van de Arbeid, D66}
Metadata
Speaker 1
Speaker 2
Speaker 3
Mijnheer de Voorzitter! Met de verdragen tot uitbreiding van de EEG met
Denemarken, Engeland, Ierland en Noorwegen wordt een van de doelstellingen
van ons buitenlands beleid verwezenlijkt.
• who, when, what
• identifiers for subparts of the debate
• chronological order of speakers
8. Data sets – Media
• Newspaper articles
• at the National Library of the Netherlands
• Many newspapers 1950- 1995
• Text + images of newspaper layout
9. All data and links expressed as RDF
• We have created a semantic model to capture the
datasets and link between them.
• Reusing other vocabularies
• Simple Event Model (SEM)
• Dublic Core
• FOAF
• ISOCAT
10. All data and links expressed as RDF
nl.proc.sgd.d.
194519460000002
nl.proc.sgd.d.
194519460000002.1
PartOfDebateDebate
http://resolver.politicalmashup.nl/nl.proc.sgd.d.194519460000002
http://statengeneraaldigitaal.nl/
http://resolver.kb.nl/resolve?urn=sgd:mpeg21:19451946:0000002:pdf
nl.proc.sgd.d.19720000002
Handelingen Verenigde
Vergadering...
Dutch
1945-11-20
rdf:type
dc:id
dc:source
dc:source
dc:publisher
dc:language
dc:date
hasPart
rdf:type
nl.proc.sgd.d.
194519460000002.1.1
hasPart
DebateContext
rdf:type
nl.proc.sgd.d.
194519460000002.1.2
Speech
rdf:type
hasPart
nl.proc.sgd.d.
194519460000002.1.3
hasSubsequentSpeech
"Mijnheer de
Voorzitter, de
Commissie
van …"
hasSpokenText
sem:hasActor
"De voorzitter
opent de
vergadering…"
hasText
http://resolver.kb.nl/resolve?urn=ddd:011198136:mpeg21:a0525:ocr
coveredIn
nl.proc.sgd.d.
194519460000002.2
hasSubsequentPartOfDebate
11. PoliMedia linking method
• Debate speeches and newspaper articles are different
types of documents, so default document similarity
metrics are insufficient
• Speeches contain many named entities, digressions.
• Newspapers are formal and concise, words are used sparingly.
• The challenge: how to create a representation of the
speeches that contains enough information to be used as
a query to retrieve the right media articles from the
archive?
12. PoliMedia linking method
• Our PoliMedia linking method consists of four steps:
1. topics: enriching the existing debate metadata with topics
2. preselection of articles: when the candidate articles were published and
who spoke in the debate (timeframe and speakers)?
3. automatic query creation: candidate articles are ranked based on
similarity to the query (automatically created from speech text) by
comparing vectors of topics and named entities
4. link creation: links are created between a speech and an article if the
similarity score is above a threshold t
13. Topics
The MALLET topic model package
• Unsupervised analysis of text
• “a Topic consists of a cluster of words that frequently occur together”
• [see http://mallet.cs.umass.edu/topics.php]
• Input: Text, Number of iterations, Number of topics/clusters
• Output: Words that cluster around one topic.
• Example:
• Text: a speech in a debate from 1975
• number of iterations: 2000
• number of topics: 1
16. Evaluation
• We tried three different approaches:
• Experiment 1: NEs in speech
• Experiment 2: NEs + topics in speech
• Experiment 3: NEs + topics in speech and debate
• Two independent evaluators: reading the speeches and articles linked to
them and manually assessing their relatedness
• Randomly selected 20 debates from our dataset of 10,924 debates
(different subjects: from fraud in the social system to the European elections)
• Each experiment: random 50 speeches
• In total: 150 speech-article pairs, namely 3 sets of 50 each
17. Evaluation
Results:
• best approach: named entities (speech + debate descriptions) and topics
(speech + debate)
(2: relevant, 1: partially relevant, 0: unrelated)
18. Evaluation
• Relative recall:
• different evaluation: annotator reads a speech, manually creates a
suitable query for it, and assess the relevance of the articles returned for
that query
Precision: 75%, recall 62%
experiment 3 on 5 speeches/115 articles gave a recall of 3804 links
19. Conclusion
• Creation of links between two very different datasets: a dataset of political
debates and a media archive
• Linking method takes advantage of:
• Debate content and metadata
• Named Entities and Topics from the debates
• semantic partOf structure of the debates
• In experiments we have shown the added value of topics and debate
structure
• Produced links
• different in nature than those produced by e.g. ontology alignment tools
• Now: coarsely typed links
• Future: nature and strength of the link
Hinweis der Redaktion
Politics and media are heavily intertwined and both play a role in thediscussion on policy proposals and current affairs. However, a dataset thatallows a joint analysis of the two does not yet exist.
The PoliMedia project is driven by research questions from historians with respectto media coverage of politics across several types of media outlets. Cross-mediacomparisons will be conducted over a longer period of time, on different topics. Theproject concentrates on media coverage of the debates in the Dutch parliament andgives insight into the different choices that different media make while reporting onthose debates and how this changes over time.
Go to archives, look up original data, decide whether there is a link to a debate.Cumbersome
Final goal of our project was to connect all the media sources we can find with one particular speech from the parliament we are interested in. In this paper we present our method of linking between debates and newsapaper dataset.
The Dutch government publishes the proceedings of its parliamentary debates ontwo websites. Debates from 1995 until now can be found on the OfficiëleBekendmakingen portal3 and can be downloaded as PDF or in an XML format, usingXML schemaandpermanentidentifiers. TheStaten-GeneraalDigitaal portal4contains the debates from before the year 1995, which can be accessed using theSearchandRetrievalvia URL (SRU) protocol5.A third source for Dutch parliamentary debate data is the Political Mashup portal6,created in the ongoing project War in Parliament (WIP).
All debates conform to the same structure where speakers give speeches in somechronological order. The debates are split up into segments according to the differentthemes or agenda points of the meeting. The first speaker of each segment is alwaysthe president of the House of Representatives (‘voorzitter’ or chairperson in Dutch).She gives usually an introduction to the subject and after her speech she gives thefloor to a member of the parliament. Every word by every speaker is transcribedincluding the names of the speakers and their party affiliation. The transcripts alsocontain metadata such as the date and title of the debate.
For newspaper data we use the historic newspaper archive of the National Libraryof the Netherlands, which contains the text as well as images of newspaper articlesfrom 1618 to 1995. Metadata of the articles is available as DIDL7 or ‘Digital ItemDeclarationLanguage’ – an XML dialect.
The semantic model for the PoliMedia project is built to satisfy the requirements ofthe project, i.e. the research questions from the historians.To represent parliamentary debates as events,we have created a domain specific semantic modelthatenables us to express information associated with the debates such as topics, actors, debate structure, and links to media. We created this model according to the rules ofthe Dutch parliament, although the model can be easily adapted to include parliamentsin other countries, because core elements like speakers, speeches and topics arepresent in all parliamentary debates
Topviewofoursemantic model in RDF. Debatesanditsstructure is broken on entitesandrelationshipsbetweenthem. Entitesinblue are importantstructuralpartsofthe debate (like topicdescriptionsandspeeches) andthey all havetheiruniqueidentifiers.
The biggest challenge in ourmethod was the task of creating the query representation of the speech that willcontain enough amounts of meaning and context, so it can be used to retrieve anddistinguish between large number of media articles that covers topics from theparliament and politics. We should stress that debate speeches and newspaper articlesare generally completely different types of documents (so computing documentsimilarity doesn’t work) in the style and scope. While speeches can contain largenumber of NEs and digressions, which makes it hard to distinguish the right contextfor each speech, newspaper articles (especially the ones that report on topics from theparliament) are very strict and concise (words are used sparingly)
For each speech inside a debate segment (called PartOfDebate in our method) weextract ten words that represent one topic discussed inside the speech. Also allspeeches contained inside one debate segment are concatenated into one text and theset of ten words that represent one topic of the debate segment as a whole is thenextractedfromthattext.
Query ismadeautomaticalybyanalyzing debate document. We are creatingfourdifferent groups (vectors) thatwillbejoinedinto a singlequery.Data for thequery is comingfromdifferentstructuralelementsofthe debate as canbeseen on picture.
Transformingdebates to rdf (conformingtosemantic model wemadejust for thiscase)topics: enriching the existing debate metadata with topicspreselection of articles: when the candidate articles were published and who spoke in the debate (timeframe and speakers)?automatic query creation: candidate articles are ranked based on similarity to the query (automatically created from speech text) by comparing vectors of topics and named entities link creation: links are created between a speech and an article if the similarity score is above a threshold t (similaritymeasureused: cosinesimilarityandoverlap)Automatically made links are written back into the rdf files representing debatesAddthateverythingg is publishedinan RDF storenow
To gain insight into the quality and added value of the varioussteps of the linking method described in the previous section, we have performedexperiments with three versions of the method. Specifically, we have varied whichinformation is used to rank the candidate articles (named entities (NEs), topics) andwhether the partOf relations between speeches and larger parts of debates are used toalso include information associated to these larger parts (debate segments). Experiment 1: NEs in speech - In the most simple form of our method, werank articles only based on the NEs found in the speech. Experiment 2: NEs + topics in speech - Here, we include not only NEs butalso topics detected for the speech. Experiment 3: NEs + topics in speech and debate - WeincludenotjustNEs and topics extracted from the speech itself but also NEs extracted fromthe debate context and topics extracted from all speeches in this context.
Table 1 shows the average number of relevant, partially relevant andunrelated links found in the experiments. Using just NEs from the speech (experiment1) gives a lot of unrelated links, and thus a low precision score of 48%. In [17], theauthors stated that NEs play an important role in news documents. They wanted toexploit that characteristic by considering them as the only distinguishing features ofthe documents. In our experiments we found out that using just NEs is not enough todistinguish between newspaper articles. When we include topics extracted from thespeech (experiment 2), precision increases to 62%. Finally, in experiment 3, weleverage the debate structure. We used NEs and topics from debate descriptions tocreate a query that is more specific than both previous queries. We can see that theresulting precision is highest with values around 80%.
To calculate recall we had to conduct a different kind of evaluation.It is infeasible to manually assess the relevance of all the close to 1 million articles inthe archive. Therefore, we chose an approach where an annotator reads a speech,manually creates a suitable query for it, and assess the relevance of the articlesreturned for that query. As with our automatic approach, we limit ourselves to articlespublished within 7 days of the debate day. For this experiment, we arbitrarily chosefive speeches for which we retrieved a total of 115 newspaper articles. We repeatedexperiment 3 on this smaller set of 5 speeches/115 articles. Precision on this set was75%, which is in line with the results of experiment 3. Recall was 62%. Another goodindication of the recall of our approach is the number of links returned. Our approachresulted in 5887 links to articles when using the settings of experiment 1, 4449 whenusing the settings of experiment 2 and 3804 for experiment 3.