SlideShare ist ein Scribd-Unternehmen logo
1 von 19
DISCOVERING LINKS
BETWEEN POLITICAL
DEBATES AND MEDIA
Damir Juric, Delft University of Technology
Laura Hollink, VU University Amsterdam
Geert-Jan Houben, Delft University of Technology
ICWE2013
The PoliMedia project: linking politics
to media
PoliMedia research questions
• How is a person, subject or process covered & visualised by the
media?
• How do debates and arguments develop over a longer period of time?
• Analysing the changing ideas, arguments and presentation in different
media
Issues with current approach
• Go to different archives, look up original data!
Goal: explicit links to different media
types in one system
Data Sets – Debates
Handelingen der Staten-General or Dutch Hansard
from 1945-1995
Some provenance:
1. Transcripts are made of the complete debates of the
Dutch parliament.
2. Published online by the government on
http://www.statengeneraaldigitaal.nl/ (1818 - 1995) and
http://officielebekendmakingen.nl/ (from 1995)
3. PoliticalMashup project has translated government pdf
and txt files into XML, incl URI’s as identifiers, see
http://politicalmashup.nl/
4. We build on that.
Structure of the debate data
Debate
Metadata
Topic 1
Topic 2
Speaker 1 / Content
Speaker 2 / Content
Speaker 3 / Content
Speaker 1 / Content
Aan de orde is de behandeling van: - de brief van de minister van Economische
Zaken inzake Borssele (16226, nr. 26).
De beraadslaging wordt geopend.
NEs={Economische Zaken, Borssele}
NEs={Borssele, Partij van de Arbeid, D66}
Metadata
Speaker 1
Speaker 2
Speaker 3
Mijnheer de Voorzitter! Met de verdragen tot uitbreiding van de EEG met
Denemarken, Engeland, Ierland en Noorwegen wordt een van de doelstellingen
van ons buitenlands beleid verwezenlijkt.
• who, when, what
• identifiers for subparts of the debate
• chronological order of speakers
Data sets – Media
• Newspaper articles
• at the National Library of the Netherlands
• Many newspapers 1950- 1995
• Text + images of newspaper layout
All data and links expressed as RDF
• We have created a semantic model to capture the
datasets and link between them.
• Reusing other vocabularies
• Simple Event Model (SEM)
• Dublic Core
• FOAF
• ISOCAT
All data and links expressed as RDF
nl.proc.sgd.d.
194519460000002
nl.proc.sgd.d.
194519460000002.1
PartOfDebateDebate
http://resolver.politicalmashup.nl/nl.proc.sgd.d.194519460000002
http://statengeneraaldigitaal.nl/
http://resolver.kb.nl/resolve?urn=sgd:mpeg21:19451946:0000002:pdf
nl.proc.sgd.d.19720000002
Handelingen Verenigde
Vergadering...
Dutch
1945-11-20
rdf:type
dc:id
dc:source
dc:source
dc:publisher
dc:language
dc:date
hasPart
rdf:type
nl.proc.sgd.d.
194519460000002.1.1
hasPart
DebateContext
rdf:type
nl.proc.sgd.d.
194519460000002.1.2
Speech
rdf:type
hasPart
nl.proc.sgd.d.
194519460000002.1.3
hasSubsequentSpeech
"Mijnheer de
Voorzitter, de
Commissie
van …"
hasSpokenText
sem:hasActor
"De voorzitter
opent de
vergadering…"
hasText
http://resolver.kb.nl/resolve?urn=ddd:011198136:mpeg21:a0525:ocr
coveredIn
nl.proc.sgd.d.
194519460000002.2
hasSubsequentPartOfDebate
PoliMedia linking method
• Debate speeches and newspaper articles are different
types of documents, so default document similarity
metrics are insufficient
• Speeches contain many named entities, digressions.
• Newspapers are formal and concise, words are used sparingly.
• The challenge: how to create a representation of the
speeches that contains enough information to be used as
a query to retrieve the right media articles from the
archive?
PoliMedia linking method
• Our PoliMedia linking method consists of four steps:
1. topics: enriching the existing debate metadata with topics
2. preselection of articles: when the candidate articles were published and
who spoke in the debate (timeframe and speakers)?
3. automatic query creation: candidate articles are ranked based on
similarity to the query (automatically created from speech text) by
comparing vectors of topics and named entities
4. link creation: links are created between a speech and an article if the
similarity score is above a threshold t
Topics
The MALLET topic model package
• Unsupervised analysis of text
• “a Topic consists of a cluster of words that frequently occur together”
• [see http://mallet.cs.umass.edu/topics.php]
• Input: Text, Number of iterations, Number of topics/clusters
• Output: Words that cluster around one topic.
• Example:
• Text: a speech in a debate from 1975
• number of iterations: 2000
• number of topics: 1
Kombrink
rente
inkomstenbelasting
bronheffing
vereenvoudiging
tarief
contourennota
Nederland
word
tussen
wetgeving
sociale
moeten
fraude
fraudebestrijding
vraag
misbruik
ten
gebruik
kamer
misbruik
fraudebestrijding
ismo-rapport
Contourennota
Kombrink
EEG
Netherlandse
OESO-verband
Nederland
Contou
Engwirda
Couprie
Midden-Oosten
Euro-kapitaalmarkt
Tariefnota
Staatssecretaris
Regering
Financiën
Zwitserland
Brussel
Grave
TopicSet Speech
NE Speech
TopicSet Topic
NE Topic
Automatic query creation
Debate
Metadata
Topic 1
Topic 2
Speaker 1 / Content
Speaker 2 / Content
Speaker 3 / Content
Speaker 1 / Content
Actor
Query Debate
came from
came from
Polimedia pipeline
RDF
semantic model
RDF files
NERs Speech
TopicSet Speech
NERs Topic
TopicSet Topic
contextual vectors
PoliticalMas
hup
(xml)
Query NE
Stopword removal
Topic modeling
Query content
Expanded query
creation
SRU Query (actor,
date range)
automatic query creation
KB
(preselect
data)
similarity
calculation
ranking
filtering
article
metadat
a
Evaluation
• We tried three different approaches:
• Experiment 1: NEs in speech
• Experiment 2: NEs + topics in speech
• Experiment 3: NEs + topics in speech and debate
• Two independent evaluators: reading the speeches and articles linked to
them and manually assessing their relatedness
• Randomly selected 20 debates from our dataset of 10,924 debates
(different subjects: from fraud in the social system to the European elections)
• Each experiment: random 50 speeches
• In total: 150 speech-article pairs, namely 3 sets of 50 each
Evaluation
Results:
• best approach: named entities (speech + debate descriptions) and topics
(speech + debate)
(2: relevant, 1: partially relevant, 0: unrelated)
Evaluation
• Relative recall:
• different evaluation: annotator reads a speech, manually creates a
suitable query for it, and assess the relevance of the articles returned for
that query
Precision: 75%, recall 62%
experiment 3 on 5 speeches/115 articles gave a recall of 3804 links
Conclusion
• Creation of links between two very different datasets: a dataset of political
debates and a media archive
• Linking method takes advantage of:
• Debate content and metadata
• Named Entities and Topics from the debates
• semantic partOf structure of the debates
• In experiments we have shown the added value of topics and debate
structure
• Produced links
• different in nature than those produced by e.g. ontology alignment tools
• Now: coarsely typed links
• Future: nature and strength of the link

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (7)

EC-WEB: Validator and Preview for the JobPosting Data Model of Schema.org
EC-WEB: Validator and Preview for the JobPosting Data Model of Schema.orgEC-WEB: Validator and Preview for the JobPosting Data Model of Schema.org
EC-WEB: Validator and Preview for the JobPosting Data Model of Schema.org
 
Bytewise approximate matching, searching and clustering
Bytewise approximate matching, searching and clusteringBytewise approximate matching, searching and clustering
Bytewise approximate matching, searching and clustering
 
Linking library data
Linking library dataLinking library data
Linking library data
 
Keystone Summer School 2015: Mauro Dragoni, Ontologies For Information Retrieval
Keystone Summer School 2015: Mauro Dragoni, Ontologies For Information RetrievalKeystone Summer School 2015: Mauro Dragoni, Ontologies For Information Retrieval
Keystone Summer School 2015: Mauro Dragoni, Ontologies For Information Retrieval
 
Linked Open Graph: browsing multiple SPARQL entry points to build your own LO...
Linked Open Graph: browsing multiple SPARQL entry points to build your own LO...Linked Open Graph: browsing multiple SPARQL entry points to build your own LO...
Linked Open Graph: browsing multiple SPARQL entry points to build your own LO...
 
Semantic Interpretation of User Query for Question Answering on Interlinked Data
Semantic Interpretation of User Query for Question Answering on Interlinked DataSemantic Interpretation of User Query for Question Answering on Interlinked Data
Semantic Interpretation of User Query for Question Answering on Interlinked Data
 
2010 09 opm_tutorial_01-jun-usecase-datagovuk
2010 09 opm_tutorial_01-jun-usecase-datagovuk2010 09 opm_tutorial_01-jun-usecase-datagovuk
2010 09 opm_tutorial_01-jun-usecase-datagovuk
 

Andere mochten auch (9)

PARA MIS HIJOS!!
PARA MIS HIJOS!!PARA MIS HIJOS!!
PARA MIS HIJOS!!
 
三人同行一人Free
三人同行一人Free三人同行一人Free
三人同行一人Free
 
Moneyunlimtedgurpreetji
MoneyunlimtedgurpreetjiMoneyunlimtedgurpreetji
Moneyunlimtedgurpreetji
 
Cuenta Regresiva
Cuenta RegresivaCuenta Regresiva
Cuenta Regresiva
 
UMAP 2013 - Link, Like, Follow, Friend: The Social Element in User Modeling a...
UMAP 2013 - Link, Like, Follow, Friend: The Social Element in User Modeling a...UMAP 2013 - Link, Like, Follow, Friend: The Social Element in User Modeling a...
UMAP 2013 - Link, Like, Follow, Friend: The Social Element in User Modeling a...
 
Jdj Foss Java Tools
Jdj Foss Java ToolsJdj Foss Java Tools
Jdj Foss Java Tools
 
14 Jo P Feb 08
14 Jo P Feb 0814 Jo P Feb 08
14 Jo P Feb 08
 
Polityka Dynastyczna JagiellonóW
Polityka Dynastyczna JagiellonóWPolityka Dynastyczna JagiellonóW
Polityka Dynastyczna JagiellonóW
 
Java Generics - Quiz Questions
Java Generics - Quiz QuestionsJava Generics - Quiz Questions
Java Generics - Quiz Questions
 

Ähnlich wie ICWE2013 - Discovering links between political debates and media

Building the PoliMedia search system; data- and user-driven
Building the PoliMedia search system; data- and user-drivenBuilding the PoliMedia search system; data- and user-driven
Building the PoliMedia search system; data- and user-driven
MaxKemman
 
Keynote Exploring and Exploiting Official Publications
Keynote Exploring and Exploiting Official PublicationsKeynote Exploring and Exploiting Official Publications
Keynote Exploring and Exploiting Official Publications
maartenmarx
 
Creation of custom KOS-based recommendation systems
Creation of custom KOS-based recommendation systemsCreation of custom KOS-based recommendation systems
Creation of custom KOS-based recommendation systems
GESIS
 
networks inparliament-ccct
 networks inparliament-ccct networks inparliament-ccct
networks inparliament-ccct
maartenmarx
 
Identifying Topics in Social Media Posts using DBpedia
Identifying Topics in Social Media Posts using DBpediaIdentifying Topics in Social Media Posts using DBpedia
Identifying Topics in Social Media Posts using DBpedia
Óscar Muñoz García
 

Ähnlich wie ICWE2013 - Discovering links between political debates and media (20)

Bringing parliamentary debates to the Semantic Web
Bringing parliamentary debates to the Semantic WebBringing parliamentary debates to the Semantic Web
Bringing parliamentary debates to the Semantic Web
 
Guest Lecture: Linked Open Data for the Humanities and Social Sciences
Guest Lecture: Linked Open Data for the Humanities and Social SciencesGuest Lecture: Linked Open Data for the Humanities and Social Sciences
Guest Lecture: Linked Open Data for the Humanities and Social Sciences
 
Building the PoliMedia search system; data- and user-driven
Building the PoliMedia search system; data- and user-drivenBuilding the PoliMedia search system; data- and user-driven
Building the PoliMedia search system; data- and user-driven
 
Connecting political data to media data
Connecting political data to media dataConnecting political data to media data
Connecting political data to media data
 
Connecting political data to media data
Connecting political data to media dataConnecting political data to media data
Connecting political data to media data
 
Reusing historical newspapers of KB in e-humanities - Case studies and exampl...
Reusing historical newspapers of KB in e-humanities - Case studies and exampl...Reusing historical newspapers of KB in e-humanities - Case studies and exampl...
Reusing historical newspapers of KB in e-humanities - Case studies and exampl...
 
Keynote Exploring and Exploiting Official Publications
Keynote Exploring and Exploiting Official PublicationsKeynote Exploring and Exploiting Official Publications
Keynote Exploring and Exploiting Official Publications
 
Using Topic Modeling to Study Everyday "Civic Talk" and Proto-political Engag...
Using Topic Modeling to Study Everyday "Civic Talk" and Proto-political Engag...Using Topic Modeling to Study Everyday "Civic Talk" and Proto-political Engag...
Using Topic Modeling to Study Everyday "Civic Talk" and Proto-political Engag...
 
Creation of custom KOS-based recommendation systems
Creation of custom KOS-based recommendation systemsCreation of custom KOS-based recommendation systems
Creation of custom KOS-based recommendation systems
 
Dynamic Topic Modeling via Non-negative Matrix Factorization (Dr. Derek Greene)
Dynamic Topic Modeling via Non-negative Matrix Factorization (Dr. Derek Greene)Dynamic Topic Modeling via Non-negative Matrix Factorization (Dr. Derek Greene)
Dynamic Topic Modeling via Non-negative Matrix Factorization (Dr. Derek Greene)
 
Information Extraction from EuroParliament and UK Parliament data
Information Extraction from EuroParliament and UK Parliament dataInformation Extraction from EuroParliament and UK Parliament data
Information Extraction from EuroParliament and UK Parliament data
 
networks inparliament-ccct
 networks inparliament-ccct networks inparliament-ccct
networks inparliament-ccct
 
Identifying Topics in Social Media Posts using DBpedia
Identifying Topics in Social Media Posts using DBpediaIdentifying Topics in Social Media Posts using DBpedia
Identifying Topics in Social Media Posts using DBpedia
 
Talk of Europe: Linked data of the European Parliament
Talk of Europe:  Linked data of the European ParliamentTalk of Europe:  Linked data of the European Parliament
Talk of Europe: Linked data of the European Parliament
 
Linked Open Data and Applications
Linked Open Data and Applications Linked Open Data and Applications
Linked Open Data and Applications
 
Semantic web-and-public-data - en
Semantic web-and-public-data - enSemantic web-and-public-data - en
Semantic web-and-public-data - en
 
Cultural text mining workshop
Cultural text mining workshopCultural text mining workshop
Cultural text mining workshop
 
Using open datasets for research purposes
Using open datasets for research purposesUsing open datasets for research purposes
Using open datasets for research purposes
 
ESWC 2017 Tutorial Knowledge Graphs
ESWC 2017 Tutorial Knowledge GraphsESWC 2017 Tutorial Knowledge Graphs
ESWC 2017 Tutorial Knowledge Graphs
 
A new approach to aggregation
A new approach to aggregation A new approach to aggregation
A new approach to aggregation
 

Kürzlich hochgeladen

The basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptxThe basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptx
heathfieldcps1
 

Kürzlich hochgeladen (20)

Food safety_Challenges food safety laboratories_.pdf
Food safety_Challenges food safety laboratories_.pdfFood safety_Challenges food safety laboratories_.pdf
Food safety_Challenges food safety laboratories_.pdf
 
Holdier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfHoldier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdf
 
Sociology 101 Demonstration of Learning Exhibit
Sociology 101 Demonstration of Learning ExhibitSociology 101 Demonstration of Learning Exhibit
Sociology 101 Demonstration of Learning Exhibit
 
Application orientated numerical on hev.ppt
Application orientated numerical on hev.pptApplication orientated numerical on hev.ppt
Application orientated numerical on hev.ppt
 
Kodo Millet PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
Kodo Millet  PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...Kodo Millet  PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
Kodo Millet PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
 
80 ĐỀ THI THỬ TUYỂN SINH TIẾNG ANH VÀO 10 SỞ GD – ĐT THÀNH PHỐ HỒ CHÍ MINH NĂ...
80 ĐỀ THI THỬ TUYỂN SINH TIẾNG ANH VÀO 10 SỞ GD – ĐT THÀNH PHỐ HỒ CHÍ MINH NĂ...80 ĐỀ THI THỬ TUYỂN SINH TIẾNG ANH VÀO 10 SỞ GD – ĐT THÀNH PHỐ HỒ CHÍ MINH NĂ...
80 ĐỀ THI THỬ TUYỂN SINH TIẾNG ANH VÀO 10 SỞ GD – ĐT THÀNH PHỐ HỒ CHÍ MINH NĂ...
 
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
 
Python Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docxPython Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docx
 
Fostering Friendships - Enhancing Social Bonds in the Classroom
Fostering Friendships - Enhancing Social Bonds  in the ClassroomFostering Friendships - Enhancing Social Bonds  in the Classroom
Fostering Friendships - Enhancing Social Bonds in the Classroom
 
Key note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdfKey note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdf
 
HMCS Max Bernays Pre-Deployment Brief (May 2024).pptx
HMCS Max Bernays Pre-Deployment Brief (May 2024).pptxHMCS Max Bernays Pre-Deployment Brief (May 2024).pptx
HMCS Max Bernays Pre-Deployment Brief (May 2024).pptx
 
Micro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdfMicro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdf
 
REMIFENTANIL: An Ultra short acting opioid.pptx
REMIFENTANIL: An Ultra short acting opioid.pptxREMIFENTANIL: An Ultra short acting opioid.pptx
REMIFENTANIL: An Ultra short acting opioid.pptx
 
How to setup Pycharm environment for Odoo 17.pptx
How to setup Pycharm environment for Odoo 17.pptxHow to setup Pycharm environment for Odoo 17.pptx
How to setup Pycharm environment for Odoo 17.pptx
 
General Principles of Intellectual Property: Concepts of Intellectual Proper...
General Principles of Intellectual Property: Concepts of Intellectual  Proper...General Principles of Intellectual Property: Concepts of Intellectual  Proper...
General Principles of Intellectual Property: Concepts of Intellectual Proper...
 
ICT role in 21st century education and it's challenges.
ICT role in 21st century education and it's challenges.ICT role in 21st century education and it's challenges.
ICT role in 21st century education and it's challenges.
 
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdfUGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
 
Graduate Outcomes Presentation Slides - English
Graduate Outcomes Presentation Slides - EnglishGraduate Outcomes Presentation Slides - English
Graduate Outcomes Presentation Slides - English
 
The basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptxThe basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptx
 
SOC 101 Demonstration of Learning Presentation
SOC 101 Demonstration of Learning PresentationSOC 101 Demonstration of Learning Presentation
SOC 101 Demonstration of Learning Presentation
 

ICWE2013 - Discovering links between political debates and media

  • 1. DISCOVERING LINKS BETWEEN POLITICAL DEBATES AND MEDIA Damir Juric, Delft University of Technology Laura Hollink, VU University Amsterdam Geert-Jan Houben, Delft University of Technology ICWE2013
  • 2. The PoliMedia project: linking politics to media
  • 3. PoliMedia research questions • How is a person, subject or process covered & visualised by the media? • How do debates and arguments develop over a longer period of time? • Analysing the changing ideas, arguments and presentation in different media
  • 4. Issues with current approach • Go to different archives, look up original data!
  • 5. Goal: explicit links to different media types in one system
  • 6. Data Sets – Debates Handelingen der Staten-General or Dutch Hansard from 1945-1995 Some provenance: 1. Transcripts are made of the complete debates of the Dutch parliament. 2. Published online by the government on http://www.statengeneraaldigitaal.nl/ (1818 - 1995) and http://officielebekendmakingen.nl/ (from 1995) 3. PoliticalMashup project has translated government pdf and txt files into XML, incl URI’s as identifiers, see http://politicalmashup.nl/ 4. We build on that.
  • 7. Structure of the debate data Debate Metadata Topic 1 Topic 2 Speaker 1 / Content Speaker 2 / Content Speaker 3 / Content Speaker 1 / Content Aan de orde is de behandeling van: - de brief van de minister van Economische Zaken inzake Borssele (16226, nr. 26). De beraadslaging wordt geopend. NEs={Economische Zaken, Borssele} NEs={Borssele, Partij van de Arbeid, D66} Metadata Speaker 1 Speaker 2 Speaker 3 Mijnheer de Voorzitter! Met de verdragen tot uitbreiding van de EEG met Denemarken, Engeland, Ierland en Noorwegen wordt een van de doelstellingen van ons buitenlands beleid verwezenlijkt. • who, when, what • identifiers for subparts of the debate • chronological order of speakers
  • 8. Data sets – Media • Newspaper articles • at the National Library of the Netherlands • Many newspapers 1950- 1995 • Text + images of newspaper layout
  • 9. All data and links expressed as RDF • We have created a semantic model to capture the datasets and link between them. • Reusing other vocabularies • Simple Event Model (SEM) • Dublic Core • FOAF • ISOCAT
  • 10. All data and links expressed as RDF nl.proc.sgd.d. 194519460000002 nl.proc.sgd.d. 194519460000002.1 PartOfDebateDebate http://resolver.politicalmashup.nl/nl.proc.sgd.d.194519460000002 http://statengeneraaldigitaal.nl/ http://resolver.kb.nl/resolve?urn=sgd:mpeg21:19451946:0000002:pdf nl.proc.sgd.d.19720000002 Handelingen Verenigde Vergadering... Dutch 1945-11-20 rdf:type dc:id dc:source dc:source dc:publisher dc:language dc:date hasPart rdf:type nl.proc.sgd.d. 194519460000002.1.1 hasPart DebateContext rdf:type nl.proc.sgd.d. 194519460000002.1.2 Speech rdf:type hasPart nl.proc.sgd.d. 194519460000002.1.3 hasSubsequentSpeech "Mijnheer de Voorzitter, de Commissie van …" hasSpokenText sem:hasActor "De voorzitter opent de vergadering…" hasText http://resolver.kb.nl/resolve?urn=ddd:011198136:mpeg21:a0525:ocr coveredIn nl.proc.sgd.d. 194519460000002.2 hasSubsequentPartOfDebate
  • 11. PoliMedia linking method • Debate speeches and newspaper articles are different types of documents, so default document similarity metrics are insufficient • Speeches contain many named entities, digressions. • Newspapers are formal and concise, words are used sparingly. • The challenge: how to create a representation of the speeches that contains enough information to be used as a query to retrieve the right media articles from the archive?
  • 12. PoliMedia linking method • Our PoliMedia linking method consists of four steps: 1. topics: enriching the existing debate metadata with topics 2. preselection of articles: when the candidate articles were published and who spoke in the debate (timeframe and speakers)? 3. automatic query creation: candidate articles are ranked based on similarity to the query (automatically created from speech text) by comparing vectors of topics and named entities 4. link creation: links are created between a speech and an article if the similarity score is above a threshold t
  • 13. Topics The MALLET topic model package • Unsupervised analysis of text • “a Topic consists of a cluster of words that frequently occur together” • [see http://mallet.cs.umass.edu/topics.php] • Input: Text, Number of iterations, Number of topics/clusters • Output: Words that cluster around one topic. • Example: • Text: a speech in a debate from 1975 • number of iterations: 2000 • number of topics: 1
  • 15. Polimedia pipeline RDF semantic model RDF files NERs Speech TopicSet Speech NERs Topic TopicSet Topic contextual vectors PoliticalMas hup (xml) Query NE Stopword removal Topic modeling Query content Expanded query creation SRU Query (actor, date range) automatic query creation KB (preselect data) similarity calculation ranking filtering article metadat a
  • 16. Evaluation • We tried three different approaches: • Experiment 1: NEs in speech • Experiment 2: NEs + topics in speech • Experiment 3: NEs + topics in speech and debate • Two independent evaluators: reading the speeches and articles linked to them and manually assessing their relatedness • Randomly selected 20 debates from our dataset of 10,924 debates (different subjects: from fraud in the social system to the European elections) • Each experiment: random 50 speeches • In total: 150 speech-article pairs, namely 3 sets of 50 each
  • 17. Evaluation Results: • best approach: named entities (speech + debate descriptions) and topics (speech + debate) (2: relevant, 1: partially relevant, 0: unrelated)
  • 18. Evaluation • Relative recall: • different evaluation: annotator reads a speech, manually creates a suitable query for it, and assess the relevance of the articles returned for that query Precision: 75%, recall 62% experiment 3 on 5 speeches/115 articles gave a recall of 3804 links
  • 19. Conclusion • Creation of links between two very different datasets: a dataset of political debates and a media archive • Linking method takes advantage of: • Debate content and metadata • Named Entities and Topics from the debates • semantic partOf structure of the debates • In experiments we have shown the added value of topics and debate structure • Produced links • different in nature than those produced by e.g. ontology alignment tools • Now: coarsely typed links • Future: nature and strength of the link

Hinweis der Redaktion

  1. Politics and media are heavily intertwined and both play a role in thediscussion on policy proposals and current affairs. However, a dataset thatallows a joint analysis of the two does not yet exist.
  2. The PoliMedia project is driven by research questions from historians with respectto media coverage of politics across several types of media outlets. Cross-mediacomparisons will be conducted over a longer period of time, on different topics. Theproject concentrates on media coverage of the debates in the Dutch parliament andgives insight into the different choices that different media make while reporting onthose debates and how this changes over time.
  3. Go to archives, look up original data, decide whether there is a link to a debate.Cumbersome
  4. Final goal of our project was to connect all the media sources we can find with one particular speech from the parliament we are interested in. In this paper we present our method of linking between debates and newsapaper dataset.
  5. The Dutch government publishes the proceedings of its parliamentary debates ontwo websites. Debates from 1995 until now can be found on the OfficiëleBekendmakingen portal3 and can be downloaded as PDF or in an XML format, usingXML schemaandpermanentidentifiers. TheStaten-GeneraalDigitaal portal4contains the debates from before the year 1995, which can be accessed using theSearchandRetrievalvia URL (SRU) protocol5.A third source for Dutch parliamentary debate data is the Political Mashup portal6,created in the ongoing project War in Parliament (WIP).
  6. All debates conform to the same structure where speakers give speeches in somechronological order. The debates are split up into segments according to the differentthemes or agenda points of the meeting. The first speaker of each segment is alwaysthe president of the House of Representatives (‘voorzitter’ or chairperson in Dutch).She gives usually an introduction to the subject and after her speech she gives thefloor to a member of the parliament. Every word by every speaker is transcribedincluding the names of the speakers and their party affiliation. The transcripts alsocontain metadata such as the date and title of the debate.
  7. For newspaper data we use the historic newspaper archive of the National Libraryof the Netherlands, which contains the text as well as images of newspaper articlesfrom 1618 to 1995. Metadata of the articles is available as DIDL7 or ‘Digital ItemDeclarationLanguage’ – an XML dialect.
  8. The semantic model for the PoliMedia project is built to satisfy the requirements ofthe project, i.e. the research questions from the historians.To represent parliamentary debates as events,we have created a domain specific semantic modelthatenables us to express information associated with the debates such as topics, actors, debate structure, and links to media. We created this model according to the rules ofthe Dutch parliament, although the model can be easily adapted to include parliamentsin other countries, because core elements like speakers, speeches and topics arepresent in all parliamentary debates
  9. Topviewofoursemantic model in RDF. Debatesanditsstructure is broken on entitesandrelationshipsbetweenthem. Entitesinblue are importantstructuralpartsofthe debate (like topicdescriptionsandspeeches) andthey all havetheiruniqueidentifiers.
  10. The biggest challenge in ourmethod was the task of creating the query representation of the speech that willcontain enough amounts of meaning and context, so it can be used to retrieve anddistinguish between large number of media articles that covers topics from theparliament and politics. We should stress that debate speeches and newspaper articlesare generally completely different types of documents (so computing documentsimilarity doesn’t work) in the style and scope. While speeches can contain largenumber of NEs and digressions, which makes it hard to distinguish the right contextfor each speech, newspaper articles (especially the ones that report on topics from theparliament) are very strict and concise (words are used sparingly)
  11. For each speech inside a debate segment (called PartOfDebate in our method) weextract ten words that represent one topic discussed inside the speech. Also allspeeches contained inside one debate segment are concatenated into one text and theset of ten words that represent one topic of the debate segment as a whole is thenextractedfromthattext.
  12. Query ismadeautomaticalybyanalyzing debate document. We are creatingfourdifferent groups (vectors) thatwillbejoinedinto a singlequery.Data for thequery is comingfromdifferentstructuralelementsofthe debate as canbeseen on picture.
  13. Transformingdebates to rdf (conformingtosemantic model wemadejust for thiscase)topics: enriching the existing debate metadata with topicspreselection of articles: when the candidate articles were published and who spoke in the debate (timeframe and speakers)?automatic query creation: candidate articles are ranked based on similarity to the query (automatically created from speech text) by comparing vectors of topics and named entities link creation: links are created between a speech and an article if the similarity score is above a threshold t (similaritymeasureused: cosinesimilarityandoverlap)Automatically made links are written back into the rdf files representing debatesAddthateverythingg is publishedinan RDF storenow
  14. To gain insight into the quality and added value of the varioussteps of the linking method described in the previous section, we have performedexperiments with three versions of the method. Specifically, we have varied whichinformation is used to rank the candidate articles (named entities (NEs), topics) andwhether the partOf relations between speeches and larger parts of debates are used toalso include information associated to these larger parts (debate segments). Experiment 1: NEs in speech - In the most simple form of our method, werank articles only based on the NEs found in the speech. Experiment 2: NEs + topics in speech - Here, we include not only NEs butalso topics detected for the speech. Experiment 3: NEs + topics in speech and debate - WeincludenotjustNEs and topics extracted from the speech itself but also NEs extracted fromthe debate context and topics extracted from all speeches in this context.
  15. Table 1 shows the average number of relevant, partially relevant andunrelated links found in the experiments. Using just NEs from the speech (experiment1) gives a lot of unrelated links, and thus a low precision score of 48%. In [17], theauthors stated that NEs play an important role in news documents. They wanted toexploit that characteristic by considering them as the only distinguishing features ofthe documents. In our experiments we found out that using just NEs is not enough todistinguish between newspaper articles. When we include topics extracted from thespeech (experiment 2), precision increases to 62%. Finally, in experiment 3, weleverage the debate structure. We used NEs and topics from debate descriptions tocreate a query that is more specific than both previous queries. We can see that theresulting precision is highest with values around 80%.
  16. To calculate recall we had to conduct a different kind of evaluation.It is infeasible to manually assess the relevance of all the close to 1 million articles inthe archive. Therefore, we chose an approach where an annotator reads a speech,manually creates a suitable query for it, and assess the relevance of the articlesreturned for that query. As with our automatic approach, we limit ourselves to articlespublished within 7 days of the debate day. For this experiment, we arbitrarily chosefive speeches for which we retrieved a total of 115 newspaper articles. We repeatedexperiment 3 on this smaller set of 5 speeches/115 articles. Precision on this set was75%, which is in line with the results of experiment 3. Recall was 62%. Another goodindication of the recall of our approach is the number of links returned. Our approachresulted in 5887 links to articles when using the settings of experiment 1, 4449 whenusing the settings of experiment 2 and 3804 for experiment 3.