This document discusses requirements and opportunities for opening up official documents like parliamentary proceedings. It argues that the value lies not in individual documents but in the relationships between documents over time. A political n-gram viewer application is proposed that would allow exploration of topics and language used by different political parties over decades. However, linking documents and extracting needed metadata like speaker affiliations is challenging and existing linked open data is not reliable enough. Official documents need to be self-describing and use shared standards and controlled vocabularies to be truly open and interoperable.
Keynote Exploring and Exploiting Official Publications
1. PoliticalMashup 1
PoliticalMashup
Open Official Documents: Requirements and
Opportunities
Maarten Marx
Universiteit van Amsterdam
Istanbul, EEOP (@LREC), 2012-05-27
2. PoliticalMashup 2
Content
• Official Documents Zoom in on a specific official publications
dataset
• Opportunities What makes official publications data valuable?
• Requirements What is needed to make official publications data
reusable and interoperable?
3. PoliticalMashup 3
Our Leading Research Question
What is the best data format for publishing both legacy and current
parliamentary proceedings in a digital sustainable manner? [Marx et
al 2010]
4. PoliticalMashup 4
W3C recommendations on Open Government Data
• make data both machine and human readable;
• link data, make data linkable, provide permanent identifiers for
each government object and data item;
• provide metadata using common standards (e.g. Dublin Core);
• make the data as easy to reuse (e.g. in mashups) as possible.
Goal of this talk: make this concrete.
5. PoliticalMashup 5
Value of a large data corpus
• Consider a 200 year corpus of temperature and humidity readings
in one location.
• Value is not in the individual “documents”
• Value is not in the corpus as a whole.
• Value is in the relation between the “documents”.
6. PoliticalMashup 6
Documents related by publication date
Google books Ngram viewer
7. PoliticalMashup 7
Properties of our Parliamentary Proceedings
Dataset
8. PoliticalMashup 8
Longitudinal data
• weakly measurement for over 150 years
• very stable measurement procedure and data model
14. PoliticalMashup 14
About this collection
• very sparse available metadata
• very rich “metadata” sits hidden inside the raw data
• Rich data model
• Meeting (1 Day)
• Topic
• Stage direction
• Scene
• Stage direction
• Speech
• Paragraph
15. PoliticalMashup 15
Very rich metadata for each word
For every word spoken in parliament, the following facts are known
at the time of the speech act, and can often be extracted from the
written proceedings:
1) when it was said,
2) who said it,
3) in what function,
4) speaking on behalf of which party,
5) in which context, and
6) who was actively present during the speech act.
16. PoliticalMashup 16
How to exploit the extra metadata and structure?
• Let’s consider a simple killer app . . .
17. PoliticalMashup 17
Political n-gram viewer
• From every word we know both the date and the speaker.
• Every speaker belongs to a political party.
• 3D n-gram viewer: political spectrum vs time vs word-count
• Use: topic ownership, agenda setting, framing
18. PoliticalMashup 18
Political n-gram viewer: requirements
documents
1. metadata: date of the meeting
2. document structure: for every spoken word: who said it.
Linked Data Speakers names are disambiguated, normalized and
mapped to a database with temporal party information.
Completeness and correctness Few missing or wrong data, also for
long time ago.
19. PoliticalMashup 19
Is Linked (Open) Data the solution?
• Link speakers name to Wikipedia/DBpedia page. (named entity
disambiguation and resolution). See also Google Knowledge
Graph, and [Spitkovsky, Chang, LREC 2012].
• DBpedia extracts link between person and party affiliation from
Wikipedia infobox
• Timestamped triple:
Geert Wilders is partymember of VVD
from 1998-08-25 until 2004-09-02
20. PoliticalMashup 20
DBpedia not yet reliable
• Data extraction is difficult, even from the infobox, even from
complete data:
Wikipedia page of Geert Wilders
DBpedia information about Geert Wilders
Notice the values of the party and the office attributes
Timestamped facts are difficult to extract and difficult to
represent in RDF triples.
21. PoliticalMashup 21
Lesson learned: requirement on metadata and
relations
• One cannot rely on Linked Open Data for good quality metadata.
• Official documents should be self-describing, also for facts which
are obvious at publication time.
• Compare speaker’s data in original (OCRed) data and XMLified
and enriched version:
• Original
• Part of it in XML
• And now for human consumption
23. PoliticalMashup 23
Entity Profiling and Entity Search
• Users search for entities, not for documents. [TREC Entity Track]
[Balog et al 2009].
• Main research questions
How to collect information on entities,
how to model an entity,
how to rank entities.
• (Parsimonious) language models work well as models. [Balog et
al, 2009][Hiemstra et al, 2004]
• Entity profiling: http://www.politiekinzicht.com
• Entity search: http://ikkieswijzer.nl
24. PoliticalMashup 24
Content and structure search
• Usual advanced search combines keyword search with metadata
search.
• Extra fields are just extra filters on the returned documents.
• With structured documents we can do search on content and
structure.
• Most useful task: rank best entry points in large documents.
• Compare two search systems on the same data:
on flat text
on an XML representation
25. PoliticalMashup 25
Lesson learned: requirement on structure
• Make semantically important structure of documents explicit in
XML markup.
• Publish for machine readability
• Publish generic data, not data prepared for one use-case.
26. PoliticalMashup 26
Application of structure: Interruption graph
(Attackogram)
• MP A interrupts B ⇐⇒ A speaks during the block of B.
combined with entity profiling:
http://debat.politiekinzicht.com/
27. PoliticalMashup 27
Exploring and exploiting official documents
• We saw what can be done with one well-curated collection.
• What are the key infrastructural and research questions?
In what direction and how to scale this up?
1. in time
2. in breadth
3. in links
28. PoliticalMashup 28
Scale diachronically
• Stable data model and measurement procedure make this data
very valuable for diachronic comparisons.
• towards the past
• OCR
• consistency in structure
• more missing data to link to
• towards the future
• remain up to date
• legacy decisions
29. PoliticalMashup 29
Scale in breadth, e.g., parlproceedings of all
European countries
• All describe the same “script”, so all fit in one schema.
• Main question: how to connect the data from different countries?
Common structure and annotation use the same Relax NG
schema
Common values on certain attributes
• Entities Normalize to Wikipedia concepts
• Controlled vocabulary keywords Normalize to Eurovoc
• Language Machine translate to English
• Events Normalize to EMM Newsexplorer query/ Wikinews
query
30. PoliticalMashup 30
Scale in breadth: link to related datasets
• Link on time, entities, events, topics
• Other official publications
• News
• User generated content
• (In our case), promisses of political actors: election manifestos
31. PoliticalMashup 31
Conclusions
• There are ample opportunities for exploiting Official Publications.
• Preprocessing and interlinking with other datasets is difficult and
does not scale well:
• High precision and recall is needed for many applications
• Many text analysis and data-mapping tasks [MUC, TAC]
• Every format needs an own transformer
• Linked Open Data knowledge bases are not (yet) good enough:
create special purpose knowledge extractors
• High investment, but if done in a general way, high return and
impact.
32. PoliticalMashup 32
Back to our research question
What is the best data format for publishing both legacy and current
parliamentary proceedings in a digital sustainable manner?
Lessons learned
• Common, open, standardized, self-describing, machine readable,
• not tied to a single application
• linked, linked, linked
• Not only shared attributes
• but more importantly, shared data values
• also store utterly obvious facts (10 years later they aren’t)
33. PoliticalMashup 33
How we can help (ourselves)
Help improve input data at the source
• Push at the source (in UK: open government data; in Holland: all
parliamentary data is now in XML . . . )
• Help reduce dumb cut-and-paste annotation work, so annotators
can concentrate on tasks which are hard for machines (e.g.
text-classification).
• Emphasize importance of using shared standards.
Future researchers will love you.
34. PoliticalMashup 34
Last Question
Official Publications: are they
or ?