4. Linking Laws and Plenary Protocols
Extract agenda items and participants‘ information
from plenary protocols from terms 12 – 16
Use GESTA as an index of laws
Link laws to plenary speeches and vice versa
1 introduction
5. We have ...
Plenary protocol PDFs from electoral terms 12 – 16
1990-12-10 – present
120.655 pages in 1162 documents
GESTA database of laws, terms 8 – 16
1 introduction
6. We have ...
Plenary protocol PDFs from electoral terms 12 – 16
1990-12-10 – present
120.655 pages in 1162 documents
GESTA database of laws, terms 8 – 16
: ) and ambition to deliver excellent results
1 introduction
7. We want to ...
Extract from 1990 up to the present time
For each plenary session
Session number, date, ...
For each item on the agenda
Descriptions
list of participants
printed matter references
speech texts
tables
Link the results with our database of laws
1 introduction
8. Challanges
Older electoral terms are not digitalized
Each electoral term requires different pattern matching
strategies
GESTA tables generated for the project
No consistent, direct links to plenary protocols
Course of legislation undetailed
Quality difference between older and newer terms
OCR errors
GESTA Database – no improvements possible for older terms
1 introduction
10. Xtract – software for data mining
a set of modern tools to annotate plenary protocols
with relevant pieces of information
preserves document layout
uses multiple strategies to mark important text blocks
location, shape and internal structure of blocks
pattern matching
Euclidean distances
statistics
comes with its own document viewer
2 software
11. Xtract – implementation details
PDF access
pdftohtml (custom builds)
Acrobat Professional 9 Extended (older terms)
Data manipulation
C# 4.0: LINQ to XML
Visualization
C# 4.0: WPF (Windows Presentation Foundation)
Statistics
CORSIS: my personal open-source project for corpus analysis
2 software
12. Xtract – why XML?
Simple and highly-`liquid´ file format
based on simple international standards
excellent APIs in many programming languages
converts easily into other formats
used in Microsoft Office, OpenOffice.org
2 software
13. Xtract – XML crash course
<event>
<speaker id=„12“>
<name>Franz Müntefering</name>
<is>Bundesminister für Arbeit und Soziales</is>
</speaker>
</event>
elements
attributes
hierarchical relations
2 software
14. Xtract – XML crash course
<event>
<speaker id=„12“>
<name>Franz Müntefering</name>
<is>Bundesminister für Arbeit und Soziales</is>
</speaker>
</event>
elements: event, speaker, name, is
2 software
15. Xtract – XML crash course
<event>
<speaker id=„12“>
<name>Franz Müntefering</name>
<is>Bundesminister für Arbeit und Soziales</is>
</speaker>
</event>
attributes: id
2 software
16. Xtract – XML crash course
<event>
<speaker id=„12“>
<name>Franz Müntefering</name>
<is>Bundesminister für Arbeit und Soziales</is>
</speaker>
</event>
children: event → speaker
parents: event ← speaker
2 software
17. Xtract – XML crash course
<event>
<speaker id=„12“>
<name>Franz Müntefering</name>
<is>Bundesminister für Arbeit und Soziales</is>
</speaker>
</event>
descendants: event → speaker, name, is
2 software
18. Xtract – XML crash course
<event>
<speaker id=„12“>
<name>Franz Müntefering</name>
<is>Bundesminister für Arbeit und Soziales</is>
</speaker>
</event>
siblings: name ↔ is
2 software
19. Xtract – how does it function?
extracts texts from PDF files along with layout
information
2 software
20. Xtract – how does it function?
merges texts into proximity blocks
2 software
21. Xtract – how does it function?
marks ambient constructs
2 software
22. Xtract – how does it function?
marks agenda items
2 software
23. Xtract – how does it function?
annotates blocks with sections they belong to
2 software
25. DIGESTA
Based on `GESTA Gesamtausgaben´: terms 14 – 16
Always up-to-date
Detailed course of legislation information
Direct links to plenary protocols
Can be complemented with keywords from MZES
http://corsis.sf.net/ipw/digesta/
3 results
Done!!
26. PLEDA – Plenary Protocols Database
Based on plenary protocols
Links agenda items multidirectionally with
participants
Interesting for different linguistic/political research
purposes
3 results
27. PLEDA – Project Status
12 13 14 15 16
OCR
Run X X - - -
Correction - - -
XML Conversion * * X X X
Division C./S. X X X
Block Merging * * X X X
Ambient Constructs X X X
Page Sections X X X
Interjections * * X X X
Contents * * X
Speeches * * X
Contents-speech links * * X
3 results
28. GLIT – German Legislative Resp ...
Laws
• .law files
• from GESTA
Protocols
• .pro files
• from BTP
GLIT
• German part of
ELIT
3 results
30. Open questions
Project hosting
Where can we host the results?
Initial GLIT interface
Web service?
Rich client-side app?
Any questions from your side?
4 discussion