SlideShare ist ein Scribd-Unternehmen logo
1 von 34
Downloaden Sie, um offline zu lesen
PoliticalMashup                                         1




                      PoliticalMashup
              Open Official Documents: Requirements and
                           Opportunities

                             Maarten Marx

                       Universiteit van Amsterdam

                   Istanbul, EEOP (@LREC), 2012-05-27
PoliticalMashup                                                  2



                           Content

• Official Documents Zoom in on a specific official publications
  dataset

• Opportunities What makes official publications data valuable?

• Requirements What is needed to make official publications data
  reusable and interoperable?
PoliticalMashup                                                       3



                  Our Leading Research Question




What is the best data format for publishing both legacy and current
parliamentary proceedings in a digital sustainable manner? [Marx et
                              al 2010]
PoliticalMashup                                                     4



 W3C recommendations on Open Government Data

• make data both machine and human readable;

• link data, make data linkable, provide permanent identifiers for
  each government object and data item;

• provide metadata using common standards (e.g. Dublin Core);

• make the data as easy to reuse (e.g. in mashups) as possible.

                  Goal of this talk: make this concrete.
PoliticalMashup                                                     5



                  Value of a large data corpus

• Consider a 200 year corpus of temperature and humidity readings
  in one location.

• Value is not in the individual “documents”

• Value is not in the corpus as a whole.

• Value is in the relation between the “documents”.
PoliticalMashup                                           6



                  Documents related by publication date




                          Google books Ngram viewer
PoliticalMashup                                         7



          Properties of our Parliamentary Proceedings
                            Dataset
PoliticalMashup                                      8



                     Longitudinal data

• weakly measurement for over 150 years

• very stable measurement procedure and data model
PoliticalMashup                                9



                  Data about human behaviour
PoliticalMashup                         10



                  Often rather boring
PoliticalMashup                                       11



         But sometimes full of drama and excitement
PoliticalMashup                                                       12



                       Loads of measurement points

                  24.000 days, 450.000 topics, 7.5 miljoen speeches
PoliticalMashup                         13



                  Digitally available
PoliticalMashup                                          14



                    About this collection

• very sparse available metadata

• very rich “metadata” sits hidden inside the raw data

• Rich data model
• Meeting (1 Day)
  • Topic
    • Stage direction
    • Scene
     • Stage direction
     • Speech
      • Paragraph
PoliticalMashup                                                      15



                  Very rich metadata for each word

For every word spoken in parliament, the following facts are known
at the time of the speech act, and can often be extracted from the
written proceedings:
1)   when it was said,
2)   who said it,
3)   in what function,
4)   speaking on behalf of which party,
5)   in which context, and
6)   who was actively present during the speech act.
PoliticalMashup                                      16



  How to exploit the extra metadata and structure?

• Let’s consider a simple killer app . . .
PoliticalMashup                                                17



                   Political n-gram viewer
• From every word we know both the date and the speaker.

• Every speaker belongs to a political party.

• 3D n-gram viewer: political spectrum vs time vs word-count

• Use: topic ownership, agenda setting, framing
PoliticalMashup                                                    18



                  Political n-gram viewer: requirements

documents
   1. metadata: date of the meeting
   2. document structure: for every spoken word: who said it.

Linked Data Speakers names are disambiguated, normalized and
   mapped to a database with temporal party information.

Completeness and correctness Few missing or wrong data, also for
  long time ago.
PoliticalMashup                                                   19



                  Is Linked (Open) Data the solution?

• Link speakers name to Wikipedia/DBpedia page. (named entity
  disambiguation and resolution). See also Google Knowledge
  Graph, and [Spitkovsky, Chang, LREC 2012].

• DBpedia extracts link between person and party affiliation from
  Wikipedia infobox

• Timestamped triple:

                     Geert Wilders is partymember of VVD
                       from 1998-08-25 until 2004-09-02
PoliticalMashup                                                   20



                     DBpedia not yet reliable

• Data extraction is difficult, even from the infobox, even from
  complete data:
        Wikipedia page of Geert Wilders
        DBpedia information about Geert Wilders
        Notice the values of the party and the office attributes
        Timestamped facts are difficult to extract and difficult to
        represent in RDF triples.
PoliticalMashup                                                      21



       Lesson learned: requirement on metadata and
                          relations

• One cannot rely on Linked Open Data for good quality metadata.

• Official documents should be self-describing, also for facts which
  are obvious at publication time.

• Compare speaker’s data in original (OCRed) data and XMLified
  and enriched version:
  • Original
  • Part of it in XML
  • And now for human consumption
PoliticalMashup                             22



                  A few more applications
PoliticalMashup                                                       23



                  Entity Profiling and Entity Search

• Users search for entities, not for documents. [TREC Entity Track]
  [Balog et al 2009].

• Main research questions
        How to collect information on entities,
        how to model an entity,
        how to rank entities.

• (Parsimonious) language models work well as models. [Balog et
  al, 2009][Hiemstra et al, 2004]

• Entity profiling: http://www.politiekinzicht.com

• Entity search: http://ikkieswijzer.nl
PoliticalMashup                                                  24



                  Content and structure search

• Usual advanced search combines keyword search with metadata
  search.

• Extra fields are just extra filters on the returned documents.

• With structured documents we can do search on content and
  structure.

• Most useful task: rank best entry points in large documents.

• Compare two search systems on the same data:
        on flat text
        on an XML representation
PoliticalMashup                                                    25



              Lesson learned: requirement on structure

• Make semantically important structure of documents explicit in
  XML markup.

• Publish for machine readability

• Publish generic data, not data prepared for one use-case.
PoliticalMashup                                           26



           Application of structure: Interruption graph
                          (Attackogram)

• MP A interrupts B ⇐⇒ A speaks during the block of B.




combined with entity profiling:
http://debat.politiekinzicht.com/
PoliticalMashup                                               27



            Exploring and exploiting official documents

• We saw what can be done with one well-curated collection.

• What are the key infrastructural and research questions?

        In what direction and how to scale this up?
   1. in time
   2. in breadth
   3. in links
PoliticalMashup                                                28



                    Scale diachronically

• Stable data model and measurement procedure make this data
  very valuable for diachronic comparisons.

• towards the past
  • OCR
  • consistency in structure
  • more missing data to link to

• towards the future
  • remain up to date
  • legacy decisions
PoliticalMashup                                                     29



          Scale in breadth, e.g., parlproceedings of all
                       European countries
• All describe the same “script”, so all fit in one schema.

• Main question: how to connect the data from different countries?
  Common structure and annotation use the same Relax NG
    schema
  Common values on certain attributes
    • Entities Normalize to Wikipedia concepts
    • Controlled vocabulary keywords Normalize to Eurovoc
    • Language Machine translate to English
    • Events Normalize to EMM Newsexplorer query/ Wikinews
      query
PoliticalMashup                                                       30



              Scale in breadth: link to related datasets

• Link on time, entities, events, topics

• Other official publications

• News

• User generated content

• (In our case), promisses of political actors: election manifestos
PoliticalMashup                                                       31



                          Conclusions

• There are ample opportunities for exploiting Official Publications.

• Preprocessing and interlinking with other datasets is difficult and
  does not scale well:
  • High precision and recall is needed for many applications
  • Many text analysis and data-mapping tasks [MUC, TAC]
  • Every format needs an own transformer
  • Linked Open Data knowledge bases are not (yet) good enough:
    create special purpose knowledge extractors

• High investment, but if done in a general way, high return and
  impact.
PoliticalMashup                                                       32



                  Back to our research question
What is the best data format for publishing both legacy and current
   parliamentary proceedings in a digital sustainable manner?

Lessons learned

• Common, open, standardized, self-describing, machine readable,

• not tied to a single application

• linked, linked, linked
  • Not only shared attributes
  • but more importantly, shared data values

• also store utterly obvious facts (10 years later they aren’t)
PoliticalMashup                                                      33



                  How we can help (ourselves)

                  Help improve input data at the source

• Push at the source (in UK: open government data; in Holland: all
  parliamentary data is now in XML . . . )

• Help reduce dumb cut-and-paste annotation work, so annotators
  can concentrate on tasks which are hard for machines (e.g.
  text-classification).

• Emphasize importance of using shared standards.

                     Future researchers will love you.
PoliticalMashup                                       34



                       Last Question

                  Official Publications: are they




                                   or             ?

Weitere ähnliche Inhalte

Was ist angesagt?

Enterprise knowledge graphs
Enterprise knowledge graphsEnterprise knowledge graphs
Enterprise knowledge graphsSören Auer
 
Design for Findability: metadata, metrics and collaboration on LOC.gov
Design for Findability: metadata, metrics and collaboration on LOC.govDesign for Findability: metadata, metrics and collaboration on LOC.gov
Design for Findability: metadata, metrics and collaboration on LOC.govUXPA International
 
Reasoning with Big Knowledge Graphs: Choices, Pitfalls and Proven Recipes
Reasoning with Big Knowledge Graphs: Choices, Pitfalls and Proven RecipesReasoning with Big Knowledge Graphs: Choices, Pitfalls and Proven Recipes
Reasoning with Big Knowledge Graphs: Choices, Pitfalls and Proven RecipesOntotext
 
Design for Findability at the Library of Congress
Design for Findability at the Library of CongressDesign for Findability at the Library of Congress
Design for Findability at the Library of CongressJill MacNeice
 
Analytics on Big Knowledge Graphs Deliver Entity Awareness and Help Data Linking
Analytics on Big Knowledge Graphs Deliver Entity Awareness and Help Data LinkingAnalytics on Big Knowledge Graphs Deliver Entity Awareness and Help Data Linking
Analytics on Big Knowledge Graphs Deliver Entity Awareness and Help Data LinkingOntotext
 
Industry@RuleML2015 DataGraft
Industry@RuleML2015 DataGraftIndustry@RuleML2015 DataGraft
Industry@RuleML2015 DataGraftRuleML
 
Web Data Extraction: A Crash Course
Web Data Extraction: A Crash CourseWeb Data Extraction: A Crash Course
Web Data Extraction: A Crash CourseGiorgio Orsi
 
Sparql querying of-property-graphs-harsh thakkar-graph day 2017 sf
Sparql querying of-property-graphs-harsh thakkar-graph day 2017 sfSparql querying of-property-graphs-harsh thakkar-graph day 2017 sf
Sparql querying of-property-graphs-harsh thakkar-graph day 2017 sfHarsh Thakkar
 
The Bounties of Semantic Data Integration for the Enterprise
The Bounties of Semantic Data Integration for the Enterprise The Bounties of Semantic Data Integration for the Enterprise
The Bounties of Semantic Data Integration for the Enterprise Ontotext
 
Analytics and Access to the UK web archive
Analytics and Access to the UK web archiveAnalytics and Access to the UK web archive
Analytics and Access to the UK web archiveLewis Crawford
 
Farirhair.ai: AI platform to mine competitive intelligence from billions of u...
Farirhair.ai: AI platform to mine competitive intelligence from billions of u...Farirhair.ai: AI platform to mine competitive intelligence from billions of u...
Farirhair.ai: AI platform to mine competitive intelligence from billions of u...Aditya Jami
 
Transforming Your Data with GraphDB: GraphDB Fundamentals, Jan 2018
Transforming Your Data with GraphDB: GraphDB Fundamentals, Jan 2018Transforming Your Data with GraphDB: GraphDB Fundamentals, Jan 2018
Transforming Your Data with GraphDB: GraphDB Fundamentals, Jan 2018Ontotext
 
Fairhair.ai – alan turing institute june '17 (public)
Fairhair.ai – alan turing institute june '17 (public)Fairhair.ai – alan turing institute june '17 (public)
Fairhair.ai – alan turing institute june '17 (public)Giorgio Orsi
 
SSSW2015 Data Workflow Tutorial
SSSW2015 Data Workflow TutorialSSSW2015 Data Workflow Tutorial
SSSW2015 Data Workflow TutorialSSSW
 
Discovering Related Data Sources in Data Portals
Discovering Related Data Sources in Data PortalsDiscovering Related Data Sources in Data Portals
Discovering Related Data Sources in Data PortalsPeter Haase
 
Building Knowledge Graphs in 10 steps
Building Knowledge Graphs in 10 stepsBuilding Knowledge Graphs in 10 steps
Building Knowledge Graphs in 10 stepsOntotext
 
Sören Auer | Enterprise Knowledge Graphs
Sören Auer | Enterprise Knowledge GraphsSören Auer | Enterprise Knowledge Graphs
Sören Auer | Enterprise Knowledge Graphssemanticsconference
 

Was ist angesagt? (20)

Enterprise knowledge graphs
Enterprise knowledge graphsEnterprise knowledge graphs
Enterprise knowledge graphs
 
Design for Findability: metadata, metrics and collaboration on LOC.gov
Design for Findability: metadata, metrics and collaboration on LOC.govDesign for Findability: metadata, metrics and collaboration on LOC.gov
Design for Findability: metadata, metrics and collaboration on LOC.gov
 
Reasoning with Big Knowledge Graphs: Choices, Pitfalls and Proven Recipes
Reasoning with Big Knowledge Graphs: Choices, Pitfalls and Proven RecipesReasoning with Big Knowledge Graphs: Choices, Pitfalls and Proven Recipes
Reasoning with Big Knowledge Graphs: Choices, Pitfalls and Proven Recipes
 
Design for Findability at the Library of Congress
Design for Findability at the Library of CongressDesign for Findability at the Library of Congress
Design for Findability at the Library of Congress
 
Analytics on Big Knowledge Graphs Deliver Entity Awareness and Help Data Linking
Analytics on Big Knowledge Graphs Deliver Entity Awareness and Help Data LinkingAnalytics on Big Knowledge Graphs Deliver Entity Awareness and Help Data Linking
Analytics on Big Knowledge Graphs Deliver Entity Awareness and Help Data Linking
 
Industry@RuleML2015 DataGraft
Industry@RuleML2015 DataGraftIndustry@RuleML2015 DataGraft
Industry@RuleML2015 DataGraft
 
Web Data Extraction: A Crash Course
Web Data Extraction: A Crash CourseWeb Data Extraction: A Crash Course
Web Data Extraction: A Crash Course
 
Sparql querying of-property-graphs-harsh thakkar-graph day 2017 sf
Sparql querying of-property-graphs-harsh thakkar-graph day 2017 sfSparql querying of-property-graphs-harsh thakkar-graph day 2017 sf
Sparql querying of-property-graphs-harsh thakkar-graph day 2017 sf
 
The Bounties of Semantic Data Integration for the Enterprise
The Bounties of Semantic Data Integration for the Enterprise The Bounties of Semantic Data Integration for the Enterprise
The Bounties of Semantic Data Integration for the Enterprise
 
Analytics and Access to the UK web archive
Analytics and Access to the UK web archiveAnalytics and Access to the UK web archive
Analytics and Access to the UK web archive
 
Farirhair.ai: AI platform to mine competitive intelligence from billions of u...
Farirhair.ai: AI platform to mine competitive intelligence from billions of u...Farirhair.ai: AI platform to mine competitive intelligence from billions of u...
Farirhair.ai: AI platform to mine competitive intelligence from billions of u...
 
Transforming Your Data with GraphDB: GraphDB Fundamentals, Jan 2018
Transforming Your Data with GraphDB: GraphDB Fundamentals, Jan 2018Transforming Your Data with GraphDB: GraphDB Fundamentals, Jan 2018
Transforming Your Data with GraphDB: GraphDB Fundamentals, Jan 2018
 
Fairhair.ai – alan turing institute june '17 (public)
Fairhair.ai – alan turing institute june '17 (public)Fairhair.ai – alan turing institute june '17 (public)
Fairhair.ai – alan turing institute june '17 (public)
 
Big Data - Gerami
Big Data - GeramiBig Data - Gerami
Big Data - Gerami
 
SSSW2015 Data Workflow Tutorial
SSSW2015 Data Workflow TutorialSSSW2015 Data Workflow Tutorial
SSSW2015 Data Workflow Tutorial
 
Discovering Related Data Sources in Data Portals
Discovering Related Data Sources in Data PortalsDiscovering Related Data Sources in Data Portals
Discovering Related Data Sources in Data Portals
 
Semantic Web in the Digital Humanities
Semantic Web in the Digital HumanitiesSemantic Web in the Digital Humanities
Semantic Web in the Digital Humanities
 
Building Knowledge Graphs in 10 steps
Building Knowledge Graphs in 10 stepsBuilding Knowledge Graphs in 10 steps
Building Knowledge Graphs in 10 steps
 
Big data
Big dataBig data
Big data
 
Sören Auer | Enterprise Knowledge Graphs
Sören Auer | Enterprise Knowledge GraphsSören Auer | Enterprise Knowledge Graphs
Sören Auer | Enterprise Knowledge Graphs
 

Ähnlich wie Keynote Exploring and Exploiting Official Publications

Groningen nl pgroep
Groningen nl pgroepGroningen nl pgroep
Groningen nl pgroepmaartenmarx
 
Exploration, visualization and querying of linked open data sources
Exploration, visualization and querying of linked open data sourcesExploration, visualization and querying of linked open data sources
Exploration, visualization and querying of linked open data sourcesLaura Po
 
Linked Open Data in Romania
Linked Open Data in RomaniaLinked Open Data in Romania
Linked Open Data in RomaniaVlad Posea
 
Mining Web content for Enhanced Search
Mining Web content for Enhanced Search Mining Web content for Enhanced Search
Mining Web content for Enhanced Search Roi Blanco
 
Applied Artificial Intelligence Unit 5 Semester 3 MSc IT Part 2 Mumbai Univer...
Applied Artificial Intelligence Unit 5 Semester 3 MSc IT Part 2 Mumbai Univer...Applied Artificial Intelligence Unit 5 Semester 3 MSc IT Part 2 Mumbai Univer...
Applied Artificial Intelligence Unit 5 Semester 3 MSc IT Part 2 Mumbai Univer...Madhav Mishra
 
Building the PoliMedia search system; data- and user-driven
Building the PoliMedia search system; data- and user-drivenBuilding the PoliMedia search system; data- and user-driven
Building the PoliMedia search system; data- and user-drivenMaxKemman
 
From Open Access to Open Data
From Open Access to Open DataFrom Open Access to Open Data
From Open Access to Open DataBrian Hole
 
“The HE Landscape – Making It Happen (Painlessly)” - Andy Youell, Director of...
“The HE Landscape – Making It Happen (Painlessly)” - Andy Youell, Director of...“The HE Landscape – Making It Happen (Painlessly)” - Andy Youell, Director of...
“The HE Landscape – Making It Happen (Painlessly)” - Andy Youell, Director of...Academic Registrars Council
 
Using DBpedia for Thesaurus Management and Linked Open Data Integration
Using DBpedia for Thesaurus Management and Linked Open Data IntegrationUsing DBpedia for Thesaurus Management and Linked Open Data Integration
Using DBpedia for Thesaurus Management and Linked Open Data IntegrationMartin Kaltenböck
 
Lecture 5: Mining, Analysis and Visualisation
Lecture 5: Mining, Analysis and VisualisationLecture 5: Mining, Analysis and Visualisation
Lecture 5: Mining, Analysis and VisualisationMarieke van Erp
 
Linked Open Government Data: What’s Next?
Linked Open Government Data:  What’s Next?Linked Open Government Data:  What’s Next?
Linked Open Government Data: What’s Next?Li Ding
 
ESWC 2017 Tutorial Knowledge Graphs
ESWC 2017 Tutorial Knowledge GraphsESWC 2017 Tutorial Knowledge Graphs
ESWC 2017 Tutorial Knowledge GraphsPeter Haase
 
Big Data Europe: SC6 Workshop 3: The European Research Data Landscape: Opport...
Big Data Europe: SC6 Workshop 3: The European Research Data Landscape: Opport...Big Data Europe: SC6 Workshop 3: The European Research Data Landscape: Opport...
Big Data Europe: SC6 Workshop 3: The European Research Data Landscape: Opport...BigData_Europe
 
An Automated Snowball Census of the Political Web - JITP 2011
An Automated Snowball Census of the Political Web - JITP 2011An Automated Snowball Census of the Political Web - JITP 2011
An Automated Snowball Census of the Political Web - JITP 2011Abe Gong
 
Managing 'Big Data' in the social sciences: the contribution of an analytico-...
Managing 'Big Data' in the social sciences: the contribution of an analytico-...Managing 'Big Data' in the social sciences: the contribution of an analytico-...
Managing 'Big Data' in the social sciences: the contribution of an analytico-...CILIP MDG
 
Data.gov Overview, August 2012
Data.gov Overview, August 2012Data.gov Overview, August 2012
Data.gov Overview, August 2012Jeanne Holm
 

Ähnlich wie Keynote Exploring and Exploiting Official Publications (20)

Groningen nl pgroep
Groningen nl pgroepGroningen nl pgroep
Groningen nl pgroep
 
Exploration, visualization and querying of linked open data sources
Exploration, visualization and querying of linked open data sourcesExploration, visualization and querying of linked open data sources
Exploration, visualization and querying of linked open data sources
 
Text Mining : Experience
Text Mining : ExperienceText Mining : Experience
Text Mining : Experience
 
Linked Open Data in Romania
Linked Open Data in RomaniaLinked Open Data in Romania
Linked Open Data in Romania
 
ONS Local presents: Explore Subnational Statistics
ONS Local presents: Explore Subnational StatisticsONS Local presents: Explore Subnational Statistics
ONS Local presents: Explore Subnational Statistics
 
Mining Web content for Enhanced Search
Mining Web content for Enhanced Search Mining Web content for Enhanced Search
Mining Web content for Enhanced Search
 
Applied Artificial Intelligence Unit 5 Semester 3 MSc IT Part 2 Mumbai Univer...
Applied Artificial Intelligence Unit 5 Semester 3 MSc IT Part 2 Mumbai Univer...Applied Artificial Intelligence Unit 5 Semester 3 MSc IT Part 2 Mumbai Univer...
Applied Artificial Intelligence Unit 5 Semester 3 MSc IT Part 2 Mumbai Univer...
 
Building the PoliMedia search system; data- and user-driven
Building the PoliMedia search system; data- and user-drivenBuilding the PoliMedia search system; data- and user-driven
Building the PoliMedia search system; data- and user-driven
 
From Open Access to Open Data
From Open Access to Open DataFrom Open Access to Open Data
From Open Access to Open Data
 
“The HE Landscape – Making It Happen (Painlessly)” - Andy Youell, Director of...
“The HE Landscape – Making It Happen (Painlessly)” - Andy Youell, Director of...“The HE Landscape – Making It Happen (Painlessly)” - Andy Youell, Director of...
“The HE Landscape – Making It Happen (Painlessly)” - Andy Youell, Director of...
 
Lecture4 Social Web
Lecture4 Social Web Lecture4 Social Web
Lecture4 Social Web
 
Using DBpedia for Thesaurus Management and Linked Open Data Integration
Using DBpedia for Thesaurus Management and Linked Open Data IntegrationUsing DBpedia for Thesaurus Management and Linked Open Data Integration
Using DBpedia for Thesaurus Management and Linked Open Data Integration
 
Lecture 5: Mining, Analysis and Visualisation
Lecture 5: Mining, Analysis and VisualisationLecture 5: Mining, Analysis and Visualisation
Lecture 5: Mining, Analysis and Visualisation
 
Linked Open Government Data: What’s Next?
Linked Open Government Data:  What’s Next?Linked Open Government Data:  What’s Next?
Linked Open Government Data: What’s Next?
 
ESWC 2017 Tutorial Knowledge Graphs
ESWC 2017 Tutorial Knowledge GraphsESWC 2017 Tutorial Knowledge Graphs
ESWC 2017 Tutorial Knowledge Graphs
 
Big Data & DS Analytics for PAARL
Big Data & DS Analytics for PAARLBig Data & DS Analytics for PAARL
Big Data & DS Analytics for PAARL
 
Big Data Europe: SC6 Workshop 3: The European Research Data Landscape: Opport...
Big Data Europe: SC6 Workshop 3: The European Research Data Landscape: Opport...Big Data Europe: SC6 Workshop 3: The European Research Data Landscape: Opport...
Big Data Europe: SC6 Workshop 3: The European Research Data Landscape: Opport...
 
An Automated Snowball Census of the Political Web - JITP 2011
An Automated Snowball Census of the Political Web - JITP 2011An Automated Snowball Census of the Political Web - JITP 2011
An Automated Snowball Census of the Political Web - JITP 2011
 
Managing 'Big Data' in the social sciences: the contribution of an analytico-...
Managing 'Big Data' in the social sciences: the contribution of an analytico-...Managing 'Big Data' in the social sciences: the contribution of an analytico-...
Managing 'Big Data' in the social sciences: the contribution of an analytico-...
 
Data.gov Overview, August 2012
Data.gov Overview, August 2012Data.gov Overview, August 2012
Data.gov Overview, August 2012
 

Mehr von maartenmarx

Ilja state2014expressivity
Ilja state2014expressivityIlja state2014expressivity
Ilja state2014expressivitymaartenmarx
 
Haagse Hogeschool 2012-09-13
Haagse Hogeschool 2012-09-13Haagse Hogeschool 2012-09-13
Haagse Hogeschool 2012-09-13maartenmarx
 
Expertmeeting, E-humanities en politieke geschiedenis, Nijmegen, 2013-09-13
Expertmeeting, E-humanities en politieke geschiedenis, Nijmegen, 2013-09-13Expertmeeting, E-humanities en politieke geschiedenis, Nijmegen, 2013-09-13
Expertmeeting, E-humanities en politieke geschiedenis, Nijmegen, 2013-09-13maartenmarx
 
Economie van de aandacht
  Economie van de aandacht  Economie van de aandacht
Economie van de aandachtmaartenmarx
 
Dans dataprijs2012
Dans dataprijs2012Dans dataprijs2012
Dans dataprijs2012maartenmarx
 
College sicco van-sas-2012_10_08
College sicco van-sas-2012_10_08College sicco van-sas-2012_10_08
College sicco van-sas-2012_10_08maartenmarx
 
Presentation at NLDB 2012
Presentation at NLDB 2012Presentation at NLDB 2012
Presentation at NLDB 2012maartenmarx
 
Women in Dutch parliament: what they did
Women in Dutch parliament: what they didWomen in Dutch parliament: what they did
Women in Dutch parliament: what they didmaartenmarx
 
Namescape 2012 03 06
Namescape 2012 03 06Namescape 2012 03 06
Namescape 2012 03 06maartenmarx
 
voting advice slides
 voting advice slides voting advice slides
voting advice slidesmaartenmarx
 
TV-slant presentatie_politicologen_etmaal
TV-slant presentatie_politicologen_etmaalTV-slant presentatie_politicologen_etmaal
TV-slant presentatie_politicologen_etmaalmaartenmarx
 
networks inparliament-ccct
 networks inparliament-ccct networks inparliament-ccct
networks inparliament-ccctmaartenmarx
 
Screen biographischportaal2010 12-10
Screen biographischportaal2010 12-10Screen biographischportaal2010 12-10
Screen biographischportaal2010 12-10maartenmarx
 

Mehr von maartenmarx (13)

Ilja state2014expressivity
Ilja state2014expressivityIlja state2014expressivity
Ilja state2014expressivity
 
Haagse Hogeschool 2012-09-13
Haagse Hogeschool 2012-09-13Haagse Hogeschool 2012-09-13
Haagse Hogeschool 2012-09-13
 
Expertmeeting, E-humanities en politieke geschiedenis, Nijmegen, 2013-09-13
Expertmeeting, E-humanities en politieke geschiedenis, Nijmegen, 2013-09-13Expertmeeting, E-humanities en politieke geschiedenis, Nijmegen, 2013-09-13
Expertmeeting, E-humanities en politieke geschiedenis, Nijmegen, 2013-09-13
 
Economie van de aandacht
  Economie van de aandacht  Economie van de aandacht
Economie van de aandacht
 
Dans dataprijs2012
Dans dataprijs2012Dans dataprijs2012
Dans dataprijs2012
 
College sicco van-sas-2012_10_08
College sicco van-sas-2012_10_08College sicco van-sas-2012_10_08
College sicco van-sas-2012_10_08
 
Presentation at NLDB 2012
Presentation at NLDB 2012Presentation at NLDB 2012
Presentation at NLDB 2012
 
Women in Dutch parliament: what they did
Women in Dutch parliament: what they didWomen in Dutch parliament: what they did
Women in Dutch parliament: what they did
 
Namescape 2012 03 06
Namescape 2012 03 06Namescape 2012 03 06
Namescape 2012 03 06
 
voting advice slides
 voting advice slides voting advice slides
voting advice slides
 
TV-slant presentatie_politicologen_etmaal
TV-slant presentatie_politicologen_etmaalTV-slant presentatie_politicologen_etmaal
TV-slant presentatie_politicologen_etmaal
 
networks inparliament-ccct
 networks inparliament-ccct networks inparliament-ccct
networks inparliament-ccct
 
Screen biographischportaal2010 12-10
Screen biographischportaal2010 12-10Screen biographischportaal2010 12-10
Screen biographischportaal2010 12-10
 

Kürzlich hochgeladen

General AI for Medical Educators April 2024
General AI for Medical Educators April 2024General AI for Medical Educators April 2024
General AI for Medical Educators April 2024Janet Corral
 
Disha NEET Physics Guide for classes 11 and 12.pdf
Disha NEET Physics Guide for classes 11 and 12.pdfDisha NEET Physics Guide for classes 11 and 12.pdf
Disha NEET Physics Guide for classes 11 and 12.pdfchloefrazer622
 
Arihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdfArihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdfchloefrazer622
 
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdfBASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdfSoniaTolstoy
 
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in DelhiRussian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhikauryashika82
 
Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)eniolaolutunde
 
Measures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeMeasures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeThiyagu K
 
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...christianmathematics
 
Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Celine George
 
Student login on Anyboli platform.helpin
Student login on Anyboli platform.helpinStudent login on Anyboli platform.helpin
Student login on Anyboli platform.helpinRaunakKeshri1
 
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...fonyou31
 
Paris 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityParis 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityGeoBlogs
 
Key note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdfKey note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdfAdmir Softic
 
microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introductionMaksud Ahmed
 
1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdfQucHHunhnh
 
The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13Steve Thomason
 
Holdier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfHoldier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfagholdier
 

Kürzlich hochgeladen (20)

General AI for Medical Educators April 2024
General AI for Medical Educators April 2024General AI for Medical Educators April 2024
General AI for Medical Educators April 2024
 
Disha NEET Physics Guide for classes 11 and 12.pdf
Disha NEET Physics Guide for classes 11 and 12.pdfDisha NEET Physics Guide for classes 11 and 12.pdf
Disha NEET Physics Guide for classes 11 and 12.pdf
 
Arihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdfArihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdf
 
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdfBASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
 
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
 
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in DelhiRussian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
 
Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)
 
Measures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeMeasures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and Mode
 
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
 
Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17
 
Student login on Anyboli platform.helpin
Student login on Anyboli platform.helpinStudent login on Anyboli platform.helpin
Student login on Anyboli platform.helpin
 
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...
 
Paris 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityParis 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activity
 
Mattingly "AI & Prompt Design: The Basics of Prompt Design"
Mattingly "AI & Prompt Design: The Basics of Prompt Design"Mattingly "AI & Prompt Design: The Basics of Prompt Design"
Mattingly "AI & Prompt Design: The Basics of Prompt Design"
 
Código Creativo y Arte de Software | Unidad 1
Código Creativo y Arte de Software | Unidad 1Código Creativo y Arte de Software | Unidad 1
Código Creativo y Arte de Software | Unidad 1
 
Key note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdfKey note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdf
 
microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introduction
 
1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdf
 
The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13
 
Holdier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfHoldier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdf
 

Keynote Exploring and Exploiting Official Publications

  • 1. PoliticalMashup 1 PoliticalMashup Open Official Documents: Requirements and Opportunities Maarten Marx Universiteit van Amsterdam Istanbul, EEOP (@LREC), 2012-05-27
  • 2. PoliticalMashup 2 Content • Official Documents Zoom in on a specific official publications dataset • Opportunities What makes official publications data valuable? • Requirements What is needed to make official publications data reusable and interoperable?
  • 3. PoliticalMashup 3 Our Leading Research Question What is the best data format for publishing both legacy and current parliamentary proceedings in a digital sustainable manner? [Marx et al 2010]
  • 4. PoliticalMashup 4 W3C recommendations on Open Government Data • make data both machine and human readable; • link data, make data linkable, provide permanent identifiers for each government object and data item; • provide metadata using common standards (e.g. Dublin Core); • make the data as easy to reuse (e.g. in mashups) as possible. Goal of this talk: make this concrete.
  • 5. PoliticalMashup 5 Value of a large data corpus • Consider a 200 year corpus of temperature and humidity readings in one location. • Value is not in the individual “documents” • Value is not in the corpus as a whole. • Value is in the relation between the “documents”.
  • 6. PoliticalMashup 6 Documents related by publication date Google books Ngram viewer
  • 7. PoliticalMashup 7 Properties of our Parliamentary Proceedings Dataset
  • 8. PoliticalMashup 8 Longitudinal data • weakly measurement for over 150 years • very stable measurement procedure and data model
  • 9. PoliticalMashup 9 Data about human behaviour
  • 10. PoliticalMashup 10 Often rather boring
  • 11. PoliticalMashup 11 But sometimes full of drama and excitement
  • 12. PoliticalMashup 12 Loads of measurement points 24.000 days, 450.000 topics, 7.5 miljoen speeches
  • 13. PoliticalMashup 13 Digitally available
  • 14. PoliticalMashup 14 About this collection • very sparse available metadata • very rich “metadata” sits hidden inside the raw data • Rich data model • Meeting (1 Day) • Topic • Stage direction • Scene • Stage direction • Speech • Paragraph
  • 15. PoliticalMashup 15 Very rich metadata for each word For every word spoken in parliament, the following facts are known at the time of the speech act, and can often be extracted from the written proceedings: 1) when it was said, 2) who said it, 3) in what function, 4) speaking on behalf of which party, 5) in which context, and 6) who was actively present during the speech act.
  • 16. PoliticalMashup 16 How to exploit the extra metadata and structure? • Let’s consider a simple killer app . . .
  • 17. PoliticalMashup 17 Political n-gram viewer • From every word we know both the date and the speaker. • Every speaker belongs to a political party. • 3D n-gram viewer: political spectrum vs time vs word-count • Use: topic ownership, agenda setting, framing
  • 18. PoliticalMashup 18 Political n-gram viewer: requirements documents 1. metadata: date of the meeting 2. document structure: for every spoken word: who said it. Linked Data Speakers names are disambiguated, normalized and mapped to a database with temporal party information. Completeness and correctness Few missing or wrong data, also for long time ago.
  • 19. PoliticalMashup 19 Is Linked (Open) Data the solution? • Link speakers name to Wikipedia/DBpedia page. (named entity disambiguation and resolution). See also Google Knowledge Graph, and [Spitkovsky, Chang, LREC 2012]. • DBpedia extracts link between person and party affiliation from Wikipedia infobox • Timestamped triple: Geert Wilders is partymember of VVD from 1998-08-25 until 2004-09-02
  • 20. PoliticalMashup 20 DBpedia not yet reliable • Data extraction is difficult, even from the infobox, even from complete data: Wikipedia page of Geert Wilders DBpedia information about Geert Wilders Notice the values of the party and the office attributes Timestamped facts are difficult to extract and difficult to represent in RDF triples.
  • 21. PoliticalMashup 21 Lesson learned: requirement on metadata and relations • One cannot rely on Linked Open Data for good quality metadata. • Official documents should be self-describing, also for facts which are obvious at publication time. • Compare speaker’s data in original (OCRed) data and XMLified and enriched version: • Original • Part of it in XML • And now for human consumption
  • 22. PoliticalMashup 22 A few more applications
  • 23. PoliticalMashup 23 Entity Profiling and Entity Search • Users search for entities, not for documents. [TREC Entity Track] [Balog et al 2009]. • Main research questions How to collect information on entities, how to model an entity, how to rank entities. • (Parsimonious) language models work well as models. [Balog et al, 2009][Hiemstra et al, 2004] • Entity profiling: http://www.politiekinzicht.com • Entity search: http://ikkieswijzer.nl
  • 24. PoliticalMashup 24 Content and structure search • Usual advanced search combines keyword search with metadata search. • Extra fields are just extra filters on the returned documents. • With structured documents we can do search on content and structure. • Most useful task: rank best entry points in large documents. • Compare two search systems on the same data: on flat text on an XML representation
  • 25. PoliticalMashup 25 Lesson learned: requirement on structure • Make semantically important structure of documents explicit in XML markup. • Publish for machine readability • Publish generic data, not data prepared for one use-case.
  • 26. PoliticalMashup 26 Application of structure: Interruption graph (Attackogram) • MP A interrupts B ⇐⇒ A speaks during the block of B. combined with entity profiling: http://debat.politiekinzicht.com/
  • 27. PoliticalMashup 27 Exploring and exploiting official documents • We saw what can be done with one well-curated collection. • What are the key infrastructural and research questions? In what direction and how to scale this up? 1. in time 2. in breadth 3. in links
  • 28. PoliticalMashup 28 Scale diachronically • Stable data model and measurement procedure make this data very valuable for diachronic comparisons. • towards the past • OCR • consistency in structure • more missing data to link to • towards the future • remain up to date • legacy decisions
  • 29. PoliticalMashup 29 Scale in breadth, e.g., parlproceedings of all European countries • All describe the same “script”, so all fit in one schema. • Main question: how to connect the data from different countries? Common structure and annotation use the same Relax NG schema Common values on certain attributes • Entities Normalize to Wikipedia concepts • Controlled vocabulary keywords Normalize to Eurovoc • Language Machine translate to English • Events Normalize to EMM Newsexplorer query/ Wikinews query
  • 30. PoliticalMashup 30 Scale in breadth: link to related datasets • Link on time, entities, events, topics • Other official publications • News • User generated content • (In our case), promisses of political actors: election manifestos
  • 31. PoliticalMashup 31 Conclusions • There are ample opportunities for exploiting Official Publications. • Preprocessing and interlinking with other datasets is difficult and does not scale well: • High precision and recall is needed for many applications • Many text analysis and data-mapping tasks [MUC, TAC] • Every format needs an own transformer • Linked Open Data knowledge bases are not (yet) good enough: create special purpose knowledge extractors • High investment, but if done in a general way, high return and impact.
  • 32. PoliticalMashup 32 Back to our research question What is the best data format for publishing both legacy and current parliamentary proceedings in a digital sustainable manner? Lessons learned • Common, open, standardized, self-describing, machine readable, • not tied to a single application • linked, linked, linked • Not only shared attributes • but more importantly, shared data values • also store utterly obvious facts (10 years later they aren’t)
  • 33. PoliticalMashup 33 How we can help (ourselves) Help improve input data at the source • Push at the source (in UK: open government data; in Holland: all parliamentary data is now in XML . . . ) • Help reduce dumb cut-and-paste annotation work, so annotators can concentrate on tasks which are hard for machines (e.g. text-classification). • Emphasize importance of using shared standards. Future researchers will love you.
  • 34. PoliticalMashup 34 Last Question Official Publications: are they or ?