SlideShare ist ein Scribd-Unternehmen logo
1 von 22
Poio API: a CLARIN-D curation project for
language documentation and language typology
Peter Bouda
Centro Interdisciplinar de Documentação Linguística e Social
pbouda@cidles.eu
Overview
● Existing infrastructure and workflows
● Poio API and CLASS within CLARIN
● GrAF and TEI
● Poio API
● GrAF as pivot structures (IGT)
● GrAF for retro-digitization (Dictionary)
Fieldwork
Fotos
Existing Infrastructure
LD tools and standards
● Elan: EAF, MPEG, WAV
● Toolbox: TXT, XML, WAV
● Arbil: IMDI/CIMDI („Component MetaData
Infrastructure“)
● Praat: XML, WAV
● ...
● No standards for tier hierarchies, tier names or
annotation schemes
● Efforts in ISOcat
Interlinear Glossed Text
CLARIN
GrAF
● GrAF: Graph Annotation Framework
● ISO 24612: Language resource management - Linguistic
annotation framework (LAF)
● Started as stand-off version of XCES
● API and representation as data structures, not a file format
● GrAF/XML as XML representation
● Used for the MASC of the ANC
● Nodes, edges, regions, annotations, feature structures
GrAF entities
GrAF structure
GrAF-XML
<node xml:id="words..W-Words..na23">
<link targets="words..W-Words..ra23"/>
</node>
<region anchors="780 1340" xml:id="words..W-Words..ra23"/>
<edge from="utterance..W-Spch..n8" to="words..W-Words..na23"
xml:id="ea23"/>
<a as="words" label="words" ref="words..W-Words..na23"
xml:id="a23">
<fs>
<f name="annotation_value">so</f>
</fs>
</a>
Why we use GrAF
● No inline markup
● Radical stand-off approach
– Easier to share and manage data
– Preferred solution to archive cultural heritage
– Ideal for sparse annotations
● Existing code: Java and Python
● API vs. XQuery
● The beauty of annotation graphs
Poio API
●
Think of GrAF as an assembly language for linguistic annotation; then
Poio API is a libray to map from and to higher-level languages
● Subset of GrAF to represent tier based annotation
– Interlinear glossed text (IGT)
● Filters and filter chains for search
● Plugin mechanism for file formats
– Mapping semantics: tiers and annotations to nodes and edges
● Meta-data for additional information (tier types etc.)
● Efforts to map between TEI and GrAF
– Poio API supports IGT, next step is dictionaries and lexica
– Retro-digitized dictionary data at University of Marburg are published as GrAF files
– We want to publish as TEI
A basic converter in Poio API
parser = poioapi.io.wikipedia_extractor.Parser("Wikipedia.xml")
writer = poioapi.io.graf.Writer()
converter = poioapi.io.graf.GrAFConverter(parser, writer)
converter.parse()
converter.write("Wikipedia.hdr")
A parser for CSV files
class CsvParser(poioapi.io.graf.BaseParser):
def get_root_tiers(self):
pass
def get_child_tiers_for_tier(self, tier):
pass
def get_annotations_for_tier(self, tier, annotation_parent=None):
pass
def tier_has_regions(self, tier):
pass
def region_for_annotation(self, annotation):
pass
def get_primary_data(self):
pass
Example: Analysis of CSV data
Example: Analysis of CSV data
http://nbviewer.ipython.org/urls/raw.github.com/pbouda/notebooks/master/Diana%2520Hinuq%25
Retro-digitization of dictionaries
● From scan to .doc to XML to DB to GrAF
● Radical stand-off approach for unsupervised
collaboration
● Dictionaries as cultural heritage texts
● GrAF as primary publication format
● Connectors to brat and TEI
Analysis of the data
● Spanish as pivot language, subset of bodypart terms
● Converting GrAF to networkx graph
● Nodes are heads, translations, etc.
● Head and translation connected via edges if they appear
in one entry
● Merge of graphs
● Count of paths of length 2 between spanish heads
● Python writes JSON graph, visualized with D3.js
D3 visualization
http://www.peterbouda.eu/bodyparts/index_bodyparts.html
Thank you for your attention!
pbouda@cidles.eu
Links
Clarin curation project:
http://de.clarin.eu/en/discipline-specific-working-groups/wg-3-linguistic-fieldwork-anthr
Poio:
http://media.cidles.eu/poio/
GrAF:
http://www.xces.org/ns/GrAF/1.0/

Weitere ähnliche Inhalte

Was ist angesagt?

R Introduction
R IntroductionR Introduction
R Introduction
schamber
 
WISE2017 - Factorization Machines Leveraging Lightweight Linked Open Data-ena...
WISE2017 - Factorization Machines Leveraging Lightweight Linked Open Data-ena...WISE2017 - Factorization Machines Leveraging Lightweight Linked Open Data-ena...
WISE2017 - Factorization Machines Leveraging Lightweight Linked Open Data-ena...
GUANGYUAN PIAO
 
Workshop presentation hands on r programming
Workshop presentation hands on r programmingWorkshop presentation hands on r programming
Workshop presentation hands on r programming
Nimrita Koul
 

Was ist angesagt? (20)

R programming
R programmingR programming
R programming
 
A Context-Based Semantics for SPARQL Property Paths over the Web
A Context-Based Semantics for SPARQL Property Paths over the WebA Context-Based Semantics for SPARQL Property Paths over the Web
A Context-Based Semantics for SPARQL Property Paths over the Web
 
Introduction to R software, by Leire ibaibarriaga
Introduction to R software, by Leire ibaibarriaga Introduction to R software, by Leire ibaibarriaga
Introduction to R software, by Leire ibaibarriaga
 
8th TUC Meeting - Peter Boncz (CWI). Query Language Task Force status
8th TUC Meeting - Peter Boncz (CWI). Query Language Task Force status8th TUC Meeting - Peter Boncz (CWI). Query Language Task Force status
8th TUC Meeting - Peter Boncz (CWI). Query Language Task Force status
 
R Programming
R ProgrammingR Programming
R Programming
 
R Introduction
R IntroductionR Introduction
R Introduction
 
1 R Tutorial Introduction
1 R Tutorial Introduction1 R Tutorial Introduction
1 R Tutorial Introduction
 
Introducing The R Software
Introducing The R Software  Introducing The R Software
Introducing The R Software
 
Graph databases & data integration v2
Graph databases & data integration v2Graph databases & data integration v2
Graph databases & data integration v2
 
What_do_Knowledge_Graph_Embeddings_Learn.pdf
What_do_Knowledge_Graph_Embeddings_Learn.pdfWhat_do_Knowledge_Graph_Embeddings_Learn.pdf
What_do_Knowledge_Graph_Embeddings_Learn.pdf
 
R programming language: conceptual overview
R programming language: conceptual overviewR programming language: conceptual overview
R programming language: conceptual overview
 
WISE2017 - Factorization Machines Leveraging Lightweight Linked Open Data-ena...
WISE2017 - Factorization Machines Leveraging Lightweight Linked Open Data-ena...WISE2017 - Factorization Machines Leveraging Lightweight Linked Open Data-ena...
WISE2017 - Factorization Machines Leveraging Lightweight Linked Open Data-ena...
 
R programming for data science
R programming for data scienceR programming for data science
R programming for data science
 
R programming
R programmingR programming
R programming
 
Poster
PosterPoster
Poster
 
Workshop presentation hands on r programming
Workshop presentation hands on r programmingWorkshop presentation hands on r programming
Workshop presentation hands on r programming
 
A short tutorial on r
A short tutorial on rA short tutorial on r
A short tutorial on r
 
Exposing relational database as rdf
Exposing relational database as rdfExposing relational database as rdf
Exposing relational database as rdf
 
R crash course
R crash courseR crash course
R crash course
 
R programming groundup-basic-section-i
R programming groundup-basic-section-iR programming groundup-basic-section-i
R programming groundup-basic-section-i
 

Andere mochten auch

Transmision
TransmisionTransmision
Transmision
pailooot
 
Parker catalogue 2012
Parker catalogue 2012Parker catalogue 2012
Parker catalogue 2012
PeterRamy
 
Product in theory and practice
Product in theory and practiceProduct in theory and practice
Product in theory and practice
Ravi Chandegara
 

Andere mochten auch (20)

Best episode ever: Angular 2 from the perspective of an Angular 1 developer
Best episode ever: Angular 2 from the perspective of an Angular 1 developerBest episode ever: Angular 2 from the perspective of an Angular 1 developer
Best episode ever: Angular 2 from the perspective of an Angular 1 developer
 
Smart Pen Presentation
Smart Pen PresentationSmart Pen Presentation
Smart Pen Presentation
 
Noord januari 2013
Noord januari 2013Noord januari 2013
Noord januari 2013
 
Multimiedia project
Multimiedia projectMultimiedia project
Multimiedia project
 
How community software supports language documentation and data analysis
How community software supports language documentation and data analysisHow community software supports language documentation and data analysis
How community software supports language documentation and data analysis
 
My Presentation
My PresentationMy Presentation
My Presentation
 
Querying GrAF data in linguistic analysis
Querying GrAF data in linguistic analysisQuerying GrAF data in linguistic analysis
Querying GrAF data in linguistic analysis
 
Poio API - An annotation framework to bridge Language Documentation and Natur...
Poio API - An annotation framework to bridge Language Documentation and Natur...Poio API - An annotation framework to bridge Language Documentation and Natur...
Poio API - An annotation framework to bridge Language Documentation and Natur...
 
Transmision
TransmisionTransmision
Transmision
 
Sci cafe humangenome&health
Sci cafe humangenome&healthSci cafe humangenome&health
Sci cafe humangenome&health
 
Parker catalogue 2012
Parker catalogue 2012Parker catalogue 2012
Parker catalogue 2012
 
Product in theory and practice
Product in theory and practiceProduct in theory and practice
Product in theory and practice
 
Pompa sentrifugal
Pompa sentrifugalPompa sentrifugal
Pompa sentrifugal
 
RxJS - The Reactive extensions for JavaScript
RxJS - The Reactive extensions for JavaScriptRxJS - The Reactive extensions for JavaScript
RxJS - The Reactive extensions for JavaScript
 
Data models in Angular 1 & 2
Data models in Angular 1 & 2Data models in Angular 1 & 2
Data models in Angular 1 & 2
 
Top Secret: Large-Scale SPA
Top Secret: Large-Scale SPATop Secret: Large-Scale SPA
Top Secret: Large-Scale SPA
 
Cycling for noobs
Cycling for noobsCycling for noobs
Cycling for noobs
 
01 - Git vs SVN
01 - Git vs SVN01 - Git vs SVN
01 - Git vs SVN
 
Simple testable code
Simple testable codeSimple testable code
Simple testable code
 
Development By The Numbers - ConFoo Edition
Development By The Numbers - ConFoo EditionDevelopment By The Numbers - ConFoo Edition
Development By The Numbers - ConFoo Edition
 

Ähnlich wie Poio API: a CLARIN-D curation project for language documentation and language typology

How to integrate python into a scala stack
How to integrate python into a scala stackHow to integrate python into a scala stack
How to integrate python into a scala stack
Fliptop
 
.NET 4 Demystified - Sandeep Joshi
.NET 4 Demystified - Sandeep Joshi.NET 4 Demystified - Sandeep Joshi
.NET 4 Demystified - Sandeep Joshi
Spiffy
 
Enforcing API Design Rules for High Quality Code Generation
Enforcing API Design Rules for High Quality Code GenerationEnforcing API Design Rules for High Quality Code Generation
Enforcing API Design Rules for High Quality Code Generation
Tim Burks
 

Ähnlich wie Poio API: a CLARIN-D curation project for language documentation and language typology (20)

Apache Arrow: Cross-language Development Platform for In-memory Data
Apache Arrow: Cross-language Development Platform for In-memory DataApache Arrow: Cross-language Development Platform for In-memory Data
Apache Arrow: Cross-language Development Platform for In-memory Data
 
Map Reduce data types and formats
Map Reduce data types and formatsMap Reduce data types and formats
Map Reduce data types and formats
 
Apache Arrow at DataEngConf Barcelona 2018
Apache Arrow at DataEngConf Barcelona 2018Apache Arrow at DataEngConf Barcelona 2018
Apache Arrow at DataEngConf Barcelona 2018
 
Ursa Labs and Apache Arrow in 2019
Ursa Labs and Apache Arrow in 2019Ursa Labs and Apache Arrow in 2019
Ursa Labs and Apache Arrow in 2019
 
Visual, scalable, and manageable data loading to and from Neo4j with Apache Hop
Visual, scalable, and manageable data loading to and from Neo4j with Apache Hop Visual, scalable, and manageable data loading to and from Neo4j with Apache Hop
Visual, scalable, and manageable data loading to and from Neo4j with Apache Hop
 
How to integrate python into a scala stack
How to integrate python into a scala stackHow to integrate python into a scala stack
How to integrate python into a scala stack
 
.NET 4 Demystified - Sandeep Joshi
.NET 4 Demystified - Sandeep Joshi.NET 4 Demystified - Sandeep Joshi
.NET 4 Demystified - Sandeep Joshi
 
Apache Arrow Workshop at VLDB 2019 / BOSS Session
Apache Arrow Workshop at VLDB 2019 / BOSS SessionApache Arrow Workshop at VLDB 2019 / BOSS Session
Apache Arrow Workshop at VLDB 2019 / BOSS Session
 
DPFManager workshop
DPFManager workshopDPFManager workshop
DPFManager workshop
 
Getting Started with PHP Extensions
Getting Started with PHP ExtensionsGetting Started with PHP Extensions
Getting Started with PHP Extensions
 
Unit V.pdf
Unit V.pdfUnit V.pdf
Unit V.pdf
 
Building scalable and language-independent Java services using Apache Thrift ...
Building scalable and language-independent Java services using Apache Thrift ...Building scalable and language-independent Java services using Apache Thrift ...
Building scalable and language-independent Java services using Apache Thrift ...
 
Standardizing on a single N-dimensional array API for Python
Standardizing on a single N-dimensional array API for PythonStandardizing on a single N-dimensional array API for Python
Standardizing on a single N-dimensional array API for Python
 
Php
PhpPhp
Php
 
Php
PhpPhp
Php
 
Php
PhpPhp
Php
 
Building scalable and language independent java services using apache thrift
Building scalable and language independent java services using apache thriftBuilding scalable and language independent java services using apache thrift
Building scalable and language independent java services using apache thrift
 
Using Aspects for Language Portability (SCAM 2010)
Using Aspects for Language Portability (SCAM 2010)Using Aspects for Language Portability (SCAM 2010)
Using Aspects for Language Portability (SCAM 2010)
 
Introduction to Analytics with Azure Notebooks and Python
Introduction to Analytics with Azure Notebooks and PythonIntroduction to Analytics with Azure Notebooks and Python
Introduction to Analytics with Azure Notebooks and Python
 
Enforcing API Design Rules for High Quality Code Generation
Enforcing API Design Rules for High Quality Code GenerationEnforcing API Design Rules for High Quality Code Generation
Enforcing API Design Rules for High Quality Code Generation
 

Kürzlich hochgeladen

Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 

Kürzlich hochgeladen (20)

AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 

Poio API: a CLARIN-D curation project for language documentation and language typology

  • 1. Poio API: a CLARIN-D curation project for language documentation and language typology Peter Bouda Centro Interdisciplinar de Documentação Linguística e Social pbouda@cidles.eu
  • 2. Overview ● Existing infrastructure and workflows ● Poio API and CLASS within CLARIN ● GrAF and TEI ● Poio API ● GrAF as pivot structures (IGT) ● GrAF for retro-digitization (Dictionary)
  • 5. LD tools and standards ● Elan: EAF, MPEG, WAV ● Toolbox: TXT, XML, WAV ● Arbil: IMDI/CIMDI („Component MetaData Infrastructure“) ● Praat: XML, WAV ● ... ● No standards for tier hierarchies, tier names or annotation schemes ● Efforts in ISOcat
  • 8. GrAF ● GrAF: Graph Annotation Framework ● ISO 24612: Language resource management - Linguistic annotation framework (LAF) ● Started as stand-off version of XCES ● API and representation as data structures, not a file format ● GrAF/XML as XML representation ● Used for the MASC of the ANC ● Nodes, edges, regions, annotations, feature structures
  • 11. GrAF-XML <node xml:id="words..W-Words..na23"> <link targets="words..W-Words..ra23"/> </node> <region anchors="780 1340" xml:id="words..W-Words..ra23"/> <edge from="utterance..W-Spch..n8" to="words..W-Words..na23" xml:id="ea23"/> <a as="words" label="words" ref="words..W-Words..na23" xml:id="a23"> <fs> <f name="annotation_value">so</f> </fs> </a>
  • 12. Why we use GrAF ● No inline markup ● Radical stand-off approach – Easier to share and manage data – Preferred solution to archive cultural heritage – Ideal for sparse annotations ● Existing code: Java and Python ● API vs. XQuery ● The beauty of annotation graphs
  • 13. Poio API ● Think of GrAF as an assembly language for linguistic annotation; then Poio API is a libray to map from and to higher-level languages ● Subset of GrAF to represent tier based annotation – Interlinear glossed text (IGT) ● Filters and filter chains for search ● Plugin mechanism for file formats – Mapping semantics: tiers and annotations to nodes and edges ● Meta-data for additional information (tier types etc.) ● Efforts to map between TEI and GrAF – Poio API supports IGT, next step is dictionaries and lexica – Retro-digitized dictionary data at University of Marburg are published as GrAF files – We want to publish as TEI
  • 14. A basic converter in Poio API parser = poioapi.io.wikipedia_extractor.Parser("Wikipedia.xml") writer = poioapi.io.graf.Writer() converter = poioapi.io.graf.GrAFConverter(parser, writer) converter.parse() converter.write("Wikipedia.hdr")
  • 15. A parser for CSV files class CsvParser(poioapi.io.graf.BaseParser): def get_root_tiers(self): pass def get_child_tiers_for_tier(self, tier): pass def get_annotations_for_tier(self, tier, annotation_parent=None): pass def tier_has_regions(self, tier): pass def region_for_annotation(self, annotation): pass def get_primary_data(self): pass
  • 17. Example: Analysis of CSV data http://nbviewer.ipython.org/urls/raw.github.com/pbouda/notebooks/master/Diana%2520Hinuq%25
  • 18. Retro-digitization of dictionaries ● From scan to .doc to XML to DB to GrAF ● Radical stand-off approach for unsupervised collaboration ● Dictionaries as cultural heritage texts ● GrAF as primary publication format ● Connectors to brat and TEI
  • 19. Analysis of the data ● Spanish as pivot language, subset of bodypart terms ● Converting GrAF to networkx graph ● Nodes are heads, translations, etc. ● Head and translation connected via edges if they appear in one entry ● Merge of graphs ● Count of paths of length 2 between spanish heads ● Python writes JSON graph, visualized with D3.js
  • 21. Thank you for your attention! pbouda@cidles.eu