This document discusses standards and tools for language documentation and annotation, including existing infrastructure, CLARIN, annotation graphs, the Graph Annotation Framework (GrAF) and Poio API. It provides an example of mapping annotations from an ELAN EAF file to GrAF-XML using the Poio API and discusses analysis workflows and the CLASS project.
Poio API and GrAF-XML: A radical stand-off approach in language documentation and language typology
1. Poio API and GrAF-XML
A radical stand-off approach in
language documentation and language typology
Jonathan Blumtritt, Cologne Center for eHumanities, University of Cologne
Peter Bouda, Centro Interdisciplinar de Documentação Linguística e Social
Felix Rau, Department of Linguistics, University of Cologne
2. Overview
● Existing infrastructure and workflows
● CLARIN
● Annotation graphs
● GrAF and Poio API
● Example: Elan EAF to GrAF-XML
● CLASS
5. LD tools and standards
● Elan: EAF, MPEG, WAV
● Toolbox: TXT, XML, WAV
● Arbil: IMDI/CIMDI („Component MetaData
Infrastructure“)
● Praat: XML, WAV
● ...
● No standards for tier hierarchies, tier names or
annotation schemes
● Efforts in ISOcat
6. ● European initiative within the European Research
Infrastructure Consortium: Common Language Resources
and Technology Infrastructure (CLARIN)
● aims at providing easy and sustainable access for scholars in
the humanities and social sciences to digital language data
● Started in 2006, part of a roadmap process, timeline currently
ending 2020
● CLARIN-D: working groups in Germany
● Curation projects for different research areas in linguistics
7. Annotation Graphs
● the underlying data model for linguistic annotations
● pivot structure for linguistic data
● time vs. byte offsets
● not hierarchical (but trees are also graphs)
● stand-off annotation
● "It is important to recognize that translation into AGs does
not magically create compatibility among systems whose
semantics are different." [Bird & Liberman 2001]
9. GrAF
● GrAF: Graph Annotation Framework
● ISO 24612: Language resource management - Linguistic
annotation framework (LAF)
● Started as stand-off version of XCES
● API and representation as data structures, not a file format
● GrAF/XML as XML representation
● Used for the MASC of the ANC
● Nodes, edges, regions, annotations, feature structures
10. TEI and GrAF
● Schemata for GrAF created with TEI Roma
● Custumized version of TEI P5 schema
● ODD: „One Document Does it all“
● GrAF is not TEI compliant
● Share data types and feature structures of annotations
● TEI has „stand-off“ variant, uses XPointer/XLink
– Primary data has to be XML
11. Why we use GrAF
● Because it's new! :-)
● No inline markup
● Radical stand-off approach
– Easier to share and manage data
– Preferred solution to archive cultural heritage
– Ideal for sparse annotations
● Existing code: Java and Python
● The beauty of annotation graphs
12. Poio API
● Think of GrAF as an assembly language for linguistic
annotation; then Poio API is a libray to map from and to
higher-level languages
● Subset of GrAF to represent tier based annotation
● Filters and filter chains for search
● Plugin mechanism for file formats
– Mapping semantics: tiers and annotations to nodes and edges
● Meta-data for additional information (tier types etc.)
19. The code
ag = poioapi.annotationgraph.AnnotationGraph()
parser = poioapi.io.ElanParser("example.eaf")
writer = poioapi.io.graf.Writer()
converter = poioapi.io.graf.GrAFConverter(parser, writer)
converter.parse()
converter.write("example.hdr")
20. Analysis workflows
● Graph-based methods
● Pipe to scientific Python libraries
● GrAF connectors for major linguistic workflow
tools (GATE and Apache UIMA)
● Example: Polysemy in dictionaries
● Example: Counting word orders