SlideShare ist ein Scribd-Unternehmen logo
1 von 22
Downloaden Sie, um offline zu lesen
Querying GrAF data in linguistic analysis
Peter Bouda
Centro Interdisciplinar de Documentação Linguística e Social
pbouda@cidles.eu
Overview
● Existing infrastructure and workflows
● GrAF
● GrAF and TEI
● Poio API
● Queries in Poio API
● Queries in GrAF API
Fieldwork
Fotos
Existing Infrastructure
LD tools and standards
● Elan: EAF, MPEG, WAV
● Toolbox: TXT, XML, WAV
● Arbil: IMDI/CIMDI („Component MetaData
Infrastructure“)
● Praat: XML, WAV
● ...
● No standards for tier hierarchies, tier names or
annotation schemes
● Efforts in ISOcat
Interlinear Glossed Text
GrAF
● GrAF: Graph Annotation Framework
● ISO 24612: Language resource management - Linguistic
annotation framework (LAF)
● Started as stand-off version of XCES
● API and representation as data structures, not a file format
● GrAF/XML as XML representation
● Used for the MASC of the ANC
● Nodes, edges, regions, annotations, feature structures
GrAF entities
GrAF structure
GrAF-XML
<node xml:id="words..W-Words..na23">
<link targets="words..W-Words..ra23"/>
</node>
<region anchors="780 1340" xml:id="words..W-Words..ra23"/>
<edge from="utterance..W-Spch..n8" to="words..W-Words..na23"
xml:id="ea23"/>
<a as="words" label="words" ref="words..W-Words..na23"
xml:id="a23">
<fs>
<f name="annotation_value">so</f>
</fs>
</a>
TEI and GrAF
● Schemata for GrAF created with TEI Roma
● Custumized version of TEI P5 schema
● ODD: „One Document Does it all“
● GrAF is not TEI compliant
● Share data types and feature structures of annotations
● TEI has „stand-off“ variant, uses XPointer/XLink
– Primary data has to be XML
Why we use GrAF
● No inline markup
● Radical stand-off approach
– Easier to share and manage data
– Preferred solution to archive cultural heritage
– Ideal for sparse annotations
● Existing code: Java and Python
● API vs. XQuery
● The beauty of annotation graphs
Poio API
● Think of GrAF as an assembly language for linguistic annotation;
then Poio API is a libray to map from and to higher-level
languages
●
Subset of GrAF to represent tier based annotation
●
Filters and filter chains for search
●
Plugin mechanism for file formats
– Mapping semantics: tiers and annotations to nodes and edges
●
Efforts to map between TEI and GrAF
– Retro-digitized dictionary data at University of Marburg are published
as GrAF files
– We want to publish as TEI
Queries in GrAF API
● All queries are in-memory
● Users can load parts of the full graph
● Annotation graph to network conversion
– Python library networkx
● Example: Semantic similarity
Queries in GrAF API
for (node_id, node) in graf_graph.nodes.items():
if node_id.endswith("entry"):
for e in node.out_edges:
if e.annotations.get_first().label == "head" or 
e.annotations.get_first().label == "translation":
features = e.to_node.annotations.get_first().features
substr = features.get_value("substring")
[...]
Queries in Poio API
● Example: Word order in Hinuq
Queries in Poio API
ag = from_excel("data/Hinuq2.csv")
clause_unit_nodes = ag.nodes_for_tier("clause_id")
verbs = [ 'COP', 'cop', 'SAY', 'say', 'v.tr', 'v.intr', 'v.aff' ]
others = [ 'A', 'S', 'P', 'EXP', 'STIM' ]
search_terms = verbs + others
word_orders = collections.defaultdict(int)
for parent_node in clause_unit_nodes:
word_order = []
for word_n in parent_node.iter_children():
a_list = ag.annotations_for_tier("grammatical_relation", word_n)
if len(a_list) > 0:
a_value = ag.annotation_value_for_annotation(a_list[0])
if a_value in search_terms:
if a_value in verbs:
word_order.append('V')
else:
word_order.append(a_value)
word_orders[tuple(word_order)] += 1
Filters and filter chains
ag = poioapi.annotationgraph.AnnotationGraph()
ag.from_elan("elan-example3.eaf")
ag.structure_type_handler =
poioapi.data.DataStructureType(ag.tier_hierarchies[0])
af = poioapi.annotationgraph.AnnotationGraphFilter(ag)
af.set_filter_for_tier("words..W-Words", "follow")
af.set_filter_for_tier("part_of_speech..W-POS", r"bprob")
ag.append_filter(af)
print("Filtered root nodes:")
print(ag.filtered_node_ids)
search_terms = {
"words..W-Words": "follow",
"part_of_speech..W-POS": r"bprob"
}
af = ag.create_filter_for_dict(search_terms)
ag.append_filter(af)
Poio Analyzer
● Developed for and with Prof. Johannes
Helmbrecht, University of Regensburg
● How to query the corpus in order to write a
descriptive grammar?
● Started with a list of requirements
● Need to publish and archive queries and results
Poio Analyzer
Thank you for your attention!
pbouda@cidles.eu
Links
Clarin curation project:
http://de.clarin.eu/en/discipline-specific-working-groups/wg-3-linguistic-fieldwork-anthr
Poio:
http://media.cidles.eu/poio/
GrAF:
http://www.xces.org/ns/GrAF/1.0/

Más contenido relacionado

Was ist angesagt?

Hacktoberfest 2020 - Intro to Knowledge Graphs
Hacktoberfest 2020 - Intro to Knowledge GraphsHacktoberfest 2020 - Intro to Knowledge Graphs
Hacktoberfest 2020 - Intro to Knowledge GraphsArangoDB Database
 
HyperGraphDb
HyperGraphDbHyperGraphDb
HyperGraphDbborislav
 
Observability for Data Pipelines With OpenLineage
Observability for Data Pipelines With OpenLineageObservability for Data Pipelines With OpenLineage
Observability for Data Pipelines With OpenLineageDatabricks
 
A Deep Dive Implementing xAPI in Learning Games
A Deep Dive Implementing xAPI in Learning GamesA Deep Dive Implementing xAPI in Learning Games
A Deep Dive Implementing xAPI in Learning GamesGBLxAPI
 
R programming presentation
R programming presentationR programming presentation
R programming presentationAkshat Sharma
 
The History and Use of R
The History and Use of RThe History and Use of R
The History and Use of RAnalyticsWeek
 
R data presentation
R data presentationR data presentation
R data presentationJulie Hartig
 
Evolution of the Graph Schema
Evolution of the Graph SchemaEvolution of the Graph Schema
Evolution of the Graph SchemaJoshua Shinavier
 
LDP-DL: A language to define the design of Linked Data Platforms
LDP-DL: A language to define the design of Linked Data PlatformsLDP-DL: A language to define the design of Linked Data Platforms
LDP-DL: A language to define the design of Linked Data PlatformsMohammad Noorani Bakerally
 
What_do_Knowledge_Graph_Embeddings_Learn.pdf
What_do_Knowledge_Graph_Embeddings_Learn.pdfWhat_do_Knowledge_Graph_Embeddings_Learn.pdf
What_do_Knowledge_Graph_Embeddings_Learn.pdfHeiko Paulheim
 
Sparql querying of-property-graphs-harsh thakkar-graph day 2017 sf
Sparql querying of-property-graphs-harsh thakkar-graph day 2017 sfSparql querying of-property-graphs-harsh thakkar-graph day 2017 sf
Sparql querying of-property-graphs-harsh thakkar-graph day 2017 sfHarsh Thakkar
 
RxJS vs RxJava: Intro
RxJS vs RxJava: IntroRxJS vs RxJava: Intro
RxJS vs RxJava: IntroMartin Toshev
 
Data analysis with R and Julia
Data analysis with R and JuliaData analysis with R and Julia
Data analysis with R and JuliaMark Tabladillo
 
Multilevel Audio Descriptors @WWW09 develtrack
Multilevel Audio Descriptors @WWW09 develtrackMultilevel Audio Descriptors @WWW09 develtrack
Multilevel Audio Descriptors @WWW09 develtrackXavier Amatriain
 

Was ist angesagt? (18)

Hacktoberfest 2020 - Intro to Knowledge Graphs
Hacktoberfest 2020 - Intro to Knowledge GraphsHacktoberfest 2020 - Intro to Knowledge Graphs
Hacktoberfest 2020 - Intro to Knowledge Graphs
 
HyperGraphDb
HyperGraphDbHyperGraphDb
HyperGraphDb
 
Observability for Data Pipelines With OpenLineage
Observability for Data Pipelines With OpenLineageObservability for Data Pipelines With OpenLineage
Observability for Data Pipelines With OpenLineage
 
A Deep Dive Implementing xAPI in Learning Games
A Deep Dive Implementing xAPI in Learning GamesA Deep Dive Implementing xAPI in Learning Games
A Deep Dive Implementing xAPI in Learning Games
 
R programming presentation
R programming presentationR programming presentation
R programming presentation
 
The History and Use of R
The History and Use of RThe History and Use of R
The History and Use of R
 
R data presentation
R data presentationR data presentation
R data presentation
 
Evolution of the Graph Schema
Evolution of the Graph SchemaEvolution of the Graph Schema
Evolution of the Graph Schema
 
R programming
R programmingR programming
R programming
 
Poster
PosterPoster
Poster
 
LDP-DL: A language to define the design of Linked Data Platforms
LDP-DL: A language to define the design of Linked Data PlatformsLDP-DL: A language to define the design of Linked Data Platforms
LDP-DL: A language to define the design of Linked Data Platforms
 
Introduction to statistical software R
Introduction to statistical software RIntroduction to statistical software R
Introduction to statistical software R
 
What_do_Knowledge_Graph_Embeddings_Learn.pdf
What_do_Knowledge_Graph_Embeddings_Learn.pdfWhat_do_Knowledge_Graph_Embeddings_Learn.pdf
What_do_Knowledge_Graph_Embeddings_Learn.pdf
 
Have you met Julia?
Have you met Julia?Have you met Julia?
Have you met Julia?
 
Sparql querying of-property-graphs-harsh thakkar-graph day 2017 sf
Sparql querying of-property-graphs-harsh thakkar-graph day 2017 sfSparql querying of-property-graphs-harsh thakkar-graph day 2017 sf
Sparql querying of-property-graphs-harsh thakkar-graph day 2017 sf
 
RxJS vs RxJava: Intro
RxJS vs RxJava: IntroRxJS vs RxJava: Intro
RxJS vs RxJava: Intro
 
Data analysis with R and Julia
Data analysis with R and JuliaData analysis with R and Julia
Data analysis with R and Julia
 
Multilevel Audio Descriptors @WWW09 develtrack
Multilevel Audio Descriptors @WWW09 develtrackMultilevel Audio Descriptors @WWW09 develtrack
Multilevel Audio Descriptors @WWW09 develtrack
 

Ähnlich wie Querying GrAF data in linguistic analysis

Apache Arrow at DataEngConf Barcelona 2018
Apache Arrow at DataEngConf Barcelona 2018Apache Arrow at DataEngConf Barcelona 2018
Apache Arrow at DataEngConf Barcelona 2018Wes McKinney
 
Custom Pregel Algorithms in ArangoDB
Custom Pregel Algorithms in ArangoDBCustom Pregel Algorithms in ArangoDB
Custom Pregel Algorithms in ArangoDBArangoDB Database
 
Visual, scalable, and manageable data loading to and from Neo4j with Apache Hop
Visual, scalable, and manageable data loading to and from Neo4j with Apache Hop Visual, scalable, and manageable data loading to and from Neo4j with Apache Hop
Visual, scalable, and manageable data loading to and from Neo4j with Apache Hop Neo4j
 
Map Reduce data types and formats
Map Reduce data types and formatsMap Reduce data types and formats
Map Reduce data types and formatsVigen Sahakyan
 
Apache Arrow: Cross-language Development Platform for In-memory Data
Apache Arrow: Cross-language Development Platform for In-memory DataApache Arrow: Cross-language Development Platform for In-memory Data
Apache Arrow: Cross-language Development Platform for In-memory DataWes McKinney
 
Transformation Processing Smackdown; Spark vs Hive vs Pig
Transformation Processing Smackdown; Spark vs Hive vs PigTransformation Processing Smackdown; Spark vs Hive vs Pig
Transformation Processing Smackdown; Spark vs Hive vs PigLester Martin
 
Standardizing on a single N-dimensional array API for Python
Standardizing on a single N-dimensional array API for PythonStandardizing on a single N-dimensional array API for Python
Standardizing on a single N-dimensional array API for PythonRalf Gommers
 
Building scalable and language independent java services using apache thrift
Building scalable and language independent java services using apache thriftBuilding scalable and language independent java services using apache thrift
Building scalable and language independent java services using apache thriftTalentica Software
 
Building scalable and language-independent Java services using Apache Thrift ...
Building scalable and language-independent Java services using Apache Thrift ...Building scalable and language-independent Java services using Apache Thrift ...
Building scalable and language-independent Java services using Apache Thrift ...IndicThreads
 
Introduction To Apache Pig at WHUG
Introduction To Apache Pig at WHUGIntroduction To Apache Pig at WHUG
Introduction To Apache Pig at WHUGAdam Kawa
 
Apache Jena Elephas and Friends
Apache Jena Elephas and FriendsApache Jena Elephas and Friends
Apache Jena Elephas and FriendsRob Vesse
 
Ursa Labs and Apache Arrow in 2019
Ursa Labs and Apache Arrow in 2019Ursa Labs and Apache Arrow in 2019
Ursa Labs and Apache Arrow in 2019Wes McKinney
 
Substrait Overview.pdf
Substrait Overview.pdfSubstrait Overview.pdf
Substrait Overview.pdfRinat Abdullin
 
Introduction to Apache Beam
Introduction to Apache BeamIntroduction to Apache Beam
Introduction to Apache BeamKnoldus Inc.
 
Hopsworks at Google AI Huddle, Sunnyvale
Hopsworks at Google AI Huddle, SunnyvaleHopsworks at Google AI Huddle, Sunnyvale
Hopsworks at Google AI Huddle, SunnyvaleJim Dowling
 
unit-4-apache pig-.pdf
unit-4-apache pig-.pdfunit-4-apache pig-.pdf
unit-4-apache pig-.pdfssuser92282c
 
How to use Parquet as a Sasis for ETL and Analytics
How to use Parquet as a Sasis for ETL and AnalyticsHow to use Parquet as a Sasis for ETL and Analytics
How to use Parquet as a Sasis for ETL and AnalyticsDataWorks Summit
 

Ähnlich wie Querying GrAF data in linguistic analysis (20)

Apache Arrow at DataEngConf Barcelona 2018
Apache Arrow at DataEngConf Barcelona 2018Apache Arrow at DataEngConf Barcelona 2018
Apache Arrow at DataEngConf Barcelona 2018
 
Custom Pregel Algorithms in ArangoDB
Custom Pregel Algorithms in ArangoDBCustom Pregel Algorithms in ArangoDB
Custom Pregel Algorithms in ArangoDB
 
Visual, scalable, and manageable data loading to and from Neo4j with Apache Hop
Visual, scalable, and manageable data loading to and from Neo4j with Apache Hop Visual, scalable, and manageable data loading to and from Neo4j with Apache Hop
Visual, scalable, and manageable data loading to and from Neo4j with Apache Hop
 
Map Reduce data types and formats
Map Reduce data types and formatsMap Reduce data types and formats
Map Reduce data types and formats
 
Apache Arrow: Cross-language Development Platform for In-memory Data
Apache Arrow: Cross-language Development Platform for In-memory DataApache Arrow: Cross-language Development Platform for In-memory Data
Apache Arrow: Cross-language Development Platform for In-memory Data
 
Unit V.pdf
Unit V.pdfUnit V.pdf
Unit V.pdf
 
Transformation Processing Smackdown; Spark vs Hive vs Pig
Transformation Processing Smackdown; Spark vs Hive vs PigTransformation Processing Smackdown; Spark vs Hive vs Pig
Transformation Processing Smackdown; Spark vs Hive vs Pig
 
Standardizing on a single N-dimensional array API for Python
Standardizing on a single N-dimensional array API for PythonStandardizing on a single N-dimensional array API for Python
Standardizing on a single N-dimensional array API for Python
 
Building scalable and language independent java services using apache thrift
Building scalable and language independent java services using apache thriftBuilding scalable and language independent java services using apache thrift
Building scalable and language independent java services using apache thrift
 
Building scalable and language-independent Java services using Apache Thrift ...
Building scalable and language-independent Java services using Apache Thrift ...Building scalable and language-independent Java services using Apache Thrift ...
Building scalable and language-independent Java services using Apache Thrift ...
 
Introduction To Apache Pig at WHUG
Introduction To Apache Pig at WHUGIntroduction To Apache Pig at WHUG
Introduction To Apache Pig at WHUG
 
Apache Jena Elephas and Friends
Apache Jena Elephas and FriendsApache Jena Elephas and Friends
Apache Jena Elephas and Friends
 
Ursa Labs and Apache Arrow in 2019
Ursa Labs and Apache Arrow in 2019Ursa Labs and Apache Arrow in 2019
Ursa Labs and Apache Arrow in 2019
 
Substrait Overview.pdf
Substrait Overview.pdfSubstrait Overview.pdf
Substrait Overview.pdf
 
Introduction to Apache Beam
Introduction to Apache BeamIntroduction to Apache Beam
Introduction to Apache Beam
 
Hopsworks at Google AI Huddle, Sunnyvale
Hopsworks at Google AI Huddle, SunnyvaleHopsworks at Google AI Huddle, Sunnyvale
Hopsworks at Google AI Huddle, Sunnyvale
 
unit-4-apache pig-.pdf
unit-4-apache pig-.pdfunit-4-apache pig-.pdf
unit-4-apache pig-.pdf
 
Unit 4-apache pig
Unit 4-apache pigUnit 4-apache pig
Unit 4-apache pig
 
How to use Parquet as a Sasis for ETL and Analytics
How to use Parquet as a Sasis for ETL and AnalyticsHow to use Parquet as a Sasis for ETL and Analytics
How to use Parquet as a Sasis for ETL and Analytics
 
Log forwarding at Scale
Log forwarding at ScaleLog forwarding at Scale
Log forwarding at Scale
 

Querying GrAF data in linguistic analysis

  • 1. Querying GrAF data in linguistic analysis Peter Bouda Centro Interdisciplinar de Documentação Linguística e Social pbouda@cidles.eu
  • 2. Overview ● Existing infrastructure and workflows ● GrAF ● GrAF and TEI ● Poio API ● Queries in Poio API ● Queries in GrAF API
  • 5. LD tools and standards ● Elan: EAF, MPEG, WAV ● Toolbox: TXT, XML, WAV ● Arbil: IMDI/CIMDI („Component MetaData Infrastructure“) ● Praat: XML, WAV ● ... ● No standards for tier hierarchies, tier names or annotation schemes ● Efforts in ISOcat
  • 7. GrAF ● GrAF: Graph Annotation Framework ● ISO 24612: Language resource management - Linguistic annotation framework (LAF) ● Started as stand-off version of XCES ● API and representation as data structures, not a file format ● GrAF/XML as XML representation ● Used for the MASC of the ANC ● Nodes, edges, regions, annotations, feature structures
  • 10. GrAF-XML <node xml:id="words..W-Words..na23"> <link targets="words..W-Words..ra23"/> </node> <region anchors="780 1340" xml:id="words..W-Words..ra23"/> <edge from="utterance..W-Spch..n8" to="words..W-Words..na23" xml:id="ea23"/> <a as="words" label="words" ref="words..W-Words..na23" xml:id="a23"> <fs> <f name="annotation_value">so</f> </fs> </a>
  • 11. TEI and GrAF ● Schemata for GrAF created with TEI Roma ● Custumized version of TEI P5 schema ● ODD: „One Document Does it all“ ● GrAF is not TEI compliant ● Share data types and feature structures of annotations ● TEI has „stand-off“ variant, uses XPointer/XLink – Primary data has to be XML
  • 12. Why we use GrAF ● No inline markup ● Radical stand-off approach – Easier to share and manage data – Preferred solution to archive cultural heritage – Ideal for sparse annotations ● Existing code: Java and Python ● API vs. XQuery ● The beauty of annotation graphs
  • 13. Poio API ● Think of GrAF as an assembly language for linguistic annotation; then Poio API is a libray to map from and to higher-level languages ● Subset of GrAF to represent tier based annotation ● Filters and filter chains for search ● Plugin mechanism for file formats – Mapping semantics: tiers and annotations to nodes and edges ● Efforts to map between TEI and GrAF – Retro-digitized dictionary data at University of Marburg are published as GrAF files – We want to publish as TEI
  • 14. Queries in GrAF API ● All queries are in-memory ● Users can load parts of the full graph ● Annotation graph to network conversion – Python library networkx ● Example: Semantic similarity
  • 15. Queries in GrAF API for (node_id, node) in graf_graph.nodes.items(): if node_id.endswith("entry"): for e in node.out_edges: if e.annotations.get_first().label == "head" or e.annotations.get_first().label == "translation": features = e.to_node.annotations.get_first().features substr = features.get_value("substring") [...]
  • 16. Queries in Poio API ● Example: Word order in Hinuq
  • 17. Queries in Poio API ag = from_excel("data/Hinuq2.csv") clause_unit_nodes = ag.nodes_for_tier("clause_id") verbs = [ 'COP', 'cop', 'SAY', 'say', 'v.tr', 'v.intr', 'v.aff' ] others = [ 'A', 'S', 'P', 'EXP', 'STIM' ] search_terms = verbs + others word_orders = collections.defaultdict(int) for parent_node in clause_unit_nodes: word_order = [] for word_n in parent_node.iter_children(): a_list = ag.annotations_for_tier("grammatical_relation", word_n) if len(a_list) > 0: a_value = ag.annotation_value_for_annotation(a_list[0]) if a_value in search_terms: if a_value in verbs: word_order.append('V') else: word_order.append(a_value) word_orders[tuple(word_order)] += 1
  • 18. Filters and filter chains ag = poioapi.annotationgraph.AnnotationGraph() ag.from_elan("elan-example3.eaf") ag.structure_type_handler = poioapi.data.DataStructureType(ag.tier_hierarchies[0]) af = poioapi.annotationgraph.AnnotationGraphFilter(ag) af.set_filter_for_tier("words..W-Words", "follow") af.set_filter_for_tier("part_of_speech..W-POS", r"bprob") ag.append_filter(af) print("Filtered root nodes:") print(ag.filtered_node_ids) search_terms = { "words..W-Words": "follow", "part_of_speech..W-POS": r"bprob" } af = ag.create_filter_for_dict(search_terms) ag.append_filter(af)
  • 19. Poio Analyzer ● Developed for and with Prof. Johannes Helmbrecht, University of Regensburg ● How to query the corpus in order to write a descriptive grammar? ● Started with a list of requirements ● Need to publish and archive queries and results
  • 21. Thank you for your attention! pbouda@cidles.eu