Diese Präsentation wurde erfolgreich gemeldet.
Die SlideShare-Präsentation wird heruntergeladen. ×

Sasaki datathon-madrid-2015

Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige

Hier ansehen

1 von 23 Anzeige

Sasaki datathon-madrid-2015

Herunterladen, um offline zu lesen

We describe the NIF approach towards representing annotations and focus on roundtripping: the conversion of existing digital content from formats like Word, HTML etc. into NIF and re-integration of annotations into the original file format. Such roundtripping is needed for many industry applications of linguistic linked data and natural language processing. Roundtripping is not always possible and constrained by 1) possibilities to store annotations in the original format, while preserving existing information (e.g. HTML inline tags) and 2) constraints of the annotation model, which is basicaly tree-structured for markup languages like generic XML or HTML. There is no general solution to this problem. Developers of roundtripping applications should use existing libraries as much as possible and leverage them to their needs.

We describe the NIF approach towards representing annotations and focus on roundtripping: the conversion of existing digital content from formats like Word, HTML etc. into NIF and re-integration of annotations into the original file format. Such roundtripping is needed for many industry applications of linguistic linked data and natural language processing. Roundtripping is not always possible and constrained by 1) possibilities to store annotations in the original format, while preserving existing information (e.g. HTML inline tags) and 2) constraints of the annotation model, which is basicaly tree-structured for markup languages like generic XML or HTML. There is no general solution to this problem. Developers of roundtripping applications should use existing libraries as much as possible and leverage them to their needs.

Anzeige
Anzeige

Weitere Verwandte Inhalte

Diashows für Sie (20)

Ähnlich wie Sasaki datathon-madrid-2015 (20)

Anzeige

Aktuellste (20)

Anzeige

Sasaki datathon-madrid-2015

  1. 1. Sasaki – LLD Datathon – Cercedilla, Spain, May 2015 Roundtripping of NIF based Linguistic Linked Data with non linked data sources Felix Sasaki DFKI / W3C Fellow Slides: http://de.slideshare.net/atcfsenzoku/sasaki-datathonmadrid2015 1
  2. 2. Sasaki – LLD Datathon – Cercedilla, Spain, May 2015 What is NIF? • Natural Language Processing Interchange Format – See http://nlp2rdf.org/ • LLD format to store annotations & to organize NLP pipelines • API specification to create NIF workflows • More details: after the coffee break  • Following slides: main roles for NIF 2
  3. 3. Sasaki – LLD Datathon – Cercedilla, Spain, May 2015 Example (Partial; JSON-LD Syntax) { "@graph" : [ { "@id" : "p:char=0,18", "@type" : [ "nif:Context", "nif:Sentence", "nif:RFC5147String" ], "anchorOf" : "Welcome to Prague.", "beginIndex" : "0", "endIndex" : "18", "isString" : "Welcome to Prague.", "referenceContext" : "p:char=0,18” }, { "@id" : "p:char=11,17", "@type" : [ "nif:RFC5147String", "nif:Word" ], … "referenceContext" : "p:char=0,18", "taIdentRef" : "http://dbpedia.org/resource/Prague" }, …] } 3
  4. 4. Sasaki – LLD Datathon – Cercedilla, Spain, May 2015 Example (Partial; JSON-LD Syntax) { "@graph" : [ { "@id" : "p:char=0,18", "@type" : [ "nif:Context", "nif:Sentence", "nif:RFC5147String" ], "anchorOf" : "Welcome to Prague.", "beginIndex" : "0", "endIndex" : "18", "isString" : "Welcome to Prague.", "referenceContext" : "p:char=0,18” }, { "@id" : "p:char=11,17", "@type" : [ "nif:RFC5147String", "nif:Word" ], … "referenceContext" : "p:char=0,18", "taIdentRef" : "http://dbpedia.org/resource/Prague" }, …] } 4 • Identifying and typing annotations • Identifying annotation offsets • Adding additional knowledge, e.g. named entity identifier • Interrelating annotations
  5. 5. Sasaki – LLD Datathon – Cercedilla, Spain, May 2015 Example (Partial; JSON-LD Syntax) { "@graph" : [ { "@id" : "p:char=0,18", "@type" : [ "nif:Context", "nif:Sentence", "nif:RFC5147String" ], "anchorOf" : "Welcome to Prague.", "beginIndex" : "0", "endIndex" : "18", "isString" : "Welcome to Prague.", "referenceContext" : "p:char=0,18” }, { "@id" : "p:char=11,17", "@type" : [ "nif:RFC5147String", "nif:Word" ], … "referenceContext" : "p:char=0,18", "taIdentRef" : "http://dbpedia.org/resource/Prague" }, …] } 5 • Identifying and typing annotations • Identifying annotation offsets • Adding additional knowledge, e.g. named entity identifier • Interrelating annotations
  6. 6. Sasaki – LLD Datathon – Cercedilla, Spain, May 2015 Example (Partial; JSON-LD Syntax) { "@graph" : [ { "@id" : "p:char=0,18", "@type" : [ "nif:Context", "nif:Sentence", "nif:RFC5147String" ], "anchorOf" : "Welcome to Prague.", "beginIndex" : "0", "endIndex" : "18", "isString" : "Welcome to Prague.", "referenceContext" : "p:char=0,18” }, { "@id" : "p:char=11,17", "@type" : [ "nif:RFC5147String", "nif:Word" ], … "referenceContext" : "p:char=0,18", "taIdentRef" : "http://dbpedia.org/resource/Prague" }, …] } 6 • Identifying and typing annotations • Identifying annotation offsets • Adding additional knowledge, e.g. named entity identifier • Interrelating annotations
  7. 7. Sasaki – LLD Datathon – Cercedilla, Spain, May 2015 Example (Partial; JSON-LD Syntax) { "@graph" : [ { "@id" : "p:char=0,18", "@type" : [ "nif:Context", "nif:Sentence", "nif:RFC5147String" ], "anchorOf" : "Welcome to Prague.", "beginIndex" : "0", "endIndex" : "18", "isString" : "Welcome to Prague.", "referenceContext" : "p:char=0,18” }, { "@id" : "p:char=11,17", "@type" : [ "nif:RFC5147String", "nif:Word" ], … "referenceContext" : "p:char=0,18", "taIdentRef" : "http://dbpedia.org/resource/Prague" }, …] } 7 • Identifying and typing annotations • Identifying annotation offsets • Adding additional knowledge, e.g. named entity identifier • Interrelating annotations
  8. 8. Sasaki – LLD Datathon – Cercedilla, Spain, May 2015 A NIF workflow 8 Existing content Content analytics, e.g. named entity recognition Conversion to NIF Deploying knowledge from the LLD cloud
  9. 9. Sasaki – LLD Datathon – Cercedilla, Spain, May 2015 Potential scenario: roundtripping 9 Existing content Content analytics, e.g. named entity recognition Conversion to NIF Storing annotations in original content Deploying knowledge from the LLD cloud
  10. 10. Sasaki – LLD Datathon – Cercedilla, Spain, May 2015 Roundtripping • Roundtripping: Storing the outcome of content processing (analytics) tasks in the original content • Not always needed, but sometimes – examples: – Enriching Web content with named entity information; generating Schema.org markup via NIF pipelines. Format: HTML – Enriching localisation content, to add value beyond translation: Format: XLIFF 10
  11. 11. Sasaki – LLD Datathon – Cercedilla, Spain, May 2015 Example: HTML Example roundtripping workflow 11 … <p>Welcome to Prague!</p>… …<p>Welcome to <span … itemtype="http://schema.org/Place">Prague</span>!< /p>… 1) Conversion to NIF 2) NER processing 3) Back conversion to HTML
  12. 12. Sasaki – LLD Datathon – Cercedilla, Spain, May 2015 Example: XLIFF Example roundtripping workflow 12 … <xlf:source>Welcome to Prague!</xlf:source> … … <xlf:source>Welcome to <mrk … its:taClassRef="http://schema.org/Place">Prague </mrk>!</xlf:source> … 1) Conversion to NIF 2) NER processing 3) Back conversion to HTML
  13. 13. Sasaki – LLD Datathon – Cercedilla, Spain, May 2015 Example usage scenario: FREME project • See http://www.freme-project.eu/ • Developing interfaces for multilingual and semantic enrichment of digital content • Relies on NIF based enrichment workflows – See FREME API version 0.1 http://api.freme-project.eu/doc/0.1/ • Deploys aspects of the LIDER reference architecture for LLD processing – See D3.1.1 at http://lider-project.eu/?q=doc/deliverables • Focuses on four business cases – Localization BC requires XLIFF roundtripping – Web content personalisation BC requires HTML roundtripping 13
  14. 14. Sasaki – LLD Datathon – Cercedilla, Spain, May 2015 Challenges for roundtripping • Source format – How to store enrichment information (annotations) – How to handle existing information • Annotation model – NIF = a general graph-based annotation model – Sources format and annotation motivation may require restriction of the model 14
  15. 15. Sasaki – LLD Datathon – Cercedilla, Spain, May 2015 How to store annotations in various source formats • Solvable for markup languages like HTML or XLIFF • Challenge to preserve existing markup “<p>Welcome to <b>Prague</b>!</p>” • General issue with complex and proprietary formats: – “My own” storage mechanism = no tool support – Using existing storage mechanisms may mean: overloading semantics 15
  16. 16. Sasaki – LLD Datathon – Cercedilla, Spain, May 2015 Source format example: Word … <w:t>Welcome to Prague!</w:t> … 16 … <w:commentRangeStart w:id="0"/><w:t>Prague</w:t> <w:commentRangeEnd w:id="0"/> <w:r w:rsidR="00987079"> … <w:p w:rsidRPr="00987079">… Enrichment: type "http://schema.org/Place"…</w:p> Enrichment process; storing enrichment as comments Change of original content: creation of anchor Comment stored separately; refers to anchor: “standoff approach” Content storage Comment storage Content storage (Word file unzipped)
  17. 17. Sasaki – LLD Datathon – Cercedilla, Spain, May 2015 Annotation models • NIF: like RDF = general graph model – Consisting of nodes and arcs 17 p:char=11,17 dbp:Prague taIdentRef
  18. 18. Sasaki – LLD Datathon – Cercedilla, Spain, May 2015 Restricting graphs: Tree structured annotations on several layers 18 • Tree structures for syntactic annotations • Several annotation layers for the same text • Concurrent hierarchies • Representation only of one of these in roundtripping with XML Example taken from TEI http://www.tei-c.org/release/doc/tei-p5-doc/en/html/NH.html
  19. 19. Sasaki – LLD Datathon – Cercedilla, Spain, May 2015 Representing overlapping hierarchies with markup (1/2) Solutions advertised by the TEI • Multiple encoding of the same information – One XML document per annotation • Boundary marking with empty “milestone” elements – Also used by XLIFF 19
  20. 20. Sasaki – LLD Datathon – Cercedilla, Spain, May 2015 Representing overlapping hierarchies with markup (2/2) Solutions advertised by the TEI • Fragmentation and reconstitution of virtual elements – One hierarchy explicit, others with interrelated marked-up spans • Stand-off markup – Separation of text and annotations, interlinked via anchor and reference – Cf. Word example 20
  21. 21. Sasaki – LLD Datathon – Cercedilla, Spain, May 2015 Representing overlapping hierarchies in RDF POWLA (cf. Chiarcos, 2012) • RDF representation for corpus annotation, based on PAULA XML Standoff format • Allows to represent hierarchical, multi-layer corpora in RDF and query in SPARQL • Not relevant for roundtripping, but for linguistic annotation representation and processing in RDF 21
  22. 22. Sasaki – LLD Datathon – Cercedilla, Spain, May 2015 Lessons learned • Choose the overlap solution that fits your roundtripping modelling and processing needs • Consider off-the-shelf tooling – For 100% hierarchical data: XPath / CSS selectors, DOM, … • Consider libraries – For extraction only: Tika http://tika.apache.org/ – For roundtripping: Okapi http://okapi.opentag.com/ - in FREME currently being adapted for roundtripping in selected formats • Make sure the annotation survives in the original format – cf. Word example – Soon to be made easier by using Okapi 22
  23. 23. Sasaki – LLD Datathon – Cercedilla, Spain, May 2015 Roundtripping of NIF based Linguistic Linked Data with non linked data sources Felix Sasaki DFKI / W3C Fellow 23

×