Sasaki – LLD Datathon – Cercedilla, Spain, May 2015
Roundtripping of NIF based
Linguistic Linked Data with non
linked data...
Sasaki – LLD Datathon – Cercedilla, Spain, May 2015
What is NIF?
• Natural Language Processing Interchange
Format
– See ht...
Sasaki – LLD Datathon – Cercedilla, Spain, May 2015
Example (Partial; JSON-LD Syntax)
{ "@graph" : [ {
"@id" : "p:char=0,1...
Sasaki – LLD Datathon – Cercedilla, Spain, May 2015
Example (Partial; JSON-LD Syntax)
{ "@graph" : [ {
"@id" : "p:char=0,1...
Sasaki – LLD Datathon – Cercedilla, Spain, May 2015
Example (Partial; JSON-LD Syntax)
{ "@graph" : [ {
"@id" : "p:char=0,1...
Sasaki – LLD Datathon – Cercedilla, Spain, May 2015
Example (Partial; JSON-LD Syntax)
{ "@graph" : [ {
"@id" : "p:char=0,1...
Sasaki – LLD Datathon – Cercedilla, Spain, May 2015
Example (Partial; JSON-LD Syntax)
{ "@graph" : [ {
"@id" : "p:char=0,1...
Sasaki – LLD Datathon – Cercedilla, Spain, May 2015
A NIF workflow
8
Existing
content
Content analytics, e.g.
named entity...
Sasaki – LLD Datathon – Cercedilla, Spain, May 2015
Potential scenario: roundtripping
9
Existing
content
Content analytics...
Sasaki – LLD Datathon – Cercedilla, Spain, May 2015
Roundtripping
• Roundtripping: Storing the outcome of
content processi...
Sasaki – LLD Datathon – Cercedilla, Spain, May 2015
Example: HTML
Example roundtripping workflow
11
… <p>Welcome to Prague...
Sasaki – LLD Datathon – Cercedilla, Spain, May 2015
Example: XLIFF
Example roundtripping workflow
12
… <xlf:source>Welcome...
Sasaki – LLD Datathon – Cercedilla, Spain, May 2015
Example usage scenario:
FREME project
• See http://www.freme-project.e...
Sasaki – LLD Datathon – Cercedilla, Spain, May 2015
Challenges for roundtripping
• Source format
– How to store enrichment...
Sasaki – LLD Datathon – Cercedilla, Spain, May 2015
How to store annotations in various
source formats
• Solvable for mark...
Sasaki – LLD Datathon – Cercedilla, Spain, May 2015
Source format example: Word
… <w:t>Welcome to Prague!</w:t> …
16
… <w:...
Sasaki – LLD Datathon – Cercedilla, Spain, May 2015
Annotation models
• NIF: like RDF = general graph model
– Consisting o...
Sasaki – LLD Datathon – Cercedilla, Spain, May 2015
Restricting graphs: Tree structured annotations
on several layers
18
•...
Sasaki – LLD Datathon – Cercedilla, Spain, May 2015
Representing overlapping hierarchies
with markup (1/2)
Solutions adver...
Sasaki – LLD Datathon – Cercedilla, Spain, May 2015
Representing overlapping hierarchies
with markup (2/2)
Solutions adver...
Sasaki – LLD Datathon – Cercedilla, Spain, May 2015
Representing overlapping hierarchies
in RDF
POWLA (cf. Chiarcos, 2012)...
Sasaki – LLD Datathon – Cercedilla, Spain, May 2015
Lessons learned
• Choose the overlap solution that fits your
roundtrip...
Sasaki – LLD Datathon – Cercedilla, Spain, May 2015
Roundtripping of NIF based
Linguistic Linked Data with non
linked data...
Nächste SlideShare
Wird geladen in …5
×

Sasaki datathon-madrid-2015

992 Aufrufe

Veröffentlicht am

We describe the NIF approach towards representing annotations and focus on roundtripping: the conversion of existing digital content from formats like Word, HTML etc. into NIF and re-integration of annotations into the original file format. Such roundtripping is needed for many industry applications of linguistic linked data and natural language processing. Roundtripping is not always possible and constrained by 1) possibilities to store annotations in the original format, while preserving existing information (e.g. HTML inline tags) and 2) constraints of the annotation model, which is basicaly tree-structured for markup languages like generic XML or HTML. There is no general solution to this problem. Developers of roundtripping applications should use existing libraries as much as possible and leverage them to their needs.

Veröffentlicht in: Internet
0 Kommentare
1 Gefällt mir
Statistik
Notizen
  • Als Erste(r) kommentieren

Keine Downloads
Aufrufe
Aufrufe insgesamt
992
Auf SlideShare
0
Aus Einbettungen
0
Anzahl an Einbettungen
55
Aktionen
Geteilt
0
Downloads
9
Kommentare
0
Gefällt mir
1
Einbettungen 0
Keine Einbettungen

Keine Notizen für die Folie

Sasaki datathon-madrid-2015

  1. 1. Sasaki – LLD Datathon – Cercedilla, Spain, May 2015 Roundtripping of NIF based Linguistic Linked Data with non linked data sources Felix Sasaki DFKI / W3C Fellow Slides: http://de.slideshare.net/atcfsenzoku/sasaki-datathonmadrid2015 1
  2. 2. Sasaki – LLD Datathon – Cercedilla, Spain, May 2015 What is NIF? • Natural Language Processing Interchange Format – See http://nlp2rdf.org/ • LLD format to store annotations & to organize NLP pipelines • API specification to create NIF workflows • More details: after the coffee break  • Following slides: main roles for NIF 2
  3. 3. Sasaki – LLD Datathon – Cercedilla, Spain, May 2015 Example (Partial; JSON-LD Syntax) { "@graph" : [ { "@id" : "p:char=0,18", "@type" : [ "nif:Context", "nif:Sentence", "nif:RFC5147String" ], "anchorOf" : "Welcome to Prague.", "beginIndex" : "0", "endIndex" : "18", "isString" : "Welcome to Prague.", "referenceContext" : "p:char=0,18” }, { "@id" : "p:char=11,17", "@type" : [ "nif:RFC5147String", "nif:Word" ], … "referenceContext" : "p:char=0,18", "taIdentRef" : "http://dbpedia.org/resource/Prague" }, …] } 3
  4. 4. Sasaki – LLD Datathon – Cercedilla, Spain, May 2015 Example (Partial; JSON-LD Syntax) { "@graph" : [ { "@id" : "p:char=0,18", "@type" : [ "nif:Context", "nif:Sentence", "nif:RFC5147String" ], "anchorOf" : "Welcome to Prague.", "beginIndex" : "0", "endIndex" : "18", "isString" : "Welcome to Prague.", "referenceContext" : "p:char=0,18” }, { "@id" : "p:char=11,17", "@type" : [ "nif:RFC5147String", "nif:Word" ], … "referenceContext" : "p:char=0,18", "taIdentRef" : "http://dbpedia.org/resource/Prague" }, …] } 4 • Identifying and typing annotations • Identifying annotation offsets • Adding additional knowledge, e.g. named entity identifier • Interrelating annotations
  5. 5. Sasaki – LLD Datathon – Cercedilla, Spain, May 2015 Example (Partial; JSON-LD Syntax) { "@graph" : [ { "@id" : "p:char=0,18", "@type" : [ "nif:Context", "nif:Sentence", "nif:RFC5147String" ], "anchorOf" : "Welcome to Prague.", "beginIndex" : "0", "endIndex" : "18", "isString" : "Welcome to Prague.", "referenceContext" : "p:char=0,18” }, { "@id" : "p:char=11,17", "@type" : [ "nif:RFC5147String", "nif:Word" ], … "referenceContext" : "p:char=0,18", "taIdentRef" : "http://dbpedia.org/resource/Prague" }, …] } 5 • Identifying and typing annotations • Identifying annotation offsets • Adding additional knowledge, e.g. named entity identifier • Interrelating annotations
  6. 6. Sasaki – LLD Datathon – Cercedilla, Spain, May 2015 Example (Partial; JSON-LD Syntax) { "@graph" : [ { "@id" : "p:char=0,18", "@type" : [ "nif:Context", "nif:Sentence", "nif:RFC5147String" ], "anchorOf" : "Welcome to Prague.", "beginIndex" : "0", "endIndex" : "18", "isString" : "Welcome to Prague.", "referenceContext" : "p:char=0,18” }, { "@id" : "p:char=11,17", "@type" : [ "nif:RFC5147String", "nif:Word" ], … "referenceContext" : "p:char=0,18", "taIdentRef" : "http://dbpedia.org/resource/Prague" }, …] } 6 • Identifying and typing annotations • Identifying annotation offsets • Adding additional knowledge, e.g. named entity identifier • Interrelating annotations
  7. 7. Sasaki – LLD Datathon – Cercedilla, Spain, May 2015 Example (Partial; JSON-LD Syntax) { "@graph" : [ { "@id" : "p:char=0,18", "@type" : [ "nif:Context", "nif:Sentence", "nif:RFC5147String" ], "anchorOf" : "Welcome to Prague.", "beginIndex" : "0", "endIndex" : "18", "isString" : "Welcome to Prague.", "referenceContext" : "p:char=0,18” }, { "@id" : "p:char=11,17", "@type" : [ "nif:RFC5147String", "nif:Word" ], … "referenceContext" : "p:char=0,18", "taIdentRef" : "http://dbpedia.org/resource/Prague" }, …] } 7 • Identifying and typing annotations • Identifying annotation offsets • Adding additional knowledge, e.g. named entity identifier • Interrelating annotations
  8. 8. Sasaki – LLD Datathon – Cercedilla, Spain, May 2015 A NIF workflow 8 Existing content Content analytics, e.g. named entity recognition Conversion to NIF Deploying knowledge from the LLD cloud
  9. 9. Sasaki – LLD Datathon – Cercedilla, Spain, May 2015 Potential scenario: roundtripping 9 Existing content Content analytics, e.g. named entity recognition Conversion to NIF Storing annotations in original content Deploying knowledge from the LLD cloud
  10. 10. Sasaki – LLD Datathon – Cercedilla, Spain, May 2015 Roundtripping • Roundtripping: Storing the outcome of content processing (analytics) tasks in the original content • Not always needed, but sometimes – examples: – Enriching Web content with named entity information; generating Schema.org markup via NIF pipelines. Format: HTML – Enriching localisation content, to add value beyond translation: Format: XLIFF 10
  11. 11. Sasaki – LLD Datathon – Cercedilla, Spain, May 2015 Example: HTML Example roundtripping workflow 11 … <p>Welcome to Prague!</p>… …<p>Welcome to <span … itemtype="http://schema.org/Place">Prague</span>!< /p>… 1) Conversion to NIF 2) NER processing 3) Back conversion to HTML
  12. 12. Sasaki – LLD Datathon – Cercedilla, Spain, May 2015 Example: XLIFF Example roundtripping workflow 12 … <xlf:source>Welcome to Prague!</xlf:source> … … <xlf:source>Welcome to <mrk … its:taClassRef="http://schema.org/Place">Prague </mrk>!</xlf:source> … 1) Conversion to NIF 2) NER processing 3) Back conversion to HTML
  13. 13. Sasaki – LLD Datathon – Cercedilla, Spain, May 2015 Example usage scenario: FREME project • See http://www.freme-project.eu/ • Developing interfaces for multilingual and semantic enrichment of digital content • Relies on NIF based enrichment workflows – See FREME API version 0.1 http://api.freme-project.eu/doc/0.1/ • Deploys aspects of the LIDER reference architecture for LLD processing – See D3.1.1 at http://lider-project.eu/?q=doc/deliverables • Focuses on four business cases – Localization BC requires XLIFF roundtripping – Web content personalisation BC requires HTML roundtripping 13
  14. 14. Sasaki – LLD Datathon – Cercedilla, Spain, May 2015 Challenges for roundtripping • Source format – How to store enrichment information (annotations) – How to handle existing information • Annotation model – NIF = a general graph-based annotation model – Sources format and annotation motivation may require restriction of the model 14
  15. 15. Sasaki – LLD Datathon – Cercedilla, Spain, May 2015 How to store annotations in various source formats • Solvable for markup languages like HTML or XLIFF • Challenge to preserve existing markup “<p>Welcome to <b>Prague</b>!</p>” • General issue with complex and proprietary formats: – “My own” storage mechanism = no tool support – Using existing storage mechanisms may mean: overloading semantics 15
  16. 16. Sasaki – LLD Datathon – Cercedilla, Spain, May 2015 Source format example: Word … <w:t>Welcome to Prague!</w:t> … 16 … <w:commentRangeStart w:id="0"/><w:t>Prague</w:t> <w:commentRangeEnd w:id="0"/> <w:r w:rsidR="00987079"> … <w:p w:rsidRPr="00987079">… Enrichment: type "http://schema.org/Place"…</w:p> Enrichment process; storing enrichment as comments Change of original content: creation of anchor Comment stored separately; refers to anchor: “standoff approach” Content storage Comment storage Content storage (Word file unzipped)
  17. 17. Sasaki – LLD Datathon – Cercedilla, Spain, May 2015 Annotation models • NIF: like RDF = general graph model – Consisting of nodes and arcs 17 p:char=11,17 dbp:Prague taIdentRef
  18. 18. Sasaki – LLD Datathon – Cercedilla, Spain, May 2015 Restricting graphs: Tree structured annotations on several layers 18 • Tree structures for syntactic annotations • Several annotation layers for the same text • Concurrent hierarchies • Representation only of one of these in roundtripping with XML Example taken from TEI http://www.tei-c.org/release/doc/tei-p5-doc/en/html/NH.html
  19. 19. Sasaki – LLD Datathon – Cercedilla, Spain, May 2015 Representing overlapping hierarchies with markup (1/2) Solutions advertised by the TEI • Multiple encoding of the same information – One XML document per annotation • Boundary marking with empty “milestone” elements – Also used by XLIFF 19
  20. 20. Sasaki – LLD Datathon – Cercedilla, Spain, May 2015 Representing overlapping hierarchies with markup (2/2) Solutions advertised by the TEI • Fragmentation and reconstitution of virtual elements – One hierarchy explicit, others with interrelated marked-up spans • Stand-off markup – Separation of text and annotations, interlinked via anchor and reference – Cf. Word example 20
  21. 21. Sasaki – LLD Datathon – Cercedilla, Spain, May 2015 Representing overlapping hierarchies in RDF POWLA (cf. Chiarcos, 2012) • RDF representation for corpus annotation, based on PAULA XML Standoff format • Allows to represent hierarchical, multi-layer corpora in RDF and query in SPARQL • Not relevant for roundtripping, but for linguistic annotation representation and processing in RDF 21
  22. 22. Sasaki – LLD Datathon – Cercedilla, Spain, May 2015 Lessons learned • Choose the overlap solution that fits your roundtripping modelling and processing needs • Consider off-the-shelf tooling – For 100% hierarchical data: XPath / CSS selectors, DOM, … • Consider libraries – For extraction only: Tika http://tika.apache.org/ – For roundtripping: Okapi http://okapi.opentag.com/ - in FREME currently being adapted for roundtripping in selected formats • Make sure the annotation survives in the original format – cf. Word example – Soon to be made easier by using Okapi 22
  23. 23. Sasaki – LLD Datathon – Cercedilla, Spain, May 2015 Roundtripping of NIF based Linguistic Linked Data with non linked data sources Felix Sasaki DFKI / W3C Fellow 23

×