We describe the NIF approach towards representing annotations and focus on roundtripping: the conversion of existing digital content from formats like Word, HTML etc. into NIF and re-integration of annotations into the original file format. Such roundtripping is needed for many industry applications of linguistic linked data and natural language processing. Roundtripping is not always possible and constrained by 1) possibilities to store annotations in the original format, while preserving existing information (e.g. HTML inline tags) and 2) constraints of the annotation model, which is basicaly tree-structured for markup languages like generic XML or HTML. There is no general solution to this problem. Developers of roundtripping applications should use existing libraries as much as possible and leverage them to their needs.
Real Escorts in Al Nahda +971524965298 Dubai Escorts Service
Sasaki datathon-madrid-2015
1. Sasaki – LLD Datathon – Cercedilla, Spain, May 2015
Roundtripping of NIF based
Linguistic Linked Data with non
linked data sources
Felix Sasaki
DFKI / W3C Fellow
Slides:
http://de.slideshare.net/atcfsenzoku/sasaki-datathonmadrid2015
1
2. Sasaki – LLD Datathon – Cercedilla, Spain, May 2015
What is NIF?
• Natural Language Processing Interchange
Format
– See http://nlp2rdf.org/
• LLD format to store annotations & to organize
NLP pipelines
• API specification to create NIF workflows
• More details: after the coffee break
• Following slides: main roles for NIF
2
8. Sasaki – LLD Datathon – Cercedilla, Spain, May 2015
A NIF workflow
8
Existing
content
Content analytics, e.g.
named entity
recognition
Conversion to
NIF
Deploying knowledge from the LLD cloud
9. Sasaki – LLD Datathon – Cercedilla, Spain, May 2015
Potential scenario: roundtripping
9
Existing
content
Content analytics, e.g.
named entity
recognition
Conversion to
NIF
Storing annotations in original content
Deploying knowledge from the LLD cloud
10. Sasaki – LLD Datathon – Cercedilla, Spain, May 2015
Roundtripping
• Roundtripping: Storing the outcome of
content processing (analytics) tasks in the
original content
• Not always needed, but sometimes –
examples:
– Enriching Web content with named entity
information; generating Schema.org markup via
NIF pipelines. Format: HTML
– Enriching localisation content, to add value
beyond translation: Format: XLIFF
10
11. Sasaki – LLD Datathon – Cercedilla, Spain, May 2015
Example: HTML
Example roundtripping workflow
11
… <p>Welcome to Prague!</p>…
…<p>Welcome to <span …
itemtype="http://schema.org/Place">Prague</span>!<
/p>…
1) Conversion to NIF 2) NER processing
3) Back conversion to HTML
12. Sasaki – LLD Datathon – Cercedilla, Spain, May 2015
Example: XLIFF
Example roundtripping workflow
12
… <xlf:source>Welcome to Prague!</xlf:source> …
… <xlf:source>Welcome to <mrk …
its:taClassRef="http://schema.org/Place">Prague
</mrk>!</xlf:source> …
1) Conversion to NIF 2) NER processing
3) Back conversion to HTML
13. Sasaki – LLD Datathon – Cercedilla, Spain, May 2015
Example usage scenario:
FREME project
• See http://www.freme-project.eu/
• Developing interfaces for multilingual and semantic
enrichment of digital content
• Relies on NIF based enrichment workflows
– See FREME API version 0.1
http://api.freme-project.eu/doc/0.1/
• Deploys aspects of the LIDER reference architecture for LLD
processing
– See D3.1.1 at http://lider-project.eu/?q=doc/deliverables
• Focuses on four business cases
– Localization BC requires XLIFF roundtripping
– Web content personalisation BC requires HTML roundtripping
13
14. Sasaki – LLD Datathon – Cercedilla, Spain, May 2015
Challenges for roundtripping
• Source format
– How to store enrichment information
(annotations)
– How to handle existing information
• Annotation model
– NIF = a general graph-based annotation model
– Sources format and annotation motivation may
require restriction of the model
14
15. Sasaki – LLD Datathon – Cercedilla, Spain, May 2015
How to store annotations in various
source formats
• Solvable for markup languages like HTML or
XLIFF
• Challenge to preserve existing markup
“<p>Welcome to <b>Prague</b>!</p>”
• General issue with complex and proprietary
formats:
– “My own” storage mechanism = no tool support
– Using existing storage mechanisms may mean:
overloading semantics
15
16. Sasaki – LLD Datathon – Cercedilla, Spain, May 2015
Source format example: Word
… <w:t>Welcome to Prague!</w:t> …
16
… <w:commentRangeStart w:id="0"/><w:t>Prague</w:t>
<w:commentRangeEnd w:id="0"/>
<w:r w:rsidR="00987079"> …
<w:p w:rsidRPr="00987079">… Enrichment: type "http://schema.org/Place"…</w:p>
Enrichment process; storing enrichment as comments
Change of original content: creation of anchor
Comment stored separately; refers to anchor: “standoff approach”
Content storage
Comment storage
Content storage (Word file unzipped)
17. Sasaki – LLD Datathon – Cercedilla, Spain, May 2015
Annotation models
• NIF: like RDF = general graph model
– Consisting of nodes and arcs
17
p:char=11,17 dbp:Prague
taIdentRef
18. Sasaki – LLD Datathon – Cercedilla, Spain, May 2015
Restricting graphs: Tree structured annotations
on several layers
18
• Tree structures
for syntactic
annotations
• Several
annotation layers
for the same text
• Concurrent
hierarchies
• Representation
only of one of
these in
roundtripping
with XML
Example taken from TEI http://www.tei-c.org/release/doc/tei-p5-doc/en/html/NH.html
19. Sasaki – LLD Datathon – Cercedilla, Spain, May 2015
Representing overlapping hierarchies
with markup (1/2)
Solutions advertised by the TEI
• Multiple encoding of the same information
– One XML document per annotation
• Boundary marking with empty “milestone”
elements
– Also used by XLIFF
19
20. Sasaki – LLD Datathon – Cercedilla, Spain, May 2015
Representing overlapping hierarchies
with markup (2/2)
Solutions advertised by the TEI
• Fragmentation and reconstitution of virtual
elements
– One hierarchy explicit, others with interrelated
marked-up spans
• Stand-off markup
– Separation of text and annotations, interlinked via
anchor and reference
– Cf. Word example
20
21. Sasaki – LLD Datathon – Cercedilla, Spain, May 2015
Representing overlapping hierarchies
in RDF
POWLA (cf. Chiarcos, 2012)
• RDF representation for corpus annotation,
based on PAULA XML Standoff format
• Allows to represent hierarchical, multi-layer
corpora in RDF and query in SPARQL
• Not relevant for roundtripping, but for
linguistic annotation representation and
processing in RDF
21
22. Sasaki – LLD Datathon – Cercedilla, Spain, May 2015
Lessons learned
• Choose the overlap solution that fits your
roundtripping modelling and processing needs
• Consider off-the-shelf tooling
– For 100% hierarchical data: XPath / CSS selectors, DOM, …
• Consider libraries
– For extraction only: Tika http://tika.apache.org/
– For roundtripping: Okapi http://okapi.opentag.com/ - in
FREME currently being adapted for roundtripping in
selected formats
• Make sure the annotation survives in the original
format – cf. Word example
– Soon to be made easier by using Okapi
22
23. Sasaki – LLD Datathon – Cercedilla, Spain, May 2015
Roundtripping of NIF based
Linguistic Linked Data with non
linked data sources
Felix Sasaki
DFKI / W3C Fellow
23