A lot of applications handle XML documents where multi- ple overlapping hierarchies are necessary and make use of a number of workarounds to force overlaps into the single hierarchy of an XML for- mat. Although these workarounds are transparent to the users, they are very difficult to handle by applications reading into these formats. This paper proposes an approach to document markup based on Semantic Web technologies. Our model allows the same expressiveness as XML and any other hierarchical meta-markup language, and, rather than re- quiring complex workarounds, allows the explicit expression of overlap- ping structures in such a way that search and manipulation of these structures does not require any specific tool or language. By simply us- ing mainstream technologies such as OWL and SPARQL, our model – called EARMARK (Extremely Annotational RDF Markup) – can per- form rather sophisticated tasks with no special tricks.
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
Handling Markup Overlaps Using OWL
1. Handling markup
overlaps using OWL
Angelo Di Iorio (diiorio@cs.unibo.it)
Silvio Peroni (speroni@cs.unibo.it)
Fabio Vitali (fabio@cs.unibo.it)
http://creativecommons.org/licenses/by-sa/3.0
2. Summary
• Overlapping markup in everyday life
• EARMARK: an OWL-based meta-markup language
• Conclusions and future works
3. Overlapping markup... wait, what?
• A definition: overlapping markup “describes cases where some markup
structures do not nest neatly into others”
DeRose, S. (2004). Markup Overlap: A Review and a Horse. In Proceedings of Extreme Markup Languages 2004. Montreal,
Canada.
<body>
<p>Some <em>very</p>
<p>interesting</em> text</p>
</body>
• Different techniques to embed overlap in XML hierarchies, for instance:
4. Overlapping markup... wait, what?
• A definition: overlapping markup “describes cases where some markup
structures do not nest neatly into others”
DeRose, S. (2004). Markup Overlap: A Review and a Horse. In Proceedings of Extreme Markup Languages 2004. Montreal,
Canada.
<body>
<p>Some <em>very</p>
<p>interesting</em> text</p>
</body>
• Different techniques to embed overlap in XML hierarchies, for instance:
✦ milestones – expressed through empty elements to mark the boundaries of the content
<body>
<p>Some <em start=”id1”/>very</p>
<p>interesting<em end=”id1”/> text</p>
</body>
5. Overlapping markup... wait, what?
• A definition: overlapping markup “describes cases where some markup
structures do not nest neatly into others”
DeRose, S. (2004). Markup Overlap: A Review and a Horse. In Proceedings of Extreme Markup Languages 2004. Montreal,
Canada.
<body>
<p>Some <em>very</p>
<p>interesting</em> text</p>
</body>
• Different techniques to embed overlap in XML hierarchies, for instance:
✦ milestones – expressed through empty elements to mark the boundaries of the content
<body>
<p>Some <em start=”id1”/>very</p>
<p>interesting<em end=”id1”/> text</p>
</body>
✦ fragmentation – expressed by two non-overlapping elements linked through id-idref pairs
<body>
<p>Some <em id=”em1” next=”em2”>very</em></p>
<p><em id=”em2”>interesting</em> text</p>
</body>
6. Overlapping everywhere
• Where we can find it: word processor formats + change tracking (e.g., ODT)
<office:text>
<text:changed-region text:id="S1">
<text:insertion>
<office:change-info>
<dc:creator>John Smith</dc:creator>
<dc:date>2009-10-27T18:45:00</dc:date>
</office:change-info>
</text:insertion>
</text:changed-region>
<text:p>
The beginning and
<text:change-start text:change-id="S1"/>
</text:p>
<text:p>
also
<text:change-end text:change-id="S1"/>
the end.
</text:p>
</office:text>
What the document is
7. Overlapping everywhere
• Where we can find it: word processor formats + change tracking (e.g., ODT)
<office:text>
<text:changed-region text:id="S1">
<text:insertion>
<office:change-info>
<dc:creator>John Smith</dc:creator>
<dc:date>2009-10-27T18:45:00</dc:date> What the document
</office:change-info> represents
</text:insertion>
</text:changed-region>
<text:p> office:text
The beginning and
<text:change-start text:change-id="S1"/> text:p
</text:p>
<text:p> before
also The beginning and the end.
<text:change-end text:change-id="S1"/> 2009-10-27T18:45:00
the end.
</text:p>
</office:text>
What the document is
8. Overlapping everywhere
• Where we can find it: word processor formats + change tracking (e.g., ODT)
<office:text>
<text:changed-region text:id="S1">
<text:insertion>
<office:change-info>
<dc:creator>John Smith</dc:creator>
<dc:date>2009-10-27T18:45:00</dc:date> What the document
</office:change-info> represents
</text:insertion>
</text:changed-region>
<text:p> office:text
The beginning and
<text:change-start text:change-id="S1"/> text:p
</text:p>
<text:p> before
also The beginning and the end.
<text:change-end text:change-id="S1"/> 2009-10-27T18:45:00
the end. also
</text:p> after
</office:text>
text:p text:p
What the document is
office:text
9. Overlapping everywhere
• Where we can find it: word processor formats + change tracking (e.g., ODT)
<office:text>
<text:changed-region text:id="S1">
<text:insertion>
<office:change-info>
<dc:creator>John Smith</dc:creator>
<dc:date>2009-10-27T18:45:00</dc:date> What the document
</office:change-info> represents
</text:insertion>
</text:changed-region>
<text:p> office:text
The beginning and
<text:change-start text:change-id="S1"/> text:p
</text:p>
<text:p> before
also The beginning and the end.
<text:change-end text:change-id="S1"/> 2009-10-27T18:45:00
the end. also
</text:p> after
</office:text>
text:p text:p
What the document is inserted by John Smith
office:text
10. • EARMARK is a vocabulary that defines a meta-markup language by means of OWL
ontologies – http://www.essepuntato.it/2008/12/earmark
• It is more expressive than XML
XML EARMARK
Data structure Tree DAG
Overlapping Only by using tricks Of course, it is a feature here
Semantics What? Yes, it is OWL!
• Three disjoint base classes:
✦ Docuverse – it represents the textual content of a document
Subclasses: StringDocuverse, URIDocuverse
✦ Range – it describes any text lying between two locations
Subclasses: PointerRange, XPathRange, XPathPointerRange
✦ MarkupItem – a collection of individuals belonging to the classes MarkupItem and
Range
Subclasses: Element, Attribute, Comment
12. An example
@prefix earmark: <http://www.essepuntato.it/2008/12/earmark#> .
@prefix : <http://www.example.com/> .
:aDoc a earmark:StringDocuverse
; earmark:hasContent “The beginning and the end.”^^xsd:string .
The beginning and the end.
13. An example
@prefix earmark: <http://www.essepuntato.it/2008/12/earmark#> .
@prefix : <http://www.example.com/> .
:aDoc a earmark:StringDocuverse
; earmark:hasContent “The beginning and the end.”^^xsd:string .
The beginning and the end.
14. An example
@prefix earmark: <http://www.essepuntato.it/2008/12/earmark#> .
@prefix : <http://www.example.com/> .
:aDoc a earmark:StringDocuverse
; earmark:hasContent “The beginning and the end.”^^xsd:string .
:r2 a earmark:PointerRange
; earmark:refersTo :aDoc
The beginning and the end. ; earmark:begins “14”^^xsd:nonNegativeInteger
; earmark:ends “26”^^xsd:nonNegativeInteger .
15. An example
@prefix earmark: <http://www.essepuntato.it/2008/12/earmark#> .
@prefix : <http://www.example.com/> .
office:text :aDoc a earmark:StringDocuverse
; earmark:hasContent “The beginning and the end.”^^xsd:string .
text:p :r2 a earmark:PointerRange
; earmark:refersTo :aDoc
The beginning and the end. ; earmark:begins “14”^^xsd:nonNegativeInteger
; earmark:ends “26”^^xsd:nonNegativeInteger .
also
text:p text:p
office:text
16. An example
@prefix earmark: <http://www.essepuntato.it/2008/12/earmark#> .
@prefix : <http://www.example.com/> .
@prefix c: <http://swan.mindinformatics.org/ontologies/1.2/collections/> .
office:text :aDoc a earmark:StringDocuverse
; earmark:hasContent “The beginning and the end.”^^xsd:string .
text:p :r2 a earmark:PointerRange
; earmark:refersTo :aDoc
The beginning and the end. ; earmark:begins “14”^^xsd:nonNegativeInteger
; earmark:ends “26”^^xsd:nonNegativeInteger .
also
:aMarkupItem a earmark:Element
; earmark:hasGeneralIdentifier “p”
text:p text:p ; earmark:hasNamespace
“urn:oasis:names:tc:opendocument:xmlns:text:1.0”
; c:firstItem :item1
office:text
; c:lastItem :item2 .
:item1 c:itemContent :r1
; c:nextItem :item2 .
:item2 c:itemContent :r2 .
17. An example
@prefix earmark: <http://www.essepuntato.it/2008/12/earmark#> .
@prefix : <http://www.example.com/> .
@prefix c: <http://swan.mindinformatics.org/ontologies/1.2/collections/> .
office:text :aDoc a earmark:StringDocuverse
; earmark:hasContent “The beginning and the end.”^^xsd:string .
text:p :r2 a earmark:PointerRange
; earmark:refersTo :aDoc
The beginning and the end. ; earmark:begins “14”^^xsd:nonNegativeInteger
; earmark:ends “26”^^xsd:nonNegativeInteger .
also
:aMarkupItem a earmark:Element
; earmark:hasGeneralIdentifier “p”
text:p text:p ; earmark:hasNamespace
“urn:oasis:names:tc:opendocument:xmlns:text:1.0”
inserted by John Smith ; c:firstItem :item1
office:text
; c:lastItem :item2 .
:item1 c:itemContent :r1
; c:nextItem :item2 .
:item2 c:itemContent :r2 .
18. An example
@prefix earmark: <http://www.essepuntato.it/2008/12/earmark#> .
@prefix : <http://www.example.com/> .
@prefix c: <http://swan.mindinformatics.org/ontologies/1.2/collections/> .
@prefix dc: <http://purl.org/dc/elements/1.1/> .
office:text :aDoc a earmark:StringDocuverse
; earmark:hasContent “The beginning and the end.”^^xsd:string .
text:p :r2 a earmark:PointerRange
; earmark:refersTo :aDoc
The beginning and the end. ; earmark:begins “14”^^xsd:nonNegativeInteger
; earmark:ends “26”^^xsd:nonNegativeInteger .
also
:aMarkupItem a earmark:Element
; earmark:hasGeneralIdentifier “p”
text:p text:p ; earmark:hasNamespace
“urn:oasis:names:tc:opendocument:xmlns:text:1.0”
inserted by John Smith ; c:firstItem :item1
office:text
; c:lastItem :item2 .
:item1 c:itemContent :r1
; c:nextItem :item2 .
:item2 c:itemContent :r2 .
:p2 a Insertion ; dc:creator “John Smith”
; dc:date “2009-10-27T18:45:00”^^xsd:dateTime .
19. EARMARK Data Structure
• It is an API and a Java library that allows to easily create and
modify EARMARK document within Java applications
• Open Source project: http://earmark.sourceforge.net
EARMARKDocument ed = new EARMARKDocument(new URI("http://www.example.com"));
Docuverse aDoc =
ed.createStringDocuverse("The beginning and the end.");
[...]
Range aRange = ed.createPointerRange(aDoc, 14, 26);
[...]
Element aMarkupItem =
ed.createElement("p", "urn:oasis:names:tc:opendocument:xmlns:text:1.0",
Collection.Type.List);
ed.appendChild(anotherMarkupItem);
[...]
20. Semantic Web technologies as added value
• Because every EARMARK document is expressed as proper ABox of an ontology,
we can use Semantic Web technologies:
✦ to manipulate documents
✦ to query them
✦ to infer new assertions
✦ to check some integrity constraints on document structure and on content semantics
• In EARMARK, those technologies can be very helpful in solving issues that are
difficult to solve or are not solvable at all by using XML tools
• An example: “get all the text fragments inserted by John Smith”
21. Semantic Web technologies as added value
• Because every EARMARK document is expressed as proper ABox of an ontology,
we can use Semantic Web technologies:
✦ to manipulate documents
✦ to query them
✦ to infer new assertions
✦ to check some integrity constraints on document structure and on content semantics
• In EARMARK, those technologies can be very helpful in solving issues that are
difficult to solve or are not solvable at all by using XML tools
• An example: “get all the text fragments inserted by John Smith”
✦ XPath
for $id in //@text:id[../text:insertion//(dc:creator[. = ‘John Smith’] |
@office:chg-author[. = ’ John Smith’])] return //text:p//text()[(preceding-
sibling::text:change-start[1][@text:change-id = $id] and following-
sibling::text:change-end[1][@text:change-id = $id]) or ancestor::text:changed-
region/@text:id = $id]
22. Semantic Web technologies as added value
• Because every EARMARK document is expressed as proper ABox of an ontology,
we can use Semantic Web technologies:
✦ to manipulate documents
✦ to query them
✦ to infer new assertions
✦ to check some integrity constraints on document structure and on content semantics
• In EARMARK, those technologies can be very helpful in solving issues that are
difficult to solve or are not solvable at all by using XML tools
• An example: “get all the text fragments inserted by John Smith”
✦ XPath
for $id in //@text:id[../text:insertion//(dc:creator[. = ‘John Smith’] |
@office:chg-author[. = ’ John Smith’])] return //text:p//text()[(preceding-
sibling::text:change-start[1][@text:change-id = $id] and following-
sibling::text:change-end[1][@text:change-id = $id]) or ancestor::text:changed-
region/@text:id = $id]
✦ SPARQL
SELECT ?r WHERE { ?r a earmark:Range , Insertion ; dc:creator "John Smith" . }
23. Conclusions and
future works
• We presented a new meta-markup language called EARMARK,
defined by means of OWL ontologies, that allows to make very
complex markup documents
• We applied it in a real-case scenario (ODT format with change
tracking) showing how it allows to handle, manipulate and query
complex documents in a better way (than XML does)
• Future works about this topic include:
✦ Rocco and Fretta are two on-going projects that allow transformations from
XML documents (with overlapping markup specified by using tricks) to
EARMARK documents, and vice versa
✦ a formalism to specify explicitly semantics of markup and of textual content
✦ a word processor that allows to define EARMARK documents in a very
simple way, with the possibility to add any kind of semantic assertions to any
entity of the document (both markup items and textual content)
24. Thanks for your attention
I think it’s time for questions :-)
25. Late time example:
A more complex ODT document...
<office:text>
<text:changed-region text:id="S2">
! <text:deletion><office:change-info>
! ! ! <dc:creator>Silvio Peroni</dc:creator>
! ! ! <dc:date>2009-10-27T18:45:00</dc:date>
! ! </office:change-info><text:p>.</text:p></text:deletion>
! <text:insertion>
! ! <office:change-info office:chg-author="Angelo Di Iorio"
! ! ! office:chg-date-time="2009-10-27T18:42:00"/>
! </text:insertion>
</text:changed-region>
<text:changed-region text:id="A2">
! <text:insertion><office:change-info>
! ! ! <dc:creator>Angelo Di Iorio</dc:creator>
! ! ! <dc:date>2009-10-27T18:42:00</dc:date>
! ! </office:change-info></text:insertion>
</text:changed-region>
[...]
<text:p>This is one paragraph<text:change-start text:change-id="S1"/>;
! actually, it was!<text:change-end text:change-id="S1"/>
! <text:change text:change-id="S2"/>
<text:change-start text:change-id="A2"/></text:p>
<text:p><text:change-end text:change-id="A2"/>
! <text:change text:change-id="A3"/><text:change-start text:change-id="A4"/>S
! <text:change-end text:change-id="A4"/>plit in two.</text:p>
</office:text>
26. ... and its representation in EARMARK
TIME docuverses ranges markup items assertions
r6 p text
a text:insertion ;
dc:creator “Silvio Peroni”
; actually, it was! dc:date “2009-10-27T18:45:00”
a text:deletion ;
dc:creator “Silvio Peroni”
dc:date “2009-10-27T18:45:00”
r4 p
text
r5
a text:insertion ;
.S p dc:creator “Angelo Di Iorio”
dc:date “2009-10-27T18:42:00”
a text:deletion ;
r1 dc:creator “Angelo Di Iorio”
dc:date “2009-10-27T18:42:00”
r2 p text
r3 Legend string in the range
docuverse begin end
This is one paragraph that will be split in two. content location location