This tutorial is held by Sebastian Hellmann from the NLP2RDF Group at AKSW:
The NLP Interchange Format (NIF) is an RDF/OWL-based format that aims to achieve interoperability between Natural Language Processing (NLP) tools, language resources and annotations. NIF consists of specifications, ontologies and software (overview), which are combined under the version identifier “NIF 2.0″. Links:
http://nlp2rdf.org
http://persistence.uni-leipzig.org/nlp2rdf/
Crea il tuo assistente AI con lo Stregatto (open source python framework)
NIF 2.0 Tutorial: Content Analysis and the Semantic Web
1. NIF Tutorial – 2013/09/24 – Page 1 http://lod2.eu
Creating Knowledge out of Interlinked Data
LOD2 Presentation . 02.09.2010 . Page http://lod2.eu
AKSW, Universität Leipzig
Sebastian Hellmann
Content Analysis
and the Semantic Web
NIF 2.0 Tutorial
http://nlp2rdf.org
http://lod2.eu
http://slideshare.net/kurzum
2. NIF Tutorial – 2013/09/24 – Page 2 http://lod2.eu
Sebastian Hellmann – researcher working on LOD2 EU Project
AKSW – Agile Knowledge and the Semantic Web research group in Leipzig -
http://aksw.org
InfAI – Institute for Applied Informatics - http://infai.org
ALL DEMOS ARE AVAILABLE AT:
http://nlp2rdf.org/leipzig-24-9-2013
Introduction
3. NIF Tutorial – 2013/09/24 – Page 3 http://lod2.eu
Introduction
ALL DEMOS ARE AVAILABLE AT:
http://nlp2rdf.org/leipzig-24-9-2013
4. NIF Tutorial – 2013/09/24 – Page 4 http://lod2.eu
End users have tasks for NLP, but:
Each new tool is a challenge:
• How to download and start it?
• What kind of annotations does it use?
• How good does it perform (on my domain)?
• If badly, are there any alternatives? How can I find them?
• Open source?
• Lot's of know-how needed to exploit NLP.
• Lot's of data needed to exploit NLP.
Barriers to NLP
5. NIF Tutorial – 2013/09/24 – Page 5 http://lod2.eu
The Semantic Gap
7. NIF Tutorial – 2013/09/24 – Page 7 http://lod2.eu
• Part 1: exploiting free, open and interoperable (FOI)
language resources
• Part 2: Connecting text to these resources
• Part 3: tools, demos, infrastructure
From a walled garden to
an interoperable infrastructure
8. NIF Tutorial – 2013/09/24 – Page 8 http://lod2.eu
• Part 1: exploiting free, open and interoperable (FOI)
language resources
From a walled garden to
an interoperable infrastructure
9. NIF Tutorial – 2013/09/24 – Page 9 http://lod2.eu
http://lod-cloud.net
Linguistic/NLP Data currently filed
under “cross-domain”
10. NIF Tutorial – 2013/09/24 – Page 10 http://lod2.eu
http://lod-cloud.net
Linked Open Data
- All datasets provide open access to individual records via HTTP
- Many are free (no payment required, as in royalty-free)
- Some are openly licensed, e.g. CC-0 or CC-BY-SA
=> Open access also applies to published HTML on the WWW, but in LOD the data
itself is published unrendered via RDF
11. NIF Tutorial – 2013/09/24 – Page 11 http://lod2.eu
Question:
• Who knows how to add a new bubble to the LOD cloud?
From a walled garden to
an interoperable infrastructure
12. NIF Tutorial – 2013/09/24 – Page 12 http://lod2.eu
• Who knows how to add a new bubble to the LOD cloud?
http://datahub.io/group/linguistics
https://github.com/jmccrae/llod-cloud.py
http://validator.lod-cloud.net/validate.php
From a walled garden to
an interoperable infrastructure
15. NIF Tutorial – 2013/09/24 – Page 15 http://lod2.eu
Question:
• What are the most important data sets and ontologies for NLP?
• Who has used what?
FOI data
16. NIF Tutorial – 2013/09/24 – Page 16 http://lod2.eu
Analysis of mentions of Wikipedia / DBpedia at LREC 2012:
• https://www.google.com/webhp?q=site:http%3A%2F%2Fwww.lrec-conf.org%2
→ 163 papers
• https://www.google.com/webhp?q=site:http%3A%2F%2Fwww.lrec-conf.org%2
→ 24 papers
FOI data 1: Wikipedia / DBpedia
17. NIF Tutorial – 2013/09/24 – Page 17 http://lod2.eu
• Training data for NLP, e.g. URI, surrounding text, surface form
• Probabilities:
• P(sf|URI): P that “apple” refers to wikipedia:Apple_Inc.
• P(URI|sf): P that wikipedia:Apple_Inc. is “apple” in text
FOI data 1: Wikipedia / DBpedia
http://wiki.dbpedia.org/Datasets/NLP
18. NIF Tutorial – 2013/09/24 – Page 18 http://lod2.eu
FOI data: Wikipedia / DBpedia
http://lookup.dbpedia.org/api/search.asmx/KeywordSearch?
QueryString=sodium
http://lookup.dbpedia.org/api/search.asmx/KeywordSearch?
QueryString=sodium
Available data for “Sodium”
http://dbpedia.org/snorql
select ?labels where {
<http://dbpedia.org/resource/Sodium> rdfs:label ?labels .
} LIMIT 100
select ?altlabel where {
?redirect dbpedia-owl:wikiPageRedirects <http://dbpedia.org/resource/Sodium> .
?redirect rdfs:label ?altlabel .
} LIMIT 100
http://lcl.uniroma1.it/babelnet/explore.jsp?word=sodium&lang=EN
25. NIF Tutorial – 2013/09/24 – Page 25 http://lod2.eu
Lemon Ontology - http://lemon-model.net
IntersectiveDataPropertyAdjective ("extinct" ,
dbpedia:conservationStatus ,"EX")
IntersectiveDataPropertyAdjective ("endangered" ,
dbpedia:conservationStatus ,"EN")
https://github.com/cunger/lemon.dbpedia
Christina Unger, John Mccrae, Sebastian Walter, Sara Winter and Philipp Cimiano (2013):
A lemon lexicon for DBpedia. NLP & DBpedia Workshop
26. NIF Tutorial – 2013/09/24 – Page 26 http://lod2.eu
• Part 2: Connecting text to these resources
From a walled garden to
an interoperable infrastructure
27. NIF Tutorial – 2013/09/24 – Page 27 http://lod2.eu
From a walled garden to
an interoperable infrastructure
https://github.com/dbpedia-spotlight/dbpedia-spotlight/wiki
28. NIF Tutorial – 2013/09/24 – Page 28 http://lod2.eu
From a walled garden to
an interoperable infrastructure
Overview of existing tools:
• http://en.wikipedia.org/wiki/Knowledge_extraction#Tools
29. NIF Tutorial – 2013/09/24 – Page 29 http://lod2.eu
From a walled garden to
an interoperable infrastructure
Developers nightmare:
• All tools belong to similar class of NLP tools
→ Wikifier or Named Entity Linking, SOA principle
But they all have:
• Heterogeneous output formats (JSON, XML)
• Heterogeneous API parameters
• Heterogeneous ways of annotating text:
• Some remove HTML internally, offsets not usable
• Some use byte offset instead of char offset
30. NIF Tutorial – 2013/09/24 – Page 30 http://lod2.eu
From a walled garden to
an interoperable infrastructure
Demo
• http://rdface.aksw.org/new/tinymce/examples/rdface.html
31. NIF Tutorial – 2013/09/24 – Page 31 http://lod2.eu
ITS 2.0 - http://www.w3.org/TR/its20/
The Internationalization Tag Set (ITS) 2.0 – enhances the foundation to
integrate automated processing of human language into core Web
technologies.
• Currently last call
• Driven by localization industry
• Embed translation aids into HTML and XML
• Robust way to encode NLP information in HTML
• ITS 2.0 describes 20 data categories → ontology
32. NIF Tutorial – 2013/09/24 – Page 32 http://lod2.eu
NIF overview
Summary
• Motivated the Walled Garden problem
• Overview of the emerging Web of Language resources
• Motivated the NLP tool heterogeneity problem
• Introduction of ITS 2.0 Use case for NIF
• Now: NIF 2.0
33. NIF Tutorial – 2013/09/24 – Page 33 http://lod2.eu
The NLP Interchange Format (NIF) is an RDF/OWL-based format that aims to
achieve interoperability between Natural Language Processing (NLP) tools,
language resources and annotations.
• Reuse of existing standards such as RDF, OWL 2, the PROV Ontology, LAF
(ISO 24612), Unicode and RFC 5147
• Standardize access parameters, annotations (e.g. tokenization), validation
and log messages.
• A NIF workflow, however, can obviously not provide any better performance
(F-measure, speed) than a properly configured UIMA or GATE pipeline with
the same components.
• Lower entry barrier, easy data integration, reusability of tools and
conceptualisation, off-the-shelf solutions for common tasks.
NIF Overview
34. NIF Tutorial – 2013/09/24 – Page 34 http://lod2.eu
Relation of NIF and UIMA and Gate
• A Formal Framework for Linguistic Annotation (2000) by Steven Bird, Mark
Liberman
• take home message: generic annotation formats should be based on
graphs
• Ontologies in NIF (e.g. OliA, lemon) can be hard compiled for internal use (as
is done in Stanbol)
WP3 Task 3.2 – Community work: NLP2RDF
Not primarily aimed at
increasing features or
performance (F-Measure)
36. NIF Tutorial – 2013/09/24 – Page 36 http://lod2.eu
• NIF turns out to have a Unique selling proposition regarding NLP and RDF
• NIF will be the recommended RDF conversion of the Internationalisation
Tagset 2.0 of W3C (ITS 2.0) - http://www.w3.org/TR/its20/
• There was no alternative RDF vocabulary for this conversion available.
NIF Overview
38. NIF Tutorial – 2013/09/24 – Page 38 http://lod2.eu
Available resources:
http://persistence.uni-leipzig.org/nlp2rdf/
Disclaimer
Migration to the online presence is still on-going, but there are 15 scientific
publications, e.g.
Integrating NLP using Linked Data. Sebastian Hellmann, Jens Lehmann, Sören Auer, and Martin Brümmer. 12th
International Semantic Web Conference, 21-25 October 2013, Sydney, Australia, (2013) -
http://svn.aksw.org/papers/2013/ISWC_NIF/public.pdf
NIF Overview
39. NIF Tutorial – 2013/09/24 – Page 39 http://lod2.eu
Question:
• What is a String?
NIF Basics
40. NIF Tutorial – 2013/09/24 – Page 40 http://lod2.eu
Counting strings is more difficult than it seems:
• Three ways to count Unicode:
• Code Units
• Code Points
• Graphems
• Encoding:
• UTF-8, 16, 32
NIF Basics Unicode
41. NIF Tutorial – 2013/09/24 – Page 41 http://lod2.eu
• Code Unit. The minimal bit combination that can represent a unit of encoded
text for processing or interchange. The Unicode Standard uses 8-bit code
units in the UTF-8 encoding form, 16-bit code units in the UTF-16 encoding
form, and 32-bit code units in the UTF-32 encoding form.
• Code Point. (1) Any value in the Unicode codespace; that is, the range of
integers from 0 to 10FFFF16. Not all code points are assigned to encoded
characters. See code point type. (2) A value, or position, for a character, in
any coded character set.
• Unicode Normal Form C
• http://unicode.org/reports/tr15/#Norm_Forms
Unicode
42. NIF Tutorial – 2013/09/24 – Page 42 http://lod2.eu
• Recommendation for RDF Literals
• http://unicode.org/reports/tr15/#Norm_Forms
Unicode Normal Form C
43. NIF Tutorial – 2013/09/24 – Page 43 http://lod2.eu
• NIF uses Unicode Normal Form C
• NIF counts in Code Points
Unicode
44. NIF Tutorial – 2013/09/24 – Page 44 http://lod2.eu
• Sadly, there are still implementation problems:
• Java length() vs. PHP strlen() function
• curl --data-urlencode i=" 대 " -d f=text "http://nlp2rdf.lod2.eu/nif-ws.php"
• Korean Character is URL encoded (#%EB%8C%80) and counted as 3
characters (not NFC in PHP)
Demo
ALL DEMOS ARE AVAILABLE AT:
http://nlp2rdf.org/leipzig-24-9-2013
45. NIF Tutorial – 2013/09/24 – Page 45 http://lod2.eu
• Now some RDF (finally):
• Note that in NIF the document is != content of the document.
• two different documents can have the same content
=> must not have the same URI
Context
47. NIF Tutorial – 2013/09/24 – Page 47 http://lod2.eu
Tokenization
Christian Chiarcos, Julia Ritz, Manfred Stede: By all these lovely tokens... Merging conflicting tokenizations.
Language Resources and Evaluation 46(1): 53-74 (2012)
53. NIF Tutorial – 2013/09/24 – Page 53 http://lod2.eu
• Demo
• Load Terminological model or Inference Model
Reasoning
54. NIF Tutorial – 2013/09/24 – Page 54 http://lod2.eu
Open Community – All feedback is welcome!
http://slideshare.net/kurzum
Websites:
http://dbpedia.org
http://nlp2rdf.org
http://lod2.eu
Thanks for your attention
ALL DEMOS ARE AVAILABLE AT:
http://nlp2rdf.org/leipzig-24-9-2013