2. Rubén Izquierdo Beviá
About me
5-year degree on Computer Science (University of
Alicante, Alicante, Spain)
National NLP projects and 1 European project (QALLME)
(University of Alicante, Alicante, Spain)
Thesis about NLP & Word Sense Disambiguation (University
of Alicante, Alicante, Spain. Sept 2010)
Postdoc position at DutchSemCor Project (University of
Tilburg, Tilburg. Sept 2011-Sept2012)
Postdoc position at OpeNER Project (Vrije
University, Amsterdam. Sept 2012-)
3. CLTL software
In general common input/output format
KAF
NAF, as an extension of KAF
Single components performing single tasks
Integration of existing modules
Adaptation of input/output formats
Development of new ones
4. KAF
Kyoto Annotation Format
Stand-off, layered, XML-based representation format
Different types of information are stored in different layers
Layers are linked by means of references
Suitable for creating pipelines based on this format
Layers:
Text tokens
Term lemmas, part-of-speech, term sentiment, word
senses
Entities, chunks, opinions…
6. NAF
NewsReader Annotation Format
Extension of KAF
Allow the cross-document processing
Event coreference
ID’s are converted into valid URI’s
Store the same type of information provided by different
tools
Result of two different pos-taggers
7. How the software is provided I
All modules are publicly available on GitHub
CLTL GitHub
http://github.com/cltl
NewsReader GitHub
http://github.com/newsreader
OpeNER GitHub
http://github.com/opener-project/
8. How the software is provided
II
Some are available as Web Services
Exposed as REST web services
Accept and input stream (KAF/NAF)
Generate an output stream (KAF/NAF)
Easy to call from command line with CURL
Easy to create module pipelines in the same way you create a
linux commands pipeline
http://wordpress.let.vupr.nl/web-services/
11. Our software I
General modules (integrated)
Tokenizers: whitespace based, open-nlp trained...
Sentence splitters: based on rules, open-nlp
Pos-taggers: treetagger, open-nlp pos taggers
Chunker: trained on Alpino data with open-nlp
Parsers: Alpino (nl), Stanford (en)
12. Our software II
General modules (developed by us)
Wordnet Tools
Functions to use a WordNet in LMF format
Word Sense Disambiguation systems
UKB: unsupersived
SVM: supervised (for nl derived from DutchSemcor)
Multiword tagger
multiword sequences of terms according the WordNet
OntoTagger
Ontotagger inserts (semantic) labels into KAF representation on the basis
of lemma or wordnet synset representations of text
13. Our software III
General modules (developed by us)
Named Entity Recognizer
Detects dates and locations using specific resources +
GeoNames
KyBot
Extract tuples and relations from a set of profiles formulated
using semantic and structural properties
14. Our software IV
OpeNER related (developed by us)
Hotel property tagger
Detect aspects related with
cleanliness, staff, breakfast, rooms…
Term polarity tagger
Positive/negative terms, intensifiers, negators …
Opinion miner
Detect opinions: target + holder + expression
2 rule based version // 1 machine learning version
15. Our software V
NewsReader related (developed by us)
Discourse Module
Splits incoming texts into headers and paragraphs
Factuality Classifier
Classifies whether a statement is factual/probable/possible or
not
Event Coreference
Compares descriptions of events within and across
documents to decide if they refer to the same events.