SlideShare ist ein Scribd-Unternehmen logo
1 von 6
Downloaden Sie, um offline zu lesen
Proceedings of Recent Advances in Natural Language Processing, pages 40–45,
Varna, Bulgaria, Sep 4–6 2017.
https://doi.org/10.26615/978-954-452-049-6_006
An Extensible Multilingual Open Source Lemmatizer
Ahmet Akera,b
and Johann Petraka
and Firas Sabbahb
Department of Computer Science, University of ShefïŹelda
Department of Information Engineering, University of Duisburg-Essenb
a.aker@is.inf.uni-due.de, johann.petrak@sheffield.ac.uk
firas.sabbah@stud.uni-due.de
Abstract
We present GATE DictLemmatizer, a mul-
tilingual open source lemmatizer for the
GATE NLP framework that currently sup-
ports English, German, Italian, French,
Dutch, and Spanish, and is easily exten-
sible to other languages. The software
is freely available under the LGPL li-
cense. The lemmatization is based on the
Helsinki Finite-State Transducer Technol-
ogy (HFST) and lemma dictionaries au-
tomatically created from Wiktionary. We
evaluate the performance of the lemma-
tizers against TreeTagger, which is only
freely available for research purposes.
Our evaluation shows that DictLemma-
tizer achieves similar or even better re-
sults than TreeTagger for languages where
there is support from HFST. The per-
formance drops when there is no sup-
port from HFST and the entire lemmatiza-
tion process is based on lemma dictionar-
ies. However, the results are still satisfac-
tory given the fact that DictLemmatizer is
open-source and can be easily extended to
other languages. The software for extend-
ing the lemmatizer by creating word lists
from Wiktionary dictionaries is also freely
available as open-source software.
1 Introduction
The process of lemmatization is an important
part of many computational linguistics applica-
tions such as Information Retrieval (IR) and Natu-
ral Language Processing (NLP). In lemmatization,
inïŹ‚ected forms of a lexeme are mapped to a canon-
ical form that is referred to as the lemma. The task
of ïŹnding the correct lemma for a word in context
is often complicated by the fact that a word can be
the inïŹ‚ected form of more than one lexeme each of
which may have different lemmas. Lemmas can be
used in various ways for NLP, for instance, to im-
prove the performance of text similarity metrics.
For this application, all words are mapped to their
lemma before a similarity is calculated. Lemmas
are also often used in information retrieval and in-
formation extraction to better identify and group
terms which occur in their inïŹ‚ected forms.
The task of ïŹnding lemmas is different and
harder than ïŹnding stems. Stemming is often used
as a much cruder heuristic approach to map in-
ïŹ‚ectional forms of words to some canonical form,
but unlike lemmatization does not differentiate
between different lexemes which could have the
same inïŹ‚ectional form and it is possible for the
stem of a word to not be a valid lexeme of the lan-
guage.
The TreeTagger (Schmid, 2013) software pro-
vides lemmatization for 20 languages including
English, German, Italian, French, Dutch and
Spanish. However, it is not open source and it is
not straightforward to use it for non-research or
commercial applications. There exist a few other
lemmatizers which are open for non-research pur-
poses (Lezius et al., 1998; Perera and Witte, 2005;
Bšar et al., 2013; Cappelli and Moretti, 1983)1.
However, these lemmatizers are mostly concerned
with only one language and do not provide a broad
coverage like the TreeTagger.
In this paper, we describe GATE DictLemma-
tizer, a plugin for the GATE NLP framework2
(Cunningham et al., 2011) that performs lemma-
tization for English, German, Italian, French,
Dutch, and Spanish and is freely available under
the LGPL license. The GATE NLP framework
is one of the most widely used frameworks for
1
https://github.com/giodegas/
morphit-lemmatizer
2
https://gate.ac.uk
40
applied natural language processing. It is imple-
mented in Java, freely available under the permis-
sive LGPL license and can be extended through
plugins.
Our method combines the Helsinki Finite-State
Transducer Technology (HFST)3 (LindÂŽen et al.,
2011) and word-lemma dictionaries obtained from
Wiktionary. Since we use separate dictionaries de-
pending on the word category, the method also
depends on a POS tagger for the language. The
word dictionaries are obtained automatically from
Wiktionary4 data dumps. The code for creating
the dictionaries automatically is available as free
and open-source software.5 This software can be
used to easily add dictionaries for new languages
to the DictLemmatizer. The plugin also contains
the HFST models for the 4 languages for which
models are available: English, German, French
and Italian.6
The rest of the paper is structured as fol-
lows. First we describe our method of performing
lemmatization (Section 2). Our lemmatizer uses
automatically generated lemma dictionaries. The
process of obtaining such dictionaries from Wik-
tionary is outlined in Section 3. In Section 4 we
detail the release information. Next, in Section 5
we evaluate the performance of our lemmatizer.
We use the TreeTagger for comparison. We con-
clude in Section 6.
2 Method
To obtain lemmas we combine two strategies:
the Helsinki Finite-State Transducer Technology
(HFST)7 and word-lemma dictionaries obtained
from Wiktionary8. For both strategies, it is nec-
essary to know the coarse-grained word categories
such as “noun”, “verb”, “adposition” for each
word.
For this purpose, the lemmatizer requires the
Universal POS tags9 from the Universal Depen-
dencies project. In GATE (Cunningham et al.,
3
http://www.ling.helsinki.fi/
kieliteknologia/tutkimus/hfst/
4
https://www.wiktionary.org/
5
https://github.com/ahmetaker/
Wiktionary-Lemma-Extractor
6
https://sourceforge.net/
projects/hfst/files/resources/
morphological-transducers/
7
http://www.ling.helsinki.fi/
kieliteknologia/tutkimus/hfst/
8
https://www.wiktionary.org/
9
http://universaldependencies.org/u/
pos/all.html
2011), POS tags can be created using different
methods or plugins, however for the evaluation
in this paper we use the ANNIE POS-tagger
(Cunningham et al., 2002) for English and the
Stanford CoreNLP POS tagger (Toutanova et al.,
2003) for all other languages. These language-
speciïŹc POS tags are then converted to Universal
Dependencies tags using mappings adapted from
https://github.com/slavpetrov/
universal-pos-tags (Petrov et al., 2011).
The lemmatizer ïŹrst tries to look up each word
form in the dictionary that matches the language
and word category of the word. Currently there are
lists for the following categories: adjective, adpo-
sition, adverb, conjunction, determiner, noun, par-
ticle, pronoun, verb. If the word form is found in
the dictionary, the corresponding lemma is used.
Pre-generated dictionaries for the six supported
languages are included with the plugin.
If the word could not be found in the dictionary,
an attempt is made to ïŹnd the lemma by using the
HFST model for the language, if it is available.
The HFST model returns for each word all pos-
sible morphological variants. This makes it difïŹ-
cult to directly ïŹnd the lemma for the word. We
therefore implemented rules that use the Univer-
sal POS tag information and extract the correct
lemma. E.g. for the word “computers” the HFST
returns the following options:
compute[V]+ER[V/N]+N+PL
computer[N]+N+PL
Since we know from the POS tagger that “com-
puters” is a noun we can use that information and
extract from the HFST list the entry that refers to
a noun ([N]) - “computer”.
The HFST models are freely available only for
a few languages. For any language where there is
no HFST model, our lemmatizer will rely only on
the Wiktionary-based dictionaries.10
3 Parsing dictionaries
We implemented a Java based tool that allows
users to extract lemma information from the Wik-
tionary API. With this tool it is easy to create dic-
tionaries for additional languages not included in
the lemmatizer distribution. We refer to this tool
as Wiktionary-Lemma Extractor. It fetches for a
10
In this case, it is also possible to make DictLemmatizer
work without any POS tags at all by merging the original dic-
tionaries per word type into one dictionary for unknown/u-
nidentiïŹed POS type.
41
given word form its lemma from the Wiktionary
page. In addition the tool expects the language in-
formation, such as English, German, etc. Once
these pieces of information are provided the tool
fetches through the Wiktionary API the English
version of the Wiktionary page for the queried
word. The English Wiktionary page is divided into
different areas where each area conveys a particu-
lar information such as lemma, synonym, trans-
lation, etc. Our tool isolates the lemma area and
ïŹnds the non-inïŹ‚ected form for the queried word.
The queried word and the non-inïŹ‚ected form are
saved into a database to be used as dictionary
lookup.
4 Software Availability
4.1 GATE DictLemmatizer Plugin
Most of the tools and resources for the GATE
NLP framework are created as separate plugins
which can be used as needed for a process-
ing pipeline. The approach for ïŹnding lem-
mas described earlier has been implemented
as a GATE plugin and is freely available
from https://github.com/GateNLP/
gateplugin-dict-lemmatizer. This plu-
gin only implements the lemmatization part since
there are already several plugins for tokenisation,
sentence splitting, and POS-tagging included or
separately available for GATE.
4.2 Wiktionary-Lemma Extractor
Similar to the GATE Plugin for lemmatization we
make our Wiktionary-Lemma Extractor publicly
available through github11. Along with the code
we also provide a client that ease the creation of
new dictionaries. The client just expects the input
of the target language such as English, German,
Turkish, Urdu, etc. The client ïŹrst collects all
possible words for that particular language from
Wiktionary titles, determines for each title word
its lemma and ïŹnally extract the lemma dictionar-
ies. These lemma dictionaries can then be directly
injected into the GATE Plugin.
5 Evaluation
We evaluated DictLemmatizer and compared it to
TreeTagger on the following corpora:
‱ English British National Corpus (EN-BNC)
(Consortium, 2007; Clear, 1993)
11
https://github.com/ahmetaker/
Wiktionary-Lemma-Extractor
‱ German Tiger Corpus (DE-Tiger) (Brants
et al., 2004)
‱ Universal Dependencies English tree bank
(EN-UD) (Bies et al., 2012)
‱ Universal Dependencies French tree bank
(FR-UD)
‱ Universal Dependencies German tree bank
(DE-UD)
‱ Universal Dependencies Spanish tree bank
(ES-UD)
‱ Universal Dependencies Spanish Ancora cor-
pus (ES-Ancora)
For more information on the Universal Dependen-
cies tree banks see McDonald et al. (2013).
All corpora were converted to GATE documents
using format speciïŹc open-source software121314.
The software and setup for carrying out all evalu-
ation is also available online.15
Note that for this comparison, the GATE
Generic Tagger Framework plugin16 was used to
wrap the original TreeTagger software. This plu-
gin does not use the full processing pipeline of
the original TreeTagger software17 but instead just
uses the tree-tagger binary to retrieve per-
token information.
All corpora were converted so that the token
boundaries from the corpus were preserved for the
conversion to GATE format. However, for the
evaluation, the tokens produced by the annotation
pipeline are based on the GATE tokenizer and can
therefore differ from the correct tokens as present
in the tree bank. We list the performance of the
tokeniser used together with the performance of
the lemmatizer on the tokens which match ex-
actly. Some corpora use corpus-speciïŹc ways to
represent multi-token words or multi-word tokens
which cannot be represented in an identical way as
GATE annotations and so these cases get excluded
12
https://github.com/GateNLP/
corpusconversion-bnc
13
https://github.com/GateNLP/
corpusconversion-tiger
14
https://github.com/GateNLP/
corpusconversion-universal-dependencies
15
https://github.com/johann-petrak/
evaluation-lemmatizer
16
https://gate.ac.uk/userguide/sec:
parsers:taggerframework
17
42
from the evaluation. Results of the evaluation re-
ported in accuracy are shown in Table 1.
From the results in Table 1 we can see that
for English and German, DictLemmatizer out-
performs TreeTagger. For French, TreeTagger
achieves better performance in the test corpora;
however, for the training corpora DictLemmatizer
achieves better results. For Spanish, the TreeTag-
ger results are much better. The reason for this is
that apart from TreeTagger’s outstanding perfor-
mance on the Spanish corpora, the HFST inducer
is not used in our Lemmatizer for Spanish because
there is no HFST model available, so the lemma-
tization is performed only using the lemma dic-
tionaries obtained from Wiktionary.18 Although
there is a big performance difference for the Span-
ish language, we consider that this result is satis-
factory given the restrictions.
To get a better indication of the performance of
each of the two strategies for the other languages,
we also performed evaluations using the DictLem-
matizer where we used only the dictionary-based
or only the HFST-based approach. Table 2 shows
the results for all corpora except Spanish (where
only the dictionary is used by default). We can
see that for English and German the performance
of using just HFST and using just lemma dictio-
naries achieve comparable results, though using
only lemma dictionaries is always slightly better.
This pictures looks different when we look at the
French language. There using only HFST clearly
wins against using only the lemma dictionaries
and achieves around 10% better accuracy. Nev-
ertheless both resources are complementary and
when combined boost the results as seen in Table
1.
In addition to the Wiktionary source, the word
lists can be extended by an annotated training cor-
pus. We tested this by ïŹnding the 500 most fre-
quent incorrect assignments on each of the Uni-
versal Dependencies training corpora grouped by
target POS tag and adding those to the dictionar-
ies for each language. The evaluations using those
extended word lists are shown in 1 with the indi-
cation ”DL-TR”. This improves the accuracy on
all Universal Dependencies training and test sets
and on the BNC corpus, but slightly decreases ac-
18
In our evaluation we focused on languages which are
rich in resources and high performing lemmatizers such as
English, German and French and also supported by HFST
and languages that are less rich in terms of resources and also
has no support by the HFST tool such as Spanish.
curacy on the Tiger corpus.
Along with the accuracy ïŹgures, we also
recorded the time needed to process each corpus
(see Table 1). The timing information only gives
a rough indication because we show the results of
a single run only, and because the machine was
under different load for different runs. Also note
that the times for the TreeTagger include the over-
head of wrapping the original TreeTagger binary
for use in a Java plugin. The implementation of the
Generic Tagger Framework plugin executes the bi-
nary for every document, so the timing informa-
tion includes the overhead for this and thus also
depends on the average document size for a cor-
pus, while the timing for the DictLemmatizer does
not. However, from these rough results we can still
see that DictLemmatizer is always signiïŹcantly
faster than the TreeTagger for the concrete GATE-
plugin implementations that were compared.
6 Conclusion
In this paper we presented a lemmatizer for six
languages: English, German, Italian, French,
Dutch and Spanish that is easily extensible to other
languages. We compared the performance of our
lemmatizer to the one of TreeTagger. Our results
show that our lemmatizer achieves similar or bet-
ter results when there is support from HFST. In
case there is no HFST support we still achieve sat-
isfactory results.
Both the DictLemmatizer and the lemma dic-
tionary collector software are available freely for
commercial use under the LGPL license. The dic-
tionary collector can be used to easily extend the
lemmatizer to new languages that are currently not
included in DictLemmatizer.
Acknowledgments
This work was partially supported by the Euro-
pean Union under grant agreement No. 687847
COMRADES and PHEME project under the grant
agreement No. 611223.
References
Daniel Bšar, Torsten Zesch, and Iryna Gurevych. 2013.
Dkpro similarity: An open source framework for
text similarity. In ACL (Conference System Demon-
strations). pages 121–126.
Ann Bies, Justin Mott, Colin Warner, and Seth Kulick.
2012. English web treebank LDC2012T13. Web
43
Corpus TT DL DL-TR Time DL Time TT
EN-BNC 0.927 0.938 0.958 1:47:23 3:11:29
EN-UD-Test 0.922 0.945 0.972 0:00:14 0:11:08
EN-UD-Train 0.920 0.941 0.972 0:01:18 1:14:54
DE-Tiger 0.804 0.940 0.925 0:12:12 0:53:07
DE-UD-Test 0.841 0.915 0.946 0:00:44 0:13:34
DE-UD-Train 0.783 0.910 0.944 0:05:08 3:02:56
FR-UD-Test 0.902 0.881 0.961 0:00:12 0:03:00
FR-UD-Train 0.882 0.896 0.973 0:04:10 1:31:28
ES-Test 0.904 0.779 0.936 0:00:18 0:01:47
ES-Train 0.884 0.797 0.954 0:02:41 0:45:50
ES-Ancora-Test 0.993 0.806 0.950 0:00:26 0:05:30
ES-Ancora-Train 0.993 0.807 0.953 0:07:21 0:58:06
Table 1: Performance of TreeTagger (TT) and our Dictionary Lemmatizer (DL) and Dictionary Lem-
matizer trained on the UD training set (DL-TR) on different corpora. Figures are accuracy of lemma
(ignoring case) for tokens matching the corpus token boundaries, times are in HH:MM:SS.
Corpora HFST Lemma Dicts
BNC-EN 0.89 0.924
UD-Test-EN 0.905 0.933
UD-Train-EN 0.90 0.928
Tiger-DE 0.812 0.853
UD-Test-DE 0.827 0.827
UD-Train-DE 0.817 0.827
UD-Test-FR 0.878 0.799
UD-Train-FR 0.894 0.819
Table 2: Performance of DictLemmatizer (HFST
only and Wiktionary lemma dictionary only) on
different corpora.
Download. Philadelphia: Linguistic Data Consor-
tium.
Sabine Brants, Stefanie Dipper, Peter Eisenberg, Sil-
via Hansen-Schirra, Esther Kšonig, Wolfgang Lezius,
Christian Rohrer, George Smith, and Hans Uszkor-
eit. 2004. Tiger: Linguistic interpretation of a ger-
man corpus. Research on Language and Computa-
tion 2(4):597–620. https://doi.org/10.1007/s11168-
004-7431-3.
Amedeo Cappelli and Lorenzo Moretti. 1983. Aspetti
della rappresentazione della conoscenza in linguis-
tica computazionale, volume 5. Pacini.
Jeremy H. Clear. 1993. The digital word. MIT
Press, Cambridge, MA, USA, chapter The
British National Corpus, pages 163–187.
http://dl.acm.org/citation.cfm?id=166403.166418.
BNC Consortium. 2007. The british national corpus,
version 3 (bnc xml edition). Distributed by Ox-
ford University Computing Services on behalf of the
BNC Consortium. http://www.natcorp.ox.
ac.uk/. http://www.natcorp.ox.ac.uk/.
Hamish Cunningham, Diana Maynard, Kalina
Bontcheva, and Valentin Tablan. 2002. GATE:
A Framework and Graphical Development Envi-
ronment for Robust NLP Tools and Applications.
In Proceedings of the 40th Anniversary Meeting
of the Association for Computational Linguistics
(ACL’02).
Hamish Cunningham, Diana Maynard, Kalina
Bontcheva, Valentin Tablan, Niraj Aswani, Ian
Roberts, Genevieve Gorrell, Adam Funk, Angus
Roberts, Danica Damljanovic, Thomas Heitz,
Mark A. Greenwood, Horacio Saggion, Johann
Petrak, Yaoyong Li, and Wim Peters. 2011. Text
Processing with GATE (Version 6).
Wolfgang Lezius, Reinhard Rapp, and Manfred Wet-
tler. 1998. A freely available morphological ana-
lyzer, disambiguator and context sensitive lemma-
tizer for german. In Proceedings of the 17th interna-
tional conference on Computational linguistics. As-
sociation for Computational Linguistics, pages 743–
748.
Krister LindÂŽen, Erik Axelson, Sam Hardwick, Mi-
ikka Silfverberg, and Tommi Pirinen. 2011. HFST–
framework for compiling and applying morpholo-
gies pages 67–85.
Ryan McDonald, Joakim Nivre, Yvonne Quirmbach-
Brundage, Yoav Goldberg, Dipanjan Das, Kuzman
Ganchev, Keith Hall, Slav Petrov, Hao Zhang, Os-
car Tackstrom, Claudia Bedini, Nuria Bertomeu
Castello, and Jungmee. 2013. Universal dependency
annotation for multilingual parsing. In Lee Proceed-
ings of ACL 2013.
Praharshana Perera and RenÂŽe Witte. 2005. A self-
learning context-aware lemmatizer for german. In
44
Proceedings of the conference on Human Language
Technology and Empirical Methods in Natural Lan-
guage Processing. Association for Computational
Linguistics, pages 636–643.
Slav Petrov, Dipanjan Das, and Ryan T. McDonald.
2011. A universal part-of-speech tagset. CoRR
abs/1104.2086. http://arxiv.org/abs/1104.2086.
Helmut Schmid. 2013. Probabilistic part-oïŹspeech
tagging using decision trees. In New methods in lan-
guage processing. Routledge, page 154.
Kristina Toutanova, Dan Klein, Christopher D. Man-
ning, and Yoram Singer. 2003. Feature-rich part-of-
speech tagging with a cyclic dependency network.
In Proceedings of the 2003 Conference of the North
American Chapter of the Association for Computa-
tional Linguistics on Human Language Technology.
Association for Computational Linguistics, Strouds-
burg, PA, USA, NAACL ’03, pages 173–180.
45

Weitere Àhnliche Inhalte

Was ist angesagt?

5. manuel arcedillo & juanjo arevalillo (hermes) translation memories
5. manuel arcedillo & juanjo arevalillo (hermes) translation memories5. manuel arcedillo & juanjo arevalillo (hermes) translation memories
5. manuel arcedillo & juanjo arevalillo (hermes) translation memories
RIILP
 
Lectura 3.5 word normalizationintwitter finitestate_transducers
Lectura 3.5 word normalizationintwitter finitestate_transducersLectura 3.5 word normalizationintwitter finitestate_transducers
Lectura 3.5 word normalizationintwitter finitestate_transducers
Matias Menendez
 
FIRE2014_IIT-P
FIRE2014_IIT-PFIRE2014_IIT-P
FIRE2014_IIT-P
Shubham Kumar
 
Machine transliteration survey
Machine transliteration surveyMachine transliteration survey
Machine transliteration survey
unyil96
 
6. Khalil Sima'an (UVA) Statistical Machine Translation
6. Khalil Sima'an (UVA) Statistical Machine Translation6. Khalil Sima'an (UVA) Statistical Machine Translation
6. Khalil Sima'an (UVA) Statistical Machine Translation
RIILP
 
CETS 2012, Mitch Donaldson & Ralph Strozza, slides for eLearning for Overseas...
CETS 2012, Mitch Donaldson & Ralph Strozza, slides for eLearning for Overseas...CETS 2012, Mitch Donaldson & Ralph Strozza, slides for eLearning for Overseas...
CETS 2012, Mitch Donaldson & Ralph Strozza, slides for eLearning for Overseas...
Chicago eLearning & Technology Showcase
 
Principles of-programming-languages-lecture-notes-
Principles of-programming-languages-lecture-notes-Principles of-programming-languages-lecture-notes-
Principles of-programming-languages-lecture-notes-
Krishna Sai
 
PDFTextProcessing
PDFTextProcessingPDFTextProcessing
PDFTextProcessing
Joshua Mathias
 

Was ist angesagt? (17)

5. manuel arcedillo & juanjo arevalillo (hermes) translation memories
5. manuel arcedillo & juanjo arevalillo (hermes) translation memories5. manuel arcedillo & juanjo arevalillo (hermes) translation memories
5. manuel arcedillo & juanjo arevalillo (hermes) translation memories
 
TALC 2008 Workshop 1 - Teaching and Language Corpora
TALC 2008 Workshop 1 - Teaching and Language CorporaTALC 2008 Workshop 1 - Teaching and Language Corpora
TALC 2008 Workshop 1 - Teaching and Language Corpora
 
Introduction to programing languages part 1
Introduction to programing languages   part 1Introduction to programing languages   part 1
Introduction to programing languages part 1
 
Compiling Apertium Dictionaries with HFST
Compiling Apertium Dictionaries with HFSTCompiling Apertium Dictionaries with HFST
Compiling Apertium Dictionaries with HFST
 
TRANSLATOR'S TOOLS, by Dr. Shadia Y. BAnjar
TRANSLATOR'S TOOLS, by Dr. Shadia Y. BAnjarTRANSLATOR'S TOOLS, by Dr. Shadia Y. BAnjar
TRANSLATOR'S TOOLS, by Dr. Shadia Y. BAnjar
 
SOFTWARE TOOL FOR TRANSLATING PSEUDOCODE TO A PROGRAMMING LANGUAGE
SOFTWARE TOOL FOR TRANSLATING PSEUDOCODE TO A PROGRAMMING LANGUAGESOFTWARE TOOL FOR TRANSLATING PSEUDOCODE TO A PROGRAMMING LANGUAGE
SOFTWARE TOOL FOR TRANSLATING PSEUDOCODE TO A PROGRAMMING LANGUAGE
 
Lectura 3.5 word normalizationintwitter finitestate_transducers
Lectura 3.5 word normalizationintwitter finitestate_transducersLectura 3.5 word normalizationintwitter finitestate_transducers
Lectura 3.5 word normalizationintwitter finitestate_transducers
 
A tutorial on Machine Translation
A tutorial on Machine TranslationA tutorial on Machine Translation
A tutorial on Machine Translation
 
FIRE2014_IIT-P
FIRE2014_IIT-PFIRE2014_IIT-P
FIRE2014_IIT-P
 
Machine transliteration survey
Machine transliteration surveyMachine transliteration survey
Machine transliteration survey
 
6. Khalil Sima'an (UVA) Statistical Machine Translation
6. Khalil Sima'an (UVA) Statistical Machine Translation6. Khalil Sima'an (UVA) Statistical Machine Translation
6. Khalil Sima'an (UVA) Statistical Machine Translation
 
CONSTRUCTION OF AMHARIC-ARABIC PARALLEL TEXT CORPUS FOR NEURAL MACHINE TRANSL...
CONSTRUCTION OF AMHARIC-ARABIC PARALLEL TEXT CORPUS FOR NEURAL MACHINE TRANSL...CONSTRUCTION OF AMHARIC-ARABIC PARALLEL TEXT CORPUS FOR NEURAL MACHINE TRANSL...
CONSTRUCTION OF AMHARIC-ARABIC PARALLEL TEXT CORPUS FOR NEURAL MACHINE TRANSL...
 
Moses
MosesMoses
Moses
 
2013 ALC Boston: Your Trained Moses SMT System doesn't work. What can you do?
2013 ALC Boston: Your Trained Moses SMT System doesn't work. What can you do?2013 ALC Boston: Your Trained Moses SMT System doesn't work. What can you do?
2013 ALC Boston: Your Trained Moses SMT System doesn't work. What can you do?
 
CETS 2012, Mitch Donaldson & Ralph Strozza, slides for eLearning for Overseas...
CETS 2012, Mitch Donaldson & Ralph Strozza, slides for eLearning for Overseas...CETS 2012, Mitch Donaldson & Ralph Strozza, slides for eLearning for Overseas...
CETS 2012, Mitch Donaldson & Ralph Strozza, slides for eLearning for Overseas...
 
Principles of-programming-languages-lecture-notes-
Principles of-programming-languages-lecture-notes-Principles of-programming-languages-lecture-notes-
Principles of-programming-languages-lecture-notes-
 
PDFTextProcessing
PDFTextProcessingPDFTextProcessing
PDFTextProcessing
 

Ähnlich wie An Extensible Multilingual Open Source Lemmatizer

Shallow parser for hindi language with an input from a transliterator
Shallow parser for hindi language with an input from a transliteratorShallow parser for hindi language with an input from a transliterator
Shallow parser for hindi language with an input from a transliterator
Shashank Shisodia
 
Dynamic Construction of Telugu Speech Corpus for Voice Enabled Text Editor
Dynamic Construction of Telugu Speech Corpus for Voice Enabled Text EditorDynamic Construction of Telugu Speech Corpus for Voice Enabled Text Editor
Dynamic Construction of Telugu Speech Corpus for Voice Enabled Text Editor
Waqas Tariq
 

Ähnlich wie An Extensible Multilingual Open Source Lemmatizer (20)

Smart grammar a dynamic spoken language understanding grammar for inflective ...
Smart grammar a dynamic spoken language understanding grammar for inflective ...Smart grammar a dynamic spoken language understanding grammar for inflective ...
Smart grammar a dynamic spoken language understanding grammar for inflective ...
 
IRJET- Text to Speech Synthesis for Hindi Language using Festival Framework
IRJET- Text to Speech Synthesis for Hindi Language using Festival FrameworkIRJET- Text to Speech Synthesis for Hindi Language using Festival Framework
IRJET- Text to Speech Synthesis for Hindi Language using Festival Framework
 
ijcai11
ijcai11ijcai11
ijcai11
 
Allin Qillqay A Free On-Line Web Spell Checking Service For Quechua
Allin Qillqay  A Free On-Line Web Spell Checking Service For QuechuaAllin Qillqay  A Free On-Line Web Spell Checking Service For Quechua
Allin Qillqay A Free On-Line Web Spell Checking Service For Quechua
 
Natural language processing with python and amharic syntax parse tree by dani...
Natural language processing with python and amharic syntax parse tree by dani...Natural language processing with python and amharic syntax parse tree by dani...
Natural language processing with python and amharic syntax parse tree by dani...
 
Different valuable tools for Arabic sentiment analysis: a comparative evaluat...
Different valuable tools for Arabic sentiment analysis: a comparative evaluat...Different valuable tools for Arabic sentiment analysis: a comparative evaluat...
Different valuable tools for Arabic sentiment analysis: a comparative evaluat...
 
W17 5406
W17 5406W17 5406
W17 5406
 
An implementation of apertium based assamese morphological analyzer
An implementation of apertium based assamese morphological analyzerAn implementation of apertium based assamese morphological analyzer
An implementation of apertium based assamese morphological analyzer
 
CONSTRUCTION OF AMHARIC-ARABIC PARALLEL TEXT CORPUS FOR NEURAL MACHINE TRANSL...
CONSTRUCTION OF AMHARIC-ARABIC PARALLEL TEXT CORPUS FOR NEURAL MACHINE TRANSL...CONSTRUCTION OF AMHARIC-ARABIC PARALLEL TEXT CORPUS FOR NEURAL MACHINE TRANSL...
CONSTRUCTION OF AMHARIC-ARABIC PARALLEL TEXT CORPUS FOR NEURAL MACHINE TRANSL...
 
Construction of Amharic-arabic Parallel Text Corpus for Neural Machine Transl...
Construction of Amharic-arabic Parallel Text Corpus for Neural Machine Transl...Construction of Amharic-arabic Parallel Text Corpus for Neural Machine Transl...
Construction of Amharic-arabic Parallel Text Corpus for Neural Machine Transl...
 
Shallow parser for hindi language with an input from a transliterator
Shallow parser for hindi language with an input from a transliteratorShallow parser for hindi language with an input from a transliterator
Shallow parser for hindi language with an input from a transliterator
 
Welcome to International Journal of Engineering Research and Development (IJERD)
Welcome to International Journal of Engineering Research and Development (IJERD)Welcome to International Journal of Engineering Research and Development (IJERD)
Welcome to International Journal of Engineering Research and Development (IJERD)
 
International Journal on Natural Language Computing (IJNLC) Vol. 4, No.2,Apri...
International Journal on Natural Language Computing (IJNLC) Vol. 4, No.2,Apri...International Journal on Natural Language Computing (IJNLC) Vol. 4, No.2,Apri...
International Journal on Natural Language Computing (IJNLC) Vol. 4, No.2,Apri...
 
Design of A Spell Corrector For Hausa Language
Design of A Spell Corrector For Hausa LanguageDesign of A Spell Corrector For Hausa Language
Design of A Spell Corrector For Hausa Language
 
IRJET- Tamil Speech to Indian Sign Language using CMUSphinx Language Models
IRJET- Tamil Speech to Indian Sign Language using CMUSphinx Language ModelsIRJET- Tamil Speech to Indian Sign Language using CMUSphinx Language Models
IRJET- Tamil Speech to Indian Sign Language using CMUSphinx Language Models
 
Dynamic Construction of Telugu Speech Corpus for Voice Enabled Text Editor
Dynamic Construction of Telugu Speech Corpus for Voice Enabled Text EditorDynamic Construction of Telugu Speech Corpus for Voice Enabled Text Editor
Dynamic Construction of Telugu Speech Corpus for Voice Enabled Text Editor
 
Design and Development of Morphological Analyzer for Tigrigna Verbs using Hyb...
Design and Development of Morphological Analyzer for Tigrigna Verbs using Hyb...Design and Development of Morphological Analyzer for Tigrigna Verbs using Hyb...
Design and Development of Morphological Analyzer for Tigrigna Verbs using Hyb...
 
DESIGN AND DEVELOPMENT OF MORPHOLOGICAL ANALYZER FOR TIGRIGNA VERBS USING HYB...
DESIGN AND DEVELOPMENT OF MORPHOLOGICAL ANALYZER FOR TIGRIGNA VERBS USING HYB...DESIGN AND DEVELOPMENT OF MORPHOLOGICAL ANALYZER FOR TIGRIGNA VERBS USING HYB...
DESIGN AND DEVELOPMENT OF MORPHOLOGICAL ANALYZER FOR TIGRIGNA VERBS USING HYB...
 
A Multilingual Tool For Hierarchical Annotation Of Texts
A Multilingual Tool For Hierarchical Annotation Of TextsA Multilingual Tool For Hierarchical Annotation Of Texts
A Multilingual Tool For Hierarchical Annotation Of Texts
 
Building of Database for English-Azerbaijani Machine Translation Expert System
Building of Database for English-Azerbaijani Machine Translation Expert SystemBuilding of Database for English-Azerbaijani Machine Translation Expert System
Building of Database for English-Azerbaijani Machine Translation Expert System
 

Mehr von COMRADES project

D6.2 First report on Communication and Dissemination activities
D6.2 First report on Communication and Dissemination activitiesD6.2 First report on Communication and Dissemination activities
D6.2 First report on Communication and Dissemination activities
COMRADES project
 
D3.1 Multilingual content processing methods
D3.1 Multilingual content processing methodsD3.1 Multilingual content processing methods
D3.1 Multilingual content processing methods
COMRADES project
 
D4.1 Enriched Semantic Models of Emergency Events
D4.1 Enriched Semantic Models of Emergency EventsD4.1 Enriched Semantic Models of Emergency Events
D4.1 Enriched Semantic Models of Emergency Events
COMRADES project
 
D2.1 Requirements for boosting community resilience in crisis situation
D2.1 Requirements for boosting community resilience in crisis situationD2.1 Requirements for boosting community resilience in crisis situation
D2.1 Requirements for boosting community resilience in crisis situation
COMRADES project
 

Mehr von COMRADES project (19)

COMRADES EU Project Factsheet
COMRADES EU Project FactsheetCOMRADES EU Project Factsheet
COMRADES EU Project Factsheet
 
Evaluating Platforms for Community Sensemaking: Using the Case of the Kenyan ...
Evaluating Platforms for Community Sensemaking: Using the Case of the Kenyan ...Evaluating Platforms for Community Sensemaking: Using the Case of the Kenyan ...
Evaluating Platforms for Community Sensemaking: Using the Case of the Kenyan ...
 
Helping Crisis Responders Find the Informative Needle in the Tweet Haystack
Helping Crisis Responders Find the Informative Needle in the Tweet HaystackHelping Crisis Responders Find the Informative Needle in the Tweet Haystack
Helping Crisis Responders Find the Informative Needle in the Tweet Haystack
 
Classifying Crises-Information Relevancy with Semantics
Classifying Crises-Information Relevancy with SemanticsClassifying Crises-Information Relevancy with Semantics
Classifying Crises-Information Relevancy with Semantics
 
D6.2 First report on Communication and Dissemination activities
D6.2 First report on Communication and Dissemination activitiesD6.2 First report on Communication and Dissemination activities
D6.2 First report on Communication and Dissemination activities
 
D3.1 Multilingual content processing methods
D3.1 Multilingual content processing methodsD3.1 Multilingual content processing methods
D3.1 Multilingual content processing methods
 
D4.1 Enriched Semantic Models of Emergency Events
D4.1 Enriched Semantic Models of Emergency EventsD4.1 Enriched Semantic Models of Emergency Events
D4.1 Enriched Semantic Models of Emergency Events
 
SemEval-2017 Task 8: RumourEval: Determining rumour veracity and support for ...
SemEval-2017 Task 8: RumourEval: Determining rumour veracity and support for ...SemEval-2017 Task 8: RumourEval: Determining rumour veracity and support for ...
SemEval-2017 Task 8: RumourEval: Determining rumour veracity and support for ...
 
Prospecting Socially-Aware Concepts and Artefacts for Designing for Community...
Prospecting Socially-Aware Concepts and Artefacts for Designing for Community...Prospecting Socially-Aware Concepts and Artefacts for Designing for Community...
Prospecting Socially-Aware Concepts and Artefacts for Designing for Community...
 
On Semantics and Deep Learning for Event Detection in Crisis Situations
On Semantics and Deep Learning for Event Detection in Crisis SituationsOn Semantics and Deep Learning for Event Detection in Crisis Situations
On Semantics and Deep Learning for Event Detection in Crisis Situations
 
A Semantic Graph-based Approach for Radicalisation Detection on Social Media
A Semantic Graph-based Approach for Radicalisation Detection on Social MediaA Semantic Graph-based Approach for Radicalisation Detection on Social Media
A Semantic Graph-based Approach for Radicalisation Detection on Social Media
 
Behind the Scenes of Scenario-Based Training: Understanding Scenario Design a...
Behind the Scenes of Scenario-Based Training: Understanding Scenario Design a...Behind the Scenes of Scenario-Based Training: Understanding Scenario Design a...
Behind the Scenes of Scenario-Based Training: Understanding Scenario Design a...
 
Sustainable Performance Measurement for Humanitarian Supply Chain Operations
Sustainable Performance Measurement for Humanitarian Supply Chain Operations  Sustainable Performance Measurement for Humanitarian Supply Chain Operations
Sustainable Performance Measurement for Humanitarian Supply Chain Operations
 
Detecting Important Life Events on Twitter Using Frequent Semantic and Syntac...
Detecting Important Life Events on Twitter Using Frequent Semantic and Syntac...Detecting Important Life Events on Twitter Using Frequent Semantic and Syntac...
Detecting Important Life Events on Twitter Using Frequent Semantic and Syntac...
 
DoRES — A Three-tier Ontology for Modelling Crises in the Digital Age
DoRES — A Three-tier Ontology for Modelling Crises in the Digital AgeDoRES — A Three-tier Ontology for Modelling Crises in the Digital Age
DoRES — A Three-tier Ontology for Modelling Crises in the Digital Age
 
D2.1 Requirements for boosting community resilience in crisis situation
D2.1 Requirements for boosting community resilience in crisis situationD2.1 Requirements for boosting community resilience in crisis situation
D2.1 Requirements for boosting community resilience in crisis situation
 
COMRADES EU Project Overall Presentation
COMRADES EU Project Overall PresentationCOMRADES EU Project Overall Presentation
COMRADES EU Project Overall Presentation
 
COMRADES EU Project Factsheet
COMRADES EU Project FactsheetCOMRADES EU Project Factsheet
COMRADES EU Project Factsheet
 
COMRADES EU Project Brochure
COMRADES EU Project BrochureCOMRADES EU Project Brochure
COMRADES EU Project Brochure
 

KĂŒrzlich hochgeladen

Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
WSO2
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎+971_581248768%)**%*]'#abortion pills for sale in dubai@
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 

KĂŒrzlich hochgeladen (20)

Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Six Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal OntologySix Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal Ontology
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Platformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityPlatformless Horizons for Digital Adaptability
Platformless Horizons for Digital Adaptability
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
 

An Extensible Multilingual Open Source Lemmatizer

  • 1. Proceedings of Recent Advances in Natural Language Processing, pages 40–45, Varna, Bulgaria, Sep 4–6 2017. https://doi.org/10.26615/978-954-452-049-6_006 An Extensible Multilingual Open Source Lemmatizer Ahmet Akera,b and Johann Petraka and Firas Sabbahb Department of Computer Science, University of ShefïŹelda Department of Information Engineering, University of Duisburg-Essenb a.aker@is.inf.uni-due.de, johann.petrak@sheffield.ac.uk firas.sabbah@stud.uni-due.de Abstract We present GATE DictLemmatizer, a mul- tilingual open source lemmatizer for the GATE NLP framework that currently sup- ports English, German, Italian, French, Dutch, and Spanish, and is easily exten- sible to other languages. The software is freely available under the LGPL li- cense. The lemmatization is based on the Helsinki Finite-State Transducer Technol- ogy (HFST) and lemma dictionaries au- tomatically created from Wiktionary. We evaluate the performance of the lemma- tizers against TreeTagger, which is only freely available for research purposes. Our evaluation shows that DictLemma- tizer achieves similar or even better re- sults than TreeTagger for languages where there is support from HFST. The per- formance drops when there is no sup- port from HFST and the entire lemmatiza- tion process is based on lemma dictionar- ies. However, the results are still satisfac- tory given the fact that DictLemmatizer is open-source and can be easily extended to other languages. The software for extend- ing the lemmatizer by creating word lists from Wiktionary dictionaries is also freely available as open-source software. 1 Introduction The process of lemmatization is an important part of many computational linguistics applica- tions such as Information Retrieval (IR) and Natu- ral Language Processing (NLP). In lemmatization, inïŹ‚ected forms of a lexeme are mapped to a canon- ical form that is referred to as the lemma. The task of ïŹnding the correct lemma for a word in context is often complicated by the fact that a word can be the inïŹ‚ected form of more than one lexeme each of which may have different lemmas. Lemmas can be used in various ways for NLP, for instance, to im- prove the performance of text similarity metrics. For this application, all words are mapped to their lemma before a similarity is calculated. Lemmas are also often used in information retrieval and in- formation extraction to better identify and group terms which occur in their inïŹ‚ected forms. The task of ïŹnding lemmas is different and harder than ïŹnding stems. Stemming is often used as a much cruder heuristic approach to map in- ïŹ‚ectional forms of words to some canonical form, but unlike lemmatization does not differentiate between different lexemes which could have the same inïŹ‚ectional form and it is possible for the stem of a word to not be a valid lexeme of the lan- guage. The TreeTagger (Schmid, 2013) software pro- vides lemmatization for 20 languages including English, German, Italian, French, Dutch and Spanish. However, it is not open source and it is not straightforward to use it for non-research or commercial applications. There exist a few other lemmatizers which are open for non-research pur- poses (Lezius et al., 1998; Perera and Witte, 2005; Bšar et al., 2013; Cappelli and Moretti, 1983)1. However, these lemmatizers are mostly concerned with only one language and do not provide a broad coverage like the TreeTagger. In this paper, we describe GATE DictLemma- tizer, a plugin for the GATE NLP framework2 (Cunningham et al., 2011) that performs lemma- tization for English, German, Italian, French, Dutch, and Spanish and is freely available under the LGPL license. The GATE NLP framework is one of the most widely used frameworks for 1 https://github.com/giodegas/ morphit-lemmatizer 2 https://gate.ac.uk 40
  • 2. applied natural language processing. It is imple- mented in Java, freely available under the permis- sive LGPL license and can be extended through plugins. Our method combines the Helsinki Finite-State Transducer Technology (HFST)3 (LindÂŽen et al., 2011) and word-lemma dictionaries obtained from Wiktionary. Since we use separate dictionaries de- pending on the word category, the method also depends on a POS tagger for the language. The word dictionaries are obtained automatically from Wiktionary4 data dumps. The code for creating the dictionaries automatically is available as free and open-source software.5 This software can be used to easily add dictionaries for new languages to the DictLemmatizer. The plugin also contains the HFST models for the 4 languages for which models are available: English, German, French and Italian.6 The rest of the paper is structured as fol- lows. First we describe our method of performing lemmatization (Section 2). Our lemmatizer uses automatically generated lemma dictionaries. The process of obtaining such dictionaries from Wik- tionary is outlined in Section 3. In Section 4 we detail the release information. Next, in Section 5 we evaluate the performance of our lemmatizer. We use the TreeTagger for comparison. We con- clude in Section 6. 2 Method To obtain lemmas we combine two strategies: the Helsinki Finite-State Transducer Technology (HFST)7 and word-lemma dictionaries obtained from Wiktionary8. For both strategies, it is nec- essary to know the coarse-grained word categories such as “noun”, “verb”, “adposition” for each word. For this purpose, the lemmatizer requires the Universal POS tags9 from the Universal Depen- dencies project. In GATE (Cunningham et al., 3 http://www.ling.helsinki.fi/ kieliteknologia/tutkimus/hfst/ 4 https://www.wiktionary.org/ 5 https://github.com/ahmetaker/ Wiktionary-Lemma-Extractor 6 https://sourceforge.net/ projects/hfst/files/resources/ morphological-transducers/ 7 http://www.ling.helsinki.fi/ kieliteknologia/tutkimus/hfst/ 8 https://www.wiktionary.org/ 9 http://universaldependencies.org/u/ pos/all.html 2011), POS tags can be created using different methods or plugins, however for the evaluation in this paper we use the ANNIE POS-tagger (Cunningham et al., 2002) for English and the Stanford CoreNLP POS tagger (Toutanova et al., 2003) for all other languages. These language- speciïŹc POS tags are then converted to Universal Dependencies tags using mappings adapted from https://github.com/slavpetrov/ universal-pos-tags (Petrov et al., 2011). The lemmatizer ïŹrst tries to look up each word form in the dictionary that matches the language and word category of the word. Currently there are lists for the following categories: adjective, adpo- sition, adverb, conjunction, determiner, noun, par- ticle, pronoun, verb. If the word form is found in the dictionary, the corresponding lemma is used. Pre-generated dictionaries for the six supported languages are included with the plugin. If the word could not be found in the dictionary, an attempt is made to ïŹnd the lemma by using the HFST model for the language, if it is available. The HFST model returns for each word all pos- sible morphological variants. This makes it difïŹ- cult to directly ïŹnd the lemma for the word. We therefore implemented rules that use the Univer- sal POS tag information and extract the correct lemma. E.g. for the word “computers” the HFST returns the following options: compute[V]+ER[V/N]+N+PL computer[N]+N+PL Since we know from the POS tagger that “com- puters” is a noun we can use that information and extract from the HFST list the entry that refers to a noun ([N]) - “computer”. The HFST models are freely available only for a few languages. For any language where there is no HFST model, our lemmatizer will rely only on the Wiktionary-based dictionaries.10 3 Parsing dictionaries We implemented a Java based tool that allows users to extract lemma information from the Wik- tionary API. With this tool it is easy to create dic- tionaries for additional languages not included in the lemmatizer distribution. We refer to this tool as Wiktionary-Lemma Extractor. It fetches for a 10 In this case, it is also possible to make DictLemmatizer work without any POS tags at all by merging the original dic- tionaries per word type into one dictionary for unknown/u- nidentiïŹed POS type. 41
  • 3. given word form its lemma from the Wiktionary page. In addition the tool expects the language in- formation, such as English, German, etc. Once these pieces of information are provided the tool fetches through the Wiktionary API the English version of the Wiktionary page for the queried word. The English Wiktionary page is divided into different areas where each area conveys a particu- lar information such as lemma, synonym, trans- lation, etc. Our tool isolates the lemma area and ïŹnds the non-inïŹ‚ected form for the queried word. The queried word and the non-inïŹ‚ected form are saved into a database to be used as dictionary lookup. 4 Software Availability 4.1 GATE DictLemmatizer Plugin Most of the tools and resources for the GATE NLP framework are created as separate plugins which can be used as needed for a process- ing pipeline. The approach for ïŹnding lem- mas described earlier has been implemented as a GATE plugin and is freely available from https://github.com/GateNLP/ gateplugin-dict-lemmatizer. This plu- gin only implements the lemmatization part since there are already several plugins for tokenisation, sentence splitting, and POS-tagging included or separately available for GATE. 4.2 Wiktionary-Lemma Extractor Similar to the GATE Plugin for lemmatization we make our Wiktionary-Lemma Extractor publicly available through github11. Along with the code we also provide a client that ease the creation of new dictionaries. The client just expects the input of the target language such as English, German, Turkish, Urdu, etc. The client ïŹrst collects all possible words for that particular language from Wiktionary titles, determines for each title word its lemma and ïŹnally extract the lemma dictionar- ies. These lemma dictionaries can then be directly injected into the GATE Plugin. 5 Evaluation We evaluated DictLemmatizer and compared it to TreeTagger on the following corpora: ‱ English British National Corpus (EN-BNC) (Consortium, 2007; Clear, 1993) 11 https://github.com/ahmetaker/ Wiktionary-Lemma-Extractor ‱ German Tiger Corpus (DE-Tiger) (Brants et al., 2004) ‱ Universal Dependencies English tree bank (EN-UD) (Bies et al., 2012) ‱ Universal Dependencies French tree bank (FR-UD) ‱ Universal Dependencies German tree bank (DE-UD) ‱ Universal Dependencies Spanish tree bank (ES-UD) ‱ Universal Dependencies Spanish Ancora cor- pus (ES-Ancora) For more information on the Universal Dependen- cies tree banks see McDonald et al. (2013). All corpora were converted to GATE documents using format speciïŹc open-source software121314. The software and setup for carrying out all evalu- ation is also available online.15 Note that for this comparison, the GATE Generic Tagger Framework plugin16 was used to wrap the original TreeTagger software. This plu- gin does not use the full processing pipeline of the original TreeTagger software17 but instead just uses the tree-tagger binary to retrieve per- token information. All corpora were converted so that the token boundaries from the corpus were preserved for the conversion to GATE format. However, for the evaluation, the tokens produced by the annotation pipeline are based on the GATE tokenizer and can therefore differ from the correct tokens as present in the tree bank. We list the performance of the tokeniser used together with the performance of the lemmatizer on the tokens which match ex- actly. Some corpora use corpus-speciïŹc ways to represent multi-token words or multi-word tokens which cannot be represented in an identical way as GATE annotations and so these cases get excluded 12 https://github.com/GateNLP/ corpusconversion-bnc 13 https://github.com/GateNLP/ corpusconversion-tiger 14 https://github.com/GateNLP/ corpusconversion-universal-dependencies 15 https://github.com/johann-petrak/ evaluation-lemmatizer 16 https://gate.ac.uk/userguide/sec: parsers:taggerframework 17 42
  • 4. from the evaluation. Results of the evaluation re- ported in accuracy are shown in Table 1. From the results in Table 1 we can see that for English and German, DictLemmatizer out- performs TreeTagger. For French, TreeTagger achieves better performance in the test corpora; however, for the training corpora DictLemmatizer achieves better results. For Spanish, the TreeTag- ger results are much better. The reason for this is that apart from TreeTagger’s outstanding perfor- mance on the Spanish corpora, the HFST inducer is not used in our Lemmatizer for Spanish because there is no HFST model available, so the lemma- tization is performed only using the lemma dic- tionaries obtained from Wiktionary.18 Although there is a big performance difference for the Span- ish language, we consider that this result is satis- factory given the restrictions. To get a better indication of the performance of each of the two strategies for the other languages, we also performed evaluations using the DictLem- matizer where we used only the dictionary-based or only the HFST-based approach. Table 2 shows the results for all corpora except Spanish (where only the dictionary is used by default). We can see that for English and German the performance of using just HFST and using just lemma dictio- naries achieve comparable results, though using only lemma dictionaries is always slightly better. This pictures looks different when we look at the French language. There using only HFST clearly wins against using only the lemma dictionaries and achieves around 10% better accuracy. Nev- ertheless both resources are complementary and when combined boost the results as seen in Table 1. In addition to the Wiktionary source, the word lists can be extended by an annotated training cor- pus. We tested this by ïŹnding the 500 most fre- quent incorrect assignments on each of the Uni- versal Dependencies training corpora grouped by target POS tag and adding those to the dictionar- ies for each language. The evaluations using those extended word lists are shown in 1 with the indi- cation ”DL-TR”. This improves the accuracy on all Universal Dependencies training and test sets and on the BNC corpus, but slightly decreases ac- 18 In our evaluation we focused on languages which are rich in resources and high performing lemmatizers such as English, German and French and also supported by HFST and languages that are less rich in terms of resources and also has no support by the HFST tool such as Spanish. curacy on the Tiger corpus. Along with the accuracy ïŹgures, we also recorded the time needed to process each corpus (see Table 1). The timing information only gives a rough indication because we show the results of a single run only, and because the machine was under different load for different runs. Also note that the times for the TreeTagger include the over- head of wrapping the original TreeTagger binary for use in a Java plugin. The implementation of the Generic Tagger Framework plugin executes the bi- nary for every document, so the timing informa- tion includes the overhead for this and thus also depends on the average document size for a cor- pus, while the timing for the DictLemmatizer does not. However, from these rough results we can still see that DictLemmatizer is always signiïŹcantly faster than the TreeTagger for the concrete GATE- plugin implementations that were compared. 6 Conclusion In this paper we presented a lemmatizer for six languages: English, German, Italian, French, Dutch and Spanish that is easily extensible to other languages. We compared the performance of our lemmatizer to the one of TreeTagger. Our results show that our lemmatizer achieves similar or bet- ter results when there is support from HFST. In case there is no HFST support we still achieve sat- isfactory results. Both the DictLemmatizer and the lemma dic- tionary collector software are available freely for commercial use under the LGPL license. The dic- tionary collector can be used to easily extend the lemmatizer to new languages that are currently not included in DictLemmatizer. Acknowledgments This work was partially supported by the Euro- pean Union under grant agreement No. 687847 COMRADES and PHEME project under the grant agreement No. 611223. References Daniel Bšar, Torsten Zesch, and Iryna Gurevych. 2013. Dkpro similarity: An open source framework for text similarity. In ACL (Conference System Demon- strations). pages 121–126. Ann Bies, Justin Mott, Colin Warner, and Seth Kulick. 2012. English web treebank LDC2012T13. Web 43
  • 5. Corpus TT DL DL-TR Time DL Time TT EN-BNC 0.927 0.938 0.958 1:47:23 3:11:29 EN-UD-Test 0.922 0.945 0.972 0:00:14 0:11:08 EN-UD-Train 0.920 0.941 0.972 0:01:18 1:14:54 DE-Tiger 0.804 0.940 0.925 0:12:12 0:53:07 DE-UD-Test 0.841 0.915 0.946 0:00:44 0:13:34 DE-UD-Train 0.783 0.910 0.944 0:05:08 3:02:56 FR-UD-Test 0.902 0.881 0.961 0:00:12 0:03:00 FR-UD-Train 0.882 0.896 0.973 0:04:10 1:31:28 ES-Test 0.904 0.779 0.936 0:00:18 0:01:47 ES-Train 0.884 0.797 0.954 0:02:41 0:45:50 ES-Ancora-Test 0.993 0.806 0.950 0:00:26 0:05:30 ES-Ancora-Train 0.993 0.807 0.953 0:07:21 0:58:06 Table 1: Performance of TreeTagger (TT) and our Dictionary Lemmatizer (DL) and Dictionary Lem- matizer trained on the UD training set (DL-TR) on different corpora. Figures are accuracy of lemma (ignoring case) for tokens matching the corpus token boundaries, times are in HH:MM:SS. Corpora HFST Lemma Dicts BNC-EN 0.89 0.924 UD-Test-EN 0.905 0.933 UD-Train-EN 0.90 0.928 Tiger-DE 0.812 0.853 UD-Test-DE 0.827 0.827 UD-Train-DE 0.817 0.827 UD-Test-FR 0.878 0.799 UD-Train-FR 0.894 0.819 Table 2: Performance of DictLemmatizer (HFST only and Wiktionary lemma dictionary only) on different corpora. Download. Philadelphia: Linguistic Data Consor- tium. Sabine Brants, Stefanie Dipper, Peter Eisenberg, Sil- via Hansen-Schirra, Esther Kšonig, Wolfgang Lezius, Christian Rohrer, George Smith, and Hans Uszkor- eit. 2004. Tiger: Linguistic interpretation of a ger- man corpus. Research on Language and Computa- tion 2(4):597–620. https://doi.org/10.1007/s11168- 004-7431-3. Amedeo Cappelli and Lorenzo Moretti. 1983. Aspetti della rappresentazione della conoscenza in linguis- tica computazionale, volume 5. Pacini. Jeremy H. Clear. 1993. The digital word. MIT Press, Cambridge, MA, USA, chapter The British National Corpus, pages 163–187. http://dl.acm.org/citation.cfm?id=166403.166418. BNC Consortium. 2007. The british national corpus, version 3 (bnc xml edition). Distributed by Ox- ford University Computing Services on behalf of the BNC Consortium. http://www.natcorp.ox. ac.uk/. http://www.natcorp.ox.ac.uk/. Hamish Cunningham, Diana Maynard, Kalina Bontcheva, and Valentin Tablan. 2002. GATE: A Framework and Graphical Development Envi- ronment for Robust NLP Tools and Applications. In Proceedings of the 40th Anniversary Meeting of the Association for Computational Linguistics (ACL’02). Hamish Cunningham, Diana Maynard, Kalina Bontcheva, Valentin Tablan, Niraj Aswani, Ian Roberts, Genevieve Gorrell, Adam Funk, Angus Roberts, Danica Damljanovic, Thomas Heitz, Mark A. Greenwood, Horacio Saggion, Johann Petrak, Yaoyong Li, and Wim Peters. 2011. Text Processing with GATE (Version 6). Wolfgang Lezius, Reinhard Rapp, and Manfred Wet- tler. 1998. A freely available morphological ana- lyzer, disambiguator and context sensitive lemma- tizer for german. In Proceedings of the 17th interna- tional conference on Computational linguistics. As- sociation for Computational Linguistics, pages 743– 748. Krister LindÂŽen, Erik Axelson, Sam Hardwick, Mi- ikka Silfverberg, and Tommi Pirinen. 2011. HFST– framework for compiling and applying morpholo- gies pages 67–85. Ryan McDonald, Joakim Nivre, Yvonne Quirmbach- Brundage, Yoav Goldberg, Dipanjan Das, Kuzman Ganchev, Keith Hall, Slav Petrov, Hao Zhang, Os- car Tackstrom, Claudia Bedini, Nuria Bertomeu Castello, and Jungmee. 2013. Universal dependency annotation for multilingual parsing. In Lee Proceed- ings of ACL 2013. Praharshana Perera and RenÂŽe Witte. 2005. A self- learning context-aware lemmatizer for german. In 44
  • 6. Proceedings of the conference on Human Language Technology and Empirical Methods in Natural Lan- guage Processing. Association for Computational Linguistics, pages 636–643. Slav Petrov, Dipanjan Das, and Ryan T. McDonald. 2011. A universal part-of-speech tagset. CoRR abs/1104.2086. http://arxiv.org/abs/1104.2086. Helmut Schmid. 2013. Probabilistic part-oïŹspeech tagging using decision trees. In New methods in lan- guage processing. Routledge, page 154. Kristina Toutanova, Dan Klein, Christopher D. Man- ning, and Yoram Singer. 2003. Feature-rich part-of- speech tagging with a cyclic dependency network. In Proceedings of the 2003 Conference of the North American Chapter of the Association for Computa- tional Linguistics on Human Language Technology. Association for Computational Linguistics, Strouds- burg, PA, USA, NAACL ’03, pages 173–180. 45