1. Academy of Graduate Studies
Tripoli - Libya
PART OF SPEECH TAGGING OF ARABIC
TEXT
By
Massaoud Abuzed Abolqasem Abuzed
March 2006
2. Abstract
Part of speech tagging is an important area of research in natural language
processing. Although it has been well studied in several Indo-European languages, it is
still not very well investigated with respect to Arabic.
In this thesis, the Brill tagger and a modified version of the Khoja tagset, along with
a corpus prepared for this purpose, are applied to tag Modern Standard Arabic
(henceforth MSA) text. The Brill tagger is a famous public domain part of speech
tagger, originally designed for tagging English text by implementing machine learning
approach through the method of transformation rules. It has been adopted for other
languages, such as German and Hungarian, by many researchers. Some modifications
need to be done to the learner and tagger that are written partly in Perl and partly in C
programming languages, and are run under the unix/linux operating system. The main
change is done on the initial state tagger, which is used by both learner and tagger. A
program is written using the lexical analyzer Lex to capture Arabic morphological
structures, and then interfaced with both learner and tagger. The tagset used in this work
is a revised version of that introduced by Khoja. The revision included changing some
of the tags for linguistic considerations and introducing some new tags to make the set
more powerful, or to make up for limitations in the original tagset that hinder tagging
some words. The corpus is obtained from two Jordanian magazines, and has to go
through a series of editing steps. A collection of lexical rules and contextual rules are
produced by the learning system, and applied to Arabic text. The tagging accuracy of
the resulting tagged text is measured to be approximately 84% for both known and
unknown words. A result which may seem low, but taking in consideration the
complexity of the language, the richness of the tagset, the fact that this work is the first
work that encompasses such a tagset for Arabic, and the fact that we did not have a
reference corpus to base our work on, we consider the results very promising.
2
3. Acknowledgements
I would like to express my gratitude to:
Associate professor Mohamed Arteimi, my academic supervisor, who guided me
through this research and gave me his valuable advices.
The Department of Computer Science in the Academy of Graduate Studies and
personally to Dr. Abdussalam Elmusrati for his encouragement and help.
The Academy of Graduate Studies, and to Dr. Saleh Ibrahim for his encouragement
by sponsoring this research through an academic scholarship.
And to my family and friends for their support and endurance.
3
4. List of Tables
Table
Table (4-1)
Table (4-2)
Table (4-3)
Table (5-1)
Table (5-2)
Table (5-3)
Table (5-4)
Table (5-5)
Table (5-6)
Table (5-7)
Table (5-8)
Page
A list of lexical rules ……………………………………………………
Examples of misleading lexical rules …………………………………….
A list of contextual rules …………………………………………………
Accuracy for the original tagset ……………………………………………
Accuracy for the complete modified tagset ………………………………
Accuracy for the complete modified tagset with enlarged training corpora
Accuracy for the ungrammatized modified tagset ………………………
Types of errors ……………………………………………………………
A sample of errors in grammatized tests …………………………………
Percentage error for each error type in the grammatized tests ……………
A sample of errors in ungrammatized tests ………………………………
4
40,41
42
43,44
45
45
46
46
47
47.48
48
48,49
5. List of Figures and Illustrations
Figure
Figure (2-1)
Figure (3-1)
Figure (3-2)
Figure (3-3)
Figure (3-4)
Figure (3-5)
Figure (3-6)
Figure (3-7)
Figure (4-1)
Figure (4-1)
Figure (4-2)
Figure (4-3)
Figure (4-4)
Page
copy of the manually tagged excerpt sought by Khoja ……………
Example of a general classification tagset ……………………………
Example of a detailed tagset for verbs …………………….…………
The entire Penn Treebank tagset …………………….………………
Preliminary steps for tagging …………………….…………………
Lexical rule learning ..………………….……………………………
Context rule learning ………………….…………………….………
Tagging …………………….…………………….…………………
(a) A sentence from the corpus …………………….………………..
(b) A transliteration of a sentence from the corpus …………………
Tagged and detransliterated sentence from the corpus ………………
Tags of plurals …………………….…………………….…………
Tags of defected verbs …………………….………………………
5
15
18
19
19
25
27
28
33
33
33
33
35
36
6. Contents
Abstract……………………………………………………………………………………………
Acknowledgements………………………………………………………………………………..
List of Tables
List of Figures and Illustrations
Contents……………………………………………………………………………………………
Chapter One: Introductoin……………………………………………………………………...
1.1 Background…………………………………………………………………………
1.2 Part-Of-Speech Tagging Methods …………………………………………..
1.3 Machine learning in POS tagging………………………………………………
1.3.1 N-gram and Markov models…………………………………………….
1.3.2 Neural Networks………………………………………………………..
1.3.3 Vector-based clustering…………………………………………………
1.3.4 Transformation-Based Learning………………………………………...
1.4 Aims and objectives…………………………………………………………………
1.5 Tools used in this work………………………………………………………………
1.5.1 Corpus……………………………………………………………………
1.5.2 Tagset………………………………………………………………………
1.5.3 Tagger……………………………………………………………………
1.6 Testing strategy ……………………………………………………………………..
1.7 Chapters summary……………………………………………………………………
Chapter Two: Literature Review………………………………………………………………
2.1 Corpora in European languages ……………………………………………………...
2.1.1 General Corpora………………………………………………………….
2.1.2 Historical Corpora……………………………………………………….
2.1.3 Annotated Corpora………………………………………………………
2.2 Arabic corpora………………………………………………………………………..
2.4 Arabic taggers………………………………………………………………………...
2.5 Definition of training and testing texts……………………………………………...
Chapter Three: Design…………………………………………………………………………..
3.1 Tagsets and the adopted Arabic tagset………………………………………….
3.1.1 Tagsets…………………………………………………………………….
3.1.2 The adopted tagset………………………………………………………..
3.2 Corpora used for this work……………………………………………………….
3.3 The Brill system…………………………………………………………………….
3.3.1 Learner…………………………………………………………………….
3.3.2 Tagger……………………………………………………………………..
3.4 Testing strategies……………………………………………………………………
Chapter Four: Implementation and Testing………………………………………………
4.1 Corpus………………………………………………………………………………..
4.2 Tagset………………………………………………………………………………..
4.2.1 Nouns……………………………………………………………………...
4.2.2 Verbs………………………………………………………………………
4.2.3 Particles…………………………………………………………………...
4.3 The program…………………………………………………………………………
4.4 Rules………………………………………………………………………………….
4.4.1 Lexical Rules
6
i
ii
iii
iv
v
1
1
3
4
4
5
5
6
6
6
6
7
8
9
10
11
11
11
12
12
13
16
17
18
18
18
20
24
24
24
29
30
31
31
34
34
35
36
37
37
38
7. 4.4.2 Contextual Rules…………………………………………………………
4.5 Testing……………………………………………………………………………….
Chapter Five: Results and discussion………………………………………………………….
5.1 Results………………………………………………………………………………..
5.2 Examples of errors in tagging……………………………………………………..
5.3 Discussion…………………………………………………………………………...
5.4 Evaluation…………………………………………………………………………...
5.5 Accomplishments…………………………………………………………………..
Chapter Six: Conclusions and Future work………………………………………………….
6.1 Conclusion…………………………………………………………………………..
6.2 Future work………………………………………………………………………...
References………………………………………………………………………………………..
APPENDIX A
APPENDIX B
APPENDIX C
Sample tagged sentences as compared to the truth corpus………….
The complete tagset (Tagset2)………………………………………
The Lex file used for initial state tagger…………………………….
7
38
38
45
45
46
49
51
52
53
53
53
54
58
62
78
8. Chapter One
Introduction
1.1 Background
It is very hard, or even impossible to encode manually all the information needed to
encode a human language that is necessary to build a system that will annotate text with
structural description [9]. Such a work would need a lot of information concerning the
type of grammar which will be used, plus a great deal of the morphological, lexical, and
syntactical information about the language itself and encoding them in an algorithmic
way for the intended system to handle them. However, this is not an easy task and
would consume a lot of time and probably a group of language experts to be achieved.
Even if achieved, it would be language specific and could not be applied to different
languages.
For this reason, language processing is tackled in different approaches recently. One
of the most growing approaches is through machine learning techniques. These
techniques start by giving samples of manually annotated text, which should be
reviewed very carefully to make sure it resembles the truth for the given language.
Then, a learning system is applied to that text to figure out the cues for annotating the
given words with the given annotation. These cues are then converted either to some
statistical information stating the probabilities of assigning a given annotation to a
certain word according to its lexical structure and/or its location in the context, or they
are converted to a collection of rules stating when and why to assign a given annotation
to the word. Afterwards, another system, the tagger, is given another raw text to be
annotated, and would go through the text and assign annotations to the words according
to the accompanying cues (probability figures, or rules).
Clearly the use of rules obtained from a learning system is more favorable over the
use of probability figures for the following two reasons:
1-
Rules are easy to understand and can reflect directly the human
understanding of the language.
2-
Rules can be manipulated through changing, omitting, or adding some rules.
when doing so would enhance the annotation ability of the system.
For these reasons, we have chosen to use a rule-based machine learning system for
our work.
8
9. Part-of-speech (POS) tagging means taking a text written in a human language and
identifying its lexical and/or syntactical structure by assigning to each word/token in the
text the correct Part-of-Speech such as noun, verb, adjective or adverb. Furthermore, the
tags give, in many cases, additional features, such as number (singular/plural), tense,
and gender, thus changing the raw (unannotated) text to annotated or tagged corpus.
This process of tagging requires a set of tags that classify words according to their
lexical and syntactical meanings. This set is referred to as a tagset.
Part-of-speech tagging is the foundation of natural language processing (NLP)
systems, and thus has been an active area of research for many years [25]. The use of
corpora has become an important issue in Language Engineering (LE), the field that
deals with all different types of handling natural languages computationally. There are
many ways to deal with corpora. These ways include the use of one language corpora,
that are annotated to reflect some information about the language structure, and parallel
corpora, i.e. corpora of the same text written in two or more different languages, where
at least one of the corpora is annotated, to help annotate the other corpora, or to help
extract some information from the other corpora. Both kinds are valuable sources of
linguistic metaknowledge, which forms the basis of techniques such as tokenization,
POS tagging, morphological and syntactic analysis, which in turn can be used to
develop LE applications [9].
An annotated corpus is a corpus that has had some level of linguistic detail added to
the raw data. For example, the Penn Treebank [41] is an annotated corpus, because it
contains the linguistic structure and part-of-speech tags for the words in the corpus.
A tagged corpus is more useful than an untagged corpus because there is more
information there than in the raw text alone. Once a corpus is tagged, it can be used to
extract information from the corpus. This can then be used for creating dictionaries and
grammars of languages using real language data. Tagged corpora are also useful for
detailed quantitative analysis of text [22].
Other applications of Part-of-speech tagging include speech recognition [14],
enhancing input methods [6], machine translation [24], and discovering errors in OCR
files [20].
9
10. 1.2 Part-Of-Speech Tagging Methods
It has recently become clear that extracting linguistic information from a sample text
corpus automatically can be an extremely powerful method for making accurate natural
language processing systems [9]. There are several part-of-speech taggers that are
widely used for Indo-European languages, all of which are trained and retrainable on
text corpora. Structural ambiguity can be greatly reduced by adding empirically derived
probabilities to grammar
rules and by computing statistical measures
of lexical
association. Word sense
disambiguation can, in some cases, be done with high
accuracy when all information is derived automatically from corpora. An effort has
recently been undertaken to create automated machine translation systems, where the
linguistic information needed for translation, is extracted automatically from aligned
text corpora [22].
These are just some of the recent applications of corpus-based techniques in natural
language processing. Along with great research advances, the infrastructure is in place
for this line of research to grow even stronger. With on-line corpora, the use of the
corpus-based natural language processing is growing, producing better performance,
and becoming more readily available. There is a worldwide trend to annotate large
corpora with linguistic information, including parts of speech.
Many techniques have been used to tag English and other European language
corpora, such as:
1-
Rule-based technique: used by Greene and Rubin in 1970 to tag
the Brown corpus. They designed the tagger TAGGIT [13] that
used context-frame rules to select the appropriate tag for each word.
It achieved an accuracy of 77%. More recently, interest in rulebased taggers has re-emerged with Eric Brill's tagger, which used
another type of rules called transformation rules (Section 3.3) and
achieved an accuracy of 97.5.
2-
Hidden Markov models were used in the 1980s to select the
appropriate tag. Example of such taggers are:
i.
CLAWS [12], which was developed at Lancaster University
and achieved an accuracy of 97%
01
11. ii. The Xerox tagger [38] developed by Doug Cutting, which
achieved an accuracy of 96%
3-
Hybrid taggers: those use a combination of both statistical and
rule-based methods. This method achieved an accuracy of 98% as
reported by Tapanainen and Voultilainen [31] who used both
techniques separately, then aligned the output.
1.3 Machine learning in POS tagging
Machine learning deals with acquiring knowledge from an environment in a
computational manner, in order to improve the performance. There are many factors
that contributed over the past couple of decades to the blending of ML and NLP.
These factors include the ever expanding availability of large corpora, more
powerful computing resources; and a greater demand for natural language based
applications [27]. This led to the use of many machine learning techniques in natural
language processing, and in particular in Part-of-speech tagging[34].
Since the method we are using in our work belongs to these techniques, we shall
give here a more detailed idea about these methods.
1.3.1
N-gram and Markov models
A Markov model of a sequence of states or symbols (e.g. words or Part-of-speech
tags) is used to estimate the probability or “likelihood” of a symbol sequence. It can
be used for disambiguation, e.g. for choosing the most likely tag for an ambiguous
word in a given context, by estimating the probability of every candidate sequence.
A Markov model applies the simplifying assumption that the probability or
“likelihood” of a long sequence or chain of symbols can be estimated in terms of its
parts or n-grams.
Hidden Markov Models (HMMs) [18] are a variant of Markov models including
two layers of states: a visible layer corresponding to input symbols (e.g. words) and
a hidden layer learnt by the system, corresponding to broader categories (e.g. wordclasses).
Markov or n-gram models have been widely used for Part-of-speech tagging,
following the successful use in tagging the LOB Corpus [19].
00
12. 1.3.2
Neural Networks
Neural networks (NNs) have been widely explored in Artificial Intelligence and
they have been studied for many years hoping to achieve human-like performance in
many fields.
There are many rules used in the learning process of neural networks. The type of
learning in a neural network is determined by the manner in which the parameters
change. This can happen with or without the intervention of a supervisor; hence, the
neural networks are divided into three groups: supervised learning, unsupervised
learning, and reinforcement learning.
Neural networks typically consist of multilayers of nodes, where the lowest layer
is the input layer, the highest is the output layer, and the layers in between are the
hidden layers. Nodes of adjacent layers are connected via weighted links. The
weights on these links are manipulated using a special function, so that the given
input produces the desired output. When this stage is reached, the given weights on
the links are recorded, or learnt as the proper values for the given input to produce
the desired output.
In part of speech tagging applications, the input consists of all the information the
system has about the parts of speech of the current word, i.e. all its possible tags, the
tags of certain number (p) of the preceeding words, and the tags of another number
(f) of the following words. The output of the network would be the appropriate tag
of that word in this context, and the weights on the links would be adapted
accordingly.
When the learning process is done, the tagger will have a huge number of weights,
along with their tag sequences, to be applied to tag new texts.
1.3.3
Vector-based clustering
This approach uses co-occurrence statistics to construct vectors that represent
word classes or meanings by virtue of their direction in multi-dimensional wordcollocation space. For example, Atwell [4] annotated each word in a sample from
the LOB Corpus with a vector of neighboring word-types; words with similar
vectors were clustered into word-classes.
A method for calculating semantic word vectors is to use random labeling of
words in narrow context windows to calculate semantic context vectors for each
02
13. word type in the text data. Incorporating linguistic information in the context vectors
can enhance the results.
1.3.4
Transformation-Based Learning
Brill has developed a symbolic Machine Learning method called TransformationBased Learning (TBL) [7,8,9]. Given a tagged training corpus, TransformationBased Learning produces a sequence of rules that serves as a model of the training
data. To derive the appropriate tags, each rule may be applied to each instance in an
untagged corpus in a specific order.
TBL relies heavily on a large annotated training corpus, and reasonable default
heuristics to get things started. It learns rules that are clearly coupled to human
understanding of a natural language, and allows rules to be easily acquired for
different domains or genres.
There is a gap between an initial semantic network generated from input data, and
a semantic one representing profound knowledge, from which a knowledge database
can be constructed. By using transformation rules, the semantic analysis method is
based on a pattern matching with a semantic network.
Transformation rule
description language allows users to manipulate their knowledge base and to define
rules.
1.4 Aims and objectives
The main purpose of this research work is to produce a system that can correctly tag
Arabic words with high accuracy utilizing a set of available tools after modifying them
to suit our purposes. These tools are Corpus, Tagset, and Tagger.
1.5 Tools used in this work
1.5.1 Corpus
Most of the researches on tagging for other languages have pretagged standard
corpora to work on and test the performance of their systems. But for Arabic, the case is
different. No standard corpora are available. This doubles the burden on the person who
wants to work on this subject; instead of concentrating on the tagger, one has to shift
03
14. part of his attention to preparing a large enough truth corpus tagged with the chosen
tagset, a task which is tedious and time consuming.
The lack of easily available standard tagged Arabic corpus was the motivation of
this work. At the beginning of this study, the researcher thought to work on
morphological analysis of Arabic by machine learning, but reviewing the literature he
discovered the unavailability of a dependable tagged corpus, a thing that is one of the
basic requirements for such a study. He found that most of the researchers in the field
are complaining of this problem. So he decided to start from scratch and work in the
direction of providing such a corpus.
For this purpose, the researcher started with a raw corpus and made some revisions
and a series of automatic taggings and manual corrections until his study reached
satisfactory results. Because of time limitations, the size of the corpus reached is
moderate and is not as large as what one would wish.
The corpus used for this study is derived from a raw corpus whose data are articles
of two Jordanian journals, Aldustur, and Aldustur Aleqtesady, but has to go through an
extensive preprocessing which will be explained in detail in Chapter Four.
1.5.2 Tagset
We adapted the Khoja detailed tagset, a morphosyntactic tagset that is very rich and
comprehensive for Arabic, and hence it is hard to deal with, whether manually or
automatically. The original tagset consists of 177 tags and is heavily increased by the
fact that we do not use a stemmer for the tagging system, and so another group of
composite tags is introduced to make up for composite words. These tags can be
composites of two, three, or even four basic tags.
This tagset was revised introducing new tags and making some refinement of
original tags. That included the distinction between plural forms (beneficial for
morphological studies), and recognizing defected verbs (beneficial for syntactical
studies). This modification raised the number of basic tags to 319. The complete new
tagset is shown in Appendix B.
Another subset of the resulting tagset is introduced by removing case information,
thus gaining two advantages: decreasing the size of the tagset, and more importantly
getting rid of some complexity and leading to better accuracy as will be seen in Chapter
Five.
04
15. Another set of tests was performed on the original tagset as well, where we noticed
that very little gain in accuracy was achieved by modifying the tagset. But it should be
kept in mind that the main purpose of modifying the tagset was not shooting for better
accuracy, rather it was looking for clarity of tags and having more features for some of
the tags. In fact it was expected to lose part of the accuracy for this reason, and we were
willing to sacrifice it.
1.5.3 Tagger
The tagger used for this study is the Brill tagger, which will be introduced in detail
in Section 3.2, a tagger that is based on the transformation rule method. This tagger was
originally designed for tagging English text, and had been adopted by many researchers
for other languages like Hungarian [23], and German [28,33]. The reasons for choosing
this tagger are:
1. The source code is available, and written mostly in a common language (C),
which makes the modification possible.
2. It is based on transformation rules, which makes it possible to adapt to other
languages.
3. The use of transformation rules also makes it easy to understand the
underlying reasons behind choosing certain tags (see Section 4.4), and easy
to modify the rules and/or omit some of them if needed. This is in contrast to
using statistical taggers (Section 2.2), where information is converted into a
huge set of numbers, representing the probabilities of choosing a specific tag
for each word.
A lot of work has to be done for adapting the tagger to our purposes, which
includes:
1. Manually tagged Arabic corpus has to be prepared, since we have to start from
scratch. This corpus is then enlarged in many steps.
2. Since the original system is written for Unix, and makes use of some of the facilities
thereof, we first attempted to convert it to the DOS environment, being more
common to us, and in our academic environment here. A lot of work was done in
this direction but many problems were encountered. The latest and hardest of which
was the fact that Turbo C under DOS did not deal with the extended RAM
05
16. explicitly, as is the case for C under Unix. So at last we decided to switch to Unix, a
task that also had many obstacles in the beginning, but worked out smoothly at the
end.
However, we still have an ambition, even after the completion of this project, to
switch back to DOS/Windows, and attempt to get a working DOS/Windows version.
3. The original code mixed between the use of C in most of its parts, and Perl in some
others, especially the lexical learner, which we had to work on. Perl is a new
language for the researcher, therefore some work had to be done in this direction,
first by learning as much as possible and needed from Perl, and then making use of
that in making an efficient change to the learner, to make use of the program
generated by Lex for the lexical analysis of the corpus. The problem that took most
of our time and effort in this is the fact that the exact same changes had to be done
on both the learner, which is written in Perl, and the tagger, written in C.
1.6 Testing strategy
Testing was done using the method of cross validation. Because of unavailability of
a standard reference (truth) corpus, we have to be satisfied by a rather small corpus for
this purpose. The corpus we prepared for learning was divided into three parts, and
three tests had to be performed, each of which utilizes two thirds of the whole corpus
for training and the other third for testing, changing the parts every time. Then taking
the average of the results. At this stage we used a total corpus of 38,000 words, so every
test involves about 25,000 words for training and 13,000 words for testing. This whole
experiment is done three times: one time for the original tagset, the second for the
modified tagset, and the third for the ungrammatized tagset. That means three sets of
corpora and three learning/tagging systems, each using the appropriate tagset, are
prepared.
The rather small size of the corpus is justified by the lack of standardly tagged
corpus. This is the best we could reach within our available time and efforts, and we
think we achieved very promising results that can be enhanced by many improvements,
including the enlargement of the learning corpus. This work is probably the first real
step in the direction of having a standard Arabic corpus tagged with a rich and
comprehensive tagset, not forgetting the contribution of Khoja who provided the
baseline for our work.
06
17. 1.7 Chapters summary
Chapter two gives a literature review of tagging, and talks about taggers, and
different tagging strategies, with concentration on the efforts exerted on Arabic, in
terms of the three parts of a tagging system; corpora, tagsets, and taggers.
Chapter three, talks about the original tools that are chosen for this work, namely,
the Khoja tagset, and the Brill tagger, giving a detailed idea about their form, and the
way they are designed. Then it gives an idea about the strategy used for testing.
Chapter four explains our contribution in modifying the tagset, preparing the corpus
for work, and adopting the tagger to fit our needs.
Chapter five gives the tests and results of our experiments. First it gives average
accuracies of each of the three performed tests, then it discusses the types of errors
encountered, studies their causes, and suggests solutions to them.
Chapter six gives the conclusion of the work and suggests future expansions.
07
18. Chapter Two
Literature Review
2.1Corpora in European languages
In European languages and some other languages, there are many famous and
standard corpora which are available for researchers, either to be used in extracting
information of interest to their fields of study, or as references for testing there tagging
strategies. Below is a list of just few examples of such corpora:
2.1.1 General Corpora
The Brown Corpus. Corpus of written American English, and the
corresponding British corpus Lancaster-Oslo/Bergen corpus (LOB)
[19], a corpus of written British English. The Brown corpus in the 60's,
while its British counterpart was compiled in the 70's. Both consist of around
one million tokens (i.e. words, counted every time they appear).
The Brown corpus was used in seminal linguistic and psycholinguistic research
that involved word frequency, and continues to be used today. It comes as text,
tagged, and parsed.
BNC: The British National Corpus (BNC) [40,42] is a 100 million word
collection of samples of written and spoken language from a wide range of
sources, designed to represent a wide cross-section of current British
English, both spoken and written. Because of it’s large size, and sampling of
written and spoken language, the BNC is very good for research involving
lexical frequency. For words with very low frequency, they are more likely
to occur in a 100 million words corpus than in a 1 million words corpus.
The Amsterdam Corpus (AC): This corpus [30] was compiled in the
beginning of the 1980s by a group of scholars directed by Anthonij Dees and
has resulted in the Atlas des formes linguistiques des textes littéraires de
l'ancien français. The electronic version of the AC was provided by Piet van
Reenen (Free University of Amsterdam). It contains about 200 different
08
19. texts, some of them in several manuscripts, which adds to a total of 289 texts
and close to three million word forms. These forms have been manually
annotated with 225 numeric tags encoding part-of-speech and other
morphological categories (e.g. “566” for verb, future tense, 3rd person,
plural).
2.1.2 Historical Corpora
Helsinki Corpus: The Helsinki Corpus of English Texts: Diachronic and
Dialectal [39] is a computerized collection of extracts of continuous text.
The Corpus contains a diachronic part covering the period from c. 750 to c.
1700 and a dialect part based on transcripts of interviews with speakers of
British rural dialects from the 1970's. The aim of the corpus is to promote
and facilitate the diachronic and dialectal study of English as well as to offer
computerized material to those interested in the development and varieties of
language. The uses for such a corpus are fairly obvious: it is used for
diachronic research; whether one is interested in lexical frequency,
semantics, syntax, etc. This corpus also has a parsed version
2.1.3 Annotated Corpora
Celex. Lexical databases of English, Dutch, and German. [40] This
corpus contains ASCII versions of the CELEX. It was developed as a joint
enterprise of the University of Nijmegen, the Institute for Dutch Lexicology
in Leiden, the Max Planck Institute for Psycholinguistics in Nijmegen and
the Institute for Perception Research in Eindhoven. This corpus contains
detailed information on the orthography, the phonology (phonetic
transcriptions, variations in pronunciation, syllable structure, primary stress),
the morphology (derivational and compositional structure, inflectional
paradigms), the syntax (word class, word class-specific subcategorizations,
argument structures), and word frequency (summed word and lemma counts,
based on recent and representative text corpora). Thus it is useful for various
types if linguistic and psycholinguistic research.
The Penn Treebank. The Penn Treebank Project [41] annotates naturallyoccurring text for linguistic structure. Most notably, they produce skeletal
09
20. parses showing rough syntactic and semantic information - a bank of
linguistic trees. They also annotate text with part-of-speech tags, and for the
Switchboard corpus of telephone conversations, dysfluency annotation. The
Penn Treebank project has annotated the Switchboard Corpus, the Wall
Street Journal Corpus, Chinese Journal Project, the Brown Corpus, and the
Helsinki Corpus (among others). It is very useful for syntactic research, or
any research involving the syntactic/semantic relationships between words.
Tgrep (tree-grep) is a useful tool for use with this corpus.
There are corpora available for many other languages. Examples include:
American English [35], German [29], Hungarian [23,26], Swedish [5, 25],
and Hebrew [15]. More information about all these corpora and others can
be easily found on the Internet.
2.2 Arabic corpora
A number of electronic Arabic text corpora have been compiled [32] but these
corpora are raw, which means that the exploration of these corpora remains
problematic. Some analyses that have been conducted on these corpora involve
sometimes very limited data. Others have developed proficient word form analyzers,
such as the analyzer by the Xerox European Research Centre, but the question remains
whether these analyzers provide an adequate solution for the exploration of Arabic
tagged corpora.
In order to explore corpora in an efficient and in an economically reliable way, some
preliminary operations ought to be made [32]. As is generally known, analyzing Arabic
corpora is more complex than other corpora because of three main reasons. In the first
place the Arabic language is very polysemic. The Arabic language is much more
polysemic than, for example, Dutch. In fact in the Dutch language one way to create
new words is by adding two words together in order to obtain a new word as a
compound. These new words are very widespread, but are also identifiable by a
computer in a simple way, i.e. by defining a word as a string of characters between two
blanks. In Arabic new meanings for words are often given by expanding the older
meaning of an existing word to a new one. This means that the external morphological
form of the word does not change, in spite of the fact that the word carries a new
meaning.
21
21. A second element that makes analysis of Arabic more complex than other languages
is the fact that the language is usually not vocalized, which means that the degree of
ambiguity of words as separate units is much greater than e.g. in English or Dutch.
Words, in their raw form, can belong to different grammatical categories as seen in the
string of characters "ktb". This string of characters stands for the verb "kataba" (to
write) as well as for the plural "kutub" (books). This complicates the searching for
words in a corpus of texts.
In the third place, the problem is complicated by the fact that in Arabic a number of
prefixes and suffixes are directly linked to the word. This makes the searching by
computer even more complex. For example the string of characters ‘fhm’ can stand for
the verb "fahima" (understood), but it can as well stand for the particle and suffix
"fahum" (since they) or for the particle and verb "fahamma" (then he considered).
These facts and others are behind the lack of tagged Arabic corpora. One of the
researchers in this line [11] noticed “the frustrating reality was that the NLP experts
with experience in dealing with European languages and scripts deemed the problem [of
providing tagged corpora and taggers for Arabic] trivial and therefore not worth wasting
time on. While the available Arabic language experts had no computer experience and
deemed the problem impossible to solve and therefore not worth wasting time on it”.
This is true to some extent, but what is certainly true is that there are very little available
corpora for the Arabic language. There are some large corpora that are available.
Unfortunately they are not free. Also, although some of these corpora are marked-up
with XML or SGML tags, none of them are POS tagged [32, 37].
There are some efforts towards the preparation of a POS-tagged corpus for Arabic, but
they are still in their early and testing stages. One of these works is that of Shereen
Khoja [16,17]. Although her work has some limitations and deficiencies, as will be
explained below, it is probably the first step towards building an Arabic POS-tagged
corpus. She introduced two tagsets, one is very small containing only five classes or
basic tags (noun, verb, particle, residual, punctuation), and the other is very
comprehensive and appropriate for Arabic, containing more detailed tags (i.e. singular,
masculine, definite common noun). She used the first tagset to manually tag 50,000
words of Arabic newspaper text. This type of tagging is obviously of little use, but she
also tagged 1,700 words with the second tagset [37]. I sent many email messages to
Miss Khoja hoping to get a copy of her tagged corpus and benefit from it, but
unfortunately I did not receive any response. However, from the small excerpt (see
20
22. Figure 2-1) she enclosed in her paper [16], it seems that the corpus is not well built,
since there are in that short passage many mistagged items. Mistakes include the
following (refer to Section 3.2.2 and to Appendix B to get an idea about meanings of
tags):
1. Mistagging adjectives as nouns, example:
الشريفينis tagged as NCDuMGD,
instead of :NADuMGD. There are many instances of such an error.
2. Case information for nouns seems almost random. Example
الر ين اا
مبناسرة االور ا
is tagged as PPr-NCSgFGI NCSgMAD NCSgMND, instead of:
PPr-NCSgFGI NCSgMGD NASgMGD, and
أعري االكراااليرis tagged
as: VPSg3M NCSgMND NCSgMAD, instead of VPSg3M NCSgMND
NASgMND
3. tagging single as plural, like: لبالدهis tagged as: PPr_NCPlFGI_NPrPSg3M
instead of: PPr_NCSgFGI_NPrPSg3M.
Figure 2-1: copy of the manually tagged excerpt cited by Khoja
4. Tagging feminine as masculine, like:
عر اأ كر االهارايis tagged as: PPr
NCSgFNI NCPlMND, instead of: PPr NASgMGI NCPlFGD.
These are just few examples of the mistakes found in the 48-word passage. Note
also that some of the words cited in the above examples contain more than one type of
mistakes.
22
23. It is worth mentioning here, that mistakes in manually tagged corpora are very
unfavorable, since these corpora are considered to represent the truth, and are to be used
as guidelines for learning systems. If they are not carefully built, the whole system is a
failure regardless of how large the reported accuracy may be.
We used the same detailed Khoja tagset to tag about 38,000 words, and have three
versions of this corpus; one tagged with the original detailed tagset as proposed in [16],
the second tagged with a modified tagset of the mentioned tagset as explained in
Section 4.2, and presented in Appendix B, and the third tagged with the modified
version with the removal of grammatical information. We do not claim perfection but
we think that our work, besides being much larger is also much more accurate in
applying the tagset to real Arabic text.
2.3 Arabic Taggers
Very few people worked towards building a complete tagger for Arabic. The
following cases, though not complete, are among the best examples:
Abuliel [1]: in his paper he described some preparatory steps of building an
Arabic POS tagger. Rule-based techniques were used for finding phrases,
analyzing affixes of the word, and discovering proper nouns. The tagset used
in this work is not specified, and no results are reported concerning the
overall performance of a tagging system.
Alshalabi et al [3] dealt with vowelized Arabic text and considered
recognizing nouns only. This work showed how to discover nouns in the text
but does not reach the stage of tagging. The fact that the system is
constrained to vowelized text makes it deficient. Although they talked about
part-of-speech tagging and gave a survey of taggers, they did not really do
any tagging, nor did they give any tagset for this purpose. They reported
95.4% accuracy, which is a good performance rate, but we should keep in
mind that the system is constrained to completely vowlized words, therefore
minimizing ambiguity, and that it is restricted to discovering nouns, which
simplifies the classification task.
Maloney and Niv [21], also worked with names only, in their name
recognizing system called TAGARAB.
23
24.
Freeman from the department of near eastern studies at the university of
Michigan [11] reported that he is attempting to adopt the Brill tagger to
Arabic. He designed his own tagset for this purpose, started to do some
morphological analysis, and explained the hurdles he encountered in that
work. According to his paper he did not reach the stage of tagging to report
any accuracy rate.
Khoja: the title of her paper [17] may lead to concluding that she has a
complete tagger. That deceived us in the beginning of our work. But
carefully studying the paper we concluded that she only did some
preliminary work in this direction, and is still working on the tagger. This
was ascertained by consulting her website [37] where she declares: “As far
as I know, a POS tagger has yet to be developed for Arabic, which is why I
am developing one myself.”
2.4 Definition of training and testing texts
A corpus of over 38,000 words was prepared. Three versions of this corpus are
available: one tagged with the original Khoja tagset, the second with a modified
tagset as explained in Section 4.1, and the third is tagged with a subset of the
modified tagset which excludes the grammar information, as explained in Section
4.5.
Each of these corpora is divided into three equal portions, then a cross validation
is done three times, using different two thirds of the corpus for training and the other
third for testing, each time. The average of the three tests is taken as the estimated
performance accuracy of the tagger. This means that nine tests are done in this way.
In addition to these tests, three other tests were performed on the corpus tagged
with the complete modified tagset, this time to test the effect of enlarging the corpus
size on the accuracy of the tagger. To do that, about five sixths of the corpus are
used for training, and the other one sixth for testing, for each new test. Then, the
average is taken to get an estimate of the overall accuracy.
24
25. Chapter Three
Design
3.1 Tagsets and the adapted Arabic tagset
3.1.1
Tagsets
As mentioned in section 2.1, tagging requires a set of tags, which classify the words
according to their lexical and syntactical meanings, i.e. a tagset.
Tagsets vary in size. The number of tags used by different systems varies a lot.
Some systems use fewer than 20 tags, while others use over 400. The larger the size of
tagset the more information is carried in each tag. For example we may have a basic
tagset, which divides the words into very small set of classes as in Figure 3-1 below.
We may enhance this tagging by classifying nouns to single and plural, verbs to present
and past, and so on, as shown in Figure 3-2, which lists a subset of a refined tagset
showing the different tags that belong to the general class verb in English. And can be
further classified as shown in Figure 3-3, which gives a complete list of the Penn
Treebank tagset [41].
Tag
Discription
Tag
Discription
NN
Noun
JJ
Adjective
NNP
Proper noun
CC
Coord conj
DT
Determiner
CD
Cardinal number
IN
Proposition
Prp
Personal pronoun
VB
Verb
RB
Adverb
-R
Comparative
-S
Superlative
-$
Possisive
Figure 3-1: example of a general classification tagset.
25
27. 3.1.2
The adopted tagset
This section describes the tagset adapted for our work. The tagset is based on the
Khoja tagset as mentioned earlier. We introduce the tagset as described by its designer
[20]. The modifications that are specific to our work are marked using an asterisk
symbol (*), and are further discussed in detail in Section 4.2. The original tagset
(Tagset1) contains 177 tags: 103 nouns, 57 verbs, 9 particles, 7 residual, and 1
punctuation. We derived two other tagsets: Tagset2, a modified version of Tagset1,
containing 319 tags, and Tagset3, a simplified version of Tagset2, which excludes
grammatical information, with 189 tags. The complete modified tagset (Tagset2) is
given in Appendix B. A full description of each of the tags and examples of Arabic
words that take those tags now follows. This description is based on that given by Khoja
.
The five main categories for words are:
1. N [noun]
2. V[verb]
3. P [particle]
4. R [residual] * 5. punc [punctuation]
Note that category number 5 is preceded by an asterisk (*). This indicates a
modification in the name of the category, or a completely new category (or
subcategory) as shall be seen in subsequent examples.
The residual category contains foreign words, mathematical formulae and numbers.
The punctuation category contains all punctuation symbols, both Arabic and foreign
such as (?, ،.) "! ؟
؟
The subcategories of noun are:
1.1 C [common]
1.4. Nu [numeral]
1.2 P[proper]
1.3 Pr [pronoun]
1.5 A [adjective] *1.6 T [title]
Adjectives are nouns that describe the aspects of an object. Adjectives inherit the
properties of nouns, so they take “nunation” when in the indefinite and can take the
definite article when definite. For example, alwld alSgyr “The small boy” contains the
adjective Sgyr “small”. This adjective can take the definite article as in ‘darasa alwaladu
alSagyr’ “the small boy studied”, and it can also have “nunation” as in ‘hasan Sgyr’
“Hassan is small”.
Examples of these subcategories include:
27
28. • Singular, masculine, accusative, common noun such as ktab “book” in the sentence
‘>x* alwld ktaba’ “the boy took a book”.
• Singular, masculine, genitive, common noun such as ktab “book” in the sentence
‘drst mn ktab’ “I studied from a book”.
• Singular, feminine, nominative, common noun such as mdrsp “school” in the
sentence ‘h*h mdrsp’ “this is a school”.
Note here and in subsequent examples that vocalization does not appear in
transliteration, because we do not assume dealing with vocalized text.
The subcategories of the pronoun are:
1.3.1 P [personal]
1.3.2 R[relative]
1.3.3 D [demonestrative]
The personal pronouns can be detached words such as ‘hw’ “he”, or attached to a
word in the form of a clitic. The attached pronouns can be attached to nouns to indicate
possession, to verbs as direct object, or attached to prepositions such as fyh “in it”.
Some examples of pronouns include:
• Third person, singular, masculine, personal pronoun, such as hw “him”.
• Singular, feminine, demonstrative pronoun, such as h*h “this”.
The subcategories of the relative pronoun are:
1.3.2.1 S [spesific]
1.3.2.2 C[common]
Examples of relative pronouns include:
• Dual, feminine, specific, relative pronoun, such as alltan “who”.
• Plural, masculine, specific, relative pronoun, such as al*yn “who”.
• Common, relative pronoun, such as ‘mn’ “who”.
The subcategories of the numeral are:
1.4.1 Ca [cardinal]
1.4.2 O[ordinal]
*1.4.3 Na [numerical adjective]
We preferred omitting subcategory 1.4.3 and adding related tags to normal
adjectives. This kind of adjectives, however, are not very common and we did not
encounter any of them in the corpus we used.
Examples of numerals include:
• Singular, masculine, nominative, indefinite cardinal number such as ‘>rbEp’ “four”.
• Singular, masculine, nominative, definite ordinal number such as ‘alrabE’ “the
fourth”.
28
29. The linguistic attributes of nouns, adjectives, and numerals, that have been used in
this tagset are:
(i) Gender:
M [masculine]
(ii) Number:
Sg [single]
F [feminine]
N [neuter]
* Plm [masculine sound plural]
* Plf [feminine sound plural] *Plb [broken plural]
Du [dual]
(iii) Person:
1 [first]
2 [second]
3 [third]
(iv) Case:
N [nominative]
A [accusative]
G [genitive]
(v) Definiteness:
D [definite]
I [indefinite]
Verbs are categorised into three main parts:
1. P [perfect]
2. I[imperfect]
Iv [imperative]
The definition of perfect verbs not only includes (i) the equivalent of English past
tense verbs (i.e. to describe acts completed in some past time) but also (ii) describes acts
which at the moment of speaking have already been completed and remain in a state of
completion, (iii) describes a past act that often took place or still takes place (i.e.
commentators are agreed (have agreed and still agree)), (iv) describes an act which is
just completed at the moment by the very act of speaking it (I sell you this), and (v)
describes acts which is certain to occur that it can be described as having already taken
place (mostly used in promises, treaties and so on) [16].
The imperfect does not in itself express any idea of time; it merely indicates a
begun, incomplete, or enduring existence either in present, past or future time. While
the imperative verbs order or ask for something to be done in the future.
Examples of verbs include:
• First person, singular, neuter, perfect verb ‘ksrt’ )“(كسرتI broke”.
• First person, singular, neuter, indicative, imperfect verb ‘>ksr’ (“ أكسرI break”
)أكس
ِ
• Second person, singular, masculine, imperative verb ‘aksr’ (“ اكسرBreak!”
)اكس
ِ
The verbal attributes that have been used in our tagset are:
29
30. (i) Gender:
M [masculine]
F [feminine]
(ii) Number:
Sg [single]
Pl [plural]
Du [neuter]
(iii) Person:
1 [first]
2 [second]
3 [third]
(iv) Mood:
I [indicative]
S [subjective]
N [neuter]
j [jussive]
The two most notable verbal attributes that are fundamental to Arabic but do not
normally appear in Indo-European tagsets are the dual number, and the jussive mood.
The subcategories of particle are:
1.1 Pr [prepositions]
1.2 A [adverbial]
1.2 C [conjunctions]
1.4 I [interjections]
1.5E [exceptions]
1.6 N [negatives]
1.7 A [answers]
1.8 X [explanations]
1.9 S[subordinates]
*1.10 dt [doutive]
*1.11 cr [certain]
*1.12 Str [stressive]
*LM [lm]
*LN[ln]
Examples of particles include:
• Prepositions fy (“ )يفin”
• Adverbial particles swf (ف
“ )سshall”
• Conjunctions w (“ )وand”
• Interjections ya (“ )فاO”
• Exceptions swY (
ِا
“ )سExcept”
• Negatives la (“ الNot”
)ال
• Answers nEm (“ )نعمyes”
• Explanations >y (“ )أيthat is”
• Subordinates lw ( “ )لif”
31
31. 3.2 Corpora used for this work
Early in our work, we were faced by the unavailability of corpora for MSA text.
Even the ones that we read about in some of the previous works were not easily
available, besides being not well fit to our needs. We contacted some of the researchers,
but only few of them responded to our request and questions. One of these responses
provided a raw corpus of excerpts from two Jordanian magazines, containing about
160,000 words. For the sake of saving time, we preferred working on this corpus than
creating our own, in spite of the fact that the corpus needs some processing before it can
be used in our experiments. These excerpts were provided as a Microsoft document in
Arabic characters which has to undergo a series of preparatory steps to be ready for use
in our tagging task, as will be explained in detail in Section 4.1
3.3 The Brill system
The Brill system is divided into two separate parts: the learner and the tagger. In the
following subsections we explain the way each of these two programs works.
3.3.1
Learner
Before the process of learning starts, the truth corpus is undergone a series of
preliminary operations to prepare a set of files that are necessary for learning. These
operations are sketched in Figure 3-4 and explained in more detail in Section 4.3.
Transformation-based error-driven learning, as shown in Figures 3-5 and 3-6, works
as follows: First, unannotated text is passed through an initial-state annotator. Various
initial state annotators, that represent different levels of complexity, have been used,
including: the output of a stochastic n-gram tagger; labeling all words with their most
likely tag as indicated in the training corpus; and simply labeling all words as nouns.
For example
Brill gave two simple
algorithms to do that; one assigns to all
unknown words the tag “NN” for common noun in the Penn Treebank tagset. And
30
32. Original
corpus
Review for
errors, typing
mistakes, etc
(manual)
Convert to Brill
format
(maual)
Transliterate
(C program)
Untagged
corpus
Tagged
corpus1
Untagged
corpus
Tag it
(semi
automatic)
Divide in two
(perl programBrill)
Tagged
corpus
Tagged
corpus2
Tagged
corpus
Untag it
(perl programBrill)
Untagged
corpus
Entire
Tagged
corpus2
Untag it
(perl programBrill)
Untagged
corpus2
Prepare final
lexicon
Final
lexicon
Figure 3-4: Preliminary steps for tagging
the other assigns to every word in the corpus either of two tags: “NNP” for proper noun
if the word starts with a capital letter, or “NN” otherwise. This strategy is based on a
conclusion that common nouns constitute a high percentage of an English text. In this
research we used a more detailed strategy, where the pattern of the letters of a word is
compared with a predefined set of patterns, to determine which word class the word
belongs to, making use of the rules of Arabic morphology (Srf). Then a tag is assigned
to the word accordingly. If the word does not belong to any of the standard patterns, it is
assigned the tag “NCSgFGI” which stands for “single feminine, genitive undefined
common noun” since this is the most probable tag for unknown words as noticed when
the manually tagged corpus is prepared. The different patterns used to tag unknown
words are further shown in Appendix C.
Once text has been passed through the initial-state annotator, it is then compared to
the truth. A manually annotated corpus is used as our reference for truth. An ordered list
of transformations is learned that can be applied to the output of the initial state
32
33. annotator to make it better resemble the truth. Each transformation has two components:
a rewrite rule and a triggering environment. A rewrite rule can be in the form:
X Y, meaning Change the tag from X to Y
While a triggering environment can be in the form:
“al” hasprefix 2 , meaning “if the current word has a 2-letters prefix of ‘al’”.
Taken together, the transformation with this rewrite rule and triggering environment
would be
X Y “al” hasprefix 2,
meaning Change the tag of the current word from X to Y if it has a 2-letter prefix of
‘al’.
There are two types of rules: Lexical rules and contextual rules. Therefore, there are
two learners that have to be run consecutively. First lexical rules are learned, then
context rules are learned to refine the tags, and make up for some divergences that may
occur in applying the lexical rules. In both the learning procedure is done by passes
through the truth corpus, each pass learning the rule that, when applied, minimizes the
errors in tagging the corpus as compared to the truth corpus. These rules are then stored
in a file in the order they are learned. Thus obtaining two rule files: a lexical rule file
and a contextual rule file. The tagger applies these rules in the same order to get similar
results. Examples of both types of rules, obtained from the Arabic tagged corpus, are
given in Section 4.4, with explanatory comments giving the meaning of each rule.
The ideal goal of the lexical module is to find rules that can produce the most likely
tag for any word in the given language, i.e. the most frequent tag for the word in
question considering all texts in that language. The problem is to determine the most
likely tags for unknown words, given the most likely tag for each word in a
comparatively small set of words. This is done by transformation rule based learning
(TBL) using three different lists: a list consisting of Word Tag Frequency - triples
derived from the first half of the training corpus, a list of all words that are available
sorted by decreasing frequency, and a list of all word pairs, i.e. bigrams. Thus, the
lexical learner module does not use running texts.
Once the tagger has learned the most likely tag for each word found in the annotated
training corpus and the rules for predicting the most likely tag for unknown words,
contextual rules are learned for disambiguation. The learner discovers rules on the basis
of the particular environments (or the context) of word tokens. The contextual learning
33
34. process needs an initially annotated text. The input to the initial state annotator is an
untagged corpus, a running text, which is the other half of the annotated corpus where
the tagging information of the words has been removed. The initial state annotator also
uses a list; consisting of words with a number of tags attached to each word, found in
the first half of the annotated corpus. The first tag is the most likely tag for the word in
question. and the rest of the tags are, in no particular order. With the help of this list, a
Untagged
corpus2
Initial state
tagger
Dummy-tagged
corpus
Tagged corpus2
(truth)
Lexical
Learner
No
Lexical
Rules
Threshold
?
yes
stop
Figure 3-5: Lexical Rule learning
list of bigrams (the same as used in the lexical learning module, see above) and the
lexical rules, the initial state annotator assigns to every word in the untagged corpus the
most likely tag. In other words, it tags the known words with the most frequent tag for
the word in question. The tags for the unknown words are computed using the lexical
34
35. rules: each unknown word is first tagged with a default tag and then the lexical rules are
applied in order.
There is one difference compared to the lexical learning module, namely the
application of the rules is restricted in the following way: if the current word occurs in
the lexicon but the new tag given by the rule is not one of the tags associated to the
word in the lexicon, then the rule does not change the tag of this word.
Dummy-tagged
corpus
Context
learner
Unnotated
corpus 2
Tagged corpus2
(Truth)
Context
Learner
No
Context
Rules
Threshold
?
yes
stop
Figure 3-6: Context rule learning
When tagging new text, an initial state annotator first applies the predefined default
tags to the unknown words (i.e. words not being in the lexicon). Then, the ordered
lexical rules are applied to these words. The known words are tagged with the most
likely tag. Finally the ordered contextual rules are applied to all words.
35
37. 3.4 Testing strategies
Testing was done using the method of cross validation. Taking in consideration that
we do not have a large standard truth corpus, we had to manage with the corpus we
tagged. This corpus is divided into three portions, each portion containing about 13,000
words, and the test had to be repeated three times, with a different one third for testing,
and the other two thirds for learning each time, then taking the average of the three tests
as an overall measure for the performance of the system. This whole experiment is
repeated using three versions of the tagset and therefore three versions of corpora:
1. Tagset1: the original detailed Khoja tagset [16] containing 177 tags.
2. Tagset2: the complete modified tagset of 319 tags (Appendix B).
3. Tagset3: a subset of Tagset2 of which grammatical information is excluded
for nouns and imperfect verbs, thus reducing the number of tags to 185 tags.
All the three tagsets are drastically enlarged by the fact that the system we used
does not use stemming prior to the learning and tagging phases. Rather it uses
composite tags to tag composite words. A fact that would introduce a new set of
tags. As an example for this consider the word balmdrsp ( بالمدرسبIf stemming
.)بالمدرسب
were applied to the system this word would be divided into two separate words b
and almdrsp, and would be tagged as b/PC almdrsp/NCSgFGD. But, since we
work without stemming, the word is treated as one unit and is tagged as
balmdrsp/PC_NCSgFGD, thus introducing the new tag PC_NCSgFGD.
Stemming would probably enhance the accuracy of the system, but it would
divert our attention to other directions and put extra burdens on the users of the
system.
37
38. Chapter Four
Implementation and Testing
4.1 Corpus
The corpus used for this study is part of an about 160,000 word corpus of two
Jordanian newspapers (Aldustor and Aldustor Aleqtsady). Any MSA corpus would have
done the task, but this corpus was gotten at an early stage of the work, and was used
henceforth. A lot of preprocessing was needed before using the corpus. The corpus is
originally a Microsoft word document, so it has to undergo the following corrections
and revision tasks to be ready for our work:
1. There were many typing, spelling, and grammatical mistakes that constituted
quite a phenomena in the text, and would hinder the process of tagging and
add up to the problem of ambiguity which is already an inherent problem of
Arabic texts. These problems had to be fixed beforehand. Examples of such
mistakes include:
a.Missing hamza, like: احبار ,اقصي ,اوضح ,اشار
,احبار
b. Misplaced hamza, like:
أجياءinstead of
.تامل
تامل
إجياء
c. هـinstead of ,ةlike: . باخر
d. يinstead of ,ىor vice versa, like: اعكى
.حتهاج ايل ,أ يي ,أمح احمم
e.Typing mistakes, like: ,لك ل اب ًام الك الل ,الةوئى اب ًام االةوئ
ال
ال
ا
ا
.األولاوفاتاب اًام ااألول فات ,فاجلياءااب اًام افاإلجياء
ال
ال
f. Grammar mistakes, like:
…اس ر ر اءايفااينر رراراالر ررني ام ابر ررلاال ر ر اء االذ ذذيح األذ ذ ذ ا ذ ذ ذ
. المتحدة لألغدادابوعاني ام ابلاسكعااساسو ااوا ارجو
.… وفه اف االزواراواليحافه قعاأنافزف اع دهما
و اص اوان
ك لااف ابكغاع داالسرواراتاواآلل ات افرغتايفامونراءاالع ةر ا
.)ب اًام ا(اليتاأفيغت
ال
38
39. . وتهجاوزاقوم ام ج داهتااعناسةع امكواراتادوالر
انامشرريوااالر زارقاال اضرراابال رراءاال رريفاالهجارفر ايفامياكررزااالل فر ا
.)شروعاً عشوائ اً اب اًام ا( ااعش اا
ا ائا
مشيوا
ال
شروعا
.) حيصلاعكىات قوعاالمؤسسوناعكوه اب اًام ا(الؤسسن
ال
…اوالر ياخا هلررهاحبررلاالسررةلاالمينر اله نور االهعرراونا ذذا ن ذ ن
الجمع ذذع ذذا ال ذذو ا ةايفاتنس ررو ااجلا ر االر ر ين ابر ر اًامر ر ا(ب ررنااجلمعور ر ا
ال
.)وال زارق
2. Getting rid of passage numbers, titles, and end marks to concentrate on
complete sentences of text.
3. The text is then converted to an ASCII MS-DOS format.
4. Because of technical considerations; like the different code pages used for
representing Arabic characters, and using software that does not support
Arabization, especially Lex analyzing system, and Linux environment, it was
decided to follow most of the previous line of research in Arabic [i.e.
1,14,21] and use transliteration. For this purpose the Buckwalter code of
transliteration [36] is used and a small C program was written to do this task.
5. The corpus is then edited to match the Brill format and copied to the Linux
system for the rest of the processing.
6. Then, it is tagged, using a program written with the help of the lexical
analyzer LEX [2]. The resulting corpus, calculated to be about 43% accurate,
is then revised manually. The result, which is supposed to represent the truth,
was then given to the learner of the Brill tagger to learn lexical and
contextual rules, a step that also requires some other preparations, as
explained in Section 3.3.
7. The above steps are performed initially on a corpus of size about 1000
words. After the rules are learned a larger corpus is presented to the tagger,
tagged, manually revised, and given to the learner to enhance the rule set.
This process is repeated continuously, hence enlarging the truth corpus and
enhancing the performance of the tagger simultaneously, until satisfactory
39
40. results are obtained and/or enough time is spent on this point. At the present
a truth corpus of over 38,000 words is reached.
Figures 4-1 and 4-2 show sample sentences in different stages of the tagging cycle.
عكى اهامش اأعمال االنه االه سطا الكهنمو ا
وال ي اع ايف اال اهيق ا هل اآذار ااجلاري انظما
كز االصيي الك راسات ااالقهصادف اورش اعملا
الي
ح ل اضعف اال ارد االةشيف اواله رفب اوتيضولا
ال ول االعيبو الكمنهج ااألجنيب اوأهم امع قاتا
كات ايف االنط . اوق اناقشت اه ها
الهنافسو الكشي
احلك االهط رات االههح ايف ااالقهصاد االعالاا
ا
.كات
واليتاأصةحتاتييضاحت فاتاعكىاالشي
(a) A sentence from
the corpus
ElY ham$ >Emal almntdY almtwsTy
lltnmyp wal*y Eqd fy alqahrp
xlal |*ar aljary nZm almrkz
almSry lldrasat alaqtSadyp wr$p
Eml Hwl DEf almward alb$ryp
waltdryb wtfDyl aldwl alErbyp
llmntj
al>jnby
w>hm
mEwqat
altnafsyp ll$rkat fy almnTqp .
wqd naq$t h*h alHlqp altTwrat
almtlaHqp fy alaqtSad alEalmy
walty >SbHt tfrD tHdyat ElY
al$rkat .
(b) A transliteration of the sentence in
(a) in the Brill format
Figure 4-1
41
41. 4.2 Tagset
The tagset used in this work is a modified version of the tagset designed by Khoja,
fully described in [16] and redescribed in Section 3.1.1. The work of Khoja is highly
esteemed, being the first comprehensive work in designing a tagset for Arabic, which
encompasses the richness and complexity of the language. Nevertheless, it has some
اNCSgMGD/ االنهNCPlbMGI/ اأعمالNCSgMGI/ ا اهامشPPr/عكى
اPC_NPrRSSgM/ اوال يPPr_NCSgFGD/ الكهنموNASgMGD/اله سطا
اNASgMGD/ ااجلاريRmoy/ اآذارPA/ ا هلRP/ اال اهيقPPr/ ايفVPSg3M/ ع
اPPr_NCPlfGD/الك راساتNASgMND/االصييNCSgMND/ا اكزVPSg3M/نظم
الي
اPA/اح ل
NCSgMGI/اعمل
NCSgFAI/ اورش
NASgFGD/ االقهصادف
اNASgFGD/ االةشيف
NCPlbMGD/اال ارد
NCSgMGI/ضعف
اNCPlbMGD/اال ول
PC_NCSgMGI/اوتيضول
PC_NCSgMGD/واله رفب
اNASgMGD/ااألجنيب
PPr_NCSgMGD/الكمنهج
NASgFGD/ العيبو
/كات
لكشيNCSgFGD/ االهنافسوNCPlfGI/ امع قاتPC_NASgMGI/وأهم
ا
./punc اNCSgFGD/ ااالنطPPr/اايفPPr_NCPlfGD
Figure 4-2: Part of the sentence in Figure 4-1 after tagging
and detransliteration
limitations and mistakes, some of which are treated in this work, and others may be a
task of future work. Modifications considered here include nouns, verbs, and particles.
4.2.1
Nouns:
For nouns the following was done:
a- Avoiding distinctions between foreign names and Arabic names. Instead all
names, whether Arabic or foreign, are given the same tag NP (for proper noun).
The tag RF (residual foreign) is kept to refer only to words of foreign languages
written in Arabic characters. In the original tagset, the tag (RF) is given to all
foreign names and words (see Figure 1-1, compare the tag given to
فا
given to كي
اسهويسيا
and لن
ا
ب
.)ا
and
.
40
with that
42. b- Using different tags for the different plural forms, and hence the indication of
plural nouns is given the subtags PlbM, PlbF, Plm, Plf for broken masculine
plural, broken feminine plural, sound masculine plural, and sound feminine plural
respectively, instead of just: PlM, and PlF for plural masculine and plural
feminine respectively. The table below (Figure (4-3)) gives examples of this.
Notice that in our set the gender is not repeated with sound plurals since it is
included implicitly in the plural form.
word
Original tag
New tag
ال ظي ن
NCPlMND
NCPlmND
العامكن
NCPlMGD
NCPlmGD
الشةيات
NCPlFND
NCPlfND
ال ارس
NCPlFND
NCPlbFND
الةن ك
NCPlMGD
NCPlbMGD
Figure 4-3: Tags of Plurals
The last two characters of each tag are irrelevant here and are given only for
completeness.
Including this information is useful when the resulting tagged corpus is used for
morphological studies.
c- Introducing some new tags.
d- Introducing another general category in addition to common nouns (NC) and
adjectives
(T
(NA),
namely
title
nouns
(NT)
like:
.) ال في، وزفي، أمن، السيري، الان س، اليئوس، الكاThis would increase the tagset
drastically, since each of these nouns can be single or plural, masculine or
feminine, definite or indefinite and can take any of the three cases. But it would
help in many cases to discover unknown proper nouns that usually follow these
titles.
4.2.2
Verbs:
42
43. For verbs the modification include:
Using distinct tags for defected verbs (
األفعالاالناقص اto capture the action they take
),
with the case of the following nouns. Therefore each verb tag is marked by a small d
following the first two characters for the verb if it is a defected verb, as in Figure (44).
word
Original tag
New tag
ذهب
VPSg3M
VPSg3M
ف هب
VISg3MI
VISg3MI
كانت
VPSg3F
VPdSg3F
فصةح ن
VIPl3MI
VIdPl3MI
Figure 4-4: Tags of defected verbs
4.2.3
Particles:
For particles, the modifications include:
Introducing a few tags to refine the tagging of some particles, and to make
room for some particles unconsidered in the original tagset; namely: Pcr, Pdt, Pst,
PQ,
اand
ل
LM,
and
LN ن ن
,(أ ا، إ ا) , ق االهشيويو ,ق االهح و وfor
دواتااالسهيااtagging
,مل ,أ
respectively. All these tags are added to help picking up some information
about the following words.
Although these tags do contribute to refining the tagset, there is still a lot to be
done with particles, since the available tags do not cover the wide range of
meanings for particles in Arabic. Examples include: the prefix particle فis now
given the tag PC (for conjunctional particle), whereas it is not always so, but
sometimes has different meanings especially when affixed to verbs (
فاءاالسةةو
اله
), the same thing goes with
.
All particles that do not belong clearly to any of the available tags are given a
general tag PA (for adverbial particle) regardless of the fact that some of them are
not really adverbial, so we do not have to take the meaning of this tag literally.
43
44. Making more distinctions is left for future work after studying more deeply the
need for such refinement.
It should be kept in mind, also, that the corpus we dealt with is not stemmed.
So the tagging is done by composite tags, which would introduce a new set of tags
for composite words. For example, the word بالعرريض
is tagged as
PPr_NCSgMGD, which is a completely a different tag from either PPr or
NCSgMGD thus leading to a drastic theoretical increase in the tagset. Contrary to
what was expected, this fact did not cause a lot of problems with the tagging
accuracy, due to the fact that the Brill tagger is powerful in dealing with prefixes
and suffixes, and that composite words comprise only a small portion of an Arabic
text (estimated to be less than 6% according to the data we worked on).
4.3 The program
The same Lex-based program, which was used for initial tagging of the very
first corpus, is now used as a start state tagger for both learner and tagger of the
Brill system.
In the original system initial tagging is done by a very simple routine, which
assigns to all words either the tag (NN) for common nouns or (NP) for proper
nouns if the word starts with a capital letter. This start state suffices for English
and similar simple languages, But for Arabic we preferred to use another type of
start state tagger, where each unknown word is checked for its syntactic structure
and assigned an initial tag accordingly. For this purpose, the Lex-based routine
was used after facing a lot of trouble getting it to work, especially since it has to
be interfaced to both the lexical learner (written in Perl), and the tagger (written in
C).
Start state routine is an important factor for getting accurate results, especially
for unknown words. And the better it is designed to take care of word structures,
the better the achieved results are. In the present, the routine takes care of many
morphological structures and relies on the statistical information, sensed when
working on manual tagging, to assign the most probable tag for words that do not
belong to any of the captured tags.
4.4 Rules
44
45. In this section we give a list of the resulting rules and explain how they are
interpreted, and the actual lexical and contextual information derived thereof. It is
worth mentioning here that the obtained rules are based on majority tests and not
on absolute truth. In other words, it is not necessary the each rule applies to all
situations in any MSA text; rather it applies to most similar situations. As an
example of this consider the rule number 9 in Table 4-3 which states:
NP NCSgMGI PREV1OR2TAG PPr
Meaning that if a word has a preposition as one of its preceding two words,
then that word should be tagged as a common noun and not as a proper noun.
This rule is derived because, in the training corpus, it turned out that applying
this rule would enhance the accuracy of the tagger, by minimizing the discrepancy
between the starting corpus and the truth corpus. But that does not mean that the
rule has no exceptions. It is easy to think of many exceptions to this rule or to any
other rule, but what counts is the overall effect of applying the rule.
4.4.1 Lexical Rules
Table 4-1 shows a list of lexical rules, together with the meaning of each rule,
and its interpretation in the context of Arabic morphology. While Table 4-2 lists a
group of rules that may be considered misleading, which means that although they
may enhance the tagging of the training corpus, they will surely have negative
effects on the testing and real life corpora.
4.4.2
Contextual Rules
Table 4-3 shows a list of contextual rules, together with the meaning of each
rule, and its interpretation in the context of Arabic morphology and syntax.
4.5 Testing
Many tests are performed to check the efficiency of the system:
In the first group of tests, the truth corpus is divided into
three portions of similar sizes, then the cross validation
method is used three times for each type of tagset as
would be explained below. In each test of the three, two
portions of the corpus (about 25,000 words) are used in
45
46. learning and the third (about 13,000 words) for evaluation,
and the average of the accuracy for the three tests is taken
as the overall measure for the system’s accuracy. This is
performed on three types of corpora: one tagged with
original tagset (Tagset1) as introduced by Khoja, the
second tagged with a modified tagset thereof (Tagset2), as
explained in Section 4.2, and the third tagged with the
modified set with the exclusion of grammar features
(Tagset3). These three tagsets are defined in Section 3.4.
The results of these tests are summarized in Table 5-1, 52, 5-3 respectively.
To test the effect of enlarging the corpus size on accuracy,
another group of corpora are prepared. In this case since
we do not have a large reference corpus to work on, we
had to reduce the size of the testing corpora to enlarge the
training corpora. So we chose the size of the learning
corpora to be about 31,000 words each, i.e. about five
sixths of the size of the complete corpus. And the test
corpus is the rest of the corpus, whose size is over 6,000
words. Three tests were performed this way, changing the
test corpus each time, and taking the average. The results
of these tests are summarized in the tables of Section 5.1.
46
47. No.
1.
Rule
al haspref 2 NASgFGD
Meaning
if a word has a prefix of two letters “al” then tag it as
Comments
“al” is a sign of difeniteness
NASgFGD
2.
at hassuf 2 NCPlfGD
if a word has a suffix of two letters “at” then tag it as
“at” is an ending of fem. plural
NCPlfGD
3.
NCSgMGI p fchar NCSgFGI
If a word tagged as NCSgMGI contains the character “p” ,
“P” ( ) التاء المربوطis a sign of femminism
then tag it as NCSgFGI
4.
y haspref 1 VISg3MI
if a word has a prefix of 1 letter “y” then tag it as VISg3MI
“y” ( ) الياءis a prefix for imperfect verb
5.
NCSgMGI l fhaspref 1 PPr_NCSgMGI
If a word tagged as NCSgMGI has a prefix of 1 letter “l”
“l” ( ) الدمin the beginning of the word is a
then tag it as PPr_NCSgMGI
proposition
If a word tagged as NCSgMGI has a suffix of 1 letter “a”
“a”-ending is a sign of accusative case
6.
NCSgMGI a fhassuf 1 NCSgMAI
then tag it as NCSgMAI
8.
NASgFGD p faddsuf 1 NASgMGD
NCSgMGI w fhaspref 1 PC_NCSgMGI
If possible to add “p” to a word tagged as NASgFGD then
Can not have two “p” ( )تاء مربوطin one
tag it as NASgMGD
7.
word
If a word tagged as NCSgMGI starts with w then tag it as
“w” is a conjunctional particle
PC_NCSgMGI
Table 4-1: a list of lexical rules
47
48. 12.
NCSgMGI t fhassuf 1 VPSg3F
b deletepref 1 PPr_NCSgMGI
followed by “al” for definiteness
Any word starting with “ll” should be tagged as
“ll” ( )للـis a proposition followed by “al”
for definiteness
If a word tagged as NCSgMGI starts with “t”, tag it as
“t” ( )تis a suffix for a past tense verb
VPSg3F
11.
ll haspref 2 PPr_NCSgMGD
“wal” ( )والـis a conjunctional particle
PC_NCSgMGD
10.
wal haspref 3 PC_NCSgMGD
Any word starting with “wal” should be tagged as
PPr_NCSgMGD
9.
(third person single feminine)
If removing the letter “b” from a word gives a word in the
Attached “b” is a proposition.
lexicon, tag the original word as PPr_NCSgMGI
13.
0 char Rnu
A word containing the character “1” is a number
Numeric
14.
NCPlfGD al faddpref 2 NCPlfGI
If a word is tagged as NCPlfGD and accepts adding prefix
Can not add “al” to a defined word.
“al”, tag it as NCPlfGI
15.
PC_NCSgMGI S-T-A-R-T fgoodright
If a word at the beginning of a sentence is tagged as
PC_VPSg3M
PC_NCSgMGI , then tag it as PC_VPSg3M
Table 4-1: a list of lexical rules
Continued
48
Can not start with genitive case.
49. No.
Rule
Meaning
1.
NASgFGD d fhassuf 1 NCSgMGD
If a word tagged as NASgFGD ends with “d” tag it as NCSgMGD
2.
NCSgMGD al> fhaspref 3 NCPlbMGD
If a word tagged as NCSgMGD ends with “n” tag it as NCPlbMGD
3.
NCSgMGI n fhassuf 1 NP
If a word tagged as NASgFGI ends with “n” tag it as NP
4.
NCSgMGI_NPrPSg3F <lY fgoodleft PPr_NPrPSg3F
Comment
If a word tagged as NCSgMGI_NPrPSg3F is followed by “ ”إلىtag it as
O
N
L
Y
A
C
H
A
N
C
E
PPr_NPrPSg3F
Table 4-2 : Examples of misleading lexical rules
49
50. No.
Rule
Meaning
1.
NCSgFAI NCSgFGI PREV1OR2TAG PPr
Change a tag from NCSgFAI to NCSgFGI if
one of the two previous words is tagged PPr
2.
NCSgFGD NCSgFND PREV1OR2TAG VPSg3F
3.
Pst PA NEXTTAG VISg3MI
Change a tag from NCSgFGD to NCSgFND if
one of the two previous words is tagged
VPSg3F
Change a tag from Pst to PA if the next word
is tagged VISg3MI
4.
VISg3MI VISg3MS PREVWD >n
Change a tag from VISg3MS to NCSgFGI if
the previous word is >n
5.
NCSgMGD NCSgMND PREV1OR2TAG STAART
6.
NCSgMGD NASgMGD PREVTAG NCSgMGD
Change a tag from NCSgMGD to NCSgMND
if the word is one of the two starting words in
the sentence
Change a tag from NCSgMGD to NASgMGD
if the previous word is tagged NCSgMGD
7.
NCSgMGI NCSgMNI PREV1OR2TAG STAART
8.
NASgFGD NCSgFGD PREVTAG PC_NCSgMGI
9.
NP NCSgMGI PREV1OR2TAG PPr
Change a tag from NP to NCSgMGI if the
previous word is tagged PPr
10.
NCSgFGI NCSgFNI PREVTAG VPSg3F
Change a tag from NCSgFGI to NCSgFNI if
the previous word is tagged VPSg3F
Change a tag from NCSgMGI to NCSgMNI if
the word is one of the two starting words in
the sentence
Change a tag from NASgFGD to NCSgFGD
if the previous word is tagged PC_NCSgMGI
Table 4-3: a list of contextual rules
51
Comment
ماابع احيفااجلياجميور
الياعلاميف ا
ن
ااأناولوساأ ا
مااقةلااليعلاالضار ا
ا
أناتنصبااليعلاالضار
الةه أاالعيفاميف ااولوساجميوا
ر
ماابع ااالسماالعيفاصي امعيف ا
ولوساامساامعيفاا
الةه أاالنييقاميف ا
متووزابناالضافاإلوهاوالصي
حيفااجليافسة ااالسماالعاديا
ولوسااسماالعكم.ا
الياعلاميف ا
51. 11.
NASgFGD NCSgFGD PREVTAG NCSgMGI
Change a tag from NASgFGD to NCSgFGD
if the previous word is tagged NCSgMGI
متووزابناالضافاإلوهاوالصي
12.
NASgFGI NCSgFGI PREVTAG PPr
Change a tag from NASgFGI to NCSgFGI if
the previous word is tagged PPr
ماابع احيفااجليااسماولوساصي
13.
PA_VISg3FI NNuCaSgFAI CURWD stp
If the current word is stp, change tag from
PA_VISg3FI to NNuCaSgFAI
Table 4-3: a list of contextual rules
Continued
50
ختصو الك اع قاالعجمو ا"ماافة أابرا
اامعاسنا
ستافا افعلامضار
"االسه ةال
52. Chapter five
Results and discussion
5.1 Results
Below are the results for the performed tests. Each table illustrates a group of related
tests using the method of cross validation. Table 5-1 gives the results for the original
tagset, Table 5-2 for the modified tagset, Table 5-3 for the modified tagset using an
enlarged versions of the training corpora, and Table 5-4 for the modified tagset with the
case (grammar) information removed
Test
Training
size(words)
Test
size(words)
Test1
Test2
Test3
Average
23834
25372
25786
-
13662
12124
11710
-
No.
lexical
rules
153
149
150
-
No.
context
rules
134
137
161
-
Tagging
accuracy
(%)
73.60
72.07
75.05
73.57
Table 5-1: Accuracy for the Original tagset
Test
Training
size(words)
Test
size(words)
Test4
Test5
Test6
Average
23834
25372
25786
-
13662
12124
11710
-
No.
lexical
rules
120
143
150
-
No.
context
rules
151
158
135
-
Tagging
accuracy
(%)
74.34
72.13
75.69
74.05
Table 5-2: Accuracy for the complete modified tagset
52
53. Test
Training
size(words)
Test
size(words)
Test7
Test8
Test9
Average
31422
31467
31634
-
6261
6216
6049
-
No.
lexical
rules
174
176
167
-
No.
context
rules
190
162
148
-
Tagging
accuracy
(%)
75.72
75.39
77.16
76.09
Table 5-3: Accuracy for the complete modified tagset
with enlarged training corpora
Test
Training
size(words)
Test
size(words)
Test10
Test11
Test12
Average
23834
25372
25786
-
13662
12124
11710
-
No.
lexical
rules
151
148
145
-
No.
context
rules
83
116
106
-
Tagging
accuracy
(%)
83.89
82.64
85.10
83.87
Table 5-4: Accuracy for the Ungrammatized modified tagset
5.2 Examples of errors in tagging
A sample of errors was taken from the error report file, consisting of randomly
chosen 38 consecutive lines of the original text for the Grammatized tagset. This sample
contains 1079 words, 280 words of which are tagged erroneously. The errors are
categorized into fifteen types as in Table 5-5 below, and then the occurrence of every
type is counted in the sample, to take an idea about the percentage of each error type.
Table 5-6 shows the list of erroneously tagged words of this sample, and for each word,
its truth and erroneous tags, and the type of error. Then, Table 5-7 shows a summary of
the errors, their count in the sample, and their percentage in descending order.
53
54. Error
type
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
Meaning
Interpreting a title as a common noun
Mistagging a broken plural
Interchanging an adjective and a common noun
Interchanging definite and indefinite
Interchanging sound plural and single
Interchanging verb with noun
Grammatical error
Error in composite tag
Lexicon Entry Missing
Interchanging Dual with sound masculine plural
Typing mistake
Interchanging adverbial article with stress article
Error in gender
Taking a common noun for a proper noun
Interchanging doubt particle and certainty particle
Table 5-5: types of errors
Word
almdyryn
w>SHab
r&yp
alm&vrat
m$rwEathm
Truth tag
NTPlmGD
PC_NCPlbMGI
NCSgFGI
NCPlfGI
NCPlfGI_NPrPPl3M
wDE
almdyryn
alastratyjyp
$rwT
wtnmyp
mharat
>salyb
wastratyjyat
w>hmyth
NCSgMGI
NTPlmGD
NASgFGD
NCPlbMGI
PC_NCSgFGI
NCPlfGI
NCPlbMGI
PC_NCPlfGD
PC_NCSgFGI_NPrPSg3M
almdaxl
al>rbEa'
lmqablat
astxdamha
NCSgMGD
RD
PPr_NCplfGI
NCSgMNI_NPrPSgF
dafws
bswysra
wsykwn
almtHdvyn
alr}ysyyn
ryma
Na}b
RP
PPr_RP
PC_PA_VIdSg3MI
NADuMAD
NCDuAD
NP
NTSgMGI
System tag
NCPlmGD
PC_NCSgMGI
NASgFGI
NCPlfGD
NCSgMGI_NPrPPl
3M
VPSg3M
NCPlmGD
NCSgFGD
NCPlbMAI
PC_NCSgFAI
NCPlfAI
NCPlbMNI
PC_NCPlfGI
NCSgFGI_NPrPSg
3M
NCPlbMND
NCPlbMGD
PPr_NCPlfGI
NCSgMNI_NPrPSg
3F
NCSgMGI
NCSgMAI
PC_NCSgMGI
NCPlmGD
NCPlmGD
NCSgMAI
NTSgMNI
Type
1
2
3
4
5
6
1
3
7
7
7
7
4
8
(7,2)
9
11
11
9
9
8
10
10
9
7
Table 5-6: A sample of errors in the Grammatized tests
54
Comments
lexicon
mistype
mistype
Lexicon
lexicon
Du-Plm
lexicon
57. 3. Distinction between sound masculine plural and dual nouns is not easy for
unknown nouns in Genitive and Accusative case states.
4. Some forms of broken plural are intermixed with other forms of names, and not
always easily distinguished since the processed text is not vocalized.
The above notes can be drawn from Table 5-5 where it is easily noticed that the
grammar contributes to the highest portion of the errors (almost half of them). Then
comes the broken plural problem, which accounts for 10% of the errors, then the
distinction between adjectives and nouns, also close to 10%. After that comes the
problem of proper names (names of people, cities, countries, etc.) which takes almost
10%, the problem of past tense verbs, about 9%. Composite tags and adverbial articles
contribute by about 5% each, and the rest of error types have insignificant contribution
to the overall error percentage.
Each of the errors of large contribution to the overall percentage of the errors is
justified and expected, although the order and exact rate was not expected to be as it
turned out after this test. We think that the following factors were leading factors:
1. The grammatical errors are partially due to the fact that some of the tags do not
reflect the case of the word, and hence, it is hard for the learner to conclude the
reason of the following word being given its tag, examples of that are proper
nouns, relative specific pronouns (
( أمساءااإلشارق
,)أمساءاال ص لand demonstrative pronouns
). Giving case information for these tags is expected to help
solving this problem, but would drastically increase the already large tagset, a
task that we preferred to avoid in the present, but which is a proper consideration
for future work. It is worth mentioning that most of the words that are
erroneously tagged for this reason are otherwise correctly tagged (i.e.
information about category, number, gender, and definiteness are correct).
2.
The size of the corpus affected the accuracy of the results, and in fact the error
rate was enhanced by two other factors; first since the corpus had to be split into
three portions to perform cross validation. And second since the Brill tagger
splits the training corpus again into two halves; one to derive lexical rules and
the other to derive contextual rules. So starting with a corpus of about 38,000
words, each test is done with 25000 words for training and 13,000 word for
evaluation. The training part is divided into two parts each of about 12,500
57
58. words, for lexical and contextual learning respectively. Had we have a ready
corpus to work with matters would be different and we are sure of getting better
results. This has been shown by three separate cross validation experiments in
which the training corpus was slightly enlarged by about 6000 words, leading to
about 2% increase in the accuracy of the system as shown in Table 5-3, a value
which does not look very great, but at least gives an indication of increase.
3. Lack of vocalization also makes it hard to distinguish between some of the
forms of the past tense verbs, and between them and some of the nouns. In this
case accuracy of tagging relies primarily on the statistical information captured
in the lexicon for known words, and on context for the unknown words. But it
should be remembered that lack of vocalization in itself is not a disadvantage of
the corpus, rather it is an advantage for the following reasons:
a. The input text to the tagger is rarely expected to be vocalized, since
vocalization is not common in most MSA writings
b. Vocalization puts an extra burden on the user of the system.
c. Getting good results in spite of unvoclization is a credit to the system,
and a sign of overcoming the problem of ambiguity without relying on
the user to disambiguate the words by vocalization.
5.4 Evaluation
Comparing with other reported results, the results we obtained may look low,
for example Diab et al [10] reported results of 95.4% accuracy, and Khoja [17]
reported 90% of disambiguation accuracy. But studying the mentioned works we
notice that the first one dealt with a very small tagset (24 tags) that are based on
an English tagset, while the second one did not specify precisely the size of the
tagset, rather she talked confusingly about three different levels of tagging with
tagsets of 5, 35, and 131 tags, and said she used the smaller tagset for initial
tagging. This means that the tagset she used contains a maximum of 35 tags.
Consulting her website [37], however, one concludes that the tagging is done
using the 5-tag set. The other problem with her results is that she reports that “the
statistical tagger achieved an accuracy of around 90% when disambiguating
ambiguous words” [17], but checking the statistics she offers, we find that
58
59. ambiguous words comprise a maximum of 3% of the test corpora, and we do not
know the performance accuracy for the rest of the corpora.
So taking in consideration the large and rich tagset we worked with, and the
unavailability of a standard truth corpus tagged with the same tagset, we think the
results obtained here are very promising, and are the best obtained for such a
tagset.
5.5 Accomplishments
In this work we achieved the following:
Revised the Khoja tagset to satisfy our needs and get rid of some of its
limitations. It was expected that this revision would lead to some drawback in
the accuracy of the tagger, and we were welling to accept that, but gladly
enough, the accuracy of the new system turned out to be slightly higher.
Prepared a manually tagged corpus of moderate size, which enjoys the fact
that it is tagged with a rich and comprehensive tagset that we consider the best
available for Arabic, and recommend it for being the basis for a standard
Arabic morphosyntactic tagset. The size of the corpus we tagged is about
38,000 words, far exceeding in size the only POS tagged Arabic corpus that
we know of, which is a 1,700-word corpus prepared by Ms. Khoja. In fact we
prepared many versions of this corpus, as follows:
o One tagged with the original tagset.
o The second tagged with a modified tagset thereof.
o And the third is tagged with the modified tagset but excluding syntactic
(grammar) features.
o All the above corpora are available in both Arabic characters and
transliterated form.
Adapted the Brill transformation rule tagger to work with the above corpora
and have the first complete tagger for Arabic, which gave –we believe- a very
promising accuracy of 75-84% depending on the tagset used.
Prepared in parallel with the corpus a tagged lexicon for Arabic, which would
help researchers in NLP tasks for Arabic.
59