SlideShare ist ein Scribd-Unternehmen logo
1 von 93
Downloaden Sie, um offline zu lesen
Academy of Graduate Studies
Tripoli - Libya

PART OF SPEECH TAGGING OF ARABIC
TEXT
By
Massaoud Abuzed Abolqasem Abuzed

March 2006
Abstract
Part of speech tagging is an important area of research in natural language
processing. Although it has been well studied in several Indo-European languages, it is
still not very well investigated with respect to Arabic.
In this thesis, the Brill tagger and a modified version of the Khoja tagset, along with
a corpus prepared for this purpose, are applied to tag Modern Standard Arabic
(henceforth MSA) text. The Brill tagger is a famous public domain part of speech
tagger, originally designed for tagging English text by implementing machine learning
approach through the method of transformation rules. It has been adopted for other
languages, such as German and Hungarian, by many researchers. Some modifications
need to be done to the learner and tagger that are written partly in Perl and partly in C
programming languages, and are run under the unix/linux operating system. The main
change is done on the initial state tagger, which is used by both learner and tagger. A
program is written using the lexical analyzer Lex to capture Arabic morphological
structures, and then interfaced with both learner and tagger. The tagset used in this work
is a revised version of that introduced by Khoja. The revision included changing some
of the tags for linguistic considerations and introducing some new tags to make the set
more powerful, or to make up for limitations in the original tagset that hinder tagging
some words. The corpus is obtained from two Jordanian magazines, and has to go
through a series of editing steps. A collection of lexical rules and contextual rules are
produced by the learning system, and applied to Arabic text. The tagging accuracy of
the resulting tagged text is measured to be approximately 84% for both known and
unknown words. A result which may seem low, but taking in consideration the
complexity of the language, the richness of the tagset, the fact that this work is the first
work that encompasses such a tagset for Arabic, and the fact that we did not have a
reference corpus to base our work on, we consider the results very promising.

2
Acknowledgements
I would like to express my gratitude to:
Associate professor Mohamed Arteimi, my academic supervisor, who guided me
through this research and gave me his valuable advices.
The Department of Computer Science in the Academy of Graduate Studies and
personally to Dr. Abdussalam Elmusrati for his encouragement and help.
The Academy of Graduate Studies, and to Dr. Saleh Ibrahim for his encouragement
by sponsoring this research through an academic scholarship.
And to my family and friends for their support and endurance.

3
List of Tables
Table
Table (4-1)
Table (4-2)
Table (4-3)
Table (5-1)
Table (5-2)
Table (5-3)
Table (5-4)
Table (5-5)
Table (5-6)
Table (5-7)
Table (5-8)

Page

A list of lexical rules ……………………………………………………
Examples of misleading lexical rules …………………………………….
A list of contextual rules …………………………………………………
Accuracy for the original tagset ……………………………………………
Accuracy for the complete modified tagset ………………………………
Accuracy for the complete modified tagset with enlarged training corpora
Accuracy for the ungrammatized modified tagset ………………………
Types of errors ……………………………………………………………
A sample of errors in grammatized tests …………………………………
Percentage error for each error type in the grammatized tests ……………
A sample of errors in ungrammatized tests ………………………………

4

40,41
42
43,44
45
45
46
46
47
47.48
48
48,49
List of Figures and Illustrations
Figure

Figure (2-1)
Figure (3-1)
Figure (3-2)
Figure (3-3)
Figure (3-4)
Figure (3-5)
Figure (3-6)
Figure (3-7)
Figure (4-1)
Figure (4-1)
Figure (4-2)
Figure (4-3)
Figure (4-4)

Page

copy of the manually tagged excerpt sought by Khoja ……………
Example of a general classification tagset ……………………………
Example of a detailed tagset for verbs …………………….…………
The entire Penn Treebank tagset …………………….………………
Preliminary steps for tagging …………………….…………………
Lexical rule learning ..………………….……………………………
Context rule learning ………………….…………………….………
Tagging …………………….…………………….…………………
(a) A sentence from the corpus …………………….………………..
(b) A transliteration of a sentence from the corpus …………………
Tagged and detransliterated sentence from the corpus ………………
Tags of plurals …………………….…………………….…………
Tags of defected verbs …………………….………………………

5

15
18
19
19
25
27
28
33
33
33
33
35
36
Contents
Abstract……………………………………………………………………………………………
Acknowledgements………………………………………………………………………………..
List of Tables
List of Figures and Illustrations
Contents……………………………………………………………………………………………
Chapter One: Introductoin……………………………………………………………………...
1.1 Background…………………………………………………………………………
1.2 Part-Of-Speech Tagging Methods …………………………………………..
1.3 Machine learning in POS tagging………………………………………………
1.3.1 N-gram and Markov models…………………………………………….
1.3.2 Neural Networks………………………………………………………..
1.3.3 Vector-based clustering…………………………………………………
1.3.4 Transformation-Based Learning………………………………………...
1.4 Aims and objectives…………………………………………………………………
1.5 Tools used in this work………………………………………………………………
1.5.1 Corpus……………………………………………………………………
1.5.2 Tagset………………………………………………………………………
1.5.3 Tagger……………………………………………………………………
1.6 Testing strategy ……………………………………………………………………..
1.7 Chapters summary……………………………………………………………………
Chapter Two: Literature Review………………………………………………………………
2.1 Corpora in European languages ……………………………………………………...
2.1.1 General Corpora………………………………………………………….
2.1.2 Historical Corpora……………………………………………………….
2.1.3 Annotated Corpora………………………………………………………
2.2 Arabic corpora………………………………………………………………………..
2.4 Arabic taggers………………………………………………………………………...
2.5 Definition of training and testing texts……………………………………………...
Chapter Three: Design…………………………………………………………………………..
3.1 Tagsets and the adopted Arabic tagset………………………………………….
3.1.1 Tagsets…………………………………………………………………….
3.1.2 The adopted tagset………………………………………………………..
3.2 Corpora used for this work……………………………………………………….
3.3 The Brill system…………………………………………………………………….
3.3.1 Learner…………………………………………………………………….
3.3.2 Tagger……………………………………………………………………..
3.4 Testing strategies……………………………………………………………………

Chapter Four: Implementation and Testing………………………………………………
4.1 Corpus………………………………………………………………………………..
4.2 Tagset………………………………………………………………………………..
4.2.1 Nouns……………………………………………………………………...
4.2.2 Verbs………………………………………………………………………
4.2.3 Particles…………………………………………………………………...
4.3 The program…………………………………………………………………………
4.4 Rules………………………………………………………………………………….
4.4.1 Lexical Rules
6

i
ii
iii
iv
v
1
1
3
4
4
5
5
6
6
6
6
7
8
9
10
11
11
11
12
12
13
16
17
18
18
18
20
24
24
24
29
30
31
31
34
34
35
36
37
37
38
4.4.2 Contextual Rules…………………………………………………………
4.5 Testing……………………………………………………………………………….
Chapter Five: Results and discussion………………………………………………………….
5.1 Results………………………………………………………………………………..
5.2 Examples of errors in tagging……………………………………………………..
5.3 Discussion…………………………………………………………………………...
5.4 Evaluation…………………………………………………………………………...
5.5 Accomplishments…………………………………………………………………..
Chapter Six: Conclusions and Future work………………………………………………….
6.1 Conclusion…………………………………………………………………………..
6.2 Future work………………………………………………………………………...
References………………………………………………………………………………………..

APPENDIX A
APPENDIX B
APPENDIX C

Sample tagged sentences as compared to the truth corpus………….
The complete tagset (Tagset2)………………………………………
The Lex file used for initial state tagger…………………………….

7

38
38
45
45
46
49
51
52
53
53
53
54
58
62
78
Chapter One
Introduction
1.1 Background
It is very hard, or even impossible to encode manually all the information needed to
encode a human language that is necessary to build a system that will annotate text with
structural description [9]. Such a work would need a lot of information concerning the
type of grammar which will be used, plus a great deal of the morphological, lexical, and
syntactical information about the language itself and encoding them in an algorithmic
way for the intended system to handle them. However, this is not an easy task and
would consume a lot of time and probably a group of language experts to be achieved.
Even if achieved, it would be language specific and could not be applied to different
languages.
For this reason, language processing is tackled in different approaches recently. One
of the most growing approaches is through machine learning techniques. These
techniques start by giving samples of manually annotated text, which should be
reviewed very carefully to make sure it resembles the truth for the given language.
Then, a learning system is applied to that text to figure out the cues for annotating the
given words with the given annotation. These cues are then converted either to some
statistical information stating the probabilities of assigning a given annotation to a
certain word according to its lexical structure and/or its location in the context, or they
are converted to a collection of rules stating when and why to assign a given annotation
to the word. Afterwards, another system, the tagger, is given another raw text to be
annotated, and would go through the text and assign annotations to the words according
to the accompanying cues (probability figures, or rules).
Clearly the use of rules obtained from a learning system is more favorable over the
use of probability figures for the following two reasons:
1-

Rules are easy to understand and can reflect directly the human
understanding of the language.

2-

Rules can be manipulated through changing, omitting, or adding some rules.
when doing so would enhance the annotation ability of the system.

For these reasons, we have chosen to use a rule-based machine learning system for
our work.
8
Part-of-speech (POS) tagging means taking a text written in a human language and
identifying its lexical and/or syntactical structure by assigning to each word/token in the
text the correct Part-of-Speech such as noun, verb, adjective or adverb. Furthermore, the
tags give, in many cases, additional features, such as number (singular/plural), tense,
and gender, thus changing the raw (unannotated) text to annotated or tagged corpus.
This process of tagging requires a set of tags that classify words according to their
lexical and syntactical meanings. This set is referred to as a tagset.
Part-of-speech tagging is the foundation of natural language processing (NLP)
systems, and thus has been an active area of research for many years [25]. The use of
corpora has become an important issue in Language Engineering (LE), the field that
deals with all different types of handling natural languages computationally. There are
many ways to deal with corpora. These ways include the use of one language corpora,
that are annotated to reflect some information about the language structure, and parallel
corpora, i.e. corpora of the same text written in two or more different languages, where
at least one of the corpora is annotated, to help annotate the other corpora, or to help
extract some information from the other corpora. Both kinds are valuable sources of
linguistic metaknowledge, which forms the basis of techniques such as tokenization,
POS tagging, morphological and syntactic analysis, which in turn can be used to
develop LE applications [9].
An annotated corpus is a corpus that has had some level of linguistic detail added to
the raw data. For example, the Penn Treebank [41] is an annotated corpus, because it
contains the linguistic structure and part-of-speech tags for the words in the corpus.
A tagged corpus is more useful than an untagged corpus because there is more
information there than in the raw text alone. Once a corpus is tagged, it can be used to
extract information from the corpus. This can then be used for creating dictionaries and
grammars of languages using real language data. Tagged corpora are also useful for
detailed quantitative analysis of text [22].
Other applications of Part-of-speech tagging include speech recognition [14],
enhancing input methods [6], machine translation [24], and discovering errors in OCR
files [20].

9
1.2 Part-Of-Speech Tagging Methods
It has recently become clear that extracting linguistic information from a sample text
corpus automatically can be an extremely powerful method for making accurate natural
language processing systems [9]. There are several part-of-speech taggers that are
widely used for Indo-European languages, all of which are trained and retrainable on
text corpora. Structural ambiguity can be greatly reduced by adding empirically derived
probabilities to grammar

rules and by computing statistical measures

of lexical

association. Word sense

disambiguation can, in some cases, be done with high

accuracy when all information is derived automatically from corpora. An effort has
recently been undertaken to create automated machine translation systems, where the
linguistic information needed for translation, is extracted automatically from aligned
text corpora [22].
These are just some of the recent applications of corpus-based techniques in natural
language processing. Along with great research advances, the infrastructure is in place
for this line of research to grow even stronger. With on-line corpora, the use of the
corpus-based natural language processing is growing, producing better performance,
and becoming more readily available. There is a worldwide trend to annotate large
corpora with linguistic information, including parts of speech.
Many techniques have been used to tag English and other European language
corpora, such as:
1-

Rule-based technique: used by Greene and Rubin in 1970 to tag
the Brown corpus. They designed the tagger TAGGIT [13] that
used context-frame rules to select the appropriate tag for each word.
It achieved an accuracy of 77%. More recently, interest in rulebased taggers has re-emerged with Eric Brill's tagger, which used
another type of rules called transformation rules (Section 3.3) and
achieved an accuracy of 97.5.

2-

Hidden Markov models were used in the 1980s to select the
appropriate tag. Example of such taggers are:
i.

CLAWS [12], which was developed at Lancaster University
and achieved an accuracy of 97%
01
ii. The Xerox tagger [38] developed by Doug Cutting, which
achieved an accuracy of 96%
3-

Hybrid taggers: those use a combination of both statistical and
rule-based methods. This method achieved an accuracy of 98% as
reported by Tapanainen and Voultilainen [31] who used both
techniques separately, then aligned the output.

1.3 Machine learning in POS tagging
Machine learning deals with acquiring knowledge from an environment in a
computational manner, in order to improve the performance. There are many factors
that contributed over the past couple of decades to the blending of ML and NLP.
These factors include the ever expanding availability of large corpora, more
powerful computing resources; and a greater demand for natural language based
applications [27]. This led to the use of many machine learning techniques in natural
language processing, and in particular in Part-of-speech tagging[34].
Since the method we are using in our work belongs to these techniques, we shall
give here a more detailed idea about these methods.

1.3.1

N-gram and Markov models

A Markov model of a sequence of states or symbols (e.g. words or Part-of-speech
tags) is used to estimate the probability or “likelihood” of a symbol sequence. It can
be used for disambiguation, e.g. for choosing the most likely tag for an ambiguous
word in a given context, by estimating the probability of every candidate sequence.
A Markov model applies the simplifying assumption that the probability or
“likelihood” of a long sequence or chain of symbols can be estimated in terms of its
parts or n-grams.
Hidden Markov Models (HMMs) [18] are a variant of Markov models including
two layers of states: a visible layer corresponding to input symbols (e.g. words) and
a hidden layer learnt by the system, corresponding to broader categories (e.g. wordclasses).
Markov or n-gram models have been widely used for Part-of-speech tagging,
following the successful use in tagging the LOB Corpus [19].

00
1.3.2

Neural Networks

Neural networks (NNs) have been widely explored in Artificial Intelligence and
they have been studied for many years hoping to achieve human-like performance in
many fields.
There are many rules used in the learning process of neural networks. The type of
learning in a neural network is determined by the manner in which the parameters
change. This can happen with or without the intervention of a supervisor; hence, the
neural networks are divided into three groups: supervised learning, unsupervised
learning, and reinforcement learning.
Neural networks typically consist of multilayers of nodes, where the lowest layer
is the input layer, the highest is the output layer, and the layers in between are the
hidden layers. Nodes of adjacent layers are connected via weighted links. The
weights on these links are manipulated using a special function, so that the given
input produces the desired output. When this stage is reached, the given weights on
the links are recorded, or learnt as the proper values for the given input to produce
the desired output.
In part of speech tagging applications, the input consists of all the information the
system has about the parts of speech of the current word, i.e. all its possible tags, the
tags of certain number (p) of the preceeding words, and the tags of another number
(f) of the following words. The output of the network would be the appropriate tag
of that word in this context, and the weights on the links would be adapted
accordingly.
When the learning process is done, the tagger will have a huge number of weights,
along with their tag sequences, to be applied to tag new texts.
1.3.3

Vector-based clustering

This approach uses co-occurrence statistics to construct vectors that represent
word classes or meanings by virtue of their direction in multi-dimensional wordcollocation space. For example, Atwell [4] annotated each word in a sample from
the LOB Corpus with a vector of neighboring word-types; words with similar
vectors were clustered into word-classes.
A method for calculating semantic word vectors is to use random labeling of
words in narrow context windows to calculate semantic context vectors for each
02
word type in the text data. Incorporating linguistic information in the context vectors
can enhance the results.
1.3.4

Transformation-Based Learning

Brill has developed a symbolic Machine Learning method called TransformationBased Learning (TBL) [7,8,9]. Given a tagged training corpus, TransformationBased Learning produces a sequence of rules that serves as a model of the training
data. To derive the appropriate tags, each rule may be applied to each instance in an
untagged corpus in a specific order.
TBL relies heavily on a large annotated training corpus, and reasonable default
heuristics to get things started. It learns rules that are clearly coupled to human
understanding of a natural language, and allows rules to be easily acquired for
different domains or genres.
There is a gap between an initial semantic network generated from input data, and
a semantic one representing profound knowledge, from which a knowledge database
can be constructed. By using transformation rules, the semantic analysis method is
based on a pattern matching with a semantic network.

Transformation rule

description language allows users to manipulate their knowledge base and to define
rules.

1.4 Aims and objectives
The main purpose of this research work is to produce a system that can correctly tag
Arabic words with high accuracy utilizing a set of available tools after modifying them
to suit our purposes. These tools are Corpus, Tagset, and Tagger.

1.5 Tools used in this work
1.5.1 Corpus
Most of the researches on tagging for other languages have pretagged standard
corpora to work on and test the performance of their systems. But for Arabic, the case is
different. No standard corpora are available. This doubles the burden on the person who
wants to work on this subject; instead of concentrating on the tagger, one has to shift

03
part of his attention to preparing a large enough truth corpus tagged with the chosen
tagset, a task which is tedious and time consuming.
The lack of easily available standard tagged Arabic corpus was the motivation of
this work. At the beginning of this study, the researcher thought to work on
morphological analysis of Arabic by machine learning, but reviewing the literature he
discovered the unavailability of a dependable tagged corpus, a thing that is one of the
basic requirements for such a study. He found that most of the researchers in the field
are complaining of this problem. So he decided to start from scratch and work in the
direction of providing such a corpus.
For this purpose, the researcher started with a raw corpus and made some revisions
and a series of automatic taggings and manual corrections until his study reached
satisfactory results. Because of time limitations, the size of the corpus reached is
moderate and is not as large as what one would wish.
The corpus used for this study is derived from a raw corpus whose data are articles
of two Jordanian journals, Aldustur, and Aldustur Aleqtesady, but has to go through an
extensive preprocessing which will be explained in detail in Chapter Four.

1.5.2 Tagset
We adapted the Khoja detailed tagset, a morphosyntactic tagset that is very rich and
comprehensive for Arabic, and hence it is hard to deal with, whether manually or
automatically. The original tagset consists of 177 tags and is heavily increased by the
fact that we do not use a stemmer for the tagging system, and so another group of
composite tags is introduced to make up for composite words. These tags can be
composites of two, three, or even four basic tags.
This tagset was revised introducing new tags and making some refinement of
original tags. That included the distinction between plural forms (beneficial for
morphological studies), and recognizing defected verbs (beneficial for syntactical
studies). This modification raised the number of basic tags to 319. The complete new
tagset is shown in Appendix B.
Another subset of the resulting tagset is introduced by removing case information,
thus gaining two advantages: decreasing the size of the tagset, and more importantly
getting rid of some complexity and leading to better accuracy as will be seen in Chapter
Five.
04
Another set of tests was performed on the original tagset as well, where we noticed
that very little gain in accuracy was achieved by modifying the tagset. But it should be
kept in mind that the main purpose of modifying the tagset was not shooting for better
accuracy, rather it was looking for clarity of tags and having more features for some of
the tags. In fact it was expected to lose part of the accuracy for this reason, and we were
willing to sacrifice it.

1.5.3 Tagger
The tagger used for this study is the Brill tagger, which will be introduced in detail
in Section 3.2, a tagger that is based on the transformation rule method. This tagger was
originally designed for tagging English text, and had been adopted by many researchers
for other languages like Hungarian [23], and German [28,33]. The reasons for choosing
this tagger are:
1. The source code is available, and written mostly in a common language (C),
which makes the modification possible.
2. It is based on transformation rules, which makes it possible to adapt to other
languages.
3. The use of transformation rules also makes it easy to understand the
underlying reasons behind choosing certain tags (see Section 4.4), and easy
to modify the rules and/or omit some of them if needed. This is in contrast to
using statistical taggers (Section 2.2), where information is converted into a
huge set of numbers, representing the probabilities of choosing a specific tag
for each word.
A lot of work has to be done for adapting the tagger to our purposes, which
includes:
1. Manually tagged Arabic corpus has to be prepared, since we have to start from
scratch. This corpus is then enlarged in many steps.
2. Since the original system is written for Unix, and makes use of some of the facilities
thereof, we first attempted to convert it to the DOS environment, being more
common to us, and in our academic environment here. A lot of work was done in
this direction but many problems were encountered. The latest and hardest of which
was the fact that Turbo C under DOS did not deal with the extended RAM
05
explicitly, as is the case for C under Unix. So at last we decided to switch to Unix, a
task that also had many obstacles in the beginning, but worked out smoothly at the
end.
However, we still have an ambition, even after the completion of this project, to
switch back to DOS/Windows, and attempt to get a working DOS/Windows version.
3. The original code mixed between the use of C in most of its parts, and Perl in some
others, especially the lexical learner, which we had to work on. Perl is a new
language for the researcher, therefore some work had to be done in this direction,
first by learning as much as possible and needed from Perl, and then making use of
that in making an efficient change to the learner, to make use of the program
generated by Lex for the lexical analysis of the corpus. The problem that took most
of our time and effort in this is the fact that the exact same changes had to be done
on both the learner, which is written in Perl, and the tagger, written in C.

1.6 Testing strategy
Testing was done using the method of cross validation. Because of unavailability of
a standard reference (truth) corpus, we have to be satisfied by a rather small corpus for
this purpose. The corpus we prepared for learning was divided into three parts, and
three tests had to be performed, each of which utilizes two thirds of the whole corpus
for training and the other third for testing, changing the parts every time. Then taking
the average of the results. At this stage we used a total corpus of 38,000 words, so every
test involves about 25,000 words for training and 13,000 words for testing. This whole
experiment is done three times: one time for the original tagset, the second for the
modified tagset, and the third for the ungrammatized tagset. That means three sets of
corpora and three learning/tagging systems, each using the appropriate tagset, are
prepared.
The rather small size of the corpus is justified by the lack of standardly tagged
corpus. This is the best we could reach within our available time and efforts, and we
think we achieved very promising results that can be enhanced by many improvements,
including the enlargement of the learning corpus. This work is probably the first real
step in the direction of having a standard Arabic corpus tagged with a rich and
comprehensive tagset, not forgetting the contribution of Khoja who provided the
baseline for our work.
06
1.7 Chapters summary
Chapter two gives a literature review of tagging, and talks about taggers, and
different tagging strategies, with concentration on the efforts exerted on Arabic, in
terms of the three parts of a tagging system; corpora, tagsets, and taggers.
Chapter three, talks about the original tools that are chosen for this work, namely,
the Khoja tagset, and the Brill tagger, giving a detailed idea about their form, and the
way they are designed. Then it gives an idea about the strategy used for testing.
Chapter four explains our contribution in modifying the tagset, preparing the corpus
for work, and adopting the tagger to fit our needs.
Chapter five gives the tests and results of our experiments. First it gives average
accuracies of each of the three performed tests, then it discusses the types of errors
encountered, studies their causes, and suggests solutions to them.
Chapter six gives the conclusion of the work and suggests future expansions.

07
Chapter Two
Literature Review
2.1Corpora in European languages
In European languages and some other languages, there are many famous and
standard corpora which are available for researchers, either to be used in extracting
information of interest to their fields of study, or as references for testing there tagging
strategies. Below is a list of just few examples of such corpora:

2.1.1 General Corpora


The Brown Corpus. Corpus of written American English, and the
corresponding British corpus Lancaster-Oslo/Bergen corpus (LOB)
[19], a corpus of written British English. The Brown corpus in the 60's,
while its British counterpart was compiled in the 70's. Both consist of around
one million tokens (i.e. words, counted every time they appear).

The Brown corpus was used in seminal linguistic and psycholinguistic research
that involved word frequency, and continues to be used today. It comes as text,
tagged, and parsed.


BNC: The British National Corpus (BNC) [40,42] is a 100 million word
collection of samples of written and spoken language from a wide range of
sources, designed to represent a wide cross-section of current British
English, both spoken and written. Because of it’s large size, and sampling of
written and spoken language, the BNC is very good for research involving
lexical frequency. For words with very low frequency, they are more likely
to occur in a 100 million words corpus than in a 1 million words corpus.



The Amsterdam Corpus (AC): This corpus [30] was compiled in the
beginning of the 1980s by a group of scholars directed by Anthonij Dees and
has resulted in the Atlas des formes linguistiques des textes littéraires de
l'ancien français. The electronic version of the AC was provided by Piet van
Reenen (Free University of Amsterdam). It contains about 200 different
08
texts, some of them in several manuscripts, which adds to a total of 289 texts
and close to three million word forms. These forms have been manually
annotated with 225 numeric tags encoding part-of-speech and other
morphological categories (e.g. “566” for verb, future tense, 3rd person,
plural).

2.1.2 Historical Corpora
 Helsinki Corpus: The Helsinki Corpus of English Texts: Diachronic and
Dialectal [39] is a computerized collection of extracts of continuous text.
The Corpus contains a diachronic part covering the period from c. 750 to c.
1700 and a dialect part based on transcripts of interviews with speakers of
British rural dialects from the 1970's. The aim of the corpus is to promote
and facilitate the diachronic and dialectal study of English as well as to offer
computerized material to those interested in the development and varieties of
language. The uses for such a corpus are fairly obvious: it is used for
diachronic research; whether one is interested in lexical frequency,
semantics, syntax, etc. This corpus also has a parsed version

2.1.3 Annotated Corpora


Celex. Lexical databases of English, Dutch, and German. [40] This
corpus contains ASCII versions of the CELEX. It was developed as a joint
enterprise of the University of Nijmegen, the Institute for Dutch Lexicology
in Leiden, the Max Planck Institute for Psycholinguistics in Nijmegen and
the Institute for Perception Research in Eindhoven. This corpus contains
detailed information on the orthography, the phonology (phonetic
transcriptions, variations in pronunciation, syllable structure, primary stress),
the morphology (derivational and compositional structure, inflectional
paradigms), the syntax (word class, word class-specific subcategorizations,
argument structures), and word frequency (summed word and lemma counts,
based on recent and representative text corpora). Thus it is useful for various
types if linguistic and psycholinguistic research.



The Penn Treebank. The Penn Treebank Project [41] annotates naturallyoccurring text for linguistic structure. Most notably, they produce skeletal
09
parses showing rough syntactic and semantic information - a bank of
linguistic trees. They also annotate text with part-of-speech tags, and for the
Switchboard corpus of telephone conversations, dysfluency annotation. The
Penn Treebank project has annotated the Switchboard Corpus, the Wall
Street Journal Corpus, Chinese Journal Project, the Brown Corpus, and the
Helsinki Corpus (among others). It is very useful for syntactic research, or
any research involving the syntactic/semantic relationships between words.
Tgrep (tree-grep) is a useful tool for use with this corpus.


There are corpora available for many other languages. Examples include:
American English [35], German [29], Hungarian [23,26], Swedish [5, 25],
and Hebrew [15]. More information about all these corpora and others can
be easily found on the Internet.

2.2 Arabic corpora
A number of electronic Arabic text corpora have been compiled [32] but these
corpora are raw, which means that the exploration of these corpora remains
problematic. Some analyses that have been conducted on these corpora involve
sometimes very limited data. Others have developed proficient word form analyzers,
such as the analyzer by the Xerox European Research Centre, but the question remains
whether these analyzers provide an adequate solution for the exploration of Arabic
tagged corpora.
In order to explore corpora in an efficient and in an economically reliable way, some
preliminary operations ought to be made [32]. As is generally known, analyzing Arabic
corpora is more complex than other corpora because of three main reasons. In the first
place the Arabic language is very polysemic. The Arabic language is much more
polysemic than, for example, Dutch. In fact in the Dutch language one way to create
new words is by adding two words together in order to obtain a new word as a
compound. These new words are very widespread, but are also identifiable by a
computer in a simple way, i.e. by defining a word as a string of characters between two
blanks. In Arabic new meanings for words are often given by expanding the older
meaning of an existing word to a new one. This means that the external morphological
form of the word does not change, in spite of the fact that the word carries a new
meaning.
21
A second element that makes analysis of Arabic more complex than other languages
is the fact that the language is usually not vocalized, which means that the degree of
ambiguity of words as separate units is much greater than e.g. in English or Dutch.
Words, in their raw form, can belong to different grammatical categories as seen in the
string of characters "ktb". This string of characters stands for the verb "kataba" (to
write) as well as for the plural "kutub" (books). This complicates the searching for
words in a corpus of texts.
In the third place, the problem is complicated by the fact that in Arabic a number of
prefixes and suffixes are directly linked to the word. This makes the searching by
computer even more complex. For example the string of characters ‘fhm’ can stand for
the verb "fahima" (understood), but it can as well stand for the particle and suffix
"fahum" (since they) or for the particle and verb "fahamma" (then he considered).
These facts and others are behind the lack of tagged Arabic corpora. One of the
researchers in this line [11] noticed “the frustrating reality was that the NLP experts
with experience in dealing with European languages and scripts deemed the problem [of
providing tagged corpora and taggers for Arabic] trivial and therefore not worth wasting
time on. While the available Arabic language experts had no computer experience and
deemed the problem impossible to solve and therefore not worth wasting time on it”.
This is true to some extent, but what is certainly true is that there are very little available
corpora for the Arabic language. There are some large corpora that are available.
Unfortunately they are not free. Also, although some of these corpora are marked-up
with XML or SGML tags, none of them are POS tagged [32, 37].
There are some efforts towards the preparation of a POS-tagged corpus for Arabic, but
they are still in their early and testing stages. One of these works is that of Shereen
Khoja [16,17]. Although her work has some limitations and deficiencies, as will be
explained below, it is probably the first step towards building an Arabic POS-tagged
corpus. She introduced two tagsets, one is very small containing only five classes or
basic tags (noun, verb, particle, residual, punctuation), and the other is very
comprehensive and appropriate for Arabic, containing more detailed tags (i.e. singular,
masculine, definite common noun). She used the first tagset to manually tag 50,000
words of Arabic newspaper text. This type of tagging is obviously of little use, but she
also tagged 1,700 words with the second tagset [37]. I sent many email messages to
Miss Khoja hoping to get a copy of her tagged corpus and benefit from it, but
unfortunately I did not receive any response. However, from the small excerpt (see
20
Figure 2-1) she enclosed in her paper [16], it seems that the corpus is not well built,
since there are in that short passage many mistagged items. Mistakes include the
following (refer to Section 3.2.2 and to Appendix B to get an idea about meanings of
tags):
1. Mistagging adjectives as nouns, example:

‫ الشريفين‬is tagged as NCDuMGD,

instead of :NADuMGD. There are many instances of such an error.
2. Case information for nouns seems almost random. Example

‫الر ين اا‬

‫مبناسرة االور ا‬

is tagged as PPr-NCSgFGI NCSgMAD NCSgMND, instead of:

PPr-NCSgFGI NCSgMGD NASgMGD, and

‫ أعري االكرااالير‬is tagged

as: VPSg3M NCSgMND NCSgMAD, instead of VPSg3M NCSgMND
NASgMND
3. tagging single as plural, like: ‫ لبالده‬is tagged as: PPr_NCPlFGI_NPrPSg3M
instead of: PPr_NCSgFGI_NPrPSg3M.

Figure 2-1: copy of the manually tagged excerpt cited by Khoja

4. Tagging feminine as masculine, like:

‫ عر اأ كر االهاراي‬is tagged as: PPr

NCSgFNI NCPlMND, instead of: PPr NASgMGI NCPlFGD.
These are just few examples of the mistakes found in the 48-word passage. Note
also that some of the words cited in the above examples contain more than one type of
mistakes.

22
It is worth mentioning here, that mistakes in manually tagged corpora are very
unfavorable, since these corpora are considered to represent the truth, and are to be used
as guidelines for learning systems. If they are not carefully built, the whole system is a
failure regardless of how large the reported accuracy may be.

We used the same detailed Khoja tagset to tag about 38,000 words, and have three
versions of this corpus; one tagged with the original detailed tagset as proposed in [16],
the second tagged with a modified tagset of the mentioned tagset as explained in
Section 4.2, and presented in Appendix B, and the third tagged with the modified
version with the removal of grammatical information. We do not claim perfection but
we think that our work, besides being much larger is also much more accurate in
applying the tagset to real Arabic text.

2.3 Arabic Taggers
Very few people worked towards building a complete tagger for Arabic. The
following cases, though not complete, are among the best examples:


Abuliel [1]: in his paper he described some preparatory steps of building an
Arabic POS tagger. Rule-based techniques were used for finding phrases,
analyzing affixes of the word, and discovering proper nouns. The tagset used
in this work is not specified, and no results are reported concerning the
overall performance of a tagging system.



Alshalabi et al [3] dealt with vowelized Arabic text and considered
recognizing nouns only. This work showed how to discover nouns in the text
but does not reach the stage of tagging. The fact that the system is
constrained to vowelized text makes it deficient. Although they talked about
part-of-speech tagging and gave a survey of taggers, they did not really do
any tagging, nor did they give any tagset for this purpose. They reported
95.4% accuracy, which is a good performance rate, but we should keep in
mind that the system is constrained to completely vowlized words, therefore
minimizing ambiguity, and that it is restricted to discovering nouns, which
simplifies the classification task.



Maloney and Niv [21], also worked with names only, in their name
recognizing system called TAGARAB.
23


Freeman from the department of near eastern studies at the university of
Michigan [11] reported that he is attempting to adopt the Brill tagger to
Arabic. He designed his own tagset for this purpose, started to do some
morphological analysis, and explained the hurdles he encountered in that
work. According to his paper he did not reach the stage of tagging to report
any accuracy rate.



Khoja: the title of her paper [17] may lead to concluding that she has a
complete tagger. That deceived us in the beginning of our work. But
carefully studying the paper we concluded that she only did some
preliminary work in this direction, and is still working on the tagger. This
was ascertained by consulting her website [37] where she declares: “As far
as I know, a POS tagger has yet to be developed for Arabic, which is why I
am developing one myself.”

2.4 Definition of training and testing texts
A corpus of over 38,000 words was prepared. Three versions of this corpus are
available: one tagged with the original Khoja tagset, the second with a modified
tagset as explained in Section 4.1, and the third is tagged with a subset of the
modified tagset which excludes the grammar information, as explained in Section
4.5.
Each of these corpora is divided into three equal portions, then a cross validation
is done three times, using different two thirds of the corpus for training and the other
third for testing, each time. The average of the three tests is taken as the estimated
performance accuracy of the tagger. This means that nine tests are done in this way.
In addition to these tests, three other tests were performed on the corpus tagged
with the complete modified tagset, this time to test the effect of enlarging the corpus
size on the accuracy of the tagger. To do that, about five sixths of the corpus are
used for training, and the other one sixth for testing, for each new test. Then, the
average is taken to get an estimate of the overall accuracy.

24
Chapter Three
Design
3.1 Tagsets and the adapted Arabic tagset
3.1.1

Tagsets

As mentioned in section 2.1, tagging requires a set of tags, which classify the words
according to their lexical and syntactical meanings, i.e. a tagset.
Tagsets vary in size. The number of tags used by different systems varies a lot.
Some systems use fewer than 20 tags, while others use over 400. The larger the size of
tagset the more information is carried in each tag. For example we may have a basic
tagset, which divides the words into very small set of classes as in Figure 3-1 below.
We may enhance this tagging by classifying nouns to single and plural, verbs to present
and past, and so on, as shown in Figure 3-2, which lists a subset of a refined tagset
showing the different tags that belong to the general class verb in English. And can be
further classified as shown in Figure 3-3, which gives a complete list of the Penn
Treebank tagset [41].

Tag

Discription

Tag

Discription

NN

Noun

JJ

Adjective

NNP

Proper noun

CC

Coord conj

DT

Determiner

CD

Cardinal number

IN

Proposition

Prp

Personal pronoun

VB

Verb

RB

Adverb

-R

Comparative

-S

Superlative

-$

Possisive

Figure 3-1: example of a general classification tagset.

25
Tag

Discription

Example

VBP

Base present

Take

VB

Infinitive

To Take

VBD

Past

Took

VBG

Present particible

Taking

VBN

Past particible

Taken

VBZ

Present 3sg

Takes

MD

Modal

Can, will

Figure 3-2: example of a detailed tagset for verbs.

Figure 3-3: the entire Penn Treebank tagset

26
3.1.2

The adopted tagset

This section describes the tagset adapted for our work. The tagset is based on the
Khoja tagset as mentioned earlier. We introduce the tagset as described by its designer
[20]. The modifications that are specific to our work are marked using an asterisk
symbol (*), and are further discussed in detail in Section 4.2. The original tagset
(Tagset1) contains 177 tags: 103 nouns, 57 verbs, 9 particles, 7 residual, and 1
punctuation. We derived two other tagsets: Tagset2, a modified version of Tagset1,
containing 319 tags, and Tagset3, a simplified version of Tagset2, which excludes
grammatical information, with 189 tags. The complete modified tagset (Tagset2) is
given in Appendix B. A full description of each of the tags and examples of Arabic
words that take those tags now follows. This description is based on that given by Khoja
.
The five main categories for words are:
1. N [noun]
2. V[verb]
3. P [particle]
4. R [residual] * 5. punc [punctuation]
Note that category number 5 is preceded by an asterisk (*). This indicates a
modification in the name of the category, or a completely new category (or
subcategory) as shall be seen in subsequent examples.
The residual category contains foreign words, mathematical formulae and numbers.
The punctuation category contains all punctuation symbols, both Arabic and foreign
such as (?, ،‫.) "! ؟‬
‫؟‬
The subcategories of noun are:
1.1 C [common]
1.4. Nu [numeral]

1.2 P[proper]
1.3 Pr [pronoun]
1.5 A [adjective] *1.6 T [title]

Adjectives are nouns that describe the aspects of an object. Adjectives inherit the
properties of nouns, so they take “nunation” when in the indefinite and can take the
definite article when definite. For example, alwld alSgyr “The small boy” contains the
adjective Sgyr “small”. This adjective can take the definite article as in ‘darasa alwaladu
alSagyr’ “the small boy studied”, and it can also have “nunation” as in ‘hasan Sgyr’
“Hassan is small”.
Examples of these subcategories include:
27
• Singular, masculine, accusative, common noun such as ktab “book” in the sentence
‘>x* alwld ktaba’ “the boy took a book”.
• Singular, masculine, genitive, common noun such as ktab “book” in the sentence
‘drst mn ktab’ “I studied from a book”.
• Singular, feminine, nominative, common noun such as mdrsp “school” in the
sentence ‘h*h mdrsp’ “this is a school”.
Note here and in subsequent examples that vocalization does not appear in
transliteration, because we do not assume dealing with vocalized text.
The subcategories of the pronoun are:
1.3.1 P [personal]

1.3.2 R[relative]

1.3.3 D [demonestrative]

The personal pronouns can be detached words such as ‘hw’ “he”, or attached to a
word in the form of a clitic. The attached pronouns can be attached to nouns to indicate
possession, to verbs as direct object, or attached to prepositions such as fyh “in it”.
Some examples of pronouns include:
• Third person, singular, masculine, personal pronoun, such as hw “him”.
• Singular, feminine, demonstrative pronoun, such as h*h “this”.
The subcategories of the relative pronoun are:
1.3.2.1 S [spesific]

1.3.2.2 C[common]

Examples of relative pronouns include:
• Dual, feminine, specific, relative pronoun, such as alltan “who”.
• Plural, masculine, specific, relative pronoun, such as al*yn “who”.
• Common, relative pronoun, such as ‘mn’ “who”.
The subcategories of the numeral are:
1.4.1 Ca [cardinal]

1.4.2 O[ordinal]

*1.4.3 Na [numerical adjective]

We preferred omitting subcategory 1.4.3 and adding related tags to normal
adjectives. This kind of adjectives, however, are not very common and we did not
encounter any of them in the corpus we used.
Examples of numerals include:
• Singular, masculine, nominative, indefinite cardinal number such as ‘>rbEp’ “four”.
• Singular, masculine, nominative, definite ordinal number such as ‘alrabE’ “the
fourth”.
28
The linguistic attributes of nouns, adjectives, and numerals, that have been used in
this tagset are:

(i) Gender:

M [masculine]

(ii) Number:

Sg [single]

F [feminine]

N [neuter]

* Plm [masculine sound plural]

* Plf [feminine sound plural] *Plb [broken plural]

Du [dual]

(iii) Person:

1 [first]

2 [second]

3 [third]

(iv) Case:

N [nominative]

A [accusative]

G [genitive]

(v) Definiteness:

D [definite]

I [indefinite]

Verbs are categorised into three main parts:
1. P [perfect]

2. I[imperfect]

Iv [imperative]

The definition of perfect verbs not only includes (i) the equivalent of English past
tense verbs (i.e. to describe acts completed in some past time) but also (ii) describes acts
which at the moment of speaking have already been completed and remain in a state of
completion, (iii) describes a past act that often took place or still takes place (i.e.
commentators are agreed (have agreed and still agree)), (iv) describes an act which is
just completed at the moment by the very act of speaking it (I sell you this), and (v)
describes acts which is certain to occur that it can be described as having already taken
place (mostly used in promises, treaties and so on) [16].
The imperfect does not in itself express any idea of time; it merely indicates a
begun, incomplete, or enduring existence either in present, past or future time. While
the imperative verbs order or ask for something to be done in the future.
Examples of verbs include:
• First person, singular, neuter, perfect verb ‘ksrt’ )‫“(كسرت‬I broke”.
• First person, singular, neuter, indicative, imperfect verb ‘>ksr’ (‫“ أكسر‬I break”
‫)أكس‬
ِ
• Second person, singular, masculine, imperative verb ‘aksr’ (‫“ اكسر‬Break!”
‫)اكس‬
ِ
The verbal attributes that have been used in our tagset are:

29
(i) Gender:

M [masculine]

F [feminine]

(ii) Number:

Sg [single]

Pl [plural]

Du [neuter]

(iii) Person:

1 [first]

2 [second]

3 [third]

(iv) Mood:

I [indicative]

S [subjective]

N [neuter]

j [jussive]

The two most notable verbal attributes that are fundamental to Arabic but do not
normally appear in Indo-European tagsets are the dual number, and the jussive mood.
The subcategories of particle are:
1.1 Pr [prepositions]

1.2 A [adverbial]

1.2 C [conjunctions]

1.4 I [interjections]

1.5E [exceptions]

1.6 N [negatives]

1.7 A [answers]

1.8 X [explanations]

1.9 S[subordinates]

*1.10 dt [doutive]

*1.11 cr [certain]

*1.12 Str [stressive]

*LM [lm]

*LN[ln]

Examples of particles include:
• Prepositions fy (‫“ )يف‬in”
• Adverbial particles swf (‫ف‬

‫“ )س‬shall”

• Conjunctions w (‫“ )و‬and”
• Interjections ya (‫“ )فا‬O”
• Exceptions swY (

ِ‫ا‬
‫“ )س‬Except”

• Negatives la (‫“ ال‬Not”
‫)ال‬
• Answers nEm (‫“ )نعم‬yes”
• Explanations >y (‫“ )أي‬that is”
• Subordinates lw ( ‫“ )ل‬if”

31
3.2 Corpora used for this work
Early in our work, we were faced by the unavailability of corpora for MSA text.
Even the ones that we read about in some of the previous works were not easily
available, besides being not well fit to our needs. We contacted some of the researchers,
but only few of them responded to our request and questions. One of these responses
provided a raw corpus of excerpts from two Jordanian magazines, containing about
160,000 words. For the sake of saving time, we preferred working on this corpus than
creating our own, in spite of the fact that the corpus needs some processing before it can
be used in our experiments. These excerpts were provided as a Microsoft document in
Arabic characters which has to undergo a series of preparatory steps to be ready for use
in our tagging task, as will be explained in detail in Section 4.1

3.3 The Brill system
The Brill system is divided into two separate parts: the learner and the tagger. In the
following subsections we explain the way each of these two programs works.

3.3.1

Learner

Before the process of learning starts, the truth corpus is undergone a series of
preliminary operations to prepare a set of files that are necessary for learning. These
operations are sketched in Figure 3-4 and explained in more detail in Section 4.3.
Transformation-based error-driven learning, as shown in Figures 3-5 and 3-6, works
as follows: First, unannotated text is passed through an initial-state annotator. Various
initial state annotators, that represent different levels of complexity, have been used,
including: the output of a stochastic n-gram tagger; labeling all words with their most
likely tag as indicated in the training corpus; and simply labeling all words as nouns.
For example

Brill gave two simple

algorithms to do that; one assigns to all

unknown words the tag “NN” for common noun in the Penn Treebank tagset. And

30
Original
corpus

Review for
errors, typing
mistakes, etc
(manual)

Convert to Brill
format
(maual)

Transliterate
(C program)

Untagged
corpus

Tagged
corpus1
Untagged
corpus

Tag it
(semi
automatic)

Divide in two
(perl programBrill)

Tagged
corpus

Tagged
corpus2

Tagged
corpus

Untag it
(perl programBrill)

Untagged
corpus
Entire

Tagged
corpus2

Untag it
(perl programBrill)

Untagged
corpus2

Prepare final
lexicon

Final
lexicon

Figure 3-4: Preliminary steps for tagging

the other assigns to every word in the corpus either of two tags: “NNP” for proper noun
if the word starts with a capital letter, or “NN” otherwise. This strategy is based on a
conclusion that common nouns constitute a high percentage of an English text. In this
research we used a more detailed strategy, where the pattern of the letters of a word is
compared with a predefined set of patterns, to determine which word class the word
belongs to, making use of the rules of Arabic morphology (Srf). Then a tag is assigned
to the word accordingly. If the word does not belong to any of the standard patterns, it is
assigned the tag “NCSgFGI” which stands for “single feminine, genitive undefined
common noun” since this is the most probable tag for unknown words as noticed when
the manually tagged corpus is prepared. The different patterns used to tag unknown
words are further shown in Appendix C.
Once text has been passed through the initial-state annotator, it is then compared to
the truth. A manually annotated corpus is used as our reference for truth. An ordered list
of transformations is learned that can be applied to the output of the initial state

32
annotator to make it better resemble the truth. Each transformation has two components:
a rewrite rule and a triggering environment. A rewrite rule can be in the form:
X Y, meaning Change the tag from X to Y
While a triggering environment can be in the form:
“al” hasprefix 2 , meaning “if the current word has a 2-letters prefix of ‘al’”.
Taken together, the transformation with this rewrite rule and triggering environment
would be
X Y “al” hasprefix 2,
meaning Change the tag of the current word from X to Y if it has a 2-letter prefix of
‘al’.
There are two types of rules: Lexical rules and contextual rules. Therefore, there are
two learners that have to be run consecutively. First lexical rules are learned, then
context rules are learned to refine the tags, and make up for some divergences that may
occur in applying the lexical rules. In both the learning procedure is done by passes
through the truth corpus, each pass learning the rule that, when applied, minimizes the
errors in tagging the corpus as compared to the truth corpus. These rules are then stored
in a file in the order they are learned. Thus obtaining two rule files: a lexical rule file
and a contextual rule file. The tagger applies these rules in the same order to get similar
results. Examples of both types of rules, obtained from the Arabic tagged corpus, are
given in Section 4.4, with explanatory comments giving the meaning of each rule.
The ideal goal of the lexical module is to find rules that can produce the most likely
tag for any word in the given language, i.e. the most frequent tag for the word in
question considering all texts in that language. The problem is to determine the most
likely tags for unknown words, given the most likely tag for each word in a
comparatively small set of words. This is done by transformation rule based learning
(TBL) using three different lists: a list consisting of Word Tag Frequency - triples
derived from the first half of the training corpus, a list of all words that are available
sorted by decreasing frequency, and a list of all word pairs, i.e. bigrams. Thus, the
lexical learner module does not use running texts.
Once the tagger has learned the most likely tag for each word found in the annotated
training corpus and the rules for predicting the most likely tag for unknown words,
contextual rules are learned for disambiguation. The learner discovers rules on the basis
of the particular environments (or the context) of word tokens. The contextual learning
33
process needs an initially annotated text. The input to the initial state annotator is an
untagged corpus, a running text, which is the other half of the annotated corpus where
the tagging information of the words has been removed. The initial state annotator also
uses a list; consisting of words with a number of tags attached to each word, found in
the first half of the annotated corpus. The first tag is the most likely tag for the word in
question. and the rest of the tags are, in no particular order. With the help of this list, a

Untagged
corpus2

Initial state
tagger

Dummy-tagged
corpus

Tagged corpus2
(truth)

Lexical
Learner

No

Lexical
Rules

Threshold
?
yes

stop

Figure 3-5: Lexical Rule learning
list of bigrams (the same as used in the lexical learning module, see above) and the
lexical rules, the initial state annotator assigns to every word in the untagged corpus the
most likely tag. In other words, it tags the known words with the most frequent tag for
the word in question. The tags for the unknown words are computed using the lexical

34
rules: each unknown word is first tagged with a default tag and then the lexical rules are
applied in order.
There is one difference compared to the lexical learning module, namely the
application of the rules is restricted in the following way: if the current word occurs in
the lexicon but the new tag given by the rule is not one of the tags associated to the
word in the lexicon, then the rule does not change the tag of this word.

Dummy-tagged
corpus

Context
learner

Unnotated
corpus 2

Tagged corpus2
(Truth)

Context
Learner

No

Context
Rules

Threshold
?
yes

stop

Figure 3-6: Context rule learning
When tagging new text, an initial state annotator first applies the predefined default
tags to the unknown words (i.e. words not being in the lexicon). Then, the ordered
lexical rules are applied to these words. The known words are tagged with the most
likely tag. Finally the ordered contextual rules are applied to all words.
35
Unanotated
corpus

Initial
tagging

anotated corpus

lexical
Rules

Lexical
tagger
No
Finished
rules?
yes

Context
Rules

Context
Tagger

Tagged
corpus
yes

No
Finished
rules?

Stop

Figure 3-7: Taggeing

3.3.2

Tagger:

The tagger follows the same path as the learner. Starting with any raw text corpus
given to it, first it applies the same initial state annotator as the one used in learning, so
that the transformation rules work correctly. Then it uses the rule files obtained by the
learner, to change initial tags to new tags. The rules are applied in the same order they
are collected: first the lexical rules, and then the contextual rules.

36
3.4 Testing strategies
Testing was done using the method of cross validation. Taking in consideration that
we do not have a large standard truth corpus, we had to manage with the corpus we
tagged. This corpus is divided into three portions, each portion containing about 13,000
words, and the test had to be repeated three times, with a different one third for testing,
and the other two thirds for learning each time, then taking the average of the three tests
as an overall measure for the performance of the system. This whole experiment is
repeated using three versions of the tagset and therefore three versions of corpora:

1. Tagset1: the original detailed Khoja tagset [16] containing 177 tags.
2. Tagset2: the complete modified tagset of 319 tags (Appendix B).
3. Tagset3: a subset of Tagset2 of which grammatical information is excluded
for nouns and imperfect verbs, thus reducing the number of tags to 185 tags.
All the three tagsets are drastically enlarged by the fact that the system we used
does not use stemming prior to the learning and tagging phases. Rather it uses
composite tags to tag composite words. A fact that would introduce a new set of
tags. As an example for this consider the word balmdrsp ( ‫ بالمدرسب‬If stemming
‫.)بالمدرسب‬
were applied to the system this word would be divided into two separate words b
and almdrsp, and would be tagged as b/PC almdrsp/NCSgFGD. But, since we
work without stemming, the word is treated as one unit and is tagged as
balmdrsp/PC_NCSgFGD, thus introducing the new tag PC_NCSgFGD.
Stemming would probably enhance the accuracy of the system, but it would
divert our attention to other directions and put extra burdens on the users of the
system.

37
Chapter Four
Implementation and Testing
4.1 Corpus
The corpus used for this study is part of an about 160,000 word corpus of two
Jordanian newspapers (Aldustor and Aldustor Aleqtsady). Any MSA corpus would have
done the task, but this corpus was gotten at an early stage of the work, and was used
henceforth. A lot of preprocessing was needed before using the corpus. The corpus is
originally a Microsoft word document, so it has to undergo the following corrections
and revision tasks to be ready for our work:
1. There were many typing, spelling, and grammatical mistakes that constituted
quite a phenomena in the text, and would hinder the process of tagging and
add up to the problem of ambiguity which is already an inherent problem of
Arabic texts. These problems had to be fixed beforehand. Examples of such
mistakes include:
a.Missing hamza, like: ‫احبار ,اقصي ,اوضح ,اشار‬
‫,احبار‬
b. Misplaced hamza, like:

‫ أجياء‬instead of

‫.تامل‬
‫تامل‬
‫إجياء‬

c. ‫ هـ‬instead of ‫ ,ة‬like: ‫. باخر‬
d. ‫ ي‬instead of ‫ ,ى‬or vice versa, like: ‫اعكى‬

‫.حتهاج ايل ,أ يي ,أمح احمم‬
e.Typing mistakes, like: ‫,لك ل اب ًام الك الل ,الةوئى اب ًام االةوئ‬
‫ال‬
‫ال‬
‫ا‬
‫ا‬
‫.األولاوفاتاب اًام ااألول فات ,فاجلياءااب اًام افاإلجياء‬
‫ال‬
‫ال‬
f. Grammar mistakes, like:

‫ …اس ر ر اءايفااينر رراراالر ررني ام ابر ررلاال ر ر اء االذ ذذيح األذ ذ ذ ا ذ ذ ذ‬
. ‫المتحدة لألغدادابوعاني ام ابلاسكعااساسو ااوا ارجو‬
.…‫ وفه اف االزواراواليحافه قعاأنافزف اع دهما‬
‫ و اص اوان‬
‫ ك لااف ابكغاع داالسرواراتاواآلل ات افرغتايفامونراءاالع ةر ا‬
.)‫ب اًام ا(اليتاأفيغت‬
‫ال‬
38
.‫ وتهجاوزاقوم ام ج داهتااعناسةع امكواراتادوالر‬

‫ انامشرريوااالر زارقاال اضرراابال رراءاال رريفاالهجارفر ايفامياكررزااالل فر ا‬
.)‫شروعاً عشوائ اً اب اًام ا( ااعش اا‬
‫ا ائا‬
‫مشيوا‬
‫ال‬
‫شروعا‬
.)‫ حيصلاعكىات قوعاالمؤسسوناعكوه اب اًام ا(الؤسسن‬
‫ال‬

‫ …اوالر ياخا هلررهاحبررلاالسررةلاالمينر اله نور االهعرراونا ذذا ن ذ ن‬

‫الجمع ذذع ذذا ال ذذو ا ةايفاتنس ررو ااجلا ر االر ر ين ابر ر اًامر ر ا(ب ررنااجلمعور ر ا‬
‫ال‬
.)‫وال زارق‬
2. Getting rid of passage numbers, titles, and end marks to concentrate on
complete sentences of text.
3. The text is then converted to an ASCII MS-DOS format.
4. Because of technical considerations; like the different code pages used for
representing Arabic characters, and using software that does not support
Arabization, especially Lex analyzing system, and Linux environment, it was
decided to follow most of the previous line of research in Arabic [i.e.
1,14,21] and use transliteration. For this purpose the Buckwalter code of
transliteration [36] is used and a small C program was written to do this task.
5. The corpus is then edited to match the Brill format and copied to the Linux
system for the rest of the processing.
6. Then, it is tagged, using a program written with the help of the lexical
analyzer LEX [2]. The resulting corpus, calculated to be about 43% accurate,
is then revised manually. The result, which is supposed to represent the truth,
was then given to the learner of the Brill tagger to learn lexical and
contextual rules, a step that also requires some other preparations, as
explained in Section 3.3.

7. The above steps are performed initially on a corpus of size about 1000
words. After the rules are learned a larger corpus is presented to the tagger,
tagged, manually revised, and given to the learner to enhance the rule set.
This process is repeated continuously, hence enlarging the truth corpus and
enhancing the performance of the tagger simultaneously, until satisfactory

39
results are obtained and/or enough time is spent on this point. At the present
a truth corpus of over 38,000 words is reached.

Figures 4-1 and 4-2 show sample sentences in different stages of the tagging cycle.

‫عكى اهامش اأعمال االنه االه سطا الكهنمو ا‬
‫وال ي اع ايف اال اهيق ا هل اآذار ااجلاري انظما‬
‫كز االصيي الك راسات ااالقهصادف اورش اعملا‬
‫الي‬
‫ح ل اضعف اال ارد االةشيف اواله رفب اوتيضولا‬
‫ال ول االعيبو الكمنهج ااألجنيب اوأهم امع قاتا‬
‫كات ايف االنط . اوق اناقشت اه ها‬
‫الهنافسو الكشي‬
‫احلك االهط رات االههح ايف ااالقهصاد االعالاا‬
‫ا‬
.‫كات‬
‫واليتاأصةحتاتييضاحت فاتاعكىاالشي‬
(a) A sentence from
the corpus

ElY ham$ >Emal almntdY almtwsTy
lltnmyp wal*y Eqd fy alqahrp
xlal |*ar aljary nZm almrkz
almSry lldrasat alaqtSadyp wr$p
Eml Hwl DEf almward alb$ryp
waltdryb wtfDyl aldwl alErbyp
llmntj
al>jnby
w>hm
mEwqat
altnafsyp ll$rkat fy almnTqp .
wqd naq$t h*h alHlqp altTwrat
almtlaHqp fy alaqtSad alEalmy
walty >SbHt tfrD tHdyat ElY
al$rkat .
(b) A transliteration of the sentence in
(a) in the Brill format
Figure 4-1

41
4.2 Tagset
The tagset used in this work is a modified version of the tagset designed by Khoja,
fully described in [16] and redescribed in Section 3.1.1. The work of Khoja is highly
esteemed, being the first comprehensive work in designing a tagset for Arabic, which
encompasses the richness and complexity of the language. Nevertheless, it has some

‫ا‬NCSgMGD/ ‫ االنه‬NCPlbMGI/‫ اأعمال‬NCSgMGI/‫ ا اهامش‬PPr/‫عكى‬
‫ا‬PC_NPrRSSgM/‫ اوال ي‬PPr_NCSgFGD/ ‫ الكهنمو‬NASgMGD/‫اله سطا‬
‫ا‬NASgMGD/‫ ااجلاري‬Rmoy/‫ اآذار‬PA/‫ ا هل‬RP/‫ اال اهيق‬PPr/‫ ايف‬VPSg3M/ ‫ع‬
‫ا‬PPr_NCPlfGD/‫الك راسات‬NASgMND/‫االصيي‬NCSgMND/‫ا اكز‬VPSg3M/‫نظم‬
‫الي‬
‫ا‬PA/‫اح ل‬
NCSgMGI/‫اعمل‬
NCSgFAI/ ‫اورش‬
NASgFGD/ ‫االقهصادف‬
‫ا‬NASgFGD/ ‫االةشيف‬
NCPlbMGD/‫اال ارد‬
NCSgMGI/‫ضعف‬
‫ا‬NCPlbMGD/‫اال ول‬
PC_NCSgMGI/‫اوتيضول‬
PC_NCSgMGD/‫واله رفب‬
‫ا‬NASgMGD/‫ااألجنيب‬
PPr_NCSgMGD/‫الكمنهج‬
NASgFGD/ ‫العيبو‬
/‫كات‬
‫ لكشي‬NCSgFGD/ ‫ االهنافسو‬NCPlfGI/‫ امع قات‬PC_NASgMGI/‫وأهم‬
‫ا‬
./punc ‫ ا‬NCSgFGD/ ‫ااالنط‬PPr/‫اايف‬PPr_NCPlfGD
Figure 4-2: Part of the sentence in Figure 4-1 after tagging
and detransliteration
limitations and mistakes, some of which are treated in this work, and others may be a
task of future work. Modifications considered here include nouns, verbs, and particles.

4.2.1

Nouns:

For nouns the following was done:
a- Avoiding distinctions between foreign names and Arabic names. Instead all
names, whether Arabic or foreign, are given the same tag NP (for proper noun).
The tag RF (residual foreign) is kept to refer only to words of foreign languages
written in Arabic characters. In the original tagset, the tag (RF) is given to all
foreign names and words (see Figure 1-1, compare the tag given to
‫فا‬
given to ‫كي‬
‫اسهويسيا‬

and ‫لن‬
‫ا‬

‫ب‬

‫.)ا‬

and

.

40

with that
b- Using different tags for the different plural forms, and hence the indication of
plural nouns is given the subtags PlbM, PlbF, Plm, Plf for broken masculine
plural, broken feminine plural, sound masculine plural, and sound feminine plural
respectively, instead of just: PlM, and PlF for plural masculine and plural
feminine respectively. The table below (Figure (4-3)) gives examples of this.
Notice that in our set the gender is not repeated with sound plurals since it is
included implicitly in the plural form.

word

Original tag

New tag

‫ال ظي ن‬

NCPlMND

NCPlmND

‫العامكن‬

NCPlMGD

NCPlmGD

‫الشةيات‬

NCPlFND

NCPlfND

‫ال ارس‬

NCPlFND

NCPlbFND

‫الةن ك‬

NCPlMGD

NCPlbMGD

Figure 4-3: Tags of Plurals

The last two characters of each tag are irrelevant here and are given only for
completeness.
Including this information is useful when the resulting tagged corpus is used for
morphological studies.
c- Introducing some new tags.
d- Introducing another general category in addition to common nouns (NC) and
adjectives
(T

(NA),

namely

title

nouns

(NT)

like:

‫ .) ال في، وزفي، أمن، السيري، الان س، اليئوس، الكا‬This would increase the tagset

drastically, since each of these nouns can be single or plural, masculine or
feminine, definite or indefinite and can take any of the three cases. But it would
help in many cases to discover unknown proper nouns that usually follow these
titles.

4.2.2

Verbs:

42
For verbs the modification include:
Using distinct tags for defected verbs (

‫ األفعالاالناقص ا‬to capture the action they take
),

with the case of the following nouns. Therefore each verb tag is marked by a small d
following the first two characters for the verb if it is a defected verb, as in Figure (44).

word

Original tag

New tag

‫ذهب‬

VPSg3M

VPSg3M

‫ف هب‬

VISg3MI

VISg3MI

‫كانت‬

VPSg3F

VPdSg3F

‫فصةح ن‬

VIPl3MI

VIdPl3MI

Figure 4-4: Tags of defected verbs

4.2.3

Particles:

For particles, the modifications include:
Introducing a few tags to refine the tagging of some particles, and to make
room for some particles unconsidered in the original tagset; namely: Pcr, Pdt, Pst,
PQ,

‫ا‬and
‫ل‬

LM,

and
LN ‫ن ن‬
‫ ,(أ ا، إ ا) , ق االهشيويو ,ق االهح و و‬for
‫دواتااالسهياا‬tagging
‫,مل ,أ‬

respectively. All these tags are added to help picking up some information
about the following words.
Although these tags do contribute to refining the tagset, there is still a lot to be
done with particles, since the available tags do not cover the wide range of
meanings for particles in Arabic. Examples include: the prefix particle ‫ ف‬is now
given the tag PC (for conjunctional particle), whereas it is not always so, but
sometimes has different meanings especially when affixed to verbs (
‫فاءاالسةةو‬

‫اله‬
), the same thing goes with

.

All particles that do not belong clearly to any of the available tags are given a
general tag PA (for adverbial particle) regardless of the fact that some of them are
not really adverbial, so we do not have to take the meaning of this tag literally.

43
Making more distinctions is left for future work after studying more deeply the
need for such refinement.
It should be kept in mind, also, that the corpus we dealt with is not stemmed.
So the tagging is done by composite tags, which would introduce a new set of tags
for composite words. For example, the word ‫بالعرريض‬

is tagged as

PPr_NCSgMGD, which is a completely a different tag from either PPr or
NCSgMGD thus leading to a drastic theoretical increase in the tagset. Contrary to
what was expected, this fact did not cause a lot of problems with the tagging
accuracy, due to the fact that the Brill tagger is powerful in dealing with prefixes
and suffixes, and that composite words comprise only a small portion of an Arabic
text (estimated to be less than 6% according to the data we worked on).
4.3 The program
The same Lex-based program, which was used for initial tagging of the very
first corpus, is now used as a start state tagger for both learner and tagger of the
Brill system.
In the original system initial tagging is done by a very simple routine, which
assigns to all words either the tag (NN) for common nouns or (NP) for proper
nouns if the word starts with a capital letter. This start state suffices for English
and similar simple languages, But for Arabic we preferred to use another type of
start state tagger, where each unknown word is checked for its syntactic structure
and assigned an initial tag accordingly. For this purpose, the Lex-based routine
was used after facing a lot of trouble getting it to work, especially since it has to
be interfaced to both the lexical learner (written in Perl), and the tagger (written in
C).
Start state routine is an important factor for getting accurate results, especially
for unknown words. And the better it is designed to take care of word structures,
the better the achieved results are. In the present, the routine takes care of many
morphological structures and relies on the statistical information, sensed when
working on manual tagging, to assign the most probable tag for words that do not
belong to any of the captured tags.
4.4 Rules

44
In this section we give a list of the resulting rules and explain how they are
interpreted, and the actual lexical and contextual information derived thereof. It is
worth mentioning here that the obtained rules are based on majority tests and not
on absolute truth. In other words, it is not necessary the each rule applies to all
situations in any MSA text; rather it applies to most similar situations. As an
example of this consider the rule number 9 in Table 4-3 which states:
NP NCSgMGI PREV1OR2TAG PPr
Meaning that if a word has a preposition as one of its preceding two words,
then that word should be tagged as a common noun and not as a proper noun.
This rule is derived because, in the training corpus, it turned out that applying
this rule would enhance the accuracy of the tagger, by minimizing the discrepancy
between the starting corpus and the truth corpus. But that does not mean that the
rule has no exceptions. It is easy to think of many exceptions to this rule or to any
other rule, but what counts is the overall effect of applying the rule.

4.4.1 Lexical Rules
Table 4-1 shows a list of lexical rules, together with the meaning of each rule,
and its interpretation in the context of Arabic morphology. While Table 4-2 lists a
group of rules that may be considered misleading, which means that although they
may enhance the tagging of the training corpus, they will surely have negative
effects on the testing and real life corpora.
4.4.2

Contextual Rules

Table 4-3 shows a list of contextual rules, together with the meaning of each
rule, and its interpretation in the context of Arabic morphology and syntax.
4.5 Testing
Many tests are performed to check the efficiency of the system:


In the first group of tests, the truth corpus is divided into
three portions of similar sizes, then the cross validation
method is used three times for each type of tagset as
would be explained below. In each test of the three, two
portions of the corpus (about 25,000 words) are used in

45
learning and the third (about 13,000 words) for evaluation,
and the average of the accuracy for the three tests is taken
as the overall measure for the system’s accuracy. This is
performed on three types of corpora: one tagged with
original tagset (Tagset1) as introduced by Khoja, the
second tagged with a modified tagset thereof (Tagset2), as
explained in Section 4.2, and the third tagged with the
modified set with the exclusion of grammar features
(Tagset3). These three tagsets are defined in Section 3.4.
The results of these tests are summarized in Table 5-1, 52, 5-3 respectively.


To test the effect of enlarging the corpus size on accuracy,
another group of corpora are prepared. In this case since
we do not have a large reference corpus to work on, we
had to reduce the size of the testing corpora to enlarge the
training corpora. So we chose the size of the learning
corpora to be about 31,000 words each, i.e. about five
sixths of the size of the complete corpus. And the test
corpus is the rest of the corpus, whose size is over 6,000
words. Three tests were performed this way, changing the
test corpus each time, and taking the average. The results
of these tests are summarized in the tables of Section 5.1.

46
No.
1.

Rule
al haspref 2 NASgFGD

Meaning
if a word has a prefix of two letters “al” then tag it as

Comments
“al” is a sign of difeniteness

NASgFGD
2.

at hassuf 2 NCPlfGD

if a word has a suffix of two letters “at” then tag it as

“at” is an ending of fem. plural

NCPlfGD
3.

NCSgMGI p fchar NCSgFGI

If a word tagged as NCSgMGI contains the character “p” ,

“P” ( ‫) التاء المربوط‬is a sign of femminism

then tag it as NCSgFGI
4.

y haspref 1 VISg3MI

if a word has a prefix of 1 letter “y” then tag it as VISg3MI

“y” (‫ ) الياء‬is a prefix for imperfect verb

5.

NCSgMGI l fhaspref 1 PPr_NCSgMGI

If a word tagged as NCSgMGI has a prefix of 1 letter “l”

“l” (‫ ) الدم‬in the beginning of the word is a

then tag it as PPr_NCSgMGI

proposition

If a word tagged as NCSgMGI has a suffix of 1 letter “a”

“a”-ending is a sign of accusative case

6.

NCSgMGI a fhassuf 1 NCSgMAI

then tag it as NCSgMAI

8.

NASgFGD p faddsuf 1 NASgMGD

NCSgMGI w fhaspref 1 PC_NCSgMGI

If possible to add “p” to a word tagged as NASgFGD then

Can not have two “p” ( ‫ )تاء مربوط‬in one

tag it as NASgMGD

7.

word

If a word tagged as NCSgMGI starts with w then tag it as

“w” is a conjunctional particle

PC_NCSgMGI

Table 4-1: a list of lexical rules

47
12.

NCSgMGI t fhassuf 1 VPSg3F

b deletepref 1 PPr_NCSgMGI

followed by “al” for definiteness

Any word starting with “ll” should be tagged as

“ll” (‫ )للـ‬is a proposition followed by “al”
for definiteness

If a word tagged as NCSgMGI starts with “t”, tag it as

“t” (‫ )ت‬is a suffix for a past tense verb

VPSg3F

11.

ll haspref 2 PPr_NCSgMGD

“wal” (‫ )والـ‬is a conjunctional particle

PC_NCSgMGD

10.

wal haspref 3 PC_NCSgMGD

Any word starting with “wal” should be tagged as
PPr_NCSgMGD

9.

(third person single feminine)

If removing the letter “b” from a word gives a word in the

Attached “b” is a proposition.

lexicon, tag the original word as PPr_NCSgMGI
13.

0 char Rnu

A word containing the character “1” is a number

Numeric

14.

NCPlfGD al faddpref 2 NCPlfGI

If a word is tagged as NCPlfGD and accepts adding prefix

Can not add “al” to a defined word.

“al”, tag it as NCPlfGI
15.

PC_NCSgMGI S-T-A-R-T fgoodright

If a word at the beginning of a sentence is tagged as

PC_VPSg3M

PC_NCSgMGI , then tag it as PC_VPSg3M

Table 4-1: a list of lexical rules
Continued

48

Can not start with genitive case.
No.

Rule

Meaning

1.

NASgFGD d fhassuf 1 NCSgMGD

If a word tagged as NASgFGD ends with “d” tag it as NCSgMGD

2.

NCSgMGD al> fhaspref 3 NCPlbMGD

If a word tagged as NCSgMGD ends with “n” tag it as NCPlbMGD

3.

NCSgMGI n fhassuf 1 NP

If a word tagged as NASgFGI ends with “n” tag it as NP

4.

NCSgMGI_NPrPSg3F <lY fgoodleft PPr_NPrPSg3F

Comment

If a word tagged as NCSgMGI_NPrPSg3F is followed by “‫ ”إلى‬tag it as

O
N
L
Y
A
C
H
A
N
C
E

PPr_NPrPSg3F

Table 4-2 : Examples of misleading lexical rules

49
No.

Rule

Meaning

1.

NCSgFAI NCSgFGI PREV1OR2TAG PPr

Change a tag from NCSgFAI to NCSgFGI if
one of the two previous words is tagged PPr

2.

NCSgFGD NCSgFND PREV1OR2TAG VPSg3F

3.

Pst PA NEXTTAG VISg3MI

Change a tag from NCSgFGD to NCSgFND if
one of the two previous words is tagged
VPSg3F
Change a tag from Pst to PA if the next word
is tagged VISg3MI

4.

VISg3MI VISg3MS PREVWD >n

Change a tag from VISg3MS to NCSgFGI if
the previous word is >n

5.

NCSgMGD NCSgMND PREV1OR2TAG STAART

6.

NCSgMGD NASgMGD PREVTAG NCSgMGD

Change a tag from NCSgMGD to NCSgMND
if the word is one of the two starting words in
the sentence
Change a tag from NCSgMGD to NASgMGD
if the previous word is tagged NCSgMGD

7.

NCSgMGI NCSgMNI PREV1OR2TAG STAART

8.

NASgFGD NCSgFGD PREVTAG PC_NCSgMGI

9.

NP NCSgMGI PREV1OR2TAG PPr

Change a tag from NP to NCSgMGI if the
previous word is tagged PPr

10.

NCSgFGI NCSgFNI PREVTAG VPSg3F

Change a tag from NCSgFGI to NCSgFNI if
the previous word is tagged VPSg3F

Change a tag from NCSgMGI to NCSgMNI if
the word is one of the two starting words in
the sentence
Change a tag from NASgFGD to NCSgFGD
if the previous word is tagged PC_NCSgMGI

Table 4-3: a list of contextual rules

51

Comment

‫ماابع احيفااجلياجميور‬
‫الياعلاميف ا‬
‫ن‬
‫ااأناولوساأ ا‬
‫مااقةلااليعلاالضار ا‬
‫ا‬
‫أناتنصبااليعلاالضار‬
‫الةه أاالعيفاميف ااولوساجميوا‬
‫ر‬
‫ماابع ااالسماالعيفاصي امعيف ا‬
‫ولوساامساامعيفاا‬
‫الةه أاالنييقاميف ا‬

‫متووزابناالضافاإلوهاوالصي‬
‫حيفااجليافسة ااالسماالعاديا‬
‫ولوسااسماالعكم.ا‬
‫الياعلاميف ا‬
11.

NASgFGD NCSgFGD PREVTAG NCSgMGI

Change a tag from NASgFGD to NCSgFGD
if the previous word is tagged NCSgMGI

‫متووزابناالضافاإلوهاوالصي‬

12.

NASgFGI NCSgFGI PREVTAG PPr

Change a tag from NASgFGI to NCSgFGI if
the previous word is tagged PPr

‫ماابع احيفااجليااسماولوساصي‬

13.

PA_VISg3FI NNuCaSgFAI CURWD stp

If the current word is stp, change tag from
PA_VISg3FI to NNuCaSgFAI

Table 4-3: a list of contextual rules
Continued

50

‫ختصو الك اع قاالعجمو ا"ماافة أابرا‬
‫اامعاسنا‬
‫ستافا افعلامضار‬
"‫االسه ةال‬
Chapter five
Results and discussion

5.1 Results
Below are the results for the performed tests. Each table illustrates a group of related
tests using the method of cross validation. Table 5-1 gives the results for the original
tagset, Table 5-2 for the modified tagset, Table 5-3 for the modified tagset using an
enlarged versions of the training corpora, and Table 5-4 for the modified tagset with the
case (grammar) information removed

Test

Training
size(words)

Test
size(words)

Test1
Test2
Test3
Average

23834
25372
25786
-

13662
12124
11710
-

No.
lexical
rules
153
149
150
-

No.
context
rules
134
137
161
-

Tagging
accuracy
(%)
73.60
72.07
75.05
73.57

Table 5-1: Accuracy for the Original tagset

Test

Training
size(words)

Test
size(words)

Test4
Test5
Test6
Average

23834
25372
25786
-

13662
12124
11710
-

No.
lexical
rules
120
143
150
-

No.
context
rules
151
158
135
-

Tagging
accuracy
(%)
74.34
72.13
75.69
74.05

Table 5-2: Accuracy for the complete modified tagset

52
Test

Training
size(words)

Test
size(words)

Test7
Test8
Test9
Average

31422
31467
31634
-

6261
6216
6049
-

No.
lexical
rules
174
176
167
-

No.
context
rules
190
162
148
-

Tagging
accuracy
(%)
75.72
75.39
77.16
76.09

Table 5-3: Accuracy for the complete modified tagset
with enlarged training corpora

Test

Training
size(words)

Test
size(words)

Test10
Test11
Test12
Average

23834
25372
25786
-

13662
12124
11710
-

No.
lexical
rules
151
148
145
-

No.
context
rules
83
116
106
-

Tagging
accuracy
(%)
83.89
82.64
85.10
83.87

Table 5-4: Accuracy for the Ungrammatized modified tagset

5.2 Examples of errors in tagging
A sample of errors was taken from the error report file, consisting of randomly
chosen 38 consecutive lines of the original text for the Grammatized tagset. This sample
contains 1079 words, 280 words of which are tagged erroneously. The errors are
categorized into fifteen types as in Table 5-5 below, and then the occurrence of every
type is counted in the sample, to take an idea about the percentage of each error type.
Table 5-6 shows the list of erroneously tagged words of this sample, and for each word,
its truth and erroneous tags, and the type of error. Then, Table 5-7 shows a summary of
the errors, their count in the sample, and their percentage in descending order.

53
Error
type
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15

Meaning
Interpreting a title as a common noun
Mistagging a broken plural
Interchanging an adjective and a common noun
Interchanging definite and indefinite
Interchanging sound plural and single
Interchanging verb with noun
Grammatical error
Error in composite tag
Lexicon Entry Missing
Interchanging Dual with sound masculine plural
Typing mistake
Interchanging adverbial article with stress article
Error in gender
Taking a common noun for a proper noun
Interchanging doubt particle and certainty particle
Table 5-5: types of errors

Word
almdyryn
w>SHab
r&yp
alm&vrat
m$rwEathm

Truth tag
NTPlmGD
PC_NCPlbMGI
NCSgFGI
NCPlfGI
NCPlfGI_NPrPPl3M

wDE
almdyryn
alastratyjyp
$rwT
wtnmyp
mharat
>salyb
wastratyjyat
w>hmyth

NCSgMGI
NTPlmGD
NASgFGD
NCPlbMGI
PC_NCSgFGI
NCPlfGI
NCPlbMGI
PC_NCPlfGD
PC_NCSgFGI_NPrPSg3M

almdaxl
al>rbEa'
lmqablat
astxdamha

NCSgMGD
RD
PPr_NCplfGI
NCSgMNI_NPrPSgF

dafws
bswysra
wsykwn
almtHdvyn
alr}ysyyn
ryma
Na}b

RP
PPr_RP
PC_PA_VIdSg3MI
NADuMAD
NCDuAD
NP
NTSgMGI

System tag
NCPlmGD
PC_NCSgMGI
NASgFGI
NCPlfGD
NCSgMGI_NPrPPl
3M
VPSg3M
NCPlmGD
NCSgFGD
NCPlbMAI
PC_NCSgFAI
NCPlfAI
NCPlbMNI
PC_NCPlfGI
NCSgFGI_NPrPSg
3M
NCPlbMND
NCPlbMGD
PPr_NCPlfGI
NCSgMNI_NPrPSg
3F
NCSgMGI
NCSgMAI
PC_NCSgMGI
NCPlmGD
NCPlmGD
NCSgMAI
NTSgMNI

Type
1
2
3
4
5
6
1
3
7
7
7
7
4
8
(7,2)
9
11
11
9
9
8
10
10
9
7

Table 5-6: A sample of errors in the Grammatized tests

54

Comments

lexicon
mistype
mistype
Lexicon
lexicon
Du-Plm
lexicon
wzyr
wEql
mtHdvyn
w>hm
myna
>n
w>n
mst$ar
waDHp
tktml
aldktwrp
>Hyana
>bwabha

NTSgMGI
PC_NP
NAPlmAI
PC_NASgMGI
RF
PA
PC_PA
NTSgMHNI
NASgFGI
VISg3MI
NTSgFNI
NAPlbMAI
NCSgMGI_NPrPSg3F

NTSgMNI
PC_NCSgMGI
NCSgMGI
PC_NCSgMAI
NCSgMAI
Pst
PC_Pst
NTSgMGI
NCSgFNI
VISg3FI
NASgFGD
NCSgMAI
NP

7
9
(7,2)
(3,7)
9
12
12
11
(7,3)
13
1,7
2,3
8

lexicon

(lexicon)

mistype
(Gender)

Starts
with >bw

Table 5-6 : A sample of errors in the Grammatized tests (continued)

Error ty pe
7
2
3
9
6
8
12
4
1
10
5
11
13
Total

count
119
30
29
27
24
15
15
9
7
2
1
1
1
280

%
42.5
10.71
10.36
9.64
8.57
5.36
5.36
3.21
2.50
0.71
0.36
0.36
0.36
100

Table 5-7: Percentage error for each error type in the Grammatized tests

Word
wbmwazap
qryp
mTar
53

Truth tag
PC_PPr_NCSgFI
NCSgFI
NCSgMI
NCSgMI

System tag
PC_NCSgFI
NASgFD
NCPlbMI
Rnu

Table 5-8 A sample of errors in the Ungrammatized tests

55

Type
8
3,4
2
9
kmHwr
bSnaEp

PPr_NCSgMI
PPr_NCSgFI

NCSgMI
PPr_NCSgMD

8
4,13

stqwm

PA_VISg3F

PA_VISg3M

13

Tyran
kEaml
rqmyn
wtzyd
>rbaHha

NCSgMI
PPPr_NCSgMI
NCDuMI
PC_VISg3F
NCPlbMI_NPrPSg3F

NP
NASgMI
NCSgMI
PC_PA
NCPlbFI_NPrPSg3F

14
8
10
8
13

bmtxSSyn
mkantha
kmrkz

PPr_NCPlMI
NCSgFI_NPrPSg3F
PPr_NCSgMI

PPr_NCSgMI
NCPlfI_NPrPSg3F
NCSgMI

5
5
8

wttkaml
LIne

PC_VISG3F

PC_NCSgMI

6

bd>

VPSg3M

PPr_NCSgMI

9

wtm
tkml
wtEzz
qryp

PC_VPSg3M
VISg3F
PC_VISg3F
NCSgFI

PC_NCSgMI
NCSgMI
PC_NCSgMI
NCSgFD

6
6
6
4

ykml
sahm
t$ark
alxarTp
vany
mltqY

VISg3F
VPSg3M
VISg3F
NCSgFD
NNuORSgMI
NCSgMI

VISg3M
NCSgMI
VPSg3M
NCSgFI
NP
NASgMI

13
6
13
4
9
3

alTa}rat
wqTE

NCPlfI
PC_NCPlbMI

NCPlfD
PC_NCSgMI

11
2

bal>mm
mEZm
qd
tDr
wtDEf
<mdadat

PPr_NCPlbFD
NASgMI
Pdt
VISg3F
PC_VISg3F
NCPlfI

PPr_NCPlbMD
NCSgMI
Pcr
NASgMI
PC_NCSgMI
NCPlfD

13
3
15
6
6
4

Table 5-8: A sample of errors in the Ungrammatized tests (Continued)

5.3 Discussion
Most of the errors can be categorized as follows:
1. Errors in the case of the word are the highest.
2. Unknown proper nouns (of people and places) cannot be guessed. Only few
rules may lead to realizing a proper noun.

56
3. Distinction between sound masculine plural and dual nouns is not easy for
unknown nouns in Genitive and Accusative case states.
4. Some forms of broken plural are intermixed with other forms of names, and not
always easily distinguished since the processed text is not vocalized.

The above notes can be drawn from Table 5-5 where it is easily noticed that the
grammar contributes to the highest portion of the errors (almost half of them). Then
comes the broken plural problem, which accounts for 10% of the errors, then the
distinction between adjectives and nouns, also close to 10%. After that comes the
problem of proper names (names of people, cities, countries, etc.) which takes almost
10%, the problem of past tense verbs, about 9%. Composite tags and adverbial articles
contribute by about 5% each, and the rest of error types have insignificant contribution
to the overall error percentage.
Each of the errors of large contribution to the overall percentage of the errors is
justified and expected, although the order and exact rate was not expected to be as it
turned out after this test. We think that the following factors were leading factors:
1. The grammatical errors are partially due to the fact that some of the tags do not
reflect the case of the word, and hence, it is hard for the learner to conclude the
reason of the following word being given its tag, examples of that are proper
nouns, relative specific pronouns (

( ‫أمساءااإلشارق‬

‫ ,)أمساءاال ص ل‬and demonstrative pronouns

). Giving case information for these tags is expected to help

solving this problem, but would drastically increase the already large tagset, a
task that we preferred to avoid in the present, but which is a proper consideration
for future work. It is worth mentioning that most of the words that are
erroneously tagged for this reason are otherwise correctly tagged (i.e.
information about category, number, gender, and definiteness are correct).
2.

The size of the corpus affected the accuracy of the results, and in fact the error
rate was enhanced by two other factors; first since the corpus had to be split into
three portions to perform cross validation. And second since the Brill tagger
splits the training corpus again into two halves; one to derive lexical rules and
the other to derive contextual rules. So starting with a corpus of about 38,000
words, each test is done with 25000 words for training and 13,000 word for
evaluation. The training part is divided into two parts each of about 12,500

57
words, for lexical and contextual learning respectively. Had we have a ready
corpus to work with matters would be different and we are sure of getting better
results. This has been shown by three separate cross validation experiments in
which the training corpus was slightly enlarged by about 6000 words, leading to
about 2% increase in the accuracy of the system as shown in Table 5-3, a value
which does not look very great, but at least gives an indication of increase.
3. Lack of vocalization also makes it hard to distinguish between some of the
forms of the past tense verbs, and between them and some of the nouns. In this
case accuracy of tagging relies primarily on the statistical information captured
in the lexicon for known words, and on context for the unknown words. But it
should be remembered that lack of vocalization in itself is not a disadvantage of
the corpus, rather it is an advantage for the following reasons:

a. The input text to the tagger is rarely expected to be vocalized, since
vocalization is not common in most MSA writings
b. Vocalization puts an extra burden on the user of the system.
c. Getting good results in spite of unvoclization is a credit to the system,
and a sign of overcoming the problem of ambiguity without relying on
the user to disambiguate the words by vocalization.
5.4 Evaluation
Comparing with other reported results, the results we obtained may look low,
for example Diab et al [10] reported results of 95.4% accuracy, and Khoja [17]
reported 90% of disambiguation accuracy. But studying the mentioned works we
notice that the first one dealt with a very small tagset (24 tags) that are based on
an English tagset, while the second one did not specify precisely the size of the
tagset, rather she talked confusingly about three different levels of tagging with
tagsets of 5, 35, and 131 tags, and said she used the smaller tagset for initial
tagging. This means that the tagset she used contains a maximum of 35 tags.
Consulting her website [37], however, one concludes that the tagging is done
using the 5-tag set. The other problem with her results is that she reports that “the
statistical tagger achieved an accuracy of around 90% when disambiguating
ambiguous words” [17], but checking the statistics she offers, we find that

58
ambiguous words comprise a maximum of 3% of the test corpora, and we do not
know the performance accuracy for the rest of the corpora.
So taking in consideration the large and rich tagset we worked with, and the
unavailability of a standard truth corpus tagged with the same tagset, we think the
results obtained here are very promising, and are the best obtained for such a
tagset.

5.5 Accomplishments
In this work we achieved the following:


Revised the Khoja tagset to satisfy our needs and get rid of some of its
limitations. It was expected that this revision would lead to some drawback in
the accuracy of the tagger, and we were welling to accept that, but gladly
enough, the accuracy of the new system turned out to be slightly higher.



Prepared a manually tagged corpus of moderate size, which enjoys the fact
that it is tagged with a rich and comprehensive tagset that we consider the best
available for Arabic, and recommend it for being the basis for a standard
Arabic morphosyntactic tagset. The size of the corpus we tagged is about
38,000 words, far exceeding in size the only POS tagged Arabic corpus that
we know of, which is a 1,700-word corpus prepared by Ms. Khoja. In fact we
prepared many versions of this corpus, as follows:
o One tagged with the original tagset.
o The second tagged with a modified tagset thereof.
o And the third is tagged with the modified tagset but excluding syntactic
(grammar) features.
o All the above corpora are available in both Arabic characters and
transliterated form.



Adapted the Brill transformation rule tagger to work with the above corpora
and have the first complete tagger for Arabic, which gave –we believe- a very
promising accuracy of 75-84% depending on the tagset used.



Prepared in parallel with the corpus a tagged lexicon for Arabic, which would
help researchers in NLP tasks for Arabic.

59
part of speech tagger for ARABIC TEXT
part of speech tagger for ARABIC TEXT
part of speech tagger for ARABIC TEXT
part of speech tagger for ARABIC TEXT
part of speech tagger for ARABIC TEXT
part of speech tagger for ARABIC TEXT
part of speech tagger for ARABIC TEXT
part of speech tagger for ARABIC TEXT
part of speech tagger for ARABIC TEXT
part of speech tagger for ARABIC TEXT
part of speech tagger for ARABIC TEXT
part of speech tagger for ARABIC TEXT
part of speech tagger for ARABIC TEXT
part of speech tagger for ARABIC TEXT
part of speech tagger for ARABIC TEXT
part of speech tagger for ARABIC TEXT
part of speech tagger for ARABIC TEXT
part of speech tagger for ARABIC TEXT
part of speech tagger for ARABIC TEXT
part of speech tagger for ARABIC TEXT
part of speech tagger for ARABIC TEXT
part of speech tagger for ARABIC TEXT
part of speech tagger for ARABIC TEXT
part of speech tagger for ARABIC TEXT
part of speech tagger for ARABIC TEXT
part of speech tagger for ARABIC TEXT
part of speech tagger for ARABIC TEXT
part of speech tagger for ARABIC TEXT
part of speech tagger for ARABIC TEXT
part of speech tagger for ARABIC TEXT
part of speech tagger for ARABIC TEXT
part of speech tagger for ARABIC TEXT
part of speech tagger for ARABIC TEXT
part of speech tagger for ARABIC TEXT

Weitere ähnliche Inhalte

Was ist angesagt?

Comparative study of Text-to-Speech Synthesis for Indian Languages by using S...
Comparative study of Text-to-Speech Synthesis for Indian Languages by using S...Comparative study of Text-to-Speech Synthesis for Indian Languages by using S...
Comparative study of Text-to-Speech Synthesis for Indian Languages by using S...ravi sharma
 
Real-time DirectTranslation System for Sinhala and Tamil Languages.
Real-time DirectTranslation System for Sinhala and Tamil Languages.Real-time DirectTranslation System for Sinhala and Tamil Languages.
Real-time DirectTranslation System for Sinhala and Tamil Languages.Sheeyam Shellvacumar
 
Tamil Morphological Analysis
Tamil Morphological AnalysisTamil Morphological Analysis
Tamil Morphological AnalysisKarthik Sankar
 
Parts of speech tagger
Parts of speech taggerParts of speech tagger
Parts of speech taggersadakpramodh
 
Tamil-English Document Translation Using Statistical Machine Translation Appr...
Tamil-English Document Translation Using Statistical Machine Translation Appr...Tamil-English Document Translation Using Statistical Machine Translation Appr...
Tamil-English Document Translation Using Statistical Machine Translation Appr...baskaran_md
 
Statistical machine translation
Statistical machine translationStatistical machine translation
Statistical machine translationHrishikesh Nair
 
Machine Translation System: Chhattisgarhi to Hindi
Machine Translation System: Chhattisgarhi to HindiMachine Translation System: Chhattisgarhi to Hindi
Machine Translation System: Chhattisgarhi to HindiPadma Metta
 
RULE BASED TRANSLITERATION SCHEME FOR ENGLISH TO PUNJABI
RULE BASED TRANSLITERATION SCHEME FOR ENGLISH TO PUNJABIRULE BASED TRANSLITERATION SCHEME FOR ENGLISH TO PUNJABI
RULE BASED TRANSLITERATION SCHEME FOR ENGLISH TO PUNJABIijnlc
 
HANDLING CHALLENGES IN RULE BASED MACHINE TRANSLATION FROM MARATHI TO ENGLISH
HANDLING CHALLENGES IN RULE BASED MACHINE TRANSLATION FROM MARATHI TO ENGLISHHANDLING CHALLENGES IN RULE BASED MACHINE TRANSLATION FROM MARATHI TO ENGLISH
HANDLING CHALLENGES IN RULE BASED MACHINE TRANSLATION FROM MARATHI TO ENGLISHijnlc
 
Duration for Classification and Regression Treefor Marathi Textto- Speech Syn...
Duration for Classification and Regression Treefor Marathi Textto- Speech Syn...Duration for Classification and Regression Treefor Marathi Textto- Speech Syn...
Duration for Classification and Regression Treefor Marathi Textto- Speech Syn...IJERA Editor
 
Enriching Transliteration Lexicon Using Automatic Transliteration Extraction
Enriching Transliteration Lexicon Using Automatic Transliteration ExtractionEnriching Transliteration Lexicon Using Automatic Transliteration Extraction
Enriching Transliteration Lexicon Using Automatic Transliteration ExtractionSarvnaz Karimi
 
NLP_KASHK:Evaluating Language Model
NLP_KASHK:Evaluating Language ModelNLP_KASHK:Evaluating Language Model
NLP_KASHK:Evaluating Language ModelHemantha Kulathilake
 
An implementation of apertium based assamese morphological analyzer
An implementation of apertium based assamese morphological analyzerAn implementation of apertium based assamese morphological analyzer
An implementation of apertium based assamese morphological analyzerijnlc
 
8. Qun Liu (DCU) Hybrid Solutions for Translation
8. Qun Liu (DCU) Hybrid Solutions for Translation8. Qun Liu (DCU) Hybrid Solutions for Translation
8. Qun Liu (DCU) Hybrid Solutions for TranslationRIILP
 
A COMPREHENSIVE ANALYSIS OF STEMMERS AVAILABLE FOR INDIC LANGUAGES
A COMPREHENSIVE ANALYSIS OF STEMMERS AVAILABLE FOR INDIC LANGUAGES A COMPREHENSIVE ANALYSIS OF STEMMERS AVAILABLE FOR INDIC LANGUAGES
A COMPREHENSIVE ANALYSIS OF STEMMERS AVAILABLE FOR INDIC LANGUAGES ijnlc
 
CBAS: CONTEXT BASED ARABIC STEMMER
CBAS: CONTEXT BASED ARABIC STEMMERCBAS: CONTEXT BASED ARABIC STEMMER
CBAS: CONTEXT BASED ARABIC STEMMERijnlc
 
Ijartes v1-i1-002
Ijartes v1-i1-002Ijartes v1-i1-002
Ijartes v1-i1-002IJARTES
 

Was ist angesagt? (20)

Comparative study of Text-to-Speech Synthesis for Indian Languages by using S...
Comparative study of Text-to-Speech Synthesis for Indian Languages by using S...Comparative study of Text-to-Speech Synthesis for Indian Languages by using S...
Comparative study of Text-to-Speech Synthesis for Indian Languages by using S...
 
Real-time DirectTranslation System for Sinhala and Tamil Languages.
Real-time DirectTranslation System for Sinhala and Tamil Languages.Real-time DirectTranslation System for Sinhala and Tamil Languages.
Real-time DirectTranslation System for Sinhala and Tamil Languages.
 
Tamil Morphological Analysis
Tamil Morphological AnalysisTamil Morphological Analysis
Tamil Morphological Analysis
 
Parts of speech tagger
Parts of speech taggerParts of speech tagger
Parts of speech tagger
 
NLP_KASHK:Text Normalization
NLP_KASHK:Text NormalizationNLP_KASHK:Text Normalization
NLP_KASHK:Text Normalization
 
Tamil-English Document Translation Using Statistical Machine Translation Appr...
Tamil-English Document Translation Using Statistical Machine Translation Appr...Tamil-English Document Translation Using Statistical Machine Translation Appr...
Tamil-English Document Translation Using Statistical Machine Translation Appr...
 
Statistical machine translation
Statistical machine translationStatistical machine translation
Statistical machine translation
 
Arabic MT Project
Arabic MT ProjectArabic MT Project
Arabic MT Project
 
Machine Translation System: Chhattisgarhi to Hindi
Machine Translation System: Chhattisgarhi to HindiMachine Translation System: Chhattisgarhi to Hindi
Machine Translation System: Chhattisgarhi to Hindi
 
RULE BASED TRANSLITERATION SCHEME FOR ENGLISH TO PUNJABI
RULE BASED TRANSLITERATION SCHEME FOR ENGLISH TO PUNJABIRULE BASED TRANSLITERATION SCHEME FOR ENGLISH TO PUNJABI
RULE BASED TRANSLITERATION SCHEME FOR ENGLISH TO PUNJABI
 
HANDLING CHALLENGES IN RULE BASED MACHINE TRANSLATION FROM MARATHI TO ENGLISH
HANDLING CHALLENGES IN RULE BASED MACHINE TRANSLATION FROM MARATHI TO ENGLISHHANDLING CHALLENGES IN RULE BASED MACHINE TRANSLATION FROM MARATHI TO ENGLISH
HANDLING CHALLENGES IN RULE BASED MACHINE TRANSLATION FROM MARATHI TO ENGLISH
 
Duration for Classification and Regression Treefor Marathi Textto- Speech Syn...
Duration for Classification and Regression Treefor Marathi Textto- Speech Syn...Duration for Classification and Regression Treefor Marathi Textto- Speech Syn...
Duration for Classification and Regression Treefor Marathi Textto- Speech Syn...
 
Enriching Transliteration Lexicon Using Automatic Transliteration Extraction
Enriching Transliteration Lexicon Using Automatic Transliteration ExtractionEnriching Transliteration Lexicon Using Automatic Transliteration Extraction
Enriching Transliteration Lexicon Using Automatic Transliteration Extraction
 
NLP_KASHK:Evaluating Language Model
NLP_KASHK:Evaluating Language ModelNLP_KASHK:Evaluating Language Model
NLP_KASHK:Evaluating Language Model
 
Pxc3898474
Pxc3898474Pxc3898474
Pxc3898474
 
An implementation of apertium based assamese morphological analyzer
An implementation of apertium based assamese morphological analyzerAn implementation of apertium based assamese morphological analyzer
An implementation of apertium based assamese morphological analyzer
 
8. Qun Liu (DCU) Hybrid Solutions for Translation
8. Qun Liu (DCU) Hybrid Solutions for Translation8. Qun Liu (DCU) Hybrid Solutions for Translation
8. Qun Liu (DCU) Hybrid Solutions for Translation
 
A COMPREHENSIVE ANALYSIS OF STEMMERS AVAILABLE FOR INDIC LANGUAGES
A COMPREHENSIVE ANALYSIS OF STEMMERS AVAILABLE FOR INDIC LANGUAGES A COMPREHENSIVE ANALYSIS OF STEMMERS AVAILABLE FOR INDIC LANGUAGES
A COMPREHENSIVE ANALYSIS OF STEMMERS AVAILABLE FOR INDIC LANGUAGES
 
CBAS: CONTEXT BASED ARABIC STEMMER
CBAS: CONTEXT BASED ARABIC STEMMERCBAS: CONTEXT BASED ARABIC STEMMER
CBAS: CONTEXT BASED ARABIC STEMMER
 
Ijartes v1-i1-002
Ijartes v1-i1-002Ijartes v1-i1-002
Ijartes v1-i1-002
 

Andere mochten auch

Corpus linguistics the basics
Corpus linguistics the basicsCorpus linguistics the basics
Corpus linguistics the basicsJorge Baptista
 
Trade agreements in arab countries
Trade agreements in arab countriesTrade agreements in arab countries
Trade agreements in arab countriesMahmoud Fath-Allah
 
Parallel computing on internet
Parallel computing on internetParallel computing on internet
Parallel computing on internetMohamed Boudchiche
 
Technological Tools for Dictionary and Corpora Building for Minority Language...
Technological Tools for Dictionary and Corpora Building for Minority Language...Technological Tools for Dictionary and Corpora Building for Minority Language...
Technological Tools for Dictionary and Corpora Building for Minority Language...Guy De Pauw
 
CLIN 2012: DutchSemCor Building a semantically annotated corpus for Dutch
CLIN 2012: DutchSemCor  Building a semantically annotated corpus for DutchCLIN 2012: DutchSemCor  Building a semantically annotated corpus for Dutch
CLIN 2012: DutchSemCor Building a semantically annotated corpus for DutchRubén Izquierdo Beviá
 
POS Annotated 50m Corpus of Tajik Language
POS Annotated 50m Corpus of Tajik LanguagePOS Annotated 50m Corpus of Tajik Language
POS Annotated 50m Corpus of Tajik LanguageGuy De Pauw
 
Ai in education
Ai in educationAi in education
Ai in educationarteimi
 
essentials of arabic grammar by brid. zahoor
essentials of arabic grammar by brid. zahooressentials of arabic grammar by brid. zahoor
essentials of arabic grammar by brid. zahoorsamadash
 
Les outils de veille sur internet
Les outils de veille sur internetLes outils de veille sur internet
Les outils de veille sur internetAref Jdey
 
The Use of Corpus Linguistics in Lexicography
The Use of Corpus Linguistics in LexicographyThe Use of Corpus Linguistics in Lexicography
The Use of Corpus Linguistics in LexicographyIhsan Ibadurrahman
 
Pmbok 5édition change 2013
Pmbok 5édition change   2013Pmbok 5édition change   2013
Pmbok 5édition change 2013Marc Bonnemains
 
LEXICOGRAPHY
LEXICOGRAPHY LEXICOGRAPHY
LEXICOGRAPHY mimisy
 
Les 4 phases du management de projet
Les 4 phases du management de projetLes 4 phases du management de projet
Les 4 phases du management de projetAntonin GAUNAND
 
Management de Projet: piloter, animer, conduire des projets
Management de Projet: piloter, animer, conduire des projetsManagement de Projet: piloter, animer, conduire des projets
Management de Projet: piloter, animer, conduire des projetsPascal Méance
 

Andere mochten auch (20)

Discourse annotation for arabic 3
Discourse annotation for arabic 3Discourse annotation for arabic 3
Discourse annotation for arabic 3
 
Corpus linguistics the basics
Corpus linguistics the basicsCorpus linguistics the basics
Corpus linguistics the basics
 
Trade agreements in arab countries
Trade agreements in arab countriesTrade agreements in arab countries
Trade agreements in arab countries
 
Carré Magique Cpp
Carré Magique CppCarré Magique Cpp
Carré Magique Cpp
 
DNS
DNSDNS
DNS
 
Parallel computing on internet
Parallel computing on internetParallel computing on internet
Parallel computing on internet
 
Hackers
HackersHackers
Hackers
 
Technological Tools for Dictionary and Corpora Building for Minority Language...
Technological Tools for Dictionary and Corpora Building for Minority Language...Technological Tools for Dictionary and Corpora Building for Minority Language...
Technological Tools for Dictionary and Corpora Building for Minority Language...
 
CLIN 2012: DutchSemCor Building a semantically annotated corpus for Dutch
CLIN 2012: DutchSemCor  Building a semantically annotated corpus for DutchCLIN 2012: DutchSemCor  Building a semantically annotated corpus for Dutch
CLIN 2012: DutchSemCor Building a semantically annotated corpus for Dutch
 
POS Annotated 50m Corpus of Tajik Language
POS Annotated 50m Corpus of Tajik LanguagePOS Annotated 50m Corpus of Tajik Language
POS Annotated 50m Corpus of Tajik Language
 
Ai in education
Ai in educationAi in education
Ai in education
 
essentials of arabic grammar by brid. zahoor
essentials of arabic grammar by brid. zahooressentials of arabic grammar by brid. zahoor
essentials of arabic grammar by brid. zahoor
 
Les outils de veille sur internet
Les outils de veille sur internetLes outils de veille sur internet
Les outils de veille sur internet
 
The Use of Corpus Linguistics in Lexicography
The Use of Corpus Linguistics in LexicographyThe Use of Corpus Linguistics in Lexicography
The Use of Corpus Linguistics in Lexicography
 
Pmbok 5édition change 2013
Pmbok 5édition change   2013Pmbok 5édition change   2013
Pmbok 5édition change 2013
 
LEXICOGRAPHY
LEXICOGRAPHY LEXICOGRAPHY
LEXICOGRAPHY
 
Les 4 phases du management de projet
Les 4 phases du management de projetLes 4 phases du management de projet
Les 4 phases du management de projet
 
Management de Projet: piloter, animer, conduire des projets
Management de Projet: piloter, animer, conduire des projetsManagement de Projet: piloter, animer, conduire des projets
Management de Projet: piloter, animer, conduire des projets
 
PMbok les nouveautés de la 5ème édition
PMbok les nouveautés de la 5ème éditionPMbok les nouveautés de la 5ème édition
PMbok les nouveautés de la 5ème édition
 
Introduction à MATLAB
Introduction à MATLABIntroduction à MATLAB
Introduction à MATLAB
 

Ähnlich wie part of speech tagger for ARABIC TEXT

Mechanising_Programs_in_IsabelleHOL
Mechanising_Programs_in_IsabelleHOLMechanising_Programs_in_IsabelleHOL
Mechanising_Programs_in_IsabelleHOLAnkit Verma
 
Advancements in Hindi-English Neural Machine Translation: Leveraging LSTM wit...
Advancements in Hindi-English Neural Machine Translation: Leveraging LSTM wit...Advancements in Hindi-English Neural Machine Translation: Leveraging LSTM wit...
Advancements in Hindi-English Neural Machine Translation: Leveraging LSTM wit...IRJET Journal
 
Arules_TM_Rpart_Markdown
Arules_TM_Rpart_MarkdownArules_TM_Rpart_Markdown
Arules_TM_Rpart_MarkdownAdrian Cuyugan
 
Part-of-Speech Tagging for Bengali Thesis submitted to Indian ...
Part-of-Speech Tagging for Bengali Thesis submitted to Indian ...Part-of-Speech Tagging for Bengali Thesis submitted to Indian ...
Part-of-Speech Tagging for Bengali Thesis submitted to Indian ...butest
 
Melkamu_Tilahun_Oct_2017_Final_Thesis.pdf
Melkamu_Tilahun_Oct_2017_Final_Thesis.pdfMelkamu_Tilahun_Oct_2017_Final_Thesis.pdf
Melkamu_Tilahun_Oct_2017_Final_Thesis.pdfeshetuTesfa
 
Error Analysis of Rule-based Machine Translation Outputs
Error Analysis of Rule-based Machine Translation OutputsError Analysis of Rule-based Machine Translation Outputs
Error Analysis of Rule-based Machine Translation OutputsParisa Niksefat
 
S.N.Sivanandam & S.N. Deepa - Introduction to Genetic Algorithms 2008 ISBN 35...
S.N.Sivanandam & S.N. Deepa - Introduction to Genetic Algorithms 2008 ISBN 35...S.N.Sivanandam & S.N. Deepa - Introduction to Genetic Algorithms 2008 ISBN 35...
S.N.Sivanandam & S.N. Deepa - Introduction to Genetic Algorithms 2008 ISBN 35...edwinray3
 
IRJET- Vernacular Language Spell Checker & Autocorrection
IRJET- Vernacular Language Spell Checker & AutocorrectionIRJET- Vernacular Language Spell Checker & Autocorrection
IRJET- Vernacular Language Spell Checker & AutocorrectionIRJET Journal
 
Cohesive Software Design
Cohesive Software DesignCohesive Software Design
Cohesive Software Designijtsrd
 
Generation of strings in language for given Regular Expression and printing i...
Generation of strings in language for given Regular Expression and printing i...Generation of strings in language for given Regular Expression and printing i...
Generation of strings in language for given Regular Expression and printing i...IRJET Journal
 
A Novel Method for An Intelligent Based Voice Meeting System Using Machine Le...
A Novel Method for An Intelligent Based Voice Meeting System Using Machine Le...A Novel Method for An Intelligent Based Voice Meeting System Using Machine Le...
A Novel Method for An Intelligent Based Voice Meeting System Using Machine Le...IRJET Journal
 
Quality estimation of machine translation outputs through stemming
Quality estimation of machine translation outputs through stemmingQuality estimation of machine translation outputs through stemming
Quality estimation of machine translation outputs through stemmingijcsa
 
Ozlem istek thesis
Ozlem istek thesisOzlem istek thesis
Ozlem istek thesissorinakader
 
Chinese Word Segmentation in MSR-NLP
Chinese Word Segmentation in MSR-NLPChinese Word Segmentation in MSR-NLP
Chinese Word Segmentation in MSR-NLPAndi Wu
 
Machine Transalation.pdf
Machine Transalation.pdfMachine Transalation.pdf
Machine Transalation.pdfAmir Abdalla
 
Automation of IT Ticket Automation using NLP and Deep Learning
Automation of IT Ticket Automation using NLP and Deep LearningAutomation of IT Ticket Automation using NLP and Deep Learning
Automation of IT Ticket Automation using NLP and Deep LearningPranov Mishra
 
Laboratory manual
Laboratory manualLaboratory manual
Laboratory manualAsif Rana
 
IRJET- Survey on Grammar Checking and Correction using Deep Learning for Indi...
IRJET- Survey on Grammar Checking and Correction using Deep Learning for Indi...IRJET- Survey on Grammar Checking and Correction using Deep Learning for Indi...
IRJET- Survey on Grammar Checking and Correction using Deep Learning for Indi...IRJET Journal
 

Ähnlich wie part of speech tagger for ARABIC TEXT (20)

Mechanising_Programs_in_IsabelleHOL
Mechanising_Programs_in_IsabelleHOLMechanising_Programs_in_IsabelleHOL
Mechanising_Programs_in_IsabelleHOL
 
Advancements in Hindi-English Neural Machine Translation: Leveraging LSTM wit...
Advancements in Hindi-English Neural Machine Translation: Leveraging LSTM wit...Advancements in Hindi-English Neural Machine Translation: Leveraging LSTM wit...
Advancements in Hindi-English Neural Machine Translation: Leveraging LSTM wit...
 
Arules_TM_Rpart_Markdown
Arules_TM_Rpart_MarkdownArules_TM_Rpart_Markdown
Arules_TM_Rpart_Markdown
 
Part-of-Speech Tagging for Bengali Thesis submitted to Indian ...
Part-of-Speech Tagging for Bengali Thesis submitted to Indian ...Part-of-Speech Tagging for Bengali Thesis submitted to Indian ...
Part-of-Speech Tagging for Bengali Thesis submitted to Indian ...
 
Melkamu_Tilahun_Oct_2017_Final_Thesis.pdf
Melkamu_Tilahun_Oct_2017_Final_Thesis.pdfMelkamu_Tilahun_Oct_2017_Final_Thesis.pdf
Melkamu_Tilahun_Oct_2017_Final_Thesis.pdf
 
dmo-phd-thesis
dmo-phd-thesisdmo-phd-thesis
dmo-phd-thesis
 
Natural Language Processing using Java
Natural Language Processing using JavaNatural Language Processing using Java
Natural Language Processing using Java
 
Error Analysis of Rule-based Machine Translation Outputs
Error Analysis of Rule-based Machine Translation OutputsError Analysis of Rule-based Machine Translation Outputs
Error Analysis of Rule-based Machine Translation Outputs
 
S.N.Sivanandam & S.N. Deepa - Introduction to Genetic Algorithms 2008 ISBN 35...
S.N.Sivanandam & S.N. Deepa - Introduction to Genetic Algorithms 2008 ISBN 35...S.N.Sivanandam & S.N. Deepa - Introduction to Genetic Algorithms 2008 ISBN 35...
S.N.Sivanandam & S.N. Deepa - Introduction to Genetic Algorithms 2008 ISBN 35...
 
IRJET- Vernacular Language Spell Checker & Autocorrection
IRJET- Vernacular Language Spell Checker & AutocorrectionIRJET- Vernacular Language Spell Checker & Autocorrection
IRJET- Vernacular Language Spell Checker & Autocorrection
 
Cohesive Software Design
Cohesive Software DesignCohesive Software Design
Cohesive Software Design
 
Generation of strings in language for given Regular Expression and printing i...
Generation of strings in language for given Regular Expression and printing i...Generation of strings in language for given Regular Expression and printing i...
Generation of strings in language for given Regular Expression and printing i...
 
A Novel Method for An Intelligent Based Voice Meeting System Using Machine Le...
A Novel Method for An Intelligent Based Voice Meeting System Using Machine Le...A Novel Method for An Intelligent Based Voice Meeting System Using Machine Le...
A Novel Method for An Intelligent Based Voice Meeting System Using Machine Le...
 
Quality estimation of machine translation outputs through stemming
Quality estimation of machine translation outputs through stemmingQuality estimation of machine translation outputs through stemming
Quality estimation of machine translation outputs through stemming
 
Ozlem istek thesis
Ozlem istek thesisOzlem istek thesis
Ozlem istek thesis
 
Chinese Word Segmentation in MSR-NLP
Chinese Word Segmentation in MSR-NLPChinese Word Segmentation in MSR-NLP
Chinese Word Segmentation in MSR-NLP
 
Machine Transalation.pdf
Machine Transalation.pdfMachine Transalation.pdf
Machine Transalation.pdf
 
Automation of IT Ticket Automation using NLP and Deep Learning
Automation of IT Ticket Automation using NLP and Deep LearningAutomation of IT Ticket Automation using NLP and Deep Learning
Automation of IT Ticket Automation using NLP and Deep Learning
 
Laboratory manual
Laboratory manualLaboratory manual
Laboratory manual
 
IRJET- Survey on Grammar Checking and Correction using Deep Learning for Indi...
IRJET- Survey on Grammar Checking and Correction using Deep Learning for Indi...IRJET- Survey on Grammar Checking and Correction using Deep Learning for Indi...
IRJET- Survey on Grammar Checking and Correction using Deep Learning for Indi...
 

Mehr von arteimi

Nils nilson
Nils nilsonNils nilson
Nils nilsonarteimi
 
تحديات تطبيق تقنيات انطمة الدفع الالكتروني في ليبيا
تحديات تطبيق تقنيات انطمة الدفع الالكتروني في ليبياتحديات تطبيق تقنيات انطمة الدفع الالكتروني في ليبيا
تحديات تطبيق تقنيات انطمة الدفع الالكتروني في ليبياarteimi
 
My paper for philippine conference
My paper for philippine conferenceMy paper for philippine conference
My paper for philippine conferencearteimi
 
My paper for philippine conference
My paper for philippine conferenceMy paper for philippine conference
My paper for philippine conferencearteimi
 
009 icemi2014 h00014
009 icemi2014 h00014009 icemi2014 h00014
009 icemi2014 h00014arteimi
 
UTILIZING COOPERATIVE LEARNING FOR IT GRADUATE STUDIES
UTILIZING COOPERATIVE LEARNING FOR IT GRADUATE STUDIESUTILIZING COOPERATIVE LEARNING FOR IT GRADUATE STUDIES
UTILIZING COOPERATIVE LEARNING FOR IT GRADUATE STUDIESarteimi
 
rule refinement in inductive knowledge based systems
rule refinement in inductive knowledge based systemsrule refinement in inductive knowledge based systems
rule refinement in inductive knowledge based systemsarteimi
 
UTILIZING learning styles for e-learning
UTILIZING learning styles for e-learningUTILIZING learning styles for e-learning
UTILIZING learning styles for e-learningarteimi
 
أهمية إدارة المعلومات في تخطيط موارد المؤسسة
أهمية إدارة المعلومات في تخطيط موارد المؤسسةأهمية إدارة المعلومات في تخطيط موارد المؤسسة
أهمية إدارة المعلومات في تخطيط موارد المؤسسةarteimi
 
دور الجامعات2
دور الجامعات2دور الجامعات2
دور الجامعات2arteimi
 
Double transform contoor extraction
Double transform contoor extractionDouble transform contoor extraction
Double transform contoor extractionarteimi
 
Utilising learning styles
Utilising learning stylesUtilising learning styles
Utilising learning stylesarteimi
 
Electronic publishing
Electronic publishingElectronic publishing
Electronic publishingarteimi
 
Ai in education2
Ai in education2Ai in education2
Ai in education2arteimi
 
Part of speech tagger
Part of speech taggerPart of speech tagger
Part of speech taggerarteimi
 
Intellectual property and software
Intellectual property and softwareIntellectual property and software
Intellectual property and softwarearteimi
 
Information technology infrastructure and education future
Information technology infrastructure and education futureInformation technology infrastructure and education future
Information technology infrastructure and education futurearteimi
 
Active learning arabic
Active learning arabicActive learning arabic
Active learning arabicarteimi
 
الذكاء الإصطناعي والنظم الخبيرة
الذكاء الإصطناعي والنظم الخبيرةالذكاء الإصطناعي والنظم الخبيرة
الذكاء الإصطناعي والنظم الخبيرةarteimi
 
الجودة في التعليم التقني
الجودة في التعليم التقنيالجودة في التعليم التقني
الجودة في التعليم التقنيarteimi
 

Mehr von arteimi (20)

Nils nilson
Nils nilsonNils nilson
Nils nilson
 
تحديات تطبيق تقنيات انطمة الدفع الالكتروني في ليبيا
تحديات تطبيق تقنيات انطمة الدفع الالكتروني في ليبياتحديات تطبيق تقنيات انطمة الدفع الالكتروني في ليبيا
تحديات تطبيق تقنيات انطمة الدفع الالكتروني في ليبيا
 
My paper for philippine conference
My paper for philippine conferenceMy paper for philippine conference
My paper for philippine conference
 
My paper for philippine conference
My paper for philippine conferenceMy paper for philippine conference
My paper for philippine conference
 
009 icemi2014 h00014
009 icemi2014 h00014009 icemi2014 h00014
009 icemi2014 h00014
 
UTILIZING COOPERATIVE LEARNING FOR IT GRADUATE STUDIES
UTILIZING COOPERATIVE LEARNING FOR IT GRADUATE STUDIESUTILIZING COOPERATIVE LEARNING FOR IT GRADUATE STUDIES
UTILIZING COOPERATIVE LEARNING FOR IT GRADUATE STUDIES
 
rule refinement in inductive knowledge based systems
rule refinement in inductive knowledge based systemsrule refinement in inductive knowledge based systems
rule refinement in inductive knowledge based systems
 
UTILIZING learning styles for e-learning
UTILIZING learning styles for e-learningUTILIZING learning styles for e-learning
UTILIZING learning styles for e-learning
 
أهمية إدارة المعلومات في تخطيط موارد المؤسسة
أهمية إدارة المعلومات في تخطيط موارد المؤسسةأهمية إدارة المعلومات في تخطيط موارد المؤسسة
أهمية إدارة المعلومات في تخطيط موارد المؤسسة
 
دور الجامعات2
دور الجامعات2دور الجامعات2
دور الجامعات2
 
Double transform contoor extraction
Double transform contoor extractionDouble transform contoor extraction
Double transform contoor extraction
 
Utilising learning styles
Utilising learning stylesUtilising learning styles
Utilising learning styles
 
Electronic publishing
Electronic publishingElectronic publishing
Electronic publishing
 
Ai in education2
Ai in education2Ai in education2
Ai in education2
 
Part of speech tagger
Part of speech taggerPart of speech tagger
Part of speech tagger
 
Intellectual property and software
Intellectual property and softwareIntellectual property and software
Intellectual property and software
 
Information technology infrastructure and education future
Information technology infrastructure and education futureInformation technology infrastructure and education future
Information technology infrastructure and education future
 
Active learning arabic
Active learning arabicActive learning arabic
Active learning arabic
 
الذكاء الإصطناعي والنظم الخبيرة
الذكاء الإصطناعي والنظم الخبيرةالذكاء الإصطناعي والنظم الخبيرة
الذكاء الإصطناعي والنظم الخبيرة
 
الجودة في التعليم التقني
الجودة في التعليم التقنيالجودة في التعليم التقني
الجودة في التعليم التقني
 

Kürzlich hochgeladen

Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demoSample pptx for embedding into website for demo
Sample pptx for embedding into website for demoHarshalMandlekar2
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersNicole Novielli
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersRaghuram Pandurangan
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rick Flair
 

Kürzlich hochgeladen (20)

Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demoSample pptx for embedding into website for demo
Sample pptx for embedding into website for demo
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software Developers
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information Developers
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...
 

part of speech tagger for ARABIC TEXT

  • 1. Academy of Graduate Studies Tripoli - Libya PART OF SPEECH TAGGING OF ARABIC TEXT By Massaoud Abuzed Abolqasem Abuzed March 2006
  • 2. Abstract Part of speech tagging is an important area of research in natural language processing. Although it has been well studied in several Indo-European languages, it is still not very well investigated with respect to Arabic. In this thesis, the Brill tagger and a modified version of the Khoja tagset, along with a corpus prepared for this purpose, are applied to tag Modern Standard Arabic (henceforth MSA) text. The Brill tagger is a famous public domain part of speech tagger, originally designed for tagging English text by implementing machine learning approach through the method of transformation rules. It has been adopted for other languages, such as German and Hungarian, by many researchers. Some modifications need to be done to the learner and tagger that are written partly in Perl and partly in C programming languages, and are run under the unix/linux operating system. The main change is done on the initial state tagger, which is used by both learner and tagger. A program is written using the lexical analyzer Lex to capture Arabic morphological structures, and then interfaced with both learner and tagger. The tagset used in this work is a revised version of that introduced by Khoja. The revision included changing some of the tags for linguistic considerations and introducing some new tags to make the set more powerful, or to make up for limitations in the original tagset that hinder tagging some words. The corpus is obtained from two Jordanian magazines, and has to go through a series of editing steps. A collection of lexical rules and contextual rules are produced by the learning system, and applied to Arabic text. The tagging accuracy of the resulting tagged text is measured to be approximately 84% for both known and unknown words. A result which may seem low, but taking in consideration the complexity of the language, the richness of the tagset, the fact that this work is the first work that encompasses such a tagset for Arabic, and the fact that we did not have a reference corpus to base our work on, we consider the results very promising. 2
  • 3. Acknowledgements I would like to express my gratitude to: Associate professor Mohamed Arteimi, my academic supervisor, who guided me through this research and gave me his valuable advices. The Department of Computer Science in the Academy of Graduate Studies and personally to Dr. Abdussalam Elmusrati for his encouragement and help. The Academy of Graduate Studies, and to Dr. Saleh Ibrahim for his encouragement by sponsoring this research through an academic scholarship. And to my family and friends for their support and endurance. 3
  • 4. List of Tables Table Table (4-1) Table (4-2) Table (4-3) Table (5-1) Table (5-2) Table (5-3) Table (5-4) Table (5-5) Table (5-6) Table (5-7) Table (5-8) Page A list of lexical rules …………………………………………………… Examples of misleading lexical rules ……………………………………. A list of contextual rules ………………………………………………… Accuracy for the original tagset …………………………………………… Accuracy for the complete modified tagset ……………………………… Accuracy for the complete modified tagset with enlarged training corpora Accuracy for the ungrammatized modified tagset ……………………… Types of errors …………………………………………………………… A sample of errors in grammatized tests ………………………………… Percentage error for each error type in the grammatized tests …………… A sample of errors in ungrammatized tests ……………………………… 4 40,41 42 43,44 45 45 46 46 47 47.48 48 48,49
  • 5. List of Figures and Illustrations Figure Figure (2-1) Figure (3-1) Figure (3-2) Figure (3-3) Figure (3-4) Figure (3-5) Figure (3-6) Figure (3-7) Figure (4-1) Figure (4-1) Figure (4-2) Figure (4-3) Figure (4-4) Page copy of the manually tagged excerpt sought by Khoja …………… Example of a general classification tagset …………………………… Example of a detailed tagset for verbs …………………….………… The entire Penn Treebank tagset …………………….……………… Preliminary steps for tagging …………………….………………… Lexical rule learning ..………………….…………………………… Context rule learning ………………….…………………….……… Tagging …………………….…………………….………………… (a) A sentence from the corpus …………………….……………….. (b) A transliteration of a sentence from the corpus ………………… Tagged and detransliterated sentence from the corpus ……………… Tags of plurals …………………….…………………….………… Tags of defected verbs …………………….……………………… 5 15 18 19 19 25 27 28 33 33 33 33 35 36
  • 6. Contents Abstract…………………………………………………………………………………………… Acknowledgements……………………………………………………………………………….. List of Tables List of Figures and Illustrations Contents…………………………………………………………………………………………… Chapter One: Introductoin……………………………………………………………………... 1.1 Background………………………………………………………………………… 1.2 Part-Of-Speech Tagging Methods ………………………………………….. 1.3 Machine learning in POS tagging……………………………………………… 1.3.1 N-gram and Markov models……………………………………………. 1.3.2 Neural Networks……………………………………………………….. 1.3.3 Vector-based clustering………………………………………………… 1.3.4 Transformation-Based Learning………………………………………... 1.4 Aims and objectives………………………………………………………………… 1.5 Tools used in this work……………………………………………………………… 1.5.1 Corpus…………………………………………………………………… 1.5.2 Tagset……………………………………………………………………… 1.5.3 Tagger…………………………………………………………………… 1.6 Testing strategy …………………………………………………………………….. 1.7 Chapters summary…………………………………………………………………… Chapter Two: Literature Review……………………………………………………………… 2.1 Corpora in European languages ……………………………………………………... 2.1.1 General Corpora…………………………………………………………. 2.1.2 Historical Corpora………………………………………………………. 2.1.3 Annotated Corpora……………………………………………………… 2.2 Arabic corpora……………………………………………………………………….. 2.4 Arabic taggers………………………………………………………………………... 2.5 Definition of training and testing texts……………………………………………... Chapter Three: Design………………………………………………………………………….. 3.1 Tagsets and the adopted Arabic tagset…………………………………………. 3.1.1 Tagsets……………………………………………………………………. 3.1.2 The adopted tagset……………………………………………………….. 3.2 Corpora used for this work………………………………………………………. 3.3 The Brill system……………………………………………………………………. 3.3.1 Learner……………………………………………………………………. 3.3.2 Tagger…………………………………………………………………….. 3.4 Testing strategies…………………………………………………………………… Chapter Four: Implementation and Testing……………………………………………… 4.1 Corpus……………………………………………………………………………….. 4.2 Tagset……………………………………………………………………………….. 4.2.1 Nouns……………………………………………………………………... 4.2.2 Verbs……………………………………………………………………… 4.2.3 Particles…………………………………………………………………... 4.3 The program………………………………………………………………………… 4.4 Rules…………………………………………………………………………………. 4.4.1 Lexical Rules 6 i ii iii iv v 1 1 3 4 4 5 5 6 6 6 6 7 8 9 10 11 11 11 12 12 13 16 17 18 18 18 20 24 24 24 29 30 31 31 34 34 35 36 37 37 38
  • 7. 4.4.2 Contextual Rules………………………………………………………… 4.5 Testing………………………………………………………………………………. Chapter Five: Results and discussion…………………………………………………………. 5.1 Results……………………………………………………………………………….. 5.2 Examples of errors in tagging…………………………………………………….. 5.3 Discussion…………………………………………………………………………... 5.4 Evaluation…………………………………………………………………………... 5.5 Accomplishments………………………………………………………………….. Chapter Six: Conclusions and Future work…………………………………………………. 6.1 Conclusion………………………………………………………………………….. 6.2 Future work………………………………………………………………………... References……………………………………………………………………………………….. APPENDIX A APPENDIX B APPENDIX C Sample tagged sentences as compared to the truth corpus…………. The complete tagset (Tagset2)……………………………………… The Lex file used for initial state tagger……………………………. 7 38 38 45 45 46 49 51 52 53 53 53 54 58 62 78
  • 8. Chapter One Introduction 1.1 Background It is very hard, or even impossible to encode manually all the information needed to encode a human language that is necessary to build a system that will annotate text with structural description [9]. Such a work would need a lot of information concerning the type of grammar which will be used, plus a great deal of the morphological, lexical, and syntactical information about the language itself and encoding them in an algorithmic way for the intended system to handle them. However, this is not an easy task and would consume a lot of time and probably a group of language experts to be achieved. Even if achieved, it would be language specific and could not be applied to different languages. For this reason, language processing is tackled in different approaches recently. One of the most growing approaches is through machine learning techniques. These techniques start by giving samples of manually annotated text, which should be reviewed very carefully to make sure it resembles the truth for the given language. Then, a learning system is applied to that text to figure out the cues for annotating the given words with the given annotation. These cues are then converted either to some statistical information stating the probabilities of assigning a given annotation to a certain word according to its lexical structure and/or its location in the context, or they are converted to a collection of rules stating when and why to assign a given annotation to the word. Afterwards, another system, the tagger, is given another raw text to be annotated, and would go through the text and assign annotations to the words according to the accompanying cues (probability figures, or rules). Clearly the use of rules obtained from a learning system is more favorable over the use of probability figures for the following two reasons: 1- Rules are easy to understand and can reflect directly the human understanding of the language. 2- Rules can be manipulated through changing, omitting, or adding some rules. when doing so would enhance the annotation ability of the system. For these reasons, we have chosen to use a rule-based machine learning system for our work. 8
  • 9. Part-of-speech (POS) tagging means taking a text written in a human language and identifying its lexical and/or syntactical structure by assigning to each word/token in the text the correct Part-of-Speech such as noun, verb, adjective or adverb. Furthermore, the tags give, in many cases, additional features, such as number (singular/plural), tense, and gender, thus changing the raw (unannotated) text to annotated or tagged corpus. This process of tagging requires a set of tags that classify words according to their lexical and syntactical meanings. This set is referred to as a tagset. Part-of-speech tagging is the foundation of natural language processing (NLP) systems, and thus has been an active area of research for many years [25]. The use of corpora has become an important issue in Language Engineering (LE), the field that deals with all different types of handling natural languages computationally. There are many ways to deal with corpora. These ways include the use of one language corpora, that are annotated to reflect some information about the language structure, and parallel corpora, i.e. corpora of the same text written in two or more different languages, where at least one of the corpora is annotated, to help annotate the other corpora, or to help extract some information from the other corpora. Both kinds are valuable sources of linguistic metaknowledge, which forms the basis of techniques such as tokenization, POS tagging, morphological and syntactic analysis, which in turn can be used to develop LE applications [9]. An annotated corpus is a corpus that has had some level of linguistic detail added to the raw data. For example, the Penn Treebank [41] is an annotated corpus, because it contains the linguistic structure and part-of-speech tags for the words in the corpus. A tagged corpus is more useful than an untagged corpus because there is more information there than in the raw text alone. Once a corpus is tagged, it can be used to extract information from the corpus. This can then be used for creating dictionaries and grammars of languages using real language data. Tagged corpora are also useful for detailed quantitative analysis of text [22]. Other applications of Part-of-speech tagging include speech recognition [14], enhancing input methods [6], machine translation [24], and discovering errors in OCR files [20]. 9
  • 10. 1.2 Part-Of-Speech Tagging Methods It has recently become clear that extracting linguistic information from a sample text corpus automatically can be an extremely powerful method for making accurate natural language processing systems [9]. There are several part-of-speech taggers that are widely used for Indo-European languages, all of which are trained and retrainable on text corpora. Structural ambiguity can be greatly reduced by adding empirically derived probabilities to grammar rules and by computing statistical measures of lexical association. Word sense disambiguation can, in some cases, be done with high accuracy when all information is derived automatically from corpora. An effort has recently been undertaken to create automated machine translation systems, where the linguistic information needed for translation, is extracted automatically from aligned text corpora [22]. These are just some of the recent applications of corpus-based techniques in natural language processing. Along with great research advances, the infrastructure is in place for this line of research to grow even stronger. With on-line corpora, the use of the corpus-based natural language processing is growing, producing better performance, and becoming more readily available. There is a worldwide trend to annotate large corpora with linguistic information, including parts of speech. Many techniques have been used to tag English and other European language corpora, such as: 1- Rule-based technique: used by Greene and Rubin in 1970 to tag the Brown corpus. They designed the tagger TAGGIT [13] that used context-frame rules to select the appropriate tag for each word. It achieved an accuracy of 77%. More recently, interest in rulebased taggers has re-emerged with Eric Brill's tagger, which used another type of rules called transformation rules (Section 3.3) and achieved an accuracy of 97.5. 2- Hidden Markov models were used in the 1980s to select the appropriate tag. Example of such taggers are: i. CLAWS [12], which was developed at Lancaster University and achieved an accuracy of 97% 01
  • 11. ii. The Xerox tagger [38] developed by Doug Cutting, which achieved an accuracy of 96% 3- Hybrid taggers: those use a combination of both statistical and rule-based methods. This method achieved an accuracy of 98% as reported by Tapanainen and Voultilainen [31] who used both techniques separately, then aligned the output. 1.3 Machine learning in POS tagging Machine learning deals with acquiring knowledge from an environment in a computational manner, in order to improve the performance. There are many factors that contributed over the past couple of decades to the blending of ML and NLP. These factors include the ever expanding availability of large corpora, more powerful computing resources; and a greater demand for natural language based applications [27]. This led to the use of many machine learning techniques in natural language processing, and in particular in Part-of-speech tagging[34]. Since the method we are using in our work belongs to these techniques, we shall give here a more detailed idea about these methods. 1.3.1 N-gram and Markov models A Markov model of a sequence of states or symbols (e.g. words or Part-of-speech tags) is used to estimate the probability or “likelihood” of a symbol sequence. It can be used for disambiguation, e.g. for choosing the most likely tag for an ambiguous word in a given context, by estimating the probability of every candidate sequence. A Markov model applies the simplifying assumption that the probability or “likelihood” of a long sequence or chain of symbols can be estimated in terms of its parts or n-grams. Hidden Markov Models (HMMs) [18] are a variant of Markov models including two layers of states: a visible layer corresponding to input symbols (e.g. words) and a hidden layer learnt by the system, corresponding to broader categories (e.g. wordclasses). Markov or n-gram models have been widely used for Part-of-speech tagging, following the successful use in tagging the LOB Corpus [19]. 00
  • 12. 1.3.2 Neural Networks Neural networks (NNs) have been widely explored in Artificial Intelligence and they have been studied for many years hoping to achieve human-like performance in many fields. There are many rules used in the learning process of neural networks. The type of learning in a neural network is determined by the manner in which the parameters change. This can happen with or without the intervention of a supervisor; hence, the neural networks are divided into three groups: supervised learning, unsupervised learning, and reinforcement learning. Neural networks typically consist of multilayers of nodes, where the lowest layer is the input layer, the highest is the output layer, and the layers in between are the hidden layers. Nodes of adjacent layers are connected via weighted links. The weights on these links are manipulated using a special function, so that the given input produces the desired output. When this stage is reached, the given weights on the links are recorded, or learnt as the proper values for the given input to produce the desired output. In part of speech tagging applications, the input consists of all the information the system has about the parts of speech of the current word, i.e. all its possible tags, the tags of certain number (p) of the preceeding words, and the tags of another number (f) of the following words. The output of the network would be the appropriate tag of that word in this context, and the weights on the links would be adapted accordingly. When the learning process is done, the tagger will have a huge number of weights, along with their tag sequences, to be applied to tag new texts. 1.3.3 Vector-based clustering This approach uses co-occurrence statistics to construct vectors that represent word classes or meanings by virtue of their direction in multi-dimensional wordcollocation space. For example, Atwell [4] annotated each word in a sample from the LOB Corpus with a vector of neighboring word-types; words with similar vectors were clustered into word-classes. A method for calculating semantic word vectors is to use random labeling of words in narrow context windows to calculate semantic context vectors for each 02
  • 13. word type in the text data. Incorporating linguistic information in the context vectors can enhance the results. 1.3.4 Transformation-Based Learning Brill has developed a symbolic Machine Learning method called TransformationBased Learning (TBL) [7,8,9]. Given a tagged training corpus, TransformationBased Learning produces a sequence of rules that serves as a model of the training data. To derive the appropriate tags, each rule may be applied to each instance in an untagged corpus in a specific order. TBL relies heavily on a large annotated training corpus, and reasonable default heuristics to get things started. It learns rules that are clearly coupled to human understanding of a natural language, and allows rules to be easily acquired for different domains or genres. There is a gap between an initial semantic network generated from input data, and a semantic one representing profound knowledge, from which a knowledge database can be constructed. By using transformation rules, the semantic analysis method is based on a pattern matching with a semantic network. Transformation rule description language allows users to manipulate their knowledge base and to define rules. 1.4 Aims and objectives The main purpose of this research work is to produce a system that can correctly tag Arabic words with high accuracy utilizing a set of available tools after modifying them to suit our purposes. These tools are Corpus, Tagset, and Tagger. 1.5 Tools used in this work 1.5.1 Corpus Most of the researches on tagging for other languages have pretagged standard corpora to work on and test the performance of their systems. But for Arabic, the case is different. No standard corpora are available. This doubles the burden on the person who wants to work on this subject; instead of concentrating on the tagger, one has to shift 03
  • 14. part of his attention to preparing a large enough truth corpus tagged with the chosen tagset, a task which is tedious and time consuming. The lack of easily available standard tagged Arabic corpus was the motivation of this work. At the beginning of this study, the researcher thought to work on morphological analysis of Arabic by machine learning, but reviewing the literature he discovered the unavailability of a dependable tagged corpus, a thing that is one of the basic requirements for such a study. He found that most of the researchers in the field are complaining of this problem. So he decided to start from scratch and work in the direction of providing such a corpus. For this purpose, the researcher started with a raw corpus and made some revisions and a series of automatic taggings and manual corrections until his study reached satisfactory results. Because of time limitations, the size of the corpus reached is moderate and is not as large as what one would wish. The corpus used for this study is derived from a raw corpus whose data are articles of two Jordanian journals, Aldustur, and Aldustur Aleqtesady, but has to go through an extensive preprocessing which will be explained in detail in Chapter Four. 1.5.2 Tagset We adapted the Khoja detailed tagset, a morphosyntactic tagset that is very rich and comprehensive for Arabic, and hence it is hard to deal with, whether manually or automatically. The original tagset consists of 177 tags and is heavily increased by the fact that we do not use a stemmer for the tagging system, and so another group of composite tags is introduced to make up for composite words. These tags can be composites of two, three, or even four basic tags. This tagset was revised introducing new tags and making some refinement of original tags. That included the distinction between plural forms (beneficial for morphological studies), and recognizing defected verbs (beneficial for syntactical studies). This modification raised the number of basic tags to 319. The complete new tagset is shown in Appendix B. Another subset of the resulting tagset is introduced by removing case information, thus gaining two advantages: decreasing the size of the tagset, and more importantly getting rid of some complexity and leading to better accuracy as will be seen in Chapter Five. 04
  • 15. Another set of tests was performed on the original tagset as well, where we noticed that very little gain in accuracy was achieved by modifying the tagset. But it should be kept in mind that the main purpose of modifying the tagset was not shooting for better accuracy, rather it was looking for clarity of tags and having more features for some of the tags. In fact it was expected to lose part of the accuracy for this reason, and we were willing to sacrifice it. 1.5.3 Tagger The tagger used for this study is the Brill tagger, which will be introduced in detail in Section 3.2, a tagger that is based on the transformation rule method. This tagger was originally designed for tagging English text, and had been adopted by many researchers for other languages like Hungarian [23], and German [28,33]. The reasons for choosing this tagger are: 1. The source code is available, and written mostly in a common language (C), which makes the modification possible. 2. It is based on transformation rules, which makes it possible to adapt to other languages. 3. The use of transformation rules also makes it easy to understand the underlying reasons behind choosing certain tags (see Section 4.4), and easy to modify the rules and/or omit some of them if needed. This is in contrast to using statistical taggers (Section 2.2), where information is converted into a huge set of numbers, representing the probabilities of choosing a specific tag for each word. A lot of work has to be done for adapting the tagger to our purposes, which includes: 1. Manually tagged Arabic corpus has to be prepared, since we have to start from scratch. This corpus is then enlarged in many steps. 2. Since the original system is written for Unix, and makes use of some of the facilities thereof, we first attempted to convert it to the DOS environment, being more common to us, and in our academic environment here. A lot of work was done in this direction but many problems were encountered. The latest and hardest of which was the fact that Turbo C under DOS did not deal with the extended RAM 05
  • 16. explicitly, as is the case for C under Unix. So at last we decided to switch to Unix, a task that also had many obstacles in the beginning, but worked out smoothly at the end. However, we still have an ambition, even after the completion of this project, to switch back to DOS/Windows, and attempt to get a working DOS/Windows version. 3. The original code mixed between the use of C in most of its parts, and Perl in some others, especially the lexical learner, which we had to work on. Perl is a new language for the researcher, therefore some work had to be done in this direction, first by learning as much as possible and needed from Perl, and then making use of that in making an efficient change to the learner, to make use of the program generated by Lex for the lexical analysis of the corpus. The problem that took most of our time and effort in this is the fact that the exact same changes had to be done on both the learner, which is written in Perl, and the tagger, written in C. 1.6 Testing strategy Testing was done using the method of cross validation. Because of unavailability of a standard reference (truth) corpus, we have to be satisfied by a rather small corpus for this purpose. The corpus we prepared for learning was divided into three parts, and three tests had to be performed, each of which utilizes two thirds of the whole corpus for training and the other third for testing, changing the parts every time. Then taking the average of the results. At this stage we used a total corpus of 38,000 words, so every test involves about 25,000 words for training and 13,000 words for testing. This whole experiment is done three times: one time for the original tagset, the second for the modified tagset, and the third for the ungrammatized tagset. That means three sets of corpora and three learning/tagging systems, each using the appropriate tagset, are prepared. The rather small size of the corpus is justified by the lack of standardly tagged corpus. This is the best we could reach within our available time and efforts, and we think we achieved very promising results that can be enhanced by many improvements, including the enlargement of the learning corpus. This work is probably the first real step in the direction of having a standard Arabic corpus tagged with a rich and comprehensive tagset, not forgetting the contribution of Khoja who provided the baseline for our work. 06
  • 17. 1.7 Chapters summary Chapter two gives a literature review of tagging, and talks about taggers, and different tagging strategies, with concentration on the efforts exerted on Arabic, in terms of the three parts of a tagging system; corpora, tagsets, and taggers. Chapter three, talks about the original tools that are chosen for this work, namely, the Khoja tagset, and the Brill tagger, giving a detailed idea about their form, and the way they are designed. Then it gives an idea about the strategy used for testing. Chapter four explains our contribution in modifying the tagset, preparing the corpus for work, and adopting the tagger to fit our needs. Chapter five gives the tests and results of our experiments. First it gives average accuracies of each of the three performed tests, then it discusses the types of errors encountered, studies their causes, and suggests solutions to them. Chapter six gives the conclusion of the work and suggests future expansions. 07
  • 18. Chapter Two Literature Review 2.1Corpora in European languages In European languages and some other languages, there are many famous and standard corpora which are available for researchers, either to be used in extracting information of interest to their fields of study, or as references for testing there tagging strategies. Below is a list of just few examples of such corpora: 2.1.1 General Corpora  The Brown Corpus. Corpus of written American English, and the corresponding British corpus Lancaster-Oslo/Bergen corpus (LOB) [19], a corpus of written British English. The Brown corpus in the 60's, while its British counterpart was compiled in the 70's. Both consist of around one million tokens (i.e. words, counted every time they appear). The Brown corpus was used in seminal linguistic and psycholinguistic research that involved word frequency, and continues to be used today. It comes as text, tagged, and parsed.  BNC: The British National Corpus (BNC) [40,42] is a 100 million word collection of samples of written and spoken language from a wide range of sources, designed to represent a wide cross-section of current British English, both spoken and written. Because of it’s large size, and sampling of written and spoken language, the BNC is very good for research involving lexical frequency. For words with very low frequency, they are more likely to occur in a 100 million words corpus than in a 1 million words corpus.  The Amsterdam Corpus (AC): This corpus [30] was compiled in the beginning of the 1980s by a group of scholars directed by Anthonij Dees and has resulted in the Atlas des formes linguistiques des textes littéraires de l'ancien français. The electronic version of the AC was provided by Piet van Reenen (Free University of Amsterdam). It contains about 200 different 08
  • 19. texts, some of them in several manuscripts, which adds to a total of 289 texts and close to three million word forms. These forms have been manually annotated with 225 numeric tags encoding part-of-speech and other morphological categories (e.g. “566” for verb, future tense, 3rd person, plural). 2.1.2 Historical Corpora  Helsinki Corpus: The Helsinki Corpus of English Texts: Diachronic and Dialectal [39] is a computerized collection of extracts of continuous text. The Corpus contains a diachronic part covering the period from c. 750 to c. 1700 and a dialect part based on transcripts of interviews with speakers of British rural dialects from the 1970's. The aim of the corpus is to promote and facilitate the diachronic and dialectal study of English as well as to offer computerized material to those interested in the development and varieties of language. The uses for such a corpus are fairly obvious: it is used for diachronic research; whether one is interested in lexical frequency, semantics, syntax, etc. This corpus also has a parsed version 2.1.3 Annotated Corpora  Celex. Lexical databases of English, Dutch, and German. [40] This corpus contains ASCII versions of the CELEX. It was developed as a joint enterprise of the University of Nijmegen, the Institute for Dutch Lexicology in Leiden, the Max Planck Institute for Psycholinguistics in Nijmegen and the Institute for Perception Research in Eindhoven. This corpus contains detailed information on the orthography, the phonology (phonetic transcriptions, variations in pronunciation, syllable structure, primary stress), the morphology (derivational and compositional structure, inflectional paradigms), the syntax (word class, word class-specific subcategorizations, argument structures), and word frequency (summed word and lemma counts, based on recent and representative text corpora). Thus it is useful for various types if linguistic and psycholinguistic research.  The Penn Treebank. The Penn Treebank Project [41] annotates naturallyoccurring text for linguistic structure. Most notably, they produce skeletal 09
  • 20. parses showing rough syntactic and semantic information - a bank of linguistic trees. They also annotate text with part-of-speech tags, and for the Switchboard corpus of telephone conversations, dysfluency annotation. The Penn Treebank project has annotated the Switchboard Corpus, the Wall Street Journal Corpus, Chinese Journal Project, the Brown Corpus, and the Helsinki Corpus (among others). It is very useful for syntactic research, or any research involving the syntactic/semantic relationships between words. Tgrep (tree-grep) is a useful tool for use with this corpus.  There are corpora available for many other languages. Examples include: American English [35], German [29], Hungarian [23,26], Swedish [5, 25], and Hebrew [15]. More information about all these corpora and others can be easily found on the Internet. 2.2 Arabic corpora A number of electronic Arabic text corpora have been compiled [32] but these corpora are raw, which means that the exploration of these corpora remains problematic. Some analyses that have been conducted on these corpora involve sometimes very limited data. Others have developed proficient word form analyzers, such as the analyzer by the Xerox European Research Centre, but the question remains whether these analyzers provide an adequate solution for the exploration of Arabic tagged corpora. In order to explore corpora in an efficient and in an economically reliable way, some preliminary operations ought to be made [32]. As is generally known, analyzing Arabic corpora is more complex than other corpora because of three main reasons. In the first place the Arabic language is very polysemic. The Arabic language is much more polysemic than, for example, Dutch. In fact in the Dutch language one way to create new words is by adding two words together in order to obtain a new word as a compound. These new words are very widespread, but are also identifiable by a computer in a simple way, i.e. by defining a word as a string of characters between two blanks. In Arabic new meanings for words are often given by expanding the older meaning of an existing word to a new one. This means that the external morphological form of the word does not change, in spite of the fact that the word carries a new meaning. 21
  • 21. A second element that makes analysis of Arabic more complex than other languages is the fact that the language is usually not vocalized, which means that the degree of ambiguity of words as separate units is much greater than e.g. in English or Dutch. Words, in their raw form, can belong to different grammatical categories as seen in the string of characters "ktb". This string of characters stands for the verb "kataba" (to write) as well as for the plural "kutub" (books). This complicates the searching for words in a corpus of texts. In the third place, the problem is complicated by the fact that in Arabic a number of prefixes and suffixes are directly linked to the word. This makes the searching by computer even more complex. For example the string of characters ‘fhm’ can stand for the verb "fahima" (understood), but it can as well stand for the particle and suffix "fahum" (since they) or for the particle and verb "fahamma" (then he considered). These facts and others are behind the lack of tagged Arabic corpora. One of the researchers in this line [11] noticed “the frustrating reality was that the NLP experts with experience in dealing with European languages and scripts deemed the problem [of providing tagged corpora and taggers for Arabic] trivial and therefore not worth wasting time on. While the available Arabic language experts had no computer experience and deemed the problem impossible to solve and therefore not worth wasting time on it”. This is true to some extent, but what is certainly true is that there are very little available corpora for the Arabic language. There are some large corpora that are available. Unfortunately they are not free. Also, although some of these corpora are marked-up with XML or SGML tags, none of them are POS tagged [32, 37]. There are some efforts towards the preparation of a POS-tagged corpus for Arabic, but they are still in their early and testing stages. One of these works is that of Shereen Khoja [16,17]. Although her work has some limitations and deficiencies, as will be explained below, it is probably the first step towards building an Arabic POS-tagged corpus. She introduced two tagsets, one is very small containing only five classes or basic tags (noun, verb, particle, residual, punctuation), and the other is very comprehensive and appropriate for Arabic, containing more detailed tags (i.e. singular, masculine, definite common noun). She used the first tagset to manually tag 50,000 words of Arabic newspaper text. This type of tagging is obviously of little use, but she also tagged 1,700 words with the second tagset [37]. I sent many email messages to Miss Khoja hoping to get a copy of her tagged corpus and benefit from it, but unfortunately I did not receive any response. However, from the small excerpt (see 20
  • 22. Figure 2-1) she enclosed in her paper [16], it seems that the corpus is not well built, since there are in that short passage many mistagged items. Mistakes include the following (refer to Section 3.2.2 and to Appendix B to get an idea about meanings of tags): 1. Mistagging adjectives as nouns, example: ‫ الشريفين‬is tagged as NCDuMGD, instead of :NADuMGD. There are many instances of such an error. 2. Case information for nouns seems almost random. Example ‫الر ين اا‬ ‫مبناسرة االور ا‬ is tagged as PPr-NCSgFGI NCSgMAD NCSgMND, instead of: PPr-NCSgFGI NCSgMGD NASgMGD, and ‫ أعري االكرااالير‬is tagged as: VPSg3M NCSgMND NCSgMAD, instead of VPSg3M NCSgMND NASgMND 3. tagging single as plural, like: ‫ لبالده‬is tagged as: PPr_NCPlFGI_NPrPSg3M instead of: PPr_NCSgFGI_NPrPSg3M. Figure 2-1: copy of the manually tagged excerpt cited by Khoja 4. Tagging feminine as masculine, like: ‫ عر اأ كر االهاراي‬is tagged as: PPr NCSgFNI NCPlMND, instead of: PPr NASgMGI NCPlFGD. These are just few examples of the mistakes found in the 48-word passage. Note also that some of the words cited in the above examples contain more than one type of mistakes. 22
  • 23. It is worth mentioning here, that mistakes in manually tagged corpora are very unfavorable, since these corpora are considered to represent the truth, and are to be used as guidelines for learning systems. If they are not carefully built, the whole system is a failure regardless of how large the reported accuracy may be. We used the same detailed Khoja tagset to tag about 38,000 words, and have three versions of this corpus; one tagged with the original detailed tagset as proposed in [16], the second tagged with a modified tagset of the mentioned tagset as explained in Section 4.2, and presented in Appendix B, and the third tagged with the modified version with the removal of grammatical information. We do not claim perfection but we think that our work, besides being much larger is also much more accurate in applying the tagset to real Arabic text. 2.3 Arabic Taggers Very few people worked towards building a complete tagger for Arabic. The following cases, though not complete, are among the best examples:  Abuliel [1]: in his paper he described some preparatory steps of building an Arabic POS tagger. Rule-based techniques were used for finding phrases, analyzing affixes of the word, and discovering proper nouns. The tagset used in this work is not specified, and no results are reported concerning the overall performance of a tagging system.  Alshalabi et al [3] dealt with vowelized Arabic text and considered recognizing nouns only. This work showed how to discover nouns in the text but does not reach the stage of tagging. The fact that the system is constrained to vowelized text makes it deficient. Although they talked about part-of-speech tagging and gave a survey of taggers, they did not really do any tagging, nor did they give any tagset for this purpose. They reported 95.4% accuracy, which is a good performance rate, but we should keep in mind that the system is constrained to completely vowlized words, therefore minimizing ambiguity, and that it is restricted to discovering nouns, which simplifies the classification task.  Maloney and Niv [21], also worked with names only, in their name recognizing system called TAGARAB. 23
  • 24.  Freeman from the department of near eastern studies at the university of Michigan [11] reported that he is attempting to adopt the Brill tagger to Arabic. He designed his own tagset for this purpose, started to do some morphological analysis, and explained the hurdles he encountered in that work. According to his paper he did not reach the stage of tagging to report any accuracy rate.  Khoja: the title of her paper [17] may lead to concluding that she has a complete tagger. That deceived us in the beginning of our work. But carefully studying the paper we concluded that she only did some preliminary work in this direction, and is still working on the tagger. This was ascertained by consulting her website [37] where she declares: “As far as I know, a POS tagger has yet to be developed for Arabic, which is why I am developing one myself.” 2.4 Definition of training and testing texts A corpus of over 38,000 words was prepared. Three versions of this corpus are available: one tagged with the original Khoja tagset, the second with a modified tagset as explained in Section 4.1, and the third is tagged with a subset of the modified tagset which excludes the grammar information, as explained in Section 4.5. Each of these corpora is divided into three equal portions, then a cross validation is done three times, using different two thirds of the corpus for training and the other third for testing, each time. The average of the three tests is taken as the estimated performance accuracy of the tagger. This means that nine tests are done in this way. In addition to these tests, three other tests were performed on the corpus tagged with the complete modified tagset, this time to test the effect of enlarging the corpus size on the accuracy of the tagger. To do that, about five sixths of the corpus are used for training, and the other one sixth for testing, for each new test. Then, the average is taken to get an estimate of the overall accuracy. 24
  • 25. Chapter Three Design 3.1 Tagsets and the adapted Arabic tagset 3.1.1 Tagsets As mentioned in section 2.1, tagging requires a set of tags, which classify the words according to their lexical and syntactical meanings, i.e. a tagset. Tagsets vary in size. The number of tags used by different systems varies a lot. Some systems use fewer than 20 tags, while others use over 400. The larger the size of tagset the more information is carried in each tag. For example we may have a basic tagset, which divides the words into very small set of classes as in Figure 3-1 below. We may enhance this tagging by classifying nouns to single and plural, verbs to present and past, and so on, as shown in Figure 3-2, which lists a subset of a refined tagset showing the different tags that belong to the general class verb in English. And can be further classified as shown in Figure 3-3, which gives a complete list of the Penn Treebank tagset [41]. Tag Discription Tag Discription NN Noun JJ Adjective NNP Proper noun CC Coord conj DT Determiner CD Cardinal number IN Proposition Prp Personal pronoun VB Verb RB Adverb -R Comparative -S Superlative -$ Possisive Figure 3-1: example of a general classification tagset. 25
  • 26. Tag Discription Example VBP Base present Take VB Infinitive To Take VBD Past Took VBG Present particible Taking VBN Past particible Taken VBZ Present 3sg Takes MD Modal Can, will Figure 3-2: example of a detailed tagset for verbs. Figure 3-3: the entire Penn Treebank tagset 26
  • 27. 3.1.2 The adopted tagset This section describes the tagset adapted for our work. The tagset is based on the Khoja tagset as mentioned earlier. We introduce the tagset as described by its designer [20]. The modifications that are specific to our work are marked using an asterisk symbol (*), and are further discussed in detail in Section 4.2. The original tagset (Tagset1) contains 177 tags: 103 nouns, 57 verbs, 9 particles, 7 residual, and 1 punctuation. We derived two other tagsets: Tagset2, a modified version of Tagset1, containing 319 tags, and Tagset3, a simplified version of Tagset2, which excludes grammatical information, with 189 tags. The complete modified tagset (Tagset2) is given in Appendix B. A full description of each of the tags and examples of Arabic words that take those tags now follows. This description is based on that given by Khoja . The five main categories for words are: 1. N [noun] 2. V[verb] 3. P [particle] 4. R [residual] * 5. punc [punctuation] Note that category number 5 is preceded by an asterisk (*). This indicates a modification in the name of the category, or a completely new category (or subcategory) as shall be seen in subsequent examples. The residual category contains foreign words, mathematical formulae and numbers. The punctuation category contains all punctuation symbols, both Arabic and foreign such as (?, ،‫.) "! ؟‬ ‫؟‬ The subcategories of noun are: 1.1 C [common] 1.4. Nu [numeral] 1.2 P[proper] 1.3 Pr [pronoun] 1.5 A [adjective] *1.6 T [title] Adjectives are nouns that describe the aspects of an object. Adjectives inherit the properties of nouns, so they take “nunation” when in the indefinite and can take the definite article when definite. For example, alwld alSgyr “The small boy” contains the adjective Sgyr “small”. This adjective can take the definite article as in ‘darasa alwaladu alSagyr’ “the small boy studied”, and it can also have “nunation” as in ‘hasan Sgyr’ “Hassan is small”. Examples of these subcategories include: 27
  • 28. • Singular, masculine, accusative, common noun such as ktab “book” in the sentence ‘>x* alwld ktaba’ “the boy took a book”. • Singular, masculine, genitive, common noun such as ktab “book” in the sentence ‘drst mn ktab’ “I studied from a book”. • Singular, feminine, nominative, common noun such as mdrsp “school” in the sentence ‘h*h mdrsp’ “this is a school”. Note here and in subsequent examples that vocalization does not appear in transliteration, because we do not assume dealing with vocalized text. The subcategories of the pronoun are: 1.3.1 P [personal] 1.3.2 R[relative] 1.3.3 D [demonestrative] The personal pronouns can be detached words such as ‘hw’ “he”, or attached to a word in the form of a clitic. The attached pronouns can be attached to nouns to indicate possession, to verbs as direct object, or attached to prepositions such as fyh “in it”. Some examples of pronouns include: • Third person, singular, masculine, personal pronoun, such as hw “him”. • Singular, feminine, demonstrative pronoun, such as h*h “this”. The subcategories of the relative pronoun are: 1.3.2.1 S [spesific] 1.3.2.2 C[common] Examples of relative pronouns include: • Dual, feminine, specific, relative pronoun, such as alltan “who”. • Plural, masculine, specific, relative pronoun, such as al*yn “who”. • Common, relative pronoun, such as ‘mn’ “who”. The subcategories of the numeral are: 1.4.1 Ca [cardinal] 1.4.2 O[ordinal] *1.4.3 Na [numerical adjective] We preferred omitting subcategory 1.4.3 and adding related tags to normal adjectives. This kind of adjectives, however, are not very common and we did not encounter any of them in the corpus we used. Examples of numerals include: • Singular, masculine, nominative, indefinite cardinal number such as ‘>rbEp’ “four”. • Singular, masculine, nominative, definite ordinal number such as ‘alrabE’ “the fourth”. 28
  • 29. The linguistic attributes of nouns, adjectives, and numerals, that have been used in this tagset are: (i) Gender: M [masculine] (ii) Number: Sg [single] F [feminine] N [neuter] * Plm [masculine sound plural] * Plf [feminine sound plural] *Plb [broken plural] Du [dual] (iii) Person: 1 [first] 2 [second] 3 [third] (iv) Case: N [nominative] A [accusative] G [genitive] (v) Definiteness: D [definite] I [indefinite] Verbs are categorised into three main parts: 1. P [perfect] 2. I[imperfect] Iv [imperative] The definition of perfect verbs not only includes (i) the equivalent of English past tense verbs (i.e. to describe acts completed in some past time) but also (ii) describes acts which at the moment of speaking have already been completed and remain in a state of completion, (iii) describes a past act that often took place or still takes place (i.e. commentators are agreed (have agreed and still agree)), (iv) describes an act which is just completed at the moment by the very act of speaking it (I sell you this), and (v) describes acts which is certain to occur that it can be described as having already taken place (mostly used in promises, treaties and so on) [16]. The imperfect does not in itself express any idea of time; it merely indicates a begun, incomplete, or enduring existence either in present, past or future time. While the imperative verbs order or ask for something to be done in the future. Examples of verbs include: • First person, singular, neuter, perfect verb ‘ksrt’ )‫“(كسرت‬I broke”. • First person, singular, neuter, indicative, imperfect verb ‘>ksr’ (‫“ أكسر‬I break” ‫)أكس‬ ِ • Second person, singular, masculine, imperative verb ‘aksr’ (‫“ اكسر‬Break!” ‫)اكس‬ ِ The verbal attributes that have been used in our tagset are: 29
  • 30. (i) Gender: M [masculine] F [feminine] (ii) Number: Sg [single] Pl [plural] Du [neuter] (iii) Person: 1 [first] 2 [second] 3 [third] (iv) Mood: I [indicative] S [subjective] N [neuter] j [jussive] The two most notable verbal attributes that are fundamental to Arabic but do not normally appear in Indo-European tagsets are the dual number, and the jussive mood. The subcategories of particle are: 1.1 Pr [prepositions] 1.2 A [adverbial] 1.2 C [conjunctions] 1.4 I [interjections] 1.5E [exceptions] 1.6 N [negatives] 1.7 A [answers] 1.8 X [explanations] 1.9 S[subordinates] *1.10 dt [doutive] *1.11 cr [certain] *1.12 Str [stressive] *LM [lm] *LN[ln] Examples of particles include: • Prepositions fy (‫“ )يف‬in” • Adverbial particles swf (‫ف‬ ‫“ )س‬shall” • Conjunctions w (‫“ )و‬and” • Interjections ya (‫“ )فا‬O” • Exceptions swY ( ِ‫ا‬ ‫“ )س‬Except” • Negatives la (‫“ ال‬Not” ‫)ال‬ • Answers nEm (‫“ )نعم‬yes” • Explanations >y (‫“ )أي‬that is” • Subordinates lw ( ‫“ )ل‬if” 31
  • 31. 3.2 Corpora used for this work Early in our work, we were faced by the unavailability of corpora for MSA text. Even the ones that we read about in some of the previous works were not easily available, besides being not well fit to our needs. We contacted some of the researchers, but only few of them responded to our request and questions. One of these responses provided a raw corpus of excerpts from two Jordanian magazines, containing about 160,000 words. For the sake of saving time, we preferred working on this corpus than creating our own, in spite of the fact that the corpus needs some processing before it can be used in our experiments. These excerpts were provided as a Microsoft document in Arabic characters which has to undergo a series of preparatory steps to be ready for use in our tagging task, as will be explained in detail in Section 4.1 3.3 The Brill system The Brill system is divided into two separate parts: the learner and the tagger. In the following subsections we explain the way each of these two programs works. 3.3.1 Learner Before the process of learning starts, the truth corpus is undergone a series of preliminary operations to prepare a set of files that are necessary for learning. These operations are sketched in Figure 3-4 and explained in more detail in Section 4.3. Transformation-based error-driven learning, as shown in Figures 3-5 and 3-6, works as follows: First, unannotated text is passed through an initial-state annotator. Various initial state annotators, that represent different levels of complexity, have been used, including: the output of a stochastic n-gram tagger; labeling all words with their most likely tag as indicated in the training corpus; and simply labeling all words as nouns. For example Brill gave two simple algorithms to do that; one assigns to all unknown words the tag “NN” for common noun in the Penn Treebank tagset. And 30
  • 32. Original corpus Review for errors, typing mistakes, etc (manual) Convert to Brill format (maual) Transliterate (C program) Untagged corpus Tagged corpus1 Untagged corpus Tag it (semi automatic) Divide in two (perl programBrill) Tagged corpus Tagged corpus2 Tagged corpus Untag it (perl programBrill) Untagged corpus Entire Tagged corpus2 Untag it (perl programBrill) Untagged corpus2 Prepare final lexicon Final lexicon Figure 3-4: Preliminary steps for tagging the other assigns to every word in the corpus either of two tags: “NNP” for proper noun if the word starts with a capital letter, or “NN” otherwise. This strategy is based on a conclusion that common nouns constitute a high percentage of an English text. In this research we used a more detailed strategy, where the pattern of the letters of a word is compared with a predefined set of patterns, to determine which word class the word belongs to, making use of the rules of Arabic morphology (Srf). Then a tag is assigned to the word accordingly. If the word does not belong to any of the standard patterns, it is assigned the tag “NCSgFGI” which stands for “single feminine, genitive undefined common noun” since this is the most probable tag for unknown words as noticed when the manually tagged corpus is prepared. The different patterns used to tag unknown words are further shown in Appendix C. Once text has been passed through the initial-state annotator, it is then compared to the truth. A manually annotated corpus is used as our reference for truth. An ordered list of transformations is learned that can be applied to the output of the initial state 32
  • 33. annotator to make it better resemble the truth. Each transformation has two components: a rewrite rule and a triggering environment. A rewrite rule can be in the form: X Y, meaning Change the tag from X to Y While a triggering environment can be in the form: “al” hasprefix 2 , meaning “if the current word has a 2-letters prefix of ‘al’”. Taken together, the transformation with this rewrite rule and triggering environment would be X Y “al” hasprefix 2, meaning Change the tag of the current word from X to Y if it has a 2-letter prefix of ‘al’. There are two types of rules: Lexical rules and contextual rules. Therefore, there are two learners that have to be run consecutively. First lexical rules are learned, then context rules are learned to refine the tags, and make up for some divergences that may occur in applying the lexical rules. In both the learning procedure is done by passes through the truth corpus, each pass learning the rule that, when applied, minimizes the errors in tagging the corpus as compared to the truth corpus. These rules are then stored in a file in the order they are learned. Thus obtaining two rule files: a lexical rule file and a contextual rule file. The tagger applies these rules in the same order to get similar results. Examples of both types of rules, obtained from the Arabic tagged corpus, are given in Section 4.4, with explanatory comments giving the meaning of each rule. The ideal goal of the lexical module is to find rules that can produce the most likely tag for any word in the given language, i.e. the most frequent tag for the word in question considering all texts in that language. The problem is to determine the most likely tags for unknown words, given the most likely tag for each word in a comparatively small set of words. This is done by transformation rule based learning (TBL) using three different lists: a list consisting of Word Tag Frequency - triples derived from the first half of the training corpus, a list of all words that are available sorted by decreasing frequency, and a list of all word pairs, i.e. bigrams. Thus, the lexical learner module does not use running texts. Once the tagger has learned the most likely tag for each word found in the annotated training corpus and the rules for predicting the most likely tag for unknown words, contextual rules are learned for disambiguation. The learner discovers rules on the basis of the particular environments (or the context) of word tokens. The contextual learning 33
  • 34. process needs an initially annotated text. The input to the initial state annotator is an untagged corpus, a running text, which is the other half of the annotated corpus where the tagging information of the words has been removed. The initial state annotator also uses a list; consisting of words with a number of tags attached to each word, found in the first half of the annotated corpus. The first tag is the most likely tag for the word in question. and the rest of the tags are, in no particular order. With the help of this list, a Untagged corpus2 Initial state tagger Dummy-tagged corpus Tagged corpus2 (truth) Lexical Learner No Lexical Rules Threshold ? yes stop Figure 3-5: Lexical Rule learning list of bigrams (the same as used in the lexical learning module, see above) and the lexical rules, the initial state annotator assigns to every word in the untagged corpus the most likely tag. In other words, it tags the known words with the most frequent tag for the word in question. The tags for the unknown words are computed using the lexical 34
  • 35. rules: each unknown word is first tagged with a default tag and then the lexical rules are applied in order. There is one difference compared to the lexical learning module, namely the application of the rules is restricted in the following way: if the current word occurs in the lexicon but the new tag given by the rule is not one of the tags associated to the word in the lexicon, then the rule does not change the tag of this word. Dummy-tagged corpus Context learner Unnotated corpus 2 Tagged corpus2 (Truth) Context Learner No Context Rules Threshold ? yes stop Figure 3-6: Context rule learning When tagging new text, an initial state annotator first applies the predefined default tags to the unknown words (i.e. words not being in the lexicon). Then, the ordered lexical rules are applied to these words. The known words are tagged with the most likely tag. Finally the ordered contextual rules are applied to all words. 35
  • 36. Unanotated corpus Initial tagging anotated corpus lexical Rules Lexical tagger No Finished rules? yes Context Rules Context Tagger Tagged corpus yes No Finished rules? Stop Figure 3-7: Taggeing 3.3.2 Tagger: The tagger follows the same path as the learner. Starting with any raw text corpus given to it, first it applies the same initial state annotator as the one used in learning, so that the transformation rules work correctly. Then it uses the rule files obtained by the learner, to change initial tags to new tags. The rules are applied in the same order they are collected: first the lexical rules, and then the contextual rules. 36
  • 37. 3.4 Testing strategies Testing was done using the method of cross validation. Taking in consideration that we do not have a large standard truth corpus, we had to manage with the corpus we tagged. This corpus is divided into three portions, each portion containing about 13,000 words, and the test had to be repeated three times, with a different one third for testing, and the other two thirds for learning each time, then taking the average of the three tests as an overall measure for the performance of the system. This whole experiment is repeated using three versions of the tagset and therefore three versions of corpora: 1. Tagset1: the original detailed Khoja tagset [16] containing 177 tags. 2. Tagset2: the complete modified tagset of 319 tags (Appendix B). 3. Tagset3: a subset of Tagset2 of which grammatical information is excluded for nouns and imperfect verbs, thus reducing the number of tags to 185 tags. All the three tagsets are drastically enlarged by the fact that the system we used does not use stemming prior to the learning and tagging phases. Rather it uses composite tags to tag composite words. A fact that would introduce a new set of tags. As an example for this consider the word balmdrsp ( ‫ بالمدرسب‬If stemming ‫.)بالمدرسب‬ were applied to the system this word would be divided into two separate words b and almdrsp, and would be tagged as b/PC almdrsp/NCSgFGD. But, since we work without stemming, the word is treated as one unit and is tagged as balmdrsp/PC_NCSgFGD, thus introducing the new tag PC_NCSgFGD. Stemming would probably enhance the accuracy of the system, but it would divert our attention to other directions and put extra burdens on the users of the system. 37
  • 38. Chapter Four Implementation and Testing 4.1 Corpus The corpus used for this study is part of an about 160,000 word corpus of two Jordanian newspapers (Aldustor and Aldustor Aleqtsady). Any MSA corpus would have done the task, but this corpus was gotten at an early stage of the work, and was used henceforth. A lot of preprocessing was needed before using the corpus. The corpus is originally a Microsoft word document, so it has to undergo the following corrections and revision tasks to be ready for our work: 1. There were many typing, spelling, and grammatical mistakes that constituted quite a phenomena in the text, and would hinder the process of tagging and add up to the problem of ambiguity which is already an inherent problem of Arabic texts. These problems had to be fixed beforehand. Examples of such mistakes include: a.Missing hamza, like: ‫احبار ,اقصي ,اوضح ,اشار‬ ‫,احبار‬ b. Misplaced hamza, like: ‫ أجياء‬instead of ‫.تامل‬ ‫تامل‬ ‫إجياء‬ c. ‫ هـ‬instead of ‫ ,ة‬like: ‫. باخر‬ d. ‫ ي‬instead of ‫ ,ى‬or vice versa, like: ‫اعكى‬ ‫.حتهاج ايل ,أ يي ,أمح احمم‬ e.Typing mistakes, like: ‫,لك ل اب ًام الك الل ,الةوئى اب ًام االةوئ‬ ‫ال‬ ‫ال‬ ‫ا‬ ‫ا‬ ‫.األولاوفاتاب اًام ااألول فات ,فاجلياءااب اًام افاإلجياء‬ ‫ال‬ ‫ال‬ f. Grammar mistakes, like: ‫ …اس ر ر اءايفااينر رراراالر ررني ام ابر ررلاال ر ر اء االذ ذذيح األذ ذ ذ ا ذ ذ ذ‬ . ‫المتحدة لألغدادابوعاني ام ابلاسكعااساسو ااوا ارجو‬ .…‫ وفه اف االزواراواليحافه قعاأنافزف اع دهما‬ ‫ و اص اوان‬ ‫ ك لااف ابكغاع داالسرواراتاواآلل ات افرغتايفامونراءاالع ةر ا‬ .)‫ب اًام ا(اليتاأفيغت‬ ‫ال‬ 38
  • 39. .‫ وتهجاوزاقوم ام ج داهتااعناسةع امكواراتادوالر‬ ‫ انامشرريوااالر زارقاال اضرراابال رراءاال رريفاالهجارفر ايفامياكررزااالل فر ا‬ .)‫شروعاً عشوائ اً اب اًام ا( ااعش اا‬ ‫ا ائا‬ ‫مشيوا‬ ‫ال‬ ‫شروعا‬ .)‫ حيصلاعكىات قوعاالمؤسسوناعكوه اب اًام ا(الؤسسن‬ ‫ال‬ ‫ …اوالر ياخا هلررهاحبررلاالسررةلاالمينر اله نور االهعرراونا ذذا ن ذ ن‬ ‫الجمع ذذع ذذا ال ذذو ا ةايفاتنس ررو ااجلا ر االر ر ين ابر ر اًامر ر ا(ب ررنااجلمعور ر ا‬ ‫ال‬ .)‫وال زارق‬ 2. Getting rid of passage numbers, titles, and end marks to concentrate on complete sentences of text. 3. The text is then converted to an ASCII MS-DOS format. 4. Because of technical considerations; like the different code pages used for representing Arabic characters, and using software that does not support Arabization, especially Lex analyzing system, and Linux environment, it was decided to follow most of the previous line of research in Arabic [i.e. 1,14,21] and use transliteration. For this purpose the Buckwalter code of transliteration [36] is used and a small C program was written to do this task. 5. The corpus is then edited to match the Brill format and copied to the Linux system for the rest of the processing. 6. Then, it is tagged, using a program written with the help of the lexical analyzer LEX [2]. The resulting corpus, calculated to be about 43% accurate, is then revised manually. The result, which is supposed to represent the truth, was then given to the learner of the Brill tagger to learn lexical and contextual rules, a step that also requires some other preparations, as explained in Section 3.3. 7. The above steps are performed initially on a corpus of size about 1000 words. After the rules are learned a larger corpus is presented to the tagger, tagged, manually revised, and given to the learner to enhance the rule set. This process is repeated continuously, hence enlarging the truth corpus and enhancing the performance of the tagger simultaneously, until satisfactory 39
  • 40. results are obtained and/or enough time is spent on this point. At the present a truth corpus of over 38,000 words is reached. Figures 4-1 and 4-2 show sample sentences in different stages of the tagging cycle. ‫عكى اهامش اأعمال االنه االه سطا الكهنمو ا‬ ‫وال ي اع ايف اال اهيق ا هل اآذار ااجلاري انظما‬ ‫كز االصيي الك راسات ااالقهصادف اورش اعملا‬ ‫الي‬ ‫ح ل اضعف اال ارد االةشيف اواله رفب اوتيضولا‬ ‫ال ول االعيبو الكمنهج ااألجنيب اوأهم امع قاتا‬ ‫كات ايف االنط . اوق اناقشت اه ها‬ ‫الهنافسو الكشي‬ ‫احلك االهط رات االههح ايف ااالقهصاد االعالاا‬ ‫ا‬ .‫كات‬ ‫واليتاأصةحتاتييضاحت فاتاعكىاالشي‬ (a) A sentence from the corpus ElY ham$ >Emal almntdY almtwsTy lltnmyp wal*y Eqd fy alqahrp xlal |*ar aljary nZm almrkz almSry lldrasat alaqtSadyp wr$p Eml Hwl DEf almward alb$ryp waltdryb wtfDyl aldwl alErbyp llmntj al>jnby w>hm mEwqat altnafsyp ll$rkat fy almnTqp . wqd naq$t h*h alHlqp altTwrat almtlaHqp fy alaqtSad alEalmy walty >SbHt tfrD tHdyat ElY al$rkat . (b) A transliteration of the sentence in (a) in the Brill format Figure 4-1 41
  • 41. 4.2 Tagset The tagset used in this work is a modified version of the tagset designed by Khoja, fully described in [16] and redescribed in Section 3.1.1. The work of Khoja is highly esteemed, being the first comprehensive work in designing a tagset for Arabic, which encompasses the richness and complexity of the language. Nevertheless, it has some ‫ا‬NCSgMGD/ ‫ االنه‬NCPlbMGI/‫ اأعمال‬NCSgMGI/‫ ا اهامش‬PPr/‫عكى‬ ‫ا‬PC_NPrRSSgM/‫ اوال ي‬PPr_NCSgFGD/ ‫ الكهنمو‬NASgMGD/‫اله سطا‬ ‫ا‬NASgMGD/‫ ااجلاري‬Rmoy/‫ اآذار‬PA/‫ ا هل‬RP/‫ اال اهيق‬PPr/‫ ايف‬VPSg3M/ ‫ع‬ ‫ا‬PPr_NCPlfGD/‫الك راسات‬NASgMND/‫االصيي‬NCSgMND/‫ا اكز‬VPSg3M/‫نظم‬ ‫الي‬ ‫ا‬PA/‫اح ل‬ NCSgMGI/‫اعمل‬ NCSgFAI/ ‫اورش‬ NASgFGD/ ‫االقهصادف‬ ‫ا‬NASgFGD/ ‫االةشيف‬ NCPlbMGD/‫اال ارد‬ NCSgMGI/‫ضعف‬ ‫ا‬NCPlbMGD/‫اال ول‬ PC_NCSgMGI/‫اوتيضول‬ PC_NCSgMGD/‫واله رفب‬ ‫ا‬NASgMGD/‫ااألجنيب‬ PPr_NCSgMGD/‫الكمنهج‬ NASgFGD/ ‫العيبو‬ /‫كات‬ ‫ لكشي‬NCSgFGD/ ‫ االهنافسو‬NCPlfGI/‫ امع قات‬PC_NASgMGI/‫وأهم‬ ‫ا‬ ./punc ‫ ا‬NCSgFGD/ ‫ااالنط‬PPr/‫اايف‬PPr_NCPlfGD Figure 4-2: Part of the sentence in Figure 4-1 after tagging and detransliteration limitations and mistakes, some of which are treated in this work, and others may be a task of future work. Modifications considered here include nouns, verbs, and particles. 4.2.1 Nouns: For nouns the following was done: a- Avoiding distinctions between foreign names and Arabic names. Instead all names, whether Arabic or foreign, are given the same tag NP (for proper noun). The tag RF (residual foreign) is kept to refer only to words of foreign languages written in Arabic characters. In the original tagset, the tag (RF) is given to all foreign names and words (see Figure 1-1, compare the tag given to ‫فا‬ given to ‫كي‬ ‫اسهويسيا‬ and ‫لن‬ ‫ا‬ ‫ب‬ ‫.)ا‬ and . 40 with that
  • 42. b- Using different tags for the different plural forms, and hence the indication of plural nouns is given the subtags PlbM, PlbF, Plm, Plf for broken masculine plural, broken feminine plural, sound masculine plural, and sound feminine plural respectively, instead of just: PlM, and PlF for plural masculine and plural feminine respectively. The table below (Figure (4-3)) gives examples of this. Notice that in our set the gender is not repeated with sound plurals since it is included implicitly in the plural form. word Original tag New tag ‫ال ظي ن‬ NCPlMND NCPlmND ‫العامكن‬ NCPlMGD NCPlmGD ‫الشةيات‬ NCPlFND NCPlfND ‫ال ارس‬ NCPlFND NCPlbFND ‫الةن ك‬ NCPlMGD NCPlbMGD Figure 4-3: Tags of Plurals The last two characters of each tag are irrelevant here and are given only for completeness. Including this information is useful when the resulting tagged corpus is used for morphological studies. c- Introducing some new tags. d- Introducing another general category in addition to common nouns (NC) and adjectives (T (NA), namely title nouns (NT) like: ‫ .) ال في، وزفي، أمن، السيري، الان س، اليئوس، الكا‬This would increase the tagset drastically, since each of these nouns can be single or plural, masculine or feminine, definite or indefinite and can take any of the three cases. But it would help in many cases to discover unknown proper nouns that usually follow these titles. 4.2.2 Verbs: 42
  • 43. For verbs the modification include: Using distinct tags for defected verbs ( ‫ األفعالاالناقص ا‬to capture the action they take ), with the case of the following nouns. Therefore each verb tag is marked by a small d following the first two characters for the verb if it is a defected verb, as in Figure (44). word Original tag New tag ‫ذهب‬ VPSg3M VPSg3M ‫ف هب‬ VISg3MI VISg3MI ‫كانت‬ VPSg3F VPdSg3F ‫فصةح ن‬ VIPl3MI VIdPl3MI Figure 4-4: Tags of defected verbs 4.2.3 Particles: For particles, the modifications include: Introducing a few tags to refine the tagging of some particles, and to make room for some particles unconsidered in the original tagset; namely: Pcr, Pdt, Pst, PQ, ‫ا‬and ‫ل‬ LM, and LN ‫ن ن‬ ‫ ,(أ ا، إ ا) , ق االهشيويو ,ق االهح و و‬for ‫دواتااالسهياا‬tagging ‫,مل ,أ‬ respectively. All these tags are added to help picking up some information about the following words. Although these tags do contribute to refining the tagset, there is still a lot to be done with particles, since the available tags do not cover the wide range of meanings for particles in Arabic. Examples include: the prefix particle ‫ ف‬is now given the tag PC (for conjunctional particle), whereas it is not always so, but sometimes has different meanings especially when affixed to verbs ( ‫فاءاالسةةو‬ ‫اله‬ ), the same thing goes with . All particles that do not belong clearly to any of the available tags are given a general tag PA (for adverbial particle) regardless of the fact that some of them are not really adverbial, so we do not have to take the meaning of this tag literally. 43
  • 44. Making more distinctions is left for future work after studying more deeply the need for such refinement. It should be kept in mind, also, that the corpus we dealt with is not stemmed. So the tagging is done by composite tags, which would introduce a new set of tags for composite words. For example, the word ‫بالعرريض‬ is tagged as PPr_NCSgMGD, which is a completely a different tag from either PPr or NCSgMGD thus leading to a drastic theoretical increase in the tagset. Contrary to what was expected, this fact did not cause a lot of problems with the tagging accuracy, due to the fact that the Brill tagger is powerful in dealing with prefixes and suffixes, and that composite words comprise only a small portion of an Arabic text (estimated to be less than 6% according to the data we worked on). 4.3 The program The same Lex-based program, which was used for initial tagging of the very first corpus, is now used as a start state tagger for both learner and tagger of the Brill system. In the original system initial tagging is done by a very simple routine, which assigns to all words either the tag (NN) for common nouns or (NP) for proper nouns if the word starts with a capital letter. This start state suffices for English and similar simple languages, But for Arabic we preferred to use another type of start state tagger, where each unknown word is checked for its syntactic structure and assigned an initial tag accordingly. For this purpose, the Lex-based routine was used after facing a lot of trouble getting it to work, especially since it has to be interfaced to both the lexical learner (written in Perl), and the tagger (written in C). Start state routine is an important factor for getting accurate results, especially for unknown words. And the better it is designed to take care of word structures, the better the achieved results are. In the present, the routine takes care of many morphological structures and relies on the statistical information, sensed when working on manual tagging, to assign the most probable tag for words that do not belong to any of the captured tags. 4.4 Rules 44
  • 45. In this section we give a list of the resulting rules and explain how they are interpreted, and the actual lexical and contextual information derived thereof. It is worth mentioning here that the obtained rules are based on majority tests and not on absolute truth. In other words, it is not necessary the each rule applies to all situations in any MSA text; rather it applies to most similar situations. As an example of this consider the rule number 9 in Table 4-3 which states: NP NCSgMGI PREV1OR2TAG PPr Meaning that if a word has a preposition as one of its preceding two words, then that word should be tagged as a common noun and not as a proper noun. This rule is derived because, in the training corpus, it turned out that applying this rule would enhance the accuracy of the tagger, by minimizing the discrepancy between the starting corpus and the truth corpus. But that does not mean that the rule has no exceptions. It is easy to think of many exceptions to this rule or to any other rule, but what counts is the overall effect of applying the rule. 4.4.1 Lexical Rules Table 4-1 shows a list of lexical rules, together with the meaning of each rule, and its interpretation in the context of Arabic morphology. While Table 4-2 lists a group of rules that may be considered misleading, which means that although they may enhance the tagging of the training corpus, they will surely have negative effects on the testing and real life corpora. 4.4.2 Contextual Rules Table 4-3 shows a list of contextual rules, together with the meaning of each rule, and its interpretation in the context of Arabic morphology and syntax. 4.5 Testing Many tests are performed to check the efficiency of the system:  In the first group of tests, the truth corpus is divided into three portions of similar sizes, then the cross validation method is used three times for each type of tagset as would be explained below. In each test of the three, two portions of the corpus (about 25,000 words) are used in 45
  • 46. learning and the third (about 13,000 words) for evaluation, and the average of the accuracy for the three tests is taken as the overall measure for the system’s accuracy. This is performed on three types of corpora: one tagged with original tagset (Tagset1) as introduced by Khoja, the second tagged with a modified tagset thereof (Tagset2), as explained in Section 4.2, and the third tagged with the modified set with the exclusion of grammar features (Tagset3). These three tagsets are defined in Section 3.4. The results of these tests are summarized in Table 5-1, 52, 5-3 respectively.  To test the effect of enlarging the corpus size on accuracy, another group of corpora are prepared. In this case since we do not have a large reference corpus to work on, we had to reduce the size of the testing corpora to enlarge the training corpora. So we chose the size of the learning corpora to be about 31,000 words each, i.e. about five sixths of the size of the complete corpus. And the test corpus is the rest of the corpus, whose size is over 6,000 words. Three tests were performed this way, changing the test corpus each time, and taking the average. The results of these tests are summarized in the tables of Section 5.1. 46
  • 47. No. 1. Rule al haspref 2 NASgFGD Meaning if a word has a prefix of two letters “al” then tag it as Comments “al” is a sign of difeniteness NASgFGD 2. at hassuf 2 NCPlfGD if a word has a suffix of two letters “at” then tag it as “at” is an ending of fem. plural NCPlfGD 3. NCSgMGI p fchar NCSgFGI If a word tagged as NCSgMGI contains the character “p” , “P” ( ‫) التاء المربوط‬is a sign of femminism then tag it as NCSgFGI 4. y haspref 1 VISg3MI if a word has a prefix of 1 letter “y” then tag it as VISg3MI “y” (‫ ) الياء‬is a prefix for imperfect verb 5. NCSgMGI l fhaspref 1 PPr_NCSgMGI If a word tagged as NCSgMGI has a prefix of 1 letter “l” “l” (‫ ) الدم‬in the beginning of the word is a then tag it as PPr_NCSgMGI proposition If a word tagged as NCSgMGI has a suffix of 1 letter “a” “a”-ending is a sign of accusative case 6. NCSgMGI a fhassuf 1 NCSgMAI then tag it as NCSgMAI 8. NASgFGD p faddsuf 1 NASgMGD NCSgMGI w fhaspref 1 PC_NCSgMGI If possible to add “p” to a word tagged as NASgFGD then Can not have two “p” ( ‫ )تاء مربوط‬in one tag it as NASgMGD 7. word If a word tagged as NCSgMGI starts with w then tag it as “w” is a conjunctional particle PC_NCSgMGI Table 4-1: a list of lexical rules 47
  • 48. 12. NCSgMGI t fhassuf 1 VPSg3F b deletepref 1 PPr_NCSgMGI followed by “al” for definiteness Any word starting with “ll” should be tagged as “ll” (‫ )للـ‬is a proposition followed by “al” for definiteness If a word tagged as NCSgMGI starts with “t”, tag it as “t” (‫ )ت‬is a suffix for a past tense verb VPSg3F 11. ll haspref 2 PPr_NCSgMGD “wal” (‫ )والـ‬is a conjunctional particle PC_NCSgMGD 10. wal haspref 3 PC_NCSgMGD Any word starting with “wal” should be tagged as PPr_NCSgMGD 9. (third person single feminine) If removing the letter “b” from a word gives a word in the Attached “b” is a proposition. lexicon, tag the original word as PPr_NCSgMGI 13. 0 char Rnu A word containing the character “1” is a number Numeric 14. NCPlfGD al faddpref 2 NCPlfGI If a word is tagged as NCPlfGD and accepts adding prefix Can not add “al” to a defined word. “al”, tag it as NCPlfGI 15. PC_NCSgMGI S-T-A-R-T fgoodright If a word at the beginning of a sentence is tagged as PC_VPSg3M PC_NCSgMGI , then tag it as PC_VPSg3M Table 4-1: a list of lexical rules Continued 48 Can not start with genitive case.
  • 49. No. Rule Meaning 1. NASgFGD d fhassuf 1 NCSgMGD If a word tagged as NASgFGD ends with “d” tag it as NCSgMGD 2. NCSgMGD al> fhaspref 3 NCPlbMGD If a word tagged as NCSgMGD ends with “n” tag it as NCPlbMGD 3. NCSgMGI n fhassuf 1 NP If a word tagged as NASgFGI ends with “n” tag it as NP 4. NCSgMGI_NPrPSg3F <lY fgoodleft PPr_NPrPSg3F Comment If a word tagged as NCSgMGI_NPrPSg3F is followed by “‫ ”إلى‬tag it as O N L Y A C H A N C E PPr_NPrPSg3F Table 4-2 : Examples of misleading lexical rules 49
  • 50. No. Rule Meaning 1. NCSgFAI NCSgFGI PREV1OR2TAG PPr Change a tag from NCSgFAI to NCSgFGI if one of the two previous words is tagged PPr 2. NCSgFGD NCSgFND PREV1OR2TAG VPSg3F 3. Pst PA NEXTTAG VISg3MI Change a tag from NCSgFGD to NCSgFND if one of the two previous words is tagged VPSg3F Change a tag from Pst to PA if the next word is tagged VISg3MI 4. VISg3MI VISg3MS PREVWD >n Change a tag from VISg3MS to NCSgFGI if the previous word is >n 5. NCSgMGD NCSgMND PREV1OR2TAG STAART 6. NCSgMGD NASgMGD PREVTAG NCSgMGD Change a tag from NCSgMGD to NCSgMND if the word is one of the two starting words in the sentence Change a tag from NCSgMGD to NASgMGD if the previous word is tagged NCSgMGD 7. NCSgMGI NCSgMNI PREV1OR2TAG STAART 8. NASgFGD NCSgFGD PREVTAG PC_NCSgMGI 9. NP NCSgMGI PREV1OR2TAG PPr Change a tag from NP to NCSgMGI if the previous word is tagged PPr 10. NCSgFGI NCSgFNI PREVTAG VPSg3F Change a tag from NCSgFGI to NCSgFNI if the previous word is tagged VPSg3F Change a tag from NCSgMGI to NCSgMNI if the word is one of the two starting words in the sentence Change a tag from NASgFGD to NCSgFGD if the previous word is tagged PC_NCSgMGI Table 4-3: a list of contextual rules 51 Comment ‫ماابع احيفااجلياجميور‬ ‫الياعلاميف ا‬ ‫ن‬ ‫ااأناولوساأ ا‬ ‫مااقةلااليعلاالضار ا‬ ‫ا‬ ‫أناتنصبااليعلاالضار‬ ‫الةه أاالعيفاميف ااولوساجميوا‬ ‫ر‬ ‫ماابع ااالسماالعيفاصي امعيف ا‬ ‫ولوساامساامعيفاا‬ ‫الةه أاالنييقاميف ا‬ ‫متووزابناالضافاإلوهاوالصي‬ ‫حيفااجليافسة ااالسماالعاديا‬ ‫ولوسااسماالعكم.ا‬ ‫الياعلاميف ا‬
  • 51. 11. NASgFGD NCSgFGD PREVTAG NCSgMGI Change a tag from NASgFGD to NCSgFGD if the previous word is tagged NCSgMGI ‫متووزابناالضافاإلوهاوالصي‬ 12. NASgFGI NCSgFGI PREVTAG PPr Change a tag from NASgFGI to NCSgFGI if the previous word is tagged PPr ‫ماابع احيفااجليااسماولوساصي‬ 13. PA_VISg3FI NNuCaSgFAI CURWD stp If the current word is stp, change tag from PA_VISg3FI to NNuCaSgFAI Table 4-3: a list of contextual rules Continued 50 ‫ختصو الك اع قاالعجمو ا"ماافة أابرا‬ ‫اامعاسنا‬ ‫ستافا افعلامضار‬ "‫االسه ةال‬
  • 52. Chapter five Results and discussion 5.1 Results Below are the results for the performed tests. Each table illustrates a group of related tests using the method of cross validation. Table 5-1 gives the results for the original tagset, Table 5-2 for the modified tagset, Table 5-3 for the modified tagset using an enlarged versions of the training corpora, and Table 5-4 for the modified tagset with the case (grammar) information removed Test Training size(words) Test size(words) Test1 Test2 Test3 Average 23834 25372 25786 - 13662 12124 11710 - No. lexical rules 153 149 150 - No. context rules 134 137 161 - Tagging accuracy (%) 73.60 72.07 75.05 73.57 Table 5-1: Accuracy for the Original tagset Test Training size(words) Test size(words) Test4 Test5 Test6 Average 23834 25372 25786 - 13662 12124 11710 - No. lexical rules 120 143 150 - No. context rules 151 158 135 - Tagging accuracy (%) 74.34 72.13 75.69 74.05 Table 5-2: Accuracy for the complete modified tagset 52
  • 53. Test Training size(words) Test size(words) Test7 Test8 Test9 Average 31422 31467 31634 - 6261 6216 6049 - No. lexical rules 174 176 167 - No. context rules 190 162 148 - Tagging accuracy (%) 75.72 75.39 77.16 76.09 Table 5-3: Accuracy for the complete modified tagset with enlarged training corpora Test Training size(words) Test size(words) Test10 Test11 Test12 Average 23834 25372 25786 - 13662 12124 11710 - No. lexical rules 151 148 145 - No. context rules 83 116 106 - Tagging accuracy (%) 83.89 82.64 85.10 83.87 Table 5-4: Accuracy for the Ungrammatized modified tagset 5.2 Examples of errors in tagging A sample of errors was taken from the error report file, consisting of randomly chosen 38 consecutive lines of the original text for the Grammatized tagset. This sample contains 1079 words, 280 words of which are tagged erroneously. The errors are categorized into fifteen types as in Table 5-5 below, and then the occurrence of every type is counted in the sample, to take an idea about the percentage of each error type. Table 5-6 shows the list of erroneously tagged words of this sample, and for each word, its truth and erroneous tags, and the type of error. Then, Table 5-7 shows a summary of the errors, their count in the sample, and their percentage in descending order. 53
  • 54. Error type 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Meaning Interpreting a title as a common noun Mistagging a broken plural Interchanging an adjective and a common noun Interchanging definite and indefinite Interchanging sound plural and single Interchanging verb with noun Grammatical error Error in composite tag Lexicon Entry Missing Interchanging Dual with sound masculine plural Typing mistake Interchanging adverbial article with stress article Error in gender Taking a common noun for a proper noun Interchanging doubt particle and certainty particle Table 5-5: types of errors Word almdyryn w>SHab r&yp alm&vrat m$rwEathm Truth tag NTPlmGD PC_NCPlbMGI NCSgFGI NCPlfGI NCPlfGI_NPrPPl3M wDE almdyryn alastratyjyp $rwT wtnmyp mharat >salyb wastratyjyat w>hmyth NCSgMGI NTPlmGD NASgFGD NCPlbMGI PC_NCSgFGI NCPlfGI NCPlbMGI PC_NCPlfGD PC_NCSgFGI_NPrPSg3M almdaxl al>rbEa' lmqablat astxdamha NCSgMGD RD PPr_NCplfGI NCSgMNI_NPrPSgF dafws bswysra wsykwn almtHdvyn alr}ysyyn ryma Na}b RP PPr_RP PC_PA_VIdSg3MI NADuMAD NCDuAD NP NTSgMGI System tag NCPlmGD PC_NCSgMGI NASgFGI NCPlfGD NCSgMGI_NPrPPl 3M VPSg3M NCPlmGD NCSgFGD NCPlbMAI PC_NCSgFAI NCPlfAI NCPlbMNI PC_NCPlfGI NCSgFGI_NPrPSg 3M NCPlbMND NCPlbMGD PPr_NCPlfGI NCSgMNI_NPrPSg 3F NCSgMGI NCSgMAI PC_NCSgMGI NCPlmGD NCPlmGD NCSgMAI NTSgMNI Type 1 2 3 4 5 6 1 3 7 7 7 7 4 8 (7,2) 9 11 11 9 9 8 10 10 9 7 Table 5-6: A sample of errors in the Grammatized tests 54 Comments lexicon mistype mistype Lexicon lexicon Du-Plm lexicon
  • 55. wzyr wEql mtHdvyn w>hm myna >n w>n mst$ar waDHp tktml aldktwrp >Hyana >bwabha NTSgMGI PC_NP NAPlmAI PC_NASgMGI RF PA PC_PA NTSgMHNI NASgFGI VISg3MI NTSgFNI NAPlbMAI NCSgMGI_NPrPSg3F NTSgMNI PC_NCSgMGI NCSgMGI PC_NCSgMAI NCSgMAI Pst PC_Pst NTSgMGI NCSgFNI VISg3FI NASgFGD NCSgMAI NP 7 9 (7,2) (3,7) 9 12 12 11 (7,3) 13 1,7 2,3 8 lexicon (lexicon) mistype (Gender) Starts with >bw Table 5-6 : A sample of errors in the Grammatized tests (continued) Error ty pe 7 2 3 9 6 8 12 4 1 10 5 11 13 Total count 119 30 29 27 24 15 15 9 7 2 1 1 1 280 % 42.5 10.71 10.36 9.64 8.57 5.36 5.36 3.21 2.50 0.71 0.36 0.36 0.36 100 Table 5-7: Percentage error for each error type in the Grammatized tests Word wbmwazap qryp mTar 53 Truth tag PC_PPr_NCSgFI NCSgFI NCSgMI NCSgMI System tag PC_NCSgFI NASgFD NCPlbMI Rnu Table 5-8 A sample of errors in the Ungrammatized tests 55 Type 8 3,4 2 9
  • 56. kmHwr bSnaEp PPr_NCSgMI PPr_NCSgFI NCSgMI PPr_NCSgMD 8 4,13 stqwm PA_VISg3F PA_VISg3M 13 Tyran kEaml rqmyn wtzyd >rbaHha NCSgMI PPPr_NCSgMI NCDuMI PC_VISg3F NCPlbMI_NPrPSg3F NP NASgMI NCSgMI PC_PA NCPlbFI_NPrPSg3F 14 8 10 8 13 bmtxSSyn mkantha kmrkz PPr_NCPlMI NCSgFI_NPrPSg3F PPr_NCSgMI PPr_NCSgMI NCPlfI_NPrPSg3F NCSgMI 5 5 8 wttkaml LIne PC_VISG3F PC_NCSgMI 6 bd> VPSg3M PPr_NCSgMI 9 wtm tkml wtEzz qryp PC_VPSg3M VISg3F PC_VISg3F NCSgFI PC_NCSgMI NCSgMI PC_NCSgMI NCSgFD 6 6 6 4 ykml sahm t$ark alxarTp vany mltqY VISg3F VPSg3M VISg3F NCSgFD NNuORSgMI NCSgMI VISg3M NCSgMI VPSg3M NCSgFI NP NASgMI 13 6 13 4 9 3 alTa}rat wqTE NCPlfI PC_NCPlbMI NCPlfD PC_NCSgMI 11 2 bal>mm mEZm qd tDr wtDEf <mdadat PPr_NCPlbFD NASgMI Pdt VISg3F PC_VISg3F NCPlfI PPr_NCPlbMD NCSgMI Pcr NASgMI PC_NCSgMI NCPlfD 13 3 15 6 6 4 Table 5-8: A sample of errors in the Ungrammatized tests (Continued) 5.3 Discussion Most of the errors can be categorized as follows: 1. Errors in the case of the word are the highest. 2. Unknown proper nouns (of people and places) cannot be guessed. Only few rules may lead to realizing a proper noun. 56
  • 57. 3. Distinction between sound masculine plural and dual nouns is not easy for unknown nouns in Genitive and Accusative case states. 4. Some forms of broken plural are intermixed with other forms of names, and not always easily distinguished since the processed text is not vocalized. The above notes can be drawn from Table 5-5 where it is easily noticed that the grammar contributes to the highest portion of the errors (almost half of them). Then comes the broken plural problem, which accounts for 10% of the errors, then the distinction between adjectives and nouns, also close to 10%. After that comes the problem of proper names (names of people, cities, countries, etc.) which takes almost 10%, the problem of past tense verbs, about 9%. Composite tags and adverbial articles contribute by about 5% each, and the rest of error types have insignificant contribution to the overall error percentage. Each of the errors of large contribution to the overall percentage of the errors is justified and expected, although the order and exact rate was not expected to be as it turned out after this test. We think that the following factors were leading factors: 1. The grammatical errors are partially due to the fact that some of the tags do not reflect the case of the word, and hence, it is hard for the learner to conclude the reason of the following word being given its tag, examples of that are proper nouns, relative specific pronouns ( ( ‫أمساءااإلشارق‬ ‫ ,)أمساءاال ص ل‬and demonstrative pronouns ). Giving case information for these tags is expected to help solving this problem, but would drastically increase the already large tagset, a task that we preferred to avoid in the present, but which is a proper consideration for future work. It is worth mentioning that most of the words that are erroneously tagged for this reason are otherwise correctly tagged (i.e. information about category, number, gender, and definiteness are correct). 2. The size of the corpus affected the accuracy of the results, and in fact the error rate was enhanced by two other factors; first since the corpus had to be split into three portions to perform cross validation. And second since the Brill tagger splits the training corpus again into two halves; one to derive lexical rules and the other to derive contextual rules. So starting with a corpus of about 38,000 words, each test is done with 25000 words for training and 13,000 word for evaluation. The training part is divided into two parts each of about 12,500 57
  • 58. words, for lexical and contextual learning respectively. Had we have a ready corpus to work with matters would be different and we are sure of getting better results. This has been shown by three separate cross validation experiments in which the training corpus was slightly enlarged by about 6000 words, leading to about 2% increase in the accuracy of the system as shown in Table 5-3, a value which does not look very great, but at least gives an indication of increase. 3. Lack of vocalization also makes it hard to distinguish between some of the forms of the past tense verbs, and between them and some of the nouns. In this case accuracy of tagging relies primarily on the statistical information captured in the lexicon for known words, and on context for the unknown words. But it should be remembered that lack of vocalization in itself is not a disadvantage of the corpus, rather it is an advantage for the following reasons: a. The input text to the tagger is rarely expected to be vocalized, since vocalization is not common in most MSA writings b. Vocalization puts an extra burden on the user of the system. c. Getting good results in spite of unvoclization is a credit to the system, and a sign of overcoming the problem of ambiguity without relying on the user to disambiguate the words by vocalization. 5.4 Evaluation Comparing with other reported results, the results we obtained may look low, for example Diab et al [10] reported results of 95.4% accuracy, and Khoja [17] reported 90% of disambiguation accuracy. But studying the mentioned works we notice that the first one dealt with a very small tagset (24 tags) that are based on an English tagset, while the second one did not specify precisely the size of the tagset, rather she talked confusingly about three different levels of tagging with tagsets of 5, 35, and 131 tags, and said she used the smaller tagset for initial tagging. This means that the tagset she used contains a maximum of 35 tags. Consulting her website [37], however, one concludes that the tagging is done using the 5-tag set. The other problem with her results is that she reports that “the statistical tagger achieved an accuracy of around 90% when disambiguating ambiguous words” [17], but checking the statistics she offers, we find that 58
  • 59. ambiguous words comprise a maximum of 3% of the test corpora, and we do not know the performance accuracy for the rest of the corpora. So taking in consideration the large and rich tagset we worked with, and the unavailability of a standard truth corpus tagged with the same tagset, we think the results obtained here are very promising, and are the best obtained for such a tagset. 5.5 Accomplishments In this work we achieved the following:  Revised the Khoja tagset to satisfy our needs and get rid of some of its limitations. It was expected that this revision would lead to some drawback in the accuracy of the tagger, and we were welling to accept that, but gladly enough, the accuracy of the new system turned out to be slightly higher.  Prepared a manually tagged corpus of moderate size, which enjoys the fact that it is tagged with a rich and comprehensive tagset that we consider the best available for Arabic, and recommend it for being the basis for a standard Arabic morphosyntactic tagset. The size of the corpus we tagged is about 38,000 words, far exceeding in size the only POS tagged Arabic corpus that we know of, which is a 1,700-word corpus prepared by Ms. Khoja. In fact we prepared many versions of this corpus, as follows: o One tagged with the original tagset. o The second tagged with a modified tagset thereof. o And the third is tagged with the modified tagset but excluding syntactic (grammar) features. o All the above corpora are available in both Arabic characters and transliterated form.  Adapted the Brill transformation rule tagger to work with the above corpora and have the first complete tagger for Arabic, which gave –we believe- a very promising accuracy of 75-84% depending on the tagset used.  Prepared in parallel with the corpus a tagged lexicon for Arabic, which would help researchers in NLP tasks for Arabic. 59