P02- Towards a New Arabic Corpus of Dyslexic Texts

Towards A New Arabic Corpus of Dyslexic Texts
M a h a A l a m r i
E l p 0 0 3 @ b a n go r. a c . u k
W i l l i a m J o h n Te a h a n
W. J.Te a h a n @ b a n go r. a c . u k
S c h o o l o f C o m p u te r S c i e n c e .
B a n go r U n i v e rs i t y.

Outline
 Introduction.
Arabic Corpus of Dyslexic Texts.
 Towards Automatic Correction of Dyslexic Errors.
 Conclusion.
THE 2ND WORKSHOP ON ARABIC CORPORA AND PROCESSING TOOLS
LREC2016
2

Introduction
The focus of this presentation is the creation of a new
Arabic corpus of texts written by dyslexics and software for
automatic spelling correction for Arabic texts written by
dyslexics.
Dyslexia:
 Its roots in the Greek word ‘dys-’, meaning difficulty with, and the word
‘-lexia’, which means language or word.
 Inability to master the utilization of written language, including issues
with comprehension.
 1 IN 10 people have a dyslexia.
LREC2016
3

Introduction
The main area of interest lies in the zone of convergence represented by
the overlap area as illustrated:
Dyslexia Arabic Corpus
Automatic spelling
correction
The term denotes the way
in which a misspelled
word is identified by a
program and is then
altered to its correct form.
LREC2016
4

Spelling Errors
Common Spelling Errors (Damerau, 1964):
 Additional letters e.g. unniverse.
 Omitted letters e.g. univ rse.
 Substituted letters e.g umiverse.
 Swapped letters e.g. uinverse.
LREC2016
5

Dyslexia Spelling Errors
 Words contain certain silent letters (knife).
 Morphemes in the case of when affixes are added:
explain – explanation.
 The struggle of dyslexic writers with the relationship between the
sound of a word and how it is spelt.
 The inability to preserve in memory orthographic symbols makes it
difficult for dyslexics to remember the right order of letters in a word.
LREC2016
6

Spelling errors by Arabic
writers with dyslexia
 Phonetic errors.
 Irregular spelling rules.
 Word omission.
 Hamza.
 Long vowel.
 Exchanging consonants.
 Difficulty in writing the letters in the correct shape.
 The Arabic word is spelt according to how they hear it in the local
spoken dialect.
LREC2016
7

Arabic Corpus of Dyslexic Texts
The rate of misspellings in the text is noticeably higher in the case of
children. Therefore, the texts were collected from female primary school
students with dyslexia who have been taught in resource rooms, been
professionally diagnosed with dyslexia.
BDAC information:
Text: Writing exercises (Homework).
Size: 1067 words containing 694 errors.
Year: 2013.
Language: Arabic.
Country of production: Saudi Arabia (Riyadh).
The Bangor Dyslexic Arabic
Corpus (BDAC) has the
character of a preliminary
version, which aims to
investigate the possibility
of a corpus being used as
an aid for Arabic dyslexic
writers.
LREC2016
8

Example Dyslexic Text
LREC2016
9
Screenshot of a scanned image of one of the texts written by a dyslexic
female child (nine years old).

Example Dyslexic Text
LREC2016
10
This example includes basic errors as below:

Analysis of the BDAC errors
LREC2016
11

LREC2016
12
3. Substitution (47 times), commonly found in:
replacement of (Heh - ‫)ه‬ to (Teh Marbuta - ‫)ة‬ or
vice verse, changing (Heh - ‫ه‬or Teh Marbuta - ‫)ة‬
with the letter (Teh - ‫)ت‬ or vice verse and
exchanging the letter (Dad - ‫)ض‬ with (Zah - ‫)ظ‬ or
vice versa.
4. Transposition (19 times).

LREC2016
13

LREC2016
14

LREC2016
15

LREC2016
16

Towards Automatic Correction
of Dyslexic Errors
The main tool employed was the Text Mining Toolkit (TMT).
TMT is a software package designed specifically to conduct
tasks revolving around compression-based language
modelling, text categorisation and correction, and
segmentation of the text.
The toolkit was used to correct a small number of the
dyslexic errors using a method that was similar to the
method described by Alhawiti (2014) found effective for
the correction of errors in Arabic OCR text.
LREC2016
17

Towards Automatic Correction
of Dyslexic Errors
First, it was crucial to choose a large training corpus of
Arabic text to train the compression-based language model
created by the toolkit. After researching suitable corpora,
the Bangor Arabic Compression Corpus (BACC) created by
Dr.Khaled Alhawiti was chosen.
Due to the current limitations of the TMT software, the
correction of the dyslexic texts was applied just for one-to-
one character errors using the toolkit’s markup correction
capabilities that was able to find the most probable
corrected sequence given the compression- based language
model.
LREC2016
18

Experimental Results
All errors containing more than one character were removed.
1067
694
280
BDAC Corpus
Text
Errors
one-to-one
character errors
LREC2016
19

Experimental Results
153
99
Word
Error
Correct 80
49
Sentences
Error
Correct
47
39
Paragraphs
Error
Correct 280
187
Total
Errors
Correct
The TMT software was able to correct more than half of the one-to-one
character errors.
LREC2016
20

Conclusion
 The corpus used in this study offers a useful platform for analysing
dyslexic text.
 It provides a better understanding of the occurrence of these errors
and the factors determining such occurrences and therefore it is
suitable for assisting dyslexic writers.
 This corpus can serve as a platform for other researchers to build upon.
 A preliminary investigation was undertaken into using automatic
processing techniques as a form of assistance for Arabic dyslexic writers
and some initial success was achieved in the automatic correction of
dyslexic errors in Arabic text.
 In future work, it requires considerably more resources and effort to
extend the corpus to include more text for analysis.
LREC2016
21

Thank you.
Any questions?
LREC2016
22

P02- Towards a New Arabic Corpus of Dyslexic Texts

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Ähnlich wie P02- Towards a New Arabic Corpus of Dyslexic Texts

Ähnlich wie P02- Towards a New Arabic Corpus of Dyslexic Texts (20)

Mehr von iwan_rg

Mehr von iwan_rg (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

P02- Towards a New Arabic Corpus of Dyslexic Texts

Hinweis der Redaktion