By:
Maha Alamri and William John Teahan
Abstract
This paper presents a detailed account of the preliminary work for the creation of a new Arabic corpus of dyslexic text. The analysis of errors found in the corpus revealed that there are four types of spelling errors made as a result of dyslexia in addition to four common spelling errors. The subsequent aim was to develop a spellchecker capable of automatically correcting the spelling mistakes of dyslexic writers in Arabic texts using statistical techniques. The purpose was to provide a tool to assist Arabic dyslexic writers. Some initial success was achieved in the automatic correction of dyslexic errors in Arabic text.
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
P02- Towards a New Arabic Corpus of Dyslexic Texts
1. Towards A New Arabic Corpus of Dyslexic Texts
M a h a A l a m r i
E l p 0 0 3 @ b a n go r. a c . u k
W i l l i a m J o h n Te a h a n
W. J.Te a h a n @ b a n go r. a c . u k
S c h o o l o f C o m p u te r S c i e n c e .
B a n go r U n i v e rs i t y.
2. Outline
Introduction.
Arabic Corpus of Dyslexic Texts.
Towards Automatic Correction of Dyslexic Errors.
Conclusion.
THE 2ND WORKSHOP ON ARABIC CORPORA AND PROCESSING TOOLS
LREC2016
2
3. Introduction
The focus of this presentation is the creation of a new
Arabic corpus of texts written by dyslexics and software for
automatic spelling correction for Arabic texts written by
dyslexics.
Dyslexia:
Its roots in the Greek word ‘dys-’, meaning difficulty with, and the word
‘-lexia’, which means language or word.
Inability to master the utilization of written language, including issues
with comprehension.
1 IN 10 people have a dyslexia.
THE 2ND WORKSHOP ON ARABIC CORPORA AND PROCESSING TOOLS
LREC2016
3
4. Introduction
The main area of interest lies in the zone of convergence represented by
the overlap area as illustrated:
Dyslexia Arabic Corpus
Automatic spelling
correction
The term denotes the way
in which a misspelled
word is identified by a
program and is then
altered to its correct form.
THE 2ND WORKSHOP ON ARABIC CORPORA AND PROCESSING TOOLS
LREC2016
4
5. Spelling Errors
Common Spelling Errors (Damerau, 1964):
Additional letters e.g. unniverse.
Omitted letters e.g. univ rse.
Substituted letters e.g umiverse.
Swapped letters e.g. uinverse.
THE 2ND WORKSHOP ON ARABIC CORPORA AND PROCESSING TOOLS
LREC2016
5
6. Dyslexia Spelling Errors
Words contain certain silent letters (knife).
Morphemes in the case of when affixes are added:
explain – explanation.
The struggle of dyslexic writers with the relationship between the
sound of a word and how it is spelt.
The inability to preserve in memory orthographic symbols makes it
difficult for dyslexics to remember the right order of letters in a word.
THE 2ND WORKSHOP ON ARABIC CORPORA AND PROCESSING TOOLS
LREC2016
6
7. Spelling errors by Arabic
writers with dyslexia
Phonetic errors.
Irregular spelling rules.
Word omission.
Hamza.
Long vowel.
Exchanging consonants.
Difficulty in writing the letters in the correct shape.
The Arabic word is spelt according to how they hear it in the local
spoken dialect.
THE 2ND WORKSHOP ON ARABIC CORPORA AND PROCESSING TOOLS
LREC2016
7
8. Arabic Corpus of Dyslexic Texts
The rate of misspellings in the text is noticeably higher in the case of
children. Therefore, the texts were collected from female primary school
students with dyslexia who have been taught in resource rooms, been
professionally diagnosed with dyslexia.
BDAC information:
Text: Writing exercises (Homework).
Size: 1067 words containing 694 errors.
Year: 2013.
Language: Arabic.
Country of production: Saudi Arabia (Riyadh).
The Bangor Dyslexic Arabic
Corpus (BDAC) has the
character of a preliminary
version, which aims to
investigate the possibility
of a corpus being used as
an aid for Arabic dyslexic
writers.
THE 2ND WORKSHOP ON ARABIC CORPORA AND PROCESSING TOOLS
LREC2016
8
9. Example Dyslexic Text
THE 2ND WORKSHOP ON ARABIC CORPORA AND PROCESSING TOOLS
LREC2016
9
Screenshot of a scanned image of one of the texts written by a dyslexic
female child (nine years old).
10. Example Dyslexic Text
THE 2ND WORKSHOP ON ARABIC CORPORA AND PROCESSING TOOLS
LREC2016
10
This example includes basic errors as below:
11. Analysis of the BDAC errors
THE 2ND WORKSHOP ON ARABIC CORPORA AND PROCESSING TOOLS
LREC2016
11
12. Analysis of the BDAC errors
THE 2ND WORKSHOP ON ARABIC CORPORA AND PROCESSING TOOLS
LREC2016
12
3. Substitution (47 times), commonly found in:
replacement of (Heh - )ه to (Teh Marbuta - )ة or
vice verse, changing (Heh - هor Teh Marbuta - )ة
with the letter (Teh - )ت or vice verse and
exchanging the letter (Dad - )ض with (Zah - )ظ or
vice versa.
4. Transposition (19 times).
13. Analysis of the BDAC errors
THE 2ND WORKSHOP ON ARABIC CORPORA AND PROCESSING TOOLS
LREC2016
13
14. Analysis of the BDAC errors
THE 2ND WORKSHOP ON ARABIC CORPORA AND PROCESSING TOOLS
LREC2016
14
15. Analysis of the BDAC errors
THE 2ND WORKSHOP ON ARABIC CORPORA AND PROCESSING TOOLS
LREC2016
15
16. Analysis of the BDAC errors
THE 2ND WORKSHOP ON ARABIC CORPORA AND PROCESSING TOOLS
LREC2016
16
17. Towards Automatic Correction
of Dyslexic Errors
The main tool employed was the Text Mining Toolkit (TMT).
TMT is a software package designed specifically to conduct
tasks revolving around compression-based language
modelling, text categorisation and correction, and
segmentation of the text.
The toolkit was used to correct a small number of the
dyslexic errors using a method that was similar to the
method described by Alhawiti (2014) found effective for
the correction of errors in Arabic OCR text.
THE 2ND WORKSHOP ON ARABIC CORPORA AND PROCESSING TOOLS
LREC2016
17
18. Towards Automatic Correction
of Dyslexic Errors
First, it was crucial to choose a large training corpus of
Arabic text to train the compression-based language model
created by the toolkit. After researching suitable corpora,
the Bangor Arabic Compression Corpus (BACC) created by
Dr.Khaled Alhawiti was chosen.
Due to the current limitations of the TMT software, the
correction of the dyslexic texts was applied just for one-to-
one character errors using the toolkit’s markup correction
capabilities that was able to find the most probable
corrected sequence given the compression- based language
model.
THE 2ND WORKSHOP ON ARABIC CORPORA AND PROCESSING TOOLS
LREC2016
18
19. Experimental Results
All errors containing more than one character were removed.
1067
694
280
BDAC Corpus
Text
Errors
one-to-one
character errors
THE 2ND WORKSHOP ON ARABIC CORPORA AND PROCESSING TOOLS
LREC2016
19
21. Conclusion
The corpus used in this study offers a useful platform for analysing
dyslexic text.
It provides a better understanding of the occurrence of these errors
and the factors determining such occurrences and therefore it is
suitable for assisting dyslexic writers.
This corpus can serve as a platform for other researchers to build upon.
A preliminary investigation was undertaken into using automatic
processing techniques as a form of assistance for Arabic dyslexic writers
and some initial success was achieved in the automatic correction of
dyslexic errors in Arabic text.
In future work, it requires considerably more resources and effort to
extend the corpus to include more text for analysis.
THE 2ND WORKSHOP ON ARABIC CORPORA AND PROCESSING TOOLS
LREC2016
21
“A room in an ordinary school which students with special needs attend for a period of not more than a half of the school day for the purpose of receiving special education services from a special education teacher.” (Ministry of Education of Saudi Arabia, 2002)
“A room in an ordinary school which students with special needs attend for a period of not more than a half of the school day for the purpose of receiving special education services from a special education teacher.” (Ministry of Education of Saudi Arabia, 2002)
“A room in an ordinary school which students with special needs attend for a period of not more than a half of the school day for the purpose of receiving special education services from a special education teacher.” (Ministry of Education of Saudi Arabia, 2002)
Essentially, markup is employed to identify the corrected text that has the highest probability in regard to the output that was observed.
The TMT markup routines are two aspects of the markup transformation that help to establish whether the markup model can be realised.