SlideShare ist ein Scribd-Unternehmen logo
1 von 22
Towards A New Arabic Corpus of Dyslexic Texts
M a h a A l a m r i
E l p 0 0 3 @ b a n go r. a c . u k
W i l l i a m J o h n Te a h a n
W. J.Te a h a n @ b a n go r. a c . u k
S c h o o l o f C o m p u te r S c i e n c e .
B a n go r U n i v e rs i t y.
Outline
 Introduction.
Arabic Corpus of Dyslexic Texts.
 Towards Automatic Correction of Dyslexic Errors.
 Conclusion.
THE 2ND WORKSHOP ON ARABIC CORPORA AND PROCESSING TOOLS
LREC2016
2
Introduction
The focus of this presentation is the creation of a new
Arabic corpus of texts written by dyslexics and software for
automatic spelling correction for Arabic texts written by
dyslexics.
Dyslexia:
 Its roots in the Greek word ‘dys-’, meaning difficulty with, and the word
‘-lexia’, which means language or word.
 Inability to master the utilization of written language, including issues
with comprehension.
 1 IN 10 people have a dyslexia.
THE 2ND WORKSHOP ON ARABIC CORPORA AND PROCESSING TOOLS
LREC2016
3
Introduction
The main area of interest lies in the zone of convergence represented by
the overlap area as illustrated:
Dyslexia Arabic Corpus
Automatic spelling
correction
The term denotes the way
in which a misspelled
word is identified by a
program and is then
altered to its correct form.
THE 2ND WORKSHOP ON ARABIC CORPORA AND PROCESSING TOOLS
LREC2016
4
Spelling Errors
Common Spelling Errors (Damerau, 1964):
 Additional letters e.g. unniverse.
 Omitted letters e.g. univ rse.
 Substituted letters e.g umiverse.
 Swapped letters e.g. uinverse.
THE 2ND WORKSHOP ON ARABIC CORPORA AND PROCESSING TOOLS
LREC2016
5
Dyslexia Spelling Errors
 Words contain certain silent letters (knife).
 Morphemes in the case of when affixes are added:
explain – explanation.
 The struggle of dyslexic writers with the relationship between the
sound of a word and how it is spelt.
 The inability to preserve in memory orthographic symbols makes it
difficult for dyslexics to remember the right order of letters in a word.
THE 2ND WORKSHOP ON ARABIC CORPORA AND PROCESSING TOOLS
LREC2016
6
Spelling errors by Arabic
writers with dyslexia
 Phonetic errors.
 Irregular spelling rules.
 Word omission.
 Hamza.
 Long vowel.
 Exchanging consonants.
 Difficulty in writing the letters in the correct shape.
 The Arabic word is spelt according to how they hear it in the local
spoken dialect.
THE 2ND WORKSHOP ON ARABIC CORPORA AND PROCESSING TOOLS
LREC2016
7
Arabic Corpus of Dyslexic Texts
The rate of misspellings in the text is noticeably higher in the case of
children. Therefore, the texts were collected from female primary school
students with dyslexia who have been taught in resource rooms, been
professionally diagnosed with dyslexia.
BDAC information:
Text: Writing exercises (Homework).
Size: 1067 words containing 694 errors.
Year: 2013.
Language: Arabic.
Country of production: Saudi Arabia (Riyadh).
The Bangor Dyslexic Arabic
Corpus (BDAC) has the
character of a preliminary
version, which aims to
investigate the possibility
of a corpus being used as
an aid for Arabic dyslexic
writers.
THE 2ND WORKSHOP ON ARABIC CORPORA AND PROCESSING TOOLS
LREC2016
8
Example Dyslexic Text
THE 2ND WORKSHOP ON ARABIC CORPORA AND PROCESSING TOOLS
LREC2016
9
Screenshot of a scanned image of one of the texts written by a dyslexic
female child (nine years old).
Example Dyslexic Text
THE 2ND WORKSHOP ON ARABIC CORPORA AND PROCESSING TOOLS
LREC2016
10
This example includes basic errors as below:
Analysis of the BDAC errors
THE 2ND WORKSHOP ON ARABIC CORPORA AND PROCESSING TOOLS
LREC2016
11
Analysis of the BDAC errors
THE 2ND WORKSHOP ON ARABIC CORPORA AND PROCESSING TOOLS
LREC2016
12
3. Substitution (47 times), commonly found in:
replacement of (Heh - ‫)ه‬ to (Teh Marbuta - ‫)ة‬ or
vice verse, changing (Heh - ‫ه‬or Teh Marbuta - ‫)ة‬
with the letter (Teh - ‫)ت‬ or vice verse and
exchanging the letter (Dad - ‫)ض‬ with (Zah - ‫)ظ‬ or
vice versa.
4. Transposition (19 times).
Analysis of the BDAC errors
THE 2ND WORKSHOP ON ARABIC CORPORA AND PROCESSING TOOLS
LREC2016
13
Analysis of the BDAC errors
THE 2ND WORKSHOP ON ARABIC CORPORA AND PROCESSING TOOLS
LREC2016
14
Analysis of the BDAC errors
THE 2ND WORKSHOP ON ARABIC CORPORA AND PROCESSING TOOLS
LREC2016
15
Analysis of the BDAC errors
THE 2ND WORKSHOP ON ARABIC CORPORA AND PROCESSING TOOLS
LREC2016
16
Towards Automatic Correction
of Dyslexic Errors
The main tool employed was the Text Mining Toolkit (TMT).
TMT is a software package designed specifically to conduct
tasks revolving around compression-based language
modelling, text categorisation and correction, and
segmentation of the text.
The toolkit was used to correct a small number of the
dyslexic errors using a method that was similar to the
method described by Alhawiti (2014) found effective for
the correction of errors in Arabic OCR text.
THE 2ND WORKSHOP ON ARABIC CORPORA AND PROCESSING TOOLS
LREC2016
17
Towards Automatic Correction
of Dyslexic Errors
First, it was crucial to choose a large training corpus of
Arabic text to train the compression-based language model
created by the toolkit. After researching suitable corpora,
the Bangor Arabic Compression Corpus (BACC) created by
Dr.Khaled Alhawiti was chosen.
Due to the current limitations of the TMT software, the
correction of the dyslexic texts was applied just for one-to-
one character errors using the toolkit’s markup correction
capabilities that was able to find the most probable
corrected sequence given the compression- based language
model.
THE 2ND WORKSHOP ON ARABIC CORPORA AND PROCESSING TOOLS
LREC2016
18
Experimental Results
All errors containing more than one character were removed.
1067
694
280
BDAC Corpus
Text
Errors
one-to-one
character errors
THE 2ND WORKSHOP ON ARABIC CORPORA AND PROCESSING TOOLS
LREC2016
19
Experimental Results
153
99
Word
Error
Correct 80
49
Sentences
Error
Correct
47
39
Paragraphs
Error
Correct 280
187
Total
Errors
Correct
The TMT software was able to correct more than half of the one-to-one
character errors.
THE 2ND WORKSHOP ON ARABIC CORPORA AND PROCESSING TOOLS
LREC2016
20
Conclusion
 The corpus used in this study offers a useful platform for analysing
dyslexic text.
 It provides a better understanding of the occurrence of these errors
and the factors determining such occurrences and therefore it is
suitable for assisting dyslexic writers.
 This corpus can serve as a platform for other researchers to build upon.
 A preliminary investigation was undertaken into using automatic
processing techniques as a form of assistance for Arabic dyslexic writers
and some initial success was achieved in the automatic correction of
dyslexic errors in Arabic text.
 In future work, it requires considerably more resources and effort to
extend the corpus to include more text for analysis.
THE 2ND WORKSHOP ON ARABIC CORPORA AND PROCESSING TOOLS
LREC2016
21
Thank you.
Any questions?
THE 2ND WORKSHOP ON ARABIC CORPORA AND PROCESSING TOOLS
LREC2016
22

Weitere ähnliche Inhalte

Ähnlich wie P02- Towards a New Arabic Corpus of Dyslexic Texts

Exploring the effects of stemming on
Exploring the effects of stemming onExploring the effects of stemming on
Exploring the effects of stemming onijaia
 
DEVELOPING A SIMPLIFIED MORPHOLOGICAL ANALYZER FOR ARABIC PRONOMINAL SYSTEM
DEVELOPING A SIMPLIFIED MORPHOLOGICAL ANALYZER FOR ARABIC PRONOMINAL SYSTEMDEVELOPING A SIMPLIFIED MORPHOLOGICAL ANALYZER FOR ARABIC PRONOMINAL SYSTEM
DEVELOPING A SIMPLIFIED MORPHOLOGICAL ANALYZER FOR ARABIC PRONOMINAL SYSTEMkevig
 
Final quantitative analysis of egyptian aphorisms by using r
Final quantitative analysis of egyptian aphorisms by using rFinal quantitative analysis of egyptian aphorisms by using r
Final quantitative analysis of egyptian aphorisms by using rAlexandria University
 
The Arabic Speech Database: PADAS
The Arabic Speech Database: PADASThe Arabic Speech Database: PADAS
The Arabic Speech Database: PADASCSCJournals
 
Rule Based Transliteration Scheme for English to Punjabi
Rule Based Transliteration Scheme for English to PunjabiRule Based Transliteration Scheme for English to Punjabi
Rule Based Transliteration Scheme for English to Punjabikevig
 
RULE BASED TRANSLITERATION SCHEME FOR ENGLISH TO PUNJABI
RULE BASED TRANSLITERATION SCHEME FOR ENGLISH TO PUNJABIRULE BASED TRANSLITERATION SCHEME FOR ENGLISH TO PUNJABI
RULE BASED TRANSLITERATION SCHEME FOR ENGLISH TO PUNJABIijnlc
 
STANDARD ARABIC VERBS INFLECTIONS USING NOOJ PLATFORM
STANDARD ARABIC VERBS INFLECTIONS USING NOOJ PLATFORMSTANDARD ARABIC VERBS INFLECTIONS USING NOOJ PLATFORM
STANDARD ARABIC VERBS INFLECTIONS USING NOOJ PLATFORMijnlc
 
Concordancing 1
Concordancing 1Concordancing 1
Concordancing 1Hala Fawzi
 
XMODEL: An XML-based Morphological Analyzer for Arabic Language
XMODEL: An XML-based Morphological Analyzer for Arabic LanguageXMODEL: An XML-based Morphological Analyzer for Arabic Language
XMODEL: An XML-based Morphological Analyzer for Arabic LanguageWaqas Tariq
 
NEURAL SYMBOLIC ARABIC PARAPHRASING WITH AUTOMATIC EVALUATION
NEURAL SYMBOLIC ARABIC PARAPHRASING WITH AUTOMATIC EVALUATIONNEURAL SYMBOLIC ARABIC PARAPHRASING WITH AUTOMATIC EVALUATION
NEURAL SYMBOLIC ARABIC PARAPHRASING WITH AUTOMATIC EVALUATIONcscpconf
 
MoM2010: Arabic natural language processing
MoM2010: Arabic natural language processingMoM2010: Arabic natural language processing
MoM2010: Arabic natural language processingHend Al-Khalifa
 
Ara--CANINE: Character-Based Pre-Trained Language Model for Arabic Language U...
Ara--CANINE: Character-Based Pre-Trained Language Model for Arabic Language U...Ara--CANINE: Character-Based Pre-Trained Language Model for Arabic Language U...
Ara--CANINE: Character-Based Pre-Trained Language Model for Arabic Language U...IJCI JOURNAL
 
Rule-Based Standard Arabic Phonetization at Phoneme, Allophone, and Syllable ...
Rule-Based Standard Arabic Phonetization at Phoneme, Allophone, and Syllable ...Rule-Based Standard Arabic Phonetization at Phoneme, Allophone, and Syllable ...
Rule-Based Standard Arabic Phonetization at Phoneme, Allophone, and Syllable ...CSCJournals
 
FURTHER INVESTIGATIONS ON DEVELOPING AN ARABIC SENTIMENT LEXICON
FURTHER INVESTIGATIONS ON DEVELOPING AN ARABIC SENTIMENT LEXICONFURTHER INVESTIGATIONS ON DEVELOPING AN ARABIC SENTIMENT LEXICON
FURTHER INVESTIGATIONS ON DEVELOPING AN ARABIC SENTIMENT LEXICONkevig
 
FURTHER INVESTIGATIONS ON DEVELOPING AN ARABIC SENTIMENT LEXICON
FURTHER INVESTIGATIONS ON DEVELOPING AN ARABIC SENTIMENT LEXICONFURTHER INVESTIGATIONS ON DEVELOPING AN ARABIC SENTIMENT LEXICON
FURTHER INVESTIGATIONS ON DEVELOPING AN ARABIC SENTIMENT LEXICONijnlc
 
A new hybrid metric for verifying
A new hybrid metric for verifyingA new hybrid metric for verifying
A new hybrid metric for verifyingcsandit
 
A ROBUST THREE-STAGE HYBRID FRAMEWORK FOR ENGLISH TO BANGLA TRANSLITERATION
A ROBUST THREE-STAGE HYBRID FRAMEWORK FOR ENGLISH TO BANGLA TRANSLITERATIONA ROBUST THREE-STAGE HYBRID FRAMEWORK FOR ENGLISH TO BANGLA TRANSLITERATION
A ROBUST THREE-STAGE HYBRID FRAMEWORK FOR ENGLISH TO BANGLA TRANSLITERATIONkevig
 

Ähnlich wie P02- Towards a New Arabic Corpus of Dyslexic Texts (20)

Exploring the effects of stemming on
Exploring the effects of stemming onExploring the effects of stemming on
Exploring the effects of stemming on
 
DEVELOPING A SIMPLIFIED MORPHOLOGICAL ANALYZER FOR ARABIC PRONOMINAL SYSTEM
DEVELOPING A SIMPLIFIED MORPHOLOGICAL ANALYZER FOR ARABIC PRONOMINAL SYSTEMDEVELOPING A SIMPLIFIED MORPHOLOGICAL ANALYZER FOR ARABIC PRONOMINAL SYSTEM
DEVELOPING A SIMPLIFIED MORPHOLOGICAL ANALYZER FOR ARABIC PRONOMINAL SYSTEM
 
Final quantitative analysis of egyptian aphorisms by using r
Final quantitative analysis of egyptian aphorisms by using rFinal quantitative analysis of egyptian aphorisms by using r
Final quantitative analysis of egyptian aphorisms by using r
 
The Arabic Speech Database: PADAS
The Arabic Speech Database: PADASThe Arabic Speech Database: PADAS
The Arabic Speech Database: PADAS
 
Rule Based Transliteration Scheme for English to Punjabi
Rule Based Transliteration Scheme for English to PunjabiRule Based Transliteration Scheme for English to Punjabi
Rule Based Transliteration Scheme for English to Punjabi
 
RULE BASED TRANSLITERATION SCHEME FOR ENGLISH TO PUNJABI
RULE BASED TRANSLITERATION SCHEME FOR ENGLISH TO PUNJABIRULE BASED TRANSLITERATION SCHEME FOR ENGLISH TO PUNJABI
RULE BASED TRANSLITERATION SCHEME FOR ENGLISH TO PUNJABI
 
STANDARD ARABIC VERBS INFLECTIONS USING NOOJ PLATFORM
STANDARD ARABIC VERBS INFLECTIONS USING NOOJ PLATFORMSTANDARD ARABIC VERBS INFLECTIONS USING NOOJ PLATFORM
STANDARD ARABIC VERBS INFLECTIONS USING NOOJ PLATFORM
 
Concordancing 1
Concordancing 1Concordancing 1
Concordancing 1
 
XMODEL: An XML-based Morphological Analyzer for Arabic Language
XMODEL: An XML-based Morphological Analyzer for Arabic LanguageXMODEL: An XML-based Morphological Analyzer for Arabic Language
XMODEL: An XML-based Morphological Analyzer for Arabic Language
 
Arabic spell checkers
Arabic spell  checkersArabic spell  checkers
Arabic spell checkers
 
NEURAL SYMBOLIC ARABIC PARAPHRASING WITH AUTOMATIC EVALUATION
NEURAL SYMBOLIC ARABIC PARAPHRASING WITH AUTOMATIC EVALUATIONNEURAL SYMBOLIC ARABIC PARAPHRASING WITH AUTOMATIC EVALUATION
NEURAL SYMBOLIC ARABIC PARAPHRASING WITH AUTOMATIC EVALUATION
 
C14-1028
C14-1028C14-1028
C14-1028
 
MoM2010: Arabic natural language processing
MoM2010: Arabic natural language processingMoM2010: Arabic natural language processing
MoM2010: Arabic natural language processing
 
Speech recognition for arabic
Speech recognition for arabicSpeech recognition for arabic
Speech recognition for arabic
 
Ara--CANINE: Character-Based Pre-Trained Language Model for Arabic Language U...
Ara--CANINE: Character-Based Pre-Trained Language Model for Arabic Language U...Ara--CANINE: Character-Based Pre-Trained Language Model for Arabic Language U...
Ara--CANINE: Character-Based Pre-Trained Language Model for Arabic Language U...
 
Rule-Based Standard Arabic Phonetization at Phoneme, Allophone, and Syllable ...
Rule-Based Standard Arabic Phonetization at Phoneme, Allophone, and Syllable ...Rule-Based Standard Arabic Phonetization at Phoneme, Allophone, and Syllable ...
Rule-Based Standard Arabic Phonetization at Phoneme, Allophone, and Syllable ...
 
FURTHER INVESTIGATIONS ON DEVELOPING AN ARABIC SENTIMENT LEXICON
FURTHER INVESTIGATIONS ON DEVELOPING AN ARABIC SENTIMENT LEXICONFURTHER INVESTIGATIONS ON DEVELOPING AN ARABIC SENTIMENT LEXICON
FURTHER INVESTIGATIONS ON DEVELOPING AN ARABIC SENTIMENT LEXICON
 
FURTHER INVESTIGATIONS ON DEVELOPING AN ARABIC SENTIMENT LEXICON
FURTHER INVESTIGATIONS ON DEVELOPING AN ARABIC SENTIMENT LEXICONFURTHER INVESTIGATIONS ON DEVELOPING AN ARABIC SENTIMENT LEXICON
FURTHER INVESTIGATIONS ON DEVELOPING AN ARABIC SENTIMENT LEXICON
 
A new hybrid metric for verifying
A new hybrid metric for verifyingA new hybrid metric for verifying
A new hybrid metric for verifying
 
A ROBUST THREE-STAGE HYBRID FRAMEWORK FOR ENGLISH TO BANGLA TRANSLITERATION
A ROBUST THREE-STAGE HYBRID FRAMEWORK FOR ENGLISH TO BANGLA TRANSLITERATIONA ROBUST THREE-STAGE HYBRID FRAMEWORK FOR ENGLISH TO BANGLA TRANSLITERATION
A ROBUST THREE-STAGE HYBRID FRAMEWORK FOR ENGLISH TO BANGLA TRANSLITERATION
 

Mehr von iwan_rg

Automatic text simplification evaluation aspects
Automatic text simplification  evaluation aspectsAutomatic text simplification  evaluation aspects
Automatic text simplification evaluation aspectsiwan_rg
 
تلخيص كتاب مقدمة في معالجة اللغة العربية
تلخيص كتاب مقدمة في معالجة اللغة العربيةتلخيص كتاب مقدمة في معالجة اللغة العربية
تلخيص كتاب مقدمة في معالجة اللغة العربيةiwan_rg
 
Building theoretical models using structured equation modeling
Building theoretical models using structured equation modelingBuilding theoretical models using structured equation modeling
Building theoretical models using structured equation modelingiwan_rg
 
ورشة تضمين الكلمات في التعلم العميق Word embeddings workshop
ورشة تضمين الكلمات في التعلم العميق Word embeddings workshopورشة تضمين الكلمات في التعلم العميق Word embeddings workshop
ورشة تضمين الكلمات في التعلم العميق Word embeddings workshopiwan_rg
 
Introduction to Arabic natural language processing (Infographics)
Introduction to Arabic natural language processing (Infographics)Introduction to Arabic natural language processing (Infographics)
Introduction to Arabic natural language processing (Infographics)iwan_rg
 
Summary of Multilingual Natural Language Processing Applications: From Theory...
Summary of Multilingual Natural Language Processing Applications: From Theory...Summary of Multilingual Natural Language Processing Applications: From Theory...
Summary of Multilingual Natural Language Processing Applications: From Theory...iwan_rg
 
التقرير السنوي لمجموعة إيوان البحثية 1437هـ-1438هـ
التقرير السنوي لمجموعة إيوان البحثية 1437هـ-1438هـالتقرير السنوي لمجموعة إيوان البحثية 1437هـ-1438هـ
التقرير السنوي لمجموعة إيوان البحثية 1437هـ-1438هـiwan_rg
 
CHOOSING RESEARCH TOPICS AND WRITING RESEARCH PAPERS
CHOOSING RESEARCH TOPICS AND WRITING RESEARCH PAPERSCHOOSING RESEARCH TOPICS AND WRITING RESEARCH PAPERS
CHOOSING RESEARCH TOPICS AND WRITING RESEARCH PAPERSiwan_rg
 
التقرير السنوي لمجموعة إيوان البحثية 1436هـ-1437هـ
التقرير السنوي لمجموعة إيوان البحثية 1436هـ-1437هـالتقرير السنوي لمجموعة إيوان البحثية 1436هـ-1437هـ
التقرير السنوي لمجموعة إيوان البحثية 1436هـ-1437هـiwan_rg
 
مركز تميز الحوسبة العربية المتقدمة
مركز تميز  الحوسبة العربية المتقدمةمركز تميز  الحوسبة العربية المتقدمة
مركز تميز الحوسبة العربية المتقدمةiwan_rg
 
P05- DINA: A Multi-Dialect Dataset for Arabic Emotion Analysis
P05- DINA: A Multi-Dialect Dataset for Arabic Emotion Analysis P05- DINA: A Multi-Dialect Dataset for Arabic Emotion Analysis
P05- DINA: A Multi-Dialect Dataset for Arabic Emotion Analysis iwan_rg
 
P03- MANDIAC: A Web-based Annotation System For Manual Arabic Diacritization
P03- MANDIAC: A Web-based Annotation System For Manual Arabic Diacritization P03- MANDIAC: A Web-based Annotation System For Manual Arabic Diacritization
P03- MANDIAC: A Web-based Annotation System For Manual Arabic Diacritization iwan_rg
 
P04- Toward an Arabic Punctuated Corpus: Annotation Guidelines and Evaluation
P04- Toward an Arabic Punctuated Corpus: Annotation Guidelines and Evaluation P04- Toward an Arabic Punctuated Corpus: Annotation Guidelines and Evaluation
P04- Toward an Arabic Punctuated Corpus: Annotation Guidelines and Evaluation iwan_rg
 
P01- Toward a rich Arabic Speech Parallel Corpus for Algerian sub-Dialects
P01- Toward a rich Arabic Speech Parallel Corpus for Algerian sub-Dialects P01- Toward a rich Arabic Speech Parallel Corpus for Algerian sub-Dialects
P01- Toward a rich Arabic Speech Parallel Corpus for Algerian sub-Dialects iwan_rg
 
Keynote - Computational Processing of Arabic Dialects: Challenges, Advances a...
Keynote - Computational Processing of Arabic Dialects: Challenges, Advances a...Keynote - Computational Processing of Arabic Dialects: Challenges, Advances a...
Keynote - Computational Processing of Arabic Dialects: Challenges, Advances a...iwan_rg
 
OSACT2 LREC 2016 workshop proceedings
OSACT2 LREC 2016 workshop proceedingsOSACT2 LREC 2016 workshop proceedings
OSACT2 LREC 2016 workshop proceedingsiwan_rg
 
محاضرة المدونات اللغوية وأدواتها
محاضرة المدونات اللغوية وأدواتهامحاضرة المدونات اللغوية وأدواتها
محاضرة المدونات اللغوية وأدواتهاiwan_rg
 
لغويات المدونة الحاسوبية
لغويات المدونة الحاسوبيةلغويات المدونة الحاسوبية
لغويات المدونة الحاسوبيةiwan_rg
 
iWAN Annual Report 1435/1436H
 iWAN Annual Report 1435/1436H iWAN Annual Report 1435/1436H
iWAN Annual Report 1435/1436Hiwan_rg
 
التقرير السنوي لمجموعة إيوان البحثية لعام 1434 هـ - 1435هـ
التقرير السنوي لمجموعة إيوان البحثية لعام 1434 هـ -  1435هـالتقرير السنوي لمجموعة إيوان البحثية لعام 1434 هـ -  1435هـ
التقرير السنوي لمجموعة إيوان البحثية لعام 1434 هـ - 1435هـiwan_rg
 

Mehr von iwan_rg (20)

Automatic text simplification evaluation aspects
Automatic text simplification  evaluation aspectsAutomatic text simplification  evaluation aspects
Automatic text simplification evaluation aspects
 
تلخيص كتاب مقدمة في معالجة اللغة العربية
تلخيص كتاب مقدمة في معالجة اللغة العربيةتلخيص كتاب مقدمة في معالجة اللغة العربية
تلخيص كتاب مقدمة في معالجة اللغة العربية
 
Building theoretical models using structured equation modeling
Building theoretical models using structured equation modelingBuilding theoretical models using structured equation modeling
Building theoretical models using structured equation modeling
 
ورشة تضمين الكلمات في التعلم العميق Word embeddings workshop
ورشة تضمين الكلمات في التعلم العميق Word embeddings workshopورشة تضمين الكلمات في التعلم العميق Word embeddings workshop
ورشة تضمين الكلمات في التعلم العميق Word embeddings workshop
 
Introduction to Arabic natural language processing (Infographics)
Introduction to Arabic natural language processing (Infographics)Introduction to Arabic natural language processing (Infographics)
Introduction to Arabic natural language processing (Infographics)
 
Summary of Multilingual Natural Language Processing Applications: From Theory...
Summary of Multilingual Natural Language Processing Applications: From Theory...Summary of Multilingual Natural Language Processing Applications: From Theory...
Summary of Multilingual Natural Language Processing Applications: From Theory...
 
التقرير السنوي لمجموعة إيوان البحثية 1437هـ-1438هـ
التقرير السنوي لمجموعة إيوان البحثية 1437هـ-1438هـالتقرير السنوي لمجموعة إيوان البحثية 1437هـ-1438هـ
التقرير السنوي لمجموعة إيوان البحثية 1437هـ-1438هـ
 
CHOOSING RESEARCH TOPICS AND WRITING RESEARCH PAPERS
CHOOSING RESEARCH TOPICS AND WRITING RESEARCH PAPERSCHOOSING RESEARCH TOPICS AND WRITING RESEARCH PAPERS
CHOOSING RESEARCH TOPICS AND WRITING RESEARCH PAPERS
 
التقرير السنوي لمجموعة إيوان البحثية 1436هـ-1437هـ
التقرير السنوي لمجموعة إيوان البحثية 1436هـ-1437هـالتقرير السنوي لمجموعة إيوان البحثية 1436هـ-1437هـ
التقرير السنوي لمجموعة إيوان البحثية 1436هـ-1437هـ
 
مركز تميز الحوسبة العربية المتقدمة
مركز تميز  الحوسبة العربية المتقدمةمركز تميز  الحوسبة العربية المتقدمة
مركز تميز الحوسبة العربية المتقدمة
 
P05- DINA: A Multi-Dialect Dataset for Arabic Emotion Analysis
P05- DINA: A Multi-Dialect Dataset for Arabic Emotion Analysis P05- DINA: A Multi-Dialect Dataset for Arabic Emotion Analysis
P05- DINA: A Multi-Dialect Dataset for Arabic Emotion Analysis
 
P03- MANDIAC: A Web-based Annotation System For Manual Arabic Diacritization
P03- MANDIAC: A Web-based Annotation System For Manual Arabic Diacritization P03- MANDIAC: A Web-based Annotation System For Manual Arabic Diacritization
P03- MANDIAC: A Web-based Annotation System For Manual Arabic Diacritization
 
P04- Toward an Arabic Punctuated Corpus: Annotation Guidelines and Evaluation
P04- Toward an Arabic Punctuated Corpus: Annotation Guidelines and Evaluation P04- Toward an Arabic Punctuated Corpus: Annotation Guidelines and Evaluation
P04- Toward an Arabic Punctuated Corpus: Annotation Guidelines and Evaluation
 
P01- Toward a rich Arabic Speech Parallel Corpus for Algerian sub-Dialects
P01- Toward a rich Arabic Speech Parallel Corpus for Algerian sub-Dialects P01- Toward a rich Arabic Speech Parallel Corpus for Algerian sub-Dialects
P01- Toward a rich Arabic Speech Parallel Corpus for Algerian sub-Dialects
 
Keynote - Computational Processing of Arabic Dialects: Challenges, Advances a...
Keynote - Computational Processing of Arabic Dialects: Challenges, Advances a...Keynote - Computational Processing of Arabic Dialects: Challenges, Advances a...
Keynote - Computational Processing of Arabic Dialects: Challenges, Advances a...
 
OSACT2 LREC 2016 workshop proceedings
OSACT2 LREC 2016 workshop proceedingsOSACT2 LREC 2016 workshop proceedings
OSACT2 LREC 2016 workshop proceedings
 
محاضرة المدونات اللغوية وأدواتها
محاضرة المدونات اللغوية وأدواتهامحاضرة المدونات اللغوية وأدواتها
محاضرة المدونات اللغوية وأدواتها
 
لغويات المدونة الحاسوبية
لغويات المدونة الحاسوبيةلغويات المدونة الحاسوبية
لغويات المدونة الحاسوبية
 
iWAN Annual Report 1435/1436H
 iWAN Annual Report 1435/1436H iWAN Annual Report 1435/1436H
iWAN Annual Report 1435/1436H
 
التقرير السنوي لمجموعة إيوان البحثية لعام 1434 هـ - 1435هـ
التقرير السنوي لمجموعة إيوان البحثية لعام 1434 هـ -  1435هـالتقرير السنوي لمجموعة إيوان البحثية لعام 1434 هـ -  1435هـ
التقرير السنوي لمجموعة إيوان البحثية لعام 1434 هـ - 1435هـ
 

Kürzlich hochgeladen

Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104misteraugie
 
The basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxThe basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxheathfieldcps1
 
Unit-IV; Professional Sales Representative (PSR).pptx
Unit-IV; Professional Sales Representative (PSR).pptxUnit-IV; Professional Sales Representative (PSR).pptx
Unit-IV; Professional Sales Representative (PSR).pptxVishalSingh1417
 
microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introductionMaksud Ahmed
 
Web & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfWeb & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfJayanti Pande
 
Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Celine George
 
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...Nguyen Thanh Tu Collection
 
Mixin Classes in Odoo 17 How to Extend Models Using Mixin Classes
Mixin Classes in Odoo 17  How to Extend Models Using Mixin ClassesMixin Classes in Odoo 17  How to Extend Models Using Mixin Classes
Mixin Classes in Odoo 17 How to Extend Models Using Mixin ClassesCeline George
 
Z Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphZ Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphThiyagu K
 
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptxMaritesTamaniVerdade
 
Key note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdfKey note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdfAdmir Softic
 
Measures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeMeasures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeThiyagu K
 
Micro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdfMicro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdfPoh-Sun Goh
 
Holdier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfHoldier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfagholdier
 
Introduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsIntroduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsTechSoup
 
PROCESS RECORDING FORMAT.docx
PROCESS      RECORDING        FORMAT.docxPROCESS      RECORDING        FORMAT.docx
PROCESS RECORDING FORMAT.docxPoojaSen20
 
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in DelhiRussian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhikauryashika82
 
ICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptxICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptxAreebaZafar22
 
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...EduSkills OECD
 

Kürzlich hochgeladen (20)

Asian American Pacific Islander Month DDSD 2024.pptx
Asian American Pacific Islander Month DDSD 2024.pptxAsian American Pacific Islander Month DDSD 2024.pptx
Asian American Pacific Islander Month DDSD 2024.pptx
 
Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104
 
The basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxThe basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptx
 
Unit-IV; Professional Sales Representative (PSR).pptx
Unit-IV; Professional Sales Representative (PSR).pptxUnit-IV; Professional Sales Representative (PSR).pptx
Unit-IV; Professional Sales Representative (PSR).pptx
 
microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introduction
 
Web & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfWeb & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdf
 
Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17
 
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
 
Mixin Classes in Odoo 17 How to Extend Models Using Mixin Classes
Mixin Classes in Odoo 17  How to Extend Models Using Mixin ClassesMixin Classes in Odoo 17  How to Extend Models Using Mixin Classes
Mixin Classes in Odoo 17 How to Extend Models Using Mixin Classes
 
Z Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphZ Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot Graph
 
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
 
Key note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdfKey note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdf
 
Measures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeMeasures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and Mode
 
Micro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdfMicro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdf
 
Holdier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfHoldier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdf
 
Introduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsIntroduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The Basics
 
PROCESS RECORDING FORMAT.docx
PROCESS      RECORDING        FORMAT.docxPROCESS      RECORDING        FORMAT.docx
PROCESS RECORDING FORMAT.docx
 
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in DelhiRussian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
 
ICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptxICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptx
 
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
 

P02- Towards a New Arabic Corpus of Dyslexic Texts

  • 1. Towards A New Arabic Corpus of Dyslexic Texts M a h a A l a m r i E l p 0 0 3 @ b a n go r. a c . u k W i l l i a m J o h n Te a h a n W. J.Te a h a n @ b a n go r. a c . u k S c h o o l o f C o m p u te r S c i e n c e . B a n go r U n i v e rs i t y.
  • 2. Outline  Introduction. Arabic Corpus of Dyslexic Texts.  Towards Automatic Correction of Dyslexic Errors.  Conclusion. THE 2ND WORKSHOP ON ARABIC CORPORA AND PROCESSING TOOLS LREC2016 2
  • 3. Introduction The focus of this presentation is the creation of a new Arabic corpus of texts written by dyslexics and software for automatic spelling correction for Arabic texts written by dyslexics. Dyslexia:  Its roots in the Greek word ‘dys-’, meaning difficulty with, and the word ‘-lexia’, which means language or word.  Inability to master the utilization of written language, including issues with comprehension.  1 IN 10 people have a dyslexia. THE 2ND WORKSHOP ON ARABIC CORPORA AND PROCESSING TOOLS LREC2016 3
  • 4. Introduction The main area of interest lies in the zone of convergence represented by the overlap area as illustrated: Dyslexia Arabic Corpus Automatic spelling correction The term denotes the way in which a misspelled word is identified by a program and is then altered to its correct form. THE 2ND WORKSHOP ON ARABIC CORPORA AND PROCESSING TOOLS LREC2016 4
  • 5. Spelling Errors Common Spelling Errors (Damerau, 1964):  Additional letters e.g. unniverse.  Omitted letters e.g. univ rse.  Substituted letters e.g umiverse.  Swapped letters e.g. uinverse. THE 2ND WORKSHOP ON ARABIC CORPORA AND PROCESSING TOOLS LREC2016 5
  • 6. Dyslexia Spelling Errors  Words contain certain silent letters (knife).  Morphemes in the case of when affixes are added: explain – explanation.  The struggle of dyslexic writers with the relationship between the sound of a word and how it is spelt.  The inability to preserve in memory orthographic symbols makes it difficult for dyslexics to remember the right order of letters in a word. THE 2ND WORKSHOP ON ARABIC CORPORA AND PROCESSING TOOLS LREC2016 6
  • 7. Spelling errors by Arabic writers with dyslexia  Phonetic errors.  Irregular spelling rules.  Word omission.  Hamza.  Long vowel.  Exchanging consonants.  Difficulty in writing the letters in the correct shape.  The Arabic word is spelt according to how they hear it in the local spoken dialect. THE 2ND WORKSHOP ON ARABIC CORPORA AND PROCESSING TOOLS LREC2016 7
  • 8. Arabic Corpus of Dyslexic Texts The rate of misspellings in the text is noticeably higher in the case of children. Therefore, the texts were collected from female primary school students with dyslexia who have been taught in resource rooms, been professionally diagnosed with dyslexia. BDAC information: Text: Writing exercises (Homework). Size: 1067 words containing 694 errors. Year: 2013. Language: Arabic. Country of production: Saudi Arabia (Riyadh). The Bangor Dyslexic Arabic Corpus (BDAC) has the character of a preliminary version, which aims to investigate the possibility of a corpus being used as an aid for Arabic dyslexic writers. THE 2ND WORKSHOP ON ARABIC CORPORA AND PROCESSING TOOLS LREC2016 8
  • 9. Example Dyslexic Text THE 2ND WORKSHOP ON ARABIC CORPORA AND PROCESSING TOOLS LREC2016 9 Screenshot of a scanned image of one of the texts written by a dyslexic female child (nine years old).
  • 10. Example Dyslexic Text THE 2ND WORKSHOP ON ARABIC CORPORA AND PROCESSING TOOLS LREC2016 10 This example includes basic errors as below:
  • 11. Analysis of the BDAC errors THE 2ND WORKSHOP ON ARABIC CORPORA AND PROCESSING TOOLS LREC2016 11
  • 12. Analysis of the BDAC errors THE 2ND WORKSHOP ON ARABIC CORPORA AND PROCESSING TOOLS LREC2016 12 3. Substitution (47 times), commonly found in: replacement of (Heh - ‫)ه‬ to (Teh Marbuta - ‫)ة‬ or vice verse, changing (Heh - ‫ه‬or Teh Marbuta - ‫)ة‬ with the letter (Teh - ‫)ت‬ or vice verse and exchanging the letter (Dad - ‫)ض‬ with (Zah - ‫)ظ‬ or vice versa. 4. Transposition (19 times).
  • 13. Analysis of the BDAC errors THE 2ND WORKSHOP ON ARABIC CORPORA AND PROCESSING TOOLS LREC2016 13
  • 14. Analysis of the BDAC errors THE 2ND WORKSHOP ON ARABIC CORPORA AND PROCESSING TOOLS LREC2016 14
  • 15. Analysis of the BDAC errors THE 2ND WORKSHOP ON ARABIC CORPORA AND PROCESSING TOOLS LREC2016 15
  • 16. Analysis of the BDAC errors THE 2ND WORKSHOP ON ARABIC CORPORA AND PROCESSING TOOLS LREC2016 16
  • 17. Towards Automatic Correction of Dyslexic Errors The main tool employed was the Text Mining Toolkit (TMT). TMT is a software package designed specifically to conduct tasks revolving around compression-based language modelling, text categorisation and correction, and segmentation of the text. The toolkit was used to correct a small number of the dyslexic errors using a method that was similar to the method described by Alhawiti (2014) found effective for the correction of errors in Arabic OCR text. THE 2ND WORKSHOP ON ARABIC CORPORA AND PROCESSING TOOLS LREC2016 17
  • 18. Towards Automatic Correction of Dyslexic Errors First, it was crucial to choose a large training corpus of Arabic text to train the compression-based language model created by the toolkit. After researching suitable corpora, the Bangor Arabic Compression Corpus (BACC) created by Dr.Khaled Alhawiti was chosen. Due to the current limitations of the TMT software, the correction of the dyslexic texts was applied just for one-to- one character errors using the toolkit’s markup correction capabilities that was able to find the most probable corrected sequence given the compression- based language model. THE 2ND WORKSHOP ON ARABIC CORPORA AND PROCESSING TOOLS LREC2016 18
  • 19. Experimental Results All errors containing more than one character were removed. 1067 694 280 BDAC Corpus Text Errors one-to-one character errors THE 2ND WORKSHOP ON ARABIC CORPORA AND PROCESSING TOOLS LREC2016 19
  • 20. Experimental Results 153 99 Word Error Correct 80 49 Sentences Error Correct 47 39 Paragraphs Error Correct 280 187 Total Errors Correct The TMT software was able to correct more than half of the one-to-one character errors. THE 2ND WORKSHOP ON ARABIC CORPORA AND PROCESSING TOOLS LREC2016 20
  • 21. Conclusion  The corpus used in this study offers a useful platform for analysing dyslexic text.  It provides a better understanding of the occurrence of these errors and the factors determining such occurrences and therefore it is suitable for assisting dyslexic writers.  This corpus can serve as a platform for other researchers to build upon.  A preliminary investigation was undertaken into using automatic processing techniques as a form of assistance for Arabic dyslexic writers and some initial success was achieved in the automatic correction of dyslexic errors in Arabic text.  In future work, it requires considerably more resources and effort to extend the corpus to include more text for analysis. THE 2ND WORKSHOP ON ARABIC CORPORA AND PROCESSING TOOLS LREC2016 21
  • 22. Thank you. Any questions? THE 2ND WORKSHOP ON ARABIC CORPORA AND PROCESSING TOOLS LREC2016 22

Hinweis der Redaktion

  1. “A room in an ordinary school which students with special needs attend for a period of not more than a half of the school day for the purpose of receiving special education services from a special education teacher.” (Ministry of Education of Saudi Arabia, 2002)
  2. “A room in an ordinary school which students with special needs attend for a period of not more than a half of the school day for the purpose of receiving special education services from a special education teacher.” (Ministry of Education of Saudi Arabia, 2002)
  3. “A room in an ordinary school which students with special needs attend for a period of not more than a half of the school day for the purpose of receiving special education services from a special education teacher.” (Ministry of Education of Saudi Arabia, 2002)
  4. Essentially, markup is employed to identify the corrected text that has the highest probability in regard to the output that was observed. The TMT markup routines are two aspects of the markup transformation that help to establish whether the markup model can be realised.