2. Introduction
Common Arabic Spell Error
Towards Automatic Spell Checking for Arabic
Towards Arabic Spell-Checker Based on N-Grams Scores
Automatic Stochastic Arabic Spelling Correction With Emphasis
on Space Insertions and Deletions
Arabic Word Generation and Modeling for Spell Checking
Improved Spelling Error Detection and Correction for Arabic
Conclusion
Outline
3. Arabic language
NLP applications
Approaches for solving the Arabic spell
checking problem
Introduction
5. 1.Automatic Stochastic Arabic Spelling Correction With
Emphasis on Space Insertions and Deletions
• Stochastic-based approach for misspelling correction of
Arabic text.
• A context based on two-layer that is automatically
correct misspelled words in large datasets.
6. Automatic Stochastic Arabic Spelling Correction With
Emphasis on Space Insertions and Deletions (Cont.)
Candidates’ generation
component
Error detection
Best candidate selection
component
Single-Word
Errors
Space Deletion
Errors
Space Insertion
Errors
7. Automatic Stochastic Arabic Spelling Correction With
Emphasis on Space Insertions and Deletions (Cont.)
Candidates’ generation component: Space Deletion Errors
8. Result
• A standard Arabic text corpus (TRN_DB_I)
• An extra standard Arabic text corpus(TRN_DB_II)
• The test data (TST_DB)
• The testing results show that as we increase the size
of the training set, the performance improves reaching
97.9% of F1 score for detection and 92.3% of F1
score for correction.
9. 2.Towards Automatic Spell Checking for Arabic
• Developing an Arabic spelling checker program.
• Using SICStus Prolog language.
• Recognizes common Arabic spelling errors and offers
suggestions for error correction.
• Be able to recognize common spelling errors for standard
Arabic and Egyptian dialects.
• Can be integrated with other text processing software, such
as word processors.
10. • Analysis of the common spelling errors that are used for
detecting the misspelled Arabic word.
• Limited the detection of spelling errors to isolated words (non–
word). e.g. ‘ ’ for ‘ ’.
• Perform a series of heuristic
steps to find a replacement
candidate:
Add missing
character
Replace incorrect
character
Remove excessive
character
Add a space to
split words
Towards Automatic Spell Checking for Arabic(Cont.)
11. • e.g. the candidates of the misspelled word
are : ,
Add missing
character
• e.g. the candidates of the misspelled word
are : , ,
Replace
incorrect
character
• e.g. the candidates of the misspelled word
are : ,
Remove
excessive
character
• e.g. the candidates of the misspelled word
are : ,
Add a space
to split words
Towards Automatic Spell Checking for
Arabic(Cont.)
13. • Developing a simple and flexible spell-checker for Arabic
language (detect errors).
• Based on N-Grams scores.
• Using matrix approach.
• The corpus which is used is adapted from Muaidi PHD
thesis .
• It is consists of 101,987 word types.
3.Towards Arabic Spell-Checker Based on
N-Grams Scores
15. • Building the matrices
Number of matrices = longest word in corpus – 1.
Dimension of each matrix is 28 28( for Arabic letter).
(M1) for the combination of the first and the second letters
in a word. (M2) for the combination of the second and the
third letters in a word and so on.
All the matrices are initialized by zeros.
Matrix Method Deals With Each Word
16. • 2-Gram set (S)
Each item in (S) consists of two letters.
The item will assign the value 1 or 2
Assigned 2 in the corresponding matrix; if the word is
ended by these two letters.
Assigned 1 if there is a connection and the word is not
over yet.
e.g. for the word:[ ]
the 2-Gram set is S = { }
M1[ ] [ ] = 1, M [ ] [ ] = 1, M3[ ][ ]=2.
Matrix Method Deals With Each Word
18. Result
• The training dataset consists of 71,390 Arabic words (70%)
and While the testing dataset consists of 30,597 Arabic words
(30%).
The Overall Evaluation of the Results
• Increasing the size of the data set leads increment the
accuracy.
19. Bridge the critical gap of available open-source
spell checking resources for Arabic.
Create open-source and large-coverage word list
for Arabic (9,000,000 words).
Error Detection:
Direct method: match words in an open text
against a list of correct words.
Language modeling method: build a character-based
tri-gram language modal using SRILM in order to
classify generated words as valid and invalid.
4. Arabic Word Generation and Modeling for Spell
Checking
20. Input
Finite-State
Transducer
Error ?
Suggestion list
Candidates list
score
Candidates ranker augmented
edit distance and language
specific rules
Post-processing
Display
suggestions
Arabic
word list
Noisy channel
model
Gigaword
corpus
Yes
No
Flow chart of spelling error
correction.
21. Best accuracy score = 75%
Evaluation on:
Microsoft Word 2010 = 80.54%
Hunspell using Ayaspell = 45.64%
Result
22. Language model
Spelling error detection and
correction components
Dictionary (or
reference word
list)
Error model
5. Improved Spelling Error Detection and Correction
for Arabic
23. AraComLex
Extended
word list
• Matching its
word list
against
Gigaword
corpus
• Double-
checked by
Buckwalter
Arabic
Morphological
Analyzer
• Creating a
dictionary of
9.3 million
Arabic words
Improving the Dictionary
25. Analyze the level of noise in different sources of
data.
Agence France-Presse (AFP) is the noisiest while
Al- Jazeera data is the cleanest.
Select the optimal subset to train the system on.
Improving the language model: Analyzing the Training
Data
26. AFP = 73.93 %
Al-Jazeera data = 80.97 %
Gigaword corpus = 82.86 %
Clean data is better than noisy data when they are
comparable in size, however more data is better
than clean data.
Evaluation on:
Google Docs = 9.32 %
Ayaspell for OpenOffice = 41.86 %
Microsoft Word 2010 = 57.15 %
Result
27. After displaying these approaches we see that the
results are promising, and represent a good starting
point for future researches to enhance the Arabic
spell checker.
Conclusion