Arabic spell checking approaches

Arabic Spell
Checking
Approaches
By: Banan AlHadlaq, Dalal AlZeer , Monirah AlOrf
Supervised by: Dr. Amal Al-Saif
Natural language processing - CS465

 Introduction
 Common Arabic Spell Error
 Towards Automatic Spell Checking for Arabic
 Towards Arabic Spell-Checker Based on N-Grams Scores
 Automatic Stochastic Arabic Spelling Correction With Emphasis
on Space Insertions and Deletions
 Arabic Word Generation and Modeling for Spell Checking
 Improved Spelling Error Detection and Correction for Arabic
 Conclusion
Outline

 Arabic language
 NLP applications
 Approaches for solving the Arabic spell
checking problem
Introduction

Common Arabic Spell Errors
• Reading Errors
{ }{ }{ }{ }{ }{ }{ }{ }{ }{ }{ }{ } { }
• Hearing Errors
{ }{ }{ }{ }{ }{ }{ }{ }{ }
• Touch-Typing Errors
• Morphological Errors
• Editing Errors

1.Automatic Stochastic Arabic Spelling Correction With
Emphasis on Space Insertions and Deletions
• Stochastic-based approach for misspelling correction of
Arabic text.
• A context based on two-layer that is automatically
correct misspelled words in large datasets.

Automatic Stochastic Arabic Spelling Correction With
Emphasis on Space Insertions and Deletions (Cont.)
Candidates’ generation
component
Error detection
Best candidate selection
component
Single-Word
Errors
Space Deletion
Errors
Space Insertion
Errors

Automatic Stochastic Arabic Spelling Correction With
Emphasis on Space Insertions and Deletions (Cont.)
Candidates’ generation component: Space Deletion Errors

Result
• A standard Arabic text corpus (TRN_DB_I)
• An extra standard Arabic text corpus(TRN_DB_II)
• The test data (TST_DB)
• The testing results show that as we increase the size
of the training set, the performance improves reaching
97.9% of F1 score for detection and 92.3% of F1
score for correction.

2.Towards Automatic Spell Checking for Arabic
• Developing an Arabic spelling checker program.
• Using SICStus Prolog language.
• Recognizes common Arabic spelling errors and offers
suggestions for error correction.
• Be able to recognize common spelling errors for standard
Arabic and Egyptian dialects.
• Can be integrated with other text processing software, such
as word processors.

• Analysis of the common spelling errors that are used for
detecting the misspelled Arabic word.
• Limited the detection of spelling errors to isolated words (non–
word). e.g. ‘ ’ for ‘ ’.
• Perform a series of heuristic
steps to find a replacement
candidate:
Add missing
character
Replace incorrect
character
Remove excessive
character
Add a space to
split words
Towards Automatic Spell Checking for Arabic(Cont.)

• e.g. the candidates of the misspelled word
are : ,
Add missing
character
are : , ,
Replace
incorrect
character
are : ,
Remove
excessive
character
are : ,
Add a space
to split words
Towards Automatic Spell Checking for
Arabic(Cont.)

Neighbors table
Towards Automatic Spell Checking for Arabic(Cont.)

• Developing a simple and flexible spell-checker for Arabic
language (detect errors).
• Based on N-Grams scores.
• Using matrix approach.
• The corpus which is used is adapted from Muaidi PHD
thesis .
• It is consists of 101,987 word types.
3.Towards Arabic Spell-Checker Based on
N-Grams Scores

Entered the
tested text
Tokenizing
process
Cleaning
process
Matrix method
deals with
each word
Towards Arabic Spell-Checker Based on
N-Grams Scores(Cont.)

• Building the matrices
 Number of matrices = longest word in corpus – 1.
 Dimension of each matrix is 28 28( for Arabic letter).
 (M1) for the combination of the first and the second letters
in a word. (M2) for the combination of the second and the
third letters in a word and so on.
 All the matrices are initialized by zeros.
Matrix Method Deals With Each Word

• 2-Gram set (S)
 Each item in (S) consists of two letters.
 The item will assign the value 1 or 2
 Assigned 2 in the corresponding matrix; if the word is
ended by these two letters.
 Assigned 1 if there is a connection and the word is not
over yet.
 e.g. for the word:[ ]
the 2-Gram set is S = { }
M1[ ] [ ] = 1, M [ ] [ ] = 1, M3[ ][ ]=2.

Entered the
tested text
Tokenizing
process
Cleaning
process
Matrix method
deals with
each word
(Cont.)

Result
• The training dataset consists of 71,390 Arabic words (70%)
and While the testing dataset consists of 30,597 Arabic words
(30%).
The Overall Evaluation of the Results
• Increasing the size of the data set leads increment the
accuracy.

 Bridge the critical gap of available open-source
spell checking resources for Arabic.
 Create open-source and large-coverage word list
for Arabic (9,000,000 words).
 Error Detection:
 Direct method: match words in an open text
against a list of correct words.
 Language modeling method: build a character-based
tri-gram language modal using SRILM in order to
classify generated words as valid and invalid.
4. Arabic Word Generation and Modeling for Spell
Checking

Input
Finite-State
Transducer
Error ?
Suggestion list
Candidates list
score
Candidates ranker augmented
edit distance and language
specific rules
Post-processing
Display
suggestions
Arabic
word list
Noisy channel
model
Gigaword
corpus
Yes
No
Flow chart of spelling error
correction.

 Best accuracy score = 75%
 Evaluation on:
 Microsoft Word 2010 = 80.54%
 Hunspell using Ayaspell = 45.64%
Result

Language model
Spelling error detection and
correction components
Dictionary (or
reference word
list)
Error model
5. Improved Spelling Error Detection and Correction
for Arabic

AraComLex
Extended
word list
• Matching its
word list
against
Gigaword
corpus
• Double-
checked by
Buckwalter
Arabic
Morphological
Analyzer
• Creating a
dictionary of
9.3 million
Arabic words
Improving the Dictionary

Finite-state
transducer to
propose
candidate
corrections
Discard
candidates that
are not found in
the word list
Rank the
remaining
candidates
 Spelling Correction:
 N-gram language models.
 The candidate with the least perplexity
score is selected to be the gold correction.
Improving the Error Model: Candidate Generation

 Analyze the level of noise in different sources of
data.
 Agence France-Presse (AFP) is the noisiest while
Al- Jazeera data is the cleanest.
 Select the optimal subset to train the system on.
Improving the language model: Analyzing the Training
Data

 AFP = 73.93 %
 Al-Jazeera data = 80.97 %
 Gigaword corpus = 82.86 %
 Clean data is better than noisy data when they are
comparable in size, however more data is better
than clean data.
 Evaluation on:
 Google Docs = 9.32 %
 Ayaspell for OpenOffice = 41.86 %
 Microsoft Word 2010 = 57.15 %
Result

 After displaying these approaches we see that the
results are promising, and represent a good starting
point for future researches to enhance the Arabic
spell checker.
Conclusion

Arabic spell checking approaches

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (18)

Andere mochten auch

Andere mochten auch (16)

Ähnlich wie Arabic spell checking approaches

Ähnlich wie Arabic spell checking approaches (20)

Mehr von Arabic_NLP_ImamU2013

Mehr von Arabic_NLP_ImamU2013 (14)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Arabic spell checking approaches