SlideShare ist ein Scribd-Unternehmen logo
1 von 27
Imam Mohammad Ibn Saud
Islamic University
College of Computing and
Information Science
Computer sciences Department
Prepared by:
Al-Moammar.A., Al-Abdullah.H., and Al-Ajlan.N
Arabic Tokenization and
Stemming
Supervised by:
Dr. Amal Al-Saif.
Arabic Tokenization and
Stemming
Outline
 Introduction
 Tokenization:
• Arabic Characteristics.
• Methodology.
• Result.
 Stemming:
• Arabic Characteristics.
• Methodology.
• Results.
 Conclusion.
Introduction
 Arabic language.
 Tokenization.
 Stemming.
Outline
 Introduction
 Tokenization:
• Arabic Characteristics.
• Methodology.
• Result.
 Stemming:
• Arabic Characteristics.
• Methodology.
• Results.
 Conclusion.
Arabic Language Characteristics
• Writing the letter in ambiguous case cause orthography problems.
• Encliticization of a word ending with “ ” or “ ” :
• Ambiguity results from decliticization of “ ” “l” “ ” “A” and “ ” “Al” “the”.
word Encliticization of word
“their Friday”
“collect them”
“Your level”
Outline
 Introduction
 Tokenization:
• Arabic Characteristics.
• Methodology.
• Result.
 Stemming:
• Arabic Characteristics.
• Methodology.
• Results.
 Conclusion.
My Approach
 Sample of Arabic tokenized text:
 The Bigrams equation that used is:
P(wi | sj) is probability of ith word given jth segmentation.
P(sj | si-1)is probability of jth segmentation given previous segmentation.
Outline
 Introduction
 Arabic Characteristics.
 Tokenization:
• Arabic Characteristics.
• Methodology.
• Result.
 Stemming:
• Arabic Characteristics.
• Methodology.
• Results.
 Conclusion.
Results
The result of My Approach algorithm:
• They used Bigrams on 45 files with size of 29092 tokens.
• The final accuracy was 98.83%.
Recall Accuracy Precision F-measure
Result without statistical
support
0.9877462 0.9802977 0.8617793 0.920473
Outline
 Introduction
 Arabic Characteristics.
 Tokenization:
• Methodology.
• Result.
 Stemming:
• Arabic Characteristics.
• Methodology.
• Results.
 Conclusion.
Arabic Language Characteristics
Outline
 Introduction
 Arabic Characteristics.
 Tokenization:
• Methodology.
• Result.
 Stemming:
• Arabic Characteristics.
• Methodology.
• Results.
 Conclusion.
Methodology
 Root-based.
 Light Stemmer.
 N-Gram.
 Hybrid Method.
Root-based
 Example of root-based stemmer
Light Stemmer
 Removed morphemes by Light stemmers
Light Stemmer
 Classification of Light8 stemmer
N-gram
 Statistical stemmer based on calculating a measure of
similarity between a pair of words.
 N-gram techniques:
• Digram.
• Trigram.
N-gram
N-gram techniques:
• ( )
• Digram (N=2)
“
• Trigram (N=3)
N-gram
 The string similarity measures calculated using Dice’s
Coefficient:
S = 2Cwq /(Aw + Bq)
Example :
“
would be:
(2 * 4/(10 +5) = 0.533).
Outline
 Introduction
 Arabic Characteristics.
 Tokenization:
• Methodology.
• Result.
 Stemming:
• Arabic Characteristics.
• Methodology.
• Results.
 Conclusion.
Hybrid Method
 Incorporates three different techniques for Arabic Stemming.
 The Hybrid algorithm starts with constructing the root file
containing more than 9,000 valid Arabic roots.
Results
Results
 Hybrid algorithm was found to supersede the other
stemming ones.
 The obtained results illustrate that using the hybrid stemmer
enhances the performance of some Arabic process.
 In Arabic Text Categorization: the averages accuracies are:
74.41% for khoja, 59.71% for light stemming, 48.17% for
n-grams, and 82.33% for Hybrid stemmer.
Outline
 Introduction
 Arabic Characteristics.
 Tokenization:
• Methodology.
• Result.
 Stemming:
• Arabic Characteristics.
• Methodology.
• Results.
 Conclusion.
Conclusion
Thanks

Weitere ähnliche Inhalte

Mehr von Arabic_NLP_ImamU2013

Mehr von Arabic_NLP_ImamU2013 (12)

Arabic to-english machine translation
Arabic to-english machine translationArabic to-english machine translation
Arabic to-english machine translation
 
Discourse annotation
Discourse annotationDiscourse annotation
Discourse annotation
 
The named entity recognition (ner)2
The named entity recognition (ner)2The named entity recognition (ner)2
The named entity recognition (ner)2
 
Arabic speech recognition
Arabic speech recognitionArabic speech recognition
Arabic speech recognition
 
Discourse annotation for arabic 2
Discourse annotation for arabic 2Discourse annotation for arabic 2
Discourse annotation for arabic 2
 
Arabic question answering ‫‬
Arabic question answering ‫‬Arabic question answering ‫‬
Arabic question answering ‫‬
 
Part of speech tagging for Arabic
Part of speech tagging for ArabicPart of speech tagging for Arabic
Part of speech tagging for Arabic
 
Coreference recognition in arabic
Coreference recognition in arabicCoreference recognition in arabic
Coreference recognition in arabic
 
Building corpus from www for arabic
Building corpus from www for arabicBuilding corpus from www for arabic
Building corpus from www for arabic
 
Sentiment analysis of arabic,a survey
Sentiment analysis of arabic,a surveySentiment analysis of arabic,a survey
Sentiment analysis of arabic,a survey
 
Discourse annotation for arabic
Discourse annotation for arabicDiscourse annotation for arabic
Discourse annotation for arabic
 
Automatic summaraitztion for_arabic
Automatic summaraitztion for_arabicAutomatic summaraitztion for_arabic
Automatic summaraitztion for_arabic
 

Kürzlich hochgeladen

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 

Kürzlich hochgeladen (20)

The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 

Arabic tokenization and stemming