SlideShare ist ein Scribd-Unternehmen logo
1 von 61
Downloaden Sie, um offline zu lesen
Processing Arabic Text
JOSA TechTalk
Dr. Samir Tartir
1/May/2019
Processing Arabic Text
Outline
 Intros
 NLU
 Issues in NLU
 Issues in Arabic NLU
 Tools & Resources
Dr. Samir Tartir
 Senior Research Scientist of AI at Mawdoo3
 PhD from the University of Georgia (USA) in 2009
 BSc & MSc from the University of Jordan in 1998 & 2002
 Highly referenced researcher
 Recognized educator (CourseHero.com)
 Research areas:
 Arabic NLP, Semantic Web, & Ontologies.
 Previously:
 Head and founder of the Department of Web Engineering at Philadelphia University (1 year)
 Assistant professor at the Department of Computer Science at Philadelphia University (9 years)
 Google outsourced team leader 2011-2016
 Database programmer for 5 years (in Jordan, USA)
Mawdoo3 is the
largest Arabic
website in the world.
Committed to
enhance the Arabic
language through
content and
technology
Mawdoo3 © Copyright 2018. All Rights Reserved.
ACHIEVEMENTS
Mawdoo3 © Copyright 2018. All Rights Reserved.
50Mmonthly active
users (MAU)
150Karticles
300paid
experts
contributors
100Bbillion words
read every
month
GENERAL
Mawdoo3 AI
 Founded in January 2017
 Leverage data collected by Mawdoo3.com for AI
 Currently working on Salma
 The first Arabic Voice Enabled Assistant
 The beta version was released during a Silicon Valley event
 We are about to release the next version with major updates
(weekend)
 It is available for download for Android and iOS devices
 Continuously improving (you’ll understand soon)
Salma Introduction
http://www.cellstrat.com/2017/10/27/nlp-vs-nlu-vs-nlg/
https://labs.eleks.com/2018/02/how-to-build-nlp-engine-that-wont-screw-
up.html
blah blah
blah
?!?!?!?!
blah blah
blah
5 words: I,
must,
organize, my,
room
blah blah
blah
I must
organize
my room
?!?!?!?!
blah blah
blah
I must
organize
my room
Many issues*
1. Easy or mostly solved issues
2. Issues with Intermediate progress
3. Hard Issues or still need lot of work
Mostly applies to English, but also applies to Arabic
*https://www.quora.com/What-are-the-major-open-problems-in-natural-language-understanding
1. Easy or mostly solved
 Spam detection and email classification
 Works!
1. Easy or mostly solved
 Part of Speech Tagging (≈ ‫)إعراب‬
 Follows tokenization (‫تجزيء‬)
 KEY: N = Noun, V = Verb, P = Preposition, Adv = Adverb, etc.
 INPUT:
Profits soared at Boeing Co., easily topping forecasts on Wall
Street, as their CEO Alan Mulally announced first quarter results.
 OUTPUT:
Profits/N soared/V at/P Boeing/N Co./N ,/, easily/ADV topping/V
forecasts/N on/P Wall/N Street/N ,/, as/P their/POSS CEO/N
Alan/N Mulally/N announced/V first/ADJ quarter/N results/N ./.
1. Easy or mostly solved
 Named Entity Recognition
 Usual classes: People, organizations, locations
 KEY: P = Person, O = Organization, L = Location
 INPUT:
Profits soared at Boeing Co., easily topping forecasts on Wall
Street, as their CEO Alan Mulally announced first quarter results.
 OUTPUT:
Profits soared at {Boeing Co.}/O , easily topping forecasts on {Wall
Street}/L , as their CEO {Alan Mulally}/P announced first quarter
results .
‫أبدى‬ ‫وقد‬‫سوالنا‬ ‫خافيير‬‫اال‬ ‫في‬ ‫الخارجية‬ ‫للسياسة‬ ‫األعلى‬ ‫المنسق‬‫تحاد‬
‫األوروبي‬،‫المصرية‬ ‫بالمبادرة‬ ‫كبيرا‬ ‫تفاؤال‬،‫ا‬ ‫إطالق‬ ‫وقف‬ ‫احتمال‬ ‫مرجحا‬‫لنار‬
،‫قريبا‬‫من‬ ‫اإلسرائيلية‬ ‫القوات‬ ‫انسحاب‬ ‫فرنسية‬ ‫مصادر‬ ‫وتوقعت‬‫غزة‬‫خالل‬
،‫أيام‬ ‫ثمانية‬‫وأوضح‬‫سوالنا‬‫دعوة‬ ‫أن‬‫مصر‬‫إل‬‫سرائيل‬‫ق‬ ‫الهجوم‬ ‫وقف‬ ‫لبحث‬‫د‬
‫المقبلة‬ ‫القليلة‬ ‫الساعات‬ ‫خالل‬ ‫ثمارها‬ ‫تؤتي‬،‫الع‬ ‫دائمة‬ ‫الدول‬ ‫أن‬ ‫مؤكدا‬‫ضوية‬
‫األمن‬ ‫بمجلس‬‫شديد‬ ‫بترحاب‬ ‫المصرية‬ ‫المبادرة‬ ‫استقبلت‬.
Named Entities:
LocationsPersons
OrganizationsArabic NER ‫األعالم‬ ‫مستخرج‬
Example
20
*Microsoft Arabic NLP Toolkit (ATK) For Academia in the Arab World Presentation, 11/2012
1. Easy or mostly solved
 Plagiarism detection
 INPUT:
A research paper, homework report, etc.
 OUTPUT:
Rate of similarity to other papers
2. Intermediate or making good
progress
 Sentiment analysis - Examples:
 Best roast chicken in San Francisco! – Positive
 The waiter ignored us for 20 minutes. – Negative
 ‫قال‬ ‫مطعم‬ ‫احسن‬ ‫قال‬!
2. Intermediate or making good
progress
 Coreference resolution - Example:
 "The thieves stole the paintings. They were subsequently caught.“
 "The thieves stole the paintings. They were subsequently sold.“
 "The thieves stole the paintings. They were subsequently found.“
 To solve whether “they" is related to “the thieves" or “the
paintings".
2. Intermediate or making good
progress
 Word sense disambiguation - Example:
 “I need new batteries for my mouse”
2. Intermediate or making good
progress
 Machine Translation
 Translating sentences from one language to another
 “Best” example would be Google translate.
3. Hard or still need lot of work
 Text Summarization
 to take input as text document(s) and try to condense them into a
summary.
https://www.springer.com/us/book/9783030167998
© 2019
Lithium-Ion Batteries
A Machine-Generated Summary of Current Research
Authors: Writer, Beta
•The first machine-generated book in chemistry
•Provides an overview of recent research
•Includes summaries of 150 articles
3. Hard or still need lot of work
 Machine dialog system - Example:
 User - I need a flight from New York to London, arriving at 10 pm ?
 System - What day are you leaving?
 User - Tomorrow.
The system detected the missing information in your sentences.
Arabic
Background
 History
– Arabic has remained intelligible and functional for around 20
centuries.
 Strategically important
– 330 million speakers living in an important region
 huge oil reserves, sacred sites, sea ports
– 1.4 billion Muslims use in their prayers.
 Cultural and literary heritage
– Closely associated with Islam
Distribution
‫ثمن‬
‫علم‬
‫ق‬
One-Letter Verbs*
.1(‫إ‬)‫من‬‫ى‬َ‫َأ‬‫و‬‫ي‬‫َئ‬‫ي‬‫ا‬ً‫ي‬ْ‫َأ‬‫و‬,‫والوأي‬:‫الوعد‬.
.2(‫ت‬)‫ا‬ًّ‫ي‬‫ت‬
ُ
‫وأ‬ ‫ا‬ً‫ي‬ْ‫ت‬َ‫أ‬ ‫ي‬‫ت‬ْ‫َأ‬‫ي‬ ‫ى‬َ‫ت‬َ‫أ‬ ‫من‬‫ت‬‫وإ‬‫ا‬ًّ‫ي‬‫ا‬ً‫ن‬‫َا‬‫ي‬ْ‫ت‬‫وإ‬َ‫ن‬‫َا‬‫ي‬ْ‫ت‬‫وإ‬ً‫ة‬‫؛‬ً‫ة‬‫ا‬َ‫ت‬ْ‫َأ‬‫م‬‫و‬
‫أي‬:‫جئته‬,‫األمر‬ ‫في‬ ‫تقول‬ ‫العرب‬ ‫وبعض‬(:‫ائت‬),‫وأك‬‫ثر‬
‫بـ‬ ‫يأمرون‬ ‫الناس‬(‫ائت‬)‫ال‬(‫ت‬.!)
.3(‫ث‬)‫أي‬ ‫َثي؛‬‫ي‬ ‫ى‬َ‫ث‬َ‫و‬ ‫من‬:‫السل‬ ‫عند‬ ‫به‬ ‫ى‬ َ‫َش‬‫و‬‫طان‬.
.4(‫ج‬)‫قطع‬ ‫أي‬ ‫َجي؛‬‫ي‬ ‫َى‬‫ج‬َ‫و‬ ‫من‬,‫وأما‬:‫َجا‬‫و‬‫ال‬‫الح‬ ‫فهو‬‫فا‬.
.5(‫ح‬)‫ا‬ً‫ي‬ْ‫َح‬‫و‬ ‫َحي‬‫ي‬ ‫َى‬‫ح‬َ‫و‬ ‫من‬,‫والوحي‬:‫وال‬ ‫اإلشارة‬‫كتابة‬
‫والكالم‬.
.6(‫خ‬)‫أي‬ ‫ا؛‬ً‫ي‬ْ‫َخ‬‫و‬ ‫ي‬‫َخ‬‫ي‬ ‫ى‬َ‫خ‬َ‫و‬:‫قصد‬.
.7(‫د‬)‫أي‬ ‫ا؛‬ً‫ي‬ْ‫د‬َ‫و‬ ‫ي‬‫َد‬‫ي‬ ‫َى‬‫د‬َ‫و‬ ‫من‬:‫الدية‬ ‫دفع‬.
.8(َ‫ر‬)‫أ‬-‫ا‬ً‫ي‬ْ‫أ‬َ‫ر‬‫و‬ ً‫ة‬‫ي‬ْ‫و‬ُ‫ر‬ َ‫ل‬‫الهال‬ ‫ى‬َ‫ر‬َ‫ي‬ ‫ى‬َ‫أ‬َ‫ر‬ ‫من‬‫و‬ً‫ة‬َ‫ء‬‫ا‬َ‫ر‬.
.9(َ‫ر‬)‫ب‬-‫ا‬ً‫ي‬ْ‫ر‬َ‫و‬ ‫ه‬َ‫ف‬‫جو‬ ُ‫القيح‬ ‫ي‬‫َر‬‫ي‬ ‫ى‬َ‫ر‬َ‫و‬ ‫ومن‬‫أي‬ ‫؛‬:‫أفسده‬.
.10(‫س‬)ً‫ي‬ ْ‫َس‬‫و‬ ‫عمرو‬ َ‫رأس‬ ٌ‫د‬‫زي‬ ‫َسي‬‫ي‬ ‫ى‬ َ‫َس‬‫و‬ ‫من‬‫أي‬ ‫ا؛‬
‫حلقه‬.
.11(‫ش‬)‫ا‬ً‫ي‬ ْ‫َش‬‫و‬ ‫ي‬‫َش‬‫ي‬ ‫ى‬ َ‫َش‬‫و‬ ‫من‬,‫والوشي‬:‫ن‬‫قش‬
‫الثوب‬.
.12(‫ص‬)ْ‫ص‬َ‫و‬ ‫بالشيء‬ َ‫ء‬‫الشي‬ ‫ي‬‫َص‬‫ي‬ ‫َى‬‫ص‬َ‫و‬ ‫من‬‫أي‬ ‫ا؛‬ً‫ي‬
‫وصله‬.
.13(‫ع‬)‫أي‬ ‫ا؛‬ً‫ي‬ ْ‫َع‬‫و‬ ‫ي‬‫َع‬‫ي‬ ‫ى‬َ‫َع‬‫و‬ ‫من‬:‫وجمع‬ ‫حفظ‬.
.14(‫ف‬)‫ا‬ ‫الوفاء‬ ‫بمعنى‬ ‫؛‬ً‫ء‬‫ا‬َ‫َف‬‫و‬ ‫ي‬‫َف‬‫ي‬ ‫ى‬َ‫َف‬‫و‬ ‫من‬‫لعهد‬.
.15(‫ق‬)َ‫و‬‫و‬ ً‫ة‬َ‫ي‬‫ا‬َ‫ق‬‫وو‬ ‫ا‬ً‫ي‬ْ‫َق‬‫و‬ ‫ي‬‫َق‬‫ي‬ ‫ى‬َ‫َق‬‫و‬ ‫من‬ً‫ة‬َ‫ي‬‫اق‬,‫بمعنى‬
‫الحفظ‬,‫مثلثة‬ ‫الوقاية‬ ‫واو‬.
.16(‫ك‬)َ‫ة‬‫ْب‬‫ر‬‫الق‬ ‫ي‬‫َك‬‫ي‬ ‫ى‬َ‫َك‬‫و‬ ‫من‬,‫والوكاء‬:‫ا‬ ‫رباط‬‫لقربة‬
‫وغيرها‬.
.17(‫ل‬)ً‫ة‬‫والي‬ ‫ي‬‫َل‬‫ي‬ ‫ى‬َ‫َل‬‫و‬ ‫من‬,‫والوالية‬:‫اإلمارة‬.
.18(‫م‬)‫أشار‬ ‫ا؛‬ً‫ي‬ْ‫م‬َ‫و‬ ‫ي‬‫َم‬‫ي‬ ‫َى‬‫م‬َ‫و‬.
.19(‫ن‬)ً‫ء‬‫نا‬‫وو‬ ‫ا‬ً‫ي‬ْ‫ن‬َ‫و‬ ‫ي‬‫َن‬‫ي‬ ‫ى‬َ‫ن‬َ‫و‬,‫أي‬:‫تعب‬.
.20(‫ـ‬‫ه‬)‫أي‬ ‫ا؛‬ً‫ي‬ْ‫َه‬‫و‬ ‫ي‬‫َه‬‫ي‬ ‫ى‬َ‫ه‬َ‫و‬:‫ضعف‬.
*https://www.m-a-arabia.com/vb/showthread.php?t=140
Versions
 Classical
 ‫مشكلة‬ ‫فصحى‬
 Modern Standard Arabic (MSA)
 ‫مشكلة‬ ‫غير‬ ‫فصحى‬ (diacritics)
 Dialects (‫)اللهجات‬
1. Egyptian Arabic (EGY)
2. Levantine Arabic (LEV)
3. Gulf Arabic (GLF)
4. North African Arabic (NOR)
 ‫كراء‬:‫استئجار‬
Arabic Language Characteristics
 Written to the left
 Highly structured
 Highly derivational language
 Morphology
 Free word order
 Many dialects
 Long sentences
Example*
*Microsoft Arabic NLP Toolkit (ATK) For Academia in the Arab World Presentation, 11/2012
Issues Specific to Arabic, Part 1
 No central entity that governs the language
 E.g. produces dictionaries
 Punctuation mark rules
 The large area covered imported different foreign words from different
neighboring languages
 Letters:
 One letter, one sound
 Letters change shape
 Hamza
 No capital letters
 High error rate of written text
High WER
 Arabic content has a very high Word Error Rate (WER).
 Analysis of 1000-Article Tagged Corpus: the Average WER is 6% in News text.
39
0
2
4
6
8
10
12
14 News Site Error Rate
Akbar El Youm 13 %
BBC Arabic 9 %
AL-Nahar 8 %
Al-Syassa 8 %
Al-Ahram 8 %
Al-Quds 8 %
Al-Qabas 5 %
Al-Hayat 4 %
Al-Jazeera 1 %
*Microsoft Arabic NLP Toolkit (ATK) For Academia in the Arab World Presentation, 11/2012
Error Class
Error
Rate
Missing Hamza 21%
Extra Hamza 19%
Missing Yaa 15%
Extra Yaa 12%
Missing TaaMarbouta 11%
Extra TaaMarbouta 8%
Wrong Hamza 7%
Missing Space 2%
Swapped Letters 1%
High WER
40
*Microsoft Arabic NLP Toolkit (ATK) For Academia in the Arab World Presentation, 11/2012
Hamza (47%) + Yaa (15%) + Taa (19%) = 81%
Issues Specific to Arabic, Part 1
 Synonymy and confusion of non-standardized terms
 Thermometer: ‫ترمومتر‬ ،‫حرارة‬ ‫ميزان‬ ،‫حرارة‬ ‫مقياس‬ ،‫محرار‬ ،‫محر‬
 Technical translation
 Hydrometer: ‫السوائل‬ ‫كثافة‬ ‫قياس‬ ‫جهاز‬
 Male/female forms
 Dual form
 Uncle, parent…
The Qadafi/Schwarzenegger Problem *
Arabic English
‫قذافي‬ Gadafi
Gaddafi
Gaddfi
Gadhafi
Ghaddafi
Kadaffy
Qaddafi
Qadhafi
‫شوارزنيغر‬
‫شوارزنغر‬
‫شوارزنيجر‬
‫شوارزنجر‬
Schwarzenegger
* Mona Diab & Nizar Habash. Natural Language Processing of Arabic and its Dialects. EMNLP 2014, Doha, Qatar.
Ambiguity
 Homographs
 ‫قدم‬
 Internal word structure ambiguity
 ‫بعقوبة‬
 Syntactic ambiguity
 ‫الجديد‬ ‫البنك‬ ‫مدير‬ ‫قابلت‬
 Semantic ambiguity
 ‫ابراهيم‬ ‫من‬ ‫اكثر‬ ‫احمد‬ ‫علي‬ ‫يحب‬
 Anaphoric ambiguity
 ‫انتقده‬ ‫الذي‬ ‫الوزير‬ ‫الصحفي‬ ‫قابل‬
• Homonyms: words that sound alike but
have different meanings.
• Homophones: homonyms that sound
alike and have different meanings, but
have different spellings.
• Homographs: words that are spelled
the same but have different meanings.
So?
Ambiguity in Word Segmentation*
* Freihat Abed Alhakim, et al. A Single A Single-Model Approach for Arabic Segmentation, POS-Tagging, and Named Tagging, and Named Entity
Recognition Entity Recognition Entity Recognition Entity Recognition.
Ambiguity in POS-Tagging*
* Freihat Abed Alhakim, et al. A Single A Single-Model Approach for Arabic Segmentation, POS-Tagging, and Named Tagging, and Named Entity
Recognition Entity Recognition Entity Recognition Entity Recognition.
Ambiguity in Fine-grained POS-Tagging*
* Freihat Abed Alhakim, et al. A Single A Single-Model Approach for Arabic Segmentation, POS-Tagging, and Named Tagging, and Named Entity
Recognition Entity Recognition Entity Recognition Entity Recognition.
NLP Issues that Apply to Arabic (and
many others)
 Lack of resources
 Financial (governmental and private)
 Human (computer and linguistic researchers and programmers)
 Lack of tools
 Lack of linguistic references
 E.g. lexical semantics (classes, synonyms, relationships)
 Lack of corpora
 Training data
 Benchmarks
It’s NOT
only
Arabic
English
 Hewlett-Packard → Hewlett and Packard as two tokens?
 state-of-the-art: break up hyphenated sequence.
 co-education
 lowercase, lower-case, lower case ?
 It can be effective to get the user to put in possible hyphens
 San Francisco: one token or two?
 How do you decide it is one token?
Sec. 2.2.1
C. D Manning, P. Raghavan, H. Schutze, Introduction to Information Retrieval, Cambridge University Press, 2008
French
 L'ensemble → one token or two?
 L ? L’ ? Le ?
 Want l’ensemble to match with un ensemble
 Until at least 2003, it didn’t on Google
 Internationalization!
Sec. 2.2.1
C. D Manning, P. Raghavan, H. Schutze, Introduction to Information Retrieval, Cambridge University Press, 2008
German
 Lebensversicherungsgesellschaftsangestellter
 ‘life insurance company employee’
 German retrieval systems benefit greatly from a compound
splitter module
 Can give a 15% performance boost for German
C. D Manning, P. Raghavan, H. Schutze, Introduction to Information Retrieval, Cambridge University Press, 2008
Japanese
 Japanese (and Chinese) have no spaces between words:
 莎拉波娃现在居住在美国东南部的佛罗里达。
 Not always guaranteed a unique tokenization
 Further complicated in Japanese, with multiple alphabets
intermingled
 Dates/amounts in multiple formats
フォーチュン500社は情報不足のため時間あた$500K(約6,000万円)
Katakana Hiragana Kanji Romaji
End-user can express query entirely in hiragana!
Sec. 2.2.1
C. D Manning, P. Raghavan, H. Schutze, Introduction to Information Retrieval, Cambridge University Press, 2008
Solutions (to Arabic), Technical
 Preprocessing
 Spelling checking (a whole new problem)
 Tatweel
 Repetitions
 Normalization
To improve matching (i.e. increase recall)
 Change ‫ء‬ ،‫آ‬ ،‫إ‬ ،‫أ‬ ،‫ا‬ to ‫ا‬
 Change ‫ة‬ to ‫ه‬
 Change ‫ى‬ to ‫ي‬
 Can use rule-based approaches to solve some common issues
 Dialects and MSA might need different approaches
Question Answering**
Hammo et al. QARAB: A Question Answering System to Support the Arabic Language.
Workshop on Computational Approaches to Semitic Languages. ACL 2002
Available Tools
 Mawdoo3’s AI tools (MALTK)
 Arabic Treebank
 Arabic WordNet
 MySQL database
 SUMO Ontology
 Java
 Dr. Mostafa Jarrar’s Arabic
Ontology
 Quran Corpus
 Dr. Nizar Habash’s MADAR
 Al-Ma’any dictionary
 Microsoft Arabic Toolkit (ATK)
 BabelNet
 Farasa
Solutions, Research
 Governments and private organizations must
 Support research
 i.e. pour millions!
 E.g. ‫االردني‬ ‫العهد‬ ‫ولي‬ ‫من‬ ‫ضاد‬ ‫مبادرة‬
 Find a way to standardize and centralize Arabic
 Teach Arabic professionally
 The Semantic web
 Understanding the context
‫مباد‬ ‫ضمن‬ ‫ستندرج‬ ‫التي‬ ‫المبادرة‬ ‫وتهدف‬‫رات‬
‫للغة‬ ‫سفراء‬ ‫إعداد‬ ‫إلى‬ ،‫العهد‬ ‫ولي‬ ‫مؤسسة‬
‫ف‬ ‫استخدامها‬ ‫تعزيز‬ ‫على‬ ‫يعملون‬ ‫العربية‬‫ي‬
‫ع‬ ‫وتعزيز‬ ‫وإنشاء‬ ،‫المعرفة‬ ‫ميادين‬ ‫مختلف‬‫دد‬
،‫العربية‬ ‫باللغة‬ ‫للتواصل‬ ‫ات‬ّ‫ص‬‫المن‬ ‫من‬
‫الحياة‬ ‫مجاالت‬ ‫جميع‬ ‫في‬ ‫واستخدامها‬
‫والتكنولوجية‬ ‫والعلمية‬ ‫العملية‬.
Summary
 NLU has many challenges and opportunities
 Arabic NLU has even more challenges and opportunities
Acknowledgment
 Dr. Abdallah Abo Shmais
 Linguistic Research Manager at Mawdoo3
 Dr. Abed Al-Hakim Freihat
 Senior Computational Linguist at Mawdoo3
‫وغاي‬ ً‫ا‬‫لفظ‬ ‫هللا‬ ‫كتاب‬ ‫وسعت‬‫ة‬‫عن‬ ‫ضقت‬ ‫وما‬‫آي‬‫وعظات‬ ‫به‬
‫وصف‬ ‫عن‬ ‫اليوم‬ ‫أضيق‬ ‫فكيف‬‫آلة‬‫لمخترعات‬ ‫أسماء‬ ‫وتنسيق‬
‫كامن‬ ‫الدر‬ ‫أحشائه‬ ‫في‬ ‫البحر‬ ‫أنا‬‫عن‬ ‫الغواص‬ ‫سألوا‬ ‫فهل‬‫صدفات‬‫ي‬
‫محاس‬ ‫وتبلى‬ ‫أبلى‬ ‫ويحكم‬ ‫فيا‬‫ني‬‫أساتي‬ ‫الدواء‬ ‫عز‬ ‫وإن‬ ‫ومنكم‬
‫فإنني‬ ‫للزمان‬ ‫تكلوني‬ ‫فال‬‫أخاف‬‫عليكم‬‫وفاتي‬ ‫تحين‬ ‫أن‬
‫ابراهيم‬ ‫حافظ‬
Thank You

Weitere ähnliche Inhalte

Was ist angesagt?

Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language Processing
Yasir Khan
 
Adversarial search
Adversarial searchAdversarial search
Adversarial search
Nilu Desai
 
Practical Natural Language Processing
Practical Natural Language ProcessingPractical Natural Language Processing
Practical Natural Language Processing
Jaganadh Gopinadhan
 

Was ist angesagt? (20)

Recent Advances in Natural Language Processing
Recent Advances in Natural Language ProcessingRecent Advances in Natural Language Processing
Recent Advances in Natural Language Processing
 
A Panorama of Natural Language Processing
A Panorama of Natural Language ProcessingA Panorama of Natural Language Processing
A Panorama of Natural Language Processing
 
NLP_KASHK:POS Tagging
NLP_KASHK:POS TaggingNLP_KASHK:POS Tagging
NLP_KASHK:POS Tagging
 
Nlp
NlpNlp
Nlp
 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language Processing
 
Natural Language Processing with Python
Natural Language Processing with PythonNatural Language Processing with Python
Natural Language Processing with Python
 
Presentation on Text Classification
Presentation on Text ClassificationPresentation on Text Classification
Presentation on Text Classification
 
Text Classification
Text ClassificationText Classification
Text Classification
 
Regular expressions
Regular expressionsRegular expressions
Regular expressions
 
Lecture: Word Sense Disambiguation
Lecture: Word Sense DisambiguationLecture: Word Sense Disambiguation
Lecture: Word Sense Disambiguation
 
Adversarial search
Adversarial searchAdversarial search
Adversarial search
 
Bert
BertBert
Bert
 
Natural Language Toolkit (NLTK), Basics
Natural Language Toolkit (NLTK), Basics Natural Language Toolkit (NLTK), Basics
Natural Language Toolkit (NLTK), Basics
 
NLP State of the Art | BERT
NLP State of the Art | BERTNLP State of the Art | BERT
NLP State of the Art | BERT
 
Neural Machine Translation (D3L4 Deep Learning for Speech and Language UPC 2017)
Neural Machine Translation (D3L4 Deep Learning for Speech and Language UPC 2017)Neural Machine Translation (D3L4 Deep Learning for Speech and Language UPC 2017)
Neural Machine Translation (D3L4 Deep Learning for Speech and Language UPC 2017)
 
Introduction to Natural Language Processing
Introduction to Natural Language ProcessingIntroduction to Natural Language Processing
Introduction to Natural Language Processing
 
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
BERT: Pre-training of Deep Bidirectional Transformers for Language UnderstandingBERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
 
Usage of regular expressions in nlp
Usage of regular expressions in nlpUsage of regular expressions in nlp
Usage of regular expressions in nlp
 
Practical Natural Language Processing
Practical Natural Language ProcessingPractical Natural Language Processing
Practical Natural Language Processing
 
Conditional Random Fields - Vidya Venkiteswaran
Conditional Random Fields - Vidya VenkiteswaranConditional Random Fields - Vidya Venkiteswaran
Conditional Random Fields - Vidya Venkiteswaran
 

Ähnlich wie Processing Arabic Text

Design and Implementation of a Language Assistant for English – Arabic Texts
Design and Implementation of a Language Assistant for English – Arabic TextsDesign and Implementation of a Language Assistant for English – Arabic Texts
Design and Implementation of a Language Assistant for English – Arabic Texts
IJCSIS Research Publications
 
Rule-Based Standard Arabic Phonetization at Phoneme, Allophone, and Syllable ...
Rule-Based Standard Arabic Phonetization at Phoneme, Allophone, and Syllable ...Rule-Based Standard Arabic Phonetization at Phoneme, Allophone, and Syllable ...
Rule-Based Standard Arabic Phonetization at Phoneme, Allophone, and Syllable ...
CSCJournals
 

Ähnlich wie Processing Arabic Text (20)

The Latest Advances in Patent Machine Translation
The Latest Advances in Patent Machine TranslationThe Latest Advances in Patent Machine Translation
The Latest Advances in Patent Machine Translation
 
Sarvesh mittal major project
Sarvesh mittal major projectSarvesh mittal major project
Sarvesh mittal major project
 
Sarvesh mittal major project
Sarvesh mittal major projectSarvesh mittal major project
Sarvesh mittal major project
 
Past, Present, and Future: Machine Translation & Natural Language Processing ...
Past, Present, and Future: Machine Translation & Natural Language Processing ...Past, Present, and Future: Machine Translation & Natural Language Processing ...
Past, Present, and Future: Machine Translation & Natural Language Processing ...
 
Past, Present, and Future: Machine Translation & Natural Language Processing ...
Past, Present, and Future: Machine Translation & Natural Language Processing ...Past, Present, and Future: Machine Translation & Natural Language Processing ...
Past, Present, and Future: Machine Translation & Natural Language Processing ...
 
New Words New Toys
New Words New ToysNew Words New Toys
New Words New Toys
 
T URN S EGMENTATION I NTO U TTERANCES F OR A RABIC S PONTANEOUS D IALOGUES ...
T URN S EGMENTATION I NTO U TTERANCES F OR  A RABIC  S PONTANEOUS D IALOGUES ...T URN S EGMENTATION I NTO U TTERANCES F OR  A RABIC  S PONTANEOUS D IALOGUES ...
T URN S EGMENTATION I NTO U TTERANCES F OR A RABIC S PONTANEOUS D IALOGUES ...
 
Seven Steps to better Translations - A Beechwood Guide to Translation
Seven Steps to better Translations - A Beechwood Guide to TranslationSeven Steps to better Translations - A Beechwood Guide to Translation
Seven Steps to better Translations - A Beechwood Guide to Translation
 
How To "Speak Developer"
How To "Speak Developer"How To "Speak Developer"
How To "Speak Developer"
 
Growing Your Freelance Business (Olga Melnikova)
Growing Your Freelance Business (Olga Melnikova)Growing Your Freelance Business (Olga Melnikova)
Growing Your Freelance Business (Olga Melnikova)
 
Ijetcas14 575
Ijetcas14 575Ijetcas14 575
Ijetcas14 575
 
Design and Implementation of a Language Assistant for English – Arabic Texts
Design and Implementation of a Language Assistant for English – Arabic TextsDesign and Implementation of a Language Assistant for English – Arabic Texts
Design and Implementation of a Language Assistant for English – Arabic Texts
 
Creating Technical Documents In English For Global Audiences
Creating Technical Documents In English For Global AudiencesCreating Technical Documents In English For Global Audiences
Creating Technical Documents In English For Global Audiences
 
24 Ways to Shut Down The Application and Other Apocryphal Stories
24 Ways to Shut Down The Application and Other Apocryphal Stories24 Ways to Shut Down The Application and Other Apocryphal Stories
24 Ways to Shut Down The Application and Other Apocryphal Stories
 
English in the 21st Century Global Knowledge Economy
English in the 21st Century Global Knowledge EconomyEnglish in the 21st Century Global Knowledge Economy
English in the 21st Century Global Knowledge Economy
 
Internationalised Domain Names & Internet Investigations
Internationalised Domain Names & Internet InvestigationsInternationalised Domain Names & Internet Investigations
Internationalised Domain Names & Internet Investigations
 
Mohammad mahmoud abdul gawwad
Mohammad mahmoud abdul gawwad Mohammad mahmoud abdul gawwad
Mohammad mahmoud abdul gawwad
 
Python overview
Python overviewPython overview
Python overview
 
"Machine Translation 101" and the Challenge of Patents
"Machine Translation 101" and the Challenge of Patents"Machine Translation 101" and the Challenge of Patents
"Machine Translation 101" and the Challenge of Patents
 
Rule-Based Standard Arabic Phonetization at Phoneme, Allophone, and Syllable ...
Rule-Based Standard Arabic Phonetization at Phoneme, Allophone, and Syllable ...Rule-Based Standard Arabic Phonetization at Phoneme, Allophone, and Syllable ...
Rule-Based Standard Arabic Phonetization at Phoneme, Allophone, and Syllable ...
 

Mehr von Jordan Open Source Association

Mehr von Jordan Open Source Association (20)

JOSA TechTalks - Data Oriented Architecture
JOSA TechTalks - Data Oriented ArchitectureJOSA TechTalks - Data Oriented Architecture
JOSA TechTalks - Data Oriented Architecture
 
JOSA TechTalks - Machine Learning on Graph-Structured Data
JOSA TechTalks - Machine Learning on Graph-Structured DataJOSA TechTalks - Machine Learning on Graph-Structured Data
JOSA TechTalks - Machine Learning on Graph-Structured Data
 
OpenSooq Mobile Infrastructure @ Scale
OpenSooq Mobile Infrastructure @ ScaleOpenSooq Mobile Infrastructure @ Scale
OpenSooq Mobile Infrastructure @ Scale
 
Data-Driven Digital Transformation
Data-Driven Digital TransformationData-Driven Digital Transformation
Data-Driven Digital Transformation
 
Data Science in Action
Data Science in ActionData Science in Action
Data Science in Action
 
JOSA TechTalks - Downgrade your Costs
JOSA TechTalks - Downgrade your CostsJOSA TechTalks - Downgrade your Costs
JOSA TechTalks - Downgrade your Costs
 
JOSA TechTalks - Docker in Production
JOSA TechTalks - Docker in ProductionJOSA TechTalks - Docker in Production
JOSA TechTalks - Docker in Production
 
JOSA TechTalks - Word Embedding and Word2Vec Explained
JOSA TechTalks - Word Embedding and Word2Vec ExplainedJOSA TechTalks - Word Embedding and Word2Vec Explained
JOSA TechTalks - Word Embedding and Word2Vec Explained
 
JOSA TechTalks - Better Web Apps with React and Redux
JOSA TechTalks - Better Web Apps with React and ReduxJOSA TechTalks - Better Web Apps with React and Redux
JOSA TechTalks - Better Web Apps with React and Redux
 
JOSA TechTalks - RESTful API Concepts and Best Practices
JOSA TechTalks - RESTful API Concepts and Best PracticesJOSA TechTalks - RESTful API Concepts and Best Practices
JOSA TechTalks - RESTful API Concepts and Best Practices
 
Web app architecture
Web app architectureWeb app architecture
Web app architecture
 
Intro to the Principles of Graphic Design
Intro to the Principles of Graphic DesignIntro to the Principles of Graphic Design
Intro to the Principles of Graphic Design
 
Intro to Graphic Design Elements
Intro to Graphic Design ElementsIntro to Graphic Design Elements
Intro to Graphic Design Elements
 
JOSA TechTalk: Realtime monitoring and alerts
JOSA TechTalk: Realtime monitoring and alerts JOSA TechTalk: Realtime monitoring and alerts
JOSA TechTalk: Realtime monitoring and alerts
 
JOSA TechTalk: Metadata Management
in Big Data
JOSA TechTalk: Metadata Management
in Big DataJOSA TechTalk: Metadata Management
in Big Data
JOSA TechTalk: Metadata Management
in Big Data
 
JOSA TechTalk: Introduction to Supervised Learning
JOSA TechTalk: Introduction to Supervised LearningJOSA TechTalk: Introduction to Supervised Learning
JOSA TechTalk: Introduction to Supervised Learning
 
JOSA TechTalk: Taking Docker to Production
JOSA TechTalk: Taking Docker to ProductionJOSA TechTalk: Taking Docker to Production
JOSA TechTalk: Taking Docker to Production
 
JOSA TechTalk: Introduction to docker
JOSA TechTalk: Introduction to dockerJOSA TechTalk: Introduction to docker
JOSA TechTalk: Introduction to docker
 
D programming language
D programming languageD programming language
D programming language
 
A taste of Functional Programming
A taste of Functional ProgrammingA taste of Functional Programming
A taste of Functional Programming
 

Kürzlich hochgeladen

Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
giselly40
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
Earley Information Science
 

Kürzlich hochgeladen (20)

Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 

Processing Arabic Text

  • 1. Processing Arabic Text JOSA TechTalk Dr. Samir Tartir 1/May/2019 Processing Arabic Text
  • 2. Outline  Intros  NLU  Issues in NLU  Issues in Arabic NLU  Tools & Resources
  • 3. Dr. Samir Tartir  Senior Research Scientist of AI at Mawdoo3  PhD from the University of Georgia (USA) in 2009  BSc & MSc from the University of Jordan in 1998 & 2002  Highly referenced researcher  Recognized educator (CourseHero.com)  Research areas:  Arabic NLP, Semantic Web, & Ontologies.  Previously:  Head and founder of the Department of Web Engineering at Philadelphia University (1 year)  Assistant professor at the Department of Computer Science at Philadelphia University (9 years)  Google outsourced team leader 2011-2016  Database programmer for 5 years (in Jordan, USA)
  • 4.
  • 5. Mawdoo3 is the largest Arabic website in the world. Committed to enhance the Arabic language through content and technology Mawdoo3 © Copyright 2018. All Rights Reserved.
  • 6. ACHIEVEMENTS Mawdoo3 © Copyright 2018. All Rights Reserved. 50Mmonthly active users (MAU) 150Karticles 300paid experts contributors 100Bbillion words read every month GENERAL
  • 7. Mawdoo3 AI  Founded in January 2017  Leverage data collected by Mawdoo3.com for AI  Currently working on Salma  The first Arabic Voice Enabled Assistant  The beta version was released during a Silicon Valley event  We are about to release the next version with major updates (weekend)  It is available for download for Android and iOS devices  Continuously improving (you’ll understand soon)
  • 9.
  • 13. blah blah blah 5 words: I, must, organize, my, room
  • 16. Many issues* 1. Easy or mostly solved issues 2. Issues with Intermediate progress 3. Hard Issues or still need lot of work Mostly applies to English, but also applies to Arabic *https://www.quora.com/What-are-the-major-open-problems-in-natural-language-understanding
  • 17. 1. Easy or mostly solved  Spam detection and email classification  Works!
  • 18. 1. Easy or mostly solved  Part of Speech Tagging (≈ ‫)إعراب‬  Follows tokenization (‫تجزيء‬)  KEY: N = Noun, V = Verb, P = Preposition, Adv = Adverb, etc.  INPUT: Profits soared at Boeing Co., easily topping forecasts on Wall Street, as their CEO Alan Mulally announced first quarter results.  OUTPUT: Profits/N soared/V at/P Boeing/N Co./N ,/, easily/ADV topping/V forecasts/N on/P Wall/N Street/N ,/, as/P their/POSS CEO/N Alan/N Mulally/N announced/V first/ADJ quarter/N results/N ./.
  • 19. 1. Easy or mostly solved  Named Entity Recognition  Usual classes: People, organizations, locations  KEY: P = Person, O = Organization, L = Location  INPUT: Profits soared at Boeing Co., easily topping forecasts on Wall Street, as their CEO Alan Mulally announced first quarter results.  OUTPUT: Profits soared at {Boeing Co.}/O , easily topping forecasts on {Wall Street}/L , as their CEO {Alan Mulally}/P announced first quarter results .
  • 20. ‫أبدى‬ ‫وقد‬‫سوالنا‬ ‫خافيير‬‫اال‬ ‫في‬ ‫الخارجية‬ ‫للسياسة‬ ‫األعلى‬ ‫المنسق‬‫تحاد‬ ‫األوروبي‬،‫المصرية‬ ‫بالمبادرة‬ ‫كبيرا‬ ‫تفاؤال‬،‫ا‬ ‫إطالق‬ ‫وقف‬ ‫احتمال‬ ‫مرجحا‬‫لنار‬ ،‫قريبا‬‫من‬ ‫اإلسرائيلية‬ ‫القوات‬ ‫انسحاب‬ ‫فرنسية‬ ‫مصادر‬ ‫وتوقعت‬‫غزة‬‫خالل‬ ،‫أيام‬ ‫ثمانية‬‫وأوضح‬‫سوالنا‬‫دعوة‬ ‫أن‬‫مصر‬‫إل‬‫سرائيل‬‫ق‬ ‫الهجوم‬ ‫وقف‬ ‫لبحث‬‫د‬ ‫المقبلة‬ ‫القليلة‬ ‫الساعات‬ ‫خالل‬ ‫ثمارها‬ ‫تؤتي‬،‫الع‬ ‫دائمة‬ ‫الدول‬ ‫أن‬ ‫مؤكدا‬‫ضوية‬ ‫األمن‬ ‫بمجلس‬‫شديد‬ ‫بترحاب‬ ‫المصرية‬ ‫المبادرة‬ ‫استقبلت‬. Named Entities: LocationsPersons OrganizationsArabic NER ‫األعالم‬ ‫مستخرج‬ Example 20 *Microsoft Arabic NLP Toolkit (ATK) For Academia in the Arab World Presentation, 11/2012
  • 21. 1. Easy or mostly solved  Plagiarism detection  INPUT: A research paper, homework report, etc.  OUTPUT: Rate of similarity to other papers
  • 22. 2. Intermediate or making good progress  Sentiment analysis - Examples:  Best roast chicken in San Francisco! – Positive  The waiter ignored us for 20 minutes. – Negative  ‫قال‬ ‫مطعم‬ ‫احسن‬ ‫قال‬!
  • 23. 2. Intermediate or making good progress  Coreference resolution - Example:  "The thieves stole the paintings. They were subsequently caught.“  "The thieves stole the paintings. They were subsequently sold.“  "The thieves stole the paintings. They were subsequently found.“  To solve whether “they" is related to “the thieves" or “the paintings".
  • 24. 2. Intermediate or making good progress  Word sense disambiguation - Example:  “I need new batteries for my mouse”
  • 25. 2. Intermediate or making good progress  Machine Translation  Translating sentences from one language to another  “Best” example would be Google translate.
  • 26. 3. Hard or still need lot of work  Text Summarization  to take input as text document(s) and try to condense them into a summary. https://www.springer.com/us/book/9783030167998 © 2019 Lithium-Ion Batteries A Machine-Generated Summary of Current Research Authors: Writer, Beta •The first machine-generated book in chemistry •Provides an overview of recent research •Includes summaries of 150 articles
  • 27. 3. Hard or still need lot of work  Machine dialog system - Example:  User - I need a flight from New York to London, arriving at 10 pm ?  System - What day are you leaving?  User - Tomorrow. The system detected the missing information in your sentences.
  • 29. Background  History – Arabic has remained intelligible and functional for around 20 centuries.  Strategically important – 330 million speakers living in an important region  huge oil reserves, sacred sites, sea ports – 1.4 billion Muslims use in their prayers.  Cultural and literary heritage – Closely associated with Islam
  • 34. One-Letter Verbs* .1(‫إ‬)‫من‬‫ى‬َ‫َأ‬‫و‬‫ي‬‫َئ‬‫ي‬‫ا‬ً‫ي‬ْ‫َأ‬‫و‬,‫والوأي‬:‫الوعد‬. .2(‫ت‬)‫ا‬ًّ‫ي‬‫ت‬ ُ ‫وأ‬ ‫ا‬ً‫ي‬ْ‫ت‬َ‫أ‬ ‫ي‬‫ت‬ْ‫َأ‬‫ي‬ ‫ى‬َ‫ت‬َ‫أ‬ ‫من‬‫ت‬‫وإ‬‫ا‬ًّ‫ي‬‫ا‬ً‫ن‬‫َا‬‫ي‬ْ‫ت‬‫وإ‬َ‫ن‬‫َا‬‫ي‬ْ‫ت‬‫وإ‬ً‫ة‬‫؛‬ً‫ة‬‫ا‬َ‫ت‬ْ‫َأ‬‫م‬‫و‬ ‫أي‬:‫جئته‬,‫األمر‬ ‫في‬ ‫تقول‬ ‫العرب‬ ‫وبعض‬(:‫ائت‬),‫وأك‬‫ثر‬ ‫بـ‬ ‫يأمرون‬ ‫الناس‬(‫ائت‬)‫ال‬(‫ت‬.!) .3(‫ث‬)‫أي‬ ‫َثي؛‬‫ي‬ ‫ى‬َ‫ث‬َ‫و‬ ‫من‬:‫السل‬ ‫عند‬ ‫به‬ ‫ى‬ َ‫َش‬‫و‬‫طان‬. .4(‫ج‬)‫قطع‬ ‫أي‬ ‫َجي؛‬‫ي‬ ‫َى‬‫ج‬َ‫و‬ ‫من‬,‫وأما‬:‫َجا‬‫و‬‫ال‬‫الح‬ ‫فهو‬‫فا‬. .5(‫ح‬)‫ا‬ً‫ي‬ْ‫َح‬‫و‬ ‫َحي‬‫ي‬ ‫َى‬‫ح‬َ‫و‬ ‫من‬,‫والوحي‬:‫وال‬ ‫اإلشارة‬‫كتابة‬ ‫والكالم‬. .6(‫خ‬)‫أي‬ ‫ا؛‬ً‫ي‬ْ‫َخ‬‫و‬ ‫ي‬‫َخ‬‫ي‬ ‫ى‬َ‫خ‬َ‫و‬:‫قصد‬. .7(‫د‬)‫أي‬ ‫ا؛‬ً‫ي‬ْ‫د‬َ‫و‬ ‫ي‬‫َد‬‫ي‬ ‫َى‬‫د‬َ‫و‬ ‫من‬:‫الدية‬ ‫دفع‬. .8(َ‫ر‬)‫أ‬-‫ا‬ً‫ي‬ْ‫أ‬َ‫ر‬‫و‬ ً‫ة‬‫ي‬ْ‫و‬ُ‫ر‬ َ‫ل‬‫الهال‬ ‫ى‬َ‫ر‬َ‫ي‬ ‫ى‬َ‫أ‬َ‫ر‬ ‫من‬‫و‬ً‫ة‬َ‫ء‬‫ا‬َ‫ر‬. .9(َ‫ر‬)‫ب‬-‫ا‬ً‫ي‬ْ‫ر‬َ‫و‬ ‫ه‬َ‫ف‬‫جو‬ ُ‫القيح‬ ‫ي‬‫َر‬‫ي‬ ‫ى‬َ‫ر‬َ‫و‬ ‫ومن‬‫أي‬ ‫؛‬:‫أفسده‬. .10(‫س‬)ً‫ي‬ ْ‫َس‬‫و‬ ‫عمرو‬ َ‫رأس‬ ٌ‫د‬‫زي‬ ‫َسي‬‫ي‬ ‫ى‬ َ‫َس‬‫و‬ ‫من‬‫أي‬ ‫ا؛‬ ‫حلقه‬. .11(‫ش‬)‫ا‬ً‫ي‬ ْ‫َش‬‫و‬ ‫ي‬‫َش‬‫ي‬ ‫ى‬ َ‫َش‬‫و‬ ‫من‬,‫والوشي‬:‫ن‬‫قش‬ ‫الثوب‬. .12(‫ص‬)ْ‫ص‬َ‫و‬ ‫بالشيء‬ َ‫ء‬‫الشي‬ ‫ي‬‫َص‬‫ي‬ ‫َى‬‫ص‬َ‫و‬ ‫من‬‫أي‬ ‫ا؛‬ً‫ي‬ ‫وصله‬. .13(‫ع‬)‫أي‬ ‫ا؛‬ً‫ي‬ ْ‫َع‬‫و‬ ‫ي‬‫َع‬‫ي‬ ‫ى‬َ‫َع‬‫و‬ ‫من‬:‫وجمع‬ ‫حفظ‬. .14(‫ف‬)‫ا‬ ‫الوفاء‬ ‫بمعنى‬ ‫؛‬ً‫ء‬‫ا‬َ‫َف‬‫و‬ ‫ي‬‫َف‬‫ي‬ ‫ى‬َ‫َف‬‫و‬ ‫من‬‫لعهد‬. .15(‫ق‬)َ‫و‬‫و‬ ً‫ة‬َ‫ي‬‫ا‬َ‫ق‬‫وو‬ ‫ا‬ً‫ي‬ْ‫َق‬‫و‬ ‫ي‬‫َق‬‫ي‬ ‫ى‬َ‫َق‬‫و‬ ‫من‬ً‫ة‬َ‫ي‬‫اق‬,‫بمعنى‬ ‫الحفظ‬,‫مثلثة‬ ‫الوقاية‬ ‫واو‬. .16(‫ك‬)َ‫ة‬‫ْب‬‫ر‬‫الق‬ ‫ي‬‫َك‬‫ي‬ ‫ى‬َ‫َك‬‫و‬ ‫من‬,‫والوكاء‬:‫ا‬ ‫رباط‬‫لقربة‬ ‫وغيرها‬. .17(‫ل‬)ً‫ة‬‫والي‬ ‫ي‬‫َل‬‫ي‬ ‫ى‬َ‫َل‬‫و‬ ‫من‬,‫والوالية‬:‫اإلمارة‬. .18(‫م‬)‫أشار‬ ‫ا؛‬ً‫ي‬ْ‫م‬َ‫و‬ ‫ي‬‫َم‬‫ي‬ ‫َى‬‫م‬َ‫و‬. .19(‫ن‬)ً‫ء‬‫نا‬‫وو‬ ‫ا‬ً‫ي‬ْ‫ن‬َ‫و‬ ‫ي‬‫َن‬‫ي‬ ‫ى‬َ‫ن‬َ‫و‬,‫أي‬:‫تعب‬. .20(‫ـ‬‫ه‬)‫أي‬ ‫ا؛‬ً‫ي‬ْ‫َه‬‫و‬ ‫ي‬‫َه‬‫ي‬ ‫ى‬َ‫ه‬َ‫و‬:‫ضعف‬. *https://www.m-a-arabia.com/vb/showthread.php?t=140
  • 35. Versions  Classical  ‫مشكلة‬ ‫فصحى‬  Modern Standard Arabic (MSA)  ‫مشكلة‬ ‫غير‬ ‫فصحى‬ (diacritics)  Dialects (‫)اللهجات‬ 1. Egyptian Arabic (EGY) 2. Levantine Arabic (LEV) 3. Gulf Arabic (GLF) 4. North African Arabic (NOR)  ‫كراء‬:‫استئجار‬
  • 36. Arabic Language Characteristics  Written to the left  Highly structured  Highly derivational language  Morphology  Free word order  Many dialects  Long sentences
  • 37. Example* *Microsoft Arabic NLP Toolkit (ATK) For Academia in the Arab World Presentation, 11/2012
  • 38. Issues Specific to Arabic, Part 1  No central entity that governs the language  E.g. produces dictionaries  Punctuation mark rules  The large area covered imported different foreign words from different neighboring languages  Letters:  One letter, one sound  Letters change shape  Hamza  No capital letters  High error rate of written text
  • 39. High WER  Arabic content has a very high Word Error Rate (WER).  Analysis of 1000-Article Tagged Corpus: the Average WER is 6% in News text. 39 0 2 4 6 8 10 12 14 News Site Error Rate Akbar El Youm 13 % BBC Arabic 9 % AL-Nahar 8 % Al-Syassa 8 % Al-Ahram 8 % Al-Quds 8 % Al-Qabas 5 % Al-Hayat 4 % Al-Jazeera 1 % *Microsoft Arabic NLP Toolkit (ATK) For Academia in the Arab World Presentation, 11/2012
  • 40. Error Class Error Rate Missing Hamza 21% Extra Hamza 19% Missing Yaa 15% Extra Yaa 12% Missing TaaMarbouta 11% Extra TaaMarbouta 8% Wrong Hamza 7% Missing Space 2% Swapped Letters 1% High WER 40 *Microsoft Arabic NLP Toolkit (ATK) For Academia in the Arab World Presentation, 11/2012 Hamza (47%) + Yaa (15%) + Taa (19%) = 81%
  • 41. Issues Specific to Arabic, Part 1  Synonymy and confusion of non-standardized terms  Thermometer: ‫ترمومتر‬ ،‫حرارة‬ ‫ميزان‬ ،‫حرارة‬ ‫مقياس‬ ،‫محرار‬ ،‫محر‬  Technical translation  Hydrometer: ‫السوائل‬ ‫كثافة‬ ‫قياس‬ ‫جهاز‬  Male/female forms  Dual form  Uncle, parent…
  • 42. The Qadafi/Schwarzenegger Problem * Arabic English ‫قذافي‬ Gadafi Gaddafi Gaddfi Gadhafi Ghaddafi Kadaffy Qaddafi Qadhafi ‫شوارزنيغر‬ ‫شوارزنغر‬ ‫شوارزنيجر‬ ‫شوارزنجر‬ Schwarzenegger * Mona Diab & Nizar Habash. Natural Language Processing of Arabic and its Dialects. EMNLP 2014, Doha, Qatar.
  • 43. Ambiguity  Homographs  ‫قدم‬  Internal word structure ambiguity  ‫بعقوبة‬  Syntactic ambiguity  ‫الجديد‬ ‫البنك‬ ‫مدير‬ ‫قابلت‬  Semantic ambiguity  ‫ابراهيم‬ ‫من‬ ‫اكثر‬ ‫احمد‬ ‫علي‬ ‫يحب‬  Anaphoric ambiguity  ‫انتقده‬ ‫الذي‬ ‫الوزير‬ ‫الصحفي‬ ‫قابل‬ • Homonyms: words that sound alike but have different meanings. • Homophones: homonyms that sound alike and have different meanings, but have different spellings. • Homographs: words that are spelled the same but have different meanings.
  • 44. So?
  • 45. Ambiguity in Word Segmentation* * Freihat Abed Alhakim, et al. A Single A Single-Model Approach for Arabic Segmentation, POS-Tagging, and Named Tagging, and Named Entity Recognition Entity Recognition Entity Recognition Entity Recognition.
  • 46. Ambiguity in POS-Tagging* * Freihat Abed Alhakim, et al. A Single A Single-Model Approach for Arabic Segmentation, POS-Tagging, and Named Tagging, and Named Entity Recognition Entity Recognition Entity Recognition Entity Recognition.
  • 47. Ambiguity in Fine-grained POS-Tagging* * Freihat Abed Alhakim, et al. A Single A Single-Model Approach for Arabic Segmentation, POS-Tagging, and Named Tagging, and Named Entity Recognition Entity Recognition Entity Recognition Entity Recognition.
  • 48. NLP Issues that Apply to Arabic (and many others)  Lack of resources  Financial (governmental and private)  Human (computer and linguistic researchers and programmers)  Lack of tools  Lack of linguistic references  E.g. lexical semantics (classes, synonyms, relationships)  Lack of corpora  Training data  Benchmarks
  • 50. English  Hewlett-Packard → Hewlett and Packard as two tokens?  state-of-the-art: break up hyphenated sequence.  co-education  lowercase, lower-case, lower case ?  It can be effective to get the user to put in possible hyphens  San Francisco: one token or two?  How do you decide it is one token? Sec. 2.2.1 C. D Manning, P. Raghavan, H. Schutze, Introduction to Information Retrieval, Cambridge University Press, 2008
  • 51. French  L'ensemble → one token or two?  L ? L’ ? Le ?  Want l’ensemble to match with un ensemble  Until at least 2003, it didn’t on Google  Internationalization! Sec. 2.2.1 C. D Manning, P. Raghavan, H. Schutze, Introduction to Information Retrieval, Cambridge University Press, 2008
  • 52. German  Lebensversicherungsgesellschaftsangestellter  ‘life insurance company employee’  German retrieval systems benefit greatly from a compound splitter module  Can give a 15% performance boost for German C. D Manning, P. Raghavan, H. Schutze, Introduction to Information Retrieval, Cambridge University Press, 2008
  • 53. Japanese  Japanese (and Chinese) have no spaces between words:  莎拉波娃现在居住在美国东南部的佛罗里达。  Not always guaranteed a unique tokenization  Further complicated in Japanese, with multiple alphabets intermingled  Dates/amounts in multiple formats フォーチュン500社は情報不足のため時間あた$500K(約6,000万円) Katakana Hiragana Kanji Romaji End-user can express query entirely in hiragana! Sec. 2.2.1 C. D Manning, P. Raghavan, H. Schutze, Introduction to Information Retrieval, Cambridge University Press, 2008
  • 54. Solutions (to Arabic), Technical  Preprocessing  Spelling checking (a whole new problem)  Tatweel  Repetitions  Normalization To improve matching (i.e. increase recall)  Change ‫ء‬ ،‫آ‬ ،‫إ‬ ،‫أ‬ ،‫ا‬ to ‫ا‬  Change ‫ة‬ to ‫ه‬  Change ‫ى‬ to ‫ي‬  Can use rule-based approaches to solve some common issues  Dialects and MSA might need different approaches
  • 55. Question Answering** Hammo et al. QARAB: A Question Answering System to Support the Arabic Language. Workshop on Computational Approaches to Semitic Languages. ACL 2002
  • 56. Available Tools  Mawdoo3’s AI tools (MALTK)  Arabic Treebank  Arabic WordNet  MySQL database  SUMO Ontology  Java  Dr. Mostafa Jarrar’s Arabic Ontology  Quran Corpus  Dr. Nizar Habash’s MADAR  Al-Ma’any dictionary  Microsoft Arabic Toolkit (ATK)  BabelNet  Farasa
  • 57. Solutions, Research  Governments and private organizations must  Support research  i.e. pour millions!  E.g. ‫االردني‬ ‫العهد‬ ‫ولي‬ ‫من‬ ‫ضاد‬ ‫مبادرة‬  Find a way to standardize and centralize Arabic  Teach Arabic professionally  The Semantic web  Understanding the context ‫مباد‬ ‫ضمن‬ ‫ستندرج‬ ‫التي‬ ‫المبادرة‬ ‫وتهدف‬‫رات‬ ‫للغة‬ ‫سفراء‬ ‫إعداد‬ ‫إلى‬ ،‫العهد‬ ‫ولي‬ ‫مؤسسة‬ ‫ف‬ ‫استخدامها‬ ‫تعزيز‬ ‫على‬ ‫يعملون‬ ‫العربية‬‫ي‬ ‫ع‬ ‫وتعزيز‬ ‫وإنشاء‬ ،‫المعرفة‬ ‫ميادين‬ ‫مختلف‬‫دد‬ ،‫العربية‬ ‫باللغة‬ ‫للتواصل‬ ‫ات‬ّ‫ص‬‫المن‬ ‫من‬ ‫الحياة‬ ‫مجاالت‬ ‫جميع‬ ‫في‬ ‫واستخدامها‬ ‫والتكنولوجية‬ ‫والعلمية‬ ‫العملية‬.
  • 58. Summary  NLU has many challenges and opportunities  Arabic NLU has even more challenges and opportunities
  • 59. Acknowledgment  Dr. Abdallah Abo Shmais  Linguistic Research Manager at Mawdoo3  Dr. Abed Al-Hakim Freihat  Senior Computational Linguist at Mawdoo3
  • 60. ‫وغاي‬ ً‫ا‬‫لفظ‬ ‫هللا‬ ‫كتاب‬ ‫وسعت‬‫ة‬‫عن‬ ‫ضقت‬ ‫وما‬‫آي‬‫وعظات‬ ‫به‬ ‫وصف‬ ‫عن‬ ‫اليوم‬ ‫أضيق‬ ‫فكيف‬‫آلة‬‫لمخترعات‬ ‫أسماء‬ ‫وتنسيق‬ ‫كامن‬ ‫الدر‬ ‫أحشائه‬ ‫في‬ ‫البحر‬ ‫أنا‬‫عن‬ ‫الغواص‬ ‫سألوا‬ ‫فهل‬‫صدفات‬‫ي‬ ‫محاس‬ ‫وتبلى‬ ‫أبلى‬ ‫ويحكم‬ ‫فيا‬‫ني‬‫أساتي‬ ‫الدواء‬ ‫عز‬ ‫وإن‬ ‫ومنكم‬ ‫فإنني‬ ‫للزمان‬ ‫تكلوني‬ ‫فال‬‫أخاف‬‫عليكم‬‫وفاتي‬ ‫تحين‬ ‫أن‬ ‫ابراهيم‬ ‫حافظ‬