Slides for 'JOSA TechTalks: Arabic NLP in Practice' with an introduction to Natural Language Understanding (NLU) with focus on Arabic, covers main issues in (Arabic) NLU and used tools and resources.
By Dr. Samir Tartir - Senior Research Scientist of AI at Mawdoo3
3. Dr. Samir Tartir
Senior Research Scientist of AI at Mawdoo3
PhD from the University of Georgia (USA) in 2009
BSc & MSc from the University of Jordan in 1998 & 2002
Highly referenced researcher
Recognized educator (CourseHero.com)
Research areas:
Arabic NLP, Semantic Web, & Ontologies.
Previously:
Head and founder of the Department of Web Engineering at Philadelphia University (1 year)
Assistant professor at the Department of Computer Science at Philadelphia University (9 years)
Google outsourced team leader 2011-2016
Database programmer for 5 years (in Jordan, USA)
7. Mawdoo3 AI
Founded in January 2017
Leverage data collected by Mawdoo3.com for AI
Currently working on Salma
The first Arabic Voice Enabled Assistant
The beta version was released during a Silicon Valley event
We are about to release the next version with major updates
(weekend)
It is available for download for Android and iOS devices
Continuously improving (you’ll understand soon)
16. Many issues*
1. Easy or mostly solved issues
2. Issues with Intermediate progress
3. Hard Issues or still need lot of work
Mostly applies to English, but also applies to Arabic
*https://www.quora.com/What-are-the-major-open-problems-in-natural-language-understanding
17. 1. Easy or mostly solved
Spam detection and email classification
Works!
18. 1. Easy or mostly solved
Part of Speech Tagging (≈ )إعراب
Follows tokenization (تجزيء)
KEY: N = Noun, V = Verb, P = Preposition, Adv = Adverb, etc.
INPUT:
Profits soared at Boeing Co., easily topping forecasts on Wall
Street, as their CEO Alan Mulally announced first quarter results.
OUTPUT:
Profits/N soared/V at/P Boeing/N Co./N ,/, easily/ADV topping/V
forecasts/N on/P Wall/N Street/N ,/, as/P their/POSS CEO/N
Alan/N Mulally/N announced/V first/ADJ quarter/N results/N ./.
19. 1. Easy or mostly solved
Named Entity Recognition
Usual classes: People, organizations, locations
KEY: P = Person, O = Organization, L = Location
INPUT:
Profits soared at Boeing Co., easily topping forecasts on Wall
Street, as their CEO Alan Mulally announced first quarter results.
OUTPUT:
Profits soared at {Boeing Co.}/O , easily topping forecasts on {Wall
Street}/L , as their CEO {Alan Mulally}/P announced first quarter
results .
20. أبدى وقدسوالنا خافييراال في الخارجية للسياسة األعلى المنسقتحاد
األوروبي،المصرية بالمبادرة كبيرا تفاؤال،ا إطالق وقف احتمال مرجحالنار
،قريبامن اإلسرائيلية القوات انسحاب فرنسية مصادر وتوقعتغزةخالل
،أيام ثمانيةوأوضحسوالنادعوة أنمصرإلسرائيلق الهجوم وقف لبحثد
المقبلة القليلة الساعات خالل ثمارها تؤتي،الع دائمة الدول أن مؤكداضوية
األمن بمجلسشديد بترحاب المصرية المبادرة استقبلت.
Named Entities:
LocationsPersons
OrganizationsArabic NER األعالم مستخرج
Example
20
*Microsoft Arabic NLP Toolkit (ATK) For Academia in the Arab World Presentation, 11/2012
21. 1. Easy or mostly solved
Plagiarism detection
INPUT:
A research paper, homework report, etc.
OUTPUT:
Rate of similarity to other papers
22. 2. Intermediate or making good
progress
Sentiment analysis - Examples:
Best roast chicken in San Francisco! – Positive
The waiter ignored us for 20 minutes. – Negative
قال مطعم احسن قال!
23. 2. Intermediate or making good
progress
Coreference resolution - Example:
"The thieves stole the paintings. They were subsequently caught.“
"The thieves stole the paintings. They were subsequently sold.“
"The thieves stole the paintings. They were subsequently found.“
To solve whether “they" is related to “the thieves" or “the
paintings".
24. 2. Intermediate or making good
progress
Word sense disambiguation - Example:
“I need new batteries for my mouse”
25. 2. Intermediate or making good
progress
Machine Translation
Translating sentences from one language to another
“Best” example would be Google translate.
27. 3. Hard or still need lot of work
Machine dialog system - Example:
User - I need a flight from New York to London, arriving at 10 pm ?
System - What day are you leaving?
User - Tomorrow.
The system detected the missing information in your sentences.
29. Background
History
– Arabic has remained intelligible and functional for around 20
centuries.
Strategically important
– 330 million speakers living in an important region
huge oil reserves, sacred sites, sea ports
– 1.4 billion Muslims use in their prayers.
Cultural and literary heritage
– Closely associated with Islam
36. Arabic Language Characteristics
Written to the left
Highly structured
Highly derivational language
Morphology
Free word order
Many dialects
Long sentences
38. Issues Specific to Arabic, Part 1
No central entity that governs the language
E.g. produces dictionaries
Punctuation mark rules
The large area covered imported different foreign words from different
neighboring languages
Letters:
One letter, one sound
Letters change shape
Hamza
No capital letters
High error rate of written text
39. High WER
Arabic content has a very high Word Error Rate (WER).
Analysis of 1000-Article Tagged Corpus: the Average WER is 6% in News text.
39
0
2
4
6
8
10
12
14 News Site Error Rate
Akbar El Youm 13 %
BBC Arabic 9 %
AL-Nahar 8 %
Al-Syassa 8 %
Al-Ahram 8 %
Al-Quds 8 %
Al-Qabas 5 %
Al-Hayat 4 %
Al-Jazeera 1 %
*Microsoft Arabic NLP Toolkit (ATK) For Academia in the Arab World Presentation, 11/2012
40. Error Class
Error
Rate
Missing Hamza 21%
Extra Hamza 19%
Missing Yaa 15%
Extra Yaa 12%
Missing TaaMarbouta 11%
Extra TaaMarbouta 8%
Wrong Hamza 7%
Missing Space 2%
Swapped Letters 1%
High WER
40
*Microsoft Arabic NLP Toolkit (ATK) For Academia in the Arab World Presentation, 11/2012
Hamza (47%) + Yaa (15%) + Taa (19%) = 81%
41. Issues Specific to Arabic, Part 1
Synonymy and confusion of non-standardized terms
Thermometer: ترمومتر ،حرارة ميزان ،حرارة مقياس ،محرار ،محر
Technical translation
Hydrometer: السوائل كثافة قياس جهاز
Male/female forms
Dual form
Uncle, parent…
42. The Qadafi/Schwarzenegger Problem *
Arabic English
قذافي Gadafi
Gaddafi
Gaddfi
Gadhafi
Ghaddafi
Kadaffy
Qaddafi
Qadhafi
شوارزنيغر
شوارزنغر
شوارزنيجر
شوارزنجر
Schwarzenegger
* Mona Diab & Nizar Habash. Natural Language Processing of Arabic and its Dialects. EMNLP 2014, Doha, Qatar.
43. Ambiguity
Homographs
قدم
Internal word structure ambiguity
بعقوبة
Syntactic ambiguity
الجديد البنك مدير قابلت
Semantic ambiguity
ابراهيم من اكثر احمد علي يحب
Anaphoric ambiguity
انتقده الذي الوزير الصحفي قابل
• Homonyms: words that sound alike but
have different meanings.
• Homophones: homonyms that sound
alike and have different meanings, but
have different spellings.
• Homographs: words that are spelled
the same but have different meanings.
45. Ambiguity in Word Segmentation*
* Freihat Abed Alhakim, et al. A Single A Single-Model Approach for Arabic Segmentation, POS-Tagging, and Named Tagging, and Named Entity
Recognition Entity Recognition Entity Recognition Entity Recognition.
46. Ambiguity in POS-Tagging*
* Freihat Abed Alhakim, et al. A Single A Single-Model Approach for Arabic Segmentation, POS-Tagging, and Named Tagging, and Named Entity
Recognition Entity Recognition Entity Recognition Entity Recognition.
47. Ambiguity in Fine-grained POS-Tagging*
* Freihat Abed Alhakim, et al. A Single A Single-Model Approach for Arabic Segmentation, POS-Tagging, and Named Tagging, and Named Entity
Recognition Entity Recognition Entity Recognition Entity Recognition.
48. NLP Issues that Apply to Arabic (and
many others)
Lack of resources
Financial (governmental and private)
Human (computer and linguistic researchers and programmers)
Lack of tools
Lack of linguistic references
E.g. lexical semantics (classes, synonyms, relationships)
Lack of corpora
Training data
Benchmarks
50. English
Hewlett-Packard → Hewlett and Packard as two tokens?
state-of-the-art: break up hyphenated sequence.
co-education
lowercase, lower-case, lower case ?
It can be effective to get the user to put in possible hyphens
San Francisco: one token or two?
How do you decide it is one token?
Sec. 2.2.1
C. D Manning, P. Raghavan, H. Schutze, Introduction to Information Retrieval, Cambridge University Press, 2008
51. French
L'ensemble → one token or two?
L ? L’ ? Le ?
Want l’ensemble to match with un ensemble
Until at least 2003, it didn’t on Google
Internationalization!
Sec. 2.2.1
C. D Manning, P. Raghavan, H. Schutze, Introduction to Information Retrieval, Cambridge University Press, 2008
52. German
Lebensversicherungsgesellschaftsangestellter
‘life insurance company employee’
German retrieval systems benefit greatly from a compound
splitter module
Can give a 15% performance boost for German
C. D Manning, P. Raghavan, H. Schutze, Introduction to Information Retrieval, Cambridge University Press, 2008
53. Japanese
Japanese (and Chinese) have no spaces between words:
莎拉波娃现在居住在美国东南部的佛罗里达。
Not always guaranteed a unique tokenization
Further complicated in Japanese, with multiple alphabets
intermingled
Dates/amounts in multiple formats
フォーチュン500社は情報不足のため時間あた$500K(約6,000万円)
Katakana Hiragana Kanji Romaji
End-user can express query entirely in hiragana!
Sec. 2.2.1
C. D Manning, P. Raghavan, H. Schutze, Introduction to Information Retrieval, Cambridge University Press, 2008
54. Solutions (to Arabic), Technical
Preprocessing
Spelling checking (a whole new problem)
Tatweel
Repetitions
Normalization
To improve matching (i.e. increase recall)
Change ء ،آ ،إ ،أ ،ا to ا
Change ة to ه
Change ى to ي
Can use rule-based approaches to solve some common issues
Dialects and MSA might need different approaches
55. Question Answering**
Hammo et al. QARAB: A Question Answering System to Support the Arabic Language.
Workshop on Computational Approaches to Semitic Languages. ACL 2002
56. Available Tools
Mawdoo3’s AI tools (MALTK)
Arabic Treebank
Arabic WordNet
MySQL database
SUMO Ontology
Java
Dr. Mostafa Jarrar’s Arabic
Ontology
Quran Corpus
Dr. Nizar Habash’s MADAR
Al-Ma’any dictionary
Microsoft Arabic Toolkit (ATK)
BabelNet
Farasa
57. Solutions, Research
Governments and private organizations must
Support research
i.e. pour millions!
E.g. االردني العهد ولي من ضاد مبادرة
Find a way to standardize and centralize Arabic
Teach Arabic professionally
The Semantic web
Understanding the context
مباد ضمن ستندرج التي المبادرة وتهدفرات
للغة سفراء إعداد إلى ،العهد ولي مؤسسة
ف استخدامها تعزيز على يعملون العربيةي
ع وتعزيز وإنشاء ،المعرفة ميادين مختلفدد
،العربية باللغة للتواصل اتّصالمن من
الحياة مجاالت جميع في واستخدامها
والتكنولوجية والعلمية العملية.
58. Summary
NLU has many challenges and opportunities
Arabic NLU has even more challenges and opportunities
59. Acknowledgment
Dr. Abdallah Abo Shmais
Linguistic Research Manager at Mawdoo3
Dr. Abed Al-Hakim Freihat
Senior Computational Linguist at Mawdoo3