SlideShare ist ein Scribd-Unternehmen logo
1 von 33
Downloaden Sie, um offline zu lesen
The Floating Arabic Dictionary: An
Automatic Method for Updating a Lexical
  Database through the Detection and
   Lemmatization of Unknown Words
      Mohammed Attia, Younes Samih, Khaled Shaalan
                      and Josef van Genabith
       Faculty of Engineering and IT, The British University in Dubai
                   Heinrich-Heine-Universität, Germany
           School of Computing, Dublin City University, Ireland
Outline
•   Introduction
•   Morphological Guesser
•   Methodology
•   Testing and Evaluation
•   Conclusion
Introduction
• Why deal with unknown words?

• Complexity of lemmatization in Arabic

• Data used
Introduction
A living language is just
… living
… dynamic
… constantly changing
… new words appear
… old words die out
… some words are seasonal
… some are core
Introduction
Introduction
Why deal with unknown words?
• Language is always changing
     • New words appear
     • Old words disappear
     • Unknown words make up 29% of the Gigaword
       corpus
• Unknown words (OOV) always cause a problem to:
     • Morphological analysers
     • Parsers
     • Machine Translation & other applications
Review of Arabic lexicographic work
Kitab al-'Ain by al-Khalil bin Ahmed al-Farahidi (died 789)
            (refinement/expansion/organizational Improvement)
                                    ▼

•   Tahzib al-Lughah by Abu Mansour al-Azhari (died 980)
•   al-Muheet by al-Sahib bin 'Abbad (died 995)
•   Lisan al-'Arab by ibn Manzour (died 1311)
•   al-Qamous al-Muheet by al-Fairouzabadi (died 1414)
•   Taj al-Arous by Muhammad Murtada al-Zabidi (died
    1791)
•   Muheet al-Muheet (1869) by Butrus al-Bustani
•   al-Mu'jam al-Waseet (1960)
•   Buckwalter Arabic Morphological Analyzer (2002)
     Size: 40,222 lemmas (including 2,034 proper nouns)
     Includes many obsolete lexical items
     Many modern words are missed out
‫‪Review of Arabic lexicographic work‬‬

      ‫‪Buckwalter obsolete words: 8,400 obsolete words‬‬
       ‫رمل )‪ :(sand‬ه َيالن وعْ س مِيعاس عِ ْث َير‬
                            ‫َ‬       ‫َ‬               ‫صحراء )‪ :(desert‬فيْفاء فدفد قواء م ْوماة م ْتلف سبْسب‬
                                                     ‫َ َ‬   ‫َ َ‬         ‫َ‬    ‫َْ َ َ‬      ‫َ‬

 ‫سرج )‪ :(saddle‬حِداجة مخلُوفة‬
  ‫َ َ ْ َ‬

             ‫ْ‬          ‫َ‬
         ‫حِمل )‪ :(load‬ظعِي َنة حِدج‬
                                 ‫ْ‬
               ‫ِْ‬
          ‫ظعُون وقر‬       ‫َ‬
           ‫ََ‬
   ‫لجام )‪ :(bridle‬فِدام كعم كِعام‬
‫أَرْ َن َبة شكم غِ مامة‬
 ‫َ‬        ‫ََ‬
                    ‫َّ‬
                  ‫راكب )‪ :(rider‬حداء‬

            ‫جمل )‪ :(camel‬هجي َنة‬
                 ‫َ ِ‬

            ‫ِ ِّ ِ ْ‬
        ‫رداء )‪ :(gown‬دفيَّة بش َتة‬

       ‫َْ‬    ‫ِ َ‬
     ‫حذاء )‪ :(shoes‬مبْذل َبشمق‬
     ‫َ َ َ‬           ‫َ‬        ‫َ‬
‫زرْ بُول زرْ بُون صرْ مة قبْقاب‬
‫‪Review of Arabic lexicographic work‬‬

    ‫‪Not in Dictionaries: about 10,000 need to be added‬‬
‫سياسة: أمننة شرعنة أفروعربية إثني إقصائي تسييس محاصصة جبهوي جمهوعسكرية العصبوية شخصنة أمركة‬
                                                                                     ‫عصرنة‬
                            ‫تكنولوجيا:‬
                      ‫َْ‬
                 ‫رقمنة، أتمتة، مك َننة‬
            ‫فيس بوك، تويتر، تغريدة‬
       ‫هاتف  جوال  تليفون محمول‬
                              ‫الب توب‬
                         ‫الهواتف الذكية‬
                                 ‫حوسبة‬
                         ‫بريد إلكتروني‬
                ‫دي في دي، سي دي‬
                          ‫سبام، فيروس‬
                             ‫ملتي ميديا‬
         ‫كمبيوتر لوحي، شاشة لمسية‬
                                  ‫شيفرة‬

‫اقتصاد: خصخصة ريعي يورو بورصة تعويم داو_جونز تضخم أسهم قيمة_دفترية مليار ترليون تجارة_إلكترونية‬
Review of Arabic lexicographic work

    Not in Dictionaries: about 10,000 need to be added
Politics: legalizing, Afro-Arab, ethnic, ostracizing, Americanize, modernize

                                                                  Technology:
                                                                  Digitalizing, automating,
                                                                  Mechanizing
                                                                  Facebook, twitter, tweet
                                                                  Mobile phone
                                                                  Laptop
                                                                  Smartphone
                                                                  Computerizing
                                                                  Email
                                                                  DVD, CD
                                                                  Span, virus
                                                                  Multimedia
                                                                  Tablet PC, touch screen

Economy: privatization, Euro, inflation, Billion, Trillion, e-commerce
Floating
Dictionary   Introduction
Introduction
Complexity of lemmatization in Arabic
• Lemmatization means reducing words to their base
  (canonical) forms
      • played -> play     studies - study
      • went -> go         wives -> wife
• New words in English appear in their base form 86% of
  the time (Lindén, 2008)
• New words in Arabic appear in their base form 45% of
  the time
• Arabic morphology is complex and semi-algorithmic:
  root, patterns, inflections, clitics, etc.
Introduction
Introduction                             ‫وسيشكرونه‬
                                                                  wasayashkurunahu
                                                                wa@sa@yashkuruna@hu
Complexity of lemmatization in Arabic                          and@will@thank[they]@him
          Proclitics              Prefix          Lemma Suffix            Enclitic

Conjunction/       Comp           Tense/mood –    Verb     Tense/mood – Object
question article                  number/gend              number/gend pronoun
Conjunctions ‫ل و‬li ‘to’           Imperfective             Imperfective   First person
wa ‘and’ or ‫ف‬fa                   tense (5)                tense (10)     (2)
‘then’
Question word ‫س أ‬sa ‘will’        Perfective tense lemma Perfective
                                                    lemma                Second
᾽a ‘is it true that’              (1)                     tense (12)     person (5)
                     ‫ل‬la ‘then’   Imperative (2)          Imperative (5) Third person
                                                                         (5)

Possible Concatenations in Arabic Verbs
                                                     ‫ شكر‬šakara ‘to thank’, generate
                                                     2,552 valid forms
Introduction                               ‫وللمدرسين‬
                                                                   walilmudarrisiyna
                                                                wa@li@al@mudarrisiyna
Complexity of lemmatization in Arabic                           and@to@the@teachers
                   Proclitics                   lemma  Suffix         Enclitic
 Conjunction/       Preposition   Definite      Noun   Gender/Number Genitive
 question article                 article                             pronoun
 Conjunctions ‫ب و‬bi ‘with’,        ‫ال‬al ‘the’          Masculine Dual First person
 wa ‘and’ or ‫ف‬     ‫ك‬ka ‘as’                            (4)            (2)
 fa ‘then’        or ‫ل‬li ‘to’                          Feminine Dual
                                                       (4)
 Question word ‫أ‬                                Stem
                                                 lemma Masculine      Second person
 ᾽a ‘is it true                                        regular plural (5)
 that’                                                 (4)
                                                       Feminine       Third person
                                                       regular plural (5)
                                                       (1)
                                                       Feminine Mark
                                                       (1)
                                                         ‫ مدرس‬mudarris ‘teacher’, generate 519
Possible Concatenations in Arabic Nouns                  valid forms
Introduction
Difference between stemming and lemmatizing

                  ‫وسيقولونها‬
          wa-sa-ya+quwl+uwna-ha
             and they will say it



       Stemming                Lemmatizing
          quwl                    qAla
           ‫قول‬                     ‫قال‬
                                             Alteration
                                               rules
Introduction
Data used
• A large-scale corpus of 1,089,111,204
  words
      •   85% from the Arabic Gigaword Fourth Edition
      •   15% from news articles crawled from the Al-
          Jazeera web site

If printed on paper, it will be more than 2 times the height of Eiffel
Tower

= 16,000 large books
= 640 meters of bookshelves

Avr reader reads 200 wpm with 60% comprehension.

You will need 11 years 24/7 to read the Gigaword corpus

Technical issues:
20-30 days to analyze with MADA using 10 parrallel sessions.
You will need a machine with 256GB RAM to read 3-,4-. Or 5-
gram language model of the Arabic Gigaword
Morphological Guesser
We develop a morphological guesser for
Arabic unknown words that handles all
possible
  • Clitics
  • Prefixes
  • Suffixes
  • And all relevant alteration operations that include
    insertion, assimilation, and deletion
Guesser
LEXC       1                            LEXICON Adjectives
======                                  +adj+fem                      GuessWords;
                                        +adj+masc                     GuessWords;
LEXICON Conjunctions                    ^ss^^‫سعيد‬se^+adj+masc
+‫وـ‬conj:‫وـ‬           Prepositions;                              FemMascduFemduMascplFempl;
+‫فـ‬conj:‫فـ‬           Prepositions;      ....
                     Prepositions;
                                        LEXICON GuessWords
LEXICON Prepositions                    ^ss^^GUESSNOUNSTEM^^se^
+‫لـ‬prep:‫لـ‬           Article;                            FemMascduFemduMascplFempl;
+‫كـ‬prep:‫كـ‬           Article;           ^ss^^GUESSNOUNSTEM^^se^
                                                         FemMascduFemduFempl;
+‫بـ‬prep:‫بـ‬           Article;           ^ss^^GUESSNOUNSTEM^^se^
                     Article;                            FemMascduFemdu;
LEXICON Article                         ….
+‫الـ‬defArt           Nouns;             ALTERATION RULES          2
+‫الـ‬defArt           Adjectives;        =================
                     Nouns;              a -> b || L _ R
                     Adjectives;        XFST                       3
LEXICON Nouns                           =====
+noun                GuessWords;        read regex < arb-Alphabet.txt
                                        define Alphabet
^ss^^‫خادم‬se^         FemMascduMascpl;   define PossNounStem [[Alphabet]^{2,24}] "+Guess":0;
....                                    substitute defined PossNounStem for
                                        "^GUESSNOUNSTEM^“
Methodology
We use a pipeline-based approach
• First: a machine learning (SVM), context-sensitive tool
  (MADA) is used to predict:
   • POS
   • Morpho-syntactic features of number, gender, person, tense, etc.
• Second: The finite-state morphological guesser is used
  to produce all the possible interpretations of words and
  suggested lemmas.
• Third: The two output are matched together and the
  agreed analysis is selected.
Methodology
Methodology
Example
‫والمسوِّ قون‬
 َ     َ ُ
wa-Al-musaw~iquwna “and-the-marketers”

MADA output:
form:wAlmswqwn    num:p      gen:m    per:na    case:n     asp:na    mod:na     vox:na
         pos:noun prc0:Al_detprc1:0   prc2:wa_conj   prc3:0     enc0:0    stt:d

Finite-state guesser output:
‫والمسوقون‬   +adj‫+والمسوق‬Guess+masc+pl+nom@
‫والمسوقون‬   +adj‫+والمسوقون‬Guess+sg@
‫والمسوقون‬   +noun‫+والمسوق‬Guess+masc+pl+nom@
‫والمسوقون‬   +noun‫+والمسوقون‬Guess+sg@
‫والمسوقون‬   ‫+و‬conj@‫+ال‬defArt@+adj‫+مسوق‬Guess+masc+pl+nom@
‫والمسوقون‬   ‫+و‬conj@‫+ال‬defArt@+adj‫+مسوقون‬Guess+sg@
‫والمسوقون‬   ‫+و‬conj@‫+ال‬defArt@+noun‫+مسوق‬Guess+masc+pl+nom@ Correct Analysis
‫والمسوقون‬   ‫+و‬conj@‫+ال‬defArt@+noun‫+مسوقون‬Guess+sg@
‫والمسوقون‬   ‫+و‬conj@+adj‫+المسوق‬Guess+masc+pl+nom@
‫والمسوقون‬   ‫+و‬conj@+adj‫+المسوقون‬Guess+sg@
‫والمسوقون‬   ‫+و‬conj@+noun‫+المسوق‬Guess+masc+pl+nom@
‫والمسوقون‬   ‫+و‬conj@+noun‫+المسوقون‬Guess+sg@
Methodology
Results
• Corpus size is 1,089,111,204 tokens, 7,348,173
  types
• Unknown Types in the corpus: 2,116,180 (29%)
• After spell checking, correctly spelt types are
  208,188
• Types with frequency of 10 or more: 40,277
• After lemmatization:18,399 types
Testing and Evaluation
                                     Gold POS                  Type Count   Ratio

We create a gold standard            noun_prop                 584          45 %
                                     noun                      264          20 %
of 1,310 words manually-             adj                       255          19 %
annotated for:                       verb                      52           4%
                                     noun_fem_plural           28           2%
• Gold lemma                         (pluralia tantum)

• Gold POS                           noun_broken_plural        28           2%
                                     others:                    8           0.6 %
• Lexical relevance (include in a    noun_masc_plural
                                     (pluralia tantum) (4) part
  dictionary): yes or no             (3) pron_dem (1)


Among unknown words,
                                     Excluded
                                     misspelling               55           4%
- Proper nouns are the most common
                                     not_known                 15           1%
- Verbs are the least common
                                     colloquial                19           1.5 %
                                     Lexicographic relevance
                                     Include in a dictionary 671            51 %
                                     Don’t include in a        639          49 %
                                     dictionary
Testing and Evaluation
Evaluating POS (accuracy)
• Baseline: The most frequent tag (proper name)
  for all unknown words: 45%
• Mada: 60%
• Voted POS Tagging: 69%. When a lemma gets a
  different POS tag with a higher frequency we
  take the higher                           Accuracy
                              POS tagging
                          1   POS Tagging baseline   45%
                          2   MADA POS tagging       60%
                          3   Voted POS Tagging      69%
Testing and Evaluation
Evaluating Lemmatization (accuracy)
• Baseline: new words appear in their base form:
  45%
• Pipelined strict definite article ‘al’: 54%
• Pipelined ignoring definite article ‘al’: 63%
                            Lemmatization
                          1 Lemma first-order baseline        45%
                          2 Pipelined lemmatization (first- 54%
                            order decision) with strict
                            definite article matching
                          3 Pipelined lemmatization (first- 63%
                            order decision) ignoring definite
                            article matching
Testing and Evaluation
Evaluating Lemma Weighting
•   The weighting criteria aims to push lexicographically
    relevant words up the list and less interesting words down.
• We aim to make the number of important words high in the
  top 100 and low in the bottom 100
Word Weight = ((number of
sister forms * 800) +              Good words               In top   In bottom
frequencies of sister forms) / 2 +                          100      100

POS factor                        relying on Frequency      63       50
                                  alone (baseline)
                                  relying on number of      87       28
                                  sister forms * 800
                                  relying on POS factor     58       30
                                  using combined criteria   78       15
Testing and Evaluation
Oxford new words list: June 2012
•   BitTorrent: a protocol that underpins the practice of peer-to-peer file
    sharing
•   command line: a user interface that is navigated by typing
    commands
•   cybercast: A news or entertainment program transmitted over the
    Internet.
•   subcommunity: a distinct grouping within a community
•   subjectivization: to make subjective
•   subpersonality: a personality mode that kicks in (appears on a
    temporary basis) to allow a person to cope with certain types of
    psychosocial situations.
•   superglue v: to stick with superglue
Testing and Evaluation
Words expected in the next Arabic dictionary/morphological analyser
Testing and Evaluation
Testing and Evaluation
Bird’s Eye view
Problem
  • Out of Vocabulary words (OOV) cause a problem to
    morphological analysers, parsers, MT, etc.
  • The manual extension of lexical databases is costly an time
    consuming.
  • With the large amount of data, manual extension of lexicons
    becomes practically impossible.
Solution
  • Creating an automatic method for updating a lexical database
  • Integrating a Machine Learning method with a finite state
    guesser to lemmatize unknown words
  • Weighting new words by relevance and importance
Conclusion
• We develop a methodology for automatically extracting
  and lemmatizing unknown words in Arabic
• We pipeline a finite-state guesser with a machine
  learning tool for lemmatization
• We develop a weighting mechanism for predicting the
  relevance and importance of lemmas
• Out of 2,116,180 unknown words, we create a lexicon of
  18,399 lemmatized, POS-tagged and weighted entries.

Weitere ähnliche Inhalte

Andere mochten auch

Bank reconciliation
Bank reconciliationBank reconciliation
Bank reconciliation
Khalid Aziz
 
Stock exchange simple ppt
Stock exchange simple pptStock exchange simple ppt
Stock exchange simple ppt
Avinash Varun
 

Andere mochten auch (16)

RETAIL BANKING
RETAIL BANKING RETAIL BANKING
RETAIL BANKING
 
The Future or Everyday Banking
The Future or Everyday BankingThe Future or Everyday Banking
The Future or Everyday Banking
 
Knowledge Management: Putting Information to Good Use
Knowledge Management: Putting Information to Good UseKnowledge Management: Putting Information to Good Use
Knowledge Management: Putting Information to Good Use
 
Stock market
Stock marketStock market
Stock market
 
Banking Services (7 P's Included)
Banking Services (7 P's Included)Banking Services (7 P's Included)
Banking Services (7 P's Included)
 
The Rise Of China
The Rise Of ChinaThe Rise Of China
The Rise Of China
 
Bank reconciliation
Bank reconciliationBank reconciliation
Bank reconciliation
 
Financial risk management ppt @ mba finance
Financial risk management  ppt @ mba financeFinancial risk management  ppt @ mba finance
Financial risk management ppt @ mba finance
 
10 Ice Breaker Games - How to get to know your office
10 Ice Breaker Games - How to get to know your office10 Ice Breaker Games - How to get to know your office
10 Ice Breaker Games - How to get to know your office
 
Commercial Banking System
Commercial Banking SystemCommercial Banking System
Commercial Banking System
 
10 Best Productivity Hacks for Customer Service
10 Best Productivity Hacks for Customer Service10 Best Productivity Hacks for Customer Service
10 Best Productivity Hacks for Customer Service
 
Case Study: Mastering digital disruption in retail
Case Study: Mastering digital disruption in retailCase Study: Mastering digital disruption in retail
Case Study: Mastering digital disruption in retail
 
Stock exchange simple ppt
Stock exchange simple pptStock exchange simple ppt
Stock exchange simple ppt
 
Banking System Presentation
Banking  System  PresentationBanking  System  Presentation
Banking System Presentation
 
Loans and advances
Loans and advancesLoans and advances
Loans and advances
 
Customer Relationship Management (CRM)
Customer Relationship Management (CRM)Customer Relationship Management (CRM)
Customer Relationship Management (CRM)
 

Ähnlich wie Floating dict presentation_04

Fsmnlp presentation 02
Fsmnlp presentation 02Fsmnlp presentation 02
Fsmnlp presentation 02
Mohammed Attia
 

Ähnlich wie Floating dict presentation_04 (6)

Fsmnlp presentation 02
Fsmnlp presentation 02Fsmnlp presentation 02
Fsmnlp presentation 02
 
Microblog-genre noise and its impact on semantic annotation accuracy
Microblog-genre noise and its impact on semantic annotation accuracyMicroblog-genre noise and its impact on semantic annotation accuracy
Microblog-genre noise and its impact on semantic annotation accuracy
 
Multilingual Search and Text Analytics with Solr - Open Source Search Conference
Multilingual Search and Text Analytics with Solr - Open Source Search ConferenceMultilingual Search and Text Analytics with Solr - Open Source Search Conference
Multilingual Search and Text Analytics with Solr - Open Source Search Conference
 
#5 Predicting Machine Translation Quality
#5 Predicting Machine Translation Quality#5 Predicting Machine Translation Quality
#5 Predicting Machine Translation Quality
 
Processing Arabic Text
Processing Arabic TextProcessing Arabic Text
Processing Arabic Text
 
Teaching Vocabulary Workshop
Teaching Vocabulary WorkshopTeaching Vocabulary Workshop
Teaching Vocabulary Workshop
 

Mehr von Mohammed Attia

Teacher training course
Teacher training courseTeacher training course
Teacher training course
Mohammed Attia
 
Arabic mwe presentation 07
Arabic mwe presentation 07Arabic mwe presentation 07
Arabic mwe presentation 07
Mohammed Attia
 
CELTA Course activities
CELTA Course activitiesCELTA Course activities
CELTA Course activities
Mohammed Attia
 
Arabic language presentation 01
Arabic language presentation 01Arabic language presentation 01
Arabic language presentation 01
Mohammed Attia
 
Fsmnlp presentation mohammed_attia
Fsmnlp presentation mohammed_attiaFsmnlp presentation mohammed_attia
Fsmnlp presentation mohammed_attia
Mohammed Attia
 
Attia sfcm presentation
Attia sfcm presentationAttia sfcm presentation
Attia sfcm presentation
Mohammed Attia
 

Mehr von Mohammed Attia (8)

Assertiveness skills
Assertiveness skillsAssertiveness skills
Assertiveness skills
 
Teacher training course
Teacher training courseTeacher training course
Teacher training course
 
Arabic mwe presentation 07
Arabic mwe presentation 07Arabic mwe presentation 07
Arabic mwe presentation 07
 
CELTA Course activities
CELTA Course activitiesCELTA Course activities
CELTA Course activities
 
Arabic language presentation 01
Arabic language presentation 01Arabic language presentation 01
Arabic language presentation 01
 
Fsmnlp presentation mohammed_attia
Fsmnlp presentation mohammed_attiaFsmnlp presentation mohammed_attia
Fsmnlp presentation mohammed_attia
 
Attia sfcm presentation
Attia sfcm presentationAttia sfcm presentation
Attia sfcm presentation
 
E lex presentation_03
E lex presentation_03E lex presentation_03
E lex presentation_03
 

Floating dict presentation_04

  • 1. The Floating Arabic Dictionary: An Automatic Method for Updating a Lexical Database through the Detection and Lemmatization of Unknown Words Mohammed Attia, Younes Samih, Khaled Shaalan and Josef van Genabith Faculty of Engineering and IT, The British University in Dubai Heinrich-Heine-Universität, Germany School of Computing, Dublin City University, Ireland
  • 2. Outline • Introduction • Morphological Guesser • Methodology • Testing and Evaluation • Conclusion
  • 3. Introduction • Why deal with unknown words? • Complexity of lemmatization in Arabic • Data used
  • 4. Introduction A living language is just … living … dynamic … constantly changing … new words appear … old words die out … some words are seasonal … some are core
  • 6. Introduction Why deal with unknown words? • Language is always changing • New words appear • Old words disappear • Unknown words make up 29% of the Gigaword corpus • Unknown words (OOV) always cause a problem to: • Morphological analysers • Parsers • Machine Translation & other applications
  • 7. Review of Arabic lexicographic work Kitab al-'Ain by al-Khalil bin Ahmed al-Farahidi (died 789) (refinement/expansion/organizational Improvement) ▼ • Tahzib al-Lughah by Abu Mansour al-Azhari (died 980) • al-Muheet by al-Sahib bin 'Abbad (died 995) • Lisan al-'Arab by ibn Manzour (died 1311) • al-Qamous al-Muheet by al-Fairouzabadi (died 1414) • Taj al-Arous by Muhammad Murtada al-Zabidi (died 1791) • Muheet al-Muheet (1869) by Butrus al-Bustani • al-Mu'jam al-Waseet (1960) • Buckwalter Arabic Morphological Analyzer (2002) Size: 40,222 lemmas (including 2,034 proper nouns) Includes many obsolete lexical items Many modern words are missed out
  • 8. ‫‪Review of Arabic lexicographic work‬‬ ‫‪Buckwalter obsolete words: 8,400 obsolete words‬‬ ‫رمل )‪ :(sand‬ه َيالن وعْ س مِيعاس عِ ْث َير‬ ‫َ‬ ‫َ‬ ‫صحراء )‪ :(desert‬فيْفاء فدفد قواء م ْوماة م ْتلف سبْسب‬ ‫َ َ‬ ‫َ َ‬ ‫َ‬ ‫َْ َ َ‬ ‫َ‬ ‫سرج )‪ :(saddle‬حِداجة مخلُوفة‬ ‫َ َ ْ َ‬ ‫ْ‬ ‫َ‬ ‫حِمل )‪ :(load‬ظعِي َنة حِدج‬ ‫ْ‬ ‫ِْ‬ ‫ظعُون وقر‬ ‫َ‬ ‫ََ‬ ‫لجام )‪ :(bridle‬فِدام كعم كِعام‬ ‫أَرْ َن َبة شكم غِ مامة‬ ‫َ‬ ‫ََ‬ ‫َّ‬ ‫راكب )‪ :(rider‬حداء‬ ‫جمل )‪ :(camel‬هجي َنة‬ ‫َ ِ‬ ‫ِ ِّ ِ ْ‬ ‫رداء )‪ :(gown‬دفيَّة بش َتة‬ ‫َْ‬ ‫ِ َ‬ ‫حذاء )‪ :(shoes‬مبْذل َبشمق‬ ‫َ َ َ‬ ‫َ‬ ‫َ‬ ‫زرْ بُول زرْ بُون صرْ مة قبْقاب‬
  • 9. ‫‪Review of Arabic lexicographic work‬‬ ‫‪Not in Dictionaries: about 10,000 need to be added‬‬ ‫سياسة: أمننة شرعنة أفروعربية إثني إقصائي تسييس محاصصة جبهوي جمهوعسكرية العصبوية شخصنة أمركة‬ ‫عصرنة‬ ‫تكنولوجيا:‬ ‫َْ‬ ‫رقمنة، أتمتة، مك َننة‬ ‫فيس بوك، تويتر، تغريدة‬ ‫هاتف جوال تليفون محمول‬ ‫الب توب‬ ‫الهواتف الذكية‬ ‫حوسبة‬ ‫بريد إلكتروني‬ ‫دي في دي، سي دي‬ ‫سبام، فيروس‬ ‫ملتي ميديا‬ ‫كمبيوتر لوحي، شاشة لمسية‬ ‫شيفرة‬ ‫اقتصاد: خصخصة ريعي يورو بورصة تعويم داو_جونز تضخم أسهم قيمة_دفترية مليار ترليون تجارة_إلكترونية‬
  • 10. Review of Arabic lexicographic work Not in Dictionaries: about 10,000 need to be added Politics: legalizing, Afro-Arab, ethnic, ostracizing, Americanize, modernize Technology: Digitalizing, automating, Mechanizing Facebook, twitter, tweet Mobile phone Laptop Smartphone Computerizing Email DVD, CD Span, virus Multimedia Tablet PC, touch screen Economy: privatization, Euro, inflation, Billion, Trillion, e-commerce
  • 11. Floating Dictionary Introduction
  • 12. Introduction Complexity of lemmatization in Arabic • Lemmatization means reducing words to their base (canonical) forms • played -> play studies - study • went -> go wives -> wife • New words in English appear in their base form 86% of the time (Lindén, 2008) • New words in Arabic appear in their base form 45% of the time • Arabic morphology is complex and semi-algorithmic: root, patterns, inflections, clitics, etc.
  • 14. Introduction ‫وسيشكرونه‬ wasayashkurunahu wa@sa@yashkuruna@hu Complexity of lemmatization in Arabic and@will@thank[they]@him Proclitics Prefix Lemma Suffix Enclitic Conjunction/ Comp Tense/mood – Verb Tense/mood – Object question article number/gend number/gend pronoun Conjunctions ‫ل و‬li ‘to’ Imperfective Imperfective First person wa ‘and’ or ‫ف‬fa tense (5) tense (10) (2) ‘then’ Question word ‫س أ‬sa ‘will’ Perfective tense lemma Perfective lemma Second ᾽a ‘is it true that’ (1) tense (12) person (5) ‫ل‬la ‘then’ Imperative (2) Imperative (5) Third person (5) Possible Concatenations in Arabic Verbs ‫ شكر‬šakara ‘to thank’, generate 2,552 valid forms
  • 15. Introduction ‫وللمدرسين‬ walilmudarrisiyna wa@li@al@mudarrisiyna Complexity of lemmatization in Arabic and@to@the@teachers Proclitics lemma Suffix Enclitic Conjunction/ Preposition Definite Noun Gender/Number Genitive question article article pronoun Conjunctions ‫ب و‬bi ‘with’, ‫ال‬al ‘the’ Masculine Dual First person wa ‘and’ or ‫ف‬ ‫ك‬ka ‘as’ (4) (2) fa ‘then’ or ‫ل‬li ‘to’ Feminine Dual (4) Question word ‫أ‬ Stem lemma Masculine Second person ᾽a ‘is it true regular plural (5) that’ (4) Feminine Third person regular plural (5) (1) Feminine Mark (1) ‫ مدرس‬mudarris ‘teacher’, generate 519 Possible Concatenations in Arabic Nouns valid forms
  • 16. Introduction Difference between stemming and lemmatizing ‫وسيقولونها‬ wa-sa-ya+quwl+uwna-ha and they will say it Stemming Lemmatizing quwl qAla ‫قول‬ ‫قال‬ Alteration rules
  • 17. Introduction Data used • A large-scale corpus of 1,089,111,204 words • 85% from the Arabic Gigaword Fourth Edition • 15% from news articles crawled from the Al- Jazeera web site If printed on paper, it will be more than 2 times the height of Eiffel Tower = 16,000 large books = 640 meters of bookshelves Avr reader reads 200 wpm with 60% comprehension. You will need 11 years 24/7 to read the Gigaword corpus Technical issues: 20-30 days to analyze with MADA using 10 parrallel sessions. You will need a machine with 256GB RAM to read 3-,4-. Or 5- gram language model of the Arabic Gigaword
  • 18. Morphological Guesser We develop a morphological guesser for Arabic unknown words that handles all possible • Clitics • Prefixes • Suffixes • And all relevant alteration operations that include insertion, assimilation, and deletion
  • 19. Guesser LEXC 1 LEXICON Adjectives ====== +adj+fem GuessWords; +adj+masc GuessWords; LEXICON Conjunctions ^ss^^‫سعيد‬se^+adj+masc +‫وـ‬conj:‫وـ‬ Prepositions; FemMascduFemduMascplFempl; +‫فـ‬conj:‫فـ‬ Prepositions; .... Prepositions; LEXICON GuessWords LEXICON Prepositions ^ss^^GUESSNOUNSTEM^^se^ +‫لـ‬prep:‫لـ‬ Article; FemMascduFemduMascplFempl; +‫كـ‬prep:‫كـ‬ Article; ^ss^^GUESSNOUNSTEM^^se^ FemMascduFemduFempl; +‫بـ‬prep:‫بـ‬ Article; ^ss^^GUESSNOUNSTEM^^se^ Article; FemMascduFemdu; LEXICON Article …. +‫الـ‬defArt Nouns; ALTERATION RULES 2 +‫الـ‬defArt Adjectives; ================= Nouns; a -> b || L _ R Adjectives; XFST 3 LEXICON Nouns ===== +noun GuessWords; read regex < arb-Alphabet.txt define Alphabet ^ss^^‫خادم‬se^ FemMascduMascpl; define PossNounStem [[Alphabet]^{2,24}] "+Guess":0; .... substitute defined PossNounStem for "^GUESSNOUNSTEM^“
  • 20. Methodology We use a pipeline-based approach • First: a machine learning (SVM), context-sensitive tool (MADA) is used to predict: • POS • Morpho-syntactic features of number, gender, person, tense, etc. • Second: The finite-state morphological guesser is used to produce all the possible interpretations of words and suggested lemmas. • Third: The two output are matched together and the agreed analysis is selected.
  • 22. Methodology Example ‫والمسوِّ قون‬ َ َ ُ wa-Al-musaw~iquwna “and-the-marketers” MADA output: form:wAlmswqwn num:p gen:m per:na case:n asp:na mod:na vox:na pos:noun prc0:Al_detprc1:0 prc2:wa_conj prc3:0 enc0:0 stt:d Finite-state guesser output: ‫والمسوقون‬ +adj‫+والمسوق‬Guess+masc+pl+nom@ ‫والمسوقون‬ +adj‫+والمسوقون‬Guess+sg@ ‫والمسوقون‬ +noun‫+والمسوق‬Guess+masc+pl+nom@ ‫والمسوقون‬ +noun‫+والمسوقون‬Guess+sg@ ‫والمسوقون‬ ‫+و‬conj@‫+ال‬defArt@+adj‫+مسوق‬Guess+masc+pl+nom@ ‫والمسوقون‬ ‫+و‬conj@‫+ال‬defArt@+adj‫+مسوقون‬Guess+sg@ ‫والمسوقون‬ ‫+و‬conj@‫+ال‬defArt@+noun‫+مسوق‬Guess+masc+pl+nom@ Correct Analysis ‫والمسوقون‬ ‫+و‬conj@‫+ال‬defArt@+noun‫+مسوقون‬Guess+sg@ ‫والمسوقون‬ ‫+و‬conj@+adj‫+المسوق‬Guess+masc+pl+nom@ ‫والمسوقون‬ ‫+و‬conj@+adj‫+المسوقون‬Guess+sg@ ‫والمسوقون‬ ‫+و‬conj@+noun‫+المسوق‬Guess+masc+pl+nom@ ‫والمسوقون‬ ‫+و‬conj@+noun‫+المسوقون‬Guess+sg@
  • 23. Methodology Results • Corpus size is 1,089,111,204 tokens, 7,348,173 types • Unknown Types in the corpus: 2,116,180 (29%) • After spell checking, correctly spelt types are 208,188 • Types with frequency of 10 or more: 40,277 • After lemmatization:18,399 types
  • 24. Testing and Evaluation Gold POS Type Count Ratio We create a gold standard noun_prop 584 45 % noun 264 20 % of 1,310 words manually- adj 255 19 % annotated for: verb 52 4% noun_fem_plural 28 2% • Gold lemma (pluralia tantum) • Gold POS noun_broken_plural 28 2% others: 8 0.6 % • Lexical relevance (include in a noun_masc_plural (pluralia tantum) (4) part dictionary): yes or no (3) pron_dem (1) Among unknown words, Excluded misspelling 55 4% - Proper nouns are the most common not_known 15 1% - Verbs are the least common colloquial 19 1.5 % Lexicographic relevance Include in a dictionary 671 51 % Don’t include in a 639 49 % dictionary
  • 25. Testing and Evaluation Evaluating POS (accuracy) • Baseline: The most frequent tag (proper name) for all unknown words: 45% • Mada: 60% • Voted POS Tagging: 69%. When a lemma gets a different POS tag with a higher frequency we take the higher Accuracy POS tagging 1 POS Tagging baseline 45% 2 MADA POS tagging 60% 3 Voted POS Tagging 69%
  • 26. Testing and Evaluation Evaluating Lemmatization (accuracy) • Baseline: new words appear in their base form: 45% • Pipelined strict definite article ‘al’: 54% • Pipelined ignoring definite article ‘al’: 63% Lemmatization 1 Lemma first-order baseline 45% 2 Pipelined lemmatization (first- 54% order decision) with strict definite article matching 3 Pipelined lemmatization (first- 63% order decision) ignoring definite article matching
  • 27. Testing and Evaluation Evaluating Lemma Weighting • The weighting criteria aims to push lexicographically relevant words up the list and less interesting words down. • We aim to make the number of important words high in the top 100 and low in the bottom 100 Word Weight = ((number of sister forms * 800) + Good words In top In bottom frequencies of sister forms) / 2 + 100 100 POS factor relying on Frequency 63 50 alone (baseline) relying on number of 87 28 sister forms * 800 relying on POS factor 58 30 using combined criteria 78 15
  • 28. Testing and Evaluation Oxford new words list: June 2012 • BitTorrent: a protocol that underpins the practice of peer-to-peer file sharing • command line: a user interface that is navigated by typing commands • cybercast: A news or entertainment program transmitted over the Internet. • subcommunity: a distinct grouping within a community • subjectivization: to make subjective • subpersonality: a personality mode that kicks in (appears on a temporary basis) to allow a person to cope with certain types of psychosocial situations. • superglue v: to stick with superglue
  • 29. Testing and Evaluation Words expected in the next Arabic dictionary/morphological analyser
  • 32. Bird’s Eye view Problem • Out of Vocabulary words (OOV) cause a problem to morphological analysers, parsers, MT, etc. • The manual extension of lexical databases is costly an time consuming. • With the large amount of data, manual extension of lexicons becomes practically impossible. Solution • Creating an automatic method for updating a lexical database • Integrating a Machine Learning method with a finite state guesser to lemmatize unknown words • Weighting new words by relevance and importance
  • 33. Conclusion • We develop a methodology for automatically extracting and lemmatizing unknown words in Arabic • We pipeline a finite-state guesser with a machine learning tool for lemmatization • We develop a weighting mechanism for predicting the relevance and importance of lemmas • Out of 2,116,180 unknown words, we create a lexicon of 18,399 lemmatized, POS-tagged and weighted entries.