SlideShare ist ein Scribd-Unternehmen logo
1 von 26
Downloaden Sie, um offline zu lesen
Natural Language Processing for
      Amazigh Language:
Challenges and Future Directions
  Fadoua Ataa Allah Siham Boulaknadel
               CEISIC, IRCAM
         {ataaallah, boulaknadel}@ircam.ma
Outline

    Amazigh Language
    Amazigh Complexity in NLP
    State of the Technology on Amazigh
    Future Directions




                LREC-2012: SALTMIL-AfLaT Workshop   2
Amazigh language
                             Sociolinguistic Context
  North African   autochthonous language
     Spoken by millions of people as dialects




                      LREC-2012: SALTMIL-AfLaT Workshop   3
Amazigh language
                              Sociolinguistic Context
    Languages of Morocco
        Classical Arabic as an official language.
        Amazigh, since 2011 it becomes an official
         language.
        Moroccan Arabic or Darija is the diglossia of
         Classical Arabic.
        French as the first foreign language.
        Spanish is used in the north of Morocco.
        English is becoming the second foreign language.

                                                           10/07/2012
                       LREC-2012: SALTMIL-AfLaT Workshop            4
Amazigh language
                                                          History

    Amazigh abjed
        Tifinagh is attested from
         25 centuries.
        Its writing form has
         continued to change
         from    the   traditional
         Tuareg writing to the
         Tifinaghe-IRCAM .
                                                 Tinzouline Inscriptions
                                                    (Zagora, Morocco)


                                                                     10/07/2012
                      LREC-2012: SALTMIL-AfLaT Workshop                       5
Amazigh language
                                                  History
 Direction



                                                         Plate 9

                                                  Anou Elias, Mammanet
                                                      Valley (Niger).
                                                    Henri Lhote, Oued
                                                   Mammanet gravures.
                                                  Les Nouvelles Editions
                                                     Africaines. 1979




                                                              10/07/2012
              LREC-2012: SALTMIL-AfLaT Workshop                        6
Moroccan Amazigh characteristics
   Amazigh writing system
       Direction: horizontal from left to right.
       Alphabet:
            27 consonants: ⴱ, ⴳ, ⴳⵯ, ⴷ, ⴹ, ⴼ, ⴽ, ⴽⵯ, ⵀ, ⵃ, ⵄ, ⵅ, ⵇ, ⵊ, ⵍ, ⵎ, ⵏ, ⵔ, ⵕ ,
             ⵖ, ⵙ, ⵚ, ⵛ, ⵜ, ⵟ, ⵣ, ⵥ;
             2 semi-consonants: ⵢ and ⵡ;
             4 vowels: ⴰ, ⵉ, ⵓ, ⴻ.
       Punctuation marks: conventional signs including: “ ”
        (space), “.”, “,”, “;”, “:”, “?”, “!”, “…” , etc.
       Numerals: Hindu-Arabic numerals [0-9].



                                                                               10/07/2012
                              LREC-2012: SALTMIL-AfLaT Workshop                           7
Amazigh Complexity in NLP

  Different writing forms
  Complex      phonology                    and   phonetic
   systems
  Rich morphology




               LREC-2012: SALTMIL-AfLaT Workshop              8
Amazigh Complexity in NLP
                  Amazigh script
     Writing prescriptions’ conversion into
      ‘Tifinaghe – Unicode’ is confronted with:
       Spelling variation related to regional
        varieties ([tfucht] [tafukt] (sun)),
       Spelling variation based on the use or the
        elimination of spaces within or between
        words ([tadartino] [tadart ino] (my house)).
       Arabic or Latin transcription systems.




                  LREC-2012: SALTMIL-AfLaT Workshop    9
Amazigh Complexity in NLP
          Phonology & phonetic
     The main problem of Amazigh phonology
      and phonetic consists on allophones:

         /ll/ that is realized as [dj] in the North.




                   LREC-2012: SALTMIL-AfLaT Workshop   10
Amazigh Complexity in NLP
                      Morphology
     High inflected language.
     Word structure:

              Prefix                 Stem                  Suffix


     Affixes set: Prefixes, Infixes, and Suffixes.
     Base form varies with paradigms:
                (qqim  svim (make sit)).



                       LREC-2012: SALTMIL-AfLaT Workshop            11
State of the Amazigh technology
    Tifinaghe Encoding
    Optical character recognition
    Fundamental processing tools
    Language resources




                 LREC-2012: SALTMIL-AfLaT Workshop   12
State of the Amazigh technology
               Tifinaghe Encoding
     ANSI            Unicode




                                    13
State of the Amazigh technology
                            OCR
    Amazigh OCR systems:
        System focused on isolated printed characters
         based on a syntactic approach using finite
         automata.
        Global approach based on Hidden Markov
         Models for recognizing handwritten characters.
        Method using invariant moments for recognizing
         printed script.
        System based on artificial neural network to
         recognize printed characters.

                      LREC-2012: SALTMIL-AfLaT Workshop   14
State of the Amazigh technology
           Fundamental processing
     Transliterator
     Tagging assistance tool
     Light stemmer
     Search engine
     Concordancer


                  LREC-2012: SALTMIL-AfLaT Workshop   15
State of the Amazigh technology
           Fundamental processing
     Transliterator


       Arabic script
                                                            Tifinaghe
       Latin script           Convertisor
                                                            Unicode
      Tifinaghe Latin        Transliterator




                        LREC-2012: SALTMIL-AfLaT Workshop               16
State of the Amazigh technology
           Fundamental processing
    Tagging assistance tool
                                Amazigh
                                  raw
                                corpora




                              Tokenization


                                               Manual POS       Tag
        Manual Stemming                                         set



            Stem
                                                  Tagged
             list
                                                  corpus
                            Validation


                          Standard output


                            LREC-2012: SALTMIL-AfLaT Workshop         17
State of the Amazigh technology
           Fundamental processing
     Light stemmer            Begin


                       Prefix + Stem + Suffix




                          Find the largest
                              prefix




                          Stem + Suffix              Find the largest
                                                          suffix




                                                          Stem



                                                          End


                 LREC-2012: SALTMIL-AfLaT Workshop                      18
State of the Amazigh technology
           Fundamental processing
     Search engine
                                 Query Engine
                                Natural Language                    Index
                                 Processing Tools


                         Data                               Data Indexing
                         Searching                                     Indexer
        User Interface
                                                                    Natural Language
                                                                     Processing Tools




                                                    Data Crawling
                                                                     Repository
             Web                             Crawler




                             LREC-2012: SALTMIL-AfLaT Workshop                          19
State of the Amazigh technology
           Fundamental processing
     Concordancer
                                    input field
                                     .txt,.doc
                                     .pdf, .zip


                                   Tokenization



        List of the text words                       Word / expression
         and their frequency                          Context display




                             LREC-2012: SALTMIL-AfLaT Workshop           20
State of the Amazigh technology
               Language resources
     Corpora
     Dictionary
     Terminology database




                   LREC-2012: SALTMIL-AfLaT Workshop   21
State of the Amazigh technology
               Language resources
     Corpora:
         General corpus,
         POS corpus.




                    LREC-2012: SALTMIL-AfLaT Workshop   22
State of the Amazigh technology
               Language resources
     Dictionary
         Definition,
         Arabic equivalent words,
         French equivalent words,
         English equivalent words,
         Synonyms,
         Classification by domains,
         Derivational families.

                        LREC-2012: SALTMIL-AfLaT Workshop   23
State of the Amazigh technology
               Language resources
     Terminology database
         Media vocabulary
         Grammatical vocabulary




                    LREC-2012: SALTMIL-AfLaT Workshop   24
Future Directions

  Building a large and representative
   Amazigh corpora.
  Developing   a machine translation
   system.
  Creating a pool of competent human
   resources.




            LREC-2012: SALTMIL-AfLaT Workshop   25
Thank you
     for
your attention

       ⵜⴰⵏⵎⵉⵔⵜ



LREC-2012: SALTMIL-AfLaT Workshop   26

Weitere ähnliche Inhalte

Mehr von Guy De Pauw

Technological Tools for Dictionary and Corpora Building for Minority Language...
Technological Tools for Dictionary and Corpora Building for Minority Language...Technological Tools for Dictionary and Corpora Building for Minority Language...
Technological Tools for Dictionary and Corpora Building for Minority Language...Guy De Pauw
 
Semi-automated extraction of morphological grammars for Nguni with special re...
Semi-automated extraction of morphological grammars for Nguni with special re...Semi-automated extraction of morphological grammars for Nguni with special re...
Semi-automated extraction of morphological grammars for Nguni with special re...Guy De Pauw
 
Resource-Light Bantu Part-of-Speech Tagging
Resource-Light Bantu Part-of-Speech TaggingResource-Light Bantu Part-of-Speech Tagging
Resource-Light Bantu Part-of-Speech TaggingGuy De Pauw
 
POS Annotated 50m Corpus of Tajik Language
POS Annotated 50m Corpus of Tajik LanguagePOS Annotated 50m Corpus of Tajik Language
POS Annotated 50m Corpus of Tajik LanguageGuy De Pauw
 
The Tagged Icelandic Corpus (MÍM)
The Tagged Icelandic Corpus (MÍM)The Tagged Icelandic Corpus (MÍM)
The Tagged Icelandic Corpus (MÍM)Guy De Pauw
 
Describing Morphologically Rich Languages Using Metagrammars a Look at Verbs ...
Describing Morphologically Rich Languages Using Metagrammars a Look at Verbs ...Describing Morphologically Rich Languages Using Metagrammars a Look at Verbs ...
Describing Morphologically Rich Languages Using Metagrammars a Look at Verbs ...Guy De Pauw
 
Tagging and Verifying an Amharic News Corpus
Tagging and Verifying an Amharic News CorpusTagging and Verifying an Amharic News Corpus
Tagging and Verifying an Amharic News CorpusGuy De Pauw
 
A Corpus of Santome
A Corpus of SantomeA Corpus of Santome
A Corpus of SantomeGuy De Pauw
 
Automatic Structuring and Correction Suggestion System for Hungarian Clinical...
Automatic Structuring and Correction Suggestion System for Hungarian Clinical...Automatic Structuring and Correction Suggestion System for Hungarian Clinical...
Automatic Structuring and Correction Suggestion System for Hungarian Clinical...Guy De Pauw
 
Compiling Apertium Dictionaries with HFST
Compiling Apertium Dictionaries with HFSTCompiling Apertium Dictionaries with HFST
Compiling Apertium Dictionaries with HFSTGuy De Pauw
 
The Database of Modern Icelandic Inflection
The Database of Modern Icelandic InflectionThe Database of Modern Icelandic Inflection
The Database of Modern Icelandic InflectionGuy De Pauw
 
Issues in Designing a Corpus of Spoken Irish
Issues in Designing a Corpus of Spoken IrishIssues in Designing a Corpus of Spoken Irish
Issues in Designing a Corpus of Spoken IrishGuy De Pauw
 
Towards Standardizing Evaluation Test Sets for Compound Analysers
Towards Standardizing Evaluation Test Sets for Compound AnalysersTowards Standardizing Evaluation Test Sets for Compound Analysers
Towards Standardizing Evaluation Test Sets for Compound AnalysersGuy De Pauw
 
The PALDO Concept - New Paradigms for African Language Resource Development
The PALDO Concept - New Paradigms for African Language Resource DevelopmentThe PALDO Concept - New Paradigms for African Language Resource Development
The PALDO Concept - New Paradigms for African Language Resource DevelopmentGuy De Pauw
 
A System for the Recognition of Handwritten Yorùbá Characters
A System for the Recognition of Handwritten Yorùbá CharactersA System for the Recognition of Handwritten Yorùbá Characters
A System for the Recognition of Handwritten Yorùbá CharactersGuy De Pauw
 
IFE-MT: An English-to-Yorùbá Machine Translation System
IFE-MT: An English-to-Yorùbá Machine Translation SystemIFE-MT: An English-to-Yorùbá Machine Translation System
IFE-MT: An English-to-Yorùbá Machine Translation SystemGuy De Pauw
 
A Number to Yorùbá Text Transcription System
A Number to Yorùbá Text Transcription SystemA Number to Yorùbá Text Transcription System
A Number to Yorùbá Text Transcription SystemGuy De Pauw
 
Bilingual Data Mining for the English-Amharic Statistical Machine Translation...
Bilingual Data Mining for the English-Amharic Statistical Machine Translation...Bilingual Data Mining for the English-Amharic Statistical Machine Translation...
Bilingual Data Mining for the English-Amharic Statistical Machine Translation...Guy De Pauw
 
Towards Standardizing Evaluation Test Sets for Compound Analysers
Towards Standardizing Evaluation Test Sets for Compound AnalysersTowards Standardizing Evaluation Test Sets for Compound Analysers
Towards Standardizing Evaluation Test Sets for Compound AnalysersGuy De Pauw
 
Human Language Technologies for Ethiopian Languages: Challenges and Future Di...
Human Language Technologies for Ethiopian Languages: Challenges and Future Di...Human Language Technologies for Ethiopian Languages: Challenges and Future Di...
Human Language Technologies for Ethiopian Languages: Challenges and Future Di...Guy De Pauw
 

Mehr von Guy De Pauw (20)

Technological Tools for Dictionary and Corpora Building for Minority Language...
Technological Tools for Dictionary and Corpora Building for Minority Language...Technological Tools for Dictionary and Corpora Building for Minority Language...
Technological Tools for Dictionary and Corpora Building for Minority Language...
 
Semi-automated extraction of morphological grammars for Nguni with special re...
Semi-automated extraction of morphological grammars for Nguni with special re...Semi-automated extraction of morphological grammars for Nguni with special re...
Semi-automated extraction of morphological grammars for Nguni with special re...
 
Resource-Light Bantu Part-of-Speech Tagging
Resource-Light Bantu Part-of-Speech TaggingResource-Light Bantu Part-of-Speech Tagging
Resource-Light Bantu Part-of-Speech Tagging
 
POS Annotated 50m Corpus of Tajik Language
POS Annotated 50m Corpus of Tajik LanguagePOS Annotated 50m Corpus of Tajik Language
POS Annotated 50m Corpus of Tajik Language
 
The Tagged Icelandic Corpus (MÍM)
The Tagged Icelandic Corpus (MÍM)The Tagged Icelandic Corpus (MÍM)
The Tagged Icelandic Corpus (MÍM)
 
Describing Morphologically Rich Languages Using Metagrammars a Look at Verbs ...
Describing Morphologically Rich Languages Using Metagrammars a Look at Verbs ...Describing Morphologically Rich Languages Using Metagrammars a Look at Verbs ...
Describing Morphologically Rich Languages Using Metagrammars a Look at Verbs ...
 
Tagging and Verifying an Amharic News Corpus
Tagging and Verifying an Amharic News CorpusTagging and Verifying an Amharic News Corpus
Tagging and Verifying an Amharic News Corpus
 
A Corpus of Santome
A Corpus of SantomeA Corpus of Santome
A Corpus of Santome
 
Automatic Structuring and Correction Suggestion System for Hungarian Clinical...
Automatic Structuring and Correction Suggestion System for Hungarian Clinical...Automatic Structuring and Correction Suggestion System for Hungarian Clinical...
Automatic Structuring and Correction Suggestion System for Hungarian Clinical...
 
Compiling Apertium Dictionaries with HFST
Compiling Apertium Dictionaries with HFSTCompiling Apertium Dictionaries with HFST
Compiling Apertium Dictionaries with HFST
 
The Database of Modern Icelandic Inflection
The Database of Modern Icelandic InflectionThe Database of Modern Icelandic Inflection
The Database of Modern Icelandic Inflection
 
Issues in Designing a Corpus of Spoken Irish
Issues in Designing a Corpus of Spoken IrishIssues in Designing a Corpus of Spoken Irish
Issues in Designing a Corpus of Spoken Irish
 
Towards Standardizing Evaluation Test Sets for Compound Analysers
Towards Standardizing Evaluation Test Sets for Compound AnalysersTowards Standardizing Evaluation Test Sets for Compound Analysers
Towards Standardizing Evaluation Test Sets for Compound Analysers
 
The PALDO Concept - New Paradigms for African Language Resource Development
The PALDO Concept - New Paradigms for African Language Resource DevelopmentThe PALDO Concept - New Paradigms for African Language Resource Development
The PALDO Concept - New Paradigms for African Language Resource Development
 
A System for the Recognition of Handwritten Yorùbá Characters
A System for the Recognition of Handwritten Yorùbá CharactersA System for the Recognition of Handwritten Yorùbá Characters
A System for the Recognition of Handwritten Yorùbá Characters
 
IFE-MT: An English-to-Yorùbá Machine Translation System
IFE-MT: An English-to-Yorùbá Machine Translation SystemIFE-MT: An English-to-Yorùbá Machine Translation System
IFE-MT: An English-to-Yorùbá Machine Translation System
 
A Number to Yorùbá Text Transcription System
A Number to Yorùbá Text Transcription SystemA Number to Yorùbá Text Transcription System
A Number to Yorùbá Text Transcription System
 
Bilingual Data Mining for the English-Amharic Statistical Machine Translation...
Bilingual Data Mining for the English-Amharic Statistical Machine Translation...Bilingual Data Mining for the English-Amharic Statistical Machine Translation...
Bilingual Data Mining for the English-Amharic Statistical Machine Translation...
 
Towards Standardizing Evaluation Test Sets for Compound Analysers
Towards Standardizing Evaluation Test Sets for Compound AnalysersTowards Standardizing Evaluation Test Sets for Compound Analysers
Towards Standardizing Evaluation Test Sets for Compound Analysers
 
Human Language Technologies for Ethiopian Languages: Challenges and Future Di...
Human Language Technologies for Ethiopian Languages: Challenges and Future Di...Human Language Technologies for Ethiopian Languages: Challenges and Future Di...
Human Language Technologies for Ethiopian Languages: Challenges and Future Di...
 

Kürzlich hochgeladen

2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024The Digital Insurer
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdflior mazor
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 

Kürzlich hochgeladen (20)

2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 

Natural Language Processing for Amazigh Language

  • 1. Natural Language Processing for Amazigh Language: Challenges and Future Directions Fadoua Ataa Allah Siham Boulaknadel CEISIC, IRCAM {ataaallah, boulaknadel}@ircam.ma
  • 2. Outline  Amazigh Language  Amazigh Complexity in NLP  State of the Technology on Amazigh  Future Directions LREC-2012: SALTMIL-AfLaT Workshop 2
  • 3. Amazigh language Sociolinguistic Context North African autochthonous language  Spoken by millions of people as dialects LREC-2012: SALTMIL-AfLaT Workshop 3
  • 4. Amazigh language Sociolinguistic Context  Languages of Morocco  Classical Arabic as an official language.  Amazigh, since 2011 it becomes an official language.  Moroccan Arabic or Darija is the diglossia of Classical Arabic.  French as the first foreign language.  Spanish is used in the north of Morocco.  English is becoming the second foreign language. 10/07/2012 LREC-2012: SALTMIL-AfLaT Workshop 4
  • 5. Amazigh language History  Amazigh abjed  Tifinagh is attested from 25 centuries.  Its writing form has continued to change from the traditional Tuareg writing to the Tifinaghe-IRCAM . Tinzouline Inscriptions (Zagora, Morocco) 10/07/2012 LREC-2012: SALTMIL-AfLaT Workshop 5
  • 6. Amazigh language History Direction Plate 9 Anou Elias, Mammanet Valley (Niger). Henri Lhote, Oued Mammanet gravures. Les Nouvelles Editions Africaines. 1979 10/07/2012 LREC-2012: SALTMIL-AfLaT Workshop 6
  • 7. Moroccan Amazigh characteristics  Amazigh writing system  Direction: horizontal from left to right.  Alphabet:  27 consonants: ⴱ, ⴳ, ⴳⵯ, ⴷ, ⴹ, ⴼ, ⴽ, ⴽⵯ, ⵀ, ⵃ, ⵄ, ⵅ, ⵇ, ⵊ, ⵍ, ⵎ, ⵏ, ⵔ, ⵕ , ⵖ, ⵙ, ⵚ, ⵛ, ⵜ, ⵟ, ⵣ, ⵥ;  2 semi-consonants: ⵢ and ⵡ;  4 vowels: ⴰ, ⵉ, ⵓ, ⴻ.  Punctuation marks: conventional signs including: “ ” (space), “.”, “,”, “;”, “:”, “?”, “!”, “…” , etc.  Numerals: Hindu-Arabic numerals [0-9]. 10/07/2012 LREC-2012: SALTMIL-AfLaT Workshop 7
  • 8. Amazigh Complexity in NLP  Different writing forms  Complex phonology and phonetic systems  Rich morphology LREC-2012: SALTMIL-AfLaT Workshop 8
  • 9. Amazigh Complexity in NLP Amazigh script  Writing prescriptions’ conversion into ‘Tifinaghe – Unicode’ is confronted with:  Spelling variation related to regional varieties ([tfucht] [tafukt] (sun)),  Spelling variation based on the use or the elimination of spaces within or between words ([tadartino] [tadart ino] (my house)).  Arabic or Latin transcription systems. LREC-2012: SALTMIL-AfLaT Workshop 9
  • 10. Amazigh Complexity in NLP Phonology & phonetic  The main problem of Amazigh phonology and phonetic consists on allophones: /ll/ that is realized as [dj] in the North. LREC-2012: SALTMIL-AfLaT Workshop 10
  • 11. Amazigh Complexity in NLP Morphology  High inflected language.  Word structure: Prefix Stem Suffix  Affixes set: Prefixes, Infixes, and Suffixes.  Base form varies with paradigms: (qqim  svim (make sit)). LREC-2012: SALTMIL-AfLaT Workshop 11
  • 12. State of the Amazigh technology  Tifinaghe Encoding  Optical character recognition  Fundamental processing tools  Language resources LREC-2012: SALTMIL-AfLaT Workshop 12
  • 13. State of the Amazigh technology Tifinaghe Encoding  ANSI  Unicode 13
  • 14. State of the Amazigh technology OCR  Amazigh OCR systems:  System focused on isolated printed characters based on a syntactic approach using finite automata.  Global approach based on Hidden Markov Models for recognizing handwritten characters.  Method using invariant moments for recognizing printed script.  System based on artificial neural network to recognize printed characters. LREC-2012: SALTMIL-AfLaT Workshop 14
  • 15. State of the Amazigh technology Fundamental processing  Transliterator  Tagging assistance tool  Light stemmer  Search engine  Concordancer LREC-2012: SALTMIL-AfLaT Workshop 15
  • 16. State of the Amazigh technology Fundamental processing  Transliterator Arabic script Tifinaghe Latin script Convertisor Unicode Tifinaghe Latin Transliterator LREC-2012: SALTMIL-AfLaT Workshop 16
  • 17. State of the Amazigh technology Fundamental processing  Tagging assistance tool Amazigh raw corpora Tokenization Manual POS Tag Manual Stemming set Stem Tagged list corpus Validation Standard output LREC-2012: SALTMIL-AfLaT Workshop 17
  • 18. State of the Amazigh technology Fundamental processing  Light stemmer Begin Prefix + Stem + Suffix Find the largest prefix Stem + Suffix Find the largest suffix Stem End LREC-2012: SALTMIL-AfLaT Workshop 18
  • 19. State of the Amazigh technology Fundamental processing  Search engine Query Engine Natural Language Index Processing Tools Data Data Indexing Searching Indexer User Interface Natural Language Processing Tools Data Crawling Repository Web Crawler LREC-2012: SALTMIL-AfLaT Workshop 19
  • 20. State of the Amazigh technology Fundamental processing  Concordancer input field .txt,.doc .pdf, .zip Tokenization List of the text words Word / expression and their frequency Context display LREC-2012: SALTMIL-AfLaT Workshop 20
  • 21. State of the Amazigh technology Language resources  Corpora  Dictionary  Terminology database LREC-2012: SALTMIL-AfLaT Workshop 21
  • 22. State of the Amazigh technology Language resources  Corpora:  General corpus,  POS corpus. LREC-2012: SALTMIL-AfLaT Workshop 22
  • 23. State of the Amazigh technology Language resources  Dictionary  Definition,  Arabic equivalent words,  French equivalent words,  English equivalent words,  Synonyms,  Classification by domains,  Derivational families. LREC-2012: SALTMIL-AfLaT Workshop 23
  • 24. State of the Amazigh technology Language resources  Terminology database  Media vocabulary  Grammatical vocabulary LREC-2012: SALTMIL-AfLaT Workshop 24
  • 25. Future Directions  Building a large and representative Amazigh corpora.  Developing a machine translation system.  Creating a pool of competent human resources. LREC-2012: SALTMIL-AfLaT Workshop 25
  • 26. Thank you for your attention ⵜⴰⵏⵎⵉⵔⵜ LREC-2012: SALTMIL-AfLaT Workshop 26