SlideShare ist ein Scribd-Unternehmen logo
1 von 4
Downloaden Sie, um offline zu lesen
 

           Exploiting Cross‐linguistic Similarities in Zulu and Xhosa 
                         Computational Morphology 
                         Laurette Pretorius                                Sonja Bosch 
                        School of Computing                       Department of African Languages 
                     University of South Africa &                    University of South Africa 
                       Meraka Institute, CSIR                          Pretoria, South Africa 
                       Pretoria, South Africa                          boschse@unisa.ac.za
                         pretol@unisa.ac.za
 
Summary 
 
       •   An existing Zulu morphological analyser prototype (ZulMorph) serves as basis for the 
           bootstrapping of a Xhosa analyser.  
       •   The investigation is structured around the morphotactics and the morphophonological 
           alternations of the languages involved.  
       •   Special attention is given to the so‐called “open” class, which represents the word root 
           lexicons for specifically nouns and verbs.  
       •   The acquisition and coverage of these lexicons prove to be crucial for the success of the 
           analysers under development.   
       •   The bootstrapped morphological analyser is applied to parallel test corpora and the results 
           are discussed.  
       •   A variety of cross‐linguistic effects is illustrated with examples from the corpora. 
 
Background on ZulMorph 
 
Xerox finite‐state toolkit:  
lexc for modelling the morphotactics;  
xfst (regular expression language) for modelling morphophonological alternations 
Word roots include 15 800 nouns, 7 600 verbs, 408 relatives, 47 adjectives, 2 735 ideophones, 176 
conjunctions.  
      
    Morphotactics (lexc)   Affixes  for  all  parts‐of‐speech  Word roots (e.g.         Rules  for  legal  combinations  and 
                             (e.g. subject & object concords,  nouns, verbs,            orders  of  morphemes  (e.g.  u‐ya‐
                             noun  class  prefixes,  verb  relatives,                   ngi‐thand‐a  and  not  *ya‐u‐a‐
                             extensions etc.)                   ideophones)             thand‐ngi)  
    Morpho‐                  Rules that determine the form of each morpheme 
    phonological             (e.g. ku‐lob‐w‐a > ku‐lotsh‐w‐a, u‐mu‐lomo > u‐m‐lomo)  
    alternations    (xfst)  
                                  Table 1: Zulu Morphological Analyser Components 

Morphotactics 
We distinguish between so‐called closed and open classes:  
   • The  open  class  accepts  the  addition  of  new  items  by  means  of  processes  such  as  borrowing, 
        coining, compounding and derivation. In the context of this paper, the open class represents word 
        roots including verb roots and noun stems.  
   • The closed class represents affixes that model the fixed morphological structure of words, as well 
        as items such as conjunctions, pronouns etc.  
 
 
 
 
 


We focus on Xhosa affixes that differ from their Zulu counterparts. A few examples are given in Table 2. 
    
Morpheme                      Zulu                                    Xhosa 
Noun Class Prefixes 
Class 1 and 3 um(u)‐          full form umu‐ with monosyllabic        um‐ with all noun stems: um‐ntu, 
                              noun stems, shortened form with         um‐fana 
                              polysyllabic noun stems: 
                              umu‐ntu, um‐fana 
Class 2a                      o‐: o‐baba                              oo‐: oo‐bawo 
Class 9                       in‐ with all noun stems:                i‐ with noun stems beginning with 
                              in‐nyama                                h, i, m, n, ny: i‐hambo  
                                                                       
Class 10                      izin‐  with  monosyllabic  and  iin‐ with polysyllabic stems: 
                              polysyllabic stems.                     iin‐dlebe 
                              izin‐ja;  izin‐dlebe 
Contracted subject concords (future tense). Examples: 
1ps                           ngo‐                                    ndo‐ 
2ps, Class 1 & 3              wo‐                                     uyo‐ 
Class 4 & 9                   yo‐                                     iyo‐ 
              Table 2. Examples of variations in Zulu and Xhosa ‘closed’ morpheme information 
 
The bootstrapping process is iterative and new information regarding dissimilar morphological 
constructions is incorporated systematically in the morphotactics component. Similarly, rules are 
adapted in a systematic manner. 
 

Morphophonological alternations 
Differences in morphophonological alternations between Zulu and Xhosa are exemplified in  
Table 3. 
 
                      Zulu                                                  Xhosa 
Class  10  class  prefix  izin‐  occurs  before  Class  10  class  prefix  izin‐  changes  to  iin‐  before 
monosyllabic  as  well  as  polysyllabic  stems,  polysyllabic stems, e.g. izinja, iindlebe 
e.g. izinja, izindlebe                                Adverb prefix na + ii > nee; e.g. neendlebe (na‐iin‐
Adverb prefix na + i > ne, e.g. nezindlebe (na‐ ndlebe) 
izin‐ndlebe) 
Palatalisation  with  passive,  diminutive  &  Palatalisation  with  passive,  diminutive  &  locative 
locative formation:                                   formation: 
 b   > tsh                                            b > ty                    
‐hlab‐w‐a > ‐hlatsh‐w‐a, intaba‐ana > intatsh‐ ‐hlab‐w‐a > ‐hlaty‐w‐a, intaba‐ana > intaty‐a na 
ana, indaba > endatsheni                              ihlobo > ehlotyeni 
ph > sh                                               ph > tsh                  
‐boph‐w‐a            > ‐bosh‐w‐a,                     –boph‐w‐a                 > ‐botsh‐w‐a,  
iphaphu‐ana      > iphash‐ana                         iphaphu‐ana       > iphatsh‐ana, 
iphaphu              > ephasheni                      usapho                > elusatsheni 
                         Table 3. Examples of variations in Zulu and Xhosa morphophonology 
 
 

 

The word root lexicons 
Zulu lexicon:  
    • Based on an extensive word list dating back to the mid 1950s, but significant 
        improvements and additions are regularly made.  
    • At present the Zulu word roots include noun stems with class information (15 759), verb 
        roots (7 567), relative stems (406), adjective stems (48), ideophones (1 360), conjunctions 
        (176).  
Xhosa lexicon:  
    • Noun stems with class information (4 959) and verb roots (5 984) extracted from various 
        recent prototype paper dictionaries whereas relative stems (27), adjective stems (17), 
        ideophones (30) and conjunctions (28) were only included as representative samples at 
        this stage. 
 

A computational approach to cross‐linguistic similarity 
    •   Application of the bootstrapped morphological analyser to parallel test corpora 
    •   Guesser variant of the morphological analyser that uses typical word root patterns for 
        identifying potential new word roots 
     
                                          Bootstrapping procedure 
                                                   




                                                                                        
        
        
        
The language resources chosen to illustrate this point are parallel corpora in the form of the South 
African Constitution. 
        
         
         
 
     
     
     
    Results of the application of the bootstrapped morphological analyser to this corpus are as follows: 
                                                                   
                                                              
                                Zulu Statistics 
                                Corpus size:    7057 types 
                                Analysed:        5748 types (81.45 %) 
                                Failures:        1309 types (18.55%) 
                                Failures analysed by guesser: 1239 types 
                                Failures not analysed by guesser: 70 types 
                                 
                                Xhosa Statistics 
                                Corpus size:   7423 types 
                                Analysed:        5380 types (72.48 %) 
                                Failures:        2043 types (27.52%) 
                                Failures analysed by guesser: 1772 types 
                                Failures not analysed by guesser: 271 types 
                                                              
             
Examples from the Zulu corpus: 
The analysis of the Zulu word ifomu ‘form’ uses the Xhosa noun stem –fomu (9/10) in the Xhosa lexicon in 
the absence of the Zulu stem: 

ifomu i[NPrePre9]fomu[Xh][NStem]
 
Examples from the Xhosa corpus: 
The analysis of the Xhosa words bephondo ‘of the province’ and esikhundleni ‘in the office’ use the Zulu 
noun stems –phondo (5/6) and –khundleni (7/8) respectively in the Zulu lexicon: 

bephondo ba[PossConc14]i[NPrePre5]li[BPre5]phondo[NStem]

esikhundleni e[LocPre]i[NPrePre7]si[BPre7]khundla[NStem]ini[LocSuf]

Examples of the guesser output from the Zulu corpus: 
The compound noun –shayamthetho (7/8) ‘legislature’ is not listed in the Zulu lexicon, but was guessed 
correctly:  

isishayamthetho i[NPrePre7]si[BPre7]shayamthetho-Guess[NStem]
                                         

Conclusion and Future Work 
        •   Zulu  informed  Xhosa  in  the  sense  that  the  systematically  developed  grammar  for  ZulMorph  was 
            directly available for the Xhosa analyser development, which significantly reduced the development 
            time for the Xhosa prototype compared to that for ZulMorph. 
        •   To a lesser extent, Xhosa also informed Zulu by providing a current (more up to date) Xhosa lexicon. 
            In addition, the guesser variant was employed in identifying possible new roots in the test corpora, 
            both for Zulu and for Xhosa. 
        •   Bootstrapping  morphological  analysers  for  languages  that  exhibit  significant  structural  and  lexical 
            similarities may be fruitfully exploited for developing analysers for lesser‐resourced languages.  
        •   Future  work  includes  the  application  of  the  approach  followed  in  this  work  to  the  other  Nguni 
            languages, namely Swati and Ndebele (Southern and Zimbabwe); the application to larger corpora, 
            and the subsequent construction of stand‐alone versions. Finally, the combined analyser could also 
            be used for (corpus‐based) quantitative studies in cross‐linguistic similarity.  

Weitere ähnliche Inhalte

Mehr von Guy De Pauw

Tagging and Verifying an Amharic News Corpus
Tagging and Verifying an Amharic News CorpusTagging and Verifying an Amharic News Corpus
Tagging and Verifying an Amharic News CorpusGuy De Pauw
 
A Corpus of Santome
A Corpus of SantomeA Corpus of Santome
A Corpus of SantomeGuy De Pauw
 
Automatic Structuring and Correction Suggestion System for Hungarian Clinical...
Automatic Structuring and Correction Suggestion System for Hungarian Clinical...Automatic Structuring and Correction Suggestion System for Hungarian Clinical...
Automatic Structuring and Correction Suggestion System for Hungarian Clinical...Guy De Pauw
 
Compiling Apertium Dictionaries with HFST
Compiling Apertium Dictionaries with HFSTCompiling Apertium Dictionaries with HFST
Compiling Apertium Dictionaries with HFSTGuy De Pauw
 
The Database of Modern Icelandic Inflection
The Database of Modern Icelandic InflectionThe Database of Modern Icelandic Inflection
The Database of Modern Icelandic InflectionGuy De Pauw
 
Learning Morphological Rules for Amharic Verbs Using Inductive Logic Programming
Learning Morphological Rules for Amharic Verbs Using Inductive Logic ProgrammingLearning Morphological Rules for Amharic Verbs Using Inductive Logic Programming
Learning Morphological Rules for Amharic Verbs Using Inductive Logic ProgrammingGuy De Pauw
 
Issues in Designing a Corpus of Spoken Irish
Issues in Designing a Corpus of Spoken IrishIssues in Designing a Corpus of Spoken Irish
Issues in Designing a Corpus of Spoken IrishGuy De Pauw
 
How to build language technology resources for the next 100 years
How to build language technology resources for the next 100 yearsHow to build language technology resources for the next 100 years
How to build language technology resources for the next 100 yearsGuy De Pauw
 
Towards Standardizing Evaluation Test Sets for Compound Analysers
Towards Standardizing Evaluation Test Sets for Compound AnalysersTowards Standardizing Evaluation Test Sets for Compound Analysers
Towards Standardizing Evaluation Test Sets for Compound AnalysersGuy De Pauw
 
The PALDO Concept - New Paradigms for African Language Resource Development
The PALDO Concept - New Paradigms for African Language Resource DevelopmentThe PALDO Concept - New Paradigms for African Language Resource Development
The PALDO Concept - New Paradigms for African Language Resource DevelopmentGuy De Pauw
 
A System for the Recognition of Handwritten Yorùbá Characters
A System for the Recognition of Handwritten Yorùbá CharactersA System for the Recognition of Handwritten Yorùbá Characters
A System for the Recognition of Handwritten Yorùbá CharactersGuy De Pauw
 
IFE-MT: An English-to-Yorùbá Machine Translation System
IFE-MT: An English-to-Yorùbá Machine Translation SystemIFE-MT: An English-to-Yorùbá Machine Translation System
IFE-MT: An English-to-Yorùbá Machine Translation SystemGuy De Pauw
 
A Number to Yorùbá Text Transcription System
A Number to Yorùbá Text Transcription SystemA Number to Yorùbá Text Transcription System
A Number to Yorùbá Text Transcription SystemGuy De Pauw
 
Bilingual Data Mining for the English-Amharic Statistical Machine Translation...
Bilingual Data Mining for the English-Amharic Statistical Machine Translation...Bilingual Data Mining for the English-Amharic Statistical Machine Translation...
Bilingual Data Mining for the English-Amharic Statistical Machine Translation...Guy De Pauw
 
Towards Standardizing Evaluation Test Sets for Compound Analysers
Towards Standardizing Evaluation Test Sets for Compound AnalysersTowards Standardizing Evaluation Test Sets for Compound Analysers
Towards Standardizing Evaluation Test Sets for Compound AnalysersGuy De Pauw
 
Human Language Technologies for Ethiopian Languages: Challenges and Future Di...
Human Language Technologies for Ethiopian Languages: Challenges and Future Di...Human Language Technologies for Ethiopian Languages: Challenges and Future Di...
Human Language Technologies for Ethiopian Languages: Challenges and Future Di...Guy De Pauw
 
Amharic document clustering
Amharic document clusteringAmharic document clustering
Amharic document clusteringGuy De Pauw
 
Modeling Improved Syllabification Algorithm for Amharic
Modeling Improved Syllabification Algorithm for AmharicModeling Improved Syllabification Algorithm for Amharic
Modeling Improved Syllabification Algorithm for AmharicGuy De Pauw
 
Ge'ez Verbs Morphology and Declaration Model
Ge'ez Verbs Morphology and Declaration ModelGe'ez Verbs Morphology and Declaration Model
Ge'ez Verbs Morphology and Declaration ModelGuy De Pauw
 
Do we need linguistic knowledge for speech technology applications in African...
Do we need linguistic knowledge for speech technology applications in African...Do we need linguistic knowledge for speech technology applications in African...
Do we need linguistic knowledge for speech technology applications in African...Guy De Pauw
 

Mehr von Guy De Pauw (20)

Tagging and Verifying an Amharic News Corpus
Tagging and Verifying an Amharic News CorpusTagging and Verifying an Amharic News Corpus
Tagging and Verifying an Amharic News Corpus
 
A Corpus of Santome
A Corpus of SantomeA Corpus of Santome
A Corpus of Santome
 
Automatic Structuring and Correction Suggestion System for Hungarian Clinical...
Automatic Structuring and Correction Suggestion System for Hungarian Clinical...Automatic Structuring and Correction Suggestion System for Hungarian Clinical...
Automatic Structuring and Correction Suggestion System for Hungarian Clinical...
 
Compiling Apertium Dictionaries with HFST
Compiling Apertium Dictionaries with HFSTCompiling Apertium Dictionaries with HFST
Compiling Apertium Dictionaries with HFST
 
The Database of Modern Icelandic Inflection
The Database of Modern Icelandic InflectionThe Database of Modern Icelandic Inflection
The Database of Modern Icelandic Inflection
 
Learning Morphological Rules for Amharic Verbs Using Inductive Logic Programming
Learning Morphological Rules for Amharic Verbs Using Inductive Logic ProgrammingLearning Morphological Rules for Amharic Verbs Using Inductive Logic Programming
Learning Morphological Rules for Amharic Verbs Using Inductive Logic Programming
 
Issues in Designing a Corpus of Spoken Irish
Issues in Designing a Corpus of Spoken IrishIssues in Designing a Corpus of Spoken Irish
Issues in Designing a Corpus of Spoken Irish
 
How to build language technology resources for the next 100 years
How to build language technology resources for the next 100 yearsHow to build language technology resources for the next 100 years
How to build language technology resources for the next 100 years
 
Towards Standardizing Evaluation Test Sets for Compound Analysers
Towards Standardizing Evaluation Test Sets for Compound AnalysersTowards Standardizing Evaluation Test Sets for Compound Analysers
Towards Standardizing Evaluation Test Sets for Compound Analysers
 
The PALDO Concept - New Paradigms for African Language Resource Development
The PALDO Concept - New Paradigms for African Language Resource DevelopmentThe PALDO Concept - New Paradigms for African Language Resource Development
The PALDO Concept - New Paradigms for African Language Resource Development
 
A System for the Recognition of Handwritten Yorùbá Characters
A System for the Recognition of Handwritten Yorùbá CharactersA System for the Recognition of Handwritten Yorùbá Characters
A System for the Recognition of Handwritten Yorùbá Characters
 
IFE-MT: An English-to-Yorùbá Machine Translation System
IFE-MT: An English-to-Yorùbá Machine Translation SystemIFE-MT: An English-to-Yorùbá Machine Translation System
IFE-MT: An English-to-Yorùbá Machine Translation System
 
A Number to Yorùbá Text Transcription System
A Number to Yorùbá Text Transcription SystemA Number to Yorùbá Text Transcription System
A Number to Yorùbá Text Transcription System
 
Bilingual Data Mining for the English-Amharic Statistical Machine Translation...
Bilingual Data Mining for the English-Amharic Statistical Machine Translation...Bilingual Data Mining for the English-Amharic Statistical Machine Translation...
Bilingual Data Mining for the English-Amharic Statistical Machine Translation...
 
Towards Standardizing Evaluation Test Sets for Compound Analysers
Towards Standardizing Evaluation Test Sets for Compound AnalysersTowards Standardizing Evaluation Test Sets for Compound Analysers
Towards Standardizing Evaluation Test Sets for Compound Analysers
 
Human Language Technologies for Ethiopian Languages: Challenges and Future Di...
Human Language Technologies for Ethiopian Languages: Challenges and Future Di...Human Language Technologies for Ethiopian Languages: Challenges and Future Di...
Human Language Technologies for Ethiopian Languages: Challenges and Future Di...
 
Amharic document clustering
Amharic document clusteringAmharic document clustering
Amharic document clustering
 
Modeling Improved Syllabification Algorithm for Amharic
Modeling Improved Syllabification Algorithm for AmharicModeling Improved Syllabification Algorithm for Amharic
Modeling Improved Syllabification Algorithm for Amharic
 
Ge'ez Verbs Morphology and Declaration Model
Ge'ez Verbs Morphology and Declaration ModelGe'ez Verbs Morphology and Declaration Model
Ge'ez Verbs Morphology and Declaration Model
 
Do we need linguistic knowledge for speech technology applications in African...
Do we need linguistic knowledge for speech technology applications in African...Do we need linguistic knowledge for speech technology applications in African...
Do we need linguistic knowledge for speech technology applications in African...
 

Kürzlich hochgeladen

Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rick Flair
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersNicole Novielli
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersRaghuram Pandurangan
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESmohitsingh558521
 

Kürzlich hochgeladen (20)

Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software Developers
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information Developers
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
 

Exploiting Cross-Linguistic Similarities in Zulu and Xhosa Computational Morphology

  • 1.   Exploiting Cross‐linguistic Similarities in Zulu and Xhosa  Computational Morphology  Laurette Pretorius  Sonja Bosch  School of Computing  Department of African Languages  University of South Africa &  University of South Africa  Meraka Institute, CSIR  Pretoria, South Africa  Pretoria, South Africa  boschse@unisa.ac.za pretol@unisa.ac.za   Summary    • An existing Zulu morphological analyser prototype (ZulMorph) serves as basis for the  bootstrapping of a Xhosa analyser.   • The investigation is structured around the morphotactics and the morphophonological  alternations of the languages involved.   • Special attention is given to the so‐called “open” class, which represents the word root  lexicons for specifically nouns and verbs.   • The acquisition and coverage of these lexicons prove to be crucial for the success of the  analysers under development.    • The bootstrapped morphological analyser is applied to parallel test corpora and the results  are discussed.   • A variety of cross‐linguistic effects is illustrated with examples from the corpora.    Background on ZulMorph    Xerox finite‐state toolkit:   lexc for modelling the morphotactics;   xfst (regular expression language) for modelling morphophonological alternations  Word roots include 15 800 nouns, 7 600 verbs, 408 relatives, 47 adjectives, 2 735 ideophones, 176  conjunctions.     Morphotactics (lexc)   Affixes  for  all  parts‐of‐speech  Word roots (e.g.  Rules  for  legal  combinations  and  (e.g. subject & object concords,  nouns, verbs,  orders  of  morphemes  (e.g.  u‐ya‐ noun  class  prefixes,  verb  relatives,  ngi‐thand‐a  and  not  *ya‐u‐a‐ extensions etc.)   ideophones)   thand‐ngi)   Morpho‐ Rules that determine the form of each morpheme  phonological  (e.g. ku‐lob‐w‐a > ku‐lotsh‐w‐a, u‐mu‐lomo > u‐m‐lomo)   alternations    (xfst)   Table 1: Zulu Morphological Analyser Components  Morphotactics  We distinguish between so‐called closed and open classes:   • The  open  class  accepts  the  addition  of  new  items  by  means  of  processes  such  as  borrowing,  coining, compounding and derivation. In the context of this paper, the open class represents word  roots including verb roots and noun stems.   • The closed class represents affixes that model the fixed morphological structure of words, as well  as items such as conjunctions, pronouns etc.      
  • 2.       We focus on Xhosa affixes that differ from their Zulu counterparts. A few examples are given in Table 2.    Morpheme  Zulu  Xhosa  Noun Class Prefixes  Class 1 and 3 um(u)‐  full form umu‐ with monosyllabic  um‐ with all noun stems: um‐ntu,    noun stems, shortened form with  um‐fana    polysyllabic noun stems:  umu‐ntu, um‐fana  Class 2a  o‐: o‐baba  oo‐: oo‐bawo  Class 9  in‐ with all noun stems:  i‐ with noun stems beginning with    in‐nyama   h, i, m, n, ny: i‐hambo     Class 10  izin‐  with  monosyllabic  and  iin‐ with polysyllabic stems:  polysyllabic stems.   iin‐dlebe  izin‐ja;  izin‐dlebe  Contracted subject concords (future tense). Examples:  1ps  ngo‐  ndo‐  2ps, Class 1 & 3  wo‐  uyo‐  Class 4 & 9  yo‐  iyo‐  Table 2. Examples of variations in Zulu and Xhosa ‘closed’ morpheme information    The bootstrapping process is iterative and new information regarding dissimilar morphological  constructions is incorporated systematically in the morphotactics component. Similarly, rules are  adapted in a systematic manner.    Morphophonological alternations  Differences in morphophonological alternations between Zulu and Xhosa are exemplified in   Table 3.    Zulu  Xhosa  Class  10  class  prefix  izin‐  occurs  before  Class  10  class  prefix  izin‐  changes  to  iin‐  before  monosyllabic  as  well  as  polysyllabic  stems,  polysyllabic stems, e.g. izinja, iindlebe  e.g. izinja, izindlebe  Adverb prefix na + ii > nee; e.g. neendlebe (na‐iin‐ Adverb prefix na + i > ne, e.g. nezindlebe (na‐ ndlebe)  izin‐ndlebe)  Palatalisation  with  passive,  diminutive  &  Palatalisation  with  passive,  diminutive  &  locative  locative formation:  formation:   b   > tsh     b > ty                     ‐hlab‐w‐a > ‐hlatsh‐w‐a, intaba‐ana > intatsh‐ ‐hlab‐w‐a > ‐hlaty‐w‐a, intaba‐ana > intaty‐a na  ana, indaba > endatsheni  ihlobo > ehlotyeni  ph > sh    ph > tsh     ‐boph‐w‐a  > ‐bosh‐w‐a,   –boph‐w‐a   > ‐botsh‐w‐a,   iphaphu‐ana      > iphash‐ana  iphaphu‐ana       > iphatsh‐ana,  iphaphu              > ephasheni  usapho                > elusatsheni  Table 3. Examples of variations in Zulu and Xhosa morphophonology   
  • 3.     The word root lexicons  Zulu lexicon:   • Based on an extensive word list dating back to the mid 1950s, but significant  improvements and additions are regularly made.   • At present the Zulu word roots include noun stems with class information (15 759), verb  roots (7 567), relative stems (406), adjective stems (48), ideophones (1 360), conjunctions  (176).   Xhosa lexicon:   • Noun stems with class information (4 959) and verb roots (5 984) extracted from various  recent prototype paper dictionaries whereas relative stems (27), adjective stems (17),  ideophones (30) and conjunctions (28) were only included as representative samples at  this stage.    A computational approach to cross‐linguistic similarity  • Application of the bootstrapped morphological analyser to parallel test corpora  • Guesser variant of the morphological analyser that uses typical word root patterns for  identifying potential new word roots    Bootstrapping procedure            The language resources chosen to illustrate this point are parallel corpora in the form of the South  African Constitution.       
  • 4.         Results of the application of the bootstrapped morphological analyser to this corpus are as follows:      Zulu Statistics  Corpus size:    7057 types  Analysed:     5748 types (81.45 %)  Failures:   1309 types (18.55%)  Failures analysed by guesser: 1239 types  Failures not analysed by guesser: 70 types    Xhosa Statistics  Corpus size:   7423 types  Analysed:    5380 types (72.48 %)  Failures:   2043 types (27.52%)  Failures analysed by guesser: 1772 types  Failures not analysed by guesser: 271 types      Examples from the Zulu corpus:  The analysis of the Zulu word ifomu ‘form’ uses the Xhosa noun stem –fomu (9/10) in the Xhosa lexicon in  the absence of the Zulu stem:  ifomu i[NPrePre9]fomu[Xh][NStem]   Examples from the Xhosa corpus:  The analysis of the Xhosa words bephondo ‘of the province’ and esikhundleni ‘in the office’ use the Zulu  noun stems –phondo (5/6) and –khundleni (7/8) respectively in the Zulu lexicon:  bephondo ba[PossConc14]i[NPrePre5]li[BPre5]phondo[NStem] esikhundleni e[LocPre]i[NPrePre7]si[BPre7]khundla[NStem]ini[LocSuf] Examples of the guesser output from the Zulu corpus:  The compound noun –shayamthetho (7/8) ‘legislature’ is not listed in the Zulu lexicon, but was guessed  correctly:   isishayamthetho i[NPrePre7]si[BPre7]shayamthetho-Guess[NStem]     Conclusion and Future Work  • Zulu  informed  Xhosa  in  the  sense  that  the  systematically  developed  grammar  for  ZulMorph  was  directly available for the Xhosa analyser development, which significantly reduced the development  time for the Xhosa prototype compared to that for ZulMorph.  • To a lesser extent, Xhosa also informed Zulu by providing a current (more up to date) Xhosa lexicon.  In addition, the guesser variant was employed in identifying possible new roots in the test corpora,  both for Zulu and for Xhosa.  • Bootstrapping  morphological  analysers  for  languages  that  exhibit  significant  structural  and  lexical  similarities may be fruitfully exploited for developing analysers for lesser‐resourced languages.   • Future  work  includes  the  application  of  the  approach  followed  in  this  work  to  the  other  Nguni  languages, namely Swati and Ndebele (Southern and Zimbabwe); the application to larger corpora,  and the subsequent construction of stand‐alone versions. Finally, the combined analyser could also  be used for (corpus‐based) quantitative studies in cross‐linguistic similarity.