SlideShare ist ein Scribd-Unternehmen logo
1 von 32
The SAWA Corpus A Parallel Corpus  English - Swahili Guy De Pauw   (guy.depauw@aflat.org) Peter Waiganjo Wagacha   (waiganjo@aflat.org) Gilles-Maurice de Schryver   (gillesmaurice.deschryver@aflat.org)
Resource-scarceness ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Data-driven approaches ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Machine Translation ,[object Object],[object Object],[object Object],[object Object],data-driven Learn translation from examples: !! Parallel corpus !!
Parallel Corpus ,[object Object],[object Object],[object Object]
Example ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
3 phases ,[object Object],[object Object],[object Object],[object Object],[object Object]
Data Collection ,[object Object],[object Object],[object Object],[object Object],[object Object]
Data Collection ,[object Object],[object Object]
[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Available data in SAWA Corpus All manually sentence aligned! English  Sentences Kiswahili  Sentences English Words Kiswahili Words New Testament 16.4k 16.3k 189.2k 151.1k Quran 14.3k 14.5k 165.5k 124.3k Declaration of HR 0.2k 1.8k 1.8k Kamusi.org 5.6k 35.5k 26.7k Movie Subtitles 9.0k 72.2k 58.4k Investment Reports 3.2k 3.1k 52.9k 54.9k Local Translator 1.5k 1.6k 25.0k 25.7k Total 50.2k 50.3k 542.1k 442.9k
Available data in SAWA Corpus All manually sentence aligned! English  Sentences Kiswahili  Sentences English Words Kiswahili Words New Testament 16.4k 16.3k 189.2k 151.1k Quran 14.3k 14.5k 165.5k 124.3k Declaration of HR 0.2k 1.8k 1.8k Kamusi.org 5.6k 35.5k 26.7k Movie Subtitles 9.0k 72.2k 58.4k Investment Reports 3.2k 3.1k 52.9k 54.9k Local Translator 1.5k 1.6k 25.0k 25.7k Total 50.2k 50.3k 542.1k 442.9k
Available data in SAWA Corpus All manually sentence aligned! Thanks to Mahmoud Shokrollahi-Far University College of Nabiye Akram (Iran) English  Sentences Kiswahili  Sentences English Words Kiswahili Words New Testament 16.4k 16.3k 189.2k 151.1k Quran 14.3k 14.5k 165.5k 124.3k Declaration of HR 0.2k 1.8k 1.8k Kamusi.org 5.6k 35.5k 26.7k Movie Subtitles 9.0k 72.2k 58.4k Investment Reports 3.2k 3.1k 52.9k 54.9k Local Translator 1.5k 1.6k 25.0k 25.7k Total 50.2k 50.3k 542.1k 442.9k
Available data in SAWA Corpus All manually sentence aligned! English  Sentences Kiswahili  Sentences English Words Kiswahili Words New Testament 16.4k 16.3k 189.2k 151.1k Quran 14.3k 14.5k 165.5k 124.3k Declaration of HR 0.2k 1.8k 1.8k Kamusi.org 5.6k 35.5k 26.7k Movie Subtitles 9.0k 72.2k 58.4k Investment Reports 3.2k 3.1k 52.9k 54.9k Local Translator 1.5k 1.6k 25.0k 25.7k Total 50.2k 50.3k 542.1k 442.9k
Available data in SAWA Corpus All manually sentence aligned! English  Sentences Kiswahili  Sentences English Words Kiswahili Words New Testament 16.4k 16.3k 189.2k 151.1k Quran 14.3k 14.5k 165.5k 124.3k Declaration of HR 0.2k 1.8k 1.8k Kamusi.org 5.6k 35.5k 26.7k Movie Subtitles 9.0k 72.2k 58.4k Investment Reports 3.2k 3.1k 52.9k 54.9k Local Translator 1.5k 1.6k 25.0k 25.7k Total 50.2k 50.3k 542.1k 442.9k
Available data in SAWA Corpus All manually sentence aligned! English  Sentences Kiswahili  Sentences English Words Kiswahili Words New Testament 16.4k 16.3k 189.2k 151.1k Quran 14.3k 14.5k 165.5k 124.3k Declaration of HR 0.2k 1.8k 1.8k Kamusi.org 5.6k 35.5k 26.7k Movie Subtitles 9.0k 72.2k 58.4k Investment Reports 3.2k 3.1k 52.9k 54.9k Local Translator 1.5k 1.6k 25.0k 25.7k Total 50.2k 50.3k 542.1k 442.9k
Available data in SAWA Corpus All manually sentence aligned! English  Sentences Kiswahili  Sentences English Words Kiswahili Words New Testament 16.4k 16.3k 189.2k 151.1k Quran 14.3k 14.5k 165.5k 124.3k Declaration of HR 0.2k 1.8k 1.8k Kamusi.org 5.6k 35.5k 26.7k Movie Subtitles 9.0k 72.2k 58.4k Investment Reports 3.2k 3.1k 52.9k 54.9k Local Translator 1.5k 1.6k 25.0k 25.7k Total 50.2k 50.3k 542.1k 442.9k
Available data in SAWA Corpus All manually sentence aligned! Thanks to Dr. James Omboga Zaja University of Nairobi English  Sentences Kiswahili  Sentences English Words Kiswahili Words New Testament 16.4k 16.3k 189.2k 151.1k Quran 14.3k 14.5k 165.5k 124.3k Declaration of HR 0.2k 1.8k 1.8k Kamusi.org 5.6k 35.5k 26.7k Movie Subtitles 9.0k 72.2k 58.4k Investment Reports 3.2k 3.1k 52.9k 54.9k Local Translator 1.5k 1.6k 25.0k 25.7k Total 50.2k 50.3k 542.1k 442.9k
Available data in SAWA Corpus All manually sentence aligned! English  Sentences Kiswahili  Sentences English Words Kiswahili Words New Testament 16.4k 16.3k 189.2k 151.1k Quran 14.3k 14.5k 165.5k 124.3k Declaration of HR 0.2k 1.8k 1.8k Kamusi.org 5.6k 35.5k 26.7k Movie Subtitles 9.0k 72.2k 58.4k Investment Reports 3.2k 3.1k 52.9k 54.9k Local Translator 1.5k 1.6k 25.0k 25.7k Total 50.2k 50.3k 542.1k 442.9k
Word alignment ,[object Object],No she ‘ s uh , , up north La  , , , yuko , aa juu  kaskazini
Word alignment You caught me skiving , I ‘ m afraid . Samahani , umenidaka  nikihepa  .
Word alignment ,[object Object],[object Object]
Current results ,[object Object],Precision Recall F (  =1) 39.4% 44.5% 41.79%
Word alignment ,[object Object],No she ‘ s uh , , up north La  , , , yuko , aa juu  kaskazini
Alignment problems nimemkatalia have turned him down I
Morphological decomposition have turned him down I ni+ me+ m+ katalia
Current results ,[object Object],[object Object],Precision Recall F (  =1) 50.2% 64.5% 55.8%
Future work ,[object Object]
Future work ,[object Object],[object Object],[object Object]
Future work ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Conclusion ,[object Object],[object Object],[object Object],[object Object]

Weitere ähnliche Inhalte

Ähnlich wie The SAWA Corpus - A parallel Corpus English - Swahili

Essay Writers Online.pdf
Essay Writers Online.pdfEssay Writers Online.pdf
Essay Writers Online.pdf
Victoria Coleman
 
SEND HANDHAKES I WILL PICK3POINTSWhat do you consider the .docx
SEND HANDHAKES I WILL PICK3POINTSWhat do you consider the .docxSEND HANDHAKES I WILL PICK3POINTSWhat do you consider the .docx
SEND HANDHAKES I WILL PICK3POINTSWhat do you consider the .docx
lorileemcclatchie
 

Ähnlich wie The SAWA Corpus - A parallel Corpus English - Swahili (12)

Writing Template With Drawing Box. Online assignment writing service.
Writing Template With Drawing Box. Online assignment writing service.Writing Template With Drawing Box. Online assignment writing service.
Writing Template With Drawing Box. Online assignment writing service.
 
CASL Report1
CASL Report1CASL Report1
CASL Report1
 
Essay On Down Syndrome.pdf
Essay On Down Syndrome.pdfEssay On Down Syndrome.pdf
Essay On Down Syndrome.pdf
 
Essay On Down Syndrome.pdf
Essay On Down Syndrome.pdfEssay On Down Syndrome.pdf
Essay On Down Syndrome.pdf
 
How To Write An Essay On Your Ipad. Online assignment writing service.
How To Write An Essay On Your Ipad. Online assignment writing service.How To Write An Essay On Your Ipad. Online assignment writing service.
How To Write An Essay On Your Ipad. Online assignment writing service.
 
Doc106
Doc106Doc106
Doc106
 
Writing To Inform - Poverty - GCSE English - Marked By Teachers.Com
Writing To Inform - Poverty - GCSE English - Marked By Teachers.ComWriting To Inform - Poverty - GCSE English - Marked By Teachers.Com
Writing To Inform - Poverty - GCSE English - Marked By Teachers.Com
 
Examples Of Informal Essay. Fantastic Informal Essay Thatsnotus
Examples Of Informal Essay. Fantastic Informal Essay  ThatsnotusExamples Of Informal Essay. Fantastic Informal Essay  Thatsnotus
Examples Of Informal Essay. Fantastic Informal Essay Thatsnotus
 
Essay Writers Online.pdf
Essay Writers Online.pdfEssay Writers Online.pdf
Essay Writers Online.pdf
 
Essay Writers Online. College essay: Professional essay writers online
Essay Writers Online. College essay: Professional essay writers onlineEssay Writers Online. College essay: Professional essay writers online
Essay Writers Online. College essay: Professional essay writers online
 
Essay On Religion
Essay On ReligionEssay On Religion
Essay On Religion
 
SEND HANDHAKES I WILL PICK3POINTSWhat do you consider the .docx
SEND HANDHAKES I WILL PICK3POINTSWhat do you consider the .docxSEND HANDHAKES I WILL PICK3POINTSWhat do you consider the .docx
SEND HANDHAKES I WILL PICK3POINTSWhat do you consider the .docx
 

Mehr von Guy De Pauw

The PALDO Concept - New Paradigms for African Language Resource Development
The PALDO Concept - New Paradigms for African Language Resource DevelopmentThe PALDO Concept - New Paradigms for African Language Resource Development
The PALDO Concept - New Paradigms for African Language Resource Development
Guy De Pauw
 

Mehr von Guy De Pauw (20)

Technological Tools for Dictionary and Corpora Building for Minority Language...
Technological Tools for Dictionary and Corpora Building for Minority Language...Technological Tools for Dictionary and Corpora Building for Minority Language...
Technological Tools for Dictionary and Corpora Building for Minority Language...
 
Semi-automated extraction of morphological grammars for Nguni with special re...
Semi-automated extraction of morphological grammars for Nguni with special re...Semi-automated extraction of morphological grammars for Nguni with special re...
Semi-automated extraction of morphological grammars for Nguni with special re...
 
Resource-Light Bantu Part-of-Speech Tagging
Resource-Light Bantu Part-of-Speech TaggingResource-Light Bantu Part-of-Speech Tagging
Resource-Light Bantu Part-of-Speech Tagging
 
Natural Language Processing for Amazigh Language
Natural Language Processing for Amazigh LanguageNatural Language Processing for Amazigh Language
Natural Language Processing for Amazigh Language
 
POS Annotated 50m Corpus of Tajik Language
POS Annotated 50m Corpus of Tajik LanguagePOS Annotated 50m Corpus of Tajik Language
POS Annotated 50m Corpus of Tajik Language
 
The Tagged Icelandic Corpus (MÍM)
The Tagged Icelandic Corpus (MÍM)The Tagged Icelandic Corpus (MÍM)
The Tagged Icelandic Corpus (MÍM)
 
Describing Morphologically Rich Languages Using Metagrammars a Look at Verbs ...
Describing Morphologically Rich Languages Using Metagrammars a Look at Verbs ...Describing Morphologically Rich Languages Using Metagrammars a Look at Verbs ...
Describing Morphologically Rich Languages Using Metagrammars a Look at Verbs ...
 
Tagging and Verifying an Amharic News Corpus
Tagging and Verifying an Amharic News CorpusTagging and Verifying an Amharic News Corpus
Tagging and Verifying an Amharic News Corpus
 
A Corpus of Santome
A Corpus of SantomeA Corpus of Santome
A Corpus of Santome
 
Automatic Structuring and Correction Suggestion System for Hungarian Clinical...
Automatic Structuring and Correction Suggestion System for Hungarian Clinical...Automatic Structuring and Correction Suggestion System for Hungarian Clinical...
Automatic Structuring and Correction Suggestion System for Hungarian Clinical...
 
Compiling Apertium Dictionaries with HFST
Compiling Apertium Dictionaries with HFSTCompiling Apertium Dictionaries with HFST
Compiling Apertium Dictionaries with HFST
 
The Database of Modern Icelandic Inflection
The Database of Modern Icelandic InflectionThe Database of Modern Icelandic Inflection
The Database of Modern Icelandic Inflection
 
Learning Morphological Rules for Amharic Verbs Using Inductive Logic Programming
Learning Morphological Rules for Amharic Verbs Using Inductive Logic ProgrammingLearning Morphological Rules for Amharic Verbs Using Inductive Logic Programming
Learning Morphological Rules for Amharic Verbs Using Inductive Logic Programming
 
Issues in Designing a Corpus of Spoken Irish
Issues in Designing a Corpus of Spoken IrishIssues in Designing a Corpus of Spoken Irish
Issues in Designing a Corpus of Spoken Irish
 
How to build language technology resources for the next 100 years
How to build language technology resources for the next 100 yearsHow to build language technology resources for the next 100 years
How to build language technology resources for the next 100 years
 
Towards Standardizing Evaluation Test Sets for Compound Analysers
Towards Standardizing Evaluation Test Sets for Compound AnalysersTowards Standardizing Evaluation Test Sets for Compound Analysers
Towards Standardizing Evaluation Test Sets for Compound Analysers
 
The PALDO Concept - New Paradigms for African Language Resource Development
The PALDO Concept - New Paradigms for African Language Resource DevelopmentThe PALDO Concept - New Paradigms for African Language Resource Development
The PALDO Concept - New Paradigms for African Language Resource Development
 
A System for the Recognition of Handwritten Yorùbá Characters
A System for the Recognition of Handwritten Yorùbá CharactersA System for the Recognition of Handwritten Yorùbá Characters
A System for the Recognition of Handwritten Yorùbá Characters
 
IFE-MT: An English-to-Yorùbá Machine Translation System
IFE-MT: An English-to-Yorùbá Machine Translation SystemIFE-MT: An English-to-Yorùbá Machine Translation System
IFE-MT: An English-to-Yorùbá Machine Translation System
 
A Number to Yorùbá Text Transcription System
A Number to Yorùbá Text Transcription SystemA Number to Yorùbá Text Transcription System
A Number to Yorùbá Text Transcription System
 

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 
WSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering Developers
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 
Six Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal OntologySix Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal Ontology
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Platformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityPlatformless Horizons for Digital Adaptability
Platformless Horizons for Digital Adaptability
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with Milvus
 

The SAWA Corpus - A parallel Corpus English - Swahili

  • 1. The SAWA Corpus A Parallel Corpus English - Swahili Guy De Pauw (guy.depauw@aflat.org) Peter Waiganjo Wagacha (waiganjo@aflat.org) Gilles-Maurice de Schryver (gillesmaurice.deschryver@aflat.org)
  • 2.
  • 3.
  • 4.
  • 5.
  • 6.
  • 7.
  • 8.
  • 9.
  • 10.
  • 11.
  • 12. Available data in SAWA Corpus All manually sentence aligned! English Sentences Kiswahili Sentences English Words Kiswahili Words New Testament 16.4k 16.3k 189.2k 151.1k Quran 14.3k 14.5k 165.5k 124.3k Declaration of HR 0.2k 1.8k 1.8k Kamusi.org 5.6k 35.5k 26.7k Movie Subtitles 9.0k 72.2k 58.4k Investment Reports 3.2k 3.1k 52.9k 54.9k Local Translator 1.5k 1.6k 25.0k 25.7k Total 50.2k 50.3k 542.1k 442.9k
  • 13. Available data in SAWA Corpus All manually sentence aligned! English Sentences Kiswahili Sentences English Words Kiswahili Words New Testament 16.4k 16.3k 189.2k 151.1k Quran 14.3k 14.5k 165.5k 124.3k Declaration of HR 0.2k 1.8k 1.8k Kamusi.org 5.6k 35.5k 26.7k Movie Subtitles 9.0k 72.2k 58.4k Investment Reports 3.2k 3.1k 52.9k 54.9k Local Translator 1.5k 1.6k 25.0k 25.7k Total 50.2k 50.3k 542.1k 442.9k
  • 14. Available data in SAWA Corpus All manually sentence aligned! Thanks to Mahmoud Shokrollahi-Far University College of Nabiye Akram (Iran) English Sentences Kiswahili Sentences English Words Kiswahili Words New Testament 16.4k 16.3k 189.2k 151.1k Quran 14.3k 14.5k 165.5k 124.3k Declaration of HR 0.2k 1.8k 1.8k Kamusi.org 5.6k 35.5k 26.7k Movie Subtitles 9.0k 72.2k 58.4k Investment Reports 3.2k 3.1k 52.9k 54.9k Local Translator 1.5k 1.6k 25.0k 25.7k Total 50.2k 50.3k 542.1k 442.9k
  • 15. Available data in SAWA Corpus All manually sentence aligned! English Sentences Kiswahili Sentences English Words Kiswahili Words New Testament 16.4k 16.3k 189.2k 151.1k Quran 14.3k 14.5k 165.5k 124.3k Declaration of HR 0.2k 1.8k 1.8k Kamusi.org 5.6k 35.5k 26.7k Movie Subtitles 9.0k 72.2k 58.4k Investment Reports 3.2k 3.1k 52.9k 54.9k Local Translator 1.5k 1.6k 25.0k 25.7k Total 50.2k 50.3k 542.1k 442.9k
  • 16. Available data in SAWA Corpus All manually sentence aligned! English Sentences Kiswahili Sentences English Words Kiswahili Words New Testament 16.4k 16.3k 189.2k 151.1k Quran 14.3k 14.5k 165.5k 124.3k Declaration of HR 0.2k 1.8k 1.8k Kamusi.org 5.6k 35.5k 26.7k Movie Subtitles 9.0k 72.2k 58.4k Investment Reports 3.2k 3.1k 52.9k 54.9k Local Translator 1.5k 1.6k 25.0k 25.7k Total 50.2k 50.3k 542.1k 442.9k
  • 17. Available data in SAWA Corpus All manually sentence aligned! English Sentences Kiswahili Sentences English Words Kiswahili Words New Testament 16.4k 16.3k 189.2k 151.1k Quran 14.3k 14.5k 165.5k 124.3k Declaration of HR 0.2k 1.8k 1.8k Kamusi.org 5.6k 35.5k 26.7k Movie Subtitles 9.0k 72.2k 58.4k Investment Reports 3.2k 3.1k 52.9k 54.9k Local Translator 1.5k 1.6k 25.0k 25.7k Total 50.2k 50.3k 542.1k 442.9k
  • 18. Available data in SAWA Corpus All manually sentence aligned! English Sentences Kiswahili Sentences English Words Kiswahili Words New Testament 16.4k 16.3k 189.2k 151.1k Quran 14.3k 14.5k 165.5k 124.3k Declaration of HR 0.2k 1.8k 1.8k Kamusi.org 5.6k 35.5k 26.7k Movie Subtitles 9.0k 72.2k 58.4k Investment Reports 3.2k 3.1k 52.9k 54.9k Local Translator 1.5k 1.6k 25.0k 25.7k Total 50.2k 50.3k 542.1k 442.9k
  • 19. Available data in SAWA Corpus All manually sentence aligned! Thanks to Dr. James Omboga Zaja University of Nairobi English Sentences Kiswahili Sentences English Words Kiswahili Words New Testament 16.4k 16.3k 189.2k 151.1k Quran 14.3k 14.5k 165.5k 124.3k Declaration of HR 0.2k 1.8k 1.8k Kamusi.org 5.6k 35.5k 26.7k Movie Subtitles 9.0k 72.2k 58.4k Investment Reports 3.2k 3.1k 52.9k 54.9k Local Translator 1.5k 1.6k 25.0k 25.7k Total 50.2k 50.3k 542.1k 442.9k
  • 20. Available data in SAWA Corpus All manually sentence aligned! English Sentences Kiswahili Sentences English Words Kiswahili Words New Testament 16.4k 16.3k 189.2k 151.1k Quran 14.3k 14.5k 165.5k 124.3k Declaration of HR 0.2k 1.8k 1.8k Kamusi.org 5.6k 35.5k 26.7k Movie Subtitles 9.0k 72.2k 58.4k Investment Reports 3.2k 3.1k 52.9k 54.9k Local Translator 1.5k 1.6k 25.0k 25.7k Total 50.2k 50.3k 542.1k 442.9k
  • 21.
  • 22. Word alignment You caught me skiving , I ‘ m afraid . Samahani , umenidaka nikihepa .
  • 23.
  • 24.
  • 25.
  • 26. Alignment problems nimemkatalia have turned him down I
  • 27. Morphological decomposition have turned him down I ni+ me+ m+ katalia
  • 28.
  • 29.
  • 30.
  • 31.
  • 32.