SlideShare ist ein Scribd-Unternehmen logo
1 von 8
Downloaden Sie, um offline zu lesen
Enriching Transliteration Lexicon Using
Automatic Transliteration Extraction
Sarvnaz Karimi
School of Computer Science and IT
RMIT University
Supervisors: Dr. Falk Scholer and Dr. Andrew Turpin
Keywords: Transliteration, Parallel Corpus
Machine Transliteration
• Machine transliteration transforms a word from a source language to a
target language with preserved pronunciation.
• Machine translation, cross-lingual information retrieval and cross-lingual
question answering are the main areas that automatic transliteration is
applicable.
• Transliteration has been studied in two major areas: transliteration
generation and transliteration extraction.
• Transliteration generation gets an input source word in source language
(e.g. Sydney in English) and generates its transliteration in target
language (e.g.
©
Â
ª
in Persian).
• Transliteration extraction is discovering transliteration pairs (e.g.
(Sydney,
©
Â
ª
) in bilingual texts.
Transliteration Extraction So Far!
Discovery of transliteration methods in literature consider:
• Extraction from parallel corpus:
– Statistical methods are beneficial, particularly because the
sentences/words can be aligned.
– Yet parallel corpus is hard to find for many less-computerised
languages.
• Extraction from comparable corpus:
– More evidence than just statistical information are required to extract
pairs (e.g. temporal, phonetic information or Web-count).
– Comparable corpora are easier to construct and find than parallel one.
Most studies use name entity (NE) recogniser to separate proper nouns that
are subject to transliteration from other words.
Persian and English Transliteration
• Transliteration generation has been studied using n-gram based and
consonant-vowel based approaches.
• Transliteration extraction is not previously studied for this language pair,
mainly due to lack of any parallel or comparable corpus.
• Transliteration extraction has been studied using co-occurrence, temporal,
edit distance measures or phonetic similarities. We aim to apply our
transliteration generation methods as a basis for this task.
Proposed Method: Application of Transliteration
Generation in Extraction
1. For each document in each language we perform a pre-processing
to generate a bag of words from each document (tokenise) and
also remove stop-words
2. Each word in source language is matched against a dictionary,
if not found then it is an out-of-dictionary word that needs
transliteration in target document.
3. A ranked list of possible transliterations for each source
word is generated by transliteration system.
4. Those transliterations matching with the target document
potential words are considered as a potential pair.
5. A score can be given to these pairs based on the rank of the
transliteration and number of times they are paired.
Experimental Setup
• An English-Persian comparable corpus of news texts is constructed
consisting of 3,474 documents.
• An English machine-readable dictionary was applied which contains
120,177 entries.
• Experiments:
– Accuracy of transliterations extracted (Fixed Training Collection).
Different methods of matching experienced (1-English documents are
parsed to extract their out-of-dictionary words using dictionary look-up
and stemming. 2- A parsing on Persian documents is performed by
rendering the words that contain allophones characters to one unique
character. 3- Repeating the previous experiment including capital
characters knowledge.)
– Impact of seed transliteration lexicon.
Experiments and Results
Experiment 1 : Accuracy
#Pairs #Correct Avg.Rank Rank 1-5 #P #E #doc Lex-Size Lex Pr.
1 4.6 3.6 (81.3) 5.8 68.9 8.5 11.3 1662 2860 70.3
2 4.1 3.6 (90.2) 5.9 69.7 8.5 11.3 1641 2496 80.4
3 6.6 5.9 (89.2) 6.9 66.8 8.4 22.6 1725 3694 75.2
Experiment 2 : Train Size
Train #Pairs #Correct Avg.Rank Rank 1-5 #P #E #doc Lex-Size Lex Pr.
200 2.6 2.2 (86.5) 2.8 84.2 9.1 24.0 1322 1287 78.2
300 3.1 2.5 (82.5) 4.2 71.9 8.8 23.5 1494 1579 73.3
400 3.1 2.6 (84.5) 4.6 71.3 8.8 23.4 1483 1569 74.1
500 2.8 2.5 (89.5) 4.3 78.0 8.9 23.5 1459 1507 79.0
Conclusions and Further Work
• Transliteration extraction can be helpful in automatically generating
transliteration lexicons.
• Transliteration lexicon as a dictionary of transliteration of a proper noun or
technical terms that are not translated are beneficial in dictionary-based
machine translation applications.
• We investigated a method of applying the current yet incomplete
transliteration lexicons in enriching them using comparable corpora.
• In future, role of NE-recogniser will be investigated to compare with a
simple dictionary look-up.

Weitere ähnliche Inhalte

Was ist angesagt?

Assessing receptive skills
Assessing receptive skillsAssessing receptive skills
Assessing receptive skillsMichael Chin
 
English to Bangla Translation
English to Bangla TranslationEnglish to Bangla Translation
English to Bangla TranslationSaugata Bose
 
International Journal of Engineering and Science Invention (IJESI)
International Journal of Engineering and Science Invention (IJESI)International Journal of Engineering and Science Invention (IJESI)
International Journal of Engineering and Science Invention (IJESI)inventionjournals
 
Computer dictionaries and_parsing_ppt
Computer dictionaries and_parsing_pptComputer dictionaries and_parsing_ppt
Computer dictionaries and_parsing_pptSubramanianMuthusamy3
 
Segmentation Words for Speech Synthesis in Persian Language Based On Silence
Segmentation Words for Speech Synthesis in Persian Language Based On SilenceSegmentation Words for Speech Synthesis in Persian Language Based On Silence
Segmentation Words for Speech Synthesis in Persian Language Based On Silencepaperpublications3
 
Phonetic Recognition In Words For Persian Text To Speech Systems
Phonetic Recognition In Words For Persian Text To Speech SystemsPhonetic Recognition In Words For Persian Text To Speech Systems
Phonetic Recognition In Words For Persian Text To Speech Systemspaperpublications3
 
Corpus linguistics
Corpus linguisticsCorpus linguistics
Corpus linguisticsAlicia Ruiz
 
4 salient features of corpus
4 salient features of corpus4 salient features of corpus
4 salient features of corpusThennarasuSakkan
 
11 terms in Corpus Linguistics1 (2)
11 terms in Corpus Linguistics1 (2)11 terms in Corpus Linguistics1 (2)
11 terms in Corpus Linguistics1 (2)ThennarasuSakkan
 
A ROBUST THREE-STAGE HYBRID FRAMEWORK FOR ENGLISH TO BANGLA TRANSLITERATION
A ROBUST THREE-STAGE HYBRID FRAMEWORK FOR ENGLISH TO BANGLA TRANSLITERATIONA ROBUST THREE-STAGE HYBRID FRAMEWORK FOR ENGLISH TO BANGLA TRANSLITERATION
A ROBUST THREE-STAGE HYBRID FRAMEWORK FOR ENGLISH TO BANGLA TRANSLITERATIONkevig
 
Welcome to International Journal of Engineering Research and Development (IJERD)
Welcome to International Journal of Engineering Research and Development (IJERD)Welcome to International Journal of Engineering Research and Development (IJERD)
Welcome to International Journal of Engineering Research and Development (IJERD)IJERD Editor
 

Was ist angesagt? (18)

Assessing receptive skills
Assessing receptive skillsAssessing receptive skills
Assessing receptive skills
 
Part of speech tagging for Arabic
Part of speech tagging for ArabicPart of speech tagging for Arabic
Part of speech tagging for Arabic
 
P99 1067
P99 1067P99 1067
P99 1067
 
English to Bangla Translation
English to Bangla TranslationEnglish to Bangla Translation
English to Bangla Translation
 
Corpus linguistics
Corpus linguisticsCorpus linguistics
Corpus linguistics
 
International Journal of Engineering and Science Invention (IJESI)
International Journal of Engineering and Science Invention (IJESI)International Journal of Engineering and Science Invention (IJESI)
International Journal of Engineering and Science Invention (IJESI)
 
Computer dictionaries and_parsing_ppt
Computer dictionaries and_parsing_pptComputer dictionaries and_parsing_ppt
Computer dictionaries and_parsing_ppt
 
Segmentation Words for Speech Synthesis in Persian Language Based On Silence
Segmentation Words for Speech Synthesis in Persian Language Based On SilenceSegmentation Words for Speech Synthesis in Persian Language Based On Silence
Segmentation Words for Speech Synthesis in Persian Language Based On Silence
 
Phonetic Recognition In Words For Persian Text To Speech Systems
Phonetic Recognition In Words For Persian Text To Speech SystemsPhonetic Recognition In Words For Persian Text To Speech Systems
Phonetic Recognition In Words For Persian Text To Speech Systems
 
Corpus linguistics
Corpus linguisticsCorpus linguistics
Corpus linguistics
 
Syntactic parsing for arabic
Syntactic parsing for arabicSyntactic parsing for arabic
Syntactic parsing for arabic
 
4 salient features of corpus
4 salient features of corpus4 salient features of corpus
4 salient features of corpus
 
11 terms in Corpus Linguistics1 (2)
11 terms in Corpus Linguistics1 (2)11 terms in Corpus Linguistics1 (2)
11 terms in Corpus Linguistics1 (2)
 
Applications of CL to FLT
Applications of CL to FLTApplications of CL to FLT
Applications of CL to FLT
 
A ROBUST THREE-STAGE HYBRID FRAMEWORK FOR ENGLISH TO BANGLA TRANSLITERATION
A ROBUST THREE-STAGE HYBRID FRAMEWORK FOR ENGLISH TO BANGLA TRANSLITERATIONA ROBUST THREE-STAGE HYBRID FRAMEWORK FOR ENGLISH TO BANGLA TRANSLITERATION
A ROBUST THREE-STAGE HYBRID FRAMEWORK FOR ENGLISH TO BANGLA TRANSLITERATION
 
Moses
MosesMoses
Moses
 
Selecting proper lexical paraphrase for children
Selecting proper lexical paraphrase for childrenSelecting proper lexical paraphrase for children
Selecting proper lexical paraphrase for children
 
Welcome to International Journal of Engineering Research and Development (IJERD)
Welcome to International Journal of Engineering Research and Development (IJERD)Welcome to International Journal of Engineering Research and Development (IJERD)
Welcome to International Journal of Engineering Research and Development (IJERD)
 

Andere mochten auch

Parallel text extraction from multimodal comparable corpora
Parallel text extraction from multimodal comparable corporaParallel text extraction from multimodal comparable corpora
Parallel text extraction from multimodal comparable corporaHaithem Afli
 
Chelo Vargas-Sierra
Chelo Vargas-SierraChelo Vargas-Sierra
Chelo Vargas-SierraChelo Vargas
 
Identification of Fertile Translations in Comparable Corpora: a Morpho-Compos...
Identification of Fertile Translations in Comparable Corpora: a Morpho-Compos...Identification of Fertile Translations in Comparable Corpora: a Morpho-Compos...
Identification of Fertile Translations in Comparable Corpora: a Morpho-Compos...Estelle Delpech
 
Meng Zhang - 2017 - Adversarial Training for Unsupervised Bilingual Lexicon I...
Meng Zhang - 2017 - Adversarial Training for Unsupervised Bilingual Lexicon I...Meng Zhang - 2017 - Adversarial Training for Unsupervised Bilingual Lexicon I...
Meng Zhang - 2017 - Adversarial Training for Unsupervised Bilingual Lexicon I...Association for Computational Linguistics
 
Dealing with Lexicon Acquired from Comparable Corpora: post-edition and exchange
Dealing with Lexicon Acquired from Comparable Corpora: post-edition and exchangeDealing with Lexicon Acquired from Comparable Corpora: post-edition and exchange
Dealing with Lexicon Acquired from Comparable Corpora: post-edition and exchangeEstelle Delpech
 
Michael Bloodgood - 2017 - Acquisition of Translation Lexicons for Historical...
Michael Bloodgood - 2017 - Acquisition of Translation Lexicons for Historical...Michael Bloodgood - 2017 - Acquisition of Translation Lexicons for Historical...
Michael Bloodgood - 2017 - Acquisition of Translation Lexicons for Historical...Association for Computational Linguistics
 
Applicative evaluation of bilingual terminologies
Applicative evaluation of bilingual terminologiesApplicative evaluation of bilingual terminologies
Applicative evaluation of bilingual terminologiesEstelle Delpech
 
Philippe Langlais - 2017 - Users and Data: The Two Neglected Children of Bili...
Philippe Langlais - 2017 - Users and Data: The Two Neglected Children of Bili...Philippe Langlais - 2017 - Users and Data: The Two Neglected Children of Bili...
Philippe Langlais - 2017 - Users and Data: The Two Neglected Children of Bili...Association for Computational Linguistics
 
Extraction of domain-specific bilingual lexicon from comparable corpora: comp...
Extraction of domain-specific bilingual lexicon from comparable corpora: comp...Extraction of domain-specific bilingual lexicon from comparable corpora: comp...
Extraction of domain-specific bilingual lexicon from comparable corpora: comp...Estelle Delpech
 
Cross-lingual ontology lexicalisation, translation and information extraction...
Cross-lingual ontology lexicalisation, translation and information extraction...Cross-lingual ontology lexicalisation, translation and information extraction...
Cross-lingual ontology lexicalisation, translation and information extraction...Tobias Wunner
 
Bilingual terminology mining
Bilingual terminology miningBilingual terminology mining
Bilingual terminology miningEstelle Delpech
 
A cognitive view of the bilingual lexicon
A cognitive view of the bilingual lexiconA cognitive view of the bilingual lexicon
A cognitive view of the bilingual lexiconİrem Tümer
 
Bilingual Terminology Extraction based on Translation Patterns
Bilingual Terminology Extraction based on Translation PatternsBilingual Terminology Extraction based on Translation Patterns
Bilingual Terminology Extraction based on Translation PatternsAlberto Simões
 
Challenges in the linguistic exploitation of specialized republishable web co...
Challenges in the linguistic exploitation of specialized republishable web co...Challenges in the linguistic exploitation of specialized republishable web co...
Challenges in the linguistic exploitation of specialized republishable web co...Adrien Barbaresi
 
Macro economische analyse van brazilië
Macro economische analyse van braziliëMacro economische analyse van brazilië
Macro economische analyse van braziliëJan-Willem Lammens
 
Embedded Human Computation for Knowledge Extraction and Evaluation
Embedded Human Computation for Knowledge Extraction and EvaluationEmbedded Human Computation for Knowledge Extraction and Evaluation
Embedded Human Computation for Knowledge Extraction and EvaluationwebLyzard technology
 
Word Formation in English
Word Formation in EnglishWord Formation in English
Word Formation in Englishteflang
 

Andere mochten auch (17)

Parallel text extraction from multimodal comparable corpora
Parallel text extraction from multimodal comparable corporaParallel text extraction from multimodal comparable corpora
Parallel text extraction from multimodal comparable corpora
 
Chelo Vargas-Sierra
Chelo Vargas-SierraChelo Vargas-Sierra
Chelo Vargas-Sierra
 
Identification of Fertile Translations in Comparable Corpora: a Morpho-Compos...
Identification of Fertile Translations in Comparable Corpora: a Morpho-Compos...Identification of Fertile Translations in Comparable Corpora: a Morpho-Compos...
Identification of Fertile Translations in Comparable Corpora: a Morpho-Compos...
 
Meng Zhang - 2017 - Adversarial Training for Unsupervised Bilingual Lexicon I...
Meng Zhang - 2017 - Adversarial Training for Unsupervised Bilingual Lexicon I...Meng Zhang - 2017 - Adversarial Training for Unsupervised Bilingual Lexicon I...
Meng Zhang - 2017 - Adversarial Training for Unsupervised Bilingual Lexicon I...
 
Dealing with Lexicon Acquired from Comparable Corpora: post-edition and exchange
Dealing with Lexicon Acquired from Comparable Corpora: post-edition and exchangeDealing with Lexicon Acquired from Comparable Corpora: post-edition and exchange
Dealing with Lexicon Acquired from Comparable Corpora: post-edition and exchange
 
Michael Bloodgood - 2017 - Acquisition of Translation Lexicons for Historical...
Michael Bloodgood - 2017 - Acquisition of Translation Lexicons for Historical...Michael Bloodgood - 2017 - Acquisition of Translation Lexicons for Historical...
Michael Bloodgood - 2017 - Acquisition of Translation Lexicons for Historical...
 
Applicative evaluation of bilingual terminologies
Applicative evaluation of bilingual terminologiesApplicative evaluation of bilingual terminologies
Applicative evaluation of bilingual terminologies
 
Philippe Langlais - 2017 - Users and Data: The Two Neglected Children of Bili...
Philippe Langlais - 2017 - Users and Data: The Two Neglected Children of Bili...Philippe Langlais - 2017 - Users and Data: The Two Neglected Children of Bili...
Philippe Langlais - 2017 - Users and Data: The Two Neglected Children of Bili...
 
Extraction of domain-specific bilingual lexicon from comparable corpora: comp...
Extraction of domain-specific bilingual lexicon from comparable corpora: comp...Extraction of domain-specific bilingual lexicon from comparable corpora: comp...
Extraction of domain-specific bilingual lexicon from comparable corpora: comp...
 
Cross-lingual ontology lexicalisation, translation and information extraction...
Cross-lingual ontology lexicalisation, translation and information extraction...Cross-lingual ontology lexicalisation, translation and information extraction...
Cross-lingual ontology lexicalisation, translation and information extraction...
 
Bilingual terminology mining
Bilingual terminology miningBilingual terminology mining
Bilingual terminology mining
 
A cognitive view of the bilingual lexicon
A cognitive view of the bilingual lexiconA cognitive view of the bilingual lexicon
A cognitive view of the bilingual lexicon
 
Bilingual Terminology Extraction based on Translation Patterns
Bilingual Terminology Extraction based on Translation PatternsBilingual Terminology Extraction based on Translation Patterns
Bilingual Terminology Extraction based on Translation Patterns
 
Challenges in the linguistic exploitation of specialized republishable web co...
Challenges in the linguistic exploitation of specialized republishable web co...Challenges in the linguistic exploitation of specialized republishable web co...
Challenges in the linguistic exploitation of specialized republishable web co...
 
Macro economische analyse van brazilië
Macro economische analyse van braziliëMacro economische analyse van brazilië
Macro economische analyse van brazilië
 
Embedded Human Computation for Knowledge Extraction and Evaluation
Embedded Human Computation for Knowledge Extraction and EvaluationEmbedded Human Computation for Knowledge Extraction and Evaluation
Embedded Human Computation for Knowledge Extraction and Evaluation
 
Word Formation in English
Word Formation in EnglishWord Formation in English
Word Formation in English
 

Ähnlich wie Enriching Transliteration Lexicon Using Automatic Transliteration Extraction

Understanding Natural Languange with Corpora-based Generation of Dependency G...
Understanding Natural Languange with Corpora-based Generation of Dependency G...Understanding Natural Languange with Corpora-based Generation of Dependency G...
Understanding Natural Languange with Corpora-based Generation of Dependency G...Edmond Lepedus
 
Building of Database for English-Azerbaijani Machine Translation Expert System
Building of Database for English-Azerbaijani Machine Translation Expert SystemBuilding of Database for English-Azerbaijani Machine Translation Expert System
Building of Database for English-Azerbaijani Machine Translation Expert SystemWaqas Tariq
 
Experiments with Different Models of Statistcial Machine Translation
Experiments with Different Models of Statistcial Machine TranslationExperiments with Different Models of Statistcial Machine Translation
Experiments with Different Models of Statistcial Machine Translationkhyati gupta
 
Experiments with Different Models of Statistcial Machine Translation
Experiments with Different Models of Statistcial Machine TranslationExperiments with Different Models of Statistcial Machine Translation
Experiments with Different Models of Statistcial Machine Translationkhyati gupta
 
Machine Transalation.pdf
Machine Transalation.pdfMachine Transalation.pdf
Machine Transalation.pdfAmir Abdalla
 
An expert system for automatic reading of a text written in standard arabic
An expert system for automatic reading of a text written in standard arabicAn expert system for automatic reading of a text written in standard arabic
An expert system for automatic reading of a text written in standard arabicijnlc
 
Ijarcet vol-2-issue-2-323-329
Ijarcet vol-2-issue-2-323-329Ijarcet vol-2-issue-2-323-329
Ijarcet vol-2-issue-2-323-329Editor IJARCET
 
Fsmnlp presentation mohammed_attia
Fsmnlp presentation mohammed_attiaFsmnlp presentation mohammed_attia
Fsmnlp presentation mohammed_attiaMohammed Attia
 
Personalising speech to-speech translation
Personalising speech to-speech translationPersonalising speech to-speech translation
Personalising speech to-speech translationbehzad66
 
Learning to Pronounce as Measuring Cross Lingual Joint Orthography Phonology ...
Learning to Pronounce as Measuring Cross Lingual Joint Orthography Phonology ...Learning to Pronounce as Measuring Cross Lingual Joint Orthography Phonology ...
Learning to Pronounce as Measuring Cross Lingual Joint Orthography Phonology ...IJITE
 
Learning to Pronounce as Measuring Cross Lingual Joint Orthography Phonology ...
Learning to Pronounce as Measuring Cross Lingual Joint Orthography Phonology ...Learning to Pronounce as Measuring Cross Lingual Joint Orthography Phonology ...
Learning to Pronounce as Measuring Cross Lingual Joint Orthography Phonology ...ijrap
 
Learning to Pronounce as Measuring Cross Lingual Joint Orthography Phonology ...
Learning to Pronounce as Measuring Cross Lingual Joint Orthography Phonology ...Learning to Pronounce as Measuring Cross Lingual Joint Orthography Phonology ...
Learning to Pronounce as Measuring Cross Lingual Joint Orthography Phonology ...gerogepatton
 
2017:12:06 acl読み会"Learning attention for historical text normalization by lea...
2017:12:06 acl読み会"Learning attention for historical text normalization by lea...2017:12:06 acl読み会"Learning attention for historical text normalization by lea...
2017:12:06 acl読み会"Learning attention for historical text normalization by lea...ayaha osaki
 
LEPOR: an augmented machine translation evaluation metric - Thesis PPT
LEPOR: an augmented machine translation evaluation metric - Thesis PPT LEPOR: an augmented machine translation evaluation metric - Thesis PPT
LEPOR: an augmented machine translation evaluation metric - Thesis PPT Lifeng (Aaron) Han
 

Ähnlich wie Enriching Transliteration Lexicon Using Automatic Transliteration Extraction (20)

Understanding Natural Languange with Corpora-based Generation of Dependency G...
Understanding Natural Languange with Corpora-based Generation of Dependency G...Understanding Natural Languange with Corpora-based Generation of Dependency G...
Understanding Natural Languange with Corpora-based Generation of Dependency G...
 
Parafraseo-Chenggang.pdf
Parafraseo-Chenggang.pdfParafraseo-Chenggang.pdf
Parafraseo-Chenggang.pdf
 
Building of Database for English-Azerbaijani Machine Translation Expert System
Building of Database for English-Azerbaijani Machine Translation Expert SystemBuilding of Database for English-Azerbaijani Machine Translation Expert System
Building of Database for English-Azerbaijani Machine Translation Expert System
 
Experiments with Different Models of Statistcial Machine Translation
Experiments with Different Models of Statistcial Machine TranslationExperiments with Different Models of Statistcial Machine Translation
Experiments with Different Models of Statistcial Machine Translation
 
Experiments with Different Models of Statistcial Machine Translation
Experiments with Different Models of Statistcial Machine TranslationExperiments with Different Models of Statistcial Machine Translation
Experiments with Different Models of Statistcial Machine Translation
 
project present
project presentproject present
project present
 
Machine Transalation.pdf
Machine Transalation.pdfMachine Transalation.pdf
Machine Transalation.pdf
 
C8 akumaran
C8 akumaranC8 akumaran
C8 akumaran
 
An expert system for automatic reading of a text written in standard arabic
An expert system for automatic reading of a text written in standard arabicAn expert system for automatic reading of a text written in standard arabic
An expert system for automatic reading of a text written in standard arabic
 
Ey4301913917
Ey4301913917Ey4301913917
Ey4301913917
 
Ijarcet vol-2-issue-2-323-329
Ijarcet vol-2-issue-2-323-329Ijarcet vol-2-issue-2-323-329
Ijarcet vol-2-issue-2-323-329
 
Fsmnlp presentation mohammed_attia
Fsmnlp presentation mohammed_attiaFsmnlp presentation mohammed_attia
Fsmnlp presentation mohammed_attia
 
Personalising speech to-speech translation
Personalising speech to-speech translationPersonalising speech to-speech translation
Personalising speech to-speech translation
 
FIRE2014_IIT-P
FIRE2014_IIT-PFIRE2014_IIT-P
FIRE2014_IIT-P
 
Learning to Pronounce as Measuring Cross Lingual Joint Orthography Phonology ...
Learning to Pronounce as Measuring Cross Lingual Joint Orthography Phonology ...Learning to Pronounce as Measuring Cross Lingual Joint Orthography Phonology ...
Learning to Pronounce as Measuring Cross Lingual Joint Orthography Phonology ...
 
Learning to Pronounce as Measuring Cross Lingual Joint Orthography Phonology ...
Learning to Pronounce as Measuring Cross Lingual Joint Orthography Phonology ...Learning to Pronounce as Measuring Cross Lingual Joint Orthography Phonology ...
Learning to Pronounce as Measuring Cross Lingual Joint Orthography Phonology ...
 
Learning to Pronounce as Measuring Cross Lingual Joint Orthography Phonology ...
Learning to Pronounce as Measuring Cross Lingual Joint Orthography Phonology ...Learning to Pronounce as Measuring Cross Lingual Joint Orthography Phonology ...
Learning to Pronounce as Measuring Cross Lingual Joint Orthography Phonology ...
 
2017:12:06 acl読み会"Learning attention for historical text normalization by lea...
2017:12:06 acl読み会"Learning attention for historical text normalization by lea...2017:12:06 acl読み会"Learning attention for historical text normalization by lea...
2017:12:06 acl読み会"Learning attention for historical text normalization by lea...
 
Translationusing moses1
Translationusing moses1Translationusing moses1
Translationusing moses1
 
LEPOR: an augmented machine translation evaluation metric - Thesis PPT
LEPOR: an augmented machine translation evaluation metric - Thesis PPT LEPOR: an augmented machine translation evaluation metric - Thesis PPT
LEPOR: an augmented machine translation evaluation metric - Thesis PPT
 

Mehr von Sarvnaz Karimi

Search in Medical Text
Search in Medical TextSearch in Medical Text
Search in Medical TextSarvnaz Karimi
 
Corpus Effects on the Evaluation of Automated Transliteration Systems
Corpus Effects on the Evaluation of Automated Transliteration SystemsCorpus Effects on the Evaluation of Automated Transliteration Systems
Corpus Effects on the Evaluation of Automated Transliteration SystemsSarvnaz Karimi
 
Collapsed Consonant and Vowel Models: New Approaches for English-Persian Tran...
Collapsed Consonant and Vowel Models: New Approaches for English-Persian Tran...Collapsed Consonant and Vowel Models: New Approaches for English-Persian Tran...
Collapsed Consonant and Vowel Models: New Approaches for English-Persian Tran...Sarvnaz Karimi
 
Pinpointing Location Focus in Microblogs
Pinpointing Location Focus in MicroblogsPinpointing Location Focus in Microblogs
Pinpointing Location Focus in MicroblogsSarvnaz Karimi
 
Classifying Microblogs For Disasters
Classifying Microblogs For DisastersClassifying Microblogs For Disasters
Classifying Microblogs For DisastersSarvnaz Karimi
 

Mehr von Sarvnaz Karimi (7)

Search in Medical Text
Search in Medical TextSearch in Medical Text
Search in Medical Text
 
Corpus Effects on the Evaluation of Automated Transliteration Systems
Corpus Effects on the Evaluation of Automated Transliteration SystemsCorpus Effects on the Evaluation of Automated Transliteration Systems
Corpus Effects on the Evaluation of Automated Transliteration Systems
 
Collapsed Consonant and Vowel Models: New Approaches for English-Persian Tran...
Collapsed Consonant and Vowel Models: New Approaches for English-Persian Tran...Collapsed Consonant and Vowel Models: New Approaches for English-Persian Tran...
Collapsed Consonant and Vowel Models: New Approaches for English-Persian Tran...
 
Karimi esair2015
Karimi esair2015Karimi esair2015
Karimi esair2015
 
Pinpointing Location Focus in Microblogs
Pinpointing Location Focus in MicroblogsPinpointing Location Focus in Microblogs
Pinpointing Location Focus in Microblogs
 
Biomedical Search
Biomedical SearchBiomedical Search
Biomedical Search
 
Classifying Microblogs For Disasters
Classifying Microblogs For DisastersClassifying Microblogs For Disasters
Classifying Microblogs For Disasters
 

Kürzlich hochgeladen

Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Zilliz
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesrafiqahmad00786416
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businesspanagenda
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Victor Rentea
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodJuan lago vázquez
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusZilliz
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Jeffrey Haguewood
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamUiPathCommunity
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Victor Rentea
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWERMadyBayot
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...apidays
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfOrbitshub
 

Kürzlich hochgeladen (20)

Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with Milvus
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 

Enriching Transliteration Lexicon Using Automatic Transliteration Extraction

  • 1. Enriching Transliteration Lexicon Using Automatic Transliteration Extraction Sarvnaz Karimi School of Computer Science and IT RMIT University Supervisors: Dr. Falk Scholer and Dr. Andrew Turpin Keywords: Transliteration, Parallel Corpus
  • 2. Machine Transliteration • Machine transliteration transforms a word from a source language to a target language with preserved pronunciation. • Machine translation, cross-lingual information retrieval and cross-lingual question answering are the main areas that automatic transliteration is applicable. • Transliteration has been studied in two major areas: transliteration generation and transliteration extraction. • Transliteration generation gets an input source word in source language (e.g. Sydney in English) and generates its transliteration in target language (e.g. © Â ª in Persian). • Transliteration extraction is discovering transliteration pairs (e.g. (Sydney, © Â ª ) in bilingual texts.
  • 3. Transliteration Extraction So Far! Discovery of transliteration methods in literature consider: • Extraction from parallel corpus: – Statistical methods are beneficial, particularly because the sentences/words can be aligned. – Yet parallel corpus is hard to find for many less-computerised languages. • Extraction from comparable corpus: – More evidence than just statistical information are required to extract pairs (e.g. temporal, phonetic information or Web-count). – Comparable corpora are easier to construct and find than parallel one. Most studies use name entity (NE) recogniser to separate proper nouns that are subject to transliteration from other words.
  • 4. Persian and English Transliteration • Transliteration generation has been studied using n-gram based and consonant-vowel based approaches. • Transliteration extraction is not previously studied for this language pair, mainly due to lack of any parallel or comparable corpus. • Transliteration extraction has been studied using co-occurrence, temporal, edit distance measures or phonetic similarities. We aim to apply our transliteration generation methods as a basis for this task.
  • 5. Proposed Method: Application of Transliteration Generation in Extraction 1. For each document in each language we perform a pre-processing to generate a bag of words from each document (tokenise) and also remove stop-words 2. Each word in source language is matched against a dictionary, if not found then it is an out-of-dictionary word that needs transliteration in target document. 3. A ranked list of possible transliterations for each source word is generated by transliteration system. 4. Those transliterations matching with the target document potential words are considered as a potential pair. 5. A score can be given to these pairs based on the rank of the transliteration and number of times they are paired.
  • 6. Experimental Setup • An English-Persian comparable corpus of news texts is constructed consisting of 3,474 documents. • An English machine-readable dictionary was applied which contains 120,177 entries. • Experiments: – Accuracy of transliterations extracted (Fixed Training Collection). Different methods of matching experienced (1-English documents are parsed to extract their out-of-dictionary words using dictionary look-up and stemming. 2- A parsing on Persian documents is performed by rendering the words that contain allophones characters to one unique character. 3- Repeating the previous experiment including capital characters knowledge.) – Impact of seed transliteration lexicon.
  • 7. Experiments and Results Experiment 1 : Accuracy #Pairs #Correct Avg.Rank Rank 1-5 #P #E #doc Lex-Size Lex Pr. 1 4.6 3.6 (81.3) 5.8 68.9 8.5 11.3 1662 2860 70.3 2 4.1 3.6 (90.2) 5.9 69.7 8.5 11.3 1641 2496 80.4 3 6.6 5.9 (89.2) 6.9 66.8 8.4 22.6 1725 3694 75.2 Experiment 2 : Train Size Train #Pairs #Correct Avg.Rank Rank 1-5 #P #E #doc Lex-Size Lex Pr. 200 2.6 2.2 (86.5) 2.8 84.2 9.1 24.0 1322 1287 78.2 300 3.1 2.5 (82.5) 4.2 71.9 8.8 23.5 1494 1579 73.3 400 3.1 2.6 (84.5) 4.6 71.3 8.8 23.4 1483 1569 74.1 500 2.8 2.5 (89.5) 4.3 78.0 8.9 23.5 1459 1507 79.0
  • 8. Conclusions and Further Work • Transliteration extraction can be helpful in automatically generating transliteration lexicons. • Transliteration lexicon as a dictionary of transliteration of a proper noun or technical terms that are not translated are beneficial in dictionary-based machine translation applications. • We investigated a method of applying the current yet incomplete transliteration lexicons in enriching them using comparable corpora. • In future, role of NE-recogniser will be investigated to compare with a simple dictionary look-up.