SlideShare ist ein Scribd-Unternehmen logo
1 von 1
Downloaden Sie, um offline zu lesen
Chinese Character Decomposition for Neural
MT with Multi-Word Expressions
Lifeng Han1
, Gareth J. F. Jones1
, Alan F. Smeaton2
and Paolo Bolzoni3
Email: lifeng.han@adaptcentre.ie
Institutes: 1
ADAPT Research Center, School of Computing, DCU, Ireland 2
Insight Centre for Data Analytics,
School of Computing, DCU, Ireland 3
Oplus International Co. LTD
Highlights
1. Chinese character decomposition with varied levels for NMT
2. NMT learning curves on different level of decomposition
3. MWE investigation in NMT, MWEs in decomposed levels
4. Open source toolkit for Chinese character decomposition
5. Open source bilingual Chinese-English MWE glossaries
Motivations
I. For Out-of-vocabulary (OOV) word translation for European languages: incorpora-
ting sub-word knowledge using Byte Pair Encoding (BPE). However, such methods
cannot be directly applied to ideographic languages (Chinese, Japanese and others).
II. Chinese character decomposition has been used as a feature to enhance NMT;
Recent work investigated ideograph or stroke level embeddings. However, questions
remain: to investigate the impact of decomposition embedding in detail, i.e., radical,
stroke, and intermediate levels. ref. our paper Han and Kuang (2018ESSLLI)
III. MWE integration into MT has seen more than a decade of exploration. However,
decomposed MWEs have not previously been explored. 1
Chinese Characters Knowledge
Level-1

鋒 (fēng)
(semantic, metal) ⾦金金夆 (phonetic, féng)
⼈人王丷 夂丰
⼈人⼀一⼟土丷 夂三⼁丨
⼃丿㇏⼀一⼀一⼁丨⼂丶㇀⼀一 ㇀㇇㇏⼀一⼀一⼀一⼁丨
…
劍 (jiàn)
(phonetic, qiān) 僉⺉刂(semantic, knife)
亼吅从 ⺉刂
⼈人⼀一⼝口⼝口⼈人⼈人 ⺉刂
⼃丿㇏⼀一⼁丨𠃍⼀一⼁丨𠃍⼀一⼃丿㇏⼃丿㇏ ⼁丨⼅亅
… … …
Level-1:
Level-2:
Level-3:
…
Full-stroke:
Chinese characters often contain two separate parts: phonetics and meaning (via ra-
dicals). For instance, this two-stroke radical, 刂 (tı́ dāo páng), ordinarily relates to
knife. The Chinese characters: 劍 (jiàn, sword) and MWE 鋒利 (fēnglı̀, sharp).
Similar stories with many other radicals: 刂 (tı́ dāo páng) preserves the meaning of
knife because it is a variation of a drawing of a knife evolving from the original
bronze inscription.
Experimental Setup
Decompose Chinese character sequences into different levels, from shallow to deep,
based on IDS dictionary from open-source CHISE Project. Chinese-English NMT
with attention based model (Google2017) and THUMT (Tsinghua Uni MT) toolkit.
Parameters: 7-7 layers encoder-decoder, Batch size 6250, 32k BPE. Bilingual MWEs
extraction from our earlier work Han et al. (2020LREC) based on MWEtoolkit 2
BLEU & Crowd-source Humans
Direct Assessment human evaluation results show five models with similarly perfor-
mances, the first cluster. Decomposition level 2 significantly far behind the other
models in terms of human judges (also the BLEU scores). n: the number of distinct
translations, N: the number of human assessments (including repeat assessment)
Discussion
BLEU has long been criticised as not reflecting real differences between high-
performing MT models. Crowd-source human evaluation is not perfect either, with
very recent work highlighting that professional translators disagree with crowd-source
human ranking of MT systems largely via WMT data. We look at detailed translation
examples from the different system outputs, at 100K learning steps, in next examples
and they reflect the advantages from decomposition models, e.g. rxd3.
Neural MT Examples
1) Chinese MWE 商场 (Shāngchǎng) in the first sentence:
- correctly translated as mall by rxd3 model -vs- translated as shop by the baseline
character sequence model;
2) MWE 楼梯间 (lóutı̄jiān) in the second sentence:
- correctly translated as stairwell by the rxd3 model -vs- baseline: as stairs
3) MWE 近日 (Jı̀nrı̀) meaning recently in the second sentence:
- totally missed out by the original character sequence model → results in a mislea-
ding ambiguous translation of an even larger content, (i.e., did the chief moved to San
Francisco (SF) recently or this week).
-vs- MWE 近日 (Jı̀nrı̀) correctly translated by the rxd3 model → the overall mea-
ning of the sentence is clear.
src
28 岁 厨师 被 发现 死 于 旧⾦金金⼭山 ⼀一家 商场
近⽇日 刚 搬 ⾄至 旧⾦金金⼭山 的 ⼀一位 28 岁 厨师 本周 被 发现 死 于 当地 ⼀一家 商场 的 楼梯间 。
ref
28 @-@ Year @-@ Old Chef Found Dead at San Francisco Mall
a 28 @-@ year @-@ old chef who had recently moved to San Francisco was found dead in the stairwell of a local
mall this week .
rxd3
the 28 @-@ year @-@ old chef was found dead at a San Francisco mall
a 28 @-@ year @-@ old chef who recently moved to San Francisco has been found dead on a stairwell in a local mall
this week .
base
the 28 @-@ year @-@ old chef was found dead in a shop in San Francisco
a 28 @-@ year @-@ old chef who has moved to San Francisco this week was found dead on the stairs of a local mall .
base
MWE
28 @-@ year @-@ old chef was found dead at a San Francisco mall
a 28 @-@ year @-@ old chef who recently moved to San Francisco was found dead this week at a local mall .
rxd3
MWE
28 @-@ year @-@ old chef was found dead at a San Francisco mall
a 28 @-@ year @-@ old chef recently moved to San Francisco was found dead this week at a local mall .
rxd1
the 28 @-@ year @-@ old chef was found dead at a San Francisco mall
a 28 @-@ year @-@ old chef recently moved to San Francisco was found dead in a local shopping mall this week .
rxd2
the 28 @-@ year @-@ old chef was found dead in a San Francisco mall
a 28 @-@ year @-@ old San Francisco chef was found dead in a local mall this week .
1
Acknowledgments: We thank Yvette Graham for helping with human evaluation, Eoin Brophy for helps with Colab, and the anonymous reviewers for their thorough reviews and insightful feedback. The ADAPT Centre for Digital Content Technology
is funded under the SFI Research Centres Programme (Grant 13/RC/2106) and is co-funded under the European Regional Development Fund. The input of Alan Smeaton is part-funded by SFI under grant number SFI/12/RC/2289 (Insight Centre).
2
Our character decomposition toolkit and multilingual MWE resources available: https://github.com/poethan/MWE4MT. This paper based on our earlier work ESSLLI2018: https://arxiv.org/abs/1805.01565 and LREC2020:
https://www.aclweb.org/anthology/2020.lrec-1.363/. We acknowledge and endorse following resources http://www.chise.org/, https://github.com/THUNLP-MT/THUMT, and http://mwetoolkit.sf.net

Weitere ähnliche Inhalte

Mehr von Lifeng (Aaron) Han

cushLEPOR uses LABSE distilled knowledge to improve correlation with human tr...
cushLEPOR uses LABSE distilled knowledge to improve correlation with human tr...cushLEPOR uses LABSE distilled knowledge to improve correlation with human tr...
cushLEPOR uses LABSE distilled knowledge to improve correlation with human tr...Lifeng (Aaron) Han
 
Chinese Character Decomposition for Neural MT with Multi-Word Expressions
Chinese Character Decomposition for  Neural MT with Multi-Word ExpressionsChinese Character Decomposition for  Neural MT with Multi-Word Expressions
Chinese Character Decomposition for Neural MT with Multi-Word ExpressionsLifeng (Aaron) Han
 
Build moses on ubuntu (64 bit) system in virtubox recorded by aaron _v2longer
Build moses on ubuntu (64 bit) system in virtubox recorded by aaron _v2longerBuild moses on ubuntu (64 bit) system in virtubox recorded by aaron _v2longer
Build moses on ubuntu (64 bit) system in virtubox recorded by aaron _v2longerLifeng (Aaron) Han
 
Detection of Verbal Multi-Word Expressions via Conditional Random Fields with...
Detection of Verbal Multi-Word Expressions via Conditional Random Fields with...Detection of Verbal Multi-Word Expressions via Conditional Random Fields with...
Detection of Verbal Multi-Word Expressions via Conditional Random Fields with...Lifeng (Aaron) Han
 
AlphaMWE: Construction of Multilingual Parallel Corpora with MWE Annotations ...
AlphaMWE: Construction of Multilingual Parallel Corpora with MWE Annotations ...AlphaMWE: Construction of Multilingual Parallel Corpora with MWE Annotations ...
AlphaMWE: Construction of Multilingual Parallel Corpora with MWE Annotations ...Lifeng (Aaron) Han
 
MultiMWE: Building a Multi-lingual Multi-Word Expression (MWE) Parallel Corpora
MultiMWE: Building a Multi-lingual Multi-Word Expression (MWE) Parallel CorporaMultiMWE: Building a Multi-lingual Multi-Word Expression (MWE) Parallel Corpora
MultiMWE: Building a Multi-lingual Multi-Word Expression (MWE) Parallel CorporaLifeng (Aaron) Han
 
ADAPT Centre and My NLP journey: MT, MTE, QE, MWE, NER, Treebanks, Parsing.
ADAPT Centre and My NLP journey: MT, MTE, QE, MWE, NER, Treebanks, Parsing.ADAPT Centre and My NLP journey: MT, MTE, QE, MWE, NER, Treebanks, Parsing.
ADAPT Centre and My NLP journey: MT, MTE, QE, MWE, NER, Treebanks, Parsing.Lifeng (Aaron) Han
 
A deep analysis of Multi-word Expression and Machine Translation
A deep analysis of Multi-word Expression and Machine TranslationA deep analysis of Multi-word Expression and Machine Translation
A deep analysis of Multi-word Expression and Machine TranslationLifeng (Aaron) Han
 
machine translation evaluation resources and methods: a survey
machine translation evaluation resources and methods: a surveymachine translation evaluation resources and methods: a survey
machine translation evaluation resources and methods: a surveyLifeng (Aaron) Han
 
Incorporating Chinese Radicals Into Neural Machine Translation: Deeper Than C...
Incorporating Chinese Radicals Into Neural Machine Translation: Deeper Than C...Incorporating Chinese Radicals Into Neural Machine Translation: Deeper Than C...
Incorporating Chinese Radicals Into Neural Machine Translation: Deeper Than C...Lifeng (Aaron) Han
 
Chinese Named Entity Recognition with Graph-based Semi-supervised Learning Model
Chinese Named Entity Recognition with Graph-based Semi-supervised Learning ModelChinese Named Entity Recognition with Graph-based Semi-supervised Learning Model
Chinese Named Entity Recognition with Graph-based Semi-supervised Learning ModelLifeng (Aaron) Han
 
Quality Estimation for Machine Translation Using the Joint Method of Evaluati...
Quality Estimation for Machine Translation Using the Joint Method of Evaluati...Quality Estimation for Machine Translation Using the Joint Method of Evaluati...
Quality Estimation for Machine Translation Using the Joint Method of Evaluati...Lifeng (Aaron) Han
 
PubhD talk: MT serving the society
PubhD talk: MT serving the societyPubhD talk: MT serving the society
PubhD talk: MT serving the societyLifeng (Aaron) Han
 
Lepor: augmented automatic MT evaluation metric
Lepor: augmented automatic MT evaluation metricLepor: augmented automatic MT evaluation metric
Lepor: augmented automatic MT evaluation metricLifeng (Aaron) Han
 
Machine translation evaluation: a survey
Machine translation evaluation: a surveyMachine translation evaluation: a survey
Machine translation evaluation: a surveyLifeng (Aaron) Han
 
LEPOR: an augmented machine translation evaluation metric - Thesis PPT
LEPOR: an augmented machine translation evaluation metric - Thesis PPT LEPOR: an augmented machine translation evaluation metric - Thesis PPT
LEPOR: an augmented machine translation evaluation metric - Thesis PPT Lifeng (Aaron) Han
 
LEPOR: an augmented machine translation evaluation metric
LEPOR: an augmented machine translation evaluation metric LEPOR: an augmented machine translation evaluation metric
LEPOR: an augmented machine translation evaluation metric Lifeng (Aaron) Han
 
Pptphrase tagset mapping for french and english treebanks and its application...
Pptphrase tagset mapping for french and english treebanks and its application...Pptphrase tagset mapping for french and english treebanks and its application...
Pptphrase tagset mapping for french and english treebanks and its application...Lifeng (Aaron) Han
 
Pptphrase tagset mapping for french and english treebanks and its application...
Pptphrase tagset mapping for french and english treebanks and its application...Pptphrase tagset mapping for french and english treebanks and its application...
Pptphrase tagset mapping for french and english treebanks and its application...Lifeng (Aaron) Han
 

Mehr von Lifeng (Aaron) Han (20)

cushLEPOR uses LABSE distilled knowledge to improve correlation with human tr...
cushLEPOR uses LABSE distilled knowledge to improve correlation with human tr...cushLEPOR uses LABSE distilled knowledge to improve correlation with human tr...
cushLEPOR uses LABSE distilled knowledge to improve correlation with human tr...
 
Chinese Character Decomposition for Neural MT with Multi-Word Expressions
Chinese Character Decomposition for  Neural MT with Multi-Word ExpressionsChinese Character Decomposition for  Neural MT with Multi-Word Expressions
Chinese Character Decomposition for Neural MT with Multi-Word Expressions
 
Build moses on ubuntu (64 bit) system in virtubox recorded by aaron _v2longer
Build moses on ubuntu (64 bit) system in virtubox recorded by aaron _v2longerBuild moses on ubuntu (64 bit) system in virtubox recorded by aaron _v2longer
Build moses on ubuntu (64 bit) system in virtubox recorded by aaron _v2longer
 
Detection of Verbal Multi-Word Expressions via Conditional Random Fields with...
Detection of Verbal Multi-Word Expressions via Conditional Random Fields with...Detection of Verbal Multi-Word Expressions via Conditional Random Fields with...
Detection of Verbal Multi-Word Expressions via Conditional Random Fields with...
 
AlphaMWE: Construction of Multilingual Parallel Corpora with MWE Annotations ...
AlphaMWE: Construction of Multilingual Parallel Corpora with MWE Annotations ...AlphaMWE: Construction of Multilingual Parallel Corpora with MWE Annotations ...
AlphaMWE: Construction of Multilingual Parallel Corpora with MWE Annotations ...
 
MultiMWE: Building a Multi-lingual Multi-Word Expression (MWE) Parallel Corpora
MultiMWE: Building a Multi-lingual Multi-Word Expression (MWE) Parallel CorporaMultiMWE: Building a Multi-lingual Multi-Word Expression (MWE) Parallel Corpora
MultiMWE: Building a Multi-lingual Multi-Word Expression (MWE) Parallel Corpora
 
ADAPT Centre and My NLP journey: MT, MTE, QE, MWE, NER, Treebanks, Parsing.
ADAPT Centre and My NLP journey: MT, MTE, QE, MWE, NER, Treebanks, Parsing.ADAPT Centre and My NLP journey: MT, MTE, QE, MWE, NER, Treebanks, Parsing.
ADAPT Centre and My NLP journey: MT, MTE, QE, MWE, NER, Treebanks, Parsing.
 
A deep analysis of Multi-word Expression and Machine Translation
A deep analysis of Multi-word Expression and Machine TranslationA deep analysis of Multi-word Expression and Machine Translation
A deep analysis of Multi-word Expression and Machine Translation
 
machine translation evaluation resources and methods: a survey
machine translation evaluation resources and methods: a surveymachine translation evaluation resources and methods: a survey
machine translation evaluation resources and methods: a survey
 
Incorporating Chinese Radicals Into Neural Machine Translation: Deeper Than C...
Incorporating Chinese Radicals Into Neural Machine Translation: Deeper Than C...Incorporating Chinese Radicals Into Neural Machine Translation: Deeper Than C...
Incorporating Chinese Radicals Into Neural Machine Translation: Deeper Than C...
 
Chinese Named Entity Recognition with Graph-based Semi-supervised Learning Model
Chinese Named Entity Recognition with Graph-based Semi-supervised Learning ModelChinese Named Entity Recognition with Graph-based Semi-supervised Learning Model
Chinese Named Entity Recognition with Graph-based Semi-supervised Learning Model
 
Quality Estimation for Machine Translation Using the Joint Method of Evaluati...
Quality Estimation for Machine Translation Using the Joint Method of Evaluati...Quality Estimation for Machine Translation Using the Joint Method of Evaluati...
Quality Estimation for Machine Translation Using the Joint Method of Evaluati...
 
PubhD talk: MT serving the society
PubhD talk: MT serving the societyPubhD talk: MT serving the society
PubhD talk: MT serving the society
 
Lepor: augmented automatic MT evaluation metric
Lepor: augmented automatic MT evaluation metricLepor: augmented automatic MT evaluation metric
Lepor: augmented automatic MT evaluation metric
 
Thesis-Master-MTE-Aaron
Thesis-Master-MTE-AaronThesis-Master-MTE-Aaron
Thesis-Master-MTE-Aaron
 
Machine translation evaluation: a survey
Machine translation evaluation: a surveyMachine translation evaluation: a survey
Machine translation evaluation: a survey
 
LEPOR: an augmented machine translation evaluation metric - Thesis PPT
LEPOR: an augmented machine translation evaluation metric - Thesis PPT LEPOR: an augmented machine translation evaluation metric - Thesis PPT
LEPOR: an augmented machine translation evaluation metric - Thesis PPT
 
LEPOR: an augmented machine translation evaluation metric
LEPOR: an augmented machine translation evaluation metric LEPOR: an augmented machine translation evaluation metric
LEPOR: an augmented machine translation evaluation metric
 
Pptphrase tagset mapping for french and english treebanks and its application...
Pptphrase tagset mapping for french and english treebanks and its application...Pptphrase tagset mapping for french and english treebanks and its application...
Pptphrase tagset mapping for french and english treebanks and its application...
 
Pptphrase tagset mapping for french and english treebanks and its application...
Pptphrase tagset mapping for french and english treebanks and its application...Pptphrase tagset mapping for french and english treebanks and its application...
Pptphrase tagset mapping for french and english treebanks and its application...
 

Kürzlich hochgeladen

Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessPixlogix Infotech
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?Igalia
 

Kürzlich hochgeladen (20)

Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 

Chinese Character Decomposition for Neural MT with Multi-Word Expressions

  • 1. Chinese Character Decomposition for Neural MT with Multi-Word Expressions Lifeng Han1 , Gareth J. F. Jones1 , Alan F. Smeaton2 and Paolo Bolzoni3 Email: lifeng.han@adaptcentre.ie Institutes: 1 ADAPT Research Center, School of Computing, DCU, Ireland 2 Insight Centre for Data Analytics, School of Computing, DCU, Ireland 3 Oplus International Co. LTD Highlights 1. Chinese character decomposition with varied levels for NMT 2. NMT learning curves on different level of decomposition 3. MWE investigation in NMT, MWEs in decomposed levels 4. Open source toolkit for Chinese character decomposition 5. Open source bilingual Chinese-English MWE glossaries Motivations I. For Out-of-vocabulary (OOV) word translation for European languages: incorpora- ting sub-word knowledge using Byte Pair Encoding (BPE). However, such methods cannot be directly applied to ideographic languages (Chinese, Japanese and others). II. Chinese character decomposition has been used as a feature to enhance NMT; Recent work investigated ideograph or stroke level embeddings. However, questions remain: to investigate the impact of decomposition embedding in detail, i.e., radical, stroke, and intermediate levels. ref. our paper Han and Kuang (2018ESSLLI) III. MWE integration into MT has seen more than a decade of exploration. However, decomposed MWEs have not previously been explored. 1 Chinese Characters Knowledge Level-1 鋒 (fēng) (semantic, metal) ⾦金金夆 (phonetic, féng) ⼈人王丷 夂丰 ⼈人⼀一⼟土丷 夂三⼁丨 ⼃丿㇏⼀一⼀一⼁丨⼂丶㇀⼀一 ㇀㇇㇏⼀一⼀一⼀一⼁丨 … 劍 (jiàn) (phonetic, qiān) 僉⺉刂(semantic, knife) 亼吅从 ⺉刂 ⼈人⼀一⼝口⼝口⼈人⼈人 ⺉刂 ⼃丿㇏⼀一⼁丨𠃍⼀一⼁丨𠃍⼀一⼃丿㇏⼃丿㇏ ⼁丨⼅亅 … … … Level-1: Level-2: Level-3: … Full-stroke: Chinese characters often contain two separate parts: phonetics and meaning (via ra- dicals). For instance, this two-stroke radical, 刂 (tı́ dāo páng), ordinarily relates to knife. The Chinese characters: 劍 (jiàn, sword) and MWE 鋒利 (fēnglı̀, sharp). Similar stories with many other radicals: 刂 (tı́ dāo páng) preserves the meaning of knife because it is a variation of a drawing of a knife evolving from the original bronze inscription. Experimental Setup Decompose Chinese character sequences into different levels, from shallow to deep, based on IDS dictionary from open-source CHISE Project. Chinese-English NMT with attention based model (Google2017) and THUMT (Tsinghua Uni MT) toolkit. Parameters: 7-7 layers encoder-decoder, Batch size 6250, 32k BPE. Bilingual MWEs extraction from our earlier work Han et al. (2020LREC) based on MWEtoolkit 2 BLEU & Crowd-source Humans Direct Assessment human evaluation results show five models with similarly perfor- mances, the first cluster. Decomposition level 2 significantly far behind the other models in terms of human judges (also the BLEU scores). n: the number of distinct translations, N: the number of human assessments (including repeat assessment) Discussion BLEU has long been criticised as not reflecting real differences between high- performing MT models. Crowd-source human evaluation is not perfect either, with very recent work highlighting that professional translators disagree with crowd-source human ranking of MT systems largely via WMT data. We look at detailed translation examples from the different system outputs, at 100K learning steps, in next examples and they reflect the advantages from decomposition models, e.g. rxd3. Neural MT Examples 1) Chinese MWE 商场 (Shāngchǎng) in the first sentence: - correctly translated as mall by rxd3 model -vs- translated as shop by the baseline character sequence model; 2) MWE 楼梯间 (lóutı̄jiān) in the second sentence: - correctly translated as stairwell by the rxd3 model -vs- baseline: as stairs 3) MWE 近日 (Jı̀nrı̀) meaning recently in the second sentence: - totally missed out by the original character sequence model → results in a mislea- ding ambiguous translation of an even larger content, (i.e., did the chief moved to San Francisco (SF) recently or this week). -vs- MWE 近日 (Jı̀nrı̀) correctly translated by the rxd3 model → the overall mea- ning of the sentence is clear. src 28 岁 厨师 被 发现 死 于 旧⾦金金⼭山 ⼀一家 商场 近⽇日 刚 搬 ⾄至 旧⾦金金⼭山 的 ⼀一位 28 岁 厨师 本周 被 发现 死 于 当地 ⼀一家 商场 的 楼梯间 。 ref 28 @-@ Year @-@ Old Chef Found Dead at San Francisco Mall a 28 @-@ year @-@ old chef who had recently moved to San Francisco was found dead in the stairwell of a local mall this week . rxd3 the 28 @-@ year @-@ old chef was found dead at a San Francisco mall a 28 @-@ year @-@ old chef who recently moved to San Francisco has been found dead on a stairwell in a local mall this week . base the 28 @-@ year @-@ old chef was found dead in a shop in San Francisco a 28 @-@ year @-@ old chef who has moved to San Francisco this week was found dead on the stairs of a local mall . base MWE 28 @-@ year @-@ old chef was found dead at a San Francisco mall a 28 @-@ year @-@ old chef who recently moved to San Francisco was found dead this week at a local mall . rxd3 MWE 28 @-@ year @-@ old chef was found dead at a San Francisco mall a 28 @-@ year @-@ old chef recently moved to San Francisco was found dead this week at a local mall . rxd1 the 28 @-@ year @-@ old chef was found dead at a San Francisco mall a 28 @-@ year @-@ old chef recently moved to San Francisco was found dead in a local shopping mall this week . rxd2 the 28 @-@ year @-@ old chef was found dead in a San Francisco mall a 28 @-@ year @-@ old San Francisco chef was found dead in a local mall this week . 1 Acknowledgments: We thank Yvette Graham for helping with human evaluation, Eoin Brophy for helps with Colab, and the anonymous reviewers for their thorough reviews and insightful feedback. The ADAPT Centre for Digital Content Technology is funded under the SFI Research Centres Programme (Grant 13/RC/2106) and is co-funded under the European Regional Development Fund. The input of Alan Smeaton is part-funded by SFI under grant number SFI/12/RC/2289 (Insight Centre). 2 Our character decomposition toolkit and multilingual MWE resources available: https://github.com/poethan/MWE4MT. This paper based on our earlier work ESSLLI2018: https://arxiv.org/abs/1805.01565 and LREC2020: https://www.aclweb.org/anthology/2020.lrec-1.363/. We acknowledge and endorse following resources http://www.chise.org/, https://github.com/THUNLP-MT/THUMT, and http://mwetoolkit.sf.net