NoDaLiDa2021 Poster presentation. https://arxiv.org/abs/2104.04497v1
Chinese character decomposition has been used as a feature to enhance Machine Translation (MT) models, combining radicals into character and word level models. Recent work has investigated ideograph or stroke level embedding. However, questions remain about different decomposition levels of Chinese character representations, radical and strokes, best suited for MT. To investigate the impact of Chinese decomposition embedding in detail, i.e., radical, stroke, and intermediate levels, and how well these decompositions represent the meaning of the original character sequences, we carry out analysis with both automated and human evaluation of MT. Furthermore, we investigate if the combination of decomposed Multiword Expressions (MWEs) can enhance the model learning. MWE integration into MT has seen more than a decade of exploration. However, decomposed MWEs has not previously been explored.
Chinese Character Decomposition for Neural MT with Multi-Word Expressions
1. Chinese Character Decomposition for Neural
MT with Multi-Word Expressions
Lifeng Han1
, Gareth J. F. Jones1
, Alan F. Smeaton2
and Paolo Bolzoni3
Email: lifeng.han@adaptcentre.ie
Institutes: 1
ADAPT Research Center, School of Computing, DCU, Ireland 2
Insight Centre for Data Analytics,
School of Computing, DCU, Ireland 3
Oplus International Co. LTD
Highlights
1. Chinese character decomposition with varied levels for NMT
2. NMT learning curves on different level of decomposition
3. MWE investigation in NMT, MWEs in decomposed levels
4. Open source toolkit for Chinese character decomposition
5. Open source bilingual Chinese-English MWE glossaries
Motivations
I. For Out-of-vocabulary (OOV) word translation for European languages: incorpora-
ting sub-word knowledge using Byte Pair Encoding (BPE). However, such methods
cannot be directly applied to ideographic languages (Chinese, Japanese and others).
II. Chinese character decomposition has been used as a feature to enhance NMT;
Recent work investigated ideograph or stroke level embeddings. However, questions
remain: to investigate the impact of decomposition embedding in detail, i.e., radical,
stroke, and intermediate levels. ref. our paper Han and Kuang (2018ESSLLI)
III. MWE integration into MT has seen more than a decade of exploration. However,
decomposed MWEs have not previously been explored. 1
Chinese Characters Knowledge
Level-1
鋒 (fēng)
(semantic, metal) ⾦金金夆 (phonetic, féng)
⼈人王丷 夂丰
⼈人⼀一⼟土丷 夂三⼁丨
⼃丿㇏⼀一⼀一⼁丨⼂丶㇀⼀一 ㇀㇇㇏⼀一⼀一⼀一⼁丨
…
劍 (jiàn)
(phonetic, qiān) 僉⺉刂(semantic, knife)
亼吅从 ⺉刂
⼈人⼀一⼝口⼝口⼈人⼈人 ⺉刂
⼃丿㇏⼀一⼁丨𠃍⼀一⼁丨𠃍⼀一⼃丿㇏⼃丿㇏ ⼁丨⼅亅
… … …
Level-1:
Level-2:
Level-3:
…
Full-stroke:
Chinese characters often contain two separate parts: phonetics and meaning (via ra-
dicals). For instance, this two-stroke radical, 刂 (tı́ dāo páng), ordinarily relates to
knife. The Chinese characters: 劍 (jiàn, sword) and MWE 鋒利 (fēnglı̀, sharp).
Similar stories with many other radicals: 刂 (tı́ dāo páng) preserves the meaning of
knife because it is a variation of a drawing of a knife evolving from the original
bronze inscription.
Experimental Setup
Decompose Chinese character sequences into different levels, from shallow to deep,
based on IDS dictionary from open-source CHISE Project. Chinese-English NMT
with attention based model (Google2017) and THUMT (Tsinghua Uni MT) toolkit.
Parameters: 7-7 layers encoder-decoder, Batch size 6250, 32k BPE. Bilingual MWEs
extraction from our earlier work Han et al. (2020LREC) based on MWEtoolkit 2
BLEU & Crowd-source Humans
Direct Assessment human evaluation results show five models with similarly perfor-
mances, the first cluster. Decomposition level 2 significantly far behind the other
models in terms of human judges (also the BLEU scores). n: the number of distinct
translations, N: the number of human assessments (including repeat assessment)
Discussion
BLEU has long been criticised as not reflecting real differences between high-
performing MT models. Crowd-source human evaluation is not perfect either, with
very recent work highlighting that professional translators disagree with crowd-source
human ranking of MT systems largely via WMT data. We look at detailed translation
examples from the different system outputs, at 100K learning steps, in next examples
and they reflect the advantages from decomposition models, e.g. rxd3.
Neural MT Examples
1) Chinese MWE 商场 (Shāngchǎng) in the first sentence:
- correctly translated as mall by rxd3 model -vs- translated as shop by the baseline
character sequence model;
2) MWE 楼梯间 (lóutı̄jiān) in the second sentence:
- correctly translated as stairwell by the rxd3 model -vs- baseline: as stairs
3) MWE 近日 (Jı̀nrı̀) meaning recently in the second sentence:
- totally missed out by the original character sequence model → results in a mislea-
ding ambiguous translation of an even larger content, (i.e., did the chief moved to San
Francisco (SF) recently or this week).
-vs- MWE 近日 (Jı̀nrı̀) correctly translated by the rxd3 model → the overall mea-
ning of the sentence is clear.
src
28 岁 厨师 被 发现 死 于 旧⾦金金⼭山 ⼀一家 商场
近⽇日 刚 搬 ⾄至 旧⾦金金⼭山 的 ⼀一位 28 岁 厨师 本周 被 发现 死 于 当地 ⼀一家 商场 的 楼梯间 。
ref
28 @-@ Year @-@ Old Chef Found Dead at San Francisco Mall
a 28 @-@ year @-@ old chef who had recently moved to San Francisco was found dead in the stairwell of a local
mall this week .
rxd3
the 28 @-@ year @-@ old chef was found dead at a San Francisco mall
a 28 @-@ year @-@ old chef who recently moved to San Francisco has been found dead on a stairwell in a local mall
this week .
base
the 28 @-@ year @-@ old chef was found dead in a shop in San Francisco
a 28 @-@ year @-@ old chef who has moved to San Francisco this week was found dead on the stairs of a local mall .
base
MWE
28 @-@ year @-@ old chef was found dead at a San Francisco mall
a 28 @-@ year @-@ old chef who recently moved to San Francisco was found dead this week at a local mall .
rxd3
MWE
28 @-@ year @-@ old chef was found dead at a San Francisco mall
a 28 @-@ year @-@ old chef recently moved to San Francisco was found dead this week at a local mall .
rxd1
the 28 @-@ year @-@ old chef was found dead at a San Francisco mall
a 28 @-@ year @-@ old chef recently moved to San Francisco was found dead in a local shopping mall this week .
rxd2
the 28 @-@ year @-@ old chef was found dead in a San Francisco mall
a 28 @-@ year @-@ old San Francisco chef was found dead in a local mall this week .
1
Acknowledgments: We thank Yvette Graham for helping with human evaluation, Eoin Brophy for helps with Colab, and the anonymous reviewers for their thorough reviews and insightful feedback. The ADAPT Centre for Digital Content Technology
is funded under the SFI Research Centres Programme (Grant 13/RC/2106) and is co-funded under the European Regional Development Fund. The input of Alan Smeaton is part-funded by SFI under grant number SFI/12/RC/2289 (Insight Centre).
2
Our character decomposition toolkit and multilingual MWE resources available: https://github.com/poethan/MWE4MT. This paper based on our earlier work ESSLLI2018: https://arxiv.org/abs/1805.01565 and LREC2020:
https://www.aclweb.org/anthology/2020.lrec-1.363/. We acknowledge and endorse following resources http://www.chise.org/, https://github.com/THUNLP-MT/THUMT, and http://mwetoolkit.sf.net