(PDF: http://tr.im/freennnb Article: http://hdl.handle.net/10045/12025 )
We describe the development of a two-way shallow-transfer machine translation system between Norwegian Nynorsk and Norwegian Bokmål built on the Apertium platform, using the Free and Open Source resources Norsk Ordbank and the Oslo–Bergen Constraint Grammar tagger. We detail the integration of these and other resources in the system along with the construction of the lexical and structural transfer, and evaluate the translation quality in comparison with another system. Finally, some future work is suggested.
Reuse of Free Resources in Machine Translation between Norwegian Nynorsk and Bokmål
1. Reuse of Free
Resources in
Nynorsk↔Bokmål
MT
Kevin Unhammer,
Reuse of Free Resources in Machine Trond Trosterud
Translation between Nynorsk and Bokmål Introduction
Nynorsk and Bokmål
Norwegian language resources
The Apertium
architecture and
Kevin Unhammer1 Trond Trosterud2 nn-nb pipeline
Constraint Grammar
1 Developing
Department of Linguistics apertium-nn-nb
University of Bergen Disambiguation and CG
conversion
Bergen, Norway Translation dictionary
kun041@student.uib.no Structural transfer
2 Evaluation
Department of Linguistics
Coverage
University of Tromsø WER and B LEU
Tromsø, Norway
Future work
trond.trosterud@uit.no
2nd November 2009
2. Reuse of Free
Outline of talk Resources in
Nynorsk↔Bokmål
MT
Kevin Unhammer,
Introduction Trond Trosterud
Nynorsk and Bokmål
Introduction
Norwegian language resources Nynorsk and Bokmål
Norwegian language resources
The Apertium
The Apertium architecture and nn-nb pipeline architecture and
nn-nb pipeline
Constraint Grammar Constraint Grammar
Developing
Developing apertium-nn-nb apertium-nn-nb
Disambiguation and CG
Disambiguation and CG conversion conversion
Translation dictionary
Translation dictionary Structural transfer
Structural transfer Evaluation
Coverage
WER and B LEU
Evaluation Future work
Coverage
WER and B LEU
Future work
3. Reuse of Free
The Norwegian language(s) Resources in
Nynorsk↔Bokmål
MT
Kevin Unhammer,
A lot of dialectal variation Trond Trosterud
Two written variants: Introduction
Nynorsk and Bokmål
Bokmål Norwegian language resources
Based on Danish and the Dano-Norwegian koiné of the The Apertium
architecture and
major cities in the 1800’s nn-nb pipeline
Nynorsk Constraint Grammar
Developing
Based on the spoken dialects of Norway, standardised by
apertium-nn-nb
linguist Ivar Aasen in the late 1800’s Disambiguation and CG
conversion
Nynorsk used by around 12% of the population Translation dictionary
Structural transfer
“Language-friendly” politics: Both standards are officially Evaluation
Coverage
recognised and both are taught in school from age 12 and WER and B LEU
up Future work
Both Nynorsk and Bokmål allow quite a lot of variation,
with some choices being considered more “radical” or
“conservative” than others
4. Reuse of Free
Free, Open Source Norwegian language Resources in
Nynorsk↔Bokmål
MT
resources Kevin Unhammer,
Trond Trosterud
Introduction
Nynorsk and Bokmål
Norwegian language resources
Norsk Ordbank The Apertium
architecture and
full form dictionaries for Nynorsk and Bokmål; 106,789 and nn-nb pipeline
Constraint Grammar
142,899 lemmas, respectively
Developing
The Oslo–Bergen tagger apertium-nn-nb
Disambiguation and CG
Constraint Grammar morphological disambiguation conversion
Translation dictionary
Constraint Grammar syntactic dependency parser Structural transfer
Various other modules (compounding, NER, . . . ) Evaluation
Coverage
No freely available bilingual dictionary between Nynorsk WER and B LEU
Future work
and Bokmål, until now. . .
5. Reuse of Free
The apertium-nn-nb pipeline Resources in
Nynorsk↔Bokmål
MT
Kevin Unhammer,
Trond Trosterud
Morphological analysis Introduction
Nynorsk and Bokmål
lttoolbox: XML format, compiles to very fast FSTs Norwegian language resources
one XML dictionary gives both analysis and generation The Apertium
architecture and
nn-nb pipeline
CG pre-disambiguation Constraint Grammar
Statistical disambiguation (HMM) Developing
apertium-nn-nb
Bilingual dictionary for lexical transfer Disambiguation and CG
conversion
Translation dictionary
Shallow syntactic transfer rules Structural transfer
Local re-ordering (det noun → noun det) Evaluation
Coverage
Insertions, deletions and substitutions of lexical units (and WER and B LEU
chunks, but we don’t use them yet) Future work
Morphological generation (again with lttoolbox)
6. Reuse of Free
Constraint Grammar Resources in
Nynorsk↔Bokmål
MT
Kevin Unhammer,
Rules work on ambiguous input and may SELECT one Trond Trosterud
analysis over all others, or REMOVE one analysis from the Introduction
set of analyses, or ADD a new tag, etc. Nynorsk and Bokmål
Norwegian language resources
Often thousands of short, hand-written rules The Apertium
architecture and
Rules apply based on “context conditions”: nn-nb pipeline
Constraint Grammar
(-1* noun) means “there must be word with a noun Developing
analysis somewhere to the left” apertium-nn-nb
Disambiguation and CG
(1C* verb) means “there must be a word disambiguated conversion
Translation dictionary
to a verb somewhere to the right” Structural transfer
(1* verb LINK 2 noun) means “there must be a Evaluation
verb-analysis to the right, and a noun-analysis two Coverage
WER and B LEU
positions to the right of that” Future work
(1* verb BARRIER noun) means “there must be a
verb-analysis to the right, and no noun-analyses before
that”
There are many other possibilities. . .
7. Reuse of Free
Example of a CG rule Resources in
Nynorsk↔Bokmål
MT
Kevin Unhammer,
Trond Trosterud
Introduction
If input contains the word ‘walks’ analysed as either Nynorsk and Bokmål
Norwegian language resources
verb 3sg present or noun pl, the following rule The Apertium
architecture and
nn-nb pipeline
SELECT (verb 3sg present) IF Constraint Grammar
Developing
(-1*C 3sg BARRIER verb) apertium-nn-nb
Disambiguation and CG
(NOT -1 det); conversion
Translation dictionary
Structural transfer
would choose the verb analysis if there is a disambiguated Evaluation
Coverage
word, analysed as third singular, to the left, with no verb WER and B LEU
between the two; and there is no determiner to the left Future work
8. Reuse of Free
Development of apertium-nn-nb Resources in
Nynorsk↔Bokmål
MT
Kevin Unhammer,
Trond Trosterud
Introduction
Nynorsk and Bokmål
Norwegian language resources
The Apertium
Most of the work done within 12 weeks (Google Summer of architecture and
nn-nb pipeline
Code 2009) Constraint Grammar
Helped by high quality free resources Developing
apertium-nn-nb
Monolingual dictionaries: Norsk Ordbank converted from Disambiguation and CG
conversion
full form listing to lttoolbox format Translation dictionary
CG: Oslo–Bergen tagger converted to use Apertium tag Structural transfer
Evaluation
scheme
Coverage
WER and B LEU
Future work
9. Reuse of Free
Disambiguation and CG conversion Resources in
Nynorsk↔Bokmål
MT
Kevin Unhammer,
Trond Trosterud
Bigram HMM’s trained on Wikipedia text (Baum-Welch, 8 Introduction
Nynorsk and Bokmål
iterations) Norwegian language resources
The Apertium
Conversion of CG tag set mostly done within a few days architecture and
nn-nb pipeline
Errors fixed in CG reported back to Oslo–Bergen tagger Constraint Grammar
team, win-win. Developing
apertium-nn-nb
However: the Oslo–Bergen tagger was designed for Disambiguation and CG
conversion
corpus annotation and lexicography Translation dictionary
Structural transfer
For the linguist, recall is more important than precision Evaluation
For (our) MT, only one analysis matters Coverage
WER and B LEU
So we need to take more chances with our rules
Future work
Also, we get some MT-specific rules (like CG-based lexical
selection)
10. Reuse of Free
Finding word translations semi-automatically Resources in
Nynorsk↔Bokmål
MT
Kevin Unhammer,
Method 1: Exact matches where the morphology is the Trond Trosterud
same Introduction
If lemma and morphological possibilities are the same, Nynorsk and Bokmål
Norwegian language resources
assume we have a translation
The Apertium
‘snøvle’, verb, pres/pass/imp/pret/inf. . . exists in both architecture and
nn-nb pipeline
monolingual dictionaries; add it as a translation Constraint Grammar
36,000 entries (although quite a lot are low-frequency / Developing
apertium-nn-nb
loan-words) Disambiguation and CG
Risk of “radical forms” conversion
Translation dictionary
Structural transfer
Evaluation
Coverage
WER and B LEU
Future work
11. Reuse of Free
Finding word translations semi-automatically Resources in
Nynorsk↔Bokmål
MT
Kevin Unhammer,
Method 1: Exact matches where the morphology is the Trond Trosterud
same Introduction
If lemma and morphological possibilities are the same, Nynorsk and Bokmål
Norwegian language resources
assume we have a translation
The Apertium
‘snøvle’, verb, pres/pass/imp/pret/inf. . . exists in both architecture and
nn-nb pipeline
monolingual dictionaries; add it as a translation Constraint Grammar
36,000 entries (although quite a lot are low-frequency / Developing
apertium-nn-nb
loan-words) Disambiguation and CG
Risk of “radical forms” conversion
Translation dictionary
Method 2: Predictable substring-translations Structural transfer
Evaluation
find Bokmål entries without translations Coverage
run string replacements for typical differences WER and B LEU
(-hjem-→-heim-, -lig→-leg, . . . ) Future work
check if the altered entries are in the Nynorsk analyser
. . . and vice versa
Main run gave 2500 good entries
12. Reuse of Free
Expanding the translational dictionary using Resources in
Nynorsk↔Bokmål
MT
alignments Kevin Unhammer,
Trond Trosterud
Introduction
Nynorsk and Bokmål
Norwegian language resources
Method 3: Automatic word aligments
The Apertium
Corpora: architecture and
nn-nb pipeline
KDE4 software translations (400,000 words) Constraint Grammar
government web pages (50,000 words, crawled with Developing
bitextor) apertium-nn-nb
Disambiguation and CG
po-terminology (only on KDE4) conversion
Translation dictionary
gave some hundreds of new terms Structural transfer
morphological tagging → Giza++ → ReTraTos Evaluation
Coverage
about 3500 entries WER and B LEU
Lots of cleaning needed Future work
13. Reuse of Free
Expanding the translational dictionary using Resources in
Nynorsk↔Bokmål
MT
alignments Kevin Unhammer,
Trond Trosterud
Introduction
Nynorsk and Bokmål
Norwegian language resources
Method 3: Automatic word aligments
The Apertium
Corpora: architecture and
nn-nb pipeline
KDE4 software translations (400,000 words) Constraint Grammar
government web pages (50,000 words, crawled with Developing
bitextor) apertium-nn-nb
Disambiguation and CG
po-terminology (only on KDE4) conversion
Translation dictionary
gave some hundreds of new terms Structural transfer
morphological tagging → Giza++ → ReTraTos Evaluation
Coverage
about 3500 entries WER and B LEU
Lots of cleaning needed Future work
Method 4: User-contributed entries (via Wikipedia)
14. Reuse of Free
Structural transfer Resources in
Nynorsk↔Bokmål
MT
Kevin Unhammer,
Trond Trosterud
Finite passive verbs Introduction
Nynorsk and Bokmål
Norwegian language resources
(1) a. Bevilgning gis oftest ikke The Apertium
architecture and
grant.IND give.PRES. PASS usually not nn-nb pipeline
Constraint Grammar
b. Løyve blir oftast ikkje gjeve
Developing
grant.IND AUX usually not give.PART apertium-nn-nb
‘Grants are usually not given’ Disambiguation and CG
conversion
Translation dictionary
c. Om høsten fylles fjorden med sild Structural transfer
In fall.DEF fill.PRES. PASS fjord.DEF with herring Evaluation
Coverage
d. Om hausten blir fjorden fylt med sild WER and B LEU
In fall.DEF AUX fjord.DEF fill.PRES. PASS with herring Future work
‘In fall, the fjord is filled with herring’
15. Reuse of Free
Structural transfer Resources in
Nynorsk↔Bokmål
MT
Kevin Unhammer,
Trond Trosterud
Genitive noun phrases Introduction
Nynorsk and Bokmål
Norwegian language resources
(2) a. forfatterens siste utgivelse The Apertium
architecture and
author.DEF. GEN last publication.IND nn-nb pipeline
Constraint Grammar
b. den siste utgjevinga til forfattaren
Developing
the last publication.DEF of author.DEF apertium-nn-nb
‘the author’s last publication’ Disambiguation and CG
conversion
Translation dictionary
c. mitt nye luftputefartøy Structural transfer
my new hovercraft.IND Evaluation
Coverage
d. det nye luftputefartøyet mitt WER and B LEU
the new hovercraft.DEF mine Future work
‘my new hovercraft’
16. Reuse of Free
Evaluation Resources in
Nynorsk↔Bokmål
MT
Kevin Unhammer,
Trond Trosterud
Introduction
Nynorsk and Bokmål
Norwegian language resources
The Apertium
architecture and
nn-nb pipeline
Coverage Constraint Grammar
Developing
WER apertium-nn-nb
Disambiguation and CG
B LEU conversion
Translation dictionary
Structural transfer
Evaluation
Coverage
WER and B LEU
Future work
17. Reuse of Free
Coverage Resources in
Nynorsk↔Bokmål
MT
Kevin Unhammer,
Trond Trosterud
Introduction
Nynorsk and Bokmål
Norwegian language resources
The Apertium
Naïve coverage on Nynorsk Wikipedia: 89.6% architecture and
nn-nb pipeline
Naïve coverage on Bokmål Wikipedia: 88.2% Constraint Grammar
Developing
Coverage seems to be the most important issue: apertium-nn-nb
Disambiguation and CG
Not only is every 10th word untranslated, but we get conversion
Translation dictionary
disambiguation problems and transfer problems in the rest Structural transfer
of the sentence Evaluation
Coverage
WER and B LEU
Future work
18. Reuse of Free
WER and B LEU scores in the nb→nn direction Resources in
Nynorsk↔Bokmål
MT
Kevin Unhammer,
Trond Trosterud
Word Error Rate, B LEU and Unknown Word Rate on text
from government web pages Introduction
Nynorsk and Bokmål
Norwegian language resources
The Apertium
B LEU WERO WERW UWR architecture and
nn-nb pipeline
Apertium 0.74 32.5 (36.1) 17.7 (50.5) 9.5 Constraint Grammar
Nyno 0.85 29.1 (34.6) 13.3 (47.3) 0.8 Developing
apertium-nn-nb
Disambiguation and CG
Table: B LEU score (two reference translations) and WER (for the conversion
Translation dictionary
Original and Wikipedia references). Numbers in parenthesis give Structural transfer
percentage of unknown words which were free-rides. Evaluation
Coverage
WER and B LEU
Future work
WER on post-edited Apertium MT output on a Wikipedia
article, however, was 10.71% (64.93% free-rides)
Coverage seems like the major difference.
19. Reuse of Free
Future work Resources in
Nynorsk↔Bokmål
MT
Kevin Unhammer,
Trond Trosterud
Compounding
Introduction
(3) a. bilkirkegård → bilkyrkjegard Nynorsk and Bokmål
Norwegian language resources
car.cemetery → car.cemetery The Apertium
architecture and
b. postordrelager → #postordrelagar nn-nb pipeline
mail.order.storage → mail.order.creator Constraint Grammar
Developing
apertium-nn-nb
Disambiguation and CG
conversion
Translation dictionary
Structural transfer
Evaluation
Coverage
WER and B LEU
Future work
20. Reuse of Free
Future work Resources in
Nynorsk↔Bokmål
MT
Kevin Unhammer,
Trond Trosterud
Compounding
Introduction
(3) a. bilkirkegård → bilkyrkjegard Nynorsk and Bokmål
Norwegian language resources
car.cemetery → car.cemetery The Apertium
architecture and
b. postordrelager → #postordrelagar nn-nb pipeline
mail.order.storage → mail.order.creator Constraint Grammar
Developing
apertium-nn-nb
Multi-word expressions Disambiguation and CG
conversion
Translation dictionary
(4) a. Han anbefalte meg å gå hjem Structural transfer
he recommended me INF go home Evaluation
Coverage
b. Han rådte meg til å gå heim WER and B LEU
Future work
he counseled me to INF go home
‘He recommended that I go home’
21. Reuse of Free
Future work Resources in
Nynorsk↔Bokmål
MT
Kevin Unhammer,
Trond Trosterud
Compounding
Introduction
(3) a. bilkirkegård → bilkyrkjegard Nynorsk and Bokmål
Norwegian language resources
car.cemetery → car.cemetery The Apertium
architecture and
b. postordrelager → #postordrelagar nn-nb pipeline
mail.order.storage → mail.order.creator Constraint Grammar
Developing
apertium-nn-nb
Multi-word expressions Disambiguation and CG
conversion
Translation dictionary
(4) a. Han anbefalte meg å gå hjem Structural transfer
he recommended me INF go home Evaluation
Coverage
b. Han rådte meg til å gå heim WER and B LEU
Future work
he counseled me to INF go home
‘He recommended that I go home’
Expanding the Scandinavian language group
22. Reuse of Free
Resources in
Nynorsk↔Bokmål
MT
Kevin Unhammer,
Trond Trosterud
Introduction
Nynorsk and Bokmål
Norwegian language resources
The Apertium
architecture and
Thanks for listening! nn-nb pipeline
Constraint Grammar
Developing
apertium-nn-nb
Disambiguation and CG
conversion
Translation dictionary
Structural transfer
Evaluation
Coverage
WER and B LEU
Future work
23. Reuse of Free
Licences Resources in
Nynorsk↔Bokmål
MT
Kevin Unhammer,
Trond Trosterud
Introduction
Nynorsk and Bokmål
This presentation may be distributed under the terms of the Norwegian language resources
The Apertium
GNU GPL, GNU FDL and CC-BY-SA licences. architecture and
nn-nb pipeline
GNU GPL v. 3.0 Constraint Grammar
http://www.gnu.org/licenses/gpl.html Developing
apertium-nn-nb
GNU FDL v. 1.2 Disambiguation and CG
conversion
http://www.gnu.org/licenses/gfdl.html Translation dictionary
Structural transfer
CC-BY-SA v. 3.0 Evaluation
Coverage
http://creativecommons.org/licenses/by-sa/3.0/ WER and B LEU
Future work