SlideShare ist ein Scribd-Unternehmen logo
1 von 49
Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa
technology
from seed
L2 F - Spoken Language Systems Laboratory 1
Cross-Language Alignments:
Challenges, Guidelines and Gold Sets
Anabela Barreiro Luísa Coheur Tiago Luís
Ângela Costa Fernando Batista João Graça
Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa
technology
from seed
L2 F - Spoken Language Systems Laboratory 2
Outline – Part 1
• Word alignment
• Basic concepts
• Applications
• State of the art
• Limitations
• Paraphrase alignment
• Multiword, meaning and translation unit alignment: importance
• Our task
• Alignment tool: CLUE-Aligner
Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa
technology
from seed
L2 F - Spoken Language Systems Laboratory 3
Outline – Part 2
• General annotation guidelines
• Cross-linguistic major challenges to word alignment
• Annotation guidelines for multiword units and lexical and non-lexical
realization phenomena
• Pro-dropping
• Articles and zero articles
• Examples: continuous multiword units
• Examples: continuous and discontinuous support verb constructions
Preposition-dependency
(V, N and Adj)
Active vs passive Choice of noun pre-modifiers Different PoS with same
semantics (V vs process N)
Noun adjuncts Coordination Anaphora: choice of co-
referents
Impersonal constructions
Contractions Style Antonyms and negation
constructions
Romance languages double
negation
Singular vs plural idiomatic vs non-idiomatic Flexible/loose paraphrasing
constructions;
Idiosyncrasies of each
language
Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa
technology
from seed
L2 F - Spoken Language Systems Laboratory 4
Outline – Part 3
• Our contribution
• Annotation process
• Preliminary results
• Discussion
• Future work
Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa
technology
from seed
L2 F - Spoken Language Systems Laboratory
Word Alignment: Basic Concepts
• Objects representing the mapping of words (or expressions),
which are semantically equivalent in a source and a target
sentence of a parallel corpus [Brown at al., 1990]
– Matrix of n * m entries, where n is a position on the source sentence, and
m is a position on the target sentence. An entry in that matrix an,m
specifies if the word at position n is part of a translation of the word at a
position m on the target language
• Task of word alignment - identifying translational equivalences
(= semantic correspondences) in the aligned sentence pairs of
a parallel text [Hearne & Way, 2011]
• Translational equivalences - graphically represented in a grid
by the intersection of single segments (individual words) or
blocks (semantico-syntactic units, phrases, expressions)
5
Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa
technology
from seed
L2 F - Spoken Language Systems Laboratory
Word Alignment: Basic Concepts
6
• Sure alignment (S-alignment)
– Unambiguous and valid in all contexts
• EN system
• ES sistema
• FR système
• PT sistema
• Possible alignment (P-alignment)
– Ambiguous and invalid in some contexts
• EN be
• ES ser/estar/haber/existir
• FR être/avoir/exister
• PT ser/estar/haver/existir
Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa
technology
from seed
L2 F - Spoken Language Systems Laboratory
Word Alignment: Applications
• Statistical machine translation
– [Brown et al., 1990] – statistical machine translation
– [Och and Ney, 2004] – phrase base machine translation
– [Galley et al., 2004] – syntax base machine translation
• Annotations’ projections
• Extraction of bilingual lexica
• Evaluation of machine translation systems
7
Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa
technology
from seed
L2 F - Spoken Language Systems Laboratory
Word Alignment: State of the Art
• Workshops and evaluation tasks (multi-language)
– http://www.cse.unt.edu/~rada/wp/
– http://www.statmt.org/wpt05
– http://www.lpl.univ-aix.fr/projects/arcade
• Projects
– Blinker project –French-English
http://nlp.cs.nyu.edu/blinker/
• Guidelines
[Melamed, 1998] [Och and Ney, 2000]
[Lambert et al., 2005] [Kruijff-Korbayová et al., 2006]
[Graça et al., 2004]
8
Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa
technology
from seed
L2 F - Spoken Language Systems Laboratory
Word Alignment: Limitations
• Language does not operate on a word-for-word basis
• A large number of words are undissociated
– Multiword units
• [Gross and Senellart, 1998] - +40% of 1 year of Le Monde are MWU
• [Sag et al., 2002] – 50-70% of specialized lexica are MWU
• [Ramisch et al., 2010] – 56.7% of terms in Genia corpus have 2+
words (not included general purpose MWU, e.g., generic compounds,
lexical bundles, phrasal verbs, fixed expressions, which also occur in
domain-specific texts)
– Translation units
– Meaning units
– Paraphrases
• Segment and block alignment (sure and possible)
9
Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa
technology
from seed
L2 F - Spoken Language Systems Laboratory
Example: Segment and Block
Alignment (Sure and Possible)
10
Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa
technology
from seed
L2 F - Spoken Language Systems Laboratory
Paraphrase Alignment
• Monolingual
– [Callison-Burch et al., 2006]
• Annotation guidelines for paraphrase alignment
• Paraphrases - sentences that convey the same meaning but are
worded differently
• Alignment of words, phrases, expressions, within the same language
• Bilingual = (non-literal) translation
– Need to account for paraphrases across languages
11
Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa
technology
from seed
L2 F - Spoken Language Systems Laboratory
Multiword, Meaning and Translation
Unit Alignment: Importance
• Publicly available manual word alignments are restricted
to a few language pairs
• Manual word alignments are a desired resource
– Evaluation of word alignment algorithms
– Training of supervised and semi-supervised algorithms
– Tuning of parameters for different types of model
• But, “name”, “concept” and “techniques” of alignment need
to be linguistically sophisticated to be more useful and
help provide improved machine translation!
12
Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa
technology
from seed
L2 F - Spoken Language Systems Laboratory
Our Task
• EuroParl corpus [Koehn, 2005]
• 6 gold alignments sets
– 400 alignments each set (400x6=2,400)
• Languages: English, French, Portuguese and Spanish
– Language pairs: [en-es], [en-fr], [en-pt], [es-fr], [pt-es], [pt-fr]
• Guidelines for multi-language manual word annotations
(with inter-annotator agreement)
• Linguistically-informed (and linguistically-motivated) cross-
language multiword unit and paraphrase alignment
(translation unit alignment)
13
Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa
technology
from seed
L2 F - Spoken Language Systems Laboratory
CLUE-Aligner Alignment Tool
14
CLUE-Aligner =
Cross-Language Unit Elicitation Aligner
• Helps reduce ambiguity in the alignment process
• Facilitates the alignment of translation units
Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa
technology
from seed
L2 F - Spoken Language Systems Laboratory
Major Challenges (4 different classes)
• semantico-discursive
– emphatic linguistic constructions
• tautology
• pleonasm and repetition
• focus constructions
• lexical and semantico-syntactic
– multiword units
– compound verbs
– prepositional predicates
15
Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa
technology
from seed
L2 F - Spoken Language Systems Laboratory
Major Challenges (4 different classes)
• morphological
– contracted forms
– lexical versus non-lexical realization
• articles and zero articles
• pro-dropping
– subject pronoun drop
– empty relative pronoun
• morpho-syntactic
– free noun adjuncts
16
Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa
technology
from seed
L2 F - Spoken Language Systems Laboratory
Linguistic phenomenon No alignment P-alignment
Incomplete or non-translation X
Incorrect translation and typo X*
Approximate correspondence (numeric) X
Non-obligatory
linguistic structure
Pleonasm X
Repetition of words or expressions X
Redundancy or additional/extra information X
Mismatching pronoun, determiner, verbs, etc. X
Abbreviations versus full word X
Punctuation mark
Different but correct X
Incorrect / mismatch X
Missing X
17
General Annotation Guidelines
* If a multiword unit is incorrectly translated or contains a typo, none of its internal segments are aligned
Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa
technology
from seed
L2 F - Spoken Language Systems Laboratory
Linguistic phenomenon No alignment Block-alignment
S-align P-align
Multiword Unit
continuous X X
discontinuous X*
Lexical
versus
non-lexical
realization
article+ N
versus
zero-article + N
Ø people
=
PT - as pessoas
X
Pro-drop + V
versus
pronoun + V
I went
=
PT - Ø fui
X
Empty relative pronoun
versus
realized relative pronoun
N that I met = N I met
=
PT - que (eu) conheci
X
Relative
versus
participial adjective
that was writen = writen
=
PT – escrito
X
18
Annotation Guidelines
* Some discontinous multiword units are candidates for block-alignment (e.g., when the number of inserts is small or the multiword unit
is “semi-frozen”
Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa
technology
from seed
L2 F - Spoken Language Systems Laboratory
Continuous multiword units Block-S-alignment Block-P-alignment
Support verb construction X X
Compound X X
Phrasal verb X X
Named entity X X
Date and time expression X
Lexical bundle X
Idiomatic expression X
Domain term X
French negation (ne pas) X
English infinitive (to + V) X X
19
Annotation Guidelines
[Barreiro, 2008] presents a detailed description and examples of the different types of multiword unit
Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa
technology
from seed
L2 F - Spoken Language Systems Laboratory
Example: Continuous Support Verb
Constructions (alignment)
20
ES aprueba plenamente
FR approuve pleinement
Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa
technology
from seed
L2 F - Spoken Language Systems Laboratory
Example: Discontinuous Support Verb
Constructions (no alignment)
21
ES para que acelere la directiva sobre pensiones
complementares
FR pour faire avancer la directive sur les pensions
complementaires
Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa
technology
from seed
L2 F - Spoken Language Systems Laboratory
Cross-Linguistic Challenges
• Prepositional predicates
EN I too should like to congratulate [NE] on his excellent report
ES también yo quisiera felicitar a mi colega [NE] por su excelente informe
FR je voudrais féliciter moi aussi mon collègue [NE] pour son excellent
rapport
PT também eu gostaria de felicitar o meu colega [NE] pelo seu excelente
relatório
EN […] our Asian partners prefer to deal with questions which unite us
ES […] nuestros socios asiáticos prefieren dedicarse a las questiones que
nos unen
FR […] nos partenaires asiatiques préfèrent s’attacher à ce qui nous unit
PT […] os nossos parceiros asiáticos preferem centrar-se unicamente nas
questões comuns
22
Segment S-alignment
Impossible to annotate discontinuous preposition-dependency
Block P-alignment
Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa
technology
from seed
L2 F - Spoken Language Systems Laboratory
agree with belong to forgive s/o for pay for stand for
aim at/for choose between hope for prepare for thank s/o for
allow for comment on insist on prevent s/o from think of/about
apologise for compare with interfere with/in provide s/o with volunteer to
apply for complain about joke about refer to wait for
approve of concentrate on laugh at rely on warn s/o about
argue with/about congratulate on lend s/th to s/o run for worry about
ask for consist of listen to smile at
attend to deal with long for succeed in
believe in decide on object to suffer from
Cross-Linguistic Challenges
• Prepositional verbs
23
Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa
technology
from seed
L2 F - Spoken Language Systems Laboratory
Cross-Linguistic Challenges
• Prepositional nouns
24
attack on attitude towards in agreement on strike
cruelty towards comparison between on average in trouble
difficulty in/with decrease in on condition on behalf of
knowledge of disadvantage of delay in connection between
reason for incerase in in doubt difference between/of
rise in preference for information about under guarantee
solution to reduction in need for in power
use of at risk protection from reaction to
in a hurry at stake report on result of
in practice in theory room for trouble with
Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa
technology
from seed
L2 F - Spoken Language Systems Laboratory
Cross-Linguistic Challenges
• Prepositional adjectives
25
delighted at/about frightened of opposed to similar to
different from friendly with pleased with sorry for/about
dissatisfied with good at popular with suspicious of
doubtful about guilty of proud of sympathetic to(wards)
enthusiastic about incapable of puzzled by/about tired of
envious of interested in safe from typical of
excited about jealous of satisfied with unaware of
famous for keen on sensitive to(wards) used to
fed up with kind to serious about
fond of mad at/about sick of
Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa
technology
from seed
L2 F - Spoken Language Systems Laboratory
Cross-Linguistic Challenges
• Noun Adjuncts
– Compounds
• European investment bank banco europeu de investimento
[Adj N N] [N Adj [de N]]
– Free noun phrases (not compounds)
• presidency communication comunicação da presidência
[N N] [N [de N]]
26
Block S-alignment
Segment S-alignment
Block-P-alignment
of [de N]
Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa
technology
from seed
L2 F - Spoken Language Systems Laboratory
Cross-Linguistic Challenges
• Contractions
– two or more words with different parts-of-speech overlap, which
makes syntactic analysis and generation difficult
– in cross-language analysis, the contrast between languages that
have contractions and languages that do not have them, or do not
have them in the same contexts, presents additional difficulties
– The alignment of one segment that corresponds to a contracted form
in one language with the corresponding segments where elements
are not contracted in the other language of the parallel pair is
pragmatically motivated
27
Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa
technology
from seed
L2 F - Spoken Language Systems Laboratory
Example: Contractions (block-P-
alignment)
28
Interference with the support verb construction
EN to make a reference to
PT fazer uma referência a
Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa
technology
from seed
L2 F - Spoken Language Systems Laboratory
Example: Contractions (block-P-
alignment)
29
Interference with the support verb construction
ES hacer una referencia a
FR faire référence a
Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa
technology
from seed
L2 F - Spoken Language Systems Laboratory
Cross-Linguistic Challenges
• Singular versus plural (related to determiner)
EN in every official language of the union
ES en todos los idiomas oficiales de la unión
FR dans toutes les langues officielles de l'union
PT em cada uma das línguas oficiais da união
• Active versus passive
EN before new member states are admitted
ES antes de la incorporación de nuevos miembros
FR avant l'admission de nouveaux membres
PT antes da entrada de novos membros
30
Block or segment
P-alignment
Block-S-alignment if there
is some fixedness
(such as in this case)
Block P-alignment
Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa
technology
from seed
L2 F - Spoken Language Systems Laboratory
Cross-Linguistic Challenges
• Coordination
EN which we will send to the council and Ø parliament
ES que enviaremos al consejo y al parlamento
FR qui sera envoyée au conseil et au parlement
PT que remeterá ao conselho e ao parlamento
• Style: idiomatic versus non-idiomatic
EN which began four years ago
ES que empezó hace quatro años
FR qui a vu le jour il y a quatre ans
PT que se iniciou há quatro anos
31
No alignment
Block P-alignment
Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa
technology
from seed
L2 F - Spoken Language Systems Laboratory
Cross-Linguistic Challenges
• Choice of noun pre-modifiers
EN we should use that public funding for those types of project which are
most difficult to finance through the private sector
ES deberíamos utilizar esa financiación pública para aquel tipo de proyectos
que tienen mayor dificuldad para ser financiados por el sector privado
FR nous devrions recourir au financement public pour les projets que le
secteur privé boude
PT o financiamento público deveria ser utilizado para os projectos que
registam maiores dificuldades em serem financiados pelo sector privado
32
Block P-alignment
EN despite certain difficulties
PT apesar das dificuldades
Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa
technology
from seed
L2 F - Spoken Language Systems Laboratory
Cross-Linguistic Challenges
• Anaphora - choice of co-referents (noun versus pronoun)
EN it is not acceptable that we assisted Korea during the Asean crisis by
means of IMF loans and suchlike, only for Korea still to be subsidising its
shipyards
EN no resulta procedente que hayamos ayudado a Corea en la crisis de la
Asean a través de préstamos del FMI, etc. y que Corea siga
subvencionando sus astilleros
FR il n’est pas acceptable que nous ayons aidé la Corée dans la crise de
l’Anase, avec des prêts du FMI, etc. et qu’elle continue à subventionner
ses chantiers navals
PT é inadmissível que, depois de termos ajudado a Coreia, através de
créditos do FMI, etc., na crise da Asean, este país continue a
subvencionar agora os seus estaleiros navais
33
Segment or block
P-alignment
Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa
technology
from seed
L2 F - Spoken Language Systems Laboratory
Cross-Linguistic Challenges
• Antonyms and negation constructions
EN the countries of Asia have not unfortunately been in favour of that
proposal
ES los países de Asia desgraciadamente no han sido favorables a dicha
propuesta
FR les pays d'Asie ont malheureusement rejeté cette proposition
PT os países da Ásia, infelizmente, não se mostraram favoráveis a esta
proposta
34
Block S-alignment together
with adverb
(insert in EN and FR)
Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa
technology
from seed
L2 F - Spoken Language Systems Laboratory
Cross-Linguistic Challenges
• Flexible/loose paraphrasing constructions
EN and we shall vote against it
ES y merece nuestra condena
FR et dénonçons
PT e merece a nossa condenação
EN 1993 was a significant year
ES el año 1993 es una fecha notable
FR l’année 1993 est à marquer d’une pierre blanche
PT 1993 é uma data charneira
35
Block P-alignment
Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa
technology
from seed
L2 F - Spoken Language Systems Laboratory
Cross-Linguistic Challenges
• Different parts-of-speech with same semantics (verbs versus
process nouns)
EN we must use all the financial instruments at our disposal to rapidly
develop the market
ES es preciso utilizar todos los instrumentos financieros disponibles para un
rápido desarollo ulterior del mercado
FR il faut utiliser tous les instruments financiers disponibles pour
développer rapidement le marché
PT todos os instrumentos financeiros disponíveis deverão ser aplicados
para continuar a desenvolver rapidamente o mercado
36
Block S-alignment (with internal segment P-alignments)
EN and PT :
Segment S-alignment
No alignment of [continuar a]
Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa
technology
from seed
L2 F - Spoken Language Systems Laboratory
Cross-Linguistic Challenges
• Impersonal constructions
(+ “impersonal” relative versus participial adjective)
EN we must fully support the demands that have been made
ES hay que apoyar plenamente las exigencias que se han formulado
FR il faut par conséquent appuyer les requêtes formulées
PT as reivindicações formuladas deverão ser plenamente apoiadas
37
Block P-alignment
Internal P-alignment
EN we must
ES hay que
FR il faut
Internal segment S-alignment - adverb and verb (EN, ES, FR)
Internal segment P-alignment - verb (PT)
Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa
technology
from seed
L2 F - Spoken Language Systems Laboratory
Cross-Linguistic Challenges
• Romance languages double negation (+ coordination)
EN it is not, therefore, surprising that there is, in this context, no real
integration or gennuine political dialogue
ES no es nada sorprendente, entonces, que en ese contexto, no haya ni
verdadera integración ni verdadero diálogo político
FR rien d’étonnant donc, qu'il n'y ait dans ce contexte, ni intégration
véritable, ni dialogue politique véritable
PT assim, não é de espantar que, nesse contexto, não exista verdadeira
integração nem verdadeiro diálogo político
38
Block P-alignment of the relative existential with adverbial (insert)
EN that there is, in this context, no
ES que en esse contexto, no haya
FR qu’il n’y ait dans ce contexte
PT que, nesse contexto, não exista
Segment P-alignment of negation
and negation connector
EN no – or
ES ni – ni
FR n’ – ni
PT Ø - nem
Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa
technology
from seed
L2 F - Spoken Language Systems Laboratory
Cross-Linguistic Challenges
• Idiosyncrasies of languages
• Portuguese inflected infinitive (peculiar verb tense)
• English to+Infinitive
• French negation
• English apostrophe
• …
• Sociolinguistic differences
39
Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa
technology
from seed
L2 F - Spoken Language Systems Laboratory
Our Contribution
• Tool CLUE-Aligner
• Annotated corpora
• Cross-language resources – gold collection
Publicly available on the META-NET website:
http://metanet4u.l2f.inesc-id.pt/
• Guidelines
– http://www.inesc-id.pt/ficheiros/publicacoes/8204.pdf
40
Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa
technology
from seed
L2 F - Spoken Language Systems Laboratory
Annotation Process
• Annotation of 400 x 6 (2,400 sentence alignments) by a
linguist
• Alignment on a subset of by a second linguist (25
• sentences of the English-Portuguese language pair)
• Inter-annotators agreement
41
Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa
technology
from seed
L2 F - Spoken Language Systems Laboratory
Preliminary Results
42
language words avg. words
en 11158 27.9
es 11664 29.2
fr 12464 31.2
pt 11649 29.1
pair Sure Possible Total
en-pt 6684 418 7102
en-fr 7025 569 7594
en-es 7636 399 8035
es-fr 7477 767 8244
pt-es 7958 557 8515
pt-fr 7029 782 7811
pair Sure Possible Total
en-pt 2588 602 3190
en-fr 3865 414 4279
en-es 3551 351 3902
es-fr 3516 495 4011
pt-es 3162 382 3544
pt-fr 3253 698 3951
Block (MWU) alignmentSegment (word) alignment
Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa
technology
from seed
L2 F - Spoken Language Systems Laboratory
Inter-annotators Agreement
43
• Statistical significance for kappa is rarely reported. However, a
number magnitude guidelines have appeared in the literature.
– Landis & Koch (1977) consider
• kappas between .4 and .6 as a moderate agreement
• kappas between .8 and 1 correspond to an almost perfect agreement
– Fleiss (1981) (equally arbitrary guidelines) characterize
• kappas from .40 to .75 as fair to good
• kappas over .75 as excellent
• This set of guidelines is however by no means universally accepted
Cohen's kappa
coefficient
Multi-word units (MWU) 0.541
Word alignments (WA) 0.984
Total 0.871
Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa
technology
from seed
L2 F - Spoken Language Systems Laboratory
Discussion
• Difficulties in analyzing fluency, stylistics (including word order),
paraphrase, etc.
• Alignments do not always work bi-directionally (sometimes the source-
target direction for a language pair matters)
• Levels of alignment and ranking systems (n-grams, morphology,
semantico-syntactic level, phrase, paraphrase, etc.)
• Terminology imprecision is found in corpora (it leads to poor quality
machine translation)
45
Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa
technology
from seed
L2 F - Spoken Language Systems Laboratory
Future Work
• Integration of lexica (multiword units, etc.) obtained via the use of local
grammars – use multiword units as ONE (1) segment of alignment,
whenever that is possible (contiguous, etc.)
• Pre-processing of contractions and post-processing of elements that
need to be contracted is important if applied to machine translation or
to create “more polished” lexica
• Evaluation of the current alignments in a statistical machine translation
system to see if translation quality improves
46
Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa
technology
from seed
L2 F - Spoken Language Systems Laboratory
Future Work
• Machine learning of recognition and alignment of multiword units
• based on segment alignments, i.e., individual words inside the
multiword unit
• based on multiword units of a parallel sentence in another language or
language pair alignment
• Use of local grammars that identify and process discontinuous
multiword units and other complex linguistic phenomena to combine
with word alignment techniques – how to combine?
47
Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa
technology
from seed
L2 F - Spoken Language Systems Laboratory
Main Conclusion
• Bringing linguistics into STM at the start is the first inevitable place
where hybridization should be possible.
• We believe that it would be productive to convert texts on both sides of
a translation pair into a common semantico-syntactic
representation before applying statistics into them. For this, each
language would have to have a parser capable of producing
homogeneous output.
• If this common representation were available, that would bring vast
possibilities for multi-linguistic SMT.
48
Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa
technology
from seed
L2 F - Spoken Language Systems Laboratory 49
technology
from seed
L2 F - Spoken Language Systems Laboratory
Thank you!
Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa
technology
from seed
L2 F - Spoken Language Systems Laboratory 50
Cross-Language Alignments:
Challenges, Guidelines and Gold Sets
Anabela Barreiro Luísa Coheur Tiago Luís
Ângela Costa Fernando Batista João Graça

Weitere ähnliche Inhalte

Was ist angesagt?

referát.doc
referát.docreferát.doc
referát.docbutest
 
Realization of natural language interfaces using
Realization of natural language interfaces usingRealization of natural language interfaces using
Realization of natural language interfaces usingunyil96
 
MACHINE-DRIVEN TEXT ANALYSIS
MACHINE-DRIVEN TEXT ANALYSISMACHINE-DRIVEN TEXT ANALYSIS
MACHINE-DRIVEN TEXT ANALYSISMassimo Schenone
 
Seeing is Correcting:Linked Open Data for Portuguese
Seeing is Correcting:Linked Open Data for PortugueseSeeing is Correcting:Linked Open Data for Portuguese
Seeing is Correcting:Linked Open Data for PortugueseValeria de Paiva
 
Linguistic markup and transclusion processing in XML documents
Linguistic markup and transclusion processing in XML documentsLinguistic markup and transclusion processing in XML documents
Linguistic markup and transclusion processing in XML documentsSimon Dew
 
Logics and Ontologies for Portuguese Understanding
Logics and Ontologies for Portuguese UnderstandingLogics and Ontologies for Portuguese Understanding
Logics and Ontologies for Portuguese UnderstandingValeria de Paiva
 
Controlled Natural Language Generation from a Multilingual FrameNet-based Gra...
Controlled Natural Language Generation from a Multilingual FrameNet-based Gra...Controlled Natural Language Generation from a Multilingual FrameNet-based Gra...
Controlled Natural Language Generation from a Multilingual FrameNet-based Gra...Normunds Grūzītis
 
Elena Bruches - The Hybrid Approach to Part-of-Speech Disambiguation
Elena Bruches - The Hybrid Approach to Part-of-Speech DisambiguationElena Bruches - The Hybrid Approach to Part-of-Speech Disambiguation
Elena Bruches - The Hybrid Approach to Part-of-Speech DisambiguationAIST
 
Parallel Corpora in (Machine) Translation: goals, issues and methodologies
Parallel Corpora in (Machine) Translation: goals, issues and methodologiesParallel Corpora in (Machine) Translation: goals, issues and methodologies
Parallel Corpora in (Machine) Translation: goals, issues and methodologiesAntonio Toral
 
An Introduction to Pre-training General Language Representations
An Introduction to Pre-training General Language RepresentationsAn Introduction to Pre-training General Language Representations
An Introduction to Pre-training General Language Representationszperjaccico
 

Was ist angesagt? (18)

Unit1 pps
Unit1 ppsUnit1 pps
Unit1 pps
 
referát.doc
referát.docreferát.doc
referát.doc
 
Realization of natural language interfaces using
Realization of natural language interfaces usingRealization of natural language interfaces using
Realization of natural language interfaces using
 
MACHINE-DRIVEN TEXT ANALYSIS
MACHINE-DRIVEN TEXT ANALYSISMACHINE-DRIVEN TEXT ANALYSIS
MACHINE-DRIVEN TEXT ANALYSIS
 
Seeing is Correcting:Linked Open Data for Portuguese
Seeing is Correcting:Linked Open Data for PortugueseSeeing is Correcting:Linked Open Data for Portuguese
Seeing is Correcting:Linked Open Data for Portuguese
 
A Rewriting Approach to Concurrent Programming Language Design and Semantics
A Rewriting Approach to Concurrent Programming Language Design and SemanticsA Rewriting Approach to Concurrent Programming Language Design and Semantics
A Rewriting Approach to Concurrent Programming Language Design and Semantics
 
IPA Spring Days 2012
IPA Spring Days 2012IPA Spring Days 2012
IPA Spring Days 2012
 
Cp viva q&a
Cp viva q&aCp viva q&a
Cp viva q&a
 
Linguistic markup and transclusion processing in XML documents
Linguistic markup and transclusion processing in XML documentsLinguistic markup and transclusion processing in XML documents
Linguistic markup and transclusion processing in XML documents
 
Analyzing and Sharing HDF5 Data with Python
Analyzing and Sharing HDF5 Data with PythonAnalyzing and Sharing HDF5 Data with Python
Analyzing and Sharing HDF5 Data with Python
 
Logics and Ontologies for Portuguese Understanding
Logics and Ontologies for Portuguese UnderstandingLogics and Ontologies for Portuguese Understanding
Logics and Ontologies for Portuguese Understanding
 
Controlled Natural Language Generation from a Multilingual FrameNet-based Gra...
Controlled Natural Language Generation from a Multilingual FrameNet-based Gra...Controlled Natural Language Generation from a Multilingual FrameNet-based Gra...
Controlled Natural Language Generation from a Multilingual FrameNet-based Gra...
 
Elena Bruches - The Hybrid Approach to Part-of-Speech Disambiguation
Elena Bruches - The Hybrid Approach to Part-of-Speech DisambiguationElena Bruches - The Hybrid Approach to Part-of-Speech Disambiguation
Elena Bruches - The Hybrid Approach to Part-of-Speech Disambiguation
 
Barreiro-Mota-VarDial@Coling2018-poster
Barreiro-Mota-VarDial@Coling2018-posterBarreiro-Mota-VarDial@Coling2018-poster
Barreiro-Mota-VarDial@Coling2018-poster
 
Ay34306312
Ay34306312Ay34306312
Ay34306312
 
Parallel Corpora in (Machine) Translation: goals, issues and methodologies
Parallel Corpora in (Machine) Translation: goals, issues and methodologiesParallel Corpora in (Machine) Translation: goals, issues and methodologies
Parallel Corpora in (Machine) Translation: goals, issues and methodologies
 
OpenLogos Semantico-Syntactic Knowledge-Rich Bilingual Dictionaries
OpenLogos Semantico-Syntactic Knowledge-Rich Bilingual DictionariesOpenLogos Semantico-Syntactic Knowledge-Rich Bilingual Dictionaries
OpenLogos Semantico-Syntactic Knowledge-Rich Bilingual Dictionaries
 
An Introduction to Pre-training General Language Representations
An Introduction to Pre-training General Language RepresentationsAn Introduction to Pre-training General Language Representations
An Introduction to Pre-training General Language Representations
 

Andere mochten auch

2. losing hope
2. losing hope2. losing hope
2. losing hopesky_fenix
 
Unicredit, a caccia di giovani e laureati. Presto 500 assunzioni
Unicredit, a caccia di giovani e laureati. Presto 500 assunzioniUnicredit, a caccia di giovani e laureati. Presto 500 assunzioni
Unicredit, a caccia di giovani e laureati. Presto 500 assunzioniNikuraTw
 
La macroeconomia en su conjunto
La macroeconomia en su conjunto La macroeconomia en su conjunto
La macroeconomia en su conjunto yanthilandrea
 
Computación en la nube
Computación  en  la  nubeComputación  en  la  nube
Computación en la nubemarcaj77
 
Transparent Ulaanbaatar 2014 10.02 eng
Transparent Ulaanbaatar 2014 10.02 engTransparent Ulaanbaatar 2014 10.02 eng
Transparent Ulaanbaatar 2014 10.02 engBayar Tsend
 
Apresentação wet seal
Apresentação wet sealApresentação wet seal
Apresentação wet sealcacau
 
Finding cinderella
Finding cinderellaFinding cinderella
Finding cinderellasky_fenix
 
Treinamento de Professores - CFAPED - Ass.Deus Ministério de Cordovil-
Treinamento de Professores - CFAPED - Ass.Deus Ministério de Cordovil- Treinamento de Professores - CFAPED - Ass.Deus Ministério de Cordovil-
Treinamento de Professores - CFAPED - Ass.Deus Ministério de Cordovil- Ana Paula Baptista
 
PNL PROGRAMACIÓN NEUROLINGUÍSTICA EJERCICIOS
PNL PROGRAMACIÓN NEUROLINGUÍSTICA EJERCICIOSPNL PROGRAMACIÓN NEUROLINGUÍSTICA EJERCICIOS
PNL PROGRAMACIÓN NEUROLINGUÍSTICA EJERCICIOSAlan Santos
 

Andere mochten auch (9)

2. losing hope
2. losing hope2. losing hope
2. losing hope
 
Unicredit, a caccia di giovani e laureati. Presto 500 assunzioni
Unicredit, a caccia di giovani e laureati. Presto 500 assunzioniUnicredit, a caccia di giovani e laureati. Presto 500 assunzioni
Unicredit, a caccia di giovani e laureati. Presto 500 assunzioni
 
La macroeconomia en su conjunto
La macroeconomia en su conjunto La macroeconomia en su conjunto
La macroeconomia en su conjunto
 
Computación en la nube
Computación  en  la  nubeComputación  en  la  nube
Computación en la nube
 
Transparent Ulaanbaatar 2014 10.02 eng
Transparent Ulaanbaatar 2014 10.02 engTransparent Ulaanbaatar 2014 10.02 eng
Transparent Ulaanbaatar 2014 10.02 eng
 
Apresentação wet seal
Apresentação wet sealApresentação wet seal
Apresentação wet seal
 
Finding cinderella
Finding cinderellaFinding cinderella
Finding cinderella
 
Treinamento de Professores - CFAPED - Ass.Deus Ministério de Cordovil-
Treinamento de Professores - CFAPED - Ass.Deus Ministério de Cordovil- Treinamento de Professores - CFAPED - Ass.Deus Ministério de Cordovil-
Treinamento de Professores - CFAPED - Ass.Deus Ministério de Cordovil-
 
PNL PROGRAMACIÓN NEUROLINGUÍSTICA EJERCICIOS
PNL PROGRAMACIÓN NEUROLINGUÍSTICA EJERCICIOSPNL PROGRAMACIÓN NEUROLINGUÍSTICA EJERCICIOS
PNL PROGRAMACIÓN NEUROLINGUÍSTICA EJERCICIOS
 

Ähnlich wie Cross language alignments - challenges guidelines and gold sets

CUHK intern PPT. Machine Translation Evaluation: Methods and Tools
CUHK intern PPT. Machine Translation Evaluation: Methods and Tools CUHK intern PPT. Machine Translation Evaluation: Methods and Tools
CUHK intern PPT. Machine Translation Evaluation: Methods and Tools Lifeng (Aaron) Han
 
Body-Part Nouns and Whole-Part Relations in Portuguese
Body-Part Nouns and Whole-Part Relations in PortugueseBody-Part Nouns and Whole-Part Relations in Portuguese
Body-Part Nouns and Whole-Part Relations in PortugueseJorge Baptista
 
pptphrase-tagset-mapping-for-french-and-english-treebanks-and-its-application...
pptphrase-tagset-mapping-for-french-and-english-treebanks-and-its-application...pptphrase-tagset-mapping-for-french-and-english-treebanks-and-its-application...
pptphrase-tagset-mapping-for-french-and-english-treebanks-and-its-application...Lifeng (Aaron) Han
 
Pptphrase tagset mapping for french and english treebanks and its application...
Pptphrase tagset mapping for french and english treebanks and its application...Pptphrase tagset mapping for french and english treebanks and its application...
Pptphrase tagset mapping for french and english treebanks and its application...Lifeng (Aaron) Han
 
VOC real world enterprise needs
VOC real world enterprise needsVOC real world enterprise needs
VOC real world enterprise needsIvan Berlocher
 
2010 INTERSPEECH
2010 INTERSPEECH 2010 INTERSPEECH
2010 INTERSPEECH WarNik Chow
 
LEPOR: an augmented machine translation evaluation metric - Thesis PPT
LEPOR: an augmented machine translation evaluation metric - Thesis PPT LEPOR: an augmented machine translation evaluation metric - Thesis PPT
LEPOR: an augmented machine translation evaluation metric - Thesis PPT Lifeng (Aaron) Han
 
Lepor: augmented automatic MT evaluation metric
Lepor: augmented automatic MT evaluation metricLepor: augmented automatic MT evaluation metric
Lepor: augmented automatic MT evaluation metricLifeng (Aaron) Han
 
Natural language processing for requirements engineering: ICSE 2021 Technical...
Natural language processing for requirements engineering: ICSE 2021 Technical...Natural language processing for requirements engineering: ICSE 2021 Technical...
Natural language processing for requirements engineering: ICSE 2021 Technical...alessio_ferrari
 
How to expand your nlp solution to new languages using transfer learning
How to expand your nlp solution to new languages using transfer learningHow to expand your nlp solution to new languages using transfer learning
How to expand your nlp solution to new languages using transfer learningLena Shakurova
 
MT SUMMIT PPT: Language-independent Model for Machine Translation Evaluation ...
MT SUMMIT PPT: Language-independent Model for Machine Translation Evaluation ...MT SUMMIT PPT: Language-independent Model for Machine Translation Evaluation ...
MT SUMMIT PPT: Language-independent Model for Machine Translation Evaluation ...Lifeng (Aaron) Han
 
Deep Learning勉強会@小町研 "Learning Character-level Representations for Part-of-Sp...
Deep Learning勉強会@小町研 "Learning Character-level Representations for Part-of-Sp...Deep Learning勉強会@小町研 "Learning Character-level Representations for Part-of-Sp...
Deep Learning勉強会@小町研 "Learning Character-level Representations for Part-of-Sp...Yuki Tomo
 
ENeL_WG3_Survey-AKA4Lexicography-TiberiusHeylenKrek (1).pptx
ENeL_WG3_Survey-AKA4Lexicography-TiberiusHeylenKrek (1).pptxENeL_WG3_Survey-AKA4Lexicography-TiberiusHeylenKrek (1).pptx
ENeL_WG3_Survey-AKA4Lexicography-TiberiusHeylenKrek (1).pptxSyedNadeemAbbas6
 
OpenWN-PT: a Brazilian Wordnet for all
OpenWN-PT: a Brazilian Wordnet for allOpenWN-PT: a Brazilian Wordnet for all
OpenWN-PT: a Brazilian Wordnet for allAlexandre Rademaker
 
Error Detection and Feedback with OT-LFG for Computer-assisted Language Learning
Error Detection and Feedback with OT-LFG for Computer-assisted Language LearningError Detection and Feedback with OT-LFG for Computer-assisted Language Learning
Error Detection and Feedback with OT-LFG for Computer-assisted Language LearningCITE
 

Ähnlich wie Cross language alignments - challenges guidelines and gold sets (20)

CLUE-Aligner: An Alignment Tool to Annotate Pairs of Paraphrastic and Transla...
CLUE-Aligner: An Alignment Tool to Annotate Pairs of Paraphrastic and Transla...CLUE-Aligner: An Alignment Tool to Annotate Pairs of Paraphrastic and Transla...
CLUE-Aligner: An Alignment Tool to Annotate Pairs of Paraphrastic and Transla...
 
Linguistic Evaluation of Support Verb Construction Translations by OpenLogos ...
Linguistic Evaluation of Support Verb Construction Translations by OpenLogos ...Linguistic Evaluation of Support Verb Construction Translations by OpenLogos ...
Linguistic Evaluation of Support Verb Construction Translations by OpenLogos ...
 
Poster @ enetCollect CA MC meeting in Iasi, Romania
Poster @ enetCollect CA MC meeting in Iasi, Romania Poster @ enetCollect CA MC meeting in Iasi, Romania
Poster @ enetCollect CA MC meeting in Iasi, Romania
 
CUHK intern PPT. Machine Translation Evaluation: Methods and Tools
CUHK intern PPT. Machine Translation Evaluation: Methods and Tools CUHK intern PPT. Machine Translation Evaluation: Methods and Tools
CUHK intern PPT. Machine Translation Evaluation: Methods and Tools
 
Body-Part Nouns and Whole-Part Relations in Portuguese
Body-Part Nouns and Whole-Part Relations in PortugueseBody-Part Nouns and Whole-Part Relations in Portuguese
Body-Part Nouns and Whole-Part Relations in Portuguese
 
pptphrase-tagset-mapping-for-french-and-english-treebanks-and-its-application...
pptphrase-tagset-mapping-for-french-and-english-treebanks-and-its-application...pptphrase-tagset-mapping-for-french-and-english-treebanks-and-its-application...
pptphrase-tagset-mapping-for-french-and-english-treebanks-and-its-application...
 
Pptphrase tagset mapping for french and english treebanks and its application...
Pptphrase tagset mapping for french and english treebanks and its application...Pptphrase tagset mapping for french and english treebanks and its application...
Pptphrase tagset mapping for french and english treebanks and its application...
 
VOC real world enterprise needs
VOC real world enterprise needsVOC real world enterprise needs
VOC real world enterprise needs
 
Programing Language
Programing LanguagePrograming Language
Programing Language
 
2010 INTERSPEECH
2010 INTERSPEECH 2010 INTERSPEECH
2010 INTERSPEECH
 
LEPOR: an augmented machine translation evaluation metric - Thesis PPT
LEPOR: an augmented machine translation evaluation metric - Thesis PPT LEPOR: an augmented machine translation evaluation metric - Thesis PPT
LEPOR: an augmented machine translation evaluation metric - Thesis PPT
 
Lepor: augmented automatic MT evaluation metric
Lepor: augmented automatic MT evaluation metricLepor: augmented automatic MT evaluation metric
Lepor: augmented automatic MT evaluation metric
 
Natural language processing for requirements engineering: ICSE 2021 Technical...
Natural language processing for requirements engineering: ICSE 2021 Technical...Natural language processing for requirements engineering: ICSE 2021 Technical...
Natural language processing for requirements engineering: ICSE 2021 Technical...
 
How to expand your nlp solution to new languages using transfer learning
How to expand your nlp solution to new languages using transfer learningHow to expand your nlp solution to new languages using transfer learning
How to expand your nlp solution to new languages using transfer learning
 
MT SUMMIT PPT: Language-independent Model for Machine Translation Evaluation ...
MT SUMMIT PPT: Language-independent Model for Machine Translation Evaluation ...MT SUMMIT PPT: Language-independent Model for Machine Translation Evaluation ...
MT SUMMIT PPT: Language-independent Model for Machine Translation Evaluation ...
 
Deep Learning勉強会@小町研 "Learning Character-level Representations for Part-of-Sp...
Deep Learning勉強会@小町研 "Learning Character-level Representations for Part-of-Sp...Deep Learning勉強会@小町研 "Learning Character-level Representations for Part-of-Sp...
Deep Learning勉強会@小町研 "Learning Character-level Representations for Part-of-Sp...
 
ENeL_WG3_Survey-AKA4Lexicography-TiberiusHeylenKrek (1).pptx
ENeL_WG3_Survey-AKA4Lexicography-TiberiusHeylenKrek (1).pptxENeL_WG3_Survey-AKA4Lexicography-TiberiusHeylenKrek (1).pptx
ENeL_WG3_Survey-AKA4Lexicography-TiberiusHeylenKrek (1).pptx
 
haenelt.ppt
haenelt.ppthaenelt.ppt
haenelt.ppt
 
OpenWN-PT: a Brazilian Wordnet for all
OpenWN-PT: a Brazilian Wordnet for allOpenWN-PT: a Brazilian Wordnet for all
OpenWN-PT: a Brazilian Wordnet for all
 
Error Detection and Feedback with OT-LFG for Computer-assisted Language Learning
Error Detection and Feedback with OT-LFG for Computer-assisted Language LearningError Detection and Feedback with OT-LFG for Computer-assisted Language Learning
Error Detection and Feedback with OT-LFG for Computer-assisted Language Learning
 

Mehr von INESC-ID (Spoken Language Systems Laboratory - L2F)

Mehr von INESC-ID (Spoken Language Systems Laboratory - L2F) (20)

Multi3Generation@INGL2020
Multi3Generation@INGL2020Multi3Generation@INGL2020
Multi3Generation@INGL2020
 
NooJ 2020 presentation
NooJ 2020 presentationNooJ 2020 presentation
NooJ 2020 presentation
 
PROPOR2020_Barreiroetal
PROPOR2020_BarreiroetalPROPOR2020_Barreiroetal
PROPOR2020_Barreiroetal
 
Análise comparativa das edições portuguesa e brasileira de Os livros que dev...
Análise comparativa das edições portuguesa e brasileira de  Os livros que dev...Análise comparativa das edições portuguesa e brasileira de  Os livros que dev...
Análise comparativa das edições portuguesa e brasileira de Os livros que dev...
 
Welcome session 3rd Annual MC Meeting - enetCollect COST Action
Welcome session 3rd Annual MC Meeting - enetCollect COST ActionWelcome session 3rd Annual MC Meeting - enetCollect COST Action
Welcome session 3rd Annual MC Meeting - enetCollect COST Action
 
Syntactic-semantic analysis for information extraction in biomedicine
Syntactic-semantic analysis for information extraction in biomedicineSyntactic-semantic analysis for information extraction in biomedicine
Syntactic-semantic analysis for information extraction in biomedicine
 
Cross language semantic relations between English and Portuguese
Cross language semantic relations between English and PortugueseCross language semantic relations between English and Portuguese
Cross language semantic relations between English and Portuguese
 
Paraphrasing biomedical support verb constructions for machine translation
Paraphrasing biomedical support verb constructions for machine translationParaphrasing biomedical support verb constructions for machine translation
Paraphrasing biomedical support verb constructions for machine translation
 
ReWriter for legal text
ReWriter for legal textReWriter for legal text
ReWriter for legal text
 
Chatbots for Language Learning
Chatbots for Language LearningChatbots for Language Learning
Chatbots for Language Learning
 
Barreiro et al POP@PROPOR2018-informal2formal-language
Barreiro et al POP@PROPOR2018-informal2formal-languageBarreiro et al POP@PROPOR2018-informal2formal-language
Barreiro et al POP@PROPOR2018-informal2formal-language
 
Rebelo-Arnold et al POP@PROPOR2018-EP-BP-alignments
Rebelo-Arnold et al POP@PROPOR2018-EP-BP-alignmentsRebelo-Arnold et al POP@PROPOR2018-EP-BP-alignments
Rebelo-Arnold et al POP@PROPOR2018-EP-BP-alignments
 
Barreiro-Batista-LR4NLP@Coling2018-presentation
Barreiro-Batista-LR4NLP@Coling2018-presentationBarreiro-Batista-LR4NLP@Coling2018-presentation
Barreiro-Batista-LR4NLP@Coling2018-presentation
 
NooJ-2018-Palermo
NooJ-2018-PalermoNooJ-2018-Palermo
NooJ-2018-Palermo
 
projeto-eSPERTo
projeto-eSPERToprojeto-eSPERTo
projeto-eSPERTo
 
ReEscreve: A Translator-Friendly Multi-Purpose Paraphrasing Software Tool
ReEscreve: A Translator-Friendly Multi-Purpose Paraphrasing Software ToolReEscreve: A Translator-Friendly Multi-Purpose Paraphrasing Software Tool
ReEscreve: A Translator-Friendly Multi-Purpose Paraphrasing Software Tool
 
Poster l2f 2017
Poster l2f 2017Poster l2f 2017
Poster l2f 2017
 
Nooj2017 cmota-etal
Nooj2017 cmota-etalNooj2017 cmota-etal
Nooj2017 cmota-etal
 
Machine Translation of Discontinuous Multiword Units
Machine Translation of Discontinuous Multiword UnitsMachine Translation of Discontinuous Multiword Units
Machine Translation of Discontinuous Multiword Units
 
Automatic Paraphrasing of Human Intransitive Adjectives in Portuguese
Automatic Paraphrasing of Human Intransitive Adjectives in PortugueseAutomatic Paraphrasing of Human Intransitive Adjectives in Portuguese
Automatic Paraphrasing of Human Intransitive Adjectives in Portuguese
 

Kürzlich hochgeladen

The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Azure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAzure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAndikSusilo4
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhisoniya singh
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?XfilesPro
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 

Kürzlich hochgeladen (20)

The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Azure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAzure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & Application
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 

Cross language alignments - challenges guidelines and gold sets

  • 1. Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa technology from seed L2 F - Spoken Language Systems Laboratory 1 Cross-Language Alignments: Challenges, Guidelines and Gold Sets Anabela Barreiro Luísa Coheur Tiago Luís Ângela Costa Fernando Batista João Graça
  • 2. Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa technology from seed L2 F - Spoken Language Systems Laboratory 2 Outline – Part 1 • Word alignment • Basic concepts • Applications • State of the art • Limitations • Paraphrase alignment • Multiword, meaning and translation unit alignment: importance • Our task • Alignment tool: CLUE-Aligner
  • 3. Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa technology from seed L2 F - Spoken Language Systems Laboratory 3 Outline – Part 2 • General annotation guidelines • Cross-linguistic major challenges to word alignment • Annotation guidelines for multiword units and lexical and non-lexical realization phenomena • Pro-dropping • Articles and zero articles • Examples: continuous multiword units • Examples: continuous and discontinuous support verb constructions Preposition-dependency (V, N and Adj) Active vs passive Choice of noun pre-modifiers Different PoS with same semantics (V vs process N) Noun adjuncts Coordination Anaphora: choice of co- referents Impersonal constructions Contractions Style Antonyms and negation constructions Romance languages double negation Singular vs plural idiomatic vs non-idiomatic Flexible/loose paraphrasing constructions; Idiosyncrasies of each language
  • 4. Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa technology from seed L2 F - Spoken Language Systems Laboratory 4 Outline – Part 3 • Our contribution • Annotation process • Preliminary results • Discussion • Future work
  • 5. Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa technology from seed L2 F - Spoken Language Systems Laboratory Word Alignment: Basic Concepts • Objects representing the mapping of words (or expressions), which are semantically equivalent in a source and a target sentence of a parallel corpus [Brown at al., 1990] – Matrix of n * m entries, where n is a position on the source sentence, and m is a position on the target sentence. An entry in that matrix an,m specifies if the word at position n is part of a translation of the word at a position m on the target language • Task of word alignment - identifying translational equivalences (= semantic correspondences) in the aligned sentence pairs of a parallel text [Hearne & Way, 2011] • Translational equivalences - graphically represented in a grid by the intersection of single segments (individual words) or blocks (semantico-syntactic units, phrases, expressions) 5
  • 6. Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa technology from seed L2 F - Spoken Language Systems Laboratory Word Alignment: Basic Concepts 6 • Sure alignment (S-alignment) – Unambiguous and valid in all contexts • EN system • ES sistema • FR système • PT sistema • Possible alignment (P-alignment) – Ambiguous and invalid in some contexts • EN be • ES ser/estar/haber/existir • FR être/avoir/exister • PT ser/estar/haver/existir
  • 7. Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa technology from seed L2 F - Spoken Language Systems Laboratory Word Alignment: Applications • Statistical machine translation – [Brown et al., 1990] – statistical machine translation – [Och and Ney, 2004] – phrase base machine translation – [Galley et al., 2004] – syntax base machine translation • Annotations’ projections • Extraction of bilingual lexica • Evaluation of machine translation systems 7
  • 8. Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa technology from seed L2 F - Spoken Language Systems Laboratory Word Alignment: State of the Art • Workshops and evaluation tasks (multi-language) – http://www.cse.unt.edu/~rada/wp/ – http://www.statmt.org/wpt05 – http://www.lpl.univ-aix.fr/projects/arcade • Projects – Blinker project –French-English http://nlp.cs.nyu.edu/blinker/ • Guidelines [Melamed, 1998] [Och and Ney, 2000] [Lambert et al., 2005] [Kruijff-Korbayová et al., 2006] [Graça et al., 2004] 8
  • 9. Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa technology from seed L2 F - Spoken Language Systems Laboratory Word Alignment: Limitations • Language does not operate on a word-for-word basis • A large number of words are undissociated – Multiword units • [Gross and Senellart, 1998] - +40% of 1 year of Le Monde are MWU • [Sag et al., 2002] – 50-70% of specialized lexica are MWU • [Ramisch et al., 2010] – 56.7% of terms in Genia corpus have 2+ words (not included general purpose MWU, e.g., generic compounds, lexical bundles, phrasal verbs, fixed expressions, which also occur in domain-specific texts) – Translation units – Meaning units – Paraphrases • Segment and block alignment (sure and possible) 9
  • 10. Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa technology from seed L2 F - Spoken Language Systems Laboratory Example: Segment and Block Alignment (Sure and Possible) 10
  • 11. Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa technology from seed L2 F - Spoken Language Systems Laboratory Paraphrase Alignment • Monolingual – [Callison-Burch et al., 2006] • Annotation guidelines for paraphrase alignment • Paraphrases - sentences that convey the same meaning but are worded differently • Alignment of words, phrases, expressions, within the same language • Bilingual = (non-literal) translation – Need to account for paraphrases across languages 11
  • 12. Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa technology from seed L2 F - Spoken Language Systems Laboratory Multiword, Meaning and Translation Unit Alignment: Importance • Publicly available manual word alignments are restricted to a few language pairs • Manual word alignments are a desired resource – Evaluation of word alignment algorithms – Training of supervised and semi-supervised algorithms – Tuning of parameters for different types of model • But, “name”, “concept” and “techniques” of alignment need to be linguistically sophisticated to be more useful and help provide improved machine translation! 12
  • 13. Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa technology from seed L2 F - Spoken Language Systems Laboratory Our Task • EuroParl corpus [Koehn, 2005] • 6 gold alignments sets – 400 alignments each set (400x6=2,400) • Languages: English, French, Portuguese and Spanish – Language pairs: [en-es], [en-fr], [en-pt], [es-fr], [pt-es], [pt-fr] • Guidelines for multi-language manual word annotations (with inter-annotator agreement) • Linguistically-informed (and linguistically-motivated) cross- language multiword unit and paraphrase alignment (translation unit alignment) 13
  • 14. Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa technology from seed L2 F - Spoken Language Systems Laboratory CLUE-Aligner Alignment Tool 14 CLUE-Aligner = Cross-Language Unit Elicitation Aligner • Helps reduce ambiguity in the alignment process • Facilitates the alignment of translation units
  • 15. Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa technology from seed L2 F - Spoken Language Systems Laboratory Major Challenges (4 different classes) • semantico-discursive – emphatic linguistic constructions • tautology • pleonasm and repetition • focus constructions • lexical and semantico-syntactic – multiword units – compound verbs – prepositional predicates 15
  • 16. Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa technology from seed L2 F - Spoken Language Systems Laboratory Major Challenges (4 different classes) • morphological – contracted forms – lexical versus non-lexical realization • articles and zero articles • pro-dropping – subject pronoun drop – empty relative pronoun • morpho-syntactic – free noun adjuncts 16
  • 17. Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa technology from seed L2 F - Spoken Language Systems Laboratory Linguistic phenomenon No alignment P-alignment Incomplete or non-translation X Incorrect translation and typo X* Approximate correspondence (numeric) X Non-obligatory linguistic structure Pleonasm X Repetition of words or expressions X Redundancy or additional/extra information X Mismatching pronoun, determiner, verbs, etc. X Abbreviations versus full word X Punctuation mark Different but correct X Incorrect / mismatch X Missing X 17 General Annotation Guidelines * If a multiword unit is incorrectly translated or contains a typo, none of its internal segments are aligned
  • 18. Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa technology from seed L2 F - Spoken Language Systems Laboratory Linguistic phenomenon No alignment Block-alignment S-align P-align Multiword Unit continuous X X discontinuous X* Lexical versus non-lexical realization article+ N versus zero-article + N Ø people = PT - as pessoas X Pro-drop + V versus pronoun + V I went = PT - Ø fui X Empty relative pronoun versus realized relative pronoun N that I met = N I met = PT - que (eu) conheci X Relative versus participial adjective that was writen = writen = PT – escrito X 18 Annotation Guidelines * Some discontinous multiword units are candidates for block-alignment (e.g., when the number of inserts is small or the multiword unit is “semi-frozen”
  • 19. Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa technology from seed L2 F - Spoken Language Systems Laboratory Continuous multiword units Block-S-alignment Block-P-alignment Support verb construction X X Compound X X Phrasal verb X X Named entity X X Date and time expression X Lexical bundle X Idiomatic expression X Domain term X French negation (ne pas) X English infinitive (to + V) X X 19 Annotation Guidelines [Barreiro, 2008] presents a detailed description and examples of the different types of multiword unit
  • 20. Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa technology from seed L2 F - Spoken Language Systems Laboratory Example: Continuous Support Verb Constructions (alignment) 20 ES aprueba plenamente FR approuve pleinement
  • 21. Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa technology from seed L2 F - Spoken Language Systems Laboratory Example: Discontinuous Support Verb Constructions (no alignment) 21 ES para que acelere la directiva sobre pensiones complementares FR pour faire avancer la directive sur les pensions complementaires
  • 22. Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa technology from seed L2 F - Spoken Language Systems Laboratory Cross-Linguistic Challenges • Prepositional predicates EN I too should like to congratulate [NE] on his excellent report ES también yo quisiera felicitar a mi colega [NE] por su excelente informe FR je voudrais féliciter moi aussi mon collègue [NE] pour son excellent rapport PT também eu gostaria de felicitar o meu colega [NE] pelo seu excelente relatório EN […] our Asian partners prefer to deal with questions which unite us ES […] nuestros socios asiáticos prefieren dedicarse a las questiones que nos unen FR […] nos partenaires asiatiques préfèrent s’attacher à ce qui nous unit PT […] os nossos parceiros asiáticos preferem centrar-se unicamente nas questões comuns 22 Segment S-alignment Impossible to annotate discontinuous preposition-dependency Block P-alignment
  • 23. Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa technology from seed L2 F - Spoken Language Systems Laboratory agree with belong to forgive s/o for pay for stand for aim at/for choose between hope for prepare for thank s/o for allow for comment on insist on prevent s/o from think of/about apologise for compare with interfere with/in provide s/o with volunteer to apply for complain about joke about refer to wait for approve of concentrate on laugh at rely on warn s/o about argue with/about congratulate on lend s/th to s/o run for worry about ask for consist of listen to smile at attend to deal with long for succeed in believe in decide on object to suffer from Cross-Linguistic Challenges • Prepositional verbs 23
  • 24. Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa technology from seed L2 F - Spoken Language Systems Laboratory Cross-Linguistic Challenges • Prepositional nouns 24 attack on attitude towards in agreement on strike cruelty towards comparison between on average in trouble difficulty in/with decrease in on condition on behalf of knowledge of disadvantage of delay in connection between reason for incerase in in doubt difference between/of rise in preference for information about under guarantee solution to reduction in need for in power use of at risk protection from reaction to in a hurry at stake report on result of in practice in theory room for trouble with
  • 25. Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa technology from seed L2 F - Spoken Language Systems Laboratory Cross-Linguistic Challenges • Prepositional adjectives 25 delighted at/about frightened of opposed to similar to different from friendly with pleased with sorry for/about dissatisfied with good at popular with suspicious of doubtful about guilty of proud of sympathetic to(wards) enthusiastic about incapable of puzzled by/about tired of envious of interested in safe from typical of excited about jealous of satisfied with unaware of famous for keen on sensitive to(wards) used to fed up with kind to serious about fond of mad at/about sick of
  • 26. Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa technology from seed L2 F - Spoken Language Systems Laboratory Cross-Linguistic Challenges • Noun Adjuncts – Compounds • European investment bank banco europeu de investimento [Adj N N] [N Adj [de N]] – Free noun phrases (not compounds) • presidency communication comunicação da presidência [N N] [N [de N]] 26 Block S-alignment Segment S-alignment Block-P-alignment of [de N]
  • 27. Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa technology from seed L2 F - Spoken Language Systems Laboratory Cross-Linguistic Challenges • Contractions – two or more words with different parts-of-speech overlap, which makes syntactic analysis and generation difficult – in cross-language analysis, the contrast between languages that have contractions and languages that do not have them, or do not have them in the same contexts, presents additional difficulties – The alignment of one segment that corresponds to a contracted form in one language with the corresponding segments where elements are not contracted in the other language of the parallel pair is pragmatically motivated 27
  • 28. Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa technology from seed L2 F - Spoken Language Systems Laboratory Example: Contractions (block-P- alignment) 28 Interference with the support verb construction EN to make a reference to PT fazer uma referência a
  • 29. Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa technology from seed L2 F - Spoken Language Systems Laboratory Example: Contractions (block-P- alignment) 29 Interference with the support verb construction ES hacer una referencia a FR faire référence a
  • 30. Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa technology from seed L2 F - Spoken Language Systems Laboratory Cross-Linguistic Challenges • Singular versus plural (related to determiner) EN in every official language of the union ES en todos los idiomas oficiales de la unión FR dans toutes les langues officielles de l'union PT em cada uma das línguas oficiais da união • Active versus passive EN before new member states are admitted ES antes de la incorporación de nuevos miembros FR avant l'admission de nouveaux membres PT antes da entrada de novos membros 30 Block or segment P-alignment Block-S-alignment if there is some fixedness (such as in this case) Block P-alignment
  • 31. Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa technology from seed L2 F - Spoken Language Systems Laboratory Cross-Linguistic Challenges • Coordination EN which we will send to the council and Ø parliament ES que enviaremos al consejo y al parlamento FR qui sera envoyée au conseil et au parlement PT que remeterá ao conselho e ao parlamento • Style: idiomatic versus non-idiomatic EN which began four years ago ES que empezó hace quatro años FR qui a vu le jour il y a quatre ans PT que se iniciou há quatro anos 31 No alignment Block P-alignment
  • 32. Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa technology from seed L2 F - Spoken Language Systems Laboratory Cross-Linguistic Challenges • Choice of noun pre-modifiers EN we should use that public funding for those types of project which are most difficult to finance through the private sector ES deberíamos utilizar esa financiación pública para aquel tipo de proyectos que tienen mayor dificuldad para ser financiados por el sector privado FR nous devrions recourir au financement public pour les projets que le secteur privé boude PT o financiamento público deveria ser utilizado para os projectos que registam maiores dificuldades em serem financiados pelo sector privado 32 Block P-alignment EN despite certain difficulties PT apesar das dificuldades
  • 33. Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa technology from seed L2 F - Spoken Language Systems Laboratory Cross-Linguistic Challenges • Anaphora - choice of co-referents (noun versus pronoun) EN it is not acceptable that we assisted Korea during the Asean crisis by means of IMF loans and suchlike, only for Korea still to be subsidising its shipyards EN no resulta procedente que hayamos ayudado a Corea en la crisis de la Asean a través de préstamos del FMI, etc. y que Corea siga subvencionando sus astilleros FR il n’est pas acceptable que nous ayons aidé la Corée dans la crise de l’Anase, avec des prêts du FMI, etc. et qu’elle continue à subventionner ses chantiers navals PT é inadmissível que, depois de termos ajudado a Coreia, através de créditos do FMI, etc., na crise da Asean, este país continue a subvencionar agora os seus estaleiros navais 33 Segment or block P-alignment
  • 34. Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa technology from seed L2 F - Spoken Language Systems Laboratory Cross-Linguistic Challenges • Antonyms and negation constructions EN the countries of Asia have not unfortunately been in favour of that proposal ES los países de Asia desgraciadamente no han sido favorables a dicha propuesta FR les pays d'Asie ont malheureusement rejeté cette proposition PT os países da Ásia, infelizmente, não se mostraram favoráveis a esta proposta 34 Block S-alignment together with adverb (insert in EN and FR)
  • 35. Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa technology from seed L2 F - Spoken Language Systems Laboratory Cross-Linguistic Challenges • Flexible/loose paraphrasing constructions EN and we shall vote against it ES y merece nuestra condena FR et dénonçons PT e merece a nossa condenação EN 1993 was a significant year ES el año 1993 es una fecha notable FR l’année 1993 est à marquer d’une pierre blanche PT 1993 é uma data charneira 35 Block P-alignment
  • 36. Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa technology from seed L2 F - Spoken Language Systems Laboratory Cross-Linguistic Challenges • Different parts-of-speech with same semantics (verbs versus process nouns) EN we must use all the financial instruments at our disposal to rapidly develop the market ES es preciso utilizar todos los instrumentos financieros disponibles para un rápido desarollo ulterior del mercado FR il faut utiliser tous les instruments financiers disponibles pour développer rapidement le marché PT todos os instrumentos financeiros disponíveis deverão ser aplicados para continuar a desenvolver rapidamente o mercado 36 Block S-alignment (with internal segment P-alignments) EN and PT : Segment S-alignment No alignment of [continuar a]
  • 37. Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa technology from seed L2 F - Spoken Language Systems Laboratory Cross-Linguistic Challenges • Impersonal constructions (+ “impersonal” relative versus participial adjective) EN we must fully support the demands that have been made ES hay que apoyar plenamente las exigencias que se han formulado FR il faut par conséquent appuyer les requêtes formulées PT as reivindicações formuladas deverão ser plenamente apoiadas 37 Block P-alignment Internal P-alignment EN we must ES hay que FR il faut Internal segment S-alignment - adverb and verb (EN, ES, FR) Internal segment P-alignment - verb (PT)
  • 38. Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa technology from seed L2 F - Spoken Language Systems Laboratory Cross-Linguistic Challenges • Romance languages double negation (+ coordination) EN it is not, therefore, surprising that there is, in this context, no real integration or gennuine political dialogue ES no es nada sorprendente, entonces, que en ese contexto, no haya ni verdadera integración ni verdadero diálogo político FR rien d’étonnant donc, qu'il n'y ait dans ce contexte, ni intégration véritable, ni dialogue politique véritable PT assim, não é de espantar que, nesse contexto, não exista verdadeira integração nem verdadeiro diálogo político 38 Block P-alignment of the relative existential with adverbial (insert) EN that there is, in this context, no ES que en esse contexto, no haya FR qu’il n’y ait dans ce contexte PT que, nesse contexto, não exista Segment P-alignment of negation and negation connector EN no – or ES ni – ni FR n’ – ni PT Ø - nem
  • 39. Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa technology from seed L2 F - Spoken Language Systems Laboratory Cross-Linguistic Challenges • Idiosyncrasies of languages • Portuguese inflected infinitive (peculiar verb tense) • English to+Infinitive • French negation • English apostrophe • … • Sociolinguistic differences 39
  • 40. Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa technology from seed L2 F - Spoken Language Systems Laboratory Our Contribution • Tool CLUE-Aligner • Annotated corpora • Cross-language resources – gold collection Publicly available on the META-NET website: http://metanet4u.l2f.inesc-id.pt/ • Guidelines – http://www.inesc-id.pt/ficheiros/publicacoes/8204.pdf 40
  • 41. Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa technology from seed L2 F - Spoken Language Systems Laboratory Annotation Process • Annotation of 400 x 6 (2,400 sentence alignments) by a linguist • Alignment on a subset of by a second linguist (25 • sentences of the English-Portuguese language pair) • Inter-annotators agreement 41
  • 42. Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa technology from seed L2 F - Spoken Language Systems Laboratory Preliminary Results 42 language words avg. words en 11158 27.9 es 11664 29.2 fr 12464 31.2 pt 11649 29.1 pair Sure Possible Total en-pt 6684 418 7102 en-fr 7025 569 7594 en-es 7636 399 8035 es-fr 7477 767 8244 pt-es 7958 557 8515 pt-fr 7029 782 7811 pair Sure Possible Total en-pt 2588 602 3190 en-fr 3865 414 4279 en-es 3551 351 3902 es-fr 3516 495 4011 pt-es 3162 382 3544 pt-fr 3253 698 3951 Block (MWU) alignmentSegment (word) alignment
  • 43. Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa technology from seed L2 F - Spoken Language Systems Laboratory Inter-annotators Agreement 43 • Statistical significance for kappa is rarely reported. However, a number magnitude guidelines have appeared in the literature. – Landis & Koch (1977) consider • kappas between .4 and .6 as a moderate agreement • kappas between .8 and 1 correspond to an almost perfect agreement – Fleiss (1981) (equally arbitrary guidelines) characterize • kappas from .40 to .75 as fair to good • kappas over .75 as excellent • This set of guidelines is however by no means universally accepted Cohen's kappa coefficient Multi-word units (MWU) 0.541 Word alignments (WA) 0.984 Total 0.871
  • 44. Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa technology from seed L2 F - Spoken Language Systems Laboratory Discussion • Difficulties in analyzing fluency, stylistics (including word order), paraphrase, etc. • Alignments do not always work bi-directionally (sometimes the source- target direction for a language pair matters) • Levels of alignment and ranking systems (n-grams, morphology, semantico-syntactic level, phrase, paraphrase, etc.) • Terminology imprecision is found in corpora (it leads to poor quality machine translation) 45
  • 45. Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa technology from seed L2 F - Spoken Language Systems Laboratory Future Work • Integration of lexica (multiword units, etc.) obtained via the use of local grammars – use multiword units as ONE (1) segment of alignment, whenever that is possible (contiguous, etc.) • Pre-processing of contractions and post-processing of elements that need to be contracted is important if applied to machine translation or to create “more polished” lexica • Evaluation of the current alignments in a statistical machine translation system to see if translation quality improves 46
  • 46. Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa technology from seed L2 F - Spoken Language Systems Laboratory Future Work • Machine learning of recognition and alignment of multiword units • based on segment alignments, i.e., individual words inside the multiword unit • based on multiword units of a parallel sentence in another language or language pair alignment • Use of local grammars that identify and process discontinuous multiword units and other complex linguistic phenomena to combine with word alignment techniques – how to combine? 47
  • 47. Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa technology from seed L2 F - Spoken Language Systems Laboratory Main Conclusion • Bringing linguistics into STM at the start is the first inevitable place where hybridization should be possible. • We believe that it would be productive to convert texts on both sides of a translation pair into a common semantico-syntactic representation before applying statistics into them. For this, each language would have to have a parser capable of producing homogeneous output. • If this common representation were available, that would bring vast possibilities for multi-linguistic SMT. 48
  • 48. Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa technology from seed L2 F - Spoken Language Systems Laboratory 49 technology from seed L2 F - Spoken Language Systems Laboratory Thank you!
  • 49. Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa technology from seed L2 F - Spoken Language Systems Laboratory 50 Cross-Language Alignments: Challenges, Guidelines and Gold Sets Anabela Barreiro Luísa Coheur Tiago Luís Ângela Costa Fernando Batista João Graça

Hinweis der Redaktion

  1. Antes de iniciar esta apresentação gostaria de agradecer à Priberam a oportunidade de mostrar o nosso trabalho neste seminário. Andnow, I willproceed in English…Goodmorning. Mynameis Anabela Barreiro. I amaninvitedresearcherattheSpoken Language Systems Laboratory, at INESC-ID, Lisbon. Today, I willpresent “Cross-Language Alignments: Challenges, Guidelines and Gold Sets”, done in collaborationwithmycolleagues Luísa Coheur, Tiago Luís, Ângela Costa, Fernando Batistaand João Graça.In thispresentation, I willdescribe the key cross-language annotation guidelines to provide support for machine translation systems. The guidelines aim at improving the quality of the machine translation output by using linguistically-informed and motivated annotation of special case multiwords and semantico-syntactic translation units.
  2. This presentation is divided in 3 parts.I will describe CLUE-Aligner, a tool developed to reduce ambiguity in the alignment process and facilitate the alignment of meaning and translation units.
  3. I will focus on the challenges to the alignment of special cross-linguistic cases, such as multiword units, lexical and non-lexical realization (the pro-drop phenomenon, determiners and zero determiners), noun adjuncts, and idiosyncrasies of each language.
  4. Themain use ofwordalignmentsis SMT.[Brown et al., 1990] – introducedtheconceptofwordalignmentandapplieditdirectly to a SMT system[OchandNey, 2004] – usedit as a primaryresource for phrase base machinetranslation[Galleyet al., 2004] – usedit as a resource for syntax base machinetranslation
  5. In thelastyears, withtheincreaseoffreelyavailableparallel corpora, a hugedevelopmenttookplace in SMT.Many workshops andevaluationtaskshavebeendedicated to multi-languagewordalignment.Some projects too. For example,theBlinkerprojectaimedataligningwordsbetweenFrenchandEnglishtexts.Manywordalignmentguidelineshavebeensuggested.
  6. Re-definitionorwordalignment: wordandmultiword, phrase, expression – translationunit
  7. Despitethegrowing # ofavailablemulti-languagesentencealignedparallel corpora andalignmenttools, the # ofpubliclyavailable manual wordalignmentsisrestricted to a fewlanguagepairs.Word alignmentis a desirableresource.
  8. The guidelines were based on the alignment of bilingual texts of the common test set of the publicly available Europarl corpus thatcontainsproceedingsoftheEuropeanParliament in thedifferentofficiallanguagesofthe EU. Theworkprovides 6 goldalignment sets. The bilingual texts cover all possible combinations between the English, Spanish, French, and Portuguese languages.
  9. CLUE-Aligner, a tool developed to reduce ambiguity in the alignment process and facilitate the alignment of meaning and translation units.
  10. Onlywhenoneoftheseelementsiselidediswhenwe use blockalignments. Whentheelements are lexicallyrealized, determiners,pronounsandthe individual elementsoftherelatives are single alignedwiththecorrespondingelements in theparallelsentenceExceptions:Discontinuousmultiwordunitswith a smallnumberofinserts are aligned
  11. Otherexamplesofalignedandnotaligned MWUPhrasalverbsAligned – look intotheproblem – debruçar-se sobre este problemaNotaligned – (230)VerbcompoundsAligned – hasalsoincreased (22)Notaligned - FrenchnegationAligned – nepasNotaligned -
  12. (da presidênciais S-alignedwithpresidency)Presidencycommunicationis in the corpus – butit does notsoundright!
  13. NOT A GOOD SOLUTION – it does notaccount for thedoublenegationstructure
  14. The gold collection and alignment tool are publicly available.