SlideShare ist ein Scribd-Unternehmen logo
1 von 39
Downloaden Sie, um offline zu lesen
Parallel Corpora in (Machine) Translation:
goals, issues and methodologies
Time in Translation Workshop
University of Utrecht, 23rd June 2017
Antonio Toral
a.toral.ruiz@rug.nl
Content
1. Introduction: Why, Which, Where, How
2. Massive Acquisition Made Easy
3. Cleaning the Messy
4. Creative Translators and Parallel Corpora
Context, about me
●
Named Entities & WSD, till 2009
– PhD at Univ. Alacant & CNR Pisa
– Topic: Named Entity Acquisition from Wikipedia
●
MT since 2010
– Researcher in MT at DCU
– Assistant Prof at Univ. Groningen (since October 2016)
– Topics
●
Corpora acquisition
●
Domain adaptation
●
Diagnostic evaluation
●
Literary text
Introduction:
Why, Which, Where, How
1
Why use parallel corpora
●
Machine Translation (MT)
– Statistical approaches (phrase-based, neural): key component
– Rule-based approaches: automatic acquisition of rules and
dictionaries
●
Less manual effort required
●
Resulting rules and dictionaries reflect language use
●
Computer-assisted translation
– Translation memories, fuzzy matches
●
Corpus-based Linguistic Research of Translations
Which Parallel Corpora
●
Wish list
– Clean
– Appropriate domain, genre and style
– Big size
– ...
Where to get parallel corpora
●
Already existing
– free (Europarl, OPUS) versus
– for a fee (ELDA, LDC, TAUS, etc)
●
Web crawling
– ad-hoc (e.g. SETimes, TLAXCALA) versus
– generic (e.g. Bitextor, ILSP Focused Crawler)
●
Generate manually
– Professional translation (~0.05 cent/word) versus
– crowdsourcing (very cheap but: quality, ethics...)
– Trade-off cost-benefit: careful selection of data to be translated
●
Variety: avoid redundance in sentences translated
●
Performance: predict which data, if it were translated, would improve (MT) the most
How to format it
●
Tokenised
– and case-normalised (truecased, lowercased)
●
Sentence aligned
– E.g. Hunalign (2005)
●
Word aligned?
– E.g. GIZA++, fast_align
●
...
Massive acquisition...
made easy
2
Massive Acquisition
●
Current parallel crawlers
– Input: URL of a web domain with parallel data
– Output: parallel documents identified from that web
domain
●
Problem
– One needs to compile the list of promising web domains
manually!
●
Solution
– Automatically identify promising web domains
Massive Acquisition
●
Spidextor [1]
– Crawling of parallel (and monolingual) corpora from
top level domains
– Example
●
Inputs
– TLD: .nl
– Languages of interest: en and nl
●
Output: corpora crawled
– Monolingual: en, nl
– Parallel: en-nl
[1] Ljubešić et al. Producing monolingual web corpora and bitext at the same time --
SpiderLing and bitextor's love affair. LREC 2016.
Massive Acquisition
Results
●
4 TLDs: .fi, .hr, .sl, .sr
●
Crawled for 3 to 7 days
Evaluation
●
Intrinsic (% non parallel segments)
– 16% (en-hr) to 32% (en-sl) noise
●
Extrinsic
Issues (I): Quality
●
Is web crawled parallel data not low-quality?
– Noise: documents in languages A and B that are not
translations of each other
●
Sentence alignment confidence scores (sentence and
document level)
– Where to set the threshold: precision vs recall trade-off
– Some data may not be human but machine translation!
●
Tell apart human from machine translated documents with a
binary classifier
Issues (II): Domain
●
Web crawled corpora from TLD (W) represents the
“general” domain...
●
... but I am interested in a particular domain (D)
●
Find the subset of W that belongs (or is very related)
to D
– Rank sentences (or documents) in W according to their
similarity to a small corpus in the domain D
Cleaning the messy
3
Motivation
●
Many publicly available parallel corpora are
potentialy useful
●
But... they are too noisy
– Missalignments
– Encoding errors
– etc
Case Study: OpenSubtitles
●
Available for many language pairs
●
Big size
●
But very messy
●
How can we unlock its potential?
Procedure
●
Automatic cleaning
– Fixing (sparsity)
– Removing sentences (noise)
[2] Forcada et al. D4.1b MT systems for the second development cycle. Abu-MaTran
deliverable. 2014.
Procedure
●
Automatic cleaning
– Fixing (sparsity)
●
Converting Cyrillic characters to their Latin counterparts
●
Converting encoding to UTF-8
●
Spelling errors
●
Inconsistent punctuation marks, numbers and spacing
– Removing sentences (noise)
●
Without alphabetical characters
●
Too different in length
●
Not in the right language
[2] Forcada et al. D4.1b MT systems for the second development cycle. Abu-MaTran
deliverable. 2014.
Results
●
Data
– Corpora: OpenSubtitles en-hr
– Input: 30M sentence pairs
– Output: 17M
●
Extrinsic Evaluation
– Train MT system with OpenSubs as is vs cleaned
– Test set: news domain
Results
EN-HR HR-EN
OpenSubs as is 0.09 0.22
OpenSubs cleaned 0.22 0.31
Relative
improvement
145% 37%
●
MT results (BLEU)
Creative translators
and parallel corpora
4
Translation Options
●
There are many possible translations
●
What makes a good translation?
●
Different views in Translation Studies, e.g:
– Domesticating. Bring the original to the target
audience
– Foreignising. Bring the target audience to the
source text
Translation Options
●
There are many possible translations
●
What makes a good translation?
●
Different views in Translation Studies, e.g:
– Domesticating. Bring the original to the target
audience
– Foreignising. Bring the target audience to the
source text
Source Target
Language & Culture Language & Culture
Translation Options
●
Lui parti, j’ai retrouvé le calme.
J’étais épuisé et je me suis jeté sur ma couchette.
Je crois que j’ai dormi parce que je me suis réveillé avec des étoiles sur le visage.
Des bruits de campagne montaient jusqu’à moi.
Des odeurs de nuit, de terre et de sel rafraîchissaient mes tempes.
La merveilleuse paix de cet été endormi entrait en moi comme une marée.
A ce moment, et à la limite de la nuit, des sirènes ont hurlé.
Elles annonçaient des départs pour un monde qui maintenant m’était à jamais indifférent.
Pour la première fois depuis bien longtemps, j’ai pensé à maman.
Jones and Irvine (2013)
Toral and Way (2015)
Translation Options
●
French
J’étais épuisé et je me suis jeté sur ma couchette.
Je crois que j’ai dormi parce que je me suis réveillé avec
des étoiles sur le visage.
English – translation A
But all this excitement had exhausted me and I dropped
heavily on to my sleeping plank.
I must have had a longish sleep, for, when I woke, the
stars were shining down on my face.
English – translation B
I was exhausted and threw myself on my bunk.
I must have fallen asleep, because I woke up with the
stars in my face.
Is any of the
translations
better/easier for
MT and/or
corpus studies?
Translation Options
●
French
J’étais épuisé et je me suis jeté sur ma couchette.
Je crois que j’ai dormi parce que je me suis réveillé avec
des étoiles sur le visage.
English – Gilbert (1946)
But all this excitement had exhausted me and I dropped
heavily on to my sleeping plank.
I must have had a longish sleep, for, when I woke, the
stars were shining down on my face.
English – Ward (1989)
I was exhausted and threw myself on my bunk.
I must have fallen asleep, because I woke up with the
stars in my face.
Translation Options
●
French
J’étais épuisé et je me suis jeté sur ma couchette.
Je crois que j’ai dormi parce que je me suis réveillé avec
des étoiles sur le visage.
English – Gilbert (1946)
But all this excitement had exhausted me and I dropped
heavily on to my sleeping plank.
I must have had a longish sleep, for, when I woke, the
stars were shining down on my face.
English – Ward (1989)
I was exhausted and threw myself on my bunk.
I must have fallen asleep, because I woke up with the
stars in my face.
Domesticating
Transcreation
Free translation
Foreignising
Literal translation
Translation Options
●
French
J’étais épuisé et je me suis jeté sur ma couchette.
Je crois que j’ai dormi parce que je me suis réveillé avec
des étoiles sur le visage.
English – Gilbert (1946)
But all this excitement had exhausted me and I dropped
heavily on to my sleeping plank.
I must have had a longish sleep, for, when I woke, the
stars were shining down on my face.
English – Ward (1989)
I was exhausted and threw myself on my bunk.
I must have fallen asleep, because I woke up with the
stars in my face.
BLEU 0.11
TER 0.80
BLEU 0.28
TER 0.56
[3] Toral and Way. Machine-Assisted Translation of Literary Text: A Case Study. Translation Spaces. 2015.
Consequences
●
Strategies such as domestication and transcreation lead to
translations that can differ significantly from the source text
●
This can be challenging for corpus-based research Examples:
– MT. Poor performance due to difficulty to
●
Find reliable word alignments (training)
●
Find n-gram matches (automatic evaluation)
– Finding corresponding tense
●
If they are the same in both languages
– can this be attributed to similarity of use in both languages, or to the translation being
foreignising?
●
If they are different
– can this be attributed to a divergence of preferred tense between both languages, or to the
stylistic choice of the translator?
Conclusions
5
1. Web crawling
●
Easy to use automatic procedure to acquire
massive amounts of parallel data
●
Can be paired with post-processing to
– Remove noise: missalignments, MT
– Find data for a specific domain of interest
2. Cleaning
●
It is possible to clean very messy corpora...
●
...at the expense of removing a lot of data
●
Opensubs use case
– Removed: 43%
– MT scores increase
●
HR -> EN 37%
●
EN -> HR 145%
3. Translation strategies
●
Be aware that human translations can differ
substantially based on strategies, theories, etc.
●
One important variable to take into account
when using parallel corpora
●
Should we do anything about it?
– e.g. identify the translation strategy used for each
parallel corpus
Acknowledgements
- Massive Acquisition
- Cleaning corpora
- Creative translators
PiPeNovel
Acknowledgements
- Creative translators
PiPeNovel
- Massive Acquisition
- Cleaning corpora
The End! Questions?
Parallel Corpora in (Machine) Translation:
goals, issues and methodologies
23rd
June 2017
Antonio Toral
a.toral.ruiz@rug.nl

Weitere ähnliche Inhalte

Was ist angesagt?

pptphrase-tagset-mapping-for-french-and-english-treebanks-and-its-application...
pptphrase-tagset-mapping-for-french-and-english-treebanks-and-its-application...pptphrase-tagset-mapping-for-french-and-english-treebanks-and-its-application...
pptphrase-tagset-mapping-for-french-and-english-treebanks-and-its-application...
Lifeng (Aaron) Han
 
2. Constantin Orasan (UoW) EXPERT Introduction
2. Constantin Orasan (UoW) EXPERT Introduction2. Constantin Orasan (UoW) EXPERT Introduction
2. Constantin Orasan (UoW) EXPERT Introduction
RIILP
 
Big Data Spain 2017 - Deriving Actionable Insights from High Volume Media St...
Big Data Spain 2017  - Deriving Actionable Insights from High Volume Media St...Big Data Spain 2017  - Deriving Actionable Insights from High Volume Media St...
Big Data Spain 2017 - Deriving Actionable Insights from High Volume Media St...
Apache OpenNLP
 

Was ist angesagt? (20)

Overview of text mining and NLP (+software)
Overview of text mining and NLP (+software)Overview of text mining and NLP (+software)
Overview of text mining and NLP (+software)
 
OUTDATED Text Mining 3/5: String Processing
OUTDATED Text Mining 3/5: String ProcessingOUTDATED Text Mining 3/5: String Processing
OUTDATED Text Mining 3/5: String Processing
 
Esa act
Esa actEsa act
Esa act
 
OUTDATED Text Mining 1/5: Introduction
OUTDATED Text Mining 1/5: IntroductionOUTDATED Text Mining 1/5: Introduction
OUTDATED Text Mining 1/5: Introduction
 
L1
L1L1
L1
 
pptphrase-tagset-mapping-for-french-and-english-treebanks-and-its-application...
pptphrase-tagset-mapping-for-french-and-english-treebanks-and-its-application...pptphrase-tagset-mapping-for-french-and-english-treebanks-and-its-application...
pptphrase-tagset-mapping-for-french-and-english-treebanks-and-its-application...
 
Bilingual terminology mining
Bilingual terminology miningBilingual terminology mining
Bilingual terminology mining
 
TARGETED ADVERSARIAL EXAMPLES FOR BLACK BOX AUDIO SYSTEMS - Rohan Taori, Amog...
TARGETED ADVERSARIAL EXAMPLES FOR BLACK BOX AUDIO SYSTEMS - Rohan Taori, Amog...TARGETED ADVERSARIAL EXAMPLES FOR BLACK BOX AUDIO SYSTEMS - Rohan Taori, Amog...
TARGETED ADVERSARIAL EXAMPLES FOR BLACK BOX AUDIO SYSTEMS - Rohan Taori, Amog...
 
Portuguese Linguistic Tools: What, Why and How
Portuguese Linguistic Tools: What, Why and HowPortuguese Linguistic Tools: What, Why and How
Portuguese Linguistic Tools: What, Why and How
 
The NLP Muppets revolution!
The NLP Muppets revolution!The NLP Muppets revolution!
The NLP Muppets revolution!
 
Embracing diversity searching over multiple languages
Embracing diversity  searching over multiple languagesEmbracing diversity  searching over multiple languages
Embracing diversity searching over multiple languages
 
2. Constantin Orasan (UoW) EXPERT Introduction
2. Constantin Orasan (UoW) EXPERT Introduction2. Constantin Orasan (UoW) EXPERT Introduction
2. Constantin Orasan (UoW) EXPERT Introduction
 
Big Data Spain 2017 - Deriving Actionable Insights from High Volume Media St...
Big Data Spain 2017  - Deriving Actionable Insights from High Volume Media St...Big Data Spain 2017  - Deriving Actionable Insights from High Volume Media St...
Big Data Spain 2017 - Deriving Actionable Insights from High Volume Media St...
 
Lean Logic for Lean Times: Entailment and Contradiction Revisited
Lean Logic for Lean Times: Entailment and Contradiction RevisitedLean Logic for Lean Times: Entailment and Contradiction Revisited
Lean Logic for Lean Times: Entailment and Contradiction Revisited
 
Lecture 1: Semantic Analysis in Language Technology
Lecture 1: Semantic Analysis in Language TechnologyLecture 1: Semantic Analysis in Language Technology
Lecture 1: Semantic Analysis in Language Technology
 
ESR10 Joachim Daiber - EXPERT Summer School - Malaga 2015
ESR10 Joachim Daiber - EXPERT Summer School - Malaga 2015ESR10 Joachim Daiber - EXPERT Summer School - Malaga 2015
ESR10 Joachim Daiber - EXPERT Summer School - Malaga 2015
 
Natural Language Processing (NLP) & Text Mining Tutorial Using NLTK | NLP Tra...
Natural Language Processing (NLP) & Text Mining Tutorial Using NLTK | NLP Tra...Natural Language Processing (NLP) & Text Mining Tutorial Using NLTK | NLP Tra...
Natural Language Processing (NLP) & Text Mining Tutorial Using NLTK | NLP Tra...
 
Large Scale Text Processing
Large Scale Text ProcessingLarge Scale Text Processing
Large Scale Text Processing
 
Word representation: SVD, LSA, Word2Vec
Word representation: SVD, LSA, Word2VecWord representation: SVD, LSA, Word2Vec
Word representation: SVD, LSA, Word2Vec
 
OUTDATED Text Mining 4/5: Text Classification
OUTDATED Text Mining 4/5: Text ClassificationOUTDATED Text Mining 4/5: Text Classification
OUTDATED Text Mining 4/5: Text Classification
 

Ähnlich wie Parallel Corpora in (Machine) Translation: goals, issues and methodologies

Pptphrase tagset mapping for french and english treebanks and its application...
Pptphrase tagset mapping for french and english treebanks and its application...Pptphrase tagset mapping for french and english treebanks and its application...
Pptphrase tagset mapping for french and english treebanks and its application...
Lifeng (Aaron) Han
 
Pptphrase tagset mapping for french and english treebanks and its application...
Pptphrase tagset mapping for french and english treebanks and its application...Pptphrase tagset mapping for french and english treebanks and its application...
Pptphrase tagset mapping for french and english treebanks and its application...
Lifeng (Aaron) Han
 
Natural Language Processing (NLP)
Natural Language Processing (NLP)Natural Language Processing (NLP)
Natural Language Processing (NLP)
Abdullah al Mamun
 
1 Introduction.ppt
1 Introduction.ppt1 Introduction.ppt
1 Introduction.ppt
tanishamahajan11
 

Ähnlich wie Parallel Corpora in (Machine) Translation: goals, issues and methodologies (20)

Pptphrase tagset mapping for french and english treebanks and its application...
Pptphrase tagset mapping for french and english treebanks and its application...Pptphrase tagset mapping for french and english treebanks and its application...
Pptphrase tagset mapping for french and english treebanks and its application...
 
NLP pipeline in machine translation
NLP pipeline in machine translationNLP pipeline in machine translation
NLP pipeline in machine translation
 
Pptphrase tagset mapping for french and english treebanks and its application...
Pptphrase tagset mapping for french and english treebanks and its application...Pptphrase tagset mapping for french and english treebanks and its application...
Pptphrase tagset mapping for french and english treebanks and its application...
 
Machine Translation System: Chhattisgarhi to Hindi
Machine Translation System: Chhattisgarhi to HindiMachine Translation System: Chhattisgarhi to Hindi
Machine Translation System: Chhattisgarhi to Hindi
 
Yves Peirsman - Deep Learning for NLP
Yves Peirsman - Deep Learning for NLPYves Peirsman - Deep Learning for NLP
Yves Peirsman - Deep Learning for NLP
 
Natural Language Processing (NLP)
Natural Language Processing (NLP)Natural Language Processing (NLP)
Natural Language Processing (NLP)
 
Introduction to natural language processing
Introduction to natural language processingIntroduction to natural language processing
Introduction to natural language processing
 
Machine translator Introduction
Machine translator IntroductionMachine translator Introduction
Machine translator Introduction
 
NLP_KASHK: Introduction
NLP_KASHK: Introduction NLP_KASHK: Introduction
NLP_KASHK: Introduction
 
1 Introduction.ppt
1 Introduction.ppt1 Introduction.ppt
1 Introduction.ppt
 
ADAPT Centre and My NLP journey: MT, MTE, QE, MWE, NER, Treebanks, Parsing.
ADAPT Centre and My NLP journey: MT, MTE, QE, MWE, NER, Treebanks, Parsing.ADAPT Centre and My NLP journey: MT, MTE, QE, MWE, NER, Treebanks, Parsing.
ADAPT Centre and My NLP journey: MT, MTE, QE, MWE, NER, Treebanks, Parsing.
 
Europeana meeting under Finland’s Presidency of the Council of the EU - Day 2...
Europeana meeting under Finland’s Presidency of the Council of the EU - Day 2...Europeana meeting under Finland’s Presidency of the Council of the EU - Day 2...
Europeana meeting under Finland’s Presidency of the Council of the EU - Day 2...
 
Natural Language Processing, Techniques, Current Trends and Applications in I...
Natural Language Processing, Techniques, Current Trends and Applications in I...Natural Language Processing, Techniques, Current Trends and Applications in I...
Natural Language Processing, Techniques, Current Trends and Applications in I...
 
Deep Learning for Machine Translation - A dramatic turn of paradigm
Deep Learning for Machine Translation - A dramatic turn of paradigmDeep Learning for Machine Translation - A dramatic turn of paradigm
Deep Learning for Machine Translation - A dramatic turn of paradigm
 
SMT3
SMT3SMT3
SMT3
 
Nlp
NlpNlp
Nlp
 
Translationusing moses1
Translationusing moses1Translationusing moses1
Translationusing moses1
 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language Processing
 
A Panorama of Natural Language Processing
A Panorama of Natural Language ProcessingA Panorama of Natural Language Processing
A Panorama of Natural Language Processing
 
Computational linguistics
Computational linguistics Computational linguistics
Computational linguistics
 

Kürzlich hochgeladen

Pests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
PirithiRaju
 
biology HL practice questions IB BIOLOGY
biology HL practice questions IB BIOLOGYbiology HL practice questions IB BIOLOGY
biology HL practice questions IB BIOLOGY
1301aanya
 
dkNET Webinar "Texera: A Scalable Cloud Computing Platform for Sharing Data a...
dkNET Webinar "Texera: A Scalable Cloud Computing Platform for Sharing Data a...dkNET Webinar "Texera: A Scalable Cloud Computing Platform for Sharing Data a...
dkNET Webinar "Texera: A Scalable Cloud Computing Platform for Sharing Data a...
dkNET
 
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune WaterworldsBiogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Sérgio Sacani
 
Introduction,importance and scope of horticulture.pptx
Introduction,importance and scope of horticulture.pptxIntroduction,importance and scope of horticulture.pptx
Introduction,importance and scope of horticulture.pptx
Bhagirath Gogikar
 
Bacterial Identification and Classifications
Bacterial Identification and ClassificationsBacterial Identification and Classifications
Bacterial Identification and Classifications
Areesha Ahmad
 

Kürzlich hochgeladen (20)

Factory Acceptance Test( FAT).pptx .
Factory Acceptance Test( FAT).pptx       .Factory Acceptance Test( FAT).pptx       .
Factory Acceptance Test( FAT).pptx .
 
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
 
Connaught Place, Delhi Call girls :8448380779 Model Escorts | 100% verified
Connaught Place, Delhi Call girls :8448380779 Model Escorts | 100% verifiedConnaught Place, Delhi Call girls :8448380779 Model Escorts | 100% verified
Connaught Place, Delhi Call girls :8448380779 Model Escorts | 100% verified
 
SAMASTIPUR CALL GIRL 7857803690 LOW PRICE ESCORT SERVICE
SAMASTIPUR CALL GIRL 7857803690  LOW PRICE  ESCORT SERVICESAMASTIPUR CALL GIRL 7857803690  LOW PRICE  ESCORT SERVICE
SAMASTIPUR CALL GIRL 7857803690 LOW PRICE ESCORT SERVICE
 
GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)
 
Feature-aligned N-BEATS with Sinkhorn divergence (ICLR '24)
Feature-aligned N-BEATS with Sinkhorn divergence (ICLR '24)Feature-aligned N-BEATS with Sinkhorn divergence (ICLR '24)
Feature-aligned N-BEATS with Sinkhorn divergence (ICLR '24)
 
GBSN - Microbiology (Unit 3)
GBSN - Microbiology (Unit 3)GBSN - Microbiology (Unit 3)
GBSN - Microbiology (Unit 3)
 
Call Girls Alandi Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Alandi Call Me 7737669865 Budget Friendly No Advance BookingCall Girls Alandi Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Alandi Call Me 7737669865 Budget Friendly No Advance Booking
 
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
 
GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)
 
IDENTIFICATION OF THE LIVING- forensic medicine
IDENTIFICATION OF THE LIVING- forensic medicineIDENTIFICATION OF THE LIVING- forensic medicine
IDENTIFICATION OF THE LIVING- forensic medicine
 
biology HL practice questions IB BIOLOGY
biology HL practice questions IB BIOLOGYbiology HL practice questions IB BIOLOGY
biology HL practice questions IB BIOLOGY
 
dkNET Webinar "Texera: A Scalable Cloud Computing Platform for Sharing Data a...
dkNET Webinar "Texera: A Scalable Cloud Computing Platform for Sharing Data a...dkNET Webinar "Texera: A Scalable Cloud Computing Platform for Sharing Data a...
dkNET Webinar "Texera: A Scalable Cloud Computing Platform for Sharing Data a...
 
COMPUTING ANTI-DERIVATIVES (Integration by SUBSTITUTION)
COMPUTING ANTI-DERIVATIVES(Integration by SUBSTITUTION)COMPUTING ANTI-DERIVATIVES(Integration by SUBSTITUTION)
COMPUTING ANTI-DERIVATIVES (Integration by SUBSTITUTION)
 
Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...
Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...
Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...
 
Pulmonary drug delivery system M.pharm -2nd sem P'ceutics
Pulmonary drug delivery system M.pharm -2nd sem P'ceuticsPulmonary drug delivery system M.pharm -2nd sem P'ceutics
Pulmonary drug delivery system M.pharm -2nd sem P'ceutics
 
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune WaterworldsBiogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
 
FAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
FAIRSpectra - Enabling the FAIRification of Spectroscopy and SpectrometryFAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
FAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
 
Introduction,importance and scope of horticulture.pptx
Introduction,importance and scope of horticulture.pptxIntroduction,importance and scope of horticulture.pptx
Introduction,importance and scope of horticulture.pptx
 
Bacterial Identification and Classifications
Bacterial Identification and ClassificationsBacterial Identification and Classifications
Bacterial Identification and Classifications
 

Parallel Corpora in (Machine) Translation: goals, issues and methodologies

  • 1. Parallel Corpora in (Machine) Translation: goals, issues and methodologies Time in Translation Workshop University of Utrecht, 23rd June 2017 Antonio Toral a.toral.ruiz@rug.nl
  • 2. Content 1. Introduction: Why, Which, Where, How 2. Massive Acquisition Made Easy 3. Cleaning the Messy 4. Creative Translators and Parallel Corpora
  • 3. Context, about me ● Named Entities & WSD, till 2009 – PhD at Univ. Alacant & CNR Pisa – Topic: Named Entity Acquisition from Wikipedia ● MT since 2010 – Researcher in MT at DCU – Assistant Prof at Univ. Groningen (since October 2016) – Topics ● Corpora acquisition ● Domain adaptation ● Diagnostic evaluation ● Literary text
  • 5. Why use parallel corpora ● Machine Translation (MT) – Statistical approaches (phrase-based, neural): key component – Rule-based approaches: automatic acquisition of rules and dictionaries ● Less manual effort required ● Resulting rules and dictionaries reflect language use ● Computer-assisted translation – Translation memories, fuzzy matches ● Corpus-based Linguistic Research of Translations
  • 6. Which Parallel Corpora ● Wish list – Clean – Appropriate domain, genre and style – Big size – ...
  • 7. Where to get parallel corpora ● Already existing – free (Europarl, OPUS) versus – for a fee (ELDA, LDC, TAUS, etc) ● Web crawling – ad-hoc (e.g. SETimes, TLAXCALA) versus – generic (e.g. Bitextor, ILSP Focused Crawler) ● Generate manually – Professional translation (~0.05 cent/word) versus – crowdsourcing (very cheap but: quality, ethics...) – Trade-off cost-benefit: careful selection of data to be translated ● Variety: avoid redundance in sentences translated ● Performance: predict which data, if it were translated, would improve (MT) the most
  • 8. How to format it ● Tokenised – and case-normalised (truecased, lowercased) ● Sentence aligned – E.g. Hunalign (2005) ● Word aligned? – E.g. GIZA++, fast_align ● ...
  • 10. Massive Acquisition ● Current parallel crawlers – Input: URL of a web domain with parallel data – Output: parallel documents identified from that web domain ● Problem – One needs to compile the list of promising web domains manually! ● Solution – Automatically identify promising web domains
  • 11. Massive Acquisition ● Spidextor [1] – Crawling of parallel (and monolingual) corpora from top level domains – Example ● Inputs – TLD: .nl – Languages of interest: en and nl ● Output: corpora crawled – Monolingual: en, nl – Parallel: en-nl [1] Ljubešić et al. Producing monolingual web corpora and bitext at the same time -- SpiderLing and bitextor's love affair. LREC 2016.
  • 13. Results ● 4 TLDs: .fi, .hr, .sl, .sr ● Crawled for 3 to 7 days
  • 14. Evaluation ● Intrinsic (% non parallel segments) – 16% (en-hr) to 32% (en-sl) noise ● Extrinsic
  • 15. Issues (I): Quality ● Is web crawled parallel data not low-quality? – Noise: documents in languages A and B that are not translations of each other ● Sentence alignment confidence scores (sentence and document level) – Where to set the threshold: precision vs recall trade-off – Some data may not be human but machine translation! ● Tell apart human from machine translated documents with a binary classifier
  • 16. Issues (II): Domain ● Web crawled corpora from TLD (W) represents the “general” domain... ● ... but I am interested in a particular domain (D) ● Find the subset of W that belongs (or is very related) to D – Rank sentences (or documents) in W according to their similarity to a small corpus in the domain D
  • 18. Motivation ● Many publicly available parallel corpora are potentialy useful ● But... they are too noisy – Missalignments – Encoding errors – etc
  • 19. Case Study: OpenSubtitles ● Available for many language pairs ● Big size ● But very messy ● How can we unlock its potential?
  • 20. Procedure ● Automatic cleaning – Fixing (sparsity) – Removing sentences (noise) [2] Forcada et al. D4.1b MT systems for the second development cycle. Abu-MaTran deliverable. 2014.
  • 21. Procedure ● Automatic cleaning – Fixing (sparsity) ● Converting Cyrillic characters to their Latin counterparts ● Converting encoding to UTF-8 ● Spelling errors ● Inconsistent punctuation marks, numbers and spacing – Removing sentences (noise) ● Without alphabetical characters ● Too different in length ● Not in the right language [2] Forcada et al. D4.1b MT systems for the second development cycle. Abu-MaTran deliverable. 2014.
  • 22. Results ● Data – Corpora: OpenSubtitles en-hr – Input: 30M sentence pairs – Output: 17M ● Extrinsic Evaluation – Train MT system with OpenSubs as is vs cleaned – Test set: news domain
  • 23. Results EN-HR HR-EN OpenSubs as is 0.09 0.22 OpenSubs cleaned 0.22 0.31 Relative improvement 145% 37% ● MT results (BLEU)
  • 25. Translation Options ● There are many possible translations ● What makes a good translation? ● Different views in Translation Studies, e.g: – Domesticating. Bring the original to the target audience – Foreignising. Bring the target audience to the source text
  • 26. Translation Options ● There are many possible translations ● What makes a good translation? ● Different views in Translation Studies, e.g: – Domesticating. Bring the original to the target audience – Foreignising. Bring the target audience to the source text Source Target Language & Culture Language & Culture
  • 27. Translation Options ● Lui parti, j’ai retrouvé le calme. J’étais épuisé et je me suis jeté sur ma couchette. Je crois que j’ai dormi parce que je me suis réveillé avec des étoiles sur le visage. Des bruits de campagne montaient jusqu’à moi. Des odeurs de nuit, de terre et de sel rafraîchissaient mes tempes. La merveilleuse paix de cet été endormi entrait en moi comme une marée. A ce moment, et à la limite de la nuit, des sirènes ont hurlé. Elles annonçaient des départs pour un monde qui maintenant m’était à jamais indifférent. Pour la première fois depuis bien longtemps, j’ai pensé à maman. Jones and Irvine (2013) Toral and Way (2015)
  • 28. Translation Options ● French J’étais épuisé et je me suis jeté sur ma couchette. Je crois que j’ai dormi parce que je me suis réveillé avec des étoiles sur le visage. English – translation A But all this excitement had exhausted me and I dropped heavily on to my sleeping plank. I must have had a longish sleep, for, when I woke, the stars were shining down on my face. English – translation B I was exhausted and threw myself on my bunk. I must have fallen asleep, because I woke up with the stars in my face. Is any of the translations better/easier for MT and/or corpus studies?
  • 29. Translation Options ● French J’étais épuisé et je me suis jeté sur ma couchette. Je crois que j’ai dormi parce que je me suis réveillé avec des étoiles sur le visage. English – Gilbert (1946) But all this excitement had exhausted me and I dropped heavily on to my sleeping plank. I must have had a longish sleep, for, when I woke, the stars were shining down on my face. English – Ward (1989) I was exhausted and threw myself on my bunk. I must have fallen asleep, because I woke up with the stars in my face.
  • 30. Translation Options ● French J’étais épuisé et je me suis jeté sur ma couchette. Je crois que j’ai dormi parce que je me suis réveillé avec des étoiles sur le visage. English – Gilbert (1946) But all this excitement had exhausted me and I dropped heavily on to my sleeping plank. I must have had a longish sleep, for, when I woke, the stars were shining down on my face. English – Ward (1989) I was exhausted and threw myself on my bunk. I must have fallen asleep, because I woke up with the stars in my face. Domesticating Transcreation Free translation Foreignising Literal translation
  • 31. Translation Options ● French J’étais épuisé et je me suis jeté sur ma couchette. Je crois que j’ai dormi parce que je me suis réveillé avec des étoiles sur le visage. English – Gilbert (1946) But all this excitement had exhausted me and I dropped heavily on to my sleeping plank. I must have had a longish sleep, for, when I woke, the stars were shining down on my face. English – Ward (1989) I was exhausted and threw myself on my bunk. I must have fallen asleep, because I woke up with the stars in my face. BLEU 0.11 TER 0.80 BLEU 0.28 TER 0.56 [3] Toral and Way. Machine-Assisted Translation of Literary Text: A Case Study. Translation Spaces. 2015.
  • 32. Consequences ● Strategies such as domestication and transcreation lead to translations that can differ significantly from the source text ● This can be challenging for corpus-based research Examples: – MT. Poor performance due to difficulty to ● Find reliable word alignments (training) ● Find n-gram matches (automatic evaluation) – Finding corresponding tense ● If they are the same in both languages – can this be attributed to similarity of use in both languages, or to the translation being foreignising? ● If they are different – can this be attributed to a divergence of preferred tense between both languages, or to the stylistic choice of the translator?
  • 34. 1. Web crawling ● Easy to use automatic procedure to acquire massive amounts of parallel data ● Can be paired with post-processing to – Remove noise: missalignments, MT – Find data for a specific domain of interest
  • 35. 2. Cleaning ● It is possible to clean very messy corpora... ● ...at the expense of removing a lot of data ● Opensubs use case – Removed: 43% – MT scores increase ● HR -> EN 37% ● EN -> HR 145%
  • 36. 3. Translation strategies ● Be aware that human translations can differ substantially based on strategies, theories, etc. ● One important variable to take into account when using parallel corpora ● Should we do anything about it? – e.g. identify the translation strategy used for each parallel corpus
  • 37. Acknowledgements - Massive Acquisition - Cleaning corpora - Creative translators PiPeNovel
  • 38. Acknowledgements - Creative translators PiPeNovel - Massive Acquisition - Cleaning corpora
  • 39. The End! Questions? Parallel Corpora in (Machine) Translation: goals, issues and methodologies 23rd June 2017 Antonio Toral a.toral.ruiz@rug.nl