SlideShare ist ein Scribd-Unternehmen logo
1 von 12
Downloaden Sie, um offline zu lesen
Bitextor: harvest your own parallel
corpora from the Web
Miquel Esplà-Gomis
Universitat d’Alacant
mespla@dlsi.ua.es
Who is behind Bitextor?
✭ Transducens (Universitat d’Alacant)
✹ Parallel data crawling
✹ Rule-based machine translation
✹ Machine translation quality estimation
✹ Computer-aided translation
✹ ...
✭ Prompsit Language Engineering
✹ Parallel data curation
✹ Machine translation (rule-based, statistical,
neural, hybrid, etc.)
✹ Linguistic variant adaptation
✹ ...
UA + Prompsit:
✹ Apertium: Rule-based MT
✹ Abu-MaTran: Automatic building of MT
✹ Bitextor: Parallel data crawling
Our motivation
✭ Specific sources (legal, technical documentation, etc.) exhaustively
exploited
✹ Europarl, EMEA, MultiUN, TEDTalks, etc
✹ Most of them available at OPUS [http://opus.lingfil.uu.se/]
✭ What about more general sources (translated websites, etc.)?
✹ good source of data for small languages
✹ easy to find domain-specific contents
✹ not that productive as crawling well-known websites... ?
✭ Free/open-source tool for automatically crawling the Web
✭ Crawl parallel data between any two languages
✭ Build parallel corpora from any XML-based data: XML, XHTML, OOXML (.
docx), etc.
✭ It can generate TMX (for translators) or Moses-like plain text (for
training SMT)
What is Bitextor?
Brief history of the project
✭ First version developed at Univesitat d’Alacant in 2006
✭ Until version 2.0, problems for compiling/installing it in the beginning: not
a good product
✭ Re-implemented in version 3.0 at Prompsit Language Engineering:
✹ Unix-pipeline architecture made of scripts
✹ highly scallar
✹ up to date external libraries and external tools
✹ good documentation and support
✭ Currently at version 5.0: dramatic improvement of performance!!
Why Bitextor?
✭ Customizable:
✹ Document- and segment-alignment quality threshold
✹ Several input/output formats available
✹ Time/size limit for crawling
✹ ...
✭ High performance in document alignment: precision ~90%, recall ~80%
✭ Fast and easy to use:
bitextor -v dic -u http://golftrotter.com en fr
+
90 seconds
= 4,056 pairs of
segments !!!
What do you need to run Bitextor?
✭ A Unix- or Posix-based operating system
✭ To follow the installation tutorial: https://sf.net/p/bitextor/wiki
✭ To identify one or more URLs to crawl
✭ Having a bilingual lexicon for the languages to crawl
✹ For many languages, you can download from:
https://sf.net/projects/bitextor/files/lexicons
✹ Bitextor can build a lexicon from parallel corpora
Bitextor is a good team player
✭ Bitextor can be combined with Spiderling* to crawl top-level domains
✭ Crawl monolingual and parallel corpora at the same time
✭ No need to look for multilingual websites!
✭ ~100GB of data in one week!
* A monolingual crawler focused on linguistic resources
Useful for translation
Useful for translators or translation companies when ...
✭ ... a statistical MT system has to be built for a new pair of languages (Abu-
MaTran: en-hr, en-fi, ...)
✭ ... domain adaptation of an MT system
✭ ... our translation memories (TM) need to be adapted to a specific domain
to improve coverage
✹ even more: for new clients, we may want to build a new TM using their documents
More than just translation
✭ build bilingual (and domain-specific?) lexicons: Bitextor uses MGIZA++ to
generate these lexicons from crawled data
✭ identify the parts of two documents that are parallel: for example, get the
translation of a word/sentence from an e-book in a foreign language
✭ get information about multilingualism to improve your strategy:
✹ discover potential customers by finding webs that need to be translated
✹ focus your efforts by identify lingüistic domains or languages with a low ratio of
translated documents by crawling top-level domains
What will be next?
✭ improved segmentation and segment alignment
✭ new tool for cleaning translation memories
✭ Bitextor (and other crawlers) as a web service
✹ ask Prompsit info@prompsit.com
✭ Bitextor for Windows
✭ generation of “deferred” translation memories (Forcada, Esplà-Gomis and
Pérez-Ortiz, 2016)
Bettercorpora!
Improvingusability!
Thank you very much for your atantion!
Paldies jums par uzmanıbu!
we will be glad to hear from you at
bitextor-stuff@lists.sourceforge.net

Weitere ähnliche Inhalte

Ähnlich wie Bitextor: harvest your own parallel corpora from the Web, Miquel Esplà-Gomis, Universitat d’Alacant

Ähnlich wie Bitextor: harvest your own parallel corpora from the Web, Miquel Esplà-Gomis, Universitat d’Alacant (20)

2010 tool forum ata handout
2010 tool forum ata handout2010 tool forum ata handout
2010 tool forum ata handout
 
COMPUTER LANGUAGES AND THERE DIFFERENCE
COMPUTER LANGUAGES AND THERE DIFFERENCE COMPUTER LANGUAGES AND THERE DIFFERENCE
COMPUTER LANGUAGES AND THERE DIFFERENCE
 
Php packages
Php packagesPhp packages
Php packages
 
Understanding Character Encodings
Understanding Character EncodingsUnderstanding Character Encodings
Understanding Character Encodings
 
Introduction to Python Programming
Introduction to Python ProgrammingIntroduction to Python Programming
Introduction to Python Programming
 
Control over digital technology with foss-tools
Control over digital technology with foss-toolsControl over digital technology with foss-tools
Control over digital technology with foss-tools
 
Arcomem training Specifying Crawls Advanced
Arcomem training Specifying Crawls AdvancedArcomem training Specifying Crawls Advanced
Arcomem training Specifying Crawls Advanced
 
Compiler design Introduction
Compiler design IntroductionCompiler design Introduction
Compiler design Introduction
 
dotNET_Overview.pdf
dotNET_Overview.pdfdotNET_Overview.pdf
dotNET_Overview.pdf
 
3. WEB TECHNOLOGIES.pptx B.Pharm sem 2 CAP
3. WEB TECHNOLOGIES.pptx B.Pharm sem 2 CAP3. WEB TECHNOLOGIES.pptx B.Pharm sem 2 CAP
3. WEB TECHNOLOGIES.pptx B.Pharm sem 2 CAP
 
VOICE BROWSER
VOICE BROWSERVOICE BROWSER
VOICE BROWSER
 
VOICE BROWSER
VOICE BROWSERVOICE BROWSER
VOICE BROWSER
 
COMPILER DESIGN.pdf
COMPILER DESIGN.pdfCOMPILER DESIGN.pdf
COMPILER DESIGN.pdf
 
TypeScript - Javascript done right
TypeScript - Javascript done rightTypeScript - Javascript done right
TypeScript - Javascript done right
 
Translation with technology
Translation with technologyTranslation with technology
Translation with technology
 
Computer assisted tools at a glance
Computer assisted tools at a glanceComputer assisted tools at a glance
Computer assisted tools at a glance
 
Arcomem training specifying-crawls
Arcomem training specifying-crawlsArcomem training specifying-crawls
Arcomem training specifying-crawls
 
Introduction to .Net
Introduction to .NetIntroduction to .Net
Introduction to .Net
 
Top Skills You Need As a Python Developer.pptx
Top Skills You Need As a Python Developer.pptxTop Skills You Need As a Python Developer.pptx
Top Skills You Need As a Python Developer.pptx
 
compiler construction tool in computer science .
compiler construction tool in computer science .compiler construction tool in computer science .
compiler construction tool in computer science .
 

Mehr von TAUS - The Language Data Network

Mehr von TAUS - The Language Data Network (20)

TAUS Global Content Summit Amsterdam 2019 / Beyond MT. A few premature reflec...
TAUS Global Content Summit Amsterdam 2019 / Beyond MT. A few premature reflec...TAUS Global Content Summit Amsterdam 2019 / Beyond MT. A few premature reflec...
TAUS Global Content Summit Amsterdam 2019 / Beyond MT. A few premature reflec...
 
TAUS Global Content Summit Amsterdam 2019 / Measure with DQF, Dace Dzeguze (T...
TAUS Global Content Summit Amsterdam 2019 / Measure with DQF, Dace Dzeguze (T...TAUS Global Content Summit Amsterdam 2019 / Measure with DQF, Dace Dzeguze (T...
TAUS Global Content Summit Amsterdam 2019 / Measure with DQF, Dace Dzeguze (T...
 
TAUS Global Content Summit Amsterdam 2019 / Automatic for the People by Domin...
TAUS Global Content Summit Amsterdam 2019 / Automatic for the People by Domin...TAUS Global Content Summit Amsterdam 2019 / Automatic for the People by Domin...
TAUS Global Content Summit Amsterdam 2019 / Automatic for the People by Domin...
 
TAUS Global Content Summit Amsterdam 2019 / The Quantum Leap: Human Parity, C...
TAUS Global Content Summit Amsterdam 2019 / The Quantum Leap: Human Parity, C...TAUS Global Content Summit Amsterdam 2019 / The Quantum Leap: Human Parity, C...
TAUS Global Content Summit Amsterdam 2019 / The Quantum Leap: Human Parity, C...
 
TAUS Global Content Summit Amsterdam 2019 / Growing Business by Connecting Co...
TAUS Global Content Summit Amsterdam 2019 / Growing Business by Connecting Co...TAUS Global Content Summit Amsterdam 2019 / Growing Business by Connecting Co...
TAUS Global Content Summit Amsterdam 2019 / Growing Business by Connecting Co...
 
Achieving Translation Efficiency and Accuracy for Video Content, Xiao Yuan (P...
Achieving Translation Efficiency and Accuracy for Video Content, Xiao Yuan (P...Achieving Translation Efficiency and Accuracy for Video Content, Xiao Yuan (P...
Achieving Translation Efficiency and Accuracy for Video Content, Xiao Yuan (P...
 
Introduction Innovation Contest Shenzhen by Henri Broekmate (Lionbridge)
Introduction Innovation Contest Shenzhen by Henri Broekmate (Lionbridge)Introduction Innovation Contest Shenzhen by Henri Broekmate (Lionbridge)
Introduction Innovation Contest Shenzhen by Henri Broekmate (Lionbridge)
 
Game Changer for Linguistic Review: Shifting the Paradigm, Klaus Fleischmann...
 Game Changer for Linguistic Review: Shifting the Paradigm, Klaus Fleischmann... Game Changer for Linguistic Review: Shifting the Paradigm, Klaus Fleischmann...
Game Changer for Linguistic Review: Shifting the Paradigm, Klaus Fleischmann...
 
A translation memory P2P trading platform - to make global translation memory...
A translation memory P2P trading platform - to make global translation memory...A translation memory P2P trading platform - to make global translation memory...
A translation memory P2P trading platform - to make global translation memory...
 
Shiyibao — The Most Efficient Translation Feedback System Ever, Guanqing Hao ...
Shiyibao — The Most Efficient Translation Feedback System Ever, Guanqing Hao ...Shiyibao — The Most Efficient Translation Feedback System Ever, Guanqing Hao ...
Shiyibao — The Most Efficient Translation Feedback System Ever, Guanqing Hao ...
 
Stepes – Instant Human Translation Services for the Digital World, Carl Yao (...
Stepes – Instant Human Translation Services for the Digital World, Carl Yao (...Stepes – Instant Human Translation Services for the Digital World, Carl Yao (...
Stepes – Instant Human Translation Services for the Digital World, Carl Yao (...
 
Farmer Lv (TrueTran)
Farmer Lv (TrueTran)Farmer Lv (TrueTran)
Farmer Lv (TrueTran)
 
Smart Translation Resource Management: Semantic Matching, Kirk Zhang (Wiitran...
Smart Translation Resource Management: Semantic Matching, Kirk Zhang (Wiitran...Smart Translation Resource Management: Semantic Matching, Kirk Zhang (Wiitran...
Smart Translation Resource Management: Semantic Matching, Kirk Zhang (Wiitran...
 
The Theory and Practice of Computer Aided Translation Training System, Liu Q...
 The Theory and Practice of Computer Aided Translation Training System, Liu Q... The Theory and Practice of Computer Aided Translation Training System, Liu Q...
The Theory and Practice of Computer Aided Translation Training System, Liu Q...
 
Translation Technology Showcase in Shenzhen
Translation Technology Showcase in ShenzhenTranslation Technology Showcase in Shenzhen
Translation Technology Showcase in Shenzhen
 
How to efficiently use large-scale TMs in translation, Jing Zhang (Tmxmall)
How to efficiently use large-scale TMs in translation, Jing Zhang (Tmxmall)How to efficiently use large-scale TMs in translation, Jing Zhang (Tmxmall)
How to efficiently use large-scale TMs in translation, Jing Zhang (Tmxmall)
 
SDL Trados Studio 2017, Jocelyn He (SDL)
SDL Trados Studio 2017, Jocelyn He (SDL)SDL Trados Studio 2017, Jocelyn He (SDL)
SDL Trados Studio 2017, Jocelyn He (SDL)
 
How we train post-editors - Yongpeng Wei (Lingosail)
How we train post-editors - Yongpeng Wei (Lingosail)How we train post-editors - Yongpeng Wei (Lingosail)
How we train post-editors - Yongpeng Wei (Lingosail)
 
A use-case for getting MT into your company, Kerstin Berns (berns language c...
 A use-case for getting MT into your company, Kerstin Berns (berns language c... A use-case for getting MT into your company, Kerstin Berns (berns language c...
A use-case for getting MT into your company, Kerstin Berns (berns language c...
 
QE integrated in XTM, by Bob Willans (XTM)
QE integrated in XTM, by Bob Willans (XTM)QE integrated in XTM, by Bob Willans (XTM)
QE integrated in XTM, by Bob Willans (XTM)
 

Kürzlich hochgeladen

If this Giant Must Walk: A Manifesto for a New Nigeria
If this Giant Must Walk: A Manifesto for a New NigeriaIf this Giant Must Walk: A Manifesto for a New Nigeria
If this Giant Must Walk: A Manifesto for a New Nigeria
Kayode Fayemi
 
Chiulli_Aurora_Oman_Raffaele_Beowulf.pptx
Chiulli_Aurora_Oman_Raffaele_Beowulf.pptxChiulli_Aurora_Oman_Raffaele_Beowulf.pptx
Chiulli_Aurora_Oman_Raffaele_Beowulf.pptx
raffaeleoman
 
No Advance 8868886958 Chandigarh Call Girls , Indian Call Girls For Full Nigh...
No Advance 8868886958 Chandigarh Call Girls , Indian Call Girls For Full Nigh...No Advance 8868886958 Chandigarh Call Girls , Indian Call Girls For Full Nigh...
No Advance 8868886958 Chandigarh Call Girls , Indian Call Girls For Full Nigh...
Sheetaleventcompany
 

Kürzlich hochgeladen (20)

Air breathing and respiratory adaptations in diver animals
Air breathing and respiratory adaptations in diver animalsAir breathing and respiratory adaptations in diver animals
Air breathing and respiratory adaptations in diver animals
 
Report Writing Webinar Training
Report Writing Webinar TrainingReport Writing Webinar Training
Report Writing Webinar Training
 
Presentation on Engagement in Book Clubs
Presentation on Engagement in Book ClubsPresentation on Engagement in Book Clubs
Presentation on Engagement in Book Clubs
 
If this Giant Must Walk: A Manifesto for a New Nigeria
If this Giant Must Walk: A Manifesto for a New NigeriaIf this Giant Must Walk: A Manifesto for a New Nigeria
If this Giant Must Walk: A Manifesto for a New Nigeria
 
Governance and Nation-Building in Nigeria: Some Reflections on Options for Po...
Governance and Nation-Building in Nigeria: Some Reflections on Options for Po...Governance and Nation-Building in Nigeria: Some Reflections on Options for Po...
Governance and Nation-Building in Nigeria: Some Reflections on Options for Po...
 
Mathematics of Finance Presentation.pptx
Mathematics of Finance Presentation.pptxMathematics of Finance Presentation.pptx
Mathematics of Finance Presentation.pptx
 
VVIP Call Girls Nalasopara : 9892124323, Call Girls in Nalasopara Services
VVIP Call Girls Nalasopara : 9892124323, Call Girls in Nalasopara ServicesVVIP Call Girls Nalasopara : 9892124323, Call Girls in Nalasopara Services
VVIP Call Girls Nalasopara : 9892124323, Call Girls in Nalasopara Services
 
Introduction to Prompt Engineering (Focusing on ChatGPT)
Introduction to Prompt Engineering (Focusing on ChatGPT)Introduction to Prompt Engineering (Focusing on ChatGPT)
Introduction to Prompt Engineering (Focusing on ChatGPT)
 
Microsoft Copilot AI for Everyone - created by AI
Microsoft Copilot AI for Everyone - created by AIMicrosoft Copilot AI for Everyone - created by AI
Microsoft Copilot AI for Everyone - created by AI
 
Re-membering the Bard: Revisiting The Compleat Wrks of Wllm Shkspr (Abridged)...
Re-membering the Bard: Revisiting The Compleat Wrks of Wllm Shkspr (Abridged)...Re-membering the Bard: Revisiting The Compleat Wrks of Wllm Shkspr (Abridged)...
Re-membering the Bard: Revisiting The Compleat Wrks of Wllm Shkspr (Abridged)...
 
Chiulli_Aurora_Oman_Raffaele_Beowulf.pptx
Chiulli_Aurora_Oman_Raffaele_Beowulf.pptxChiulli_Aurora_Oman_Raffaele_Beowulf.pptx
Chiulli_Aurora_Oman_Raffaele_Beowulf.pptx
 
The workplace ecosystem of the future 24.4.2024 Fabritius_share ii.pdf
The workplace ecosystem of the future 24.4.2024 Fabritius_share ii.pdfThe workplace ecosystem of the future 24.4.2024 Fabritius_share ii.pdf
The workplace ecosystem of the future 24.4.2024 Fabritius_share ii.pdf
 
Call Girl Number in Khar Mumbai📲 9892124323 💞 Full Night Enjoy
Call Girl Number in Khar Mumbai📲 9892124323 💞 Full Night EnjoyCall Girl Number in Khar Mumbai📲 9892124323 💞 Full Night Enjoy
Call Girl Number in Khar Mumbai📲 9892124323 💞 Full Night Enjoy
 
ANCHORING SCRIPT FOR A CULTURAL EVENT.docx
ANCHORING SCRIPT FOR A CULTURAL EVENT.docxANCHORING SCRIPT FOR A CULTURAL EVENT.docx
ANCHORING SCRIPT FOR A CULTURAL EVENT.docx
 
George Lever - eCommerce Day Chile 2024
George Lever -  eCommerce Day Chile 2024George Lever -  eCommerce Day Chile 2024
George Lever - eCommerce Day Chile 2024
 
Thirunelveli call girls Tamil escorts 7877702510
Thirunelveli call girls Tamil escorts 7877702510Thirunelveli call girls Tamil escorts 7877702510
Thirunelveli call girls Tamil escorts 7877702510
 
Mohammad_Alnahdi_Oral_Presentation_Assignment.pptx
Mohammad_Alnahdi_Oral_Presentation_Assignment.pptxMohammad_Alnahdi_Oral_Presentation_Assignment.pptx
Mohammad_Alnahdi_Oral_Presentation_Assignment.pptx
 
No Advance 8868886958 Chandigarh Call Girls , Indian Call Girls For Full Nigh...
No Advance 8868886958 Chandigarh Call Girls , Indian Call Girls For Full Nigh...No Advance 8868886958 Chandigarh Call Girls , Indian Call Girls For Full Nigh...
No Advance 8868886958 Chandigarh Call Girls , Indian Call Girls For Full Nigh...
 
BDSM⚡Call Girls in Sector 93 Noida Escorts >༒8448380779 Escort Service
BDSM⚡Call Girls in Sector 93 Noida Escorts >༒8448380779 Escort ServiceBDSM⚡Call Girls in Sector 93 Noida Escorts >༒8448380779 Escort Service
BDSM⚡Call Girls in Sector 93 Noida Escorts >༒8448380779 Escort Service
 
Andrés Ramírez Gossler, Facundo Schinnea - eCommerce Day Chile 2024
Andrés Ramírez Gossler, Facundo Schinnea - eCommerce Day Chile 2024Andrés Ramírez Gossler, Facundo Schinnea - eCommerce Day Chile 2024
Andrés Ramírez Gossler, Facundo Schinnea - eCommerce Day Chile 2024
 

Bitextor: harvest your own parallel corpora from the Web, Miquel Esplà-Gomis, Universitat d’Alacant

  • 1. Bitextor: harvest your own parallel corpora from the Web Miquel Esplà-Gomis Universitat d’Alacant mespla@dlsi.ua.es
  • 2. Who is behind Bitextor? ✭ Transducens (Universitat d’Alacant) ✹ Parallel data crawling ✹ Rule-based machine translation ✹ Machine translation quality estimation ✹ Computer-aided translation ✹ ... ✭ Prompsit Language Engineering ✹ Parallel data curation ✹ Machine translation (rule-based, statistical, neural, hybrid, etc.) ✹ Linguistic variant adaptation ✹ ... UA + Prompsit: ✹ Apertium: Rule-based MT ✹ Abu-MaTran: Automatic building of MT ✹ Bitextor: Parallel data crawling
  • 3. Our motivation ✭ Specific sources (legal, technical documentation, etc.) exhaustively exploited ✹ Europarl, EMEA, MultiUN, TEDTalks, etc ✹ Most of them available at OPUS [http://opus.lingfil.uu.se/] ✭ What about more general sources (translated websites, etc.)? ✹ good source of data for small languages ✹ easy to find domain-specific contents ✹ not that productive as crawling well-known websites... ?
  • 4. ✭ Free/open-source tool for automatically crawling the Web ✭ Crawl parallel data between any two languages ✭ Build parallel corpora from any XML-based data: XML, XHTML, OOXML (. docx), etc. ✭ It can generate TMX (for translators) or Moses-like plain text (for training SMT) What is Bitextor?
  • 5. Brief history of the project ✭ First version developed at Univesitat d’Alacant in 2006 ✭ Until version 2.0, problems for compiling/installing it in the beginning: not a good product ✭ Re-implemented in version 3.0 at Prompsit Language Engineering: ✹ Unix-pipeline architecture made of scripts ✹ highly scallar ✹ up to date external libraries and external tools ✹ good documentation and support ✭ Currently at version 5.0: dramatic improvement of performance!!
  • 6. Why Bitextor? ✭ Customizable: ✹ Document- and segment-alignment quality threshold ✹ Several input/output formats available ✹ Time/size limit for crawling ✹ ... ✭ High performance in document alignment: precision ~90%, recall ~80% ✭ Fast and easy to use: bitextor -v dic -u http://golftrotter.com en fr + 90 seconds = 4,056 pairs of segments !!!
  • 7. What do you need to run Bitextor? ✭ A Unix- or Posix-based operating system ✭ To follow the installation tutorial: https://sf.net/p/bitextor/wiki ✭ To identify one or more URLs to crawl ✭ Having a bilingual lexicon for the languages to crawl ✹ For many languages, you can download from: https://sf.net/projects/bitextor/files/lexicons ✹ Bitextor can build a lexicon from parallel corpora
  • 8. Bitextor is a good team player ✭ Bitextor can be combined with Spiderling* to crawl top-level domains ✭ Crawl monolingual and parallel corpora at the same time ✭ No need to look for multilingual websites! ✭ ~100GB of data in one week! * A monolingual crawler focused on linguistic resources
  • 9. Useful for translation Useful for translators or translation companies when ... ✭ ... a statistical MT system has to be built for a new pair of languages (Abu- MaTran: en-hr, en-fi, ...) ✭ ... domain adaptation of an MT system ✭ ... our translation memories (TM) need to be adapted to a specific domain to improve coverage ✹ even more: for new clients, we may want to build a new TM using their documents
  • 10. More than just translation ✭ build bilingual (and domain-specific?) lexicons: Bitextor uses MGIZA++ to generate these lexicons from crawled data ✭ identify the parts of two documents that are parallel: for example, get the translation of a word/sentence from an e-book in a foreign language ✭ get information about multilingualism to improve your strategy: ✹ discover potential customers by finding webs that need to be translated ✹ focus your efforts by identify lingüistic domains or languages with a low ratio of translated documents by crawling top-level domains
  • 11. What will be next? ✭ improved segmentation and segment alignment ✭ new tool for cleaning translation memories ✭ Bitextor (and other crawlers) as a web service ✹ ask Prompsit info@prompsit.com ✭ Bitextor for Windows ✭ generation of “deferred” translation memories (Forcada, Esplà-Gomis and Pérez-Ortiz, 2016) Bettercorpora! Improvingusability!
  • 12. Thank you very much for your atantion! Paldies jums par uzmanıbu! we will be glad to hear from you at bitextor-stuff@lists.sourceforge.net