Parallel data has become an extremely valuable resource, not only for building new statistical machine translation systems, but also for building other useful resources for translators, such as bilingual concordancers, translation memories or bilingual lexicons. One of the most important and under-exploited sources of bilingual information is the Internet: many strategies have been proposed to crawl specific websites, but defining methods for surfing the whole Web and harvesting bitexts is still an open problem. Recently, the free/open-source tool Bitextor has become one of the reference tools for this task: it has been one of the basic tools featured in European projects such as Panacea or Abu-MaTran, and it has been chosen as the reference tool for the shared task on document alignment of the 1st Conference on Machine Translation (WMT 2016). In this presentation we will describe this tool, explaining the advantages when compared to other state-of-the-art tools, and the strategies chosen to crawl large amounts of parallel data.
Andrés Ramírez Gossler, Facundo Schinnea - eCommerce Day Chile 2024
Bitextor: harvest your own parallel corpora from the Web, Miquel Esplà-Gomis, Universitat d’Alacant
1. Bitextor: harvest your own parallel
corpora from the Web
Miquel Esplà-Gomis
Universitat d’Alacant
mespla@dlsi.ua.es
2. Who is behind Bitextor?
✭ Transducens (Universitat d’Alacant)
✹ Parallel data crawling
✹ Rule-based machine translation
✹ Machine translation quality estimation
✹ Computer-aided translation
✹ ...
✭ Prompsit Language Engineering
✹ Parallel data curation
✹ Machine translation (rule-based, statistical,
neural, hybrid, etc.)
✹ Linguistic variant adaptation
✹ ...
UA + Prompsit:
✹ Apertium: Rule-based MT
✹ Abu-MaTran: Automatic building of MT
✹ Bitextor: Parallel data crawling
3. Our motivation
✭ Specific sources (legal, technical documentation, etc.) exhaustively
exploited
✹ Europarl, EMEA, MultiUN, TEDTalks, etc
✹ Most of them available at OPUS [http://opus.lingfil.uu.se/]
✭ What about more general sources (translated websites, etc.)?
✹ good source of data for small languages
✹ easy to find domain-specific contents
✹ not that productive as crawling well-known websites... ?
4. ✭ Free/open-source tool for automatically crawling the Web
✭ Crawl parallel data between any two languages
✭ Build parallel corpora from any XML-based data: XML, XHTML, OOXML (.
docx), etc.
✭ It can generate TMX (for translators) or Moses-like plain text (for
training SMT)
What is Bitextor?
5. Brief history of the project
✭ First version developed at Univesitat d’Alacant in 2006
✭ Until version 2.0, problems for compiling/installing it in the beginning: not
a good product
✭ Re-implemented in version 3.0 at Prompsit Language Engineering:
✹ Unix-pipeline architecture made of scripts
✹ highly scallar
✹ up to date external libraries and external tools
✹ good documentation and support
✭ Currently at version 5.0: dramatic improvement of performance!!
6. Why Bitextor?
✭ Customizable:
✹ Document- and segment-alignment quality threshold
✹ Several input/output formats available
✹ Time/size limit for crawling
✹ ...
✭ High performance in document alignment: precision ~90%, recall ~80%
✭ Fast and easy to use:
bitextor -v dic -u http://golftrotter.com en fr
+
90 seconds
= 4,056 pairs of
segments !!!
7. What do you need to run Bitextor?
✭ A Unix- or Posix-based operating system
✭ To follow the installation tutorial: https://sf.net/p/bitextor/wiki
✭ To identify one or more URLs to crawl
✭ Having a bilingual lexicon for the languages to crawl
✹ For many languages, you can download from:
https://sf.net/projects/bitextor/files/lexicons
✹ Bitextor can build a lexicon from parallel corpora
8. Bitextor is a good team player
✭ Bitextor can be combined with Spiderling* to crawl top-level domains
✭ Crawl monolingual and parallel corpora at the same time
✭ No need to look for multilingual websites!
✭ ~100GB of data in one week!
* A monolingual crawler focused on linguistic resources
9. Useful for translation
Useful for translators or translation companies when ...
✭ ... a statistical MT system has to be built for a new pair of languages (Abu-
MaTran: en-hr, en-fi, ...)
✭ ... domain adaptation of an MT system
✭ ... our translation memories (TM) need to be adapted to a specific domain
to improve coverage
✹ even more: for new clients, we may want to build a new TM using their documents
10. More than just translation
✭ build bilingual (and domain-specific?) lexicons: Bitextor uses MGIZA++ to
generate these lexicons from crawled data
✭ identify the parts of two documents that are parallel: for example, get the
translation of a word/sentence from an e-book in a foreign language
✭ get information about multilingualism to improve your strategy:
✹ discover potential customers by finding webs that need to be translated
✹ focus your efforts by identify lingüistic domains or languages with a low ratio of
translated documents by crawling top-level domains
11. What will be next?
✭ improved segmentation and segment alignment
✭ new tool for cleaning translation memories
✭ Bitextor (and other crawlers) as a web service
✹ ask Prompsit info@prompsit.com
✭ Bitextor for Windows
✭ generation of “deferred” translation memories (Forcada, Esplà-Gomis and
Pérez-Ortiz, 2016)
Bettercorpora!
Improvingusability!
12. Thank you very much for your atantion!
Paldies jums par uzmanıbu!
we will be glad to hear from you at
bitextor-stuff@lists.sourceforge.net