Bitextor: harvest your own parallel corpora from the Web, Miquel Esplà-Gomis, Universitat d’Alacant

Bitextor: harvest your own parallel
corpora from the Web
Miquel Esplà-Gomis
Universitat d’Alacant
mespla@dlsi.ua.es

Who is behind Bitextor?
✭ Transducens (Universitat d’Alacant)
✹ Parallel data crawling
✹ Rule-based machine translation
✹ Machine translation quality estimation
✹ Computer-aided translation
✹ ...
✭ Prompsit Language Engineering
✹ Parallel data curation
✹ Machine translation (rule-based, statistical,
neural, hybrid, etc.)
✹ Linguistic variant adaptation
✹ ...
UA + Prompsit:
✹ Apertium: Rule-based MT
✹ Abu-MaTran: Automatic building of MT
✹ Bitextor: Parallel data crawling

Our motivation
✭ Specific sources (legal, technical documentation, etc.) exhaustively
exploited
✹ Europarl, EMEA, MultiUN, TEDTalks, etc
✹ Most of them available at OPUS [http://opus.lingfil.uu.se/]
✭ What about more general sources (translated websites, etc.)?
✹ good source of data for small languages
✹ easy to find domain-specific contents
✹ not that productive as crawling well-known websites... ?

✭ Free/open-source tool for automatically crawling the Web
✭ Crawl parallel data between any two languages
✭ Build parallel corpora from any XML-based data: XML, XHTML, OOXML (.
docx), etc.
✭ It can generate TMX (for translators) or Moses-like plain text (for
training SMT)
What is Bitextor?

Brief history of the project
✭ First version developed at Univesitat d’Alacant in 2006
✭ Until version 2.0, problems for compiling/installing it in the beginning: not
a good product
✭ Re-implemented in version 3.0 at Prompsit Language Engineering:
✹ Unix-pipeline architecture made of scripts
✹ highly scallar
✹ up to date external libraries and external tools
✹ good documentation and support
✭ Currently at version 5.0: dramatic improvement of performance!!

Why Bitextor?
✭ Customizable:
✹ Document- and segment-alignment quality threshold
✹ Several input/output formats available
✹ Time/size limit for crawling
✹ ...
✭ High performance in document alignment: precision ~90%, recall ~80%
✭ Fast and easy to use:
bitextor -v dic -u http://golftrotter.com en fr
+
90 seconds
= 4,056 pairs of
segments !!!

What do you need to run Bitextor?
✭ A Unix- or Posix-based operating system
✭ To follow the installation tutorial: https://sf.net/p/bitextor/wiki
✭ To identify one or more URLs to crawl
✭ Having a bilingual lexicon for the languages to crawl
✹ For many languages, you can download from:
https://sf.net/projects/bitextor/files/lexicons
✹ Bitextor can build a lexicon from parallel corpora

Bitextor is a good team player
✭ Bitextor can be combined with Spiderling* to crawl top-level domains
✭ Crawl monolingual and parallel corpora at the same time
✭ No need to look for multilingual websites!
✭ ~100GB of data in one week!
* A monolingual crawler focused on linguistic resources

Useful for translation
Useful for translators or translation companies when ...
✭ ... a statistical MT system has to be built for a new pair of languages (Abu-
MaTran: en-hr, en-fi, ...)
✭ ... domain adaptation of an MT system
✭ ... our translation memories (TM) need to be adapted to a specific domain
to improve coverage
✹ even more: for new clients, we may want to build a new TM using their documents

More than just translation
✭ build bilingual (and domain-specific?) lexicons: Bitextor uses MGIZA++ to
generate these lexicons from crawled data
✭ identify the parts of two documents that are parallel: for example, get the
translation of a word/sentence from an e-book in a foreign language
✭ get information about multilingualism to improve your strategy:
✹ discover potential customers by finding webs that need to be translated
✹ focus your efforts by identify lingüistic domains or languages with a low ratio of
translated documents by crawling top-level domains

What will be next?
✭ improved segmentation and segment alignment
✭ new tool for cleaning translation memories
✭ Bitextor (and other crawlers) as a web service
✹ ask Prompsit info@prompsit.com
✭ Bitextor for Windows
✭ generation of “deferred” translation memories (Forcada, Esplà-Gomis and
Pérez-Ortiz, 2016)
Bettercorpora!
Improvingusability!

Thank you very much for your atantion!
Paldies jums par uzmanıbu!
we will be glad to hear from you at
bitextor-stuff@lists.sourceforge.net

Bitextor: harvest your own parallel corpora from the Web, Miquel Esplà-Gomis, Universitat d’Alacant

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Ähnlich wie Bitextor: harvest your own parallel corpora from the Web, Miquel Esplà-Gomis, Universitat d’Alacant

Ähnlich wie Bitextor: harvest your own parallel corpora from the Web, Miquel Esplà-Gomis, Universitat d’Alacant (20)

Mehr von TAUS - The Language Data Network

Mehr von TAUS - The Language Data Network (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Bitextor: harvest your own parallel corpora from the Web, Miquel Esplà-Gomis, Universitat d’Alacant