Diese Präsentation wurde erfolgreich gemeldet.
Wir verwenden Ihre LinkedIn Profilangaben und Informationen zu Ihren Aktivitäten, um Anzeigen zu personalisieren und Ihnen relevantere Inhalte anzuzeigen. Sie können Ihre Anzeigeneinstellungen jederzeit ändern.
State of the
Machine Translation
by Intento
November 2017
About
• At Intento, we want to make Machine Intelligence
services easy to discover, choose and use.
• So far, evaluation i...
Overview
© Intento, Inc. November 2017
11
Machine Translation Services
35
Language Pairs
TRANSLATION QUALITY LANGUAGE COVE...
November 2017© Intento, Inc.
Changes since July 2017
• +2 vendors: DeepL (beta), SAP (beta)
• +21 language pair
• Detailed...
Machine Translation
Services* Compared
* We have evaluated general purpose Cloud Machine Translation services with prebuil...
Translation Quality
November 2017© Intento, Inc.
Evaluation Methodology
Overall Performance
Available MT Quality
Price vs....
Evaluation methodology (I)
• Translation quality is evaluated by computing LEPOR
score between reference translations and ...
Evaluation methodology (II)
• We consider MT service A more performant than B for the
language pair C if:
- mean LEPOR sco...
LEPOR score
• LEPOR: automatic machine translation evaluation metric
considering the enhanced Length Penalty, n-gram
Posit...
Language Pairs
We focus on the
en-P1, P1-en and
P1-P1 (partially)
* https://w3techs.com/technologies/overview/content_lang...
Datasets
• WMT-2013 (translation task, news domain)
- en-es, es-en
• WMT-2015 (translation task, news domain)
- fr-en, en-...
LEPOR Convergence
We used 1440 - 3000 sentences per language pair. In all cases it’s clear that
the metric stabilises and ...
Overall Performance
35 language pairs, 1440-3000 sentences per pair
>70%
<40%
Variance
among
language
pairs
Detailed data ...
Available MT Quality
en ru ja de es fr pt it zh cs tr fi ro
en 4 7 3 2 7 1 2 6 2 4 5 2
ru 2 3 3 4 4
ja 1 4
de 8 2
es 7 4
fr...
Sample pair analysis: en-pt
LEPOR
score
Providers
Price range
(per 1M characters)
77 % Google $20
72 % Yandex, Microsoft $...
Price vs. Performance
AFFORDABILITY
PERFORMANCE
As of November 2017
COST-EFFECTIVE
ACCURATE
FREE
(BETA)
NOT
SET
YET
COST-E...
Language Coverage
November 2017© Intento, Inc.
Supported and Unique per Provider
Coverage by Language Popularity
Language coverage
Unique
language
pairs -
supported
exclusively by
one provider
1
100
10000
Google Yandex Microsoft Baidu ...
Language popularity
Language groups by
web popularity*:
P1 - ≥ 2.0% websites
P2 - 0.5%-2% websites
P3 - 0.1-0.3% websites
...
Language groups by
web popularity*:
P1 - ≥ 2.0% websites
P2 - 0.5%-2% websites
P3 - 0.1-0.3% websites
P4 - <0.1% websites
...
by service provider
Google Cloud
Translation API
Microsoft
Translator Text API
Yandex
Translate API
Systran
REST Translati...
Developer Experience (DX)
November 2017© Intento, Inc.
Evaluation Methodology
DX Charts
Evaluation methodology (I)
Here we evaluate overall service organisation from the following
angles:
• Product - Support of...
Evaluation methodology (II)
• Translation domains
• Translation engines
• Language autodetect
• Glossaries
• TM Support
• ...
by service provider
July 2017© Intento, Inc.
Developer experience
Available in the full report
Miscellaneous
November 2017© Intento, Inc.
API change frequency
API & Documentation changes since July 2017:
Change frequency
SDL Language Cloud 31
Google Translate API 14
IBM Watson Tra...
Detailed version of this report
• We give this over view version for free.
• The full evaluation report contains:
- Detail...
Discover the best service providers for
your AI task
Evaluate performance on your own data
at a fraction of the potential ...
Intento
https://inten.to
Konstantin Savenkov
CEO Intento, Inc.

<ks@inten.to>
Appendix A
July 2017© Intento, Inc.
Overall performance of the MT services across many language
pairs is computed in the f...
Nächste SlideShare
Wird geladen in …5
×

State of the Machine Translation by Intento (November 2017)

23.764 Aufrufe

Veröffentlicht am

Evaluation of 11 major Machine Translation (Google, Microsoft, IBM, SAP, Yandex, SDL, Systran, Baidu, GTCom, PROMT, DeepL) providers for 35 most popular language pairs: performance, quality, language coverage, API update frequency.

Veröffentlicht in: Technologie
  • Thanks for your presentation, but would like to correct as api.ai which is now termed as dialogflow is not at all completely free, it runs on freemium model.
       Antworten 
    Sind Sie sicher, dass Sie …  Ja  Nein
    Ihre Nachricht erscheint hier
  • Dion, my apologies for the misleading title, we just wanted to make it more catchy. Sorry for that. Ideally, we want (and we will eventually) to show the price-quality range for every major domain, and this inevitable requires evaluating domain-specific MT engines. However, most of the customised solutions (including Omniscien) refused to participate in such study. One one hand, it's understandable (a poorely cooked report may harm a fine brand). Still, we are going to solve this sooner or later - we have a number of clients who use the customised MT engines and we are going to ask for their help in evaluating this type of MT technology. Another problem is domain-specific datasets, which we still need to solve somehow.
       Antworten 
    Sind Sie sicher, dass Sie …  Ja  Nein
    Ihre Nachricht erscheint hier
  • While the content is certainly interesting, the title is misleading. The research shows the state of non-customized machine translation engines, but as written implies all machine translation. The research does not show what is possible with customization or what kind of machine translation was used (Rules, SMT, NMT, etc.) As with a human, if you put someone trained in a specialty domain (i.e. life sciences), they will translate better than someone with only general translation skills. Machine translation is the same. I recently did a blog post on this topic that shows how easily Google and others can be surpassed in quality with relatively small amounts of training data that is specialized in a domain such as life sciences. https://omniscien.com/riding-machine-translation-hype-cycle. A professionally customized engine takes into account domain and client terminology, non-translatable terms, writing style and more. Generic engines listed in this report are designed to translate anything for anyone, anytime. As they translate at a sentence level only, there is huge amounts of ambiguity that a custom engine would resolve, but a non-customized engine would not. It would be interesting to see similar research on MT vendors that provide customization based on the same data provided by a client and going through their respective customization processes and seeing the quality of the outcome. This would complement this report and provide a more complete state of machine translation than only generic MT systems.
       Antworten 
    Sind Sie sicher, dass Sie …  Ja  Nein
    Ihre Nachricht erscheint hier
  • @HansemEUG, Inc. That's interesting. We may easily compute TER as well as see how it's correlated with hLEPOR. Do you think it's good to include in the next report?
       Antworten 
    Sind Sie sicher, dass Sie …  Ja  Nein
    Ihre Nachricht erscheint hier
  • Great insights! Leonid. I think TER will be the main metric for assessing the payment for MTPE tasks but I think there is no metric that can directly represent or accurately measure the performance of MT engine output at this moment.
       Antworten 
    Sind Sie sicher, dass Sie …  Ja  Nein
    Ihre Nachricht erscheint hier

State of the Machine Translation by Intento (November 2017)

  1. 1. State of the Machine Translation by Intento November 2017
  2. 2. About • At Intento, we want to make Machine Intelligence services easy to discover, choose and use. • So far, evaluation is the most problematic part: to compare different services, one need to sign a lot of contracts and integrate a lot of APIs. • We deliver this overview report for FREE. To get the full report or evaluate on you own dataset, contact us. • Also, check out our Natural Language Understanding Benchmark. NLU may help you to automate workflows beyond the automated translation. November 2017© Intento, Inc.
  3. 3. Overview © Intento, Inc. November 2017 11 Machine Translation Services 35 Language Pairs TRANSLATION QUALITY LANGUAGE COVERAGE DEVELOPER EXPERIENCE MISCELLANEOUS Get the full version of this report
  4. 4. November 2017© Intento, Inc. Changes since July 2017 • +2 vendors: DeepL (beta), SAP (beta) • +21 language pair • Detailed performance analysis • Developer experience comparison
  5. 5. Machine Translation Services* Compared * We have evaluated general purpose Cloud Machine Translation services with prebuilt translation models, provided via API. Some vendors also provide web-based, on-premise or custom MT engines, which may differ on all aspects from what we’ve evaluated. Baidu Translate API DeepL API (beta) Google Cloud Translation API GTCom YeeCloud MT IBM Watson Language Translator Microsoft Translator Text API PROMT Cloud API SAP Translation Hub (beta) SDL Cloud Machine Translation Systran REST Translation API Yandex Translate API
  6. 6. Translation Quality November 2017© Intento, Inc. Evaluation Methodology Overall Performance Available MT Quality Price vs. Performance
  7. 7. Evaluation methodology (I) • Translation quality is evaluated by computing LEPOR score between reference translations and the MT output (Slide 9). • Currently, our goal is to evaluate performance of translation between the most popular languages (Slide 10). • We use public datasets from StatMT/WMT and CASMACAT News Commentary (Slide 11). • We have performed LEPOR metric convergence analysis to identify minimal viable number of segments in the dataset. See Slide 12 for some details.
  8. 8. Evaluation methodology (II) • We consider MT service A more performant than B for the language pair C if: - mean LEPOR score of A is greater than LEPOR of B for the pair C, and - lower bound of the LEPOR 95% confidence interval of A is greater than the upper bound of the LEPOR confidence interval of B for the pair C. See Slide 12 for an example. • Different language pairs (and different datasets) impose different translation complexity. In order to compare overall MT performance of different services we regularise LEPOR scores across all language pairs (See Appendix A for more details).
  9. 9. LEPOR score • LEPOR: automatic machine translation evaluation metric considering the enhanced Length Penalty, n-gram Position difference Penalty and Recall • In our evaluation, we used hLEPORA v.3.1: • best metric from ACL-WMT 2013 contest https://www.slideshare.net/AaronHanLiFeng/lepor-an-augmented-machine-translation-evaluation-metric-thesis-ppt https://github.com/aaronlifenghan/aaron-project-lepor LIKE BLEU, BUT BETTER
  10. 10. Language Pairs We focus on the en-P1, P1-en and P1-P1 (partially) * https://w3techs.com/technologies/overview/content_language/all Language groups by web popularity*: P1 - ≥ 2.0% websites P2 - 0.5%-2% websites P3 - 0.1-0.3% websites P4 - <0.1% websites en ru ja de es fr pt it zh cs tr fi ro en ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ru ✓ ✓ ✓ ✓ ✓ ja ✓ ✓ de ✓ ✓ es ✓ ✓ fr ✓ ✓ ✓ pt ✓ it ✓ ✓ zh ✓ ✓ cs ✓ tr ✓ fi ✓ ro ✓
  11. 11. Datasets • WMT-2013 (translation task, news domain) - en-es, es-en • WMT-2015 (translation task, news domain) - fr-en, en-fr • WMT-2016 (translation task, news domain) - cs-en, en-cs, de-en, en-de, ro-en, en-ro, fi-en, en-fi, ru-en, en-ru, tr-en, en-tr • WMT-2017 (translation task, news domain) - zh-en, en-zh • NewsCommentary-2011 - en-ja, ja-en, en-pt, pt-en, en-it, it-en, ru-de, ru-es, ru-fr, ru-pt, ja-fr, de-ja, es-zh, fr-ru, fr-es, it-pt, zh-it
  12. 12. LEPOR Convergence We used 1440 - 3000 sentences per language pair. In all cases it’s clear that the metric stabilises and adding more from the same domain won’t change the outcome. number of sentences regularisedhLEPORscores Aggregated across all language pairs Examples for individual language pairs: Aggre- gated mean Confi- dense interval Detailed data on each language pair provided in the full report
  13. 13. Overall Performance 35 language pairs, 1440-3000 sentences per pair >70% <40% Variance among language pairs Detailed data on each language pair provided in the full report
  14. 14. Available MT Quality en ru ja de es fr pt it zh cs tr fi ro en 4 7 3 2 7 1 2 6 2 4 5 2 ru 2 3 3 4 4 ja 1 4 de 8 2 es 7 4 fr 6 1 8 pt 4 it 7 2 zh 6 4 cs 2 tr 2 fi 1 ro 4 70 % 60 % 50 % 40 % 30 % Maximal Achieved hLEPOR score: No. of top-performing MT Providers Minimal price for this quality, per 1M char: $$$ ≥$20 $$ $10-15 $ <$10 $$ $ $ $$ $$ $ $ $$$ $$$ $ $$$ $ $ $$$ $$ $$ $ $$$ $ $ $$ $ $ $$ $ $$ $$ $ $$ $ $$$ $ $$ $ Detailed data on each language pair provided in the full report
  15. 15. Sample pair analysis: en-pt LEPOR score Providers Price range (per 1M characters) 77 % Google $20 72 % Yandex, Microsoft $4.5-15 70 % Baidu, SDL, IBM $8-$21 62 % Systran, PROMT $3-$8 BEST QUALITY: BEST PRICE: PRICE&QUALITY: Google PROMT Microsoft ALL 35 PAIRS AVAILABLE IN THE FULL REPORT
  16. 16. Price vs. Performance AFFORDABILITY PERFORMANCE As of November 2017 COST-EFFECTIVE ACCURATE FREE (BETA) NOT SET YET COST-EFFECTIVE Performance Regularized hLEPOR score aggregated across all language pairs in the dataset Affordability = 1/price Using public volume- based pricing tiers Legend • performance range: - regularised average - max across all pairs - min across all pairs • price range Detailed data on each language pair provided in the full report
  17. 17. Language Coverage November 2017© Intento, Inc. Supported and Unique per Provider Coverage by Language Popularity
  18. 18. Language coverage Unique language pairs - supported exclusively by one provider 1 100 10000 Google Yandex Microsoft Baidu Systran SDL MT PROMT SAP DeepLIBM WatsonGTCom 2 1 2 1 54 920 1 246 3 202 6 20 4247 88 104111 756 3 422 8 556 10 712 Total Unique Supported language pairs © Intento, Inc. July 2017Detailed data on the supported languages is provided in the full report
  19. 19. Language popularity Language groups by web popularity*: P1 - ≥ 2.0% websites P2 - 0.5%-2% websites P3 - 0.1-0.3% websites P4 - <0.1% websites * https://w3techs.com/technologies/overview/content_language/all A total of 29070 pairs possible, 12989 are supported across all providers P1 en, ru, ja, de, es, fr, pt, it, zh P2 pl, fa, tr, nl, ko, cs, ar, vi, el, sv in, ro, hu P3 da, sk, fi, th, bg, he, lt, uk, hr, no, nb, sr, ca, sl, lv, et P4 hi, az, bs, ms, is, mk, bn, eu, ka, sq, gl, mn, kk, hy, se, uz, kr, ur, ta, nn, af, be, si, my, br, ne, sw, km, fil, ml, pa, …
  20. 20. Language groups by web popularity*: P1 - ≥ 2.0% websites P2 - 0.5%-2% websites P3 - 0.1-0.3% websites P4 - <0.1% websites 100% 94% 63% 88% 31% P1 P2 P3 P4 P1 P2 P3 P4 59% 100% 100% 94% 63% 100% 94% 94% 63% 63% 59% Supported language pairs by popularity * up from 44% in July 2017 as we better distinguish variations of the Chinese language© Intento, Inc. Language coverage 45%* of possible language pairs July 2017
  21. 21. by service provider Google Cloud Translation API Microsoft Translator Text API Yandex Translate API Systran REST Translation API SDL Cloud Machine Translation PROMT Cloud API IBM Watson Language Translation July 2017© Intento, Inc. Language coverage Baidu Translate API GTCom YeeCloud API DeepL API SAP Translation Hub Detailed data on the supported languages is provided in the full report
  22. 22. Developer Experience (DX) November 2017© Intento, Inc. Evaluation Methodology DX Charts
  23. 23. Evaluation methodology (I) Here we evaluate overall service organisation from the following angles: • Product - Support of Machine Translation features desired for using the API in various MT scenarios • Design - Overall API design and technical convenience • Documentation - How well the API is documented • Onboarding - How easy is to integrate and start using the API • Commercial - Flexibility of the commercial terms • Implementation - Important low-level features of the API • Maintenance - Convenience of getting information about the API changes for ongoing support • Reliability - Various technical issues we’ve encountered Some references: • http://talks.kinlane.com/apistrat/api101/index.html#/14 • https://mathieu.fenniak.net/the-api-checklist/ • https://www.slideshare.net/jmusser/ten-reasons-developershateyourapi • https://restfulapi.net/richardson-maturity-model/ • https://github.com/shieldfy/API-Security-Checklist • https://nordicapis.com/why-api-developer-experience-matters-more-than-ever/ • http://www.drdobbs.com/windows/measuring-api-usability/184405654?pgno=1
  24. 24. Evaluation methodology (II) • Translation domains • Translation engines • Language autodetect • Glossaries • TM Support • Custom engines • Bulk mode • Formatted text • XLIFF support Product Design Documentation Onboarding Commercial Implementation Maintenance Reliability • Authentication • Use of SSL • Quota info • Domain info • Balance info • Self-sufficient • Intuitive • Versioning • Bulk mode • Task-invocation ratio • I/O Structure • List of endpoints • User documentation • Supported languages • Quotas • Response codes • Error codes • Error messages • API explorer • API console • Number of docs • HTML doc • Self-registration • Self-issued keys • Self-payment • Free / Trial plan • Sandbox • Test console • Github repo • Code libraries • SDK / PDK • Sample code • Direct support • Ticket system • Self-support • Tutorial • FAQ / KB • Starter package • Public pricing • Pay as you go • Post-paid • Volume discounts • Payment systems • Billing history • API spec • Data compression • Supports JSON • Negotiable content • Unicode support • Error codes • Error messages • News source • Subscription news • Versioning • Changelog • Release notes • Roadmap • Status dashboard • Developer dashboard • Exportable logs • Uptime • Sporadic errors • Bugs • Performance issues • Status dashboard • Outage alerts
  25. 25. by service provider July 2017© Intento, Inc. Developer experience Available in the full report
  26. 26. Miscellaneous November 2017© Intento, Inc. API change frequency
  27. 27. API & Documentation changes since July 2017: Change frequency SDL Language Cloud 31 Google Translate API 14 IBM Watson Translator 10 Microsoft Translator API 7 SAP Translation Hub 6 July 2017© Intento, Inc.
  28. 28. Detailed version of this report • We give this over view version for free. • The full evaluation report contains: - Detailed best-deal analysis for each of the 35 language pairs - Developer experience analysis for each of the 11 MT providers • Also, by ordering the full report you support our ongoing evaluation of the Cloud MT • To get the full report, reach us at hello@inten.to
  29. 29. Discover the best service providers for your AI task Evaluate performance on your own data at a fraction of the potential cost saving Access any provider with no effort using to our Single API Intento Service Platform July 2017© Intento, Inc.
  30. 30. Intento https://inten.to Konstantin Savenkov CEO Intento, Inc. <ks@inten.to>
  31. 31. Appendix A July 2017© Intento, Inc. Overall performance of the MT services across many language pairs is computed in the following way: 1. [Standardisation] We compute mean language-standardised LEPOR score (or z-score) for each provider. 2. [Scale adjustment] We restore the original scale by multiplying z-score for each MT provider by the global LEPOR standard deviation and adding the global mean LEPOR score.

×