This document discusses how big data from cloud-based translation tools can provide insights through benchmarking and analytics. It provides examples of metrics that can be tracked, such as translation productivity, use of translation memory and machine translation, and project manager performance. Aggregating anonymous data from many users could establish universal performance standards, help identify areas for process improvement, and provide cost savings estimates for tools like translation memory. However, challenges include cleaning and interpreting diverse data from many users and systems.
2. Quick intro
• 2010: Memsource founded
• 2015: 50,000 users & 100+ million words translated monthly
• Some of the world’s largest translation providers and buyers are
customers
SEGA FUJIFILM
3. Cloud tools lead to Big Data
Server tools – private data silos Cloud tools – centralized data
4. And the clouds are getting bigger…
In May alone, users processed 0.8 billion words in Memsource
7. Example problem - quality
• Free testing
• Since the end of LISA everyone has a unique quality metric
• Can we embed a certain standard into the tool itself?
12. So what can we track there?
• In theory, anything:
• Translation data
• Productivity
• Business analytics
• Notifications
• In practice (challenges):
• Data clean-up
• Relevance
• Interpretation
14. Users save 10 to 40% with TM
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50
USERS
OVERALL TM LEVERAGE BY TOP 50 VOLUME USERS
repetitions tm.match101 tm.match100 tm.match95 tm.match85 tm.match75 tm.match50 tm.match0
Data for jobs where post-editing analysis has been performed, December 2015 - May 2016
Sample
9 bn
words
Savings
approx.
$300
million
15. MT is currently used on 31% of projects
Top MT Engines
ENGINE %
Microsoft with Feedback 15.8%
Microsoft Translator Hub 9.9%
Google Translate 2.6%
Microsoft Translator 2.5%
SDL BeGlobal 0.4%
Other 0.6%
MT not used 68.2%
16. Up to 80% content pasted from MT then
edited
Sample size 20 million words, December 2015 - May 2016
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
en:es pt:en en:pt es:en en:ru ru:en en:de pt:es en:fr es:pt
%OFWORDSINSEGEMENTSFROMMT
SAMPLE LANGUAGE PAIRS
EDIT DISTANCE FOR MAJOR LANGUAGE PAIRS
mt.match100 mt.match95 mt.match85 mt.match75 mt.match50 mt.match0
MT not used
Raw MT
Moderate
edits
Heavily
edited
17. Many linguists translate
more than 10 pages a day consistently
0
200
400
600
800
1000
1200
1400
1600
1
21
41
61
81
101
121
141
161
181
201
221
241
261
281
301
321
341
361
381
401
421
441
461
481
501
521
541
561
581
601
621
641
661
681
701
721
741
761
781
801
821
841
861
881
901
921
941
961
981
PagesCompletedinApril
Users
TOP 1000 LINGUIST ROLE PRODUCTIVITY, PAGES IN APRIL 2016
Norm:
8 pages a day x 20 days
20 pages a day
Probably not
human translation
10 pages a day
18. Project manager productivity
408
325
313
263
159
143
122
74 68 63
31
10 5
0
50
100
150
200
250
300
350
400
450
Renato Joana Kris John Bill Robert Alex Sandor Dave Millingan Mihiko Olga Barbora
Job Created by PMs and Completed by Linguists in the last 30
days
– test organization
19. Benchmarking possibilities
674
440 428
94
37
13 9 5 12 10 7
0
100
200
300
400
500
600
700
800
1 or less from 1 to 10 from 11 to 100 from 101 to
200
from 200 to
300
from 300 to
400
from 400 to
500
from 500 to
600
from 501 to
1000
from 1001 to
2000
more than
2000
PM Productivity, Completed Jobs Per Month
Number of jobs completed
Numberofusers
December – May 2016
Top 10%
20. Project manager productivity
408
325
313
263
159
143
122
74 68 63
31
10 5
0
50
100
150
200
250
300
350
400
450
Renato Joana Kris John Bill Robert Alex Sandor Dave Millingan Mihiko Olga Barbora
Job Created by PMs and Completed by Linguists in the last 30
days
Top 10% of Global PM
User Population
21.
22. “In fact, Big Data applications are bound only
by the human imagination”.
Peter Pham
23. What you can do now
• What to track?
• How can organizations benefit from each other’s data?
• Which data should not be shared?