The document discusses translation quality and metrics. It makes several key points:
1) While quality is subjective, metrics can help objectively measure translation quality by counting errors. Various quality standards like ISO aim to regulate processes to reduce errors but have limitations.
2) Other industries use techniques like Six Sigma to minimize defects through process improvement, but translation standards have focused more on requirements than measurability.
3) To measure quality, its components must be clearly defined and translated to metrics. Simply measuring something because it's measurable does not ensure usefulness if it's irrelevant to the quality model.
1. Let’s call the whole thing off
Tout ce qui a besoin d'être dit l'a déjà été. Mais puisque personne n'écoutait, tout doit être dit à nouveau.
Everything that needs to be said has already been said. But since no one was listening, everything must be
said again.
André Gide, Le traité du Narcisse, 1891*
Although considered a mature and widespread concept, quality is a relative and largely subjective notion.
There is no unique conventional set of metrics for translation quality measurement and, as in many other
fields of application, translation quality broadly corresponds to the fulfillment of a set of specifications,
encompassing the buyer’s requirements.
Quality, utility and pricing
Utility is defined as the ability of something to satisfy needs or wants. In this sense, it is quite similar to
quality, defined as ‘fitness for purpose’. Both refer to customer’s satisfaction for a good or a service.
In business, quality has a pragmatic meaning as the non-inferiority or superiority of something, but is
always an intuitive, conditional, and subjective attribute and may be interpreted differently by different
people.
In economics, utility is a representation of preferences over some set of goods and services.
As is the case for quality, utility cannot be measured. Nobel laureate Paul Samuelson named ‘revealed
preferences’ the choices outlining utility.
In economics, the marginal utility of a good or service is the gain from an increase or loss from a decrease in
the consumption of that good or service. In other words, the first unit of consumption of a good or service
yields more utility than the second and subsequent units, with a continuing reduction for greater amounts.
A good or service should then be consumed at a quantity at which the marginal utility equals the change in
the cost of producing one more unit of a good (marginal cost).
Due to information asymmetry, translation is supplied without qualitative differentiation across markets.
This makes it a typical commodity.
*
Many thanks to Kirti Vashee for the quote.
2. Research claims that the demand for translation has been increasing, although at a slower pace in the last
few years. As the rate of commodity acquisition increases, marginal utility decreases, and if commodity
consumption continues to rise, marginal utility at some point may fall to zero, reaching maximum total
utility. Further increase in consumption of units of commodities causes marginal utility to become negative;
this signifies dissatisfaction.
Price is determined by both marginal utility and marginal cost, and this dynamic explains clearly not only
why the marginal cost of water is far lower than that of diamonds, but also why quality is an expected
feature in a good or service, which is not linked to its selling price.
Is then quality a way to differentiation? Is ‘purity’ the right way to differentiation? Is diversity still richness?
Is differentiation really important? Or do we have to wear camouflage to stay alive?
Most often, the buyer perceives translation as the only material available for scrutiny. Therefore,
particularly in translation, there is no such thing as absolute quality, with different jobs meeting different
requirements and different quality criteria.
To be reliable, translation quality assessment must be undisputable and repeatable; effective metrics must
be available that are objective (measurable), unbiased, and able to provide enough resolution (detail) to
assess the factors that need improvement.
Since there are no common protocol or tools for automated translation quality assessment, guidelines
enable a human team to perform this task while keeping error margin as low as possible.
However, so far, two different people following the same protocol could hardly achieve the same result (or
at least a comparable result).
In fact, the detailed and strict error-based evaluation models used so far have proved costly, ineffectual,
and erratic as they hardly consider content type, end-user requirements, and usability, in one word, fitness
for purpose. These models have been developed, unfolded, and implemented by linguists for linguists. They
focus on linguistic features instead of cost-effectiveness and functionality, with time and cost growing
linearly with volume.
Technology and Incomes
Google has made machine translation a General Purpose Technology (GPT), thus helping spread the
concept of translation as a utility. It cannot be, however, be accused of having contributed to the
commodification of translation, being the two concepts distinct and unconnected. If anything, Google
Translate has helped raise awareness for the importance of translation for the circulation of information
and knowledge, even if only indirectly.
The Moses engine has been harming translation more because it is charge-free and apparently easy and
convenient to implement, while, like any other complex technology, no matter how seemingly simple, it
requires specific skills, know-how, understanding, and patience. Improvisation does not pay. Gambling
neither, especially if the ultimate goal is to lower costs and increase profits. Just like players in the
translation industry would like prospects to see translation as an investment, professional users should see
machine translation as a complex technology and they should refrain from proclaiming themselves experts
just for being able to install and run a DIY of a piece of software. This applies for any software product.
In the last decade the acceleration in technology has shocked not only the industry, but virtually everyone.
Skills and institutions in the translation industry have not been able to keep pace with the rapid changes of
technology. Also in the translation industry, skill-biased technological change (SBTC) increases the incomes
of highly skilled workers and reduce the incomes and employment of low-skilled workers.
3. As Erik Brynjolfsson and Andrew McAfee argue in The Second Machine Age, in the last decade, the fall in
demand has been greater for those who find themselves in the middle of skill distribution. Highly qualified
workers have done well, but workers with lower qualifications have been less affected than those with
medium qualifications, reflecting a polarization of labor demand and an interesting fact about automation.
Physical activities requiring a physical and sensory perception coordination have proved more resilient to
automation than basic data processing, following Moravec's Paradox, which claims that high-level
reasoning requires very little computation, but low-level sensorimotor skills require enormous
computational resources.
In this respect, a recent Economist article helps clarify this point. Lower qualified jobs are and will most
probably remain low paid; this makes replacement of highly qualified jobs with machines convenient,
especially in the long run, despite Keynes’s opinions (This long run is a misleading guide to current affairs. In
the long run we are all dead.)
As Brynjolfsson and McAfee suggest, the whole translation industry should pursue a strategy of innovating
and reshaping organizations, structures, processes, and business models to leverage developing
technologies and human skills. These would easier to achieve than technological disruptive innovations that
its history proved the industry is incapable to produce, to rather undergo and endure outsiders (see also
Moore’s Law and Commoditization (of Translation too)).
On the other hand, the more technologies are present in an industry, the harsher the competition. The
spread between the highest and lowest performers increases as well as the profit margin spread between
the companies at the top and at the bottom of the scale.
Going back for a moment to information asymmetry, it is worth recalling a study by Robert Jensen of the
John F. Kennedy School of Government of Harvard University on the digital provide in the fisheries sector in
Kerala. In Professor Jensen’s words, “when information is limited or costly, agents are unable to engage in
optimal arbitrage. Excess price dispersion across markets can arise, and goods may not be allocated
efficiently.”
Information technologies and mobile phones in Kerala allowed fishermen to access information on prices
and market demand in real time and use this information to make decisions. This resulted in a significant
reduction in price dispersion and improved market performance, after an initial drop in prices and
subsequent stabilization, with an eventual increase in profits.
4. A New Standard
After a lustrum, a seemingly endless gestation, especially for
our fast-paced times, ISO/DIS 17100 (Translation Services —
Requirements for translation services) has eventually reached
the quasi final draft status (voting terminated on November
20, 2013). This draft has been submitted to the ISO member
bodies and to the CEN member bodies for a parallel enquiry,
which is about to end as well. Waiting for imprimatur, this 20-
page draft is available for purchase at CHF 66,00 (€ 54,18 or
US $ 74.50).
Very ambitiously, in its introduction, ISO/DIS 17100 declares
to specify “requirements for all aspects of the translation
process directly affecting the quality and delivery of
translation services.” ‘All’ is a very challenging word,
especially when it comes to a typical human task like
translation; in reality, more realistically, in its scope section,
the standard only “provides requirements for the core
processes, resources and other aspects necessary for the
delivery of a quality translation service that meets applicable
specifications.”
It is not a good start for a supposed state-of-the-art standard
made by presumably renowned experts.
In fact, ISO/DIS 17100 is a reworking of EN 15038 to partly accommodate ASTM F2575-06 and blink an eye
to the Chinese GB/T 19363 1-2003. ISO/TS 11669 is crucial to ISO/DIS 17100 framework.
ISO/TS 11669 is a technical specification. The shelf life of an ISO technical specification is six years: within
this timeframe it is either converted to a full standard or eliminated.
ISO/TS 11669 provides a framework for developing structured specifications for translation projects, but it
does not cover legally binding contracts between parties involved in a translation project. It addresses
quality assurance and provides the basis for qualitative assessment, but it does not provide procedures for
a quantitative measurement of the quality of a translation product.
ISO/TS 11669 describes a decision-making system about how translation projects should be carried out.
Those decisions — or project specifications — would then become a resource for both the requester (and
the translation service provider (TSP) throughout all phases of a translation project. These specifications
can be attached to a legally binding contract to define the work to be done. In the absence of a contract,
they can be attached to a purchase order or any other document supporting the request.
Requesters and TSPs should determine project specifications together. The project specifications can be
used to guide assessments made by either the TSP or the end user. The use of the same specifications by all
parties allows to avoid assessment based on personal opinions of how source content should be translated.
ISO/TS 11669 does not provide any procedures for quantitative measurement of the quality of a translation
product.
ISO/TS 11669 introduces translation parameters, intended as key factors, activities, elements and attributes
of a given project used for creating project specifications. However, the long listing of translation
parameters is a surreptitious way to levy vague and blurry translation quality assessment criteria, which are
traditionally subjective.
5. In addition, since quality is defined as the degree to which the translation product conforms to the project
specifications, and no guidance is given for qualitative assessment, register should not be a parameter, as
its compliance to requirements is highly subjective.
Like EN 15038:2006, ISO/DIS 17100 specifies requirements that a provider of translation services must
meet, in terms of staff and equipment, project management and processes. Like EN 15038:2006,
ISO/DIS 17100 shows the typical conservatism of the translation industry. Although the EN 15038:2006
draft was finalized two years well before its release, like EN 15038:2006, ISO/DIS 17100 still reflects the
typical old business model of the whole translation industry. Like EN 15038:2006, ISO/DIS 17100 contains
no commitment towards metrics, and no hints on how the quality of these translations achieves a certain
level. Anyway, in one of the informative annexes, ISO/DIS 17100 contains a timid commitment towards
service level agreements (SLAs) that could outline such a framework.
Translators’ competences are still a weak point in ISO/DIS 17100. The TSP is required to “have a
documented process in place to ensure that the people selected to perform translation projects have the
required competences and qualifications,” but no means is envisage to ensure it shall anyway in any case. A
basic requirement for translator qualification is “a recognized graduate qualification in translation” or a
substantial full-time professional experience in translating. The same basic errors as in EN 15038:2006,
reflecting a candied view, which is now far away from reality, proving unfailingly the inadequacy of the
newbies being churned out by old-fashioned translation schools flocking the old and the new world. Not
surprisingly, these schools are under the thumb of the same advocates of EN 15038:2006 and
ISO/DIS 17100.
On the other hand, ISO/DIS 17100 takes translation vendor and project management into consideration, in
the view of the assurance that “the people selected to perform translation projects have the required
competences and qualifications.” According to the standard, “translation project management competence
can be acquired in the course of formal or informal training, e.g. as part of a relevant higher educational
course or by means of on-the-job training or by industry experience.” This is somewhat dismissive of the
importance admittedly acknowledge to translation project management, and yet is definitely much more
than the attention devoted to translation vendor management, which is in fact crucial. Indeed, the
standard does not envisage any requirement in this respect.
Here comes the biggest flaw in ISO/DIS 17100, in section 5.3 Translation process. With the typical dirigist
trait of translation scholars, the abundance of details is not accompanied by any specification of
requirements as to who and how should monitor the several tasks in the process, ending in an utter
manifestation of the typical wishful thinking that permeates the industry.
A blatant example is given in section 5.3.3 Revision. Beyond the impractical revival of the typical academic
approach based on contrastive analysis, no indication is given about the base to “correct any errors found
in the translation output or recommend the corrective measures to be implemented”, leaving any decision
entirely to the reviser’s discretion, thus wide space for the introduction of further errors.
ISO/DIS 17100 still contains all the flaws and limitations of EN 15038:2006 and incorporates some from
ASTM F2575-06, although both left much room for improvement, and four years of life for both at the start
and as much of drafting were enough time span for doing better.
Annex A and G are perfect examples in these respect. They seem quite a divertissement in themselves, with
the translation workflow outlined in the first one still offering a monolithic serial model afar from agility,
and a ‘DOK’ in the latter, with no definition/elaboration, being something that would most probably disturb
the sleep of many uninformed readers. For being informative, both annexes surely miss their goal.
Annex B offers a list of elements to be included in an agreement as project specifications possibly in “the
form of statements of work such as a service level agreement (SLA),” but it gives no definition for
statements of work (SoW) or SLA.
6. Standards are all about allowing stakeholders to overcome information asymmetries and make informed
decisions; to this end, they must be simple, functional, and end-user oriented.
ISO/DIS 17100 is another missed opportunity to gain respect and consideration for the translation industry.
Measurability and Metrics
The quality process standard par excellence, ISO 9001:2008 is based on the assumption that regulating and
systematizing tasks in repeatable processes, with strong audit trails, will eventually lead to control
production processes and products/services delivered with repeatable quality (attributes).
Over the years, the concept of continuous improvement has been spreading, to be eventually incorporated
in this standard. While leading industries developed complementary sets of techniques and tools for
process improvement, the translation industry pursued its own standards, which respected its peculiarity
and the special nature of its services.
The manufacturing industry applied the concept of Kaizen and conceived Total Quality Management (TQM),
Six Sigma (6) and CMMI to improve the quality of process outputs by identifying and removing the causes
of defects (errors) and minimizing variability in manufacturing and business processes.
The table below gives a measure of process performance corresponding to Sis Sigma levels roughly
expressed in errors per million units.
Sigma level DPMO
Percentage
yield
1 690,000 31%
2 310,000 69%
3 67,000 93.3%
4 6,200 99.38%
5 230 99.977%
6 3.4 99.99966%
This means that, in a 10,000 word projects, the seemingly minute difference between 99,38% and
99,99% means 62 errors compared to 1; 2 errors every three pages compared to only 1 in total.
In the language industry, quality is a most debated subject. The most commonly asked question about
quality is: how can quality be measured? To measure something, you must know what it is, and then
you must develop metrics that measure it.
Metrics definition is the hardest part for people who have always thought of quality in their
deliverables as a questionable subject.
The best way to assess quality remains measuring the number and magnitude of defects, and when
defects cannot be physically removed, their features and scope must be specified.
The first step, then, is to establish a model or definition of quality, and translate it into a set of metrics
that measure each of the elements of quality in it. Measuring things just because they can be
measured is not useful. If something is not relevant to the quality model established, it is not a good
use of time to develop metrics to measure it.
7. Striving for a single, all-encompassing metric is not only troublesome, it can be useless, as a simple
metric would not reveal all the problems. Creating multiple metrics that assess the various aspects of
what is to be measured can help re-compose the overall framework: knowing which parts of a process
work well and which ones do not allows to take measures to correct the problems.
A comprehensive set of metrics must measure quality from several perspectives and at several points
during the production process, regardless of the quality model. At a minimum, metrics should tell
something about:
The quality of the finished product or the lack of it;
The quality of the process, i.e. how reliable it is to produce quality products;
The likelihood of achieving quality in a deliverable.
The quality of the finished product corresponds to general customer satisfaction ratings, while the lack of
quality can be given by defects such as technical errors, the quality of process comes from repeatability,
and typical predictors of quality are in-process indicators such as editing.
Sampling
In this perspective, the distinction is important between quality assurance, quality assessment, and quality
inspection and control.
Quality assurance is a planned and systematic pattern of all actions necessary to provide adequate
confidence that the item or product conforms to established technical requirements. Quality assurance
covers all activities, in accordance with two basic rules, “fit for purpose” and “do it right the first time”.
Quality control and quality assessment contribute to quality assurance.
Quality assurance is the full set of procedures applied before, during and after the production process, by
all members of an organization, to ensure that quality objectives important to clients are being met.
Quality assessment is intended for establishing whether contract conditions have been met. Whereas
quality control is product-oriented and customer-oriented, quality assessment is business-oriented.
Unlike quality control, which always occurs before the final product is delivered to the client, quality
assessment may take place after delivery. Assessment is not part of the production process. It consists in
identifying — but not correcting — problems in one or more randomly selected samples of a product
output to determine the degree to which it meets the agreed standards.
In the translation industry, quality control is done with specific software tools, whether standalone or
integrated in translation environments. These tools usually detect mechanical errors, spelling errors,
omissions, inconsistencies, and oversights, especially when reference material is provided.
Nevertheless, since there is no ‘perfect’ translation, the intended purpose of a translation and its suitability
remain the only judgment criteria which, for the sake of objectivity, must be accompanied by assessment
metrics. The combination of process and output quality assessment of translation work will eventually tell
simply whether it is acceptable or defective.
Therefore, translation quality assessment (TQA) criteria are to be agreed upon with the client, be subject of
requirements and be formalized in a separate document.
So far, TQA has been performed on the basis of a strict correspondence between source and target texts
and on intensive error detection and analysis. While this could be the best approach from a theoretical —
and maybe pedagogical — point of view, it is uneconomic. It requires a considerable investment in human
resources and time, and it reduces translation to a matter of trust.
8. On the other hand, who will go over 100,000 words of translation to check for terminology changes after a
translation has been delivered? However, if terminology issues can be approached in a systematic way,
style is a matter of personal preferences. The same goes for correctness and meaning with respect to
completeness. Any translation can be fully checked, automatically, for comprehensiveness with the source
text, freedom from mechanical flaws or errors, and even for grammar, intended as correctness as
conforming to an approved or conventional standard. In any case, any job done by a professional translator
is taken for granted as free from such defects.
Today, any large translation project follows the same standards and rules as a production process in
common business. In this perspective, defects as such should positively be reproduced in the same
conditions, corrected and then removed.
A first step towards improvement in the quality of process outputs consists in preventing the insurgence of
defects by minimizing variability in processes. To this end, a detailed statement of work and an accurate
style guide can be helpful — although time consuming — in most situations, possibly together with
examples of do’s and don’ts. This approach could eventually lead to set defect tracking and assessment
procedures.
Here comes inspection.
Just like any other object, to be measurable, a translation, especially when large, should be apportioned in
definite allotments, to be homogeneous in size and scope for a reasonable estimate in the number and
significance of defects and set a limit for both.
Such apportionment is called sampling. Sampling becomes necessary for any translation project exceeding
a typical freelancer’s single-day capacity, making 100% inspection not sustainable.
Sampling will allow for inspection of meaningful, representative batches, and for accepting or rejecting
them through the determination of the maximum number of defects, based on simple pass/fail criteria.
Acceptance sampling is the middle-of-the-road approach between no inspection and full inspection. Its
main purpose is to decide whether a lot is acceptable, not to estimate its quality. To determine
acceptability, criteria for inspection by attributes must be specified in advance.
Once criteria for inspection are specified, acceptability thresholds must be set. The ISO 2859 series of
standards can be used here as a reference.
For acceptance sampling to be effective, a lot acceptance sampling plan (LASP) must be implemented
indicating the conditions for acceptance or rejection of the lot that is being inspected. These parameters
are usually the number of different defectives in a sample and should vary in quantity and severity in direct
relation to the importance of the characteristics inspected.
Average Outgoing Quality (AOQ) procedures are the best suited for translation projects, since sampling is
non-destructive, lots are fully inspected and all defectives in rejected lots are replaced with good units. In
this case, all rejected lots are made perfect and the only defects left are those in lots that were accepted.
AOQ expresses the average nonconforming fraction that is shipped to clients:
Np1PpnN
PpnN
AOQ(p)
A
A
where PA is the probability of accepting the lot, (N-n)PA is the number of pieces that are shipped without
inspection, and p is the nonconforming fraction. The numerator is the number of bad pieces that are
shipped, and the denominator is the total pieces shipped.
Corrections are made to make rejected lots perfect and allow for identifying and removing the causes of
defects, thus preventing their insurgence by improving processes and then the quality of outputs.
9. To make assessment criteria, methods and tools unambiguous, AQLs (Acceptance Quality Levels) can be
used allowing for tolerance and deviations (errors). AQLs should be agreed upon in a SLA and should specify
the maximal percentage of non-conforming items to be considered as a satisfying process mean. Different
AQLs may be designated for different types of defects.
An implication of acceptance sampling is that a lot exceeding a given percentage of deviations from the
AQL is unsatisfactory and must be rejected. At the same time, a high defect level (Lot Tolerance Percentage
Defective, LTPD) must be designated that would be unacceptable to the consumer.
AQLs imply that a level of non-quality exists in a product where defects remain that ruin a batch, despite
being ‘acceptable’. This level represents a compromise between quality, volume and price negotiated.
To set AQLs, a simple defect prediction technique can be implemented to separate the defects found in a
translation sample in two groups. Depending on the number of defects found in either of the two groups —
but not in both — the defects that have not been found in the sample can then be estimated. This number
gives approximately the number of defects in the entire project.
Drawing Samples
A sampling is a subset of a production output to estimate characteristics of the whole output. The sample
drawing process consists of:
Defining the production output;
Specifying a sampling frame, a set of items to measure;
Specifying a sampling method for selecting items from the frame;
Determining the sample size;
Implementing the sampling plan;
Sampling and data collecting;
Data that can be selected.
In most cases, it is inconvenient and uneconomic to sentence a batch of material from production
(acceptance sampling by lots) by identifying and measuring every single item in the production output and
including any one of them in the sample.
Given the variety and variance in projects, the need to use different providers to match (large) volumes
with (tight) deadlines, and the consequent unpredictable nature of translation, simple random sampling
(SRS) is the most advisable method to minimize bias and simplify the analysis of results.
In SRS, the variance between individual results is a good indicator of the variance in the sample, which
helps estimate the accuracy of results, even though the randomness of a selection may result in a sample
that does not reflect the makeup of the overall output.
Assuming a source content for a translation project is homogeneous per se, the size of samples could be
determined according to the type of deliverables and AOQ.
Purity and Quality
In recent years, Statistical Machine Translation (SMT) have become interesting particularly for LSPs, mostly
thanks to the availability of the free Moses engine.
However, contrary to expectations, corpus creation can be costly for a system to run effectively and
satisfactorily. In fact, for quite some time now, a distinction has been made between generic SMT and
customized SMT, where customized the latter leverages domain resources for phraseology, terminology,
and style. In this respect, a further distinction has been made between clean data and quality data. In
reality, the latter include the first. The following table should help clarify this concept.
10. Clean Data Quality Data
Small number of trusted quality sources Actual data
Domain relevance (restricted) Standard length sentences
No less than 1,000 segments Terminologically consistent
Encoding consistency Consistent writing style
No empty segments No mistakes or errors (syntax, grammar, spelling)
No mechanical errors (diacritics, punctuation,
capitalization, spelling)
Correct translation (exact words, morphology, no
loans)
Cleaning data for training purposes can be performed automatically or semi-automatically with the aid of
software tools. These tools can be used to run a series of checks on parallel data, e.g. for no empty
segments, unbroken markups, correct numbers, etc. and even for consistent translations and
correspondence with approved terminology.
Refining data for quality, i.e. to match the intended purpose and target audience with preferred writing
style and terminology, is a human task requiring thorough understanding of the data.