Quality
Anthony Pym
Universitat Rovira i Virgili
Post-print of: Pym, A. (2020). Quality. In M. O’Hagan (Ed.) The Routledge Handbook of
Translation and Technology, pp. 437-452. Abingdon and New York: Routledge.
Abstract
Understood as the relative excellence of a translation product or process, quality can be
measured in many ways, including automatic comparison metrics, evaluation by
translators, evaluation by monolingual end users, time required for postediting, time
required for non-translation (language learning), process regulation, user satisfaction,
and translator satisfaction. Behind all these measures there lie a series of human
judgements and work-process considerations. In order to draw out those human aspects
of quality, a critical appraisal is made of five relations involved: 1) Automatic
evaluation metrics appear to measure equivalence to a start text but in effect adopt a
reference translation, which is itself subject to all the hazards of translational
indeterminacy; 2) Claims to parity with human translation are based on human
judgements of acceptability but are often measured on the basis of isolated sentence
pairs, which is not how humans communicate; 3) Criteria of usability generally do not
take into account the risks involved in not knowing where error might lie; 4) Industrial
regulations of production processes allow for enhanced reviewing and revision needs
but do not address technologies directly; and 5) Assessments of translator satisfaction
give variable results but tend not to account for the individual skills involved in the use
of technologies.
Keywords: translation quality, neural machine translation, indeterminacy, usability, job
satisfaction
‘If the mere quantity of labor functions as a measure of value
regardless of quality, it presupposes that simple labor has
become the pivot of industry. It presupposes that labor has
been equalized by the subordination of people to machines or
by the extreme division of labor: people are effaced by their
labor.’
Karl Marx (1847:30) [my translation]
Introduction
‘Qualities’, in Aristotle (Categories), are properties of things, as opposed the commonusage sense where ‘quality’, in the singular, is the relative excellence of the thing,
usually for a particular purpose. These two senses are nevertheless related, particularly
in a field of ongoing innovation where changes in properties (‘qualities’) are to be
measured in terms of changes in excellence (‘quality’). Translation technologies
constitute one such field.
The basic thing to be measured here is the relative excellence of a translation
produced with or without a particular technology. This is complicated only slightly by
the possibility of measuring the excellence of the translation process as well. The
quality of translation technologies is superficially presented as an affair of numbers and
1
rules: Levenshtein distances, BLEU scores, adherence to industrial standards, and the
like (see Wright, Doherty and Melby for their respective chapters in this volume). Such
apparently objective criteria are nevertheless themselves judged and thus ultimately
made meaningful in human terms, incorporating criteria that may include how fast
translations are produced, how efficacious they are in the attainment of purposes, how
satisfied users are with linguistic products, how happy translators are, and hopefully
how successful whole communication acts are. Behind the technical numbers, if you
know where to look, there are kinds of quality that are ultimately measured in terms of
what humans think and do.
This is not to say that the human aspect of quality is ever just one absolute
measure; it involves several quite different ways in which people interact with
translations.
Literature review
The general development of translation technologies is clear enough: the up and down
of machine translation in the 1950s and 1960s, the acceptance of translation memories
in the 1990s, and the integration of statistical and neural machine translation in the 21st
century. Each step along the way has been accompanied by a set of discourses on
quality, mostly seeking progress. Quality is not, as Cronin surmises (2013: 128), a new
concern with postediting, the ‘return of the repressed translation detail’. It has long been
an issue in arguments for and against technologies.
The 1966 ALPAC Report, for example, was very much about assessing the
current and future quality of machine translation. In part it did so by calculating the
human time required to make machine translation between Russian and English usable,
compared with number of hours a scientist required to learn enough Russian to read in
their field of expertise (1966: 5). A further comparison was between the time required to
postedit machine translation (yes, in 1966) and the time needed to translate from scratch
(ibid.:97). On both those counts, the bottom-line criterion for quality was the number of
hours a human would spend on different tasks – the mere quantity of labour. It was on
that criterion that research on machine translation (albeit not on computational
linguistics) was drastically curtailed.
When machine translation research picked up again, attempts were made to
evaluate output in terms of human judgements of linguistic quality. In the 1990s the
Advanced Research Projects Agency (ARPA) did this by using methods including
comprehension tests on back-translations, judgements by professional translators, and
human assessment of adequacy (how much information is transferred) and fluency (how
correct the language is) (White, O’Connell and Carlson 1993, White 2003, 2005). The
expense, time, and subjectivity of many of those methods proved generally daunting
(Hovy 1999).
It is intriguing to see how the early evaluations of quality attempted to marshal
the opinions of different translators. The ALPAC report includes an experiment
comparing the way monolinguals and bilinguals assess MT outputs. It notes that
bilinguals took longer to assess the sentences and added little to the overall evaluation:
‘One is inclined to give more credence to the results from the monolinguals because
monolinguals are more representative of potential users of translations’ (1966: 72-73).
This particular debate has not been resolved over the years, although the industry-based
papers do tend to prefer monolingual evaluation terms of efficiency (White, O’Connell
and Carlson 1993, Coughlin 2003). The issue nevertheless shifted from straight MT
evaluation to the kind of active evaluation performed in postediting. Koponen (2016)
cites studies that compare monolingual and bilingual postediting and generally finds,
2
contrary to ALPAC, that ‘post-editing does not currently appear feasible if the
posteditors have no access to the source text’ (2016: 142). This could, however, be
unnecessarily pessimistic. In a small study, Mitchell et al. (2013) found that
‘monolingual post-editing can lead to improved fluency and comprehensibility scores
similar to those achieved through bilingual post-editing’ (2013: 1), although ‘fidelity’
was improved more in the bilingual setting. Temizöz (2013), working with a technical
text, found that postediting MT produced higher quality when carried out by subjectmatter experts than by trained translators, and that it may be more advantageous to have
experts revise translators rather than vice versa.
These discussions indirectly address the question of who has the authority to
assess the quality of a translation. While translation theorists tend to refer to ‘equivalence’
as some kind of objective yardstick (cf. House 2015), actual experiments show a range of
different judgements: Martín-Mor (2011), for example, finds that academics list more
ostensible errors than do translation professionals; García (2010) records worrying
differences between official accreditation evaluators; Le and Schuster (2016) show there
is no universal agreement on what a ‘perfect translation’ is.
Given the high costs of human evaluation, automatic MT evaluation metrics
became standard (Fiederer and O’Brien 2009). Proposals date from the early 1990s and
basically set out to measure the edit distance between the MT and a reference human
translation (Su et al. 1992). Edit distances are general based on Levenshtein (1966) and
involve the same kind of calculations that give us fuzzy matches in translation memory
systems. BLEU, TER and METEOR scores all compare MT output to a humanproduced reference translation (see Doherty in this volume). The correlations between
human and automatic evaluations have been cause for investigation and occasional
dispute (Coughlin 2003, Banerjee and Lavie 2005, Graham and Baldwin 2014, Wu et
al. 2016), since the selection of different parameters and reference texts can give quite
different scores.
While human vs. automatic evaluation became an issue for machine translation,
the initial quality claims for the use of translation memories were more commonly
focused on the time taken to achieve a particular translation product. This was done
through reasoned speculation and ad hoc surveys of users, leading to quite complex lists
of incommensurate advantages and disadvantages (Freigang 1998, Webb 2000).
The assumption was that translation memories did indeed increase productivity,
and this became part of a general promotional discourse. Production and distribution
companies could produce any number of reports announcing savings achieved and
associated benefits obtained, although productivity basically depended on the degree of
repetition in the text involved – virtually any claim could be made by selecting
appropriate texts and parameters.
When more solid evidence did start to come in, it was ambivalent. García
(2010), for example, found that postediting MT was only sometimes more advantageous
than translating from scratch. From 2002 a series of audits of the European
Commission’s Translation Service showed little evidence of enhanced productivity due
to the use of translation technology: the average cost of one page of translation was 150
euros in 2003 and rose to 194 euros in 2005 (European Commission 2006), despite 23.7
million euros having been spent on technology in 2003. The ideological emphasis on
productivity then shifted to alternative benefits such as terminological and
phraseological consistency. At the same time, translators’ discussion lists gave voice to
doubts about the quality not just of outgoing text (García 2006) but also of the
translation memories that had been built up over time (Austermühl 2006).
In view of general awareness that productivity was only part of the story, a
series of scholars (for example, Reinke 1999, Gow 2003, García 2006) began to
3
broaden the way in which the various technologies are evaluated. In historical terms,
this may be seen as the traditional equivalence-based concept of translation quality
meeting up with the purpose-based paradigm that had been developing in general
translation theory since 1984 (Reiss and Vermeer 1984). The confluence drew easily on
existing technical discourses of usability (Jordan 1998, Byrne 2006) and pointed toward
what would become Multidimensional Quality Metrics.
In the background of the various technical discourses, questions have been
raised about the wider effects of translation technologies. Torres del Rey (2005)
comments on problems for the very concept of communication. Pym (2004) talks about
the dehumanization resulting from technologies where translators cannot visualize the
reception situation. Dragsted (2004) finds that automatic segmentation is adequate to
the purpose of establishing phrase-level equivalence, but that the segments used by
translators without the technology are frequently much larger, embracing factors
pertaining to the communication situation.
Methodological considerations
To make sense of these many different aspects, Chesterman (2004) usefully points out
that quality is never an absolute value; it is always a relation, of which there are several
kinds. Two possible relations are linguistic: between the translation and the start text1
(allowing judgments of adequacy, equivalence or similarity), then between the
translation and “parallel texts” understood as non-translations of the same type
(allowing people to judge fluency, the acceptability of language). Chesterman next
recognizes two further relations: between the translation and the need or purpose
(Skopos) that it is meant to fulfill (judgments of usability) and between the translation
and industrial standards (judgments concerning production processes). A fifth quality
relation is then between the translation and the translator (allowing judgments of job
satisfaction and just recompense).
If we look at the way translation technologies intersect with these five relations,
we find that the front-page judgements seem to be associated with the first relation only:
between the translation and the start text, and thus with questions of equivalence,
considered ‘the conceptual basis of translation quality assessment’ (House 2015: 5).
When a machine translation system or translation memory suite is being evaluated
automatically, that is what appears to be at stake – we want to know how well the start
text is rendered. However, if you look at the actual metrics used, they mostly involve
comparisons not with start texts but with human translations of those start texts, and
those human translations are in turn evaluated by several non-automatic metrics that
may concern holistic naturalness and the like. Further, when humans are called upon to
evaluate rival MT candidates, they are often doing so monolingually, comparing the
machine translation output with what is acceptable in the target languages. So what
appears to be a traditional relation between start text and translation (the first of
Chesterman’s relations) frequently turns out to be a complex evaluation involving
human translations, selective human reception, and implicit human comparison with
non-translations (the second of his relations).
This conceptual slippage is repeated elsewhere, to the point where the increasing
use of technologies has repercussions on all the relations identified by Chesterman.
Precisely because production processes are automated, there is an emphasis on
localization workflows, the need for adaptation, and measurements of usability and
satisfaction on the user’s side (Byrne 2006: 193ff.). And precisely because those same
workflows increasingly involve stages of revision and review, handling more words
than can be evaluated manually, industrial standards are created in order to regulate the
4
processes, on the precarious but time-saving assumption that if the process has quality,
then so must the product (see Wright in this volume). Further, since automated
translation can produce texts at different levels of quality depending on the resources
invested in pre-editing and postediting, there is increasing market awareness that clients
can fine-tune ‘fit for purpose’ translation (Way 2018, also see Bowker in this volume),
thus involving a very active human relation between translation and client. In fact, the
one relation that does not usually enter into calculations of quality, it seems, is translator
satisfaction, where the number of empirical indications of relative happiness are
nowhere near the enthusiastic testimonials used in sales promotions.
Critical discussion
Here I shall consider each of Chesterman’s five relations in turn, teasing out the human
elements from behind the technocratic discourse.
Indeterminacy in ST-TT comparison
Although only part of the range of translation technologies available, machine
translation systems deservedly grab the headlines. In theory, the historical movement
away from rule-based translation and towards statistical methods means that the
machines are not actually translating in any strict sense: they are searching for the
optimal translations previously done by a human, then putting those previous
translations together in various ways. In theory, if the algorithm finds the right human
working on the right text, then it finds the right translation. The basic operative
assumption behind statistical methods would thus be nobly democratic: if a large
majority of people have rendered a string in the same way, then they are likely to be
correct – likely, but not with certainty, neither in machine translation nor in democratic
politics. A correlative assumption would be that the larger the database of previous
human translations, the better the statistical likelihood of the translation being correct.
Likely, but again with no more certainty than could be attributed to the presumption that
large democracies elect better governments than do small ones. Neural approaches
might be seen as correcting the democratic fallacies: the size of the database still
influences the quality produced, but so do the specific thematic contexts in which the
translations were done and the absence of intrusive items in the database. That is, neural
systems theoretically aim to select more specific human translations and eliminate
proposals that are contextually aberrant – a society of voters theoretically defers to
groups of more expert language users.
That theory does not tell the whole story, of course, since MT quality is still very
much dependent on the grammatical similarity between the languages concerned
(French to Spanish will give higher quality than German to Japanese), whereas if clean
context-specific databases were all, the language distance should not matter so much.
And then much depends on the relative standardization of the text to be translated
(highly technical texts with fixed terminologies and a limited repertoire of verbal
relations will perform relatively well) and the electronic language resources developed
for particular languages (smaller languages with few electronic texts do not fare well).
The quality of MT output itself can be evaluated for at least two purposes: to
assess the inherent superiority of one system or another, or to estimate the degree of
postediting (and potentially pre-editing) necessary and thereby to calculate the relative
pricing of the associated language work. These evaluations can be holistic, on the basis
of a checklist, or by algorithms applied to automatic comparisons between texts.
5
The human evaluation of MT, with or without a checklist, is not necessarily
different from any other kind of translation, with the same kinds of problems: it
involves high workloads, suffers from a lack of inter-rater and intra-rater consistency,
and the categories and weightings ideally have to be adjusted to suit specific purposes.
Further, both human and automatic evaluation are more basically haunted by the
precarious assumption that there is only one correct translation for a given ST segment.
The fact that different people make different evaluations of translations evinces the
fundamental ‘indeterminacy’, described by Quine (1960) as a situation where translators
produce different translations, all of which are correct but different. That is, there are
almost always equally valid but different translations for the one input – and when this
is not the case, we are judging grammar or terminology rather than translation. This
indeterminacy concerns not just the inevitable differences between human evaluators,
but also the dependence of automatic evaluation on initial ‘reference translations’,
which could always have been otherwise.
The past decade has seen numerous variants on these approaches, particularly on
BLEU with respect to the possibly multiplicity of reference translations, and systems
that seek to combine the results from previous systems, since the same MT output can
in theory be fed through all of them and the results can be synthesized. For example,
Akiba et al. (2001) do propose the use of multiple reference translations, and part of the
work since then has been on how to select and combine those reference translations,
which concerns the same kind of algorithms that have driven statistical machine
translation. The mechanical evaluation of MT (see Doherty in this volume) is thus
conceptually inseparable from very technologies being evaluated.
A fundamental problem facing all these metrics remains the multiplicity of
possible human translations, and the human labour required to tell which of them might
be superior.
Human parity in TT-TT judgements
The question most frequently asked about the quality of machine translation is whether
it will ever be as good as fully human translation (understood as a production process
that uses no MT). The answer depends on the nature of start text, the MT system, the
human translation entering into the comparison, the language pair, the definition of
‘quality’, and the human or automatic metrics used. So the answer will never be simple.
The more important point, however, is that the question itself is poorly formulated,
since it allows at least three simplistic answers:
-
-
Yes, statistical MT is by definition as good as human translation since it is based
on locating and recycling prior human translations (see supra.);
Yes, statistical MT will reach and surpass humans because computer processing
capacity increases geometrically and has surpassed, or should surpass, or will
surpass the capacity of the human brain – this is the moment of ‘singularity’
(Kurzweil 2005), achieved in the fields of chess and the game Go, and rumoured
to have been identified in some translation experiments (see infra.).
No, statistical MT, even in its neural avatars, will never reach human quality
because people are stupid: they think MT translations are valid; they thus put
raw MT output on public websites and the like; the defective translations are fed
back into the MT databases; the user-detected quality of statistical MT in some
language pairs has thus stagnated (see, for example, Lotz and van Rensburg
2016). That is, even when computer processing capacity is equal to humans
6
(such are the claims of singularity), people still need to learn what to do with
that capacity.
In theory, neural technologies help restrict the concerns of this third answer by
working from cleaner databases and thus sidelining some of the stupid people. They
also use omission for phrases where there is no sure candidate translation – better to say
nothing at all rather than something completely off the wall.
These advances notwithstanding, the more pertinent question, at least for practicing
translators and their employers, is not what quality machine translation produces, but
whether it is cost-beneficial, in term of effort and quality, to pre-edit and/or postedit
machine-translation output rather than translate from scratch (see Vieira in this volume).
The answer to this question must be, once again, that it depends on a wide range of
factors, including language distance, text type, definition of quality, the metric used, and
who the posteditor is, with this last factor perhaps being the most crucial (Mellinger
2018) – a recurrent assumption in the more theoretical studies is that all editors all
somehow share the same level of expertise.
Pre-editing is based on revising the start text in order to remove ‘negative
translatability indicators’ or elements that are likely to be problematic for machine
translation (Underwood and Jongejan 2001, Mitamura and Nyberg, 2001). This is in
many respects an application of controlled language, although more specific indicators
should be identified for particular language pairs (Nyberg, Mitamura and Huijsen 2003,
O’Brien and Roturier 2007, Fiederer and O’Brien 2009). The effort invested in preediting logically increases the quality of the MT output and thereby reduces the effort
required for postediting (Aikawa et al. 2007). The cost-effectiveness of pre-editing with
respect to postediting generally only comes into its own when a given start text is to be
translated into more than a few target languages. It should then increase more or less
arithmetically with each additional target language, depending on how language-pairspecific the translatability indicators are.
In terms of linguistic features, judgements of the quality of pre-edited and/or
postediting machine translation generally have little to say about the ST-TT relationship
(since the posteditor is supposed to be a human translator, and thus theoretically
constitutes the yardstick by which measurements are made) and tend to concern instead
the ‘fluency’ or ‘acceptability’ of the translation, where questions of discursive style
seem important. Quite early in the game, Claude Bédard (2000) noted that technologies
that draw on the work of previous translators tend to bring across the styles of those
different translators, and indeed the styles of the various thematic fields in which they
were working. Bédard was complaining about translation-memory systems more than
MT, but his general warning, that any text-reuse technology can result in a ‘sentence
salad’ – a mixture of linguistic styles –, retains validity in any genre where style is not
yet fully regulated.
Some studies have detected stylistic effects resulting from the ways in which the
various technologies segment the texts to be postedited. Bowker (2005) and Ribas
(2007) seeded errors into translation memories and found that, thanks to the
combination of segmentation and immediate translation proposals, some of those errors
were not corrected, especially when they concerned numbers. Other studies such as
Vilanova i Subirats (2006) and Martín-Mor (2011) find that the use of translation
memories increases linguistic interference from the start language. Such findings may
be attributed to the modes of reading invited by the segmentation of texts within the
translation memory suites, when translators are made to focus on the sentence level to
the detriment of cohesion and cohesive flow – the translator’s sense of what ‘sounds
right’ is thereby rendered less acute. The effects might also be explained by a degree of
7
authority being attributed to the translation suggestions, especially in cases where the
translation memories come from the client and the translator is instructed not to touch
the exact or full matches (LeBlanc 2013). The translator’s strategic use of risk transfer
is thereby enhanced (the translation may be wrong, but I’m not responsible for it…).
Such effects explain why post-draft revision and review processes become all the more
important in the workflows.
Further, when evaluations of MT quality are based on isolated out-of-context
sentences, as is often the case, the linguistic skills of translators and other bilinguals are
rarely required: sentences are usually right or wrong on target-language criteria. For
Coughlin (2003: 7), human raters are thus made to ‘behave like expensive, slow versions
of BLEU’ – their skills are unnecessary. A corollary, of course, is that such testing
procedures exclude human co-textual and contextual processing skills from the evaluation
process.
Fiederer and O’Brien (2009: 62-63) found that postedited MT was slightly better
for clarity and accuracy but was rated lower in terms of style. García (2010), however,
found that postedited MT was rated slightly higher than fully human translations by
evaluators applying the system developed by the Australian National Accreditation
Authority for Translators and Interpreters (NAATI). The small groups used in these
experiments and the limited range of language pairs mean that no definitive conclusions
can be drawn. Yet it is worth repeating that the quality of postedited statistical MT can
indeed be similar to that of fully human translation (cf. Koponen 2016), and anecdotal
evidence suggests that neural systems are significantly enhancing the advantages of
postedited MT in cognate language pairs.
An example of this problem is a paper released by Microsoft in March 2018 that
makes the following claim with respect to Chinese to English news translation: ‘We
find that our latest neural machine translation (NMT) system has reached a new stateof-the-art, and that the translation quality is at human parity when compared to
professional human translations’ (Hassan et al. 2018: 1). This is the kind of strong claim
that deserves to be analyzed in detail (also see Vieira in this volume). ‘Parity’ here
means that the neural systems produce translations that are indistinguishable from
human translations, not that the translations are error-free. This finding comes from
18M bilingual sentence pairs being evaluated by ‘bilingual crowd workers’, who were
asked whether the candidate translation conveyed ‘the semantics of the source text’
(2018: 3). ‘Bilingual annotators’ were then used to identify errors. The important point
that the human testing was context-free sentence pairs being assessed in terms of
content, not form. So pragmatic, discursive, stylistic or other purpose-based features
were excluded from the very narrow concept of quality, to the extent that professional
translators were apparently not needed in the evaluation process.
Such ways of testing quality necessarily work to the advantage of MT output.
The concept of quality operative in paired-sentence evaluation assumes a quite literalist
mode of translation, devoid of the kind of solutions that can be used for specific
readerships: transcription, calque, resegmentation, compensation, cultural
correspondence, updating, and so on. A legitimate case can be made for comparing
translated documents rather than isolated sentences (Läubli, Sennrich and Volk 2018).
More specifically, the evocation of human parity precariously assumes that all sentences
are of the same importance for communicative success. A human translator may expend
greater effort on the more high-risk passages, reducing the risk of error in them and
negotiating ethical issues, whereas MT output may have the same probability of error
per sentence quite independently of communicative criteria. In other words, if a page of
human translation and a page of MT output both have three errors in them, one would
expect that the human translation does not have the errors in the high-risk passages. And
8
when the end-user gets the translations, they may indeed assume human parity, but they
still do not know exactly where those three errors are.
Parity and usability are thus quite different things.
Usability, acceptability and specific purposes
The growth of the localization industry saw widespread use of several metrics for
evaluating the quality of a translation. The model put forward by the Localization
Industry Standards Association (LISA) was based on the identification of errors
(Mistranslation, Accuracy, Terminology, etc.) and their weighting in terms of Minor,
Major or Critical, in much the same way as different categories and weightings are
employed in pedagogy and certification exams for translators. Technology has no
special place in these approaches. It does, however, start to figure in more flexible
metrics designed not just for evaluation but also for quality assurance, where the causes
and remedies of error are addressed. Multidimensional Quality Metrics (MQM) allow
for different degrees of quality to be sought for specific purposes, for error to be
attributed to real-world factors like poorly written start texts, and for some errors to be
attributed to probable causes like defective glossaries or poor use of automatic checking
systems. MQM has a checklist of over a hundred quality issues, which can be adjusted
to allow for different degrees of granularity in assessments. The high-order categories
are: accuracy, design, fluency, internationalization, locale convention, style,
terminology, verity, and other (MQM 2015a). The selection and weighting of categories
allow the system’s designers to claim it is based on ‘the ‘functionalist’ approach that
quality can be defined by how well a text meets its communicative purpose’ (MQM
2015b): although the individual items involve comparisons between texts and with
standards and expectations, the ‘function’ enters when the user decided which particular
categories are going to enter into the evaluation. The MQM project is being continued
in Quality Translation 21 (QT21), which extends the categories into areas like
audiovisual translation and the particular demands for MT in under-resourced and
morphologically complex languages.
In principle, clients are the first recipients of translations, since they are the ones
who have to be satisfied in order to pay for them and to keep paying for more. Their
perception of quality is thus of key importance. In the case of translation technologies,
though, there has been some doubt as to whether clients actually attach meaning to
BLEU scores and the like, and indeed whether they are in a position to judge quality at
all, especially when they do not master all the languages involved. Beninatto (2007)
provocatively claimed that ‘quality doesn’t matter’ since translation vendors all promise
perfect quality, their clients thus all expect perfect quality, and the same clients are often
unable to distinguish one translation provider from another. Something similar can be
seen in surveys where vendors all say they require ‘ability to produce 100% quality’
from the translators they employ (as in Optimale, cf. Toudic 2012), since no company
can afford to admit that it might also appreciate factors like speed, eloquence, or
suggestions about how to improve a text. Beninatto’s more serious argument was
nevertheless that translation companies should differentiate themselves precisely on
such ‘non-quality’ things, particularly the range and timeliness of their services, which
is where the variable use of technologies has a role to play (also see Bowker in this
volume).
Responding to Beninatto a few years later, Davies (2013) pointed out that
companies do in fact compete on quality, but not in the sense that they all claim to offer
the best. The competition is to provide scenarios where clients can ‘fine-tune’ the
quality they require, basically by selecting the technologies, the number of revisions and
9
reviews, and ideally the price. That is, technologies allow adjustment to the required
level of quality and the available budget. When this is the case, translation quality
obviously has not become irrelevant.
There are relatively few studies on quality as experienced by end users of
translations. It is commonly pointed out that different users have different expectations,
that MT is mostly acceptable for those who want a general ‘gist’ understanding of a text
(Church and Hovy 1993, Lewis 1997, Miller et al. 2001). This is common sense and
feeds into a view where different degrees of postediting can be used to meet user
demands and indeed budgets. There are nevertheless areas where raw MT can prove
costly, not just in terms of linguistic errors but more especially with respect to company
image or general trustworthiness. Many cities around the world offer their websites with
unmediated MT in multiple languages: the information may be understandable, but the
public branding of the city will not benefit – linguistic form has a value, independently
of understood content. Similar but more serious considerations concern communities
that use translation to preserve and promote a threatened culture (Bowker 2009).
If the process is right, the product must be right
As in many industries where evaluating outputs is labour-intensive and partly
subjective, the translation industry has seen a tendency to rely on external industrial
standards. The standards purport to embody best practices; they apply to translation
service providers rather than to individual translators; they themselves involve
significant costs, which is a way of filtering out smaller and less solid companies; and
they are based on the philosophy that if you can regulate how translations are produced
(and by whom), you thereby regulate the quality of the translations themselves.
Among the main standards used by translation companies, ISO17100 in Europe
and ASTM 2575 in the United States both refer to the importance of having each
translation be reviewed by someone other than the translator, although the terminology
for revision varies and it is unclear whether the task is to be performed by a bilingual or
a monolingual (cf. Biel 2011, Mellinger 2018). The MQM evaluation approach is based
on ISO/TS11669 for translation projects. Perhaps the prime usefulness of these
standards is that they establish clear (albeit differing) terms and concepts that can be
understood by all stakeholders in the translation process. In this respect, the plurality of
standards would nevertheless constitute a drawback.
Strangely, though, these industrial standards have little to say about the use of
translation technologies as such. Most notably, the postediting of MT output has no
explicit place in ISO17100, even though the development of the technologies would
seem to have been one of the main historical reasons for the development of new
standards (see Wright in this volume). That said, there is a separate standard for
postediting (ISO 18587) and all the recent standards attribute enhanced roles to the
various kinds of checking, revision and reviewing, all of which might be seen as
quality-control measures. At the same time, though, the need for greater quality control
could be attributed to the increasing role of machine translation and translation
memories, which would thus be operating as unspoken forces behind the industrial
standards.
Emerging issues: Quality for the translator
A question too rarely posed in discussions of technologies is how they affect the
translator’s mode of work and job satisfaction, which is in itself an experience of
quality. As Marx complained of industrialization, machines reduce labour to the
10
‘carcass of time’ (1847: 30). Gone, apparently, are factors such as craft, pride in
product, and quality as some kind of intrinsic virtue, a timeless cause of happiness for
the producer and of appreciation for the consumer.
We know relatively little about the kind of quality that is felt by translators when
working with technology. A quick but common assumption is that technologies remove
the donkeywork, leaving translators to solve the more interesting problems. An
alternative surmise is that job satisfaction can only come from fully human work. In
between those extremes, there is much to be discovered.
Some attention has been paid to who is best able to improve the quality of MT
outputs by postediting. Remarkably, ALPAC (1966) included an experiment with 23
translators on this very question and found that ‘fast translators will lose productivity if
given postediting to do, whereas slow translators will gain (ibid.:94). Further, ‘most of
the translators found postediting tedious and even frustrating’ (96), although some
found it useful as an aid when translating, ‘particularly with regard to technical terms’
(97). Although the linguistic quality of MT output has improved enormously since the
1960s, the basics of this job-satisfaction assessment seem to have remained fairly
constant (cf. García 2010): poor translators benefit the most.
Guerberof (2013) interviewed 24 translators and three reviewers to find that
translators have mixed feelings when postediting MT, with the degree of satisfaction
depending mainly on the quality of the MT output. One might assume that NMT has
thus increased their satisfaction since then. A particular source of frustration was
nevertheless the different ways in which pay rates are calculated for different kinds of
tasks. It seems that there is still little standardization of the way postediting is
remunerated, although one would assume that a healthy hourly rate would be fair for all
sides.
Teixeira (2014), analyzing ten professional translators, found that translators
perceive postediting tasks more positively the more they are familiar with the text genre
and the technologies, and that there is a preference for metadata to appear when the
technologies offer possible translations. That is, posteditors appreciate being able to
know more or less where a translation suggestion is coming from, in order to assess its
trustworthiness. Interestingly, Teixeira’s experiment involved a situation where the
translators did not know whether the suggested solutions were from MT, a TM or a
fully human translation. He found that when a suggested translation was faulty, the
translators tended to assume it came from MT, even when this was not the case. There
are thus signs of a partly unfounded prejudice against MT.
LeBlanc (2013, 2017) studies the use of TMs in the workplace and reports that
translator dissatisfaction is not so much with the tools themselves as with the business
practices that have been built upon them. The ability to measure productivity
mechanically means that the translator has to meet productivity requirements – the
measure of quality becomes the number of words per unit of time. And the frequent
obligation to recycle previous translations is similarly resented as a departure from
previous workplace practices.
These findings are necessarily located in a particular history context. Each
generation progresses professionally and scales hierarchies thanks to the technologies it
was trained to master: spoken rhetoric, styli on wax tablets, papyrus, parchment, paper,
print of many kinds and speeds, typewriters, word processors, and now machine
translation in translation memories. And just as each generation rises, it seeks to resist
the new technologies that it did not initially master: the priestly castes did not want
writing to be a general skill; journalists feared word processors; and as for translators, it
is still not unusual to hear global dismissal of advances: ‘if you can’t translate with
pencil and paper,’ says the former Canadian government translator Brian Mossop
11
(2003: 20), ‘then you can’t translate with the latest translation technology.’ The claim
may be technically correct – although Temizöz (2013) found that area experts do pretty
well postediting NMT – but it does not address a labour market where young translators
can create niches and displace the older generations precisely by stressing the new
technologies that allow them to do much better than pencil and paper. On this view, the
resistance found in surveys like Guerberof (2013) and Teixeira (2014) may be that of an
established generation clinging to the technologies that brought it to relative power.
Most of those who train students in translation technology are aware that this is an area
where the younger the student is, the faster they pick up new software.
Research on the actual experience of quality needs to focus on specific skills.
Our technologies come bundled in suites that involve quite different things. One of the
main advances in word processing software has been online spellchecking and thesauri
– tools that no one would really want to be without. Translation memory suites
incorporate a segmentation tool, translation suggestions, metadata on those suggestions,
MT feeds, quality assurance tools, sometimes sophisticated terminology tools, and user
interface that may be more or less inviting. So when translators perform well with the
suites or react negatively against them, it is often because of one or two of these
features, and in terms of the specific skills associated with those tools. Studies that just
give global reactions to whole translation memory suites thus seem to be missing the
point. It makes more sense to isolate the specific tools, test how well or badly they suit
translators’ dispositions, and to do the skills analysis from there.
There seems little doubt that translation will increasingly become pre-editing
and postediting, and the translator’s role will be to correct and authorize texts that have
been produced electronically. That kind of task can involve high-level, satisfying
intellectual work, relieved of the donkeywork that much manual translating can involve.
It is up to translators and their employers to ensure that that kind of work-process
quality is appreciated and rewarded.
Conclusion: The human in the machine
Translation technologies are helping translators keep up with global demand, enhancing
cross-border service provision, enabling collaboration with volunteer translators, and
as this happens, debates over quality tend to be highly ideological: on the one hand,
fear-mongering Luddites idealize the human and fail to see real progress in technology;
on the other, unabashed company promotion produces numerical bravura based on
questionable methodologies.
In between those extremes, it is sanguine to recall the extent to which all
evaluations of quality reside on human values (a translation of this kind, for this user, to
achieve this purpose) and are built on a fundamental indeterminacy (translations can be
different and equally correct). Awareness of the human elements should help us temper
the hypostasized hype, and hopefully ensure progress for all.
Further reading
Developments in this field tend to occur faster than book publications, to the extent that
there are few valid reference works that will stand the test of time. Readers are advised
to follow the quality claims as they are published online, but to do so with a critical eye
to the human factors involved: who is doing the evaluating, on what basis, in relation to
12
what purpose, and with what kind of claim to authority. The following nevertheless
successfully address the current issues:
Kenny, D. (ed.) Human issues in translation technology. London and New York:
Routledge. | As the title indicates, this collective volume hits the bullseye in focusing on
the most worrying part of translation technologies: what happens to the human. Key
articles by Doherty, Leblanc, Koskinen and Ruokonen, and Moorkens and O’Brien are
based on empirical data, taking us beyond the promises of gurus.
Moorkens, J., S. Castilho, F. Gaspari and S. Doherty (eds) (2018) Translation Quality
Assessment. From Principles to Practice, Cham: Springer. | This is a collection of
papers from the academy and industry, in part dealing with translation technology,
including five papers specifically on quality in machine translation. The contributions
by Way, Specia et al., and Torai et al. assess future possibilities of quality assessment.
References
Aikawa, T., L. Schwartz, R. King, M. Corston-Oliver and C. Lozano (2007) ‘Impact of
Controlled Language on Translation Quality and Post-editing in a Statistical
Machine Translation Environment’. Available online: https://goo.gl/sNoxfr
[last access 28 October 2018].
Akiba, Y., K. Imamura and E. Sumita (2001) ‘Using Multiple Edit Distances to
Automatically Rank Machine Translation Output’, in Proceedings of the MT
Summit VIII, Santiago de Compostela, Spain. Available online:
https://goo.gl/zvMF3o [last access 28 October 2018].
ALPAC (1966) Languages and machines: computers in translation and linguistics. A
report by the Automatic Language Processing Advisory Committee, Division of
Behavioral Sciences. Washington DC: National Academy of Sciences, National
Research Council. Available online: https://goo.gl/6PmwcA [last access 28
October 2018]
Aristotle (1938) Categories. On Interpretation. Prior Analytics, trans. H. P. Cooke and
H. Tredennick, Cambridge MA: Harvard University Press.
Austermühl, F. (2006) ‘Training Translators to Localize’, in A. Pym et al. (eds)
Translation Technology and its Teaching (with much mention of localization),
Tarragona: Intercultural Studies Group, 97-105. Available online:
https://goo.gl/ysdcdW [last access 28 October 2018].
Banerjee, S. and A. Lavie (2005) ‘METEOR: An Automatic Metric for MT Evaluation
with Improved Correlation with Human Judgments’, in Proceedings of Workshop
on Intrinsic and Extrinsic Evaluation Measures for MT and/or Summarization,
Ann Arbor, Michigan, June 2005.
Bédard, C. (2000) ‘“Translation Memory Seeks Sentence-oriented Translator…’”,
Traduire 186: 41-9.
Beninatto, R. (2007) ‘Quality still doesn’t matter’, presentation to ATA conference, San
Francisco. Available online: http://goo.gl/QuhnEg [last access 28 October 2018].
Biel, Ł. (2011) ‘Training translators or translation service providers? EN 15038: 2006
standard of translation services and its training implications’, The Journal of
Specialised Translation 16: 61-76.
Bowker, L. (2005) ‘Productivity vs Quality? A pilot study on the impact of translation
memory systems’, Localisation Focus 4: 13-20.
13
Bowker, L. (2009) ‘Can Machine Translation meet the needs of official language
minority communities in Canada? A recipient evaluation’, Linguistica
Antverpiensia 8: 123-155.
Byrne, J. (2006) Technical translation: Usability strategies for translating technical
documentation, Dordrecht: Springer.
Chesterman, A. (2004) ‘Functional quality’, a lecture at Universitat Rovira i Virgili,
Tarragona. Available online: https://goo.gl/t36dL7 [last access 28 October 2018].
Church, K. and E. Hovy (1993) ‘Good applications for crummy machine translation’,
Machine Translation 8(4): 239-258.
Coughlin, D. (2003) ‘Correlating Automated and Human Assessments of Machine
Translation Quality’, MT Summit IX, New Orleans, 23–27. Available online:
https://goo.gl/7yNzLV [last access 28 October 2018].
Cronin, M. (2013) Translation in the digital age, London & New York: Routledge.
Davies, I. (2013) ‘The hardest word in translation’, The Pillar Box (Institute of
Translation and Interpreting). Available online: https://goo.gl/Raxob4 [last access
8 July 2018].
Dragsted, B. (2004) Segmentation in translation and translation memory systems: An
empirical investigation of cognitive segmentation and effects of integrating a TM
system into the translation process, Copenhagen: Samfundslitteratur.
European Commission (2006) Special Report No 9/2006 concerning translation
expenditure incurred by the Commission, the Parliament and the Council.
Available online: https://goo.gl/RZzntv [last access 28 October 2018].
Fiederer, R. and S. O’Brien (2009) ‘Quality and machine translation: A realistic
objective?’, Journal of Specialised Translation 11: 52–74. Available online:
https://goo.gl/fBqP4u [last access 28 October 2018].
Freigang, K. H. (1998) ‘Machine-aided translation’, in M. Baker (ed.) Routledge
Encyclopedia of Translation Studies, London & New York: Routledge, 134-136.
García, I. (2006) ‘Translators on Translation Memories. A Blessing or a Curse?’, in A.
Pym et al. (eds) Translation Technology and its Teaching (with much mention of
localization), Tarragona: Intercultural Studies Group, 97-105. Available online:
https://goo.gl/ysdcdW [last access 28 October 2018].
García, I. (2010) ‘Is machine translation ready yet?’, Target 22(1): 7–21.
Gow, F. (2003) Metrics for Evaluating Translation Memory Software, MA thesis,
University of Ottawa. Available online: https://goo.gl/osq7ez [last access 28
October 2018].
Graham, Y. and T. Baldwin (2014) ‘Testing for Significance of Increased Correlation
with Human Judgment’, Proceedings of EMNLP 2014, Doha, Qatar.
Guerberof, A. (2013) ‘What do professional translators think about postediting?’, Journal of Specialised Translation 19: 75-95.
Hassan, H., et al. (2018) ‘Achieving Human Parity on Automatic Chinese to English
News Translation’. Available online: https://goo.gl/iD9DEd [last access 28
October 2018].
House, J. (2015) Translation quality assessment. Past and present, London & New
York: Routledge.
Hovy, E. E. (1999) ‘Toward finely differentiated evaluation metrics for machine
translation’, in Proceedings of the Eagles Workshop on Standards and
Evaluation, Pisa, Italy. 127-133.
Jordan, P. W. (1998) An Introduction to Usability, London: Taylor and Francis.
Koponen, M. (2016) ‘Is machine translation post-editing worth the effort? A survey of
research into post-editing and effort’, Journal of Specialised Translation 25: 131147.
14
Kurzweil, R. (2005) The singularity is near, New York: Viking Books.
Läubi, S., R. Sennrich and M. Volk (2018) ‘Has Machine Translation Achieved Human
Parity? A Case for Document-level Evaluation’. Available online:
https://arxiv.org/pdf/1808.07048.pdf [last access 28 October 2018].
Le, Quoc V. and M. Schuster (2016) ‘A neural network for machine translation, at
production scale’. Google AI Blog. Available online: https://goo.gl/EcFszd [last
access 28 October 2018].
Leblanc, M. (2013) ‘Translators on translation memory (TM). Results of an
ethnographic study in three translation services and agencies’, Translation &
Interpreting 5(2): 1-13.
Leblanc, M. (2017) ‘“I can’t get no satisfaction!” Should we blame translation
technologies or shifting business practices?’, in D. Kenny (ed.) Human issues in
translation technology, London and New York: Routledge, 45-62.
Levenshtein, V. I. (1966) ‘Binary codes capable of correcting deletions, insertions and
reversals’, Soviet Physics - Doklady 10(8): 707-710. Available online:
https://goo.gl/y7ioao [last access 28 October 2018].
Lewis, D. (1997) ‘Machine translation in a Modern Languages curriculum’, Computer
Assisted Language Learning 10(3): 255-271.
Lotz, S. and A. van Rensburg (2016) ‘Omission and other sins: tracking the quality of
online machine translation output over four years’, Stellenbosch Papers in
Linguistics 46: 77-97.
Martín-Mor, A. (2011) La interferència lingüística en entorns de Traducció Assistida
per Ordinador: Recerca empíricoexperimental, doctoral thesis, Universitat
Autònoma de Barcelona. Available online: https://goo.gl/41Hn95 [last access 28
October 2018].
Marx, K. (1847) Misère de la philosophie, Paris & Brussels: A. Frank, C. H. Vogler.
Mellinger, C. D. (2018) ‘Re-thinking translation quality: Revision in the digital age’,
Target 30(2): 310-331.
Miller, K., D. Gates, N. Underwood and J. Magdalen (2001) ‘Evaluation of Machine
Translation Output for an Unknown Source Language’. Available online:
https://goo.gl/oUjffh [last access 28 October 2018].
Mitamura, T. and E. Nyberg (2001) ‘Automatic rewriting for controlled language
translation’. Available online: https://goo.gl/PfYbue [last access 28 October
2018].
Mitchell, L., J. Roturier and S. O’Brien (2013) ‘Community-based post-editing of
machine-translated content: monolingual vs. bilingual’, in S. O’Brien, M. Simard
and L. Specia (eds) Workshop Proceedings: Workshop on Post-editing
Technology and Practice (WPTP-2), Allschwil: The European Association for
Machine Translation, 35–44. Available online: https://goo.gl/JKjGQF [last access
28 October 2018].
Moorkens, J., S. Castilho, F. Gaspari and S. Doherty (eds.) (2018) Translation Quality
Assessment. From Principles to Practice, Cham: Springer.
Mossop, B. (2003) ‘What Should be Taught at Translation School?’, in A. Pym, C.
Fallada, J. R. Biau and J. Orenstein (eds) Innovation and E-Learning in
Translator Training, Tarragona: Universitat Rovira i Virgili, 20-22. Available
online: http://www.intercultural.urv.cat/publications/elearning/ [last access July
8, 2018].
MQM (2015a) ‘Multidimensional Quality Metrics (MQM) Issue Types’. Available
online: http://www.qt21.eu/mqm-definition/issues-list-2015-12-30.html [last
access 28 October 2018].
15
MQM (2015b) ‘Multidimensional Quality Metrics (MQM) Definition’. Available
online: http://www.qt21.eu/mqm-definition/definition-2015-12-30.html [last
access 28 October 2018].
Nyberg, E., T. Mitamura and W.O. Huijsen (2003) ‘Controlled language for authoring
and translation’, in H. Somers (ed) Computers and Translation: A Translator’s
Guide, Amsterdam & Philadelphia: John Benjamins, 245-281.
O’Brien, S. and J. Roturier (2007) ‘How Portable are Controlled Languages Rules? A
Comparison of Two Empirical MT Studies’, MT Summit XI, Copenhagen,
Denmark, 345-352. Available online: https://goo.gl/2fyrHB [last access 28
October 2018].
Pym, A. (2004) The Moving Text: Localization, Translation, and Distribution,
Amsterdam & Philadelphia: Benjamins.
Quine, W. V. O. (1960) Word and Object, Cambridge, MA: MIT Press.
Reiss, K. and H. J. Vermeer (1984) Grundlegung einer allgemeinen
Translationstheorie, Tübingen: Niemeyer.
Ribas, C. (2007) ‘Translation Memories as Vehicles for Error Propagation: A Pilot
Study’. Minor Dissertation. Tarragona: Universitat Rovira i Virgili.
Reinke, U. (1999) ‘Evaluierung der linguistischen Leistungsfähigkeit von Translation
Memory Systemen. Ein Erfahrungsbericht’, LDV-Forum 16: 100-117.
Su, K.Y., M. W. Wu and J. S. Chang (1992) ‘A New Quantitative Quality Measure for
Machine Translation Systems’, in Proceedings of COLING-92, Nantes, France.
Available online: https://goo.gl/kSEsmJ [last access July 8 2018].
Teixeira, C. S. C. (2014) ‘Perceived vs. measured performance in the post-editing of
suggestions from machine translation and translation memories’, in S. O’Brien,
M. Simard and L. Specia (eds) Proceedings of the Third Workshop on PostEditing Technology and Practice, 45-59.
Temizöz, Ö. (2013) Postediting machine-translation output and its revision. Subjectmatter experts versus professional translators, PhD thesis, Universitat Rovira i
Virgili. Available online: https://goo.gl/OQtQvv [last access 28 October 2018].
Torres del Rey, J. (2005) La interfaz de la traducción. Formación de traductores y
nuevas tecnologías, Granada: Comares.
Toudic, D. (2012) ‘Employer Consultation Synthesis Report’, OPTIMALE Academic
Network project on translator education and training, Université Rennes 2,
Rennes.
Underwood, N. and B. Jongejan (2001) ‘Translatability checker: A tool to help decide
whether to use MT’, in B. Maegaard (ed.) Proceedings of MT Summit VIII:
Machine translation in the information age, Santiago de Compostella, 363-368.
Available online: https://goo.gl/QdrF9s [last access 28 October 2018].
Vilanova i Subirats, S. (2006) L’impacte de les memòries de traducció sobre el text
d’arribada: interferències i trets lingüístics, Minor dissertation, Universitat
Rovira i Virgili. Available online: https://goo.gl/xjLuPT [last access 28 October
2018].
Way, A. (2018) ‘Quality expectations of machine translation’, in J. Moorkens et al.
(eds) (2018) Translation Quality Assessment. From Principles to Practice, Cham:
Springer, 159-178.
Webb, L. E. (2000) Advantages and Disadvantages of Translation Memory, MA thesis,
Monterey Institute for International Studies. Available online:
https://goo.gl/d9kBHU [last access 28 October 2018].
White, J. (1995) ‘Approaches to Black Box MT Evaluation’. Available online:
https://goo.gl/LM432C [last access 28 October 2018].
16
White, J. (2003) ‘How to evaluate machine translation’, in H. Somers (ed) Computers
and translation. A translator’s guide, Amsterdam & Philadelphia: John
Benjamins, 211-244.
White, J., T. O’Connell and L. Carlson (1993) ‘Evaluation of machine translation’, in
Human Language Technology: Proceedings of the Workshop (ARPA): 206–210.
Available online: https://goo.gl/KCxTP2 [last access 28 October 2018].
Wu Y., M. Schuster, Z. Chen, Q. V. Le and M. Norouzi (2016) ‘Google’s Neural
Machine Translation System: Bridging the Gap between Human and Machine
Translation’. Available online: https://goo.gl/29cgid [last access 28 October
2018].
Note
I use the term start text rather than source text because technologies mean that
translations are these days produced from translation memories, glossaries and machinetranslation proposals, all of which are as much a ‘source’ as the text the translator
actually starts from. The term also brings us into line with what is said in neighbouring
languages: Ausgangstext, texte de départ, texto de partida, for example.
1
17