Quality

Anthony Pym

Quality Anthony Pym Universitat Rovira i Virgili Post-print of: Pym, A. (2020). Quality. In M. O’Hagan (Ed.) The Routledge Handbook of Translation and Technology, pp. 437-452. Abingdon and New York: Routledge. Abstract Understood as the relative excellence of a translation product or process, quality can be measured in many ways, including automatic comparison metrics, evaluation by translators, evaluation by monolingual end users, time required for postediting, time required for non-translation (language learning), process regulation, user satisfaction, and translator satisfaction. Behind all these measures there lie a series of human judgements and work-process considerations. In order to draw out those human aspects of quality, a critical appraisal is made of five relations involved: 1) Automatic evaluation metrics appear to measure equivalence to a start text but in effect adopt a reference translation, which is itself subject to all the hazards of translational indeterminacy; 2) Claims to parity with human translation are based on human judgements of acceptability but are often measured on the basis of isolated sentence pairs, which is not how humans communicate; 3) Criteria of usability generally do not take into account the risks involved in not knowing where error might lie; 4) Industrial regulations of production processes allow for enhanced reviewing and revision needs but do not address technologies directly; and 5) Assessments of translator satisfaction give variable results but tend not to account for the individual skills involved in the use of technologies. Keywords: translation quality, neural machine translation, indeterminacy, usability, job satisfaction ‘If the mere quantity of labor functions as a measure of value regardless of quality, it presupposes that simple labor has become the pivot of industry. It presupposes that labor has been equalized by the subordination of people to machines or by the extreme division of labor: people are effaced by their labor.’ Karl Marx (1847:30) [my translation] Introduction ‘Qualities’, in Aristotle (Categories), are properties of things, as opposed the commonusage sense where ‘quality’, in the singular, is the relative excellence of the thing, usually for a particular purpose. These two senses are nevertheless related, particularly in a field of ongoing innovation where changes in properties (‘qualities’) are to be measured in terms of changes in excellence (‘quality’). Translation technologies constitute one such field. The basic thing to be measured here is the relative excellence of a translation produced with or without a particular technology. This is complicated only slightly by the possibility of measuring the excellence of the translation process as well. The quality of translation technologies is superficially presented as an affair of numbers and 1 rules: Levenshtein distances, BLEU scores, adherence to industrial standards, and the like (see Wright, Doherty and Melby for their respective chapters in this volume). Such apparently objective criteria are nevertheless themselves judged and thus ultimately made meaningful in human terms, incorporating criteria that may include how fast translations are produced, how efficacious they are in the attainment of purposes, how satisfied users are with linguistic products, how happy translators are, and hopefully how successful whole communication acts are. Behind the technical numbers, if you know where to look, there are kinds of quality that are ultimately measured in terms of what humans think and do. This is not to say that the human aspect of quality is ever just one absolute measure; it involves several quite different ways in which people interact with translations. Literature review The general development of translation technologies is clear enough: the up and down of machine translation in the 1950s and 1960s, the acceptance of translation memories in the 1990s, and the integration of statistical and neural machine translation in the 21st century. Each step along the way has been accompanied by a set of discourses on quality, mostly seeking progress. Quality is not, as Cronin surmises (2013: 128), a new concern with postediting, the ‘return of the repressed translation detail’. It has long been an issue in arguments for and against technologies. The 1966 ALPAC Report, for example, was very much about assessing the current and future quality of machine translation. In part it did so by calculating the human time required to make machine translation between Russian and English usable, compared with number of hours a scientist required to learn enough Russian to read in their field of expertise (1966: 5). A further comparison was between the time required to postedit machine translation (yes, in 1966) and the time needed to translate from scratch (ibid.:97). On both those counts, the bottom-line criterion for quality was the number of hours a human would spend on different tasks – the mere quantity of labour. It was on that criterion that research on machine translation (albeit not on computational linguistics) was drastically curtailed. When machine translation research picked up again, attempts were made to evaluate output in terms of human judgements of linguistic quality. In the 1990s the Advanced Research Projects Agency (ARPA) did this by using methods including comprehension tests on back-translations, judgements by professional translators, and human assessment of adequacy (how much information is transferred) and fluency (how correct the language is) (White, O’Connell and Carlson 1993, White 2003, 2005). The expense, time, and subjectivity of many of those methods proved generally daunting (Hovy 1999). It is intriguing to see how the early evaluations of quality attempted to marshal the opinions of different translators. The ALPAC report includes an experiment comparing the way monolinguals and bilinguals assess MT outputs. It notes that bilinguals took longer to assess the sentences and added little to the overall evaluation: ‘One is inclined to give more credence to the results from the monolinguals because monolinguals are more representative of potential users of translations’ (1966: 72-73). This particular debate has not been resolved over the years, although the industry-based papers do tend to prefer monolingual evaluation terms of efficiency (White, O’Connell and Carlson 1993, Coughlin 2003). The issue nevertheless shifted from straight MT evaluation to the kind of active evaluation performed in postediting. Koponen (2016) cites studies that compare monolingual and bilingual postediting and generally finds, 2 contrary to ALPAC, that ‘post-editing does not currently appear feasible if the posteditors have no access to the source text’ (2016: 142). This could, however, be unnecessarily pessimistic. In a small study, Mitchell et al. (2013) found that ‘monolingual post-editing can lead to improved fluency and comprehensibility scores similar to those achieved through bilingual post-editing’ (2013: 1), although ‘fidelity’ was improved more in the bilingual setting. Temizöz (2013), working with a technical text, found that postediting MT produced higher quality when carried out by subjectmatter experts than by trained translators, and that it may be more advantageous to have experts revise translators rather than vice versa. These discussions indirectly address the question of who has the authority to assess the quality of a translation. While translation theorists tend to refer to ‘equivalence’ as some kind of objective yardstick (cf. House 2015), actual experiments show a range of different judgements: Martín-Mor (2011), for example, finds that academics list more ostensible errors than do translation professionals; García (2010) records worrying differences between official accreditation evaluators; Le and Schuster (2016) show there is no universal agreement on what a ‘perfect translation’ is. Given the high costs of human evaluation, automatic MT evaluation metrics became standard (Fiederer and O’Brien 2009). Proposals date from the early 1990s and basically set out to measure the edit distance between the MT and a reference human translation (Su et al. 1992). Edit distances are general based on Levenshtein (1966) and involve the same kind of calculations that give us fuzzy matches in translation memory systems. BLEU, TER and METEOR scores all compare MT output to a humanproduced reference translation (see Doherty in this volume). The correlations between human and automatic evaluations have been cause for investigation and occasional dispute (Coughlin 2003, Banerjee and Lavie 2005, Graham and Baldwin 2014, Wu et al. 2016), since the selection of different parameters and reference texts can give quite different scores. While human vs. automatic evaluation became an issue for machine translation, the initial quality claims for the use of translation memories were more commonly focused on the time taken to achieve a particular translation product. This was done through reasoned speculation and ad hoc surveys of users, leading to quite complex lists of incommensurate advantages and disadvantages (Freigang 1998, Webb 2000). The assumption was that translation memories did indeed increase productivity, and this became part of a general promotional discourse. Production and distribution companies could produce any number of reports announcing savings achieved and associated benefits obtained, although productivity basically depended on the degree of repetition in the text involved – virtually any claim could be made by selecting appropriate texts and parameters. When more solid evidence did start to come in, it was ambivalent. García (2010), for example, found that postediting MT was only sometimes more advantageous than translating from scratch. From 2002 a series of audits of the European Commission’s Translation Service showed little evidence of enhanced productivity due to the use of translation technology: the average cost of one page of translation was 150 euros in 2003 and rose to 194 euros in 2005 (European Commission 2006), despite 23.7 million euros having been spent on technology in 2003. The ideological emphasis on productivity then shifted to alternative benefits such as terminological and phraseological consistency. At the same time, translators’ discussion lists gave voice to doubts about the quality not just of outgoing text (García 2006) but also of the translation memories that had been built up over time (Austermühl 2006). In view of general awareness that productivity was only part of the story, a series of scholars (for example, Reinke 1999, Gow 2003, García 2006) began to 3 broaden the way in which the various technologies are evaluated. In historical terms, this may be seen as the traditional equivalence-based concept of translation quality meeting up with the purpose-based paradigm that had been developing in general translation theory since 1984 (Reiss and Vermeer 1984). The confluence drew easily on existing technical discourses of usability (Jordan 1998, Byrne 2006) and pointed toward what would become Multidimensional Quality Metrics. In the background of the various technical discourses, questions have been raised about the wider effects of translation technologies. Torres del Rey (2005) comments on problems for the very concept of communication. Pym (2004) talks about the dehumanization resulting from technologies where translators cannot visualize the reception situation. Dragsted (2004) finds that automatic segmentation is adequate to the purpose of establishing phrase-level equivalence, but that the segments used by translators without the technology are frequently much larger, embracing factors pertaining to the communication situation. Methodological considerations To make sense of these many different aspects, Chesterman (2004) usefully points out that quality is never an absolute value; it is always a relation, of which there are several kinds. Two possible relations are linguistic: between the translation and the start text1 (allowing judgments of adequacy, equivalence or similarity), then between the translation and “parallel texts” understood as non-translations of the same type (allowing people to judge fluency, the acceptability of language). Chesterman next recognizes two further relations: between the translation and the need or purpose (Skopos) that it is meant to fulfill (judgments of usability) and between the translation and industrial standards (judgments concerning production processes). A fifth quality relation is then between the translation and the translator (allowing judgments of job satisfaction and just recompense). If we look at the way translation technologies intersect with these five relations, we find that the front-page judgements seem to be associated with the first relation only: between the translation and the start text, and thus with questions of equivalence, considered ‘the conceptual basis of translation quality assessment’ (House 2015: 5). When a machine translation system or translation memory suite is being evaluated automatically, that is what appears to be at stake – we want to know how well the start text is rendered. However, if you look at the actual metrics used, they mostly involve comparisons not with start texts but with human translations of those start texts, and those human translations are in turn evaluated by several non-automatic metrics that may concern holistic naturalness and the like. Further, when humans are called upon to evaluate rival MT candidates, they are often doing so monolingually, comparing the machine translation output with what is acceptable in the target languages. So what appears to be a traditional relation between start text and translation (the first of Chesterman’s relations) frequently turns out to be a complex evaluation involving human translations, selective human reception, and implicit human comparison with non-translations (the second of his relations). This conceptual slippage is repeated elsewhere, to the point where the increasing use of technologies has repercussions on all the relations identified by Chesterman. Precisely because production processes are automated, there is an emphasis on localization workflows, the need for adaptation, and measurements of usability and satisfaction on the user’s side (Byrne 2006: 193ff.). And precisely because those same workflows increasingly involve stages of revision and review, handling more words than can be evaluated manually, industrial standards are created in order to regulate the 4 processes, on the precarious but time-saving assumption that if the process has quality, then so must the product (see Wright in this volume). Further, since automated translation can produce texts at different levels of quality depending on the resources invested in pre-editing and postediting, there is increasing market awareness that clients can fine-tune ‘fit for purpose’ translation (Way 2018, also see Bowker in this volume), thus involving a very active human relation between translation and client. In fact, the one relation that does not usually enter into calculations of quality, it seems, is translator satisfaction, where the number of empirical indications of relative happiness are nowhere near the enthusiastic testimonials used in sales promotions. Critical discussion Here I shall consider each of Chesterman’s five relations in turn, teasing out the human elements from behind the technocratic discourse. Indeterminacy in ST-TT comparison Although only part of the range of translation technologies available, machine translation systems deservedly grab the headlines. In theory, the historical movement away from rule-based translation and towards statistical methods means that the machines are not actually translating in any strict sense: they are searching for the optimal translations previously done by a human, then putting those previous translations together in various ways. In theory, if the algorithm finds the right human working on the right text, then it finds the right translation. The basic operative assumption behind statistical methods would thus be nobly democratic: if a large majority of people have rendered a string in the same way, then they are likely to be correct – likely, but not with certainty, neither in machine translation nor in democratic politics. A correlative assumption would be that the larger the database of previous human translations, the better the statistical likelihood of the translation being correct. Likely, but again with no more certainty than could be attributed to the presumption that large democracies elect better governments than do small ones. Neural approaches might be seen as correcting the democratic fallacies: the size of the database still influences the quality produced, but so do the specific thematic contexts in which the translations were done and the absence of intrusive items in the database. That is, neural systems theoretically aim to select more specific human translations and eliminate proposals that are contextually aberrant – a society of voters theoretically defers to groups of more expert language users. That theory does not tell the whole story, of course, since MT quality is still very much dependent on the grammatical similarity between the languages concerned (French to Spanish will give higher quality than German to Japanese), whereas if clean context-specific databases were all, the language distance should not matter so much. And then much depends on the relative standardization of the text to be translated (highly technical texts with fixed terminologies and a limited repertoire of verbal relations will perform relatively well) and the electronic language resources developed for particular languages (smaller languages with few electronic texts do not fare well). The quality of MT output itself can be evaluated for at least two purposes: to assess the inherent superiority of one system or another, or to estimate the degree of postediting (and potentially pre-editing) necessary and thereby to calculate the relative pricing of the associated language work. These evaluations can be holistic, on the basis of a checklist, or by algorithms applied to automatic comparisons between texts. 5 The human evaluation of MT, with or without a checklist, is not necessarily different from any other kind of translation, with the same kinds of problems: it involves high workloads, suffers from a lack of inter-rater and intra-rater consistency, and the categories and weightings ideally have to be adjusted to suit specific purposes. Further, both human and automatic evaluation are more basically haunted by the precarious assumption that there is only one correct translation for a given ST segment. The fact that different people make different evaluations of translations evinces the fundamental ‘indeterminacy’, described by Quine (1960) as a situation where translators produce different translations, all of which are correct but different. That is, there are almost always equally valid but different translations for the one input – and when this is not the case, we are judging grammar or terminology rather than translation. This indeterminacy concerns not just the inevitable differences between human evaluators, but also the dependence of automatic evaluation on initial ‘reference translations’, which could always have been otherwise. The past decade has seen numerous variants on these approaches, particularly on BLEU with respect to the possibly multiplicity of reference translations, and systems that seek to combine the results from previous systems, since the same MT output can in theory be fed through all of them and the results can be synthesized. For example, Akiba et al. (2001) do propose the use of multiple reference translations, and part of the work since then has been on how to select and combine those reference translations, which concerns the same kind of algorithms that have driven statistical machine translation. The mechanical evaluation of MT (see Doherty in this volume) is thus conceptually inseparable from very technologies being evaluated. A fundamental problem facing all these metrics remains the multiplicity of possible human translations, and the human labour required to tell which of them might be superior. Human parity in TT-TT judgements The question most frequently asked about the quality of machine translation is whether it will ever be as good as fully human translation (understood as a production process that uses no MT). The answer depends on the nature of start text, the MT system, the human translation entering into the comparison, the language pair, the definition of ‘quality’, and the human or automatic metrics used. So the answer will never be simple. The more important point, however, is that the question itself is poorly formulated, since it allows at least three simplistic answers: - - Yes, statistical MT is by definition as good as human translation since it is based on locating and recycling prior human translations (see supra.); Yes, statistical MT will reach and surpass humans because computer processing capacity increases geometrically and has surpassed, or should surpass, or will surpass the capacity of the human brain – this is the moment of ‘singularity’ (Kurzweil 2005), achieved in the fields of chess and the game Go, and rumoured to have been identified in some translation experiments (see infra.). No, statistical MT, even in its neural avatars, will never reach human quality because people are stupid: they think MT translations are valid; they thus put raw MT output on public websites and the like; the defective translations are fed back into the MT databases; the user-detected quality of statistical MT in some language pairs has thus stagnated (see, for example, Lotz and van Rensburg 2016). That is, even when computer processing capacity is equal to humans 6 (such are the claims of singularity), people still need to learn what to do with that capacity. In theory, neural technologies help restrict the concerns of this third answer by working from cleaner databases and thus sidelining some of the stupid people. They also use omission for phrases where there is no sure candidate translation – better to say nothing at all rather than something completely off the wall. These advances notwithstanding, the more pertinent question, at least for practicing translators and their employers, is not what quality machine translation produces, but whether it is cost-beneficial, in term of effort and quality, to pre-edit and/or postedit machine-translation output rather than translate from scratch (see Vieira in this volume). The answer to this question must be, once again, that it depends on a wide range of factors, including language distance, text type, definition of quality, the metric used, and who the posteditor is, with this last factor perhaps being the most crucial (Mellinger 2018) – a recurrent assumption in the more theoretical studies is that all editors all somehow share the same level of expertise. Pre-editing is based on revising the start text in order to remove ‘negative translatability indicators’ or elements that are likely to be problematic for machine translation (Underwood and Jongejan 2001, Mitamura and Nyberg, 2001). This is in many respects an application of controlled language, although more specific indicators should be identified for particular language pairs (Nyberg, Mitamura and Huijsen 2003, O’Brien and Roturier 2007, Fiederer and O’Brien 2009). The effort invested in preediting logically increases the quality of the MT output and thereby reduces the effort required for postediting (Aikawa et al. 2007). The cost-effectiveness of pre-editing with respect to postediting generally only comes into its own when a given start text is to be translated into more than a few target languages. It should then increase more or less arithmetically with each additional target language, depending on how language-pairspecific the translatability indicators are. In terms of linguistic features, judgements of the quality of pre-edited and/or postediting machine translation generally have little to say about the ST-TT relationship (since the posteditor is supposed to be a human translator, and thus theoretically constitutes the yardstick by which measurements are made) and tend to concern instead the ‘fluency’ or ‘acceptability’ of the translation, where questions of discursive style seem important. Quite early in the game, Claude Bédard (2000) noted that technologies that draw on the work of previous translators tend to bring across the styles of those different translators, and indeed the styles of the various thematic fields in which they were working. Bédard was complaining about translation-memory systems more than MT, but his general warning, that any text-reuse technology can result in a ‘sentence salad’ – a mixture of linguistic styles –, retains validity in any genre where style is not yet fully regulated. Some studies have detected stylistic effects resulting from the ways in which the various technologies segment the texts to be postedited. Bowker (2005) and Ribas (2007) seeded errors into translation memories and found that, thanks to the combination of segmentation and immediate translation proposals, some of those errors were not corrected, especially when they concerned numbers. Other studies such as Vilanova i Subirats (2006) and Martín-Mor (2011) find that the use of translation memories increases linguistic interference from the start language. Such findings may be attributed to the modes of reading invited by the segmentation of texts within the translation memory suites, when translators are made to focus on the sentence level to the detriment of cohesion and cohesive flow – the translator’s sense of what ‘sounds right’ is thereby rendered less acute. The effects might also be explained by a degree of 7 authority being attributed to the translation suggestions, especially in cases where the translation memories come from the client and the translator is instructed not to touch the exact or full matches (LeBlanc 2013). The translator’s strategic use of risk transfer is thereby enhanced (the translation may be wrong, but I’m not responsible for it…). Such effects explain why post-draft revision and review processes become all the more important in the workflows. Further, when evaluations of MT quality are based on isolated out-of-context sentences, as is often the case, the linguistic skills of translators and other bilinguals are rarely required: sentences are usually right or wrong on target-language criteria. For Coughlin (2003: 7), human raters are thus made to ‘behave like expensive, slow versions of BLEU’ – their skills are unnecessary. A corollary, of course, is that such testing procedures exclude human co-textual and contextual processing skills from the evaluation process. Fiederer and O’Brien (2009: 62-63) found that postedited MT was slightly better for clarity and accuracy but was rated lower in terms of style. García (2010), however, found that postedited MT was rated slightly higher than fully human translations by evaluators applying the system developed by the Australian National Accreditation Authority for Translators and Interpreters (NAATI). The small groups used in these experiments and the limited range of language pairs mean that no definitive conclusions can be drawn. Yet it is worth repeating that the quality of postedited statistical MT can indeed be similar to that of fully human translation (cf. Koponen 2016), and anecdotal evidence suggests that neural systems are significantly enhancing the advantages of postedited MT in cognate language pairs. An example of this problem is a paper released by Microsoft in March 2018 that makes the following claim with respect to Chinese to English news translation: ‘We find that our latest neural machine translation (NMT) system has reached a new stateof-the-art, and that the translation quality is at human parity when compared to professional human translations’ (Hassan et al. 2018: 1). This is the kind of strong claim that deserves to be analyzed in detail (also see Vieira in this volume). ‘Parity’ here means that the neural systems produce translations that are indistinguishable from human translations, not that the translations are error-free. This finding comes from 18M bilingual sentence pairs being evaluated by ‘bilingual crowd workers’, who were asked whether the candidate translation conveyed ‘the semantics of the source text’ (2018: 3). ‘Bilingual annotators’ were then used to identify errors. The important point that the human testing was context-free sentence pairs being assessed in terms of content, not form. So pragmatic, discursive, stylistic or other purpose-based features were excluded from the very narrow concept of quality, to the extent that professional translators were apparently not needed in the evaluation process. Such ways of testing quality necessarily work to the advantage of MT output. The concept of quality operative in paired-sentence evaluation assumes a quite literalist mode of translation, devoid of the kind of solutions that can be used for specific readerships: transcription, calque, resegmentation, compensation, cultural correspondence, updating, and so on. A legitimate case can be made for comparing translated documents rather than isolated sentences (Läubli, Sennrich and Volk 2018). More specifically, the evocation of human parity precariously assumes that all sentences are of the same importance for communicative success. A human translator may expend greater effort on the more high-risk passages, reducing the risk of error in them and negotiating ethical issues, whereas MT output may have the same probability of error per sentence quite independently of communicative criteria. In other words, if a page of human translation and a page of MT output both have three errors in them, one would expect that the human translation does not have the errors in the high-risk passages. And 8 when the end-user gets the translations, they may indeed assume human parity, but they still do not know exactly where those three errors are. Parity and usability are thus quite different things. Usability, acceptability and specific purposes The growth of the localization industry saw widespread use of several metrics for evaluating the quality of a translation. The model put forward by the Localization Industry Standards Association (LISA) was based on the identification of errors (Mistranslation, Accuracy, Terminology, etc.) and their weighting in terms of Minor, Major or Critical, in much the same way as different categories and weightings are employed in pedagogy and certification exams for translators. Technology has no special place in these approaches. It does, however, start to figure in more flexible metrics designed not just for evaluation but also for quality assurance, where the causes and remedies of error are addressed. Multidimensional Quality Metrics (MQM) allow for different degrees of quality to be sought for specific purposes, for error to be attributed to real-world factors like poorly written start texts, and for some errors to be attributed to probable causes like defective glossaries or poor use of automatic checking systems. MQM has a checklist of over a hundred quality issues, which can be adjusted to allow for different degrees of granularity in assessments. The high-order categories are: accuracy, design, fluency, internationalization, locale convention, style, terminology, verity, and other (MQM 2015a). The selection and weighting of categories allow the system’s designers to claim it is based on ‘the ‘functionalist’ approach that quality can be defined by how well a text meets its communicative purpose’ (MQM 2015b): although the individual items involve comparisons between texts and with standards and expectations, the ‘function’ enters when the user decided which particular categories are going to enter into the evaluation. The MQM project is being continued in Quality Translation 21 (QT21), which extends the categories into areas like audiovisual translation and the particular demands for MT in under-resourced and morphologically complex languages. In principle, clients are the first recipients of translations, since they are the ones who have to be satisfied in order to pay for them and to keep paying for more. Their perception of quality is thus of key importance. In the case of translation technologies, though, there has been some doubt as to whether clients actually attach meaning to BLEU scores and the like, and indeed whether they are in a position to judge quality at all, especially when they do not master all the languages involved. Beninatto (2007) provocatively claimed that ‘quality doesn’t matter’ since translation vendors all promise perfect quality, their clients thus all expect perfect quality, and the same clients are often unable to distinguish one translation provider from another. Something similar can be seen in surveys where vendors all say they require ‘ability to produce 100% quality’ from the translators they employ (as in Optimale, cf. Toudic 2012), since no company can afford to admit that it might also appreciate factors like speed, eloquence, or suggestions about how to improve a text. Beninatto’s more serious argument was nevertheless that translation companies should differentiate themselves precisely on such ‘non-quality’ things, particularly the range and timeliness of their services, which is where the variable use of technologies has a role to play (also see Bowker in this volume). Responding to Beninatto a few years later, Davies (2013) pointed out that companies do in fact compete on quality, but not in the sense that they all claim to offer the best. The competition is to provide scenarios where clients can ‘fine-tune’ the quality they require, basically by selecting the technologies, the number of revisions and 9 reviews, and ideally the price. That is, technologies allow adjustment to the required level of quality and the available budget. When this is the case, translation quality obviously has not become irrelevant. There are relatively few studies on quality as experienced by end users of translations. It is commonly pointed out that different users have different expectations, that MT is mostly acceptable for those who want a general ‘gist’ understanding of a text (Church and Hovy 1993, Lewis 1997, Miller et al. 2001). This is common sense and feeds into a view where different degrees of postediting can be used to meet user demands and indeed budgets. There are nevertheless areas where raw MT can prove costly, not just in terms of linguistic errors but more especially with respect to company image or general trustworthiness. Many cities around the world offer their websites with unmediated MT in multiple languages: the information may be understandable, but the public branding of the city will not benefit – linguistic form has a value, independently of understood content. Similar but more serious considerations concern communities that use translation to preserve and promote a threatened culture (Bowker 2009). If the process is right, the product must be right As in many industries where evaluating outputs is labour-intensive and partly subjective, the translation industry has seen a tendency to rely on external industrial standards. The standards purport to embody best practices; they apply to translation service providers rather than to individual translators; they themselves involve significant costs, which is a way of filtering out smaller and less solid companies; and they are based on the philosophy that if you can regulate how translations are produced (and by whom), you thereby regulate the quality of the translations themselves. Among the main standards used by translation companies, ISO17100 in Europe and ASTM 2575 in the United States both refer to the importance of having each translation be reviewed by someone other than the translator, although the terminology for revision varies and it is unclear whether the task is to be performed by a bilingual or a monolingual (cf. Biel 2011, Mellinger 2018). The MQM evaluation approach is based on ISO/TS11669 for translation projects. Perhaps the prime usefulness of these standards is that they establish clear (albeit differing) terms and concepts that can be understood by all stakeholders in the translation process. In this respect, the plurality of standards would nevertheless constitute a drawback. Strangely, though, these industrial standards have little to say about the use of translation technologies as such. Most notably, the postediting of MT output has no explicit place in ISO17100, even though the development of the technologies would seem to have been one of the main historical reasons for the development of new standards (see Wright in this volume). That said, there is a separate standard for postediting (ISO 18587) and all the recent standards attribute enhanced roles to the various kinds of checking, revision and reviewing, all of which might be seen as quality-control measures. At the same time, though, the need for greater quality control could be attributed to the increasing role of machine translation and translation memories, which would thus be operating as unspoken forces behind the industrial standards. Emerging issues: Quality for the translator A question too rarely posed in discussions of technologies is how they affect the translator’s mode of work and job satisfaction, which is in itself an experience of quality. As Marx complained of industrialization, machines reduce labour to the 10 ‘carcass of time’ (1847: 30). Gone, apparently, are factors such as craft, pride in product, and quality as some kind of intrinsic virtue, a timeless cause of happiness for the producer and of appreciation for the consumer. We know relatively little about the kind of quality that is felt by translators when working with technology. A quick but common assumption is that technologies remove the donkeywork, leaving translators to solve the more interesting problems. An alternative surmise is that job satisfaction can only come from fully human work. In between those extremes, there is much to be discovered. Some attention has been paid to who is best able to improve the quality of MT outputs by postediting. Remarkably, ALPAC (1966) included an experiment with 23 translators on this very question and found that ‘fast translators will lose productivity if given postediting to do, whereas slow translators will gain (ibid.:94). Further, ‘most of the translators found postediting tedious and even frustrating’ (96), although some found it useful as an aid when translating, ‘particularly with regard to technical terms’ (97). Although the linguistic quality of MT output has improved enormously since the 1960s, the basics of this job-satisfaction assessment seem to have remained fairly constant (cf. García 2010): poor translators benefit the most. Guerberof (2013) interviewed 24 translators and three reviewers to find that translators have mixed feelings when postediting MT, with the degree of satisfaction depending mainly on the quality of the MT output. One might assume that NMT has thus increased their satisfaction since then. A particular source of frustration was nevertheless the different ways in which pay rates are calculated for different kinds of tasks. It seems that there is still little standardization of the way postediting is remunerated, although one would assume that a healthy hourly rate would be fair for all sides. Teixeira (2014), analyzing ten professional translators, found that translators perceive postediting tasks more positively the more they are familiar with the text genre and the technologies, and that there is a preference for metadata to appear when the technologies offer possible translations. That is, posteditors appreciate being able to know more or less where a translation suggestion is coming from, in order to assess its trustworthiness. Interestingly, Teixeira’s experiment involved a situation where the translators did not know whether the suggested solutions were from MT, a TM or a fully human translation. He found that when a suggested translation was faulty, the translators tended to assume it came from MT, even when this was not the case. There are thus signs of a partly unfounded prejudice against MT. LeBlanc (2013, 2017) studies the use of TMs in the workplace and reports that translator dissatisfaction is not so much with the tools themselves as with the business practices that have been built upon them. The ability to measure productivity mechanically means that the translator has to meet productivity requirements – the measure of quality becomes the number of words per unit of time. And the frequent obligation to recycle previous translations is similarly resented as a departure from previous workplace practices. These findings are necessarily located in a particular history context. Each generation progresses professionally and scales hierarchies thanks to the technologies it was trained to master: spoken rhetoric, styli on wax tablets, papyrus, parchment, paper, print of many kinds and speeds, typewriters, word processors, and now machine translation in translation memories. And just as each generation rises, it seeks to resist the new technologies that it did not initially master: the priestly castes did not want writing to be a general skill; journalists feared word processors; and as for translators, it is still not unusual to hear global dismissal of advances: ‘if you can’t translate with pencil and paper,’ says the former Canadian government translator Brian Mossop 11 (2003: 20), ‘then you can’t translate with the latest translation technology.’ The claim may be technically correct – although Temizöz (2013) found that area experts do pretty well postediting NMT – but it does not address a labour market where young translators can create niches and displace the older generations precisely by stressing the new technologies that allow them to do much better than pencil and paper. On this view, the resistance found in surveys like Guerberof (2013) and Teixeira (2014) may be that of an established generation clinging to the technologies that brought it to relative power. Most of those who train students in translation technology are aware that this is an area where the younger the student is, the faster they pick up new software. Research on the actual experience of quality needs to focus on specific skills. Our technologies come bundled in suites that involve quite different things. One of the main advances in word processing software has been online spellchecking and thesauri – tools that no one would really want to be without. Translation memory suites incorporate a segmentation tool, translation suggestions, metadata on those suggestions, MT feeds, quality assurance tools, sometimes sophisticated terminology tools, and user interface that may be more or less inviting. So when translators perform well with the suites or react negatively against them, it is often because of one or two of these features, and in terms of the specific skills associated with those tools. Studies that just give global reactions to whole translation memory suites thus seem to be missing the point. It makes more sense to isolate the specific tools, test how well or badly they suit translators’ dispositions, and to do the skills analysis from there. There seems little doubt that translation will increasingly become pre-editing and postediting, and the translator’s role will be to correct and authorize texts that have been produced electronically. That kind of task can involve high-level, satisfying intellectual work, relieved of the donkeywork that much manual translating can involve. It is up to translators and their employers to ensure that that kind of work-process quality is appreciated and rewarded. Conclusion: The human in the machine Translation technologies are helping translators keep up with global demand, enhancing cross-border service provision, enabling collaboration with volunteer translators, and as this happens, debates over quality tend to be highly ideological: on the one hand, fear-mongering Luddites idealize the human and fail to see real progress in technology; on the other, unabashed company promotion produces numerical bravura based on questionable methodologies. In between those extremes, it is sanguine to recall the extent to which all evaluations of quality reside on human values (a translation of this kind, for this user, to achieve this purpose) and are built on a fundamental indeterminacy (translations can be different and equally correct). Awareness of the human elements should help us temper the hypostasized hype, and hopefully ensure progress for all. Further reading Developments in this field tend to occur faster than book publications, to the extent that there are few valid reference works that will stand the test of time. Readers are advised to follow the quality claims as they are published online, but to do so with a critical eye to the human factors involved: who is doing the evaluating, on what basis, in relation to 12 what purpose, and with what kind of claim to authority. The following nevertheless successfully address the current issues: Kenny, D. (ed.) Human issues in translation technology. London and New York: Routledge. | As the title indicates, this collective volume hits the bullseye in focusing on the most worrying part of translation technologies: what happens to the human. Key articles by Doherty, Leblanc, Koskinen and Ruokonen, and Moorkens and O’Brien are based on empirical data, taking us beyond the promises of gurus. Moorkens, J., S. Castilho, F. Gaspari and S. Doherty (eds) (2018) Translation Quality Assessment. From Principles to Practice, Cham: Springer. | This is a collection of papers from the academy and industry, in part dealing with translation technology, including five papers specifically on quality in machine translation. The contributions by Way, Specia et al., and Torai et al. assess future possibilities of quality assessment. References Aikawa, T., L. Schwartz, R. King, M. Corston-Oliver and C. Lozano (2007) ‘Impact of Controlled Language on Translation Quality and Post-editing in a Statistical Machine Translation Environment’. Available online: https://goo.gl/sNoxfr [last access 28 October 2018]. Akiba, Y., K. Imamura and E. Sumita (2001) ‘Using Multiple Edit Distances to Automatically Rank Machine Translation Output’, in Proceedings of the MT Summit VIII, Santiago de Compostela, Spain. Available online: https://goo.gl/zvMF3o [last access 28 October 2018]. ALPAC (1966) Languages and machines: computers in translation and linguistics. A report by the Automatic Language Processing Advisory Committee, Division of Behavioral Sciences. Washington DC: National Academy of Sciences, National Research Council. Available online: https://goo.gl/6PmwcA [last access 28 October 2018] Aristotle (1938) Categories. On Interpretation. Prior Analytics, trans. H. P. Cooke and H. Tredennick, Cambridge MA: Harvard University Press. Austermühl, F. (2006) ‘Training Translators to Localize’, in A. Pym et al. (eds) Translation Technology and its Teaching (with much mention of localization), Tarragona: Intercultural Studies Group, 97-105. Available online: https://goo.gl/ysdcdW [last access 28 October 2018]. Banerjee, S. and A. Lavie (2005) ‘METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments’, in Proceedings of Workshop on Intrinsic and Extrinsic Evaluation Measures for MT and/or Summarization, Ann Arbor, Michigan, June 2005. Bédard, C. (2000) ‘“Translation Memory Seeks Sentence-oriented Translator…’”, Traduire 186: 41-9. Beninatto, R. (2007) ‘Quality still doesn’t matter’, presentation to ATA conference, San Francisco. Available online: http://goo.gl/QuhnEg [last access 28 October 2018]. Biel, Ł. (2011) ‘Training translators or translation service providers? EN 15038: 2006 standard of translation services and its training implications’, The Journal of Specialised Translation 16: 61-76. Bowker, L. (2005) ‘Productivity vs Quality? A pilot study on the impact of translation memory systems’, Localisation Focus 4: 13-20. 13 Bowker, L. (2009) ‘Can Machine Translation meet the needs of official language minority communities in Canada? A recipient evaluation’, Linguistica Antverpiensia 8: 123-155. Byrne, J. (2006) Technical translation: Usability strategies for translating technical documentation, Dordrecht: Springer. Chesterman, A. (2004) ‘Functional quality’, a lecture at Universitat Rovira i Virgili, Tarragona. Available online: https://goo.gl/t36dL7 [last access 28 October 2018]. Church, K. and E. Hovy (1993) ‘Good applications for crummy machine translation’, Machine Translation 8(4): 239-258. Coughlin, D. (2003) ‘Correlating Automated and Human Assessments of Machine Translation Quality’, MT Summit IX, New Orleans, 23–27. Available online: https://goo.gl/7yNzLV [last access 28 October 2018]. Cronin, M. (2013) Translation in the digital age, London & New York: Routledge. Davies, I. (2013) ‘The hardest word in translation’, The Pillar Box (Institute of Translation and Interpreting). Available online: https://goo.gl/Raxob4 [last access 8 July 2018]. Dragsted, B. (2004) Segmentation in translation and translation memory systems: An empirical investigation of cognitive segmentation and effects of integrating a TM system into the translation process, Copenhagen: Samfundslitteratur. European Commission (2006) Special Report No 9/2006 concerning translation expenditure incurred by the Commission, the Parliament and the Council. Available online: https://goo.gl/RZzntv [last access 28 October 2018]. Fiederer, R. and S. O’Brien (2009) ‘Quality and machine translation: A realistic objective?’, Journal of Specialised Translation 11: 52–74. Available online: https://goo.gl/fBqP4u [last access 28 October 2018]. Freigang, K. H. (1998) ‘Machine-aided translation’, in M. Baker (ed.) Routledge Encyclopedia of Translation Studies, London & New York: Routledge, 134-136. García, I. (2006) ‘Translators on Translation Memories. A Blessing or a Curse?’, in A. Pym et al. (eds) Translation Technology and its Teaching (with much mention of localization), Tarragona: Intercultural Studies Group, 97-105. Available online: https://goo.gl/ysdcdW [last access 28 October 2018]. García, I. (2010) ‘Is machine translation ready yet?’, Target 22(1): 7–21. Gow, F. (2003) Metrics for Evaluating Translation Memory Software, MA thesis, University of Ottawa. Available online: https://goo.gl/osq7ez [last access 28 October 2018]. Graham, Y. and T. Baldwin (2014) ‘Testing for Significance of Increased Correlation with Human Judgment’, Proceedings of EMNLP 2014, Doha, Qatar. Guerberof, A. (2013) ‘What do professional translators think about postediting?’, Journal of Specialised Translation 19: 75-95. Hassan, H., et al. (2018) ‘Achieving Human Parity on Automatic Chinese to English News Translation’. Available online: https://goo.gl/iD9DEd [last access 28 October 2018]. House, J. (2015) Translation quality assessment. Past and present, London & New York: Routledge. Hovy, E. E. (1999) ‘Toward finely differentiated evaluation metrics for machine translation’, in Proceedings of the Eagles Workshop on Standards and Evaluation, Pisa, Italy. 127-133. Jordan, P. W. (1998) An Introduction to Usability, London: Taylor and Francis. Koponen, M. (2016) ‘Is machine translation post-editing worth the effort? A survey of research into post-editing and effort’, Journal of Specialised Translation 25: 131147. 14 Kurzweil, R. (2005) The singularity is near, New York: Viking Books. Läubi, S., R. Sennrich and M. Volk (2018) ‘Has Machine Translation Achieved Human Parity? A Case for Document-level Evaluation’. Available online: https://arxiv.org/pdf/1808.07048.pdf [last access 28 October 2018]. Le, Quoc V. and M. Schuster (2016) ‘A neural network for machine translation, at production scale’. Google AI Blog. Available online: https://goo.gl/EcFszd [last access 28 October 2018]. Leblanc, M. (2013) ‘Translators on translation memory (TM). Results of an ethnographic study in three translation services and agencies’, Translation & Interpreting 5(2): 1-13. Leblanc, M. (2017) ‘“I can’t get no satisfaction!” Should we blame translation technologies or shifting business practices?’, in D. Kenny (ed.) Human issues in translation technology, London and New York: Routledge, 45-62. Levenshtein, V. I. (1966) ‘Binary codes capable of correcting deletions, insertions and reversals’, Soviet Physics - Doklady 10(8): 707-710. Available online: https://goo.gl/y7ioao [last access 28 October 2018]. Lewis, D. (1997) ‘Machine translation in a Modern Languages curriculum’, Computer Assisted Language Learning 10(3): 255-271. Lotz, S. and A. van Rensburg (2016) ‘Omission and other sins: tracking the quality of online machine translation output over four years’, Stellenbosch Papers in Linguistics 46: 77-97. Martín-Mor, A. (2011) La interferència lingüística en entorns de Traducció Assistida per Ordinador: Recerca empíricoexperimental, doctoral thesis, Universitat Autònoma de Barcelona. Available online: https://goo.gl/41Hn95 [last access 28 October 2018]. Marx, K. (1847) Misère de la philosophie, Paris & Brussels: A. Frank, C. H. Vogler. Mellinger, C. D. (2018) ‘Re-thinking translation quality: Revision in the digital age’, Target 30(2): 310-331. Miller, K., D. Gates, N. Underwood and J. Magdalen (2001) ‘Evaluation of Machine Translation Output for an Unknown Source Language’. Available online: https://goo.gl/oUjffh [last access 28 October 2018]. Mitamura, T. and E. Nyberg (2001) ‘Automatic rewriting for controlled language translation’. Available online: https://goo.gl/PfYbue [last access 28 October 2018]. Mitchell, L., J. Roturier and S. O’Brien (2013) ‘Community-based post-editing of machine-translated content: monolingual vs. bilingual’, in S. O’Brien, M. Simard and L. Specia (eds) Workshop Proceedings: Workshop on Post-editing Technology and Practice (WPTP-2), Allschwil: The European Association for Machine Translation, 35–44. Available online: https://goo.gl/JKjGQF [last access 28 October 2018]. Moorkens, J., S. Castilho, F. Gaspari and S. Doherty (eds.) (2018) Translation Quality Assessment. From Principles to Practice, Cham: Springer. Mossop, B. (2003) ‘What Should be Taught at Translation School?’, in A. Pym, C. Fallada, J. R. Biau and J. Orenstein (eds) Innovation and E-Learning in Translator Training, Tarragona: Universitat Rovira i Virgili, 20-22. Available online: http://www.intercultural.urv.cat/publications/elearning/ [last access July 8, 2018]. MQM (2015a) ‘Multidimensional Quality Metrics (MQM) Issue Types’. Available online: http://www.qt21.eu/mqm-definition/issues-list-2015-12-30.html [last access 28 October 2018]. 15 MQM (2015b) ‘Multidimensional Quality Metrics (MQM) Definition’. Available online: http://www.qt21.eu/mqm-definition/definition-2015-12-30.html [last access 28 October 2018]. Nyberg, E., T. Mitamura and W.O. Huijsen (2003) ‘Controlled language for authoring and translation’, in H. Somers (ed) Computers and Translation: A Translator’s Guide, Amsterdam & Philadelphia: John Benjamins, 245-281. O’Brien, S. and J. Roturier (2007) ‘How Portable are Controlled Languages Rules? A Comparison of Two Empirical MT Studies’, MT Summit XI, Copenhagen, Denmark, 345-352. Available online: https://goo.gl/2fyrHB [last access 28 October 2018]. Pym, A. (2004) The Moving Text: Localization, Translation, and Distribution, Amsterdam & Philadelphia: Benjamins. Quine, W. V. O. (1960) Word and Object, Cambridge, MA: MIT Press. Reiss, K. and H. J. Vermeer (1984) Grundlegung einer allgemeinen Translationstheorie, Tübingen: Niemeyer. Ribas, C. (2007) ‘Translation Memories as Vehicles for Error Propagation: A Pilot Study’. Minor Dissertation. Tarragona: Universitat Rovira i Virgili. Reinke, U. (1999) ‘Evaluierung der linguistischen Leistungsfähigkeit von Translation Memory Systemen. Ein Erfahrungsbericht’, LDV-Forum 16: 100-117. Su, K.Y., M. W. Wu and J. S. Chang (1992) ‘A New Quantitative Quality Measure for Machine Translation Systems’, in Proceedings of COLING-92, Nantes, France. Available online: https://goo.gl/kSEsmJ [last access July 8 2018]. Teixeira, C. S. C. (2014) ‘Perceived vs. measured performance in the post-editing of suggestions from machine translation and translation memories’, in S. O’Brien, M. Simard and L. Specia (eds) Proceedings of the Third Workshop on PostEditing Technology and Practice, 45-59. Temizöz, Ö. (2013) Postediting machine-translation output and its revision. Subjectmatter experts versus professional translators, PhD thesis, Universitat Rovira i Virgili. Available online: https://goo.gl/OQtQvv [last access 28 October 2018]. Torres del Rey, J. (2005) La interfaz de la traducción. Formación de traductores y nuevas tecnologías, Granada: Comares. Toudic, D. (2012) ‘Employer Consultation Synthesis Report’, OPTIMALE Academic Network project on translator education and training, Université Rennes 2, Rennes. Underwood, N. and B. Jongejan (2001) ‘Translatability checker: A tool to help decide whether to use MT’, in B. Maegaard (ed.) Proceedings of MT Summit VIII: Machine translation in the information age, Santiago de Compostella, 363-368. Available online: https://goo.gl/QdrF9s [last access 28 October 2018]. Vilanova i Subirats, S. (2006) L’impacte de les memòries de traducció sobre el text d’arribada: interferències i trets lingüístics, Minor dissertation, Universitat Rovira i Virgili. Available online: https://goo.gl/xjLuPT [last access 28 October 2018]. Way, A. (2018) ‘Quality expectations of machine translation’, in J. Moorkens et al. (eds) (2018) Translation Quality Assessment. From Principles to Practice, Cham: Springer, 159-178. Webb, L. E. (2000) Advantages and Disadvantages of Translation Memory, MA thesis, Monterey Institute for International Studies. Available online: https://goo.gl/d9kBHU [last access 28 October 2018]. White, J. (1995) ‘Approaches to Black Box MT Evaluation’. Available online: https://goo.gl/LM432C [last access 28 October 2018]. 16 White, J. (2003) ‘How to evaluate machine translation’, in H. Somers (ed) Computers and translation. A translator’s guide, Amsterdam & Philadelphia: John Benjamins, 211-244. White, J., T. O’Connell and L. Carlson (1993) ‘Evaluation of machine translation’, in Human Language Technology: Proceedings of the Workshop (ARPA): 206–210. Available online: https://goo.gl/KCxTP2 [last access 28 October 2018]. Wu Y., M. Schuster, Z. Chen, Q. V. Le and M. Norouzi (2016) ‘Google’s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation’. Available online: https://goo.gl/29cgid [last access 28 October 2018]. Note I use the term start text rather than source text because technologies mean that translations are these days produced from translation memories, glossaries and machinetranslation proposals, all of which are as much a ‘source’ as the text the translator actually starts from. The term also brings us into line with what is said in neighbouring languages: Ausgangstext, texte de départ, texto de partida, for example. 1 17

Log In

Quality

Quality

Related Papers

RELATED PAPERS