1. Introduction
In recent years, increasing attention has been paid to the inequalities caused by the monopoly of English as the language of science. Amano et al. [
1] surveyed 908 researchers in the environmental sciences and found that non-native English speakers, particularly early in their careers, face greater challenges and spend more effort in conducting scientific activities in English than native English speakers. In addition to the negative impact on non-English-speaking researchers at an individual level, systematic reviews, which are considered a reliable form of research evidence, often neglect non-English literature as a source of important evidence [
2]. Bowker et al. [
3] carried out a systematic review to examine the use of machine translation (MT) tools in scholarly communication and to investigate whether such tools contribute to a more linguistically diverse ecosystem. The study found that while there is interest and positive attitudes towards these tools, the quality of MT tools is highly dependent on the data used to train the MT tool, and therefore, translation quality varies across language pairs, text types, and scientific domains. In addition, when custom-built prototypes were compared with general-purpose tools such as Google Translate, the custom-built tools designed for scholarly communication showed better performance on related tasks.
Several papers explored the usability of MT in the context of scholarly communication from different perspectives. O’Brien and colleagues [
4] explored the potential of using MT and self-post-editing to support the academic writing process for authors writing in English as a foreign language. An experiment was conducted in which participants wrote part of an abstract in their native language and part in English, and the native language section was then machine translated. The results suggest that MT and self-post-editing show promise as a tool that could help support academic writing in a foreign language without compromising quality, although further research is needed. Two studies [
5,
6] focus on improving the quality of MT from Russian to English for academic articles by pre-editing the source text. The first study demonstrated that by minimizing abbreviations, simplifying complex phrases, and ensuring grammatical accuracy, higher quality was obtained when comparing the original versus pre-edited text translations from Russian to English using the Google, Amazon, and DeepL machine translation systems. The second study developed fifteen pre-editing rules based on common negative translatability indicators in Russian life sciences texts. When these rules were applied, over 95% of the sentences met the generally accepted threshold for publication-level quality with minimal post-editing. Roussis and colleagues [
7] present the development of domain-specific parallel and monolingual corpora focused on scientific domains and their application in the fine-tuning of general-purpose neural MT systems. They built large corpora for the language pairs Spanish–English, French–English, and Portuguese–English, covering general scientific texts as well as the specific domains of cancer research, energy research, neuroscience, and transportation research. Their results suggest that domain adaptation through targeted collection of scientific data and model fine-tuning can effectively improve the performance of NMT for translating specialized scientific and academic texts. In this study, we evaluate the performance of three different MT engines, one of which was developed using a similar methodology to that described in [
7]. In this study, we do not pre-edit or post-edit the texts but evaluate the usability of the machine-translated texts by means of reading experiments, which is a new contribution to the field.
In his 2021 paper, “Investigating how we read translations. A Call to Action for Experimental Studies of Translation Reception”, Walker [
8] stresses the need for more research on the reception of written translation. He highlights that while there is a considerable amount of research on the reception of audiovisual translation [
9], less attention has been paid to the reception of written translation. Whyatt and colleagues [
10] responded to this call and conducted a small exploratory study in 2023 to investigate the relationship between the translator’s effort in producing a translation, the translation quality of the final translation, and the reader’s reception effort. Twenty native Polish speakers were randomly assigned to read either a high-quality or a low-quality translation of a product description translated from English into Polish. The researchers fitted a linear mixed effects model to analyze the data. The results show that the translator’s effort did not have a significant effect on the readers’ reception effort. But there was a significant effect of translation quality on readers’ reception effort, with participants showing greater effort when reading low-quality translations compared with high-quality translations. Despite the small data set (the source text consisted of 8 sentences and 162 words), the authors suggest that reading experience can be used to evaluate the effectiveness of translation decisions, particularly in terms of translation quality, but that further research is needed to explore the impact of translation errors and the severity of their effects on the reading experience of translated texts.
Today, translation is not only performed by human translators. In many scenarios, MT systems are used to generate translations, especially in scenarios where human translation (HT) is too expensive or simply not feasible. Eye-tracking has previously been used to evaluate the quality of MT output. In 2010, Doherty et al. [
11] investigated whether eye movement data reflected MT quality. Ten native French speakers were instructed to read 50 machine-translated sentences of either excellent or poor quality. The study found that when participants read MT sentences that were rated poorly by human evaluators, the number of fixations and the average gaze duration increased. In 2020, Kasperavičienė et al. [
12] examined the eye movements of 14 participants when reading a news article automatically translated from English into Lithuanian. According to the authors, the language of the source text was simple and intended for a general audience. The study found that there were more fixations and longer gaze duration for segments containing MT errors compared with correct segments. In 2022, Colman et al. [
13] collected eye movements of 20 Dutch-speaking participants reading an entire novel (Agatha Christie’s The Mysterious Affair at Styles), comparing the published human translation with a machine translation. The participants alternated between reading the human translation and a machine translation, or vice versa. In line with earlier research, the average reading duration was slightly higher in the MT condition compared with the HT condition, as well as the number of fixations, and the average fixation duration. In 2023, Kapsere et al. [
14] conducted a study using eye-tracking to investigate the reading process of professional and nonprofessional users of MT. The participants read a 371-word recipe that had been automatically translated from English into Lithuanian. The results show that professional users spent more time reading the machine-translated text and assessed the quality of the MT more critically than nonprofessional users.
Although the papers we discussed above work on different language pairs, use different MT architectures and technologies, and vary in the type and size of the texts that were read, there is a consensus that reading machine-translated output requires more processing effort compared with reading human-translated text. Most studies, however, have compared machine-translated texts with human-translated versions, compared high- and low-quality translations, or focused on specific errors in the machine-translated texts. We are not aware of any studies that have examined the reader’s effort when processing the output of different MT systems of varying translation quality levels.
In recent years, the field of MT has seen rapid advancements, beginning with the introduction of encoder–decoder transformer models [
15] and more recently with the development of decoder-only large language models (LLMs) such as, Mistral [
16] and Llama 3 (
https://ai.meta.com/blog/meta-llama-3/, accessed on 5 July 2024). While there is increasing interest in leveraging LLMs for MT due to their additional capabilities, such as instruction following [
17], their use does not necessarily lead to superior performance in translation tasks, particularly in specialized domains (e.g., legal) or when translating into less-represented languages (e.g., Czech or Russian) [
18]. Several toolkits are available to train MT models. Open-source NMT toolkits, such as OpenNMT [
19] and Marian [
20], allow users to train NMT models from scratch with existing bilingual data sets, providing flexibility while requiring substantial data and computational power. Additionally, open-source NMT models, such as Opus-MT [
21] and Microsoft’s NLLB (No Language Left Behind) [
22], and LLMs, such as Llama 3 and Mistral, offer pretrained models that can be further fine-tuned with language- or domain-specific data sets, balancing ease of use with customization potential. Commercial off-the-shelf MT solutions, like Google Translate (
https://translate.google.com/, accessed on 5 July 2024) and DeepL (
https://www.deepl.com/en/translator, accessed on 5 July 2024), offer user-friendly, high-quality translation services without the need for any model training. However, they are limited in their ability to be customized for specialized domains, specific styles, or terminology. Some commercial toolkits, such as ModernMT (
https://www.modernmt.com/, accessed on 5 July 2024) and Google Cloud AI (
https://cloud.google.com/, accessed on 5 July 2024), bridge this gap by allowing further fine-tuning of user-specific data, combining the benefits of high baseline performance and adaptability to particular needs.
MT output can be assessed through various methods, each providing insights into different aspects of translation quality. Existing MT assessment techniques can be categorized into two types: human evaluation and automatic evaluation. Human evaluation typically involves human annotators or evaluators who assess translations based on predefined criteria. To this end, common human evaluation techniques consist of (i) direct assessment, where MT output is scored on a predefined scale (e.g., from 0 to 100); (ii) ranking, where translations produced by different systems are compared and ranked from best to worst; (iii) intrinsic evaluation of fluency (i.e., well-formedness of translation in target language) or accuracy (i.e., to what extent source text meaning is conveyed in the target text); and (iv) fine-grained error annotation, where common errors in MT output are identified and categorized on the word/phrase level. Error annotation tasks are based on existing MT error taxonomies, such as the MQM [
23] or the SCATE taxonomy [
24]. Notably, both the MQM and the SCATE taxonomies categorize MT errors hierarchically within accuracy and fluency as the two main error categories. While direct assessment, ranking, and intrinsic evaluation techniques provide insights into the overall quality of MT output from different perspectives, fine-grained error annotation allows for a detailed analysis of translation quality, pinpointing error patterns and determining the relative strengths and weaknesses of different MT systems, according to various linguistic aspects.
Despite that human evaluation offers a nuanced and more detailed linguistic assessment of MT quality, this approach is also often resource-intensive and time-consuming, which necessitates the use of automatic evaluation methods to provide efficient and cost-effective assessments, particularly in large-scale evaluation tasks or real-time monitoring scenarios. Automatic evaluation of MT involves the use of computational techniques to assess the quality of translated texts without human involvement. Unlike human assessment techniques, which compare source texts with machine-translated texts, traditional automatic evaluation metrics often rely on reference translations (i.e., gold-standard translations) and assign a final quality score to a given MT output based on the degree of similarity or divergence between them.
Existing automatic evaluation metrics vary in their methods for measuring the similarity between MT output and reference translations. While some metrics, such as BLEU [
25] and METEOR [
26], rely on word-level n-gram precision, others, such as chrF [
27], utilize character-level n-gram F-score, and some like TER [
28] calculate word-level edit distance. More recently developed metrics that rely on neural-based machine learning techniques, such as BERTScore [
29], BLEURT [
30], and COMET [
31], focus on computing the similarity of sentence-level vector representations. Moreover, certain neural-based metrics are tailored for specific evaluation tasks; for example, COMET, which additionally incorporates source-text information, was specifically trained to predict various human judgments, including post-editing effort, direct assessment, or translation error analysis [
31].
In many evaluation scenarios where reference translations are not consistently available, the dependency on such reference translations restricts the usability of automatic evaluation metrics. Consequently, there has been a growing interest directed toward developing quality estimation (QE) systems, metrics that aim to assess MT quality in the absence of reference translations, relying solely on the source text and MT output. Examples of recently developed neural-based QE metrics include COMETKiwi [
32] and MetricX [
33]. Besides enhancing the applicability of automatic metrics, by comparing source text with MT output, QE systems can be argued to capture different characteristics of MT quality compared with reference-based metrics.
While automated evaluation is generally faster and more cost-effective than human evaluation, their performance is often measured by to what extent they emulate human assessment. Hence, a reliable metric is often characterized by its ability to strongly correlate with human judgments [
34]. Viewed from this standpoint, neural-based automatic evaluation (and QE) metrics continue to stand out as the state-of-the-art approach for automatic quality assessment, having achieved notably higher correlations with human assessments compared with non-neural metrics across different domains and language pairs in recent years [
35,
36]. In addition to employing advanced machine learning techniques, the effectiveness of neural metrics can also be attributed to their superior ability to capture semantic similarity between texts. By using word and sentence embeddings, these metrics are not confined to surface form comparisons with the reference translation, unlike lexical-based metrics, allowing them to more effectively identify semantically related translations (e.g., paraphrases and synonyms) [
37]. On the other hand, while neural-based metrics generally offer better performance than non-neural metrics, they require a pretrained language model, human-labeled data, and additional training, which limits their use for low-resource languages [
37]. In addition to enabling comparisons with past research to some degree, as Lee et al. [
37] further argues, this is one reason why lexical-based metrics like BLEU continue to be commonly used for MT assessment.
Current Study
This study is part of the ‘Translations and Open Science’ project (
https://operas-eu.org/projects/translations-and-open-science/, accessed on 5 July 2024), which explores the possibility of using MT to disseminate research results in different languages. More specifically, it is part of a larger study on the evaluation of MT in the context of scholarly communication, focusing on the language pair English–French in three scientific disciplines:
D1: Human Mobility, Environment, and Space (Social Sciences and Humanities, translated from French into English)
D2: Neurosciences (Life Sciences, translated from English into French)
D3: Climatology and Climate Change (Physical Sciences, translated from English into French)
Three different engines were selected for evaluation: two commercial systems (DeepL and ModernMT) and one open-source system (OpenNMT), which was trained on various data sets [
38]. ModernMT was further customized by uploading a domain-specific translation memory. The OpenNMT system was trained from scratch on publicly available data from the OPUS repository [
39] and fine-tuned on collected in-domain data and the SciPar data set [
40]. In the larger study, automatic evaluations on held-out test sets revealed that MT quality varied across MT systems, with DeepL achieving the best performance and ModernMT the second best. For more information, we refer to the reports that are publicly available.
The project evaluated the usability of the raw MT output in three different use cases: (i) researchers specialized in the domain in question using MT to translate publications, as a writing aid, or for gisting purposes; (ii) professional translators using MT for post-editing to speed up the translation task; and (iii) nonexpert readers using MT to get an idea of the content of a scientific publication. This study focuses on the assessment of MT quality for nonexpert readers. The other use cases are described in Fiorini et al. [
38]. Detailed reports and resources are available in a Zenodo repository (
https://zenodo.org/records/10972872, accessed on 5 July 2024).
To assess the usability of the MT output for nonexpert readers, reading experiments were combined with automatic evaluation and fine-grained error annotations using the MQM framework [
23]. The whole project was on a very tight schedule, which did not allow for time-consuming eye-tracking studies. Instead, self-paced reading experiments were set up [
41]. Participants read translations in a cumulative, self-paced reading view, where each key press revealed the next sentence, while the previous sentences of the text remained in view. Although less fine-grained than eye-tracking, this set-up allowed us to collect reading times per text.
In this study, we were not only interested in how readers rated the quality of MT but also in the relationship between translation quality (either measured by fine-grained error annotations or by automatic metrics) and the readers’ reception effort, measured by reading times. More specifically, we were interested in whether different measures of translation quality could predict reading times.