Translatotron-V(ison): An End-to-End Model for
In-Image Machine Translation

Zhibin Lan^1,3, Liqiang Niu², Fandong Meng², Jie Zhou², Min Zhang⁴, Jinsong Su^1,3
¹School of Informatics, Xiamen University, China
²Pattern Recognition Center, WeChat AI, Tencent Inc, China
³Key Laboratory of Digital Protection and Intelligent Processing of Intangible Cultural Heritage
of Fujian and Taiwan (Xiamen University), Ministry of Culture and Tourism, China
⁴Institute of Computer Science and Technology, Soochow University, China
lanzhibin@stu.xmu.edu.cn jssu@xmu.edu.cn Work was done when Zhibin Lan was interning at Pattern Recognition Center, WeChat AI, Tencent Inc, China. Corresponding author.

Abstract

In-image machine translation (IIMT) aims to translate an image containing texts in source language into an image containing translations in target language. In this regard, conventional cascaded methods suffer from issues such as error propagation, massive parameters, and difficulties in deployment and retaining visual characteristics of the input image. Thus, constructing end-to-end models has become an option, which, however, faces two main challenges: 1) the huge modeling burden, as it is required to simultaneously learn alignment across languages and preserve the visual characteristics of the input image; 2) the difficulties of directly predicting excessively lengthy pixel sequences. In this paper, we propose Translatotron-V(ision), an end-to-end IIMT model consisting of four modules. In addition to an image encoder, and an image decoder, our model contains a target text decoder and an image tokenizer. Among them, the target text decoder is used to alleviate the language alignment burden, and the image tokenizer converts long sequences of pixels into shorter sequences of visual tokens, preventing the model from focusing on low-level visual features. Besides, we present a two-stage training framework for our model to assist the model in learning alignment across modalities and languages. Finally, we propose a location-aware evaluation metric called Structure-BLEU to assess the translation quality of the generated images. Experimental results demonstrate that our model achieves competitive performance compared to cascaded models with only 70.9% of parameters, and significantly outperforms the pixel-level end-to-end IIMT model.¹¹1Our code and dataset can be found at https://github.com/DeepLearnXMU/translatotron-v.

1 Introduction

In recent years, significant advancements have been achieved in natural language processing (NLP) and computer vision (CV), largely due to the evolution of deep learning. As a combined direction of these two fields, in-image machine translation (IIMT) aims to covert an image containing texts in source language into another image containing the translations in target language, which has significant research value and practical applications. It not only helps us understand the fusion mechanism of multimodal and multilingual information, but also finds widespread applications in daily life. For instance, IIMT can effortlessly enable foreign travelers to read signs written in other languages.

Refer to caption — Figure 1: The illustration of two paradigms of IIMT.

As shown in Figure 1, current IIMT systems are divided into two paradigms: cascaded and end-to-end. The first one relies on cascading multiple models, including an optical character recognition (OCR) model, a machine translation (MT) model, and a text-to-image (T2I) model. However, this paradigm suffers from error propagation, massive parameters, and difficulties in deployment and retaining visual characteristics of the input image. By contrast, end-to-end methods Mansimov et al. (2020); Tian et al. (2023) integrate different models into one IIMT model and conduct end-to-end training. Thus, they have potential advantages over cascaded systems in aspects of avoiding error propagation, reduced parameters, and ease of deployment. Particularly, they are naturally capable of retaining visual characteristics from the input image during translation, e.g. maintaining background, text location, font, etc.

Despite the above advantages, the end-to-end IIMT models still face two major challenges: 1) the huge modeling burden, since they are required to not only learn alignment between two languages but also the visual characteristics of the input image; 2) the difficulties of directly predicting excessively lengthy pixel sequences, which are low-level and involve a large search space Ramesh et al. (2021); Yu et al. (2022b).

To the best of our knowledge, Mansimov et al. (2020) and Tian et al. (2023) are the only two attempts to explore end-to-end IIMT. However, the former is directly based on pixel prediction, resulting in significantly lower translation quality compared to the cascaded models, while the latter requires converting RGB images to grayscale ones, losing visual characteristics. Besides, both of them can only handle images containing single-line text. These defects make them still far from real-world applications.

In this paper, we propose Translatotron-V(ision), the first end-to-end IIMT model capable of generating RGB images, achieving comparable performance to cascaded models with only 70.9% of parameters. As shown in Figure 2, our model consists of four modules: 1) an image encoder that represents the semantics of the image as a sequence of visual vectors; 2) a target text decoder that utilizes the visual vector sequence to predict the text translation, which can effectively reduce the modeling burden on the image decoder; 3) an image decoder that generates the visual tokens of the target image based on the visual and linguistic information generated from the image encoder and target text decoder, respectively; 4) an image tokenizer that converts the image into discrete visual tokens and can reconstruct the image from these visual tokens. By converting the image into visual tokens, the image decoder only needs to predict visual tokens, rather than excessively lengthy pixel sequences, which allows the model to avoid spending too much capacity capturing low-level visual features.

Furthermore, as illustrated in Figure 3, we propose a training framework for our model, consisting of two stages. First, we utilize large-scale unlabeled images to train the image tokenizer through an image reconstruction task. Then, we freeze the image tokenizer and train other modules using IIMT dataset. Inspired by end-to-end speech translation Jia et al. (2019), we introduce multi-task learning at this stage. The auxiliary tasks include OCR and text image translation (TIT), assisting the model in learning alignment across different modalities and languages. Particularly, we introduce a knowledge distillation method to reduce the difficulty of the end-to-end model directly learning from ground-truth labels.

Due to the absence of publicly available IIMT datasets, we use IWSLT14 German-English Cettolo et al. (2014) to synthesize a dataset for this task. Note that unlike previous works Mansimov et al. (2020); Tian et al. (2023) only focus on images containing single-line text, the images in our dataset are more complex, featuring multiple lines of text, as well as text rotation and translation. Furthermore, since the conventional BLEU Papineni et al. (2002) is not applicable to image evaluation, we extend BLEU to Structure-BLEU that considers text location information to better evaluate the quality of text translations within images.

To summarize, we have the following major contributions in this work:

•

We propose a novel end-to-end IIMT model named Translatotron-V. More importantly, it introduces two crucial modules to address major challenges in end-to-end IIMT: 1) target text decoder used to alleviate the modeling burden; 2) image tokenizer preventing the model from directly predicting pixels.
•

We present a two-stage training framework for Translatotron-V, which fully exploits unlabeled images, OCR, and TIT data to refine the model training.
•

We propose Structure-BLEU, an evaluation metric that considers text location information for IIMT.
•

Experimental results demonstrate that Translatotron-V not only significantly outperforms the pixel-level end-to-end IIMT model, but also achieves comparable performance with fewer parameters to cascaded models.

2 Related Work

To achieve high-performance IIMT, previous research mainly focuses on text image translation (TIT), which is a subtask of IIMT Watanabe et al. (1998); Yang et al. (2002); Du et al. (2011); Chen et al. (2015); Afli and Way (2016); Lan et al. (2023). Unlike conventional multimodal machine translation Elliott et al. (2016); Yin et al. (2020); Lin et al. (2020); Su et al. (2021a); Yin et al. (2023); Kang et al. (2023), TIT aims to translate source language texts in images into target language. In this regard, dominant studies resort to the cascading method, which uses an OCR model to obtain the recognized source language texts and then feed them into an MT model for translation Goodfellow et al. (2014); Zhang et al. (2016); Gu et al. (2018).

Afterwards, due to the advantages of mitigating error propagation, the end-to-end TIT attracts increasing attention. Chen et al. (2020) adopt multi-task learning framework that integrates OCR as an auxiliary task. Along this line, Ma et al. (2022) incorporating MT into the multi-task learning framework. Unlike previous studies, both Su et al. (2021b) and Ma et al. (2023b) employ an adapter to combine individual pretrained OCR and MT modules in a TIT model. Furthermore, Ma et al. (2023c) apply knowledge distillation to effectively distillate the knowledge of OCR and MT models into the end-to-end TIT model. Zhu et al. (2023) explore an end-to-end TIT model with an aligner and a regularizer to reduce the modality gap. To explicitly exploit guidance from recognized texts, Ma et al. (2023a) incorporate recognized text information into the TIT decoder through interactive attention. Differing from the above studies focusing on model design, Salesky et al. (2021) analyze the effect of visual text representation, and find that it exhibits significant robustness to various types of noise.

However, none of the aforementioned works consider generating the image with target translations, which is a common requirement in real-world scenarios. To this end, Mansimov et al. (2020) first explore the IIMT task. They introduce an end-to-end model that contains a self-attention encoder, two convolutional encoders, and a convolutional decoder to generate target images at the pixel level. Nonetheless, their model significantly lags behind cascaded models, suffering from issues such as character omission and artifacts. Recently, Tian et al. (2023) convert pixels into characters, thereby transforming the IIMT task into a conventional sequence-to-sequence text generation task. However, this method can only generate grayscale images, losing visual characteristics. Susladkar et al. (2023) present a conditional diffusion-based image editing model, which replaces text in the input image with a given translation while preserving the visual characteristics of origin image. However, this model can only perform single-word editing, which makes its application very limited.

Different from these studies, we propose an end-to-end IIMT model that can generate RGB images with multiple lines of text while preserving the visual features of the input image, and achieve comparable performance to the cascaded model.

3 Our Model

3.1 Model Architecture

As shown in Figure 2, our model consists of four modules: an image encoder, a target text decoder, an image decoder, and an image tokenizer. All of those modules will be elaborated in the following.

Image Encoder. This module converts the input image into a sequence of visual vectors.

We use ViT Dosovitskiy et al. (2021) as the backbone of the image encoder. In order to convert a 2D image into a 1D sequence that can be handled by Transformer, we first split the input image $\mathbf{x}$ into $N=HW/P^{2}$ image patches $\{x_{i}\}_{i=1}^{N}$ , where $(H,W)$ is the resolution of the input image, and $(P,P)$ is the resolution of each patch. Then we apply a linear projection matrix $\mathbf{W}_{e}$ to transform image patches into patch embeddings, and use a standard learnable positional embedding matrix $\mathbf{E}_{pos}$ to further optimize these patch embeddings. Formally, the initial hidden states $\mathbf{H}_{ie}^{0}$ of the image encoder can be formulated as

\mathbf{H}_{ie}^{(0)}=[x_{0};\mathbf{W}_{e}x_{1};\mathbf{W}_{e}x_{2};...;% \mathbf{W}_{e}x_{N}]+\mathbf{E}_{pos},

(1)

where $x_{0}$ is the special token prepended to the input sequence.

Afterwards, we process these patch embeddings using a Transformer encoder with multiple layers. Each Transformer encoder layer is composed of a self-attention sub-layer and a feed-forward network (FFN) sub-layer. Layernorm (LN) is applied before each sub-layer, and residual connections after each sub-layer Wang et al. (2019). The hidden states $\mathbf{H}_{ie}^{(l)}$ of the $l$ -th encoder layer is calculated as

\mathbf{H}_{ie}^{(l)}=\mathrm{FFN}(\mathrm{MHA}(\mathbf{H}_{ie}^{(l-1)},% \mathbf{H}_{ie}^{(l-1)},\mathbf{H}_{ie}^{(l-1)})),

(2)

where $\mathrm{MHA}(\cdot,\cdot,\cdot)$ denotes a multi-head attention function. The residual connection and layer normalization are omitted for simplicity.

Target Text Decoder. By utilizing the features generated by the image encoder, this decoder is responsible for producing text translations. In this way, it focuses on the alignment of different languages, and thus alleviates the modeling burden of the image decoder.

When constructing our target text decoder, we employ the widely-used Transformer Vaswani et al. (2017) decoder as the architecture, consisting of multiple identical layers. In addition to the standard self-attention and FFN sub-layers, each decoder layer is equipped with a cross-attention sub-layer to exploit hidden states produced by the image encoder. Formally, we calculate the hidden states $\mathbf{H}_{td}^{(l)}$ for the $l$ -th decoder layer using the following equations:

	$\displaystyle\mathbf{C}_{td}^{(l)}$	$\displaystyle=\mathrm{MHA}(\mathbf{H}_{td}^{(l-1)},\mathbf{H}_{td}^{(l-1)},% \mathbf{H}_{td}^{(l-1)}),$		(3)
	$\displaystyle\mathbf{H}_{td}^{(l)}$	$\displaystyle=\mathrm{FFN}({\mathrm{MHA}(\mathbf{C}_{td}^{(l)},\mathbf{H}_{ie}% ^{(L)},\mathbf{H}_{ie}^{(L)}))},$		(4)

where the initial hidden states $\mathbf{H}_{td}^{(0)}$ are computed by summing the word embeddings and position embeddings of the input sequence. Unless otherwise specified, we use $L$ to represent the last layer.

Image Decoder. This module is responsible for generating visual tokens based on visual and linguistic information generated from the image encoder and target text decoder, respectively.

The architecture of the image decoder closely resembles that of the target text decoder but with the following notable modifications. It includes two cross-attention sub-layers to gather information from both the image encoder and target text decoder, followed by a fusion sub-layer to generate intermediate representations enriched with both visual and linguistic features. Besides, we incorporate the 2D relative position encoding Wu et al. (2021) into the self-attention sub-layer to capture relative positional relationships within images.

Let $\mathbf{C}_{id}^{(l)}$ denote the hidden states output by the $l$ -th self-attention sub-layer, we calculate it in the following way:

\mathbf{C}_{id}^{(l)}=\mathrm{MHA}(\mathbf{H}_{id}^{(l-1)},\mathbf{H}_{id}^{(l% -1)},\mathbf{H}_{id}^{(l-1)}),

(5)

where $\mathbf{H}_{id}^{(l-1)}$ represents the hidden state output by the ( $l$ -1)-th image decoder layer. Subsequently, the hidden states $\mathbf{\overline{H}}_{id}^{(l)}$ and $\mathbf{\widetilde{H}}_{id}^{(l)}$ are computed through two cross-attention mechanisms, which attend to the image encoder and the target text decoder, respectively, as follows:

\mathbf{\overline{H}}_{id}^{(l)}=\mathrm{MHA}(\mathbf{C}_{id}^{(l)},\mathbf{H}% _{ie}^{(L)},\mathbf{H}_{ie}^{(L)}),

(6)

\mathbf{\widetilde{H}}_{id}^{(l)}=\mathrm{MHA}(\mathbf{C}_{id}^{(l)},\mathbf{H% }_{td}^{(L)},\mathbf{H}_{td}^{(L)}).

(7)

Finally, the hidden states of the $l$ -th image decoder layer are obtained through a gated fusion mechanism, which is calculated using the following equations:

	$\displaystyle\Lambda$	$\displaystyle=\mathrm{sigmoid}(\mathbf{W}_{\Lambda}\mathbf{\overline{H}}_{id}^% {(l)}+\mathbf{U}_{\Lambda}\mathbf{\widetilde{H}}_{id}^{(l)}),$		(8)
	$\displaystyle\mathbf{H}_{id}^{(l)}$	$\displaystyle=\Lambda\mathbf{\overline{H}}_{id}^{(l)}+(1-\Lambda)\mathbf{% \widetilde{H}}_{id}^{(l)},$		(9)

where $\mathbf{W}_{\Lambda}$ and $\mathbf{U}_{\Lambda}$ are projection matrices, and $\Lambda$ is a gated matrix featuring values ranging from 0 to 1, serving the purpose of dynamically fusing two modalities of information.

Image Tokenizer. It is used to perform the conversion between an image and a sequence of discrete visual tokens. By introducing this module, we allow the image decoder only to predict visual tokens, preventing it from modeling excessively lengthy sequences. For instance, a 256 $\times$ 256 $\times$ 3 RGB image results in 196,608 rasterized values.

Our image tokenizer follows the architecture of ViT-VQGAN Yu et al. (2022a), which includes a Vison Transformer (ViT) Dosovitskiy et al. (2021) based encoder and decoder. The encoder $E$ of the image tokenizer is used to tokenize the image into $\mathbf{z}=(z_{1},...,z_{N})$ through a quantizer $q(\cdot)$ . Formally, the quantizer looks up the nearest visual token for each input, as shown in the following:

z_{i}=q(E(x_{i}))=\mathop{\mathrm{argmin}}\limits_{e_{k}\in\mathcal{V}}||E(x_{% i})-e_{k}||_{2},

(10)

where $\mathcal{V}$ is the image vocabulary containing visual tokens.

Conversely, the decoder $G$ of the image tokenizer reconstructs the input image based on the visual tokens generated by $E$ , formulated as

\hat{\mathbf{x}}=G(q(E(\mathbf{x}))).

(11)

Please note that during training, we use the encoder to obtain visual tokens of the target image as labels. During inference, the decoder converts visual tokens generated by the image decoder into the target image.

3.2 Model Training

We provide a detailed description of the training procedures for our model, which consists of two stages, as illustrated in Figure 3.

Stage 1. At this stage, we train the image tokenizer using a large-scale unlabeled image dataset $D_{v}$ in the same way as ViT-VQGAN Yu et al. (2022a), where we convert the input image into visual tokens and then reconstruct the image from these visual tokens.

Given an image $\mathbf{x}$ from the unlabeled image dataset $D_{v}$ , we define the training objective of this stage as follows:

	$\displaystyle\mathcal{L}_{1}=\|\|\hat{\mathbf{x}}-\mathbf{x}\|\|^{2}$	$\displaystyle+\|\|\mathrm{sg}(E(\mathbf{x}))-\mathbf{z}\|\|_{2}^{2}$		(12)
		$\displaystyle+\beta\|\|E(\mathbf{x})-\mathrm{sg}(\mathbf{z})\|\|_{2}^{2}.$		(12)

Here, the first item is the reconstruction loss optimizing the encoder and decoder, the middle item is the vector-quantization loss used to update the visual tokens, the last item is the so-called “commitment loss” for the encoder which prevents its output fluctuating frequently from one visual token to another, $\mathrm{sg}(\cdot)$ denotes the stop-gradient operation, and $\beta$ is the weighting factor set to 0.25 following van den Oord et al. (2017).²²2Note that we also include other loss terms as presented in Yu et al. (2022a), but omit the descriptions for brevity. Please refer to Yu et al. (2022a) for more details.

Stage 2. Using an IIMT dataset, we then adopt multi-task learning and knowledge distillation to train the image encoder, target text decoder, and image decoder.

Overall, the training objective at this stage is defined as follows:

\mathcal{L}_{2}=\mathcal{L}_{iimt}+\mathcal{L}_{ocr}+\mathcal{L}_{tit}+% \mathcal{L}_{kd}.

(13)

where $\mathcal{L}_{iimt}$ , $\mathcal{L}_{ocr}$ , $\mathcal{L}_{tit}$ , and $\mathcal{L}_{kd}$ denote the IIMT task loss, OCR auxiliary task loss, TIT auxiliary task loss, and knowledge distillation loss, respectively. ³³3We also explore the balance of different training objectives. Experimental results in Appendix A show that we do not need to introduce additional hyperparameters to balance different objectives.

Given an IIMT training instance $(\mathbf{x},\mathbf{y},\mathbf{s},\mathbf{t})$ from the IIMT dataset $D_{iimt}$ , we can utilize the image tokenizer trained in the first stage to process the target image, obtaining visual tokens denoted as $\mathbf{z}$ . Here, $\mathbf{x}$ represents the source image, $\mathbf{y}$ is the target image, $\mathbf{s}$ denotes the source language text within the source image, and $\mathbf{t}$ denotes the target language text within the target image.

To alleviate the burden of end-to-end model training, we adopt multi-task learning, which involves not only the primary IIMT task but also two auxiliary tasks: the OCR task and the TIT task. The OCR task is employed to assist the model in recognizing texts within the image, while the TIT task further facilitates cross-lingual alignment. Formally, the training objective of the IIMT task can be formulated as follows:

\mathcal{L}_{iimt}=-\mathrm{log}p(\mathbf{z}|\mathbf{x};\theta_{ie},\theta_{% ttd},\theta_{id}),

(14)

where $\theta_{ie}$ , $\theta_{ttd}$ , $\theta_{id}$ denote the trainable parameters of the image encoder, target text decoder, and image decoder, respectively.

To train our model using the OCR auxiliary task, we additionally introduce a source text decoder, which adopts the same architecture as the target text decoder. Formally, the training objectives of the OCR and TIT auxiliary tasks are defined as

\mathcal{L}_{ocr}=-\mathrm{log}p(\mathbf{s}|\mathbf{x};\theta_{ie},\theta_{std% }),

(15)

\mathcal{L}_{tit}=-\mathrm{log}p(\mathbf{t}|\mathbf{x};\theta_{ie},\theta_{ttd% }),\vspace{-3pt}

(16)

where $\theta_{std}$ is the parameters of the source text decoder. Note that the source text decoder takes the intermediate hidden states of the image encoder as input. This design is based on the intuition that the shallow encoder layers represent the source visual content, while the deep layers encode more information about the target visual content.

Besides, training an end-to-end model is considerably more difficult than a T2I model, where the latter only needs to learn the mapping between different modalities and thus has better performance. Consequently, we introduce a T2I model as a teacher to facilitate knowledge transfer to the end-to-end model. This T2I model includes a Transformer-based text encoder, a ResNet-based image encoder He et al. (2016), and an image decoder similar to our model, where the image encoder is used to preserve the features of the original image. Denote the output distribution of the teacher model for $t$ -th visual token $z_{t}$ as $q(z_{t}|\mathbf{z}_{<t},\mathbf{x},\mathbf{t};\theta_{t2i})$ , we define the cross-entropy between the distributions of teacher and student as the distillation loss:

\begin{split}\mathcal{L}_{kd}=&-\sum_{t=1}^{|\mathbf{z}|}\sum_{k=1}^{|\mathcal% {V}|}q(z_{t}=k|\mathbf{z}_{<t},\mathbf{x},\mathbf{t};\theta_{t2i})\\ &\mathrm{log}p(z_{t}=k|\mathbf{z}_{<t},\mathbf{x};\theta_{ie},\theta_{ttd},% \theta_{id}),\end{split}

(17)

where $\theta_{t2i}$ represents the parameters of the T2I model. Note that we will remove the source text decoder and T2I model during inference.

4 Experiments

4.1 Setup

Dataset. Due to the lack of readily available data, we utilize the widely-used IWSLT14 German-English (De-En) dataset Cettolo et al. (2014) to synthesize paired images for this task. Concretely, we leverage the Python Pillow package⁴⁴4https://github.com/python-pillow/Pillow to render texts onto images with the black Arial⁵⁵5https://learn.microsoft.com/en-us/typography/font-list/arial font. The text is arranged horizontally from left to right, and vertically from top to bottom, with randomly translating and rotating. This involves shifting the text in a random direction and changing its orientation by a random angle. Additionally, the background color of the image is selected randomly and the resolution of the images is 512 $\times$ 512. Note that bilingual texts exceeding the image boundaries will be disregarded during the process of data synthesis. In contrast to prior studies Mansimov et al. (2020); Tian et al. (2023), which focus solely on generating images with single-line text and white background, our research delves into more complex scenes. In the end, the synthesized dataset comprises 81,741 training instances, 3,765 validation instances, and 3,527 test instances. Several synthetic examples and comparisons with previous data can be found in Appendix B.

In addition to the IIMT data, a substantial quantity of images is also indispensable for training the image tokenizer. To this end, we employ the text extracted from the WMT14 English-German (En-De) Bojar et al. (2014) to synthesize images for training our image tokenizer.

	$\displaystyle\mathcal{L}_{1}=\|\|\hat{\mathbf{x}}-\mathbf{x}\|\|^{2}$	$\displaystyle+\|\|\mathrm{sg}(E(\mathbf{x}))-\mathbf{z}\|\|_{2}^{2}$		(12)
		$\displaystyle+\beta\|\|E(\mathbf{x})-\mathrm{sg}(\mathbf{z})\|\|_{2}^{2}.$		(12)

Model	De $\rightarrow$ En			En $\rightarrow$ De			#Param
Model	BLEU $\uparrow$	Structure-BLEU $\uparrow$	SSIM $\uparrow$	BLEU $\uparrow$	Structure-BLEU $\uparrow$	SSIM $\uparrow$	#Param
Cascaded Models
OCR+MT+T2I	15.37	14.87	0.7785	13.22	12.57	0.7550	247M
TIT+T2I	14.80	14.73	0.7812	12.92	12.74	0.7620	201M
PEIT+T2I	10.91	10.78	0.7740	8.77	8.01	0.7594	178M
End-to-end Models
Pixel-level Transformer	0.15	0.15	0.7538	1.11	1.22	0.7616	162M
\hdashline[2pt/2pt] Translatotron-V		15.39	15.26	0.7832	13.23	12.92	0.7629	175M

Model	BLEU $\uparrow$	S-BLEU $\uparrow$	SSIM $\uparrow$
Translatotron-V	15.39	15.26	0.7832
w/o gated fusion	14.34	14.20	0.7830
w/o OCR auxiliary task	1.39	1.18	0.7277
w/o knowledge distillation	13.35	13.43	0.7813
w/o target text decoder	0.47	0.43	0.7751

Model	Fr $\rightarrow$ En			Ro $\rightarrow$ En			#Param
Model	BLEU $\uparrow$	Structure-BLEU $\uparrow$	SSIM $\uparrow$	BLEU $\uparrow$	Structure-BLEU $\uparrow$	SSIM $\uparrow$	#Param
Cascaded Models
OCR+MT+T2I	21.60	21.58	0.7738	18.34	18.61	0.7752	247M
TIT+T2I	21.87	21.78	0.7801	18.39	18.30	0.7764	201M
PEIT+T2I	18.51	18.55	0.7741	14.54	14.90	0.7704	178M
End-to-end Models
Pixel-level Transformer	2.08	2.61	0.7753	1.58	2.11	0.7696	162M
\hdashline[2pt/2pt] Translatotron-V		22.20	22.17	0.7811	18.44	18.73	0.7780	175M

Translatotron-V(ison): An End-to-End Model for
In-Image Machine Translation

Abstract

1 Introduction

2 Related Work