Abstract
To reduce ambiguity and semantic distortion when translating a text, current dominant methods focus on integrating features from multi-modalities, such as text and image. However, this indiscriminate integration neglects inherent differences in modalities, introducing noise that adversely affects translation. To overcome this challenge, we propose a model, FINE-LMT, to learn common and specific features from the modalities. To recognize the common features between text and image modalities, we employ contrastive learning to enhance the distinction between common and specific features. Additionally, we utilize an orthogonal loss to ensure clear distinction from extracted common features when extracting specific features from the text modality. By fusing common and specific features, FINE-LMT surpasses advanced MMT methods and demonstrates effective integration with pre-trained language models, achieving BLEU score improvements of 0.98% and 1.06% for En\(\rightarrow \)De/Fr translation tasks, and improvements of 0.61% and 0.78% when integrated with pre-trained models, all averaged across three benchmarks.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Calixto, I., Liu, Q., Campbell, N.: Doubly-attentive decoder for multi-modal neural machine translation. In: Proceedings of ACL, pp. 1913–1924. Vancouver, Canada (2017)
Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., Zagoruyko, S.: End-to-end object detection with transformers. In: Proceedings ECCV, pp. 213–229. Cham (2020)
Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: Proceedings of ICML. vol. 119, pp. 1597–1607 (2020)
Elliott, D., Frank, S., Barrault, L., Bougares, F., Specia, L.: Findings of the second shared task on multimodal machine translation and multilingual image description. In: Proceedings of MT, pp. 215–233. Copenhagen, Denmark (2017)
Elliott, D., Frank, S., Sima’an, K., Specia, L.: Multi30K: multilingual English-German image descriptions. In: Proceedings of WVL, pp. 70–74. Berlin, Germany (2016)
Elliott, D., Kádár, Á.: Imagination improves multimodal translation. In: Proceedings of IJCN 2017, pp. 130–141. Taipei, Taiwan (2017)
Fang, Q., Feng, Y.: Neural machine translation with phrase-level universal visual representations. In: Proceedings of ACL, pp. 5687–5698. Dublin, Ireland (2022)
Futeral, M., Schmid, C., Laptev, I., Sagot, B., Bawden, R.: Tackling ambiguity with images: improved multimodal machine translation and contrastive evaluation. In: Proceedings of ACL 2023, pp. 5394–5413. Toronto, Canada (2023)
Gehring, J., Auli, M., Grangier, D., Dauphin, Y.: A convolutional encoder model for neural machine translation. In: Proceedings of ACL, pp. 123–135. Vancouver, Canada (2017)
Hazarika, D., Zimmermann, R., Poria, S.: Misa: Modality-invariant and -specific representations for multimodal sentiment analysis. In: Proceedings of MM, pp. 1122-1131. Seattle, Washington (2020)
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778. Las Vegas, Nevada (2016)
Lai, Z., Wu, J., Chen, S., Zhou, Y., Hovakimyan, N.: Residual-based language models are free boosters for biomedical imaging tasks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5086–5096 (2024)
Li, B., et al.: On vision features in multimodal machine translation. In: Proceedings of ACL 2022, pp. 6327–6337. Dublin, Ireland (2022)
Li, D., Wang, Y., Funakoshi, K., Okumura, M.: Joyful: joint modality fusion and graph contrastive learning for multimoda emotion recognition. In: Proceedings of EMNLP, pp. 16051–16069. Singapore (2023)
Li, D., You, J., Funakoshi, K., Okumura, M.: A-TIP: attribute-aware text infilling via pre-trained language model. In: Proceedings of COLING, pp. 5857–5869. Gyeongju, Republic of Korea (2022)
Liu, Y., et al.: Multilingual denoising pre-training for neural machine translation. Trans. Assoc. Comput. Linguist. 8, 726–742 (2020)
Radford, A., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021)
Vaswani, A., et al.: Attention is all you need. In: Proceedings of NeurIPS, pp. 5998–6008. Long Beach, California (2017)
Wang, Y., Li, D., Funakoshi, K., Okumura, M.: Emp: emotion-guided multi-modal fusion and contrastive learning for personality traits recognition. In: Proceedings of ICMR, pp. 243–252. ICMR ’23, New York, NY, USA (2023)
Wang, Z., Li, D., Li, G., Zhang, Z., Jiang, R.: Multimodal low-light image enhancement with depth information. In: Proceedings of MM (2024)
Wang, Z., Li, D., Okumura, M.: Multimodal graph-based audio-visual event localization. In: Proceedings of ICASSP, pp. 7880–7884 (2024)
Wu, Z., Kong, L., Bi, W., Li, X., Kao, B.: Good for misconceived reasons: an empirical revisiting on the need for visual context in multimodal machine translation. In: Proceedings of ACL-IJCNLP, pp. 6153–6166. Online (2021)
Yang, D., Huang, S., Kuang, H., Du, Y., Zhang, L.: Disentangled representation learning for multimodal emotion recognition. In: Proceedings of MM, pp. 1642–1651. Lisboa, Portugal (2022)
Ye, J., Guo, J., Xiang, Y., Tan, K., Yu, Z.: Noise-robust cross-modal interactive learning with Text2Image mask for multi-modal neural machine translation. In: Proceedings of COLING, pp. 5098–5108. Gyeongju, Republic of Korea (2022)
Yin, Y., Meng, F., Su, J., Zhou, C., Yang, Z., Zhou, J., Luo, J.: A novel graph-based multi-modal fusion encoder for neural machine translation. In: Proceedings of ACL, pp. 3025–3035. Online (2020)
You, J., Li, D., Kamigaito, H., Funakoshi, K., Okumura, M.: Joint learning-based heterogeneous graph attention network for timeline summarization. In: Proceedings of NAACL, pp. 4091–4104. Seattle, United States (2022)
Zhang, M., Mosbach, M., Adelani, D., Hedderich, M., Klakow, D.: MCSE: multimodal contrastive learning of sentence embeddings. In: Proceedings of NAACL, pp. 5959–5969. Seattle, Washington (2022)
Zhang, Y., Kamigaito, H., Okumura, M.: Bidirectional transformer reranker for grammatical error correction. In: Proceedings of ACL, pp. 3801–3825. Toronto, Canada (2023)
Zhang, Z., et al.: Neural machine translation with universal visual representation. In: Proceedings of ICLR, pp. 1–14. Addis Ababa, Ethiopia (2020)
Zhao, Y., Komachi, M., Kajiwara, T., Chu, C.: Double attention-based multimodal neural machine translation with semantic image regions. In: Proceedings of EAMT, pp. 105–114. Lisboa, Portugal (2020)
Zhou, B., Khosla, A., Lapedriza, À., Oliva, A., Torralba, A.: Learning deep features for discriminative localization. In: Proceedings of CVPR 2016, pp. 2921–2929. Las Vegas, Nevada (2016)
Zhu, Y., Sun, Z., Cheng, S., Huang, L., Wu, L., Wang, M.: Beyond triplet: leveraging the most data for multimodal machine translation. In: Proceedings of ACL, pp. 2679–2697. Toronto, Canada (2023)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2025 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Wang, Y. et al. (2025). FINE-LMT: Fine-Grained Feature Learning for Multi-modal Machine Translation. In: Hadfi, R., Anthony, P., Sharma, A., Ito, T., Bai, Q. (eds) PRICAI 2024: Trends in Artificial Intelligence. PRICAI 2024. Lecture Notes in Computer Science(), vol 15282. Springer, Singapore. https://doi.org/10.1007/978-981-96-0119-6_32
Download citation
DOI: https://doi.org/10.1007/978-981-96-0119-6_32
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-96-0118-9
Online ISBN: 978-981-96-0119-6
eBook Packages: Computer ScienceComputer Science (R0)