FINE-LMT: Fine-Grained Feature Learning for Multi-modal Machine Translation

Wang, Yusong; Zhang, Ying; Li, Dongyuan; Shen, Jialun; Xu, Yicheng; Xu, Mingkun; Funakoshi, Kotaro; Okumura, Manabu

doi:10.1007/978-981-96-0119-6_32

Yusong Wang¹²,
Ying Zhang¹³,
Dongyuan Li¹²,
Jialun Shen¹²,
Yicheng Xu¹²,
Mingkun Xu¹⁴,
Kotaro Funakoshi¹² &
…
Manabu Okumura¹²

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 15282))

Included in the following conference series:

Pacific Rim International Conference on Artificial Intelligence

237 Accesses

Abstract

To reduce ambiguity and semantic distortion when translating a text, current dominant methods focus on integrating features from multi-modalities, such as text and image. However, this indiscriminate integration neglects inherent differences in modalities, introducing noise that adversely affects translation. To overcome this challenge, we propose a model, FINE-LMT, to learn common and specific features from the modalities. To recognize the common features between text and image modalities, we employ contrastive learning to enhance the distinction between common and specific features. Additionally, we utilize an orthogonal loss to ensure clear distinction from extracted common features when extracting specific features from the text modality. By fusing common and specific features, FINE-LMT surpasses advanced MMT methods and demonstrates effective integration with pre-trained language models, achieving BLEU score improvements of 0.98% and 1.06% for En$\rightarrow $De/Fr translation tasks, and improvements of 0.61% and 0.78% when integrated with pre-trained models, all averaged across three benchmarks.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 64.99; Price excludes VAT (USA)

Softcover Book: USD 79.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Calixto, I., Liu, Q., Campbell, N.: Doubly-attentive decoder for multi-modal neural machine translation. In: Proceedings of ACL, pp. 1913–1924. Vancouver, Canada (2017)
Google Scholar
Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., Zagoruyko, S.: End-to-end object detection with transformers. In: Proceedings ECCV, pp. 213–229. Cham (2020)
Google Scholar
Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: Proceedings of ICML. vol. 119, pp. 1597–1607 (2020)
Google Scholar
Elliott, D., Frank, S., Barrault, L., Bougares, F., Specia, L.: Findings of the second shared task on multimodal machine translation and multilingual image description. In: Proceedings of MT, pp. 215–233. Copenhagen, Denmark (2017)
Google Scholar
Elliott, D., Frank, S., Sima’an, K., Specia, L.: Multi30K: multilingual English-German image descriptions. In: Proceedings of WVL, pp. 70–74. Berlin, Germany (2016)
Google Scholar
Elliott, D., Kádár, Á.: Imagination improves multimodal translation. In: Proceedings of IJCN 2017, pp. 130–141. Taipei, Taiwan (2017)
Google Scholar
Fang, Q., Feng, Y.: Neural machine translation with phrase-level universal visual representations. In: Proceedings of ACL, pp. 5687–5698. Dublin, Ireland (2022)
Google Scholar
Futeral, M., Schmid, C., Laptev, I., Sagot, B., Bawden, R.: Tackling ambiguity with images: improved multimodal machine translation and contrastive evaluation. In: Proceedings of ACL 2023, pp. 5394–5413. Toronto, Canada (2023)
Google Scholar
Gehring, J., Auli, M., Grangier, D., Dauphin, Y.: A convolutional encoder model for neural machine translation. In: Proceedings of ACL, pp. 123–135. Vancouver, Canada (2017)
Google Scholar
Hazarika, D., Zimmermann, R., Poria, S.: Misa: Modality-invariant and -specific representations for multimodal sentiment analysis. In: Proceedings of MM, pp. 1122-1131. Seattle, Washington (2020)
Google Scholar
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778. Las Vegas, Nevada (2016)
Google Scholar
Lai, Z., Wu, J., Chen, S., Zhou, Y., Hovakimyan, N.: Residual-based language models are free boosters for biomedical imaging tasks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5086–5096 (2024)
Google Scholar
Li, B., et al.: On vision features in multimodal machine translation. In: Proceedings of ACL 2022, pp. 6327–6337. Dublin, Ireland (2022)
Google Scholar
Li, D., Wang, Y., Funakoshi, K., Okumura, M.: Joyful: joint modality fusion and graph contrastive learning for multimoda emotion recognition. In: Proceedings of EMNLP, pp. 16051–16069. Singapore (2023)
Google Scholar
Li, D., You, J., Funakoshi, K., Okumura, M.: A-TIP: attribute-aware text infilling via pre-trained language model. In: Proceedings of COLING, pp. 5857–5869. Gyeongju, Republic of Korea (2022)
Google Scholar
Liu, Y., et al.: Multilingual denoising pre-training for neural machine translation. Trans. Assoc. Comput. Linguist. 8, 726–742 (2020)
Google Scholar
Radford, A., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021)
Google Scholar
Vaswani, A., et al.: Attention is all you need. In: Proceedings of NeurIPS, pp. 5998–6008. Long Beach, California (2017)
Google Scholar
Wang, Y., Li, D., Funakoshi, K., Okumura, M.: Emp: emotion-guided multi-modal fusion and contrastive learning for personality traits recognition. In: Proceedings of ICMR, pp. 243–252. ICMR ’23, New York, NY, USA (2023)
Google Scholar
Wang, Z., Li, D., Li, G., Zhang, Z., Jiang, R.: Multimodal low-light image enhancement with depth information. In: Proceedings of MM (2024)
Google Scholar
Wang, Z., Li, D., Okumura, M.: Multimodal graph-based audio-visual event localization. In: Proceedings of ICASSP, pp. 7880–7884 (2024)
Google Scholar
Wu, Z., Kong, L., Bi, W., Li, X., Kao, B.: Good for misconceived reasons: an empirical revisiting on the need for visual context in multimodal machine translation. In: Proceedings of ACL-IJCNLP, pp. 6153–6166. Online (2021)
Google Scholar
Yang, D., Huang, S., Kuang, H., Du, Y., Zhang, L.: Disentangled representation learning for multimodal emotion recognition. In: Proceedings of MM, pp. 1642–1651. Lisboa, Portugal (2022)
Google Scholar
Ye, J., Guo, J., Xiang, Y., Tan, K., Yu, Z.: Noise-robust cross-modal interactive learning with Text2Image mask for multi-modal neural machine translation. In: Proceedings of COLING, pp. 5098–5108. Gyeongju, Republic of Korea (2022)
Google Scholar
Yin, Y., Meng, F., Su, J., Zhou, C., Yang, Z., Zhou, J., Luo, J.: A novel graph-based multi-modal fusion encoder for neural machine translation. In: Proceedings of ACL, pp. 3025–3035. Online (2020)
Google Scholar
You, J., Li, D., Kamigaito, H., Funakoshi, K., Okumura, M.: Joint learning-based heterogeneous graph attention network for timeline summarization. In: Proceedings of NAACL, pp. 4091–4104. Seattle, United States (2022)
Google Scholar
Zhang, M., Mosbach, M., Adelani, D., Hedderich, M., Klakow, D.: MCSE: multimodal contrastive learning of sentence embeddings. In: Proceedings of NAACL, pp. 5959–5969. Seattle, Washington (2022)
Google Scholar
Zhang, Y., Kamigaito, H., Okumura, M.: Bidirectional transformer reranker for grammatical error correction. In: Proceedings of ACL, pp. 3801–3825. Toronto, Canada (2023)
Google Scholar
Zhang, Z., et al.: Neural machine translation with universal visual representation. In: Proceedings of ICLR, pp. 1–14. Addis Ababa, Ethiopia (2020)
Google Scholar
Zhao, Y., Komachi, M., Kajiwara, T., Chu, C.: Double attention-based multimodal neural machine translation with semantic image regions. In: Proceedings of EAMT, pp. 105–114. Lisboa, Portugal (2020)
Google Scholar
Zhou, B., Khosla, A., Lapedriza, À., Oliva, A., Torralba, A.: Learning deep features for discriminative localization. In: Proceedings of CVPR 2016, pp. 2921–2929. Las Vegas, Nevada (2016)
Google Scholar
Zhu, Y., Sun, Z., Cheng, S., Huang, L., Wu, L., Wang, M.: Beyond triplet: leveraging the most data for multimodal machine translation. In: Proceedings of ACL, pp. 2679–2697. Toronto, Canada (2023)
Google Scholar

Download references

Author information

Authors and Affiliations

Tokyo Institute of Technology, Tokyo, Japan
Yusong Wang, Dongyuan Li, Jialun Shen, Yicheng Xu, Kotaro Funakoshi & Manabu Okumura
RIKEN Center for Advanced Intelligence Project, Tokyo, Japan
Ying Zhang
Guangdong Institute of Intelligence Science and Technology, Zhuhai, China
Mingkun Xu

Authors

Yusong Wang
View author publications
You can also search for this author in PubMed Google Scholar
Ying Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Dongyuan Li
View author publications
You can also search for this author in PubMed Google Scholar
Jialun Shen
View author publications
You can also search for this author in PubMed Google Scholar
Yicheng Xu
View author publications
You can also search for this author in PubMed Google Scholar
Mingkun Xu
View author publications
You can also search for this author in PubMed Google Scholar
Kotaro Funakoshi
View author publications
You can also search for this author in PubMed Google Scholar
Manabu Okumura
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Dongyuan Li .

Editor information

Editors and Affiliations

Kyoto University, Kyoto, Japan
Rafik Hadfi
Lincoln University, Christchurch, New Zealand
Patricia Anthony
RIKEN Center for Integrative Medical Sciences, Yokohama, Japan
Alok Sharma
Kyoto University, Kyoto, Japan
Takayuki Ito
University of Tasmania, Tasmania, TAS, Australia
Quan Bai

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Wang, Y. et al. (2025). FINE-LMT: Fine-Grained Feature Learning for Multi-modal Machine Translation. In: Hadfi, R., Anthony, P., Sharma, A., Ito, T., Bai, Q. (eds) PRICAI 2024: Trends in Artificial Intelligence. PRICAI 2024. Lecture Notes in Computer Science(), vol 15282. Springer, Singapore. https://doi.org/10.1007/978-981-96-0119-6_32

Download citation

DOI: https://doi.org/10.1007/978-981-96-0119-6_32
Published: 12 November 2024
Publisher Name: Springer, Singapore
Print ISBN: 978-981-96-0118-9
Online ISBN: 978-981-96-0119-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

FINE-LMT: Fine-Grained Feature Learning for Multi-modal Machine Translation