Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Skip to main content

FINE-LMT: Fine-Grained Feature Learning for Multi-modal Machine Translation

  • Conference paper
  • First Online:
PRICAI 2024: Trends in Artificial Intelligence (PRICAI 2024)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 15282))

Included in the following conference series:

  • 237 Accesses

Abstract

To reduce ambiguity and semantic distortion when translating a text, current dominant methods focus on integrating features from multi-modalities, such as text and image. However, this indiscriminate integration neglects inherent differences in modalities, introducing noise that adversely affects translation. To overcome this challenge, we propose a model, FINE-LMT, to learn common and specific features from the modalities. To recognize the common features between text and image modalities, we employ contrastive learning to enhance the distinction between common and specific features. Additionally, we utilize an orthogonal loss to ensure clear distinction from extracted common features when extracting specific features from the text modality. By fusing common and specific features, FINE-LMT surpasses advanced MMT methods and demonstrates effective integration with pre-trained language models, achieving BLEU score improvements of 0.98% and 1.06% for En\(\rightarrow \)De/Fr translation tasks, and improvements of 0.61% and 0.78% when integrated with pre-trained models, all averaged across three benchmarks.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 64.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 79.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Calixto, I., Liu, Q., Campbell, N.: Doubly-attentive decoder for multi-modal neural machine translation. In: Proceedings of ACL, pp. 1913–1924. Vancouver, Canada (2017)

    Google Scholar 

  2. Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., Zagoruyko, S.: End-to-end object detection with transformers. In: Proceedings ECCV, pp. 213–229. Cham (2020)

    Google Scholar 

  3. Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: Proceedings of ICML. vol. 119, pp. 1597–1607 (2020)

    Google Scholar 

  4. Elliott, D., Frank, S., Barrault, L., Bougares, F., Specia, L.: Findings of the second shared task on multimodal machine translation and multilingual image description. In: Proceedings of MT, pp. 215–233. Copenhagen, Denmark (2017)

    Google Scholar 

  5. Elliott, D., Frank, S., Sima’an, K., Specia, L.: Multi30K: multilingual English-German image descriptions. In: Proceedings of WVL, pp. 70–74. Berlin, Germany (2016)

    Google Scholar 

  6. Elliott, D., Kádár, Á.: Imagination improves multimodal translation. In: Proceedings of IJCN 2017, pp. 130–141. Taipei, Taiwan (2017)

    Google Scholar 

  7. Fang, Q., Feng, Y.: Neural machine translation with phrase-level universal visual representations. In: Proceedings of ACL, pp. 5687–5698. Dublin, Ireland (2022)

    Google Scholar 

  8. Futeral, M., Schmid, C., Laptev, I., Sagot, B., Bawden, R.: Tackling ambiguity with images: improved multimodal machine translation and contrastive evaluation. In: Proceedings of ACL 2023, pp. 5394–5413. Toronto, Canada (2023)

    Google Scholar 

  9. Gehring, J., Auli, M., Grangier, D., Dauphin, Y.: A convolutional encoder model for neural machine translation. In: Proceedings of ACL, pp. 123–135. Vancouver, Canada (2017)

    Google Scholar 

  10. Hazarika, D., Zimmermann, R., Poria, S.: Misa: Modality-invariant and -specific representations for multimodal sentiment analysis. In: Proceedings of MM, pp. 1122-1131. Seattle, Washington (2020)

    Google Scholar 

  11. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778. Las Vegas, Nevada (2016)

    Google Scholar 

  12. Lai, Z., Wu, J., Chen, S., Zhou, Y., Hovakimyan, N.: Residual-based language models are free boosters for biomedical imaging tasks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5086–5096 (2024)

    Google Scholar 

  13. Li, B., et al.: On vision features in multimodal machine translation. In: Proceedings of ACL 2022, pp. 6327–6337. Dublin, Ireland (2022)

    Google Scholar 

  14. Li, D., Wang, Y., Funakoshi, K., Okumura, M.: Joyful: joint modality fusion and graph contrastive learning for multimoda emotion recognition. In: Proceedings of EMNLP, pp. 16051–16069. Singapore (2023)

    Google Scholar 

  15. Li, D., You, J., Funakoshi, K., Okumura, M.: A-TIP: attribute-aware text infilling via pre-trained language model. In: Proceedings of COLING, pp. 5857–5869. Gyeongju, Republic of Korea (2022)

    Google Scholar 

  16. Liu, Y., et al.: Multilingual denoising pre-training for neural machine translation. Trans. Assoc. Comput. Linguist. 8, 726–742 (2020)

    Google Scholar 

  17. Radford, A., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021)

    Google Scholar 

  18. Vaswani, A., et al.: Attention is all you need. In: Proceedings of NeurIPS, pp. 5998–6008. Long Beach, California (2017)

    Google Scholar 

  19. Wang, Y., Li, D., Funakoshi, K., Okumura, M.: Emp: emotion-guided multi-modal fusion and contrastive learning for personality traits recognition. In: Proceedings of ICMR, pp. 243–252. ICMR ’23, New York, NY, USA (2023)

    Google Scholar 

  20. Wang, Z., Li, D., Li, G., Zhang, Z., Jiang, R.: Multimodal low-light image enhancement with depth information. In: Proceedings of MM (2024)

    Google Scholar 

  21. Wang, Z., Li, D., Okumura, M.: Multimodal graph-based audio-visual event localization. In: Proceedings of ICASSP, pp. 7880–7884 (2024)

    Google Scholar 

  22. Wu, Z., Kong, L., Bi, W., Li, X., Kao, B.: Good for misconceived reasons: an empirical revisiting on the need for visual context in multimodal machine translation. In: Proceedings of ACL-IJCNLP, pp. 6153–6166. Online (2021)

    Google Scholar 

  23. Yang, D., Huang, S., Kuang, H., Du, Y., Zhang, L.: Disentangled representation learning for multimodal emotion recognition. In: Proceedings of MM, pp. 1642–1651. Lisboa, Portugal (2022)

    Google Scholar 

  24. Ye, J., Guo, J., Xiang, Y., Tan, K., Yu, Z.: Noise-robust cross-modal interactive learning with Text2Image mask for multi-modal neural machine translation. In: Proceedings of COLING, pp. 5098–5108. Gyeongju, Republic of Korea (2022)

    Google Scholar 

  25. Yin, Y., Meng, F., Su, J., Zhou, C., Yang, Z., Zhou, J., Luo, J.: A novel graph-based multi-modal fusion encoder for neural machine translation. In: Proceedings of ACL, pp. 3025–3035. Online (2020)

    Google Scholar 

  26. You, J., Li, D., Kamigaito, H., Funakoshi, K., Okumura, M.: Joint learning-based heterogeneous graph attention network for timeline summarization. In: Proceedings of NAACL, pp. 4091–4104. Seattle, United States (2022)

    Google Scholar 

  27. Zhang, M., Mosbach, M., Adelani, D., Hedderich, M., Klakow, D.: MCSE: multimodal contrastive learning of sentence embeddings. In: Proceedings of NAACL, pp. 5959–5969. Seattle, Washington (2022)

    Google Scholar 

  28. Zhang, Y., Kamigaito, H., Okumura, M.: Bidirectional transformer reranker for grammatical error correction. In: Proceedings of ACL, pp. 3801–3825. Toronto, Canada (2023)

    Google Scholar 

  29. Zhang, Z., et al.: Neural machine translation with universal visual representation. In: Proceedings of ICLR, pp. 1–14. Addis Ababa, Ethiopia (2020)

    Google Scholar 

  30. Zhao, Y., Komachi, M., Kajiwara, T., Chu, C.: Double attention-based multimodal neural machine translation with semantic image regions. In: Proceedings of EAMT, pp. 105–114. Lisboa, Portugal (2020)

    Google Scholar 

  31. Zhou, B., Khosla, A., Lapedriza, À., Oliva, A., Torralba, A.: Learning deep features for discriminative localization. In: Proceedings of CVPR 2016, pp. 2921–2929. Las Vegas, Nevada (2016)

    Google Scholar 

  32. Zhu, Y., Sun, Z., Cheng, S., Huang, L., Wu, L., Wang, M.: Beyond triplet: leveraging the most data for multimodal machine translation. In: Proceedings of ACL, pp. 2679–2697. Toronto, Canada (2023)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Dongyuan Li .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2025 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Wang, Y. et al. (2025). FINE-LMT: Fine-Grained Feature Learning for Multi-modal Machine Translation. In: Hadfi, R., Anthony, P., Sharma, A., Ito, T., Bai, Q. (eds) PRICAI 2024: Trends in Artificial Intelligence. PRICAI 2024. Lecture Notes in Computer Science(), vol 15282. Springer, Singapore. https://doi.org/10.1007/978-981-96-0119-6_32

Download citation

  • DOI: https://doi.org/10.1007/978-981-96-0119-6_32

  • Published:

  • Publisher Name: Springer, Singapore

  • Print ISBN: 978-981-96-0118-9

  • Online ISBN: 978-981-96-0119-6

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics