Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3512527.3531386acmconferencesArticle/Chapter ViewAbstractPublication PagesicmrConference Proceedingsconference-collections
research-article

HybridVocab: Towards Multi-Modal Machine Translation via Multi-Aspect Alignment

Published: 27 June 2022 Publication History
  • Get Citation Alerts
  • Abstract

    Multi-modal machine translation (MMT) aims to augment the linguistic machine translation frameworks by incorporating aligned vision information. As the core research challenge for MMT, how to fuse the image information and further align it with the bilingual data remains critical. Existing works have either focused on a methodological alignment in the space of bilingual text or emphasized the combination of the one-sided text and given image. In this work, we entertain the possibility of a triplet alignment, among the source and target text together with the image instance. In particular, we propose Multi-aspect AlignmenT (MAT) model that augments the MMT tasks to three sub-tasks --- namely cross-language translation alignment, cross-modal captioning alignment and multi-modal hybrid alignment tasks. Core to this model consists of a hybrid vocabulary which compiles the visually depictable entity (nouns) occurrence on both sides of the text as well as the detected object labels appearing in the images. Through this sub-task, we postulate that MAT manages to further align the modalities by casting three instances into a shared domain, as compared against previously proposed methods. Extensive experiments and analyses demonstrate the superiority of our approaches, which achieve several state-of-the-art results on two benchmark datasets of the MMT task.

    Supplementary Material

    MP4 File (ICMR22-fp160.mp4)
    This is the video presentation of our work. In this paper, we propose a Multi-aspect AlignmenT (MAT) model that augments the MMT tasks to three sub-tasks ? namely cross language translation alignment, cross-modal captioning alignment and multi-modal hybrid alignment tasks. To address the triplet alignment issue, a hybrid vocabulary is innovatively introduced to casting the source and target text, and the image into a shared space. Extensive experiments and analyses demonstrate the effectiveness of our approaches, which achieves several state-ofthe-art results on two benchmark MMT datasets.

    References

    [1]
    Peter Anderson, Xiaodong He, Chris Buehler, Damien Teney, Mark Johnson, Stephen Gould, and Lei Zhang. 2018. Bottom-up and top-down attention for image captioning and visual question answering. In Proceedings of the CVPR. 6077--6086.
    [2]
    Hasan Sait Arslan, Mark Fishel, and Gholamreza Anbarjafari. 2018. Doubly attentive transformer machine translation. arXiv preprint arXiv:1807.11605 (2018).
    [3]
    Satanjeev Banerjee and Alon Lavie. 2005. METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments. In Proceedings of the ACL.
    [4]
    Ozan Caglayan. 2019. Multimodal machine translation. Ph. D. Dissertation. Université du Maine.
    [5]
    Ozan Caglayan, Walid Aransa, Adrien Bardet, Mercedes García-Martínez, Fethi Bougares, Loïc Barrault, Marc Masana, Luis Herranz, and Joost van de Weijer. 2017. LIUM-CVC Submissions for WMT17 Multimodal Translation Task. In Proceedings of the Second Conference on Machine Translation. 432--439.
    [6]
    Ozan Caglayan, Loïc Barrault, and Fethi Bougares. 2016. Multimodal attention for neural machine translation. arXiv preprint arXiv:1609.03976 (2016).
    [7]
    Iacer Calixto and Qun Liu. 2017. Incorporating Global Visual Features into Attention-based Neural Machine Translation. In Proceedings of the EMNLP. 992--1003.
    [8]
    Iacer Calixto, Qun Liu, and Nick Campbell. 2017. Doubly-Attentive Decoder for Multi-modal Neural Machine Translation. In Proceedings of the ACL. 1913--1924.
    [9]
    Iacer Calixto, Miguel Rios, and Wilker Aziz. 2019. Latent Variable Model for Multi-modal Translation. In Proceedings of the ACL. 6392--6405.
    [10]
    Tianshui Chen, Muxin Xu, Xiaolu Hui, Hefeng Wu, and Liang Lin. 2019. Learning semantic-specific graph representation for multi-label image recognition. In Proceedings of the ICCV. 522--531.
    [11]
    Jean-Benoit Delbrouck and Stéphane Dupont. 2017. Modulating and attending the source image during encoding improves multimodal translation. arXiv preprint arXiv:1712.03449 (2017).
    [12]
    Jean-Benoit Delbrouck, Stéphane Dupont, and Omar Seddati. 2017. Visually Grounded Word Embeddings and Richer Visual Features for Improving Multimodal Neural Machine Translation. In Proc. GLU 2017 International Workshop on Grounding Language Understanding. 62--67.
    [13]
    Michael Denkowski and Alon Lavie. 2014. Meteor Universal: Language Specific Translation Evaluation for Any Target Language. In Proceedings of the Ninth Workshop on Statistical Machine Translation. 376--380.
    [14]
    Desmond Elliott, Stella Frank, Loïc Barrault, Fethi Bougares, and Lucia Specia. 2017. Findings of the Second Shared Task on Multimodal Machine Translation and Multilingual Image Description. In Proceedings of the Second Conference on Machine Translation. 215--233.
    [15]
    Desmond Elliott, Stella Frank, Khalil Sima'an, and Lucia Specia. 2016. Multi30K: Multilingual English-German Image Descriptions. In Proceedings of the 5th Workshop on Vision and Language. 70--74.
    [16]
    Desmond Elliott and Ákos Kádár. 2017. Imagination Improves Multimodal Translation. In Proceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 1: Long Papers). 130--141.
    [17]
    Jonas Gehring, Michael Auli, David Grangier, and Yann Dauphin. 2017. A Convolutional Encoder Model for Neural Machine Translation. In Proceedings of the ACL. 123--135.
    [18]
    K. He, X. Zhang, S. Ren, and J. Sun. 2016. Deep Residual Learning for Image Recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 770--778.
    [19]
    Jindrich Helcl, Jindrich Libovický, and Duan Vari?. 2018. CUNI System for the WMT18 Multimodal Translation Task. In Proceedings of the Third Conference on Machine Translation: Shared Task Papers. 616--623.
    [20]
    Po-Yao Huang, Frederick Liu, Sz-Rung Shiang, Jean Oh, and Chris Dyer. 2016. Attention-based Multimodal Neural Machine Translation. In Proceedings of the First Conference on Machine Translation: Volume 2, Shared Task Papers. 639--645.
    [21]
    Julia Ive, Pranava Madhyastha, and Lucia Specia. 2019. Distilling Translations with Visual Awareness. In Proceedings of the ACL. 6525--6538.
    [22]
    Ryan Kiros, Ruslan Salakhutdinov, and Richard S Zemel. 2014. Unifying visual-semantic embeddings with multimodal neural language models. arXiv preprint arXiv:1411.2539 (2014).
    [23]
    Huan Lin, Fandong Meng, Jinsong Su, Yongjing Yin, Zhengyuan Yang, Yubin Ge, Jie Zhou, and Jiebo Luo. 2020. Dynamic context-guided capsule network for multimodal machine translation. In Proceedings of the ACM MM. 1320--1329.
    [24]
    Pengbo Liu, Hailong Cao, and Tiejun Zhao. 2021. Gumbel-Attention for Multimodal Machine Translation. arXiv preprint arXiv:2103.08862 (2021).
    [25]
    Yunpeng Luo, Jiayi Ji, Xiaoshuai Sun, Liujuan Cao, Yongjian Wu, Feiyue Huang, Chia-Wen Lin, and Rongrong Ji. 2021. Dual-level collaborative transformer for image captioning. arXiv preprint arXiv:2101.06462 (2021).
    [26]
    Jinseok Nam, Eneldo Loza Mencía, Hyunwoo J Kim, and Johannes Fürnkranz. 2017. Maximizing subset accuracy with recurrent neural networks in multi-label classification. In Proceedings of the NeurIPS. 5419--5429.
    [27]
    Myle Ott, Sergey Edunov, Alexei Baevski, Angela Fan, Sam Gross, Nathan Ng, David Grangier, and Michael Auli. 2019. fairseq: A Fast, Extensible Toolkit for Sequence Modeling. In Proceedings of the NAACL. 48--53.
    [28]
    Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. BLEU: a method for automatic evaluation of machine translation. In Proceedings of the ACL. 311--318.
    [29]
    Ru Peng, Nankai Lin, Yi Fang, Shengyi Jiang, and Junbo Zhao. 2021. Boosting Neural Machine Translation with Dependency-Scaled Self-Attention Network. arXiv preprint arXiv:2111.11707 (2021).
    [30]
    Ofir Press and Lior Wolf. 2017. Using the Output Embedding to Improve Language Models. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers. 157--163.
    [31]
    Peng Qi, Yuhao Zhang, Yuhui Zhang, Jason Bolton, and Christopher D. Manning. 2020. Stanza: A Python Natural Language Processing Toolkit for Many Human Languages. In Proceedings of the ACL.
    [32]
    Alessandro Raganato and Jörg Tiedemann. [n. d.]. An Analysis of Encoder Representations in Transformer-Based Machine Translation. In Proceedings of the EMNLP.
    [33]
    Vikas Raunak, Sang Keun Choe, Quanyang Lu, Yi Xu, and Florian Metze. 2019. On Leveraging the Visual Modality for Neural Machine Translation. In Proceedings of the 12th International Conference on Natural Language Generation. 147--151.
    [34]
    Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. 2015. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. In Proceedings of the NeurIPS. 91--99.
    [35]
    Michael T Rosenstein, Zvika Marx, Leslie Pack Kaelbling, and Thomas G Dietterich. 2005. To transfer or not to transfer. In NIPS, Vol. 898. 1--4.
    [36]
    Rico Sennrich, Barry Haddow, and Alexandra Birch. 2016. Neural Machine Translation of Rare Words with Subword Units. In Proceedings of the ACL. 1715--1725.
    [37]
    Zhan Shi, Xu Zhou, Xipeng Qiu, and Xiaodan Zhu. 2020. Improving image captioning with better use of captions. arXiv preprint arXiv:2006.11807 (2020).
    [38]
    Zeliang Song and Xiaofei Zhou. 2021. Exploring Explicit And Implicit Visual Relationships For Image Captioning. In ICME. IEEE, 1--6.
    [39]
    Lucia Specia, Stella Frank, Khalil Sima'an, and Desmond Elliott. 2016. A Shared Task on Multimodal Machine Translation and Crosslingual Image Description. In Proceedings of the First Conference on Machine Translation: Volume 2, Shared Task Papers. 543--553.
    [40]
    Lisa Torrey and Jude Shavlik. 2010. Transfer learning. In Handbook of research on machine learning applications and trends: algorithms, methods, and techniques. IGI global, 242--264.
    [41]
    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in neural information processing systems. 5998--6008.
    [42]
    Oriol Vinyals, Alexander Toshev, Samy Bengio, and Dumitru Erhan. 2015. Show and tell: A neural image caption generator. In Proceedings of the CVPR. 3156--3164.
    [43]
    Dexin Wang and Deyi Xiong. 2021. Efficient Object-Level Visual Context Modeling for Multimodal Machine Translation: Masking Irrelevant Objects Helps Grounding. In Proceedings of the AAAI. 2720--2728.
    [44]
    Ziwei Wang, Yadan Luo, Yang Li, Zi Huang, and Hongzhi Yin. 2018. Look deeper see richer: Depth-aware image paragraph captioning. In Proceedings of the ACM MM. 672--680.
    [45]
    Zhiyong Wu, Lingpeng Kong, Wei Bi, Xiang Li, and Ben Kao. 2021. Good for Misconceived Reasons: An Empirical Revisiting on the Need for Visual Context in Multimodal Machine Translation. arXiv preprint arXiv:2105.14462 (2021).
    [46]
    Baosong Yang, Jian Li, Derek F Wong, Lidia S Chao, Xing Wang, and Zhaopeng Tu. 2019. Context-aware self-attention networks. In Proceedings of the AAAI, Vol. 33. 387--394.
    [47]
    Shaowei Yao and Xiaojun Wan. 2020. Multimodal Transformer for Multimodal Machine Translation. In Proceedings of the ACL. 4346--4350.
    [48]
    Mark Yatskar, Vicente Ordonez, Luke Zettlemoyer, and Ali Farhadi. 2017. Commonly uncommon: Semantic sparsity in situation recognition. In Proceedings of the CVPR. 7196--7205.
    [49]
    Chih-Kuan Yeh, Wei-Chieh Wu, Wei-Jen Ko, and Yu-Chiang Frank Wang. 2017. Learning deep latent space for multi-label classification. In AAAI.
    [50]
    Yongjing Yin, Fandong Meng, Jinsong Su, Chulun Zhou, Zhengyuan Yang, Jie Zhou, and Jiebo Luo. 2020. A Novel Graph-based Multi-modal Fusion Encoder for Neural Machine Translation. In Proceedings of the ACL.
    [51]
    Yawen Zeng, Da Cao, Xiaochi Wei, Meng Liu, Zhou Zhao, and Zheng Qin. 2021. Multi-Modal Relational Graph for Cross-Modal Video Moment Retrieval. In Proceedings of the CVPR. IEEE, 2215--2224.
    [52]
    Zhuosheng Zhang, Kehai Chen, Rui Wang, Masao Utiyama, Eiichiro Sumita, Zuchao Li, and Hai Zhao. 2020. Neural Machine Translation with Universal Visual Representation. In ICLRs.
    [53]
    Mingyang Zhou, Runxiang Cheng, Yong Jae Lee, and Zhou Yu. 2018. A Visual Attention Grounding Neural Model for Multimodal Machine Translation. In Proceedings of the EMNLP. 3643--3653.

    Cited By

    View all
    • (2024)RetrievalMMT: Retrieval-Constrained Multi-Modal Prompt Learning for Multi-Modal Machine TranslationProceedings of the 2024 International Conference on Multimedia Retrieval10.1145/3652583.3658018(860-868)Online publication date: 30-May-2024
    • (2023)CLIPTrans: Transferring Visual Knowledge with Pre-trained Models for Multimodal Machine Translation2023 IEEE/CVF International Conference on Computer Vision (ICCV)10.1109/ICCV51070.2023.00269(2863-2874)Online publication date: 1-Oct-2023

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    ICMR '22: Proceedings of the 2022 International Conference on Multimedia Retrieval
    June 2022
    714 pages
    ISBN:9781450392389
    DOI:10.1145/3512527
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 27 June 2022

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. hybrid vocabulary
    2. multi-aspect alignment
    3. multi-modal machine translation

    Qualifiers

    • Research-article

    Conference

    ICMR '22
    Sponsor:

    Acceptance Rates

    Overall Acceptance Rate 204 of 685 submissions, 30%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)19
    • Downloads (Last 6 weeks)3
    Reflects downloads up to 27 Jul 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)RetrievalMMT: Retrieval-Constrained Multi-Modal Prompt Learning for Multi-Modal Machine TranslationProceedings of the 2024 International Conference on Multimedia Retrieval10.1145/3652583.3658018(860-868)Online publication date: 30-May-2024
    • (2023)CLIPTrans: Transferring Visual Knowledge with Pre-trained Models for Multimodal Machine Translation2023 IEEE/CVF International Conference on Computer Vision (ICCV)10.1109/ICCV51070.2023.00269(2863-2874)Online publication date: 1-Oct-2023

    View Options

    Get Access

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media