short-paper

X-modaler: A Versatile and High-performance Codebase for Cross-modal Analytics

Authors:

Tao MeiAuthors Info & Claims

MM '21: Proceedings of the 29th ACM International Conference on Multimedia

Pages 3799 - 3802

https://doi.org/10.1145/3474085.3478331

Published: 17 October 2021 Publication History

Abstract

With the rise and development of deep learning over the past decade, there has been a steady momentum of innovation and breakthroughs that convincingly push the state-of-the-art of cross-modal analytics between vision and language in multimedia field. Nevertheless, there has not been an open-source codebase in support of training and deploying numerous neural network models for cross-modal analytics in a unified and modular fashion. In this work, we propose X-modaler --- a versatile and high-performance codebase that encapsulates the state-of-the-art cross-modal analytics into several general-purpose stages (e.g., pre-processing, encoder, cross-modal interaction, decoder, and decode strategy). Each stage is empowered with the functionality that covers a series of modules widely adopted in state-of-the-arts and allows seamless switching in between. This way naturally enables a flexible implementation of state-of-the-art algorithms for image captioning, video captioning, and vision-language pre-training, aiming to facilitate the rapid development of research community. Meanwhile, since the effective modular designs in several stages (e.g., cross-modal interaction) are shared across different vision-language tasks, X-modaler can be simply extended to power startup prototypes for other tasks in cross-modal analytics, including visual question answering, visual commonsense reasoning, and cross-modal retrieval. X-modaler is an Apache-licensed codebase, and its source codes, sample projects and pre-trained models are available on-line: https://github.com/YehLi/xmodaler.

References

[1]

Peter Anderson, Xiaodong He, Chris Buehler, Damien Teney, Mark Johnson, Stephen Gould, and Lei Zhang. 2018. Bottom-up and top-down attention for image captioning and visual question answering. In CVPR.

[2]

Jyoti Aneja, Aditya Deshpande, and Alexander G Schwing. 2018. Convolutional image captioning. In CVPR.

[3]

Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C Lawrence Zitnick, and Devi Parikh. 2015. Vqa: Visual question answering. In ICCV.

Digital Library

[4]

Samy Bengio, Oriol Vinyals, Navdeep Jaitly, and Noam Shazeer. 2015. Scheduled sampling for sequence prediction with recurrent Neural networks. In NIPS.

Digital Library

[5]

David Chen and William B. Dolan. 2011. Collecting Highly Parallel Data for Paraphrase Evaluation. In ACL.

Digital Library

[6]

Jingwen Chen, Yingwei Pan, Yehao Li, Ting Yao, Hongyang Chao, and Tao Mei. 2019. Temporal deformable convolutional encoder-decoder networks for video captioning. In AAAI.

[7]

Xinlei Chen, Hao Fang, Tsung-Yi Lin, Ramakrishna Vedantam, Saurabh Gupta, Piotr Dollár, and C Lawrence Zitnick. 2015. Microsoft COCO captions: Data collection and evaluation server. arXiv preprint arXiv:1504.00325 (2015).

[8]

Marcella Cornia, Matteo Stefanini, Lorenzo Baraldi, and Rita Cucchiara. 2020. Meshed-memory transformer for image captioning. In CVPR.

[9]

Jin-Hwa Kim, Jaehyun Jun, and Byoung-Tak Zhang. 2018. Bilinear attention networks. In NeurIPS.

Digital Library

[10]

Yehao Li, Yingwei Pan, Ting Yao, Jingwen Chen, and Tao Mei. 2021. Scheduled Sampling in Vision-Language Pretraining with Decoupled Encoder-Decoder Network. In AAAI.

[11]

Yehao Li, Ting Yao, Yingwei Pan, Hongyang Chao, and Tao Mei. 2018. Jointly localizing and describing events for dense video captioning. In CVPR.

[12]

Jiasen Lu, Dhruv Batra, Devi Parikh, and Stefan Lee. 2019. Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. In NeurIPS.

Digital Library

[13]

Yingwei Pan, Yehao Li, Jianjie Luo, Jun Xu, Ting Yao, and Tao Mei. 2020 a. Auto-captions on GIF: A Large-scale Video-sentence Dataset for Vision-language Pre-training. arXiv preprint arXiv:2007.02375 (2020).

[14]

Yingwei Pan, Tao Mei, Ting Yao, Houqiang Li, and Yong Rui. 2016. Jointly modeling embedding and translation to bridge video and language. In CVPR.

[15]

Yingwei Pan, Ting Yao, Houqiang Li, and Tao Mei. 2017. Video Captioning with Transferred Semantic Attributes. In CVPR.

[16]

Yingwei Pan, Ting Yao, Yehao Li, and Tao Mei. 2020 b. X-Linear Attention Networks for Image Captioning. In CVPR.

[17]

Steven J Rennie, Etienne Marcheret, Youssef Mroueh, Jerret Ross, and Vaibhava Goel. 2017. Self-Critical Sequence Training for Image Captioning. In CVPR.

[18]

Piyush Sharma, Nan Ding, Sebastian Goodman, and Radu Soricut. 2018. Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning. In ACL.

[19]

Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbigniew Wojna. 2016. Rethinking the inception architecture for computer vision. In CVPR.

[20]

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In NeurIPS.

Digital Library

[21]

Subhashini Venugopalan, Huijuan Xu, Jeff Donahue, Marcus Rohrbach, Raymond Mooney, and Kate Saenko. 2015. Translating Videos to Natural Language Using Deep Recurrent Neural Networks. In NAACL HLT.

[22]

Jun Xu, Tao Mei, Ting Yao, and Yong Rui. 2016. MSR-VTT: A Large Video Description Dataset for Bridging Video and Language. In CVPR.

[23]

Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhudinov, Rich Zemel, and Yoshua Bengio. 2015. Show, Attend and Tell: Neural Image Caption Generation with Visual Attention. In ICML.

Digital Library

[24]

Li Yao, Atousa Torabi, Kyunghyun Cho, Nicolas Ballas, Christopher Pal, Hugo Larochelle, and Aaron Courville. 2015. Describing Videos by Exploiting Temporal Structure. In ICCV.

Digital Library

[25]

Ting Yao, Yingwei Pan, Yehao Li, and Tao Mei. 2017a. Incorporating Copying Mechanism in Image Captioning for Learning Novel Objects. In CVPR.

[26]

Ting Yao, Yingwei Pan, Yehao Li, and Tao Mei. 2018. Exploring Visual Relationship for Image Captioning. In ECCV.

[27]

Ting Yao, Yingwei Pan, Yehao Li, and Tao Mei. 2019. Hierarchy Parsing for Image Captioning. In ICCV.

[28]

Ting Yao, Yingwei Pan, Yehao Li, Zhaofan Qiu, and Tao Mei. 2017b. Boosting Image Captioning with Attributes. In ICCV.

[29]

Zhou Yu, Jun Yu, Yuhao Cui, Dacheng Tao, and Qi Tian. 2019. Deep modular co-attention networks for visual question answering. In CVPR.

[30]

Luowei Zhou, Hamid Palangi, Lei Zhang, Houdong Hu, Jason J Corso, and Jianfeng Gao. 2020. Unified vision-language pre-training for image captioning and vqa. In AAAI.

Cited By

Luo JLi YPan YYao TFeng JChao HMei T(2025)Exploring Vision-Language Foundation Model for Novel Object CaptioningIEEE Transactions on Circuits and Systems for Video Technology10.1109/TCSVT.2024.345243735:1(91-102)Online publication date: Jan-2025
https://doi.org/10.1109/TCSVT.2024.3452437
Wang LLi HZhang MQiu HMeng FWu QXu L(2024)CrowdCaption++: Collective-Guided Crowd Scenes CaptioningIEEE Transactions on Multimedia10.1109/TMM.2023.332818926(4974-4986)Online publication date: 2024
https://doi.org/10.1109/TMM.2023.3328189
Wang LQiu HQiu BMeng FWu QLi H(2024)TridentCap: Image-Fact-Style Trident Semantic Framework for Stylized Image CaptioningIEEE Transactions on Circuits and Systems for Video Technology10.1109/TCSVT.2023.331513334:5(3563-3575)Online publication date: May-2024
https://doi.org/10.1109/TCSVT.2023.3315133
Show More Cited By

Index Terms

X-modaler: A Versatile and High-performance Codebase for Cross-modal Analytics
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision
      1. Computer vision tasks
        Scene understanding
2. Software and its engineering
  1. Software notations and tools
    1. Software libraries and repositories

Recommendations

Attention driven reference resolution in multimodal contexts

In recent years a number of psycholinguistic experiments have pointed to the interaction between language and vision. In particular, the interaction between visual attention and linguistic reference. In parallel with this, several theories of discourse ...
Structured Multimodal Fusion Network for Referring Image Segmentation
ICMI '22: Proceedings of the 2022 International Conference on Multimodal Interaction

Referring image segmentation aims to segment one particular object referred by a natural language expression in the image. One major challenge of this task is how to understand and align vision and language to distinguish the referent. Another major ...
Enabling Harmonious Human-Machine Interaction with Visual-Context Augmented Dialogue System: A Review
The intelligent dialogue system, aiming at communicating with humans harmoniously with natural language, is brilliant for promoting the advancement of human-machine interaction in the era of artificial intelligence. With the gradually complex human-...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

MM '21: Proceedings of the 29th ACM International Conference on Multimedia

October 2021

5796 pages

ISBN:9781450386517

DOI:10.1145/3474085

General Chairs:
Heng Tao Shen
University of Electronic Science&Technology of China, China
,
Yueting Zhuang
Zhejiang University, China
,
John R. Smith
IBM, USA
,
Program Chairs:
Yang Yang
University of Electronic Science and Technology of China, China
,
Pablo Cesar
CWI&TU Delft, The Netherlands
,
Florian Metze
FACEBOOK, Inc., USA
,
Balakrishnan Prabhakaran
University of Texas at Dallas, USA

Copyright © 2021 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGMM: ACM Special Interest Group on Multimedia

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 17 October 2021

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Short-paper

Conference

MM '21

Sponsor:

SIGMM

MM '21: ACM Multimedia Conference

October 20 - 24, 2021

Virtual Event, China

Acceptance Rates

Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

20
Total Citations
View Citations
154
Total Downloads

Downloads (Last 12 months)11
Downloads (Last 6 weeks)3

Reflects downloads up to 31 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Luo JLi YPan YYao TFeng JChao HMei T(2025)Exploring Vision-Language Foundation Model for Novel Object CaptioningIEEE Transactions on Circuits and Systems for Video Technology10.1109/TCSVT.2024.345243735:1(91-102)Online publication date: Jan-2025
https://doi.org/10.1109/TCSVT.2024.3452437
Wang LLi HZhang MQiu HMeng FWu QXu L(2024)CrowdCaption++: Collective-Guided Crowd Scenes CaptioningIEEE Transactions on Multimedia10.1109/TMM.2023.332818926(4974-4986)Online publication date: 2024
https://doi.org/10.1109/TMM.2023.3328189
Wang LQiu HQiu BMeng FWu QLi H(2024)TridentCap: Image-Fact-Style Trident Semantic Framework for Stylized Image CaptioningIEEE Transactions on Circuits and Systems for Video Technology10.1109/TCSVT.2023.331513334:5(3563-3575)Online publication date: May-2024
https://doi.org/10.1109/TCSVT.2023.3315133
Alam MHossain IPuppala STalukder S(2024)Advancements in Multimodal Social Media Post Summarization: Integrating GPT-4 for Enhanced Understanding2024 IEEE 48th Annual Computers, Software, and Applications Conference (COMPSAC)10.1109/COMPSAC61105.2024.00307(1934-1940)Online publication date: 2-Jul-2024
https://doi.org/10.1109/COMPSAC61105.2024.00307
Ondeng OOuma HAkuon P(2023)A Review of Transformer-Based Approaches for Image CaptioningApplied Sciences10.3390/app13191110313:19(11103)Online publication date: 9-Oct-2023
https://doi.org/10.3390/app131911103
Runyan DWenkai ZZhi GXian S(2023)A Survey on Learning Objects’ Relationship for Image CaptioningComputational Intelligence and Neuroscience10.1155/2023/86008532023(1-16)Online publication date: 29-May-2023
https://doi.org/10.1155/2023/8600853
Pan YLi YYao TMei T(2023)Bottom-up and Top-down Object Inference Networks for Image CaptioningACM Transactions on Multimedia Computing, Communications, and Applications10.1145/358036619:5(1-18)Online publication date: 16-Mar-2023
https://dl.acm.org/doi/10.1145/3580366
Chen JLuo JPan YLi YYao TChao HMei T(2023)Boosting Vision-and-Language Navigation with Direction Guiding and BacktracingACM Transactions on Multimedia Computing, Communications, and Applications10.1145/352602419:1(1-16)Online publication date: 5-Jan-2023
https://dl.acm.org/doi/10.1145/3526024
Zhang YPan YYao THuang RMei TChen C(2023)Boosting Scene Graph Generation with Visual Relation SaliencyACM Transactions on Multimedia Computing, Communications, and Applications10.1145/351404119:1(1-17)Online publication date: 5-Jan-2023
https://dl.acm.org/doi/10.1145/3514041
Liang YZhu LWang XYang Y(2023)IcoCap: Improving Video Captioning by Compounding ImagesIEEE Transactions on Multimedia10.1109/TMM.2023.332232926(4389-4400)Online publication date: 5-Oct-2023
https://dl.acm.org/doi/10.1109/TMM.2023.3322329
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten