Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3474085.3478331acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
short-paper

X-modaler: A Versatile and High-performance Codebase for Cross-modal Analytics

Published: 17 October 2021 Publication History

Abstract

With the rise and development of deep learning over the past decade, there has been a steady momentum of innovation and breakthroughs that convincingly push the state-of-the-art of cross-modal analytics between vision and language in multimedia field. Nevertheless, there has not been an open-source codebase in support of training and deploying numerous neural network models for cross-modal analytics in a unified and modular fashion. In this work, we propose X-modaler --- a versatile and high-performance codebase that encapsulates the state-of-the-art cross-modal analytics into several general-purpose stages (e.g., pre-processing, encoder, cross-modal interaction, decoder, and decode strategy). Each stage is empowered with the functionality that covers a series of modules widely adopted in state-of-the-arts and allows seamless switching in between. This way naturally enables a flexible implementation of state-of-the-art algorithms for image captioning, video captioning, and vision-language pre-training, aiming to facilitate the rapid development of research community. Meanwhile, since the effective modular designs in several stages (e.g., cross-modal interaction) are shared across different vision-language tasks, X-modaler can be simply extended to power startup prototypes for other tasks in cross-modal analytics, including visual question answering, visual commonsense reasoning, and cross-modal retrieval. X-modaler is an Apache-licensed codebase, and its source codes, sample projects and pre-trained models are available on-line: https://github.com/YehLi/xmodaler.

References

[1]
Peter Anderson, Xiaodong He, Chris Buehler, Damien Teney, Mark Johnson, Stephen Gould, and Lei Zhang. 2018. Bottom-up and top-down attention for image captioning and visual question answering. In CVPR.
[2]
Jyoti Aneja, Aditya Deshpande, and Alexander G Schwing. 2018. Convolutional image captioning. In CVPR.
[3]
Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C Lawrence Zitnick, and Devi Parikh. 2015. Vqa: Visual question answering. In ICCV.
[4]
Samy Bengio, Oriol Vinyals, Navdeep Jaitly, and Noam Shazeer. 2015. Scheduled sampling for sequence prediction with recurrent Neural networks. In NIPS.
[5]
David Chen and William B. Dolan. 2011. Collecting Highly Parallel Data for Paraphrase Evaluation. In ACL.
[6]
Jingwen Chen, Yingwei Pan, Yehao Li, Ting Yao, Hongyang Chao, and Tao Mei. 2019. Temporal deformable convolutional encoder-decoder networks for video captioning. In AAAI.
[7]
Xinlei Chen, Hao Fang, Tsung-Yi Lin, Ramakrishna Vedantam, Saurabh Gupta, Piotr Dollár, and C Lawrence Zitnick. 2015. Microsoft COCO captions: Data collection and evaluation server. arXiv preprint arXiv:1504.00325 (2015).
[8]
Marcella Cornia, Matteo Stefanini, Lorenzo Baraldi, and Rita Cucchiara. 2020. Meshed-memory transformer for image captioning. In CVPR.
[9]
Jin-Hwa Kim, Jaehyun Jun, and Byoung-Tak Zhang. 2018. Bilinear attention networks. In NeurIPS.
[10]
Yehao Li, Yingwei Pan, Ting Yao, Jingwen Chen, and Tao Mei. 2021. Scheduled Sampling in Vision-Language Pretraining with Decoupled Encoder-Decoder Network. In AAAI.
[11]
Yehao Li, Ting Yao, Yingwei Pan, Hongyang Chao, and Tao Mei. 2018. Jointly localizing and describing events for dense video captioning. In CVPR.
[12]
Jiasen Lu, Dhruv Batra, Devi Parikh, and Stefan Lee. 2019. Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. In NeurIPS.
[13]
Yingwei Pan, Yehao Li, Jianjie Luo, Jun Xu, Ting Yao, and Tao Mei. 2020 a. Auto-captions on GIF: A Large-scale Video-sentence Dataset for Vision-language Pre-training. arXiv preprint arXiv:2007.02375 (2020).
[14]
Yingwei Pan, Tao Mei, Ting Yao, Houqiang Li, and Yong Rui. 2016. Jointly modeling embedding and translation to bridge video and language. In CVPR.
[15]
Yingwei Pan, Ting Yao, Houqiang Li, and Tao Mei. 2017. Video Captioning with Transferred Semantic Attributes. In CVPR.
[16]
Yingwei Pan, Ting Yao, Yehao Li, and Tao Mei. 2020 b. X-Linear Attention Networks for Image Captioning. In CVPR.
[17]
Steven J Rennie, Etienne Marcheret, Youssef Mroueh, Jerret Ross, and Vaibhava Goel. 2017. Self-Critical Sequence Training for Image Captioning. In CVPR.
[18]
Piyush Sharma, Nan Ding, Sebastian Goodman, and Radu Soricut. 2018. Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning. In ACL.
[19]
Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbigniew Wojna. 2016. Rethinking the inception architecture for computer vision. In CVPR.
[20]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In NeurIPS.
[21]
Subhashini Venugopalan, Huijuan Xu, Jeff Donahue, Marcus Rohrbach, Raymond Mooney, and Kate Saenko. 2015. Translating Videos to Natural Language Using Deep Recurrent Neural Networks. In NAACL HLT.
[22]
Jun Xu, Tao Mei, Ting Yao, and Yong Rui. 2016. MSR-VTT: A Large Video Description Dataset for Bridging Video and Language. In CVPR.
[23]
Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhudinov, Rich Zemel, and Yoshua Bengio. 2015. Show, Attend and Tell: Neural Image Caption Generation with Visual Attention. In ICML.
[24]
Li Yao, Atousa Torabi, Kyunghyun Cho, Nicolas Ballas, Christopher Pal, Hugo Larochelle, and Aaron Courville. 2015. Describing Videos by Exploiting Temporal Structure. In ICCV.
[25]
Ting Yao, Yingwei Pan, Yehao Li, and Tao Mei. 2017a. Incorporating Copying Mechanism in Image Captioning for Learning Novel Objects. In CVPR.
[26]
Ting Yao, Yingwei Pan, Yehao Li, and Tao Mei. 2018. Exploring Visual Relationship for Image Captioning. In ECCV.
[27]
Ting Yao, Yingwei Pan, Yehao Li, and Tao Mei. 2019. Hierarchy Parsing for Image Captioning. In ICCV.
[28]
Ting Yao, Yingwei Pan, Yehao Li, Zhaofan Qiu, and Tao Mei. 2017b. Boosting Image Captioning with Attributes. In ICCV.
[29]
Zhou Yu, Jun Yu, Yuhao Cui, Dacheng Tao, and Qi Tian. 2019. Deep modular co-attention networks for visual question answering. In CVPR.
[30]
Luowei Zhou, Hamid Palangi, Lei Zhang, Houdong Hu, Jason J Corso, and Jianfeng Gao. 2020. Unified vision-language pre-training for image captioning and vqa. In AAAI.

Cited By

View all
  • (2025)Exploring Vision-Language Foundation Model for Novel Object CaptioningIEEE Transactions on Circuits and Systems for Video Technology10.1109/TCSVT.2024.345243735:1(91-102)Online publication date: Jan-2025
  • (2024)CrowdCaption++: Collective-Guided Crowd Scenes CaptioningIEEE Transactions on Multimedia10.1109/TMM.2023.332818926(4974-4986)Online publication date: 2024
  • (2024)TridentCap: Image-Fact-Style Trident Semantic Framework for Stylized Image CaptioningIEEE Transactions on Circuits and Systems for Video Technology10.1109/TCSVT.2023.331513334:5(3563-3575)Online publication date: May-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
MM '21: Proceedings of the 29th ACM International Conference on Multimedia
October 2021
5796 pages
ISBN:9781450386517
DOI:10.1145/3474085
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 17 October 2021

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. cross-modal analytics
  2. open source
  3. vision and language

Qualifiers

  • Short-paper

Conference

MM '21
Sponsor:
MM '21: ACM Multimedia Conference
October 20 - 24, 2021
Virtual Event, China

Acceptance Rates

Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)11
  • Downloads (Last 6 weeks)3
Reflects downloads up to 31 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2025)Exploring Vision-Language Foundation Model for Novel Object CaptioningIEEE Transactions on Circuits and Systems for Video Technology10.1109/TCSVT.2024.345243735:1(91-102)Online publication date: Jan-2025
  • (2024)CrowdCaption++: Collective-Guided Crowd Scenes CaptioningIEEE Transactions on Multimedia10.1109/TMM.2023.332818926(4974-4986)Online publication date: 2024
  • (2024)TridentCap: Image-Fact-Style Trident Semantic Framework for Stylized Image CaptioningIEEE Transactions on Circuits and Systems for Video Technology10.1109/TCSVT.2023.331513334:5(3563-3575)Online publication date: May-2024
  • (2024)Advancements in Multimodal Social Media Post Summarization: Integrating GPT-4 for Enhanced Understanding2024 IEEE 48th Annual Computers, Software, and Applications Conference (COMPSAC)10.1109/COMPSAC61105.2024.00307(1934-1940)Online publication date: 2-Jul-2024
  • (2023)A Review of Transformer-Based Approaches for Image CaptioningApplied Sciences10.3390/app13191110313:19(11103)Online publication date: 9-Oct-2023
  • (2023)A Survey on Learning Objects’ Relationship for Image CaptioningComputational Intelligence and Neuroscience10.1155/2023/86008532023(1-16)Online publication date: 29-May-2023
  • (2023)Bottom-up and Top-down Object Inference Networks for Image CaptioningACM Transactions on Multimedia Computing, Communications, and Applications10.1145/358036619:5(1-18)Online publication date: 16-Mar-2023
  • (2023)Boosting Vision-and-Language Navigation with Direction Guiding and BacktracingACM Transactions on Multimedia Computing, Communications, and Applications10.1145/352602419:1(1-16)Online publication date: 5-Jan-2023
  • (2023)Boosting Scene Graph Generation with Visual Relation SaliencyACM Transactions on Multimedia Computing, Communications, and Applications10.1145/351404119:1(1-17)Online publication date: 5-Jan-2023
  • (2023)IcoCap: Improving Video Captioning by Compounding ImagesIEEE Transactions on Multimedia10.1109/TMM.2023.332232926(4389-4400)Online publication date: 5-Oct-2023
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media