Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3638584.3638635acmotherconferencesArticle/Chapter ViewAbstractPublication PagescsaiConference Proceedingsconference-collections
research-article

Ontology-Semantic Alignment On Contrastive Video-Language Model for Multimodel Video Retrieval Task

Published: 14 March 2024 Publication History

Abstract

Contrastive Learning-based models have shown impressive performance in text-image retrieval tasks. However, when applied in video retrieval, traditional contrastive learning strategies have faced challenges in achieving satisfactory results due to redundancy of video contents. We discern several potential reasons: (1)Current methodologies sometimes overlook the significant information imbalance between videos and query text, specifically neglecting the in-depth textual representation of the content within the videos. (2) Current video matching methodologies typically focus on cross-model alignment at general entity similarity level, without specific consideration for how entity pair preferences and similarity properties affect the task at hand. (3) Previous vectorized retrieval based on video content features have been somewhat flawed. They primarily focused on aligning overall features without having an video content tags feature for meaningful feature discrimination. Considering the shortcomings identified in the mentioned three aspects, we propose an ontology semantic labels augments retrieval model and introduce a method to integrate video ontology semantic labels into the contrastive learning framework. In particular, we have developed ontology semantic descriptions about entities encompassing both human figures and textual elements within the videos. Subsequently, we conducted training and testing on the CMIVQA dataset to assess the performance of our approach. The experimental results show that employing fine-grained ontology labels as sample pairs for contrastive learning leads to an increased level of precision in video retrieval tasks.

References

[1]
Yizhen Chen, Jie Wang, Lijian Lin, Zhongang Qi, Jin Ma, and Ying Shan. 2023. Tagging before Alignment: Integrating Multi-Modal Tags for Video-Text Retrieval. arXiv preprint arXiv:2301.12644 (2023).
[2]
Zhe Gan, Chuang Gan, Xiaodong He, Yunchen Pu, Kenneth Tran, Jianfeng Gao, Lawrence Carin, and Li Deng. 2017. Semantic compositional networks for visual captioning. In Proceedings of the IEEE conference on computer vision and pattern recognition. 5630–5639.
[3]
Deepak Gupta, Kush Attal, and Dina Demner-Fushman. 2023. A dataset for medical instructional video classification and question answering. Scientific Data 10, 1 (2023), 158.
[4]
Michael Gutmann and Aapo Hyvärinen. 2010. Noise-contrastive estimation: A new estimation principle for unnormalized statistical models. In Proceedings of the thirteenth international conference on artificial intelligence and statistics. JMLR Workshop and Conference Proceedings, 297–304.
[5]
Pengcheng He, Xiaodong Liu, Jianfeng Gao, and Weizhu Chen. 2020. Deberta: Decoding-enhanced bert with disentangled attention. arXiv preprint arXiv:2006.03654 (2020).
[6]
Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C Berg, Wan-Yen Lo, 2023. Segment anything. arXiv preprint arXiv:2304.02643 (2023).
[7]
Alexander Kraskov, Harald Stögbauer, and Peter Grassberger. 2004. Estimating mutual information. Physical review E 69, 6 (2004), 066138.
[8]
Alex Krizhevsky, Geoffrey Hinton, 2009. Learning multiple layers of features from tiny images. (2009).
[9]
Bin Li, Yixuan Weng, Bin Sun, and Shutao Li. 2023. Learning To Locate Visual Answer In Video Corpus Using Question. In ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 1–5. https://doi.org/10.1109/ICASSP49357.2023.10096391
[10]
Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. 2022. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In International Conference on Machine Learning. PMLR, 12888–12900.
[11]
Haoyu Lu, Nanyi Fei, Yuqi Huo, Yizhao Gao, Zhiwu Lu, and Ji-Rong Wen. 2022. COTS: Collaborative two-stream vision-language pre-training model for cross-modal retrieval. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 15692–15701.
[12]
Camillo Lugaresi, Jiuqiang Tang, Hadon Nash, Chris McClanahan, Esha Uboweja, Michael Hays, Fan Zhang, Chuo-Ling Chang, Ming Guang Yong, Juhyun Lee, 2019. Mediapipe: A framework for building perception pipelines. arXiv preprint arXiv:1906.08172 (2019).
[13]
Louis Mahon, Eleonora Giunchiglia, Bowen Li, and Thomas Lukasiewicz. 2020. Knowledge graph extraction from videos. In 2020 19th IEEE International Conference on Machine Learning and Applications (ICMLA). IEEE, 25–32.
[14]
Deborah L McGuinness, Frank Van Harmelen, 2004. OWL web ontology language overview. W3C recommendation 10, 10 (2004), 2004.
[15]
Antoine Miech, Dimitri Zhukov, Jean-Baptiste Alayrac, Makarand Tapaswi, Ivan Laptev, and Josef Sivic. 2019. Howto100m: Learning a text-video embedding by watching hundred million narrated video clips. In Proceedings of the IEEE/CVF international conference on computer vision. 2630–2640.
[16]
Maria-Elena Nilsback and Andrew Zisserman. 2008. Automated Flower Classification over a Large Number of Classes. In Indian Conference on Computer Vision, Graphics and Image Processing.
[17]
Yingwei Pan, Ting Yao, Houqiang Li, and Tao Mei. 2017. Video captioning with transferred semantic attributes. In Proceedings of the IEEE conference on computer vision and pattern recognition. 6504–6512.
[18]
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, 2021. Learning transferable visual models from natural language supervision. In International conference on machine learning. PMLR, 8748–8763.
[19]
Arka Sadhu, Tanmay Gupta, Mark Yatskar, Ram Nevatia, and Aniruddha Kembhavi. 2021. Visual semantic role labeling for video understanding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 5589–5600.
[20]
Chien-Yao Wang, I-Hau Yeh, and Hong-Yuan Mark Liao. 2021. You only learn one representation: Unified network for multiple tasks. arXiv preprint arXiv:2105.04206 (2021).
[21]
Yixuan Weng and Bin Li. 2023. Visual Answer Localization with Cross-Modal Mutual Knowledge Transfer. In ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 1–5. https://doi.org/10.1109/ICASSP49357.2023.10095026
[22]
Hu Xu, Gargi Ghosh, Po-Yao Huang, Dmytro Okhonko, Armen Aghajanyan, Florian Metze, Luke Zettlemoyer, and Christoph Feichtenhofer. 2021. Videoclip: Contrastive pre-training for zero-shot video-text understanding. arXiv preprint arXiv:2109.14084 (2021).
[23]
Hao Zhang, Aixin Sun, Wei Jing, Guoshun Nan, Liangli Zhen, Joey Tianyi Zhou, and Rick Siow Mong Goh. 2021. Video corpus moment retrieval with contrastive learning. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval. 685–695.
[24]
Jiaxing Zhang, Ruyi Gan, Junjie Wang, Yuxiang Zhang, Lin Zhang, Ping Yang, Xinyu Gao, Ziwei Wu, Xiaoqun Dong, Junqing He, Jianheng Zhuo, Qi Yang, Yongfeng Huang, Xiayu Li, Yanghan Wu, Junyu Lu, Xinyu Zhu, Weifeng Chen, Ting Han, Kunhao Pan, Rui Wang, Hao Wang, Xiaojun Wu, Zhongshen Zeng, and Chongpei Chen. 2022. Fengshenbang 1.0: Being the Foundation of Chinese Cognitive Intelligence. CoRR abs/2209.02970 (2022).
[25]
Luowei Zhou, Yingbo Zhou, Jason J Corso, Richard Socher, and Caiming Xiong. 2018. End-to-end dense video captioning with masked transformer. In Proceedings of the IEEE conference on computer vision and pattern recognition. 8739–8748.
[26]
Xingkui Zhu, Shuchang Lyu, Xu Wang, and Qi Zhao. 2021. TPH-YOLOv5: Improved YOLOv5 based on transformer prediction head for object detection on drone-captured scenarios. In Proceedings of the IEEE/CVF international conference on computer vision. 2778–2788.

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences
CSAI '23: Proceedings of the 2023 7th International Conference on Computer Science and Artificial Intelligence
December 2023
563 pages
ISBN:9798400708688
DOI:10.1145/3638584
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 14 March 2024

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Multimodal alignment
  2. Ontology description
  3. Video content understanding

Qualifiers

  • Research-article
  • Research
  • Refereed limited

Conference

CSAI 2023

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 16
    Total Downloads
  • Downloads (Last 12 months)16
  • Downloads (Last 6 weeks)1
Reflects downloads up to 03 Oct 2024

Other Metrics

Citations

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format.

HTML Format

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media