research-article

Ontology-Semantic Alignment On Contrastive Video-Language Model for Multimodel Video Retrieval Task

Authors:

Hao HeAuthors Info & Claims

CSAI '23: Proceedings of the 2023 7th International Conference on Computer Science and Artificial Intelligence

Pages 408 - 413

https://doi.org/10.1145/3638584.3638635

Published: 14 March 2024 Publication History

Abstract

Contrastive Learning-based models have shown impressive performance in text-image retrieval tasks. However, when applied in video retrieval, traditional contrastive learning strategies have faced challenges in achieving satisfactory results due to redundancy of video contents. We discern several potential reasons: (1)Current methodologies sometimes overlook the significant information imbalance between videos and query text, specifically neglecting the in-depth textual representation of the content within the videos. (2) Current video matching methodologies typically focus on cross-model alignment at general entity similarity level, without specific consideration for how entity pair preferences and similarity properties affect the task at hand. (3) Previous vectorized retrieval based on video content features have been somewhat flawed. They primarily focused on aligning overall features without having an video content tags feature for meaningful feature discrimination. Considering the shortcomings identified in the mentioned three aspects, we propose an ontology semantic labels augments retrieval model and introduce a method to integrate video ontology semantic labels into the contrastive learning framework. In particular, we have developed ontology semantic descriptions about entities encompassing both human figures and textual elements within the videos. Subsequently, we conducted training and testing on the CMIVQA dataset to assess the performance of our approach. The experimental results show that employing fine-grained ontology labels as sample pairs for contrastive learning leads to an increased level of precision in video retrieval tasks.

References

[1]

Yizhen Chen, Jie Wang, Lijian Lin, Zhongang Qi, Jin Ma, and Ying Shan. 2023. Tagging before Alignment: Integrating Multi-Modal Tags for Video-Text Retrieval. arXiv preprint arXiv:2301.12644 (2023).

[2]

Zhe Gan, Chuang Gan, Xiaodong He, Yunchen Pu, Kenneth Tran, Jianfeng Gao, Lawrence Carin, and Li Deng. 2017. Semantic compositional networks for visual captioning. In Proceedings of the IEEE conference on computer vision and pattern recognition. 5630–5639.

[3]

Deepak Gupta, Kush Attal, and Dina Demner-Fushman. 2023. A dataset for medical instructional video classification and question answering. Scientific Data 10, 1 (2023), 158.

[4]

Michael Gutmann and Aapo Hyvärinen. 2010. Noise-contrastive estimation: A new estimation principle for unnormalized statistical models. In Proceedings of the thirteenth international conference on artificial intelligence and statistics. JMLR Workshop and Conference Proceedings, 297–304.

[5]

Pengcheng He, Xiaodong Liu, Jianfeng Gao, and Weizhu Chen. 2020. Deberta: Decoding-enhanced bert with disentangled attention. arXiv preprint arXiv:2006.03654 (2020).

[6]

Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C Berg, Wan-Yen Lo, 2023. Segment anything. arXiv preprint arXiv:2304.02643 (2023).

[7]

Alexander Kraskov, Harald Stögbauer, and Peter Grassberger. 2004. Estimating mutual information. Physical review E 69, 6 (2004), 066138.

[8]

Alex Krizhevsky, Geoffrey Hinton, 2009. Learning multiple layers of features from tiny images. (2009).

[9]

Bin Li, Yixuan Weng, Bin Sun, and Shutao Li. 2023. Learning To Locate Visual Answer In Video Corpus Using Question. In ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 1–5. https://doi.org/10.1109/ICASSP49357.2023.10096391

[10]

Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. 2022. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In International Conference on Machine Learning. PMLR, 12888–12900.

[11]

Haoyu Lu, Nanyi Fei, Yuqi Huo, Yizhao Gao, Zhiwu Lu, and Ji-Rong Wen. 2022. COTS: Collaborative two-stream vision-language pre-training model for cross-modal retrieval. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 15692–15701.

[12]

Camillo Lugaresi, Jiuqiang Tang, Hadon Nash, Chris McClanahan, Esha Uboweja, Michael Hays, Fan Zhang, Chuo-Ling Chang, Ming Guang Yong, Juhyun Lee, 2019. Mediapipe: A framework for building perception pipelines. arXiv preprint arXiv:1906.08172 (2019).

[13]

Louis Mahon, Eleonora Giunchiglia, Bowen Li, and Thomas Lukasiewicz. 2020. Knowledge graph extraction from videos. In 2020 19th IEEE International Conference on Machine Learning and Applications (ICMLA). IEEE, 25–32.

[14]

Deborah L McGuinness, Frank Van Harmelen, 2004. OWL web ontology language overview. W3C recommendation 10, 10 (2004), 2004.

[15]

Antoine Miech, Dimitri Zhukov, Jean-Baptiste Alayrac, Makarand Tapaswi, Ivan Laptev, and Josef Sivic. 2019. Howto100m: Learning a text-video embedding by watching hundred million narrated video clips. In Proceedings of the IEEE/CVF international conference on computer vision. 2630–2640.

[16]

Maria-Elena Nilsback and Andrew Zisserman. 2008. Automated Flower Classification over a Large Number of Classes. In Indian Conference on Computer Vision, Graphics and Image Processing.

Digital Library

[17]

Yingwei Pan, Ting Yao, Houqiang Li, and Tao Mei. 2017. Video captioning with transferred semantic attributes. In Proceedings of the IEEE conference on computer vision and pattern recognition. 6504–6512.

[18]

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, 2021. Learning transferable visual models from natural language supervision. In International conference on machine learning. PMLR, 8748–8763.

[19]

Arka Sadhu, Tanmay Gupta, Mark Yatskar, Ram Nevatia, and Aniruddha Kembhavi. 2021. Visual semantic role labeling for video understanding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 5589–5600.

[20]

Chien-Yao Wang, I-Hau Yeh, and Hong-Yuan Mark Liao. 2021. You only learn one representation: Unified network for multiple tasks. arXiv preprint arXiv:2105.04206 (2021).

[21]

Yixuan Weng and Bin Li. 2023. Visual Answer Localization with Cross-Modal Mutual Knowledge Transfer. In ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 1–5. https://doi.org/10.1109/ICASSP49357.2023.10095026

[22]

Hu Xu, Gargi Ghosh, Po-Yao Huang, Dmytro Okhonko, Armen Aghajanyan, Florian Metze, Luke Zettlemoyer, and Christoph Feichtenhofer. 2021. Videoclip: Contrastive pre-training for zero-shot video-text understanding. arXiv preprint arXiv:2109.14084 (2021).

[23]

Hao Zhang, Aixin Sun, Wei Jing, Guoshun Nan, Liangli Zhen, Joey Tianyi Zhou, and Rick Siow Mong Goh. 2021. Video corpus moment retrieval with contrastive learning. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval. 685–695.

Digital Library

[24]

Jiaxing Zhang, Ruyi Gan, Junjie Wang, Yuxiang Zhang, Lin Zhang, Ping Yang, Xinyu Gao, Ziwei Wu, Xiaoqun Dong, Junqing He, Jianheng Zhuo, Qi Yang, Yongfeng Huang, Xiayu Li, Yanghan Wu, Junyu Lu, Xinyu Zhu, Weifeng Chen, Ting Han, Kunhao Pan, Rui Wang, Hao Wang, Xiaojun Wu, Zhongshen Zeng, and Chongpei Chen. 2022. Fengshenbang 1.0: Being the Foundation of Chinese Cognitive Intelligence. CoRR abs/2209.02970 (2022).

[25]

Luowei Zhou, Yingbo Zhou, Jason J Corso, Richard Socher, and Caiming Xiong. 2018. End-to-end dense video captioning with masked transformer. In Proceedings of the IEEE conference on computer vision and pattern recognition. 8739–8748.

[26]

Xingkui Zhu, Shuchang Lyu, Xu Wang, and Qi Zhao. 2021. TPH-YOLOv5: Improved YOLOv5 based on transformer prediction head for object detection on drone-captured scenarios. In Proceedings of the IEEE/CVF international conference on computer vision. 2778–2788.

Index Terms

Ontology-Semantic Alignment On Contrastive Video-Language Model for Multimodel Video Retrieval Task
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision
      1. Computer vision tasks
        Visual content-based indexing and retrieval
    2. Knowledge representation and reasoning
      1. Ontology engineering

Recommendations

Rough Ontology Based Semantic Information Retrieval
ISCID '13: Proceedings of the 2013 Sixth International Symposium on Computational Intelligence and Design - Volume 01

It is known that traditional precise ontology based information retrieval cannot finish implicit semantic information mining. To solve this problem, rough ontology is introduced into semantic information retrieval to meet the user needs to utmost ...
Ontology-Based Information Retrieval Model for the Semantic Web
EEE '05: Proceedings of the 2005 IEEE International Conference on e-Technology, e-Commerce and e-Service (EEE'05) on e-Technology, e-Commerce and e-Service

In this paper, we describe ontology-based information retrieval model for the Semantic Web. By using OWL Lite as standard ontology language, which is a suitable tradeoff between expressivity of knowledge and complexity of reasoning problems, ontology is ...
Ontology acquisition and semantic retrieval from semantic annotated chinese poetry
JCDL '04: Proceedings of the 4th ACM/IEEE-CS joint conference on Digital libraries

This research aims to utilize semantic web[1]. technology to the semantic annotation of classical Chinese poetry. We investigate the feasibilities and advantages of semantic retrieval and automated ontology acquisition from semantically annotated poems ...

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences

CSAI '23: Proceedings of the 2023 7th International Conference on Computer Science and Artificial Intelligence

December 2023

563 pages

ISBN:9798400708688

DOI:10.1145/3638584

Copyright © 2023 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 14 March 2024

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Conference

CSAI 2023

CSAI 2023: 2023 7th International Conference on Computer Science and Artificial Intelligence

December 8 - 10, 2023

Beijing, China

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
16
Total Downloads

Downloads (Last 12 months)16
Downloads (Last 6 weeks)1

Reflects downloads up to 03 Oct 2024

Other Metrics

View Author Metrics

Citations

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Media

Figures

Other

Tables

View Table of Contents