research-article

A Hierarchical Deep Video Understanding Method with Shot-Based Instance Search and Large Language Model

Authors:

Chao LiangAuthors Info & Claims

MM '23: Proceedings of the 31st ACM International Conference on Multimedia

Pages 9425 - 9429

https://doi.org/10.1145/3581783.3612838

Published: 27 October 2023 Publication History

Abstract

Deep video understanding (DVU) is often considered a challenge due to the aim of interpreting a video with storyline, which is designed to solve two levels of problems: predicting the human interaction in scene-level and identifying the relationship between two entities in movie-level. Based on our understanding of the movie characteristics and analysis of DVU tasks, in this paper, we propose a four-stage method to solve the task, which includes video structuring, shot based instance search, interaction & relation prediction and shot-scene summary & Question Answering (QA) with ChatGPT. In these four stages, shot based instance search allows accurate identification and tracking of characters at an appropriate video granularity. Using ChatGPT in QA, on the one hand, can narrow the answer space, on the other hand, with the help of the powerful text understanding ability, ChatGPT can help us answer the questions by giving background knowledge. We rank first in movie-level group 2 and scene-level group 1, second in movie-level group 1 and scene-level group 2 in ACM MM 2023 Grand Challenge.

References

[1]

Anurag Arnab, Mostafa Dehghani, Georg Heigold, Chen Sun, Mario Luvc ić, and Cordelia Schmid. 2021. ViViT: A Video Vision Transformer. arXiv e-prints, Article arXiv:2103.15691 (March 2021). https://doi.org/10.48550/arXiv.2103.15691 arxiv: 2103.15691 [cs.CV]

[2]

George Awad, Keith Curtis, Asad A. Butt, Jonathan Fiscus, Afzal Godil, Yooyoung Lee, Andrew Delgado, Jesse Zhang, Eliot Godard, Baptiste Chocot, Lukas Diduch, Jeffrey Liu, Yvette Graham, and Georges Quénot. 2022. An overview on the evaluated video retrieval tasks at TRECVID 2022. In Proceedings of TRECVID 2022. NIST, USA.

[3]

Joao Carreira and Andrew Zisserman. 2017. Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset. arXiv e-prints, Article arXiv:1705.07750 (May 2017). https://doi.org/10.48550/arXiv.1705.07750 arxiv: 1705.07750 [cs.CV]

[4]

Ming Chen, Peng Du, and Jieyi Zhao. 2018. SCRFD: Spatial Coherence Based Rib Fracture Detection. In Proceedings of the 2018 5th International Conference on Biomedical and Bioinformatics Engineering. 105--109.

Digital Library

[5]

Keith Curtis, George Awad, Shahzad Rajput, and Ian Soboroff. 2020. HLVU: A New Challenge to Test Deep Understanding of Movies the Way Humans do. In Proceedings of the 2020 International Conference on Multimedia Retrieval. 355--361.

Digital Library

[6]

Jiankang Deng, Jia Guo, Niannan Xue, and Stefanos Zafeiriou. 2019. Arcface: Additive angular margin loss for deep face recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 4690--4699.

[7]

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018).

[8]

Anna Kukleva, Makarand Tapaswi, and Ivan Laptev. 2020. Learning interactions and relationships between movie characters. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 9849--9858.

[9]

Ji Lin, Chuang Gan, Kuan Wang, and Song Han. 2020. TSM: Temporal shift module for efficient and scalable video understanding on edge devices. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020).

[10]

Penggang Qin, Jiarui Yu, Yan Gao, Derong Xu, Yunkai Chen, Shiwei Wu, Tong Xu, Enhong Chen, and Yanbin Hao. 2022. Unified QA-Aware Knowledge Graph Generation Based on Multi-Modal Modeling. In Proceedings of the 30th ACM International Conference on Multimedia (Lisboa, Portugal) (MM '22). Association for Computing Machinery, New York, NY, USA, 7185--7189. https://doi.org/10.1145/3503161.3551604

Digital Library

[11]

Zhaofan Qiu, Ting Yao, and Tao Mei. 2017. Learning Spatio-Temporal Representation with Pseudo-3D Residual Networks. In 2017 IEEE International Conference on Computer Vision (ICCV). 5534--5542. https://doi.org/10.1109/ICCV.2017.590

[12]

Raksha Ramesh, Vishal Anand, Zifan Chen, Yifei Dong, Yun Chen, and Ching-Yung Lin. 2022. Leveraging Text Representation and Face-Head Tracking for Long-Form Multimodal Semantic Relation Understanding. In Proceedings of the 30th ACM International Conference on Multimedia (Lisboa, Portugal) (MM '22). Association for Computing Machinery, New York, NY, USA, 7215--7219. https://doi.org/10.1145/3503161.3551610

Digital Library

[13]

Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. 2015. Faster r-cnn: Towards real-time object detection with region proposal networks. Advances in neural information processing systems, Vol. 28 (2015).

[14]

Karen Simonyan and Andrew Zisserman. 2014. Two-Stream Convolutional Networks for Action Recognition in Videos. arXiv e-prints, Article arXiv:1406.2199 (June 2014). https://doi.org/10.48550/arXiv.1406.2199 arxiv: 1406.2199 [cs.CV]

[15]

Siyang Sun, Xiong Xiong, and Yun Zheng. 2022. Two Stage Multi-Modal Modeling for Video Interaction Analysis in Deep Video Understanding Challenge. In Proceedings of the 30th ACM International Conference on Multimedia (Lisboa, Portugal) (MM '22). Association for Computing Machinery, New York, NY, USA, 7040--7044. https://doi.org/10.1145/3503161.3551575

Digital Library

[16]

Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. 2015. Learning Spatiotemporal Features with 3D Convolutional Networks. In 2015 IEEE International Conference on Computer Vision (ICCV). 4489--4497. https://doi.org/10.1109/ICCV.2015.510

Digital Library

[17]

Nicolai Wojke, Alex Bewley, and Dietrich Paulus. 2017. Simple Online and Realtime Tracking with a Deep Association Metric. arXiv e-prints, Article arXiv:1703.07402 (March 2017), arXiv:1703.07402 pages. https://doi.org/10.48550/arXiv.1703.07402 arxiv: 1703.07402 [c

Index Terms

A Hierarchical Deep Video Understanding Method with Shot-Based Instance Search and Large Language Model
1. Computing methodologies
  1. Artificial intelligence
    1. Knowledge representation and reasoning

Recommendations

Salient Time Slice Pruning and Boosting for Person-Scene Instance Search in TV Series
MMAsia '19: Proceedings of the 1st ACM International Conference on Multimedia in Asia

It is common that TV audiences want to quickly browse scenes with certain actors in TV series. Since 2016, the TREC Video Retrieval Evaluation (TRECVID) Instance Search (INS) task has started to focus on identifying a target person in a target scene ...
Memory recall based video search: Finding videos you have seen before based on your memory

We often remember images and videos that we have seen or recorded before but cannot quite recall the exact venues or details of the contents. We typically have vague memories of the contents, which can often be expressed as a textual description and/or ...
GestureGPT: Toward Zero-Shot Free-Form Hand Gesture Understanding with Large Language Model Agents

Existing gesture interfaces only work with a fixed set of gestures defined either by interface designers or by users themselves, which introduces learning or demonstration efforts that diminish their naturalness. Humans, on the other hand, understand ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

MM '23: Proceedings of the 31st ACM International Conference on Multimedia

October 2023

9913 pages

ISBN:9798400701085

DOI:10.1145/3581783

General Chairs:
Abdulmotaleb El Saddik
University of Ottawa, Canada & MBZUAI, UAE
,
Tao Mei
HiDream.ai, China
,
Rita Cucchiara
University of Modena and Reggio Emilia, Italy
,
Program Chairs:
Marco Bertini
University of Florence, Italy
,
Diana Patricia Tobon Vallejo
Unversidad de Medellin, Colombia
,
Pradeep K. Atrey
University at Albany, State University of New York, USA
,
M. Shamim Hossain
M. Shamim Hossain (King Saud University, KSA

Copyright © 2023 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

SIGMM: ACM Special Interest Group on Multimedia

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 27 October 2023

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

National Natural Science Foundation of China
Fundamental Research Funds for the Central Universities
Ministry of Education Industry University Cooperative Education Project

Conference

MM '23

Sponsor:

SIGMM

MM '23: The 31st ACM International Conference on Multimedia

October 29 - November 3, 2023

Ottawa ON, Canada

Acceptance Rates

Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
138
Total Downloads

Downloads (Last 12 months)113
Downloads (Last 6 weeks)7

Reflects downloads up to 25 Dec 2024

Other Metrics

View Author Metrics

Citations

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents