Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3581783.3612838acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article

A Hierarchical Deep Video Understanding Method with Shot-Based Instance Search and Large Language Model

Published: 27 October 2023 Publication History

Abstract

Deep video understanding (DVU) is often considered a challenge due to the aim of interpreting a video with storyline, which is designed to solve two levels of problems: predicting the human interaction in scene-level and identifying the relationship between two entities in movie-level. Based on our understanding of the movie characteristics and analysis of DVU tasks, in this paper, we propose a four-stage method to solve the task, which includes video structuring, shot based instance search, interaction & relation prediction and shot-scene summary & Question Answering (QA) with ChatGPT. In these four stages, shot based instance search allows accurate identification and tracking of characters at an appropriate video granularity. Using ChatGPT in QA, on the one hand, can narrow the answer space, on the other hand, with the help of the powerful text understanding ability, ChatGPT can help us answer the questions by giving background knowledge. We rank first in movie-level group 2 and scene-level group 1, second in movie-level group 1 and scene-level group 2 in ACM MM 2023 Grand Challenge.

References

[1]
Anurag Arnab, Mostafa Dehghani, Georg Heigold, Chen Sun, Mario Luvc ić, and Cordelia Schmid. 2021. ViViT: A Video Vision Transformer. arXiv e-prints, Article arXiv:2103.15691 (March 2021). https://doi.org/10.48550/arXiv.2103.15691 arxiv: 2103.15691 [cs.CV]
[2]
George Awad, Keith Curtis, Asad A. Butt, Jonathan Fiscus, Afzal Godil, Yooyoung Lee, Andrew Delgado, Jesse Zhang, Eliot Godard, Baptiste Chocot, Lukas Diduch, Jeffrey Liu, Yvette Graham, and Georges Quénot. 2022. An overview on the evaluated video retrieval tasks at TRECVID 2022. In Proceedings of TRECVID 2022. NIST, USA.
[3]
Joao Carreira and Andrew Zisserman. 2017. Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset. arXiv e-prints, Article arXiv:1705.07750 (May 2017). https://doi.org/10.48550/arXiv.1705.07750 arxiv: 1705.07750 [cs.CV]
[4]
Ming Chen, Peng Du, and Jieyi Zhao. 2018. SCRFD: Spatial Coherence Based Rib Fracture Detection. In Proceedings of the 2018 5th International Conference on Biomedical and Bioinformatics Engineering. 105--109.
[5]
Keith Curtis, George Awad, Shahzad Rajput, and Ian Soboroff. 2020. HLVU: A New Challenge to Test Deep Understanding of Movies the Way Humans do. In Proceedings of the 2020 International Conference on Multimedia Retrieval. 355--361.
[6]
Jiankang Deng, Jia Guo, Niannan Xue, and Stefanos Zafeiriou. 2019. Arcface: Additive angular margin loss for deep face recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 4690--4699.
[7]
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018).
[8]
Anna Kukleva, Makarand Tapaswi, and Ivan Laptev. 2020. Learning interactions and relationships between movie characters. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 9849--9858.
[9]
Ji Lin, Chuang Gan, Kuan Wang, and Song Han. 2020. TSM: Temporal shift module for efficient and scalable video understanding on edge devices. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020).
[10]
Penggang Qin, Jiarui Yu, Yan Gao, Derong Xu, Yunkai Chen, Shiwei Wu, Tong Xu, Enhong Chen, and Yanbin Hao. 2022. Unified QA-Aware Knowledge Graph Generation Based on Multi-Modal Modeling. In Proceedings of the 30th ACM International Conference on Multimedia (Lisboa, Portugal) (MM '22). Association for Computing Machinery, New York, NY, USA, 7185--7189. https://doi.org/10.1145/3503161.3551604
[11]
Zhaofan Qiu, Ting Yao, and Tao Mei. 2017. Learning Spatio-Temporal Representation with Pseudo-3D Residual Networks. In 2017 IEEE International Conference on Computer Vision (ICCV). 5534--5542. https://doi.org/10.1109/ICCV.2017.590
[12]
Raksha Ramesh, Vishal Anand, Zifan Chen, Yifei Dong, Yun Chen, and Ching-Yung Lin. 2022. Leveraging Text Representation and Face-Head Tracking for Long-Form Multimodal Semantic Relation Understanding. In Proceedings of the 30th ACM International Conference on Multimedia (Lisboa, Portugal) (MM '22). Association for Computing Machinery, New York, NY, USA, 7215--7219. https://doi.org/10.1145/3503161.3551610
[13]
Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. 2015. Faster r-cnn: Towards real-time object detection with region proposal networks. Advances in neural information processing systems, Vol. 28 (2015).
[14]
Karen Simonyan and Andrew Zisserman. 2014. Two-Stream Convolutional Networks for Action Recognition in Videos. arXiv e-prints, Article arXiv:1406.2199 (June 2014). https://doi.org/10.48550/arXiv.1406.2199 arxiv: 1406.2199 [cs.CV]
[15]
Siyang Sun, Xiong Xiong, and Yun Zheng. 2022. Two Stage Multi-Modal Modeling for Video Interaction Analysis in Deep Video Understanding Challenge. In Proceedings of the 30th ACM International Conference on Multimedia (Lisboa, Portugal) (MM '22). Association for Computing Machinery, New York, NY, USA, 7040--7044. https://doi.org/10.1145/3503161.3551575
[16]
Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. 2015. Learning Spatiotemporal Features with 3D Convolutional Networks. In 2015 IEEE International Conference on Computer Vision (ICCV). 4489--4497. https://doi.org/10.1109/ICCV.2015.510
[17]
Nicolai Wojke, Alex Bewley, and Dietrich Paulus. 2017. Simple Online and Realtime Tracking with a Deep Association Metric. arXiv e-prints, Article arXiv:1703.07402 (March 2017), arXiv:1703.07402 pages. https://doi.org/10.48550/arXiv.1703.07402 arxiv: 1703.07402 [c

Index Terms

  1. A Hierarchical Deep Video Understanding Method with Shot-Based Instance Search and Large Language Model

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    MM '23: Proceedings of the 31st ACM International Conference on Multimedia
    October 2023
    9913 pages
    ISBN:9798400701085
    DOI:10.1145/3581783
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 27 October 2023

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. instance search
    2. multi-modal feature
    3. vedio understanding

    Qualifiers

    • Research-article

    Funding Sources

    Conference

    MM '23
    Sponsor:
    MM '23: The 31st ACM International Conference on Multimedia
    October 29 - November 3, 2023
    Ottawa ON, Canada

    Acceptance Rates

    Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • 0
      Total Citations
    • 138
      Total Downloads
    • Downloads (Last 12 months)113
    • Downloads (Last 6 weeks)7
    Reflects downloads up to 25 Dec 2024

    Other Metrics

    Citations

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media