Contextual Explainable Video Representation: Human Perception-based Understanding

Vo, Khoa; Yamazaki, Kashu; Nguyen, Phong X.; Nguyen, Phat; Luu, Khoa; Le, Ngan

Computer Science > Computer Vision and Pattern Recognition

arXiv:2212.06206 (cs)

[Submitted on 12 Dec 2022 (v1), last revised 17 Dec 2022 (this version, v2)]

Title:Contextual Explainable Video Representation: Human Perception-based Understanding

Authors:Khoa Vo, Kashu Yamazaki, Phong X. Nguyen, Phat Nguyen, Khoa Luu, Ngan Le

View PDF

Abstract:Video understanding is a growing field and a subject of intense research, which includes many interesting tasks to understanding both spatial and temporal information, e.g., action detection, action recognition, video captioning, video retrieval. One of the most challenging problems in video understanding is dealing with feature extraction, i.e. extract contextual visual representation from given untrimmed video due to the long and complicated temporal structure of unconstrained videos. Different from existing approaches, which apply a pre-trained backbone network as a black-box to extract visual representation, our approach aims to extract the most contextual information with an explainable mechanism. As we observed, humans typically perceive a video through the interactions between three main factors, i.e., the actors, the relevant objects, and the surrounding environment. Therefore, it is very crucial to design a contextual explainable video representation extraction that can capture each of such factors and model the relationships between them. In this paper, we discuss approaches, that incorporate the human perception process into modeling actors, objects, and the environment. We choose video paragraph captioning and temporal action detection to illustrate the effectiveness of human perception based-contextual representation in video understanding. Source code is publicly available at this https URL.

Comments:	Accepted in Asilomar Conference 2022
Subjects:	Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2212.06206 [cs.CV]
	(or arXiv:2212.06206v2 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2212.06206

Submission history

From: Khoa Vo Ho Viet [view email]
[v1] Mon, 12 Dec 2022 19:29:07 UTC (8,567 KB)
[v2] Sat, 17 Dec 2022 06:29:37 UTC (8,567 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:Contextual Explainable Video Representation: Human Perception-based Understanding

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:Contextual Explainable Video Representation: Human Perception-based Understanding

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators