No abstract available.
Front Matter
Hunting Group Clues with Transformers for Social Group Activity Recognition
This paper presents a novel framework for social group activity recognition. As an expanded task of group activity recognition, social group activity recognition requires recognizing multiple sub-group activities and identifying group members. ...
Contrastive Positive Mining for Unsupervised 3D Action Representation Learning
Recent contrastive based 3D action representation learning has made great progress. However, the strict positive/negative constraint is yet to be relaxed and the use of non-self positive is yet to be explored. In this paper, a Contrastive Positive ...
Target-Absent Human Attention
The prediction of human gaze behavior is important for building human-computer interaction systems that can anticipate the user’s attention. Computer vision models have been developed to predict the fixations made by people as they search for ...
Uncertainty-Based Spatial-Temporal Attention for Online Action Detection
Online action detection aims at detecting the ongoing action in a streaming video. In this paper, we proposed an uncertainty-based spatial-temporal attention for online action detection. By explicitly modeling the distribution of model parameters, ...
Rethinking Zero-shot Action Recognition: Learning from Latent Atomic Actions
To avoid time-consuming annotating and retraining cycle in applying supervised action recognition models, Zero-Shot Action Recognition (ZSAR) has become a thriving direction. ZSAR requires models to recognize actions that never appear in training ...
Mining Cross-Person Cues for Body-Part Interactiveness Learning in HOI Detection
Human-Object Interaction (HOI) detection plays a crucial role in activity understanding. Though significant progress has been made, interactiveness learning remains a challenging problem in HOI detection: existing methods usually generate ...
Collaborating Domain-Shared and Target-Specific Feature Clustering for Cross-domain 3D Action Recognition
In this work, we consider the problem of cross-domain 3D action recognition in the open-set setting, which has been rarely explored before. Specifically, there is a source domain and a target domain that contain the skeleton sequences with ...
Is Appearance Free Action Recognition Possible?
Intuition might suggest that motion and dynamic information are key to video-based action recognition. In contrast, there is evidence that state-of-the-art deep-learning video understanding architectures are biased toward static information ...
Learning Spatial-Preserved Skeleton Representations for Few-Shot Action Recognition
Few-shot action recognition aims to recognize few-labeled novel action classes and attracts growing attentions due to practical significance. Human skeletons provide explainable and data-efficient representation for this problem by explicitly ...
Dual-Evidential Learning for Weakly-supervised Temporal Action Localization
Weakly-supervised temporal action localization (WS-TAL) aims to localize the action instances and recognize their categories with only video-level labels. Despite great progress, existing methods suffer from severe action-background ambiguity, ...
Global-Local Motion Transformer for Unsupervised Skeleton-Based Action Learning
We propose a new transformer model for the task of unsupervised learning of skeleton motion sequences. The existing transformer model utilized for unsupervised skeleton-based action learning is learned the instantaneous velocity of each joint from ...
AdaFocusV3: On Unified Spatial-Temporal Dynamic Video Recognition
- Yulin Wang,
- Yang Yue,
- Xinhong Xu,
- Ali Hassani,
- Victor Kulikov,
- Nikita Orlov,
- Shiji Song,
- Humphrey Shi,
- Gao Huang
Recent research has revealed that reducing the temporal and spatial redundancy are both effective approaches towards efficient video recognition, e.g., allocating the majority of computation to a task-relevant subset of frames or the most valuable ...
Panoramic Human Activity Recognition
To obtain a more comprehensive activity understanding for a crowded scene, in this paper, we propose a new problem of panoramic human activity recognition (PAR), which aims to simultaneously achieve the recognition of individual actions, social ...
Delving into Details: Synopsis-to-Detail Networks for Video Recognition
In this paper, we explore the details in video recognition with the aim to improve the accuracy. It is observed that most failure cases in recent works fall on the mis-classifications among very similar actions (such as high kick vs. side kick) ...
A Generalized and Robust Framework for Timestamp Supervision in Temporal Action Segmentation
In temporal action segmentation, Timestamp Supervision requires only a handful of labelled frames per video sequence. For unlabelled frames, previous works rely on assigning hard labels, and performance rapidly collapses under subtle violations of ...
Few-Shot Action Recognition with Hierarchical Matching and Contrastive Learning
Few-shot action recognition aims to recognize actions in test videos based on limited annotated data of target action classes. The dominant approaches project videos into a metric space and classify videos via nearest neighboring. They mainly ...
PrivHAR: Recognizing Human Actions from Privacy-Preserving Lens
The accelerated use of digital cameras prompts an increasing concern about privacy and security, particularly in applications such as action recognition. In this paper, we propose an optimizing framework to provide robust visual privacy protection ...
Scale-Aware Spatio-Temporal Relation Learning for Video Anomaly Detection
Recent progress in video anomaly detection (VAD) has shown that feature discrimination is the key to effectively distinguishing anomalies from normal events. We observe that many anomalous events occur in limited local regions, and the severe ...
Compound Prototype Matching for Few-Shot Action Recognition
Few-shot action recognition aims to recognize novel action classes using only a small number of labeled training samples. In this work, we propose a novel approach that first summarizes each video into compound prototypes consisting of a group of ...
Dynamic Spatio-Temporal Specialization Learning for Fine-Grained Action Recognition
The goal of fine-grained action recognition is to successfully discriminate between action categories with subtle differences. To tackle this, we derive inspiration from the human visual system which contains specialized regions in the brain that ...
Dynamic Local Aggregation Network with Adaptive Clusterer for Anomaly Detection
Existing methods for anomaly detection based on memory-augmented autoencoder (AE) have the following drawbacks: (1) Establishing a memory bank requires additional memory space. (2) The fixed number of prototypes from subjective assumptions ignores ...
Action Quality Assessment with Temporal Parsing Transformer
Action Quality Assessment(AQA) is important for action understanding and resolving the task poses unique challenges due to subtle visual differences. Existing state-of-the-art methods typically rely on the holistic video representations for score ...
Entry-Flipped Transformer for Inference and Prediction of Participant Behavior
Some group activities, such as team sports and choreographed dances, involve closely coupled interaction between participants. Here we investigate the tasks of inferring and predicting participant behavior, in terms of motion paths and actions, ...
Pairwise Contrastive Learning Network for Action Quality Assessment
Considering the complexity of modeling diverse actions of athletes, action quality assessment (AQA) in sports is a challenging task. A common solution is to tackle this problem as a regression task that map the input video to the final score ...