research-article

Progressive Instance-Aware Feature Learning for Compositional Action Recognition

Authors:

Jinhui TangAuthors Info & Claims

IEEE Transactions on Pattern Analysis and Machine Intelligence, Volume 45, Issue 8

Pages 10317 - 10330

https://doi.org/10.1109/TPAMI.2023.3261659

Published: 01 August 2023 Publication History

Abstract

In order to enable the model to generalize to unseen “action-objects” (compositional action), previous methods encode multiple pieces of information (i.e., the appearance, position, and identity of visual instances) independently and concatenate them for classification. However, these methods ignore the potential supervisory role of instance information (i.e., position and identity) in the process of visual perception. To this end, we present a novel framework, namely Progressive Instance-aware Feature Learning (PIFL), to progressively extract, reason, and predict dynamic cues of moving instances from videos for compositional action recognition. Specifically, this framework extracts features from foreground instances that are likely to be relevant to human actions (Position-aware Appearance Feature Extraction in Section III-B1), performs identity-aware reasoning among instance-centric features with semantic-specific interactions (Identity-aware Feature Interaction in Section III-B2), and finally predicts instances’ position from observed states to force the model into perceiving their movement (Semantic-aware Position Prediction in Section III-B3). We evaluate our approach on two compositional action recognition benchmarks, namely, Something-Else and IKEA-Assembly. Our approach achieves consistent accuracy gain beyond off-the-shelf action recognition algorithms in terms of both ground truth and detected position of instances.

References

[1]

R. Goyal et al., “The ”something something” video database for learning and evaluating visual common sense,” in Proc. Int. Conf. Comput. Vis., 2017, pp. 5842–5850.

[2]

K. Simonyan and A. Zisserman, “Two-stream convolutional networks for action recognition in videos,” in Proc. Annu. Conf. Neural Inf. Process. Syst., 2014, pp. 568–576.

[3]

J. Donahue et al., “Long-term recurrent convolutional networks for visual recognition and description,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2015, pp. 2625–2634.

[4]

L. Wang et al., “Temporal segment networks: Towards good practices for deep action recognition,” in Proc. Eur. Conf. Comput. Vis., 2016, pp. 20–36.

[5]

J. Carreira and A. Zisserman, “Quo vadis, action recognition? A new model and the kinetics dataset,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2017, pp. 6299–6308.

[6]

X. Wang and A. Gupta, “Videos as space-time region graphs,” in Proc. Eur. Conf. Comput. Vis., 2018, pp. 399–417.

[7]

D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri, “Learning spatiotemporal features with 3D convolutional networks,” in Proc. Int. Conf. Comput. Vis., 2015, pp. 4489–4497.

[8]

C. Feichtenhofer, “X3D: Expanding architectures for efficient video recognition,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2020, pp. 203–213.

[9]

K. Soomro, A. R. Zamir, and M. Shah, “UCF101: A dataset of 101 human actions classes from videos in the wild,” 2012,.

[10]

H. Kuehne, H. Jhuang, E. Garrote, T. Poggio, and T. Serre, “HMDB: A large video database for human motion recognition,” in Proc. Int. Conf. Comput. Vis., 2011, pp. 2556–2563.

[11]

J. Materzynska, T. Xiao, R. Herzig, H. Xu, X. Wang, and T. Darrell, “Something-else: Compositional action recognition with spatial-temporal interaction networks,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2020, pp. 1049–1059.

[12]

J. Ji, R. Krishna, L. Fei-Fei, and J. C. Niebles, “Action genome: Actions as compositions of spatio-temporal scene graphs,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2020, pp. 10 236–10 247.

[13]

E. S. Spelke and K. D. Kinzler, “Core knowledge,” Devlop. Sci., vol. 10, no. 1, pp. 89–96, 2007.

[14]

A. Torralba and A. A. Efros, “Unbiased look at dataset bias,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2011, pp. 1521–1528.

[15]

Y. Ben-Shabat et al., “The ikea asm dataset: Understanding people assembling furniture through actions, objects and pose,” in Proc. Winter Conf. Appl. Comput. Vis., 2021, pp. 847–859.

[16]

X. Wang, R. Girshick, A. Gupta, and K. He, “Non-local neural networks,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2018, pp. 7794–7803.

[17]

R. Yan, L. Xie, J. Tang, X. Shu, and Q. Tian, “HiGCIN: Hierarchical graph-based cross inference network for group activity recognition,” IEEE Trans. Pattern Anal. Mach. Intell., early access, Oct. 27, 2020.

Digital Library

[18]

G. A. Sigurdsson, G. Varol, X. Wang, A. Farhadi, I. Laptev, and A. Gupta, “Hollywood in homes: Crowdsourcing data collection for activity understanding,” in Proc. Eur. Conf. Comput. Vis., 2016, pp. 510–526.

[19]

F. Caba Heilbron, V. Escorcia, B. Ghanem, and J. Carlos Niebles, “ActivityNet: A large-scale video benchmark for human activity understanding,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2015, pp. 961–970.

[20]

W. Kay et al., “The kinetics human action video dataset,” 2017,.

[21]

J. Lin, C. Gan, and S. Han, “TSM: Temporal shift module for efficient video understanding,” in Proc. Int. Conf. Comput. Vis., 2019, pp. 7083–7093.

[22]

C. Feichtenhofer, H. Fan, J. Malik, and K. He, “Slowfast networks for video recognition,” in Proc. Int. Conf. Comput. Vis., 2019, pp. 6202–6211.

[23]

Y. Li, B. Ji, X. Shi, J. Zhang, B. Kang, and L. Wang, “TEA: Temporal excitation and aggregation for action recognition,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2020, pp. 909–918.

[24]

R. Girdhar and D. Ramanan, “CATER: A diagnostic dataset for compositional actions and temporal reasoning,” in Proc. Int. Conf. Learn. Representations, 2019.

[25]

B. Jia, Y. Chen, S. Huang, Y. Zhu, and S.-C. Zhu, “Lemma: A multi-view dataset for learning multi-agent multi-task activities,” in Proc. Eur. Conf. Comput. Vis., 2020, pp. 767–786.

[26]

D. Keysers et al., “Measuring compositional generalization: A comprehensive method on realistic data,” in Proc. Int. Conf. Learn. Representations, 2020, pp. 1–13.

[27]

B. Lake and M. Baroni, “Generalization without systematicity: On the compositional skills of sequence-to-sequence recurrent networks,” in Proc. Int. Conf. Mach. Learn., 2018, pp. 2873–2882.

[28]

D. Shao, Y. Zhao, B. Dai, and D. Lin, “Intra-and inter-action understanding via temporal action parsing,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2020, pp. 730–739.

[29]

D. Shao, Y. Zhao, B. Dai, and D. Lin, “FineGym: A hierarchical video dataset for fine-grained action understanding,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2020, pp. 2616–2625.

[30]

F. Baradel, N. Neverova, C. Wolf, J. Mille, and G. Mori, “Object level visual reasoning in videos,” in Proc. Eur. Conf. Comput. Vis., 2018, pp. 105–121.

[31]

S. Ji, W. Xu, M. Yang, and K. Yu, “3D convolutional neural networks for human action recognition,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 35, no. 1, pp. 221–231, Jan. 2013.

Digital Library

[32]

R. Krishna et al., “Visual genome: Connecting language and vision using crowdsourced dense image annotations,” Int. J. Comput. Vis., vol. 123, no. 1, pp. 32–73, 2017.

Digital Library

[33]

C.-Y. Ma, A. Kadav, I. Melvin, Z. Kira, G. AlRegib, and H. P. Graf, “Attend and interact: Higher-order object interactions for video understanding,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2018, pp. 6790–6800.

[34]

C. Sun, A. Shrivastava, C. Vondrick, K. Murphy, R. Sukthankar, and C. Schmid, “Actor-centric relation network,” in Proc. Eur. Conf. Comput. Vis., 2018, pp. 318–334.

[35]

H. Qi, X. Wang, D. Pathak, Y. Ma, and J. Malik, “Learning long-term visual dynamics with region proposal interaction networks,” in Proc. Int. Conf. Learn. Representations, 2021, pp. 1–13.

[36]

M. Ranzato, A. Szlam, J. Bruna, M. Mathieu, R. Collobert, and S. Chopra, “Video (language) modeling: A baseline for generative models of natural videos,” 2014,.

[37]

N. Srivastava, E. Mansimov, and R. Salakhudinov, “Unsupervised learning of video representations using LSTMs,” in Proc. Int. Conf. Mach. Learn., 2015, pp. 843–852.

[38]

E. L. Denton et al., “Unsupervised learning of disentangled representations from video,” in Proc. Annu. Conf. Neural Inf. Process. Syst., 2017, pp. 4414–4423.

[39]

M. Mathieu, C. Couprie, and Y. LeCun, “Deep multi-scale video prediction beyond mean square error,” in Proc. Int. Conf. Learn. Representations, 2015, pp. 1–11.

[40]

J. Oh, X. Guo, H. Lee, R. L. Lewis, and S. Singh, “Action-conditional video prediction using deep networks in atari games,” in Proc. Annu. Conf. Neural Inf. Process. Syst., 2015, pp. 2863–2871.

[41]

J.-T. Hsieh, B. Liu, D.-A. Huang, L. F. Fei-Fei, and J. C. Niebles, “Learning to decompose and disentangle representations for video prediction,” in Proc. Annu. Conf. Neural Inf. Process. Syst., 2018, pp. 515–524.

[42]

K. M. Kitani, B. D. Ziebart, J. A. Bagnell, and M. Hebert, “Activity forecasting,” in Proc. Eur. Conf. Comput. Vis., 2012, pp. 201–214.

[43]

D.-A. Huang, A.-M. Farahmand, K. M. Kitani, and J. A. Bagnell, “Approximate maxent inverse optimal control and its application for mental simulation of human interactions,” in Proc. AAAI Conf. Artif. Intell., 2015, pp. 2673–2679.

[44]

T. Ye, X. Wang, J. Davidson, and A. Gupta, “Interpretable intuitive physics model,” in Proc. Eur. Conf. Comput. Vis., 2018, pp. 87–102.

[45]

J. Wu, J. J. Lim, H. Zhang, J. B. Tenenbaum, and W. T. Freeman, “Physics 101: Learning physical object properties from unlabeled videos,” in Proc. Brit. Mach. Vis. Conf., 2016, pp. 1483–1492.

[46]

T. Han, W. Xie, and A. Zisserman, “Video representation learning by dense predictive coding,” in Int. Conf. Comput. Vis. (Workshop), 2019.

[47]

T. Han, W. Xie, and A. Zisserman, “Memory-augmented dense predictive coding for video representation learning,” in Proc. Eur. Conf. Comput. Vis., 2020, pp. 312–329.

[48]

A. V. D. Oord, Y. Li, and O. Vinyals, “Representation learning with contrastive predictive coding,” 2018,.

[49]

P. Sun, B. Wu, X. Li, W. Li, L. Duan, and C. Gan, “Counterfactual debiasing inference for compositional action recognition,” in Proc. ACM Int. Conf. Multimedia, 2021, pp. 3220–3228.

[50]

R. Girshick, “Fast R-CNN,” in Proc. Int. Conf. Comput. Vis., 2015, pp. 1440–1448.

[51]

R. Girdhar, J. Carreira, C. Doersch, and A. Zisserman, “Video action transformer network,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2019, pp. 244–253.

[52]

A. Santoro et al., “A simple neural network module for relational reasoning,” in Proc. Annu. Conf. Neural Inf. Process. Syst., 2017, pp. 4967–4976.

[53]

H. Hu, J. Gu, Z. Zhang, J. Dai, and Y. Wei, “Relation networks for object detection,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2018, pp. 3588–3597.

[54]

B. Zhou, A. Andonian, A. Oliva, and A. Torralba, “Temporal relational reasoning in videos,” in Proc. Eur. Conf. Comput. Vis., 2018, pp. 803–818.

[55]

R. J. Williams and D. Zipser, “A learning algorithm for continually running fully recurrent neural networks,” Neural Computation, vol. 1, no. 2, pp. 270–280, 1989.

Digital Library

[56]

S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural Computation, vol. 9, no. 8, pp. 1735–1780, 1997.

Digital Library

[57]

A. Vaswani et al., “Attention is all you need,” in Proc. Annu. Conf. Neural Inf. Process. Syst., 2017, pp. 5998–6008.

[58]

Y. Zhang et al., “Exploiting motion information from unlabeled videos for static image action recognition,” in Proc. AAAI Conf. Artif. Intell., 2020, pp. 12 918–12 925.

[59]

T. S. Kim, J. Jones, and G. D. Hager, “Motion guided attention fusion to recognize interactions from videos,” 2021,.

[60]

A. Paszke et al., “Pytorch: An imperative style, high-performance deep learning library,” in Proc. Annu. Conf. Neural Inf. Process. Syst., 2019, pp. 8026–8037.

[61]

J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Imagenet: A large-scale hierarchical image database,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2009, pp. 248–255.

[62]

Y. Niu, K. Tang, H. Zhang, Z. Lu, X.-S. Hua, and J.-R. Wen, “Counterfactual VQA: A cause-effect look at language bias,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2021, pp. 12 700–12 710.

[63]

C. Finn, P. Abbeel, and S. Levine, “Model-agnostic meta-learning for fast adaptation of deep networks,” in Proc. Int. Conf. Mach. Learn., 2017, pp. 1126–1135.

Cited By

Zhu XZhu BTan YWang SHao YZhang HCai JKankanhalli MPrabhakaran BBoll SSubramanian RZheng LSingh VCesar PXie LXu D(2024)Selective Vision-Language Subspace Projection for Few-shot CLIPProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3680885(3848-3857)Online publication date: 28-Oct-2024
https://dl.acm.org/doi/10.1145/3664647.3680885
Jiang XTang HYan RTang JLi ZCai JKankanhalli MPrabhakaran BBoll SSubramanian RZheng LSingh VCesar PXie LXu D(2024)DVF: Advancing Robust and Accurate Fine-Grained Image Retrieval with Retrieval GuidelinesProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3680763(2379-2388)Online publication date: 28-Oct-2024
https://dl.acm.org/doi/10.1145/3664647.3680763
Dang YHuang CChen PZhao DGao NLiang RHuan R(2024)Discriminative Action Snippet Propagation Network for Weakly Supervised Temporal Action LocalizationACM Transactions on Multimedia Computing, Communications, and Applications10.1145/364381520:6(1-21)Online publication date: 31-Jan-2024
https://dl.acm.org/doi/10.1145/3643815
Show More Cited By

Recommendations

A novel mid-level distinctive feature learning for action recognition via diffusion map

Recent works have shown that mid-level feature is superior to low-level feature, which can not only improve discriminative power, but also enhance descriptive capability. In this paper, the classical STIP, spatial star-graph and temporal star-graph are ...
Spatio-Temporal Contrastive Learning for Compositional Action Recognition
Pattern Recognition and Computer Vision
Abstract
The task of compositional action recognition holds significant importance in the field of video understanding; however, the issue of static bias severely limits the generalization capability of models. Existing models often overly rely on ...
Learning discriminative motion feature for enhancing multi-modal action recognition
Abstract
Video action recognition is an important topic in computer vision tasks. Most of the existing methods use CNN-based models, and multiple modalities of image features are captured from the videos, such as static frames, dynamic images, and optical ...
Highlights
- A new network is proposed to learn discriminative dynamic motion features.
- The dynamic motion feature is complementary to other modal of features.
- The proposed method improves the performance of action recognition.

Comments

Information & Contributors

Information

Published In

cover image IEEE Transactions on Pattern Analysis and Machine Intelligence

IEEE Transactions on Pattern Analysis and Machine Intelligence Volume 45, Issue 8

Aug. 2023

1338 pages

ISSN:0162-8828

Issue’s Table of Contents

0162-8828 © 2023 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See https://www.ieee.org/publications/rights/index.html for more information.

Publisher

IEEE Computer Society

United States

Publication History

Published: 01 August 2023

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

15
Total Citations
View Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 31 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Zhu XZhu BTan YWang SHao YZhang HCai JKankanhalli MPrabhakaran BBoll SSubramanian RZheng LSingh VCesar PXie LXu D(2024)Selective Vision-Language Subspace Projection for Few-shot CLIPProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3680885(3848-3857)Online publication date: 28-Oct-2024
https://dl.acm.org/doi/10.1145/3664647.3680885
Jiang XTang HYan RTang JLi ZCai JKankanhalli MPrabhakaran BBoll SSubramanian RZheng LSingh VCesar PXie LXu D(2024)DVF: Advancing Robust and Accurate Fine-Grained Image Retrieval with Retrieval GuidelinesProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3680763(2379-2388)Online publication date: 28-Oct-2024
https://dl.acm.org/doi/10.1145/3664647.3680763
Dang YHuang CChen PZhao DGao NLiang RHuan R(2024)Discriminative Action Snippet Propagation Network for Weakly Supervised Temporal Action LocalizationACM Transactions on Multimedia Computing, Communications, and Applications10.1145/364381520:6(1-21)Online publication date: 31-Jan-2024
https://dl.acm.org/doi/10.1145/3643815
Mu JNie FWang WXu JZhang JLiu H(2024)MOCOLNet: A Momentum Contrastive Learning Network for Multimodal Aspect-Level Sentiment AnalysisIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2023.334502236:12(8787-8800)Online publication date: 1-Dec-2024
https://dl.acm.org/doi/10.1109/TKDE.2023.3345022
Guan SYu XHuang WFang GLu H(2024)DMMG: Dual Min-Max Games for Self-Supervised Skeleton-Based Action RecognitionIEEE Transactions on Image Processing10.1109/TIP.2023.333841033(395-407)Online publication date: 1-Jan-2024
https://dl.acm.org/doi/10.1109/TIP.2023.3338410
Liu XQian BLiu HGuo DWang YWang M(2024)Seeking False Hard Negatives for Graph Contrastive LearningIEEE Transactions on Circuits and Systems for Video Technology10.1109/TCSVT.2024.337014934:8(7454-7466)Online publication date: 1-Aug-2024
https://dl.acm.org/doi/10.1109/TCSVT.2024.3370149
Li RFeng ZXu TLi LWu XAwais MAtito SKittler J(2024)C2C: Component-to-Composition Learning for Zero-Shot Compositional Action RecognitionComputer Vision – ECCV 202410.1007/978-3-031-72920-1_21(369-388)Online publication date: 29-Sep-2024
https://dl.acm.org/doi/10.1007/978-3-031-72920-1_21
Cao MYan RShu XZhang JWang JXie GEl Saddik AMei TCucchiara RBertini MTobon Vallejo DAtrey PHossain M(2023)MUP: Multi-granularity Unified Perception for Panoramic Activity RecognitionProceedings of the 31st ACM International Conference on Multimedia10.1145/3581783.3612435(7666-7675)Online publication date: 26-Oct-2023
https://dl.acm.org/doi/10.1145/3581783.3612435
Tang HLiu JYan SYan RLi ZTang JEl Saddik AMei TCucchiara RBertini MTobon Vallejo DAtrey PHossain M(2023)M3Net: Multi-view Encoding, Matching, and Fusion for Few-shot Fine-grained Action RecognitionProceedings of the 31st ACM International Conference on Multimedia10.1145/3581783.3612221(1719-1728)Online publication date: 26-Oct-2023
https://dl.acm.org/doi/10.1145/3581783.3612221
Shen FShu XDu XTang JEl Saddik AMei TCucchiara RBertini MTobon Vallejo DAtrey PHossain M(2023)Pedestrian-specific Bipartite-aware Similarity Learning for Text-based Person RetrievalProceedings of the 31st ACM International Conference on Multimedia10.1145/3581783.3612009(8922-8931)Online publication date: 26-Oct-2023
https://dl.acm.org/doi/10.1145/3581783.3612009
Show More Cited By

View Options

View options

Figures

Tables

Media

View Issue’s Table of Contents