Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

Progressive Instance-Aware Feature Learning for Compositional Action Recognition

Published: 01 August 2023 Publication History

Abstract

In order to enable the model to generalize to unseen “action-objects” (compositional action), previous methods encode multiple pieces of information (i.e., the appearance, position, and identity of visual instances) independently and concatenate them for classification. However, these methods ignore the potential supervisory role of instance information (i.e., position and identity) in the process of visual perception. To this end, we present a novel framework, namely Progressive Instance-aware Feature Learning (PIFL), to progressively extract, reason, and predict dynamic cues of moving instances from videos for compositional action recognition. Specifically, this framework extracts features from foreground instances that are likely to be relevant to human actions (Position-aware Appearance Feature Extraction in Section III-B1), performs identity-aware reasoning among instance-centric features with semantic-specific interactions (Identity-aware Feature Interaction in Section III-B2), and finally predicts instances’ position from observed states to force the model into perceiving their movement (Semantic-aware Position Prediction in Section III-B3). We evaluate our approach on two compositional action recognition benchmarks, namely, Something-Else and IKEA-Assembly. Our approach achieves consistent accuracy gain beyond off-the-shelf action recognition algorithms in terms of both ground truth and detected position of instances.

References

[1]
R. Goyal et al., “The ”something something” video database for learning and evaluating visual common sense,” in Proc. Int. Conf. Comput. Vis., 2017, pp. 5842–5850.
[2]
K. Simonyan and A. Zisserman, “Two-stream convolutional networks for action recognition in videos,” in Proc. Annu. Conf. Neural Inf. Process. Syst., 2014, pp. 568–576.
[3]
J. Donahue et al., “Long-term recurrent convolutional networks for visual recognition and description,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2015, pp. 2625–2634.
[4]
L. Wang et al., “Temporal segment networks: Towards good practices for deep action recognition,” in Proc. Eur. Conf. Comput. Vis., 2016, pp. 20–36.
[5]
J. Carreira and A. Zisserman, “Quo vadis, action recognition? A new model and the kinetics dataset,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2017, pp. 6299–6308.
[6]
X. Wang and A. Gupta, “Videos as space-time region graphs,” in Proc. Eur. Conf. Comput. Vis., 2018, pp. 399–417.
[7]
D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri, “Learning spatiotemporal features with 3D convolutional networks,” in Proc. Int. Conf. Comput. Vis., 2015, pp. 4489–4497.
[8]
C. Feichtenhofer, “X3D: Expanding architectures for efficient video recognition,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2020, pp. 203–213.
[9]
K. Soomro, A. R. Zamir, and M. Shah, “UCF101: A dataset of 101 human actions classes from videos in the wild,” 2012,.
[10]
H. Kuehne, H. Jhuang, E. Garrote, T. Poggio, and T. Serre, “HMDB: A large video database for human motion recognition,” in Proc. Int. Conf. Comput. Vis., 2011, pp. 2556–2563.
[11]
J. Materzynska, T. Xiao, R. Herzig, H. Xu, X. Wang, and T. Darrell, “Something-else: Compositional action recognition with spatial-temporal interaction networks,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2020, pp. 1049–1059.
[12]
J. Ji, R. Krishna, L. Fei-Fei, and J. C. Niebles, “Action genome: Actions as compositions of spatio-temporal scene graphs,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2020, pp. 10 236–10 247.
[13]
E. S. Spelke and K. D. Kinzler, “Core knowledge,” Devlop. Sci., vol. 10, no. 1, pp. 89–96, 2007.
[14]
A. Torralba and A. A. Efros, “Unbiased look at dataset bias,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2011, pp. 1521–1528.
[15]
Y. Ben-Shabat et al., “The ikea asm dataset: Understanding people assembling furniture through actions, objects and pose,” in Proc. Winter Conf. Appl. Comput. Vis., 2021, pp. 847–859.
[16]
X. Wang, R. Girshick, A. Gupta, and K. He, “Non-local neural networks,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2018, pp. 7794–7803.
[17]
R. Yan, L. Xie, J. Tang, X. Shu, and Q. Tian, “HiGCIN: Hierarchical graph-based cross inference network for group activity recognition,” IEEE Trans. Pattern Anal. Mach. Intell., early access, Oct. 27, 2020.
[18]
G. A. Sigurdsson, G. Varol, X. Wang, A. Farhadi, I. Laptev, and A. Gupta, “Hollywood in homes: Crowdsourcing data collection for activity understanding,” in Proc. Eur. Conf. Comput. Vis., 2016, pp. 510–526.
[19]
F. Caba Heilbron, V. Escorcia, B. Ghanem, and J. Carlos Niebles, “ActivityNet: A large-scale video benchmark for human activity understanding,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2015, pp. 961–970.
[20]
W. Kay et al., “The kinetics human action video dataset,” 2017,.
[21]
J. Lin, C. Gan, and S. Han, “TSM: Temporal shift module for efficient video understanding,” in Proc. Int. Conf. Comput. Vis., 2019, pp. 7083–7093.
[22]
C. Feichtenhofer, H. Fan, J. Malik, and K. He, “Slowfast networks for video recognition,” in Proc. Int. Conf. Comput. Vis., 2019, pp. 6202–6211.
[23]
Y. Li, B. Ji, X. Shi, J. Zhang, B. Kang, and L. Wang, “TEA: Temporal excitation and aggregation for action recognition,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2020, pp. 909–918.
[24]
R. Girdhar and D. Ramanan, “CATER: A diagnostic dataset for compositional actions and temporal reasoning,” in Proc. Int. Conf. Learn. Representations, 2019.
[25]
B. Jia, Y. Chen, S. Huang, Y. Zhu, and S.-C. Zhu, “Lemma: A multi-view dataset for learning multi-agent multi-task activities,” in Proc. Eur. Conf. Comput. Vis., 2020, pp. 767–786.
[26]
D. Keysers et al., “Measuring compositional generalization: A comprehensive method on realistic data,” in Proc. Int. Conf. Learn. Representations, 2020, pp. 1–13.
[27]
B. Lake and M. Baroni, “Generalization without systematicity: On the compositional skills of sequence-to-sequence recurrent networks,” in Proc. Int. Conf. Mach. Learn., 2018, pp. 2873–2882.
[28]
D. Shao, Y. Zhao, B. Dai, and D. Lin, “Intra-and inter-action understanding via temporal action parsing,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2020, pp. 730–739.
[29]
D. Shao, Y. Zhao, B. Dai, and D. Lin, “FineGym: A hierarchical video dataset for fine-grained action understanding,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2020, pp. 2616–2625.
[30]
F. Baradel, N. Neverova, C. Wolf, J. Mille, and G. Mori, “Object level visual reasoning in videos,” in Proc. Eur. Conf. Comput. Vis., 2018, pp. 105–121.
[31]
S. Ji, W. Xu, M. Yang, and K. Yu, “3D convolutional neural networks for human action recognition,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 35, no. 1, pp. 221–231, Jan. 2013.
[32]
R. Krishna et al., “Visual genome: Connecting language and vision using crowdsourced dense image annotations,” Int. J. Comput. Vis., vol. 123, no. 1, pp. 32–73, 2017.
[33]
C.-Y. Ma, A. Kadav, I. Melvin, Z. Kira, G. AlRegib, and H. P. Graf, “Attend and interact: Higher-order object interactions for video understanding,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2018, pp. 6790–6800.
[34]
C. Sun, A. Shrivastava, C. Vondrick, K. Murphy, R. Sukthankar, and C. Schmid, “Actor-centric relation network,” in Proc. Eur. Conf. Comput. Vis., 2018, pp. 318–334.
[35]
H. Qi, X. Wang, D. Pathak, Y. Ma, and J. Malik, “Learning long-term visual dynamics with region proposal interaction networks,” in Proc. Int. Conf. Learn. Representations, 2021, pp. 1–13.
[36]
M. Ranzato, A. Szlam, J. Bruna, M. Mathieu, R. Collobert, and S. Chopra, “Video (language) modeling: A baseline for generative models of natural videos,” 2014,.
[37]
N. Srivastava, E. Mansimov, and R. Salakhudinov, “Unsupervised learning of video representations using LSTMs,” in Proc. Int. Conf. Mach. Learn., 2015, pp. 843–852.
[38]
E. L. Denton et al., “Unsupervised learning of disentangled representations from video,” in Proc. Annu. Conf. Neural Inf. Process. Syst., 2017, pp. 4414–4423.
[39]
M. Mathieu, C. Couprie, and Y. LeCun, “Deep multi-scale video prediction beyond mean square error,” in Proc. Int. Conf. Learn. Representations, 2015, pp. 1–11.
[40]
J. Oh, X. Guo, H. Lee, R. L. Lewis, and S. Singh, “Action-conditional video prediction using deep networks in atari games,” in Proc. Annu. Conf. Neural Inf. Process. Syst., 2015, pp. 2863–2871.
[41]
J.-T. Hsieh, B. Liu, D.-A. Huang, L. F. Fei-Fei, and J. C. Niebles, “Learning to decompose and disentangle representations for video prediction,” in Proc. Annu. Conf. Neural Inf. Process. Syst., 2018, pp. 515–524.
[42]
K. M. Kitani, B. D. Ziebart, J. A. Bagnell, and M. Hebert, “Activity forecasting,” in Proc. Eur. Conf. Comput. Vis., 2012, pp. 201–214.
[43]
D.-A. Huang, A.-M. Farahmand, K. M. Kitani, and J. A. Bagnell, “Approximate maxent inverse optimal control and its application for mental simulation of human interactions,” in Proc. AAAI Conf. Artif. Intell., 2015, pp. 2673–2679.
[44]
T. Ye, X. Wang, J. Davidson, and A. Gupta, “Interpretable intuitive physics model,” in Proc. Eur. Conf. Comput. Vis., 2018, pp. 87–102.
[45]
J. Wu, J. J. Lim, H. Zhang, J. B. Tenenbaum, and W. T. Freeman, “Physics 101: Learning physical object properties from unlabeled videos,” in Proc. Brit. Mach. Vis. Conf., 2016, pp. 1483–1492.
[46]
T. Han, W. Xie, and A. Zisserman, “Video representation learning by dense predictive coding,” in Int. Conf. Comput. Vis. (Workshop), 2019.
[47]
T. Han, W. Xie, and A. Zisserman, “Memory-augmented dense predictive coding for video representation learning,” in Proc. Eur. Conf. Comput. Vis., 2020, pp. 312–329.
[48]
A. V. D. Oord, Y. Li, and O. Vinyals, “Representation learning with contrastive predictive coding,” 2018,.
[49]
P. Sun, B. Wu, X. Li, W. Li, L. Duan, and C. Gan, “Counterfactual debiasing inference for compositional action recognition,” in Proc. ACM Int. Conf. Multimedia, 2021, pp. 3220–3228.
[50]
R. Girshick, “Fast R-CNN,” in Proc. Int. Conf. Comput. Vis., 2015, pp. 1440–1448.
[51]
R. Girdhar, J. Carreira, C. Doersch, and A. Zisserman, “Video action transformer network,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2019, pp. 244–253.
[52]
A. Santoro et al., “A simple neural network module for relational reasoning,” in Proc. Annu. Conf. Neural Inf. Process. Syst., 2017, pp. 4967–4976.
[53]
H. Hu, J. Gu, Z. Zhang, J. Dai, and Y. Wei, “Relation networks for object detection,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2018, pp. 3588–3597.
[54]
B. Zhou, A. Andonian, A. Oliva, and A. Torralba, “Temporal relational reasoning in videos,” in Proc. Eur. Conf. Comput. Vis., 2018, pp. 803–818.
[55]
R. J. Williams and D. Zipser, “A learning algorithm for continually running fully recurrent neural networks,” Neural Computation, vol. 1, no. 2, pp. 270–280, 1989.
[56]
S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural Computation, vol. 9, no. 8, pp. 1735–1780, 1997.
[57]
A. Vaswani et al., “Attention is all you need,” in Proc. Annu. Conf. Neural Inf. Process. Syst., 2017, pp. 5998–6008.
[58]
Y. Zhang et al., “Exploiting motion information from unlabeled videos for static image action recognition,” in Proc. AAAI Conf. Artif. Intell., 2020, pp. 12 918–12 925.
[59]
T. S. Kim, J. Jones, and G. D. Hager, “Motion guided attention fusion to recognize interactions from videos,” 2021,.
[60]
A. Paszke et al., “Pytorch: An imperative style, high-performance deep learning library,” in Proc. Annu. Conf. Neural Inf. Process. Syst., 2019, pp. 8026–8037.
[61]
J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Imagenet: A large-scale hierarchical image database,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2009, pp. 248–255.
[62]
Y. Niu, K. Tang, H. Zhang, Z. Lu, X.-S. Hua, and J.-R. Wen, “Counterfactual VQA: A cause-effect look at language bias,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2021, pp. 12 700–12 710.
[63]
C. Finn, P. Abbeel, and S. Levine, “Model-agnostic meta-learning for fast adaptation of deep networks,” in Proc. Int. Conf. Mach. Learn., 2017, pp. 1126–1135.

Cited By

View all
  • (2024)Selective Vision-Language Subspace Projection for Few-shot CLIPProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3680885(3848-3857)Online publication date: 28-Oct-2024
  • (2024)DVF: Advancing Robust and Accurate Fine-Grained Image Retrieval with Retrieval GuidelinesProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3680763(2379-2388)Online publication date: 28-Oct-2024
  • (2024)Discriminative Action Snippet Propagation Network for Weakly Supervised Temporal Action LocalizationACM Transactions on Multimedia Computing, Communications, and Applications10.1145/364381520:6(1-21)Online publication date: 31-Jan-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image IEEE Transactions on Pattern Analysis and Machine Intelligence
IEEE Transactions on Pattern Analysis and Machine Intelligence  Volume 45, Issue 8
Aug. 2023
1338 pages

Publisher

IEEE Computer Society

United States

Publication History

Published: 01 August 2023

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 31 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2024)Selective Vision-Language Subspace Projection for Few-shot CLIPProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3680885(3848-3857)Online publication date: 28-Oct-2024
  • (2024)DVF: Advancing Robust and Accurate Fine-Grained Image Retrieval with Retrieval GuidelinesProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3680763(2379-2388)Online publication date: 28-Oct-2024
  • (2024)Discriminative Action Snippet Propagation Network for Weakly Supervised Temporal Action LocalizationACM Transactions on Multimedia Computing, Communications, and Applications10.1145/364381520:6(1-21)Online publication date: 31-Jan-2024
  • (2024)MOCOLNet: A Momentum Contrastive Learning Network for Multimodal Aspect-Level Sentiment AnalysisIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2023.334502236:12(8787-8800)Online publication date: 1-Dec-2024
  • (2024)DMMG: Dual Min-Max Games for Self-Supervised Skeleton-Based Action RecognitionIEEE Transactions on Image Processing10.1109/TIP.2023.333841033(395-407)Online publication date: 1-Jan-2024
  • (2024)Seeking False Hard Negatives for Graph Contrastive LearningIEEE Transactions on Circuits and Systems for Video Technology10.1109/TCSVT.2024.337014934:8(7454-7466)Online publication date: 1-Aug-2024
  • (2024)C2C: Component-to-Composition Learning for Zero-Shot Compositional Action RecognitionComputer Vision – ECCV 202410.1007/978-3-031-72920-1_21(369-388)Online publication date: 29-Sep-2024
  • (2023)MUP: Multi-granularity Unified Perception for Panoramic Activity RecognitionProceedings of the 31st ACM International Conference on Multimedia10.1145/3581783.3612435(7666-7675)Online publication date: 26-Oct-2023
  • (2023)M3Net: Multi-view Encoding, Matching, and Fusion for Few-shot Fine-grained Action RecognitionProceedings of the 31st ACM International Conference on Multimedia10.1145/3581783.3612221(1719-1728)Online publication date: 26-Oct-2023
  • (2023)Pedestrian-specific Bipartite-aware Similarity Learning for Text-based Person RetrievalProceedings of the 31st ACM International Conference on Multimedia10.1145/3581783.3612009(8922-8931)Online publication date: 26-Oct-2023
  • Show More Cited By

View Options

View options

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media