Article

Contrastive Positive Mining for Unsupervised 3D Action Representation Learning

Authors:

Wanqing LiAuthors Info & Claims

Computer Vision – ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part IV

Pages 36 - 51

https://doi.org/10.1007/978-3-031-19772-7_3

Published: 23 October 2022 Publication History

Abstract

Recent contrastive based 3D action representation learning has made great progress. However, the strict positive/negative constraint is yet to be relaxed and the use of non-self positive is yet to be explored. In this paper, a Contrastive Positive Mining (CPM) framework is proposed for unsupervised skeleton 3D action representation learning. The CPM identifies non-self positives in a contextual queue to boost learning. Specifically, the siamese encoders are adopted and trained to match the similarity distributions of the augmented instances in reference to all instances in the contextual queue. By identifying the non-self positive instances in the queue, a positive-enhanced learning strategy is proposed to leverage the knowledge of mined positives to boost the robustness of the learned latent space against intra-class and inter-class diversity. Experimental results have shown that the proposed CPM is effective and outperforms the existing state-of-the-art unsupervised methods on the challenging NTU and PKU-MMD datasets.

References

[1]

Caetano, C., Brémond, F., Schwartz, W.R.: Skeleton image representation for 3D action recognition based on tree structure and reference joints. In: 2019 32nd SIBGRAPI Conference on Graphics, Patterns and Images (SIBGRAPI), pp. 16–23. IEEE (2019)

[2]

Chen J, Samuel RDJ, and Poovendran P LSTM with bio inspired algorithm for action recognition in sports videos Image Vis. Comput. 2021 112 104214

[3]

Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607. PMLR (2020)

[4]

Chen, X., He, K.: Exploring simple siamese representation learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15750–15758 (2021)

[5]

Fang, Z., Wang, J., Wang, L., Zhang, L., Yang, Y., Liu, Z.: SEED: self-supervised distillation for visual representation. arXiv preprint arXiv:2101.04731 (2021)

[6]

Grill, J.B., et al.: Bootstrap your own latent: a new approach to self-supervised learning. arXiv preprint arXiv:2006.07733 (2020)

[7]

Gui, L.Y., Wang, Y.X., Liang, X., Moura, J.M.: Adversarial geometry-aware human motion prediction. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 786–803 (2018)

[8]

Gutmann, M.U., Hyvärinen, A.: Noise-contrastive estimation of unnormalized statistical models, with applications to natural image statistics. J. Mach. Learn. Res. 13(2) (2012)

[9]

He, K., Fan, H., Wu, Y., Xie, S., Girshick, R.: Momentum contrast for unsupervised visual representation learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9729–9738 (2020)

[10]

Hou Y, Li Z, Wang P, and Li W Skeleton optical spectra-based action recognition using convolutional neural networks IEEE Trans. Circuits Syst. Video Technol. 2018 28 3 807-811

[11]

Ke, Q., Bennamoun, M., An, S., Sohel, F., Boussaid, F.: A new representation of skeleton sequences for 3D action recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3288–3297 (2017)

[12]

Kundu, J.N., Gor, M., Uppala, P.K., Radhakrishnan, V.B.: Unsupervised feature learning of human actions as trajectories in pose embedding manifold. In: 2019 IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 1459–1467. IEEE (2019)

[13]

Li, J., Wong, Y., Zhao, Q., Kankanhalli, M.S.: Unsupervised learning of view-invariant action representations. arXiv preprint arXiv:1809.01844 (2018)

[14]

Li, L., Wang, M., Ni, B., Wang, H., Yang, J., Zhang, W.: 3D human action representation learning via cross-view consistency pursuit. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4741–4750 (2021)

[15]

Lin, L., Song, S., Yang, W., Liu, J.: MS2L: multi-task self-supervised learning for skeleton based action recognition. In: Proceedings of the 28th ACM International Conference on Multimedia, pp. 2490–2498 (2020)

[16]

Liu, J., Song, S., Liu, C., Li, Y., Hu, Y.: A benchmark dataset and comparison study for multi-modal human action analytics. ACM Trans. Multimedia Comput. Commun. Appl. (TOMM) 16(2), 1–24 (2020)

[17]

Liu J, Shahroudy A, Perez M, Wang G, Duan LY, and Kot AC NTU RGB+D 120: a large-scale benchmark for 3D human activity understanding IEEE Trans. Pattern Anal. Mach. Intell. 2019 42 10 2684-2701

Digital Library

[18]

Liu M, Liu H, and Chen C 3D action recognition using multiscale energy-based global ternary image IEEE Trans. Circuits Syst. Video Technol. 2017 28 8 1824-1838

Digital Library

[19]

Loshchilov, I., Hutter, F.: SGDR: stochastic gradient descent with warm restarts. arXiv preprint arXiv:1608.03983 (2016)

[20]

Luo, Z., Peng, B., Huang, D.A., Alahi, A., Fei-Fei, L.: Unsupervised learning of long-term motion dynamics for videos. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2203–2212 (2017)

[21]

Van der Maaten, L., Hinton, G.: Visualizing data using t-SNE. J. Mach. Learn. Res. 9(11) (2008)

[22]

Noroozi M and Favaro P Leibe B, Matas J, Sebe N, and Welling M Unsupervised learning of visual representations by solving jigsaw puzzles Computer Vision – ECCV 2016 2016 Cham Springer 69-84

[23]

Oord, A.V.D., Li, Y., Vinyals, O.: Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748 (2018)

[24]

Rao H, Xu S, Hu X, Cheng J, and Hu B Augmented skeleton based contrastive action learning with momentum LSTM for unsupervised action recognition Inf. Sci. 2021 569 90-109

[25]

Shahroudy, A., Liu, J., Ng, T.T., Wang, G.: NTU RGB+D: a large scale dataset for 3D human activity analysis. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1010–1019 (2016)

[26]

Shi, Z., Kim, T.K.: Learning and refining of privileged information-based RNNs for action recognition from depth sequences. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3461–3470 (2017)

[27]

Si, C., Chen, W., Wang, W., Wang, L., Tan, T.: An attention enhanced graph convolutional LSTM network for skeleton-based action recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1227–1236 (2019)

[28]

Song S, Lan C, Xing J, Zeng W, and Liu J Spatio-temporal attention-based LSTM networks for 3D action recognition and detection IEEE Trans. Image Process. 2018 27 7 3459-3471

[29]

Srivastava, N., Mansimov, E., Salakhudinov, R.: Unsupervised learning of video representations using LSTMs. In: International Conference on Machine Learning, pp. 843–852. PMLR (2015)

[30]

Su, K., Liu, X., Shlizerman, E.: Predict & cluster: unsupervised skeleton based action recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9631–9640 (2020)

[31]

Sun N, Leng L, Liu J, and Han G Multi-stream slowfast graph convolutional networks for skeleton-based action recognition Image Vis. Comput. 2021 109 104141

[32]

Tarvainen, A., Valpola, H.: Mean teachers are better role models: weight-averaged consistency targets improve semi-supervised deep learning results. arXiv preprint arXiv:1703.01780 (2017)

[33]

Thoker, F.M., Doughty, H., Snoek, C.G.: Skeleton-contrastive 3D action representation learning. In: Proceedings of the 29th ACM International Conference on Multimedia, pp. 1655–1663 (2021)

[34]

Tian Y, Krishnan D, and Isola P Vedaldi A, Bischof H, Brox T, and Frahm J-M Contrastive multiview coding Computer Vision – ECCV 2020 2020 Cham Springer 776-794

Digital Library

[35]

Wang, P., Li, W., Gao, Z., Zhang, Y., Tang, C., Ogunbona, P.: Scene flow to action map: a new representation for RGB-D based action recognition with convolutional neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 595–604 (2017)

[36]

Wei, C., et al.: Iterative reorganization with weak spatial constraints: solving arbitrary jigsaw puzzles for unsupervised representation learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1910–1919 (2019)

[37]

Wu, Z., Xiong, Y., Yu, S.X., Lin, D.: Unsupervised feature learning via non-parametric instance discrimination. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3733–3742 (2018)

[38]

Xiao Y, Chen J, Wang Y, Cao Z, Zhou JT, and Bai X Action recognition for depth video using multi-view dynamic images Inf. Sci. 2019 480 287-304

Digital Library

[39]

Yan, S., Xiong, Y., Lin, D.: Spatial temporal graph convolutional networks for skeleton-based action recognition. In: Thirty-Second AAAI Conference on Artificial Intelligence (2018)

[40]

You, Y., Gitman, I., Ginsburg, B.: Large batch training of convolutional networks. arXiv preprint arXiv:1708.03888 (2017)

[41]

Zbontar, J., Jing, L., Misra, I., LeCun, Y., Deny, S.: Barlow twins: self-supervised learning via redundancy reduction. arXiv preprint arXiv:2103.03230 (2021)

[42]

Zhang H, Hou Y, Wang P, Guo Z, and Li W SAR-NAS: skeleton-based action recognition via neural architecture searching J. Vis. Commun. Image Represent. 2020 73 102942

Digital Library

[43]

Zhang, X., Xu, C., Tao, D.: Context aware graph convolution for skeleton-based action recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 14333–14342 (2020)

[44]

Zheng, N., Wen, J., Liu, R., Long, L., Dai, J., Gong, Z.: Unsupervised representation learning with long-term dynamics for skeleton based action recognition. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018)

Cited By

Abdelfattah MAlahi A(2024)S-JEPA: A Joint Embedding Predictive Architecture for Skeletal Action RecognitionComputer Vision – ECCV 202410.1007/978-3-031-73411-3_21(367-384)Online publication date: 29-Sep-2024
https://dl.acm.org/doi/10.1007/978-3-031-73411-3_21
Wang ZWang HTian CJin Y(2024)Preventing Catastrophic Overfitting in Fast Adversarial Training: A Bi-level Optimization PerspectiveComputer Vision – ECCV 202410.1007/978-3-031-73390-1_9(144-160)Online publication date: 29-Sep-2024
https://dl.acm.org/doi/10.1007/978-3-031-73390-1_9
Sun SLiu DDong JQu XGao JYang XWang XWang MEl Saddik AMei TCucchiara RBertini MTobon Vallejo DAtrey PHossain M(2023)Unified Multi-modal Unsupervised Representation Learning for Skeleton-based Action UnderstandingProceedings of the 31st ACM International Conference on Multimedia10.1145/3581783.3612449(2973-2984)Online publication date: 26-Oct-2023
https://dl.acm.org/doi/10.1145/3581783.3612449

Recommendations

Reconstruction-driven contrastive learning for unsupervised skeleton-based human action recognition
Abstract
At present, researchers intend to use unlabeled skeleton data for human action recognition considering the cumbersome process of annotating large-scale datasets. Therefore, how to learn discriminate human action representation in an unsupervised ...
Learning an Unsupervised and Interpretable Representation of Emotion from Speech
Speech and Computer
Abstract
One of the severe obstacles to naturalistic human affective computing is that emotions are complex constructs with fuzzy boundaries and substantial individual variations. Thus, an important issue to be considered in emotion analysis is generating ...
Combining unsupervised learning and discrimination for 3D action recognition

Previous work on 3D action recognition has focused on using hand-designed features, either from depth videos or 2D videos. In this work, we present an effective way to combine unsupervised feature learning with discriminative feature mining. ...

Comments

Information & Contributors

Information

Published In

cover image Guide Proceedings

Computer Vision – ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part IV

Oct 2022

800 pages

ISBN:978-3-031-19771-0

DOI:10.1007/978-3-031-19772-7

Editors:
Shai Avidan
Tel Aviv University, Tel Aviv, Israel
,
Gabriel Brostow
University College London, London, UK
,
Moustapha Cissé
Google AI, Accra, Ghana
,
Giovanni Maria Farinella
University of Catania, Catania, Italy
,
Tal Hassner
Facebook (United States), Menlo Park, CA, USA

© The Author(s), under exclusive license to Springer Nature Switzerland AG 2022.

Publisher

Springer-Verlag

Berlin, Heidelberg

Publication History

Published: 23 October 2022

Author Tags

Qualifiers

Article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

3
Total Citations
View Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 14 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Abdelfattah MAlahi A(2024)S-JEPA: A Joint Embedding Predictive Architecture for Skeletal Action RecognitionComputer Vision – ECCV 202410.1007/978-3-031-73411-3_21(367-384)Online publication date: 29-Sep-2024
https://dl.acm.org/doi/10.1007/978-3-031-73411-3_21
Wang ZWang HTian CJin Y(2024)Preventing Catastrophic Overfitting in Fast Adversarial Training: A Bi-level Optimization PerspectiveComputer Vision – ECCV 202410.1007/978-3-031-73390-1_9(144-160)Online publication date: 29-Sep-2024
https://dl.acm.org/doi/10.1007/978-3-031-73390-1_9
Sun SLiu DDong JQu XGao JYang XWang XWang MEl Saddik AMei TCucchiara RBertini MTobon Vallejo DAtrey PHossain M(2023)Unified Multi-modal Unsupervised Representation Learning for Skeleton-based Action UnderstandingProceedings of the 31st ACM International Conference on Multimedia10.1145/3581783.3612449(2973-2984)Online publication date: 26-Oct-2023
https://dl.acm.org/doi/10.1145/3581783.3612449

View Options

View options

Media

Figures

Other

Tables

View Table of Contents