Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1007/978-3-031-19772-7_3guideproceedingsArticle/Chapter ViewAbstractPublication PagesConference Proceedingsacm-pubtype
Article

Contrastive Positive Mining for Unsupervised 3D Action Representation Learning

Published: 23 October 2022 Publication History

Abstract

Recent contrastive based 3D action representation learning has made great progress. However, the strict positive/negative constraint is yet to be relaxed and the use of non-self positive is yet to be explored. In this paper, a Contrastive Positive Mining (CPM) framework is proposed for unsupervised skeleton 3D action representation learning. The CPM identifies non-self positives in a contextual queue to boost learning. Specifically, the siamese encoders are adopted and trained to match the similarity distributions of the augmented instances in reference to all instances in the contextual queue. By identifying the non-self positive instances in the queue, a positive-enhanced learning strategy is proposed to leverage the knowledge of mined positives to boost the robustness of the learned latent space against intra-class and inter-class diversity. Experimental results have shown that the proposed CPM is effective and outperforms the existing state-of-the-art unsupervised methods on the challenging NTU and PKU-MMD datasets.

References

[1]
Caetano, C., Brémond, F., Schwartz, W.R.: Skeleton image representation for 3D action recognition based on tree structure and reference joints. In: 2019 32nd SIBGRAPI Conference on Graphics, Patterns and Images (SIBGRAPI), pp. 16–23. IEEE (2019)
[2]
Chen J, Samuel RDJ, and Poovendran P LSTM with bio inspired algorithm for action recognition in sports videos Image Vis. Comput. 2021 112 104214
[3]
Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607. PMLR (2020)
[4]
Chen, X., He, K.: Exploring simple siamese representation learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15750–15758 (2021)
[5]
Fang, Z., Wang, J., Wang, L., Zhang, L., Yang, Y., Liu, Z.: SEED: self-supervised distillation for visual representation. arXiv preprint arXiv:2101.04731 (2021)
[6]
Grill, J.B., et al.: Bootstrap your own latent: a new approach to self-supervised learning. arXiv preprint arXiv:2006.07733 (2020)
[7]
Gui, L.Y., Wang, Y.X., Liang, X., Moura, J.M.: Adversarial geometry-aware human motion prediction. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 786–803 (2018)
[8]
Gutmann, M.U., Hyvärinen, A.: Noise-contrastive estimation of unnormalized statistical models, with applications to natural image statistics. J. Mach. Learn. Res. 13(2) (2012)
[9]
He, K., Fan, H., Wu, Y., Xie, S., Girshick, R.: Momentum contrast for unsupervised visual representation learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9729–9738 (2020)
[10]
Hou Y, Li Z, Wang P, and Li W Skeleton optical spectra-based action recognition using convolutional neural networks IEEE Trans. Circuits Syst. Video Technol. 2018 28 3 807-811
[11]
Ke, Q., Bennamoun, M., An, S., Sohel, F., Boussaid, F.: A new representation of skeleton sequences for 3D action recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3288–3297 (2017)
[12]
Kundu, J.N., Gor, M., Uppala, P.K., Radhakrishnan, V.B.: Unsupervised feature learning of human actions as trajectories in pose embedding manifold. In: 2019 IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 1459–1467. IEEE (2019)
[13]
Li, J., Wong, Y., Zhao, Q., Kankanhalli, M.S.: Unsupervised learning of view-invariant action representations. arXiv preprint arXiv:1809.01844 (2018)
[14]
Li, L., Wang, M., Ni, B., Wang, H., Yang, J., Zhang, W.: 3D human action representation learning via cross-view consistency pursuit. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4741–4750 (2021)
[15]
Lin, L., Song, S., Yang, W., Liu, J.: MS2L: multi-task self-supervised learning for skeleton based action recognition. In: Proceedings of the 28th ACM International Conference on Multimedia, pp. 2490–2498 (2020)
[16]
Liu, J., Song, S., Liu, C., Li, Y., Hu, Y.: A benchmark dataset and comparison study for multi-modal human action analytics. ACM Trans. Multimedia Comput. Commun. Appl. (TOMM) 16(2), 1–24 (2020)
[17]
Liu J, Shahroudy A, Perez M, Wang G, Duan LY, and Kot AC NTU RGB+D 120: a large-scale benchmark for 3D human activity understanding IEEE Trans. Pattern Anal. Mach. Intell. 2019 42 10 2684-2701
[18]
Liu M, Liu H, and Chen C 3D action recognition using multiscale energy-based global ternary image IEEE Trans. Circuits Syst. Video Technol. 2017 28 8 1824-1838
[19]
Loshchilov, I., Hutter, F.: SGDR: stochastic gradient descent with warm restarts. arXiv preprint arXiv:1608.03983 (2016)
[20]
Luo, Z., Peng, B., Huang, D.A., Alahi, A., Fei-Fei, L.: Unsupervised learning of long-term motion dynamics for videos. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2203–2212 (2017)
[21]
Van der Maaten, L., Hinton, G.: Visualizing data using t-SNE. J. Mach. Learn. Res. 9(11) (2008)
[22]
Noroozi M and Favaro P Leibe B, Matas J, Sebe N, and Welling M Unsupervised learning of visual representations by solving jigsaw puzzles Computer Vision – ECCV 2016 2016 Cham Springer 69-84
[23]
Oord, A.V.D., Li, Y., Vinyals, O.: Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748 (2018)
[24]
Rao H, Xu S, Hu X, Cheng J, and Hu B Augmented skeleton based contrastive action learning with momentum LSTM for unsupervised action recognition Inf. Sci. 2021 569 90-109
[25]
Shahroudy, A., Liu, J., Ng, T.T., Wang, G.: NTU RGB+D: a large scale dataset for 3D human activity analysis. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1010–1019 (2016)
[26]
Shi, Z., Kim, T.K.: Learning and refining of privileged information-based RNNs for action recognition from depth sequences. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3461–3470 (2017)
[27]
Si, C., Chen, W., Wang, W., Wang, L., Tan, T.: An attention enhanced graph convolutional LSTM network for skeleton-based action recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1227–1236 (2019)
[28]
Song S, Lan C, Xing J, Zeng W, and Liu J Spatio-temporal attention-based LSTM networks for 3D action recognition and detection IEEE Trans. Image Process. 2018 27 7 3459-3471
[29]
Srivastava, N., Mansimov, E., Salakhudinov, R.: Unsupervised learning of video representations using LSTMs. In: International Conference on Machine Learning, pp. 843–852. PMLR (2015)
[30]
Su, K., Liu, X., Shlizerman, E.: Predict & cluster: unsupervised skeleton based action recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9631–9640 (2020)
[31]
Sun N, Leng L, Liu J, and Han G Multi-stream slowfast graph convolutional networks for skeleton-based action recognition Image Vis. Comput. 2021 109 104141
[32]
Tarvainen, A., Valpola, H.: Mean teachers are better role models: weight-averaged consistency targets improve semi-supervised deep learning results. arXiv preprint arXiv:1703.01780 (2017)
[33]
Thoker, F.M., Doughty, H., Snoek, C.G.: Skeleton-contrastive 3D action representation learning. In: Proceedings of the 29th ACM International Conference on Multimedia, pp. 1655–1663 (2021)
[34]
Tian Y, Krishnan D, and Isola P Vedaldi A, Bischof H, Brox T, and Frahm J-M Contrastive multiview coding Computer Vision – ECCV 2020 2020 Cham Springer 776-794
[35]
Wang, P., Li, W., Gao, Z., Zhang, Y., Tang, C., Ogunbona, P.: Scene flow to action map: a new representation for RGB-D based action recognition with convolutional neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 595–604 (2017)
[36]
Wei, C., et al.: Iterative reorganization with weak spatial constraints: solving arbitrary jigsaw puzzles for unsupervised representation learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1910–1919 (2019)
[37]
Wu, Z., Xiong, Y., Yu, S.X., Lin, D.: Unsupervised feature learning via non-parametric instance discrimination. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3733–3742 (2018)
[38]
Xiao Y, Chen J, Wang Y, Cao Z, Zhou JT, and Bai X Action recognition for depth video using multi-view dynamic images Inf. Sci. 2019 480 287-304
[39]
Yan, S., Xiong, Y., Lin, D.: Spatial temporal graph convolutional networks for skeleton-based action recognition. In: Thirty-Second AAAI Conference on Artificial Intelligence (2018)
[40]
You, Y., Gitman, I., Ginsburg, B.: Large batch training of convolutional networks. arXiv preprint arXiv:1708.03888 (2017)
[41]
Zbontar, J., Jing, L., Misra, I., LeCun, Y., Deny, S.: Barlow twins: self-supervised learning via redundancy reduction. arXiv preprint arXiv:2103.03230 (2021)
[42]
Zhang H, Hou Y, Wang P, Guo Z, and Li W SAR-NAS: skeleton-based action recognition via neural architecture searching J. Vis. Commun. Image Represent. 2020 73 102942
[43]
Zhang, X., Xu, C., Tao, D.: Context aware graph convolution for skeleton-based action recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 14333–14342 (2020)
[44]
Zheng, N., Wen, J., Liu, R., Long, L., Dai, J., Gong, Z.: Unsupervised representation learning with long-term dynamics for skeleton based action recognition. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018)

Cited By

View all
  • (2024)S-JEPA: A Joint Embedding Predictive Architecture for Skeletal Action RecognitionComputer Vision – ECCV 202410.1007/978-3-031-73411-3_21(367-384)Online publication date: 29-Sep-2024
  • (2024)Preventing Catastrophic Overfitting in Fast Adversarial Training: A Bi-level Optimization PerspectiveComputer Vision – ECCV 202410.1007/978-3-031-73390-1_9(144-160)Online publication date: 29-Sep-2024
  • (2023)Unified Multi-modal Unsupervised Representation Learning for Skeleton-based Action UnderstandingProceedings of the 31st ACM International Conference on Multimedia10.1145/3581783.3612449(2973-2984)Online publication date: 26-Oct-2023

Recommendations

Comments

Information & Contributors

Information

Published In

cover image Guide Proceedings
Computer Vision – ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part IV
Oct 2022
800 pages
ISBN:978-3-031-19771-0
DOI:10.1007/978-3-031-19772-7

Publisher

Springer-Verlag

Berlin, Heidelberg

Publication History

Published: 23 October 2022

Author Tags

  1. Unsupervised learning
  2. 3D action representation
  3. Skeleton
  4. Positive mining

Qualifiers

  • Article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 14 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2024)S-JEPA: A Joint Embedding Predictive Architecture for Skeletal Action RecognitionComputer Vision – ECCV 202410.1007/978-3-031-73411-3_21(367-384)Online publication date: 29-Sep-2024
  • (2024)Preventing Catastrophic Overfitting in Fast Adversarial Training: A Bi-level Optimization PerspectiveComputer Vision – ECCV 202410.1007/978-3-031-73390-1_9(144-160)Online publication date: 29-Sep-2024
  • (2023)Unified Multi-modal Unsupervised Representation Learning for Skeleton-based Action UnderstandingProceedings of the 31st ACM International Conference on Multimedia10.1145/3581783.3612449(2973-2984)Online publication date: 26-Oct-2023

View Options

View options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media