Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Skip to main content

Deformable patch embedding-based shift module-enhanced transformer for panoramic action recognition

  • Original article
  • Published:
The Visual Computer Aims and scope Submit manuscript

Abstract

\(360^{\circ }\) video action recognition is one of the most promising fields with the popularity of omnidirectional cameras. To obtain a more precise action understanding in panoramic scene, in this paper, we propose a deformable patch embedding-based temporal shift module-enhanced vision transformer model (DS-ViT), which aims to simultaneously eliminate the distortion effects caused by equirectangular projection (ERP) and construct temporal relationship among the video sequences. Panoramic action recognition is a practical but challenging domain for the lack of panoramic feature extraction methods. With deformable patch embedding, our scheme can adaptively learn the position offsets between different pixels, which effectively captures the distorted features. The temporal shift module facilitates temporal information exchanging by shifting part of the channels with zero parameters. Thanks to the powerful encoder, DS-ViT can efficiently learn the distorted features from the ERP inputs. Simulation results show that our proposed solution outperforms the state-of-the-art two-stream solution by an action accuracy of 9.29\({\%}\) and an activity accuracy of 8.18\({\%}\), where the recent EgoK360 dataset is employed.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5

Similar content being viewed by others

Explore related subjects

Discover the latest articles, news and stories from top researchers in related subjects.

Data Availability

The datasets generated during and/or analyzed during the current study are available in the EGOK360 repository, https://egok360.github.io/.

Notes

  1. K is set to 3.

References

  1. Lee, Y., Jeong, J., Yun, J., Cho, W., Yoon, K.-J.: SpherePHD: applying CNNs on a spherical polyhedron representation of 360\(^\circ \) Images. In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognitions, pp. 9173–9181. (2019)

  2. Li, J., Liu, J., Wong, Y., Nishimura, S., Kankanhalli, M.S.: Weakly-supervised multi-person action recognition in 360\(^{\circ }\) videos. In: 2020 IEEE Winter Conference on Applications of Computer Vision, pp. 497–505. (2020)

  3. Carreira, J., Zisserman, A.: Quo vadis, action recognition? A new model and the kinetics dataset. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition, pp. 4724–4733. (2017)

  4. Li, D., Shi, W.: Partially occluded skeleton action recognition based on multi-stream fusion graph convolutional networks. In: Advances in Computer Graphics - 38th Computer Graphics International Conference, 2021, vol. 13002 of Lecture Notes in Computer Science, pp. 178–189, Springer, (2021)

  5. Lin, J., Gan, C., Han, S.: TSM: temporal shift module for efficient video understanding. In: 2019 IEEE/CVF International Conference on Computer Vision, pp. 7083–7092, (2019)

  6. Snyder, J.P.: Flattening the Earth: Two Thousand Years of Map Projections. University of Chicago Press, Chicago, USA (1993)

    Google Scholar 

  7. Monroy, R., Lutz, S., Chalasani, T., Smolic, A.: SalNet360: saliency maps for omni-directional images with CNN. Signal Process. Image Commun. 69, 26–34 (2017)

    Article  Google Scholar 

  8. Eder, M., Shvets, M., Lim, J., Frahm, J.-M.: Tangent images for mitigating spherical distortion. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12423–12431, (2020)

  9. Li, Y., Barnes, C., Huang, K., Zhang, F.: Deep 360\(^{\circ }\) optical flow estimation based on multi-projection fusion. In: European Conference on Computer Vision, pp. 336–352, (2022)

  10. Dai, J., Qi, H., Xiong, Y., Li, Y., Zhang, G., Hu, H., Wei, Y.: Deformable convolutional networks. In: 2017 IEEE International Conference on Computer Vision, pp. 764–773, (2017)

  11. Bhandari, K., DeLaGarza, M.A., Zong, Z., Latapie, H., Yan, Y.: EGOK360: a 360 egocentric kinetic human activity video dataset. In: 2020 IEEE International Conference on Image Processing, pp. 266–270, (2020)

  12. Simonyan, K., Zisserman, A.: Two-stream convolutional networks for action recognition in videos. Adv. Neural Inf. Process. Syst. 27, 568–576 (2014)

    Google Scholar 

  13. Wang, L., Xiong, Y., Wang, Z., Qiao, Y., Lin, D., Tang, X., Gool, L.V.: Temporal segment networks for action recognition in videos. IEEE Trans. Pattern Anal. Mach. Intell. 41, 2740–2755 (2019)

    Article  Google Scholar 

  14. Tu, Z., Li, H., Zhang, D., Dauwels, J., Li, B., Yuan, J.: Action-stage emphasized spatiotemporal VLAD for video action recognition. IEEE Trans. Image Process. 28, 2799–2812 (2019)

    Article  MathSciNet  MATH  Google Scholar 

  15. Tian, L., Tu, Z., Zhang, D., Liu, J., Li, B., Yuan, J.: Unsupervised learning of optical flow with CNN-based non-local filtering. IEEE Trans. Image Process. 29, 8429–8442 (2020)

    Article  MATH  Google Scholar 

  16. Sudhakaran, S., Escalera, S., Lanz, O.: LSTA: long short-term attention for egocentric action recognition. In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9946–9955, (2019)

  17. Gers, F.A., Schmidhuber, J., Cummins, F.: Learning to forget: continual prediction with LSTM. In: 1999 Ninth International Conference on Artificial Neural Networks, vol. 2, pp. 850–855, (2000)

  18. Zhou, B., Andonian, A., Oliva, A., Torralba, A.: Temporal relational reasoning in videos. In: European Conference on Computer Vision, pp. 831–846, (2018)

  19. Wang, L., Tong, Z., Ji, B., Wu, G.: TDN: temporal difference networks for efficient action recognition. In: 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1895–1904, (2021)

  20. Yan, F., Wen, J., Li, Z., Zhou, Z.: Monocular dense SLAM with consistent deep depth prediction. In: Advances in Computer Graphics-38th Computer Graphics International Conference, 2021, vol. 13002 of Lecture Notes in Computer Science, pp. 113–124, (2021)

  21. Zhang, H., Guo, M., Zhao, W., Huang, J., Meng, Z., Lu, P., Sen, L., Sheng, B.: Visual indoor navigation using mobile augmented reality. In: Advances in Computer Graphics - 39th Computer Graphics International Conference, 2022, Virtual Event, Sept 12–16, 2022, Proceedings, vol. 13443 of Lecture Notes in Computer Science, pp. 145–156, (2022)

  22. Jiang, C.M., Huang, J., Kashinath Prabhat, K., Marcus, P., Nießner, M.: Spherical CNNs on unstructured grids. In: International Conference on Learning Representations, (2019)

  23. Han, R., Yan, H., Li, J., Wang, S., Feng, W., Wang, S.: Panoramic human activity recognition. In: European Conference on Computer Vision, p. 244-261, (2022)

  24. Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16 \(\times \) 16 words: transformers for image recognition at scale. In: International Conference on Learning Representations, (2021)

  25. Touvron, H., Cord, M., Sablayrolles, A., Synnaeve, G., Jégou, H.: Going deeper with image transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 32–42, (2021)

  26. Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357, PMLR, (2021)

  27. Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: hierarchical vision transformer using shifted windows. In: 2021 IEEE/CVF International Conference on Computer Vision, pp. 9992–10002, IEEE Computer Society, (2021)

  28. Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: SegFormer: simple and efficient design for semantic segmentation with transformers. In: Advances in Neural Information Processing Systems, (2021)

  29. Zhao, H., Jiang, L., Jia, J., Torr, P.H., Koltun, V..: Point transformer. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16259–16268, (2021)

  30. Zhang, Y., Li, X., Liu, C., Shuai, B., Zhu, Y., Brattoli, B., Chen, H., Marsic, I., Tighe, J.: VidTr: video transformer without convolutions. In: 2021 IEEE/CVF International Conference on Computer Vision, pp. 13557–13567, (2021)

  31. Chen, Z., Zhu, Y., Zhao, C., Hu, G., Zeng, W., Wang, J., Tang, M.: DPT: deformable patch-based transformer for visual recognition. In: Proceedings of the 29th ACM International Conference on Multimedia, p. 2899-2907, (2021)

  32. Xia, Z., Pan, X., Song, S., Li, L., Huang, G.: Vision transformer with deformable attention. In: 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4784–4793, (2022)

  33. Zhu, X., Su, W., Lu, L., Li, B., Wang, X., Dai, J.: Deformable DETR: deformable transformers for end-to-end object detection. In: International Conference on Learning Representations, (2021)

  34. Yun, H., Lee, S., Kim, G.: Panoramic vision transformer for saliency detection in 360\(^\circ \) videos. In: European Conference on Computer Vision, pp. 422–439, (2022)

  35. Zhang, J., Yang, K., Ma, C., Reiss, S., Peng, K., Stiefelhagen, R.: Bending reality: distortion-aware transformers for adapting to panoramic semantic segmentation. In: 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16896–16906, (2022)

  36. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, L., Polosukhin, I.: Attention is all you need. Adv. Neural Inf. Process. Syst. 30, 6000–6010 (2017)

    Google Scholar 

  37. Brox, T., Bruhn, A., Papenberg, N., Weickert, J.: High accuracy optical flow estimation based on a theory for warping. In: European Conference on Computer Vision, pp. 25–36, (2004)

  38. Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? Data, augmentation, and regularization in vision transformers. Trans. Mach. Learn. Res., (2022)

  39. Sharir, G., Noy, A., Zelnik-Manor, L.: An image is worth 16 \(\times \) 16 words, what is a video worth?. arXiv preprint arXiv:2103.13915, (2021)

  40. Abnar, S., Zuidema, W.: In: Quantifying Attention Flow in Transformers, pp. 4190–4197, (2020)

  41. Sudhakaran, S., Escalera, S., Lanz, O.: Gate-shift networks for video action recognition. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1099–1108, (2020)

  42. Feichtenhofer, C., Fan, H., Malik, J., He, K.: SlowFast networks for video recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, (2019)

  43. Bertasius, G., Wang, H., Torresani, L.: Is space-time attention all you need for video understanding?. In: Proceedings of the 38th International Conference on Machine Learning, vol. 139, pp. 813–824, (2021)

  44. Arnab, A., Dehghani, M., Heigold, G., Sun, C., Lucic, M., Schmid, C.: ViViT: a video vision transformer. In: 2021 IEEE/CVF International Conference on Computer Vision, pp. 6816–6826, (2021)

Download references

Acknowledgements

This work is supported in part by the National Natural Science Foundation of China (No. 61702335), in part by the National Science Foundation of Guangdong Province of China (No. 2021A1515011632) and in part by the funding of Shenzhen University (001203234).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Xiaoyan Zhang.

Ethics declarations

Conflict of interest

The authors declare that they have no conflicts of interests.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Zhang, X., Cui, Y. & Huo, Y. Deformable patch embedding-based shift module-enhanced transformer for panoramic action recognition. Vis Comput 39, 3247–3257 (2023). https://doi.org/10.1007/s00371-023-02959-y

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00371-023-02959-y

Keywords