Deformable patch embedding-based shift module-enhanced transformer for panoramic action recognition

Zhang, Xiaoyan; Cui, Yujie; Huo, Yongkai

doi:10.1007/s00371-023-02959-y

Deformable patch embedding-based shift module-enhanced transformer for panoramic action recognition

Original article
Published: 24 June 2023

Volume 39, pages 3247–3257, (2023)
Cite this article

The Visual Computer Aims and scope Submit manuscript

Xiaoyan Zhang¹,
Yujie Cui¹ &
Yongkai Huo¹

285 Accesses
1 Citation
1 Altmetric
Explore all metrics

Abstract

$360^{\circ }$ video action recognition is one of the most promising fields with the popularity of omnidirectional cameras. To obtain a more precise action understanding in panoramic scene, in this paper, we propose a deformable patch embedding-based temporal shift module-enhanced vision transformer model (DS-ViT), which aims to simultaneously eliminate the distortion effects caused by equirectangular projection (ERP) and construct temporal relationship among the video sequences. Panoramic action recognition is a practical but challenging domain for the lack of panoramic feature extraction methods. With deformable patch embedding, our scheme can adaptively learn the position offsets between different pixels, which effectively captures the distorted features. The temporal shift module facilitates temporal information exchanging by shifting part of the channels with zero parameters. Thanks to the powerful encoder, DS-ViT can efficiently learn the distorted features from the ERP inputs. Simulation results show that our proposed solution outperforms the state-of-the-art two-stream solution by an action accuracy of 9.29${\%}$ and an activity accuracy of 8.18${\%}$, where the recent EgoK360 dataset is employed.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Spatiotemporal Self-attention Modeling with Temporal Patch Shift for Action Recognition

Privacy-Preserving Action Recognition via Motion Difference Quantization

Investigating image stitching for action recognition

Article 22 August 2017

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Data Availability

The datasets generated during and/or analyzed during the current study are available in the EGOK360 repository, https://egok360.github.io/.

Notes

K is set to 3.

References

Lee, Y., Jeong, J., Yun, J., Cho, W., Yoon, K.-J.: SpherePHD: applying CNNs on a spherical polyhedron representation of 360$^\circ $ Images. In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognitions, pp. 9173–9181. (2019)
Li, J., Liu, J., Wong, Y., Nishimura, S., Kankanhalli, M.S.: Weakly-supervised multi-person action recognition in 360$^{\circ }$ videos. In: 2020 IEEE Winter Conference on Applications of Computer Vision, pp. 497–505. (2020)
Carreira, J., Zisserman, A.: Quo vadis, action recognition? A new model and the kinetics dataset. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition, pp. 4724–4733. (2017)
Li, D., Shi, W.: Partially occluded skeleton action recognition based on multi-stream fusion graph convolutional networks. In: Advances in Computer Graphics - 38th Computer Graphics International Conference, 2021, vol. 13002 of Lecture Notes in Computer Science, pp. 178–189, Springer, (2021)
Lin, J., Gan, C., Han, S.: TSM: temporal shift module for efficient video understanding. In: 2019 IEEE/CVF International Conference on Computer Vision, pp. 7083–7092, (2019)
Snyder, J.P.: Flattening the Earth: Two Thousand Years of Map Projections. University of Chicago Press, Chicago, USA (1993)
Google Scholar
Monroy, R., Lutz, S., Chalasani, T., Smolic, A.: SalNet360: saliency maps for omni-directional images with CNN. Signal Process. Image Commun. 69, 26–34 (2017)
Article Google Scholar
Eder, M., Shvets, M., Lim, J., Frahm, J.-M.: Tangent images for mitigating spherical distortion. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12423–12431, (2020)
Li, Y., Barnes, C., Huang, K., Zhang, F.: Deep 360$^{\circ }$ optical flow estimation based on multi-projection fusion. In: European Conference on Computer Vision, pp. 336–352, (2022)
Dai, J., Qi, H., Xiong, Y., Li, Y., Zhang, G., Hu, H., Wei, Y.: Deformable convolutional networks. In: 2017 IEEE International Conference on Computer Vision, pp. 764–773, (2017)
Bhandari, K., DeLaGarza, M.A., Zong, Z., Latapie, H., Yan, Y.: EGOK360: a 360 egocentric kinetic human activity video dataset. In: 2020 IEEE International Conference on Image Processing, pp. 266–270, (2020)
Simonyan, K., Zisserman, A.: Two-stream convolutional networks for action recognition in videos. Adv. Neural Inf. Process. Syst. 27, 568–576 (2014)
Google Scholar
Wang, L., Xiong, Y., Wang, Z., Qiao, Y., Lin, D., Tang, X., Gool, L.V.: Temporal segment networks for action recognition in videos. IEEE Trans. Pattern Anal. Mach. Intell. 41, 2740–2755 (2019)
Article Google Scholar
Tu, Z., Li, H., Zhang, D., Dauwels, J., Li, B., Yuan, J.: Action-stage emphasized spatiotemporal VLAD for video action recognition. IEEE Trans. Image Process. 28, 2799–2812 (2019)
Article MathSciNet MATH Google Scholar
Tian, L., Tu, Z., Zhang, D., Liu, J., Li, B., Yuan, J.: Unsupervised learning of optical flow with CNN-based non-local filtering. IEEE Trans. Image Process. 29, 8429–8442 (2020)
Article MATH Google Scholar
Sudhakaran, S., Escalera, S., Lanz, O.: LSTA: long short-term attention for egocentric action recognition. In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9946–9955, (2019)
Gers, F.A., Schmidhuber, J., Cummins, F.: Learning to forget: continual prediction with LSTM. In: 1999 Ninth International Conference on Artificial Neural Networks, vol. 2, pp. 850–855, (2000)
Zhou, B., Andonian, A., Oliva, A., Torralba, A.: Temporal relational reasoning in videos. In: European Conference on Computer Vision, pp. 831–846, (2018)
Wang, L., Tong, Z., Ji, B., Wu, G.: TDN: temporal difference networks for efficient action recognition. In: 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1895–1904, (2021)
Yan, F., Wen, J., Li, Z., Zhou, Z.: Monocular dense SLAM with consistent deep depth prediction. In: Advances in Computer Graphics-38th Computer Graphics International Conference, 2021, vol. 13002 of Lecture Notes in Computer Science, pp. 113–124, (2021)
Zhang, H., Guo, M., Zhao, W., Huang, J., Meng, Z., Lu, P., Sen, L., Sheng, B.: Visual indoor navigation using mobile augmented reality. In: Advances in Computer Graphics - 39th Computer Graphics International Conference, 2022, Virtual Event, Sept 12–16, 2022, Proceedings, vol. 13443 of Lecture Notes in Computer Science, pp. 145–156, (2022)
Jiang, C.M., Huang, J., Kashinath Prabhat, K., Marcus, P., Nießner, M.: Spherical CNNs on unstructured grids. In: International Conference on Learning Representations, (2019)
Han, R., Yan, H., Li, J., Wang, S., Feng, W., Wang, S.: Panoramic human activity recognition. In: European Conference on Computer Vision, p. 244-261, (2022)
Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16 $\times $ 16 words: transformers for image recognition at scale. In: International Conference on Learning Representations, (2021)
Touvron, H., Cord, M., Sablayrolles, A., Synnaeve, G., Jégou, H.: Going deeper with image transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 32–42, (2021)
Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357, PMLR, (2021)
Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: hierarchical vision transformer using shifted windows. In: 2021 IEEE/CVF International Conference on Computer Vision, pp. 9992–10002, IEEE Computer Society, (2021)
Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: SegFormer: simple and efficient design for semantic segmentation with transformers. In: Advances in Neural Information Processing Systems, (2021)
Zhao, H., Jiang, L., Jia, J., Torr, P.H., Koltun, V..: Point transformer. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16259–16268, (2021)
Zhang, Y., Li, X., Liu, C., Shuai, B., Zhu, Y., Brattoli, B., Chen, H., Marsic, I., Tighe, J.: VidTr: video transformer without convolutions. In: 2021 IEEE/CVF International Conference on Computer Vision, pp. 13557–13567, (2021)
Chen, Z., Zhu, Y., Zhao, C., Hu, G., Zeng, W., Wang, J., Tang, M.: DPT: deformable patch-based transformer for visual recognition. In: Proceedings of the 29th ACM International Conference on Multimedia, p. 2899-2907, (2021)
Xia, Z., Pan, X., Song, S., Li, L., Huang, G.: Vision transformer with deformable attention. In: 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4784–4793, (2022)
Zhu, X., Su, W., Lu, L., Li, B., Wang, X., Dai, J.: Deformable DETR: deformable transformers for end-to-end object detection. In: International Conference on Learning Representations, (2021)
Yun, H., Lee, S., Kim, G.: Panoramic vision transformer for saliency detection in 360$^\circ $ videos. In: European Conference on Computer Vision, pp. 422–439, (2022)
Zhang, J., Yang, K., Ma, C., Reiss, S., Peng, K., Stiefelhagen, R.: Bending reality: distortion-aware transformers for adapting to panoramic semantic segmentation. In: 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16896–16906, (2022)
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, L., Polosukhin, I.: Attention is all you need. Adv. Neural Inf. Process. Syst. 30, 6000–6010 (2017)
Google Scholar
Brox, T., Bruhn, A., Papenberg, N., Weickert, J.: High accuracy optical flow estimation based on a theory for warping. In: European Conference on Computer Vision, pp. 25–36, (2004)
Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? Data, augmentation, and regularization in vision transformers. Trans. Mach. Learn. Res., (2022)
Sharir, G., Noy, A., Zelnik-Manor, L.: An image is worth 16 $\times $ 16 words, what is a video worth?. arXiv preprint arXiv:2103.13915, (2021)
Abnar, S., Zuidema, W.: In: Quantifying Attention Flow in Transformers, pp. 4190–4197, (2020)
Sudhakaran, S., Escalera, S., Lanz, O.: Gate-shift networks for video action recognition. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1099–1108, (2020)
Feichtenhofer, C., Fan, H., Malik, J., He, K.: SlowFast networks for video recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, (2019)
Bertasius, G., Wang, H., Torresani, L.: Is space-time attention all you need for video understanding?. In: Proceedings of the 38th International Conference on Machine Learning, vol. 139, pp. 813–824, (2021)
Arnab, A., Dehghani, M., Heigold, G., Sun, C., Lucic, M., Schmid, C.: ViViT: a video vision transformer. In: 2021 IEEE/CVF International Conference on Computer Vision, pp. 6816–6826, (2021)

Download references

Acknowledgements

This work is supported in part by the National Natural Science Foundation of China (No. 61702335), in part by the National Science Foundation of Guangdong Province of China (No. 2021A1515011632) and in part by the funding of Shenzhen University (001203234).

Author information

Authors and Affiliations

National Engineering Laboratory for Big Data System Computing Technology, and the Research Institute for Future Media Computing, School of Computer Science and Software Engineering, Shenzhen University, Shenzhen, 518060, China
Xiaoyan Zhang, Yujie Cui & Yongkai Huo

Authors

Xiaoyan Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Yujie Cui
View author publications
You can also search for this author in PubMed Google Scholar
Yongkai Huo
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Xiaoyan Zhang.

Ethics declarations

Conflict of interest

The authors declare that they have no conflicts of interests.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Zhang, X., Cui, Y. & Huo, Y. Deformable patch embedding-based shift module-enhanced transformer for panoramic action recognition. Vis Comput 39, 3247–3257 (2023). https://doi.org/10.1007/s00371-023-02959-y

Download citation

Accepted: 09 June 2023
Published: 24 June 2023
Issue Date: August 2023
DOI: https://doi.org/10.1007/s00371-023-02959-y

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Deformable patch embedding-based shift module-enhanced transformer for panoramic action recognition

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Spatiotemporal Self-attention Modeling with Temporal Patch Shift for Action Recognition

Privacy-Preserving Action Recognition via Motion Difference Quantization

Investigating image stitching for action recognition

Data Availability

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Navigation

Deformable patch embedding-based shift module-enhanced transformer for panoramic action recognition

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Spatiotemporal Self-attention Modeling with Temporal Patch Shift for Action Recognition

Privacy-Preserving Action Recognition via Motion Difference Quantization

Investigating image stitching for action recognition

Explore related subjects

Data Availability

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now

Search

Navigation