Abstract
\(360^{\circ }\) video action recognition is one of the most promising fields with the popularity of omnidirectional cameras. To obtain a more precise action understanding in panoramic scene, in this paper, we propose a deformable patch embedding-based temporal shift module-enhanced vision transformer model (DS-ViT), which aims to simultaneously eliminate the distortion effects caused by equirectangular projection (ERP) and construct temporal relationship among the video sequences. Panoramic action recognition is a practical but challenging domain for the lack of panoramic feature extraction methods. With deformable patch embedding, our scheme can adaptively learn the position offsets between different pixels, which effectively captures the distorted features. The temporal shift module facilitates temporal information exchanging by shifting part of the channels with zero parameters. Thanks to the powerful encoder, DS-ViT can efficiently learn the distorted features from the ERP inputs. Simulation results show that our proposed solution outperforms the state-of-the-art two-stream solution by an action accuracy of 9.29\({\%}\) and an activity accuracy of 8.18\({\%}\), where the recent EgoK360 dataset is employed.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Data Availability
The datasets generated during and/or analyzed during the current study are available in the EGOK360 repository, https://egok360.github.io/.
Notes
K is set to 3.
References
Lee, Y., Jeong, J., Yun, J., Cho, W., Yoon, K.-J.: SpherePHD: applying CNNs on a spherical polyhedron representation of 360\(^\circ \) Images. In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognitions, pp. 9173–9181. (2019)
Li, J., Liu, J., Wong, Y., Nishimura, S., Kankanhalli, M.S.: Weakly-supervised multi-person action recognition in 360\(^{\circ }\) videos. In: 2020 IEEE Winter Conference on Applications of Computer Vision, pp. 497–505. (2020)
Carreira, J., Zisserman, A.: Quo vadis, action recognition? A new model and the kinetics dataset. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition, pp. 4724–4733. (2017)
Li, D., Shi, W.: Partially occluded skeleton action recognition based on multi-stream fusion graph convolutional networks. In: Advances in Computer Graphics - 38th Computer Graphics International Conference, 2021, vol. 13002 of Lecture Notes in Computer Science, pp. 178–189, Springer, (2021)
Lin, J., Gan, C., Han, S.: TSM: temporal shift module for efficient video understanding. In: 2019 IEEE/CVF International Conference on Computer Vision, pp. 7083–7092, (2019)
Snyder, J.P.: Flattening the Earth: Two Thousand Years of Map Projections. University of Chicago Press, Chicago, USA (1993)
Monroy, R., Lutz, S., Chalasani, T., Smolic, A.: SalNet360: saliency maps for omni-directional images with CNN. Signal Process. Image Commun. 69, 26–34 (2017)
Eder, M., Shvets, M., Lim, J., Frahm, J.-M.: Tangent images for mitigating spherical distortion. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12423–12431, (2020)
Li, Y., Barnes, C., Huang, K., Zhang, F.: Deep 360\(^{\circ }\) optical flow estimation based on multi-projection fusion. In: European Conference on Computer Vision, pp. 336–352, (2022)
Dai, J., Qi, H., Xiong, Y., Li, Y., Zhang, G., Hu, H., Wei, Y.: Deformable convolutional networks. In: 2017 IEEE International Conference on Computer Vision, pp. 764–773, (2017)
Bhandari, K., DeLaGarza, M.A., Zong, Z., Latapie, H., Yan, Y.: EGOK360: a 360 egocentric kinetic human activity video dataset. In: 2020 IEEE International Conference on Image Processing, pp. 266–270, (2020)
Simonyan, K., Zisserman, A.: Two-stream convolutional networks for action recognition in videos. Adv. Neural Inf. Process. Syst. 27, 568–576 (2014)
Wang, L., Xiong, Y., Wang, Z., Qiao, Y., Lin, D., Tang, X., Gool, L.V.: Temporal segment networks for action recognition in videos. IEEE Trans. Pattern Anal. Mach. Intell. 41, 2740–2755 (2019)
Tu, Z., Li, H., Zhang, D., Dauwels, J., Li, B., Yuan, J.: Action-stage emphasized spatiotemporal VLAD for video action recognition. IEEE Trans. Image Process. 28, 2799–2812 (2019)
Tian, L., Tu, Z., Zhang, D., Liu, J., Li, B., Yuan, J.: Unsupervised learning of optical flow with CNN-based non-local filtering. IEEE Trans. Image Process. 29, 8429–8442 (2020)
Sudhakaran, S., Escalera, S., Lanz, O.: LSTA: long short-term attention for egocentric action recognition. In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9946–9955, (2019)
Gers, F.A., Schmidhuber, J., Cummins, F.: Learning to forget: continual prediction with LSTM. In: 1999 Ninth International Conference on Artificial Neural Networks, vol. 2, pp. 850–855, (2000)
Zhou, B., Andonian, A., Oliva, A., Torralba, A.: Temporal relational reasoning in videos. In: European Conference on Computer Vision, pp. 831–846, (2018)
Wang, L., Tong, Z., Ji, B., Wu, G.: TDN: temporal difference networks for efficient action recognition. In: 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1895–1904, (2021)
Yan, F., Wen, J., Li, Z., Zhou, Z.: Monocular dense SLAM with consistent deep depth prediction. In: Advances in Computer Graphics-38th Computer Graphics International Conference, 2021, vol. 13002 of Lecture Notes in Computer Science, pp. 113–124, (2021)
Zhang, H., Guo, M., Zhao, W., Huang, J., Meng, Z., Lu, P., Sen, L., Sheng, B.: Visual indoor navigation using mobile augmented reality. In: Advances in Computer Graphics - 39th Computer Graphics International Conference, 2022, Virtual Event, Sept 12–16, 2022, Proceedings, vol. 13443 of Lecture Notes in Computer Science, pp. 145–156, (2022)
Jiang, C.M., Huang, J., Kashinath Prabhat, K., Marcus, P., Nießner, M.: Spherical CNNs on unstructured grids. In: International Conference on Learning Representations, (2019)
Han, R., Yan, H., Li, J., Wang, S., Feng, W., Wang, S.: Panoramic human activity recognition. In: European Conference on Computer Vision, p. 244-261, (2022)
Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16 \(\times \) 16 words: transformers for image recognition at scale. In: International Conference on Learning Representations, (2021)
Touvron, H., Cord, M., Sablayrolles, A., Synnaeve, G., Jégou, H.: Going deeper with image transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 32–42, (2021)
Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357, PMLR, (2021)
Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: hierarchical vision transformer using shifted windows. In: 2021 IEEE/CVF International Conference on Computer Vision, pp. 9992–10002, IEEE Computer Society, (2021)
Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: SegFormer: simple and efficient design for semantic segmentation with transformers. In: Advances in Neural Information Processing Systems, (2021)
Zhao, H., Jiang, L., Jia, J., Torr, P.H., Koltun, V..: Point transformer. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16259–16268, (2021)
Zhang, Y., Li, X., Liu, C., Shuai, B., Zhu, Y., Brattoli, B., Chen, H., Marsic, I., Tighe, J.: VidTr: video transformer without convolutions. In: 2021 IEEE/CVF International Conference on Computer Vision, pp. 13557–13567, (2021)
Chen, Z., Zhu, Y., Zhao, C., Hu, G., Zeng, W., Wang, J., Tang, M.: DPT: deformable patch-based transformer for visual recognition. In: Proceedings of the 29th ACM International Conference on Multimedia, p. 2899-2907, (2021)
Xia, Z., Pan, X., Song, S., Li, L., Huang, G.: Vision transformer with deformable attention. In: 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4784–4793, (2022)
Zhu, X., Su, W., Lu, L., Li, B., Wang, X., Dai, J.: Deformable DETR: deformable transformers for end-to-end object detection. In: International Conference on Learning Representations, (2021)
Yun, H., Lee, S., Kim, G.: Panoramic vision transformer for saliency detection in 360\(^\circ \) videos. In: European Conference on Computer Vision, pp. 422–439, (2022)
Zhang, J., Yang, K., Ma, C., Reiss, S., Peng, K., Stiefelhagen, R.: Bending reality: distortion-aware transformers for adapting to panoramic semantic segmentation. In: 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16896–16906, (2022)
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, L., Polosukhin, I.: Attention is all you need. Adv. Neural Inf. Process. Syst. 30, 6000–6010 (2017)
Brox, T., Bruhn, A., Papenberg, N., Weickert, J.: High accuracy optical flow estimation based on a theory for warping. In: European Conference on Computer Vision, pp. 25–36, (2004)
Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? Data, augmentation, and regularization in vision transformers. Trans. Mach. Learn. Res., (2022)
Sharir, G., Noy, A., Zelnik-Manor, L.: An image is worth 16 \(\times \) 16 words, what is a video worth?. arXiv preprint arXiv:2103.13915, (2021)
Abnar, S., Zuidema, W.: In: Quantifying Attention Flow in Transformers, pp. 4190–4197, (2020)
Sudhakaran, S., Escalera, S., Lanz, O.: Gate-shift networks for video action recognition. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1099–1108, (2020)
Feichtenhofer, C., Fan, H., Malik, J., He, K.: SlowFast networks for video recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, (2019)
Bertasius, G., Wang, H., Torresani, L.: Is space-time attention all you need for video understanding?. In: Proceedings of the 38th International Conference on Machine Learning, vol. 139, pp. 813–824, (2021)
Arnab, A., Dehghani, M., Heigold, G., Sun, C., Lucic, M., Schmid, C.: ViViT: a video vision transformer. In: 2021 IEEE/CVF International Conference on Computer Vision, pp. 6816–6826, (2021)
Acknowledgements
This work is supported in part by the National Natural Science Foundation of China (No. 61702335), in part by the National Science Foundation of Guangdong Province of China (No. 2021A1515011632) and in part by the funding of Shenzhen University (001203234).
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors declare that they have no conflicts of interests.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Zhang, X., Cui, Y. & Huo, Y. Deformable patch embedding-based shift module-enhanced transformer for panoramic action recognition. Vis Comput 39, 3247–3257 (2023). https://doi.org/10.1007/s00371-023-02959-y
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00371-023-02959-y