research-article

Sign language recognition via dimensional global–local shift and cross-scale aggregation

Authors:

Wanqing LiAuthors Info & Claims

Neural Computing and Applications, Volume 35, Issue 17

Pages 12481 - 12493

https://doi.org/10.1007/s00521-023-08380-9

Published: 01 March 2023 Publication History

Abstract

Sign languages generally consist of a sequence of upper body gestures and are cooperative processes among various parts such as the hands, arms, and face. Therefore, the dynamics of the parts as well as the holistic appearance of the upper body and individual parts are essential for robust recognition. In this paper, a global–local representation (GLR) module is proposed to boost the spatiotemporal feature modeling. The GLR module is composed of global shift and local shift along the height, width, and temporal dimensions. Specifically, the global shift is applied to the entire feature map for holistic representation, while the local shift restricts itself to local patches to capture detailed features. Furthermore, a novel cross-scale aggregation module is designed to combine the global and local information in different dimensions. Extensive experimental results on three large-scale benchmarks, including WLASL, INCLUDE and LSA64, demonstrate that the proposed method achieves state-of-the-art recognition performance.

References

[1]

Wadhawan A and Kumar P Deep learning-based sign language recognition system for static signs Neural Comput Appl 2020 32 12 7957-7968

[2]

Rezende TM, Almeida SGM, and Guimarães FG Development and validation of a Brazilian sign language database for human gesture recognition Neural Comput Appl 2021 33 16 10449-10467

[3]

Güney S and Erkuş M A real-time approach to recognition of Turkish sign language by using convolutional neural networks Neural Comput Appl 2021 34 1-11

[4]

Wang H, Wang P, Song Z, Li W (2017) Large-scale multimodal gesture segmentation and recognition based on convolutional neural networks. In: Proceedings of the IEEE international conference on computer vision workshops, pp 3138–3146

[5]

Wang P, Li W, Liu S, Gao Z, Tang C, Ogunbona P (2016) Large-scale isolated gesture recognition using convolutional neural networks. In: 2016 23rd international conference on pattern recognition (ICPR). IEEE, pp 7–12

[6]

Jiang S, Sun B, Wang L, Bai Y, Li K, Fu Y (2021) Skeleton aware multi-modal sign language recognition. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 3413–3423

[7]

Maruyama M, Ghose S, Inoue K, Roy PP, Iwamura M, Yoshioka M (2021) Word-level sign language recognition with multi-stream neural networks focusing on local regions. arXiv preprint arXiv:2106.15989

[8]

Hosain AA, Santhalingam PS, Pathak P, Rangwala H, Kosecka J (2021) Hand pose guided 3d pooling for word-level sign language recognition. In: Proceedings of the IEEE/CVF winter conference on applications of computer vision, pp 3429–3439

[9]

Hezhen H, Zhou W, Junfu P, and Li H Global-local enhancement network for NMF-aware sign language recognition ACM Trans Multimed Comput Commun Appl TOMM 2021 17 3 1-19

[10]

Li D, Rodriguez C, Yu X, Li H (2020) Word-level deep sign language recognition from video: a new large-scale dataset and methods comparison. In: Proceedings of the IEEE/CVF winter conference on applications of computer vision, pp 1459–1469

[11]

Sridhar A, Ganesan RG, Kumar P, Khapra M (2020) Include: a large scale dataset for Indian sign language recognition. In: Proceedings of the 28th ACM international conference on multimedia, pp 1366–1375

[12]

Ronchetti F, Quiroga F, Estrebou CA, Lanzarini LC, Rosete A (2016) Lsa64: an Argentinian sign language dataset. In: XXII congreso argentino de ciencias de la computación (CACIC) (2016)

[13]

Imran J and Raman B Deep motion templates and extreme learning machine for sign language recognition Vis Comput 2020 36 6 1233-1246

[14]

Venugopalan A and Reghunadhan R Applying deep neural networks for the automatic recognition of sign language words: a communication aid to deaf agriculturists Expert Syst Appl 2021 185

[15]

Li D, Yu X, Xu C, Petersson L, Li H (2020) Transferring cross-domain knowledge for video sign language recognition. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 6205–6214

[16]

Zhang X and Li X Dynamic gesture recognition based on MEMP network Future Internet 2019 11 4 91

[17]

Wang F, Yuxuan D, Wang G, Zeng Z, and Zhao L (2+ 1) D-SLR: an efficient network for video sign language recognition Neural Comput Appl 2022 34 3 2413-2423

[18]

Carreira J, Zisserman A (2017) Quo vadis, action recognition? A new model and the kinetics dataset. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 6299–6308

[19]

Joze HRV, Koller O (2018) Ms-asl: A large-scale data set and benchmark for understanding American sign language. arXiv preprint arXiv:1812.01053

[20]

Zhou Z, Lui K-S, Tam VWL, Lam EY (2021) Applying (3+ 2+ 1) D residual neural network with frame selection for Hong Kong sign language recognition. In: 2020 25th international conference on pattern recognition (ICPR). IEEE, pp 4296–4302

[21]

Tunga A, Nuthalapati SV, Wachs J (2021) Pose-based sign language recognition using GCN and BERT. In: Proceedings of the IEEE/CVF winter conference on applications of computer vision, pp 31–40

[22]

Boháček M, Hrúz M (2022) Sign pose-based transformer for word-level sign language recognition. In: Proceedings of the IEEE/CVF winter conference on applications of computer vision, pp 182–191

[23]

Li C, Li S, Gao Y, Zhang X, and Li W A two-stream neural network for pose-based hand gesture recognition IEEE Trans Cogn Dev Syst 2021 14 1594-1603

[24]

Yan S, Xiong Y, Lin D (2018) Spatial temporal graph convolutional networks for skeleton-based action recognition. In: Thirty-second AAAI conference on artificial intelligence

[25]

Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser Ł, Polosukhin I (2017) Attention is all you need. In: Advances in neural information processing systems, vol 30

[26]

Konstantinidis D, Dimitropoulos K, Daras P (2018) A deep learning approach for analyzing video and skeletal features in sign language recognition. In: 2018 IEEE international conference on imaging systems and techniques (IST). IEEE, pp 1–6

[27]

Konstantinidis D, Dimitropoulos K, Daras P (2018) Sign language recognition based on hand and body skeletal data. In: 2018-3DTV-conference: the true vision-capture, transmission and display of 3D video (3DTV-CON). IEEE, pp 1–4

[28]

Hezhen H, Zhou W, Li H (2021) Hand-model-aware sign language recognition. In: Proceedings of the AAAI conference on artificial intelligence, vol 35, pp 1558–1566

[29]

Zhang S and Zhang Q Sign language recognition based on global–local attention J Vis Commun Image Represent 2021 80

[30]

Moryossef A, Tsochantaridis I, Dinn J, Camgoz NC, Bowden R, Jiang T, Rios A, Muller M, Ebling S (2021) Evaluating the immediate applicability of pose estimation for sign language recognition. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 3434–3440

[31]

Vázquez-Enríquez M, Alba-Castro JL, Docío-Fernández L, Rodríguez-Banga E (2021) Isolated sign language recognition with multi-scale spatial-temporal graph convolutional networks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 3462–3471

[32]

Hu H, Zhao W, Zhou W, Wang Y, Li H (2021) Signbert: pre-training of hand-model-aware representation for sign language recognition. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 11087–11096

[33]

De Coster M, Van Herreweghe M, Dambre J (2021) Isolated sign recognition from RGB video using pose flow and self-attention. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 3441–3450

[34]

Xiao S, Fang Y, Ni L (2021) Multi-modal sign language recognition with enhanced spatiotemporal representation. In: 2021 International joint conference on neural networks (IJCNN). IEEE, pp 1–8

[35]

Wu B, Wan A, Yue X, Jin P, Zhao S, Golmant N, Gholaminejad A, Gonzalez J, Keutzer K (2018) Shift: a zero flop, zero parameter alternative to spatial convolutions. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 9127–9135

[36]

Jeon Y, Kim J (2018) Constructing fast network through deconstruction of convolution. arXiv preprint arXiv:1806.07370

[37]

Paoletti ME, Haut JM, Tao X, Plaza J, and Plaza A Flop-reduction through memory allocations within CNN for hyperspectral image classification IEEE Trans Geosci Remote Sens 2020 59 7 5938-5952

[38]

Yang J, He Y, Huang X, Xu J, Ye X, Tao G, Ni B (2020) Alignshift: bridging the gap of imaging thickness in 3D anisotropic volumes. In: International conference on medical image computing and computer-assisted intervention. Springer, pp 562–572

[39]

Brown A, Mettes P, Worring M (2019) 4-Connected shift residual networks. In: Proceedings of the IEEE/CVF international conference on computer vision workshops

[40]

Lin J, Gan C, Han S (2019) Tsm: Temporal shift module for efficient video understanding. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 7083–7093

[41]

Sudhakaran S, Escalera S, Lanz O (2020) Gate-shift networks for video action recognition. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 1102–1111

[42]

Chen W, Xie D, Zhang Y, Pu S (2019) All you need is a few shifts: designing efficient convolutional neural networks for image classification. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 7241–7250

[43]

Jeon Y, Kim J (2018) Constructing fast network through deconstruction of convolution. In: Advances in neural information processing systems, vol 31

[44]

Li Y, Song S, Li Y, Liu J (2019) Temporal bilinear networks for video action recognition. In: Proceedings of the AAAI conference on artificial intelligence, vol 33, pp 8674–8681

[45]

Cheng K, Zhang Y, He X, Chen W, Cheng J, Lu H (2020) Skeleton-based action recognition with shift graph convolutional network. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 183–192

[46]

Li M, Zhou G, Cai W, Li J, Li M, He M, Yahui H, and Li L Multi-scale sparse network with cross-attention mechanism for image-based butterflies fine-grained classification Appl Soft Comput 2022 117

[47]

Shang R, Chang H, Zhang W, Feng J, Li Y, and Jiao L Hyperspectral image classification based on multiscale cross-branch response and second-order channel attention IEEE Trans Geosci Remote Sens 2022 60 1-16

[48]

Yang J-Y, Li H-C, Hu W-S, Pan L, and Du Q Adaptive cross-attention-driven spatial-spectral graph convolutional network for hyperspectral image classification IEEE Geosci Remote Sens Lett 2022

[49]

Praveen RG, de Melo WC, Ullah N, Aslam H, Zeeshan O, Denorme T, Pedersoli M, Koerich AL, Bacon S, Cardinal P et al (2022) A joint cross-attention model for audio-visual fusion in dimensional emotion recognition. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 2486–2495

[50]

Jiagao W, Weng W, Junxia F, Liu L, and Bin H Deep semantic hashing with dual attention for cross-modal retrieval Neural Comput Appl 2022 34 7 5397-5416

Recommendations

Local Binary Pattern based features for sign language recognition

In this paper we focus on appearance features particularly the Local Binary Patterns describing the manual component of Sign Language. We compare the performance of these features with geometric moments describing the trajectory and shape of hands. ...
Local-Global Cross-Fusion Transformer Network for Facial Expression Recognition
Web and Big Data
Abstract
Facial Expression Recognition (FER) has received increasing attention in the computer vision community. For FER, there are two challenging issues among the facial images: large inter-class similarity and small intra-class discrepancy. To address ...
Sign language recognition based on global-local attention
Highlights
- A sign language recognition framework based on global-local attention, including:
Abstract
Video-level sign language recognition is still a challenging task due to the influence of sign language-independent factors and timing requirements. This paper constructs a sign language recognition framework based on global-local ...

Comments

Information & Contributors

Information

Published In

cover image Neural Computing and Applications

Neural Computing and Applications Volume 35, Issue 17

Jun 2023

676 pages

ISSN:0941-0643

EISSN:1433-3058

Issue’s Table of Contents

© The Author(s), under exclusive licence to Springer-Verlag London Ltd., part of Springer Nature 2023. Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Publisher

Springer-Verlag

Berlin, Heidelberg

Publication History

Published: 01 March 2023

Accepted: 13 February 2023

Received: 04 May 2022

Author Tags

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 14 Jan 2025

Other Metrics

View Author Metrics

Citations

View Options

View options

Media

Figures

Other

Tables

View Issue’s Table of Contents