Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

A multi-modal fusion framework for continuous sign language recognition based on multi-layer self-attention mechanism

Published: 01 January 2022 Publication History

Abstract

Some of the existing continuous sign language recognition (CSLR) methods require alignment. However, this is time-consuming, and breaks the continuity of the frame sequence, and also affects the subsequent process of CSLR. In this paper, we propose a multi-modal network framework for CSLR based on a multi-layer self-attention mechanism. We propose a 3D convolution residual neural network (CR3D) and a multi-layer self-attention network (ML-SAN) for the feature extraction stage. The CR3D obtains the short-term spatiotemporal features of the RGB and optical flow image streams, whereas the ML-SAN uses a bi-gated recurrent unit (BGRU) to model the long-term sequence relationship and a multi-layer self-attention mechanism to learn the internal relationships between sign language sequences. For the performance optimization stage, we propose a cross-modal spatial mapping loss function, which improves the precision of CSLR by studying the spatial similarity between the video and text domains. Experiments were conducted on two test datasets: the RWTH-PHOENIX-Weather multi-signer dataset, and a Chinese SL (CSL) dataset. The results show that the proposed method can obtain state-of-the-art recognition performance on the two datasets, with word error rate (WER) value of 24.4% and accuracy value of 14.42%, respectively.

References

[1]
Manuel V.E., Jose L., Laura D.F.et al.,.Isolated Sign LanguageRecognition with Multi-Scale Spatial-Temporal Graph ConvolutionalNetworks[C], Proceedings of the IEEE/CVF Conference on ComputerVision and Pattern Recognition (CVPR) Workshops, (2021),3462–3471.
[2]
Ji Y.L., Yan Y., Shen F.M.et al.,Arbitrary-View Human Action Recognition: A Varying-View RGB-D Action Dataset[J], IEEE Transactions on Circuits and Systems for Video Technology (2021),289–300.
[3]
Molchanov P., Yang X.andGupta S., Online Detection and Classification of Dynamic Hand Gestures with Recurrent 3D Convolutional Neural Networks[C], Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2016),4207–4215.
[4]
Cihan Camgoz N.,Hadfield S. and Koller O.,Neural sign language translation[C], Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,(2018),7784–7793.
[5]
Koller O., Zargaran S.andNey H., Re-sign: Re-aligned End-to-end sequence modelling with deep recurrent CNN-HMMs[C], Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2017),4297–4305.
[6]
Koller O., Forster J.andNey H., Continuous sign language recognition: Towards large vocabulary statistical recognition systems handling multiple signers[J], Computer Vision and Image Understanding (2015),108–125.
[7]
Tang A., Lu K., Wang Y.et al.,A real-time hand posture recognition system using deep neural networks[J],ACM Transactions on Intelligent Systems and Technology (2015),1–23.
[8]
Wei C., Zhao J., Zhou W.et al.,Semantic Boundary Detection with Reinforcement Learning for Continuous Sign Language Recognition[J],IEEE Transactions on Circuits and Systems for Video Technology,(2021),1138–1149.
[9]
Cui R., Liu H. and Zhang C., Recurrent convolutional neural networks for continuous sign language recognition by staged optimization[C], Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2017),1610–1618.
[10]
Tran D., Bourdev L., Fergus R.et al.,Learning spatiotemporal features with 3D convolutional networks[C],Proceedings of the IEEE International Conference on Computer Vision (2015), 4489–4497.
[11]
Xu J., Mei T., Yao T.et al.,Msr-vtt: A Large Video description dataset for bridging video and language[C], Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2016),5288–5296.
[12]
Miao Q., Li Y., Ouyang W.et al.,Multi-modal gesture recognition based on the resc3d network[C], Proceedings of the IEEE International Conference on Computer Vision Workshops,(2017),3047–3055.
[13]
Ramachandram D.andTaylor G.W., Deep multi-modal learning: A survey on recent advances and trends[J], IEEE Signal Processing Magazine (2017), 96–108.
[14]
Thwe P.M.andYu M.T., Analysis on Skin colour model using adaptive threshold values for hand segmentation[J], International Journal of Image, Graphics and Signal Processing 11 (9) (2019), 25–33.
[15]
Chen S., Chen J., Jin Q.et al.,Video captioning with guidance of multi-modal latent topics[C], Proceedings of the 25th ACM international conference on Multimedia, (2017),1838–1846.
[16]
Xiao Q., Qin M., Guo P.et al.,Multimodal Fusion Based on LSTM and a Couple Conditional Hidden Markov Model for Chinese Sign Language Recognition[J], IEEE Access 7 (2019), 112258–112268.
[17]
Camgoz N.C., Hadfield S., Koller O.et al.,Subunets: End-to-end hand shape and continuous sign language recognition[C], Proceedings of the 2017 IEEE International Conference on Computer Vision (2017), 3075–3084.
[18]
Koller O., Zargaran O., Ney H.et al.,Deep sign: Hybrid CNN-HMM for continuous sign language recognition, Proceedings of the British Machine Vision Conference,(2016), 1–12.
[19]
Zhu G., Zhang L., Mei L.et al.,Large-scale isolated gesture recognition using pyramidal 3d convolutional networks[C], Proceedings of 23rd International Conference on Pattern Recognition (2016),19–24.
[20]
Ye Y., Tian Y., Huenerfauth M.et al.,Recognizing American sign language gestures from within continuous videos[C], Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops,(2018),2064–2073.
[21]
Liao Y., Xiong P., Min W.et al.,Dynamic Sign Language Recognition Based on Video Sequence With BLSTM-3D Residual Networks[J], IEEE Access 7 (2019),38044–38054.
[22]
Abavisani M. and Patel V.M., Deep multi-modal subspace clustering networks[J], IEEE Journal of Selected Topics in Signal Processing, 12 (6),(2018),1601–1614.
[23]
Chai X., Liu Z., Yin F.et al.,Two streams recurrent neural networks for large-scale continuous gesture recognition[C], Proceedings of the International Conference on Pattern Recognition,(2016),31–36.
[24]
Cui R., Liu H.andZhang C., A Deep Neural Framework for Continuous Sign Language Recognition by Iterative Training[J], IEEE Transactions on Multimedia,21(7),(2019), 1880–1891.
[25]
Cho K., Van Merriënboer B., Gulcehre C.et al.,learning phraserepresentations using RNN encoder-decoder for statistical machinetranslation[J], Computer Science 2014.
[26]
Chorowski J.K., Bahdanau D., Serdyuk D.et al.,Attention-based models for speech recognition[C], Proceedings of the advances in Neural Information Processing Systems, (2015),577–585.
[27]
Bahdanau D., Cho K.andBengio Y. Neural machine translation by jointly learning to align and translate, 2014.
[28]
Huang J., Zhou W., Zhang Q.et al.,Video-based sign language recognition without temporal segmentation[C], Thirty-Second AAAI Conference on Artificial Intelligence, (2018),2257–2264.
[29]
Pu J., Zhou W.andLi H., Iterative Alignment Network for Deep sign: Hybrid CNN-HMM for continuous sign language recognition[C], Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,(2019),4165–4174.
[30]
Vaswani A., Shazeer N., Parmar N.et al.,Attention is all you need[C], Proceedings of the advances in Neural Information Processing Systems (2017),5998–6008.
[31]
Wang X., Girshick R., Gupta A.et al.,Non-local neural networks[C], Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, (2018),7794–7803.
[32]
Zhang H., Goodfellow I., Metaxas D.et al., Self-attention generat adversarial networks[J], 2018.
[33]
Dosovitskiy A., Fischer P., Ilg E.et al.,Flownet: Learning optical flow with convolutional networks[C], Proceedings of the IEEE International Conference on Computer Vision, (2015),2758–2766.
[34]
Wang H., Wang P., Song Z.et al.,Large-scale multi-modal gesture segmentation and recognition based on convolutional neural networks[C], Proceedings of the IEEE International Conference on Computer Vision,(2017),3138–3146.
[35]
Glorot X. and Bengio Y., Understanding the difficulty of training deep feedforward neural networks[C], Proceedings of the thirteenth International Conference on Artificial Intelligence and Statistics, (2010),249–256.
[36]
Kingma D.P.andBa J. Adam: A method for stochastic optimization, arxiv, 2021.
[37]
Koller O., Ney H.andBowden R., Deep hand: How to train a CNN on 1 million hand images when your data is continuous and weakly labelled[C], Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,(2016),3793–3802.
[38]
Yang W., Tao J.andYe Z., Continuous sign language recognition using level building based on fast hidden Markov model[J], Pattern Recognition Letters 78 (2016),28–35.
[39]
Lafferty J., McCallum A.andPereira F.C.N., Conditional random fields: Probabilistic models for segmenting and labeling sequence data[C], Proceedings of the 18th International Conference on Machine Learning, (2001),282–289.
[40]
Wang H., Chai X., Zhou Y.et al.,Fast Deep sign: Hybrid CNN-HMM for sign language recognition benefited from low rank approximation[C], 11th IEEE International Conference and Workshops on Automatic Face and Gesture Recognition (2015), 1–6.
[41]
Donahue J., Anne Hendricks L., Guadarrama S.et al.,Long-term recurrent convolutional networks for visual recognition and description[C], Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, (2015),2625–2634.
[42]
Koller O., Camgoz C., Ney H.et al.,Weakly supervised learning with multi-stream CNN-LSTM-HMMs to discover sequential parallelism in sign language videos[J], IEEE Transactions on Pattern Analysis and Machine Intelligence 42,(9) (2019),2306–2320.
[43]
Zhou H., Zhou W.andLi H., Dynamic pseudo label decoding for continuous sign language recognition[C], Proceedings of the IEEE International Conference on Multimedia and Expo, (2019),1282–1287.
[44]
Rao G.A. and Kishore P.V.V., Selfie video based continuous Indian sign language recognition system[J], Ain Shams Engineering Journal 9(4) (2018),1929–1939.
[45]
Pu J., Zhou W.andLi H., Dilated convolutional network with iterative optimization for continuous sign language recognition[C], Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence, (2018),885–891.
[46]
Zhang J., Zhou W., Xie C.et al.,Chinese sign language recognition with adaptive HMM[C], Proceedings of the IEEE International Conference on Multimedia and Expo, (2016), 1–6.
[47]
Liu Z., Chai X., Liu Z.et al.,Continuous gesture recognition with hand-oriented spatiotemporal feature[C], Proceedings of the IEEE International Conference on Computer Vision Workshops, (2017), 3056–3064.
[48]
Swetha S., Balasubramanian V.N.andJawahar C.V., Sequence-to-sequence learning for human pose correction in videos[C], Proceedings of the IAPR Asian Conference on Pattern Recognition, (2017),298–303.
[49]
Zhang S.J., Zhang Q. and Li H., Review of Sign Language Recognition Based on Deep Learning[J], Journal of Electronics and Information Technology 42(4)(2020),1021–1032.
[50]
Guo D., Zhou W., Li A.et al.,Hierarchical Recurrent Deep Fusion Using Adaptive Clip Summarization for Sign Language Translation[J], IEEE Transactions on Image Processing, 29 (2020),1575–1590.
[51]
Huang J., Zhou W., Zhang Q.et al.,Video-based Sign Language Recognition without Temporal Segmentation[C], AAAI Conference on Artificial Intelligence 2018.
[52]
Cihan Camgöz N., Koller O., Hadfield S.,et al.,Sign Language Transformers: Joint End-to-End Sign Language Recognition and Translation[C], 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), (2020),10020–10030.
[53]
Al-Hammadi M.,et al.,Deep Learning-Based Approach for Sign Language Gesture Recognition with Efficient Hand Gesture Representation[J], IEEE Access, 8 (2020),192527–192542.
[54]
Yu B., Luo Z., Wu H.et al.,Hand gesture recognition based on attentive feature fusion[J], Concurrency and Computation Practice and Experience 32(1) (2019).
[55]
Ameur S., Khalifa A.B. and Bouhlel M.S., A novel Hybrid Bidirectional Unidirectional LSTM Network for Dynamic Hand Gesture Recognition with Leap Motion[J], Entertainment Computing 35 (2020).
[56]
Santos, Clebeson Canuto dos,et al. Dynamic Gesture Recognition by Using CNNs and Star RGB: a Temporal Information Condensation[J], Neurocomputing 400 (2020),238–254.
[57]
Zhou H., Zhou W., Zhou Y.andLi H.Q., Spatial-Temporal Multi-Cue Network for Sign Language Recognition and Translation[J], IEEE Transactions on Multimedia (2021),1–13.
[58]
Pu J., Zhou W.andLi H., Iterative Alignment Network for Continuous Sign Language Recognition[C], 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR),(2019),4160–4169.
[59]
Zhang Z., Pu J., Zhuang L.et al.,Continuous Sign Language Recognition via Reinforcement Learning[C], 2019 IEEE International Conference on Image Processing (ICIP), (2019), 285–289.
[60]
Zhou M.J., Ng M., Cai Z.X.et al.,Self-Attention-Based Fully-Inception Networks for Continuous Sign Language Recognition[J], European Conference on Artificial Intelligence, (2020),2832–2839.

Cited By

View all
  • (2024)Two-Step Strategy for Domain Adaptation RetrievalIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2023.328988236:2(897-912)Online publication date: 1-Feb-2024

Index Terms

  1. A multi-modal fusion framework for continuous sign language recognition based on multi-layer self-attention mechanism
            Index terms have been assigned to the content through auto-classification.

            Recommendations

            Comments

            Information & Contributors

            Information

            Published In

            cover image Journal of Intelligent & Fuzzy Systems: Applications in Engineering and Technology
            Journal of Intelligent & Fuzzy Systems: Applications in Engineering and Technology  Volume 43, Issue 4
            2022
            1429 pages

            Publisher

            IOS Press

            Netherlands

            Publication History

            Published: 01 January 2022

            Author Tags

            1. CR3D
            2. multi-modal fusion
            3. self-attention mechanism
            4. ML-SAN
            5. cross-modal spatial mapping

            Qualifiers

            • Research-article

            Contributors

            Other Metrics

            Bibliometrics & Citations

            Bibliometrics

            Article Metrics

            • Downloads (Last 12 months)0
            • Downloads (Last 6 weeks)0
            Reflects downloads up to 18 Feb 2025

            Other Metrics

            Citations

            Cited By

            View all
            • (2024)Two-Step Strategy for Domain Adaptation RetrievalIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2023.328988236:2(897-912)Online publication date: 1-Feb-2024

            View Options

            View options

            Figures

            Tables

            Media

            Share

            Share

            Share this Publication link

            Share on social media