research-article

A multi-modal fusion framework for continuous sign language recognition based on multi-layer self-attention mechanism

Authors: Cuihong Xue, Ming Yu, Gang Yan, Mengxian Qin, Yuehao Liu, Jingli JiaAuthors Info & Claims

Journal of Intelligent & Fuzzy Systems, Volume 43, Issue 4

Pages 4303 - 4316

https://doi.org/10.3233/JIFS-211697

Published: 01 January 2022 Publication History

Abstract

Some of the existing continuous sign language recognition (CSLR) methods require alignment. However, this is time-consuming, and breaks the continuity of the frame sequence, and also affects the subsequent process of CSLR. In this paper, we propose a multi-modal network framework for CSLR based on a multi-layer self-attention mechanism. We propose a 3D convolution residual neural network (CR3D) and a multi-layer self-attention network (ML-SAN) for the feature extraction stage. The CR3D obtains the short-term spatiotemporal features of the RGB and optical flow image streams, whereas the ML-SAN uses a bi-gated recurrent unit (BGRU) to model the long-term sequence relationship and a multi-layer self-attention mechanism to learn the internal relationships between sign language sequences. For the performance optimization stage, we propose a cross-modal spatial mapping loss function, which improves the precision of CSLR by studying the spatial similarity between the video and text domains. Experiments were conducted on two test datasets: the RWTH-PHOENIX-Weather multi-signer dataset, and a Chinese SL (CSL) dataset. The results show that the proposed method can obtain state-of-the-art recognition performance on the two datasets, with word error rate (WER) value of 24.4% and accuracy value of 14.42%, respectively.

References

[1]

Manuel V.E., Jose L., Laura D.F.et al.,.Isolated Sign LanguageRecognition with Multi-Scale Spatial-Temporal Graph ConvolutionalNetworks[C], Proceedings of the IEEE/CVF Conference on ComputerVision and Pattern Recognition (CVPR) Workshops, (2021),3462–3471.

[2]

Ji Y.L., Yan Y., Shen F.M.et al.,Arbitrary-View Human Action Recognition: A Varying-View RGB-D Action Dataset[J], IEEE Transactions on Circuits and Systems for Video Technology (2021),289–300.

[3]

Molchanov P., Yang X.andGupta S., Online Detection and Classification of Dynamic Hand Gestures with Recurrent 3D Convolutional Neural Networks[C], Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2016),4207–4215.

[4]

Cihan Camgoz N.,Hadfield S. and Koller O.,Neural sign language translation[C], Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,(2018),7784–7793.

[5]

Koller O., Zargaran S.andNey H., Re-sign: Re-aligned End-to-end sequence modelling with deep recurrent CNN-HMMs[C], Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2017),4297–4305.

[6]

Koller O., Forster J.andNey H., Continuous sign language recognition: Towards large vocabulary statistical recognition systems handling multiple signers[J], Computer Vision and Image Understanding (2015),108–125.

[7]

Tang A., Lu K., Wang Y.et al.,A real-time hand posture recognition system using deep neural networks[J],ACM Transactions on Intelligent Systems and Technology (2015),1–23.

[8]

Wei C., Zhao J., Zhou W.et al.,Semantic Boundary Detection with Reinforcement Learning for Continuous Sign Language Recognition[J],IEEE Transactions on Circuits and Systems for Video Technology,(2021),1138–1149.

[9]

Cui R., Liu H. and Zhang C., Recurrent convolutional neural networks for continuous sign language recognition by staged optimization[C], Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2017),1610–1618.

[10]

Tran D., Bourdev L., Fergus R.et al.,Learning spatiotemporal features with 3D convolutional networks[C],Proceedings of the IEEE International Conference on Computer Vision (2015), 4489–4497.

[11]

Xu J., Mei T., Yao T.et al.,Msr-vtt: A Large Video description dataset for bridging video and language[C], Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2016),5288–5296.

[12]

Miao Q., Li Y., Ouyang W.et al.,Multi-modal gesture recognition based on the resc3d network[C], Proceedings of the IEEE International Conference on Computer Vision Workshops,(2017),3047–3055.

[13]

Ramachandram D.andTaylor G.W., Deep multi-modal learning: A survey on recent advances and trends[J], IEEE Signal Processing Magazine (2017), 96–108.

[14]

Thwe P.M.andYu M.T., Analysis on Skin colour model using adaptive threshold values for hand segmentation[J], International Journal of Image, Graphics and Signal Processing 11 (9) (2019), 25–33.

[15]

Chen S., Chen J., Jin Q.et al.,Video captioning with guidance of multi-modal latent topics[C], Proceedings of the 25th ACM international conference on Multimedia, (2017),1838–1846.

[16]

Xiao Q., Qin M., Guo P.et al.,Multimodal Fusion Based on LSTM and a Couple Conditional Hidden Markov Model for Chinese Sign Language Recognition[J], IEEE Access 7 (2019), 112258–112268.

[17]

Camgoz N.C., Hadfield S., Koller O.et al.,Subunets: End-to-end hand shape and continuous sign language recognition[C], Proceedings of the 2017 IEEE International Conference on Computer Vision (2017), 3075–3084.

[18]

Koller O., Zargaran O., Ney H.et al.,Deep sign: Hybrid CNN-HMM for continuous sign language recognition, Proceedings of the British Machine Vision Conference,(2016), 1–12.

[19]

Zhu G., Zhang L., Mei L.et al.,Large-scale isolated gesture recognition using pyramidal 3d convolutional networks[C], Proceedings of 23rd International Conference on Pattern Recognition (2016),19–24.

[20]

Ye Y., Tian Y., Huenerfauth M.et al.,Recognizing American sign language gestures from within continuous videos[C], Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops,(2018),2064–2073.

[21]

Liao Y., Xiong P., Min W.et al.,Dynamic Sign Language Recognition Based on Video Sequence With BLSTM-3D Residual Networks[J], IEEE Access 7 (2019),38044–38054.

[22]

Abavisani M. and Patel V.M., Deep multi-modal subspace clustering networks[J], IEEE Journal of Selected Topics in Signal Processing, 12 (6),(2018),1601–1614.

[23]

Chai X., Liu Z., Yin F.et al.,Two streams recurrent neural networks for large-scale continuous gesture recognition[C], Proceedings of the International Conference on Pattern Recognition,(2016),31–36.

[24]

Cui R., Liu H.andZhang C., A Deep Neural Framework for Continuous Sign Language Recognition by Iterative Training[J], IEEE Transactions on Multimedia,21(7),(2019), 1880–1891.

[25]

Cho K., Van Merriënboer B., Gulcehre C.et al.,learning phraserepresentations using RNN encoder-decoder for statistical machinetranslation[J], Computer Science 2014.

[26]

Chorowski J.K., Bahdanau D., Serdyuk D.et al.,Attention-based models for speech recognition[C], Proceedings of the advances in Neural Information Processing Systems, (2015),577–585.

[27]

Bahdanau D., Cho K.andBengio Y. Neural machine translation by jointly learning to align and translate, 2014.

[28]

Huang J., Zhou W., Zhang Q.et al.,Video-based sign language recognition without temporal segmentation[C], Thirty-Second AAAI Conference on Artificial Intelligence, (2018),2257–2264.

[29]

Pu J., Zhou W.andLi H., Iterative Alignment Network for Deep sign: Hybrid CNN-HMM for continuous sign language recognition[C], Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,(2019),4165–4174.

[30]

Vaswani A., Shazeer N., Parmar N.et al.,Attention is all you need[C], Proceedings of the advances in Neural Information Processing Systems (2017),5998–6008.

[31]

Wang X., Girshick R., Gupta A.et al.,Non-local neural networks[C], Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, (2018),7794–7803.

[32]

Zhang H., Goodfellow I., Metaxas D.et al., Self-attention generat adversarial networks[J], 2018.

[33]

Dosovitskiy A., Fischer P., Ilg E.et al.,Flownet: Learning optical flow with convolutional networks[C], Proceedings of the IEEE International Conference on Computer Vision, (2015),2758–2766.

[34]

Wang H., Wang P., Song Z.et al.,Large-scale multi-modal gesture segmentation and recognition based on convolutional neural networks[C], Proceedings of the IEEE International Conference on Computer Vision,(2017),3138–3146.

[35]

Glorot X. and Bengio Y., Understanding the difficulty of training deep feedforward neural networks[C], Proceedings of the thirteenth International Conference on Artificial Intelligence and Statistics, (2010),249–256.

[36]

Kingma D.P.andBa J. Adam: A method for stochastic optimization, arxiv, 2021.

[37]

Koller O., Ney H.andBowden R., Deep hand: How to train a CNN on 1 million hand images when your data is continuous and weakly labelled[C], Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,(2016),3793–3802.

[38]

Yang W., Tao J.andYe Z., Continuous sign language recognition using level building based on fast hidden Markov model[J], Pattern Recognition Letters 78 (2016),28–35.

Digital Library

[39]

Lafferty J., McCallum A.andPereira F.C.N., Conditional random fields: Probabilistic models for segmenting and labeling sequence data[C], Proceedings of the 18th International Conference on Machine Learning, (2001),282–289.

[40]

Wang H., Chai X., Zhou Y.et al.,Fast Deep sign: Hybrid CNN-HMM for sign language recognition benefited from low rank approximation[C], 11th IEEE International Conference and Workshops on Automatic Face and Gesture Recognition (2015), 1–6.

[41]

Donahue J., Anne Hendricks L., Guadarrama S.et al.,Long-term recurrent convolutional networks for visual recognition and description[C], Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, (2015),2625–2634.

[42]

Koller O., Camgoz C., Ney H.et al.,Weakly supervised learning with multi-stream CNN-LSTM-HMMs to discover sequential parallelism in sign language videos[J], IEEE Transactions on Pattern Analysis and Machine Intelligence 42,(9) (2019),2306–2320.

Digital Library

[43]

Zhou H., Zhou W.andLi H., Dynamic pseudo label decoding for continuous sign language recognition[C], Proceedings of the IEEE International Conference on Multimedia and Expo, (2019),1282–1287.

[44]

Rao G.A. and Kishore P.V.V., Selfie video based continuous Indian sign language recognition system[J], Ain Shams Engineering Journal 9(4) (2018),1929–1939.

[45]

Pu J., Zhou W.andLi H., Dilated convolutional network with iterative optimization for continuous sign language recognition[C], Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence, (2018),885–891.

[46]

Zhang J., Zhou W., Xie C.et al.,Chinese sign language recognition with adaptive HMM[C], Proceedings of the IEEE International Conference on Multimedia and Expo, (2016), 1–6.

[47]

Liu Z., Chai X., Liu Z.et al.,Continuous gesture recognition with hand-oriented spatiotemporal feature[C], Proceedings of the IEEE International Conference on Computer Vision Workshops, (2017), 3056–3064.

[48]

Swetha S., Balasubramanian V.N.andJawahar C.V., Sequence-to-sequence learning for human pose correction in videos[C], Proceedings of the IAPR Asian Conference on Pattern Recognition, (2017),298–303.

[49]

Zhang S.J., Zhang Q. and Li H., Review of Sign Language Recognition Based on Deep Learning[J], Journal of Electronics and Information Technology 42(4)(2020),1021–1032.

[50]

Guo D., Zhou W., Li A.et al.,Hierarchical Recurrent Deep Fusion Using Adaptive Clip Summarization for Sign Language Translation[J], IEEE Transactions on Image Processing, 29 (2020),1575–1590.

[51]

Huang J., Zhou W., Zhang Q.et al.,Video-based Sign Language Recognition without Temporal Segmentation[C], AAAI Conference on Artificial Intelligence 2018.

[52]

Cihan Camgöz N., Koller O., Hadfield S.,et al.,Sign Language Transformers: Joint End-to-End Sign Language Recognition and Translation[C], 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), (2020),10020–10030.

[53]

Al-Hammadi M.,et al.,Deep Learning-Based Approach for Sign Language Gesture Recognition with Efficient Hand Gesture Representation[J], IEEE Access, 8 (2020),192527–192542.

[54]

Yu B., Luo Z., Wu H.et al.,Hand gesture recognition based on attentive feature fusion[J], Concurrency and Computation Practice and Experience 32(1) (2019).

[55]

Ameur S., Khalifa A.B. and Bouhlel M.S., A novel Hybrid Bidirectional Unidirectional LSTM Network for Dynamic Hand Gesture Recognition with Leap Motion[J], Entertainment Computing 35 (2020).

[56]

Santos, Clebeson Canuto dos,et al. Dynamic Gesture Recognition by Using CNNs and Star RGB: a Temporal Information Condensation[J], Neurocomputing 400 (2020),238–254.

[57]

Zhou H., Zhou W., Zhou Y.andLi H.Q., Spatial-Temporal Multi-Cue Network for Sign Language Recognition and Translation[J], IEEE Transactions on Multimedia (2021),1–13.

[58]

Pu J., Zhou W.andLi H., Iterative Alignment Network for Continuous Sign Language Recognition[C], 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR),(2019),4160–4169.

[59]

Zhang Z., Pu J., Zhuang L.et al.,Continuous Sign Language Recognition via Reinforcement Learning[C], 2019 IEEE International Conference on Image Processing (ICIP), (2019), 285–289.

[60]

Zhou M.J., Ng M., Cai Z.X.et al.,Self-Attention-Based Fully-Inception Networks for Continuous Sign Language Recognition[J], European Conference on Artificial Intelligence, (2020),2832–2839.

Cited By

Chen YFang XLiu YZheng WKang PHan NXie S(2024)Two-Step Strategy for Domain Adaptation RetrievalIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2023.328988236:2(897-912)Online publication date: 1-Feb-2024
https://dl.acm.org/doi/10.1109/TKDE.2023.3289882

Index Terms

A multi-modal fusion framework for continuous sign language recognition based on multi-layer self-attention mechanism
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision
      1. Computer vision problems
      2. Computer vision tasks
        Scene understanding
    2. Natural language processing
  2. Machine learning
    1. Learning paradigms
    2. Machine learning approaches
      1. Neural networks

Index terms have been assigned to the content through auto-classification.

Recommendations

Joint long and short span self-attention network for multi-view classification
Abstract
Multi-view classification aims to efficiently utilize information from different views to improve classification performance. In recent researches, many effective multi-view learning methods have been proposed to perform multi-view data analysis. ...
Highlights
- A novel end-to-end unified multi-view classification framework is proposed.
- A long and short span self-attention layer is constructed.
- An adaptive weight loss fusion strategy is designed.
- The performance of our method ...
Chinese clinical named entity recognition via multi-head self-attention based BiLSTM-CRF
Abstract
Clinical named entity recognition (CNER) is a fundamental step for many clinical Natural Language Processing (NLP) systems, which aims to recognize and classify clinical entities such as diseases, symptoms, exams, body parts and treatments in ...
Highlights
- A Multi-head Self-attention-based BiLSTM-CRF model (MUSA-BiLSTM-CRF) for Chinese clinical named entity recognition
- An improved character-level feature representation method combining character embedding and character-label embedding
Multi-modal Continuous Dimensional Emotion Recognition Using Recurrent Neural Network and Self-Attention Mechanism
MuSe'20: Proceedings of the 1st International on Multimodal Sentiment Analysis in Real-life Media Challenge and Workshop

Automatic perception and understanding of human emotion or sentiment has a wide range of applications and has attracted increasing attention nowadays. The Multimodal Sentiment Analysis in Real-life Media (MuSe) 2020 provides a testing bed for ...

Comments

Information & Contributors

Information

Published In

cover image Journal of Intelligent & Fuzzy Systems: Applications in Engineering and Technology

Journal of Intelligent & Fuzzy Systems: Applications in Engineering and Technology Volume 43, Issue 4

2022

1429 pages

ISSN:1064-1246

Issue’s Table of Contents

© 2022 – IOS Press. All rights reserved.

Publisher

IOS Press

Netherlands

Publication History

Published: 01 January 2022

Author Tags

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

1
Total Citations
View Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 18 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Chen YFang XLiu YZheng WKang PHan NXie S(2024)Two-Step Strategy for Domain Adaptation RetrievalIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2023.328988236:2(897-912)Online publication date: 1-Feb-2024
https://dl.acm.org/doi/10.1109/TKDE.2023.3289882

View Options

View options

Figures

Tables

Media

View Issue’s Table of Contents