research-article

Cross-modal knowledge distillation for continuous sign language recognition

Authors:

Wei FengAuthors Info & Claims

Volume 179, Issue C

https://doi.org/10.1016/j.neunet.2024.106587

Published: 21 November 2024 Publication History

Abstract

Continuous Sign Language Recognition (CSLR) is a task which converts a sign language video into a gloss sequence. The existing deep learning based sign language recognition methods usually rely on large-scale training data and rich supervised information. However, current sign language datasets are limited, and they are only annotated at sentence-level rather than frame-level. Inadequate supervision of sign language data poses a serious challenge for sign language recognition, which may result in insufficient training of sign language recognition models. To address above problems, we propose a cross-modal knowledge distillation method for continuous sign language recognition, which contains two teacher models and one student model. One of the teacher models is the Sign2Text dialogue teacher model, which takes a sign language video and a dialogue sentence as input and outputs the sign language recognition result. The other teacher model is the Text2Gloss translation teacher model, which targets to translate a text sentence into a gloss sequence. Both teacher models can provide information-rich soft labels to assist the training of the student model, which is a general sign language recognition model. We conduct extensive experiments on multiple commonly used sign language datasets, i.e., PHOENIX 2014T, CSL-Daily and QSL, the results show that the proposed cross-modal knowledge distillation method can effectively improve the sign language recognition accuracy by transferring multi-modal information from teacher models to the student model. Code is available at https://github.com/glq-1992/cross-modal-knowledge-distillation_new.

References

[1]

Boháček, M., & Hrúz, M. (2022). Sign pose-based transformer for word-level sign language recognition. In Proceedings of the IEEE/CVF winter conference on applications of computer vision (pp. 182–191).

[2]

Camgoz, N. C., Hadfield, S., Koller, O., Ney, H., & Bowden, R. (2018). Neural sign language translation. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 7784–7793).

[3]

Camgoz, N. C., Koller, O., Hadfield, S., & Bowden, R. (2020). Multi-channel transformers for multi-articulatory sign language translation. In European conference on computer vision workshops (pp. 301–319).

[4]

Camgoz, N. C., Koller, O., Hadfield, S., & Bowden, R. (2020). Sign language transformers: Joint end-to-end sign language recognition and translation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 10023–10033).

[5]

Chen H., Pei Y., Zhao H., Huang Y., Super-resolution guided knowledge distillation for low-resolution image classification, Pattern Recognition Letters 155 (2022) 62–68.

[6]

Chen, Y., Wei, F., Sun, X., Wu, Z., & Lin, S. (2022). A simple multi-modality transfer learning baseline for sign language translation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 5120–5130).

[7]

Cheng, K. L., Yang, Z., Chen, Q., & Tai, Y.-W. (2020). Fully Convolutional Networks for Continuous Sign Language Recognition. In European conference on computer vision (pp. 697–714).

[8]

Cui, R., Liu, H., & Zhang, C. (2017). Recurrent convolutional neural networks for continuous sign language recognition by staged optimization. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 7361–7369).

[9]

Cui R., Liu H., Zhang C., A deep neural framework for continuous sign language recognition by iterative training, IEEE Transactions on Multimedia 21 (7) (2019) 1880–1891.

[10]

Du Y., Xie P., Wang M., Hu X., Zhao Z., Liu J., Full transformer network with masking future for word-level sign language recognition, Neurocomputing 500 (2022) 115–123.

[11]

Furlanello, T., Lipton, Z., Tschannen, M., Itti, L., & Anandkumar, A. (2018). Born again neural networks. In International conference on machine learning (pp. 1607–1616).

[12]

Futami H., Inaguma H., Ueno S., Mimura M., Sakai S., Kawahara T., Distilling the knowledge of BERT for sequence-to-sequence ASR, 2020, arXiv preprint arXiv:2008.03822.

[13]

Gou J., Yu B., Maybank S.J., Tao D., Knowledge distillation: A survey, International Journal of Computer Vision 129 (2021) 1789–1819.

[14]

Graves, A., Fernández, S., Gomez, F., & Schmidhuber, J. (2006). Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. In Proceedings of the 23rd international conference on machine learning (pp. 369–376).

[15]

Guo, L., Xue, W., Guo, Q., Liu, B., Zhang, K., Yuan, T., & Chen, S. (2023). Distilling cross-temporal contexts for continuous sign language recognition. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition.

[16]

Hao, A., Min, Y., & Chen, X. (2021). Self-mutual distillation learning for continuous sign language recognition. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 11303–11312).

[17]

Hinton G., Vinyals O., Dean J., Distilling the knowledge in a neural network, 2015, arXiv preprint arXiv:1503.02531.

[18]

Hu, L., Gao, L., Liu, Z., & Feng, W. (2022). Temporal lift pooling for continuous sign language recognition. In European conference on computer vision (pp. 511–527).

[19]

Hu, L., Gao, L., Liu, Z., & Feng, W. (2023). Continuous sign language recognition with correlation network. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition.

[20]

Hu, L., Gao, L., Liu, Z., & Feng, W. (2023). Self-emphasizing network for continuous sign language recognition. In Proceedings of the AAAI conference on artificial intelligence.

[21]

Hu L., Gao L., Liu Z., Feng W., Scalable frame resolution for efficient continuous sign language recognition, Pattern Recognition 145 (2024).

[22]

Hu, L., Gao, L., Liu, Z., Pun, C.-M., & Feng, W. (2023). AdaBrowse: Adaptive Video Browser for Efficient Continuous Sign Language Recognition. In Proceedings of the 31st ACM international conference on multimedia (pp. 709–718).

[23]

Huang, J., Zhou, W., Li, H., & Li, W. (2015). Sign language recognition using 3d convolutional neural networks. In IEEE international conference on multimedia and expo (pp. 1–6).

[24]

Huang, J., Zhou, W., Zhang, Q., Li, H., & Li, W. (2018). Video-based sign language recognition without temporal segmentation. 32, In Proceedings of the AAAI conference on artificial intelligence. (1).

[25]

Jiao, P., Min, Y., Li, Y., Wang, X., Lei, L., & Chen, X. (2023). CoSign: Exploring Co-occurrence Signals in Skeleton-based Continuous Sign Language Recognition. In Proceedings of the IEEE/CVF international conference on computer vision.

[26]

Jin, X., Peng, B., Wu, Y., Liu, Y., Liu, J., Liang, D., Yan, J., & Hu, X. (2019). Knowledge distillation via route constrained optimization. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 1345–1354).

[27]

Kan, J., Hu, K., Hagenbuchner, M., Tsoi, A. C., Bennamoun, M., & Wang, Z. (2022). Sign language translation with hierarchical spatio-temporal graph neural network. In Proceedings of the IEEE/CVF winter conference on applications of computer vision (pp. 3367–3376).

[28]

Kenton, J. D. M.-W. C., & Toutanova, L. K. (2019). Bert: Pre-training of deep bidirectional transformers for language understanding. 1, In Proceedings of naacL-HLT (p. 2).

[29]

Kim, Y., & Rush, A. M. (2016). Sequence-Level Knowledge Distillation. In Proceedings of the 2016 conference on empirical methods in natural language processing (pp. 1317–1327).

[30]

Koller O., Camgoz N.C., Ney H., Bowden R., Weakly supervised learning with multi-stream CNN-LSTM-HMMs to discover sequential parallelism in sign language videos, IEEE Transactions on Pattern Analysis and Machine Intelligence 42 (9) (2019) 2306–2320.

[31]

Koller O., Forster J., Ney H., Continuous sign language recognition: Towards large vocabulary statistical recognition systems handling multiple signers, Computer Vision and Image Understanding 141 (2015) 108–125.

[32]

Kumar P., Gauba H., Roy P.P., Dogra D.P., Coupled HMM-based multi-sensor data fusion for sign language recognition, Pattern Recognition Letters 86 (2017) 1–8.

[33]

Li, H., Gao, L., Han, R., Wan, L., & Feng, W. (2020). Key Action and Joint CTC-Attention based Sign Language Recognition. In IEEE international conference on acoustics, speech and signal processing.

[34]

Li, T., Li, J., Liu, Z., & Zhang, C. (2020). Few sample knowledge distillation for efficient network compression. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 14639–14647).

[35]

Li, D., Rodriguez, C., Yu, X., & Li, H. (2020). Word-level deep sign language recognition from video: A new large-scale dataset and methods comparison. In Proceedings of the IEEE/CVF winter conference on applications of computer vision (pp. 1459–1469).

[36]

Liu, Y., Chen, K., Liu, C., Qin, Z., Luo, Z., & Wang, J. (2019). Structured knowledge distillation for semantic segmentation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 2604–2613).

[37]

Mikolov T., Karafiát M., Burget L., Cernockỳ J., Khudanpur S., Recurrent neural network based language model, Interspeech, vol. 2, 2010, pp. 1045–1048.

[38]

Min, Y., Hao, A., Chai, X., & Chen, X. (2021). Visual alignment constraint for continuous sign language recognition. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 11542–11551).

[39]

Niu, Z., & Mak, B. (2020). Stochastic fine-grained labeling of multi-state sign glosses for continuous sign language recognition. In European conference on computer vision (pp. 172–186).

[40]

Papastratis I., Dimitropoulos K., Daras P., Continuous sign language recognition through a context-aware generative adversarial network, Sensors 21 (7) (2021) 2437.

[41]

Pigou, L., Dieleman, S., Kindermans, P.-J., & Schrauwen, B. (2015). Sign language recognition using convolutional neural networks. In European conference on computer vision (pp. 572–578).

[42]

Pu, J., Zhou, W., & Li, H. (2018). Dilated convolutional network with iterative optimization for continuous sign language recognition. 3, In Proceedings of the twenty-seventh international joint conference on artificial intelligence (p. 7).

[43]

Shin J., Musa Miah A.S., Hasan M.A.M., Hirooka K., Suzuki K., Lee H.-S., Jang S.-W., Korean sign language recognition using transformer-based deep neural network, Applied Sciences 13 (5) (2023) 3029.

[44]

Starner T., Weaver J., Pentland A., Real-time american sign language recognition using desk and wearable computer based video, IEEE Transactions on Pattern Analysis and Machine Intelligence 20 (12) (1998) 1371–1375.

[45]

Sutskever, I., Vinyals, O., & Le, Q. V. (2014). Sequence to sequence learning with neural networks. 27, In Advances in neural information processing systems.

[46]

Vaswani A., Shazeer N., Parmar N., Uszkoreit J., Jones L., Gomez A.N., Kaiser Ł., Polosukhin I., Attention is all you need, Advances in neural information processing systems, vol. 30, 2017.

[47]

Wang L., Yoon K.-J., Knowledge distillation and student-teacher learning for visual intelligence: A review and new outlooks, IEEE Transactions on Pattern Analysis and Machine Intelligence 44 (6) (2021) 3048–3068.

[48]

Wu, Q., Lin, Z., Karlsson, B., Lou, J.-G., & Huang, B. (2020). Single-/Multi-Source Cross-Lingual NER via Teacher-Student Learning on Unlabeled Data in Target Language. In Proceedings of the 58th annual meeting of the association for computational linguistics (pp. 6505–6514).

[49]

Xie P., Cui Z., Du Y., Zhao M., Cui J., Wang B., Hu X., Multi-scale local-temporal similarity fusion for continuous sign language recognition, Pattern Recognition 136 (2023).

[50]

Yim, J., Joo, D., Bae, J., & Kim, J. (2017). A gift from knowledge distillation: Fast optimization, network minimization and transfer learning. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 4133–4141).

[51]

Yin K., Read J., Better sign language translation with STMC-transformer, 2020, arXiv preprint arXiv:2004.00588.

[52]

Yu F., Koltun V., Multi-scale context aggregation by dilated convolutions, 2015, arXiv preprint arXiv:1511.07122.

[53]

Yun, H., Hwang, Y., & Jung, K. (2020). Improving context-aware neural machine translation using self-attentive sentence embedding. 34, In Proceedings of the AAAI conference on artificial intelligence (05), (pp. 9498–9506).

[54]

Zhang B., Müller M., Sennrich R., SLTUNET: A simple unified model for sign language translation, 2023, arXiv preprint arXiv:2305.01778.

[55]

Zhang, L., Song, J., Gao, A., Chen, J., Bao, C., & Ma, K. (2019). Be your own teacher: Improve the performance of convolutional neural networks via self distillation. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 3713–3722).

[56]

Zhao, Y., Xu, R., Wang, X., Hou, P., Tang, H., & Song, M. (2020). Hearing lips: Improving lip reading by distilling speech recognizers. 34, In Proceedings of the AAAI conference on artificial intelligence (04), (pp. 6917–6924).

[57]

Zheng, J., Wang, Y., Tan, C., Li, S., Wang, G., Xia, J., Chen, Y., & Li, S. Z. (2023). Cvt-slr: Contrastive visual-textual transformation for sign language recognition with variational alignment. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 23141–23150).

[58]

Zheng Z., Yue X., Huang S., Chen J., Birch A., Towards making the most of context in neural machine translation, 2020, arXiv preprint arXiv:2002.07982.

[59]

Zhou C., Meng F., Zhou J., Zhang M., Wang H., Su J., Confidence based bidirectional global context aware training framework for neural machine translation, 2022, arXiv preprint arXiv:2202.13663.

[60]

Zhou, H., Zhou, W., Qi, W., Pu, J., & Li, H. (2021). Improving sign language translation with monolingual data by sign back-translation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 1316–1325).

[61]

Zhou, H., Zhou, W., Qi, W., Pu, J., & Li, H. (2021). Improving Sign Language Translation With Monolingual Data by Sign Back-Translation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition.

[62]

Zhou, H., Zhou, W., Zhou, Y., & Li, H. (2020). Spatial-temporal multi-cue network for continuous sign language recognition. In Proceedings of the AAAI conference on artificial intelligence (07), (pp. 13009–13016).

[63]

Zuo, R., Wei, F., & Mak, B. (2023). Natural language-assisted sign language recognition. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition.

Index Terms

Cross-modal knowledge distillation for continuous sign language recognition
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision
      1. Computer vision problems
      2. Computer vision tasks
        Activity recognition and understanding
    2. Natural language processing
      1. Natural language generation
  2. Machine learning
2. Social and professional topics
  1. Professional topics
    1. Computing profession
      1. Assistive technologies

Index terms have been assigned to the content through auto-classification.

Recommendations

Cross-lingual few-shot sign language recognition
Abstract
There are over 150 sign languages worldwide, each with numerous local variants and thousands of signs. However, collecting annotated data for each sign language to train a model is a laborious and expert-dependent task. To address this issue, ...
Highlights
- The motivation of the problem is to recognize a novel sign based on a small set of examples.
- A novel framework leverages signer body and hand features for embedding is proposed.
- Three novel meta-learning benchmarks that span ...
Subunit sign modeling framework for continuous sign language recognition
Abstract
A new framework named three subunit sign modeling is introduced for automatic sign language recognition. This works on continuous video sequences consisting of isolated words, signed sentences under different signer variations and ...
Cross-modal Neural Sign Language Translation
MM '19: Proceedings of the 27th ACM International Conference on Multimedia

Sign Language is the primary means of communication for the majority of the Deaf and hard-of-hearing communities. Current computational approaches in this general research area have focused specifically on sign language recognition and the translation ...

Comments

Information & Contributors

Information

Published In

cover image Neural Networks

Neural Networks Volume 179, Issue C

Nov 2024

1557 pages

Issue’s Table of Contents

Elsevier Ltd.

Publisher

Elsevier Science Ltd.

United Kingdom

Publication History

Published: 21 November 2024

Author Tags

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 05 Jan 2025

Other Metrics

View Author Metrics

Citations

View Options

View options

Media

Figures

Other

Tables

View Issue’s Table of Contents