Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

Cross-modal knowledge distillation for continuous sign language recognition

Published: 21 November 2024 Publication History

Abstract

Continuous Sign Language Recognition (CSLR) is a task which converts a sign language video into a gloss sequence. The existing deep learning based sign language recognition methods usually rely on large-scale training data and rich supervised information. However, current sign language datasets are limited, and they are only annotated at sentence-level rather than frame-level. Inadequate supervision of sign language data poses a serious challenge for sign language recognition, which may result in insufficient training of sign language recognition models. To address above problems, we propose a cross-modal knowledge distillation method for continuous sign language recognition, which contains two teacher models and one student model. One of the teacher models is the Sign2Text dialogue teacher model, which takes a sign language video and a dialogue sentence as input and outputs the sign language recognition result. The other teacher model is the Text2Gloss translation teacher model, which targets to translate a text sentence into a gloss sequence. Both teacher models can provide information-rich soft labels to assist the training of the student model, which is a general sign language recognition model. We conduct extensive experiments on multiple commonly used sign language datasets, i.e., PHOENIX 2014T, CSL-Daily and QSL, the results show that the proposed cross-modal knowledge distillation method can effectively improve the sign language recognition accuracy by transferring multi-modal information from teacher models to the student model. Code is available at https://github.com/glq-1992/cross-modal-knowledge-distillation_new.

References

[1]
Boháček, M., & Hrúz, M. (2022). Sign pose-based transformer for word-level sign language recognition. In Proceedings of the IEEE/CVF winter conference on applications of computer vision (pp. 182–191).
[2]
Camgoz, N. C., Hadfield, S., Koller, O., Ney, H., & Bowden, R. (2018). Neural sign language translation. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 7784–7793).
[3]
Camgoz, N. C., Koller, O., Hadfield, S., & Bowden, R. (2020). Multi-channel transformers for multi-articulatory sign language translation. In European conference on computer vision workshops (pp. 301–319).
[4]
Camgoz, N. C., Koller, O., Hadfield, S., & Bowden, R. (2020). Sign language transformers: Joint end-to-end sign language recognition and translation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 10023–10033).
[5]
Chen H., Pei Y., Zhao H., Huang Y., Super-resolution guided knowledge distillation for low-resolution image classification, Pattern Recognition Letters 155 (2022) 62–68.
[6]
Chen, Y., Wei, F., Sun, X., Wu, Z., & Lin, S. (2022). A simple multi-modality transfer learning baseline for sign language translation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 5120–5130).
[7]
Cheng, K. L., Yang, Z., Chen, Q., & Tai, Y.-W. (2020). Fully Convolutional Networks for Continuous Sign Language Recognition. In European conference on computer vision (pp. 697–714).
[8]
Cui, R., Liu, H., & Zhang, C. (2017). Recurrent convolutional neural networks for continuous sign language recognition by staged optimization. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 7361–7369).
[9]
Cui R., Liu H., Zhang C., A deep neural framework for continuous sign language recognition by iterative training, IEEE Transactions on Multimedia 21 (7) (2019) 1880–1891.
[10]
Du Y., Xie P., Wang M., Hu X., Zhao Z., Liu J., Full transformer network with masking future for word-level sign language recognition, Neurocomputing 500 (2022) 115–123.
[11]
Furlanello, T., Lipton, Z., Tschannen, M., Itti, L., & Anandkumar, A. (2018). Born again neural networks. In International conference on machine learning (pp. 1607–1616).
[12]
Futami H., Inaguma H., Ueno S., Mimura M., Sakai S., Kawahara T., Distilling the knowledge of BERT for sequence-to-sequence ASR, 2020, arXiv preprint arXiv:2008.03822.
[13]
Gou J., Yu B., Maybank S.J., Tao D., Knowledge distillation: A survey, International Journal of Computer Vision 129 (2021) 1789–1819.
[14]
Graves, A., Fernández, S., Gomez, F., & Schmidhuber, J. (2006). Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. In Proceedings of the 23rd international conference on machine learning (pp. 369–376).
[15]
Guo, L., Xue, W., Guo, Q., Liu, B., Zhang, K., Yuan, T., & Chen, S. (2023). Distilling cross-temporal contexts for continuous sign language recognition. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition.
[16]
Hao, A., Min, Y., & Chen, X. (2021). Self-mutual distillation learning for continuous sign language recognition. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 11303–11312).
[17]
Hinton G., Vinyals O., Dean J., Distilling the knowledge in a neural network, 2015, arXiv preprint arXiv:1503.02531.
[18]
Hu, L., Gao, L., Liu, Z., & Feng, W. (2022). Temporal lift pooling for continuous sign language recognition. In European conference on computer vision (pp. 511–527).
[19]
Hu, L., Gao, L., Liu, Z., & Feng, W. (2023). Continuous sign language recognition with correlation network. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition.
[20]
Hu, L., Gao, L., Liu, Z., & Feng, W. (2023). Self-emphasizing network for continuous sign language recognition. In Proceedings of the AAAI conference on artificial intelligence.
[21]
Hu L., Gao L., Liu Z., Feng W., Scalable frame resolution for efficient continuous sign language recognition, Pattern Recognition 145 (2024).
[22]
Hu, L., Gao, L., Liu, Z., Pun, C.-M., & Feng, W. (2023). AdaBrowse: Adaptive Video Browser for Efficient Continuous Sign Language Recognition. In Proceedings of the 31st ACM international conference on multimedia (pp. 709–718).
[23]
Huang, J., Zhou, W., Li, H., & Li, W. (2015). Sign language recognition using 3d convolutional neural networks. In IEEE international conference on multimedia and expo (pp. 1–6).
[24]
Huang, J., Zhou, W., Zhang, Q., Li, H., & Li, W. (2018). Video-based sign language recognition without temporal segmentation. 32, In Proceedings of the AAAI conference on artificial intelligence. (1).
[25]
Jiao, P., Min, Y., Li, Y., Wang, X., Lei, L., & Chen, X. (2023). CoSign: Exploring Co-occurrence Signals in Skeleton-based Continuous Sign Language Recognition. In Proceedings of the IEEE/CVF international conference on computer vision.
[26]
Jin, X., Peng, B., Wu, Y., Liu, Y., Liu, J., Liang, D., Yan, J., & Hu, X. (2019). Knowledge distillation via route constrained optimization. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 1345–1354).
[27]
Kan, J., Hu, K., Hagenbuchner, M., Tsoi, A. C., Bennamoun, M., & Wang, Z. (2022). Sign language translation with hierarchical spatio-temporal graph neural network. In Proceedings of the IEEE/CVF winter conference on applications of computer vision (pp. 3367–3376).
[28]
Kenton, J. D. M.-W. C., & Toutanova, L. K. (2019). Bert: Pre-training of deep bidirectional transformers for language understanding. 1, In Proceedings of naacL-HLT (p. 2).
[29]
Kim, Y., & Rush, A. M. (2016). Sequence-Level Knowledge Distillation. In Proceedings of the 2016 conference on empirical methods in natural language processing (pp. 1317–1327).
[30]
Koller O., Camgoz N.C., Ney H., Bowden R., Weakly supervised learning with multi-stream CNN-LSTM-HMMs to discover sequential parallelism in sign language videos, IEEE Transactions on Pattern Analysis and Machine Intelligence 42 (9) (2019) 2306–2320.
[31]
Koller O., Forster J., Ney H., Continuous sign language recognition: Towards large vocabulary statistical recognition systems handling multiple signers, Computer Vision and Image Understanding 141 (2015) 108–125.
[32]
Kumar P., Gauba H., Roy P.P., Dogra D.P., Coupled HMM-based multi-sensor data fusion for sign language recognition, Pattern Recognition Letters 86 (2017) 1–8.
[33]
Li, H., Gao, L., Han, R., Wan, L., & Feng, W. (2020). Key Action and Joint CTC-Attention based Sign Language Recognition. In IEEE international conference on acoustics, speech and signal processing.
[34]
Li, T., Li, J., Liu, Z., & Zhang, C. (2020). Few sample knowledge distillation for efficient network compression. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 14639–14647).
[35]
Li, D., Rodriguez, C., Yu, X., & Li, H. (2020). Word-level deep sign language recognition from video: A new large-scale dataset and methods comparison. In Proceedings of the IEEE/CVF winter conference on applications of computer vision (pp. 1459–1469).
[36]
Liu, Y., Chen, K., Liu, C., Qin, Z., Luo, Z., & Wang, J. (2019). Structured knowledge distillation for semantic segmentation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 2604–2613).
[37]
Mikolov T., Karafiát M., Burget L., Cernockỳ J., Khudanpur S., Recurrent neural network based language model, Interspeech, vol. 2, 2010, pp. 1045–1048.
[38]
Min, Y., Hao, A., Chai, X., & Chen, X. (2021). Visual alignment constraint for continuous sign language recognition. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 11542–11551).
[39]
Niu, Z., & Mak, B. (2020). Stochastic fine-grained labeling of multi-state sign glosses for continuous sign language recognition. In European conference on computer vision (pp. 172–186).
[40]
Papastratis I., Dimitropoulos K., Daras P., Continuous sign language recognition through a context-aware generative adversarial network, Sensors 21 (7) (2021) 2437.
[41]
Pigou, L., Dieleman, S., Kindermans, P.-J., & Schrauwen, B. (2015). Sign language recognition using convolutional neural networks. In European conference on computer vision (pp. 572–578).
[42]
Pu, J., Zhou, W., & Li, H. (2018). Dilated convolutional network with iterative optimization for continuous sign language recognition. 3, In Proceedings of the twenty-seventh international joint conference on artificial intelligence (p. 7).
[43]
Shin J., Musa Miah A.S., Hasan M.A.M., Hirooka K., Suzuki K., Lee H.-S., Jang S.-W., Korean sign language recognition using transformer-based deep neural network, Applied Sciences 13 (5) (2023) 3029.
[44]
Starner T., Weaver J., Pentland A., Real-time american sign language recognition using desk and wearable computer based video, IEEE Transactions on Pattern Analysis and Machine Intelligence 20 (12) (1998) 1371–1375.
[45]
Sutskever, I., Vinyals, O., & Le, Q. V. (2014). Sequence to sequence learning with neural networks. 27, In Advances in neural information processing systems.
[46]
Vaswani A., Shazeer N., Parmar N., Uszkoreit J., Jones L., Gomez A.N., Kaiser Ł., Polosukhin I., Attention is all you need, Advances in neural information processing systems, vol. 30, 2017.
[47]
Wang L., Yoon K.-J., Knowledge distillation and student-teacher learning for visual intelligence: A review and new outlooks, IEEE Transactions on Pattern Analysis and Machine Intelligence 44 (6) (2021) 3048–3068.
[48]
Wu, Q., Lin, Z., Karlsson, B., Lou, J.-G., & Huang, B. (2020). Single-/Multi-Source Cross-Lingual NER via Teacher-Student Learning on Unlabeled Data in Target Language. In Proceedings of the 58th annual meeting of the association for computational linguistics (pp. 6505–6514).
[49]
Xie P., Cui Z., Du Y., Zhao M., Cui J., Wang B., Hu X., Multi-scale local-temporal similarity fusion for continuous sign language recognition, Pattern Recognition 136 (2023).
[50]
Yim, J., Joo, D., Bae, J., & Kim, J. (2017). A gift from knowledge distillation: Fast optimization, network minimization and transfer learning. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 4133–4141).
[51]
Yin K., Read J., Better sign language translation with STMC-transformer, 2020, arXiv preprint arXiv:2004.00588.
[52]
Yu F., Koltun V., Multi-scale context aggregation by dilated convolutions, 2015, arXiv preprint arXiv:1511.07122.
[53]
Yun, H., Hwang, Y., & Jung, K. (2020). Improving context-aware neural machine translation using self-attentive sentence embedding. 34, In Proceedings of the AAAI conference on artificial intelligence (05), (pp. 9498–9506).
[54]
Zhang B., Müller M., Sennrich R., SLTUNET: A simple unified model for sign language translation, 2023, arXiv preprint arXiv:2305.01778.
[55]
Zhang, L., Song, J., Gao, A., Chen, J., Bao, C., & Ma, K. (2019). Be your own teacher: Improve the performance of convolutional neural networks via self distillation. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 3713–3722).
[56]
Zhao, Y., Xu, R., Wang, X., Hou, P., Tang, H., & Song, M. (2020). Hearing lips: Improving lip reading by distilling speech recognizers. 34, In Proceedings of the AAAI conference on artificial intelligence (04), (pp. 6917–6924).
[57]
Zheng, J., Wang, Y., Tan, C., Li, S., Wang, G., Xia, J., Chen, Y., & Li, S. Z. (2023). Cvt-slr: Contrastive visual-textual transformation for sign language recognition with variational alignment. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 23141–23150).
[58]
Zheng Z., Yue X., Huang S., Chen J., Birch A., Towards making the most of context in neural machine translation, 2020, arXiv preprint arXiv:2002.07982.
[59]
Zhou C., Meng F., Zhou J., Zhang M., Wang H., Su J., Confidence based bidirectional global context aware training framework for neural machine translation, 2022, arXiv preprint arXiv:2202.13663.
[60]
Zhou, H., Zhou, W., Qi, W., Pu, J., & Li, H. (2021). Improving sign language translation with monolingual data by sign back-translation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 1316–1325).
[61]
Zhou, H., Zhou, W., Qi, W., Pu, J., & Li, H. (2021). Improving Sign Language Translation With Monolingual Data by Sign Back-Translation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition.
[62]
Zhou, H., Zhou, W., Zhou, Y., & Li, H. (2020). Spatial-temporal multi-cue network for continuous sign language recognition. In Proceedings of the AAAI conference on artificial intelligence (07), (pp. 13009–13016).
[63]
Zuo, R., Wei, F., & Mak, B. (2023). Natural language-assisted sign language recognition. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition.

Recommendations

Comments

Information & Contributors

Information

Published In

cover image Neural Networks
Neural Networks  Volume 179, Issue C
Nov 2024
1557 pages

Publisher

Elsevier Science Ltd.

United Kingdom

Publication History

Published: 21 November 2024

Author Tags

  1. Sign language recognition
  2. Knowledge distillation
  3. Cross-modal
  4. Attention mechanism

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 0
    Total Downloads
  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 05 Jan 2025

Other Metrics

Citations

View Options

View options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media