research-article

Skeleton-Aware Neural Sign Language Translation

Authors:

Sanglu LuAuthors Info & Claims

MM '21: Proceedings of the 29th ACM International Conference on Multimedia

Pages 4353 - 4361

https://doi.org/10.1145/3474085.3475577

Published: 17 October 2021 Publication History

Abstract

As an essential communication way for deaf-mutes, sign languages are expressed by human actions. To distinguish human actions for sign language understanding, the skeleton which contains position information of human pose can provide an important cue, since different actions usually correspond to different poses/skeletons. However, skeleton has not been fully studied for Sign Language Translation (SLT), especially for end-to-end SLT. Therefore, in this paper, we propose a novel end-to-end Skeleton-Aware neural Network (SANet) for video-based SLT. Specifically, to achieve end-to-end SLT, we design a self-contained branch for skeleton extraction. To efficiently guide the feature extraction from video with skeletons, we concatenate the skeleton channel and RGB channels of each frame for feature extraction. To distinguish the importance of clips, we construct a skeleton-based Graph Convolutional Network (GCN) for feature scaling, i.e., giving importance weight for each clip. The scaled features of each clip are then sent to a decoder module to generate spoken language. In our SANet, a joint training strategy is designed to optimize skeleton extraction and sign language translation jointly. Experimental results on two large scale SLT datasets demonstrate the effectiveness of our approach, which outperforms the state-of-the-art methods. Our code is available at https://github.com/SignLanguageCode/SANet.

References

[1]

Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2014. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014).

[2]

Kshitij Bantupalli and Ying Xie. 2018. American sign language recognition using deep learning and computer vision. In 2018 IEEE International Conference on Big Data (Big Data). IEEE, 4896--4899.

[3]

Jan Bungeroth and Hermann Ney. 2004. Statistical sign language translation. In Workshop on representation and processing of sign languages, LREC, Vol. 4. Citeseer, 105--108.

[4]

Necati Cihan Camgoz, Simon Hadfield, Oscar Koller, and Richard Bowden. 2017. Subunets: End-to-end hand shape and continuous sign language recognition. In ICCV. IEEE, 3075--3084.

[5]

N. C. Camgoz, S. Hadfield, O. Koller, H. Ney, and R. Bowden. 2018. Neural Sign Language Translation. In CVPR. 7784--7793. https://doi.org/10.1109/CVPR.2018.00812

[6]

Necati Cihan Camgoz, Oscar Koller, Simon Hadfield, and Richard Bowden. 2020 a. Multi-channel Transformers for Multi-articulatory Sign Language Translation. arXiv preprint arXiv:2009.00299 (2020).

[7]

Necati Cihan Camgoz, Oscar Koller, Simon Hadfield, and Richard Bowden. 2020 b. Sign Language Transformers: Joint End-to-end Sign Language Recognition and Translation. In CVPR. 10023--10033.

[8]

Zhe Cao, Tomas Simon, Shih-En Wei, and Yaser Sheikh. 2017. Realtime multi-person 2d pose estimation using part affinity fields. In CVPR. 7291--7299.

[9]

Xiujuan Chai, Guang Li, Yushun Lin, Zhihao Xu, Yili Tang, Xilin Chen, and Ming Zhou. 2013. Sign language recognition and translation with kinect. In IEEE Conf. on AFGR, Vol. 655. 4.

[10]

Runpeng Cui, Hu Liu, and Changshui Zhang. 2019. A deep neural framework for continuous sign language recognition by iterative training. IEEE Transactions on Multimedia, Vol. 21, 7 (2019), 1880--1891.

[11]

Amanda Duarte, Shruti Palaskar, Lucas Ventura, Deepti Ghadiyaram, Kenneth DeHaan, Florian Metze, Jordi Torres, and Xavier Giro-i Nieto. 2021. How2Sign: A Large-scale Multimodal Dataset for Continuous American Sign Language. In Conference on Computer Vision and Pattern Recognition (CVPR).

[12]

K. Grobel and M. Assan. 1997. Isolated sign language recognition using hidden Markov models. In SMC, Vol. 1. 162--167 vol.1. https://doi.org/10.1109/ICSMC.1997.625742

[13]

Dan Guo, Shuo Wang, Qi Tian, and Meng Wang. 2019 a. Dense Temporal Convolution Network for Sign Language Translation. In IJCAI. 744--750.

Digital Library

[14]

Dan Guo, Wengang Zhou, Anyang Li, Houqiang Li, and Meng Wang. 2019 b. Hierarchical recurrent deep fusion using adaptive clip summarization for sign language translation. TIP, Vol. 29 (2019), 1575--1590.

[15]

Dan Guo, Wengang Zhou, Houqiang Li, and Meng Wang. 2018. Hierarchical lstm for sign language translation. In AAAI, Vol. 32.

[16]

Dan Guo, Wengang Zhou, Meng Wang, and Houqiang Li. 2016. Sign language recognition based on adaptive hmms with data augmentation. In 2016 IEEE International Conference on Image Processing (ICIP). IEEE, 2876--2880.

[17]

Jie Huang, Wengang Zhou, Houqiang Li, and Weiping Li. 2018a. Attention-based 3D-CNNs for large-vocabulary sign language recognition. IEEE Transactions on Circuits and Systems for Video Technology, Vol. 29, 9 (2018), 2822--2832.

Digital Library

[18]

Jie Huang, Wengang Zhou, Qilin Zhang, Houqiang Li, and Weiping Li. 2018b. Video-based sign language recognition without temporal segmentation. In AAAI.

[19]

Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014).

[20]

Oscar Koller, Cihan Camgoz, Hermann Ney, and Richard Bowden. 2019. Weakly supervised learning with multi-stream CNN-LSTM-HMMs to discover sequential parallelism in sign language videos. TPAMI (2019).

[21]

Oscar Koller, Sepehr Zargaran, and Hermann Ney. 2017. Re-sign: Re-aligned end-to-end sequence modelling with deep recurrent CNN-HMMs. In CVPR. 4297--4305.

[22]

Dongxu Li, Chenchen Xu, Xin Yu, Kaihao Zhang, Benjamin Swift, Hanna Suominen, and Hongdong Li. 2020. TSPNet: Hierarchical Feature Learning via Temporal Semantic Pyramid for Sign Language Translation. In Advances in Neural Information Processing Systems, Vol. 33.

[23]

Chin-Yew Lin. 2004. Rouge: A package for automatic evaluation of summaries. In Text summarization branches out. 74--81.

[24]

Tao Liu, Wengang Zhou, and Houqiang Li. 2016. Sign language recognition with long short-term memory. In ICIP. IEEE, 2871--2875.

[25]

Alptekin Orbay and Lale Akarun. 2020. Neural sign language translation by learning tokenization. arXiv preprint arXiv:2002.00479 (2020).

[26]

Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. BLEU: a method for automatic evaluation of machine translation. In ACL. 311--318.

Digital Library

[27]

Lionel Pigou, Mieke Van Herreweghe, and Joni Dambre. 2017. Gesture and sign language recognition with temporal residual networks. In Proceedings of the IEEE International Conference on Computer Vision Workshops. 3086--3093.

[28]

Junfu Pu, Wengang Zhou, Hezhen Hu, and Houqiang Li. 2020. Boosting Continuous Sign Language Recognition via Cross Modality Augmentation. In Proceedings of the 28th ACM International Conference on Multimedia. 1497--1505.

Digital Library

[29]

Junfu Pu, Wengang Zhou, and Houqiang Li. 2019. Iterative alignment network for continuous sign language recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4165--4174.

[30]

Zhaofan Qiu, Ting Yao, and Tao Mei. 2017. Learning spatio-temporal representation with pseudo-3d residual networks. In CVPR. 5533--5541.

[31]

Lei Shi, Yifan Zhang, Jian Cheng, and Hanqing Lu. 2019. Skeleton-based action recognition with directed graph neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 7912--7921.

[32]

Karen Simonyan and Andrew Zisserman. 2014a. Two-stream convolutional networks for action recognition in videos. arXiv preprint arXiv:1406.2199 (2014).

[33]

Karen Simonyan and Andrew Zisserman. 2014b. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014).

[34]

Ke Sun, Bin Xiao, Dong Liu, and Jingdong Wang. 2019. Deep high-resolution representation learning for human pose estimation. In CVPR. 5693--5703.

[35]

Ilya Sutskever, Oriol Vinyals, and Quoc V Le. 2014. Sequence to sequence learning with neural networks. In NIPS. 3104--3112.

Digital Library

[36]

Ao Tang, Ke Lu, Yufei Wang, Jie Huang, and Houqiang Li. 2015. A real-time hand posture recognition system using deep neural networks. ACM Transactions on Intelligent Systems and Technology (TIST), Vol. 6, 2 (2015), 1--23.

Digital Library

[37]

Dominique Uebersax, Juergen Gall, Michael Van den Bergh, and Luc Van Gool. 2011. Real-time sign language letter and word recognition from depth data. In 2011 IEEE international conference on computer vision workshops (ICCV Workshops). IEEE, 383--390.

[38]

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In NIPS. 5998--6008.

Digital Library

[39]

Subhashini Venugopalan, Marcus Rohrbach, Jeffrey Donahue, Raymond Mooney, Trevor Darrell, and Kate Saenko. 2015. Sequence to sequence-video to text. In ICCV. 4534--4542.

Digital Library

[40]

Limin Wang, Yuanjun Xiong, Zhe Wang, Yu Qiao, Dahua Lin, Xiaoou Tang, and Luc Van Gool. 2016. Temporal segment networks: Towards good practices for deep action recognition. In European conference on computer vision. Springer, 20--36.

[41]

Sijie Yan, Yuanjun Xiong, and Dahua Lin. 2018. Spatial Temporal Graph Convolutional Networks for Skeleton-Based Action Recognition. In AAAI.

[42]

Siyuan Yang, Jun Liu, Shijian Lu, Meng Hwa Er, and Alex C Kot. 2020. Collaborative learning of gesture recognition and 3D hand pose estimation with multi-order feature analysis. In European Conference on Computer Vision. Springer, 769--786.

Digital Library

[43]

Zhaoyang Yang, Zhenmei Shi, Xiaoyong Shen, and Yu-Wing Tai. 2019. SF-Net: Structured Feature Network for Continuous Sign Language Recognition. arXiv preprint arXiv:1908.01341 (2019).

[44]

Jihai Zhang, Wengang Zhou, and Houqiang Li. 2014. A threshold-based hmm-dtw approach for continuous sign language recognition. In ICIMCS. 237--240.

Digital Library

[45]

Hao Zhou, Wengang Zhou, Yun Zhou, and Houqiang Li. 2020. Spatial-Temporal Multi-Cue Network for Continuous Sign Language Recognition. In AAAI. 13009--13016.

[46]

Hao Zhou, Wengang Zhou, Yun Zhou, and Houqiang Li. 2021. Spatial-temporal multi-cue network for sign language recognition and translation. IEEE Transactions on Multimedia (2021).

Cited By

Jiang LWang MLi ZFang YZhou WLi HCai JKankanhalli MPrabhakaran BBoll SSubramanian RZheng LSingh VCesar PXie LXu D(2024)SEDS: Semantically Enhanced Dual-Stream Encoder for Sign Language RetrievalProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681237(5141-5150)Online publication date: 28-Oct-2024
https://dl.acm.org/doi/10.1145/3664647.3681237
Keskin AKeles H(2024)SkelCap: Automated Generation of Descriptive Text from Skeleton Keypoint Sequences2024 International Congress on Human-Computer Interaction, Optimization and Robotic Applications (HORA)10.1109/HORA61326.2024.10550522(1-8)Online publication date: 23-May-2024
https://doi.org/10.1109/HORA61326.2024.10550522
Gan SYin YJiang ZWen HXie LLu S(2024)SignGraph: A Sign Sequence is Worth Graphs of Nodes2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)10.1109/CVPR52733.2024.01279(13470-13479)Online publication date: 16-Jun-2024
https://doi.org/10.1109/CVPR52733.2024.01279
Show More Cited By

Index Terms

Skeleton-Aware Neural Sign Language Translation
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision
      1. Computer vision tasks
        Activity recognition and understanding

Recommendations

Cross-modal Neural Sign Language Translation
MM '19: Proceedings of the 27th ACM International Conference on Multimedia

Sign Language is the primary means of communication for the majority of the Deaf and hard-of-hearing communities. Current computational approaches in this general research area have focused specifically on sign language recognition and the translation ...
Deep Learning Methods for Sign Language Translation
Many sign languages are bona fide natural languages with grammatical rules and lexicons hence can benefit from machine translation methods. Similarly, since sign language is a visual-spatial language, it can also benefit from computer vision methods for ...
A machine translation system from Arabic sign language to Arabic
Abstract
Arabic sign language (ArSL) is one of the sign languages that is used in Arab countries. This language has structure and grammar that differ from spoken Arabic. Available ArSL recognition systems perform direct mapping between the recognized sign ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

MM '21: Proceedings of the 29th ACM International Conference on Multimedia

October 2021

5796 pages

ISBN:9781450386517

DOI:10.1145/3474085

General Chairs:
Heng Tao Shen
University of Electronic Science&Technology of China, China
,
Yueting Zhuang
Zhejiang University, China
,
John R. Smith
IBM, USA
,
Program Chairs:
Yang Yang
University of Electronic Science and Technology of China, China
,
Pablo Cesar
CWI&TU Delft, The Netherlands
,
Florian Metze
FACEBOOK, Inc., USA
,
Balakrishnan Prabhakaran
University of Texas at Dallas, USA

Copyright © 2021 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGMM: ACM Special Interest Group on Multimedia

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 17 October 2021

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

Natural Science Foundation of Jiangsu Province
Collaborative Innovation Center of Novel Software Technology and Industrialization
National Natural Science Foundation of China
National Key R&D Program of China

Conference

MM '21

Sponsor:

SIGMM

MM '21: ACM Multimedia Conference

October 20 - 24, 2021

Virtual Event, China

Acceptance Rates

Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

14
Total Citations
View Citations
520
Total Downloads

Downloads (Last 12 months)117
Downloads (Last 6 weeks)22

Reflects downloads up to 25 Dec 2024

Other Metrics

View Author Metrics

Citations

Cited By

Jiang LWang MLi ZFang YZhou WLi HCai JKankanhalli MPrabhakaran BBoll SSubramanian RZheng LSingh VCesar PXie LXu D(2024)SEDS: Semantically Enhanced Dual-Stream Encoder for Sign Language RetrievalProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681237(5141-5150)Online publication date: 28-Oct-2024
https://dl.acm.org/doi/10.1145/3664647.3681237
Keskin AKeles H(2024)SkelCap: Automated Generation of Descriptive Text from Skeleton Keypoint Sequences2024 International Congress on Human-Computer Interaction, Optimization and Robotic Applications (HORA)10.1109/HORA61326.2024.10550522(1-8)Online publication date: 23-May-2024
https://doi.org/10.1109/HORA61326.2024.10550522
Gan SYin YJiang ZWen HXie LLu S(2024)SignGraph: A Sign Sequence is Worth Graphs of Nodes2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)10.1109/CVPR52733.2024.01279(13470-13479)Online publication date: 16-Jun-2024
https://doi.org/10.1109/CVPR52733.2024.01279
Kim YBaek H(2023)Preprocessing for Keypoint-Based Sign Language Translation without GlossesSensors10.3390/s2306323123:6(3231)Online publication date: 17-Mar-2023
https://doi.org/10.3390/s23063231
Eunice JJ ASei YHemanth D(2023)Sign2Pose: A Pose-Based Approach for Gloss Prediction Using a Transformer ModelSensors10.3390/s2305285323:5(2853)Online publication date: 6-Mar-2023
https://doi.org/10.3390/s23052853
Liang ZLi HChai J(2023)Sign Language Translation: A Survey of Approaches and TechniquesElectronics10.3390/electronics1212267812:12(2678)Online publication date: 15-Jun-2023
https://doi.org/10.3390/electronics12122678
Gan SYin YJiang ZXia KXie LLu SElkind E(2023)Contrastive learning for sign language recognition and translationProceedings of the Thirty-Second International Joint Conference on Artificial Intelligence10.24963/ijcai.2023/85(763-772)Online publication date: 19-Aug-2023
https://dl.acm.org/doi/10.24963/ijcai.2023/85
Gan SYin YJiang ZXie LLu SEl Saddik AMei TCucchiara RBertini MTobon Vallejo DAtrey PHossain M(2023)Towards Real-Time Sign Language Recognition and Translation on Edge DevicesProceedings of the 31st ACM International Conference on Multimedia10.1145/3581783.3611820(4502-4512)Online publication date: 27-Oct-2023
https://doi.org/10.1145/3581783.3611820
Guo ZHou YHou CYin W(2023)Locality-Aware Transformer for Video-Based Sign Language TranslationIEEE Signal Processing Letters10.1109/LSP.2023.326380830(364-368)Online publication date: 2023
https://doi.org/10.1109/LSP.2023.3263808
Tarrés LGállego GDuarte ATorres JGiró-i-Nieto X(2023)Sign Language Translation from Instructional Videos2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW)10.1109/CVPRW59228.2023.00596(5625-5635)Online publication date: Jun-2023
https://doi.org/10.1109/CVPRW59228.2023.00596
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents