Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3474085.3475577acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article

Skeleton-Aware Neural Sign Language Translation

Published: 17 October 2021 Publication History

Abstract

As an essential communication way for deaf-mutes, sign languages are expressed by human actions. To distinguish human actions for sign language understanding, the skeleton which contains position information of human pose can provide an important cue, since different actions usually correspond to different poses/skeletons. However, skeleton has not been fully studied for Sign Language Translation (SLT), especially for end-to-end SLT. Therefore, in this paper, we propose a novel end-to-end Skeleton-Aware neural Network (SANet) for video-based SLT. Specifically, to achieve end-to-end SLT, we design a self-contained branch for skeleton extraction. To efficiently guide the feature extraction from video with skeletons, we concatenate the skeleton channel and RGB channels of each frame for feature extraction. To distinguish the importance of clips, we construct a skeleton-based Graph Convolutional Network (GCN) for feature scaling, i.e., giving importance weight for each clip. The scaled features of each clip are then sent to a decoder module to generate spoken language. In our SANet, a joint training strategy is designed to optimize skeleton extraction and sign language translation jointly. Experimental results on two large scale SLT datasets demonstrate the effectiveness of our approach, which outperforms the state-of-the-art methods. Our code is available at https://github.com/SignLanguageCode/SANet.

References

[1]
Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2014. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014).
[2]
Kshitij Bantupalli and Ying Xie. 2018. American sign language recognition using deep learning and computer vision. In 2018 IEEE International Conference on Big Data (Big Data). IEEE, 4896--4899.
[3]
Jan Bungeroth and Hermann Ney. 2004. Statistical sign language translation. In Workshop on representation and processing of sign languages, LREC, Vol. 4. Citeseer, 105--108.
[4]
Necati Cihan Camgoz, Simon Hadfield, Oscar Koller, and Richard Bowden. 2017. Subunets: End-to-end hand shape and continuous sign language recognition. In ICCV. IEEE, 3075--3084.
[5]
N. C. Camgoz, S. Hadfield, O. Koller, H. Ney, and R. Bowden. 2018. Neural Sign Language Translation. In CVPR. 7784--7793. https://doi.org/10.1109/CVPR.2018.00812
[6]
Necati Cihan Camgoz, Oscar Koller, Simon Hadfield, and Richard Bowden. 2020 a. Multi-channel Transformers for Multi-articulatory Sign Language Translation. arXiv preprint arXiv:2009.00299 (2020).
[7]
Necati Cihan Camgoz, Oscar Koller, Simon Hadfield, and Richard Bowden. 2020 b. Sign Language Transformers: Joint End-to-end Sign Language Recognition and Translation. In CVPR. 10023--10033.
[8]
Zhe Cao, Tomas Simon, Shih-En Wei, and Yaser Sheikh. 2017. Realtime multi-person 2d pose estimation using part affinity fields. In CVPR. 7291--7299.
[9]
Xiujuan Chai, Guang Li, Yushun Lin, Zhihao Xu, Yili Tang, Xilin Chen, and Ming Zhou. 2013. Sign language recognition and translation with kinect. In IEEE Conf. on AFGR, Vol. 655. 4.
[10]
Runpeng Cui, Hu Liu, and Changshui Zhang. 2019. A deep neural framework for continuous sign language recognition by iterative training. IEEE Transactions on Multimedia, Vol. 21, 7 (2019), 1880--1891.
[11]
Amanda Duarte, Shruti Palaskar, Lucas Ventura, Deepti Ghadiyaram, Kenneth DeHaan, Florian Metze, Jordi Torres, and Xavier Giro-i Nieto. 2021. How2Sign: A Large-scale Multimodal Dataset for Continuous American Sign Language. In Conference on Computer Vision and Pattern Recognition (CVPR).
[12]
K. Grobel and M. Assan. 1997. Isolated sign language recognition using hidden Markov models. In SMC, Vol. 1. 162--167 vol.1. https://doi.org/10.1109/ICSMC.1997.625742
[13]
Dan Guo, Shuo Wang, Qi Tian, and Meng Wang. 2019 a. Dense Temporal Convolution Network for Sign Language Translation. In IJCAI. 744--750.
[14]
Dan Guo, Wengang Zhou, Anyang Li, Houqiang Li, and Meng Wang. 2019 b. Hierarchical recurrent deep fusion using adaptive clip summarization for sign language translation. TIP, Vol. 29 (2019), 1575--1590.
[15]
Dan Guo, Wengang Zhou, Houqiang Li, and Meng Wang. 2018. Hierarchical lstm for sign language translation. In AAAI, Vol. 32.
[16]
Dan Guo, Wengang Zhou, Meng Wang, and Houqiang Li. 2016. Sign language recognition based on adaptive hmms with data augmentation. In 2016 IEEE International Conference on Image Processing (ICIP). IEEE, 2876--2880.
[17]
Jie Huang, Wengang Zhou, Houqiang Li, and Weiping Li. 2018a. Attention-based 3D-CNNs for large-vocabulary sign language recognition. IEEE Transactions on Circuits and Systems for Video Technology, Vol. 29, 9 (2018), 2822--2832.
[18]
Jie Huang, Wengang Zhou, Qilin Zhang, Houqiang Li, and Weiping Li. 2018b. Video-based sign language recognition without temporal segmentation. In AAAI.
[19]
Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014).
[20]
Oscar Koller, Cihan Camgoz, Hermann Ney, and Richard Bowden. 2019. Weakly supervised learning with multi-stream CNN-LSTM-HMMs to discover sequential parallelism in sign language videos. TPAMI (2019).
[21]
Oscar Koller, Sepehr Zargaran, and Hermann Ney. 2017. Re-sign: Re-aligned end-to-end sequence modelling with deep recurrent CNN-HMMs. In CVPR. 4297--4305.
[22]
Dongxu Li, Chenchen Xu, Xin Yu, Kaihao Zhang, Benjamin Swift, Hanna Suominen, and Hongdong Li. 2020. TSPNet: Hierarchical Feature Learning via Temporal Semantic Pyramid for Sign Language Translation. In Advances in Neural Information Processing Systems, Vol. 33.
[23]
Chin-Yew Lin. 2004. Rouge: A package for automatic evaluation of summaries. In Text summarization branches out. 74--81.
[24]
Tao Liu, Wengang Zhou, and Houqiang Li. 2016. Sign language recognition with long short-term memory. In ICIP. IEEE, 2871--2875.
[25]
Alptekin Orbay and Lale Akarun. 2020. Neural sign language translation by learning tokenization. arXiv preprint arXiv:2002.00479 (2020).
[26]
Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. BLEU: a method for automatic evaluation of machine translation. In ACL. 311--318.
[27]
Lionel Pigou, Mieke Van Herreweghe, and Joni Dambre. 2017. Gesture and sign language recognition with temporal residual networks. In Proceedings of the IEEE International Conference on Computer Vision Workshops. 3086--3093.
[28]
Junfu Pu, Wengang Zhou, Hezhen Hu, and Houqiang Li. 2020. Boosting Continuous Sign Language Recognition via Cross Modality Augmentation. In Proceedings of the 28th ACM International Conference on Multimedia. 1497--1505.
[29]
Junfu Pu, Wengang Zhou, and Houqiang Li. 2019. Iterative alignment network for continuous sign language recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4165--4174.
[30]
Zhaofan Qiu, Ting Yao, and Tao Mei. 2017. Learning spatio-temporal representation with pseudo-3d residual networks. In CVPR. 5533--5541.
[31]
Lei Shi, Yifan Zhang, Jian Cheng, and Hanqing Lu. 2019. Skeleton-based action recognition with directed graph neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 7912--7921.
[32]
Karen Simonyan and Andrew Zisserman. 2014a. Two-stream convolutional networks for action recognition in videos. arXiv preprint arXiv:1406.2199 (2014).
[33]
Karen Simonyan and Andrew Zisserman. 2014b. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014).
[34]
Ke Sun, Bin Xiao, Dong Liu, and Jingdong Wang. 2019. Deep high-resolution representation learning for human pose estimation. In CVPR. 5693--5703.
[35]
Ilya Sutskever, Oriol Vinyals, and Quoc V Le. 2014. Sequence to sequence learning with neural networks. In NIPS. 3104--3112.
[36]
Ao Tang, Ke Lu, Yufei Wang, Jie Huang, and Houqiang Li. 2015. A real-time hand posture recognition system using deep neural networks. ACM Transactions on Intelligent Systems and Technology (TIST), Vol. 6, 2 (2015), 1--23.
[37]
Dominique Uebersax, Juergen Gall, Michael Van den Bergh, and Luc Van Gool. 2011. Real-time sign language letter and word recognition from depth data. In 2011 IEEE international conference on computer vision workshops (ICCV Workshops). IEEE, 383--390.
[38]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In NIPS. 5998--6008.
[39]
Subhashini Venugopalan, Marcus Rohrbach, Jeffrey Donahue, Raymond Mooney, Trevor Darrell, and Kate Saenko. 2015. Sequence to sequence-video to text. In ICCV. 4534--4542.
[40]
Limin Wang, Yuanjun Xiong, Zhe Wang, Yu Qiao, Dahua Lin, Xiaoou Tang, and Luc Van Gool. 2016. Temporal segment networks: Towards good practices for deep action recognition. In European conference on computer vision. Springer, 20--36.
[41]
Sijie Yan, Yuanjun Xiong, and Dahua Lin. 2018. Spatial Temporal Graph Convolutional Networks for Skeleton-Based Action Recognition. In AAAI.
[42]
Siyuan Yang, Jun Liu, Shijian Lu, Meng Hwa Er, and Alex C Kot. 2020. Collaborative learning of gesture recognition and 3D hand pose estimation with multi-order feature analysis. In European Conference on Computer Vision. Springer, 769--786.
[43]
Zhaoyang Yang, Zhenmei Shi, Xiaoyong Shen, and Yu-Wing Tai. 2019. SF-Net: Structured Feature Network for Continuous Sign Language Recognition. arXiv preprint arXiv:1908.01341 (2019).
[44]
Jihai Zhang, Wengang Zhou, and Houqiang Li. 2014. A threshold-based hmm-dtw approach for continuous sign language recognition. In ICIMCS. 237--240.
[45]
Hao Zhou, Wengang Zhou, Yun Zhou, and Houqiang Li. 2020. Spatial-Temporal Multi-Cue Network for Continuous Sign Language Recognition. In AAAI. 13009--13016.
[46]
Hao Zhou, Wengang Zhou, Yun Zhou, and Houqiang Li. 2021. Spatial-temporal multi-cue network for sign language recognition and translation. IEEE Transactions on Multimedia (2021).

Cited By

View all
  • (2024)SEDS: Semantically Enhanced Dual-Stream Encoder for Sign Language RetrievalProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681237(5141-5150)Online publication date: 28-Oct-2024
  • (2024)SkelCap: Automated Generation of Descriptive Text from Skeleton Keypoint Sequences2024 International Congress on Human-Computer Interaction, Optimization and Robotic Applications (HORA)10.1109/HORA61326.2024.10550522(1-8)Online publication date: 23-May-2024
  • (2024)SignGraph: A Sign Sequence is Worth Graphs of Nodes2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)10.1109/CVPR52733.2024.01279(13470-13479)Online publication date: 16-Jun-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
MM '21: Proceedings of the 29th ACM International Conference on Multimedia
October 2021
5796 pages
ISBN:9781450386517
DOI:10.1145/3474085
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 17 October 2021

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. neural network
  2. sign language translation
  3. skeleton

Qualifiers

  • Research-article

Funding Sources

Conference

MM '21
Sponsor:
MM '21: ACM Multimedia Conference
October 20 - 24, 2021
Virtual Event, China

Acceptance Rates

Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)117
  • Downloads (Last 6 weeks)22
Reflects downloads up to 25 Dec 2024

Other Metrics

Citations

Cited By

View all
  • (2024)SEDS: Semantically Enhanced Dual-Stream Encoder for Sign Language RetrievalProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681237(5141-5150)Online publication date: 28-Oct-2024
  • (2024)SkelCap: Automated Generation of Descriptive Text from Skeleton Keypoint Sequences2024 International Congress on Human-Computer Interaction, Optimization and Robotic Applications (HORA)10.1109/HORA61326.2024.10550522(1-8)Online publication date: 23-May-2024
  • (2024)SignGraph: A Sign Sequence is Worth Graphs of Nodes2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)10.1109/CVPR52733.2024.01279(13470-13479)Online publication date: 16-Jun-2024
  • (2023)Preprocessing for Keypoint-Based Sign Language Translation without GlossesSensors10.3390/s2306323123:6(3231)Online publication date: 17-Mar-2023
  • (2023)Sign2Pose: A Pose-Based Approach for Gloss Prediction Using a Transformer ModelSensors10.3390/s2305285323:5(2853)Online publication date: 6-Mar-2023
  • (2023)Sign Language Translation: A Survey of Approaches and TechniquesElectronics10.3390/electronics1212267812:12(2678)Online publication date: 15-Jun-2023
  • (2023)Contrastive learning for sign language recognition and translationProceedings of the Thirty-Second International Joint Conference on Artificial Intelligence10.24963/ijcai.2023/85(763-772)Online publication date: 19-Aug-2023
  • (2023)Towards Real-Time Sign Language Recognition and Translation on Edge DevicesProceedings of the 31st ACM International Conference on Multimedia10.1145/3581783.3611820(4502-4512)Online publication date: 27-Oct-2023
  • (2023)Locality-Aware Transformer for Video-Based Sign Language TranslationIEEE Signal Processing Letters10.1109/LSP.2023.326380830(364-368)Online publication date: 2023
  • (2023)Sign Language Translation from Instructional Videos2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW)10.1109/CVPRW59228.2023.00596(5625-5635)Online publication date: Jun-2023
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media