Abstract
Sign Languages are expressed through hand and upper body gestures as well as facial expressions. Therefore, Sign Language Recognition (SLR) needs to focus on all such cues. Previous work uses hand-crafted mechanisms or network aggregation to extract the different cue features, to increase SLR performance. This is slow and involves complicated architectures. We propose a more straightforward approach that focuses on training separate cue models specializing on the dominant hand, hands, face, and upper body regions. We compare the performance of 3D Convolutional Neural Network (CNN) models specializing in these regions, combine them through score-level fusion, and use the weighted alternative. Our experimental results have shown the effectiveness of mixed convolutional models. Their fusion yields up to \(19\%\) accuracy improvement over the baseline using the full upper body. Furthermore, we include a discussion for fusion settings, which can help future work on Sign Language Translation (SLT).
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Camgoz, N.C., Hadfield, S., Koller, O., Bowden, R., Ney, H.: Neural sign language translation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 7784–7793 (2018)
Camgoz, N.C., Koller, O., Hadfield, S., Bowden, R.: Multi-channel transformers for multi-articulatory sign language translation. arXiv preprint arXiv:2009.00299 (2020)
Camgoz, N.C., Koller, O., Hadfield, S., Bowden, R.: Sign language transformers: joint end-to-end sign language recognition and translation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 10023–10033 (2020)
Cao, Z., Hidalgo Martinez, G., Simon, T., Wei, S., Sheikh, Y.A.: Openpose: realtime multi-person 2D pose estimation using part affinity fields. IEEE Trans. Pattern Analy. Mach. Intell. p. 1 (2019)
Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4724–4733 (2017)
Feichtenhofer, C., Fan, H., Malik, J., He, K.: Slowfast networks for video recognition. In: IEEE/CVF International Conference on Computer Vision (ICCV), pp. 6201–6210 (2019)
Forster, J., Schmidt, C., Koller, O., Bellgardt, M., Ney, H.: Extensions of the sign language recognition and translation corpus RWTH-PHOENIX-weather. In: Proceedings of the International Conference on Language Resources and Evaluation (LREC), pp. 1911–1916 (2014)
Hanke, T., König, L., Wagner, S., Matthes, S.: DGS corpus & dicta-sign: the hamburg studio setup. In: Proceedings of the Representation and Processing of Sign Languages: Corpora and Sign Language Technologies, pp. 106–110 (2010)
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)
Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., Fei-Fei, L.: Large-scale video classification with convolutional neural networks. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 1725–1732 (2014)
Kay, W., et al.: The kinetics human action video dataset. arXiv:1705.06950 (2017)
Kındıroğlu, A.A., Özdemir, O., Akarun, L.: Temporal accumulative features for sign language recognition. In: IEEE/CVF International Conference on Computer Vision Workshop (ICCVW), pp. 1288–1297. IEEE (2019)
Kuehne, H., Jhuang, H., Garrote, E., Poggio, T., Serre, T.: Hmdb: a large video database for human motion recognition. In: International Conference on Computer Vision, pp. 2556–2563. IEEE (2011)
Li, D., Opazo, C.R., Yu, X., Li, H.: Word-level deep sign language recognition from video: a new large-scale dataset and methods comparison. In: IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 1448–1458 (2020)
Li, Y., Xia, R., Huang, Q., Xie, W., Li, X.: Survey of spatio-temporal interest point detection algorithms in video. IEEE Access 5, 10323–10331 (2017)
Liddell, S.K., Johnson, R.E.: American sign language: the phonological base. Sign Lang. Stud. 64(1), 195–277 (1989)
Lin, T., Liu, X., Li, X., Ding, E., Wen, S.: Bmn: boundary-matching network for temporal action proposal generation. In: IEEE/CVF International Conference on Computer Vision (ICCV), pp. 3888–3897 (2019)
Orbay, A., Akarun, L.: Neural sign language translation by learning tokenization. arXiv preprint arXiv:2002.00479 (2020)
Özdemir, O., Kındıroğlu, A.A., Camgöz, N.C., Akarun, L.: Bosphorussign22k sign language recognition dataset. arXiv preprint arXiv:2004.01283 (2020)
Paszke, A., et al.: Automatic differentiation in PyTorch. In: NIPS Autodiff Workshop (2017)
Simonyan, K., Zisserman, A.: Two-stream convolutional networks for action recognition in videos. In Advances in Neural Information Processing Systems, pp. 568–576. MIT (2014)
Soomro, K., Zamir, A.R., Shah, M.: Ucf101: a dataset of 101 human actions classes from videos in the wild. arXiv:1212.0402 (2012)
Tran, D., Bourdev, L., Fergus, R., Torresani, L., Paluri, M.: Learning spatiotemporal features with 3d convolutional networks. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 4489–4497. IEEE (2015)
Tran, D., Wang, H., Torresani, L., Ray, J., LeCun, Y., Paluri, M.: A closer look at spatiotemporal convolutions for action recognition. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6450–6459 (2018)
Vaezi Joze, H., Koller, O.: Ms-asl: a large-scale data set and benchmark for understanding american sign language. In: The British Machine Vision Conference (BMVC) (2019)
Varol, G., Laptev, I., Schmid, C.: Long-term temporal convolutions for action recognition. IEEE Trans. Pattern Anal. Mach. Intell. 40(6), 1510–1517 (2017)
Wang, H., Schmid, C.: Action recognition with improved trajectories. In: IEEE International Conference on Computer Vision, pp. 3551–3558. IEEE (2013)
Wang, Y., See, J., Phan, R.C.W., Oh, Y.H.: Efficient spatio-temporal local binary patterns for spontaneous facial micro-expression recognition. PloS one 10(5), e0124674 (2015)
Zhang, J., Zhou, W., Xie, C., Pu, J., Li, H.: Chinese sign language recognition with adaptive hmm. In: IEEE International Conference on Multimedia and Expo (ICME), pp. 1–6. IEEE (2016)
Zhou, H., Zhou, W., Zhou, Y., Li, H.: Spatial-temporal multi-cue network for continuous sign language recognition. In: AAAI, pp. 13009–13016 (2020)
Acknowledgement
This work has been supported by the TUBITAK Project No. 117E059 and TAM Project No. 2007K120610 under the Turkish Ministry of Development.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
1 Electronic supplementary material
Below is the link to the electronic supplementary material.
Rights and permissions
Copyright information
© 2020 Springer Nature Switzerland AG
About this paper
Cite this paper
Gökçe, Ç., Özdemir, O., Kındıroğlu, A.A., Akarun, L. (2020). Score-Level Multi Cue Fusion for Sign Language Recognition. In: Bartoli, A., Fusiello, A. (eds) Computer Vision – ECCV 2020 Workshops. ECCV 2020. Lecture Notes in Computer Science(), vol 12536. Springer, Cham. https://doi.org/10.1007/978-3-030-66096-3_21
Download citation
DOI: https://doi.org/10.1007/978-3-030-66096-3_21
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-66095-6
Online ISBN: 978-3-030-66096-3
eBook Packages: Computer ScienceComputer Science (R0)