Abstract
Sign Language Recognition (SLR) has become one of the most important research areas in the field of human computer interaction. SLR systems are meant to automatically translate sign language into text or speech, in order to reduce the communicational gap between deaf and hearing people. The aim of this paper is to exploit multimodal learning techniques for an accurate SLR, making use of data provided by Kinect and Leap Motion. In this regard, single-modality approaches as well as different multimodal methods, mainly based on convolutional neural networks, are proposed. Our main contribution is a novel multimodal end-to-end neural network that explicitly models private feature representations that are specific to each modality and shared feature representations that are similar between modalities. By imposing such regularization in the learning process, the underlying idea is to increase the discriminative ability of the learned features and, hence, improve the generalization capability of the model. Experimental results demonstrate that multimodal learning yields an overall improvement in the sign recognition performance. In particular, the novel neural network architecture outperforms the current state-of-the-art methods for the SLR task.
Similar content being viewed by others
References
Adithya V, Vinod PR, Gopalakrishnan U (2013) Artificial neural network based method for indian sign language recognition. In: 2013 IEEE conference on information communication technologies (ICT), pp 1080–1085. https://doi.org/10.1109/CICT.2013.6558259
Bastien F, Lamblin P, Pascanu R, Bergstra J, Goodfellow IJ, Bergeron A, Bouchard N, Bengio Y (2012) Theano: new features and speed improvements. Deep Learning and Unsupervised Feature Learning NIPS 2012 Workshop
Bousmalis K, Trigeorgis G, Silberman N, Krishnan D, Erhan D (2016) Domain separation networks. In: lee DD, Sugiyama M, Luxburg UV, Guyon I, Garnett R (eds) Advances in neural information processing systems 29, pp 343–351
Cooper H, Bowden R (2007) Large lexicon detection of sign language. Springer, Berlin, pp 88–97
den Bergh MV, Gool LV (2011) Combining rgb and tof cameras for real-time 3d hand gesture interaction. In: 2011 IEEE workshop on applications of computer vision (WACV), pp 66–72
Dominio F, Donadeo M, Zanuttigh P (2014) Combining multiple depth-based descriptors for hand gesture recognition. Pattern Recogn Lett 50:101–111
Ferreira PM, Cardoso JS, Rebelo A (2017) Multimodal learning for sign language recognition. In: Iberian conference on pattern recognition and image analysis, pp 313–321. Springer
Geng Y, Zhang G, Li W, Gu Y, Liang RZ, Liang G, Wang J, Wu Y, Patil N, Wang JY (2017) A novel image tag completion method based on convolutional neural transformation. In: Lintas A, Rovetta S, Verschure PF, Villa AE (eds) Artificial neural networks and machine learning – ICANN 2017. Springer International Publishing, Cham, pp 539–546
Hamid ATZ, Wirza RR, Iqbal SM, Suhaiza SP (2014) Skin segmentation using yuv and rgb color spaces. J Inf Process Syst 10(2):283
Huang C, Loy CC, Tang X (2016) Local similarity-aware deep feature embedding. In: Lee DD, Sugiyama M, Luxburg UV, Guyon I, Garnett R (eds) Advances in neural information processing systems 29, pp 1262–1270
Kurakin A, Zhang Z, Liu Z (2012) A real time system for dynamic hand gesture recognition with a depth sensor. In: 2012 Proceedings of the 20th European signal processing conference (EUSIPCO), pp 1975–1979
Lenz I, Lee H, Saxena A (2015) Deep learning for detecting robotic grasps. Int J Robot Res 34(4-5):705–724. https://doi.org/10.1177/0278364914549607
Liang R, Liang G, Li W, Li Q, Wang JJ (2016) Learning convolutional neural network to maximize pos@top performance measure. arXiv:1609.08417
Marin G, Dominio F, Zanuttigh P (2014) Hand gesture recognition with leap motion and kinect devices. In: 2014 IEEE International conference on image processing (ICIP), pp 1565–1569
Marin G, Dominio F, Zanuttigh P (2016) Hand gesture recognition with jointly calibrated leap motion and depth sensor. Multimedia Tools and Applications 75 (22):14,991–15,015. https://doi.org/10.1007/s11042-015-2451-6
Ngiam J, Khosla A, Kim M, Nam J, Lee H, Ng AY (2011) Multimodal deep learning. In: International conference on machine learning (ICML), vol 6
Potter LE, Araullo J, Carter L (2013) The leap motion controller: a view on sign language. In: Proceedings of the 25th Australian computer-human interaction conference: augmentation, application, innovation, collaboration, OzCHI ’13. ACM, New York, pp 175–178
Ramachandram D, Taylor GW (2017) Deep multimodal learning: a survey on recent advances and trends. IEEE Signal Proc Mag 34(6):96–108. https://doi.org/10.1109/MSP.2017.2738401
Schroff F, Kalenichenko D, Philbin J (2015) Facenet: a unified embedding for face recognition and clustering. In: The IEEE conference on computer vision and pattern recognition (CVPR)
Sohn K, Shang W, Lee H (2014) Improved multimodal deep learning with variation of information. In: Ghahramani Z, Welling M, Cortes C, Lawrence ND, Weinberger KQ (eds) Advances in neural information processing systems 27, pp 2141–2149. Curran Associates, Inc. http://papers.nips.cc/paper/5279-improved-multimodal-deep-learning-with-variation-of-information.pdf
Srinivas S, Sarvadevabhatla RK, Mopuri KR, Prabhu N, Kruthiventi S, Radhakrishnan VB (2016) A taxonomy of deep convolutional neural nets for computer vision. Frontiers in Robotics and AI 2(36):1–13
Srivastava N, Hinton G, Krizhevsky A, Sutskever I, Salakhutdinov R (2014) Dropout: a simple way to prevent neural networks from overfitting. J Mach Learn Res 15:1929–1958. http://jmlr.org/papers/v15/srivastava14a.html
Su F, Wang J (2018) Domain transfer convolutional attribute embedding. arXiv:1803.09733
Wang A, Cai J, Lu J, Cham TJ (2015) Mmss: Multi-modal sharable and specific feature learning for rgb-d object recognition. In: 2015 IEEE International conference on computer vision (ICCV), pp 1125–1133
Wang A, Lu J, Cai J, Cham TJ, Wang G (2015) Large-margin multi-modal deep learning for rgb-d object recognition. IEEE Trans Multimedia 17(11):1887–1898. https://doi.org/10.1109/TMM.2015.2476655
Wang J, Shi L, Wang H, Meng J, Wang JJ, Sun Q, Gu Y (2016) Optimizing top precision performance measure of content-based image retrieval by learning similarity function. arXiv:1604.06620
Wang JJY, Wang Y, Zhao S, Gao X (2015) Maximum mutual information regularized classification. Eng Appl Artif Intell 37:1–8. https://doi.org/10.1016/j.engappai.2014.08.009. http://www.sciencedirect.com/science/article/pii/S0952197614002085
Wu Z, Jiang YG, Wang J, Pu J, Xue X (2014) Exploring inter-feature and inter-class relationships with deep neural networks for video classification. In: Proceedings of the 22Nd ACM International conference on multimedia, MM ’14. ACM, New York, pp 167–176. https://doi.org/10.1145/2647868.2654931. http://doi.acm.org/10.1145/2647868.2654931
Yang H (2015) Sign language recognition with the kinect sensor based on conditional random fields. Sensors 15(1):135–147. https://doi.org/10.3390/s150100135
Zhang G, Liang G, Li W, Fang J, Wang J, Geng Y, Wang JY (2017) Learning convolutional ranking-score function by query preference regularization. In: Yin H, Gao Y, Chen S, Wen Y, Cai G, Gu T, Du J, Tallón-Ballesteros AJ, Zhang M (eds) Intelligent data engineering and automated learning – IDEAL 2017. Springer International Publishing, Cham, pp 1–8
Zhang S, Wang H, Huang W (2017) Two-stage plant species recognition by local mean clustering and weighted sparse representation classification. Clust Comput 20(2):1517–1525. https://doi.org/10.1007/s10586-017-0859-7
Acknowledgements
This work was funded by the Protect “NanoSTIMA: Macro-to-Nano Human Sensing: Towards Integrated Multimodal Health Monitoring and Analytics/NORTE010145-FEDER000016 ” financed by the North Portugal Regional Operational Programme (NORTE 2020), under PORTUGAL 2020 Partnership Agreement, and through the European Regional Development FUND (ERDF), and also by Fundação para a Ciência e a Tecnologia (FCT) within PhD and BPD grants with numbers SFRH/BD/102177/2014 and SFRH/BPD/101439/2014.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Ferreira, P.M., Cardoso, J.S. & Rebelo, A. On the role of multimodal learning in the recognition of sign language. Multimed Tools Appl 78, 10035–10056 (2019). https://doi.org/10.1007/s11042-018-6565-5
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-018-6565-5