A Aaaaaaa
A Aaaaaaa
Int. Worksh. on “Photogrammetric & Computer Vision Techniques for Video Surveillance, Biometrics and Biomedicine”, 13–15 May 2019, Moscow, Russia
St. Petersburg Institute for Informatics and Automation of the Russian Academy of Sciences, SPIIRAS, Saint-Petersburg,
Russian Federation – denis.ivanko11@gmail.com, dl_03.03.1991@mail.ru, karpov@iias.spb.su
Commission II WG II/5
KEY WORDS: Lip-reading, hearing impaired people, region-of-interest detection, visual speech recognition
ABSTRACT:
Inability to use speech interfaces greatly limits the deaf and hearing impaired people in the possibility of human-machine interaction.
To solve this problem and to increase the accuracy and reliability of the automatic Russian sign language recognition system it is
proposed to use lip-reading in addition to hand gestures recognition. Deaf and hearing impaired people use sign language as the main
way of communication in everyday life. Sign language is a structured form of hand gestures and lips movements involving visual
motions and signs, which is used as a communication system. Since sign language includes not only hand gestures, but also lip
movements that mimic vocalized pronunciation, it is of interest to investigate how accurately such a visual speech can be recognized
by a lip-reading system, especially considering the fact that the visual speech of hearing impaired people is often characterized with
hyper-articulation, which should potentially facilitate its recognition. For this purpose, thesaurus of Russian sign language
(TheRusLan) collected in SPIIRAS in 2018-19 was used. The database consists of color optical FullHD video recordings of 13
native Russian sign language signers (11 females and 2 males) from “Pavlovsk boarding school for the hearing impaired”. Each of
the signers demonstrated 164 phrases for 5 times. This work covers the initial stages of this research, including data collection, data
labeling, region-of-interest detection and methods for informative features extraction. The results of this study can later be used to
create assistive technologies for deaf or hearing impaired people.
Sign language is a structured form of hand gestures involving The focus of this research is on improving the accuracy and
visual motions and signs, which is used as a communication robustness of automatic Russian sign language recognition via
system. SL recognition includes the whole process of tracking adding lip-reading module to hand gesture recognition system.
and identifying the signs performed and converting into This paper covers the initial stages of this research, including
semantically meaningful words and expressions (Cheok et al., data collection, data labeling, region-of-interest detection and
2017). Majority of sign language involves only upper part of the methods for informative features extraction.
body from waist level upwards. Besides, the same sign can have
2.1 TheRuSLan
3. PROPOSED METHOD
3.3 Challenges
4. CONCLUSIONS AND FUTURE WORK Ivanko et al., 2017. D. Ivanko, A. Karpov, D. Ryumin, I.
Kipyatkova, A. Saveliev, V. Budkov, M. Zelezny, Using a high-
The paper discusses the possibility of use additional modality speed video Camera for robust audio–visual speech recognition
(lip-reading) to improve the accuracy and robustness of in acoustically noisy conditions. In: SPECOM 2017, LNAI
automatic Russian sign language recognition system. Such 10458, pp 757-766.
studies have not been conducted before and it is of great
practical interest to investigate how good visual speech of Ivanko et al., 2018b. D. Ivanko, D. Ryumin, I. Kipyatkova, A.
hearing impaired people can be recognized. The present study is Karpov, Lip-Reading Using Pixel-based and Geometry-based
the first step towards this direction and covers following stages: Features for Multimodal Human-robot Interfaces. In: Zavalishin
data collection, data labeling, region-of-interest detection and Readings 2019, in press.
methods for informative features extraction.
Ivanko et al., 2018c. D. Ivanko, D. Ryumin, A. Axyonov, M.
We also proposed method for parametric representation of Zelezny, Designing advanced geometric features for automatic
visual speech with adaptation to overlapping of hand and mouth Russian visual speech recognition. In: Proceedings of the 20th
regions. The feature vectors obtained using this method will International Conference on Speech and Computer (SPECOM
later be used to train the lip-reading system. Due to its 2018), pp. 245-255.
versatility, this method can also be used for different tasks of
biometrics, computer vision, etc. Kar, 2010. A. Kar, Skeletal tracking using Microsoft Kinect, In:
Methodology, vol. 1, pp. 1-11.
In further research, we are going to use statistical approaches,
e.g. based on some types of Hidden Markov Models or deep Katsaggelos et al., 2015. K. Katsaggelos, S. Bahaadini, R.
neural networks for development of a robust and accurate Molina, Audiovisual fusion: challenges and new approaches. In:
Russian sign language recognition system. The results of this Proceedings of the IEEE, vol. 103, no. 9, pp 1635-1653.
study can later be used to create assistive technologies for deaf
or hearing impaired people. King, 2009. D. E. King, Dlib-ml: A Machine Learning Toolkit,
Journal of Machine Learning Research, vol. 10, pp. 1755-1758
ACKNOWLEDGEMENTS Kumar et al., 2017. S. Kumar, MK. Bhuyan, B. Chakraborty,
This research is financially supported by the Ministry of Extraction of texture and geometrical features from informative
Science and Higher Education of the Russian Federation, facial regions for sign language recognition. In: Journal of
agreement No. 14.616.21.0095 (reference Multimodal User Interfaces (JMUI), vol. 11, no. 2, pp. 227-
RFMEFI61618X0095). 239.
Darrell et al., 1993. T. Darrell, A. Pentland, Space-time Yang et al., 2010. R. Yang, S. Sarkar, B. Loeding, Handling
gestures, Proceedings on Computer Vision and Pattern movement epenthesis and hand segmentation ambiguities in
Recognition, IEEE computer society conference, pp. 335-340. continuous sign language recognition using nested dynamic
programming, IEEE Transactions on Pattern Analysis and
Denby et al., 2010. B. Denby, T. Schultz, K. Honda, T. Hueber, Machine Intelligence, vol. 32, no.3, pp. 462-477.
J. M. Gilbert, J. S. Brumberg, Silent speech interfaces, Journal
on Speech Communication, vol. 52, pp. 270-287. Zhou et al., 2014. Z. Zhou, G. Zhao, X. Hong, M. Pietikainen,
A review of recent advances in visual speech decoding. Image
Hong et al., 2006. S. Hong, H. Yao, Y. Wan, R. Chen, A PCA and Vision Computing, vol. 32, 590-605.
based visual DCT feature extraction method for lip-reading. In:
Proceedings of the intelligent informatics, hiding multimedia
and signal processing, pp 321-326.