Sign Language Recognition System Using TensorFlow
Sign Language Recognition System Using TensorFlow
1 Introduction
Communication can be defined as the act of transferring information from one place,
person, or group to another. It consists of three components: the speaker, the message
that is being communicated, and the listener. It can be considered successful only when
whatever message the speaker is trying to convey is received and understood by the
listener. It can be divided into different categories as follows [1]: formal and informal
communication, oral (face-to-face and distance) and written communication, non-ver-
bal, grapevine, feedback, and visual communication, and the active listening. The
------------------------------------------------------
This is a pre-print of an accepted article for publication in proceedings of
International Conference on Advanced Network Technologies and Intelligent
Computing (ANTIC-2021), part of the book series ‘Communications in Computer
and Information Science (CCIS)’, Springer. The final authenticated version is
available online at: https://doi.org/10.1007/978-3-030-96040-7_48
2
2 Related work
Sign languages are defined as an organized collection of hand gestures having spe-
cific meanings which are employed from the hearing impaired people to communicate
in everyday life [3]. Being visual languages, they use the movements of hands, face,
and body as communication mediums. There are over 300 different sign languages
available all around the world [5]. Though there are so many different sign languages,
the percentage of population knowing any of them is low which makes it difficult for
the specially-abled people to communicate freely with everyone. SLR provides a means
to communicate in sign language without knowing it. It recognizes a gesture and trans-
lates it into a commonly spoken language like English.
SLR is a very vast topic for research where a lot of work has been done but still
various things need to be addressed. The machine learning techniques allow the elec-
tronic systems to take decisions based on experience i.e. data. The classification algo-
rithms need two datasets – training dataset and testing dataset. The training set provides
experiences to the classifier and the model is tested using the testing set [6]. Many
authors have developed efficient data acquisition and classification methods [3][7].
Based on data acquisition method, previous work can be categorized into two ap-
proaches: the direct measurement methods and the vision-based approaches [3]. The
direct measurement methods are based on motion data gloves, motion capturing sys-
tems, or sensors. The motion data extracted can supply accurate tracking of fingers,
hands, and other body parts which leads to robust SLR methodologies development.
The vision-based SLR approaches rely on the extraction of discriminative spatial and
temporal from RGB images. Most of the vision-based methods initially try to track and
extract the hand regions before their classification to gestures [3]. Hand detection is
achieved by semantic segmentation and skin colour detection as the skin colour is usu-
ally distinguishable easily [8][9]. Though, because the other body parts like face and
arms can be mistakenly recognized as hands, so, the recent hand detection methods also
use the face detection and subtraction, and background subtraction to recognize only
the moving parts in a scene [10][11]. To attain accurate and robust hands tracking, par-
ticularly in cases of obstructions, authors employed filtering techniques, for example,
Kalman and particle filters [10][12].
For data acquisition by either the direct measurement or the vision-based ap-
proaches, different devices need to be used. The primary device employed as input pro-
cess in SLR system is camera [13]. There are other devices available that are used for
input such as Microsoft Kinect which provides colour video stream and depth video
stream all together. The depth data helps in background segmentation. Apart from the
devices, other methods used for acquiring data are accelerometer and sensory gloves.
Another system that is used for data acquisition is Leap Motion Controller (LMC)
[14][15] – it is a touchless controller developed by technology company “Leap Motion”
now called “Ultraleap” based in San Francisco. Approximately, it can operate around
200 frames per second and can detect and track the hands, fingers, and objects that look
alike fingers. Most of the researchers collect their training dataset by recording it from
their signer as finding a sign language dataset is a problem [2].
4
Different processing methods have been used for creating an SLR system
[16][17][18]. Hidden Markov Model (HMM) has been widely used in SLR [12]. The
various HMM that have been used are Multi Stream HMM (MSHMM) which is based
on the two standard single-stream HMMs, Light-HMM, and Tied-Mixture Density-
HMM [2]. The other processing models that have been used are neural network
[19][20][21][22][23], ANN [24], Naïve Bayes Classifier (NBC), and Multilayer Per-
ceptron (MLP) [14], unsupervised neural network Self-Organizing Map (SOM) [25],
Self-Organizing Feature Map (SOFM), Simple Recurrent Network (SRN) [26], Support
Vector Machine (SVM) [27], 3D convolutional residual network [28]. Researchers
have also used self-designed methods like the wavelet-based method [29] and Eigen
Value Euclidean Distance [30].
The use of different processing methods or application systems has given different
accuracy results. The Light-HMM gave 83.6% accuracy result, the MSHMM gave
86.7% accuracy result, SVM gave 97.5% accuracy result, Eigen Value gave 97% accu-
racy result, Wavelet Family gave 100% accuracy result [2][31][22][32]. Though differ-
ent models have given high accuracy results, but the accuracy does not depend only on
the processing model used, it depends upon various factors such as size of the dataset,
clarity of images of the dataset depending upon data acquisition methods, devices used,
etc.
There are two types of SLR systems – isolated SLR and continuous SLR. In isolated
SLR, the system is trained to recognize a single gesture. Each image is labelled to rep-
resent an alphabet, a digit, or some special gesture. Continuous SLR is different from
isolated gesture classification. In continuous SLR, the system is able to recognize and
translate whole sentences instead of a single gesture [33][34].
Even with all the research that has been done in SLR, many inadequacies need to be
dealt with by further research. Some of the issues and challenges that need to be worked
on are as follows [33][2][4][6].
• Isolated SLR methods need to do strenuous labeling for each word.
• Continuous SLR methods make use of isolated SLR systems as building blocks with
temporal segmentation as pre-processing, which is non-trivial and unescapably pro-
liferates errors into subsequent steps, and sentence synthesis as post-processing.
• Devices needed for data acquisition are costly, a cheap method is needed for SLR
systems to be commercialized.
• Web camera is an alternative to higher specification camera but the image is blurred
so, the quality is compromised.
• Data acquisition by sensors also has some issues e.g., noise, bad human manipula-
tion, bad ground connection, etc.
• Vision-based methodologies introduce inaccuracies due to overlapping of hand and
finger.
• Large datasets are not available.
• There are misconceptions about sign languages like sign language is same around
the world, while sign language is based upon the spoken language.
5
• Indian Sign Language is communicated using hand gestures made by a single hand
and double hands due to which there are two types of gestures representing the same
thing.
In this paper, the dataset that will be used is created using Python and OpenCV with
the help of a webcam. The SLR system that is being developed is a real-time detection
system.
3 Data acquisition
A real-time sign language detection system is being developed for Indian Sign Lan-
guage. For data acquisition, images are captured by webcam using Python and
OpenCV. OpenCV provides functions which are primarily aimed at the real-time com-
puter vision. It accelerates the use of machine perception in commercial products and
provides a common infrastructure for the computer vision-based applications. The
OpenCV library has more than 2500 efficient computer vision and machine learning
algorithms which can be used for face detection and recognition, object identification,
classification of human actions, tracking camera and object movements, extracting 3D
object models, and many more [35].
The created dataset is made up of signs representing alphabets in Indian Sign Lan-
guage [36] as shown in Fig. 1. For every alphabet, 25 images are captured to make the
dataset. The images are captured in every 2 seconds providing time to record gesture
with a bit of difference every time and a break of five seconds are given between two
individual signs, i.e., to change the sign of one alphabet to the sign of a different alpha-
bet, five seconds interval is provided. The captured images are stored in their respective
folder.
For data acquisition, dependencies like cv2, i.e., OpenCV, os, time, and uuid have been
imported. The dependency os is used to help work with file paths. It comes under stand-
ard utility modules of Python and provides functions for interacting with the operating
systems. With the help of the time module in Python, time can be represented in multi-
ple ways in code like objects, numbers, and strings. Apart from representing time, it
can be used to measure code efficiency or wait during code execution. Here, it is used
to add breaks between the image capturing in order to provide time for hand move-
ments. The uuid library is used in naming the image files. It helps in the generation of
random objects of 128 bits as ids providing uniqueness as the ids are generated on the
basis of time and computer hardware.
Once all the images have been captured, they are then one by one labelled using the
LabelImg package. LabelImg is a free open-source tool for graphically labelling im-
ages. The hand gesture portion of the image is labelled by what the gesture in the box
or the sign represents as shown in Fig. 2 and Fig. 3. On saving the labelled image, its
XML file is created. The XML files have all the details of the images including the
detail of the labelled portion. After labelling all the images, their XML files are availa-
ble. This is used for creating the TF (TensorFlow) records. All the images along with
their XML files are then divided into training data and validation data in the ratio of
80:20. From 25 images of an alphabet, 20 (80%) of them were taken and stored as a
training dataset and the remaining 5 (20%) were taken and stored as validation dataset.
This task was performed for all the images of all 26 alphabets.
4 Methodology
The proposed system is designed to develop a real-time sign language detector using a
TensorFlow object detection API and train it through transfer learning for the created
dataset [37]. For data acquisition, images are captured by a webcam using Python and
OpenCV following the procedure described under Section 3.
Following the data acquisition, a labeled map is created which is a representation of
all the objects within the model, i.e., it contains the label of each sign (alphabet) along
with their id. The label map contains 26 labels, each one representing an alphabet. Each
label has been assigned a unique id ranging from 1 to 26. This will be used as a reference
to look up the class name. TF records of the training data and the testing data are then
created using generate_tfrecord which is used to train the TensorFlow object detection
API. TF record is the binary storage format of TensorFlow. Binary files usage for stor-
age of the data significantly impacts the performance of the import pipeline conse-
quently, the training time of the model. It takes less space on a disk, copies fast, and
can efficiently be read from the disk.
The open-source framework, TensorFlow object detection API makes it easy to de-
velop, train and deploy an object detection model. They have their framework called
the TensorFlow detection model zoo which offers various models for detection that
have been pre-trained on the COCO 2017 dataset. The pre-trained TensorFlow model
that is being used is SSD MobileNet v2 320x320. The SSD MobileNet v2 Object de-
tection model is combined with the FPN-lite feature extractor, shared box predictor,
and focal loss with training images scaled to 320x320. Pipeline configuration, i.e., the
configuration of the pre-trained model is set up and then updated for transfer learning
to train it by the created dataset. For configuration, dependencies like TensorFlow, con-
fig_util, pipeline_pb2, and text_format have been imported. The major update that has
been done is to change the number of classes which is initially 90 to 26, the number of
signs (alphabets) that the model will be trained on. After setting up and updating the
configuration, the model was trained in 10000 steps. The hyper-parameter used during
the training was to set up the number of steps in which the model will be trained which
was set up to 10000 steps. During the training, the model has some losses as classifica-
tion loss, regularization loss, and localization loss. The localization loss is mismatched
8
between the predicted bounding box correction and the true values. The formula of the
localization loss [38] is given in Eq. (1) – (5).
where, N is the number of the matched default boxes, l is the predicted bounding box,
g is the ground truth bounding box, 𝑔𝑔� is the encoded ground truth bounding box and
𝑥𝑥𝑖𝑖𝑖𝑖𝑘𝑘 is the matching indicator between default box i and ground truth box j of category
k.
The classification loss is defined as the softmax loss over multiple classes. The for-
mula of the classification loss [38] is as Eq. (6).
𝑝𝑝 𝑝𝑝
𝐿𝐿𝑐𝑐𝑐𝑐𝑐𝑐 𝑓𝑓 (𝑥𝑥, 𝑐𝑐) = − ∑𝑁𝑁 0
𝑖𝑖∈𝑃𝑃𝑃𝑃𝑃𝑃 𝑥𝑥𝑖𝑖𝑖𝑖 log�𝑐𝑐̂𝑖𝑖 � − ∑𝑖𝑖∈𝑁𝑁𝑁𝑁𝑁𝑁 log(𝑐𝑐̂𝑖𝑖 ) (6)
𝑝𝑝 𝑝𝑝 𝑝𝑝
where, 𝑐𝑐̂𝑖𝑖 = exp�𝑐𝑐𝑖𝑖 � / ∑𝑝𝑝 exp (𝑐𝑐𝑖𝑖 ) is the softmax activated class score for default box
𝑝𝑝
i with category p, 𝑥𝑥𝑖𝑖𝑖𝑖 is the matching indicator between default box i and the ground
truth box j of category p.
The different losses incurred during the experimentation are mentioned in the sub-
sequent section. After training, the model is loaded from the latest checkpoint which
makes it ready for real-time detection. After setting up and updating the configuration,
the model will be ready for training. The trained model is loaded from the latest check-
point which is created during the training of the model. This completes the model mak-
ing it ready for real-time sign language detection.
The real-time detection is done using OpenCV and webcam again. For, real-time
detection, cv2, and NumPy dependencies are used. The system detects signs in real-
time and translates what each gesture means into English as shown in Fig. 5. The system
is tested in real-time by creating and showing it different signs. The confidence rate of
each sign (alphabet), i.e., how confident the system is in recognizing a sign (alphabet)
is checked, noted, and tabulated for the result.
9
5 Experimental Evaluation
The dataset is created for Indian Sign Language where signs are alphabets of the Eng-
lish language. The dataset is created following the data acquisition method described
in Section 3.
The experimentation was carried out on a system with an Intel i5 7th generation 2.70
GHz processor, 8 GB memory and webcam (HP TrueVision HD camera with 0.31 MP
and 640x480 resolution), running Windows 10 operating system. The programming
environment includes Python (version 3.7.3), Jupyter Notebook, OpenCV (version
4.2.0), TensorFlow Object Detection API.
The developed system is able to detect Indian Sign Language alphabets in real-time.
The system has been created using TensorFlow object detection API. The pre-trained
model that has been taken from the TensorFlow model zoo is SSD MobileNet v2
320x320. It has been trained using transfer learning on the created dataset which con-
tains 650 images in total, 25 images for each alphabet.
10
The total loss incurred during the last part of the training, at 10,000 steps was 0.25,
localization loss was 0.18, classification loss was 0.13, and regularization loss was 0.10
as shown in Fig. 4. Fig. 4 also shows that the lowest lost 0.17 was suffered at steps
9900.
The result of the system is based on the confidence rate and the average confidence rate
of the system is 85.45%. For each alphabet, the confidence rate is recorded and tabu-
lated in the result as shown in Table 1. The confidence rate of the system can be in-
creased by increasing the size of the dataset which will boost up the recognition ability
of the system. Thus, improving the result of the system and enhancing it.
A B C D E F G H I
J K L M N O P Q R
S T U V W X Y Z
The state-of-the-art method of the Indian Sign Language Recognition system achieved
93-96% accuracy [4]. Though being highly accurate, it is not a real-time SLR system.
This issue is dealt with in this paper. In spite of the dataset being small, our system has
achieved an average confidence rate of 85.45%.
Sign languages are kinds of visual languages that employ movements of hands, body,
and facial expression as a means of communication. Sign languages are important for
specially-abled people to have a means of communication. Through it, they can com-
municate and express and share their feelings with others. The drawback is that not
everyone possesses the knowledge of sign languages which limits communication. This
limitation can be overcome by the use of automated Sign Language Recognition sys-
tems which will be able to easily translate the sign language gestures into commonly
spoken language. In this paper, it has been done by TensorFlow object detection API.
The system has been trained on the Indian Sign Language alphabet dataset. The system
detects sign language in real-time. For data acquisition, images have been captured by
a webcam using Python and OpenCV which makes the cost cheaper. The developed
system is showing an average confidence rate of 85.45%. Though the system has
achieved a high average confidence rate, the dataset it has been trained on is small in
size and limited.
In the future, the dataset can be enlarged so that the system can recognize more ges-
tures. The TensorFlow model that has been used can be interchanged with another
model as well. The system can be implemented for different sign languages by changing
the dataset.
References
(2019). https://doi.org/10.1109/BIGDATA.2018.8622141.
21. Hore, S., Chatterjee, S., Santhi, V., Dey, N., Ashour, A.S., Balas, V.E., Shi, F.: Indian Sign
Language Recognition Using Optimized Neural Networks. Adv. Intell. Syst. Comput. 455,
553–563 (2017). https://doi.org/10.1007/978-3-319-38771-0_54.
22. Kumar, P., Roy, P.P., Dogra, D.P.: Independent Bayesian classifier combination based sign
language recognition using facial expression. Inf. Sci. (Ny). 428, 30–48 (2018).
https://doi.org/10.1016/J.INS.2017.10.046.
23. Sharma, A., Sharma, N., Saxena, Y., Singh, A., Sadhya, D.: Benchmarking deep neural
network approaches for Indian Sign Language recognition. Neural Comput. Appl. 2020
3312. 33, 6685–6696 (2020). https://doi.org/10.1007/S00521-020-05448-8.
24. Kishore, P.V.V., Prasad, M. V.D., Prasad, C.R., Rahul, R.: 4-Camera model for sign
language recognition using elliptical fourier descriptors and ANN. Int. Conf. Signal
Process. Commun. Eng. Syst. - Proc. SPACES 2015, Assoc. with IEEE. 34–38 (2015).
https://doi.org/10.1109/SPACES.2015.7058288.
25. Tewari, D., Srivastava, S.K.: A Visual Recognition of Static Hand Gestures in Indian Sign
Language based on Kohonen Self-Organizing Map Algorithm. Int. J. Eng. Adv. Technol.
165 (2012).
26. Gao, W., Fang, G., Zhao, D., Chen, Y.: A Chinese sign language recognition system based
on SOFM/SRN/HMM. Pattern Recognit. 37, 2389–2402 (2004).
https://doi.org/10.1016/J.PATCOG.2004.04.008.
27. Quocthang, P., Dung, N.D., Thuy, N.T.: A comparison of SimpSVM and RVM for sign
language recognition. ACM Int. Conf. Proceeding Ser. 98–104 (2017).
https://doi.org/10.1145/3036290.3036322.
28. Pu, J., Zhou, W., Li, H.: Iterative alignment network for continuous sign language
recognition. Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit. 2019-June,
4160–4169 (2019). https://doi.org/10.1109/CVPR.2019.00429.
29. Kalsh, E.A., Garewal, N.S.: Sign Language Recognition System. Int. J. Comput. Eng. Res.
6.
30. Singha, J., Das, K.: Indian Sign Language Recognition Using Eigen Value Weighted
Euclidean Distance Based Classification Technique. IJACSA) Int. J. Adv. Comput. Sci.
Appl. 4, (2013).
31. Liang, Z., Liao, S., Hu, B.: 3D Convolutional Neural Networks for Dynamic Sign
Language Recognition. Comput. J. 61, 1724–1736 (2018).
https://doi.org/10.1093/COMJNL/BXY049.
32. Pigou, L., Van Herreweghe, M., Dambre, J.: Gesture and Sign Language Recognition with
Temporal Residual Networks. Proc. - 2017 IEEE Int. Conf. Comput. Vis. Work. ICCVW
2017. 2018-Janua, 3086–3093 (2017). https://doi.org/10.1109/ICCVW.2017.365.
33. Huang, J., Zhou, W., Zhang, Q., Li, H., Li, W.: Video-based Sign Language Recognition
without Temporal Segmentation.
34. Cui, R., Liu, H., Zhang, C.: Recurrent convolutional neural networks for continuous sign
language recognition by staged optimization. Proc. - 30th IEEE Conf. Comput. Vis. Pattern
Recognition, CVPR 2017. 2017-Janua, 1610–1618 (2017).
https://doi.org/10.1109/CVPR.2017.175.
35. About - OpenCV.
36. Poster of the Manual Alphabet in ISL | Indian Sign Language Research and Training Center
14