A Modern Approach For Sign Language Interpretation Using CNN
A Modern Approach For Sign Language Interpretation Using CNN
A Modern Approach For Sign Language Interpretation Using CNN
net/publication/335359908
CITATIONS READS
7 549
6 authors, including:
Some of the authors of this publication are also working on these related projects:
All content following this page was uploaded by Pias Paul on 07 September 2019.
Abstract. There are nearly 70 million deaf people in the world. A signif-
icant portion of them and their families use sign language as a medium
for communicating with each other. As automation is being gradually
introduced to many parts of everyday life, the ability for machines to
understand the act on sign language will be critical to creating an inclu-
sive society. This paper presents multiple convolutional neural network
based approaches, suitable for fast classification of hand sign characters.
We propose two custom convolutional neural network (CNN) based archi-
tectures which are able to generalize 24 static American Sign Language
(ASL) signs using only convolutional and fully connected layers. We com-
pare these networks with transfer learning based approaches, where mul-
tiple pre-trained models were utilized. Our models have remarkably out-
performed all the preceding models by accomplishing 86.52% and 85.88%
accuracy on RGB images of the ASL Finger Spelling dataset.
1 Introduction
A language that needs manual communication and involvement of body language
to convey meaning as opposed to conveyed sound patterns is known as sign lan-
guage. This can involve a simultaneous combination of handshapes, orientation,
and movement of the hands, arms or body, and different facial expressions to
fluidly express a speaker’s thoughts. In some cases, sign language is the only
method that is used to communicate with a person with hearing impairment.
Sign languages such as the American Sign Language (ASL), British Sign Lan-
guage (BSL), Quebec Sign Language (LSQ), Spanish sign language (SSL) differ
in the way an expression is made. They share many similarities with spoken
languages, which is why linguists consider sign languages to be a part of natural
languages.
c Springer Nature Switzerland AG 2019
A. C. Nayak and A. Sharma (Eds.): PRICAI 2019, LNAI 11672, pp. 431–444, 2019.
https://doi.org/10.1007/978-3-030-29894-4_35
432 P. Paul et al.
2 Related Works
approach, the output of the pooling layers was directly fed into the LSTM. The
second approach gave a better result with an accuracy of 95.2%.
3 Experimental Setup
This section provides details of the setup used for the experiments performed.
We initially present the dataset on which we will train and compare the different
models. This is followed by a brief description of the data preprocessing and par-
titioning. The proposed models are discussed next, which includes descriptions
of the custom models as well as the transfer learning techniques.
3.1 Dataset
The work is based on ASL Finger Spelling dataset that consists of images which
were obtained from 5 different users. In the proposed dataset [18], images were
obtained in 2 different ways, each user was asked to perform 24 ASL static
signs which were captured in both color and depth format. There are a total of
131,670 number of images where 65,774 images have RGB channels and rest
are depth images that contain the intensity values in the image which represent
the distance of the object or simply depth from a viewpoint. The reason behind
choosing American Sign Language (ASL) for this work was that ASL is widely
learned as a second language and the dataset contains sign from only using one
hand which reduces the task of over-complicated feature extraction. Here, the
dataset comprises 24 static signs which have similar lighting and background
excluding the letters j and z since these 2 letters require dictionary lookup and
involve motion (Table 1).
From the total of 5 user samples, 4 were considered in such a way that the
proposed dataset [18] was divided into two parts. First part is Dataset-A which
contains only color images and the other one is Dataset-B which contains both
434 P. Paul et al.
depth and color images. This is shown in Table 2. In both the DataSet-A and
DataSet-B, images from users C and D were used as the training set and images
from user A and B were used to make validation/test set. As the images were of
different sizes, all of them were re-sized to 200×200 pixels. Pixel color values were
re-scaled between 0 and 1 and then each image was normalized by subtracting
the mean (Fig. 1).
Fig. 1. Illustration on the variety of the dataset where each column represents images
of individual letters that has been collected from 4 different users.
To increase the amount of training data, each training image was augmented
using the transformations mentioned in Table 3. The augmentations were applied
single (not compositionally) and were only applied to RGB images. The valida-
tion data were not augmented per say, but were modified.
The loss function of choice was categorical cross entropy as shown in Eq. 1,
which measures the classification error as a cross entropy loss when multiple
categories are in use. Here, the double sum is over the observations i, whose
number is N , and the categories c, whose number is C and the term 1yi ∈Cc
A Modern Approach for Sign Language Interpretation Using CNN 437
Training details
Batch Size 64
Input size 200*200*3
Learning Rate 0.001
Optimizer Adam
Loss Function Categorical Crossentropy
Epoch 25
is the indicator function of the ith observation belonging to the cth category.
Finally, the probability predicted by the model for the ith observation belongs
to which of the cth category is determined by Pmodel [yi ∈ Cc ].
1
N C
− 1y C logPmodel [yi Cc ] (1)
N i=1 c=1 i c
For this work the base learning rate was set to 0.001 with which the
network starts to train itself but as mentioned earlier the learning rate is
being adapted step wise here by using Eq. 2. Here last epoch, the value of
Step Wise LR will be updated as if an epoch completes all its steps.
last epoch
Step W ise LR = (base lr ∗ gamma ∗ ( )) (2)
step size
4 Results
The task of finding a model which will detect the signs based on ASL was divided
into two parts. In the first segment, two custom models were built from which
accuracy of 86.52% and in the second segment, an accuracy of 85.88% was
achieved using pre-trained models on Dataset-A.
438 P. Paul et al.
prominently. This indicates that there was no need for running the model after
that certain epochs. To overcome overfitting some regularization techniques such
as Dropout, L2 Regularization were applied by tuning the hyperparameters
which lead to the best performance on the validation set. For this work, 3 dif-
ferent instances of dropout value for custom model-B were considered where
dropping 60% of neurons reduces the overall validation loss by an amount of
0.25 that helped to increase the validation accuracy.
Fig. 4. Illustration of training and validation accuracy of the best two transfer learning
models
440 P. Paul et al.
To improve the validation accuracy, fine-tuning process was introduced where the
model was initialized using the technique mentioned in Sect. 3.4. From this con-
figuration, with trainable parameters-9, 051, 928 and non-trainable parameters-
20, 024, 384 a validation accuracy of 55.57% was achieved using VGG16 model
which weights were pre-trained on imagenet dataset and from VGG19 with
Trainable parameters-9, 051, 928 & Non-trainable parameters-14, 714, 688, a val-
idation accuracy of 59.93% was achieved where the training accuracy was 84.75%.
In both the models, parameters except in fully connected layers were being
frozen. As this result was not even close to our custom models, a different tech-
nique with other pre-trained models was implied. With this technique, the top
layers or fully connected layers of the model was first trained for 10 epochs,
then the weights of all the pre-trained layer and the top layer were unfrozen and
the same model was trained for the second time. In the first scenario, when the
model was only trained with top layers weighs the activation function “Softmax”
that relied upon the last fully connected layers trained itself in a way that when
in the second time model retrained itself for 25 epochs, it gives much better val-
idation accuracy mentioned in Table 7. From this process using ‘MobileNetV2’
& ‘NASNetMobile’ model’s pre-trained weights with 2072 and 1176 correspond-
ing neurons, accuracy of 84.93% and 85.88% were recorded. In the case of
Densenet121, VGG16 and VGG19 same configuration could not be applied as
there is a huge number of parameters or weights in terms of memory. In case of all
the pre-trained models, “MobilenetV” and “NASNetMobile” gives linear growth
in terms of validation accuracy. From Fig. 4 we can see that, after running for
several epochs, validation accuracy has gone lower for the first 3–4 epochs, then
it jumps to 75% and gradually increases to 84% and stabilizes for the remaining
epochs. On the other hand, the training accuracy gains 98% accuracy in first
5–6 epoch and remain stable for the rest of the epochs.
The previous work that gave best validation accuracy based on ASL fin-
gerspelling dataset was conducted by Pugeault and Bowden [18] where they
recorded accuracy on three different instances. They obtained 73% accuracy for
using only RGB images, 69% for using only depth information and 75% accu-
racy for using RGB+depth images. In our work, we have considered only two
instances as we only used RGB(“DataSet-A”) and Depth+RGB(“DataSet-B”)
to measure performances. Although our customized models could not perform
better on “DataSet-B” compared to their [18] work but all the other models
performed better than [18] on RGB images. A total of 240 unseen color images
were used to measure f1 score of both the customized models. Both the models
A Modern Approach for Sign Language Interpretation Using CNN 441
were asked to measure ground truth values of 10 images from each class. Based
on the precision and recall values, f1 score was then generated for each class that
is shown in Table 8.
For “Custom-Model-A” recall values are significantly higher than the preci-
sion values for classes k, m, o, v where for “Custom-Model-B” those classes are
d, q, w. The reason behind this might be because signs of c and o, w and f, d
and l, m and n, k and r shown in Fig. 5 are quite similar which is why models
may get confused while classifying for those particular classes. In case of both
the models, the classifiers could not predict n, r out of given images. In case
of letter c, f “Custom-Model-A” shows small confusion as the precision values
are slightly lower than the recall values for those classes wherein for “Custom-
Model-B” those classes are l, t. Although for some classes the custom models
could not give accurate predictions overall performance of both the models was
good as the macro-average value of “Custom-Model-A” is nearly 59% and for
“Custom-Model-B” it is nearly 68%.
442 P. Paul et al.
5 Conclusion
References
1. Anderson, R., Wiryana, F., Chandra, M., Putra, G.: Sign language recognition
application systems for deaf-mute people: a review based on input-process-output.
Procedia Comput. Sci. 116, 441–448 (2017)
2. Arge, F.O.R.L., Mage in CI: Vdcnl-s i r, pp. 1–14 (2015)
3. 2014 IEEE International Conference on Advanced Communications, Control and
Computing Technologies, pp. 1412–1415 (2014)
4. Núñez Fernández, D., Kwolek, B.: Hand posture recognition using convolutional
neural network. In: Mendoza, M., Velastı́n, S. (eds.) CIARP 2017. LNCS, vol.
10657, pp. 441–449. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-
75193-1 53
5. Ghotkar, A., Kharate, G.K.: Study of vision based hand gesture recognition using
Indian sign language (2017)
6. Chollet, F.: Xception: deep learning with depthwise separable convolutions (2014)
7. Hoque, T., Kabir, F.: Automated Bangla sign language translation system:
prospects, limitations and applications, pp. 856–862 (2016)
8. Hosoe, H., Sako, S.: Recognition of JSL finger spelling using convolutional neural
networks, pp. 85–88 (2017)
9. Huang, G., Weinberger, K.Q.: Densely connected convolutional networks (2016)
10. Karabasi, M., Bhatti, Z., Shah, A.: A model for Real-time recognition and tex-
tual representation of Malaysian sign language through image processing. In: 2013
International Conference on Advanced Computer Science Applications and Tech-
nologies (2013)
11. Karmokar, B.C., Alam, K.R., Siddiquee, K.: Bangladeshi sign language recognition
employing neural network ensemble (2012)
12. Kishore, P.V.V., Kumar, P.R.: Segment, track, extract, recognize and convert sign
language videos to voice/text. IJACSA 3, 35–47 (2012)
13. Koller, O., Forster, J., Ney, H.: Continuous sign language recognition: towards large
vocabulary statistical recognition systems handling multiple signers. Comput. Vis.
Image Underst. 141, 108–125 (2015)
14. Kumar, P.K., Prahlad, P., Loh, A.P.: Attention based detection and recognition of
hand postures against complex backgrounds (2012)
15. Masood, S., Srivastava, A., Thuwal, H.C., Ahmad, M.: Real-time sign language ges-
ture (word) recognition from video sequences using CNN and RNN. In: Bhateja,
V., Coello Coello, C.A., Satapathy, S.C., Pattnaik, P.K. (eds.) Intelligent Engineer-
ing Informatics. AISC, vol. 695, pp. 623–632. Springer, Singapore (2018). https://
doi.org/10.1007/978-981-10-7566-7 63
16. Mekala, P., Gao, Y., Fan, J., Davari, A.: Real-time sign language recognition based
on neural network architecture, pp. 195–199 (2011)
17. Prajapati, R., Pandey, V., Jamindar, N., Yadav, N., Phadnis, P.N.: Hand gesture
recognition and voice conversion for deaf and dumb. IRJET 5, 1373–1376 (2018)
18. Pugeault, N., Bowden, R.: Spelling it out: real-time ASL fingerspelling recognition
(2011)
19. Rahaman, M.A., Jasim, M., Ali, H.: Real-time computer vision-based Bengali sign
language recognition, pp. 192–197 (2014)
20. Rajam, P.S., Balakrishnan, G.: Real time Indian sign language recognition system
to aid deaf-dumb people, pp. 1–6 (2011)
21. Rao, G.A., Kishore, P.V.: Selfie video based continuous Indian sign language recog-
nition system. Ain Shams Eng. J. 9, 1929 (2017)
444 P. Paul et al.
22. Sandler, M., Zhu, M., Zhmoginov, A., Howard, A., Chen, L.-C.: MobileNetV2:
inverted residuals and linear bottlenecks (2018)
23. Savur, C.: Real-time American sign language recognition system by using surface
EMG signal, pp. 497–502 (2015)
24. Sarawate, N., Leu, M.C., ÖZ, C.: A real-time American sign language word recog-
nition system based on neural networks and a probabilistic model. Turk. J. Electr.
Eng. Comput. Sci. 23, 2107–2123 (2015)
25. Seth, D., Ghosh, A., Dasgupta, A., Nath, A.: Real time sign language processing
system. In: Unal, A., Nayak, M., Mishra, D.K., Singh, D., Joshi, A. (eds.) Smart-
Com 2016. CCIS, vol. 628, pp. 11–18. Springer, Singapore (2016). https://doi.org/
10.1007/978-981-10-3433-6 2
26. Singha, J., Das, K.: Recognition of Indian sign language in live video. Int. J. Com-
put. Appl. 70, 17–22 (2013)
27. Springenberg, J.T., Dosovitskiy, A., Brox, T., Riedmiller, M.: Striving for simplic-
ity: the all convolutional net, pp. 1–14 (2015)
28. Szegedy, C., Vanhoucke, V., Shlens, J., Wojna, Z.: Rethinking the inception archi-
tecture for computer vision (2014)
29. Tripathi, K., Baranwal, N., Nandi, G.C.: Continuous Indian sign language gesture
recognition and sentence formation. Procedia Comput. Sci. 54, 523–531 (2015)
30. Uddin, S.J.: Bangla sign language interpretation using image processing (2017)
31. Wazalwar, S., Shrawankar, U.: Interpretation of sign language into English using
NLP techniques. J. Inf. Optim. Sci. 38, 895 (2017)
32. Zoph, B., Shlens, J.: Learning transferable architectures for scalable image recog-
nition (2017)