Sign Language Recognition Using Machine Learning
Sign Language Recognition Using Machine Learning
https://doi.org/10.22214/ijraset.2023.49199
International Journal for Research in Applied Science & Engineering Technology (IJRASET)
ISSN: 2321-9653; IC Value: 45.98; SJ Impact Factor: 7.538
Volume 11 Issue II Feb 2023- Available at www.ijraset.com
Abstract: This study proposes a method for recognising motions through image processing. In this project, we create a sign
detector that can be readily enhanced to recognise a wide range of various signs and hand gestures, such as the alphabets, and
that can recognise numbers from 1 to 10.. We used the Python Keras and OpenCV libraries to create this project. Applications
for this project are incredibly diverse. Machine learning is used to build models, extract key features, and construct applications
since it has access to enormous databases. Machine learning can be used in our daily lives to make living easier. Deep
learningis used to detect signs in sign language, which is a complicated process.
Index Terms: CNN, openCV, Keras, Machine learning.
I. INTRODUCTION
In many ways, sign language is highly beneficial. Here, we are accepting input in the form of signs and sending them through an
interface that, depending on the sign’s recognition threshold value, launches an application. To create an interface that truly aids in
taking real-time input and launching the re- quired application, we use OpenCV. Using the newly produced dataset, we will first
train the data.
II. MODEL
A. Convolutional Neural Networks
We will first generate our own dataset of images of hand movements including the digits 1 to 10 using numpy and a number of tools
and methods. The dataset can be produced using a wide variety of photos. The data is then trained using the Keras library. To
extract the features, or indicators, from the image, we employ convolutional neural networks (CNN).
Identify applicable funding agency here. If none, delete this.
©IJRASET: All Rights are Reserved | SJ Impact Factor 7.538 | ISRA Journal Impact Factor 7.894 | 1393
International Journal for Research in Applied Science & Engineering Technology (IJRASET)
ISSN: 2321-9653; IC Value: 45.98; SJ Impact Factor: 7.538
Volume 11 Issue II Feb 2023- Available at www.ijraset.com
A. Real-Time Sign Language Fingerspelling Recognition Using Convolutional Neural Networks From Depth Map [5].
The focus of this project is American Sign Language static fingerspelling. a technique for putting into practise a sign language to
text/speech conversion system without the use of handheld gloves and sensors, which involves continually recording the gestures
and turning them into voice. Only a small number of photos were used in this method to identify objects. the layout of a device to let
physically unable people communicate.
B. Real-Time Sign Language Fingerspelling Recognition Using Convolutional Neural Networks From Depth Map [5].
C. Design Of A Communication Aid For Physically Challenged [6]
The MATLAB environment was used for the system’s development. The training phase and the testing phase make up the majority
of it. The author used feed-forward neural networks during the training stage. The issue here is that MATLAB is not very effective,
and it is also challenging to integrate the concurrent qualities as a whole.
D. American Sign Language Interpreter System for Deaf and Dumb Individuals [7].
Twenty of the static ASL alphabets could be recognised using the approaches we explained. The occlusion issue pre- vented the
recognition of the letters A, M, N, and S. There aren’t many images that they’ve used.
E. A Framework for Hand Gesture Recognition and Spotting Using Sub-gesture Modeling [10].
With the only difference that in this scenario, we add a gesture model (each gesture will have a gesture completion model) between
the start state and exit state of the filler model,gesture-completion models are built in a manner identical to filler models.
V. PROPOSED METHODOLOGY
This model will be used to recognise hand gestures in real-time video, determine the proper threshold value, and launch the proper
application. It will be implemented in standalone apps or embedded devices that employ standalone web frameworks like Django.
A. Mission
The goal is to create a system that can accept a sign and process it to a real time interface that opens the application respectively.
B. Corpus
The data set that we produced served as the corpus that we employed. We have produced a data set with various indicatorsin it. The
data set we require may be found online, however for this project, we will be building the data set ourselves. Every frame that
recognises a hand in the predetermined region of interest will be saved in the "gesture" directory, which also contains the folders
"train" and "test," each of which contains 10 folders of images that were captured using the application "generate gesture using
data.py.".
©IJRASET: All Rights are Reserved | SJ Impact Factor 7.538 | ISRA Journal Impact Factor 7.894 | 1394
International Journal for Research in Applied Science & Engineering Technology (IJRASET)
ISSN: 2321-9653; IC Value: 45.98; SJ Impact Factor: 7.538
Volume 11 Issue II Feb 2023- Available at www.ijraset.com
C. Data Pre-processing
An image, in its most basic form, is a 2-dimensional array of pixels with values ranging from 0 to 255. Typically, 0 indicates black
and 255 indicates white.. The mathematical function f(x, y) defines an image, where x in a coordinate plane denotes the
horizontal and y the vertical. An image’s pixel value at every position is given by the value of f(x, y) at that location. Algorithms are
used in image pre-processing to manipulate pictures. Prior to delivering the photos for model training, it is crucial to preprocess
the photographs. For instance, all of thephotos should be 200x200 pixels in size. If not, it is impossibleto train the model.
D. Model
The model we developed trains the data sequentially. The train and test data are first loaded using the keras programming language.
Now we use a variety of hyperparameters to create the CNN. The model is now fitted, and it is saved for future usage. This
model will be implemented in standalone applications or embedded devices that use standalone web frameworks like Django to
identify hand motions in real-time video, calculate the appropriate threshold value, and launchthe appropriate application.
In our Sequential Model we have used 4 convolutional neural layers and pooling layers with different filters.In order to edit images,
algorithms are utilised in image pre-processing. Preprocessing the images is essential before providing them for model training. For
instance, each photo should have a dimension of 200x200 pixels. If not, training the model is not possible.
A 2-dimensional array of pixels with values ranging from 0 to 255 constitutes a picture in its most basic form. Normally, 0
represents black and 255 represents white. The mathematical function f(x, y), which represents the horizontal in a coordinate plane
as x and the vertical as y, defines a picture.The value of f(x, y) at a particular place determines the value of each pixel in a picture.
©IJRASET: All Rights are Reserved | SJ Impact Factor 7.538 | ISRA Journal Impact Factor 7.894 | 1395
International Journal for Research in Applied Science & Engineering Technology (IJRASET)
ISSN: 2321-9653; IC Value: 45.98; SJ Impact Factor: 7.538
Volume 11 Issue II Feb 2023- Available at www.ijraset.com
The CNN uses filters (also known as kernels) to help with feature learning and identify abstract ideas such as blurring, edge
detection, sharpening, etc., similar to how the human brain recognises objects in time and space. Because weights may bereused and
there are fewer factors involved, the architecture fits the picture dataset more accurately (2048 to 256).
In this study, a convolutional neural network (CNN) is employed to convert an RGB picture into a visual feature
vector. The three most often used CNN layers are convolution, pooling, and fully connected. ReLU f(x) = max(0, x), a
nonlinear active function, is also used. ReLU is faster than the common equation f(x) = tanh (x). The use of a dropout layer
prevents overfitting. Each hidden neuron's output is set to zero with probability 0.5 by the dropout.The ”dropped out” neurons are
neither a part of the backpropagation or the forward pass. [5]
Due to the millions of parameters that both the CNN andthe RNN include, there are special convergence concerns when they are
merged. For instance, Vinyals et al. found that fixing the convolutional layer’s parameters to those learnt from ImageNet is optimal.
The only CNN parameters that are really learnt from caption instances are the RNN parameters and the non-convolution layer
parameters.
Even though new deep models are constantly being de- veloped, deep learning has long captured the interest of the scientific
community. Finding and choosing the best model from the many that are accessible in the literature is dif- ficult. A simple task is
choosing optimizers and adjusting optimization hyperparameters. This study evaluates the performance of two deep models that
have already been trained, four adaptive gradient-based optimizers, and the tuning of their associated hyperparameters on a static
dataset of Indian sign language. InceptionResNetV2 and Adam optimizer may be used for transfer learning-based static sign
language recognition, according to experimental results. The Inception-ResNetV2 model outperformed even the most sophisticated
machine learning methods by a wide margin.
©IJRASET: All Rights are Reserved | SJ Impact Factor 7.538 | ISRA Journal Impact Factor 7.894 | 1396
International Journal for Research in Applied Science & Engineering Technology (IJRASET)
ISSN: 2321-9653; IC Value: 45.98; SJ Impact Factor: 7.538
Volume 11 Issue II Feb 2023- Available at www.ijraset.com
VII. RESULTS
The initial training had poor results; with 45 epochs, the training accuracy ranged from 0.00 percent to 20.00 percent .
Because we believed something went wrong, we halted that. the training accuracy then improved to 89 percent with
761 epochs. loss of 1.443190574645996; value accuracy of 60.00000238418579.A final accuracy of 89.99999761581421
percent is recoderd. It contains a final layers of conv2d(Conv2D) has a Output shape of (None,62,62,16) .Contains
maxpooling2d(MaxPooing2D) layers with an outer shape of(None,31,31,16) 4 convolutional layers, including input and output,
make up I3D Inception. There are 9 modules for conception. The inception module’s details are displayed. We divide the dataset
into a 6:2:2 ratio for training. 300 videos would thus be used for the training set, 100 for the validation set, and 100 for the
testing set.
These results led us to believe that our model was overfitting. It acquired too much knowledge before it could even categorise the
signer. The next step is to test our hypothesis. So, using a new dataset structure, we retrained. We currently use two signers for ten
classes with a total of one hundred videos, two signers for twenty classes with one hundred videos, and four signers for forty classes
with two hundred films. The two signers' training accuracy ranged from 50.00 to 80.00 percent for the first ten lessons, 100.00
percent for the following ten lessons, and 20.00 percent for the final ten lessons.
REFERENCES
[1] Deaf Cambridge Dictionary. (2018). Retrieved from Cambridge Dictio- nary: https://dictionary.cambridge.org/dictionary/english/deaf
[2] Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L. (2009). ImageNet: A Large-Scale Hierarchical Image Database. Computer Vi- sion and Pattern
Recognition, 2009. CVPR 2009. IEEE Conference on (pp. 248-255). IEEE. Miami, FL, USA .
[3] Escalera, S., Baró, X., Gonzàlez, J., Bautista, M., Madadi, M., Reyes, M., . . . Guyon, I. (2014). ChaLearn Looking at People Challenge 2014: Dataset and
Results. Workshop at the European Conference on Computer Vision (pp. 459-473). Springer, . Cham.
[4] Feichtenhofer, C., Pinz, A., Wildes, R. P. (2016). Spatiotemporal Residual Networks for Video Action Recognition. Advances in neural information processing
systems, (pp. 3468-3476)
[5] B. Kang, S. Tripathi and T. Q. Nguyen, ”Real-time sign language fingerspelling recognition using convolutional neural networks from depth map,” 2015 3rd
IAPR Asian Conference on Pattern Recog- nition (ACPR), Kuala Lumpur, Malaysia, 2015, pp. 136-140, doi: 10.1109/ACPR.2015.7486481.
[6] Tomohito Fujimoto, Takayuki Kawamura, Keiichi Zempo, Sandra Puentes, ”First-person view hand posture estimation and fingerspelling recognition using
HoloLens”, 2022 IEEE 11th Global Conference on Consumer Electronics (GCCE), pp.323-327, 2022.
[7] S. Upendran and A. Thamizharasi, ”American Sign Language interpreter system for deaf and dumb individuals,” 2014 International Conference on Control,
Instrumentation, Communication and Computational Tech- nologies (ICCICCT), Kanyakumari, India, 2014, pp. 1477-1481, doi:
10.1109/ICCICCT.2014.6993193.
©IJRASET: All Rights are Reserved | SJ Impact Factor 7.538 | ISRA Journal Impact Factor 7.894 | 1397