Recognizing Sign Language Using Machine Learning and Deep Learning Models
Recognizing Sign Language Using Machine Learning and Deep Learning Models
Abstract:- Individuals with hearing impairments occur. These programs aim towards interpreting any sign
communicate mostly through sign language. Our goal language gesture accurately and translating it into either
was to create an American Sign Language recognition written or spoken language as required, thereby promoting
dataset and utilize it in a neural network-based machine smoother integration among those involved, irrespective of
learning model that can interpret hand gestures and varying individual preferences regarding method preference
positions into natural language. In our study, we during personal communications scenarios going forward.
incorporated the SVM, CNN and Resnet-18 models to
enhance predictability when interpreting ASL signs This study investigates the use of a CNN model to
through this new dataset, which includes provisions such improve sign language recognition. The goal is to break
as lighting and distance limitations. Our research also down communication barriers between people with hearing
features comparison results between all the other models impairments and the rest of society by developing an
implemented under invariant conditions versus those efficient, trustworthy system capable of quickly identifying
using our proposed CNN model. As demonstrated by its and understanding sign language motions. By training on
high levels of precision at 95.10% despite changes vast amounts of American Sign Language (ASL) data, our
encountered during testing procedures like varying data proposed CNN will gain insights into various subtle shifts
sets or scene configurations where losses are minimal within sign characteristics for increased precision when
(0.545), there exists great potential for future recognizing differences among numerous signs.
applications in image recognition systems requiring deep
learning techniques. Furthermore, these advancements II. RELATED WORKS
may lead to significant improvements within various
fields related explicitly to speech-language therapy Sign language identification is an important field of
sessions designed specifically around helping people research because it facilitates communication for people
overcome challenges associated with deafness while with hearing difficulties. Various deep learning-based
building bridges towards improved social integration approaches have been investigated over time to improve sign
opportunities. language recognition systems. This literature analysis seeks
to consolidate present research findings and highlight
Keywords:- Image Recognition, Image Classification, prospective future research paths in the field of sign
Feature Extraction, Deep Learning, Convolutional Neural language recognition with Convolutional Neural Networks
Network (CNN), Sign Language Translation, American Sign (CNNs).
Language (ASL), Real-Time Recognition.
Koller, Zargaran, Ney, and Bowden (2016) delivered a
I. INTRODUCTION hybrid CNN-HMM version for non-stop signal language
popularity [4], which combines the discriminative talents
Sign language is a type of communication used by deaf modern day CNNs with the series modelling competencies
and hard-of-hearing people that relies on visual cues. While modern day Hidden Markov models (HMMs). They tested
it features its own unique grammar, vocabulary, and syntax, that their stop-to-give up embedding improved the
the majority of people in the world are not fluent in sign performance on three challenging benchmark continuous
language, which can make interaction between members signal language reputation duties, reaching relative upgrades
who use different forms difficult at times. The advancement among 15% and 38% and up to 13.3% absolute. This has a
in machine learning technology offers a potential solution to look at sheds mild at the capacity cutting-edge hybrid CNN-
this challenge. Machine learning involves teaching HMM fashions for enhancing the overall accuracy state-of-
computers how to learn from experience without needing the-art signal language recognition systems.
explicit programming or instruction throughout the process.
This includes training computer algorithms using extensive Koller, Zargaran, Ney, and Bowden (2018) [5]
datasets so they can recognize recurring patterns within their conducted additional research into the usage of hybrid CNN-
environment while making predictions based upon such past HMMs for robust statistical continuous sign language
input. Therefore, sign language recognition systems utilizing recognition. Although data from this work are not available,
these advances offer an opportunity for bridging gaps it is clear that the study expands on the investigation of
amongst community members with differing abilities to CNN-HMM models in sign language identification,
communicate effectively when face-to-face interactions
IJISRT24MAY500 www.ijisrt.com 93
Volume 9, Issue 5, May – 2024 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165 https://doi.org/10.38124/ijisrt/IJISRT24MAY500
highlighting the necessity for strong and accurate challenge is real-time recognition, as many existing studies
recognition systems. focus only on offline scenarios. However, seamless
communication requires accurate real-time applications,
Furthermore, Wadhawan and Kumar (2020) discussed especially when interacting dynamically with individuals
the application of 3D-CNNs to train spatio-temporal features who have hearing impairments. Moreover, recognizing
from raw video data for sign language recognition [1]. The variations in gestures that arise due to the diversity and
authors used spatial attention in the network to focus on complexity of different sign languages poses a considerable
areas of interest during feature extraction, and temporal obstacle that warrants further investigation. Furthermore,
attention to choose meaningful motions for categorization. integrating multimodal approaches such as incorporating
This approach demonstrates how attention mechanisms can additional physiological cues could enhance the overall
improve the overall accuracy of sign language recognition accuracy and robustness of current systems significantly
by allowing the network to focus on essential spatiotemporal towards large-vocabulary scenarios common among people
information. with varying levels of deafness or cultural backgrounds.
Investigating various techniques while considering
Huang, Zhou, Li, and Li (2019) also investigated contextual factors would offer insights into generalizing
attention-based 3D-CNNs for large-vocabulary sign models across diverse user populations beyond their native
language recognition [8]. Although no specific research sign-language habitat. In summary, future research must
outcomes were presented, the attention-based method prioritize addressing these knowledge gaps if we want to
emphasizes the necessity of focusing on significant features realize more effective inclusive solutions for individuals
in sign language recognition, which contributes to the living with hearing disabilities through advanced sign
investigation of advanced CNN architectures for this job. language recognition technology.
Barbhuiya, Karsh, and Jain (2020) proposed a CNN- III. PROPOSED METHODOLOGY
based feature extraction and classification method for sign
language recognition [7]. While specific results were not A. Dataset Description
available, the study adds to our understanding of CNNs' We used two widely available datasets and one custom
potential for feature extraction and classification in sign dataset to train the model. The American Sign Language
language recognition systems. Dataset, published on Kaggle.com, consists of 2515 JPEG
files totaling 32.46 MB in size. It contains graphics of
Furthermore, Masood, Srivastava, Thuwal, and Ahmad numbers 0 through 9, as well as all alphabets from A to Z,
(2018) studied real-time sign language gesture detection making it a comprehensive resource for ASL recognition and
with CNN and Recurrent Neural Networks (RNNs) [2]. interpretation. The Indian sign language dataset contains 42k
Their research focuses on detecting gestures from video JPEG files, totaling 80mb in size. Similar to the ASL dataset,
sequences, demonstrating the power of merging CNNs and it contains graphics representing digits 0 to 9 as well as all
RNNs for real-time sign language recognition. alphabets from A to Z. These datasets are publicly available
to the public, making them useful for study, education, and
Additionally, Katoch, Singh, and Tiwary (2022) the development of sign language-related applications and
investigated the use of CNNs for American Sign Language technology. The "American Sign Language Dataset" is
identification, combining Speeded Up Robust Features meticulously crafted, containing images captured in the
(SURF) with Support Vector Machines (SVM) and CNN ".JPEG" format. Spanning a considerable size of 320 MB,
[6]. Their work adds to our understanding of CNNs' the dataset is rich with detail, consisting of 3600 files
applicability in recognizing various sign languages. carefully organized to encompass a wide array of sign
language gestures. It covers numeric hand signs from 0 to 9,
Finally, Rastgoo, Kiani, and Escalera (2020) as well as the complete set of alphabets in American Sign
underlined the use of Long Short-Term Memory (LSTM) Language, offering a comprehensive resource for sign
models for isolated hand sign language recognition [3]. language recognition and related studies.
Their method entailed linking LSTM to the fully linked layer
of a CNN, revealing the potential of combining RNNs and B. Pre-Processing
CNNs for sequence learning tasks in sign language Pre-processing involves three stages. The first is
recognition. resizing which refers to adjusting the image size, according
to predetermined architecture requirements. Second, the
In conclusion, this literature review explores various scaling functionality changes the pixel range from 0-255 to
CNN-based techniques for sign language recognition and 0-1, improving data setup and speeding up processing time.
suggests areas for future research to improve overall Lastly, normalization occurs where we applied various
accuracy and robustness. Despite significant progress in sign transformations like random resizing or cropping using
language recognition using CNN-based approaches, there PyTorch's 'transforms' module on training and validation
are still research gaps and areas for improvement. One key datasets for optimal results.
IJISRT24MAY500 www.ijisrt.com 94
Volume 9, Issue 5, May – 2024 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165 https://doi.org/10.38124/ijisrt/IJISRT24MAY500
…. (2)
IJISRT24MAY500 www.ijisrt.com 95
Volume 9, Issue 5, May – 2024 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165 https://doi.org/10.38124/ijisrt/IJISRT24MAY500
Prewitt, Canny, and Sobel are edge detection Compared with its counterparts, this method is less
algorithms that are widely utilized as feature extraction vulnerable to interference; hence, it's widely believed among
methods in image classification. These methods detect rapid experts that it's one of the superlative techniques currently
changes in intensity or color in an image, which are available for detecting an object's boundary reliably.
commonly associated with borders or boundaries between
different objects or regions.
IJISRT24MAY500 www.ijisrt.com 96
Volume 9, Issue 5, May – 2024 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165 https://doi.org/10.38124/ijisrt/IJISRT24MAY500
IJISRT24MAY500 www.ijisrt.com 97
Volume 9, Issue 5, May – 2024 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165 https://doi.org/10.38124/ijisrt/IJISRT24MAY500
IJISRT24MAY500 www.ijisrt.com 98
Volume 9, Issue 5, May – 2024 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165 https://doi.org/10.38124/ijisrt/IJISRT24MAY500
industries counting upon state-of-the-art tools available Table 3. shows overall accuracy evolution of models.
today, representing future progress moving forward. In which it has been seen that accuracy is increasing
noticeably. CNN exhibits a significantly higher initial
Table 2: Architecture of ResNet-18 accuracy of 37.5%, which rapidly escalates to 99.20% over
the course of the epochs, indicating its remarkable efficiency
in learning and capturing complex patterns within the
training data. Overall, CNN outperforms ResNet and SVM,
showcasing the highest training accuracy rates across all
epochs.
E. Training
During the training process, iterating over the training
dataset for a specified number of epochs is crucial for
optimizing model performance. Stochastic Gradient Descent
(SGD) serves as the optimizer, adjusting model parameters
to minimize the Cross-Entropy Loss function.
Backpropagation computes the loss and updates the model's
parameters accordingly. To align with the number of classes
in the dataset, the last fully connected layer is replaced, and
the evaluation mode is set. Data splitting involves allocating
70 percent for training and 30 percent for testing across all
three datasets, ensuring robust model evaluation and
generalization.
IJISRT24MAY500 www.ijisrt.com 99
Volume 9, Issue 5, May – 2024 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165 https://doi.org/10.38124/ijisrt/IJISRT24MAY500
This trend indicates that all models effectively learn IV. RESULTS AND ANALYSIS
from the training data over the epochs, with CNN
consistently exhibiting the most substantial decrease in loss, The overall accuracy comparison table illustrates the
implying its superior ability to optimize and generalize performance of three models CNN, ResNet, and SVM across
compared to ResNet and SVM. various feature extraction methods: Canny, contour, Harris,
Prewitt, watershed, and Sobel. Across the different feature
extraction techniques, CNN consistently achieves
competitive accuracy rates, with the highest scores observed
in Sobel (96.03%), Prewitt (95.03%), and contour (95.03%)
methods.
ResNet also demonstrates strong performance, most feature extraction techniques, indicating its robustness
particularly excelling with the Harris feature extraction and effectiveness in handling diverse image features.
method (96.45%). SVM generally lags behind CNN and
ResNet but still maintains respectable overall accuracy rates, The study analyzes categorization using the following
with the highest score achieved using the Harris method parameters: recall value or sensitivity (R), precision (P), F1
(97.45%). Overall, while ResNet and SVM show notable score (F1), accuracy (A), and error.
accuracy levels, CNN consistently performs well across
In terms of precision, ResNet exhibits the highest value Overall, ResNet consistently demonstrates competitive
at 95.83%, closely followed by SVM at 95.16%, with CNN performance across all metrics, while CNN closely follows,
slightly lower at 95.14%. In terms of overall accuracy, and SVM lags slightly behind in terms of precision, F1 score,
ResNet stands out with a score of 94.3%, followed closely overall accuracy, and recall.
by CNN at 94.74%, and SVM slightly lower at 92.26%.
From table (7), we observe a comparative analysis of The CNN classifier maintains a robust performance across
three distinct classifiers - SVM, CNN and Resnet-18 across the dataset, with perfect scores in numerous classes, but it is
a diverse set of 36 classes, encompassing both numerical not without its shortcomings, as evidenced by lower
digits (0-9) and alphabetical letters (A-Z). The ResNet accuracies in classes such as 'W', 'U', and 'R'. Despite these
architecture demonstrates a commendable level of accuracy, individual variances, the overall performance of each
particularly excelling with a flawless 100% in the majority classifier is impressive, showcasing their ability to
of classes. It does, however, exhibit some challenges, most effectively discern and classify a wide range of characters.
notably with the numerical class '0' where it achieves only In summary, ResNet appears to be the most consistent and
65% accuracy, and to a lesser extent with the classes '1', 'O', accurate across the majority of classes, with only a few
'T', and 'V'. The SVM classifier, while achieving perfect instances of reduced accuracy. SVM, while achieving high
scores in several instances, shows a more erratic accuracy in certain classes, shows more pronounced dips in
performance profile with significant dips in accuracy for performance, particularly with specific numerical and
certain classes. It struggles considerably with the numerical alphabetical classes. CNN generally performs well, with a
class '0' at 64% and the alphabetical classes 'N' at 70%, 'M' few exceptions where its accuracy falls below that of
at 79%, and 'K' at 81%, indicating potential weaknesses in ResNet.
its classification capabilities for these particular characters.