Aml 22 Sign Language Recognition Using Deep Learning
Aml 22 Sign Language Recognition Using Deep Learning
Swapnil Shinde ,
AI & DS Parikshit Mahalle Sayee Panchal
Vishwakarma institute of information AI & DS, AI & DS.
Technology, Pune,India Vishwakarma institute of information Vishwakarma institute of information
swapnil.shinde@viit.ac.in Technology, Pune,India Technology, Pune,India
parikshit.mahalle@viit.ac.in sayee.22110420@viit.ac.in
Shreya Mahalle
AI & DS Atharva Pandit Parag Tonpe
Vishwakarma institute of information AI & DS AI & DS
Technology, Pune,India Vishwakarma institute of information Vishwakarma institute of information
shreya.22110637@viit.ac.in Technology, Pune,India Technology, Pune,India
atharva.22110449@viit.ac.in parag.22110305@viit.ac.in
I. INTRODUCTION
Abstract— This study aims to develop a robust deep learning-based
solution for accurate detection and recognition of Indian Sign
In light of a report from the World Health Organization highlighting
Language (ISL) motions. The proposed model integrates advanced
that India is home to approximately 63 million deaf individuals,
techniques such as MediaPipe Holistic for feature extraction and
whether completely or partially hearing-impaired, the significance of
Long Short-Term Memory (LSTM) networks for sign language
detection and translation. The overall process is divided into addressing communication barriers within this demographic becomes
several key phases: data collection, feature extraction, and model increasingly apparent. Sign language, serving as the primary means of
training using essential Python modules. In the data collection communication for the deaf community, stands as a vital component of
phase, a comprehensive dataset of ISL motions is compiled in the their cultural identity. Employing body and eye gestures, sign language
form of video clips. These videos are meticulously annotated with possesses its own distinctive vocabulary, datasets, grammatical
corresponding textual representations of the signs to facilitate structures, and rules, analogous to any spoken language used by the
accurate training and evaluation. The preprocessing step involves hearing population. Despite its importance, a considerable gap exists in
extracting significant hand landmarks from the video frames, the understanding of sign language by non-deaf individuals, thereby
which are crucial for distinguishing between different sign contributing to pervasive communication challenges.
language motions.Feature extraction is carried out using the Sign language, being a visual and physical mode of communication, is
MediaPipe Holistic algorithm, which is renowned for its high integral to the interaction between deaf or hard-of-hearing individuals.
accuracy and efficiency in detecting hand and body landmarks. However, the limited fluency of sign language among the broader
The provided code snippet outlines the detailed procedure for population exacerbates the existing communication divide, hindering
converting videos to landmark coordinates. This process leverages effective interaction. In response to this pressing need, recent years
several libraries including OpenCV for video processing, have witnessed a surge in interest and efforts to develop technological
MediaPipe for landmark detection, and NumPy for efficient data solutions aimed at bridging the communication gap between deaf and
manipulation. The extracted landmarks serve as critical inputs to hearing communities. One promising avenue involves the automatic
the subsequent deep learning model. Using the TensorFlow or conversion of sign language into written or spoken language, and vice
Keras frameworks, an LSTM network—a type of Recurrent versa, through the utilization of advanced technologies. This review
Neural Network (RNN) model—is implemented during the deep paper delves into the exploration of the application of MediaPipe
learning phase. These frameworks are essential for specifying the Holistic and LSTM (Long Short-Term Memory) for the conversion of
model's architecture, preprocessed data training, and real-time sign language—an endeavor that holds immense potential for fostering
sign language recognition deployment. Because the LSTM network seamless communication and inclusivity between these distinct
can capture temporal dependencies in sequential data, it is linguistic communities.
especially well-suited for understanding the dynamic character of In light of a report from the World Health Organization highlighting
sign language motions. The goal of this research is to develop a that India is home to approximately 63 million deaf individuals,
reliable and accurate system for recognizing sign language. The whether completely or partially hearing-impaired, the significance of
purpose of this method is to facilitate inclusive communication by addressing communication barriers within this demographic becomes
removing obstacles to communication that the hearing-impaired increasingly apparent. Sign language, serving as the primary means of
community faces. The project's objectives of precise and communication for the deaf community, stands as a vital component of
dependable ISL recognition depend on the effective fusion of their cultural identity. Employing body and eye gestures, sign language
cutting-edge deep learning approaches with effective landmark possesses its own distinctive vocabulary, datasets, grammatical
extraction strategies. structures, and rules, analogous to any spoken language used by the
hearing population. Despite its importance, a considerable gap exists in
the understanding of sign language by non-deaf individuals, thereby
Keywords: Sign Language Recognition, Deep Learning, MediaPipe
contributing to pervasive communication challenges.
Holistic, RNN model, LSTM, Indian Sign Language, Feature
Sign language, being a visual and physical mode of communication, is
Extraction, TensorFlow, Keras, OpenCV, NumPy.
integral to the interaction between deaf or hard-of-hearing individuals.
However, the limited fluency of sign language among the broader to a skeleton pose using pose estimation and Decision Tree algorithms.
population exacerbates the existing communication divide, hindering They create datasets using Tf-pose-estimation and the system can
effective interaction. In response to this pressing need, recent years recognize multiple sign language gestures in sequence and output the
have witnessed a surge in interest and efforts to develop technological corresponding words.
solutions aimed at bridging the communication gap between deaf and
hearing communities. One promising avenue involves the automatic The aim of [12] is to create a hand detection-based learning tool for
conversion of sign language into written or spoken language, and vice individuals who are new to sign language. They proposed a solution
versa, through the utilization of advanced technologies. This review that utilizes a CNN algorithm to identify and translate static sign
paper delves into the exploration of the application of MediaPipe gestures into their corresponding words. The system achieved an
Holistic and LSTM (Long Short-Term Memory) for the conversion of accuracy rate of 93.44% in recognizing numbers, with an average time
sign language—an endeavor that holds immense potential for fostering of 3.93 seconds.
seamless communication and inclusivity between these distinct
linguistic communities. III. RELATED WORK
A. Gesture Recognition Using Deep Learning
II. LITERATURE SURVEY Deep learning techniques have resulted in significant advancements in
The primary goal of the proposed system in [1] is to create a feature gesture recognition, which involves detecting and interpreting human
vector capable of representing dynamic hand movements and gestures and movements. Convolutional Neural Networks (CNNs) and
achieving adequate recognition accuracy using just the Leap Motion Recurrent Neural Networks (RNNs), specifically, have been critical in
controller (LMC). A feature vector with depth information is developing models that automatically learn to recognize gestures.
computed and sent into a Hidden Conditional Neural Field (HCNF) CNNs (Convolutional Neural Networks):
classifier as part of the suggested solution. For the LeapMotion- CNNs have been extensively used to recognize hand gestures. Hand
Gesture 3D dataset and the Handicraft-Gesture dataset, the system shape recognition, an essential component of gesture recognition,
achieved recognition accuracies of 89.5% and 95.0%, respectively. depends on correctly identifying features such as finger positions and
The system's main advantage is the LMC's superior localization hand orientation. CNNs can efficiently learn to detect these features
precision compared to other depth sensors. and classify hand shapes in real-time.
[2] proposes using Google API and NLP to automate sign-to-text RNNs (Recurrent Neural Networks):
language conversion. The solution entails acquiring the input sign and RNNs, particularly Long Short-Term Memory (LSTM) networks,
converting it to text with Google API, removing the infected parts with have been utilized for continuous gesture recognition. Recognizing a
NLP concepts, and matching each word/character in the processed text sequence of gestures, rather than individual gestures, is complex, and
with the visual sign word library to retrieve the matched videos. These RNNs are well-suited to this challenge due to their ability to capture
videos are then concatenated to create a single video on the final display temporal dependencies between gestures.
that depicts the entire text in sign language. In terms of sign Other Deep Learning Techniques:
interpretation, the proposed model achieved 90% accuracy. Besides CNNs and RNNs, other deep learning methods have been
[3] The paper titled "Advancements in English to Regional Machine applied to gesture recognition. For example, the Transformer
Translation" discusses approaches and challenges in English-to- model, originally designed for natural language processing, has
Regional machine translation, an essential field for bridging language been adapted for gesture recognition. It treats gestures as a
barriers. sequence of symbols, effectively capturing complex relationships
[4] The research paper "Text-to-Speech Synthesis: An Overview" between them.
provides a thorough overview of text-to-speech technology, which is The ability of deep learning to learn directly from raw data
essential for converting written text into spoken language. eliminates the need for manual feature engineering in gesture
[5] The research paper "Neural Machine Translation and Sequence-to- recognition, allowing models to uncover complex patterns and
Sequence Models: A Tutorial" offers a tutorial on neural machine relationships. However, deep learning models require a large
translation models, which play a vital role in enhancing translation amount of labeled data for training, which can be difficult to
capabilities. obtain due to the variability and complexity of gestures.
[6] Introduces metrics like BLEU, which are commonly used in While deep learning shows immense promise in gesture
machine translation research and aid in the evaluation of translation recognition, challenges remain, such as enhancing model
quality. robustness to factors like lighting, orientation, and background
[7] This topic explores the development of a system that can instantly variations, and adapting models to diverse types of gestures.
translate sign language into spoken language and text, aiming to
enhance communication accessibility for the deaf and hard of hearing B. Computer Vision in Education
community.
[8] This thesis by Daniel Varab aims to enhance automatic text Computer vision has found applications in education,
summarization by improving language support and designing more revolutionizing how students learn and interact with educational
pragmatic systems for generating concise and context-aware content.
summaries. Gesture-Based Learning:
The goal of reference [9] was to create a system that could translate Computer vision enables gesture-based learning systems that
ISL gestures into both text and speech. To accomplish this, they interpret students' hand gestures to interact with educational
proposed converting gesture images into text/speech using the K- software and content. This technology makes learning more
nearest Neighbor algorithm. Their strategy consisted of four steps: A) engaging and interactive, allowing students to control educational
using captured ISL gestures as input, B) extracting features from the applications through gestures and sign language.
segmented images, C) analyzing multiple images using unsupervised Sign Language Instruction:
feature learning (UFL) and classification, and D) synthesizing text and Computer vision can be employed to create sign language teaching
speech from the classified images. In unsupervised feature learning, tools. These tools can recognize and interpret sign language
the system achieved an accuracy rate of 78%. gestures, helping both deaf and hearing individuals learn and
The goal of [10] is to create a Sign Language Interpreter using 2D/3D practice sign language more effectively. The system can provide
sensing and AI/ML - Neural Network algorithms to bridge the feedback and corrections, making sign language education more
communication gap between individuals who are deaf or mute and accessible.
those who are not. To generate text from sign language, this system Interactive Learning Environments:
employs a vision-based approach and neural network algorithms. Computer vision creates interactive learning environments where
Similarly, [11] aims to eliminate the communication barrier between students can use gestures to control virtual elements. For example,
individuals who are deaf or mute and those who are not. They propose students can manipulate 3D models or conduct virtual experiments
a system that converts text to a gloss network and then maps the gloss using hand gestures, enhancing their understanding of complex
scalability for further research in sign language recognition using deep proficiency in understanding diverse human gestures.
learning techniques. The proposed approach integrates computer vision and deep
Data Preparation: The extracted landmarks are used to create a learning to revolutionize gesture recognition, with potential
comprehensive dataset for training and validation. This dataset is applications in various domains, including human-computer
divided into suitable training and validation sets to facilitate model interaction, virtual and augmented reality, and more. It signifies a
development. significant advancement in the understanding and interpretation of
Model Training: The proposed approach relies on recurrent neural human gestures, opening doors to exciting possibilities in the
networks (RNNs), specifically Long Short-Term Memory (LSTM) realm of technology.
networks, to train a deep learning model. These networks excel at VI. RESULTS
capturing temporal dependencies, making them ideal for recognizing ● The sign language recognition system, developed through the
gesture sequences. integration of the MediaPipe library and LSTM, demonstrated
exceptional performance in its evaluation. The accuracy
achieved on the train-test split dataset was a perfect 100.0%.
This high accuracy level indicates the system's robustness in
accurately recognizing and interpreting various sign language
gestures under controlled conditions. The Matthews Correlation
Coefficient (MCC) for the system was 1.0000, emphasizing its
capability to maintain high precision and recall across multiple
classes. Overall Metrics Summary:
● Accuracy: 100.0%
● Matthews Correlation Coefficient (MCC): 1.0000
These metrics demonstrate the high precision and recall achieved
across specific words in our dataset, highlighting the effectiveness
of our sign language recognition system.
Model Summary:
Our sign language recognition model utilizes a sequential
architecture built with Long Short-Term Memory (LSTM) layers
for effective recognition of sign sequences. The model summary
(shown below) provides a detailed breakdown of the network
structure and its trainable parameters.
Key Points from the Model Summary:
Sequential Architecture:
The model employs a sequential structure, where data flows
through each layer in a single sequence. This is well-suited for
tasks involving ordered data like sign language sequences.
LSTM Layers:
The core of the model consists of three LSTM layers (lstm, lstm_1,
and lstm_2). LSTMs are known for their ability to capture temporal
dependencies within sequences, which is crucial for recognizing
the order of signs in sign language.
Layer Outputs: Fig: Classification Report
The first two LSTM layers (lstm and lstm_1) have an output shape
of (None, 30, units), where None represents the variable batch size, Real-time evaluation of the system in practical scenarios, however,
30 is the sequence length, and units represent the number of hidden showed a slightly lower accuracy of 96.4%. This discrepancy
units in the layer (64 for the first LSTM and 128 for the second). between train-test split and real-time performance underscores the
The final LSTM layer (lstm_2) has an output shape of (None, challenges posed by real-world factors such as varying lighting
units), where units represent the number of hidden units in the layer conditions, background noise, and different hand orientations.
(set to 64 in this case).
Dense Layers:
Following the LSTM layers, three fully-connected dense layers
(dense, dense_1, and dense_2) are used. These layers perform non-
linear transformations on the extracted features to classify the sign
sequence into one of the possible output categories. The number of
units in each dense layer progressively reduces (64, 32, and 8) as
the network progresses towards the final output layer.
Total Parameters:
The model has a total of 187,496 trainable parameters. This
indicates the model's capacity to learn complex relationships
between the input sign sequences and their corresponding labels.
VII. CONCLUSIONS
In conclusion, the advancement of technology, particularly the
integration of MediaPipe Holistic and LSTM, holds immense
promise in breaking down communication barriers between the
deaf and hearing communities. With approximately 63 million
deaf individuals in India alone, the need for effective sign
language conversion systems is critical.
According to the information presented, a variety of approaches
and strategies for sign language recognition have been developed