Nats Project
Nats Project
Nats Project
BY
SUBMITTED TO
22/03/2024
APPROVAL PAGE
This is to certify that this project entitled “, Design and Implementation of a sign language
translation system” was carried out by, Uriri Nathaniel Elo-oghene VUG/CSC/20/4188 in the
Faculty of Natural and Applied Science, Veritas University, Abuja for the award of Bachelor of
Science Degree in Computer Science.
1
DEDICATION
ACKNOWLEDGEMENTS
2
I would like to express my deep gratitude to My Educational Institution. Their unwavering
support, encouragement, and guidance have been invaluable throughout this journey. Their belief
in the significance of this project has fueled my determination, and their insights have greatly
enriched the outcome. This expression of gratitude is a humble acknowledgment of the profound
impact they have had on the success of this endeavor. Their support has been a constant source
of inspiration, and I am truly thankful for their contribution to this project's realization
3
Table of Contents
APPROVAL PAGE....................................................................................................................................................
ABSTRACT...............................................................................................................................................................
CHAPTER ONE - INTRODUCTION.......................................................................................................................
1.1 BACKGROUND OF THE STUDY...............................................................................................................
1.2 Problem Statement..........................................................................................................................................
1.3 Research Question..........................................................................................................................................
1.4 Aims and Objective of the Study....................................................................................................................
1.5 Significance of the Study................................................................................................................................
1.6 Scope of the Study..........................................................................................................................................
4
1.7 Limitations of the Study.................................................................................................................................
1.8 Overview of Research Method.......................................................................................................................
1.9 Structure of the Project...................................................................................................................................
CHAPTER TWO- LITERATURE REVIEW............................................................................................................
2.1 Introduction....................................................................................................................................................
2.2 The concept of sign language.........................................................................................................................
2.3 Understanding Sign Language Recognition...................................................................................................
2.4 Sign Language Transformers: Joint End-to-end Sign Language Recognition and
Translation............................................................................................................................................................
2.5 Video-Based Sign Language Recognition Without Temporal Segmentation................................................
2.6 Hand gesture and sign language recognition techniques................................................................................
2.7 Real Time Sign Language Recognition..........................................................................................................
CHAPTER THREE – RESEARCH METHODOLOGY...........................................................................................
3.3 Data Pre-processing...................................................................................................................................
3.4 Statistical Analysis......................................................................................................................................
3.5 Software Development Methodology..........................................................................................................
3.6 Testing and Validation of the New System.................................................................................................
CHAPTER FOUR - SYSTEM DESIGN AND IMPLEMENTATION.........................................................................
4.1 Introduction....................................................................................................................................................
4.2 Analysis of Existing system...........................................................................................................................
4.2.1 Analysis of Existing Systems: Sign Language Transformers: Joint End-to-end Sign
Language Recognition and Translation..........................................................................................................
4.2.2 Analysis of Proposed system: Sign Language recognition using Neural Networks.............................
4.2.3 Comparison of Proposed system and existing system...........................................................................
4.3 Limitation of Existing Systems......................................................................................................................
4.3.1 Limitations Of The Transformer-based System For Sign Language Recognition
And Translation Presented By Camgöz et al..................................................................................................
4.3.2 Limitations Of The Proposed System....................................................................................................
4.4 Proposed System: Design and Implementation of a sign language translation system..................................
4.4.1 Rationale for Implementation of a sign language translation system Using an LSTM
........................................................................................................................................................................
4.4.2. System Architecture.............................................................................................................................
4.4.3. Advantages of the Proposed System....................................................................................................
4.5 SYSTEM MODELLING................................................................................................................................
4.5.1 System Activity Diagram.....................................................................................................................
4.5.2 Use Case Diagram for Sign Language Translation Web Application.................................................
4.5.3 Sequence Diagram for Sign Language Translation Web Application.................................................
4.6 AI Process.......................................................................................................................................................
4.6.1 Data Collection......................................................................................................................................
4.6.2 Data Processing....................................................................................................................................
5
4.6.3 Training Of The Model..........................................................................................................................
4.6.4 Testing Of The Model...........................................................................................................................
4.6.5 Model Evaluation..................................................................................................................................
4.6.6 Model Deployment (Mobile or Web)....................................................................................................
4.7 Funtionality/Experimenation and Results 1..................................................................................................
4.8 Funtionality / Experimenation and Results 2................................................................................................
4.9 Funtionality / Experimenation and Results 3................................................................................................
4.10 Funtionality / Experimentation and Results 4.............................................................................................
CHAPTER 5 – SUMMARY AND CONCLUSION....................................................................................................
5.1 Summary......................................................................................................................................................
5.2 Contribution of Research and Conclusion...................................................................................................
5.3 Recommendations.......................................................................................................................................
References............................................................................................................................................................
APPENDIX.........................................................................................................................................................
LIST OF FIGURES
ABSTRACT
In today's interconnected world, artificial intelligence (AI) plays a big role in breaking down
language barriers and connecting people across distances. However, people with hearing
7
impairments, especially those in the Deaf community, still face significant communication
challenges. Sign language is their primary way of communicating, but it is not widely
understood, creating difficulties in societies where spoken and written languages dominate.
This research focuses on using AI to improve communication for the Deaf community. It looks
at how AI can help bridge the gap between Deaf individuals and the rest of the world, supporting
efforts to promote inclusivity and equal access to information. Despite progress in AI, translating
sign language remains a challenge due to the lack of data for training AI models and the
complexity of sign languages, which use both hands and facial expressions.
The study aims to develop an AI-powered system that can translate sign language in real-time. It
examines current systems, identifies their shortcomings, and proposes a more effective solution.
The goal is to create a user-friendly interface that allows Deaf individuals and those unfamiliar
It uses a research method that involves designing and testing a new system to see how well it
works. Despite its limitations, this study contributes to the ongoing effort to improve sign
language translation and address the communication needs of the Deaf community.
8
CHAPTER ONE - INTRODUCTION
In the contemporary interconnected world driven by technology, artificial intelligence (AI) has
constraints. This transformation, facilitated by real-time language translation, has, however, left a
Sign language serves as the primary mode of expression for the Deaf community, with its rich
visual language, distinct grammar, syntax, and cultural nuances. Despite its expressive nature,
This research addresses the intersection of AI and sign language understanding to enhance
inclusive communication for the Deaf community. It explores the potential of AI to bridge the
communication gap between Deaf individuals and the broader world, considering both
technological advancements and social implications. The study aligns with the global
However, the field of AI sign language translation (SLT) faces notable research gaps. Sign
languages, being low-resource languages, lack sufficient data for training SLT models due to
factors such as their limited usage compared to spoken languages. Additionally, the visual
1
complexity of sign languages, incorporating information from both hands and face, poses
challenges for accurate recognition and translation by SLT models. Furthermore, the multitude of
sign languages and dialects worldwide requires SLT systems to adapt to diverse linguistic inputs
and outputs.
Addressing these research gaps is essential for the widespread deployment of AI SLT systems.
Affordability and accessibility for the Deaf community, integration with existing technologies
like video conferencing and social media platforms, and adaptation to various sign languages and
dialects are crucial considerations. Despite challenges, the AI SLT sector is progressing rapidly,
with ongoing developments in data collection, model refinement, and evaluation methods.
Continued research and development hold the potential to revolutionize communication for deaf
A range of studies have explored the development of sign language recognition systems. (Vargas
2011) and (K 2022) both proposed systems using neural networks for image pattern recognition,
with Vargas focusing on static signs and K on alphabets and gestures. (Holden 2005) developed
a system for Australian sign language recognition, achieving high accuracy through the use of
Hidden Markov Models and features invariant to scaling and rotation. (Cho 2020) further
image acquisition, pre-processing, and training with a multilayer perceptron and gradient descent
momentum. These studies collectively demonstrate the potential of neural networks and other
advanced techniques in the development of accurate and efficient sign language recognition
systems.
2
1.2 Problem Statement
American Sign Language (ASL) is the language of choice for most deaf
in the United States. (Starner, T. 1995) ASL uses approximately 6000 gestures for common
words and finger spelling for communicating obscure words or proper nouns and because of the
lack of data on sign languages building an (SLT) system is a major challenge. SLT systems are
used to translate sign language into spoken language and vice versa. This technology has the
potential to Educate Non-signers all about Sign’s and help signers Communicate to Non-signers
much faster. However, the lack of sign language data is a major obstacle to the development and
Sign languages are low-resource languages, meaning that there is relatively little data available
for training machine learning models To understand and predict between the many 6000 possible
classes there could be in a sign language. This is due to a number of factors, including the fact
that sign languages are not as widely spoken as spoken languages, and that it is more difficult to
collect and annotate sign language data. the research identifies the following problems and their
3
1.3 Research Question
The factors that affect the design and implementation of a sign language translation system are
not well understood. Recently, researchers have tried to figure out how different priorities impact
these systems, but the results are unclear. Many people are asking for more research to
understand the conflicts in literature. This study aims to design and implement a system capable
of interpreting at least 30 different signs in the American Sign Language. Thus the research
“How can SLT models that are more robust to changes in lighting, background, and the
variations in lighting?
ii. How can SLT models be made more resistant to changes in the signer's
background?
iii. What kind of data would be helpful to train SLT models to recognize signers
4
1.4 Aim and Objective of the Study
Objectives
II. Create a User-Friendly and Accessible Interface: Design an interface that caters to both
Deaf individuals and those unfamiliar with sign language. Prioritize accessibility and ease of use
for all users.
III. Refine the System Through User-Centered Design: Conduct extensive user testing to
gather valuable feedback. Employ an iterative process to continuously improve the system's
accessibility and user experience, focusing on the specific needs of Deaf users.
Thus the significance of this work is that it addresses a specific set of communication needs. This
system could cater to basic but crucial expressions, allowing for essential interactions between
individuals who use these particular signs. While it may not cover the entirety of ASL, it still
provides a valuable tool for those who rely on these specific signs, contributing to improved
5
1.6 Scope of the Study
This project is focused on the Design and Implementation of a sign language translation system
that can only understand at most 40 different single hand signs In the American sign language.
This project on the Design and Implementation of a sign language translation system, is limited
to understanding at most 40 different single-hand signs in American Sign Language (ASL), has
inherent limitations that warrant acknowledgment. Firstly, as with any research focused on a
specific dataset, the system's effectiveness cannot be universally established based on this
singular study. Secondly, the study is confined to a particular scope within the domain of ASL
and does not encompass the entirety of the sign language lexicon. Consequently, caution is
advised when generalizing the outcomes beyond the defined set of signs and their interpretations.
Furthermore, the study's applicability is bound to a specific context, and the results may not be
universally representative. The limited scope to 40 signs raises concerns about the system's
Additionally, the study may not account for the nuances and variations in sign language used by
practical scenarios.
Another constraint is the focus on single-hand signs, excluding the complexities that arise in
signs involving both hands, facial expressions, and other non-manual markers. This limitation
may affect the system's ability to comprehensively interpret the richness of sign language,
6
It is important to note that the development of this system was constrained by available resources
and time, leading to a narrowed focus on a specific subset of signs. Consequently, the findings
and functionalities may not universally apply, and the limitations of the study should be
crucial to recognize that the current work, while contributing to sign language translation, may
The chosen research method for this project is Design Science Research (DSR). Design Science
Research is a collection of synthetic and analytical techniques and viewpoints that complement
positivist, interpretive, and critical perspectives in conducting research within the field of
Information Systems. This approach encompasses two key activities aimed at enhancing and
comprehending the behavior of various aspects of Information Systems: firstly, the generation of
new knowledge through the design of innovative artifacts, whether they be tangible items or
processes; and secondly, the analysis of the artifact's utilization and/or performance through
7
Figure 1.8: Schematic diagram of DSR adopted in this study.
The artifacts created in the design science research process include, but are not limited to,
8
methodology, which offers specific guidelines for evaluation and iteration within research
projects. Figure 1.1 represents the schematic diagram of the research method. Design Science
then is knowledge in the form of constructs, techniques and methods, querysgi, well developed
theory for performing the mapping of knowledge for creating artifacts that satisfy given sets of
This project work is organized into five chapters. Chapter one introduces The topic Design and
Implementation of a sign language translation system a case study on communication with Non-
signers and presents a background of the work. It discusses the problems and challenges faced by
most existing Methods of solving the problem of communication with deaf individuals. Chapter
two delves into the literature review of former work carried out on the topic. Chapter three gives
the design analysis and project consideration based on the diagram above. Chapter four presents
the proposed implementation and finally, chapter five contains the summary, conclusion and
recommendations. It is finally rounded off with the references and appendix consisting of code
of the implementation.
9
CHAPTER TWO- LITERATURE REVIEW
2.1 Introduction
Neural networks have greatly shaped Sign Language Translation, evolving from recognizing
isolated signs to the complex task of translating sign languages. Necati Cihan Camgöz and team
played a crucial role, challenging the idea that sign languages are mere translations of spoken
languages. Their notable work (Camgöz et al., 2018) stressed the unique structures of sign
languages.
They introduced Continuous Sign Language Translation, using advanced deep learning models
to translate continuous sign language videos into spoken language. A key moment was the
(Camgöz et al, 2020). set a performance benchmark with their models achieving a BLEU-4 score
a unified framework.
(Danielle Bragg, Oscar Koller, Mary Bellard, et al.'s, 2019) collaborative work in "Sign
Language Recognition, Generation, and Translation" advances the field. They highlight the need
for larger datasets and standardized annotation systems, advocating for interdisciplinary
10
The paper is a must-read for researchers, offering a clear review of the state-of-the-art and steps
for future research. It's a critical resource for developing technologies that bridge communication
Zhang, et al, 2018). break barriers by eliminating temporal segmentation in sign language
recognition. Their Hierarchical Attention Network with Latent Space (LS-HAN) showcases
"A Review of Hand Gesture and Sign Language Recognition Techniques" by (Ming Jin Cheok,
Zaid Omar, and Mohamed Hisham Jaward, 2017) comprehensively explores recognition
methods and challenges. The paper sheds light on the intricacies of gesture and sign language
recognition, emphasizing the need for context-dependent models and the incorporation of three-
Despite progress, the paper acknowledges limitations and calls for sustained interdisciplinary
research to advance gesture and sign language recognition. The transformative potential of this
Sign languages are comprehensive and distinct languages that manifest through visual signs and
gestures, fulfilling the full communicative needs of deaf communities where they arise (Brentari
& Coppola, 2012). Researchers such as Diane Brentari and Marie Coppola have explored how
these languages are created and develop, a process that occurs when specific social conditions
allow for the transformation of individual gesture systems into rich, communal languages
11
(Brentari & Coppola, 2012). Emerging sign languages serve as a primary communication system
Brentari and Coppola highlight that the creation of a new sign language requires two essential
elements: a shared symbolic environment and the ability to exploit that environment, particularly
by child learners. These conditions are met in scenarios where deaf individuals come together
and collectively evolve their individual homesign systems into a community-wide language
Developmental Pathways
The evolution of sign languages can follow various trajectories. One common pathway involves
the establishment of institutions, like schools for the deaf, which become hubs for this evolution.
A pertinent example is Nicaraguan Sign Language, which developed rapidly when a special
education center in Managua expanded in 1978, bringing together a large deaf population. This
setting facilitated the transition from homesign to an established sign language through what
Brentari and Coppola describe as the 'initial contact stage', followed by the 'sustained contact
stage' as the language was adopted by subsequent generations (Brentari & Coppola, 2012).
Sign Language Recognition has been a growing area of research for several decades. However, it
was not until recently that the field has evolved towards the more complex task of Sign
Language Translation. In the seminal work of Necati Cihan Camgöz and colleagues, a significant
shift is proposed from recognizing isolated signs as a naive gesture recognition problem to a
12
neural machine translation approach that respects the unique grammatical and linguistic
Previous studies primarily focused on SLR with simple sentence constructions from isolated
signs, overlooking the intricate linguistic features of sign languages. These studies operated
under the flawed assumption that there exists a direct, one-to-one mapping between signed and
spoken languages. In contrast, the work by Camgöz et al. acknowledges that sign languages are
independent languages with their own syntax, morphology, and semantics, and are not merely
To address these limitations, Camgöz and his team introduced the concept of Continuous SLT,
which tackles the challenge of translating continuous sign language videos into spoken language
while accounting for the different word orders and grammar. They utilize state-of-the-art
and map these to spoken or written language (Camgöz et al., 2018). A milestone in this field is
the creation of the RWTH-PHOENIX-Weather 2014 T dataset, the first publicly available
continuous SLT dataset. This dataset contains sign language video segments from weather
broadcasts, accompanied by gloss annotations and spoken language translations. The dataset is
pivotal for advancing research, allowing for the evaluation and development of SLT models
(Camgöz et al., 2018). Camgöz et al. set the bar for translation performance with their models
achieving a BLEU-4 score of up to 18.13, creating a benchmark for future research efforts. Their
various tokenization methods, attention schemes, and parameter configurations (Camgöz et al.,
2018).
13
2.4 Sign Language Transformers: Joint End-to-end Sign Language Recognition and
Translation
The pioneering work by (Camgöz et al, 2020). addresses the inherent challenges in the domain of
based architecture. This novel approach leverages Connectionist Temporal Classification loss,
integrating the recognition of continuous sign language and translation into a single unified
framework, which leads to significant gains in performance. The preceding efforts in sign
language translation primarily relied on a mid-level sign gloss representation, which is crucial for
translation models, as supported by prior research in the field. Camgöz and colleagues delineate
sign glosses as minimal lexical items that correlate spoken language words with their respective
(Camgöz et al.'s, 2020) research contributes to addressing critical sub-tasks in sign language
translation, such as sign segmentation and the comprehensive understanding of sign sentences.
These sub-tasks are vital since sign languages leverage multiple articulators, including manual
and non-manual features, to convey information. The grammar disparities between sign and
spoken languages necessitate models that can navigate the asynchronous multi-articulatory
nature and the high-dimensional spatio-temporal data of sign language. Additionally, this work
underscores the utilization of transformer encoders and decoders to handle the compound nature
of translating sign language videos into spoken language sentences. Unique to their approach is
the fact that ground-truth timing information is not a prerequisite for training, as their system can
14
concurrently solve sequence-to-sequence learning problems inherent to both recognition and
the approach's effectiveness, with reported improvements over existing sign video to spoken
language, and gloss to spoken language translation models. The results indicate more than a
doubling in performance in some instances, establishing a new benchmark for the task. (Camgöz
et al.'s, 2020) transformative research on sign language transformers thus makes a substantial
contribution to the fields of computer vision and machine translation, setting the stage for
advanced developments in accessible communication technologies for the Deaf and hard-of-
hearing communities.
Huang, (Zhou, Zhang, et al.), introduces a significant shift away from conventional approaches.
Traditionally, sign language recognition has been bifurcated into isolated SLR, which handles
the recognition of words or expressions one at a time, and continuous SLR, which interprets
whole sentences. Continuous SLR has hitherto relied on temporal segmentation, a process of pre-
processing videos to identify individual word or expression boundaries, which is not only
challenging due to the subtlety and variety of transition movements in sign language but also
To combat the limitations of existing methods, (Huang et al.) propose a novel framework called
the Hierarchical Attention Network with Latent Space, which aims to forgo the need for
Network for feature extraction and a Hierarchical Attention Network, their method pays detailed
15
attention to both the global and local features within the video to facilitate the understanding of
sign language semantics without the need for finely labeled datasets creating a sort of semi self
Furthermore, the authors address an existing gap in sign language datasets by compiling a
comprehensive Modern Chinese Sign Language dataset with sentence-level annotations. This
contribution not only aids their proposed framework but also provides a resource for future
research in the field. The effectiveness of the LS-HAN framework is substantiated through
experiments conducted on two large-scale datasets, highlighting its potential to revolutionize the
landscape of sign language recognition by circumventing some of the field's most persistent
challenges.
The document "A review of hand gesture and sign language recognition techniques" by (Ming
Jin Cheok, Zaid Omar, and Mohamed Hisham Jaward) presents a comprehensive examination of
the methods and challenges in hand gesture and sign language recognition. The authors illustrate
communication for the deaf and hard-of-hearing individuals, tracing its development alongside
methodically explores various algorithms involved in the recognition process, which are
organized into stages such as data acquisition, pre-processing, segmentation, feature extraction,
and classification, and offers insights into their respective advantages and disadvantages.
16
In particular, the authors delve into the intricacies and hurdles inherent in gesture recognition,
ranging from environmental factors like lighting and viewpoint to movement variability and the
use of aids like colored gloves for improved segmentation. Additionally, the review provides an
extensive look at sign language recognition, primarily focusing on (ASL) while also mentioning
other sign languages from around the world. The necessity for context-dependent models and the
Despite the progress, the paper acknowledges the limitations of present technologies, especially
concerning their adaptability across different individuals, which significantly affects recognition
accuracy. The authors suggest that future research could focus on overcoming these limitations
to create more robust and universally applicable recognition systems. Concluding on a visionary
note, the paper underscores the transformative potential of gesture and sign language recognition,
not only within specific application domains but also in fostering more inclusive communication
tools, and calls for sustained interdisciplinary research to advance the state of the art in this
domain.
Real Time Sign Language Recognition by (Pankaj Kumar Varshney, G. S. Naveen Kumar,
Shrawan Kumar, et al). The literature "Real Time Sign Language Recognition" presents a
substantive contribution to the domain of assistive technology, specifically concerning the deaf
and mute community. (Varshney, Kumar, and Kumar et al). embark on a mission to bridge the
communication gap faced by individuals with speech and hearing impairments through the
17
gestures of American Sign Language, with an exemption for the dynamically oriented signs 'J'
and 'Z.'
The researchers methodically underscore the significance of sign language as a primary tool for
non-verbal communication, using visual cues like hands, eyes, facial expressions, and body
language. The study underscores the intricate challenge of creating a system that can seamlessly
interpret these signals in real-time, a task compounded by the substantial variation in sign
The heart of the research lies in its exploration of a vision-based approach over a data glove
method for gesture detection, arguing for the former's intuitiveness in human-computer
interaction. This choice reflects a keen understanding that technology for the impaired must
Interestingly, the authors note the system's potential reach beyond its assistive purpose,
highlighting applications in fields such as gaming, medical imaging, and augmented reality. This
prospect indicates that the technological advancements spurred by this research might ripple
However, the study is not without its limitations. The vision-based approach, while innovative,
faces potential challenges in terms of gesture and posture identification versatility and the
accurate interpretation of dynamic movements. These are concerns that merit further exploration,
with the study likely serving as a springboard for additional research geared at refining real-time
In conclusion, "Real Time Sign Language Recognition" is a pivotal work that addresses a critical
societal need while also opening avenues for extensive applications across multiple technological
18
spheres. Its clear focus on enhancing communication accessibility using neural networks makes
it a notable cornerstone in the field of human-computer interaction and assistive technology. The
researchers' contribution is commendable for both its direct impact on the lives of those with
hearing and speech disabilities and its potential to inform subsequent innovations.
19
CHAPTER THREE – RESEARCH METHODOLOGY
3.1 Introduction
The architecture of the system will utilize a Long Short-Term Memory (LSTM) neural network
for real-time sign language recognition by predicting classes from keypoint in sequences of
frames a webrtc based video component and streamlit web library in python
Figure 3.1: User Interface of the Sign Languge Translation Web Application
20
Webcam Integration: The application utilizes your webcam to capture video of the user's hands
performing signs.
Keypoint Extraction: For each video frame (image), the system extracts key points that represent
the location and orientation of your hands and fingers. (Imagine these as tiny dots marking
Sequence Building: These key points are captured for a set number of frames (e.g., 30 frames) to
Model Inference: The sequence is loaded unto the model to perform a prediction, The model will
This section details the methods used to create the sign language dataset for training the
● Primary Sources: Video recordings: For this study, I created the dataset myself by
different sequences for each sign class, extracting 30 frames of data per sequence. This
type of sequential data is essential for training Long Short-Term Memory (LSTM)
networks. LSTMs require a minimum amount of sequential data to achieve even basic
21
● Secondary Sources: Sign language dictionaries and resources: I consulted sign language
The collected data underwent a preprocessing stage to prepare it for model training. This
involved:
● Video segmentation: Videos were segmented into individual sign instances, isolating
● Data normalization: Sign data (both video and motion capture) was normalized to a
Accuracy: This core metric reflects the overall proportion of signs translated correctly. In My
case, the model achieved an accuracy of 80%, indicating a strong ability to translate most signs
accurately.
Precision: Precision delves deeper, measuring the exactness of positive predictions. A precision
of 86.7% signifies that when the model identifies a sign, there's an 86.7% chance it's the correct
sign. This demonstrates the model's proficiency in accurately identifying true signs.
Recall: Recall focuses on the model's ability to capture all relevant signs. The model achieved a
recall of 80%, indicating it successfully translates a high percentage of actual signs presented to
it.
22
F1-Score: To gain a balanced view, we employed the F1-Score, which combines precision and
recall. Our model's F1-Score of 78.7% suggests a good balance between identifying true signs
These statistical analyses provide valuable insights. The high accuracy signifies the model's
overall effectiveness, while the breakdown by precision and recall helps pinpoint areas for
potential improvement. Future work might involve gathering more data or refining the model
architecture to enhance both the accuracy of positive predictions and the ability to capture all
signs correctly.
Design Science Research Methodology (DSRM) isn't a rigid, step-by-step process; it's a flexible
approach that emphasizes knowledge creation through the design and development of artifacts.
Unlike traditional scientific methods that focus on explaining how things work, DSRM centers
on creating innovative solutions to address real-world problems (Hevner et al., 2004). This
iterative approach allows researchers to continuously refine their artifacts based on user feedback
23
Figure 3.5: Design Science Research Methodology (Adapted from Peffers et al. 2008)
specific domain. The research is driven by the desire to create an artifact that addresses
this need.
● Iterative development: DSRM is not a linear process. Researchers build and evaluate
prototypes of their artifact in cycles, allowing them to learn from each iteration and
● Evaluation: A crucial aspect of DSRM is the rigorous evaluation of the designed artifact.
This evaluation can involve a variety of methods, such as user testing, performance
● Contribution to knowledge: Beyond the creation of the artifact itself, DSRM seeks to
contribute to the broader body of knowledge in the field. This can involve the
24
3.6 Testing and Validation of the New System
Description
Sign Detection The system - The system correctly Consider using a pre-
handed).
ethnicities, genders) to
ensure generalizability.
Accuracy detected signs into signs across different diverse set of signs
need improvement.
Real-time The system - The system displays the Measure the time
25
minimal latency. execution (30 frames). and the appearance of
experience.
Camera and The system - The system successfully Verify that the video
Integration integrates with the webcam and uses WebRTC the system functions
User Interface The UI elements - The screen darkens slightly Test UI elements like
performance. Consider
26
dim, natural, artificial). adding image pre-
processing techniques
to handle lighting
variations.
Background The system - The system focuses on the Test with various
hands.
4.1 Introduction
This section explores the challenges of building a sign language translation system that captures
Video in real time and predicts what a sign might mean in american sign language.
This section analyzes existing Sign language recognition methods and techniques, focusing on
how they handle the nuances of Understanding gestures and signs in video frames and images.
27
4.2.1 Analysis of Existing Systems: Sign Language Transformers: Joint End-to-end Sign
Language Recognition and Translation
The researchers developed a novel transformer-based architecture for sign language recognition
and translation which is unified and capable of being trained end-to-end. Their system utilizes a
Connectionist Temporal Classification loss to bind the recognition and translation problems into
one architecture. This approach does not require ground-truth timing information, which refers to
the precise moment when a sign begins and ends in a video sequence—a significant challenge in
problems, where the first sequence is a video of continuous sign language and the second
sequence is the corresponding spoken language sentences. The model must detect and segment
sign sentences in a continuous video stream, understand the conveyed information within each
An essential component of the system is the recognition of sign glosses—spoken language words
that represent the meaning of individual signs and serve as minimal lexical items. The
researchers' model recognizes these glosses using specially designed transformer encoders that
are adept at handling high-dimensional spatiotemporal data (such as sign videos). These
encoders are trained to understand the 3D signing space and how signers interact and move
within that space. Additionally, the transformer architecture also captures the multi-articulatory
nature of sign language, where multiple channels are used simultaneously to convey information
—including manual gestures and non-manual features like facial expressions, mouth movements,
and the position and movements of the head, shoulders, and torso.
For the translation task, once the sign glosses and their meaning have been understood, the
system embarks on converting this information into spoken language sentences. This conversion
28
is not straightforward due to the structural and grammatical differences between sign language
and spoken language, such as different ordering of words and the use of space and motion to
convey relationships between entities. To address this, the researchers applied transformer
decoders, which take the output from the transformer encoders and generate a sequence of
spoken language words. This output is then shaped into coherent sentences that represent the
meaning of the sign language input. In sum, the sophisticated encoder-decoder structure of the
transformers allows the system to perform both recognition and translation tasks effectively, and
4.2.2 Analysis of Proposed system: Sign Language recognition using Neural Networks
The proposed system, captures keypoints in sequences of images using opencv and googles
MediaPipe holistic model mewhich detects the face, pose and hand landmarks as key points. The
dataset to be stored as several sequences put in frames as video format where the key points are
pushed into a NumPy array.Hereafter, the system is trained and built using Long STM (LSTM)
deep learning model which is formed using three LSTM layers and three Dense layers. This
model was trained for 2000 epochs on a batch size of 128 using the dataset extracted. The model
was trained using the dataset to attenuate the loss by categorical cross-entropy using the Adam
optimizer.Finally, after building the neural network, real-time language recognition is performed
using streamlit and opencv where the gestures are recognized and displayed as text within the
highlighted area
29
4.2.3 Comparison of Proposed system and existing system
● Focuses on real-time detection and classification: This application prioritizes processing
live video streams and providing instant sign language action recognition.
● Uses keypoints for prediction: It extracts keypoints from body pose, face, and hands to
represent signs and feeds them to a pre-trained machine learning model for classification.
model that can both recognize signs and translate them into spoken language sentences.
ii. Processes video frame directly: The model takes full video frames as input,
iii. Leverages transformers for complex tasks: The system utilizes a powerful transformer
architecture capable of handling high-dimensional data and understanding the intricacies of sign
(Level of translation) The first system classifies individual signs, while Sign Language
Transformers aim for full sentence translation. (Data Processing) The first system uses
keypoints, while Sign Language Transformers process entire video frames. (Model Architecture)
The first system uses a pre-trained machine learning model, while Sign Language Transformers
30
4.3 Limitation of Existing Systems
Existing sign language translation systems often face limitations in accuracy, real-time
performance, and user experience. Accuracy can be hampered by factors like limited training
data, variations in signing styles, and complex hand movements. Real-time translation can be
computationally demanding and may struggle with rapid signing or background noise. User
interfaces might not be intuitive for all users, especially those unfamiliar with sign language.
These limitations can hinder effective communication and highlight the need for ongoing
4.3.1 Limitations Of The Transformer-based System For Sign Language Recognition And
Translation Presented By Camgöz et al
Domain Limitation: The state-of-the-art results achieved by the system are within the context of
a limited domain of discourse which is weather forecasts (Camgöz et al., 2020). The
performance in more generic or diverse sign language contexts may not be as high, indicating a
limitation in the system's ability to generalize across different subjects or domains outside of this
controlled scope.
Recognition Complexity: Despite advances, the system must address the challenge of
recognizing sign glosses from high-dimensional spatiotemporal data, which is complex due to
the asynchronous multi-articulatory nature of sign languages (Camgöz et al., 2020). The models
need to accurately comprehend the 3D signing space and understand what these different aspects
31
Sign Segmentation: The translation system needs to accomplish the task of sign segmentation,
detecting sign sentences from continuous sign language videos (Camgöz et al., 2020). Unlike
text, which has punctuation, or spoken languages that have pauses, sign language does not have
obvious delimiters, making segmentation for translating continuous sign language a persistent
issue that is not yet fully resolved in the literature (Camgöz et al., 2020).
Limited Vocabulary and Sign Language Variations: LSTMs require a substantial amount of
training data to learn the complex relationships between keypoints and their corresponding signs.
This can be challenging for sign languages with vast vocabularies, limiting the system's ability to
Dependency On Keypoint Detection Accuracy: The system's performance heavily relies on the
accuracy of keypoint detection. Errors in identifying keypoints (e.g., due to lighting, occlusion,
requiring significant processing power and time. Additionally, real-time translation necessitates
Limited Context and Idioms: Sign language heavily relies on facial expressions and body
language for conveying context. Current systems struggle to capture these subtleties, leading to
32
to an object might have a different meaning from just pointing but the current proposed system
Background Noise and Occlusions: Real-world environments often have background noise or
occlusions (e.g., from other people or objects). These factors can disrupt keypoint detection and
Lighting Variations: Changes in lighting conditions can affect the quality of video frames,
This section introduces a system utilizing a Long (STM) model and a pretrained model for
detecting landmark keypoints in face’s, poses and hands for Sign language translation. By
addressing the limitations identified in existing systems (Section 3.3), this approach aims to
provide a more comprehensive and efficient method for Sign language translation to non-signers.
4.4.1 Rationale for Implementation of a sign language translation system Using an LSTM
Sign language translation systems bridge the communication between Signer’s and Non-signer’s.
Information Access: Sign language translation systems can provide real-time access to spoken
information for deaf and hard-of-hearing individuals. This includes lectures, meetings,
33
Improved Participation: These systems can enable deaf and hard-of-hearing people to actively
participate in conversations and express themselves more readily. This fosters inclusivity and a
Long Short-Term Memory (LSTM) neural network for real-time sign language recognition a
webrtc based video component and streamlit web library in python . Here's a breakdown of its
components:
1. Input Data: The system expects a sequence of keypoint vectors extracted from video frames,
representing the user's body language (pose, hands, face). The input shape is defined as `(30,
1662)`, where:
30: Represents the number of frames (time window) used to capture the sign language gesture.
1662: Represents the dimensionality of each keypoint vector, likely containing information
Model Architecture:
1. Sequential Model: A sequential model is used to stack multiple LSTM layers for effective
2. LSTM Layers:
3 LSTM layers: The model utilizes three LSTM layers with the following configurations:
34
1st LSTM Layer: 64 units, ReLU activation, returns sequences.
3rd LSTM Layer: 64 units, ReLU activation, does not return sequences (flattens the output).
LSTMs are adept at capturing long-term dependencies present in sequential data Thats why it
was choosen.
3. Dense Layers:
2 Dense layers: Two fully-connected dense layers are added after the LSTM layers:
These layers help extract higher-level features from the LSTM outputs.
4. Output Layer:
Dense layer with Softmax activation: The final layer has a number of units equal to the
number of actions in the sign language vocabulary (e.g., "Hello", "Thanks", "I love you").
The Softmax activation ensures the output probabilities sum to 1, representing the likelihood
Model Training:
1.Optimizer: Adam optimizer is used for efficient gradient descent during training.
2.Loss Function: Categorical crossentropy is used as the loss function, suitable for multi-class
35
3. Metrics: Categorical accuracy is used as a metric to monitor training progress and evaluate
model performance.
4. Training Data: The model is trained on pre-processed training data (`X_train`) containing
sequences of keypoint vectors and corresponding labels (`y_train`) indicating the performed sign
language actions.
WebRTC Streamer: Streamlit WebRTC component is used to capture the user's webcam video
MediaPipe: The system utilizes MediaPipe's Holistic model for multi-modal human body
landmark detection. It extracts keypoints from the user's face, hands, and pose from the webcam
stream that was used to train the LSTM that is able to predict the sign language.
Preprocessing: The captured video frames are converted from BGR to RGB color format for the
Landmark Extraction: MediaPipe processes each frame and outputs keypoint landmarks for the
Keypoint Sequence Building: The extracted keypoints from each frame are combined and stored
in a sequence (30 frames). A lock (`sequence_lock`) ensures thread safety for accessing and
Model Prediction:
36
Action Prediction: When the sequence reaches 30 frames (representing a short window of sign
language), the system predicts the performed action using a pre-trained TensorFlow LSTM
model.
Prediction Results: The model takes the sequence of keypoints as input and outputs the
probability of each action in the defined vocabulary (e.g., "Hello", "Thanks", "I love you").The
action with the highest probability is considered the predicted sign language gesture.
Result Display
Landmarks Visualization: If a face is detected , the system overlays the detected landmarks and
Sign Prediction Overlay: The predicted sign language action is displayed as text on top of the
video frame.
This architecture allows for real-time sign language recognition by continuously capturing video
frames, extracting keypoints, building a sequence, and predicting the user's sign language
are well-suited for capturing the temporal dependencies present in sign language gestures. Sign
language involves sequences of hand movements and poses over time, and LSTMs can
37
ii. Real-time Recognition: Employs a WebRTC streamer to capture video frames from the user's
webcam in real-time. This allows for immediate processing and prediction of sign language
iii. MediaPipe Integration: Leverages MediaPipe's Holistic model for efficient and accurate
multi-modal human body landmark detection. This extracts relevant keypoints from the user's
face, hands, and pose, providing necessary data for the LSTM model.
iv. Scalable Model Architecture: Uses a modular architecture with sequential LSTM layers
followed by dense layers for feature extraction and classification. This allows for easy
v. Reduced Complexity: Utilizing a pretrained model for feature extraction reduce total system
complexities.
vi. Cross platform: The system was developed using a web library so it is accessable for any
System modelling is the process of creating abstract models of a system, each with a unique
perspective or viewpoint on that system. The design tool utilized in this study is the Unified
Modeling Language. It is a standard graphical notation for describing software analysis and
designs. UML leverages symbols to describe and document the application development process.
38
4.5.1 System Activity Diagram
A user flow diagram, also known as a user journey or user process flow, visually represents the
steps a user takes within a system or application to complete a specific task or goal. It shows the
user's actions and interactions with the system, highlighting the pathways and decision points
along the way. The objective of a user flow diagram is to provide a clear and concise overview
of the user's experience and movement within the application. It helps designers and stakeholders
understand the user's journey, identify potential pain points or usability issues, and make
Figure 4.5.1 illustrates the entire user journey throughout the web app.
39
4.5.2 Use Case Diagram for Sign Language Translation Web Application
software application. It illustrates the interactions between users (actors) and the system,
showing how they work together to achieve specific goals or tasks. In the context of a sign
language translation web application, use case diagrams are used to capture and explain the
40
Figure 4.5.2 Illustrates the entire Functional Requirements the web app.
A sequence diagram is a type of interaction diagram in UML (Unified Modeling Language) that
illustrates the sequence of interactions between objects or components within a system. It depicts
the flow of messages exchanged between these objects over time to achieve a specific
shows how various components, such as the user interface, translation engine, and database,
Figure 4.5.3 Illustrates the entire interaction done in the web app.
41
4.6 AI Process
Artificial intelligence (AI) development follows a structured process. First, relevant data is
collected and prepared for the AI system to learn from. This might involve cleaning, organizing,
and transforming the data into a suitable format. Next, the appropriate AI model type (e.g.,
decision tree, neural network) is chosen based on the task and data characteristics. The model's
architecture, defining its components and connections, is then designed. Training involves
feeding the prepared data to the model, allowing it to learn by adjusting internal parameters to
identify patterns and relationships. The model's performance is then evaluated using unseen data
from a testing set to assess its ability to generalize its learning to new situations. Finally, well-
performing models can be deployed for real-world use, with ongoing monitoring to ensure
continued accuracy and the potential for retraining with new data over time.
For this study, I created the dataset myself by capturing keypoint sequences from my own sign
language gestures. I recorded 30 different sequences for each sign class, extracting 30 frames of
data per sequence. This type of sequential data is essential for training Long Short-Term
Memory (LSTM) networks. LSTMs require a minimum amount of sequential data to achieve
42
After data collection using keypoint sequences, data processing is crucial to prepare the data for
training of the LSTM model. Here's a breakdown of the data processing steps:
Missing Values Check: Check for missing keypoints or frames in your sequences. You can
address these by:Removing sequences with excessive missing data, Inputing missing values
using techniques like interpolation or filling with the previous/next available values (depending
on the context).
Data Splitting: Divide the processed data into training, validation, and test sets. Training set:
Used to train the LSTM model. Validation set: Used to monitor model performance during
training and prevent overfitting. Test set: Used to evaluate the model's final performance on
unseen data.
For this research, I trained an LSTM model using the Keras library within a Python environment.
The model architecture consists of three stacked LSTM layers with varying hidden unit sizes (64,
128, and 64) and ReLU activation functions. The first two layers utilize (return_sequences=True)
to maintain the sequential nature of the data. A final dense layer with a softmax activation
predicts the most likely action class from the provided keypoint sequences. The model was
compiled with the Adam optimizer, categorical cross-entropy loss function, and categorical
43
accuracy metric. Training was conducted for 2000 epochs, utilizing a TensorBoard callback to
After Training the model tests were performed on the model to quantify the accuracy, the
(train_test_split) function from the scikit-learn library in Python was used to split my dataset into
training and testing sets. This is a crucial step in machine learning, ensuring the model doesn't
simply memorize the training data and can generalize well to unseen examples. After splitting
the dataset I used the model to perform predictions on the test dataset expecting the model to
44
4.6.5 Model Evaluation
Model evaluation is a crucial step in the AI Process. It determines how well a trained model
performs on unseen data. This involves testing the model on a separate dataset (testing set) and
Evaluation metrics are quantitative measures used to assess the performance of a trained model.
The choice of metrics depends on the specific task and the desired outcome of your model.
Classification Tasks
These metrics evaluate how well a model can categorize data points into predefined classes:
Negatives) divided by the total number of samples. It's a good overall measure. The accuracy of
the trained model is (0.8) it was calculated using a library in python called (scikit-learn)
ii .Precision: The proportion of predicted positives that are actually correct (True Positives
divided by True Positives + False Positives). It measures how good the model is at identifying
actual positives. The precision of the trained model is (0.8666666666666666) it was calculated
45
iii. Recall: The proportion of actual positives that are correctly identified by the model (True
Positives divided by True Positives + False Negatives). It measures how good the model is at
capturing all relevant positives. The recall of the trained model is (0.8) it was calculated using a
iv. F1-Score: The harmonic mean of precision and recall, combining both metrics into a
single score. It provides a balance between precision and recall. The F1-Score of the trained
The Model is saved as a keras model in .h5 format While this model itself isn't directly deployed
as a web app or mobile app, it can be integrated into a larger application for user interaction and
real-world deployment. In My sign language translation system I built a webapp to display how
Results: The model can still recognize the signs even if the the landmark keypoints are not
complete
46
Figure 4.7: Illustrating The User Interface of the model predicting the correct sign with only half
Experimentation: experiment by making gestures really fast and see if it still makes an accurate
prediction
Results: The model can sometimes mistake signs if the signs done are too fast
47
Figure 4.8: Illustrating The User Interface of the model predicting the correct sign with complete
Experimentation: experiment by making sure the landmark keypoints are complete for the model
Result: Every time the keypoints are complete there is a higher chance the model will predict the
sign correctly
48
Figure 4.9: Illustrating The User Interface of the model predicting the correct sign with complete
The sign language translation web app leverages a combination of cutting-edge technologies to
deliver an accurate, efficient, and user-friendly experience. Here's an overview of the key tools
○ Description: A robust Library that uses only python to create beautiful webapp
○ Description: During the training process the data i got was processed by the user's
49
● Sign Language Translation Model: Neural Network
○ Description: At the heart of the system lies a Neural Network trained on a dataset
of sequenced frames and their corresponding translations. This model, built using
the TensorFlow framework, analyzes the recognized signs and translates them
● WebRTC API:
This combination of tools allows for a web-based sign language translation app that is not only
accurate but also user-friendly and accessible from any device with a web browser.
This report outlines the potential of a sign language translation system, highlighting its key
50
5.1 Summary
The system facilitates communication between deaf and hearing individuals by translating sign
language into spoken or written language, and vice versa.Users can create profiles detailing their
communication preferences (sign language type, spoken language) and areas of expertise (if
conversations.The system can be integrated with various platforms, including video conferencing
Further research and development are necessary to refine the translation accuracy and encompass
a wider range of sign languages and dialects.A user-centered design approach should be adopted
to ensure the system is intuitive and user-friendly for deaf and hearing individuals with varying
levels of technical proficiency.Collaboration with deaf communities and sign language experts is
crucial for gathering feedback and ensuring the system effectively addresses their
communication needs.Exploring integration with educational resources and social platforms can
broaden the system's impact and promote language learning and cultural exchange.
5.3 Recommendations
Current sign language translation technology is under development, and accuracy may be
affected by factors like signing speed, variations in execution, and background noise.Limited
availability of training data for certain sign languages can hinder translation accuracy for those
languages.The system may not capture the full nuance of sign language communication, which
51
Investigate methods to improve translation accuracy, particularly for complex sentences and
idiomatic expressions. Develop mechanisms to account for regional variations and cultural
nuances within sign languages.Explore integration with sentiment analysis tools to better convey
5.5 Conclusion
Sign language translation systems hold immense potential to break down communication barriers
and foster greater social inclusion. By addressing the limitations and continuously developing the
technology, these systems can empower deaf and hearing individuals to connect and participate
52
References
Katiyar, P., Shukla, K. S. K., & Kumar, V. (2023, May). Analysis of Human Action Recognition
Using Machine Learning techniques. In 2023 4th International Conference for Emerging
Guo, L., Lu, Z., & Yao, L. (2021). Human-machine interaction sensing technology based on
hand gesture recognition: A review. IEEE Transactions on Human-Machine Systems, 51(4), 300-
309.
Achenbach, P., Laux, S., Purdack, D., Müller, P. N., & Göbel, S. (2023). Give Me a Sign: Using
Camgoz, N. C., Hadfield, S., Koller, O., Ney, H., & Bowden, R. (2018). Neural sign language
translation. In Proceedings of the IEEE conference on computer vision and pattern recognition
(pp. 7784-7793).
Rastgoo, R., Kiani, K., & Escalera, S. (2021). Sign language recognition: A deep survey. Expert
53
Camgoz, N. C., Koller, O., Hadfield, S., & Bowden, R. (2020). Sign language transformers: Joint
end-to-end sign language recognition and translation. In Proceedings of the IEEE/CVF
conference on computer vision and pattern recognition (pp. 10023-10033).
Huang, J., Zhou, W., Zhang, Q., Li, H., & Li, W. (2018, April). Video-based sign language
recognition without temporal segmentation. In Proceedings of the AAAI Conference on Artificial
Intelligence (Vol. 32, No. 1).
Cheok, M. J., Omar, Z., & Jaward, M. H. (2019). A review of hand gesture and sign language
recognition techniques. International Journal of Machine Learning and Cybernetics, 10, 131-
153.
Li, D., Rodriguez, C., Yu, X., & Li, H. (2020). Word-level deep sign language recognition from
video: A new large-scale dataset and methods comparison. In Proceedings of the IEEE/CVF
winter conference on applications of computer vision (pp. 1459-1469).
Varshney, P. K., Kumar, S. K., & Thakur, B. (2024). Real-Time Sign Language Recognition. In
Medical Robotics and AI-Assisted Diagnostics for a High-Tech Healthcare Industry (pp. 81-92).
IGI Global.
Wadhawan, A., & Kumar, P. (2021). Sign language recognition systems: A decade systematic
literature review. Archives of Computational Methods in Engineering, 28, 785-813.
Bragg, D., Koller, O., Bellard, M., Berke, L., Boudreault, P., Braffort, A., ... & Ringel Morris,
M. (2019, October). Sign language recognition, generation, and translation: An interdisciplinary
perspective. In Proceedings of the 21st International ACM SIGACCESS Conference on
Computers and Accessibility (pp.
16-31).
54
APPENDIX EXCERPT OF PROGRAM SOURCE CODE
55
def draw_landmarks(image, results):
mp_drawing.draw_landmarks(image, results.face_landmarks,
mp_holistic.FACE_CONNECTIONS)
mp_drawing.draw_landmarks(image, results.pose_landmarks,
mp_holistic.POSE_CONNECTIONS)
mp_drawing.draw_landmarks(image, results.left_hand_landmarks,
mp_holistic.HAND_CONNECTIONS)
mp_drawing.draw_landmarks(image, results.right_hand_landmarks,
mp_holistic.HAND_CONNECTIONS)
```
58
for action in actions:
for sequence in np.array(os.listdir(os.path.join(DATA_PATH, action))).astype(int):
window = []
for frame_num in range(sequence_length):
res = np.load(os.path.join(DATA_PATH, action, str(sequence),
"{}.npy".format(frame_num)))
window.append(res)
sequences.append(window)
labels.append(label_map[action])
X = np.array(sequences)
y = to_categorical(labels).astype(int)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.05)
```
log_dir = os.path.join('Logs')
tb_callback = TensorBoard(log_dir=log_dir)
model = Sequential()
model.add(LSTM(64, return_sequences=True, activation='relu', input_shape=(30,1662)))
model.add(LSTM(128, return_sequences=True, activation='relu'))
model.add(LSTM(64, return_sequences=False, activation='relu'))
model.add(Dense(64, activation='relu'))
model.add(Dense(32, activation='relu'))
model.add(Dense(actions.shape[0], activation='softmax'))
model.compile(optimizer='Adam', loss='categorical_crossentropy',
metrics=['categorical_accuracy'])
model.fit(X_train, y_train, epochs=1000, callbacks=[tb_callback])
```
60