Nats Project

Design and Implementation of a sign language translation system
(Case Study: Communication with Non-Signers)
BY
Uriri Nathaniel Elo-oghene

VUG/CSC/20/4188
SUBMITTED TO
FACULTY OF NATURAL AND APPLIED SCIENCE
DEPARTMENT OF COMPUTER SCIENCE AND INFORMATION TECHNOLOGY
VERITAS UNIVERSITY, ABUJA
IN PARTIAL FULFILLMENT OF THE
REQUIREMENT FOR THE AWARD OF
BSC in Computer Science
22/03/2024
APPROVAL PAGE
This is to certify that this project entitled “, Design and Implementation of a sign language
translation system” was carried out by, Uriri Nathaniel Elo-oghene VUG/CSC/20/4188 in the
Faculty of Natural and Applied Science, Veritas University, Abuja for the award of Bachelor of
Science Degree in Computer Science.
Supervisor Signature Date
Head of Department Signature Date
DEAN, NAS Signature Date
External Examiner Signature Date
1
DEDICATION
I dedicate this project to My Family, My Mentors or Advisors, My Educational Institution. Their

unwavering support, encouragement, and inspiration have been the driving force behind the
completion of this endeavor. Their belief in the importance of has fueled my determination to
make a meaningful contribution. This dedication is a tribute to their invaluable influence on my
journey, and I extend my heartfelt gratitude for being an integral part of this project's realization.
ACKNOWLEDGEMENTS
2
I would like to express my deep gratitude to My Educational Institution. Their unwavering
support, encouragement, and guidance have been invaluable throughout this journey. Their belief
in the significance of this project has fueled my determination, and their insights have greatly
enriched the outcome. This expression of gratitude is a humble acknowledgment of the profound
impact they have had on the success of this endeavor. Their support has been a constant source
of inspiration, and I am truly thankful for their contribution to this project's realization
3
Table of Contents
APPROVAL PAGE....................................................................................................................................................
ABSTRACT...............................................................................................................................................................
CHAPTER ONE - INTRODUCTION.......................................................................................................................
1.1 BACKGROUND OF THE STUDY...............................................................................................................
1.2 Problem Statement..........................................................................................................................................
1.3 Research Question..........................................................................................................................................
1.4 Aims and Objective of the Study....................................................................................................................
1.5 Significance of the Study................................................................................................................................
1.6 Scope of the Study..........................................................................................................................................
4
1.7 Limitations of the Study.................................................................................................................................
1.8 Overview of Research Method.......................................................................................................................
1.9 Structure of the Project...................................................................................................................................
CHAPTER TWO- LITERATURE REVIEW............................................................................................................
2.1 Introduction....................................................................................................................................................
2.2 The concept of sign language.........................................................................................................................
2.3 Understanding Sign Language Recognition...................................................................................................
2.4 Sign Language Transformers: Joint End-to-end Sign Language Recognition and
Translation............................................................................................................................................................
2.5 Video-Based Sign Language Recognition Without Temporal Segmentation................................................
2.6 Hand gesture and sign language recognition techniques................................................................................
2.7 Real Time Sign Language Recognition..........................................................................................................
CHAPTER THREE – RESEARCH METHODOLOGY...........................................................................................
3.3 Data Pre-processing...................................................................................................................................
3.4 Statistical Analysis......................................................................................................................................
3.5 Software Development Methodology..........................................................................................................
3.6 Testing and Validation of the New System.................................................................................................
CHAPTER FOUR - SYSTEM DESIGN AND IMPLEMENTATION.........................................................................
4.1 Introduction....................................................................................................................................................
4.2 Analysis of Existing system...........................................................................................................................
4.2.1 Analysis of Existing Systems: Sign Language Transformers: Joint End-to-end Sign
Language Recognition and Translation..........................................................................................................
4.2.2 Analysis of Proposed system: Sign Language recognition using Neural Networks.............................
4.2.3 Comparison of Proposed system and existing system...........................................................................
4.3 Limitation of Existing Systems......................................................................................................................
4.3.1 Limitations Of The Transformer-based System For Sign Language Recognition
And Translation Presented By Camgöz et al..................................................................................................
4.3.2 Limitations Of The Proposed System....................................................................................................
4.4 Proposed System: Design and Implementation of a sign language translation system..................................
4.4.1 Rationale for Implementation of a sign language translation system Using an LSTM
........................................................................................................................................................................
4.4.2. System Architecture.............................................................................................................................
4.4.3. Advantages of the Proposed System....................................................................................................
4.5 SYSTEM MODELLING................................................................................................................................
4.5.1 System Activity Diagram.....................................................................................................................
4.5.2 Use Case Diagram for Sign Language Translation Web Application.................................................
4.5.3 Sequence Diagram for Sign Language Translation Web Application.................................................
4.6 AI Process.......................................................................................................................................................
4.6.1 Data Collection......................................................................................................................................
4.6.2 Data Processing....................................................................................................................................
5
4.6.3 Training Of The Model..........................................................................................................................
4.6.4 Testing Of The Model...........................................................................................................................
4.6.5 Model Evaluation..................................................................................................................................
4.6.6 Model Deployment (Mobile or Web)....................................................................................................
4.7 Funtionality/Experimenation and Results 1..................................................................................................
4.8 Funtionality / Experimenation and Results 2................................................................................................
4.9 Funtionality / Experimenation and Results 3................................................................................................
4.10 Funtionality / Experimentation and Results 4.............................................................................................
CHAPTER 5 – SUMMARY AND CONCLUSION....................................................................................................
5.1 Summary......................................................................................................................................................
5.2 Contribution of Research and Conclusion...................................................................................................
5.3 Recommendations.......................................................................................................................................
References............................................................................................................................................................
APPENDIX.........................................................................................................................................................
LIST OF FIGURES
Figure 3.1: p20

Figure 3.5: Design Science Research Methodology p24
Figure 3.6: A table showing all test cases p27
Figure 4.5.1: Illustrates the entire user journey throughout the web app p40
Figure 4.5.2: Illustrates the entire Functional Requirements the web app p41
Figure 4.5.3: Illustrates the entire interaction done in the web app p42
Figure 4.6.3: An image of the training process p45
Figure 4.7: Illustrating the User Interface of the model predicting the correct sign with
only half of the keypoints being detected p47
6
complete keypoints being detected for single-handed signs p48
complete keypoints being detected for double-handed signs p49
ABSTRACT
In today's interconnected world, artificial intelligence (AI) plays a big role in breaking down
language barriers and connecting people across distances. However, people with hearing
7
impairments, especially those in the Deaf community, still face significant communication
challenges. Sign language is their primary way of communicating, but it is not widely
understood, creating difficulties in societies where spoken and written languages dominate.
This research focuses on using AI to improve communication for the Deaf community. It looks
at how AI can help bridge the gap between Deaf individuals and the rest of the world, supporting
efforts to promote inclusivity and equal access to information. Despite progress in AI, translating
sign language remains a challenge due to the lack of data for training AI models and the
complexity of sign languages, which use both hands and facial expressions.
The study aims to develop an AI-powered system that can translate sign language in real-time. It
examines current systems, identifies their shortcomings, and proposes a more effective solution.
The goal is to create a user-friendly interface that allows Deaf individuals and those unfamiliar
with sign language to communicate more easily.
It uses a research method that involves designing and testing a new system to see how well it
works. Despite its limitations, this study contributes to the ongoing effort to improve sign
language translation and address the communication needs of the Deaf community.
8
CHAPTER ONE - INTRODUCTION
1.1 BACKGROUND OF THE STUDY
In the contemporary interconnected world driven by technology, artificial intelligence (AI) has
significantly influenced communication by overcoming language barriers and geographical
constraints. This transformation, facilitated by real-time language translation, has, however, left a
distinct group—individuals with hearing impairments, particularly the Deaf community—facing
communication challenges requiring innovative solutions.
Sign language serves as the primary mode of expression for the Deaf community, with its rich
visual language, distinct grammar, syntax, and cultural nuances. Despite its expressive nature,
sign language is not universally understood, leading to communication barriers in a
predominantly spoken and written language society.
This research addresses the intersection of AI and sign language understanding to enhance
inclusive communication for the Deaf community. It explores the potential of AI to bridge the
communication gap between Deaf individuals and the broader world, considering both
technological advancements and social implications. The study aligns with the global
commitment to inclusivity, equal access to information, and social integration in an era
celebrating diversity and accessibility.
However, the field of AI sign language translation (SLT) faces notable research gaps. Sign
languages, being low-resource languages, lack sufficient data for training SLT models due to
factors such as their limited usage compared to spoken languages. Additionally, the visual
1
complexity of sign languages, incorporating information from both hands and face, poses
challenges for accurate recognition and translation by SLT models. Furthermore, the multitude of
sign languages and dialects worldwide requires SLT systems to adapt to diverse linguistic inputs
and outputs.
Addressing these research gaps is essential for the widespread deployment of AI SLT systems.
Affordability and accessibility for the Deaf community, integration with existing technologies
like video conferencing and social media platforms, and adaptation to various sign languages and
dialects are crucial considerations. Despite challenges, the AI SLT sector is progressing rapidly,
with ongoing developments in data collection, model refinement, and evaluation methods.
Continued research and development hold the potential to revolutionize communication for deaf
individuals, fostering enhanced interaction with the world.
A range of studies have explored the development of sign language recognition systems. (Vargas
2011) and (K 2022) both proposed systems using neural networks for image pattern recognition,
with Vargas focusing on static signs and K on alphabets and gestures. (Holden 2005) developed
a system for Australian sign language recognition, achieving high accuracy through the use of
Hidden Markov Models and features invariant to scaling and rotation. (Cho 2020) further
improved the performance of a sign language recognition system by using a combination of
image acquisition, pre-processing, and training with a multilayer perceptron and gradient descent
momentum. These studies collectively demonstrate the potential of neural networks and other
advanced techniques in the development of accurate and efficient sign language recognition
systems.
2
1.2 Problem Statement
American Sign Language (ASL) is the language of choice for most deaf
in the United States. (Starner, T. 1995) ASL uses approximately 6000 gestures for common
words and finger spelling for communicating obscure words or proper nouns and because of the
lack of data on sign languages building an (SLT) system is a major challenge. SLT systems are
used to translate sign language into spoken language and vice versa. This technology has the
potential to Educate Non-signers all about Sign’s and help signers Communicate to Non-signers
much faster. However, the lack of sign language data is a major obstacle to the development and
deployment of SLT systems.
Sign languages are low-resource languages, meaning that there is relatively little data available
for training machine learning models To understand and predict between the many 6000 possible
classes there could be in a sign language. This is due to a number of factors, including the fact
that sign languages are not as widely spoken as spoken languages, and that it is more difficult to
collect and annotate sign language data. the research identifies the following problems and their
implications for organisations
i. Lack of data for training a model to understand sign language
ii. To many different sign language’s
3
1.3 Research Question
The factors that affect the design and implementation of a sign language translation system are
not well understood. Recently, researchers have tried to figure out how different priorities impact
these systems, but the results are unclear. Many people are asking for more research to
understand the conflicts in literature. This study aims to design and implement a system capable
of interpreting at least 30 different signs in the American Sign Language. Thus the research
question for this work is;
“How can SLT models that are more robust to changes in lighting, background, and the
signer's appearance be developed?”
i. Can existing Sign Language Translation (SLT) models be improved to handle
variations in lighting?
ii. How can SLT models be made more resistant to changes in the signer's
background?
iii. What kind of data would be helpful to train SLT models to recognize signers
wearing different clothing?
4
1.4 Aim and Objective of the Study
This research’s aim is to revolutionize sign language communication by developing a

groundbreaking AI-powered sign language translation system. This system will bridge the
communication gap and foster inclusivity between Deaf and hearing communities.
Objectives
I. Design and Develop a Robust Sign Language Understanding System: Engineer a

comprehensive system capable of interpreting sign language gestures in real-time. Utilize
machine learning algorithms to ensure accurate translation into written or spoken language.
II. Create a User-Friendly and Accessible Interface: Design an interface that caters to both
Deaf individuals and those unfamiliar with sign language. Prioritize accessibility and ease of use
for all users.
III. Refine the System Through User-Centered Design: Conduct extensive user testing to
gather valuable feedback. Employ an iterative process to continuously improve the system's
accessibility and user experience, focusing on the specific needs of Deaf users.
1.5 Significance of the Study
Thus the significance of this work is that it addresses a specific set of communication needs. This
system could cater to basic but crucial expressions, allowing for essential interactions between
individuals who use these particular signs. While it may not cover the entirety of ASL, it still
provides a valuable tool for those who rely on these specific signs, contributing to improved
accessibility and understanding in certain communication scenarios.
5
1.6 Scope of the Study
This project is focused on the Design and Implementation of a sign language translation system
that can only understand at most 40 different single hand signs In the American sign language.
1.7 Limitations of the Study
This project on the Design and Implementation of a sign language translation system, is limited
to understanding at most 40 different single-hand signs in American Sign Language (ASL), has
inherent limitations that warrant acknowledgment. Firstly, as with any research focused on a
specific dataset, the system's effectiveness cannot be universally established based on this
singular study. Secondly, the study is confined to a particular scope within the domain of ASL
and does not encompass the entirety of the sign language lexicon. Consequently, caution is
advised when generalizing the outcomes beyond the defined set of signs and their interpretations.
Furthermore, the study's applicability is bound to a specific context, and the results may not be
universally representative. The limited scope to 40 signs raises concerns about the system's
adaptability to a broader range of expressions present in real-life sign language communication.
Additionally, the study may not account for the nuances and variations in sign language used by
different individuals or in diverse situations, potentially limiting the system's accuracy in
practical scenarios.
Another constraint is the focus on single-hand signs, excluding the complexities that arise in
signs involving both hands, facial expressions, and other non-manual markers. This limitation
may affect the system's ability to comprehensively interpret the richness of sign language,
impacting its practical utility.
6
It is important to note that the development of this system was constrained by available resources
and time, leading to a narrowed focus on a specific subset of signs. Consequently, the findings
and functionalities may not universally apply, and the limitations of the study should be
considered in assessing the system's broader implications and generalizability. Therefore, it is
crucial to recognize that the current work, while contributing to sign language translation, may
not offer a one-size-fits-all solution and is subject to inherent limitations.
1.8 Overview of Research Method
The chosen research method for this project is Design Science Research (DSR). Design Science
Research is a collection of synthetic and analytical techniques and viewpoints that complement
positivist, interpretive, and critical perspectives in conducting research within the field of
Information Systems. This approach encompasses two key activities aimed at enhancing and
comprehending the behavior of various aspects of Information Systems: firstly, the generation of
new knowledge through the design of innovative artifacts, whether they be tangible items or
processes; and secondly, the analysis of the artifact's utilization and/or performance through
reflection and abstraction.
7
Figure 1.8: Schematic diagram of DSR adopted in this study.
The artifacts created in the design science research process include, but are not limited to,
algorithms, human/computer interfaces, and system design querys or languages (Wieringa,
2014). This will be performed using an outcome-based information technology research
8
methodology, which offers specific guidelines for evaluation and iteration within research
projects. Figure 1.1 represents the schematic diagram of the research method. Design Science
then is knowledge in the form of constructs, techniques and methods, querysgi, well developed
theory for performing the mapping of knowledge for creating artifacts that satisfy given sets of
functional requirements. With Design Science Research, it is possible create a categories of
missing knowledge using design, analysis, reflection and abstraction.
1.9 Structure of the Project
This project work is organized into five chapters. Chapter one introduces The topic Design and
Implementation of a sign language translation system a case study on communication with Non-
signers and presents a background of the work. It discusses the problems and challenges faced by
most existing Methods of solving the problem of communication with deaf individuals. Chapter
two delves into the literature review of former work carried out on the topic. Chapter three gives
the design analysis and project consideration based on the diagram above. Chapter four presents
the proposed implementation and finally, chapter five contains the summary, conclusion and
recommendations. It is finally rounded off with the references and appendix consisting of code
of the implementation.
9
CHAPTER TWO- LITERATURE REVIEW
2.1 Introduction
Neural networks have greatly shaped Sign Language Translation, evolving from recognizing
isolated signs to the complex task of translating sign languages. Necati Cihan Camgöz and team
played a crucial role, challenging the idea that sign languages are mere translations of spoken
languages. Their notable work (Camgöz et al., 2018) stressed the unique structures of sign
languages.
They introduced Continuous Sign Language Translation, using advanced deep learning models
to translate continuous sign language videos into spoken language. A key moment was the
creation of the RWTH-PHOENIX-Weather 2014 T dataset, a significant resource for evaluating
Sign Language Translation models (Camgöz et al., 2018).
(Camgöz et al, 2020). set a performance benchmark with their models achieving a BLEU-4 score
of up to 18.13. Their subsequent work on "Sign Language Transformers" brought a
transformative approach, addressing recognition and translation challenges simultaneously, using
a unified framework.
(Danielle Bragg, Oscar Koller, Mary Bellard, et al.'s, 2019) collaborative work in "Sign
Language Recognition, Generation, and Translation" advances the field. They highlight the need
for larger datasets and standardized annotation systems, advocating for interdisciplinary
collaboration in computer vision, linguistics, Deaf culture, and human-computer interaction.
10
The paper is a must-read for researchers, offering a clear review of the state-of-the-art and steps
for future research. It's a critical resource for developing technologies that bridge communication
gaps for the Deaf community.
In "Video-Based Sign Language Recognition Without Temporal Segmentation,"( Huang, Zhou,
Zhang, et al, 2018). break barriers by eliminating temporal segmentation in sign language
recognition. Their Hierarchical Attention Network with Latent Space (LS-HAN) showcases
potential in overcoming challenges, offering a new direction for the field.
"A Review of Hand Gesture and Sign Language Recognition Techniques" by (Ming Jin Cheok,
Zaid Omar, and Mohamed Hisham Jaward, 2017) comprehensively explores recognition
methods and challenges. The paper sheds light on the intricacies of gesture and sign language
recognition, emphasizing the need for context-dependent models and the incorporation of three-
dimensional data to enhance accuracy.
Despite progress, the paper acknowledges limitations and calls for sustained interdisciplinary
research to advance gesture and sign language recognition. The transformative potential of this
technology in fostering inclusive communication tools is highlighted In thier paper.
2.2 The concept of sign language
Sign languages are comprehensive and distinct languages that manifest through visual signs and
gestures, fulfilling the full communicative needs of deaf communities where they arise (Brentari
& Coppola, 2012). Researchers such as Diane Brentari and Marie Coppola have explored how
these languages are created and develop, a process that occurs when specific social conditions
allow for the transformation of individual gesture systems into rich, communal languages
11
(Brentari & Coppola, 2012). Emerging sign languages serve as a primary communication system
for a community of deaf individuals, leveraging pre-existing homesign gestures as a foundation.
Brentari and Coppola highlight that the creation of a new sign language requires two essential
elements: a shared symbolic environment and the ability to exploit that environment, particularly
by child learners. These conditions are met in scenarios where deaf individuals come together
and collectively evolve their individual homesign systems into a community-wide language
(Brentari & Coppola, 2012).
Developmental Pathways
The evolution of sign languages can follow various trajectories. One common pathway involves
the establishment of institutions, like schools for the deaf, which become hubs for this evolution.
A pertinent example is Nicaraguan Sign Language, which developed rapidly when a special
education center in Managua expanded in 1978, bringing together a large deaf population. This
setting facilitated the transition from homesign to an established sign language through what
Brentari and Coppola describe as the 'initial contact stage', followed by the 'sustained contact
stage' as the language was adopted by subsequent generations (Brentari & Coppola, 2012).
2.3 Understanding Sign Language Recognition
Sign Language Recognition has been a growing area of research for several decades. However, it
was not until recently that the field has evolved towards the more complex task of Sign
Language Translation. In the seminal work of Necati Cihan Camgöz and colleagues, a significant
shift is proposed from recognizing isolated signs as a naive gesture recognition problem to a
12
neural machine translation approach that respects the unique grammatical and linguistic
structures inherent to sign languages (Camgöz et al., 2018).
Previous studies primarily focused on SLR with simple sentence constructions from isolated
signs, overlooking the intricate linguistic features of sign languages. These studies operated
under the flawed assumption that there exists a direct, one-to-one mapping between signed and
spoken languages. In contrast, the work by Camgöz et al. acknowledges that sign languages are
independent languages with their own syntax, morphology, and semantics, and are not merely
translated verbatim from spoken languages (Camgöz et al., 2018).
To address these limitations, Camgöz and his team introduced the concept of Continuous SLT,
which tackles the challenge of translating continuous sign language videos into spoken language
while accounting for the different word orders and grammar. They utilize state-of-the-art
sequence-to-sequence deep learning models to learn the spatio-temporal representation of signs
and map these to spoken or written language (Camgöz et al., 2018). A milestone in this field is
the creation of the RWTH-PHOENIX-Weather 2014 T dataset, the first publicly available
continuous SLT dataset. This dataset contains sign language video segments from weather
broadcasts, accompanied by gloss annotations and spoken language translations. The dataset is
pivotal for advancing research, allowing for the evaluation and development of SLT models
(Camgöz et al., 2018). Camgöz et al. set the bar for translation performance with their models
achieving a BLEU-4 score of up to 18.13, creating a benchmark for future research efforts. Their
work provides a broad range of experimental results, offering a comprehensive analysis of
various tokenization methods, attention schemes, and parameter configurations (Camgöz et al.,
2018).
13
2.4 Sign Language Transformers: Joint End-to-end Sign Language Recognition and
Translation
The pioneering work by (Camgöz et al, 2020). addresses the inherent challenges in the domain of
sign language recognition and translation by proposing an end-to-end trainable transformer-
based architecture. This novel approach leverages Connectionist Temporal Classification loss,
integrating the recognition of continuous sign language and translation into a single unified
framework, which leads to significant gains in performance. The preceding efforts in sign
language translation primarily relied on a mid-level sign gloss representation, which is crucial for
enhancing translation performance. Gloss-level tokenization is a precursor for state-of-the-art
translation models, as supported by prior research in the field. Camgöz and colleagues delineate
sign glosses as minimal lexical items that correlate spoken language words with their respective
sign meanings, serving as an essential step in the translation process.
(Camgöz et al.'s, 2020) research contributes to addressing critical sub-tasks in sign language
translation, such as sign segmentation and the comprehensive understanding of sign sentences.
These sub-tasks are vital since sign languages leverage multiple articulators, including manual
and non-manual features, to convey information. The grammar disparities between sign and
spoken languages necessitate models that can navigate the asynchronous multi-articulatory
nature and the high-dimensional spatio-temporal data of sign language. Additionally, this work
underscores the utilization of transformer encoders and decoders to handle the compound nature
of translating sign language videos into spoken language sentences. Unique to their approach is
the fact that ground-truth timing information is not a prerequisite for training, as their system can
14
concurrently solve sequence-to-sequence learning problems inherent to both recognition and
translation. The authors' testing on the RWTH-PHOENIX-Weather 2014 T dataset demonstrates
the approach's effectiveness, with reported improvements over existing sign video to spoken
language, and gloss to spoken language translation models. The results indicate more than a
doubling in performance in some instances, establishing a new benchmark for the task. (Camgöz
et al.'s, 2020) transformative research on sign language transformers thus makes a substantial
contribution to the fields of computer vision and machine translation, setting the stage for
advanced developments in accessible communication technologies for the Deaf and hard-of-
hearing communities.
2.5 Video-Based Sign Language Recognition Without Temporal Segmentation
The paper "Video-Based Sign Language Recognition Without Temporal Segmentation" by
Huang, (Zhou, Zhang, et al.), introduces a significant shift away from conventional approaches.
Traditionally, sign language recognition has been bifurcated into isolated SLR, which handles
the recognition of words or expressions one at a time, and continuous SLR, which interprets
whole sentences. Continuous SLR has hitherto relied on temporal segmentation, a process of pre-
processing videos to identify individual word or expression boundaries, which is not only
challenging due to the subtlety and variety of transition movements in sign language but also
prone to error propagation through the later stages of recognition.
To combat the limitations of existing methods, (Huang et al.) propose a novel framework called
the Hierarchical Attention Network with Latent Space, which aims to forgo the need for
temporal segmentation altogether. By employing a two-stream 3D Convolutional Neural
Network for feature extraction and a Hierarchical Attention Network, their method pays detailed
15
attention to both the global and local features within the video to facilitate the understanding of
sign language semantics without the need for finely labeled datasets creating a sort of semi self
supervised learning approach.
Furthermore, the authors address an existing gap in sign language datasets by compiling a
comprehensive Modern Chinese Sign Language dataset with sentence-level annotations. This
contribution not only aids their proposed framework but also provides a resource for future
research in the field. The effectiveness of the LS-HAN framework is substantiated through
experiments conducted on two large-scale datasets, highlighting its potential to revolutionize the
landscape of sign language recognition by circumventing some of the field's most persistent
challenges.
2.6 Hand gesture and sign language recognition techniques
The document "A review of hand gesture and sign language recognition techniques" by (Ming
Jin Cheok, Zaid Omar, and Mohamed Hisham Jaward) presents a comprehensive examination of
the methods and challenges in hand gesture and sign language recognition. The authors illustrate
the importance of this technology in improving human-computer interaction and aiding
communication for the deaf and hard-of-hearing individuals, tracing its development alongside
similar advancements in speech and handwriting recognition technologies. The paper
methodically explores various algorithms involved in the recognition process, which are
organized into stages such as data acquisition, pre-processing, segmentation, feature extraction,
and classification, and offers insights into their respective advantages and disadvantages.
16
In particular, the authors delve into the intricacies and hurdles inherent in gesture recognition,
ranging from environmental factors like lighting and viewpoint to movement variability and the
use of aids like colored gloves for improved segmentation. Additionally, the review provides an
extensive look at sign language recognition, primarily focusing on (ASL) while also mentioning
other sign languages from around the world. The necessity for context-dependent models and the
incorporation of three-dimensional data to enhance accuracy are emphasized.
Despite the progress, the paper acknowledges the limitations of present technologies, especially
concerning their adaptability across different individuals, which significantly affects recognition
accuracy. The authors suggest that future research could focus on overcoming these limitations
to create more robust and universally applicable recognition systems. Concluding on a visionary
note, the paper underscores the transformative potential of gesture and sign language recognition,
not only within specific application domains but also in fostering more inclusive communication
tools, and calls for sustained interdisciplinary research to advance the state of the art in this
domain.
2.7 Real Time Sign Language Recognition
Real Time Sign Language Recognition by (Pankaj Kumar Varshney, G. S. Naveen Kumar,
Shrawan Kumar, et al). The literature "Real Time Sign Language Recognition" presents a
substantive contribution to the domain of assistive technology, specifically concerning the deaf
and mute community. (Varshney, Kumar, and Kumar et al). embark on a mission to bridge the
communication gap faced by individuals with speech and hearing impairments through the
development of a neural network model capable of identifying finger spelling-based hand
17
gestures of American Sign Language, with an exemption for the dynamically oriented signs 'J'
and 'Z.'
The researchers methodically underscore the significance of sign language as a primary tool for
non-verbal communication, using visual cues like hands, eyes, facial expressions, and body
language. The study underscores the intricate challenge of creating a system that can seamlessly
interpret these signals in real-time, a task compounded by the substantial variation in sign
language across different cultures and geographies.
The heart of the research lies in its exploration of a vision-based approach over a data glove
method for gesture detection, arguing for the former's intuitiveness in human-computer
interaction. This choice reflects a keen understanding that technology for the impaired must
emphasize ease of use to be truly transformative.
Interestingly, the authors note the system's potential reach beyond its assistive purpose,
highlighting applications in fields such as gaming, medical imaging, and augmented reality. This
prospect indicates that the technological advancements spurred by this research might ripple
across various industries, indicating a broader impact.
However, the study is not without its limitations. The vision-based approach, while innovative,
faces potential challenges in terms of gesture and posture identification versatility and the
accurate interpretation of dynamic movements. These are concerns that merit further exploration,
with the study likely serving as a springboard for additional research geared at refining real-time
sign language recognition systems.
In conclusion, "Real Time Sign Language Recognition" is a pivotal work that addresses a critical
societal need while also opening avenues for extensive applications across multiple technological
18
spheres. Its clear focus on enhancing communication accessibility using neural networks makes
it a notable cornerstone in the field of human-computer interaction and assistive technology. The
researchers' contribution is commendable for both its direct impact on the lives of those with
hearing and speech disabilities and its potential to inform subsequent innovations.
19
CHAPTER THREE – RESEARCH METHODOLOGY
3.1 Introduction
The architecture of the system will utilize a Long Short-Term Memory (LSTM) neural network
for real-time sign language recognition by predicting classes from keypoint in sequences of
frames a webrtc based video component and streamlit web library in python
Figure 3.1: User Interface of the Sign Languge Translation Web Application
20
Webcam Integration: The application utilizes your webcam to capture video of the user's hands
performing signs.
Keypoint Extraction: For each video frame (image), the system extracts key points that represent
the location and orientation of your hands and fingers. (Imagine these as tiny dots marking
specific points on your hand.)
Sequence Building: These key points are captured for a set number of frames (e.g., 30 frames) to
create a sequence that represents the complete sign execution.
Model Inference: The sequence is loaded unto the model to perform a prediction, The model will
try to guess what the sign the signer is performing.
3.2 Data Acquisition For Sign Language Translation System
This section details the methods used to create the sign language dataset for training the
translation model. I employed a combination of primary and secondary data sources.
● Primary Sources: Video recordings: For this study, I created the dataset myself by
capturing keypoint sequences from my own sign language gestures. I recorded 30
different sequences for each sign class, extracting 30 frames of data per sequence. This
type of sequential data is essential for training Long Short-Term Memory (LSTM)
networks. LSTMs require a minimum amount of sequential data to achieve even basic
levels of accuracy in predicting sign language gestures.
21
● Secondary Sources: Sign language dictionaries and resources: I consulted sign language
dictionaries, tutorials, and online resources to ensure accurate representation of signs
within the dataset.
3.3 Data Pre-processing
The collected data underwent a preprocessing stage to prepare it for model training. This
involved:
● Video segmentation: Videos were segmented into individual sign instances, isolating
each sign for independent processing.
● Data normalization: Sign data (both video and motion capture) was normalized to a
standard format and scale for efficient model training.
3.4 Statistical Analysis
Accuracy: This core metric reflects the overall proportion of signs translated correctly. In My
case, the model achieved an accuracy of 80%, indicating a strong ability to translate most signs
accurately.
Precision: Precision delves deeper, measuring the exactness of positive predictions. A precision
of 86.7% signifies that when the model identifies a sign, there's an 86.7% chance it's the correct
sign. This demonstrates the model's proficiency in accurately identifying true signs.
Recall: Recall focuses on the model's ability to capture all relevant signs. The model achieved a
recall of 80%, indicating it successfully translates a high percentage of actual signs presented to
it.
22
F1-Score: To gain a balanced view, we employed the F1-Score, which combines precision and
recall. Our model's F1-Score of 78.7% suggests a good balance between identifying true signs
and capturing all relevant ones.
These statistical analyses provide valuable insights. The high accuracy signifies the model's
overall effectiveness, while the breakdown by precision and recall helps pinpoint areas for
potential improvement. Future work might involve gathering more data or refining the model
architecture to enhance both the accuracy of positive predictions and the ability to capture all
signs correctly.
3.5 Software Development Methodology
Design Science Research Methodology (DSRM) isn't a rigid, step-by-step process; it's a flexible
approach that emphasizes knowledge creation through the design and development of artifacts.
Unlike traditional scientific methods that focus on explaining how things work, DSRM centers
on creating innovative solutions to address real-world problems (Hevner et al., 2004). This
iterative approach allows researchers to continuously refine their artifacts based on user feedback
and evaluation, ultimately leading to more effective and impactful solutions.
23
Figure 3.5: Design Science Research Methodology (Adapted from Peffers et al. 2008)
Some key characteristics of DSRM include:
● Problem-centric: DSRM starts with a clearly defined problem or opportunity in a
specific domain. The research is driven by the desire to create an artifact that addresses
this need.
● Iterative development: DSRM is not a linear process. Researchers build and evaluate
prototypes of their artifact in cycles, allowing them to learn from each iteration and
improve the design.
● Evaluation: A crucial aspect of DSRM is the rigorous evaluation of the designed artifact.
This evaluation can involve a variety of methods, such as user testing, performance
analysis, and case studies.
● Contribution to knowledge: Beyond the creation of the artifact itself, DSRM seeks to
contribute to the broader body of knowledge in the field. This can involve the
development of design principles, frameworks, or methodologies that can be applied to
future research efforts.}
24
3.6 Testing and Validation of the New System
Test Case Expected Outcome Pass/Fail Criteria Notes
Description
Sign Detection The system - The system correctly Consider using a pre-
Accuracy accurately detects identifies at least 80% of defined dataset of
various signs from a signs across different signs for testing or
diverse set of categories (simple, complex, create a custom dataset
signers. single-handed, double- with diverse signers.
handed).
- Test with a variety of
signers (different ages,
ethnicities, genders) to
ensure generalizability.
LSTM The LSTM model - The system correctly Similar to sign
Prediction accurately translates translates at least 75% of detection, use a
Accuracy detected signs into signs across different diverse set of signs
text. categories. and signers for testing.
- Test with unseen signs not Analyze errors to
used for training to assess identify specific signs
model generalizability. or signing styles that
need improvement.
Real-time The system - The system displays the Measure the time
Performance translates signs with translated text within 1 between the
second of the completed sign completion of a sign
25
minimal latency. execution (30 frames). and the appearance of
the translated text.
Aim for low latency to
ensure a smooth user
experience.
Camera and The system - The system successfully Verify that the video
WebRTC seamlessly captures video from the feed is displayed and
Integration integrates with the webcam and uses WebRTC the system functions
camera and for real-time processing. correctly in different
WebRTC for real- - Test across different web browsers.
time browsers to ensure

communication. compatibility.
User Interface The UI elements - The screen darkens slightly Test UI elements like
(UI) function as intended. before displaying the screen dimming and
Functionality translated text for better text display for proper
visibility. functionality. Ensure
- The translated text is the UI is intuitive and
clearly displayed on the left user-friendly.
side of the screen.
Robustness to The system - The system accurately Test the system in
Lighting maintains accuracy detects signs and translates different lighting
Conditions in different lighting them correctly under various environments to
conditions. lighting conditions (bright, ensure consistent
performance. Consider
26
dim, natural, artificial). adding image pre-
processing techniques
to handle lighting
variations.
Background The system - The system focuses on the Test with various
Interference minimizes the signer's hands and minimizes backgrounds (simple,
impact of the influence of background cluttered, moving
background objects or movements. objects) to assess the
distractions. system's ability to
isolate the signer's
hands.
Figure 3.6 : A table showing all test cases
CHAPTER FOUR - SYSTEM DESIGN AND IMPLEMENTATION
4.1 Introduction
This section explores the challenges of building a sign language translation system that captures
Video in real time and predicts what a sign might mean in american sign language.
4.2 Analysis of Existing system
This section analyzes existing Sign language recognition methods and techniques, focusing on
how they handle the nuances of Understanding gestures and signs in video frames and images.
27
4.2.1 Analysis of Existing Systems: Sign Language Transformers: Joint End-to-end Sign
Language Recognition and Translation
The researchers developed a novel transformer-based architecture for sign language recognition
and translation which is unified and capable of being trained end-to-end. Their system utilizes a
Connectionist Temporal Classification loss to bind the recognition and translation problems into
one architecture. This approach does not require ground-truth timing information, which refers to
the precise moment when a sign begins and ends in a video sequence—a significant challenge in
translation systems. The architecture intelligently manages sequence-to-sequence learning
problems, where the first sequence is a video of continuous sign language and the second
sequence is the corresponding spoken language sentences. The model must detect and segment
sign sentences in a continuous video stream, understand the conveyed information within each
sign, and finally translate this into spoken language sentences.
An essential component of the system is the recognition of sign glosses—spoken language words
that represent the meaning of individual signs and serve as minimal lexical items. The
researchers' model recognizes these glosses using specially designed transformer encoders that
are adept at handling high-dimensional spatiotemporal data (such as sign videos). These
encoders are trained to understand the 3D signing space and how signers interact and move
within that space. Additionally, the transformer architecture also captures the multi-articulatory
nature of sign language, where multiple channels are used simultaneously to convey information
—including manual gestures and non-manual features like facial expressions, mouth movements,
and the position and movements of the head, shoulders, and torso.
For the translation task, once the sign glosses and their meaning have been understood, the
system embarks on converting this information into spoken language sentences. This conversion
28
is not straightforward due to the structural and grammatical differences between sign language
and spoken language, such as different ordering of words and the use of space and motion to
convey relationships between entities. To address this, the researchers applied transformer
decoders, which take the output from the transformer encoders and generate a sequence of
spoken language words. This output is then shaped into coherent sentences that represent the
meaning of the sign language input. In sum, the sophisticated encoder-decoder structure of the
transformers allows the system to perform both recognition and translation tasks effectively, and
its application led to significant performance gains in the challenging RWTH-PHOENIX-
Weather-2014 T dataset (Camgöz et al., 2020).
4.2.2 Analysis of Proposed system: Sign Language recognition using Neural Networks
The proposed system, captures keypoints in sequences of images using opencv and googles
MediaPipe holistic model mewhich detects the face, pose and hand landmarks as key points. The
dataset to be stored as several sequences put in frames as video format where the key points are
pushed into a NumPy array.Hereafter, the system is trained and built using Long STM (LSTM)
deep learning model which is formed using three LSTM layers and three Dense layers. This
model was trained for 2000 epochs on a batch size of 128 using the dataset extracted. The model
was trained using the dataset to attenuate the loss by categorical cross-entropy using the Adam
optimizer.Finally, after building the neural network, real-time language recognition is performed
using streamlit and opencv where the gestures are recognized and displayed as text within the
highlighted area
29
4.2.3 Comparison of Proposed system and existing system
● Focuses on real-time detection and classification: This application prioritizes processing
live video streams and providing instant sign language action recognition.
● Uses keypoints for prediction: It extracts keypoints from body pose, face, and hands to
represent signs and feeds them to a pre-trained machine learning model for classification.
Sign Language Transformers
i. Focuses on end-to-end recognition and translation: This research proposes a unified
model that can both recognize signs and translate them into spoken language sentences.
ii. Processes video frame directly: The model takes full video frames as input,
potentiuringally capst more nuanced information about signs.
iii. Leverages transformers for complex tasks: The system utilizes a powerful transformer
architecture capable of handling high-dimensional data and understanding the intricacies of sign
language, including facial expressions and multi-articulated gestures.
(Level of translation) The first system classifies individual signs, while Sign Language
Transformers aim for full sentence translation. (Data Processing) The first system uses
keypoints, while Sign Language Transformers process entire video frames. (Model Architecture)
The first system uses a pre-trained machine learning model, while Sign Language Transformers
leverage a complex transformer architecture.
30
4.3 Limitation of Existing Systems
Existing sign language translation systems often face limitations in accuracy, real-time
performance, and user experience. Accuracy can be hampered by factors like limited training
data, variations in signing styles, and complex hand movements. Real-time translation can be
computationally demanding and may struggle with rapid signing or background noise. User
interfaces might not be intuitive for all users, especially those unfamiliar with sign language.
These limitations can hinder effective communication and highlight the need for ongoing
research in sign language translation technology.
4.3.1 Limitations Of The Transformer-based System For Sign Language Recognition And
Translation Presented By Camgöz et al
Domain Limitation: The state-of-the-art results achieved by the system are within the context of
a limited domain of discourse which is weather forecasts (Camgöz et al., 2020). The
performance in more generic or diverse sign language contexts may not be as high, indicating a
limitation in the system's ability to generalize across different subjects or domains outside of this
controlled scope.
Recognition Complexity: Despite advances, the system must address the challenge of
recognizing sign glosses from high-dimensional spatiotemporal data, which is complex due to
the asynchronous multi-articulatory nature of sign languages (Camgöz et al., 2020). The models
need to accurately comprehend the 3D signing space and understand what these different aspects
mean in combination, which remains a significant modeling challenge.
31
Sign Segmentation: The translation system needs to accomplish the task of sign segmentation,
detecting sign sentences from continuous sign language videos (Camgöz et al., 2020). Unlike
text, which has punctuation, or spoken languages that have pauses, sign language does not have
obvious delimiters, making segmentation for translating continuous sign language a persistent
issue that is not yet fully resolved in the literature (Camgöz et al., 2020).
4.3.2 Limitations Of The Proposed System
Limited Vocabulary and Sign Language Variations: LSTMs require a substantial amount of
training data to learn the complex relationships between keypoints and their corresponding signs.
This can be challenging for sign languages with vast vocabularies, limiting the system's ability to
recognize and translate a wide range of signs.
Dependency On Keypoint Detection Accuracy: The system's performance heavily relies on the
accuracy of keypoint detection. Errors in identifying keypoints (e.g., due to lighting, occlusion,
or hand posture variations) can lead to misinterpretations and incorrect translations.
Computational Cost: Training LSTMs on large datasets can be computationally expensive,
requiring significant processing power and time. Additionally, real-time translation necessitates
efficient inference algorithms to minimize latency.
Limited Context and Idioms: Sign language heavily relies on facial expressions and body
language for conveying context. Current systems struggle to capture these subtleties, leading to
potential misunderstandings, especially for idioms or nuanced expressions as sometimes pointing
32
to an object might have a different meaning from just pointing but the current proposed system
won’t be able to detect such .
Background Noise and Occlusions: Real-world environments often have background noise or
occlusions (e.g., from other people or objects). These factors can disrupt keypoint detection and
impact translation accuracy.
Lighting Variations: Changes in lighting conditions can affect the quality of video frames,
making it harder for the system to accurately identify keypoints.
4.4 Proposed System: Design and Implementation of a sign language translation

system
This section introduces a system utilizing a Long (STM) model and a pretrained model for
detecting landmark keypoints in face’s, poses and hands for Sign language translation. By
addressing the limitations identified in existing systems (Section 3.3), this approach aims to
provide a more comprehensive and efficient method for Sign language translation to non-signers.
4.4.1 Rationale for Implementation of a sign language translation system Using an LSTM
Sign language translation systems bridge the communication between Signer’s and Non-signer’s.
This allows for smoother interaction in Various social interactions.
Information Access: Sign language translation systems can provide real-time access to spoken
information for deaf and hard-of-hearing individuals. This includes lectures, meetings,
presentations, news broadcasts, and other audio-based content.
33
Improved Participation: These systems can enable deaf and hard-of-hearing people to actively
participate in conversations and express themselves more readily. This fosters inclusivity and a
sense of belonging in various environments.
4.4.2. System Architecture

System Architecture for Sign Language Recognition System: This system architecture utilizes a
Long Short-Term Memory (LSTM) neural network for real-time sign language recognition a
webrtc based video component and streamlit web library in python . Here's a breakdown of its
components:
Data Preprocessing For Model Training:
1. Input Data: The system expects a sequence of keypoint vectors extracted from video frames,
representing the user's body language (pose, hands, face). The input shape is defined as `(30,
1662)`, where:
30: Represents the number of frames (time window) used to capture the sign language gesture.
1662: Represents the dimensionality of each keypoint vector, likely containing information
like x, y, and z coordinates for various body landmarks.
Model Architecture:
1. Sequential Model: A sequential model is used to stack multiple LSTM layers for effective
learning of temporal dependencies in the sign language gestures.
2. LSTM Layers:
3 LSTM layers: The model utilizes three LSTM layers with the following configurations:
34
1st LSTM Layer: 64 units, ReLU activation, returns sequences.
2nd LSTM Layer: 128 units, ReLU activation, returns sequences.
3rd LSTM Layer: 64 units, ReLU activation, does not return sequences (flattens the output).
LSTMs are adept at capturing long-term dependencies present in sequential data Thats why it
was choosen.
3. Dense Layers:
2 Dense layers: Two fully-connected dense layers are added after the LSTM layers:
1st Dense Layer: 64 units, ReLU activation.
2nd Dense Layer: 32 units, ReLU activation.
These layers help extract higher-level features from the LSTM outputs.
4. Output Layer:
Dense layer with Softmax activation: The final layer has a number of units equal to the
number of actions in the sign language vocabulary (e.g., "Hello", "Thanks", "I love you").
The Softmax activation ensures the output probabilities sum to 1, representing the likelihood
of each action class.
Model Training:
1.Optimizer: Adam optimizer is used for efficient gradient descent during training.
2.Loss Function: Categorical crossentropy is used as the loss function, suitable for multi-class
classification problems like sign language recognition.
35
3. Metrics: Categorical accuracy is used as a metric to monitor training progress and evaluate
model performance.
4. Training Data: The model is trained on pre-processed training data (`X_train`) containing
sequences of keypoint vectors and corresponding labels (`y_train`) indicating the performed sign
language actions.
Data Acquisition For the System:
WebRTC Streamer: Streamlit WebRTC component is used to capture the user's webcam video
and feed it to the MediaPipe model for processing in real-time.
MediaPipe: The system utilizes MediaPipe's Holistic model for multi-modal human body
landmark detection. It extracts keypoints from the user's face, hands, and pose from the webcam
stream that was used to train the LSTM that is able to predict the sign language.
Sign Language Recognition
Preprocessing: The captured video frames are converted from BGR to RGB color format for the
MediaPipe model compatibility.
Landmark Extraction: MediaPipe processes each frame and outputs keypoint landmarks for the
face, hands, and pose.
Keypoint Sequence Building: The extracted keypoints from each frame are combined and stored
in a sequence (30 frames). A lock (`sequence_lock`) ensures thread safety for accessing and
modifying the sequence.
Model Prediction:
36
Action Prediction: When the sequence reaches 30 frames (representing a short window of sign
language), the system predicts the performed action using a pre-trained TensorFlow LSTM
model.
Prediction Results: The model takes the sequence of keypoints as input and outputs the
probability of each action in the defined vocabulary (e.g., "Hello", "Thanks", "I love you").The
action with the highest probability is considered the predicted sign language gesture.
Result Display
Landmarks Visualization: If a face is detected , the system overlays the detected landmarks and
connections on the video frame.
Sign Prediction Overlay: The predicted sign language action is displayed as text on top of the
video frame.
The system utilizes asynchronous processing (`async_processing=True`) with the WebRTC
streamer for efficient handling of video frames and model prediction.
This architecture allows for real-time sign language recognition by continuously capturing video
frames, extracting keypoints, building a sequence, and predicting the user's sign language
gestures using the trained model.
4.4.3. Advantages of the Proposed System

i. LSTM for Temporal Dependence: Utilizes Long Short-Term Memory (LSTM) layers, which
are well-suited for capturing the temporal dependencies present in sign language gestures. Sign
language involves sequences of hand movements and poses over time, and LSTMs can
effectively learn these patterns.
37
ii. Real-time Recognition: Employs a WebRTC streamer to capture video frames from the user's
webcam in real-time. This allows for immediate processing and prediction of sign language
gestures as the user performs them.
iii. MediaPipe Integration: Leverages MediaPipe's Holistic model for efficient and accurate
multi-modal human body landmark detection. This extracts relevant keypoints from the user's
face, hands, and pose, providing necessary data for the LSTM model.
iv. Scalable Model Architecture: Uses a modular architecture with sequential LSTM layers
followed by dense layers for feature extraction and classification. This allows for easy
customization and potential expansion of the model's capabilities.
v. Reduced Complexity: Utilizing a pretrained model for feature extraction reduce total system
complexities.
vi. Cross platform: The system was developed using a web library so it is accessable for any
device that has a camera and a web browser
4.5 SYSTEM MODELLING
System modelling is the process of creating abstract models of a system, each with a unique
perspective or viewpoint on that system. The design tool utilized in this study is the Unified
Modeling Language. It is a standard graphical notation for describing software analysis and
designs. UML leverages symbols to describe and document the application development process.
When UML notation is employed, it provides an efficient means of communication and a
detailed explanation of a systems design
38
4.5.1 System Activity Diagram
A user flow diagram, also known as a user journey or user process flow, visually represents the
steps a user takes within a system or application to complete a specific task or goal. It shows the
user's actions and interactions with the system, highlighting the pathways and decision points
along the way. The objective of a user flow diagram is to provide a clear and concise overview
of the user's experience and movement within the application. It helps designers and stakeholders
understand the user's journey, identify potential pain points or usability issues, and make
informed decisions to enhance the user experience.
Figure 4.5.1 illustrates the entire user journey throughout the web app.
39
4.5.2 Use Case Diagram for Sign Language Translation Web Application
A use case diagram is a graphical representation of the functional requirements of a system or
software application. It illustrates the interactions between users (actors) and the system,
showing how they work together to achieve specific goals or tasks. In the context of a sign
language translation web application, use case diagrams are used to capture and explain the
application's functionality and the roles of the various actors involved.
40
Figure 4.5.2 Illustrates the entire Functional Requirements the web app.
4.5.3 Sequence Diagram for Sign Language Translation Web Application
A sequence diagram is a type of interaction diagram in UML (Unified Modeling Language) that
illustrates the sequence of interactions between objects or components within a system. It depicts
the flow of messages exchanged between these objects over time to achieve a specific
functionality or scenario. In a sign language translation web application, a sequence diagram
shows how various components, such as the user interface, translation engine, and database,
interact to provide translation services.
Figure 4.5.3 Illustrates the entire interaction done in the web app.
41
4.6 AI Process
Artificial intelligence (AI) development follows a structured process. First, relevant data is
collected and prepared for the AI system to learn from. This might involve cleaning, organizing,
and transforming the data into a suitable format. Next, the appropriate AI model type (e.g.,
decision tree, neural network) is chosen based on the task and data characteristics. The model's
architecture, defining its components and connections, is then designed. Training involves
feeding the prepared data to the model, allowing it to learn by adjusting internal parameters to
identify patterns and relationships. The model's performance is then evaluated using unseen data
from a testing set to assess its ability to generalize its learning to new situations. Finally, well-
performing models can be deployed for real-world use, with ongoing monitoring to ensure
continued accuracy and the potential for retraining with new data over time.
4.6.1 Data Collection
For this study, I created the dataset myself by capturing keypoint sequences from my own sign
language gestures. I recorded 30 different sequences for each sign class, extracting 30 frames of
data per sequence. This type of sequential data is essential for training Long Short-Term
Memory (LSTM) networks. LSTMs require a minimum amount of sequential data to achieve
even basic levels of accuracy in predicting sign language gestures.
4.6.2 Data Processing
42
After data collection using keypoint sequences, data processing is crucial to prepare the data for
training of the LSTM model. Here's a breakdown of the data processing steps:
Missing Values Check: Check for missing keypoints or frames in your sequences. You can
address these by:Removing sequences with excessive missing data, Inputing missing values
using techniques like interpolation or filling with the previous/next available values (depending
on the context).
Data Splitting: Divide the processed data into training, validation, and test sets. Training set:
Used to train the LSTM model. Validation set: Used to monitor model performance during
training and prevent overfitting. Test set: Used to evaluate the model's final performance on
unseen data.
4.6.3 Training Of The Model
For this research, I trained an LSTM model using the Keras library within a Python environment.
The model architecture consists of three stacked LSTM layers with varying hidden unit sizes (64,
128, and 64) and ReLU activation functions. The first two layers utilize (return_sequences=True)
to maintain the sequential nature of the data. A final dense layer with a softmax activation
predicts the most likely action class from the provided keypoint sequences. The model was
compiled with the Adam optimizer, categorical cross-entropy loss function, and categorical
43
accuracy metric. Training was conducted for 2000 epochs, utilizing a TensorBoard callback to
monitor training progress and prevent overfitting.
figure 4.6.3: An image of the training process
4.6.4 Testing Of The Model
After Training the model tests were performed on the model to quantify the accuracy, the
(train_test_split) function from the scikit-learn library in Python was used to split my dataset into
training and testing sets. This is a crucial step in machine learning, ensuring the model doesn't
simply memorize the training data and can generalize well to unseen examples. After splitting
the dataset I used the model to perform predictions on the test dataset expecting the model to
return an expected output.
44
4.6.5 Model Evaluation
Model evaluation is a crucial step in the AI Process. It determines how well a trained model
performs on unseen data. This involves testing the model on a separate dataset (testing set) and
measuring its performance using metrics relevant to the task.
Evaluation Metrics Used
Evaluation metrics are quantitative measures used to assess the performance of a trained model.
The choice of metrics depends on the specific task and the desired outcome of your model.
Classification Tasks
These metrics evaluate how well a model can categorize data points into predefined classes:
i. Accuracy: The proportion of correctly classified samples (True Positives + True
Negatives) divided by the total number of samples. It's a good overall measure. The accuracy of
the trained model is (0.8) it was calculated using a library in python called (scikit-learn)
ii .Precision: The proportion of predicted positives that are actually correct (True Positives
divided by True Positives + False Positives). It measures how good the model is at identifying
actual positives. The precision of the trained model is (0.8666666666666666) it was calculated
using a library in python called (scikit-learn)
45
iii. Recall: The proportion of actual positives that are correctly identified by the model (True
Positives divided by True Positives + False Negatives). It measures how good the model is at
capturing all relevant positives. The recall of the trained model is (0.8) it was calculated using a
library in python called (scikit-learn)
iv. F1-Score: The harmonic mean of precision and recall, combining both metrics into a
single score. It provides a balance between precision and recall. The F1-Score of the trained
model is (0.7866666666666667) it was calculated using a library in python called (scikit-learn)
4.6.6 Model Deployment (Mobile or Web)
The Model is saved as a keras model in .h5 format While this model itself isn't directly deployed
as a web app or mobile app, it can be integrated into a larger application for user interaction and
real-world deployment. In My sign language translation system I built a webapp to display how
the model works
4.7 Funtionality/Experimenation and Results 1
Experimentation: experiment by making gestures without complete landmark keypoints
Results: The model can still recognize the signs even if the the landmark keypoints are not
complete
46
Figure 4.7: Illustrating The User Interface of the model predicting the correct sign with only half
of the keypoints being detected
4.8 Funtionality / Experimenation and Results 2
Experimentation: experiment by making gestures really fast and see if it still makes an accurate
prediction
Results: The model can sometimes mistake signs if the signs done are too fast
47
Figure 4.8: Illustrating The User Interface of the model predicting the correct sign with complete
keypoints being detected for single handed signs
4.9 Funtionality / Experimenation and Results 3
Experimentation: experiment by making sure the landmark keypoints are complete for the model
to have a complete landmark keypoint sequence to make a prediction
Result: Every time the keypoints are complete there is a higher chance the model will predict the
sign correctly
48
Figure 4.9: Illustrating The User Interface of the model predicting the correct sign with complete
keypoints being detected for Double handed signs
4.10 DEVELOPMENT TOOLS AND TECHNOLOGIES
The sign language translation web app leverages a combination of cutting-edge technologies to
deliver an accurate, efficient, and user-friendly experience. Here's an overview of the key tools
and their functionalities:
● Web Library: Streamlit
○ Description: A robust Library that uses only python to create beautiful webapp
called streamlit was used.
● Computer Vision Library: OpenCV.
○ Description: During the training process the data i got was processed by the user's
webcam feed to recognize hand gestures.
49
● Sign Language Translation Model: Neural Network
○ Description: At the heart of the system lies a Neural Network trained on a dataset
of sequenced frames and their corresponding translations. This model, built using
the TensorFlow framework, analyzes the recognized signs and translates them
into the desired language.
● WebRTC API:
○ Description: For real-time communication features, the WebRTC API can be
implemented. This API enables direct browser-to-browser audio and video
communication, potentially facilitating video calls between sign language users
and interpreters or other individuals.
This combination of tools allows for a web-based sign language translation app that is not only
accurate but also user-friendly and accessible from any device with a web browser.
CHAPTER 5 – SUMMARY AND CONCLUSION
This report outlines the potential of a sign language translation system, highlighting its key
functionalities, advantages, and areas for further exploration.
50
5.1 Summary
The system facilitates communication between deaf and hearing individuals by translating sign
language into spoken or written language, and vice versa.Users can create profiles detailing their
communication preferences (sign language type, spoken language) and areas of expertise (if
applicable for professional settings).The system employs machine learning algorithms to
recognize and translate signs accurately.Real-time translation capabilities enable fluid
conversations.The system can be integrated with various platforms, including video conferencing
tools, for broader accessibility.
5.2 Contribution of Research and Conclusion
Further research and development are necessary to refine the translation accuracy and encompass
a wider range of sign languages and dialects.A user-centered design approach should be adopted
to ensure the system is intuitive and user-friendly for deaf and hearing individuals with varying
levels of technical proficiency.Collaboration with deaf communities and sign language experts is
crucial for gathering feedback and ensuring the system effectively addresses their
communication needs.Exploring integration with educational resources and social platforms can
broaden the system's impact and promote language learning and cultural exchange.
5.3 Recommendations
Current sign language translation technology is under development, and accuracy may be
affected by factors like signing speed, variations in execution, and background noise.Limited
availability of training data for certain sign languages can hinder translation accuracy for those
languages.The system may not capture the full nuance of sign language communication, which
often incorporates facial expressions and body language.
5.4 Further Research Considerations
51
Investigate methods to improve translation accuracy, particularly for complex sentences and
idiomatic expressions. Develop mechanisms to account for regional variations and cultural
nuances within sign languages.Explore integration with sentiment analysis tools to better convey
the emotional context of signed communication. Research ethical considerations surrounding
data privacy and potential biases within the translation algorithms.
5.5 Conclusion
Sign language translation systems hold immense potential to break down communication barriers
and foster greater social inclusion. By addressing the limitations and continuously developing the
technology, these systems can empower deaf and hearing individuals to connect and participate
more fully in all aspects of life.
52
References
Katiyar, P., Shukla, K. S. K., & Kumar, V. (2023, May). Analysis of Human Action Recognition
Using Machine Learning techniques. In 2023 4th International Conference for Emerging
Technology (INCET) (pp. 1-4). IEEE.
Guo, L., Lu, Z., & Yao, L. (2021). Human-machine interaction sensing technology based on
hand gesture recognition: A review. IEEE Transactions on Human-Machine Systems, 51(4), 300-
309.
Achenbach, P., Laux, S., Purdack, D., Müller, P. N., & Göbel, S. (2023). Give Me a Sign: Using
Data Gloves for Static Hand-Shape Recognition. Sensors, 23(24), 9847.
Camgoz, N. C., Hadfield, S., Koller, O., Ney, H., & Bowden, R. (2018). Neural sign language
translation. In Proceedings of the IEEE conference on computer vision and pattern recognition
(pp. 7784-7793).
Rastgoo, R., Kiani, K., & Escalera, S. (2021). Sign language recognition: A deep survey. Expert
Systems with Applications, 164, 113794.
53
Camgoz, N. C., Koller, O., Hadfield, S., & Bowden, R. (2020). Sign language transformers: Joint
end-to-end sign language recognition and translation. In Proceedings of the IEEE/CVF
conference on computer vision and pattern recognition (pp. 10023-10033).
Huang, J., Zhou, W., Zhang, Q., Li, H., & Li, W. (2018, April). Video-based sign language
recognition without temporal segmentation. In Proceedings of the AAAI Conference on Artificial
Intelligence (Vol. 32, No. 1).
Cheok, M. J., Omar, Z., & Jaward, M. H. (2019). A review of hand gesture and sign language
recognition techniques. International Journal of Machine Learning and Cybernetics, 10, 131-
153.
Li, D., Rodriguez, C., Yu, X., & Li, H. (2020). Word-level deep sign language recognition from
video: A new large-scale dataset and methods comparison. In Proceedings of the IEEE/CVF
winter conference on applications of computer vision (pp. 1459-1469).
Varshney, P. K., Kumar, S. K., & Thakur, B. (2024). Real-Time Sign Language Recognition. In
Medical Robotics and AI-Assisted Diagnostics for a High-Tech Healthcare Industry (pp. 81-92).
IGI Global.
Wadhawan, A., & Kumar, P. (2021). Sign language recognition systems: A decade systematic
literature review. Archives of Computational Methods in Engineering, 28, 785-813.
Bragg, D., Koller, O., Bellard, M., Berke, L., Boudreault, P., Braffort, A., ... & Ringel Morris,
M. (2019, October). Sign language recognition, generation, and translation: An interdisciplinary
perspective. In Proceedings of the 21st International ACM SIGACCESS Conference on
Computers and Accessibility (pp.
16-31).
54
APPENDIX EXCERPT OF PROGRAM SOURCE CODE
#### Step-by-Step Instructions for Sign Language Translation System
1. **Import and Install Dependencies**

```python
!pip install tensorflow==2.4.1 tensorflow-gpu==2.4.1 opencv-python mediapipe sklearn
matplotlib
import cv2
import numpy as np
import os
from matplotlib import pyplot as plt
import time
import mediapipe as mp
```
2. **Initialize MediaPipe Holistic Model**

```python
mp_holistic = mp.solutions.holistic
mp_drawing = mp.solutions.drawing_utils
```
3. **Function to Process Image with MediaPipe**

```python
def mediapipe_detection(image, model):
image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)
image.flags.writeable = False
results = model.process(image)
image.flags.writeable = True
image = cv2.cvtColor(image, cv2.COLOR_RGB2BGR)
return image, results
```
4. **Function to Draw Landmarks**

```python
55
def draw_landmarks(image, results):
mp_drawing.draw_landmarks(image, results.face_landmarks,
mp_holistic.FACE_CONNECTIONS)
mp_drawing.draw_landmarks(image, results.pose_landmarks,
mp_holistic.POSE_CONNECTIONS)
mp_drawing.draw_landmarks(image, results.left_hand_landmarks,
mp_holistic.HAND_CONNECTIONS)
mp_drawing.draw_landmarks(image, results.right_hand_landmarks,
mp_holistic.HAND_CONNECTIONS)
```
5. **Function to Draw Styled Landmarks**

```python
def draw_styled_landmarks(image, results):
mp_drawing.draw_landmarks(
image, results.face_landmarks, mp_holistic.FACEMESH_TESSELATION,
mp_drawing.DrawingSpec(color=(80,110,10), thickness=1, circle_radius=1),
mp_drawing.DrawingSpec(color=(80,256,121), thickness=1, circle_radius=1)
)
image, results.pose_landmarks, mp_holistic.POSE_CONNECTIONS,
)
image, results.left_hand_landmarks, mp_holistic.HAND_CONNECTIONS,
)
image, results.right_hand_landmarks, mp_holistic.HAND_CONNECTIONS,
)
```
6. **Capture Video and Apply MediaPipe Model**

```python
cap = cv2.VideoCapture(0)
with mp_holistic.Holistic(min_detection_confidence=0.5, min_tracking_confidence=0.5) as
holistic:
56
while cap.isOpened():
ret, frame = cap.read()
image, results = mediapipe_detection(frame, holistic)
draw_styled_landmarks(image, results)
cv2.imshow('OpenCV Feed', image)
if cv2.waitKey(10) & 0xFF == ord('q'):
break
cap.release()
cv2.destroyAllWindows()
```
7. **Extract Keypoint Values**

```python
def extract_keypoints(results):
pose = np.array([[res.x, res.y, res.z, res.visibility] for res in
results.pose_landmarks.landmark]).flatten() if results.pose_landmarks else np.zeros(33*4)
face = np.array([[res.x, res.y, res.z] for res in results.face_landmarks.landmark]).flatten() if
results.face_landmarks else np.zeros(468*3)
lh = np.array([[res.x, res.y, res.z] for res in results.left_hand_landmarks.landmark]).flatten()
if results.left_hand_landmarks else np.zeros(21*3)
rh = np.array([[res.x, res.y, res.z] for res in
results.right_hand_landmarks.landmark]).flatten() if results.right_hand_landmarks else
np.zeros(21*3)
return np.concatenate([pose, face, lh, rh])
```
8. **Setup Folders for Data Collection**

```python
DATA_PATH = os.path.join('MP_Data')
actions = np.array(['hello', 'thanks', 'iloveyou'])
no_sequences = 30
sequence_length = 30
start_folder = 30
for action in actions:

dirmax = np.max(np.array(os.listdir(os.path.join(DATA_PATH, action))).astype(int))
for sequence in range(1, no_sequences+1):
try:
os.makedirs(os.path.join(DATA_PATH, action, str(dirmax+sequence)))
except:
pass
57
```
9. **Collect Keypoint Values for Training and Testing**

```python
cap = cv2.VideoCapture(0)
with mp_holistic.Holistic(min_detection_confidence=0.5, min_tracking_confidence=0.5) as
holistic:
for sequence in range(start_folder, start_folder+no_sequences):
for frame_num in range(sequence_length):
ret, frame = cap.read()
image, results = mediapipe_detection(frame, holistic)
draw_styled_landmarks(image, results)
if frame_num == 0:
cv2.putText(image, 'STARTING COLLECTION', (120,200),
cv2.FONT_HERSHEY_SIMPLEX, 1, (0,255, 0), 4, cv2.LINE_AA)
cv2.putText(image, f'Collecting frames for {action} Video Number {sequence}',
(15,12), cv2.FONT_HERSHEY_SIMPLEX, 0.5, (0, 0, 255), 1, cv2.LINE_AA)
cv2.waitKey(500)
else:
cv2.putText(image, f'Collecting frames for {action} Video Number {sequence}',
(15,12), cv2.FONT_HERSHEY_SIMPLEX, 0.5, (0, 0, 255), 1, cv2.LINE_AA)
keypoints = extract_keypoints(results)
npy_path = os.path.join(DATA_PATH, action, str(sequence), str(frame_num))
np.save(npy_path, keypoints)
if cv2.waitKey(10) & 0xFF == ord('q'):
break
cap.release()
cv2.destroyAllWindows()
```
10. **Preprocess Data and Create Labels and Features**

```python
from sklearn.model_selection import train_test_split
from tensorflow.keras.utils import to_categorical
label_map = {label:num for num, label in enumerate(actions)}

sequences, labels = [], []
58
for sequence in np.array(os.listdir(os.path.join(DATA_PATH, action))).astype(int):
window = []
for frame_num in range(sequence_length):
res = np.load(os.path.join(DATA_PATH, action, str(sequence),
"{}.npy".format(frame_num)))
window.append(res)
sequences.append(window)
labels.append(label_map[action])
X = np.array(sequences)
y = to_categorical(labels).astype(int)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.05)
```
11. **Build and Train LSTM Neural Network**

```python
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import LSTM, Dense
from tensorflow.keras.callbacks import TensorBoard
log_dir = os.path.join('Logs')
tb_callback = TensorBoard(log_dir=log_dir)
model = Sequential()
model.add(LSTM(64, return_sequences=True, activation='relu', input_shape=(30,1662)))
model.add(LSTM(128, return_sequences=True, activation='relu'))
model.add(LSTM(64, return_sequences=False, activation='relu'))
model.add(Dense(64, activation='relu'))
model.add(Dense(32, activation='relu'))
model.add(Dense(actions.shape[0], activation='softmax'))
model.compile(optimizer='Adam', loss='categorical_crossentropy',
metrics=['categorical_accuracy'])
model.fit(X_train, y_train, epochs=1000, callbacks=[tb_callback])
```
12. **Make Predictions**

```python
res = model.predict(X_test)
print(actions)
59
actions[np.argmax(res[4])]
actions[np.argmax(y_test[4])]
```
13. **Save Weights**

```python
model.load_weights('action.h5')
```
14. **Evaluation using Confusion Matrix and Accuracy**

```python
from sklearn.metrics import multilabel_confusion_matrix, accuracy_score, recall_score,
f1_score, precision_score
60

Nats Project

Uploaded by

Copyright:

Available Formats

Nats Project

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Nats Project

Uploaded by

Copyright:

Available Formats

Design and Implementation of a sign language translation system

(Case Study: Communication with Non-Signers)

Uriri Nathaniel Elo-oghene

FACULTY OF NATURAL AND APPLIED SCIENCE

DEPARTMENT OF COMPUTER SCIENCE AND INFORMATION TECHNOLOGY

VERITAS UNIVERSITY, ABUJA

IN PARTIAL FULFILLMENT OF THE

REQUIREMENT FOR THE AWARD OF

BSC in Computer Science

Supervisor Signature Date

Head of Department Signature Date

DEAN, NAS Signature Date

External Examiner Signature Date

I dedicate this project to My Family, My Mentors or Advisors, My Educational Institution. Their

Figure 3.1: p20

with sign language to communicate more easily.

1.1 BACKGROUND OF THE STUDY

significantly influenced communication by overcoming language barriers and geographical

distinct group—individuals with hearing impairments, particularly the Deaf community—facing

communication challenges requiring innovative solutions.

sign language is not universally understood, leading to communication barriers in a

predominantly spoken and written language society.

commitment to inclusivity, equal access to information, and social integration in an era

celebrating diversity and accessibility.

individuals, fostering enhanced interaction with the world.

improved the performance of a sign language recognition system by using a combination of

deployment of SLT systems.

implications for organisations

i. Lack of data for training a model to understand sign language

ii. To many different sign language’s

question for this work is;

signer's appearance be developed?”

i. Can existing Sign Language Translation (SLT) models be improved to handle

wearing different clothing?

This research’s aim is to revolutionize sign language communication by developing a

I. Design and Develop a Robust Sign Language Understanding System: Engineer a

1.5 Significance of the Study

accessibility and understanding in certain communication scenarios.

1.7 Limitations of the Study

adaptability to a broader range of expressions present in real-life sign language communication.

different individuals or in diverse situations, potentially limiting the system's accuracy in

impacting its practical utility.

considered in assessing the system's broader implications and generalizability. Therefore, it is

not offer a one-size-fits-all solution and is subject to inherent limitations.

1.8 Overview of Research Method

reflection and abstraction.

algorithms, human/computer interfaces, and system design querys or languages (Wieringa,

2014). This will be performed using an outcome-based information technology research

functional requirements. With Design Science Research, it is possible create a categories of

missing knowledge using design, analysis, reflection and abstraction.

1.9 Structure of the Project

creation of the RWTH-PHOENIX-Weather 2014 T dataset, a significant resource for evaluating

Sign Language Translation models (Camgöz et al., 2018).

of up to 18.13. Their subsequent work on "Sign Language Transformers" brought a

transformative approach, addressing recognition and translation challenges simultaneously, using

collaboration in computer vision, linguistics, Deaf culture, and human-computer interaction.

gaps for the Deaf community.

In "Video-Based Sign Language Recognition Without Temporal Segmentation,"( Huang, Zhou,

potential in overcoming challenges, offering a new direction for the field.