Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Nats Project

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 69

Design and Implementation of a sign language translation system

(Case Study: Communication with Non-Signers)

BY

Uriri Nathaniel Elo-oghene


VUG/CSC/20/4188

SUBMITTED TO

FACULTY OF NATURAL AND APPLIED SCIENCE

DEPARTMENT OF COMPUTER SCIENCE AND INFORMATION TECHNOLOGY

VERITAS UNIVERSITY, ABUJA

IN PARTIAL FULFILLMENT OF THE

REQUIREMENT FOR THE AWARD OF

BSC in Computer Science

22/03/2024
APPROVAL PAGE

This is to certify that this project entitled “, Design and Implementation of a sign language
translation system” was carried out by, Uriri Nathaniel Elo-oghene VUG/CSC/20/4188 in the
Faculty of Natural and Applied Science, Veritas University, Abuja for the award of Bachelor of
Science Degree in Computer Science.

Supervisor Signature Date

Head of Department Signature Date

DEAN, NAS Signature Date

External Examiner Signature Date

1
DEDICATION

I dedicate this project to My Family, My Mentors or Advisors, My Educational Institution. Their


unwavering support, encouragement, and inspiration have been the driving force behind the
completion of this endeavor. Their belief in the importance of has fueled my determination to
make a meaningful contribution. This dedication is a tribute to their invaluable influence on my
journey, and I extend my heartfelt gratitude for being an integral part of this project's realization.

ACKNOWLEDGEMENTS

2
I would like to express my deep gratitude to My Educational Institution. Their unwavering

support, encouragement, and guidance have been invaluable throughout this journey. Their belief

in the significance of this project has fueled my determination, and their insights have greatly

enriched the outcome. This expression of gratitude is a humble acknowledgment of the profound

impact they have had on the success of this endeavor. Their support has been a constant source

of inspiration, and I am truly thankful for their contribution to this project's realization

3
Table of Contents
APPROVAL PAGE....................................................................................................................................................
ABSTRACT...............................................................................................................................................................
CHAPTER ONE - INTRODUCTION.......................................................................................................................
1.1 BACKGROUND OF THE STUDY...............................................................................................................
1.2 Problem Statement..........................................................................................................................................
1.3 Research Question..........................................................................................................................................
1.4 Aims and Objective of the Study....................................................................................................................
1.5 Significance of the Study................................................................................................................................
1.6 Scope of the Study..........................................................................................................................................
4
1.7 Limitations of the Study.................................................................................................................................
1.8 Overview of Research Method.......................................................................................................................
1.9 Structure of the Project...................................................................................................................................
CHAPTER TWO- LITERATURE REVIEW............................................................................................................
2.1 Introduction....................................................................................................................................................
2.2 The concept of sign language.........................................................................................................................
2.3 Understanding Sign Language Recognition...................................................................................................
2.4 Sign Language Transformers: Joint End-to-end Sign Language Recognition and
Translation............................................................................................................................................................
2.5 Video-Based Sign Language Recognition Without Temporal Segmentation................................................
2.6 Hand gesture and sign language recognition techniques................................................................................
2.7 Real Time Sign Language Recognition..........................................................................................................
CHAPTER THREE – RESEARCH METHODOLOGY...........................................................................................
3.3 Data Pre-processing...................................................................................................................................
3.4 Statistical Analysis......................................................................................................................................
3.5 Software Development Methodology..........................................................................................................
3.6 Testing and Validation of the New System.................................................................................................
CHAPTER FOUR - SYSTEM DESIGN AND IMPLEMENTATION.........................................................................
4.1 Introduction....................................................................................................................................................
4.2 Analysis of Existing system...........................................................................................................................
4.2.1 Analysis of Existing Systems: Sign Language Transformers: Joint End-to-end Sign
Language Recognition and Translation..........................................................................................................
4.2.2 Analysis of Proposed system: Sign Language recognition using Neural Networks.............................
4.2.3 Comparison of Proposed system and existing system...........................................................................
4.3 Limitation of Existing Systems......................................................................................................................
4.3.1 Limitations Of The Transformer-based System For Sign Language Recognition
And Translation Presented By Camgöz et al..................................................................................................
4.3.2 Limitations Of The Proposed System....................................................................................................
4.4 Proposed System: Design and Implementation of a sign language translation system..................................
4.4.1 Rationale for Implementation of a sign language translation system Using an LSTM
........................................................................................................................................................................
4.4.2. System Architecture.............................................................................................................................
4.4.3. Advantages of the Proposed System....................................................................................................
4.5 SYSTEM MODELLING................................................................................................................................
4.5.1 System Activity Diagram.....................................................................................................................
4.5.2 Use Case Diagram for Sign Language Translation Web Application.................................................
4.5.3 Sequence Diagram for Sign Language Translation Web Application.................................................
4.6 AI Process.......................................................................................................................................................
4.6.1 Data Collection......................................................................................................................................
4.6.2 Data Processing....................................................................................................................................
5
4.6.3 Training Of The Model..........................................................................................................................
4.6.4 Testing Of The Model...........................................................................................................................
4.6.5 Model Evaluation..................................................................................................................................
4.6.6 Model Deployment (Mobile or Web)....................................................................................................
4.7 Funtionality/Experimenation and Results 1..................................................................................................
4.8 Funtionality / Experimenation and Results 2................................................................................................
4.9 Funtionality / Experimenation and Results 3................................................................................................
4.10 Funtionality / Experimentation and Results 4.............................................................................................
CHAPTER 5 – SUMMARY AND CONCLUSION....................................................................................................
5.1 Summary......................................................................................................................................................
5.2 Contribution of Research and Conclusion...................................................................................................
5.3 Recommendations.......................................................................................................................................
References............................................................................................................................................................
APPENDIX.........................................................................................................................................................

LIST OF FIGURES

Figure 3.1: p20


Figure 3.5: Design Science Research Methodology p24
Figure 3.6: A table showing all test cases p27
Figure 4.5.1: Illustrates the entire user journey throughout the web app p40
Figure 4.5.2: Illustrates the entire Functional Requirements the web app p41
Figure 4.5.3: Illustrates the entire interaction done in the web app p42
Figure 4.6.3: An image of the training process p45
Figure 4.7: Illustrating the User Interface of the model predicting the correct sign with
only half of the keypoints being detected p47
Figure 4.8: Illustrating the User Interface of the model predicting the correct sign with
6
complete keypoints being detected for single-handed signs p48
Figure 4.9: Illustrating the User Interface of the model predicting the correct sign with
complete keypoints being detected for double-handed signs p49

ABSTRACT
In today's interconnected world, artificial intelligence (AI) plays a big role in breaking down

language barriers and connecting people across distances. However, people with hearing

7
impairments, especially those in the Deaf community, still face significant communication

challenges. Sign language is their primary way of communicating, but it is not widely

understood, creating difficulties in societies where spoken and written languages dominate.

This research focuses on using AI to improve communication for the Deaf community. It looks

at how AI can help bridge the gap between Deaf individuals and the rest of the world, supporting

efforts to promote inclusivity and equal access to information. Despite progress in AI, translating

sign language remains a challenge due to the lack of data for training AI models and the

complexity of sign languages, which use both hands and facial expressions.

The study aims to develop an AI-powered system that can translate sign language in real-time. It

examines current systems, identifies their shortcomings, and proposes a more effective solution.

The goal is to create a user-friendly interface that allows Deaf individuals and those unfamiliar

with sign language to communicate more easily.

It uses a research method that involves designing and testing a new system to see how well it

works. Despite its limitations, this study contributes to the ongoing effort to improve sign

language translation and address the communication needs of the Deaf community.

8
CHAPTER ONE - INTRODUCTION

1.1 BACKGROUND OF THE STUDY

In the contemporary interconnected world driven by technology, artificial intelligence (AI) has

significantly influenced communication by overcoming language barriers and geographical

constraints. This transformation, facilitated by real-time language translation, has, however, left a

distinct group—individuals with hearing impairments, particularly the Deaf community—facing

communication challenges requiring innovative solutions.

Sign language serves as the primary mode of expression for the Deaf community, with its rich

visual language, distinct grammar, syntax, and cultural nuances. Despite its expressive nature,

sign language is not universally understood, leading to communication barriers in a

predominantly spoken and written language society.

This research addresses the intersection of AI and sign language understanding to enhance

inclusive communication for the Deaf community. It explores the potential of AI to bridge the

communication gap between Deaf individuals and the broader world, considering both

technological advancements and social implications. The study aligns with the global

commitment to inclusivity, equal access to information, and social integration in an era

celebrating diversity and accessibility.

However, the field of AI sign language translation (SLT) faces notable research gaps. Sign

languages, being low-resource languages, lack sufficient data for training SLT models due to

factors such as their limited usage compared to spoken languages. Additionally, the visual

1
complexity of sign languages, incorporating information from both hands and face, poses

challenges for accurate recognition and translation by SLT models. Furthermore, the multitude of

sign languages and dialects worldwide requires SLT systems to adapt to diverse linguistic inputs

and outputs.

Addressing these research gaps is essential for the widespread deployment of AI SLT systems.

Affordability and accessibility for the Deaf community, integration with existing technologies

like video conferencing and social media platforms, and adaptation to various sign languages and

dialects are crucial considerations. Despite challenges, the AI SLT sector is progressing rapidly,

with ongoing developments in data collection, model refinement, and evaluation methods.

Continued research and development hold the potential to revolutionize communication for deaf

individuals, fostering enhanced interaction with the world.

A range of studies have explored the development of sign language recognition systems. (Vargas

2011) and (K 2022) both proposed systems using neural networks for image pattern recognition,

with Vargas focusing on static signs and K on alphabets and gestures. (Holden 2005) developed

a system for Australian sign language recognition, achieving high accuracy through the use of

Hidden Markov Models and features invariant to scaling and rotation. (Cho 2020) further

improved the performance of a sign language recognition system by using a combination of

image acquisition, pre-processing, and training with a multilayer perceptron and gradient descent

momentum. These studies collectively demonstrate the potential of neural networks and other

advanced techniques in the development of accurate and efficient sign language recognition

systems.

2
1.2 Problem Statement

American Sign Language (ASL) is the language of choice for most deaf

in the United States. (Starner, T. 1995) ASL uses approximately 6000 gestures for common

words and finger spelling for communicating obscure words or proper nouns and because of the

lack of data on sign languages building an (SLT) system is a major challenge. SLT systems are

used to translate sign language into spoken language and vice versa. This technology has the

potential to Educate Non-signers all about Sign’s and help signers Communicate to Non-signers

much faster. However, the lack of sign language data is a major obstacle to the development and

deployment of SLT systems.

Sign languages are low-resource languages, meaning that there is relatively little data available

for training machine learning models To understand and predict between the many 6000 possible

classes there could be in a sign language. This is due to a number of factors, including the fact

that sign languages are not as widely spoken as spoken languages, and that it is more difficult to

collect and annotate sign language data. the research identifies the following problems and their

implications for organisations

i. Lack of data for training a model to understand sign language

ii. To many different sign language’s

3
1.3 Research Question

The factors that affect the design and implementation of a sign language translation system are

not well understood. Recently, researchers have tried to figure out how different priorities impact

these systems, but the results are unclear. Many people are asking for more research to

understand the conflicts in literature. This study aims to design and implement a system capable

of interpreting at least 30 different signs in the American Sign Language. Thus the research

question for this work is;

“How can SLT models that are more robust to changes in lighting, background, and the

signer's appearance be developed?”

i. Can existing Sign Language Translation (SLT) models be improved to handle

variations in lighting?

ii. How can SLT models be made more resistant to changes in the signer's

background?

iii. What kind of data would be helpful to train SLT models to recognize signers

wearing different clothing?

4
1.4 Aim and Objective of the Study

This research’s aim is to revolutionize sign language communication by developing a


groundbreaking AI-powered sign language translation system. This system will bridge the
communication gap and foster inclusivity between Deaf and hearing communities.

Objectives

I. Design and Develop a Robust Sign Language Understanding System: Engineer a


comprehensive system capable of interpreting sign language gestures in real-time. Utilize
machine learning algorithms to ensure accurate translation into written or spoken language.

II. Create a User-Friendly and Accessible Interface: Design an interface that caters to both
Deaf individuals and those unfamiliar with sign language. Prioritize accessibility and ease of use
for all users.

III. Refine the System Through User-Centered Design: Conduct extensive user testing to
gather valuable feedback. Employ an iterative process to continuously improve the system's
accessibility and user experience, focusing on the specific needs of Deaf users.

1.5 Significance of the Study

Thus the significance of this work is that it addresses a specific set of communication needs. This

system could cater to basic but crucial expressions, allowing for essential interactions between

individuals who use these particular signs. While it may not cover the entirety of ASL, it still

provides a valuable tool for those who rely on these specific signs, contributing to improved

accessibility and understanding in certain communication scenarios.

5
1.6 Scope of the Study

This project is focused on the Design and Implementation of a sign language translation system

that can only understand at most 40 different single hand signs In the American sign language.

1.7 Limitations of the Study

This project on the Design and Implementation of a sign language translation system, is limited

to understanding at most 40 different single-hand signs in American Sign Language (ASL), has

inherent limitations that warrant acknowledgment. Firstly, as with any research focused on a

specific dataset, the system's effectiveness cannot be universally established based on this

singular study. Secondly, the study is confined to a particular scope within the domain of ASL

and does not encompass the entirety of the sign language lexicon. Consequently, caution is

advised when generalizing the outcomes beyond the defined set of signs and their interpretations.

Furthermore, the study's applicability is bound to a specific context, and the results may not be

universally representative. The limited scope to 40 signs raises concerns about the system's

adaptability to a broader range of expressions present in real-life sign language communication.

Additionally, the study may not account for the nuances and variations in sign language used by

different individuals or in diverse situations, potentially limiting the system's accuracy in

practical scenarios.

Another constraint is the focus on single-hand signs, excluding the complexities that arise in

signs involving both hands, facial expressions, and other non-manual markers. This limitation

may affect the system's ability to comprehensively interpret the richness of sign language,

impacting its practical utility.

6
It is important to note that the development of this system was constrained by available resources

and time, leading to a narrowed focus on a specific subset of signs. Consequently, the findings

and functionalities may not universally apply, and the limitations of the study should be

considered in assessing the system's broader implications and generalizability. Therefore, it is

crucial to recognize that the current work, while contributing to sign language translation, may

not offer a one-size-fits-all solution and is subject to inherent limitations.

1.8 Overview of Research Method

The chosen research method for this project is Design Science Research (DSR). Design Science

Research is a collection of synthetic and analytical techniques and viewpoints that complement

positivist, interpretive, and critical perspectives in conducting research within the field of

Information Systems. This approach encompasses two key activities aimed at enhancing and

comprehending the behavior of various aspects of Information Systems: firstly, the generation of

new knowledge through the design of innovative artifacts, whether they be tangible items or

processes; and secondly, the analysis of the artifact's utilization and/or performance through

reflection and abstraction.

7
Figure 1.8: Schematic diagram of DSR adopted in this study.

The artifacts created in the design science research process include, but are not limited to,

algorithms, human/computer interfaces, and system design querys or languages (Wieringa,

2014). This will be performed using an outcome-based information technology research

8
methodology, which offers specific guidelines for evaluation and iteration within research

projects. Figure 1.1 represents the schematic diagram of the research method. Design Science

then is knowledge in the form of constructs, techniques and methods, querysgi, well developed

theory for performing the mapping of knowledge for creating artifacts that satisfy given sets of

functional requirements. With Design Science Research, it is possible create a categories of

missing knowledge using design, analysis, reflection and abstraction.

1.9 Structure of the Project

This project work is organized into five chapters. Chapter one introduces The topic Design and

Implementation of a sign language translation system a case study on communication with Non-

signers and presents a background of the work. It discusses the problems and challenges faced by

most existing Methods of solving the problem of communication with deaf individuals. Chapter

two delves into the literature review of former work carried out on the topic. Chapter three gives

the design analysis and project consideration based on the diagram above. Chapter four presents

the proposed implementation and finally, chapter five contains the summary, conclusion and

recommendations. It is finally rounded off with the references and appendix consisting of code

of the implementation.

9
CHAPTER TWO- LITERATURE REVIEW

2.1 Introduction

Neural networks have greatly shaped Sign Language Translation, evolving from recognizing

isolated signs to the complex task of translating sign languages. Necati Cihan Camgöz and team

played a crucial role, challenging the idea that sign languages are mere translations of spoken

languages. Their notable work (Camgöz et al., 2018) stressed the unique structures of sign

languages.

They introduced Continuous Sign Language Translation, using advanced deep learning models

to translate continuous sign language videos into spoken language. A key moment was the

creation of the RWTH-PHOENIX-Weather 2014 T dataset, a significant resource for evaluating

Sign Language Translation models (Camgöz et al., 2018).

(Camgöz et al, 2020). set a performance benchmark with their models achieving a BLEU-4 score

of up to 18.13. Their subsequent work on "Sign Language Transformers" brought a

transformative approach, addressing recognition and translation challenges simultaneously, using

a unified framework.

(Danielle Bragg, Oscar Koller, Mary Bellard, et al.'s, 2019) collaborative work in "Sign

Language Recognition, Generation, and Translation" advances the field. They highlight the need

for larger datasets and standardized annotation systems, advocating for interdisciplinary

collaboration in computer vision, linguistics, Deaf culture, and human-computer interaction.

10
The paper is a must-read for researchers, offering a clear review of the state-of-the-art and steps

for future research. It's a critical resource for developing technologies that bridge communication

gaps for the Deaf community.

In "Video-Based Sign Language Recognition Without Temporal Segmentation,"( Huang, Zhou,

Zhang, et al, 2018). break barriers by eliminating temporal segmentation in sign language

recognition. Their Hierarchical Attention Network with Latent Space (LS-HAN) showcases

potential in overcoming challenges, offering a new direction for the field.

"A Review of Hand Gesture and Sign Language Recognition Techniques" by (Ming Jin Cheok,

Zaid Omar, and Mohamed Hisham Jaward, 2017) comprehensively explores recognition

methods and challenges. The paper sheds light on the intricacies of gesture and sign language

recognition, emphasizing the need for context-dependent models and the incorporation of three-

dimensional data to enhance accuracy.

Despite progress, the paper acknowledges limitations and calls for sustained interdisciplinary

research to advance gesture and sign language recognition. The transformative potential of this

technology in fostering inclusive communication tools is highlighted In thier paper.

2.2 The concept of sign language

Sign languages are comprehensive and distinct languages that manifest through visual signs and

gestures, fulfilling the full communicative needs of deaf communities where they arise (Brentari

& Coppola, 2012). Researchers such as Diane Brentari and Marie Coppola have explored how

these languages are created and develop, a process that occurs when specific social conditions

allow for the transformation of individual gesture systems into rich, communal languages

11
(Brentari & Coppola, 2012). Emerging sign languages serve as a primary communication system

for a community of deaf individuals, leveraging pre-existing homesign gestures as a foundation.

Brentari and Coppola highlight that the creation of a new sign language requires two essential

elements: a shared symbolic environment and the ability to exploit that environment, particularly

by child learners. These conditions are met in scenarios where deaf individuals come together

and collectively evolve their individual homesign systems into a community-wide language

(Brentari & Coppola, 2012).

Developmental Pathways

The evolution of sign languages can follow various trajectories. One common pathway involves

the establishment of institutions, like schools for the deaf, which become hubs for this evolution.

A pertinent example is Nicaraguan Sign Language, which developed rapidly when a special

education center in Managua expanded in 1978, bringing together a large deaf population. This

setting facilitated the transition from homesign to an established sign language through what

Brentari and Coppola describe as the 'initial contact stage', followed by the 'sustained contact

stage' as the language was adopted by subsequent generations (Brentari & Coppola, 2012).

2.3 Understanding Sign Language Recognition

Sign Language Recognition has been a growing area of research for several decades. However, it

was not until recently that the field has evolved towards the more complex task of Sign

Language Translation. In the seminal work of Necati Cihan Camgöz and colleagues, a significant

shift is proposed from recognizing isolated signs as a naive gesture recognition problem to a

12
neural machine translation approach that respects the unique grammatical and linguistic

structures inherent to sign languages (Camgöz et al., 2018).

Previous studies primarily focused on SLR with simple sentence constructions from isolated

signs, overlooking the intricate linguistic features of sign languages. These studies operated

under the flawed assumption that there exists a direct, one-to-one mapping between signed and

spoken languages. In contrast, the work by Camgöz et al. acknowledges that sign languages are

independent languages with their own syntax, morphology, and semantics, and are not merely

translated verbatim from spoken languages (Camgöz et al., 2018).

To address these limitations, Camgöz and his team introduced the concept of Continuous SLT,

which tackles the challenge of translating continuous sign language videos into spoken language

while accounting for the different word orders and grammar. They utilize state-of-the-art

sequence-to-sequence deep learning models to learn the spatio-temporal representation of signs

and map these to spoken or written language (Camgöz et al., 2018). A milestone in this field is

the creation of the RWTH-PHOENIX-Weather 2014 T dataset, the first publicly available

continuous SLT dataset. This dataset contains sign language video segments from weather

broadcasts, accompanied by gloss annotations and spoken language translations. The dataset is

pivotal for advancing research, allowing for the evaluation and development of SLT models

(Camgöz et al., 2018). Camgöz et al. set the bar for translation performance with their models

achieving a BLEU-4 score of up to 18.13, creating a benchmark for future research efforts. Their

work provides a broad range of experimental results, offering a comprehensive analysis of

various tokenization methods, attention schemes, and parameter configurations (Camgöz et al.,

2018).

13
2.4 Sign Language Transformers: Joint End-to-end Sign Language Recognition and

Translation

The pioneering work by (Camgöz et al, 2020). addresses the inherent challenges in the domain of

sign language recognition and translation by proposing an end-to-end trainable transformer-

based architecture. This novel approach leverages Connectionist Temporal Classification loss,

integrating the recognition of continuous sign language and translation into a single unified

framework, which leads to significant gains in performance. The preceding efforts in sign

language translation primarily relied on a mid-level sign gloss representation, which is crucial for

enhancing translation performance. Gloss-level tokenization is a precursor for state-of-the-art

translation models, as supported by prior research in the field. Camgöz and colleagues delineate

sign glosses as minimal lexical items that correlate spoken language words with their respective

sign meanings, serving as an essential step in the translation process.

(Camgöz et al.'s, 2020) research contributes to addressing critical sub-tasks in sign language

translation, such as sign segmentation and the comprehensive understanding of sign sentences.

These sub-tasks are vital since sign languages leverage multiple articulators, including manual

and non-manual features, to convey information. The grammar disparities between sign and

spoken languages necessitate models that can navigate the asynchronous multi-articulatory

nature and the high-dimensional spatio-temporal data of sign language. Additionally, this work

underscores the utilization of transformer encoders and decoders to handle the compound nature

of translating sign language videos into spoken language sentences. Unique to their approach is

the fact that ground-truth timing information is not a prerequisite for training, as their system can

14
concurrently solve sequence-to-sequence learning problems inherent to both recognition and

translation. The authors' testing on the RWTH-PHOENIX-Weather 2014 T dataset demonstrates

the approach's effectiveness, with reported improvements over existing sign video to spoken

language, and gloss to spoken language translation models. The results indicate more than a

doubling in performance in some instances, establishing a new benchmark for the task. (Camgöz

et al.'s, 2020) transformative research on sign language transformers thus makes a substantial

contribution to the fields of computer vision and machine translation, setting the stage for

advanced developments in accessible communication technologies for the Deaf and hard-of-

hearing communities.

2.5 Video-Based Sign Language Recognition Without Temporal Segmentation

The paper "Video-Based Sign Language Recognition Without Temporal Segmentation" by

Huang, (Zhou, Zhang, et al.), introduces a significant shift away from conventional approaches.

Traditionally, sign language recognition has been bifurcated into isolated SLR, which handles

the recognition of words or expressions one at a time, and continuous SLR, which interprets

whole sentences. Continuous SLR has hitherto relied on temporal segmentation, a process of pre-

processing videos to identify individual word or expression boundaries, which is not only

challenging due to the subtlety and variety of transition movements in sign language but also

prone to error propagation through the later stages of recognition.

To combat the limitations of existing methods, (Huang et al.) propose a novel framework called

the Hierarchical Attention Network with Latent Space, which aims to forgo the need for

temporal segmentation altogether. By employing a two-stream 3D Convolutional Neural

Network for feature extraction and a Hierarchical Attention Network, their method pays detailed

15
attention to both the global and local features within the video to facilitate the understanding of

sign language semantics without the need for finely labeled datasets creating a sort of semi self

supervised learning approach.

Furthermore, the authors address an existing gap in sign language datasets by compiling a

comprehensive Modern Chinese Sign Language dataset with sentence-level annotations. This

contribution not only aids their proposed framework but also provides a resource for future

research in the field. The effectiveness of the LS-HAN framework is substantiated through

experiments conducted on two large-scale datasets, highlighting its potential to revolutionize the

landscape of sign language recognition by circumventing some of the field's most persistent

challenges.

2.6 Hand gesture and sign language recognition techniques

The document "A review of hand gesture and sign language recognition techniques" by (Ming

Jin Cheok, Zaid Omar, and Mohamed Hisham Jaward) presents a comprehensive examination of

the methods and challenges in hand gesture and sign language recognition. The authors illustrate

the importance of this technology in improving human-computer interaction and aiding

communication for the deaf and hard-of-hearing individuals, tracing its development alongside

similar advancements in speech and handwriting recognition technologies. The paper

methodically explores various algorithms involved in the recognition process, which are

organized into stages such as data acquisition, pre-processing, segmentation, feature extraction,

and classification, and offers insights into their respective advantages and disadvantages.

16
In particular, the authors delve into the intricacies and hurdles inherent in gesture recognition,

ranging from environmental factors like lighting and viewpoint to movement variability and the

use of aids like colored gloves for improved segmentation. Additionally, the review provides an

extensive look at sign language recognition, primarily focusing on (ASL) while also mentioning

other sign languages from around the world. The necessity for context-dependent models and the

incorporation of three-dimensional data to enhance accuracy are emphasized.

Despite the progress, the paper acknowledges the limitations of present technologies, especially

concerning their adaptability across different individuals, which significantly affects recognition

accuracy. The authors suggest that future research could focus on overcoming these limitations

to create more robust and universally applicable recognition systems. Concluding on a visionary

note, the paper underscores the transformative potential of gesture and sign language recognition,

not only within specific application domains but also in fostering more inclusive communication

tools, and calls for sustained interdisciplinary research to advance the state of the art in this

domain.

2.7 Real Time Sign Language Recognition

Real Time Sign Language Recognition by (Pankaj Kumar Varshney, G. S. Naveen Kumar,

Shrawan Kumar, et al). The literature "Real Time Sign Language Recognition" presents a

substantive contribution to the domain of assistive technology, specifically concerning the deaf

and mute community. (Varshney, Kumar, and Kumar et al). embark on a mission to bridge the

communication gap faced by individuals with speech and hearing impairments through the

development of a neural network model capable of identifying finger spelling-based hand

17
gestures of American Sign Language, with an exemption for the dynamically oriented signs 'J'

and 'Z.'

The researchers methodically underscore the significance of sign language as a primary tool for

non-verbal communication, using visual cues like hands, eyes, facial expressions, and body

language. The study underscores the intricate challenge of creating a system that can seamlessly

interpret these signals in real-time, a task compounded by the substantial variation in sign

language across different cultures and geographies.

The heart of the research lies in its exploration of a vision-based approach over a data glove

method for gesture detection, arguing for the former's intuitiveness in human-computer

interaction. This choice reflects a keen understanding that technology for the impaired must

emphasize ease of use to be truly transformative.

Interestingly, the authors note the system's potential reach beyond its assistive purpose,

highlighting applications in fields such as gaming, medical imaging, and augmented reality. This

prospect indicates that the technological advancements spurred by this research might ripple

across various industries, indicating a broader impact.

However, the study is not without its limitations. The vision-based approach, while innovative,

faces potential challenges in terms of gesture and posture identification versatility and the

accurate interpretation of dynamic movements. These are concerns that merit further exploration,

with the study likely serving as a springboard for additional research geared at refining real-time

sign language recognition systems.

In conclusion, "Real Time Sign Language Recognition" is a pivotal work that addresses a critical

societal need while also opening avenues for extensive applications across multiple technological

18
spheres. Its clear focus on enhancing communication accessibility using neural networks makes

it a notable cornerstone in the field of human-computer interaction and assistive technology. The

researchers' contribution is commendable for both its direct impact on the lives of those with

hearing and speech disabilities and its potential to inform subsequent innovations.

19
CHAPTER THREE – RESEARCH METHODOLOGY

3.1 Introduction

The architecture of the system will utilize a Long Short-Term Memory (LSTM) neural network

for real-time sign language recognition by predicting classes from keypoint in sequences of

frames a webrtc based video component and streamlit web library in python

Figure 3.1: User Interface of the Sign Languge Translation Web Application

20
Webcam Integration: The application utilizes your webcam to capture video of the user's hands

performing signs.

Keypoint Extraction: For each video frame (image), the system extracts key points that represent

the location and orientation of your hands and fingers. (Imagine these as tiny dots marking

specific points on your hand.)

Sequence Building: These key points are captured for a set number of frames (e.g., 30 frames) to

create a sequence that represents the complete sign execution.

Model Inference: The sequence is loaded unto the model to perform a prediction, The model will

try to guess what the sign the signer is performing.

3.2 Data Acquisition For Sign Language Translation System

This section details the methods used to create the sign language dataset for training the

translation model. I employed a combination of primary and secondary data sources.

● Primary Sources: Video recordings: For this study, I created the dataset myself by

capturing keypoint sequences from my own sign language gestures. I recorded 30

different sequences for each sign class, extracting 30 frames of data per sequence. This

type of sequential data is essential for training Long Short-Term Memory (LSTM)

networks. LSTMs require a minimum amount of sequential data to achieve even basic

levels of accuracy in predicting sign language gestures.

21
● Secondary Sources: Sign language dictionaries and resources: I consulted sign language

dictionaries, tutorials, and online resources to ensure accurate representation of signs

within the dataset.

3.3 Data Pre-processing

The collected data underwent a preprocessing stage to prepare it for model training. This

involved:

● Video segmentation: Videos were segmented into individual sign instances, isolating

each sign for independent processing.

● Data normalization: Sign data (both video and motion capture) was normalized to a

standard format and scale for efficient model training.

3.4 Statistical Analysis

Accuracy: This core metric reflects the overall proportion of signs translated correctly. In My

case, the model achieved an accuracy of 80%, indicating a strong ability to translate most signs

accurately.

Precision: Precision delves deeper, measuring the exactness of positive predictions. A precision

of 86.7% signifies that when the model identifies a sign, there's an 86.7% chance it's the correct

sign. This demonstrates the model's proficiency in accurately identifying true signs.

Recall: Recall focuses on the model's ability to capture all relevant signs. The model achieved a

recall of 80%, indicating it successfully translates a high percentage of actual signs presented to

it.

22
F1-Score: To gain a balanced view, we employed the F1-Score, which combines precision and

recall. Our model's F1-Score of 78.7% suggests a good balance between identifying true signs

and capturing all relevant ones.

These statistical analyses provide valuable insights. The high accuracy signifies the model's

overall effectiveness, while the breakdown by precision and recall helps pinpoint areas for

potential improvement. Future work might involve gathering more data or refining the model

architecture to enhance both the accuracy of positive predictions and the ability to capture all

signs correctly.

3.5 Software Development Methodology

Design Science Research Methodology (DSRM) isn't a rigid, step-by-step process; it's a flexible

approach that emphasizes knowledge creation through the design and development of artifacts.

Unlike traditional scientific methods that focus on explaining how things work, DSRM centers

on creating innovative solutions to address real-world problems (Hevner et al., 2004). This

iterative approach allows researchers to continuously refine their artifacts based on user feedback

and evaluation, ultimately leading to more effective and impactful solutions.

23
Figure 3.5: Design Science Research Methodology (Adapted from Peffers et al. 2008)

Some key characteristics of DSRM include:

● Problem-centric: DSRM starts with a clearly defined problem or opportunity in a

specific domain. The research is driven by the desire to create an artifact that addresses

this need.

● Iterative development: DSRM is not a linear process. Researchers build and evaluate

prototypes of their artifact in cycles, allowing them to learn from each iteration and

improve the design.

● Evaluation: A crucial aspect of DSRM is the rigorous evaluation of the designed artifact.

This evaluation can involve a variety of methods, such as user testing, performance

analysis, and case studies.

● Contribution to knowledge: Beyond the creation of the artifact itself, DSRM seeks to

contribute to the broader body of knowledge in the field. This can involve the

development of design principles, frameworks, or methodologies that can be applied to

future research efforts.}

24
3.6 Testing and Validation of the New System

Test Case Expected Outcome Pass/Fail Criteria Notes

Description

Sign Detection The system - The system correctly Consider using a pre-

Accuracy accurately detects identifies at least 80% of defined dataset of

various signs from a signs across different signs for testing or

diverse set of categories (simple, complex, create a custom dataset

signers. single-handed, double- with diverse signers.

handed).

- Test with a variety of

signers (different ages,

ethnicities, genders) to

ensure generalizability.

LSTM The LSTM model - The system correctly Similar to sign

Prediction accurately translates translates at least 75% of detection, use a

Accuracy detected signs into signs across different diverse set of signs

text. categories. and signers for testing.

- Test with unseen signs not Analyze errors to

used for training to assess identify specific signs

model generalizability. or signing styles that

need improvement.

Real-time The system - The system displays the Measure the time

Performance translates signs with translated text within 1 between the

second of the completed sign completion of a sign

25
minimal latency. execution (30 frames). and the appearance of

the translated text.

Aim for low latency to

ensure a smooth user

experience.

Camera and The system - The system successfully Verify that the video

WebRTC seamlessly captures video from the feed is displayed and

Integration integrates with the webcam and uses WebRTC the system functions

camera and for real-time processing. correctly in different

WebRTC for real- - Test across different web browsers.

time browsers to ensure


communication. compatibility.

User Interface The UI elements - The screen darkens slightly Test UI elements like

(UI) function as intended. before displaying the screen dimming and

Functionality translated text for better text display for proper

visibility. functionality. Ensure

- The translated text is the UI is intuitive and

clearly displayed on the left user-friendly.

side of the screen.

Robustness to The system - The system accurately Test the system in

Lighting maintains accuracy detects signs and translates different lighting

Conditions in different lighting them correctly under various environments to

conditions. lighting conditions (bright, ensure consistent

performance. Consider

26
dim, natural, artificial). adding image pre-

processing techniques

to handle lighting

variations.

Background The system - The system focuses on the Test with various

Interference minimizes the signer's hands and minimizes backgrounds (simple,

impact of the influence of background cluttered, moving

background objects or movements. objects) to assess the

distractions. system's ability to

isolate the signer's

hands.

Figure 3.6 : A table showing all test cases

CHAPTER FOUR - SYSTEM DESIGN AND IMPLEMENTATION

4.1 Introduction

This section explores the challenges of building a sign language translation system that captures
Video in real time and predicts what a sign might mean in american sign language.

4.2 Analysis of Existing system

This section analyzes existing Sign language recognition methods and techniques, focusing on

how they handle the nuances of Understanding gestures and signs in video frames and images.

27
4.2.1 Analysis of Existing Systems: Sign Language Transformers: Joint End-to-end Sign
Language Recognition and Translation

The researchers developed a novel transformer-based architecture for sign language recognition

and translation which is unified and capable of being trained end-to-end. Their system utilizes a

Connectionist Temporal Classification loss to bind the recognition and translation problems into

one architecture. This approach does not require ground-truth timing information, which refers to

the precise moment when a sign begins and ends in a video sequence—a significant challenge in

translation systems. The architecture intelligently manages sequence-to-sequence learning

problems, where the first sequence is a video of continuous sign language and the second

sequence is the corresponding spoken language sentences. The model must detect and segment

sign sentences in a continuous video stream, understand the conveyed information within each

sign, and finally translate this into spoken language sentences.

An essential component of the system is the recognition of sign glosses—spoken language words

that represent the meaning of individual signs and serve as minimal lexical items. The

researchers' model recognizes these glosses using specially designed transformer encoders that

are adept at handling high-dimensional spatiotemporal data (such as sign videos). These

encoders are trained to understand the 3D signing space and how signers interact and move

within that space. Additionally, the transformer architecture also captures the multi-articulatory

nature of sign language, where multiple channels are used simultaneously to convey information

—including manual gestures and non-manual features like facial expressions, mouth movements,

and the position and movements of the head, shoulders, and torso.

For the translation task, once the sign glosses and their meaning have been understood, the

system embarks on converting this information into spoken language sentences. This conversion
28
is not straightforward due to the structural and grammatical differences between sign language

and spoken language, such as different ordering of words and the use of space and motion to

convey relationships between entities. To address this, the researchers applied transformer

decoders, which take the output from the transformer encoders and generate a sequence of

spoken language words. This output is then shaped into coherent sentences that represent the

meaning of the sign language input. In sum, the sophisticated encoder-decoder structure of the

transformers allows the system to perform both recognition and translation tasks effectively, and

its application led to significant performance gains in the challenging RWTH-PHOENIX-

Weather-2014 T dataset (Camgöz et al., 2020).

4.2.2 Analysis of Proposed system: Sign Language recognition using Neural Networks
The proposed system, captures keypoints in sequences of images using opencv and googles

MediaPipe holistic model mewhich detects the face, pose and hand landmarks as key points. The

dataset to be stored as several sequences put in frames as video format where the key points are

pushed into a NumPy array.Hereafter, the system is trained and built using Long STM (LSTM)

deep learning model which is formed using three LSTM layers and three Dense layers. This

model was trained for 2000 epochs on a batch size of 128 using the dataset extracted. The model

was trained using the dataset to attenuate the loss by categorical cross-entropy using the Adam

optimizer.Finally, after building the neural network, real-time language recognition is performed

using streamlit and opencv where the gestures are recognized and displayed as text within the

highlighted area

29
4.2.3 Comparison of Proposed system and existing system
● Focuses on real-time detection and classification: This application prioritizes processing

live video streams and providing instant sign language action recognition.

● Uses keypoints for prediction: It extracts keypoints from body pose, face, and hands to

represent signs and feeds them to a pre-trained machine learning model for classification.

Sign Language Transformers

i. Focuses on end-to-end recognition and translation: This research proposes a unified

model that can both recognize signs and translate them into spoken language sentences.

ii. Processes video frame directly: The model takes full video frames as input,

potentiuringally capst more nuanced information about signs.

iii. Leverages transformers for complex tasks: The system utilizes a powerful transformer

architecture capable of handling high-dimensional data and understanding the intricacies of sign

language, including facial expressions and multi-articulated gestures.

(Level of translation) The first system classifies individual signs, while Sign Language

Transformers aim for full sentence translation. (Data Processing) The first system uses

keypoints, while Sign Language Transformers process entire video frames. (Model Architecture)

The first system uses a pre-trained machine learning model, while Sign Language Transformers

leverage a complex transformer architecture.

30
4.3 Limitation of Existing Systems

Existing sign language translation systems often face limitations in accuracy, real-time

performance, and user experience. Accuracy can be hampered by factors like limited training

data, variations in signing styles, and complex hand movements. Real-time translation can be

computationally demanding and may struggle with rapid signing or background noise. User

interfaces might not be intuitive for all users, especially those unfamiliar with sign language.

These limitations can hinder effective communication and highlight the need for ongoing

research in sign language translation technology.

4.3.1 Limitations Of The Transformer-based System For Sign Language Recognition And
Translation Presented By Camgöz et al

Domain Limitation: The state-of-the-art results achieved by the system are within the context of

a limited domain of discourse which is weather forecasts (Camgöz et al., 2020). The

performance in more generic or diverse sign language contexts may not be as high, indicating a

limitation in the system's ability to generalize across different subjects or domains outside of this

controlled scope.

Recognition Complexity: Despite advances, the system must address the challenge of

recognizing sign glosses from high-dimensional spatiotemporal data, which is complex due to

the asynchronous multi-articulatory nature of sign languages (Camgöz et al., 2020). The models

need to accurately comprehend the 3D signing space and understand what these different aspects

mean in combination, which remains a significant modeling challenge.

31
Sign Segmentation: The translation system needs to accomplish the task of sign segmentation,

detecting sign sentences from continuous sign language videos (Camgöz et al., 2020). Unlike

text, which has punctuation, or spoken languages that have pauses, sign language does not have

obvious delimiters, making segmentation for translating continuous sign language a persistent

issue that is not yet fully resolved in the literature (Camgöz et al., 2020).

4.3.2 Limitations Of The Proposed System

Limited Vocabulary and Sign Language Variations: LSTMs require a substantial amount of

training data to learn the complex relationships between keypoints and their corresponding signs.

This can be challenging for sign languages with vast vocabularies, limiting the system's ability to

recognize and translate a wide range of signs.

Dependency On Keypoint Detection Accuracy: The system's performance heavily relies on the

accuracy of keypoint detection. Errors in identifying keypoints (e.g., due to lighting, occlusion,

or hand posture variations) can lead to misinterpretations and incorrect translations.

Computational Cost: Training LSTMs on large datasets can be computationally expensive,

requiring significant processing power and time. Additionally, real-time translation necessitates

efficient inference algorithms to minimize latency.

Limited Context and Idioms: Sign language heavily relies on facial expressions and body

language for conveying context. Current systems struggle to capture these subtleties, leading to

potential misunderstandings, especially for idioms or nuanced expressions as sometimes pointing

32
to an object might have a different meaning from just pointing but the current proposed system

won’t be able to detect such .

Background Noise and Occlusions: Real-world environments often have background noise or

occlusions (e.g., from other people or objects). These factors can disrupt keypoint detection and

impact translation accuracy.

Lighting Variations: Changes in lighting conditions can affect the quality of video frames,

making it harder for the system to accurately identify keypoints.

4.4 Proposed System: Design and Implementation of a sign language translation


system

This section introduces a system utilizing a Long (STM) model and a pretrained model for

detecting landmark keypoints in face’s, poses and hands for Sign language translation. By

addressing the limitations identified in existing systems (Section 3.3), this approach aims to

provide a more comprehensive and efficient method for Sign language translation to non-signers.

4.4.1 Rationale for Implementation of a sign language translation system Using an LSTM
Sign language translation systems bridge the communication between Signer’s and Non-signer’s.

This allows for smoother interaction in Various social interactions.

Information Access: Sign language translation systems can provide real-time access to spoken

information for deaf and hard-of-hearing individuals. This includes lectures, meetings,

presentations, news broadcasts, and other audio-based content.

33
Improved Participation: These systems can enable deaf and hard-of-hearing people to actively

participate in conversations and express themselves more readily. This fosters inclusivity and a

sense of belonging in various environments.

4.4.2. System Architecture


System Architecture for Sign Language Recognition System: This system architecture utilizes a

Long Short-Term Memory (LSTM) neural network for real-time sign language recognition a

webrtc based video component and streamlit web library in python . Here's a breakdown of its

components:

Data Preprocessing For Model Training:

1. Input Data: The system expects a sequence of keypoint vectors extracted from video frames,

representing the user's body language (pose, hands, face). The input shape is defined as `(30,

1662)`, where:

30: Represents the number of frames (time window) used to capture the sign language gesture.

1662: Represents the dimensionality of each keypoint vector, likely containing information

like x, y, and z coordinates for various body landmarks.

Model Architecture:

1. Sequential Model: A sequential model is used to stack multiple LSTM layers for effective

learning of temporal dependencies in the sign language gestures.

2. LSTM Layers:

3 LSTM layers: The model utilizes three LSTM layers with the following configurations:

34
1st LSTM Layer: 64 units, ReLU activation, returns sequences.

2nd LSTM Layer: 128 units, ReLU activation, returns sequences.

3rd LSTM Layer: 64 units, ReLU activation, does not return sequences (flattens the output).

LSTMs are adept at capturing long-term dependencies present in sequential data Thats why it

was choosen.

3. Dense Layers:

2 Dense layers: Two fully-connected dense layers are added after the LSTM layers:

1st Dense Layer: 64 units, ReLU activation.

2nd Dense Layer: 32 units, ReLU activation.

These layers help extract higher-level features from the LSTM outputs.

4. Output Layer:

Dense layer with Softmax activation: The final layer has a number of units equal to the

number of actions in the sign language vocabulary (e.g., "Hello", "Thanks", "I love you").

The Softmax activation ensures the output probabilities sum to 1, representing the likelihood

of each action class.

Model Training:

1.Optimizer: Adam optimizer is used for efficient gradient descent during training.

2.Loss Function: Categorical crossentropy is used as the loss function, suitable for multi-class

classification problems like sign language recognition.

35
3. Metrics: Categorical accuracy is used as a metric to monitor training progress and evaluate

model performance.

4. Training Data: The model is trained on pre-processed training data (`X_train`) containing

sequences of keypoint vectors and corresponding labels (`y_train`) indicating the performed sign

language actions.

Data Acquisition For the System:

WebRTC Streamer: Streamlit WebRTC component is used to capture the user's webcam video

and feed it to the MediaPipe model for processing in real-time.

MediaPipe: The system utilizes MediaPipe's Holistic model for multi-modal human body

landmark detection. It extracts keypoints from the user's face, hands, and pose from the webcam

stream that was used to train the LSTM that is able to predict the sign language.

Sign Language Recognition

Preprocessing: The captured video frames are converted from BGR to RGB color format for the

MediaPipe model compatibility.

Landmark Extraction: MediaPipe processes each frame and outputs keypoint landmarks for the

face, hands, and pose.

Keypoint Sequence Building: The extracted keypoints from each frame are combined and stored

in a sequence (30 frames). A lock (`sequence_lock`) ensures thread safety for accessing and

modifying the sequence.

Model Prediction:

36
Action Prediction: When the sequence reaches 30 frames (representing a short window of sign

language), the system predicts the performed action using a pre-trained TensorFlow LSTM

model.

Prediction Results: The model takes the sequence of keypoints as input and outputs the

probability of each action in the defined vocabulary (e.g., "Hello", "Thanks", "I love you").The

action with the highest probability is considered the predicted sign language gesture.

Result Display

Landmarks Visualization: If a face is detected , the system overlays the detected landmarks and

connections on the video frame.

Sign Prediction Overlay: The predicted sign language action is displayed as text on top of the

video frame.

The system utilizes asynchronous processing (`async_processing=True`) with the WebRTC

streamer for efficient handling of video frames and model prediction.

This architecture allows for real-time sign language recognition by continuously capturing video

frames, extracting keypoints, building a sequence, and predicting the user's sign language

gestures using the trained model.

4.4.3. Advantages of the Proposed System


i. LSTM for Temporal Dependence: Utilizes Long Short-Term Memory (LSTM) layers, which

are well-suited for capturing the temporal dependencies present in sign language gestures. Sign

language involves sequences of hand movements and poses over time, and LSTMs can

effectively learn these patterns.

37
ii. Real-time Recognition: Employs a WebRTC streamer to capture video frames from the user's

webcam in real-time. This allows for immediate processing and prediction of sign language

gestures as the user performs them.

iii. MediaPipe Integration: Leverages MediaPipe's Holistic model for efficient and accurate

multi-modal human body landmark detection. This extracts relevant keypoints from the user's

face, hands, and pose, providing necessary data for the LSTM model.

iv. Scalable Model Architecture: Uses a modular architecture with sequential LSTM layers

followed by dense layers for feature extraction and classification. This allows for easy

customization and potential expansion of the model's capabilities.

v. Reduced Complexity: Utilizing a pretrained model for feature extraction reduce total system

complexities.

vi. Cross platform: The system was developed using a web library so it is accessable for any

device that has a camera and a web browser

4.5 SYSTEM MODELLING

System modelling is the process of creating abstract models of a system, each with a unique

perspective or viewpoint on that system. The design tool utilized in this study is the Unified

Modeling Language. It is a standard graphical notation for describing software analysis and

designs. UML leverages symbols to describe and document the application development process.

When UML notation is employed, it provides an efficient means of communication and a

detailed explanation of a systems design

38
4.5.1 System Activity Diagram
A user flow diagram, also known as a user journey or user process flow, visually represents the

steps a user takes within a system or application to complete a specific task or goal. It shows the

user's actions and interactions with the system, highlighting the pathways and decision points

along the way. The objective of a user flow diagram is to provide a clear and concise overview

of the user's experience and movement within the application. It helps designers and stakeholders

understand the user's journey, identify potential pain points or usability issues, and make

informed decisions to enhance the user experience.

Figure 4.5.1 illustrates the entire user journey throughout the web app.

39
4.5.2 Use Case Diagram for Sign Language Translation Web Application

A use case diagram is a graphical representation of the functional requirements of a system or

software application. It illustrates the interactions between users (actors) and the system,

showing how they work together to achieve specific goals or tasks. In the context of a sign

language translation web application, use case diagrams are used to capture and explain the

application's functionality and the roles of the various actors involved.

40
Figure 4.5.2 Illustrates the entire Functional Requirements the web app.

4.5.3 Sequence Diagram for Sign Language Translation Web Application

A sequence diagram is a type of interaction diagram in UML (Unified Modeling Language) that

illustrates the sequence of interactions between objects or components within a system. It depicts

the flow of messages exchanged between these objects over time to achieve a specific

functionality or scenario. In a sign language translation web application, a sequence diagram

shows how various components, such as the user interface, translation engine, and database,

interact to provide translation services.

Figure 4.5.3 Illustrates the entire interaction done in the web app.

41
4.6 AI Process

Artificial intelligence (AI) development follows a structured process. First, relevant data is

collected and prepared for the AI system to learn from. This might involve cleaning, organizing,

and transforming the data into a suitable format. Next, the appropriate AI model type (e.g.,

decision tree, neural network) is chosen based on the task and data characteristics. The model's

architecture, defining its components and connections, is then designed. Training involves

feeding the prepared data to the model, allowing it to learn by adjusting internal parameters to

identify patterns and relationships. The model's performance is then evaluated using unseen data

from a testing set to assess its ability to generalize its learning to new situations. Finally, well-

performing models can be deployed for real-world use, with ongoing monitoring to ensure

continued accuracy and the potential for retraining with new data over time.

4.6.1 Data Collection

For this study, I created the dataset myself by capturing keypoint sequences from my own sign

language gestures. I recorded 30 different sequences for each sign class, extracting 30 frames of

data per sequence. This type of sequential data is essential for training Long Short-Term

Memory (LSTM) networks. LSTMs require a minimum amount of sequential data to achieve

even basic levels of accuracy in predicting sign language gestures.

4.6.2 Data Processing

42
After data collection using keypoint sequences, data processing is crucial to prepare the data for

training of the LSTM model. Here's a breakdown of the data processing steps:

Missing Values Check: Check for missing keypoints or frames in your sequences. You can

address these by:Removing sequences with excessive missing data, Inputing missing values

using techniques like interpolation or filling with the previous/next available values (depending

on the context).

Data Splitting: Divide the processed data into training, validation, and test sets. Training set:

Used to train the LSTM model. Validation set: Used to monitor model performance during

training and prevent overfitting. Test set: Used to evaluate the model's final performance on

unseen data.

4.6.3 Training Of The Model

For this research, I trained an LSTM model using the Keras library within a Python environment.

The model architecture consists of three stacked LSTM layers with varying hidden unit sizes (64,

128, and 64) and ReLU activation functions. The first two layers utilize (return_sequences=True)

to maintain the sequential nature of the data. A final dense layer with a softmax activation

predicts the most likely action class from the provided keypoint sequences. The model was

compiled with the Adam optimizer, categorical cross-entropy loss function, and categorical

43
accuracy metric. Training was conducted for 2000 epochs, utilizing a TensorBoard callback to

monitor training progress and prevent overfitting.

figure 4.6.3: An image of the training process

4.6.4 Testing Of The Model

After Training the model tests were performed on the model to quantify the accuracy, the

(train_test_split) function from the scikit-learn library in Python was used to split my dataset into

training and testing sets. This is a crucial step in machine learning, ensuring the model doesn't

simply memorize the training data and can generalize well to unseen examples. After splitting

the dataset I used the model to perform predictions on the test dataset expecting the model to

return an expected output.

44
4.6.5 Model Evaluation

Model evaluation is a crucial step in the AI Process. It determines how well a trained model

performs on unseen data. This involves testing the model on a separate dataset (testing set) and

measuring its performance using metrics relevant to the task.

Evaluation Metrics Used

Evaluation metrics are quantitative measures used to assess the performance of a trained model.

The choice of metrics depends on the specific task and the desired outcome of your model.

Classification Tasks

These metrics evaluate how well a model can categorize data points into predefined classes:

i. Accuracy: The proportion of correctly classified samples (True Positives + True

Negatives) divided by the total number of samples. It's a good overall measure. The accuracy of

the trained model is (0.8) it was calculated using a library in python called (scikit-learn)

ii .Precision: The proportion of predicted positives that are actually correct (True Positives

divided by True Positives + False Positives). It measures how good the model is at identifying

actual positives. The precision of the trained model is (0.8666666666666666) it was calculated

using a library in python called (scikit-learn)

45
iii. Recall: The proportion of actual positives that are correctly identified by the model (True

Positives divided by True Positives + False Negatives). It measures how good the model is at

capturing all relevant positives. The recall of the trained model is (0.8) it was calculated using a

library in python called (scikit-learn)

iv. F1-Score: The harmonic mean of precision and recall, combining both metrics into a

single score. It provides a balance between precision and recall. The F1-Score of the trained

model is (0.7866666666666667) it was calculated using a library in python called (scikit-learn)

4.6.6 Model Deployment (Mobile or Web)

The Model is saved as a keras model in .h5 format While this model itself isn't directly deployed

as a web app or mobile app, it can be integrated into a larger application for user interaction and

real-world deployment. In My sign language translation system I built a webapp to display how

the model works

4.7 Funtionality/Experimenation and Results 1

Experimentation: experiment by making gestures without complete landmark keypoints

Results: The model can still recognize the signs even if the the landmark keypoints are not

complete

46
Figure 4.7: Illustrating The User Interface of the model predicting the correct sign with only half

of the keypoints being detected

4.8 Funtionality / Experimenation and Results 2

Experimentation: experiment by making gestures really fast and see if it still makes an accurate

prediction

Results: The model can sometimes mistake signs if the signs done are too fast

47
Figure 4.8: Illustrating The User Interface of the model predicting the correct sign with complete

keypoints being detected for single handed signs

4.9 Funtionality / Experimenation and Results 3

Experimentation: experiment by making sure the landmark keypoints are complete for the model

to have a complete landmark keypoint sequence to make a prediction

Result: Every time the keypoints are complete there is a higher chance the model will predict the

sign correctly

48
Figure 4.9: Illustrating The User Interface of the model predicting the correct sign with complete

keypoints being detected for Double handed signs

4.10 DEVELOPMENT TOOLS AND TECHNOLOGIES

The sign language translation web app leverages a combination of cutting-edge technologies to

deliver an accurate, efficient, and user-friendly experience. Here's an overview of the key tools

and their functionalities:

● Web Library: Streamlit

○ Description: A robust Library that uses only python to create beautiful webapp

called streamlit was used.

● Computer Vision Library: OpenCV.

○ Description: During the training process the data i got was processed by the user's

webcam feed to recognize hand gestures.

49
● Sign Language Translation Model: Neural Network

○ Description: At the heart of the system lies a Neural Network trained on a dataset

of sequenced frames and their corresponding translations. This model, built using

the TensorFlow framework, analyzes the recognized signs and translates them

into the desired language.

● WebRTC API:

○ Description: For real-time communication features, the WebRTC API can be

implemented. This API enables direct browser-to-browser audio and video

communication, potentially facilitating video calls between sign language users

and interpreters or other individuals.

This combination of tools allows for a web-based sign language translation app that is not only

accurate but also user-friendly and accessible from any device with a web browser.

CHAPTER 5 – SUMMARY AND CONCLUSION

This report outlines the potential of a sign language translation system, highlighting its key

functionalities, advantages, and areas for further exploration.

50
5.1 Summary

The system facilitates communication between deaf and hearing individuals by translating sign

language into spoken or written language, and vice versa.Users can create profiles detailing their

communication preferences (sign language type, spoken language) and areas of expertise (if

applicable for professional settings).The system employs machine learning algorithms to

recognize and translate signs accurately.Real-time translation capabilities enable fluid

conversations.The system can be integrated with various platforms, including video conferencing

tools, for broader accessibility.

5.2 Contribution of Research and Conclusion

Further research and development are necessary to refine the translation accuracy and encompass

a wider range of sign languages and dialects.A user-centered design approach should be adopted

to ensure the system is intuitive and user-friendly for deaf and hearing individuals with varying

levels of technical proficiency.Collaboration with deaf communities and sign language experts is

crucial for gathering feedback and ensuring the system effectively addresses their

communication needs.Exploring integration with educational resources and social platforms can

broaden the system's impact and promote language learning and cultural exchange.

5.3 Recommendations

Current sign language translation technology is under development, and accuracy may be

affected by factors like signing speed, variations in execution, and background noise.Limited

availability of training data for certain sign languages can hinder translation accuracy for those

languages.The system may not capture the full nuance of sign language communication, which

often incorporates facial expressions and body language.

5.4 Further Research Considerations

51
Investigate methods to improve translation accuracy, particularly for complex sentences and

idiomatic expressions. Develop mechanisms to account for regional variations and cultural

nuances within sign languages.Explore integration with sentiment analysis tools to better convey

the emotional context of signed communication. Research ethical considerations surrounding

data privacy and potential biases within the translation algorithms.

5.5 Conclusion

Sign language translation systems hold immense potential to break down communication barriers

and foster greater social inclusion. By addressing the limitations and continuously developing the

technology, these systems can empower deaf and hearing individuals to connect and participate

more fully in all aspects of life.

52
References

Katiyar, P., Shukla, K. S. K., & Kumar, V. (2023, May). Analysis of Human Action Recognition

Using Machine Learning techniques. In 2023 4th International Conference for Emerging

Technology (INCET) (pp. 1-4). IEEE.

Guo, L., Lu, Z., & Yao, L. (2021). Human-machine interaction sensing technology based on

hand gesture recognition: A review. IEEE Transactions on Human-Machine Systems, 51(4), 300-

309.

Achenbach, P., Laux, S., Purdack, D., Müller, P. N., & Göbel, S. (2023). Give Me a Sign: Using

Data Gloves for Static Hand-Shape Recognition. Sensors, 23(24), 9847.

Camgoz, N. C., Hadfield, S., Koller, O., Ney, H., & Bowden, R. (2018). Neural sign language

translation. In Proceedings of the IEEE conference on computer vision and pattern recognition

(pp. 7784-7793).

Rastgoo, R., Kiani, K., & Escalera, S. (2021). Sign language recognition: A deep survey. Expert

Systems with Applications, 164, 113794.

53
Camgoz, N. C., Koller, O., Hadfield, S., & Bowden, R. (2020). Sign language transformers: Joint
end-to-end sign language recognition and translation. In Proceedings of the IEEE/CVF
conference on computer vision and pattern recognition (pp. 10023-10033).

Huang, J., Zhou, W., Zhang, Q., Li, H., & Li, W. (2018, April). Video-based sign language
recognition without temporal segmentation. In Proceedings of the AAAI Conference on Artificial
Intelligence (Vol. 32, No. 1).

Cheok, M. J., Omar, Z., & Jaward, M. H. (2019). A review of hand gesture and sign language
recognition techniques. International Journal of Machine Learning and Cybernetics, 10, 131-
153.

Li, D., Rodriguez, C., Yu, X., & Li, H. (2020). Word-level deep sign language recognition from
video: A new large-scale dataset and methods comparison. In Proceedings of the IEEE/CVF
winter conference on applications of computer vision (pp. 1459-1469).

Varshney, P. K., Kumar, S. K., & Thakur, B. (2024). Real-Time Sign Language Recognition. In
Medical Robotics and AI-Assisted Diagnostics for a High-Tech Healthcare Industry (pp. 81-92).
IGI Global.

Wadhawan, A., & Kumar, P. (2021). Sign language recognition systems: A decade systematic
literature review. Archives of Computational Methods in Engineering, 28, 785-813.

Bragg, D., Koller, O., Bellard, M., Berke, L., Boudreault, P., Braffort, A., ... & Ringel Morris,
M. (2019, October). Sign language recognition, generation, and translation: An interdisciplinary
perspective. In Proceedings of the 21st International ACM SIGACCESS Conference on
Computers and Accessibility (pp.
16-31).

54
APPENDIX EXCERPT OF PROGRAM SOURCE CODE

#### Step-by-Step Instructions for Sign Language Translation System

1. **Import and Install Dependencies**


```python
!pip install tensorflow==2.4.1 tensorflow-gpu==2.4.1 opencv-python mediapipe sklearn
matplotlib
import cv2
import numpy as np
import os
from matplotlib import pyplot as plt
import time
import mediapipe as mp
```

2. **Initialize MediaPipe Holistic Model**


```python
mp_holistic = mp.solutions.holistic
mp_drawing = mp.solutions.drawing_utils
```

3. **Function to Process Image with MediaPipe**


```python
def mediapipe_detection(image, model):
image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)
image.flags.writeable = False
results = model.process(image)
image.flags.writeable = True
image = cv2.cvtColor(image, cv2.COLOR_RGB2BGR)
return image, results
```

4. **Function to Draw Landmarks**


```python

55
def draw_landmarks(image, results):
mp_drawing.draw_landmarks(image, results.face_landmarks,
mp_holistic.FACE_CONNECTIONS)
mp_drawing.draw_landmarks(image, results.pose_landmarks,
mp_holistic.POSE_CONNECTIONS)
mp_drawing.draw_landmarks(image, results.left_hand_landmarks,
mp_holistic.HAND_CONNECTIONS)
mp_drawing.draw_landmarks(image, results.right_hand_landmarks,
mp_holistic.HAND_CONNECTIONS)
```

5. **Function to Draw Styled Landmarks**


```python
def draw_styled_landmarks(image, results):
mp_drawing.draw_landmarks(
image, results.face_landmarks, mp_holistic.FACEMESH_TESSELATION,
mp_drawing.DrawingSpec(color=(80,110,10), thickness=1, circle_radius=1),
mp_drawing.DrawingSpec(color=(80,256,121), thickness=1, circle_radius=1)
)
mp_drawing.draw_landmarks(
image, results.pose_landmarks, mp_holistic.POSE_CONNECTIONS,
mp_drawing.DrawingSpec(color=(80,22,10), thickness=2, circle_radius=4),
mp_drawing.DrawingSpec(color=(80,44,121), thickness=2, circle_radius=2)
)
mp_drawing.draw_landmarks(
image, results.left_hand_landmarks, mp_holistic.HAND_CONNECTIONS,
mp_drawing.DrawingSpec(color=(121,22,76), thickness=2, circle_radius=4),
mp_drawing.DrawingSpec(color=(121,44,250), thickness=2, circle_radius=2)
)
mp_drawing.draw_landmarks(
image, results.right_hand_landmarks, mp_holistic.HAND_CONNECTIONS,
mp_drawing.DrawingSpec(color=(245,117,66), thickness=2, circle_radius=4),
mp_drawing.DrawingSpec(color=(245,66,230), thickness=2, circle_radius=2)
)
```

6. **Capture Video and Apply MediaPipe Model**


```python
cap = cv2.VideoCapture(0)
with mp_holistic.Holistic(min_detection_confidence=0.5, min_tracking_confidence=0.5) as
holistic:
56
while cap.isOpened():
ret, frame = cap.read()
image, results = mediapipe_detection(frame, holistic)
draw_styled_landmarks(image, results)
cv2.imshow('OpenCV Feed', image)
if cv2.waitKey(10) & 0xFF == ord('q'):
break
cap.release()
cv2.destroyAllWindows()
```

7. **Extract Keypoint Values**


```python
def extract_keypoints(results):
pose = np.array([[res.x, res.y, res.z, res.visibility] for res in
results.pose_landmarks.landmark]).flatten() if results.pose_landmarks else np.zeros(33*4)
face = np.array([[res.x, res.y, res.z] for res in results.face_landmarks.landmark]).flatten() if
results.face_landmarks else np.zeros(468*3)
lh = np.array([[res.x, res.y, res.z] for res in results.left_hand_landmarks.landmark]).flatten()
if results.left_hand_landmarks else np.zeros(21*3)
rh = np.array([[res.x, res.y, res.z] for res in
results.right_hand_landmarks.landmark]).flatten() if results.right_hand_landmarks else
np.zeros(21*3)
return np.concatenate([pose, face, lh, rh])
```

8. **Setup Folders for Data Collection**


```python
DATA_PATH = os.path.join('MP_Data')
actions = np.array(['hello', 'thanks', 'iloveyou'])
no_sequences = 30
sequence_length = 30
start_folder = 30

for action in actions:


dirmax = np.max(np.array(os.listdir(os.path.join(DATA_PATH, action))).astype(int))
for sequence in range(1, no_sequences+1):
try:
os.makedirs(os.path.join(DATA_PATH, action, str(dirmax+sequence)))
except:
pass
57
```

9. **Collect Keypoint Values for Training and Testing**


```python
cap = cv2.VideoCapture(0)
with mp_holistic.Holistic(min_detection_confidence=0.5, min_tracking_confidence=0.5) as
holistic:
for action in actions:
for sequence in range(start_folder, start_folder+no_sequences):
for frame_num in range(sequence_length):
ret, frame = cap.read()
image, results = mediapipe_detection(frame, holistic)
draw_styled_landmarks(image, results)
if frame_num == 0:
cv2.putText(image, 'STARTING COLLECTION', (120,200),
cv2.FONT_HERSHEY_SIMPLEX, 1, (0,255, 0), 4, cv2.LINE_AA)
cv2.putText(image, f'Collecting frames for {action} Video Number {sequence}',
(15,12), cv2.FONT_HERSHEY_SIMPLEX, 0.5, (0, 0, 255), 1, cv2.LINE_AA)
cv2.imshow('OpenCV Feed', image)
cv2.waitKey(500)
else:
cv2.putText(image, f'Collecting frames for {action} Video Number {sequence}',
(15,12), cv2.FONT_HERSHEY_SIMPLEX, 0.5, (0, 0, 255), 1, cv2.LINE_AA)
cv2.imshow('OpenCV Feed', image)
keypoints = extract_keypoints(results)
npy_path = os.path.join(DATA_PATH, action, str(sequence), str(frame_num))
np.save(npy_path, keypoints)
if cv2.waitKey(10) & 0xFF == ord('q'):
break
cap.release()
cv2.destroyAllWindows()
```

10. **Preprocess Data and Create Labels and Features**


```python
from sklearn.model_selection import train_test_split
from tensorflow.keras.utils import to_categorical

label_map = {label:num for num, label in enumerate(actions)}


sequences, labels = [], []

58
for action in actions:
for sequence in np.array(os.listdir(os.path.join(DATA_PATH, action))).astype(int):
window = []
for frame_num in range(sequence_length):
res = np.load(os.path.join(DATA_PATH, action, str(sequence),
"{}.npy".format(frame_num)))
window.append(res)
sequences.append(window)
labels.append(label_map[action])

X = np.array(sequences)
y = to_categorical(labels).astype(int)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.05)
```

11. **Build and Train LSTM Neural Network**


```python
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import LSTM, Dense
from tensorflow.keras.callbacks import TensorBoard

log_dir = os.path.join('Logs')
tb_callback = TensorBoard(log_dir=log_dir)

model = Sequential()
model.add(LSTM(64, return_sequences=True, activation='relu', input_shape=(30,1662)))
model.add(LSTM(128, return_sequences=True, activation='relu'))
model.add(LSTM(64, return_sequences=False, activation='relu'))
model.add(Dense(64, activation='relu'))
model.add(Dense(32, activation='relu'))
model.add(Dense(actions.shape[0], activation='softmax'))

model.compile(optimizer='Adam', loss='categorical_crossentropy',
metrics=['categorical_accuracy'])
model.fit(X_train, y_train, epochs=1000, callbacks=[tb_callback])
```

12. **Make Predictions**


```python
res = model.predict(X_test)
print(actions)
59
actions[np.argmax(res[4])]
actions[np.argmax(y_test[4])]
```

13. **Save Weights**


```python
model.load_weights('action.h5')
```

14. **Evaluation using Confusion Matrix and Accuracy**


```python
from sklearn.metrics import multilabel_confusion_matrix, accuracy_score, recall_score,
f1_score, precision_score

60

You might also like