ABSTRACT
ABSTRACT
1.1 Introduction
Sign language is a vital mode of communication for individuals who are deaf or
hard of hearing. A sign language translator system is designed to bridge
communication gaps between sign language users and non-sign language users.
The development of such systems has grown over the years, utilizing various
technologies, including computer vision, machine learning, and artificial
intelligence, to translate gestures and signs into readable or audible language.
This technology has the potential to improve accessibility and integration for the
deaf and hard-of-hearing community into broader society.
1
1.2 Existing System
Currently, several systems attempt to translate sign language into text or speech.
Some of the popular solutions involve gesture recognition using cameras and
sensors, where algorithms process hand movements and convert them into a
readable format. However, these systems often face limitations in terms of
accuracy, speed, and the diversity of sign languages they can support. Many
existing solutions are also expensive or require specific hardware, limiting
accessibility for everyday use.
2
1.3 Advantages
• Accessibility: Sign language translator systems offer greater accessibility for the
deaf and hard of hearing, enabling communication in public spaces, education,
and workplaces.
• Improved Social Integration: These systems help break down communication
barriers, allowing individuals who use sign language to engage more seamlessly
with the wider community.
• Advancements in AI: With AI, machine learning, and computer vision, the
accuracy of translations is improving over time, making communication more
effective.
1.4 Disadvantages
1.5 Application
The applications of sign language translator systems are vast, ranging from:
• Education: Facilitating better communication in classrooms for deaf students,
enabling inclusive learning environments.
• Public Services: Allowing more accessible communication in government
offices, hospitals, and public transport for people who use sign language.
• Assistive Technology: Providing everyday users with the ability to easily
communicate with those who are deaf or hard of hearing.
3
Chapter 2: Literature Survey for Problem Identification and
Specification
Over the years, research in the field of sign language translation has made
significant strides, particularly with advancements in artificial intelligence (AI),
computer vision, and machine learning. Early attempts at sign language
translation relied heavily on physical devices, such as gloves and sensors, to
capture hand movements. However, these systems were often limited by their
complexity, cost, and lack of precision in recognizing nuanced gestures. More
recent approaches use vision-based systems, where cameras and machine
learning algorithms are employed to recognize sign language gestures in real-
time. Various studies have examined the use of deep learning and convolutional
neural networks (CNNs) for gesture recognition, leading to improvements in
accuracy and efficiency.
A key development in the literature has been the use of data-driven models, which
learn from large datasets of sign language gestures to improve the system’s ability
to translate a wide range of signs. Additionally, natural language processing
(NLP) techniques have been integrated to convert the recognized signs into
spoken or written language. While there has been significant progress, challenges
remain, including difficulties in interpreting complex sign language grammar,
regional dialects, and the speed of translation.
4
2.2 Problem Definition
The primary problem that sign language translator systems seek to address is the
communication barrier between individuals who are deaf or hard of hearing and
those who do not understand sign language. Despite technological advancements,
most existing systems are still hindered by limitations such as:
• Inaccuracy in sign recognition, especially for complex signs or non-
standardized gestures.
• Limited real-time translation capabilities, which can make conversations slow
and cumbersome.
• Hardware limitations, as many systems require specialized equipment that is
not widely available or affordable.
• Lack of support for multiple sign languages, as many translation systems are
designed to work only with a specific sign language, such as American Sign
Language (ASL), and do not support others like British Sign Language (BSL) or
Indian Sign Language (ISL).
These challenges highlight the need for a more versatile and accessible system
capable of accurately translating a wide range of sign languages and ensuring
real-time communication.
5
2.3 Background of Sign Language Translator and Related Technologies
3. MediaPipe:
MediaPipe’s Hand module uses machine learning models to track hand
landmarks in real-time, making it highly useful for sign language translation
systems. By tracking the position and movement of fingers, the system can
recognize different signs accurately.
4. pyttsx3:
Unlike gTTS, pyttsx3 works offline and supports various Text To Speech
engines, including SAPI5 and NS Speech Synthesizer.
6
Chapter 3: Scope of the Project
The project aims to develop a sign language translator using Python that can
recognize and translate sign language gestures into text and/or speech. The
scope of this project includes the following key aspects:
1. Gesture Recognition: The system will focus on recognizing hand gestures
associated with a specific sign language (e.g., American Sign Language - ASL)
using computer vision techniques. Python libraries such as OpenCV, MediaPipe,
and TensorFlow/Keras will be used for real-time gesture recognition, where
video inputs will be processed to detect and track hand movements and positions.
2. Translation to Text: Once the system recognizes a gesture, it will convert the
gesture into a readable format (text). This will be achieved by using pre-trained
machine learning models, such as convolutional neural networks (CNNs), which
will be trained on a dataset of sign language gestures.
3. Text-to-Speech Output: After the text translation, the system will convert the
text into speech using Text-to-Speech (TTS) technology, such as gTTS (Google
Text-to-Speech) or pyttsx3.
4. User Interaction: The user will interact with the system via a camera or webcam
that captures real-time hand gestures. The system will process these gestures and
provide immediate feedback in the form of translated text or speech. This will be
implemented with a user-friendly interface for simplicity.
5. Platform Compatibility: The project will primarily target desktop platforms
(Windows, macOS, Linux), but it could be extended to mobile platforms in the
future. Python's cross-platform capabilities and libraries will ensure that the
solution is portable.
6. Real-time Performance: The system will be optimized for real-time
performance, ensuring that sign language gestures are recognized and translated
with minimal delay.
7. Limitations:
o Language Limitation: Initially, the system will focus on one specific sign
language (such as ASL) due to dataset limitations.
o Hardware Dependency: The system will rely on a webcam for input, meaning its
performance is partially dependent on the quality of the camera.
7
3.2 Modelling and Analysis
The development of the sign language translator involves several modeling and
analysis steps that combine computer vision, machine learning, and natural
language processing.
1. Hand Gesture Recognition Model:
The core of the system is the gesture recognition model, which will be built
using deep learning techniques.
o Data Preprocessing: Before training, the dataset will be preprocessed, which may
include resizing images, normalizing pixel values, and augmenting the dataset to
introduce variations in gesture poses, lighting conditions, and backgrounds.
o Model Architecture: The CNN model will consist of several convolutional layers
followed by fully connected layers. The output of the model will be a
classification of the recognized gesture into a corresponding sign language
symbol.
2. Text Translation and Sentence Generation:
Once a gesture is recognized, it is converted into a written text representation.
This text can be simple (one-to-one mapping between signs and words) or more
complex, depending on the context.
3. Text-to-Speech (TTS) Model:
The converted text will then be processed by a Text-to-Speech (TTS) engine to
produce spoken output. Python library pyttsx3 will be used to generate speech
from the text output of the gesture recognition model.
4. System Integration:
The gesture recognition, text translation, and speech synthesis components will
be integrated into a single Python application. This will involve creating a main
processing loop that captures video frames, processes the frames for gesture
recognition, and then feeds the output into the text and speech components.
5. Performance Metrics:
The model's performance will be evaluated using standard metrics like:
o Accuracy: The percentage of correctly classified gestures in the dataset.
o Latency: The time taken from gesture input to text or speech output.
o User Feedback: The ease with which users can interact with the system and
understand the translation.
8
Chapter 4: Methodology
The development of the sign language translator system will follow the
Waterfall Model, a traditional software development process. The Waterfall
Model is characterized by a sequential and systematic approach, where each
phase must be completed before moving on to the next. This model is
particularly suitable for projects where the requirements are well understood
upfront and the scope of work is clear. The following phases will guide the
project:
9
1. Requirement Gathering and Analysis:
In this phase, the objectives of the project are defined, and the project’s scope is
established. For this sign language translator, the main goal is to create a system
that can recognize sign language gestures and translate them into text and/or
speech. The necessary features and functionalities of the system, such as real-
time translation, accuracy, and the ability to recognize a set of standard gestures,
will be outlined.
2. System Design:
The system design phase involves creating the architecture of the sign language
translator system. This includes choosing the right algorithms and technologies
for gesture recognition, machine learning, and text-to-speech functionality. Key
decisions will be made on the hardware requirements (such as cameras),
software libraries (e.g., OpenCV, TensorFlow, MediaPipe), and how the system
will process and output translated signs.
3. Implementation:
During this phase, the actual coding of the system will take place. The
implementation will involve setting up Python environments, integrating
various libraries for gesture recognition (e.g., OpenCV, MediaPipe), building
and training deep learning models (e.g.,TensorFlow), and developing the text-
to-speech conversion component (e.g., gTTS or pyttsx3). The system will also
include the development of the user interface for interacting with the camera
and viewing translations.
4. Integration and Testing:
Once individual modules are implemented, they will be integrated into a single
system. The integration process will focus on ensuring that the gesture
recognition, text translation, and speech synthesis components work seamlessly
together.
5. Deployment:
After successful testing and validation, the system will be deployed for real-
time use. This phase involves installing the system on user devices, such as
desktops or laptops, and ensuring it functions in different environments
6. Maintenance:
Post-deployment, the system will enter the maintenance phase. Regular updates
and patches will be applied to improve performance, accuracy, and add support
for additional sign languages.
10
4.2 Technologies Used
11
4.3 System Requirements
The following are the system requirements needed to run the sign language
translator effectively:
1. Hardware Requirements:
o Camera: A webcam or external camera with at least 720p resolution for
capturing hand gestures. A higher frame rate (30 FPS or more) will improve
gesture recognition accuracy.
o CPU: A modern processor (Intel i5 or higher) for handling real-time image
processing and running deep learning models.
o RAM: Minimum 8 GB of RAM for smooth execution, especially when running
deep learning models.
o GPU: A dedicated GPU (e.g., NVIDIA GTX or RTX) is recommended for faster
processing when training deep learning models using TensorFlow, but not
mandatory for inference.
2. Software Requirements:
o Operating System: The system can run on Windows, macOS, or Linux.
o Python: Python 3.x installed on the system.
3. Required Libraries:
▪ OpenCV: For image processing and video capture.
▪ TensorFlow/Keras: For building and training machine learning models.
▪ MediaPipe: For hand gesture detection.
▪ pyttsx3: For converting text to speech.
4. Development Environment:
o Editor: An Integrated Development Environment (IDE) like VS Code can be
used to write and run Python code.
12
Chapter 5: Details of Designs, Working, and Processes
A Use Case Diagram illustrates the interaction between the system and its users
(actors), representing the key functionalities that the system provides. For the
sign language translator, the use case diagram will include interactions between
the user and the system, such as:
13
5.2 System DFD (Data Flow Diagram)
A Data Flow Diagram (DFD) represents how data moves within the system. The
DFD for the sign language translator can be broken down into several levels:
1. Level 0 :
2. Level 1:
14
5.3 Sequence Diagram
A sequence diagram details the order of operations and interactions between the
system components. The sequence diagram for the sign language translator will
include the following steps:
15
5.4 System Description
16
5.5 Implementation and Testing
17
5.5 Code
import numpy as np
import math
import cv2
from docx import Document
import os, sys
import traceback
import pyttsx3
from keras.models import load_model
from cvzone.HandTrackingModule import HandDetector
from string import ascii_uppercase
import enchant
ddd=enchant.Dict("en-US")
hd = HandDetector(maxHands=1)
hd2 = HandDetector(maxHands=1)
import tkinter as tk
from PIL import Image, ImageTk
offset=29
os.environ["THEANO_FLAGS"] = "device=cuda, assert_no_cpu_op=True"
# Application :
class Application:
def __init__(self):
self.vs = cv2.VideoCapture(0)
self.current_image = None
self.model = load_model('cnn8grps_rad1_model.h5')
self.speak_engine=pyttsx3.init()
18
self.speak_engine.setProperty("rate",100)
voices=self.speak_engine.getProperty("voices")
self.speak_engine.setProperty("voice",voices[0].id)
self.ct = {}
self.ct['blank'] = 0
self.blank_flag = 0
self.space_flag=False
self.next_flag=True
self.prev_char=""
self.count=-1
self.ten_prev_char=[]
for i in range(10):
self.ten_prev_char.append(" ")
for i in ascii_uppercase:
self.ct[i] = 0
print("Loaded model from disk")
self.root = tk.Tk()
self.root.title("Sign Language To Text Conversion")
self.root.protocol('WM_DELETE_WINDOW', self.destructor)
self.root.geometry("1300x700")
self.panel = tk.Label(self.root)
self.panel.place(x=100, y=3, width=480, height=640)
self.panel2 = tk.Label(self.root) # initialize image panel
self.panel2.place(x=700, y=115, width=400, height=400)
self.T = tk.Label(self.root)
self.T.place(x=60, y=5)
19
self.T.config(text="Sign Language To Text Conversion", font=("Courier",
30, "bold"))
self.panel3 = tk.Label(self.root) # Current Symbol
self.panel3.place(x=280, y=585)
self.T1 = tk.Label(self.root)
self.T1.place(x=10, y=580)
self.T1.config(text="Character :", font=("Courier", 30, "bold"))
self.panel5 = tk.Label(self.root) # Sentence
self.panel5.place(x=260, y=632)
self.T3 = tk.Label(self.root)
self.T3.place(x=10, y=632)
self.T3.config(text="Sentence :", font=("Courier", 30, "bold"))
self.T4 = tk.Label(self.root)
self.T4.place(x=10, y=700)
self.T4.config(text="Suggestions :", fg="red", font=("Courier", 30,
"bold"))
self.b1=tk.Button(self.root)
self.b1.place(x=390,y=700)
self.b2 = tk.Button(self.root)
self.b2.place(x=590, y=700)
self.b3 = tk.Button(self.root)
self.b3.place(x=790, y=700)
self.b4 = tk.Button(self.root)
self.b4.place(x=990, y=700)
self.speak = tk.Button(self.root)
self.speak.place(x=1205, y=630)
20
self.speak.config(text="Speak", font=("Courier", 20), wraplength=100,
command=self.speak_fun)
self.save_btn = tk.Button(self.root, text="Save", font=("Courier", 20),
command=self.save_sentence_to_docx)
self.save_btn.place(x=1005, y=630)
self.clear = tk.Button(self.root)
self.clear.place(x=1100, y=630)
self.clear.config(text="Clear", font=("Courier", 20), wraplength=100,
command=self.clear_fun)
self.str = " "
self.ccc=0
self.word = " "
self.current_symbol = "C"
self.photo = "Empty"
self.video_loop()
def video_loop(self):
try:
ok, frame = self.vs.read()
cv2image = cv2.flip(frame, 1)
if cv2image.any:
hands = hd.findHands(cv2image, draw=False, flipType=True)
cv2image_copy=np.array(cv2image)
cv2image = cv2.cvtColor(cv2image, cv2.COLOR_BGR2RGB)
self.current_image = Image.fromarray(cv2image)
imgtk = ImageTk.PhotoImage(image=self.current_image)
self.panel.imgtk = imgtk
self.panel.config(image=imgtk)
21
if hands[0]:
hand = hands[0]
map = hand[0]
x, y, w, h=map['bbox']
image = cv2image_copy[y - offset:y + h + offset, x - offset:x + w
white = cv2.imread("white.jpg")
# img_final=img_final1=img_final2=0
if image.all:
handz = hd2.findHands(image, draw=False, flipType=True)
self.ccc += 1
if handz[0]:
hand = handz[0]
handmap=hand[0]
self.pts = handmap['lmList']
# x1,y1,w1,h1=hand['bbox']
os = ((400 - w) // 2) - 15
os1 = ((400 - h) // 2) - 15
for t in range(0, 4, 1):
cv2.line(white, (self.pts[t][0] + os, self.pts[t][1] + os1),
(self.pts[t + 1][0] + os, self.pts[t + 1][1] + os1),
(0, 255, 0), 3)
for t in range(5, 8, 1):
cv2.line(white, (self.pts[t][0] + os, self.pts[t][1] + os1),
(self.pts[t + 1][0] + os, self.pts[t + 1][1] + os1),
(0, 255, 0), 3)
for t in range(9, 12, 1):
22
cv2.line(white, (self.pts[t][0] + os, self.pts[t][1] + os1),
(self.pts[t + 1][0] + os, self.pts[t + 1][1] + os1),
(0, 255, 0), 3)
for t in range(13, 16, 1):
cv2.line(white, (self.pts[t][0] + os, self.pts[t][1] + os1),
(self.pts[t + 1][0] + os, self.pts[t + 1][1] + (0, 255, 0), 3)
for t in range(17, 20, 1):
cv2.line(white, (self.pts[t][0] + os, self.pts[t][1] + os1),
(self.pts[t + 1][0] + os, self.pts[t + 1][1] + os1),0, 255, 0), 3)
3)
for i in range(21):
cv2.circle(white, (self.pts[i][0] + os, self.pts[i][1] + os1), 2, (0, 0, 255), 1)
res=white
self.predict(res)
self.current_image2 = Image.fromarray(res)
imgtk = ImageTk.PhotoImage(image=self.current_image2)
self.panel2.imgtk = imgtk
self.panel2.config(image=imgtk)
self.panel3.config(text=self.current_symbol, font=("Courier", 30)
#self.panel4.config(text=self.word, font=("Courier", 30))
self.b1.config(text=self.word1, font=("Courier", 20),
wraplength=825, command=self.action1)
self.b2.config(text=self.word2, font=("Courier", 20),
wraplength=825, command=self.action2)
self.b3.config(text=self.word3, font=("Courier", 20),
wraplength=825, command=self.action3)
self.b4.config(text=self.word4, font=("Courier", 20),
wraplength=825, command=self.action4)
23
self.panel5.config(text=self.str, font=("Courier", 30),
wraplength=1025)
except Exception:
print(Exception.__traceback__)
hands = hd.findHands(cv2image, draw=False, flipType=True)
cv2image_copy=np.array(cv2image)
cv2image = cv2.cvtColor(cv2image, cv2.COLOR_BGR2RGB)
self.current_image = Image.fromarray(cv2image)
imgtk = ImageTk.PhotoImage(image=self.current_image)
self.panel.imgtk = imgtk
self.panel.config(image=imgtk)
if hands and len(hands) > 0:
hand = hands[0]
if "bbox" in hand:
x, y, w, h = hand["bbox"]
image = cv2image_copy[y-offset : y+h+offset, x-offset :
x+w+offset]
white = cv2.imread("white.jpg")
if image.size != 0:
handz = hd2.findHands(image, draw=False, flipType=True)
self.ccc += 1
if handz and len(handz) > 0:
hand = handz[0]
if "lmList" in hand:
self.pts = hand['lmList']
os = ((400 - w) // 2) - 15
os1 = ((400 - h) // 2) - 15
24
for t in range(0, 4, 1):
cv2.line(white, (self.pts[t][0] + os, self.pts[t][1] + os1), (self.pts[t
+ 1][0] + os, self.pts[t + 1][1] + os1), (0, 255, 0), 3)
for t in range(5, 8, 1):
cv2.line(white, (self.pts[t][0] + os, self.pts[t][1] + os1), (self.pts[t
+ 1][0] + os, self.pts[t + 1][1] + os1),
(0, 255, 0), 3) for t in range(9, 12, 1):
cv2.line(white, (self.pts[t][0] + os, self.pts[t][1] + os1), (self.pts[t
+ 1][0] + os, self.pts[t + 1][1] + os1),
(0, 255, 0), 3for t in range(13, 16, 1):
cv2.line(white, (self.pts[t][0] + os, self.pts[t][1] + os1), (self.pts[t
+ 1][0] + os, self.pts[t + 1][1] + os1), (0, 255, 0), 3) for t in range(17, 20, 1):
cv2.line(white, (self.pts[t][0] + os, self.pts[t][1] + os1), (self.pts[t
+ 1][0] + os, self.pts[t + 1][1] + os1), (0, 255, 0), 3)
cv2.line(white, (self.pts[5][0] + os, self.pts[5][1] + os1),
self.predict(res)
self.current_image2 = Image.fromarray(res)
imgtk = ImageTk.PhotoImage(image=self.current_image2)
self.panel2.imgtk = imgtk
self.panel2.config(image=imgtk)
self.panel3.config(text=self.current_symbol, font=("Courier", 30))
25
def distance(self,x,y):
return math.sqrt(((x[0] - y[0]) ** 2) + ((x[1] - y[1]) ** 2))
def speak_fun(self):
self.speak_engine.say(self.str)
self.speak_engine.runAndWait()
def clear_fun(self):self.str=" "
def save_sentence_to_docx(self):
doc_file = "sentences.docx"
filtered = self.str.replace("next", "").replace("Backspace", "").strip(
if filtered:
# Always create a new document
doc = Document()
doc.add_paragraph(filtered)
doc.save(doc_file)
self.clear_fun()
26
5.6 Snapshots
2. Text Output:
A screenshot displaying the translated text on the user interface.
27
Chapter 6: Results and Applications
6.1 Result
28
o Hand Positioning: The system was most accurate when the user's hands were
clearly visible and in a consistent position. Complex or fast movements
sometimes led to lower recognition accuracy.
o Gesture Variety: The system recognized a wide range of static signs, but
dynamic gestures (such as those requiring movement between multiple hand
positions) were harder to detect without additional model refinement.
Overall Results: The system demonstrates promising results for real-time sign
language translation, with a high accuracy rate for basic static gestures and good
performance under normal conditions. However, further improvements in
accuracy, especially for dynamic gestures, are needed for broader use cases.
29
6.2 Application
30
Chapter 7: Conclusions and Future Scope
7.1 Conclusion
31
7.2 Future Scope and Future Enhancement of the Project
While the current Sign Language Translator project has made significant
strides, there are several opportunities for enhancing its functionality, expanding
its features, and improving its overall performance. The future scope of this
project includes the following:
1. Improving Gesture Recognition Accuracy
• Expanding the Dataset: The accuracy of gesture recognition can be improved by
training the system on a larger and more diverse dataset, including different sign
languages (e.g., American Sign Language, British Sign Language) and a broader
range of gestures.
• Advanced Models: The use of more advanced deep learning models could
improve recognition, particularly for dynamic or complex gestures.
• Handling Variability: Enhancing the system to account for variations in hand
shapes, sizes, and movements will make it more accurate across different users.
2. Supporting Multiple Sign Languages
Currently, the system may support only a specific sign language (such as
American Sign Language). Future enhancements could include:
• Multi-Language Support: The system could be expanded to recognize multiple
sign languages, allowing users from different regions to interact in their native
sign language.
5. Speech Synthesis Improvements
• Multi-Language Support for Speech: The system could incorporate multi-
language text-to-speech engines that support a wider range of voices and
accents, making it more accessible to people from different linguistic
backgrounds.
32
Chapter 8 : Appendix
33
References
• https://data-flair.training/blogs/sign-language-recognition-
python-ml-opencv/
• https://www.youtube.com/watch?v=MJCSjXepaAM
• https://www.researchgate.net/publication/371490804_Sign_La
nguage_Recognition_Using_Python
• https://ieeexplore.ieee.org/document/10110225/
34