Project Report
Project Report
Project Report
A PROJECT REPORT ON
OF
BACHELOR OF ENGINEERING
IN
INFORMATION TECHNOLOGY
BY
PROF. V. P. TONDE
CERTIFICATE
Submitted by
Is a bonafide work carried out by them under the supervision of Prof. V. P. Tonde and it is
approved for the partial fulfillment of the requirement of Savitribai Phule Pune University for the
award of the Degree of Bachelor of Engineering (Information Technology)Pune in the academic
year 2020-2021.
This project report has not been earlier submitted to any other Institute or University for the
award of any degree or diploma.
I
Real time hand Detection and Recognition using Machine
ACKNOWLEDGEMENT
We would like to thank our Project Guide Prof. K. S. Karnekar , Assistant Professor in
Information Technology, Sinhgad Institute of Technology, Lonavala for their continuous support
and valuable suggestions throughout this work carried out by us. Authors are also grateful to the
reviewer for perilously going through the manuscript and giving valuable suggestions for the
renovation of manuscript. We would also like to thank the Department of Information Technology,
Sinhgad Institute of Technology, Lonavala for providing us with the facility for carrying out the
simulations.
I express my thanks to all staff members and friends for all the help and co-ordination
extended in bringing out this project successfully in time. I will be failing in duty if I do not
acknowledge with grateful thanks to the authors of the references and other literatures referred to
in this project. Last but not the least; I am very much thankful to my parents who guided me in
every step which I took.
II
Real time hand Detection and Recognition using Machine
ABSTRACT
Hand gesture recognition is one of the system that can detect the gesture of hand in a real
time video. The gesture of hand is classify within a certain area of interest. In this study,
designing of the hand gesture recognition is one of the complicated job that involves two major
problem. Firstly is the detection of hand. Another problem is to create the sign that is suitable to
be used for one hand in a time. This project concentrates on how a system could detect,
recognize and interpret the hand gesture recognition through computer vision with the
challenging factors which variability in pose, orientation, location and scale.
To perform well for developing this project, different types of gestures such as numbers
and sign languages need to be created in this system. The image taken from the realtime video is
analysed via Haar-cascaded Classifier to detect the gesture of hand before the image processing
is done or in the other word to detect the appearance of hand in a frame.[1]
In this project, the detection of hand will be done using the theories of Region of Interest
(ROI) via Python programming. The explanation of the results will be focused on the simulation
part since the different for the hardware implementation is the source code to read the real-time
input video.
The developing of hand gesture recognition using Python and OpenCV can be
implemented by applying the theories of hand segmentation and the hand detection system which
use the Haar-cascade classifier.In our future work, we will incorporate the complex video like
disaster videos for object detection and classification present in the scene for more sophisticated
vision based applications like fire accident, earthquake disaster etc. As a result, our algorithm
identifies the objects by its classes, assigns each object by its tag, and has dimensions on detected
image.[1]
III
Real time hand Detection and Recognition using Machine
LIST OF FIGURES
IV
CONTENTS
CERTIFICATE I
ACKNOWLEDGEMENT II
ABSTRACT III
LIST OF FIGURES IV
CHAPTER 1
INTRODUCTION
Hand gestures are the form of non-verbal communication .This can be practised in
several fields such as communicating with deaf-mute people, robot control (Gesture
Control Robotics), HCI, home automation and medical applications.
.
A central requirement here that this implementation depends on, is the need for the system
to provide feedback in real time. In ideal situations where communication between someone who
does and does not understand sign language is to occur, the requirement would immediately be to
have the translation done as soon as the signer finishes signing.
These make communication very ambiguous. The system uses a VGG16 neural network
model trained to predict hand gestures. But it also allows users to add their desired hand gestures
and associated labels to the database, and then layers of the neural network are retrained using
transfer learning.
This project uses static identification of hand gestures in this case because it improves
accuracy as compared to dynamic hand gestures, such as those for the letters J and Z. The proposed
study here tries to increase accuracy by combining Convolution Neural Network (CNN) and open
Computer Vision.
The motivation of project is to build a system that is cost efficient and can help people
use hand gestures to give commands to machines. In India alone, over 50 lakh people suffer from
hearing disabilities as per the 2011 Census which made for 19 percent of the total population.
Close to 20 lakh people suffer from speech impairments , making for roughly 7 percent of the
Real time hand Detection and Recognition using Machine
population . In order to bridge this communication gap and aid with the ease of life for these
minorities, the use of artificial intelligence aided technology is introduced.
● To identify suitable and highly efficient deep learning models for real-time
object recognition and tracking of objects.
● Evaluate the classification performance of the selected deep learning models.
● Compare the classification performance of the selected models among each other
and present the results.
● To gain the ability to sense and react to its environment so as to navigate without the
help or involvement of a human.
This project aims at creating a system to make their communication with other people one
step easier by converting the sign language into text and audio output.
The aim of this thesis is to evaluate the classification performance of the suitable deep learning
models for real-time hand gesture recognition and translating into text or speech.
This project aims at creating a system to make their communication with other people one step
easier by converting the sign language into text and audio output.
The following objectives have been identified to fulfil the aim of this thesis work:
● To identify suitable and highly efficient deep learning models for real-time
object recognition .
● Evaluate the classification performance of the selected deep learning models.
● Producing a model which can recognize Fingerspelling-based hand gestures in order to
form a complete word by combining each gesture.
● To gain the ability to convert the text to speech, sense and react to its environment so as
to navigate without the help or involvement of a human.
The system explains a very generic way of communication not restricted to any sign
language.But while the user shave the ability to add new gestures to system the number of gestures
that can be added is limited.Because the accuracy of CNN may vary largely on adding more gestures.
So algorithms that can build a flexible model that can adjust, modify and get trained with good
accuracy on periodical addition of good amount of images could be built.. And system with good
computation power could be used to hold maximum gestures.
Sentiment analysis can be added to such introduced to potentially increase the working of
architecture.
The system draws different geometric on getting command from user. In the similar way systems that
can perform multiple tasks like playing music/video, sending email, playing game, house automation
etc using gestures as input can be developed.
Real time hand Detection and Recognition using Machine
CHAPTER 2
LITERATURE
SURVEY
The following inclusion and exclusion criteria have been followed while collecting the
articles for the literature review:
● Only those articles that discussed about sign language detection/recognition and
deep learning models have been included.
● Only the articles published in the years 2020 and 2021 have been included, as they
reflect the most recent research conducted in this area.
● Only the journal articles, conference papers, magazines and reviews have been included.
● Only the articles written in English language have been included for
understandability purposes .
● Abstracts and PowerPoint presentations have been excluded.
Real time hand Detection and Recognition using Machine
CHAPTER 3
PROBLEM STATEMENT
With the growth of ubiquitous computing, use of software is not limited to computers or
CPUs. It has reached everywhere in one or the other form .Interaction of user with machine is
now not limited using mouse or keyboards. Each of orthodox input device has its limitation on
receiving commands .Using hand gesture as an input device will provide natural and efficient
interaction Our project is based on concept of Image processing and Neural Networks .Lots of
research is based on gesture recognition using kinetic sensor on using HD camera but camera and
kinetic sensors are more costly. To reduce cost and improve robustness of the proposed system
we used simple web camera.
.
Real time hand Detection and Recognition using Machine
CHAPTER 4
Steps to be followed:
1) Download and install Python version 3 from official Python Language website
https://www.python.org/
i. TensorFlow :
While the reference implementation runs on single devices, TensorFlow can run on multiple
CPU’s and GPU (with optional CUDA and SYCL extensions for general-purpose computing on
graphics processing units).
TensorFlow is available on various platforms such as64-bit Linux, macOS, Windows, and mobile
computing platforms including Android and iOS.
The architecture of TensorFlow allows the easy deployment of computation across a variety of
platforms (CPU’s, GPU’s, TPU’s), and from desktops - clusters of servers to mobile and edge
devices. TensorFlow computations are expressed as stateful dataflow graphs. The name
TensorFlow derives from operations that such neural networks perform on multidimensional data
arrays, which are referred to as tensors.
ii. Numpy :
NumPy is library of Python programming language, adding support for large, multi-
dimensional array and matrices, along with large collection of high-level mathematical function
to operate over these arrays. The ancestor of NumPy, Numeric, was originally created by Jim
Hugunin with contributions from several developers. In 2005 Travis Olphant created NumPy by
incorporating features of computing Numarray into Numeric, with extension modifications.
NumPy is open-source software and has many contributors.
iii. SciPy :
SciPy contain modules for many optimizations, linear algebra, integration, interpolation,
special function, FFT, signal and image processing, ODE solvers and other tasks common in
engineering. SciPy abstracts majorly on NumPy array object, and is the part of the NumPy stack
which include tools like Matplotlib, pandas and SymPy, etc., and an expanding set of scientific
computing libraries.This NumPy stack has similar uses to other applications such as MATLAB,
Octave, and Scilab. The NumPy stack is also sometimes referred as the SciPy stack.
The SciPy library is currently distributed under BSDlicense, and its development is sponsored
and supported by an open communities of developers. It is also supported by NumFOCUS,
community foundation for supporting reproducible and accessible science.
iv. OpenCV :
OpenCV is an library of programming functions mainly aimed on real time computer vision.
originally developed by Intel, it is later supported by Willow Garage then Itseez. The library is
a cross-platform and free to use under the open-source BSD license.
v. Pillow :
Python Imaging Library is a free Python programming language library that provides support
to open, edit and save several different formats of image files. Windows, Mac OS X and Linux
are available for this.
vi. Matplotlib :
Matplotlib is a Python programming language plotting library and its NumPy numerical math
extension. It provides an object-oriented API to use general-purpose GUI toolkits such as
Tkinter, wxPython, Qt, or GTK+ to embed plots into applications.
vii. H5py :
The software h5py includes a high-level and low-level interface for Python’s HDF5 library.
The low interface expected to be complete wrapping of the HDF5 API, while the high-
level0020component uses established Python and NumPy concepts to support access to HDF5
files, datasets and groups.
A strong emphasis on automatic conversion between Python (Numpy) datatypes and data
structures and their HDF5 equivalents vastly simplifies the process of reading and writing data
from Python.
viii. Keras :
Memory : 6 GB
Display Memory : 4 GB
CHAPTER 5
Segmentation: The method of separating objects or signs from the con- text of a
captured image is known as segmentation. text subtracting, skin-color detection, and
edge detection are all used in the segmentation process. The motion and location of the
hand must be detected and segmented in order to recognise gestures.fig:threshold
Real time hand Detection and Recognition using Machine
3. Features Extraction: Predefined features such as form, contour, geometrical feature (position,
angle, distance, etc. ), colour feature, histogram, and others are extracted from the preprocessed
images and used later for sign classification or recognition. Feature extraction is a step in the
dimensionality reduction process that divides and organises a large collection of raw data. reduced to
smaller, easier-to-manage classes As a result, processing would be simpler. The fact that these
massive data sets have a large number of variables is the most important feature. To process these
variables, a large amount of computational power is needed. As a result, function extraction aids in
the extraction of the best feature from large data sets by selecting and combining variables into
functions. reducing the size of the data These features are simple to use while still accurately and
uniquely de- scribing the actual data collection.
4. Preprocessing: Each picture frame is preprocessed to eliminate noise using a variety of filters
including erosion, dilation, and Gaussian smoothing, among others. The size of an im- age is reduced
when a color image is transformed to grayscale. A common method for reducing the amount of data
to be processed is to convert an image to grey scale. The phases of preprocessing are as follows
Morphological operations use a structuring feature onan input image to create a similar-sized
output image.It compares the corresponding pixel in the input image with its neighbours to
determine the value of each pixelin the output image.There are two different kinds of morphological
transformations Erosion and Dilation.
5. Recognition:
We’ll use classifiers in this case. Classifiers are the meth- ods or algorithms that are used to interpret
the signals. Popular classifiers that identify or understand sign lan- guage include the Hidden Markov
Model (HMM), K- Nearest Neighbor classifiers, Support Vector Machine (SVM), Artificial Neural
Network (ANN), and Principle Component Analysis (PCA), among others. However, in this project,
the classifier will be CNN. Because of its high precision, CNNs are used for image classification and
recognition. The CNN uses a hierarchical model that builds a network, similar to a funnel, and then
outputs a fully-connected layer in which all neurons are connected to each other and the output is
processed
Image classification is the process of taking an input(like a picture) and outputting its class or
probability that the input is a particular class. Neural networks are applied in the following steps:
1) One hot encode the data: A one-hot encoding can be applied to the integer
representation. This is where the integer encoded variable is removed and a new
binary variable is added for each unique integer value.
2) Define the model: A model said in a very simplified form is nothing but a function that
is used to take in certain input, perform certain operation to its beston the given input
(learning and then predicting/classifying) and produce the suitable output. 3)Compile the
model: The optimizer controls the learning rate. We will be using ‘adam’ as our optmizer.
Adam is generally a good optimizer to use for many cases. The adam optimizer adjusts
the learning rate throughout training. The learning rate determines how fast the optimal
weights for the model are calculated. A smaller learning rate may lead to more accurate
weights (up to a certain point), but the time it takes to compute the weights will be longer.
4)Train the model: Training a model simply means learning (determining) good values for
all the weights and the bias from labeled examples. In supervised learning, a machine
learning algorithm builds a model by examining many examples and attempting to find a
model that minimizes loss; this process is called empirical risk minimization.
5) Test the model
A convolutional neural network convolves learned featured with input data and uses 2D
convolution layers.
ConvolutionOperation:
Here are the three elements that enter into the convolution operation: • Input
image
• Feature detector
• Feature map
• You place it over the input image beginning from the top-left corner within the
borders you see demarcated above, and then you count the number of cells in which the
feature detector matches the input image.
• The number of matching cells is then inserted in the top-left cell of the feature
map
• You then move the feature detector one cell to the right and do the same thing.
This movement is called a and since we are moving the feature detector one cell at time,
that would be called a stride of one pixel.
• What you will find in this example is that the feature detector's middle-left cell
with the number 1 inside it matches the cell that it is standing over inside the input image.
That's the only matching cell, and so you write “1” in the next cell in the feature map, and
so on and so forth.
32
• After you have gone through the whole first row, you can then move it over to the
next row and go through the same process.
There are several uses that we gain from deriving a feature map. These are the most
important of them: Reducing the size of the input image, and you should know that the
larger your strides (the movements across pixels), the smaller your feature map.
Relu Layer: Rectified linear unit is used to scale the parameters to non negativevalues.We
get pixel values as negative values too . Inthis layer we make them as 0’s. The purpose of
applying the rectifier function is to increase the non-linearity in our images. The reason we
want to do that is that images are naturally non-linear. The rectifier serves to break up the
linearity even further in order to make up for the linearity that we might impose an image
when we put it through the convolution operation. What the rectifier function does to an
image like this is remove all the black elements from it, keeping only those carrying a
positive value (the grey and white colors).The essential difference between the non-rectified
version of the image and the rectified one is the progression of colors. After we rectify the
image, you will find the colors changing more abruptly. The gradual change is no longer
there. That indicates that the linearity has been disposed of.
Pooling Layer:
The pooling (POOL) layer reduces the height and width of the input. It helps
reduce computation, as well as helps make feature detectors more invariant to its position
in the input This process is what provides the convolutional neural network with the
“spatial variance” capability. In addition to that, pooling serves to minimize the size of the
images as well as the number of parameters which, in turn, prevents an issue of
“overfitting” from coming up. Overfitting in a nutshell is when you create an excessively
complex model in order to account for the idiosyncracies we just mentioned The result
ofusing a pooling layer and creating down sampled or pooled feature maps is a
summarized version of the features detected in the input. They are useful as small changes
in the location of the feature in the input detected by the convolutional layer will result in a
pooled feature map with the feature in the same
location. Thiscapability added by pooling is called the model’s invariance to
local translation.
generalization .
CONVULATION IN CNN:-
Kernel convolution is not only used in CNNs, but is also a key element of many other
Computer Vision algorithms. It is a process where we take a small matrix of numbers (called kernel
or filter), we pass it over our image and transform it based on the values from filter. Subsequent
feature map values are calculated according to the following formula, where the input image is
denoted by f and our kernel by h. The indexes of rows and columns of the result matrix are marked
with m and n respectively.
After placing our filter over a selected pixel, we take each value from kernel and multiply them in
pairs with corresponding values from the image.
In CNN the Kernels can be considered as nodes in ANN the kernel matrix values are weights that are
updated in each epoch
A level 1 DFD notates each of the main sub-processes that together form the complete
system. We can think of a level 1 DFD as an “exploded view” of the context diagram and shows
the input and output flow of the proposed system.
The purpose of the use case diagrams is simply to provide the high level view of the system
and convey the requirements in laypeople's terms for the stakeholders.
The class diagram is the main building block of object-oriented modeling. It is used for
general conceptual modeling of the structure of the application, and for detailed modeling
translating the models into programming code. Class diagrams can also be used for data
modeling.
Fig. 8 : Class
Real time hand Detection and Recognition using Machine
A sequence diagram is a type of interaction diagram because it describes how and in what
order—a group of objects works together. These diagrams are used by software developers and
business professionals to understand requirements for a new system or to document an existing
process.
CHAPTER 7
SYSTEM DOCUMENTATION
Sample Code :
import argparse
import os
import shutil
import time
from pathlib import Path
import cv2
import torch
import torch.backends.cudnn as cudnn
from numpy import random
from models.experimental import attempt_load
from utils.datasets import LoadStreams,
LoadImages from utils.general import (
check_img_size, non_max_suppression, apply_classifier, scale_coords,
xyxy2xywh, plot_one_box, strip_optimizer, set_logging)
from utils.torch_utils import select_device, load_classifier, time_synchronized
def detect(save_img=False):
out, source, weights, view_img, save_txt, imgsz = \
opt.save_dir, opt.source, opt.weights, opt.view_img, opt.save_txt, opt.img_size
webcam = source.isnumeric() or source.startswith(('rtsp://', 'rtmp://', 'http://')) or
source.endswith('.txt')
Real time hand Detection and Recognition using Machine
CHAPTER 8
CHAPTER 9
OTHER SPECIFICATIONS
9.1 ADVANTAGES
9.2 LIMITATIONS
Class Imbalance.
Speed for real time objects.
Multiple spatial scales and aspect ratio.
User must have large memory storage
User must have all the software required to run application
In this section, we’ll provide an overview of real-world use cases for real time hand
detection. We’ve mentioned several of them in previous sections, but here we’ll dive a bit deeper
and explore the impact this computer vision technique can have across industries.
Real time hand Detection and Recognition using Machine
Specifically, we’ll examine how real time hand detection can be used in the following areas:
.
Real time hand Detection and Recognition using Machine
CHAPTER 10
CONCLUSIONS
Gesture recognition is a budding field of computer science and AI. Using hand gestures as the input
to system can enhance the way user interacts with the system. This system performs as a medium of
communication for the deaf and dumb community of society. Thus tries to reduce the barrier for the
after mentioned minorities. The system also allows the user to add new gestures associated labels to
the system for translation.
The primary advantage of the system is that it is designed to be an interface that functions in real
time and would be available in masses. If developed as a mobile application then it can be effectively
used by targeted audience.
Like the general systems this system does not promise a 100% accuracy translating the gestures.
And since this is made to be used in real time the chances of error increase due to randomness in the
behaviour of user and noise in the images captured.But as the user would be in midst of a
communication or task chance of user to flag the error are considerably low.However this does not
rule out chances of user ignoring all the errors.Inbuilt machine learning algorithms can be used to
avoid the flagged errors by user.
Since the user has privilege to add new gestures to data base for gesture translation.on which the
model gets trained again. This process may affect the accuracy of system. To overcome this we have
limited the number of new gestures that a user can add.But still according to the nature of gesture the
model accuracy may get affected a bit.
Furthermore this system does not translate the mood or emotions of the user. But it is made for
simple translation of gestures to text
Real time hand Detection and Recognition using Machine
REFERENCES
1. Dr. V. Subedha, Sandhya , Shree Lakshmi , Swathi (IRJMETS - 2021) - Sign Language
Recognition to Aid Physically Challenged Using Open CV and CNNN
2. Shruti Chavan, Xinrui Yu and Jafar Saniie (IEEE - 2021) - Convolutional Neural Network Hand
Gesture Recognition for American Sign Language
3. Dr.J Rethna Virgil Jeny, A Anjana, Karnati Monica, Thandu Sumanth ,A Mamatha (IEEE -2021)
- Hand Gesture Recognition for Sign Language Using Convolutional Neural Network
4. Shagun Gupta ,Riya Thakur,, Vinay Maheshwari & Namita Pulgam (IEEE - 2020) - Sign
Language Converter Using Hand Gestures
5. Manasi Agrawal, Rutuja Ainapur, Shrushti Agrawal, Simran Bhosale,Dr. Sharmishta
Desai (IEEE - 2020) - Models for Hand Gesture Recognition using Deep Learning
6. Aishwarya Sharma, Dr. Siba Panda, Prof. Saurav Verma (IEEE -2020) - Sign Language
to Speech Translation
7. Ritika Bharti, Sarthak Yadav, Sourav Gupta, and Rajitha Bakthula (SSRN -2019) - Automated
Speech to Sign language Conversion using Google API and NLP