Plag Free
Plag Free
Plag Free
1. Abstract..................................................................................................................................2
2. Introduction ...........................................................................................................................4
4. Methodologies: ....................................................................................................................11
4.2. Why use Dense type? Why not any other? ...................................................................11
9.8. Classification Report for model .................................... Error! Bookmark not defined.
1. Abstract
Homo sapiens interact with each other using a natural language it maybe in the form of speech, text, hand
gestures and so on. So, understanding the language of deaf people also known as sign language imposes a
serious problem for normal people in daily life. As gesture language is the only communication way for deaf
people. Hence, the use of a system that would recognize the gesture language can benefit deaf people both in
their social lives and work place environment. In this paper, we have proposed a highly optimized approach,
for visual American Sign Language recognition system based on word level custom dataset using an
optimized way of data acquisition through computer vision(CV) and neural network(NN) methodologies that
we will later describe, to identify the attributes of the hands in images captured from a video stream through
camera. This technique will convert moving images to sequence of meaningful sentences as text. Identifying
hand shape from continuous frames is done through Google’s Media pipe Hands model. It is a high fealty
tracking result. It applies ML to derive 21 3D milestone of a fist from just an individual frame [6].
Classification of gestures and related labels will be classified by using a very simplified form of Neural
Networks known as “Densely connected Neural Network”. The Benefit of using 3D hand model points
known as landmarks and the use of dense neural nets, is that it would require significantly much less time for
training and the performance of this approach yields a significant increase in the overall accuracy of the
system. As the system rather works indirectly with images or rather a 3D hand model consisting of points.
This optimized approach to the previous research in the field of classification of American Sign Language,
by far surpasses the previous approaches to this research field in both quality and quantity.
2. Introduction
2.1. Brief
Sign language is a visual language that is used by people who are vocally impaired
or disabled, by makinggestures through their hands. There are about 26 different
gestures for 26 alphabets that are used all aroundthe world for ASL [3], These signs
are have differences in their hand postures and are different from one another. But
this paper deals with the Word level American Sign Language Dataset (WLASL)
[7]
. Several research projects for the translation of American sign language have
been developed in past but they lack the performance criteria during evaluation
process and are very time & power consuming yet not very efficient. Some Sign
language translators use computer vision as a basis for acquisition of data and
others might use sensor-based data. Both methods are different from the way of
input gathering. To raise recognition efficiency of the classification systems, the
researchers use methods such as the HMM, ANN, CNN and Long-Short-Term
Memory Units. Effective approaches for the analysis and recognition of pattern of
hand signs have evolved over the years. The objective of the paper is to develop
such a solution, an optimized approach to solve this classification problem of
WLASL dataset. A solution that requires significantly less time in training and
produces much higher accuracy results.
Conversation is an elemental part of human life. But for a person who are mute &
hearing impaired, conversation is a challenge. To understand them, some research
projects have been created in order to create a sign language translator but so far these
projects are limited by the efficiency and performance evaluations of the trained
models, which required drastic amount of time and efforts in order to create a
translator system using modern machine learning approaches so that a common
person without the knowledge of ASL can understand ASL through the created
system.
4
modules are implemented in these apps:
sign2text
We-Capable: Text to Sign Language (ASL) Converter
sign-language-gesture-recognition
5
2.4. Analysis from Literature Review
The input vector is [x0, x1, x2] and weight vector [w0, w1, w2] and the provided Bias “b” is processed by each neuron as:
F = (∑ xi • wi + b); where (xi • wi) is a dot product between two vectors or matrix.)
i
resulted in an increased dot products per layer of the CNN[9] and thus more processing required more time and power.
Thus, there is a need of more less complex model one that does not have to process the images directly, converting
pixels into array values as a matrix and then using this huge matrix to perform dot products in a neural network which
is a huge number of computations.
This model provides us 21 landmarks for each hand in a frame. As shown in the following figure:
7
Figure 2: MediaPipe Hand Model Tracking
8
Figure 3: NormalizedLandmark Objects and extraction values
Each of these normalized landmarks is a dictionary containing 3 values for a point.
X: determines the parallel distance of the point.
Y: determines the perpendicular distance of the point.
Z: determines the gap of point from camera.
As we don’t want to train our model from a specific distance to the camera, we will be ignoring the “Z” point from each
landmark point.
And the X Factor and Y Factor values will be used to create a custom dataset. So that our ML model can learn the
differences between each hand sign and learn relevance between landmark (x, y) points of the hand sign from the same
label.
3.2. Dataset
As we concluded that we could obtain x, y values for each landmark in a frame for a hand and we should9exclude z
values for performance reasons. So, our data set may look like this.
Figure 5: Dataset file
Since each hand has 21 landmarks and each landmark has x, y values that we want to use in our dataset. 10
We created a dataset of 40 signs 1000 frames per each sign total frames = 40,000. Making us with 4 csv.
That we would later append to create a big dataset as a whole to train our model on. As shown in figure:
4. Methodologies:
4.1. Usage of ANN (Artificial Dense Neural Network)
As the name suggests a DNN is the network where each layer is fully connected to previous and next layer i.e., all
neuron in a layer will be associated to the all next and previous layers hence the term dense.
4.2. Why use Dense type? Why not any other?
A densely connected neural network relies on learning patterns from the combination of all of the features hence the
architecture uses dense connections whereas the convolutional neural network relies on segmented pattern learning
meaning each layer may learn a small segment from the whole pattern and thus divides the learning process which is
costly in computations. In CNN, each layer is referred to as a “filter”. Thus, using huge number of filters will require
drastic computations and we do not want to make this process costly.
4.3. Proposed Neural Network Architecture
11
This section of the paper emphasizes on the general ideology of the problem-solving mechanisms is that often
complex problems must require the simplest form of a solution. Keeping in view this ideology we introduced a
simplest form of a Neural Network to be used in our approach known as a “Densely connected Artificial Neural
Network”.
Since we wanted our neural network to be small and it must require a small amount of time for training and must yield
results of higher quality than that of previous approaches. We used Dense Layers in our Neural Network.
first layer is the input layer with a shape of 84 (no of inputs we calculated for two hands) which has no other
function but to transfer the inputs to the first hidden layer.
Now we followed a convention to reduce down the entities in each layer as we move on to the output layer.
Therefore, the following dense layers before the output dense layer show a significant decrease in number of
units in each layer.
We used ReLU as the activation function because of the following reasons:
ReLU is a linear function that will take input if it’s positive, or else, it will give zero output..
12
Why only ReLU? Why not any other activation function?
Unlike c-shaped and TanH activation functions ReLU does not have any sensitivity issues.
A small change in inputs results a sudden change in output of sigmoid and tanh activation functions as
their output range lies in between -1 to +1.
The output layer of the Neural Network consists of number of units equal to the number of class = 40.
The output layer also has a different activation function that is “softmax” the purpose of this function is to
map previous ReLU layer outputs to map between 0 to 39 = 40 labels, as we known ReLU may output any
positive number as it is so it is possible that the output number may range greater than our label range.
A call back function for an early stopping is being introduced in the network that monitors its accuracy if
there’s no significant change in it after a certain number of epochs and stops the training so that our model
doesn’t overfit to the data.
What were the benefits of using such simplest form of Neural Network?
The removal of complex recurrent computation as in (RNNs) and Removal of Complex Convolutional
Computations in (CNNs)
As we are working indirectly with images, we do not need complex neural architectures to process and learn
our data. Rather we extracted points the (x, y) values so that even a simplest form of ANN can learn the data.
It drastically decreases our training times.
Required much less computation power even no usage of a GPU at all.
5. Experimental setup
5.1. Feature Selection & Extraction
Each point in our frame is represented as a Normalized Landmark as shown in below figure:
We decided to select only x and y values and leave z as it is not relevant for our model and we don’t
want to train our model for specific distance from camera.
Now we to extract these values as simple quadrantes from the image frame, we used formula:
Real_x_coordinate = min((x_value * img_width), img_width – 1)
Real_y_coordinate = min((y_value * img_height), image_ height – 1)
Each of the formula calculates the minimum from the two values that is 1 minus from original
frame size and value obtained from multiplying the coordinate with corresponding height
13 or
width.
We obtained the original values for the coordinates on the image for where the point is located.
After obtaining the original (x, y) coordinates from the picture for each point. We preprocessed the list of point such
that in each list of 21-point x0 and y0 became base_x and base_y respectively.
First step we used base_x and base_y to subtract with each point in the list to obtain a relative difference of
each point form the first one and obtained a list of relative differences between points.
We obtained an absolute max value from this list of relative differences between points.
Now to normalize each x and y value pair of the difference list and obtained a divided difference.
And append this divided difference list as a row in the dataset.
14
Figure 11: Preprocessing of Relative differences of points for each frame.
6. Results
The performance of the trained model upon testing it on test split of the dataset shows a drastic increase in the
accuracy of the classification approach. The overall complex problem was solved by simple methods and yields a
greater quality result. The model training time was about 2, 3 minutes and it achieved an accuracy of 99.89% in such a
short interval of time, within epochs of 339 out of 10,000 epochs. This accuracy is obtained due to excellent methods
of pre-processing and data acquisition as it eradicated the need of using complex models for such a simple problem.
15
Figure 12: Deep learning model performance
16
7. Discussion and Conclusions
The paper concludes that the conventional machine learning approaches for solving classification of WLASL dataset
uses rather more complex ML models like
CNN, RNN and other models that may take forever to train if the available dataset is large enough. These approaches
require intense computations to be performed to reach an evaluated performance accuracy in range of 75% to 85% that
indeed is not bad but is too much effort for a simple problem such as this. The proposed paper reduces all these
complications of using complex models to train on data and the training time along with the achieved performance of
99.89% proofs that it is indeed a well optimized approach. The well performed acquisition of data resulted in rather a
very simple type of deep-learning models known as “DNN” that trained on learning the relative differences between
each point in a specific hand-sign and how these relative differences differed from that of other hand signs.
17
8. Reference
Prototype 2: “sign language gesture recognition” -Harish Thuwal (Sep 27, 2020) Deep Learning
Model https://github.com/hthuwal/sign-language-gesture-recognition
1. Appendices
1.1. Data acquisition code
import cv2 as cv
import numpy as npi
import mediapipe as media_pipe
camera = cv.VideoCapture(0)
media_pipe_drawing = media_pipe.solutions.drawing_utils
media_pipe_hands = media_pipe.solutions.hands
with media_pipe_hands.Hands(
min_detection_confidence=0.5,
min_tracking_confidence=0.5) as hands:
while camera.isOpened():
success_bool, camera_image_frame = camera.read()
print('camera_image_frame_SHAPE: ', camera_image_frame.shape)
if not success_bool:
print("Got an Empty camera camera_image_frame!")
# If loading a video, use 'break' instead of 'continue'.
continue
# Flip the camera_image_frame horizontally for a later selfie-view display, and convert
# the BGR camera_image_frame to RGB.
camera_image_frame = cv.cvtColor(cv.flip(camera_image_frame, 1), cv.COLOR_BGR2RGB)
18
# To improve performance, optionally mark the camera_image_frame as not writeable to
# pass by reference.
camera_image_frame.flags.writeable = False
results = hands.process(camera_image_frame)
cv.destroyAllWindows()
camera.release()
# Calculates the real coordinates from image for each landmark point
def calc_landmark_list(image_frame, landmarks):
image_width, image_height = image_frame.shape[1], image_frame.shape[0]
landmark_points = []
landmark_points.append([landmark_x, landmark_y])
return landmark_points
base_x, base_y = 0, 0
print(" x,\ty")
for index, landmark_point in enumerate(temp_landmark_list):
if index == 0:
base_x, base_y = landmark_point[0], landmark_point[1]
temp_landmark_list = list(
itertools.chain.from_iterable(temp_landmark_list))
19
max_value = max(list(map(abs, temp_landmark_list)))
print("ABSOLUT MAX: ", max_value)
def normalize_(n):
return n / max_value
# Demo code to show the whole process of from acquisition to preprocessing of data
camera = cv.VideoCapture(0)
media_pipe_drawing = media_pipe.solutions.drawing_utils
media_pipe_hands = media_pipe.solutions._hands
with media_pipe_hands.Hands(
min_detection_confidence=0.5,
min_tracking_confidence=0.5) as _hands:
while camera.isOpened():
success, image_frame = camera.read()
print('IMAGE_FRAME_SHAPE: ', image_frame.shape)
if not success:
print("Ignoring empty camera frame.")
# If loading a video, use 'break' instead of 'continue'.
continue
# Flip the image horizontally for a later selfie-view display, and convert
# the BGR image to RGB.
image_frame = cv.cvtColor(cv.flip(image_frame, 1), cv.COLOR_BGR2RGB)
# To improve performance, optionally mark the image as not writeable to
# pass by reference.
image_frame.flags.writeable = False
results = _hands.process(image_frame)
cv.destroyAllWindows() 20
camera.release()OutPut:
IMAGE_SHAPE: (480, 640, 3)
---------------------------------------------------------------
ORIG_LANDMARK: landmark_0 {
x: 0.5298218
y: 0.9121418
z: -8.171827e-05
}
Landmark_1 {
x: 0.45480007
y: 0.8920693
z: -0.06166866
}
Landmark_2 {
x: 0.3911696
y: 0.82389766
z: -0.09706815
}
Landmark_3 {
x: 0.33392072
y: 0.7772389
z: -0.13308653
}
Landmark_4 {
x: 0.27219355
y: 0.75963604
z: -0.17590481
}
Landmark_5 {
x: 0.4567382
y: 0.6703464
z: -0.0669591
}
Landmark_6 {
x: 0.43688262
y: 0.5598857
z: -0.1016331
}
Landmark_7 {
x: 0.4251266
y: 0.49154353
z: -0.1258149
}
Landmark_8 {
x: 0.4154522
y: 0.4323473
z: -0.1451781
}
Landmark_9 {
x: 0.5110696
y: 0.6602165
z: -0.06574423
}
Landmark_10 {
x: 0.5108578
y: 0.5347301
z: -0.09177217
}
Landmark_11 {
x: 0.5153187
y: 0.45548508
z: -0.12585047
} 21
Landmark_12 {
x: 0.5219439
y: 0.3864655
z: -0.15050456
}
Landmark_13 {
x: 0.55792934
y: 0.6755521
z: -0.0664633
}
Landmark_14 {
x: 0.56623095
y: 0.5567026
z: -0.0971865
}
Landmark_15 {
x: 0.5762762
y: 0.48259073
z: -0.12416917
}
Landmark_16 {
x: 0.58437335
y: 0.417627
z: -0.14456311
}
Landmark_17 {
x: 0.6011278
y: 0.7114185
z: -0.072560504
}
Landmark_18 {
x: 0.63162947
y: 0.63810915
z: -0.109268196
}
Landmark_19 {
x: 0.65477633
y: 0.596596
z: -0.12863845
}
Landmark_20 {
x: 0.6754149
y: 0.55502
z: -0.14428398
}
---------------------------------------------------------------
---------------------------------------------------------------
LandMark_CALC_LIST(21): [[339, 437], [291, 428], [250, 395], [213, 373], [174, 364
], [292, 321], [279, 268], [272, 235], [265, 207], [327, 316], [326, 256], [329, 21
8], [334, 185], [357, 324], [362, 267], [368, 231], [373, 200], [384, 341], [404, 3
06], [419, 286], [432, 266]]
---------------------------------------------------------------
x, y
DIFF: 0 0
DIFF: -48 -9
DIFF: -89 -42
DIFF: -126 -64
DIFF: -165 -73
DIFF: -47 -116
DIFF: -60 -169
DIFF: -67 -202
DIFF: -74 -230
DIFF: -12 -121 22
DIFF: -13 -181
DIFF: -10 -219
DIFF: -5 -252
DIFF: 18 -113
DIFF: 23 -170
DIFF: 29 -206
DIFF: 34 -237
DIFF: 45 -96
DIFF: 65 -131
DIFF: 80 -151
DIFF: 93 -171
ABSOLUT MAX: 252
---------------------------------------------------------------
PREPROCESSED: [0.0, 0.0, -0.19047619047619047, -0.03571428571428571, -0.3531746031
746032, -0.16666666666666666, -0.5, -0.25396825396825395, -0.6547619047619048, -0.2
896825396825397, -0.1865079365079365, -0.4603174603174603, -0.23809523809523808, -0
.6706349206349206, -0.26587301587301587, -0.8015873015873016, -0.29365079365079366,
-0.9126984126984127, -0.047619047619047616, -0.4801587301587302, -0.051587301587301
584, -0.7182539682539683, -0.03968253968253968, -0.8690476190476191, -0.01984126984
126984, -1.0, 0.07142857142857142, -0.44841269841269843, 0.09126984126984126, -0.67
46031746031746, 0.11507936507936507, -0.8174603174603174, 0.1349206349206349, -0.94
04761904761905, 0.17857142857142858, -0.38095238095238093, 0.25793650793650796, -0.
5198412698412699, 0.31746031746031744, -0.5992063492063492, 0.36904761904761907, -0
.6785714285714286]
---------------------------------------------------------------
RANDOM_SEED = 42
dataset_p1 = 'extracted_data/generalized/keypoint_1.csv'
dataset_p2 = 'extracted_data/generalized/keypoint_2.csv'
dataset_p3 = 'extracted_data/generalized/keypoint_3.csv'
dataset_p4 = 'extracted_data/generalized/keypoint_4.csv'
model_save_path = 'extracted_data/generalized/model/beta_clf.hdf5'
OutPut:
(40000,)
RANDOM_SEED = 42
model = tflow.keras.models.Sequential([
tflow.keras.layers.Input((42 * 2, )),
tflow.keras.layers.Dense(80, activation='relu'),
tflow.keras.layers.Dropout(0.4),
tflow.keras.layers.Dense(60, activation='relu'),
tflow.keras.layers.Dropout(0.4),
tflow.keras.layers.Dense(55, activation='relu'),
tflow.keras.layers.Dense(NUM_CLASSES, activation='softmax')
])
model.summary()
OutPut:
Model: "sequential"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
dense (Dense) (None, 80) 6800
_________________________________________________________________
dropout (Dropout) (None, 80) 0
_________________________________________________________________
dense_1 (Dense) (None, 60) 4860
_________________________________________________________________
dropout_1 (Dropout) (None, 60) 0
_________________________________________________________________
dense_2 (Dense) (None, 55) 3355
_________________________________________________________________
dense_3 (Dense) (None, 40) 2240
=================================================================
Total params: 17,255
Trainable params: 17,255
Non-trainable params: 0
_________________________________________________________________
model.fit(
train_x,
train_y,
epochs=10000,
batch_size=512,
validation_data=(test_x, test_y),
callbacks=[capture_callback, early_stopping_callback]
)
1.7. Model Evaluation
value_loss, value_accuracy = model.evaluate(test_x, test_y, batch_size=128)
Output:
if report:
print('-----------Classification Report for Model-----------')
print(classification_report(test_y, predictions))
predictions = model.predict(test_x)
predictions = np.argmax(predictions, axis=1)
print_confusion_matrix(test_y, predictions)
OutPut:
26
27
28