Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Plag Free

Download as pdf or txt
Download as pdf or txt
You are on page 1of 28

Table of Contents

1. Abstract..................................................................................................................................2

2. Introduction ...........................................................................................................................4

2.1. Brief .................................................................................................................................4

2.2. Problem Background .......................................................................................................4

2.3. Literature Review ............................................................................................................4

2.4. Analysis from Literature Review ....................................................................................6

2.5. Problem Statement ..........................................................................................................6

3. Problem & Dataset ................................................................................................................7

3.1. Problem description.........................................................................................................7

3.2. Dataset .............................................................................................................................9

4. Methodologies: ....................................................................................................................11

4.1. Usage of ANN (Artificial Dense Neural Network) .......................................................11

4.2. Why use Dense type? Why not any other? ...................................................................11

4.3. Proposed Neural Network Architecture ........................................................................11

5. Experimental setup ..............................................................................................................13

5.1. Feature Selection & Extraction .....................................................................................13

5.2. Data Pre-processing.......................................................................................................14

-6. Results .................................................................................................................................15

6.1. Classification Report for Model ....................................................................................16

7. Discussion and Conclusions ................................................................................................17


8. Reference .............................................................................................................................18

9. Appendices .......................................................................... Error! Bookmark not defined.

9.1. Data acquisition code .................................................... Error! Bookmark not defined.

9.2. Data preprocessing code................................................ Error! Bookmark not defined.

9.3. Dataset creation code .................................................... Error! Bookmark not defined.

9.4. Train test split ................................................................ Error! Bookmark not defined.

9.5. Model Creation code ..................................................... Error! Bookmark not defined.

9.6. Model Training .............................................................. Error! Bookmark not defined.

9.7. Model Evaluation .......................................................... Error! Bookmark not defined.

9.8. Classification Report for model .................................... Error! Bookmark not defined.

1. Abstract
Homo sapiens interact with each other using a natural language it maybe in the form of speech, text, hand
gestures and so on. So, understanding the language of deaf people also known as sign language imposes a
serious problem for normal people in daily life. As gesture language is the only communication way for deaf
people. Hence, the use of a system that would recognize the gesture language can benefit deaf people both in
their social lives and work place environment. In this paper, we have proposed a highly optimized approach,
for visual American Sign Language recognition system based on word level custom dataset using an
optimized way of data acquisition through computer vision(CV) and neural network(NN) methodologies that
we will later describe, to identify the attributes of the hands in images captured from a video stream through
camera. This technique will convert moving images to sequence of meaningful sentences as text. Identifying
hand shape from continuous frames is done through Google’s Media pipe Hands model. It is a high fealty
tracking result. It applies ML to derive 21 3D milestone of a fist from just an individual frame [6].
Classification of gestures and related labels will be classified by using a very simplified form of Neural
Networks known as “Densely connected Neural Network”. The Benefit of using 3D hand model points
known as landmarks and the use of dense neural nets, is that it would require significantly much less time for
training and the performance of this approach yields a significant increase in the overall accuracy of the
system. As the system rather works indirectly with images or rather a 3D hand model consisting of points.
This optimized approach to the previous research in the field of classification of American Sign Language,
by far surpasses the previous approaches to this research field in both quality and quantity.
2. Introduction

2.1. Brief

Sign language is a visual language that is used by people who are vocally impaired
or disabled, by makinggestures through their hands. There are about 26 different
gestures for 26 alphabets that are used all aroundthe world for ASL [3], These signs
are have differences in their hand postures and are different from one another. But
this paper deals with the Word level American Sign Language Dataset (WLASL)
[7]
. Several research projects for the translation of American sign language have
been developed in past but they lack the performance criteria during evaluation
process and are very time & power consuming yet not very efficient. Some Sign
language translators use computer vision as a basis for acquisition of data and
others might use sensor-based data. Both methods are different from the way of
input gathering. To raise recognition efficiency of the classification systems, the
researchers use methods such as the HMM, ANN, CNN and Long-Short-Term
Memory Units. Effective approaches for the analysis and recognition of pattern of
hand signs have evolved over the years. The objective of the paper is to develop
such a solution, an optimized approach to solve this classification problem of
WLASL dataset. A solution that requires significantly less time in training and
produces much higher accuracy results.

2.2. Problem Background

Conversation is an elemental part of human life. But for a person who are mute &
hearing impaired, conversation is a challenge. To understand them, some research
projects have been created in order to create a sign language translator but so far these
projects are limited by the efficiency and performance evaluations of the trained
models, which required drastic amount of time and efforts in order to create a
translator system using modern machine learning approaches so that a common
person without the knowledge of ASL can understand ASL through the created
system.

2.3. Literature Review


We have studied systems that are precisely not like our project but some of the

4
modules are implemented in these apps:

 sign2text
 We-Capable: Text to Sign Language (ASL) Converter
 sign-language-gesture-recognition

5
2.4. Analysis from Literature Review

Table 1: Analysis from Literature Review


Application Name Weakness Proposed
Solution
• Prototype, not a This system will be a
product. stable
application/product and
• Doesn’t provide
sign2text[1]
it will provide the series
appropriate text
translation.
for series of
hand-signs.
• Just converts Our product will
input text to a perform a complete
We-Capable: Text to Sign
series of the sign sign-to-text or and
Language (ASL)
containing each help users to
Converter[8]
letter converted communicate.
to a sign.

• Deployed but not Will provide a


sign-language-gesture-recognition[2] a stable or useful meaningful
application. translation on word
• Prototype, not a level would require
product. less signs to express
• Does not provide thoughts.
appropriate text
for series of
hand-signs.

2.5. Problem Statement


Upon researching on previous works in the field of classification of WLASL datasets it was concluded that the
conventional Machine Learning approaches (use of CNN’s[4], LSTM’s[5], RNN’s) to build a classification system that
can classify hand signs required a huge amount of time ranging from days, weeks to months depending upon the size of
big data to be processed by the model during training. Such approaches also required a huge amount of processing
powers such as the usage of GPUs to reduce the time complexity in the learning process still it yields a time complexity
of Ω(n) to Ω(n3) computed by observing the increase in input size to a CNN, 6
(As shown by the image:

Figure 1: Internal working of Neurons in ANNs

The input vector is [x0, x1, x2] and weight vector [w0, w1, w2] and the provided Bias “b” is processed by each neuron as:
F = (∑ xi • wi + b); where (xi • wi) is a dot product between two vectors or matrix.)
i

resulted in an increased dot products per layer of the CNN[9] and thus more processing required more time and power.
Thus, there is a need of more less complex model one that does not have to process the images directly, converting
pixels into array values as a matrix and then using this huge matrix to perform dot products in a neural network which
is a huge number of computations.

3. Problem & Dataset

3.1. Problem description


This paper introduces to an optimized approach for a machine learning solution to a classification problem of Word
Level American Sign Language Dataset. The prior experiments directly used images as an input to the Neural Network
system which resulted in a complex and very large number of computations by the networks thus increasing time
complexity of the model and processing power. In this approach we used a Pre-trained Hand Detection Model which
are from Google named as Media Pipe Hands model [6].

This model provides us 21 landmarks for each hand in a frame. As shown in the following figure:

7
Figure 2: MediaPipe Hand Model Tracking

Upon observing each landmark point closely, we found out that:

8
Figure 3: NormalizedLandmark Objects and extraction values
Each of these normalized landmarks is a dictionary containing 3 values for a point.
 X: determines the parallel distance of the point.
 Y: determines the perpendicular distance of the point.
 Z: determines the gap of point from camera.
As we don’t want to train our model from a specific distance to the camera, we will be ignoring the “Z” point from each
landmark point.
And the X Factor and Y Factor values will be used to create a custom dataset. So that our ML model can learn the
differences between each hand sign and learn relevance between landmark (x, y) points of the hand sign from the same
label.

Display Demo of landmarks extraction:

Figure 4: Display demo of MediaPipe Normalized Landmarks system plot

3.2. Dataset
As we concluded that we could obtain x, y values for each landmark in a frame for a hand and we should9exclude z
values for performance reasons. So, our data set may look like this.
Figure 5: Dataset file

Figure 6: Clear Dataset View

Since each hand has 21 landmarks and each landmark has x, y values that we want to use in our dataset. 10

 So a row in our dataset consists of ((21-landmarks * 2-values-x-y) * 2-hands) = 84 columns of data.


 The first column of our dataset would contain labels from 0-9, 10 labels.
 We collected 1000 rows each 84 points column long and for each label i.e.: label = 0 has 1000 rows or 1000
frames of data. Whereas each row represents a frame in this dataset.
 For a frame in which only one hand was present we padded the remaining (21-landmarks * 2-values-x-y) = 42
values with zeros = 0. This represents that a single hand was present in frame and would match the
dimensions of our ML model as it requires both hands and 84 points to work.

We created a dataset of 40 signs 1000 frames per each sign total frames = 40,000. Making us with 4 csv.
That we would later append to create a big dataset as a whole to train our model on. As shown in figure:

Figure 7: Big Data creation for model

4. Methodologies:
4.1. Usage of ANN (Artificial Dense Neural Network)
As the name suggests a DNN is the network where each layer is fully connected to previous and next layer i.e., all
neuron in a layer will be associated to the all next and previous layers hence the term dense.
4.2. Why use Dense type? Why not any other?
A densely connected neural network relies on learning patterns from the combination of all of the features hence the
architecture uses dense connections whereas the convolutional neural network relies on segmented pattern learning
meaning each layer may learn a small segment from the whole pattern and thus divides the learning process which is
costly in computations. In CNN, each layer is referred to as a “filter”. Thus, using huge number of filters will require
drastic computations and we do not want to make this process costly.
4.3. Proposed Neural Network Architecture
11
This section of the paper emphasizes on the general ideology of the problem-solving mechanisms is that often
complex problems must require the simplest form of a solution. Keeping in view this ideology we introduced a
simplest form of a Neural Network to be used in our approach known as a “Densely connected Artificial Neural
Network”.

Since we wanted our neural network to be small and it must require a small amount of time for training and must yield
results of higher quality than that of previous approaches. We used Dense Layers in our Neural Network.

figure is showing the architecture of Neural Network:

Figure 8: Deep Learning Model Architecture

The architecture is simple and consists of 4 dense-layers:

 first layer is the input layer with a shape of 84 (no of inputs we calculated for two hands) which has no other
function but to transfer the inputs to the first hidden layer.
 Now we followed a convention to reduce down the entities in each layer as we move on to the output layer.
Therefore, the following dense layers before the output dense layer show a significant decrease in number of
units in each layer.
 We used ReLU as the activation function because of the following reasons:
ReLU is a linear function that will take input if it’s positive, or else, it will give zero output..
12
Why only ReLU? Why not any other activation function?
Unlike c-shaped and TanH activation functions ReLU does not have any sensitivity issues.
A small change in inputs results a sudden change in output of sigmoid and tanh activation functions as
their output range lies in between -1 to +1.
 The output layer of the Neural Network consists of number of units equal to the number of class = 40.
 The output layer also has a different activation function that is “softmax” the purpose of this function is to
map previous ReLU layer outputs to map between 0 to 39 = 40 labels, as we known ReLU may output any
positive number as it is so it is possible that the output number may range greater than our label range.
 A call back function for an early stopping is being introduced in the network that monitors its accuracy if
there’s no significant change in it after a certain number of epochs and stops the training so that our model
doesn’t overfit to the data.

What were the benefits of using such simplest form of Neural Network?

 The removal of complex recurrent computation as in (RNNs) and Removal of Complex Convolutional
Computations in (CNNs)
 As we are working indirectly with images, we do not need complex neural architectures to process and learn
our data. Rather we extracted points the (x, y) values so that even a simplest form of ANN can learn the data.
 It drastically decreases our training times.
 Required much less computation power even no usage of a GPU at all.

5. Experimental setup
5.1. Feature Selection & Extraction
Each point in our frame is represented as a Normalized Landmark as shown in below figure:

Figure 9: Landmark data

 We decided to select only x and y values and leave z as it is not relevant for our model and we don’t
want to train our model for specific distance from camera.
 Now we to extract these values as simple quadrantes from the image frame, we used formula:
Real_x_coordinate = min((x_value * img_width), img_width – 1)
Real_y_coordinate = min((y_value * img_height), image_ height – 1)
Each of the formula calculates the minimum from the two values that is 1 minus from original
frame size and value obtained from multiplying the coordinate with corresponding height
13 or
width.
We obtained the original values for the coordinates on the image for where the point is located.

Figure 10: Conversion of normalized landmarks into real x, y coordinates

5.2. Data Pre-processing

After obtaining the original (x, y) coordinates from the picture for each point. We preprocessed the list of point such
that in each list of 21-point x0 and y0 became base_x and base_y respectively.

First step we used base_x and base_y to subtract with each point in the list to obtain a relative difference of
each point form the first one and obtained a list of relative differences between points.
We obtained an absolute max value from this list of relative differences between points.
Now to normalize each x and y value pair of the difference list and obtained a divided difference.
And append this divided difference list as a row in the dataset.

Following figure demonstrates the actual process:

14
Figure 11: Preprocessing of Relative differences of points for each frame.

6. Results
The performance of the trained model upon testing it on test split of the dataset shows a drastic increase in the
accuracy of the classification approach. The overall complex problem was solved by simple methods and yields a
greater quality result. The model training time was about 2, 3 minutes and it achieved an accuracy of 99.89% in such a
short interval of time, within epochs of 339 out of 10,000 epochs. This accuracy is obtained due to excellent methods
of pre-processing and data acquisition as it eradicated the need of using complex models for such a simple problem.

15
Figure 12: Deep learning model performance

6.1. Classification Report for Model

16
7. Discussion and Conclusions
The paper concludes that the conventional machine learning approaches for solving classification of WLASL dataset
uses rather more complex ML models like
CNN, RNN and other models that may take forever to train if the available dataset is large enough. These approaches
require intense computations to be performed to reach an evaluated performance accuracy in range of 75% to 85% that
indeed is not bad but is too much effort for a simple problem such as this. The proposed paper reduces all these
complications of using complex models to train on data and the training time along with the achieved performance of
99.89% proofs that it is indeed a well optimized approach. The well performed acquisition of data resulted in rather a
very simple type of deep-learning models known as “DNN” that trained on learning the relative differences between
each point in a specific hand-sign and how these relative differences differed from that of other hand signs.

17
8. Reference

 Prototype 1: “Sign2Text” -BelalC (Jul 30,2020) Machine Learning Model


https://github.com/BelalC/sign2text

 Prototype 2: “sign language gesture recognition” -Harish Thuwal (Sep 27, 2020) Deep Learning
Model https://github.com/hthuwal/sign-language-gesture-recognition

 American Sign Language Study: 


 System Study using Deep learning: Litrature Study
 System Study for ASL: ASL Recognition System using CNN and Stacked LSTM
 MediaPipe Documentation: MediaPipe Google
 WLASL Dataset study how to gather custom data.
8. Simple ASL converter A case study of relevant system.
9. Training complexity of CNNs CNN training

1. Appendices
1.1. Data acquisition code
import cv2 as cv
import numpy as npi
import mediapipe as media_pipe

camera = cv.VideoCapture(0)

media_pipe_drawing = media_pipe.solutions.drawing_utils
media_pipe_hands = media_pipe.solutions.hands

with media_pipe_hands.Hands(
min_detection_confidence=0.5,
min_tracking_confidence=0.5) as hands:
while camera.isOpened():
success_bool, camera_image_frame = camera.read()
print('camera_image_frame_SHAPE: ', camera_image_frame.shape)
if not success_bool:
print("Got an Empty camera camera_image_frame!")
# If loading a video, use 'break' instead of 'continue'.
continue

# Flip the camera_image_frame horizontally for a later selfie-view display, and convert
# the BGR camera_image_frame to RGB.
camera_image_frame = cv.cvtColor(cv.flip(camera_image_frame, 1), cv.COLOR_BGR2RGB)
18
# To improve performance, optionally mark the camera_image_frame as not writeable to
# pass by reference.
camera_image_frame.flags.writeable = False
results = hands.process(camera_image_frame)

# Draw the hand annotations on the camera_image_frame.


camera_image_frame.flags.writeable = True
camera_image_frame = cv.cvtColor(camera_image_frame, cv.COLOR_RGB2BGR)
camera_image_frame = npi.zeros_like(camera_image_frame)
if results.multi_hand_landmarks:
for hand_landmarks in results.multi_hand_landmarks:
media_pipe_drawing.draw_landmarks(camera_image_frame, hand_landmarks,
media_pipe_hands.HAND_CONNECTIONS)
land_mark_list = calc_landmark_list(camera_image_frame, hand_landmarks)
cv.imshow('MediaPipe Demo Hands', camera_image_frame)
if cv.waitKey(5) & 0xFF == 27:
break

cv.destroyAllWindows()
camera.release()

1.2. Data preprocessing code


import cv2 as cv
import numpy as npi
import mediapipe as media_pipe
import copy
import itertools

# Calculates the real coordinates from image for each landmark point
def calc_landmark_list(image_frame, landmarks):
image_width, image_height = image_frame.shape[1], image_frame.shape[0]

landmark_points = []

for i, landmark in enumerate(landmarks.landmark):


landmark_x = min(int(landmark.x * image_width), image_width - 1)
landmark_y = min(int(landmark.y * image_height), image_height - 1)

landmark_points.append([landmark_x, landmark_y])

return landmark_points

# Preprocesses each original coordinate and finds it’s relative differences


def pre_process_landmark(landmark_list):
temp_landmark_list = copy.deepcopy(landmark_list)

base_x, base_y = 0, 0
print(" x,\ty")
for index, landmark_point in enumerate(temp_landmark_list):
if index == 0:
base_x, base_y = landmark_point[0], landmark_point[1]

temp_landmark_list[index][0] = temp_landmark_list[index][0] - base_x


temp_landmark_list[index][1] = temp_landmark_list[index][1] - base_y
print("DIFF: ", temp_landmark_list[index][0], "\t", temp_landmark_list[index][1])

temp_landmark_list = list(
itertools.chain.from_iterable(temp_landmark_list))
19
max_value = max(list(map(abs, temp_landmark_list)))
print("ABSOLUT MAX: ", max_value)

def normalize_(n):
return n / max_value

temp_landmark_list = list(map(normalize_, temp_landmark_list))


print('---------------------------------------------------------------')
return temp_landmark_list

# Demo code to show the whole process of from acquisition to preprocessing of data
camera = cv.VideoCapture(0)

media_pipe_drawing = media_pipe.solutions.drawing_utils
media_pipe_hands = media_pipe.solutions._hands

with media_pipe_hands.Hands(
min_detection_confidence=0.5,
min_tracking_confidence=0.5) as _hands:
while camera.isOpened():
success, image_frame = camera.read()
print('IMAGE_FRAME_SHAPE: ', image_frame.shape)
if not success:
print("Ignoring empty camera frame.")
# If loading a video, use 'break' instead of 'continue'.
continue

# Flip the image horizontally for a later selfie-view display, and convert
# the BGR image to RGB.
image_frame = cv.cvtColor(cv.flip(image_frame, 1), cv.COLOR_BGR2RGB)
# To improve performance, optionally mark the image as not writeable to
# pass by reference.
image_frame.flags.writeable = False
results = _hands.process(image_frame)

# Draw the hand annotations on the image.


image_frame.flags.writeable = True
image_frame = cv.cvtColor(image_frame, cv.COLOR_RGB2BGR)
image_frame = npi.zeros_like(image_frame)
if results.multi_hand_landmarks:
for found_hand_landmarks in results.multi_hand_landmarks:
media_pipe_drawing.draw_landmarks(image_frame, found_hand_landmarks,
media_pipe_hands.HAND_CONNECTIONS)
land_mark_list = calc_landmark_list(image_frame, found_hand_landmarks)
print('---------------------------------------------------------------')
print('ORIG_LANDMARK: ', found_hand_landmarks)
print('---------------------------------------------------------------')
print('---------------------------------------------------------------')
print(f'LandMark_CALC_LIST({len(land_mark_list)}): ', land_mark_list)
print('---------------------------------------------------------------')
print('PREPROCESSED: ', pre_process_landmark(land_mark_list))
print('---------------------------------------------------------------')
break
cv.imshow('Demo MediaPipe Hands', image_frame)
break
if cv.waitKey(5) & 0xFF == 27:
break

cv.destroyAllWindows() 20
camera.release()OutPut:
IMAGE_SHAPE: (480, 640, 3)
---------------------------------------------------------------
ORIG_LANDMARK: landmark_0 {
x: 0.5298218
y: 0.9121418
z: -8.171827e-05
}
Landmark_1 {
x: 0.45480007
y: 0.8920693
z: -0.06166866
}
Landmark_2 {
x: 0.3911696
y: 0.82389766
z: -0.09706815
}
Landmark_3 {
x: 0.33392072
y: 0.7772389
z: -0.13308653
}
Landmark_4 {
x: 0.27219355
y: 0.75963604
z: -0.17590481
}
Landmark_5 {
x: 0.4567382
y: 0.6703464
z: -0.0669591
}
Landmark_6 {
x: 0.43688262
y: 0.5598857
z: -0.1016331
}
Landmark_7 {
x: 0.4251266
y: 0.49154353
z: -0.1258149
}
Landmark_8 {
x: 0.4154522
y: 0.4323473
z: -0.1451781
}
Landmark_9 {
x: 0.5110696
y: 0.6602165
z: -0.06574423
}
Landmark_10 {
x: 0.5108578
y: 0.5347301
z: -0.09177217
}
Landmark_11 {
x: 0.5153187
y: 0.45548508
z: -0.12585047
} 21
Landmark_12 {
x: 0.5219439
y: 0.3864655
z: -0.15050456
}
Landmark_13 {
x: 0.55792934
y: 0.6755521
z: -0.0664633
}
Landmark_14 {
x: 0.56623095
y: 0.5567026
z: -0.0971865
}
Landmark_15 {
x: 0.5762762
y: 0.48259073
z: -0.12416917
}
Landmark_16 {
x: 0.58437335
y: 0.417627
z: -0.14456311
}
Landmark_17 {
x: 0.6011278
y: 0.7114185
z: -0.072560504
}
Landmark_18 {
x: 0.63162947
y: 0.63810915
z: -0.109268196
}
Landmark_19 {
x: 0.65477633
y: 0.596596
z: -0.12863845
}
Landmark_20 {
x: 0.6754149
y: 0.55502
z: -0.14428398
}

---------------------------------------------------------------
---------------------------------------------------------------
LandMark_CALC_LIST(21): [[339, 437], [291, 428], [250, 395], [213, 373], [174, 364
], [292, 321], [279, 268], [272, 235], [265, 207], [327, 316], [326, 256], [329, 21
8], [334, 185], [357, 324], [362, 267], [368, 231], [373, 200], [384, 341], [404, 3
06], [419, 286], [432, 266]]
---------------------------------------------------------------
x, y
DIFF: 0 0
DIFF: -48 -9
DIFF: -89 -42
DIFF: -126 -64
DIFF: -165 -73
DIFF: -47 -116
DIFF: -60 -169
DIFF: -67 -202
DIFF: -74 -230
DIFF: -12 -121 22
DIFF: -13 -181
DIFF: -10 -219
DIFF: -5 -252
DIFF: 18 -113
DIFF: 23 -170
DIFF: 29 -206
DIFF: 34 -237
DIFF: 45 -96
DIFF: 65 -131
DIFF: 80 -151
DIFF: 93 -171
ABSOLUT MAX: 252
---------------------------------------------------------------
PREPROCESSED: [0.0, 0.0, -0.19047619047619047, -0.03571428571428571, -0.3531746031
746032, -0.16666666666666666, -0.5, -0.25396825396825395, -0.6547619047619048, -0.2
896825396825397, -0.1865079365079365, -0.4603174603174603, -0.23809523809523808, -0
.6706349206349206, -0.26587301587301587, -0.8015873015873016, -0.29365079365079366,
-0.9126984126984127, -0.047619047619047616, -0.4801587301587302, -0.051587301587301
584, -0.7182539682539683, -0.03968253968253968, -0.8690476190476191, -0.01984126984
126984, -1.0, 0.07142857142857142, -0.44841269841269843, 0.09126984126984126, -0.67
46031746031746, 0.11507936507936507, -0.8174603174603174, 0.1349206349206349, -0.94
04761904761905, 0.17857142857142858, -0.38095238095238093, 0.25793650793650796, -0.
5198412698412699, 0.31746031746031744, -0.5992063492063492, 0.36904761904761907, -0
.6785714285714286]
---------------------------------------------------------------

1.3. Dataset creation code


import csv

import numpy as npi


import tensorflow as tflow
from sklearn.model_selection import train_test_split

RANDOM_SEED = 42

dataset_p1 = 'extracted_data/generalized/keypoint_1.csv'
dataset_p2 = 'extracted_data/generalized/keypoint_2.csv'
dataset_p3 = 'extracted_data/generalized/keypoint_3.csv'
dataset_p4 = 'extracted_data/generalized/keypoint_4.csv'
model_save_path = 'extracted_data/generalized/model/beta_clf.hdf5'

X_dataset1 = npi.loadtxt(dataset_p1, delimiter=',', dtype='float32', usecols=list(range(1, (42 * 2) + 1)))


y_dataset1 = npi.loadtxt(dataset_p1, delimiter=',', dtype='int32', usecols=(0))

X_dataset2 = npi.loadtxt(dataset_p2, delimiter=',', dtype='float32', usecols=list(range(1, (42 * 2) + 1)))


y_dataset2 = npi.loadtxt(dataset_p2, delimiter=',', dtype='int32', usecols=(0))
y_dataset2 = y_dataset2 + 10

Xdataset3 = npi.loadtxt(dataset_p3, delimiter=',', dtype='float32', usecols=list(range(1, (42 * 2) + 1)))


ydataset3 = npi.loadtxt(dataset_p3, delimiter=',', dtype='int32', usecols=(0))
ydataset3 = ydataset3 + 20

X_dataset4 = npi.loadtxt(dataset_p4, delimiter=',', dtype='float32', usecols=list(range(1, (42 * 2) + 1)))


ydataset4 = npi.loadtxt(dataset_p4, delimiter=',', dtype='int32', usecols=(0))
ydataset4 = ydataset4 + 30

concat_x = npi.concatenate((X_dataset1, X_dataset2), axis=0)


concat_x = npi.concatenate((concat_x, Xdataset3), axis=0)
concat_x = npi.concatenate((concat_x, X_dataset4), axis=0)
concat_x.shape
23
OutPut:
(40000, 84)
concat_y = npi.concatenate((y_dataset1, y_dataset2), axis=0)
concat_y = npi.concatenate((concat_y, ydataset3), axis=0)
concat_y = npi.concatenate((concat_y, ydataset4), axis=0)
concat_y.shape

OutPut:
(40000,)

1.4. Train test split


train_x, test_x, train_y, test_y = train_test_split(concat_x, concat_y, train_size=0.75,
random_state=RANDOM_SEED)

1.5. Model Creation code


import numpy as npi
import tensorflow as tflow
from sklearn.model_selection import train_test_split

RANDOM_SEED = 42

model = tflow.keras.models.Sequential([
tflow.keras.layers.Input((42 * 2, )),
tflow.keras.layers.Dense(80, activation='relu'),
tflow.keras.layers.Dropout(0.4),
tflow.keras.layers.Dense(60, activation='relu'),
tflow.keras.layers.Dropout(0.4),
tflow.keras.layers.Dense(55, activation='relu'),
tflow.keras.layers.Dense(NUM_CLASSES, activation='softmax')
])

model.summary()

OutPut:
Model: "sequential"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
dense (Dense) (None, 80) 6800
_________________________________________________________________
dropout (Dropout) (None, 80) 0
_________________________________________________________________
dense_1 (Dense) (None, 60) 4860
_________________________________________________________________
dropout_1 (Dropout) (None, 60) 0
_________________________________________________________________
dense_2 (Dense) (None, 55) 3355
_________________________________________________________________
dense_3 (Dense) (None, 40) 2240
=================================================================
Total params: 17,255
Trainable params: 17,255
Non-trainable params: 0
_________________________________________________________________

1.6. Model Training


# Adding callbacks for early stopping
capture_callback = tf.keras.callbacks.ModelCheckpoint(model_save_path, verbose=1, save_weights_only=False)
early_stopping_callback = tf.keras.callbacks.EarlyStopping(patience=20, verbose=1)
24
model.compile(
optimizer='adam',
loss='sparse_categorical_crossentropy',
metrics=['accuracy']
)

model.fit(
train_x,
train_y,
epochs=10000,
batch_size=512,
validation_data=(test_x, test_y),
callbacks=[capture_callback, early_stopping_callback]
)
1.7. Model Evaluation

value_loss, value_accuracy = model.evaluate(test_x, test_y, batch_size=128)

Output:

79/79 [==============================] - 0s 1ms/step - loss: 0.0034 - accuracy: 0.9


989

1.8. Classification Report for model


import pandas as p_d
import seaborn as s_n_s
import matplotlib.pyplot as py_plt
from sklearn.metrics import confusion_matrix, classification_report

def print_confusion_matrix(y_original, y_predictions, report=True):


y_labels = sorted(list(set(y_original)))
conf_matrix_data = confusion_matrix(y_original, y_predictions, labels=y_labels) 25
df_conf_matrix = p_d.DataFrame(conf_matrix_data, index=y_labels, columns=y_labels)
fig, ax = py_plt.subplots(figsize=(20, 20))
s_n_s.heatmap(df_conf_matrix, annot=True, fmt='g', square=False)
ax.set_ylim(len(set(y_original)), 0)
py_plt.show()

if report:
print('-----------Classification Report for Model-----------')
print(classification_report(test_y, predictions))

predictions = model.predict(test_x)
predictions = np.argmax(predictions, axis=1)

print_confusion_matrix(test_y, predictions)

OutPut:

26
27
28

You might also like