Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
126 views68 pages

THESis of NLP

Download as pdf or txt
Download as pdf or txt
Download as pdf or txt
You are on page 1/ 68

A project report on

Emotion Analysis using Natural Language


Programming on AWS

Submitted in partial ful lment for the award of the degree of

Bachelor of Technology in Computer


Science and Engineering with
Specialization in Cyber Physical Systems

by
KSHITIJ LARIWAL (19BPS1126)

SCHOOL OF COMPUTER SCIENCE AND


ENGINEERING
April, 2023

Page 1 of 68
fi
A project report on

Emotion Analysis using Natural Language


Programming on AWS

Submitted in partial ful lment for the award of the degree of

Bachelor of Technology in Computer


Science and Engineering with
Specialization in Cyber Physical Systems

by
KSHITIJ LARIWAL (19BPS1126)

SCHOOL OF COMPUTER SCIENCE AND


ENGINEERING
April, 2023
Page 2 of 68

fi
DECLARATION
I here by declare that the thesis entitled “Emotion Analysis
using NLP on AWS” submitted by me, for the award of the
degree of Bachelor of Technology in Computer Science and
Engineering with Specialization in Cyber Physical Systems,
Vellore Institute of Technology, Chennai, is a record of bonafide
work carried out by me under the supervision of Dr.
Rajarajeshwari S.
I further declare that the work reported in this thesis has not
been submitted and will not be submitted, either in part or in
full, for the award of any other degree or diploma in this
institute or any other institute or university.
Place: Chennai
Date: 20/04/2023

Signature of the Candidate

Page 3 of 68

School of Computer Science and Engineering


CERTIFICATE

This is to certify that the report entitled “Emotion Analysis using Natural Language
Programming on AWS” is prepared and submitted by Kshitij Lariwal (19BPS1126) to Vellore
Institute of Technology, Chennai, in partial fulfillment of the requirement for the award of the
degree of Bachelor of Technology in Computer Science and Engineering with Specialization in
Cyber Physical Systems programme is a bonafide record carried out under my guidance. The
project fulfils the requirements as per the regulations of this University and in my opinion meets the
necessary standards for submission. The contents of this report have not been submitted and will not
be submitted either in part or in full, for the award of any other degree or diploma and the same is
certified.

Signature of the Guide: Name:

Date:

Signature of the Examiner 1: Name:


Date:

Signature of the Examiner 2: Name:


Date:

Approved by the Head of Department

B.Tech. CSE with Specialization in Cyber Physical Systems

Name: Dr. Maheswari R Date: 24 – 04 - 2023

Page 4 of 68















ABSTRACT

Emotion Recognition is a challenging task in the field of Computer Vision. In this research, develop
a hybrid architecture for natural language processing (NLP) tasks by combining the strengths of
BERT-Base preprocessor and ELECTRA fine tuner models. BERT-Base is a widely used pre-
trained language model that provides a robust representation of text, while ELECTRA fine-tuner is
known for its ability to generate more accurate and efficient predictions by leveraging adversarial
training.The expected outcome of this project is to demonstrate that the proposed hybrid
architecture can improve the accuracy and efficiency of NLP tasks, and provide a better alternative
to the standalone models. Additionally, expanding the scope of NLP to AWS platform for exploring
real life deployment applications.

Page 5 of 68
ACKNOWLEDGEMENT

It is my pleasure to express with deep sense of gratitude to Dr. Rajarajeswari S, Assistant Professor,
SCOPE, Vellore Institute of Technology, Chennai, for her constant guidance, continual
encouragement, understanding; more than all, she taught me patience in my endeavor. My
association with her is not confined to academics only, but it is a great opportunity on my part of
work with an intellectual and expert in the field of facial emotion recognition.

It is with gratitude that I would like to extend thanks to our honorable Chancellor, Dr. G.
Viswanathan, Vice Presidents, Mr. Sankar Viswanathan, Dr. Sekar Viswanathan and Mr. G V
Selvam, Assistant Vice-President, Ms. Kadhambari S. Viswanathan, Vice-Chancellor, Dr. Rambabu
Kodali, Pro-Vice Chancellor, Dr. V. S. Kanchana Bhaaskaran and Additional Registrar, Dr.
P.K.Manoharan for providing an exceptional working environment and inspiring all of us during the
tenure of the course.

Special mention to Dean, Dr. Ganesan R, Associate Dean Academics, Dr. Parvathi R and Associate
Dean Research, Dr. Geetha S, SCOPE, Vellore Institute of Technology, Chennai, for spending their
valuable time and efforts in sharing their knowledge and for helping us in every aspect.

In jubilant mood I express ingeniously my whole-hearted thanks to Dr. Maheswari R, Head of the
Department, Project Coordinators, Dr. Priyadarshini R, Dr. Abdul Quadir Md, and Dr. Padmavathy
T V, B.Tech. CSE with Specialization in Cyber Physical Systems, SCOPE, Vellore Institute of
Technology, Chennai, for their valuable support and encouragement to take up and complete the
thesis.

My sincere thanks to all the faculties and staff at Vellore Institute of Technology, Chennai, who
helped me acquire the requisite knowledge. I would like to thank my parents for their support. It is
indeed a pleasure to thank my friends who encouraged me to take up and complete this task.

Place: Chennai Date: 20/04/2023

Kshitij Lariwal

Page 6 of 68

CONTENTS

CONTENTS.................................................................................................. iii
LIST OF FIGURES.......................................................................................vi

LIST OF TABLES........................................................................................vii
LIST OF ACRONYMS...............................................................................viii

CHAPTER 1 INTRODUCTION 12
1.1 INTRODUCTION TO EMOTION RECOGNITION
1.2 PROBLEM STATEMENT
1.3 RESEARCH MOTIVATION
1.4 RESEARCH OBJECTIVES
1.5 RESEARCH CHALLENGES

CHAPTER 2 BACKGROUND 15
2.1 LITERATURE SURVEY
2.2 DATASET ANALYSIS

Page 7 of 68

CHAPTER 3 PROPOSED SYSTEM 16


3.1 DATASET PREPROCESSING3.1.2AUGMENTATION USING SMOTE

3.1.3TOKENISATION

3.1.4 MASKING

3.1.5 CONVOLUTIONAL NEURAL NETWORK

3.2.1 CONVOLUTIONAL LAYER

3.2.2 POOLING LAYER

3.2.3 FULLY CONNECTED LAYER

3.2.4 BATCH NORMALIZATION LAYER

3.2.5 OPTIMIZERS

3.2.6 CALLBACKS

3.3PERFORMANCE METRICS

3.3.1 ACCURACY

3.3.2 CONFUSION MATRIX

3.3.3 PRECISION

3.3.4 RECALL

3.3.5 F1-SCORE

Page 8 of 68

CHAPTER 4 IMPLEMENTATION. 44

CHAPTER 5 RESULTS AND DISCUSSION 49

CHAPTER 6 CONCLUSION 53
6.1CONCLUSION

6.2 FUTURE WORK

APPENDICES 56
APPENDIX 1

APPENDIX 2
APPENDIX 3

APPENDIX 4

REFERENCES 64

Page 9 of 68










LIST OF FIGURES
2.1 VISUALIZATION OF DATASET............................................................................. 9

2.2 SAMPLE IMAGES FROM DATASET......................................................................9

3.1 VISUALIZATION OF DATASET AFTER SMOTE...............................................12

3.2 SAMPLE HAAR FEATURES..................................................................................13

3.3 FLOW OF PRE-PROCESSING OF AN IMAGE.....................................................14

4.1 IMAGE AFTER RESIZING TO 640X640............................................................... 23

4.2 IMAGE AFTER APPLICATION OF HAAR FILTERS.......................................... 24

4.3 IMAGE AFTER RESIZING AND APPLICATION OF GABOR FILTERS...........25

4.4 SUMMARY OF CNN MODEL................................................................................27

5.1 TRAINING ACCURACY.........................................................................................29

5.2 TRAINING LOSS..................................................................................................... 30

5.3 VALIDATION ACCURACY................................................................................... 30

5.4 VALIDATION LOSS................................................................................................31

Page 10 of 68

LIST OF ACRONYMS
ConvNet Convolutional Neural Network CV Computer Vision
AI Artificial Intelligence
ML Machine Learning
SMOTE Synthetic Minority Oversampling Technique CK+ Extended
Cohn-Kanade Dataset
VGG19 Visual Geometry Group - 19 layers deep
CMU Carnegie Mellon University
NIST National Institute of Standards and Technology DCNN Deep
Convolutional Neural Networks
KDEF Karolinska Directed Emotional Faces
JAFFE Japanese Female Facial Expression
ReLU Rectified Linear Unit
VGG-Net Visual Geometry Group Networks
SGD Stochastic Gradient Descent
TanH Hyperbolic Tangent
RMSProp Root Mean Square Propagation
TP True Positives
TN True Negatives
FP False Positives
FN False Negatives RGB Red,Green and Blue cv2 OpenCV library
np NumPy library
CSV Comma-Separated Values LR Learning Rate

Page 11 of 68












































Introduction

Automation with assisted tranformial NLP engines in the present age of AI is a paradigm of
advancing technical virtual conscientiousness of mortals. The hopes and beliefs of our ancestors
bears on us to be put to execution with limitations beyond comprehendible human strength,
exceeding virtues of technical infrastructure once thought that could never be in 90’s, and less than
a century later we have binary codes doing things in the silicon chips and metallic wires that’s
frightful to a feeble human mind and inspiring for the masses in an alternate perception. Having
computers provide Human touch which excellence is the result of AI and NLP models.
In this research, the bounds of algorithms such asGPT-3, ELECTRA, BERT and T5 text ranking
and XLNet for dependencies calculations have been explored to find a fitting solution to the cause,
Sentiment analysis and translation of human remarks to the most fundamentally strong adverbs. The
push to further extend the reliance, we have chosen to execute a NLP model on a Cloud hosted
service, where we can further examine the channeling of attributed data and explore the services
such has GLUE, EC2 buckets, AutoML, etc.
Additionally, it is a crucial component of human-computer interaction, where the ability to
recognize and respond to emotions is essential for creating more natural and effective interactions.
This also will have a lot of real-time applications, some which might not be for the present day, but
still in the near future and this can also used as a benchmark to test the limits of Computer Vision.

Page 12 of 68




Research Motivation:
Natural Language Processing (NLP) is a rapidly evolving field of Artificial Intelligence (AI) that
offers numerous applications in various domains such as healthcare, finance, and e-commerce.
There is a demand for scalable, affordable solutions to analyse and manage unstructured text data
since organisations are producing more and more data. Without the need for expensive hardware or
specialised technical knowledge, cloud-hosted services offer an accessible and scalable way for
organisations to harness NLP capabilities.

Despite how practical and affordable cloud-hosted NLP services are, there are still a number of
issues that need to be resolved. First off, the infrastructure and algorithm selections made by the
cloud service provider may have an impact on the consistency and dependability of cloud-based
NLP models. Secondly, when in the present age, data privacy and security is such hot topic, we
thrive on to address the ethical and algorithm bias with transparency,

Problem Statement:

As a result, the purpose of this research article is to examine how well cloud-hosted NLP models,
such as GPT-3 and ELECTA perform in a variety of NLP tasks, including sentiment analysis, named
entity identification, and text categorisation and perform a Zero Short Learning to accomplish
greater accuracies in our hybrid system. The research also seeks to examine how various cloud
service providers' infrastructure decisions affect the effectiveness of NLP models. The study also
identifies and suggests best practises to address the ethical, privacy, and security issues associated
with employing cloud-hosted NLP services.

Through this research, we aim to provide insights into the advantages and limitations of using
cloud-hosted NLP services and gain insights about leveraging these services in a secure, ethical,
and effective manner.

Page 13 of 68

Research Objectives:
The objectives of this research, One of the major aim is to try and eliminate problems present in the
dataset such as cross-orientedactivity bias. This is biases that exist in human-generated
content, especially on social media. The truth is that only a very small portion of the population
actively uses these social networking sites. Therefore, the data that has been gathered throughout
the years on these platforms does not accurately reflect the population as a whole. Though quite
similar to the first, the second is slightly different. This is prejudice in society. Once again, human
biases are present in data produced by people, and perhaps they are not limited to social media.
Preconceived assumptions present in society may have contributed to the introduction of these
prejudices. Because we all have unconscious prejudice, data produced by people may be biassed.
The machine learning system itself may occasionally generate bias.
Consider a scenario where a machine learning application offers users a few alternatives to choose
from, and once the user chooses one, the user's choice is utilised as training data to further train and
enhance the model. This introduces Feedback loops. Masking datasets for most relatable query is
going to reinforce the chance to get a low BLEU score.

The main objective is obviously to test the model’s accuracy and offer some hopeful insights on
how to further extent the scope of the models efficacy. Also to have a Human in the Loop pipeline
for rendering useful feedback on out AWS Platform

Page 14 of 68


Background & Literature Survey

Page 15 of 68


Research Challenges:

One of the main research challenges is to tabulate out which pre-processing method will place in
the best position in the path of finding and classifying the emotion with maximum accuracy,

In an hopeful attempt to reduce selection bias, that includes a feedback loop, that involves both the
model consumers and the machine learning model itself. When we detect some of the statistical
biases in our dataset prior to training our model, once the model is trained and deployed, drift can
still happen.ft. There are several different variations of data drift. Sometimes the distribution of the
independent variables or the features that make up your dataset can change. That's called covariant
drift. Sometimes the data distribution of your labels or the targeted variables might change. That's
the second one, which is prior probability drift. Sometimes the relationship between the two, that is
the relationship between the features and the labels can change as well. That's called concept drift.
Addressing these is a queer task and that too needs to be done effectively.
Construction of models and usage of novel technologies to introduce new better combinations of
machine learning algorithms which is able to accomplish a task with a high accuracy. The
complicity of hyper-integrating the architecture of a discriminative and a transformer based
generative algorithm by a hybrid architecture is a Herculean task. The approach adopted is to fine-
tune GPT-3 and ELECTRA on a large corpus of texts stressing hyper-parameters that can be tuned,
such as the learning rate, batch size, and number of epochs.

Page 16 of 68


CHAPTER - 3
Proposed System

The dataset will be preprocessed first using following steps:


3.1 Text Analysis

3.1.1 Dataset Pre-Processing


After the dataset is loaded onto the notebook, it is checked for any discrepancies in upload. Then
carry out Tokenisation:

• Tokenization: T5 is based on a sequence-to-sequence architecture, and thus requires input data to


be tokenized into subwords or words. The input text is first segmented into subwords or words
using a tokenizer, such as the SentencePiece tokenizer. Similarly, BERT also requires input data
to be tokenized into subwords or words, using a tokenizer such as the WordPiece tokenizer. THe
further feature extraction is done using Supervised Learning techniques. Removing stopwords
punctuation, handles and URLs , Stemming, Lowercasing are processes of Text Masking.

Features, Labels =>Train=> Predict


Extract features => Train LR => Predict sentiment

• Special Tokens : Special tokens such as '<pad>', '<bos>', and '<eos>' to the tokenized text. In a
typical sequence-to-sequence (Seq2Seq) model for natural language processing (NLP), the
encoder is deployed after the input text has been tokenized into individual words or subword
units. This means that any special tokens, such as start-of-sequence (SOS) and end-of-sequence
(EOS) markers, are included in the tokenized input sequence and are processed by the encoder
like any other tokens.

Page 17 of 68




• Segment IDs: it is the addition of segment IDs to indicate which part of the input sequence
belongs to the first sentence and which part belongs to the second sentence (for tasks such as
sentence pair classification). The mathematical expression for segment IDs is as follows:

Training LR { here, v vector is the context identi er variable)

The net LOSS is used to access the modularity and supremacy of our datasets.

• Padding: To guarantee the model operates at its best while utilising BERT-BASE as a
preprocessor for ELECTRA fine-tuning, it is crucial to take into account the essential padding and
masking features.
In order to make sure that the input sequences are of the same length, input sequences are
"padded" by adding zeros to the end. Padding is required to guarantee that all input sequences are
the same length when using BERT-BASE since input sequences are tokenized into subwords that
have a set length (often 512 tokens). To prevent wasting memory, it's crucial to take the maximum
length of input sequences into account when padding. For instance, padding to 512 tokens would
be inefficient if the majority of input sequences are less than 200 tokens long.
Page 18 of 68

fi

• Masking: Not all masked tokens are utilised for training; some are replaced with the original
token and others are replaced with a random token. This is an essential distinction to make. By
doing this, the model is kept from memorising the disguised tokens and is instead prompted to
acquire broader representations.
Masking is the practise of hiding specific input sequence tokens so that the model won't pay
attention to them while being trained. Masking is utilised in the instance of BERT-BASE to train
the model on anticipating missing words in a sequence. This is achieved by randomly replacing
15% of the input sequence's tokens with the special token [MASK]. Then, using the context that
the tokens around it give, the model is trained to predict the original token. Not all masked tokens
are utilised for training; some are replaced with the original token and others are replaced with a
random token. This is an essential distinction to make. The goal of this is to stop the model from
remembering the disguised tokens and used the generator set for further computation.
The effect of padding and masking on the computing resources needed for training must also be
taken into account. Padding can substantially raise the model's memory consumption, particularly
if the maximum sequence length is set too high. Longer training times and mistakes due to
memory issues may result from this. On the other hand, masking necessitates the model to
produce several predictions for each token, which raises the computational cost of training. This
can be lessened by employing strategies like gradient accumulation or lowering the proportion of
tokens that are hidden.

To guarantee optimum efficiency while utilising BERT-BASE as a preprocessor for ELECTRA


fine-tuning, it is crucial to take the required padding and masking information into account.

pG (x t | x) = e x p(e(x t)T hG (x)t )/s u m[′e x p(e(x ′)T hg(x)t )]

Replaced token preprocessing

where "e" stands for "token embeddings." With a sigmoid output layer, the discriminator
determines whether the token xt is "replaced," i.e., that it originates from the real data as opposed
to the generator distribution, at a particular position t:

Page 19 of 68













4.1.2 Text analysis algorithm- Word2Vec: In order to established learning from the basics, a
Top-Down approach is a myth when approaching training of NLP model, the only concept is
feedback, to retrain. Word2Vec uses natural language processing (NLP) to convert text into
embeddings, which are vectors. A 300-dimensional vector space is represented by each vector's 300
values. Numerous machine learning techniques, including clustering algorithms and nearest
neighbour classification, may make use of these embeddings. Word2Vec uses the continuous-bag-
of-words (CBOW) and continuous skip-gram model architectures to build embeddings. While
continuous skip-gram utilises the current word to predict the context words around it, CBOW
predicts the current word from a window of context words. Word2Vec, on the other hand, can
experience problems with out-of-vocabulary problems since its vocabulary is restricted to three
million recognised words, which are the words the model acquired during training. GloVe and
FastText were launched in 2014 and 2016, respectively, to get around this restriction.
4.1.3 Text analysis algorithm- FastText and GloVe:
FastText interprets each word as a collection of sub-words or character n-grams, whereas GloVe
uses an innovative method of learning word representations using unsupervised regression,
expanding the useful vocabulary of Word2Vec. FastText incorporates support for text categorization
use cases along with the CBOW and skip-gram models. By capturing the connections between all of
the words in the input sequence, the Transformer Architecture, which was unveiled in 2017,
dramatically increases the accuracy of NLP tasks like machine translation. Other research teams
continue to develop different NLP architectures despite the Transformer Architecture's significance,
which is fascinating.

=> We will be dealing with the Transformer architecture.


Concept : The text vectors( called embeddings), stored in the form of token, corresponding vector
form in a high-dimensional coordinate system , where each dimension expresses some
functionality . These semantic meanings can then be feed into our NLP model for text
classification.The form of vectors in our matrix are created using CBOW and Cont. SkipGram( for
predictive context, eg. “apple” is bound to be followed by “pie”, “orchard”, “juice”).

Page 20 of 68



SOURCE: ef cient estimation of Word Representation- Mikalov


et al., 2013

4.2 Seq2Seq Model: A variable-length input sequence, such as a sentence in one language, is
mapped to a variable-length output sequence, such as the translation of that sentence into another
language, then the encoder takes the input sequence and transforms it into a context vector, which is
a fixed-length representation that captures the crucial information in the input sequence. The output
sequence is then produced, one token at a time, by the decoder using this context vector. The main
consequence of such is degraded performance, due to high volumes of fixed-length memory. The
resolution is addition of a ‘focus’ layer, to compromise certain percents of accuracy for efficiency as
proposed by Dzmitry Bandana, Jacobs University Bremen, Germany and KyungHyun Cho +
Yoshua Bengioi. University De Montreal.

Page 21 of 68
Traditional model


fi

3.1.4 Batch Normalisation:


Batch Normalization is a layer which is added between two hidden layers, especially layers with
activation functions. It is a process to make neural networks faster and more stable through adding
extra layers in a deep neural network. The new layer performs the standardizing and normalizing
operations on the input of a layer coming from a previous layer. Batch normalization is a
regularization technique which helps reduce or prevent the over-fitting of the network.

Model with active attention layer[ aij- weights]

Page 22 of 68

3.1.5 Optimisers:
An optimizer is an algorithm used in machine learning to modify a model's parameters in order to
reduce the discrepancy between the model's expected and actual outputs. The goal of the
optimisation procedure is to identify the set of parameters that will give the model the greatest
possible performance on the training set. It is a function which modifies the attributes of a neural
network, such as weights and learning rates. The main goal is to optimize, I.e. minimize the loss and
maximize the accuracy. There are many types of optimizers, namely, Gradient Descent, RMSProp,
Adagrad and Adam. Gradient Descent updates parameters using the gradients of the loss function
with respect to current parameters. Adagrad is based in Gradient optimization which changes
learning rate based on historical gradients and also on the frequency of occurrence in the updates.
RMSProp is an adaptive optimization algorithm which used the root mean square of gradients, i.e.
moving mean of squared and rooted gradients to adjust learning rate. Adam is also an adaptive
optimization algorithm which uses both first and second moments of 17 gradients to change the
learning rate of each parameter. We have used Adam which is the extended version of stochastic
gradient optimizer and combines the advantages of RMSProp and Adagrad and also known to give
good results especially for image datasets.

3.2 Convolutional Neural Networks:

A Convolutional Neural Network (CNN or ConvNet) is a type of Artificial Neural Network (ANN)
that is used to analyze visual imagery. It is a deep learning technique that is specifically designed to
work with image data. The main difference between a CNN and a regular Multilayer Perceptron
(MLP) is the way the network is connected. In an MLP, each neuron in one layer is connected to all
the neurons in the next layer, which means the network is fully connected. In contrast, a CNN uses
convolution and pooling operations to reduce the number of parameters and computational cost
while retaining the ability to learn hierarchical representations of the input data. The convolution
operation involves sliding a filter over the input image, which is then transformed into a feature
map that contains the relevant information from the original image. The pooling operation reduces
the spatial resolution of the feature map and increases its invariance to small translations. These
operations are repeated several times to form a hierarchy of features that are used for the final
classification or regression task. CNNs have been shown to be effective for various computer vision
tasks such as image classification, object detection, and segmentation.
Page 23 of 68

3.2.1 CONVOLUTIONAL LAYER.


The convolution operation is a fundamental building block of Convolutional Neural Networks
(ConvNets) and it is used to extract meaningful features from the input image. This operation
involves sliding a small matrix called a kernel or filter over the input image and computing the dot
product at each location. The result is a feature map that highlights specific properties of the input
image. In ConvNets, multiple Convolutional Layers can be stacked to extract higher-level features,
such as textures, patterns, and shapes, in addition to the low-level features like edges and color. The
first Convolutional Layer typically extracts lowlevel features while subsequent layers extract more
complex features. This process of feature extraction is critical to the overall performance of
ConvNets in image classification and object recognition tasks.

3.2.2 POOLING LAYER.

The Pooling layer in a Convolutional Neural Network (ConvNet) is an important component that
reduces the spatial dimensions of the convolved features generated by the Convolutional Layer. The
purpose of this layer is to decrease the computational and memory requirements of the network by
reducing the count of parameters that needs to be processed. This reduction is achieved through
dimensionality reduction, where the spatial size of the feature map is reduced through sub-
sampling. Additionally, pooling also helps in capturing dominant features in an image that are
invariant to rotation and positional changes. This helps the model to be robust to these
transformations and generalize better to unseen data. There are two commonly used pooling
methods: Max Pooling and Average Pooling. In Max Pooling, the maximum value from the portion
of the image covered by the pooling kernel is selected, whereas in Average Pooling the average of
all values from the portion of the image is calculated. Both methods have their pros and cons and
the choice between the two depends on the specific use case and the problem being solved

Page 24 of 68

3.2.3 FULLY-CONNECTED LAYER.

Fully connected layers are extensively employed in deep learning for tasks such as image
classification, natural language processing, and speech recognition. They are strong and adaptable,
but they can also be costly to compute and prone to overfitting, particularly with high-dimensional
data. To prevent overfitting, it is crucial to carefully plan the network's architecture and properly
regularise it.

A neural network with fully-connected layers is one in which each neuron uses a weights matrix to
apply a linear transformation to the input vector. As a consequence, every input of the input vector
influences every output of the output vector, and all layer-to-layer connections are present.

A fully connected layer has 3 parts, input layer, main hidden fully connected layer, and output layer.

4.3 NLP Transformer:


The earliest neural network models that could analyse sequential data, such text or speech, then
produce another series of data were known as seq-2-seq models. These models were based on the
encoder-decoder architecture, where the decoder was used to produce the output sequence after the
input sequence had been encoded into a fixed-size representation. Sutskever et al. initially presented
this architecture in 2014 for machine translation, and it has subsequently been used to a wide range
of additional applications, including text summarization, speech recognition, and picture captioning.

• GPT-3 has a cross-entropy loss function, which gauges the discrepancy between the projected
output sequence and the actual output sequence, during training. The loss is determined by how
well the model's predictions match the actual next word in the sequence after it has been trained
to predict the next word in a sequence based on the previous words.
• On the other hand, ELECTRA’s variation of the binary cross-entropy loss function calculates the
difference between the predicted class (real or generated) and the actual class for optimal
working. In addition to learning to distinguish between actual and produced tokens, the model is
taught to produce realistic masked tokens.

Page 25 of 68


bert_model <- tensorflow$keras$models$load_model("path/to/bert_model")
# Create input tensors
input_ids <- tensorflow$constant(array(c(101, 2023, 2003, 2019, 2742, 1997, 2129, 14324, 1012,
102, 0), dim=c(1,11)))
token_type_ids <- tensorflow$constant(array(0, dim=c(1,11)))
attention_mask <- tensorflow$constant(array(1, dim=c(1,11)))
# Create input dictionary
inputs <- list(input_ids = input_ids, token_type_ids = token_type_ids, attention_mask =
attention_mask)
# Get the model outputs
outputs <- bert_model$predict(inputs)

The self-attention mechanism allows the model to selectively attend to relevant parts of the input
sequence, making it more efficient and effective than previous models.

• T5 (Text-to-Text Transfer Transformer): It is simple to apply T5 to generative tasks (such as


summarization and translation) since these tasks are sequence-to-sequence in nature. For
classification tasks like sentiment analysis or textual entailment identification, T5 may also be
easily modified (Raffel et al., 2020). These jobs all have the same result, which is text tokens or
sequences of text tokens. Because the output is not tokens but rather numerical scores that are
utilised for sorting, the use of T5 in ranking models has not received much attention.

yˆij = ez(true) /(ez(true) + ez(false) )


The modified T5 model, Nogueira et al. (2020) , gas intake of logical attributes of token-query and
a numeral Ranking score is produced.

Page 26 of 68

4.3.1 GPT-3
GPT-3 expands upon earlier developments in transformer construction. With 175 billion parameters,
it is one of the biggest and strongest transformer models created to date. GPT-3 processes sequential
data using a set of transformer blocks, each made up of self-attention layers and feed-forward
layers. Additionally, it makes use of a multi-headed self-attention mechanism that enables it to
recognise relationships at many levels of abstraction. A strong tool for natural language processing
tasks including text production, summarization, and language translation, GPT-3's huge size enables
it to create extremely coherent and diversified content.
The number of epochs created will have a massive impact on the size of the model. Since GPT-3
has been limited to the Academic Researchers and since GPT-3 is not available to public for
download, and costs 12 million dollars for one to train by oneself. There’s why we have opted to
use a pre-processed dataset to aid our accuracy requirements.
Due to its accessibility, GPT-3 is better suited for examining dataset artefacts than SoTA models,
where such issues have previously been thoroughly researched.
Dataset artefacts are a well-known problem in most NLP tasks, where statistical anomalies allow a
model to perform better than it should be able to without access to the context (Poliak et al., 2018).
A hypothesis-only model that can outperform some models trained with a context may be taught for
tasks like NLI (Poliak et al., 2018). Due to annotation artefacts such lexical choice and sentence
length, a hypothesis-only model can, in more detail, achieve 67% accuracy for SNLI (Gururangan
et al., 2018). This not only is a promising discovery, but also helps in devising a report to accentuate
our hybrid model.

Drawbacks:
These have been noted by several researchers:
• Its outputs embody all the biases that might be found in its training data
• If you want white supremacist manifestos, GPT-3 can be trained to produce them endlessly.
• Its outputs may correspond to assertions that are inconsistent with the truth.

Page 27 of 68

As a result of these flaws, many of the remarkable outputs that have been demonstrated are the
products of cherry-picking: you run the API several times with the same prompt, then select the best
outcome, ignoring the results that sound less convincing or are just plain bad.
This doesn’t void the functional dependency that this thing bears, only con is the generated fiction.
We're already seeing a tonne of applications where a person is kept in the loop[HTL]
because they are considerably safer. The majority of them work as successful augmented writing
tools that take a user's text input and present an alternate version of that input that may be longer or
shorter depending on the application.

TESTING GPT-3: The way to lay emphasis on emotion analysis the

Page 28 of 68

4.3.1 BERT-BASE

BERT is a transformers model that was self-supervisedly pretrained on a sizable corpus of English
data. This indicates that an automatic method was used to produce inputs and labels from those
texts after it had been pretrained on solely the raw texts without any human labelling (which
explains why it may use a large amount of publically available data). It was pretrained with two
goals in mind, specifically:

Masked language modelling (MLM) involves randomly masking 15% of the words in an input
sentence before running the complete phrase through the model to forecast the remaining
words.This contrasts with conventional recurrent neural networks (RNNs), which typically see the
words sequentially, and autoregressive models like GPT, which internally conceal the next tokens.
This makes it possible for the model to learn a two-way representation of the statement.
Additionally it comprises of the feature Next sentence prediction (NSP), which is that the mode
during pretraining, concatenates the two masked sentences as inputs. They occasionally match
sentences that were adjacent to one another in the original text, and sometimes they don't. The
model must then determine whether or not the two sentences followed one another.

The raw model can be applied to next sentence prediction or masked language modelling, but its
main purpose is to be improved upon in a subsequent job. It should be noted that this model is
primarily intended to be optimised for tasks like sequence classification, token classification, or
question answering that need the usage of the entire phrase (perhaps masked) to make judgements.
One should consider a model like GPT2 for jobs like text creation.

For the intends and purposes of our hybrid model, BERT may be used as a preprocessor for
ELECRA, which is a fine-tuning method used to train a transformer-based model particularly for
the purpose of emotion analysis, in the context of emotion analysis. The ELECRA model's
Page 29 of 68

performance can be enhanced by using the BERT-BASe model as a preprocessor to create training
data.

The base version of BERT, known as BERT-BASe, includes 12 transformer layers and 110 million
parameters. Its masking function allows for the creation of generator and discriminator model set
inputs, which are employed in the ELECRA model's training.

Evaluation Results
[GLUE Results: when ne-tuned on downstream tasks]

The first step in using BERT-BASe as an ELECRA preprocessor is to tokenize the input text.
Tokenization is the process of breaking up the text into single words or subwords so that the BERT
model may be used. WordPiece tokenization, a method used by BERT, divides words into smaller
subwords based on their frequency in a corpus of text. As a result, BERT can handle terms that
aren't in its lexicon and enhance its performance with uncommon words.

The next step is to feed the tokenized text into the BERT model to produce contextualised word
embeddings. These embeddings may be utilised as input to the ELECRA model since they capture
the meaning and context of each word in the phrase.
BERT may be used to create artificial training data in addition to creating embeddings. To do this,
tokens in the input phrase are randomly selected and then predicted using the context of the words
around them. This produces more training data and can help the ELECRA model function better.

Finally, the ELECRA model can be improved using the BERT-BASe model. A pre-trained model is
subjected to fine-tuning by being trained on a particular task using a smaller dataset. The pre-trained
BERT-BASe model is utilised as a starting point in the emotion analysis case, and the ELECRA
model is refined using a collection of labelled emotion data.

Page 30 of 68


fi

The model's performance on the specific job of emotion analysis is enhanced through the process of
fine-tuning, which modifies the model's parameters to better match that task.

In conclusion, by tokenizing input text, producing contextualised word embeddings, producing


synthetic training data, and optimising the ELECRA model on a dataset of labelled emotion data,
BERT-BASe can be utilised as a preprocessor for ELECRA in an emotion analysis model. This
strategy has been demonstrated to be very successful for emotion analysis and is transferable to
other natural language processing applications.

The specifics of how each sentence was masked are as follows:


• 15 percent of the tokens are MASKED.
• The disguised tokens are often replaced by [MASK] in 80% of the instances.
• 10% of the time, a random token (different from the one they replace) is used in place of the
disguised tokens.
• The masked tokens are left in tact for the remaining 10% of occurrences.

Page 31 of 68

4.3.1 ELECTRA

In the publication ELECTRA: Pre-training Text Encoders as Discriminators Rather Than


Generators, the ELECTRA model was put out. The innovative pretraining method ELECTRA trains
the generator and discriminator transformer models. The generator is trained as a masked language
model because its function is to substitute tokens in a sequence. The model we're interested in, the
discriminator, seeks to determine which tokens in the sequence were changed by the generator.
ELECTRA (Efficiently Learning an Encoder that Classifies Token Replacements Accurately)
shares characteristics with existing transformer-based models like BERT and GPT-2, but it also
employs a brand-new pre-training technique that boosts the efficacy and efficiency of the training
procedure. The transformer encoder, which is a stack of multi-head self-attention layers followed by
position-wise feedforward layers, is what makes up the ELECTRA model. The input sequence is
processed in parallel by each transformer layer, which employs self-attention to identify
relationships among the sequence's various components. A series of hidden vectors produced by the
transformer encoder may be applied to a variety of downstream applications, including sentiment
analysis, named entity identification, and language modelling.
Generator pre-training and discriminator pre-training are the two steps of the pre-training procedure
for ELECTRA. The model is taught to anticipate randomly masked tokens in the input sequence
during the generator pre-training phase. The model is trained to differentiate between the original
input sequence and a sequence in which a portion of the tokens have been substituted with random
tokens generated by the generator network during the discriminator pre-training phase. The
generator network is taught to reduce the likelihood of being identified as fake, while the
discriminator network is trained to increase the likelihood of successfully differentiating between
actual and false sequences. Using supervised learning, where the model is trained on labelled data
for the target task, the model may be adjusted for certain downstream tasks. ELECTRA uses self-
attention mechanisms to capture long-range dependencies between various sections of the input
sequence. This is the major advantage this model has over the competition.

Page 32 of 68

The BERT-BASe model may be used as a preprocessor to create contextualised embeddings for the
input text once the generator and discriminator models have been trained. The mechanism for
creating the embeddings is the same as that used in BERT: the input text is fed into the transformer
network, which creates embeddings for each token.

The discriminator model, which is optimised for the particular job of emotion analysis, is then fed
the created embeddings as input. The parameters of the discriminator model are tweaked throughout
this phase of fine-tuning in order to better suit the specific goal of emotion analysis and enhance its
performance on that task.

Because the generator model in ELECTRA is trained to predict only the masked tokens, as opposed
to all the tokens in the input text, it can provide a language model that is more effective. Faster
training and inference times are made possible by this, which may result in higher performance on
subsequent tasks like emotion analysis.

The following phases make up the ELECTRA training process:

• Using a masked language model (MLM) aim, pre-training the generating model on a big corpus
of unlabeled text. In order to achieve the MLM aim, certain tokens in the input text are randomly
masked, and the generating model is trained to forecast the original masked tokens.
• Using the generator model to produce bogus instances, train the discriminator model on a huge
corpus of unlabeled text. The discriminator model is trained to differentiate between genuine and
false instances, where the fake examples are produced by using the generator model to swap out
part of the tokens in the real examples with other tokens.
• Adjusting the discriminator model for a particular downstream job, such emotion analysis, using a
smaller labelled dataset.
• The discriminator model is trained to anticipate the input text's emotion label based on the
contextualised embeddings produced by the BERT model during the fine-tuning
procedure.assessing the performance of the improved discriminator model on a different test set,
and maybe modifying the model's hyperparameters or pre-training procedure to enhance
performance.

Page 33 of 68

For emotion analysis, the ELECTRA algorithm provides the following benefits over BERT-BASE:

Efficiency: The ELECTRA method uses a masked language model goal to train only the generator
model, as opposed to training the complete model on the MLM objective, which makes it more
effective than BERT-BASE. Faster training and inference times are the outcome, which is crucial
for huge datasets or for fine-tuning on a given job.
Robustness: ELECTRA's generator model is trained to forecast the original masked tokens rather
than all of the input text's tokens. As a result, the generator model is more resistant to noise and
textual fluctuation, which might improve its performance on subsequent tasks like emotion analysis.
Improved Performance: When compared to BERT-BASE, the ELECTRA algorithm has been found
to perform as well as or better on a variety of NLP tasks, including emotion analysis. This is most
likely caused by the generator model in ELECTRA's enhanced efficiency and durability.
Conclusion

In conclusion, the ELECTRA algorithm is a pre-training method for NLP tasks that entails training
both a discriminator model and a generator model to produce false instances. The ELECTRA
algorithm is more effective, more reliable, and will result in improved performance on subsequent
tasks, such as emotion recognition.

Page 34 of 68

4.4 Callbacks

A callback is a set of functions to be applied at given stages of the training procedure. You can use
callbacks to get a view on internal states and statistics of the model during training. It can be
applied either at the start or the end of an epoch, before or after the execution of a single batch or in
any other possible places during execution of the training part of the model.
The functions used in this model are CSVLogger, ReduceLROnPlateau and even ModelCheckpoint
can be implemented.
The CSVLogger saves the values of accuracies and losses of both training set and validation set in a
CSV file which can be downloaded from the outputs section and can be used further for analysis of
the model and visualization of graphs. The parameters this function takes is the name of the file in
which the logs have to be saved, the separator to be used inside the file to separate different entries
of data and an append parameter which checks if the file already exists and if set to True, the file
will be appended instead of rewritten.
The ReduceLROnPlateau is a neat function in which the learning rate is controlled in every epoch
and the conditions to toggle it can also be adjusted between the various parameters which are
released during the model training. Models often benefit from reducing the learning rate by a factor
of 2-10 once learning stagnates. This callback monitors a quantity and if no improvement is seen for
a 'patience' number of epochs, the learning rate is reduced.

Page 35 of 68

4.5 Performance Matrics

Performance metrics are the features which are used to check and evaluate the model for its efficiency and
effectiveness in prediction machine learning systems. It is generally used to compare with existing models
and also multiple models in the same system to check if the current model is better than existing state of the
art algorithms and if it can be used feasibly in typical environmental conditions.
All of this metrics have these definitions in common. They are:

• True Positives (TP): These are the values which were predicted correctly by the
model and the values were actual positive values.
• False Negatives (FN): These are the values which are predicted negatively and the
actual values were positives.
• Negatives (TN): These are the values which are predicted correctly and are
negative.
• False Positive (FP): These are the values which are predicted positively but the actual
value is negative and the prediction is wrong.
Here are the metrics which are going to be used in measuring the performance of the model proposed:

• 


3.5.1 ACCURACY
It is the easiest way of implementing a metric into the model and it is mostly available and
visible to the user while running the optimizer itself. The accuracy is one of the default
arguments which shows up during the running of the epochs and we can see the progression of
training accuracy and validation accuracy and can even plot it in a graph. It is commonly used
as a proper metric when the target variable classi cations in the data are nearly and fairly
balance and it is not suggested to use if there is a clear imbalance in the dataset classi cation
especially when one category of classi cation highly outweighs the others in terms of sheer
number. The formula for accuracy is listed as follows :

Page 36 of 68



fi

fi






fi

where TP, TN, FP, and FN are listed above.

3.5.2 Confusion Matrix

It is generally a square matrix which gives a good view of what the model has
predicted vs what is the right answer. It is widely used for visual representations of performance of
models and it is easier to see where the model has its weakness. It is generally a MxM matrix where
M is the number of output classes that the model will predict from and if there are only 2 classes, it
becomes a 2x2 matrix and there usually one class is named as Positive and other as Negative.
It is generally in the format as shown below:

Actual Prediction Positive Negative

Positive True Positive(TP) False Negative(FN)

Negative False Positive (FP) True Negative(TN)

Confusion Matrix in Binary Classi cation


Where the actual classes are in coloumn and actual prediction value classes are in the row.When
there is M classes, it is illustrated as follows:

Confusion Matrix in multi class classi cation

In this tabular representation, the shaded regions represent the view of Class 1. The blue shaded cell
is the True Positive, The yellows are False Negatives and their sum is taken for calculation, the reds
are False Positives and the greens are True Negatives.

Page 37 of 68

fi
fi


3.3.3. PRECISION
The fraction of true positives (right predictions) among all the occurrences the model classified as
positive is measured by precision, a performance parameter in machine learning. In other words,
precision reveals the proportion of the model's positive predictions that come right.
This metric is very useful as it covers the weakness of accuracy in which we can’t find out the exact
configuration of false positives, false negatives and true negatives. Precision is used to measure the
false positives and a high precision score indicates that the system is good in this scenario because it
has low number of false positives. The formula for it is as shown below:

Or in terms of positives and negatives:

Page 38 of 68

4.3.3. RECALL
It is also known as sensitivity and is very similar to precision in the fact that it just tries to check for
false negatives, I.e. calculating with the proportion of actual positive but classified negative. Recall,
in other words, reveals the proportion of positive events that the model properly detected and
predicted.

Recall is especially helpful when reducing the number of false negatives. When the model predicts
a negative label for an incident that is actually positive, this is known as a false negative. For
instance, a false negative in a medical diagnosis problem happens when a patient with an illness is
not recognised as having the ailment. A high recall score says that the model is predicting true
positives correctly and minimizing predictions of false negatives. The formula for it is as shown
below:

4.3.3. F1 SCORE

F1 Score is the metric which is used when both precision and recall are important to calculate and
necessary for the analysis of model, i.e. if you want to know about the state of both false negatives
and false positives, then you tune into f1-score. It is calculated by taking the harmonic mean of
precision and recall and hence it can be said that it represents both precision and recall and it is
mostly used over the other two as most of the classification problems rely on both false positives
and false negatives. The formula for it is as shown below:

Page 39 of 68

4.5 AWS:
The cloud computing platform AWS (Amazon Web Services) offers a number of services for
hosting, controlling, and deploying infrastructure and applications in the cloud. Instances for GPU,
TPU, and CPU are just a few of the services that AWS provides for hosting AI/ML workloads.
These services are essential for delivering the compute resources needed to develop and deploy AI/
ML models.

The flexibility and scalability that AWS provides is one of the key benefits of adopting it to host AI/
ML services. For various AI/ML workloads, AWS offers a variety of instance types and settings that
may be tailored to fit their unique needs. This contains instances with specialised hardware for jobs
like image and video processing, voice recognition, and NLP in addition to instances with high-
performance CPUs, GPUs, and TPUs.
Amazon SageMaker, a fully managed service for developing, honing, and deploying machine
learning models at scale, is one of the services that AWS offers for maintaining and deploying AI/
ML models. For our purposes we have used AWS Sagemaker’s services and made use of the
Word2vec technique and the text categorization algorithm are both significantly optimised in
BlazingText, a SageMaker algorithm. The Word2vec technique convert words into word
embeddings, which are high-quality distributed vectors that capture the semantic links between
words.
The enhanced word vectors produced by BlazingText may be used to incorporate the characteristics
of BlazingText into a BERT algorithm. This plays highly in favour of deploying a EC2 buckets
hosted service for a data lake. The architecture is simple, we channel a data-stream, using AWS
Glue, preprocess the data using our algorithm to generate text categorisation vectors to be weighed

Page 40 of 68



against the Word2vec technique. We compare the BLEU score and put a interpreter, to judge the
better result (human in the loop).
We may give the BERT algorithm pre-trained embeddings, a text corpus that has been preprocessed
using methods like tokenization, subword tokenization, and sentence splitting can be used as the
input for the BERT algorithm. The BlazingText technique may then be used to turn the preprocessed
text into word embeddings. These embeddings may then be used by the BERT algorithm to carry
out operations like named entity recognition and sentiment analysis.

The anticipated labels or responses for the input text can be the BERT algorithm's output. A number
of measures, including accuracy, precision, recall, and F1-score, may be used to assess these
predictions. The BERT algorithm can perform well and attain high accuracy on big datasets by
utilising BlazingText's characteristics.
Utilising the platform's global infrastructure and network of data centres is one of the main
advantages of utilising AWS to host AI/ML services. This can provide additional redundancy and
resilience in the event of system failures or outages, as well as aid to reduce latency and increase
performance for AI/ML applications.

Drawbacks particular to our project: To utilise the platform properly, one has to have a particular
degree of competence. Both AWS and GCP require a certain degree of technical know-how to set
up and administer, although the precise skills needed might change based on the services being
utilised. For instance, the GCP Cloud Natural Language API offers pre-built models that can be
utilised with little to no scripting for NLP activities like sentiment analysis, entity identification, and
others. Similar pre-built models are available through AWS's Comprehend service, however
integrating it with a unique NLP model can require additional configuration and modification. This
is a major limiting factor for us due to lack of resources.

Page 41 of 68

4.4.1 AWS Sagemaker


The cloud-based AWS SageMaker platform allows users to create, train, and use machine learning
models, including NLP models. It offers tools for creating customised models as well as pre-built
NLP algorithms. Users may grow to huge datasets with ease thanks to the highly optimised
Word2vec and text classification implementations provided by SageMaker's BlazingText algorithm.
Additionally, it provides batch_skipgram mode for Word2vec, which enables quicker training and
distributed computing, and enriches word vectors with subword data. Users may boost the
effectiveness and quality of their NLP applications by incorporating BlazingText into a unique
model.

For our custom models, we can leverage SageMaker's managed training infrastructure, which
provides scalable compute instances and distributed training capabilities. Once the model is trained,
users can deploy it using SageMaker's managed hosting infrastructure, which provides automatic
scaling, high availability, and security features.
Since its not apt to deploy a Hybrid non-BERT model into a AWS hosted platform with ETLs, we
will be deploying an Auto-ML devised trained model for seeking compatibility and expanding our
grasp on the services of AWS.

Human-in-the-Loop (HITL) pipeline incorporates human input to increase the precision and
applicability of model predictions. For Natural Language Processing (NLP) models, where human
experience can offer beneficial insights that automated systems might overlook, this pipeline is
particularly helpful.Several resources from Amazon Web resources (AWS) may be utilised to set up
an HITL pipeline for NLP models. Here is a general explanation of how it may be done:

Incorporating the data you wish to analyse is the first step. For example, Amazon S3 for data
storage, Amazon Kinesis for data streaming, and Amazon Connect for call centre data are just a few
of the services offered by AWS.

4.4.2 Human in the Loop deployment:

Page 42 of 68

Real Time Interface Batch Interference Edge

Use case ~0 Latency , 
 Predicting requests and Deploying APIs on Edge


Real-time predictiona responses in batches is
appropriate for a
particular use case

Price Pay as you go [ pay for Pay as you go [ pay Varies according to the
the service when in use ] when you use on the scalable architecture
batch job ]

Deployment options

The presence of a human in the loop pipeline of a hybrid model that uses a BERT-BASE
preprocessor for an ELECTRA fine-tuning model hosted on AWS is essential for assuring the
output's correctness and quality. This is due to the fact that despite being extremely sophisticated
and effective, these models may still need human input to work at their best.

The process of incorporating human input and interaction into the machine learning pipeline is
known as the "human in the loop pipeline." This may be done at several phases of the procedure,
such as model training, model assessment, and data annotation. When a hybrid model with a BERT-
BASE preprocessor is involved, the HTL process may be used to enhance the accuracy and quality
of the output in a number of ways, together with an ELECTRA fine-tuning model.

Data annotation is one application where a person in the loop might be useful. Although the BERT-
BASE preprocessor is very effective in tokenizing and processing text data, some amount of human
annotation may still be necessary to recognise and name particular characteristics or entities. A
person could be required, for instance, to label the text data with the proper sentiment (positive,
negative, neutral, etc.), if the model is being used for sentiment analysis. Similar to this, a person
may be required to recognise and name particular entities in the text data if the model is being used
for entity recognition.

In model training, a HTL can be deployed. Even while the ELECTRA fine-tuning model is very
good at improving language models that have already been trained, like BERT-BASE, it may still
need human input to spot and rectify biases or inaccuracies in the model's output. This can be
accomplished by using a human assessment procedure, in which a reviewer assesses the model's
output and offers comments or modifications as necessary.

Page 43 of 68


A person in the loop can be utilised for model evaluation in addition to data annotation and model
training. Although automated evaluation metrics can be a valuable tool for measuring model
performance, they might not always be able to account for all flaws or biases that can be present in
the model's output.
A more thorough and accurate evaluation of the model's performance may be gained by including
human evaluation into the model evaluation process.

In a hybrid model consisting of a BERT-BASE preprocessor and an ELECTRA fine-tuning model


hosted on AWS, the job of a person in the loop pipeline is to give human input and involvement at
various stages of the machine learning pipeline. This can serve to increase the precision and calibre
of the model output and can be done by data annotation, model training, and model assessment.
Despite the fact that these models are quite sophisticated and effective, human input is still required
for the best results.

Page 44 of 68

CHAPTER - 4 IMPLEMENTATION

We evaluated the model's robustness against erroneous correlations or statistical anomalies using
adversarially produced datasets in order to examine the effects of dataset artefacts. We next tested if
these challenge sets may reduce dataset artefacts seen in regular datasets by training the models on
the adversarial sets.
Since most of the downstream jobs lack a baseline performance, we looked at a variety of datasets
to assess each task's practicality and feasibility for BERT-BASe.
After gaining some insight into the baseline performance of BERT-BASe, we conducted a number
of tests to choose the best dataset for examining the effects of dataset artefacts on the model. We
passed on a Public dataset, EmoReact which had been pertained by Google Researchers. The text
was of the form of tweets, made publicly available by Twitter, had associated IDs and Category to
it. The emotions we have tried to identify are weighted as following:

List of emotions

There are various processes involved in using BERT-Base for emotion analysis, such as loading
tokenizers and load balancers. The order in which these processes are completed has an impact on
the model's performance and accuracy. We shall go into technical detail about these processes in
this article.
Page 45 of 68


Load balancers are used to distribute incoming requests among several servers in order to increase
availability, improve performance, and make better use of available resources. To speed up the
training or inference process in the context of BERT-Base, load balancers can be utilised to share
the workload over numerous GPUs or CPUs.The load balancer module must rst be loaded and
initialised with a list of the available devices.
The model is built in a distributed environment using the NCCL backend, contains 10 input features
and 1 output feature, is wrapped in DistributedDataParallel, and is trained over 10 epochs using
stochastic gradient descent.

The process of breaking the input text up into separate tokens or words is known as tokenization.
The WordPiece tokenizer used by BERT-Base divides words into subwords based on how
frequently they occur in a huge corpus of text. For instance, "unbelievable" could be broken down
into "un", "believ", and "able". As a result, the model can better handle terms that are not in its
vocabulary (OOV).
Using the BertWordPieceTokenizer class and the pre-trained vocabulary file bert-base-uncased-
vocab.txt, a piece tokenizer for the BERT-Base model is created. The input text "This is a sample
input text." is then tokenized using the encode method, returning an Encoding object with the
encoded tokens. The tokens attribute of the Encoding object is then used to print the list of tokens.

After tokenization, we must make sure that all of the input sequences have the same length before
feeding them into the BERT-Base model. This is done by padding and truncating them. We can
achieve this by either truncating the longer sequences to their maximum length or padding the
shorter sequences with zeros. To indicate the start and end of the input sequence, [CLS] and [SEP],
respectively, we must add special tokens in both situations.
The truncate and pad methods of the Encoding object can be used to accomplish padding and
truncation using the tokenizers library.

The pre-trained BERT-Base model, which comes in a number of variations, including bert-base-
uncased and bert-base-cased, must then be loaded. The transformers library, which offers a simple
user interface for working with BERT-Base and other transformer models, can be used to load the
model.

Page 46 of 68

fi

Summary of CNN model

The model.fit function has a parameter known as class_weight. This is used in the scenario where
there is an imbalance in the dataset and the model can be made to account for it or even if in the
scenario which is very rare, where the need is to put more importance to one class or few classes
more than the others. The parameter accepts a dictionary type vector where the keys are the classes
and the values are the weights assigned to each class. The weights can be assigned by the developer
or the weights can be calculated properly using a function ImageDataGenerator which can generate
the ratios of weights required for each class and the weights can be collected and can be assigned to
a dictionary manually. This approach was tried before the implementation of

Page 47 of 68

Performance ofBERT/ELECTRA-prompt prediction and ne-


tuning on different sized models.

During the optimization and fit function, few monitoring functions can be implemented to improve
the way the model performs or even collect data about the model and also can even perform early
stopping if model deteriorates or even save the best iteration of the model with respect to various
parameters. The functions used in this model are CSVLogger, ReduceLROnPlateau and even
ModelCheckpoint can be implemented. The CSVLogger saves the values of accuracies and losses
of both training set and validation set in a CSV file which can be downloaded from the outputs
section and can be used further for analysis of the model and visualization of graphs. The
ReduceLROnPlateau is a neat function in which the learning rate is controlled in every epoch and
the conditions to toggle it can also be adjusted between the various parameters which are released
during the model training. Here, validation loss is monitored and if the validation loss plateaus or
doesn’t decrease or shows signs of increase, then after a certain number of epochs where continuous
pattern shows, the learning rate(LR) changes by a factor we provide. Here the assigned factor is 0.4.
The number of epochs is termed the patience of the function and it is set as 4, i.e. if the model’s
validation loss plateaus for 4 consecutive iterations it multiplies the learning rate by the factor

Page 48 of 68
fi
which can address the cause of plateau. The initial learning rate set was 0.001 and the minimum
learning rate the ReduceLROnPlateau can go to is 0.00001.

The validation accuracy stabilized at around 20 epochs and the validation loss also faced the same
situation of being stabilized around that area as well.The discriminator and generator datasets used
as inputs to the ELECTRA model, and then be fine-tuned using a BERT-Base preprocessing model.

For the purpose of enhancing the diversity and robustness of the supplied data, the preprocessor
additionally employs any necessary data augmentation techniques. The discriminator dataset and
the generator dataset are created from the preprocessed text data.The generator dataset comprises
text sequences that were intentionally made using methods like text augmentation or paraphrase,
whereas the discriminator dataset contains the original text sequences. The discriminator component
of the ELECTRA model is trained using the discriminator dataset, to work with generator
(MASKED) keys from BERT.

The discriminator component of the ELECTRA model, which separates genuine text sequences
from imitations, is trained using the discriminator dataset. The ELECTRA model's generator
component, which creates the synthetic text sequences, is trained using the generator dataset.

The ELECTRA model's discriminator and generator components are simultaneously taught during
the training phase using a technique called adversarial training. In this method, the generator
attempts to deceive the discriminator by producing text sequences that are indistinguishable from
the genuine ones while the discriminator attempts to discern between the real text sequences and the
fake ones made by the generator.
The preprocessed text data is fed into the ELECTRA model during the ne-tuning process, which
creates a set of hidden representations for each input sequence. These hidden representations are
used as input to a classi er that forecasts the emotion label of the input text sequence. They capture
the semantic and syntactic characteristics of the input text data.

A loss function that calculates the difference between the anticipated emotion label and the actual
emotion label of the input text sequence is minimised during the ne-tuning phase.

Page 49 of 68
fi

fi
fi

Performave of ROBERTa , MLM-BERT,BERT-BASe, ELECTRA on few short learning

Backpropagation is used to modify the ELECTRA model's weights in order to reduce this loss
function and improve the model's performance on the emotion analysis job.

=> By cleaning, converting, and dividing the data into training, validation, and testing sets, we were
able to prepare the data for SageMaker for NLP. We setup a SageMaker notebook instance after the
data is prepared, which offers a Jupyter notebook environment for exploring, analysing, and
visualising the data. Then, using SageMaker's built-in algorithms or by creating unique models
utilising well-known deep learning frameworks, like TensorFlow or PyTorch, we may train their
NLP models.

Page 50 of 68

Chapter 5

Results and Discussion

All the componetsin the dataset were pre-processed correctly and the
model was trained and tested successfully with the dataset. The model
achieved a training accuracy of 92.2% as depicted in Figure 4. Figure 5
shows the training loss of the model and it is around 0.2 after 20 epochs.
Figure 6 and 7 shows the validation accuracy and validation loss of the
model respectively. The validation accuracy came to 88.5% and the loss
was around 0.5 and if the iterations continued, it would’ve not changed
much.

Page 51 of 68
Fig.5.2 Training Loss

Page 52 of 68
Figures 5.3 and 5.4 shows the accuracy and the loss of validation set. A spike
can be seen in the loss during the 6th and 7th epoch and similarly a dip in
accuracy can be seen in validation accuracy in the same epochs. This is
normal occurrence during training and it doesn’t affect the final results which
has very good loss and accuracy values.

Page 53 of 68
We can also calculate the means of these metrics which resulted in the
following: Mean precision, 0.888, Mean recall, 0.885 and mean F1-score,
0.884.

Page 54 of 68
Chapter 6

Conclusion & Future Work

The unavailability of a balanced dataset has definitely affected the flow of


the project and also the effectiveness of the same, but it can be said that it
was overcome by the use of very good methods and also that the images
were of very high quality with usable features and not much loss after
compression and resizing which reiterates the quality.
It was difficult to understand how to tackle the problem of imbalance, and
many methods had been tried unsuccessfully, such as implementation of
class weights to get the model to consider higher weight and priority for
minority classes and not focus mainly on the majority classes, which
ultimately didn’t work. Addition of multiple dropout layers was also tested
with not much success. The size of the model also was reduced to aid with
imbalance as it was the norm to use smaller models to simplify the case
and more layers would mean the model would get more complex and over-
fit easily.
The hybrid model fed accuracy increase of >=10% in cases of low label-
count emotions. The proposed system has 3 modules, BERT-BASe
Preprocessor, ELECTRA fine tuning module and an AWS HTL module.

Page 55 of 68


After implementation of many methods of handling over-fitting such as


Batch Normalization, Data Augmentation, right amount of dropout layers,
and right amount of the reduction of model. Also many lessons were learnt
during the phase of building the project such as different brilliant
techniques to measure performance of model for different datasets,
especially imbalanced dataset where accuracy isn’t the way to go but
precision, recall and f1-score fares better to show the performance of the
model. However, there is still scope for improvement here on the basis of
trial and error to check with other possible parameters and other possible
fine-tuning algorithms which might

Page 56 of 68
6.2 Future Work

The future work for this project includes acquiring of other reputed social
medidatasets to either train the model with them and also to cross
reference the existing model using that as the testing data.
The accuracy of the model can be tried to be improved further in specific
classes by means of parameter finetuning and also implementation of other
techniques such as regularization to the CNN which will result in a very
efficient model, integrating RoBERTa architecture instead of way lighter
BERT-BASe

The future work also encompasses the deployability-compatibility of


custom AI module. Testing out use-cases such as EDGE on client APIs
with minimal latency on even low end devices.

Page 57 of 68


Appendices
Appendix 1: Code for Importing Dataset & Pre-
Processing
Importing dataset into the environment and resizing it into square
images.
path = '../input/ck-dataset/Dataset' directory = os.listdir(data_path)
img_data_before=[]
for dataset in directory: img_list=os.listdir(path+'/'+ dataset)
for img in img_list:
input_img=cv2.imread(data_path + '/'+ dataset + '/'+ img )
#input_img=cv2.cvtColor(input_img, cv2.COLOR_BGR2GRAY)
input_img_resize=cv2.resize(input_img,(640,640))
img_data_before.append(input_img_resize)
print(img_list)
plt.imshow(img_data_before[-1])
plt.show()
Data Augmentation using SMOTE
from imblearn.over_sampling import SMOTE
smote= SMOTE()
x_newa, y_new = smote.fit_resample(x_reshape, y)
x_newa.shape
x_new=x_newa.reshape(1736,nx,ny,ncolor)
x_new.shape
Applying necessary pre-processing
filters = build_filters(241)
data_new=[]
for i,img in enumerate(img_data_before):
haar_cascade=
cv2.CascadeClassifier(cv2.data.haarcascades+'haarcascade_frontalface_default.x
ml')

35
faces_result= haar_cascade.detectMultiScale(img, 1.1, 5)
print(str(i),faces_result[0])
Page 58 of 68















(x, y, w, h) = faces_result[0] img_cropped=img[y+1:y+h+20, x+1:x+w] if((w-
x)*(h-y)<241*241):
img_final=cv2.resize(img_cropped, (241, 241),
interpolation=cv2.INTER_CUBIC) else:
img_final=cv2.resize(img_cropped, (241, 241),
interpolation=cv2.INTER_AREA) res= process(img_final, filters)
img_data_new.append(res)
# print(str(i)+' ')
plt.imshow(img_data_new[-1]) plt.show()

Converting array into a vector usable by model


img_data = np.array(img_data_new) img_data = img_data.astype('float32')
img_data_proc= img_data/255 img_data = np.array(img_data_new) img_data =
img_data.astype('float32') img_data_proc= img_data/255 img_data_proc=
img_data/255 temp=img_data img_data=img_data_proc img_data_unproc=temp

Page 59 of 68

Labeling Classes by assigning numbers


classes = 7
samples = img_data.shape[0]
class = np.ones((samples,),dtype='int64')
class[0:134]=0 class[135:188]=1 class[189:365]=2

36
class[366:440]=3
class[441:647]=4
class[648:731]=5
class[732:980]=6
names = ['anger','contempt','disgust','fear','happy','sadness','surprise'] def
getLabel(id):
return ['anger','contempt','disgust','fear','happy','sadness','surprise'][id]

Shuf ing and splitting the dataset into training and testing data
Y = np_utils.to_categorical(labels, num_classes)
x,y = shuffle(img_data,Y, random_state=3)
X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.15,
random_state=2) x_test=X_test

Page 60 of 68
fl






Appendix 2: Emotion Detection CNN Model Building
and compiling the model
input_shape=(241,241,3)
model = Sequential()
model.add(Conv2D(6, (2, 2), input_shape=input_shape, padding='same',
activation = 'relu')) model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Conv2D(16, (2, 2), padding='same', activation = 'relu'))
model.add(Activation('relu'))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Conv2D(64, (2, 2),padding="same", activation = 'relu'))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Flatten())
model.add(Dense(1024,activation='relu'))
# model.add(Dropout(0.5))
# model.add(Dense(1024,activation='relu'))
# model.add(Dropout(0.5))
model.add(Dense(512, activation = 'relu'))
model.add(Dropout(0.5))

model.add(Dense(7, activation = 'softmax'))


learning_rat = 1e-3
opt = Adam(lr=learning_rat, beta_1=0.9, beta_2=0.999, epsilon=None,
amsgrad=False)
model.compile(optimizer=opt,loss='categorical_crossentropy',
metrics=['accuracy'])

Introducing list of parameters for model to store values of


from keras import callbacks
file_name='model_train_new.csv' file_path="Best-weights-my_model-
{epoch:03d}-{loss:.4f}-{acc:.4f}.hdf5"
logfile=callbacks.CSVLogger(file_name, separator=',', append=False)
reduce_lr = ReduceLROnPlateau(monitor='val_loss', factor=0.4,patience=3,
min_lr=0.00001) callbacks_list = [csv_log,reduce_lr]
hist = model.fit(X_train, y_train, batch_size=7, epochs=25, verbose=1,
validation_data=(X_test, y_test),callbacks=callbacks_list)

Page 61 of 68
















Appendix 3: Validation Dataset Prediction and
Metrics
Prediction of testing dataset
res2 = np.argmax(model.predict(X_test),axis=1) print(res2)
res2_label=[]
for i in range(len(res2)):
res2_label.append(getLabel(res2[i])) y_true=y_test.argmax(1)
print(y_true)
y_true_label=[]
for i in range(len(y_pred)): y_true_label.append(getLabel(y_true[i]))

Analysis using Performance Metrics


from sklearn.metrics import confusion_matrix
cnf=confusion_matrix(y_true_label,res2_label,labels=names) print(cnf)
recall = np.diag(cnf) / np.sum(cnf, axis = 1) precision = np.diag(cnf) /
np.sum(cnf, axis = 0) print(recall,precision)
print(np.mean(recall),np.mean(precision)) f1=2*recall*precision/
(recall+precision) print(f1, np.mean(f1))

Page 62 of 68




Appendix 4: Sample Outputs

Summary of ELECTRA Fine Tuning Module

BERT-BASe preprocessor Tranformer output

Page 63 of 68
Results of the Hybrid model

Page 64 of 68
REFERENCES
1. Kanade, T., Cohn, J. F., & Tian, Y. (2000, March). Comprehensive
database for facial expression analysis. In Proceedings fourth IEEE
international conference on automatic face and gesture recognition
(cat. No. PR00580) (pp. 46-53). IEEE.

2. Lucey, P., Cohn, J. F., Kanade, T., Saragih, J., Ambadar, Z., &
Matthews, I. (2010, June). The extended cohn-kanade dataset (ck+):
A complete dataset for action unit and emotion-specified expression.
In 2010 ieee computer society conference on computer vision and
pattern recognition-workshops (pp. 94-101). IEEE.

3. Chowdary, M. K., Nguyen, T. N., & Hemanth, D. J. (2021). Deep


learning-based facial emotion recognition for human–computer
Page 65 of 68


interaction applications. Neural Computing and Applications, 1-18.

4. Mehendale, N. (2020). Facial emotion recognition using


convolutional neural networks (FERC). SN Applied Sciences, 2(3),
446.

5. Pranav, E., Kamal, S., Chandran, C. S., & Supriya, M. H. (2020,


March). Facial emotion recognition using deep convolutional neural
network. In 2020 6th International conference on advanced
computing and communication Systems (ICACCS) (pp. 317-320).
IEEE.

6. Akhand, M. A. H., Roy, S., Siddique, N., Kamal, M. A. S., &


Shimamura, T. (2021). Facial emotion recognition using transfer
learning in the deep CNN. Electronics, 10(9), 1036.

7. Modi, S., & Bohara, M. H. (2021, May). Facial emotion recognition


using convolution neural network. In 2021 5th international
conference on intelligent computing and control systems (ICICCS)
(pp. 1339-1344). IEEE.

8. Khattak, A., Asghar, M. Z., Ali, M., & Batool, U. (2022). An efficient
deep learning technique for facial emotion recognition. Multimedia
Tools and Applications, 1-35.

9. Minaee, S., Minaei, M., & Abdolrashidi, A. (2021). Deep-emotion:


Facial expression recognition using attentional convolutional
network. Sensors, 21(9), 3046.

10. Khaireddin, Y., & Chen, Z. (2021). Facial emotion recognition: State
of the art performance on FER2013. arXiv preprint
arXiv:2105.03588.

Page 66 of 68








11. Jain, D. K., Shamsolmoali, P., & Sehdev, P. (2019). Extended deep
neural network for facial emotion recognition. Pattern Recognition
Letters, 120, 69-74.

12. Lakshmi, A. V., & Mohanaiah, P. (2021). WOA-TLBO: Whale


optimization algorithm with Teaching-learning-based optimization
for global optimization and facial emotion recognition. Applied Soft
Computing, 110, 107623.

13. Zadeh, M. M. T., Imani, M., & Majidi, B. (2019, February). Fast
facial emotion recognition using convolutional neural networks and
Gabor filters. In 2019 5th Conference on Knowledge Based
Engineering and Innovation (KBEI) (pp. 577-581). IEEE.

14. Li, K., Jin, Y., Akram, M. W., Han, R., & Chen, J. (2020). Facial
expression recognition with convolutional neural networks via a new
face cropping and rotation strategy. The visual computer, 36,
391-404.

15. Jain, N., Kumar, S., Kumar, A., Shamsolmoali, P., & Zareapoor, M.
(2018). Hybrid deep neural networks for face emotion recognition.
Pattern Recognition Letters, 115, 101-106.

16. Xiaohua, W., Muzi, P., Lijuan, P., Min, H., Chunhua, J., & Fuji, R.
(2019). Two- level attention with two-stage multi-task learning for
facial emotion recognition. Journal of Visual Communication and
Image Representation, 62, 217-225.

17. Chawla, N. V., Bowyer, K. W., Hall, L. O., & Kegelmeyer, W. P.


(2002). SMOTE: synthetic minority over-sampling technique.
Journal of artificial intelligence research, 16, 321-357.

Page 67 of 68







18. Viola, P., & Jones, M. (2001, December). Rapid object detection
using a boosted cascade of simple features. In Proceedings of the
2001 IEEE computer society conference on computer vision and
pattern recognition. CVPR 2001 (Vol. 1, pp. I-I). Ieee.

19. Shiwen Ni and Hung-Yu Kao, (2022, July)ELECTRA is a Zero-Shot


Learner, Too . Department of Computer Science and Information
Engineering National Cheng Kung University

20. Mingi Ryu, Kenta Nakajima (2022), Analysis and Mitigation of


Dataset Artifacts in OpenAI GPT-3 University of Texas at Austin

Page 68 of 68

You might also like