Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
22 views

Attention Based Image Caption Generation ABICG Using Encoder-Decoder Architecture

Attention Based Image Caption Generation ABICG Using Encoder-Decoder Architecture

Uploaded by

rbagewadi63
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
22 views

Attention Based Image Caption Generation ABICG Using Encoder-Decoder Architecture

Attention Based Image Caption Generation ABICG Using Encoder-Decoder Architecture

Uploaded by

rbagewadi63
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

Proceedings of the 5th International Conference on Smart Systems and Inventive Technology (ICSSIT 2023)

IEEE Xplore Part Number: CFP23P17-ART; ISBN: 978-1-6654-7467-2

Attention based Image Caption Generation


(ABICG) using Encoder-Decoder Architecture
2023 5th International Conference on Smart Systems and Inventive Technology (ICSSIT) | 978-1-6654-7467-2/23/$31.00 ©2023 IEEE | DOI: 10.1109/ICSSIT55814.2023.10061040

Uday Kulkarni Kushagra Tomar Mayuri Kalmat


Asst. Professor, Dept. of CSE Department of CSE Department of CSE
KLE Technological University KLE Technological University KLE Technological University
Hubballi, India Hubballi, India Hubballi, India
uday kulkarni@kletech.ac.in kushagratomar2016@gmail.com mayurikalmat1@gmail.com

Rakshita Bandi Pranav Jadhav Dr. Meena S M


Department of CSE Department of CSE Professor and Head, Dept. of CSE
KLE Technological University KLE Technological University KLE Technological University
Hubballi, India Hubballi, India Hubballi, India
rakshitabandi0@gmail.com jadhavpranav250@gmail.com msm@kletech.ac.in

Abstract—The image captioning is utilized to develop the and techniques for processing text like NLP [22], a number of
explanations of the sentences describing the series of scenes previously challenging pieces of work became straightforward
captured in the image or picture forms. The practice of using using Machine Learning. These are profitable in recognition,
image captioning is vast although it is a tedious task for the
machine to learn what a human is capable of. The model must be classification and captioning of images and several additional
built in a way such that when it reads the scene, it recognizes and AI [19] applications. This “Image Captioning” [2] is pragmatic
reproduce to-the-point captions or descriptions. The generated in various applications viz. The concept of self-driving cars,
descriptions must be semantically and syntactically accurate. surveillance which are at present the matters of the moment. In
Hence, availability of Artificial Intelligence (AI) and Machine comparison with classification and object recognition, the task
Learning algorithms viz. Natural Language Processing (NLP),
Deep Learning (DL) etc. makes the task easier. Although majority of automatically generating captions and describing images
of the existing machine-generated captions are valid, they do not is significantly more complicated. A description of an image
focus on the crucial parts of the images, which results in lesser must include more than just the objects in it, also how the
clarity of the captions. In the proposed paper, anew introduction objects are related to their attributes and activities just like
to attention mechanism called Bahdanau’s along with Encoder- shown in the Fig. 1
Decoder architecture is being used so as to reflect the image
captions that are more accurate and detailed. It uses a pre-
trained Convolutional Neural Network (CNN) called InceptionV3
architecture to gather the features of images and then a Recurrent
Neural Network (RNN) called Gated Recurrent Unit (GRU)
architecture in order to develop captions. This model is trained
on Flickr8k dataset and the captions generated are 10% more
accurate than the present state of art.
Keywords— Convolutional Neural Network (CNN), Recur-
rent Neural Network (RNN), Gated Recurrent Unit (GRU),
Encoder, Decoder, Attention mechanism, Image captioning.
I. I NTRODUCTION
Language is the medium through which the society con-
stantly interacts, be it written or spoken. It typically describes
the perceptible world around us. The photos and symbols are
otherwise to speak and perceive by the physically disabled
individuals. Automatic description generation from an image
in proper sentences is a tedious task, nevertheless it can have
an ample impression on visually challenged individuals for
better understanding of the description of pictures on the web.
Long back, “Image or Picture Captioning” [2] [3] [4] has Fig. 1. Examples of image captioning
always been a rigid mission and the generated captions for
the given image were not so pertinent. However, it is best to express semantic knowledge in
In conjunction with the progress of Deep Learning [21], CNN natural language such as English. A single model is to be

978-1-6654-7467-2/23/$31.00 ©2023 IEEE 1564


Authorized licensed use limited to: KLE Technological University. Downloaded on June 20,2023 at 06:12:39 UTC from IEEE Xplore. Restrictions apply.
Proceedings of the 5th International Conference on Smart Systems and Inventive Technology (ICSSIT 2023)
IEEE Xplore Part Number: CFP23P17-ART; ISBN: 978-1-6654-7467-2

designed which intakes an image and trains it to give out II. RELATED WORK
a string of words in which each word is affiliated to the
glossary that narrates the picture accordingly. In this paper, In [1], the authors have made use of CNN as an encoder to
an ABICG model is proposed which is worthy of describing extract the characteristics or attributes from the images. CNN
images in a latest and novel way. For this job, the dataset is a pre-trained InceptionV3. Owing to the InceptionV3’s fact
composed of 8000 images and five descriptions per image that it is a deep network for object detection, it demands to
is taken from the Flickr8k dataset [20]. As new applications be altered slightly to assist in encoding. A feature vector is
are being developed every day, Deep Learning has emerged obtained from this deep network by removing the terminating
to be a prepotent world today. Test and trial is the mere way layer. The feature vector obtained is of the size (8x8x2048).
to explore Deep Learning to a greater extent. In this way, The feature vector is the input to RNN. The RNN employed
you gain a better understanding of the topic and become a for decoding is GRU [16]. To generate more focused captions,
more profound professional. Real-life applications of this Bahdanau model is used.
technique are numerous. The use of Deep Learning for image
description has been proposed in many different models, The writers of [2] have proposed a model where the
including detection of objects, captioning on the basis of input sent is Flickr8k dataset, and the output obtained is
visual attention and image captioning using Deep Learning. passed to the latest layer which is completely connected
Different Deep Learning models exist as well, such as the and is introduced at the termination of the InceptionV3
InceptionV3 model [9], the Visual Geometry Group 16 model. The task of this layer is to transform the model’s
(VGG–16) model [14], the Residual Networks (ResNet) [18] output into a vector which embed words. It serves as an
– Long Short-Term Memory (LSTM) model [15] and the input to an LSTM cell order by implanting a vector. The
traditional CNN – RNN model [23]. CNN as well as RNN LSTM unit attaches the series of information and collects it
are used here. For image encoding i.e. as an encoder, in order progressively hence enabling the establishment of meaningful
to classify images, a pre-trained CNN is used. Input from the captions. The start-V3 component of this model is trained
last hidden layer is used to train the RNN. This network is to recognize complete possible objects in a picture. Each
a decoder to generate captions. In LSTM networks, memory word in the picture is predicted using the previous words in
cells are endowed with a limited number of phases that have the phrase. The main intention of training is to reduce the
been determined by long-term dilution of existing memory failure function. They have used the Flickr8k dataset which
information. A total of 16 layers supported by VGG–16 is has nearly 8000 images and each image is tagged with five
an ingenious model for object recognition. For the following unique captions or descriptions which offer compact reports
stage, the extracted features are trained with wording specified of the noteworthy features.
in the dataset. Two architectures LSTM and GRU [16] are
used for framing sentences from the given input images given. In [3], the authors have put forth a model that allows
neural networks to view an image automatically and yield
In ABICG, the Flickr8k dataset [20] is used and it is meaningful captions similar to natural English sentences. It is
subjected to an elaborated preprocessing steps to optimize the a well-trained model to perform the above-mentioned tasks.
input. The preprocessed data is fed to the model which uses Here, the pre-trained CNN is utilized to classify images. This
Inception-V3 as Encoder and GRU as Decoder. Bahdanau’s network handles the task of encoding images. The input to
Attention model is applied to this encoder-decoder model the RNN (decoder here) is the hindmost hidden layer of the
to fetch more focused captions. Meaningful captions are encoder. The decoder generates sentences. The dataset being
generated by the model. Performance of this model is used here is Flickr8k [20] consists of about 8000 images
evaluated by the BLEU scores [31]. and five descriptions tagged to every image. They used VGG
[14] for large-scale image recognition. They conclude that
using a bulky dataset boosts the performance of the model.
The paper is organized as, the Section ‘Related Work’ In addition to reducing losses, it also improves accuracy.
discusses the previous works and methodologies related to
this domain which includes the scope of improvements in To achieve better results, the authors of [4] worked on
the methods that already exist and the uniqueness of the a model that combines CNN architecture and LSTM for
model presented in this paper. In Section ‘Proposed Work’, it image captioning. The proposed model uses three CNN
comprises the proposed work where the ABICG architecture architectures: ResNet-50 [25], Xception [24], and Inception-
has been discussed along with explanation to each component V3 [9]. The aptest combination of CNN and LSTM is chosen
of the architecture. Section ‘Experimental Results’ further based on the model’s accuracy. Training is performed on
elaborates on the system specifications, dataset used, data pre- the Flickr8k data. Combining Xception with LSTM has the
processing, results, comparison of training loss and evaluation highest accuracy of 75% across epochs among the three CNN
of model using BLEU scores [32]. Finally, the conclusion and models.
the future scope have been discussed below as can be seen in
the Section ‘Conclusion’. The authors of [5] proposed a model where CNN features

978-1-6654-7467-2/23/$31.00 ©2023 IEEE 1565


Authorized licensed use limited to: KLE Technological University. Downloaded on June 20,2023 at 06:12:39 UTC from IEEE Xplore. Restrictions apply.
Proceedings of the 5th International Conference on Smart Systems and Inventive Technology (ICSSIT 2023)
IEEE Xplore Part Number: CFP23P17-ART; ISBN: 978-1-6654-7467-2

are extracted from an image and encoded into vector depth of hidden layers increase, the learnability of models
representations using 16 convolutional layers of the VGG-16. fall to zero. In most of the works, LSTMs are used which
Next, a RNN decoder model is used to develop corresponding are slower, and computationally less efficient as compared to
sentences based on the learned image features i.e., training the GRU. To solve this issue of vanishing gradients, the GRU is
features with captions or descriptions provided in the dataset. utilized in the current work.
The input images are processed using two architectures viz,
GRU and LSTM. Through the results, it is evident that the III. P ROPOSED W ORK
LSTM model achieves better results than the GRU [16] To accomplish the task of image captioning, ABICG com-
model. Although, it takes a longer time to train and generate prises bipartite architecture, viz encoder and decoder. Im-
captions due to the model’s complexity. ages are fed to the encoder where the image information is
transformed into features vectors. The output of the encoder
In [6], the authors have developed a model where pre- is passed to the decoder, which translates the features into
processed images are fed to the Inception V3 model, and the English sentences. This method is termed as ”Classic Image
features are extracted. Later, a D-dimensional representation Captioning”. The problem with this method is that, it is not
of each and every part of the image is produced by the possible to take into account the spatial features of an image
extractor such as L vectors. With the spatial features of a with this classic captioning method. As a result, a caption is
CNN convolution layer, the decoder calculates the context generated taking into consideration the full image as a scene
vector according to the specific regions of the input image. and not considering the sensitive or important features of the
For the decoder’s job, GRU is utilized which has a simpler image. To enable the model to focus on important features of
structure than LSTM [15] . The vanishing gradient problem the image, Bahdanau’s attention has been used jointly with the
does not affect GRU, unlike RNN. Thus, it proves that usage encoder-decoder architecture.
of GRU gives better results than LSTM. The captions generated by the proposed model are semanti-
cally and grammatically correct. The generated captions are
The authors display a method to overcome the vanishing close to the human generated or with human centric meaning
gradient problem which hinders the existing CNN-RNN and they not only describe the scene in the image but also
models in [7] . They have proposed ResNet-LSTM as an the intricate details and relationship of the object with the
encoder-decoder technique for image captioning. The ResNet background.
(encoder) extracts the features and the LSTM (decoder)
generates the caption from the extracted features. For this, the A. Model Overview
images are resized to (224x224x3) and subjected to several
pre-processing steps. They have used the Flickr8k dataset In ABICG, the CNN used is InceptionV3 [9] pretrained
for training the model. After a minimum of 20 epochs, on ImageNet weights, serves as an encoder. This extracts the
meaningful captions begin to generate. It is better than VGG features from the receptive fields of the images and forwards
and CNN-RNN models. it to the decoder. Here, the RNN used is GRU, which is used
as a decoder. The use of the decoder is to decode the sentence
The authors of [8] explain a multi-feature fusion model from the encoding. The Bahdanau attention model [11] is
to generate image captions. Models that currently exist used to enhance the capability of the decoder by allowing
focus on the global characteristics of an image, but with it to focus on the important aspects of the images while
the comprehensive features, this model also considers the producing the captions. Thus, taking care that the sensitive
localized features of images. A global feature extraction of parts of the image are not left out in the generated caption.
global features is performed using the VGG16 network and
Faster-CNN is used to excerpt the local characteristics. The
local and global features are mixed and fed as input, through
an attention layer to the Bi-LSTM [28]. The caption obtained
is corrected if any error occurs. ImageNet dataset with image
size (224x224x3) is used to train VGG16 [14], Pascal VOC
dataset is used to train Faster-RCNN [29] (1:1 positive and
negative sample ratio maintained). Bi-LSTM is trained with
the MSCOCO dataset [27]. The fused features have turned
out to be superior to global or local features alone. The test
accuracies in the training set and verification set are 78.20%
and 66.50% respectively.

Majority of the existing models are hindered by the vanishing


gradient [12] problem. Usage of the CNN is prevalent in
the existing models. Due to the vanishing gradients, as the Fig. 2. Overall Architecture

978-1-6654-7467-2/23/$31.00 ©2023 IEEE 1566


Authorized licensed use limited to: KLE Technological University. Downloaded on June 20,2023 at 06:12:39 UTC from IEEE Xplore. Restrictions apply.
Proceedings of the 5th International Conference on Smart Systems and Inventive Technology (ICSSIT 2023)
IEEE Xplore Part Number: CFP23P17-ART; ISBN: 978-1-6654-7467-2

The above figure shows the overall architecture of the model interface that connects the encoder and decoder. It instructs
which includes the CNN encoder (InceptionV3), RNN decoder the decoder with the relevant details from each and every
(GRU) and the attention model. Image is fed into the encoder encoder hidden state. Decoding begins with the context vector
and then the tanh activation function is applied to introduce generated by the attention model to predict the word at that
non-linearity. The output is then fed to the decoder and for particular timestamp. The context vector changes with each
each timestamp of the decoder, the attention model enables timestamp since it is adaptive in nature.
the decoder to focus on specific parts of the images.

B. Convolutional Neural Network (CNN)


InceptionV3 [9] is often used for image recognition and has
been very popular in field of image processing because of its
up to the mark accuracy on different datasets. It encompasses
the building blocks which are of types asymmetric and
symmetric, along with convolutions, max pooling, average
pooling, concatenations, dropouts and various fully connected
layers as shown in the Fig 3. Fig. 4. The Bahdanau’s Attention Model [11]

It was built for the purpose of object detection on receiving Attention model Fig 4 [11] does a linear transformation of
a (299x299x3) image. Since InceptionV3 [9] is mostly used the input by applying tanh (1) to it so as to introduce non-
for object detection, the refinement of it to some extent linearities henceforth achieving a smoother distribution. Then,
is required so as to make it an encoder for extracting the the attention score as is computed. The output is required
image feactures. The last layer is eliminated which is used in the range (0,1). The sof tmax function is applied to the
for classifying the images into the labels since classification attention score and the final attention weights are obtained.
of images is not required. Thus, a feature vector is obtained This model intends to overcome the curb of the orthodox
which is of the size (8x8x2048). The resulting feature vector CNN–RNN models. This model facilitates passing various
is static and does not alter at each timestamp. Therefore, this parts of the image instead of the whole. This also makes it
vector is passed to the attention model along with the hidden swift and hikes its accuracy.
state of the decoder to create the context vector.
as = tanh(W1 hd1 + W2 hd2 ) (1)
With this score, the weights of the attention are calculated
using (2)

α = sof tmax(as ) (2)

Then, by using the attention weights (α) from (2) and features
(hd2 ) which were obtained from an encoder, the context vector
cv ec is obtained, with (3).

cv ec = αh2 (3)
Fig. 3. Convolutional Neural Network (Inception V3) [9]
Ultimately, the fixed length vector “cv ec ” is unified with the
The benefits of using InceptionV3 CNN for the encoder part decoder’s output from the predecessor timestamp ht and then
is that it generates fewer parameters for computation which fed into the RNN cell in order to obtain the decoder’s output
makes it computationally less expensive in comparison to the for the current timestamp.
other models and is memory efficient. There is no comparison In the above image, there is a white color bird sitting on the
between InceptionV3 and the other models when it comes to sign board. The image is fed into the encoder which extracts
depth and accuracy. the image features and gives it as an input to the decoder which
transforms the image feature vector into concise caption. So
C. Attention Mechanism here, the caption generated would be “a white bird perched
Attention model is a deep learning technique that makes use on top of a red stop sign” all in lower case.
of attention mechanism which provides attention or additional The project aims at mimicking the human brain because of
focus on specific components. The Bahdanau’s Attention its abilities to generate a caption for every scene it senses.
Model [11] is used in ABICG. It selectively highlights the Therefore, it becomes crucial to add an attention mechanism
relevant features of the input data. It is also referred to as an using which the CNN-RNN model focuses on the more

978-1-6654-7467-2/23/$31.00 ©2023 IEEE 1567


Authorized licensed use limited to: KLE Technological University. Downloaded on June 20,2023 at 06:12:39 UTC from IEEE Xplore. Restrictions apply.
Proceedings of the 5th International Conference on Smart Systems and Inventive Technology (ICSSIT 2023)
IEEE Xplore Part Number: CFP23P17-ART; ISBN: 978-1-6654-7467-2

Fig. 5. Complete architecture of the proposed CNN-RNN-Attention model


[17]

important parts of the image. There is no static vector encoding


of the whole image in the attention mechanism. Instead, it
Fig. 6. The Gated Recurrent Unit [16]
adds the spatial information corresponding to the image to
the extraction of image features. As a result, statements are
Fig. 6 displays the working principle of GRU with a diagram.
described in a more detailed manner as shown in the above
Here, xt is an input vector, zt is an update gate vector, ht−1 is
Fig. 5 [17].
a previous output, ht is the current output, rt is the reset gate
By such means, while generating sentences, simulation of
vector and h˜1 is an activation vector. Sigma (σ) represents
the human vision using the attention mechanism can be
the sigmoid activation function and tanh represents the tan
encouraged with the generation process of word sequence.
hyperbolic operation. Firstly, the update gate vector zt is
This is to ensure that the generated sentence will reflect the
calculated for time step t using (4).
expression habits of the people.

D. Recurrent Neural Network (RNN) zt = σ(W eightinputupdate ∗ Xt + W eighthiddenupdate ∗ ht−1 )


GRU is used as a decoder. It works on the mechanism of (4)
RNN which anticipates expressions in the natural language.
The other probably used RNNs are LSTM, Vanilla RNN, and The input vector xt and previous vector ht−1 are multiplied
the GRU. Vanilla RNN is not preferred due to its vanishing with their respective own weights viz. W eightinputupdate and
gradient problem. W eighthiddenupdate . The obtained products are added and the
GRU is similar to LSTM. It owns a few key differences sum is squashed between 0 and 1 using the sigmoid activation
from LSTM: GRU has only two gates whereas, LSTM has function. Update gate enables the model to decide how much
three gates. It exposes its total memory and also the hidden of the previous content needs to be sent to the future.
layers. GRU not only requires fewer parameters for training, Then, the forget gate vector (5) rt is calculated using the same
but also has way more effective computation. Thus making it formula as used in (4). A gate like this allows the model to
computationally efficient. calculate how much information has to be forgotten from the
GRU is composed of two gates, viz., The update gate and the past.
reset gate. Both these gates in GRU together act as a convex rt = σ(W eightinputreset ∗ Xt + W eighthiddenreset ∗ ht−1 )
combination which gives the verdict of which information or (5)
the data of the hidden state is to be updated and which is to For the current memory content, the input xt is multiplied with
be forgotten. a weight Wh1 and ht−1 is multiplied with a weight WX1 .
A large number of layers in the network lead to a fall in Then, the Hadamard (element-wise) product is calculated is
the derivative product till the partial derivative of the loss between the reset gate rt and Wh1 ht−1 . All these are added
function tends to zero, and the partial derivative is abolished. and a nonlinear activation function tanh is applied as shown
This phenomenon is known as the vanishing gradient problem. in (6).
In simple words, this means that the initially predicted words
are wiped out as the new words are predicted therefore giving h˜t = tanh(rt ⊙ Wh1 ∗ ht−1 + WX1 ∗ Xt ) (6)
less weightage to the initial words and vice versa in the output
For the final memory content at current time step, element-
generated. To tackle this problem, the LSTM was inaugurated.
wise multiplication is applied to the update gate zt and ht−1 ,
There is not much difference between GRU and LSTM. But
and 1 − zt and h˜t . Then, these two products are added, as
GRU has a simpler network cell architecture as shown in Fig. 7
shown in (7).
[10] as compared to that of LSTM. Hence, the GRU is used
in this caption generator model. ht = (1 − zt ) ⊙ h˜t + zt ⊙ ht−1 (7)

978-1-6654-7467-2/23/$31.00 ©2023 IEEE 1568


Authorized licensed use limited to: KLE Technological University. Downloaded on June 20,2023 at 06:12:39 UTC from IEEE Xplore. Restrictions apply.
Proceedings of the 5th International Conference on Smart Systems and Inventive Technology (ICSSIT 2023)
IEEE Xplore Part Number: CFP23P17-ART; ISBN: 978-1-6654-7467-2

IV. E XPERIMENTAL R ESULTS C. Data Preprocessing


In the proposed ABICG, the neural framework is proposed
The considered dataset comprises 5 captions which corre-
for generating the descriptions for the given input images.
spond to each image. Therefore, preprocessing is performed
twice, one for the images and another one for the captions or
A. System Specifications the annotations. Captions preprocessing includes the removal
of punctuation and alpha-numeric values from each caption.
The model was trained on the system having specifications Also, < start > and < end > tags are introduced at
mentioned in Fig. 7 .The hyper parameters decided were the beginning and ending of each caption respectively. Then,
epochs of 25, batch size of 64, and learning rate of 0.001. creating the tokenized vectors by tokenizing the captions i.e.
NVIDIA’s CUDA was used to achieve parallel processing. splitting them into words using spaces and other filters. This
The training took approx 1 hour 31 minutes. gives a lexicon of all unrepeated words in the data. For
memory efficiency, the total vocabulary is restricted to 5000
words. All other words with the unknown token are replaced
with < U N K >. Then follows the creation of a word-to-index
mapping and a index-to-word mapping.
The input to the decoder should be of same size and shape.
Therefore, padding is used to bring all captions to a fixed
length before proceeding further. In order to ensure that all
samples have a standard length, zero padding is applied before
or after a sequence. In this model zero padding is done at the
end (of the caption sequence). But padding can result in a risk
Fig. 7. Hardware specifications of the system on which the model was trained. of adding penalty to the model. Masking is applied to rectify
the same and this will truncate down all the added penalties
The Tensor-Flow [26] is an back-to-back open source back to zero. As for the image preprocessing, the images are
platform for this domain, Machine Learning. Google is the reshaped into (299, 299) and normalized within the range of
pioneer of the Tensor-Flow. It has various frameworks of Deep -1 to 1, such that it is in correct format for CNN encoder
Learning. That being utterly flexible, portable, and reliable, is (InceptionV3).
used in ABICG. Afterwards, the captions are mapped with their corresponding
image names in the dataset. So when training, the vectors cor-
B. Dataset responding to caption and image feature are mapped together
The dataset used to develop ABICG is Flickr8k [20]. There and trained suitably.
are approximately 8000 images in the collection (8091 to be
precise) and each image has 5 captions. So, there are a total
of 40455 captions generated to build the aimed model. Then,
Python’s TensorFlow library is used to preprocess the images.

Fig. 9. Process of data preprocessing [13]

D. Results

1) Image 1
The below image is fed to the model and the generated
Fig. 8. Sample data from Flickr8k Dataset [20] caption is ”large bird swooping down towards the ground” in
comparison with the human generated annotation – ”a white

978-1-6654-7467-2/23/$31.00 ©2023 IEEE 1569


Authorized licensed use limited to: KLE Technological University. Downloaded on June 20,2023 at 06:12:39 UTC from IEEE Xplore. Restrictions apply.
Proceedings of the 5th International Conference on Smart Systems and Inventive Technology (ICSSIT 2023)
IEEE Xplore Part Number: CFP23P17-ART; ISBN: 978-1-6654-7467-2

bird swooping down the ground”. 3) Image 3


The below image is fed to the model and the caption
generated is ”the people are standing in front of the building”
in comparison with the human generated annotation – ”the
people are standing before a building”.

Fig. 10. ”large bird swooping down towards the ground”

2) Image 2
The below image is fed to the model and the caption
generated is ”two girls hanging upside down on monkey-bars
at a park” in comparison with the human generated annotation
– ”two girls are hanging upside down”.

Fig. 12. The people are standing in front of building

4) Graph of Loss vs Epoch for ABICG model

Fig. 11. ”two girls hanging upside down on monkey-bars at a park”


Fig. 13. Train-Test Loss vs Epoch for ABICG model

978-1-6654-7467-2/23/$31.00 ©2023 IEEE 1570


Authorized licensed use limited to: KLE Technological University. Downloaded on June 20,2023 at 06:12:39 UTC from IEEE Xplore. Restrictions apply.
Proceedings of the 5th International Conference on Smart Systems and Inventive Technology (ICSSIT 2023)
IEEE Xplore Part Number: CFP23P17-ART; ISBN: 978-1-6654-7467-2

ABICG model was trained for an epoch of 25 with the about ’precision’, a simpler and more well-known metric is
hyperparameters like learning rate and batch size set to 0.001 customary. Let machine-generated n-grams and ground truth
and 64 respectively. The resultant Test and Training vs Loss n-grams be denoted by the vectors x and y respectively. For
plot obtained is shown in Fig. 13. As expected the training instance, x could be taken as the words of a caption generated
and testing loss decreases as the number of epochs increases. from an image, with xi representing an individual word,
and y could be the words from actual captions describing
E. Comparision of Training Losses the same scene. It is expected always to denote the several
’Train Loss VS Epoch’ graph was plotted for 25 epochs on possible captions of a single idea.
both Traditional InceptionV3-GRU model (without attention)
N
and InceptionV3-GRU model with attention(ABICG model). 1
X
p= N 1{xi ∈ y}
From the below comparision, it is evident that the train loss is
i=1
higher in the InceptionV3-GRU model without attention [32]
The BLEU score and precision are equivalent, except the
(Loss = 0.8647) as compared to the InceptionV3-GRU model
fact that there can only be a single instance of an n-gram in
with attention (ABICG model) (Loss = 0.0050).
x for every incidence of an n-gram in y. Say, the statement
“is is is is is” would receive an absolute precision if the
word ‘is’ was present in the reference translation, but not
compulsorily a perfect BLEU score, as it limits to counting
only the number of occurrences of ‘is’ as it appears in y.

For ABICG model, a BLEU score of 90% was obtained for


the weights (0.75, 0.25, 0, 0) and (0.50, 0.25, 0, 0).

Fig. 15. Comparision of accuracies

From the above table, it is evident that the accuracy of


the InceptionV3-GRU model with attention (ABICG model) is
approximately 10% more than that of traditional InceptionV3-
GRU model (without attention).
V. C ONCLUSION
In this paper, the caption generator for any given input
image is being proposed using the encoder decoder techniques.
The novel attention mechanism is the prime focus of the
paper. The attention mechanism which is introduced after the
InceptionV3 layer of networks here, makes the model focus
on the highlighted receptive fields in the image to facilitate the
Fig. 14. Comparison of train losses between InceptionV3-GRU without decoder to produce captions solely for those parts. This greatly
attention and InceptionV3-GRU with attention (ABICG) hikes the performance or the process of spawning the captions
as compared to the orthodox encoder-decoder models. Results
fetched from the model are budding and generated captions
F. Metrics are clear.
The volumetric data hinders the way to show the result Since the model had been exposed to a confined training set
for each image. Hence, it becomes essential to look for and vocabulary, the model may be deficient in connecting the
a method to assess the system’s average accuracy on the input images to those features or the characteristics which are
entire dataset. There are multiple ways to evaluate the quality not present in the vocabulary. So, words like these are replaced
of machine-generated text. For this model, the Bilingual with < U N K > tag which means that these are unknown to
Evaluation Understudy (BLEU) [31] has been chosen as the model. The model might not do well with such kinds of
the evaluation metric, owing to its popularity and ease of input images where the < U N K > tag occurs. In such cases,
usage. Before giving introduction to BLEU, knowledge the captions reproduced might be too trivial.

978-1-6654-7467-2/23/$31.00 ©2023 IEEE 1571


Authorized licensed use limited to: KLE Technological University. Downloaded on June 20,2023 at 06:12:39 UTC from IEEE Xplore. Restrictions apply.
Proceedings of the 5th International Conference on Smart Systems and Inventive Technology (ICSSIT 2023)
IEEE Xplore Part Number: CFP23P17-ART; ISBN: 978-1-6654-7467-2

Future scope of this work includes the usage of the transformer [15] S. Hochreiter and J. Schmidhuber, ”Long Short-Term Memory,” in
based models instead of the existing Encoder-Decoder based Neural Computation, vol. 9, no. 8, pp. 1735-1780, 15 Nov. 1997, doi:
10.1162/neco.1997.9.8.1735.
models, with the multi-head attention coupled with the posi-
tional embedding helps render information regarding how the [16] Chung, J., Gulcehre, C., Cho, K., Bengio, Y. (2014). Empirical
different words are related in correct order. evaluation of gated recurrent neural networks on sequence modeling.
arXiv preprint arXiv:1412.3555.

R EFERENCES [17] Jiasen Lu, Caiming Xiong, Devi Parikh, and Richard Socher. 2016.

[18] He, K., Zhang, X., Ren, S., Sun, J. (2016). Deep residual learning for
[1] V. Agrawal, S. Dhekane, N. Tuniya and V. Vyas, ”Image image recognition. In Proceedings of the IEEE conference on computer
Caption Generator Using Attention Mechanism,” 2021 12th vision and pattern recognition (pp. 770-778).
International Conference on Computing Communication and
Networking Technologies (ICCCNT), 2021, pp. 1-6, doi: [19] Chaitin, G.. (2013). Computing Machinery and Intelligence. Alan
10.1109/ICCCNT51525.2021.9579967. Turing: His Work and Impact. 551-621. 10.1016/B978-0-12-386980-
7.50023-X.
[2] S. Degadwala, D. Vyas, H. Biswas, U. Chakraborty and S.
Saha, ”Image Captioning Using Inception V3 Transfer Learning [20] Hodosh, Micah, Peter Young, and Julia Hockenmaier. ”Framing image
Model,” 2021 6th International Conference on Communication description as a ranking task: Data, models and evaluation metrics.”
and Electronics Systems (ICCES), 2021, pp. 1103-1108, doi: Journal of Artificial Intelligence Research 47 (2013): 853-899.
10.1109/ICCES51350.2021.9489111.
[21] LeCun, Y., Bengio, Y., Hinton, G. (2015). Deep learning. nature,
[3] Amritkar, Chetan Jabade, Vaishali. (2018). Image Caption 521(7553), 436-444.
Generation Using Deep Learning Technique. 1-4. 10.1109/IC-
CUBEA.2018.8697360. [22] J., Weizenbaum. 1966. ELIZA—a computer program for the study of
natural language communication between man and machine. Commun.
[4] C. S. Kanimozhiselvi, K. V, K. S. P and K. S, ”Image Captioning ACM 9, 1 (Jan. 1966), 36–45. https://doi.org/10.1145/365153.365168
Using Deep Learning,” 2022 International Conference on Computer
Communication and Informatics (ICCCI), 2022, pp. 1-7, doi: [23] Liu, S., Bai, L., Hu, Y., Wang, H. (2018). Image captioning based on
10.1109/ICCCI54379.2022.9740788. deep neural networks. In MATEC Web of Conferences (Vol. 232, p.
01052). EDP Sciences.
[5] Sharma, Grishma Kalena, Priyanka Malde, Nishi Nair, Aromal
Parkar, Saurabh. (2019). Visual Image Caption Generator Using Deep [24] Chollet, F. (2017). Xception: Deep learning with depthwise separable
Learning. SSRN Electronic Journal. 10.2139/ssrn.3368837. convolutions. In Proceedings of the IEEE conference on computer
vision and pattern recognition (pp. 1251-1258).
[6] Manish Raypurkar, Abhishek Supe, Pratik Bhumkar, Pravin Borse,
Dr. Shabnam Sayyad5(2021). Deep Learning Based Image Caption [25] Mo, Nan Yan, Li Zhu, Ruixi Xie, Hong. (2019). Class-Specific
Generator. Anchor Based and Context-Guided Multi-Class Object Detection in
High Resolution Remote Sensing Imagery with a Convolutional Neural
[7] Aishwarya Maroju , Sneha Sri Doma, Lahari Chandarlapati, 2021, Network. Remote Sensing. 11. 272. 10.3390/rs11030272.
Image Caption Generating Deep Learning Model, INTERNATIONAL
JOURNAL OF ENGINEERING RESEARCH TECHNOLOGY [26] Abadi, Martı́n Barham, Paul Chen, Jianmin Chen, Zhifeng Davis,
(IJERT) Volume 10, Issue 09 (September 2021). Andy Dean, Jeffrey Devin, Matthieu Ghemawat, Sanjay Irving,
Geoffrey Isard, Michael Kudlur, Manjunath Levenberg, Josh Monga,
[8] M. Duan, J. Liu and S. Lv, ”Encoder-decoder based multi-feature Rajat Moore, Sherry Murray, Derek Steiner, Benoit Tucker,
fusion model for image caption generation,” Journal on Big Data, vol. Paul Vasudevan, Vijay Warden, Pete Zhang, Xiaoqiang. (2016).
3, no.2, pp. 77–83, 2021 TensorFlow: A system for large-scale machine learning.

[9] Szegedy, Christian Vanhoucke, Vincent Ioffe, Sergey Shlens, Jon [27] Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: ImageNet:
Wojna, ZB. (2016). Rethinking the Inception Architecture for Computer A Large-Scale Hierarchical Image Database. In: CVPR (2009)
Vision. 10.1109/CVPR.2016.308.
[28] Graves, Alex Fernández, Santiago Schmidhuber, Jürgen. (2005).
[10] Cho, Kyunghyun Merrienboer, Bart Gulcehre, Caglar Bougares, Fethi Bidirectional LSTM Networks for Improved Phoneme Classification
Schwenk, Holger Bengio, Y.. (2014). Learning Phrase Representations and Recognition.. 799-804.
using RNN Encoder-Decoder for Statistical Machine Translation.
10.3115/v1/D14-1179. [29] Ren, Shaoqing He, Kaiming Girshick, Ross Sun, Jian. (2015). Faster
R-CNN: Towards Real-Time Object Detection with Region Proposal
[11] Bahdanau, Dzmitry Cho, Kyunghyun Bengio, Y.. (2014). Neural Networks. IEEE Transactions on Pattern Analysis and Machine
Machine Translation by Jointly Learning to Align and Translate. ArXiv. Intelligence. 39. 10.1109/TPAMI.2016.2577031.
1409.
[30] Hubel, David Wiesel, Torsten. (2012). David Hubel and Torsten
[12] Hochreiter, Sepp. (1998). The Vanishing Gradient Problem During Wiesel. Neuron. 75. 182-4. 10.1016/j.neuron.2012.07.002.
Learning Recurrent Neural Nets and Problem Solutions. International
Journal of Uncertainty, Fuzziness and Knowledge-Based Systems. 6. [31] Papineni, Kishore Roukos, Salim Ward, Todd Zhu, Wei Jing. (2002).
107-116. 10.1142/S0218488598000094. BLEU: a Method for Automatic Evaluation of Machine Translation.
10.3115/1073083.1073135.
[13] Doshi, K. (2021, April 30). Image Captions with Attention in
Tensorflow, Step-by-step. Medium.com. Retrieved December 28, 2022, [32] Hyunju1 (2018) Image-Captioning [Source code] https://github.com/
from https://link.medium.com/s77SJEyi7vb HyunJu1/Image-Captioning

[14] Simonyan, K., Zisserman, A. (2014). Very deep convolutional networks


for large-scale image recognition. arXiv preprint arXiv:1409.1556.

978-1-6654-7467-2/23/$31.00 ©2023 IEEE 1572


Authorized licensed use limited to: KLE Technological University. Downloaded on June 20,2023 at 06:12:39 UTC from IEEE Xplore. Restrictions apply.

You might also like