0% found this document useful (0 votes)

22 views

Attention Based Image Caption Generation ABICG Using Encoder-Decoder Architecture

Uploaded by

rbagewadi63

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

22 views

Attention Based Image Caption Generation ABICG Using Encoder-Decoder Architecture

Uploaded by

rbagewadi63

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 9

Proceedings of the 5th International Conference on Smart Systems and Inventive Technology (ICSSIT 2023)

IEEE Xplore Part Number: CFP23P17-ART; ISBN: 978-1-6654-7467-2

Attention based Image Caption Generation

(ABICG) using Encoder-Decoder Architecture
2023 5th International Conference on Smart Systems and Inventive Technology (ICSSIT) | 978-1-6654-7467-2/23/$31.00 ©2023 IEEE | DOI: 10.1109/ICSSIT55814.2023.10061040

Uday Kulkarni Kushagra Tomar Mayuri Kalmat

Asst. Professor, Dept. of CSE Department of CSE Department of CSE
KLE Technological University KLE Technological University KLE Technological University
Hubballi, India Hubballi, India Hubballi, India
uday kulkarni@kletech.ac.in kushagratomar2016@gmail.com mayurikalmat1@gmail.com

Rakshita Bandi Pranav Jadhav Dr. Meena S M

Department of CSE Department of CSE Professor and Head, Dept. of CSE
KLE Technological University KLE Technological University KLE Technological University
Hubballi, India Hubballi, India Hubballi, India
rakshitabandi0@gmail.com jadhavpranav250@gmail.com msm@kletech.ac.in

Abstract—The image captioning is utilized to develop the and techniques for processing text like NLP [22], a number of
explanations of the sentences describing the series of scenes previously challenging pieces of work became straightforward
captured in the image or picture forms. The practice of using using Machine Learning. These are profitable in recognition,
image captioning is vast although it is a tedious task for the
machine to learn what a human is capable of. The model must be classification and captioning of images and several additional
built in a way such that when it reads the scene, it recognizes and AI [19] applications. This “Image Captioning” [2] is pragmatic
reproduce to-the-point captions or descriptions. The generated in various applications viz. The concept of self-driving cars,
descriptions must be semantically and syntactically accurate. surveillance which are at present the matters of the moment. In
Hence, availability of Artificial Intelligence (AI) and Machine comparison with classification and object recognition, the task
Learning algorithms viz. Natural Language Processing (NLP),
Deep Learning (DL) etc. makes the task easier. Although majority of automatically generating captions and describing images
of the existing machine-generated captions are valid, they do not is significantly more complicated. A description of an image
focus on the crucial parts of the images, which results in lesser must include more than just the objects in it, also how the
clarity of the captions. In the proposed paper, anew introduction objects are related to their attributes and activities just like
to attention mechanism called Bahdanau’s along with Encoder- shown in the Fig. 1
Decoder architecture is being used so as to reflect the image
captions that are more accurate and detailed. It uses a pre-
trained Convolutional Neural Network (CNN) called InceptionV3
architecture to gather the features of images and then a Recurrent
Neural Network (RNN) called Gated Recurrent Unit (GRU)
architecture in order to develop captions. This model is trained
on Flickr8k dataset and the captions generated are 10% more
accurate than the present state of art.
Keywords— Convolutional Neural Network (CNN), Recur-
rent Neural Network (RNN), Gated Recurrent Unit (GRU),
Encoder, Decoder, Attention mechanism, Image captioning.
I. I NTRODUCTION
Language is the medium through which the society con-
stantly interacts, be it written or spoken. It typically describes
the perceptible world around us. The photos and symbols are
otherwise to speak and perceive by the physically disabled
individuals. Automatic description generation from an image
in proper sentences is a tedious task, nevertheless it can have
an ample impression on visually challenged individuals for
better understanding of the description of pictures on the web.
Long back, “Image or Picture Captioning” [2] [3] [4] has Fig. 1. Examples of image captioning
always been a rigid mission and the generated captions for
the given image were not so pertinent. However, it is best to express semantic knowledge in
In conjunction with the progress of Deep Learning [21], CNN natural language such as English. A single model is to be

978-1-6654-7467-2/23/$31.00 ©2023 IEEE 1564

Authorized licensed use limited to: KLE Technological University. Downloaded on June 20,2023 at 06:12:39 UTC from IEEE Xplore. Restrictions apply.
Proceedings of the 5th International Conference on Smart Systems and Inventive Technology (ICSSIT 2023)
IEEE Xplore Part Number: CFP23P17-ART; ISBN: 978-1-6654-7467-2

designed which intakes an image and trains it to give out II. RELATED WORK
a string of words in which each word is affiliated to the
glossary that narrates the picture accordingly. In this paper, In [1], the authors have made use of CNN as an encoder to
an ABICG model is proposed which is worthy of describing extract the characteristics or attributes from the images. CNN
images in a latest and novel way. For this job, the dataset is a pre-trained InceptionV3. Owing to the InceptionV3’s fact
composed of 8000 images and five descriptions per image that it is a deep network for object detection, it demands to
is taken from the Flickr8k dataset [20]. As new applications be altered slightly to assist in encoding. A feature vector is
are being developed every day, Deep Learning has emerged obtained from this deep network by removing the terminating
to be a prepotent world today. Test and trial is the mere way layer. The feature vector obtained is of the size (8x8x2048).
to explore Deep Learning to a greater extent. In this way, The feature vector is the input to RNN. The RNN employed
you gain a better understanding of the topic and become a for decoding is GRU [16]. To generate more focused captions,
more profound professional. Real-life applications of this Bahdanau model is used.
technique are numerous. The use of Deep Learning for image
description has been proposed in many different models, The writers of [2] have proposed a model where the
including detection of objects, captioning on the basis of input sent is Flickr8k dataset, and the output obtained is
visual attention and image captioning using Deep Learning. passed to the latest layer which is completely connected
Different Deep Learning models exist as well, such as the and is introduced at the termination of the InceptionV3
InceptionV3 model [9], the Visual Geometry Group 16 model. The task of this layer is to transform the model’s
(VGG–16) model [14], the Residual Networks (ResNet) [18] output into a vector which embed words. It serves as an
– Long Short-Term Memory (LSTM) model [15] and the input to an LSTM cell order by implanting a vector. The
traditional CNN – RNN model [23]. CNN as well as RNN LSTM unit attaches the series of information and collects it
are used here. For image encoding i.e. as an encoder, in order progressively hence enabling the establishment of meaningful
to classify images, a pre-trained CNN is used. Input from the captions. The start-V3 component of this model is trained
last hidden layer is used to train the RNN. This network is to recognize complete possible objects in a picture. Each
a decoder to generate captions. In LSTM networks, memory word in the picture is predicted using the previous words in
cells are endowed with a limited number of phases that have the phrase. The main intention of training is to reduce the
been determined by long-term dilution of existing memory failure function. They have used the Flickr8k dataset which
information. A total of 16 layers supported by VGG–16 is has nearly 8000 images and each image is tagged with five
an ingenious model for object recognition. For the following unique captions or descriptions which offer compact reports
stage, the extracted features are trained with wording specified of the noteworthy features.
in the dataset. Two architectures LSTM and GRU [16] are
used for framing sentences from the given input images given. In [3], the authors have put forth a model that allows
neural networks to view an image automatically and yield
In ABICG, the Flickr8k dataset [20] is used and it is meaningful captions similar to natural English sentences. It is
subjected to an elaborated preprocessing steps to optimize the a well-trained model to perform the above-mentioned tasks.
input. The preprocessed data is fed to the model which uses Here, the pre-trained CNN is utilized to classify images. This
Inception-V3 as Encoder and GRU as Decoder. Bahdanau’s network handles the task of encoding images. The input to
Attention model is applied to this encoder-decoder model the RNN (decoder here) is the hindmost hidden layer of the
to fetch more focused captions. Meaningful captions are encoder. The decoder generates sentences. The dataset being
generated by the model. Performance of this model is used here is Flickr8k [20] consists of about 8000 images
evaluated by the BLEU scores [31]. and five descriptions tagged to every image. They used VGG
[14] for large-scale image recognition. They conclude that
using a bulky dataset boosts the performance of the model.
The paper is organized as, the Section ‘Related Work’ In addition to reducing losses, it also improves accuracy.
discusses the previous works and methodologies related to
this domain which includes the scope of improvements in To achieve better results, the authors of [4] worked on
the methods that already exist and the uniqueness of the a model that combines CNN architecture and LSTM for
model presented in this paper. In Section ‘Proposed Work’, it image captioning. The proposed model uses three CNN
comprises the proposed work where the ABICG architecture architectures: ResNet-50 [25], Xception [24], and Inception-
has been discussed along with explanation to each component V3 [9]. The aptest combination of CNN and LSTM is chosen
of the architecture. Section ‘Experimental Results’ further based on the model’s accuracy. Training is performed on
elaborates on the system specifications, dataset used, data pre- the Flickr8k data. Combining Xception with LSTM has the
processing, results, comparison of training loss and evaluation highest accuracy of 75% across epochs among the three CNN
of model using BLEU scores [32]. Finally, the conclusion and models.
the future scope have been discussed below as can be seen in
the Section ‘Conclusion’. The authors of [5] proposed a model where CNN features

978-1-6654-7467-2/23/$31.00 ©2023 IEEE 1565

are extracted from an image and encoded into vector depth of hidden layers increase, the learnability of models
representations using 16 convolutional layers of the VGG-16. fall to zero. In most of the works, LSTMs are used which
Next, a RNN decoder model is used to develop corresponding are slower, and computationally less efficient as compared to
sentences based on the learned image features i.e., training the GRU. To solve this issue of vanishing gradients, the GRU is
features with captions or descriptions provided in the dataset. utilized in the current work.
The input images are processed using two architectures viz,
GRU and LSTM. Through the results, it is evident that the III. P ROPOSED W ORK
LSTM model achieves better results than the GRU [16] To accomplish the task of image captioning, ABICG com-
model. Although, it takes a longer time to train and generate prises bipartite architecture, viz encoder and decoder. Im-
captions due to the model’s complexity. ages are fed to the encoder where the image information is
transformed into features vectors. The output of the encoder
In [6], the authors have developed a model where pre- is passed to the decoder, which translates the features into
processed images are fed to the Inception V3 model, and the English sentences. This method is termed as ”Classic Image
features are extracted. Later, a D-dimensional representation Captioning”. The problem with this method is that, it is not
of each and every part of the image is produced by the possible to take into account the spatial features of an image
extractor such as L vectors. With the spatial features of a with this classic captioning method. As a result, a caption is
CNN convolution layer, the decoder calculates the context generated taking into consideration the full image as a scene
vector according to the specific regions of the input image. and not considering the sensitive or important features of the
For the decoder’s job, GRU is utilized which has a simpler image. To enable the model to focus on important features of
structure than LSTM [15] . The vanishing gradient problem the image, Bahdanau’s attention has been used jointly with the
does not affect GRU, unlike RNN. Thus, it proves that usage encoder-decoder architecture.
of GRU gives better results than LSTM. The captions generated by the proposed model are semanti-
cally and grammatically correct. The generated captions are
The authors display a method to overcome the vanishing close to the human generated or with human centric meaning
gradient problem which hinders the existing CNN-RNN and they not only describe the scene in the image but also
models in [7] . They have proposed ResNet-LSTM as an the intricate details and relationship of the object with the
encoder-decoder technique for image captioning. The ResNet background.
(encoder) extracts the features and the LSTM (decoder)
generates the caption from the extracted features. For this, the A. Model Overview
images are resized to (224x224x3) and subjected to several
pre-processing steps. They have used the Flickr8k dataset In ABICG, the CNN used is InceptionV3 [9] pretrained
for training the model. After a minimum of 20 epochs, on ImageNet weights, serves as an encoder. This extracts the
meaningful captions begin to generate. It is better than VGG features from the receptive fields of the images and forwards
and CNN-RNN models. it to the decoder. Here, the RNN used is GRU, which is used
as a decoder. The use of the decoder is to decode the sentence
The authors of [8] explain a multi-feature fusion model from the encoding. The Bahdanau attention model [11] is
to generate image captions. Models that currently exist used to enhance the capability of the decoder by allowing
focus on the global characteristics of an image, but with it to focus on the important aspects of the images while
the comprehensive features, this model also considers the producing the captions. Thus, taking care that the sensitive
localized features of images. A global feature extraction of parts of the image are not left out in the generated caption.
global features is performed using the VGG16 network and
Faster-CNN is used to excerpt the local characteristics. The
local and global features are mixed and fed as input, through
an attention layer to the Bi-LSTM [28]. The caption obtained
is corrected if any error occurs. ImageNet dataset with image
size (224x224x3) is used to train VGG16 [14], Pascal VOC
dataset is used to train Faster-RCNN [29] (1:1 positive and
negative sample ratio maintained). Bi-LSTM is trained with
the MSCOCO dataset [27]. The fused features have turned
out to be superior to global or local features alone. The test
accuracies in the training set and verification set are 78.20%
and 66.50% respectively.

Majority of the existing models are hindered by the vanishing

gradient [12] problem. Usage of the CNN is prevalent in
the existing models. Due to the vanishing gradients, as the Fig. 2. Overall Architecture

978-1-6654-7467-2/23/$31.00 ©2023 IEEE 1566

The above figure shows the overall architecture of the model interface that connects the encoder and decoder. It instructs
which includes the CNN encoder (InceptionV3), RNN decoder the decoder with the relevant details from each and every
(GRU) and the attention model. Image is fed into the encoder encoder hidden state. Decoding begins with the context vector
and then the tanh activation function is applied to introduce generated by the attention model to predict the word at that
non-linearity. The output is then fed to the decoder and for particular timestamp. The context vector changes with each
each timestamp of the decoder, the attention model enables timestamp since it is adaptive in nature.
the decoder to focus on specific parts of the images.

B. Convolutional Neural Network (CNN)

InceptionV3 [9] is often used for image recognition and has
been very popular in field of image processing because of its
up to the mark accuracy on different datasets. It encompasses
the building blocks which are of types asymmetric and
symmetric, along with convolutions, max pooling, average
pooling, concatenations, dropouts and various fully connected
layers as shown in the Fig 3. Fig. 4. The Bahdanau’s Attention Model [11]

It was built for the purpose of object detection on receiving Attention model Fig 4 [11] does a linear transformation of
a (299x299x3) image. Since InceptionV3 [9] is mostly used the input by applying tanh (1) to it so as to introduce non-
for object detection, the refinement of it to some extent linearities henceforth achieving a smoother distribution. Then,
is required so as to make it an encoder for extracting the the attention score as is computed. The output is required
image feactures. The last layer is eliminated which is used in the range (0,1). The sof tmax function is applied to the
for classifying the images into the labels since classification attention score and the final attention weights are obtained.
of images is not required. Thus, a feature vector is obtained This model intends to overcome the curb of the orthodox
which is of the size (8x8x2048). The resulting feature vector CNN–RNN models. This model facilitates passing various
is static and does not alter at each timestamp. Therefore, this parts of the image instead of the whole. This also makes it
vector is passed to the attention model along with the hidden swift and hikes its accuracy.
state of the decoder to create the context vector.
as = tanh(W1 hd1 + W2 hd2 ) (1)
With this score, the weights of the attention are calculated
using (2)

α = sof tmax(as ) (2)

Then, by using the attention weights (α) from (2) and features
(hd2 ) which were obtained from an encoder, the context vector
cv ec is obtained, with (3).

cv ec = αh2 (3)
Fig. 3. Convolutional Neural Network (Inception V3) [9]
Ultimately, the fixed length vector “cv ec ” is unified with the
The benefits of using InceptionV3 CNN for the encoder part decoder’s output from the predecessor timestamp ht and then
is that it generates fewer parameters for computation which fed into the RNN cell in order to obtain the decoder’s output
makes it computationally less expensive in comparison to the for the current timestamp.
other models and is memory efficient. There is no comparison In the above image, there is a white color bird sitting on the
between InceptionV3 and the other models when it comes to sign board. The image is fed into the encoder which extracts
depth and accuracy. the image features and gives it as an input to the decoder which
transforms the image feature vector into concise caption. So
C. Attention Mechanism here, the caption generated would be “a white bird perched
Attention model is a deep learning technique that makes use on top of a red stop sign” all in lower case.
of attention mechanism which provides attention or additional The project aims at mimicking the human brain because of
focus on specific components. The Bahdanau’s Attention its abilities to generate a caption for every scene it senses.
Model [11] is used in ABICG. It selectively highlights the Therefore, it becomes crucial to add an attention mechanism
relevant features of the input data. It is also referred to as an using which the CNN-RNN model focuses on the more

978-1-6654-7467-2/23/$31.00 ©2023 IEEE 1567

Fig. 5. Complete architecture of the proposed CNN-RNN-Attention model

[17]

important parts of the image. There is no static vector encoding

of the whole image in the attention mechanism. Instead, it
Fig. 6. The Gated Recurrent Unit [16]
adds the spatial information corresponding to the image to
the extraction of image features. As a result, statements are
Fig. 6 displays the working principle of GRU with a diagram.
described in a more detailed manner as shown in the above
Here, xt is an input vector, zt is an update gate vector, ht−1 is
Fig. 5 [17].
a previous output, ht is the current output, rt is the reset gate
By such means, while generating sentences, simulation of
vector and h˜1 is an activation vector. Sigma (σ) represents
the human vision using the attention mechanism can be
the sigmoid activation function and tanh represents the tan
encouraged with the generation process of word sequence.
hyperbolic operation. Firstly, the update gate vector zt is
This is to ensure that the generated sentence will reflect the
calculated for time step t using (4).
expression habits of the people.

D. Recurrent Neural Network (RNN) zt = σ(W eightinputupdate ∗ Xt + W eighthiddenupdate ∗ ht−1 )

GRU is used as a decoder. It works on the mechanism of (4)
RNN which anticipates expressions in the natural language.
The other probably used RNNs are LSTM, Vanilla RNN, and The input vector xt and previous vector ht−1 are multiplied
the GRU. Vanilla RNN is not preferred due to its vanishing with their respective own weights viz. W eightinputupdate and
gradient problem. W eighthiddenupdate . The obtained products are added and the
GRU is similar to LSTM. It owns a few key differences sum is squashed between 0 and 1 using the sigmoid activation
from LSTM: GRU has only two gates whereas, LSTM has function. Update gate enables the model to decide how much
three gates. It exposes its total memory and also the hidden of the previous content needs to be sent to the future.
layers. GRU not only requires fewer parameters for training, Then, the forget gate vector (5) rt is calculated using the same
but also has way more effective computation. Thus making it formula as used in (4). A gate like this allows the model to
computationally efficient. calculate how much information has to be forgotten from the
GRU is composed of two gates, viz., The update gate and the past.
reset gate. Both these gates in GRU together act as a convex rt = σ(W eightinputreset ∗ Xt + W eighthiddenreset ∗ ht−1 )
combination which gives the verdict of which information or (5)
the data of the hidden state is to be updated and which is to For the current memory content, the input xt is multiplied with
be forgotten. a weight Wh1 and ht−1 is multiplied with a weight WX1 .
A large number of layers in the network lead to a fall in Then, the Hadamard (element-wise) product is calculated is
the derivative product till the partial derivative of the loss between the reset gate rt and Wh1 ht−1 . All these are added
function tends to zero, and the partial derivative is abolished. and a nonlinear activation function tanh is applied as shown
This phenomenon is known as the vanishing gradient problem. in (6).
In simple words, this means that the initially predicted words
are wiped out as the new words are predicted therefore giving h˜t = tanh(rt ⊙ Wh1 ∗ ht−1 + WX1 ∗ Xt ) (6)
less weightage to the initial words and vice versa in the output
For the final memory content at current time step, element-
generated. To tackle this problem, the LSTM was inaugurated.
wise multiplication is applied to the update gate zt and ht−1 ,
There is not much difference between GRU and LSTM. But
and 1 − zt and h˜t . Then, these two products are added, as
GRU has a simpler network cell architecture as shown in Fig. 7
shown in (7).
[10] as compared to that of LSTM. Hence, the GRU is used
in this caption generator model. ht = (1 − zt ) ⊙ h˜t + zt ⊙ ht−1 (7)

978-1-6654-7467-2/23/$31.00 ©2023 IEEE 1568

IV. E XPERIMENTAL R ESULTS C. Data Preprocessing

In the proposed ABICG, the neural framework is proposed
The considered dataset comprises 5 captions which corre-
for generating the descriptions for the given input images.
spond to each image. Therefore, preprocessing is performed
twice, one for the images and another one for the captions or
A. System Specifications the annotations. Captions preprocessing includes the removal
of punctuation and alpha-numeric values from each caption.
The model was trained on the system having specifications Also, < start > and < end > tags are introduced at
mentioned in Fig. 7 .The hyper parameters decided were the beginning and ending of each caption respectively. Then,
epochs of 25, batch size of 64, and learning rate of 0.001. creating the tokenized vectors by tokenizing the captions i.e.
NVIDIA’s CUDA was used to achieve parallel processing. splitting them into words using spaces and other filters. This
The training took approx 1 hour 31 minutes. gives a lexicon of all unrepeated words in the data. For
memory efficiency, the total vocabulary is restricted to 5000
words. All other words with the unknown token are replaced
with < U N K >. Then follows the creation of a word-to-index
mapping and a index-to-word mapping.
The input to the decoder should be of same size and shape.
Therefore, padding is used to bring all captions to a fixed
length before proceeding further. In order to ensure that all
samples have a standard length, zero padding is applied before
or after a sequence. In this model zero padding is done at the
end (of the caption sequence). But padding can result in a risk
Fig. 7. Hardware specifications of the system on which the model was trained. of adding penalty to the model. Masking is applied to rectify
the same and this will truncate down all the added penalties
The Tensor-Flow [26] is an back-to-back open source back to zero. As for the image preprocessing, the images are
platform for this domain, Machine Learning. Google is the reshaped into (299, 299) and normalized within the range of
pioneer of the Tensor-Flow. It has various frameworks of Deep -1 to 1, such that it is in correct format for CNN encoder
Learning. That being utterly flexible, portable, and reliable, is (InceptionV3).
used in ABICG. Afterwards, the captions are mapped with their corresponding
image names in the dataset. So when training, the vectors cor-
B. Dataset responding to caption and image feature are mapped together
The dataset used to develop ABICG is Flickr8k [20]. There and trained suitably.
are approximately 8000 images in the collection (8091 to be
precise) and each image has 5 captions. So, there are a total
of 40455 captions generated to build the aimed model. Then,
Python’s TensorFlow library is used to preprocess the images.

Fig. 9. Process of data preprocessing [13]

D. Results

1) Image 1
The below image is fed to the model and the generated
Fig. 8. Sample data from Flickr8k Dataset [20] caption is ”large bird swooping down towards the ground” in
comparison with the human generated annotation – ”a white

978-1-6654-7467-2/23/$31.00 ©2023 IEEE 1569

bird swooping down the ground”. 3) Image 3

The below image is fed to the model and the caption
generated is ”the people are standing in front of the building”
in comparison with the human generated annotation – ”the
people are standing before a building”.

Fig. 10. ”large bird swooping down towards the ground”

2) Image 2
The below image is fed to the model and the caption
generated is ”two girls hanging upside down on monkey-bars
at a park” in comparison with the human generated annotation
– ”two girls are hanging upside down”.

Fig. 12. The people are standing in front of building

4) Graph of Loss vs Epoch for ABICG model

Fig. 11. ”two girls hanging upside down on monkey-bars at a park”

Fig. 13. Train-Test Loss vs Epoch for ABICG model

ABICG model was trained for an epoch of 25 with the about ’precision’, a simpler and more well-known metric is
hyperparameters like learning rate and batch size set to 0.001 customary. Let machine-generated n-grams and ground truth
and 64 respectively. The resultant Test and Training vs Loss n-grams be denoted by the vectors x and y respectively. For
plot obtained is shown in Fig. 13. As expected the training instance, x could be taken as the words of a caption generated
and testing loss decreases as the number of epochs increases. from an image, with xi representing an individual word,
and y could be the words from actual captions describing
E. Comparision of Training Losses the same scene. It is expected always to denote the several
’Train Loss VS Epoch’ graph was plotted for 25 epochs on possible captions of a single idea.
both Traditional InceptionV3-GRU model (without attention)
N
and InceptionV3-GRU model with attention(ABICG model). 1
X
p= N 1{xi ∈ y}
From the below comparision, it is evident that the train loss is
i=1
higher in the InceptionV3-GRU model without attention [32]
The BLEU score and precision are equivalent, except the
(Loss = 0.8647) as compared to the InceptionV3-GRU model
fact that there can only be a single instance of an n-gram in
with attention (ABICG model) (Loss = 0.0050).
x for every incidence of an n-gram in y. Say, the statement
“is is is is is” would receive an absolute precision if the
word ‘is’ was present in the reference translation, but not
compulsorily a perfect BLEU score, as it limits to counting
only the number of occurrences of ‘is’ as it appears in y.

For ABICG model, a BLEU score of 90% was obtained for

the weights (0.75, 0.25, 0, 0) and (0.50, 0.25, 0, 0).

Fig. 15. Comparision of accuracies

From the above table, it is evident that the accuracy of

the InceptionV3-GRU model with attention (ABICG model) is
approximately 10% more than that of traditional InceptionV3-
GRU model (without attention).
V. C ONCLUSION
In this paper, the caption generator for any given input
image is being proposed using the encoder decoder techniques.
The novel attention mechanism is the prime focus of the
paper. The attention mechanism which is introduced after the
InceptionV3 layer of networks here, makes the model focus
on the highlighted receptive fields in the image to facilitate the
Fig. 14. Comparison of train losses between InceptionV3-GRU without decoder to produce captions solely for those parts. This greatly
attention and InceptionV3-GRU with attention (ABICG) hikes the performance or the process of spawning the captions
as compared to the orthodox encoder-decoder models. Results
fetched from the model are budding and generated captions
F. Metrics are clear.
The volumetric data hinders the way to show the result Since the model had been exposed to a confined training set
for each image. Hence, it becomes essential to look for and vocabulary, the model may be deficient in connecting the
a method to assess the system’s average accuracy on the input images to those features or the characteristics which are
entire dataset. There are multiple ways to evaluate the quality not present in the vocabulary. So, words like these are replaced
of machine-generated text. For this model, the Bilingual with < U N K > tag which means that these are unknown to
Evaluation Understudy (BLEU) [31] has been chosen as the model. The model might not do well with such kinds of
the evaluation metric, owing to its popularity and ease of input images where the < U N K > tag occurs. In such cases,
usage. Before giving introduction to BLEU, knowledge the captions reproduced might be too trivial.

Future scope of this work includes the usage of the transformer [15] S. Hochreiter and J. Schmidhuber, ”Long Short-Term Memory,” in
based models instead of the existing Encoder-Decoder based Neural Computation, vol. 9, no. 8, pp. 1735-1780, 15 Nov. 1997, doi:
10.1162/neco.1997.9.8.1735.
models, with the multi-head attention coupled with the posi-
tional embedding helps render information regarding how the [16] Chung, J., Gulcehre, C., Cho, K., Bengio, Y. (2014). Empirical
different words are related in correct order. evaluation of gated recurrent neural networks on sequence modeling.
arXiv preprint arXiv:1412.3555.

R EFERENCES [17] Jiasen Lu, Caiming Xiong, Devi Parikh, and Richard Socher. 2016.

[18] He, K., Zhang, X., Ren, S., Sun, J. (2016). Deep residual learning for
[1] V. Agrawal, S. Dhekane, N. Tuniya and V. Vyas, ”Image image recognition. In Proceedings of the IEEE conference on computer
Caption Generator Using Attention Mechanism,” 2021 12th vision and pattern recognition (pp. 770-778).
International Conference on Computing Communication and
Networking Technologies (ICCCNT), 2021, pp. 1-6, doi: [19] Chaitin, G.. (2013). Computing Machinery and Intelligence. Alan
10.1109/ICCCNT51525.2021.9579967. Turing: His Work and Impact. 551-621. 10.1016/B978-0-12-386980-
7.50023-X.
[2] S. Degadwala, D. Vyas, H. Biswas, U. Chakraborty and S.
Saha, ”Image Captioning Using Inception V3 Transfer Learning [20] Hodosh, Micah, Peter Young, and Julia Hockenmaier. ”Framing image
Model,” 2021 6th International Conference on Communication description as a ranking task: Data, models and evaluation metrics.”
and Electronics Systems (ICCES), 2021, pp. 1103-1108, doi: Journal of Artificial Intelligence Research 47 (2013): 853-899.
10.1109/ICCES51350.2021.9489111.
[21] LeCun, Y., Bengio, Y., Hinton, G. (2015). Deep learning. nature,
[3] Amritkar, Chetan Jabade, Vaishali. (2018). Image Caption 521(7553), 436-444.
Generation Using Deep Learning Technique. 1-4. 10.1109/IC-
CUBEA.2018.8697360. [22] J., Weizenbaum. 1966. ELIZA—a computer program for the study of
natural language communication between man and machine. Commun.
[4] C. S. Kanimozhiselvi, K. V, K. S. P and K. S, ”Image Captioning ACM 9, 1 (Jan. 1966), 36–45. https://doi.org/10.1145/365153.365168
Using Deep Learning,” 2022 International Conference on Computer
Communication and Informatics (ICCCI), 2022, pp. 1-7, doi: [23] Liu, S., Bai, L., Hu, Y., Wang, H. (2018). Image captioning based on
10.1109/ICCCI54379.2022.9740788. deep neural networks. In MATEC Web of Conferences (Vol. 232, p.
01052). EDP Sciences.
[5] Sharma, Grishma Kalena, Priyanka Malde, Nishi Nair, Aromal
Parkar, Saurabh. (2019). Visual Image Caption Generator Using Deep [24] Chollet, F. (2017). Xception: Deep learning with depthwise separable
Learning. SSRN Electronic Journal. 10.2139/ssrn.3368837. convolutions. In Proceedings of the IEEE conference on computer
vision and pattern recognition (pp. 1251-1258).
[6] Manish Raypurkar, Abhishek Supe, Pratik Bhumkar, Pravin Borse,
Dr. Shabnam Sayyad5(2021). Deep Learning Based Image Caption [25] Mo, Nan Yan, Li Zhu, Ruixi Xie, Hong. (2019). Class-Specific
Generator. Anchor Based and Context-Guided Multi-Class Object Detection in
High Resolution Remote Sensing Imagery with a Convolutional Neural
[7] Aishwarya Maroju , Sneha Sri Doma, Lahari Chandarlapati, 2021, Network. Remote Sensing. 11. 272. 10.3390/rs11030272.
Image Caption Generating Deep Learning Model, INTERNATIONAL
JOURNAL OF ENGINEERING RESEARCH TECHNOLOGY [26] Abadi, Martı́n Barham, Paul Chen, Jianmin Chen, Zhifeng Davis,
(IJERT) Volume 10, Issue 09 (September 2021). Andy Dean, Jeffrey Devin, Matthieu Ghemawat, Sanjay Irving,
Geoffrey Isard, Michael Kudlur, Manjunath Levenberg, Josh Monga,
[8] M. Duan, J. Liu and S. Lv, ”Encoder-decoder based multi-feature Rajat Moore, Sherry Murray, Derek Steiner, Benoit Tucker,
fusion model for image caption generation,” Journal on Big Data, vol. Paul Vasudevan, Vijay Warden, Pete Zhang, Xiaoqiang. (2016).
3, no.2, pp. 77–83, 2021 TensorFlow: A system for large-scale machine learning.

[9] Szegedy, Christian Vanhoucke, Vincent Ioffe, Sergey Shlens, Jon [27] Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: ImageNet:
Wojna, ZB. (2016). Rethinking the Inception Architecture for Computer A Large-Scale Hierarchical Image Database. In: CVPR (2009)
Vision. 10.1109/CVPR.2016.308.
[28] Graves, Alex Fernández, Santiago Schmidhuber, Jürgen. (2005).
[10] Cho, Kyunghyun Merrienboer, Bart Gulcehre, Caglar Bougares, Fethi Bidirectional LSTM Networks for Improved Phoneme Classification
Schwenk, Holger Bengio, Y.. (2014). Learning Phrase Representations and Recognition.. 799-804.
using RNN Encoder-Decoder for Statistical Machine Translation.
10.3115/v1/D14-1179. [29] Ren, Shaoqing He, Kaiming Girshick, Ross Sun, Jian. (2015). Faster
R-CNN: Towards Real-Time Object Detection with Region Proposal
[11] Bahdanau, Dzmitry Cho, Kyunghyun Bengio, Y.. (2014). Neural Networks. IEEE Transactions on Pattern Analysis and Machine
Machine Translation by Jointly Learning to Align and Translate. ArXiv. Intelligence. 39. 10.1109/TPAMI.2016.2577031.
1409.
[30] Hubel, David Wiesel, Torsten. (2012). David Hubel and Torsten
[12] Hochreiter, Sepp. (1998). The Vanishing Gradient Problem During Wiesel. Neuron. 75. 182-4. 10.1016/j.neuron.2012.07.002.
Learning Recurrent Neural Nets and Problem Solutions. International
Journal of Uncertainty, Fuzziness and Knowledge-Based Systems. 6. [31] Papineni, Kishore Roukos, Salim Ward, Todd Zhu, Wei Jing. (2002).
107-116. 10.1142/S0218488598000094. BLEU: a Method for Automatic Evaluation of Machine Translation.
10.3115/1073083.1073135.
[13] Doshi, K. (2021, April 30). Image Captions with Attention in
Tensorflow, Step-by-step. Medium.com. Retrieved December 28, 2022, [32] Hyunju1 (2018) Image-Captioning [Source code] https://github.com/
from https://link.medium.com/s77SJEyi7vb HyunJu1/Image-Captioning

[14] Simonyan, K., Zisserman, A. (2014). Very deep convolutional networks

for large-scale image recognition. arXiv preprint arXiv:1409.1556.

Authorized licensed use limited to: KLE Technological University. Downloaded on June 20,2023 at 06:12:39 UTC from IEEE Xplore. Restrictions apply.

Download Full Modern Computer Architecture and Organization Learn x86 ARM and RISC V architectures and the design of smartphones PCs and cloud servers 2nd Edition Ledin PDF All Chapters
100% (2)
Download Full Modern Computer Architecture and Organization Learn x86 ARM and RISC V architectures and the design of smartphones PCs and cloud servers 2nd Edition Ledin PDF All Chapters
55 pages
Intelligent Digital Operations Center WP
No ratings yet
Intelligent Digital Operations Center WP
14 pages
Format - For - BSC - Thesis
100% (1)
Format - For - BSC - Thesis
22 pages
Ahci PPT
100% (1)
Ahci PPT
34 pages
Module 2 - Reading5 - Chipset
No ratings yet
Module 2 - Reading5 - Chipset
3 pages
Lesson 1 - Evolution of Microprocessor PDF
No ratings yet
Lesson 1 - Evolution of Microprocessor PDF
59 pages
Familiarization With The Various Computer Systems' Components and Peripherals
No ratings yet
Familiarization With The Various Computer Systems' Components and Peripherals
78 pages
Microcontrollers and Embedded Systems
No ratings yet
Microcontrollers and Embedded Systems
49 pages
Basic Electricity
No ratings yet
Basic Electricity
48 pages
Lec - Basics of Electronics
No ratings yet
Lec - Basics of Electronics
40 pages
Electronics Symbols
No ratings yet
Electronics Symbols
19 pages
MICROCONTROLLERS AND MICROPROCESSORS SYSTEMS DESIGN - Chapter
No ratings yet
MICROCONTROLLERS AND MICROPROCESSORS SYSTEMS DESIGN - Chapter
12 pages
Automatic Water Pump Controller - Full Circuit Available
No ratings yet
Automatic Water Pump Controller - Full Circuit Available
4 pages
Computer Application Lecture Notes-Saint
No ratings yet
Computer Application Lecture Notes-Saint
52 pages
Power Supply
No ratings yet
Power Supply
21 pages
Microcontrollers - UNIT I
No ratings yet
Microcontrollers - UNIT I
45 pages
Lect 5- Non Linear Activation Functions
No ratings yet
Lect 5- Non Linear Activation Functions
41 pages
Project Report Computer Hardware Networking Mass Infotech (Cedti), Yamuna Nagar (Hariyana)
No ratings yet
Project Report Computer Hardware Networking Mass Infotech (Cedti), Yamuna Nagar (Hariyana)
151 pages
Module-1 Theory of Parallelism: The State of Computing Computer Development Milestones
No ratings yet
Module-1 Theory of Parallelism: The State of Computing Computer Development Milestones
48 pages
Lab Manual of Computer Maintenance and Troubleshooting (C301-3350701) Semester - 5
No ratings yet
Lab Manual of Computer Maintenance and Troubleshooting (C301-3350701) Semester - 5
59 pages
Embedded System Development Environment
No ratings yet
Embedded System Development Environment
6 pages
Motherboards: The Main Board On A PC
No ratings yet
Motherboards: The Main Board On A PC
85 pages
Error Detection and Recovery in Compiler Design PDF
No ratings yet
Error Detection and Recovery in Compiler Design PDF
2 pages
Chapter2-Applications and Future Trends of Artificial Intelligence-FuzzyLogic
No ratings yet
Chapter2-Applications and Future Trends of Artificial Intelligence-FuzzyLogic
17 pages
Computer Skills: CIS-100 CH1
No ratings yet
Computer Skills: CIS-100 CH1
68 pages
Mobile Phone Charger Report
100% (3)
Mobile Phone Charger Report
16 pages
IM - SYSAD2 - Topic2-Final
No ratings yet
IM - SYSAD2 - Topic2-Final
13 pages
Internship Presentation
No ratings yet
Internship Presentation
15 pages
17CS62 CGV
No ratings yet
17CS62 CGV
297 pages
Deep Learning Tutorial 3
No ratings yet
Deep Learning Tutorial 3
12 pages
Descrete Electronic Components and Trouble Shooting
No ratings yet
Descrete Electronic Components and Trouble Shooting
24 pages
Testing Electronic Components
No ratings yet
Testing Electronic Components
104 pages
Jocelyn O. Padallan - Introductory Guide To Operating Systems-Arcler Press (2022)
No ratings yet
Jocelyn O. Padallan - Introductory Guide To Operating Systems-Arcler Press (2022)
281 pages
Chapter One: Object Oriented Concepts
No ratings yet
Chapter One: Object Oriented Concepts
36 pages
Fundamentals of Electronics
No ratings yet
Fundamentals of Electronics
17 pages
Ethernet Protocol
No ratings yet
Ethernet Protocol
115 pages
Chapter 1
No ratings yet
Chapter 1
58 pages
Speaker Recognition
No ratings yet
Speaker Recognition
29 pages
System Sequence Diagrams
No ratings yet
System Sequence Diagrams
4 pages
List of Projects Using Pic Microcontroller With Advance View
No ratings yet
List of Projects Using Pic Microcontroller With Advance View
235 pages
16 Easy Steps To Start PCB Circuit Design
No ratings yet
16 Easy Steps To Start PCB Circuit Design
10 pages
UNIT - 3 EMBEDDED SYSTEM DEVELOPMENT
No ratings yet
UNIT - 3 EMBEDDED SYSTEM DEVELOPMENT
42 pages
AVR Microcontroller: Prepared By: Eng. Ashraf Darwish
100% (2)
AVR Microcontroller: Prepared By: Eng. Ashraf Darwish
27 pages
PCI IRQ Routing Table Specification
No ratings yet
PCI IRQ Routing Table Specification
4 pages
Cew Exp 1 PC Disassembling and Assembling
No ratings yet
Cew Exp 1 PC Disassembling and Assembling
14 pages
Chapter 8 Input-Output Organization - Copy
No ratings yet
Chapter 8 Input-Output Organization - Copy
31 pages
Lec08 - Instruction Sets - Characteristics and Functions
0% (1)
Lec08 - Instruction Sets - Characteristics and Functions
44 pages
Basic Electronics Engineering Interview Questions & Answers PDF
No ratings yet
Basic Electronics Engineering Interview Questions & Answers PDF
1 page
Application of Machine Learning
No ratings yet
Application of Machine Learning
11 pages
Bit Wise
100% (1)
Bit Wise
24 pages
DR - Chao Tan, Carnegie Mellon University: Computer Organization Computer Architecture
No ratings yet
DR - Chao Tan, Carnegie Mellon University: Computer Organization Computer Architecture
221 pages
AVR Microcontroller: Prepared By: Eng. Ashraf Darwish
100% (2)
AVR Microcontroller: Prepared By: Eng. Ashraf Darwish
28 pages
MP QB
No ratings yet
MP QB
19 pages
AdHoc Networks
100% (1)
AdHoc Networks
34 pages
Model Exam Version 6
No ratings yet
Model Exam Version 6
18 pages
Assembling Computer: Presented By: Group 4 Reporters: Carlos Jade L. Pelina Mycel Patingo Nicholas Andrei Sangco
No ratings yet
Assembling Computer: Presented By: Group 4 Reporters: Carlos Jade L. Pelina Mycel Patingo Nicholas Andrei Sangco
48 pages
How IDE Controllers Work
100% (2)
How IDE Controllers Work
3 pages
Reconfigurable Hardware Design Approach For Economic Neural Network
No ratings yet
Reconfigurable Hardware Design Approach For Economic Neural Network
5 pages
Final Fog Computing
No ratings yet
Final Fog Computing
25 pages
The Today and Future of WSN, AI, and IoT: A Compass and Torchbearer for the Technocrats
From Everand
The Today and Future of WSN, AI, and IoT: A Compass and Torchbearer for the Technocrats
Dr.Chandrakant
No ratings yet
Image Caption Generator
No ratings yet
Image Caption Generator
6 pages
Lu 等 - 2024 - 3DGTN 3-D dual-attention GLocal transformer network for point cloud classification and segmentation
No ratings yet
Lu 等 - 2024 - 3DGTN 3-D dual-attention GLocal transformer network for point cloud classification and segmentation
13 pages
Module 2
No ratings yet
Module 2
13 pages
DBSCAN Clustering Algorithm: Presented by
No ratings yet
DBSCAN Clustering Algorithm: Presented by
22 pages
Crime Prediction and Analysis Using Data Mining
No ratings yet
Crime Prediction and Analysis Using Data Mining
6 pages
Clustering Today
No ratings yet
Clustering Today
52 pages
Prisma
No ratings yet
Prisma
21 pages
7 Transformers
No ratings yet
7 Transformers
20 pages
Cluster Analysis in Python Chapter2 PDF
No ratings yet
Cluster Analysis in Python Chapter2 PDF
30 pages
Fake Reviews Detector Project (132,133)
100% (1)
Fake Reviews Detector Project (132,133)
3 pages
REPORT - Assignment 1
No ratings yet
REPORT - Assignment 1
2 pages
IIIT Allahabad Resume Template
No ratings yet
IIIT Allahabad Resume Template
1 page
Introduction To Deep Learning
No ratings yet
Introduction To Deep Learning
32 pages
AI Fundamentals Finals
No ratings yet
AI Fundamentals Finals
6 pages
Python MP Report PDF
No ratings yet
Python MP Report PDF
61 pages
01 - Data Warehousing & BI01
No ratings yet
01 - Data Warehousing & BI01
73 pages
Kiran Kumar Mini
No ratings yet
Kiran Kumar Mini
113 pages
1 Synopsis On Pedestrian Controlling On Zebra Crossing
No ratings yet
1 Synopsis On Pedestrian Controlling On Zebra Crossing
7 pages
Sensors 23 07395
No ratings yet
Sensors 23 07395
18 pages
Jishna M - Resume
No ratings yet
Jishna M - Resume
2 pages
Week 7 - Module 18 - PPT - Application of Artificial Intelligence and Machine Learning in Library Operations and Services
No ratings yet
Week 7 - Module 18 - PPT - Application of Artificial Intelligence and Machine Learning in Library Operations and Services
28 pages
Underwater Image Classification Using ML Techniques
No ratings yet
Underwater Image Classification Using ML Techniques
6 pages
Mid2 Date Sheet Spring2024 Stu0.3
No ratings yet
Mid2 Date Sheet Spring2024 Stu0.3
12 pages
Sentiment Classification With Deep Neural Networks: Yi Zhou
No ratings yet
Sentiment Classification With Deep Neural Networks: Yi Zhou
58 pages
IT 701 Soft Computing Unit III - 1722317899
No ratings yet
IT 701 Soft Computing Unit III - 1722317899
14 pages
PR Assignment 1 Solution
No ratings yet
PR Assignment 1 Solution
2 pages
A Comprehensive Guide To Understand and Implement Text Classification in Python
No ratings yet
A Comprehensive Guide To Understand and Implement Text Classification in Python
34 pages
Tdwi Best Practices Report Building The Unified Data Warehouse and Data Lake
No ratings yet
Tdwi Best Practices Report Building The Unified Data Warehouse and Data Lake
32 pages
Ggkey 48WQ6LQBF92
No ratings yet
Ggkey 48WQ6LQBF92
112 pages
Artificial Intelligence and IPR
No ratings yet
Artificial Intelligence and IPR
5 pages