Attention Based Image Caption Generation ABICG Using Encoder-Decoder Architecture
Attention Based Image Caption Generation ABICG Using Encoder-Decoder Architecture
Abstract—The image captioning is utilized to develop the and techniques for processing text like NLP [22], a number of
explanations of the sentences describing the series of scenes previously challenging pieces of work became straightforward
captured in the image or picture forms. The practice of using using Machine Learning. These are profitable in recognition,
image captioning is vast although it is a tedious task for the
machine to learn what a human is capable of. The model must be classification and captioning of images and several additional
built in a way such that when it reads the scene, it recognizes and AI [19] applications. This “Image Captioning” [2] is pragmatic
reproduce to-the-point captions or descriptions. The generated in various applications viz. The concept of self-driving cars,
descriptions must be semantically and syntactically accurate. surveillance which are at present the matters of the moment. In
Hence, availability of Artificial Intelligence (AI) and Machine comparison with classification and object recognition, the task
Learning algorithms viz. Natural Language Processing (NLP),
Deep Learning (DL) etc. makes the task easier. Although majority of automatically generating captions and describing images
of the existing machine-generated captions are valid, they do not is significantly more complicated. A description of an image
focus on the crucial parts of the images, which results in lesser must include more than just the objects in it, also how the
clarity of the captions. In the proposed paper, anew introduction objects are related to their attributes and activities just like
to attention mechanism called Bahdanau’s along with Encoder- shown in the Fig. 1
Decoder architecture is being used so as to reflect the image
captions that are more accurate and detailed. It uses a pre-
trained Convolutional Neural Network (CNN) called InceptionV3
architecture to gather the features of images and then a Recurrent
Neural Network (RNN) called Gated Recurrent Unit (GRU)
architecture in order to develop captions. This model is trained
on Flickr8k dataset and the captions generated are 10% more
accurate than the present state of art.
Keywords— Convolutional Neural Network (CNN), Recur-
rent Neural Network (RNN), Gated Recurrent Unit (GRU),
Encoder, Decoder, Attention mechanism, Image captioning.
I. I NTRODUCTION
Language is the medium through which the society con-
stantly interacts, be it written or spoken. It typically describes
the perceptible world around us. The photos and symbols are
otherwise to speak and perceive by the physically disabled
individuals. Automatic description generation from an image
in proper sentences is a tedious task, nevertheless it can have
an ample impression on visually challenged individuals for
better understanding of the description of pictures on the web.
Long back, “Image or Picture Captioning” [2] [3] [4] has Fig. 1. Examples of image captioning
always been a rigid mission and the generated captions for
the given image were not so pertinent. However, it is best to express semantic knowledge in
In conjunction with the progress of Deep Learning [21], CNN natural language such as English. A single model is to be
designed which intakes an image and trains it to give out II. RELATED WORK
a string of words in which each word is affiliated to the
glossary that narrates the picture accordingly. In this paper, In [1], the authors have made use of CNN as an encoder to
an ABICG model is proposed which is worthy of describing extract the characteristics or attributes from the images. CNN
images in a latest and novel way. For this job, the dataset is a pre-trained InceptionV3. Owing to the InceptionV3’s fact
composed of 8000 images and five descriptions per image that it is a deep network for object detection, it demands to
is taken from the Flickr8k dataset [20]. As new applications be altered slightly to assist in encoding. A feature vector is
are being developed every day, Deep Learning has emerged obtained from this deep network by removing the terminating
to be a prepotent world today. Test and trial is the mere way layer. The feature vector obtained is of the size (8x8x2048).
to explore Deep Learning to a greater extent. In this way, The feature vector is the input to RNN. The RNN employed
you gain a better understanding of the topic and become a for decoding is GRU [16]. To generate more focused captions,
more profound professional. Real-life applications of this Bahdanau model is used.
technique are numerous. The use of Deep Learning for image
description has been proposed in many different models, The writers of [2] have proposed a model where the
including detection of objects, captioning on the basis of input sent is Flickr8k dataset, and the output obtained is
visual attention and image captioning using Deep Learning. passed to the latest layer which is completely connected
Different Deep Learning models exist as well, such as the and is introduced at the termination of the InceptionV3
InceptionV3 model [9], the Visual Geometry Group 16 model. The task of this layer is to transform the model’s
(VGG–16) model [14], the Residual Networks (ResNet) [18] output into a vector which embed words. It serves as an
– Long Short-Term Memory (LSTM) model [15] and the input to an LSTM cell order by implanting a vector. The
traditional CNN – RNN model [23]. CNN as well as RNN LSTM unit attaches the series of information and collects it
are used here. For image encoding i.e. as an encoder, in order progressively hence enabling the establishment of meaningful
to classify images, a pre-trained CNN is used. Input from the captions. The start-V3 component of this model is trained
last hidden layer is used to train the RNN. This network is to recognize complete possible objects in a picture. Each
a decoder to generate captions. In LSTM networks, memory word in the picture is predicted using the previous words in
cells are endowed with a limited number of phases that have the phrase. The main intention of training is to reduce the
been determined by long-term dilution of existing memory failure function. They have used the Flickr8k dataset which
information. A total of 16 layers supported by VGG–16 is has nearly 8000 images and each image is tagged with five
an ingenious model for object recognition. For the following unique captions or descriptions which offer compact reports
stage, the extracted features are trained with wording specified of the noteworthy features.
in the dataset. Two architectures LSTM and GRU [16] are
used for framing sentences from the given input images given. In [3], the authors have put forth a model that allows
neural networks to view an image automatically and yield
In ABICG, the Flickr8k dataset [20] is used and it is meaningful captions similar to natural English sentences. It is
subjected to an elaborated preprocessing steps to optimize the a well-trained model to perform the above-mentioned tasks.
input. The preprocessed data is fed to the model which uses Here, the pre-trained CNN is utilized to classify images. This
Inception-V3 as Encoder and GRU as Decoder. Bahdanau’s network handles the task of encoding images. The input to
Attention model is applied to this encoder-decoder model the RNN (decoder here) is the hindmost hidden layer of the
to fetch more focused captions. Meaningful captions are encoder. The decoder generates sentences. The dataset being
generated by the model. Performance of this model is used here is Flickr8k [20] consists of about 8000 images
evaluated by the BLEU scores [31]. and five descriptions tagged to every image. They used VGG
[14] for large-scale image recognition. They conclude that
using a bulky dataset boosts the performance of the model.
The paper is organized as, the Section ‘Related Work’ In addition to reducing losses, it also improves accuracy.
discusses the previous works and methodologies related to
this domain which includes the scope of improvements in To achieve better results, the authors of [4] worked on
the methods that already exist and the uniqueness of the a model that combines CNN architecture and LSTM for
model presented in this paper. In Section ‘Proposed Work’, it image captioning. The proposed model uses three CNN
comprises the proposed work where the ABICG architecture architectures: ResNet-50 [25], Xception [24], and Inception-
has been discussed along with explanation to each component V3 [9]. The aptest combination of CNN and LSTM is chosen
of the architecture. Section ‘Experimental Results’ further based on the model’s accuracy. Training is performed on
elaborates on the system specifications, dataset used, data pre- the Flickr8k data. Combining Xception with LSTM has the
processing, results, comparison of training loss and evaluation highest accuracy of 75% across epochs among the three CNN
of model using BLEU scores [32]. Finally, the conclusion and models.
the future scope have been discussed below as can be seen in
the Section ‘Conclusion’. The authors of [5] proposed a model where CNN features
are extracted from an image and encoded into vector depth of hidden layers increase, the learnability of models
representations using 16 convolutional layers of the VGG-16. fall to zero. In most of the works, LSTMs are used which
Next, a RNN decoder model is used to develop corresponding are slower, and computationally less efficient as compared to
sentences based on the learned image features i.e., training the GRU. To solve this issue of vanishing gradients, the GRU is
features with captions or descriptions provided in the dataset. utilized in the current work.
The input images are processed using two architectures viz,
GRU and LSTM. Through the results, it is evident that the III. P ROPOSED W ORK
LSTM model achieves better results than the GRU [16] To accomplish the task of image captioning, ABICG com-
model. Although, it takes a longer time to train and generate prises bipartite architecture, viz encoder and decoder. Im-
captions due to the model’s complexity. ages are fed to the encoder where the image information is
transformed into features vectors. The output of the encoder
In [6], the authors have developed a model where pre- is passed to the decoder, which translates the features into
processed images are fed to the Inception V3 model, and the English sentences. This method is termed as ”Classic Image
features are extracted. Later, a D-dimensional representation Captioning”. The problem with this method is that, it is not
of each and every part of the image is produced by the possible to take into account the spatial features of an image
extractor such as L vectors. With the spatial features of a with this classic captioning method. As a result, a caption is
CNN convolution layer, the decoder calculates the context generated taking into consideration the full image as a scene
vector according to the specific regions of the input image. and not considering the sensitive or important features of the
For the decoder’s job, GRU is utilized which has a simpler image. To enable the model to focus on important features of
structure than LSTM [15] . The vanishing gradient problem the image, Bahdanau’s attention has been used jointly with the
does not affect GRU, unlike RNN. Thus, it proves that usage encoder-decoder architecture.
of GRU gives better results than LSTM. The captions generated by the proposed model are semanti-
cally and grammatically correct. The generated captions are
The authors display a method to overcome the vanishing close to the human generated or with human centric meaning
gradient problem which hinders the existing CNN-RNN and they not only describe the scene in the image but also
models in [7] . They have proposed ResNet-LSTM as an the intricate details and relationship of the object with the
encoder-decoder technique for image captioning. The ResNet background.
(encoder) extracts the features and the LSTM (decoder)
generates the caption from the extracted features. For this, the A. Model Overview
images are resized to (224x224x3) and subjected to several
pre-processing steps. They have used the Flickr8k dataset In ABICG, the CNN used is InceptionV3 [9] pretrained
for training the model. After a minimum of 20 epochs, on ImageNet weights, serves as an encoder. This extracts the
meaningful captions begin to generate. It is better than VGG features from the receptive fields of the images and forwards
and CNN-RNN models. it to the decoder. Here, the RNN used is GRU, which is used
as a decoder. The use of the decoder is to decode the sentence
The authors of [8] explain a multi-feature fusion model from the encoding. The Bahdanau attention model [11] is
to generate image captions. Models that currently exist used to enhance the capability of the decoder by allowing
focus on the global characteristics of an image, but with it to focus on the important aspects of the images while
the comprehensive features, this model also considers the producing the captions. Thus, taking care that the sensitive
localized features of images. A global feature extraction of parts of the image are not left out in the generated caption.
global features is performed using the VGG16 network and
Faster-CNN is used to excerpt the local characteristics. The
local and global features are mixed and fed as input, through
an attention layer to the Bi-LSTM [28]. The caption obtained
is corrected if any error occurs. ImageNet dataset with image
size (224x224x3) is used to train VGG16 [14], Pascal VOC
dataset is used to train Faster-RCNN [29] (1:1 positive and
negative sample ratio maintained). Bi-LSTM is trained with
the MSCOCO dataset [27]. The fused features have turned
out to be superior to global or local features alone. The test
accuracies in the training set and verification set are 78.20%
and 66.50% respectively.
The above figure shows the overall architecture of the model interface that connects the encoder and decoder. It instructs
which includes the CNN encoder (InceptionV3), RNN decoder the decoder with the relevant details from each and every
(GRU) and the attention model. Image is fed into the encoder encoder hidden state. Decoding begins with the context vector
and then the tanh activation function is applied to introduce generated by the attention model to predict the word at that
non-linearity. The output is then fed to the decoder and for particular timestamp. The context vector changes with each
each timestamp of the decoder, the attention model enables timestamp since it is adaptive in nature.
the decoder to focus on specific parts of the images.
It was built for the purpose of object detection on receiving Attention model Fig 4 [11] does a linear transformation of
a (299x299x3) image. Since InceptionV3 [9] is mostly used the input by applying tanh (1) to it so as to introduce non-
for object detection, the refinement of it to some extent linearities henceforth achieving a smoother distribution. Then,
is required so as to make it an encoder for extracting the the attention score as is computed. The output is required
image feactures. The last layer is eliminated which is used in the range (0,1). The sof tmax function is applied to the
for classifying the images into the labels since classification attention score and the final attention weights are obtained.
of images is not required. Thus, a feature vector is obtained This model intends to overcome the curb of the orthodox
which is of the size (8x8x2048). The resulting feature vector CNN–RNN models. This model facilitates passing various
is static and does not alter at each timestamp. Therefore, this parts of the image instead of the whole. This also makes it
vector is passed to the attention model along with the hidden swift and hikes its accuracy.
state of the decoder to create the context vector.
as = tanh(W1 hd1 + W2 hd2 ) (1)
With this score, the weights of the attention are calculated
using (2)
Then, by using the attention weights (α) from (2) and features
(hd2 ) which were obtained from an encoder, the context vector
cv ec is obtained, with (3).
cv ec = αh2 (3)
Fig. 3. Convolutional Neural Network (Inception V3) [9]
Ultimately, the fixed length vector “cv ec ” is unified with the
The benefits of using InceptionV3 CNN for the encoder part decoder’s output from the predecessor timestamp ht and then
is that it generates fewer parameters for computation which fed into the RNN cell in order to obtain the decoder’s output
makes it computationally less expensive in comparison to the for the current timestamp.
other models and is memory efficient. There is no comparison In the above image, there is a white color bird sitting on the
between InceptionV3 and the other models when it comes to sign board. The image is fed into the encoder which extracts
depth and accuracy. the image features and gives it as an input to the decoder which
transforms the image feature vector into concise caption. So
C. Attention Mechanism here, the caption generated would be “a white bird perched
Attention model is a deep learning technique that makes use on top of a red stop sign” all in lower case.
of attention mechanism which provides attention or additional The project aims at mimicking the human brain because of
focus on specific components. The Bahdanau’s Attention its abilities to generate a caption for every scene it senses.
Model [11] is used in ABICG. It selectively highlights the Therefore, it becomes crucial to add an attention mechanism
relevant features of the input data. It is also referred to as an using which the CNN-RNN model focuses on the more
D. Results
1) Image 1
The below image is fed to the model and the generated
Fig. 8. Sample data from Flickr8k Dataset [20] caption is ”large bird swooping down towards the ground” in
comparison with the human generated annotation – ”a white
2) Image 2
The below image is fed to the model and the caption
generated is ”two girls hanging upside down on monkey-bars
at a park” in comparison with the human generated annotation
– ”two girls are hanging upside down”.
ABICG model was trained for an epoch of 25 with the about ’precision’, a simpler and more well-known metric is
hyperparameters like learning rate and batch size set to 0.001 customary. Let machine-generated n-grams and ground truth
and 64 respectively. The resultant Test and Training vs Loss n-grams be denoted by the vectors x and y respectively. For
plot obtained is shown in Fig. 13. As expected the training instance, x could be taken as the words of a caption generated
and testing loss decreases as the number of epochs increases. from an image, with xi representing an individual word,
and y could be the words from actual captions describing
E. Comparision of Training Losses the same scene. It is expected always to denote the several
’Train Loss VS Epoch’ graph was plotted for 25 epochs on possible captions of a single idea.
both Traditional InceptionV3-GRU model (without attention)
N
and InceptionV3-GRU model with attention(ABICG model). 1
X
p= N 1{xi ∈ y}
From the below comparision, it is evident that the train loss is
i=1
higher in the InceptionV3-GRU model without attention [32]
The BLEU score and precision are equivalent, except the
(Loss = 0.8647) as compared to the InceptionV3-GRU model
fact that there can only be a single instance of an n-gram in
with attention (ABICG model) (Loss = 0.0050).
x for every incidence of an n-gram in y. Say, the statement
“is is is is is” would receive an absolute precision if the
word ‘is’ was present in the reference translation, but not
compulsorily a perfect BLEU score, as it limits to counting
only the number of occurrences of ‘is’ as it appears in y.
Future scope of this work includes the usage of the transformer [15] S. Hochreiter and J. Schmidhuber, ”Long Short-Term Memory,” in
based models instead of the existing Encoder-Decoder based Neural Computation, vol. 9, no. 8, pp. 1735-1780, 15 Nov. 1997, doi:
10.1162/neco.1997.9.8.1735.
models, with the multi-head attention coupled with the posi-
tional embedding helps render information regarding how the [16] Chung, J., Gulcehre, C., Cho, K., Bengio, Y. (2014). Empirical
different words are related in correct order. evaluation of gated recurrent neural networks on sequence modeling.
arXiv preprint arXiv:1412.3555.
R EFERENCES [17] Jiasen Lu, Caiming Xiong, Devi Parikh, and Richard Socher. 2016.
[18] He, K., Zhang, X., Ren, S., Sun, J. (2016). Deep residual learning for
[1] V. Agrawal, S. Dhekane, N. Tuniya and V. Vyas, ”Image image recognition. In Proceedings of the IEEE conference on computer
Caption Generator Using Attention Mechanism,” 2021 12th vision and pattern recognition (pp. 770-778).
International Conference on Computing Communication and
Networking Technologies (ICCCNT), 2021, pp. 1-6, doi: [19] Chaitin, G.. (2013). Computing Machinery and Intelligence. Alan
10.1109/ICCCNT51525.2021.9579967. Turing: His Work and Impact. 551-621. 10.1016/B978-0-12-386980-
7.50023-X.
[2] S. Degadwala, D. Vyas, H. Biswas, U. Chakraborty and S.
Saha, ”Image Captioning Using Inception V3 Transfer Learning [20] Hodosh, Micah, Peter Young, and Julia Hockenmaier. ”Framing image
Model,” 2021 6th International Conference on Communication description as a ranking task: Data, models and evaluation metrics.”
and Electronics Systems (ICCES), 2021, pp. 1103-1108, doi: Journal of Artificial Intelligence Research 47 (2013): 853-899.
10.1109/ICCES51350.2021.9489111.
[21] LeCun, Y., Bengio, Y., Hinton, G. (2015). Deep learning. nature,
[3] Amritkar, Chetan Jabade, Vaishali. (2018). Image Caption 521(7553), 436-444.
Generation Using Deep Learning Technique. 1-4. 10.1109/IC-
CUBEA.2018.8697360. [22] J., Weizenbaum. 1966. ELIZA—a computer program for the study of
natural language communication between man and machine. Commun.
[4] C. S. Kanimozhiselvi, K. V, K. S. P and K. S, ”Image Captioning ACM 9, 1 (Jan. 1966), 36–45. https://doi.org/10.1145/365153.365168
Using Deep Learning,” 2022 International Conference on Computer
Communication and Informatics (ICCCI), 2022, pp. 1-7, doi: [23] Liu, S., Bai, L., Hu, Y., Wang, H. (2018). Image captioning based on
10.1109/ICCCI54379.2022.9740788. deep neural networks. In MATEC Web of Conferences (Vol. 232, p.
01052). EDP Sciences.
[5] Sharma, Grishma Kalena, Priyanka Malde, Nishi Nair, Aromal
Parkar, Saurabh. (2019). Visual Image Caption Generator Using Deep [24] Chollet, F. (2017). Xception: Deep learning with depthwise separable
Learning. SSRN Electronic Journal. 10.2139/ssrn.3368837. convolutions. In Proceedings of the IEEE conference on computer
vision and pattern recognition (pp. 1251-1258).
[6] Manish Raypurkar, Abhishek Supe, Pratik Bhumkar, Pravin Borse,
Dr. Shabnam Sayyad5(2021). Deep Learning Based Image Caption [25] Mo, Nan Yan, Li Zhu, Ruixi Xie, Hong. (2019). Class-Specific
Generator. Anchor Based and Context-Guided Multi-Class Object Detection in
High Resolution Remote Sensing Imagery with a Convolutional Neural
[7] Aishwarya Maroju , Sneha Sri Doma, Lahari Chandarlapati, 2021, Network. Remote Sensing. 11. 272. 10.3390/rs11030272.
Image Caption Generating Deep Learning Model, INTERNATIONAL
JOURNAL OF ENGINEERING RESEARCH TECHNOLOGY [26] Abadi, Martı́n Barham, Paul Chen, Jianmin Chen, Zhifeng Davis,
(IJERT) Volume 10, Issue 09 (September 2021). Andy Dean, Jeffrey Devin, Matthieu Ghemawat, Sanjay Irving,
Geoffrey Isard, Michael Kudlur, Manjunath Levenberg, Josh Monga,
[8] M. Duan, J. Liu and S. Lv, ”Encoder-decoder based multi-feature Rajat Moore, Sherry Murray, Derek Steiner, Benoit Tucker,
fusion model for image caption generation,” Journal on Big Data, vol. Paul Vasudevan, Vijay Warden, Pete Zhang, Xiaoqiang. (2016).
3, no.2, pp. 77–83, 2021 TensorFlow: A system for large-scale machine learning.
[9] Szegedy, Christian Vanhoucke, Vincent Ioffe, Sergey Shlens, Jon [27] Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: ImageNet:
Wojna, ZB. (2016). Rethinking the Inception Architecture for Computer A Large-Scale Hierarchical Image Database. In: CVPR (2009)
Vision. 10.1109/CVPR.2016.308.
[28] Graves, Alex Fernández, Santiago Schmidhuber, Jürgen. (2005).
[10] Cho, Kyunghyun Merrienboer, Bart Gulcehre, Caglar Bougares, Fethi Bidirectional LSTM Networks for Improved Phoneme Classification
Schwenk, Holger Bengio, Y.. (2014). Learning Phrase Representations and Recognition.. 799-804.
using RNN Encoder-Decoder for Statistical Machine Translation.
10.3115/v1/D14-1179. [29] Ren, Shaoqing He, Kaiming Girshick, Ross Sun, Jian. (2015). Faster
R-CNN: Towards Real-Time Object Detection with Region Proposal
[11] Bahdanau, Dzmitry Cho, Kyunghyun Bengio, Y.. (2014). Neural Networks. IEEE Transactions on Pattern Analysis and Machine
Machine Translation by Jointly Learning to Align and Translate. ArXiv. Intelligence. 39. 10.1109/TPAMI.2016.2577031.
1409.
[30] Hubel, David Wiesel, Torsten. (2012). David Hubel and Torsten
[12] Hochreiter, Sepp. (1998). The Vanishing Gradient Problem During Wiesel. Neuron. 75. 182-4. 10.1016/j.neuron.2012.07.002.
Learning Recurrent Neural Nets and Problem Solutions. International
Journal of Uncertainty, Fuzziness and Knowledge-Based Systems. 6. [31] Papineni, Kishore Roukos, Salim Ward, Todd Zhu, Wei Jing. (2002).
107-116. 10.1142/S0218488598000094. BLEU: a Method for Automatic Evaluation of Machine Translation.
10.3115/1073083.1073135.
[13] Doshi, K. (2021, April 30). Image Captions with Attention in
Tensorflow, Step-by-step. Medium.com. Retrieved December 28, 2022, [32] Hyunju1 (2018) Image-Captioning [Source code] https://github.com/
from https://link.medium.com/s77SJEyi7vb HyunJu1/Image-Captioning