Fin Irjmets1655531403

e-ISSN: 2582-5208
International Research Journal of Modernization in Engineering Technology and Science

( Peer-Reviewed, Open Access, Fully Refereed International Journal )
Volume:04/Issue:06/June-2022 Impact Factor- 6.752 www.irjmets.com
IMAGE CAPTION GENERATION USING DEEP LEARNING TECHNIQUE
Dr. P. Srinivasa Rao*1, Thipireddy Pavankumar*2, Raghu Mukkera*3,
Gopu Hruthik Kiran*4, Velisala Hariprasad*5
*1Professor, Department Of Computer Science And Engineering, JB Institute Of Engineering
And Technology, Hyderabad, Telangana, India.
*2,3,4,5B.Tech Student, Department Of Computer Science And Engineering, JB Institute Of Engineering
And Technology, Hyderabad, Telangana, India.
ABSTRACT
An image caption is something that describes an image in the form of text. It is widely used in programs where
one needs information from any image in automatic text format. We analyze three components of the process:
convolutional neural networks (CNN), recurrent neural networks (RNN) and sentence production. It develops a
model that decomposes both images and sentences into their elements, regions of intelligent languages in
photography with the help of LSTM model and NLP methods. It also introduces the implementation of the LSTM
Method with additional efficiency features. The Gated Recurrent Unit (GRU) and LSTM Method are tested in this
paper. According to tests using BLEU Metrics LSTM is identified as the best with 80% efficiency. This method
enhances the best results in the Visual Genome role-caption database.
Keywords: CNN, RNN, LSTM, GRU.
I. INTRODUCTION
Photo captions aim to describe objects, actions, and details found in an image using natural language. Most
image caption research focuses on single-sentence captions, but the descriptive capabilities of this form are
limited; one sentence can only describe in detail a small part of an image. Recent work has been challenged
instead of captions for the role of the image for the purpose of reproduction (usually sentence 5-8) describing
the image. Compared to single-sentence captions, section captions are a relatively new task. The caption data
set for the main role is Visual Genome corpus, presented by Krause et al. (2016). When solid single-sentence
caption models are trained in this database, they produce repetitive sections that can explain various aspects of
the images. The generated sections repeat the slightest variation of the same sentence many times, even when
beam search is used.
Similarly, the different methods used for classification, namely: Long Dial Network Repetition: Input can be an
image or image sequence in a video frame. captions [4]. Visual Paragraph Generation: This method is meant to
provide a coherent and detailed category. Few semantic regions are acquired in the image using the attention
model and sentences are generated sequentially and phase is generated [14] .RNN: Continuous neural network
is a special neural network for processing data sequences with a timestamp. index t from 1 to t.In activities that
include sequential inputs, such as speech and language, it is usually best to use RNNs. before it.
GRU: Repetitive unit with the latest development gateway proposed by Cho et al. Similar to the LSTM unit, GRU
has gate-gate units that model the flow of information within a unit, however, without having separate memory
cells. The Gated Recurrent Unit (GRU) lists two gates called renewal and reset gates that control the flow of
information for each hidden unit. Each hidden state during step t is calculated using the following calculations:
Update gate formula, Reset gate formula, new memory formula, final memory formula. This gate controls how
many parts of the new memory and old memory should be integrated into the final memory. Similarly the reset
gate is calculated but with a different set of weights. Controls the balance between previous memory and new
input information for new memory.
II. RELATED WORK
Machine Learning is an idea to learn from examples and experience, without being explicitly programmed.
Instead of writing code, you feed data to the generic algorithm, and it builds logic based on the data given.
Machine Learning is a field which is raised out of Artificial Intelligence(AI). Applying AI, we wanted to build
better and intelligent machines. But except for few mere tasks such as finding the shortest path between point
A and B, we were unable to program more complex and constantly evolving challenges. There was a
www.irjmets.com @International Research Journal of Modernization in Engineering, Technology and Science
[2215]
e-ISSN: 2582-5208
realisation that the only way to be able to achieve this task was to let machine learn from itself. This sounds
similar to a child learning from its self. So machine learning was developed as a new capability for computers.
And now machine learning is present in so many segments of technology, that didn‘t even realise it while using
it.
Supervised learning is the most popular machine learning paradigm. It is easy to understand and very easy to
use. It is a learning function that creates a map of output inputs based on the example of input-output pairs. It
takes work from a training data lab that includes a set of training examples. In supervised reading, each
example is a pair that includes an input object (usually vector) and the required output value (also called the
directional signal). The supervised learning algorithm analyzes training data and generates targeted activity,
which can be used to map new examples. Supervised Reading is very similar to teaching a child about the data
provided and that data is in the form of labeled examples, we can feed the algorithm of learning with these pairs
of individual model-labels, allowing the algorithm to predict the correct answer or not. Over time, the algorithm
will learn to measure the exact nature of the relationship between models and their labels. When fully trained,
the supervised learning algorithm will be able to detect a new, unprecedented model and predict its excellent
label.
Unsupervised learning is a machine learning method, where you do not need to monitor the model. Instead,
you need to let the model work on its own for information. Works great with non-labeled data and looks for
patterns that were not previously found in a set of data that does not already have labels and has minimal
human monitoring. In contrast to supervised reading that often uses personal name data, unchecked reading,
also known as self-organizing, allows for the creation of a dynamic model over the input.
The Neural Network (or Artificial Neural Network) has the ability to learn by example. ANN is an information
processing model inspired by a biological neuron system. ANN-biologically inspired images that are computer-
generated to perform a specific set of tasks such as merging, segmentation, pattern recognition etc. It is made
up of a large number of highly interconnected processing devices known as neurons to solve problems. It
follows a non-linear approach and processes information uniformly across all nodes. The neural network is a
complex flexible system. Adaptive means it has the ability to change its internal structure by adjusting the input
weights.
Deep learning is a branch of machine learning based entirely on neural networks that are practiced. In-depth
learning is an artificial intelligence activity that mimics the functioning of the human brain in processing data
and creating patterns that will be used in decision making. In-depth learning is a subset of machine learning in
artificial intelligence (AI) with networks that can read without being monitored for random or unlabeled data.
It has a large number of hidden layers and is known as deep neural learning or deep neural network. Deep
learning has evolved in conjunction with the digital age, which has brought an explosion of data across all
genres and regions of the world. This data, known as big data, is taken from sources such as social media, online
search engines, e-commerce forums, and online cinemas, among others. This large amount of data is easily
accessible and can be shared with fintech applications such as cloud computing. However, the details, often
irregular, are so large that it can take decades for people to understand and produce the right information.
Companies are recognizing the incredible power that can come from uncovering this wealth of information and
are becoming increasingly familiar with AI systems for automated support. In-depth reading learns from large
amounts of informal data that can often take decades for people to understand and process. In-depth learning
also uses the hierarchical level of neural networks performed to perform the machine learning process.
Nervous network networks are shaped like the human brain, with neuron nodes connected as a web. While
traditional systems create data analytics in a straightforward manner, the hierarchical function of in-depth
learning systems enables machines to process data indirectly.
III. METHODOLOGY
The goal of captions for the role of an image is to produce captions in the image. Then combine captions to get
the output. Tokenization is the first module in this process where the distribution of characters is divided into
tokens used in data processing (role) before processing. It is the act of dividing successive strands into pieces
such as words, keywords, phrases, symbols and other elements. called tokens.T tokens are stored on file and
used when needed.
[2216]
e-ISSN: 2582-5208
Pre-data processing is the process of filtering data from duplicates and retaining its purest form. Object
identification [4,13] is the second module in this work where objects are obtained to make the researcher's job
easier. This is done using the LSTM Model. Fig. 1 shows the performance flow. Initially the image is uploaded. In
the first step the functions in the picture are found .Then the extruded elements are given to the LSTM where
the word related to the object element is found and a sentence is produced. Later, go to the Intermediate
section where several sentences are formed the paragraph is given as output.
Sentence generation is the third module in this activity. Words are generated by recognizing the obects in the
object element and taking tokens from the file names as captions. Each word is added to the pre-formed word
that forms the sentence.
The paragraph is the last module for this activity. The sentences produced are arranged in a logical order that
gives a good meaning.
IV. MODELING AND ANALYSIS

[2217]
e-ISSN: 2582-5208
In this project Flicker8K dataset is used which consists of 8 Thousand images. Data preprocessing is done on
these images which splits the dataset into train, test and validate sets.
Algorithm Steps
Step 1: Download the Visual Genome Dataset and perform preprocessing. Step 2: Download spacy English
tokenizer and convert the text into tokens. Step 3: Extract image features using an object detector named LSTM.
Step 4: Features are generated from Tokenization on which LSTM is trained and it generates the captions.
Step 5: A paragraph is generated by combining all the captions.
Interpreting an image is a problem of producing a description of an image that is readable to a person, such as
an image of an object or an article.
The problem is sometimes called "automatic image annotation" or "tag image." It is a simple problem for man,
but very challenging for the machine.
Data pre processing - Images : Images are nothing but an (X) in our model. As you probably know that any
input to the model should be given in the form of a vector.
We need to convert each image into a fixed size vector that can be provided as input to the neural network. For
this purpose, we choose to transfer learning using the InceptionV3 (Convolutional Neural Network) model
created by Google Research.
This model was trained in the Visual Genome Corpus database to perform image classification into 1000
different photo classes. However, our goal here is not to separate the image but to simply find the vector of
information that has the fixed length of each image. This process is called automatic feature engineering.
Data pre processing – Captions : We must realize that captions are something we want to predict. Therefore
during the training period, captions will be the target variable (Y) model that learns to predict.
But prediction of all captions, when given a picture does not happen all at once. We will predict the captions
word for word. Therefore, we need to encode each word into a fixed size vector.
Dictionaries "wordtoix" (pronounced - word in the index) and "ixtoword" (pronounced - index in the word).
Data pre processing using generator function : Let’s take the first image vector Image_1 and its
corresponding caption “startseq the black cat sat on grass endseq”. Recall that, Image vector is the input and the
caption is what we need to predict. But the way we predict the caption is as follows:
For the first time, we provide the image vector and the first word as input and try to predict the second word,
i.e.:
Input = Image_1 + ‘startseq’; Output = ‘the’
Then we provide image vector and the first two words as input and try to predict the third word, i.e.: Input =
Image_1 + ‘startseq the’; Output = ‘cat’
And so on…
Thus, we can summarize the data matrix for one image and its corresponding

[2218]
e-ISSN: 2582-5208
V. RESULTS AND DISCUSSION
Flicker8K Dataset : A new benchmark collection for sentence-based image description and search, consisting of
8,000 images that are each paired with five different captions which provide clear descriptions of the salient
entities and events. … The images were chosen from six different Flickr groups, and tend not to contain any
well-known people or locations, but were manually selected to depict a variety of scenes and situations.
Figure: Dog is running through tha water
Figure: Man is skiing down down snowy Mountain

VI. CONCLUSION
This paper mainly focuses on image captioning based on research papers. Different Captioning metrics are used
for evaluation of the sentences generated by the system. The scores tells about the accuracy of the words
obtained. Different methods are compared which tells the efficiency of the LSTM method to be 80%.This
provides best results on Flicker8K Dataset. The output generated can have few limitations i.e, the can contain
upto 50 words or 1-2 lines. Hence, this paper provides a clear view of how caption is generated from an image.
The scope of the paper is limited to LSTM approach only. In future, the scope of the work can be extended so
that the system can be more efficiently used by all the researchers.

[2219]
e-ISSN: 2582-5208
VII. REFERENCES
[1] Marc‘ Aurelio Ranzato, Sumit Chopra, Michael Auli, and Wojciech Zaremba. 2015.‖ Sequence level
training with recurrent neural networks‖arXiv preprintarXiv:1511.06732.
[2] Peter Anderson, Xiaodong He, Chris Buehler, DamienTeney, Mark Johnson, Stephen Gould, and
LeiZhang. 2017.‖ Bottom-up and top-down attention for image captioning and vqa‖. arXiv
preprintarXiv:1707.07998.
[3] Steven J Rennie, Etienne Marcheret, Youssef Mroueh, Jarret Ross, and Vaibhava Goel. 2016‖. Self-
critical sequence training for image captioning.‖
[4] Tsung-Yi Lin, Michael Maire, Serge Belongie, JamesHays, Pietro Perona, Deva Ramanan, Piotr Dollar,
and C Lawrence Zitnick. 2014.‖ Microsoft coco:Common objects in context.‖ In European conference
on computer vision, pages 740–755. Springer.
[5] Xiaodan Liang, Zhiting Hu, Hao Zhang, Chuang Gan,and Eric P Xing. 2017.
[6] ―Recurrent topic-transition for visual paragraph generation.‖
[7] A. Farhadi, M. Hejrati, M. A. Sadeghi, P. Young,C. Rashtchian, J. Hockenmaier, and D. Forsyth. Every
picture tells a story: ―Generating sentences from images.‖ InECCV,2010.
[8] A. Karpathy and L. Fei-Fei.‖ Deep visual-semantic generating image descriptions.‖InCVPR,2015.
[9] G. Kulkarni, V. Premraj, S. Dhar, S. Li, Y. Choi, A. C. Berg,and T. L. Berg. Baby talk: ―Understanding and
generating image descriptions.‖InCVPR, 2011.
[10] J. Donahue,L. Anne Hendricks,S. Guadarrama,M. Rohrbach, S. Venugopalan, K. Saenko, and
T.Darrel‖Long-term recurrent convolutional networks for and description‖. InCVPR, 2015.
[11] Mert Kilickaya, Aykut Erdem, Nazli Ikizler-Cinbis,and Erkut Erdem. 2016. Re-evaluating
automaticmetrics for image captioning.arXiv preprintarXiv:1612.07600.
[12] Oriol Vinyals, Alexander Toshev, Samy Bengio, and Dumitru Erhan. 2014. Show and tell: A neural im-
age caption generator. CoRR, abs/1411.4555.
[13] Ruotian Luo, Brian L. Price, Scott Cohen, and Gregory Shakhnarovich. 2018. Discriminability objective
for training descriptive captions. CoRR,abs/1803.04376.
[14] Steven J Rennie, Etienne Marcheret, Youssef Mroueh, Jarret Ross, and Vaibhava Goel.2016“Self Critical
Sequence Training for Image Captioning”.
[15] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Personal Deva Ramanan, Piotr Dollar,
and C Larence Zitnick.2014, “Microsococo: Common Objects in ConteAt” In European conference on
computer vision pages 740-755.Springer.
[16] Xiaodan Liang, Zhiting Hu, Hao Zhang, Chuang Ggan ,and Eric P Xing.2017. “Recurrent Topic-Transition
for Visual Paragraph Generation.”
[17] Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc VLe, Mohammad Norouzi, Wolfgang Macherey, Maxim
Krikun, Yuan Cao, Qin Ga0,KlausMacherrey,et at.2016.Google’sneural machine translation system:
“Bridging the gap between human and machine translation,arXivpreprintarXiv”1609.08144.

[2220]

Fin Irjmets1655531403

Uploaded by

Document Informationclick to expand document information

Copyright:

Available Formats

Fin Irjmets1655531403

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Fin Irjmets1655531403

Uploaded by

Copyright:

Available Formats

e-ISSN: 2582-5208

International Research Journal of Modernization in Engineering Technology and Science

www.irjmets.com @International Research Journal of Modernization in Engineering, Technology and Science

www.irjmets.com @International Research Journal of Modernization in Engineering, Technology and Science

Figure: Dog is running through tha water

Figure: Man is skiing down down snowy Mountain

www.irjmets.com @International Research Journal of Modernization in Engineering, Technology and Science

www.irjmets.com @International Research Journal of Modernization in Engineering, Technology and Science

You might also like