BTP Report

Image Caption Generator
Project report submitted in partial fulfillment

of the requirements for the degree of
Bachelor of Technology
in
Computer Science Engineering
by
Kuber Varshney - 19UCS194

Shubham Agrawal - 19UCS009
Under Guidance of
Dr. Aloke Datta
Department of Computer Science Engineering

The LNM Institute of Information Technology, Jaipur
August 2022
©The LNM INSTITUTE OF INFORMATION TECHNOLOGY Jaipur-2022.
All rights reserved.
The LNM Institute of Information Technology
Jaipur, India
CERTIFICATE
This is to certify that the project entitled “Image Caption Generator” , submitted by Kuber
Varshney (19UCS194) and Shubham Agrawal (19UCS009) in partial fulfillment of the re-
quirement of degree in Bachelor of Technology (B. Tech), is a bonafide record of work carried
out by them at the Department of Computer Science Engineering, The LNM Institute of Infor-
mation Technology, Jaipur, (Rajasthan) India, during the academic session 2021-2022 under
my supervision and guidance and the same has not been submitted elsewhere for award of any
other degree. In my/our opinion, this report is of standard required for the award of the degree
of Bachelor of Technology (B. Tech).
Date Supervisor: Dr. Aloke Datta

Acknowledgment
We would like to express our deepest gratitude to our mentor, Dr. Aloke Datta for his valuable
Guidance, Consistent Encouragement, Personal Caring, and Timely help, and for providing
us with an excellent atmosphere for doing the project work. Despite his hectic schedule, he
has offered us pleasant and kind support throughout the project to help us finish the work.
We would also like to thank him for considering us for this project and guiding us wherever
possible.
iv
Abstract
In the modern era, image captioning has become one of the most widely required tools. More-
over, there are inbuilt applications that generate and provide a caption for a certain image, all
these things are done with the help of deep neural network models. The process of generating
a description of an image is called image captioning. It requires recognizing the important
objects, their attributes, and the relationships among the objects in an image. It generates syn-
tactically and semantically correct sentences.In this paper, we present a deep learning model
to describe images and generate captions using computer vision and machine translation. This
paper aims to detect different objects found in an image, recognize the relationships between
those objects and generate captions. The dataset used is Flickr8k and the programming lan-
guage used is Python3, and an ML technique called Transfer Learning is implemented with
the help of the VGG-16 model, to demonstrate the proposed experiment. This study will also
elaborate on the functions and structure of the various Neural networks involved. Generating
image captions is an important aspect of Computer Vision and Natural language processing.
Image caption generators can find applications in Image segmentation as used by Facebook
and Google Photos, and even more so, its use can be extended to video frames. They will
easily automate the job of a person who has to interpret images.
Keywords- Image, Caption, CNN, VGG-16, RNN, LSTM, Neural Networks.
v
Contents
List of Figures vii
List of Tables viii
1 Introduction 1
1.1 The Area of Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Problem Addressed . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.3 Existing System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
2 Literature Review 3
2.1 Convolutional Neural Network . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.1.1 Layers used to build ConvNets . . . . . . . . . . . . . . . . . . . . . 3
2.1.2 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.2 Recurrent Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.3 Long Short Term Memory . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
3 Proposed Work 6
3.1 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
3.1.1 Flickr 8k Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
3.1.2 Image Data Preparation . . . . . . . . . . . . . . . . . . . . . . . . . 6
3.1.3 Caption Data Preparation . . . . . . . . . . . . . . . . . . . . . . . . 7
3.2 Prerequisites . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
3.3 Project File Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
3.3.1 Building the python based Project . . . . . . . . . . . . . . . . . . . 8
3.3.2 Extracting the feature vector from all Images . . . . . . . . . . . . . 8
3.3.3 Text Preprocessiong . . . . . . . . . . . . . . . . . . . . . . . . . . 9
3.3.4 Tokenizing the Vocabulary . . . . . . . . . . . . . . . . . . . . . . . 11
3.3.5 Creating the Data Generator . . . . . . . . . . . . . . . . . . . . . . 11
3.3.6 Defining the CNN-RNN Model . . . . . . . . . . . . . . . . . . . . 12
3.3.7 Training the Model . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
4 Simulation and Results 14
5 Conclusions and Future Work 18
Bibliography 18
vi
List of Figures
3.1 Flickr Dataset Caption Format . . . . . . . . . . . . . . . . . . . . . . . . . 8

3.2 VGG-16 Model Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
3.3 Code for Feature extraction and saving them in a Pickle File . . . . . . . . . 10
3.4 Mapping images with captions . . . . . . . . . . . . . . . . . . . . . . . . . 10
3.5 Text Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.6 Text Tokenization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.7 Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
4.1 Result 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
4.2 Result 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
4.3 Result 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
4.4 Result 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
vii
List of Tables
3.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
viii
Chapter 1
Introduction
1.1 The Area of Work
Every day, we encounter a large number of images from various sources like the net, news
articles, document diagrams and advertisements. These sources contain images that viewers
would have to interpret themselves. Most images don’t have a description, but the human can
largely understand them without their detailed captions. However, machines have to interpret
some variety of image captions if humans need automatic image captions from it. Image
captioning is very important for several reasons. Captions for each image on the web can
result in faster and descriptively accurate images searches and indexing.
The goal of this project is to generate appropriate captions for a given image. The captions will
be generated in order to capture the contextual information on the images. Current methods
uses convolutional neural networks (CNNs) and recurrent neural networks (RNNs) or their
variants to generate appropriate captions. These networks provide an encoder- decoder method
to do this task, where CNNs encode the image into feature vectors and RNNs are used as
decoders to generate language descriptions .
Image captioning finds various applications in various fields such as commerce, biomedicine,
web searching and military etc. Social media such as Instagram , Facebook etc can generate
captions automatically from images.
1.2 Problem Addressed
Generating captions for images is a vital task relevant to the area of both Natural Language
Processing and Computer Vision. Mimicking the human ability of providing descriptions for
images by a machine is itself a motivating step along the road of Artificial Intelligence. The
main challenge of this task is to capture how objects relate to every other in the image and to
1
chapter: 01 2
express them in a natural language like English.Traditionally, computer systems have been us-
ing predefined templates for generating text descriptions for images. However, this approach
does not provide sufficient variety required for generating lexicographical rich text descrip-
tions. This shortcoming has been lessened with the increased efficiency of neural networks.
Many state of art models use neural networks for generating captions by taking image as input
and predicting next lexical unit within the output sentence.
1.3 Existing System
The concept of generating texts from images has been a topic of great interest since the in-
troduction of machine learning and Artificial Intelligence. The usage of this model in various
fields creates high interest to optimize the model to a far way better as present. The existing
work on image caption generation includes the use of CNN and RNN. CNN is basically used
to detect images and determine them. Then RNN comes into picture to generate relationships
between images and generate texts based on those relationships. Microsoft has built AI sys-
tem, ”Azure” that can generate captions for the images an with good accuracy. Some work
has also been done to convert the image captions to speech. However we confine this paper to
only generating rich texts for the input images.
Chapter 2
Literature Review
2.1 Convolutional Neural Network
In deep learning, a convolutional neural network (CNN/ConvNet) is a class of deep neural

networks, most commonly applied to analyze visual imagery. Now when we think of a neural
network we think about matrix multiplications but that is not the case with ConvNet. It uses
a special technique called Convolution. Now in mathematics convolution is a mathematical
operation on two functions that produces a third function that expresses how the shape of one
is modified by the other.
2.1.1 Layers used to build ConvNets
A cov-nets is a sequence of layers, and every layer transforms one volume to another through a
differentiable function. Let’s take an example by running a cov-nets on an image of dimension
32 x 32 x 3.
1. Input Layer: This layer holds the raw input of the image with width 32, height 32, and
depth 3.
2. Convolution Layer: This layer computes the output volume by computing the dot prod-
uct between all filters and image patches. Suppose we use a total of 12 filters for this
layer we’ll get output volume of dimension 32 x 32 x 12.
3. Activation Function Layer: This layer will apply an element-wise activation func-
tion to the output of the convolution layer. Some common activation functions are
RELU,softmax,tanh,LeakyRelu,etc.
4. Pool Layer: This layer is periodically inserted in the cov-nets and its main function is to
reduce the size of volume which makes the computation fast, reduces memory and also
prevents overfitting. Two common types of pooling layers are max pooling and average
3
chapter: 02 4
pooling. If we use a max pool with 2 x 2 filters and stride 2, the resultant volume will
be of dimension 16x16x12.
5. Fully-Connected Layer: This layer is a regular neural network layer that takes input
from the previous layer and computes the class scores and outputs the 1-D array of size
equal to the number of classes.
2.1.2 Limitations
Despite the power and resource complexity of CNNs, they provide in-depth results. At the
root of it all, it is just recognizing patterns and details that are so minute and inconspicuous
that it goes unnoticed to the human eye. But when it comes to understanding the contents of
an image it fails.
2.2 Recurrent Neural Networks
Recurrent neural networks (RNN) are a class of neural networks designed to process data.
Unlike traditional neural networks, RNNs have a unique design that allows them to operate
by managing hidden states that evolve over time. This ability makes RNNs efficient for many
tasks where understanding and modeling expected behavior is important.
An important feature of RNNs is the presence of continuous connections that form loops in
the network architecture. These connections allow the RNN to collect data from previous
steps and use it to make predictions about the current time.[1] RNN is widely used in NLP
tasks such as language modelling, text generation, sentiment analysis and machine translation.
They are good at understanding and producing human speech because they understand context
and patterns. In computer vision, RNN is used to add captions to the image.[2] They create
similar descriptions by processing visual features extracted from images, thus improving the
understanding of image content.
2.3 Long Short Term Memory
LSTM stands for Long short term memory, they are a type of RNN (recurrent neural network)
which is well suited for sequence prediction problems. Based on the previous text, we can
predict what the next word will be. It has proven itself effective from the traditional RNN
by overcoming the limitations of RNN which had short term memory. LSTM can carry out
relevant information throughout the processing of inputs and with a forget gate, it discards
non-relevant information.
LSTMs are designed to overcome the vanishing gradient problem and allow them to retain
information for longer periods compared to traditional RNNs. LSTMs can maintain a constant
chapter: 02 5
error, which allows them to continue learning over numerous time-steps and backpropagation
through time and layers.
LSTMs use gated cells to store information outside the regular flow of the RNN. With these
cells, the network can manipulate the information in many ways, including storing information
in the cells and reading from them. The cells are individually capable of making decisions
regarding the information and can execute these decisions by opening or closing the gates. The
ability to retain information for a long period of time gives LSTM the edge over traditional
RNNs in these tasks. The chain-like architecture of LSTM allows it to contain information
for longer time periods, solving challenging tasks that traditional RNNs struggle to or simply
cannot solve.
The three major parts of the LSTM include:
1. Forget gate—removes information that is no longer necessary for the completion of the
task. This step is essential to optimizing the performance of the network.
2. Input gate—responsible for adding information to the cells.
3. Output gate—selects and outputs necessary information.
The CNN LSTM architecture involves using Convolutional Neural Network (CNN) layers for
feature extraction on input data combined with LSTMs to support sequence prediction. This
architecture was originally referred to as a Long-term Recurrent Convolutional Network or
LRCN model, although we will use the more generic name “CNN LSTM” to refer to LSTMs
that use a CNN as a front end in this lesson. This architecture is used for the task of generating
textual descriptions of images. Key is the use of a CNN that is pre-trained on a challenging
image classification task that is repurposed as a feature extractor for the caption generating
problem.
Chapter 3
Proposed Work
3.1 Dataset
This project requires a dataset which has both images and their captions. The dataset should
be able to train the image captioning model.
3.1.1 Flickr 8k Dataset
Flickr 8k dataset is a public benchmark dataset for image to sentence description. This dataset
consists of 8000 images with five captions for each image. These images are extracted from
diverse groups on the Flickr website. Each caption provides a clear description of entities
and events present in the image. The dataset depicts a variety of events and scenarios and
doesn’t include images containing well-known people and places which makes the dataset
more generic.Features of the dataset making it suitable for this project are:
• Multiple captions mapped for a single image makes the model generic and avoids overfitting
of the model.
• Diverse categories of training images can make the image captioning model to work for
multiple categories of images and hence can make the model more robust
3.1.2 Image Data Preparation
The image should be converted to suitable features so that they can be trained into a deep
learning model. Feature extraction is a mandatory step to train any image in a deep learning
model. The features are extracted using the Convolutional Neural Network (CNN) with Vi-
sual Geometry Group (VGG-16) model. This model also won ImageNet Large Scale Visual
Recognition Challenge in 2015 to classify the images into one among the 1000 classes given
in the challenge. Hence, this model is ideal to use for this project as image captioning requires
identification of images. In VGG-16, there are 16 weight layers in the network and the deeper
6
chapter: 03 7
number of layers help in better feature extraction from images. The VGG-16 network uses 3*3
convolutional layers making its architecture simple and uses max pooling layer in between to
reduce volume size of the image. The last layer of the image which predicts the classification
is removed and the internal representation of the image just before classification is returned
as a feature. The dimension of the input image should be 224*224 and this model extracts
features of the image and returns a 1-dimensional 4096 element vector.
3.1.3 Caption Data Preparation
The Flickr 8k dataset contains multiple descriptions for a single image. In the data preparation
phase, each image id is taken as key and its corresponding captions are stored as values in a
dictionary.
3.2 Prerequisites
This project requires good knowledge of Deep learning, Python, working on Kaggle note-
books, Keras library, Numpy, and Natural language processing. Make sure you have installed
all the following necessary libraries:
1. pip install tensorflow
2. keras
3. pillow
4. numpy
5. tqdm
3.3 Project File Structure
Downloaded from dataset:

Flicker8k Dataset - Dataset folder which contains 8091 images.
Flickr 8k text - Dataset folder which contains text files and captions of images.
The below files will be created by us while making the project.

best model.h5 - It will contain our trained models. features.pkl - Pickle object that contains
an image and their feature vector extracted from the VGG-16 pre-trained CNN model.
model.png - Visual representation of dimensions of our project.
chapter: 03 8
F IGURE 3.1: Flickr Dataset Caption Format
3.3.1 Building the python based Project
https://www.overleaf.com/project/639cc30966d648a95e71226e The main text file which

contains all image captions is captions.txt in our flickr 8k folder.
3.3.2 Extracting the feature vector from all Images
This technique is also called transfer learning, we dont have to do everything on our own,
we use the pre-trained model that has been already trained on large datasets and extract the
features from these models and use them for our tasks. We are using the VGG-16 model
which has been trained on an imagenet dataset that has 1000 different classes to classify. We
can directly import this model from the keras.applications . Make sure you are connected to
the internet as the weights get automatically downloaded.
Since the VGG-16 model was originally built for imagenet, we will do little changes for
integrating with our model.
chapter: 03 9
F IGURE 3.2: VGG-16 Model Summary
Then we will extract features for all the images and we will map image names with
their respective feature array. Then we will dump the features dictionary into a “features.pkl”
pickle file.
3.3.3 Text Preprocessiong
We will first do mapping in which we will create a dictionary that maps image with its
captions as shown in the figure below
chapter: 03 10
F IGURE 3.3: Code for Feature extraction and saving them in a Pickle File
F IGURE 3.4: Mapping images with captions
Cleaning text (captions) – Here, we will take all captions and perform data cleaning. This is an
important step when we work with textual data. According to our goal, we decide what type
of cleaning we want to perform on the text. In our case, we will be removing punctuations,
converting all text to lowercase and removing words that contain numbers.We will also
append the ¡start¿ and ¡end¿ identifier for each caption. We need this so that our LSTM model
can identify the starting and ending of the caption. So, a caption like “A man riding on a
three-wheeled wheelchair” will be transformed into “man riding on three wheeled wheelchair”.
chapter: 03 11
F IGURE 3.5: Text Processing
3.3.4 Tokenizing the Vocabulary
Computers don’t understand English words, for computers, we will have to represent them
with numbers. So, we will map each word of the vocabulary with a unique index value. Keras
library provides us with the tokenizer function that we will use to create tokens from our
vocabulary and save them to a “tokenizer.p” pickle file. Our vocabulary contains 8467 words.
We calculate the maximum length of the descriptions. This is important for deciding the model
structure parameters. Max length of description is 35.
3.3.5 Creating the Data Generator
Let us first see how the input and output of our model will look like. To make this task into
a supervised learning task, we have to provide input and output to the model for training. We
have to train our model on 6000 images and each image will contain a 2048 length feature
vector and the caption is also represented as numbers. This amount of data for 6000 images
is not possible to hold into memory so we will be using a generator method that will yield
batches. The generator will yield the input and output sequence. For example: The input to
our model is [x1, x2] and the output will be y, where x1 is the 2048 feature vector of that
chapter: 03 12
F IGURE 3.6: Text Tokenization
image, x2 is the input text sequence and y is the output text sequence that the model has to
predict.
TABLE 3.1
x1(Feature Vector) x2(Text sequence) x3(Word to Predict)

feature start, Two
feature start,two dogs
feature start,two,dogs, drink
feature start,two,dogs,drink, water
feature start,two,dogs,drink,water end
3.3.6 Defining the CNN-RNN Model
To define the structure of the model, we will be using the Keras Model from Functional API.
It will consist of three major parts: Feature Extractor – The feature extracted from the image
has a size of 2048, with a dense layer, we will reduce the dimensions to 256 nodes. Sequence
Processor – An embedding layer will handle the textual input, followed by the LSTM layer.
Decoder – By merging the output from the above two layers, we will process the dense layer
to make the final prediction. The final layer will contain the number of nodes equal to our
vocabulary size.
chapter: 03 13
3.3.7 Training the Model
To train the model, we will be using the 6000 training images by generating the input and out-
put sequences in batches and fitting them to the model using model.fit() method. We also save
the model to our models folder. This will take some time depending on your system capability.
F IGURE 3.7: Model

Chapter 4
Simulation and Results
The model has been trained, now, we will make a separate file testing caption generator.py
which will load the model and generate predictions. The predictions contain the max length of
index values so we will use the same tokenizer.p pickle file to get the words from their index
values.
F IGURE 4.1: Result 1
14
chapter: 04 15

chapter: 04 16

chapter: 04 17

Chapter 5
Conclusions and Future Work
The future work on Image Captioning can focus on these points:
• Multimodal Approaches: Enhancing the integration of visual and textual information in

image captioning models.
• Bias and Fairness Considerations: Detecting and mitigating biases in image captioning
models to ensure fairness and eliminate stereotypes.
• Long-Term Context Modeling: Incorporating temporal information and image sequence

analysis for generating captions that consider broader context.
The aforementioned future works can be made to this project to boost its usefulness and ap-
plicability. Data efficiency, generalization, and addressing biases and fairness considerations
are important aspects to consider for creating robust and unbiased image captioning systems.
Additionally, incorporating user interaction and personalization can provide opportunities to
tailor captions to individual preferences and refine the captioning process. By addressing these
challenges and exploring these directions, the field of image captioning can continue to evolve,
leading to more accurate, detailed, and contextually-aware captions that enhance our under-
standing and interaction with visual content.
18
Bibliography
[1] S. Takkar, A. Jain, and P. Adlakha, “Comparative study of different image captioning models,”
2021.
[2] S. B. O. Vinyals, A. Toshev and D. Erhan, “Show and tell: A neural image caption generator,” in
2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015.
19

BTP Report

Uploaded by

Copyright:

Available Formats

BTP Report

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

BTP Report

Uploaded by

Copyright:

Available Formats

Image Caption Generator

Project report submitted in partial fulfillment

Kuber Varshney - 19UCS194

Department of Computer Science Engineering

Date Supervisor: Dr. Aloke Datta

Keywords- Image, Caption, CNN, VGG-16, RNN, LSTM, Neural Networks.

List of Figures vii

List of Tables viii

4 Simulation and Results 14

5 Conclusions and Future Work 18

3.1 Flickr Dataset Caption Format . . . . . . . . . . . . . . . . . . . . . . . . . 8

1.1 The Area of Work

1.2 Problem Addressed

1.3 Existing System

2.1 Convolutional Neural Network

In deep learning, a convolutional neural network (CNN/ConvNet) is a class of deep neural

2.1.1 Layers used to build ConvNets

2.2 Recurrent Neural Networks

2.3 Long Short Term Memory

The three major parts of the LSTM include:

2. Input gate—responsible for adding information to the cells.

3. Output gate—selects and outputs necessary information.

3.1.1 Flickr 8k Dataset

3.1.2 Image Data Preparation

3.1.3 Caption Data Preparation

1. pip install tensorflow

3.3 Project File Structure

Downloaded from dataset:

The below files will be created by us while making the project.

F IGURE 3.1: Flickr Dataset Caption Format

3.3.1 Building the python based Project

https://www.overleaf.com/project/639cc30966d648a95e71226e The main text file which

3.3.2 Extracting the feature vector from all Images

F IGURE 3.2: VGG-16 Model Summary

3.3.3 Text Preprocessiong

F IGURE 3.4: Mapping images with captions

F IGURE 3.5: Text Processing

3.3.4 Tokenizing the Vocabulary

3.3.5 Creating the Data Generator

F IGURE 3.6: Text Tokenization

x1(Feature Vector) x2(Text sequence) x3(Word to Predict)

3.3.6 Defining the CNN-RNN Model

3.3.7 Training the Model

F IGURE 3.7: Model

Simulation and Results

F IGURE 4.1: Result 1

F IGURE 4.2: Result 2

F IGURE 4.3: Result 3

F IGURE 4.4: Result 4

Conclusions and Future Work

The future work on Image Captioning can focus on these points:

• Multimodal Approaches: Enhancing the integration of visual and textual information in

• Long-Term Context Modeling: Incorporating temporal information and image sequence

You might also like