Image Caption Generator Using Deep Learning
Image Caption Generator Using Deep Learning
1
Sufiyan Ali Khan,2 G Swetha, 3 G Thejashwini, 4 H Varshith, 5 M Varun Teja, 6 A Siva Kumar
12345
Student, 6 Professor
I. ABSTRACT
information of captions and generate contextually
Image Captioning is a task where each image must be relevant captions.
understood properly and are able generate suitable
caption with proper grammatical structure.Here it is a
hybrid system which uses multilayer CNN
(Convolutional Neural Network) for generating
Keywords—CNN(Convolutional Neural
keywords which narrates given input images and Long
Network),LSTM(Long Short Term Memory),VGG16(Visual
Short Term Memory(LSTM) for precisely constructing
Geometry Group),Deep Learning,Encoder-Decoder.
the significant captions utilizing the obtained words
.Convolution Neural Network (CNN) proven to be so 2.INTRODUCTION
effective that there is a way to get to any kind of Creating precise descriptions for images has posed a
estimating problem that includes image data as input. significant challenge in the field of Artificial
LSTM was developed to avoid the poor predictive Intelligence, with a wide range of applications from
problem which occurred while using traditional robotic vision to assisting the visually impaired. The
approaches. We used an encoder-decoder based model ultimate goal is to develop a system that can produce
that is capable of generating grammatically correct accurate and meaningful captions for images.
captions for images. This model makes use of Researchers have been exploring various methods to
VGG16(Visual Geometry Group) as an encoder and improve predictions, utilizing deep neural networks and
LSTM as a decoder. The model will be trained like when machine learning techniques to construct effective
an image is given model produces captions that almost models. By utilizing the Flickr 8k dataset, which consists
describe the image. The efficiency is demonstrated for of 8000 sample images each with five captions, we have
the given model using Flickr8K data sets which contains focused on two key phases: extracting features from
8000 images and captions for each image but we use images using Convolutional Neural Networks (CNN)
CNN and LSTM to capture dependencies and tell both and generating natural language sentences based on the
the spatial relationships of images and contextual images using Recurrent Neural Networks (RNN). In the
first phase, we have adopted a unique approach to
feature extraction that captures even subtle differences The research showcases a model that employs a neural
between similar images, rather than simply identifying network to automatically analyze an image and create
objects. The VGG-16 model, with 16 convolutional suitable English captions. The generated captions are
categorized as follows: error-free descriptions,
layers, has been employed for object recognition. In the descriptions with minimal errors, descriptions
second phase, training our features with the dataset's somewhat related to the image, and descriptions
captions is essential. We are implementing LSTM (Long unrelated to the image.
Short Term Memory) architectures to formulate [5]The integration of VGG16, LSTM, and CNN in
sentences based on the input images. various applications, particularly in image caption
3.LITERATURE SURVEY generation, highlights the significance of combining
these powerful components to achieve improved
[1] This document introduces a model that utilizes pre- performance. The use of VGG16 as a robust feature
trained deep machine learning to generate captions for a extractor proves pivotal, capturing intricate details and
given image. The VGG model is employed for image semantic information from images. LSTM, with its
processing through deep convolutional neural networks. ability to understand sequential data, complements
The caption is derived by comparing the model's output VGG16 by effectively generating coherent and
with actual human sentences. This comparison involves contextually rich captions. The synergy between CNN
analyzing both the model's output and the captions and LSTM addresses the challenges of combining
provided by humans for the image. Based on this visual and linguistic information for a more
analysis, it is determined that the generated caption comprehensive understanding of images.While the
closely resembles the human-provided caption, surveyed literature demonstrates the effectiveness of
resulting in an accuracy of approximately 75%. this combination, ongoing research and advancements
Therefore, the generated sentence and the human- in deep learning may unveil novel architectures and
provided sentence are highly similar. methodologies, further refining the synergy between
[2] The image caption generator in this study was VGG16, LSTM, and CNN in image captioning and
developed using the Flickr_8k database, which consists related tasks.
of a wide range of images depicting various situations.
With a total of 8000 images, each picture is 4. METHODOLOGY
accompanied by 5 captions. The dataset is split into
6000 training images, 1000 validation images, and 1000 Methodology focuses on developing an Image Caption
testing images. Through thorough training and testing, Generator integrating VGG16 CNN for feature extraction
the model successfully generated accurate captions for and LSTM for caption generation
the images. The proposed model utilizes a combination
of Convolutional Neural Network and Recurrent Neural [1]Data Collection and Preprocessing:
Network to assign appropriate labels and create
grammatically correct captions, with CNN serving as Flickr8K Dataset: The Flickr8K dataset comprises 8,000
the encoder and RNN as the decoder. images, each paired with five descriptive captions,
[3] In an image caption generator, the VGG16 model totaling 40,000 captions. These images cover a wide
acts like a smart filter for pictures. It looks at an image range of categories and scenes, providing diversity for
and identifies important features like shapes and training and evaluation.
objects. These features help create a sort of summary of
the picture. This summary, produced by VGG16, is then Data Collection: The dataset was collected from the
used by another part of the system to write a sentence Flickr website, where users upload images with
describing what's happening in the image. So, VGG16
helps the system understand what's in the picture, and descriptive captions. Each image is associated with
then the caption generator turns that understanding into multiple captions provided by different users.
words.
Preprocessing: Prior to training, the images were
[4] This study introduces a framework that produces preprocessed by resizing them to a uniform size (e.g.,
appropriate descriptions based on images. The 224x224 pixels) and normalizing the pixel values to a
framework utilizes the Flickr8K dataset, which includes
8000 images, each accompanied by five descriptions. fixed range (e.g., [0, 1]). Captions were tokenized into
individual words, and a vocabulary was created by Integration of VGG16 Features: The features extracted
assigning a unique index to each word. Additionally, by VGG16 were fed into the initial state of the LSTM
captions were padded or truncated to ensure uniform decoder, enabling the model to generate captions
length for batch processing during training. conditioned on the visual content of the images.
VGG16 Architecture: VGG16, a widely-used Dataset Splitting: The Flickr8K dataset was divided into
convolutional neural network (CNN) architecture, was training, validation, and test sets, typically with a
employed as a feature extractor. The model consists of 16 standard split of 6,000 training images, 1,000 validation
convolutional layers followed by three fully connected images, and 1,000 test images.
layers.
Training Parameters: Training parameters such as batch
Pre-trained Weights: Pre-trained weights of the VGG16 size, learning rate, and number of epochs were selected
model trained on the ImageNet dataset were utilized. The through experimentation and hyperparameter tuning.
fully connected layers of VGG16 were removed, and
only the convolutional layers were retained to extract Optimization: The model was trained using an
image features while discarding unnecessary optimization algorithm such as Adam, with the cross-
classification information. entropy loss function used to measure the discrepancy
between predicted and ground truth captions.
Feature Extraction: Each image in the Flickr8K dataset
was passed through the modified VGG16 network, and Teacher Forcing: During training, teacher forcing was
the activations from one of the intermediate employed to facilitate learning by feeding the ground
convolutional layers were extracted. These activations truth previous word as input to predict the next word.
served as high-level feature representations capturing the
[6]Evaluation:
visual content of the images.
Evaluation Metrics: Evaluation of the trained model was
[3]Caption Preprocessing:
performed using standard metrics such as BLEU score,
Tokenization: The captions associated with each image METEOR, and CIDEr, which measure the similarity
were tokenized into individual words or subword units to between generated captions and human-written
represent the textual content. references.
Vocabulary Creation: A vocabulary was constructed by Dataset Split: The evaluation was conducted on the
compiling all unique words from the captions and validation and test sets of the Flickr8K dataset to assess
assigning a unique index to each word. the generalization performance of the model.
Padding or Truncation: To ensure uniform length for Comparison: The performance of the model was
captions, they were padded with a special token (e.g., compared with baseline methods or previous state-of-the-
<PAD>) or truncated to a maximum length. This step art approaches to validate its effectiveness.
facilitated batch processing during training.
[7]Inference:
[4]Model Architecture:
Caption Generation: For inference, unseen images were
Sequence-to-Sequence Architecture: The model passed through the trained VGG16 network to extract
architecture comprised an encoder-decoder framework, features, which were then fed into the LSTM decoder to
with VGG16 serving as the encoder and an LSTM generate captions.
network serving as the decoder.
[8]Fine-tuning:
5. RESULTS
6. DRAWBACK
7. CONCLUSION
In conclusion, the proposed methodology leveraging
VGG16 and LSTM for Image Caption Generation
demonstrates promising results on the Flickr8K dataset,
showcasing the efficacy of the approach in generating
descriptive and contextually relevant captions. Despite
certain drawbacks, such as dataset limitations and
computational complexity, the methodology serves as a
valuable foundation for advancing research in computer
vision and natural language processing. Future work
should focus on addressing these limitations and
exploring innovative techniques to enhance model
performance, scalability, and interpretability in diverse
application domains. Overall, the findings presented
herein contribute to the ongoing dialogue surrounding
multimodal understanding and intelligent image
captioning systems.
8. REFERENCES