Image Caption Generator Using Deep Learning

The document presents a study on an image caption generator using deep learning techniques, specifically employing VGG16 for feature extraction and LSTM for caption generation. The model was trained on the Flickr8K dataset, which includes 8000 images, and aims to produce grammatically correct and contextually relevant captions. Despite some limitations, such as dataset size and model interpretability, the approach shows promise for advancing research in computer vision and natural language processing.

Uploaded by

baforemmanuel1

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

2 views

Image Caption Generator Using Deep Learning

Uploaded by

baforemmanuel1

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 5

International Journal of Scientific Research in Engineering and Management (IJSREM)

Volume: 08 Issue: 04 | April - 2024 SJIF Rating: 8.448 ISSN: 2582-3930

IMAGE CAPTION GENERATOR USING DEEP LEARNING

1
Sufiyan Ali Khan,2 G Swetha, 3 G Thejashwini, 4 H Varshith, 5 M Varun Teja, 6 A Siva Kumar
12345
Student, 6 Professor

School of Engineering – IIIrd year AI&ML-ZETA

1
2111cs020573@mallareddyuniversity.ac.in 2 2111cs020584@mallareddyuniversity.ac.in
3
2111cs020589@mallareddyuniversity.ac.in 4 2111cs020621@mallareddyuniversity.ac.in
5
2111cs020626@mallareddyuniversity.ac.in

Department of Artificial Intelligence & Machine Learning

Malla Reddy University, Kompally, Hyderabad, India

I. ABSTRACT
information of captions and generate contextually
Image Captioning is a task where each image must be relevant captions.
understood properly and are able generate suitable
caption with proper grammatical structure.Here it is a
hybrid system which uses multilayer CNN
(Convolutional Neural Network) for generating
Keywords—CNN(Convolutional Neural
keywords which narrates given input images and Long
Network),LSTM(Long Short Term Memory),VGG16(Visual
Short Term Memory(LSTM) for precisely constructing
Geometry Group),Deep Learning,Encoder-Decoder.
the significant captions utilizing the obtained words
.Convolution Neural Network (CNN) proven to be so 2.INTRODUCTION
effective that there is a way to get to any kind of Creating precise descriptions for images has posed a
estimating problem that includes image data as input. significant challenge in the field of Artificial
LSTM was developed to avoid the poor predictive Intelligence, with a wide range of applications from
problem which occurred while using traditional robotic vision to assisting the visually impaired. The
approaches. We used an encoder-decoder based model ultimate goal is to develop a system that can produce
that is capable of generating grammatically correct accurate and meaningful captions for images.
captions for images. This model makes use of Researchers have been exploring various methods to
VGG16(Visual Geometry Group) as an encoder and improve predictions, utilizing deep neural networks and
LSTM as a decoder. The model will be trained like when machine learning techniques to construct effective
an image is given model produces captions that almost models. By utilizing the Flickr 8k dataset, which consists
describe the image. The efficiency is demonstrated for of 8000 sample images each with five captions, we have
the given model using Flickr8K data sets which contains focused on two key phases: extracting features from
8000 images and captions for each image but we use images using Convolutional Neural Networks (CNN)
CNN and LSTM to capture dependencies and tell both and generating natural language sentences based on the
the spatial relationships of images and contextual images using Recurrent Neural Networks (RNN). In the
first phase, we have adopted a unique approach to

© 2024, IJSREM | www.ijsrem.com DOI: 10.55041/IJSREM31987 | Page 1

International Journal of Scientific Research in Engineering and Management (IJSREM)
Volume: 08 Issue: 04 | April - 2024 SJIF Rating: 8.448 ISSN: 2582-3930

feature extraction that captures even subtle differences The research showcases a model that employs a neural
between similar images, rather than simply identifying network to automatically analyze an image and create
objects. The VGG-16 model, with 16 convolutional suitable English captions. The generated captions are
categorized as follows: error-free descriptions,
layers, has been employed for object recognition. In the descriptions with minimal errors, descriptions
second phase, training our features with the dataset's somewhat related to the image, and descriptions
captions is essential. We are implementing LSTM (Long unrelated to the image.
Short Term Memory) architectures to formulate [5]The integration of VGG16, LSTM, and CNN in
sentences based on the input images. various applications, particularly in image caption
3.LITERATURE SURVEY generation, highlights the significance of combining
these powerful components to achieve improved
[1] This document introduces a model that utilizes pre- performance. The use of VGG16 as a robust feature
trained deep machine learning to generate captions for a extractor proves pivotal, capturing intricate details and
given image. The VGG model is employed for image semantic information from images. LSTM, with its
processing through deep convolutional neural networks. ability to understand sequential data, complements
The caption is derived by comparing the model's output VGG16 by effectively generating coherent and
with actual human sentences. This comparison involves contextually rich captions. The synergy between CNN
analyzing both the model's output and the captions and LSTM addresses the challenges of combining
provided by humans for the image. Based on this visual and linguistic information for a more
analysis, it is determined that the generated caption comprehensive understanding of images.While the
closely resembles the human-provided caption, surveyed literature demonstrates the effectiveness of
resulting in an accuracy of approximately 75%. this combination, ongoing research and advancements
Therefore, the generated sentence and the human- in deep learning may unveil novel architectures and
provided sentence are highly similar. methodologies, further refining the synergy between
[2] The image caption generator in this study was VGG16, LSTM, and CNN in image captioning and
developed using the Flickr_8k database, which consists related tasks.
of a wide range of images depicting various situations.
With a total of 8000 images, each picture is 4. METHODOLOGY
accompanied by 5 captions. The dataset is split into
6000 training images, 1000 validation images, and 1000 Methodology focuses on developing an Image Caption
testing images. Through thorough training and testing, Generator integrating VGG16 CNN for feature extraction
the model successfully generated accurate captions for and LSTM for caption generation
the images. The proposed model utilizes a combination
of Convolutional Neural Network and Recurrent Neural [1]Data Collection and Preprocessing:
Network to assign appropriate labels and create
grammatically correct captions, with CNN serving as Flickr8K Dataset: The Flickr8K dataset comprises 8,000
the encoder and RNN as the decoder. images, each paired with five descriptive captions,
[3] In an image caption generator, the VGG16 model totaling 40,000 captions. These images cover a wide
acts like a smart filter for pictures. It looks at an image range of categories and scenes, providing diversity for
and identifies important features like shapes and training and evaluation.
objects. These features help create a sort of summary of
the picture. This summary, produced by VGG16, is then Data Collection: The dataset was collected from the
used by another part of the system to write a sentence Flickr website, where users upload images with
describing what's happening in the image. So, VGG16
helps the system understand what's in the picture, and descriptive captions. Each image is associated with
then the caption generator turns that understanding into multiple captions provided by different users.
words.
Preprocessing: Prior to training, the images were
[4] This study introduces a framework that produces preprocessed by resizing them to a uniform size (e.g.,
appropriate descriptions based on images. The 224x224 pixels) and normalizing the pixel values to a
framework utilizes the Flickr8K dataset, which includes
8000 images, each accompanied by five descriptions. fixed range (e.g., [0, 1]). Captions were tokenized into

© 2024, IJSREM | www.ijsrem.com DOI: 10.55041/IJSREM31987 | Page 2

International Journal of Scientific Research in Engineering and Management (IJSREM)
Volume: 08 Issue: 04 | April - 2024 SJIF Rating: 8.448 ISSN: 2582-3930

individual words, and a vocabulary was created by Integration of VGG16 Features: The features extracted
assigning a unique index to each word. Additionally, by VGG16 were fed into the initial state of the LSTM
captions were padded or truncated to ensure uniform decoder, enabling the model to generate captions
length for batch processing during training. conditioned on the visual content of the images.

[2]Feature Extraction using VGG16: [5]Training:

VGG16 Architecture: VGG16, a widely-used Dataset Splitting: The Flickr8K dataset was divided into
convolutional neural network (CNN) architecture, was training, validation, and test sets, typically with a
employed as a feature extractor. The model consists of 16 standard split of 6,000 training images, 1,000 validation
convolutional layers followed by three fully connected images, and 1,000 test images.
layers.
Training Parameters: Training parameters such as batch
Pre-trained Weights: Pre-trained weights of the VGG16 size, learning rate, and number of epochs were selected
model trained on the ImageNet dataset were utilized. The through experimentation and hyperparameter tuning.
fully connected layers of VGG16 were removed, and
only the convolutional layers were retained to extract Optimization: The model was trained using an
image features while discarding unnecessary optimization algorithm such as Adam, with the cross-
classification information. entropy loss function used to measure the discrepancy
between predicted and ground truth captions.
Feature Extraction: Each image in the Flickr8K dataset
was passed through the modified VGG16 network, and Teacher Forcing: During training, teacher forcing was
the activations from one of the intermediate employed to facilitate learning by feeding the ground
convolutional layers were extracted. These activations truth previous word as input to predict the next word.
served as high-level feature representations capturing the
[6]Evaluation:
visual content of the images.
Evaluation Metrics: Evaluation of the trained model was
[3]Caption Preprocessing:
performed using standard metrics such as BLEU score,
Tokenization: The captions associated with each image METEOR, and CIDEr, which measure the similarity
were tokenized into individual words or subword units to between generated captions and human-written
represent the textual content. references.

Vocabulary Creation: A vocabulary was constructed by Dataset Split: The evaluation was conducted on the
compiling all unique words from the captions and validation and test sets of the Flickr8K dataset to assess
assigning a unique index to each word. the generalization performance of the model.

Padding or Truncation: To ensure uniform length for Comparison: The performance of the model was
captions, they were padded with a special token (e.g., compared with baseline methods or previous state-of-the-
<PAD>) or truncated to a maximum length. This step art approaches to validate its effectiveness.
facilitated batch processing during training.
[7]Inference:
[4]Model Architecture:
Caption Generation: For inference, unseen images were
Sequence-to-Sequence Architecture: The model passed through the trained VGG16 network to extract
architecture comprised an encoder-decoder framework, features, which were then fed into the LSTM decoder to
with VGG16 serving as the encoder and an LSTM generate captions.
network serving as the decoder.

© 2024, IJSREM | www.ijsrem.com DOI: 10.55041/IJSREM31987 | Page 3

International Journal of Scientific Research in Engineering and Management (IJSREM)
Volume: 08 Issue: 04 | April - 2024 SJIF Rating: 8.448 ISSN: 2582-3930

Example: An example inference pipeline was provided,

demonstrating how the model generated captions for new
images from the Flickr8K dataset.

[8]Fine-tuning:

Domain-specific Fine-tuning: Fine-tuning of the model

on the Flickr8K dataset itself or related datasets could be
performed to adapt the model to specific domains or
improve performance on similar datasets.

Benefits and Challenges: The potential benefits and

challenges of fine-tuning were discussed, including the
need for additional annotated data and the risk of
overfitting.

5. RESULTS

6. DRAWBACK

Limited Dataset Size: Acknowledge the relatively small

size of the Flickr8K dataset compared to larger datasets
like MSCOCO, potentially limiting the model's ability to
generalize to a broader range of images and captions.
Data Imbalance and Bias: Recognize potential biases or
imbalances within the Flickr8K dataset, which may affect
model performance and generalization. Discuss
strategies to mitigate bias and ensure equitable
representation across diverse image categories and
caption styles.
Complexity of Model Interpretability: Highlight the
challenge of interpreting and explaining the inner
workings of the VGG16-LSTM model, particularly
regarding how specific image features influence caption
generation. Discuss the importance of model
interpretability for practical applications and avenues for
improving transparency

International Journal of Scientific Research in Engineering and Management (IJSREM)
Volume: 08 Issue: 04 | April - 2024 SJIF Rating: 8.448 ISSN: 2582-3930

7. CONCLUSION
In conclusion, the proposed methodology leveraging
VGG16 and LSTM for Image Caption Generation
demonstrates promising results on the Flickr8K dataset,
showcasing the efficacy of the approach in generating
descriptive and contextually relevant captions. Despite
certain drawbacks, such as dataset limitations and
computational complexity, the methodology serves as a
valuable foundation for advancing research in computer
vision and natural language processing. Future work
should focus on addressing these limitations and
exploring innovative techniques to enhance model
performance, scalability, and interpretability in diverse
application domains. Overall, the findings presented
herein contribute to the ongoing dialogue surrounding
multimodal understanding and intelligent image
captioning systems.
8. REFERENCES

Referred the below in internet:

[1]Dr.Jagadisha N ,Chaithra Rao(2022):Image
Caption Generator Using Deep Learning
[1]Sreejith S P, Vijayakumar A (2021) : Image
Captioning Generator using Deep Machine Learning.
[2] Preksha Khant, Vishal Deshmukh, Aishwarya
Kude, Prachi Kiraula (2021) : Image Caption
Generator using CNN-LSTM.
[3] Ali Ashraf Mohamed (2020) : Image Caption
Using CNN and LSTM.
[4] Chetan Amritkar, Vaishali Jabade (2018) : Image
caption Generation Using Deep Learning Technique.
[5] Subrata Das, Lalit jain, Arup Das (2018) : Deep
Learning for Military Image Captioning.
[6] Pranay Mathur , Aman Gill , Aayush Yadav ,
Anurag Mishra, Nand Kumar Bansode (2017):
Camera2Caption :A Real-Time Caption Generator.

CHOICES Intermediate StudentsBook PDF
25% (4)
CHOICES Intermediate StudentsBook PDF
24 pages
Achille Mbembe Out of The Dark Night Essays On Decolonization
100% (1)
Achille Mbembe Out of The Dark Night Essays On Decolonization
276 pages
English
No ratings yet
English
20 pages
114 - Article Summary Rubric
No ratings yet
114 - Article Summary Rubric
1 page
24 Hour Time Rotations Lesson Plan
No ratings yet
24 Hour Time Rotations Lesson Plan
3 pages
Lesson Plan (T6, Tiết 7,11 - 2)
100% (1)
Lesson Plan (T6, Tiết 7,11 - 2)
3 pages
Image Caption Generator Using Deep Learning
No ratings yet
Image Caption Generator Using Deep Learning
9 pages
Project Review
No ratings yet
Project Review
12 pages
Image Captioning Generator Using Deep Machine Learning
No ratings yet
Image Captioning Generator Using Deep Machine Learning
3 pages
RP Springer
No ratings yet
RP Springer
10 pages
Base Paper
No ratings yet
Base Paper
6 pages
(IJCST-V11I4P7) :dr. T. S. Suganya, Mrs. M. Divya, T. Santhosh Kumar, K. Prem Kumar
No ratings yet
(IJCST-V11I4P7) :dr. T. S. Suganya, Mrs. M. Divya, T. Santhosh Kumar, K. Prem Kumar
4 pages
2501
No ratings yet
2501
6 pages
IJIEMR March 2023 COPY RIGHT (2 Files Merged)
No ratings yet
IJIEMR March 2023 COPY RIGHT (2 Files Merged)
8 pages
Image_Captioning_-_A_Deep_Learning_Approach_Using_CNN_and_LSTM_Network
No ratings yet
Image_Captioning_-_A_Deep_Learning_Approach_Using_CNN_and_LSTM_Network
6 pages
Image Caption Generator Research Paper
No ratings yet
Image Caption Generator Research Paper
4 pages
Visual Image Caption Generator Using Deep Learning
No ratings yet
Visual Image Caption Generator Using Deep Learning
7 pages
Building A Voice Based Image Caption Generator With Deep Learning
No ratings yet
Building A Voice Based Image Caption Generator With Deep Learning
6 pages
A Novel Approach of Image Caption Generator Using Deep Learning
No ratings yet
A Novel Approach of Image Caption Generator Using Deep Learning
6 pages
Image Caption Generator
No ratings yet
Image Caption Generator
6 pages
Acd
No ratings yet
Acd
15 pages
Image Caption Generator Report
No ratings yet
Image Caption Generator Report
27 pages
Fin Irjmets1689950550
No ratings yet
Fin Irjmets1689950550
5 pages
Sunnit Singh Shivam Kumar Soham Chatterjee Abhishek Kumar Sujata Dawn MuHmt
No ratings yet
Sunnit Singh Shivam Kumar Soham Chatterjee Abhishek Kumar Sujata Dawn MuHmt
6 pages
Image Caption Generation
No ratings yet
Image Caption Generation
8 pages
118-presentation
No ratings yet
118-presentation
26 pages
03_1121
No ratings yet
03_1121
12 pages
ROHAN PRASAD FinalProjectReport - Rohan Gamer
No ratings yet
ROHAN PRASAD FinalProjectReport - Rohan Gamer
39 pages
Project Report
No ratings yet
Project Report
31 pages
Detection and Recognition of Objects in Image Caption Generator System A Deep Learning Approach
No ratings yet
Detection and Recognition of Objects in Image Caption Generator System A Deep Learning Approach
3 pages
Paper 17881
No ratings yet
Paper 17881
6 pages
ImagecaptionusingCNNandLSTM
No ratings yet
ImagecaptionusingCNNandLSTM
11 pages
BTP Report
No ratings yet
BTP Report
27 pages
Synopsis May 2024 (Pradeep, Vikas) - 1
No ratings yet
Synopsis May 2024 (Pradeep, Vikas) - 1
14 pages
Image Captioning - A Deep Learning Approach
No ratings yet
Image Captioning - A Deep Learning Approach
4 pages
abstract final Major project
No ratings yet
abstract final Major project
1 page
Apply Deep Learning-based CNN and LSTM for Visual Image Caption Generator
No ratings yet
Apply Deep Learning-based CNN and LSTM for Visual Image Caption Generator
6 pages
Image Caption Generator
No ratings yet
Image Caption Generator
2 pages
Image To Caption Generator
No ratings yet
Image To Caption Generator
7 pages
Hybrid_Image_Captioning_Model
No ratings yet
Hybrid_Image_Captioning_Model
6 pages
Image Caption Generator PCL
No ratings yet
Image Caption Generator PCL
19 pages
Image Caption Generator Using Deep Learning
No ratings yet
Image Caption Generator Using Deep Learning
8 pages
Document from Deependra singh (1)
No ratings yet
Document from Deependra singh (1)
10 pages
Automatic Image Captioning Using Neural Networks
No ratings yet
Automatic Image Captioning Using Neural Networks
9 pages
IJNRD2309143
No ratings yet
IJNRD2309143
11 pages
Image Caption Generator Using Deep Learning
No ratings yet
Image Caption Generator Using Deep Learning
5 pages
Image Captioning
No ratings yet
Image Captioning
8 pages
Minor
No ratings yet
Minor
14 pages
Image Caption Generator Using AI: Review - 1
No ratings yet
Image Caption Generator Using AI: Review - 1
9 pages
Image Caption Generator Using Deep Learning: Guided by Dr. Ch. Bindu Madhuri, M Tech, PH.D
No ratings yet
Image Caption Generator Using Deep Learning: Guided by Dr. Ch. Bindu Madhuri, M Tech, PH.D
9 pages
IJCRT2310418
No ratings yet
IJCRT2310418
8 pages
Mini Project Fln..
No ratings yet
Mini Project Fln..
51 pages
DL 20i0551 Project Proposal
No ratings yet
DL 20i0551 Project Proposal
3 pages
Research Paper of Generating Caption From Image
No ratings yet
Research Paper of Generating Caption From Image
5 pages
Image Caption
No ratings yet
Image Caption
16 pages
Major Report Final
No ratings yet
Major Report Final
40 pages
Fin Irjmets1681386363
No ratings yet
Fin Irjmets1681386363
5 pages
Gray Scale Image Captioning Using CNN and LSTM
No ratings yet
Gray Scale Image Captioning Using CNN and LSTM
8 pages
Image Captioning Generator Using CNN and LSTM
No ratings yet
Image Captioning Generator Using CNN and LSTM
8 pages
DL Group 6 Rep
No ratings yet
DL Group 6 Rep
11 pages
Implementation_of_Simple_and_Efficient_P
No ratings yet
Implementation_of_Simple_and_Efficient_P
8 pages
Major Project Abstract
No ratings yet
Major Project Abstract
3 pages
Image caption Generation Research Paper-
No ratings yet
Image caption Generation Research Paper-
8 pages
Project Report
No ratings yet
Project Report
35 pages
Visual Image Caption Generator 38
No ratings yet
Visual Image Caption Generator 38
6 pages
Machine Learning - Advanced Concepts
From Everand
Machine Learning - Advanced Concepts
Derrick Mwiti
No ratings yet
Curriculum Book
No ratings yet
Curriculum Book
15 pages
G7-Q1 Week 7
No ratings yet
G7-Q1 Week 7
3 pages
Final Exam
No ratings yet
Final Exam
3 pages
Locsin Technological Competency As Caring PDF
No ratings yet
Locsin Technological Competency As Caring PDF
3 pages
Stephen Krashens SLA Theories A Critical
No ratings yet
Stephen Krashens SLA Theories A Critical
16 pages
1st Post and Pre
No ratings yet
1st Post and Pre
4 pages
Construction Project Managements
No ratings yet
Construction Project Managements
32 pages
Ast Lesson Plan Tpa Ela Grammar
No ratings yet
Ast Lesson Plan Tpa Ela Grammar
22 pages
Marzano
No ratings yet
Marzano
1 page
Phenomenology Mind
No ratings yet
Phenomenology Mind
232 pages
Velo de Course S Works Prix
100% (2)
Velo de Course S Works Prix
7 pages
Essay About Hope
No ratings yet
Essay About Hope
1 page
Aspiring For Leadership
No ratings yet
Aspiring For Leadership
16 pages
Would You Mind Speaking PDF
No ratings yet
Would You Mind Speaking PDF
2 pages
Khalvati 2019 Bayesian Inference Explains Human Choices in Group Decision-Making
No ratings yet
Khalvati 2019 Bayesian Inference Explains Human Choices in Group Decision-Making
13 pages
B1 UNITS 3 and 4 CLIL Teacher's Notes
No ratings yet
B1 UNITS 3 and 4 CLIL Teacher's Notes
1 page
Chapter 3
No ratings yet
Chapter 3
8 pages
Theory of Caritative Caring
0% (1)
Theory of Caritative Caring
2 pages
Shoo Fly
No ratings yet
Shoo Fly
2 pages
4 The Table of Specifications
No ratings yet
4 The Table of Specifications
2 pages
Science 7 1st Quiz 1 Science Process Skills
No ratings yet
Science 7 1st Quiz 1 Science Process Skills
2 pages
Research Exercises
No ratings yet
Research Exercises
4 pages
Teacher Humor Style and Attention Span of Grade 7 Students
No ratings yet
Teacher Humor Style and Attention Span of Grade 7 Students
5 pages
Pozzulo Et Al., (Line-Ups)
No ratings yet
Pozzulo Et Al., (Line-Ups)
23 pages