Image Captioning Using CNN & LSTM: Digital Signal Processing Laboratory (EEE - 316)
Image Captioning Using CNN & LSTM: Digital Signal Processing Laboratory (EEE - 316)
● Problem Statement
● Basic building blocks for the
network
- CNN
- Transfer Learning
- RNN
- LSTM
● How do we wire them together?
● Code
● Other places this can be
implemented
● Interaction & Questions
Problem Overview
Problem Overview
Overall Model:
Building Blocks for the Network:
CNN
Building Blocks for the Network:
CNN
Convolution layer is a feature detector that automagically learns to filter out the not needed
information from an input by using convolution kernel.
Pooling layers compute the max or average value of a particular feature over a region of
the input data (downsizing of input images). Also helps to detect objects in some unusual
places and reduces memory size.
Building Blocks for the Network:
Transfer Learning
Building Blocks for the Network:
Inception V3
Building Blocks for the Network:
RNN
● As humans we understand context
● Every single time we don’t reset our understanding
● Thoughts have persistence
● Traditional NNs like CNNs don’t have persistence
● speech recognition, language modeling, translation
requires this persistence
Forget Gate:
Concatenate
Building Blocks for the Network:
LSTM
Classic neuron
Building Blocks for the Network:
LSTM
Embeddings are used to turn textual data (words, sentences, paragraphs) into
high- dimensional vector representations and group them together with
semantically similar data in a vectorspace. Thereby, computer can detect
similarities mathematically.
Final Model:
Training Data:
Flickr8k Dataset:
Dataset contains 8000 different images with 5 different human
labelled captions.:
The image is given 5 different captions:
Three people are on a boat in Three people pose for a One man is sitting at a table
the water picture together in front of a restaurant
A soccer player prepares to A group of kids play in the A boy hits the ball at a
kick the ball water baseball game .
Application: