Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
103 views

Image Captioning Using CNN & LSTM: Digital Signal Processing Laboratory (EEE - 316)

This document describes a model for image captioning using a convolutional neural network (CNN) and long short-term memory (LSTM) network. It first outlines the problem and overall model, then discusses the key building blocks: CNNs for image feature extraction, transfer learning with Inception V3, RNNs/LSTMs for generating captions, and word embeddings. It shows how these pieces are connected in the final model and provides examples of model performance on test and real data, as well as potential applications.
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
103 views

Image Captioning Using CNN & LSTM: Digital Signal Processing Laboratory (EEE - 316)

This document describes a model for image captioning using a convolutional neural network (CNN) and long short-term memory (LSTM) network. It first outlines the problem and overall model, then discusses the key building blocks: CNNs for image feature extraction, transfer learning with Inception V3, RNNs/LSTMs for generating captions, and word embeddings. It shows how these pieces are connected in the final model and provides examples of model performance on test and real data, as well as potential applications.
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 24

Digital Signal Processing Laboratory ( EEE -316)

Image Captioning Using


CNN & LSTM

Uday Kamal Hasib Amin Rajib Al-Sabah Abhishek Shushil

Id : 1406041 Id : 1406045 Id : 1406035 Id : 1406034

Bangladesh University of Engineering and Technology (BUET)


Presentation Outline

● Problem Statement
● Basic building blocks for the
network
- CNN
- Transfer Learning
- RNN
- LSTM
● How do we wire them together?
● Code
● Other places this can be
implemented
● Interaction & Questions
Problem Overview
Problem Overview
Overall Model:
Building Blocks for the Network:
CNN
Building Blocks for the Network:
CNN

Convolution layer is a feature detector that automagically learns to filter out the not needed
information from an input by using convolution kernel.
Pooling layers compute the max or average value of a particular feature over a region of
the input data (downsizing of input images). Also helps to detect objects in some unusual
places and reduces memory size.
Building Blocks for the Network:
Transfer Learning
Building Blocks for the Network:
Inception V3
Building Blocks for the Network:
RNN
● As humans we understand context
● Every single time we don’t reset our understanding
● Thoughts have persistence
● Traditional NNs like CNNs don’t have persistence
● speech recognition, language modeling, translation
requires this persistence

RNNs are general computers which can learn algorithms to map


input sequences to output sequences (flexible-sized vectors).
The output vector’s contents are influenced by the entire
history of inputs.
Building Blocks for the Network:
RNN
Building Blocks for the Network:
LSTM
The LSTM units give the network memory cells with read, write and reset
operations. During training, the network can learn when it should remember data
and when it should throw it away.
Building Blocks for the Network:
LSTM

Ct is the cell state, which flows through the


entire chain...
Building Blocks for the Network:
LSTM

Forget Gate:

Concatenate
Building Blocks for the Network:
LSTM

Input Gate Layer

New contribution to cell state

Classic neuron
Building Blocks for the Network:
LSTM

Update Cell State (memory):


Building Blocks for the Network:
LSTM

Output Gate Layer

Output to next layer


Building Blocks for the Network:
Word Embedding

Embeddings are used to turn textual data (words, sentences, paragraphs) into
high- dimensional vector representations and group them together with
semantically similar data in a vectorspace. Thereby, computer can detect
similarities mathematically.
Final Model:
Training Data:

Flickr8k Dataset:
Dataset contains 8000 different images with 5 different human
labelled captions.:
The image is given 5 different captions:

1) A boy runs as others play on a home-made slip and


slide.

2) Children in swimming clothes in a field.

3) Little kids are playing outside with a water hose and


are sliding down a water slide.

4) Several children are playing outside with a wet tarp on


the ground.

5) Several children playing on a homemade water slide.


Training History:
Model’s Performance on Test Data:
Model’s Performance on Real Data:

Three people are on a boat in Three people pose for a One man is sitting at a table
the water picture together in front of a restaurant

A soccer player prepares to A group of kids play in the A boy hits the ball at a
kick the ball water baseball game .
Application:

● Visual to Text systems for blind people

● Search Engines for searching medical records based on


content based caption

● Auto Tagging different imaging data

● Auto Video tagging and summary generation

You might also like