Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
341 views

Mini Project Final Report

This project targets creating a Machine Learning Model which can predict the probability of a particular Photograph for being selected as a post on the Instagram account @indiapictures. There is an Instagram handle named @indiapictures which has 460K followers and 5,281 posts. On Instagram new Photographers who want their photos to be posted on this account posts their photos with a hashtag #indiapictures. This account then posts only a few images that are good enough out of hundreds and thousan

Uploaded by

Utkarsh Gupta
Copyright
© © All Rights Reserved
0% found this document useful (0 votes)
341 views

Mini Project Final Report

This project targets creating a Machine Learning Model which can predict the probability of a particular Photograph for being selected as a post on the Instagram account @indiapictures. There is an Instagram handle named @indiapictures which has 460K followers and 5,281 posts. On Instagram new Photographers who want their photos to be posted on this account posts their photos with a hashtag #indiapictures. This account then posts only a few images that are good enough out of hundreds and thousan

Uploaded by

Utkarsh Gupta
Copyright
© © All Rights Reserved
You are on page 1/ 30

gPhotoCNN : Is this a Good Photograph?

MINI PROJECT REPORT

Submitted by

Utkarsh Gupta
17BCS058

Agha Syed
17BCS006

BACHELOR OF TECHNOLOGY

IN

COMPUTER SCIENCE ENGINEERING

SHRI MATA VAISHNO DEVI UNIVERSITY

November 2019
1
SHRI MATA VAISHNO DEVI UNIVERSITY

CERTIFICATE

Certified​ ​that​ ​this​ ​project​ ​report​ ​“gPhotoCNN : Is this a Good


Photograph?” ​is the work of “​Utkarsh Gupta 17BCS058 AND Agha
Syed 17BCS006” ​who carried out the mini project work under my
supervision.

Submitted to the Viva Voice Examination held on​___________________

INTERNAL EXAMINER ​ ​EXTERNAL EXAMINER

2
ACKNOWLEDGEMENT

We would like to express our sincere gratitude towards our project


mentor, ​Mr. SUDESH KUMAR ​who constantly motivated and
supervised us throughout the duration of the project. We are grateful to
him for his critical views and the relentless support and belief he had in
us.

3
TABLE OF CONTENTS

1. ABSTRACT 5
2. TASKS OF OUR PROJECT 6
3. INTRODUCTION
3.1. WHAT IS CNN?
3.1.1. HISTORY
3.1.2. DESIGN
3.2. WHAT IS WEB SCRAPING?
3.3. WHAT IS CLEANING DATA?
4. TOOLS USED 13
4.1. PYTHON 3.6.4
4.2. TENSORFLOW
4.3. JUPYTER NOTEBOOKS
4.4. TENSORFLOW
4.5. KERAS
4.6. INSTALOADER
5. METHODOLOGY (DAY 1 - DAY 11) 15
6. CONCLUSION 29
7. REFERENCES 30

4
ABSTRACT

This project targets creating a ​Machine Learning Model which can predict the
probability of a particular Photograph for being selected as a post on the
Instagram account ​@indiapictures​.

Instagram holds millions of Photographs and is a platform where people


showcase their Photography skills. The photos captured are judged by millions
of people and receive likes, comments & shoutouts. This provides us an
opportunity to collect a dataset that is reviewed and accepted by millions of
people.

There is an Instagram handle named ​@indiapictures which has ​460K followers


and ​5,281 posts​. On Instagram new Photographers who wants their photos to be
posted on this account posts their photos with a hashtag ​#indiapictures.​ This
account then posts only few images that are good enough out of hundreds and
thousands of images.

We need a model which can evaluate the quality of Photograph being captured
by a newbie Photographer. This model will output probability of range [0,1].

5
TASKS OF OUR PROJECT

1. Web Scraping
First of all, we will scrape photos from Instagram that use ​#indiapictures
and all the photos that are posted on the account ​@indiapictures​.

2. Cleaning Data
We then will be removing corrupted & duplicate images and irrelevant
metadata from the scraped data. Then we will split them into our
training_set​ and ​test_set​.

3. Building CNN
The next step will be to create a Machine Learning model that is based on
ConvNet. We will decide the best Neural Network architecture to attack
this particular problem.

6
INTRODUCTION

1. WHAT IS CNN?

In deep learning, a convolutional neural network (CNN, or ConvNet) is a class


of deep neural networks, most commonly applied to analyzing visual imagery.

The name “convolutional neural network” indicates that the network employs a
mathematical operation called convolution. Convolution is a specialized kind of
linear operation. Convolutional networks are simply neural networks that use
convolution in place of general matrix multiplication in at least one of their
layers.

CNNs are regularized versions of multilayer perceptrons. The


"fully-connectedness" of these networks makes them prone to overfitting data.
However, CNNs take a different approach towards regularization: they take
advantage of the hierarchical pattern in data and assemble more complex
patterns using smaller and simpler patterns. Therefore, on the scale of
connectedness and complexity, CNNs are on the lower extreme. The equation of
convolution function is given below.

7
1.1 HISTORY

A system to recognize hand-written ZIP Code numbers involved convolutions


in which the kernel coefficients had been laboriously hand designed.

Yann LeCun et al. (1989) used back-propagation to learn the convolution kernel
coefficients directly from images of hand-written numbers. Learning was thus
fully automatic, performed better than manual coefficient design, and was suited
to a broader range of image recognition problems and image types.

1.2 DESIGN

A convolutional neural network consists of an input and an output layer, as well


as multiple hidden layers. The hidden layers of a CNN typically consist of a
series of convolutional layers that convolve with a multiplication or other dot
product. The activation function is commonly a RELU layer, and is
subsequently followed by additional convolutions such as pooling layers, fully
connected layers and normalization layers, referred to as hidden layers because
their inputs and outputs are masked by the activation function and final
convolution.

The final convolution, in turn, often involves backpropagation in order to more


accurately weight the end product.

Though the layers are colloquially referred to as convolutions, this is only by


convention. Mathematically, it is technically a sliding dot product or
cross-correlation. This has significance for the indices in the matrix, in that it
affects how weight is determined at a specific index point.

8
Here are the elements of our Design:

A. Receptive field
The input area of a neuron is called its ​receptive field​. In a fully
connected layer, each neuron receives input from every element of the
previous layer. In a convolutional layer, neurons receive input from only
a restricted subarea of the previous layer.

B. Weights
Each neuron in a neural network computes an output value by applying a
specific function to the input values coming from the receptive field in
the previous layer. The function that is applied to the input values is
determined by a vector of weights and a bias (typically real numbers).
Learning, in a neural network, progresses by making iterative adjustments
to these biases and weights. The vector of weights and the bias are called
filters and represent particular features of the input (e.g., a particular
shape). A distinguishing feature of CNNs is that many neurons can share
the same filter. This reduces memory footprint because a single bias and a
single vector of weights are used across all receptive fields sharing that
filter.

9
C. Convolutional
Input is a tensor with shape (number of images) x (image width) x (image
height) x (image depth). Convolutional kernels whose ​width and ​height
are ​hyper-parameters,​ and whose depth must be equal to that of the
image. Convolutional layers convolve the input and pass its result to the
next layer. This is similar to the response of a neuron in the visual cortex
to a specific stimulus. Each convolutional neuron processes data only for
its receptive field. A very high number of neurons would be necessary,
even in a shallow (opposite of deep) architecture, due to the very large
input sizes associated with images, where each pixel is a relevant
variable.

D. Pooling
Convolutional networks may include local or global pooling layers to
streamline the underlying computation. Pooling layers reduce the
dimensions of the data by combining the outputs of neuron clusters at one
layer into a single neuron in the next layer. Local pooling combines small
clusters, typically 2 x 2. ​Max pooling uses the maximum value from each
of a cluster of neurons at the prior layer. ​Average pooling uses the
average value from each of a cluster of neurons at the prior layer.

E. Fully connected
Fully connected layers connect every neuron in one layer to every neuron
in another layer. It is in principle the same as the traditional multi-layer
perceptron neural network (MLP). The flattened matrix goes through a
fully connected layer to classify the images.

10
2. WHAT IS WEB SCRAPING?

Web Scraping (also termed Screen Scraping, Web Data Extraction, Web
Harvesting etc.) is a technique employed to extract large amounts of data from
websites whereby the data is extracted and saved to a local file in your computer
or to a database in table (spreadsheet) format.

Data displayed by most websites can only be viewed using a web browser. They
do not offer the functionality to save a copy of this data for personal use. The
only option then is to manually copy and paste the data - a very tedious job
which can take many hours or sometimes days to complete. Web Scraping is the
technique of automating this process, so that instead of manually copying the
data from websites, the Web Scraping software will perform the same task
within a fraction of the time.

11
3. WHAT IS CLEANING DATA?

Data cleaning is the process of detecting and correcting (or removing) corrupt or
inaccurate records from a record set, table, or database and refers to identifying
incomplete, incorrect, inaccurate or irrelevant parts of the data and then
replacing, modifying, or deleting the dirty or coarse data. Data cleansing may be
performed interactively with data wrangling tools, or as batch processing
through scripting.

After cleansing, a data set should be consistent with other similar data sets in the
system. The inconsistencies detected or removed may have been originally
caused by user entry errors, by corruption in transmission or storage, or by
different data dictionary definitions of similar entities in different stores. Data
cleaning differs from data validation in that validation almost invariably means
data is rejected from the system at entry and is performed at the time of entry,
rather than on batches of data.

The actual process of data cleansing may involve removing typographical errors
or validating and correcting values against a known list of entities. The
validation may be strict (such as rejecting any address that does not have a valid
postal code) or fuzzy (such as correcting records that partially match existing,
known records). Some data cleansing solutions will clean data by
cross-checking with a validated data set. A common data cleansing practice is
data enhancement, where data is made more complete by adding related
information.

12
TOOLS USED

1. PYTHON 3.6.4
Python is an interpreted, high-level, general-purpose programming
language. Created by Guido van Rossum and first released in 1991,
Python's design philosophy emphasizes code readability with its notable
use of significant whitespace. Its language constructs and object-oriented
approach aim to help programmers write clear, logical code for small and
large-scale projects. It is used in Machine Learning, Image Processing,
Scientific Calculations, Web Scraping, Databases etc.

2. JUPYTER NOTEBOOKS
Jupyter Notebook (formerly IPython Notebooks) is a web-based
interactive computational environment for creating Jupyter notebook
documents. It contains an ordered list of input/output cells which can
contain code, text (using Markdown), mathematics, plots and rich media,
usually ending with the ​.ipynb extension. A Jupyter Notebook can be
converted to a number of open standard output formats (HTML,
presentation slides, LaTeX, PDF, ReStructuredText, Markdown, Python).

3. TENSORFLOW
TensorFlow is a free and open-source software library for dataflow and
differentiable programming across a range of tasks. It is a symbolic math
library, and is also used for machine learning applications such as neural
networks. It is used for both research and production at Google.

13
4. KERAS
Keras is an open source neural network library written in Python. It is
capable of running on top of TensorFlow, Microsoft Cognitive Toolkit, or
Theano. Designed to enable fast experimentation with deep neural
networks, it focuses on being user-friendly, modular, and extensible. In
2017, Google's TensorFlow team decided to support Keras in
TensorFlow's core library. Keras contains numerous implementations of
commonly used neural network building blocks such as layers,
objectives, activation functions, optimizers, and a host of tools to make
working with image and text data easier.

5. INSTALOADER
Instaloader is a tool to download pictures (or videos) along with their
captions and other metadata from Instagram. It downloads public and
private profiles, hashtags, user stories, feeds and saved media. It also
downloads comments, geotags and captions of each post. It automatically
detects profile name changes and renames the target directory
accordingly. It allows fine-grained customization of filters and where to
store downloaded media. It is free open source software written in
Python.

14
METHODOLOGY

Day 1 

At noon, I Installed instaloader on ubuntu to scrape images from India Pictures.


After installation I started by downloading images that were featured on india
pictures.

Entered command: instaloader profile “indiapictures”

By evening 5185 posts were downloaded containing 5371 images, 102 videos,
captions, and metadata of the featured posts. JPGs, MP4s, XZs and TXTs files
were put in seperate folders.

At evening, I started downloading images that were on #indiapictures.

I started by downloading posts that were on hashtag #indiapictures and had user
tagged @indiapictures in image(s) or had mentioned @indiapictures in caption
mention.

Using command:
instaloader --post-filter="'indiapictures' in caption_mentions or 'indiapictures' in
tagged_users " "#indiapictures" "

But this method was taking too much time around 20 seconds per posts and all I
checked on 20/20 posts were having these too post-filter conditions satisfied. So
instead I downloaded all images on the #indiapictures.

Entered command:
instaloader --post-filter​=​"not is_video"​ ​"#indiapictures"

By next day 2 PM about 44k+(TBD) posts were downloaded containing TBD.


Downloading was stopped with HTTP error code 410.

15
Day 2

The Dataset was as follows. It was approximately 8.94 GB. It contained the
following folders

● #indiapictures
❖ 53769 JPG
❖ 336 MP4
❖ 45280 Text Docs
❖ 44135 XZ Files (JSON Archives)
● Indiapictures
❖ 102 MP4 Files
❖ 4 Folders
➢ .xd (Empty)
➢ JPGs
■ 5371 JPG
➢ Txt
■ 5185 Text Docs
➢ XZ
■ 2006 JSON
■ 5219 XZ Files (JSON Archives)
● Profile

The files in #indiapictures is our bad data (Pictures that used #indiapictures,
53679 JPG) and Indiapictures -> JPGs is our good data (Pictures that were
featured on the page of Indiapictures 5371 JPG).

16
I then created following Directories:

● Dataset

○ test_set

■ bad - 3714 JPG

■ good - 371 JPG

○ training_set

■ bad - 50055 JPG

■ good - 5000 JPG

I made the split such that 5000 good images go to training_set and other 371 to
test_set. The split ratio turned out to be 6.9% for the test_set and rest as the
training_set. We used the same ratio to split the training_set.

17
Day 3

I created a CNN model using Keras (TF Backend). I made a model that used too
many FeatureMaps, and the Dense units also contained a large number of
neurons. It resulted in an error.

It said ResourceExhaustedError. It was an error due to the reason that my


laptop’s GPU (GTX 1050Ti) can’t compile such a model.

Solution - Make a simpler model that is compatible with your GPU.

18
Day 4

I now created a simpler model. It was as such.

classifier = Sequential()
classifier.add( Convolution2D( ​128​, (​5​, ​5​), input_shape=(​64​,​64​,​3​), activation = ​'relu'​ )
)
classifier.add( MaxPooling2D(pool_size = (2,2) ) )
classifier.add( Convolution2D( 64, (3, 3), activation = 'relu' ) )
classifier.add( MaxPooling2D(pool_size = (2,2) ) )
classifier.add(Flatten())
classifier.add( Dense(units = 256, activation = 'relu' ) )
classifier.add( Dense(units = 1, activation = 'sigmoid' ) )
classifier.compile(optimizer = 'adam', loss = 'binary_crossentropy', metrics =
['accuracy'])

This model also resulted in an error. 

I realized that I assumed that all the scraped images would not be corrupted. I was wrong.  

Solution - Remove all the corrupted images.  

19
Day 5

Today, I realized the fact that there is an error in our data. There maybe some
images in folder #Indiapictures that were also featued on their page. In simple
terms there will be some of our good data files as duplicates in bad data files. I
calculated the probability. Note: We scraped all images that were featured on
their page.
Total Images With tag #indiapictures = 40,00,000 (approx.)
Total Scraped Images with tag #indiapictures = 53,769
Total Images featured on the page = 5371

P (Image is Duplicate)
= Total Images featured on the page / Total Images With tag #indiapictures
= 5,371 / 40,00,000
= 0.00134275

Good Data That may be duplicated in Bad Data


= Total Scraped Images with tag #indiapictures * P (Image is Duplicate in
folder #Indiapictures | Image was featured)
= 53,769 * 0.00134275
= 72.19 (approx)

It means 72 images are duplicates in Good and Bad data, if we assume that the
scraped data is completely random.
We need to remove these images before making a model.

Solution - Find a software or script to remove duplicates.

20
Day 6

Today, I spent most of time finding a python script to find corrupted files. I
completed my script to find the corrupted images. The code used is completely
reusable and anyone who wants to perform this check can use this script. The
script is in HTML. So, you should download it first than open it using browser.

This script scans a folder for all the jpgs and jpegs. It then verifies it if it is
corrupted or not. If a corrupted file is found then it moves that file into a
different directory that you would have to initialize. In my case I named that
folder “corrupted”.

Luckily, in our project we got only one corrupted file.

Now, we have only one challenge left before building our CNN. We need to
remove duplicates. It is very essential to remove duplicates in our project
because files common in both the folders with neutralize the changes in cost
function which we are trying to minimize. In simple words, we will be wasting
almost 722 images in good dataset.

21
Day 7

Challenge:
We have two folders named “good” and “bad”. We want to remove the images
in folder bad which are common in both the folders.

Today, I completed writing the script to remove the duplicate files. The script
first calculated md5 hash codes of the images and created a HashMap out of it.
The HashMap was as Follows:

hashMap = {
'good' : [
{"dir":"D://Codes//gPhotoCNN//dataset//good//2013-04-24_05-57-27_U
TC.jpg",
"sign":"4b625aa4eca63d3d41202d96ae3c09620ca0d0b2fb516dca21a892322b
f82efa"}
…],
‘Bad’ : [...]}

Now, this hashMap consisted of two keys, “good” and “bad”, named after the
two folders. These keys have arrays as their values which then consists the
directory and 128 bit md5 Hash Codes. If the two hash codes are identical, then
the files are the same. The odds of 2 different files generating identical md5
sums are 1 / 2^128.. That's 1 / 340282366920938463463374607431768211456.
In other words, you'd need 2^64 files before there's a 10% chance. This script
will give good results for the project of our size.

22
This HashMap is then stored into a JSON file so that we do not need to compute
Hash Codes again and again as it is a time consuming process. We then checked
for identical hashes using simple for loops. Luckily, we got only 19 identical
pictures. That’s far less than our computed probability. It means, the scrapper
mainly scrapped the data which has low frequency of those images that has been
featured on the page of indiapictures.

23
Day 8

I started the learning of the model. It was for 5 epochs with each epoch
consisting of steps 55037 and 4083 validation steps. At first epoch the accuracy
wasn’t good enough. It was around 90% which was worse than even a random
model.

At present it’s showing 30 hours left for one epoch. Let’s wait for now.

Day 9

The first epoch completed and now we have


● Accuracy​: 91.14%
● Validation Accuracy:​ 90.06%

The difference between Accuracy and validation accuracy is that accuracy is the
estimate of mean accuracy of the mini-batches and Validation accuracy is the
accuracy for the validation accuracy.

Now, we may continue the epochs until Validation Accuracy is Improving.

24
Day 10

The Notebook Crashed. :-(

There was power shortage for few hours at my university. Things like these
don’t happen usually maybe once or twice a year. This was one of those
moments. This is the accuracy that the model luckily reached.

It reached the accuracy of 92.51%. This was a considerable improvement.


Unluckily I lost all the learning.

This model was not a total waste though. It showed that learning can reach up to
a level that is better than random. This means that if we train it harder, we may
have sufficient good result.

New Plan ->


I will search for a method to download the weights of Imagenet and use it to
initialize the weights in this model.

25
Day 11

Today I implemented transfer learning. Transfer Learning is a method to


transfer the pre trained model’s weights and biases and use them in your project.
We can do it by making changes to the structure of downloaded neural network
so that it fits our use case. It is better than random initialization of weights and
biases as the complex patterns has already been learned by the pretrained model.

I used “imagenet”, in my case.

Keras has simple method for transfer learning. We used VGG16 model using
the weights of the imagenet with 1000 classes. We removed the top layer so that
we can add our own fully connected layer.

The input tensor was of dimension (224, 224, 3).

The layers of imagenet were locked (their weights won’t be trained). It is a good
practice to keep imagenet layers locked so that new features can be learned in
the next layers.

We then flatten the output of convolve layer block and added two layers with 4
and 1 neuron respectively.

26
27
This model was very fast to train and resulted with the following results:
● ​Accuracy:​ 90.8%
● Validation Accuracy​: 90.9%
● Time Taken: ​14 hours 15 minutes

We reached such a good accuracy in a low amount of time.

28
Conclusion

We built two models:

❏ First Model:
It learned from scratch and took a large amount of time in learning. It’s
one epoch took almost ​30 hours​ to learn. The accuracies are as follows:
❏ Accuracy: 91.14%
❏ Validation Accuracy: 90.06%

❏ Second Model:
We used transfer learning for this model. We used VGG16 model using
the weights of the imagenet with 1000 classes. We then flatten the output
of convolve layer block and added two layers with 4 and 1 neuron
respectively. The accuracies are as follows:
❏ Accuracy: 90.80%
❏ Validation Accuracy: 90.90%

The conclusion of our project is that transfer learning is a very good approach
especially for problems related to computer vision. If we create a model from
scratch, it takes a large amount of time to learn and distracts us from our main
goal.

29
References

➢ Object Recognition with Gradient-Based Learning - Yann LeCun


➢ Imagenet Website
➢ Keras Documentation
➢ Instaloader

30

You might also like