Mini Project Final Report
Mini Project Final Report
Submitted by
Utkarsh Gupta
17BCS058
Agha Syed
17BCS006
BACHELOR OF TECHNOLOGY
IN
November 2019
1
SHRI MATA VAISHNO DEVI UNIVERSITY
CERTIFICATE
2
ACKNOWLEDGEMENT
3
TABLE OF CONTENTS
1. ABSTRACT 5
2. TASKS OF OUR PROJECT 6
3. INTRODUCTION
3.1. WHAT IS CNN?
3.1.1. HISTORY
3.1.2. DESIGN
3.2. WHAT IS WEB SCRAPING?
3.3. WHAT IS CLEANING DATA?
4. TOOLS USED 13
4.1. PYTHON 3.6.4
4.2. TENSORFLOW
4.3. JUPYTER NOTEBOOKS
4.4. TENSORFLOW
4.5. KERAS
4.6. INSTALOADER
5. METHODOLOGY (DAY 1 - DAY 11) 15
6. CONCLUSION 29
7. REFERENCES 30
4
ABSTRACT
This project targets creating a Machine Learning Model which can predict the
probability of a particular Photograph for being selected as a post on the
Instagram account @indiapictures.
We need a model which can evaluate the quality of Photograph being captured
by a newbie Photographer. This model will output probability of range [0,1].
5
TASKS OF OUR PROJECT
1. Web Scraping
First of all, we will scrape photos from Instagram that use #indiapictures
and all the photos that are posted on the account @indiapictures.
2. Cleaning Data
We then will be removing corrupted & duplicate images and irrelevant
metadata from the scraped data. Then we will split them into our
training_set and test_set.
3. Building CNN
The next step will be to create a Machine Learning model that is based on
ConvNet. We will decide the best Neural Network architecture to attack
this particular problem.
6
INTRODUCTION
1. WHAT IS CNN?
The name “convolutional neural network” indicates that the network employs a
mathematical operation called convolution. Convolution is a specialized kind of
linear operation. Convolutional networks are simply neural networks that use
convolution in place of general matrix multiplication in at least one of their
layers.
7
1.1 HISTORY
Yann LeCun et al. (1989) used back-propagation to learn the convolution kernel
coefficients directly from images of hand-written numbers. Learning was thus
fully automatic, performed better than manual coefficient design, and was suited
to a broader range of image recognition problems and image types.
1.2 DESIGN
8
Here are the elements of our Design:
A. Receptive field
The input area of a neuron is called its receptive field. In a fully
connected layer, each neuron receives input from every element of the
previous layer. In a convolutional layer, neurons receive input from only
a restricted subarea of the previous layer.
B. Weights
Each neuron in a neural network computes an output value by applying a
specific function to the input values coming from the receptive field in
the previous layer. The function that is applied to the input values is
determined by a vector of weights and a bias (typically real numbers).
Learning, in a neural network, progresses by making iterative adjustments
to these biases and weights. The vector of weights and the bias are called
filters and represent particular features of the input (e.g., a particular
shape). A distinguishing feature of CNNs is that many neurons can share
the same filter. This reduces memory footprint because a single bias and a
single vector of weights are used across all receptive fields sharing that
filter.
9
C. Convolutional
Input is a tensor with shape (number of images) x (image width) x (image
height) x (image depth). Convolutional kernels whose width and height
are hyper-parameters, and whose depth must be equal to that of the
image. Convolutional layers convolve the input and pass its result to the
next layer. This is similar to the response of a neuron in the visual cortex
to a specific stimulus. Each convolutional neuron processes data only for
its receptive field. A very high number of neurons would be necessary,
even in a shallow (opposite of deep) architecture, due to the very large
input sizes associated with images, where each pixel is a relevant
variable.
D. Pooling
Convolutional networks may include local or global pooling layers to
streamline the underlying computation. Pooling layers reduce the
dimensions of the data by combining the outputs of neuron clusters at one
layer into a single neuron in the next layer. Local pooling combines small
clusters, typically 2 x 2. Max pooling uses the maximum value from each
of a cluster of neurons at the prior layer. Average pooling uses the
average value from each of a cluster of neurons at the prior layer.
E. Fully connected
Fully connected layers connect every neuron in one layer to every neuron
in another layer. It is in principle the same as the traditional multi-layer
perceptron neural network (MLP). The flattened matrix goes through a
fully connected layer to classify the images.
10
2. WHAT IS WEB SCRAPING?
Web Scraping (also termed Screen Scraping, Web Data Extraction, Web
Harvesting etc.) is a technique employed to extract large amounts of data from
websites whereby the data is extracted and saved to a local file in your computer
or to a database in table (spreadsheet) format.
Data displayed by most websites can only be viewed using a web browser. They
do not offer the functionality to save a copy of this data for personal use. The
only option then is to manually copy and paste the data - a very tedious job
which can take many hours or sometimes days to complete. Web Scraping is the
technique of automating this process, so that instead of manually copying the
data from websites, the Web Scraping software will perform the same task
within a fraction of the time.
11
3. WHAT IS CLEANING DATA?
Data cleaning is the process of detecting and correcting (or removing) corrupt or
inaccurate records from a record set, table, or database and refers to identifying
incomplete, incorrect, inaccurate or irrelevant parts of the data and then
replacing, modifying, or deleting the dirty or coarse data. Data cleansing may be
performed interactively with data wrangling tools, or as batch processing
through scripting.
After cleansing, a data set should be consistent with other similar data sets in the
system. The inconsistencies detected or removed may have been originally
caused by user entry errors, by corruption in transmission or storage, or by
different data dictionary definitions of similar entities in different stores. Data
cleaning differs from data validation in that validation almost invariably means
data is rejected from the system at entry and is performed at the time of entry,
rather than on batches of data.
The actual process of data cleansing may involve removing typographical errors
or validating and correcting values against a known list of entities. The
validation may be strict (such as rejecting any address that does not have a valid
postal code) or fuzzy (such as correcting records that partially match existing,
known records). Some data cleansing solutions will clean data by
cross-checking with a validated data set. A common data cleansing practice is
data enhancement, where data is made more complete by adding related
information.
12
TOOLS USED
1. PYTHON 3.6.4
Python is an interpreted, high-level, general-purpose programming
language. Created by Guido van Rossum and first released in 1991,
Python's design philosophy emphasizes code readability with its notable
use of significant whitespace. Its language constructs and object-oriented
approach aim to help programmers write clear, logical code for small and
large-scale projects. It is used in Machine Learning, Image Processing,
Scientific Calculations, Web Scraping, Databases etc.
2. JUPYTER NOTEBOOKS
Jupyter Notebook (formerly IPython Notebooks) is a web-based
interactive computational environment for creating Jupyter notebook
documents. It contains an ordered list of input/output cells which can
contain code, text (using Markdown), mathematics, plots and rich media,
usually ending with the .ipynb extension. A Jupyter Notebook can be
converted to a number of open standard output formats (HTML,
presentation slides, LaTeX, PDF, ReStructuredText, Markdown, Python).
3. TENSORFLOW
TensorFlow is a free and open-source software library for dataflow and
differentiable programming across a range of tasks. It is a symbolic math
library, and is also used for machine learning applications such as neural
networks. It is used for both research and production at Google.
13
4. KERAS
Keras is an open source neural network library written in Python. It is
capable of running on top of TensorFlow, Microsoft Cognitive Toolkit, or
Theano. Designed to enable fast experimentation with deep neural
networks, it focuses on being user-friendly, modular, and extensible. In
2017, Google's TensorFlow team decided to support Keras in
TensorFlow's core library. Keras contains numerous implementations of
commonly used neural network building blocks such as layers,
objectives, activation functions, optimizers, and a host of tools to make
working with image and text data easier.
5. INSTALOADER
Instaloader is a tool to download pictures (or videos) along with their
captions and other metadata from Instagram. It downloads public and
private profiles, hashtags, user stories, feeds and saved media. It also
downloads comments, geotags and captions of each post. It automatically
detects profile name changes and renames the target directory
accordingly. It allows fine-grained customization of filters and where to
store downloaded media. It is free open source software written in
Python.
14
METHODOLOGY
Day 1
By evening 5185 posts were downloaded containing 5371 images, 102 videos,
captions, and metadata of the featured posts. JPGs, MP4s, XZs and TXTs files
were put in seperate folders.
I started by downloading posts that were on hashtag #indiapictures and had user
tagged @indiapictures in image(s) or had mentioned @indiapictures in caption
mention.
Using command:
instaloader --post-filter="'indiapictures' in caption_mentions or 'indiapictures' in
tagged_users " "#indiapictures" "
But this method was taking too much time around 20 seconds per posts and all I
checked on 20/20 posts were having these too post-filter conditions satisfied. So
instead I downloaded all images on the #indiapictures.
Entered command:
instaloader --post-filter="not is_video" "#indiapictures"
15
Day 2
The Dataset was as follows. It was approximately 8.94 GB. It contained the
following folders
● #indiapictures
❖ 53769 JPG
❖ 336 MP4
❖ 45280 Text Docs
❖ 44135 XZ Files (JSON Archives)
● Indiapictures
❖ 102 MP4 Files
❖ 4 Folders
➢ .xd (Empty)
➢ JPGs
■ 5371 JPG
➢ Txt
■ 5185 Text Docs
➢ XZ
■ 2006 JSON
■ 5219 XZ Files (JSON Archives)
● Profile
The files in #indiapictures is our bad data (Pictures that used #indiapictures,
53679 JPG) and Indiapictures -> JPGs is our good data (Pictures that were
featured on the page of Indiapictures 5371 JPG).
16
I then created following Directories:
● Dataset
○ test_set
○ training_set
I made the split such that 5000 good images go to training_set and other 371 to
test_set. The split ratio turned out to be 6.9% for the test_set and rest as the
training_set. We used the same ratio to split the training_set.
17
Day 3
I created a CNN model using Keras (TF Backend). I made a model that used too
many FeatureMaps, and the Dense units also contained a large number of
neurons. It resulted in an error.
18
Day 4
classifier = Sequential()
classifier.add( Convolution2D( 128, (5, 5), input_shape=(64,64,3), activation = 'relu' )
)
classifier.add( MaxPooling2D(pool_size = (2,2) ) )
classifier.add( Convolution2D( 64, (3, 3), activation = 'relu' ) )
classifier.add( MaxPooling2D(pool_size = (2,2) ) )
classifier.add(Flatten())
classifier.add( Dense(units = 256, activation = 'relu' ) )
classifier.add( Dense(units = 1, activation = 'sigmoid' ) )
classifier.compile(optimizer = 'adam', loss = 'binary_crossentropy', metrics =
['accuracy'])
I realized that I assumed that all the scraped images would not be corrupted. I was wrong.
19
Day 5
Today, I realized the fact that there is an error in our data. There maybe some
images in folder #Indiapictures that were also featued on their page. In simple
terms there will be some of our good data files as duplicates in bad data files. I
calculated the probability. Note: We scraped all images that were featured on
their page.
Total Images With tag #indiapictures = 40,00,000 (approx.)
Total Scraped Images with tag #indiapictures = 53,769
Total Images featured on the page = 5371
P (Image is Duplicate)
= Total Images featured on the page / Total Images With tag #indiapictures
= 5,371 / 40,00,000
= 0.00134275
It means 72 images are duplicates in Good and Bad data, if we assume that the
scraped data is completely random.
We need to remove these images before making a model.
20
Day 6
Today, I spent most of time finding a python script to find corrupted files. I
completed my script to find the corrupted images. The code used is completely
reusable and anyone who wants to perform this check can use this script. The
script is in HTML. So, you should download it first than open it using browser.
This script scans a folder for all the jpgs and jpegs. It then verifies it if it is
corrupted or not. If a corrupted file is found then it moves that file into a
different directory that you would have to initialize. In my case I named that
folder “corrupted”.
Now, we have only one challenge left before building our CNN. We need to
remove duplicates. It is very essential to remove duplicates in our project
because files common in both the folders with neutralize the changes in cost
function which we are trying to minimize. In simple words, we will be wasting
almost 722 images in good dataset.
21
Day 7
Challenge:
We have two folders named “good” and “bad”. We want to remove the images
in folder bad which are common in both the folders.
Today, I completed writing the script to remove the duplicate files. The script
first calculated md5 hash codes of the images and created a HashMap out of it.
The HashMap was as Follows:
hashMap = {
'good' : [
{"dir":"D://Codes//gPhotoCNN//dataset//good//2013-04-24_05-57-27_U
TC.jpg",
"sign":"4b625aa4eca63d3d41202d96ae3c09620ca0d0b2fb516dca21a892322b
f82efa"}
…],
‘Bad’ : [...]}
Now, this hashMap consisted of two keys, “good” and “bad”, named after the
two folders. These keys have arrays as their values which then consists the
directory and 128 bit md5 Hash Codes. If the two hash codes are identical, then
the files are the same. The odds of 2 different files generating identical md5
sums are 1 / 2^128.. That's 1 / 340282366920938463463374607431768211456.
In other words, you'd need 2^64 files before there's a 10% chance. This script
will give good results for the project of our size.
22
This HashMap is then stored into a JSON file so that we do not need to compute
Hash Codes again and again as it is a time consuming process. We then checked
for identical hashes using simple for loops. Luckily, we got only 19 identical
pictures. That’s far less than our computed probability. It means, the scrapper
mainly scrapped the data which has low frequency of those images that has been
featured on the page of indiapictures.
23
Day 8
I started the learning of the model. It was for 5 epochs with each epoch
consisting of steps 55037 and 4083 validation steps. At first epoch the accuracy
wasn’t good enough. It was around 90% which was worse than even a random
model.
At present it’s showing 30 hours left for one epoch. Let’s wait for now.
Day 9
The difference between Accuracy and validation accuracy is that accuracy is the
estimate of mean accuracy of the mini-batches and Validation accuracy is the
accuracy for the validation accuracy.
24
Day 10
There was power shortage for few hours at my university. Things like these
don’t happen usually maybe once or twice a year. This was one of those
moments. This is the accuracy that the model luckily reached.
This model was not a total waste though. It showed that learning can reach up to
a level that is better than random. This means that if we train it harder, we may
have sufficient good result.
25
Day 11
Keras has simple method for transfer learning. We used VGG16 model using
the weights of the imagenet with 1000 classes. We removed the top layer so that
we can add our own fully connected layer.
The layers of imagenet were locked (their weights won’t be trained). It is a good
practice to keep imagenet layers locked so that new features can be learned in
the next layers.
We then flatten the output of convolve layer block and added two layers with 4
and 1 neuron respectively.
26
27
This model was very fast to train and resulted with the following results:
● Accuracy: 90.8%
● Validation Accuracy: 90.9%
● Time Taken: 14 hours 15 minutes
28
Conclusion
❏ First Model:
It learned from scratch and took a large amount of time in learning. It’s
one epoch took almost 30 hours to learn. The accuracies are as follows:
❏ Accuracy: 91.14%
❏ Validation Accuracy: 90.06%
❏ Second Model:
We used transfer learning for this model. We used VGG16 model using
the weights of the imagenet with 1000 classes. We then flatten the output
of convolve layer block and added two layers with 4 and 1 neuron
respectively. The accuracies are as follows:
❏ Accuracy: 90.80%
❏ Validation Accuracy: 90.90%
The conclusion of our project is that transfer learning is a very good approach
especially for problems related to computer vision. If we create a model from
scratch, it takes a large amount of time to learn and distracts us from our main
goal.
29
References
30