Object Detection Project Report
Object Detection Project Report
TENSORFLOW
PROJECT REPORT
OF MAJOR PROJECT (EASYKART)
Supervised By Submitted By
Reema Lalit Parul Kataria
Assistant Professor Roll No. 1252159
1
PANIPAT INSTITUTE OF ENGINEERING &
TECHNOLOGY, PANIPAT
DECLARATION
Date: PARUL
2
CERTIFICATE
It is certified that Ms. Parul, a student of Master of Computer Applications, under class
Roll No. 1252159 for the session 2017-2019, has completed the project entitled Easykart
under my supervision.
This project report is an authentic work of the candidate, as per his/her declaration and is
found to be fit for the award of Master’s degree in Computer Applications in accordance
with the rules and regulations of PIET, Panipat as per my opinion.
3
ACKNOWLEDGEMENT
It has been rightly remarked, “Success is the satisfactory achievement of chosen and
desired objective. It is the attainment of the major objects; you earnestly desired and
worked for with burning enthusiasm and dedication.”
Before we get into things, I would like to share a few heartfelt words with the people who
were part of this project in numerous ways, people who gave unending from the
beginning.
The successful completion of the project is a combined effort of a number of people, and
all of them have their own importance in the achievement of objective.
I cannot miss this opportunity to thank Ms. Reema Lalit as a mentor for her timely
support and valuable guidance throughout the project.
Last but not the least I would also like to thank my parents & project mates for being a
pillar of strength and support in times of stress & difficulty throughout the project
duration.
(Parul)
4
COMPANY PROFILE
Established in 2012, Influence Technolabs Pvt Ltd. Always had the vision of coming up
with world class products and keys services related to online systems i.e. Travel,
Insurance, ERP, CRM, SCM etc. Founded by Ajay Kumar Jain and Malini Jain, the
company has grown to strength of more than 200 specialists working in various state-of-
the-art development facilities located worldwide .
Today Influence Technolabs Pvt Ltd is among the one of the largest service providers of
online booking technology. We facilitates global clients in a wide product range with
technologies that have been crafted by keeping the future perspective in mind.
Our technology solutions are vast and include creative design , solution definition, mobile
application development, product development, hotel consolidator, extranet system,
payment gateway integration, online booking engine, online insurance, Pos,
ERP,CRM,SCM, working on Automation AI and many others. This variety in services
we help us with several clients including travel agencies(Both online and
offline.),Destination management companies, tour operator, consortia, consolidators,
insurance brokers, insurance agents, aggregators, manufacturing, trading companies etc.
Our focus at Influence Technolabs Pvt Ltd. Has always been about reducing the ‘clients
cost v/s capability’ ratio while providing them with an easy interface to work upon .All
our products integrate the right mix of technical finesse and business solution –The key to
growth in the industry. We further provide customized technology solutions to suits the
unique requirements of our special clients.
5
We at Influence Technolabs have been able to create a highly fulfilling workplace. Our
desks are rated to be one of the most wanted ones in this industry and this done given an
extra edge to our organization. Reaching out to large and small companies alike,
providing solutions to enable them to grow in the online travel business with minimal
investment, no need for programmers or IT Professionals and expensive server and
hosting
Company Products
We provide IT solutions that are vast and include creative design, solution definition,
mobile application development, product development, hotel consolidator, extranet
systems, payment gateway integration, online booking engine, insurance API, insurance
Portal, AI based technology and many others.
6
ABBREVIATIONS
7
LIST OF FIGURES
8
LIST OF TABLES
Table7.1 Train_labels.csv 53
Table7.2 Train_labels.csv 54
Table7.3 Test_labels.csv 54
Table7.4 Train_labels.csv 55
9
ABSTRACT
There is an ever-increasing amount of image data in the world, and the rate of growth
itself is increasing. Infotrends estimates that in 2016 still cameras and mobile devices
captured more than 1.1 trillion images. According to the same estimate, in 2020 the
figure will increase to 1.4 trillion. Many of these images are stored in cloud services or
published on the Internet. In 2014, over 1.8 billion images were uploaded daily to the
most popular platforms, such as Instagram and Facebook.
Going beyond consumer devices, there are cameras all over the world that capture images
for automation purposes. Cars monitor the road, and traffic cameras monitor the same
cars. Robots need to understand a visual scene in order to smartly build devices and sort
waste. Imaging devices are used by engineers, doctors and space explorers alike.
To effectively manage all this data, we need to have some idea about its contents.
Automated processing of image contents is useful for a wide variety of image-related
tasks. For computer systems, this means crossing the so-called semantic gap between the
pixel level information stored in the image files and the human understanding of the same
images. Computer vision attempts to bridge this cap.
Objects contained in image files can be located and identified automatically. This is
called object detection and is one of the basic problems of computer vision. As we will
demonstrate, convolutional neural networks are currently the state-of-the-art solution for
object detection. The main task of this thesis is to review and test convolutional object
detection methods.
In the theoretical part, we review the relevant literature and study how convolutional
object detection methods have improved in the past few years. In the experimental part,
we study how easily a convolutional object detection system can be implemented in
practice, test how well a detection system trained on general image data performs in a
10
specific task and explore, both experimentally and based on the literature, how the
current systems can be improved.
11
TABLE OF CONTENTS
Declaration i
Certificate ii
Acknowledgement iii
Abstraact vi
12
CHAPTER 1
PROJECT INTRODUCTION
1.1 Objective
The goal of “object detection” is to find the location of an object in a given picture
accurately and mark the object with the appropriate category. To be precise, the problem
that object detection seeks to solve involves determining where the object is, and what it
is. However, solving this problem is not easy. Unlike the human eye, a computer processes
images in two dimensions. Furthermore, the size of the object, its orientation in the space,
its attitude, and its location in the image can all vary greatly.
1.2 Introduction
Object detection is technologically challenging and practically useful problem in the field
of computer vision. Object detection deals with identifying the presence of various
individual objects in an image. Great success has been achieved in controlled environment
for object detection/recognition problem but the problem remains unsolved in uncontrolled
places, in particular, when objects are placed in arbitrary poses in cluttered and occluded
environment. As an example, it might be easy to train a domestic help robot to recognize
the presence of coffee machine with nothing else in the image.
On the other hand imagine the difficulty of such robot in detecting the machine on a
kitchen slab that is cluttered by other utensils, gadgets, tools, etc. The searching or
recognition process in such scenario is very difficult. So far, no effective solution has been
found for this problem. A lot of research is being done in the area of object recognition and
detection during the last two decades. The research on object detection is multi-disciplinary
and often involves the fields of image processing, machine learning, linear algebra,
topology, statistics/probability, optimization, etc. The research innovations in this field
13
have become so diverse that getting a primary first hand summary of most state-of-the-art
approaches is quite difficult and time consuming.
The approach used incorporates four computer vision and machine learning concepts:
sliding windows to extract sub-images from the image, feature extraction to get meaningful
data from the sub-images, Support Vector Machines (SVMs) to classify the objects in sub-
image, and Principle Component Analysis (PCA) to improve efficiency. As a model
problem for the motivating application, we focused on the problem of recognizing objects
in images, in particular, soccer balls and sunflowers. For this algorithm to be useful as a
real-time aid to the visually-impaired, it would have to be enhanced to distinguish between
“close” and “far” objects, as well as provide information about relative distance between
the user and the object, etc. We do not consider these complications in this project; we
focus on the core machine learning issues of object recognition. The training and testing of
the proposed algorithm was done using data sets .
Detecting objects
Fig. 1.2.1
14
1.3 Applications
Fig. 1.3.1
A deep learning facial recognition system called the “DeepFace” has been developed by a
group of researchers in the Facebook, which identifies human faces in a digital image very
effectively. Google uses its own facial recognition system in Google Photos, which
automatically segregates all the photos based on the person in the image. There are various
components involved in Facial Recognition like the eyes, nose, mouth and the eyebrows.
Fig.1.3.2
15
Object detection can be also used for people counting, it is used for analyzing store
performance or crowd statistics during festivals. These tend to be more difficult as people
move out of the frame quickly.
Fig. 1.3.3
Object detection is also used in industrial processes to identify products. Finding a specific
object through visual inspection is a basic task that is involved in multiple industrial
processes like sorting, inventory management, machining, quality management, packaging
etc.
Inventory management can be very tricky as items are hard to track in real
time. Automatic object counting and localization allows improving inventory accuracy.
Fig. 1.3.4
16
Self-driving cars are the Future, there’s no doubt in that. But the working behind it is very
tricky as it combines a variety of techniques to perceive their surroundings, including radar,
laser light, GPS, and computer vision.
One of the best examples of why you need object detection is the high-level algorithm for
autonomous driving:
In order for a car to decide what to do next: accelerate, apply brakes or turn, it
needs to know where all the objects are around the car and what those objects are
That requires object detection
You would essentially train the car to detect known set of objects: cars,
pedestrians, traffic lights, road signs, bicycles, motorcycles, etc.
GPU For good cost/performance, I generally recommend an RTX 2070 or an RTX 2080
Ti. If you use these cards you should use 16-bit models. Otherwise, GTX 1070, GTX 1080,
GTX 1070 Ti, and GTX 1080 Ti from eBay are fair choices and you can use these GPUs
with 32-bit (but not 16-bit). Be careful about the memory requirements when you pick your
GPU. RTX cards, which can run in 16-bits, can train models which are twice as big with
the same memory compared to GTX cards. As such RTX cards have a memory advantage
and picking RTX cards and learn how to use 16-bit models effectively will carry you a
long way. In general, the requirements for memory are roughly the following:
17
Suspect line-up
Fig.2.4.1
CPU
The main mistake that people make is that people pay too much attention to PCIe lanes of
a CPU. You should not care much about PCIe lanes. Instead, just look up if your CPU and
motherboard combination supports the number of GPUs that you want to run.We need CPU
with heavy RAM for running the large number of training steps
Performance Graph
Fig.2.4.2
18
Python 3.7
TensorFlow
Anaconda Software
Machine Learning Libraries
Detection accuracy is usually measured on a given test set where the expected outcome for
a detection sample is compared to the actual outcome of the object detection system .The
detection accuracy is the percentage of samples for which the expected outcome matches
the actual outcome of the detection system
Expected outcome
Fig.2.6
19
CHAPTER 2
BACKGROUND
Learning algorithms are widely used in computer vision applications. Before considering
image related tasks, we are going to have a brief look at basics of machine learning.
Machine learning has emerged as a useful tool for modelling problems that are otherwise
difficult to formulate exactly. Classical computer programs are explicitly programmed by
hand to perform a task. With machine learning, some portion of the human contribution is
replaced by a learning algorithm. As availability of computational capacity and data has
increased, machine learning has become more and more practical over the years, to the
point of being almost ubiquitous.
2.1.1Types
In unsupervised learning, the algorithm attempts to learn useful proper- ties of the data
without a human teacher telling what the correct output should be. Classical example of
unsupervised learning is clustering. More recently, especially with the advent of deep
learning technologies, un- supervised pre-processing has become a popular tool in
supervised learning tasks for discovering useful representations of the data [9].
2.1.2 Features
20
Some kind of pre-processing is almost always needed. Pre-processing the data into a new,
simpler variable space is called feature extraction. Of- ten, it is impractical or impossible
to use the full-dimensional training data directly. Rather, detectors are programmed to
extract interesting features from the data, and these features are used as input to the
machine learning algorithm.
In the past, the feature detectors were often hand-crafted. The problem with this approach
is that we do not always know in advance, which features are interesting. The trend in
machine learning has been towards learning the feature detectors as well, which enables
using the complete data.
2.1.3 Generalization
Since the training data cannot include every possible instance of the inputs, the learning
algorithm has to be able to generalize in order to handle unseen data points. Too simple
model estimate can fail to capture important aspects of the true model. On the other hand,
too complex methods can overfit by modelling unimportant details and noise, which also
leads to bad generalization. Typically, overfitting happens when a complex method is
used in conjunction with too little training data. An overfitted model learns to model the
known examples but does not understand what connects them.
The performance of the algorithm can be evaluated from the quality and quantity of
errors. A loss function, such as mean squared error, is used to assign a cost to the errors.
The objective in the training phase is to minimize this loss.
Neural networks are a popular type of machine learning model. A special case of a neural
network called the convolutional neural network (CNN) is the primary focus of this
thesis. Before discussing CNNs, we will discuss how regular neural networks work.
2.2.1 Origins
Neural networks were originally called artificial neural networks, because they were
developed to mimic the neural function of the human brain. Pioneering research includes
21
the threshold logic unit by Warren McCulloch and Walter Pitts in 1943 and the
perceptron by Frank Rosenblatt in 1957.
Even though the inspiration from biology is apparent, it would be mis- leading to
overemphasize the connection between artificial neurons and biological neurons or
neuroscience. The human brain contains approximately 100 billion neurons operating in
parallel. Artificial neurons are mathematical functions implemented on more-or-less
serial computers. Research into neural networks is mostly guided by developments in
engineering and mathematics rather than biology.
An artificial neuron based on the McCulloch-Pitts model is shown in Figure. The neuron
k receives m input parameters xj. The neuron also has m weight parameters wkj. The
weight parameters often include a bias term that has a matching dummy input with axed
value of 1. The inputs and weights are linearly combined and summed. The sum is then
fed to an activation function ’ that produces the output yk of the neuron:
The neuron is trained by carefully selecting the weights to produce a desired output for
each input.
22
Figure 2.2: A fully-connected multi-layer neural network.
A multi-layer network typically includes three types of layers: an input layer, one or more
hidden layers and an output layer. The input layer usually merely passes data along
without modifying it. Most of the computation happens in the hidden layers. The output
layer converts the hidden layer activations to an output, such as a classification. A
multilayer feed-forward network with at least one hidden layer can function as a
universal approximator i.e. can be constructed to compute almost any function.
2.2.3 Back-propagation
A neural network is trained by selecting the weights of all neurons so that the network
learns to approximate target outputs from known inputs. It is difficult to solve the neuron
weights of a multi-layer network analytically. The back-propagation algorithm provides a
23
simple and effective solution to solving the weights iteratively. The classical version uses
gradient descent as optimization method. Gradient descent can be quite time-consuming
and is not guaranteed to find the global minimum of error, but with proper configuration
(known in machine learning as hyper- parameters) works well enough in practice.
In the first phase of the algorithm, an input vector is propagated forward through the
neural network. Before this, the weights of the network neurons have been initialized to
some values, for example small random values. The received output of the network is
compared to the desired output (which should be known for the training examples) using
a loss function. The gradient of the loss function is then computed. This gradient is also
called the error value. When using mean squared error as the loss function, the output
layer error value is simply the difference between the current and desired output.
The error values are then propagated back through the network to calculate the error
values of the hidden layer neurons. The hidden neuron loss function gradients can be
solved using the chain rule of derivatives. Finally, the neuron weights are updated by
calculating the gradient of the weights and subtracting a proportion of the gradient from
the weights. This ratio is called the learning rate. The learning rate can be fixed or
dynamic. After the weights have been updated, the algorithm continues by executing the
phases again with different input until the weights converge.
In the above description, we have described online learning that calculates the weight
updates after each new input. Online learning can lead to \zig-zagging" behavior, where
the single data point estimate of the gradient keeps changing direction and does not
approach the minimum directly. Another way of computing the updates is full batch
learning, where we compute the weight updates for the complete dataset. This is quite
computationally heavy and has other drawbacks. A compromise version is mini-batch
learning, where we use only some portion of the training set for each update.
Mathematical descriptions of the algorithm are readily available in other works.
24
The activation function ’ determines the final output of each neuron. It is important to
select the function properly in order to create an effective network.
Early researchers found that perceptron’s and other linear systems had severe drawbacks,
being unable to solve problems that were not linearly separable, such as the XOR-
problem. Sometimes, linear systems can solve these kinds of problems using hand-crafted
feature detectors, but this is not the most advantageous use of machine learning. Simply
adding layers does not help either, because a network composed of linear neurons
remains linear no matter how many layers it has.
A light-weight and effective way of creating a non-linear network is using rectified linear
units (ReLu). A rectified linear function generates the output using a ramp function such
as:
This type of function is easy to compute and differentiate (for back- propagation). The
function is not differentiable at zero, but this has not prevented its use in practice. ReLus
have become quite popular lately, often replacing sigmoidal activation functions, which
have smooth derivatives but suffer from gradient saturation problems and slower
computation.
For multi-class classification problems, the softmax activation function is used in the
output layer of the network:
The softmax function takes a vector of K arbitrarily large values and outputs a vector of
K values that range between 0...1 and sum to 1. The values output by the softmax unit
can be utilized as class probabilities.
Modern neural networks are often called deep neural networks. Even though multi-layer
neural networks have existed since the 1980s, several reasons pre- vented the effective
training of networks with multiple hidden layers.
One of the main problems is the curse of dimensionality. As the number of variables
increases, the number of different configurations of the variables grows exponentially. As
25
the number of configurations increases, the number of training samples should increase in
equal measure. Collecting a training dataset of sufficient size is time-consuming and
costly or outright impossible.
Fortunately, real-world data is not uniformly distributed and often involves a structure,
where the interesting information lies on a low-dimensional manifold. The manifold
hypothesis assumes that most data configurations are invalid or rare. We can decrease
dimensionality by learning to represent the data using the coordinates of the manifold.
Another way to improve generalization is to assume local constancy. This means
assuming that the function that the neural network learns to approximate should not
change much within a small region.
In the past ten years, neural networks have had a renaissance, mainly because of the
availability of more powerful computers and larger datasets. In early 2000s, it was
discovered that neural networks could be trained efficiently using graphics processing
units. GPUs are more efficient for the task than traditional CPUs and provide a relatively
cheap alternative to specialist hardware. Today, researchers typically use high-end
consumer graphic cards, such as NVIDIA Tesla K40.
With deep learning, there is less need for hand-tuned machine learning solutions that
were used previously. A classical pattern detection system, for example, includes a hand-
tuned feature detection phase before a machine learning phase. The deep learning
equivalent consists of a single neural network. The lower layers of the neural network
learn to recognize the basic features, which are then fed forward to higher layers of the
network.
Next, we are going to discuss computer vision in general and explore the primary subject
of this thesis, object detection, as a subproblem of computer vision.
2.3.1 Overview
Computer vision deals with the extraction of meaningful information from the contents of
digital images or video. This is distinct from mere image processing, which involves
manipulating visual information on the pixel level. Applications of computer vision
26
include image classification, visual detection, 3D scene reconstruction from 2D images,
image retrieval, augmented reality, machine vision and traffic automation.
Object detection is one of the classical problems of computer vision and is often
described as a difficult task. In many respects, it is similar to other computer vision tasks,
because it involves creating a solution that is invariant to deformation and changes in
lighting and viewpoint. What makes object detection a distinct problem is that it involves
both locating and classifying regions of an image. The locating part is not needed in, for
example, whole image classification.
To detect an object, we need to have some idea where the object might be and how the
image is segmented. This creates a type of chicken-and-egg problem, where, to recognize
the shape (and class) of an object, we need to know its location, and to recognize the
location of an object, we need to know its shape. Some visually dissimilar features, such
as the clothes and face of a human being, may be parts of the same object, but it is
difficult to know this without recognizing the object first. On the other hand, some
objects stand out only slightly from the background, requiring separation before
recognition.
Low-level visual features of an image, such as a saliency map, may be used as a guide for
locating candidate objects. The location and size is typically defined using a bounding
box, which is stored in the form of corner coordinates. Using a rectangle is simpler than
using an arbitrarily shaped polygon, and many operations, such as convolution, are
performed on rectangles in any case. The sub-image contained in the bounding box is
27
then classified by an algorithm that has been trained using machine learning. The
boundaries of the object can be further refined iteratively, after making an initial guess.
During the 2000s, popular solutions for object detection utilized feature descriptors, such
as scale-invariant feature transform (SIFT) developed by David Lowe in 1999 and
histogram of oriented gradients (HOG) popularized in 2005. In the 2010s, there has been
a shift towards utilizing convolutional neural networks.
Before the widescale adoption of CNNs, there were two competing solutions for
generating bounding boxes. In the first solution, a dense set of region proposals is
generated and then most of these are rejected. This typically involves a sliding window
detector. In the second solution, a sparse set of bounding boxes is generated using a
region proposal method, such as Selective Search. Combining sparse region proposals
with convolutional neural networks has provided good results and is currently popular.
2.4.1 Justification
The problem with solving computer vision problems using traditional neural networks is
that even a modestly sized image contains an enormous amount of information (see
section 2.2.5 on deep learning and the curse of dimensionality).
A monochrome 620x480 image contains 297 600 pixels. If each pixel intensity of this
image is input separately to a fully-connected network, each neuron requires 297 600
weights. A 1920x1080 full HD image would require 2,073,600 weights. If the images are
polychrome, the amount of weights is multiplied by the amount of color channels
(typically three). Thus, we can see that the overall number of free parameters in the
network quickly becomes extremely large as the image size increases. Too large models
cause over-fitting and slow performance.
28
Furthermore, many pattern detection tasks require that the solution is translation
invariant. It is inefficient to train neurons to separately recognize the same pattern in the
left-top corner and in the right-bottom corner of an image. A fully-connected neural
network fails to take this kind of structure into account.
The basic idea of the CNN was inspired by a concept in biology called the receptive field.
Receptive fields are a feature of the animal visual cortex. They act as detectors that are
sensitive to certain types of stimulus, for example, edges. They are found across the
visual field and overlap each other.
Figure 2.3: Detecting horizontal edges from an image using convolution filtering.
The discrete convolution operation between an image f and a filter matrix g is defined as:
In effect, the dot product of the filter g and a sub-image of f (with same dimensions as g)
centered on coordinates x; y produces the pixel value of h at coordinates x; y. The size of
the receptive field is adjusted by the size of the filter matrix. Aligning the filter
successively with every sub-image of f produces the of output pixel matrix h. In the case
of neural networks, the output matrix is also called an feature map (or an activation map
29
after computing the activation function). Edges need to be treated as a special case. If
image f is not padded, the output size decreases slightly with every convolution.
Since the same filters are used for all parts of the image, the number of free parameters is
reduced drastically compared to a fully-connected neural layer. The neurons of the
convolutional layer mostly share the same parameters and are only connected to a local
region of the input. Parameter sharing resulting from convolution ensures translation
invariance. An alternative way of describing the convolutional layer is to imagine a fully-
connected layer with an infinitely strong prior placed on its weights. This prior forces the
neurons to share weights at different spatial locations and to have zero weight outside the
receptive field.
Successive convolutional layers (often combined with other types of layers, such as
pooling described below) form a convolutional neural network (CNN). An example of a
convolutional network is shown in figure. The back- propagation training algorithm,
described in section 2.2.3 is also applicable to convolutional networks. In theory, the
layers closer to the input should learn to recognize low-level features of the image, such
as edges and corners, and the layers closer to the output should learn to combine these
features to recognize more meaningful shapes. In this thesis, we are interested in studying
whether convolutional networks can learn to recognize complete objects.
To make the network more manageable for classification, it is useful to de- crease the
activation map size in the deep end of the network. Generally, the deep layers of the
network require less information about exact spatial locations of features, but require
30
more filter matrixes to recognize multiple high-level patterns. By reducing the height and
width of the data volume, we can increase the depth of the data volume and keep the
computation time at a reasonable level.
There are two ways of reducing the data volume size. One way is to include a pooling
layer after a convolutional layer. The layer effectively down-samples the activation maps.
Pooling has the added effect of making the resulting network more translation invariant
by forcing the detectors to be less precise. However, pooling can destroy information
about spatial relationships between subparts of patterns. Typical pooling method is max-
pooling. Max-pooling simply outputs the maximum value within a rectangular
neighborhood of the activation map.
Another way of reducing the data volume size is adjusting the stride parameter of the
convolution operation. The stride parameter controls whether the convolution output is
calculated for a neighborhood centered on every pixel of the input image (stride 1) or for
every nth pixel (stride n). Research has shown that pooling layers can often be discarded
without loss in accuracy by using convolutional layers with larger stride value. The stride
operation is equivalent to using a fixed grid for pooling.
Some systems, such as also implement a layer called local response normalization, which
is used as a regularization technique. Local response normalization mimics a function of
biological neurons called lateral inhibition, which causes excited neurons to decrease the
activity of neighboring neurons. However, other regularization techniques are currently
more popular and these are discussed in the next section.
The final hidden layers of a CNN are typically fully-connected layers. A fully-connected
layer can capture some interesting relationships parameter-sharing convolutional layers
31
cannot. However, a fully- connected layer requires a sufficiently small data volume size
in order to be practical. Pooling and stride settings can be used to reduce the size of the
data volume that reaches the fully-connected layers. A convolutional network that does
not include any fully-connected layers, is called a fully convolutional network (FCN).
If the network is used for classification, it usually includes a softmax output layer (see
also section 2.2.4) The activations of the topmost layers can also be used directly to
generate a feature representation of an image. This means that the convolutional network
is used as a large feature detector.
There are several regularization techniques that are specific to deep neural networks. A
popular technique called dropout attempts to reduce the co-adaptation of neurons. This is
achieved by randomly dropping out neurons during training, meaning that a slightly
different neural network is used for each training sample or minibatch. This causes the
system not to depend too much on any single neuron or connection and provides an
effective yet computationally inexpensive way of implementing regularization. In
convolutional networks, dropout is typically used in the final fully-connected layers.
Overfitting can also be reduced by increasing the amount of training data. When it is not
possible to acquire more actual samples, data augmentation is used to generate more
samples from the existing data. For classification using convolutional networks, this can
be achieved by computing transformations of the input images that do not alter the
perceived object classes, yet provide additional challenge to the system. The images can
be, for example, flipped, rotated or subsampled with different crops and scales. Also,
noise can be added to the input images.
32
2.4.6 Development
Convolutional neural networks were one of the first successful deep neural networks. The
Noncognition, developed by Fukushima in 1980s, provided a neural network model for
translation-invariant object recognition, inspired by biology. Le Cun et al. combined this
method with a learning algorithm, i.e. back-propagation. These early solutions were
mostly used for hand- written character recognition.
After providing some promising results, the neural network methods faded in prominence
and were mostly replaced by support vector machines. Then, in 2012, Krizhevsky et al.
achieved excellent results on the ImageNet Large Scale Visual Recognition Challenge
(ILSVRC) dataset by com- bining Le Cun’s method with recent fine-tuning methods for
deep learning. These results popularized CNNs and led to the development of new
powerful object detection methods described in chapter.
For the 2014 ImageNet challenge, Simonyan and Zisserman explored the effect of
increasing the depth of a convolutional network on localization and classification
accuracy. The team achieved results that improved the then state-of-the-art by using
convolutional networks 16 and 19 layers deep. The 16-layer architecture includes 13
convolutional layers (with 3x3 filters), 5 pooling layers (2x2 neighborhood max-pooling)
and 3 fully-connected layers. All hidden layers use rectified (ReLu) activations. The
fully-connected layers reduce 4096 channels down to 1000 softmax outputs and are
regularized using dropout. This form of network is referred to as VGG-16 later in this
thesis.
The current (2016) winner of the object detection category in the ImageNet challenge is
also CNN-based. The method uses a combination of CRAFT region proposal generation,
gated bi-directional CNN, clustering, landmark generation and ensembling.
33
34
Chapter 3
Convolutional Neural Networks
3.1 R-CNN
In 2012, Krizhevsky et al. achieved promising results with CNNs for the general image
classification task, as mentioned in section 2.4.6. In 2013, Girshick et al. published a
method generalizing these results to object detection. This method is called R-CNN
(\CNN with region proposals").
R-CNN forward computation has several stages, shown in figure. First, the regions of
interest are generated. The RoIs are category-independent bounding boxes that have a
high likelihood of containing an interesting object. In the paper, a separate method called
Selective Search, is used for generating these, but other region generation methods can be
used instead. Selective Search, along with other region proposal generation techniques, is
discussed in further detail in section 3.3.
Next, a convolutional network is used to extract features from each region proposal. The
sub-image contained in the bounding-box is warped to match the input size of the CNN
and then fed to the network. After the network has extracted features from the input, the
features are input to support vector machines (SVM) that provide the final classification.
35
Figure 3.1: Stages of R-CNN forward computation.
The method is trained in multiple stages, beginning with the convolutional network. After
the CNN has been trained, the SVMs are fitted to the CNN features. Finally, the region
proposal generating method is trained.
3.1.2 Drawbacks
R-CNN is an important method, because it provided the first practical solution for object
detection using CNNs. Being the first, it has many drawbacks that have been improved
upon by later methods.
In his 2015 paper for Fast R-CNN, Girshick lists three main problems of R-CNN:
Third, and perhaps most important, object detection is slow, requiring almost a minute for
each image, even on a GPU. This is because the CNN forward computation is performed
separately for every object proposal, even if the proposals originate from the same image
or overlap each other.
Fast R-CNN published in 2015 by Girshick provides a more practical method for object
recognition. The main idea is to perform the forward pass of the CNN for the entire
image, instead of performing it separately for each RoI.
36
Figure 3.2: Stages of Fast R-CNN forward computation.
The general structure of Fast R-CNN is illustrated in figure 3.2. The method receives as
input an image plus regions of interest computed from the image. As in R-CNN, the RoIs
are generated using an external method. The image is processed using a CNN that
includes several convolutional and max pooling layers.
The convolutional feature map that is generated after these layers is input to a RoI
pooling layer. This extracts a fixed-length feature vector for each RoI from the feature
map. The feature vectors are then input to fully- connected layers that are connected to
two output layers: a softmax layer that produces probability estimates for the object
classes and a real-valued layer that outputs bounding box co-ordinates computed using
regression (meaning refinements to the initial candidate boxes).
According to the authors, Fast R-CNN provides significantly shorter classification time
compared to regular R-CNN, taking less than a second on a state-of-the-art GPU. This is
mainly due to using the same feature map for each RoI.
As the detection time decreases, the overall computation time begins to depend
significantly on the performance of the region proposal generation method. The RoI
generation can thus form a computational bottleneck. Additionally, when there are many
RoIs, the time spent on evaluating the fully-connected layers can dominate the evaluation
time of the convolutional layers. Classification time can be accelerated by approximately
30% if the fully-connected layers are compressed using truncated singular value decom-
position. This results in a slight decrease in precision, however.
3.2.3 Training
37
According to the original publication, Fast R-CNN is more efficient to train than R-CNN,
with nine-fold reduction in training time. The entire network (including the RoI pooling
layer and the fully-connected layers) can be trained using the back-propagation algorithm
and stochastic gradient de- scent. Typically, a pre-trained network is used as a starting
point and then ne-tuned. Training is done in mini-batches of N images. R=N RoIs are
sampled from each mini-batch image. The RoI samples are assigned to a class, if their
intersection over union (see section 4.6) with a ground-truth box is over 0.5. Other RoIs
belong to the background class.
As in classification, RoIs from the same image share computation and memory usage.
For data augmentation, the original image is flipped horizontally with probability 0.5.
The softmax classier and the bounding box regressors are ne-tuned together using a
multi-task loss function, which con- siders both the true class of the sampled RoI and the
o set of the sampled bounding box from the true bounding box.
To use R-CNN and Fast R-CNN, we need a method for generating the class-agnostic
regions of interest. Next, we are going to discuss general principles of RoI generation,
and have a closer look at two popular methods: Selective Search and Edge Boxes.
3.3.1 Overview
The aim of region proposal generation in object detection is to maximize recall i.e. to
generate enough regions so that all true objects are recovered. The generator is less
concerned with precision, since it is the task of the object detector to identify correct
regions from the output of the region proposal generator.
Dense set solutions attempt to generate by brute force an exhaustive set of bounding
boxes that includes every potential object location. This can be achieved by sliding a
38
detection window across the image. However, searching through every location of the
image is computationally costly and requires a fast object detector. Additionally, different
window shapes and sizes need to be considered. Thus, most sliding window methods
limit the amount of candidate objects by using a coarse step-size and a limited number of
fixed aspect ratios.
Most region proposals in a dense set do not contain interesting objects. These proposals
need to be discarded after the object detection phase. Detection results can be discarded,
if they fall behind a predefined confidence threshold or if their confidence value is below
a local maximum (non-maximum suppression).
Instead of discarding the regions after the object detection stage, the region proposal
generator itself can rank the regions in a class-agnostic way and discard low-ranking
regions. This generates a sparse set of object detections. Similarly to dense set methods,
thresholding and non-maximum suppression can be implemented after the detection
phase to further improve the detection quality. Sparse set solutions can be grouped into
unsupervised and supervised methods.
One of the most popular unsupervised methods is Selective Search (see section 3.3.2)
which utilizes an iterative merging of superpixels. There are also other methods that use
the same approach. Another approach is to rank the objectness of a sliding window. A
popular example of this is Edge Boxes (see section 3.3.2) which calculates the objectness
score by calculating the number of edges within a bounding box and by subtracting the
number of edges that overlap the box boundary. There is also a third group of methods
based on seed segmentation.
39
Certain advanced object detection methods, such as Faster R-CNN described in 3.4.1 use
parts of the same convolutional network both for generating the region proposals and for
detection. We call these kinds of methods integrated methods.
The algorithm begins by creating a set of small initial regions using a method called
Graph Based Image Segmentation designed by Felzen- szwalb and Huttenlocher. The
method creates a set of regions called super- pixels. The superpixels are internally nearly
uniform. Combined, they span the entire image, but individually they should not span
different objects.
Selective Search then continues by iteratively grouping the regions together using a
greedy algorithm, beginning with the two most similar regions. Many complimentary
measures are used to compute the similarity. These measures consider color similarity
(by computing a color histogram), texture similarity (by computing a SIFT-like measure),
size of the regions (small regions should be merged earlier) and how well the regions fit
together (gaps should be avoided). The grouping phase ends when every region has been
combined.
The hypothetical object locations thus generated are then ordered by the likelihood of the
location containing an object. In practice, the locations are ordered based on the order in
which they were grouped together by the different measures. A certain element of
randomness is added to prevent large objects from being favored too much. Lower-
ranking duplicates are removed.
40
Both the region generating method and the similarity measures were selected to be fast to
compute, making the method fast in general. In addition to using diverse similarity
measures, the search can be further diversified by using complementary color spaces (to
ensure lighting invariance) and using complementary starting regions.
As the name suggests, Edge Boxes is based on detecting objects from edge maps. The
main contribution of the authors of the method is the observation that the number of edge
contours wholly enclosed by a bounding box is correlated with the likelihood that the box
contains an object.
First, the edge map is calculated using a method by the same authors called Structured
Edge Detector. Then, thick edge lines are thinned using non-maximum suppression.
Instead of operating on the edge pixels directly, the pixels are grouped using a greedy
algorithm. An affinity measure is devised to calculate whether edge groups are part of the
same contour.
The region proposals are found by scanning the image using the traditional sliding
window method and calculating an objectness score at each position, aspect ratio and
scale. The score is calculated by summing the edge strength of edge groups that lie
completely within the box and subtracting the strength of edge groups that are part of a
contour that cross the box boundary. Promising regions are then further refined.
In the experimental section of this thesis, we will focus mostly on Fast R- CNN. There
are, however, several state-of-the-art algorithms with an im- proved computation time or
accuracy. Next, we will describe two of these algorithms. See also chapter 7 for
discussion of improvements of convolutional object detection.
41
Faster R-CNN by Ren et al. is an integrated method. The main idea is to use shared
convolutional layers for region proposal generation and for detection. The authors
discovered that feature maps generated by object detection networks can also be used to
generate the region proposals. The fully convolutional part of the Faster R-CNN network
that generates the feature proposals is called a region proposal network (RPN). The
authors used Fast R-CNN architecture for the detection network.
A Faster R-CNN network is trained by alternating between training for RoI generation
and detection. First, two separate networks are trained. Then, these networks are
combined and fine-tuned. During fine-tuning, certain layers are kept fixed and certain
layers are trained in turn.
The trained network receives a single image as input. The shared fully convolutional
layers generate feature maps from the image. These feature maps are fed to the RPN. The
RPN outputs region proposals, which are input, together with the said feature maps, to
the final detection layers. These layers include a RoI pooling layer and output the final
classifications.
Using shared convolutional layers, region proposals are computationally almost cost-free.
Computing the region proposals on a CNN has the added benefit of being realizable on a
GPU. Traditional RoI generation methods, such as Selective Search, are implemented
using a CPU.
For dealing with different shapes and sizes of the detection window, the method uses
special anchor boxes instead of using a pyramid of scaled images or a pyramid of
different filter sizes (see section 7.2 for discussion of scale invariance). The anchor boxes
function as reference points to different region proposals centered on the same pixel.
3.4.2 SSD
The Single Shot MultiBox Detector (SSD) takes integrated detection even further. The
method does not generate proposals at all, nor does it involve any resampling of image
segments. It generates object detections using a single pass of a convolutional network.
42
Somewhat resembling a sliding window method, the algorithm begins with a default set
of bounding boxes. These include different aspect ratios and scales. The object
predictions calculated for these boxes include o set parameters, which predict how much
the correct bounding box surrounding the object identifiers from the default box.
The algorithm deals with different scales by using feature maps from many different
convolutional layers (i.e. larger and smaller feature maps) as input to the classier. Since
the method generates a dense set of bounding boxes, the classier is followed by a non-
maximum suppression stage that eliminates most boxes below a certain confidence
threshold.
Above, we described how Fast R-CNN is faster and more accurate than regular R-CNN.
But how does Fast R-CNN perform compared to the above- mentioned advanced
methods?
Liu et al. compared the performance of Fast R-CNN, Faster R-CNN and SSD on the
PASCAL VOC 2007 test set (see section 4.5 for discussion of the standard benchmarks).
When using networks trained on the PASCAL VOC 2007 training data, Fast R-CNN
achieved a mean average precision (mAP) of 66.9 (see section 4.6 or discussion of
evaluation methods). Faster R-CNN performed slightly better, with a mAP of 69.9. SSD
achieved a mAP of 68.0 with input size 300 x 300 and 71.6 with input size 512 x 512. As
the standard implementations of Fast R-CNN and Faster R-CNN use 600 as the length of
the shorter dimension of the input image, SSD seems to perform better with similarly
sized images. However, SSD requires extensive use of data augmentation to achieve this
result. Fast R-CNN and Faster R- CNN only use horizontal flipping, and it is currently
unknown, whether they would benefit from additional augmentation.
While the advanced methods are more precise than Fast R-CNN, the real improvements
come from speed. When most of the detections with a low probability are eliminated
using thresholding and non-maximum suppression (see section 4.6 for details), SSD512
43
can run at 19 FPS on a Titan X GPU. Meanwhile, Faster R-CNN with a VGG-16
architecture performs at 7
FPS. The original authors of Faster R-CNN report a running time of 5 FPS i.e. 0.2 s per
image. Fast R-CNN has approximately the same evaluation speed, but requires additional
time for calculating the region proposals. Region generation time depends on the method,
with Selective Search re- quiring 2 seconds per image on a CPU and Edge Boxes
requiring 0.2 seconds per image.
44
CHAPTER 4
SYSTEM DESIGN
Process
A procedure or process does operations and give the output on the supplied
arguments. The pure
Functions are considered as low level process that do not have side effects. A process
data flow component is represented as an ellipse.
Data Flows
The nexus between one process to another or one sub identity to mother is
represented by the with the intermediate value or the label on it
45
Graphical Representation
Actors
The element that drives the data flow by taking the inputs and thereby computing
the out is termed as the actor.
Data Store
Sometimes data is required to be accessed later in the data flow that is done by data
store component of DFD.
External Entity
Any external entity which can access the flow in DFD like a librarian, is called as
External Entity component. It is represented as a rectangle.
Graphical Representation
Output Symbol
While the user interaction with the system the DFD depicts it in the form of a below
polygon.
Graphical Representation
46
4.3 Detailed Design
Pre-Processing image
10 Different
Direction
28 Dimension
Fig. 4.3.2
47
Processing
Find the
nodal points
Fig. 4.3.3
Recognition
Compare
the both
Image
Compare
maximum
prec.
Testing Image
48
Fig. 4.3.4
Yes
Fig. 4.3.5
Second level DFD
Convert gray image Find the neighbourhood Retrieve the stored Image
the nodal points
Compare Maximal
28 Dimensions Compare the Image Percentage
49
Fig. 4.3.6
Use case
Input gray
Image
Pre-Processing
Processing
Recognition
Fig. 4.3.7
Sequence Diagram
Level Of Pixel
Compare Calculated
Principle Points with
Test Image
Fig.5.3.8
50
Chapter 5
Implementation
o Python
o TensorFlow
o Tensorboard
o Protobuf v3.4 or above
item {
name: "/m/01g317"
id: 1
display_name: "person"}
item {
name: "/m/0199g"
id: 2
display_name: "bicycle"}
item {
name: "/m/0k4j"
id: 3
display_name: "car"}
item {
name: "/m/04_sv"
id: 4
display_name: "motorcycle"}
item {
name: "/m/01940j"
id: 27
51
display_name: "backpack"}
item {
name: "/m/080hkjn"
id: 31
display_name: "handbag"}
item {
name: "/m/01c648"
id: 73
display_name: "laptop"}
item {
name: "/m/050k8"
id: 77
display_name: "cell phone"}
item {
name: "/m/0bt_c3"
id: 84
display_name: "book"}
To do this we can write a simple script that iterates through all *.xml files in
import os
import glob
import pandas as pd
import argparse
import xml.etree.ElementTree as ET
def xml_to_csv(path):
print(path)
xml_list=[]
for xml_file in glob.glob(path + '/*.xml'):
52
tree=ET.parse(xml_file)
root=tree.getroot()
for member in root.findall('object'):
value =(root.find('filename').text,
int(root.find('size')[0].text),
int(root.find('size')[1].text),
member[0].text,
int(member[4][0].text),
int(member[4][1].text),
int(member[4][2].text),
int(member[4][3].text)
)
xml_list.append(value)
column_name=['filename','width','height','class','xmin','ymin','xmax','
ymax']
xml_df = pd.DataFrame(xml_list,columns=column_name)
return xml_df
def main():
parser= argparse.ArgumentParser(
description="sample tensorflow xml-to-csv convertor")
parser.add_argument("-i", "--inputDir",help="path to the folder
where the input.xml files are stored",
type=str)
parser.add_argument("-o","--outputFile",help="name of output .csv
file(including path)",type=str)
args=parser.parse_args()
print(args)
if(args.inputDir is None):
args.inputDir=os.getcwd()
if(args.outputFile is None):
args.outputFile=args.inputDir+"/labels.csv"
assert(os.path.isdir(args.inputDir))
xml_df=xml_to_csv(args.inputDir)
xml_df.to_csv(args.outputFile,index=None)
print('successfully converted xml to csv')
if __name__ =='__main__':
main()
53
5.3.2 Converting from *.Csv to *.Record
Now that we have obtained our *.csv annotation files, we will need to convert them into
TFRecords.
from __future__ import division
from __future__ import print_function
from __future__ import absolute_import
import os
import io
import pandas as pd
import tensorflow as tf
import sys
sys.path.append("../../models/research")
from PIL import Image
from object_detection.utils import dataset_util
from collections import namedtuple,OrderedDict
flags=tf.app.flags
flags.DEFINE_string('csv_input','/tensorflow/workspace/training/annotat
ion/test_labels.csv','path to the CSV input')
flags.DEFINE_string('output_path','/tensorflow/workspace/training/annot
ation/test.record','path to output TFRecord')
flags.DEFINE_string('label0','mobile','Name of class[0] label')
flags.DEFINE_string('label1','hand','Name of class[1] label')
flags.DEFINE_string('label2','book','Name of class[2] label')
flags.DEFINE_string('label3','pen','Name of class[3] label')
flags.DEFINE_string('label4','bag','Name of class[4] label')
flags.DEFINE_string('img_path','/tensorflow/workspace/training/images/t
est','path to image')
FLAGS=flags.FLAGS
def class_text_to_int(row_label):
if row_label == FLAGS.label0:
return 1
elif row_label == FLAGS.label1:
return 2 #elif row_label ==FLAGS.label2:
elif row_label == FLAGS.label2:
return 3
elif row_label == FLAGS.label3:
54
return 4
else:
return 5
def split(df,group):
data=namedtuple('data',['filename','object'])
gb=df.groupby(group)
return[data(filename,gb.get_group(x)) for filename, x in
zip(gb.groups.keys(),gb.groups)]
def create_tf_example(group,path):
with
tf.gfile.GFile(os.path.join(path,'{}'.format(group.filename)),'rb') as
fid:
encoded_jpg = fid.read()
encoded_jpg_io = io.BytesIO(encoded_jpg)
image=Image.open(encoded_jpg_io)
width,height = image.size
filename=group.filename.encode('utf8')
image_format = b'jpg'
xmins=[]
xmaxs=[]
ymins=[]
ymaxs=[]
classes_text =[]
classes=[]
for index,row in group.object.iterrows():
xmins.append(row['xmin']/width)
xmaxs.append(row['xmax']/width)
ymins.append(row['ymin']/height)
ymaxs.append(row['ymax']/height)
classes_text.append(row['class'].encode('utf8'))
classes.append(class_text_to_int(row['class']))
tf_example = tf.train.Example(features=tf.train.Features(feature={
'image/height' : dataset_util.int64_feature(height),
'image/width' : dataset_util.int64_feature(width),
'image/filename' : dataset_util.bytes_feature(filename),
'image/source_id' : dataset_util.bytes_feature(filename),
55
'image/encoded' : dataset_util.bytes_feature(encoded_jpg),
'image/format' : dataset_util.bytes_feature(image_format),
'image/object/bbox/xmin' :
dataset_util.float_list_feature(xmins),
'image/object/bbox/xmax' :
dataset_util.float_list_feature(xmaxs),
'image/object/bbox/ymin' :
dataset_util.float_list_feature(ymins),
'image/object/bbox/ymax' :
dataset_util.float_list_feature(ymaxs),
'image/object/class/text' :
dataset_util.bytes_list_feature(classes_text),
'image/object/class/label' :
dataset_util.int64_list_feature(classes),
}))
return tf_example
def main(_):
writer=tf.python_io.TFRecordWriter(FLAGS.output_path)
path=os.path.join(os.getcwd(),FLAGS.img_path)
examples=pd.read_csv(FLAGS.csv_input)
grouped=split(examples,'filename')
for group in grouped:
tf_example=create_tf_example(group,path)
writer.write(tf_example.SerializeToString())
writer.close()
output_path = os.path.join(os.getcwd(),FLAGS.output_path)
print('successfully created the TFRecords :
{}'.format(output_path))
if __name__ == '__main__':
tf.app.run()
56
model {
ssd {
num_classes: 1
image_resizer {
fixed_shape_resizer {
height: 300
width: 300
} }
feature_extractor {
type: "ssd_inception_v2"
depth_multiplier: 1.0
min_depth: 16
conv_hyperparams {
regularizer {
l2_regularizer {
weight: 3.99999989895e-05
} }
initializer {
truncated_normal_initializer {
mean: 0.0
stddev: 0.0299999993294
} }
activation: RELU_6
batch_norm {
decay: 0.999700009823
center: true
scale: true
epsilon: 0.0010000000475
train: true
} } }
box_coder {
faster_rcnn_box_coder {
y_scale: 10.0
x_scale: 10.0
height_scale: 5.0
width_scale: 5.0
}}
57
matcher {
argmax_matcher {
matched_threshold: 0.5
unmatched_threshold: 0.5
ignore_thresholds: false
negatives_lower_than_unmatched: true
force_match_for_each_row: true
} }
similarity_calculator {
iou_similarity {
} }
box_predictor {
convolutional_box_predictor {
conv_hyperparams {
regularizer {
l2_regularizer {
weight: 3.99999989895e-05
} }
initializer {
truncated_normal_initializer {
mean: 0.0
stddev: 0.0299999993294
} }
activation: RELU_6 }
min_depth: 0
max_depth: 0
num_layers_before_predictor: 0
use_dropout: false
dropout_keep_probability: 0.800000011921
kernel_size: 3
box_code_size: 4
apply_sigmoid_to_scores: false
} }
anchor_generator {
ssd_anchor_generator {
num_layers: 6
min_scale: 0.20000000298
max_scale: 0.949999988079
58
aspect_ratios: 1.0
aspect_ratios: 2.0
aspect_ratios: 0.5
aspect_ratios: 3.0
aspect_ratios: 0.333299994469
reduce_boxes_in_lowest_layer: true
} }
post_processing {
batch_non_max_suppression {
score_threshold: 0.300000011921
iou_threshold: 0.600000023842
max_detections_per_class: 100
max_total_detections: 100
}
score_converter: SIGMOID }
normalize_loss_by_num_matches: true
loss {
localization_loss {
weighted_smooth_l1 {
} }
classification_loss {
weighted_sigmoid {
} }
hard_example_miner {
num_hard_examples: 3000
iou_threshold: 0.990000009537
loss_type: CLASSIFICATION
max_negatives_per_positive: 3
min_negatives_per_image: 0
}
classification_weight: 1.0
localization_weight: 1.0
}
}}
train_config: {
batch_size: 24
data_augmentation_options {
random_horizontal_flip {
59
} }
data_augmentation_options {
ssd_random_crop {
} }
optimizer {
rms_prop_optimizer {
learning_rate {
exponential_decay_learning_rate {
initial_learning_rate: 0.00400000018999
decay_steps: 800720
decay_factor: 0.949999988079
} }
momentum_optimizer_value: 0.899999976158
decay: 0.899999976158
epsilon: 1.0
} }
fine_tune_checkpoint: "PATH_TO_BE_CONFIGURED/model.ckpt"
from_detection_checkpoint: true
num_steps: 200000}
train_input_reader {
label_map_path: "PATH_TO_BE_CONFIGURED/mscoco_label_map.pbtxt"
tf_record_input_reader {
input_path: "PATH_TO_BE_CONFIGURED/mscoco_train.record"}
}
eval_config {
num_examples: 8000
max_evals: 10
use_moving_averages: false
}
eval_input_reader {
label_map_path: "PATH_TO_BE_CONFIGURED/mscoco_label_map.pbtxt"
shuffle: false
num_readers: 1
tf_record_input_reader {
input_path: "PATH_TO_BE_CONFIGURED/mscoco_val.record"
}}
60
5.3.5 Train .py
import functools
import json
import os
import tensorflow as tf
flags = tf.app.flags
flags.DEFINE_string('master', '', 'Name of the TensorFlow master to
use.')
flags.DEFINE_integer('task', 0, 'task id')
flags.DEFINE_integer('num_clones', 1, 'Number of clones to deploy per
worker.')
flags.DEFINE_boolean('clone_on_cpu', False,
'Force clones to be deployed on CPU. Note that
even if '
'set to False (allowing ops to run on gpu), some
ops may '
'still be run on the CPU if they have no GPU
kernel.')
flags.DEFINE_integer('worker_replicas', 1, 'Number of worker+trainer '
'replicas.')
flags.DEFINE_integer('ps_tasks', 0,
'Number of parameter server tasks. If None, does
not use '
'a parameter server.')
flags.DEFINE_string('train_dir', '',
'Directory to save the checkpoints and training
summaries.')
61
flags.DEFINE_string('pipeline_config_path', '',
'Path to a pipeline_pb2.TrainEvalPipelineConfig
config '
'file. If provided, other configs are ignored')
flags.DEFINE_string('train_config_path', '',
'Path to a train_pb2.TrainConfig config file.')
flags.DEFINE_string('input_config_path', '',
'Path to an input_reader_pb2.InputReader config
file.')
flags.DEFINE_string('model_config_path', '',
'Path to a model_pb2.DetectionModel config file.')
FLAGS = flags.FLAGS
@tf.contrib.framework.deprecated(None, 'Use
object_detection/model_main.py.')
def main(_):
assert FLAGS.train_dir, '`train_dir` is missing.'
if FLAGS.task == 0: tf.gfile.MakeDirs(FLAGS.train_dir)
if FLAGS.pipeline_config_path:
configs = config_util.get_configs_from_pipeline_file(
FLAGS.pipeline_config_path)
if FLAGS.task == 0:
tf.gfile.Copy(FLAGS.pipeline_config_path,
os.path.join(FLAGS.train_dir, 'pipeline.config'),
overwrite=True)
else:
configs = config_util.get_configs_from_multiple_files(
model_config_path=FLAGS.model_config_path,
train_config_path=FLAGS.train_config_path,
train_input_config_path=FLAGS.input_config_path)
if FLAGS.task == 0:
for name, config in [('model.config', FLAGS.model_config_path),
('train.config', FLAGS.train_config_path),
('input.config', FLAGS.input_config_path)]:
tf.gfile.Copy(config, os.path.join(FLAGS.train_dir, name),
overwrite=True
model_config = configs['model']
train_config = configs['train_config']
62
input_config = configs['train_input_config']
model_fn = functools.partial(
model_builder.build,
model_config=model_config,
is_training=True)
def get_next(config):
return dataset_builder.make_initializable_iterator(
dataset_builder.build(config)).get_next()
create_input_dict_fn = functools.partial(get_next, input_config)
env = json.loads(os.environ.get('TF_CONFIG', '{}'))
cluster_data = env.get('cluster', None)
cluster = tf.train.ClusterSpec(cluster_data) if cluster_data else
None
task_data = env.get('task', None) or {'type': 'master', 'index': 0}
task_info = type('TaskSpec', (object,), task_data)
ps_tasks = 0
worker_replicas = 1
worker_job_name = 'lonely_worker'
task = 0
is_chief = True
master = ''
if cluster_data and 'worker' in cluster_data:
worker_replicas = len(cluster_data['worker']) + 1
if cluster_data and 'ps' in cluster_data:
ps_tasks = len(cluster_data['ps'])
if worker_replicas > 1 and ps_tasks < 1:
raise ValueError('At least 1 ps task is needed for distributed
training.')
if worker_replicas >= 1 and ps_tasks > 0:
server = tf.train.Server(tf.train.ClusterSpec(cluster),
protocol='grpc',
job_name=task_info.type,
task_index=task_info.index)
if task_info.type == 'ps':
server.join()
return
63
worker_job_name = '%s/task:%d' % (task_info.type, task_info.index)
task = task_info.index
is_chief = (task_info.type == 'master')
master = server.target
graph_rewriter_fn = None
if 'graph_rewriter_config' in configs:
graph_rewriter_fn = graph_rewriter_builder.build(
configs['graph_rewriter_config'], is_training=True)
trainer.train(
create_input_dict_fn,
model_fn,
train_config,
master,
task,
FLAGS.num_clones,
worker_replicas,
FLAGS.clone_on_cpu,
ps_tasks,
worker_job_name,
is_chief,
FLAGS.train_dir,
graph_hook_fn=graph_rewriter_fn)
if __name__ == '__main__':
tf.app.run()
import numpy as np
import os
import six.moves.urllib as urllib
import sys
import tarfile
import tensorflow as tf
import zipfile
import cv2
64
from io import StringIO
from matplotlib import pyplot as plt
from PIL import Image
from utils import label_map_util
from utils import visualization_utils as vis_util
cap = cv2.VideoCapture(0)
for the object detection.
PATH_TO_CKPT = 'trained-inference-
graphs/output_inference_graph_v1.pb/frozen_inference_graph.pb'
PATH_TO_LABELS = 'annotations/label_map.pbtxt'
NUM_CLASSES = 4
detection_graph = tf.Graph()
with detection_graph.as_default():
od_graph_def = tf.GraphDef()
with tf.gfile.GFile(PATH_TO_CKPT, 'rb') as fid:
serialized_graph = fid.read()
od_graph_def.ParseFromString(serialized_graph)
tf.import_graph_def(od_graph_def, name='')
label_map = label_map_util.load_labelmap(PATH_TO_LABELS)
categories = label_map_util.convert_label_map_to_categories(
label_map, max_num_classes=NUM_CLASSES, use_display_name=True)
category_index = label_map_util.create_category_index(categories)
def load_image_into_numpy_array(image):
(im_width, im_height) = image.size
return np.array(image.getdata()).reshape(
(im_height, im_width, 3)).astype(np.uint8)
with detection_graph.as_default():
with tf.Session(graph=detection_graph) as sess:
while True:
# Read frame from camera
ret, image_np = cap.read()
# Expand dimensions since the model expects images to have
shape: [1, None, None, 3]
image_np_expanded = np.expand_dims(image_np, axis=0)
# Extract image tensor
65
image_tensor =
detection_graph.get_tensor_by_name('image_tensor:0')
# Extract detection boxes
boxes =
detection_graph.get_tensor_by_name('detection_boxes:0')
# Extract detection scores
scores =
detection_graph.get_tensor_by_name('detection_scores:0')
# Extract detection classes
classes =
detection_graph.get_tensor_by_name('detection_classes:0')
# Extract number of detectionsd
num_detections = detection_graph.get_tensor_by_name(
'num_detections:0')
# Actual detection.
(boxes, scores, classes, num_detections) = sess.run(
[boxes, scores, classes, num_detections],
feed_dict={image_tensor: image_np_expanded})
# Visualization of the results of a detection.
vis_util.visualize_boxes_and_labels_on_image_array(
image_np,
np.squeeze(boxes),
np.squeeze(classes).astype(np.int32),
np.squeeze(scores),
category_index,
use_normalized_coordinates=True,
line_thickness=8)
# Display output
cv2.imshow('object detection', cv2.resize(image_np, (800,
600)))
66
Chapter 6
Dataset
Tfrecords in which we store the images for test and train purpose
6.1 Train_labels.csv:
In this file we store all the images for training purpose according to their parameters i.e.
width, height, class, xmin, ymin, xmax, ymax
Table. 6.1
67
Table. 6.2
6.2 Test_labels.csv
68
Table7.3
Table.7.4
69
Chapter 7
Snapshots/Forms
Fig.7.1
I used labelimg to annotate the images. Annotations are created in the Pascal VOC format
which is useful later on. It is written in Python and uses Qt for interface. I used Python3 +
Qt5 with no problems. example of annotated image. Essentially we identify xmin, ymin,
xmax and ymax for the object and pass that to the model along with the image for training
70
Fig. 7.2
fig. 7.3
71
Fig. 7.4
Another example of annotate the image we use upto 100 images for each object
Fig.7.5
72
7.4 Bounding Box
Fig. 7.6
7,5 Creating XML File
73
Fig. 7.7
After annotating the image we create a label map which includes item
name, id and display name there is one label for each object
create a label.pbtxt file that is used to convert label name to a numeric id.
Fig. 7.8
74
7.7 Raw images and xml files
This show all the images store in the test and train folder. This
images help in taining and testing of the object
Fig.7.9
75
7.8 Monitor Training Job Progress using TensorBoard
We check our training progress and loss rate using tensorboard it shows the report in
graph format . It shows the process of checkpoints how well the model train
Fig. 7.10
Fig. 7.11
76
Fig. 7.13
Fig. 7.14
77
7.10 Result
After running the program a new window will open, which can be used to detect objects in
real time.
Fig. 7.15
Fig. 7.1
78
Chapter 8
Testing
A set of activities carried out to check a the functionality or stability is termed as testing.
These activities are so planned and perfomed systematically that it leaves no scope for
rework or bugs. General characteristics of this strategies are:
1 Testing begins at the module level and works outward".
2 disparate testing techniques are appropriate at disparate points in time.
3 Debugging and testing are altogether disparate procedures.
4 The developer of the software conducts testing and if the project is big then there is a
testing team.
The System testing involved is the most widely used testing procedure consisting of five
stages as shown in the figure. In general, the sequence of testing activities is component
testing, integration testing, and then user testing
Unit
Testing
Module
Testing
Sub-System
Testing
System
Setting
Acceptance
Testing
(Component testing)
(Integration testing)
79
(User testing)
Fig.8.1
8.1 Functional Testing
Once the system is completed developed and integrated it is checked and evaluated for its
functionality as whole for specific demands and requirements. This type of testing falls
under the category of Blackbox testing and does not require the knowledge of in depth
working and protocol off the system.
In contrary to Functional testing Structural testing checks for the functionality of the
different modules of the whole system and how well they are in link with other module.
This type of testing requires full knowledge of the behaviour, protocol and the working of
the system as a whole and module wise. The system base coding and programming
knowledge is also a requirement to perform this testing. The tester chooses inputs to
exercise paths through the code and determine the appropriate outputs.
To test the model, we first select a model checkpoint (usually the latest) and export into a
frozen inference graph. checkpoints is created when we train our model with the help of
checkpoint we are testing our model . we divide our data and used 70% images for training
and 30% for testing purpose so we split our images in test and train train folder
We store 100 of images per object to train the model of every angle of the object . In figure
9.3 there are some test images
80
Fig 8.3.1
We ran tests with databases built for 6,12,18,24 objects and obtained overall success
rates(correct classification on forced choice) of 99.6%, 98%, 97.4% and 97% respectively.
The worst cases were the book and the pen in 24 object test,with 19/24 and 20/24 correct
respectively
Table 8.3.2
The time to identify an object depends more or less linearly on the number of key features
fed to the system, and the size of the database. At the moment, overall recognition time on
a single processor are about 20 seconds for the 6 object database, and about 2 min for the
24 object database. This could also be improved substantially by pushing on the indexing
methods.
81
The program updates the video window with a new frame every between 0.25 sec and 0.5
seconds, which means an average of 2 - 4 FPS. In this project we detect live object with
help of camera.
Fig. 8.3.3
It identifies me as a person with 95% confidence and water bottle also with 95%
confidence. It show the accuracy of detecting the object
Fig. 8.3.4
82
Chapter 9
Maintenance & Evaluation
Maintenance is the is the term that is used to refer to modifications that are made to
software system after its release. System maintenance is an ongoing activity which covers
a wide variety of activities including removing program and design errors, updating
documentation and test data and updating user support Maintenance can be broadly
classified into following three classes:
This is used to remove errors in the program, which occurs when the product is delivered
as well as during maintenance. Thus in corrective maintenance the product is modified to
solve the discovered errors after the software product is being delivered to customer.
Adaptive maintenance is generally not requested by client but it is imposed by the outside
environment. It may include following organizational changes:
It means changing the software to improve some of its qualities like add new functions
improve computer efficiency, make it easier to use. This type of maintenance is used to
83
respond to user's additional needs may be due to the changes within or outside of the
organization. These changes include:
Changes in software
Economic and competitive conditions
Changes in models
84
Chapter 10
Conclusion and Future Scope
The Object Detection system in Images is web based application which mainly aims to
detect the multiple objects from various types of images. To achieve this goal shape and
edge feature from image is extracted. It uses large image database for correct object
detection and recognition. This system will provide easy user interface to retrieve the
desired images. The system have additional feature such as Sketch based detection. In
Sketch detection user can draw the sketch by hand as an input. Finally the system results
output images by searching those images that user want.
The project has wide scope in multiple areas and can easily increase its utilization by
adding more efficient algorithms. Some of the areas are as follows-
Medical Diagnose:
Use of object detection and recognition in medical diagnose to
detect the X-Ray report, brain tumors.
Shapes recognition:
Recognize the shape from whole region in images.
Cartography:
The cartography as the discipline dealing with the conception,
production dissemination and study of maps.
Robotics:
In robotics use of object detection is movement of body parts and
motion sensing.
85
Chapter 11
References
He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learning for image
recognition. In Proceedings of the IEEE Conference on Computer Vision and
Pattern Recognition (2016), pp. 770–778.
Hoiem, D., Efros, A. A., and Hebert, M. Automatic photo popup. ACM transactions
on graphics (TOG) 24, 3 (2005), 577–584.
Hoiem, D., Efros, A. A., and Hebert, M. Geometric context from a single image. In
Computer Vision, 2005. ICCV 2005. Tenth IEEE International Conference on
(2005), vol. 1, IEEE, pp. 654–661.
Hoiem, D., Efros, A. A., and Hebert, M. Putting objects in perspective.
International Journal of Computer Vision 80, 1 (2008), 3–15.
Hornik, K. Approximation capabilities of multilayer feedforward networks. Neural
networks 4, 2 (1991), 251–257.
Huang, T. Computer vision: Evolution and promise. CERN EUROPEAN
ORGANIZATION FOR NUCLEAR RESEARCH-REPORTSCERN (1996), 21–
26.
Hubel, D. H., and Wiesel, T. N. Receptive fields and functional architecture of
monkey striate cortex. The Journal of Physiology 195, 1 (1968), 215–243.
Ioffe, S., and Szegedy, C. Batch normalization: Accelerating deep network training
by reducing internal covariate shift. CoRR abs/1502.03167 (2015).
Jia, Y., Shelhamer, E., Donahue, J., Karayev, S., Long,
86