IMINT Target Detection Using Deep Learning
IMINT Target Detection Using Deep Learning
Serial N°:
I would like to thank my supervisor Li Wei for suggesting the topic and
for guidance.
I also would like to thank my wife Souha Chebab for supporting me.
I would like to thank all the staff of the Overseas Student Training in
the college for providing the necessary needs.
i
(PAGE INTENTIONALLY LEFT BLANK)
ii
ABSTRACT
D etecting different targets from a High Resolution Remote Sensing image is one
of the classical problems of computer vision and is often described as a difficult
task. This thesis will present the appropriate tasks in computer vision using Deep
Learning technology with the constraint of small training data with the use of a pre-trained
Convolutional Neural Network (CNN) (by preprocessing the images dataset and
developing the right training process). Faster R-CNN method will be used for the object
detection task. Due to practical use, this work will detect one class (Airplane) but it can
be expanded to include other classes (Storage tanks, ships, armored vehicles, bridges…).
The dataset used for the training is a combination of an existing data set and collected
images for the military aircrafts. A Graphical User Interface is also created to input image
from real world imagery in order to detect the targets.
The analyze of the large data of IMINT will be faster and the human labor will be
reduced to the minimum. The software will recognize different targets from large images
collected by Satellites, Reconnaissance UAVs or Aircrafts from the ISR missions by a
mean average precision of 90.12% on the test image dataset.
Index Terms: Imagery Intelligence; Remote Sensing; ISR; Deep Learning; Object
Detection; Convolutional Neural Network; Faster RCNN; Computer
Vision.
iii
(PAGE INTENTIONALLY LEFT BLANK)
iv
TABLE OF CONTENTS
ACKNOWLEDGEMENTS ........................................................................................ i
LIST OF TABLES................................................................................................... ix
CHAPTER 1. INTRODUCTION............................................................................... 1
1.1 PROBLEM STATEMENT .................................................................................. 1
v
CHAPTER 4. CREATED DATA AND METHOD ................................................. 31
4.1 STARTING POINT ......................................................................................... 31
5.3 EVALUATION............................................................................................... 43
REFERENCES ........................................................................................................ 55
APPENDIX............................................................................................................. 59
PUBLISHED ARTICLES........................................................................................ 69
vi
TABLE OF FIGURES
FIGURE 2.3: DETECTING HORIZONTAL EDGES FROM AN IMAGE USING CONVOLUTION FILTERING ............ 13
FIGURE 3.3: FASTER RCNN IS A SINGLE, UNIFIED NETWORK FOR OBJECT DETECTION .......................... 24
FIGURE 4.1: THESIS STARTING RESULT (LEFT) THESIS FINAL RESULT (RIGHT) ................................... 32
FIGURE 4.2: ONE EXAMPLE IMAGE FOR EACH CLASS OF THE UC MERCED LAND USE DATASET ............. 33
FIGURE 4.3: EXAMPLE IMAGES FROM THE DATASET COLLECTED USING GLOBAL MAPPER .................... 33
FIGURE 4.4: AIRPLANE: DAILY NATURE IMAGE(LEFT) REMOTE SENSING IMAGE (RIGHT) ....................... 34
FIGURE 5.1: HR TEST IMAGES SAMPLES: CHENGDU (LEFT) AFB2 (RIGHT) ........................................ 44
FIGURE 5.2: NMS PERFORMED ON AN EXTRACT OF TEST IMAGE 5 WITH NC_FASTER_AUG APPROACH.... 48
FIGURE 5.7: DETECTION AND SHOWING THE RESULT ON THE GUI SOFTWARE .................................... 52
vii
(PAGE INTENTIONALLY LEFT BLANK)
viii
LIST OF TABLES
TABLE 2: COMPARING STILL IMAGE OBJECT DETECTION METHOD WITH VOC 2007 ............................. 29
ix
(PAGE INTENTIONALLY LEFT BLANK)
x
LIST OF ABBREVIATIONS
AI Artificial Intelligence
CNN Convolutional Neural Network
CPU Central Processing Unit
DCNN Deep CNN
DL Deep Learning
FC Fully-Connected Layer
FCN Fully Convolutional Network
FPS Frame Per Second
GIS Geographic Information System
GPU Graphics Processing Unit
GT Ground Truth
GUI Graphical User Interface
HD High Definition
HOG Histogram of Oriented Gradients
HR High Resolution
ILSVRC ImageNet Large Scale Visual Recognition Challenge
IMINT IMagery INTelligence
ISR Intelligence Surveillance and Reconnaissance
NMS Non-Maximum Suppression
RAM Random Access Memory
RCNN CNN with Region proposals
RGB Red Green Blue channels
RoI Region of Interest
RPN Region Proposal Network
RS Remote Sensing
SAR Synthetic Aperture Radar
SGDM Stochastic Gradient Descent with Momentum
SIFT Scale Invariant Feature Transform
SSD Single Shot multibox Detector
SVM Support Vector Machines
USGS United States Geological Survey
VOC Visual Object Classes
XOR Exclusive OR
YOLO You Only Look Once
xi
Chapter 1. INTRODUCTION
This work is a combination between Computer vision tasks and DL technology. The
most common technic used for this combination is the use of the Convolutional Neural
Network (CNN) and Deep CNN (DCNN). The basic idea of the CNN was inspired by a
concept in biology called the receptive field [1]. They act as detectors that are sensitive
to certain types of stimulus, for example, edges. This biological function can be
approximated in computers using the convolution operation. Our work is related with the
object (target) detection from a High Resolution (HR) Remote Sensing (RS) images.
1
RESEARCH ON IMINT MILITARY TARGET DETECTION USING DEEP LEARNING
aspects, it is similar to other computer vision tasks because it involves creating a solution
that is invariant to deformation, rotation (especially for Remote Sensing detection) and
changes in lighting resolution. What makes object detection a distinct problem is that it
involves both locating and classifying regions of an image.
In the theoretical part, we review the relevant literature and study how
convolutional object detection methods have improved in the past few years. In the
experimental part, we study how a convolutional object detection system can be
implemented in practice, test how well a detection system trained on remote sensing
image data performs in aircraft detection.
A few detection methods transfer the pre-trained CNNs for object detection. Zhou
et al. [5] propose a weakly supervised learning framework to train an object detector,
where a pre-trained CNN model is transferred to extract high-level features of objects and
the negative bootstrapping scheme is incorporated into the detector training process to
provide faster convergence of the detector. Zhang et al. [6] propose a hierarchical oil tank
detector, which combines deep surrounding features, which are extracted from the pre-
trained CNN model with local features (Histogram of Oriented Gradients). The candidate
2
CHAPTER 1. INTRODUCTION
regions are selected by an ellipse and line segment detector. Salberg [7] proposes to
extract features from the pre-trained AlexNet model and applies the deep CNN features
for automatic detection of seals in aerial images. Ševo et al. [8] propose a two-stage
approach for CNN training and develop an automatic object detection method based on a
pre-trained CNN, where the GoogLeNet is first fine-tuned twice on UC-Merced dataset,
using different fine-tuning options, and then the fine-tuned model is utilized for sliding-
window object detection. To address the problem of orientation variations of objects, Zhu
et al. [9] employ the pre-trained CNN features that are extracted from combined layers
and implement orientation-robust object detection in a coarse localization framework.
For enhancing the performance of generic object detection, Cheng et al. [10]
propose an effective approach to learn a rotation-invariant CNN (RICNN) to improve
invariance to object rotation. In their paper, they add a new rotation-invariant layer to the
off-the-shelf AlexNet model. The RICNN is learned by optimizing a new object function,
including an additional regularization constraint which enforces the training samples
before and after being rotating to share the similar features to guarantee the rotation-
invariant ability of RICNN model.
3
RESEARCH ON IMINT MILITARY TARGET DETECTION USING DEEP LEARNING
the numerical results, but also some analysis of them. Finally, we will present the GUI
software design.
4
Chapter 2. BACKGROUND
I n this chapter, we provide the theoretical background necessary for understanding the
methods discussed in the next chapter. First, we discuss relevant details of machine
learning, neural networks, and computer vision. Finally, we explain how these disciplines
are combined in convolutional neural networks.
Machine learning has emerged as a useful tool for modelling problems that are
otherwise difficult to formulate exactly. Classical computer programs are explicitly
programmed by hand to perform a task. With machine learning, some portion of the
human contribution is replaced by a learning algorithm [2]. As availability of
computational capacity and data has increased, machine learning has become more and
more practical over the years, to the point of being almost ubiquitous.
2.1.1 Types
A typical way of using machine learning is supervised learning [3]. A learning
algorithm is shown multiple examples that have been annotated or labelled by humans.
For example, in the object detection problem we use training images where humans have
marked the locations and classes of relevant objects. After learning from the examples,
the algorithm is able to predict the annotations or labels of previously unseen data.
Classification and regression are the most important task types [3].In classification, the
algorithm attempts to predict the correct class of a new piece of data based on the training
data. In regression, instead of discrete classes, the algorithm tries to predict a continuous
output.
5
RESEARCH ON IMINT MILITARY TARGET DETECTION USING DEEP LEARNING
of unsupervised learning is clustering [3]. More recently, especially with the advent of
deep learning technologies, unsupervised preprocessing has become a popular tool in
supervised learning tasks for discovering useful representations of the data [4].
2.1.2 Features
Some kind of preprocessing is almost always needed. Preprocessing the data into a
new, simpler variable space is called feature extraction [3]. Often, it is impractical or
impossible to use the full-dimensional training data directly. Rather, detectors are
programmed to extract interesting features from the data, and these features are used as
input to the machine learning algorithm.
In the past, the feature detectors were often hand-crafted. The problem with this
approach is that we do not always know in advance, which features are interesting. The
trend in machine learning has been towards learning the feature detectors as well, which
enables using the complete data [2].
2.1.3 Generalization
Since the training data cannot include every possible instance of the inputs, the
learning algorithm must be able to generalize in order to handle unseen data points [3].
Too simple model estimate can fail to capture important aspects of the true model. On the
other hand, too complex methods can overfit by modelling unimportant details and noise,
which also leads to bad generalization [3]. Typically, overfitting happens when a complex
method is used in conjunction with too little training data. An overfitted model learns to
model the known examples but does not understand what connects them.
The performance of the algorithm can be evaluated from the quality and quantity
of errors. A loss function, such as mean squared error, is used to assign a cost to the errors
[3]. The objective in the training phase is to minimize this loss.
6
CHAPTER 2. BACKGROUND
2.2.1 Origins
Neural networks were originally called artificial neural networks because they were
developed to mimic the neural function of the human brain. Pioneering research includes
the threshold logic unit by Warren McCulloch and Walter Pitts in 1943 and the perceptron
by Frank Rosenblatt in 1957.
An artificial neuron based on the McCulloch-Pitts model is shown in Figure 2.1 [3].
The neuron receives input parameters . The neuron also has weight parameters
. The weight parameters often include a bias term that has a matching dummy input
with a fixed value of 1. The inputs and weights are linearly combined and summed. The
sum is then fed to an activation function that produces the output of the neuron:
= ( )= .
7
RESEARCH ON IMINT MILITARY TARGET DETECTION USING DEEP LEARNING
A multi-layer network typically includes three types of layers: an input layer, one
or more hidden layers and an output layer [3]. The input layer usually passes data along
without modifying it. Most of the computation happens in the hidden layers. The output
layer converts the hidden layer activations to an output, such as a classification.
2.2.3 Back-propagation
A neural network is trained by selecting the weights of all neurons so that the
network learns to approximate target outputs from known inputs. It is difficult to solve
the neuron weights of a multi-layer network analytically. The back-propagation
algorithm [2] provides a simple and effective solution to solve the weights iteratively.
The classical version uses gradient descent as optimization method. Gradient descent can
be quite time-consuming and is not guaranteed to find the global minimum of error, but
with proper configuration (known in machine learning as hyperparameters) works well
enough in practice [2] [3].
8
CHAPTER 2. BACKGROUND
In the first phase of the algorithm, an input vector is propagated forward through
the neural network. Before this, the weights of the network neurons have been initialized
to some values, for example small random values. The received output of the network is
compared to the desired output (which should be known for the training examples) using
a loss function. The gradient of the loss function is then computed. This gradient is also
called the error value. When using mean squared error as the loss function, the output
layer error value is simply the difference between the current and desired output.
The error values are then propagated back through the network to calculate the error
values of the hidden layer neurons. The hidden neuron loss function gradients can be
solved using the chain rule of derivatives. Finally, the neuron weights are updated by
calculating the gradient of the weights and subtracting a proportion of the gradient from
the weights. This ratio is called the learning rate [3]. The learning rate can be fixed or
dynamic. After the weights have been updated, the algorithm continues by executing the
phases again with different input until the weights converge.
In the above description, we have described online learning that calculates the
weight updates after each new input [2]. Online learning can lead to “zig-zagging”
behavior, where the single data point estimate of the gradient keeps changing direction
and does not approach the minimum directly. Another way of computing the updates is
full batch learning, where we compute the weight updates for the complete dataset [2].
This is quite computationally heavy and has other drawbacks. A compromise version is
mini-batch learning, where we use only some portion of the training set for each update
[6]. Mathematical descriptions of the algorithm are available in this reference [7].
Early researchers found that perceptron and other linear systems had severe
drawbacks, being unable to solve problems that were not linearly separable, such as the
XOR-problem. Sometimes, linear systems can solve these kinds of problems using hand-
crafted feature detectors, but this is not the most advantageous use of machine learning.
Simply adding layers does not help either, because a network composed of linear neurons
remains linear no matter how many layers it has [2].
9
RESEARCH ON IMINT MILITARY TARGET DETECTION USING DEEP LEARNING
For multi-class classification problems, the Softmax activation function [3] is used
in the output layer of the network:
( )=
∑
The Softmax function takes a vector of K arbitrarily large values and outputs a
vector of K values that range between [0...1] and sum to 1. The values output by the
Softmax unit can be utilized as class probabilities.
One of the main problems is the curse of dimensionality [2]. As the number of
variables increases, the number of different configurations of the variables grows
exponentially. As the number of configurations increases, the number of training samples
should increase in equal measure. Collecting a training dataset of sufficient size is time-
consuming and costly or outright impossible.
In the past ten years, neural networks have had a renaissance, mainly because of the
availability of more powerful computers and larger datasets. In early 2000s, it was
discovered that neural networks could be trained efficiently using graphics processing
units. GPUs are more efficient for the task than traditional CPUs and provide a relatively
cheap alternative to specialist hardware [8]. Today, researchers typically use high-end
consumer graphic cards, such as NVIDIA Tesla K40 [9].
10
CHAPTER 2. BACKGROUND
With deep learning, there is less need for hand-tuned machine learning solutions
that were used previously [2]. A classical pattern detection system, for example, includes
a hand-tuned feature detection phase before a machine learning phase. The deep learning
equivalent consists of a single neural network. The lower layers of the neural network
learn to recognize the basic features, which are then fed forward to higher layers of the
network.
2.3.1 Overview
Computer vision deals with the extraction of meaningful information from the
contents of digital images or video. This is distinct from mere image processing, which
involves manipulating visual information on the pixel level. Applications of computer
vision include image classification, visual detection, 3D scene reconstruction from 2D
images, image retrieval, augmented reality, machine vision and traffic automation [10].
11
RESEARCH ON IMINT MILITARY TARGET DETECTION USING DEEP LEARNING
To detect an object, we need to have some idea where the object might be and how
the image is segmented. This creates a type of chicken-and-egg problem, where, to
recognize the shape (and class) of an object, we need to know its location, and to
recognize the location of an object, we need to know its shape [12]. Some visually
dissimilar features, such as the clothes and face of a human being, may be parts of the
same object, but it is difficult to know this without recognizing the object first. On the
other hand, some objects stand out only slightly from the background, requiring
separation before recognition [13].
During the 2000s, popular solutions for object detection utilized feature descriptors,
such as Scale-Invariant Feature Transform (SIFT) developed by David Lowe in 1999
and Histogram of Oriented Gradients (HOG) popularized in 2005. In the 2010s, there has
been a shift towards utilizing convolutional neural networks [14] [9] [15].
Before the widescale adoption of CNNs, there were two competing solutions for
generating bounding boxes. In the first solution, a dense set of region proposals is
generated and then most of these are rejected [16]. This typically involves a sliding
window detector. In the second solution, a sparse set of bounding boxes is generated using
a region proposal method, such as Selective Search [13]. Combining sparse region
proposals with convolutional neural networks has provided good results and is currently
popular [9].
12
CHAPTER 2. BACKGROUND
2.4.1 Justification
The problem with solving computer vision problems using traditional neural
networks is that even a modestly sized image contains an enormous amount of
information (see section 2.2.5).
For example, a monochrome 620 × 480 image contains 297 600 pixels. If each
pixel intensity of this image is input separately to a Fully-connected network, each neuron
requires 297 600 weights. A 1920 × 1080 full HD image would require 2,073,600
weights. If the images are polychrome, the amount of weights is multiplied by the amount
of color channels (typically three). Thus, we can see that the overall number of free
parameters in the network quickly becomes extremely large as the image size increases.
Too large models cause overfitting and slow performance [3].
Furthermore, many pattern detection tasks require that the solution is translation
invariant. It is inefficient to train neurons to separately recognize the same pattern in the
left-top corner and in the right-bottom corner of an image. A Fully-connected neural
network fails to take this kind of structure into account.
Figure 2.3: Detecting horizontal edges from an image using convolution filtering
13
RESEARCH ON IMINT MILITARY TARGET DETECTION USING DEEP LEARNING
ℎ[ , ] = [ , ] ∗ [ , ] = [ , ] [ − , − ].
In effect, the dot product of the filter g and a sub-image of (with same dimensions
as ) centred on coordinates , produces the pixel value of ℎ at coordinates , [2].
The size of the receptive field is adjusted by the size of the filter matrix. Aligning the
filter successively with every sub-image of produces the of output pixel matrix ℎ. In
the case of neural networks, the output matrix is also called a feature map [2] (or an
activation map after computing the activation function). Edges need to be treated as a
special case. If image f is not padded, the output size decreases slightly with every
convolution [2].
Since the same filters are used for all parts of the image, the number of free
parameters is reduced drastically compared to a Fully-connected neural layer [2]. The
neurons of the convolutional layer mostly share the same parameters and are only
connected to a local region of the input. Parameter sharing resulting from convolution
ensures translation invariance. An alternative way of describing the convolutional layer
is to imagine a Fully-connected layer with an infinitely strong prior placed on its weights
[2]. This prior force the neurons to share weights at different spatial locations and to have
zero weight outside the receptive field.
Successive convolutional layers (often combined with other types of layers, such as
pooling described below) form a convolutional neural network (CNN). An example of a
convolutional network is shown in Figure 2.4. The backpropagation training algorithm,
described in subsection 2.2.3, is also applicable to convolutional networks [2]. In theory,
the layers closer to the input should learn to recognize low-level features of the image,
such as edges and corners, and the layers closer to the output should learn to combine
14
CHAPTER 2. BACKGROUND
these features to recognize more meaningful shapes [1]. In this thesis, we are interested
in studying whether convolutional networks can learn to recognize complete objects.
There are two ways of reducing the data volume size. One way is to include a
pooling layer after a convolutional layer [2]. The layer effectively down-samples the
activation maps as shown in Figure 2.5. Pooling has the added effect of making the
resulting network more translation invariant by forcing the detectors to be less precise.
However, pooling can destroy information about spatial relationships between subparts
of patterns. Typical pooling method is Max-pooling (Figure 2.6). Max-pooling simply
outputs the maximum value within a rectangular neighborhood of the activation map [2].
15
RESEARCH ON IMINT MILITARY TARGET DETECTION USING DEEP LEARNING
Another way of reducing the data volume size is adjusting the stride parameter of
the convolution operation. The stride parameter controls whether the convolution output
is calculated for a neighborhood centered on every pixel of the input image (stride 1) or
for every th pixel (stride ) [2]. Research has shown that pooling layers can often be
discarded without loss in accuracy by using convolutional layers with larger stride value
[2]. The stride operation is equivalent to using a fixed grid for pooling.
Some systems, such as [18], also implement a layer called local response
normalization, which is used as a regularization technique. Local response normalization
mimics a function of biological neurons called lateral inhibition, which causes excited
neurons to decrease the activity of neighboring neurons. However, other regularization
techniques are currently more popular, and these are discussed in the next section.
The final hidden layers of a CNN are typically Fully-connected layers [3]. A Fully-
connected layer can capture some interesting relationships parameter-sharing
convolutional layers cannot. However, a Fully-connected layer requires a sufficiently
small data volume size in order to be practical. Pooling and stride settings can be used to
reduce the size of the data volume that reaches the Fully-connected layers. A
convolutional network that does not include any Fully-connected layers, is called a fully
convolutional network (FCN) [15].
If the network is used for classification, it usually includes a Softmax output layer
[3] (see section 2.2.4). The activations of the topmost layers can also be used directly to
16
CHAPTER 2. BACKGROUND
generate a feature representation of an image. This means that the convolutional network
is used as a large feature detector [2].
There are several regularization techniques that are specific to deep neural networks.
A popular technique called dropout [19] attempts to reduce the co-adaptation of neurons.
This is achieved by randomly dropping out neurons during training, meaning that a
slightly different neural network is used for each training sample or minibatch. This
causes the system not to depend too much on any single neuron or connection and
provides an effective yet computationally inexpensive way of implementing
regularization [2]. In convolutional networks, Dropout is typically used in the final Fully-
connected layers [18].
Overfitting can also be reduced by increasing the amount of training data. When it
is not possible to acquire more actual samples, data augmentation is used to generate more
samples from the existing data [2]. For classification using convolutional networks, this
can be achieved by computing transformations of the input images that do not alter the
perceived object classes yet provide additional challenge to the system. The images can
be, for example, flipped, rotated or subsampled with different crops and scales. Also,
noise can be added to the input images [2].
2.4.6 Development
Convolutional neural networks were one of the first successful deep neural
networks. The Neocognitron, developed by Fukushima in 1980s, provided a neural
network model for translation-invariant object recognition, inspired by biology [1]. Le
Cun et al. combined this method with a learning algorithm, i.e. back-propagation. These
early solutions were mostly used for handwritten character recognition.
After providing some promising results, the neural network methods faded in
prominence and were mostly replaced by support vector machines [14]. Then, in 2012,
17
RESEARCH ON IMINT MILITARY TARGET DETECTION USING DEEP LEARNING
Krizhevsky et al. [20] achieved excellent results on the ImageNet Large Scale Visual
Recognition Challenge (ILSVRC) dataset by combining Le Cun’s method with recent
fine-tuning methods for deep learning. These results popularized CNNs and led to the
development of new powerful object detection methods described in Chapter 3 [14].
For the 2014 ImageNet challenge, Simonyan and Zisserman [18] explored the effect
of increasing the depth of a convolutional network on localization and classification
accuracy. The team achieved results that improved the then state-of-the-art by using
convolutional networks 16 layers deep. The 16-layer architecture includes 13
convolutional layers (with 3x3 filters), 5 pooling layers (2x2 neighborhood max-pooling)
and 3 Fully-connected layers. All hidden layers use rectified (ReLu) activations. The
Fully-connected layers reduce 4096 channels down to 1000 Softmax outputs and are
regularized using dropout. This form of network is referred to as VGG-16.
18
Chapter 3. CNN IN OBJECT
DETECTION
I n this chapter, we discuss and compare different object detection methods that utilize
convolutional neural networks. In particular, we are going to look at methods that
combine CNNs with region proposal classification. We further discuss, how the region
proposals, also called Regions of Interest (RoI), are generated.
3.1 RCNN
In 2012, Krizhevsky et al. [20] achieved promising results with CNNs for the
general image classification task, as mentioned in subsection 2.4.6. In 2013, Girshick et
al. published a method [14] generalizing these results to object detection. This method is
called R-CNN (“CNN with region proposals”).
Next, a convolutional network is used to extract features from each region proposal.
The sub-image contained in the bounding-box is warped to match the input size of the
CNN and then fed to the network. After the network has extracted features from the input,
the features are input to Support Vector Machines (SVM) that provide the final
classification.
19
RESEARCH ON IMINT MILITARY TARGET DETECTION USING DEEP LEARNING
The method is trained in multiple stages, beginning with the convolutional network
[9]. After the CNN has been trained, the SVMs are fitted to the CNN features. Finally,
the region proposal generating method is trained.
3.1.2 Drawbacks
R-CNN is an important method because it provided the first practical solution for
object detection using CNNs. Being the first, it has many drawbacks that have been
improved upon by later methods.
In his 2015 paper for Fast R-CNN [9], Girshick lists three main problems of R-
CNN:
Second, training is expensive. For both SVM and region proposal training, features
are extracted from each region proposal and stored on disk. This requires days of
computation and hundreds of gigabytes of storage space.
Third, and perhaps most important, object detection is slow, requiring almost a
minute for each image, even on a GPU. This is because the CNN forward computation is
performed separately for every object proposal, even if the proposals originate from the
same image or overlap each other.
20
CHAPTER 3. CNN IN OBJECT DETECTION
The convolutional feature map that is generated after these layers is input to a RoI
pooling layer. This extracts a fixed-length feature vector for each RoI from the feature
map. The feature vectors are then input to Fully-connected layers that are connected to
two output layers: a Softmax layer that produces probability estimates for the object
classes and a real-valued layer that outputs bounding box co-ordinates computed using
regression (meaning refinements to the initial candidate boxes).
As the detection time decreases, the overall computation time begins to depend
significantly on the performance of the region proposal generation method. The RoI
generation can thus form a computational bottleneck [15]. Additionally, when there are
many RoIs, the time spent on evaluating the Fully-connected layers can dominate the
evaluation time of the convolutional layers. Classification time can be accelerated by
approximately 30% if the Fully-connected layers are compressed using truncated singular
value decomposition [9]. This results in a slight decrease in precision.
21
RESEARCH ON IMINT MILITARY TARGET DETECTION USING DEEP LEARNING
3.2.3 Training
According to the original publication, Fast R-CNN is more efficient to train than R-
CNN, with nine-fold reduction in training time. The entire network (including the RoI
pooling layer and the Fully-connected layers) can be trained using the back-propagation
algorithm and stochastic gradient descent. Typically, a pre-trained network is used as a
starting point and then fine-tuned. Training is done in mini-batches of images. /
RoIs are sampled from each mini-batch image. The RoI samples are assigned to a class,
if their intersection over union with a ground-truth box is over 0.5. Other RoIs belong to
the background class.
As in classification, RoIs from the same image share computation and memory
usage. For data augmentation, the original image is flipped horizontally with probability
0.5. The Softmax classifier and the bounding box regressors are fine-tuned together using
a multi-task loss function, which considers both the true class of the sampled RoI and the
offset of the sampled bounding box from the true bounding box.
The aim of region proposal generation in object detection is to maximize recall i.e.
to generate enough regions so that all true objects are recovered [21]. The generator is
less concerned with precision since it is the task of the object detector to identify correct
regions from the output of the region proposal generator.
Dense set solutions attempt to generate by brute force an exhaustive set of bounding
boxes that includes every potential object location [13]. This can be achieved by sliding
a detection window across the image. However, searching through every location of the
image is computationally costly and requires a fast object detector. Additionally, different
window shapes and sizes need to be considered. Thus, most sliding window methods limit
22
CHAPTER 3. CNN IN OBJECT DETECTION
the amount of candidate objects by using a coarse step-size and a limited number of fixed
aspect ratios.
Most region proposals in a dense set do not contain interesting objects. These
proposals need to be discarded after the object detection phase. Detection results can be
discarded, if they fall behind a predefined confidence threshold or if their confidence
value is below a local maximum (non-maximum suppression) [16].
Instead of discarding the regions after the object detection stage, the region proposal
generator itself can rank the regions in a class-agnostic way and discard low-ranking
regions. This generates a sparse set of object detections [22]. Similarly to dense set
methods, thresholding and non-maximum suppression (NMS) can be implemented after
the detection phase to further improve the detection quality. Sparse set solutions can be
grouped into unsupervised and supervised methods.
One of the most popular unsupervised methods is Selective Search [13], which
utilizes an iterative merging of super-pixels. Another approach is to rank the objectness1
of a sliding window. A popular example of this is Edge Boxes [21], which calculates the
objectness score by calculating the number of edges within a bounding box and by
subtracting the number of edges that overlap the box boundary.
Certain advanced object detection methods, such as Faster R-CNN [15] described
in section 3.4 below, use parts of the same convolutional network both for generating the
region proposals and for detection. We call these kinds of methods integrated methods.
1
Objectness: measures membership to a set of object classes vs. background
23
RESEARCH ON IMINT MILITARY TARGET DETECTION USING DEEP LEARNING
that generates the feature proposals is called a Region Proposal Network (RPN). The
authors used Fast R-CNN architecture for the detection network. In this thesis, this
method is used for the target detection. Hence, the Faster RCNN is more detailed in this
section based on the authors paper [15].
Figure 3.3: Faster RCNN is a single, unified network for object detection
Thus, Faster R-CNN is composed of two modules. The first module is a deep FCN
that proposes regions (described in subsection 3.4.2 below), and the second module is the
Fast R-CNN detector (described in section 3.2) that uses the proposed regions. The entire
system is a single, unified network for object detection (Figure 3.3). The RPN module
tells the Fast R-CNN module where to look.
24
CHAPTER 3. CNN IN OBJECT DETECTION
has 5 shareable convolutional layers and the Simonyan and Zisserman model (VGG-16)
[24], which has 13 shareable convolutional layers. To generate region proposals, they
slide a small network over the convolutional feature map output by the last shared
convolutional layer. This small network takes as input an × spatial window of the
input convolutional feature map. Each sliding window is mapped to a lower-dimensional
feature (256-d for ZF and 512-d for VGG, with ReLu following). This feature is fed into
two sibling Fully-connected layers: a box-regression layer (reg) and a box-classification
layer (cls). The authors used = 3. This mini-network is illustrated in Figure 3.4. Note
that because the mini-network operates in a sliding-window fashion, the Fully-connected
layers are shared across all spatial locations. This architecture is naturally implemented
with an × convolutional layer followed by two siblings 1 × 1 convolutional layers
(for regression and classification, respectively).
Intersection over Union (IoU): is the intersection between the anchor and the
ground truth box (object) in the image over the union of these two boxes as shown in
25
RESEARCH ON IMINT MILITARY TARGET DETECTION USING DEEP LEARNING
Figure 3.5. In example, if the > 0.7 means that the anchors contain in object (we call
Positive overlap Range is [0.7 … 1]), else if the < 0.3 means the anchor do not
contain an object (we call Negative overlap Range is [0 … 0.3]). The Mathematical
( ∩ ) , > 0.7
representation of IoU : = =
( ∪ ) , < 0.3
Non-Maximum Suppression (NMS): in this thesis the NMS have been used to
enhance the detection accuracy. Since in remote sensing image objects can’t be
overlapped, this technic is used in this work. NMS is performed by keeping detections
that have an IoU larger than a pre-set parameter value (typically 0.5) with a higher-
probability detection and discarding the others. The purpose is to remove multiple
detections of the same object before evaluation as shown in Figure 3.6.
3.4.3 Training
A Faster R-CNN network is trained by alternating between training for RoI
generation and detection. First, two separate networks are trained. Then, these networks
are combined and fine-tuned. During fine-tuning, certain layers are kept fixed and certain
layers are trained in turn.
26
CHAPTER 3. CNN IN OBJECT DETECTION
The trained network receives a single image as input. The shared fully
convolutional layers generate feature maps from the image. These feature maps are fed
to the RPN. The RPN outputs region proposals, which are input, together with the said
feature maps, to the final detection layers. These layers include a RoI pooling layer and
output the final classifications. Using shared convolutional layers, region proposals are
computationally almost cost-free. Computing the region proposals on a CNN has the
added benefit of being realizable on a GPU. Traditional RoI generation methods, such as
Selective Search, are implemented using a CPU.
SSD: The Single Shot MultiBox Detector (SSD) [16] takes integrated detection
further. The method does not generate proposals at all, nor does it involve any resampling
of image segments. It generates object detections using a single pass of a convolutional
network.
The algorithm deals with different scales by using feature maps from many different
convolutional layers (i.e. larger and smaller feature maps) as input to the classifier. Since
the method generates a dense set of bounding boxes, the classifier is followed by a non-
maximum suppression (NMS) stage that eliminates most boxes below a certain confidence
threshold.
2
FPS: Frame Per Second
27
RESEARCH ON IMINT MILITARY TARGET DETECTION USING DEEP LEARNING
Liu et al. [16] compared the performance of Fast R-CNN, Faster R-CNN and SSD
on the PASCAL VOC 2007 [26] test set. When using networks trained on the PASCAL
VOC 2007 training data, Fast R-CNN achieved a mean Average Precision (mAP) of 66.9
(see subsection 5.3.2 for the discussion of the Evaluation metrics). Faster R-CNN
performed a better accuracy, with a mAP of 69.9%. SSD achieved a mAP of 68.0% with
input size 300 × 300 and 71.6% with input size 512 × 512 . As the standard
implementations of Fast R-CNN and Faster R-CNN use 600 as the length of the shorter
dimension of the input image, SSD seems to perform better with similarly sized images.
However, SSD requires extensive use of data augmentation to achieve this result [16].
Fast R-CNN and Faster RCNN only use horizontal flipping, and it is currently unknown,
whether they would benefit from additional augmentation.
While the advanced methods are more precise than Fast R-CNN, the real
improvements come from speed. When most of the detections with a low probability are
eliminated using thresholding and non-maximum suppression, SSD512 can run at 19 FPS
on a Titan X GPU. Meanwhile, Faster R-CNN with a VGG-16 architecture performs at 7
FPS [16]. The original authors of Faster R-CNN [15] report a running time of 5 FPS (0.2s
per image). Fast R-CNN has approximately the same evaluation speed but requires
additional time for calculating the region proposals. Region generation time depends on
the method, with Selective Search requiring 2 seconds per image on a CPU and Edge
Boxes requiring 0.2 seconds per image [15]. This work will not consider the speed
performance because evaluating execution time require a standardized environment.
28
CHAPTER 3. CNN IN OBJECT DETECTION
As mentioned, the purpose of the thesis is detecting targets from High Resolution
Remote sensing images. So, comparing the RCNN, Fast RCNN and Faster RCNN is
necessary in order to make the choice of the method used in this work. Table 2 [15] shows
that Faster RCNN is the appropriate method. It’s by far faster than the other methods in
image testing 250 times than RCNN and 10 times than the Fast RCNN. Concerning the
detection precision, the methods has been running on the VOC 2007 Data Set and the
Faster made the best mAP of 69.9%.
29
RESEARCH ON IMINT MILITARY TARGET DETECTION USING DEEP LEARNING
30
Chapter 4. CREATED DATA AND
METHOD
A fter getting the knowledge of the basic concept of object detection using the CNN,
we will present and discuss the methods and technics (Alexnet [20] network and
Faster RCNN [15] detection method) used and created in this thesis along with presenting
the Dataset used for the training to make the target detection functional. A first experiment
has been done as a starting point to figure how to enhance the detection. The main goal
of this thesis is to create a software capable of detecting different military target from
High resolution Remote Sensing images (airplanes, bridges, storage tanks, troops…).
Because collecting the appropriate images of all the target classes for the training is a big
challenge with the right amount, and the heavy computational work needed by the
hardware, we will only focus on one class which is the Airplane class. However, the work
and the method will be similar for detecting multi-class targets by gathering the
appropriate training data.
We trained the Faster RCNN using the Alexnet on the collected Dataset (see
subsection 4.2.1). As shown in Figure 4.1, the detection result at first was unusable (it
was enhanced during this thesis work). We have got a global = . % which is
very low comparing with final result of this thesis work which is = . %
(detailed in section 5.3).
31
RESEARCH ON IMINT MILITARY TARGET DETECTION USING DEEP LEARNING
Figure 4.1: Thesis starting result (Left) Thesis final result (Right)
32
CHAPTER 4. CREATED DATA AND METHOD
Figure 4.2: One example image for each class of the UC Merced Land Use Dataset
In addition, with the UC Merced Dataset, we have collected other satellite images,
generally, of military transport airplanes (good availability in most airbases) using the
Geographic Information System (GIS) Software Global Mapper [28]. The source map
used by the software is provided by ESRI World Imagery Map [29]. We searched for
different Air Bases around the world and downloaded the high-resolution images with
different sizes with a spatial resolution3 of 0.3 meter as shown in Figure 4.3. The total
number of images in the dataset is 83 images with different sizes
Figure 4.3: Example images from the dataset collected using Global Mapper
NWPU Dataset: In addition, with the above Dataset, we used a large Dataset called
NWPU-RESISC45 [30] created by Northwestern Polytechnic University (NWPU) to train
a CNN which is used in some approaches (see subsection 5.2.1). This dataset contains
3
Spatial resolution: the real distance between distinguishable patterns (pixels) in an image that can
be separated from each other and often expressed in meters.
33
RESEARCH ON IMINT MILITARY TARGET DETECTION USING DEEP LEARNING
31,500 aerial images spread over 45 scene classes. So far, it is the largest dataset for land
use scene classification in terms of both total number of images and number of scene
classes. To extract the right general features from remote sensing images, we should
reduce the “distance” between daily nature images and satellite images as clear in Figure
4.4. This dataset is used to fine-tune the layers of the CNN because we believe that the
features learned are more oriented to the satellite images, which can help to exploit the
intrinsic characteristics of satellite images. This approach will be compared with others
approaches used for the purpose of this thesis. This dataset has been split into 2 parts: 90%
as a training data and 10% as a validation data.
Figure 4.4: Airplane: daily nature image(left) remote sensing image (right)
The first step in the images preprocessing, is to rotate the images for two reasons.
The first one is to enlarge the dataset (83 images will be not enough to get good results
even if a pretrained network is used). Second is to try to make the detection rotational
invariant. The CNN cannot distinguish between different rotation of the target. The idea
is to make a copy of every image from the dataset by adding a rotation of 45° from 0° to
180°, so we will get 4 new images. We chose that interval because data augmentation
function will be added in the input image layer (see subsection 4.3.1) by flipping the
34
CHAPTER 4. CREATED DATA AND METHOD
image vertically. At the end, while training the network, we will have a theoretical number
of images augmented by 9 images for each source image. (Total output is 435 images).
The second step is to draw the GT boxes around the targets in the images manually.
Training an object detection CNN need to have the images with GT labeled with different
target categories to train the RPN and to learn how to localize the object from the
neighboring environment. We made a manual labelling session using the built-in software
(Image Labeler) implemented in MATLAB 2017b [32] to get the GT boxes in the form
of [ ℎ ℎ ℎ ] as shown in Figure 4.5.
As a third step, all the labeled target from the images in the dataset after step 2 are
extracted by the crop method. We did this step to initially train the CNN as a classification
resolver. We believe that it will initialize the CNN weights to get better result in the Faster
RCNN training (The total output images is 774 images after this step). But to feed the
CNN, images must be resized into × and we must keep the same image ratio to
prevent distortion, hence the need of the next step.
The CNN input squared image (in our case it’s 227 × 227 and RGB channel) so
we resized all the cropped images from step iii. The resize method idea is to fill the
shortest side by black (or said zeros) and output a square image so we don’t lose the image
ratio and we don’t get a distorted target for training. After that, we resize the output into
227 × 227 pixels.
35
RESEARCH ON IMINT MILITARY TARGET DETECTION USING DEEP LEARNING
This fifth step is to prepare the data for the training purpose. We create an image
datastore with all the images in the previous step with their respective classification label.
After that we split it into 3 parts: A Training Dataset representing 80% of the data, A
Validation Dataset (to follow the improvement of the network while training) 10%, and
a Test Dataset (to get the accuracy) of 10%. This datastore is used only for training and
testing the CNN (for classification).
A MATLAB scripts was created to batch process all the previous image
preprocessing technic to prepare the data for feeding the network and method described
in the section below.
4.3.1 CNN
Regardless of the visual task that we want to achieve, a CNN or DCNN is necessary
for the DL in computer vision tasks. 2012 marked the first year where a CNN was used
to achieve a top 5 test error rate of 15.4%. Alex Krizhevsky et al. [20] achieved promising
results with CNNs for the general image classification task by developing the ALEXNET.
It achieved excellent results on the ImageNet Large Scale Visual Recognition Challenge
(ILSVRC) [34] dataset. Trained on ImageNet data (on two GTX 580 GPUs for five to six
days), which contained over 15 million annotated images from a total of over 22,000
categories.
36
CHAPTER 4. CREATED DATA AND METHOD
As presented in Figure 4.6, to make the transfer learning, we will completely delete
the last 3 layers that are trained for the classification (Fully-connected layer, Softmax
layer and the classification layer) because the output was 1000 classes, and our output is
only 1 class. We will also change the first layer which is the input-layer and add a data
augmentation method for making a random vertical flip of the input image on every
iteration of the mini-batch for the training. This data augmentation, together with the
image preprocessing explained in the subsection 4.2.2, will help the Faster R-CNN to
detect the object with rotational invariance due to the limitation of the Object detection
method explained in the subsection 4.3.2 below.
37
RESEARCH ON IMINT MILITARY TARGET DETECTION USING DEEP LEARNING
input image. This issue was overcome by the data augmentation in the input layer
(subsection 4.3.1) and the image preprocessing (subsection 4.2.2).
The Faster R-CNN have 2 CNN working together. One network is the pretrained
CNN described in the subsection 4.3.1 above transformed into a Fast R-CNN (see section
3.4) by adding a regression network in order to output the localization box represented by
4 values [ ℎ ℎ ℎ ] . The other one is called the RPN for outputting the RoI.
This network shares the same weights as the previous CNN, but the last layers concerned
by the classification are changed by an RoI output to feed the Fast R-CNN for the
classification [15].
2. Train a separate detection network by Fast R-CNN using proposals generated by step
1, initialized by the CNN,
3. Fix the convolutional layers, fine tune unique layers to RPN, initialized by the
detector on step 2,
4. Fix the convolutional layer, fine-tune the Fully-Connected layers of Fast R-CNN.
38
Chapter 5. IMPLEMENTATION
I n the previous chapter, we presented the work that have been done on the Dataset and
on the CNN and the detection method. In this chapter, we will discuss the practical
implementation of the previous work by presenting the different proposed training
approaches and processes. Then we will go through the evaluation results and choose the
best approach according to the mAP in order to create the software. The GUI is also
presented at the end of this chapter.
5.1 ENVIRONMENT
The implementation environment was on a Lenovo V110 laptop computer with an
Intel Core i5-7200U 2.50 GHz CPU, 8 GBs of RAM and no GPU used for the training.
The operating system was Windows 10.
The main software tool was MATLAB 2017b. The object detection system and its
related methods were implemented as a combination of preexisting and self-programmed
MATLAB tools.
39
RESEARCH ON IMINT MILITARY TARGET DETECTION USING DEEP LEARNING
In order to look for the efficiency of the right method and to understand the behavior
of the network and the object detection method, different approaches have been created.
Those approaches differ on how the sequences of training the Dataset are made, and
whether we use data augmentation or not. So, we can present the correct approach and
method to detect airplanes from a High-Resolution remote sensing image using Deep
Learning.
5.2.1 Approaches
During the research we have made different tests and trainings to understand how
the detection method behave on behalf of different changing parameters like the used
dataset or the training process. So, different processes are presented (Table 3) according
to the used Dataset and the training sequences of the CNN and the Faster RCNN. There
are 4 types of sequences with 2 Dataset types (with or without data augmentation) so we
are having 8 approaches totally which will be evaluated on section 5.3.
The first training process (Faster RCNN) mean that we train only the method
without pretraining the CNN. It means that the weights of the CNN are initially the
weights of the Alexnet (pretrained on the ImageNet Dataset).
The second process is the sequence for training the CNN first to initiate the weights
of the network with the appropriate data before training the Faster RCNN.
In the third sequence, the network is trained first on the NWPU Dataset (see page
33) because we believe that it will make the features learned by the network more oriented
to the satellite images, which can help to exploit the intrinsic characteristics of satellite
images. Then we train the Faster RCNN detection using the previous network.
In the last process sequence training the network again with the targets to orient
more the network in detecting the airplanes the network will be used for training the Faster
RCNN
40
CHAPTER 5. IMPLEMENTATION PROCEDURE AND EVALUATIONS
Name
Training Process
No Data Augmentation Data Augmentation
Faster RCNN (Only) Faster Faster_Aug
CNN → Faster RCNN C_Faster C_Faster_Aug
NWPU → Faster RCNN N_Faster N_Faster_Aug
NWPU → CNN → Faster RCNN NC_Faster NC_Faster_Aug
5.2.2 Training
To be able to compare the results of the different approaches presented above, we
fixed the training parameter for the CNN trainings and the Faster RCNN method. Fixing
the parameter of the trainings on the different approaches let us evaluate the results
properly and discuss only the behavior of the approach. More details about the training
options are shown in Appendix 2.
i. Training Options
CNN on NWPU and created dataset: For the training of our CNNs we used SGDM
(Stochastic Gradient Descent with Momentum) with a batch size of 128, a momentum of
0.9 and a constant learning rate of 10 (because it’s just a fine-tuning step). We also
shuffle the data in each epoch 4 . We also validate the network by the validation data
(described in 4.2.2v) in every epoch to evaluate the training and we set a patience of 4
(for stopping the training if the accuracy is not enhanced after 4 successive epochs)
Faster RCNN: As the Faster R-CNN passes by 4-steps alternating training, 4 stage
options have been made. Those learning options are almost the same for the 4 steps:
SGDM with a batch size of 128, a learn rate drop factor of 0.5 every 3 epochs (this option
will decrease the learning rate every 3 epochs by 0.5 to reach the lower loss easily) and
by shuffling the image dataset on every epoch. The initial learn rate of the 2 first steps is
10 , and 10 for the rest of the steps (because those steps are used to fine tune the
previous ones). We fixed the learning epochs at 10. We also fixed the positive IoU by
4
Epoch: 1 epoch is defined as one complete pass through the entire training data set
41
RESEARCH ON IMINT MILITARY TARGET DETECTION USING DEEP LEARNING
[0.7 1] (object) and negative IoU by [0 0.3] (not object) (see page 25). All the
parameters were chosen after many practical tries to get a better result.
Major problems limiting the use of deep learning methods are the availability of
computing power and training data (see subsection 2.2.5). For this thesis, we did not have
access to a server farm, or a GPU used for research purposes. Rather, we needed to
implement the methods on an ordinary consumer laptop (section 5.1). Training a
convolutional network from start to finish on such hardware would be enormously time-
consuming. Thus, we favored methods that were established enough to have pre-trained
networks to be used. We also favored methods that had MATLAB implementations
available. Even though, training process took a very long time as shown in the Table 4.
This table show the total duration of the training process of every approach (including
CNNs and Faster RCNN training). It’s presented for information only because evaluating
the time need more standardized environment. A sample of training details are presented
in Appendix 3.
As we can see the training time is directly related to the size of the training dataset.
It took almost 7 days to train the NWPU dataset (which contain 31,500 aerial images).
For the CNN it took between 1 hour and 1hour 26 minutes without data augmentation,
and between 6 and 9 hours for the augmented dataset. The Faster RCNN (4 stages training)
took between 3 hours and a half and 4 hours for the non-augmented dataset, and between
40 and 44 hours on the augmented dataset. As an example, to train the NC_Faster_Aug
we need 9 days of training. Fortunately, the CNN trained by the NWPU dataset have been
done only once.
42
CHAPTER 5. IMPLEMENTATION PROCEDURE AND EVALUATIONS
5.3 EVALUATION
5.3.1 Test Dataset
To evaluate the accuracy of the presented approaches, we have to generalize the
detection by testing them on new images that have not been trained by the methods. For
this purpose we collected 7 new High Resolution remote sensing images of different
Airports and Airbases (by the same method in subsection 4.2.1) with a spatial resolution
of 0.3 meter with different sizes as shown on the Table 5 (samples of the images are
shown in Figure 5.1). Foreign airbases are named AFBx.
Number of
N° Name Type Size (pixels)
annotated targets
1 Xian XianYang International 6395 × 7703 39
2 Chengdu ShuangLiu Airport 6372 × 11652 50
3 Beijing TongXian 1950 × 2717 8
4 AFB1 5281 × 3844 15
5 AFB2 Air Base 2796 × 4002 12
6 AFB3 1228 × 1465 11
7 AFB4 1259 × 1067 2
43
RESEARCH ON IMINT MILITARY TARGET DETECTION USING DEEP LEARNING
Generally, an “IoU score over 0.5 is counted as a true positive detection” [21] as
used in the PASCAL VOC Challenge [26] and this definition is used in this thesis as well.
There can be only one true positive match per each ground-truth box. If several detections
match a single ground-truth, the box with the highest likelihood is selected as the true
positive match and the other detections are marked as false positives. Detections with no
matching ground-truth box are marked false positives as well. Ground-truth boxes with
no matching detections are called false negatives (see Table 6 below).
44
CHAPTER 5. IMPLEMENTATION PROCEDURE AND EVALUATIONS
Ground Truth
Object Not Object
Object
True Positive False Positive
Detection
Not Object
False Negative True Negative
=
+
=
+
45
RESEARCH ON IMINT MILITARY TARGET DETECTION USING DEEP LEARNING
The mAP has also been evaluated for each image on each approach. We will show
only the detailed results after the NMS process as shown in the Table 8 below.
Image
1 2 3 4 5 6 7
Name
Faster 46.94% 56.16% 26.09% 89.83% 84.31% 47.26% 41.67%
C_Faster 18.69% 11.10% 45.33% 29.54% 16.80% 27.16% 36.67%
N_Faster 76.17% 72.14% 94.64% 84.45% 72.90% 98.05% 100%
NC_Faster 75.66% 34.27% 86.23% 59.96% 67.47% 100% 100%
Faster_Aug 55.96% 59.93% 66.50% 73.86% 87.27% 100% 100%
C_Faster_Aug 81.96% 79.69% 49.76% 90.87% 84.47% 100% 100%
N_Faster_Aug 97.36% 81.78% 100% 81.42% 87.64% 100% 100%
NC_Faster_Aug 97.43% 83.63% 100% 83.26% 87.64% 100% 100%
46
CHAPTER 5. IMPLEMENTATION PROCEDURE AND EVALUATIONS
The detection time has also been evaluated for the detection process of each image
as shown in Table 9. As mentioned before, it’s presented for information only because
evaluating the time need more standardized environment.
Image
1 2 3 4 5 6 7 Total
Name
Faster 859 1 384 96 355 183 31 23 2 931
C_Faster 756 1 140 76 297 160 25 18 2 472
N_Faster 732 1 143 73 286 152 25 17 2 428
NC_Faster 910 1 359 94 369 201 32 23 2 988
Faster_Aug 839 1 302 83 307 165 26 19 2 741
C_Faster_Aug 712 1 105 74 289 156 24 17 2 377
N_Faster_Aug 731 1 132 71 286 158 24 17 2 419
NC_Faster_Aug 847 1 310 90 375 199 33 23 2 877
All the above results will be discussed on the next subsection along with showing
the detection result on some images of the test dataset.
The NMS added between 2% and 10% of mAP in most of the cases. This technic
removed the overlapping bounding boxes of the same target by keeping the most accurate
one as shown in Figure 5.2. Since the targets in Remote Sensing image could not be
overlapped for the same class, we defined the IoU ratio to 0.1 (minimum value).
Performing the NMS will definitely enhance the detection precision.
47
RESEARCH ON IMINT MILITARY TARGET DETECTION USING DEEP LEARNING
Figure 5.2: NMS performed on an extract of test image 5 with NC_Faster_Aug approach
The system now can do the state-of-art detection (more than 90% accuracy) after
using our image preprocessing and data augmentation discussed in the subsection 4.2.2.
We can see clearly how much the use of this created method enhance the detection. For
example, in the NC_Faster_Aug approach after the NMS performing, the mAP jumped
by 40%: from 50.01% without the use of the data augmentation to 90.12%, and by 50%
without the NMS performing. The Figure 5.3 below represent the precision-recall curve
(explained in subsection 5.3.2) for the NC_Faster_Aug and the NC_Faster approaches.
As we can see, for the first approach, the recall reaches the value of “1” (it’s the case for
all the approaches using the data augmentation and image preprocessing. Details can be
seen in Appendix 4). It means that this approach can find all the targets in all the test
dataset. And the precision value for the first approach is higher than the second one. It
means that there is less false positive detection. So, the data augmentation and image
preprocessing enhance the detection by finding not only the targets, but also the correct
ones.
48
CHAPTER 5. IMPLEMENTATION PROCEDURE AND EVALUATIONS
The Table 7 shows us also that training the CNN with a very large RS dataset
(NWPU) helps to enhance the results. At first, we believed that the features learned will
be more oriented to the satellite images, which can help to exploit the intrinsic
characteristics of satellite images. The evaluation validates this supposition and as we can
see, detection on the approach that their CNN was trained by the NWPU dataset
performed better result. For example, the mAP is augmented by around 10% from the
Faster to the N_Faster approaches.
Exploring the Table 8 show us that the accuracy is higher for smaller image size.
As we can see the 100% accuracy (NC_Faster_Aug approach) are done on the 3 smallest
images (3, 6 and 7) and contains less targets. But even for larger image (1) we got a good
accuracy of 97.43%. Figure 5.4 shows the detection on the test images 1, 6 and 7 by the
NC_Faster_Aug.
Table 9 shows that the time detection is directly related to the size of the input test
image and the number of the targets. It’s obvious since the kernel window need to slide
on more pixels in the image and to classify more targets.
49
RESEARCH ON IMINT MILITARY TARGET DETECTION USING DEEP LEARNING
The Framework of the software is shown in Figure 5.5. The basic functionality of
the software is to input a High Resolution Remote Sensing image and get the same image
with the bounding boxes of the detected targets.
50
CHAPTER 5. IMPLEMENTATION PROCEDURE AND EVALUATIONS
The main window is simple. The menu bar contains 3 buttons as shown in Figure
5.6. The process menu contains the process procedure: starting by inputting the image,
passing by starting detection and ending by saving the resulted image.
An option window has been created to permit the user to change the values of the
line width, the boxes color, the font size of the classification accuracy of the detection
and also change the NMS IoU value (recommended to be 0.1).
51
RESEARCH ON IMINT MILITARY TARGET DETECTION USING DEEP LEARNING
Figure 5.7: Detection and showing the result on the GUI Software
After inputting the raw image, the user will click on the detect button. As shown in
Figure 5.7, the detection starts, then it will show the detection results with the number of
targets and the detection time. After that, the user can save the result image in a chosen
folder.
52
Chapter 6. CONCLUSION
I n this chapter, we will review our results and provide some concluding remarks. We
set out to review the current methods of convolutional object detection, to implement
one such method and to explore potential improvements.
6.1 THEORY
We began the thesis with a review of the theoretical background. We explained how
neural networks function and what object detection entails. We demonstrated why regular
neural networks are insufficient to image-related tasks and how translation invariant
convolutional networks provide an effective solution to many computer vision problems.
Next, we demonstrated how convolutional object detection has evolved from the
relatively slow R-CNN to the current optimized method. This development is mostly not
related to the structure of the convolutional network itself. Rather, it is related to how the
convolutional network is used and to computation that takes place before and after the
convolutional network. In the previous methods, there were many more separate phases
involving preprocessing, region generation, computation of the fully connected layers and
the final classification. In the latest methods, these phases have been increasingly
integrated into the convolutional network itself, while keeping the basic CNN model
intact. On the other hand, the 2016 winner of the ImageNet challenge [34] is again a
model composed of many separate components. Nonetheless, several computational
bottlenecks have disappeared. Over the past few years, the speed of object detection has
improved more dramatically than precision.
6.2 PRACTICE
To experiment with a convolutional method in practice, we created a working
MATLAB implementation of Faster R-CNN. We learned that the most challenging part
of implementing a deep learning system is collecting the training data and performing the
training itself. Training time can be further shortened by using a pretrained network. Even
if the final system does not feature the same objects classes as the benchmark data, visual
53
RESEARCH ON IMINT MILITARY TARGET DETECTION USING DEEP LEARNING
problems are universal enough to benefit from detectors trained for a different problem.
The optimal bottom-layers of a convolutional network are often similar regardless of the
problem, just like human eye uses the same receptive fields for all visual tasks. Thus, it
makes sense to initialize the layers using a pretrained network.
Related to the implementation, we also learned that there are no easy “out-of-the-
box” solutions for effectively implementing convolutional networks especially for remote
sensing images. The goal of this work is to show the ability of object detection using DL
technology by the Faster R-CNN method for military targets from a High-Resolution RS
image. To implement this method, a series of process had to be done on the created dataset
and on the chosen pre-trained CNN due to the small number of images in the dataset. We
demonstrated the feasibility of this type of work on a single CPU laptop.
6.3 RESULTS
Regarding precision, the results were promising. We reached the state-of-art
accuracy by 90.12%. We showed how a system pretrained on general image data can be
used to detect objects in a specific task (military target detection from HR RS images),
thus demonstrating the adaptability of the methods by creating the appropriate dataset,
image preprocessing, data augmentation and training process. In many cases, Faster R-
CNN detected more objects than the marked targets. These were labelled as false positives,
despite clearly having the right object class by visual inspection.
54
REFERENCES
[3] X. Chen, S. Xiang, C.-L. Liu and C. H. Pan, "Vehicle detection in satellite
images by hybrid deep convolutional neural networks," IEEE Geoscience and
remote sensing letters, vol. 11, no. 10, pp. 1797-1801, 2014.
[4] Q. Jiang, L. Cao, M. Cheng, C. Wang and J. Li, "Deep neural networks-based
vehicle detection in satellite images," in International Symposium on
Bioelectronics and Bioinformatics, 2015.
[6] L. Zhang, Z. Shi and J. Wu, "A hierarchical oil tank detector with deep
surrounding features for high-resolution optical satellite imagery," IEEE
Journal of Selected Topics in Applied Earth Observations and Remote
Sensing, vol. 8, no. 10, pp. 4895-4909, 2015.
[7] A.-B. Salberg, "Detection of seals in remote sensing images using features
extracted from deep convolutional neural networks," in IEEE International
Geoscience and Remote Sensing Symposium (IGARSS), 2015.
[9] H. Zhu, X. Chen, . W. Dai, K. Fu, Q. Ye and J. Jiao, "Orientation robust object
detection in aerial images using deep convolutional neural network," in IEEE
International Conference on Image Processing (ICIP), 2015.
55
RESEARCH ON IMINT MILITARY TARGET DETECTION USING DEEP LEARNING
IEEE Transactions on Geoscience and Remote Sensing, vol. 54, no. 12, pp.
7405-7415, 2016.
[17] D. Steinkraus, I. Buck and P. Simard, "Using gpus for machine learning
algorithms," in Proceedings Eighth International Conference on Document
Analysis and Recognition, 2005.
[18] R. Girshick, "Fast R-CNN," Proceedings of the IEEE International, pp. 1440-
1448, 2015.
[20] N. Sebe, "Machine learning in computer vision," Springer Science & Business
Media, vol. 29, 2005.
[23] R. Girshick, J. Donahue, T. Darrell and J. Malik, "Rich feature hierarchies for
accurate object detection and semantic segmentation," in Proceedings of the
IEEE conference on computer vision and, 2014.
56
REFERENCES
[24] S. Ren, K. He, R. Girshick and J. Sun, "Faster R-CNN: Towards Real-Time
Object Detection with Region Proposal Networks," IEEE Transactions on
Pattern Analysis and Machine Intelligence, vol. 39, no. 6, pp. 1137-1149, 01
06 2017.
[26] D. Marr and E. Hildreth, "Theory of edge detection," Proceedings, vol. 207,
no. 1167, pp. 187 - 217, 1980.
[27] K. Simonyan and A. Zisserman, "Very deep convolutional networks for large-
scale image recognition," arXiv preprint, 2014.
[30] C. L. Zitnick and P. Dollar, "Edge boxes: Locating object proposals from
edges," European Conference on Computer Vision, pp. 391-405, 2014.
[31] B. Yang, J. Yan, Z. Lei and S. Z. Li, "Craft objects from images," in
Proceedings of the IEEE Conference on Computer Vision and Pattern, 2016.
[33] K. Simonyan and A. Zisserman, "Very deep convolutional networks for large-
scale image recognition," in International Conference on Learning
Representations (ICLR), 2015.
[34] J. Redmon, S. Divvala, R. Girshick and A. Farhadi, "You Only Look Once:
Unified, Real-Time Object Detection," in Conference on Computer Vision and
Pattern Recognition, 2016.
57
RESEARCH ON IMINT MILITARY TARGET DETECTION USING DEEP LEARNING
[39] G. Cheng, J. Han and X. Lu, "Remote sensing image scene classification:
Benchmark and state of the art," in Proceedings of the IEEE.
58
APPENDIX
ALEXNET
59
RESEARCH ON IMINT MILITARY TARGET DETECTION USING DEEP LEARNING
MODIFIED ALEXNET
60
APPENDIX
61
RESEARCH ON IMINT MILITARY TARGET DETECTION USING DEEP LEARNING
62
APPENDIX
63
RESEARCH ON IMINT MILITARY TARGET DETECTION USING DEEP LEARNING
64
APPENDIX
C_FASTER
N_FASTER
65
RESEARCH ON IMINT MILITARY TARGET DETECTION USING DEEP LEARNING
NC_FASTER
FASTER_AUG
C_FASTER_AUG
66
APPENDIX
N_FASTER_AUG
NC_FASTER_AUG
67
RESEARCH ON IMINT MILITARY TARGET DETECTION USING DEEP LEARNING
68
PUBLISHED ARTICLES
69