Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
68 views

IMINT Target Detection Using Deep Learning

This academic dissertation presents research on using deep learning for military target detection in remote sensing imagery. The dissertation aims to detect airplanes in images using a Faster R-CNN model trained on a combination of public and collected datasets. The model achieves a mean average precision of 90.12% for airplane detection. A graphical user interface software is also developed to allow users to input images and obtain detection results in real-time. The dissertation demonstrates the ability of deep learning to perform fast and accurate analysis of large remote sensing imagery datasets for intelligence purposes.

Uploaded by

Ghazi Marzouk
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
68 views

IMINT Target Detection Using Deep Learning

This academic dissertation presents research on using deep learning for military target detection in remote sensing imagery. The dissertation aims to detect airplanes in images using a Faster R-CNN model trained on a combination of public and collected datasets. The model achieves a mean average precision of 90.12% for airplane detection. A graphical user interface software is also developed to allow users to input images and obtain detection results in real-time. The dissertation demonstrates the ability of deep learning to perform fast and accurate analysis of large remote sensing imagery datasets for intelligence purposes.

Uploaded by

Ghazi Marzouk
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 83

Security Classification ______________________

AIR FORCE ENGINEERING UNIVERSITY, XI’AN - CHINA

ACADEMIC DISSERTATION FOR


MASTER OF ENGINEERING
(For Postgraduate Overseas Students)

TITLE: RESEARCH ON IMINT MILITARY TARGET

DETECTION USING DEEP LEARNING

Name: Cpt. Ghazi MARZOUK (Tunisia) Tutor: Li Wei

Specialty: Artificial Intelligence

Serial N°:

Date: June 2018


ACKNOWLEDGEMENTS

I would like to thank my supervisor Li Wei for suggesting the topic and
for guidance.

I also would like to thank my wife Souha Chebab for supporting me.

I would like to thank all the staff of the Overseas Student Training in
the college for providing the necessary needs.

I also would like to thank my friend Mohamed Chaouechi who helped


me and guided me through the life in Xi’an and China.

Xi’an, June 2018

Cpt. Ghazi MARZOUK (Tunisia)

i
(PAGE INTENTIONALLY LEFT BLANK)

ii
ABSTRACT

D etecting different targets from a High Resolution Remote Sensing image is one
of the classical problems of computer vision and is often described as a difficult
task. This thesis will present the appropriate tasks in computer vision using Deep
Learning technology with the constraint of small training data with the use of a pre-trained
Convolutional Neural Network (CNN) (by preprocessing the images dataset and
developing the right training process). Faster R-CNN method will be used for the object
detection task. Due to practical use, this work will detect one class (Airplane) but it can
be expanded to include other classes (Storage tanks, ships, armored vehicles, bridges…).
The dataset used for the training is a combination of an existing data set and collected
images for the military aircrafts. A Graphical User Interface is also created to input image
from real world imagery in order to detect the targets.

The analyze of the large data of IMINT will be faster and the human labor will be
reduced to the minimum. The software will recognize different targets from large images
collected by Satellites, Reconnaissance UAVs or Aircrafts from the ISR missions by a
mean average precision of 90.12% on the test image dataset.

Index Terms: Imagery Intelligence; Remote Sensing; ISR; Deep Learning; Object
Detection; Convolutional Neural Network; Faster RCNN; Computer
Vision.

iii
(PAGE INTENTIONALLY LEFT BLANK)

iv
TABLE OF CONTENTS

ACKNOWLEDGEMENTS ........................................................................................ i

ABSTRACT ............................................................................................................ iii

TABLE OF CONTENTS .......................................................................................... v

TABLE OF FIGURES ............................................................................................ vii

LIST OF TABLES................................................................................................... ix

LIST OF ABBREVIATIONS ................................................................................... xi

CHAPTER 1. INTRODUCTION............................................................................... 1
1.1 PROBLEM STATEMENT .................................................................................. 1

1.2 EXISTING RESEARCHES ................................................................................. 2

1.3 STRUCTURE OF THE THESIS ........................................................................... 3

CHAPTER 2. BACKGROUND ................................................................................ 5


2.1 MACHINE LEARNING .................................................................................... 5

2.2 NEURAL NETWORKS ..................................................................................... 6

2.3 COMPUTER VISION ...................................................................................... 11

2.4 CONVOLUTIONAL NEURAL NETWORK ........................................................ 12

CHAPTER 3. CNN IN OBJECT DETECTION ..................................................... 19


3.1 RCNN ........................................................................................................ 19

3.2 FAST RCNN ............................................................................................... 20

3.3 REGION PROPOSAL GENERATION ............................................................... 22

3.4 FASTER RCNN ........................................................................................... 23

3.5 REAL-TIME CAPABLE CONVOLUTIONAL OBJECT DETECTION ..................... 27

3.6 COMPARING THE METHODS......................................................................... 28

v
CHAPTER 4. CREATED DATA AND METHOD ................................................. 31
4.1 STARTING POINT ......................................................................................... 31

4.2 TRAINING DATA.......................................................................................... 32

4.3 ARCHITECTURE OF THE CREATED NETWORK AND DETECTION METHOD ..... 36

CHAPTER 5. IMPLEMENTATION PROCEDURE AND EVALUATIONS ............. 39


5.1 ENVIRONMENT ............................................................................................ 39

5.2 TRAINING PROCESS ..................................................................................... 39

5.3 EVALUATION............................................................................................... 43

5.4 DISCUSSION OF THE RESULTS ...................................................................... 47

5.5 GUI SOFTWARE DESIGN .............................................................................. 50

CHAPTER 6. CONCLUSION ................................................................................. 53


6.1 THEORY ...................................................................................................... 53

6.2 PRACTICE .................................................................................................... 53

6.3 RESULTS ..................................................................................................... 54

6.4 THE FUTURE ............................................................................................... 54

REFERENCES ........................................................................................................ 55

APPENDIX............................................................................................................. 59

PUBLISHED ARTICLES........................................................................................ 69

vi
TABLE OF FIGURES

FIGURE 2.1: ARTIFICIAL NEURON .............................................................................................. 7

FIGURE 2.2: A FULLY-CONNECTED MULTI-LAYER NEURAL NETWORK ................................................ 8

FIGURE 2.3: DETECTING HORIZONTAL EDGES FROM AN IMAGE USING CONVOLUTION FILTERING ............ 13

FIGURE 2.4: AN EXAMPLE OF A CNN ........................................................................................ 15

FIGURE 2.5: POOLING FUNCTION .............................................................................................. 15

FIGURE 2.6: MAX POOLING WITH STRIDE .................................................................................. 16

FIGURE 3.1: STAGES OF R-CNN FORWARD COMPUTATION ............................................................ 20

FIGURE 3.2: GENERAL STRUCTURE OF FAST R-CNN .................................................................... 21

FIGURE 3.3: FASTER RCNN IS A SINGLE, UNIFIED NETWORK FOR OBJECT DETECTION .......................... 24

FIGURE 3.4: REGION PROPOSAL NETWORK (RPN) ....................................................................... 25

FIGURE 3.5: INTERSECTION OVER UNION ................................................................................... 26

FIGURE 3.6: NMS ................................................................................................................. 26

FIGURE 4.1: THESIS STARTING RESULT (LEFT) THESIS FINAL RESULT (RIGHT) ................................... 32

FIGURE 4.2: ONE EXAMPLE IMAGE FOR EACH CLASS OF THE UC MERCED LAND USE DATASET ............. 33

FIGURE 4.3: EXAMPLE IMAGES FROM THE DATASET COLLECTED USING GLOBAL MAPPER .................... 33

FIGURE 4.4: AIRPLANE: DAILY NATURE IMAGE(LEFT) REMOTE SENSING IMAGE (RIGHT) ....................... 34

FIGURE 4.5: EXAMPLE OF DRAWING THE GT BOXES ..................................................................... 35

FIGURE 4.6: MODIFICATIONS ON THE ALEXNET........................................................................... 37

FIGURE 5.1: HR TEST IMAGES SAMPLES: CHENGDU (LEFT) AFB2 (RIGHT) ........................................ 44

FIGURE 5.2: NMS PERFORMED ON AN EXTRACT OF TEST IMAGE 5 WITH NC_FASTER_AUG APPROACH.... 48

FIGURE 5.3: DATA AUGMENTATION ENHANCEMENT .................................................................... 49

FIGURE 5.4: DETECTION RESULT ON TEST IMAGES 1, 6 AND 7 ......................................................... 50

FIGURE 5.5: GUI SOFTWARE FRAMEWORK ................................................................................ 51

FIGURE 5.6: SOFTWARE MAIN WINDOW ..................................................................................... 51

FIGURE 5.7: DETECTION AND SHOWING THE RESULT ON THE GUI SOFTWARE .................................... 52

FIGURE 5.8: GUI SOFTWARE "ABOUT" WINDOW ......................................................................... 52

vii
(PAGE INTENTIONALLY LEFT BLANK)

viii
LIST OF TABLES

TABLE 1: PASCAL VOC2007 TEST DETECTION RESULTS ............................................................. 28

TABLE 2: COMPARING STILL IMAGE OBJECT DETECTION METHOD WITH VOC 2007 ............................. 29

TABLE 3: TRAINING PROCESSES .............................................................................................. 41

TABLE 4: TRAINING TIME OF THE APPROACHES (IN MINUTES) ......................................................... 42

TABLE 5: TEST DATASET DETAILS ............................................................................................ 43

TABLE 6: DETECTION EVENT TABLE ......................................................................................... 45

TABLE 7: EVALUATION RESULTS ............................................................................................. 46

TABLE 8: DETAILED EVALUATION RESULTS (AFTER NMS PERFORMING) .......................................... 46

TABLE 9: DETAILED EVALUATION TIME (IN SECONDS) .................................................................. 47

ix
(PAGE INTENTIONALLY LEFT BLANK)

x
LIST OF ABBREVIATIONS
AI Artificial Intelligence
CNN Convolutional Neural Network
CPU Central Processing Unit
DCNN Deep CNN
DL Deep Learning
FC Fully-Connected Layer
FCN Fully Convolutional Network
FPS Frame Per Second
GIS Geographic Information System
GPU Graphics Processing Unit
GT Ground Truth
GUI Graphical User Interface
HD High Definition
HOG Histogram of Oriented Gradients
HR High Resolution
ILSVRC ImageNet Large Scale Visual Recognition Challenge
IMINT IMagery INTelligence
ISR Intelligence Surveillance and Reconnaissance
NMS Non-Maximum Suppression
RAM Random Access Memory
RCNN CNN with Region proposals
RGB Red Green Blue channels
RoI Region of Interest
RPN Region Proposal Network
RS Remote Sensing
SAR Synthetic Aperture Radar
SGDM Stochastic Gradient Descent with Momentum
SIFT Scale Invariant Feature Transform
SSD Single Shot multibox Detector
SVM Support Vector Machines
USGS United States Geological Survey
VOC Visual Object Classes
XOR Exclusive OR
YOLO You Only Look Once

xi
Chapter 1. INTRODUCTION

T he growing quantity of Airborne and Satellite images (Electro Optical, Infrared,


Multispectral, SAR, etc.) acquired for Military Intelligence or called IMINT
(IMagery INTelligence), lead us to find a solution for detecting and recognizing the
targets such as tactical targets (tanks, armored vehicle, land troops…) and strategic targets
(aircrafts, ships, runways, bridges, storage tanks, buildings…). Due to the large scale
(surface) of the battlefield, the solution needs to use an automatic system to detect and
recognize (object detection), then third extract the categorized targets. This system could
enhance the intelligence capability of the Air force in the ISR (Intelligence Surveillance
and Reconnaissance) missions, but it can also be employed by the Land Forces or the
Navy. In order the make this system autonomous, Artificial Intelligence (AI) could be the
best choice to achieve that goal. Nowadays, Deep Learning (DL), is wide used in many
fields (Medical, Social media, Energy optimization, autonomous vehicle navigation and
also in military use). The system can recognize different objects without telling explicitly
the features of that object.

This work is a combination between Computer vision tasks and DL technology. The
most common technic used for this combination is the use of the Convolutional Neural
Network (CNN) and Deep CNN (DCNN). The basic idea of the CNN was inspired by a
concept in biology called the receptive field [1]. They act as detectors that are sensitive
to certain types of stimulus, for example, edges. This biological function can be
approximated in computers using the convolution operation. Our work is related with the
object (target) detection from a High Resolution (HR) Remote Sensing (RS) images.

1.1 PROBLEM STATEMENT


Objects contained in image files can be located and identified automatically. This
is called object detection. It’s one of the classical problems of computer vision and is
often described as a difficult task. Target recognition in large High-Resolution remote
sensing image, such as aircraft, and vehicle detection, is a challenging task due to the
small size and big number of targets and the complex neighboring environments. In many

1
RESEARCH ON IMINT MILITARY TARGET DETECTION USING DEEP LEARNING

aspects, it is similar to other computer vision tasks because it involves creating a solution
that is invariant to deformation, rotation (especially for Remote Sensing detection) and
changes in lighting resolution. What makes object detection a distinct problem is that it
involves both locating and classifying regions of an image.

As we will demonstrate, convolutional neural networks are currently the state-of-


the-art solution for object detection. The main task of this thesis is to review, test and
implement convolutional object detection method to create a software to detect airplane
from high resolution remote sensing images.

In the theoretical part, we review the relevant literature and study how
convolutional object detection methods have improved in the past few years. In the
experimental part, we study how a convolutional object detection system can be
implemented in practice, test how well a detection system trained on remote sensing
image data performs in aircraft detection.

1.2 EXISTING RESEARCHES


Jin et al. [2] propose a vector-guided vehicle detection approach for IKONOS
satellite imagery using a morphological shared-weight neural network, which learns the
implicit vehicle model and incorporates both spatial and spectral characteristics, and
classifies pixels into vehicles and non-vehicles. To address the problem of large-scale
variance of objects, Chen et al. [3] propose a hybrid deep CNN model for vehicle
detection in satellite images, which divides all feature maps of the last convolutional and
max-pooling layer of CNN into multiple blocks of variable receptive field size or pooling
size, to extract multi-scale features. Jiang et al. [4] propose a CNN-based vehicle
detection approach, where a graph-based super-pixel segmentation is used to extract
image patches and a CNN model is trained to predict whether a patch contains a vehicle.

A few detection methods transfer the pre-trained CNNs for object detection. Zhou
et al. [5] propose a weakly supervised learning framework to train an object detector,
where a pre-trained CNN model is transferred to extract high-level features of objects and
the negative bootstrapping scheme is incorporated into the detector training process to
provide faster convergence of the detector. Zhang et al. [6] propose a hierarchical oil tank
detector, which combines deep surrounding features, which are extracted from the pre-
trained CNN model with local features (Histogram of Oriented Gradients). The candidate

2
CHAPTER 1. INTRODUCTION

regions are selected by an ellipse and line segment detector. Salberg [7] proposes to
extract features from the pre-trained AlexNet model and applies the deep CNN features
for automatic detection of seals in aerial images. Ševo et al. [8] propose a two-stage
approach for CNN training and develop an automatic object detection method based on a
pre-trained CNN, where the GoogLeNet is first fine-tuned twice on UC-Merced dataset,
using different fine-tuning options, and then the fine-tuned model is utilized for sliding-
window object detection. To address the problem of orientation variations of objects, Zhu
et al. [9] employ the pre-trained CNN features that are extracted from combined layers
and implement orientation-robust object detection in a coarse localization framework.

For enhancing the performance of generic object detection, Cheng et al. [10]
propose an effective approach to learn a rotation-invariant CNN (RICNN) to improve
invariance to object rotation. In their paper, they add a new rotation-invariant layer to the
off-the-shelf AlexNet model. The RICNN is learned by optimizing a new object function,
including an additional regularization constraint which enforces the training samples
before and after being rotating to share the similar features to guarantee the rotation-
invariant ability of RICNN model.

1.3 STRUCTURE OF THE THESIS


The thesis begins with two theoretical chapters. Since convolutional object
detection is a combination of several fields of computer science, we need to discuss
several theoretical topics that seem disparate at first. In Chapter 2, we begin with a short
introduction to machine learning and neural networks. Next, we discuss computer vision
and object detection as its subfield. We end the chapter by introducing convolutional
neural networks as a combination of machine learning and computer vision. In Chapter
3, we discuss how convolutional networks can be used for object detection and review
the relevant literature and methods.

In Chapter 4, we move to the experimental part. We discuss what kind of


experimental setup we used for testing a convolutional network. We discuss not only the
details of the experiments, but also the details of the datasets. In Chapter 5, we discuss
how we will evaluate the results by presenting the created training approaches and
processes. We discuss also the practical implementation of the experiments by discussing
the required software and hardware. Then, we evaluate the results, not only by providing

3
RESEARCH ON IMINT MILITARY TARGET DETECTION USING DEEP LEARNING

the numerical results, but also some analysis of them. Finally, we will present the GUI
software design.

In Chapter 6, we provide a review of the thesis and some concluding remarks.

4
Chapter 2. BACKGROUND

I n this chapter, we provide the theoretical background necessary for understanding the
methods discussed in the next chapter. First, we discuss relevant details of machine
learning, neural networks, and computer vision. Finally, we explain how these disciplines
are combined in convolutional neural networks.

2.1 MACHINE LEARNING


Learning algorithms are widely used in computer vision applications. Before
considering image related tasks, we are going to have a brief look at basics of machine
learning.

Machine learning has emerged as a useful tool for modelling problems that are
otherwise difficult to formulate exactly. Classical computer programs are explicitly
programmed by hand to perform a task. With machine learning, some portion of the
human contribution is replaced by a learning algorithm [2]. As availability of
computational capacity and data has increased, machine learning has become more and
more practical over the years, to the point of being almost ubiquitous.

2.1.1 Types
A typical way of using machine learning is supervised learning [3]. A learning
algorithm is shown multiple examples that have been annotated or labelled by humans.
For example, in the object detection problem we use training images where humans have
marked the locations and classes of relevant objects. After learning from the examples,
the algorithm is able to predict the annotations or labels of previously unseen data.
Classification and regression are the most important task types [3].In classification, the
algorithm attempts to predict the correct class of a new piece of data based on the training
data. In regression, instead of discrete classes, the algorithm tries to predict a continuous
output.

In unsupervised learning, the algorithm attempts to learn useful properties of the


data without a human teacher telling what the correct output should be. Classical example

5
RESEARCH ON IMINT MILITARY TARGET DETECTION USING DEEP LEARNING

of unsupervised learning is clustering [3]. More recently, especially with the advent of
deep learning technologies, unsupervised preprocessing has become a popular tool in
supervised learning tasks for discovering useful representations of the data [4].

2.1.2 Features
Some kind of preprocessing is almost always needed. Preprocessing the data into a
new, simpler variable space is called feature extraction [3]. Often, it is impractical or
impossible to use the full-dimensional training data directly. Rather, detectors are
programmed to extract interesting features from the data, and these features are used as
input to the machine learning algorithm.

In the past, the feature detectors were often hand-crafted. The problem with this
approach is that we do not always know in advance, which features are interesting. The
trend in machine learning has been towards learning the feature detectors as well, which
enables using the complete data [2].

2.1.3 Generalization
Since the training data cannot include every possible instance of the inputs, the
learning algorithm must be able to generalize in order to handle unseen data points [3].
Too simple model estimate can fail to capture important aspects of the true model. On the
other hand, too complex methods can overfit by modelling unimportant details and noise,
which also leads to bad generalization [3]. Typically, overfitting happens when a complex
method is used in conjunction with too little training data. An overfitted model learns to
model the known examples but does not understand what connects them.

The performance of the algorithm can be evaluated from the quality and quantity
of errors. A loss function, such as mean squared error, is used to assign a cost to the errors
[3]. The objective in the training phase is to minimize this loss.

2.2 NEURAL NETWORKS


Neural networks are a popular type of machine learning model. A special case of a
neural network called the Convolutional Neural Network (CNN) is the primary focus of
this thesis. Before discussing CNNs, we will discuss how regular neural networks work.

6
CHAPTER 2. BACKGROUND

2.2.1 Origins
Neural networks were originally called artificial neural networks because they were
developed to mimic the neural function of the human brain. Pioneering research includes
the threshold logic unit by Warren McCulloch and Walter Pitts in 1943 and the perceptron
by Frank Rosenblatt in 1957.

Even though the inspiration from biology is apparent, it would be misleading to


overemphasize the connection between artificial neurons and biological neurons or
neuroscience. The human brain contains approximately 100 billion neurons operating in
parallel [5]. Artificial neurons are mathematical functions implemented on more-or-less
serial computers. Research into neural networks is mostly guided by developments in
engineering and mathematics rather than biology [2].

Figure 2.1: Artificial Neuron

An artificial neuron based on the McCulloch-Pitts model is shown in Figure 2.1 [3].
The neuron receives input parameters . The neuron also has weight parameters
. The weight parameters often include a bias term that has a matching dummy input
with a fixed value of 1. The inputs and weights are linearly combined and summed. The
sum is then fed to an activation function that produces the output of the neuron:

= ( )= .

The neuron is trained by carefully selecting the weights to produce a desired


output for each input.

7
RESEARCH ON IMINT MILITARY TARGET DETECTION USING DEEP LEARNING

2.2.2 Multi-layer networks

Figure 2.2: A Fully-connected multi-layer neural network

A neural network is a combination of artificial neurons. The neurons are typically


grouped into layers. In a Fully-connected feed-forward multi-layer network, shown in
Figure 2.2, each output of a layer of neurons is fed as input to each neuron of the next
layer. Thus, some layers process the original input data, while some process data received
from other neurons. Each neuron has a number of weights equal to the number of neurons
in the previous layer [3].

A multi-layer network typically includes three types of layers: an input layer, one
or more hidden layers and an output layer [3]. The input layer usually passes data along
without modifying it. Most of the computation happens in the hidden layers. The output
layer converts the hidden layer activations to an output, such as a classification.

In this thesis, we will mostly discuss Fully-connected networks and convolutional


networks (see section 2.4 below). Convolutional networks utilize parameter sharing and
have limited connections compared to Fully-connected networks [2].

2.2.3 Back-propagation
A neural network is trained by selecting the weights of all neurons so that the
network learns to approximate target outputs from known inputs. It is difficult to solve
the neuron weights of a multi-layer network analytically. The back-propagation
algorithm [2] provides a simple and effective solution to solve the weights iteratively.
The classical version uses gradient descent as optimization method. Gradient descent can
be quite time-consuming and is not guaranteed to find the global minimum of error, but
with proper configuration (known in machine learning as hyperparameters) works well
enough in practice [2] [3].

8
CHAPTER 2. BACKGROUND

In the first phase of the algorithm, an input vector is propagated forward through
the neural network. Before this, the weights of the network neurons have been initialized
to some values, for example small random values. The received output of the network is
compared to the desired output (which should be known for the training examples) using
a loss function. The gradient of the loss function is then computed. This gradient is also
called the error value. When using mean squared error as the loss function, the output
layer error value is simply the difference between the current and desired output.

The error values are then propagated back through the network to calculate the error
values of the hidden layer neurons. The hidden neuron loss function gradients can be
solved using the chain rule of derivatives. Finally, the neuron weights are updated by
calculating the gradient of the weights and subtracting a proportion of the gradient from
the weights. This ratio is called the learning rate [3]. The learning rate can be fixed or
dynamic. After the weights have been updated, the algorithm continues by executing the
phases again with different input until the weights converge.

In the above description, we have described online learning that calculates the
weight updates after each new input [2]. Online learning can lead to “zig-zagging”
behavior, where the single data point estimate of the gradient keeps changing direction
and does not approach the minimum directly. Another way of computing the updates is
full batch learning, where we compute the weight updates for the complete dataset [2].
This is quite computationally heavy and has other drawbacks. A compromise version is
mini-batch learning, where we use only some portion of the training set for each update
[6]. Mathematical descriptions of the algorithm are available in this reference [7].

2.2.4 Activation function types


The activation function determines the final output of each neuron. It is important
to select the function properly in order to create an effective network.

Early researchers found that perceptron and other linear systems had severe
drawbacks, being unable to solve problems that were not linearly separable, such as the
XOR-problem. Sometimes, linear systems can solve these kinds of problems using hand-
crafted feature detectors, but this is not the most advantageous use of machine learning.
Simply adding layers does not help either, because a network composed of linear neurons
remains linear no matter how many layers it has [2].

9
RESEARCH ON IMINT MILITARY TARGET DETECTION USING DEEP LEARNING

A light-weight and effective way of creating a non-linear network is using rectified


linear units (ReLu) [2]. A rectified linear function generates the output using a ramp
function such as: ( ) = max(0, ).

This type of function is easy to compute and differentiate (for backpropagation).


The function is not differentiable at zero, but this has not prevented its use in practice.
ReLu have become quite popular lately, often replacing sigmoidal activation functions,
which have smooth derivatives but suffer from gradient saturation problems and slower
computation.

For multi-class classification problems, the Softmax activation function [3] is used
in the output layer of the network:

( )=

The Softmax function takes a vector of K arbitrarily large values and outputs a
vector of K values that range between [0...1] and sum to 1. The values output by the
Softmax unit can be utilized as class probabilities.

2.2.5 Deep Learning


Modern neural networks are often called deep neural networks. Even though multi-
layer neural networks have existed since the 1980s, several reasons prevented the
effective training of networks with multiple hidden layers [2].

One of the main problems is the curse of dimensionality [2]. As the number of
variables increases, the number of different configurations of the variables grows
exponentially. As the number of configurations increases, the number of training samples
should increase in equal measure. Collecting a training dataset of sufficient size is time-
consuming and costly or outright impossible.

In the past ten years, neural networks have had a renaissance, mainly because of the
availability of more powerful computers and larger datasets. In early 2000s, it was
discovered that neural networks could be trained efficiently using graphics processing
units. GPUs are more efficient for the task than traditional CPUs and provide a relatively
cheap alternative to specialist hardware [8]. Today, researchers typically use high-end
consumer graphic cards, such as NVIDIA Tesla K40 [9].

10
CHAPTER 2. BACKGROUND

Other more theoretical breakthroughs include replacing mean-squared error


functions with cross-entropy based functions and replacing sigmoidal activation functions
with rectified linear units [2].

With deep learning, there is less need for hand-tuned machine learning solutions
that were used previously [2]. A classical pattern detection system, for example, includes
a hand-tuned feature detection phase before a machine learning phase. The deep learning
equivalent consists of a single neural network. The lower layers of the neural network
learn to recognize the basic features, which are then fed forward to higher layers of the
network.

2.3 COMPUTER VISION


Next, we are going to discuss computer vision in general and explore the primary
subject of this thesis, object detection, as a subproblem of computer vision.

2.3.1 Overview
Computer vision deals with the extraction of meaningful information from the
contents of digital images or video. This is distinct from mere image processing, which
involves manipulating visual information on the pixel level. Applications of computer
vision include image classification, visual detection, 3D scene reconstruction from 2D
images, image retrieval, augmented reality, machine vision and traffic automation [10].

Today, machine learning is a necessary component of many computer vision


algorithms [11]. Such algorithms can be described as a combination of image processing
and machine learning. Effective solutions require algorithms that can cope with the vast
amount of information contained in visual images, and critically for many applications,
can carry out the computation in real time.

2.3.2 Object detection


Object detection is one of the classical problems of computer vision and is often
described as a difficult task. It is similar to other computer vision tasks because it involves
creating a solution that is invariant to deformation and changes in lighting and viewpoint.
What makes object detection a distinct problem is that it involves both locating and
classifying regions of an image [9]. The locating part is not needed in whole image
classification.

11
RESEARCH ON IMINT MILITARY TARGET DETECTION USING DEEP LEARNING

To detect an object, we need to have some idea where the object might be and how
the image is segmented. This creates a type of chicken-and-egg problem, where, to
recognize the shape (and class) of an object, we need to know its location, and to
recognize the location of an object, we need to know its shape [12]. Some visually
dissimilar features, such as the clothes and face of a human being, may be parts of the
same object, but it is difficult to know this without recognizing the object first. On the
other hand, some objects stand out only slightly from the background, requiring
separation before recognition [13].

Low-level visual features of an image, such as a saliency map, may be used as a


guide for locating candidate objects [12]. The location and size are typically defined using
a bounding box, which is stored in the form of corner coordinates. Using a rectangle is
simpler than using an arbitrarily shaped polygon, and many operations, such as
convolution, are performed on rectangles in any case. The sub-image contained in the
bounding box is then classified by an algorithm that has been trained using machine
learning [14]. The boundaries of the object can be further refined iteratively, after making
an initial guess [10].

During the 2000s, popular solutions for object detection utilized feature descriptors,
such as Scale-Invariant Feature Transform (SIFT) developed by David Lowe in 1999
and Histogram of Oriented Gradients (HOG) popularized in 2005. In the 2010s, there has
been a shift towards utilizing convolutional neural networks [14] [9] [15].

Before the widescale adoption of CNNs, there were two competing solutions for
generating bounding boxes. In the first solution, a dense set of region proposals is
generated and then most of these are rejected [16]. This typically involves a sliding
window detector. In the second solution, a sparse set of bounding boxes is generated using
a region proposal method, such as Selective Search [13]. Combining sparse region
proposals with convolutional neural networks has provided good results and is currently
popular [9].

2.4 CONVOLUTIONAL NEURAL NETWORK


Next, we are going to discuss why and how convolutional neural networks (CNN)
are used and describe their history.

12
CHAPTER 2. BACKGROUND

2.4.1 Justification
The problem with solving computer vision problems using traditional neural
networks is that even a modestly sized image contains an enormous amount of
information (see section 2.2.5).

For example, a monochrome 620 × 480 image contains 297 600 pixels. If each
pixel intensity of this image is input separately to a Fully-connected network, each neuron
requires 297 600 weights. A 1920 × 1080 full HD image would require 2,073,600
weights. If the images are polychrome, the amount of weights is multiplied by the amount
of color channels (typically three). Thus, we can see that the overall number of free
parameters in the network quickly becomes extremely large as the image size increases.
Too large models cause overfitting and slow performance [3].

Furthermore, many pattern detection tasks require that the solution is translation
invariant. It is inefficient to train neurons to separately recognize the same pattern in the
left-top corner and in the right-bottom corner of an image. A Fully-connected neural
network fails to take this kind of structure into account.

2.4.2 Basic structure


The basic idea of the CNN was inspired by a concept in biology called the receptive
field. Receptive fields are a feature of the animal visual cortex [1]. They act as detectors
that are sensitive to certain types of stimulus, for example, edges. They are found across
the visual field and overlap each other.

This biological function can be approximated in computers using the convolution


operation [17]. In image processing, images can be filtered using convolution to produce
different visible effects. Figure 2.3 shows how a hand-selected convolutional filter detects
horizontal edges from an image, functioning similarly to a receptive field.

Figure 2.3: Detecting horizontal edges from an image using convolution filtering

13
RESEARCH ON IMINT MILITARY TARGET DETECTION USING DEEP LEARNING

The discrete convolution operation between an image and a filter matrix is


defined as:

ℎ[ , ] = [ , ] ∗ [ , ] = [ , ] [ − , − ].

In effect, the dot product of the filter g and a sub-image of (with same dimensions
as ) centred on coordinates , produces the pixel value of ℎ at coordinates , [2].
The size of the receptive field is adjusted by the size of the filter matrix. Aligning the
filter successively with every sub-image of produces the of output pixel matrix ℎ. In
the case of neural networks, the output matrix is also called a feature map [2] (or an
activation map after computing the activation function). Edges need to be treated as a
special case. If image f is not padded, the output size decreases slightly with every
convolution [2].

A set of convolutional filters can be combined to form a convolutional layer of a


neural network [1]. The matrix values of the filters are treated as neuron parameters and
trained using machine learning. The convolution operation replaces the multiplication
operation of a regular neural network layer. Output of the layer is usually described as a
volume. The height and width of the volume depend on the dimensions of the activation
map. The depth of the volume depends on the number of filters.

Since the same filters are used for all parts of the image, the number of free
parameters is reduced drastically compared to a Fully-connected neural layer [2]. The
neurons of the convolutional layer mostly share the same parameters and are only
connected to a local region of the input. Parameter sharing resulting from convolution
ensures translation invariance. An alternative way of describing the convolutional layer
is to imagine a Fully-connected layer with an infinitely strong prior placed on its weights
[2]. This prior force the neurons to share weights at different spatial locations and to have
zero weight outside the receptive field.

Successive convolutional layers (often combined with other types of layers, such as
pooling described below) form a convolutional neural network (CNN). An example of a
convolutional network is shown in Figure 2.4. The backpropagation training algorithm,
described in subsection 2.2.3, is also applicable to convolutional networks [2]. In theory,
the layers closer to the input should learn to recognize low-level features of the image,
such as edges and corners, and the layers closer to the output should learn to combine

14
CHAPTER 2. BACKGROUND

these features to recognize more meaningful shapes [1]. In this thesis, we are interested
in studying whether convolutional networks can learn to recognize complete objects.

Figure 2.4: An example of a CNN

2.4.3 Pooling and stride


To make the network more manageable for classification, it is useful to decrease
the activation map size in the deep end of the network. Generally, the deep layers of the
network require less information about exact spatial locations of features but require more
filter matrixes to recognize multiple high-level patterns [2]. By reducing the height and
width of the data volume, we can increase the depth of the data volume and keep the
computation time at a reasonable level.

Figure 2.5: Pooling function

There are two ways of reducing the data volume size. One way is to include a
pooling layer after a convolutional layer [2]. The layer effectively down-samples the
activation maps as shown in Figure 2.5. Pooling has the added effect of making the
resulting network more translation invariant by forcing the detectors to be less precise.
However, pooling can destroy information about spatial relationships between subparts
of patterns. Typical pooling method is Max-pooling (Figure 2.6). Max-pooling simply
outputs the maximum value within a rectangular neighborhood of the activation map [2].

15
RESEARCH ON IMINT MILITARY TARGET DETECTION USING DEEP LEARNING

Figure 2.6: Max Pooling with Stride

Another way of reducing the data volume size is adjusting the stride parameter of
the convolution operation. The stride parameter controls whether the convolution output
is calculated for a neighborhood centered on every pixel of the input image (stride 1) or
for every th pixel (stride ) [2]. Research has shown that pooling layers can often be
discarded without loss in accuracy by using convolutional layers with larger stride value
[2]. The stride operation is equivalent to using a fixed grid for pooling.

2.4.4 Additional layers


The convolutional layer typically includes a non-linear activation function, such as
a rectified linear activation function (see subsection 2.2.4). Activations are sometimes
described as a separate layer between the convolutional layer and the pooling layer.

Some systems, such as [18], also implement a layer called local response
normalization, which is used as a regularization technique. Local response normalization
mimics a function of biological neurons called lateral inhibition, which causes excited
neurons to decrease the activity of neighboring neurons. However, other regularization
techniques are currently more popular, and these are discussed in the next section.

The final hidden layers of a CNN are typically Fully-connected layers [3]. A Fully-
connected layer can capture some interesting relationships parameter-sharing
convolutional layers cannot. However, a Fully-connected layer requires a sufficiently
small data volume size in order to be practical. Pooling and stride settings can be used to
reduce the size of the data volume that reaches the Fully-connected layers. A
convolutional network that does not include any Fully-connected layers, is called a fully
convolutional network (FCN) [15].

If the network is used for classification, it usually includes a Softmax output layer
[3] (see section 2.2.4). The activations of the topmost layers can also be used directly to

16
CHAPTER 2. BACKGROUND

generate a feature representation of an image. This means that the convolutional network
is used as a large feature detector [2].

2.4.5 Regularization and data augmentation


Regularization refers to methods that are used to reduce overfitting by introducing
additional constraints or information to the machine learning system [2]. A classical way
of using regularization in neural networks is adding a penalty term to the objective/loss
function that penalizes certain types of weights. The parameter sharing feature of
convolutional networks is another example of regularization.

There are several regularization techniques that are specific to deep neural networks.
A popular technique called dropout [19] attempts to reduce the co-adaptation of neurons.
This is achieved by randomly dropping out neurons during training, meaning that a
slightly different neural network is used for each training sample or minibatch. This
causes the system not to depend too much on any single neuron or connection and
provides an effective yet computationally inexpensive way of implementing
regularization [2]. In convolutional networks, Dropout is typically used in the final Fully-
connected layers [18].

Overfitting can also be reduced by increasing the amount of training data. When it
is not possible to acquire more actual samples, data augmentation is used to generate more
samples from the existing data [2]. For classification using convolutional networks, this
can be achieved by computing transformations of the input images that do not alter the
perceived object classes yet provide additional challenge to the system. The images can
be, for example, flipped, rotated or subsampled with different crops and scales. Also,
noise can be added to the input images [2].

2.4.6 Development
Convolutional neural networks were one of the first successful deep neural
networks. The Neocognitron, developed by Fukushima in 1980s, provided a neural
network model for translation-invariant object recognition, inspired by biology [1]. Le
Cun et al. combined this method with a learning algorithm, i.e. back-propagation. These
early solutions were mostly used for handwritten character recognition.

After providing some promising results, the neural network methods faded in
prominence and were mostly replaced by support vector machines [14]. Then, in 2012,

17
RESEARCH ON IMINT MILITARY TARGET DETECTION USING DEEP LEARNING

Krizhevsky et al. [20] achieved excellent results on the ImageNet Large Scale Visual
Recognition Challenge (ILSVRC) dataset by combining Le Cun’s method with recent
fine-tuning methods for deep learning. These results popularized CNNs and led to the
development of new powerful object detection methods described in Chapter 3 [14].

For the 2014 ImageNet challenge, Simonyan and Zisserman [18] explored the effect
of increasing the depth of a convolutional network on localization and classification
accuracy. The team achieved results that improved the then state-of-the-art by using
convolutional networks 16 layers deep. The 16-layer architecture includes 13
convolutional layers (with 3x3 filters), 5 pooling layers (2x2 neighborhood max-pooling)
and 3 Fully-connected layers. All hidden layers use rectified (ReLu) activations. The
Fully-connected layers reduce 4096 channels down to 1000 Softmax outputs and are
regularized using dropout. This form of network is referred to as VGG-16.

18
Chapter 3. CNN IN OBJECT
DETECTION

I n this chapter, we discuss and compare different object detection methods that utilize
convolutional neural networks. In particular, we are going to look at methods that
combine CNNs with region proposal classification. We further discuss, how the region
proposals, also called Regions of Interest (RoI), are generated.

3.1 RCNN
In 2012, Krizhevsky et al. [20] achieved promising results with CNNs for the
general image classification task, as mentioned in subsection 2.4.6. In 2013, Girshick et
al. published a method [14] generalizing these results to object detection. This method is
called R-CNN (“CNN with region proposals”).

3.1.1 General description


R-CNN forward computation has several stages, shown in Figure 3.1. First, the
Regions of Interest are generated. The RoIs are category-independent bounding boxes
that have a high likelihood of containing an interesting object. In the paper, a separate
method called Selective Search [13], is used for generating these, but other region
generation methods can be used instead. Selective Search, along with other region
proposal generation techniques, is discussed in further detail in subsection 3.3.

Next, a convolutional network is used to extract features from each region proposal.
The sub-image contained in the bounding-box is warped to match the input size of the
CNN and then fed to the network. After the network has extracted features from the input,
the features are input to Support Vector Machines (SVM) that provide the final
classification.

19
RESEARCH ON IMINT MILITARY TARGET DETECTION USING DEEP LEARNING

Figure 3.1: Stages of R-CNN forward computation

The method is trained in multiple stages, beginning with the convolutional network
[9]. After the CNN has been trained, the SVMs are fitted to the CNN features. Finally,
the region proposal generating method is trained.

3.1.2 Drawbacks
R-CNN is an important method because it provided the first practical solution for
object detection using CNNs. Being the first, it has many drawbacks that have been
improved upon by later methods.

In his 2015 paper for Fast R-CNN [9], Girshick lists three main problems of R-
CNN:

First, training consists of multiple stages, as described above.

Second, training is expensive. For both SVM and region proposal training, features
are extracted from each region proposal and stored on disk. This requires days of
computation and hundreds of gigabytes of storage space.

Third, and perhaps most important, object detection is slow, requiring almost a
minute for each image, even on a GPU. This is because the CNN forward computation is
performed separately for every object proposal, even if the proposals originate from the
same image or overlap each other.

3.2 FAST RCNN


Fast R-CNN [9] published in 2015 by Girshick provides a more practical method
for object recognition. The main idea is to perform the forward pass of the CNN for the
entire image, instead of performing it separately for each RoI.

20
CHAPTER 3. CNN IN OBJECT DETECTION

3.2.1 General description


The general structure of Fast R-CNN is illustrated in Figure 3.2. The method
receives as input an image plus regions of interest computed from the image. As in R-
CNN, the RoIs are generated using an external method. The image is processed using a
CNN that includes several convolutional and max pooling layers.

Figure 3.2: General structure of Fast R-CNN

The convolutional feature map that is generated after these layers is input to a RoI
pooling layer. This extracts a fixed-length feature vector for each RoI from the feature
map. The feature vectors are then input to Fully-connected layers that are connected to
two output layers: a Softmax layer that produces probability estimates for the object
classes and a real-valued layer that outputs bounding box co-ordinates computed using
regression (meaning refinements to the initial candidate boxes).

3.2.2 Classification performance


According to the authors, Fast R-CNN provides significantly shorter classification
time compared to regular R-CNN, taking less than a second on a state-of-the-art GPU [9].
This is mainly due to using the same feature map for each RoI.

As the detection time decreases, the overall computation time begins to depend
significantly on the performance of the region proposal generation method. The RoI
generation can thus form a computational bottleneck [15]. Additionally, when there are
many RoIs, the time spent on evaluating the Fully-connected layers can dominate the
evaluation time of the convolutional layers. Classification time can be accelerated by
approximately 30% if the Fully-connected layers are compressed using truncated singular
value decomposition [9]. This results in a slight decrease in precision.

21
RESEARCH ON IMINT MILITARY TARGET DETECTION USING DEEP LEARNING

3.2.3 Training
According to the original publication, Fast R-CNN is more efficient to train than R-
CNN, with nine-fold reduction in training time. The entire network (including the RoI
pooling layer and the Fully-connected layers) can be trained using the back-propagation
algorithm and stochastic gradient descent. Typically, a pre-trained network is used as a
starting point and then fine-tuned. Training is done in mini-batches of images. /
RoIs are sampled from each mini-batch image. The RoI samples are assigned to a class,
if their intersection over union with a ground-truth box is over 0.5. Other RoIs belong to
the background class.

As in classification, RoIs from the same image share computation and memory
usage. For data augmentation, the original image is flipped horizontally with probability
0.5. The Softmax classifier and the bounding box regressors are fine-tuned together using
a multi-task loss function, which considers both the true class of the sampled RoI and the
offset of the sampled bounding box from the true bounding box.

3.3 REGION PROPOSAL GENERATION


To use R-CNN and Fast R-CNN, we need a method for generating the class-
agnostic regions of interest. Next, we are going to discuss general principles of RoI
generation

The aim of region proposal generation in object detection is to maximize recall i.e.
to generate enough regions so that all true objects are recovered [21]. The generator is
less concerned with precision since it is the task of the object detector to identify correct
regions from the output of the region proposal generator.

However, the amount of proposals generated affects performance. As mentioned in


subsection 2.3.2, there are two main approaches to region generation: dense set generation
and sparse set generation.

Dense set solutions attempt to generate by brute force an exhaustive set of bounding
boxes that includes every potential object location [13]. This can be achieved by sliding
a detection window across the image. However, searching through every location of the
image is computationally costly and requires a fast object detector. Additionally, different
window shapes and sizes need to be considered. Thus, most sliding window methods limit

22
CHAPTER 3. CNN IN OBJECT DETECTION

the amount of candidate objects by using a coarse step-size and a limited number of fixed
aspect ratios.

Most region proposals in a dense set do not contain interesting objects. These
proposals need to be discarded after the object detection phase. Detection results can be
discarded, if they fall behind a predefined confidence threshold or if their confidence
value is below a local maximum (non-maximum suppression) [16].

Instead of discarding the regions after the object detection stage, the region proposal
generator itself can rank the regions in a class-agnostic way and discard low-ranking
regions. This generates a sparse set of object detections [22]. Similarly to dense set
methods, thresholding and non-maximum suppression (NMS) can be implemented after
the detection phase to further improve the detection quality. Sparse set solutions can be
grouped into unsupervised and supervised methods.

One of the most popular unsupervised methods is Selective Search [13], which
utilizes an iterative merging of super-pixels. Another approach is to rank the objectness1
of a sliding window. A popular example of this is Edge Boxes [21], which calculates the
objectness score by calculating the number of edges within a bounding box and by
subtracting the number of edges that overlap the box boundary.

Supervised methods treat region proposal generation as a classification or a


regression problem. This means using a machine learning algorithm, such as a support
vector machine [22]. It is also possible to use a convolutional network to generate the
regions of interest.

Certain advanced object detection methods, such as Faster R-CNN [15] described
in section 3.4 below, use parts of the same convolutional network both for generating the
region proposals and for detection. We call these kinds of methods integrated methods.

3.4 FASTER RCNN


Faster R-CNN [15] by Ren et al. is an integrated method. The main idea is to use
shared convolutional layers for region proposal generation and for detection. The authors
discovered that feature maps generated by object detection networks can also be used to
generate the region proposals. The fully convolutional part of the Faster R-CNN network

1
Objectness: measures membership to a set of object classes vs. background

23
RESEARCH ON IMINT MILITARY TARGET DETECTION USING DEEP LEARNING

that generates the feature proposals is called a Region Proposal Network (RPN). The
authors used Fast R-CNN architecture for the detection network. In this thesis, this
method is used for the target detection. Hence, the Faster RCNN is more detailed in this
section based on the authors paper [15].

3.4.1 General description


Faster R-CNN works to combat the somewhat complex training pipeline that both
R-CNN and Fast R-CNN exhibited. The authors insert an RPN after the last convolutional
layer. This network is able to just look at the last convolutional feature map and produce
region proposals from that. From that stage, the same pipeline as Fast R-CNN is used
(ROI pooling, FC, and then classification and regression heads).

Figure 3.3: Faster RCNN is a single, unified network for object detection

Thus, Faster R-CNN is composed of two modules. The first module is a deep FCN
that proposes regions (described in subsection 3.4.2 below), and the second module is the
Fast R-CNN detector (described in section 3.2) that uses the proposed regions. The entire
system is a single, unified network for object detection (Figure 3.3). The RPN module
tells the Fast R-CNN module where to look.

3.4.2 Region Proposal Network


A Region Proposal Network (RPN) takes an image (of any size) as input and outputs
a set of rectangular object proposals, each with an objectness score. This process is
modeled with an FCN. Because the goal is to share computation with a Fast R-CNN
object detection network, both networks share a common set of convolutional layers. In
the author experiments, they investigate the Zeiler and Fergus model (ZF) [23], which

24
CHAPTER 3. CNN IN OBJECT DETECTION

has 5 shareable convolutional layers and the Simonyan and Zisserman model (VGG-16)
[24], which has 13 shareable convolutional layers. To generate region proposals, they
slide a small network over the convolutional feature map output by the last shared
convolutional layer. This small network takes as input an × spatial window of the
input convolutional feature map. Each sliding window is mapped to a lower-dimensional
feature (256-d for ZF and 512-d for VGG, with ReLu following). This feature is fed into
two sibling Fully-connected layers: a box-regression layer (reg) and a box-classification
layer (cls). The authors used = 3. This mini-network is illustrated in Figure 3.4. Note
that because the mini-network operates in a sliding-window fashion, the Fully-connected
layers are shared across all spatial locations. This architecture is naturally implemented
with an × convolutional layer followed by two siblings 1 × 1 convolutional layers
(for regression and classification, respectively).

Figure 3.4: Region Proposal Network (RPN)

Anchors: At each sliding-window location, multiple region proposals are


simultaneously predicted, where the number of maximum possible proposals for each
location is denoted as . So, the regression layer has 4 × outputs encoding the
coordinates of boxes, and the classification layer outputs 2 × scores that estimate
probability of object or not object for each proposal. The proposals are parameterized
relative to reference boxes, which is called anchors. In other word, anchor is
mechanism to judge if the box contains an object or no using the Intersection over Union
(described below). An anchor is centered at the sliding window in question and is
associated with a scale and aspect ratio (Figure 3.4). By default, 3 scales and 3 aspect
ratios are used, yielding = 9 anchors at each sliding position. For a convolutional
feature map of a size × (typically ∼ 2,400), there are × × anchors in total.

Intersection over Union (IoU): is the intersection between the anchor and the
ground truth box (object) in the image over the union of these two boxes as shown in

25
RESEARCH ON IMINT MILITARY TARGET DETECTION USING DEEP LEARNING

Figure 3.5. In example, if the > 0.7 means that the anchors contain in object (we call
Positive overlap Range is [0.7 … 1]), else if the < 0.3 means the anchor do not
contain an object (we call Negative overlap Range is [0 … 0.3]). The Mathematical
( ∩ ) , > 0.7
representation of IoU : = =
( ∪ ) , < 0.3

Figure 3.5: Intersection over Union

Non-Maximum Suppression (NMS): in this thesis the NMS have been used to
enhance the detection accuracy. Since in remote sensing image objects can’t be
overlapped, this technic is used in this work. NMS is performed by keeping detections
that have an IoU larger than a pre-set parameter value (typically 0.5) with a higher-
probability detection and discarding the others. The purpose is to remove multiple
detections of the same object before evaluation as shown in Figure 3.6.

Figure 3.6: NMS

3.4.3 Training
A Faster R-CNN network is trained by alternating between training for RoI
generation and detection. First, two separate networks are trained. Then, these networks
are combined and fine-tuned. During fine-tuning, certain layers are kept fixed and certain
layers are trained in turn.

26
CHAPTER 3. CNN IN OBJECT DETECTION

The trained network receives a single image as input. The shared fully
convolutional layers generate feature maps from the image. These feature maps are fed
to the RPN. The RPN outputs region proposals, which are input, together with the said
feature maps, to the final detection layers. These layers include a RoI pooling layer and
output the final classifications. Using shared convolutional layers, region proposals are
computationally almost cost-free. Computing the region proposals on a CNN has the
added benefit of being realizable on a GPU. Traditional RoI generation methods, such as
Selective Search, are implemented using a CPU.

3.5 REAL-TIME CAPABLE CONVOLUTIONAL OBJECT


DETECTION
Detecting objects from a video is possible with advanced existing methods. Some
methods like YOLO [25] or SSD [16] can detect targets with a high FPS2 rate, but not
necessary with good precision (detailed on section 3.6 below). This type of methods is
not needed for the goal of this thesis which is detecting targets from a Remote sensing
images. The method discussed below is to understand how real-time capable detection
work, we chose SSD as an example.

SSD: The Single Shot MultiBox Detector (SSD) [16] takes integrated detection
further. The method does not generate proposals at all, nor does it involve any resampling
of image segments. It generates object detections using a single pass of a convolutional
network.

Somewhat resembling a sliding window method, the algorithm begins with a


default set of bounding boxes. These include different aspect ratios and scales. The object
predictions calculated for these boxes include offset parameters, which predict how much
the correct bounding box surrounding the object differs from the default box.

The algorithm deals with different scales by using feature maps from many different
convolutional layers (i.e. larger and smaller feature maps) as input to the classifier. Since
the method generates a dense set of bounding boxes, the classifier is followed by a non-
maximum suppression (NMS) stage that eliminates most boxes below a certain confidence
threshold.

2
FPS: Frame Per Second

27
RESEARCH ON IMINT MILITARY TARGET DETECTION USING DEEP LEARNING

3.6 COMPARING THE METHODS


Above, we described how Faster R-CNN is faster and more accurate than regular
R-CNN. But how does Faster R-CNN perform compared to the abovementioned
advanced methods?

Liu et al. [16] compared the performance of Fast R-CNN, Faster R-CNN and SSD
on the PASCAL VOC 2007 [26] test set. When using networks trained on the PASCAL
VOC 2007 training data, Fast R-CNN achieved a mean Average Precision (mAP) of 66.9
(see subsection 5.3.2 for the discussion of the Evaluation metrics). Faster R-CNN
performed a better accuracy, with a mAP of 69.9%. SSD achieved a mAP of 68.0% with
input size 300 × 300 and 71.6% with input size 512 × 512 . As the standard
implementations of Fast R-CNN and Faster R-CNN use 600 as the length of the shorter
dimension of the input image, SSD seems to perform better with similarly sized images.
However, SSD requires extensive use of data augmentation to achieve this result [16].
Fast R-CNN and Faster RCNN only use horizontal flipping, and it is currently unknown,
whether they would benefit from additional augmentation.

Method DataSet mAP


Fast RCNN 66.9
Faster RCNN 69.9
VOC 2017
SSD300 68.0
SSD512 71.6

Table 1: PASCAL VOC2007 test detection results

While the advanced methods are more precise than Fast R-CNN, the real
improvements come from speed. When most of the detections with a low probability are
eliminated using thresholding and non-maximum suppression, SSD512 can run at 19 FPS
on a Titan X GPU. Meanwhile, Faster R-CNN with a VGG-16 architecture performs at 7
FPS [16]. The original authors of Faster R-CNN [15] report a running time of 5 FPS (0.2s
per image). Fast R-CNN has approximately the same evaluation speed but requires
additional time for calculating the region proposals. Region generation time depends on
the method, with Selective Search requiring 2 seconds per image on a CPU and Edge
Boxes requiring 0.2 seconds per image [15]. This work will not consider the speed
performance because evaluating execution time require a standardized environment.

28
CHAPTER 3. CNN IN OBJECT DETECTION

Faster RCNN Fast RCNN RCNN


Test time/image
0.2 s 2s 50 s
(With proposal)
test speedup 250x 25x 1
mAP 69.9 66.9 66
Table 2: Comparing still image object detection method with VOC 2007

As mentioned, the purpose of the thesis is detecting targets from High Resolution
Remote sensing images. So, comparing the RCNN, Fast RCNN and Faster RCNN is
necessary in order to make the choice of the method used in this work. Table 2 [15] shows
that Faster RCNN is the appropriate method. It’s by far faster than the other methods in
image testing 250 times than RCNN and 10 times than the Fast RCNN. Concerning the
detection precision, the methods has been running on the VOC 2007 Data Set and the
Faster made the best mAP of 69.9%.

29
RESEARCH ON IMINT MILITARY TARGET DETECTION USING DEEP LEARNING

(PAGE INTENTIONALLY LEFT BLANK)

30
Chapter 4. CREATED DATA AND
METHOD

A fter getting the knowledge of the basic concept of object detection using the CNN,
we will present and discuss the methods and technics (Alexnet [20] network and
Faster RCNN [15] detection method) used and created in this thesis along with presenting
the Dataset used for the training to make the target detection functional. A first experiment
has been done as a starting point to figure how to enhance the detection. The main goal
of this thesis is to create a software capable of detecting different military target from
High resolution Remote Sensing images (airplanes, bridges, storage tanks, troops…).
Because collecting the appropriate images of all the target classes for the training is a big
challenge with the right amount, and the heavy computational work needed by the
hardware, we will only focus on one class which is the Airplane class. However, the work
and the method will be similar for detecting multi-class targets by gathering the
appropriate training data.

4.1 STARTING POINT


In order to understand the behavior of the CNN and the Faster RCNN, a first
experiment has been done to visualize the result of target detection without any
enhancement or data augmentation that have been the main work in this thesis.

We trained the Faster RCNN using the Alexnet on the collected Dataset (see
subsection 4.2.1). As shown in Figure 4.1, the detection result at first was unusable (it
was enhanced during this thesis work). We have got a global = . % which is
very low comparing with final result of this thesis work which is = . %
(detailed in section 5.3).

31
RESEARCH ON IMINT MILITARY TARGET DETECTION USING DEEP LEARNING

Figure 4.1: Thesis starting result (Left) Thesis final result (Right)

4.2 TRAINING DATA


This section will present the images used for the training of the pretrained network
which we call it Dataset. First, we will present the source of the images collected for the
training. Next, we will discuss the image preprocessing done on the previous collected
images while explaining the reasons.

4.2.1 Image source


The main source of the RS image dataset is the well-known University of
California(UC) Merced Land use Dataset [27]. Derived from United States Geological
Survey (USGS) National Map, this dataset contains 2100 aerial scene images with
256 × 256 pixels, which are manually labeled as 21 land use classes, 100 for each class.
Figure 4.2 shows one example image for each class. Since each image comes with a single
label, the dataset can be only used for image classification purposes (i.e., to classify the
whole image into a single land use class). So, we needed to enhance the images in the
dataset as described in subsection 4.2.2 below. Also, the number of images per category
is relatively small (100 images) hence the need to collect more images especially for the
Military Airplane and we had to make the preprocessing and data augmentation described
in the next subsection. For the practical purpose of this work, as mentioned and explained
in the beginning of this chapter, we only have chosen one category: Airplane.

32
CHAPTER 4. CREATED DATA AND METHOD

Figure 4.2: One example image for each class of the UC Merced Land Use Dataset

In addition, with the UC Merced Dataset, we have collected other satellite images,
generally, of military transport airplanes (good availability in most airbases) using the
Geographic Information System (GIS) Software Global Mapper [28]. The source map
used by the software is provided by ESRI World Imagery Map [29]. We searched for
different Air Bases around the world and downloaded the high-resolution images with
different sizes with a spatial resolution3 of 0.3 meter as shown in Figure 4.3. The total
number of images in the dataset is 83 images with different sizes

Figure 4.3: Example images from the dataset collected using Global Mapper

NWPU Dataset: In addition, with the above Dataset, we used a large Dataset called
NWPU-RESISC45 [30] created by Northwestern Polytechnic University (NWPU) to train
a CNN which is used in some approaches (see subsection 5.2.1). This dataset contains

3
Spatial resolution: the real distance between distinguishable patterns (pixels) in an image that can
be separated from each other and often expressed in meters.

33
RESEARCH ON IMINT MILITARY TARGET DETECTION USING DEEP LEARNING

31,500 aerial images spread over 45 scene classes. So far, it is the largest dataset for land
use scene classification in terms of both total number of images and number of scene
classes. To extract the right general features from remote sensing images, we should
reduce the “distance” between daily nature images and satellite images as clear in Figure
4.4. This dataset is used to fine-tune the layers of the CNN because we believe that the
features learned are more oriented to the satellite images, which can help to exploit the
intrinsic characteristics of satellite images. This approach will be compared with others
approaches used for the purpose of this thesis. This dataset has been split into 2 parts: 90%
as a training data and 10% as a validation data.

Figure 4.4: Airplane: daily nature image(left) remote sensing image (right)

4.2.2 Image preprocessing and Data augmentation


Data augmentation is a practical technique for training an effective deep CNN.
However, when we transfer a pre-trained deep CNN for remote sensing classification, the
feature extractor will be fixed. Objects in RS imagery need to be rotational invariant due
to the random direction of the targets in the input image. And because the method used
for object detection don’t have rotation invariance implemented [15], we create many
processing technics in order to augment the dataset and try to make the network rotational
invariant. Therefore, data augmentation enhances the accuracy in general as proven in
this paper [31].

i. Rotating the images

The first step in the images preprocessing, is to rotate the images for two reasons.
The first one is to enlarge the dataset (83 images will be not enough to get good results
even if a pretrained network is used). Second is to try to make the detection rotational
invariant. The CNN cannot distinguish between different rotation of the target. The idea
is to make a copy of every image from the dataset by adding a rotation of 45° from 0° to
180°, so we will get 4 new images. We chose that interval because data augmentation
function will be added in the input image layer (see subsection 4.3.1) by flipping the

34
CHAPTER 4. CREATED DATA AND METHOD

image vertically. At the end, while training the network, we will have a theoretical number
of images augmented by 9 images for each source image. (Total output is 435 images).

ii. Drawing the Ground Truth (GT) boxes

The second step is to draw the GT boxes around the targets in the images manually.
Training an object detection CNN need to have the images with GT labeled with different
target categories to train the RPN and to learn how to localize the object from the
neighboring environment. We made a manual labelling session using the built-in software
(Image Labeler) implemented in MATLAB 2017b [32] to get the GT boxes in the form
of [ ℎ ℎ ℎ ] as shown in Figure 4.5.

Figure 4.5: Example of drawing the GT boxes

iii. Extracting the target images from the image dataset

As a third step, all the labeled target from the images in the dataset after step 2 are
extracted by the crop method. We did this step to initially train the CNN as a classification
resolver. We believe that it will initialize the CNN weights to get better result in the Faster
RCNN training (The total output images is 774 images after this step). But to feed the
CNN, images must be resized into × and we must keep the same image ratio to
prevent distortion, hence the need of the next step.

iv. Image resizing

The CNN input squared image (in our case it’s 227 × 227 and RGB channel) so
we resized all the cropped images from step iii. The resize method idea is to fill the
shortest side by black (or said zeros) and output a square image so we don’t lose the image
ratio and we don’t get a distorted target for training. After that, we resize the output into
227 × 227 pixels.

35
RESEARCH ON IMINT MILITARY TARGET DETECTION USING DEEP LEARNING

v. Create image datastore

This fifth step is to prepare the data for the training purpose. We create an image
datastore with all the images in the previous step with their respective classification label.
After that we split it into 3 parts: A Training Dataset representing 80% of the data, A
Validation Dataset (to follow the improvement of the network while training) 10%, and
a Test Dataset (to get the accuracy) of 10%. This datastore is used only for training and
testing the CNN (for classification).

A MATLAB scripts was created to batch process all the previous image
preprocessing technic to prepare the data for feeding the network and method described
in the section below.

4.3 ARCHITECTURE OF THE CREATED NETWORK AND


DETECTION METHOD
In this section we will discuss and present the chosen CNN and the object detection
method to achieve the goal of this thesis. We used the transfer learning technic in a
pretrained network in this work due to the benefits that it provides. Actually, the
advantage of this approach is that the network has already learned a rich set of images
features that is applicable to a wide range of images. The time of the training can be too
long because of the heavy calculation needed to train the network, especially for the
detection purpose. Nevertheless, using a pretrained network proved to have better
performance compared to networks trained from scratch [33].

4.3.1 CNN
Regardless of the visual task that we want to achieve, a CNN or DCNN is necessary
for the DL in computer vision tasks. 2012 marked the first year where a CNN was used
to achieve a top 5 test error rate of 15.4%. Alex Krizhevsky et al. [20] achieved promising
results with CNNs for the general image classification task by developing the ALEXNET.
It achieved excellent results on the ImageNet Large Scale Visual Recognition Challenge
(ILSVRC) [34] dataset. Trained on ImageNet data (on two GTX 580 GPUs for five to six
days), which contained over 15 million annotated images from a total of over 22,000
categories.

36
CHAPTER 4. CREATED DATA AND METHOD

The Alexnet network is composed by 25 layers as shown in Appendix 1. It uses a


relatively simple layout, compared to modern architectures. The network is made up of 5
convolutional layers, max-pooling layers, dropout layers (to combat the problem of
overfitting to the training data which is a good solution for making the transfer learning
with a small dataset hence we do the choice of this model), and 3 fully connected layers.
The network is used for classification with 1000 possible categories [20].

As presented in Figure 4.6, to make the transfer learning, we will completely delete
the last 3 layers that are trained for the classification (Fully-connected layer, Softmax
layer and the classification layer) because the output was 1000 classes, and our output is
only 1 class. We will also change the first layer which is the input-layer and add a data
augmentation method for making a random vertical flip of the input image on every
iteration of the mini-batch for the training. This data augmentation, together with the
image preprocessing explained in the subsection 4.2.2, will help the Faster R-CNN to
detect the object with rotational invariance due to the limitation of the Object detection
method explained in the subsection 4.3.2 below.

Figure 4.6: Modifications on the Alexnet

4.3.2 Object detection method


For the object detection method, we will use the FASTER R-CNN described in
section 3.4. The high accuracy result done by this method as shown in section 3.6 lead us
to use it. Faster R-CNN don’t have rotation invariance implemented [15]. Objects in RS
imagery need to be rotational invariant due to the random direction of the targets in the

37
RESEARCH ON IMINT MILITARY TARGET DETECTION USING DEEP LEARNING

input image. This issue was overcome by the data augmentation in the input layer
(subsection 4.3.1) and the image preprocessing (subsection 4.2.2).

The Faster R-CNN have 2 CNN working together. One network is the pretrained
CNN described in the subsection 4.3.1 above transformed into a Fast R-CNN (see section
3.4) by adding a regression network in order to output the localization box represented by
4 values [ ℎ ℎ ℎ ] . The other one is called the RPN for outputting the RoI.
This network shares the same weights as the previous CNN, but the last layers concerned
by the classification are changed by an RoI output to feed the Fast R-CNN for the
classification [15].

The Faster R-CNN passes by 4-steps alternating training, where:


1. Train the RPN initialized by the pretrained CNN,

2. Train a separate detection network by Fast R-CNN using proposals generated by step
1, initialized by the CNN,

3. Fix the convolutional layers, fine tune unique layers to RPN, initialized by the
detector on step 2,

4. Fix the convolutional layer, fine-tune the Fully-Connected layers of Fast R-CNN.

38
Chapter 5. IMPLEMENTATION

PROCEDURE AND EVALUATIONS

I n the previous chapter, we presented the work that have been done on the Dataset and
on the CNN and the detection method. In this chapter, we will discuss the practical
implementation of the previous work by presenting the different proposed training
approaches and processes. Then we will go through the evaluation results and choose the
best approach according to the mAP in order to create the software. The GUI is also
presented at the end of this chapter.

5.1 ENVIRONMENT
The implementation environment was on a Lenovo V110 laptop computer with an
Intel Core i5-7200U 2.50 GHz CPU, 8 GBs of RAM and no GPU used for the training.
The operating system was Windows 10.

The main software tool was MATLAB 2017b. The object detection system and its
related methods were implemented as a combination of preexisting and self-programmed
MATLAB tools.

We implemented the convolutional network using Neural Network toolbox and


Computer vision toolbox, which are MATLAB toolboxes developed specifically for this
purpose. Another alternative would have been Caffe deep learning framework [35], which
is more popular, more versatile and also has a MATLAB interface. However, the
implemented toolboxes provided all the required functionality.

5.2 TRAINING PROCESS


In the environment described above, we made different MATLAB scripts and
functions to train the CNN and the Faster RCNN, in addition of those used for image
preprocessing and data augmentation (subsection 4.2.2).

39
RESEARCH ON IMINT MILITARY TARGET DETECTION USING DEEP LEARNING

In order to look for the efficiency of the right method and to understand the behavior
of the network and the object detection method, different approaches have been created.
Those approaches differ on how the sequences of training the Dataset are made, and
whether we use data augmentation or not. So, we can present the correct approach and
method to detect airplanes from a High-Resolution remote sensing image using Deep
Learning.

5.2.1 Approaches
During the research we have made different tests and trainings to understand how
the detection method behave on behalf of different changing parameters like the used
dataset or the training process. So, different processes are presented (Table 3) according
to the used Dataset and the training sequences of the CNN and the Faster RCNN. There
are 4 types of sequences with 2 Dataset types (with or without data augmentation) so we
are having 8 approaches totally which will be evaluated on section 5.3.

Concerning the Dataset augmentation, the No Data Augmentation means that we


used the raw dataset of this thesis without any image preprocessing or augmentation to
train the Faster RCNN. So, the total number of images is 83 (to train the Faster RCNN)
containing 134 annotated targets (extracted to train the CNN). The Data Augmentation
means that the previous data has been processed for data augmentation as explained in
subsection 4.2.2. The total number of images is 435 containing 774 annotated targets.

The first training process (Faster RCNN) mean that we train only the method
without pretraining the CNN. It means that the weights of the CNN are initially the
weights of the Alexnet (pretrained on the ImageNet Dataset).

The second process is the sequence for training the CNN first to initiate the weights
of the network with the appropriate data before training the Faster RCNN.

In the third sequence, the network is trained first on the NWPU Dataset (see page
33) because we believe that it will make the features learned by the network more oriented
to the satellite images, which can help to exploit the intrinsic characteristics of satellite
images. Then we train the Faster RCNN detection using the previous network.

In the last process sequence training the network again with the targets to orient
more the network in detecting the airplanes the network will be used for training the Faster
RCNN

40
CHAPTER 5. IMPLEMENTATION PROCEDURE AND EVALUATIONS

Name
Training Process
No Data Augmentation Data Augmentation
Faster RCNN (Only) Faster Faster_Aug
CNN → Faster RCNN C_Faster C_Faster_Aug
NWPU → Faster RCNN N_Faster N_Faster_Aug
NWPU → CNN → Faster RCNN NC_Faster NC_Faster_Aug

Table 3: Training Processes

All the training parameters are presented on the next subsection.

5.2.2 Training
To be able to compare the results of the different approaches presented above, we
fixed the training parameter for the CNN trainings and the Faster RCNN method. Fixing
the parameter of the trainings on the different approaches let us evaluate the results
properly and discuss only the behavior of the approach. More details about the training
options are shown in Appendix 2.

i. Training Options

CNN on NWPU and created dataset: For the training of our CNNs we used SGDM
(Stochastic Gradient Descent with Momentum) with a batch size of 128, a momentum of
0.9 and a constant learning rate of 10 (because it’s just a fine-tuning step). We also
shuffle the data in each epoch 4 . We also validate the network by the validation data
(described in 4.2.2v) in every epoch to evaluate the training and we set a patience of 4
(for stopping the training if the accuracy is not enhanced after 4 successive epochs)

Faster RCNN: As the Faster R-CNN passes by 4-steps alternating training, 4 stage
options have been made. Those learning options are almost the same for the 4 steps:
SGDM with a batch size of 128, a learn rate drop factor of 0.5 every 3 epochs (this option
will decrease the learning rate every 3 epochs by 0.5 to reach the lower loss easily) and
by shuffling the image dataset on every epoch. The initial learn rate of the 2 first steps is
10 , and 10 for the rest of the steps (because those steps are used to fine tune the
previous ones). We fixed the learning epochs at 10. We also fixed the positive IoU by

4
Epoch: 1 epoch is defined as one complete pass through the entire training data set

41
RESEARCH ON IMINT MILITARY TARGET DETECTION USING DEEP LEARNING

[0.7 1] (object) and negative IoU by [0 0.3] (not object) (see page 25). All the
parameters were chosen after many practical tries to get a better result.

ii. Training results

Major problems limiting the use of deep learning methods are the availability of
computing power and training data (see subsection 2.2.5). For this thesis, we did not have
access to a server farm, or a GPU used for research purposes. Rather, we needed to
implement the methods on an ordinary consumer laptop (section 5.1). Training a
convolutional network from start to finish on such hardware would be enormously time-
consuming. Thus, we favored methods that were established enough to have pre-trained
networks to be used. We also favored methods that had MATLAB implementations
available. Even though, training process took a very long time as shown in the Table 4.
This table show the total duration of the training process of every approach (including
CNNs and Faster RCNN training). It’s presented for information only because evaluating
the time need more standardized environment. A sample of training details are presented
in Appendix 3.

CNN on the CNN on our Faster


Approach Name Total
NWPU Dataset RCNN
Faster - - 224 224
C_Faster - 64 224 288
N_Faster 9 851 - 208 10 059
NC_Faster 9 851 86 240 10 177
Faster_Aug - - 2 646 2 646
C_Faster_Aug - 384 2 384 2 768
N_Faster_Aug 9 851 - 2 428 12 279
NC_Faster_Aug 9 851 539 2 497 12 887
Table 4: Training time of the approaches (in minutes)

As we can see the training time is directly related to the size of the training dataset.
It took almost 7 days to train the NWPU dataset (which contain 31,500 aerial images).
For the CNN it took between 1 hour and 1hour 26 minutes without data augmentation,
and between 6 and 9 hours for the augmented dataset. The Faster RCNN (4 stages training)
took between 3 hours and a half and 4 hours for the non-augmented dataset, and between
40 and 44 hours on the augmented dataset. As an example, to train the NC_Faster_Aug
we need 9 days of training. Fortunately, the CNN trained by the NWPU dataset have been
done only once.

42
CHAPTER 5. IMPLEMENTATION PROCEDURE AND EVALUATIONS

5.3 EVALUATION
5.3.1 Test Dataset
To evaluate the accuracy of the presented approaches, we have to generalize the
detection by testing them on new images that have not been trained by the methods. For
this purpose we collected 7 new High Resolution remote sensing images of different
Airports and Airbases (by the same method in subsection 4.2.1) with a spatial resolution
of 0.3 meter with different sizes as shown on the Table 5 (samples of the images are
shown in Figure 5.1). Foreign airbases are named AFBx.

Since there is no standardized test dataset for object detection in High


Resolution Remote Sensing images, the results of the evaluation on this test dataset are
not standardized and it can’t be compared with other researches in this field.

Number of
N° Name Type Size (pixels)
annotated targets
1 Xian XianYang International 6395 × 7703 39
2 Chengdu ShuangLiu Airport 6372 × 11652 50
3 Beijing TongXian 1950 × 2717 8
4 AFB1 5281 × 3844 15
5 AFB2 Air Base 2796 × 4002 12
6 AFB3 1228 × 1465 11
7 AFB4 1259 × 1067 2

Table 5: Test Dataset details

43
RESEARCH ON IMINT MILITARY TARGET DETECTION USING DEEP LEARNING

Figure 5.1: HR Test images samples: Chengdu (left) AFB2 (right)

5.3.2 Evaluation Metrics


The object detections are evaluated using the standard Intersection over Union (IoU)
metric (see subsection 3.4.2) [21]. The bounding boxes of the detections are rarely a pixel-
perfect match to the ground-truth boxes. In practice, we are interested in finding
detections that are close enough to be called true positive matches. IoU is calculated by
dividing the intersection (the overlap) of the detection box and the ground-truth box by
the area of their union.

Generally, an “IoU score over 0.5 is counted as a true positive detection” [21] as
used in the PASCAL VOC Challenge [26] and this definition is used in this thesis as well.
There can be only one true positive match per each ground-truth box. If several detections
match a single ground-truth, the box with the highest likelihood is selected as the true
positive match and the other detections are marked as false positives. Detections with no
matching ground-truth box are marked false positives as well. Ground-truth boxes with
no matching detections are called false negatives (see Table 6 below).

44
CHAPTER 5. IMPLEMENTATION PROCEDURE AND EVALUATIONS

Ground Truth
Object Not Object

Object
True Positive False Positive

Detection

Not Object
False Negative True Negative

Table 6: Detection event table

The performance of the object detection algorithm is evaluated by calculating the


precision-recall curve and the interpolated average precision of the detections for each
class, similarly to the PASCAL VOC Challenge. We calculate the curve directly for an
entire Dataset (i.e. over all objects in the test data), instead of calculating an average curve
sampled at certain recall values from each test image. Precision and Recall are calculated
cumulatively at each detection’s location in the list of the target detection. Precision is
defined as the number of true positive detections retrieved, divided by the total number
of all the retrieved targets (true and false positives retrieved). Recall is defined as the
number of true positive detections retrieved, divided by the total number of positive
targets known from the ground-truth data (True positives + False negatives).

=
+

=
+

An interpolated (monotonically decreasing) curve is computed for visualization


purposes and for calculating the interpolated average precision. The interpolated curve is
created by replacing the actual precision values with maximum precision values from the
remaining part of the curve. In the PASCAL VOC challenge, interpolated average
precision is used [26]. We use the same method to produce comparable results. In the
following parts, the term average precision or mean average precision is to be understood
as interpolated average precision noted by mAP.

45
RESEARCH ON IMINT MILITARY TARGET DETECTION USING DEEP LEARNING

5.3.3 Evaluation results


A MATLAB script has been created to call the Test Dataset (described in subsection
5.3.1) and run the detector file containing the detection features on each single image of
the dataset. The results are bounding boxes location and size in the form of
[ ℎ ℎ ℎ] with the classification accuracy of the target. Before calculating the
evaluation output (mAP, precision, recall) NMS is performed (see page 26) with an
= 0.1 since in remote sensing image objects can’t be overlapped like in the daily
nature image. Evaluating the effect of NMS is done to admit that this technic enhances
the detection accuracy (No NMS means that the IoU value is the default value used in
Faster RCNN: IoU=0.5)

No Data Augmentation Data Augmentation


Name No NMS NMS Name No NMS NMS
Faster 41.29% 49.00% Faster_Aug 55.82% 61.72%
C_Faster 13.47% 15.48% C_Faster_Aug 78.07% 83.21%
N_Faster 49.97% 59.62% N_Faster_Aug 87.36% 89.08%
NC_Faster 38.58% 50.01% NC_Faster_Aug 88.22% 90.12%

Table 7: Evaluation Results

The mAP has also been evaluated for each image on each approach. We will show
only the detailed results after the NMS process as shown in the Table 8 below.

Image
1 2 3 4 5 6 7
Name
Faster 46.94% 56.16% 26.09% 89.83% 84.31% 47.26% 41.67%
C_Faster 18.69% 11.10% 45.33% 29.54% 16.80% 27.16% 36.67%
N_Faster 76.17% 72.14% 94.64% 84.45% 72.90% 98.05% 100%
NC_Faster 75.66% 34.27% 86.23% 59.96% 67.47% 100% 100%
Faster_Aug 55.96% 59.93% 66.50% 73.86% 87.27% 100% 100%
C_Faster_Aug 81.96% 79.69% 49.76% 90.87% 84.47% 100% 100%
N_Faster_Aug 97.36% 81.78% 100% 81.42% 87.64% 100% 100%
NC_Faster_Aug 97.43% 83.63% 100% 83.26% 87.64% 100% 100%

Table 8: Detailed Evaluation results (after NMS performing)

46
CHAPTER 5. IMPLEMENTATION PROCEDURE AND EVALUATIONS

The detection time has also been evaluated for the detection process of each image
as shown in Table 9. As mentioned before, it’s presented for information only because
evaluating the time need more standardized environment.

Image
1 2 3 4 5 6 7 Total
Name
Faster 859 1 384 96 355 183 31 23 2 931
C_Faster 756 1 140 76 297 160 25 18 2 472
N_Faster 732 1 143 73 286 152 25 17 2 428
NC_Faster 910 1 359 94 369 201 32 23 2 988
Faster_Aug 839 1 302 83 307 165 26 19 2 741
C_Faster_Aug 712 1 105 74 289 156 24 17 2 377
N_Faster_Aug 731 1 132 71 286 158 24 17 2 419
NC_Faster_Aug 847 1 310 90 375 199 33 23 2 877

Table 9: Detailed evaluation time (in seconds)

All the above results will be discussed on the next subsection along with showing
the detection result on some images of the test dataset.

5.4 DISCUSSION OF THE RESULTS


When we analyze the Table 7 we notice that our approaches really show the
enhancement of our augmented dataset. Our last approach (NC_Faster_Aug) jumped the
mAP of detection result from 41.29% in Faster approach (no data augmentation or image
processing) to 90.12%.

The NMS added between 2% and 10% of mAP in most of the cases. This technic
removed the overlapping bounding boxes of the same target by keeping the most accurate
one as shown in Figure 5.2. Since the targets in Remote Sensing image could not be
overlapped for the same class, we defined the IoU ratio to 0.1 (minimum value).
Performing the NMS will definitely enhance the detection precision.

47
RESEARCH ON IMINT MILITARY TARGET DETECTION USING DEEP LEARNING

Figure 5.2: NMS performed on an extract of test image 5 with NC_Faster_Aug approach

The system now can do the state-of-art detection (more than 90% accuracy) after
using our image preprocessing and data augmentation discussed in the subsection 4.2.2.
We can see clearly how much the use of this created method enhance the detection. For
example, in the NC_Faster_Aug approach after the NMS performing, the mAP jumped
by 40%: from 50.01% without the use of the data augmentation to 90.12%, and by 50%
without the NMS performing. The Figure 5.3 below represent the precision-recall curve
(explained in subsection 5.3.2) for the NC_Faster_Aug and the NC_Faster approaches.
As we can see, for the first approach, the recall reaches the value of “1” (it’s the case for
all the approaches using the data augmentation and image preprocessing. Details can be
seen in Appendix 4). It means that this approach can find all the targets in all the test
dataset. And the precision value for the first approach is higher than the second one. It
means that there is less false positive detection. So, the data augmentation and image
preprocessing enhance the detection by finding not only the targets, but also the correct
ones.

48
CHAPTER 5. IMPLEMENTATION PROCEDURE AND EVALUATIONS

Figure 5.3: Data Augmentation enhancement

The Table 7 shows us also that training the CNN with a very large RS dataset
(NWPU) helps to enhance the results. At first, we believed that the features learned will
be more oriented to the satellite images, which can help to exploit the intrinsic
characteristics of satellite images. The evaluation validates this supposition and as we can
see, detection on the approach that their CNN was trained by the NWPU dataset
performed better result. For example, the mAP is augmented by around 10% from the
Faster to the N_Faster approaches.

Exploring the Table 8 show us that the accuracy is higher for smaller image size.
As we can see the 100% accuracy (NC_Faster_Aug approach) are done on the 3 smallest
images (3, 6 and 7) and contains less targets. But even for larger image (1) we got a good
accuracy of 97.43%. Figure 5.4 shows the detection on the test images 1, 6 and 7 by the
NC_Faster_Aug.

Table 9 shows that the time detection is directly related to the size of the input test
image and the number of the targets. It’s obvious since the kernel window need to slide
on more pixels in the image and to classify more targets.

49
RESEARCH ON IMINT MILITARY TARGET DETECTION USING DEEP LEARNING

Figure 5.4: Detection result on test images 1, 6 and 7

5.5 GUI SOFTWARE DESIGN


As we discussed the results in the previous section, we found that NC_Faster_Aug
approach have made the best result in this work. To use this method in practical, we build
a GUI software to detect airplane targets from High Resolution remote sensing images.

The Framework of the software is shown in Figure 5.5. The basic functionality of
the software is to input a High Resolution Remote Sensing image and get the same image
with the bounding boxes of the detected targets.

50
CHAPTER 5. IMPLEMENTATION PROCEDURE AND EVALUATIONS

Figure 5.5: GUI Software Framework

The main window is simple. The menu bar contains 3 buttons as shown in Figure
5.6. The process menu contains the process procedure: starting by inputting the image,
passing by starting detection and ending by saving the resulted image.

Figure 5.6: Software main window

An option window has been created to permit the user to change the values of the
line width, the boxes color, the font size of the classification accuracy of the detection
and also change the NMS IoU value (recommended to be 0.1).

51
RESEARCH ON IMINT MILITARY TARGET DETECTION USING DEEP LEARNING

Figure 5.7: Detection and showing the result on the GUI Software

After inputting the raw image, the user will click on the detect button. As shown in
Figure 5.7, the detection starts, then it will show the detection results with the number of
targets and the detection time. After that, the user can save the result image in a chosen
folder.

Figure 5.8: GUI Software "About" window

52
Chapter 6. CONCLUSION

I n this chapter, we will review our results and provide some concluding remarks. We
set out to review the current methods of convolutional object detection, to implement
one such method and to explore potential improvements.

6.1 THEORY
We began the thesis with a review of the theoretical background. We explained how
neural networks function and what object detection entails. We demonstrated why regular
neural networks are insufficient to image-related tasks and how translation invariant
convolutional networks provide an effective solution to many computer vision problems.

Next, we demonstrated how convolutional object detection has evolved from the
relatively slow R-CNN to the current optimized method. This development is mostly not
related to the structure of the convolutional network itself. Rather, it is related to how the
convolutional network is used and to computation that takes place before and after the
convolutional network. In the previous methods, there were many more separate phases
involving preprocessing, region generation, computation of the fully connected layers and
the final classification. In the latest methods, these phases have been increasingly
integrated into the convolutional network itself, while keeping the basic CNN model
intact. On the other hand, the 2016 winner of the ImageNet challenge [34] is again a
model composed of many separate components. Nonetheless, several computational
bottlenecks have disappeared. Over the past few years, the speed of object detection has
improved more dramatically than precision.

6.2 PRACTICE
To experiment with a convolutional method in practice, we created a working
MATLAB implementation of Faster R-CNN. We learned that the most challenging part
of implementing a deep learning system is collecting the training data and performing the
training itself. Training time can be further shortened by using a pretrained network. Even
if the final system does not feature the same objects classes as the benchmark data, visual

53
RESEARCH ON IMINT MILITARY TARGET DETECTION USING DEEP LEARNING

problems are universal enough to benefit from detectors trained for a different problem.
The optimal bottom-layers of a convolutional network are often similar regardless of the
problem, just like human eye uses the same receptive fields for all visual tasks. Thus, it
makes sense to initialize the layers using a pretrained network.

Related to the implementation, we also learned that there are no easy “out-of-the-
box” solutions for effectively implementing convolutional networks especially for remote
sensing images. The goal of this work is to show the ability of object detection using DL
technology by the Faster R-CNN method for military targets from a High-Resolution RS
image. To implement this method, a series of process had to be done on the created dataset
and on the chosen pre-trained CNN due to the small number of images in the dataset. We
demonstrated the feasibility of this type of work on a single CPU laptop.

6.3 RESULTS
Regarding precision, the results were promising. We reached the state-of-art
accuracy by 90.12%. We showed how a system pretrained on general image data can be
used to detect objects in a specific task (military target detection from HR RS images),
thus demonstrating the adaptability of the methods by creating the appropriate dataset,
image preprocessing, data augmentation and training process. In many cases, Faster R-
CNN detected more objects than the marked targets. These were labelled as false positives,
despite clearly having the right object class by visual inspection.

6.4 THE FUTURE


In this work one class has been used as an example (Airplane). In the future, the
software can be trained to detect more classes (Storage tanks, ships, armored vehicles,
bridges, deployed troops…). There will be no limitation as long as we can collect the
appropriate Dataset from Remote Sensing images (including EO, IR, SAR…).

54
REFERENCES

[1] K. Fukushima, "Neocognitron: A hierarchical neural network capable," Neural


Networks, vol. 1, no. 2, pp. 119 - 130, 1988.

[2] X. Jin and C. H. Davis, "Vehicle detection from high-resolution satellite


imagery using morphological shared-weight neural networks," Image and
Vision Computing, vol. 25, no. 9, pp. 1422-1431, 2007.

[3] X. Chen, S. Xiang, C.-L. Liu and C. H. Pan, "Vehicle detection in satellite
images by hybrid deep convolutional neural networks," IEEE Geoscience and
remote sensing letters, vol. 11, no. 10, pp. 1797-1801, 2014.

[4] Q. Jiang, L. Cao, M. Cheng, C. Wang and J. Li, "Deep neural networks-based
vehicle detection in satellite images," in International Symposium on
Bioelectronics and Bioinformatics, 2015.

[5] P. Zhou, G. Cheng, Z. Liu, S. Bu and X. Hu, "Weakly supervised target


detection in remote sensing images based on transferred deep features and
negative bootstrapping," Multidimensional Systems and Signal Processing,
vol. 27, no. 4, pp. 925-944, 2016.

[6] L. Zhang, Z. Shi and J. Wu, "A hierarchical oil tank detector with deep
surrounding features for high-resolution optical satellite imagery," IEEE
Journal of Selected Topics in Applied Earth Observations and Remote
Sensing, vol. 8, no. 10, pp. 4895-4909, 2015.

[7] A.-B. Salberg, "Detection of seals in remote sensing images using features
extracted from deep convolutional neural networks," in IEEE International
Geoscience and Remote Sensing Symposium (IGARSS), 2015.

[8] I. Ševo and A. Avramovic, "Convolutional neural network based automatic


object detection on aerial images," IEEE Geoscience and Remote Sensing
Letters, vol. 13, no. 5, pp. 740-744, 2016.

[9] H. Zhu, X. Chen, . W. Dai, K. Fu, Q. Ye and J. Jiao, "Orientation robust object
detection in aerial images using deep convolutional neural network," in IEEE
International Conference on Image Processing (ICIP), 2015.

[10] G. Cheng, P. Zhou and J. Han, "Learning rotation-invariant convolutional


neural networks for object detection in VHR optical remote sensing images,"

55
RESEARCH ON IMINT MILITARY TARGET DETECTION USING DEEP LEARNING

IEEE Transactions on Geoscience and Remote Sensing, vol. 54, no. 12, pp.
7405-7415, 2016.

[11] I. Goodfellow, Y. Bengio and A. Courville, DEEP LEARNING, MIT Press,


2016.

[12] C. M. Bishop, Pattern Recognition and Machine Learning, Secaucus,NJ, USA:


Springer-Verlag New York, Inc., 2006.

[13] Y. Bengio, "Deep Learning of Representations for Unsupervised and Transfer


Learning," in ICML Unsupervised and Transfer Learning, 2012.

[14] L. N. Long and A. Gupta, "Scalable massively parallel artificial," Journal of


Aerospace Computing, Information, and Communication, vol. 5, no. 1, pp. 3-
15, 2008.

[15] D. R. Wilson and T. R. Martinez, "The general inefficiency of batch training


for gradient descent learning," Neural Networks, vol. 16, no. 10, pp. 1429-
1451, 2008.

[16] D. Rumelhart, G. Hinton and R. Williams, "Learning representations by back-


propagating errors," Cognitive modeling, vol. 5, no. 3, 1986.

[17] D. Steinkraus, I. Buck and P. Simard, "Using gpus for machine learning
algorithms," in Proceedings Eighth International Conference on Document
Analysis and Recognition, 2005.

[18] R. Girshick, "Fast R-CNN," Proceedings of the IEEE International, pp. 1440-
1448, 2015.

[19] R. Szeliski, Computer Vision: Algorithms and Applications, Springer, 2010.

[20] N. Sebe, "Machine learning in computer vision," Springer Science & Business
Media, vol. 29, 2005.

[21] D. Walther, L. Itti, M. Riesenhuber, T. Poggio and C. Koch, "Attentional


selection for object recognition - a gentle way," in International Workshop on
Biologically Motivated Computer Vision, 2002.

[22] J. R. R. Uijlings, K. E. A. Sande, T. Gevers and A. W. M. Smeulders,


"Selective Search for Object Recognition," International Journal of Computer
Vision, vol. 2, pp. 154-171, 2013.

[23] R. Girshick, J. Donahue, T. Darrell and J. Malik, "Rich feature hierarchies for
accurate object detection and semantic segmentation," in Proceedings of the
IEEE conference on computer vision and, 2014.

56
REFERENCES

[24] S. Ren, K. He, R. Girshick and J. Sun, "Faster R-CNN: Towards Real-Time
Object Detection with Region Proposal Networks," IEEE Transactions on
Pattern Analysis and Machine Intelligence, vol. 39, no. 6, pp. 1137-1149, 01
06 2017.

[25] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C. Y. Fu and A. C. Berg,


"SSD: Single shot multibox detector," in European Conference on Computer
Vision , Amsterdam, Netherlands, 2016.

[26] D. Marr and E. Hildreth, "Theory of edge detection," Proceedings, vol. 207,
no. 1167, pp. 187 - 217, 1980.

[27] K. Simonyan and A. Zisserman, "Very deep convolutional networks for large-
scale image recognition," arXiv preprint, 2014.

[28] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever and R. Salakhutdinov,


"Dropout: A simple way to prevent neural networks from overfitting," Journal
of Machine Learning Research , vol. 15, pp. 1929-1958, 2014.

[29] A. Krizhevsky, I. Sutskever and G. E. Hinton, "ImageNet Classification with


Deep Convolutional Neural Networks," Advances in neural information
processing systems, pp. 1097-1105, 2012.

[30] C. L. Zitnick and P. Dollar, "Edge boxes: Locating object proposals from
edges," European Conference on Computer Vision, pp. 391-405, 2014.

[31] B. Yang, J. Yan, Z. Lei and S. Z. Li, "Craft objects from images," in
Proceedings of the IEEE Conference on Computer Vision and Pattern, 2016.

[32] M. D. Zeiler and . R. Fergus, "Visualizing and understanding convolutional


neural networks," in European Conference on Computer Vision (ECCV),
2014.

[33] K. Simonyan and A. Zisserman, "Very deep convolutional networks for large-
scale image recognition," in International Conference on Learning
Representations (ICLR), 2015.

[34] J. Redmon, S. Divvala, R. Girshick and A. Farhadi, "You Only Look Once:
Unified, Real-Time Object Detection," in Conference on Computer Vision and
Pattern Recognition, 2016.

[35] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn and A. Zisserman,


"The PASCAL Visual Object Classes Challenge 2007," [Online]. Available:
http://host.robots.ox.ac.uk/pascal/VOC/voc2007/.

57
RESEARCH ON IMINT MILITARY TARGET DETECTION USING DEEP LEARNING

[36] Y. Yang and S. Newsam, "Bag-of-visual-words and spatial extensions for


land-use classification," in 18th ACM SIGSPATIAL Int. Symposium on
Advances in Geograph. Inf. Sys., San Jose, CA, USA, June, 2010.

[37] Blue Marble Geographics, "Global Mapper," Blue Marble Geographics ,


[Online]. Available: http://www.bluemarblegeo.com/products/global-
mapper.php.

[38] esri, "Esri World Imagery Map," ESRI, [Online]. Available:


https://www.arcgis.com/home/item.html?id=10df2279f9684e4a9f6a7f08feba
c2a9.

[39] G. Cheng, J. Han and X. Lu, "Remote sensing image scene classification:
Benchmark and state of the art," in Proceedings of the IEEE.

[40] J. Wang, C. Luo, H. Huang, H. Zhao and S. Wang, "Transferring Pre-Trained


Deep CNNs for Remote Scene Classification with General Features Learned
from Linear PCA Network," MDPI Remote Sensing, vol. 9, no. 225, 2017.

[41] Mathworks, "Matlab," [Online]. Available:


https://www.mathworks.com/products/matlab.html.

[42] M. Castelluccio, G. Poggi, C. Sansone and L. Verdoliva, Land use


classification in remote sensing images by convolutional neural networks,
arXiv, 2015.

[43] ImageNet, "Large Scale Visual Recognition Challenge (ILSVRC)," Stanford


Vision Lab, [Online]. Available: http://www.image-
net.org/challenges/LSVRC/.

[44] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S.


Guadarrama and T. Darrell, "Caffe: Convolutional architecture for fast feature
embedding," arXiv preprint. arXiv:1408.5093 , 2014.

58
APPENDIX

Appendix 1. Network Visualizations

ALEXNET

59
RESEARCH ON IMINT MILITARY TARGET DETECTION USING DEEP LEARNING

MODIFIED ALEXNET

60
APPENDIX

Appendix 2. Training options

CNNS TRAINING OPTION

FASTER RCNN TRAINING OPTION

61
RESEARCH ON IMINT MILITARY TARGET DETECTION USING DEEP LEARNING

62
APPENDIX

Appendix 3. Training results

CNN ON THE NWPU DATASET

CNN OF THE NC_FASTER AUG

DETECTOR OF THE NC_FASTER


TRAINING A FASTER R-CNN OBJECT DETECTOR FOR THE FOLLOWING OBJECT CLASSES:
* AIRPLANE
STEP 1 OF 4: TRAINING A REGION PROPOSAL NETWORK (RPN).
TRAINING ON SINGLE CPU.
|==============================================================================|
| EPOCH | ITERATION | TIME ELAPSED | MINI-BATCH | MINI-BATCH | BASE LEARNING|
| | | (SECONDS) | LOSS | ACCURACY | RATE |
|==============================================================================|
| 1 | 1 | 4.23 | 0.8015 | 32.26% | 1.00E-05 |
| 1 | 50 | 330.82 | 0.8326 | 53.60% | 1.00E-05 |
| 2 | 100 | 720.84 | 0.9636 | 53.28% | 1.00E-05 |
| 2 | 150 | 1070.42 | 0.6072 | 63.96% | 1.00E-05 |
| 3 | 200 | 1385.49 | 0.1029 | 96.34% | 1.00E-05 |
| 4 | 250 | 1772.18 | 0.1347 | 98.41% | 5.00E-06 |
| 4 | 300 | 2095.05 | 0.2642 | 97.50% | 5.00E-06 |
| 5 | 350 | 2449.28 | 0.0717 | 98.98% | 5.00E-06 |
| 5 | 400 | 2783.54 | 0.1200 | 98.02% | 5.00E-06 |

63
RESEARCH ON IMINT MILITARY TARGET DETECTION USING DEEP LEARNING

| 6 | 450 | 3152.80 | 0.1356 | 95.79% | 5.00E-06 |


| 7 | 500 | 3531.74 | 0.0932 | 96.80% | 2.50E-06 |
| 7 | 550 | 3880.59 | 0.0831 | 99.09% | 2.50E-06 |
| 8 | 600 | 4194.11 | 0.0479 | 98.26% | 2.50E-06 |
| 9 | 650 | 4565.16 | 0.0524 | 100.00% | 2.50E-06 |
| 9 | 700 | 4927.12 | 0.0697 | 98.18% | 2.50E-06 |
| 10 | 750 | 5261.77 | 0.3381 | 86.61% | 1.25E-06 |
| 10 | 800 | 5536.18 | 0.0856 | 96.19% | 1.25E-06 |
| 10 | 810 | 5658.32 | 0.0564 | 97.66% | 1.25E-06 |
|===============================================================================|
STEP 2 OF 4: TRAINING A FAST R-CNN NETWORK USING THE RPN FROM STEP 1.
*******************************************************************
TRAINING A FAST R-CNN OBJECT DETECTOR FOR THE FOLLOWING OBJECT CLASSES:
* AIRPLANE
--> EXTRACTING REGION PROPOSALS FROM 83 TRAINING IMAGES...DONE.
TRAINING ON SINGLE CPU.
|===========================================================================|
| EPOCH | ITERATION | TIME ELAPSED | MINI-BATCH | MINI-BATCH | BASE LEARNING|
| | | (SECONDS) | LOSS | ACCURACY | RATE |
|============================================================================|
| 1 | 1 | 4.00 | 0.8623 | 0.00% | 1.00E-05 |
| 3 | 50 | 304.92 | 0.4888 | 100.00% | 1.00E-05 |
| 5 | 100 | 612.74 | 0.2417 | 100.00% | 5.00E-06 |
| 7 | 150 | 914.63 | 0.6358 | 50.00% | 2.50E-06 |
| 9 | 200 | 1203.98 | 0.1600 | 100.00% | 2.50E-06 |
| 10 | 230 | 1378.66 | 0.2679 | 100.00% | 1.25E-06 |
|===============================================================================|
STEP 3 OF 4: RE-TRAINING RPN USING WEIGHT SHARING WITH FAST R-CNN.
STARTING PARALLEL POOL (PARPOOL) USING THE 'LOCAL' PROFILE ...
CONNECTED TO 2 WORKERS.
TRAINING ON SINGLE CPU.
|===========================================================================|
| EPOCH | ITERATION | TIME ELAPSED | MINI-BATCH | MINI-BATCH | BASE LEARNING|
| | | (SECONDS) | LOSS | ACCURACY | RATE |
|==============================================================================|
| 1 | 1 | 19.80 | 0.5322 | 77.69% | 1.00E-06 |
| 1 | 50 | 305.92 | 0.3426 | 88.19% | 1.00E-06 |
| 2 | 100 | 695.85 | 0.1598 | 98.37% | 1.00E-06 |
| 2 | 150 | 1043.31 | 0.5692 | 69.84% | 1.00E-06 |
| 3 | 200 | 1341.78 | 0.2618 | 96.25% | 1.00E-06 |
| 4 | 250 | 1770.97 | 0.1327 | 99.21% | 5.00E-07 |
| 4 | 300 | 2045.80 | 0.6939 | 65.04% | 5.00E-07 |
| 5 | 350 | 2440.63 | 0.5911 | 71.43% | 5.00E-07 |
| 5 | 400 | 2791.42 | 0.2100 | 95.83% | 5.00E-07 |
| 6 | 450 | 3183.70 | 0.2255 | 97.09% | 5.00E-07 |
| 7 | 500 | 3552.22 | 0.0987 | 96.40% | 2.50E-07 |
| 7 | 550 | 3900.39 | 0.2507 | 96.09% | 2.50E-07 |
| 8 | 600 | 4300.56 | 0.2686 | 88.19% | 2.50E-07 |
| 9 | 650 | 4659.95 | 0.1339 | 93.33% | 2.50E-07 |
| 9 | 700 | 5013.47 | 0.0985 | 98.98% | 2.50E-07 |
| 10 | 750 | 5390.06 | 0.1851 | 94.44% | 1.25E-07 |
| 10 | 800 | 5721.74 | 0.1568 | 99.21% | 1.25E-07 |
| 10 | 810 | 5761.51 | 0.2388 | 92.63% | 1.25E-07 |
|===============================================================================|
STEP 4 OF 4: RE-TRAINING FAST R-CNN USING UPDATED RPN.
*******************************************************************
TRAINING A FAST R-CNN OBJECT DETECTOR FOR THE FOLLOWING OBJECT CLASSES:
* AIRPLANE
--> EXTRACTING REGION PROPOSALS FROM 83 TRAINING IMAGES...DONE.
TRAINING ON SINGLE CPU.
|===========================================================================|
| EPOCH | ITERATION | TIME ELAPSED | MINI-BATCH | MINI-BATCH | BASE LEARNING|
| | | (SECONDS) | LOSS | ACCURACY | RATE |
|==============================================================================|
| 1 | 1 | 3.72 | 0.7580 | 50.00% | 1.00E-06 |
| 2 | 50 | 267.51 | 0.7434 | 50.00% | 1.00E-06 |
| 4 | 100 | 547.59 | 0.6311 | 50.00% | 5.00E-07 |
| 6 | 150 | 831.96 | 0.5542 | 50.00% | 5.00E-07 |
| 7 | 200 | 1108.40 | 0.6893 | 50.00% | 2.50E-07 |
| 9 | 250 | 1380.31 | 0.7083 | 50.00% | 2.50E-07 |
| 10 | 290 | 1608.14 | 0.5942 | 50.00% | 1.25E-07 |
|===============================================================================|
FINISHED TRAINING FASTER R-CNN OBJECT DETECTOR.

64
APPENDIX

Appendix 4. Evaluation precision-recall curves


FASTER

C_FASTER

N_FASTER

65
RESEARCH ON IMINT MILITARY TARGET DETECTION USING DEEP LEARNING

NC_FASTER

FASTER_AUG

C_FASTER_AUG

66
APPENDIX

N_FASTER_AUG

NC_FASTER_AUG

67
RESEARCH ON IMINT MILITARY TARGET DETECTION USING DEEP LEARNING

(PAGE INTENTIONALLY LEFT BLANK

68
PUBLISHED ARTICLES

[1] Ghazi MARZOUK, Li Wei, “Design of IMINT Military Target Detection


Software using Deep Learning”. Journal of Air Force Commander College.

69

You might also like