Deep Neural Networks For Marine Debris Detection in Sonar Images
Deep Neural Networks For Marine Debris Detection in Sonar Images
Deep Neural Networks For Marine Debris Detection in Sonar Images
Matias Alejandro
Valdenegro Toro, M.Sc., B.Sc.
April 2019.
The copyright in this thesis is owned by the author. Any quotation from the
thesis or use of any of the information contained in it must acknowledge this
thesis as the source of the quotation or information.
Copyright © 2018 - 2019 Matias Alejandro Valdenegro Toro.
Final version, arXiv release, April 2019. Compiled on May 15, 2019.
Abstract
Rubén, y Sergio.
Acknowledgements
Abstract 3
Acknowledgements 5
1 Introduction 15
1.1 Research Questions . . . . . . . . . . . . . . . . . . . . 17
1.2 Feasibility Analysis . . . . . . . . . . . . . . . . . . . . 18
1.3 Scope . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
1.4 Thesis Structure . . . . . . . . . . . . . . . . . . . . . . 23
1.5 Software Implementations . . . . . . . . . . . . . . . . 24
1.6 Contributions . . . . . . . . . . . . . . . . . . . . . . . 24
1.7 Related Publications . . . . . . . . . . . . . . . . . . . 26
6
4.3.4 Comparison with State of the Art . . . . . . . 94
4.3.5 Feature Visualization . . . . . . . . . . . . . . . 96
4.3.6 Computational Performance . . . . . . . . . . 101
4.4 Summary of Results . . . . . . . . . . . . . . . . . . . 106
7
9.1 Future Work . . . . . . . . . . . . . . . . . . . . . . . . 212
Bibliography 226
8
List of Tables
9
List of Figures
10
4.8 TinyNet5-8 t-SNE Feature visualization for the output
of each Tiny module. . . . . . . . . . . . . . . . . . . . 100
4.9 TinyNet5-8 MDS Feature visualization for the output
of each Tiny module. . . . . . . . . . . . . . . . . . . . 100
4.10 ClassicNet-BN-5 t-SNE Feature visualization . . . . . 102
4.11 ClassicNet-BN-5 MDS Feature visualization . . . . . 102
4.12 ClassicNet-Dropout-5 t-SNE Feature visualization . . 103
4.13 ClassicNet-Dropout-5 MDS Feature visualization . . 103
4.14 Computation Time comparison for the HP and LP
platforms on our tested networks. . . . . . . . . . . . 104
11
7.8 Sample detections using objectness thresholding with
To = 0.6 and NMS threshold St = 0.7 . . . . . . . . . 175
7.9 ClassicNet with Objectness Ranking: Detection pro-
posal results over our dataset . . . . . . . . . . . . . . 176
7.10 TinyNet-FCN with Objectness Ranking: Detection
proposal results over our dataset . . . . . . . . . . . . 177
7.11 Sample detections using objectness ranking with K =
10 and NMS threshold St = 0.7 . . . . . . . . . . . . . 178
7.12 Number of output proposals versus Recall for differ-
ent techniques . . . . . . . . . . . . . . . . . . . . . . . 180
7.13 Objectness Map Visualization . . . . . . . . . . . . . . 181
7.14 Detection proposals on unseen objects . . . . . . . . . 183
7.15 Detection proposals on unseen objects (cont) . . . . . 184
7.16 Sample missed detections with objectness threshold-
ing at To = 0.6 and NMS threshold St = 0.7 . . . . . . 185
7.17 Proposal Quality for ClassicNet objectness . . . . . . 187
7.18 Proposal Quality for TinyNet-FCN objectness . . . . 188
7.19 Training set size evaluation for our proposal networks 189
12
Symbols and Abbreviations
α Learning Rate
B Batch Size
ŷ Predicted Value
x Input Scalar
x Input Vector
C Number of Classes
Tr Training Set
Vl Validation Set
Ts Test Set
ML Machine Learning
GD Gradient Descent
13
Neural Network Notation
Through this thesis we use the following notation for neural network
layers:
Conv( f , w × h)
Two Dimensional Convolutional layer with f filters of w × h
spatial size.
MaxPool(w × h)
Two Dimensional Max-Pooling layer with spatial sub-sampling
size w × h.
AvgPool()
Two Dimensional Global Average Pooling layer.
FC(n)
Fully Connected layer with n neurons.
14
1 Introduction
15
introduction
implies that the object sinks and it could be forgotten in the bottom
of the water body.
There is an extensive scientific literature2 about describing, lo- 2
WC Li, HF Tse, and L Fok.
Plastic waste in the marine en-
cating, and quantifying the amount of marine debris found in the vironment: A review of sources,
occurrence and effects. Sci-
environment. There are reports of human-made discarded objects ence of the Total Environment,
566:333–349, 2016
at up to 4000 meters deep at the coasts of California 3 , and at more
3
Kyra Schlining, Susan
than 10K meters in the Mariana Trench 4 . Von Thun, Linda Kuhnz, Brian
During trials at Loch Earn (Scotland, UK) We particularly saw the Schlining, Lonny Lundsten,
Nancy Jacobsen Stout, Lori
amount of submerged marine debris in the bottom of this lake. This Chaney, and Judith Connor.
Debris in the deep: Using a 22-
experience was the initial motivation for this Doctoral research. For year video annotation database
to survey marine litter in mon-
an Autonomous Underwater Vehicle, it would be a big challenge to terey canyon, central california,
usa. Deep Sea Research Part I:
detect and map objects with a large intra and inter-class variability Oceanographic Research Papers,
such as marine debris. 79:96–105, 2013
4
Sanae Chiba, Hideaki Saito,
This thesis proposes the use of Autonomous Underwater Vehicles
Ruth Fletcher, Takayuki Yogi,
to survey and recover/pick-up submerged marine debris from the Makino Kayo, Shin Miyagi,
Moritaka Ogido, and Katsunori
bottom of a water body. This is an important problem, as we believe Fujikura. Human footprint in
the abyss: 30 year records of
that contaminating our natural water sources is not a sustainable deep-sea plastic debris. Marine
Policy, 2018
way of life, and there is evidence 5 that debris is made of materials
5
María Esperanza Iñiguez,
that pollute and have a negative effect on marine environments 6 . Juan A Conesa, and Andres
Fullana. Marine debris occur-
Most research in underwater object detection and classification rence and treatment: A review.
deals with mine-like objects (MLOs). This bias is also affected by Renewable and Sustainable Energy
Reviews, 64:394–402, 2016
large streams of funding from different military sources around the 6
SB Sheavly and KM Regis-
world. We believe that a much more interesting and challenging ter. Marine debris & plastics: en-
vironmental concerns, sources,
problem for underwater perception is to find and map submerged impacts and solutions. Journal
of Polymers and the Environment,
marine debris. 15(4):301–305, 2007
16
introduction
17
introduction
18
introduction
A
T= (1.1)
V×S
Note that S might be limited not only by the AUV’s maximum
speed, but also by the maximum operational speed that the sonar
sensor requires (towing speed). For example, some sonar sensors
collect information across time (multiple pings) and they might not
operate correctly if the vehicle is moving too fast. High speeds might
also produce motion blur in the produced images. Maximum towing
speeds are also limited by depth range.
A typical value is S = 2 meters per second, which is limited by
drag forces and energy efficiency 9 . Assuming A = 1000000 m2 (one 9
Thor I Fossen. Handbook of
marine craft hydrodynamics and
squared kilometer), then we can evaluate the required survey time motion control. John Wiley &
Sons, 2011
as a function of the sensor range/swath S, shown in Figure 1.1. For 900
S = 10 m, 13.8 hrs are required, and this value drops to 42 minutes 700
T [Mins]
500
with S = 200 m.
300
Assuming that there are N marine debris elements in the seafloor 100
T
per square kilometer, then every N seconds a piece of marine debris 10 50 100 150 200
S [M]
will be found.
Figure 1.1: Survey time in
As mentioned before, for a shallow coastal area N ∈ [13.7, 320],
minutes as function of sensor
T
and taking T = 42 × 60 s, this implies that N is in the range
range for a 1 km2 patch and
[7.8, 184.0] seconds. The lower bound implies that for the most V = 2 m/s
dense debris distribution, one piece will be found every 8 seconds.
This motivates a real-time computational implementation, where
processing one frame should at most take approximately one second.
This requirement also depends on the sensor, as Forward-Looking
sonars can provide data at up to 15 frames per second, also re-
1
quiring a real-time perception implementation, up to 15 =∼ 66.6
19
introduction
in Figure 1.2. With the ARIS Explorer 3000 we can expect at most
33 − 125 Hours of life, while with a more power consuming Kraken
Battery Life [Hrs]
300
Aquapix SAS we can expect 14 − 15 Hours. This calculation does
200
not include power consumption by other subsystems of the AUV, 100
50
such as propulsion, control, perception processing, and autonomy, 10
6 25 50 100 145
so we can take them as strict maximums in the best case. Sensor Power [W]
Since the ARIS Explorer 3000 has an approximate value S = 10
Figure 1.2: Battery life as func-
meters, surveying one km2 will take 13.8 Hours. In the best possible tion of sensor power require-
case, 2.4 − 9 km2 of surface can be surveyed with a single battery ment for a 2000 Watt-Hour
charge. For the Kraken Aquapix SAS, which has S = 200 meters battery
20
introduction
21
introduction
1.3 Scope
22
introduction
23
introduction
1.6 Contributions
24
introduction
25
introduction
26
introduction
27
2 Marine Debris As
This chapter describes the "full picture" that motivates this thesis.
While the introduction provides a summarized version of that mo- Figure 2.1: Surface Marine
tivation, this chapter will take a deeper dive into the problem of Debris captured by the author
at Manarola, Italy.
polluting our natural environment, with a specific look into pollution
of water bodies.
After reading this chapter, the reader will have a general idea
of how our daily lives are affected by marine debris, and how the
use of Autonomous Underwater Vehicles can help us to reduce this
problem.
This chapter is structured as follows. First we define what is Figure 2.2: Surface Marine De-
bris captured by the author at
marine debris, how it is composed and where it can be found in
the Union Canal in Edinburgh,
the seafloor. Then we describe the effect of marine debris in the
Scotland.
environment as pollutant and its ecological consequences. We then
make the scientific argument that submerged marine debris can be
recovered by AUVs. Finally we close the chapter by describing a
small datasets of marine debris in sonar images, which we use in
the technical chapters of this thesis.
The author got his initial motivation about Marine Debris by
experimental observation of the environment. During a couple of
excursions to Loch Earn (Scotland) we observed submerged marine
debris (beer and soft drink cans, tires) in the bottom of the Loch,
and during daily commute to Heriot-Watt University through the
Union Canal in Edinburgh, we also observed both submerged and
floating marine debris. Figures 2.4 shows small submerged objects
in the Union Canal, while Figure 2.3 shows large objects that were
discarded in the same place.
28
marine debris as motivation for object detection
(a) Refrigerator (b) Traffic Cone and Pram (c) Shopping Cart
Figure 2.3: Photos of Large
Submerged Marine Debris
captured by the author at the
2.1 What is Marine Debris? Union Canal in Edinburgh,
Scotland.
Marine Debris
Marine Debris encompasses a very large category of human-made
objects that have been discarded, and are either floating in the ocean,
partially submerged in the water column, or fully submerged and 1
Judith S Weis. Marine
lying in the floor of a water body . pollution: what everyone needs to
know. Oxford University Press,
Marine Debris is mostly composed of processed materials 1 that 2015
29
marine debris as motivation for object detection
Intentional Release There are of course people and entities that will
directly dump garbage into the ocean (usually from ships) or
in rivers and lakes, without any consideration about damage to
the environment. Legality of these releases varies with country,
but generally it is very hard to enforce any kind of regulation,
specially in developing countries mostly due to economical needs.
30
marine debris as motivation for object detection
31
marine debris as motivation for object detection
Majority of the items were found along the Monterey Canyon sys-
tem, which makes sense as debris could be dragged by underwater
currents and accumulated there.
Another study in the Western Gulfs of Greece by Stefatos et al.12 , 12
A Stefatos, M Charalam-
pakis, G Papatheodorou, and
where the debris recovered by nets from fishing boats was counted. G Ferentinos. Marine debris
on the seafloor of the mediter-
In the Gulf of Patras 240 items per km2 were found, while in the Gulf ranean sea: examples from
of Echinadhes 89 items per km2 were recovered. Debris distributions two enclosed gulfs in western
greece. Marine Pollution Bulletin,
in both sites are similar, with over 80% of recovered debris being 38(5):389–393, 1999
plastics, 10 % were metals, and less than 5% were glass, wood, nylon,
and synthetics. The authors attribute a high percentage of drink
packaging in Echinadhes to shipping traffic, while general packaging
was mostly found in Patras, suggesting that the primary source for
this site is carried by land through rivers.
Chiba et al. 13 built and examined a database of almost 30 13
Sanae Chiba, Hideaki
Saito, Ruth Fletcher, Takayuki
years (1989 - 2018) of video and photographic evidence of marine Yogi, Makino Kayo, Shin
Miyagi, Moritaka Ogido, and
debris at deep-sea locations around the world. They found 3425 Katsunori Fujikura. Human
man-made debris pieces in their footage, where more than one third footprint in the abyss: 30 year
records of deep-sea plastic
was plastics, and 89% were single use products. This dataset covers debris. Marine Policy, 2018
deep-sea parts of the ocean, showing that marine debris has reached
depths of 6-10 thousand meters, at distances of up to 1000 km from
the closest coast. At the North-Western Pacific Ocean, up to 17-335
debris pieces per km2 were found at depths of 1000-6000 meters. The
deepest piece of debris was found in the Mariana Trench at 10898
meters deep. Their survey shows that plastic debris accumulates in
the deepest parts of the ocean from land-based sources, and they
suggest that a way to monitor debris is needed.
Jambeck et al.14 studies the transfer of plastic debris from land 14
Jenna R Jambeck,
Roland Geyer, Chris Wilcox,
to sea. The authors built a model that estimates up to an order Theodore R Siegler, Miriam Per-
ryman, Anthony Andrady, Ra-
of magnitude the contribution of each coastal country to global mani Narayan, and Kara Laven-
plastic marine debris. The model considers plastic resin production, der Law. Plastic waste inputs
from land into the ocean. Sci-
population growth, mass of plastic waste per capita, and the ratio of ence, 347(6223):768–771, 2015
32
marine debris as motivation for object detection
33
marine debris as motivation for object detection
34
marine debris as motivation for object detection
35
marine debris as motivation for object detection
36
marine debris as motivation for object detection
One very important point that must be made now is that we are
not proposing a "silver bullet" that can fully solve the problem of
marine debris, or the general issues with waste management. The
first and most obvious solution is not to pollute our environment.
Recycling and properly disposal of waste is a key element of any
sustainable policy. Our proposal only deals with the submerged
marine debris that is currently lying on the seafloor, and does not
prevent further debris being discarded into the ocean and other
water bodies. Some debris also floats and does not sink, and these
would require a whole different set of techniques to be collected.
This section describes the data used to train deep neural networks
and to produce all the results presented in further chapters. We
describe both the data capture setup and the data itself.
There are no public datasets that contain marine debris in sonar
Figure 2.6: Nessie AUV with
images, and for the purposes of this thesis, a dataset of sonar im-
ARIS Sonar attached in the
ages with bounding box annotations. We captured such dataset underside.
on the Water Tank of the Ocean Systems Lab, Heriot-Watt Univer-
sity, Edinburgh, Scotland. The Water Tank measures approximately
(W, H, D ) = 3 × 2 × 4 meters, and on it we submerged the Nessie
AUV with a sonar sensor attached to the underside, as shown in
Figure 2.6. 32
SoundMetrics. ARIS
Explorer 3000: See what
others can’t, 2018. Ac-
2.4.1 Sonar Sensing cessed 1-9-2018. Available at
http://www.soundmetrics.
com/products/aris-sonars/
The sonar sensor used to capture data was the ARIS Explorer 300032 ARIS-Explorer-3000/015335_
RevC_ARIS-Explorer-3000_
built by SoundMetrics, which is a Forward-Looking Sonar, but can Brochure
37
marine debris as motivation for object detection
also means that the acoustic lens moves inside the sonar in order to
35
Edward Belcher, Dana
focus different parts of the scene, and focusing distance can be set Lynn, Hien Dinh, and Thomas
by the user. Laughlin. Beamforming and
imaging with acoustic lenses in
There is some publicly available information about the use of small, high-frequency sonars.
In OCEANS’99 MTS/IEEE.,
acoustic lenses with the DIDSON sonar. Belcher et al in 1999 35 volume 3, pages 1495–1499.
IEEE, 1999
showed three high-frequency sonar prototypes using acoustic lenses,
36
Edward Belcher, Brian
and in 2001 36 they showcased how the DIDSON sonar can be Matsuyama, and Gary Trim-
ble. Object identification with
used for object identification. A Master Thesis by Kevin Fink 37 acoustic lenses. In OCEANS
2001 MTS/IEEE, volume 1,
produced computer simulations of sound waves through an acoustic pages 6–11. IEEE, 2001
lens beamformer. Kamgar-Parsi et al 38 describes how to perform
37
Kevin Fink. Computer
sensor fusion from multiple views of the scene obtained by a moving
simulation of pressure fields
acoustic lens. generated by acoustic lens
beamformers. Master’s thesis,
University of Washington, 1994
38
marine debris as motivation for object detection
by dropping them into the water, and manually adjusting position Class
39
marine debris as motivation for object detection
covers far more objects than we did, we believe this set of objects is
appropriate for the scope of this thesis.
We also included a set of marine objects as distractors (or counter
examples), namely a chain, a hook, a propeller, a rubber tire, and
a mock-up valve. These objects can be expected to be present in a
marine environment 41 due to fishing and marine operations, and 41
Paul K Dayton, Simon F
Thrush, M Tundi Agardy, and
are not necessarily debris. Robert J Hofman. Environmen-
tal effects of marine fishing.
The object set was pragmatically limited by the objects we could Aquatic conservation: marine and
easily get and were readily available at our lab. There is a slight freshwater ecosystems, 5(3):205–
232, 1995
imbalance between the marine debris and distractor objects, with
approximately 10 object instances for marine debris, and 5 instances
of distractors.
A summary of the object classes is shown in Table 2.1. Bottles of
different materials that lie horizontally in the tank floor are grouped
in a single class, but a beer bottle that was standing on the bottom
was split into its own class, as it looks completely different in a
sonar image. A shampoo bottle was also found standing in the
tank bottom and assigned its own class. The rest of the objects map
directly to classes in our dataset. We also have one additional class
called background that represents anything that is not an object in
our dataset, which is typically the tank bottom. Due to the high
40
marine debris as motivation for object detection
We were not able to label all the objects in each image, as some
objects looked blurry and we could not determine its class, but we
labeled all the objects where its class can be easy and clearly recog-
nized by the human annotator. In total 2364 objects are annotated in
our dataset.
Most of the objects we used for this dataset did not produce a
Figure 2.13: Sample of Sham-
shadow, depending on the perspective and size of the object. For
poo Bottle Class
the purpose of bounding box labeling, we decided to always include
the highlight of the object, as it is the most salient feature, but we
decided not to include the shadow of most objects in the bounding
box, as there is a large variability of the shadow in small objects
(sometimes present and sometimes completely absent).
41
marine debris as motivation for object detection
For two object classes we always included the shadow, namely the
standing bottle and the shampoo bottle, as sometimes the highlight
of the object is easily confused with the background, but the shadow
should allow for discrimination of this kind of object. The drink
carton object is one that many times it had a shadow that we did
not label. More detailed image crops of these objects are presented
in Appenxix Figures A.8, and A.7, and A.4.
We note that this kind of labeling is not the best as there is bias Figure 2.15: Sample of Stand-
by not always labeling shadows. We made the practical decision ing Bottle Class
42
marine debris as motivation for object detection
producing blurry objects that are not in focus. We can also see the
walls of the water tank as very strong reflections of a linear structure.
Figure 2.18 shows the count distribution of the labels in our
dataset. It is clearly unbalanced, as we made no effort to make a
balanced dataset, and this does not prove to be an issue during
learning, as there is no dominant class that would make learning
fail.
In the Appendix, Figures A.2 to A.11 show randomly selected
Figure 2.17: Sample of Valve
image crops of each class, with a variable number depending on the
Class
size of the object in order to fill one page. This shows the intra and
inter-class variability of our dataset, which is not high, specially for
the intra-class variability, due to the low number of object instances
that we used.
43
marine debris as motivation for object detection
(a) (b)
(c) (d)
44
marine debris as motivation for object detection
(d) Drink Carton (e) Hook and Propeller (f) Shampoo Bottle
45
marine debris as motivation for object detection
46
3 Machine Learning Background
advances in the field starting from 2012. Small contributions like Ian Goodfellow, Yoshua Ben-
gio, and Aaron Courville. Deep
the Rectified Linear Unit (ReLU) or the ADAM optimizer, or bigger
Learning. MIT Press, 2016. http:
ones like Dropout and Batch Normalization, have allowed to push
//www.deeplearningbook.org
the limits of neural networks.
We aim to fill the gap in that common knowledge, and to provide
a self-contained thesis that can be read by specialists that do not
know neural networks in detail. We also made the effort to cover in
detail many practical issues when training neural networks, such
as how to tune hyper-parameters, and the proper machine learning
model iteration loop from the point of view of the neural network
designer, as well as the best practices when using machine learning
models.
We also summarize my experience training neural networks, as it
is a process that contains equal parts of science and art. The artistic
part is quite of a problem, as it introduces "researcher degrees of
freedom" that could skew the results. One must always be aware of
this.
Training a neural network is not an easy task, as it requires a
minimum the following steps:
47
machine learning background
detect overfitting.
Testing After the network has been trained and the loss indicates
convergence, then the network must be tested. Many compu-
tational frameworks include the testing step explicitly during
training, as it helps to debug issues. The testing step requires a
validation set and metrics on this set are reported continuously
during training.
z (x) = ∑ wi xi + b = w · x + b
i (3.1)
a(x) = g(z(x))
Where x is a input vector of n elements, w is a learned weight
vector, and b is a scalar bias that completes an affine transformation
of the input. g( x ) is a scalar activation function, which is intended
48
machine learning background
z1 = Θ1 · x + B1
a1 = g ( z1 )
z2 = Θ2 · a1 + B2
a2 = g ( z2 )
z3 = Θ3 · a2 + B3
a3 = g ( z3 )
49
machine learning background
50
machine learning background
Learning Rate Setting the right value of the learning rate is a key
factor to using gradient descent successfully. Typical values of
the learning rate are α ∈ [0, 1], and common starting values are
10−1 or 10−2 . It is important that the learning rate is set to the
right value before training, as a larger than necessary learning
rate can make the optimization process fail (by overshooting the
optimum or making the process unstable), a lower learning rate
can converge slowly to the optimum, and the "right" learning
rate will make the process converge at an appropriate speed.
The learning rate can also be changed during training, and this
is called Learning Rate Schedule. Common approaches are to Learning Rate Schedule
Loss Surface The geometry and smoothness of the loss surface is also
key to good training, as it defines the quality of the gradients. The
ideal loss function should be convex on the network parameters,
but typically this is not the case for outputs produced by DNNs.
Non-convexity leads to multiple local optima where gradient
descent can become "stuck". In practice a non-convex loss function
is not a big problem, and many theoretical results show that the
local optima in deep neural networks are very similar and close
in terms of loss value.
Where xi:j denotes that the neural network hΘn (x) and loss func-
51
machine learning background
value and using values i, j = {(0, B), ( B, 2B), (2B, 3B), . . . , (cB, n)}.
Note that not all batches have the same size due that B might not
divide | Tr | exactly. That means the last batch can be smaller. A
variation of MGD is Stochastic Gradient Descent (SGD), where sim-
| Tr |
ply B is set to one. After approximately B iterations of gradient
descent, the learning process will have "seen" the whole dataset.
This is called an Epoch , and corresponds to one single pass over the Epochs
52
machine learning background
The most basic loss function is the mean squared error (MSE),
typically used for regression:
n
MSE(ŷ, y) = n−1 ∑ (ŷi − yi )2 (3.5)
i =0
The MSE loss penalizes the predicted values ŷ that diverge from
the ground truth values y. The error is defined just as the difference
between ŷ and y, and squaring is done to get a smooth positive
value. One problem with the MSE is that due to the square term,
large errors are penalized more heavily than smaller ones. This
produces a practical problem where using the MSE loss might lead
the convergence of the output to a mean of the ground truth values
instead of predicting values close to them. This issue could be
reduced by using the Mean Absolute Error (MAE), which is just the
mean of absolute values of errors:
n
MAE(ŷ, y) = n−1 ∑ |ŷi − yi | (3.6)
i =0
The MSE is also called the L2 loss, while the MAE is named as
L1 loss, both defined as the order of the norm applied to the errors.
Note that the MAE/L1 loss is not differentiable at the origin, but
generally this is not a big issue. The L1 loss can recover the median
of the targets, in contrast to the mean recovered by the L2 loss.
For classification, the cross-entropy loss function is preferred, as
it produces a much smoother loss surface, and it does not have
the outlier weighting problems of the MSE. Given a classifier that
outputs a probability value ŷc for each class c, then the categorical
cross-entropy loss function is defined as:
n C
CE(ŷ, y) = − ∑ ∑ yic log ŷic (3.7)
i =0 c =0
Minimizing the cross-entropy between ground truth probability
distribution and the predicted distribution is the equivalent to min-
imizing the Kullback-Leibler divergence 9 . For the case of binary 9
David JC MacKay. In-
formation theory, inference and
classification, then there is a simplification usually called binary learning algorithms. Cambridge
university press, 2003
cross-entropy:
n
BCE(ŷ, y) = − ∑ [yi log ŷi + (1 − yi ) log(1 − ŷi )] (3.8)
i =0
In this case ŷ is the probability of the positive class.
53
machine learning background
This activation function has constant output of zero for negative Rectified Linear Unit (ReLU)
54
machine learning background
e xi
" #
g (x) = x (3.10)
∑j e j 4
i
Activation
3
Given a softmax output a, the class decision can be obtained by: 2
1
c = arg max ai (3.11) 0
i −4 −2 0 2 4
Looking at Equation 3.10 one can see that softmax outputs are ReLU
Softplus
then "tied" by the normalization value in the denominator. This
Figure 3.2: Non-Saturating
produces a comparison operatiog between the inputs, and the biggest
Activation Functions
softmax output will always be located at the largest input relative
to the other inputs. Inputs to a softmax are typically called logits.
As the softmax operation is differentiable, its use as an activation
function then produces a loss surface that is easier to optimize.
Softmax combined with a categorical cross-entropy loss function
is the base building block to construct DNN classifiers.
As the ReLU is not differentiable at x = 0 and it has constant zero
output for negative inputs, this could produce a new kind of problem
called "dying ReLU", where neurons that use ReLU can stop learning
completely if they output negative values. As the activations and
gradients become zero, the neuron can "get stuck" and not learn
anymore. While this problem does not happen very often in practice,
it can be prevented by using other kinds of activation functions like
the Softplus function, which can be seen as a "softer" version of
the ReLU that only has a zero gradient as the limit when x → −∞.
Figure 3.2 shows the Softplus versus the ReLU activation functions.
The Softplus function is given by:
55
machine learning background
w ∼ U (−s, s) (3.14)
w ∼ N (0, σ) (3.15)
56
machine learning background
issues modeling.
The scale of input features is an important issue as if inputs have
different ranges, then the weights associated to those features will
be in different scales. Since we usually use fixed learning rates, this
leads to the problem that some parts of a neuron learn at different
speeds than others, and this issue propagates through the network,
making learning harder as the network becomes deeper.
The scale of outputs poses a different but easier problem. The
designer has to make sure that the range of the activation of the
output layer matches the range of the desired targets. If these do
not match, then learning will be poor or not possible. Matching the
ranges will make sure that learning happens smoothly.
57
machine learning background
x − mini xi
x̂ = (3.17)
maxi xi − mini xi
Z-Score Normalization Subtract the sample mean µ x and divide by
the sample standard deviation σx . This produces values that are
approximately in the [−1, 1] range.
x − µx
x̂ = (3.18)
σx
q
µ x = n −1 ∑ x i σx = ( n − 1 ) −1 ∑ ( x i − µ x )2 (3.19)
3.1.6 Regularization
58
machine learning background
and letting others pass unchanged. This is called the dropout mecha-
nism. During training the masks at each Dropout layer are randomly
sampled at each iteration, meaning that these masks change during
training and mask different activations at an output. This breaks
any correlations between activations in one layer and the one before
it (where the Dropout layer is placed), meaning that more strong
features can be learned and co-adaptation of neurons is prevented.
At inference or test time, Dropout layers do not perform any
stochastic dropping of neurons, and instead they just multiply any
incoming activation by p, which accounts for all activations being
present during inference, unlike at training time. This also prevents
any kind of stochastic effect during inference. It should also be
noted that Dropout can also be used with its stochastic behavior at
inference time, which produces very powerful model uncertainty
estimates, as it was proven by Gal et al 2015. 21 . 21
Yarin Gal and Zoubin
Ghahramani. Dropout as a
Using Dropout layers in a neural network, typically before fully bayesian approximation: Rep-
resenting model uncertainty in
connected ones, has the effect of reducing overfitting and improving deep learning. arXiv preprint
generalization significantly. arXiv:1506.02142, 2, 2015
59
machine learning background
yi = γi x̂i + β i (3.22)
26
schemes have appeared in the literature, such as Layer Normaliza- Yuxin Wu and Kaiming
He. Group normalization. arXiv
tion 25 , Instance Normalization, and Group Normalization 26 . These preprint arXiv:1803.08494, 2018
60
machine learning background
them add a term λ ∑i ||wi || p to the loss function, which penalizes Weight Decay
large weights that are not supported by evidence from the data. p is
the order of the norm that is being computed over the weights.
3.1.7 Optimizers
61
machine learning background
rn = ρrn−1 + (1 − ρ) gn gn
gn
Θ n +1 = Θ n − α √ (3.24)
e + rn
Where ρ is a decay parameter that controls the weight of past
squared gradients through the moving average, usually it is set to
0.9 or 0.99. RMSProp is more stable than AdaGrad due to better
control of the history of squared gradients, and allows a model to
reach a better optima, which improves generalization. It is regularly
used by practitioners as one of the first methods to start training a 32
Diederik Kingma and
model. Jimmy Ba. Adam: A method for
stochastic optimization. arXiv
Another advanced Optimizer algorithm is Adam 32 , which stands preprint arXiv:1412.6980, 2014
s n = ρ 1 s n −1 + (1 − ρ 1 ) gn
r n = ρ 2 r n −1 + (1 − ρ 2 ) g n gn
sn 1 − ρ2n
Θ n +1 = Θn − α √ (3.25)
e + rn 1 − ρ1n
Where sn is the biased estimate of the gradient and rn is the biased
estimate of the squared gradients, both obtained with an exponential
moving average with different decay rates ρ1 and ρ2 . The factors
1 − ρ1n and 1 − ρ2n are used to correct bias in the exponential moving
averages. These computations are done component-wise.
Overall Adam performs considerably better than RMSProp and
62
machine learning background
the Training set, the Validation set, and the Test set. The fractions Splits
for each split vary, but it is ideal to make the biggest split for the
training set, at least 50 % of the available data, and use the rest in
equal splits for the validation and test sets.
The validation set is used to evaluate performance during hyper-
parameter selection, and only after fixing these values, a final eval-
uation on the test set can be performed. This prevents any kind
of bias in samples in the training or validation set from affecting
conclusions about model performance.
Overfitting is the problem where the model learns unwanted Overfitting
63
machine learning background
patterns and/or noise from the training data, and fails to generalize
outside of its training set. Detecting overfitting is key during training
any machine learning model, and is the reason why validation or
test sets are used, as it is the only known way to detect overfitting.
During the training process, the loss and metrics on the training
set is typically tracked and displayed to the designer, and after
each epoch, loss and associated metrics can be computed on the
validation set. The overall pattern of both training and validation
loss tells a picture about that is happening, we cover three cases:
Training Loss Not Decreasing, Validation Loss Not Decreasing The model
is not overfitting, but it indicates that the model does not fit the
data. A model with more learning capacity might be needed, as
the current model cannot really predict the data given the input
features. For example, if fitting a linear model to data with a
quadratic shape. This case might also indicate that the input
features might not be well correlated to the desired output or that
the learning problem is ill-defined.
64
machine learning background
65
machine learning background
Learning Rate This parameter controls the "speed" over which learn-
ing is performed, as it scales the gradient, which effectively makes
it a kind of step size in the parameter space. Valid learning rate
values are typically in the [0, 1] range, but small values are mostly
used in practice. If a large LR is used, then learning could di-
verge (producing infinite or NaN loss values), if a too small LR is
used, learning happens but very slowly, taking a large number of
epochs to converge. The right LR value will produce fast learning,
with a loss curve that is similar to exponential decay. Figure 3.3
shows typical loss curves with different learning rates. The case
of a high LR shows that in the case where learning does not fail,
but the loss decreases and stays approximately constant after a
Low LR
certain number of epochs. Note that the LR does not have to Correct LR
High LR
be a constant, and it can be varied during training. The typical
method is to decrease the LR by a factor after a plateau of the loss 10
8
curve has been detected, which potentially could allow the loss 6
Loss
4
to decrease further. 2
0
Learning rate can be tuned using grid or random search, but a 0 20 40 60 80 100
Epochs
faster way is to guess an initial LR, train different models and vary
Figure 3.3: Effect of Learning
the learning rate manually, decreasing or increasing it accordingly
Rate on the Loss Curve during
to the previously mentioned rules. A common heuristic 36 is that
Training
if learning fails, decrease the LR by a factor of ten until the loss 36
Ian Goodfellow, Yoshua
Bengio, and Aaron Courville.
starts to decrease consistently, and adjust further by small steps Deep Learning. MIT Press, 2016.
http://www.deeplearningbook.
to produce the ideal loss curve. Typical learning rates used in the org
literature are negative power of 10, like α = [0.1, 0.01, 0.001].
66
machine learning background
Note that both the value of learning rate and number of epochs
depend on the actual loss function that is being minimized, any
change to the loss implies retuning both parameters, as the actual
loss surface or landscape is what defines the learning rate and length
of training.
67
machine learning background
y = f (x ∗ F + b) (3.27)
68
machine learning background
the input image, and the result after applying bias and activation
function is stored in the channels dimension of the output, stacking
all feature maps into a 3D volume. Convolutional layers can also
take feature maps as inputs, which forms the feature hierarchy
previously mentioned.
Another important detail of a convolutional layer is that both bias
and weights on the filter are learned using gradient descent. They are
not hand tuned as previously was done for image processing, like to
make edge detection or sharpness filters. The filter in a convolutional
layer is not necessarily a two dimensional matrix, as when the input
image or feature map has K > 1 channels, then the filter must have a
matching shape (W, H, K ), so convolution can be possible. When the
filter has multiple channels, convolution is performed individually
for each channel using the classical convolution operation from
42
Rafael C. Gonzalez and
image processing 42 . Richard E. Woods. Digital Image
Processing (3rd Edition). Prentice-
The filter size (width and height) in a convolutional layer is a Hall, Inc., Upper Saddle River,
NJ, USA, 2006
hyper-parameter that must be tuned for specific applications, and
generally it must be a odd integer, typical values are 3 × 3 or 5 × 5,
but some networks such as AlexNet 43 used filter sizes up to 11 × 11. 43
Alex Krizhevsky, Ilya
Sutskever, and Geoffrey E Hin-
The width and height of a filter do not have to be the same, but ton. Imagenet classification
with deep convolutional neural
generally square filters are used. networks. In Advances in Neural
Convolution is normally performed with a stride of 1 pixel , Information Processing Systems,
pages 1097–1105, 2012
meaning that the convolution sliding window is moved by one pixel
Stride
at a time, but different strides can also be used which is a kind of
sub-sampling of the feature map.
The output dimensions of a convolutional layer are defined by
the filter sizes, as convolution is only typically performed for pixels
69
machine learning background
that lie inside the image region, and out of bound pixels are not
considered (at the edges of the image). Padding can be added to Padding
the input image or feature map in order to output the same spatial
dimensions as the input.
For a N × N filter and a W × W input image or feature map, with
padding of P pixels and stride S, the output has dimensions O:
W − N + 2P
O= +1 (3.28)
S
y = max x (3.29)
x∈R
y = D −2 ∑x (3.30)
x∈R
70
machine learning background
71
machine learning background
LeNet was a big innovation for the time, since it obtains a 0.95%
error rate on the MNIST dataset (corresponding to 99.05% accuracy),
which is very close to human performance. Other kinds of classifiers
such as K-Nearest-Neighbors with euclidean distance obtained 5%
error rates, which shows the advantages of a CNNs.
LeNet set that initial standard for CNN design, starting with con-
volution and max-pooling blocks that are repeated a certain number
of times, and perform feature extraction, followed by a couple of
fully connected layers that perform classification or regression of 47
Alex Krizhevsky, Ilya
those learned features. The network can be trained end-to-end using Sutskever, and Geoffrey E Hin-
ton. Imagenet classification
gradient descent. with deep convolutional neural
networks. In Advances in Neural
A second milestone in CNNs is AlexNet 47 , which is one of the Information Processing Systems,
pages 1097–1105, 2012
first real deep neural networks trained on a large scale dataset.
48
This network was designed to compete in the ImageNet Large Olga Russakovsky, Jia
Deng, Hao Su, Jonathan
Scale Visual Recognition Challenge 48 , where the task is to clas- Krause, Sanjeev Satheesh,
Sean Ma, Zhiheng Huang,
sify variable-sized images over 1000 different classes, with a training Andrej Karpathy, Aditya
Khosla, Michael Bernstein,
set containing 1.2 million images. It is a very difficult task due to et al. Imagenet large scale
the large training set, large number of classes, and many visual visual recognition challenge.
International Journal of Computer
confusion between classes. 49 Vision, 115(3):211–252, 2015
72
machine learning background
proven by the margin with the second place, of around 10% top-5
accuracy less than AlexNet.
The architecture of AlexNet is shown in Figure 3.2, the network
has 15 layers and approximately 60 million trainable parameters.
AlexNet obtains 83.6 % top-5 accuracy on the ImageNet 2012 dataset,
while the second place winner of the same competition obtains 73.8
% top-5 accuracy, showing the superior performance and capability
of a deep neural network.
Progress in the ImageNet competition has been constant over the
years, producing advances in CNN architecture engineering. Pretty
much all of the contenders after 2012 were using CNNs. In 2013
the VGG group at Oxford made a deeper version of AlexNet, which
is typically just called VGG 50 , with over 144 million parameters 50
Karen Simonyan and
Andrew Zisserman. Very deep
and obtaining 92% top-5 accuracy. The VGG networks use a simpler convolutional networks for
large-scale image recognition.
structure, with only 3 × 3 filters, and combining two consecutive arXiv preprint arXiv:1409.1556,
convolutions both with 3 × 3 filters to simulate a bigger 5 × 5 filter. 2014
73
machine learning background
3.2.4 Discussion
74
machine learning background
75
4 Forward-Looking Sonar
Image Classification
76
forward-looking sonar image classification
77
forward-looking sonar image classification
about the features they used, namely "profiles", perimeter, area, pixel
statistics of the regions. A Radial Basis Function classifier with a
Gaussian Kernel
Additional features are computed with Principal Component
Analysis (PCA), by projecting each image into the first 5 principal
components, which produces an additional 5-dimensional feature
vector. Different combination of features are tested. 90% classifica-
tion accuracy is obtained with shadow, highlight, and PCA features.
Other combinations of features performed slightly worse, but no-
tably shadow-only features obtained considerably worse accuracy
(around 40%). An interesting observation made in this work is that
using the normalized image pixels as a feature vector obtains al-
most the same classification performance as the more complex set
of features.
Myers and Fawcett 6 proposed the Normalized Shadow-Echo 6
Vincent Myers and John
Fawcett. A template match-
Matching (NSEM) method for object detection in SAS images. NSEM ing procedure for automatic
target recognition in synthetic
first requires to segment the sonar image into bright echo, echo, back- aperture sonar imagery. Sig-
ground, shadow and dark shadow. Fixed values are assigned to each nal Processing Letters, IEEE,
17(7):683–686, 2010
segmentation class (in the range [−1, 1]). Then similarity is com-
puted with a custom correlation function shown in Eq. 4.3, where I
and T are the segmented and post-processed test and template im-
ages, I ⊗ T = ∑ T ? I is the standard cross-correlation operator, and
IE /TE are the highlight components of the corresponding images,
and IS /TS are the shadow components. The bar operation inverts
the image, setting any non-zero element to zero, and any zero value
78
forward-looking sonar image classification
to one.
I ⊗ TE I ⊗ TS
f ( T, I ) = + (4.3)
1 + IE ⊗ T̄E 1 + IS ⊗ T̄S
The final classification is performed by outputting the class of the
template with the highest similarity score as given by Eq. 4.3. This
method can also be used as an object detector by setting a minimum
similarity threshold to declare a detection.
This method was tested on a 3-class dataset of MLOs, namely
Cylinder, Cone, Truncated Cone, and Wedge shapes. Target tem-
plates were generated using a SAS simulator, adding multiple views
by rotating the objects. Objects are correctly classified with ac-
curacies in the range 97 − 92%, which is slightly higher than the
reported baseline using normalized cross-correlation with the raw
image (92 − 62%) and segmented images (95 − 81%). This method
performs quite well as reported, but it requires a segmentation of
the input image, which limits its applicability to marine debris.
Sawas et al. 7 8 propose boosted cascades of classifiers for 7
Jamil Sawas, Yvan Petillot,
and Yan Pailhas. Cascade of
object detection in SAS images. This method can also be used for boosted classifiers for rapid de-
tection of underwater objects. In
classification, as its core technique (Adaboost) is a well-known ML Proceedings of the European Con-
classification framework . Haar features are quite similar to the ference on Underwater Acoustics,
2010
shadow-highlight segmentation present in typical sonar images, 8
Jamil Sawas and Yvan
Petillot. Cascade of boosted
which is that motivates their use. Haar features have the additional
classifiers for automatic tar-
advantage that they can be efficiently computed using summed-area get recognition in synthetic
aperture sonar imagery. In Pro-
tables. ceedings of Meetings on Acoustics
ECUA2012. ASA, 2012
A boosted cascade of classifiers 9 is a set of weak classifiers
9
Christopher Bishop. Pat-
10 that are stacked in a cascade fashion. The basic idea originally tern Recognition and Machine
Learning. Springer, 2006
introduced by Viola-Jones 11 for face detection is that weak classifiers
10
Classifiers with low accuracy,
at the beginning of the cascade can be used to quickly reject non- but computationally inexpen-
sive
face windows, while classifiers close to the end of the cascade can
concentrate on more specific features for face detection. AdaBoost is 11
Paul Viola and Michael
Jones. Rapid object detection
then used to jointly train these classifiers. This structure produces using a boosted cascade of
a very efficient algorithm, as the amount of computation varies simple features. In Computer
Vision and Pattern Recognition,
with each stage and depth in the cascade (deep classifiers can use 2001. CVPR 2001. Proceedings of
the 2001 IEEE Computer Society
more features, while shallow classifiers can use less), but fast to Conference on, volume 1, pages
I–I. IEEE, 2001
compute features are required, as feature selection is performed
during training.
Sawas et al. proposes an extension to the classic Haar feature,
where a long shadow area with a small highlight is used as a Haar
feature, matching the signature from the MLOs that are used as
targets. On a synthetic dataset with Manta, Rockan and Cylinder
objects, the authors obtain close to 100% accuracy for the Manta, but
79
forward-looking sonar image classification
80
forward-looking sonar image classification
81
forward-looking sonar image classification
backgrounds.
The authors use a boosted cascade of weak classifier (same as
Sawas et al.) with Haar and Local Binary Pattern (LBP) features.
For Haar features, classifiers that were trained on real data have an
accuracy advantage of around 3% over classifiers that were trained
on semi-synthetic data, but the classifiers on real data saturate at
93 − 92% correct classification. For LBP features, the difference is
more drastic, as using a classifier trained on real data has a 20%
accuracy advantage over using synthetic data. A classifier trained
on real data obtains close to 90% accuracy, while synthetic classifiers
obtain 70%. Some specific configurations of semi-synthetic data
generation can improve accuracy to the point of a 6% difference
versus the real classifier.
David Williams 18 performs target classification in SAS images 18
David P Williams. Un-
derwater target classification
using a CNN with sigmoid activations. His dataset contains MLO- in synthetic aperture sonar im-
agery using deep convolutional
like objects for the positive class (cylinders, wedges, and truncated neural networks. In Pattern
cones), while the negative class contain distractor objects like rocks, Recognition (ICPR), 2016 23rd In-
ternational Conference on, pages
a washing machine, a diving bottle, and a weighted duffel bag, 2497–2502. IEEE, 2016
82
forward-looking sonar image classification
83
forward-looking sonar image classification
4.1.1 Discussion
84
forward-looking sonar image classification
85
forward-looking sonar image classification
• MLOs have mostly convex shapes, while marine debris can have
shapes with concave parts. This has the effect of producing
strong reflections in the sonar image, and consequentially a much
stronger viewpoint dependence. Simply, objects look quite differ-
ent in the sonar image if you rotate them.
works
86
forward-looking sonar image classification
25
Classic Module Yann LeCun, Léon Bottou,
Yoshua Bengio, and Patrick
This is the most common module use by CNNs, starting from Haffner. Gradient-based learn-
ing applied to document recog-
LeNet 25 . It consists of one convolution layer followed by a max- nition. Proceedings of the IEEE,
86(11):2278–2324, 1998
pooling layer. Hyper-parameters are the number of filters f and
the size of the convolutional filters s. In this work we set s = 5 × 5.
This module is shown in Figure 4.1a.
Fire Module
The Fire module was introduced by Iandola et al. 26 as part 26
Forrest N Iandola, Song
Han, Matthew W Moskewicz,
of SqueezeNet.The basic idea of the Fire module is to use 1 × 1 Khalid Ashraf, William J Dally,
and Kurt Keutzer. Squeezenet:
convolutions to reduce the number of channels and 3 × 3 con- Alexnet-level accuracy with
volutions to capture spatial features. This module is shown in 50x fewer parameters and< 0.5
mb model size. arXiv preprint
Figure 4.1b. The initial 1 × 1 convolution is used to "squeeze" the arXiv:1602.07360, 2016
Tiny Module
The Tiny module was designed as part of this thesis. It is a
modification of the Fire module, removing the expand 1 × 1
convolution and adding 2 × 2 Max-Pooling into the module itself.
The basic idea of these modifications is that by aggressively using
Max-Pooling in a network, smaller feature maps require less
computation, making a network that is more computationally
efficient. This module is shown in Figure 4.1c. The Tiny module
has one hyper-parameter, the number of convolutional filters f ,
which is shared for both 1 × 1 and 3 × 3 convolutions.
MaxFire Module
This is a variation of the Fire module that includes two Fire
modules with the same hyper-parameters and one 2 × 2 Max-
Pooling inside the module. It is shown in Figure 4.1d and has the
same hyper-parameters as a Fire module.
All modules in Figure 4.1 use ReLU as activation. We designed
four kinds of neural networks, each matching a kind of module.
Networks are denominated as ClassicNet, TinyNet and FireNet.
While FireNet is quite similar to SqueezeNet, we did not want to
use that name as it refers to a specific network architecture that uses
the Fire module. Our FireNet uses the MaxFire module instead.
To build ClassicNet, we stack N Classic modules and add two
fully connected layers as classifiers. This corresponds to a config-
87
forward-looking sonar image classification
(a) Classic Module (b) Fire Module (c) Tiny Module (d) MaxFire Mod-
ule
Figure 4.1: Basic Convolutional
Modules that are used in this
Chapter
Input Conv2D( f , 5 × 5) Max-Pool(2 × 2) FC(64) FC(c) Class Probabilities
n instances
Figure 4.2: ClassicNet Network
Architecture
uration FC(64)-FC(C), where C is the number of classes. The first
fully connected layer uses a ReLU activation, while the second uses
a softmax in order to produce class probabilities. This architecture
can be seen in Figure 4.2.
FireNet is built in a similar way, but differently from ClassicCNN.
This network contains an initial convolution to "expand" the number
of available channels, as sonar images are single channel. Then N
MaxFire modules are stacked. Then a final convolution is used, in
order to change the number of channels to C. Then global average
pooling 27 is applied to reduce feature maps from any size to 1 × 1 × 27
Min Lin, Qiang Chen,
and Shuicheng Yan. Network
C. FireNet is shown in Figure 4.4. TinyNet is similarly constructed, in network. arXiv preprint
arXiv:1312.4400, 2013
but it does not have a initial convolution. It contains a stack of n
Tiny modules with a final 1 × 1 convolution to produce C output
channels. Global average pooling is applied and a softmax activation
is used to produce output class probabilities. TinyNet is shown in
Figure 4.3.
Both FireNet and TinyNet do not use fully connected layers for
classification, and instead such layers are replaced by global average
pooling and a softmax activation. This is a very different approach,
but it is useful as it reduces the number of learning parameters,
reducing the chance of overfitting and increasing computational
performance.
Each network is trained using the same algorithm, namely gradi-
28
Diederik Kingma and
ent descent with the ADAM optimizer 28 , using an initial learning Jimmy Ba. Adam: A method for
stochastic optimization. arXiv
rate of α = 0.01 and a batch size B = 64. preprint arXiv:1412.6980, 2014
88
forward-looking sonar image classification
ClassicNet
89
forward-looking sonar image classification
TinyNet
For this architecture we also evaluate up to 6 modules, but only 4
and 8 filters. The main reason driving the number of filters is to
minimize the total number of parameters, as these networks were 31
Matias Valdenegro-Toro.
31 . Real-time convolutional net-
designed for fast executing in embedded devices works for sonar image classifi-
cation in low-power embedded
FireNet systems. In European Symposium
on Artificial Neural Networks,
This network was evaluated up to 6 modules, as accuracy satu- Computational Intelligence and
Machine Learning (ESANN),
rated at the maximum value with more modules. We only eval- 2017
uate 4 filters per module, corresponding to s11 = e11 = e33 = 4
filters in each Fire module inside the MaxFire one.
Each network is trained in the same way, using the ADAM opti-
mizer 32 with an initial learning rate α = 0.01. ClassicNet is trained 32
Diederik Kingma and
Jimmy Ba. Adam: A method for
for 20 epochs, while TinyNet and FireNet are trained for 30 epochs. stochastic optimization. arXiv
preprint arXiv:1412.6980, 2014
We train 10 instances of each network architecture for each param-
eter set. We do this because of random initialization, as training a
single network can produce biased or "lucky" results. For each pa-
rameter set we report the mean and standard deviation of accuracy
evaluated on the validation set.
ClassicNet results are shown in Figure 4.5. We see that a choice
of 32 filters seems to be the best, as it produces the biggest accuracy
in the validation set and learning seems to be more stable. Configu-
rations with less filters seem to be less stable, as shown in the 8-filter
configuration with decreasing accuracy after adding more than 3
modules, and the 16-module configuration showing large variations
in accuracy. In general it is expected that a deeper network should
have better accuracy, but tuning the right number of layers/modules
is not easy, as these results show.
We compared three other design choices, whether to use regu-
larization (Dropout or Batch Normalization), or not use it. Our
90
forward-looking sonar image classification
96 96 96
94 94 94
92 92 92
1 2 3 4 5 6 1 2 3 4 5 6 1 2 3 4 5 6
# of Modules # of Modules # of Modules
(b) 8 filters (c) 16 filters (d) 32 filters
91
forward-looking sonar image classification
94 94
92 92
4 Filters
8 Filters 4 Filters
1 2 3 4 5 6 1 2 3 4 5 6
# of Modules # of Modules
(a) TinyNet (b) FireNet
92
forward-looking sonar image classification
P = M×W ×H (4.5)
93
forward-looking sonar image classification
and this is reflected in the fact that many templates are required
to model such variability.
80
70
60
50 CC
SQD
40
1
10
20
30
40
50
60
70
80
90
100
110
120
130
140
150
94
forward-looking sonar image classification
CC SQD
TPC # of Params Accuracy (%) Time (ms) Accuracy (%) Time (ms)
1.0 0.10 M 45.67 ± 5.99 % 0.7 ± 0.1 ms 43.85 ± 4.94 % 0.2 ± 0.0 ms
5.0 0.50 M 68.9 ± 3.01 % 3.2 ± 0.1 ms 69.47 ± 3.18 % 1.1 ± 0.0 ms
10.0 1.01 M 77.17 ± 2.35 % 6.3 ± 0.1 ms 79.83 ± 2.67 % 2.3 ± 0.1 ms
20.0 2.02 M 84.21 ± 1.65 % 12.6 ± 0.1 ms 88.25 ± 1.75 % 4.6 ± 0.1 ms
30.0 3.04 M 86.32 ± 1.52 % 18.9 ± 0.2 ms 91.62 ± 1.75 % 7.0 ± 0.1 ms
40.0 4.05 M 88.49 ± 1.0 % 25.2 ± 0.7 ms 93.76 ± 1.2 % 9.2 ± 0.1 ms
50.0 5.07 M 89.67 ± 1.09 % 31.4 ± 0.3 ms 95.03 ± 1.02 % 11.6 ± 0.1 ms
60.0 6.09 M 90.39 ± 1.08 % 37.6 ± 0.3 ms 96.05 ± 0.81 % 13.9 ± 0.2 ms
70.0 7.09 M 90.96 ± 0.81 % 43.9 ± 0.4 ms 96.52 ± 0.71 % 16.2 ± 0.2 ms
80.0 8.11 M 91.52 ± 0.7 % 50.1 ± 0.4 ms 96.96 ± 0.63 % 18.6 ± 0.2 ms
90.0 9.12 M 91.99 ± 0.67 % 56.5 ± 0.4 ms 97.23 ± 0.55 % 20.7 ± 0.2 ms
100.0 10.13 M 92.1 ± 0.65 % 62.7 ± 0.5 ms 97.35 ± 0.54 % 23.0 ± 0.2 ms
110.0 11.15 M 92.42 ± 0.67 % 68.9 ± 0.5 ms 97.63 ± 0.46 % 25.2 ± 0.3 ms
120.0 12.16 M 92.62 ± 0.54 % 75.1 ± 0.5 ms 97.8 ± 0.46 % 27.5 ± 0.3 ms
130.0 13.17 M 92.78 ± 0.56 % 81.3 ± 0.6 ms 97.95 ± 0.34 % 29.8 ± 0.3 ms
140.0 14.19 M 92.91 ± 0.46 % 87.7 ± 0.6 ms 97.99 ± 0.39 % 32.1 ± 0.3 ms
150.0 15.20 M 92.97 ± 0.47 % 93.8 ± 0.7 ms 98.08 ± 0.34 % 34.6 ± 0.3 ms
Table 4.2: Template Match-
ing with Cross-Correlation
Boosting and a Random Forest. All of these classifiers are trained (CC) and Sum of Squared
Differences (SQD). Accuracy,
on normalized image pixels. Number of Parameters and
Computation versus Number
of Templates per Class is pre-
Neural Networks We also include our best results produced by CNN sented. Number of parameters
classifiers, as described in Section 4.3.2. is expressed in Millions.
All classifiers33 are trained on the same training set, and evaluated 33
I used the scikit-learn 0.18.1
implementation of these algo-
on the same testing set. Each classifier was tuned independently to rithms
95
forward-looking sonar image classification
C ( C −1)
function which consists of training 2 SVMs and evaluate them
at test time and get the majority decision as class output. The linear
kernel gets best performance with C = 0.1, while the RBF uses
C = 100.0. Both classifiers have 506880 parameters, considering 110
trained SVMs. The ratio of parameters to number of training data
1200 15.2M
points ranges from 2069 ∼ 0.58 for TinyNet(5, 4) to 2069 ∼ 7346.5
for template matching classifiers. ClassicCNN with Dropout has a
930000
parameter to data point ratio of 2069 ∼ 449.5, which is reduced to
224.8 during the training phase due to the use of Dropout.
Our comparison results are shown in Table 4.3. Machine learning
methods perform poorly, as gradient boosting and random forests
obtain accuracies that are lower than simpler classifiers like a Linear
SVM. We expected both algorithms to perform better, due to their
popularity in competitions like Kaggle, which suggests that they
might generalize well with low amounts of data.
The best classic ML classifier according to our results is a Linear
SVM, which is surprising, but still it does not perform better than
the state of the art template matching methods using sum of square
differences.
The best performing method is a convolutional neural network,
either TinyNet with 5 modules and 8 filters per layer or FireNet with
3 layers. There is a small difference in accuracy (0.1 %) between
both networks. These results show that a CNN can be successfully
trained with a small quantity of data (approx 2000 images) and
that even in such case, it can outperform other methods, specially
template matching with cross-correlation and ensemble classifiers
like random forests and gradient boosting.
The second best performing method is also a CNN, namely the
ClassicCNN. It should be noted that there is a large difference in the
number of parameters between ClassicCNN and TinyNet/FireNet.
Those networks are able to efficiently encode the mapping be-
tween image and class. Considering TM-SQD as the baseline, then
TinyNet(5, 8) is 1.18 % more accurate, and FireNet-3 outperforms it
by 1.28 %. A more realistic baseline is TM-CC as it is used in many
published works 35 , and in such case TinyNet(5, 8) outperforms 35
Natalia Hurtós, Narcis
Palomeras, Sharad Nagappa,
TM-CC by 6.16 % while FireNet-3 is superior by 6.26 %. and Joaquim Salvi. Automatic
detection of underwater chain
links using a forward-looking
sonar. In OCEANS-Bergen, 2013
4.3.5 Feature Visualization MTS/IEEE, pages 1–7. IEEE,
2013
In the previous section we have established that a CNN is the best
classifier for FLS images. In this section we would like to move away
96
forward-looking sonar image classification
ML
methods for FLS Image Classifi-
Gradient Boosting 90.63 % 9.9K cation.
Random Forest 93.17 % 7.9K
TM with CC 93.44 % 15.2M
TM
pj|i + pi| j
Where pij = 2n . Then stochastic gradient descent is used to
97
forward-looking sonar image classification
dij = || xi − x j || (4.9)
98
forward-looking sonar image classification
99
forward-looking sonar image classification
20 20
20
0 0
0
−20 −20 −20
20 20
0 0
−20 −20
−40
−40 −20 0 20 −20 0 20
(d) Module 4 (e) Module 5
1 1
1
0 0 0
−1 −1 −1
−1 0 1 −1 0 1 −1 0 1
(a) Module 1 (b) Module 2 (c) Module 3
1 1
0 0
−1 −1
−1 0 1 −1 0 1
(d) Module 4 (e) Module 5
100
forward-looking sonar image classification
101
forward-looking sonar image classification
40
20 20
20
0 0 0
20 20 20
0 0 0
−20 −20 −20
0.5
0.5 0.5
0 0 0
0.4
0.5
0.2 0.2
0 0 0
−0.2
−0.5 −0.2
−0.4
−0.5 0 0.5 −0.4−0.2 0 0.2 0.4 −0.2 0 0.2
(d) BN4 (e) BN5 (f) FC1
102
forward-looking sonar image classification
20 20 20
0 0 0
−20 −20
−20
−40
−20 0 20 40 −40 −20 0 20 −20 0 20
(a) MP1 (b) MP2 (c) MP3
20 20 20
0 0 0
−20 −20 −20
1
0.5 0.5 0.5
0 0 0
−0.5 −0.5
−0.5
−1
−0.5 0 0.5 −0.5 0 0.5 1 −1 −0.5 0 0.5 1
(a) MP1 (b) MP2 (c) MP3
1 1
0.5 0.5 0.5
0 0 0
−0.5 −0.5 −0.5
−1 −1 −1
−1 −0.5 0 0.5 1 −1 −0.5 0 0.5 1 −1 0 1
(d) MP4 (e) MP5 (f) FC1
103
forward-looking sonar image classification
12.5
1,500
Computation Time (ms)
10
1,000
7.5
5
500
2.5
100
0
0
1 2 3 4 5 6 1 2 3 4 5 6
# of Modules # of Modules
(a) High-Power Platform (b) Low-Power Platform
High-Power platform results show that all networks are quite fast,
with a maximum computation time of 12 milliseconds per image,
which is able to run in real-time at 83 frames per second. Still
TinyNet is considerably faster than the other networks, at less than
2.5 milliseconds per frame (over 400 frames per second), which is
5 times faster. FireNet as well is at 2-6 times slower than TinyNet,
depending on the number of modules.
Low-Power platform results are more interesting, as ClassicNet
is quite slow in this platform, peaking at 1740 milliseconds per
104
forward-looking sonar image classification
frame, while TinyNet-4 and TinyNet-8 only taking less than 40 and
100 milliseconds per frame, correspondingly. TinyNet-4 is 43 times
faster than ClassicNet while TinyNet-8 is 17 times faster. As seen
previously, using less filters in ClassicNet considerably degrades
classification performance, specially when using a small number
of modules. FireNet is also more accuracy than ClassicNet, and
considerably faster at 310 milliseconds per frame (5-times speedup).
TinyNet and FireNet are quite similar, with only FireNet contain-
ing extra 3 × 3 filters
Selecting a network to run on a low-power embedded device is
now quite easy, as TinyNet is only 0.5% less accurate than ClassicNet
and 1.0% less than FireNet, but many times faster. TinyNet can run
at 25 frames per second on a Raspberry Pi 2.
There is a measurable difference between using Batch Normaliza-
tion and Dropout, specially in the Low-Power platform
It must also be pointed out that comparing results in this section
with Table 4.7, which shows computation time on the HP platform,
it can be seen that template matching is not an option either, as it is
slightly less accurate than TinyNet, but considerably slower. TinyNet
is 24 times faster than template matching with SQD, and 67 times
faster than CC. In the Low-Power platform with 150 templates per
class, a CC template matcher takes 3150 milliseconds to classify one
image, a SQD template matcher takes 1200 milliseconds per image.
These results form the core argument that for sonar image classi-
fication, template matching or classic machine learning should not
be used, and convolutional neural networks should be preferred. A
CNN can be more accurate and perform faster, both in high and
low power hardware, which is appropriate for use in autonomous
underwater vehicles.
105
forward-looking sonar image classification
TinyNet FireNet
4 Filters 8 Filters 4 Filters
# P LP Time HP Time P LP Time HP Time P LP Time HP Time
1 307 24 ms 0.9 ms 443 53 ms 1.5 ms 3499 310 ms 6.0 ms
2 859 32 ms 1.1 ms 1483 82 ms 2.1 ms 8539 247 ms 6.2 ms .
3 1123 35 ms 1.3 ms 2235 90 ms 2.4 ms 10195 237 ms 6.2 ms
4 1339 36 ms 1.2 ms 2939 92 ms 2.3 ms 11815 236 ms 6.3 ms
5 1531 37 ms 1.4 ms 3619 95 ms 2.4 ms 13415 236 ms 6.3 ms
6 1711 38 ms 1.4 ms 4287 96 ms 2.4 ms 15003 237 ms 6.5 ms
Table 4.6: TinyNet and FireNet
performance as function of
number of modules (#) and
4.4 Summary of Results convolution filters. Mean com-
putation time is shown, both
in the High-Power platform
This chapter has performed a in-depth evaluation of image classifica- (HP) and Low-Power platform
(LP). Standard deviation is not
tion algorithms for sonar image classification. While the state of the shown as it is less than 0.2 ms
art typically uses template matching with cross-correlation 39 , we for HP and 1 ms for LP. The
number of parameters (P) in
have shown that a convolutional neural network can outperform the each model is also shown.
106
forward-looking sonar image classification
107
5 Limits of Convolutional
This chapter deals with a less applied problem than the rest of this
thesis. While the use of Deep Neural Networks for different tasks
has exploded during the last 5 years, many questions of practical
importance remain unanswered 1 . 1
Chiyuan Zhang, Samy Ben-
gio, Moritz Hardt, Benjamin
A general rule of thumb in machine learning is that more data Recht, and Oriol Vinyals. Under-
standing deep learning requires
always improves the generalization ability of a classifier or regressor. rethinking generalization. arXiv
For Deep Learning it has been largely assumed that a large dataset preprint arXiv:1611.03530, 2016
108
limits of convolutional neural networks on fls images
7 classes. Note that the state of the art compared in this work is
mostly composed of engineered features, like bag of visual words,
clustering, and dictionary learning from HoG, SIFT and LBP features.
Even as OverFeat is not trained on the same dataset, its features
generalize outside of this set quite well.
On the MIT 67 indoor scenes dataset, the authors obtain 69.0 %
mean accuracy with data augmentation, which is 5 % better than
the state of the art. This dataset is considerably different from the
ImageNet dataset used to train OverFeat.
In order to evaluate a more complex task, the authors used the
Caltech-UCSD Birds dataset, where the task is to classify images of
200 different species of birds, where many birds "look alike" and are
hard to recognize. Again this simple method outperforms the state
of the art by 6 %, producing 61.8 accuracy. This result shows how
CNN features outperform engineered ones, even when the task is
considerably different from the training set. This work also shows 6
Yan Pailhas, Yvan Petil-
the importance of data augmentation for computer vision tasks. lot, and Chris Capus. High-
resolution sonars: what reso-
Pailhas, Petillot and Capus 6 have explored the relationship be- lution do we need for target
recognition? EURASIP Journal
tween sonar resolution and target recognition accuracy. While this on Advances in Signal Processing,
2010(1):205095, 2010
is not the same question as we are exploring, it is similar enough to
109
limits of convolutional neural networks on fls images
110
limits of convolutional neural networks on fls images
0.8 M and 1.2 million images (the original size). Accuracy decreases
from 45 % at 1.2 M to 30 % at 0.2 M. The relationship between
training set size and accuracy is quite close to linear, as it slowly
decreases linearly from 1.2 M to 0.4 M, but then decreases more
sharply. While both results are quite interesting, these authors have
not controlled for the random weight initialization, and variations of
accuracy should be computed. Due to the large size of the ImageNet
dataset, it can expected that these kind of evaluation protocol is not
available due to the large computational resources required.
111
limits of convolutional neural networks on fls images
the best generalization performance, but using the first layer has by
112
limits of convolutional neural networks on fls images
8 filters with BN 32 filters with BN 8 filters with Dropout 32 filters with Dropout
100
Test Accuracy (%)
95
90
100
100
Test Accuracy (%)
99
Test Accuracy (%)
98
98
97
96
96
113
limits of convolutional neural networks on fls images
64 × 64
5.3 Effect of Object Size
114
limits of convolutional neural networks on fls images
115
limits of convolutional neural networks on fls images
116
limits of convolutional neural networks on fls images
ADAM-BN TinyNet-ADAM-BN
SGD-BN TinyNet-SGD-BN
ADAM-Dropout FireNet-ADAM-BN
SGD-Dropout FireNet-SGD-BN
100 100
90 90
80 80
Test Accuracy (%)
16 32 48 64 80 96 16 32 48 64 80 96
Image size (Pixels) Image size (Pixels)
(a) ClassicNet (b) TinyNet and FireNet
117
limits of convolutional neural networks on fls images
100 100
95 95
90 90
Test Accuracy (%)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
25
30
Samples per Class Samples per Class
(a) Full Plot (b) Zoom into region SPC 1 − 30
118
limits of convolutional neural networks on fls images
100 100
95 95
90 90
Test Accuracy (%)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
25
30
Samples per Class Samples per Class
(a) Full Plot (b) Zoom into region SPC 1 − 30
119
limits of convolutional neural networks on fls images
120
limits of convolutional neural networks on fls images
121
limits of convolutional neural networks on fls images
70 70 70
60 60 60
50 50 50
40 40 40
1
10
20
30
40
50
60
70
80
90
100
110
120
130
140
150
1
10
20
30
40
50
60
70
80
90
100
110
120
130
140
150
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
25
30
Samples per Class Samples per Class Samples per Class
(a) Full Plot (b) Zoom into region 1-30 (c) SVM
70 70 70
60 60 60
50 50 50
40 40 40
1
10
20
30
40
50
60
70
80
90
100
110
120
130
140
150
1
10
20
30
40
50
60
70
80
90
100
110
120
130
140
150
1
10
20
30
40
50
60
70
80
90
100
110
120
130
140
150
Samples per Class Samples per Class Samples per Class
(d) ClassicNet-2-BN (e) ClassicNet-2-Dropout (f) ClassicNet-3-BN
80 80 80
70 70 70
60 60 60
50 50 50
40 40 40
1
10
20
30
40
50
60
70
80
90
100
110
120
130
140
150
1
10
20
30
40
50
60
70
80
90
100
110
120
130
140
150
1
10
20
30
40
50
60
70
80
90
100
110
120
130
140
150
122
limits of convolutional neural networks on fls images
In this section we combine the ideas of Section 5.2 and Section 5.4,
into evaluating how transfer learning can be used to make a CNN
that can produce good generalization with small number of samples.
123
limits of convolutional neural networks on fls images
100 100
95 95
90 90
Test Accuracy (%)
70 70
60 60
ClassicNet-BN-TL ClassicNet-BN-TL
ClassicNet-Dropout-TL ClassicNet-Dropout-TL
50 ClassicNet-BN
50 ClassicNet-BN
ClassicNet-Dropout ClassicNet-Dropout
SVM SVM
40 40
1
10
20
30
40
50
60
70
80
90
100
110
120
130
140
150
1
10
20
30
40
50
60
70
80
90
100
110
120
130
140
150
SVM Samples per Class SVM Samples per Class
(a) Different Objects (b) Same Objects
result.
Considering now ten samples per class. In the case of different
objects, both networks produce generalization that is very close to
90 % accuracy, while for the same objects Dropout produces 93 %
accuracy, and Batch Normalization 96 %. Both are results that can
be considered usable for practical applications.
Now considering large sample sizes (more than 30 samples per
class), the performance of the learned features is not considerably
different from learning a classifier network from the data directly.
This means the only advantage of learning features is when one has
a small number of samples to train. Only in the case of using the
same objects the generalization of feature learning is slightly better
than the baselines from the previous section.
124
limits of convolutional neural networks on fls images
125
limits of convolutional neural networks on fls images
80 90 90
70 85 85
60 80 80
Feature 1 Feature 60 Feature 110
Feature 10 Feature 70 Feature 120
50 Feature 20
75 Feature 80
75 Feature 130
Feature 30 Feature 90 Feature 140
Feature 40 Feature 100 Feature 150
40 70 70
1
10
20
30
40
50
60
70
80
90
100
110
120
130
140
150
1
10
20
30
40
50
60
70
80
90
100
110
120
130
140
150
1
10
20
30
40
50
60
70
80
90
100
110
120
130
140
150
SVM Samples per Class SVM Samples per Class SVM Samples per Class
(a) 1-50 (b) 60-100 (c) 110-150
80 90 90
70 85 85
60 80 80
Feature 1 Feature 60 Feature 110
Feature 10 Feature 70 Feature 120
50 Feature 20
75 Feature 80
75 Feature 130
Feature 30 Feature 90 Feature 140
Feature 40 Feature 100 Feature 150
40 70 70
1
10
20
30
40
50
60
70
80
90
100
110
120
130
140
150
1
10
20
30
40
50
60
70
80
90
100
110
120
130
140
150
1
10
20
30
40
50
60
70
80
90
100
110
120
130
140
150
SVM Samples per Class SVM Samples per Class SVM Samples per Class
(a) 1-50 (b) 60-100 (c) 110-150
126
limits of convolutional neural networks on fls images
Results using the same objects for feature learning show improved
generalization over using different objects. This is acceptable, as the
learned features have a natural bias to well represent the learned
objects. We believe that this invariance can be considerably improved
with more data and variation among object classes.
In this case, achieving 95 % accuracy reliably requires only 40
samples per class for feature learning (F). Performance at a single
sample per class for T also improves considerably with more feature
learning samples, starting at 40 % and increasing to 80 % for 40
samples per class, and it further increases up to 90 % when more
samples are used for feature learning.
The same case of a single sample for training T shows that Batch
Normalization features are superior, as BN produces 50 % accu-
racy versus less than 40 % for Dropout. When more samples are
added to F, single sample T performance improves considerably,
reaching more than 80 % with BN features and 70 % with Dropout.
As more samples are used to F, performance continues to slowly
improve, eventually achieving 98 % accuracy reliably with 100 sam-
ples per class in F. In the case of a large number of samples in F,
Batch Normalization is still superior, reaching the 98 % barrier more
consistently than Dropout.
Two clear conclusions can be obtained from these experiments:
High generalization (95 % accuracy) can be achieved with small
samples (10-30 samples per class with for both T and F) but only
if the same objects are used for both sets. This implies that gener-
alization outside of the training set will probably be reduced. The
second conclusion is that if T and F do not share objects, there will
be a performance hit compared to sharing objects, but this case still
learning features will improve generalization when compared to
training a CNN over the same data.
It has to be mentioned that our results show that by using the
same data, but changing the training procedure, a considerable
improvement in generalization can be obtained, even when using
low samples to learn features (F) and to train a SVM on those
features (T).
127
limits of convolutional neural networks on fls images
128
limits of convolutional neural networks on fls images
samples per class for F and T, but only if the same objects are used
in both datasets. In the case of learning features in one set of objects,
and training an SVM for a different one, then more data is required
to achieve 95 % accuracy, in the order of 100 T samples per class and
40 − 50 feature learning (F) samples.
We expect that our results will contribute to the discussion about
how many samples are actually required to use Deep Neural Net-
works in different kinds of images. For the marine robotics commu-
nity, we expect that our argument is convincing and more use of
neural networks can be seen on the field.
129
limits of convolutional neural networks on fls images
80 90
70 85 90
60 80
Feature 1 Feature 60 Feature 110
Feature 10 Feature 70 85 Feature 120
50 Feature 20
75 Feature 80 Feature 130
Feature 30 Feature 90 Feature 140
Feature 40 Feature 100 Feature 150
40 70 80
1
10
20
30
40
50
60
70
80
90
100
110
120
130
140
150
1
10
20
30
40
50
60
70
80
90
100
110
120
130
140
150
1
10
20
30
40
50
60
70
80
90
100
110
120
130
140
150
SVM Samples per Class SVM Samples per Class SVM Samples per Class
(a) 1-50 (b) 60-100 (c) 110-150
80 90
70 85 90
60 80
Feature 1 Feature 60 Feature 110
Feature 10 Feature 70 85 Feature 120
50 Feature 20
75 Feature 80 Feature 130
Feature 30 Feature 90 Feature 140
Feature 40 Feature 100 Feature 150
40 70 80
1
10
20
30
40
50
60
70
80
90
100
110
120
130
140
150
1
10
20
30
40
50
60
70
80
90
100
110
120
130
140
150
1
10
20
30
40
50
60
70
80
90
100
110
120
130
140
150
SVM Samples per Class SVM Samples per Class SVM Samples per Class
(a) 1-50 (b) 60-100 (c) 110-150
130
6 Matching Sonar Image Patches
Object Recognition
Given an image, classify its contents into a predefined set of
classes. Instead of using a trainable classifier, recognition can be
performed by using a database of labeled images and match the
input image with one in the labeled database. This approach has
the advantage of being dynamic, so new object classes can be
easily added to the database, but recognition performance now
depends on the quality of the matching functionality.
Object Detection
Similar to Object Recognition, but instead of classifying the com-
plete image, specific objects must be localized in the image. Typ-
131
matching sonar image patches
While many of the use cases described in this chapter are a field
on its own, with mature techniques, using matching to perform such
tasks does offer some potential advantages:
132
matching sonar image patches
133
matching sonar image patches
D ( x, y, σ ) = L( x, y, σ ) − L( x, y, βσ) (6.1)
L( x, y, σ ) = G ( x, y, σ ) ∗ I ( x, y) (6.2)
134
matching sonar image patches
135
matching sonar image patches
I ( X, Y ) = H ( X ) + H (Y ) − H ( X, Y ) (6.3)
FX ( β, ν) FY∗ ( β, ν)
e j2π ( βuo +νv0 ) = (6.6)
|| FX ( β, ν) FY ( β, ν)||
Where Fs is the Fourier transform of s. This equation comes from
the Fourier shift theorem, associating the Fourier transforms of two
images that are related by a spatial shift (uo , vo ) with a constant
multiplicative factor. The phase correlation function is computed as
the inverse Fourier transform of Eq 6.6. This function will have a
peak at (uo , vo ), and by finding the maxima it is possible to recover
this shift. For the case of a rotation and a translation, the same
principle applies, as the spectra is also rotated. To recover the
rotation shift, the same method can be applied to the log-polar
transformation of the Fourier transform. The authors mention that
this method is the most accurate across all evaluated methods. This
result is consistent with other publications on the topic 17 . 17
Natalia Hurtós, Sharad
Nagappa, Narcis Palomeras,
Only one feature-based registration method was evaluated, corre- and Joaquim Salvi. Real-
time mosaicing with two-
sponding to SIFT. The authors detect SIFT keypoints in both images dimensional forward-looking
and try to match their feature descriptors using a nearest neighbour sonar. In 2014 IEEE International
Conference on Robotics and Au-
search. The authors experiments show that many false matches tomation (ICRA), pages 601–606.
IEEE, 2014
are produced in sonar images. RANSAC is applied to discard
outliers while fitting a transformation to align both images. The
results are good but only when a large number of keypoints are
detected. The method fails when keypoints are sparse or when
they are mismatched, requiring methods that iteratively refine the
learned transformation parameters.
Two hybrid methods were evaluated. The first is a combination
of the log-polar transformation:
136
matching sonar image patches
q
r= ( x − x c )2 + ( y − y c )2 (6.7)
y − yc
θ = tan−1 (6.8)
x − xc
Where ( xc , yc ) represents the polar center. This transforms the
input images in order to provide some kind of normalization, as
image rotations are represented as linear shifts in polar coordinates.
Then normalized cross-correlation (as presented previously in Eq 4.1)
is applied to obtain the different shifts by using a sliding window
and keeping the transformation parameters that maximize the cross-
correlation. The second method is Region-of-Interest detection with
a custom operator that detects regions of interest through a variance
saliency map. Both methods perform poorly, as the correlation
between features in multiple observations is quite low, leading to
false matches.
Pham and Gueriot 18 propose the use of guided block-matching 18
Minh Tân Pham and Di-
dier Guériot. Guided block-
for sonar image registration. The method we propose in this chapter matching for sonar image reg-
istration using unsupervised
could be considered similar to the block matching step, as this step kohonen neural networks. In
just needs to make a binary decision whether two pairs of blocks 2013 OCEANS San Diego, pages
1–5. IEEE, 2013
(image patches) match or not. The authors use a pipeline that first
extracts dense features from the image, performs unsupervised
segmentation of the image through a Self-Organizing map (SOM)
applied on the computer features. The unsupervised segmentation
of the image then is used to aid the block matching process, as only
blocks that are similar in feature space according to the SOM will
be compared, but this process is unsupervised, which implies that
different comparison functions will be learned for each image. Then
the final step is to estimate a motion vector from the matched blocks,
from where a geometrical transformation for registration can be
estimated.
The results from this paper show that it performs considerably
faster than standard block matching, but only visual results are
provided. The displayed mosaics make sense, showing that standard
and guided block matching can recover the correct translation vector,
but a more through numerical comparison is needed.
Moving into modern methods for matching, Zagoryuko and
Komodakis 19 were one of the first to propose the use of CNNs to 19
Sergey Zagoruyko and
Nikos Komodakis. Learning
learn an image matching function, but they instead learn how to to compare image patches via
convolutional neural networks.
compare image patches, corresponding to predicting a similarity In Proceedings of the IEEE Con-
ference on Computer Vision and
score instead of a plain binary classification. We base our own
Pattern Recognition, pages 4353–
matching networks on this work. 4361, 2015
137
matching sonar image patches
This work defined several CNN models that can be used for patch
matching:
138
matching sonar image patches
λ
L= ||w|| + ∑ max(0, 1 − yi ŷi ) (6.9)
2 i
Their results show that the best performing network configuration
is the two-channel central-surround one, with a considerably mar-
gin over all other configurations. The plain two-channel networks
perform slightly worse. As two-channel networks are considerably
simpler to implement, this is one choice that we made for our own
matching networks in this chapter. These results are quite good, but
they seem to only be possible due to the large labeled dataset that is
available.
The authors also test their matching networks for wide-baseline
stereo on a different dataset, showing superior performance when
compared with DAISY 23 . These results show that using a CNN 23
Engin Tola, Vincent Lep-
etit, and Pascal Fua. A fast local
for matching is quite powerful and can be used for other tasks. descriptor for dense matching.
In Computer Vision and Pattern
Generalization outside of the training set is quite good. Recognition, 2008. CVPR 2008.
Zbonar and LeCun 24 also propose the use of a CNN to compare IEEE Conference on, pages 1–8.
IEEE, 2008
image patches. Their method is specifically designed for wide- 24
Jure Zbontar and Yann
baseline stereo matching. Given a stereo disparity map, the authors LeCun. Stereo matching by
training a convolutional neural
construct a binary classification dataset by extracting one positive network to compare image
patches. Journal of Machine
and one negative example where true disparity is known a priori. Learning Research, 17:1–32, 2016
139
matching sonar image patches
6.1.1 Discussion
While there is a rich literature on mosaicing and registration for
sonar images, these approaches typically use simplistic method
for image patch matching. The most complex technique that is
commonly used is SIFT, which was not designed specifically for
sonar images.
Many techniques use extensive domain knowledge in order to
obtain good registration performance, but in general evaluation is
quite simplistic. Only visual confirmation that the method works
in a tiny dataset (less than 10 images) and no further numerical
140
matching sonar image patches
141
matching sonar image patches
142
matching sonar image patches
143
matching sonar image patches
144
matching sonar image patches
from each image, and weight sharing allows for a reduction in the FC(32)
MaxPool(2, 2)
6.4 Experimental Evaluation
Conv(32, 5 × 5)
metrics: false and true positive rates, precision and recall, ROC Figure 6.2: CNN architecture
145
matching sonar image patches
FC(c)
FC(64)
Merge
FC(96) FC(96)
FC(96) FC(96)
MaxPool(2, 2) MaxPool(2, 2)
Conv(16, 5 × 5) Conv(16, 5 × 5)
MaxPool(2, 2) MaxPool(2, 2)
Conv(32, 5 × 5) Conv(32, 5 × 5)
MaxPool(2, 2) MaxPool(2, 2)
Conv(32, 5 × 5) Conv(32, 5 × 5)
MaxPool(2, 2) MaxPool(2, 2)
Conv(16, 5 × 5) Conv(16, 5 × 5)
Input Input
Image A Image B
146
matching sonar image patches
147
matching sonar image patches
cross-validation.
TP FP
TPR = FPR = (6.11)
P N
Assuming a classifier that provides a score for each class, then
the TPR and FPR rates vary as a threshold is set on the output
classification score. Then the ROC curve is built as points in the
(FPR, TPR) space as the threshold is varied. This curve indicates the
different operating points that a given classifier outputs can produce,
and it is useful in order to tune a specific threshold while using the
classifier in production.
The AUC is the just the area under the ROC curve, which is a
number in the [0, 1] range. The AUC is a metric that is not simple
to interpret 34 . One interpretation is that the AUC is the probability 34
Claude Sammut and Ge-
offrey I Webb. Encyclopedia
that the classifier will produce a higher score for a randomly chosen of machine learning. Springer
Science & Business Media, 2011
positive example than a randomly chosen negative example. A
classifier with a higher AUC is then preferable.
We also evaluated accuracy, but as we have labels for the three
components that were used to generate each dataset, we also eval-
uated accuracy on each component. This includes examples that
represent a object-object positive match, a object-object negative
match, and a object-background negative match. We also compute
and present mean accuracy. As we are also evaluating regressors to
predict a score, we compute accuracy from class predictions:
148
matching sonar image patches
Method AUC Mean Acc Obj-Obj + Acc Obj-Obj − Acc Obj-Bg − Acc
SIFT 0.610 54.0 % 74.5 % 43.6 % 44.0 %
SURF 0.679 48.1 % 89.9 % 18.6 % 35.9 %
ORB 0.682 54.9 % 72.3 % 41.9 % 60.5 %
AKAZE 0.634 52.2 % 95.1 % 4.8 % 56.8 %
RF-Score 0.741 57.6 % 22.5 % 88.2 % 97.2 %
RF-Class 0.795 69.9 % 12.5 % 97.7 % 99.7 %
Different
149
matching sonar image patches
150
matching sonar image patches
SIFT
SURF
0.6
ORB
AKAZE
0.4
0.2
0
0 0.2 0.4 0.6 0.8 1
False Positive Rate
SIFT
SURF
0.6
ORB
AKAZE
0.4
0.2
0
0 0.2 0.4 0.6 0.8 1
False Positive Rate
151
matching sonar image patches
152
matching sonar image patches
100 100
95 95
90 90
80
70 70
60 60
50 50
40 40
100
250
500
750
1,000
2,000
3,000
4,000
5,000
1
25
50
100
150
200
300
400
500
Samples per Class Samples per Class
(a) Sub-Region 1 − 500 (b) Full Plot
100 100
95 95
90 90
80
Test AUC (%)
80
Test AUC (%)
70 70
60 60
50 50
40 40
100
250
500
750
1,000
2,000
3,000
4,000
5,000
1
25
50
100
150
200
300
400
500
153
matching sonar image patches
training set.
We evaluated classic keypoint matching algorithms, namely SIFT,
SURF, ORB, and AKAZE. We show that these techniques do not
perform well in sonar images, with area under the ROC curve in the
range 0.61-0.63, which is slightly better than a random classifier.
We also evaluated the use of classic ML classifiers for this problem,
including a Random Forest and a Support Vector Machine. In this
case we model matching as binary classification given two 96 × 96
image patches. These methods work better than keypoint detectors
at AUC 0.65-0.80. Based on previous work by Zagoryuko et al. 36 , 36
Sergey Zagoruyko and
Nikos Komodakis. Learning
we decided to implement and compare a two-channel and a siamese to compare image patches via
convolutional neural networks.
network. In Proceedings of the IEEE Con-
The two-channel network obtains the best matching performance ference on Computer Vision and
Pattern Recognition, pages 4353–
at 0.91 AUC, performing binary classification. It is closely followed 4361, 2015
154
7 Detection Proposals In
Forward-Looking Sonar
This chapter deals with the core problem of this thesis, namely the
detection of marine debris in sonar images. But here we focus on a
slightly different problem.
Most literature work on object detection deals with the problem
of designing and testing a class-specific object detector, which just
detects objects of a certain class or classes. But our object of interest,
marine debris, has a very large intra-class and inter-class variability.
This motivates to construct a generic or class-agnostic object detector,
which in the computer vision literature is called detection proposals.
For example, in order for a robot to detect novel objects 1 , a class- 1
Ian Endres and Derek
Hoiem. Category independent
independent object detector must be available. A novel object could object proposals. In Computer
Vision–ECCV 2010, pages 575–
be placed in front of the robot’s sensors, and the robot would be able 588. Springer, 2010
to say that there is an object in front of him, but it does not match
any class that was previously trained, and it could ask the operator
about information in order to label the new object.
Detection proposals are connected to the concept of objectness,
which is a basic measurement of how likely an image window or
patch contains an object of interest. This is also related to the concept
of an object itself, which is hard to define.
While there are many computer vision algorithms that produce
detection proposals, the concept of an class-agnostic object detector
has not been applied to sonar images. Detection proposals in general
are used in order to construct class-specific object detectors and
improve their performance, but in our case we would like to design
and build a class-agnostic detector as a purpose in itself, as we want
an AUV to have have the capabilities to detect novel and new objects
that were not considered during training time.
For our specific objective of detecting marine debris, as we cannot
possibly model or collect training data for all kinds of marine debris,
155
detection proposals in forward-looking sonar
• Can we make sure that it generalizes well outside its training set?
156
detection proposals in forward-looking sonar
area( A ∩ B)
IoU( A, B) = (7.2)
area( A ∪ B)
The most common value 3 is Ot = 0.5, but higher values are 3
Mark Everingham, Luc
Van Gool, Christopher KI
possible. The IoU score measures how well two bounding boxes Williams, John Winn, and An-
drew Zisserman. The pascal
match, and it is used because ground truth bounding boxes are visual object classes (voc) chal-
human-generated and could be considered arbitrary for a computer lenge. International journal of
computer vision, 88(2):303–338,
algorithm. Using IoU introduces a degree of slack into the proposals 2010
157
detection proposals in forward-looking sonar
in the 0.68 − 0.69 range. This method works acceptably, but it per-
forms heavy feature engineering, which indicates that the features
do generalize but not in order to obtain recall closer to 99 %.
Rahtu et al. 6 use a cascade architecture to learn a category- 6
Esa Rahtu, Juho Kannala,
and Matthew Blaschko. Learn-
independent object detector. Their motivation is that a cascade is ing a category independent
object detection cascade. In
considerably faster than the usual object detection architectures (like Computer Vision (ICCV), 2011
for example, Viola-Jones 7 for real-time face detection). The authors IEEE International Conference on,
pages 1052–1059. IEEE, 2011
introduce features that are useful to predict the likelihood that a 7
Paul Viola and Michael
bounding box contains an object (objectness). Jones. Rapid object detection
using a boosted cascade of
The first step is to generate an initial set of bounding boxes from simple features. In Computer
Vision and Pattern Recognition,
the input image, where two methods are applied. One is super-pixel 2001. CVPR 2001. Proceedings of
the 2001 IEEE Computer Society
segmentation, and the second is to sample 100K bounding boxes Conference on, volume 1, pages
I–I. IEEE, 2001
from a prior distribution computed from the training set. This prior
distribution is parameterized by bounding box width, height, and
row/column position in the image.
Each window is the evaluated for objectness and a binary deci-
sion is made by a classifier. Three features are used: Super-pixel
boundary integral, boundary edge distribution and window symme-
try. Non-maxima suppression is applied before outputting the final
bounding boxes. Results on the PASCAL VOC 2007 dataset show
that 95 $ recall can be obtained at IoU threshold of Ot = 0.5. This
approach works well in terms of recall and is promising about com-
putational performance, but as it has been previously mentioned,
their choice of features does not transfer to sonar, specially super-
pixel ones due to the noise and lack of clear boundary in sonar
images.
Alexe et al. 8 present an objectness measure, putting a number on 8
Bogdan Alexe, Thomas
Deselaers, and Vittorio Ferrari.
how likely is an image window to contain an object of interest, but Measuring the objectness of im-
age windows. Pattern Analysis
not belonging to any specific class. The authors define an object with and Machine Intelligence, IEEE
three basic characteristics: a defined closed boundary, a different Transactions on, 34(11):2189–
2202, 2012
appearance from its surroundings, and being unique and salient in
the image.
The authors use multiple objectness cues that are combined using
a Bayesian framework. The cues consist of multi-scale saliency, color
contrast, edge density, super-pixel straddling, and location and size.
All these cues contain parameters that must be learned from training
data.
This method was also evaluated on the PASCAL VOC 2007
dataset, obtaining 91 % recall with 1000 bounding boxes per im-
age, taking approximately 4 seconds per image. This objectness
measure seems to generalize quite well in unseen objects, but it is
158
detection proposals in forward-looking sonar
1
ABO( G, B) =
|G| ∑ max IoU (b, g) (7.3)
g∈ G b∈ B
Varying the diversification strategies produce different ABO scores,
with a combination of all similarity measures giving the best ABO.
The best color space seems to be HSV. Using a single diversification
strategy with only HSV color space and all four similarity mea-
sures produces bounding boxes with 0.693 ABO. The authors define
two other selective search configurations: "fast" which uses 8 strate-
gies and produces 0.799 ABO, and "quality" with 80 strategies and
producing 0.878 ABO.
Due to the high ABO scores produced by selective search, it ob-
tains 98 − 99 % recall on the PASCAL VOC test set. This shows
that the produced bounding box are high quality and are usable
for object detection/recognition purposes. Two disadvantage of
selective search are the large number of bounding boxes that are re-
quired to obtain high recall (over 1000), and the slow computational
159
detection proposals in forward-looking sonar
160
detection proposals in forward-looking sonar
161
detection proposals in forward-looking sonar
large recall than smaller windows. This paper also evaluates recall
as a function of the number of proposals, where EdgeBoxes and
Selective Search are the best methods, requiring less proposals for a
given recall target, or achieving a higher overall recall.
Hosang et al. 15 refine their results in a follow-up journal pa- 15
Jan Hosang, Rodrigo
Benenson, Piotr Dollár, and
per. This work repeats the previous experiments but using two Bernt Schiele. What makes for
effective detection proposals?
new datasets: ImageNet 2013 object detection, and the COCO 2014 IEEE transactions on pattern
object detection dataset. The basic idea of this comparison is to analysis and machine intelligence,
38(4):814–830, 2016
check for overfitting to the PASCAL VOC object categories. But the
authors see no general loss of recall performance on other datasets,
as methods perform similarly. But on the COCO dataset there are
some significant differences, such as EdgeBoxes performing poorly
when compared to Selective Search, while other methods improve
their performance.
This paper also introduces the average recall (AR) metric in order
to predict object detection performance in a class-specific setting,
which can be a proxy metric for the more used mean average pre-
cision (mAP). AR is computed as the mean recall score as the IoU
overlap threshold is varied in the range Ot ∈ [0.5, 1.0]. Better pro-
posals that lead to an improved class-specific object detector will
reflect into a higher AR score.
We now cover two methods that use CNNs for detection proposals.
The first is MultiBox 16 , which extends AlexNet to generate detection 16
Dumitru Erhan, Christian
Szegedy, Alexander Toshev,
proposals. The authors propose to use a final layer that outputs a and Dragomir Anguelov. Scal-
able object detection using
vector of 5K values, four of them corresponding to upper-left and deep neural networks. In Pro-
lower-right corner coordinates of a bounding box, and a confidence ceedings of the IEEE Conference
on Computer Vision and Pattern
score that represents objectness. The default boxes are obtained by Recognition, pages 2147–2154,
2014
applying K-means to the ground truth bounding boxes, which make
them not translation invariant 17 and could hurt generalization. 17
This is mentioned in the
Faster R-CNN paper as a big
The assignment between ground truth bounding boxes and pre- downside.
dictions by the network is made by solving an optimization problem.
The idea is to assign the best ground truth bounding boxes in order
to predict a high confidence score. These bounding boxes can then
be ranked and less proposals should be needed to achieve high
recall.
This network is trained on 10 million positive samples, consisting
of patch crops that intersect the ground truth with at least IoU over-
lap threshold Ot = 0.5, and 20 million negative samples obtained
from crops with IoU overlap threshold smaller than Ot = 0.2. On
the PASCAL VOC 2012 dataset, this model can obtain close to 80
% recall with 100 proposals per image, which is a considerable im-
162
detection proposals in forward-looking sonar
network outputting 256 − 512 features that are fed to two sibling
fully connected layers. One of these produces four coordinates
corresponding to a bounding box location, and the other sibling layer
produces two softmax scores indicating the presence of an object.
This small network can be implemented as a fully convolutional
network for efficient evaluation.
The RPN uses the concept of an anchor box. As proposals should
cover a wide range of scale and aspect ratio, the RPN produces k
bounding box and objectness score predictions, each corresponding
to a different anchor box. Then at test time, all anchors are predicted
and the objectness score is used to decide final detections. One
contribution of the RPN is that the anchors are translation invari-
ant, which is very desirable in order to predict objects at multiple
locations.
Training the RPN is not a simple process. Anchors must be labeled
as objects or background. Given a set of ground truth bounding
boxes, the anchor with the highest IoU overlap is given a positive
label (pi = 1), as well as any anchor with IoU larger than Ot = 0.7.
Anchors with IoU smaller than Ot = 0.3 are given a negative label
(background, pi = 0). Then the RPN layers are trained using a
multi-task loss function:
!
−1
L=N ∑ CE( pi , p̂1 ) + λ ∑ pi H (|ti − t̂i |) (7.4)
i i
Where pi are the ground truth object label, ti is the true vector of
normalized bounding box coordinates (t x , ty , tw , th ), λ is a trade-off
factor used to combine both sub-losses, CE is the cross-entropy loss,
and H is the Huber loss with δ = 1, which is also called smooth L1
163
detection proposals in forward-looking sonar
loss:
1 x2
for| x | < δ
2
Hδ ( x ) = (7.5)
δ(| x | − 1 δ)
otherwise
2
Bounding box coordinates for regression are normalized by:
x − xa y − ya w h
tx = ty = tw = th =
wa ha wa ha
Where the a subscript denotes the anchor coordinates. The RPN
is not evaluated as a standalone component, but as part of the whole
Faster R-CNN pipeline, including the use of Fast R-CNN 19 for 19
Ross Girshick. Fast r-cnn.
In Proceedings of the IEEE inter-
object detection given a set of proposals. Mean average precision national conference on computer
vision, pages 1440–1448, 2015
(mAP) on PASCAL VOC 2012 improves from 65.7 % when using
Selective Search proposals to 67.0 % with RPN proposals.
Faster R-CNN was a milestone in object detection, being consider-
ably faster than previous iterations (R-CNN and Fast R-CNN), while
also being more accurate and introducing the RPN. We have tried to
train similar networks performing bounding box regression on our
Forward-Looking sonar images, but it fails to converge into a state
that produces useful predictions. We believe that this is due to much
a smaller training set and failure to learn the "appropriate" features
by pre-training the network on a large dataset. For comparison, 20
Mark Everingham, Luc
the RPN is trained on the PASCAL VOC 07+12 dataset. The 2007 Van Gool, Christopher KI
Williams, John Winn, and An-
20 version of this dataset contains 5K training images with 12.5K drew Zisserman. The pascal
visual object classes (voc) chal-
labeled object instances, while the 2012 21 version contains 11.5K lenge. International journal of
computer vision, 88(2):303–338,
training images with 31.5K object instances. The RPN has also been 2010
trained successfully on the COCO dataset 22 , with 328K images and 21
Mark Everingham, SM Ali
Eslami, Luc Van Gool, Christo-
2.5 million labeled object instances. Both datasets vastly surpass pher KI Williams, John Winn,
and Andrew Zisserman. The
our dataset of 2069 images with 2364 labeled object instances. Ad-
pascal visual object classes
ditionally, the DSOD (Deeply Supervised Object Detection) 23 also challenge: A retrospective. In-
ternational journal of computer
evaluates the RPN in an end-to-end approach without pre-training vision, 111(1):98–136, 2015
22
the network, and while it works well (and outperforms the RPN Tsung-Yi Lin, Michael
Maire, Serge Belongie, James
and Faster R-CNN) using a proposal-free approach, the authors Hays, Pietro Perona, Deva
Ramanan, Piotr Dollár, and
mention that using the RPN in a proposal-based framework with C Lawrence Zitnick. Microsoft
COCO: Common objects in
DSOD failed to converge, which suggests that there are additional context. In European conference
on computer vision, pages 740–
issues with the RPN formulation that are not related to the size of
755. Springer, 2014
the training set. Additional research is needed about understanding 23
Zhiqiang Shen, Zhuang
how the RPN works and what is required for convergence. Liu, Jianguo Li, Yu-Gang Jiang,
Yurong Chen, and Xiangyang
Xue. Dsod: Learning deeply su-
pervised object detectors from
scratch. In The IEEE International
Conference on Computer Vision
(ICCV), volume 3, page 7, 2017
164
detection proposals in forward-looking sonar
7.1.1 Discussion
165
detection proposals in forward-looking sonar
This section describes our detection proposals pipeline and the use
of convolutional neural networks for objectness prediction.
166
detection proposals in forward-looking sonar
• Objectness Thresholding. Any image window having objectness sonars. This grid was gener-
ated with s = 8. Many sliding
larger or equal to a predefined threshold To is output as a detec-
windows overlap with each
tion proposal. The value of To must be carefully tuned, as a low
other.
value will produce many proposals with high recall, and a large
value will generate less proposals with low recall. In general it
is desirable to produce the smallest amount of proposals that
reaches a given recall target.
167
detection proposals in forward-looking sonar
classification. Conv(24, 3 × 3)
This network is also trained with a mean squared error loss for
MaxPool(2, 2)
15 epochs, using ADAM. We applied the same methodology as
Conv(24, 1 × 1)
before, but the loss converges in more epochs, but does not seem
to produce overfitting. This network has a final loss that is slightly Conv(24, 3 × 3)
higher than the previous one, but still performs adequately. Note 96 × 96 Image Patch
that we do not use regularization as part of this network, as using
Figure 7.3: TinyNet-Objectness
Batch Normalization prevents us into transforming the CNN into a Architecture for objectness
FCN due to the fixed sizes in the Batch Normalization parameters. prediction.
168
detection proposals in forward-looking sonar
Objectness Map
As mentioned before, this network performs well and does not seem
to overfit. Conv(1, 24 × 24)
We have three datasets which are used to train and test this method:
Objectness Map
• Training: This is a dataset of 51653 96 × 96 sonar image patches,
obtained and labeled with the methodology described in Section Figure 7.5: Objectness Map
7.2.1. CNN models are trained on this dataset. We perform data produced by TinyNet-FCN-
Objectness on a given input
augmentation by flipping images left-right and up-down, which
image
increases the amount of data by three times.
169
detection proposals in forward-looking sonar
• Average Best Overlap: Mean value of the best overlap score for
each ground truth object. This metric tells how well the generated
proposals match the ground truth bounding boxes.
170
detection proposals in forward-looking sonar
7.3.3 Baseline
object detection, and a simple modification can transform it into an plate matching is available in
Chapter 4
objectness score.
We randomly select a set of N templates from the training set,
and apply cross-correlation as a sliding window between the input
image (with size (W, H )) and each template. This produces a set
of images that correspond to the response of each template, with
dimensions ( N, W, H ). To produce a final objectness score, we take
the maximum value across the template dimensions, producing a
final image with size (W, H ). Taking the maximum makes sense, in
order to make sure that only the best matching template produces
an objectness score. As cross-correlation takes values in the [−1, 1]
range, we produce objectness values in the [0, 1] range by just setting
any negative value to zero.
171
detection proposals in forward-looking sonar
172
detection proposals in forward-looking sonar
80 85 90 95 100
1
Average Best Overlap (ABO)
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
# of Proposals
300
Recall (%)
70
200
60
10 50 100
50
40
0
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Objectness Threshold (To ) Objectness Threshold (To ) Objectness Threshold (To )
(a) Recall vs Threshold (b) Number of Proposals vs Threshold(c) Average Best Overlap vs Threshold
173
detection proposals in forward-looking sonar
80 85 90 95 100
1
Average Best Overlap (ABO)
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
900
# of Proposals
Recall (%)
700
70
50100200300400500
60
50
40
0
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Objectness Threshold (To ) Objectness Threshold (To ) Objectness Threshold (To )
(a) Recall vs Threshold (b) Number of Proposals vs Threshold(c) Average Best Overlap vs Threshold
174
detection proposals in forward-looking sonar
175
detection proposals in forward-looking sonar
100 1
95
0.6
0.5
0.4
0.3
0.2
0.1
0
0 20 40 60 80 100 0 20 40 60 80 100
K K
(a) Recall vs K (b) ABO vs K
176
detection proposals in forward-looking sonar
100 1
95
0.6
0.5
0.4
0.3
0.2
0.1
0
0 20 40 60 80 100 0 20 40 60 80 100
K K
(a) Recall vs K (b) ABO vs K
177
detection proposals in forward-looking sonar
178
detection proposals in forward-looking sonar
Method Best Recall # of Proposals Time (s) Table 7.1: Comparison of detec-
tion proposal techniques with
TM CC Threshold 91.83 % 150 10.0 ± 0.5 state of the art. Our proposed
TM CC Ranking 88.59 % 110 10.0 ± 0.5 methods obtain the highest re-
call with the lowest number of
EdgeBoxes (Thresh) 57.01 % 300 0.1 proposals. Only EdgeBoxes has
EdgeBoxes (# Boxes) 97.94 % 5000 0.1 a higher recall with a consider-
ably larger number of output
Selective Search Fast 84.98 % 1000 1.5 ± 0.1
proposals.
Selective Search Quality 95.15 % 2000 5.4 ± 0.3
ClassicNet Threshold 96.42 % 125 12.4 ± 2.0
TinyNet-FCN Threshold 96.33 % 300 3.1 ± 1.0
ClassicNet Ranking 96.12 % 80 12.4 ± 2.0
Tinynet-FCN Ranking 95.43 % 100 3.1 ± 1.0
over 2000 proposals needed for this. While this is lower than what
is required by EdgeBoxes, it is still too much for practical purposes.
We also notice the same pattern that too many boxes are assigned to
just noise in the image. This can be expected as these algorithms are
not really designed for sonar images.
Cross-Correlation Template Matching produces the lowest recall
we observed on this experiment. Our objectness networks obtain
very good recall with a low number of proposals per image. Clas-
sicNet with objectness ranking produces 96 % recall with only 80
proposals per image, which is 62 times less than EdgeBoxes with
only a 1 % absolute loss in recall. TinyNet-FCN with objectness rank-
ing also produces 95% recall with only 100 proposals per image, at
a four times reduced computational cost. Selective Search produces
1 % less recall than the best of our methods, but outputting 25 times
more proposals.
In terms of computation time, EdgeBoxes is the fastest. FCN
objectness is 4 times faster to compute than CNN objectness, due to
the fully convolutional network structure, and it only requires a 1 %
reduction in recall. CC Template Matching is also quite slow, at 10
seconds per image, making it difficult to use in an AUV.
Figure 7.12 shows a comparison of the selected techniques as the
number of output proposals is varied. This provides a more broad
overview of how increasing the number of proposals that are output
affects recall. The best methods would provide a high recall with
a low number of proposals, corresponding to the top left part of
the plot, and it can be seen that both ClassicNet and TinyNet-FCN
objectness do a better job at predicting the correct objectness values
at the right regions in the image, which leads to high recall with less
proposals.
Overall we believe that our results show that our CNN-based
179
detection proposals in forward-looking sonar
100
95
90
85 SS Fast
80
SS Quality
EdgeBoxes
Recall (%) 60 CNN Threshold
FCN Threshold
40 TM Threshold
CNN Ranking
20 FCN Ranking
TM Ranking
0
100 101 102 103 104
# of Proposals
180
detection proposals in forward-looking sonar
181
detection proposals in forward-looking sonar
182
detection proposals in forward-looking sonar
183
detection proposals in forward-looking sonar
(a) Chain
(b) Wrench
184
detection proposals in forward-looking sonar
185
detection proposals in forward-looking sonar
186
detection proposals in forward-looking sonar
100 100
95 95
90 90
85 85
80 80
70 70
Recall (%)
Recall (%)
60 60
50 50
40 40
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0 20 40 60 80 100
187
detection proposals in forward-looking sonar
100 100
95 95
90 90
85 85
80 80
70 70
Recall (%)
Recall (%)
60 60
50 50
40 40
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0 20 40 60 80 100
188
detection proposals in forward-looking sonar
0.3 0.3
0.25 0.25
Val MSE 0.2 0.2
Val MSE
0.15 0.15
0.1 0.1
0.05 0.05
0 0
1
1,000
2,000
3,000
4,000
5,000
6,000
7,000
8,000
9,000
10,000
1
50
100
200
300
400
500
600
700
800
900
1,000
Samples per Class Samples per Class
(a) MSE Full Plot (b) MSE Sub-Region 1 − 1000
1 1
0.95 0.95
0.9 0.9
0.8
Val AUC
0.8
Val AUC
0.7 0.7
0.6 0.6
1
1,000
2,000
3,000
4,000
5,000
6,000
7,000
8,000
9,000
10,000
1
50
100
200
300
400
500
600
700
800
900
1,000
Samples per Class Samples per Class
(c) AUC (d) AUC Sub-Region 1 − 1000
7.4 Limitations
189
detection proposals in forward-looking sonar
typical cameras. This means that objects have the same size
independent of the distance to the sensor. The only difference
with distance is sampling due to the polar field of view. But still
our method has problems producing accurate bounding boxes for
small objects. Ideally a detection proposal algorithm should be
scale invariant and produce variable-sized bounding boxes that
fit objects tightly.
• Training Data. Our training set is quite small, only at 50K images.
The variability inside the dataset is limited, and more data, in-
cluding additional views of the object, different sonar poses, and
more variation in objects, will definitely help train an objectness
regressor with better generalization.
190
detection proposals in forward-looking sonar
191
detection proposals in forward-looking sonar
192
8 Selected Applications Of
Detection Proposals
8.1.1 Introduction
We have built a object detection pipeline based on detection pro-
posals. While object detection is a well researched subject, many
underwater object detection systems suffer from generalization is-
sues that we have mentioned in the past.
There is a rampart use of feature engineering that is problem
and object specific, which harms the ability to use such features
for different objects and environments. We have previously shown
how classic methods for sonar images do not perform adequately
for marine debris, and extensions to use deep neural networks are
required.
A natural extension of our detection proposal system is to include
a classification stage so full object detection can be performed. An
additional desirable characteristic of a CNN-based object detector is
193
selected applications of detection proposals
194
selected applications of detection proposals
195
selected applications of detection proposals
FC(128)
MaxPooling(2, 2)
Conv(32, 5 × 5)
MaxPooling(2, 2)
Conv(32, 5 × 5)
96 × 96 Input Image
196
selected applications of detection proposals
error:
197
selected applications of detection proposals
weight γ. Most papers that use multi-task learning tune this value
on a validation set, but we have found out that setting the right
value is critical for good performance and balance between the tasks.
We take an empiric approach and evaluate a predefined set of values,
8
Alex Kendall, Yarin Gal,
namely γ ∈ [0.5, 1, 2, 3, 4] and later determine which produces the and Roberto Cipolla. Multi-task
learning using uncertainty to
best result using both recall and accuracy. This is not an exhaustive
weigh losses for scene geome-
evaluation, but we found that it produces good results. Recent try and semantics. arXiv preprint
arXiv:1705.07115, 2017
research by Kendall et al. 8 proposes a method to automatically
tune multi-task weights using task uncertainty that could be used
in the future. This paper also notes that multi-task weights have a
great effect on overall system performance.
Figure 8.3 shows our principal results as the objectness threshold
To is varied and we also show the effect of different multi-task loss
weight γ.
Our detection proposals recall is good, close to 95 % at To = 0.5
for many values of γ, showing little variation between different
choices of that parameter. Larger variations in recall are observed
for bigger (> 0.6) values of To .
Classification accuracy produced by our system shows a consid-
erable variation across different values of the multi-task loss weight
γ. The best performing value is γ = 3, but it is not clear why this
value is optimal or how other values can be considered (other than
trial and error). As mentioned before, setting multi-task loss weights
is not trivial and has a large effect on the result. It is also counter-
intuitive that a larger value of gamma leads to lower classification
performance. At To = 0.5, the best performing model produces 70
% accuracy.
Looking at the number of proposals, again there is a considerable
variation between different values of gamma, but the number of pro-
posal reduces considerably with an increasing objectness threshold
To . For To = 0.5 the best performing value gamma = 3 produces
112 ± 62 proposals per image. This is higher than the number of
proposals that our non-classification methods produce (as shown in
Chapter 7).
From a multi-task learning point of view, learning the classifi-
cation task seems to be considerably harder than the objectness
regression task. This is probably due to the different task complexity
(modeling class-agnostic objects seems to be easier than class-specific
objects), and also we have to consider the limitations from our small
training set. Other object detectors like Faster R-CNN are usually
trained in much bigger datasets, also containing more object pose
198
selected applications of detection proposals
60 60
40 40
20 20
0 0
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
Objectness Threshold (To ) Objectness Threshold (To )
γ = 12
Number of Proposals
γ=1
400
γ=2
γ=3
γ=4
200
0
0 0.2 0.4 0.6 0.8 1
Objectness Threshold (To )
Figure 8.3: Detection Recall,
Classification Accuracy and
variability. Number of Proposals as a func-
tion of Objectness Threshold
Figure 8.4 shows the relation between recall/accuracy and the To . Different combinations of
the multi-task loss weight γ are
number of detection proposals. For detection proposal recall, our shown.
results show that only 100 to 150 proposals per image are required
to achieve 95 % recall, and using more proposals only marginally
increases performance.
For classification accuracy, the pattern is different from proposal
recall. As mentioned previously, there is a large variation in classi-
fication performance as the multi-task loss weight γ is varied, and
clearly γ = 3 performs best. But accuracy increases slowly as the
number of proposals is increased, which shows that many proposals
are being misclassified, indicating a problem with the classification
branch of the network. We expected that classification performance
will increase in a similar way as to proposal recall, if the network
is performing well at both tasks, but it is likely that our implicit
assumption that both tasks are approximately equally hard might
not hold.
While our object detector has high proposal recall, the results we
obtained in terms of classification are not satisfactory.
199
selected applications of detection proposals
200
selected applications of detection proposals
201
selected applications of detection proposals
202
selected applications of detection proposals
8.2.1 Introduction
Tracking is the process of first detecting an object of interest in
an image, and then continually detect its position in subsequent
frames. This operation is typically performed on video frames or
equivalently on data that has a temporal dimension. Tracking is
challenging due to the possible changes in object appearance as time
moves forward.
In general, tracking is performed by finding relevant features in
the target object, and try to match them in subsequent frames 11 . 11
Alper Yilmaz, Omar Javed,
and Mubarak Shah. Object
This is called feature tracking. An alternative formulation is tracking tracking: A survey. Acm com-
puting surveys (CSUR), 38(4):13,
by detection 12 , which uses simple object detectors to detect the 2006
target in the current and subsequent frames, with additions in order 12
Luka Čehovin, Aleš
Leonardis, and Matej Kristan.
to exploit temporal correlations. Visual object tracking perfor-
mance measures revisited. IEEE
In this section we evaluate a tracker built with our matching
Transactions on Image Processing,
function, as to showcase a real underwater robotics application. 25(3):1261–1274, 2016
203
selected applications of detection proposals
For the Marine Debris task, tracking is required as the AUV might
experience underwater currents that result in the target object mov-
ing in the sonar image, and during object manipulation, the object
must be tracked robustly. One way to measure robustness of the
tracker is to measure the number of correctly tracked frames (CTF)
13 , which is also human interpretable. 13
Luka Čehovin, Aleš
Leonardis, and Matej Kristan.
To make a fair comparison, we constructed another tracker that Visual object tracking perfor-
mance measures revisited. IEEE
uses a cross-correlation similarity (Eq 8.2) instead of our CNN match- Transactions on Image Processing,
ing function. If SCC > 0.01, then we declare a match. This is a delib- 25(3):1261–1274, 2016
204
selected applications of detection proposals
Our Tracker
CC Tracker
1 1 1
0 0 0
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
IoU Threshold Ot IoU Threshold Ot IoU Threshold Ot
(a) Can Sequence (b) Valve Sequence (c) Bottle Sequence
205
selected applications of detection proposals
Our Tracker
CC Tracker
1 1 1
IoU
IoU
0.4 0.4 0.4
0 0 0
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
Normalized Frame Number Normalized Frame Number Normalized Frame Number
(a) Can Sequence (b) Valve Sequence (c) Bottle Sequence
206
selected applications of detection proposals
and tune.
We do not believe that our tracker can outperform state of the
art trackers that are based on CNNs. Other newer approaches are
more likely to perform better but require considerable more data.
For example, the deep regression networks for tracking by Held et
al 14 . The winners from the Visual Object Tracking (VOT) challenge 14
David Held, Sebastian
Thrun, and Silvio Savarese.
(held each year) are strong candidates to outperform our method, Learning to track at 100 fps
with deep regression networks.
but usually these methods are quite complex and are trained on In European Conference on Com-
considerably larger datasets. puter Vision, pages 749–765.
Springer, 2016
A big challenge in Machine Learning and Computer Vision is
to solve tasks without requiring large training sets 15 . While large 15
Chen Sun, Abhinav Shri-
vastava, Saurabh Singh, and
datasets are available for many computer vision tasks, their avail- Abhinav Gupta. Revisiting un-
reasonable effectiveness of data
ability for robotics scenarios is smaller. Particularly for underwater in deep learning era. In Pro-
environments the availability of public datasets is very scarce. Then ceedings of the IEEE Conference
on Computer Vision and Pattern
our tracking technique is particularly valuable in contexts where Recognition, pages 843–852,
2017
large training sets are not available.
Improvements in the matching function, either from better net-
work models or data with more object variability, will immediately
transfer into improvements in tracking. As we expect that detec-
tion proposals will be used in an AUV to find "interesting" objects,
using the matching networks for tracking can be considered as a
more streamlined architecture than just one network that performs
tracking as a black box separately from object detection.
207
selected applications of detection proposals
208
9 Conclusions and Future Work
209
conclusions and future work
210
conclusions and future work
211
conclusions and future work
There is plenty of work that can be done in the future to extend this
thesis.
The dataset that we captured does not fully cover the large variety
of marine debris, and only has one environment: the OSL water
tank. We believe that a larger scientific effort should be made to
capture a ImageNet-scale dataset of marine debris in a variety of
real-world environments. This requires the effort of more than just
one PhD student. A New dataset should consider a larger set of
debris objects, and include many kinds of distractor objects such as
rocks, marine flora and fauna, with a richer variety of environments,
like sand, rocks, mud, etc.
Bounding Box prediction seems to be a complicated issue, as our
informal experiments showed that with the data that is available to
us, it does not converge into an usable solution. Due to this we used
fixed scale bounding boxes for detection proposals, which seem to
work well for our objects, but it would not work with larger scale
variations. A larger and more varied dataset could make a state of
the art object detection method such as SSD or Faster R-CNN work
well, from where more advanced detection proposal methods could
be built upon.
We only explored the use of the ARIS Explorer 3000 Forward-
Looking Sonar to detect marine debris, but other sensors could also
be useful. Particularly a civilian Synthetic Aperture Sonar could
be used for large scale surveying of the seafloor, in particular to
locate regions where debris accumulates, and AUVs can target these
regions more thoroughly. Underwater laser scanners could also
prove useful to recognize debris or to perform manipulation and
grasping for collection, but these sensors would require new neural
network architectures to deal with the highly unstructured outputs
that they produce. There are newer advances in neural networks4 4
Charles R Qi, Hao Su,
Kaichun Mo, and Leonidas J
that can process point clouds produced by laser sensors, but they Guibas. Pointnet: Deep learning
on point sets for 3d classifica-
are computationally expensive. tion and segmentation. Proc.
Another promising approach to detect marine debris is that in- Computer Vision and Pattern
Recognition (CVPR), IEEE, 1(2):4,
stead of using object detection methods, which detect based visual 2017
212
conclusions and future work
213
A Randomly Selected Samples
214
randomly selected samples of the marine debris dataset
215
randomly selected samples of the marine debris dataset
216
randomly selected samples of the marine debris dataset
217
randomly selected samples of the marine debris dataset
218
randomly selected samples of the marine debris dataset
219
randomly selected samples of the marine debris dataset
220
randomly selected samples of the marine debris dataset
221
randomly selected samples of the marine debris dataset
222
randomly selected samples of the marine debris dataset
223
randomly selected samples of the marine debris dataset
224
randomly selected samples of the marine debris dataset
225
Bibliography
[3] Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. Layer
normalization. arXiv preprint arXiv:1607.06450, 2016.
[5] Herbert Bay, Tinne Tuytelaars, and Luc Van Gool. Surf:
Speeded up robust features. In European conference on com-
puter vision, pages 404–417. Springer, 2006.
[9] Edward Belcher, Dana Lynn, Hien Dinh, and Thomas Laugh-
lin. Beamforming and imaging with acoustic lenses in small,
226
bibliography
[11] James Bergstra and Yoshua Bengio. Random search for hyper-
parameter optimization. Journal of Machine Learning Research,
13(Feb):281–305, 2012.
227
bibliography
[26] John Duchi, Elad Hazan, and Yoram Singer. Adaptive subgra-
dient methods for online learning and stochastic optimization.
Journal of Machine Learning Research, 12(Jul):2121–2159, 2011.
228
bibliography
229
bibliography
[50] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun.
Spatial pyramid pooling in deep convolutional networks for
visual recognition. In European Conference on Computer Vision,
pages 346–361. Springer, 2014.
[51] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun.
Deep residual learning for image recognition. In Proceedings
of IEEE CVPR, pages 770–778, 2016.
230
bibliography
[54] Jan Hosang, Rodrigo Benenson, Piotr Dollár, and Bernt Schiele.
What makes for effective detection proposals? IEEE transac-
tions on pattern analysis and machine intelligence, 38(4):814–830,
2016.
[55] Jan Hosang, Rodrigo Benenson, and Bernt Schiele. How good
are detection proposals, really? arXiv preprint arXiv:1406.6962,
2014.
231
bibliography
[68] Diederik Kingma and Jimmy Ba. Adam: A method for stochas-
tic optimization. arXiv preprint arXiv:1412.6980, 2014.
232
bibliography
[75] Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner.
Gradient-based learning applied to document recognition.
Proceedings of the IEEE, 86(11):2278–2324, 1998.
[76] WC Li, HF Tse, and L Fok. Plastic waste in the marine envi-
ronment: A review of sources, occurrence and effects. Science
of the Total Environment, 566:333–349, 2016.
[77] Min Lin, Qiang Chen, and Shuicheng Yan. Network in net-
work. arXiv preprint arXiv:1312.4400, 2013.
[79] Hanxiao Liu, Karen Simonyan, and Yiming Yang. Darts: Dif-
ferentiable architecture search. arXiv preprint arXiv:1806.09055,
2018.
[82] Ping Luo, Xinjiang Wang, Wenqi Shao, and Zhanglin Peng.
Understanding regularization in batch normalization. arXiv
preprint arXiv:1809.00846, 2018.
233
bibliography
234
bibliography
[96] Charles R Qi, Hao Su, Kaichun Mo, and Leonidas J Guibas.
Pointnet: Deep learning on point sets for 3d classification and
segmentation. Proc. Computer Vision and Pattern Recognition
(CVPR), IEEE, 1(2):4, 2017.
[102] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun.
Faster r-cnn: Towards real-time object detection with region
proposal networks. In Advances in Neural Information Processing
Systems, pages 91–99, 2015.
[103] Pere Ridao, Marc Carreras, David Ribas, Pedro J Sanz, and
Gabriel Oliver. Intervention auvs: the next challenge. IFAC
Proceedings Volumes, 47(3):12146–12159, 2014.
235
bibliography
[105] Ethan Rublee, Vincent Rabaud, Kurt Konolige, and Gary Brad-
ski. Orb: An efficient alternative to sift or surf. In 2011
International conference on computer vision, pages 2564–2571.
IEEE, 2011.
[108] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev
Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya
Khosla, Michael Bernstein, et al. Imagenet large scale visual
recognition challenge. International Journal of Computer Vision,
115(3):211–252, 2015.
[109] Peter G Ryan. Litter survey detects the south atlantic ‘garbage
patch’. Marine Pollution Bulletin, 79(1-2):220–224, 2014.
236
bibliography
[115] Kyra Schlining, Susan Von Thun, Linda Kuhnz, Brian Schlin-
ing, Lonny Lundsten, Nancy Jacobsen Stout, Lori Chaney, and
Judith Connor. Debris in the deep: Using a 22-year video an-
notation database to survey marine litter in monterey canyon,
central california, usa. Deep Sea Research Part I: Oceanographic
Research Papers, 79:96–105, 2013.
ARIS-Explorer-3000/015335_RevC_ARIS-Explorer-3000_
Brochure.
237
bibliography
[132] Engin Tola, Vincent Lepetit, and Pascal Fua. A fast local
descriptor for dense matching. In Computer Vision and Pattern
Recognition, 2008. CVPR 2008. IEEE Conference on, pages 1–8.
IEEE, 2008.
238
bibliography
[138] Paul Viola and Michael Jones. Rapid object detection using
a boosted cascade of simple features. In Computer Vision and
Pattern Recognition, 2001. CVPR 2001. Proceedings of the 2001
IEEE Computer Society Conference on, volume 1, pages I–I. IEEE,
2001.
[140] Chris Wilcox, Erik Van Sebille, and Britta Denise Hardesty.
Threat of plastic pollution to seabirds is global, pervasive,
and increasing. Proceedings of the National Academy of Sciences,
112(38):11899–11904, 2015.
239
bibliography
[146] Kwang Moo Yi, Eduard Trulls, Vincent Lepetit, and Pascal
Fua. Lift: Learned invariant feature transform. In European
Conference on Computer Vision, pages 467–483. Springer, 2016.
[147] Alper Yilmaz, Omar Javed, and Mubarak Shah. Object track-
ing: A survey. Acm computing surveys (CSUR), 38(4):13, 2006.
240
bibliography
241