5.1. Dataset Configuration
Our approach was principally assessed using the MASATI dataset described in
Section 4.3. In order to measure how the complexity of the dataset, according to the number of classes, impacts on the classification process, three sets were created by grouping samples of different classes as follows:
We evaluated these sets considering both the main class (which only distinguishes between ships or non-ships), and the sub-class (the fine grain labels, see
Table 3). Therefore, we performed a total of five experiments, one for Set 1, whose main and sub-class labels are the same, and two for each of both Sets 2 and 3.
In all the experiments, we used a
n-fold cross validation (with
), which yields a better Monte-Carlo estimate than when solely performing the tests in a single random partition [
50]. Our dataset was consequently divided into
n mutually exclusive sub-sets, maintaining the percentage of samples for each class. For each fold, we used one of the partitions for test (
of samples) and the rest for training (
). The classifier was trained
n times using these sets, after which the average results were calculated.
Before training, the raw input data were initialized by a zero-mean of a
z-score normalization [
51]:
where
M is the input matrix containing the raw image pixels from the training subset. The same operation was subsequently applied to the test subset, maintaining the same mean and standard deviation calculated for the training. The
z-score normalization satisfied the standard normal distribution, i.e., each of the dimensional data had a zero mean and standard deviation. This helped correct outliers and remove the effect of lighting, thus improving the performance.
In this work, we have applied data augmentation [
30,
52] in order to artificially increase the size of the training subset by adding certain random transformations (concretely translation, rotation and reflection) to the input data in order to generate new samples. This technique usually improves the performance and helps reduce overfitting.
5.2. Evaluation Metrics
In order to evaluate the performance of the proposed models, two evaluation metrics widely used for this kind of tasks were chosen: F-measure (
) and Average Precision (AP).
can be defined by means of precision and recall:
where TP (True Positives) denotes the number of correctly detected ships, FN (False Negatives) the number of non-detected or missed ships, and FP (False Positives or false alarms) the number of incorrectly detected ships.
The Average Precision (AP) metric has been used to compare the behavior of our method with that of others, in terms of the ship classification ratio. AP calculates the mean precision value in the recall interval , which is equivalent to the area under the Precision-Recall Curve (PRC). It is not, therefore, necessary to represent the PRC in order to show the performance.
Since both datasets contain classes that are unbalanced such as the MASATI classes Multi and Coast & ship, we have chosen these metrics rather than the accuracy or other alternatives, as they are more appropriate and fairer in the case of unbalanced data. In addition, for the case of classification using sub-classes, we calculate the confusion matrix in order to analyze the errors.
5.3. Hyperparameters Evaluation
Both size and image quality may have an impact on the performance of the deep learning techniques used, as mentioned in [
53,
54]. In order to determine how the image size could affect our models, we performed exhaustive experimentation by means of rescaling the input images to sizes ranging from 32 × 32 pixels to 512 × 512 pixels using an anisotropic scaling. This scaling was used rather than square crop because cropping may remove information that is relevant to our task, such as ships that are close to the image borders.
Figure 3 shows the average
score obtained for the Set 3 and the main class labeling, using the network model described in
Table 1. We performed this experiment using only the Baseline network because the other models have a fixed input size due to their architecture limitations. As can be seen in this plot, changing the size of the image makes the
to vary up to 5 %, and the optimal results are obtained with image sizes ranging from 100 × 100 pixels to 250 × 250 pixels. This experiment allows us to verify that the optimal range of input size matches that of the other CNN topologies evaluated, which is set to 224 × 224 pixels.
Furthermore, as previously mentioned, some images underwent transformations in order to generate new data and increase the training dataset. Data augmentation increases the performance when using our baseline CNN and the MASATI dataset, as shown in
Figure 4. This plot shows the positive impact of data augmentation in the
, which increases until stability is attained. This experiment was also done with Set 3 and the main class labeling. In addition to the baseline network, we evaluated other two models, VGG-16 and Xception, which are those that obtained better results. The highest increase occurs at the beginning, when around five and 10 new samples are generated from each image in the original dataset. Although a higher
can be obtained by adding many augmented samples, we set this parameter to 5 for the subsequent experiments, as a high number of samples also significantly increases the computation cost during the training stage. All these experimental tests allowed us to adjust the training parameters for the CNNs topologies evaluated in the following section.
5.4. Results with MASATI Dataset
This section shows the evaluation of the six CNN topologies described in
Section 4.1. Training was carried out in three different ways for each network: Final (backpropagating from only the last three layers, which are fully connected), Middle (backpropagating from the second half of the network), and Full (training the whole network). In all three cases, weights were initialized using a model pre-trained with ILSVRC ImageNet [
44] in order to subsequently perform fine-tuning. The only exception was the baseline network, which was only trained without initialization and using the Full approach, since being a custom model it could not have pre-trained weights.
In order to adjust the kNN classification of our architecture, five values of k = {1, 5, 10, 15, 20} were tested. We evaluated incremental values of k until a downward trend was obtained, selecting that which obtained the highest . The best k was, on average, using -normalized neural codes. Therefore, a value was chosen, since it is the closest value of k.
The results obtained when using the standard Softmax output on the neural codes are reported for each network model and training method (NC + Softmax), along with the proposed architecture (NC +
+ kNN).
Table 4 and
Table 5 show a comparison of the results contained in the three sets described in
Section 5.1 using all network models and architecture configurations, along with the three types of fine-tuning.
Table 4 shows the results obtained using the main class labeling (Ship/Non-ship). To perform this evaluation, we trained the CNN using only two classes and then we obtained the neural codes. During classification, evaluation was performed using both Softmax, and kNN search on the training set prototypes.
Table 5 shows the results using the same methodology above, but in this case the models were trained and evaluated with the seven sub-classes.
Table 5 does not include the results for Set 1 as they are the same as shown in
Table 4 (this set only includes Ship and Non-ship labels). In both tables, the value displayed is the
percentage for the average of the 5 folds. The best result for each model and set is shown in bold type. The lower rows show the average of all models for each set, and also for all sets, excluding the base network since it can only contribute to one average, therefore it would bias those data.
As can be seen in
Table 4 and
Table 5, the best average results for all models are obtained with NC +
+ kNN and full training. The Xception model specifically yields the highest score for the three sets, closely followed by VGG-16, REsNet, and Inception V3. As can be noted, the fine-tuning configuration also has an impact on the
. On average, the middle training outperforms final training, and the full training is better than the middle and final training.
To analyze this hypothesis in a more rigorous way, we performed a statistical significance analysis by considering the non-parametric Wilcoxon signed-rank test [
55]. More precisely, the idea is to assess whether the improvement observed in the classification performance with the use of the NC +
+ kNN is statistically relevant. For this we compared the results obtained with both approaches (NC + Softmax and NC +
+ kNN) in each of the five-folds, with all the network models, the three types of training, the three sets evaluated, and both the main and the sub-class level. The results showed that the proposed method (NC +
+ kNN) significantly improved the NC + Softmax approach considering a statistical significance threshold of
(the most restrictive threshold normally used).
Previous experiments showed that adding weight initialization increases the
by 4.5% on average. Therefore, it is expected that using this technique on the baseline model could increase its performance similarly, although it would be still far from the state-of-the-art models evaluated. As shown in
Figure 4, adding more training images helps improving the
until a limit is reached, from which the models do not seem to improve further. This indicates that the obtained results are also very dependent of the network architecture used. For example, for classifying the Sets 1 and 2 there is a difference of more than a 10% between VGG-16 and Xception, both using weight initialization.
Table 4 and
Table 5 show that the best average results were obtained with Set 3, which can be explained because it is the one with a larger number of samples. However, results using Set 2 are slightly lower than those from Set 1, maybe due to the addition of two new classes that are unbalanced (Coast and Coast & Ship). Additionally, Set 3 contains more classes with many more samples of ship and land (more than 3000 new samples are added), allowing a better discrimination.
Besides, there is a difference in
between the average results obtained when classifying the main classes (
Table 4) and the sub-classes (
Table 5). Note that the average result obtained with the sub-classes increases for Set 2 and decreases for Set 3, although if we compare the results of the best model (Xception) then an improvement is observed in both cases. This improvement when performing the classification using the sub-classes (which although it has more classes obtains better results) may be due to the fact that certain classes are confused when grouping them into Ship or Non-ship, such as Coast and Coast & Ship.
In order to analyze these errors, we visualize them using confusion matrices.
Figure 5a shows the normalized confusion matrix for sub-classes. It can be seen that most instances are correctly classified, and the errors are only caused by confusion between Ship and Sea, as well as between Coast and Coast & Ship. As can be seen in
Figure 2, Ship and Sea samples are very similar since, in general, the size of the ships is very small and it could cause a bad classification. The same occurs with Coast and Coast & Ship classes.
Figure 5b shows the normalized confusion matrix for the main-class classification, but analyzed at the sub-class level. That is, in this case the system was trained and evaluated at the main-class level, but we generated the confusion matrix by assessing the error using the sub-class to which it would belong. It can be seen that although it is trained to differentiate between the main classes, the network learned discriminative characteristics that allows it to differentiate the ships also within the sub-class level.
It can be seen that the main mistakes are made within the main-class group (separated in quadrants by two blue lines in the graph), for example when classifying Ship samples as Detail, or Coast samples as Land. Much fewer errors are made between the main classes, where only some samples of Sea and Ship, and Coast and Coast & Ship are confused.
In addition to the experiments shown in
Table 4 and
Table 5, which prove both the validity and the robustness of our proposal, we also compared it with other methods used for ship classification.
Table 6 shows a comparison of our best architecture (Xception NC +
+ kNN) and some traditional techniques mentioned in
Section 3. These experiments were done using the MASATI dataset and the main class labeling to discriminate the occurrence of ships. Specifically, the techniques with which we compare our method are:
“Features + NN”: An approach that is similar to the method described in [
7], which is based on hand-crafted features, was evaluated. This algorithm applies an adaptive threshold and a morphological opening (with a kernel of 2 × 2) to remove noise. The candidate ships are located by means of a region-growing process. These objects are characterized by a set of features which are used to train a neural network (a fully-connected network with three hidden layers and four nodes in each layer).
“HOG + SVM”: An approach with which to extract local features from the input images that is based on the methods proposed in [
9,
10,
11]. This algorithm calculates the Histogram of Oriented Gradients (HOG) [
56] (with a 8 × 8 cell size, a 16 × 16 block size, and nine histogram channels) and classifies them using SVM (using the C-Support Vector implementation with a penalty parameter of
). HOG is based on the counting occurrences of gradient orientation in localized portions of an input image.
“ORB + aNN”: This method uses the Oriented FAST and rotated BRIEF (ORB) [
57] (with an edge threshold of 10, a patch size of 31, a scale factor of 1.2, and eight levels in the scale pyramid) to extract local features that are paired using an approximate Nearest Neighbors (aNN) algorithm. For this step we evaluated different values of
k (up to
), finally determining that
obtained the best results. ORB is a fast robust local feature detector based on the FAST keypoint detector and on the BRIEF (Binary Robust Independent Elementary Features) visual descriptor, and includes some modifications to enhance the performance.
According to the results of the comparison shown in
Table 6, it is possible to affirm that, taken the ground truth of our dataset, our best approach (Xception NC normalized with
and using a kNN) outperforms the best result of any other previous method, attaining a
score of 99.05% versus 79.27% for ‘HOG + SVM” (the best of the traditional methods).
5.5. Results with MWPU VHR-10 Dataset
The evaluation has also been performed with an existing dataset used for aerial scenes classification in literature, in order to compare the score of our best setup with other approaches that had previously been evaluated with these data. MWPU VHR-10 [
20,
49] is a challenging ten-class geospatial object classification dataset that contains 800 VHR optical remote sensing images gathered from Google Earth. These images are of different sizes, ranging between a spatial resolution of 0.08 to 2 m. They are divided into two sets: a positive set including 650 images, each of which contains at least one target to be detected, and a negative set including 150 images which do not contain any target. The positive set consists of the following classes: 757 airplanes, 302 ships, 655 storage tanks, 390 baseball diamonds, 524 tennis courts, 159 basketball courts, 163 ground track fields, 224 harbors, 124 bridges and 477 vehicles.
The experiments were carried out by organizing this dataset into only two classes: ships (all samples containing vessels) and non-ships (all the other images, including those from the negative set).
Table 7 shows a quantitative comparison in terms of the Average Precision (AP) of our architecture using the best CNN approach (Xception) and another five methods from the state of the art:
“BOW-SVM” of Xu et al. [
58] based on Bag-Of-Words (BOW) feature and SVM classifier.
“SSCBOW” of Sun et al. [
59] based on Spatial Sparse Coding (SSC) and BOW.
“Exemplar-SVMs” of Malisiewicz et al. [
60] based on a set of exemplar-based SVMs.
“FDDL” of Han et al. [
61] based on visual saliency modeling and Fisher Discrimination Dictionary Learning.
“COPD” of Cheng et al. [
49] based on a collection of part detectors, in which each detector is a linear SVM classifier specialized in the classification of objects or recurring spatial patterns within a certain range of orientation.
The details regarding the implementation and parameters used in these five methods can be found in the work by Cheng et al. [
49]. As illustrated in
Table 7, our method significantly outperforms all the approaches evaluated in terms of AP for the ship recognition task.
The initialization of our method was performed using the weights learned with the MASATI dataset for the ship/not ship model. Once the model was initialized with these weights, it was trained during 20 epochs. The results in
Table 7 are shown initializing the network with these weights and without training (obtaining a 78.12%), and also after the 20 fine-tuning epochs (improving up to 86.02%) without data augmentation. For training and validation with this dataset, we also performed a five-fold cross-validation experiment. Therefore, this dataset was split into five exclusive subsets, maintaining the percentage of samples for each class, and it was trained and validated five times. The reported results are the average performance.
Our method based on NC + + kNN again obtains the best results for ship recognition when retraining the model using this dataset. Even without retraining, the method generalizes well and the average precision is higher than most previous methods. When fine-tuning our network with only 20 epochs and without data augmentation the average precision achieves state-of-the-art results.