1. Introduction
Small object detection in remote sensing involves identifying and precisely locating small-sized features in large-scale aerial or satellite images [
1]. In applications like urban monitoring, traffic planning, agriculture, maritime surveillance, disaster management, military operations, and environmental conservation, small objects offer crucial insights. These objects, such as small buildings, vehicles, trees, or specific land-use patterns, serve as indicators of significant phenomena or Earth’s surface changes [
2]. In urban planning, environmental monitoring, disaster management, and precision agriculture, small objects like buildings aid in assessing population density, monitoring vegetation health, estimating yields, and tracking land-use changes. They play a pivotal role in understanding transformations over time, enabling the observation of deforestation, climate change impact, and other crucial developments.
Detecting tiny instances in remote imagery is particularly challenging due to variations, occlusion, and low contrast in spectral and shape characteristics [
3]. These challenges are further exacerbated by illumination variations, atmospheric effects, and noise, all of which underscore the importance of accurate detection for enabling informed decision making. Annotated training datasets, although requiring considerable labor for their creation, serve as critical resources for the development of effective methods [
4]. Researchers commonly employ one-stage and two-stage methods for tiny object detection. One-stage methods, like SSD (single shot detector) and YOLO (You Only Look Once), predict object bounding boxes and class labels in a single pass, offering efficiency but potentially struggling with localizing small objects [
5,
6,
7]. Two-stage methods, such as Faster R-CNN, follow a two-step approach, achieving significant success in small object detection tasks [
8].
The choice between one- and two-phase methods depends on the specific requirements of the task. One-stage methods, with their simpler architecture, offer calculation efficacy and are well-suited for autonomous driving and video analysis [
9,
10]. They excel in detecting and localizing large, well-represented objects, demonstrating higher recall rates for significant-sized, clearly visible items. One-stage methods may enhance generalization to unseen data with fewer parameters and also reduce overfitting [
11]. Additionally, one-stage methods simplify the training process by allowing for a direct, end-to-end approach, eliminating the requirement for separate stages of region proposals and object detection, which reduces the associated computational expenses. The complexity of small object detection in remote sensing arises from several factors such as low occlusion, contrast, and variations in spectral and shape characteristics [
12,
13,
14,
15]. Additionally, atmospheric effects, illumination variations, and noise further complicate the task [
16]. Despite these challenges, accurate small object detection is vital for extracting meaningful information from remote sensing imagery and enabling informed decision making across various applications. Existing research for small object detection lacks proper investigations in terms of speed, time, and score analysis. Most of the existing studies focus either on two stages or both stages at a time. As a result, it is nearly impossible to identify the state-of-the-art one-stage method based on speed, time, and score.
Therefore, this study solely focuses on one-stage methods. In order to address the challenges of small object detection in remote sensing, this study proposes a novel pre-processing method to train YOLOv8 to achieve better performance metrics. This study also conducts a benchmarking analysis on one-stage methods for tiny object detection in terms of time, accuracy, and speed. The pre-processing methods focus on enhancing the quality of data for better object detection. This tackles noise reduction and data normalization, indirectly enhancing contrast by making object features more distinguishable from the background. Standardizing the dataset and removing noise improves the clarity of object boundaries, thereby aiding in handling occlusion to some extent. This method streamlines data formatting by converting object names into numerical representations, standardizing the representation of objects with different shapes and spectral characteristics and thus aiding in detection. This method explicitly addresses noise by employing regular expressions to remove extraneous strings from the dataset, ensuring that the model is trained on clean and relevant data. Additionally, data normalization helps mitigate the effects of illumination variations by scaling dataset values within a standardized range, making the model less sensitive to such variations during training. Moreover, the proposed method contributes to understanding the state of the art not only through the mean average precision but also in terms of speed and computational efficiency. This comprehensive evaluation aids in selecting the most practical method for real-world implementation, considering all relevant parameters.
In summary, the main objective of this research is to perform a benchmarking analysis of the existing one-stage object detection method under the same environment and to propose a pre-processing method aimed at achieving better mean average precision for all classes. The main contributions of this paper are as follows:
Pre-processing: a novel pre-processing method is proposed to train the YOLOv8 model with the DOTA v1.5 dataset in order to achieve a better mean averaged precision.
Benchmarking analysis: a benchmarking analysis is performed with one-stage methods for small object detection DOTA-v1.5 datasets.
The rest of this article is structured as follows. Related works are analyzed in
Section 2. The methodology is delineated in
Section 3.
Section 4 illustrates the performance evaluation and discussion section. Finally,
Section 5 concludes this article.
2. Related Work
This section describes the studies related to one-stage object detection.
Table 1 shows the comparison between the proposed studies and other studies.
Yang et al. [
17] propose the Small, Cluttered, and Rotated Object Detector++ (SCRDet++), focusing on reducing noise in object detection, particularly for small and crowded objects. They perform individual-level denoising on the feature map in order to improve detection accuracy. Wang et al. [
18] introduce feature-merged single-shot detection (FMSSD), a comprehensive framework that combines contextual information from various scales by using the atrous spatial feature pyramid (ASFP) module. In addition, they also adjust the loss function to give priority to small objects. Qian et al. [
19] introduced rotated object detection with RSDet, offering advantages such as an adjusted rotation loss and predicting object corners and thus improving performance. Jiang et al. [
26] present an Information Balanced Fusion Network (IBFF), a detector for small objects operating at multiple scales, featuring different attention-based context feature fusion (DACFF) modules. Zakaria et al. [
20] integrate Instance Level Denoising (ILD) from SCRDet++ into S2A-Net.
Cheng et al. [
27] present the Anchor-Free Oriented Proposal Generator (AOPG), eliminating horizontal box-related operations by utilizing a Coarse Location Module (CLM) for initial coarse-oriented box generation without anchors. A Fast Region-based Convolutional Neural Network (R-CNN) head refines these boxes for high-quality oriented proposals. Li et al. [
21] propose the Dense Path Aggregation Feature Pyramid Network (DPAFPN) as a single-stage detector for remote sensing data. It aims to use both high-level semantic and low-level location information of the images. Qian et al. [
22] suggest a Unified Transferring Strategy (UTS) for bounding box regression (BBR) in oriented object detection, introducing Rotated-Intersection of Union (RIoU) loss. Chen et al. [
28] extend Faster R-CNN with Weighted Fusion and Refinement (WFR), Affine Transformation-Based Feature Decoupling (ATFD), and Post-Classification Regression (PCR) modules for improved performance.
Gao et al. [
29] propose a repulsion constraint for point representation, assessing centeredness quality and introducing oriented repulsion regression for densely packed targets in remote sensing. Hou et al. [
23] present G-Rep, a unified representation using Gaussian distributions for the OBB, QBB, and PointSet, optimizing parameters through maximum likelihood estimation. Wei et al. [
30] offer a lightweight method for proposals of arbitrary-oriented objects, using a rotated region proposal network and a rotation-equivariant backbone. Lin et al. [
24] augment foreground features in a one-stage object detection system by including a keypoint attention module and a prototype contrastive learning module. Cao et al. [
25] integrate semantic edge detection with arbitrary-oriented object detection, introducing a feature-enhancement network and a rotation-invariant spatial pooling pyramid. Zheng et al. [
31] proposed crossNet, an end-to-end deep neural network using cross-scale warping, which improves reference-based super-resolution accuracy and efficiency by performing spatial alignment at the pixel level. Law et al. [
32] proposed cornerNet, a single convolution neural network, which effectively detects objects as paired key points, outperforming the existing one-stage detectors on MS COCO with a 42.2% accuracy. Duan et al. [
33] proposed centerNet, which improves object detection precision and recall by detecting each object as a triplet of key points, outperforming existing one-stage detectors by at least 4.9%.
The studies referenced above primarily emphasize both one-stage and two-stage detection methods, focusing exclusively on the mean average precision (mAP) without considering other crucial factors such as speed, processing time, and additional relevant parameters. However, a comprehensive evaluation of all metrics is essential to gain a more thorough understanding of the performance of these approaches. To address this gap, the current study introduces a novel pre-processing approach designed to improve the training process of YOLOv8. Additionally, a comprehensive benchmarking evaluation was conducted on the DOTA-v1.5 dataset to assess the effectiveness of the proposed approach in enhancing the performance of small object detection against state-of-the-art one-stage methods.
3. Methodology
The computer environment utilized for the studies has an Intel(R) Core(TM) i7-9700 CPU running at 3.00 GHz, 32.0 GB of RAM (31.8 GB useable), and runs on a 64-bit Windows 11 Pro system with version 22H2 (OS build 22621.1702) and Windows Feature Experience Pack 1000.22641.1000.0. The experimental server configuration is far more robust, with improved connectivity and computational capabilities.
It comprises network connectivity with four InfiniBand 100 Gbps EDR and two 10 GbE connections. The server uses 8x NVIDIA Tesla V100 GPUs, each with 16 GB of RAM, for a total of 40,960 NVIDIA CUDA cores and 5120 Tensor cores. These GPUs are linked together via the NVIDIA NVLink Hybrid Cube Mesh, which ensures high-bandwidth communication between them. The system memory is significant, comprising 512 GB DDR4 LRDIMM, and the CPU configuration comprises two 20-core Intel Xeon E5-2698 v4 processors operating at 2.2 GHz.
The server’s storage subsystem has four 1.92 TB SSDs deployed in a RAID 0 array, giving fast data access and a total storage capacity of 7.68 TB. The power needs are handled by four 1600 W power supply units (PSUs) with a combined thermal design power (TDP) of 3500 W, which provides enough power for the high-performance components. The system’s cooling is tuned for optimal front-to-back airflow, ensuring stable operation even under high computational loads. This high-performance configuration shown in
Table 2 allows for full benchmarking and analysis, which supports the study’s need to efficiently handle massive amounts of data and sophisticated computations.
3.1. Datasets
The identification in aerial images (DOTA) dataset is a well-known benchmark in the field of object identification, designed particularly for high-resolution aerial images. It has contributed significantly to the development and assessment of object detection algorithms. The DOTA dataset has gone through multiple revisions, with each iteration bringing new features that improve its usability for academics and practitioners. The following is a complete summary of the many versions of the DOTA dataset, highlighting their contributions and advancements.
DOTA-v1.0, released in 2018, was the first version of the DOTA dataset. This first edition includes 2806 aerial photographs taken in a variety of geographic regions and settings, including urban and rural areas. The collection includes annotations for 15 item categories, covering a wide range of real-world things typically seen in aerial images. The categories include airplanes (PL), ships (SH), storage tanks (ST), and basketball courts (BC), among others. Each object in the photos is tagged with bounding box coordinates and categorization names, making it easier to create and test object recognition algorithms. DOTA-v1.0 provided a fundamental dataset for assessing object identification algorithms in aerial photos, answering the demand for high-resolution, diversified, and annotated datasets. The annotations in this version were created to help researchers train and test object detection algorithms, allowing them to compare their predictions to a consistent collection of data.
Building on the success of DOTA-v1.0, DOTA-v1.5 was released as an expansion of the original dataset. DOTA-v1.5, which included enhanced annotations, was designed to improve both the precision and dependability of item labeling. While the dataset size remained comparable with DOTA-v1.0, improved annotations resulted in higher coverage and more exact classifications of items inside the photos. DOTA-v1.5 aimed to solve problems identified in the previous version, notably in terms of annotation quality and object categorization. This version sought to remove ambiguities and inconsistencies in the annotations, which would improve the performance of detectors for objects trained on the dataset. The improved annotations made it easier to evaluate model performance and helped to progress object recognition algorithms in aerial photography.
The DOTA-v2.0 version, published in 2019, significantly expanded the dataset. This iteration retains the original 2806 photos while making numerous significant changes. One of the most important innovations to DOTA-v2.0 was the introduction of a new object category, the backdrop class, which increased the overall number of object categories to 15 + 1. This update was intended to offer a more thorough portrayal of the many objects and backdrops found in aerial images. The annotations in DOTA-v2.0 were improved, increasing both the accuracy and coverage of item tagging. This version also added a wider range of item categories and enhanced annotation consistency, resulting in a more rigorous benchmark for assessing object detection methods. Better annotations and an enlarged dataset made it possible to compare and analyze model performance in more detail, which aided in the creation of increasingly sophisticated object recognition techniques.
The current work makes use of DOTA-v1.5, which provides a comprehensive collection of classifications of objects for evaluation and building models. This version includes the following object categories: bridge (BR), helicopter (HC), storage tank (ST), soccer ball field (SBF), small vehicle (SV), plane (PL), large vehicle (LV), ground track field (GTF), tennis court (TC), ship (SH), swimming pool (SP), container crane (CC), basketball court (BC), harbor (HA), roundabout (RA), and baseball diamond (BD). This wide set of categories includes a variety of items and buildings typically seen in aerial images, making the dataset extremely useful for training and assessing object identification algorithms. The DOTA-v1.5 dataset is described in full in
Figure 1.
Figure 1a depicts the frequency of the various item labels, while
Figure 1b displays a correlogram of the labels. The frequency plot displays the distribution of object labels in the dataset, indicating how frequently every group appears throughout the photos. The correlogram, on the other hand, shows the associations between multiple labels, demonstrating linkages and combination patterns across different item types. The DOTA-v1.5 dataset is separated into subsets for model training and assessment, with 70% for training, 20% for validation, and 10% for testing [
34]. This segmentation enables a thorough evaluation of object detection algorithms, guaranteeing that models are evaluated on previously unknown data and their performance is correctly measured at various phases of development.
3.2. Method
The existing research on tiny item recognition frequently skips a thorough examination of one-stage approaches, especially when it comes to critical performance variables like speed, computing time, and detection scores. Because of the absence of extensive assessments, it is difficult to identify and use the most effective strategies for detecting tiny objects. To solve these deficiencies, our research focuses on a detailed benchmarking analysis and the development of innovative pre-processing algorithms for the YOLOv8 model. The DOTA-v1.5 dataset, which is notable for its wide range of item categories and high-resolution photos, is used to assess the efficacy of one-step algorithms. This dataset provides a solid foundation for evaluating how well different algorithms perform under difficult settings, such as spotting tiny, densely packed objects. Using DOTA-v1.5, we want to give a complete comparison of existing one-stage approaches, highlighting their merits and limitations while taking into account both speed and accuracy.
Further, our work provides novel pre-processing strategies for YOLOv8 that improve its performance, particularly for tiny object recognition. Pre-processing is crucial for enhancing the quality of input data and, hence, the accuracy of detection models. Traditional pre-processing approaches may be insufficient to address the special issues of tiny object identification, resulting in an inferior performance. Our suggested solutions shown in
Figure 2 include enhanced noise reduction and adaptive histogram equalization to improve picture contrast, allowing for the better separation of tiny objects. These pre-processing stages are combined with YOLOv8, which was chosen for its higher efficiency and accuracy than earlier one-stage models. YOLOv8’s sophisticated design and training capabilities make it ideal for processing refined input data. The creation of models consists of three stages: data processing, model training, and assessment metrics. Data processing focuses on removing noise and reformatting pictures in order to increase input quality. YOLOv8 is then trained on these processed photos to determine its performance in spotting tiny things. The assessment step entails employing extensive metrics to assess not just the detection accuracy but also speed and computing economy. By concentrating on these characteristics, we want to give a more nuanced understanding of one-stage approaches and their practical consequences, ultimately leading to more effective and efficient tiny-item identification solutions.
3.2.1. The Proposed Pre-Processing Approach
The DOTA dataset uses an annotation style for every item indicated by an orientated bounding box (OBB). The coordinates of the i-th vertex of the oriented bounding box (OBB) are represented by (xi, yi), while the overall format includes (x1, y1, x2, y2, x3, y3, x4, y4, categories, complex). These vertices are ordered clockwise to establish the object’s bounding box. This work describes a new pre-processing strategy for improving a DOTA-v1.5 dataset in order to train the YOLOv8 algorithm, which is critical for good object detection. The pre-processing approach consists of three critical steps: noise reduction, data presentation, and data standardization. Each of these actions is intended to improve the dataset’s efficiency in the YOLOv8 model.
Noise handling: The first part of the pre-processing procedure handles the issue of noise in the dataset. The DOTA-v1.5 dataset includes two sorts of files: photos and their labels. The label files include not only the locations and class names of objects but also unnecessary text and information, which might inject noise into the dataset. To clean up the dataset, regular expressions are utilized to detect and delete any extraneous strings. This challenge uses two regular expressions, (
1) and (
2):
After dealing with noise, the next stage is data formatting. During this step, the label file’s last column, which provides further labeling information, is removed. Instead, a different strategy is used: each item is allocated a unique identification number using a dictionary. This dictionary converts object names, which are initially in string format, into numerical values. This transformation produces a new labeling column to replace the previous one. The new column, which contains numerical IDs, is subsequently added as the first column in the dataset. This update simplifies the dataset and guarantees that it meets the criteria of the YOLOv8 training procedure. By translating item names to numerical representations, the dataset becomes more effective and standardized, making the training process easier and the model more accurate.
Data normalization is the last stage of the pre-processing technique. This phase involves dividing each value in the label files by the height and width of the relevant picture, with the exception of the recently added labeling column. Through this process of normalization, the values are scaled to fall between 0 and 1. Normalization is used to minimize problems that could arise during training, like burst gradients. The model becomes less sensitive to changes in the input data and more resilient as a result of scaling the values. By ensuring that every input feature is on the same scale, this phase stops certain characteristics from controlling the learning process because of their higher values. Because normalization keeps the model from being too sensitive to specific characteristics, it promotes faster convergence and a more seamless training procedure.
The comparison between the original DOTA dataset and the dataset following the use of the suggested pre-processing strategy is shown in
Table 3. The processed dataset demonstrates how the data formatting, normalization, and noise-management processes were applied successfully. The addition of a new labeling column including normalized values and numerical representations suggests that the dataset is now well-structured and ready for YOLOv8 training. Finally, by addressing noise, ensuring appropriate normalization, and improving data formatting, this thorough pre-processing method raises the overall standard of the DOTA-v1.5 dataset. Using the YOLOv8 algorithm for accurate and reliable object recognition is made possible by this improved dataset.
3.2.2. YOLOv8 Model
The YOLOv8 model is a cutting-edge object identification model that forecasts bounding boxes and class probabilities for every grid cell by dividing the input image into a grid. Localization loss, categorization loss, and confidence loss are combined to form the total loss function.
Figure 3 [
35] illustrates the structure of YOLOv8.
The model divides the input into
N grid cells. For each cell
i and corresponding bounding box
j, it predicts four coordinates
that define the bounding box’s location, along with a confidence score
c. The class probabilities are encoded in the vector
P. The predicted coordinates of the bounding box,
, are calculated according to the following equations:
Given the sigmoid function
; the predicted parameters
,
,
, and
; as well as the dimensions of the anchor box
and
, the confidence score
for each bounding box is defined as
The predicted confidence parameter is denoted by
. The class probabilities
are obtained by applying the softmax activation function:
The term
represents the vector of predicted class parameters. The total loss function is formulated as a linear combination of three components: the localization loss, confidence loss, and classification loss:
The hyperparameters , , and are used to control the weighting of each individual loss component within the overall loss function.
4. Results Analysis
This section provides a performance analysis and discussion of the proposed pre-processing approaches.
Table 4 illustrates the performance analysis based on the mean average precision (mAP) and average precision (AP). The evaluation metrics were the AP for each class. It is important to note that all the experiments were executed under the same conditions such as 50 epochs where 80% of the data are set for training and 20% of the data are set for testing.
The proposed pre-processing approach outperforms other one-stage methods for the majority of the object classes in terms of the mAP with YOLOv8. This table presents a thorough performance comparative of the proposed pre-processing approach (denoted as “TS”) against various state-of-the-art one-stage object detection algorithms utilizing YOLOv8, focusing on the mean average precision (mAP) across multiple object categories. The aim is to highlight the effectiveness of the proposed method in improving detection accuracy for a range of objects including planes, ships, storage tanks, baseball diamonds, tennis courts, basketball courts, ground track fields, harbors, bridges, large vehicles, small vehicles, helicopters, roundabouts, soccer ball fields, swimming pools, and container cranes.
The table provides mAP scores for each algorithm across these categories, reflecting how well each method performs in detecting and classifying objects. The mAP is a crucial metric in object detection, representing the average precision across all classes and thus giving a comprehensive measure of a model’s performance. The comparison data consist of many studies, each with a reference number that indicates how well it detected distinct object types.
The study by [
17] achieves an mAP score of 65.2, with its highest scores in detecting planes (87.9) and its lowest in container cranes (1.4). Similarly, [
18] scores 62.9 overall, with its best performance in detecting planes (88.2) and a lower score for container cranes (1.9). The performance of [
19] is noteworthy with an overall mAP of 64.1, excelling in detecting planes (86.4) but with less effectiveness in detecting container cranes (2.1). The authors of [
20] present an mAP of 66.4, showing competitive results across most categories, particularly in detecting baseball diamonds (85) and tennis courts (90.8), although the score for container cranes is relatively low at 1.8. The study [
21] demonstrates an overall mAP of 55.9, with strengths in detecting larger objects like storage tanks (67.5) but a weaker performance in detecting smaller objects like container cranes (1.3). The authors of [
22] report an overall mAP of 66.6, highlighting its efficacy in detecting several object types, notably achieving high scores for tennis courts (90.8) and large vehicles (78.6), yet with a lower score for container cranes (3.6). The study by [
23] shows an overall mAP of 66.3, with a good performance in detecting large vehicles (78) and tennis courts (89.9), but its detection of container cranes is also on the lower side (3.4). The study [
24] achieves an mAP of 64.4, with a notable performance in detecting tennis courts (90.8) but a lower score for container cranes (4.8). Lastly, [
25] reports an mAP of 64.8, excelling in detecting tennis courts (90.8) and planes (88.4) but with a relatively lower performance in container cranes (3.7).
The proposed method, TS, achieves an overall mAP score of 66.7, making it the top performer among the compared methods. The detailed breakdown reveals that TS excels particularly in detecting tennis courts (96.6) and planes (90), showing substantial improvements over other methods. It maintains a competitive performance across several categories, including storage tanks (72.3), baseball diamonds (84), and harbors (83), with varying effectiveness in detecting smaller objects like container cranes (6.1) and soccer ball fields (54.1). The exceptional performance of TS in several categories provides evidence that the pre-processing methods employed in this approach greatly improve the YOLOv8 model’s capacity to reliably detect and classify objects. The proficiency of the suggested approach in attaining the best scores in specific categories, such as tennis courts and planes, highlights its efficacy in enhancing detection accuracy; this may be ascribed to the improved data representation and feature extraction procedures employed in TS.
The comparison demonstrates that the proposed pre-processing method (TS) not only achieves the highest overall mAP score but also exhibits significant improvements in specific categories where other methods demonstrate an inferior performance. For example, whereas TS has outstanding accuracy in identifying tennis courts and airplanes, properly detecting container cranes remains difficult, as shown by the lower score of 6.1. This underscores the possible domains in which additional improvements in the pre-processing practices might result in even better detection results. Furthermore, the table also demonstrates that while other methods exhibit robust performance in specific categories, they frequently fail to meet expectations in others. Methods such as [
20,
22] demonstrate improved mean average precision (mAP) scores in identifying tennis courts and large vehicles but exhibit a poor performance in detecting smaller items such as container cranes. In contrast, the proposed method demonstrates a more balanced performance across various categories, therefore highlighting its overall efficacy and adaptability.
In summary, the table compellingly illustrates the benefits of the newly proposed pre-processing approach (TS) in enhancing the efficacy of object detection using the YOLOv8 framework. The notable increase in the mean average precision (mAP) score and the method’s exceptional ability to detect particular object classes indicate the effectiveness of the employed pre-processing approach. Such enhancements are primarily due to the improved data management and feature extraction techniques, leading to increased accuracy in detection. The results emphasize the potential of TS to enhance the state of the art in object detection, offering valuable insights for future research and progress in this field. Overall, the proposed approach is a substantial improvement in object detection technology, offering a reliable solution for precisely detecting and categorizing a diverse array of objects. The comprehensive comparison of the performance highlights the efficacy of the approach and establishes a standard for future enhancements in object detection systems.
The speed and time analyses of the suggested technique are presented in
Table 5. None of the research that focuses on one-stage tiny object detection on the DOTA dataset has addressed their Giga Floating-point Operations Per Second (GFLOPs), speed, epoch, gradients, and other necessary evaluation parameters. Based on
Table 5, it is clear that the proposed pre-processing methods are suitable for real-time applications. This paper provides a comprehensive comparison of various studies focusing on small object detection within the DOTA dataset, specifically emphasizing their GFLOPs (Giga Floating-point Operations Per Second), speed, epochs, gradients, and pre-processing, inference, loss, and postprocessing times. In this analysis, the goal is to highlight the efficiency and effectiveness of the proposed method relative to existing methods. Here is a detailed explanation of each aspect presented in the table.
Table 5 summarizes key performance metrics for different studies and the proposed approach in the context of small object detection. The table includes columns for the following: Studies: references to the various studies evaluated. Gradients: the gradient computation time or amount, which reflects the amount of information used during the learning phase. GFLOPs: this indicates the computational complexity of the method, with lower values suggesting more efficient algorithms. Epoch: the number of times the learning algorithm iterates over the entire dataset. Pre-process: the time taken for data pre-processing. Inference: the time required to make predictions on new data. Loss: the time to compute the loss function. Postprocess: the time required for any additional processing after inference.
The study by [
17] demonstrates a relatively balanced approach with gradients taking 0.3 ms, GFLOPs at 268.7, and a pre-processing time of 1.5 ms. The inference time is 59.7 ms, and postprocessing takes 7.2 ms. The loss calculation is quick at 0.5 ms. In [
18], the gradients are slightly lower at 0.1 ms with GFLOPs of 274.1. This study shows a marginal increase in the pre-processing and inference times, but the postprocessing time is notably higher at 9.3 ms compared to other studies. The study by [
19] shows the highest gradient computation time at 0.5 ms and GFLOPs of 270.4. The pre-processing time is the lowest among the studies (1.4 ms), but the inference time is the highest at 69.1 ms. The postprocessing time is 8.4 ms. In [
20], with 0.5 ms for gradients and GFLOPs of 265.9, this study maintains a reasonable pre-processing time of 1.6 ms. The inference time is quite high at 75.3 ms, and postprocessing takes 7.5 ms. The study by [
21] features a gradient computation time of 0.9 ms, GFLOPs of 264.5, and a pre-processing time of 1.4 ms. The inference time is 71.4 ms, and the postprocessing time is relatively low at 7.1 ms. In [
22], gradients take 0.2 ms, the GFLOPs are 260.2, and the pre-processing time is 1.6 ms. The inference time is lower at 63.9 ms, with the postprocessing time at 6.9 ms. In [
23], with the lowest gradient time of 0.1 ms and GFLOPs of 259.3, this study has a slightly higher pre-processing time of 1.7 ms. The inference time is 67.7 ms, and postprocessing is 6.8 ms. In [
24], the gradient computation time is 0.5 ms, the GFLOPs are 263.8, and the pre-processing time is 1.5 ms. The inference time is 64.5 ms and postprocessing takes 7.5 ms. In [
25], the gradient time is 0.3 ms, the GFLOPs are 266.1, and the pre-processing time is 1.7 ms. The inference time is 60.1 ms, with a postprocessing time of 8.2 ms.
The proposed method shows the following metrics: Gradients: 0 ms, indicating that gradient computation is either negligible or integrated differently, possibly through optimized methods or precomputed gradients. GFLOPs: 263.2, a value that is competitive with other methods, suggesting efficient computation. Epoch: 50, consistent with the other studies, providing a comparable basis for training duration. Pre-process: 1.2 ms, which is the lowest pre-processing time among all methods listed, highlighting efficient data handling and preparation. Inference: 57.4 ms, which is the lowest inference time, indicating faster prediction capabilities compared to other methods. Loss: 0 ms, suggesting that the loss calculation might be embedded within the training loop or otherwise optimized. Postprocess: 6.0 ms, the lowest postprocessing time, further emphasizing efficiency.
Comparative Analysis and Implications: Speed and efficiency: The proposed method demonstrates a superior efficiency in pre-processing, inference, and postprocessing times. Specifically, the proposed method’s pre-processing time of 1.2 ms is notably faster than other studies, which range from 1.4 ms to 1.7 ms. Furthermore, the inference time of 57.4 ms is the lowest, suggesting faster object detection capabilities. Additionally, the postprocessing time of 6.0 ms is also the shortest, therefore enhancing the overall efficiency of the system. Computational complexity (GFLOPs): the proposed method’s GFLOPs value of 263.2 is competitive and shows that while the method is computationally efficient, it does not sacrifice the complexity of the operations required for detection. Gradient computation and loss: The zero gradient time and zero loss time are particularly remarkable. These values suggest that the proposed method has redefined or optimized the typical gradient and loss computation processes, potentially integrating them into other stages of the pipeline or using advanced techniques that reduce their traditional computational overhead. Epochs: consistent with other studies, the proposed method uses 50 epochs, which provides a fair basis for comparison in terms of training duration. Conclusion: The data in
Table 5 provide a clear illustration of the proposed method’s efficiency and effectiveness in small object detection. The reduced pre-processing, inference, and postprocessing times compared to other studies underline its suitability for real-time applications. Moreover, the competitive GFLOPs value shows that this efficiency is achieved without compromising the computational complexity. This combination of low computational overhead and effective processing makes the proposed method highly advantageous for real-time and resource-constrained environments.
The proposed method’s confusion matrices are presented in
Figure 4.
Figure 4a depicts the general confusion matrix, while
Figure 4b illustrates the normalized version. Clearly, the normalized confusion matrix offers a more refined representation of the data compared to the general one. All training batches are illustrated in
Figure 5, and the true and predicted validation images are displayed in
Figure 6. It is evident that the proposed pre-processing approach with YOLOv8 has delivered exceptional results in terms of correctly identifying the true labels during predictions. Refer to
Figure 7 for the confidence curves (the P curve, R curve, F1 curve, and PR curve) of the presented pre-processing approach using YOLOv8. The graph in
Figure 7a demonstrates the balance between precision and confidence, two vital metrics in object detection. The precision of a model quantifies the proportion of accurate detections, which is determined by dividing the number of true positives by the total number of true positives and false positives. Confidence, on the other hand, reflects how certain the model is about the correctness of its detection, usually represented as a probability score between 0 and 1. The graph depicts how precision shifts as the confidence threshold is adjusted. The confidence threshold is the minimum value that the model’s confidence score must meet for a detection to be considered valid. Raising the confidence threshold increases the model’s accuracy, though it also reduces the total number of detections made. The various lines on the graph correspond to different object categories. For instance, the line marked “small-vehicle” illustrates the precision–confidence curve, indicating the model’s performance in identifying small vehicles. The optimal area on a precision–confidence curve is the top-right side, where the approach illustrates both high confidence and high accuracy.
Figure 7b provides a graphical depiction of the trade-off between two essential object detection metrics: recall and confidence. Recall measures the model’s ability to locate all relevant instances of objects, reflecting the proportion of correctly identified objects in the images. In contrast, confidence shows how certain the model is about its predictions. The lines in the graph represent different object classes. For example, the “small-vehicle” line illustrates the model’s recall at diverse confidence levels for tiny vehicles. The “all classes” curve represents the average recall across all categories. The figure’s bottom-left value, “0.69 at 0.000,” represents the model’s recall when the confidence threshold is set to zero, indicating that the model correctly identifies approximately 69% of objects even with no confidence. Overall, the graph illustrates the model’s detection performance across varying confidence levels.
Discussion
Research in the domain of detecting tiny objects in aerial imagery has consistently highlighted the challenges posed by the small size and low resolution of these objects. Previous studies, such as those by Yang et al. [
17] and Wang et al. [
18], primarily focused on enhancing detection accuracy through innovative architectures and loss functions, often neglecting critical aspects like data pre-processing and comprehensive evaluation metrics. Many existing works typically report the mean average precision (mAP) without considering the impacts of noise reduction and data normalization, which are essential for improving object visibility and discriminative features.
In contrast, our study introduces a robust data pre-processing technique for YOLOv8, which includes noise reduction, data restructuring, and normalization. These strategies significantly enhance the clarity of small object boundaries and improve detection accuracy across all classes in the DOTA-v1.5 dataset. By utilizing 50 epochs for training and encompassing all relevant categories for small object detection, our approach not only achieves a higher mAP but also establishes a new standard for processing speed and efficiency. Furthermore, our comprehensive evaluation—encompassing confusion matrices and performance metrics—addresses gaps left by prior studies [
19,
20,
26,
27], underscoring the importance of rigorous evaluation and thorough analysis in advancing the field.
Future research directions should focus on enhancing the model’s capabilities for detecting small objects. This could involve integrating multi-scale detection techniques to identify small objects at various resolutions or exploring advanced architectures, such as YOLOv10, that may offer improved feature-extraction capabilities. Additionally, experimenting with hybrid approaches that combine traditional object detection methods with deep learning techniques could yield beneficial results. Although our benchmarking analysis provides valuable insights, there remains significant potential for improving model performance in detecting small objects, particularly by addressing the limitations of current pre-processing techniques.
Moreover, future research should explore the implementation of more robust data-augmentation strategies that simulate diverse real-world scenarios, enhancing model robustness against varying conditions. While our evaluation metrics offer a clearer understanding of model effectiveness, ongoing development and refinement are necessary to advance small object detection in aerial imagery. By acknowledging these limitations and pursuing these research directions, we aim to contribute to the ongoing advancement of accurate and efficient techniques for small object detection, ultimately improving the applicability and reliability of such models in practical applications.