6.2. Vision System Evaluation
To evaluate the vision system proposed in this paper, we follow the complete procedure of autonomous landing and manually fly the DJI M100 quadrotor to simulate the phases of “Take-off”, “Approaching”, “Hovering”, “Descending”, and “Touch Down”. In
Figure 9a, the three-dimensional trajectory of one manual flight test recorded by the onboard GNSS module is presented. As we can see, the quadrotor takes off from the vicinity of the landing marker at the beginning. After ascending to an altitude above 10 m, it approaches the marker and stays in the air to confirm target detection. Then, the vehicle gradually lowers its altitude during the “Hovering” and “Descending” phases and eventually lands on the marker.
Figure 9b shows the vehicle’s local velocities of the
x,
y, and
z axes. The onboard camera records images at a resolution of 1280 × 1024, a frame rate of 20 Hz, and an exposure time of 50 ms. As mentioned earlier in
Section 3 and
Section 4, such a frame configuration meets the minimum requirement of real-time processing while allowing as much light as possible to enter the camera in the exposure phase to lighten the image. Each video is then stored in the “bag” file format under the ROS framework. The actual frame rate is around 19.8 Hz due to processing latency between adjacency frames, but we neglect the error in this study. In the experiments, the M100 quadrotor has a maximum horizontal speed of 2 m/s and a vertical speed of 1 m/s, which is sufficient to simulate motion blur and perspective distortion caused by varying the camera angle when acceleration and deceleration occur horizontally. Finally, at each field experiment scene, we collect a video with a length of approximately 2–3 min, resulting in a total number of four videos for evaluating the proposed vision system. In
Table 1, the statistics of each video are summarized, where key elements like “total images”, “marker images”, and “max altitude” are presented. These elements are the fundamental metrics for evaluating the vision system.
Since our dataset is built on the automatically-extracted ROI images without precisely labeled ground-truth bounding boxes, it would be inappropriate to use the conventional intersection of union (IoU) metric to evaluate the vision system. Instead, we employ some other criteria to evaluate the vision system based on the following equations:
where points
and
are the predicted landing marker center and its ground truth, respectively. Equation (
11) indicates that if the marker center is successfully detected when a marker is presented, whose coordinate falls into a small circle
C centered at
with radius
, we consider the detected landing marker as a true-positive TP. According to our previous experience in [
14], a marker center derived from the four outermost vertices of the “H” pattern is accurate enough with limited bias. Hence, we set
to one-eighth of the bounding polygon’s diagonal length. Such a value raises the threshold and makes the evaluation more reasonable. On the contrary, if a landing marker is presented in the image, but no marker center is obtained, or the predicted marker center is outside the circle
C, it is considered a false-negative FN. For the extracted background ROI images, if an image is categorized as a landing marker by the CNN node, it is considered a false-positive FP. In Equations (12) and (13), the standard metrics, precision P, recall R, and F-measure F
, are adopted to evaluate the accuracy of the vision system. Compared with the conventional IoU metric, the proposed criteria give credit to both the quality of landing marker detection outcomes and the detection rate. The overall performance of the vision system affects the accuracy of pose estimation and precision of hardware-in-the-loop control in future works.
Based on the criteria mentioned above, we collect the output images of each stage using the recorded different scene videos, and manually categorize them into marker (denoted as “MK”) and background (denoted as “BG”) categories to calculate the exact numbers of TP, FP, and FN, results of which are listed in
Table 2. Specifically, the term “ccomp” stands for the number of detected connected components, and “dt1-dt3” means the remaining ROIs filtered by nodes 1 to 3 of the decision tree. Owing to the CNN node playing a critical role in confirming actual landing markers and rejecting the irrelevant background ROIs, the output of the CNN node is counted separately and presented as “dt4_cnn”. Finally, we examine the number of cases where the marker keypoints can be successfully extracted, denoted by the term “ext_kpt”. At scene L
, there are 2380 marker-presented frames, whereas the “ccomp” stage outputs 2339 connected components of the marker and 5186 that of background fragments. Then, the decision tree rejects a small amount of the marker ROIs and outputs 2241 marker images with accurately extracted keypoints confirmed as TP while effectively eliminating more than 98% of the background segments. We carefully examine the remaining FP samples reported by the network, discovering that they all come from partially cropped landing marker images generated during the final “Touch Down” phase of landing. Although the keypoints are drawn from most of these “H” patterns, we still consider them FP due to inconsistency and unreliability. Compared with L
, scene L
has a more complicated and cluttered background than the basketball field. Therefore, nearly three times the number of the connected components more than L
are picked up in L
. The vision system has similar performance to the previous scene, except that the CNN node only verifies 23 FP samples. We study this case and find the cause lies in the way of the landing maneuvering. Since the marker is placed on a narrow stone-paved path surrounded by cluttered vegetation, the quadrotor is piloted toward the west of the marker at a relatively high altitude (3.28 m) to land on flat ground. The marker quickly vanishes by the edge of the image, leaving limited partially occluded marker samples only. However, we find it difficult to extract the keypoints from these negative samples due to heavy motion blur. Moving on to the fire escape scene L
, an interesting phenomenon is observed that the number of extracted markers is slightly greater than its actual value (3008 vs. 2936, marked by the
sign in
Table 2, column 6). This is because the fire escape has a lighter color background than the marker. The edges of the marker square are thresholded to the foreground at a low altitude, leading to approximately 80 repetitive identifications of the marker. Therefore, we have subtracted this number at each stage to make a fair comparison. Scene L
has a similar cluttered environment as scene L
, where a large number of background segments have been successfully rejected by the vision system while achieving a comparable performance in marker detection. A majority of the FP samples also come from incomplete marker patterns that are yet to be eliminated.
In
Table 3, precision P, recall R, and F-measure F
are listed correspondingly. Combining with the altitude information listed in
Table 1, we may also see that the altitude level has an impact on the system. The vision system occasionally misses the marker when the UAV flies above 13 m due to insufficient pixels in the connected component analysis. This situation is exacerbated if motion blur generated by horizontal movements occurs, which explains a slight degeneration in system performance found in scene L
. In contrast, the vision system achieves the best performance at scene L
with an altitude constantly below 10 m. Note that we subtract the repetitive identifications mentioned above from final system outputs to get the actual number of TP samples (2852, denoted by
sign in
Table 3) at scene L
. In real landing scenarios, it is recommended to maintain a lower altitude for landing marker detection and tracking.
In
Figure 10, we present some of the landing marker detection results. The first, third, and fifth rows are some of the original color images captured onboard, in which the markers detected by our vision system are highlighted in color lines and dots. The second, fourth, and sixth rows are the corresponding color images after enhancement to observe the details. The proposed vision system is able to detect the landing marker in the presence of noise, scaling, image distortions, motion blur (see figures in row 4, column 3 and row 6, column 1), and image acquisition error (see the last two figures of
Figure 10) in complex low-illumination environments.
As we suggest, the low-illumination image enhancement stage plays a crucial role in landing marker detection, and the vision system is further evaluated using the same metrics, but the original test videos are without enhancement. To make a fair comparison, we also retrain the CNN node with the original unenhanced training samples. The results are shown in
Table 4, from which we may intuitively observe massive degradations in system performance. The reason is that the pre-processing stage of the system experiences difficulty properly binarizing the image, thus reducing the number of generated connected components and detection rates in the subsequent stages. Notice that only two marker samples can be identified at scene L
, leading to a complete failure of the system. Such a bad performance is because the marker is placed at the basketball field’s center, a location far away from the light source with the lowest luminance among all the test scenes. The landing marker merges into the dark background so that it can no longer be detected. According to the analysis, we may conclude that the low-illumination image enhancement scheme increases the quality of nighttime images and is also the foundation of landing marker detection.
Since there is a certain number of CNN-based object detection frameworks in the related research, we also compare the performance of the proposed vision system against that of the state-of-the-art methods that can be implemented in real-time. We have chosen YOLOv3 and its simplified version YOlOv3-Tiny [
50], and MobileNetV2-SSD [
51] to carry out the evaluation. As our vision system does not rely on image-wise bounding box labeling, we randomly select an equal proportion of the original images without enhancement from the videos to establish a dataset for a fair comparison. From
Table 1, one may see there is a total number of 9642 marker images. As a result, 500 images were selected and manually labeled as the training dataset, whilst the remaining 9142 images were collected for testing. All the images were resized to
for a faster evaluation process.
Each network was trained individually using the default parameters under official guidance, and the detection results are listed in
Table 5. It is worth mentioning that the M100 quadrotor spends a longer time at higher altitudes during the “Approaching” and “Hovering” stages, resulting in approximately 60% to 70% of the images appearing to have a relatively small landing marker, which brings difficulties to marker detection. As a result, it significantly increases the number of FN samples for YOLOv3 and its derivative. On the contrary, MobileNetv2-SSD performs better than YOLOv3 and YOLOv3-Tiny in detecting small markers, but also introduces a certain amount of FP samples, which are other ground objects, into the results. Finally, the performance of our vision system is derived by using the data from
Table 3, from which one may see that our approach achieves the best recall and F-measure based on a minimal network design.
We also evaluate whether the enhanced low-illumination images benefit the above-mentioned CNN-based object detection frameworks or not. Again, each network is trained and tested by the same procedure, but using the enhanced image dataset. We present the detection results in
Table 6. We may see some overall improvements for both YOLOv3 and YOLOv3-Tiny, whereas MobileNetv2-SSD significantly reduces the FN samples to achieve the best recall and F-measure. Note that such a result slightly outperforms our vision system, as the enhanced images offer much stronger and more discriminative features for the networks to extract. However, our vision system still has the advantage of processing speed, which are elaborated in the next subsection.
6.3. Processing Time Evaluation
Processing speed is of crucial importance to real-time UAV applications. Therefore, the timing performance of each stage of the proposed system is quantitatively evaluated. In this section, we utilize the collected videos to test the average time consumption of each processing stage. Each video has an initial resolution of 1280 × 1024, that is then resized to 1024 × 768 and 800 × 600, respectively. We use the annotations “HR”, “MR”, and “LR” to denote these resolutions. The tests are conducted on both the CNN training desktop PC (denoted by “PC”) and the NVidia TX2 unit (denoted by “TX2”). Timing performances of different processing stages, including image enhancement, pre-processing, the decision tree method, and keypoint extraction are comprehensively evaluated, results of which are listed in
Table 7 accordingly. It is worth mentioning that the CNN node calculation is performed on GPUs on both the platforms, where the time consumption of data transfer between CPU and GPU is neglected.
We may see that the most time-consuming parts lie in the image enhancement and the adaptive thresholding stages. By reducing the resolution to the “MR” level, the detection rate on the desktop PC has surpassed the maximum frame rate that the onboard camera offers. On the contrary, the TX2 unit has relatively poor performances on both the “HR” and “MR” resolutions. Nonetheless, it has already achieved a detection rate of more than 10Hz on the “LR” resolution, satisfying this study’s minimum requirement of real-time processing. Compared with the works of [
9,
25], we utilize the landing marker detection at a higher resolution, whereas their algorithms only operate at image resolutions of 752 × 480 and 640 × 480, respectively. We also test the timing performance of the aforementioned CNN-based frameworks on the desktop PC’s NVidia Geforce 1070 GPU. For an image size of 1280 × 1024, the averaged processing time for YOLOv3, YOLOv3-Tiny, and MobileNetV2-SSD are 102.65 ms, 14.12 ms, and 64.39 ms. For image size 640 × 512, YOLOv3, YOLOv3-Tiny, and MobileNetV2-SSD take approximately 33.49 ms, 5.21 ms, and 26.65 ms to process one frame. Although these networks may achieve real-time processing using a high-performance desktop GPU at the cost of reducing the image resolution, we still find it challenging to implement these networks on a resource-limited UAV onboard platform. Moreover, these networks still require the low-illumination image enhancement scheme to achieve a better performance, which further reduces the processing speed.
Note that the timing performance of the proposed vision system is accomplished without optimization, in which redundant operations are executed at each frame. Using the information of adjacent frames in the image sequences, we may narrow down the ROIs to a specific area based on the previous detection results to dramatically reduce the computational burden caused by image enhancement and adaptive thresholding.