Enhancing Human Detection in Occlusion-Heavy Disaster Scenarios: A Visibility-Enhanced DINO (VE-DINO) Model with Reassembled Occlusion Dataset

Zhao, Zi-An; Wang, Shidan; Chen, Min-Xin; Mao, Ye-Jiao; Chan, Andy Chi-Ho; Lai, Derek Ka-Hei; Wong, Duo Wai-Chi; Cheung, James Chung-Wai

doi:10.3390/smartcities8010012

Open AccessArticle

Enhancing Human Detection in Occlusion-Heavy Disaster Scenarios: A Visibility-Enhanced DINO (VE-DINO) Model with Reassembled Occlusion Dataset

by

Zi-An Zhao

¹

,

Shidan Wang

²

,

Min-Xin Chen

¹

,

Ye-Jiao Mao

¹

,

Andy Chi-Ho Chan

¹

,

Derek Ka-Hei Lai

¹,

Duo Wai-Chi Wong

^1,*

and

James Chung-Wai Cheung

^1,*

¹

Department of Biomedical Engineering, Faculty of Engineering, The Hong Kong Polytechnic University, Hong Kong, China

²

School of Microelectronics and Communication Engineering, Chongqing University, Chongqing 400044, China

^*

Authors to whom correspondence should be addressed.

Smart Cities 2025, 8(1), 12; https://doi.org/10.3390/smartcities8010012

Submission received: 14 November 2024 / Revised: 29 December 2024 / Accepted: 14 January 2025 / Published: 16 January 2025

(This article belongs to the Topic Machine Learning and Big Data Analytics for Natural Disaster Reduction and Resilience)

Download

Browse Figures

Versions Notes

Abstract

:

Highlights

What are the main findings?

Visibility-Enhanced DINO (VE-DINO): This modified DINO model was designed to identify partially obscured individuals in disaster scenes. VE-DINO enhances the identification process by incorporating key point information of body parts, allowing for more accurate recognition even when parts of the body are obscured. Additionally, the model introduces a specialized visibility-aware loss function that assigns weights to different body parts based on their visibility. VE-DINO has demonstrated superior performance compared to the original DINO in challenging conditions.
Disaster Occlusion Detection Dataset (DODD): A newly assembled dataset of disaster scenes with occluded individuals, crucial for testing the visibility-enhanced DINO model’s improved performance in detecting partially visible people.

What is the implication of the main finding?

Faster victim location: The VE-DINO model that could be deployed in an unmanned aerial vehicle (UAV) can enable quicker and more accurate identification of obscured individuals in disaster scenes, accelerating rescue efforts.

Abstract

Natural disasters create complex environments where effective human detection is both critical and challenging, especially when individuals are partially occluded. While recent advancements in computer vision have improved detection capabilities, there remains a significant need for efficient solutions that can enhance search-and-rescue (SAR) operations in resource-constrained disaster scenarios. This study modified the original DINO (Detection Transformer with Improved Denoising Anchor Boxes) model and introduced the visibility-enhanced DINO (VE-DINO) model, designed for robust human detection in occlusion-heavy environments, with potential integration into SAR system. VE-DINO enhances detection accuracy by incorporating body part key point information and employing a specialized loss function. The model was trained and validated using the COCO2017 dataset, with additional external testing conducted on the Disaster Occlusion Detection Dataset (DODD), which we developed by meticulously compiling relevant images from existing public datasets to represent occlusion scenarios in disaster contexts. The VE-DINO achieved an average precision of 0.615 at IoU 0.50:0.90 on all bounding boxes, outperforming the original DINO model (0.491) in the testing set. The external testing of VE-DINO achieved an average precision of 0.500. An ablation study was conducted and demonstrated the robustness of the model subject when confronted with varying degrees of body occlusion. Furthermore, to illustrate the practicality, we conducted a case study demonstrating the usability of the model when integrated into an unmanned aerial vehicle (UAV)-based SAR system, showcasing its potential in real-world scenarios.

Keywords:

SAR operations; UAVs; deep learning; DINO; occlusion detection; Disaster Occlusion Detection Dataset (DODD); human detection; natural disasters; resource-constrained environments

1. Introduction

1.1. Background

Natural disasters, exacerbated by climate change, rapid urbanization, and increasing population density, have become a growing concern for cities worldwide [1]. Urban areas are particularly vulnerable due to their dense infrastructure and limited access to natural protective barriers [2]. Natural disasters such as earthquakes, floods, and wildfires cause significant socio-economic impacts, with substantial human casualties, severe environmental damage, and the destruction of critical infrastructure. In these disaster scenarios, individuals are often partially or fully obscured by debris, vegetation, or other obstacles, making it extremely challenging for rescue personnel to locate them [3,4]. The difficulty in detecting occluded individuals can result in delays and missed opportunities for rescue, ultimately increasing the number of casualties. Therefore, it is crucial to develop innovative technologies that can effectively detect partially occluded individuals, thereby improving the efficiency of rescue operations and increasing the chances of saving more lives during critical emergency-response situations [4].

Recent advances in Unmanned Aerial Vehicle (UAV) [5] technology have opened new avenues for enhancing urban resilience against natural disasters [6,7,8]. UAVs are increasingly valued in post-disaster scenarios for their ability to access hard-to-reach areas [5,7,8,9,10,11,12]. Equipped with advanced visual sensors, Unmanned Aerial Vehicles (UAVs) can conduct rapid surveys and collect real-time visual data [6]. When leveraging machine learning methods for human detection and localization [13], these aerial platforms provide critical support for search-and-rescue operations, significantly enhancing the efficiency and effectiveness of disaster response efforts. However, in urban, mountainous, and forested environments, the effectiveness of advanced sensors in UAVs is often compromised by occlusions caused by elements such as trees, debris, water, smoke, and fire [8]. Overcoming these challenges is vital for accurate identification of individuals, which can be lifesaving in situations where traditional rescue efforts are obstructed by poor visibility and difficult terrain.

1.2. Motivation

DINO, which stands for Detection Transformer (DETR) with Improved Denoising Anchor Boxes, is an advanced object-detection model that enhances end-to-end detection performance using improved denoising techniques with an optimized anchor box [14]. With a ResNet-50 backbone, DINO achieves an impressive average precision of 49.4 with just 5 scales in a mere 12 epochs, demonstrating its remarkable efficiency and accuracy in object detection. However, the performance of the model in occlusion-heavy environments, particularly when objects are partially obscured, remains limited.

To address this limitation, we propose the visibility-enhanced DINO (VE-DINO), which incorporates the original DINO with a visibility-aware mechanism to improve detection capabilities, especially for partially occluded individuals. The core innovation of VE-DINO lies in its visibility-aware loss function, which leverages annotations for key body regions, including the head, upper body, lower body, and legs. This integrated approach guides the model to focus on visible regions, infer the presence of occluded parts, and optimize bounding box regression, improving localization accuracy under partial visibility conditions, enhancing object-detection performance in challenging occlusion-heavy scenarios.

Additionally, there is a lack of comprehensive datasets tailored for this purpose (i.e., partially-occluded individuals in disaster scenes). Existing datasets often focus on specific types of disasters. For instance, the Saied Fire dataset [15] primarily concentrates on fire-detection scenarios, while the Telperion DisasterDatasetRaw [16] covers a broader range of disaster types but lacks specific annotations for occluded individuals. To address this gap, we were motivated to create the Disaster Occlusion Detection Dataset (DODD). DODD is a meticulously curated compilation of images from existing public datasets, specifically selected to include occluded individuals in various disaster scenes. By assembling these diverse images into a comprehensive dataset, DODD not only provides researchers and developers with a robust resource for training and evaluating human detection models in challenging disaster environments but also serves as an ideal testbed for our VE-DINO model.

1.3. Objectives and Contributions

The primary objective of this study is to develop and evaluate the VE-DINO model, an advancement of the existing DINO framework designed to improve the detection of partially-occluded individuals in disaster scenes. By enhancing the model’s capabilities, we aim to overcome the challenges of detecting individuals in complex environments, where such detection is both essential and difficult.

Our second objective is to create the Disaster Occlusion Detection Dataset (DODD), a specialized collection of annotated images featuring partially occluded individuals in disaster contexts. This dataset serves as a crucial resource for the external testing and validation of the VE-DINO model, providing a comprehensive benchmark for assessing its performance in real-world disaster scenarios.

Finally, we aim to demonstrate the practical usability of our VE-DINO model by integrating it into a search-and-rescue (SAR) system deployed on an Unmanned Aerial Vehicle (UAV). This integration will illustrate the model’s feasibility and effectiveness in real-world scenarios, showcasing its potential to enhance disaster-response efforts and save lives in challenging environments.

The main contributions of this work are as follows:

VE-DINO Model: We propose a modification to the DINO object-detection framework by incorporating visibility-aware mechanisms, aiming to improve the detection of partially occluded individuals in disaster scenarios. Our comprehensive evaluation on both standard and custom datasets demonstrates the model’s performance, particularly in occlusion-heavy conditions. Through ablation studies, we investigate the impact of visibility information for different body parts, providing insights into the model’s behavior in challenging environments.
DODD: We introduce a new benchmark dataset focusing on occlusion-heavy scenarios in disaster contexts, compiled from various public datasets to facilitate the evaluation of detection models under challenging conditions.
UAV-Based SAR System Integration: We demonstrate the integration of the VE-DINO model with an Unmanned Aerial Vehicle (UAV) system through a usability case study, exploring its potential application in search-and-rescue operations.

2. Related Work

Recent advances in UAV (or drone) technology, machine learning, and computer vision have significantly enhanced the efficiency of search-and-rescue (SAR) operations in urban areas affected by natural disasters. These advancements include the use of UAVs in SAR missions, the integration of machine learning for improving UAV capabilities, and the development of occlusion-aware object-detection techniques.

2.1. UAVs in Search and Rescue

Unmanned Aerial Vehicles (UAVs) or drones have become essential tools in disaster management and SAR operations [8]. UAVs are particularly useful in providing real-time monitoring of disaster-affected areas, collecting valuable data, and supporting emergency-response efforts. In environments affected by natural disasters, UAVs can navigate areas that are otherwise inaccessible, such as those blocked by rubble, flooded streets, or unstable terrain. This capability is crucial in ensuring that SAR teams receive timely information regarding the location and status of individuals in need of assistance.

UAVs have proven to be highly effective in surveying disaster-affected regions quickly and efficiently. For example, in earthquake and flood scenarios, UAVs have been used to map damaged areas and locate survivors more rapidly compared to traditional ground-based methods [17]. The use of UAVs helps to overcome physical barriers that often hinder rescue teams, as demonstrated in various studies [18].

Despite their advantages, UAVs face significant challenges in accurately detecting individuals in complex environments [8]. Natural obstacles, such as trees, debris, and smoke, can occlude individuals, making it difficult for UAV-based systems to locate them accurately [18,19]. Traditional detection methods often fall short in these scenarios, highlighting the need for advanced solutions that can effectively handle occlusions. Addressing these challenges requires the integration of machine-learning techniques that can enhance UAV detection capabilities by improving their ability to identify partially visible individuals.

2.2. Object Detection Under Occlusion

Traditional object detection and segmentation methods face significant challenges in occlusion-heavy scenarios, as they rely heavily on the clear visibility of object features and often fail when key parts of the object are obscured. For instance, models like Faster R-CNN [20] and YOLO [21], while effective in general object-detection tasks, struggle when significant portions of an object are occluded, as they rely on bounding boxes and clear feature visibility for accurate detection. These methods typically lack the ability to incorporate additional information, such as the visibility of key points for different body parts, which is crucial for detecting partially occluded individuals. In contrast, the visibility-enhanced segmentation method proposed approach leverages visibility annotations for key body regions, such as the head, upper body, lower body, and legs, to guide the model in focusing on visible regions. By integrating this visibility information into the detection process, the model can infer the presence of occluded individuals more effectively, even when critical features are missing. This additional layer of information processing enables a more robust performance in occlusion-heavy disaster scenarios compared to traditional methods.

Recent studies have explored various methods for enhancing object-detection performance in occlusion-heavy environments. These methods primarily focus on improving models to handle partial visibility effectively, using strategies like part-based voting mechanisms, context-aware modules, and specialized network modifications. The goal of these methods is to enhance the ability of detection models to locate occluded objects across different scenarios, ranging from crowded urban settings to agricultural fields. The following Table 1 summarizes these studies, including the models used, detection methods, and application scenarios.

While these methods have shown effectiveness in handling occluded objects, there are certain limitations when dealing with smaller occluded targets. Models such as YOLO [21] with SPCS [23] and CompositionalNets [22] have demonstrated success in crowded and occluded environments, yet they often face challenges when it comes to detecting smaller objects. For instance, while YOLO performs well for detecting occluded humans in crowded scenes, it struggles when tasked with detecting human targets from UAV images, where individuals are smaller due to the high-altitude perspective.

The DINO model, in its current form, has proven to be effective in tasks like the Apple Object Detection scenario where small and occluded objects were detected [25]. However, it is not well-suited for human detection from the UAV sensors in SAR operations, especially when the individuals are small and partially obscured, as demonstrated in Figure 1, where the original DINO model struggles to accurately detect small, occluded human figures. Figure 1 compares the examples from the COCO2017 dataset [26], the Saied Fire dataset [15], and our own UAV-collected data.

Our modification of the DINO model (VE-DINO) was inspired by the occlusion-aware R-CNN (region-based convolutional neural network) approach [27], which divides the human body into different parts for training. Nevertheless, our approach differs in that we divide individuals into different parts during training, with each part having its corresponding visibility information. The model processes this visibility information for each part, thereby enhancing the detection performance, especially for small, occluded individuals. By segmenting human detection into distinct parts, the VE-DINO is now capable of more effectively identifying smaller, occluded individuals, making it better suited for SAR operations in challenging conditions. By improving the occlusion handling capability of the DINO model specifically for UAV-based SAR operations, the study aims to contribute to more efficient and accurate rescue missions in urban mountain and forested areas affected by natural disasters.

3. Methodology

3.1. Overview

This study focuses on developing and evaluating the VE-DINO model for improved human detection in occlusion-heavy disaster scenarios. The research flow encompasses dataset preparation, model development, model evaluation and ablation study, and a usability case study.

The VE-DINO model builds upon the original DINO architecture, incorporating visibility-aware mechanisms to enhance detection performance for partially occluded individuals. It consists of several key components, including a backbone network for feature extraction, a transformer encoder for global dependency modeling, and a transformer decoder with dynamic anchor boxes for object localization.

The study utilizes three datasets: COCO2017 for training and validation, and the newly created DODD for external testing. A visibility-aware loss function is introduced to focus on visible regions during training.

Model evaluation was conducted using IoU, average precision, and average recall. An ablation study assesses the impact of visibility information for different body parts on model performance. Finally, a practical demonstration of the VE-DINO model integrated with a UAV system for search-and-rescue operations is presented.

3.2. Dataset

3.2.1. Training Dataset

We utilized the COCO2017 dataset [26] for the model training, which is a comprehensive benchmark for computer vision tasks, containing 124k images with diverse annotations for object detection and segmentation. First, we extracted the images, the bounding box information, and the key points information of the persons from the COCO2017 dataset. The key point information includes 17 anatomical landmarks annotated with three values (x, y, v), where x and y are the coordinates and v indicates visibility. Then, based on the key points, we divided the body parts into the head, upper body, lower body, and legs, as shown in Figure 2. These partitioned body parts were referred to as visibility information with a visibility field value from 0 to 1 (detailed in Section 3.3.5).

Specifically, the data preprocessing steps are as follows: First, Python’s JSON library was used to read the original COCO annotation files. Each target object contains the category, bounding box coordinates, and key point information. A new visibility field was then added for each target object, representing the visibility status of the head, upper body, lower body, and legs. After adding visibility information for all target objects, the updated annotations were saved in COCO format JSON files for use in subsequent training.

3.2.2. Validation Dataset

The COCO dataset used in this study was split into training and validation sets, with 118K images in the training set and 5K images in the validation set, representing an approximate ratio of 95:5. Through these steps, it was ensured that each target in the dataset has detailed visibility information, which will be used during model training to adjust the loss function, making the model focus more on the visible parts of the target and improving its detection performance in cases of partial occlusion.

3.2.3. External Testing Dataset

To evaluate the performance of the VE-DINO model in detecting occluded individuals, we developed a new dataset, named the DODD (as shown in Figure 3), which was assembled from other publicly available images from the Saied Fire dataset [15] and the Telperion DisasterDatasetRaw dataset [16]. The selected images from existing dataset included instances of natural disasters, such as fires, earthquakes, and floods. We annotated bounding boxes for all individuals, whether fully visible or partially occluded, using the LabelMe software (MIT Computer Science and Artificial Intelligence Laboratory, Cambridge, United States).

DODD contains 121 images, including 20 images without any human presence as negative samples, all featuring people partially occluded by environmental factors typical of disaster situations. It served as the external testing dataset for VE-DINO.

3.3. Model Development

The VE-DINO model (as shown in Figure 4) incorporates multiple advanced components to address the challenges of object detection in complex, occlusion-heavy scenarios. This section is structured into five sub-sections: backbone, transformer encoder, transformer decoder with dynamic anchor boxes, contrastive denoising training, and loss functions [14,28]. This section details the functionality and workflow of each component, offering insights into the model’s design and implementation.

3.3.1. Model Backbone

The backbone of the VE-DINO model is a deep convolutional neural network (CNN) [29], such as ResNet-50 (as shown in Figure 5), designed to extract hierarchical feature maps from input images. These features capture both spatial and semantic information, critical for object-detection tasks. The backbone processes the input through a series of convolutional layers, each down sampling the spatial dimensions while increasing the depth of the feature representations. To handle objects of varying scales, a Feature Pyramid Network (FPN) is employed, which combines feature maps from multiple stages of the backbone. This multi-scale representation ensures robustness against changes in object sizes.

The output sizes and operations in each layer of ResNet-50 contribute to multi-scale feature extraction, allowing the model to efficiently capture both low-level and high-level features. The convolutional layers are represented as (filter size, number of filters), and each layer group repeats the corresponding residual blocks a specific number of times to deepen the feature representation. The output height and width can be calculated in Equations (1) and (2).

H_{o u t} = ⌊\frac{H_{i n} + 2 p_{h} - d_{h} (k_{h} - 1) - 1}{s_{h}} + 1⌋

(1)

W_{o u t} = ⌊\frac{W_{i n} + 2 p_{w} - d_{w} (k_{w} - 1) - 1}{s_{w}} + 1⌋

(2)

where

H_{i n / o u t}

and

W_{i n / o u t}

denote the input and output height and width; and p, d, k, and s are the padding size, dilation rate, kernel size, and stride where a subscript h or w denotes height and width direction.

3.3.2. Transformer Encoder

The transformer encoder in VE-DINO plays a pivotal role in modeling global dependencies across multi-scale feature maps. Unlike traditional convolutional operations that primarily focus on local contexts, the self-attention mechanism employed by the encoder aggregates information across all spatial locations, enabling the model to capture long-range dependencies that are critical for robust object detection. The core components of this mechanism are the query (Q), key (K), and value (V) matrices, each of which serves a distinct purpose in the attention process.

In VE-DINO, the query (Q) is derived from learnable object queries generated in the decoder. These queries incorporate information from dynamic anchor boxes, represented as (x, y, w, h). The key (K) is obtained by adding positional embeddings to the multi-scale feature maps extracted from the backbone. This ensures that the spatial relationships within the feature maps are preserved and leveraged during attention computation. The value (V) corresponds to the raw multi-scale feature maps, which encode semantic content and provide the foundational information for object detection.

The self-attention mechanism computes the attention weights using the scaled dot-product formula as expressed in Equation (3)

A t t e n t i o n (Q, K, V) = s o f t m a x (\frac{Q K^{T}}{\sqrt{d_{k}}}) V

(3)

Multi-head attention, as illustrated in Equations (4) and (5), is employed to capture diverse patterns within the input:

M u l t i H e a d (Q, K, V) = C o n c a t (α_{1}, \dots, α_{n}) W_{o}

(4)

α_{i} = A t t e n t i o n (Q W_{i}^{Q}, K W_{i}^{K}, V W_{i}^{V})

(5)

where

W_{o}

is the learnable projection matrix applied to the concatenated outputs of multiple attention heads (i.e., α₁ to α_n). Each encoder layer applies layer normalization and feedforward networks to refine the feature representations, ensuring that the output is robust and informative for downstream processing.

3.3.3. Transformer Decoder with Dynamic Anchor Boxes

The transformer decoder refines object localization by iteratively updating dynamic anchor boxes. These anchor boxes are represented as four-dimensional vectors in Equation (6).

P_{t} = \{x_{c}, y_{c}, w, h\}

(6)

where x_c, y_c denotes the center coordinates; and w and h represent the width and height. The decoder uses query embeddings to predict the offset required to refine the anchor boxes, which is modeled in Equation (7).

P_{t + 1} = P_{t} + Δ P_{t}

(7)

Δ P t = M L P (Q_{t})

(8)

where MLP is a multi-layer perceptron applied to the decoder query embedding Q at the current decoder layer t in Equation (8). This iterative refinement allows the model to precisely adjust the localization of objects, even under challenging conditions such as partial occlusion or cluttered backgrounds. By stacking multiple decoder layers, the model ensures that predictions are progressively improved through detailed spatial reasoning.

3.3.4. Contrastive Denoising Training (CDN)

To stabilize training and improve robustness, VE-DINO incorporates contrastive denoising training. This technique introduces noise into ground truth labels and bounding boxes, creating both positive and negative samples for the decoder. Noise is sampled and illustrated in Equation (9):

\{δ_{x}, δ_{y}, δ_{z}, δ_{h}\} \sim U (- λ, λ)

(9)

where

\{\cdot\}

represents perturbations in the x-direction, y-direction, width and height, sampling from

U (\cdot)

, and the uniform distribution of the noise scale (λ). Positive queries are trained to match ground truth objects with small perturbations, while negative queries are associated with noisy samples that represent “no object”. This contrastive approach reduces the likelihood of duplicate detections and improves the model’s generalization to unseen data.

3.3.5. Loss Function

The VE-DINO model introduced a visibility-aware loss function to enhance the detection accuracy of partially occluded individuals. This loss function addresses a fundamental challenge in disaster scenarios: individuals are often partially occluded by environmental factors such as debris, smoke, or vegetation. Standard object-detection models typically struggle to accurately detect these individuals, especially when parts of the body are not visible. By incorporating visibility-aware loss, the model prioritizes learning from visible body parts, thereby reducing the impact of occluded regions and enhancing detection performance under challenging conditions.

The core Idea of this approach is to assign different weights to the visible and occluded parts of the target during training. Visibility weights are used to penalize poor alignment, particularly in areas where the target is visible, ensuring that the model learns to accurately predict the bounding boxes of partially visible targets. This approach allows the model to effectively focus on visible regions, making it more robust in complex environments where partial occlusions are common.

The loss function includes a visibility-weighted L1 regression loss and generalized intersection over union (GioU) loss, both of which leverage target visibility information to focus on the visible parts of the target, enhancing the model’s performance in detecting partially occluded individuals. Visibility information, denoted as

v_{i j}

, represents the degree of visibility for the jth part of the ith target.

v_{i j}

is defined as a continuous variable ranging from 0 to 1. A value closer to 1 indicates that a larger proportion of the corresponding part is visible, while a value near 0 indicates significant occlusion. For instance, if 80% of a target’s head is visible, the corresponding

v_{i j}

for the head would be 0.8. This visibility score is determined for each body part of the target (e.g., head, upper body, lower body, and legs) based on the proportion of visible key points associated with that part. For each target

i

, the average visibility can be calculated in Equation (10).

v_{i} = \frac{1}{M} \sum_{j = 1}^{M} v_{i j}

(10)

where

M

is the total number of parts of the target. This average visibility

v_{i}

is used as a weight in the subsequent loss functions to ensure that the model focuses more on the visible parts during training. The L1 bounding box regression loss is computed with visibility information, as shown in Equation (11).

L_{b b o x} = \frac{1}{N} \sum_{i = 1}^{N} v_{i} \cdot |b_{i}^{p r e d} - b_{i}^{g t}|

(11)

where

v_{i}

is the visibility weight for target

i

, as calculated by the formula above.

b_{i}^{p r e d}

and

b_{i}^{g t}

are the predicted and ground-truth bounding boxes, respectively. The visibility weight

v_{i}

allows the model to focus on the errors for visible parts of the target.

The GioU loss is computed with visibility information, as given by Equation (12).

L_{g i o u} = \frac{1}{N} \sum_{i = 1}^{N} v_{i} \cdot (1 - G I o U (b_{i}^{p r e d}, b_{i}^{g t}))

(12)

where

G I o U (b_{i}^{p r e d}, b_{i}^{g t})

represents the generalized intersection over union between the predicted and ground truth bounding boxes. The visibility weight

v_{i}

emphasizes alignment for visible parts, improving the model’s ability to accurately predict the bounding boxes of partially occluded targets.

The VE-DINO model computes the L1 distance between the predicted and ground truth bounding boxes and applies weighting based on visibility information, considering only the error for visible parts. Specifically, the average visibility for each target is calculated and used to weigh the L1 loss. This allows the model to prioritize learning from the parts of the target that are visible, reducing the impact of occluded regions and improving detection accuracy under occlusion. In addition to the L1 loss, a visibility-weighted GioU loss is introduced to further refine the alignment between the predicted and ground truth bounding boxes. The visibility weights are used to penalize poor alignment, particularly in areas where the target is visible. This ensures that the model learns to accurately predict the bounding boxes of partially visible targets, focusing more on aligning the visible portions of the box.

The primary advantage of the VE-DINO model is its ability to better handle occluded targets by incorporating visibility information directly into the loss calculation. By focusing on visible regions, the model is more robust in complex environments where partial occlusions are common, such as dense vegetation or urban search-and-rescue scenarios. Additionally, the visibility-weighted loss functions help to reduce false positives that might arise from attempting to predict occluded areas inaccurately, leading to an overall improvement in precision and recall. The modifications to the DINO model also ensure a more efficient training process, as the model learns to allocate its resources to the most informative parts of the target. This not only improves the convergence speed but also enhances the generalizability of the model to real-world occlusion scenarios.

3.4. Model Validation and Ablation Study

Model validation was conducted on the split of COCO2017, while the external testing was conducted on DODD. The primary metric was IoU 0.50:0.95 (intersection over union), which measures the overlap between the predicted and ground truth bounding boxes, with thresholds ranging from 0.50 to 0.95.

Additionally, the average precision is the area under the precision–recall curve over a range of IoU thresholds. The average recall is the ratio of correctly identified individuals to the total number of ground truth individuals, averaged over a range of IoU thresholds.

The ablation study was conducted by selectively removing visibility information for specific body parts (e.g., head, upper body, lower body, and legs) in the validation dataset, COCO2017. IoU 0.50:0.95. Average precision was the primary metric followed by average recall for the evaluation of the ablation study. We conducted the ablation study with the following configurations:

Full Visibility Information (Baseline): No visibility information was removed (serving as the baseline).
Without Head Visibility: The visibility information for the head was removed during training.
Without Upper Body Visibility: The visibility information for the upper body was removed during training.
Without Lower Body Visibility: The visibility information for the lower body was removed during training.
Without Legs Visibility: The visibility information for the legs was removed during training.

4. Results

4.1. Model Validation Using COCO2017 Dataset

The VE-DINO model demonstrated a higher average precision compared to the original model, indicating its superior capability in accurately detecting occluded individuals in complex environments, such as mountainous forests, earthquake sites, and fire-stricken areas. This improvement is crucial for scenarios where partial occlusions and challenging backgrounds are common, which are typical in search-and-rescue operations.

To illustrate the performance improvements, the test results of both the original and VE-DINO are presented in Table 2, respectively. The VE-DINO model achieved an AP of 0.615 at IoU = 0.50:0.95, compared to 0.491 for the original model, highlighting its enhanced detection accuracy.

A visual illustration is provided through the detection outputs on a collection of images. Figure 6 shows the COCO2017 dataset, Saied Fire dataset, Telperion DisasterDatasetRaw dataset, and our own UAV-collected data and highlights the occluded individuals that could not be detected by the original DINO and were detected by VE-DINO. The original DINO model can accurately detect individuals when they are fully visible; however, it struggles with individuals who are partially occluded. In the provided comparison, the original model failed to accurately identify individuals whose lower bodies were obscured, demonstrating its limitations in handling occlusions. This is particularly evident in scenarios where only the upper body is visible, as the original model is unable to generate reliable bounding boxes, resulting in missed detections or false positives.

In contrast, the VE-DINO model effectively detects partially occluded individuals, as shown in the combined results. In the forest scene, the improved model accurately identifies the person partially obscured by the surrounding vegetation, demonstrating its enhanced capability to handle occlusions. In the fire scene, the VE-DINO model accurately detects individuals amidst flames, emphasizing its robustness in environments with visual obstructions caused by smoke and fire. Similarly, in the earthquake scenario, both models were able to detect the individual, showing comparable performance in this specific case. However, in the flood scenario, the original model fails to fully identify the person in the water, whereas the VE-DINO model accurately locates the individual, highlighting its improved performance in challenging flood conditions. These improvements are critical for search-and-rescue operations in natural environments, such as forests, areas affected by natural disasters, and flood zones, where individuals are often only partially visible. The ability of the VE-DINO to reliably detect such individuals makes it highly suitable for deployment in real-world scenarios where occlusions are prevalent.

4.2. External Testing Using DODD

The external test results for the VE-DINO model using DODD are presented as Table 3. The results from the DODD evaluation indicated that the VE-DINO model maintains a consistent performance in detecting occluded individuals across various disaster scenarios. An AP of 0.500 for IoU = 0.50:0.95 across all object sizes was achieved, which is particularly noteworthy given the challenging nature of the dataset. For small objects, the AP was relatively low at 0.222, underscoring the difficulty of detecting smaller, occluded individuals. However, the AP for large objects was 0.611, reflecting the model’s ability to effectively detect larger individuals in challenging disaster conditions.

Overall, the results demonstrated that the VE-DINO model is more effective in handling partial occlusions, leading to higher detection precision and recall. The qualitative analysis further supports the quantitative improvements, showing that the modifications enhance the model’s reliability in real-world conditions, making it a more suitable candidate for deployment in search-and-rescue missions and other applications requiring robust human detection in complex environments.

4.3. Ablation Study

The results of the ablation study, as presented in Table 4, demonstrate the critical role of visibility information for different body parts in enhancing detection performance. The baseline model, which retained visibility information for all body parts, achieved the highest performance across all evaluation metrics, with an overall average precision of 0.615 and an average recall of 0.725. This result underscores the importance of incorporating visibility information holistically to improve detection robustness in occlusion-heavy scenarios.

When the visibility information for the head was removed, the model’s performance exhibited a moderate decline. The overall average precision decreased from 0.615 to 0.612, and the average precision for small objects was particularly affected, dropping from 0.352 to 0.333. This observation highlights the significant contribution of head visibility in detecting small, partially occluded individuals, where other parts of the body may be less discernible.

The removal of upper body visibility information resulted in a more noticeable decline in detection accuracy, with the overall average precision reducing to 0.606. The impact was particularly evident for medium-sized objects, where the average precision dropped from 0.67 to 0.659. This finding emphasizes the importance of the upper body as a critical contextual feature for human detection, particularly in scenarios where individuals are partially obscured by environmental factors such as debris or vegetation.

Similarly, the removal of lower body visibility information caused a decline in performance, with the overall average precision decreasing to 0.61. While this reduction was slightly less pronounced compared to the upper body, it indicates that lower body visibility still contributes meaningfully to detection accuracy. Notably, the performance for small-scale objects was again affected, further reinforcing the idea that visibility information is particularly critical for detecting smaller, occluded targets.

When the visibility information for the legs was removed, the overall average precision dropped marginally from 0.615 to 0.612, with minimal impact across all object sizes. The AP for small objects decreased slightly from 0.352 to 0.349, while the AR for large objects remained virtually unchanged at 0.899. These results suggest that leg visibility plays a relatively minor role in the model’s detection process, contributing less significantly compared to the head, upper body, and lower body.

The experimental results indicated that removing visibility information for the head, upper body, lower body, or legs results in a slight decline in detection performance compared to the baseline model, but the reduction remains within an acceptable range. Specifically, the average precision for all objects drops by less than 1%, depending on the removed body region. The largest decline is observed when upper body visibility is removed, with AP decreasing from 0.615 to 0.606 (0.009 reduction), while removing leg visibility leads to a minimal average precision drop from 0.615 to 0.612 (0.003 reduction). Similarly, the average recall remains relatively stable with a reduction of less than 1% across all objects. These results demonstrate the robustness of the proposed model, as it maintains a strong detection performance even when critical visibility information is selectively removed.

5. Illustration of Case Study Usability Integrating UAV System

The purpose of designing this system is to create an efficient and reliable search-and-rescue (SAR) solution that leverages UAVs and the visibility-enhanced DINO model to detect individuals even in challenging environments with potential occlusions. The main goal is to automate the detection process, ensuring that SAR teams receive timely and actionable information that can improve mission success rates while reducing human effort and the risks involved in manual search operations.

The development of the system is driven by the need to cover large, often inaccessible areas quickly and to provide precise information for SAR teams. By integrating UAVs with advanced detection models, this system can automate the detection of individuals, thereby enhancing the speed and accuracy of search missions. The system has been designed to facilitate seamless communication between UAVs and ground control units, enabling the efficient processing of images and swift detection of targets.

The system workflow, as illustrated in Figure 7, involves several key steps, each of which has been carefully designed to ensure optimal performance and reliability:

UAV Deployment and Image Capture: The UAVs are deployed to designated search areas, equipped with high-resolution cameras that capture aerial images at intervals of every 5 s. This interval was selected to ensure a balance between sufficient coverage and minimizing data redundancy, making sure that important information is not missed while avoiding excessive data accumulation. The UAV continues to scan and capture images throughout the search mission.
Image Transmission to Local Computer: The captured images are transmitted to a local computer on the ground via a secure wireless link. This step is crucial for ensuring that data are promptly received without interference or data loss, especially in remote areas where maintaining stable communication is challenging.
Human Detection Using VE-DINO Model: Once the images are transmitted to the local computer, they are processed using the VE-DINO model to detect any human presence. If an individual is detected and the detection confidence (possibility) is greater than 50%, the image and corresponding detection results are passed to the next step. The decision to use a local computer for processing rather than onboard processing in UAVs is motivated by the significant computational power required by the DINO model, which can be best handled by ground-based hardware.
Data Storage and Result Notification: Images with positive detections (possibility greater than 50%) are saved in a designated folder for further analysis. These images are stored along with the detection results, allowing SAR personnel to review them for informed decision-making. Once a positive detection is confirmed, the system immediately notifies the SAR team, ensuring that they receive timely updates on the search status.
SAR Team Notification and Response: Detected individuals are flagged, and the SAR team is notified of the exact location and visual confirmation of the detection. This real-time alert system aims to provide immediate, actionable insights, allowing SAR personnel to efficiently respond to potential sightings. The use of the VE-DINO model is key in handling partial occlusions, ensuring that even individuals obscured by natural obstacles are detected, which greatly improves the chances of success in SAR missions.

Each of these steps is designed to meet the system’s overall objectives: maximizing detection accuracy, ensuring efficient data flow, and providing timely support to SAR teams. The integration of UAVs with advanced object-detection techniques overcomes traditional challenges faced in SAR missions, such as occlusions, difficult terrain, and limited human resources.

To validate the accuracy of the proposed model, a case study was conducted using a DJI Mini 4 Pro UAV to perform flight tests in a mountainous forest environment. The objective was to evaluate the model’s capability in detecting partially occluded individuals in a complex, natural setting. The UAV was deployed to capture aerial imagery of forested areas where individuals were positioned in varying levels of occlusion, caused by trees, vegetation, and terrain.

The captured images were processed using the VE-DINO model, and the results indicate that the model successfully identified individuals who were partially obscured by environmental elements. Figure 8 illustrates the detection results, showcasing the model’s ability to recognize occluded individuals in challenging environments.

In a collection of three field images with seven individuals, the VE-DINO model correctly identified all seven individuals, whereas the original model identified only three out of seven. This case study highlights the feasibility of integrating UAV technology with advanced machine-learning models for search-and-rescue operations in disaster-stricken areas. By enhancing the model’s ability to detect partially occluded individuals, the proposed approach can significantly improve the efficiency and success of search-and-rescue missions, ultimately contributing to increased urban resilience against natural disasters.

6. Discussion

This study presents significant advancements in UAV-based SAR operations by introducing a VE-DINO model, a machine learning-based object-detection framework, along with the development of the DODD. The DINO model enhancements focus on the efficient detection of partially occluded individuals, a common challenge in post-disaster environments such as forests, floods, and fires. The DODD dataset, specifically developed to support this research, serves as an essential contribution, enabling rigorous validation of the model’s performance in occlusion-heavy scenarios.

The results demonstrated substantial improvements in the performance of the VE-DINO model across complex environments. Specifically, the model effectively detected individuals in disaster scenarios like forests and fires, where vegetation and smoke often obscure visibility. The experimental results on average precision and average recall illustrated enhanced performance, especially in occlusion-heavy conditions. VE-DINO can focus on visible parts of individuals more effectively, which leads to significantly improved detection accuracy compared to the original model. This adaptability is evident in diverse scenarios, where the enhanced model successfully detected individuals even in environments with smoke and partial flooding, outperforming the original DINO model under similar conditions.

The implications of this study extend across both the fields of machine learning and practical SAR operations. From a machine-learning perspective, the proposed modifications to the DINO model (VE-DINO)—including visibility-aware mechanisms and visibility-weighted loss functions—mark a meaningful step forward in addressing the challenge of occluded object detection. These advancements contribute to enhancing the robustness of object-detection models in real-world environments, providing a foundation for further improvements in handling occlusion-heavy scenarios. In terms of practical impact on SAR operations, the enhanced model directly improves the efficiency of search-and-rescue missions. By reducing the number of false negatives and missed detections, the model significantly increases the likelihood of identifying survivors in time-critical scenarios, ultimately making SAR efforts more effective. This is crucial in disaster situations where rapid and accurate identification can determine life-saving outcomes. Moreover, the DODD dataset itself represents a significant contribution, as it fills a critical gap in existing datasets by providing targeted data for occlusion-heavy disaster environments. This dataset serves not only as a benchmark for evaluating detection models but also holds promise for future research aimed at improving human detection under challenging conditions.

The underlying mechanisms behind the enhanced performance of the modified DINO model are rooted in the integration of visibility information and body segmentation. The model divides the human body into distinct parts—such as the head, upper body, lower body, and legs—and uses visibility information for each part during training. This segmentation allows the model to focus on the visible segments and minimizes the influence of occluded areas, which helps to improve detection accuracy in real-world disaster scenarios. Additionally, the introduction of a visibility-weighted loss function ensures that the model places greater emphasis on learning from visible regions during training. This modification improves the model’s ability to accurately predict bounding boxes for partially occluded individuals, making it more effective in SAR operations involving occluded subjects.

Despite the significant advancements, the VE-DINO model still faces certain limitations that warrant further research and development. While effective in detecting occluded individuals, the model continues to encounter challenges when identifying smaller targets, especially in complex and cluttered environments. This limitation is particularly critical in UAV-based SAR operations, where distant scenes often render individuals as small targets. Consequently, there is a risk of missed detections in scenarios where individuals are small or partially obscured, potentially impacting the effectiveness of rescue efforts. To address these challenges, future studies should explore the integration of advanced techniques such as multi-scale feature fusion [30,31], adaptive attention mechanisms [32,33], and super-resolution methods [34,35]. These approaches could potentially enhance the model’s ability to detect smaller targets and improve its performance in complex environments. Furthermore, the current evaluation of the VE-DINO model has primarily focused on natural disaster scenarios like forests, fires, and floods. To ensure broader applicability and robustness, it is crucial to expand testing to include urban disaster settings and other challenging environments. This expanded scope of testing would not only validate the model’s effectiveness across a wider range of conditions but also help to assess its generalizability, ultimately contributing to the development of more versatile and reliable SAR technologies.

Future research will focus on several key areas to further enhance the capabilities of the model and expand its applicability. One potential direction is the incorporation of multimodal sensors, such as thermal imaging, to improve detection capabilities in low-visibility conditions like nighttime or heavy smoke. The combination of visible and thermal sensors could significantly enhance the reliability of detection in such scenarios. Additionally, expanding and improving the DODD dataset is crucial. The goal is to include a wider variety of disaster scenarios, such as urban environments, which will help the model adapt to different types of occlusions and challenging backgrounds, ultimately improving its robustness and applicability. Another avenue for future research involves refining the loss function and adjusting the visibility weighting of different body parts. By dynamically adjusting weights based on the visibility of each body part during training, the model could better utilize visibility information, leading to improved accuracy and reliability in occlusion-heavy environments. Broader applications of the model could be extended to healthcare settings, where it has demonstrated the ability to detect patients partially obscured by blankets in hospitals and care homes to enhance patient monitoring and safety [36].

In conclusion, the VE-DINO model, supported by the custom-built DODD dataset, represents a significant advancement in UAV-based SAR operations. The enhancements introduced in this study not only improve the detection accuracy of partially occluded individuals but also contribute to the broader field of machine learning by providing a robust solution to occlusion detection challenges. These contributions have the potential to make SAR operations more efficient, ultimately saving lives in post-disaster scenarios. Future work will aim to build on these findings, further enhancing model capabilities and expanding dataset diversity to support more complex and varied rescue operations.

7. Conclusions

The VE-DINO model presented in this study significantly enhances UAV-based SAR operations, particularly in environments where occlusions pose challenges to effective human detection. By incorporating visibility-aware attributes, the model improves the detection of small, partially occluded individuals, which is crucial in post-disaster scenarios such as earthquakes, floods, and fires. The integration of this enhanced model into UAV workflows demonstrates improved detection accuracy, achieving an average precision of 0.615 compared to 0.491 for the original model. These improvements can lead to faster response times and ultimately save lives.

A key contribution of this study is the development of DODD, which provides targeted data for evaluating the performance of detection models in occlusion-heavy disaster environments. The DODD dataset plays an essential role in validating the enhanced model’s capabilities and fills a critical gap in existing datasets by offering real-world scenarios where occlusions are prevalent. This dataset not only serves as a benchmark for the current research but also offers a valuable resource for future advancements in the field of human detection during SAR operations.

Future SAR operations are expected to integrate more advanced sensor technologies and autonomous systems to address increasingly complex disaster scenarios. Our future research will focus on incorporating thermal imaging technology in combination with our occlusion detection techniques to further improve detection accuracy in low-visibility conditions. Additionally, we aim to expand and refine the DODD dataset to include a wider range of disaster scenarios, such as urban environments, enhancing the model’s generalizability and robustness. These advancements will help to develop more resilient UAV-based SAR systems capable of effectively responding to real-world disaster challenges.

Author Contributions

Conceptualization, D.W.-C.W. and J.C.-W.C.; methodology, D.W.-C.W. and J.C.-W.C.; software, Z.-A.Z., S.W., M.-X.C., Y.-J.M. and D.K.-H.L.; validation, M.-X.C. and Y.-J.M.; formal analysis, Z.-A.Z., S.W. and D.K.-H.L.; investigation, Z.-A.Z., S.W. and A.C.-H.C.; data curation, Z.-A.Z., S.W. and A.C.-H.C.; writing—original draft preparation, Z.-A.Z.; writing—review and editing, D.W.-C.W. and J.C.-W.C.; visualization, M.-X.C. and Y.-J.M.; supervision, D.W.-C.W. and J.C.-W.C.; project administration, D.W.-C.W. and J.C.-W.C.; funding acquisition, J.C.-W.C. All authors have read and agreed to the published version of the manuscript.

Funding

This study received no external funding.

Data Availability Statement

The original data presented in the study are openly available in GitHub, https://github.com/ZhaoZia/DODD.git (accessed on 14 November 2024).

Acknowledgments

The authors would like to thank Hui Chun-Wai, Yip Kai-Hung, Wong Chung-Yin, and Miu Yin-Wai from the Hong Kong Police Force Innovation and Solution Lab for their advice on SAR and UAV. Additionally, the authors would like to thank Henry Yu-Keung Chan and West Ka-Fai Wo from the Industrial Centre of the Hong Kong Polytechnic University for the facility and consultation.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Benevolenza, M.A.; DeRigne, L. The impact of climate change and natural disasters on vulnerable populations: A systematic review of literature. J. Hum. Behav. Soc. Environ. 2019, 29, 266–281. [Google Scholar] [CrossRef]
Li, H.; Xu, E.; Zhang, H. Examining the coupling relationship between urbanization and natural disasters: A case study of the Pearl River Delta, China. Int. J. Disaster Risk Reduct. 2021, 55, 102057. [Google Scholar] [CrossRef]
Chou, S.Y.; Chen, D. Emergent disaster rescue methods and prevention management. Disaster Prev. Manag. Int. J. 2013, 22, 265–277. [Google Scholar] [CrossRef]
Liu, B.; Sheu, J.-B.; Zhao, X.; Chen, Y.; Zhang, W. Decision making on post-disaster rescue routing problems from the rescue efficiency perspective. Eur. J. Oper. Res. 2020, 286, 321–335. [Google Scholar] [CrossRef]
Gupta, L.; Jain, R.; Vaszkun, G. Survey of important issues in UAV communication networks. IEEE Commun. Surv. Tutor. 2015, 18, 1123–1152. [Google Scholar] [CrossRef]
Titu, M.F.S.; Pavel, M.A.; Michael, G.K.O.; Babar, H.; Aman, U.; Khan, R. Real-Time Fire Detection: Integrating Lightweight Deep Learning Models on Drones with Edge Computing. Drones 2024, 8, 483. [Google Scholar] [CrossRef]
Yeom, S. Thermal Image Tracking for Search and Rescue Missions with a Drone. Drones 2024, 8, 53. [Google Scholar] [CrossRef]
Lyu, M.; Zhao, Y.; Huang, C.; Huang, H. Unmanned aerial vehicles for search and rescue: A survey. Remote Sens. 2023, 15, 3266. [Google Scholar] [CrossRef]
Oh, D.; Han, J. Smart search system of autonomous flight UAVs for disaster rescue. Sensors 2021, 21, 6810. [Google Scholar] [CrossRef] [PubMed]
Liu, X.; Ansari, N. Resource allocation in UAV-assisted M2M communications for disaster rescue. IEEE Wirel. Commun. Lett. 2018, 8, 580–583. [Google Scholar] [CrossRef]
Banuls, A.; Mandow, A.; Vázquez-Martín, R.; Morales, J.; García-Cerezo, A. Object detection from thermal infrared and visible light cameras in search and rescue scenes. In Proceedings of the 2020 IEEE International Symposium on Safety, Security, and Rescue Robotics (SSRR), Abu Dhabi, United Arab Emirates, 4–6 November 2020; pp. 380–386. [Google Scholar]
Wang, Y.; Chen, W.; Luan, T.H.; Su, Z.; Xu, Q.; Li, R.; Chen, N. Task offloading for post-disaster rescue in unmanned aerial vehicles networks. IEEE/ACM Trans. Netw. 2022, 30, 1525–1539. [Google Scholar] [CrossRef]
Jordan, M.I.; Mitchell, T.M. Machine learning: Trends, perspectives, and prospects. Science 2015, 349, 255–260. [Google Scholar] [PubMed]
Zhang, H.; Li, F.; Liu, S.; Zhang, L.; Su, H.; Zhu, J.; Ni, L.M.; Shum, H.-Y. Dino: Detr with improved denoising anchor boxes for end-to-end object detection. arXiv 2022, arXiv:2203.03605. [Google Scholar] [CrossRef]
Saied. Saied Fire Dataset. Available online: https://www.kaggle.com/datasets/phylake1337/fire-dataset?select=fire_datase (accessed on 14 November 2024).
Telperion. DiasterDatasetRaw Dataset. Available online: https://www.kaggle.com/datasets/telperion/diasterdatasetraw/data (accessed on 14 November 2024).
Zhou, Q.; Wang, S.; Wang, Y.; Huang, Z.; Wang, X. Human de-occlusion: Invisible perception and recovery for humans. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 3691–3701. [Google Scholar]
Russell Bernal, A.M.; Scheirer, W.; Cleland-Huang, J. NOMAD: A Natural, Occluded, Multi-scale Aerial Dataset, for Emergency Response Scenarios. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 3–8 January 2024; pp. 8584–8595. [Google Scholar]
Niedzielski, T.; Jurecka, M.; Miziński, B.; Pawul, W.; Motyl, T. First successful rescue of a lost person using the human detection system: A case study from Beskid Niski (SE Poland). Remote Sens. 2021, 13, 4903. [Google Scholar] [CrossRef]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell. 2016, 39, 1137–1149. [Google Scholar]
Redmon, J. You only look once: Unified, real-time object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016. [Google Scholar]
Wang, A.; Sun, Y.; Kortylewski, A.; Yuille, A.L. Robust object detection under occlusion with context-aware compositionalnets. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 12645–12654. [Google Scholar]
Li, X.; He, M.; Liu, Y.; Luo, H.; Ju, M. SPCS: A spatial pyramid convolutional shuffle module for YOLO to detect occluded object. Complex Intell. Syst. 2023, 9, 301–315. [Google Scholar] [CrossRef]
Aslan, M.F.; Durdu, A.; Sabanci, K.; Mutluer, M.A. CNN and HOG based comparison study for complete occlusion handling in human tracking. Measurement 2020, 158, 107704. [Google Scholar] [CrossRef]
Geng, L. Improving Apple Object Detection with Occlusion-Enhanced Distillation. arXiv 2024, arXiv:2409.01573. [Google Scholar] [CrossRef]
Lin, T.-Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; Zitnick, C.L. Microsoft coco: Common objects in context. In Proceedings of the Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, 6–12 September 2014; Proceedings Part V 13. 2014; pp. 740–755. [Google Scholar]
Zhang, S.; Wen, L.; Bian, X.; Lei, Z.; Li, S.Z. Occlusion-aware R-CNN: Detecting pedestrians in a crowd. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germay, 8–14 September 2018; pp. 637–653. [Google Scholar]
Li, F.; Zhang, H.; Liu, S.; Guo, J.; Ni, L.M.; Zhang, L. Dn-detr: Accelerate detr training by introducing query denoising. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 13619–13627. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
Zeng, N.; Wu, P.; Wang, Z.; Li, H.; Liu, W.; Liu, X. A small-sized object detection oriented multi-scale feature fusion approach with application to defect detection. IEEE Trans. Instrum. Meas. 2022, 71, 3507014. [Google Scholar] [CrossRef]
Tan, S.; Duan, Z.; Pu, L. Multi-scale object detection in UAV images based on adaptive feature fusion. PLoS ONE 2024, 19, e0300120. [Google Scholar] [CrossRef]
Li, W.; Liu, K.; Zhang, L.; Cheng, F. Object detection based on an adaptive attention mechanism. Sci. Rep. 2020, 10, 11307. [Google Scholar] [CrossRef]
Qu, J.; Tang, Z.; Zhang, L.; Zhang, Y.; Zhang, Z. Remote sensing small object detection network based on attention mechanism and multi-scale feature fusion. Remote Sens. 2023, 15, 2728. [Google Scholar] [CrossRef]
Liu, J.; Zhang, J.; Ni, Y.; Chi, W.; Qi, Z. Small-Object Detection in Remote Sensing Images with Super Resolution Perception. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2024, 17, 15721–15734. [Google Scholar] [CrossRef]
Courtrai, L.; Pham, M.-T.; Lefèvre, S. Small object detection in remote sensing images based on super-resolution with auxiliary generative adversarial networks. Remote Sens. 2020, 12, 3152. [Google Scholar] [CrossRef]
Lai, D.K.-H.; Yu, Z.-H.; Leung, T.Y.-N.; Lim, H.-J.; Tam, A.Y.-C.; So, B.P.-H.; Mao, Y.-J.; Cheung, D.S.K.; Wong, D.W.-C.; Cheung, J.C.-W. Vision Transformers (ViT) for blanket-penetrating sleep posture recognition using a triple ultra-wideband (UWB) radar system. Sensors 2023, 23, 2475. [Google Scholar] [CrossRef]

Figure 1. Examples of undetected occluded humans (represented by yellow dotted line box) by the original DINO model in various environments.

Figure 2. Body part segmentation by the key points from the COCO2017 dataset.

Figure 3. Sample images from the Disaster Occlusion Detection Dataset (DODD).

Figure 4. Model architecture of the VE-DINO model with the visibility-aware loss function.

Figure 5. ResNet-50 Architecture.

Figure 6. Comparison of detection results: (a) Difference between VE-DINO and original DINO. The bold green box represents individuals that could be identified by VE-DINO but failed to be identified by the original DINO; (b) VE-DINO detection results; (c) original model detection results.

Figure 7. System flow diagram.

Figure 8. Scenes captured by UAV-sensing showcasing the detection of occluded individuals in a forested environment using the VE-DINO model.

Table 1. Existing studies on object-detection methods under occlusion.

Study	Year	Scenario	Model	Detection Method
Wang, et al. [22]	2020	Strongly occluded vehicles detection	DCNN (CompositionalNet)	Part-based voting and context segmentation
Zhou, et al. [17]	2021	Real-world occluded human detection	HDM	Mask completion and content recovery
Li, et al. [23]	2023	Crowded scenes occluded human detection	Scaled-YOLOv4 + SPCS	SPCS
Aslan, et al. [24]	2020	Noisy environments occluded human detection	HOG-SVM and CNN	Comparison between HOG-SVM and CNN for occlusion
Geng [25]	2024	Occluded apple object detection	OED	Knowledge distillation for occlusion handling

CNN: convolutional neural network; DCNN: deep CNN; HDM: human de-occlusion model; HOG: histogram of oriented gradients; OED: occlusion-enhanced distillation; SPCS: Spatial Pyramid Convolutional Shuffle; SVM: support vector machine; YOLO: You Only Look Once Model.

Table 2. Performance comparison of original and VE-DINO models on the COCO2017 dataset.

		Average Precision		Average Recall
IoU	Area	Original DINO	VE-DINO	Original DINO	VE-DINO
0.50: 0.95	all	0.491	0.615	0.728	0.725
0.50: 0.95	small	0.321	0.352	0.562	0.488
0.50: 0.95	medium	0.524	0.67	0.767	0.783
0.50: 0.95	large	0.63	0.799	0.885	0.902

Table 3. Performance of VE-DINO on the external testing set DODD.

IoU	Area	Average Precision	Average Recall
0.50: 0.95	all	0.500	0.619
0.50: 0.95	small	0.222	0.360
0.50: 0.95	medium	0.424	0.566
0.50: 0.95	large	0.611	0.704

Table 4. Average precision and recall of ablation study with varying degrees of visibility information on different sizes of objects.

Visibility Information				Average Precision				Average Recall
Head	Upper Body	Lower Body	Legs	All Objects	Small Objects	Medium Objects	Large Objects	All Objects	Small Objects	Medium Objects	Large Objects
√	√	√	√	0.615	0.352	0.67	0.799	0.725	0.488	0.783	0.902
	√	√	√	0.612	0.333	0.667	0.801	0.729	0.491	0.788	0.905
√		√	√	0.606	0.332	0.659	0.797	0.716	0.469	0.777	0.899
√	√		√	0.61	0.341	0.663	0.8	0.723	0.483	0.783	0.9
√	√	√		0.612	0.349	0.665	0.796	0.723	0.487	0.78	0.899

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zhao, Z.-A.; Wang, S.; Chen, M.-X.; Mao, Y.-J.; Chan, A.C.-H.; Lai, D.K.-H.; Wong, D.W.-C.; Cheung, J.C.-W. Enhancing Human Detection in Occlusion-Heavy Disaster Scenarios: A Visibility-Enhanced DINO (VE-DINO) Model with Reassembled Occlusion Dataset. Smart Cities 2025, 8, 12. https://doi.org/10.3390/smartcities8010012

AMA Style

Zhao Z-A, Wang S, Chen M-X, Mao Y-J, Chan AC-H, Lai DK-H, Wong DW-C, Cheung JC-W. Enhancing Human Detection in Occlusion-Heavy Disaster Scenarios: A Visibility-Enhanced DINO (VE-DINO) Model with Reassembled Occlusion Dataset. Smart Cities. 2025; 8(1):12. https://doi.org/10.3390/smartcities8010012

Chicago/Turabian Style

Zhao, Zi-An, Shidan Wang, Min-Xin Chen, Ye-Jiao Mao, Andy Chi-Ho Chan, Derek Ka-Hei Lai, Duo Wai-Chi Wong, and James Chung-Wai Cheung. 2025. "Enhancing Human Detection in Occlusion-Heavy Disaster Scenarios: A Visibility-Enhanced DINO (VE-DINO) Model with Reassembled Occlusion Dataset" Smart Cities 8, no. 1: 12. https://doi.org/10.3390/smartcities8010012

APA Style

Zhao, Z.-A., Wang, S., Chen, M.-X., Mao, Y.-J., Chan, A. C.-H., Lai, D. K.-H., Wong, D. W.-C., & Cheung, J. C.-W. (2025). Enhancing Human Detection in Occlusion-Heavy Disaster Scenarios: A Visibility-Enhanced DINO (VE-DINO) Model with Reassembled Occlusion Dataset. Smart Cities, 8(1), 12. https://doi.org/10.3390/smartcities8010012

Article Menu

Enhancing Human Detection in Occlusion-Heavy Disaster Scenarios: A Visibility-Enhanced DINO (VE-DINO) Model with Reassembled Occlusion Dataset

Abstract

Highlights

Abstract

1. Introduction

1.1. Background

1.2. Motivation

1.3. Objectives and Contributions

2. Related Work

2.1. UAVs in Search and Rescue

2.2. Object Detection Under Occlusion

3. Methodology

3.1. Overview

3.2. Dataset

3.2.1. Training Dataset

3.2.2. Validation Dataset

3.2.3. External Testing Dataset

3.3. Model Development

3.3.1. Model Backbone

3.3.2. Transformer Encoder

3.3.3. Transformer Decoder with Dynamic Anchor Boxes

3.3.4. Contrastive Denoising Training (CDN)

3.3.5. Loss Function

3.4. Model Validation and Ablation Study

4. Results

4.1. Model Validation Using COCO2017 Dataset

4.2. External Testing Using DODD

4.3. Ablation Study

5. Illustration of Case Study Usability Integrating UAV System

6. Discussion

7. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI