FireYOLO-Lite: Lightweight Forest Fire Detection Network with Wide-Field Multi-Scale Attention Mechanism

Sheng, Sha; Liang, Zhengyin; Xu, Wenxing; Wang, Yong; Su, Jiangdan

doi:10.3390/f15071244

Open AccessArticle

FireYOLO-Lite: Lightweight Forest Fire Detection Network with Wide-Field Multi-Scale Attention Mechanism

by

Sha Sheng

^*,

Zhengyin Liang

,

Wenxing Xu

,

Yong Wang

and

Jiangdan Su

College of Information Engineering, Beijing Institute of Petrochemical Technology, Beijing 102617, China

^*

Author to whom correspondence should be addressed.

Forests 2024, 15(7), 1244; https://doi.org/10.3390/f15071244

Submission received: 28 May 2024 / Revised: 12 July 2024 / Accepted: 15 July 2024 / Published: 17 July 2024

(This article belongs to the Section Natural Hazards and Risk Management)

Download

Browse Figures

Versions Notes

Abstract

:

A lightweight forest fire detection model based on YOLOv8 is proposed in this paper in response to the problems existing in traditional sensors for forest fire detection. The performance of traditional sensors is easily constrained by hardware computing power, and their adaptability in different environments needs improvement. To balance the accuracy and speed of fire detection, the GhostNetV2 lightweight network is adopted to replace the backbone network for feature extraction of YOLOv8. The Ghost module is utilized to replace traditional convolution operations, conducting feature extraction independently in different dimensional channels, significantly reducing the complexity of the model while maintaining excellent performance. Additionally, an improved CPDCA channel priority attention mechanism is proposed, which extracts spatial features through dilated convolution, thereby reducing computational overhead and enabling the model to focus more on fire targets, achieving more accurate detection. In response to the problem of small targets in fire detection, the Inner IoU loss function is introduced. By adjusting the size of the auxiliary bounding boxes, this function effectively enhances the convergence effect of small target detection, further reducing missed detections, and improving overall detection accuracy. Experimental results indicate that, compared with traditional methods, the algorithm proposed in this paper significantly improves the average precision and FPS of fire detection while maintaining a smaller model size. Through experimental analysis, compared with YOLOv3-tiny, the average precision increased by 5.9% and the frame rate reached 285.3 FPS when the model size was only 4.9 M; compared with Shufflenet, the average precision increased by 2.9%, and the inference speed tripled. Additionally, the algorithm effectively addresses false positives, such as cloud and reflective light, further enhancing the detection of small targets and reducing missed detections.

Keywords:

lightweight network; dilated convolutions; breadth spatial vision; small fire targets; CPDCA attention module; fire detection; YOLOv8; deep learning

1. Introduction

From a global perspective, the frequency of forest fires is astonishingly high. It is estimated that over 220,000 forest fires occur on average worldwide every year based on relevant statistical data, and the area of forest destruction caused by these fires accounts for more than 1% of the total global forest area [1]. The significant destructive impact of forest fires on global forest resources is intuitively revealed by these data. Secondly, supported by specific examples, the forest fires that occurred in Canada in 2023 had a burned area exceeding 18.4 million hectares, nearly twice the size of South Korea’s territory; the balance of ecosystems was severely disrupted, leading to the loss of biodiversity, by these large-scale fires that burned down large areas of trees [2]. In summary, from various aspects such as the frequency of occurrence, the extent of burned area, degree of ecological environmental damage, and impact on human social life worldwide, it is evident that forest fires indeed pose a comprehensive and profound threat to the ecological environment and human life.

Traditional smoke sensors [3,4,5,6] and temperature sensors [7,8,9] do indeed have limitations in terms of response delay in fire detection. Their operation is based on pre-set threshold triggers, meaning that these sensors will only issue alerts when a fire’s smoke concentration or temperature reaches specific alarm criteria [10,11,12]. In recent years, scholars have proposed various new fire detection methods, such as using infrared photoelectric smoke detection systems combined with high-resolution spectroscopy and virtual low-pass filters [13]. Others include measuring carbon monoxide levels to detect if there is a fire [14], and a technique of detectingthe flames radiation using Fourier transform infrared spectrometer and thermal imagers [15]. A distributed feedback semiconductor laser (DFB LD) is used as a light source to achieve harmonic detection of gas concentration generated by fire by modulating the light source [16]. In addition, some researchers also use chemical molecular materials to detect fires: Chen et al. [17] used graphene paper for fire detection and alarm, and Huang et al. [18] further improved graphene paper’s fire resistance and fire detection capability. Traditional detection methods usually rely on fixed thresholds to determine whether a fire has occurred; in the early stages of a fire or when conditions are not clear, these methods often fail to respond promptly, which increases the risk of fire underreporting.

Forest fire detection based on machine vision relies mainly on machine learning [19] and image processing [20], for example, support vector machines (SVMs) [21], decision trees [22], and so on. Image processing mainly focuses on color thresholds, dividing fires from specific color spaces. Some researchers adopt the YCbCr color space, cleverly separating brightness from color to extract flame pixels, thus achieving fire detection [23]. Yang et al. [24] proposed a new pixel accuracy method, PreVM, which improves fire detection accuracy by introducing new norm constraints. Additionally, some scholars utilize multiple sensors to detect smoke, dust, and steam, and combine them with SVM for multi-classification [25]. However, these methods have limitations when dealing with more complex forest environments, relying solely on color information cannot yield reliable results.

With the further development of technology, researchers have begun to explore the application of deep learning [26,27] to more complex forest fire detection tasks. Some early popular detection models include VGG [28], AlexNet, and DenseNet. However, the above model is gradually being replaced due to issues such as overfitting caused by excessive parameters or low accuracy [29]; these shortcomings are mitigated or resolved with the continuous advancement of technology and the ongoing improvement of algorithms, such as YOLO [30] and Faster R-CNN [31], and they are utilized to extract crucial features of flames and smoke, as well as to pinpoint the location of the fire. Moreover, some researchers have proposed improved models, such as using a target detection model with deconvolution and dilated convolution [32], detecting fires by combining RCNN and ResNet [33], incorporating a small target detection layer (STD) into YOLOv5 to enhance the detection accuracy of small targets [34], employing an enhanced LSTM network combined with fire sensors to detect fires within buildings [35], and experimenting with an improved CNN model across multiple datasets, ultimately enabling the model to adapt to various harsh environments [36].

Although current fire detection algorithms have made significant progress, most models still rely on increasing the number of parameters and computational complexity to improve accuracy, undoubtedly greatly reducing detection speed. Moreover, the larger computational complexity and parameters make it difficult to deploy the model to existing detection devices [37,38]. Therefore, it is urgently necessary to propose a lightweight network to address this issue. Lightweight detection models still face some challenges in forest fire detection. The morphology and characteristics of forest fires are diverse, such as smoke, flames, etc., which require the model to have strong generalization capabilities. The forest environment is complex and variable; factors such as lighting, climate, etc., may affect the detection results of the model [39]. In response to these challenges, researchers have made some beneficial attempts. Some lightweight networks, such as ShuffleNet [40,41], MobileNet [42,43,44], and YOLOv3-tiny [45], are developed by simplifying network model structures. Additionally, some researchers have proposed improvements to the above-mentioned network. Li et al. [46] enhanced R-ShuffleNetV2 by reconstructing ShuffleNetV2 units using residual ideas. Yar et al. [47] enhanced the capability of feature extraction of the MobileNetV3 backbone network; there was an improvement in the accuracy of fire detection for different sizes. Jin et al. [48] proposed an adaptive fusion algorithm based on YOLOv5; this fusion with physical detection led to a reduction in false alarm rates. Table 1 summarizes some Solution Approach and methodologies for fire detection based on deep learning algorithms.

The design ideas in the above literature provide valuable insights for this paper. Traditional detection methods are often limited by their design principles, and they can only effectively detect under specific, predefined environmental conditions. However, for tasks detecting forest fires, which require handling complex and variable environments, the aforementioned methods are difficult to meet practical needs [49]. In contrast, deep learning algorithms, with their powerful feature extraction capabilities and adaptability to complex patterns, have demonstrated higher accuracy and robustness in complex tasks such as forest fire detection [50]. However, in actual forest fire detection, most fire sources are small targets, and the variable environmental factors also add to the difficulty of detection; this paper improves the overall feature extraction of small flame targets and the contextual characteristics of smoke. Additionally, the dataset of fire samples is also crucial. This paper utilizes the M⁴SFWD [51] dataset, which includes various lighting conditions, terrains, weather patterns, and different numbers of fire sources; its complex contextual information and diverse target information enhance the robustness of the model.

The main contributions of this study are outlined as follows:

An attention model, CPDCA, is proposed in this research, which leverages dilated convolutions for the extraction of multi-scale features. The model is characterized by a lightweight design and possesses an enhanced capability to extract long-distance features.
The introduction of Inner-IoU [52] is aimed at enhancing the convergence effect of small fire targets by incorporating auxiliary bounding boxes into the detection process.
A lightweight fire detection model, FireYOLO-Lite, has been designed as part of this study.

Forest fires are highly susceptible to adverse environmental conditions. The M⁴SFWD dataset, which encompasses various weather factors such as mist, dusk, daylight, and nighttime, along with different terrains and fire source densities, was utilized to enhance the model’s robustness. To enhance the detection speed and accuracy of wildfires, the lightweight GhostNetV2 [53] network is employed by FireYOLO-Lite instead of YOLOv8 as the backbone network for feature extraction. Additionally, it integrates the multi-scale dilated channel attention model CPDCA; this attention model utilizes dilated convolutions and stripe convolutions to focus on the critical features of the fire. It enhances the extraction of critical features without increasing computational overhead, the GhostNetV2 network is employed for feature extraction, and this attention model is used to increase spatial receptive fields. Furthermore, Inner-SIoU is used to improve the convergence effect on small fire targets; the results demonstrate significant improvements in both detection accuracy and speed.

Our paper is divided into the following sections: Section 1: This section introduces the research background, covering the progression from traditional physical detection, traditional image processing, and machine learning to deep learning. It also introduces the content and highlights of this research. Section 2: This section introduces the network structures of YOLOv8 and GhostNetV2. Section 3: This section provides a detailed introduction to the model-building process and the content of each part, including FireYOLO-Lite, CPDCA, experimental data, loss function, etc. Section 4: This section analyzes the experimental results, conducts detailed comparative experiments and ablation experiments on the model, and studies the model’s shortcomings and future research directions. Section 5: This section summarizes our experimental process, the innovations and innovative ideas.

We have created an abbreviation table (Table 2) to facilitate readers’ comprehension.

2. Related Work

2.1. YOLOv8 Object Detection Model

The overall structure of the YOLOv8 algorithm continues the design of YOLOv5 [54], which consists of CSPDarknet [55] as the backbone network, neck section, and Decoupled-Head as the output head. However, the most significant change in YOLOv8 is the abandonment of the traditional Anchor-Based method in favor of adopting the Anchor-Free concept [56]. This transition means that the algorithm no longer needs to pre-set anchor points but directly infers the position and size of objects from feature maps, thereby surpassing Anchor-Based methods in inference speed. Anchor-Free methods avoid the hyperparameter settings associated with Anchor-Based methods, such as scale size, aspect ratio, IoU threshold, etc., thus resulting in higher computational efficiency.

Additionally, the Decoupled-Head design of YOLOv8 has the output head decoupled, with target position and class information being extracted separately, learned through different branches, and finally merged (Figure 1). This strategy can effectively reduce the number of parameters and computational complexity, thereby enhancing the model’s generalization capability and robustness. In terms of loss functions, YOLOv8 adopts DFL attention and CIoU Loss as regression loss functions, optimizing the probabilities of the two positions closest to the given label, allowing the network to focus on the target position more quickly in the form of cross-entropy. Additionally, by replacing C3 (cross stage partial networks with 3 convolutions) with the C2f model, richer gradient flow information is obtained while ensuring model lightweighting. In summary, compared to YOLOv5, YOLOv8 has improved inference speed and still has the potential for further optimization in detection accuracy.

2.2. GhostNetV2 Lightweight Neural Network

GhostNetV2 is a lightweight neural network characterized by its small model size and low computational resource requirements. This enables it to perform well in resource-constrained environments, such as smartphones or embedded systems, without significantly impacting device memory usage and heat generation. GhostNetV2 utilizes the Ghost module as the primary feature extractor and, relative to its predecessor GhostNetV1 [57], proposes DFC attention to capture dependencies between distant pixels. It is a neural network that is friendly for deployment on mobile devices, ensuring model inference performance and reducing computational costs. Through calculations, the acceleration ratio and compression ratio of the model can be obtained. Let the input feature map be h × w × c, and the output feature map be h’ × w’ × n. The original convolution kernel size is k × k, the standard feature map’s convolution kernel size is d × d, and the number of groups is s. The acceleration ratio r_s of the theoretical calculation quantity and the compression ratio r_c of the reference quantity are expressed by Equations (1) and (2).

r_{s} = \frac{n \cdot h ’ \cdot w ’ \cdot c \cdot k \cdot k}{\frac{n}{s} \cdot h ’ \cdot w ’ \cdot c \cdot k \cdot k + (s - 1) \cdot \frac{n}{s} \cdot h ’ \cdot w ’ \cdot d \cdot d} \approx s

(1)

r_{c} = \frac{n \cdot c \cdot k \cdot k}{\frac{n}{s} \cdot c \cdot k \cdot k + (s - 1) \cdot \frac{n}{s} \cdot d \cdot d} \approx \frac{s \cdot c}{s + c - 1} \approx s

(2)

When d and k are close, and s is much smaller than c, it can be seen that the multiplication computational cost of traditional convolution is reduced by s times by the Ghost module. The Ghost module (Figure 2) is first subjected to a regular convolution operation. This operation reduces the number of convolutional kernels, thereby decreasing the number of layers in the resulting feature map. Subsequently, the resulting feature map is input into the depth-wise convolution operation to obtain the final feature map. The purpose of this design is to effectively reduce the information redundancy caused by similar feature layers, thus improving the efficiency and performance of the model. It can be observed that only a portion of the features in the Ghost module interact with other pixels, which impacts the ability of the Ghost module to extract spatial information.

3. Materials and Methods

3.1. Lightweight FireYOLO-Lite Detection Framework Is Designed

In this paper, the YOLOv8 object detection network was adopted as the base framework, and innovative optimizations were made. To enhance the model’s real-time performance and computational efficiency, the original backbone network was replaced with the lightweight GhostNetV2 network (Figure 3). The Ghost module utilizes inexpensive convolution operations to replace traditional convolution operations. This improvement not only significantly reduces the number of parameters and computational complexity of the model but also preserves the model’s feature extraction capability. Depth-wise convolution is also employed in the backbone to further extract features. Although the depth-wise convolution technique has adv2antages in reducing computational complexity, it may sacrifice the ability to capture global information. To address this issue, the CPDCA (Channel Prioritized Dilated Channel Attention model) is proposed. This model supports dynamically allocating attention weights to channel and spatial dimensions, thereby more effectively extracting spatial relationships and ensuring the priori validation of channel feature maps. By introducing CPDCA, the model can maintain excellent object detection performance while remaining lightweight.

3.2. Wide-Field Multi-Scale Attention Mechanism CPDCA

CPCA (channel prior convolutional attention) [58] is a channel-prioritized convolutional attention mechanism that is utilized to form deep spatial attention using multi-scale depth separable convolution modules, dynamically allocating weights in both channel and spatial dimensions. Significant computational complexity is often seen in common self-attention models. CPDCA (channel prior dilated convolutional attention) models are proposed in this paper with a lightweight design to successfully reduce computational complexity. In comparison to the SE [59] attention mechanism, deep stripe convolution is adopted by CPCA to construct spatial attention maps. Large convolutional kernel like 7 × 7, 11 × 11, and 21 × 21 are utilized for prediction [58]. However, an excessive amount of detailed information may be captured by excessively large convolutional kernels, leading to a decrease in the model’s ability to abstract overall features. Overfitting issues could also be caused, resulting in poor performance on unseen data. Furthermore, more parameters are introduced by larger convolutional kernels, increasing model complexity. Therefore, inspired by Inception [60], to reduce the computational cost, as shown in Figure 4, the number of channels in the resulting feature map is reduced, followed by the replacement of the 5 × 5 convolutional layer with two 3 × 3 convolutional layers.

CPDCA is composed of channel prior (CP) and spatial attention (SA). Let the feature map of the network input be D ∈

R^{C \times H \times W}

. Firstly, the CP part will enhance the feature information through max pooling and average pooling. Equation (3) is shown.

D^{*} = M L P (M a x P o o l (D) + A v g P o o l (D))

(3)

Then, the aggregated information D* ∈

R^{C \times H \times W}

is element-wise multiplied by the input feature D to obtain the channel prior detection feature map. Equation (4) is shown.

C P = D^{*} \otimes D

(4)

Subsequently, the CP is inputted into the spatial attention (SA) through DilateConv, thereby enabling the capture of a larger range of contextual information without increasing the depth or complexity of the network.

S A = {C o n v}_{1 \times 1} (\sum_{i = 0}^{2} S p e c i a l (D i l a t e C o n v (D^{*}) + \sum_{i = 7, 21} P i o n t - w i s e (D^{*})))

(5)

Improvements are made to the spatial attention (SA) (Figure 4), inspired by RFB [61] and CapsNet [62], where DilateConv [63] is used instead of large convolutional kernels and point-wise convolution combinations are employed to extract spatial features. DilateConv increases the receptive field without changing parameters, but it is not friendly to small targets that do not necessarily require such a large receptive field; hence, excessively large dilated convolutional kernels are not suitable. When using dilated convolution, a gridding effect is prone to occur, as shown in the figure below, where each non-zero element is spaced apart by a certain interval, leading to the loss of a significant amount of visual information; thus, the design of the dilation factor is crucial. Here, we adopt the design principles of Hybrid Dilated Convolution (HDC) [64] for the dilation factor, where the convolutional kernel is of size K × K,

r_{i}

is the dilation rate, and the dilation factor is denoted as [

M_{1}, M_{2}, M_{3}

], with the condition that

M_{2} \leq K

; Formula 6 is as follows.

M_{i} = \max ((M_{i + 1} - 2 r_{i}), M_{i + 1} - 2 (M_{i + 1} - r_{i}), r_{i})

(6)

As shown in Figure 5, when three layers of dilated convolution are used, the two dilation factor configurations are [2, 2, 2] and [1, 2, 3], respectively. In the configuration with [1, 2, 3], the receptive field and utilization rate are increased. The design of the dilation factor significantly impacts the receptive field.

By conducting a comparative analysis with the original model experiment, the spatial receptive field of the model is effectively enhanced, rendering it apt for small flame targets and objects such as smoke. Simultaneously, within the broader spatial receptive field, the issue of the same target being erroneously detected multiple times is circumvented (Figure 6). Ultimately, through element-wise multiplication of the outcomes from the dilated convolution module and channel prior detection, finely detailed features are derived as output.

3.3. Experimental Data

In order to evaluate the performance of the model, the dataset used in this experiment is the M⁴SFWD dataset created by Yunnan University; the distinguishing feature of this dataset lies in its diversity, covering various terrains, weather conditions, and multimodal forest scenes at different times. To address varying complexities in fire detection, the dataset comprises three scenarios: no fire, single fire target, and multiple fire targets. This facilitates a comprehensive assessment of target detection algorithm performance on simulated multimodal forest fire datasets. Moreover, the dataset includes multimodal forest fire images captured at different times of the day (daytime, dusk, and nighttime), thus realistically simulating the environmental conditions of actual forest fires, as shown in Figure 7. Furthermore, the dataset extensively covers various terrains and vegetation types, providing rich samples for training and validation purposes. The entire dataset consists of 3583 manually annotated high-quality fire images for training and validation; meanwhile, the test set comprises 391 real fire images, including 2820 training sets, and 763 validation sets, totaling 17,763 annotated bounding boxes, ensuring the accuracy and practicality of the evaluation.

3.4. Experimental Environment and Parameter Setting

The PyTorch deep learning framework was established on a Windows system in this study, with the torch environment being based on CUDA Toolkit V12.1 and compatible with torch version 2.1.1+cu121. The experiments were performed using a computer equipped with an Intel i5-12400F processor and an NVIDIA GeForce RTX 4060 graphics card. During the training process, we employed data augmentation techniques and a random cropping strategy, randomly selecting 4 images for cropping each time, effectively enhancing the robustness of the model. Furthermore, we set the number of epochs for model training to 200 and controlled the batch size to 16 to ensure the stability and efficiency of the training process.

3.5. Evaluation Method

Floating-point Operations (FLOPs) are utilized as the metric for evaluating model complexity, and the scale of the model is assessed based on the number of parameters. In order to comprehensively evaluate the performance of the model, we utilize average precision (AP), mean average precision (mAP), and recall as evaluation metrics for accuracy. Additionally, we assess the inference speed of the model by Frames Per Second (FPS), which represents the efficiency of processing images by the model.

To conduct a comprehensive analysis of the detection performance of the model, we introduce several key indicators from the confusion matrix: TP (True Positive) represents the number of positive samples correctly predicted by the model; FN (False Negative) represents the number of samples incorrectly predicted as negative when they are actually positive; FP (False Positive) represents the number of samples incorrectly predicted as positive when they are actually negative; TN (True Negative) represents the number of negative samples correctly predicted by the model. Based on these indicators, we can further derive the precision (Equation (7)) and recall (Equation (8)) of the model, thus evaluating the detection performance more comprehensively.

P r e c i s i o n = \frac{T P}{T P + F P}

(7)

R e c a l l = \frac{T P}{T P + F N}

(8)

The formulas for calculating the average precision (AP) and mean average precision (mAP) can be derived from Equations (9) and (10), where n represents the number of samples.

A P = \int_{0}^{1} P r e c i s i o n (R) d R

(9)

m A P = \frac{\sum_{i = 1}^{N} {A P}_{i}}{N}

(10)

3.6. Loss Function

The loss function of YOLOv8 is a weighted sum, consisting of three parts: classification loss, localization loss, and confidence loss. Initially, the varifocal loss function was employed for YOLOv8’s bounding box loss, providing the advantage of accounting for the correlation between features and the probability distribution of data. However, its high sensitivity leads to reduced applicability in complex and variable environments, such as fire scenes, where factors like sunlight, reflection, and water mist affect it.

To improve the model’s performance in complex environments, a novel loss function, called Inner-IoU Loss, is introduced, which is based on the calculation of auxiliary bounding boxes [52]. The characteristic of this loss function is to utilize a scale factor (ratio) to control the size of auxiliary bounding boxes, which plays a crucial role in computing the loss. Let the center points of anchor boxes and ground truth (GT) boxes be (x_c, y_c) and (

x_{c}^{g t}, y_{c}^{g t}

), respectively. The anchor boxes and GT boxes are represented by d and d^gt_, respectively. The specific calculation of Formulas (11)–(18) is as follows.

d_{l}^{g t} = x_{c}^{g t} - \frac{w^{g t} * r}{2}, d_{r}^{g t} = x_{c}^{g t} + \frac{w^{g t} * r}{2}

(11)

d_{u}^{g t} = y_{c}^{g t} - \frac{h^{g t} * r}{2}, d_{b}^{g t} = y_{c}^{g t} + \frac{h^{g t} * r}{2}

(12)

d_{l} = x_{c} - \frac{w * r}{2}, d_{r} = x_{c} + \frac{w * r}{2}

(13)

d_{u} = y_{c} - \frac{h * r}{2}, d_{b} = y_{c} + \frac{h * r}{2}

(14)

i n t e r = (\min (d_{r}^{g t}, d_{r}) - \max (d_{l}^{g t}, d_{l})) * (\min (d_{b}^{g t}, d_{b}) - \max (d_{t}^{g t}, d_{t}))

(15)

u n i o n = (w^{g t} * h^{g t}) * r^{2} + (w * h) * r^{2} - i n t e r

(16)

{I o U}^{i} = \frac{i n t e r}{u n i o n}

(17)

L_{i - s I o U} = L_{S I o U} + I o U - {I o U}^{i}

(18)

By introducing auxiliary bounding boxes (Figure 8) and ensuring that they differ from the actual bounding boxes only in scale, we observed that during the regression process, the change in IoU of the auxiliary bounding boxes is highly consistent with the change in IoU of the actual bounding boxes. This characteristic enables the auxiliary bounding boxes to accurately reflect the quality of the regression results of the actual bounding boxes, providing us with an effective means of evaluation. In the figure, d and dgt represent the center points of the anchor box and the ground truth box. Based on Equations (11)–(14), the sizes of the auxiliary boxes for both are determined, as shown in the figures hinner, winner, hinner-gt, and winner-gt.

For small targets, due to their inherent low-IoU characteristics, using larger auxiliary bounding boxes can effectively increase the regression range of IoU, thus benefiting from the low-IoU situations of small targets.

In the field of forest fire prevention, early detection of fire locations is crucial for timely response and prevention of fire spread. As a result, specific attention is directed towards improving the detection performance of small targets in this paper. To achieve this, we conducted a series of fine experimental adjustments on the scale factor ratio of auxiliary bounding boxes. To comparing the convergence effects of introducing auxiliary bounding boxes in CIoU and SIoU, we conducted comparative experiments with ratios set to 0.7, 1.1, and 1.2, respectively (Table 3). The results show that setting the ratio to 1.1 improves the detection accuracy of flames and smoke by an average of 2%. This improvement helps accurately and rapidly detect fire locations in the early stages of fires, thus providing strong technical support for forest fire prevention efforts.

4. Experimental Design and Result Analysis

4.1. Comparative Test of Attention Mechanism

To verify the improvement of the CPDCA attention mechanism model on the extraction of flame features, we employed CA [65], SE, CBAM [66], and EMA [67] attention modules to enhance the network model and conducted comparative performance tests on the same fire dataset under the same configuration. Attention modules such as CA (channel attention), SE (Squeeze-and-Excitation), CBAM (Convolutional Block Attention Module), and EMA (exponential moving average) each have their unique characteristics in the field of deep learning. They demonstrate different advantages in enhancing model performance, emphasizing feature importance, and handling complex input data. The CA attention module primarily focuses on channel dimensions, adjusting channel information in feature maps by learning the importance of each channel to extract more useful features. The SE attention mechanism is specifically designed to enhance inter-channel relationships. It operates through compression and excitation steps, initially reducing the dimensionality of each channel using global average pooling, then learning channel-wise weights via fully connected layers applied to each channel of the input feature map. The CBAM attention mechanism goes further, integrating both channel attention and spatial attention mechanisms. It adjusts the channel dimensions of feature maps through its channel attention module and then adjusts the spatial dimensions through its spatial attention module. This combination allows CBAM to simultaneously focus on both channel and spatial information of feature maps, thereby comprehensively enhancing model performance. The EMA attention mechanism differs in purpose and application from the above three mechanisms. It is primarily used in natural language processing and sequence data tasks, smoothing attention weights using exponential moving average methods to enhance model robustness and generalization capability. The experimental results in Table 4 show that CPDCA excels in detection accuracy, outperforming all aforementioned models, fully showcasing its advantages in multi-scale and broad-field vision. Additionally, CPDCA features a smaller model size, making it more suitable for deployment on terminal detection devices and demonstrating potential for practical applications.

4.2. Ablation Experiment

To validate the effectiveness of the lightweight neural network GhostNetV2 integrated with the improved CPDCA channel-first attention mechanism in extracting features of small fire and smoke targets in fires, we designed an ablation experiment (Table 5). The results from experiments 1 and 2 show that the addition of the Inner-SIOU model increased both the mean accuracy and recall by 1.6% and 2.3%, respectively. The results of experiments 1 and 3 show that the integration of the CPDCA attention mechanism has improved the model’s precision and recall by 1.7% and 2.6%, respectively, experiments 4, 5, and 6 are not added to the experimental results of GhostNetV2, the calculation amount is increased compared with 1, 2, and 3, and the accuracy is not greatly improved. We added a CPDCA module based on experiment 2. Compared to experiment 2, although the addition of the CPDCA module led to a slight decrease in model FPS, the model’s accuracy improved by 2.4%. The comparison between experiment 3 and experiment Ours shows that using InnerSiou increased the model’s FPS by 41.9; this fully demonstrates the significant effectiveness of Inner-IoU in enhancing detection efficiency, which is capable of compensating for the impact introduced by incorporating the CPDCA module into the model. Similarly, our experiment showed improvements in precision and mean average precision (mAP) by 4% and 2.8%, respectively.

4.3. Validation of the Improved Algorithm’s Effectiveness

The primary concept of designing lightweight networks is to optimize the network’s size and computational speed, while ensuring adequate accuracy, with the goal of making the network lightweight. To explore the performance of lightweight networks on forest fire datasets, we conducted an in-depth analysis of various popular neural networks. Specifically, we combined robust lightweight classification models such as MobileNetV3, ShuffleNetV1, ShuffleNetV2, GhostNetV2, and Resnet50 with YOLO, enabling them to effectively perform object detection tasks for forest fires. There are many common lightweight networks, among which ShuffleNet, as a significant representative, notably improves the execution efficiency of group convolutions by introducing the Channel Shuffle operation in its V1 version, thus accelerating network processing speed. MobileNet, on the other hand, effectively reduces computational complexity by employing depth-wise separable convolutions. Furthermore, it innovatively introduces a new bottleneck structure, which, while maintaining network accuracy, further enhances the network’s inference speed. YOLOv3-tiny is a lightweight version of YOLOv3, preserving the object detection concept of YOLOv3 but employing a smaller network model. Compared to YOLOv3, the network structure of YOLOv3-tiny is simpler, saving time and space, and is suitable for resource-constrained environments.

Through experimental analysis, the model’s weight size is 4.9 M, only 0.7 M higher than MobileNetV3, yet it achieves a frame rate of 285.3 FPS and an accuracy 5.9% higher than YOLOv3-tiny. This enables it to rapidly and accurately locate the source of the fire in fire detection scenarios, providing strong support for timely fire response; the experimental data are shown in Figure 9 and Table 6.

By analyzing the data from groups (1) and (2), we can draw the following conclusions: YOLOv3-tiny and MobileNetV3 exhibit more instances of missed detections in fire detection, and their adaptability in detecting small flame targets is relatively poor. Despite ShuffleNetV2’s ability to gain richer spatial information through the Channel Shuffle mechanism, its performance in detecting small targets is still inferior to that of our model compared to groups (1), (2), and (4). Furthermore, comparing with group (3), we find that CPDCA enhances the model’s spatial perception and integrates multi-scale feature information, effectively reducing instances of redundant detections and missed detections.

We conducted comparative experiments on the DFS datasets [68,69] to verify that the model has a wide receptive field and strong robustness. We also compared it with YOLOv3-tiny (a), MobileNetV3 (b), ShuffleNetV2 (c), and FireYOLO-Lite (d) (see Figure 10). By analyzing experiments (1), (2), and (3), we found that our model, with its larger receptive field and integration of multi-scale feature information, can accurately detect the full scope of fires and smoke, effectively reducing duplicate detections. At the same time, the larger receptive field also helps in reducing missed detections. Experiment (4) further validated that our model has a certain degree of robustness, capable of adapting to weather factors such as sunlight reflection.

To validate the generalization ability of the model, we compared it with YOLOv9 and YOLOv10, and the results are shown in Table 7. Considering that YOLOv9 has a large number of parameters, it is not suitable for lightweight fire detection tasks. The experimental results show that our model outperforms both in terms of detection accuracy and model size. Although YOLOv10, as the latest version, has made breakthroughs in performance and efficiency, it may still require further validation and optimization in practical applications. In contrast, YOLOv8 has demonstrated good robustness across different application scenarios.

4.4. Discussion

The current fire detection methods for urban buildings mainly rely on traditional sensors, including temperature detection, gas detection, and light radiation detection (infrared and ultraviolet sensors). These traditional sensor-based detection methods have high sensitivity and low environmental requirements, making them suitable for office and home settings. The popular international technology for wildfire monitoring is satellite remote sensing technology. Its basic principle is to use the electromagnetic radiation characteristics released during the combustion of materials to identify and monitor the location of the fire source [39]. Satellite remote sensing technology has the advantages of wide detection range, high resolution, and timely response to dynamic changes. However, this technology also has limitations, as clouds and thick smoke produced by burning can obscure wildfire areas, affecting the quality of satellite images and thus reducing the accuracy of fire monitoring. With the development of deep learning, it has shown significant advantages in video processing, making our research highly practical [70]. Considering factors such as resource limitations and high costs of deploying models on hardware detection devices, we focus on lightweight improvements, utilizing multi-scale feature information to enhance the model’s ability to cope with harsh environments, and introducing auxiliary bounding boxes to improve the detection accuracy of small targets. As shown in Table 1, our research methods align with early studises, all dedicated to reducing the impact of environmental factors on detection accuracy by enhancing model robustness. Moreover, thanks to the model’s wide field of view, we have achieved innovative breakthroughs in reducing redundant detections (Figure 6). Early research models were usually quite complex. To improve detection accuracy, they often increased the number of detection heads or integrated more feature information, which generally increased the computational load, thereby limiting their application in resource-constrained environments. Our feature fusion method adopts a lightweight design approach, using dilated convolutions and long-distance point-wise convolutions for feature fusion. This not only ensures a lightweight model but also improves detection accuracy. Therefore, our research conclusions demonstrate significant advantages in terms of model lightness and efficiency, multi-scale feature extraction capabilities, small object detection accuracy, and adaptability to harsh environments.

The results in Table 6 indicate that FireYOLO-Lite demonstrates excellent performance on the M⁴SFWD dataset, with mAP and FPS metrics reaching 85.3% and 285 fps. Compared to other algorithms [30,71], our improvements show significant enhancement. This improvement is mainly due to the incorporation of the CPDCA attention mechanism and InnerIoU, which allows the model to integrate feature maps with different scales and wide fields of view, significantly enhancing detection accuracy. Additionally, replacing the backbone network with GhostNetV2 reduced the model size and increased the detection speed, resulting in an excellent fps performance. The dilated convolution and long-distance convolution design of CPDCA provided the model with a richer spatial receptive field. This demonstrates the effectiveness of feature maps at different scales and a broader spatial receptive field in enhancing the model’s detection capabilities (Figure 6).

Our model also has some shortcomings, particularly in dealing with unclear color features and complex surrounding environments, where the detection accuracy of smoke is relatively low. As shown in Figure 11, the model has achieved high accuracy in detecting small target flames, and the situation of repeated detection has significantly improved, but the detection of smoke is easily affected by clouds, fog, or changes in surrounding light. According to the data analysis in Table 4, the detection accuracy of smoke consistently remains lower than that of flames, which to some extent affects the overall detection performance. In future research, we will focus on improving the detection accuracy of smoke, striving to reduce the impact of the surrounding environment on smoke target detection.

When deploying a deep learning-based fire detection model, one can choose between a cloud computing platform or local terminal devices. Cloud computing platforms offer flexible computing resources and convenient management services, suitable for scenarios requiring rapid deployment and scalability. The model can be deployed on cloud servers, supported by cloud service providers such as AWS and Azure. In contrast, deployment on local terminal devices is more suitable for scenarios with high real-time requirements or large data volumes. The model is deployed on local servers and directly connected to terminal devices (cameras, drones, etc.) to achieve faster response and higher data processing capabilities. In this approach, terminal devices collect image information and upload it to the local server, where the model performs inference and obtains detection results. Both methods have their advantages, and the choice should be based on the specific application scenario and requirements. The workflow of our model is divided into backbone, neck, and head. First, images transmitted from the terminal device are input into the GhostNetV2 backbone network for feature extraction. The extracted features then pass through the attention mechanism and the neck section. The neck section of this model follows the design of YOLOv8, where the neck is responsible for integrating feature maps of different levels to enhance detection performance. Finally, the integrated feature maps are sent to the head section, which predicts the target classes and bounding box locations, ultimately outputting the detection results.

Future research can focus on different aspects. Firstly, more sensors (such as temperature, gas, or infrared images) can be added to establish a multimodal detection model to improve detection accuracy. This multimodal data fusion technology will help build a more comprehensive fire detection model [72]. In terms of the model, the model size can be optimized. Quantization and pruning techniques can be applied to efficiently optimize the network model, enhancing its performance and reliability in practical applications, thus providing important safeguards for the environment and social security.

5. Conclusions

A thorough exploration of the numerous challenges encountered in forest fire detection is conducted in this paper, and a series of practical and effective improvement measures are proposed in response to these challenges.

Firstly, we have made significant contribution in making the fire detection network lightweight. In forest fire detection scenarios, the complex geographical environment often necessitates the use of devices such as cameras or drones with relatively low computational power. These devices have limited computing capabilities, thus requiring the fire detection algorithms to be efficient and lightweight. To meet this requirement, we adopted GhostNetV2 as the backbone neural network, replacing the original network architecture. GhostNetV2, with its unique design, maintains good feature extraction capabilities while using fewer computational resources. This improvement not only significantly reduces the model’s size but also greatly increases the inference speed, making fire detection faster and more accurate.

In addition to making the fire detection network lightweight, we optimized the model’s spatial view through the improved CPDCA channel attention mechanism. Traditional fire detection algorithms have always faced challenges in detecting small targets, as flames are frequently small initially and can be obscured by complex backgrounds, resulting in missed or false detections. By introducing the CPDCA channel attention mechanism, the model can focus more on crucial features such as flames, thereby enhancing the detection speed for small targets. This improvement achieved significant results in experiments, greatly increasing the detection accuracy for small fire targets. To further enhance model performance, we also adopted pruning and quantization operations from YOLOv8. Model pruning effectively removes redundant parameters and connections, reducing model complexity and improving inference speed. Meanwhile, quantization converts model parameters from floating-point to lower-precision representations, further decreasing storage and computational requirements. Through these optimization measures, our model can quickly locate the source of a fire in a short time, providing strong support for timely response to fires.

By incorporating the M⁴SFWD dataset, our model was trained in more realistic and complex fire scenarios, thereby enhancing its accuracy and robustness in actual fire detection. During the training process, we fully utilized the annotated information in the dataset to guide the model in better learning the characteristics of flames and smoke. This enables our model to quickly identify flames and smoke in actual fire scenarios, promptly issuing alerts. Our improvements not only enhanced the accuracy of fire detection but also made a qualitative leap in speed. This is of significant importance for forest fire prevention, as every second is crucial during the initial stages of a fire. Through our improvements, fire prevention departments can quickly respond as soon as a fire occurs, effectively controlling the spread of the fire and thus minimizing the losses caused by the fire.

Overall, this paper has made a series of targeted innovations to address the challenges of forest fire detection, including making the network lightweight and enhancing the channel attention mechanism. Significant results have been achieved through these improvements in experiments, offering new technical means for preventing forest fires. We believe that with continuous technological progress and innovation, we will be able to more effectively address the global challenge of forest fires.

Author Contributions

Project administration, S.S.; methodology, Z.L. and S.S.; formal analysis, W.X. and J.S.; investigation, Y.W.; writing—original draft preparation, S.S. and Z.L.; conceptualization, Z.L.; writing—review and editing, W.X., S.S. and Z.L.; resources, S.S.; data curation, Z.L.; visualization, Z.L.; supervision, W.X. All authors have read and agreed to the published version of the manuscript.

Funding

This study was supported by the Scientific Research Program of Beijing Municipal Commission of Education, Natural Science Foundation of Beijing, grant number KZ202210017024.

Data Availability Statement

Data available in a publicly accessible repository. Details of the image datasets for forest fires can be found in reference [51]. The code is available at https://github.com/ZhengYin-Liang/forests-fire-detection.git (accessed on 2 July 2024).

Conflicts of Interest

The authors declare no conflicts of interest.

References

CTIF World Fire Statistics Center. World Fire Statistics. Available online: https://ctif.org/world-fire-statistics (accessed on 30 April 2024).
Department of Agriculture, Water and the Environment, Canberra, Australia; National Indicative Aggregated Fire Extent Dataset. Available online: https://www.agriculture.gov.au/abares/forestsaustralia/forest-data-maps-and-tools/data-by-topic/fire#area-of-native-forest-in-fire-area-by-forest-tenure-and-jurisdiction (accessed on 31 March 2024).
Chaturvedi, S.; Khanna, P.; Ojha, A. A survey on vision-based outdoor smoke detection techniques for environmental safety. ISPRS J. Photogramm. Remote Sens. 2022, 185, 158–187. [Google Scholar] [CrossRef]
Anđelić, N.; Baressi Šegota, S.; Lorencin, I.; Car, Z. The Development of Symbolic Expressions for Fire Detection with Symbolic Classifier Using Sensor Fusion Data. Sensors 2023, 23, 169. [Google Scholar] [CrossRef] [PubMed]
Fonollosa, J.; Solorzano, A.; Marco, S. Chemical sensor systems and associated algorithms for fire detection: A review. Sensors 2018, 18, 553. [Google Scholar] [CrossRef] [PubMed]
Pohle, R.; Pohl, T.; Pannek, C.; Tarantik, K.; Bauersfeld, M.-L.; Wöllenstein, J.; Raible, S.; Seiler, F. Evaluation of a Colorimetric Sensor System for Early Fire Detection. Proceedings 2018, 2, 966. [Google Scholar] [CrossRef]
Du, X.; Cao, D.; Mishra, D.; Bernardes, S.; Jordan, T.R.; Madden, M. Self-Adaptive Gradient-Based Thresholding Method for Coal Fire Detection Using ASTER Thermal Infrared Data, Part I: Methodology and Decadal Change Detection. Remote Sens. 2015, 7, 6576–6610. [Google Scholar] [CrossRef]
Bousack, H.; Kahl, T.; Schmitz, A.; Schmitz, H. Towards Improved Airborne Fire Detection Systems Using Beetle Inspired Infrared Detection and Fire Searching Strategies. Micromachines 2015, 6, 718–746. [Google Scholar] [CrossRef]
Yu, H.; Wang, J.; Wang, Z.; Yang, J.; Huang, K.; Lu, G.; Deng, F.; Zhou, Y. A lightweight network based on local–global feature fusion for real-time industrial invisible gas detection with infrared thermography. Appl. Soft Comput. 2024, 152, 111138. [Google Scholar] [CrossRef]
Shaharuddin, S.; Abdul Maulud, K.N.; Syed Abdul Rahman, S.A.F.; Che Ani, A.I.; Pradhan, B. The role of IoT sensor in smart building context for indoor fire hazard scenario: A systematic review of interdisciplinary articles. Internet Things 2023, 22, 100803. [Google Scholar] [CrossRef]
Bustos, N.; Mashhadi, M.; Lai-Yuen, S.K.; Sarkar, S.; Das, T.K. A systematic literature review on object detection using near infrared and thermal images. Neurocomputing 2023, 560, 126804. [Google Scholar] [CrossRef]
Ghali, R.; Jmal, M.; Souidene Mseddi, W.; Attia, R. Recent advances in fire detection and monitoring systems: A review. In Proceedings of the 8th International Conference on Sciences of Electronics, Technologies of Information and Telecommunications (SETIT’18), Genoa, Italy, 18–20 December 2020; Volume 1, pp. 332–340. [Google Scholar] [CrossRef]
Bordbar, H.; Alinejad, F.; Conley, K.; Ala-Nissila, T.; Hostikka, S. Flame detection by heat from the infrared spectrum: Optimization and sensitivity analysis. Fire Saf. J. 2022, 133, 103673. [Google Scholar] [CrossRef]
Courbat, J.; Pascu, M.; Gutmacher, D.; Briand, D.; Wöllenstein, J.; Hoefer, U.; Severin, K.; de Rooij, N.F. A colorimetric CO sensor for fire detection. Procedia Engineering 2011, 25, 1329–1332. [Google Scholar] [CrossRef]
Parent, G.; Acem, Z.; Lechêne, S.; Boulet, P. Measurement of infrared radiation emitted by the flame of a vegetation fire. Int. J. Therm. Sci. 2010, 49, 555–562. [Google Scholar] [CrossRef]
Qiu, X.; Wei, Y.; Li, N.; Guo, A.; Zhang, E.; Li, C.; Peng, Y.; Wei, J.; Zang, Z. Development of an early warning fire detection system based on a laser spectroscopic carbon monoxide sensor using a 32-bit system-on-chip. Infrared Phys. Technol. 2019, 96, 44–51. [Google Scholar] [CrossRef]
Chen, G.; Yuan, B.; Zhan, Y.; Dai, H.; He, S.; Chen, X. Functionalized graphene paper with the function of fuse and its flame-triggered self-cutting performance for fire-alarm sensor application. Mater. Chem. Phys. 2020, 252, 123292. [Google Scholar] [CrossRef]
Huang, N.-J.; Xia, Q.-Q.; Zhang, Z.-H.; Zhao, L.; Zhang, G.-D.; Gao, J.-F.; Tang, L.-C. Simultaneous improvements in fire resistance and alarm response of GO paper via one-step 3-mercaptopropyltrimethoxysilane functionalization for efficient fire safety and prevention. Compos. Part A Appl. Sci. Manuf. 2020, 131, 105797. [Google Scholar] [CrossRef]
Diaconu, B.M. Recent Advances and Emerging Directions in Fire Detection Systems Based on Machine Learning Algorithms. Fire 2023, 6, 441. [Google Scholar] [CrossRef]
Barmpoutis, P.; Papaioannou, P.; Dimitropoulos, K.; Grammalidis, N. A Review on Early Forest Fire Detection Systems Using Optical Remote Sensing. Sensors 2020, 20, 6442. [Google Scholar] [CrossRef] [PubMed]
Ko, B.C.; Cheong, K.-H.; Nam, J.-Y. Fire detection based on vision sensor and support vector machines. Fire Saf. J. 2009, 44, 322–329. [Google Scholar] [CrossRef]
Abid, F. A Survey of Machine Learning Algorithms Based Forest Fires Prediction and Detection Systems. Fire Technol. 2021, 57, 559–590. [Google Scholar] [CrossRef]
Çelik, T.; Demirel, H. Fire detection in video sequences using a generic color model. Fire Saf. J. 2009, 44, 147–158. [Google Scholar] [CrossRef]
Yang, X.; Hua, Z.; Zhang, L.; Fan, X.; Zhang, F.; Ye, Q.; Fu, L. Preferred vector machine for forest fire detection. Pattern Recognit. 2023, 143, 109722. [Google Scholar] [CrossRef]
Chen, S.; Ren, J.; Yan, Y.; Sun, M.; Hu, F.; Zhao, H. Multi-sourced sensing and support vector machine classification for effective detection of fire hazard in early stage. Comput. Electr. Eng. 2022, 101, 108046. [Google Scholar] [CrossRef]
Bai, C.; Bai, X.; Wu, K. A Review: Remote Sensing Image Object Detection Algorithm Based on Deep Learning. Electronics 2023, 12, 4902. [Google Scholar] [CrossRef]
Xue, Z.; Lin, H.; Wang, F. A Small Target Forest Fire Detection Model Based on YOLOv5 Improvement. Forests 2022, 13, 1332. [Google Scholar] [CrossRef]
Krishnaveni, S.; Subramani, K.; Sharmila, L.; Sathiya, V.; Maheswari, M.; Priyaadarshan, B. Enhancing human sight perceptions to optimize machine vision: Untangling object recognition using deep learning techniques. Meas. Sens. 2023, 28, 100853. [Google Scholar] [CrossRef]
Saleh, A.; Zulkifley, M.A.; Harun, H.H.; Gaudreault, F.; Davison, I.; Spraggon, M. Forest fire surveillance systems: A review of deep learning methods. Heliyon 2024, 10, e23127. [Google Scholar] [CrossRef]
Yin, D.; Cheng, P.; Huang, Y. YOLO-EPF: Multi-scale smoke detection with enhanced pool former and multiple receptive fields. Digit. Signal Process. 2024, 149, 104511. [Google Scholar] [CrossRef]
Zhang, L.; Wang, M.; Ding, Y.; Bu, X. MS-FRCNN: A Multi-Scale Faster RCNN Model for Small Target Forest Fire Detection. Forests 2023, 14, 616. [Google Scholar] [CrossRef]
Zhan, J.; Hu, Y.; Zhou, G.; Wang, Y.; Cai, W.; Li, L. A high-precision forest fire smoke detection approach based on ARGNet. Comput. Electron. Agric. 2022, 196, 106874. [Google Scholar] [CrossRef]
Huang, P.; Chen, M.; Chen, K.; Zhang, H.; Yu, L.; Liu, C. A combined real-time intelligent fire detection and forecasting approach through cameras based on computer vision method. Process Saf. Environ. Prot. 2022, 164, 629–638. [Google Scholar] [CrossRef]
Wu, Z.; Xue, R.; Li, H. Real-Time Video Fire Detection via Modified YOLOv5 Network Model. Fire Technol. 2022, 58, 2377–2403. [Google Scholar] [CrossRef]
Cao, X.; Wu, K.; Geng, X.; Guan, Q. Field detection of indoor fire threat situation based on LSTM-Kriging network. J. Build. Eng. 2024, 84, 108686. [Google Scholar] [CrossRef]
Yar, H.; Ullah, W.; Ahmad Khan, Z.; Wook Baik, S. An Effective Attention-based CNN Model for Fire Detection in Adverse Weather Conditions. ISPRS J. Photogramm. Remote Sens. 2023, 206, 335–346. [Google Scholar] [CrossRef]
Jadon, A.; Varshney, A.; Ansari, M.S. Low-Complexity High-Performance Deep Learning Model for Real-Time Low-Cost Embedded Fire Detection Systems. Procedia Comput. Sci. 2020, 171, 418–426. [Google Scholar] [CrossRef]
Al-Lqubaydhi, N.; Alenezi, A.; Alanazi, T.; Senyor, A.; Alanezi, N.; Alotaibi, B.; Alotaibi, M.; Razaque, A.; Hariri, S. Deep learning for unmanned aerial vehicles detection: A review. Comput. Sci. Rev. 2024, 51, 100614. [Google Scholar] [CrossRef]
Jin, C.T.; Wang, T.; Alhusaini, N.; Zhao, S.H.; Liu, H.L.; Xu, K.; Zhang, J.; Chen, T. Video Fire Detection Methods Based on Deep Learning: Datasets, Methods, and Future Directions. Fire 2023, 6, 315. [Google Scholar] [CrossRef]
Zhang, X.; Zhou, X.; Lin, M.; Sun, J. ShuffleNet: An Extremely Efficient Convolutional Neural Network for Mobile Devices. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 6848–6856. [Google Scholar] [CrossRef]
Ma, N.; Zhang, X.; Zheng, H.-T.; Sun, J. ShuffleNet V2: Practical Guidelines for Efficient CNN Architecture Design. In Proceedings of the Computer Vision—ECCV 2018, Munich, Germany, 8–14 September 2018; pp. 122–138. [Google Scholar] [CrossRef]
Howard, A.G.; Zhu, M.; Chen, B.; Kalenichenko, D.; Wang, W.; Weyand, T.; Andreetto, M.; Adam, H. MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications. arXiv 2017, arXiv:1704.04861. [Google Scholar] [CrossRef]
Sandler, M.; Howard, A.; Zhu, M.; Zhmoginov, A.; Chen, L.C. MobileNetV2: Inverted Residuals and Linear Bottlenecks. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 4510–4520. [Google Scholar] [CrossRef]
Howard, A.; Sandler, M.; Chen, B.; Wang, W.; Chen, L.C.; Tan, M.; Chu, G.; Vasudevan, V.; Zhu, Y.; Pang, R.; et al. Searching for MobileNetV3. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019; pp. 1314–1324. [Google Scholar] [CrossRef]
Redmon, J.; Farhadi, A. Yolov3: An incremental improvement. arXiv 2018, arXiv:1804.02767. [Google Scholar] [CrossRef]
Li, M.; Zhang, Y.; Mu, L.; Xin, J.; Xue, X.; Jiao, S.; Liu, H.; Xie, G.; Yi, Y. A Real-Time Forest Fire Recognition Method Based on R-shufflenetv2. In Proceedings of the 2022 5th International Symposium on Autonomous Systems (ISAS), Hangzhou, China, 8–10 April 2022; pp. 1–5. [Google Scholar]
Yar, H.; Khan, Z.A.; Rida, I.; Ullah, W.; Kim, M.J.; Baik, S.W. An efficient deep learning architecture for effective fire detection in smart surveillance. Image Vis. Comput. 2024, 145, 104989. [Google Scholar] [CrossRef]
Jin, S.; Wang, T.; Huang, H.; Zheng, X.; Li, T.; Guo, Z. A self-adaptive wildfire detection algorithm by fusing physical and deep learning schemes. Int. J. Appl. Earth Obs. Geoinf. 2024, 127, 103671. [Google Scholar] [CrossRef]
Geng, X.; Su, Y.; Cao, X.; Li, H.; Liu, L. YOLOFM: An improved fire and smoke object detection algorithm based on YOLOv5n. Sci. Rep. 2024, 14, 4543. [Google Scholar] [CrossRef] [PubMed]
Moghimi, A.; Welzel, M.; Celik, T.; Schlurmann, T. A Comparative Performance Analysis of Popular Deep Learning Models and Segment Anything Model (SAM) for River Water Segmentation in Close-Range Remote Sensing Imagery. IEEE Access 2024, 12, 52067–52085. [Google Scholar] [CrossRef]
Wang, G.; Li, H.; Li, P.; Lang, X.; Feng, Y.; Ding, Z.; Xie, S. M4SFWD: A Multi-Faceted synthetic dataset for remote sensing forest wildfires detection. Expert Syst. Appl. 2024, 248, 123489. [Google Scholar] [CrossRef]
Zhang, H.; Xu, C.; Zhang, S. Inner-IoU: More Effective Intersection over Union Loss with Auxiliary Bounding Box. arXiv 2023, arXiv:2311.02877. [Google Scholar] [CrossRef]
Tang, Y.; Han, K.; Guo, J.; Xu, C.; Xu, C.; Wang, Y. GhostNetV2: Enhance Cheap Operation with Long-Range Attention. arXiv 2022, arXiv:2211.12905. [Google Scholar] [CrossRef]
Jocher, G.; Chaurasia, A.; Stoken, A.; Borovec, J.; Kwon, Y.; Michael, K.; TaoXie; Fang, J.; Imyhxy; Lorna; et al. ultralytics/yolov5: v7.0-YOLOv5 SOTA Realtime Instance Segmentation (v7.0); Zenodo: Geneva, Switzerland, 2022. [Google Scholar] [CrossRef]
Bochkovskiy, A.; Wang, C.-Y.; Liao, H.-Y.M. YOLOv4: Optimal Speed and Accuracy of Object Detection. arXiv 2020, arXiv:2004.10934. [Google Scholar] [CrossRef]
Zhang, S.; Chi, C.; Yao, Y.; Lei, Z.; Li, S.Z. Bridging the Gap Between Anchor-Based and Anchor-Free Detection via Adaptive Training Sample Selection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 9756–9765. [Google Scholar] [CrossRef]
Han, K.; Wang, Y.; Tian, Q.; Guo, J.; Xu, C.; Xu, C. GhostNet: More Features from Cheap Operations. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 1577–1586. [Google Scholar] [CrossRef]
Huang, H.-l.; Chen, Z.; Zou, Y.; Lu, M.; Chen, C. Channel prior convolutional attention for medical image segmentation. arXiv 2023, arXiv:2306.05196. [Google Scholar] [CrossRef] [PubMed]
Hu, J.; Shen, L.; Sun, G. Squeeze-and-Excitation Networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7132–7141. [Google Scholar] [CrossRef]
Szegedy, C.; Vanhoucke, V.; Ioffe, S.; Shlens, J.; Wojna, Z. Rethinking the Inception Architecture for Computer Vision. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 2818–2826. [Google Scholar] [CrossRef]
Liu, S.; Huang, D.; Wang, Y. Receptive Field Block Net for Accurate and Fast Object Detection. In Proceedings of the Computer Vision—ECCV 2018, Munich, Germany, 8–14 September 2018; pp. 404–419. [Google Scholar] [CrossRef]
Wu, Y.; Cen, L.; Kan, S.; Xie, Y. Multi-layer capsule network with joint dynamic routing for fire recognition. Image Vis. Comput. 2023, 139, 104825. [Google Scholar] [CrossRef]
Yu, F.; Koltun, V. Multi-scale context aggregation by dilated convolutions. arXiv 2015, arXiv:1511.07122. [Google Scholar] [CrossRef]
Wang, P.; Chen, P.; Yuan, Y.; Liu, D.; Huang, Z.; Hou, X.; Cottrell, G. Understanding Convolution for Semantic Segmentation. In Proceedings of the IEEE Winter Conference on Applications of Computer Vision (WACV), Lake Tahoe, NV, USA, 12–15 March 2018; pp. 1451–1460. [Google Scholar] [CrossRef]
Hou, Q.; Zhou, D.; Feng, J. Coordinate Attention for Efficient Mobile Network Design. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; pp. 13708–13717. [Google Scholar] [CrossRef]
Woo, S.; Park, J.; Lee, J.-Y.; Kweon, I.S. CBAM: Convolutional Block Attention Module. In Proceedings of the Computer Vision—ECCV 2018, Munich, Germany, 8–14 September 2018; pp. 3–19. [Google Scholar] [CrossRef]
Li, X.; Zhong, Z.; Wu, J.; Yang, Y.; Lin, Z.; Liu, H. Expectation-Maximization Attention Networks for Semantic Segmentation. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019; pp. 9166–9175. [Google Scholar] [CrossRef]
Wu, S.; Zhang, X.; Liu, R.; Li, B. A dataset for fire and smoke object detection. Multimed. Tools Appl. 2023, 82, 6707–6726. [Google Scholar] [CrossRef]
Liu, R.; Wu, S.; Lu, X. Real-time fire detection network for intelligent surveillance systems. In Proceedings of the 2nd International Conference on Computer Vision, Image and Deep Learning, Liuzhou, China, 25–28 June 2021; p. 1191114. [Google Scholar] [CrossRef]
Yang, S.; Huang, Q.; Yu, M. Advancements in remote sensing for active fire detection: A review of datasets and methods. Sci. Total Environ. 2024, 943, 173273. [Google Scholar] [CrossRef] [PubMed]
Li, J.W.; Tang, H.; Li, X.D.; Dou, H.Q.; Li, R. LEF-YOLO: A lightweight method for intelligent detection of four extreme wildfires based on the YOLO framework. Int. J. Wildland Fire 2024, 33, WF23044. [Google Scholar] [CrossRef]
Li, Y.; El Habib Daho, M.; Conze, P.-H.; Zeghlache, R.; Le Boité, H.; Tadayoni, R.; Cochener, B.; Lamard, M.; Quellec, G. A review of deep learning-based information fusion techniques for multimodal medical image classification. Comput. Biol. Med. 2024, 177, 108635. [Google Scholar] [CrossRef] [PubMed]

Figure 1. YOLOv8 network structure.

Figure 2. Ghost module in GhostNetV2.

Figure 3. The improved forest fire detection network, with an input image size of 640 × 640, is composed of a backbone consisting of Ghost modules and depth-wise convolutions, with features extracted through inexpensive convolution operations. Furthermore, the CPDCA attention model is integrated to enhance spatial feature extraction, enabling the extraction of image features at different scales.

Figure 4. The CPDCA model first aggregates channel information through average pooling and max-pooling operations. Then, through a Shared MLP (multi-layer perceptron), it generates intermediate features for channel attention and spatial attention. Subsequently, it produces channel attention feature maps. By element-wise multiplication of input features and channel attention maps, channel-wise checked feature maps are obtained. These feature maps are then inputted into dilated convolution modules and long-distance stripe convolutions to extract spatial features. Finally, the extracted spatial features are mixed with the channel-wise checked features to obtain the final feature information.

Figure 5. In the figure, red is the darkest color, and white is the lightest color. The number in each pixel in the figure represents its utilization rate. The higher the value, the higher the utilization rate. When using dilation factors of [2, 2, 2], the spatial receptive field is limited ((a) the more dispersed spatial receptive field). However, when employing HDC with dilation factors of [1, 2, 3], the spatial receptive field of (b) significantly expands.

Figure 6. The experimental results using the initial model: (a) experimental results with a large spatial receptive field using DilateConv and (b) point-wise convolution ensemble.

Figure 7. Examples of the M⁴SFWD dataset, from top to bottom, with different weather conditions (sunny, misty, and dusk), vegetation, and different targets in the same scene.

Figure 8. The anchor boxes and target boxes of Inner-IoU.

Figure 9. We compared and analyzed our model with other improved object detection models. (a) The YOLOv3-tiny experimental diagram, (b) the MobileNetV3 experimental diagram, (c) the ShuffleNetV2 experimental diagram, and (d) our model. Each row represents a set of image information, and the four sets of images are represented by (1), (2), (3), and (4).

Figure 10. Comparative experiments were conducted on the DFS datasets with YOLOv3-tiny (a), MobileNetV3 (b), ShuffleNetV2 (c), and FireYOLO-Lite (d). Each row represents a set of image information, and the four sets of images are represented by (1), (2), (3), and (4).

Figure 11. Model failed to detect or miss the fire. As shown in (a,c,e), the detection accuracy significantly decreases when the color of the smoke is affected by the surrounding environment. (b) shows the cases of missed detection for some flames and smoke with indistinct features. (d,f) indicate that the similarity between smoke and fog features leads to missed smoke detection.

Table 1. Representative studies on fire detection.

Solution Approach	Methodology	Author
Features are extracted from smoke at different scales	Multimodal extraction of smoke characteristics	Yin et al. [30]
Attentional mechanisms are used to reduce the impact of complex backgrounds in images	Faster RCNN	Zhang et al. [31]
Flame small target problem	YOLOv5	Dalal et al. [34]
The ShuffleNetV2 units were reconstructed using a residual structure	ShuffleNetV2	Li et al. [46]
Enhanced MobileNetV3 backbone with MSAM and 3D convolution	MobileNetV3	Yar et al. [47]
Fire detection based on fusion algorithm	Fusion of physics and deep learning solutions	Jin et al. [48]

Table 2. Table of abbreviations.

CSPDarknet (Cross Stage Partial Darknet)	A deep convolutional neural network model that is mainly used for image recognition and classification tasks.
DFC (Decoupled Fully Connected) attention	DFC attention has been successfully applied to GhostNetV2, which significantly improves the performance and computational efficiency of the model.
C3	Cross stage partial networks with 3 convolutions.
C2f	Faster implementation of CSP Bottleneck with 2 convolutions.
CP (Channel Prior)	CP represents the channel priority mechanism in this article.
SA (Spatial Attention)	In this article, SA represents the module of extracting spatial features.
DFL (Distribution Focal Loss)	DFL is an effective loss function for boundary box regression in object detection tasks.

Table 3. Comparative test of parameter ratio.

InnerIoU/Ratio	mAP/%			mAP50/%
InnerIoU/Ratio	All	Fire	Smoke	All	Fire	Smoke
InnerCIoU/0.7	84.5	87.2	81.9	86.9	87.9	85.9
InnerCIoU/1.1	84.2	86.7	81.6	86.8	88	85.7
InnerCIoU/1.2	84.1	86.4	81.9	86.1	87.3	84.9
InnerSIoU/0.7	83.3	85.6	80.9	86.2	87	85.4
InnerSIoU/1.2	85.1	87.9	82.3	86.8	88	85.7
InnerSIoU/1.1	85.3	87.6	82.9	86.9	88.2	85.7

Table 4. A comparative test of attention mechanisms.

Model Summary	mAP/%			mAP50/%			M
Model Summary	All	Fire	Smoke	All	Fire	Smoke	M
GhostNetV2 + CA	82.9	85.8	79.9	86.2	86.1	86.3	14.3
GhostNetV2 + SE	82.4	85.4	79.5	85.8	86.4	85.3	14.2
GhostNetV2 + CBAM	83.8	87	80.8	86.7	87.8	85.6	5.6
GhostNetV2 + EMA	83.9	86.6	81.1	86.9	88.1	85.7	5.5
GhostNetV2 + CPDCA	85.3	87.6	82.9	86.9	88.2	85.7	4.9

Table 5. Ablation experiment.

Number	GhostNetV2 (backbone)	Inner-SIoU	CPDCA	Precision	GFLPOs	FPS	mAP	R
1	√			81.3	6.8	274.4	84.2	77.4
2	√	√		82.9	8.1	310.1	86	79.7
3	√		√	82.5	7.1	232.5	86.1	79.5
4		√	√	82.4	8.4	218.6	86.5	78.3
5		√		83.6	8.3	326.3	86.9	80.1
6			√	83.3	8.4	215.3	86.2	78.4
Ours	√	√	√	85.3	7	285.3	87	80.5

Table 6. Contrast test.

Model	Weight/M	mAP/%	Frame Rate/FPS
YOLOv3-tiny	21.9	79.4	275.3
YOLOv6n	8.7	82.5	176.4
Resnet50	52.5	82.9	74.7
YOLOv8n	6.2	82.4	279.6
MobileNetV3	4.2	83.7	223.6
ShufflenetV1	6.9	82.4	182.3
ShufflenetV2	12.4	83.6	237.1
GhostNetV1	5.2	83.2	252.1
FireYOLO-Lite	4.9	85.3	285.3

Table 7. Comparative experiments between our model and YOLOv9 and YOLOv10.

Model Summary	mAP/%			mAP50/%			Weight/M
Model Summary	All	Fire	Smoke	All	Fire	Smoke	Weight/M
YOLOv9	83.4	86.6	80.1	86.2	86.1	86.3	60.4
YOLOv10	80.7	83	78.3	83.5	85.1	82	8.5
Ours	85.3	87.6	82.9	86.9	88.2	85.7	4.9

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Sheng, S.; Liang, Z.; Xu, W.; Wang, Y.; Su, J. FireYOLO-Lite: Lightweight Forest Fire Detection Network with Wide-Field Multi-Scale Attention Mechanism. Forests 2024, 15, 1244. https://doi.org/10.3390/f15071244

AMA Style

Sheng S, Liang Z, Xu W, Wang Y, Su J. FireYOLO-Lite: Lightweight Forest Fire Detection Network with Wide-Field Multi-Scale Attention Mechanism. Forests. 2024; 15(7):1244. https://doi.org/10.3390/f15071244

Chicago/Turabian Style

Sheng, Sha, Zhengyin Liang, Wenxing Xu, Yong Wang, and Jiangdan Su. 2024. "FireYOLO-Lite: Lightweight Forest Fire Detection Network with Wide-Field Multi-Scale Attention Mechanism" Forests 15, no. 7: 1244. https://doi.org/10.3390/f15071244

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

FireYOLO-Lite: Lightweight Forest Fire Detection Network with Wide-Field Multi-Scale Attention Mechanism

Abstract

1. Introduction

2. Related Work

2.1. YOLOv8 Object Detection Model

2.2. GhostNetV2 Lightweight Neural Network

3. Materials and Methods

3.1. Lightweight FireYOLO-Lite Detection Framework Is Designed

3.2. Wide-Field Multi-Scale Attention Mechanism CPDCA

3.3. Experimental Data

3.4. Experimental Environment and Parameter Setting

3.5. Evaluation Method

3.6. Loss Function

4. Experimental Design and Result Analysis

4.1. Comparative Test of Attention Mechanism

4.2. Ablation Experiment

4.3. Validation of the Improved Algorithm’s Effectiveness

4.4. Discussion

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI