Citation 8

Journal of Safety Science and Resilience 4 (2023) 294–304
Contents lists available at ScienceDirect
Journal of Safety Science and Resilience

journal homepage: http://www.keaipublishing.com/en/journals/journal-of-safety-science-and-resilience/
Early smoke and flame detection based on transformer

Xinzhi Wang a, Mengyue Li a, Mingke Gao b,∗, Quanyi Liu c, Zhennan Li a, Luyao Kou d
a
School of Computer Engineering and Science, Shanghai University, Shanghai 200444, China
b
The 32nd Research Institute of China Electronics Technology Group Corporation, Shanghai 201808, China
c
College of Civil Aviation Safety Engineering, Civil Aviation Flight University of China, Guanghan 618307, China
d
Department of Engineering Physics, Tsinghua University, Beijing 100084, China
a r t i c l e i n f o a b s t r a c t
Keywords: Fire-detection technology plays a critical role in ensuring public safety and facilitating the development of smart
Early fire cities. Early fire detection is imperative to mitigate potential hazards and minimize associated losses. However,
Smoke and flame detection existing vision-based fire-detection methods exhibit limited generalizability and fail to adequately consider the
Fire detection
effect of fire object size on detection accuracy. To address this issue, in this study a decoder-free fully transformer-
Vision transformer
based (DFFT) detector is used to achieve early smoke and flame detection, improving the detection performance
Public safety
for fires of different sizes. This method effectively captures multi-level and multi-scale fire features with rich
semantic information while using two powerful encoders to maintain the accuracy of the single-feature map
prediction. First, data augmentation is performed to enhance the generalizability of the model. Second, the
detection-oriented transformer (DOT) backbone network is treated as a single-layer fire-feature extractor to ob-
tain fire-related features on four scales, which are then fed into an encoder-only single-layer dense prediction
module. Finally, the prediction module aggregates the multi-scale fire features into a single feature map using a
scale-aggregated encoder (SAE). The prediction module then aligns the classification and regression features using
a task-aligned encoder (TAE) to ensure the semantic interaction of the classification and regression predictions.
Experimental results on one private dataset and one public dataset demonstrate that the adopted DFFT possesses
high detection accuracy and a strong generalizability for fires of different sizes, particularly early small fires. The
DFFT achieved mean average precision (mAP) values of 87.40% and 81.12% for the two datasets, outperforming
other baseline models. It exhibits a better detection performance on flame objects than on smoke objects because
of the prominence of flame features.
1. Introduction period of forest fires that displaced or killed 3 billion animals and 33
people owing to adverse climatic conditions such as high temperatures
Fire-detection technology plays a pivotal role in providing early and drought. On December 22, 2022, a sudden fire occurred in a park-
warning of fire by recognizing smoke or flame combustion [1]. This ing shed located in Shanghai, China, resulting in extensive damage to
technology is a critical component in promoting public safety and de- non-motorized and motorized vehicles at the site. Fire accidents cause
veloping smart cities. Among different types of disasters, fire accidents panic and economic loss, threaten people’s lives, and disrupt social or-
pose a significant risk to both safety and societal progress. In recent der. Therefore, prompt and precise early detection of smoke and flames
years, the number of major fire hazards has increased sharply owing is important for fire rescue and minimizing the damage caused by fire,
to rapid economic growth worldwide. For example, China experienced thus accelerating global fire safety construction.
748,000 fires in 2021, causing 1987 fatalities, 2225 injuries, and direct Traditionally, fire detection uses four types of physical sensors to
property losses valued at approximately $6.75 billion. By comparison, detect heat, gas, smoke, and flames. Heat sensors are used to measure
the United States reported 1557,500 fires in 2021, resulting in direct the environmental heat in a building and are temperature-sensitive [2].
property losses worth $14.639 billion. Several typical fire events are Ding et al. [3] studied fire detectors in cable tunnels using temperature-
shown in Fig. 1, including the Notre Dame fire in France, forest fires in sensitive cables, light scattering, and fiber Bragg gratings. Zhang et al.
Australia, and a district parking shed fire in China. In April 2019, the [4] introduced a wireless sensor network for forest-fire detection. This
Notre Dame Cathedral, with a rich history of more than 850 years, suf- network consists of sensor nodes that collect essential fire informa-
fered a catastrophic fire that led to the collapse of its emblematic spire. tion including temperature, humidity, and atmospheric pressure. Most
From July 2019 to February 2020, Australia encountered a prolonged fire casualties are caused by toxic emissions rather than fire burns;
∗
Corresponding author.
E-mail address: michaelgar@foxmail.com (M. Gao).
https://doi.org/10.1016/j.jnlssr.2023.06.002
Received 30 March 2023; Received in revised form 16 May 2023; Accepted 5 June 2023
2666-4496/© 2023 China Science Publishing & Media Ltd. Publishing Services by Elsevier B.V. on behalf of KeAi Communications Co. Ltd. This is an open access
article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/)
X. Wang, M. Li, M. Gao et al. Journal of Safety Science and Resilience 4 (2023) 294–304
Fig. 1. Fire accident scenes.
therefore, gas-based fire detection provides safe protection for build- vides the input image into a series of patches and selectively processes
ing residents [5]. Qiu et al. [6] developed a fire-detection warning sys- different parts of the image based on context, achieving a classifica-
tem based on wavelength-modulated spectroscopic CO sensors that de- tion accuracy of 98.54% on a public dataset. Despite these advances,
tected fires within 24 s. Smoke is defined as “solid and liquid parti- current vision-based fire-detection methods still face challenges in ac-
cles and gasses in the air produced when material undergoes pyroly- curately detecting small objects in early fire scenes, achieving the neces-
sis or combustion.” Wang et al. [7] developed a new type of average- sary detection speed, and avoiding high false detection rates in complex
diameter smoke sensor based on dual-wavelength fire-detection tech- environments. These constraints may impede the timely detection and
nology that used the assumption that the average diameter of smoke subsequent alerting to fire-related occurrences, particularly during the
particles in a fire is less than that of interference particles in non-fire preliminary phase of a fire event when smoke and flames encompass
situations. Flames are detected by identifying the radiation produced only a modest fraction of the visual field and lack distinct features.
by the burning area [8]. Wen et al. [9] proposed a radial basis func- Several studies have been conducted to address these issues and
tion neural network fusion algorithm based on the Takagi-Sugeno fuzzy have yielded promising results. Xu et al. [17] introduced a forest-fire-
model using a three-channel infrared flame sensor designed to achieve detection technique based on integrated learning that combined YOLO
flame detection. These four fire sensor types have the potential to en- v5 and EfficientDet [18] for fire detection. EfficientNet [19] was used
hance fire safety. However, these methods have certain limitations. Heat to acquire global information and prevent false alarms. Although this
sensors cannot detect ground fires or slight temperature changes near technique achieved accurate localization and detection of fires across
fire sources. Gas sensors have limitations such as irreversibility, volatil- several scenarios, the incorporation of three distinct learners resulted
ity, and low selectivity. Smoke sensors have high false-alarm rates be- in prolonged inference times, making it difficult to achieve a harmo-
cause of the difficulty in distinguishing between smoke particles and nious tradeoff between speed and accuracy. Pincott et al. [20] evalu-
other disturbances. Similarly, flame sensors have high false alarm rates ated the effectiveness of the faster R-CNN Inception V2 and SSD Mo-
because they are sensitive to infrared, visible light, and ultraviolet radi- bileNet V2 for indoor fire-detection tasks. Although the faster R-CNN
ation from non-fire sources [10]. Traditional fire sensors are therefore model achieved acceptable accuracy, it still had a high rate of missed
vulnerable to false alarms, delayed alerts, and missed alarms, owing to detections. Meanwhile, the SSD MobileNet V2 demonstrated poor accu-
complex environmental factors. In contrast, vision-based methods have racy and also missed detections. These studies improved the accuracy
several inherent advantages. By leveraging the widespread deployment of fire detection; however, they still face challenges in detecting small-
of camera devices in public spaces and buildings, real-time fire images sized smoke or flame objects. Recently CNNs have been widely used
and critical information, such as fire size, extent, and propagation, are to solve several computer vision-related problems, yielding satisfactory
accurately captured. Visual technology is suitable for both indoor and outcomes [21]. However, convolutional operations in CNNs are inher-
outdoor environments, improving false alarm rates and detection per- ently biased towards capturing local information, limiting their ability
formance compared with conventional fire-detection methods [11]. to effectively capture long-range dependencies and encode the global
Vision-based fire-detection technology has evolved with the devel- statistics of the input image. This limitation has prompted the develop-
opment of computer vision methods. Early studies used feature analysis ment of visual transformers, which have emerged as formidable alterna-
methods to extract smoke or flame features from images, and then ap- tives to CNNs. A visual transformer decomposes an image into patch se-
plied predetermined rules for classification. For example, Yuan [12] pro- quences and uses self-attention mechanisms to measure the importance
posed a smoke detection model based on a local binary pattern vari- of each region within the image based on weighted scores [22,23]. For
ance pyramid histogram and a local binary pattern, that used texture the fire-detection task investigated in this study, the shapes and sizes of
features for smoke identification. Qureshi et al. [13] detected flames the smoke and flame targets are arbitrary. The self-attention mechanism
by analyzing the growth rate and the pyramidal Luca-Kanade optical of the transformer proves to be highly effective in perceiving fire targets
flow method in the candidate regions. However, in recent years, deep of different sizes. By leveraging the internal attention between pixels,
learning has surpassed traditional manual methods in several domains. the transformer encodes the global information of fire images and effec-
Compared with traditional manual feature extraction approaches, deep tively captures the correlations between distant pixels. Therefore, we use
learning methods have achieved superior model generalization and bet- a DFFT method [24] for early smoke and flame detection. This method
ter flame detection results. Zheng et al. [14] investigated the feasibility effectively extracts multi-level and multi-scale fire features with rich
of using deep convolutional neural networks (CNNs) such as faster R- semantic information through a detection-oriented transformer (DOT)
CNN, single-shot detector (SSD), you only look once (YOLO) v3, and Ef- backbone. By leveraging two powerful encoders, the accuracy of single-
ficientDet for forest-fire detection. The experimental results show that level feature map prediction is maintained. The DFFT has the potential
deep CNN algorithms can automatically extract complicated fire fea- to capture small fire objects in early fires by continuously enhancing
tures for fire detection in diverse scenarios. Li et al. [15] proposed a low-dimensional fire semantic information across multiple stages. The
lightweight MobileNet V3-based network with an anchor-free module main contributions of this study are as follows:
and a modified FoveaBox for fire detection, which exhibited good per-
formance and high speed. To consider the global semantic interaction in • Multi-dimensional fire features are extracted, and low-dimensional
an image, Khudayberdiev et al. [16] proposed a fire-detection approach fire semantics are enhanced through the DOT backbone, which
based on a fine-tuned Swin transformer. The Swin transformer model di- consists of DOT blocks and a semantic-augmented attention (SAA)
295
module. The DOT block captures local spatial and global semantic et al. [17] used CNN integrated learning for forest-fire detection by com-
relations at each scale, whereas the SAA exchanges semantic infor- bining YOLO v5 and EfficientDet for object detection and used Efficient-
mation between two consecutive feature maps and reinforces their Net for global information to avoid false alarms. The detection outcomes
features. were determined based on the decisions made by the three networks to
• A decoder-free structure is explored using two encoders. The scale- ensure accurate fire localization. Nguyen et al. [30] proposed a real-time
aggregated encoder (SAE) leverages the global channel-wise atten- fire-detection solution using an unmanned aerial vehicle equipped with
tion to effectively fuse information from the four layers of fire fea- an integrated visual detection and alarm system for large-area moni-
tures. The task-aligned encoder (TAE) applies group channel-wise at- toring. MobileNet was used as a feature extraction network to enhance
tention to decoupled classification and regression features, thereby accuracy and detection speed, followed by the use of the SSD model for
reducing prediction conflicts while ensuring interaction between visual real-time fire detection. Avazov et al. [31] developed an enhanced
classification and regression predictions. YOLO v4 network [32] for detecting fire areas. This method modified
• Experimental results on private and public datasets demonstrate that the network structure using automatic color enhancement, parameter
the DFFT approach is suitable for early fire-detection tasks. Specif- reduction, and other strategies. It could quickly and accurately detect
ically, it outperformed previous fire-detection models in detecting and report catastrophic fires under different conditions, such as sunny,
small smoke and flame objects. cloudy, daytime, and nighttime. Fang et al. [33] constructed an efficient
flame detector, called DANet, based on RetinaNet. The model proposed
The remainder of this paper is organized as follows. Section 2 reviews
a dynamic attention mechanism for scale and spatial perception to al-
the related studies on fire detection. Section 3 provides a detailed de-
leviate interference from smoke or other background objects. Li et al.
scription of the DFFT model used in this study. In Section 4, experiments
[15] proposed a lightweight deep neural network based on a MobileNet
are designed to demonstrate the effectiveness of the DFFT. Finally, the
V3 backbone and an anchorless module with an improved FoveaBox for
conclusions are presented in Section 5.
fire detection.
2. Related work
2.3. Transformer-based methods
Smoke and flame detection involves the location and classification
of smoke or flames. The outputs are rectangular boxes in the original Initially, researchers applied transformers to natural language pro-
image that mark smoke or flame objects. In this section, vision-based cessing tasks with favorable results. Because of their remarkable perfor-
fire-detection approaches are introduced, which are divided into three mance, transformer-based models have gained popularity for computer
categories: region extraction, regression, and transformer-based meth- vision tasks such as object detection, image classification, and video pro-
ods. cessing [35]. Shahid and Hua [23] proposed a transformer-based ap-
proach for fire recognition that leveraged vision transformers [22] to
2.1. Region extraction-based methods process a fire image as a collection of patches and captured the depen-
dencies among patches. Experimental results on two publicly available
This approach consists of two stages: generating candidate regions fire datasets showed the importance of the new structure in improving
from the image and predicting the location of fire objects based on the detection accuracy. Lin et al. [38] proposed an improved small-object
feature maps of the candidate regions. Classical algorithms used include forest fire-detection model called STPM_SAHI. The model integrated the
R-CNN, and faster R-CNN. Wang et al. [25] proposed a novel two-stage Swin transformer [39] backbone into the mask R-CNN detection frame-
flame detection method based on flame color, directional features, and work, leveraging its self-attention to capture global information and
motion features. In the first stage, the proposed method used the frame enhancing local information acquisition to obtain larger sensory fields
difference method and color thresholding to find fire-like colored mov- and contextual information. The experimental results indicate that the
ing regions in a video, followed by the use of the probability of the model achieved a mean average precision (𝑚𝐴𝑃 ) of 89.4% for forest fire
motion direction to determine the fire region. In the second stage, the objects at different scales. Yang et al. [40] designed a lightweight and
detected regions were divided into spatiotemporal blocks, and the in- efficient fire-detection network that combined a CNN and transformer to
ternal features were extracted from the blocks. A set of attributes rep- model global and local information, achieving a balance between accu-
resenting the spatial and temporal features of the fire regions was ap- racy and speed. In the backbone MobileLP, a linear highlighted attention
plied to form the fire descriptors. The support vector machine classifier mechanism was deployed to reduce the computational effort, and fea-
was trained and tested using descriptors obtained from video data con- ture fusion was achieved by combining the designed backbone network
taining fire-like objects. Lin et al. [26] used an improved faster R-CNN with the BiFPN. An entire fire-detection model equipped with the YOLO
[27] and non-maximum suppression to label the location of smoke based head was constructed.
on static spatial information. A 3D-CNN combined with dynamic spa- Current smoke and flame detection techniques can effectively locate
tiotemporal information was then used to identify forest smoke. Zhang smoke and flames. However, they exhibit limited accuracy in detecting
et al. [28] proposed an enhanced feature extraction process for a faster small-scale smoke or flame entities during the initial fire phases. To ad-
R-CNN to achieve fire detection, wherein a feature pyramid network dress this problem, this study aims to investigate the correlation among
(FPN) feature fusion network was introduced to fuse shallow and high- multi-scale features to achieve highly accurate detection of fire regions
level features. Ryu and Kwak [29] explored a method that effectively of different sizes.
detects both flames and smoke in fires. In this method, color conversion
and corner detection were used to preprocess the flame regions, and 3. Fundamentals of fire-detection model
then dark channel priors and optical flow were used to detect the smoke
region. This eliminated unnecessary background areas and allowed the In this section, a DFFT model is introduced as an efficient fire de-
selection of fire-related areas. Finally, a CNN was selected to determine tector, as shown in Fig. 2. The effectiveness of the model is evaluated
whether the pretreated region of interest was a flame or smoke. using a three-step process. First, data augmentation is used to augment
the training data of the model and enhance its generalizability. Second,
2.2. Regression-based methods a DOT backbone network is used as a multi-scale fire feature extractor
at four different scales. Finally, the encoder-only single-level dense pre-
These approaches do not require a candidate region extraction phase, diction module uses an SAE to aggregate multi-scale fire features into
and directly generate the class probability and position coordinate val- a single feature map. A TAE is then applied to align the classification
ues of fire-related objects. Classic algorithms include YOLO and SSD. Xu and regression features for separate classification and regression tasks,
296
Fig. 2. The framework of the fire detection model.
thus enhancing the semantic interaction between fire features. In the There are two types of image-flipping: horizontal and vertical. Be-
following subsections, the fire-detection task is defined. Subsequently, cause upside-down smoke or flames would not occur in reality, the hor-
three components of the model-data augmentation, a multi-scale fire izontal flipping method is deemed suitable for this study. Let (𝑥2 , 𝑦2 )
feature extractor, and encoder-only single-level dense prediction-are in- denote the post-flipping coordinates. The relationship between the orig-
troduced. inal and flipped coordinates is expressed as
{
𝑥2 = −𝑥0 + 𝑤
(2)
3.1. Task definition 𝑦2 = 𝑦0
The pixel values of all the pixel points in the flipped image are the
The goal of this study is to develop a transformer-based fire-detection same as those in the corresponding pixels of the source image. After
model that can accurately identify and localize smoke or flame objects horizontal flipping, the image size remains unchanged.
in an image and predict their corresponding class labels. In the image padding method, the edges of an image are padded with
The input to the fire-detection model is an image 𝑀, which may a specified amount of white to achieve a fixed size, which is expressed
contain multiple smoke or flame objects of different sizes. as
The output of the fire-detection model are tuples (𝑌̂ , 𝐵̂ ), where 𝑌̂ ⋯
⎡ 0 0 ⎤
refers to the predicted class labels for all fire objects detected in the ⎢ ⋮ ⋱ ⋮ ⎥
image and 𝐵̂ represents the corresponding bounding boxes. Specifi- ⎢
0 ⋯ 0
⎥
cally, 𝑌̂ = {𝑦̂1 , 𝑦̂2 , … , 𝑦̂𝑘 }, where 𝑘 is the number of fire objects de- 𝑖𝑚𝑔 = ⎢ ⎥ (3)
⎢𝑝11 ⋯ 𝑝1𝑤 ⎥
tected by the model, and 𝑦̂𝑖 ∈ {smoke, f lame}. 𝐵̂ = {𝑏̂ 1 , 𝑏̂ 2 , … , 𝑏̂ 𝑘 }, where ⎢ ⋮ ⋱ ⋮ ⎥
𝑏̂ 𝑖 = (𝑥̂ 𝑚𝑖𝑛 𝑚𝑖𝑛 ̂ 𝑚𝑎𝑥 , 𝑦̂ 𝑚𝑎𝑥 ) is the predicted bounding box coordinates for ⎢ ⎥
𝑖 , 𝑦̂ 𝑖 , 𝑥 𝑖 𝑖 ⎣𝑝ℎ1 ⋯ 𝑝ℎ𝑤 ⎦
the 𝑖-th fire object. Here, (𝑥̂ 𝑚𝑖𝑛 𝑚𝑖𝑛 ̂ 𝑚𝑎𝑥
𝑖 , 𝑦̂ 𝑖 ) and (𝑥 𝑖 , 𝑦̂𝑚𝑎𝑥
𝑖 ) indicate the coor-
dinates of the upper left and lower right corners of the bounding box, where 𝑝𝑖 is the pixel in the original image; after padding, the size of the
respectively. image changes from [ℎ, 𝑤] to [ℎ + 𝑝ℎ , 𝑤 + 𝑝𝑤 ], where 𝑝ℎ and 𝑝𝑤 are the
lengths padded in the vertical and horizontal directions, respectively.
For the image-cropping method, let (𝑥3 , 𝑦3 ) denote the coordinates
3.2. Data augmentation after cropping, which are calculated as
{
𝑥3 = 𝑥0 − 𝑤𝑙 − 𝑤𝑟
To enhance the generalizability of the model and prevent over- (4)
𝑦3 = 𝑦0 − ℎ𝑙 − ℎ𝑟
fitting, data augmentation techniques are used to preprocess the im-
ages. Geometric and pixel transformation methods are two classic ap- where 𝑤𝑙 and 𝑤𝑟 are the distances cropped from the left and right sides
proaches used for data augmentation [41]. Geometric transformation in- of the source image, respectively; similarly, ℎ𝑙 and ℎ𝑟 are the distances
volves operations such as flipping, rotation, cropping, scaling, panning, cropped from the top and bottom of the source image, respectively. The
and dithering, whereas pixel transformation includes adding Gaussian pixel values of all the pixel points of the cropped image are the same
noise, applying Gaussian blur, and adjusting the HSV contrast, bright- as those of the corresponding pixel points of the source image. After
ness, saturation, and histogram equalization. In this study, several data- cropping, the size of the image changes from [ℎ, 𝑤] to [ℎ − ℎ𝑙 − ℎ𝑟 , 𝑤 −
augmentation methods are combined, including scaling, horizontal flip- 𝑤𝑙 − 𝑤𝑟 ].
ping, padding, cropping, and normalization. The image normalization method prevents affine transformations
In the image-scaling method, the relationship between the source and accelerates the gradient descent to obtain the optimal solution. Nor-
image’s pixel point coordinates (𝑥0 , 𝑦0 ) and the scaled image coordinates malization is performed pixel-by-pixel using
(𝑥1 , 𝑦1 ) is expressed as im𝑔0
im𝑔1 = (5)
{ 255.0
𝑥1 = 𝑥0 × 𝑠𝑐𝑎𝑙𝑒𝑥
𝑦1 = 𝑦0 × 𝑠𝑐𝑎𝑙𝑒𝑦
(1) where im𝑔1 and im𝑔0 are the normalized and original pixel values of the
image, respectively. After normalization, the image dimensions remain
where 𝑠𝑐𝑎𝑙𝑒𝑥 and 𝑠𝑐𝑎𝑙𝑒𝑦 are the scaling ratios of the source image in unchanged.
the horizontal and vertical directions, respectively. After image scaling, Certain specific tasks may require additional label data transforma-
the size of the image changes from [ℎ, 𝑤] to [ℎ × 𝑠𝑐𝑎𝑙𝑒𝑥 , 𝑤 × 𝑠𝑐𝑎𝑙𝑒𝑦 ]. It is tions subsequent to the application of data augmentation methods. For
worth noting that when both 𝑠𝑐𝑎𝑙𝑒𝑥 and 𝑠𝑐𝑎𝑙𝑒𝑦 are greater than 1, some instance, when flipping is used for fire object detection, the ground truth
pixel points in the new image may not correspond to any actual pixel box must be adjusted accordingly. These operations introduce changes
points in the source image. To address this, an interpolation method is in the learned data throughout each epoch, thereby enhancing the gen-
applied to determine an approximate pixel value or to calculate a pixel eralizabilities of the model and mitigating the risk of overfitting. Fur-
value from the source image, which is then assigned to the correspond- thermore, the uniform distribution of small smoke or flame objects is
ing pixel in the new image. By randomly scaling the image, smoke or enhanced, leading to improved detection performance and resolution of
flame objects appear at different locations in images, thus mitigating the the issue of inadequate training caused by non-uniform object distribu-
sensitivity of the model to the object location. tions in the dataset.
297
Fig. 3. The three major modules in DFFT.
3.3. Multi-scale fire-feature extractor Multi-head self-attention is equivalent to the integration of multiple
types of self-attention. Many studies use eight heads as examples to ex-
In this section, a DOT backbone 𝐹 is used to extract four multi- plain this principle. The multi-head attention calculation formula is
scale fire features with strong semantics for fire detection. As shown
( )
in Fig. 3(a), 𝐹 stacks an embedding module and four DOT stages, where ℎ𝑒𝑎𝑑𝑖 = 𝐴𝑡𝑡𝑒𝑛𝑡𝑖𝑜𝑛 𝑄𝑖 , 𝐾𝑖 , 𝑉𝑖 , 𝑖 = 1, 2 … 8 (9)
each stage is composed of a DOT block and an SAA module (except
for the first phase). The SAA aggregates the underlying semantic in- ( )
formation of every two consecutive DOT stages. For each fire image 𝑀𝑆𝐴(𝑄, 𝐾, 𝑉 ) = 𝐶𝑜𝑛𝑐𝑎𝑡 ℎ𝑒𝑎𝑑1 , … , ℎ𝑒𝑎𝑑8 𝑊𝑂 (10)
𝑀aug ∈ (ℝ)𝐻×𝑊 × 3 after data augmentation, the DOT backbone extracts where 𝑊𝑂 is a learnable parameter matrix.
fire features at four different scales: In DOT, local spatial attention is a multi-head self-attention mech-
( ) anism applied to process spatial information and aggregate features in
𝑓1𝑑𝑜𝑡 , 𝑓2𝑑𝑜𝑡 , 𝑓3𝑑𝑜𝑡 , 𝑓4𝑑𝑜𝑡 = 𝐹 𝑀𝑎𝑢𝑔 (6)
a local window. For an input sequence 𝑋𝑙𝑠𝑎 ∈ ℝ𝐵𝑙𝑠𝑎 ×𝐿×𝐶𝑙𝑠𝑎 , where 𝐵𝑙𝑠𝑎
where 𝐻, and 𝑊 are the height and width of the image, respectively; is the batch size, 𝐿 is the sequence length, and 𝐶𝑙𝑠𝑎 is the number of
𝐻
× 𝑊 ×𝐶
𝑓𝑖dot ∈ (ℝ) 8⋅2𝑖−1 8⋅2𝑖−1 𝑖 is the 𝑖-th fire-related feature with 𝐶𝑖 channels, feature channels. The formal definition of self-attention in local spatial
where 𝑖 ∈ {1, 2, 3, 4}. In the following section, the operation process of attention is
( )
the multi-scale fire-feature extractor is described. 𝑄𝐾 𝐓
𝐴𝑡𝑡𝑒𝑛𝑡𝑖𝑜𝑛𝑙𝑠𝑎 (𝑄, 𝐾, 𝑉 ) = sof tmax √ 𝑉 (11)
𝑑𝑘
3.3.1. Embedding module
This module divides the input image into multiple patches to match where 𝑄 ∈ ℝ𝑛×𝑑𝑘 , 𝐾 ∈ ℝ𝑚×𝑑𝑘 , and 𝑉 ∈ ℝ𝑚×𝑑𝑣 , which are obtained by
the input size of the transformer-based backbone network. Specifically, multiplying 𝑋𝑙𝑠𝑎 by the weight matrices 𝑊𝑄 , 𝑊𝐾 , and 𝑊𝑉 , respec-
given an input image 𝑀aug , it is partitioned into 𝐻8 × 𝑊8 patches. These tively; 𝑑𝑣 is the dimension of 𝑉 ; and 𝑛 and 𝑚 are the number of po-
patches are then fed into a linear projection to obtain patch embedding sitions in the local window and the length of the input sequence, re-
𝑓̂0 : spectively. Global channel-wise attention is a type of attention mech-
anism used to aggregate information across the different feature chan-
( ) 𝐻 𝑊
𝑓̂0 = 𝐹embed 𝑀aug ∈ (ℝ) 8 × 8 ×𝐶1 (7) nels of a tensor. The self-attention mechanism in the global channel-
wise attention is executed in the channel direction. For an input fea-
where 𝐹embed is the embedding function. ture map 𝑋𝑔𝑐𝑎 ∈ ℝ𝐵𝑔𝑐𝑎 ×𝐻𝑔𝑐𝑎 ×𝑊𝑔𝑐𝑎 ×𝐶𝑔𝑐𝑎 , where 𝐵𝑔𝑐𝑎 is the batch size, 𝐻𝑔𝑐𝑎
and 𝑊𝑔𝑐𝑎 are the height and width of the feature map, and 𝐶𝑔𝑐𝑎 is
3.3.2. DOT block the number of channels. Let 𝑋 ′ be the reshaped feature map such that
Each DOT stage consists of a DOT block 𝐹block , which aims to effi- 𝑋 ′ ∈ ℝ𝐵𝑔𝑐𝑎 ×𝑁×𝐶𝑔𝑐𝑎 , where 𝑁 = 𝐻𝑔𝑐𝑎 × 𝑊𝑔𝑐𝑎 and let the feature vector of
′ ∈ ℝ𝐶𝑔𝑐𝑎 . Then, the formula for self-attention in global
each pixel be 𝑋𝑖,𝑗
ciently capture local spatial and global semantic relations at each scale.
The DOT block contains multiple local spatial attention layers [39] and channel-wise attention is
a global channel-wise attention block [42], as shown in the first part of
⎛ ′ ( )T ⎞
Fig. 3(a). Each attention block comprises an attention layer and a feed- ⎜ 𝑋 𝑊𝑃 𝑋 ′ ⎟ ′
forward network layer. For simplicity, this layer is not shown in each Attentio𝑛gca = softmax⎜ √ ⎟𝑋 (12)
attention block in Fig. 3(a). is the output fire feature of the DOT block at ⎜ 𝐶gca ⎟
⎝ ⎠
the 𝑖-th DOT stage. Placing a lightweight global channel-wise attention
layer behind a continuous local spatial attention layer allows the overall where 𝑊𝑃 is a learnable parameter matrix. The multi-head attention
object semantics to be deduced at each scale. operations in local spatial attention and global channel-wise attention
The most prominent feature of the transformer is the introduction are consistent with the aforementioned multi-head attention and are not
of a self-attention mechanism, which enables the model to consider all repeated here.
positions in the input sequence simultaneously, without the need for
sequential processing. The different attention mechanisms in the DFFT 3.3.3. SAA
are based on self-attention. Self-attention receives the input, which is While the DOT block successfully improves semantic information in
represented by the matrix 𝑋. First, the weight matrices 𝑊𝑄 , 𝑊𝐾 , 𝑊𝑉 are low-level fire features through global channel-wise attention, there is
used to linearly transform 𝑋 to obtain the matrices 𝑄(query), 𝐾(key), still room for further improvement to enhance the detection task. There-
and 𝑉 (value). After obtaining these matrices, the self-attention output fore, the SAA module 𝐹se−att is proposed to facilitate the exchange of
is calculated. The calculation formula is as follows, where is the number semantic information between two consecutive scale-level fire features
of columns of matrices 𝑄 and 𝑉 : and to enhance their features. The SAA consists of an up-sampling layer
( ) and global channel-wise attention. It is integrated into every two con-
𝑄𝐾 T secutive DOT blocks, as shown in Fig. 3(a). Formally, the SAA takes the
Attention(𝑄, 𝐾, 𝑉 ) = sof tmax √ 𝑉 (8)
𝑑𝑘 output of the current DOT block and previous DOT stage as inputs and
298
Fig. 4. Multi-level fire-detection methods.
returns semantically augmented fire features. The features are sent to 𝑠sae = 𝑠3 (20)
the next DOT stage and contribute to the final multi-scale fire-feature .
The output features of SAA at the 𝑖-th DOT stage are represented by 𝑓̃𝑖 . where 𝑆att is the global channel-wise attention block, and 𝑠sae is the final
aggregated fire feature. A major challenge in fire detection is effective
3.3.4. DOT stage representation of fire objects at different scales. Many detectors have
The final DOT backbone contains four DOT stages. The first stage overcome this problem by using multi-scale features and multi-level pre-
contains one DOT block and no SAA module because the input of the dictions, as shown in Fig. 4. The FPN [43] is widely adopted in region-
SAA module comes from two consecutive DOT stages. The remaining based detection and regression-based multi-level detectors, such as mask
three phases contain a patch-merging module, a DOT block, and an R-CNN [44], faster R-CNN [27], and FCOS [45]. An FPN builds a feature
SAA module. Finally, the input dimension is recovered using the down- pyramid by sequentially combining two adjacent layers in the feature
sampling layer. Thus, the 𝑖-th stage DOT block is defined as hierarchy in the backbone with top-down and lateral connections. Sim-
{ ( ) ilarly, the detection transformer DETR uses an encoder-decoder trans-
𝐹block 𝑓̂𝑖−1 if 𝑖 = 1, 2
𝑓̂𝑖 = ( ( )) (13) former framework, which is commonly used in transformer-based multi-
𝐹block down 𝑓𝑖−1 ̃ if 𝑖 = 3, 4 level detectors such as DETR and deformable DETR [46,47]. The en-
coder processes the flattened deep features from the CNN backbone.
where is the down-sampling function. The 𝑖-th stage of the SAA module The decoder takes the output of the encoder and a set of learned object
is defined as query vectors as input and predicts the category labels and bounding
{ ( ( ) )
𝐹se−att up 𝑓̂𝑖 + 𝑓̂𝑖−1 if 𝑖 = 2 boxes individually for the features within the group. In contrast, our
𝑓̃𝑖 = ( ( ) ) (14) DFFT model uses global dependency modeling based on a transformer,
̂ ̃
𝐹se−att up 𝑓𝑖 + 𝑓𝑖−1 if 𝑖 = 3, 4
which introduces large receptive fields to cover multi-scale fire objects
where up() is an up-sampling function. Ultimately, the multi-scale fea- while aggregating multi-scale fire features into a single-level semantic
ture acquired by the DOT backbone is expressed as feature using the SAE. Consequently, fire detection is completed by mak-
{ ing only one prediction based on the aggregated single-level feature 𝑠sae .
𝑓̃𝑖+1 𝑖𝑓 𝑖 = 1, 2, 3 This design of single-level dense prediction enables the DFFT to achieve
𝑓𝑖𝑑𝑜𝑡 = (15)
𝑓̂𝑖 𝑖𝑓 𝑖 = 4 superior fire-detection performance.
3.4. Encoder-only single-level dense prediction 3.4.2. TAE

Recent single-stage object detectors [48] perform object classifica-
This module enhances the efficiency of both the inference and train- tion and localization independently through two separate branches (de-
ing of transformer-based object detectors using two novel encoders. coupled heads). However, this two-branch design neglects the interac-
First, it uses an SAE to aggregate the multi-scale fire features 𝑓𝑖dot in tion between the two tasks, resulting in inconsistent predictions [49].
the DOT backbone into a feature map 𝑠sae . Second, it leverages TAE to Meanwhile, the features learnt by two tasks in a coupling head usually
simultaneously generate aligned classification feature 𝑡cls and regression conflict [50]. Consequently, a TAE is proposed to strike a better bal-
feature 𝑡reg for classification and regression tasks to obtain the category ance between learning task-interaction features and specific features by
and labeled box coordinates of fire objects (smoke or flame). stacking channel-wise attention blocks in the coupling head. As shown
in Fig. 3(c), the encoder comprises two channel-wise attention blocks.
3.4.1. SAE First, the stacked group channel-wise attention block 𝑇group divides the
This encoder is designed with three SAE blocks in series. The SAE merged features 𝑠sae into fire-classification and fire-regression features.
block is shown in Fig. 3(b). Each SAE block considers two features as Second, the global channelwise attention block 𝑇global further encodes
input and progressively aggregates these features across all SAE blocks. the regression feature 𝑡reg for the subsequent regression task. This pro-
𝐻
The scale of the final aggregated features is set to 32 ×𝑊
32
to balance de- cess can be expressed as
tection accuracy and computational cost. To achieve this objective, the ( )
𝐻 𝑡1 , 𝑡2 = 𝑇group 𝑠sae (21)
last SAE block upscales the input fire-related feature to 32 ×𝑊 32
before
aggregation. The process is described as
𝑠0 = 𝑓1dot (16) 𝑡cls = 𝑡1 (22)
( ( ) ) ( )
𝑠1 = 𝑆att down 𝑠0 + 𝑓2dot (17) 𝑡reg = 𝑇global 𝑡2 (23)
( ( ) ) 𝐻 𝑊
𝑠2 = 𝑆att down 𝑠1 + 𝑓3dot (18) where 𝑡1 , 𝑡2 ∈ (ℝ) 32 × 32 × 256 are the two sub-features that are divided;
𝐻 𝑊 𝐻 𝑊
×
× 256
𝑡cls ∈ (ℝ) 32 32 and 𝑡reg ∈ (ℝ) 32 × 32 × 512 are the final features that are
( ( ))
𝑠3 = 𝑆att 𝑠2 + up 𝑓4dot (19) ultimately used for the classification and regression tasks, respectively.
299
Fig. 5. Data augmentation on Our_flame_smoke dataset.
Based on the classification and regression features obtained, the clas- encompass a range of challenging scenarios, including the presence
sification and regression processes are represented by two different func- of colored objects associated with fires, floating clouds, occlusions,
tions, 𝐺cls and 𝐺reg , respectively, as and reflections.
( ) • Kaggle_flame_smoke: This dataset1 collected 23,730 images of mul-
𝑌̂ = 𝐺𝑐𝑙𝑠 𝑡𝑐𝑙𝑠 (24)
tiple stages of smoke and flames, which were captured in real-world
scenarios using mobile phones. These images were obtained un-
( )
𝐵̂ = 𝐺reg 𝑡reg (25) der different lighting (indoor and outdoor) and weather conditions.
The dataset also includes typical household settings such as burn-
During training, the difference between the outputs of these two ing garbage, paper and plastic, and crops on farmlands, as well as
functions and the ground truth is minimized to improve the fire- household cooking, covering a comprehensive range of fire scenes.
detection accuracy.
Because the DFFT performs single-level dense prediction on a sin- Before feeding the image data into a multi-scale fire-feature extrac-
gle fire-feature map, the predefined anchor is sparse. However, the use tor, it is necessary to apply data augmentation. The effects of the data
of Max-IoU based on a sparse anchor leads to an imbalance problem in augmentation methods on the two datasets are shown in Figs. 5 and 6,
positive anchors, which causes the detector to focus on the large ground- respectively.
truth boxes and ignore the small ground-truth boxes during training. To
overcome this problem, the uniform matching strategy proposed by the 4.2. Compared baselines
you only look one-level feature (YOLOF) is adopted, which ensures that
all ground-truth boxes are consistently matched with the same number The object detection methods used have been migrated from the gen-
of positive anchors, regardless of their size. The loss function  of the eral field to the fire-detection domain.
DFFT model consists of the focal loss for classification and the general-
ized IOU loss for regression. • Faster R-CNN [27]: Faster R-CNN integrates feature extraction, pro-
posal extraction, bounding box regression, and classification into one
 = 𝜆𝑐𝑙𝑠 𝑐𝑙𝑠 + 𝜆𝑔𝑖𝑜𝑢 𝑔𝑖𝑜𝑢 (26) network, resulting in a significant improvement in comprehensive
where 𝑐𝑙𝑠 is the focal loss of the predicted classification and true cate- performance, especially in terms of detection speed.
gory label [51]; 𝑔𝑖𝑜𝑢 is the generalized IoU loss [52]; and 𝜆𝑐𝑙𝑠 and 𝜆𝑔𝑖𝑜𝑢 • FCOS [45]: FCOS is an anchor-free one-stage object detection algo-
are the weights of each component. rithm that directly regresses the distances from each position on the
feature map to the ground-truth boxes.
4. Experiments • YOLOF [48]: YOLOF states that the success of an FPN lies in its par-
titioned solution to the target optimization problem, rather than in
This section describes the experiments designed to demonstrate multi-scale feature fusion. This method introduces a dilated encoder
the effectiveness of the adopted DFFT model for early smoke and and uniform matching strategy, enabling object detection using a
flame detection. Both a private dataset (Our_flame_smoke) and a public single-level feature map.
dataset (Kaggle_flame_smoke) were used to test the performance of the • DETR [46]: This model is a pioneering transformer applied to image
model. Six comparison methods were used: a region-extraction-based object detection. DETR is the transformation of an image sequence
method (faster R-CNN), regression-based methods (FCOS, YOLOF), and into an ensemble sequence.
transformer-based methods (DETR, deformable DETR, and AdaMixer). • Deformable DETR [47]: Because it is difficult to use high-level fea-
The experimental results reveal that the DFFT model outperforms these tures to predict small objects and the convergence process is slow,
six baselines, achieving the highest level of fire-detection performance deformable DETR introduces the variable attention module into the
on both datasets. overall framework.
• AdaMixer [53]: AdaMixer is a query-based detector where each
4.1. Datasets query adaptively samples features and scales them based on esti-
mated offsets, enabling efficient processing of relevant object re-
Our_flame_smoke and Kaggle_flame_smoke datasets were used to gions. The sampled features are decoded using an adaptive MLP-
conduct a comparative analysis of the detection effectiveness of the Mixer guided by each query.
DFFT and other methods. These datasets include diverse fire scenes and
fire sizes and encompass two distinct categories of objects in fire scenar- 4.3. Experimental setup
ios: smoke and flames.
The experiments were conducted using a computer equipped with
• Our_flame_smoke: This is a self-built dataset. Through a combination
an NVIDIA GeForce RTX 3090 GPU. Before training, the images were
of self-designed fire experiments and online collections, a total of
5900 fire images were obtained. The dataset consists of indoor fire
1
scenes, with a smaller subset featuring outdoor fires. These images https://www.kaggle.com/datasets/hhhhhhdoge/fire-smoke-dataset.
300
Fig. 6. Data augmentation on Kaggle_flame_smoke dataset.
randomly augmented to a size of 1333 × 800 pixels to improve the gen- Table 1
eralizability performance. The fire-detection model was optimized using Results of compared methods and our approach on two datasets (testing set).
the AdamW optimizer with an initial learning rate of 0.02 and a batch Our_flame_smoke Kaggle_flame_smoke
size of 4. The training, validation, and testing sets were randomly di- Methods
𝐴𝑃𝑠𝑚𝑜𝑘𝑒 𝐴𝑃𝑓 𝑙𝑎𝑚𝑒 𝑚𝐴𝑃 𝐴𝑃𝑠𝑚𝑜𝑘𝑒 𝐴𝑃𝑓 𝑙𝑎𝑚𝑒 𝑚𝐴𝑃
vided in an 8:1:1 ratio for each of the two datasets.
Faster R-CNN (2017) 79.29% 94.28% 86.79% 74.63% 84.68% 79.65%
FCOS (2019) 77.30% 94.88% 86.09% 72.26% 87.03% 79.65%
4.4. Evaluation metric YOLOF (2021) 78.38% 94.60% 86.49% 68.31% 85.05% 76.68%
DETR (2020) 75.95% 93.46% 84.70% 61.20% 83.27% 72.23%
To evaluate the performance of the DFFT in fire detection, this Deformable DETR 64.77% 92.70% 78.73% 71.10% 85.13% 78.12%
study adopted three evaluation metrics: 𝐴𝑃𝑠𝑚𝑜𝑘𝑒 , 𝐴𝑃𝑓 𝑙𝑎𝑚𝑒 , and 𝑚𝐴𝑃 . The (2021)
AdaMixer (2022) 77.78% 94.88% 86.33% 61.31% 85.46% 73.39%
higher the value of the three metrics, the better is the detection effect.
DFFT (2022) 80.50% 94.31% 87.40% 74.67% 87.56% 81.12%
Specifically, precision (P) is the ratio of the number of true positives to
the total number of positive predictions. For example, if the model de-
tected 100 fire objects and 90 were correctly identified, P is 90%. Recall
(R) is the ratio of the number of true positives to the total number of
actual (relevant) objects. For example, if the model correctly detects 75 In the Our_flame_smoke dataset, DFFT outperforms all baseline mod-
fire objects in an image and there are 100 fire objects in the image, R is els on both 𝐴𝑃𝑠𝑚𝑜𝑘𝑒 and 𝑚𝐴𝑃 metrics. Specifically, the DFFT exhibits
75%. P and R are calculated as follows: better performance than the region-extraction-based model, faster R-
CNN, by 1.21% for 𝐴𝑃𝑠𝑚𝑜𝑘𝑒 . Compared to the regression-based mod-
𝑇𝑃
𝑃 = (27) els, DFFT outperforms the best of these models, YOLOF, by 1.58% and
𝑇𝑃 + 𝐹𝑁
0.46% on 𝐴𝑃𝑠𝑚𝑜𝑘𝑒 and 𝑚𝐴𝑃 , respectively. Compared with FCOS, DFFT
𝑇𝑃 improves the 𝐴𝑃𝑠𝑚𝑜𝑘𝑒 and 𝑚𝐴𝑃 by 2.66% and 0.96%, respectively. Com-
𝑅= (28)
𝑇𝑃 + 𝐹𝑃 pared to the transformer-based methods, DETR, deformable DETR, and
AdaMixer, DFFT shows improvements of 1.03%, 8.32%, and 0.72% in
where True Positive (TP) is the number of correctly identified positive
𝑚𝐴𝑃 , respectively. Although DFFT’s performance of DFFT on 𝐴𝑃𝑓 𝑙𝑎𝑚𝑒
instances, False Negative (FN) is the number of positive instances mis-
is not the best among all the models, the difference between its perfor-
classified as negative by the model, and False Positive (FP) is the number
mance and that of the optimal model on 𝐴𝑃𝑓 𝑙𝑎𝑚𝑒 is minimal considering
of erroneously classified negative instances.
the small sample size of the testing set.
Average Precision (AP) is the average Precision across all recall val-
For the Kaggle_flame_smoke dataset, DFFT outperformed the base-
ues between 0 and 1. By interpolating across all points, the AP can be
lines for all three metrics. DFFT outperforms the region-extraction-
interpreted as the area under the precision-recall curve. 𝐴𝑃𝑠𝑚𝑜𝑘𝑒 refers
based model, faster R-CNN, by 2.88% for 𝐴𝑃𝑓 𝑙𝑎𝑚𝑒 and 1.47% for 𝑚𝐴𝑃 .
to the AP of the smoke and 𝐴𝑃𝑓 𝑙𝑎𝑚𝑒 refers to the AP of the flame. The
DFFT shows an improvement over the regression-based model, FCOS,
mAP is the average AP over multiple IoU thresholds. The IoU detection
of 2.41% in 𝐴𝑃𝑠𝑚𝑜𝑘𝑒 . DFFT exhibits improvements over the transformer-
threshold for AP in this study is set to 0.5. The equations for 𝐴𝑃𝑠𝑚𝑜𝑘𝑒 ,
based model, deformable DETR, of 3.57%, 2.43%, and 3% for the
𝐴𝑃𝑓 𝑙𝑎𝑚𝑒 , and 𝑚𝐴𝑃 are
APsmoke , APflame , and mAP, respectively.
1 Overall, our experimental results suggest that the DFFT surpasses the
( )
𝐴𝑃𝑠𝑚𝑜𝑘𝑒 = 𝑃𝑠𝑚𝑜𝑘𝑒 𝑅𝑠𝑚𝑜𝑘𝑒 d𝑅𝑠𝑚𝑜𝑘𝑒 (29) other baseline models on the two datasets, indicating its effectiveness
∫
0 and versatility for fire object detection. These models perform better
at detecting flames than smoke, because it is challenging to distinguish
1
( ) smoke-like objects from noise interference. The translucency and mo-
𝐴𝑃𝑓 𝑙𝑎𝑚𝑒 = 𝑃𝑓 𝑙𝑎𝑚𝑒 𝑅𝑓 𝑙𝑎𝑚𝑒 d𝑅𝑓 𝑙𝑎𝑚𝑒 (30) tion of smoke often lead to its blending with the background, posing
∫
0 difficulties for its accurate detection. In contrast, flames exhibit distinct
characteristics, such as color and edges, which facilitate easier detec-
1( )
𝑚𝐴𝑃 = 𝐴𝑃𝑠𝑚𝑜𝑘𝑒 + 𝐴𝑃𝑓 𝑙𝑎𝑚𝑒 (31) tion.
2
To demonstrate the superiority of the DFFT for small smoke and
where 𝑃𝑠𝑚𝑜𝑘𝑒 , and 𝑅𝑠𝑚𝑜𝑘𝑒 are the precision and recall, respectively, for flame detection, this study has classified fire objects into three distinct
the smoke object category, and 𝑃𝑓 𝑙𝑎𝑚𝑒 , and 𝑅𝑓 𝑙𝑎𝑚𝑒 are the precision and scales based on their proportion within the input image: small objects
recall, respectively, for the flame object category. (0%−1%), medium objects (1%−10%), and large objects (10%−100%).
The comparison results of 𝑚𝐴𝑃 for different methods at different
4.5. Experimental results and analysis scales are presented in Fig. 7. Here, only the results for the Kag-
gle_flame_smoke dataset are shown, because it contains more fire sce-
Experiments were conducted on two datasets for the DFFT model, narios than Our_flame_smoke dataset. This finding indicates that DFFT
and baselines were compared, with the performances shown in Table 1. offers the highest detection accuracy for small fires and achieves the
The following observations were made. best results for medium and large fire detection.
301
Fig. 7. mAP for small, medium, and large objects on Kaggle_flame_smoke.
Fig. 8. Detection effects on small-scale objects.
The DFFT has achieved remarkable detection outcomes across a capture of rich low-level fire-semantic features. The low-level fire se-
range of complex fire scenarios, which can be attributed to its pow- mantics from different stages facilitate the fire detector in classifying
erful and efficient DOT backbone, combined with two highly trained smoke- or flame-like disturbances in greater detail. The SAE summa-
encoders (SAE and TAE). The semantic-augmented attention module is rizes multi-scale cues into a single fire feature map by analyzing the
incorporated at multiple stages of the DOT backbone, which enables global spatial and semantic relations of two consecutive fire features.
Fig. 9. Comparison of multi-scale fire-detection results (faster R-CNN, FCOS, and YOLOF). The first, second, and third rows represent small, medium, and large
objects, respectively.
302
Fig. 10. Comparison of multi-scale fire-detection results (DETR, deformable DETR, AdaMixer, and DFFT).
This approach enables smoke or flame instances at different scales to be perior performance compared to conventional object detection methods
detected on a single feature map, thereby avoiding exhaustive searches in identifying fires of different sizes and scales within complex environ-
across network layers. The TAE allows the DFFT to perform both classi- ments.
fication and regression in a single coupling head, using group channel- In conclusion, the DFFT offers superior accuracy, interference resis-
wise attention to resolve learning conflicts from both tasks and provide tance, and suitability for meeting the demands of early fire-detection
consistent predictions. tasks compared with other models. These capabilities are critical for en-
abling early detection and warning of fires, thereby promoting public
safety.
4.6. Visualization analysis
5. Conclusion
In this subsection, the results of the DFFT applied to small-sized fire
objects are presented. The inference results of the DFFT and other meth-
To address the issue of missing the detection of small smoke or flame
ods are then compared on the three scales.
objects in current fire-detection methods, a DFFT is used to achieve early
To visually illustrate the effectiveness of the DFFT for early fire de-
fire detection, which improves the detection capability for fires of dif-
tection, several original images containing small-scale fires from the two
ferent sizes. First, data augmentation is performed to enhance the gen-
datasets are selected for inference. The inference results are shown in
eralization capability of the model. Second, a DOT backbone network
Fig. 8. For fire images containing only small smoke or flames, as de-
is used as a multi-scale fire-feature extractor to mine fire-related fea-
picted in Figs. 8(a)–(c), the DFFT accurately and precisely labels the de-
tures on four scales. Finally, the encoder-only single-level dense predic-
tection frame for each fire object without redundancy. In addition, the
tion module aggregates multi-scale fire features into a single fire-feature
predicted boxes closely fit the ground truth, and the DFFT correctly de-
map using an SAE. The TAE then aligns the classification and regression
termines the category of each object. For images containing both small
features for their respective classification and regression tasks. Exper-
smoke and flames, as shown in Fig. 8(d), the DFFT differentiates be-
imental results on two datasets show that the DFFT model has a high
tween smoke and flame and labels different detection boxes according
fire-detection accuracy and strong anti-interference capability. In par-
to the object category. Even if there are distracting factors, such as gray
ticular, for early fires with small smoke and flame objects, the DFFT
skies and walls that are similar to smoke in color, the DFFT adaptively
outperforms the previous models. It is well suited to early fire-detection
rejects the distracting features of such smoke-like objects and extracts
tasks with complex scenarios, thereby contributing to the protection of
significant fire features for fire detection. In the initial phase of a fire in-
society and public safety.
cident, the fire objects typically manifest as small-scale smoke or flames.
The transformer-based DFFT model offers a high accuracy in the iden-
tification and localization of small fires. Thus, the DFFT is suitable for Declaration of Competing Interest
early fire-detection tasks.
To further visualize the detection results of the DFFT and its com- The authors declare that they have no known competing financial
parative approaches, fire images of varying scale sizes (small, medium, interests or personal relationships that could have appeared to influence
and large) are processed using the trained model weights. The results the work reported in this paper.
are shown in Figs. 9 and 10. Specifically, the DFFT is the only method
that can detect flames under smoke obscuration conditions in medium- Acknowledgements
sized fires. For small- and large-scale fires, the bounding boxes gener-
ated by the DFFT fit the objects better, and no redundant boxes are This work was supported by the Open Fund Project [grant number
present. Consistent with the results shown in Fig. 7, the DFFT shows su- MZ2022KF05] of Civil Aircraft Fire Science and Safety Engineering Key
303
Laboratory of Sichuan Province, the National Science Foundation of [25] Z. Wang, D. Wei, X. Hu, Research on two stage flame detection algorithm based on
China [Grant No. 72204155], and the Natural Science Foundation of fire feature and machine learning, in: Proceedings of the International Conference
on Robotics, Intelligent Control and Artificial Intelligence, 2019, pp. 574–578.
Shanghai [grant number 23ZR1423100]. [26] G. Lin, Y. Zhang, G. Xu, Q. Zhang, Smoke detection on video sequences using 3d
convolutional neural networks, Fire Technol. 55 (2019) 1827–1847.
References [27] S. Ren, K. He, R. Girshick, J. Sun, Faster r-cnn: towards real-time object detection
with region proposal networks, IEEE Trans. Pattern Anal. Mach. Intell. 39 (2017)
[1] K. Avazov, A.E. Hyun, A.A. Sami S, A. Khaitov, A.B. Abdusalomov, Y.I. Cho, For- 1137–1149.
est fire detection and notification method based on ai and iot approaches, Future [28] J. Zhang, S. Guo, G. Zhang, L. Tan, Fire detection model based on multi-scale feature
Internet 15 (2023) 61. fusion, J. Zhengzhou Univ. Eng. Sci. 42 (2021) 13–18.
[2] N. Luan, C. Ding, J. Yao, A refractive index and temperature sensor based on surface [29] J. Ryu, D. Kwak, A study on a complex flame and smoke detection method using
plasmon resonance in an exposed-core microstructured optical fiber, IEEE Photonics computer vision detection and convolutional neural network, Fire 5 (2022) 108.
J. 8 (2016) 1–8. [30] A. Nguyen, H. Nguyen, V. Tran, H.X. Pham, J. Pestana, A visual real-time fire detec-
[3] H. Ding, D. Fan, H. Yao, Cable tunnel fire experiments based on linear temperature tion using single shot multibox detector for uav-based fire surveillance, in: Proceed-
sensing fire detectors, Opt. Precis. Eng. 21 (2013) 2225–2230. ings of the IEEE Eighth International Conference on Communications and Electronics
[4] J. Zhang, W. Li, Z. Yin, S. Liu, X. Guo, Forest fire detection system based on wireless (ICCE), IEEE, 2021, pp. 338–343.
sensor network, in: Proceedings of the 4th IEEE conference on industrial electronics [31] K. Avazov, M. Mukhiddinov, F. Makhmudov, Y.I. Cho, Fire detection method
and applications, IEEE, 2009, pp. 520–523. in smart city environments using a deep-learning-based approach, Electronics 11
[5] J. Fonollosa, A. Solórzano, S. Marco, Chemical sensor systems and associated algo- (2021) 73 (Basel).
rithms for fire detection: a review, Sensors 18 (2018) 553. [32] A. Bochkovskiy, C. Wang, H.M. Liao, Yolov4: optimal speed and accuracy of object
[6] X. Qiu, Y. Wei, N. Li, A. Guo, E. Zhang, C. Li, Y. Peng, J. Wei, Z. Zang, Development detection, 2020. arXiv:2004,10934.
of an early warning fire detection system based on a laser spectroscopic carbon [33] L. Fang, Y. Zhang, H. Li, B. Chen, Y. Sui, Dynamic attention network for flame de-
monoxide sensor using a 32-bit system-on-chip, Infrared Phys. Technol. 96 (2019) tection, Electron. Lett. 59 (2023) e12690.
44–51. [34] S. Khan, M. Naseer, M. Hayat, S.W. Zamir, F.S. Khan, M. Shah, Transformers in
[7] S. Wang, X. Xiao, T. Deng, A. Chen, M. Zhu, A sauter mean diameter sensor for fire vision: a survey, ACM Comput. Surv. CSUR 54 (2022) 1–41.
smoke detection, Sens. Actuators B 281 (2019) 920–932. [35] J. Lin, H. Lin, F. Wang, Stpm_sahi: a small-target forest fire detection model based
[8] J. Sidey, E. Mastorakos, R. Gordon, Simulations of autoignition and laminar pre- on swin transformer and slicing aided hyper inference, Forests 13 (2022) 1603.
mixed flames in methane/air mixtures diluted with hot products, Combust. Sci. [36] Z. Liu, Y. Lin, Y. Cao, H. Hu, Y. Wei, Z. Zhang, S. Lin, B. Guo, Swin transformer: hier-
Technol. 186 (2014) 453–465. archical vision transformer using shifted windows, in: Proceedings of the IEEE/CVF
[9] Z. Wen, L. Xie, H. Feng, Y. Tan, Robust fusion algorithm based on rbf neural network International Conference on Computer Vision, 2021, pp. 10012–10022.
with ts fuzzy model and its application to infrared flame detection problem, Appl. [37] C. Yang, Y. Pan, Y. Cao, X. Lu, CNN-transformer hybrid architecture for early fire
Soft Comput. 76 (2019) 251–264. detection, in: Proceedings of the Artificial Neural Networks and Machine Learning-I-
[10] F. Khan, Z. Xu, J. Sun, F.M. Khan, A. Ahmed, Y. Zhao, Recent advances in sensors CANN 2022: 31st International Conference on Artificial Neural Networks, Bristol,
for fire detection, Sensors 22 (2022) 3310. UK, September 6-9, 2022, Springer, 2022, pp. 570–581. Proceedings; Part IV.
[11] X. Chen, B. Hopkins, H. Wang, L. O’Neill, F. Afghah, A. Razi, P. Fulé, J. Coen, E. Row- [38] C. Shorten, T.M. Khoshgoftaar, A survey on image data augmentation for deep learn-
ell, A. Watts, Wildland fire detection and monitoring using a drone-collected rgb/ir ing, J. Big Data 6 (2019) 1–48.
image dataset, IEEE Access 10 (2022) 121301–121317. [39] A. Ali, H. Touvron, M. Caron, P. Bojanowski, M. Douze, A. Joulin, I. Laptev,
[12] F. Yuan, Video-based smoke detection with histogram sequence of lbp and lbpv pyra- N. Neverova, G. Synnaeve, J. Verbeek, et al., XCIT: cross-covariance image trans-
mids, Fire Saf. J. 46 (2011) 132–139. formers, Adv. Neural Inf. Process Syst. 34 (2021) 20014–20027.
[13] W.S. Qureshi, M. Ekpanyapong, M.N. Dailey, S. Rinsurongkawong, A. Malenichev, [40] T.Y. Lin, P. Dollár, R. Girshick, K. He, B. Hariharan, S. Belongie, Feature pyramid
O. Krasotkina, Quickblaze: early fire detection using a combined video processing networks for object detection, in: Proceedings of the IEEE Conference on Computer
approach, Fire Technol. 52 (2016) 1293–1317. Vision and Pattern Recognition, 2017, pp. 2117–2125.
[14] X. Zheng, F. Chen, L. Lou, P. Cheng, Y. Huang, Real-time detection of full-scale forest [41] K. He, G. Gkioxari, P. Dollár, R. Girshick, Mask r-cnn, in: Proceedings of the IEEE
fire smoke based on deep convolution neural network, Remote Sens 14 (2022) 536 International Conference on Computer Vision, 2017, pp. 2961–2969.
(Basel). [42] Z. Tian, C. Shen, H. Chen, T. He, FCOS: fully convolutional one-stage object detec-
[15] Y. Li, W. Zhang, Y. Liu, Y. Jin, A visualized fire detection method based on convo- tion, in: Proceedings of the IEEE/CVF International Conference on Computer Vision,
lutional neural network beyond anchor, Appl. Intell. 52 (2022) 13280–13295. 2019, pp. 9627–9636.
[16] O. Khudayberdiev, J. Zhang, A. Elkhalil, L. Balde, Fire detection approach based on [43] N. Carion, F. Massa, G. Synnaeve, N. Usunier, A. Kirillov, S. Zagoruyko, End-to-end
vision transformer, in: Proceedings of the Artificial Intelligence and Security: 8th object detection with transformers, in: Proceedings of the Computer Vision-ECCV
International Conference, ICAIS 2022, Qinghai, China, Springer, 2022, pp. 41–53. 2020: 16th European Conference, Glasgow, UK, August 23-28, 2020, Springer, 2020,
July 15-20, 2022, Proceedings, Part I. pp. 213–229. Proceedings, Part I 16.
[17] R. Xu, H. Lin, K. Lu, L. Cao, Y. Liu, A forest fire detection system based on ensemble [44] X. Zhu, W. Su, L. Lu, B. Li, X. Wang, J. Dai, Deformable DETR: deformable trans-
learning, Forests 12 (2021) 217. formers for end-to-end object detection, 9th International Conference on Learning
[18] M. Tan, R. Pang, Q.V. Le, Efficientdet: scalable and efficient object detection, in: Pro- Representations, Austria, 2021 ICLR 2021, Virtual Event.
ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, [45] Q. Chen, Y. Wang, T. Yang, X. Zhang, J. Cheng, J. Sun, You only look one-level
2020, pp. 10781–10790. feature, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern
[19] M. Tan, Q. Le, Efficientnet: rethinking model scaling for convolutional neural net- Recognition, 2021, pp. 13039–13048.
works, in: Proceedings of the International Conference on Machine Learning, PMLR, [46] C. Feng, Y. Zhong, Y. Gao, M.R. Scott, W. Huang, Tood: task-aligned one-stage object
2019, pp. 6105–6114. detection, in: Proceedings of the IEEE/CVF International Conference on Computer
[20] J. Pincott, P.W. Tien, S. Wei, J.K. Calautit, Indoor fire detection utilizing computer Vision (ICCV), IEEE Computer Society, 2021, pp. 3490–3499.
vision-based strategies, J. Build. Eng. 61 (2022) 105154. [47] X. Dai, Y. Chen, B. Xiao, D. Chen, M. Liu, L. Yuan, L. Zhang, Dynamic head: unifying
[21] D. Bhatt, C. Patel, H. Talsania, J. Patel, R. Vaghela, S. Pandya, K. Modi, H. Ghayvat, object detection heads with attentions, in: Proceedings of the IEEE/CVF Conference
Cnn variants for computer vision: history, architecture, application, challenges and on Computer Vision and Pattern Recognition, 2021, pp. 7373–7382.
future scope, Electronics 10 (2021) 2470 (Basel). [48] T.Y. Lin, P. Goyal, R. Girshick, K. He, P. Dollár, Focal loss for dense object detection,
[22] A. Dosovitskiy, L. Beyer, Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Un- in: Proceedings of the IEEE International Conference on Computer Vision, 2017,
terthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, N. Houlsby, pp. 2980–2988.
An image is worth 16x16 words: Transformers for image recognition at scale, in: 9th [49] H. Rezatofighi, N. Tsoi, J. Gwak, A. Sadeghian, I. Reid, S. Savarese, Generalized
International Conference on Learning Representations, in: ICLR 2021, Austria, 2021 intersection over union: a metric and a loss for bounding box regression, in: Pro-
Virtual Event. ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition,
[23] M. Shahid, K.-l. Hua, Fire detection using transformer network, in: Proceedings of 2019, pp. 658–666.
the International Conference on Multimedia Retrieval, 2021, pp. 627–630. [50] Z. Gao, L. Wang, B. Han, S. Guo, Adamixer: a fast-converging query-based object de-
[24] P. Chen, M. Zhang, Y. Shen, K. Sheng, Y. Gao, X. Sun, K. Li, C. Shen, Efficient detector, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern
coder-free object detection with transformers, in: Proceedings of the Computer Vi- Recognition, 2022, pp. 5364–5373.
sion-ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23-27, 2022,
Springer, 2022, pp. 70–86. Proceedings, Part X.
304

Citation 8

Uploaded by

Copyright:

Available Formats

Citation 8

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Citation 8

Uploaded by

Copyright:

Available Formats

Journal of Safety Science and Resilience 4 (2023) 294–304

Contents lists available at ScienceDirect

Journal of Safety Science and Resilience

Early smoke and ﬂame detection based on transformer

Fig. 1. Fire accident scenes.

Fig. 2. The framework of the ﬁre detection model.

Fig. 3. The three major modules in DFFT.

Fig. 4. Multi-level ﬁre-detection methods.

3.4. Encoder-only single-level dense prediction 3.4.2. TAE

𝑠0 = 𝑓1dot (16) 𝑡cls = 𝑡1 (22)

Fig. 5. Data augmentation on Our_ﬂame_smoke dataset.

Fig. 6. Data augmentation on Kaggle_ﬂame_smoke dataset.

Fig. 7. mAP for small, medium, and large objects on Kaggle_ﬂame_smoke.

Fig. 8. Detection eﬀects on small-scale objects.

You might also like