1. Introduction
Pedestrian object detection is an important research direction in computer vision, and consists of feature extraction, object classification, and object localization. Most of the existing methods are mainly concerned with object detection under visible light. However, due to complex lighting conditions, e.g., cloud, rain, fog, etc., visible light methods are insufficient for extracting features. Hence, infrared cameras can be used, which can sense the radiation of the object and can effectively avoid the interference of complex scenes during detection [
1]. However, thermal infrared images have the disadvantages of low contrast, weak texture, and loud noise. These physical properties still lead to challenging object detection in thermal infrared scenes [
2,
3].
Infrared pedestrian-detection methods can be divided into traditional methods and convolutional neural network (ConvNet) methods [
4]. Feature extraction and classification are performed separately in traditional object detection. Firstly, the pedestrian features (e.g., color, edge, texture, etc.) are extracted by handcraft; then, the classifier is trained with these features; finally, sliding windows are utilized for object localization [
5]. Compared with traditional detection methods, ConvNets can implement end-to-end pedestrian detection, which is independent of expert experience [
6]. The ConvNet methods consist of a feature extraction network, bounding-box regression, and object classification [
7].
The traditional methods are based on handcraft features [
8]. Zhang et al. [
9] designed rectangular templates of the human body with multi-modal and multi-channel Haar-like features. Watanabe et al. [
10] utilized two co-occurrence histogram features (i.e., CoHOG and CoHLBP), which describe different aspects of object shapes to improve the accuracy of pedestrian detection. Felzenszwalb et al. [
11] proposed latent SVM, which combines a margin-sensitive method for data mining hard negative examples. Zhang et al. [
12] extended two feature classes (edgelets and HOG features) and two classification models (AdaBoost and SVM cascade) to infrared images. Lowe et al. [
13] introduced a scale-invariant feature transform for object recognition. Similarly, Dalal et al. [
14] utilized the histogram of an oriented gradient for feature extraction. Professor Maldague’s team utilized visible light and thermal infrared image pairs to extract regions of interest [
15]. However, these methods have insufficient performance in complex scenarios. Moreover, the strategy of sliding windows generates a massive number of redundant boxes. Additionally, the strategy of handcrafted features lacks high-level semantic information. These two strategies lead to low object detection efficiency [
16].
The ConvNet methods involve the Yolo series [
17] and the SSD series [
18], which have huge parameters (i.e., the weight parameters of the network) and high latency on edge computing devices. The edge computing devices for these series are only Internet-of-Things terminal devices, which have low computing power and power consumption [
19]. Hence, the MobileNet [
20] series and the ShuffleNet [
21] series were presented, which are represented by a lightweight base model [
22]. Zheng et al. [
23] utilized MobileNetv3 and depthwise separable convolution to achieve parameter compression and inference acceleration of Yolov4 (i.e., Yolov4-lite). Wang et al. [
24] achieved SSD parameter compression using MoblieNetV2 in infrared image pedestrian detection. There is also Yolov4-tiny [
25], which performs lightweight reconstruction of Yolov4. According to the scenario’s needs, EfficientDet [
26], and Yolov5 [
27] utilized hyperparameters to control the network parameters. For example, the network structure of EfficentDet (i.e., D0 to D7) is, respectively, associated with the input size, where D0 corresponds to the input size of 512 × 512. Rep-Yolov4 [
28] utilized network pruning to compress network parameters, speed up inference, etc.
The above ConvNet methods (i.e., anchor-based object detectors) are less effective for some specific datasets, which require the reclustering of anchors [
29]. Many studies have shown that anchor-free detectors can perform comparably to anchor-based detectors in recent years. The Yolox of anchor-free detectors introduces a decoupled head, which reduces the conflict between classification and regression tasks [
30]. Liu et al. [
4] proposed a thermal infrared pedestrian and vehicle detector based on Yolov3 by expanding the parameters of YoloHead, which consist of anchor-free and anchor-based. To enhance the features, the attention mechanism was introduced to improve detection performance, e.g., SENet [
31] and CBAM [
32]. Xue et al. [
33] proposed a spatio-temporal sharpening attention mechanism, which combined the channel attention module and the sharpening spatial attention module. Xue et al. [
34] proposed novel multi-modal attention fusion based on Yolo for real-time pedestrian detection.
However, due to the small coverage area, low resolution, and reduced feature information of small objects in the thermal infrared images, the performance of the above methods is unsatisfactory. Gao et al. [
35] proposed a high-precision detection method based on the feature mapping of deep neural networks with a spindle network structure. Lu et al. [
36] proposed a novel small-object detection method via multidirectional derivative-based weighted contrast measures. Zhu et al. [
37] introduced an attention mechanism based on RefineDet for the infrared detection of small pedestrians. Li et al. [
38] utilized the weighted fusion of visible images and infrared images, which enhanced small-feature information. In this paper, we proposed an efficient thermal infrared detection method based on Yolox-S in complex pedestrian scenes. Our method can effectively solve the problem of small objects and occlusion objects, without adding any parameters. To sum up, our main contributions are:
- (1)
We adjust the input size of the network, which reduces the calculation redundancy and alleviates the imbalanced number of small objects based on the self-built dataset.
- (2)
Multi-scale mosaic data augmentation is proposed to extend object diversity in training, which simulates multiple objects, small objects, and occlusion objects scenarios.
- (3)
To achieve effective feature fusion in the neck, we introduce the parameter-free attention mechanism to implement feature enhancement.
- (4)
We accelerate the network inference using quantization technology on the edge device. To fully exploit hardware resources, multi-threading technology is utilized to realize multi-channel video detection tasks.
The rest of this paper is organized as follows.
Section 2 details the proposed method.
Section 3 conducts the experiments based on the infrared image datasets (YDTIP, UTMD [
39], KAIST [
40] and CVC-14 [
41]) to verify the effectiveness of the proposed method.
Section 4 summarizes this paper.
2. Proposed Method
In recent years, the ConvNet structure has been iteratively updated, and the capability of feature extraction has become stronger. In this paper, the main improvements based on Yolox-S in thermal imaging pedestrian-detection (TIPYOLO) are as follows: we modify the size of the input image by analyzing the characteristics of the dataset, and increase the diversity of the training set through a multi-scale mosaic; the parameter-free attention mechanism is introduced into Yolox-S; we utilize quantization and multi-threading technology to meet business timeliness requirements.
2.1. Multi-Scale Mosaic
In the object detection network, the input of the network has a uniform size, and generally has a square structure. The customary methods obtain input sizes of the standard by cutting, stretching, scaling, etc. However, these operations are prone to problems such as missing objects, a loss of resolution, and reduced accuracy. In the testing stage of Yolox-S, the sizes of the input images are 416 × 416 or 640 × 640. Hence, we utilize gray filling to achieve image scaling in the model inference phase.
Here, we take an image of 416 × 416 as an example; in
Figure 1, A is the original input image. After calculating the scaling ratio of the A image, image B is obtained using the BICUBIC interpolation algorithm. Image B is pasted into a 416 × 416 gray image to obtain C1. This method of image scaling reduces the image resolution without distortion, but the filled gray pixel area in C1 is too large (i.e., more than half of the gray-filled area, causing redundant computation), which increases the inference time of the model. Meanwhile, considering the limited resources of edge computing devices, the maximum downsampling rate of the ConvNet is 32 times, which results in an optimal gray filling scale of 416 × 192, as shown by C2 in
Figure 1.
For the model training phase, we introduce mosaic data augmentation [
17]. Mosaic data augmentation mainly crops four pictures in a central position, and then, merges them into one picture after cropping. As shown in
Figure 2a,b, the mosaic can enrich the scene, which involves the image background, small objects, and pedestrian occlusion. Moreover, a larger batchsize will increase the GPU memory burden. The Mosaic operation can reduce the dimension of the input data, that is, channel × height × width (CHW) × batchsize is reduced to CHW × batchsize × 1/4, thereby reducing the memory usage. This is due to the mosaic data augmentation being away from the distribution of natural images. Therefore, the code automatically turns off the mosaic during the training of the last 30 epochs, and the random probability of training with mosaic is 0.5.
On the basis of the above strategies, we utilize a random multi-scale training strategy. Since the maximum downsampling rate of the network is 32 times, the size sequence of the input image is also selected with an integer multiple of 32, that is, 288 × 64, 320 × 96, 352 × 128, 384 × 160, 416 × 192, 480 × 224, 512 × 256, 544 × 288, 576 × 320, 608 × 352, and 640 × 384. A scale is randomly selected in the sequence as the input size. This strategy can not only improve the adaptability of the network to different scales, but can also realize the scale perturbation of objects.
2.2. TIPYOLO Network Details
TIPYOLO is an improvement on Yolox-S in the thermal imaging of pedestrian detection. The input of model is a three-channel RGB image (i.e., 416 × 192 × 3). The backbone consists of CSPDarknet, which integrates general residuals and cross-stage partial residuals. Hence, CSPDarknet achieves efficient feature extraction with few parameters. As shown in
Figure 3, SPPBottleneck is added to CSPDarknet, which mainly plays the role of multi-receptive field fusion. Between the backbone and neck, we introduce the parameter-free attention module (i.e., simAM) [
42]. We utilize simAM to filter more focused feature information for the neck.
The theory of simAM is based on visual neuroscience. In visual processing, neurons with clear spatial suppression effects should be given higher priority. To better implement the attention mechanism, the simAM algorithm evaluates the importance of each neuron using the minimum energy equation
. The lower the energy, the more distinctive the target neuron from the surrounding neurons, and the more important it is for visual processing. The importance of each neuron can be obtained using
, as shown in Equation (1).
where
and
;
and
are the target neuron and other neurons in a single channel of the input feature
;
is index over spatial dimension; and
is the number of neurons on that channel. The hyper-parameter
is set to 0.0001
According to the definition of an attention mechanism, we need to enhance the feature, as shown in Equation (2).
where
groups all of
across channel and spatial dimensions.
PANet performs the fusion features of bottom-up and top-down; these features are filtered by the attention module. Here, feature fusion adopts Concat in the channel dimension, which has rich semantic information, that is, it includes both network deep features and shallow features. The fused features are passed to YoloHead for object classification and location regression. The first Conv layer of YoloHead mainly implements channel adjustment, which reduces the channel size and the amount of branch computation. The two parallel branches (i.e., localization and classification) have two 3 × 3 Conv layers, respectively, which further improves the nonlinear fitting ability of the network.
2.3. Video Processing Acceleration
During the deployment phase of the method, we utilize post-training quantization techniques to achieve network inference acceleration on edge computing devices. The model quantization is the process of mapping float values to int values. We take int8 as an example, to map the floating-point input to the fixed point. First, we need to count the numerical range of the input, which provides the maximum value of its absolute value (i.e.,
). Then,
is mapped to 127 (i.e., the maximum value under int8), obtaining the mapped ratio
using Equation (3).
is the scaling ratio, which is the floating-point input mapped to the fixed point. Similarly, the weight scaling ratio can be calculated. Finally, and are saved in the weight parameter for subsequent fixed-point calculations, where the index is the current network layer.
Here, we take the convolution operator as an example, as shown in
Figure 4. First, the floating-point input is quantized to obtain an integer input using the scaling ratio (i.e.,
and
), as shown in Equations (4)–(6).
where
is the input, and
is the quantized result.
is the threshold, and its value range is [−127, 127]. Similarly, the weight of ConvNet is quantized to obtain an integer weight. Then, an integer Conv operation is performed to obtain an output for the integer result. Finally, the floating-point output is obtained through the dequantization Equation (7), where
is the result of the Conv operation, and
is the dequantized result [
43].
The above quantification algorithm is mainly utilized for network inference acceleration, which neglects the calculation acceleration of pre-processing and post-processing. The pre-processing, inference, and post-processing of the model are serial executions. Inevitably, some units of computation will be waiting. To alleviate the wait of series execution, we convert the three computational units of the model into pipeline executions, as shown in
Figure 5b, that is, each computing unit is assigned a thread. As shown in
Figure 5c, unlike b, two threads are added to speed up the pre-processing. However, pipelineV2 only works in single-channel video processing. For multi-channel video processing, we take two-channel video as an example, and the specific strategy is shown in
Figure 5d,e. As shown in
Figure 5d, we utilize pipelineV2 to implement parallelism processing of the two-channel video. However, considering that the two models are running on edge computing devices at the same time, which inevitably increases the hardware burden, we implement two-channel video processing based on the improved pipelineV2. The specific implementation details are: First, we utilize two buffer queues to temporarily store the pre-processing data. Then, the model reads the data alternately from the two queues. Once model inference is completed, the video data of post-processing are stored alternately in two buffer queues using the Flag operation. Finally, the visualization thread utilizes two queue datapoints to visualize.