As illustrated in
Figure 2, the development of the nighttime object detection system encompasses several steps. These include hardware setup and sensor selection, data collection, image alignment across different sensors, information fusion, training the object detection models, evaluating the developed models, and deploying the system for real-time inference. In this paper, the authors propose a novel method for image alignment from two different sensors, collect and label 32,000 paired data samples from the IR and RGB cameras, and implement three different sensor-fusion methods. The best-performing sensor-fused DNN model was optimized for deployment on the in-vehicle computing unit.
3.2. Alignment of Two Different Sensor Images
For a sensor-fusion system incorporating RGB and IR thermal cameras, image alignment is essential to integrate data from two different sensors. The necessity for image alignment emerges from inherent disparities in sensor placements, orientations, and perspectives, potentially leading to misalignments in captured images. As presented in
Figure 3b, images are captured from two sensors with different resolutions and FOVs. The Logitech RGB camera image has a 960 × 540 resolution with a 78° FOV, and the FLIR ADK thermal camera (Teledyne FLIR, Wilsonville, OR, USA) has a 640 × 512 resolution with a 50° FOV. Because the two cameras also have different fields of view (FOVs), a parallax effect is observed between images of the same scene captured by the two cameras. The change in the FOV causes a parallax phenomenon, which displaces an object differently due to the varying FOV. If images from two different sensors are not aligned properly, they can result in the erroneous fusion of features, complicating the fusion algorithm’s ability to accurately combine information from various sources.
To resolve the parallax issue, the authors proposed a new image alignment algorithm that determines the necessary parameters to align images from different camera sensors. The proposed alignment algorithm generates resizing and translating parameters for alignment by comparing the location information of the same object on the images from two different cameras. Since RGB and IR thermal images capture the same two-dimensional scene, the factors contributing to misalignment are positional and size differences. Given that the IR image has a lower FOV (a 50° FOV) compared to the RGB image’s FOV (a 78° FOV), the RGB image will be aligned with respect to the IR image. The procedures for the proposed alignment algorithm are described in the following steps:
- Step (1)
Capture paired images containing a single object (e.g., a pedestrian) using two cameras mounted on the test vehicle, as illustrated in
Figure 3a. The authors utilized 20 paired images.
- Step (2)
For each pair of RGB and IR images:
- (i)
Detect the object using the existing DNN model [
33]: The DNN-based object detection algorithm is separately applied to both RGB and IR images, resulting in bounding box coordinates (depicted in
Figure 4a and
Figure 4b, respectively). In
Figure 4a, the RGB image detection is represented by coordinates (
X1RGB,
Y1RGB) for the top-left and (
X2RGB,
Y2RGB) for the bottom-right. Similarly, the IR image detection in
Figure 4b uses coordinates (
X1IR,
Y1IR) and (
X2IR,
Y2IR).
- (ii)
Calculate the resizing factor: To quantify size differences between images from different sensors, resizing factors in the x and y directions are computed using Equations (1) and (2):
Here, RFactor_X represents the ratio of the IR image bounding box width to the RGB image bounding box width, and RFactor_Y calculates the ratio of the IR image bounding box height to the RGB image bounding box height.
- (iii)
Calculate the translation factor: Positional differences between RGB and IR images arise from field of view (FOV) variations. Translation adjusts the RGB image coordinate system to align with the IR image coordinate system. Translation parameters in the x and y directions are determined using Equations (3) and (4):
- (iv)
Record these four parameters: resizing factors and translation factors in the x and y directions.
- Step (3)
For each parameter, calculate the average value using the data generated in Step 2.
Table 2 presents the parameters derived from 20 pairs of images using the proposed alignment algorithm. Using these calculated resizing and translation parameters, the RGB image is resized and translated accordingly.
Figure 4c displays the output post-translation operation. Following translation, the RGB image is cropped to 640 by 512 pixels, starting from the top-left coordinate (1, 1) to (640, 512), matching the size of the IR image as shown in
Figure 4d. The proposed alignment algorithm requires a single run during camera calibration. Once alignment parameters, resizing factors, and translation factors are computed using this method, they enable real-time alignment of RGB and IR images in subsequent operations. The algorithm is efficient and robust, facilitating the development of sensor-fusion algorithms across different camera sensors.
Figure 5 displays the image alignment results produced by both the existing registration method and the proposed image alignment method.
Figure 5a,b depict the original RGB camera image and its corresponding IR thermal image, respectively. In
Figure 5c, the output from the current registration method is shown. A comparison with the corresponding IR thermal image reveals noticeable misalignment, particularly in areas such as trees, cars, and pedestrian locations. In contrast,
Figure 5d presents the output from the proposed method, demonstrating accurate alignment with the corresponding IR thermal image.
3.3. Publicly Available Dataset and New Data Collection
To develop the nighttime pedestrian detection system, two publicly available datasets were used: the KAIST dataset [
29] and the LLVIP (Low Light Vision Pedestrian) dataset [
30]. The KAIST dataset [
29], published in 2015, initiated low-light object detection research. This dataset consists of pairs of aligned RGB and thermal images, all with a resolution of 640 × 512 for pedestrian detection. The second dataset is the LLVIP (Low Light Vision Pedestrian) dataset [
30], which comprises pairs of RGB and thermal images taken in low-visibility scenes, with all images in the dataset spatially aligned. Example images from these datasets are shown in
Figure 6a,b for the KAIST dataset and the LLVIP dataset, respectively. However, these two datasets have several drawbacks. The KAIST dataset suffers from extremely poor IR image quality, as shown in
Figure 6a, while the LLVIP dataset consists of images captured by surveillance cameras, which do not align with the viewpoint of the cameras mounted on vehicles.
Therefore, the authors decided to collect data that better suited the requirements for the night model development. The authors gathered data across various scenarios, including residential and urban driving, pedestrian crossings, shopping malls, and parking lots, during nighttime and low-light conditions. In total, 55,000 frames of nighttime data were collected, with 32,000 frames containing pedestrians. Sample images collected by the authors are presented in
Figure 7.
All collected images were aligned using the proposed alignment algorithm explained in
Section 3.2, then labeled using the MATLAB Image Labeler app [
34]. Three different datasets, the KAIST, the LLVIP, and the Kettering datasets, were utilized to develop the nighttime pedestrian detection system. The data samples are categorized into three groups for training, validation, and testing of the DNN models. A summary of the entire dataset utilized is provided in
Table 3.
3.4. Development of the Sensor-Fusion DNN Models
Using a single sensor for object detection can lead to vulnerabilities, as it may fail to provide adequate information in certain scenarios (e.g., obscured vision due to low lighting conditions or fog). Sensor fusion mitigates these risks by providing redundant or complementary data from multiple sensors, making the system more robust and reliable in various environmental conditions. For sensor-fusion systems, how data are integrated from different sensors is critical to the overall system performance.
Figure 8 shows three different fusion methods that combine the data differently: early fusion, mid fusion, and late fusion [
19].
Early Fusion: As depicted in
Figure 8a, the early-fusion method integrates input images from multiple sensors at the beginning of a data processing pipeline, before the DNN model. The objective is to create a unified and comprehensive representation of the scene by leveraging the complementary nature of RGB and thermal information. The IR and RGB images are fused using the weighted sum method [
35], which employs a mathematical approach to combine multiple values. Each value is multiplied by a specific weight, reflecting its significance in the overall decision-making process. The following procedures are applied to fuse the images from two sensors:
- (i)
Each RGB image is aligned using the proposed image alignment algorithm in
Section 3.2. The aligned RGB has the same image width,
imgW, and image height,
imgH, as the corresponding IR thermal image.
- (ii)
To generate the fused representation of the scene captured by two different sensors, the proposed weighted sum approach involves adding two weighted pixel values at each location (x, y) for each color channel.
For every channel c, where c = 1:3 in an RGB image:
For every pixel location (
x,
y):
where
IRimg is an IR image and
RGBimg is an RGB image. The ranges of
x and
y are defined as
x = [1:
imgH],
y = [1:
imgW]. The weights,
WIR and
WRGB, are associated with the IR image and the RGB image, respectively, where
WIR +
WRGB = 1. The fused image data,
Fused_img, will be used to develop the DNN models. In this research, a 60/40 ratio of IR to RGB images is utilized based on experimental findings, where 60% of the total weight is attributed to IR images and 40% to RGB images.
Figure 9 shows an example of information fusion using the weighted sum method with the weights
WIR = 0.6 and
WRGB = 0.4.
Figure 9a,b are taken from the IR camera and the RGB camera, respectively. As shown in
Figure 9b, the RGB camera image did not capture the details of the person (marked with a green dotted rectangle) under low-light conditions. On the other hand, the IR image in
Figure 9a shows the details of the person in the same scene. The fused image using the weighted sum method displays the person in the same scene as presented in
Figure 9c. The early-fusion model is trained with 110,000 training samples in
Table 3. Rather than training from scratch, the model is developed using the transfer learning method [
36] with the pre-trained YOLO v5 [
33] model, as shown in
Figure 10.
Late Fusion: Late fusion is a technique that involves merging detection results after independent detections on RGB and IR thermal images. This approach allows for the utilization of diverse types of data or DNN models, potentially leading to improved performance or robustness compared to using any single modality or model in isolation. Late fusion contrasts with early fusion, where data from different sources are combined before being fed into DNN models. In
Figure 11, an overview of the late-fusion method is presented, illustrating how RGB and IR images are separately input into the object detection DNN models. Each detection includes details such as bounding box information and confidence scores for each detected object. The outcomes from each sensor are compared and merged, as shown in
Figure 11, using the Non-Maximum Suppression (NMS) algorithm [
37].
The NMS algorithm [
37] is a post-processing technique designed to remove redundant detections of the same object within a single DNN model’s output. When an image is input into an object detection model, it identifies objects based on features such as hands, legs, and other body parts. Consequently, the model’s output may include duplicate detections for a single object, as shown in the dotted bounding boxes in
Figure 11 and
Figure 12. Moreover, the application of NMS can be extended to merge detection results from different DNN models originating from various sources (such as RGB and IR), ensuring that each object is associated with the most accurate bounding box. This process enhances the accuracy and reliability of object detection, as illustrated in
Figure 12. The NMS process for the late-fusion system involves the following steps:
- Step 1
Merge detection results in the form of a set of bounding boxes along with their associated confidence scores from two different object detection DNN models.
- Step 2
Sort the bounding boxes based on their confidence scores in descending order. This ensures that the box with the highest confidence score is considered first.
- Step 3
Start with the bounding box that has the highest confidence score, high_bb, in the sorted list. This box is considered a potential detection.
- Step 4
Iterate over remaining boxes in the sorted list.
For each box, bb_i, in the sorted list:
- i.
Calculate the intersection over union (IoU) with the current bounding box,
bb_i, and the highest confidence score bounding box,
high_bb.
- ii.
If the IoU is above a certain threshold (0.5 is used), discard the bounding box, bb_i, as it significantly overlaps with the currently selected box, high_bb, and is likely to represent the same object; otherwise, keep the bounding box.
Steps 3 and 4 are iteratively applied to the next highest confidence score bounding box until no additional bounding boxes remain. Applying NMS eliminates redundant detections, resulting in a cleaner and more accurate set of bounding boxes for object detection tasks in late fusion.
Mid Fusion: To implement the mid-fusion method using RGB and IR images, the original YOLO v5 algorithm was redesigned with dual-stream backbones, as described in [
21] and illustrated in
Figure 13. This approach processes RGB and IR thermal images separately: the first stream backbone extracts features from RGB images, while the second backbone extracts features from thermal images. The key component of this architecture is the CFT modules [
21], where features from RGB and IR thermal images are integrated. The proposed mid-fusion model is trained using transfer learning with 110,000 data samples, as shown in
Table 3. Integrating RGB image features with thermal image features enhances feature richness. These enriched features are then reprocessed through the RGB backbone and, similarly, thermal images are enhanced with RGB features and reprocessed through the thermal backbone. This fusion of features improves detection across multiple scales.
Training for five DNN models, including three fusion models and two single-mode models, was conducted on a Dell Alienware Aurora R8 desktop computer with a 9th Gen Intel Core i7-9700 processor and an NVIDIA GeForce RTX 2080 Ti GPU. Each model was trained with 110,000 paired data samples as detailed in
Table 3. For all Deep Neural Network (DNN) models, the authors used a learning rate of 0.001 and the Stochastic Gradient Descent (SGD) optimizer, with a batch size set to 12.