1. Introduction
Forklifts are one of the most popular types of handling equipment on the market [
1], and the demand for them tends to increase year by year [
2], because they offer many benefits, such as improved productivity and reduced manual handling [
3]. The Research and Markets research agency predicts that the forklift market will record sales of about 2.2 million units in 2023, with a compound annual growth rate of about 9%. According to a report by the PR Newswire agency, the forklift market will grow at a rate of 7.8% annually, increasing from 2 billion USD in 2020 USD to 2.9 billion in 2035 [
4]. The application of such a large number of forklifts has and will result in their involvement with high accident and fatality rates around the world [
5,
6,
7]. Researchers have reported that most accidents can be attributed to human operator errors, such as a lack of attention, misperception, or misjudgment [
8,
9]. The US Occupational Safety and Health Administration (OSHA) also reported that a significant number of forklift accident cases are caused by reduced situation awareness of the forklift operator [
10,
11]. The prevalence of this type of accident clearly highlights the importance of developing automatic-drive forklifts.
The use of automatic-drive forklifts still requires the development of key technologies for tasks such as autonomous navigation, pallet location, and pallet recognition. Automated guided vehicles (AGVs) have been proposed since the early 1950s, and the first AGV was created in America [
12,
13] to help improve the autonomous navigation of forklifts. Researchers have intensely studied optimal control theories and optimization methods for solving AGV path planning and navigation problems in recent years [
14,
15,
16,
17]. Zhang et al. [
15] proposed a learning-based algorithm to solve global path planning problems, which involved training a deep convolutional neural network with dual branches (DB-CNN). However, it is more difficult to recognize and locate pallets in actual warehouses because pallets are placed with high uncertainty. To address this issue, the authors of [
18,
19] defined a semi-structured environment representing work conditions that contained information about pallets and the environment as a priori information, including the desired pose, goods loaded on pallets, and goods on a shelf or the ground.
The present investigation aimed to address three key issues that must be overcome in semi-structured environments. First, a forklift algorithm has to identify pallets with uncertainty and subsequently maintain a continuously accurate estimation of the pose parameters while approaching and picking up pallets [
19]. The pose of a target pallet cannot be predetermined due to errors in human operation. Second, a forklift algorithm should identify various pallets because several types of pallets and goods are used in warehouses. Third, the algorithm should have the ability to segment multiple pallets. Segmenting pallets situated within a small distance of each other is difficult. If the pallet detection algorithm cannot overcome these problems, then the fork may damage goods and lead to security incidents.
Some researchers have explored various algorithms for pallet recognition based on a single sensor. An algorithm based on calculating geometry using pallet edges and shape features was investigated, but the features could be incorrectly extracted when the pallet was blocked or the illumination conditions changed. In [
19,
20], the pose of single pallets was estimated based on geometry features extracted from distance data provided by LiDAR with geometry classifiers. In addition, the authors of [
21,
22,
23,
24,
25,
26] obtained pallet pose information based on pallet size and edge features extracted from a color image provided by a camera. To reduce the effect of the environment, pallets were identified with marks attached to the pallet feet in [
27,
28,
29]. However, the processes of adding and calibrating the mark were found to lead to higher costs and more human work.
Additionally, some researchers have focused on recognizing pallets from images and extracting pose parameters using depth sensors such as laser radar and structured light. The authors of these studies were able to achieve more robust performance in semi-structured environments. In [
18], a three-foot pallet template was created to match the scattering image of LiDAR data. In a color image, an edge template was used to match the distance-transform image. The two results were combined based on the maximum percentage of couple points. In [
30], a pallet model was created and pose was estimated using an ICP method applied to LiDAR data. In [
31], a classifier with a feature grid template and gray-scale features (normalized pair differences) was proposed, and it was shown that pallets could be identified with the classifier. The algorithms used color thresholds in [
32,
33,
34]. Moreover, other researchers trained a classifier [
35] to detect a wooden pallet from images using Haar-like features, and the results were verified with an adaptive structural feature and direction-weighted overlapping ratio in [
36].
The aforementioned approaches used different types of sensing, but they all involved single-sensor devices, such as laser scanners and cameras (monocular or stereo). A laser scanner only provides distance information, but it has the advantages of being stable and robust against light. In contrast, a camera provides vision data that include rich information. However, an RGB-D sensor combines a camera and a distance range sensor to obtain both color and distance information. The camera obtains image data by CMOS, and each pixel of the image records red, green, and blue color data [
37,
38,
39,
40] that are used for recognition. The depth sensor obtains depth information for every pixel. For tray image recognition, researchers have mostly focused on the recognition of a single tray at the expense of the spatial relationship between trays and the environment. For pallet image recognition, researchers have mostly focused on the recognition of a single pallet at the expense of the spatial relationship between pallets and the environment [
41,
42,
43].
This paper proposes an algorithm to recognize pallets and locate them using an RGB-D sensor in a semi-structured environment. This algorithm was adapted to situations with multiple pallets and with differently shaped pallets. An accuracy experiment was conducted to test the algorithm. A labeled template matching algorithm was developed to accelerate the matching speed, and this algorithm served as a basis for studying object detection in unstructured environments. The simplicity of operation should enhance the flexibility of autonomous forklifts and expand the applicability of autonomous forklifts in complex environmental conditions.
2. Methods
2.1. Establishing Forklift Detection Model and Situation Analysis
Figure 1 shows the considered situation detection model of approaching a pallet. The model consists of a forklift, RGB-D sensor, goods, and pallets. Multiple pallets are used to load differently sized goods, and the pallets are located on the ground, goods, or shelves. Two coordinate frames were constructed: world coordinate frame {W}, where the
X-
Y plane is the ground, and sensor coordinate frame {C}, where the
X-
Y plane is parallel to the ground. The detection algorithm aims to obtain the center position of the pallet surface in {C} and the angle of {C} to
Y-
Z, which is represented as (
,
,
,
θ). The pallet has four degrees of freedom: moving along the
X,
Y, and
Z axes of {W} and rotating along the
Z axis of {W} (
Figure 1).
In previous studies, scholars attempted to recognize single pallets without taking the goods, the ground, and shelves into account. However, not only pallets but also goods and the ground must be detected during the process of engaging a pallet. Moreover, the combination of pallets and goods must always be predetermined such that information in respect of goods and the ground can be utilized as the detection target. With more target information, a detection algorithm can be more robust in complicated situations.
2.2. Algorithm Flow
The algorithm is divided into three steps: calculating initial data, obtaining the range of interest (ROI) of the pallet, and estimating pallet pose parameters. After running step 1 once, the system loops through steps 2 and 3 to update the pose parameters of the pallet, as shown in
Figure 2.
In step 1, the algorithm obtains information in respect of pallets and goods and then selects the classifier model that classifies each pixel of the color image. The initial distance is a statistic from the distance data of pallet category pixels, which is the initial parameter for the next step.
In step 2, the algorithm obtains a new RGB and depth image from the sensors. If the target distance is in the effective range, then the compression unit size is calculated based on the projective principle. After classifying each pixel, a category matrix records the label information. A template is created based on the target information. To accelerate the matching process, both the category matrix and template are compressed. The labeled template is matched to the category matrix, and the match score of pixels is calculated. If the match score is higher than the threshold score, then the pallet ROI is confirmed.
In step 3, the foot coordinates of the pallet are obtained from the ROI using the pallet geometry information. In sequence, the pallet center coordinate and the pallet angle are calculated. A sliding average filter is constructed to filter the pose parameters. Finally, the distance of the target is updated to step 2 to calculate the template size.
This algorithm achieves real-time parameter updates. All the details are described in following sections.
2.3. Classifier Training
2.3.1. Analysis and Feature Selection
The choice of features has a considerable impact on the speed and accuracy of an algorithm. Due to real-time and pixel-level accuracy requirements, the features should be easily extracted, pixel-level, and distinguishable. Three different types of pallets are presented in
Figure 3. As mentioned above, the objects to be recognized are pallets, goods, and the ground/shelf. The most common features can be divided into three parts: shape, texture-based features, and color. The features are next discussed with reference to the real situation.
(1) It is difficult to apply the same shape feature to represent all pallets and quickly extract information because a pallet’s shape and size are variable from changing angles.
(2) Texture-based features are generally local features that use a statistical vector to express the gray change in the cell of each pixel, such as histogram of oriented gradient (HOG), local binary patterns (LBP), and scale-invariant feature transform (SIFT). These features are widely used in complex object recognition tasks [
44,
45]. However, pallets are often composed of plastic with a smooth and flat surface. Occasionally, texture information may not be enough for the algorithm to segment the target and background, and more time is needed for classification due to expensive computation.
(3) Because a source image is recorded in the RGB channel format, it can be transformed into HSI color space to help easily segment a target with little computational cost. As shown in
Figure 2, the color feature can fit most situations. Therefore, effective color components R, G, B, H, and S were selected as the input features to train the classifiers in this study.
2.3.2. Classifier Construction and Category Matrix Creation
This study uses a data set with the aim of performing pixel classification. The data set contains 21 RGB images of three types of pallet taken at different distances. The dataset was collected and photographed with a Bumblebee XB3 camera (supplied by the Point Grey company) in the XiZhou agriculture material warehouse, XinTang, Guangdong. The dataset was annotated with Photoshop (Adobe Systems Incorporated). The pixels corresponding to goods, pallet, pallet hole, and the ground are represented as A, B, C, and D, respectively. Each category contains more than 20,000 pixels as the classifier sample and the sample size of each category is balanced. This study divides the training and testing into 80% and 20%, respectively.
The input vector x is presented as
which contains
R,
G,
B,
H, and
S color information of pixels, which is shown in Equation (1). The output is the category corresponding to each pixel, which is represented by numbers, as shown in Equation (2).
where
i and
j represent the pixel’s column and row, respectively.
This study uses support vector machine (SVM) as the pixel classifier. SVM classifier is a supervised machine learning technique, and it is one of the most popular discriminative classifiers [
46]. This classifier applies the kernel trick to maximum-margin hyperplanes to solve linear and nonlinear classification problems. A linear discriminant function is defined as shown in Equation (3).
where
and
are the optimal parameters of maximum-margin hyperplanes, which were obtained from Equation (4).
The classifier was trained with an SVM model from sklearn, which is an open source library.
Different pallet loading scenarios are trained with corresponding different classification models. In the standardized warehouse, goods and pallets correspond to each other, and different pallets are generally used for different goods, as shown in
Figure 3, so the target goods can be used as a priori information to determine the type of pallets selected and choose the corresponding classification model.
The RGB images acquired by the RGB-D sensor are input to the SVM classification model to obtain the category corresponding to each pixel point. The categories are stored into the category matrix to realize the category information and spatial relationship storage.
Figure 4 shows the result of the classification of
Figure 3a into a category matrix, and the colors of different categories are the same as the templates.
Element labels are represented with different colors. Green represents goods. Red represents pallets. Blue represents ground. Light blue represents pallet holes.
and
represent the grid compressing unit size and the grid unit vector of category matrix, respectively (explained in
Section 2.5 below).
2.4. Labeled Template Creation
As shown in
Figure 3, in a warehouse environment, pallets have the same spatial characteristics: goods are placed above the pallets, the pallet is set on the ground, and the pallets correspond to the category of the goods. Therefore, we have established a labeled template that preserves the dimensions of the pallet and the spatial characteristics mentioned above, as shown in
Figure 5. The labeled template contains 4 categories: goods, pallet, pallet hole and the ground or shelf. The height of the goods and the ground label of the template are set as the height of the pallet.
Element labels are represented with different colors. Green represents goods. Red represents pallets. Blue represents ground. Light blue represents pallet holes. The grids are compression units of label templet.
and
represent the grid compressing unit size and the grid unit vector of Labeled templet, respectively (explained in
Section 2.5 below). The pallet foot detection grids are explained in
Section 2.7 below.
Based on the projection principle of the camera and the spatial relationship, as shown in
Figure 6, the pixel size of the pallet in the image coordinate system {I} is calculated with Equations (5) and (6), when the distance between the pallet and the camera is
. The whole template size is calculated with Equations (7) and (8).
where
and
represent height and length of pallet in camera coordinate system {C}, respectively.
and
represent the number of pixels of the pallet height and length in image, respectively.
represents the focal length of the RGB camera in the RGB-D sensor.
represents the pixel size of the CMOS in the RGB camera.
2.5. Grid Compressing Template and Category Matrix
Because the size of the template and category matrix has a great impact on the speed of the algorithm, we propose a grid compression algorithm to speed up the algorithm by compressing the template and category matrix with less information loss. Because the size of the template and category matrix has a great impact on the speed of the algorithm, we propose a grid compression algorithm that compresses the template and category matrix to speed up the algorithm [
47]. Meanwhile, the grid compression algorithm improves the robustness of the algorithm and reduces the impact of misclassification of pixels (shown in
Figure 4) on the matching process. Because there are only four categories and most of the regions in them represent the same category, the template and category matrices contain a lot of redundant information, so the matrix size can be reduced, and most of the information can be preserved by grid compression.
The grid compression operation is shown in
Figure 4 and
Figure 5. The template and category matrix are compressed in the same proportion based on the grid. The size of each compression unit is
.
is calculated based on the size of the pallet in the template. The calculation principle is that the compressed grid of the template has the same category, and it retains the pallet structure information in the template. After compression, the template and category matrix are
times smaller. Finally, the information on the proportional number of categories in the compressed grid unit is retained by means of the homogenization vector. The grid units of the template and category matrices are represented by
and
, respectively, as in Equations (9) and (10).
where
represents the number of pixels of category n in the
,
grid unit.
2.6. Template Matching
The position of the pallet was set to be determined by matching the template and category matrix with the sliding window method [
48,
49]. The matching process was a convolution operation. The template matrix matches the category matrix to calculate the matching degree. The matching degree is evaluated by the matching score of grid units, and each unit’s score is calculated based on Equation (11). A higher matching score corresponds to a higher probability of being a pallet. If the matching score is higher than MiniScore, the pallet position is determined. MiniScore is determined with a threshold, as shown in Equation (12). Multiple pallets can reach the peak of the matching score matrix, and pallets have no overlap. Thus, a non-maximum suppression algorithm is used to obtain multiple pallet coordination.
where
Sij is the matching score of each unit, and
sthred is the threshold of the matching score.
2.7. Pose Parameters of Pallet Estimation
The pallet pose parameters include the center coordinates of the pallet detection surface
and the inclination angle
θ of the pallet to be measured, which is shown in
Figure 1. These parameters are calculated from the depth information of the pallet foot, which is extracted from the RGB-D camera depth data. In order to reduce the influence of the angle between the sensor and the pallet, the center grids of the pallet feet are selected as the detection sampling area, as shown in
Figure 1.
In the camera coordinate system {C}, the calculation formula of the pallet pose parameter is shown as Equations (13) and (14), respectively.
is calculated by averaging the corresponding point cloud data in the detection grid of the center pallet feet.
is calculated by fitting the slope of all point clouds within the detection grid of the pallet feet in the x-y plane of the camera coordinate system {C}. The average filter is used to smooth the successive detection results, and the estimated value of the results will be updated every cycle. The calculation formula is shown in Equations (15) and (16). Note that when dynamically detecting pallets, the detection results of each loop will be transformed to world coordinates before estimation.
and represent the estimated values obtained for the Nth detection. and represent the estimated value of and after filtering, respectively.
3. Algorithm Performance Test
Four experiments were designed to test the performance of the algorithm: (1) algorithm comparison, (2) multiple pallet recognition, (3) accuracy test in static experiment, and (4) accuracy test in dynamic experiment.
3.1. Template Matching Algorithm Comparison
To demonstrate the advantages of the proposed algorithm, a comparison was performed with the template matching algorithms proven by OpenCV 3.0. Those algorithms were tested with the labeled template without compressing the template and category matrix. Gray template and color template tests were not performed because of their low success rates in changing environments. Three typical situations were tested: large angle, background changes, and obstacle obstruction. The figures below present comparisons of these situations.
3.2. Multiple Pallet Recognition Algorithm Comparison
To illustrate the performance of the proposed recognition algorithm in recognizing multiple and different pallets in a warehouse environment where different goods are stored, the proposed algorithm is compared with the deep learning-based yolov5 approach in this paper.
A pallet dataset was photographed with a Bumblebee XB3 camera in the XiZhou agriculture material warehouse (located in XinTang, Guangdong, China). The test dataset contained 21 pallet pictures with different distances and arrangements. The training set contains 122 images captured in the warehouse.
Since the dataset is too small, yolov5 uses mosaic data enhancement, which is stitched with 4 images, randomly scaled, randomly cropped, and randomly lined up, to enrich the dataset. It also uses a pre-trained model to speed up the training speed. yolov5 takes the input images through the focus structure, downsamples the images, and then sends them to the backbone. The backbone uses CSPDarknet53, and resizes all the input images into 640 × 640 size, and the training batch size is 64, with 300 epochs of training, a learning rate of 0.01, a decay of 0.0005, and a momentum of 0.937.
To verify the performance of the proposed method, precision and recall are used as evaluation metrics in this study.
Precision indicates the proportion of the correct detection results in the total detection results, and recall indicates the proportion of the correct detection results in all the ground truths. They can be denoted as the following equations:
where
TP,
FP, and
FN represent the number of correctly detected objects, falsely detected objects, and missed detected objects, respectively.
3.3. Accuracy Test in Static Experiment
Accuracy and time consumption are important indicators for a pallet recognition algorithm. An experiment was designed to test the relationships between detection accuracy, pallet distance, and angle. This experiment was conducted in a room to easily control the environmental conditions. The sensor and pallet were set as shown in
Figure 1. The pallet was moved to change the distance
and the angle θ after fixing the sensor. The result was transformed into {W} space with the pose of the sensor.
The distance values were chosen as 1000, 2000, 3000, and 4000 mm because the detection range of the depth sensor is 500−4500 mm. θ was chosen as 0, ±5°, ±10°, ±15°, ±20°, and ±25°. The estimation results and errors obtained by the algorithm were recorded and counted. The overall computation time and the computation time of the main processes of the algorithm—classifying the source image, creating the template, compressing, matching, extracting ROI, and estimating pallet parameters—were also recorded.
The proposed algorithm was compiled based on OpenCV 3.0, running on a PC with 16 GB of RAM, an Intel Core i7-6820 processor, and a Windows 10 operating system. The sensor was a Kinect 2 (Microsoft, America), which can obtain RGB images with a resolution of 1920 × 1080 pixel and depth images with a resolution of 512 × 424. Kinect 2 detects the distance based on the time of flight, the depth detection range is 0.5–4.5 m and detection accuracy is ±4 mm. The depth image size was 512 × 424. The pallet size was 1300 × 1300 × 150 mm.
3.4. Accuracy Test in Dynamic Experiment
In this paper, dynamic performance refers to the algorithm’s measurement accuracy when the RGB-D sensor was approaching the pallet. In this experiment, the RGB-D sensor moved close to the pallet with a changeable distance and angle between the pallet and sensor to show the influence of changing distance and angle on the engaging process. To concisely show the parameters of the experiment, all results data were transformed into {W} space, as shown in
Figure 1. The detection result was transformed from {C} space, with the RGB-D sensor position and angle in {W} provided by location sensor NAV350 (SICK, Germany, scanning frequency: 8 Hz, positioning accuracy ±4 mm), a laser scanner sensor that could obtain the position and angle of the RGB-D sensor by detecting reflection marks fixed in the room.
The position of the pallet was fixed at (0, 0, 75), and θ was 90°. The sensor moved in from 4 m to 1.5 m and changed angle in three different ranges: 0−10°, 20−30°, and 0−40°. The pose of the sensor and the error of each parameter were recorded.
5. Discussion
5.1. Comparative Performance Analysis
The template matching method provided by OpenCV 3.0 has a strict requirement that the template must be similar to the target in the image; pallet images change with angle changes. In addition, those matching methods are designed for gray or color images, not a label matrix. This method shows worse performance when the pallet angle is larger than ±15°. Occasionally, the matching result is easily affected by environmental interference.
In contrast, the proposed algorithm compresses the category matrix and template to reduce the environment effect, and the matching method is a probability calculation that is more suitable for a label matrix. Consequently, the proposed algorithm was shown to perform better in the face of pallet and environmental changes.
Moreover, the grid compression matching algorithm has many advantages. For instance, the label pallet template has more rich information—including spatial relationships among goods, the ground, pallet feet, and pallet holes—than that of an individual pallet. Meanwhile, it is easy to create a template other than a gray or color template because the labeled template is unchanged under different illumination conditions and only has four labels. The labeled template algorithm, with its strong robustness and implementation, could be applied to many object recognition problems.
5.2. Multiple Pallet Recognition Performance in Warehouse
In a warehouse, goods are placed in a very dense arrangement, making it difficult to segment pallets. Moreover, pallets are often made with plastic, so diversity and notable color features are the most important types of information for this pallet recognition algorithm. However, color information is impacted by illumination conditions, and the classifier cannot precisely categorize pixels. To resolve this issue, the labeled template is used as another classifier to identify the pallet and reduce the interference of incorrect classification.
One reason for misrecognition could be that most of the missed pallets in the test were not completely photographed, with matching scores less than the threshold, as shown in
Figure 7a,c. Since the size of the template is determined according to the distance, some pallets will not be recognized when the distance between multiple pallets is large, as shown in
Figure 7a. Another reason is that the color spaces of different labels overlapped in dark illumination, thus leading to incorrect classification.
The main approaches to improve the accuracy of yolov5 is to increase the number of pallet samples, and moreover, the number of pallet multi-angle images should be added to the samples to ensure the robustness of the algorithm. The work of sample collection and labeling requires more workload.
5.3. Detection Accuracy Performance
Accuracy is an important indicator for a pallet recognition algorithm. The best algorithms should be useful at long detection distances and large observable angles. A static experiment was performed to test the relationship between accuracy and pallet pose parameters. A dynamic experiment was performed to explore the influence of changing distance and angle on the algorithm.
In the static and dynamic accuracy experiments, the detection accuracy was associated with the detection distance and the angle. The error had a positive correlation with angle of the pallet. As the detection distance was near the edge of the depth sensor‘s effective range, the depth sensor’s detection mistakes affected the detection accuracy.
One reason for the increasing amount of detection errors as the pallet angle increased could have been the length change in pallets in the image with the angle change. Only one shape template was created to match the rotation of pallets. The length of the template was longer than that of the pallets in the image according to the projection principle. This led to the pallet foot measure-points moving, and the measurement error became larger as the rotation angle increased.
In the dynamic accuracy experiment, the detection error fluctuated with the angle change. The performance in the dynamic test matched the tendency shown in the static test. There may have been three major reasons for the error in the dynamic test being larger than that in the static experiment.
The first reason was the size error between the template and pallet in images with the change in distance and angle. The second reason was the error in the transformation matrix between the RGB-D sensor and the location sensor, which caused the error to increase when the distance was longer. The final reason was the communication delay between the RGB-D sensor and location sensor, during which location data with noise affected the detection results.
5.4. Real-Time Performance of Proposed Algorithm
The time consumption of the proposed algorithm was tested in the static experiment. The time consumption was lower when the detection distance decreased. There was no relation between the time consumption and the rotation angle of the pallet. Computational resources are the main cost during template matching, and they are determined by the sizes of the template and category matrix. Their sizes are determined by the compression unit size, which is calculated based on the distance according to Equation (5). To accelerate the proposed algorithm, the matching process should be optimized. The time consumption results indicate that the algorithm has the ability to detect pallets in real time when the detection distance is less than 4 m.
However, the developed algorithm still has some shortcomings. First, illumination changes throughout the entire day, and the training samples hardly cover all illumination conditions. Second, when the pallet color is similar to that of the ground or goods, the color space of each category overlaps and leads to incorrect classification. Third, the compression unit size has a significant impact on pallet recognition performance, and it should be carefully selected by considering the size of the pallet, the camera parameters, and the pallet distance.