Which Framework Is Suitable For Online 3D Multi-Object Tracking For Autonomous Driving With Automotive 4D Imaging Radar?
Which Framework Is Suitable For Online 3D Multi-Object Tracking For Autonomous Driving With Automotive 4D Imaging Radar?
Which Framework Is Suitable For Online 3D Multi-Object Tracking For Autonomous Driving With Automotive 4D Imaging Radar?
Abstract— Online 3D multi-object tracking (MOT) has re- autonomous vehicle to achieve robust and accurate 3D per-
cently received significant research interests due to the expand- ception by eliminating uncertainties in data association and
ing demand of 3D perception in advanced driver assistance multi-object state estimation. Due to the advances in sensor
arXiv:2309.06036v4 [eess.SP] 25 May 2024
GNN-PMB Filter
PMB Predict & Update
Multi-object Density State Extraction
Obtain & Update
PPP MB
3D-Object Detector the Best Global
Single-object Density Association Hypothesis Track Maintenance
Center Positions of Bounding Gaussian
Position-Refined
Boxes with Classes Info.
Bounding Boxes with ID
Framework II: Joint Detection and Tracking with Extended Object Tracker (JDT with EOT)
GGIW-PMBM Filter
PMBM Predict & Update
Multi-object Density State Extraction
DBSCAN & K-Means Obtain & Update
Clustering with Different PPP MBM K-Best Global
Parameters Single-object Density Association Hypotheses Track Maintenance
GGIW
4D Imaging Radar Multiple Measurement Estimated Object
Point Cloud Partitions
Position, Extent & ID
Framework III: Tracking-by-Detection with Extended Object Tracker (TBD with EOT)
GGIW-PMBM Filter
3D-Object Detector PMBM Predict & Update
Multi-object Density State Extraction
Obtain & Update
PPP MBM K-Best Global
Single-object Density Association Hypotheses Track Maintenance
Select Effective Points &
Cluster by Bounding Boxes GGIW
Measurement Partition Estimated Object
with Classes Info. Position, Extent & ID
Fig. 1: The illustration of three different frameworks for online 3D MOT with 4D imaging radar point cloud.
including TBD-POT, JDT-EOT, and our proposed TBD-EOT, total cost, is obtained by solving the 2D-assignment problem
see Fig. 1 for an illustration. on the cost matrix. Different from the PMBM filter that
A. Framework I: TBD with POT calculates and propagates multiple global hypotheses, GNN-
PMB only propagates the best global hypothesis to reduce
The TBD-POT framework has been widely adopted in computational complexity without significantly deteriorating
literature for MOT with different sensor modalities, e.g. the tracking performance [18]. In summary, the first online
[3, 16, 51, 52]. In this tracking framework, the 4D imaging 3D MOT framework in this paper combines a deep learning-
radar point cloud is first processed by an object detector based 3D object detector with the GNN-PMB filter, as
to generate 3D bounding boxes that provide information illustrated in the first row of Fig. 1.
such as object position, bounding box size, orientation,
class, and detection score. The POT algorithm often takes
B. Framework II: JDT with EOT
two-dimensional object position measurements in Cartesian
coordinate and performs MOT on the bird-eye’s view (BEV) In contrast to the first framework, the JDT-EOT framework
plane to simplify calculations. Other information of the 3D operates on 4D radar point clouds by detecting and tracking
bounding boxes is then combined with estimated object posi- multiple objects simultaneously. The point clouds go through
tions and IDs to generate 3D tracking results. The TBD-POT a gating and clustering process to generate the measurement
framework has two main advantages: 1) the POT algorithm partition (a group of disjoint clusters); then, an EOT filtering
can leverage the extra information such as object classes and algorithm performs 3D MOT using these clusters. To reduce
detection scores to further improve the tracking performance; computational complexity, clusters can be matched with
2) POT is typically less compute-intensive compared to EOT. objects during the gating process only if they fall within a
The GNN-PMB filter [18], one of the state-of-the-art POT distance threshold around the predicted position of the object.
approaches for LiDAR-based online 3D MOT, is selected as Clusters that cannot match with any existing object are then
the POT algorithm. The filter estimates the multi-object state assigned to newborn objects or considered as clutter. Theo-
by propagating a PMB density over time, which combines a retically, this framework has the potential to provide more
Poisson point process (PPP) for modeling undetected objects accurate estimates of the object position and shape while
and a multi-Bernoulli (MB) process for modeling detected also reducing false negatives, since the point clouds contain
objects. The data association is achieved by managing local more information than pre-processed 3D bounding boxes.
and global hypotheses. For each time step, a measurement However, it is challenging to produce proper measurement
can be matched with a previously tracked object, a newly partitions, particularly for 4D radar point clouds with lots of
detected object, or a false alarm to generate different local ambiguities and clutters. As the distribution and density of
hypotheses with corresponding costs. Then, a group of com- point clouds can vary between objects, different clustering
patible local hypotheses are collected in a global hypothesis algorithms, such as DBSCAN [54] and k-means [55], with
[53], which defines a possible association between existing different parameter settings are usually employed to generate
objects and measurements. Finally, the optimal data associ- as many different measurement partitions as possible. This
ation result, which is the global hypothesis with the lowest further increases EOT’s computational complexity and poses
a challenge to the real-time performance of this framework. LiDAR, and camera data with high-quality annotations. Each
In order to implement the JDT-EOT framework, as illus- framework is evaluated with three object classes (car, pedes-
trated in the second row of Fig. 1, we select the PMBM filter trian, and cyclist) on the validation set of VoD (sequence
with the gamma Gaussian inverse Wishart model (GGIW- numbers 0, 8, 12, and 18) and part of the test set of
PMBM), which is recognized as one of the state-of-the- TJ4DRadSet (sequence numbers 0, 10, 23, 31, and 41).
art EOT algorithms due to its high estimation accuracy and These selected sequences cover various driving conditions
manageable computational complexity [22, 23]. The PMBM and contain different classes of objects in balanced quanti-
filter [56] models object-originated measurements with the ties. SMURF [43], a state-of-the-art object detector for 4D
multi-Bernoulli mixture (MBM) density and propagating imaging radar point clouds, is selected to extract bounding
multiple global hypotheses to contend with the uncertainty box detections for implementing TBD-POT and TBD-EOT.
of data association. The GGIW model assumes that the Since object class information is inaccessible for JDT-EOT, a
number of object-generated measurements is Poisson dis- heuristic classification step is employed in the state extraction
tributed, and the single measurement likelihood is Gaussian. procedure of this framework. In this step, unclassified track-
Under this assumption, each object has an elliptical shape ing results are separated into cars, pedestrians, cyclists, and
represented by the inverse Wishart (IW) density, while the other objects based on the width and length of the estimated
major and minor axes of this ellipse can be used to form bounding boxes.
a rectangular bounding box. This simple but flexible extent In the following evaluations, a set of commonly accepted
modeling is sufficient to model different classes of objects MOT metrics for ADAS and AD are evaluated on the
[21–24]. More importantly, the GGIW implementation has BEV plane, including multiple object tracking accuracy
the lowest computational complexity among all existing EOT (MOTA), multiple object tracking precision (MOTP), true
implementations [20], which is desirable for real-time 3D positive (TP), false negative (FN), false positive (FP), and
MOT. Without detection bounding boxes and class informa- ID switch (IDS). In addition, we employ a recently proposed
tion, the extent of a newborn object is initialized based on MOT metric, higher order tracking accuracy (HOTA) [58].
the spatial size of the associating cluster. Additionally, the HOTA decomposes into a family of sub-metrics, including
extent estimates are processed by non-maximum suppression detection accuracy (DetA), association accuracy (AssA), and
(NMS) [57] to reduce the physically impossible overlapping localization accuracy (LocA), thus enabling a clear analysis
tracking results. of the MOT performance.
Most notably, the MOTA, MOTP and HOTA metrics are
C. The Proposed Framework III: TBD with EOT calculated based on TP, FN, and FP, which are determined
To leverage the strengths of deep learning-based object by the similarity score S defined as:
detector and EOT, we present TBD-EOT as the third MOT
d(p, q)
framework. Instead of directly performing EOT on the radar S = max 0, 1 − (1)
d0
point cloud, the points within detected bounding boxes are
selected for clustering since these “effective” points are more where d(p, q) is the Euclidean distance between a object’s
likely originated from objects than clutters. Compared with estimated position p and its corresponding ground-truth
JDT-EOT, the advantage of TBD-EOT framework is twofold. position q, and d0 is the distance where S reduces to
First, the computational complexity of the data association zero. The pairs of estimation and ground-truth satisfying
in EOT can be substantially reduced by removing the clutter S ≥ α are matched and counted as TPs, where α is the
points, which leads to improved tracking performance with localization threshold. The remaining unmatched estimations
fewer false tracks. Second, the EOT algorithm can utilize the become FPs, and unmatched ground-truths become FNs. The
information from the detector to further improve the tracking zero-distance is set to d0 = 4m in following evaluations.
performance, for example, by setting optimized parameters For MOTA and MOTP, the localization threshold is set to
for different object classes and discarding bounding boxes α = 0.5. This setting indicates that an estimation can match
with low detection scores. Compared with TBD-POT, the with a ground-truth if the Euclidean distance between their
TBD-EOT framework employs a more realistic measurement center positions is no more than 2m, which is aligned with
model and has the potential to produce accurate object the nuScenes [59] tracking challenge1 . HOTA is calculated
bounding boxes from extent estimates. As shown in the third by averaging the results over different α values (0.05 to 0.95
row of Fig. 1, this MOT framework is implemented using the with an interval of 0.05, as suggested in [58]).
same 3D object detector as the TBD-POT framework along B. Comparison between Different Tracking Frameworks
with the GGIW-PMBM filter.
The evaluation results of the three online 3D MOT
IV. E XPERIMENTS AND A NALYSIS frameworks on VoD and TJ4DRadSet are provided in Table
A. Dataset and Evaluation Metrics I and Table II, respectively. The hyper-parameters of the
implemented algorithms, specifically, SMURF + GNN-PMB,
We evaluate each online 3D MOT framework on two
GGIW-PMBM, and SMURF + GGIW-PMBM, are fine-tuned
recently released 4D imaging radar-based autonomous driv-
on the training sets by optimizing the HOTA metric.
ing datasets: View-of-Delft (VoD) [44] and TJ4DRadSet
[45]. Both datasets contain synchronized 4D imaging radar, 1 https://www.nuscenes.org/tracking
TABLE I: 4D imaging radar-based 3D MOT tracking results on VoD validation set
Method Framework Class HOTA† ↑ DetA† ↑ AssA† ↑ LocA† ↑ MOTA† ↑ MOTP† ↑ TP↑ FN↓ FP↓ IDS↓
car 54.36 42.64 69.34 93.90 36.26 93.60 2190 2101 593 41
SMURF +
TBD-POT pedestrian 53.23 45.29 62.62 94.91 40.65 94.42 1925 1824 352 49
GNN-PMB
cyclist 65.77 60.71 71.27 93.78 57.95 93.58 1045 389 201 13
car 8.35 5.60 12.62 70.28 -78.75* 66.58 567 3724 3839 107
GGIW-PMBM JDT-EOT pedestrian 16.21 7.40 36.00 89.28 -9.79* 90.92 326 3423 660 33
cyclist 21.21 10.09 44.67 90.73 -114.30* 91.34 361 1073 1990 10
car 47.15 35.70 62.45 82.70 38.22 79.68 2145 2146 491 12
SMURF +
TBD-EOT pedestrian 55.27 44.22 69.13 94.15 39.96 93.52 1906 1843 378 26
GGIW-PMBM
cyclist 66.47 58.48 75.64 92.68 54.32 92.25 1089 345 302 8
* The MOTA of GGIW-PMBM are negative values because there are significantly more FNs and FPs than TPs, while MOTA=1-(FN+FP+IDS)/(TP+FN).
† The metrics are multiplied by 100. The bold values indicate the best results of each object class.
TABLE II: 4D imaging radar-based 3D MOT tracking results on TJRadSet test set
Method Framework Class HOTA† ↑ DetA† ↑ AssA† ↑ LocA† ↑ MOTA† ↑ MOTP† ↑ TP↑ FN↓ FP↓ IDS↓
car 43.41 32.19 59.32 89.50 24.56 88.36 961 1331 378 20
SMURF +
TBD-POT pedestrian 31.21 27.96 34.82 96.00 20.17 95.95 294 638 99 8
GNN-PMB
cyclist 42.22 35.43 50.32 93.28 23.74 92.82 448 542 200 13
car 16.45 6.86 39.60 78.54 -89.88* 72.32 424 1868 2435 49
GGIW-PMBM JDT-EOT pedestrian 13.19 10.03 17.41 93.52 -132.30* 94.52 239 693 1462 10
cyclist 23.28 8.65 62.79 90.53 -92.32* 89.94 195 795 1105 4
car 38.16 28.35 51.88 82.19 24.39 78.55 962 1330 397 6
SMURF +
TBD-EOT pedestrian 41.60 27.10 63.87 95.36 21.89 95.22 279 653 68 7
GGIW-PMBM
cyclist 49.48 36.42 67.23 92.05 20.10 91.46 505 485 298 8
* The MOTA of GGIW-PMBM are negative values because there are significantly more FNs and FPs than TPs, while MOTA=1-(FN+FP+IDS)/(TP+FN).
† The metrics are multiplied by 100. The bold values indicate the best result of each object class under each metric.
PMBM can produce tracking results close to the ground-truth Fig. 2: Histogram of the TP bounding box size estimated by GGIW-
positions. However, as illustrated in Fig. 2, a considerable PMBM. All tracking results within 2m of the ground truth positions
portion of the TP bounding boxes estimated by GGIW- are matched as TPs. (a) and (b) illustrate the width and length
estimated on VoD validation set. (c) and (d) illustrate the width
PMBM have similar width and length. Consequently, the and length estimated on TJ4DRadSet test set.
heuristic classification step fails to classify some tracking
results based on the estimated bounding box size, resulting
in low detection accuracy in the original evaluation. pedestrian class is substantially lower on TJ4DRadSet than
on VoD, indicating that GGIW-PMBM generates more false
TABLE III: TP and FN evaluated using unclassified GGIW-PMBM tracks with small extent estimates on TJ4DRadSet. The
tracking results. disparity in performance can be attributed to the fact that
the tested sequences of TJ4DRadSet contain dense clutters
Dataset Class TP FN TP Increase (%) originating from roadside obstacles, whereas the clustering
car 1536 2755 170.90 procedure is incapable of excluding these clutters. This
VoD pedestrian 1703 2046 422.39 effect is illustrated in Fig. 3, which displays a scene from
cyclist 988 446 173.68 TJ4DRadSet where the vehicle is travelling on a four-lane
car 1157 1135 172.88 road with obstacles such as fences and street lights on both
TJ4DRadSet pedestrian 357 575 49.37
sides of the road. Since the roadside obstacles are stationary,
cyclist 430 560 120.51
this problem could be mitigated by removing radar points
with low radial velocity prior to clustering. Supplementary
We proceed to discuss how the performance of GGIW- experiments are not conducted here as TJ4DRadSet has
PMBM differs between the two datasets. The MOTA for not yet provided ego-vehicle motion data. However, such
TABLE IV: MOTA of car class evaluated under different localiza-
removal process can also influence the point clouds of
tion thresholds α.
stationary objects, thereby increasing the probability of these
objects being mistracked. SMURF + GNN-PMB SMURF + GGIW-PMBM
In general, it can be inferred that GGIW-PMBM does not α
VoD TJ4DRadSet VoD TJ4DRadSet
achieve superior performance than SMURF + GNN-PMB 0.5 36.26 24.56 38.22 24.39
when applied to real-world 4D imaging radar point clouds. 0.6 36.03 23.95 34.68 17.67
This is primarily due to the absence of object detector’s 0.7 35.40 21.86 22.12 5.19
information under the JDT-EOT framework, which makes 0.8 34.16 12.00 -8.72 -15.27
it challenging to classify tracking results using heuristic
methods and to distinguish object-generated point clouds
Ground-Truth 3D Bounding Box
from background clutters.
Radar Point Clouds
False Positives Generated from Roadside Obstacles
Detected 3D Bounding Box