Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Which Framework Is Suitable For Online 3D Multi-Object Tracking For Autonomous Driving With Automotive 4D Imaging Radar?

Download as pdf or txt
Download as pdf or txt
You are on page 1of 8

Which Framework is Suitable for Online 3D Multi-Object Tracking for

Autonomous Driving with Automotive 4D Imaging Radar?


Jianan Liu1∗ , Guanhua Ding2∗ , Yuxuan Xia3 , Jinping Sun2 , Tao Huang4 , Lihua Xie5 , and Bing Zhu6†

Abstract— Online 3D multi-object tracking (MOT) has re- autonomous vehicle to achieve robust and accurate 3D per-
cently received significant research interests due to the expand- ception by eliminating uncertainties in data association and
ing demand of 3D perception in advanced driver assistance multi-object state estimation. Due to the advances in sensor
arXiv:2309.06036v4 [eess.SP] 25 May 2024

systems (ADAS) and autonomous driving (AD). Among the


existing 3D MOT frameworks for ADAS and AD, conventional and signal processing technology, online 3D MOT using
point object tracking (POT) framework using the tracking-by- various types of sensors, e.g., camera, LiDAR, and radar,
detection (TBD) strategy has been well studied and accepted has received substantial interests in recent years [1–3].
for LiDAR and 4D imaging radar point clouds. In contrast, Among all commonly available sensor modalities, auto-
extended object tracking (EOT), another important framework motive radars, the only cost-effective sensor that can operate
which accepts the joint-detection-and-tracking (JDT) strategy,
has rarely been explored for online 3D MOT applications. in both extreme lighting conditions and adverse weather [4],
This paper provides the first systematic investigation of the have been widely adopted for perception tasks including in-
EOT framework for online 3D MOT in real-world ADAS stance segmentation [5, 6], object detection [7, 8], and MOT
and AD scenarios. Specifically, the widely accepted TBD-POT [9]. Although conventional automotive radars can effectively
framework, the recently investigated JDT-EOT framework, and separate objects in range and Doppler velocity dimensions,
our proposed TBD-EOT framework are compared via extensive
evaluations on two open source 4D imaging radar datasets: the low angular resolution of radar measurements limits
View-of-Delft and TJ4DRadSet. Experiment results demon- the performance of radar-based object detection and MOT.
strate that the conventional TBD-POT framework remains Recently, the 4D imaging radar based on the multiple-
preferable for online 3D MOT with high tracking performance input multiple-output (MIMO) technology attracts increasing
and low computational complexity, while the proposed TBD- attention [10, 11]. Unlike conventional automotive radars, 4D
EOT framework has the potential to outperform it in certain
situations. However, the results also show that the JDT-EOT imaging radars are capable of measuring the range, velocity,
framework encounters multiple problems and performs inad- azimuth, and elevation of an object, thereby providing new
equately in evaluation scenarios. After analyzing the causes possibilities to develop novel radar-based 3D MOT methods.
of these phenomena based on various evaluation metrics and The design paradigms of 3D MOT methods can be divided
visualizations, we provide possible guidelines to improve the into two categories: model-based and deep learning-based
performance of these MOT frameworks on real-world data.
These provide the first benchmark and important insights for [12, 13]. The model-based paradigm employs meticulously
the future development of 4D imaging radar-based online 3D designed multi-object dynamic and measurement models,
MOT algorithms. making it suitable for the development of efficient and
robust 3D MOT methods. As a typical framework of the
I. I NTRODUCTION model-based MOT paradigm, the point object tracking (POT)
framework using tracking-by-detection (TBD) strategy has
Online 3D multi-object tracking (MOT) is a critical
been widely adopted in academia and industry[14–19]. POT
component in advanced driver assistance systems (ADAS)
assumes that each object generates at most one measurement
and autonomous driving (AD) applications. It helps the
per sensor scan; however, a 3D object often generates multi-
This work has been submitted to the IEEE for possible publication. ple measurement points in LiDAR and 4D imaging radar
Copyright may be transferred without notice, after which this version may point clouds. Consequently, object detection is performed
no longer be accessible. before tracking to combine the measurements generated by
1 Vitalent Consulting, Gothenburg, Sweden. Email: jianan.liu@vitalent.se.
2 The School of Electronics and Information Engineering, Beihang Uni- the same object into a single detection. The effectiveness
versity, Beijing, P.R. China. Email: {buaadgh, sunjinping}@buaa.edu.cn of the TBD-POT framework has been validated in several
3 The Department of Electrical Engineering, Linköping University,
real-world LiDAR-based online 3D MOT tasks [3, 14–19].
Linköping, Sweden. Email: yuxuan.xia@liu.se.
4 The College of Science and Engineering, James Cook University, Another model-based MOT framework receiving increas-
Smithfield QLD 4878, Australia. Email: tao.huang1@jcu.edu.au. ing attention in the tracking literature is extended object
5 The School of Electrical and Electronic Engineering, Nanyang Techno-
tracking (EOT) [20–26]. In contrast to POT, EOT assumes
logical University, Singapore 639798. Email: elhxie@ntu.edu.sg. that an object can generate multiple measurements per sen-
6 The School of Automation Science and Electrical Engineering, Beihang
University, Beijing, P.R. China. Email: zhubing@buaa.edu.cn. sor scan. Therefore, EOT can achieve joint-detection-and-
∗ Both authors contribute equally to the work and are co-first authors. tracking (JDT) without an additional object detection module
† Corresponding author. and is claimed to achieve promising results for single object
This paper has been accepted by IEEE 35th Intelligent Vehicles Sympo-
sium (IV). Code is available at https://github.com/dinggh0817/ tracking using real-world LiDAR point clouds [27–30] and
4D_Radar_MOT. automotive radar detection points [31, 32]. However, EOT
has rarely been conducted for online 3D MOT in complex more noise and ambiguities compared to LiDAR. To ad-
ADAS and AD scenarios with real-world data. Currently, dress these issues, several neural network-based 3D object
only two available works attempt to evaluate the EOT detection methods for 4D imaging radar have recently been
framework for real-world LiDAR-based MOT [34, 35]. None proposed. For example, a self-attention mechanism in RPFA-
of the aforementioned works provided detailed performance Net [41] is employed to extract global features from 4D radar
of tracking multiple objects with different classes in an point clouds, achieving improved performance for estimating
ADAS/AD dataset, nor did they perform a systematic anal- object heading angles. A 3D object detection framework is
ysis using widely accepted metrics. Thus, the applicability proposed in [42] to accumulate temporal and spatial features
of EOT in complex ADAS and AD scenarios has not really in multiple 4D radar frames through velocity compensation
been demonstrated. Moreover, with the rapid development and inter-frame matching. Multiple representations for 4D
of deep learning, almost all the state-of-the-art approaches radar points are introduced in SMURF [43] by utilizing pil-
for 3D MOT with point clouds in ADAS and AD scenarios larization and kernel density estimation techniques, achieving
follow either the TBD-POT framework using deep learning- state-of-the-art performance on two latest 4D imaging radar
based object detector [3, 14–19], or the deep learning-based datasets, VoD [44] and TJ4DRadSet [45]. Moreover, 4D
tracking paradigm [36–40], which seems to imply that EOT imaging radar are also fused with camera [46, 47] and
is no longer necessary. Specifically, it remains an open LiDAR [48, 49] for performance improvement.
question that whether EOT can outperform the traditional
TBD-POT framework for 3D MOT with point clouds in B. 3D Multi-Object Tracking with LiDAR
terms of performance and complexity. In this study, this open The majority of LiDAR-based 3D MOT methods employ
question is answered for the first time with comprehensive a traditional TBD strategy, in which an object detector
evaluations and analyses. processes the point cloud to produce detection results in
Specifically, the contributions of this paper are: the form of bounding boxes, and then a point object tracker
• This paper provides the first benchmark for subsequent performs MOT with the detection. As many 3D object detec-
studies on 4D imaging radar-based online 3D MOT in tors for LiDAR are sufficiently accurate, adequate tracking
ADAS and AD by comparing POT and EOT frame- performance can be achieved by using a simple Bayesian
works. The evaluations reveal the pros and cons of MOT algorithm such as the global nearest neighbour tracker
POT and EOT frameworks, while our analyses provide with heuristic track management [3, 14–16]. However, due to
guidelines for designing online 3D MOT algorithms. the fact that the detectors still produce false detections, these
• To fill the gap between theory and practice for EOT- MOT methods could suffer from track fragmentation and
based online 3D MOT, for the first time the EOT object ID switches. Several random finite set (RFS)-based
framework is systematically investigated in real-world methods have been proposed in an effort to further enhance
ADAS and AD scenarios. While the extensively inves- the tracking performance. RFS-M 3 [17] employs a Poisson
tigated JDT-EOT framework performs inadequately, our multi-Bernoulli mixture (PMBM) filter in conjunction with
proposed TBD-EOT framework, which leverages the a neural network-based 3D object detector. Liu et al. further
strength of deep learning-based object detector, achieves modify PMBM and propose Poisson multi-Bernoulli (PMB)
superior tracking performance and computational effi- filter with global nearest neighbour (GNN-PMB) [18] as a
ciency compared with the JDT-EOT framework. simple and effective online MOT algorithm for LiDAR. A
• Experiment results indicate that the conventional TBD- novel MOT framework based on the sum-product algorithm
POT framework remains preferable for online 3D MOT [50] is proposed to achieve efficient probabilistic data asso-
with 4D imaging radar due to its high tracking per- ciation and substantially reduces ID switch errors.
formance and computational efficiency. However, the On the other hand, the LiDAR-based 3D MOT with JDT
TBD-EOT framework can outperform TBD-POT in cer- strategy has been implemented using the deep learning-based
tain situations, demonstrating the potentials of applying paradigm as well. For instance, SimTrack [36] integrates data
EOT for online 3D MOT in real-world ADAS and AD association and track management in an end-to-end trainable
applications. model, CenterTube [37] achieves JDT by detecting 4D
The rest of the paper is organized as follows. Related spatio-temporal tubelets in point cloud sequences, 3DMODT
works are reviewed in Section II. Three different online 3D [38] can directly operate on raw LiDAR point clouds and
MOT frameworks based on POT and EOT are explained in employs an attention-based refinement module for affinity
Section III. Evaluation results of the investigated 3D MOT matrices. However, the EOT framework under the model-
frameworks on View-of-Delft and TJ4DRadSet datasets are based paradigm, which can also accept the JDT strategy, has
provided and compared systematically in Section IV. Finally, rarely been investigated for LiDAR-based 3D MOT in real-
the conclusions are drawn in Section V. world ADAS and AD scenarios. Thus, it remains an area
requiring further research.
II. R ELATED W ORKS
A. 3D Object Detection with 4D Imaging Radar III. M ETHODOLOGIES
Due to the limited angular resolution and multi-path effect, In this section, we introduce three different frameworks
the 4D imaging radar point cloud is sparser and contains for online 3D MOT with 4D imaging radar point clouds,
Framework I: Tracking-by-Detection with Point Object Tracker (TBD with POT)

GNN-PMB Filter
PMB Predict & Update
Multi-object Density State Extraction
Obtain & Update
PPP MB
3D-Object Detector the Best Global
Single-object Density Association Hypothesis Track Maintenance
Center Positions of Bounding Gaussian
Position-Refined
Boxes with Classes Info.
Bounding Boxes with ID

Framework II: Joint Detection and Tracking with Extended Object Tracker (JDT with EOT)

GGIW-PMBM Filter
PMBM Predict & Update
Multi-object Density State Extraction
DBSCAN & K-Means Obtain & Update
Clustering with Different PPP MBM K-Best Global
Parameters Single-object Density Association Hypotheses Track Maintenance
GGIW
4D Imaging Radar Multiple Measurement Estimated Object
Point Cloud Partitions
Position, Extent & ID

Framework III: Tracking-by-Detection with Extended Object Tracker (TBD with EOT)

GGIW-PMBM Filter
3D-Object Detector PMBM Predict & Update
Multi-object Density State Extraction
Obtain & Update
PPP MBM K-Best Global
Single-object Density Association Hypotheses Track Maintenance
Select Effective Points &
Cluster by Bounding Boxes GGIW
Measurement Partition Estimated Object
with Classes Info. Position, Extent & ID

Fig. 1: The illustration of three different frameworks for online 3D MOT with 4D imaging radar point cloud.

including TBD-POT, JDT-EOT, and our proposed TBD-EOT, total cost, is obtained by solving the 2D-assignment problem
see Fig. 1 for an illustration. on the cost matrix. Different from the PMBM filter that
A. Framework I: TBD with POT calculates and propagates multiple global hypotheses, GNN-
PMB only propagates the best global hypothesis to reduce
The TBD-POT framework has been widely adopted in computational complexity without significantly deteriorating
literature for MOT with different sensor modalities, e.g. the tracking performance [18]. In summary, the first online
[3, 16, 51, 52]. In this tracking framework, the 4D imaging 3D MOT framework in this paper combines a deep learning-
radar point cloud is first processed by an object detector based 3D object detector with the GNN-PMB filter, as
to generate 3D bounding boxes that provide information illustrated in the first row of Fig. 1.
such as object position, bounding box size, orientation,
class, and detection score. The POT algorithm often takes
B. Framework II: JDT with EOT
two-dimensional object position measurements in Cartesian
coordinate and performs MOT on the bird-eye’s view (BEV) In contrast to the first framework, the JDT-EOT framework
plane to simplify calculations. Other information of the 3D operates on 4D radar point clouds by detecting and tracking
bounding boxes is then combined with estimated object posi- multiple objects simultaneously. The point clouds go through
tions and IDs to generate 3D tracking results. The TBD-POT a gating and clustering process to generate the measurement
framework has two main advantages: 1) the POT algorithm partition (a group of disjoint clusters); then, an EOT filtering
can leverage the extra information such as object classes and algorithm performs 3D MOT using these clusters. To reduce
detection scores to further improve the tracking performance; computational complexity, clusters can be matched with
2) POT is typically less compute-intensive compared to EOT. objects during the gating process only if they fall within a
The GNN-PMB filter [18], one of the state-of-the-art POT distance threshold around the predicted position of the object.
approaches for LiDAR-based online 3D MOT, is selected as Clusters that cannot match with any existing object are then
the POT algorithm. The filter estimates the multi-object state assigned to newborn objects or considered as clutter. Theo-
by propagating a PMB density over time, which combines a retically, this framework has the potential to provide more
Poisson point process (PPP) for modeling undetected objects accurate estimates of the object position and shape while
and a multi-Bernoulli (MB) process for modeling detected also reducing false negatives, since the point clouds contain
objects. The data association is achieved by managing local more information than pre-processed 3D bounding boxes.
and global hypotheses. For each time step, a measurement However, it is challenging to produce proper measurement
can be matched with a previously tracked object, a newly partitions, particularly for 4D radar point clouds with lots of
detected object, or a false alarm to generate different local ambiguities and clutters. As the distribution and density of
hypotheses with corresponding costs. Then, a group of com- point clouds can vary between objects, different clustering
patible local hypotheses are collected in a global hypothesis algorithms, such as DBSCAN [54] and k-means [55], with
[53], which defines a possible association between existing different parameter settings are usually employed to generate
objects and measurements. Finally, the optimal data associ- as many different measurement partitions as possible. This
ation result, which is the global hypothesis with the lowest further increases EOT’s computational complexity and poses
a challenge to the real-time performance of this framework. LiDAR, and camera data with high-quality annotations. Each
In order to implement the JDT-EOT framework, as illus- framework is evaluated with three object classes (car, pedes-
trated in the second row of Fig. 1, we select the PMBM filter trian, and cyclist) on the validation set of VoD (sequence
with the gamma Gaussian inverse Wishart model (GGIW- numbers 0, 8, 12, and 18) and part of the test set of
PMBM), which is recognized as one of the state-of-the- TJ4DRadSet (sequence numbers 0, 10, 23, 31, and 41).
art EOT algorithms due to its high estimation accuracy and These selected sequences cover various driving conditions
manageable computational complexity [22, 23]. The PMBM and contain different classes of objects in balanced quanti-
filter [56] models object-originated measurements with the ties. SMURF [43], a state-of-the-art object detector for 4D
multi-Bernoulli mixture (MBM) density and propagating imaging radar point clouds, is selected to extract bounding
multiple global hypotheses to contend with the uncertainty box detections for implementing TBD-POT and TBD-EOT.
of data association. The GGIW model assumes that the Since object class information is inaccessible for JDT-EOT, a
number of object-generated measurements is Poisson dis- heuristic classification step is employed in the state extraction
tributed, and the single measurement likelihood is Gaussian. procedure of this framework. In this step, unclassified track-
Under this assumption, each object has an elliptical shape ing results are separated into cars, pedestrians, cyclists, and
represented by the inverse Wishart (IW) density, while the other objects based on the width and length of the estimated
major and minor axes of this ellipse can be used to form bounding boxes.
a rectangular bounding box. This simple but flexible extent In the following evaluations, a set of commonly accepted
modeling is sufficient to model different classes of objects MOT metrics for ADAS and AD are evaluated on the
[21–24]. More importantly, the GGIW implementation has BEV plane, including multiple object tracking accuracy
the lowest computational complexity among all existing EOT (MOTA), multiple object tracking precision (MOTP), true
implementations [20], which is desirable for real-time 3D positive (TP), false negative (FN), false positive (FP), and
MOT. Without detection bounding boxes and class informa- ID switch (IDS). In addition, we employ a recently proposed
tion, the extent of a newborn object is initialized based on MOT metric, higher order tracking accuracy (HOTA) [58].
the spatial size of the associating cluster. Additionally, the HOTA decomposes into a family of sub-metrics, including
extent estimates are processed by non-maximum suppression detection accuracy (DetA), association accuracy (AssA), and
(NMS) [57] to reduce the physically impossible overlapping localization accuracy (LocA), thus enabling a clear analysis
tracking results. of the MOT performance.
Most notably, the MOTA, MOTP and HOTA metrics are
C. The Proposed Framework III: TBD with EOT calculated based on TP, FN, and FP, which are determined
To leverage the strengths of deep learning-based object by the similarity score S defined as:
detector and EOT, we present TBD-EOT as the third MOT 
d(p, q)

framework. Instead of directly performing EOT on the radar S = max 0, 1 − (1)
d0
point cloud, the points within detected bounding boxes are
selected for clustering since these “effective” points are more where d(p, q) is the Euclidean distance between a object’s
likely originated from objects than clutters. Compared with estimated position p and its corresponding ground-truth
JDT-EOT, the advantage of TBD-EOT framework is twofold. position q, and d0 is the distance where S reduces to
First, the computational complexity of the data association zero. The pairs of estimation and ground-truth satisfying
in EOT can be substantially reduced by removing the clutter S ≥ α are matched and counted as TPs, where α is the
points, which leads to improved tracking performance with localization threshold. The remaining unmatched estimations
fewer false tracks. Second, the EOT algorithm can utilize the become FPs, and unmatched ground-truths become FNs. The
information from the detector to further improve the tracking zero-distance is set to d0 = 4m in following evaluations.
performance, for example, by setting optimized parameters For MOTA and MOTP, the localization threshold is set to
for different object classes and discarding bounding boxes α = 0.5. This setting indicates that an estimation can match
with low detection scores. Compared with TBD-POT, the with a ground-truth if the Euclidean distance between their
TBD-EOT framework employs a more realistic measurement center positions is no more than 2m, which is aligned with
model and has the potential to produce accurate object the nuScenes [59] tracking challenge1 . HOTA is calculated
bounding boxes from extent estimates. As shown in the third by averaging the results over different α values (0.05 to 0.95
row of Fig. 1, this MOT framework is implemented using the with an interval of 0.05, as suggested in [58]).
same 3D object detector as the TBD-POT framework along B. Comparison between Different Tracking Frameworks
with the GGIW-PMBM filter.
The evaluation results of the three online 3D MOT
IV. E XPERIMENTS AND A NALYSIS frameworks on VoD and TJ4DRadSet are provided in Table
A. Dataset and Evaluation Metrics I and Table II, respectively. The hyper-parameters of the
implemented algorithms, specifically, SMURF + GNN-PMB,
We evaluate each online 3D MOT framework on two
GGIW-PMBM, and SMURF + GGIW-PMBM, are fine-tuned
recently released 4D imaging radar-based autonomous driv-
on the training sets by optimizing the HOTA metric.
ing datasets: View-of-Delft (VoD) [44] and TJ4DRadSet
[45]. Both datasets contain synchronized 4D imaging radar, 1 https://www.nuscenes.org/tracking
TABLE I: 4D imaging radar-based 3D MOT tracking results on VoD validation set

Method Framework Class HOTA† ↑ DetA† ↑ AssA† ↑ LocA† ↑ MOTA† ↑ MOTP† ↑ TP↑ FN↓ FP↓ IDS↓
car 54.36 42.64 69.34 93.90 36.26 93.60 2190 2101 593 41
SMURF +
TBD-POT pedestrian 53.23 45.29 62.62 94.91 40.65 94.42 1925 1824 352 49
GNN-PMB
cyclist 65.77 60.71 71.27 93.78 57.95 93.58 1045 389 201 13
car 8.35 5.60 12.62 70.28 -78.75* 66.58 567 3724 3839 107
GGIW-PMBM JDT-EOT pedestrian 16.21 7.40 36.00 89.28 -9.79* 90.92 326 3423 660 33
cyclist 21.21 10.09 44.67 90.73 -114.30* 91.34 361 1073 1990 10
car 47.15 35.70 62.45 82.70 38.22 79.68 2145 2146 491 12
SMURF +
TBD-EOT pedestrian 55.27 44.22 69.13 94.15 39.96 93.52 1906 1843 378 26
GGIW-PMBM
cyclist 66.47 58.48 75.64 92.68 54.32 92.25 1089 345 302 8
* The MOTA of GGIW-PMBM are negative values because there are significantly more FNs and FPs than TPs, while MOTA=1-(FN+FP+IDS)/(TP+FN).
† The metrics are multiplied by 100. The bold values indicate the best results of each object class.

TABLE II: 4D imaging radar-based 3D MOT tracking results on TJRadSet test set

Method Framework Class HOTA† ↑ DetA† ↑ AssA† ↑ LocA† ↑ MOTA† ↑ MOTP† ↑ TP↑ FN↓ FP↓ IDS↓
car 43.41 32.19 59.32 89.50 24.56 88.36 961 1331 378 20
SMURF +
TBD-POT pedestrian 31.21 27.96 34.82 96.00 20.17 95.95 294 638 99 8
GNN-PMB
cyclist 42.22 35.43 50.32 93.28 23.74 92.82 448 542 200 13
car 16.45 6.86 39.60 78.54 -89.88* 72.32 424 1868 2435 49
GGIW-PMBM JDT-EOT pedestrian 13.19 10.03 17.41 93.52 -132.30* 94.52 239 693 1462 10
cyclist 23.28 8.65 62.79 90.53 -92.32* 89.94 195 795 1105 4
car 38.16 28.35 51.88 82.19 24.39 78.55 962 1330 397 6
SMURF +
TBD-EOT pedestrian 41.60 27.10 63.87 95.36 21.89 95.22 279 653 68 7
GGIW-PMBM
cyclist 49.48 36.42 67.23 92.05 20.10 91.46 505 485 298 8
* The MOTA of GGIW-PMBM are negative values because there are significantly more FNs and FPs than TPs, while MOTA=1-(FN+FP+IDS)/(TP+FN).
† The metrics are multiplied by 100. The bold values indicate the best result of each object class under each metric.

1) Performance of GGIW-PMBM: Table I and Table II il-


lustrate that the performance of GGIW-PMBM is undesirable
in our experiments. It is observed that GGIW-PMBM suffers
from low detection accuracy for all three classes since the
tracking results include significantly more FPs and FNs than (a) (b)
TPs. To analyze the underlying reason, we re-calculate TP
and FN using unclassified GGIW-PMBM tracking results,
where any tracking results within 2m of the ground-truth
positions are matched as TPs. As shown in Table III, the
TPs for all three classes increase by large margins compared
to the original evaluation results, demonstrating that GGIW- (c) (d)

PMBM can produce tracking results close to the ground-truth Fig. 2: Histogram of the TP bounding box size estimated by GGIW-
positions. However, as illustrated in Fig. 2, a considerable PMBM. All tracking results within 2m of the ground truth positions
portion of the TP bounding boxes estimated by GGIW- are matched as TPs. (a) and (b) illustrate the width and length
estimated on VoD validation set. (c) and (d) illustrate the width
PMBM have similar width and length. Consequently, the and length estimated on TJ4DRadSet test set.
heuristic classification step fails to classify some tracking
results based on the estimated bounding box size, resulting
in low detection accuracy in the original evaluation. pedestrian class is substantially lower on TJ4DRadSet than
on VoD, indicating that GGIW-PMBM generates more false
TABLE III: TP and FN evaluated using unclassified GGIW-PMBM tracks with small extent estimates on TJ4DRadSet. The
tracking results. disparity in performance can be attributed to the fact that
the tested sequences of TJ4DRadSet contain dense clutters
Dataset Class TP FN TP Increase (%) originating from roadside obstacles, whereas the clustering
car 1536 2755 170.90 procedure is incapable of excluding these clutters. This
VoD pedestrian 1703 2046 422.39 effect is illustrated in Fig. 3, which displays a scene from
cyclist 988 446 173.68 TJ4DRadSet where the vehicle is travelling on a four-lane
car 1157 1135 172.88 road with obstacles such as fences and street lights on both
TJ4DRadSet pedestrian 357 575 49.37
sides of the road. Since the roadside obstacles are stationary,
cyclist 430 560 120.51
this problem could be mitigated by removing radar points
with low radial velocity prior to clustering. Supplementary
We proceed to discuss how the performance of GGIW- experiments are not conducted here as TJ4DRadSet has
PMBM differs between the two datasets. The MOTA for not yet provided ego-vehicle motion data. However, such
TABLE IV: MOTA of car class evaluated under different localiza-
removal process can also influence the point clouds of
tion thresholds α.
stationary objects, thereby increasing the probability of these
objects being mistracked. SMURF + GNN-PMB SMURF + GGIW-PMBM
In general, it can be inferred that GGIW-PMBM does not α
VoD TJ4DRadSet VoD TJ4DRadSet
achieve superior performance than SMURF + GNN-PMB 0.5 36.26 24.56 38.22 24.39
when applied to real-world 4D imaging radar point clouds. 0.6 36.03 23.95 34.68 17.67
This is primarily due to the absence of object detector’s 0.7 35.40 21.86 22.12 5.19
information under the JDT-EOT framework, which makes 0.8 34.16 12.00 -8.72 -15.27
it challenging to classify tracking results using heuristic
methods and to distinguish object-generated point clouds
Ground-Truth 3D Bounding Box
from background clutters.
Radar Point Clouds
False Positives Generated from Roadside Obstacles
Detected 3D Bounding Box

3D Bounding Box Estimated by


SMURF + GNN-PMB

3D Bounding Box Estimated by


SMURF + GGIW-PMBM
Cyclist
Fig. 4: The visualization of unevenly distributed radar point clouds
for car objects taken from a scene in the VoD validation set. The
figures in the left column illustrate the ground-truth, detected, and
Radar Position estimated 3D bounding boxes of two car objects at the same time
step. The 4D radar is on the right side of the objects.
Fig. 3: The false positives generated by GGIW-PMBM from road-
side obstacles in a scene of TJ4DRadSet test set. The red dots are
radar points. The green boxes are estimated object bounding boxes. fewer IDS than SMURF + GNN-PMB for pedestrian and
cyclist classes, as illustrated in Fig. 5. These phenomena are
2) Performance of SMURF + GNN-PMB and SMRUF + analyzed as below. First, GGIW-PMBM employs an adaptive
GGIW-PMBM: Different from GGIW-PMBM, both SMURF detection model for the object [22, Eq. (35)]. The object
+ GNN-PMB and SMURF + GGIW-PMBM utilize infor- detection probability Pd can be factorized as Pd = Pdm Pm ,
mation from the object detector. As shown in Table I and where Pm denotes the probability of an existing object being
Table II, SMURF + GNN-PMB outperforms SMURF + measurable, i.e., the object generates a bounding box for
GGIW-PMBM in HOTA by a large margin with regard to GNN-PMB or at least one radar point for GGIW-PMBM;
car objects, primarily because the localization and detection Pdm represents the detection probability of a measurable
accuracy of cars are notably lower for SMURF + GGIW- object. In contrast to GNN-PMB which models Pm as a
PMBM. To better illustrate this phenomenon, we evaluate fixed hyper-parameter, GGIW-PMBM calculates Pm based
the MOTA of car class under different localization thresholds on GGIW densities and more reliably estimates the object
α, as shown in Table IV. As α increases, the MOTA of detection probability. Second, beside object position, the
SMURF + GGIW-PMBM decreases more rapidly than that of number and spatial distribution of object-originated radar
SMURF + GNN-PMB, indicating that more tracking results points are also employed in the GGIW-PMBM filter to
from SMURF + GGIW-PMBM are evaluated as FPs under calculate the likelihood of association hypothesis. Since the
the same TP matching criterion. The localization error of GGIW density can model the distribution of radar points
SMURF + GGIW-PMBM primarily stems from inaccuracies more accurately for small objects (as the points are less
in modeling the distribution of point clouds. As shown in likely to congregate on one side of these objects compared
Fig. 4, radar point clouds often congregate on the side of to cars), GGIW-PMBM can utilize more information from
the car object nearest to the radar. This contrasts with the point clouds of pedestrians and cyclists to accurately estimate
modeling assumption in the GGIW implementation, which Pd and association hypothesis likelihood. Third, GGIW-
assumes that the measurement points are distributed over the PMBM propagates multiple global hypotheses over time,
entire extent ellipse. Consequently, this discrepancy causes which potentially improves the data association accuracy
the estimated size and position of car objects to deviate with noisy radar measurements. These factors could help
from the ground-truth. Therefore, employing more accurate SMURF + GGIW-PMBM to achieve superior performance
measurement models, e.g., Gaussian process [29] and data- on AssA and IDS by reducing associations with false alarms
region association [32, 33], may improve the performance of and the false-termination of tracks.
TBD-EOT framework for large objects like cars. However, Finally, the computational complexity for the three MOT
this could also increase the computational complexity. algorithms is evaluated by the average frames processed per
In terms of pedestrians and cyclists, it is noteworthy that second (FPS). As shown in Table V, the average FPS of
SMURF + GGIW-PMBM outperforms SMURF + GNN- SMURF + GGIW-PMBM is about 20% that of SMURF
PMB on HOTA mainly due to its superior association accu- + GNN-PMB, while GGIW-PMBM is substantially slower
racy (AssA). In addition, SMURF + GGIW-PMBM produces than the other two algorithms, primarily due to the excessive
TABLE V: FPS for MOT frameworks, evaluated from Python
Transactions on Industrial Informatics, vol. 18, no. 1, pp. 97-108, Jan.
implementations with AMD 7950X CPU and 64 GB RAM. 2022.
[3] H. Wu, W. Han, C. Wen, X. Li, and C. Wang, “3D multi-object
Method Avg. FPS in VoD / TJ4DRadSet tracking in point clouds based on prediction confidence-guided data
SMURF + GNN-PMB 401.97 / 326.57 association,” IEEE Transactions on Intelligent Transportation Systems,
vol. 23, no. 6, pp. 5668-5677, Jun. 2022.
GGIW-PMBM 3.58 / 1.24
[4] A. Pandharipande et al., “Sensing and machine learning for automotive
SMURF + GGIW-PMBM 81.59 / 87.95 perception: A review,” IEEE Sensors Journal, vol. 11, no. 23, pp.
11097-11115, Jun. 2023.
[5] J. Liu et al., “Deep instance segmentation with automotive radar
number of possible measurement partitions generated from detection points,” IEEE Transactions on Intelligent Vehicles, vol. 8,
no. 1, pp. 84-94, Jan. 2023.
raw 4D imaging radar point clouds. [6] W. Xiong, J. Liu, Y. Xia, T. Huang, B. Zhu, and W. Xiang, “Contrastive
learning for automotive mmWave radar detection points based instance
Frame 36 Frame 37 Frame 38 Frame 36 Frame 37 Frame 38 segmentation,” in Proceedings of the IEEE International Conference
on Intelligent Transportation Systems (ITSC), 2022, pp. 1255-1261.
[7] P. Li, P. Wang, K. Berntorp, and H. Liu, “Exploiting temporal relations
on radar perception for autonomous driving,” in Proceedings of the
IEEE/CVF Conference on Computer Vision and Pattern Recognition
(CVPR), 2022, pp. 17071-17080.
[8] Y. Yang, J. Liu, T. Huang, Q.-L. Han, G. Ma, and B. Zhu, “RaLiBEV:
Radar and LiDAR BEV fusion learning for anchor box free object
(a) (b)
detection systems,” arXiv preprint, 2022. [Online]. Available: arxiv.
Fig. 5: Track ID maintenance for pedestrians in a scene of the VoD org/abs/2211.06108.
validation set. The tracking results of SMURF + GNN-PMB and [9] T. Zhou et al., “3D multiple object tracking with multi-modal fusion
of low-cost sensors for autonomous driving,” in Proceedings of the
SMURF + GGIW-PMBM are shown in (a) and (b), respectively.
IEEE International Conference on Intelligent Transportation Systems
The dashed lines connect bounding boxes of the same object and (ITSC), 2022, pp. 1750-1757.
the cross marks represent ID switches. [10] D. Schwarz, N. Riese, I. Dorsch, and C. Waldschmidt, “System
performance of a 79 GHz high-resolution 4D imaging MIMO radar
with 1728 virtual channels,” IEEE Journal of Microwaves, vol. 2, no.
V. C ONCLUSION AND F UTURE W ORK 4, pp. 637-647, Oct. 2022.
[11] Z. Han et al., “4D millimeter-wave radar in autonomous driving:
This paper systematically compares the POT and EOT A survey,” arXiv preprint, 2023. [Online]. Available: arxiv.org/
frameworks for online 3D MOT with 4D imaging radar point abs/2306.04242.
[12] J. Pinto, G. Hess, W. Ljungbergh, Y. Xia, H. Wymeersch and L.
clouds on the VoD and TJ4DRadSet datasets. Three MOT Svensson, “Deep learning for model-based multi-object tracking,” in
frameworks, including TBD-POT, JDT-EOT, and TBD-EOT, IEEE Transactions on Aerospace and Electronic Systems, vol. 59, no.
are implemented with state-of-the-art methods and evaluated 6, pp. 7363-7379, Dec. 2023.
[13] S. Ding, E. Rehder, L. Schneider, M. Cordts, J. Gall, “3DMOTFormer:
for car, cyclist, and pedestrian objects with commonly ac- Graph transformer for online 3D multi-object tracking,” in Proceed-
cepted 3D MOT metrics. Experiment results show that the ings of the IEEE/CVF International Conference on Computer Vision
traditional TBD-POT framework remains effective because (ICCV), 2023, pp. 9784-9794.
[14] G. Guo and S. Zhao, “3D multi-object tracking with adaptive cubature
its implementation, SMURF + GNN-PMB, achieves the best Kalman filter for autonomous driving,” IEEE Transactions on Intelli-
tracking performance for cars with the lowest computational gent Vehicles, vol. 8, no. 1, pp. 84-94, Jan. 2023.
complexity. However, the GGIW-PMBM implementation [15] Z. Pang, Z. Li, and N. Wang, “SimpleTrack: Understanding and
rethinking 3D multi-object tracking,” in Proceedings of the Euro-
of the intensively studied JDT-EOT framework does not pean Conference on Computer Vision (ECCV) Workshop, 2022, pp.
yield satisfactory tracking performance, primarily due to the 680–696.
incapability of conventional clustering methods to remove [16] T. Wen, Y. Zhang, and N. M. Freris, “PF-MOT: Probability fusion
based 3D multi-object tracking for autonomous vehicles,” in Pro-
dense clutters and the high computational complexity caused ceedings of the International Conference on Robotics and Automation
by an excessive number of measurement partition hypothe- (ICRA), 2022, pp. 700-706.
ses. Under the proposed TBD-EOT framework, SMURF [17] S. Pang, D. Morris, and H. Radha, “3D multi-object tracking using
random finite set-based multiple measurement models filtering (RFS-
+ GGIW-PMBM shows great potential to outperform the M3) for autonomous vehicles,” in Proceedings of the 2021 IEEE
implementation of TBD-POT, SMURF + GNN-PMB, by International Conference on Robotics and Automation (ICRA), 2021,
achieving superior association accuracy and more reliable pp. 13701-13707.
[18] J. Liu, L. Bai, Y. Xia, T. Huang, B. Zhu, and Q.-L. Han, “GNN-PMB:
ID estimation for both pedestrian and cyclist objects. Yet, A simple but effective online 3D multi-object tracker without bells
the performance of SMURF + GGIW-PMBM deteriorates and whistles,” IEEE Transactions on Intelligent Vehicles, vol. 8, no.
for cars because GGIW is unsuitable for modeling unevenly 2, pp. 1176-1189, Feb. 2023.
[19] Z. Zhang, J. Liu, Y. Xia, T. Huang, Q.-L. Han, and H. Liu, “LEGO:
distributed radar point clouds, indicating the necessity of Learning and graph-optimized modular tracker for online multi-object
developing more realistic, accurate, and computationally tracking with point clouds,” arXiv preprint, 2023. [Online]. Available:
efficient object models in the future. arxiv.org/abs/2308.09908.
[20] K. Granstrom, M. Baum, and S. Reuter, “Extended object tracking: In-
R EFERENCES troduction, overview and applications,” arXiv preprint, 2016. [Online].
Available: arxiv.org/abs/1604.00970.
[1] P. Li, and J. Jin, “Time3D: End-to-end joint monocular 3D object [21] C. Lundquist, K. Granström, and U. Orguner, “An extended target
detection and tracking for autonomous driving,” in Proceedings of the CPHD filter and a gamma Gaussian inverse Wishart implementation,”
IEEE/CVF Conference on Computer Vision and Pattern Recognition IEEE Journal of Selected Topics in Signal Processing, vol. 7, no. 3,
(CVPR), 2022, pp. 3885-3894. pp. 472-483, Jun. 2013.
[2] K. Shi, Z. Shi, C. Yang, S. He, J. Chen, and A. Chen, “Road-map aided [22] K. Granström, M. Fatemi, and L. Svensson, “Poisson multi-Bernoulli
GM-PHD filter for multivehicle tracking with automotive radar,” IEEE
mixture conjugate prior for multiple extended target filtering,” IEEE Conference on Intelligent Transportation Systems (ITSC), 2021, pp.
Transactions on Aerospace and Electronic Systems, vol. 56, no. 1, pp. 3061-3066.
208-225, Feb. 2020. [42] B. Tan et al., “3D object detection for multi-frame 4D automotive
[23] Á. F. Garcı́a-Fernández, J. L. Williams, L. Svensson, Y. Xia, “A Pois- millimeter-wave radar point cloud,” IEEE Sensors Journal, vol. 23,
son multi-Bernoulli mixture filter for coexisting point and extended no. 11, pp. 11125-11138, Jun. 2023.
Targets,” IEEE Transactions on Signal Processing, vol. 69, pp. 2600- [43] J. Liu, Q. Zhao, W. Xiong, T. Huang, Q.-L. Han, and B. Zhu,
2610, Apr. 2021. “SMURF: Spatial multi-representation fusion for 3D object detection
[24] Y. Xia, K. Granström, L. Svensson, M. Fatemi, Á. F. Garcı́a- with 4D imaging radar,” IEEE Transactions on Intelligent Vehicles,
Fernández, and J. L. Williams, “Poisson multi-Bernoulli approxima- vol. 9, no. 1, pp. 799-812, Jan. 2024.
tions for multiple extended object filtering” IEEE Transactions on [44] A. Palffy, E. Pool, S. Baratam, J. F. P. Kooij, and D. M. Gavrila,
Aerospace and Electronic Systems, vol. 58, no. 2, pp. 890-906, Sep. “Multi-class road user detection with 3+1D radar in the View-of-Delft
2021. dataset,” IEEE Robotics and Automation Letters, vol. 7, no. 2, pp.
[25] X. Yang, and Q. Jiao, “Variational approximation for adaptive extended 4961-4968, Apr. 2022.
target tracking in clutter with random matrix,” IEEE Transactions on [45] L. Zheng et al., “TJ4DRadSet: A 4D radar dataset for autonomous
Vehicular Technology, vol. 72, no. 10, pp. 12639-12652, Oct. 2023. driving,” in Proceedings of the IEEE International Conference on
[26] B. Liu, R. Tharmarasa, R. Jassemi, D. Brown, and T. Kirubarajan, Intelligent Transportation Systems (ITSC), 2022, pp. 493-498.
“RFS-based multiple extended target tracking with resolved multipath [46] L. Zheng et al., “RCFusion: Fusing 4D radar and camera with bird’s-
detections in clutter,” IEEE Transactions on Intelligent Transportation eye view features for 3D object detection,” IEEE Transactions on
Systems, vol. 24, no. 10, pp. 10400-10409, Oct. 2023. Instrumentation and Measurement, vol. 72, pp. 1-14, May 2023.
[27] K. Granström, S. Reuter, D. Meissner, and A. Scheel, “A multiple [47] W. Xiong, J. Liu, T. Huang, Q.-L. Han, Y. Xia, and B. Zhu, “LXL:
model PHD approach to tracking of cars under an assumed rectangular LiDAR excluded lean 3D object detection with 4D imaging radar and
shape,” in Proceedings of the IEEE 17th International Conference on camera fusion,” IEEE Transactions on Intelligent Vehicles, vol. 9, no.
Information Fusion (FUSION), 2014, pp. 1-8. 1, pp. 79-92, Jan. 2024.
[28] P. Dahal, S. Mentasti, S. Arrigoni, F. Braghin, M. Matteucci, and F. [48] L. Wang et al., “InterFusion: Interaction-based 4D radar and LiDAR
Cheli, “Extended object tracking in curvilinear road coordinates for fusion for 3D object detection,” in Proceedings of the IEEE/RSJ
autonomous driving,” IEEE Transactions on Intelligent Vehicles, vol. International Conference on Intelligent Robots and Systems (IROS),
8, no. 2, pp. 1266-1278, Feb. 2023. 2022, pp. 12247-12253.
[29] M. Kumru, and E. Özkan, “Three-dimensional extended object track- [49] L. Wang et al., “Multi-modal and multi-scale fusion 3D object
ing and shape learning using Gaussian processes,” IEEE Transactions detection of 4D radar and LiDAR for autonomous driving,” IEEE
on Aerospace and Electronic Systems, vol. 57, no. 5, pp. 2795-2814, Transactions on Vehicular Technology, vol. 72, no. 5, pp. 5628-5641,
Oct. 2021. May 2023.
[30] A. Scheel, K. Granstrom, D. Meissner, S. Reuter, and K. Dietmayer, [50] F. Meyer et al., “Message passing algorithms for scalable multitarget
“Tracking and data segmentation using a GGIW filter with mixture tracking,” Proceedings of the IEEE, vol. 106, no. 2, pp. 221–259, Feb.
clustering,” in Proceedings of the 17th IEEE International Conference 2018.
on Information Fusion (FUSION), 2014, pp. 1-8. [51] L. Wang et al., “CAMO-MOT: Combined appearance-motion opti-
[31] Y. Xia et al., “Learning-based extended object tracking using hierar- mization for 3D multi-object tracking with camera-LiDAR fusion,”
chical truncation measurement model with automotive radar,” IEEE IEEE Transactions on Intelligent Transportation Systems, vol. 24, no.
Journal of Selected Topics in Signal Processing, vol. 15, no. 4, pp. 11, pp. 11981-11996, Nov. 2023.
1013-1029, Jun. 2021. [52] X. Hao, Y. Xia, H. Yang, and Z. Zuo, “Asynchronous information
[32] X. Cao, J. Lan, X. R. Li, and Y. Liu, “Automotive radar-based fusion in intelligent driving systems for target tracking using cameras
vehicle tracking using data-region association,” IEEE Transactions on and radars,” IEEE Transactions on Industrial Electronics, vol. 70, no.
Intelligent Transportation Systems, vol. 23, no. 7, pp. 8997-9010, July. 3, pp. 2708-2717, Mar. 2023.
2022. [53] J. L. Williams, “Marginal multi-Bernoulli filters: RFS derivation of
[33] G. Ding, J. Liu, Y. Xia, T. Huang, B. Zhu, and J. Sun, “Li- MHT, JIPDA, and association-based MeMBer,” IEEE Transactions on
DAR point cloud-based multiple vehicle tracking with probabilis- Aerospace and Electronic Systems, vol. 51, no. 3, pp. 1664-1687, Jul.
tic measurement-region association,” arXiv preprint, 2024. [Online]. 2015.
Available: arxiv.org/abs/2403.06423. [54] E. Schubert, J. Sander, M. Ester, H. P. Kriegel, and X. Xu, “DBSCAN
[34] K. Granström, L. Svensson, S. Reuter, Y. Xia, and M. Fatemi, revisited, revisited: why and how you should (still) use DBSCAN,”
“Likelihood-based data association for extended object tracking using ACM Transactions on Database Systems (TODS), vol. 42, no. 3, pp.
sampling methods,” IEEE Transactions on Intelligent Vehicles, vol. 3, 1-21, Sep. 2017.
no. 1, pp. 30-45, Mar. 2018. [55] M. Ahmed, R. Seraj, and S. M. S. Islam, “The k-means algorithm: A
[35] F. Meyer, and J. L. Williams, “Scalable detection and tracking of comprehensive survey and performance evaluation,” Electronics, vol.
geometric extended objects,” IEEE Transactions on Signal Processing, 9, no. 8, pp. 1295, Aug. 2020.
vol. 69, no. 1, pp. 6283-6298, Oct. 2021. [56] Á. F. Garcı́a-Fernández, J. L. Williams, K. Granström, and L. Svens-
[36] C. Luo, X. Yang, and A. Yuille, “Exploring simple 3D multi-object son, “Poisson multi-Bernoulli mixture filter: Direct derivation and
tracking for autonomous driving,” in Proceedings of the IEEE/CVF implementation,” IEEE Transactions on Aerospace and Electronic
International Conference on Computer Vision (ICCV), 2021, pp. Systems, vol. 54, no. 4, pp. 1883-1901, Aug. 2018.
10488-10497. [57] A. Neubeck and L. Van Gool, “Efficient non-maximum suppression,”
[37] H. Liu, Y. Ma, Q. Hu, and Y. Guo, “CenterTube: Tracking multiple 3D in Proceedings of the International Conference on Pattern Recognition
objects with 4D tubelets in dynamic point clouds,” IEEE Transactions (ICPR), 2006, vol. 3, pp. 850-855.
on Multimedia, vol. 25, pp. 8793-8804, Feb. 2023. [58] J. Luiten et al., “HOTA: A higher order metric for evaluating multi-
[38] J. Kini, A. Mian, and M. Shah, “3DMODT: Attention-guided affinities object tracking,” International Journal of Computer Vision, vol. 129,
for joint detection and tracking in 3D point clouds,” in Proceedings pp. 548-578, Oct. 2020.
of the IEEE International Conference on Robotics and Automation [59] H. Caesar et al., “nuScenes: A multimodal dataset for autonomous
(ICRA), 2023, pp. 841-848. driving,” in Proceedings of the IEEE/CVF Conference on Computer
[39] T. Zhang, X. Chen, Y. Wang, Y. Wang, and H. Zhao, “MUTR3D: A Vision and Pattern Recognition (CVPR), 2020, pp. 11621-11631.
multi-camera tracking framework via 3D-to-2D queries,” in Proceed-
ings of the IEEE/CVF Conference on Computer Vision and Pattern
Recognition (CVPR) Workshop, 2022, pp. 4537-4546.
[40] T. Meinhardt, A. Kirillov, L. Leal-Taixe, and C. Feichtenhofer, “Track-
former: Multi-object tracking with transformers,” in Proceedings of the
IEEE/CVF Conference on Computer Vision and Pattern Recognition
(CVPR), 2022, pp. 8844-8854.
[41] B. Xu et al., “RPFA-Net: a 4D RaDAR pillar feature attention network
for 3D object detection,” in Proceedings of the IEEE International

You might also like