Keywords

1 Introduction

Object detection is a fundamental and challenging problem in computer vision  [25]. In the past years, with the emergence of deep learning  [11, 18] and the availability of large-scale annotated datasets  [6, 24], the state of the art in 2D object detection has improved significantly  [4, 10, 23, 27, 34, 40]. Object detection in the 2D image plane, however, is not sufficient for autonomous driving, which often requires accurate 3D localization of targets in the scene. Currently, the foremost methods  [17, 36, 45, 49] on 3D object detection heavily rely on expensive LiDAR sensors to provide accurate depth information as input. Monocular 3D object detection  [2, 3, 16, 19, 26, 30, 31, 44] is a promising low-cost solution, but it is much harder due to the ill-posed nature, i.e., lack of depth cues. The performance gap between LiDAR-based approaches and monocular methods is still substantial.

One key challenge for monocular 3D object detection is in handling large distance variations so that the detector can estimate 3D locations accurately. Learning the distance-specific feature requires specific sophisticated designs  [1, 29, 33, 42], while simply learning the feature covering all possible locations is difficult and costs much capacity of the model, resulting in a heavy and slow model for good accuracy. In this work, we solve the learning efficiency problem by introducing a single-stage and multi-scale framework that learns a unified representation of objects in different scales and distance ranges, termed as UR3D. The deep model is relieved from learning different representations for objects within different scale and distance ranges, which significantly reduces the cost of network capacity. Besides, the unified object representation reduces the number of learnable parameters and thus prevents overfitting. Consequently, we achieve accurate monocular 3D object detection with a lightweight network.

An important step for monocular 3D object detection is Non-Maximum Suppression (NMS), which is usually based on the confidence from the classification branch  [1, 33]. This may cause omissions of candidate boxes with high-quality 3D information prediction, because the higher classification confidence doesn’t always interpret as the better 3D information prediction. To solve the mismatch, we propose a distance-guided NMS, which automatically selects candidate boxes with better distance estimations. With the distance-guided NMS, UR3D achieves better distance estimation and 3D detection accuracy.

Another challenge for monocular 3D object detection is recovering object physical sizes. Such physical parameters are abstract 3D quantities not directly linked to how objects appear in images  [14]. It is thus hard to directly predict the physical sizes of 3D bounding boxes by CNNs. Besides, estimating orientations of 3D boxes is shown imprecise by direct regression  [1, 14, 31]. To tackle this problem, we propose a fully convolutional cascaded point regression to estimate the projected 2D center points and corner points of 3D boxes accurately and efficiently. Then the predicted keypoints are used to post-optimize the physical sizes and orientations by minimizing a projection-consistency loss  [14], which improves the estimates. The contributions of the proposed UR3D are summarised below:

  1. 1.

    UR3D is a single stage and multi-scale framework that can learn a unified representation of objects within different distance ranges for monocular 3D object detection, which leads to a compact and robust network.

  2. 2.

    A distance-guided NMS is proposed, which selects the candidate boxes with better distance estimations.

  3. 3.

    A fully convolutional cascaded point regression is proposed to estimate the projected 2D center points and corner points precisely and efficiently. The predicted keypoints are used to post-optimize the estimated physical sizes and orientations by minimizing a projection-consistency loss.

  4. 4.

    Experimental results on the KITTI  [9] autonomous driving dataset show that our method achieves accurate monocular 3D object detection with a compact architecture.

2 Related Work

2.1 2D Object Detection

Scale-Aware Designs. Large scale variation is one of the key challenges for 2D object detection. Image pyramid  [20, 37, 39, 41, 48] is a classical solution, but not efficient enough. Faster RCNN  [34] utilizes multi-scale anchor boxes to achieve multi-scale object detection. SSD  [27] further uses multi-scale features to approximate the image pyramid. Recent works  [21,22,23, 40] not only adopt multi-scale features, but also share the convolutional weights of detection heads on different layers to get better object representation. However, learning the unified object representation across different scales and distance ranges for monocular 3D object detection is not a trivial problem. The reason is that the quantities for 3D boxes are much more complicated, especially the distance is highly nonlinear. Our UR3D learns robust and compact distance-normalized unified object representation via proposed designs.

Score Mismatch in NMS. [12, 13] find that probabilities for class labels naturally reflect classification confidence instead of localization confidence, thus they predict the score or uncertainty of bounding box regression, which can be used to guide the NMS procedure to preserve accurately localized bounding boxes. We reveal the severe score mismatch problem in the NMS of monocular 3D object detection and propose distance-guided NMS to tackle it.

2.2 Monocular 3D Object Detection

Distance-Aware Designs. Handling large distance variations in monocular 3D object detection is challenging, which requires distance-specific representation. MonoDIS  [38] uses a two-stage architecture for monocular 3D object detection, in which the 2D module first detects objects then all the detected objects are fed into a 3D detection head to predict 3D parameters. MonoDIS further disentangles dependencies of different parameters by introducing a loss enabling to handle groups of parameters separately. MonoGRNet  [33] is a multi-stage method consisting of four specialized modules for different tasks: 2D detection, instance depth estimation, 3D location estimation and local corner regression. MonoGRNet first predicts objects’ 3D locations progressively and then estimates the corner coordinates locally.

MonoPSR  [16] uses a network to jointly compute 3D bounding boxes from 2D ones and estimate instance point clouds to help recover shape and scale information. Pseudo-Lidar  [42] and AM3D  [29] convert the estimated depth image into 3D point clouds to utilize the geometry information, then LiDAR-based 3D object detection methods are employed.

To help the spatial feature learning, OFTNet  [35] proposes an orthographic feature transform to map image-level feature into a 3D voxel map, which is then reduced to 2D bird’s eye view representation. M3D-RPN  [1] is a single-stage framework that exploits 3D anchor boxes to utilize 3D location priors and proposes depth-aware convolution to generate distance-specific feature, which eases the difficulty of learning the distance-information in the full possible range.

To learn the spatial location information, previous works utilize careful multi-stage designs  [33, 38], point cloud feature  [16, 29, 42], or feature transformation  [1, 35]. Prior methods directly learn object representation covering all possible distance locations, without considering the feature reuse between different distance ranges. UR3D solves the learning efficiency problem by learning a unified representation for objects within different distance ranges.

3D Box Fitting via Projection-Consistency. Deep3DBox  [31] and M3D-RPN  [1] fit better 3D boxes by constraining the consistency between the projected 2D boxes from camera coordinate to image coordinate and the network-predicted 2D boxes. SS3D  [14] improves the accuracy of 3D box estimation in the similar way. SS3D further optimizes the 3D location, physical size and orientation together. As a comparison, our UR3D solves the projection-consistency loss of corner points and center points as a post-optimization, but only optimizes physical size and orientation prediction.

2.3 Cascaded Point Regression

Cascaded point regression is a classical mechanism for keypoint regression  [5, 28, 47]. [28, 47] predict facial keypoints by a multi-stage cascaded structure, i.e., a global stage to predict coarse shapes and local stages using shape-indexed feature as input to predict fine shapes. Previous works mainly focus on cascaded point regression with a single object input, which are inefficient when predicting keypoints for thousands of candidates simultaneously. In contrast, our proposed fully convolutional cascaded point regression makes dense prediction efficient.

3 Proposed UR3D

We first detail the overall framework, then present the three key components, i.e., distance-normalized unified representation, followed by the distance-guided NMS, and finally the fully convolutional cascaded point regression and projection-consistency based post-optimization. We term our method as UR3D and the main architecture is illustrated in Fig. 1.

Fig. 1.
figure 1

Framework of our UR3D. UR3D learns a compact and robust unified representation for objects within different distance ranges, which relieves the model from learning the complicated distance-specific representation covering all possible locations.

3.1 Basic Framework

We address the problem of monocular 3D object detection, which predicts the 3D bounding boxes of targets in camera coordinate from a RGB image. As commonly assumed  [9], we only consider yaw angles, and set roll and pitch angles as zero. We also assume that per-image calibration parameters are available both at training and testing phase  [9]. For a given RGB image \(\mathbf{x} \in \mathbb {R}^{H\times W \times 3}\), UR3D reports all objects of concerned categories, and the output for each object is the

  1. 1.

    class label \( cls \) and confidence \( score \),

  2. 2.

    2D bounding box represented by its top-left and bottom-right corners \(\mathbf{b} = (a_1, b_1, a_2, b_2)\),

  3. 3.

    2D projected center point and eight corner points in image coordinate of 3D box in camera coordinate, encoded as \(\mathbf{p} = (x_0, x_1, .., x_8, y_0, y_1, .., y_8)\),

  4. 4.

    distance of center point of the 3D bounding box, in image coordinate, encoded as \(z_0\),

  5. 5.

    3D bounding box parameters encoded as \(\mathbf{m} = (w, h, l, \sin (\theta ), \cos (\theta ))\), where whl are the physical dimensions, and \(\theta \) is the allocentric pose of the 3D box. UR3D predicts \(\sin (\theta )\) and \(\cos (\theta )\), then converts them to \(\theta \).

UR3D predicts the center point \((x_0, y_0, z_0)\) in image coordinate and converts it to camera coordinate using the calibration parameters during the testing phase.

UR3D is a single-stage and multi-scale architecture (Fig. 1). During the training stage, we assign targets onto five different layers based on their scales. With the rules, we make the scale range of objects assigned on a layer is larger than that of objects assigned on the previous layer. Since the distance is related to scale, objects within different distance ranges are also assigned to different layers. Detailed assignment rules can be found in Sect. 3.5.

3.2 Distance-Normalized Unified Representation

At this part we detail the distance-normalized unified representation. As shown in Fig. 1, there are five different detection heads on each detection layer, corresponding to five tasks, i.e., classification, bounding box regression, distance estimation, keypoint regression and physical size and yaw angle prediction. To learn a unified representation for objects assigned on different detection layers, we first share the learnable weights of the detection heads on different layers, then we normalize each task’s training targets on different layers to a same range according to their relationships with scale, details as follows:

Scale-Invariant Task. Object category, physical size and orientation are attributes not related to the apparent scale, so the classification and physical size and yaw angle prediction are scale-invariant tasks. Thus the learnable weights of the classification head and size and yaw head on different layers can naturally be shared to form a unified representation between different layers.

Scale-Linear Task. The numerical ranges of 2D bounding box and keypoint are linearly dependent on the apparent scale, so the bounding box regression and keypoint regression are two tasks linear to scale. We normalize the targets of these two tasks by introducing learnable parameters \(\alpha _i\) and \(\beta _i\), and the loss functions of an object are defined as:

$$\begin{aligned} \begin{aligned} L_{bbox} = loss(\hat{\mathbf {b}}_i, \mathbf{b}_i) = loss(\hat{\mathbf {b}}_i, \alpha _i {{\mathbf{b}}_i}'), \end{aligned} \end{aligned}$$
(1)
$$\begin{aligned} \begin{aligned} L_{point} = loss(\hat{\mathbf {p}}_i, \mathbf{p}_i) = loss(\hat{\mathbf {p}}_i, \beta _i {{\mathbf{p}}_i}'), \end{aligned} \end{aligned}$$
(2)

where \(i = 0, 1, 2, 3, 4\) denotes the index of the object-assigned detection layer, \(\hat{\mathbf {b}}_i\) and \(\hat{\mathbf {p}}_i\) are groundtruths of bounding box regression and keypoint regression respectively, \({{\mathbf{b}}_i}'\) and \({{\mathbf{p}}_i}'\) are network-predicted bounding box regression result and keypoint regression result respectively, \(0< \alpha _0< \alpha _1< \alpha _2< \alpha _3 < \alpha _4\) and \(0< \beta _0< \beta _1< \beta _2< \beta _3 < \beta _4\). During the training phase, the network learns the best normalization parameters \(\alpha _i\) and \(\beta _i\) automatically. During the testing phase, we use \(\mathbf{b}_i = \alpha _i {{\mathbf{b}}_i}'\) and \(\mathbf{p}_i = \beta _i {{\mathbf{p}}_i}'\) as outputs for the bounding box regression and keypoint regression respectively.

Scale-Nonlinear Task. To investigate the relationship between distance values and apparent scales, we show some statistics of the car category in KITTI training set  [9] in Fig. 2(a). The left figure shows the relationship of distance vs. height, the middle figure shows the curve of depth value of the center point vs. height, and the right figure shows their difference vs. height. The depth images are generated by a monocular depth estimation model  [8] as in  [29, 42]. Apparently the relationships of distance vs. height and depth vs. height are highly nonlinear but in the similar trends (left figure and middle figure), i.e., subtracting the depth can reduce the degree of nonlinearity of distance (right figure).

To get accurate distance estimation, we first introduce learnable parameters \(\gamma _i\) multiplied with the output of \(i_{th}\) distance head to use a piece-wise linear curve to fit the nonlinear distance curve. However, the capacity of our piece-wise linear distance estimation model consisting of only five parts is limited, and we still cannot fit the highly nonlinear distance precisely. We further subtract the depth value of a low resolution depth image with the same size of the distance head (Fig. 2(b)), to reduce the degree of nonlinearity of distance, which significantly eases the distance learning. The distance loss of an object is defined as:

Fig. 2.
figure 2

Illustration of distance estimation method.

$$\begin{aligned} \begin{aligned} L_{dist} = loss(\hat{{z_0}}_i, {z_0}_i) = loss(\hat{{z_0}}_i, \gamma _i {{z_0}_i}' + depth), \end{aligned} \end{aligned}$$
(3)

where \(i = 0, 1, 2, 3, 4\) denotes the index of the object-assigned detection layer, \(\hat{{z_0}}_i\) is the groundtruth distance, \({{z_0}_i}'\) is the network-predicted distance result, \(\gamma _0> \gamma _1> \gamma _2> \gamma _3> \gamma _4 > 0\), and depth is the depth value from the corresponding position of the low resolution depth image. During the training phase, the network can learn the best slope parameters \(\gamma _i\) automatically. During the testing phase, we use \({z_0}_i = \gamma _i {{z_0}_i}' + depth\) as output for distance estimation. For both train and test, we run the depth estimation model  [8] once and downsample the depth map five times to feed into each distance head, and the maximum size of depth maps we need is only one eighth of the size of depth maps required by  [29, 42].

Fig. 3.
figure 3

Illustration for distance estimation error vs. different scores of candidate boxes. Classification score \(\times \) distance score can best push boxes with inaccurate estimates to the left side. Statistics are based on a UR3D model trained with KITTI [9] car class.

3.3 Distance-Guided NMS

In this part, we detail the distance-guided NMS. Firstly, to get the score of distance estimation, we extend an uncertainty-aware regression loss  [15] for distance estimation, as follows:

$$\begin{aligned} \begin{aligned} L_{dist}(\hat{z}_0, z_0) = \lambda _{dist} \frac{loss(\hat{z}_0, z_0)}{\sigma ^2} + \lambda _{uncertain} log(\sigma ^2), \end{aligned} \end{aligned}$$
(4)

where \(\hat{z}_0\) and \(z_0\) are the groundtruth and estimated distance respectively, \(loss(\hat{z}_0, z_0)\) is a normal regression loss, \(\lambda _{dist}\) and \(\lambda _{uncertain}\) are positive parameters to balance the two parts. \(\sigma ^2\) is a positive learnable parameter and \(\frac{1}{\sigma ^2}\) can be regarded as the score of distance estimation.

figure a

In Fig. 3, we show the correlations between the distance estimation error of predicted 3D bounding boxes and corresponding score, \(\frac{1}{\sigma ^2}\), \(\frac{score}{\sigma ^2}\). As can be seen, \(\frac{score}{\sigma ^2}\) best pushes candidates with inaccurate distance estimates to the left side. Traditional NMS does not select the candidate boxes with better distance estimates, we propose Distance-Guided NMS (Algorithm 1) to solve the problem.

3.4 Fully Convolutional Cascaded Point Regression

The proposed efficient fully convolutional cascaded point regression (Fig. 4) is adapted from  [4] and consists of two stages. In the first stage, we directly regress the positions of center point and eight corner points, and the results of position q are encoded as:

$$ \mathbf {p}_0=\{p_0, p_1, \ldots , p_8\} =\{(x_0, y_0), (x_1, y_1), \ldots , (x_8, y_8)\}, $$

In the second stage, we extract the shape-indexed feature guided by \(\mathbf {p}_0\), and predict the residual values of keypoints. The extraction of shape-indexed feature can be formulated as an efficient convolutional layer as in  [4], instead of traditional time-consuming multi-patch extraction  [28, 47]. Let the nine positions of a \(3\times 3\) convolutional kernels correspond to the nine keypoints. The convolutional layer for the extraction consists of two steps: 1) sampling using \(\mathbf {p}_0\) as the kernel point positions over the input feature map \(\mathbf {f}_{in}\); 2) summation of sampled values weighted by kernel weights \(\mathbf {w}\) to get the output feature map \(\mathbf {f}_{out}\), i.e.,

$$\begin{aligned} \mathbf {f}_{out}(q)=\sum _{i = 0}^{8}\mathbf {w}(i)\cdot \mathbf {f}_{in}(p_i). \end{aligned}$$
(5)

The sampling is on the irregular locations. As the location \( p_i\) is typically fractional, Eq. (5) \(\mathbf {f}_{in}(p_i)\) is obtained by bilinear interpolation. The detailed implementation is similar to  [4]. Note during the training, the gradients will not be backpropagated to \(p_i\) through Eq. (5), because \(p_i\) has its own supervised loss. The keypoint losses for two stages are:

$$\begin{aligned} \begin{aligned} L_{point_0} = loss(\hat{\mathbf {p}}, {\mathbf{p}}_0), \end{aligned} \end{aligned}$$
(6)
$$\begin{aligned} \begin{aligned} L_{point_1} = loss(\hat{\mathbf {p}}, {\mathbf{p}}) = loss(\hat{\mathbf {p}}, {\mathbf{p}}_0 + {\mathbf{p}}_1), \end{aligned} \end{aligned}$$
(7)

where \(\hat{\mathbf {p}}\) is the groundtruth of keypoint regression, \({\mathbf{p}}_0\) and \({\mathbf{p}}_1\) are the outputs of the first and second stage respectively, and \({\mathbf{p}} = {\mathbf{p}}_0 + {\mathbf{p}}_1\) is the final output of keypoint regression.

Fig. 4.
figure 4

Illustration of fully convolutional cascaded point regression, which formulates dense cascaded point regression as an efficient convolutional layer.

Fully convolutional cascaded point regression achieves accurate prediction of thousands of candidates simultaneously. Then we use the estimated keypoints to post-optimize the physical size and yaw angle prediction. Given a set of center point \((x_0, y_0, z_0)\), physical size whl, and yaw angle \(\theta \), we calculate the center and corner points of corresponding 3D bounding box in camera coordinate with calibration parameters. Denote the calculation function as \(\mathbf {F}(x_0, y_0, z_0, w, h, l, \theta )\). We try to find a set of \(w', h', l', \theta '\) to minimize the objective function:

$$\begin{aligned} \begin{aligned} \arg \min _{w', h', l', \theta '}&\lambda _{post}\left\| \mathbf {F}(x_0, y_0, z_0, w', h', l', \theta ') - \mathbf {p}\right\| _2^2 \\&+ \left[ (w' - w)^2 + (h' - h)^2 + (l' - l)\right] ^2, \end{aligned} \end{aligned}$$
(8)

where \(x_0, y_0, z_0, w, h, l, \theta \) are the network-predicted results, \(w', h', l', \theta '\) are the post-optimized results. This is a standard nonlinear optimization problem, which can be solved by an optimization toolbox.

3.5 Implementation Details

Object Assignment Rule. During the training stage, we assign a position q on a detection layer \(\mathbf {f}_i\) (\(i = 0, 1, 2, 3, 4\)) to an object, if 1) q falls in the object, 2) the maximum distance from q to the boundaries of the object is within a given range \(\mathbf {r}_i\), and 3) the distance from q to the center of the object is less than a given value \(\mathbf {d}_i\). \(\mathbf {r}_i\) denotes the scale range of objects assigned on each detection layer  [40], and \(\mathbf {d}_i\) defines the radius of positive samples on each detection layer. \(\mathbf {r}_i\) is [0, 64], [64, 128], [128, 256], [256, 512], [512, 1024] for the five layers, and \(\mathbf {d}_i\) is 12, 24, 48, 96, 192 respectively, all in pixels. Positions without assigning to any object will be regarded as negative samples, except that the positions adjacent with the positive samples are treated as ignored samples.

Network Architecture. The backbone of UR3D is ResNet-34  [11]. All the head depth of the detection heads is two. Images are scaled to a fixed height of 384 pixels for both training and testing.

Loss. We use the focal loss  [23] for classification task, IoU loss  [46] for bounding box regression, smooth \(L_1\) loss  [10] for keypoint regression, and Wing loss  [7] for distance, size and orientation estimation. The loss weights are 1, 1, 0.003, 0.1, 0.05,  0.1, 0.001 for the classification, bounding box regression, keypoint regression, distance estimation, distance variance estimation, size and orientation estimation, and post-optimization, respectively.

Optimization. We adopt the step strategy to adjust a learning rate. At first the learning rate is fixed to 0.01 and reduced by 50 times every \(3\times 10^4\) iterations. The total iteration number is \(9\times 10^4\) with batch size 5. The only augmentation we perform is random mirroring. We implement our framework using Python and PyTorch  [32]. All the experiments run on a server with 2.6 GHz CPU and GTX Titan X.

4 Experiments

We evaluate our method on KITTI  [9] dataset with the car class under the two 3D localization tasks: Bird’s Eye View (BEV) and 3D Object Detection. The method is comprehensively tested on two validation splits  [3, 43] and the official test dataset. We further present analyses on the impacts of individual components of the proposed UR3D. Finally we visualize qualitative examples of UR3D on KITTI (Fig. 5).

4.1 KITTI

The KITTI  [9] dataset provides multiple widely used benchmarks for computer vision problems in autonomous driving. The BirdEye View (BEV) and 3D Object Detection tasks are used to evaluate 3D localization performance. These two tasks are characterized by 7481 training and 7518 test images with 2D and 3D annotations for cars, pedestrians, cyclists, etc. Each object is assigned with a difficulty level, i.e., easy, moderate or hard, based on its occlusion level and truncation degree.

We conduct experiments on three common data splits including val1  [3], val2  [43], and the official test split [9]. Each split contains images from non-overlapping sequences such that no data from an evaluated frame, or its neighbors, are used for training. We report the \(\text {AP}|_{R_{11}}\) and \(\text {AP}|_{R_{40}}\) on val1 and val2, and \(\text {AP}|_{R_{40}}\) on test subset. We use the car class, the most representative, and the official IoU criteria for cars, i.e., 0.7.

Val Set Results. We evaluate UR3D on val1 and val2 as detailed in Table 1 and Table 2. Using the same monocular depth estimator  [8] as in AM3D  [29] and Pseudo-LiDAR  [42], UR3D can compete with them on the two splits. The time cost of depth map generation of our UR3D can be much smaller than that of  [29, 42], since the size of depth maps we need is only one eighth of the size of depth maps required by them. We use depth priors to normalize the learning targets of distance instead of converting to point clouds as in  [29, 42], leading to a more compact and efficient architecture.

Table 1. Bird’s Eye View. Comparisons on the Bird’s Eye View task (AP\(_\text {BEV}\)) on val1  [3] and val2  [43] of KITTI  [9].
Table 2. 3D Detection. Comparisons on the 3D Detection task (AP\(_\text {3D}\)) on val1  [3] and val2  [43] of KITTI  [9].
Table 3. Test Set Results. Comparisons of our UR3D to SOTA methods of monocular 3D object detection on the test set of KITTI  [9].

Test Set Results. We evaluate the results on test set in Table 3. Compared with FQNet  [26], ROI-10D  [30], GS3D  [19], and MonoGRNet  [33], UR3D outperforms them significantly in all indicators. Compared with MonoDIS  [38], UR3D outperforms it by a large margin in three indicators, i.e., AP\(_\text {3D}\) of easy subset, AP\(_\text {3D}\) of moderate subset and AP\(_\text {BEV}\) of easy subset. Note MonoDIS  [38] is a two-stage method while ours is a more compact single-stage method. Compared with another single-stage method, M3D-RPN  [1], UR3D outperforms it on two indicators, i.e., AP\(_\text {3D}\) and AP\(_\text {BEV}\) of easy subset, with a more lightweight backbone. Compared with AM3D  [29], UR3D runs with a much faster speed.

Learned Parameters. We initialize \(\alpha _i\) and \(\beta _i\) with 32, 64, 128, 256, 512, and 16, 8, 4, 2, 1 for \(\gamma _i\). The learned results on val1 split are 5.7, 10.6, 20.7, 41.0, 82.3 for \(\alpha _i\), 5.3, 10.4, 20.6, 41.4, 82.2 for \(\beta _i\), and 2.3, 1.4, 0.8, 0.3, 0.2 for \(\gamma _i\).

Table 4. Ablations. We ablate the effects of key components of UR3D with respect to accuracy and inference time.

4.2 Ablation Study

We conduct ablation experiments to examine how each proposed component affects the final performance of UR3D. We evaluate the performance by first setting a simple baseline which doesn’t adopt proposed components, then adding the proposed designs one-by-one, as shown in Table 4. For all ablations we use the KITTI val1 dataset split and evaluate based on the car class. From the results listed in Table 4, some promising conclusions can be summed up as follows:

Distance-Normalized Unified Representation Is Crucial. The results of “\(+\) LR Depth Image” show that adding the low resolution depth image to help normalize the distance improves the AP\(_\text {3D}\) and AP\(_\text {BEV}\) of baseline a lot, which indicates that reducing the nonlinear degree of distance estimation eases the unified object representation learning dramatically.

Distance-Guided NMS Is Promising. The AP\(_\text {3D}\) and AP\(_\text {BEV}\) of “\(+\) Distance-Guided NMS (\(K=1\))” are much better than the results of “\(+\) LR Depth Image”. It supports that our distance-guided NMS can select the candidate boxes with better distance estimates automatically and effectively. Increasing the number of candidates participating the average (from \(K=1\) to \(K=2\)) also helps, suggesting that the candidate with the best distance estimate may not be the top one but among the top K due to the noise of distance score.

Fully Convolutional Cascaded Point Regression Is Effective. The results of “\(+\) Post-Optimization” illustrate that introducing the projection-consistency based post-optimization improves AP\(_\text {3D}\) and AP\(_\text {BEV}\). The results of “\(+\) Cascaded Regression” show that adding the fully convolutional cascaded point regression further improves AP\(_\text {3D}\) and AP\(_\text {BEV}\). The fully convolutional cascaded point regression only costs \(10\,ms\) with a non-optimized Python implementation.

Fig. 5.
figure 5

Qualitative Examples. We visualize qualitative examples of UR3D. All illustrated images are from the val1  [3] split and not used for training. Bird’s eye view results (right) are also provided and the red lines indicate the yaw angles of cars. (Color figure online)

5 Conclusions

In this work, we present a monocular 3D object detector, i.e., UR3D, which learns a distance-normalized unified object representation, in contrast to prior works which learns to represent objects in full possible range. UR3D is uniquely designed to learn the shared representation across different distance ranges, which is robust and compact. We further propose a distance-guided NMS to select candidate boxes with better distance estimates and a fully convolutional cascaded point regression predicting accurate keypoints to post-optimize the 3D boxes parameters, both of which improve the accuracy. Collectively, our method achieves accurate monocular 3D object detection with a compact architecture.