Bottom-Up Object Detection by Grouping Extreme and Center Points
Bottom-Up Object Detection by Grouping Extreme and Center Points
Bottom-Up Object Detection by Grouping Extreme and Center Points
Abstract
1
Left heatmap Top heatmap Bottom heatmap Right heatmap to DEXTR [29] leads to close to state-of-the-art instance
segmentation results.
Our proposed method achieves a bounding box AP of
43.7% on COCO test-dev, out-performing all reported one-
stage object detectors [22, 25, 40, 52] and on-par with so-
Extract the peaks phisticated two-stage detectors. A Pascal VOC [8, 14]
Enumerate pre-trained DEXTR [29] model yields a Mask AP of
the peaks 34.6%, without using any COCO mask annotations. Code
combination
is available at https://github.com/xingyizhou/
ExtremeNet.
Check the
Center heatmap scores of 2. Related Work
geometric
centers Two-stage object detectors Region-CNN family [11,
Low center High center 12, 15, 16, 41] considers object detection as two sequential
score: Reject score: Accept
problems: first propose a (large) set of category-agnostic
bounding box candidates, crop them, and use an image
Figure 2: Illustration of our object detection method. Our classification module to classify the cropped region or re-
network predicts four extreme point heatmaps (Top. We gion feature. R-CNN [12] uses selective search [47] to
shown the heatmap overlaid on the input image) and one generate region proposals and feeds them to an ImageNet
center heatmap (Bottom row left) for each category. We classification network. SPP [16] and Fast RCNN [11] first
enumerate the combinations of the peaks (Middle left) of feed an image through a convolutional network and crop
four extreme point heatmaps and compute the geometric an intermediate feature map to reduce computation. Faster
center of the composed bounding box (Middle right). A RCNN [41] further replaces region proposals [47] with a
bounding box is produced if and only if its geometric center Region Proposal Network. The detection-by-classification
has a high response in the center heatmap (Bottom right). idea is intuitive and keeps the best performance so far [6, 7,
19, 20, 24, 27, 37, 45, 46, 54].
Our method does not require region proposal or region
point definition and grouping. A corner is another form of classification. We argue that a region is not a necessary
bounding box, and suffers many of the issues top-down de- component in object detection. Representing an object by
tection suffers from. A corner often lies outside an object, four extreme points is also effective and provides as much
without strong appearance features. Extreme points, on the information as bounding boxes.
other hand, lie on objects, are visually distinguishable, and One-stage object detector One-stage object detec-
have consistent local appearance features. For example, the tors [22, 25, 28, 38, 39, 42, 48] do not have a region crop-
top-most point of human is often the head, and the bottom- ping module. They can be considered as category-specific
most point of a car or airplane will be a wheel. This makes region or anchor proposal networks and directly assign a
the extreme point detection easier. The second difference to class label to each positive anchor. SSD [10,28] uses differ-
CornerNet is the geometric grouping. Our detection frame- ent scale anchors in different network layers. YOLOv2 [39]
work is fully appearance-based, without any implicit feature learns anchor shape priors. RetinaNet [25] proposes a focal
learning. In our experiments, the appearance-based group- loss to balance the training contribution between positive
ing works significantly better. and negative anchors. RefineDet [52] learns to early reject
Our idea is motivated by Papadopoulos et al. [33], who negative anchors. Well-designed single-stage object detec-
proposed to annotate bounding boxes by clicking the four tors achieve very close performance with two-stage ones at
extreme points. This annotation is roughly four times faster higher efficiency.
to collect and provides richer information than bounding Our method falls in the one-stage detector category.
boxes. Extreme points also have a close connection to ob- However, instead of setting anchors in an O(h2 w2 ) space,
ject masks. Directly connecting the inflated extreme points we detects five individual parts (four extreme points and one
offers a more fine-grained object mask than the bounding center) of a bounding box in O(hw) space. Instead of set-
box. In our experiment, we show that fitting a simple oc- ting default scales or aspect-ratios as anchors at each pixel
tagon to the extreme points yields a good object mask es- location, we only predict the probability for that location be-
timation. Our method can be further combined with Deep ing a keypoint. Our center map can also be seen as a scale
Extreme Cut (DEXTR) [29], which turns extreme point an- and aspect ratio agnostic region proposal network without
notations into a segmentation mask for the indicated object. bounding box regression.
Directly feeding our extreme point predictions as guidance Deformable Part Model As a bottom-up object detec-
4 x C x H x W extreme point heatmaps C x H x W center heatmap
top heatmap left heatmap bottom heatmap right heatmap center heatmap
Hourglass Network
4 x 2 x H x W offset map
Figure 3: Illustration of our framework. Our network takes an image as input and produces four C-channel heatmaps, one C-
channel heatmap, and four 2-channel category-agnostic offset map. The heatmaps are trained by weighted pixel-wise logistic
regression, where the weight is used to reduce false-positive penalty near the ground truth location. And the offset map is
trained with Smooth L1 loss applied at ground truth peak locations.
tion method, our idea of grouping center and extreme points clicks are often inaccuracy and need to be adjusted a few
is related to Deformable Part Model [9]. Our center point times. The whole process takes 34.5 seconds on av-
detector functions similarly with the root filter in DPM [9], erage [44]. Papadopoulos et al. [33] propose to anno-
and our four extreme points can be considered as a univer- tate the bounding box by clicking the four extreme points
sal part decomposition for all categories. Instead of learning (x(t) , y (t) ), (x(l) , y (l) ), (x(b) , y (b) ), (x(r) , y (r) ), where the
the part configuration, our predicted center and four extreme box is (x(l) , y (t) , x(r) , y (b) ). An extreme point is a point
points have a fixed geometry structure. And we use a state- (x(a) , y (a) ) such that no other point (x, y) on the object lies
of-the-art keypoint detection network instead of low-level further along one of the four cardinal directions a: top, bot-
image filters for part detection. tom, left, right. Extreme click annotation time is 7.2 sec-
Grouping in bottom-up human pose estimation De- onds on average [33]. The resulting annotation is on-par
termining which keypoints are from the same person is an with the more time-consuming box annotation. Here, we
important component in bottom-up multi-person pose esti- use the extreme click annotations directly and bypass the
mation. There are multiple solutions: Newell et al. [30] bounding box. We additionally use the center point of each
(l) (r) y (t) +y (b)
proposes to learn an associative feature for each keypoint, object as ( x +x 2 , 2 ).
which is trained using an embedding loss. Cao et al. [3]
learns an affinity field which resembles the edge between
connected keypoints. Papandreous et al. [34] learns the dis-
placement to the parent joint on the human skeleton tree, as
a 2-d feature for each keypoint. Nie et al. [32] also learn a Keypoint detection Keypoint estimation, e.g., human
feature as the offset with respect to the object center. joint estimation [3,5,15,30,49] or chair corner point estima-
In contrast to all the above methods, our center grouping tion [36, 53], commonly uses a fully convolutional encoder-
is pure appearance-based and is easy to learn, by exploiting decoder network to predict a multi-channel heatmap for
the geometric structure of extreme points and their center. each type of keypoint (e.g., one heatmap for human head,
Implicit keypoint detection Prevalent keypoint detec- another heatmap for human wrist). The network is trained
tion methods work on well-defined semantic keypoints, e.g., in a fully supervised way, with either an L2 loss to a ren-
human joints. StarMap [53] mixes all types of keypoints dered Gaussian map [3,5,30,49] or with a per-pixel logistic
using a single heatmap for general keypoint detection. Our regression loss [22, 34, 35]. State-of-the-art keypoint esti-
extreme and center points are a kind of such general implicit mation networks, e.g., 104-layer HourglassNet [22, 31], are
keypoints, but with more explicit geometry property. trained in a fully convolutional manner. They regress to a
heatmap Ŷ ∈ (0, 1)H×W of width W and height H for
3. Preliminaries each output channel. The training is guided by a multi-peak
Gaussian heatmap Y ∈ (0, 1)H×W , where each keypoint
Extreme and center points Let (x(tl) , y (tl) , x(br) , y (br) ) defines the mean of a Gaussian Kernel. The standard devia-
denote the four sides of a bounding box. To anno- tion is either fixed, or set proportional to the object size [22].
tate a bounding box, a user commonly clicks on the The Gaussian heatmap serves as the regression target in the
top-left (x(tl) , y (tl) ) and bottom-right (x(br) , y (br) ) cor- L2 loss case or as the weight map to reduce penalty near a
ners. As both points regularly lie outside an object, these positive location in the logistic regression case [22].
CornerNet CornerNet [22] uses keypoint estimation with Algorithm 1: Center Grouping
an HourglassNet [31] as an object detector. They predict Input : Center and Extremepoint heatmaps of an image for one
two sets of heatmaps for the opposing corners of the box. category: Ŷ (c) , Ŷ (t) , Ŷ (l) , Ŷ (b) , Ŷ (r) ∈ (0, 1)H×W
In order to balance the positive and negative locations they Center and peak selection thresholds: τc and τp
Output: Bounding box with score
use a modified focal loss [25] for training: // Convert heatmaps into coordinates of keypoints.
// T , L, B, R are sets of points.
H W T ← ExtractPeak(Ŷ (t) , τp )
1 XX (1 − Ŷij )α log(Ŷij ) if Yij = 1
Ldet = − , (1) L ← ExtractPeak(Ŷ (l) , τp )
N i=1 j=1(1−Yij)β (Ŷij)α log(1− Ŷij) o.w. B ← ExtractPeak(Ŷ (b) , τp )
R ← ExtractPeak(Ŷ (r) , τp )
where α and β are hyper-parameters and fixed to α = 2 and for t ∈ T , l ∈ L, b ∈ B, r ∈ R do
// If the bounding box is valid
β = 4 during training. N is the number of objects in the if ty ≤ ly , ry ≤ by and lx ≤ tx , bx ≤ rx then
image. // compute geometry center
cx ← (lx + rx )/2
For sub-pixel accuracy of extreme points, CornerNet cy ← (ty + by )/2
additionally regresses to category-agnostic keypoint offset // If the center is detected
(c)
∆(a) for each corner. This regression recovers part of the if Ŷcx ,cy ≥ τc then
information lost in the down-sampling of the hourglass net- Add Bounding box (lx , ty , rx , by ) with score
(t) (l) (b) (r) (c)
work. The offset map is trained with smooth L1 Loss [11] (Ŷtx ,ty + Ŷlx ,ly + Ŷbx ,by + Ŷrx ,ry + Ŷcx ,cy )/5.
SL1 on ground truth extreme point locations: end
end
end
N
1 X
Lof f = SL1 (∆(a) , ~x/s − b~x/sc), (2)
N
k=1
Table 2: State-of-the-art comparison on COCO test-dev. SS/ MS are short for single-scale/ multi-scale tesing, respectively.
It shows that our ExtremeNet in on-par with state-of-the-art region-based object detectors.
trained, we provide error analysis by replacing each out- 86.0%. The rest error is from the ghost box (Section 4.2).
put component with its ground truth. Table 1 shows the
result. A ground truth center heatmap alone does not in- 5.5. State-of-the-art comparisons
crease AP much. This indicates that our center heatmap is
Table 2 compares ExtremeNet to other state-of-the-art
trained quite well, and shows that the implicit object cen-
methods on COCO test-dev. Our model with multi-scale
ter is learnable. Replacing the extreme point heatmap with
testing achieves an AP of 43.7, outperforming all reported
ground truth gives 16.3% AP improvement. When replac-
one-stage object detectors and on-par with popular two-
ing both extreme point heatmap and center heatmap, the re-
stage detectors. Notable, it performs 1.6% higher than Cor-
sult comes to 79.8%, much higher than replacing one of
nerNet, which shows the advantage of detecting extreme
them. This is due to that our center grouping is very strict
and center points over detecting corners with associative
in the keypoint location and a high performance requires to
features. In single scale setting, our performance is 0.3%
improve both extreme point heatmap and center heatmap.
AP below CornerNet [22]. However, our method has higher
Adding the ground truth offsets further increases the AP to
AP for small and median objects than CornerNet, which is
known to be more challenging. For larger objects our cen-
AP AP50 AP75 APS APM APL ter response map might not be accurate enough to perform
BBox 12.1 34.9 6.2 8.2 12.7 16.9 well, as a few pixel shift might make the difference between
Ours octagon 18.9 44.5 13.7 10.4 20.4 28.3 a detection and a false-negative. Further, note that we used
Ours+DEXTR [29] 34.6 54.9 36.6 16.6 36.5 52.0 the half number of GPUs to train our model.
Mask RCNN-50 [15] 34.0 55.5 36.1 14.4 36.7 51.9 5.6. Instance Segmentation
Mask RCNN-101 [15] 37.5 60.6 39.9 17.7 41.0 55.4
Finally, we compare our instance segmentation results
Table 3: Instance segmentation evaluation on COCO with/ without DEXTR [29] to other baselines in Table 3.
val2017. The results are shown in Mask AP. As a dummy baseline, we directly assign all pixels inside
the rectangular bounding box as the segmentation mask.
Extreme point heatmap Center heatmap Octagon mask Extreme points+DEXTR [29]
Table 4: Qualitative results on COCO val2017. First and second column: our predicted (combined four) extreme point
heatmap and center heatmap, respectively. We show them overlaid on the input image. We show heatmaps of different
categories in different colors. Third column: our predicted bounding box and the octagon mask formed by extreme points.
Fourth column: resulting masks of feeding our extreme point predictions to DEXTR [29].
The result on our best-model (with 43.3% bounding box our result which is on-par with Res50 [17] and 2.9% AP
AP) is 12.1% Mask AP. The simple octagon mask (Section. below ResNeXt-101 is very competitive.
4.4) based on our predicted extreme points gets a mask AP
of 18.9%, much better than the bounding box baseline. This 6. Conclusion
shows that this simple octagon mask can give a relatively
reasonable object mask without additional cost. Note that In conclusion, we present a novel object detection frame-
directly using the quadrangle of the four extreme points work based on bottom-up extreme points estimation. Our
yields a too-small mask, with a lower IoU. framework extracts four extreme points and groups them
When combined with DEXTR [29], our method achieves in a purely geometric manner. The presented framework
a mask AP of 34.6% on COCO val2017. To put this result yields state-of-the-art detection results and produces com-
in a context, the state-of-the-art Mask RCNN [15] gets a petitive instance segmentation results on MSCOCO, with-
mask AP of 37.5% with ResNeXt-101-FPN [24, 50] back- out seeing any COCO training instance segmentations.
bone and 34.0% AP with Res50-FPN. Considering the fact
that our model has not been trained on the COCO segmen- Acknowledgement We thank Chao-Yuan Wu, Dian
tation annotation, or any class specific segmentations at all, Chen, and Chia-Wen Cheng for helpful feedback.
References [21] D. P. Kingma and J. Ba. Adam: A method for stochastic
optimization. ICLR, 2014. 6
[1] N. Bodla, B. Singh, R. Chellappa, and L. S. Davis. Soft- [22] H. Law and J. Deng. Cornernet: Detecting objects as paired
nmsimproving object detection with one line of code. In keypoints. In ECCV, 2018. 1, 2, 3, 4, 6, 7
ICCV, 2017. 5, 6
[23] Z. Li, C. Peng, G. Yu, X. Zhang, Y. Deng, and J. Sun. Light-
[2] Z. Cai and N. Vasconcelos. Cascade r-cnn: Delving into high head r-cnn: In defense of two-stage object detector. arXiv
quality object detection. CVPR, 2018. 7 preprint, 2017. 7
[3] Z. Cao, T. Simon, S.-E. Wei, and Y. Sheikh. Realtime multi- [24] T.-Y. Lin, P. Dollár, R. B. Girshick, K. He, B. Hariharan, and
person 2d pose estimation using part affinity fields. In CVPR, S. J. Belongie. Feature pyramid networks for object detec-
2017. 1, 3 tion. In CVPR, 2017. 2, 7, 8
[4] L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and [25] T.-Y. Lin, P. Goyal, R. Girshick, K. He, and P. Dollár. Focal
A. L. Yuille. Deeplab: Semantic image segmentation with loss for dense object detection. ICCV, 2017. 1, 2, 4, 7
deep convolutional nets, atrous convolution, and fully con- [26] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ra-
nected crfs. PAMI, 2018. 4 manan, P. Dollár, and C. L. Zitnick. Microsoft coco: Com-
[5] Y. Chen, Z. Wang, Y. Peng, Z. Zhang, G. Yu, and J. Sun. mon objects in context. In ECCV, 2014. 1, 4, 5, 6
Cascaded pyramid network for multi-person pose estimation. [27] S. Liu, L. Qi, H. Qin, J. Shi, and J. Jia. Path aggregation
In CVPR, 2018. 1, 3 network for instance segmentation. In CVPR, 2018. 2, 7
[6] J. Dai, Y. Li, K. He, and J. Sun. R-fcn: Object detection via [28] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y.
region-based fully convolutional networks. In NIPS, 2016. 2 Fu, and A. C. Berg. Ssd: Single shot multibox detector. In
[7] J. Dai, H. Qi, Y. Xiong, Y. Li, G. Zhang, H. Hu, and Y. Wei. ECCV, 2016. 1, 2, 7
Deformable convolutional networks. In ICCV, 2017. 2, 7 [29] K. Maninis, S. Caelles, J. Pont-Tuset, and L. Van Gool. Deep
[8] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, extreme cut: From extreme points to object segmentation. In
and A. Zisserman. The PASCAL Visual Object Classes CVPR, 2018. 2, 4, 5, 7, 8
Challenge 2012 (VOC2012) Results. http://www.pascal- [30] A. Newell, Z. Huang, and J. Deng. Associative embedding:
network.org/challenges/VOC/voc2012/workshop/index.html. End-to-end learning for joint detection and grouping. In
2 NIPS, 2017. 1, 3, 4, 6
[9] P. F. Felzenszwalb, R. B. Girshick, D. McAllester, and D. Ra- [31] A. Newell, K. Yang, and J. Deng. Stacked hourglass net-
manan. Object detection with discriminatively trained part- works for human pose estimation. In ECCV, 2016. 1, 3,
based models. PAMI, 2010. 1, 3 4
[10] C.-Y. Fu, W. Liu, A. Ranga, A. Tyagi, and A. C. Berg. Dssd: [32] X. Nie, J. Feng, J. Xing, and S. Yan. Pose partition networks
Deconvolutional single shot detector. arXiv preprint, 2017. for multi-person pose estimation. In ECCV, 2018. 3
2, 7 [33] D. P. Papadopoulos, J. R. Uijlings, F. Keller, and V. Ferrari.
[11] R. Girshick. Fast r-cnn. In ICCV, 2015. 1, 2, 4 Extreme clicking for efficient object annotation. In ICCV,
2017. 2, 3, 6
[12] R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich fea-
ture hierarchies for accurate object detection and semantic [34] G. Papandreou, T. Zhu, L.-C. Chen, S. Gidaris, J. Tompson,
segmentation. In CVPR, 2014. 1, 2 and K. Murphy. Personlab: Person pose estimation and in-
stance segmentation with a bottom-up, part-based, geometric
[13] R. B. Girshick, P. F. Felzenszwalb, and D. A. Mcallester. Ob-
embedding model. ECCV, 2018. 3
ject detection with grammar models. In NIPS, 2011. 1
[35] G. Papandreou, T. Zhu, N. Kanazawa, A. Toshev, J. Tomp-
[14] B. Hariharan, P. Arbelaez, L. Bourdev, S. Maji, and J. Malik.
son, C. Bregler, and K. Murphy. Towards accurate multi-
Semantic contours from inverse detectors. In ICCV, 2011. 2
person pose estimation in the wild. In CVPR, 2017. 3
[15] K. He, G. Gkioxari, P. Dollár, and R. Girshick. Mask r-cnn. [36] G. Pavlakos, X. Zhou, A. Chan, K. G. Derpanis, and K. Dani-
In ICCV, 2017. 2, 3, 7, 8 ilidis. 6-dof object pose from semantic keypoints. In ICRA,
[16] K. He, X. Zhang, S. Ren, and J. Sun. Spatial pyramid pooling 2017. 3
in deep convolutional networks for visual recognition. In [37] C. Peng, T. Xiao, Z. Li, Y. Jiang, X. Zhang, K. Jia, G. Yu, and
ECCV, 2014. 2 J. Sun. Megdet: A large mini-batch object detector. CVPR,
[17] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning 2018. 2
for image recognition. In CVPR, 2016. 8 [38] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi. You
[18] J. H. Hosang, R. Benenson, and B. Schiele. Learning non- only look once: Unified, real-time object detection. In
maximum suppression. In CVPR, 2017. 6 CVPR, 2016. 1, 2
[19] J. Huang, V. Rathod, C. Sun, M. Zhu, A. Korattikara, [39] J. Redmon and A. Farhadi. Yolo9000: better, faster, stronger.
A. Fathi, I. Fischer, Z. Wojna, Y. Song, S. Guadarrama, et al. CVPR, 2017. 2, 7
Speed/accuracy trade-offs for modern convolutional object [40] J. Redmon and A. Farhadi. Yolov3: An incremental improve-
detectors. In CVPR, 2017. 2 ment. arXiv preprint, 2018. 2, 7
[20] B. Jiang, R. Luo, J. Mao, T. Xiao, and Y. Jiang. Acquisition [41] S. Ren, K. He, R. Girshick, and J. Sun. Faster r-cnn: Towards
of localization confidence for accurate object detection. In real-time object detection with region proposal networks. In
ECCV, September 2018. 2 NIPS, 2015. 1, 2
[42] Z. Shen, Z. Liu, J. Li, Y.-G. Jiang, Y. Chen, and X. Xue.
Dsod: Learning deeply supervised object detectors from
scratch. In ICCV, 2017. 2
[43] B. Singh and L. S. Davis. An analysis of scale invariance in
object detection–snip. In CVPR, 2018. 7
[44] H. Su, J. Deng, and L. Fei-Fei. Crowdsourcing annotations
for visual object detection. In AAAIW, 2012. 3
[45] L. Tychsen-Smith and L. Petersson. Denet: Scalable real-
time object detection with directed sparse sampling. arXiv
preprint arXiv:1703.10295, 2017. 2
[46] L. Tychsen-Smith and L. Petersson. Improving object local-
ization with fitness nms and bounded iou loss. CVPR, 2017.
2
[47] J. R. Uijlings, K. E. Van De Sande, T. Gevers, and A. W.
Smeulders. Selective search for object recognition. IJCV,
2013. 2
[48] X. Wang, K. Chen, Z. Huang, C. Yao, and W. Liu.
Point linking network for object detection. arXiv preprint
arXiv:1706.03646, 2017. 2
[49] B. Xiao, H. Wu, and Y. Wei. Simple baselines for human
pose estimation and tracking. In ECCV, 2018. 1, 3
[50] S. Xie, R. Girshick, P. Dollár, Z. Tu, and K. He. Aggregated
residual transformations for deep neural networks. In CVPR,
2017. 8
[51] H. Xu, X. Lv, X. Wang, Z. Ren, N. Bodla, and R. Chellappa.
Deep regionlets for object detection. In ECCV, 2018. 7
[52] S. Zhang, L. Wen, X. Bian, Z. Lei, and S. Z. Li. Single-shot
refinement neural network for object detection. In CVPR,
2018. 2, 7
[53] X. Zhou, A. Karpur, L. Luo, and Q. Huang. Starmap for
category-agnostic keypoint and viewpoint estimation. In
ECCV, 2018. 3
[54] Y. Zhu, C. Zhao, J. Wang, X. Zhao, Y. Wu, H. Lu, et al. Cou-
plenet: Coupling global structure with local parts for object
detection. In ICCV, 2017. 2