Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Bottom-Up Object Detection by Grouping Extreme and Center Points

Download as pdf or txt
Download as pdf or txt
You are on page 1of 10

Bottom-up Object Detection by Grouping Extreme and Center Points

Xingyi Zhou Jiacheng Zhuo Philipp Krähenbühl


UT Austin UT Austin UT Austin
zhouxy@cs.utexas.edu jzhuo@cs.utexas.edu philkr@cs.utexas.edu
arXiv:1901.08043v3 [cs.CV] 25 Apr 2019

Abstract

With the advent of deep learning, object detection drifted


from a bottom-up to a top-down recognition problem. State
of the art algorithms enumerate a near-exhaustive list of
object locations and classify each into: object or not. In
this paper, we show that bottom-up approaches still perform
competitively. We detect four extreme points (top-most, left-
most, bottom-most, right-most) and one center point of ob-
jects using a standard keypoint estimation network. We
group the five keypoints into a bounding box if they are
geometrically aligned. Object detection is then a purely
appearance-based keypoint estimation problem, without re-
gion classification or implicit feature learning. The pro-
posed method performs on-par with the state-of-the-art re- Figure 1: We propose to detect objects by finding their ex-
gion based detection methods, with a bounding box AP of treme points. They directly form a bounding box , but also
43.7% on COCO test-dev. In addition, our estimated ex- give a much tighter octagonal approximation of the object.
treme points directly span a coarse octagonal mask, with a
COCO Mask AP of 18.9%, much better than the Mask AP
of vanilla bounding boxes. Extreme point guided segmenta- In this paper, we propose ExtremeNet, a bottom-up ob-
tion further improves this to 34.6% Mask AP. ject detection framework that detects four extreme points
(top-most, left-most, bottom-most, right-most) of an ob-
ject. We use a state-of-the-art keypoint estimation frame-
work [3, 5, 30, 31, 49] to find extreme points, by predicting
1. Introduction
four multi-peak heatmaps for each object category. In ad-
Top-down approaches have dominated object detection dition, we use one heatmap per category predicting the ob-
for years. Prevalent detectors convert object detection into ject center, as the average of two bounding box edges in
rectangular region classification, by either explicitly crop- both the x and y dimension. We group extreme points into
ping the region [12] or region feature [11,41] (two-stage ob- objects with a purely geometry-based approach. We group
ject detection) or implicitly setting fix-sized anchors for re- four extreme points, one from each map, if and only if their
gion proxies [25, 28, 38] (one-stage object detection). How- geometric center is predicted in the center heatmap with a
ever, top-down detection is not without limits. A rectan- score higher than a pre-defined threshold. We enumerate all
gular bounding box is not a natural object representation. O(n4 ) combinations of extreme point prediction, and select
Most objects are not axis-aligned boxes, and fitting them the valid ones. The number of extreme point prediction n
inside a box includes many distracting background pixels is usually quite small, for COCO [26] n ≤ 40, and a brute
(Figure. 1). In addition, top-down object detectors enumer- force algorithm implemented on GPU is sufficient. Figure 2
ate a large number of possible box locations without truly shows an overview of the proposed method.
understanding the compositional visual grammars [9, 13] of We are not the first to use deep keypoint prediction
objects themselves. This is computationally expensive. Fi- for object detection. CornerNet [22] predicts two oppos-
nally, boxes are a bad proxy for the object themselves. They ing corners of a bounding box. They group corner points
convey little detailed object information, e.g., object shape into bounding boxes using an associative embedding fea-
and pose. ture [30]. Our approach differs in two key aspects: key-

1
Left heatmap Top heatmap Bottom heatmap Right heatmap to DEXTR [29] leads to close to state-of-the-art instance
segmentation results.
Our proposed method achieves a bounding box AP of
43.7% on COCO test-dev, out-performing all reported one-
stage object detectors [22, 25, 40, 52] and on-par with so-
Extract the peaks phisticated two-stage detectors. A Pascal VOC [8, 14]
Enumerate pre-trained DEXTR [29] model yields a Mask AP of
the peaks 34.6%, without using any COCO mask annotations. Code
combination
is available at https://github.com/xingyizhou/
ExtremeNet.
Check the
Center heatmap scores of 2. Related Work
geometric
centers Two-stage object detectors Region-CNN family [11,
Low center High center 12, 15, 16, 41] considers object detection as two sequential
score: Reject score: Accept
problems: first propose a (large) set of category-agnostic
bounding box candidates, crop them, and use an image
Figure 2: Illustration of our object detection method. Our classification module to classify the cropped region or re-
network predicts four extreme point heatmaps (Top. We gion feature. R-CNN [12] uses selective search [47] to
shown the heatmap overlaid on the input image) and one generate region proposals and feeds them to an ImageNet
center heatmap (Bottom row left) for each category. We classification network. SPP [16] and Fast RCNN [11] first
enumerate the combinations of the peaks (Middle left) of feed an image through a convolutional network and crop
four extreme point heatmaps and compute the geometric an intermediate feature map to reduce computation. Faster
center of the composed bounding box (Middle right). A RCNN [41] further replaces region proposals [47] with a
bounding box is produced if and only if its geometric center Region Proposal Network. The detection-by-classification
has a high response in the center heatmap (Bottom right). idea is intuitive and keeps the best performance so far [6, 7,
19, 20, 24, 27, 37, 45, 46, 54].
Our method does not require region proposal or region
point definition and grouping. A corner is another form of classification. We argue that a region is not a necessary
bounding box, and suffers many of the issues top-down de- component in object detection. Representing an object by
tection suffers from. A corner often lies outside an object, four extreme points is also effective and provides as much
without strong appearance features. Extreme points, on the information as bounding boxes.
other hand, lie on objects, are visually distinguishable, and One-stage object detector One-stage object detec-
have consistent local appearance features. For example, the tors [22, 25, 28, 38, 39, 42, 48] do not have a region crop-
top-most point of human is often the head, and the bottom- ping module. They can be considered as category-specific
most point of a car or airplane will be a wheel. This makes region or anchor proposal networks and directly assign a
the extreme point detection easier. The second difference to class label to each positive anchor. SSD [10,28] uses differ-
CornerNet is the geometric grouping. Our detection frame- ent scale anchors in different network layers. YOLOv2 [39]
work is fully appearance-based, without any implicit feature learns anchor shape priors. RetinaNet [25] proposes a focal
learning. In our experiments, the appearance-based group- loss to balance the training contribution between positive
ing works significantly better. and negative anchors. RefineDet [52] learns to early reject
Our idea is motivated by Papadopoulos et al. [33], who negative anchors. Well-designed single-stage object detec-
proposed to annotate bounding boxes by clicking the four tors achieve very close performance with two-stage ones at
extreme points. This annotation is roughly four times faster higher efficiency.
to collect and provides richer information than bounding Our method falls in the one-stage detector category.
boxes. Extreme points also have a close connection to ob- However, instead of setting anchors in an O(h2 w2 ) space,
ject masks. Directly connecting the inflated extreme points we detects five individual parts (four extreme points and one
offers a more fine-grained object mask than the bounding center) of a bounding box in O(hw) space. Instead of set-
box. In our experiment, we show that fitting a simple oc- ting default scales or aspect-ratios as anchors at each pixel
tagon to the extreme points yields a good object mask es- location, we only predict the probability for that location be-
timation. Our method can be further combined with Deep ing a keypoint. Our center map can also be seen as a scale
Extreme Cut (DEXTR) [29], which turns extreme point an- and aspect ratio agnostic region proposal network without
notations into a segmentation mask for the indicated object. bounding box regression.
Directly feeding our extreme point predictions as guidance Deformable Part Model As a bottom-up object detec-
4 x C x H x W extreme point heatmaps C x H x W center heatmap

top heatmap left heatmap bottom heatmap right heatmap center heatmap

Hourglass Network
4 x 2 x H x W offset map

top offset left offset bottom offset right offset

Figure 3: Illustration of our framework. Our network takes an image as input and produces four C-channel heatmaps, one C-
channel heatmap, and four 2-channel category-agnostic offset map. The heatmaps are trained by weighted pixel-wise logistic
regression, where the weight is used to reduce false-positive penalty near the ground truth location. And the offset map is
trained with Smooth L1 loss applied at ground truth peak locations.

tion method, our idea of grouping center and extreme points clicks are often inaccuracy and need to be adjusted a few
is related to Deformable Part Model [9]. Our center point times. The whole process takes 34.5 seconds on av-
detector functions similarly with the root filter in DPM [9], erage [44]. Papadopoulos et al. [33] propose to anno-
and our four extreme points can be considered as a univer- tate the bounding box by clicking the four extreme points
sal part decomposition for all categories. Instead of learning (x(t) , y (t) ), (x(l) , y (l) ), (x(b) , y (b) ), (x(r) , y (r) ), where the
the part configuration, our predicted center and four extreme box is (x(l) , y (t) , x(r) , y (b) ). An extreme point is a point
points have a fixed geometry structure. And we use a state- (x(a) , y (a) ) such that no other point (x, y) on the object lies
of-the-art keypoint detection network instead of low-level further along one of the four cardinal directions a: top, bot-
image filters for part detection. tom, left, right. Extreme click annotation time is 7.2 sec-
Grouping in bottom-up human pose estimation De- onds on average [33]. The resulting annotation is on-par
termining which keypoints are from the same person is an with the more time-consuming box annotation. Here, we
important component in bottom-up multi-person pose esti- use the extreme click annotations directly and bypass the
mation. There are multiple solutions: Newell et al. [30] bounding box. We additionally use the center point of each
(l) (r) y (t) +y (b)
proposes to learn an associative feature for each keypoint, object as ( x +x 2 , 2 ).
which is trained using an embedding loss. Cao et al. [3]
learns an affinity field which resembles the edge between
connected keypoints. Papandreous et al. [34] learns the dis-
placement to the parent joint on the human skeleton tree, as
a 2-d feature for each keypoint. Nie et al. [32] also learn a Keypoint detection Keypoint estimation, e.g., human
feature as the offset with respect to the object center. joint estimation [3,5,15,30,49] or chair corner point estima-
In contrast to all the above methods, our center grouping tion [36, 53], commonly uses a fully convolutional encoder-
is pure appearance-based and is easy to learn, by exploiting decoder network to predict a multi-channel heatmap for
the geometric structure of extreme points and their center. each type of keypoint (e.g., one heatmap for human head,
Implicit keypoint detection Prevalent keypoint detec- another heatmap for human wrist). The network is trained
tion methods work on well-defined semantic keypoints, e.g., in a fully supervised way, with either an L2 loss to a ren-
human joints. StarMap [53] mixes all types of keypoints dered Gaussian map [3,5,30,49] or with a per-pixel logistic
using a single heatmap for general keypoint detection. Our regression loss [22, 34, 35]. State-of-the-art keypoint esti-
extreme and center points are a kind of such general implicit mation networks, e.g., 104-layer HourglassNet [22, 31], are
keypoints, but with more explicit geometry property. trained in a fully convolutional manner. They regress to a
heatmap Ŷ ∈ (0, 1)H×W of width W and height H for
3. Preliminaries each output channel. The training is guided by a multi-peak
Gaussian heatmap Y ∈ (0, 1)H×W , where each keypoint
Extreme and center points Let (x(tl) , y (tl) , x(br) , y (br) ) defines the mean of a Gaussian Kernel. The standard devia-
denote the four sides of a bounding box. To anno- tion is either fixed, or set proportional to the object size [22].
tate a bounding box, a user commonly clicks on the The Gaussian heatmap serves as the regression target in the
top-left (x(tl) , y (tl) ) and bottom-right (x(br) , y (br) ) cor- L2 loss case or as the weight map to reduce penalty near a
ners. As both points regularly lie outside an object, these positive location in the logistic regression case [22].
CornerNet CornerNet [22] uses keypoint estimation with Algorithm 1: Center Grouping
an HourglassNet [31] as an object detector. They predict Input : Center and Extremepoint heatmaps of an image for one
two sets of heatmaps for the opposing corners of the box. category: Ŷ (c) , Ŷ (t) , Ŷ (l) , Ŷ (b) , Ŷ (r) ∈ (0, 1)H×W
In order to balance the positive and negative locations they Center and peak selection thresholds: τc and τp
Output: Bounding box with score
use a modified focal loss [25] for training: // Convert heatmaps into coordinates of keypoints.
// T , L, B, R are sets of points.
H W T ← ExtractPeak(Ŷ (t) , τp )
1 XX (1 − Ŷij )α log(Ŷij ) if Yij = 1
Ldet = − , (1) L ← ExtractPeak(Ŷ (l) , τp )
N i=1 j=1(1−Yij)β (Ŷij)α log(1− Ŷij) o.w. B ← ExtractPeak(Ŷ (b) , τp )
R ← ExtractPeak(Ŷ (r) , τp )
where α and β are hyper-parameters and fixed to α = 2 and for t ∈ T , l ∈ L, b ∈ B, r ∈ R do
// If the bounding box is valid
β = 4 during training. N is the number of objects in the if ty ≤ ly , ry ≤ by and lx ≤ tx , bx ≤ rx then
image. // compute geometry center
cx ← (lx + rx )/2
For sub-pixel accuracy of extreme points, CornerNet cy ← (ty + by )/2
additionally regresses to category-agnostic keypoint offset // If the center is detected
(c)
∆(a) for each corner. This regression recovers part of the if Ŷcx ,cy ≥ τc then
information lost in the down-sampling of the hourglass net- Add Bounding box (lx , ty , rx , by ) with score
(t) (l) (b) (r) (c)
work. The offset map is trained with smooth L1 Loss [11] (Ŷtx ,ty + Ŷlx ,ly + Ŷbx ,by + Ŷrx ,ry + Ŷcx ,cy )/5.
SL1 on ground truth extreme point locations: end
end
end
N
1 X
Lof f = SL1 (∆(a) , ~x/s − b~x/sc), (2)
N
k=1

4.1. Center Grouping


where s is the down-sampling factor (s = 4 for Hourglass-
Net), ~x is the coordinate of the keypoint. Extreme points lie on different sides of an object. This
CornerNet then groups opposing corners into detection complicates grouping. For example, an associative embed-
using an associative embedding [30]. Our extreme point ding [30] might not have a global enough view to group
estimation uses the CornerNet architecture and loss, but not these keypoints. Here, we take a different approach that ex-
the associative embedding. ploits the spread out nature of extreme points.
The input to our grouping algorithm is five heatmaps per
class: one center heatmap Ŷ (c) ∈ (0, 1)H×W and four ex-
Deep Extreme Cut Deep Extreme Cut (DEXTR) [29] is treme heatmaps Ŷ (t) , Ŷ (l) , Ŷ (b) , Ŷ (r) ∈ (0, 1)H×W for the
an extreme point guided image segmentation method. It top, left, bottom, right, respectively. Given a heatmap, we
takes four extreme points and the cropped image region sur- extract the corresponding keypoints by detecting all peaks.
rounding the bounding box spanned by the extreme points A peak is any pixel location with a value greater than τp ,
as input. From this it produces a category-agnostic fore- that is locally maximal in a 3 × 3 window surrounding the
ground segmentation of the indicated object using the se- pixel. We name this procedure as ExtrectPeak.
mantic segmentation network of Chen et al. [4]. The net-
Given four extreme points t, b, r, l extracted from
work learns to generate the segmentation mask that matches
heatmaps Ŷ (t) , Ŷ (l) , Ŷ (b) , Ŷ (r) , we compute their geomet-
the input extreme point. x ty +by
ric center c = ( lx +t 2 , 2 ). If this center is predicted
with a high response in the center map Ŷ (c) , we commit the
4. ExtremeNet for Object detection (c)
extreme points as a valid detection: Ŷcx ,cy ≥ τc for a thresh-
ExtremeNet uses an HourglassNet [31] to detect five old τc . We then enumerate over all quadruples of keypoints
keypoints per class (four extreme points, and one center). t, b, r, l in a brute force manner. We extract detections for
We follow the training setup, loss and offset prediction of each class independently. Algorithm 1 summarizes this pro-
CornerNet [22]. The offset prediction is category-agnostic, cedure. We set τp = 0.1 and τc = 0.1 in all experiments.
but extreme-point specific. There is no offset prediction for This brute force grouping algorithm has a runtime
the center map. The output of our network is thus 5 × C of O(n4 ), where n is the number of extracted extreme
heatmaps and 4 × 2 offset maps, where C is the number of points for each cardinal direction. Supplementary material
classes (C = 80 for MS COCO [26]). Figure 3 shows an presents a O(n2 ) algorithm that is faster on paper. How-
overview. Once the extreme points are extracted, we group ever, then it is harder to accelerate on a GPU and slower in
them into detections in a purely geometric manner. practice for the MS COCO dataset, where n ≤ 40.
creasing scores, and stop the aggregation at a local mini-
mum along the aggregation direction. Specifically, let m
(m)
be an extreme point and Ni = Ŷmx +i,my be the vertical
or horizontal line segment at that point. Let i0 < 0 and
(m) (m)
0 < i1 be the two closest local minima Ni0 −1 > Ni0
(m) (m)
and Ni1 < Ni1 +1 . Edge aggregation updates the key-
Pi1 (m)
point score as Ỹm = Ŷm + λaggr i=i 0
Ni , where
λaggr is the aggregation weight. In our experiments, we
(a) Original heatmap. (b) After edge aggregation.
set λaggr = 0.1. See Fig. 4 for en example.
Figure 4: Illustration of the purpose of edge aggregation.
In the case of multiple points being the extreme point on 4.4. Extreme Instance Segmentation
one edge, our model predicts a segment of low confident Extreme points carry considerable more information
responses (a). Edge aggregation enhances the confidence of about an object, than a simple bounding box, with at least
the middle pixel (b). twice as many annotated values (8 vs 4). We propose a sim-
ple method to approximate the object mask using extreme
points by creating an octagon whose edges are centered on
4.2. Ghost box suppression
the extreme points. Specifically, for an extreme point, we
Center grouping may give a high-confidence false- extend it in both directions on its corresponding edge to a
positive detection for three equally spaced colinear objects segment of 1/4 of the entire edge length. The segment is
of the same size. The center object has two choices here, truncated when it meets a corner. We then connect the end
commit to the correct small box, or predict a much larger points of the four segments to form the octagon. See Figure
box containing the extreme points of its neighbors. We 1 for an example.
call these false-positive detections “ghost” boxes. As we’ll To further refine the bounding box segmentation, we use
show in our experiments, these ghost boxes are infrequent, Deep Extreme Cut (DEXTR) [29], a deep network trained
but nonetheless a consistent error mode of our grouping. to convert the manually provided extreme points into in-
We present a simple post-processing step to remove stance segmentation mask. In this work, we simply re-
ghost boxes. By definition a ghost box contains many other place the manual input of DEXTR [29] with our extreme
smaller detections. To discourage ghost boxes, we use a point prediction, to perform a 2-stage instance segmenta-
form of soft non-maxima suppression [1]. If the sum of tion. Specifically, for each of our predicted bounding box
scores of all boxes contained in a certain bounding box ex- , we crop the bounding box region, render a Gaussian map
ceeds 3 times of the score of itself, we divide its score by with our predicted extreme point, and then feed the concate-
2. This non-maxima suppression is similar to the standard nated image to the pre-trained DEXTR model. DEXTR [29]
overlap-based non-maxima suppression, but penalizes po- is class-agnostic, thus we directly use the detected class and
tential ghost boxes instead of multiple overlapping boxes. score of ExtremeNet. No further post-processing is used.
4.3. Edge aggregation
5. Experiments
Extreme points are not always uniquely defined. If verti-
cal or horizontal edges of an object form the extreme points We evaluate our method on the popular MS COCO
(e.g., the top of a car) any point along that edge might be dataset [26]. COCO contains rich bounding box and in-
considered an extreme point. As a result, our network pro- stance segmentation annotations for 80 categories. We train
duces a weak response along any aligned edges of the ob- on the train2017 split, which contains 118k images and
ject, instead of a single strong peak response. This weak 860k annotated objects. We perform all ablation studies on
response has two issues: First, the weaker response might val2017 split, with 5k images and 36k objects, and compare
be below our peak selection threshold τp , and we will miss to prior work on the test-dev split with contains 20k im-
the extreme point entirely. Second, even if we detect the ages The main evaluation metric is average precision over
keypoint, its score will be lower than a slightly rotated ob- a dense set of fixed recall threshold We show average pre-
ject with a strong peak response. cision at IOU threshold 0.5 (AP50 ), 0.75 (AP75 ), and aver-
We use edge aggregation to address this issue. For each aged over all thresholds between 0.5 and 1 (AP ). We also
extreme point, extracted as a local maximum, we aggre- report AP for small, median and large objects (APS , APM ,
gate its score in either the vertical direction, for left and APL ). The test evaluation is done on the official evaluation
right extreme points, or the horizontal direction, for top server. Qualitative results are shown in Table. 4 and can be
and bottom keypoints. We aggregate all monotonically de- found more in the supplementary material.
5.1. Extreme point annotations AP AP50 AP75 APS APM APL
There are no direct extreme point annotation in the 40.3 55.1 43.7 21.6 44.0 56.1
COCO [26]. However, there are complete annotations for w/ multi-scale testing 43.3 59.6 46.8 25.7 46.6 59.4
object segmentation masks. We thus find extreme points as
extrema in the polygonal mask annotations. In cases where w/o Center grouping 38.2 53.8 40.4 20.6 41.5 52.9

an edge is parallel to an axis or within a 3 angle, we place w/o Edge aggregation 39.6 54.7 43.0 22.0 43.0 54.1
the extreme point at the center of the edge. Although our w/o Ghost removal 40.0 54.7 43.3 21.6 44.2 54.1
training data is derived from the more expensive segmenta- w/ gt center 48.6 62.1 53.9 26.3 53.7 66.7
tion annotation, the extreme point data itself is 4× cheaper w/ gt extreme 56.3 67.2 60.0 40.9 62.0s 64.0
to collect than the standard bounding box [33]. w/ gt extreme + center 79.8 94.5 86.2 65.5 88.7 95.5
w/ gt ex. + ct. + offset 86.0 94.0 91.3 73.4 95.7 98.4
5.2. Training details
Our implementation is based on the public implemen- Table 1: Ablation study and error analysis on COCO
tation of CornerNet [22]. We strictly follow CornerNets val2017. We show AP(%) after removing each component
hyper-parameters: we set the input resolution to 511 × 511 or replacing it with its ground truth.
and output resolution to 128×128. Data augmentation con-
sists of flipping, random scaling between 0.6 and 1.3, ran-
dom cropping, and random color jittering. The network is
optimized with Adam [21] with learning rate 2.5e − 4. Cor- ding [30] similar to CornerNet [22], instead of our geomet-
nerNet [22] was originally trained on 10 GPUs for 500k iter- ric center point grouping. We tried this idea and replaced
ations, and an equivalent of over 140 GPU days. Due to lim- the center map with a four-channel associative embedding
ited GPU resources, the self-comparison experiments (Ta- feature map trained with a Hinge Loss [22]. Table 1 shows
ble. 1) are finetuned from a pre-trained CornerNet model the result. We observe a 2.1% AP drop when using the as-
with randomly initialized head layers on 5 GPUs for 250k sociative embedding. While associative embeddings work
iterations with a batch size of 24. Learning rate is dropped well for human pose estimation and CornerNet, our extreme
10× at the 200k iteration. The state-of-the-art comparison points lie on the very side of objects. Learning the identity
experiment is trained from scratch on 5 GPUs for 500k it- and appearance of entire objects from the vantage point of
erations with learning rate dropped at the 450k iteration. its extreme points might simply be too hard. While it might
work well for small objects, where the entire object easily
5.3. Testing details fits into the effective receptive field of a keypoint, it fails
For each input image, our network produces four for medium and large objects as shown in Table 1. Further-
C-channel heatmaps for extreme points, one C-channel more, extreme points often lie at the intersection between
heatmap for center points, and four 2-channel offset maps. overlapping objects, which further confuses the identity fea-
We apply edge aggregation (Section. 4.3) to each extreme ture. Our geometric grouping method gracefully deals with
point heatmap, and multiply the center heatmap by 2 to cor- these issues, as it only needs to reason about appearance.
rect for the overall scale change. We then apply the cen-
ter grouping algorithm (Section. 4.1) to the heatmaps. At Edge aggregation Edge aggregation (Section 4.3) gives a
most 40 top points are extracted in ExtrectPeak to keep the decent AP improvement of 0.7%. It proofs more effective
enumerating efficiency. The predicted bounding box coor- for larger objects, that are more likely to have a long axis
dinates are refined by adding an offset at the corresponding aligned edges without a single well defined extreme point.
location of offsetmaps. Removing edge aggregation improves the decoding time to
Following CornerNet [22], we keep the original image 76ms and overall speed to 4.1 FPS.
resolution instead of resizing it to a fixed size. We use flip
augmentation for testing. In our main comparison, we use
additional 5× multi-scale (0.5, 0.75, 1, 1.25, 1.5) augmen- Ghost box suppression Our simple ghost bounding box
tation. Finally, Soft-NMS [1] filters all augmented detection suppression (Section 4.2) yields 0.3% AP improvement.
results. Testing on one image takes 322ms (3.1FPS), with This suggests that ghost boxes are not a significant practi-
168ms on network forwarding, 130ms on decoding and rest cal issue in MS COCO. A more sophisticated false-positive
time on image pre- and post-processing (NMS). removal algorithm, e.g., learn NMS [18], might yield a
slightly better result.
5.4. Ablation studies
Center Grouping vs. Associative Embedding Our Ex- Error Analysis To better understand where the error
tremeNet can also be trained with an Associative Embed- comes from and how well each of our components is
Backbone Input resolution AP AP50 AP75 APS APM APL
Two-stage detectors
Faster R-CNN w/ FPN [24] ResNet-101 1000 × 600 36.2 59.1 39.0 18.2 39.0 48.2
Deformable-CNN [7] Inception-ResNet 1000 × 600 37.5 58.0 - 19.4 40.1 52.5
Deep Regionlets [51] ResNet-101 1000 × 600 39.3 59.8 - 21.7 43.7 50.9
Mask R-CNN [15] ResNeXt-101 1333 × 800 39.8 62.3 43.4 22.1 43.2 51.2
LH R-CNN [23] ResNet-101 1000 × 600 41.5 - - 25.2 45.3 53.1
Cascade R-CNN [2] ResNet-101 1333 × 800 42.8 62.1 46.3 23.7 45.5 55.2
D-RFCN + SNIP [43] DPN-98 1333 × 800 45.7 67.3 51.1 29.3 48.8 57.1
PANet [27] ResNeXt-101 1000 × 600 47.4 67.2 51.8 30.1 51.7 60.0
One-stage detectors
YOLOv2 [39] DarkNet-19 544 × 544 21.6 44.0 19.2 5.0 22.4 35.5
YOLOv3 [40] DarkNet-53 608 × 608 33.0 57.9 34.4 18.3 35.4 41.9
SSD [28] ResNet-101 513 × 513 31.2 50.4 33.3 10.2 34.5 49.8
DSSD [10] ResNet-101 513 × 513 33.2 53.3 35.2 13.0 35.4 51.1
RetinaNet [25] ResNet-101 1333 × 800 39.1 59.1 42.3 21.8 42.7 50.2
RefineDet (SS) [52] ResNet-101 512 × 512 36.4 57.5 39.5 16.6 39.9 51.4
RefineDet (MS) [52] ResNet-101 512 × 512 41.8 62.9 45.7 25.6 45.1 54.1
CornerNet (SS) [22] Hourglass-104 511 × 511 40.5 56.5 43.1 19.4 42.7 53.9
CornerNet (MS) [22] Hourglass-104 511 × 511 42.1 57.8 45.3 20.8 44.8 56.7
ExtremeNet (SS) Hourglass-104 511 × 511 40.2 55.5 43.2 20.4 43.2 53.1
ExtremeNet (MS) Hourglass-104 511 × 511 43.7 60.5 47.0 24.1 46.9 57.6

Table 2: State-of-the-art comparison on COCO test-dev. SS/ MS are short for single-scale/ multi-scale tesing, respectively.
It shows that our ExtremeNet in on-par with state-of-the-art region-based object detectors.

trained, we provide error analysis by replacing each out- 86.0%. The rest error is from the ghost box (Section 4.2).
put component with its ground truth. Table 1 shows the
result. A ground truth center heatmap alone does not in- 5.5. State-of-the-art comparisons
crease AP much. This indicates that our center heatmap is
Table 2 compares ExtremeNet to other state-of-the-art
trained quite well, and shows that the implicit object cen-
methods on COCO test-dev. Our model with multi-scale
ter is learnable. Replacing the extreme point heatmap with
testing achieves an AP of 43.7, outperforming all reported
ground truth gives 16.3% AP improvement. When replac-
one-stage object detectors and on-par with popular two-
ing both extreme point heatmap and center heatmap, the re-
stage detectors. Notable, it performs 1.6% higher than Cor-
sult comes to 79.8%, much higher than replacing one of
nerNet, which shows the advantage of detecting extreme
them. This is due to that our center grouping is very strict
and center points over detecting corners with associative
in the keypoint location and a high performance requires to
features. In single scale setting, our performance is 0.3%
improve both extreme point heatmap and center heatmap.
AP below CornerNet [22]. However, our method has higher
Adding the ground truth offsets further increases the AP to
AP for small and median objects than CornerNet, which is
known to be more challenging. For larger objects our cen-
AP AP50 AP75 APS APM APL ter response map might not be accurate enough to perform
BBox 12.1 34.9 6.2 8.2 12.7 16.9 well, as a few pixel shift might make the difference between
Ours octagon 18.9 44.5 13.7 10.4 20.4 28.3 a detection and a false-negative. Further, note that we used
Ours+DEXTR [29] 34.6 54.9 36.6 16.6 36.5 52.0 the half number of GPUs to train our model.

Mask RCNN-50 [15] 34.0 55.5 36.1 14.4 36.7 51.9 5.6. Instance Segmentation
Mask RCNN-101 [15] 37.5 60.6 39.9 17.7 41.0 55.4
Finally, we compare our instance segmentation results
Table 3: Instance segmentation evaluation on COCO with/ without DEXTR [29] to other baselines in Table 3.
val2017. The results are shown in Mask AP. As a dummy baseline, we directly assign all pixels inside
the rectangular bounding box as the segmentation mask.
Extreme point heatmap Center heatmap Octagon mask Extreme points+DEXTR [29]

Table 4: Qualitative results on COCO val2017. First and second column: our predicted (combined four) extreme point
heatmap and center heatmap, respectively. We show them overlaid on the input image. We show heatmaps of different
categories in different colors. Third column: our predicted bounding box and the octagon mask formed by extreme points.
Fourth column: resulting masks of feeding our extreme point predictions to DEXTR [29].

The result on our best-model (with 43.3% bounding box our result which is on-par with Res50 [17] and 2.9% AP
AP) is 12.1% Mask AP. The simple octagon mask (Section. below ResNeXt-101 is very competitive.
4.4) based on our predicted extreme points gets a mask AP
of 18.9%, much better than the bounding box baseline. This 6. Conclusion
shows that this simple octagon mask can give a relatively
reasonable object mask without additional cost. Note that In conclusion, we present a novel object detection frame-
directly using the quadrangle of the four extreme points work based on bottom-up extreme points estimation. Our
yields a too-small mask, with a lower IoU. framework extracts four extreme points and groups them
When combined with DEXTR [29], our method achieves in a purely geometric manner. The presented framework
a mask AP of 34.6% on COCO val2017. To put this result yields state-of-the-art detection results and produces com-
in a context, the state-of-the-art Mask RCNN [15] gets a petitive instance segmentation results on MSCOCO, with-
mask AP of 37.5% with ResNeXt-101-FPN [24, 50] back- out seeing any COCO training instance segmentations.
bone and 34.0% AP with Res50-FPN. Considering the fact
that our model has not been trained on the COCO segmen- Acknowledgement We thank Chao-Yuan Wu, Dian
tation annotation, or any class specific segmentations at all, Chen, and Chia-Wen Cheng for helpful feedback.
References [21] D. P. Kingma and J. Ba. Adam: A method for stochastic
optimization. ICLR, 2014. 6
[1] N. Bodla, B. Singh, R. Chellappa, and L. S. Davis. Soft- [22] H. Law and J. Deng. Cornernet: Detecting objects as paired
nmsimproving object detection with one line of code. In keypoints. In ECCV, 2018. 1, 2, 3, 4, 6, 7
ICCV, 2017. 5, 6
[23] Z. Li, C. Peng, G. Yu, X. Zhang, Y. Deng, and J. Sun. Light-
[2] Z. Cai and N. Vasconcelos. Cascade r-cnn: Delving into high head r-cnn: In defense of two-stage object detector. arXiv
quality object detection. CVPR, 2018. 7 preprint, 2017. 7
[3] Z. Cao, T. Simon, S.-E. Wei, and Y. Sheikh. Realtime multi- [24] T.-Y. Lin, P. Dollár, R. B. Girshick, K. He, B. Hariharan, and
person 2d pose estimation using part affinity fields. In CVPR, S. J. Belongie. Feature pyramid networks for object detec-
2017. 1, 3 tion. In CVPR, 2017. 2, 7, 8
[4] L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and [25] T.-Y. Lin, P. Goyal, R. Girshick, K. He, and P. Dollár. Focal
A. L. Yuille. Deeplab: Semantic image segmentation with loss for dense object detection. ICCV, 2017. 1, 2, 4, 7
deep convolutional nets, atrous convolution, and fully con- [26] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ra-
nected crfs. PAMI, 2018. 4 manan, P. Dollár, and C. L. Zitnick. Microsoft coco: Com-
[5] Y. Chen, Z. Wang, Y. Peng, Z. Zhang, G. Yu, and J. Sun. mon objects in context. In ECCV, 2014. 1, 4, 5, 6
Cascaded pyramid network for multi-person pose estimation. [27] S. Liu, L. Qi, H. Qin, J. Shi, and J. Jia. Path aggregation
In CVPR, 2018. 1, 3 network for instance segmentation. In CVPR, 2018. 2, 7
[6] J. Dai, Y. Li, K. He, and J. Sun. R-fcn: Object detection via [28] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y.
region-based fully convolutional networks. In NIPS, 2016. 2 Fu, and A. C. Berg. Ssd: Single shot multibox detector. In
[7] J. Dai, H. Qi, Y. Xiong, Y. Li, G. Zhang, H. Hu, and Y. Wei. ECCV, 2016. 1, 2, 7
Deformable convolutional networks. In ICCV, 2017. 2, 7 [29] K. Maninis, S. Caelles, J. Pont-Tuset, and L. Van Gool. Deep
[8] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, extreme cut: From extreme points to object segmentation. In
and A. Zisserman. The PASCAL Visual Object Classes CVPR, 2018. 2, 4, 5, 7, 8
Challenge 2012 (VOC2012) Results. http://www.pascal- [30] A. Newell, Z. Huang, and J. Deng. Associative embedding:
network.org/challenges/VOC/voc2012/workshop/index.html. End-to-end learning for joint detection and grouping. In
2 NIPS, 2017. 1, 3, 4, 6
[9] P. F. Felzenszwalb, R. B. Girshick, D. McAllester, and D. Ra- [31] A. Newell, K. Yang, and J. Deng. Stacked hourglass net-
manan. Object detection with discriminatively trained part- works for human pose estimation. In ECCV, 2016. 1, 3,
based models. PAMI, 2010. 1, 3 4
[10] C.-Y. Fu, W. Liu, A. Ranga, A. Tyagi, and A. C. Berg. Dssd: [32] X. Nie, J. Feng, J. Xing, and S. Yan. Pose partition networks
Deconvolutional single shot detector. arXiv preprint, 2017. for multi-person pose estimation. In ECCV, 2018. 3
2, 7 [33] D. P. Papadopoulos, J. R. Uijlings, F. Keller, and V. Ferrari.
[11] R. Girshick. Fast r-cnn. In ICCV, 2015. 1, 2, 4 Extreme clicking for efficient object annotation. In ICCV,
2017. 2, 3, 6
[12] R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich fea-
ture hierarchies for accurate object detection and semantic [34] G. Papandreou, T. Zhu, L.-C. Chen, S. Gidaris, J. Tompson,
segmentation. In CVPR, 2014. 1, 2 and K. Murphy. Personlab: Person pose estimation and in-
stance segmentation with a bottom-up, part-based, geometric
[13] R. B. Girshick, P. F. Felzenszwalb, and D. A. Mcallester. Ob-
embedding model. ECCV, 2018. 3
ject detection with grammar models. In NIPS, 2011. 1
[35] G. Papandreou, T. Zhu, N. Kanazawa, A. Toshev, J. Tomp-
[14] B. Hariharan, P. Arbelaez, L. Bourdev, S. Maji, and J. Malik.
son, C. Bregler, and K. Murphy. Towards accurate multi-
Semantic contours from inverse detectors. In ICCV, 2011. 2
person pose estimation in the wild. In CVPR, 2017. 3
[15] K. He, G. Gkioxari, P. Dollár, and R. Girshick. Mask r-cnn. [36] G. Pavlakos, X. Zhou, A. Chan, K. G. Derpanis, and K. Dani-
In ICCV, 2017. 2, 3, 7, 8 ilidis. 6-dof object pose from semantic keypoints. In ICRA,
[16] K. He, X. Zhang, S. Ren, and J. Sun. Spatial pyramid pooling 2017. 3
in deep convolutional networks for visual recognition. In [37] C. Peng, T. Xiao, Z. Li, Y. Jiang, X. Zhang, K. Jia, G. Yu, and
ECCV, 2014. 2 J. Sun. Megdet: A large mini-batch object detector. CVPR,
[17] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning 2018. 2
for image recognition. In CVPR, 2016. 8 [38] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi. You
[18] J. H. Hosang, R. Benenson, and B. Schiele. Learning non- only look once: Unified, real-time object detection. In
maximum suppression. In CVPR, 2017. 6 CVPR, 2016. 1, 2
[19] J. Huang, V. Rathod, C. Sun, M. Zhu, A. Korattikara, [39] J. Redmon and A. Farhadi. Yolo9000: better, faster, stronger.
A. Fathi, I. Fischer, Z. Wojna, Y. Song, S. Guadarrama, et al. CVPR, 2017. 2, 7
Speed/accuracy trade-offs for modern convolutional object [40] J. Redmon and A. Farhadi. Yolov3: An incremental improve-
detectors. In CVPR, 2017. 2 ment. arXiv preprint, 2018. 2, 7
[20] B. Jiang, R. Luo, J. Mao, T. Xiao, and Y. Jiang. Acquisition [41] S. Ren, K. He, R. Girshick, and J. Sun. Faster r-cnn: Towards
of localization confidence for accurate object detection. In real-time object detection with region proposal networks. In
ECCV, September 2018. 2 NIPS, 2015. 1, 2
[42] Z. Shen, Z. Liu, J. Li, Y.-G. Jiang, Y. Chen, and X. Xue.
Dsod: Learning deeply supervised object detectors from
scratch. In ICCV, 2017. 2
[43] B. Singh and L. S. Davis. An analysis of scale invariance in
object detection–snip. In CVPR, 2018. 7
[44] H. Su, J. Deng, and L. Fei-Fei. Crowdsourcing annotations
for visual object detection. In AAAIW, 2012. 3
[45] L. Tychsen-Smith and L. Petersson. Denet: Scalable real-
time object detection with directed sparse sampling. arXiv
preprint arXiv:1703.10295, 2017. 2
[46] L. Tychsen-Smith and L. Petersson. Improving object local-
ization with fitness nms and bounded iou loss. CVPR, 2017.
2
[47] J. R. Uijlings, K. E. Van De Sande, T. Gevers, and A. W.
Smeulders. Selective search for object recognition. IJCV,
2013. 2
[48] X. Wang, K. Chen, Z. Huang, C. Yao, and W. Liu.
Point linking network for object detection. arXiv preprint
arXiv:1706.03646, 2017. 2
[49] B. Xiao, H. Wu, and Y. Wei. Simple baselines for human
pose estimation and tracking. In ECCV, 2018. 1, 3
[50] S. Xie, R. Girshick, P. Dollár, Z. Tu, and K. He. Aggregated
residual transformations for deep neural networks. In CVPR,
2017. 8
[51] H. Xu, X. Lv, X. Wang, Z. Ren, N. Bodla, and R. Chellappa.
Deep regionlets for object detection. In ECCV, 2018. 7
[52] S. Zhang, L. Wen, X. Bian, Z. Lei, and S. Z. Li. Single-shot
refinement neural network for object detection. In CVPR,
2018. 2, 7
[53] X. Zhou, A. Karpur, L. Luo, and Q. Huang. Starmap for
category-agnostic keypoint and viewpoint estimation. In
ECCV, 2018. 3
[54] Y. Zhu, C. Zhao, J. Wang, X. Zhao, Y. Wu, H. Lu, et al. Cou-
plenet: Coupling global structure with local parts for object
detection. In ICCV, 2017. 2

You might also like