Noname manuscript No.
(will be inserted by the editor)
Demisting the Hough Transform
for 3D Shape Recognition and Registration
Oliver J. Woodford · Minh-Tri Pham · Atsuto Maki · Frank Perbet ·
Björn Stenger
Received: date / Accepted: date
Abstract In applying the Hough transform to the problem of 3D shape recognition and registration, we develop two new and powerful improvements to this popular inference method. The first, intrinsic Hough, solves
the problem of exponential memory requirements of
the standard Hough transform by exploiting the sparsity of the Hough space. The second, minimum-entropy
Hough, explains away incorrect votes, substantially reducing the number of modes in the posterior distribution of class and pose, and improving precision. Our
experiments demonstrate that these contributions make
the Hough transform not only tractable but also highly
accurate for our example application. Both contributions can be applied to other tasks that already use the
standard Hough transform.
1 Introduction
The Hough transform [13], named after Hough’s 1962
patent [18] describing a method for detecting lines in
images, has since been generalized to detecting, as well
as recognizing, many other objects or instances: parameterized curves [13], arbitrary 2D shapes [3], object motions [8], cars [16, 24], pedestrians [4, 16], hands
[30] and 3D shapes [21, 31, 37], to name but a few. This
popularity stems from the simplicity and generality of
the first step of the Hough transform—the conversion
of features, found in the data space, into sets of votes
in a Hough space, parameterized by the pose of the
object(s) to be found. Various different approaches to
O. J. Woodford
Toshiba Research Europe Ltd.
208 Cambridge Science Park, Milton Road
Cambridge CB4 0GZ, UK
E-mail: oliver.woodford@crl.toshiba.co.uk
learning this feature-to-vote conversion function have
been proposed, including the implicit shape model [24]
and Hough forests [16, 30].
The second stage of the Hough transform simply
sums the likelihoods of the votes at each location in
Hough space, then selects the modes. One problem with
this step is that the summation can create modes where
there are only a few outlier votes. A second problem is
that, given a required accuracy, the size of the Hough
space is exponential in its dimensionality. The application we are concerned with, object recognition and registration (R&R) from 3D geometry (here, point clouds),
suffers significantly from both these problems. The Hough
space, at 8D (one dimension for class, three for rotation,
three for translation and one for scale), is to our knowledge the largest to which the Hough transform has been
applied, and the feature-to-vote conversion generates a
high proportion of incorrect votes, creating a “mist” of
object likelihood throughout that space, as shown in
figure 2(a).
In the face of this adversity, we have developed two
important contributions which enable inference on this
task, and potentially many others, using the Hough
transform to be both feasible and accurate:
– We introduce the intrinsic Hough transform, which
substantially reduces memory and computational requirements in applications with a high dimensional
Hough space.
– We introduce the minimum-entropy Hough transform, which greatly improves the precision and robustness of the Hough transform.
These extensions of the Hough transform are not task
specific; they can be applied, either together or independently, to any application that does or is able to
use the standard Hough transform.
2
The rest of this paper is organized as follows: The
next section describes inference using the Hough transform, and briefly reviews the literature relevant to our
contributions. In §3 we describe our new inference methods. The section following that describes and discusses
our experiments. Finally, we conclude in §5.
2 Background
2.1 3D shape recognition and registration
The implicit shape model of Leibe et al . [23, 24] pioneered the use of the Hough transform for object recognition in 2D images. This approach has since been applied to object recognition in 3D geometric data [21],
and extended to object registration [31, 37]. For the dual
problem of R&R in 3D, the Hough space is either 7D
(if scale is known) [37] or 8D [31].
The feature extraction stages of these methods follow the same pipeline: features are detected at a given
scale and position; a canonical orientation of the feature is estimated; a descriptor for the feature is computed. The votes are then computed by matching features in the test data with features from training data
with ground truth class and pose, either directly (i.e. a
nearest neighbour search) [31], or via a codebook created by clustering the feature descriptors [21, 24, 37].
In this work we will be using the feature-to-vote
conversion process of Pham et al . [31] as an off-theshelf method, since our contributions lie in the second
stage of the Hough transform. It is this process that
generates a high proportion of incorrect votes, amongst
which the correct votes need to be found.
O. J. Woodford et al.
simply as a kernel density estimate over all the votes
[8, 44].
Let y be an object’s location in a Hough space, H,
which is the space of all object poses (usually real) and,
in the case of object recognition tasks, object classes
(discrete). Furthermore, let the list of votes, cast in H
by N features, which are computed in some first stage
feature-to-vote conversion process (not addressed here)
i
be denoted by X = {{xij }Jj=1
}N
i=1 . The posterior probability of an object’s location is then given by
p(y|X, ω, θ) =
N
X
ωi
i=1
Ji
X
θij K(xij , y),
(1)
j=1
where Ji is the number of votes generated by the ith
feature, K(·, ·) is a density kernel in Hough space, and
ω = {ωi }N
i=1 and θ = {θij }∀i,j are feature and vote
PN
weights respectively, s.t. ωi ≥ 0, ∀i, i=1 ωi = 1, and
θij ≥ 0, ∀i, j,
Ji
X
θij = 1, ∀i ∈ {1, .., N }.
(2)
j=1
For example, in the original Hough transform used for
line detection [13], the features are edgels, votes are
generated for a discrete set of lines (parameterized by
angle) passing through each edgel, the kernel, K(·, ·),
returns 1 for the nearest point in the discretized Hough
space to the input vote, 0 otherwise, and the weights,
ω and θ, are set to uniform distributions. Recently
methods have been proposed for learning a priori more
discriminative weights [27, 44] for object detection, as
well as evaluating over different kernel shapes [44].
The final stage of the Hough transform involves finding, using non-maxima suppression, the modes of this
distribution whose probabilities are above a certain threshold value, τ .
2.2 The Hough transform
2.2.1 Computational feasibility
The earliest descriptions of the Hough transform [3, 13,
18] present it as an algorithm, but more recently there
has been a desire to cast the framework in a probabilistic light. Generative model interpretations [2, 4, 35] in
which the votes represent likelihoods of features, given
an object pose, require that the likelihoods of these independent variables be multiplied, in contrast to the
summation of the Hough transform. The summation
has been explained in two ways: firstly that it is in fact
over the log likelihood of features [4, 35], though this
requires a differently shaped distribution for each vote
than is typically given [4], or secondly that it is a first
order approximation to a robustified product of likelihoods [2, 29]. We prefer to interpret the second stage of
the Hough transform as a discriminative model of the
posterior distribution of an object’s location, phrased
Finding the modes in H involves sampling that space,
the volume of which increases exponentially with its dimensionality, d. Several approaches have been proposed
to reduce this burden, which we categorize as one of approximate, hierarchical, irregular or mode-seeking.
Approximate methods use reduced-dimensionality approximations of the full Hough space to find modes.
For example, given a 6D pose (translation & rotation),
Fisher et al. [15] quantize translations and rotations in
two separate 3D arrays (peak entries in both arrays
indicate an object, but multiple objects create ambiguities), while Tombari & Di Stefano [37] find modes over
translation only, then compute an average rotation for
each mode. Geometric hashing techniques, e.g. [12, 22,
28], also fall into this category.
Demisting the Hough Transform
Hierarchical approaches, such as the fast [25] and adaptive [19] Hough transforms, sample the space in a coarseto-fine manner, exploiting the sparsity of some areas,
though their complexity is still exponential in d.
Irregular methods do not sample the Hough space regularly, but rather sample only where objects are likely
to be detected, again exploiting potential sparsity in
H. For example, the combinatorial [5] and randomized
[42] Hough transforms generate lists of sampling locations, the former for all lines (in line detection) joining
pairs of edgels in confined regions, the latter for curves
(in curve detection) defined uniquely by random sets of
edgels. Both these approaches are task specific, whereas
the intrinsic Hough transform introduced here, which
also falls into this category, is not.
Mode-seeking methods find modes in H through iterative optimization [8, 9]. Mean shift [9] is the most commonly used approach, the complexity of which is O(nd2 ),
PN
where n = i=1 Ji (the total number of votes). It has
successfully been applied to an 8D Hough space [31].
However, it needs to be initialized in many, perhaps
O(n), locations, making the total complexity O(n2 d2 ),
and is not guaranteed to find every mode. Two extensions of this approach, though generally applied to clustering rather than mode seeking, are medoid shift [34]
and quick shift [38].
These approaches can also be combined. For example, modes found in a coarse sampling of H can be refined using mean shift [24], an approach we employ here.
2.2.2 Explaining away votes
The summing of votes in the Hough transform enables
incorrect votes to generate modes in H, and since most
applications tend to produce a large number of incorrect votes, this can lead to false detections, especially
in multi-object detection scenarios. The problem arises
from the fact that each test feature generates a number
(often quite large) of votes, which represent the locations of all objects that could have generated that feature, but usually only one of those votes will actually
be correct, because most features are generated by only
one object. Figure 2(a) visualizes the ambiguity caused
by these incorrect votes in our R&R application.
If we assume, usually correctly, that a feature is generated by only one object, we can then enforce the resulting implicit constraint that only one vote cast by
each feature is correct. By choosing which vote this is
for each feature, the other votes can then be dismissed
as being incorrect, removing them from the transform—
the correct vote essentially explains away all the other
3
votes. This assumption was first applied to the Hough
transform in the 1980s by Gerig [17], using a two stage
approach, first computing the standard Hough transform, then, simultaneously for each feature, collating
the values of the Hough transform at the locations of
all votes of a given feature, and keeping only the vote
at the highest value.
The idea was resurrected more recently by Barinova
et al . [4], using an approach akin to the Hough transform, in that it exhaustively samples the Hough space
while searching for objects. However, they directly enforce the constraint that a feature is generated by only
one object, using feature-to-object assignments, with
a cost per object detection. Phrasing the problem as
an energy minimization, they greedily detect objects in
Hough space, assigning to them features which decrease
the overall energy. Furthermore, rather than using kernels that tail off to zero, their kernels continue decreasing away from the vote, with an explicit background
assignment for outlier features.
Several other multi-object detection frameworks also
make explicit feature-to-object assignments: energy-minimization-based methods [7, 10, 11, 20, 41], which iteratively update the assignments; RANSAC, similar to
energy-based methods but focusing more on the algorithm than the objective function, with features assigned either greedily [39] or with iterative refinement
[43, 45]; non-parametric methods, which cluster features
into groups representing objects [36].
A benefit of methods using feature-to-object assignments, as opposed to the feature-to-vote assignments of
Hough-based methods, is that they avoid the last step
of the Hough transform: non-maxima suppression of accumulated votes in Hough space.
3 Our framework
This section describes our improvements to the Hough
transform. In §3.1 we introduce the intrinsic Hough
transform, which overcomes the high memory requirements of the standard Hough transform with high-dimensional Hough spaces. In §3.2 we introduce a method
which exploits the assumption that only one vote per
feature is correct.
3.1 The intrinsic Hough transform
As discussed in §2.2, high-dimensional Hough spaces require infeasible amounts of memory to sample regularly.
However, we note that while the volume of the Hough
space increases exponentially with its dimensionality,
the number of votes generated in applications using
4
O. J. Woodford et al.
the Hough transform generally does not, implying that
higher dimensional Hough spaces are often sparser. We
exploit this sparsity by sampling the Hough space only
at locations where the probability (given by equation
(1)) is likely to be non-zero. Assuming that the density
kernel, K(·, ·), in equation (1) is zero-mean and unimodal (which is generally true for kernel density estimation), the modes of the distribution will be at or near
the locations of the votes. We therefore simply sample
the Hough space at the locations of the votes themselves. Since the votes define the distribution, therefore
are intrinsic to it, we call this approach the intrinsic
Hough transform.
While similar in some respects to intrinsic modeseeking algorithms [34, 38], the intrinsic Hough transform does not seek modes through iterative updates.
Rather, the modes of the distribution are detected using
non-maxima suppression, as per the standard Hough
transform; here, a sample location, y, is classified as a
mode if no other sample location, z, within a certain
distance, s.t. K(y, z) > γ, has a higher probability. Implicit in this approach is the assumption that the local modes of the distribution given by equation (1) lie
very close to a vote—this is the case for most shapes
of kernel used in practice. As a final step to improve
accuracy, the location of each mode found is updated
with one step of mean shift. The memory and computational requirements of this approach are O(n) and
O(n2 d2 ) respectively.
3.2 The minimum-entropy Hough transform
Making the assumption that only one vote per feature
is correct, a vote that is believed to be correct should
explain away the other votes from that feature. This
suggests that, rather than being given θ a priori, it
would be beneficial to optimize over its possible values,
giving those votes which agree with votes from other
features more weight than those which do not.
One way of achieving this is by minimizing the information entropy 1 of p(y|X, ω, θ) w.r.t. θ. A similar
approach, but minimizing entropy w.r.t. some parameters of the vote generation process, has already been
used for lens distortion calibration [32]. A lower entropy
distribution contains less information, making it more
peaky and hence having more votes in agreement. Since
information in Hough space is the location of objects,
minimizing entropy constrains features to be generated
by as few objects as possible. This can be viewed as
enforcing Occam’s razor. The objective function to be
minimized is therefore
Z
f (θ) = −
p(y|X, ω, θ) ln p(y|X, ω, θ) dy
However, computing this entropy involves an integration over Hough space; for our application this is very
large. To make this integration tractable we sample the
space at discrete locations using importance sampling
[26, §29.2]; as with the intrinsic Hough transform, we
sample the Hough space at the locations of all the votes.
The value of θ is therefore approximated by
Specifically we
R use the Shannon entropy [33], H =
E[− ln p(x)] = − p(x) ln p(x) dx.
θ = argmin −
θ′
Ji
N X
X
p(xij |X, ω, θ′ )
i=1 j=1
q(xij )
ln p(xij |X, ω, θ′ )
(4)
where q(·) is the (unknown) sampling distribution from
which the votes are drawn. Once this optimization (described below) is done, the estimated θ is applied to
equation (1), and inference continues as per the standard (or intrinsic) Hough transform. We call this approach the minimum-entropy Hough transform.2
3.2.1 Optimization framework
It turns out, as we show in Appendix A, that a global
minimum of equation (3) must lie at an extremum of the
parameter space, which is constrained by equation (2),
i
such that at least one optimal value of θi = {θij }Jj=1
(i.e. the vector of feature i’s vote weights) will be an
all 0 vector, except for one 1, i.e. minimizing entropy
naturally enforces the one-correct-vote-per-feature constraint. As a result, a global minimum can always be
found if we limit the search space for each θi to integer values, making a discrete set of Ji possible Q
vectors,
N
s.t. the total number of possible solutions is i=1 Ji .
It should be noted that this search space is not unimodal—for example, if there are only two features and
they each identically generate two votes, one for location y and one for location z, then both y and z will
be modes. Furthermore, as the search space is exponential in the number of features, an exhaustive search is
infeasible for all but the smallest problems.
We therefore use a local approach, iterated conditional modes (ICM) [6], to quickly find a local minimum
of this optimization problem. This involves updating
the vote weights of each feature in turn, by minimizing equation (4) conditioned on the current weights of
2
1
(3)
H
Strictly speaking, the minimum-entropy Hough transform
is not a transform, because the probability of each location
in Hough space cannot be computed independently.
Demisting the Hough Transform
5
all other votes, and repeating this process until convergence. The correct update equation for the vote weights
of a feature f is as follows:
pf k (y|X, ω, θ) = ωf K(xf k , y)+
X
ωi
∀i6=f
Ji
X
θij K(xij , y),
j=1
(5)
Jf
k = argmax
k′ =1
θf k = 1,
Ji
N X
X
pf k′ (xij |X, ω, θ)
i=1 j=1
θf j = 0,
q(xij )
∀j 6= k.
s(y)
ln pf k′ (xij |X, ω, θ)
,
ds,(y, z) = log
s(z)
q
dr (y, z) = 1 − |q(y)T q(z)|,
(6)
(7)
However, since this update not only involves q(·), which
is unknown, but is also relatively costly to compute, we
replace it with a simpler proxy which in practice performs a similar job of encouraging the resulting posterior distribution to be as peaky as possible:
Jf
k = argmax pf k′ (xf k′ |X, ω, θ).
(8)
k′ =1
This is effectively the strategy of Gerig, but applied
sequentially rather than simultaneously. Since the optimization is local, a good initialization of θ is key to
reaching a good minimum. In our experiments we start
at the value of θ used in the standard Hough transform,
then applied the following update to each vote weight
simultaneously:
pik (xik |X, ω, θ)
θik = PJi
,
j=1 pij (xij |X, ω, θ)
For the density kernel, K(·, ·), of equation (1) we
use a Gaussian kernel on a symmetric version of the
SRT distance between direct similarity transforms [31].
For two object poses, y and z, of the same class, it is
defined as
2
1
d (y, z) d2r (y, z) d2t (y, z)
K(y, z) = exp − s 2
−
−
,
ζ
σs
σr2
σt2
(10)
(9)
iterating this five times before starting ICM. Initially
updating weights softly, i.e. not fixing them to 0 or 1,
and synchronously, avoiding ordering bias, in this way
helped to avoid falling into a poor local minimum early
on, thus improving the quality of solution found.
4 Experiments
4.1 Setup
For our test application, 3D shape R&R, we use the
framework introduced by Pham et al . [31], outlined in
figure 3, the evaluation data from which can be found
online [1]. It consists of 100 test instances, each containing one object, for each of 10 object classes, shown
in figure 4, i.e. 1000 test instances in total. Each test
instance provides ground truth 7D object pose (scale
and 3D rotation and translation) and class, and a set
of input votes, with weights, for object pose and from
all 10 classes.
dt (y, z) =
||t(y) − t(z)||
p
,
s(y)s(z)
(11)
(12)
(13)
where s(y), q(y) and t(y) are the scale, rotation (as
a quaternion) and translation components of y respectively. If y and z specify different classes, then K(y, z) =
0. The values of the bandwidth parameters, σs , σr and
σt , given in table 1, are those learned in [31]. The normalization factor, ζ, cannot easily be computed, but is
independent of z [31], therefore, since our equations (8)
& (9) are scale independent,3 it can be ignored.
4.2 Methods
As well as evaluating the relative performance of the
two Hough transforms introduced in §3.1 & 3.2, we
compare them with the SRT mean shift method of [31]
(henceforth referred to as “mean shift”), and the inference methods of Gerig [17] and Barinova et al . [4] (here
referred to simply as Gerig and BLK, after the authors,
for short), and finally a Greedy approach which computes the standard Hough transform, finds the maximum and adds the corresponding object to the list of
found objects, then removes all of the votes of all features that voted for that object, and repeats the process until no votes are left. Apart from mean shift, the
methods all use the intrinsic Hough transform to make
sampling H feasible. For the mean shift refinement step
of the intrinsic Hough transform, we use the closed-form
mean given in [31], despite our slightly different density
kernel. However, we do not refine the detections of BLK
because their probability distribution is not amenable
to this, since the likelihoods are multiplied. The likelihood function used in our implementation of BLK
is the same kernel density function used in the other
methods, defined in equation (10). We note that the
parameters of this kernel were learned in [31] specifically for Hough-based inference, therefore might not be
3
The requirement for a scale independent optimization
strategy is a further reason to use the proxy of equation (8).
6
O. J. Woodford et al.
σs
a–f
0.0694
σr
a–f
0.12
σt
a–f
0.12
γ
b–e
exp(−8)
λ
f
10
Table 1 Parameter values for the inference methods
tested: (a) mean shift, (b) intrinsic Hough, (c) minimumentropy Hough, (d) Gerig, (e) Greedy, (f) Barinova et al. [4].
optimal for BLK. Parameter values used for the various
methods are summarized in table 1.
4.3 Results
4.3.1 Quantitative results
Quantitative results, computed using the ground truth
classes and poses provided in the evaluation set, and
using the registration criterion in [31], are given in tables 2 & 3 and figure 1. There is a small improvement in
performance in both registration and recognition moving from mean shift to intrinsic Hough, which is most
likely due to modes being missed by mean shift. Recognition rates then increase rapidly moving to Gerig, then
Greedy, then finally minimum-entropy Hough, whose
recognition rate, the largest seen, with only 1.5% of objects are left unrecognized, the majority of those in the
car class, is a huge improvement on mean shift, providing a 96% reduction in misclassifications. This improvement is due to the improved assignment of the correct
vote per feature, from a one-shot simultaneous assignment, to a greedy assignment, to an iteratively refined
assignment. Minimum-entropy Hough also shows a significantly improved registration rate, with top scores on
7/10 classes. BLK, though greedy, performs almost as
well as minimum-entropy Hough in terms of recognition, though less well in terms of registration, in part
due to a lack of mean shift pose refinement at the end.
However, because these results only reflect the best
detection per test, they do not tell the whole story;
we do not know how many other (incorrect) detections
had competitive weights. To see this, we generated the
precision-recall curves shown in figure 5, by varying the
detection threshold, τ (or λ for BLK [4]). A correct
detection in this test required the class and pose to
be correct simultaneously, and allowed only one correct
detection per test. The curves show that precision remains high as recall increases for the minimum-entropy
Hough transform, and marginally less so for BLK and
Greedy, all of which are able to explain away incorrect votes, while it drops off rapidly with recall for the
other methods, indicating that the latter methods suffer
from greater ambiguity as to which modes correspond
to real objects, or perhaps in the case of Gerig, the
Mean shift
Intrinsic Hough
Min.-entropy
Gerig
Greedy
BLK
Recognition
64.9%
67.6%
98.5%
71.8%
85.7%
98.1%
Registration
72.8%
73.0%
79.6%
73.3%
70.3%
75.1%
Time
0.427s
0.192s
0.214s
0.218s
0.226s
0.224s
Table 2 Quantitative results for the inference methods
tested.
wrong correct-vote assignments being made. Interestingly, Greedy and minimum-entropy Hough have lower
maximum recall rates (of 0.759 and 0.813 respectively),
which we propose is due to some correct modes being “explained away”. Since, in the case of minimumentropy Hough, our optimization strategy finds only a
local minimum, we cannot be sure whether this effect is
due to the objective function or the optimization strategy.
In terms of computation time (table 2), all methods
tested had the same order of magnitude speed, with
the mean shift approach being about twice as slow as
the others, though there is a trade-off of time versus
accuracy with this approach, by changing the number
of starting points of the optimization. However, we noticed that the speed of BLK was dependent on the value
of λ, its detection threshold, and therefore equally the
number of objects in the scene, unlike the other methods.
4.3.2 Qualitative results
The benefit of explaining away incorrect votes is demonstrated in figure 2. While the standard Hough transform
shows a great deal of ambiguity as to where and how
many objects there are, the minimum-entropy Hough
transform is able to clear away the “mist” of incorrect
votes, leaving six distinct modes corresponding to the
objects present; there are some other modes, but these
are much less significant, corroborating the results seen
in figure 5.
The benefit of having correct and clearly defined
modes is demonstrated in figure 6, using the same point
cloud as in figure 2, a challenging dataset containing
three pairs of touching objects. Both minimum-entropy
Hough and BLK find all six objects in the top six detections (though both mis-register the piston lying at a
shallow angle), whereas the other methods find not only
incorrect objects, but also multiple instances of correct
objects (particularly the piston on the cog).
bracket
car
cog
flange
knob
pipe
piston1
piston2
Mean shift
Intrinsic Hough
Minimum-entropy Hough
Gerig [17]
Greedy
Barinova et al. (BLK) [4]
block
7
bearing
Demisting the Hough Transform
77
77
83
76
83
79
13
15
20
13
15
20
95
96
98
96
83
97
75
76
91
84
54
93
100
100
100
100
100
100
86
83
86
84
89
74
88
86
91
85
81
73
86
86
89
83
82
81
44
44
54
46
49
48
64
67
84
66
67
86
2
98 1
0
0
0
0
2
0
0 97 0
0
0 15 3
2 12 0 26 32 2
0
0 25 1
2 12 0 19 37 12 0
0 18 0
0
0
0
0
4 93 0
0
0
0
0
0
0
0 100 0
0
3
0
0
5 10 0
0 37 1 29 0 17 1
6 25 0
0 10 2
0 43 7
7
0
0
0
0 100 0
0
1
0
0
6
7
0
0 29 3 42 0 12 1
0
0
0
0
1 98 0
0
0
0
5 23 0
0
8
2
0 54 5
3
0
1
0
5
0
99 0
0
0
0
0
0
1
0
0
5
1
0
0
1
0
0 97 0
0
0
1
0
0
0
2
6 75 0
9
0
0
0
6
4
0
0 100 0
0
0
0
0
0
0
0
0 54 44 1
0
0
1
0
1
2
0 52 22 3
0
0 19 1
0
2
1 89 0
0
2
0
6
0
0
0
0 100 0
0
0
0
0
0
0
0
0 100 0
0
0
0
0
0
0
2
0 98 0
0
0
0
0
0
0
0 19 81 0
0
0
0
0
1
0
0
0 99 0
0
0
0
0
0
0
0
0 100 0
0
0
0
0
0
0 20 1 75 0
2
1
2
1
0
0
2
0 87 0
7
1
0
0
0
0
0
0 100 0
0
0
0
0
0 27 2
1 67 0
3
2 10 0
0
1
1
0 76 3
7
0
0
0
0
0
0
0 99 0
1
0
0
0 11 1
1
0 86 1
0
2
0
0
0
1
1
0 96 0
0
0
0
0
0
0
0
0 99 1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1 99
0
0
0
0
0
0
0
0
ar
0
0 54 1
2 43 0
0
0
0
3
1 91 0
0
1
0
4
0
0
0
0
0 100 0
0
0
0
0
0
0
0
0
0 100 0
0
0
0
0
0
0
0
0
0 100 0
0
0
0
0
0
0
0
0
0 99 0
1
0
0
0
1
0
0
0 96 0
0
2
0
0
0
1
0
0 97 0
0
0
0
0
0
0
0
0 100 0
0
0
0
0
0
1
0
0
0
0
0
0
0
1
0
0
0
0
0
0
0
0
0
0
0
0
0 100
1
1 98
Output class
Output class
Output class
(c) Min.-entropy
(d) Gerig
(e) Greedy
be
3
0
0
27 2
1
0
1
Output class
0
1 92 0
0
0
1
0
0
(b) Intrinsic Hough
81 10 0
4
0
1
0
0
0
Output class
9
2
0
0
0
0
0
(a) Mean shift
6
0
0
0
1
0
2
3 96
0
0
0
0
0 100 0
1
3 96
1
0 49 0
0 87 0 13 0
0
0
0
ar
bl ing
oc
br k
ac
ca ket
r
co
g
f la
ng
kn e
ob
pi
pe
pi
sto
pi n1
sto
pi pi pi kn fla co ca br bl be
n2
sto sto pe o n g r ac oc ar
b ge
ke k in
n2 n1
g
t
8
1
Ground truth class
0
9
g
f la
ng
kn e
ob
pi
pe
pi
sto
pi n1
sto
pi pi pi kn fla co ca br bl be
n2
sto sto pe o n g r ac oc ar
b
ge
ke k in
n2 n1
g
t
0
0
0 27 27 0 26 2
co
0 13 1
0 88 0
be
ar
bl ing
oc
br k
ac
ca ket
r
55 21 0
0
Ground truth class
1
2
0 14 3
be
be
8
0
0
Ground truth class
0
0
Ground truth class
0
1
bl ing
oc
br k
ac
ca ket
r
co
g
f la
ng
kn e
ob
pi
pe
pi
sto
pi n1
sto
pi pi pi kn fla co ca br bl be
n2
sto sto pe o n g r ac oc ar
b ge
ke k in
n2 n1
g
t
8
0 32 29 0 21 1
Ground truth class
0
ar
bl ing
oc
br k
ac
ca ket
r
co
g
f la
ng
kn e
ob
pi
pe
pi
sto
pi n1
sto
pi pi pi kn fla co ca br bl be
n2
sto sto pe o n g r ac oc ar
b
g
k
i
k
n2 n1
ng
e
et
0 11 2
0 89 0
ar
bl ing
oc
br k
ac
ca ket
r
co
g
f la
ng
kn e
ob
pi
pe
pi
sto
pi n1
sto
pi pi pi kn fla co ca br bl be
n2
sto sto pe o n g r ac oc ar
b ge
ke k in
n2 n1
g
t
48 30 0
be
be
ar
bl ing
oc
br k
ac
ca ket
r
co
g
f la
ng
kn e
ob
pi
pe
pi
sto
pi n1
sto
pi pi pi kn fla co ca br bl be
n2
sto sto pe o n g r ac oc ar
b ge
ke k in
n2 n1
g
t
Ground truth class
Table 3 Registration rate per class (%) for the six inference methods tested.
0 100
Output class
(f) BLK
Fig. 1 Confusion matrices for the six inference methods tested.
5 Conclusion
We have introduced two key extensions of the Hough
transform, which can be applied to any approach using the Hough transform. The first, the intrinsic Hough
transform, changes the memory requirements of the
Hough transform from O(k d ), (k > 1) to O(n), making
it feasible for high-dimensional Hough spaces such as
that of our 3D shape R&R application. The second, the
minimum-entropy Hough transform, was shown to significantly increase detection precision over mean shift
on our task. We also showed that it marginally outperformed the probabilistic method of Barinova et al . [4],
as well as benefiting from a computation time that is
independent of the number of objects in the scene, and
allowing the straightforward refinement of modes using
mean shift.
However, given that the kernel density parameters
used were optimized for Hough-based approaches and
not for BLK, the real “take home” message of this paper is that the assumption that only one vote generated
by each feature is correct is a powerful constraint in
Hough-based frameworks, which can dramatically improve inference by “clearing the mist” of incorrect votes,
as long as the correct vote is chosen well. We also note
that several inference approaches outside the Hough domain enforce a similar constraint, that only one object
generates each feature, e.g. [10, 11, 20, 41]; these methods may well perform similarly, and potentially even
better, on the same problem.
Acknowledgements The authors are extremely grateful to
Bob Fisher, Andrew Fitzgibbon, Chris Williams, John Illingworth and the anonymous reviewers for providing valuable
feedback on this work.
A Proof of the integer nature of vote weights
Theorem 1 Given equation (3), an integer set of optimal
values of θ exists, i.e. for which θij ∈ {0, 1} ∀i, j.
Proof Let us assume that θ is at its globally optimal value,
and consider only the weights of the ith feature (i.e. assume
the other weights are fixed), so that
p(y|θi ) = C(y) + ωi
Ji
X
θij K(xij , y),
(14)
j=1
where C(y) is a function which is independent of θi . The
objective function can then be written as
f (θi ) = −
Z
H
p(y|θi ) ln p(y|θ′i ) dy.
(15)
We distinguish between the two instances of θi in the equation above purely for the purposes of the proof. The objective
function can be rewritten as follows:
f (θi ) = D −
aij =
Z
Ji
X
θij aij
(16)
j=1
H
ωi K(xij , y) ln p(y|θ′i ) dy
(17)
where D is a constant. Given the constraints of equation (2),
minimizing equation (16) with respect to θi , whilst keeping
θ′i fixed, can always be achieved by setting θij = 1 for one
j for which aij is largest, and setting all other weights to
0. In addition, Gibbs’ inequality [14] implies that equation
(15) is minimized when θi = θ′i (as we require them to be).
Therefore the ith feature must have an integer set of optimal
weights. This argument can be applied to each feature independently.
⊓
⊔
Demisting the Hough Transform
References
1. Toshiba CAD model point clouds dataset (2011).
http://www.toshiba-europe.com/research/crl/cvg/
projects/stereo_points.html
2. Allan, M., Williams, C.K.I.: Object localisation using the
generative template of features. Computer Vision and
Image Understanding 113, 824–838 (2009)
3. Ballard, D.H.: Generalizing the Hough transform to detect arbitrary shapes. Pattern Recognition 13(2), 111–
122 (1981)
4. Barinova, O., Lempitsky, V., Kohli, P.: On detection of
multiple object instances using Hough transforms. In:
Proceedings of the IEEE Conference on Computer Vision
and Pattern Recognition (2010)
5. Ben-Tzvi, D., Sandler, M.B.: A combinatorial Hough
transform. Pattern Recognition Letters 11(3), 167–174
(1990)
6. Besag, J.: On the statistical analysis of dirty pictures.
Journal of the Royal Statistical Society, Series B 48(3),
259–302 (1986)
7. Birchfield, S., Tomasi, C.: Multiway cut for stereo and
motion with slanted surfaces. In: Proceedings of the IEEE
International Conference on Computer Vision (1999)
8. Bober, M., Kittler, J.: Estimation of complex multimodal motion: An approach based on robust statistics
and Hough transform. In: Proceedings of the British Machine Vision Conference (1993)
9. Cheng, Y.: Mean shift, mode seeking, and clustering.
Transactions on Pattern Analysis and Machine Intelligence 17(8), 790–799 (1995)
10. Delong, A., Osokin, A., Isack, H., Boykov, Y.: Fast approximate energy minimization with label costs. International Journal of Computer Vision 96(1), 1–27 (2012)
11. Delong, A., Veksler, O., Boykov, Y.: Fast fusion moves for
multi-model estimation. In: Proceedings of the European
Conference on Computer Vision (2012)
12. Drost, B., Ulrich, M., Navab, N., Ilic, S.: Model globally,
match locally: Efficient and robust 3D object recognition.
In: Proceedings of the IEEE Conference on Computer
Vision and Pattern Recognition, pp. 998–1005 (2010)
13. Duda, R.O., Hart, P.E.: Use of the Hough transformation
to detect lines and curves in pictures. Commun. ACM 15,
11–15 (1972)
14. Falk, H.: Inequalities of J. W. Gibbs. American Journal
of Physics 38(7), 858–869 (1970)
15. Fisher, A., Fisher, R.B., Robertson, C., Werghi, N.: Finding surface correspondence for object recognition and registration using pairwise geometric histograms. In: Proceedings of the European Conference on Computer Vision, pp. 674–686 (1998)
16. Gall, J., Lempitsky, V.: Class-specific Hough forests for
object detection. In: Proceedings of the IEEE Conference
on Computer Vision and Pattern Recognition, pp. 1022–
1029 (2009)
17. Gerig, G.: Linking image-space and accumulator-space: A
new approach for object-recognition. In: Proceedings of
the IEEE International Conference on Computer Vision,
pp. 112–117 (1987)
18. Hough, P.V.C.: Method and means for recognizing complex patterns. U.S. Patent 3,069,654 (1962)
19. Illingworth, J., Kittler, J.: The adaptive Hough transform. Transactions on Pattern Analysis and Machine Intelligence 9(5), 690–698 (1987)
20. Isack, H., Boykov, Y.: Energy-based geometric multimodel fitting. International Journal of Computer Vision
97(2), 123–147 (2012)
9
21. Knopp, J., Prasad, M., Willems, G., Timofte, R.,
Van Gool, L.: Hough transform and 3D SURF for robust
three dimensional classification. In: Proceedings of the
European Conference on Computer Vision, pp. 589–602
(2010)
22. Lamdan, Y., Wolfson, H.: Geometric hashing: A general
and efficient model-based recognition scheme. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 238–249 (1988)
23. Leibe, B., Leonardis, A., Schiele, B.: Combined object
categorization and segmentation with an implicit shape
model. In: ECCV Workshop on Statistical Learning in
Computer Vision (2004)
24. Leibe, B., Leonardis, A., Schiele, B.: Robust object detection with interleaved categorization and segmentation.
International Journal of Computer Vision 77(1-3), 259–
289 (2008)
25. Li, H., Lavin, M.A., Le Master, R.J.: Fast Hough transform: A hierarchical approach. Computer Vision, Graphics, and Image Processing 36(2-3), 139–161 (1986)
26. MacKay, D.J.C.: Information Theory, Inference and
Learning Algorithms. Cambridge University Press (2003)
27. Maji, S., Malik, J.: Object detection using a max-margin
Hough transform. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2009)
28. Mian, A., Bennamoun, M., Owens, R.: Three-dimensional
model-based object recognition and segmentation in cluttered scenes. Transactions on Pattern Analysis and Machine Intelligence 28(10), 1584–1601 (2006)
29. Minka, T.P.: The ‘summation hack’ as an outlier model.
Technical note (2003)
30. Okada, R.: Discriminative generalized Hough transform
for object detection. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2000–2005
(2009)
31. Pham, M.T., Woodford, O.J., Perbet, F., Maki, A.,
Stenger, B., Cipolla, R.: A new distance for scaleinvariant 3D shape recognition and registration. In: Proceedings of the IEEE International Conference on Computer Vision (2011)
32. Rosten, E., Loveland, R.: Camera distortion selfcalibration using the plumb-line constraint and minimal
Hough entropy. Machine Vision and Applications (2009)
33. Shannon, C.E.: A mathematical theory of communication. The Bell System Technical Journal 27, 379–423,
623–656 (1948)
34. Sheikh, Y.A., Khan, E.A., Kanade, T.: Mode-seeking by
medoidshifts. In: Proceedings of the IEEE International
Conference on Computer Vision (2007)
35. Stephens, R.S.: A probabilistic approach to the Hough
transform. Image and Vision Computing 9(1), 66–71
(1991)
36. Toldo, R., Fusiello, A.: Robust multiple structures estimation with j-linkage. In: Proceedings of the European
Conference on Computer Vision (2008)
37. Tombari, F., Di Stefano, L.: Object recognition in 3D
scenes with occlusions and clutter by Hough voting. In:
Proceedings of PSIVT, pp. 349–355 (2010)
38. Vedaldi, A., Soatto, S.: Quick shift and kernel methods
for mode seeking. In: Proceedings of the European Conference on Computer Vision, pp. 705–718 (2008)
39. Vincent, E., Laganiere, R.: Detecting planar homographies in an image pair. In: Proceedings of the International Symposium on Image and Signal Processing and
Analysis, pp. 182–187 (2001)
40. Vogiatzis, G., Hernández, C.: Video-based, real-time
multi view stereo. Image and Vision Computing 29(7),
434–441 (2011)
10
41. Woodford, O.J., Pham, M.T., Maki, A., Gherardi, R.,
Perbet, F., Stenger, B.: Contraction moves for geometric
model fitting. In: Proceedings of the European Conference on Computer Vision (2012)
42. Xu, L., Oja, E., Kultanen, P.: A new curve detection
method: Randomized Hough transform (RHT). Pattern
Recognition Letters 11(5), 331–338 (1990)
43. Zhang, W., Kǒsecká, J.: Nonparametric estimation of
multiple structures with outliers. In: R. Vidal, A. Heyden, Y. Ma (eds.) Dynamical Vision, Lecture Notes in
Computer Science, vol. 4358, pp. 60–74. Springer Berlin
/ Heidelberg (2007)
44. Zhang, Y., Chen, T.: Implicit shape kernel for discrimintative learning of the Hough transform detector. In:
Proceedings of the British Machine Vision Conference
(2010)
45. Zuliani, M., Kenney, C.S., Manjunath, B.S.: The multiRANSAC algorithm and its application to detect planar
homographies. In: Proceedings of the IEEE International
Conference on Image Processing (2005)
O. J. Woodford et al.