Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Noname manuscript No. (will be inserted by the editor) Demisting the Hough Transform for 3D Shape Recognition and Registration Oliver J. Woodford · Minh-Tri Pham · Atsuto Maki · Frank Perbet · Björn Stenger Received: date / Accepted: date Abstract In applying the Hough transform to the problem of 3D shape recognition and registration, we develop two new and powerful improvements to this popular inference method. The first, intrinsic Hough, solves the problem of exponential memory requirements of the standard Hough transform by exploiting the sparsity of the Hough space. The second, minimum-entropy Hough, explains away incorrect votes, substantially reducing the number of modes in the posterior distribution of class and pose, and improving precision. Our experiments demonstrate that these contributions make the Hough transform not only tractable but also highly accurate for our example application. Both contributions can be applied to other tasks that already use the standard Hough transform. 1 Introduction The Hough transform [13], named after Hough’s 1962 patent [18] describing a method for detecting lines in images, has since been generalized to detecting, as well as recognizing, many other objects or instances: parameterized curves [13], arbitrary 2D shapes [3], object motions [8], cars [16, 24], pedestrians [4, 16], hands [30] and 3D shapes [21, 31, 37], to name but a few. This popularity stems from the simplicity and generality of the first step of the Hough transform—the conversion of features, found in the data space, into sets of votes in a Hough space, parameterized by the pose of the object(s) to be found. Various different approaches to O. J. Woodford Toshiba Research Europe Ltd. 208 Cambridge Science Park, Milton Road Cambridge CB4 0GZ, UK E-mail: oliver.woodford@crl.toshiba.co.uk learning this feature-to-vote conversion function have been proposed, including the implicit shape model [24] and Hough forests [16, 30]. The second stage of the Hough transform simply sums the likelihoods of the votes at each location in Hough space, then selects the modes. One problem with this step is that the summation can create modes where there are only a few outlier votes. A second problem is that, given a required accuracy, the size of the Hough space is exponential in its dimensionality. The application we are concerned with, object recognition and registration (R&R) from 3D geometry (here, point clouds), suffers significantly from both these problems. The Hough space, at 8D (one dimension for class, three for rotation, three for translation and one for scale), is to our knowledge the largest to which the Hough transform has been applied, and the feature-to-vote conversion generates a high proportion of incorrect votes, creating a “mist” of object likelihood throughout that space, as shown in figure 2(a). In the face of this adversity, we have developed two important contributions which enable inference on this task, and potentially many others, using the Hough transform to be both feasible and accurate: – We introduce the intrinsic Hough transform, which substantially reduces memory and computational requirements in applications with a high dimensional Hough space. – We introduce the minimum-entropy Hough transform, which greatly improves the precision and robustness of the Hough transform. These extensions of the Hough transform are not task specific; they can be applied, either together or independently, to any application that does or is able to use the standard Hough transform. 2 The rest of this paper is organized as follows: The next section describes inference using the Hough transform, and briefly reviews the literature relevant to our contributions. In §3 we describe our new inference methods. The section following that describes and discusses our experiments. Finally, we conclude in §5. 2 Background 2.1 3D shape recognition and registration The implicit shape model of Leibe et al . [23, 24] pioneered the use of the Hough transform for object recognition in 2D images. This approach has since been applied to object recognition in 3D geometric data [21], and extended to object registration [31, 37]. For the dual problem of R&R in 3D, the Hough space is either 7D (if scale is known) [37] or 8D [31]. The feature extraction stages of these methods follow the same pipeline: features are detected at a given scale and position; a canonical orientation of the feature is estimated; a descriptor for the feature is computed. The votes are then computed by matching features in the test data with features from training data with ground truth class and pose, either directly (i.e. a nearest neighbour search) [31], or via a codebook created by clustering the feature descriptors [21, 24, 37]. In this work we will be using the feature-to-vote conversion process of Pham et al . [31] as an off-theshelf method, since our contributions lie in the second stage of the Hough transform. It is this process that generates a high proportion of incorrect votes, amongst which the correct votes need to be found. O. J. Woodford et al. simply as a kernel density estimate over all the votes [8, 44]. Let y be an object’s location in a Hough space, H, which is the space of all object poses (usually real) and, in the case of object recognition tasks, object classes (discrete). Furthermore, let the list of votes, cast in H by N features, which are computed in some first stage feature-to-vote conversion process (not addressed here) i be denoted by X = {{xij }Jj=1 }N i=1 . The posterior probability of an object’s location is then given by p(y|X, ω, θ) = N X ωi i=1 Ji X θij K(xij , y), (1) j=1 where Ji is the number of votes generated by the ith feature, K(·, ·) is a density kernel in Hough space, and ω = {ωi }N i=1 and θ = {θij }∀i,j are feature and vote PN weights respectively, s.t. ωi ≥ 0, ∀i, i=1 ωi = 1, and θij ≥ 0, ∀i, j, Ji X θij = 1, ∀i ∈ {1, .., N }. (2) j=1 For example, in the original Hough transform used for line detection [13], the features are edgels, votes are generated for a discrete set of lines (parameterized by angle) passing through each edgel, the kernel, K(·, ·), returns 1 for the nearest point in the discretized Hough space to the input vote, 0 otherwise, and the weights, ω and θ, are set to uniform distributions. Recently methods have been proposed for learning a priori more discriminative weights [27, 44] for object detection, as well as evaluating over different kernel shapes [44]. The final stage of the Hough transform involves finding, using non-maxima suppression, the modes of this distribution whose probabilities are above a certain threshold value, τ . 2.2 The Hough transform 2.2.1 Computational feasibility The earliest descriptions of the Hough transform [3, 13, 18] present it as an algorithm, but more recently there has been a desire to cast the framework in a probabilistic light. Generative model interpretations [2, 4, 35] in which the votes represent likelihoods of features, given an object pose, require that the likelihoods of these independent variables be multiplied, in contrast to the summation of the Hough transform. The summation has been explained in two ways: firstly that it is in fact over the log likelihood of features [4, 35], though this requires a differently shaped distribution for each vote than is typically given [4], or secondly that it is a first order approximation to a robustified product of likelihoods [2, 29]. We prefer to interpret the second stage of the Hough transform as a discriminative model of the posterior distribution of an object’s location, phrased Finding the modes in H involves sampling that space, the volume of which increases exponentially with its dimensionality, d. Several approaches have been proposed to reduce this burden, which we categorize as one of approximate, hierarchical, irregular or mode-seeking. Approximate methods use reduced-dimensionality approximations of the full Hough space to find modes. For example, given a 6D pose (translation & rotation), Fisher et al. [15] quantize translations and rotations in two separate 3D arrays (peak entries in both arrays indicate an object, but multiple objects create ambiguities), while Tombari & Di Stefano [37] find modes over translation only, then compute an average rotation for each mode. Geometric hashing techniques, e.g. [12, 22, 28], also fall into this category. Demisting the Hough Transform Hierarchical approaches, such as the fast [25] and adaptive [19] Hough transforms, sample the space in a coarseto-fine manner, exploiting the sparsity of some areas, though their complexity is still exponential in d. Irregular methods do not sample the Hough space regularly, but rather sample only where objects are likely to be detected, again exploiting potential sparsity in H. For example, the combinatorial [5] and randomized [42] Hough transforms generate lists of sampling locations, the former for all lines (in line detection) joining pairs of edgels in confined regions, the latter for curves (in curve detection) defined uniquely by random sets of edgels. Both these approaches are task specific, whereas the intrinsic Hough transform introduced here, which also falls into this category, is not. Mode-seeking methods find modes in H through iterative optimization [8, 9]. Mean shift [9] is the most commonly used approach, the complexity of which is O(nd2 ), PN where n = i=1 Ji (the total number of votes). It has successfully been applied to an 8D Hough space [31]. However, it needs to be initialized in many, perhaps O(n), locations, making the total complexity O(n2 d2 ), and is not guaranteed to find every mode. Two extensions of this approach, though generally applied to clustering rather than mode seeking, are medoid shift [34] and quick shift [38]. These approaches can also be combined. For example, modes found in a coarse sampling of H can be refined using mean shift [24], an approach we employ here. 2.2.2 Explaining away votes The summing of votes in the Hough transform enables incorrect votes to generate modes in H, and since most applications tend to produce a large number of incorrect votes, this can lead to false detections, especially in multi-object detection scenarios. The problem arises from the fact that each test feature generates a number (often quite large) of votes, which represent the locations of all objects that could have generated that feature, but usually only one of those votes will actually be correct, because most features are generated by only one object. Figure 2(a) visualizes the ambiguity caused by these incorrect votes in our R&R application. If we assume, usually correctly, that a feature is generated by only one object, we can then enforce the resulting implicit constraint that only one vote cast by each feature is correct. By choosing which vote this is for each feature, the other votes can then be dismissed as being incorrect, removing them from the transform— the correct vote essentially explains away all the other 3 votes. This assumption was first applied to the Hough transform in the 1980s by Gerig [17], using a two stage approach, first computing the standard Hough transform, then, simultaneously for each feature, collating the values of the Hough transform at the locations of all votes of a given feature, and keeping only the vote at the highest value. The idea was resurrected more recently by Barinova et al . [4], using an approach akin to the Hough transform, in that it exhaustively samples the Hough space while searching for objects. However, they directly enforce the constraint that a feature is generated by only one object, using feature-to-object assignments, with a cost per object detection. Phrasing the problem as an energy minimization, they greedily detect objects in Hough space, assigning to them features which decrease the overall energy. Furthermore, rather than using kernels that tail off to zero, their kernels continue decreasing away from the vote, with an explicit background assignment for outlier features. Several other multi-object detection frameworks also make explicit feature-to-object assignments: energy-minimization-based methods [7, 10, 11, 20, 41], which iteratively update the assignments; RANSAC, similar to energy-based methods but focusing more on the algorithm than the objective function, with features assigned either greedily [39] or with iterative refinement [43, 45]; non-parametric methods, which cluster features into groups representing objects [36]. A benefit of methods using feature-to-object assignments, as opposed to the feature-to-vote assignments of Hough-based methods, is that they avoid the last step of the Hough transform: non-maxima suppression of accumulated votes in Hough space. 3 Our framework This section describes our improvements to the Hough transform. In §3.1 we introduce the intrinsic Hough transform, which overcomes the high memory requirements of the standard Hough transform with high-dimensional Hough spaces. In §3.2 we introduce a method which exploits the assumption that only one vote per feature is correct. 3.1 The intrinsic Hough transform As discussed in §2.2, high-dimensional Hough spaces require infeasible amounts of memory to sample regularly. However, we note that while the volume of the Hough space increases exponentially with its dimensionality, the number of votes generated in applications using 4 O. J. Woodford et al. the Hough transform generally does not, implying that higher dimensional Hough spaces are often sparser. We exploit this sparsity by sampling the Hough space only at locations where the probability (given by equation (1)) is likely to be non-zero. Assuming that the density kernel, K(·, ·), in equation (1) is zero-mean and unimodal (which is generally true for kernel density estimation), the modes of the distribution will be at or near the locations of the votes. We therefore simply sample the Hough space at the locations of the votes themselves. Since the votes define the distribution, therefore are intrinsic to it, we call this approach the intrinsic Hough transform. While similar in some respects to intrinsic modeseeking algorithms [34, 38], the intrinsic Hough transform does not seek modes through iterative updates. Rather, the modes of the distribution are detected using non-maxima suppression, as per the standard Hough transform; here, a sample location, y, is classified as a mode if no other sample location, z, within a certain distance, s.t. K(y, z) > γ, has a higher probability. Implicit in this approach is the assumption that the local modes of the distribution given by equation (1) lie very close to a vote—this is the case for most shapes of kernel used in practice. As a final step to improve accuracy, the location of each mode found is updated with one step of mean shift. The memory and computational requirements of this approach are O(n) and O(n2 d2 ) respectively. 3.2 The minimum-entropy Hough transform Making the assumption that only one vote per feature is correct, a vote that is believed to be correct should explain away the other votes from that feature. This suggests that, rather than being given θ a priori, it would be beneficial to optimize over its possible values, giving those votes which agree with votes from other features more weight than those which do not. One way of achieving this is by minimizing the information entropy 1 of p(y|X, ω, θ) w.r.t. θ. A similar approach, but minimizing entropy w.r.t. some parameters of the vote generation process, has already been used for lens distortion calibration [32]. A lower entropy distribution contains less information, making it more peaky and hence having more votes in agreement. Since information in Hough space is the location of objects, minimizing entropy constrains features to be generated by as few objects as possible. This can be viewed as enforcing Occam’s razor. The objective function to be minimized is therefore Z f (θ) = − p(y|X, ω, θ) ln p(y|X, ω, θ) dy However, computing this entropy involves an integration over Hough space; for our application this is very large. To make this integration tractable we sample the space at discrete locations using importance sampling [26, §29.2]; as with the intrinsic Hough transform, we sample the Hough space at the locations of all the votes. The value of θ is therefore approximated by Specifically we R use the Shannon entropy [33], H = E[− ln p(x)] = − p(x) ln p(x) dx.  θ = argmin − θ′ Ji N X X p(xij |X, ω, θ′ ) i=1 j=1 q(xij )  ln p(xij |X, ω, θ′ ) (4) where q(·) is the (unknown) sampling distribution from which the votes are drawn. Once this optimization (described below) is done, the estimated θ is applied to equation (1), and inference continues as per the standard (or intrinsic) Hough transform. We call this approach the minimum-entropy Hough transform.2 3.2.1 Optimization framework It turns out, as we show in Appendix A, that a global minimum of equation (3) must lie at an extremum of the parameter space, which is constrained by equation (2), i such that at least one optimal value of θi = {θij }Jj=1 (i.e. the vector of feature i’s vote weights) will be an all 0 vector, except for one 1, i.e. minimizing entropy naturally enforces the one-correct-vote-per-feature constraint. As a result, a global minimum can always be found if we limit the search space for each θi to integer values, making a discrete set of Ji possible Q vectors, N s.t. the total number of possible solutions is i=1 Ji . It should be noted that this search space is not unimodal—for example, if there are only two features and they each identically generate two votes, one for location y and one for location z, then both y and z will be modes. Furthermore, as the search space is exponential in the number of features, an exhaustive search is infeasible for all but the smallest problems. We therefore use a local approach, iterated conditional modes (ICM) [6], to quickly find a local minimum of this optimization problem. This involves updating the vote weights of each feature in turn, by minimizing equation (4) conditioned on the current weights of 2 1 (3) H Strictly speaking, the minimum-entropy Hough transform is not a transform, because the probability of each location in Hough space cannot be computed independently. Demisting the Hough Transform 5 all other votes, and repeating this process until convergence. The correct update equation for the vote weights of a feature f is as follows: pf k (y|X, ω, θ) = ωf K(xf k , y)+ X ωi ∀i6=f Ji X θij K(xij , y), j=1 (5) Jf  k = argmax  k′ =1 θf k = 1, Ji N X X pf k′ (xij |X, ω, θ) i=1 j=1 θf j = 0, q(xij ) ∀j 6= k. s(y) ln pf k′ (xij |X, ω, θ) , ds,(y, z) = log s(z) q dr (y, z) = 1 − |q(y)T q(z)|, (6) (7) However, since this update not only involves q(·), which is unknown, but is also relatively costly to compute, we replace it with a simpler proxy which in practice performs a similar job of encouraging the resulting posterior distribution to be as peaky as possible: Jf k = argmax pf k′ (xf k′ |X, ω, θ). (8) k′ =1 This is effectively the strategy of Gerig, but applied sequentially rather than simultaneously. Since the optimization is local, a good initialization of θ is key to reaching a good minimum. In our experiments we start at the value of θ used in the standard Hough transform, then applied the following update to each vote weight simultaneously: pik (xik |X, ω, θ) θik = PJi , j=1 pij (xij |X, ω, θ) For the density kernel, K(·, ·), of equation (1) we use a Gaussian kernel on a symmetric version of the SRT distance between direct similarity transforms [31]. For two object poses, y and z, of the same class, it is defined as  2  1 d (y, z) d2r (y, z) d2t (y, z) K(y, z) = exp − s 2 − − , ζ σs σr2 σt2 (10)  (9) iterating this five times before starting ICM. Initially updating weights softly, i.e. not fixing them to 0 or 1, and synchronously, avoiding ordering bias, in this way helped to avoid falling into a poor local minimum early on, thus improving the quality of solution found. 4 Experiments 4.1 Setup For our test application, 3D shape R&R, we use the framework introduced by Pham et al . [31], outlined in figure 3, the evaluation data from which can be found online [1]. It consists of 100 test instances, each containing one object, for each of 10 object classes, shown in figure 4, i.e. 1000 test instances in total. Each test instance provides ground truth 7D object pose (scale and 3D rotation and translation) and class, and a set of input votes, with weights, for object pose and from all 10 classes. dt (y, z) = ||t(y) − t(z)|| p , s(y)s(z) (11) (12) (13) where s(y), q(y) and t(y) are the scale, rotation (as a quaternion) and translation components of y respectively. If y and z specify different classes, then K(y, z) = 0. The values of the bandwidth parameters, σs , σr and σt , given in table 1, are those learned in [31]. The normalization factor, ζ, cannot easily be computed, but is independent of z [31], therefore, since our equations (8) & (9) are scale independent,3 it can be ignored. 4.2 Methods As well as evaluating the relative performance of the two Hough transforms introduced in §3.1 & 3.2, we compare them with the SRT mean shift method of [31] (henceforth referred to as “mean shift”), and the inference methods of Gerig [17] and Barinova et al . [4] (here referred to simply as Gerig and BLK, after the authors, for short), and finally a Greedy approach which computes the standard Hough transform, finds the maximum and adds the corresponding object to the list of found objects, then removes all of the votes of all features that voted for that object, and repeats the process until no votes are left. Apart from mean shift, the methods all use the intrinsic Hough transform to make sampling H feasible. For the mean shift refinement step of the intrinsic Hough transform, we use the closed-form mean given in [31], despite our slightly different density kernel. However, we do not refine the detections of BLK because their probability distribution is not amenable to this, since the likelihoods are multiplied. The likelihood function used in our implementation of BLK is the same kernel density function used in the other methods, defined in equation (10). We note that the parameters of this kernel were learned in [31] specifically for Hough-based inference, therefore might not be 3 The requirement for a scale independent optimization strategy is a further reason to use the proxy of equation (8). 6 O. J. Woodford et al. σs a–f 0.0694 σr a–f 0.12 σt a–f 0.12 γ b–e exp(−8) λ f 10 Table 1 Parameter values for the inference methods tested: (a) mean shift, (b) intrinsic Hough, (c) minimumentropy Hough, (d) Gerig, (e) Greedy, (f) Barinova et al. [4]. optimal for BLK. Parameter values used for the various methods are summarized in table 1. 4.3 Results 4.3.1 Quantitative results Quantitative results, computed using the ground truth classes and poses provided in the evaluation set, and using the registration criterion in [31], are given in tables 2 & 3 and figure 1. There is a small improvement in performance in both registration and recognition moving from mean shift to intrinsic Hough, which is most likely due to modes being missed by mean shift. Recognition rates then increase rapidly moving to Gerig, then Greedy, then finally minimum-entropy Hough, whose recognition rate, the largest seen, with only 1.5% of objects are left unrecognized, the majority of those in the car class, is a huge improvement on mean shift, providing a 96% reduction in misclassifications. This improvement is due to the improved assignment of the correct vote per feature, from a one-shot simultaneous assignment, to a greedy assignment, to an iteratively refined assignment. Minimum-entropy Hough also shows a significantly improved registration rate, with top scores on 7/10 classes. BLK, though greedy, performs almost as well as minimum-entropy Hough in terms of recognition, though less well in terms of registration, in part due to a lack of mean shift pose refinement at the end. However, because these results only reflect the best detection per test, they do not tell the whole story; we do not know how many other (incorrect) detections had competitive weights. To see this, we generated the precision-recall curves shown in figure 5, by varying the detection threshold, τ (or λ for BLK [4]). A correct detection in this test required the class and pose to be correct simultaneously, and allowed only one correct detection per test. The curves show that precision remains high as recall increases for the minimum-entropy Hough transform, and marginally less so for BLK and Greedy, all of which are able to explain away incorrect votes, while it drops off rapidly with recall for the other methods, indicating that the latter methods suffer from greater ambiguity as to which modes correspond to real objects, or perhaps in the case of Gerig, the Mean shift Intrinsic Hough Min.-entropy Gerig Greedy BLK Recognition 64.9% 67.6% 98.5% 71.8% 85.7% 98.1% Registration 72.8% 73.0% 79.6% 73.3% 70.3% 75.1% Time 0.427s 0.192s 0.214s 0.218s 0.226s 0.224s Table 2 Quantitative results for the inference methods tested. wrong correct-vote assignments being made. Interestingly, Greedy and minimum-entropy Hough have lower maximum recall rates (of 0.759 and 0.813 respectively), which we propose is due to some correct modes being “explained away”. Since, in the case of minimumentropy Hough, our optimization strategy finds only a local minimum, we cannot be sure whether this effect is due to the objective function or the optimization strategy. In terms of computation time (table 2), all methods tested had the same order of magnitude speed, with the mean shift approach being about twice as slow as the others, though there is a trade-off of time versus accuracy with this approach, by changing the number of starting points of the optimization. However, we noticed that the speed of BLK was dependent on the value of λ, its detection threshold, and therefore equally the number of objects in the scene, unlike the other methods. 4.3.2 Qualitative results The benefit of explaining away incorrect votes is demonstrated in figure 2. While the standard Hough transform shows a great deal of ambiguity as to where and how many objects there are, the minimum-entropy Hough transform is able to clear away the “mist” of incorrect votes, leaving six distinct modes corresponding to the objects present; there are some other modes, but these are much less significant, corroborating the results seen in figure 5. The benefit of having correct and clearly defined modes is demonstrated in figure 6, using the same point cloud as in figure 2, a challenging dataset containing three pairs of touching objects. Both minimum-entropy Hough and BLK find all six objects in the top six detections (though both mis-register the piston lying at a shallow angle), whereas the other methods find not only incorrect objects, but also multiple instances of correct objects (particularly the piston on the cog). bracket car cog flange knob pipe piston1 piston2 Mean shift Intrinsic Hough Minimum-entropy Hough Gerig [17] Greedy Barinova et al. (BLK) [4] block 7 bearing Demisting the Hough Transform 77 77 83 76 83 79 13 15 20 13 15 20 95 96 98 96 83 97 75 76 91 84 54 93 100 100 100 100 100 100 86 83 86 84 89 74 88 86 91 85 81 73 86 86 89 83 82 81 44 44 54 46 49 48 64 67 84 66 67 86 2 98 1 0 0 0 0 2 0 0 97 0 0 0 15 3 2 12 0 26 32 2 0 0 25 1 2 12 0 19 37 12 0 0 18 0 0 0 0 0 4 93 0 0 0 0 0 0 0 0 100 0 0 3 0 0 5 10 0 0 37 1 29 0 17 1 6 25 0 0 10 2 0 43 7 7 0 0 0 0 100 0 0 1 0 0 6 7 0 0 29 3 42 0 12 1 0 0 0 0 1 98 0 0 0 0 5 23 0 0 8 2 0 54 5 3 0 1 0 5 0 99 0 0 0 0 0 0 1 0 0 5 1 0 0 1 0 0 97 0 0 0 1 0 0 0 2 6 75 0 9 0 0 0 6 4 0 0 100 0 0 0 0 0 0 0 0 0 54 44 1 0 0 1 0 1 2 0 52 22 3 0 0 19 1 0 2 1 89 0 0 2 0 6 0 0 0 0 100 0 0 0 0 0 0 0 0 0 100 0 0 0 0 0 0 0 2 0 98 0 0 0 0 0 0 0 0 19 81 0 0 0 0 0 1 0 0 0 99 0 0 0 0 0 0 0 0 0 100 0 0 0 0 0 0 0 20 1 75 0 2 1 2 1 0 0 2 0 87 0 7 1 0 0 0 0 0 0 100 0 0 0 0 0 0 27 2 1 67 0 3 2 10 0 0 1 1 0 76 3 7 0 0 0 0 0 0 0 99 0 1 0 0 0 11 1 1 0 86 1 0 2 0 0 0 1 1 0 96 0 0 0 0 0 0 0 0 0 99 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 99 0 0 0 0 0 0 0 0 ar 0 0 54 1 2 43 0 0 0 0 3 1 91 0 0 1 0 4 0 0 0 0 0 100 0 0 0 0 0 0 0 0 0 0 100 0 0 0 0 0 0 0 0 0 0 100 0 0 0 0 0 0 0 0 0 0 99 0 1 0 0 0 1 0 0 0 96 0 0 2 0 0 0 1 0 0 97 0 0 0 0 0 0 0 0 0 100 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 100 1 1 98 Output class Output class Output class (c) Min.-entropy (d) Gerig (e) Greedy be 3 0 0 27 2 1 0 1 Output class 0 1 92 0 0 0 1 0 0 (b) Intrinsic Hough 81 10 0 4 0 1 0 0 0 Output class 9 2 0 0 0 0 0 (a) Mean shift 6 0 0 0 1 0 2 3 96 0 0 0 0 0 100 0 1 3 96 1 0 49 0 0 87 0 13 0 0 0 0 ar bl ing oc br k ac ca ket r co g f la ng kn e ob pi pe pi sto pi n1 sto pi pi pi kn fla co ca br bl be n2 sto sto pe o n g r ac oc ar b ge ke k in n2 n1 g t 8 1 Ground truth class 0 9 g f la ng kn e ob pi pe pi sto pi n1 sto pi pi pi kn fla co ca br bl be n2 sto sto pe o n g r ac oc ar b ge ke k in n2 n1 g t 0 0 0 27 27 0 26 2 co 0 13 1 0 88 0 be ar bl ing oc br k ac ca ket r 55 21 0 0 Ground truth class 1 2 0 14 3 be be 8 0 0 Ground truth class 0 0 Ground truth class 0 1 bl ing oc br k ac ca ket r co g f la ng kn e ob pi pe pi sto pi n1 sto pi pi pi kn fla co ca br bl be n2 sto sto pe o n g r ac oc ar b ge ke k in n2 n1 g t 8 0 32 29 0 21 1 Ground truth class 0 ar bl ing oc br k ac ca ket r co g f la ng kn e ob pi pe pi sto pi n1 sto pi pi pi kn fla co ca br bl be n2 sto sto pe o n g r ac oc ar b g k i k n2 n1 ng e et 0 11 2 0 89 0 ar bl ing oc br k ac ca ket r co g f la ng kn e ob pi pe pi sto pi n1 sto pi pi pi kn fla co ca br bl be n2 sto sto pe o n g r ac oc ar b ge ke k in n2 n1 g t 48 30 0 be be ar bl ing oc br k ac ca ket r co g f la ng kn e ob pi pe pi sto pi n1 sto pi pi pi kn fla co ca br bl be n2 sto sto pe o n g r ac oc ar b ge ke k in n2 n1 g t Ground truth class Table 3 Registration rate per class (%) for the six inference methods tested. 0 100 Output class (f) BLK Fig. 1 Confusion matrices for the six inference methods tested. 5 Conclusion We have introduced two key extensions of the Hough transform, which can be applied to any approach using the Hough transform. The first, the intrinsic Hough transform, changes the memory requirements of the Hough transform from O(k d ), (k > 1) to O(n), making it feasible for high-dimensional Hough spaces such as that of our 3D shape R&R application. The second, the minimum-entropy Hough transform, was shown to significantly increase detection precision over mean shift on our task. We also showed that it marginally outperformed the probabilistic method of Barinova et al . [4], as well as benefiting from a computation time that is independent of the number of objects in the scene, and allowing the straightforward refinement of modes using mean shift. However, given that the kernel density parameters used were optimized for Hough-based approaches and not for BLK, the real “take home” message of this paper is that the assumption that only one vote generated by each feature is correct is a powerful constraint in Hough-based frameworks, which can dramatically improve inference by “clearing the mist” of incorrect votes, as long as the correct vote is chosen well. We also note that several inference approaches outside the Hough domain enforce a similar constraint, that only one object generates each feature, e.g. [10, 11, 20, 41]; these methods may well perform similarly, and potentially even better, on the same problem. Acknowledgements The authors are extremely grateful to Bob Fisher, Andrew Fitzgibbon, Chris Williams, John Illingworth and the anonymous reviewers for providing valuable feedback on this work. A Proof of the integer nature of vote weights Theorem 1 Given equation (3), an integer set of optimal values of θ exists, i.e. for which θij ∈ {0, 1} ∀i, j. Proof Let us assume that θ is at its globally optimal value, and consider only the weights of the ith feature (i.e. assume the other weights are fixed), so that p(y|θi ) = C(y) + ωi Ji X θij K(xij , y), (14) j=1 where C(y) is a function which is independent of θi . The objective function can then be written as f (θi ) = − Z H p(y|θi ) ln p(y|θ′i ) dy. (15) We distinguish between the two instances of θi in the equation above purely for the purposes of the proof. The objective function can be rewritten as follows: f (θi ) = D − aij = Z Ji X θij aij (16) j=1 H ωi K(xij , y) ln p(y|θ′i ) dy (17) where D is a constant. Given the constraints of equation (2), minimizing equation (16) with respect to θi , whilst keeping θ′i fixed, can always be achieved by setting θij = 1 for one j for which aij is largest, and setting all other weights to 0. In addition, Gibbs’ inequality [14] implies that equation (15) is minimized when θi = θ′i (as we require them to be). Therefore the ith feature must have an integer set of optimal weights. This argument can be applied to each feature independently. ⊓ ⊔ Demisting the Hough Transform References 1. Toshiba CAD model point clouds dataset (2011). http://www.toshiba-europe.com/research/crl/cvg/ projects/stereo_points.html 2. Allan, M., Williams, C.K.I.: Object localisation using the generative template of features. Computer Vision and Image Understanding 113, 824–838 (2009) 3. Ballard, D.H.: Generalizing the Hough transform to detect arbitrary shapes. Pattern Recognition 13(2), 111– 122 (1981) 4. Barinova, O., Lempitsky, V., Kohli, P.: On detection of multiple object instances using Hough transforms. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2010) 5. Ben-Tzvi, D., Sandler, M.B.: A combinatorial Hough transform. Pattern Recognition Letters 11(3), 167–174 (1990) 6. Besag, J.: On the statistical analysis of dirty pictures. Journal of the Royal Statistical Society, Series B 48(3), 259–302 (1986) 7. Birchfield, S., Tomasi, C.: Multiway cut for stereo and motion with slanted surfaces. In: Proceedings of the IEEE International Conference on Computer Vision (1999) 8. Bober, M., Kittler, J.: Estimation of complex multimodal motion: An approach based on robust statistics and Hough transform. In: Proceedings of the British Machine Vision Conference (1993) 9. Cheng, Y.: Mean shift, mode seeking, and clustering. Transactions on Pattern Analysis and Machine Intelligence 17(8), 790–799 (1995) 10. Delong, A., Osokin, A., Isack, H., Boykov, Y.: Fast approximate energy minimization with label costs. International Journal of Computer Vision 96(1), 1–27 (2012) 11. Delong, A., Veksler, O., Boykov, Y.: Fast fusion moves for multi-model estimation. In: Proceedings of the European Conference on Computer Vision (2012) 12. Drost, B., Ulrich, M., Navab, N., Ilic, S.: Model globally, match locally: Efficient and robust 3D object recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 998–1005 (2010) 13. Duda, R.O., Hart, P.E.: Use of the Hough transformation to detect lines and curves in pictures. Commun. ACM 15, 11–15 (1972) 14. Falk, H.: Inequalities of J. W. Gibbs. American Journal of Physics 38(7), 858–869 (1970) 15. Fisher, A., Fisher, R.B., Robertson, C., Werghi, N.: Finding surface correspondence for object recognition and registration using pairwise geometric histograms. In: Proceedings of the European Conference on Computer Vision, pp. 674–686 (1998) 16. Gall, J., Lempitsky, V.: Class-specific Hough forests for object detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1022– 1029 (2009) 17. Gerig, G.: Linking image-space and accumulator-space: A new approach for object-recognition. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 112–117 (1987) 18. Hough, P.V.C.: Method and means for recognizing complex patterns. U.S. Patent 3,069,654 (1962) 19. Illingworth, J., Kittler, J.: The adaptive Hough transform. Transactions on Pattern Analysis and Machine Intelligence 9(5), 690–698 (1987) 20. Isack, H., Boykov, Y.: Energy-based geometric multimodel fitting. International Journal of Computer Vision 97(2), 123–147 (2012) 9 21. Knopp, J., Prasad, M., Willems, G., Timofte, R., Van Gool, L.: Hough transform and 3D SURF for robust three dimensional classification. In: Proceedings of the European Conference on Computer Vision, pp. 589–602 (2010) 22. Lamdan, Y., Wolfson, H.: Geometric hashing: A general and efficient model-based recognition scheme. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 238–249 (1988) 23. Leibe, B., Leonardis, A., Schiele, B.: Combined object categorization and segmentation with an implicit shape model. In: ECCV Workshop on Statistical Learning in Computer Vision (2004) 24. Leibe, B., Leonardis, A., Schiele, B.: Robust object detection with interleaved categorization and segmentation. International Journal of Computer Vision 77(1-3), 259– 289 (2008) 25. Li, H., Lavin, M.A., Le Master, R.J.: Fast Hough transform: A hierarchical approach. Computer Vision, Graphics, and Image Processing 36(2-3), 139–161 (1986) 26. MacKay, D.J.C.: Information Theory, Inference and Learning Algorithms. Cambridge University Press (2003) 27. Maji, S., Malik, J.: Object detection using a max-margin Hough transform. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2009) 28. Mian, A., Bennamoun, M., Owens, R.: Three-dimensional model-based object recognition and segmentation in cluttered scenes. Transactions on Pattern Analysis and Machine Intelligence 28(10), 1584–1601 (2006) 29. Minka, T.P.: The ‘summation hack’ as an outlier model. Technical note (2003) 30. Okada, R.: Discriminative generalized Hough transform for object detection. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2000–2005 (2009) 31. Pham, M.T., Woodford, O.J., Perbet, F., Maki, A., Stenger, B., Cipolla, R.: A new distance for scaleinvariant 3D shape recognition and registration. In: Proceedings of the IEEE International Conference on Computer Vision (2011) 32. Rosten, E., Loveland, R.: Camera distortion selfcalibration using the plumb-line constraint and minimal Hough entropy. Machine Vision and Applications (2009) 33. Shannon, C.E.: A mathematical theory of communication. The Bell System Technical Journal 27, 379–423, 623–656 (1948) 34. Sheikh, Y.A., Khan, E.A., Kanade, T.: Mode-seeking by medoidshifts. In: Proceedings of the IEEE International Conference on Computer Vision (2007) 35. Stephens, R.S.: A probabilistic approach to the Hough transform. Image and Vision Computing 9(1), 66–71 (1991) 36. Toldo, R., Fusiello, A.: Robust multiple structures estimation with j-linkage. In: Proceedings of the European Conference on Computer Vision (2008) 37. Tombari, F., Di Stefano, L.: Object recognition in 3D scenes with occlusions and clutter by Hough voting. In: Proceedings of PSIVT, pp. 349–355 (2010) 38. Vedaldi, A., Soatto, S.: Quick shift and kernel methods for mode seeking. In: Proceedings of the European Conference on Computer Vision, pp. 705–718 (2008) 39. Vincent, E., Laganiere, R.: Detecting planar homographies in an image pair. In: Proceedings of the International Symposium on Image and Signal Processing and Analysis, pp. 182–187 (2001) 40. Vogiatzis, G., Hernández, C.: Video-based, real-time multi view stereo. Image and Vision Computing 29(7), 434–441 (2011) 10 41. Woodford, O.J., Pham, M.T., Maki, A., Gherardi, R., Perbet, F., Stenger, B.: Contraction moves for geometric model fitting. In: Proceedings of the European Conference on Computer Vision (2012) 42. Xu, L., Oja, E., Kultanen, P.: A new curve detection method: Randomized Hough transform (RHT). Pattern Recognition Letters 11(5), 331–338 (1990) 43. Zhang, W., Kǒsecká, J.: Nonparametric estimation of multiple structures with outliers. In: R. Vidal, A. Heyden, Y. Ma (eds.) Dynamical Vision, Lecture Notes in Computer Science, vol. 4358, pp. 60–74. Springer Berlin / Heidelberg (2007) 44. Zhang, Y., Chen, T.: Implicit shape kernel for discrimintative learning of the Hough transform detector. In: Proceedings of the British Machine Vision Conference (2010) 45. Zuliani, M., Kenney, C.S., Manjunath, B.S.: The multiRANSAC algorithm and its application to detect planar homographies. In: Proceedings of the IEEE International Conference on Image Processing (2005) O. J. Woodford et al.