2.1. The Basics
First introduced by Drost et al. [
21], the Point Pair Features voting approach is a feature-based solution combining a global modeling and a local matching stage within a local pipeline using sparse features. The method details have been explained in several publications (e.g., [
21,
26]), however, for completeness and better understanding of the paper, we consider it important to offer a general overview of the approach with special emphasis on some specific points.
Using point cloud representations of oriented points (i.e., points with normals), the method relays on four-dimensional features extracted from pairs of points (from now on “point pairs” or simply “pairs”) to globally describe the whole object from each surface point in a way that later the object can be locally matched with the scene. This four-dimensional feature, called Point Pair Feature or PPF, defines an asymmetric description between two oriented points by encoding their relative distance and normal information, as shown in
Figure 1. In detail, having a set of points in the 3D space
representing the model object, for a given 3D point
, called reference, and a given 3D point
, named second, such that
, with their respective unit normal vectors
and
, a model four-dimensional feature
is defined by Equation (
1),
where
and
is the angle between the vector
and
. In the same way, having a set of point
representing the scene data, the function
can be applied to compute a scene PPF using a pair of scene points
such that
, with their respective unit normal vectors
and
. Notice that, if the object model has
points, the total number of features is defined by
. In order to reduce the effect of this square relation on the method performance, the input data of both model and scene are downsampled with respect to the model size, effectively decreasing the complexity of the system.
The method can be divided into two main stages: modeling and matching. On modeling, the global model descriptor is created by computing and saving all the possible model pairs with their related PPF. During the matching stage, the model pose in the scene is estimated by matching the scene pairs with the stored model pairs using the PPF. This matching process consists of two distinctive parts: (1) find the correspondence between the pairs using the four-dimensional features and (2) group the correspondences generating hypotheses’ poses.
The correspondence problem between similar point pairs is efficiently solved by grouping the pairs with the same quantized PPF on a hash table or, alternatively, a four-dimensional lookup table. Quantizing the feature space defines a mapping from each four-dimensional space element to the set of all point pairs that generate this specific feature. In particular, for the object model, this mapping from quantized features to sets of model pairs defines the object model description expressed by the function
, where
and
represents the power set of
. In other words, point pairs that generate the same quantized PPF are grouped together on the same table position pointed by their common quantized index, effectively grouping pairs with similar features. This process of model construction is done during the modeling stage, as shown in
Figure 2a for three sample point pairs. Using this model description, given one scene pair, similar model pairs can be retrieved by accessing a table position pointed by the PPF quantized index. The quantization index is obtained by a quantization function
using the step size
for the first dimension and
for the remaining three dimensions. The quantization step size will bound the similarity level, i.e., correspondence distance, between matching features, and hence point pairs. Defining a function
that computes a normal from a point, the correspondence matching subset of model pairs
for a given scene pair
and its related quantized feature
is defined by Equation (
2):
From each scene-model point pair correspondence, a 6D pose transformation, or hypothesis, can be generated. Specifically, for a corresponding point pair
, the matched reference points
and their normals
constrain five degrees of freedom, aligning both oriented points, and the second points
, as long as they are non-collinear, constrain the remaining degree of freedom, which is a rotation around the aligned normals. However, the discriminative capability of a single four-dimensional feature from two sparse oriented points is clearly not enough to uniquely encode any surface characteristic, producing wrong correspondences. Therefore, the method requires a group of consistent correspondences to support the same hypothesis. Actually, the more correspondences support a single pose, the more likely this will be. In this regard, grouping consistent point pair correspondences, or, alternatively, 6D poses obtained from corresponding pairs have a high dimension complexity. In order to effectively solve this problem, a local coordinate, which we will refer to as LC, is used to efficiently group the poses within a two-dimensional space. As with two corresponding pairs, for a given scene point
that belongs to the object model, a 6D pose can be defined by only using one corresponding model point
and a rotation angle
around their two aligned normals, i.e.,
and
. In this way, for the scene point
, a 6D pose transformation candidate
can defined by the LC represented by the parameters
, as shown in
Figure 3. To solve this transformation, both points and normals are aligned respectively with the origin and
x-axis of a common world coordinate system
. Taking the scene point, this alignment can be expressed by the transformation
. The rotation that aligns the normal vector
to the
x-axis
is defined by the axis-angle representation
, where
and
. Therefore, the rotation matrix
can be efficiently found using the Rodrigues’ rotation formula [
30]. In turn, the translation
is defined by
. Exactly in the same way, the transformation
is found for the model point
and its normal
. Using these two transformations and the rotation angle, the 6D pose for a given object instance is defined by Equation (
3):
where
represents a rotation of
angle around the
x-axis. Using the LC, the correspondence grouping problem can be individually tackled for any scene pair created from
by grouping the corresponding model pairs in a two-dimensional space using the parameters (
).
During grouping and hypothesis generation, for every reference scene point
, the method intends to find the LC, i.e., (
), which defines the best fitting model pose on the scene data or, in other words, that maximizes the number of pairs correspondences that support it. This correspondence grouping problem is solved by defining a two-dimensional voting table or
accumulator, in a Generalized Hough Transform manner, representing the parameter space of the LC, where one dimension represents the corresponding model point
and the other the quantized rotation angle
. In particular, for each possible scene pair generated from
, i.e.,
, a LC will be defined by a corresponding pair
reference point, i.e.,
, and the rotation angle
defined by the two second points
. The corresponding model pairs are retrieved from the lookup table using the quantized PPF and, for each obtained LC, a vote is cast on the table, as represented by
Figure 2b for a single pair. After all pairs are checked, the peak of the table represents the most supported LC, and hence the most likely pose, for this specific
point. This process is applied to all or, alternatively, a fraction of the scene points, obtaining a set of plausible hypotheses.
To increase the efficiency of the voting part, which requires to compute the
angle for each pair correspondence, it is possible to split the rotation angle
in two parts; one part related to the model point,
, and one part related to the scene point,
. In detail, taking into account that in the intermediate world coordinate system the
angle is defined around the
x-axis, the rotation on the two-dimensional
-plane can be divided with respect to the positive
y-axis. In this case, the
and
will be defined as the rotation angles between the positive
y-axis vector
and the
-plane projection of the vectors obtained by the world transformed second points of the model pair (
) and scene pairs (
). As shown in
Figure 4, these angles can be defined as
tan
and
tan
where
,
and
atan
represents the multi-valued inverse tangent. With this solution, the model angle can be precomputed during the modeling stage and saved alongside the reference point in the lookup table
. Later, during the matching stage, for each scene pair, the
angle is computed by adding the two angles. Considering that
is defined from the model to the scene, the total angle can be computed as
.
Finally, in order to join similar candidate poses generated from different scene reference points, the method is completed with a clustering approach that groups similar poses that do not vary in rotation and translation more than a threshold.
2.2. Our Proposal
In this section, we define a new method based on the well-known Point Pair Features voting approach [
21] for robust 6D pose estimation of free-form objects under clutter and occlusions on range data. In detail, the original ideas presented in [
21] are improved and a complete method within a local feature-based pipeline is defined. The proposed method pipeline, shown in
Figure 5, can be divided in an
Offline modeling and an
Online matching stages with six basic steps:
Preprocessing,
Feature Extraction,
Matching,
Hypothesis Generation,
Clustering and
Postprocessing. Due to this method’s particular correspondence grouping step, using a voting table for each scene point, a basic straightforward implementation will require to create a voting table for each of the scene points during the hypothesis generation step, with large memory requirements. From a practical point of view, a more efficient solution is to iteratively generate a hypothesis for each scene point using a single voting table. In this regard, the green fine dotted box in
Figure 5 represents the iterative implementation of the steps
Feature Extraction,
Matching and
Hypothesis generation for each scene point. The method is considered to work with mesh data for modeling and organized point cloud for matching, as standardized data types.
2.2.1. Preprocessing
The Point Pair Feature voting method strongly relies on the discriminative effect of the PPF and their sparse nature to allow an efficient, structural aware local matching. The performance of the original four-dimensional PPF and its variants has been deeply studied by Kiforenko et al. [
27]. Their work concludes that a set of PPF globally defining a model point has stronger discriminative capability than most local features. On the other hand, they also showed that, despite its robustness, the PPF are significantly affected by noise. In fact, individually, each feature relies on the quality and relevance of the normal and distance information extracted from the sparse surface characteristics provided by the pairs of the sampled data. In this sense, low quality or non-discriminative features will reduce speed and decrease the recognition performance of the method. Therefore, the overall global description performance, in terms of time and recognition, depends on the number of features and the relevance and quality of each individual feature. This relation makes the performance of the approach rely significantly on the preprocessing steps. In turn, the sampling and normal estimation in preprocessing are mainly affected by the sensor noise and the relative size of the underlying surface characteristics. Taking these considerations into account, we propose a combination of two normal estimation approaches and a novel downsampling methodology that mitigates sensor noise, accounts for surface variability and maximizes the discriminative effect of the features.
Normal Estimation
For the normal estimation problem, we propose using two different variants regarding the input data representation of each stage. For the
Offline stage, using reconstructed or CAD mesh data, the normals are estimated by averaging the normal planes of each vertex’s surrounding triangles. In this case, noise and resolution limitations regarding surface reconstruction techniques are considered out of the scope of this manuscript, and thus not considered. For the
online stage, using the organized point cloud data, we use the method proposed in [
13], based on the first order Taylor expansion, including a bilateral filter inspired solution for cases where the surface depth difference is above a given threshold. These two approaches provide a normal estimation relative to the data source resolution and, additionally, the
online method provides an efficient and robust estimation against sensor noise [
13]. Notice that noisy and spiky surface data will affect the quality of the normal estimation step and, in turn, the downsampling step, decreasing the method efficiency and performance. In this regard, a normal estimation robust to noise is a basic part of the method, with a high impact on the matching results [
27].
Downsampling
Traditional downsampling methods, also called subsampling or decimation, based on voxel-grid or Poisson-disk sampling, have a fixed size structure that do not consider local information and tend to either average or ignore parts of the data, removing and distorting important characteristics of the underlying surface. If these characteristics want to be somehow preserved, these methods require increasing the sampling rate, i.e., decrease voxel size, which in turn dramatically decreases the algorithm performance adding superfluous data. As an alternative to these problems, we propose a novel approach that accounts for the variability of the surface data without increasing non-discriminative pairs.
The proposed method is based on a novel voxel-grid downsampling approach using surface information and an additional non-discriminative pairs’ averaging step. The method starts by computing a voxel-grid structure for the point cloud data. For each voxel cell, a greedy clustering approach is used to group those points with similar normal information, i.e., the angle between normals is smaller than a threshold. Then, for each clustered group, we average the oriented points, effectively merging the similar points while keeping discriminative data.
Figure 6 shows a simplified comparison between the common voxel-grid average method and the proposed normal clustering approach. Notice that, especially due to the PPF quantization space, for close points, distance is not relevant and normals encode the most discriminative information about underlying surface characteristics. As in the original method, the voxel size is set to
, defining a value relative to the model size. However, in our method, the parameter effect on the algorithm performance is significantly reduced, moving towards a more robust parameter-independent method.
Despite its local efficiency, this downsampling method does not account for the cases where the non-relevant surface characteristics are bigger than the voxel size. To mitigate these cases, when neighboring downsampled voxels contain similar data, we propose an additional step to average those points that do not provide additional surface information. This process is done by defining a new voxel-grid structure, with a much bigger voxel size (e.g., two or three times bigger), and averaging all points that do not have relevant normal data compared with all their neighbors’ voxels points. This step will reduce the points on planar surfaces, decreasing the number of total votes supporting the hypothesis. However, as the process is applied equally to the scene and the object, this will mainly decrease the votes of the non-discriminative parts, effectively increasing the value of the rest of the surface data.
2.2.2. Feature Extraction
As mentioned before, Kiforenko et al. [
27] published an exhaustive study and comparison of different types of PPF. Their results show that, despite the multimodal variants, the original four-dimensional feature [
21] provides the best performance for range data. In light of this result, we propose to keep using the original PPF introduced in
Section 2.1, represented in
Figure 1 and Equation (
1).
During the
Offline stage, the model bounding box is obtained and the model diameter
is estimated as the diagonal length of the box. For a given PPF, a four-dimensional index is obtained using the quantization function defined in Equation (
4):
where the quantization step
is set to
and
is fixed to
. These values have been set as a trade-off between recognition rates and speed. In this way, the lookup table is defined with a size of
. After preprocessing, for each model pair, the quantized PPF index is obtained and the reference point and the computed
angle are saved into the pointed table cell. In this case, all points of the model are used.
During the online stage, for each reference point, all possible point pairs will be computed and, using the four-dimensional lookup table, matched with the object model. Following the solution proposed by [
21], only one of every five points (in input order) will be used as a reference point, while all points will be used as second points. To improve the efficiency of the matching part, in order to avoid considering pairs further away than the model diameter
, for each scene reference point, we propose to use an efficient Kd-tree structure to obtain only the second points within the model diameter.
2.2.3. Matching
As explained in
Section 2.1, the Point Pair Feature voting approach solves the matching problem by quantizing the feature space, grouping all similar pairs under the same four-dimensional index. As a result, any point pair is matched with all the other pairs that generate the same quantized features in a constant time. Despite its efficiency, this approach has two main drawbacks.
The first drawback is regarding the noise effect on the quantized nature of the point pairs matching, as the quantization function
Q can output different indices for very similar real values. In these cases, similar pairs generate different quantized index, which points to different cells of the lookup table, missing correct correspondences during the online stage.
Figure 7a shows a one-dimensional representation of the problem. A straightforward solution was proposed by [
26]. Their approach
spreads the PPF quantized index to all its neighbors, effectively retrieving from the lookup table all the corresponding pairs pointed by the index alongside the pairs stored in its 80 neighboring cells, i.e.,
cells for a four-dimensional table. The main drawback with this method is the increased number of access to the lookup table, which is done for each matching PPF, decreasing significantly the time performance of the method. In addition, another problem can arise regarding the corresponding distance between features. If the quantization size
is kept, see
Figure 7b, the correspondence distance increases, dramatically augmenting the number of corresponding pairs and introducing matching pairs with lower similarity level to the voting scheme. An alternative approach is to decrease the quantization size
, see
Figure 7c, accounting for the neighboring cells, using a bigger data structure.
We propose a more efficient solution by only checking a maximum of 16 neighbors keeping the size of the quantization step, as shown in
Figure 7d. Considering that the difference between similar pairs are mainly generated by sensor noise, it is reasonable to assume that this noise follows a normal distribution characterized by a relatively small standard deviation
, i.e., smaller than half of the quantization step
. Based on this assumption, we propose to check the quantization error
to determine which neighbors are more likely to be affected by the noise. This process is defined for each dimension by the piecewise function represented in Equation (
5):
where the result is interpreted as follows:
indicates that left neighbor could be affected, 1 indicates that right neighbor could be affected and 0 indicates that no neighbor is likely to be affected.
During matching, for each dimension, those pairs from neighbors that are likely to be affected by noise are retrieved. In practice, for generalization, we set the standard deviation value to three times the quantization step ; however, other values could be used regarding any specific noise model. This method have a best case scenario of accessing to a single table cell and worst case of accessing 16 cells, i.e., . As we keep the same quantization step as the original method, a relatively lower similarity level correspondence may be retrieved during matching, yet with smaller number and negligible impact on performance.
The second drawback is related with multiple voting and over-representation of similar scene features. This problem is generated when during a scene reference point matching, several different pairs obtain the same combination of model correspondence and quantized
rotation angle. In detail, this happens when similar scene pairs obtain the same model correspondence and they have a similar scene angle value
, generating the same quantized
index. Moreover, this situation is worsened by the neighboring checking method. This problem, especially found on planar surfaces, generates multiple superfluous votes for the same LC on the voting table that may produce a deviation in the results. Following the solution of [
26], we avoid matching two model pairs with the same combination of quantized PPF index and scene angle
. This process is efficiently done by creating an additional 32 bits variable for every PPF quantization index, where each bit represents a quantized value of the scene angle. In this way, when matching a point pair using a PPF, the bit value related to the scene angle is checked. Only if the bit is 0 is the matching allowed and the bit is set to 1, avoiding any new matching with the same exact combination. Notice that the first drawback could be more efficiently solved during training by duplicating the pairs on the neighboring cell. However, in this case, the second drawback will be more difficult to avoid, as keeping track of the same pairs on different cells will request a more complex checking strategy.
2.2.4. Hypothesis Generation
As explained before, for each scene reference point, all the possible pairs are matched with the model. Then, during hypothesis generation, all consistent correspondence are grouped together generating a candidate pose. In detail, for each obtained scene–model pair correspondence, an LC combination is voted in the two-dimensional voting table. In this way, each position of the table represents an LC, which defines a model pose candidate in the scene, and its value represents the number of supports, which indicates how likely the pose is. The LC angle is quantized by defining a voting table with a total size of . After all votes have been cast, the highest value of the table indicates the most likely LC, defining a candidate pose for this scene reference point. At this step, an important problem arises from the assumption that a local coordinate always exists and, therefore, each piece of scene data has a corresponding model point. In reality, most scenes will have a majority of points that do not belong to the object. In order to avoid generating false positive poses, which can induce bias to the following clustering step, we propose defining a threshold to only consider LC with a minimum number of supports, e.g., three or five votes. Therefore, if the peak of the table is below this number, the pose will be discarded; otherwise, a candidate pose with an associated score is generated.
2.2.5. Clustering
The matching result of different scene reference points yields multiple candidate poses which may be defining the same model hypothesis pose. In order to joint similar poses together, we propose using a hierarchical complete-linkage clustering method. This clustering approach enforces that all combinations of elements of each cluster follow the same conditions based on two main thresholds, distance and rotation. In practice, we sort the candidate poses by their vote support and create a cluster for each individual pose. Then, all clusters are checked in order and two clusters are joined together when for all combinations of their elements the conditions hold. In this way, the most likely clusters will be merged first, reducing the effect of mutual exclusive combinations. In detail, for two defined thresholds
and
, two clusters
will be joined if they satisfy the condition:
where the binary function
represents the Euclidean distance and the binary function
represents the rotation difference between two poses defined by the double arccosine of the inner product of unit quaternions [
31]. Finally, for each cluster, all elements are merged and individual scores are summed up to define a new candidate pose.
2.2.6. Postprocessing
At this point, the method provides a list of candidate poses sorted by score. The score of each pose is just an approximation obtained from the sum of each clustered pose number of matching pairs. Due to the nature of the hypothesis generation and clustering steps, joining poses obtained from each table peak, the clustered pose score may not properly represent how well the pose fits the object model to the scene. In this regard, we propose computing a more reliable value through an additional re-scoring processes. This new score will be computed by adding the total number of model points that fit the scene, where a fitting point is a model point closer to a scene point than a threshold. In particular, for a given pose
, the fittings score is computed as shown in Equation (
7):
where [] represents the Iverson bracket and
th represents the maximum distance threshold. Taking into account the preprocessing of the data, this threshold is set to half of the voxel size. Notice that this re-scoring procedure can be efficiently solved by a Kd-tree structure.
Even though this process provides a better fitting value approximation, there are two important issues that can still reduce the accuracy of the score. First, the deviation produced by model points that are self-occluded in the scene by the camera view, and, second, the possible aligning error of the object model respect to the scene. In order to mitigate these problems, we propose to use an efficient variant of the ICP algorithm alongside a perspective rendering of the object model for each hypothesis pose. For every clustered pose, the model object will be rendered using a virtual camera representing the scene acquisition system. At this point, the rendered data will be downsampled in the same way than the scene data. After that, an efficient point-to-plane ICP algorithm, based on Linear Least-Squares Optimization [
32], using projective correspondence [
33] will be applied. Despite the efficiency of this process, the large number of hypotheses obtained from the previous steps could significantly affect the whole method performance. A compromise solution is to apply this re-scoring and ICP steps only to the subset of the clustered poses with the higher scores, which represent the more likely fitting poses.
Based on the ideas proposed by [
23,
25,
26], after the re-scoring process, two verification steps are applied to filter false positive cases. These steps are introduced to discard well fitting model poses that do not consistently represent the underlying scene data. The first step checks the model-scene data consistency and discards cases which do not properly match the visibility context of the scene data. From the virtual camera point of view, each point of the rendered view of the model can be classified in three types, regarding its position with respect to the scene data: inlier, occluded and non-consistent. Inlier, shown in
Figure 8a, is a model point that is near a scene point within a threshold distance and it is considered to match and explain the underlying scene surface. Occluded, shown in
Figure 8b, is a point that is further away from the scene than a surface inlier; therefore, it is below the scene surface and can not be considered right or wrong. Non-consistent, shown in
Figure 8c, is a point that is closer to the camera than a surface inlier, which means that it is not explained by the scene data and it is considered wrong. Hypotheses with a big percentage of occluded points or relatively small percentage of non-consistent points are likely to be false positive cases, hence discarded. In order to deal with challenging cases and certain degree of sensor noise, a maximum percentage of 15% of non-consistence points and 90% of occlusion is used.
The second verification step accounts for well fitting poses with non-matching surface boundaries. This checking procedure is especially useful to discard cases relaying on planar or homogeneous surfaces without relevant surface characteristics, which can easily be incorrectly fitted to other similar scene surfaces if no boundary considerations are applied. For each hypothesis pose, this step extracts the silhouette of the object model from the camera view, as shown in
Figure 9a, and compares it with scene extracted edges,
Figure 9b. The scene edges are extracted by identifying depth and normal variations. The comparison is performed by averaging the distance from each silhouette point to the scene edges. Therefore, having a set of pixels defining the scene edges
, for each different model pose view silhouette, defined by a set of pixels
, the average edge score can be computed as:
Poses where the final score is higher than a threshold are discarded. In practice, a threshold of 5 pixels is used as an average distance error.
Notice that both steps may wrongly discard true positive cases under high occlusion. In this sense, both verification steps represent a trade-off between false positive pruning and occlusion acceptance rate. Hence, a request for high scene consistency, in terms of visibility context and contour matching, will reduce the capability of the system to handle occluded cases in benefit of higher reliability for normal cases.