The convolutional neural network (CNN) is highly invariant to image translation, scaling, and tilting through multi-layer feature extraction and regional weight sharing [
8]. However, due to the disorder of the point clouds, a CNN cannot directly perform feature extraction on them. In this section, firstly, we introduce a local feature descriptor for fine-grained feature representation and then introduce the
for the convolution operation of the point clouds. Thirdly, depending on the
, we construct a new convolutional neural network for facial feature extraction. Fourthly, a new feature enhancement mechanism is proposed to enhance the discrimination of facial features. Finally, based on the feature enhancement mechanism, we adopt a triplet loss function for training and construct an efficient face recognition network.
3.1. Local Feature Descriptor
In this part, inspired by [
7], in order to obtain the fine-grained representation of features, we use a hand-crafted descriptor to describe the local geometric features of the point clouds.
For a points pair
, the geometric relationship between two points is represented by a four-dimensional descriptor:
where
represents the Euclidean distance between two points. The
and
are normal vectors of
and
, respectively. The
is the angle between two vectors:
where
, the
represents the cross-product and the
represents the dot-product. As described above, the
describes in detail the geometric relationship between two points through normal vectors and angles.
For a local region,
, we choose a center point
, which has a total of
n pairs of points (including
); the geometric feature of this local region is expressed as follows:
where
is the point in the local region and
is the normal vector of point
. The
is the four-dimensional descriptor between
and center point
. As shown in
Figure 1,
uses all points pairs with the center point
to describe the spatial geometric characteristics of the local region.
3.2. Operator
As mentioned above, because of the disorder of the point clouds, they cannot directly use the convolution operation. To deal with the problem, Li et al. [
8] trained a permutation matrix through a multi-layer perceptron (MLP) to realize the permutation invariance of the point clouds. As shown in
Figure 2, the points in
Figure 2a,b have the same distribution but the orders are different.
In
Figure 2,
,
,
,
represent the features of the corresponding points and the number represents the order of each point. We use a same convolution kernel
to operate on the above two point clouds:
As shown above, the two sets of point clouds have the same distribution, but the convolution results are different. As shown in
Figure 3, in order to make the convolution result only related to the distribution but not to the order, we use a permutation matrix to adjust the order of the points.
Based on the local feature descriptor and the permutation matrix, we construct a new operator
, which achieves permutation invariance and fine-grained feature extraction of a local region of the point clouds. The algorithm of the
operator is as follows in Algorithm 1 below:
Algorithm 1 operator |
Input: P, p, K | |
Output: | |
| Local coordinate transformation. |
| Encode point pairs with the descriptor . |
| Local feature descriptor. |
| PointNet to extract local geometric features. |
| performs point-by-point feature extraction. |
| Concatenate . |
| Obtain weight matrix through . |
| Achieving feature permutation invariance. |
| Feature extraction using the convolution kernel K. |
The input of is the set of feature points in the local region and p is the center of P (we take p as the center and use the k-nearest neighbors algorithm (KNN) to sample the nearest k points, ). K represents the convolution kernel and the size of K is k (the size of the convolution kernel is the same as the number of points in the local region).
In the first step, the spatial coordinates of are transformed into relative coordinates based on the center point p (relative coordinates make local points translation invariant).
The second step is to encode the points pairs in the local region according to Formula (1).
In the third step, according to Formula (3), the local feature descriptor is used to encode the local geometric feature.
The fourth step, the PointNet is used to extract local geometric features. The structure of the PointNet is shown in
Figure 4.
The PointNet consists of an MLP and a max pooling layer. The MLP has three layers and the number of nodes in each layer is the same, all of which are . After feature extraction, we obtain a local feature .
In the fifth step, we use the
to improve the feature dimension of each point. The structure of the
is shown as
Figure 5.
In
Figure 5,
k is the point number of the local region and
is the feature dimension of the points.
has two convolutional layers. Due to the disorder of the points, only the
convolution kernel can be used to increase the dimension of the points (point-by-point). The number of channels of the two convolutional layers is
and
, respectively (
is the output dimension,
).
In the sixth step, the high-dimensional feature of each point obtained in the fourth step is concatenated with the local geometric feature obtained in the fourth step (each point has the same ).
In the seventh step, according to [
8], we use
to train a permutation matrix (as shown in
Figure 3, which is only related to the distribution of points;
k is the number of points in the local region) that redistributes the weight of each point to eliminate the influence of different orders. The structure of the
is shown in
Figure 6.
In
Figure 6, a fully connected layer (FC) map
k points (
) to
:
and reshapes it into a
matrix. Then, we adopt two layers of depth-wise convolution (DC, different from normal convolutional layer, the kernel of depth-wise convolution is responsible for one channel and the feature map has the same number of channels as the input layer) and reshape the feature maps; a
permutation matrix
can be obtained:
Ideally, the permutation matrix is a binary matrix, as shown in
Figure 3, but the obtained matrix by
is a weight matrix, as shown in
Figure 7. The weight matrix can approximate the permutation invariance of the local region.
The eighth step,
, where
is the weight matrix obtained in the seventh step,
is the concatenated feature of each point in the sixth step, and the “
” represents matrix multiplication. During this step, as shown in
Figure 7, the point clouds achieve permutation invariance through the weight matrix
and obtain the weighted features
of the local region.
The ninth step, we can directly perform convolution operation on to obtain (feature map of this local region).
The above steps can be represented as follows:
where
K,
p, and
P represent the input of
,
is the convolution operation, and
PointNet,
, and
are shown in
Figure 4,
Figure 5 and
Figure 6, respectively.
3.3. CNN for Feature Extraction
In
Section 3.2, we use the local feature descriptor to describe the local fine-grained feature of a local region of the point clouds and adopt
to weight the disordered point clouds to achieve the permutation invariance. In this section, based on
we construct a convolutional neural network (CNN) for facial feature extraction. The structure of our network is shown in
Figure 8.
The network consists of 5 convolutional layers; the parameters of each layer are shown in
Figure 8, where
K is the number of points in a local region in this layer,
C is the output feature dimension,
N is the number of feature points in the next layer, and
D is the dilation rate, which determines the receptive filed of the convolutional layer:
(
is the number of feature points in the previous layer). For each layer, we also list the dimension of
,
, which presents the size of the PointNet in
Figure 4 and
in
Figure 5.
Take the first layer as an example. The input point cloud has 1024 points (in our method, according to [
32] we use farthest point sampling (FPS) algorithm sample 1024 points for each face). We use k-nearest neighbors algorithm (KNN) to sample 8 nearest points for each point (each local region has 8 points), then we adopt
to extract the
(convolution result, feature map) of each local region, where
,
, which represent the
in the PointNet of this layer is 8 (as shown in
Figure 4) and
in the
of this layer is 8 (as shown in
Figure 5). After the
operation, each local region becomes a feature map
and is regarded as a new point in
for the next convolution layer.
After 5 convolution layers, the number of feature points changes as follows:
. The feature dimension changes as follows:
. In the last convolutional layer, the receptive filed
, which means the last 32 feature points “see” the whole region of the previous layer. Then, we use a global average pooling to extract the global feature
from 32 feature points. According to [
9], in order to avoid large differences between facial features, we normalize the features by 2-norm (
):
3.4. Feature Enhancement Mechanism
In
Section 3.3, we obtain normalized facial features
(the value of each dimension is between (−1 and 1)). However, not every dimension plays the same role in the recognition task. For example, the larger the eigenvalue of a certain dimension, the higher the recognition contribution of this dimension provides; on the contrary, the smaller the eigenvalue of a certain dimension is, the lower the recognition contribution of this dimension provides. Based on the above phenomenon, we propose a new feature enhancement mechanism to enhance the discrimination of features.
First, take the absolute value of the eigenvalues of each dimension according to Formula (10). Then, use
softmax to map
to the probability distribution between (0 and 1). In this step, according to Formula (11), the numerator of eigenvalue with a large absolute value grows fast and the numerator of eigenvalue with a small absolute value grows slowly (because
). The stretched eigenvalues can improve the discrimination of features. Finally, as shown in Formula (12), we restore the eigenvalues to their original positive and negative distributions.
We use
softmax to enhance the eigenvalue in
, but, in order to avoid ignoring some original information in
, we utilize the enhancement parameter
to linearly add
and the enhanced feature
:
In Formula (13), the eigenvalues in
and
are between (−1 and 1), but there is still a large gap. Parameter
determines the degree of coupling of the two features and also determines the contribution of the proposed feature enhancement mechanism to the
. The structure of feature enhancement mechanism is shown in
Figure 9.
3.5. Triplet Loss Function
In the feature space, the metric distance between objects is related to the similarity and the training purpose of the face recognition network is to make the same object have a closer metric distance, with a far metric distance between different objects.
In the field of 2D face recognition, FaceNet [
9] constructed a triplet loss function and has surpassed humans in accuracy. In this section, we construct a triplet loss based on enhancement parameter
.
The triplet loss function includes three types of samples: anchor samples (Anchor), positive samples (Positive), and negative samples (Negative). The anchor samples and positive samples come from the same object and the negative samples come from different objects. As shown in
Figure 10, the purpose of the network is to make the metric distance between the anchor sample (
) and the positive sample (
) with the farthest distance smaller than the anchor sample and the negative sample (
) with the closest distance.
According to Formula (13), face features
are composed of two parts:
and
. As shown in Formula (12),
to
is a non-linear change process. If directly using
for measurement, some original details of the features will be ignored. Therefore, in this section, we construct a new triplet loss according to parameter
and the training purpose in
Figure 10 can be expressed as follows:
where
,
, and
represent the
feature (as Formula (9)) of Anchor, Positive, and Negative samples, respectively.
,
, and
represent the
feature (as Formula (12)) of Anchor, Positive, and Negative samples, respectively.
is the enhancement parameter (as Formula (13)) and
is a margin that is the minimum distance between
and
.
In the training process, only samples that do not satisfy Formula (14) are used to optimize the model (the loss of sample that satisfies the Formula (14) is 0):
The loss function of our model is defined as follows:
where
N represents the total number of triplet samples satisfying Formula (15). During the training process, according to the loss function, Anchor and Positive samples with far distance become closer. Anchor and Negative samples with close distance become farther. The whole structure of our network is shown in
Figure 11.
Ideally, we want the farthest pair of same objects (hard positive pair) to have a smaller metric distance than the closest different objects (hard negative pair). However, for a large number of training samples, it is difficult to find the hard positive pair and the hard negative pair. Sample selection is very important for the performance of the model. As described in
Section 3.3, each point face samples 1024 points as input. According to [
8,
9], we set the mini-batch in each batch. For a mini-batch, 40 samples are selected from the same subject and we find the hard positive pair in the 40 samples. The hard negative pair is randomly selected from other subjects. The margin
in Formula (14) is computed in each mini-batch. The size of each batch in our network is fixed at 1800. The ADAM optimizer has an initial learning rate of 0.01 for the training of our model.