Keywords

1 Introduction

Modern dentistry, using digital equipment for both intra-oral and extra-oral imaging, demands for computer-aided design (CAD) systems to facilitate data analysis, e.g. for accurate treatment planning and diagnostic aid. In this study, we explore a segmentation methodology for intra-oral scans (IOS) based on recent deep learning advances, to support automated clinical workflow in implantology and orthodontic fields. Such CAD-based workflows require accurate segmentation of each individual tooth and gingiva (gums) in the imaging data. IOS imaging involves capturing the 3D geometrical profile of tooth crowns and gingiva in a high-spatial resolution. Processing of such high-detail information from the anatomic structures of tooth crowns is highly desirable for many clinical applications. An IOS consists of a large (e.g. hundreds of thousands) set of points in a 3D Cartesian coordinate system. These data can be represented either by a point cloud or a mesh (i.e. after applying a triangulation algorithm on the points). Each point is represented by its 3D coordinates and, depending on the type of scanner, other attributes such as color. In this paper, the semantic instance segmentation of an IOS refers to the assignment of a unique label to all the points belonging to each instance (i.e. an individual tooth) using a computational model. After the segmentation of the tooth instances, a post-processing stage follows for the standardization of the labels, where the model assigns to each detected tooth one label prescribed by the Federation Dentaire Internationale (FDI) dental notation for adult dentition.

In the sequel, we briefly introduce the related work on IOS segmentation, recent advances of deep learning in instance segmentation on point cloud data, and our contributions to both. Afterwards, we explain our proposed method and the obtained results in detail. Lastly, we provide discussions and conclusions.

2 Related Work

IOS Segmentation efforts are mostly based on conventional computer vision techniques, which are limited by finding the best handcrafted features, manual tuning of several parameters and lack of generalization and robustness [1]. Recently, Zanjani et al. [1] proposed an end-to-end learning model for IOS segmentation using a deep PointCNN [6] model. In their work, the label of each tooth is treated as a semantic label, which aims to be predicted directly at the output of the network. However, formulating the IOS segmentation using a point-based classification loss is ill-posed. This is mainly due to the low inter-class variability between neighbouring teeth, especially among the molar and premolar teeth. Hence, an accurate prediction of the labels requires not only the local geometrical information (i.e. crown shapes), but also the global context of e.g. the relative position, teeth arrangement and possible absence of other teeth.

Thus, to address this ill-posed statement, the segmentation problem is defined in the context of an instance segmentation task, which recognizes each tooth as an instance in the 3D point cloud. The learning model then localizes all instances with their 3D bounding boxes and simultaneously assigns a unique label to all points belonging to each instance. This approach has at least two advantages: (1) inference on the labeling of each tooth instance is not dependent on its relative position with respect to other teeth; (2) a patch-based training and subsequent processing of point cloud data at its original spatial resolution (without down-sampling) is facilitated. This is possible since the network does not require the global context to assign a specific FDI label to all points in the tooth instance.

Deep Learning Instance Segmentation in 3D Point Cloud: Among the proposed deep learning models for point cloud analysis, only a few researchers have addressed the challenging issue of 3D instance segmentation. To better compare and position our proposed framework, we briefly survey the most related recent works. FrustumNet [8] proposes a hybrid framework involving two stages. The first stage detects the objects bounding boxes in a 2D image. The second stage processes the 3D point cloud in a 3D search space, partially bound by the initially set 2D bounding boxes. The 3D-SIS model [4] also first processes the 2D images rendered from the point cloud through a 2D convolutional network (ConvNet). Afterwards, the learned features are back-projected on the voxelized point cloud data, where the extracted 2D features and the geometric information are combined to obtain the object proposal and per-voxel mask prediction. The dependency on 2D image(s) of both preceding models limits the application of them for 3D point cloud analysis. In another approach, the GSPN [12] follows an analysis-by-synthesis strategy and instead of directly finding the object bounding boxes in a point cloud, it utilizes a conditional variational auto-encoder (CVAE). However, GSPN training requires a separate two-stage training of the CVAE part and the region-based networks (which perform the classification, regression and mask generation on the proposals). In an alternative approach to detect object proposals, SGPN [11] and MASC [7] methods perform a clustering on the processed points for segmenting the instances. SGPN [11] uses a similarity matrix between the features of each pair of points in the embedded feature space, to indicate whether the given pair of points belong to the same object instance or not. However, computing such a pair-wise distance is impractical for large point clouds and especially for IOS data, where down-sampling would significantly affect the detection/segmentation performance. MASC [7] voxelizes the point cloud for processing the volumetric data by a 3D U-Net model. Similar to SGPN, MASC uses a clustering mechanism to find similarities between each pair of points by comparing their extracted features in several hidden layers of a trained U-Net. Unfortunately, as mentioned before, voxelization of a large fine-detailed point cloud greatly limits the performance of such approaches.

In this paper, we propose an end-to-end deep learning model for instance segmentation in 3D point cloud data. Our contribution is threefold.

  1. 1.

    We present a new instance segmentation model, called Mask-MCNet. Our proposed model is applied directly to an irregular 3D point cloud on its original spatial resolution and predicts the 3D bounding boxes of instances along with their masks, indicating the segmented points of each instance.

  2. 2.

    To the best of our knowledge, this is the first study which both detects and segments tooth instances in IOS data by a deep learning model.

  3. 3.

    We conduct an extensive experimental evaluation and show that the proposed model significantly outperforms state-of-the-art in IOS segmentation.

3 Method

At high level, the Mask-MCNet is similar to the Mask R-CNN [2] as it includes three main parts: the backbone network, Region Proposal Network (RPN), and three branches of predictor networks for classification, regression, and mask generation (see Fig. 1). Each part is explained in detail below.

Fig. 1.
figure 1

Block diagram of the Mask-MCNet (see supplementary material for details).

The backbone network acts as a feature extractor and consists of a deep MLP-based network, which is applied on the entire or cropped 3D patches (depending on hardware limitations) of an input 3D point cloud. Every input patch includes n points (varying across patches), where each point is represented by its (xyz) 3D coordinates along with its normal vector (which can be computed by averaging over all normal vectors of faces which are connecting to that point). Hence, the input to the backbone model is an \(n \times 6\) matrix. In this study, we choose to employ a PointCNN [6] for its fine-detail processing capacity and its small model size [1, 6]. The backbone outputs an \(n \times 256\) matrix of features (where n denotes the number of input points) which contain rich geometrical information around each point.

Monte Carlo ConvNet as Region Proposal Network (RPN). Since the points in the point cloud cover solely the surface of objects, the computed features from the backbone network only contain local geometrical representations on a manifold in 3D space. However, for a regression problem such as accurate localization of a 3D bounding box encompassing an object, the model requires awareness of several parts (or sides) of each object. Hence, voxelization of the data and employing a 3D ConvNet on the obtained volumetric data is a common approach. However, the shortcomings of it have been mentioned already. Therefore, in an alternative approach, for aggregating the computed feature vectors of the 3D points, we employ a Monte Carlo ConvNet (MCCNet) [3] for distributing and transferring the information from the surface of objects into the entire 3D space (e.g. into void space inside of the objects). The MCCNet consists of several modular MLP sub-networks of two hidden layers whose function resembles a set of convolution kernels. For more information regarding the mechanism of MCCNet, we refer to the original paper [3].

We employ the MCCNet as our region proposal network (RPN) because of two important properties: (1) its capability of computing the convolution on an arbitrary output point-set within the kernel’s field of view (FOV), regardless of its presence within the set of input points; (2) its capability of handling the non-uniform distribution of points when computing the convolution. The first property makes it possible to transfer the computed features by the backbone network on an arbitrary new domain such as the node of a 3D grid, while the second property facilitates processing of a non-uniform grid domain.

To generate object proposals (i.e. 3D cubes encompassing teeth), we follow the idea of using anchors which is adopted from Faster-RCNN [10], but modified to a 3D space. Here, each 3D anchor is indicated by a cube, which is represented with its central position \([x_a,y_a,z_a]\) and its size [wdh]. Making no assumptions regarding the possible positions of objects (which leads to a more generic approach), the centers of the anchors should be located on a regular 3D grid which spans almost the entire input 3D space. The spatial resolution of such a grid affects the performance of the model, i.e. choosing a low-resolution grid leads to positioning too few anchors inside the small objects (e.g. incisor teeth), whereas a high-resolution grid causes the computation to be inefficient. Instead of imposing a naive uniform grid, we design a non-uniform grid that has dense nodes close to the object surface(s) and sparse nodes far from the surface. A non-uniform grid can be easily obtained by filtering out the nodes of an initial dense grid in the 3D space, according to the distance of each node from a closest point in the point cloud and a predefined lower bound for the grid resolution.

With the two above-mentioned properties, the MCCNet is able to transfer the received features from the backbone network into a new non-uniform grid domain with m nodes, through its first convolutional layer. By further processing of data through the hidden layers of MCCNet and based on the FOV of each convolutional kernel, the geometrical information of surface points becomes distributed on the entire grid domain. Reminding that each node on the non-uniform grid indicates the center of one (\(k=1\)) or multiple-size (\(k>1\)) anchor(s), the total amount of anchors is \(k \times m\). As a classification task, the model predicts from the feature set inside each anchor whether the anchor contains an object or not. If so, a further regression task is performed on the prediction of the object’s center-point and its size. As a fully-connected MLP with fixed-length input would be employed for performing such a classification and regression task, the feature set inside each anchor would require to have a fixed length. To do so, a fixed set of \(s\times s\times s\) nodes (e.g. \(s=5\)) is interpolated inside each anchor in 3D space by applying a triangular interpolation using the three nearest neighbour nodes of the grid and weighting their feature vectors based on their distance to the new node in 3D space. At this stage, for \(k \times m\) positioned anchors in 3D input space, an output matrix of \(k \times m \times s^3\) is obtained.

Predictor networks consist of three parallel branches for classification, regression, and mask generation. The classification and regression branches both consist of a fully-connected MLP network and receive the \(k \times m \times s^3\) feature matrix from the RPN. The classification branch aims to make a binary classification, to indicate if each \(k\, \times \, m\) anchors contain an object instance or not. In case of a positively detected anchor, its central position and size offset (i.e. residual vectors) are predicted at the output of the regression branch. In the training phase an anchor is labeled positive if it has an overlap with any tooth instances above a threshold (e.g. 0.4 IoU) and it is labeled negative if it is lower than a certain threshold (e.g. 0.2 IoU). Since the number of positive and negative anchors are highly imbalanced, about 50% of each training batch is selected from the positive and 25% from the negative anchors. The rest of the 25% sampled anchors in the training batch are selected from the marginal anchors (\(0.2<IoU<0.4\)), which are considered also as negative samples.

The mask-generation branch directly receives features from the backbone network. The architecture of the mask branch is similar to a PointCNN, which has been used as the backbone network, though it consists of only three layers [6]. The estimated 3D bounding boxes by the regression branch are used and the point cloud is cropped accordingly. The cropped point set along with their feature vectors at the output of the backbone network are passed on to the mask branch, which performs a binary classification of the points inside each anchor into two classes: (1) foreground points which belong to a tooth instance and (2) background points which belong to other teeth or gingiva.

The loss function of Mask-MCNet is similar to Mask R-CNN with an equal contribution of three terms. The first term is a cross-entropy loss value for the classification branch on its softmax output layer. The second term is a mean squared error at the linear output layer of the regression branch. Finally, the third term is a binary cross-entropy loss for classification of all points in each positive anchor at the output softmax layer of the mask branch. The regression loss and mask loss are involved only if the examined anchor is labeled positive.

Tooth Label Assignment as Constraint Satisfaction Problem: As mentioned earlier, for clinical purposes and consistency of the tooth labeling assignments, we use a post-processing stage for translating (via a look-up table) the instance labels predicted by the Mask-MCNet into the FDI standard labels. By measuring the average central positions and sizes of the FDI labels within the training data, a combinatorial search algorithm searches the most likely label assignment, which satisfies the predefined constraint (prior measurements on training data) in the context of a constraint satisfaction problem (CSP).

3.1 Implementation Details

Training of the entire Mask-MCNet model is done end-to-end by using a gradient descent and the Adam learning adaptation technique for 1000 epochs with a batch size of 32 (equally balanced between positive and negative anchors). The pre-processing of the input IOS only consists of normalizing the whole point cloud to obtain zero mean and unit variance. The input to the Mask-MCNet is a randomly cropped patch of the point cloud, which usually contains 2–4 tooth instances. As explained, the non-uniform grid domain is constructed by filtering out the nodes of a dense regular grid with 0.04 (lower bound) spatial resolution in each dimension. The upper bound for the grid resolution is set to be equal to 0.12. For creating sufficient overlap between anchors and both small and large objects (e.g. incisor and molar teeth, respectively), two types (\(k=2\)) of anchors are employed (with size of [0.3, 0.3, 0.2] and [0.15, 0.2, 0.2]).

Inference on a new IOS is performed by applying the Mask-MCNet on several cropped overlapped patches. Giving the 3D patches and applying a regular grid with the highest defined resolution (e.g. 0.04), the anchors positioned on the grid are classified into object/no-object by the classification branch. The sizes and central positions of positively detected anchors are updated according to the estimated values by the regression branch. Since for each object multiple anchors might be detected, similar to Faster-RCNN, a non-maximum suppression algorithm is employed according to the highest objectiveness scores (from classification probabilities). It is worth mentioning that the non-maximum suppression also handles the repeated points by overlapping the input patches. After bounding box prediction, retrieving a mask for all points inside each bounding box from the mask prediction branch is straightforward. Network architecture of the Mask-MCNet is given in the supplementary material.

4 Experiments and Results

Data: Our dataset consists of 120 optical scans of dentitions from 60 adult subjects, each containing one upper and one lower jaw scan. The optical scan data was recorded by a 3Shape d500 optical scanner (3Shape AS, Copenhagen, Denmark), which obtains 180k points on the average (varying in a range interval of [100k, 310k]). The first dataset includes scans from healthy dentition with a variety of abnormalities among subjects.

All optical scans were manually segmented and their respective points were categorized according to the FDI standard into one of the 32 classes by a dental professional and reviewed and adjusted by one dental expert. Segmentation of each optical scan took 45 min on average, which shows that it forms an intensive laborious task for a human.

Experimental Setup: The performance of the Mask-MCNet in comparison with the state-of-the-art is evaluated by fivefold cross-validation. The average Jaccard Index (also known as mIoU) is used as a segmentation metric. On top of the mIoU, by treating each class individually as a binary (one-versus-all) segmentation problem and then by averaging on all measured precision and recall scores, we report the mean average precision (mAP) and mean average recall (mAR) for evaluating the multi-class teeth segmentation problem. In contrast to teeth semantic segmentation approach that preserving the global context information (e.g. relative tooth positions on the dental arch) is required for accurate label assignment, Mask-MCNet can be applied on the cropped scans (e.g. partitioned into five patches). This is because an instance segmentation approach first localizes the objects (teeth) and then performs segmentation by assigning a unique label to each instance. As earlier explained, in the post-processing stage, the assigned unique labels are converted into semantic tooth labels by applying the constraint satisfaction solver. Employing a patch processing technique allows Mask-MCNet to process an IOS in a higher resolution at the cost of longer execution time. Using 5 patches per scan, Mask-MCNet segments an IOS in a longer execution time as has been reported in Table 1.

The obtained results are shown in Table 1 and a number of segmented IOS are visualized in the supplementary material. As can be observed, the proposed Mask-MCNet significantly outperforms the state-of-the-art networks in IOS segmentation.

Table 1. Instance segmentation performance of the Mask-MCNet, compared with state-of-the-art semantic segmentation models on multi-class tooth label assignment. The mean IoU (mIoU), mean average precision (mAP), mean average recall (mAR), and the execution time are reported.

5 Discussion and Conclusion

In this study, we have presented a new instance segmentation framework, called Mask-MCNet, for tooth instance segmentation in a 3D point cloud of IOS data. In contrast to alternative deep learning models, our proposed end-to-end learning model does not follow a voxelization step for processing a point cloud. Consequently, the data can be processed by preserving its fine-detail geometrical information, which is important for a successful IOS segmentation. Furthermore, by employing the Monte Carlo ConvNet, the Mask-MCNet can handle the processing of the non-uniformly distributed information in a 3D space. This property leads to an efficient search of object proposals that is important for scalability of the method to be applicable for processing the IOS data with large point clouds (more than 100k points). The experiments have shown that the proposed framework achieves a 98% IoU score on the test data, thereby outperforming the state-of-the-art networks in the IOS segmentation task. This performance is close to the human level and obtained in only a few seconds of processing time, while it forms a lengthy and intensive laborious task for a human.