1 Introduction
Interactive 3D segmentation, the ability to select fine-grained segments from a 3D shape based on user inputs like clicks, is a fundamental problem in computer graphics with broad implications. In fields such as computer-aided design and 3D modeling, precise segment selection facilitates detailed model refinement. Moreover, in engineering, architecture, and medicine, fine-grained selection is indispensable for simulation and analysis, allowing for accurate assessment of structural integrity and behavior. While important, this problem poses significant challenges. How can we decipher the user intentions from such a minimal input as clicks? How do we handle diverse shapes with varying geometries and select specific and unique shape parts? In this work, we propose a method tailored to the shape at hand that selects regions adhering to the user clicks. Traditional segmentation techniques do not utilize user inputs and instead rely on geometric features to delineate semantic parts [Cornea et al.
2007; Dey and Zhao
2004; Hoffman and Richards
1984; Lien and Amato
2007; Shamir
2008; Zheng et al.
2015]. Recent data-driven techniques have further leveraged fully annotated 3D datasets and achieved ipressive 3D segmentation results [Chen et al.
2019; Deng et al.
2020; Hanocka et al.
2019; Hu et al.
2022; Milano et al.
2020; Milletari et al.
2016; Qi et al.
2017; Sharp et al.
2022; Sun et al.
2021; Yi et al.
2017; Zhu et al.
2020]. However, the reliance on a dataset and the scarcity of large-scale 3D datasets limits the network to a specific shape domain with a predefined set of parts.
Current 3D segmentation methods have circumvented the dependency on 3D data and pre-determined part definition by utilizing pretrained 2D foundation models to learn semantic co-segmentation [Ye et al.
2023] or text-driven segmentation [Abdelreheem et al.
2023a;
2023b; Decatur et al.
2024;
2023; Ha and Song
2022; Kim and Sung
2024; Liu et al.
2023]. Nonetheless, text may not be able to accurately describe all fine-grained segmentations, such as the fourth leg of an octopus or a region corresponding to a particular point on the shape.
In this paper, we present
iSeg, a new data-driven interactive technique for 3D shape segmentation that generates customized partitions of the shape according to user clicks. Given a shape represented as a triangular mesh, the user selects points on the mesh interactively to indicate a desired segmentation and iSeg predicts a region over the mesh surface that adheres to the clicked points. Our interactive interface can utilize positive and negative clicks, enabling additions and exclusions of areas from the segmented region, respectively (see Fig.
1).
We harness the power of a pretrained 2D foundation segmentation model [Kirillov et al.
2023] and distill its knowledge to 3D. However, segmenting a meaningful 3D region using a 2D model is very challenging, since occluded shape regions cannot be seen together from a single 2D view. Accordingly, we design an interactive segmentation system that operates
entirely in 3D, where the user clicks and the inferred corresponding region are applied over the shape surface
directly, ensuring 3D consistency
by construction. During training only, we project the 3D user clicks and the predicted segmentation to multiple 2D views to enable supervision from the powerful pretrained foundation model [Kirillov et al.
2023].
For interactiveness, we want our system to accommodate different user inputs, meaning, point clicks that can vary in number and type. Instead of training a separate segmentation model for each user click configuration, we propose a novel interactive attention mechanism, which learns the representation of positive and negative clicks and computes their interaction with the other points of the mesh. This attention layer consolidates variable-size guidance into a fixed-size representation, resulting in a unified flexible segmentation model capable of predicting shape regions for various click settings.
iSeg is optimized per mesh to capture its unique segments, without any ground-truth annotations. We train the model with only a small fraction of the mesh vertices, while the model successfully infers segmentations for other vertices not used during training. iSeg further generalizes beyond its training data and computes complete segments in 3D for clicks and regions occluded from each other.
In summary, this paper presents iSeg, an interactive method for selecting customized fine-grained regions on a 3D shape. We distill inconsistent feature embeddings of a 2D foundation model into a coherent feature field over the mesh surface and decode it along with user inputs to segment the mesh on the fly. Our interactive attention mechanism handles a variable number of user clicks that can signify both the inclusion and exclusion of regions. We showcase the effectiveness of iSeg on a variety of meshes from different domains, including humanoids, animals, and man-made objects, and show its flexibility for various segmentation specifications.
3 Method
Given a 3D shape depicted as a mesh \(\mathcal {M}\) with vertices \(V = \lbrace v_i\rbrace _{i=1}^{n}\), \(v_i = (x_i, y_i, z_i) \in \mathbb {R}^3\), and a set of selected vertices by the user representing positive or negative clicks, our goal is to predict the per-vertex probability \(P = \lbrace p_i\rbrace _{i=1}^{n}\), pi ∈ [0, 1] of belonging to a region adhering to the user inputs. Our system offers an interactive user interface, such that the number of clicks and their type can be varied and the segmented region of the shape is updated accordingly.
We tackle the problem by proposing an interactive segmentation technique consisting of two parts: an encoder that maps vertex coordinates to a deep semantic vector and a decoder that takes the
vertex features and the user clicks and predicts the corresponding mesh segment. The decoder contains an interactive attention layer supporting a variable number of clicks, which can be positive or negative, to increase or decrease the segmented region. Fig.
3 presents an overview of the method.
3.1 Mesh Feature Field
Our encoder learns a function \(\phi :\mathbb {R}^{3} \rightarrow \mathbb {R}^{d}\) that embeds each mesh vertex into a deep feature vector ϕ(vi) = fi, where d is the feature dimension. The collection of mesh vertex features is denoted as \(F \in \mathbb {R}^{n \times d}\) and regarded as the Mesh Feature Field (MFF).
The
encoder distills the semantic information from a pretrained 2D foundation model for image segmentation [Kirillov et al.
2023] and facilitates a 3D consistent feature representation for interactive segmentation of the mesh. To train the
encoder, we render the high-dimensional vertex attributes differentiably and obtain the 2D projected features:
where
\(\mathcal {R}\) is a differentiable renderer,
θ is the viewing direction,
f represents the visible vertices in the view, and
w ×
h are the spatial dimensions of the rasterized image.
The
encoder is implemented as a multi-layer perceptron network. To supervise its training, we render the mesh into a color image
\(I_c^\theta\) and pass it through the encoder
E2D of the 2D foundation model [Kirillov et al.
2023] to obtain a reference feature map:
This process is repeated for multiple random viewing angles
Θ, and our encoder is trained to minimize the discrepancy between the rendered MFF and the reference 2D features:
The 2D model operates on each image separately and might produce inconsistent features for different views of shape. In contrast, our MFF is defined in 3D and is view-consistent by construction. It consolidates the information from the multiple views and lifts the 2D embeddings to a coherent field over the mesh surface. Additionally, we emphasize that the MFF is optimized independently of the user inputs. This is a key consideration in our method, resulting in a condition-agnostic representation, describing inherent semantic properties of the mesh. We validate this design choice with an ablation experiment demonstrated in Fig.
16 and explained in the supplementary. The MFF is optimized until convergence and then utilized together with the user click prompts to compute the interactive mesh partition.
3.2 Interactive Attention Layer
The interactive attention layer is part of the decoding component of our system (Fig.
3). Its structure is illustrated in Fig.
4. The layer computes the interaction between the user input clicks and the mesh vertices, accommodating variable numbers and types of clicks, positive and negative. This key element in our method enables a unified decoder architecture supporting various user click settings.
Our interactive attention extends the scaled dot-product attention mechanism [Vaswani et al.
2017]. The features of the positively and negatively clicked vertices are marked as
\(F_{pos} \in \mathbb {R}^{n_{pos} \times d}\) and
\(F_{neg} \in \mathbb {R}^{n_{neg} \times d}\), respectively, where
npos and
nneg are the number of clicks of each type. The interactive attention layer projects the mesh features
F to Queries and the features of the clicked points to Keys and Values:
where
\(W^Q, W^K_{\lbrace pos, neg\rbrace }, W^V_{\lbrace pos, neg\rbrace } \in \mathbb {R}^{d \times d}\) are learnable weight matrices. Then, the mesh vertices are attended to the user clicks to obtain the conditioned mesh features:
where
\(K, V \in \mathbb {R}^{(n_{pos} + n_{neg}) \times d}\) are the concatenation of
Kpos,
Kneg and
Vpos,
Vneg, respectively.
Our attention mechanism condenses variable interactive user inputs into a fixed-length output. It learns the representation of positive and negative clicks, correlates the mesh vertices with them, and yields updated vertex features to enable the on-the-fly segmentation of the shape. Another benefit of our attention layer is that it is permutation invariant w.r.t. the user clicks. In other words, it is independent of the sequential order of the point clicks and consistent in their joint influence on the shape partition. Moreover, the attention’s output G is permutation equivariant w.r.t. the vertex order in F, a desired property for the mesh data structure.
3.3 Segmentation Prediction
The output of our model is a segmentation of the mesh that adheres to the user clicks, represented as the per-vertex probability of belonging to the desired region. To do so, we learn to decode the a posteriori condition dependent vertex features
gi = [
G]
i and the a priori inherent embedding
fi = [
F]
i into the partition probability:
ψ is implemented as a multi-layer perceptron network, where
fi and
gi are concatenated along the feature dimension at the network’s input. The remaining question is – how to supervise the training of such an obscure and ill-defined problem?
Similar to our encoder’s training, we translate the problem to the 2D domain and harness the power of the 2D foundation model [Kirillov et al.
2023] for our 3D decoder learning (Fig.
3). We project the mesh probability map with a differentiable rasterizer to a probability image
\(I_p^{\theta ^{\prime }} = \mathcal {R}(\mathcal {M}, p, \theta ^{\prime }) \in [0, 1]^{w^{\prime } \times h^{\prime } \times 2}\), where
θ′ is the viewing angle, and the image channels represent the segment and background probabilities. Then, we project the 3D clicks to their corresponding 2D pixels and use them as prompts to segment the rendered color mesh image with the 2D segmentation model, resulting in the supervising probability mask
\(I_m^{\theta ^{\prime }} \in \lbrace 0, 1\rbrace ^{w^{\prime } \times h^{\prime } \times 2}\). We randomize the viewing direction
θ′ and train our decoder subject to the optimization objective:
where CE denotes the binary cross-entropy loss.
To prepare data for our 3D decoder training, we simulate user clicks and generate masks from the 2D model [Kirillov et al.
2023]. The data generation process includes two phases. First, we pick a
small training subset of
3% of the mesh vertices
well distributed over the shape using Farthest Point Sampling [Eldar et al. 1997], where each vertex is regarded as a single positive click. For each vertex, we generate random views, feed each one through the 2D foundation model, and get the reference segmentation mask. Then, for each view, we sample another
training vertex visible within the viewing angle, which is set to be a positive or a negative click, and compute the updated segmentation by the 2D model. According to its type, we require the second click to increase or decrease the previous segmentation mask to obtain rich and diverse training data.
As seen in Fig.
3, the supervision signal is highly inconsistent. The 3D shape and the same clicks are interpreted differently by the 2D model, yielding strong variations in the 2D segmentation masks. Nevertheless, our method reveals a coherent underlying 3D segmentation function out of the noisy 2D measurements. Our decoder utilizes the robust distilled 3D vertex features, applies their interaction with the clicked points, and computes the region probability map directly in 3D. iSeg segmentaitons are view-consistent
by construction, improving substantially over its training data. Furthermore, although trained with only 2D supervision, iSeg delineates meaningful regions in 3D that are not entirely visible in a single 2D projection (Fig.
5).
4 Experiments
We evaluate iSeg in a variety of aspects. First, we demonstrate the generality and fidelity of our method in Secs.
4.1 and
4.2, respectively. Then, in Sec.
4.3, we showcase
the generic feature information captured by iSeg. Finally, Sec.
4.4 presents the strong generalization power of iSeg in terms of the selected point, views of the click, and the number of clicks.
We apply our method to diverse meshes from different sources: COSEG [van Kaick et al.
2011], Turbo Squid [TurboSquid
2021], Thingi10K [Zhou and Jacobson
2016], Toys4k [Rehg
2022], SCAPE [Anguelov et al.
2005], SHREC ’19 [Melzi et al.
2019], ModelNet [Wu et al.
2015], ShapeNet [Chang et al.
2015],
and PartNet [Mo et al. 2019]. iSeg is highly robust to the shape properties. It operates on meshes with different numbers of vertices and various geometries, including thin, flat, and high-curvature surfaces.
iSeg is implemented in PyTorch [Paszke et al.
2017] and its training time varies according to the number of mesh vertices. For a mesh with 3000 vertices, the optimization takes about 3 hours on a single Nvidia A40 GPU.
Training the model is a one-time offline phase. Once trained, querying the model with input clicks takes only about 0.7 seconds
, which allows fast interactive interaction with the shape. In our experiments, we used SAM ViT-H with an image size of 224 × 224. To obtain fine-grained segmentations, we utilized the smallest scale mask from SAM for the projected clicked points. Additional details are provided in the supplementary.
4.1 Generality of iSeg
iSeg is highly versatile and works on a variety of shapes and geometries. It is not limited to any specific shape category nor a pre-defined set of parts and can be applied to meshes from various domains, including humanoids, animals, musical instruments, household objects, and more. Our method is also applicable to shapes with complex geometric structures and is optimized to capture the elements of the given object.
Fig.
2 presents different single-click results. iSeg successfully segments regions with sharp edges, such as the neck of the lamp and the thin spokes of the bike. It also captures accurately the flat surface of the alien’s head and the curved lower part of its leg. Moreover, iSeg can segment small parts of the shape - the bike’s seat and the water bottle, or larger portions, such as the body parts of the camel.
4.2 Fidelity of iSeg
Our method’s training is supervised by segmentation masks generated from SAM [Kirillov et al.
2023] for 2D renderings of the shape. As we show in Fig.
3, SAM’s masks differ substantially between views. In contrast, iSeg manages to fuse the noisy training examples into a coherent 3D segmentation model that corresponds to the clicked vertices. Examples are presented in Figs.
2 and
6. In the supplementary, we further demonstrate the method’s 3D consistency.
iSeg is adapted to the granularity of the given mesh, which enables it to adhere to the user’s clicks and segment the region of the shape related to the user’s inputs. For example, in Fig.
2, for a click on the alien’s middle antenna, the entire antenna and
only that particular antenna is selected. We see similar behavior for other clicks, such as the one on the lamp’s base and the camel’s neck.
In Fig.
6, we further show the results of iSeg for a couple of clicks. We incorporate either a second positive click that extends the segmented part or a second negative click that retracts the region. For example, with the first click, the fine-grained bulb area of the lamp is segmented. Then, the second positive click enables to include the flat surface surrounding the bulb. The negative click, on the other hand, offers the control to reduce and refine the segmentation region. As shown in Fig.
6, the first click on the hammer’s head segments the entire head. The front part can be easily and intuitively removed by a second negative click on it.
Quantitative evaluation . As far as we can ascertain, there are no annotated datasets for click-based interactive segmentation of 3D shapes. Thus, we adapted the part segmentation dataset PartNet [Mo et al. 2019] for our setting. The evaluation included 170 meshes sourced from all the categories in the dataset. For each mesh, we selected five test vertices from a part at random, where each vertex was regarded as a single click, and measured how well the part was segmented. We used two evaluation metrics: Accuracy (ACC) and Intersection over Union (IoU). Further details about the evaluation are provided in the supplemental material. We considered two alternatives for comparison, InterObject3D [Kontogianni et al.
2023] and a baseline we constructed based on SAM’s 2D segmentations. InterObject3D is a recent work on interactive segmentation of 3D objects.
In our experiments, we employed the publicly available pretrained model released by the authors.For the SAM baseline, we rendered the shape and projected the clicked point to 2D from 100 random views, computed SAM’s mask, and re-projected the result back to 3D for each visible vertex. Then, we averaged the predictions according to the number of times each vertex was seen.
In the supplementary material, we discuss additional baselines we devised using SAM. Other interactive segmentation techniques use an implicit 3D representation [Chen et al.
2023b] or perform a different task (object detection) [Zhang et al.
2023], and thus, are not directly comparable to our method.
Tab. 1 presents the ACC and IoU averaged over the test clicks and shapes, and Fig. 7 shows visual examples. InterObject3D does not select the part properly and yields a partial segmentation. SAM Baseline segments the region of the part where the click is visible. However, occluded regions that belong to the part are not marked. In contrast, iSeg adheres to the clicked point. It segments a coherent region in 3D, which is similar to the ground-truth part label, and achieves much higher ACC and IoU than the baselines. Perceptual user study. iSeg is not limited to a particular shape type from a dataset or specific parts defined in the dataset. In such cases, ground-truth labels are unavailable. Thus, to evaluate the effectiveness of the flexible and diverse segmentations offered by our method, we opt to perform a perceptual user study. We used 20 meshes from different categories, such as humanoids, animals, and man-made objects, and included 40 participants in our study.
For each mesh, we showed the 3D segmentation for a clicked point from multiple viewing angles and asked the participants to rate the effectiveness of the result on a scale of 1 to 5. The score 5 refers to a completely effective segmentation, where the entire 3D region corresponding to the clicked point is selected. When part of the 3D region is marked, the segmentation is considered partially effective. The score 1 is defined as a completely ineffective segmentation and refers to no region selection. Examples are presented in Fig.
14.
Tab.
2 summarizes the effectiveness score averaged over all the meshes and participants. Fig.
15 further compares our method with the SAM baseline. As seen in the figure and reflected by the table, the participants rated the effectiveness of iSeg much higher than the other methods, indicating its fidelity to the clicked point.
Applications . iSeg computes contiguous and localized shape partitions. These segmentations can be used to extract shape parts easily and enable applications such as full-shape segmentation and local geometric editing of the mesh. These results are demonstrated in Figs. 9, 10 and 13. Further details and discussion appear in the supplementary material. 4.3 Generic Feature Information
Our mesh feature field is distilled directly from SAM’s encoder and is independent of the user’s clicks for segmentation. Thus, although iSeg is optimized per mesh, the semantic feature representation is shared across shapes. We demonstrate this property by cross-domain segmentation. In this experiment, we use the point encoded features from one shape and predict region probability with a
different shape’s decoder. Fig.
8 shows examples. The transferable features enable the creation of cross-domain shape analogies, such as how the belly of a human corresponds to the
“belly” of an airplane.
4.4 Generalization Capabilities
Unseen mesh vertices. Our method exhibits strong generalization power. First, we emphasize that we train just on a small fraction of 3% of the mesh vertices. Still, iSeg is successfully applied to other
mesh vertices unseen during training and properly respects the clicked points, as shown in Figs.
2 and
6 and discussed in Sec.
4.2. We note that all the results shown in the paper are for test vertices.
Unseen views. Although iSeg was trained with 2D supervision only, its predictions are 3D in nature. Fig.
5 exemplifies this phenomenon. A click on the back side of the backrest segments the front side as well. Such supervision does not exist for our model’s training, since the clicked vertex cannot be seen from the front side. Similarly, two clicks at opposite sides of the backrest segment the entire part, although they cannot be seen together from
any single 2D view. This result suggests that iSeg learned 3D-consistent semantic vertex information, enabling it to generalize beyond its 2D supervision.
Unseen number of clicks. For resource efficient training, we trained iSeg on up to two clicks: single click, second positive click, and second negative click. Nonetheless, our model offers customized segmentation with more than two clicks, as demonstrated in Fig.
11. We attribute this capability to the interactive attention mechanism. The interactive attention layer seemed to learn the representation of a positive and a negative click, and how to attend to each click for a meaningful multi-click segmentation.
Limitations . Our method may not follow the symmetry of the mesh exactly, as exemplified in Fig. 12. For a click on the goat’s head, the segmented regions from the sides of the head differ somewhat from each other, since iSeg is not trained to segment those regions the same. In the supplementary, we discuss the potential limitation of processing 3D shapes using MLPs operating in the Euclidean space. 5 Conclusion
In this work, we presented iSeg, a technique for interactively generating fine-grained tailored segmentations of 3D meshes. We opt to lift features from a powerful pre-trained 2D segmentation model onto a 3D mesh, which can be used to create customized user-specified segmentations. Our mesh feature field is general and may be used for additional tasks such as cross-segmentation across shapes of different categories (
e.g. Fig.
8). Key to our method is an interactive attention mechanism that learns a unified representation for a varied number of positive or negative point clicks. Our 3D-consistent segmentation enables selecting points across occluded surfaces and segmenting meaningful regions directly in 3D (
e.g. Fig.
5).
In the future, we are interested in exploring additional applications of iSeg beyond segmentation. We have demonstrated that it can potentially be used for cross-domain segmentation, and there may be other exciting applications, such as key-point correspondence, texture transfer, and more.