research-article

Open access

iSeg: Interactive 3D Segmentation via Interactive Attention

Authors:

Itai Lang,

Fei Xu,

Dale Decatur,

Sudarshan Babu,

Rana HanockaAuthors Info & Claims

SA '24: SIGGRAPH Asia 2024 Conference Papers

Article No.: 139, Pages 1 - 11

https://doi.org/10.1145/3680528.3687605

Published: 03 December 2024 Publication History

All formats PDF

Abstract

We present iSeg, a new interactive technique for segmenting 3D shapes. Previous works have focused mainly on leveraging pre-trained 2D foundation models for 3D segmentation based on text. However, text may be insufficient for accurately describing fine-grained spatial segmentations. Moreover, achieving a consistent 3D segmentation using a 2D model is highly challenging, since occluded areas of the same semantic region may not be visible together from any 2D view. Thus, we design a segmentation method conditioned on fine user clicks, which operates entirely in 3D. Our system accepts user clicks directly on the shape’s surface, indicating the inclusion or exclusion of regions from the desired shape partition. To accommodate various click settings, we propose a novel interactive attention module capable of processing different numbers and types of clicks, enabling the training of a single unified interactive segmentation model. We apply iSeg to a myriad of shapes from different domains, demonstrating its versatility and faithfulness to the user’s specifications. Our project page is at https://threedle.github.io/iSeg/.

1 Introduction

Interactive 3D segmentation, the ability to select fine-grained segments from a 3D shape based on user inputs like clicks, is a fundamental problem in computer graphics with broad implications. In fields such as computer-aided design and 3D modeling, precise segment selection facilitates detailed model refinement. Moreover, in engineering, architecture, and medicine, fine-grained selection is indispensable for simulation and analysis, allowing for accurate assessment of structural integrity and behavior. While important, this problem poses significant challenges. How can we decipher the user intentions from such a minimal input as clicks? How do we handle diverse shapes with varying geometries and select specific and unique shape parts? In this work, we propose a method tailored to the shape at hand that selects regions adhering to the user clicks.

Fig. 1:

Fig. 2:

Traditional segmentation techniques do not utilize user inputs and instead rely on geometric features to delineate semantic parts [Cornea et al. 2007; Dey and Zhao 2004; Hoffman and Richards 1984; Lien and Amato 2007; Shamir 2008; Zheng et al. 2015]. Recent data-driven techniques have further leveraged fully annotated 3D datasets and achieved ipressive 3D segmentation results [Chen et al. 2019; Deng et al. 2020; Hanocka et al. 2019; Hu et al. 2022; Milano et al. 2020; Milletari et al. 2016; Qi et al. 2017; Sharp et al. 2022; Sun et al. 2021; Yi et al. 2017; Zhu et al. 2020]. However, the reliance on a dataset and the scarcity of large-scale 3D datasets limits the network to a specific shape domain with a predefined set of parts.

Current 3D segmentation methods have circumvented the dependency on 3D data and pre-determined part definition by utilizing pretrained 2D foundation models to learn semantic co-segmentation [Ye et al. 2023] or text-driven segmentation [Abdelreheem et al. 2023a; 2023b; Decatur et al. 2024; 2023; Ha and Song 2022; Kim and Sung 2024; Liu et al. 2023]. Nonetheless, text may not be able to accurately describe all fine-grained segmentations, such as the fourth leg of an octopus or a region corresponding to a particular point on the shape.

In this paper, we present iSeg, a new data-driven interactive technique for 3D shape segmentation that generates customized partitions of the shape according to user clicks. Given a shape represented as a triangular mesh, the user selects points on the mesh interactively to indicate a desired segmentation and iSeg predicts a region over the mesh surface that adheres to the clicked points. Our interactive interface can utilize positive and negative clicks, enabling additions and exclusions of areas from the segmented region, respectively (see Fig. 1).

We harness the power of a pretrained 2D foundation segmentation model [Kirillov et al. 2023] and distill its knowledge to 3D. However, segmenting a meaningful 3D region using a 2D model is very challenging, since occluded shape regions cannot be seen together from a single 2D view. Accordingly, we design an interactive segmentation system that operates entirely in 3D, where the user clicks and the inferred corresponding region are applied over the shape surface directly, ensuring 3D consistency by construction. During training only, we project the 3D user clicks and the predicted segmentation to multiple 2D views to enable supervision from the powerful pretrained foundation model [Kirillov et al. 2023].

For interactiveness, we want our system to accommodate different user inputs, meaning, point clicks that can vary in number and type. Instead of training a separate segmentation model for each user click configuration, we propose a novel interactive attention mechanism, which learns the representation of positive and negative clicks and computes their interaction with the other points of the mesh. This attention layer consolidates variable-size guidance into a fixed-size representation, resulting in a unified flexible segmentation model capable of predicting shape regions for various click settings.

iSeg is optimized per mesh to capture its unique segments, without any ground-truth annotations. We train the model with only a small fraction of the mesh vertices, while the model successfully infers segmentations for other vertices not used during training. iSeg further generalizes beyond its training data and computes complete segments in 3D for clicks and regions occluded from each other.

In summary, this paper presents iSeg, an interactive method for selecting customized fine-grained regions on a 3D shape. We distill inconsistent feature embeddings of a 2D foundation model into a coherent feature field over the mesh surface and decode it along with user inputs to segment the mesh on the fly. Our interactive attention mechanism handles a variable number of user clicks that can signify both the inclusion and exclusion of regions. We showcase the effectiveness of iSeg on a variety of meshes from different domains, including humanoids, animals, and man-made objects, and show its flexibility for various segmentation specifications.

2 Related Work

Fig. 3:

2.1 Non-Interactive 3D Segmentation

A large body of research has been focused on 3D segmentation using annotated datasets [Armeni et al. 2017; Hanocka et al. 2019; Hu et al. 2022; Kalogerakis et al. 2017; Milano et al. 2020; Sharp et al. 2022; Yi et al. 2017]. Such models demonstrate impressive performance at the cost of being restricted to the domain of the training data and the set of manually defined semantic labels. A partial solution to this limitation is utilizing unlabeled data, where common semantic elements are discovered by unsupervised learning [Chen et al. 2019; Deng et al. 2020; Hong et al. 2022; Sun et al. 2021; Zhu et al. 2020]. Still, the segmentation is confined to the learned parts and is not easy to alter.

In contrast, our segmentation approach is highly versatile and flexible. It is applied to various shapes from different domains. Our model is trained without any segment labels, and instead, it is optimized to the shape at hand to discover its unique partitions. Moreover, iSeg is interactive – its segmentation result can be updated simply with an intuitive user-click interface.

2.2 Lifting 2D Foundation Models to 3D

The emergence of powerful 2D foundation models with a broad semantic understanding has propelled a surge of interest in distilling their knowledge and lifting it to a 3D representation [Abdelreheem et al. 2023a; 2023b; Chen et al. 2023a; Decatur et al. 2024; 2023; Fan et al. 2023; Kundu et al. 2020; Peng et al. 2023; Umam et al. 2024; Yin et al. 2024; Zhang et al. 2022]. Notably, several researchers [Kerr et al. 2023; Kobayashi et al. 2022; Tschernezki et al. 2022] augmented the neural radiance scene representation (NeRF) [Mildenhall et al. 2020] with a volumetric feature filed. This approach enabled text-driven segmentation of objects within the scene, alleviating the need for a training dataset.

Similarly, we lift the features of a 2D foundation model [Kirillov et al. 2023] into 3D. However, instead of using the implicit NeRF representation [Kerr et al. 2023; Ye et al. 2023], our model operates directly on explicit 3D meshes, making it readily adaptable to 3D modeling workflows. Moreover, rather than decoding the feature field by a simple correlation with the embedding of the semantic prompt [Fan et al. 2023; Peng et al. 2023], we learn a dedicated decoder in 3D to exploit the semantic information embodied within our mesh feature field better.

2.3 Interactive 3D Segmentation

Traditional interactive techniques have utilized heuristic smoothness priors and formulated the problem with a graph cut optimization objective [Boykov and Jolly 2001; Rother et al. 2004; Sormann et al. 2006]. More recently, several learning-based methods have been proposed for interactive segmentation [Goel et al. 2023; Kontogianni et al. 2023; Mirzaei et al. 2023; Ren et al. 2022; Ying et al. 2024; Yue et al. 2024]. For example, [Kontogianni et al. 2023] segmented 3D point clouds based on user clicks. Unlike our work, they constructed a dataset for training their model, which limited its utility to parsing objects from a scene.

Very recently, [Kirillov et al. 2023] presented a foundation model for 2D interactive segmentation termed SAM, which triggered a line of follow-up works aiming at harnessing SAM’s impressive capabilities to the 3D domain [Cen et al. 2023; Chen et al. 2023b; Yang et al. 2023; Zhang et al. 2023]. One approach is to segment 2D projections of the 3D data and fuse them in 3D. However, such an approach requires high user guidance, as the segmentation is performed in 2D, and the user’s input is required for different views.

Another approach taken by [Chen et al. 2023b] is to lift SAM’s features to a NeRF representation and use SAM’s decoder to obtain the segmentation masks. As in the first approach, applying the segmentation in 2D limits the method’s capabilities, since it cannot segment together occluded regions in 3D that are not visible concurrently in any 2D view. In contrast to existing works, our model and the user clicks are applied directly in 3D, simplifying the segmentation process and enabling the native segmentation of meaningful regions in 3D (as demonstrated in Fig. 5).

3 Method

Given a 3D shape depicted as a mesh \(\mathcal {M}\) with vertices \(V = \lbrace v_i\rbrace _{i=1}^{n}\), \(v_i = (x_i, y_i, z_i) \in \mathbb {R}^3\), and a set of selected vertices by the user representing positive or negative clicks, our goal is to predict the per-vertex probability \(P = \lbrace p_i\rbrace _{i=1}^{n}\), p_i ∈ [0, 1] of belonging to a region adhering to the user inputs. Our system offers an interactive user interface, such that the number of clicks and their type can be varied and the segmented region of the shape is updated accordingly.

We tackle the problem by proposing an interactive segmentation technique consisting of two parts: an encoder that maps vertex coordinates to a deep semantic vector and a decoder that takes the vertex features and the user clicks and predicts the corresponding mesh segment. The decoder contains an interactive attention layer supporting a variable number of clicks, which can be positive or negative, to increase or decrease the segmented region. Fig. 3 presents an overview of the method.

Fig. 4:

3.1 Mesh Feature Field

Our encoder learns a function \(\phi :\mathbb {R}^{3} \rightarrow \mathbb {R}^{d}\) that embeds each mesh vertex into a deep feature vector ϕ(v_i) = f_i, where d is the feature dimension. The collection of mesh vertex features is denoted as \(F \in \mathbb {R}^{n \times d}\) and regarded as the Mesh Feature Field (MFF).

The encoder distills the semantic information from a pretrained 2D foundation model for image segmentation [Kirillov et al. 2023] and facilitates a 3D consistent feature representation for interactive segmentation of the mesh. To train the encoder, we render the high-dimensional vertex attributes differentiably and obtain the 2D projected features:

\begin{equation} I_f^\theta = \mathcal {R}(\mathcal {M}, f, \theta) \in \mathbb {R}^{w \times h \times d}, \end{equation}

(1)

where \(\mathcal {R}\) is a differentiable renderer, θ is the viewing direction, f represents the visible vertices in the view, and w × h are the spatial dimensions of the rasterized image.

The encoder is implemented as a multi-layer perceptron network. To supervise its training, we render the mesh into a color image \(I_c^\theta\) and pass it through the encoder E_2D of the 2D foundation model [Kirillov et al. 2023] to obtain a reference feature map:

\begin{equation} I_e^\theta = E_{2D}(I_c^\theta) \in \mathbb {R}^{w \times h \times d}. \end{equation}

(2)

This process is repeated for multiple random viewing angles Θ, and our encoder is trained to minimize the discrepancy between the rendered MFF and the reference 2D features:

\begin{equation} \mathcal {L}_{enc} = \frac{1}{|\Theta |} \sum _{\theta \in \Theta } || I_f^\theta - I_e^\theta ||^2. \end{equation}

(3)

The 2D model operates on each image separately and might produce inconsistent features for different views of shape. In contrast, our MFF is defined in 3D and is view-consistent by construction. It consolidates the information from the multiple views and lifts the 2D embeddings to a coherent field over the mesh surface. Additionally, we emphasize that the MFF is optimized independently of the user inputs. This is a key consideration in our method, resulting in a condition-agnostic representation, describing inherent semantic properties of the mesh. We validate this design choice with an ablation experiment demonstrated in Fig. 16 and explained in the supplementary. The MFF is optimized until convergence and then utilized together with the user click prompts to compute the interactive mesh partition.

Fig. 5:

3.2 Interactive Attention Layer

The interactive attention layer is part of the decoding component of our system (Fig. 3). Its structure is illustrated in Fig. 4. The layer computes the interaction between the user input clicks and the mesh vertices, accommodating variable numbers and types of clicks, positive and negative. This key element in our method enables a unified decoder architecture supporting various user click settings.

Fig. 6:

Our interactive attention extends the scaled dot-product attention mechanism [Vaswani et al. 2017]. The features of the positively and negatively clicked vertices are marked as \(F_{pos} \in \mathbb {R}^{n_{pos} \times d}\) and \(F_{neg} \in \mathbb {R}^{n_{neg} \times d}\), respectively, where n_pos and n_neg are the number of clicks of each type. The interactive attention layer projects the mesh features F to Queries and the features of the clicked points to Keys and Values:

\begin{equation} \begin{split} Q = & \: F W^Q \in \mathbb {R}^{n \times d} \\ K_{\lbrace pos, neg\rbrace } = & \: F_{\lbrace pos, neg\rbrace } W^K_{\lbrace pos, neg\rbrace } \in \mathbb {R}^{n_{\lbrace pos, neg\rbrace } \times d} \\ V_{\lbrace pos, neg\rbrace } = & \: F_{\lbrace pos, neg\rbrace } W^V_{\lbrace pos, neg\rbrace } \in \mathbb {R}^{n_{\lbrace pos, neg\rbrace } \times d}, \end{split} \end{equation}

(4)

where \(W^Q, W^K_{\lbrace pos, neg\rbrace }, W^V_{\lbrace pos, neg\rbrace } \in \mathbb {R}^{d \times d}\) are learnable weight matrices. Then, the mesh vertices are attended to the user clicks to obtain the conditioned mesh features:

\begin{equation} G = \text{softmax}\left(\frac{QK^T}{\sqrt {d}}\right)V \in \mathbb {R}^{n \times d}, \end{equation}

(5)

where \(K, V \in \mathbb {R}^{(n_{pos} + n_{neg}) \times d}\) are the concatenation of K_pos, K_neg and V_pos, V_neg, respectively.

Our attention mechanism condenses variable interactive user inputs into a fixed-length output. It learns the representation of positive and negative clicks, correlates the mesh vertices with them, and yields updated vertex features to enable the on-the-fly segmentation of the shape. Another benefit of our attention layer is that it is permutation invariant w.r.t. the user clicks. In other words, it is independent of the sequential order of the point clicks and consistent in their joint influence on the shape partition. Moreover, the attention’s output G is permutation equivariant w.r.t. the vertex order in F, a desired property for the mesh data structure.

3.3 Segmentation Prediction

The output of our model is a segmentation of the mesh that adheres to the user clicks, represented as the per-vertex probability of belonging to the desired region. To do so, we learn to decode the a posteriori condition dependent vertex features g_i = [G]_i and the a priori inherent embedding f_i = [F]_i into the partition probability:

\begin{equation} p_i = \psi (f_i, g_i) \in [0, 1]. \end{equation}

(6)

ψ is implemented as a multi-layer perceptron network, where f_i and g_i are concatenated along the feature dimension at the network’s input. The remaining question is – how to supervise the training of such an obscure and ill-defined problem?

Similar to our encoder’s training, we translate the problem to the 2D domain and harness the power of the 2D foundation model [Kirillov et al. 2023] for our 3D decoder learning (Fig. 3). We project the mesh probability map with a differentiable rasterizer to a probability image \(I_p^{\theta ^{\prime }} = \mathcal {R}(\mathcal {M}, p, \theta ^{\prime }) \in [0, 1]^{w^{\prime } \times h^{\prime } \times 2}\), where θ′ is the viewing angle, and the image channels represent the segment and background probabilities. Then, we project the 3D clicks to their corresponding 2D pixels and use them as prompts to segment the rendered color mesh image with the 2D segmentation model, resulting in the supervising probability mask \(I_m^{\theta ^{\prime }} \in \lbrace 0, 1\rbrace ^{w^{\prime } \times h^{\prime } \times 2}\). We randomize the viewing direction θ′ and train our decoder subject to the optimization objective:

\begin{equation} \mathcal {L}_{dec} = \frac{1}{|\Theta ^{\prime }|} \sum _{\theta ^{\prime } \in \Theta ^{\prime }} \text{CE} (I_p^{\theta ^{\prime }}, I_m^{\theta ^{\prime }}), \end{equation}

(7)

where CE denotes the binary cross-entropy loss.

To prepare data for our 3D decoder training, we simulate user clicks and generate masks from the 2D model [Kirillov et al. 2023]. The data generation process includes two phases. First, we pick a small training subset of 3% of the mesh vertices well distributed over the shape using Farthest Point Sampling [Eldar et al. 1997], where each vertex is regarded as a single positive click. For each vertex, we generate random views, feed each one through the 2D foundation model, and get the reference segmentation mask. Then, for each view, we sample another training vertex visible within the viewing angle, which is set to be a positive or a negative click, and compute the updated segmentation by the 2D model. According to its type, we require the second click to increase or decrease the previous segmentation mask to obtain rich and diverse training data.

As seen in Fig. 3, the supervision signal is highly inconsistent. The 3D shape and the same clicks are interpreted differently by the 2D model, yielding strong variations in the 2D segmentation masks. Nevertheless, our method reveals a coherent underlying 3D segmentation function out of the noisy 2D measurements. Our decoder utilizes the robust distilled 3D vertex features, applies their interaction with the clicked points, and computes the region probability map directly in 3D. iSeg segmentaitons are view-consistent by construction, improving substantially over its training data. Furthermore, although trained with only 2D supervision, iSeg delineates meaningful regions in 3D that are not entirely visible in a single 2D projection (Fig. 5).

4 Experiments

We evaluate iSeg in a variety of aspects. First, we demonstrate the generality and fidelity of our method in Secs. 4.1 and 4.2, respectively. Then, in Sec. 4.3, we showcase the generic feature information captured by iSeg. Finally, Sec. 4.4 presents the strong generalization power of iSeg in terms of the selected point, views of the click, and the number of clicks.

We apply our method to diverse meshes from different sources: COSEG [van Kaick et al. 2011], Turbo Squid [TurboSquid 2021], Thingi10K [Zhou and Jacobson 2016], Toys4k [Rehg 2022], SCAPE [Anguelov et al. 2005], SHREC ’19 [Melzi et al. 2019], ModelNet [Wu et al. 2015], ShapeNet [Chang et al. 2015], and PartNet [Mo et al. 2019]. iSeg is highly robust to the shape properties. It operates on meshes with different numbers of vertices and various geometries, including thin, flat, and high-curvature surfaces.

iSeg is implemented in PyTorch [Paszke et al. 2017] and its training time varies according to the number of mesh vertices. For a mesh with 3000 vertices, the optimization takes about 3 hours on a single Nvidia A40 GPU. Training the model is a one-time offline phase. Once trained, querying the model with input clicks takes only about 0.7 seconds, which allows fast interactive interaction with the shape. In our experiments, we used SAM ViT-H with an image size of 224 × 224. To obtain fine-grained segmentations, we utilized the smallest scale mask from SAM for the projected clicked points. Additional details are provided in the supplementary.

4.1 Generality of iSeg

iSeg is highly versatile and works on a variety of shapes and geometries. It is not limited to any specific shape category nor a pre-defined set of parts and can be applied to meshes from various domains, including humanoids, animals, musical instruments, household objects, and more. Our method is also applicable to shapes with complex geometric structures and is optimized to capture the elements of the given object.

Fig. 2 presents different single-click results. iSeg successfully segments regions with sharp edges, such as the neck of the lamp and the thin spokes of the bike. It also captures accurately the flat surface of the alien’s head and the curved lower part of its leg. Moreover, iSeg can segment small parts of the shape - the bike’s seat and the water bottle, or larger portions, such as the body parts of the camel.

Table 1:

Method	InterObject3D	SAM Baseline	iSeg (ours)
Accuracy ↑	0.54	0.76	0.95
IoU ↑	0.38	0.51	0.90

Table 1: Quantitative evaluation on PartNet. We compare the segmentation performance of different clicked-based interactive techniques on shapes from the PartNet dataset [Mo et al. 2019]. IoU stands for Intersection over Union. iSeg’s scores are substantially higher than those of the alternatives.

4.2 Fidelity of iSeg

Our method’s training is supervised by segmentation masks generated from SAM [Kirillov et al. 2023] for 2D renderings of the shape. As we show in Fig. 3, SAM’s masks differ substantially between views. In contrast, iSeg manages to fuse the noisy training examples into a coherent 3D segmentation model that corresponds to the clicked vertices. Examples are presented in Figs. 2 and 6. In the supplementary, we further demonstrate the method’s 3D consistency.

iSeg is adapted to the granularity of the given mesh, which enables it to adhere to the user’s clicks and segment the region of the shape related to the user’s inputs. For example, in Fig. 2, for a click on the alien’s middle antenna, the entire antenna and only that particular antenna is selected. We see similar behavior for other clicks, such as the one on the lamp’s base and the camel’s neck.

In Fig. 6, we further show the results of iSeg for a couple of clicks. We incorporate either a second positive click that extends the segmented part or a second negative click that retracts the region. For example, with the first click, the fine-grained bulb area of the lamp is segmented. Then, the second positive click enables to include the flat surface surrounding the bulb. The negative click, on the other hand, offers the control to reduce and refine the segmentation region. As shown in Fig. 6, the first click on the hammer’s head segments the entire head. The front part can be easily and intuitively removed by a second negative click on it.

Table 2:

Method	InterObject3D	SAM Baseline	iSeg (ours)
Effectiveness ↑	2.54	3.02	4.55

Table 2: Perceptual user study. We evaluate the 3D segmentation effectiveness on a scale of 1 to 5, corresponding to completely ineffective and completely effective segmentation. Our method is considered much more effective than the competitors.

Quantitative evaluation . As far as we can ascertain, there are no annotated datasets for click-based interactive segmentation of 3D shapes. Thus, we adapted the part segmentation dataset PartNet [Mo et al. 2019] for our setting. The evaluation included 170 meshes sourced from all the categories in the dataset. For each mesh, we selected five test vertices from a part at random, where each vertex was regarded as a single click, and measured how well the part was segmented. We used two evaluation metrics: Accuracy (ACC) and Intersection over Union (IoU). Further details about the evaluation are provided in the supplemental material.

We considered two alternatives for comparison, InterObject3D [Kontogianni et al. 2023] and a baseline we constructed based on SAM’s 2D segmentations. InterObject3D is a recent work on interactive segmentation of 3D objects. In our experiments, we employed the publicly available pretrained model released by the authors.

For the SAM baseline, we rendered the shape and projected the clicked point to 2D from 100 random views, computed SAM’s mask, and re-projected the result back to 3D for each visible vertex. Then, we averaged the predictions according to the number of times each vertex was seen. In the supplementary material, we discuss additional baselines we devised using SAM. Other interactive segmentation techniques use an implicit 3D representation [Chen et al. 2023b] or perform a different task (object detection) [Zhang et al. 2023], and thus, are not directly comparable to our method.

Tab. 1 presents the ACC and IoU averaged over the test clicks and shapes, and Fig. 7 shows visual examples. InterObject3D does not select the part properly and yields a partial segmentation. SAM Baseline segments the region of the part where the click is visible. However, occluded regions that belong to the part are not marked. In contrast, iSeg adheres to the clicked point. It segments a coherent region in 3D, which is similar to the ground-truth part label, and achieves much higher ACC and IoU than the baselines.

Fig. 7:

Perceptual user study. iSeg is not limited to a particular shape type from a dataset or specific parts defined in the dataset. In such cases, ground-truth labels are unavailable. Thus, to evaluate the effectiveness of the flexible and diverse segmentations offered by our method, we opt to perform a perceptual user study. We used 20 meshes from different categories, such as humanoids, animals, and man-made objects, and included 40 participants in our study.

For each mesh, we showed the 3D segmentation for a clicked point from multiple viewing angles and asked the participants to rate the effectiveness of the result on a scale of 1 to 5. The score 5 refers to a completely effective segmentation, where the entire 3D region corresponding to the clicked point is selected. When part of the 3D region is marked, the segmentation is considered partially effective. The score 1 is defined as a completely ineffective segmentation and refers to no region selection. Examples are presented in Fig. 14.

Tab. 2 summarizes the effectiveness score averaged over all the meshes and participants. Fig. 15 further compares our method with the SAM baseline. As seen in the figure and reflected by the table, the participants rated the effectiveness of iSeg much higher than the other methods, indicating its fidelity to the clicked point.

Applications . iSeg computes contiguous and localized shape partitions. These segmentations can be used to extract shape parts easily and enable applications such as full-shape segmentation and local geometric editing of the mesh. These results are demonstrated in Figs. 9, 10 and 13. Further details and discussion appear in the supplementary material.

4.3 Generic Feature Information

Our mesh feature field is distilled directly from SAM’s encoder and is independent of the user’s clicks for segmentation. Thus, although iSeg is optimized per mesh, the semantic feature representation is shared across shapes. We demonstrate this property by cross-domain segmentation. In this experiment, we use the point encoded features from one shape and predict region probability with a different shape’s decoder. Fig. 8 shows examples. The transferable features enable the creation of cross-domain shape analogies, such as how the belly of a human corresponds to the “belly” of an airplane.

Fig. 8:

4.4 Generalization Capabilities

Unseen mesh vertices. Our method exhibits strong generalization power. First, we emphasize that we train just on a small fraction of 3% of the mesh vertices. Still, iSeg is successfully applied to other mesh vertices unseen during training and properly respects the clicked points, as shown in Figs. 2 and 6 and discussed in Sec. 4.2. We note that all the results shown in the paper are for test vertices.

Unseen views. Although iSeg was trained with 2D supervision only, its predictions are 3D in nature. Fig. 5 exemplifies this phenomenon. A click on the back side of the backrest segments the front side as well. Such supervision does not exist for our model’s training, since the clicked vertex cannot be seen from the front side. Similarly, two clicks at opposite sides of the backrest segment the entire part, although they cannot be seen together from any single 2D view. This result suggests that iSeg learned 3D-consistent semantic vertex information, enabling it to generalize beyond its 2D supervision.

Unseen number of clicks. For resource efficient training, we trained iSeg on up to two clicks: single click, second positive click, and second negative click. Nonetheless, our model offers customized segmentation with more than two clicks, as demonstrated in Fig. 11. We attribute this capability to the interactive attention mechanism. The interactive attention layer seemed to learn the representation of a positive and a negative click, and how to attend to each click for a meaningful multi-click segmentation.

Limitations . Our method may not follow the symmetry of the mesh exactly, as exemplified in Fig. 12. For a click on the goat’s head, the segmented regions from the sides of the head differ somewhat from each other, since iSeg is not trained to segment those regions the same. In the supplementary, we discuss the potential limitation of processing 3D shapes using MLPs operating in the Euclidean space.

5 Conclusion

In this work, we presented iSeg, a technique for interactively generating fine-grained tailored segmentations of 3D meshes. We opt to lift features from a powerful pre-trained 2D segmentation model onto a 3D mesh, which can be used to create customized user-specified segmentations. Our mesh feature field is general and may be used for additional tasks such as cross-segmentation across shapes of different categories (e.g. Fig. 8). Key to our method is an interactive attention mechanism that learns a unified representation for a varied number of positive or negative point clicks. Our 3D-consistent segmentation enables selecting points across occluded surfaces and segmenting meaningful regions directly in 3D (e.g. Fig. 5).

In the future, we are interested in exploring additional applications of iSeg beyond segmentation. We have demonstrated that it can potentially be used for cross-domain segmentation, and there may be other exciting applications, such as key-point correspondence, texture transfer, and more.

Acknowledgments

This research was supported by grant #2022363 from the United States - Israel Binational Science Foundation (BSF), grant #2304481 from the National Science Foundation (NSF), and gifts from Adobe, Snap, and Google. We thank the University of Chicago and the Toyota Technological Institute at Chicago (TTIC) for allocating computational resources for this work, and their technical staff for the support along the project. We also extend our gratitude to the members of the 3DL lab at the University of Chicago for their helpful feedback and excellent advice.

Fig. 9:

Fig. 10:

Fig. 11:

Fig. 12:

Fig. 13:

Fig. 14:

Fig. 15:

Fig. 16:

Supplemental Material

PDF File

Appendix document Supplementary video

Download
16.15 MB

MP4 File

Appendix document Supplementary video

Download
23.95 MB

References

[1]

Ahmed Abdelreheem, Abdelrahman Eldesokey, Maks Ovsjanikov, and Peter Wonka. 2023a. Zero-Shot 3D Shape Correspondence. SIGGRAPH Asia 2023 Conference Papers (2023).

Abstract

1 Introduction

2 Related Work

2.1 Non-Interactive 3D Segmentation

2.2 Lifting 2D Foundation Models to 3D

2.3 Interactive 3D Segmentation

3 Method

3.1 Mesh Feature Field

3.2 Interactive Attention Layer

3.3 Segmentation Prediction

4 Experiments

4.1 Generality of iSeg

4.2 Fidelity of iSeg

4.3 Generic Feature Information

4.4 Generalization Capabilities

5 Conclusion

Acknowledgments

Supplemental Material

References

Index Terms

Recommendations

Interactive Segmentation Model for Placenta Segmentation from 3D Ultrasound Images

Graph-cut based interactive segmentation of 3D materials-science images

Interactive segmentation based on component-trees

Comments

Information

Published In

Sponsors

Publisher

Publication History

Check for updates

Author Tags

Qualifiers

Funding Sources

Conference

Acceptance Rates

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

View options

PDF

eReader

HTML Format

Login options

Full Access

Share

Share this Publication link

Share on social media

Affiliations