1 Introduction

Various kinds of real-world data are usually represented with different modalities, such as perception data of intelligent unmanned systems and medical diagnosis data (Yang et al., 2019b; Chen et al., 2019; Cao et al., 2022). Among researches on modeling such multi-modal data, multi-modal clustering (MmC), which divides samples into clusters in an unsupervised manner, has attracted much attention in recent years (Zhang et al., 2020; Chen et al., 2022). MmC aims to integrate multiple features and discover complementary information among different modalities (Zhang et al., 2018; Xie et al., 2019; Fang et al., 2023). Compared with single-modality clustering, MmC can more fully exploit the complementarity between multiple modalities to improve performance (Han et al., 2023; Zhan et al., 2018). In real-world applications, some modalities of instances may be missing due to the difficulty of data collection or the failure of data collectors (Kumar et al., 2013; Xiang et al., 2013). When certain modalities are missing, it leads to a significant loss of information. Furthermore, the absence of modalities severely hinders exploring complementary and consistent information. This indicates that incomplete multi-modal clustering presents its unique challenges (Wen et al., 2023; Lin et al., 2023). Such incompleteness further aggravates the difficulty of mining complementary information that can be originally mined through complete paired data. Therefore, how to effectively model the complementarity within incomplete data is an essential problem for incomplete multi-modal clustering (IMmC). The traditional MmC pipeline fails to address the challenges of IMmC. The core focus of research in the domain of missing modality multi-modal learning is to understand the impact of missing modality on modeling and representation. Unsupervised tasks, including clustering, prioritize the discovery of underlying data structures and relationships without relying on label information, making it a more challenging task. When some modalities are missing in the data, unsupervised tasks are generally more sensitive and capable of capturing these changes since they are not constrained by label information. In contrast, supervised tasks primarily focus on establishing a mapping between data and labels.

Many researchers have dedicated themselves to addressing the problem, and existing works can be roughly classified into three categories. (1) Grouping strategies divide data into multiple groups and design different models for each group. Then, these models are fused to alleviate the influence of missing modalities to obtain the clustering results (Yuan et al., 2012; Wang et al., 2020b). However, the amount of data used in this method for training is drastically reduced, which may lead to over-fitting. To alleviate the scarcity of complete modalities, the researchers proposed data imputation-based strategies. (2) Data imputation-based strategies complete the missing modalities of samples for the subsequent clustering, which transforms IMmC to a classical multi-modal clustering problem with complete data (Zhang et al., 2018; Lin et al., 2021). However, it is difficult to ensure the quality of the complete modality and may introduce additional noisy information, especially when the rate of missing data increases. To get rid of the reliance on large-scale complete data, recent studies have attempted to explore consistency in IMmC. (3) Consistency strategies (Zhang et al., 2022; Wang et al., 2021) generate missing modalities of samples by maintaining the consistent relationships between different modalities for the whole data. Although they reduce the requirement of paired data, the training process is quite unstable and is difficult to converge if the data distribution is complicated, e.g., data with high missing rates. Consequently, the quality of the generated data is still difficult to control, which significantly deteriorates the performance of the models.

By revisiting existing methods, we find that two problems are still open: (1) Modeling without relying heavily on paired data. Grouping and data imputation-based strategies require a large number of paired data to learn the relationships between different modalities. For cases where only few complete data are available, these methods struggle to complete the missing modalities in high quality, thus deteriorating their performance. (2) Mining complementarity on data with high missing rates. Consistency strategies tend to learn relationships independently for each modality and can work well on simple cases, e.g., data with low missing rates, with stable learning and convergence progress. However, the learned modality representations and structures become inaccurate when handling data with high missing rates, thus making them quite restricted to complicated real-world applications.

To this end, we propose a simple yet effective method, Integrated Heterogeneous Graph ATtention (IHGAT) network, to effectively and stably explore the structural information of samples and modalities without paired data. First, a set of integrated heterogeneous graphs is constructed by fusing two types of graphs: the similarity graph learned from unified latent representations and the modality-specific availability graphs obtained by the existing relations of different samples. Then, we adopt graph learning to exploit complementary structural information between samples based on the constructed integrated heterogeneous graphs. Concretely, we apply an attention mechanism to aggregate the embedded content of heterogeneous neighbors for each graph node. In this way, the incomplete data are embedded into a complete latent space while exploiting the structural information and maintaining the modality-missing information. Finally, the consistency of probability distribution is embedded into the network through KL divergence for clustering.

The proposed method has two advantages over existing methods, thus facilitating solving the aforementioned problems. (1) Low dependency on the complete data. The proposed method exploits the complementarity information by constructing a set of integrated heterogeneous graphs in a learnable unified feature space, where the relationships between different samples and modalities can be directly measured by their similarity. Such a simple method avoids the requirement for complete modalities of paired data. (2) Effective exploitation of intrinsic structural information. Based on the unified latent representation and constructed heterogeneous graphs, the proposed method aggregates the embeddings of heterogeneous neighbors for each node using an attention mechanism. In this way, the structural information and intra-sample and inter-sample multi-modal relationships can be fully exploited to enhance the capabilities of representation learning for samples with incomplete modalities.

Six common datasets with different missing rates are used in our experiments, and the results show that our method achieves state-of-the-art performance. Typically, our method is more robust than the baselines in cases of data with high missing rates, which infers that it can learn complementarity information between samples and modalities by learning intrinsic structural information without many paired data. The proposed method does not include any completion parts and is easy to implement. Source code is available at https://github.com/yxjdarren/IHGAT.

In summary, our contributions to this work are summarized as follows:

  • We propose a structured information mining strategy, which involves constructing a heterogeneous graph structure within the data. This approach allows for the comprehensive exploration and exploitation of inter-modality and inter-sample relationships, facilitating the effective representation of incomplete data.

  • The inter-modality relationships are realized by mapping multiple modalities into a unified latent space, while the inter-sample relationships are established based on the similarity within the latent space and the incomplete modality information. Such relationships are further used to facilitate complementary fusion among similar samples with graph attention mechanisms.

  • Extensive experiments demonstrate the effectiveness of the proposed method on IMmC. Our method can maintain outstanding performance compared to the state-of-the-art baselines as the missing rate increases. Typically, IHGAT is significantly effective in scenarios with high missing rates, improving the baselines by up to 14.78 and 15.36% on Accuracy (ACC) and Normalized Mutual Information (NMI), respectively.

The remainder of this paper is organized as follows. In Sect. 2, we first review the related works about incomplete multi-modal learning and graph representation learning. Then, we elaborate on details of our work, including basic notations, framework, and analysis of each module in Sect. 3. Next, the experimental setting and evaluation results are reported in Sect. 4. Finally, we conclude our work in Sect. 5.

2 Related Work

In this section, we briefly review the related incomplete multi-modal learning and heterogeneous graph learning.

2.1 Incomplete Multi-Modal Learning

In contrast to multi-modal learning, incomplete multi-modal learning contains some missing modalities in data. Existing methods can be mainly categorized into three groups: grouping strategies, data imputation-based strategies, and consistency strategies.

Grouping strategies learn multiple models on various groups for late fusion and focus on the use of completeness theory (Baltrušaitis et al., 2018), which emphasizes complementarity to learn better latent representations. Specifically, within each group, samples with missing modalities are removed, resulting in multiple sets of complete multi-modal samples. However, the process of removing samples with missing modalities substantially reduces the available training data. Yuan et al. (2012) proposed to divide samples according to the availability of data sources and learn a base classifier for each data source independently. Xu et al. (2015) assumed that different modalities are generated from a shared subspace and investigated a successive over-relaxation method to solve the objective function. Wang et al. (2020b) proposed a framework based on knowledge distillation, utilizing the supplementary information from all modalities. However, the above methods have a relatively small amount of data in each group due to grouping, which may lead to overfitting. To address the lack of complete modalities, the data imputation-based strategies have attracted significant attention from researchers.

Data imputation-based strategies first complete the missing modalities, and then apply a common MmC algorithm. Enders (2010) simply imputed missing parts with the average value of all samples in each modality. Tran et al. (2017) imputed missing modalities by stacking residual autoencoders, which grows iteratively to model the residual between the current prediction and original data. Zhang et al. (2018) exploited the identical distribution constraint of missing modality to the other available one in the feature-isomorphic subspace to accomplish missing modality completion. Lin et al. (2021) proposed a novel objective that incorporates representation learning and data recovery into a unified framework from the modality of information theory. However, due to the noise that can be introduced when completing modalities, there is a trend to explore consistency between modalities.

Consistency strategies contain matrix factorization and consensus learning based IMmC to learn a consistent representation for different modalities. Hotelling (1992) proposed a matrix completion method by iterative soft thresholding of singular value decomposition. Shao et al. (2015) proposed Multi-Incomplete-modality Clustering (MIC), an algorithm based on weighted nonnegative matrix factorization with \(L_{2,1}\) regularization. Zhang et al. (2022) proposed a novel framework to achieve the optimal tradeoff between consistency and complementarity across different modalities. Wang et al. (2021) proposed a generative partial multi-modal clustering model with adaptive fusion and cycle consistency, and a weighted adaptive fusion scheme was implemented to exploit the complementary information. Wang et al. (2020) maximized the intrinsic correlations among different modalities by deep canonical correlation analysis to learn a consistent subspace representation among incomplete cross-modal data. While the above methods can explore inter-modality information with less complete data, they cannot stably learn and be convergent when dealing with data with high missing rates.

In real-world applications, massive paired data are hardly collected, and large portions of data may be missing due to the impact of environmental interference. In contrast to existing incomplete multi-modal learning methods, our proposed method requires less paired data and can handle cases of data with high missing rates, thus being capable of adapting to and working in such an open environment easily.

2.2 Graph Representation Learning

Graph learning is able to provide valuable insights into the structure of the data (Brasó et al., 2022; Brissman et al., 2023; Michieli & Zanuttigh, 2022). Li et al. (2021) jointly constructed local incomplete graph matrices, generated incomplete base partition matrices, stretched them to produce a unified partition matrix, and employed them to learn a consensus graph matrix. Wen et al. (2021) proposed a novel method introducing the tensor low-rank representation constraint and semantic consistency-based graph constraint. Cheng et al. (2020) designed Multi-View Attribute Graph Convolution Networks (MAGCN) with two-pathway encoders that map graph embedding features and learn modality-consistency information. Since MAGCN was designed assuming all modalities were fully and adequately observed, the design of its reconstruction loss functions and geometric consistency loss functions heavily relied on data completeness. Wen et al. (2020) developed a joint framework for graph completion and consensus representation learning, which introduces some adaptive weights to balance the importance of different modalities during consensus representation learning.

Unlike homogeneous graphs, attribute information is integrated into the clustering analysis on heterogeneous graphs. Heterogeneous graph learning is to learn effective representation from data of different attributes that are organized in multiple relation graphs (Wang et al., 2019; Zhang et al., 2019). Usually, constructing heterogeneous graphs requires considering the difference in neighbor information under different relationships. Therefore, heterogeneous graph neural networks usually adopt hierarchical aggregation. To implement the hierarchical aggregation function, heterogeneous graphs usually need to consider the difference in neighbor information under different relationships (Chang et al., 2015; Zhang et al., 2018c).

Different from traditional homogeneous graph structure learning, considering the heterogeneity of different relations in the heterogeneous graph, heterogeneous graph structure learning (Zhao et al., 2021) generates each relation subgraph separately. At present, there are relatively few studies on heterogeneous graph learning applied to IMmC (Bothorel et al., 2015; Shi et al., 2016). Qi et al. (2012) proposed heterogeneous random fields to model the structure and content of social media networks. Li et al. (2017) studied the problem of clustering objects in an attributed heterogeneous information network, taking into account the similarities of objects with respect to both object attribute values and their structural connectedness in the network. Chen et al. (2020) represented attributed graphs as star-schema heterogeneous graphs to capture both structural and attribute similarities, where attributes are modeled as different types of graph nodes. Yang et al. (2019a) learned the common subspace with the adaptive graph fusion, which allows the integration of complementary and consistent information from different modalities.

In our work, heterogeneous graphs are constructed to mine the complementarity information between samples and modalities deeply. Unlike some usual graph representation learning methods, we consider the heterogeneity of relations of different modalities and samples by fusing the similarity graph and modality-specific availability graphs. By learning representations based on the heterogeneous graphs, the structural information inside the incomplete multi-modal data is learned to exploit the complementarity between different samples and modalities, which yields compact representations for incomplete multi-modal clustering.

3 Methodology

Fig. 1
figure 1

The architecture of our method. IHGAT requires only a small amount of paired data to model missing modalities, focusing more on internal structural information rather than using modality-completion methods that may introduce noise. Firstly, we construct a set of integrated heterogeneous graphs based on the similarity graph learned from unified latent representations and the modality-specific availability graphs obtained by the existing relations of different samples. Next, we apply an attention mechanism to aggregate the embedded content of heterogeneous neighbors for each node. Finally, the consistency of probability distribution is embedded into the network for clustering

In this section, the details of the proposed method are illustrated. We design an integrated heterogeneous graph attention network that includes a latent representation learning layer, an integrated heterogeneous graph construction layer, and a clustering layer (See Fig. 1). First, a set of integrated heterogeneous graphs is constructed by fusing: (1) the similarity graph that reflects the neighborhood relations of samples, and (2) the modality-specific availability graphs that encode the modality-existence information. Then, the attention mechanism is applied to the obtained graphs to learn complete representations of data. Finally, considering the consistency of the probability distribution, we use KL divergence to measure the non-symmetric difference between two probability distributions and obtain the clustering results. Details of different modules are presented in the following subsections.

Problem Definition. Consider data \(\{ {\textbf{S}}_{n}\}_{n=1}^{N}\), where \({\textbf{S}}_{n}\) is a subset of the complete observations \({\textbf{X}}_{n} = \{x_{n}^{(v)}\}_{v=1}^{V}\) (i.e., \({\textbf{S}}\subset {\textbf{X}}\)) with N and V being the number of samples and modalities, respectively. \(\mathbf {{X}}_{n} \in {\textbf{R}}^{N \times V}\) is the n-th sample with all modalities. IMmC aims to cluster data in which some samples have missing modalities so that samples \({\textbf{S}}_{n}\) with the arbitrary possible missing-modalities pattern can be clustered.

3.1 Integrated Heterogeneous Graph Construction

The integrated heterogeneous graphs are composed of the similarity graph and the modality-specific availability graphs. We use the consistency loss \({\mathcal {L}}_{c}\) to measure the non-symmetric difference between the original distribution and the target distribution. By embedding \({\mathcal {L}}_{c}\) into the network, we can simultaneously optimize both reconstruction loss \({\mathcal {L}}_{r}\) and consistency loss \({\mathcal {L}}_{c}\) within a unified framework. This approach offers the advantage of allowing the network to capture the intrinsic structure of data better while capturing the complementarity between samples and modalities through integrated heterogeneous graph attention networks, thereby improving the performance of the model. Next, we elaborate on the construction of each graph.

3.1.1 Similarity Graph

Similar samples can help each other for representation learning, and they should be close in the learned latent space. To this end, we construct the similarity graph to maintain the local structure of the data by first learning unified latent representations of all modalities and then obtaining the graph based on the similarity of samples in the latent space.

To process samples with arbitrary missing-modality modes flexibly, we project the samples into a unified latent space. Ideally, the expression of the hidden layer can extract the unified expression from each modality. If we denote the latent space representation of the n-th sample as \({\textbf{h}}_n\), then the optimization objective of the elastic implicit space representations is as follows:

$$\begin{aligned} {\mathcal {L}}_{r} (s_{n v}, {\textbf{S}}_{n}, {\textbf{h}}_{n}; \mathbf {\Theta }_{r}) = \sum \nolimits _{n = 1}^N {\sum \nolimits _{v = 1}^V {s_{nv} } } \left\| {f_v \left( {{\varvec{h}}_n ;{\varvec{\Theta }}_r^{(v)} } \right) - {\varvec{s}}_n^{(v)} } \right\| ^2 ,\nonumber \\ \end{aligned}$$
(1)

where \({\mathcal {L}}_{r}\) is the reconstruction loss, which aims to learn the bidirectional mapping between the original data space and the unified embedding space. \(\Vert \cdot \Vert \) represents the \(l_{2}\)-norm. \(f_{v}\left( {\textbf{h}}_{n} ; \mathbf {\Theta }_{r}^{(v)}\right) \) is the reconstruction network for the v-th modality parameterized by \(\mathbf {\Theta }_{r}^{(v)}\), and \(\small {{\textbf{s}}_{n}^{(v)}}\) represents the input of the v-th modality with the n-th sample. N and V represent the number of samples and modalities, respectively. \({\textbf{s}}_{n v}\) indicates the availability of the n-th sample in the v-th modality, which is defined as follows:

$$\begin{aligned} {\textbf{s}}_{nv}=\left\{ \begin{array}{l} 1, \text { if the } n \text {-th instance has the } v \text {-th modality } \\ 0, \text { otherwise } \end{array}\right. . \end{aligned}$$
(2)

3.1.2 Learnable Unified Latent Representation

By using multiple individual multi-layer perceptrons as encoders, different available modalities are encoded into a unified learnable space \({\textbf{h}}_n\) (regardless of their lost patterns), where the number of encoders should be the same as modalities. Relatively complete and universal representations are learned by minimizing Eq. (1), so that any sample with missing patterns can be reconstructed. This means that the space has learned the potential elastic representations from the observation modality.

Generally, the neighborhood structure can be obtained from a Gaussian-based kernel matrix. We denote the matrix as \({\textbf{G}}_n\in {\textbf{R}}^{N \times N}\), and the detailed formulation is as follows:

(3)

where \(\sigma \) is the standard deviation. \(N_{k}\left( {\textbf{h}}_{i}\right) \) and \(N_{k}\left( {\textbf{h}}_{j}\right) \) indicate samples of the K nearest neighbors of \({\textbf{h}}_{i}\) and \({\textbf{h}}_{j}\), respectively.

3.1.3 Modality-Specific Availability Graphs

For different modalities, the absence of internal samples may vary. Two different samples in the same modality can only interact with each other if they exist at the same time. We propose modality-specific availability graphs based on multiple samples in the same modality to make full use of the similarities between samples. We denote the matrices as \({\textbf{G}}_e\in {\textbf{R}}^{V \times N \times N}\) and \({\textbf{G}}_e^{(v)}\in {\textbf{R}}^{N \times N}\). The detailed formulation is defined as follows:

$$\begin{aligned} {{\textbf{G}}_e}_{i j}^{(v)}=\left\{ \begin{array}{ll}1, \text { if both } {x}_{i}^{(v)} \text { and } {x}_{j}^{(v)} \text { exist} \\ 0,\text { otherwise } \end{array}\right. , \end{aligned}$$
(4)

where \({x}_{i}^{(v)}\) and \({x}_{j}^{(v)}\) are the v-th modality of different samples,

$$\begin{aligned} {\textbf{G}}_e=\left[ \begin{array}{ccccc} {\textbf{G}}_e^{(1)},&{\textbf{G}}_e^{(2)},&{\textbf{G}}_e^{(3)},&\text {...},&{\textbf{G}}_e^{(V)} \end{array}\right] . \end{aligned}$$
(5)

3.1.4 Integrated Heterogeneous Graph

To further utilize the complementarity of data information, we consider fusing the similarity graph and the modality-specific availability graphs to obtain a set of integrated heterogeneous graphs. We define the graph adjacency matrix as follows:

$$\begin{aligned} {\textbf{G}}_{adj}={\textbf{G}}_{n} \cdot {\textbf{G}}_{e}^{(v)}. \end{aligned}$$
(6)

Then, we obtain:

$$\begin{aligned} {\textbf{G}}=\left[ \begin{array}{ccccc} {\textbf{G}}_{n}\cdot {\textbf{G}}_e^{(1)},&{\textbf{G}}_{n} \cdot {\textbf{G}}_e^{(2)},&\text {...},&{\textbf{G}}_{n} \cdot {\textbf{G}}_e^{(V)} \end{array}\right] . \end{aligned}$$
(7)

3.2 Graph Representation Learning

Given the integrated heterogeneous graphs, we can exploit structural information to learn complete representation. Our research is focused on addressing the challenges of incomplete multi-modal learning, with a particular emphasis on harnessing the interrelationships between modalities and samples in the absence of partial modalities. In other words, our intention is not to introduce a novel attention mechanism but rather to make the most of existing methodologies to explore the interrelationships in data with incomplete modalities in a comprehensive manner. Leveraging the dynamic adaptability of the attention mechanism introduced by Graph Attention Network (GAT) (Veličković et al., 2018), we have seamlessly incorporated GAT into our framework for the purpose of exploring the interrelationships between modalities and samples. By leveraging masked self-attention layers and stacking layers, nodes can attend to the features of their neighborhoods.

Formally, given the latent representations \({\textbf{h}}_{n}=\) \(\left\{ {\textbf{h}}_{1}, {\textbf{h}}_{2}, \ldots , {\textbf{h}}_{N}\right\} \), where \({\textbf{h}}_{n} \in {\textbf{R}}^{D}, N\) is the total number of samples, and D is the dimension of latent space, the network outputs the adjacency matrix \({\textbf{G}}_{v}\) generated by multiple graph learners based on different samples. Each of these samples is then possible to generate V groups of new features \({\textbf{z}}^{(v)}=\left\{ {z}_{1}^{(v)}, {z}_{2}^{(v)}, \ldots , {z}_{N}^{(v)}\right\} \left( {z}_{n}^{(v)} \in {\textbf{R}}^{F}, n=1, 2, \ldots , N \right) \), where N is the number of nodes, F is the number of features in each node and \({z}_{n}^{(v)}\) refers to the feature vector associated with the n-th node and the v-th modality.

To facilitate graph representation learning, we use GAT to transform the latent representations into features that are suitable for graph semantics, which requires a mapping layer composed of learnable parameters \(\mathbf {\Theta }_{g}^{(v)}\):

$$\begin{aligned} {z}_{n}^{(v)}=GAT\left( {\textbf{h}}_{n}, {\textbf{G}}_{v}; \mathbf {\Theta }_{g}^{(v)}\right) , \end{aligned}$$
(8)

where \({\textbf{h}}_{n}\) is the latent representations, \({\textbf{G}}_{v}\) is the graph adjacency matrix, and \(\mathbf {\Theta }_{g}^{(v)}\) is the parameter set of the GAT.

Based on representations \({\textbf{z}}^{(v)}\), we use the attention mechanism to calculate the importance between nodes. As an initial step, a shared linear transformation, parametrized by a weight matrix, \({\textbf{W}} \in {\textbf{R}}^{F^{\prime } \times F}\) (of potentially different cardinality \({F}^{\prime }\)), is applied to every node. The importance of each two nodes can be calculated by a shared attention mechanism a. Thus, attention coefficients are defined as:

$$\begin{aligned} e_{ij}=a\left( {\textbf{W}} z_{i}^{(v)}, {\textbf{W}} z_{j}^{(v)}\right) . \end{aligned}$$
(9)

The attention mechanism a is a feedforward network, parametrized by a weight vector \(\overrightarrow{{\textbf{a}}} \in {\textbf{R}}^{2{F}^{\prime }}\), and \(e_{ij}\) indicates the importance of node j to node i. We use an attention mask to inject the integrated heterogeneous graph structure into the calculation, and only consider \(e_{ij}\) for nodes \(j \in {\mathcal {N}}_{i}\) that have relations in \({\textbf{G}}_{v}\). Projected by a softmax function, the formula is:

$$\begin{aligned} \alpha _{i j}=\text {softmax}_{j}\left( e_{i j}\right) =\frac{\exp \left( e_{i j}\right) }{\sum _{k \in {\mathcal {N}}_{i}} \exp \left( e_{i k}\right) }, \end{aligned}$$
(10)

where \({\mathcal {N}}_{i}\) is some neighborhood of node i in the graph. The attention mechanism uses a single-layer feedforward network, and applies the LeakyReLU nonlinearity. Fully expanded out, the attention calculation can be expressed as:

$$\begin{aligned} \alpha _{ij} = \frac{{\exp \left( {{\text { LeakyReLU}}\left( {\overrightarrow{\varvec{a}} ^{\text {T}} \left[ {{\varvec{W}}{\text {z}}_{\text {i}}^{({\text {v}})} \mathrm{{}}{\varvec{W}}{\text {z}}_{\text {j}}^{({\text {v}})} } \right] } \right) } \right) }}{{\sum _{k \in \mathrm{{\mathcal {N}}}_i } {\exp } \left( {{\text { LeakyReLU}} \left( {\overrightarrow{\varvec{a}} ^{\text {T}} \left[ {{\varvec{W}}{\text {z}}_{\text {i}}^{({\text {v}})} \mathrm{{}}{\varvec{W}}{\text {z}}_{\text {k}}^{({\text {v}})} } \right] } \right) } \right) }}, {\text { }}\nonumber \\ \end{aligned}$$
(11)

where \(\cdot ^{T}\) represents transposition and \(\left[ \cdot \Vert \cdot \right] \) is the concatenation operation. Moreover, multi-head attention can be used to enrich the ability of the method and stabilize the training process. Each head of attention has its own parameters. We use splicing to integrate the output of multiple attention mechanisms, which can be described as follows:

$$\begin{aligned} z_{i}^{\prime {(v)}}=\Vert _{b=1}^{B} \mu \left( \sum _{j \in N_{i}} \alpha _{i j}^{b} {\textbf{W}}^{b} z_{j}^{{(v)}}\right) , \end{aligned}$$
(12)

where \(\Vert \) represents concatenation, \(\mu \) is the activation function, \(\alpha _{i j}^{b}\) are normalized attention coefficients computed by the b-th attention mechanism (\(a^{b}\)), B is the number of attention heads and \({\textbf{W}}^{b}\) is the corresponding weight matrix of input linear transformation. When dealing with the last layer, we use the average value instead of the concatenation, as follows

$$\begin{aligned} z_{i}^{\prime {(v)}}= \mu \left( \frac{1}{B} \sum \limits _{b=1}^{B} \sum _{j \in N_{i}} \alpha _{i j}^{b} {\textbf{W}}^{b} z_{j}^{{(v)}}\right) , \end{aligned}$$
(13)

3.2.1 Consistent Embedding Network

Based on the learned features, we optimize the clustering task in an end-to-end manner. To measure the non-symmetric difference between the original distribution and target distribution, we embed the consistency of probability distribution into the network. Following (Wang et al., 2018; Tao et al., 2019), we measure the similarity between integrated node representations \(z_{i}^{\prime (v)}\) and the cluster center \(\mu _{j}\) by adopting the t-distribution of student. \(q_{i j}^{(v)}\) and \(p_{i j}^{(v)}\) are the elements of original distribution \({\textbf{Q}}\) and target distribution \({\textbf{P}}\), respectively, which is defined as:

$$\begin{aligned} q_{i j}^{(v)} = \frac{(1+\Vert z_{i}^{\prime {(v)}}-\mu _{j}\Vert ^{2}/\beta )^{ -\frac{\beta +1}{2}}}{\sum \limits _{j^{\prime }=1}^{J} (1+\Vert z_{i}^{\prime {(v)}}-\mu _{j^{\prime }}\Vert ^{2}/\beta )^{-\frac{\beta +1}{2}}}, \end{aligned}$$
(14)

where \(\Vert \cdot \Vert \) represents the \(l_{2}\)-norm; \(\mu _{j}\) is the cluster center; J is the number of cluster centers; \(\beta \) is the degree of t-distribution freedom of the Student, and \(q_{i j}^{(v)}\) is the probability of assigning node i to cluster j. In our experiments, the cluster centers \(\{\mu _{j}\}_{j=1}^J\) can be initialized by employing k-means and the target probability distribution \(p_{i j}^{(v)}\) (0 \(\le \) \(p_{i j}^{(v)}\) \(\le \) 1) can be computed. We obtain \(p_{i j}^{(v)}\) by raising \(q_{i j}^{(v)}\) to the second power and normalizing by frequency per cluster:

$$\begin{aligned} p_{i j}^{(v)} = \frac{{{q_{i j}^{(v)}}^2}/{f_i}}{\sum \limits _{j^{\prime }=1}^{J} {{q_{i j^{\prime }}^{(v)}}^2}/{f_{j^{\prime }}}}, \end{aligned}$$
(15)

where \(\small {f_{j} = \sum \limits _{i=1}^{M} {q_{i j}^{(v)}}}\) are soft cluster frequencies. To compare the similarity of the two probability distributions, we define our objective as a probability distribution consistency loss \({\mathcal {L}}_{c}\). The clustering loss is defined as minimizing the KL divergence between an original distribution and a target distribution. That is to say, \({\mathcal {L}}_{c}\) is defined as:

$$\begin{aligned} {\mathcal {L}}_{c} ({\textbf{P}}, {\textbf{Q}})= KL({\textbf{P}}\Vert {\textbf{Q}}) =\sum _{v} \sum _{i} \sum _{j} p_{i j}^{(v)} log \frac{p_{i j}^{(v)}}{q_{i j}^{(v)}}, \end{aligned}$$
(16)

where KL is the Kullback–Leibler divergence that measures the non-symmetric difference between two probability distributions. \(\left( \cdot \Vert \cdot \right) \) represents the separator between two probability distributions. \({\textbf{P}}\) and \({\textbf{Q}}\) are defined by Eq. (15) and Eq. (14), respectively. Finally, we take the mean of \(\{p_{i j}^{(v)}\}_{v=1}^{V}\) as the ideal distribution.

Accordingly, the overall loss function of the proposed IHGAT can be formulated as follows:

$$\begin{aligned} {\mathcal {L}} = \lambda _{1}{\mathcal {L}}_{c} + \lambda _{2}{\mathcal {L}}_{r}, \end{aligned}$$
(17)

where \({\mathcal {L}}_{c}\) and \({\mathcal {L}}_{r}\) are the clustering loss and the reconstruction loss, respectively. The \(\lambda _{1}\) and \(\lambda _{2}\) are the trade-off hyper-parameters of the \({\mathcal {L}}_{c}\) and \({\mathcal {L}}_{r}\), respectively.

Algorithm 1
figure a

Algorithm for IHGAT

We construct a set of integrated heterogeneous graphs based on the similarity graph learned from unified latent representations and the modality-specific availability graphs obtained by the existing relations of different samples. Based on the constructed integrated heterogeneous graphs, we use the incomplete multi-modal data as input to optimize the model parameters for a better representation. Algorithm 1 briefly summarizes the optimization procedures of the proposed method.

4 Experiments

In this section, we conducted comprehensive experiments on incomplete multi-modal data to evaluate the performance of our proposed method, followed by the analysis of our proposed method.

4.1 Metrics and Datasets

For a comprehensive analysis, we conducted extensive experiments on six datasets and adopted two widely used metrics, including Accuracy (ACC) and Normalized Mutual Information (NMI). High values denote good clustering performance of the method for both metrics.

CUB (Wah et al., 2011): Caltech-UCSD Birds (CUB) contains 11,788 bird images associated with text descriptions from 200 different categories (we followed the experimental settings in Zhang et al. (2022), so the first 10 categories are used). We extracted 1024-dimensional features based on images using GoogLeNet, and 300-dimensional features based on text (Le & Mikolov, 2014).

FootballFootnote 1: A collection of 248 English Premier League football players and clubs active on Twitter. The disjoint ground-truth communities correspond to the 20 individual clubs in the league.

ORLFootnote 2: ORL is a popular face database in the field of face recognition. It contains 400 face images provided by 40 volunteers, with 10 face images from each person. Three types of features, i.e., LBP, Gabor, and intensity, are extracted as the three modalities for representing every face image.

PIEFootnote 3: PIE is a subset containing 680 facial images of 68 subjects, for which the intensity, LBP, and Gabor features have been extracted.

PoliticsFootnote 4: A collection of Irish politicians and political organizations, assigned to seven disjoint ground truth groups, according to their affiliation.

3SourcesFootnote 5: 3Sources is collected from three online news sources: BBC, Reuters, and Guardian. In total, 169 samples of stories are used, which are reported by all three sources.

ADNIFootnote 6: The dataset consists of 774 subjects from ADNI-1, including 226 normal controls (NC), 362 MCI and 186 AD subjects. There are only 379 subjects with complete MRI and PET data, including 101 NC, 185 MCI, and 93-AD, where the missing rate is up to 0.26. We use 93-dimensional ROI-based features from both MRI and PET data, respectively.

3Sources-partialFootnote 7: 948 news articles were collected covering 416 distinct news stories. Specifically, 169 were reported in all three sources, 194 in two sources, and 53 appeared in a single news source. Each source represents a unique modality, and the combination of different sources constitutes multi-modal information. The missing rate of 3Sources-partial is 0.24.

Table 1 The clustering performance comparison on six datasets with different missing rates (\(\varepsilon \))
Table 2 The clustering performance comparison on six datasets with high missing rates (\(\varepsilon \))
Table 3 The clustering performance comparison on real-world missing data

4.2 Experimental Setups

To generate incomplete multi-modal datasets from complete multi-modal datasets, we randomly removed different modalities within each sample based on the missing rate. The missing rate was defined as \(\varepsilon =\frac{\sum _{v} M_{v}}{V \times N}\), where \(M_{v}\) indicates the number of instances without the v-th modality. To evaluate the influence of \(\lambda _{1}\) and \(\lambda _{2}\), we changed their value in the range of {0.01, 0.1, 1, 10, 100, 1000} and {0.01, 0.1, 1, 10, 100, 1000}, respectively. As shown in Fig. 6, IHGAT was robust to \(\lambda _{1}\) and \(\lambda _{2}\), and the proposed method could reach a high level of performance while \(\lambda _{1}\) was from the range of \(\{10, 100, 1000\}\) and \(\lambda _{2}\) was from the range of \(\{0.01, 0.1, 1\}\). For all datasets, the trade-off hyper-parameters \(\lambda _{1}\) and \(\lambda _{2}\) were fixed to 100 and 1, respectively. We evaluated the performance and reported the averaged results over five runs of experiments. Our algorithm was implemented in Torch 1.9.0 and carried all evaluations on a standard Ubuntu\(-\)16.04 system with NVIDIA 3090 Graphics Processing Units (GPUs). We set an initial learning rate of 0.02 on the other datasets except for the Football and PIE datasets, which had an initial learning rate of 0.01.

4.3 Baseline Methods

Eight baseline methods were used in the experiments, including SVD (Hotelling, 1992), Average (Enders, 2010), CRA (Tran et al., 2017), iCmSC (Wang et al., 2020), IMVTSC-MVI (Wen et al., 2021), GP-MVC (Wang et al., 2021), CPM (Zhang et al., 2022), CPM-GAN (Zhang et al., 2022), PIMVC (Deng et al., 2023), and GreatF (Wen et al., 2023a).

SVD (Hotelling, 1992): SVD is a matrix completion method by iterative soft thresholding of singular value decomposition.

Average (Enders, 2010): Average imputes missing parts with the average value of all samples in each modality.

CRA (Tran et al., 2017): CRA is composed of a set of stacked residual autoencoders, which can learn complex relationships among data from different modalities.

iCmSC (Wang et al., 2020): iCmSC is a novel incomplete cross-modal clustering method that integrates canonical correlation analysis and exclusive representation.

IMVTSC-MVI (Wen et al., 2021): IMVTSC-MVI incorporates the feature space based missing-modality inferring and manifold space based similarity graph learning into a unified framework.

GP-MVC (Wang et al., 2021): GP-MVC is a generative partial multi-modal clustering model with adaptive fusion and cycle consistency to solve the incomplete multi-modal problem by explicitly generating the data of missing modalities.

CPM (Zhang et al., 2022): CPM provides the comparative version of the CPM-Nets without the adversarial strategy.

CPM-GAN (Zhang et al., 2022): CPM-GAN can be regarded as generators. As for the discriminators, it uses the same structure as the generators. For the purpose of discrimination, a sigmoid layer is imposed on the output layer of each discriminator network.

PIMVC (Deng et al., 2023): PIMVC applies projection learning to IMmC, which solves the problem of information imbalance between different modalities.

Fig. 2
figure 2

Ablation study on (a) CUB, (b) Football, (c) ORL, (d) PIE, (e) Politics, and (f) 3Sources datasets

Fig. 3
figure 3

Visualization of (a) IHGAT, (b) S-IHGAT, and (c) LATENT on CUB dataset with varying epochs

GreatF (Wen et al., 2023a): GreatF provides an adaptive weighted matrix factorization model to obtain the representation of every modality, which can enhance the weight of the discriminative features of all modalities for representation learning.

4.4 Incomplete Multi-Modal Clustering Performance

Experimental results are shown in Table 1. By analyzing the results, we have the following observations: (1) In terms of both ACC and NMI, our method achieves promising performance compared with all baselines. It performs the best on most settings of all datasets in terms of both metrics, which validates the effectiveness of IHGAT. (2) Although the baselines can also achieve good performance at several low missing rates, a clear phenomenon of performance degeneration can be observed as the rate of missing data increases. Most existing baselines attempt to complete the missing modality, which may introduce extra noise when the missing situation is complex, or the missing rate is high. (3) On all missing rates, IHGAT generally achieves outstanding performance on all multi-modal datasets. On six datasets, we average the ACC and NMI of each method with different missing rates (\(\varepsilon \) from 0.1 to 0.5), and our method is 6.74% higher in ACC and 8.75% higher in NMI than the second-best method.

4.5 Multi-Modal Clustering Performance with High Missing Rates

Based on the above discussion, we have further analyzed the most state-of-the-art methods, including iCmSC, IMVTSC-MVI, GP-MVC, and CPM-GAN in Table 2. In terms of ACC, taking the missing rate (\(\varepsilon \)=0.9) for example, our method improves 18.25, 9.71, 25.4, 17.41, 14.35, and 16.33% over the second performers on CUB, Football, ORL, PIE, Politics, and 3Sources, respectively. IMVTSC-MVI hardly maintains good performance as the missing rate increases, but CPM-GAN is rather robust to modality-missing data. The missing modalities seriously affect the mining of information from multi-modal data. For a more comprehensive comparison, we compare the performance of IHGAT and the suboptimal method at high missing rates by taking the mean value. The suboptimal methods differ in different datasets, consisting of IMVTSC-MVI and CPM-GAN. The average ACC and NMI of IHGAT are 46.00 and 53.43%, respectively, and the suboptimal methods are 31.22 and 38.07%, respectively. IHGAT improves by 14.78 and 15.36% over the suboptimal method on ACC and NMI, respectively.

The combined analysis of Tables 1 and  2 shows that most baselines rely heavily on a large amount of paired multi-modal data by using shared information of latent representations to complete the missing modalities. When the missing modalities are large and complexly distributed, such methods may introduce additional noise and make it difficult to effectively complete the missing modalities. The above observations further validate the advantages of IHGAT. This suggests that it is beneficial to learn a more compact common representation for incomplete multi-modal clustering by considering both the structural information of missing data and the available information of non-missing data. It is clear that IHGAT is superior and more competitive, especially as the missing rate increases. This implies that our method can effectively explore the complex relationship between modalities and samples, even with a relatively large incomplete sample ratio.

4.6 Multi-Modal Clustering Performance on Real-World Missing Data

The existing IMmC is mostly based on publicly available multi-modal datasets, created by randomly removing portions of the data to form incomplete multi-modal datasets. In existing research, it is rare to encounter real-world incomplete multi-modal datasets. To further validate the effectiveness of IHGAT in real-world scenarios, we conducted experiments on two real-world incomplete multi-modal datasets, namely ADNI and 3Sources-partial.

As presented in Table 3, IHGAT still performs well on real-world missing datasets. In real-world scenarios, there are usually incomplete cases for multi-modal data. For instance, within medical applications, diverse subjects typically undergo various types of examinations. In the realm of web analysis, some websites encompass a variety of content, including text, images, and videos, while others may contain only a subset of these, resulting in data with missing modalities. As the number of modalities increases, the patterns of modality-missing, denoting the combinations of available modalities, become progressively intricate. Therefore, research on incomplete multi-modal data holds significant practical value and has a wide range of application scenarios.

Table 4 The clustering performance comparison on three datasets with different modules

4.7 Ablation Study

To verify the effectiveness of the similarity graph and modality-specific availability graphs, we visualized the representations on the CUB dataset to investigate the performance of IHGAT. Figure 3a, b, and c show the representations of IHGAT, S-IHGAT, and LATENT obtained in different epochs. LATENT represents the proposed method without the similarity graph and modality-specific availability graphs; S-IHGAT uses the similarity graph only, and IHGAT uses both similarity graph and modality-specific availability graphs. As the number of epochs increases, the clusters of IHGAT are more compact, and the margins between different classes become more clear. It shows that the similarity graph and modality-specific availability graphs contribute substantially to the learning representation ability of IHGAT.

To further analyze the contribution of the similarity graph and modality-specific availability graphs in IHGAT, we conducted the ablation study with respect to the proposed method. As presented in Fig. 2, S-IHGAT substantially outperforms LATENT, which numerically indicates that it would be harmful to overlook the relationship between sample structures that can enhance multi-modal complementary information. Besides, IHGAT performs better than S-IHGAT, validating the effectiveness of the modality-specific availability graphs. Under the influence of the similarity graph and modality-specific availability graphs, IHGAT can indeed achieve better clustering results.

The graph adjacency matrix provides the possibility to correlate features and semantic representations, and the intrinsic structural information can be maintained to obtain more sufficient complementary information of different samples and modalities. The new features of a specific node are obtained by adding a nonlinear transformation to a weighted average of the neighboring features of the specific node in terms of their contribution. The new features are tighter and can further exploit complementarity. It can be intuitively seen through Figs. 2 and 3 that the similarity graph and modality-specific availability graphs have a significant impact on the performance of our proposed method, mainly because different constraints obtain different features.

As shown in Table 4, we conduct additional ablation experiments to explore the impact of different graph construction methods, attention mechanisms, and probability distribution. This will provide a deeper understanding of the contributions of these components. Additionally, we employed t-SNE visualization, as shown in Fig. 4, to illustrate that using multiple modalities to construct a unified representation results in more compact intra-class clusters and clearer inter-class boundaries. While directly aggregating encoder outputs into a common representation provides some improvement over single modality usage, utilizing multi-modal information to build a learnable unified latent representation yields superior overall performance. Our method reduces its heavy reliance on paired data by encoding multiple modalities into unified hidden representations. When the amount of data is large, we do not have to group the data like other works. The combination of the similarity graph and modality-specific availability graphs to form a set of integrated heterogeneous graphs can better explore the structural information of data and their relationship with each other.

Fig. 4
figure 4

Visualization of (a) Modality-1, (b) Modality-2, (c) Modality-aggregate, and (d) Modality-learnable on CUB

Fig. 5
figure 5

Convergence analysis on (a) CUB, (b) Football, (c) ORL, and (d) 3Sources datasets

Fig. 6
figure 6

Effect of the parameters \(\lambda _{1}\) and \(\lambda _{2}\) on (a) CUB, (b) ORL, and (c) 3Sources datasets

4.8 Convergence Analysis

To investigate the stability and convergence of the training process of IHGAT, we showed the convergence curves on multiple datasets (\(\varepsilon \) = 0.9) of IHGAT and CPM-GAN, a typical method of consistency strategy, respectively. As shown in Fig. 5, The training process of CPM-GAN is quite unstable on data with high missing rates. Consequently, the quality of the generator is challenging to control, which degrades the performance of the model significantly. Moreover, it also reveals the potential risk of introducing additional noise by the method of completing missing modalities with highly missing data. By contrast, IHGAT converges stably and fast in around 75 epochs, further demonstrating the performance advantage of IHGAT under complicated data distributions.

4.9 Parameter Analysis

Since \(\lambda _{1}\) and \(\lambda _{2}\) are the trade-off hyper-parameters that influence the clustering term and the reconstruction term in the final loss function, we analyzed the impacts of different values of \(\lambda _{1}\) and \(\lambda _{2}\) on the performance (\(\varepsilon \) = 0.5), and the results are shown in Fig. 6. It can be seen that IHGAT is not sensitive to \(\lambda _{1}\) and \(\lambda _{2}\), and the proposed method can reach a high level of performance whilst \(\lambda _{1}\) is from the range of \(\{10, 100, 1000\}\) and \(\lambda _{2}\) is from the range of \(\{0.01, 0.1, 1\}\).

Fig. 7
figure 7

Effect of the parameters (a) the number of nearest neighbor samples K and (b) the dimensionality of the latent representations in terms of ACC and NMI, respectively

The number of nearest neighbors K and the dimensionality of the latent representations \(\gamma \) are the two main parameters of our method. In terms of K, since K nearest neighbors are used to obtain the similarity graph, we analyzed the influence of different K values on the proposed method. As shown in Fig. 7a, it can be observed that too small or large K values are adverse to the performance of the model. If the K is too small, it is easy to make the model complicated and thus over-fitting. If the K value is too large, the result will be affected by distant points. Therefore, a medium K value, specifically \(K=5\), is appropriate for our method in the experiments.

In terms of \(\gamma \), we visualized the influence of different values of \(\gamma \) on three datasets (\(\varepsilon \) = 0.5) in Fig. 7b, where the values of \(\gamma \) are ranged from {16, 32, 64, 128, 256}. Ten trials of experiments are conducted , and the average values of ACC and NMI are reported as the final results. According to Fig. 7b, it can be observed that different parameter settings greatly affect the performance of the method, and most datasets achieve better performance with \(\gamma \) in 64, which is by default a good choice.

5 Conclusion

In this paper, we proposed an effective method that deeply mines structural information to use complementarity information of different samples for IMmC. First, the similarity graph and modality-specific availability graphs are fused to form a set of integrated heterogeneous graphs. Thereafter, the attention mechanism is applied to the obtained integrated heterogeneous graphs to capture the complementarity information among different samples and modalities. In this way, complete representations can be learned for data with incomplete modalities. Finally, clustering is performed on the learned representations via embedding the consistency of probability distribution into the network. The proposed method does not require a large amount of paired data to model the missing modalities and shows significant improvements over the compared methods on six challenging benchmark datasets. More clear advantages of the proposed method over baselines can be observed with high missing rates of incomplete data.