Integrated Heterogeneous Graph Attention Network for Incomplete Multi-modal Clustering

Wang, Yu; Yao, Xinjie; Zhu, Pengfei; Li, Weihao; Cao, Meng; Hu, Qinghua

doi:10.1007/s11263-024-02066-y

Integrated Heterogeneous Graph Attention Network for Incomplete Multi-modal Clustering

Published: 24 April 2024

Volume 132, pages 3847–3866, (2024)
Cite this article

Download PDF

International Journal of Computer Vision Aims and scope Submit manuscript

Integrated Heterogeneous Graph Attention Network for Incomplete Multi-modal Clustering

Download PDF

597 Accesses
1 Citation
Explore all metrics

Abstract

Incomplete multi-modal clustering (IMmC) is challenging due to the unexpected missing of some modalities in data. A key to this problem is to explore complementarity information among different samples with incomplete information of unpaired data. Despite preliminary progress, existing methods suffer from (1) relying heavily on paired data, and (2) difficulty in mining complementarity on data with high missing rates. To address the problems, we propose a novel method, Integrated Heterogeneous Graph ATtention (IHGAT) network, for IMmC. To fully exploit the complementarity among different samples and modalities, we first construct a set of integrated heterogeneous graphs based on the similarity graph learned from unified latent representations and the modality-specific availability graphs formed by the existing relations of different samples. Thereafter, the attention mechanism is applied to the constructed integrated heterogeneous graph to aggregate the embedded content of heterogeneous neighbors for each node. In this way, the representations of missing modalities can be learned based on the complementarity information of other samples and their other modalities. Finally, the consistency of probability distribution is embedded into the network for clustering. Consequently, the proposed method can form a complete latent space where incomplete information can be supplemented by other related samples via the learned intrinsic structure. Extensive experiments on eight public datasets show that the proposed IHGAT outperforms existing methods under various settings and is typically more robust in cases of high missing rates.

Multiple kernel-based anchor graph coupled low-rank tensor learning for incomplete multi-view clustering

Article Open access 02 June 2022

One-step graph-based incomplete multi-view clustering

Article 19 January 2024

Structured anchor-inferred graph learning for universal incomplete multi-view clustering

Article 22 March 2022

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Various kinds of real-world data are usually represented with different modalities, such as perception data of intelligent unmanned systems and medical diagnosis data (Yang et al., 2019b; Chen et al., 2019; Cao et al., 2022). Among researches on modeling such multi-modal data, multi-modal clustering (MmC), which divides samples into clusters in an unsupervised manner, has attracted much attention in recent years (Zhang et al., 2020; Chen et al., 2022). MmC aims to integrate multiple features and discover complementary information among different modalities (Zhang et al., 2018; Xie et al., 2019; Fang et al., 2023). Compared with single-modality clustering, MmC can more fully exploit the complementarity between multiple modalities to improve performance (Han et al., 2023; Zhan et al., 2018). In real-world applications, some modalities of instances may be missing due to the difficulty of data collection or the failure of data collectors (Kumar et al., 2013; Xiang et al., 2013). When certain modalities are missing, it leads to a significant loss of information. Furthermore, the absence of modalities severely hinders exploring complementary and consistent information. This indicates that incomplete multi-modal clustering presents its unique challenges (Wen et al., 2023; Lin et al., 2023). Such incompleteness further aggravates the difficulty of mining complementary information that can be originally mined through complete paired data. Therefore, how to effectively model the complementarity within incomplete data is an essential problem for incomplete multi-modal clustering (IMmC). The traditional MmC pipeline fails to address the challenges of IMmC. The core focus of research in the domain of missing modality multi-modal learning is to understand the impact of missing modality on modeling and representation. Unsupervised tasks, including clustering, prioritize the discovery of underlying data structures and relationships without relying on label information, making it a more challenging task. When some modalities are missing in the data, unsupervised tasks are generally more sensitive and capable of capturing these changes since they are not constrained by label information. In contrast, supervised tasks primarily focus on establishing a mapping between data and labels.

Many researchers have dedicated themselves to addressing the problem, and existing works can be roughly classified into three categories. (1) Grouping strategies divide data into multiple groups and design different models for each group. Then, these models are fused to alleviate the influence of missing modalities to obtain the clustering results (Yuan et al., 2012; Wang et al., 2020b). However, the amount of data used in this method for training is drastically reduced, which may lead to over-fitting. To alleviate the scarcity of complete modalities, the researchers proposed data imputation-based strategies. (2) Data imputation-based strategies complete the missing modalities of samples for the subsequent clustering, which transforms IMmC to a classical multi-modal clustering problem with complete data (Zhang et al., 2018; Lin et al., 2021). However, it is difficult to ensure the quality of the complete modality and may introduce additional noisy information, especially when the rate of missing data increases. To get rid of the reliance on large-scale complete data, recent studies have attempted to explore consistency in IMmC. (3) Consistency strategies (Zhang et al., 2022; Wang et al., 2021) generate missing modalities of samples by maintaining the consistent relationships between different modalities for the whole data. Although they reduce the requirement of paired data, the training process is quite unstable and is difficult to converge if the data distribution is complicated, e.g., data with high missing rates. Consequently, the quality of the generated data is still difficult to control, which significantly deteriorates the performance of the models.

By revisiting existing methods, we find that two problems are still open: (1) Modeling without relying heavily on paired data. Grouping and data imputation-based strategies require a large number of paired data to learn the relationships between different modalities. For cases where only few complete data are available, these methods struggle to complete the missing modalities in high quality, thus deteriorating their performance. (2) Mining complementarity on data with high missing rates. Consistency strategies tend to learn relationships independently for each modality and can work well on simple cases, e.g., data with low missing rates, with stable learning and convergence progress. However, the learned modality representations and structures become inaccurate when handling data with high missing rates, thus making them quite restricted to complicated real-world applications.

To this end, we propose a simple yet effective method, Integrated Heterogeneous Graph ATtention (IHGAT) network, to effectively and stably explore the structural information of samples and modalities without paired data. First, a set of integrated heterogeneous graphs is constructed by fusing two types of graphs: the similarity graph learned from unified latent representations and the modality-specific availability graphs obtained by the existing relations of different samples. Then, we adopt graph learning to exploit complementary structural information between samples based on the constructed integrated heterogeneous graphs. Concretely, we apply an attention mechanism to aggregate the embedded content of heterogeneous neighbors for each graph node. In this way, the incomplete data are embedded into a complete latent space while exploiting the structural information and maintaining the modality-missing information. Finally, the consistency of probability distribution is embedded into the network through KL divergence for clustering.

The proposed method has two advantages over existing methods, thus facilitating solving the aforementioned problems. (1) Low dependency on the complete data. The proposed method exploits the complementarity information by constructing a set of integrated heterogeneous graphs in a learnable unified feature space, where the relationships between different samples and modalities can be directly measured by their similarity. Such a simple method avoids the requirement for complete modalities of paired data. (2) Effective exploitation of intrinsic structural information. Based on the unified latent representation and constructed heterogeneous graphs, the proposed method aggregates the embeddings of heterogeneous neighbors for each node using an attention mechanism. In this way, the structural information and intra-sample and inter-sample multi-modal relationships can be fully exploited to enhance the capabilities of representation learning for samples with incomplete modalities.

Six common datasets with different missing rates are used in our experiments, and the results show that our method achieves state-of-the-art performance. Typically, our method is more robust than the baselines in cases of data with high missing rates, which infers that it can learn complementarity information between samples and modalities by learning intrinsic structural information without many paired data. The proposed method does not include any completion parts and is easy to implement. Source code is available at https://github.com/yxjdarren/IHGAT.

In summary, our contributions to this work are summarized as follows:

We propose a structured information mining strategy, which involves constructing a heterogeneous graph structure within the data. This approach allows for the comprehensive exploration and exploitation of inter-modality and inter-sample relationships, facilitating the effective representation of incomplete data.
The inter-modality relationships are realized by mapping multiple modalities into a unified latent space, while the inter-sample relationships are established based on the similarity within the latent space and the incomplete modality information. Such relationships are further used to facilitate complementary fusion among similar samples with graph attention mechanisms.
Extensive experiments demonstrate the effectiveness of the proposed method on IMmC. Our method can maintain outstanding performance compared to the state-of-the-art baselines as the missing rate increases. Typically, IHGAT is significantly effective in scenarios with high missing rates, improving the baselines by up to 14.78 and 15.36% on Accuracy (ACC) and Normalized Mutual Information (NMI), respectively.

The remainder of this paper is organized as follows. In Sect. 2, we first review the related works about incomplete multi-modal learning and graph representation learning. Then, we elaborate on details of our work, including basic notations, framework, and analysis of each module in Sect. 3. Next, the experimental setting and evaluation results are reported in Sect. 4. Finally, we conclude our work in Sect. 5.

2 Related Work

In this section, we briefly review the related incomplete multi-modal learning and heterogeneous graph learning.

2.1 Incomplete Multi-Modal Learning

In contrast to multi-modal learning, incomplete multi-modal learning contains some missing modalities in data. Existing methods can be mainly categorized into three groups: grouping strategies, data imputation-based strategies, and consistency strategies.

Grouping strategies learn multiple models on various groups for late fusion and focus on the use of completeness theory (Baltrušaitis et al., 2018), which emphasizes complementarity to learn better latent representations. Specifically, within each group, samples with missing modalities are removed, resulting in multiple sets of complete multi-modal samples. However, the process of removing samples with missing modalities substantially reduces the available training data. Yuan et al. (2012) proposed to divide samples according to the availability of data sources and learn a base classifier for each data source independently. Xu et al. (2015) assumed that different modalities are generated from a shared subspace and investigated a successive over-relaxation method to solve the objective function. Wang et al. (2020b) proposed a framework based on knowledge distillation, utilizing the supplementary information from all modalities. However, the above methods have a relatively small amount of data in each group due to grouping, which may lead to overfitting. To address the lack of complete modalities, the data imputation-based strategies have attracted significant attention from researchers.

Data imputation-based strategies first complete the missing modalities, and then apply a common MmC algorithm. Enders (2010) simply imputed missing parts with the average value of all samples in each modality. Tran et al. (2017) imputed missing modalities by stacking residual autoencoders, which grows iteratively to model the residual between the current prediction and original data. Zhang et al. (2018) exploited the identical distribution constraint of missing modality to the other available one in the feature-isomorphic subspace to accomplish missing modality completion. Lin et al. (2021) proposed a novel objective that incorporates representation learning and data recovery into a unified framework from the modality of information theory. However, due to the noise that can be introduced when completing modalities, there is a trend to explore consistency between modalities.

Consistency strategies contain matrix factorization and consensus learning based IMmC to learn a consistent representation for different modalities. Hotelling (1992) proposed a matrix completion method by iterative soft thresholding of singular value decomposition. Shao et al. (2015) proposed Multi-Incomplete-modality Clustering (MIC), an algorithm based on weighted nonnegative matrix factorization with $L_{2,1}$ regularization. Zhang et al. (2022) proposed a novel framework to achieve the optimal tradeoff between consistency and complementarity across different modalities. Wang et al. (2021) proposed a generative partial multi-modal clustering model with adaptive fusion and cycle consistency, and a weighted adaptive fusion scheme was implemented to exploit the complementary information. Wang et al. (2020) maximized the intrinsic correlations among different modalities by deep canonical correlation analysis to learn a consistent subspace representation among incomplete cross-modal data. While the above methods can explore inter-modality information with less complete data, they cannot stably learn and be convergent when dealing with data with high missing rates.

In real-world applications, massive paired data are hardly collected, and large portions of data may be missing due to the impact of environmental interference. In contrast to existing incomplete multi-modal learning methods, our proposed method requires less paired data and can handle cases of data with high missing rates, thus being capable of adapting to and working in such an open environment easily.

2.2 Graph Representation Learning

Graph learning is able to provide valuable insights into the structure of the data (Brasó et al., 2022; Brissman et al., 2023; Michieli & Zanuttigh, 2022). Li et al. (2021) jointly constructed local incomplete graph matrices, generated incomplete base partition matrices, stretched them to produce a unified partition matrix, and employed them to learn a consensus graph matrix. Wen et al. (2021) proposed a novel method introducing the tensor low-rank representation constraint and semantic consistency-based graph constraint. Cheng et al. (2020) designed Multi-View Attribute Graph Convolution Networks (MAGCN) with two-pathway encoders that map graph embedding features and learn modality-consistency information. Since MAGCN was designed assuming all modalities were fully and adequately observed, the design of its reconstruction loss functions and geometric consistency loss functions heavily relied on data completeness. Wen et al. (2020) developed a joint framework for graph completion and consensus representation learning, which introduces some adaptive weights to balance the importance of different modalities during consensus representation learning.

Unlike homogeneous graphs, attribute information is integrated into the clustering analysis on heterogeneous graphs. Heterogeneous graph learning is to learn effective representation from data of different attributes that are organized in multiple relation graphs (Wang et al., 2019; Zhang et al., 2019). Usually, constructing heterogeneous graphs requires considering the difference in neighbor information under different relationships. Therefore, heterogeneous graph neural networks usually adopt hierarchical aggregation. To implement the hierarchical aggregation function, heterogeneous graphs usually need to consider the difference in neighbor information under different relationships (Chang et al., 2015; Zhang et al., 2018c).

Different from traditional homogeneous graph structure learning, considering the heterogeneity of different relations in the heterogeneous graph, heterogeneous graph structure learning (Zhao et al., 2021) generates each relation subgraph separately. At present, there are relatively few studies on heterogeneous graph learning applied to IMmC (Bothorel et al., 2015; Shi et al., 2016). Qi et al. (2012) proposed heterogeneous random fields to model the structure and content of social media networks. Li et al. (2017) studied the problem of clustering objects in an attributed heterogeneous information network, taking into account the similarities of objects with respect to both object attribute values and their structural connectedness in the network. Chen et al. (2020) represented attributed graphs as star-schema heterogeneous graphs to capture both structural and attribute similarities, where attributes are modeled as different types of graph nodes. Yang et al. (2019a) learned the common subspace with the adaptive graph fusion, which allows the integration of complementary and consistent information from different modalities.

In our work, heterogeneous graphs are constructed to mine the complementarity information between samples and modalities deeply. Unlike some usual graph representation learning methods, we consider the heterogeneity of relations of different modalities and samples by fusing the similarity graph and modality-specific availability graphs. By learning representations based on the heterogeneous graphs, the structural information inside the incomplete multi-modal data is learned to exploit the complementarity between different samples and modalities, which yields compact representations for incomplete multi-modal clustering.

3 Methodology

In this section, the details of the proposed method are illustrated. We design an integrated heterogeneous graph attention network that includes a latent representation learning layer, an integrated heterogeneous graph construction layer, and a clustering layer (See Fig. 1). First, a set of integrated heterogeneous graphs is constructed by fusing: (1) the similarity graph that reflects the neighborhood relations of samples, and (2) the modality-specific availability graphs that encode the modality-existence information. Then, the attention mechanism is applied to the obtained graphs to learn complete representations of data. Finally, considering the consistency of the probability distribution, we use KL divergence to measure the non-symmetric difference between two probability distributions and obtain the clustering results. Details of different modules are presented in the following subsections.

Problem Definition. Consider data $\{ {\textbf{S}}_{n}\}_{n=1}^{N}$, where ${\textbf{S}}_{n}$ is a subset of the complete observations ${\textbf{X}}_{n} = \{x_{n}^{(v)}\}_{v=1}^{V}$ (i.e., ${\textbf{S}}\subset {\textbf{X}}$) with N and V being the number of samples and modalities, respectively. $\mathbf {{X}}_{n} \in {\textbf{R}}^{N \times V}$ is the n-th sample with all modalities. IMmC aims to cluster data in which some samples have missing modalities so that samples ${\textbf{S}}_{n}$ with the arbitrary possible missing-modalities pattern can be clustered.

3.1 Integrated Heterogeneous Graph Construction

The integrated heterogeneous graphs are composed of the similarity graph and the modality-specific availability graphs. We use the consistency loss ${\mathcal {L}}_{c}$ to measure the non-symmetric difference between the original distribution and the target distribution. By embedding ${\mathcal {L}}_{c}$ into the network, we can simultaneously optimize both reconstruction loss ${\mathcal {L}}_{r}$ and consistency loss ${\mathcal {L}}_{c}$ within a unified framework. This approach offers the advantage of allowing the network to capture the intrinsic structure of data better while capturing the complementarity between samples and modalities through integrated heterogeneous graph attention networks, thereby improving the performance of the model. Next, we elaborate on the construction of each graph.

3.1.1 Similarity Graph

Similar samples can help each other for representation learning, and they should be close in the learned latent space. To this end, we construct the similarity graph to maintain the local structure of the data by first learning unified latent representations of all modalities and then obtaining the graph based on the similarity of samples in the latent space.

To process samples with arbitrary missing-modality modes flexibly, we project the samples into a unified latent space. Ideally, the expression of the hidden layer can extract the unified expression from each modality. If we denote the latent space representation of the n-th sample as ${\textbf{h}}_n$, then the optimization objective of the elastic implicit space representations is as follows:

$$\begin{aligned} {\mathcal {L}}_{r} (s_{n v}, {\textbf{S}}_{n}, {\textbf{h}}_{n}; \mathbf {\Theta }_{r}) = \sum \nolimits _{n = 1}^N {\sum \nolimits _{v = 1}^V {s_{nv} } } \left\| {f_v \left( {{\varvec{h}}_n ;{\varvec{\Theta }}_r^{(v)} } \right) - {\varvec{s}}_n^{(v)} } \right\| ^2 ,\nonumber \\ \end{aligned}$$

(1)

where ${\mathcal {L}}_{r}$ is the reconstruction loss, which aims to learn the bidirectional mapping between the original data space and the unified embedding space. $\Vert \cdot \Vert $ represents the $l_{2}$-norm. $f_{v}\left( {\textbf{h}}_{n} ; \mathbf {\Theta }_{r}^{(v)}\right) $ is the reconstruction network for the v-th modality parameterized by $\mathbf {\Theta }_{r}^{(v)}$, and $\small {{\textbf{s}}_{n}^{(v)}}$ represents the input of the v-th modality with the n-th sample. N and V represent the number of samples and modalities, respectively. ${\textbf{s}}_{n v}$ indicates the availability of the n-th sample in the v-th modality, which is defined as follows:

$$\begin{aligned} {\textbf{s}}_{nv}=\left\{ \begin{array}{l} 1, \text { if the } n \text {-th instance has the } v \text {-th modality } \\ 0, \text { otherwise } \end{array}\right. . \end{aligned}$$

(2)

3.1.2 Learnable Unified Latent Representation

By using multiple individual multi-layer perceptrons as encoders, different available modalities are encoded into a unified learnable space ${\textbf{h}}_n$ (regardless of their lost patterns), where the number of encoders should be the same as modalities. Relatively complete and universal representations are learned by minimizing Eq. (1), so that any sample with missing patterns can be reconstructed. This means that the space has learned the potential elastic representations from the observation modality.

Generally, the neighborhood structure can be obtained from a Gaussian-based kernel matrix. We denote the matrix as ${\textbf{G}}_n\in {\textbf{R}}^{N \times N}$, and the detailed formulation is as follows:

(3)

where $\sigma $ is the standard deviation. $N_{k}\left( {\textbf{h}}_{i}\right) $ and $N_{k}\left( {\textbf{h}}_{j}\right) $ indicate samples of the K nearest neighbors of ${\textbf{h}}_{i}$ and ${\textbf{h}}_{j}$, respectively.

3.1.3 Modality-Specific Availability Graphs

For different modalities, the absence of internal samples may vary. Two different samples in the same modality can only interact with each other if they exist at the same time. We propose modality-specific availability graphs based on multiple samples in the same modality to make full use of the similarities between samples. We denote the matrices as ${\textbf{G}}_e\in {\textbf{R}}^{V \times N \times N}$ and ${\textbf{G}}_e^{(v)}\in {\textbf{R}}^{N \times N}$. The detailed formulation is defined as follows:

$$\begin{aligned} {{\textbf{G}}_e}_{i j}^{(v)}=\left\{ \begin{array}{ll}1, \text { if both } {x}_{i}^{(v)} \text { and } {x}_{j}^{(v)} \text { exist} \\ 0,\text { otherwise } \end{array}\right. , \end{aligned}$$

(4)

where ${x}_{i}^{(v)}$ and ${x}_{j}^{(v)}$ are the v-th modality of different samples,

$$\begin{aligned} {\textbf{G}}_e=\left[ \begin{array}{ccccc} {\textbf{G}}_e^{(1)},&{\textbf{G}}_e^{(2)},&{\textbf{G}}_e^{(3)},&\text {...},&{\textbf{G}}_e^{(V)} \end{array}\right] . \end{aligned}$$

(5)

3.1.4 Integrated Heterogeneous Graph

To further utilize the complementarity of data information, we consider fusing the similarity graph and the modality-specific availability graphs to obtain a set of integrated heterogeneous graphs. We define the graph adjacency matrix as follows:

$$\begin{aligned} {\textbf{G}}_{adj}={\textbf{G}}_{n} \cdot {\textbf{G}}_{e}^{(v)}. \end{aligned}$$

(6)

Then, we obtain:

$$\begin{aligned} {\textbf{G}}=\left[ \begin{array}{ccccc} {\textbf{G}}_{n}\cdot {\textbf{G}}_e^{(1)},&{\textbf{G}}_{n} \cdot {\textbf{G}}_e^{(2)},&\text {...},&{\textbf{G}}_{n} \cdot {\textbf{G}}_e^{(V)} \end{array}\right] . \end{aligned}$$

(7)

3.2 Graph Representation Learning

Given the integrated heterogeneous graphs, we can exploit structural information to learn complete representation. Our research is focused on addressing the challenges of incomplete multi-modal learning, with a particular emphasis on harnessing the interrelationships between modalities and samples in the absence of partial modalities. In other words, our intention is not to introduce a novel attention mechanism but rather to make the most of existing methodologies to explore the interrelationships in data with incomplete modalities in a comprehensive manner. Leveraging the dynamic adaptability of the attention mechanism introduced by Graph Attention Network (GAT) (Veličković et al., 2018), we have seamlessly incorporated GAT into our framework for the purpose of exploring the interrelationships between modalities and samples. By leveraging masked self-attention layers and stacking layers, nodes can attend to the features of their neighborhoods.

Formally, given the latent representations ${\textbf{h}}_{n}=$ $\left\{ {\textbf{h}}_{1}, {\textbf{h}}_{2}, \ldots , {\textbf{h}}_{N}\right\} $, where ${\textbf{h}}_{n} \in {\textbf{R}}^{D}, N$ is the total number of samples, and D is the dimension of latent space, the network outputs the adjacency matrix ${\textbf{G}}_{v}$ generated by multiple graph learners based on different samples. Each of these samples is then possible to generate V groups of new features ${\textbf{z}}^{(v)}=\left\{ {z}_{1}^{(v)}, {z}_{2}^{(v)}, \ldots , {z}_{N}^{(v)}\right\} \left( {z}_{n}^{(v)} \in {\textbf{R}}^{F}, n=1, 2, \ldots , N \right) $, where N is the number of nodes, F is the number of features in each node and ${z}_{n}^{(v)}$ refers to the feature vector associated with the n-th node and the v-th modality.

To facilitate graph representation learning, we use GAT to transform the latent representations into features that are suitable for graph semantics, which requires a mapping layer composed of learnable parameters $\mathbf {\Theta }_{g}^{(v)}$:

$$\begin{aligned} {z}_{n}^{(v)}=GAT\left( {\textbf{h}}_{n}, {\textbf{G}}_{v}; \mathbf {\Theta }_{g}^{(v)}\right) , \end{aligned}$$

(8)

where ${\textbf{h}}_{n}$ is the latent representations, ${\textbf{G}}_{v}$ is the graph adjacency matrix, and $\mathbf {\Theta }_{g}^{(v)}$ is the parameter set of the GAT.

Based on representations ${\textbf{z}}^{(v)}$, we use the attention mechanism to calculate the importance between nodes. As an initial step, a shared linear transformation, parametrized by a weight matrix, ${\textbf{W}} \in {\textbf{R}}^{F^{\prime } \times F}$ (of potentially different cardinality ${F}^{\prime }$), is applied to every node. The importance of each two nodes can be calculated by a shared attention mechanism a. Thus, attention coefficients are defined as:

$$\begin{aligned} e_{ij}=a\left( {\textbf{W}} z_{i}^{(v)}, {\textbf{W}} z_{j}^{(v)}\right) . \end{aligned}$$

(9)

The attention mechanism a is a feedforward network, parametrized by a weight vector $\overrightarrow{{\textbf{a}}} \in {\textbf{R}}^{2{F}^{\prime }}$, and $e_{ij}$ indicates the importance of node j to node i. We use an attention mask to inject the integrated heterogeneous graph structure into the calculation, and only consider $e_{ij}$ for nodes $j \in {\mathcal {N}}_{i}$ that have relations in ${\textbf{G}}_{v}$. Projected by a softmax function, the formula is:

$$\begin{aligned} \alpha _{i j}=\text {softmax}_{j}\left( e_{i j}\right) =\frac{\exp \left( e_{i j}\right) }{\sum _{k \in {\mathcal {N}}_{i}} \exp \left( e_{i k}\right) }, \end{aligned}$$

(10)

where ${\mathcal {N}}_{i}$ is some neighborhood of node i in the graph. The attention mechanism uses a single-layer feedforward network, and applies the LeakyReLU nonlinearity. Fully expanded out, the attention calculation can be expressed as:

$$\begin{aligned} \alpha _{ij} = \frac{{\exp \left( {{\text { LeakyReLU}}\left( {\overrightarrow{\varvec{a}} ^{\text {T}} \left[ {{\varvec{W}}{\text {z}}_{\text {i}}^{({\text {v}})} \mathrm{{}}{\varvec{W}}{\text {z}}_{\text {j}}^{({\text {v}})} } \right] } \right) } \right) }}{{\sum _{k \in \mathrm{{\mathcal {N}}}_i } {\exp } \left( {{\text { LeakyReLU}} \left( {\overrightarrow{\varvec{a}} ^{\text {T}} \left[ {{\varvec{W}}{\text {z}}_{\text {i}}^{({\text {v}})} \mathrm{{}}{\varvec{W}}{\text {z}}_{\text {k}}^{({\text {v}})} } \right] } \right) } \right) }}, {\text { }}\nonumber \\ \end{aligned}$$

(11)

where $\cdot ^{T}$ represents transposition and $\left[ \cdot \Vert \cdot \right] $ is the concatenation operation. Moreover, multi-head attention can be used to enrich the ability of the method and stabilize the training process. Each head of attention has its own parameters. We use splicing to integrate the output of multiple attention mechanisms, which can be described as follows:

$$\begin{aligned} z_{i}^{\prime {(v)}}=\Vert _{b=1}^{B} \mu \left( \sum _{j \in N_{i}} \alpha _{i j}^{b} {\textbf{W}}^{b} z_{j}^{{(v)}}\right) , \end{aligned}$$

(12)

where $\Vert $ represents concatenation, $\mu $ is the activation function, $\alpha _{i j}^{b}$ are normalized attention coefficients computed by the b-th attention mechanism ($a^{b}$), B is the number of attention heads and ${\textbf{W}}^{b}$ is the corresponding weight matrix of input linear transformation. When dealing with the last layer, we use the average value instead of the concatenation, as follows

$$\begin{aligned} z_{i}^{\prime {(v)}}= \mu \left( \frac{1}{B} \sum \limits _{b=1}^{B} \sum _{j \in N_{i}} \alpha _{i j}^{b} {\textbf{W}}^{b} z_{j}^{{(v)}}\right) , \end{aligned}$$

(13)

3.2.1 Consistent Embedding Network

Based on the learned features, we optimize the clustering task in an end-to-end manner. To measure the non-symmetric difference between the original distribution and target distribution, we embed the consistency of probability distribution into the network. Following (Wang et al., 2018; Tao et al., 2019), we measure the similarity between integrated node representations $z_{i}^{\prime (v)}$ and the cluster center $\mu _{j}$ by adopting the t-distribution of student. $q_{i j}^{(v)}$ and $p_{i j}^{(v)}$ are the elements of original distribution ${\textbf{Q}}$ and target distribution ${\textbf{P}}$, respectively, which is defined as:

$$\begin{aligned} q_{i j}^{(v)} = \frac{(1+\Vert z_{i}^{\prime {(v)}}-\mu _{j}\Vert ^{2}/\beta )^{ -\frac{\beta +1}{2}}}{\sum \limits _{j^{\prime }=1}^{J} (1+\Vert z_{i}^{\prime {(v)}}-\mu _{j^{\prime }}\Vert ^{2}/\beta )^{-\frac{\beta +1}{2}}}, \end{aligned}$$

(14)

where $\Vert \cdot \Vert $ represents the $l_{2}$-norm; $\mu _{j}$ is the cluster center; J is the number of cluster centers; $\beta $ is the degree of t-distribution freedom of the Student, and $q_{i j}^{(v)}$ is the probability of assigning node i to cluster j. In our experiments, the cluster centers $\{\mu _{j}\}_{j=1}^J$ can be initialized by employing k-means and the target probability distribution $p_{i j}^{(v)}$ (0 $\le $ $p_{i j}^{(v)}$ $\le $ 1) can be computed. We obtain $p_{i j}^{(v)}$ by raising $q_{i j}^{(v)}$ to the second power and normalizing by frequency per cluster:

$$\begin{aligned} p_{i j}^{(v)} = \frac{{{q_{i j}^{(v)}}^2}/{f_i}}{\sum \limits _{j^{\prime }=1}^{J} {{q_{i j^{\prime }}^{(v)}}^2}/{f_{j^{\prime }}}}, \end{aligned}$$

(15)

where $\small {f_{j} = \sum \limits _{i=1}^{M} {q_{i j}^{(v)}}}$ are soft cluster frequencies. To compare the similarity of the two probability distributions, we define our objective as a probability distribution consistency loss ${\mathcal {L}}_{c}$. The clustering loss is defined as minimizing the KL divergence between an original distribution and a target distribution. That is to say, ${\mathcal {L}}_{c}$ is defined as:

$$\begin{aligned} {\mathcal {L}}_{c} ({\textbf{P}}, {\textbf{Q}})= KL({\textbf{P}}\Vert {\textbf{Q}}) =\sum _{v} \sum _{i} \sum _{j} p_{i j}^{(v)} log \frac{p_{i j}^{(v)}}{q_{i j}^{(v)}}, \end{aligned}$$

(16)

where KL is the Kullback–Leibler divergence that measures the non-symmetric difference between two probability distributions. $\left( \cdot \Vert \cdot \right) $ represents the separator between two probability distributions. ${\textbf{P}}$ and ${\textbf{Q}}$ are defined by Eq. (15) and Eq. (14), respectively. Finally, we take the mean of $\{p_{i j}^{(v)}\}_{v=1}^{V}$ as the ideal distribution.

Accordingly, the overall loss function of the proposed IHGAT can be formulated as follows:

$$\begin{aligned} {\mathcal {L}} = \lambda _{1}{\mathcal {L}}_{c} + \lambda _{2}{\mathcal {L}}_{r}, \end{aligned}$$

(17)

where ${\mathcal {L}}_{c}$ and ${\mathcal {L}}_{r}$ are the clustering loss and the reconstruction loss, respectively. The $\lambda _{1}$ and $\lambda _{2}$ are the trade-off hyper-parameters of the ${\mathcal {L}}_{c}$ and ${\mathcal {L}}_{r}$, respectively.

We construct a set of integrated heterogeneous graphs based on the similarity graph learned from unified latent representations and the modality-specific availability graphs obtained by the existing relations of different samples. Based on the constructed integrated heterogeneous graphs, we use the incomplete multi-modal data as input to optimize the model parameters for a better representation. Algorithm 1 briefly summarizes the optimization procedures of the proposed method.

4 Experiments

In this section, we conducted comprehensive experiments on incomplete multi-modal data to evaluate the performance of our proposed method, followed by the analysis of our proposed method.

4.1 Metrics and Datasets

For a comprehensive analysis, we conducted extensive experiments on six datasets and adopted two widely used metrics, including Accuracy (ACC) and Normalized Mutual Information (NMI). High values denote good clustering performance of the method for both metrics.

CUB (Wah et al., 2011): Caltech-UCSD Birds (CUB) contains 11,788 bird images associated with text descriptions from 200 different categories (we followed the experimental settings in Zhang et al. (2022), so the first 10 categories are used). We extracted 1024-dimensional features based on images using GoogLeNet, and 300-dimensional features based on text (Le & Mikolov, 2014).

Football^{Footnote 1}: A collection of 248 English Premier League football players and clubs active on Twitter. The disjoint ground-truth communities correspond to the 20 individual clubs in the league.

ORL^{Footnote 2}: ORL is a popular face database in the field of face recognition. It contains 400 face images provided by 40 volunteers, with 10 face images from each person. Three types of features, i.e., LBP, Gabor, and intensity, are extracted as the three modalities for representing every face image.

PIE^{Footnote 3}: PIE is a subset containing 680 facial images of 68 subjects, for which the intensity, LBP, and Gabor features have been extracted.

Politics^{Footnote 4}: A collection of Irish politicians and political organizations, assigned to seven disjoint ground truth groups, according to their affiliation.

3Sources^{Footnote 5}: 3Sources is collected from three online news sources: BBC, Reuters, and Guardian. In total, 169 samples of stories are used, which are reported by all three sources.

ADNI^{Footnote 6}: The dataset consists of 774 subjects from ADNI-1, including 226 normal controls (NC), 362 MCI and 186 AD subjects. There are only 379 subjects with complete MRI and PET data, including 101 NC, 185 MCI, and 93-AD, where the missing rate is up to 0.26. We use 93-dimensional ROI-based features from both MRI and PET data, respectively.

3Sources-partial^{Footnote 7}: 948 news articles were collected covering 416 distinct news stories. Specifically, 169 were reported in all three sources, 194 in two sources, and 53 appeared in a single news source. Each source represents a unique modality, and the combination of different sources constitutes multi-modal information. The missing rate of 3Sources-partial is 0.24.

Table 1 The clustering performance comparison on six datasets with different missing rates ($\varepsilon $)

Full size table

Table 2 The clustering performance comparison on six datasets with high missing rates ($\varepsilon $)

Full size table

Table 3 The clustering performance comparison on real-world missing data

Full size table

4.2 Experimental Setups

To generate incomplete multi-modal datasets from complete multi-modal datasets, we randomly removed different modalities within each sample based on the missing rate. The missing rate was defined as $\varepsilon =\frac{\sum _{v} M_{v}}{V \times N}$, where $M_{v}$ indicates the number of instances without the v-th modality. To evaluate the influence of $\lambda _{1}$ and $\lambda _{2}$, we changed their value in the range of {0.01, 0.1, 1, 10, 100, 1000} and {0.01, 0.1, 1, 10, 100, 1000}, respectively. As shown in Fig. 6, IHGAT was robust to $\lambda _{1}$ and $\lambda _{2}$, and the proposed method could reach a high level of performance while $\lambda _{1}$ was from the range of $\{10, 100, 1000\}$ and $\lambda _{2}$ was from the range of $\{0.01, 0.1, 1\}$. For all datasets, the trade-off hyper-parameters $\lambda _{1}$ and $\lambda _{2}$ were fixed to 100 and 1, respectively. We evaluated the performance and reported the averaged results over five runs of experiments. Our algorithm was implemented in Torch 1.9.0 and carried all evaluations on a standard Ubuntu$-$16.04 system with NVIDIA 3090 Graphics Processing Units (GPUs). We set an initial learning rate of 0.02 on the other datasets except for the Football and PIE datasets, which had an initial learning rate of 0.01.

4.3 Baseline Methods

Eight baseline methods were used in the experiments, including SVD (Hotelling, 1992), Average (Enders, 2010), CRA (Tran et al., 2017), iCmSC (Wang et al., 2020), IMVTSC-MVI (Wen et al., 2021), GP-MVC (Wang et al., 2021), CPM (Zhang et al., 2022), CPM-GAN (Zhang et al., 2022), PIMVC (Deng et al., 2023), and GreatF (Wen et al., 2023a).

SVD (Hotelling, 1992): SVD is a matrix completion method by iterative soft thresholding of singular value decomposition.

Average (Enders, 2010): Average imputes missing parts with the average value of all samples in each modality.

CRA (Tran et al., 2017): CRA is composed of a set of stacked residual autoencoders, which can learn complex relationships among data from different modalities.

iCmSC (Wang et al., 2020): iCmSC is a novel incomplete cross-modal clustering method that integrates canonical correlation analysis and exclusive representation.

IMVTSC-MVI (Wen et al., 2021): IMVTSC-MVI incorporates the feature space based missing-modality inferring and manifold space based similarity graph learning into a unified framework.

GP-MVC (Wang et al., 2021): GP-MVC is a generative partial multi-modal clustering model with adaptive fusion and cycle consistency to solve the incomplete multi-modal problem by explicitly generating the data of missing modalities.

CPM (Zhang et al., 2022): CPM provides the comparative version of the CPM-Nets without the adversarial strategy.

CPM-GAN (Zhang et al., 2022): CPM-GAN can be regarded as generators. As for the discriminators, it uses the same structure as the generators. For the purpose of discrimination, a sigmoid layer is imposed on the output layer of each discriminator network.

PIMVC (Deng et al., 2023): PIMVC applies projection learning to IMmC, which solves the problem of information imbalance between different modalities.

GreatF (Wen et al., 2023a): GreatF provides an adaptive weighted matrix factorization model to obtain the representation of every modality, which can enhance the weight of the discriminative features of all modalities for representation learning.

4.4 Incomplete Multi-Modal Clustering Performance

Experimental results are shown in Table 1. By analyzing the results, we have the following observations: (1) In terms of both ACC and NMI, our method achieves promising performance compared with all baselines. It performs the best on most settings of all datasets in terms of both metrics, which validates the effectiveness of IHGAT. (2) Although the baselines can also achieve good performance at several low missing rates, a clear phenomenon of performance degeneration can be observed as the rate of missing data increases. Most existing baselines attempt to complete the missing modality, which may introduce extra noise when the missing situation is complex, or the missing rate is high. (3) On all missing rates, IHGAT generally achieves outstanding performance on all multi-modal datasets. On six datasets, we average the ACC and NMI of each method with different missing rates ($\varepsilon $ from 0.1 to 0.5), and our method is 6.74% higher in ACC and 8.75% higher in NMI than the second-best method.

4.5 Multi-Modal Clustering Performance with High Missing Rates

Based on the above discussion, we have further analyzed the most state-of-the-art methods, including iCmSC, IMVTSC-MVI, GP-MVC, and CPM-GAN in Table 2. In terms of ACC, taking the missing rate ($\varepsilon $=0.9) for example, our method improves 18.25, 9.71, 25.4, 17.41, 14.35, and 16.33% over the second performers on CUB, Football, ORL, PIE, Politics, and 3Sources, respectively. IMVTSC-MVI hardly maintains good performance as the missing rate increases, but CPM-GAN is rather robust to modality-missing data. The missing modalities seriously affect the mining of information from multi-modal data. For a more comprehensive comparison, we compare the performance of IHGAT and the suboptimal method at high missing rates by taking the mean value. The suboptimal methods differ in different datasets, consisting of IMVTSC-MVI and CPM-GAN. The average ACC and NMI of IHGAT are 46.00 and 53.43%, respectively, and the suboptimal methods are 31.22 and 38.07%, respectively. IHGAT improves by 14.78 and 15.36% over the suboptimal method on ACC and NMI, respectively.

The combined analysis of Tables 1 and 2 shows that most baselines rely heavily on a large amount of paired multi-modal data by using shared information of latent representations to complete the missing modalities. When the missing modalities are large and complexly distributed, such methods may introduce additional noise and make it difficult to effectively complete the missing modalities. The above observations further validate the advantages of IHGAT. This suggests that it is beneficial to learn a more compact common representation for incomplete multi-modal clustering by considering both the structural information of missing data and the available information of non-missing data. It is clear that IHGAT is superior and more competitive, especially as the missing rate increases. This implies that our method can effectively explore the complex relationship between modalities and samples, even with a relatively large incomplete sample ratio.

4.6 Multi-Modal Clustering Performance on Real-World Missing Data

The existing IMmC is mostly based on publicly available multi-modal datasets, created by randomly removing portions of the data to form incomplete multi-modal datasets. In existing research, it is rare to encounter real-world incomplete multi-modal datasets. To further validate the effectiveness of IHGAT in real-world scenarios, we conducted experiments on two real-world incomplete multi-modal datasets, namely ADNI and 3Sources-partial.

As presented in Table 3, IHGAT still performs well on real-world missing datasets. In real-world scenarios, there are usually incomplete cases for multi-modal data. For instance, within medical applications, diverse subjects typically undergo various types of examinations. In the realm of web analysis, some websites encompass a variety of content, including text, images, and videos, while others may contain only a subset of these, resulting in data with missing modalities. As the number of modalities increases, the patterns of modality-missing, denoting the combinations of available modalities, become progressively intricate. Therefore, research on incomplete multi-modal data holds significant practical value and has a wide range of application scenarios.

Table 4 The clustering performance comparison on three datasets with different modules

Full size table

4.7 Ablation Study

To verify the effectiveness of the similarity graph and modality-specific availability graphs, we visualized the representations on the CUB dataset to investigate the performance of IHGAT. Figure 3a, b, and c show the representations of IHGAT, S-IHGAT, and LATENT obtained in different epochs. LATENT represents the proposed method without the similarity graph and modality-specific availability graphs; S-IHGAT uses the similarity graph only, and IHGAT uses both similarity graph and modality-specific availability graphs. As the number of epochs increases, the clusters of IHGAT are more compact, and the margins between different classes become more clear. It shows that the similarity graph and modality-specific availability graphs contribute substantially to the learning representation ability of IHGAT.

To further analyze the contribution of the similarity graph and modality-specific availability graphs in IHGAT, we conducted the ablation study with respect to the proposed method. As presented in Fig. 2, S-IHGAT substantially outperforms LATENT, which numerically indicates that it would be harmful to overlook the relationship between sample structures that can enhance multi-modal complementary information. Besides, IHGAT performs better than S-IHGAT, validating the effectiveness of the modality-specific availability graphs. Under the influence of the similarity graph and modality-specific availability graphs, IHGAT can indeed achieve better clustering results.

The graph adjacency matrix provides the possibility to correlate features and semantic representations, and the intrinsic structural information can be maintained to obtain more sufficient complementary information of different samples and modalities. The new features of a specific node are obtained by adding a nonlinear transformation to a weighted average of the neighboring features of the specific node in terms of their contribution. The new features are tighter and can further exploit complementarity. It can be intuitively seen through Figs. 2 and 3 that the similarity graph and modality-specific availability graphs have a significant impact on the performance of our proposed method, mainly because different constraints obtain different features.

As shown in Table 4, we conduct additional ablation experiments to explore the impact of different graph construction methods, attention mechanisms, and probability distribution. This will provide a deeper understanding of the contributions of these components. Additionally, we employed t-SNE visualization, as shown in Fig. 4, to illustrate that using multiple modalities to construct a unified representation results in more compact intra-class clusters and clearer inter-class boundaries. While directly aggregating encoder outputs into a common representation provides some improvement over single modality usage, utilizing multi-modal information to build a learnable unified latent representation yields superior overall performance. Our method reduces its heavy reliance on paired data by encoding multiple modalities into unified hidden representations. When the amount of data is large, we do not have to group the data like other works. The combination of the similarity graph and modality-specific availability graphs to form a set of integrated heterogeneous graphs can better explore the structural information of data and their relationship with each other.

4.8 Convergence Analysis

To investigate the stability and convergence of the training process of IHGAT, we showed the convergence curves on multiple datasets ($\varepsilon $ = 0.9) of IHGAT and CPM-GAN, a typical method of consistency strategy, respectively. As shown in Fig. 5, The training process of CPM-GAN is quite unstable on data with high missing rates. Consequently, the quality of the generator is challenging to control, which degrades the performance of the model significantly. Moreover, it also reveals the potential risk of introducing additional noise by the method of completing missing modalities with highly missing data. By contrast, IHGAT converges stably and fast in around 75 epochs, further demonstrating the performance advantage of IHGAT under complicated data distributions.

4.9 Parameter Analysis

Since $\lambda _{1}$ and $\lambda _{2}$ are the trade-off hyper-parameters that influence the clustering term and the reconstruction term in the final loss function, we analyzed the impacts of different values of $\lambda _{1}$ and $\lambda _{2}$ on the performance ($\varepsilon $ = 0.5), and the results are shown in Fig. 6. It can be seen that IHGAT is not sensitive to $\lambda _{1}$ and $\lambda _{2}$, and the proposed method can reach a high level of performance whilst $\lambda _{1}$ is from the range of $\{10, 100, 1000\}$ and $\lambda _{2}$ is from the range of $\{0.01, 0.1, 1\}$.

The number of nearest neighbors K and the dimensionality of the latent representations $\gamma $ are the two main parameters of our method. In terms of K, since K nearest neighbors are used to obtain the similarity graph, we analyzed the influence of different K values on the proposed method. As shown in Fig. 7a, it can be observed that too small or large K values are adverse to the performance of the model. If the K is too small, it is easy to make the model complicated and thus over-fitting. If the K value is too large, the result will be affected by distant points. Therefore, a medium K value, specifically $K=5$, is appropriate for our method in the experiments.

In terms of $\gamma $, we visualized the influence of different values of $\gamma $ on three datasets ($\varepsilon $ = 0.5) in Fig. 7b, where the values of $\gamma $ are ranged from {16, 32, 64, 128, 256}. Ten trials of experiments are conducted , and the average values of ACC and NMI are reported as the final results. According to Fig. 7b, it can be observed that different parameter settings greatly affect the performance of the method, and most datasets achieve better performance with $\gamma $ in 64, which is by default a good choice.

5 Conclusion

In this paper, we proposed an effective method that deeply mines structural information to use complementarity information of different samples for IMmC. First, the similarity graph and modality-specific availability graphs are fused to form a set of integrated heterogeneous graphs. Thereafter, the attention mechanism is applied to the obtained integrated heterogeneous graphs to capture the complementarity information among different samples and modalities. In this way, complete representations can be learned for data with incomplete modalities. Finally, clustering is performed on the learned representations via embedding the consistency of probability distribution into the network. The proposed method does not require a large amount of paired data to model the missing modalities and shows significant improvements over the compared methods on six challenging benchmark datasets. More clear advantages of the proposed method over baselines can be observed with high missing rates of incomplete data.

Data Availibility

The CUB Wah et al. (2011) dataset can be obtained from https://www.vision.caltech.edu/datasets/cub_200_2011/. The Football dataset can be obtained from http://mlg.ucd.ie/aggregation/index.html. The ORL dataset can be obtained from https://www.cl.cam.ac.uk/research/dtg/attarchive/facedatabase.html. The PIE dataset can be obtained from http://www.cs.cmu.edu/afs/cs/project/PIE/MultiPie/Multi-Pie/Home.html. The Politics dataset can be obtained from http://mlg.ucd.ie/aggregation/index.html. The 3Sources dataset can be obtained from http://mlg.ucd.ie/datasets/3sources.html.

Notes

References

Baltrušaitis, T., Ahuja, C., & Morency, L.-P. (2018). Multimodal machine learning: A survey and taxonomy. IEEE Transactions on Pattern Analysis and Machine Intelligence, 41(2), 423–443.
Article Google Scholar
Bothorel, C., Cruz, J. D., Magnani, M., & Micenkova, B. (2015). Clustering attributed graphs: Models, measures and methods. Network Science, 3(3), 408–444.
Article Google Scholar
Brasó, G., Cetintas, O., & Leal-Taixé, L. (2022). Multi-object tracking and segmentation via neural message passing. International Journal of Computer Vision, 130(12), 3035–3053.
Article Google Scholar
Brissman, E., Johnander, J., Danelljan, M., & Felsberg, M. (2023). Recurrent graph neural networks for video instance segmentation. International Journal of Computer Vision, 131(2), 471–495.
Article Google Scholar
Cao, Y., Luo, X., Yang, J., Cao, Y., & Yang, M. Y. (2022). Locality guided cross-modal feature aggregation and pixel-level fusion for multispectral pedestrian detection. Information Fusion, 88, 1–11.
Article Google Scholar
Chang, S., Han, W., Tang, J., Qi, G.-J., Aggarwal, C. C., & Huang, T. S. (2015). Heterogeneous network embedding via deep architectures. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 119–128.
Chen, L., Gao, Y., Huang, X., Jensen, C. S., & Zheng, B. (2020). Efficiently distributed clustering algorithms on star-schema heterogeneous graphs. IEEE Transactions on Knowledge and Data Engineering, pp. 1–15.
Chen, Y., Mancini, M., Zhu, X., & Akata, Z. (2022). Semi-supervised and unsupervised deep visual learning: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence, pp. 1–23.
Chen, Y., Xiao, X., & Zhou, Y. (2019). Jointly learning kernel representation tensor and affinity matrix for multi-view clustering. IEEE Transactions on Multimedia, 22(8), 1985–1997.
Article Google Scholar
Cheng, J., Wang, Q., Tao, Z., Xie, D.-Y., & Gao, Q. (2020). Multi-view attribute graph convolution networks for clustering. In IJCAI, pp. 2973–2979.
Deng, S., Wen, J., Liu, C., Yan, K., Xu, G., & Xu, Y. (2023). Projective incomplete multi-view clustering. IEEE Transactions on Neural Networks and Learning Systems.
Enders, C. K. (2010). Applied missing data analysis. Guilford press.
Fang, U., Li, M., Li, J., Gao, L., Jia, T., & Zhang, Y. (2023). A comprehensive survey on multi-view clustering. IEEE Transactions on Knowledge and Data Engineering, 35(12), 12350–12368.
Article Google Scholar
Hamilton, W., Ying, Z., & Leskovec, J. (2017). Inductive representation learning on large graphs. Advances in Neural Information Processing systems.
Han, R., Gan, Y., Wang, L., Li, N., Feng, W., & Wang, S. (2023). Relating view directions of complementary-view mobile cameras via the human shadow. International Journal of Computer Vision, pp. 1–16.
Hotelling, H. (1992). Relations between two sets of variates. In Breakthroughs in Statistics, pp. 162–190. Springer.
Kumar, R., Chen, T., Hardt, M., Beymer, D., Brannon, K., & Syeda-Mahmood, T. (2013). Multiple kernel completion and its application to cardiac disease discrimination. In 2013 IEEE 10th International Symposium on Biomedical Imaging, pp. 764–767. IEEE.
Le, Q. & Mikolov, T. (2014). Distributed representations of sentences and documents. In International Conference on Machine Learning, pp. 1188–1196. PMLR.
Li, L., Wan, Z., & He, H. (2021). Incomplete multi-view clustering with joint partition and graph learning. IEEE Transactions on Knowledge and Data Engineering, pp. 1–15.
Li, X., Wu, Y., Ester, M., Kao, B., Wang, X., & Zheng, Y. (2017). Semi-supervised clustering in attributed heterogeneous information networks. In Proceedings of the 26th International Conference on World Wide Web, pp. 1621–1629.
Lin, Y., Gou, Y., Liu, X., Bai, J., Lv, J., & Peng, X. (2023). Dual contrastive prediction for incomplete multi-view representation learning. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(4), 4447–4461.
Google Scholar
Lin, Y., Gou, Y., Liu, Z., Li, B., Lv, J., & Peng, X. (2021). Completer: Incomplete multi-view clustering via contrastive prediction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11174–11183.
Michieli, U., & Zanuttigh, P. (2022). Edge-aware graph matching network for part-based semantic segmentation. International Journal of Computer Vision, 130(11), 2797–2821.
Article Google Scholar
Qi, G.-J., Aggarwal, C. C., & Huang, T. S. (2012). On clustering heterogeneous social media objects with outlier links. In Proceedings of the 5th ACM International Conference on Web Search and Data Mining, pp. 553–562.
Shao, W., He, L., & Philip, S. Y. (2015). Multiple incomplete views clustering via weighted nonnegative matrix factorization with $l_ {2, 1}$ regularization. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pp. 318–334. Springer.
Shi, C., Li, Y., Zhang, J., Sun, Y., & Philip, S. Y. (2016). A survey of heterogeneous information network analysis. IEEE Transactions on Knowledge and Data Engineering, 29(1), 17–37.
Article Google Scholar
Tao, Z., Liu, H., Li, J., Wang, Z., & Fu, Y. (2019). Adversarial graph embedding for ensemble clustering. In International Joint Conferences on Artificial Intelligence Organization, pp. 3562–3568.
Tran, L., Liu, X., Zhou, J., & Jin, R. (2017). Missing modalities imputation via cascaded residual autoencoder. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1405–1414.
Veličković, P., Cucurull, G., Casanova, A., Romero, A., Liò, P., & Bengio, Y. (2018). Graph Attention Networks. International Conference on Learning Representations.
Wah, C., Branson, S., Welinder, P., Perona, P., & Belongie, S. (2011). The caltech-ucsd birds-200-2011 dataset.
Wang, Q., Ding, Z., Tao, Z., Gao, Q., & Fu, Y. (2018). Partial multi-view clustering via consistent gan. In 2018 IEEE International Conference on Data Mining (ICDM), pp. 1290–1295. IEEE.
Wang, Q., Ding, Z., Tao, Z., Gao, Q., & Fu, Y. (2021). Generative partial multi-view clustering with adaptive fusion and cycle consistency. IEEE Transactions on Image Processing, 30, 1771–1783.
Article Google Scholar
Wang, Q., Lian, H., Sun, G., Gao, Q., & Jiao, L. (2020). icmsc: Incomplete cross-modal subspace clustering. IEEE Transactions on Image Processing, 30, 305–317.
Article Google Scholar
Wang, Q., Zhan, L., Thompson, P., & Zhou, J. (2020b). Multimodal learning with incomplete modalities by knowledge distillation. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 1828–1838.
Wang, X., Ji, H., Shi, C., Wang, B., Ye, Y., Cui, P., & Yu, P. S. (2019). Heterogeneous graph attention network. In The World Wide Web Conference, pp. 2022–2032.
Wen, J., Xu, G., Tang, Z., Wang, W., Fei, L., & Xu, Y. (2023a). Graph regularized and feature aware matrix factorization for robust incomplete multi-view clustering. IEEE Transactions on Circuits and Systems for Video Technology.
Wen, J., Yan, K., Zhang, Z., Xu, Y., Wang, J., Fei, L., & Zhang, B. (2020). Adaptive graph completion based incomplete multi-view clustering. IEEE Transactions on Multimedia, 23, 2493–2504.
Article Google Scholar
Wen, J., Zhang, Z., Fei, L., Zhang, B., Xu, Y., Zhang, Z., & Li, J. (2023). A survey on incomplete multiview clustering. IEEE Transactions on Systems, Man, and Cybernetics: Systems, 53(2), 1136–1149.
Article Google Scholar
Wen, J., Zhang, Z., Zhang, Z., Zhu, L., Fei, L., Zhang, B., & Xu, Y. (2021). Unified tensor framework for incomplete multi-view clustering and missing-view inferring. In Proceedings of the AAAI Conference on Artificial Intelligence, 35, 10273–10281.
Article Google Scholar
Xiang, S., Yuan, L., Fan, W., Wang, Y., Thompson, P. M., & Ye, J. (2013). Multi-source learning with block-wise missing data for alzheimer’s disease prediction. In Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 185–193.
Xie, D., Zhang, X., Gao, Q., Han, J., Xiao, S., & Gao, X. (2019). Multiview clustering by joint latent representation and similarity learning. IEEE Transactions on Cybernetics, 50(11), 4848–4854.
Article Google Scholar
Xu, C., Tao, D., & Xu, C. (2015). Multi-view learning with incomplete views. IEEE Transactions on Image Processing, 24(12), 5812–5825.
Article MathSciNet Google Scholar
Xu, K., Hu, W., Leskovec, J., & Jegelka, S. (2019). How powerful are graph neural networks? International Conference on Learning Representations.
Yang, L., Shen, C., Hu, Q., Jing, L., & Li, Y. (2019). Adaptive sample-level graph combination for partial multiview clustering. IEEE Transactions on Image Processing, 29, 2780–2794.
Article Google Scholar
Yang, S., Li, L., Wang, S., Zhang, W., Huang, Q., & Tian, Q. (2019). Skeletonnet: A hybrid network with a skeleton-embedding process for multi-view image representation learning. IEEE Transactions on Multimedia, 21(11), 2916–2929.
Article Google Scholar
Yuan, L., Wang, Y., Thompson, P. M., Narayan, V. A., & Ye, J. (2012). Multi-source learning for joint analysis of incomplete multi-modality neuroimaging data. In Proceedings of the 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1149–1157.
Zhan, K., Nie, F., Wang, J., & Yang, Y. (2018). Multiview consensus graph clustering. IEEE Transactions on Image Processing, 28(3), 1261–1270.
Article MathSciNet Google Scholar
Zhang, C., Cui, Y., Han, Z., Zhou, J. T., Fu, H., & Hu, Q. (2022). Deep partial multi-view learning. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44, 2402–2415.
Google Scholar
Zhang, C., Fu, H., Hu, Q., Cao, X., Xie, Y., Tao, D., & Xu, D. (2018). Generalized latent multi-view subspace clustering. IEEE Transactions on Pattern Analysis and Machine Intelligence, 42(1), 86–99.
Article Google Scholar
Zhang, C., Fu, H., Wang, J., Li, W., Cao, X., & Hu, Q. (2020). Tensorized multi-view subspace representation learning. International Journal of Computer Vision, 128(8–9), 2344–2361.
Article MathSciNet Google Scholar
Zhang, C., Song, D., Huang, C., Swami, A., & Chawla, N. V. (2019). Heterogeneous graph neural network. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 793–803.
Zhang, L., Zhao, Y., Zhu, Z., Shen, D., & Ji, S. (2018). Multi-view missing data completion. IEEE Transactions on Knowledge and Data Engineering, 30(7), 1296–1309.
Article Google Scholar
Zhang, Y., Xiong, Y., Kong, X., Li, S., Mi, J., & Zhu, Y. (2018c). Deep collective classification in heterogeneous information networks. In Proceedings of the 2018 World Wide Web Conference, pp. 399–408.
Zhao, J., Wang, X., Shi, C., Hu, B., Song, G., & Ye, Y. (2021). Heterogeneous graph structure learning for graph neural networks. In Proceedings of the AAAI Conference on Artificial Intelligence, pp. 4697–4705.

Download references

Acknowledgements

This work was supported in part by the National Science and Technology Major Project under Grant 2022ZD0116500, in part by the National Natural Science Foundation of China under Grants 62106174, 62222608, 62266035, and 61925602, and in part by Tianjin Natural Science Funds for Distinguished Young Scholar under Grant 23JCJQJC00270.

Funding

National Science and Technology Major Project under Grant 2022ZD0116500; National Natural Science Foundation of China under Grants 62106174, 62222608, 62266035, and 61925602; Tianjin Natural Science Funds for Distinguished Young Scholar under Grant 23JCJQJC00270.

Author information

Authors and Affiliations

College of Intelligence and Computing, Tianjin University, Tianjin, China
Yu Wang, Xinjie Yao, Pengfei Zhu, Meng Cao & Qinghua Hu
Engineering Research Center of City Intelligence and Digital Governance, Ministry of Education of the People’s Republic of China, Tianjin, China
Yu Wang, Xinjie Yao, Pengfei Zhu & Qinghua Hu
Haihe Lab of ITAI, Tianjin, China
Yu Wang, Xinjie Yao, Pengfei Zhu & Qinghua Hu
Department of Computer Science, Boston University, Boston, USA
Weihao Li

Authors

Yu Wang
View author publications
You can also search for this author in PubMed Google Scholar
Xinjie Yao
View author publications
You can also search for this author in PubMed Google Scholar
Pengfei Zhu
View author publications
You can also search for this author in PubMed Google Scholar
Weihao Li
View author publications
You can also search for this author in PubMed Google Scholar
Meng Cao
View author publications
You can also search for this author in PubMed Google Scholar
Qinghua Hu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Pengfei Zhu.

Ethics declarations

Conflict of interest

The authors have no Conflict of interest to declare that are relevant to the content of this article.

Additional information

Communicated by Massimiliano Mancini.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Wang, Y., Yao, X., Zhu, P. et al. Integrated Heterogeneous Graph Attention Network for Incomplete Multi-modal Clustering. Int J Comput Vis 132, 3847–3866 (2024). https://doi.org/10.1007/s11263-024-02066-y

Download citation

Received: 11 February 2023
Accepted: 08 March 2024
Published: 24 April 2024
Issue Date: September 2024
DOI: https://doi.org/10.1007/s11263-024-02066-y

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Integrated Heterogeneous Graph Attention Network for Incomplete Multi-modal Clustering

Abstract

Similar content being viewed by others

Multiple kernel-based anchor graph coupled low-rank tensor learning for incomplete multi-view clustering

One-step graph-based incomplete multi-view clustering

Structured anchor-inferred graph learning for universal incomplete multi-view clustering

Explore related subjects

1 Introduction

2 Related Work

2.1 Incomplete Multi-Modal Learning

2.2 Graph Representation Learning

3 Methodology

3.1 Integrated Heterogeneous Graph Construction

3.1.1 Similarity Graph

3.1.2 Learnable Unified Latent Representation

3.1.3 Modality-Specific Availability Graphs

3.1.4 Integrated Heterogeneous Graph

3.2 Graph Representation Learning

3.2.1 Consistent Embedding Network

4 Experiments

4.1 Metrics and Datasets

4.2 Experimental Setups

4.3 Baseline Methods

4.4 Incomplete Multi-Modal Clustering Performance

4.5 Multi-Modal Clustering Performance with High Missing Rates

4.6 Multi-Modal Clustering Performance on Real-World Missing Data

4.7 Ablation Study

4.8 Convergence Analysis

4.9 Parameter Analysis

5 Conclusion

Data Availibility

Notes

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation