0% found this document useful (0 votes)

4 views

Dual-view graph convolutional network for multi-label text classification

Uploaded by

yuweiji339

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

4 views

Dual-view graph convolutional network for multi-label text classification

Uploaded by

yuweiji339

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 18

Applied Intelligence (2024) 54:9363–9380

https://doi.org/10.1007/s10489-024-05666-w

Dual-view graph convolutional network for multi-label text

classification
Xiaohong Li1 · Ben You1 · Qixuan Peng1 · Shaojie Feng1

Accepted: 30 June 2024 / Published online: 15 July 2024

Abstract
Multi-label text classification refers to assigning multiple relevant category labels to each text, which has been widely applied
in the real world. To enhance the performance of multi-label text classification, most existing methods only focus on optimizing
document and label representations, assuming accurate label-document similarity is crucial. However, whether the potential
relevance between labels and if the problem of the long-tail distribution of labels could be solved are also key factors affecting
the performance of multi-label classification. To this end, we propose a multi-label text classification model called DV-MLTC,
which is based on a dual-view graph convolutional network to predict multiple labels for text. Specifically, we utilize graph
convolutional neural networks to explore the potential correlation between labels in both the global and local views. First, we
capture the global consistency of labels on the global label graph based on existing statistical information and generate label
paths through a random walk algorithm to reconstruct the label graph. Then, to capture relationships between low-frequency
co-occurring labels on the reconstructed graph, we guide the generation of reasonable co-occurring label pairs within the
local neighborhood by utilizing the local consistency of labels, which also helps alleviate the long-tail distribution of labels.
Finally, we integrate the global and local consistency of labels to address the problem of highly skewed distribution caused
by incomplete label co-occurrence patterns in the label co-occurrence graph. The Evaluation shows that our proposed model
achieves competitive results compared to existing state-of-the-art methods. Moreover, our model achieves a better balance
between efficiency and performance.

Keywords Multi-label classification · Graph convolutional networks · Random walk model · Label co-occurrence · Label
graph

1 Introduction tion [2], and question answering [3]. The primary objective
of MLTC is to assign one or more appropriate categories to
Multi-label text classification (MLTC) is a crucial task in nat- a document using a set of predefined categories or labels.
ural language processing that finds applications in various In recent years, the MLTC has garnered significant atten-
domains, including sentiment analysis [1], patent classification and has become an active area of research. However, the
increasing number of labels and documents, coupled with the
Ben You contributed equally to this work. complex interrelationships between labels and documents,
pose significant challenges to MLTC. These challenges have
B Xiaohong Li prompted researchers to delve deeper into the field of multi-
xiaohongli@nwnu.edu.cn
label learning.
Ben You Previous research on MLTC focused on developing
2021222268@nwnu.edu.cn
enhanced document representations. Various methods have
Qixuan Peng been proposed for learning label-specific document repre-
2021222181@nwnu.edu.cn
sentations [4, 5]. Moreover, some studies have used attention
Shaojie Feng mechanisms to capture label-semantic-based representations
2022222319@nwnu.edu.cn
[6–8] and document-label interaction representations [9–12].
1 College of Computer Science and Engineering, Northwest Although these approaches have shown promising results,
Normal University, Lanzhou 730070, China they have not fully explored the interactions between label-

123
9364 X. Li et al.

specific semantic components, thereby ignoring the rich label To address these challenges, we propose a dual-view
co-occurrence information within documents. graph convolutional network for multi-label text classifi-
In recent years, label co-occurrence graph-based methods cation (DV-MLTC). The proposed method aims to model
have gained attention for their ability to exploit statistical label co-occurrences from both global and local perspectives,
correlations between labels to construct label co-occurrence thereby offering a comprehensive solution. First, to address
graphs [13–17]. In this study, we refer to the view of label the long-tailed distribution problem, we introduce a strategy
co-occurrence graphs built using statistical correlations as that generates label paths for the local label graph using a
label global consistency. We identified two issues regarding random walk. By reconstructing the local label graph based
label global consistency. First, statistical label correlations on this strategy, we effectively captured the relationships
may exhibit a long-tailed distribution, with some categories between low-frequency co-occurring labels. This approach
being common and most having only a few relevant docu- helps alleviate the long-tail distribution issue and enhances
ments. Figure 1 shows the long-tailed distribution on RCV1 the overall performance. Second, to address the highly
[18], where only a few labels have a large number of articles, skewed distribution problem caused by the incompleteness
and these head labels also have a high co-occurrence with of label co-occurrence patterns in the label co-occurrence
other labels. Second, the co-occurrence patterns between graph, we leverage the power of the graph convolutional net-
label pairs obtained from the training data are frequently works (GCN) [16]. By employing a GCN, we can model
incomplete. For instance, in the AAPD, the labels “computers rich co-occurrence patterns between labels from both global
and society (cs.CY)” and “Physics and Society (physics.soc- and local consistency perspectives. Additionally, label local
ph)” co-occurred 300 times in training set, while only 6 times consistency is proposed to measure the rationality of label
in test set (0.009%). This imbalance in the co-occurrence co-occurrence in local neighborhoods, further improving the
frequency of labels within the data as well as between the accuracy of the model. Furthermore, we incorporated atten-
training and testing sets, led to a highly skewed distribution tion flow to extract label-specific semantic components from
[15]. Existing methods based solely on label global consis- the document content. This allows us to merge the semantic
tency model label relationships, which are based on prior information of the labels and obtain the initial embedding of
statistical information from the training data, fail to address the dual-view graph convolution. Finally, we fuse the fine-
above two challenges effectively. grained document information with learned label correlations
for classification, resulting in a comprehensive and robust
classification model.
This paper makes the following contributions:

• We introduced a novel neural network that leverages

dual-view convolutions on label co-occurrence graphs
for MLTC tasks. Our model combines learned label
information from a dual-view graph convolution with
label-specific document representations using a dual
attention flow. This integration enhanced the overall per-
formance of the model.
• To effectively capture the co-occurrence patterns between
labels, we leveraged both global and local label consisten-
cies. Additionally, we employed a dynamic construction
approach for the local label graph using a random-
walk strategy. This strategy enriches the co-occurrence
patterns between labels and significantly improves the
performance of multi-label text classification.
• To evaluate the effectiveness of the proposed model, we
conducted experiments on three commonly used bench-
Fig. 1 Long-tailed distribution and label co-occurrence for the RCV1.
mark datasets. The experimental results demonstrate the
The co-occurrence matrix undergoes color-coding, wherein the rep-
resentation is influenced by the conditional probability p(i| j). This competitiveness of our model on these datasets, demon-
probability signifies the likelihood of the presence of a class in the i-th strating its ability to achieve impressive performance in
column given the occurrence of a class in the j-th row multi-label text classification tasks.

123
Dual-view graph convolutional network for multi-label text classification 9365

2 Related work between label-specific components in documents, a common

approach involves utilizing label graphs based on statisti-
2.1 Enhancing document-label interaction in MLTC
cal co-occurrence. MAGNET [14] constructs a label graph
With the widespread application of neural network meth- based on frequency. DXML [31] establishes an explicit label
ods in document representation, innovative deep-learning co-occurrence graph to explore label embeddings in a low-
approaches have been developed. XML-CNN [4] uses a dimensional latent space. LiGCN [32] utilized a pretrained
convolution neural network(CNN) and dynamic pooling to language model as the initial embedding of a label-word
learn text representations for multi-label text classification. A heterogeneous graph and achieved outstanding classifica-
sequence-to-sequence (Seq2Seq) model based on recurrent tion performance while paying attention to different word
neuron networks (RNNs) was used to capture the correla- choices. The methods used by LR-GCN [33] and GCN-
tions between labels [19–21]. Nevertheless, they treated all MTC [15] are similar. They constructed labeled graphs based
words uniformly and failed to discern the informative con- on data-driven statistical information, and the former per-
tent within the documents. Considering the negative impact formed better than the latter. LDGN [34] adaptively modeled
of label sequence order on Seq2Seq models, S2S-LSAM the interactions among labels using dual-graph convolu-
[22] introduces a novel Seq2Seq model with distinct seman- tional neural networks. CFTC [35] first constructed a global
tic attention mechanisms for labels. This model incorporates label co-occurrence graph and then prevented confounding
label semantics and textual features through the interaction shortcuts using counterfactual techniques with the help of a
of the label semantic attention mechanism, resulting in fused human causal graph. S-GCN [36] leverages text, words, and
information comprising both label and textual information. labels to construct a global heterogeneous graph for mining
ML-Reasoner [23] utilizes a sequence model as a text feature correlations between similar documents. Subsequently, an
extractor and incorporates the prediction probabilities from encoder is trained to extract semantic features from document
the previous round as an additional input in the model to nodes, followed by utilizing graph convolutional networks to
reflect label correlation. This approach mitigates the reliance classify the text nodes. TLC-XML [37] initially constructs
on label order. The aforementioned methods do not model a label correlation graph using the semantic information
the rich co-occurrence relationships among labels. Moreover, of labels and symmetric conditional probabilities. Subse-
these methods struggle to effectively address the long-tail quently, strongly correlated labels are further grouped into
issue associated with labels. the same cluster. Finally, graph convolutional networks are
Recently, attention mechanisms have been used in sev- employed to extract the inter-cluster correlations among the
eral studies to enhance the interaction between labels and label clusters. Nevertheless, each label is assigned to only
words [24], labels and documents [6, 11, 25–27], and labels one cluster, which severely ignores the semantic correlation
and labels [7], in order to learn specific label-specific docu- of labels.
ment representations for classification tasks. Some methods However, the majority of the above methods primarily
have taken a different approach by incorporating addi- focus on the label global consistency of label co-occurrence
tional sources of knowledge to enhance label-specific docu- while neglecting the potential label local consistency, which
ment representations [28–30]. These approaches exhibited could potentially enhance classification performance. By
promising results in MLTC, underscoring the importance contrast, our proposed dual-view convolution module is
of investigating semantic connections. However, they did guided by prior knowledge from co-occurrence statistics and
not thoroughly explore the interactions among label-specific posterior information obtained from a dynamic random walk,
semantic components, which could potentially enhance the which can effectively capture comprehensive interactions
prediction of low-frequency labels. In our research, we intro- from different views, understand the potential relationships
duced a label-word attention module and a label-semantic between labels through global and local modes in the data,
self-attention module. The former extracts important seman- and improve their classification performance.
tics specific to labels from the word-level document infor-
mation. The latter further helps capture label-level semantic
features. Our approach enriches the semantic information of 3 Proposed model
labels by combining these two modules, and this enhanced
representation has the potential to improve prediction accu- As shown in Fig. 2, our model comprises two primary
racy, particularly for low-frequency labels. modules: 1) a label-specific document representation based
on dual attention flow. This module outlines the process
of extracting label-specific semantic components from the
2.2 Label co-occurrence graph in MLTC
word-level information of each document and further extract-
To apprehend profound correlations among labels in a ing label-specific semantic components. 2) Dual-view graph
graph structure and delve into the semantic interactions convolutional networks for semantic interactive learning. We

123
9366 X. Li et al.

DualAttn

Label-Word Attention

Classification
Convolution
Dual-View
MLP
Attn Flow Merge

Label-Sematic Attention

GlobalConv LocalConv
1 2 3 4 5 1 9 12 adptivly add
1
2
3
4
5
Global Local
Convolution Convolution 9
adptivly add
12 Label Co-occurrence Graph
Reconstruction

8
9
2 7 2
4 4
3 1 3 1 10

5 13 5
11
12

Label Co-occurrence Graph Label Co-occurrence Graph Label Pair Frequentness

by Random Walk
Global Consistency Convolution (GlobalConv) Local Consistency Convolution (LocalConv)

Fig. 2 The model architecture of the DV-MLTC consists of two main extending multiple label paths from a starting node. By incorporating
components: GlobalConv and LocalConv. In the GlobalConv compo- this local guidance, we can adaptively add label relationships between
nent, we construct a prior label co-occurrence graph and derive the pairs that initially had few co-occurrences. This helps mitigate the
label co-occurrence matrix A G . This matrix represents the connections long-tail distribution problem by reconstructing the label co-occurrence
between labels based on their co-occurrence probabilities. Using Glob- graph. (For instance, in Fig. 2, we observe that node 1 was not originally
alConv, we obtain the label embedding matrix H G under the guidance connected to node 9 and node 12, but the co-occurrence relationship is
of the global information (For example, node 1 connects nodes 1,2,3,4,5 added through label co-occurrence graph reconstruction.) Finally, the
by priory probability). In the LocalConv component, we leverage the label co-occurrence matrix A L is passed to the local convolution layer
label co-occurrence graph to compute the local co-occurrence frequency to obtain the matrix H L
between label pairs using a random walk module. This process involves

present a detailed description of how this module effectively set and E represents edges set, as in previous work [13, 34,
explores and captures comprehensive interactions from dis- 38]. In this graph, the nodes represent the categories, and the
tinct perspectives, guided by prior knowledge of statistical nodes vi correspond to the labels ci in the label set C. The
co-occurrences and posterior information obtained from a edges in the graph represent the statistical co-occurrences
dynamic random walk. Our dual-view convolution module between categories. Specifically, we compute the conditional
can effectively explore and capture comprehensive inter- probability for all label pairs in the training set, yielding
actions from distinct views guided by prior knowledge of the global label co-occurrence matrix A G ∈ R |C|×|C| Here,
j) = p(v j |vi ) signifies the conditional probability of a
statistical co-occurrences and posterior information obtained G
A(i,
from a dynamic random walk. document being categorized as c j when it belongs to cate-
gory ci . Notably G is a directed graph, therefore, A(i, j) may
3.1 Problem definition not be equal to A( j,i) owning the conditional probability cal-
culations.
In the MLTC problem, we have a document set denoted as
D = {d1 , d2 , ..., d|D| }, and a corresponding label set denoted 3.2 Label-specific attention networks
as C = {c1 , c2 , ..., c|C| }. Here, |D| represents the number
of documents in the document set, and |C| represents the Given a document D containing n words, we utilized bidirec-
total number of labels. Each document di contains n words tional long short-term memory (BiLSTM) to encode word-
and is associated with labels ci ∈ C, where ci ∈ {0, 1}|C| , level semantic information in the document representation.
indicating whether a label is relevant. BiLSTM leverages its bidirectional nature to effectively cap-
To achieve the goal of MLTC, which is assigning the most ture contextual information by processing word sequences
relevant label to a new document, we define a global label in both forward and backward directions. This enables a
co-occurrences graph G = (V , E) where V represents nodes thorough understanding of the document’s semantic context.

123
Dual-view graph convolutional network for multi-label text classification 9367

Upon applying BiLSTM, we obtained two sets of hidden calculated as the sum of U w and U s . Our approach draws
states: forward and backward. These hidden states encap- inspiration from previous works, such as [25] and [39],
sulate the contextual information of the words within a which also utilized attention mechanisms. However, the dual-
document. To create a comprehensive word sequence repre- attention flow module distinguishes itself based on two key
sentation, we concatenated the forward and backward hidden aspects. First, we focused on the interaction between doc-
states, resulting in the matrix H ∈ Rn×da . da denotes the uments and labels, enabling a more targeted exploration
dimensions of the word vectors. By concatenating the for- of their relationships. Second, our calculation method is
ward and backward hidden states, semantic information can designed to be more straightforward and efficient while still
be captured in both directions, thereby creating a robust and delivering superior performance.
holistic representation of the word sequence within the doc- The resulting label-specific document representation U
ument. serves as the input for the subsequent module: the dual-view
convolutional networks. These networks further process and
3.2.1 Label-word attention capture the interactions between the extracted semantic com-
ponents.
Labels possess distinctive semantics in the context of text
classification, concealed within their textual representations 3.3 Dual-view graph convolutional networks
or descriptions. To capitalize on this semantic information,
labels undergo preprocessing and are symbolized as train- To capture the interactions between label-specific semantic
able matrices L ∈ R |C|×da in the same latent da -dimensional components from multiple perspectives, we employed a dual-
space as words. To ascertain determine the semantic rela- view interaction approach. Specifically, we utilize global
tionship between each pair of words and labels, scaled and local consistency convolutions. In the global consis-
dot-product attention is employed: tency convolution, we construct a global label co-occurrence
graph and apply GCN to achieve global consistency. This
LHT
U w = so f tmax( √ )H (1) convolution leverages the co-occurrence patterns between
da labels captured by the global label co-occurrence graph. In
the local consistency convolution, we generated a local label
where L is the query vector, H is the key vector and the value
co-occurrence graph using a random walk strategy. Subse-
vector. u i is the i-th row vector of U w ∈ R |C|×da , denoting
quently, we employed a GCN to perform local consistency
the semantic component in the document associated with the
convolution. This convolution focuses on enhancing the
label ci . This representation is based on labeled text, which
co-occurrence patterns between labels based on the local con-
can be called the Label-Word (LW) attention mechanism.
text captured by the local label co-occurrence graph. These
convolutions consider distinct interaction views, thereby
3.2.2 Label-semantic self-attention
enhancing the co-occurrence patterns between labels.
Multiple labels may be assigned to labeled documents, and
each document should encompass the contexts most relevant 3.3.1 GlobalConv
to its corresponding labels. Consequently, each document
may comprise multiple components, and the words within To establish deep relationships between label-specific seman-
a document may contribute differently to each label. To tic components guided by statistical label correlations, we
capture these distinct components of each label, a self- employ a global consistency convolution (GlobalConv). We
attention mechanism is employed. The label-semantic (LS) leverage a GCN layer to propagate messages between neigh-
self-attention score (Q ∈ R |C|×n ) is calculated as follows: boring labeled nodes, thereby enhancing their representation
of these labeled nodes. The layer-by-layer propagation rules
Q = so f tmax(W2 tanh(W1 H T )) are defined as follows:
Us = Q × H (2) −(1/2) −(1/2)
H G = σ (D1 Â G D1 U W G) (3)
where W1 ∈ R db ×da
and W2 ∈ R |C|×db
are self-attention
parameters that must be trained. db is a hyperparameter. where A G in (3) is the global label co-occurrence graph.
Label-specific semantic components are extracted from σ (·) represents the LeakyReLU activation function. Â G rep-
text content using a novel approach that incorporates both resents the normalized adjacency matrix of A G . D1 is the
label-word attention U w and label-semantic self-attention degree matrix of A G and W G ∈ Rda ×dc denotes the trans-
U s . By combining these attention flows, we obtain the label- formation matrix that must be learned. GlobalConv uses the
specific document representation U = U w + U s , which is initialized components U ∈ R|C|×da and A G as inputs and

123
9368 X. Li et al.

ultimately generates H G ∈ R|C|×dc , where dc denotes the Algorithm 1 Calculation method of frequency matrix F.
dimensionality of the final node representation. Require: global label co-occurrence matrix A G , path length q, number
GlobalConv primarily performs a 1-hop diffusion pro- of iteration t
cess in each layer by leveraging prior statistical relationships Ensure: frequency matrix F
1: Initialize matrix F as 0 matrix
present in the dataset. As described in a previous study 2: for each label node vi do
[40], this process only considers the addition of feature 3: Set vi as the starting point of the path for the random walk
vectors from neighboring nodes to account for the feature 4: for 1 to t do
relationships between them. However, the statistical label 5: Generate path S = RandomW alk(A G , vi , q)
6: Uniformly sample non-repeating label pairs (vi , v j ) from S
correlations obtained from training data can be incomplete 7: for (vi , v j ) do
and noisy, and the co-occurrence patterns between label pairs 8: Fi, j + = 1; F j,i + = 1
may suffer from long-tailed distributions [15]. Recognizing 9: end for
this limitation motivated us to assign a certain probability to 10: end for
11: end for
low-frequency co-occurring labels, indicating that they might
belong to the same text rather than being directly filtered as
noise. We enabled the model to learn more effective propa- Using the frequency matrix F, we transform it into a PPMI
gation and richer co-occurrence patterns by introducing local matrix as follows:
consistency convolution.
Fi, j
pi, j =
3.3.2 LocalConv i, j Fi, j

j Fi, j
In addition to the graph structure information defined by pi,∗ = (5)
i, j Fi, j
the adjacency matrix A G , we utilized positive pointwise
Fi, j
mutual information (PPMI) to encode the potential relation- p∗, j =i
ship between label pairs. First, we calculated the frequency i, j Fi, j
matrix F using a random walk. Subsequently, we derived the
We apply (6) to encode the potential relationship between
local graph label co-occurrence graph A L ∈ R|C|×|C| based
label pairs in F. Here, pi, j represents the estimated proba-
on F. Finally, we performed a local consistency convolution.
bility of node vi appearing in context context j ; pi, denotes
A random walk can be characterized as a Markov chain
the estimated probability of node vi , and p, j indicates the
that delineates the sequence of nodes visited by a random
estimated probability of the context context j . The adjacency
walker [40]. We define a state as s(m) = vi if a random
matrix based on the label local consistency is computed as
walker is on node vi at time m. The transition probability of
follows:
moving from the current node vi to one of its neighbors v j is
denoted as p(s(m + 1) = v j |s(m) = vi ). In our problem set- pi, j
Ai,L j = max{P M Ii, j = log( ), 0} (6)
ting, given a prior label co-occurrence matrix A L , we assign: pi,∗ p∗, j

G
Ai, where P M Ii, j is the pointwise mutual information between
j
p(s(m + 1) = v j |s(m) = vi ) = G (4) node vi and context context j. The PPMI matrix A L repre-
A
j i, j sents the adjacency matrix based on label local consistency,
This assignment ensures that the transition probability is where any negative PMI value is set to zero.
proportional to the label co-occurrence in A L , thereby incor- Similar to GlobalConv, we defined an independent single-
porating semantic information into the random walk process. layer GCN for LocalConv based on A L . The graph convolu-
Algorithm 1 outlines the calculation of the frequency tional networks is given by:
matrix F using random walk. This algorithm can be paral-
−(1/2) −(1/2)
lelized by simultaneously performing multiple random walks H L = σ (D2 Â L D2 HGW L) (7)
on different parts of a graph.
Following the computation of the frequency matrix F, the where Â L denotes the normalized label local consistency
i-th row in F corresponds to the row vector Fi,: , while the j-th matrix, D2 is the degree matrix of A L , and W L ∈ Rdc ×dc is
column in F corresponds to the column vector F:, j . Specif- a training parameter. Notably, the dynamically reconstructed
ically, Fi,: represents the path node context for node vi , and A G based on random walk ensures label local consistency,
F:, j represents the path neighbor node context j. Moreover, where labels that appear on the same path are reasonably
Fi, j denotes the number of co-occurrences of vi and v j in considered to belong to the same text. In addition, as the
all generated paths. A higher value of Fi, j indicates a greater path length increases within a reasonable range, the impor-
frequency of co-occurrence between the two nodes. tance of the labels becomes more prominent. Moreover, the

123
Dual-view graph convolutional network for multi-label text classification 9369

non-positive values in the PPMI matrix were automatically Table 1 Statistics of the datasets
filtered out, preventing low-frequency co-occurrence labels Datasets N M D L L L̃
such as noise from disturbing the model.
Both H G and H L represent graph convolution-based label RCV1 804,414 23,149 781,265 101 3.18 729.67
representations, with the former focusing on the similarity AAPD 55,840 54,840 1000 54 2.41 2444.04
of global labels and the latter emphasizing the co-occurrence EUR-Lex 171,120 11,585 3,865 3,956 5.32 15.59
plausibility from local perspectives. These representations where N represents the total number of documents, M is the number
had different training parameters. In this task, concatenation of training documents, D is the number of testing documents, L is the
is employed to integrate them. number of class labels, L is the average number of labels per document,
and L̃ is the average number of documents per label

Z = H L ||H G (8)
AAPD2 : AAPD [19] was constructed by gathering the
The label-specific document representation generated abstracts and their corresponding subjects from a computer
under the guidance of global and local consistency can be science academic website encompassing 55,840 papers.
described as matrix Z ∈ R|C|×2dc . We then make label pre- EUR-Lex3 : EUR-Lex [41] is an extreme multi-label text
dictions using a trainable linear layer followed by a sigmoid classification dataset comprising documents related to Euro-
activation function: pean Union law across 3956 subjects. The public version
includes 11585 instances for training and 3865 instances for
Ŷ = sigmoid(W3 Z + b2 ) (9) testing.
These datasets were meticulously chosen due to their
where W3 represents the weights of the linear layer and b2 is widespread usage and large scale, allowing us to validate
the bias. Let y ∈ R|C| denote the true label of a document, the efficiency of the proposed model. Additionally, to main-
where yi ∈ {0, 1}|C| indicates whether label i is present in the tain consistency with prior research, we employed the same
document. The proposed model was trained using multi-label dataset partitioning as those in earlier studies [25, 34]. These
cross-entropy loss as follows: partitions were the original ones provided by the publishers
of the datasets. Detailed statistics for the datasets are pre-

N
C
sented in Table 1.
L= yi j log(yi j ) + (1 − yi j ) log(1 − ŷi j ) (10)
Following the established conventions of previous studies
i=1 j=1
[24, 25, 33, 34], we employed the accuracy of the top k (P@k)
In (10), N represents the number of documents, C repre- and the normalized discounted cumulative gain of the top
sents the number of labels, and yi j and ŷi j denote the true k (nDCG@k) as performance evaluation metrics for all three
and predicted values, respectively, for the j-th label of the datasets.
i-th document. The word embeddings in our model were initialized with
300-dimensional GloVe [42] word vectors that were trained
on the dataset using the Skip-gram [43] algorithm. The hid-
4 Experiment den sizes of the Bi-LSTM and GCN layers were set to 300
and 512, respectively. For the AAPD, we established q =
4.1 Datasets and evaluation metrics 2 and t = 400. We determined that q = 3 and t = 450 for
RCV1. Finally, for the EUR-Lex, we set t = 3 and t = 600.
We evaluate the proposed model on three benchmark multi- We employed the Adam optimization method to minimize
label text classification datasets: cross-entropy loss. The learning rate was initialized to 1e-3,
RCV11 : RCV1 [18] was collected and manually classified and a cosine-annealing algorithm was applied to gradually
by Reuters, which collected more than 80k news texts and reduce the learning rate during training. To ensure a fair com-
corresponding multiple labels from 1996 to 1997. Moreover, parison with related baselines using the large language model
the testing set consisted of a significantly larger number of (LLM), we also implemented an LLM-based version of our
examples than the training set. This aspect allowed for a model. In this version, we used the word sequence token
comprehensive evaluation of the generalization capability of RoBERTa [44] as the output of the label-specific attention
the proposed model.
2https://git.uwaterloo.ca/jimmylin/Castor-data/tree/master/datasets/
1 http://www.ai.mit.edu/projects/jmlr/papers/volume5/lewis04a/ AAPD/
lyrl2004_rcv1v2_README.htm 3 http://nlp.cs.aueb.gr/software.html

123
9370 X. Li et al.

network module in our model. The model was trained for 15 • MLGN [26]: A multi-label guided network capable
epochs with a batch size of 64. The best parameter configura- of guiding document representation with multi-label
tion was selected based on the performance of the validation semantic information.
set and evaluate using a testing set.
Label graph-based methods

4.2 Baselines • DXML [31]: A deep embedding method that simultane-

ously models the feature and label space.
To demonstrate the efficiency of the proposed model, it was • MAGNET [14]: A model based on graph attention
compared with models that achieved state-of-the-art results networks. Capturing the attention-dependent structure
using selected datasets. For a fair comparison, we only reused between labels using features and correlation matrices
the experimental results when selecting baselines instead of was proposed. In addition, the model uses BiLSTM to
reimplementing them to maintain the recommended optimal extract text features.
settings and results. In addition, for models that were not • LAHA [6]: LAHA focuses on using hybrid attention to
implemented on specific datasets, we reimplemented these represent documents with labels. The model comprises
models with their source codes and then evaluated them on three components: a multi-label self-attention mecha-
selected datasets. nism that identifies each word’s association with labels,
a depiction of label arrangement and document context,
Enhancing document-label-based methods and an adaptive fusion method for classification.
• LDGN [34]: A dual-graph convolution network that
incorporates category information and models adaptive
• XML-CNN [19]: A sequence generative model that interactions among labels in a reconstructed graph.
labels correlations as an ordered sequence. • LiGCN [32]: A label interpretable graph model that
• AttentionXML [24]: A model that constructs the label- solves the MLTC problem by modeling tokens and labels
aware document representation solely based on the as nodes in a heterogeneous graph and uses the pretrained
document content. language model BERT as a text encoder.
• LSAN [25]: Label-aware attention framework based on • LA-MLTC [39]: A label-aware network built (which we
self-attention and label attention mechanisms. refer to as LA-MLTC) a heterogeneous graph including
• HTTN [7]: This proposes a head-to-tail network that words and labels to learn the label representation and text
transfers meta-knowledge from head-labels to tail-labels. representation by metapath2vec.

Table 2 Comparing our model

Methods P@1(%) P@3(%) P@5(%) nDCG@3(%) nDCG@5(%)
with baselines in terms of P@K
and nDCG@k on RCV1 XML-CNNa (2018) 95.75 78.63 54.94 89.89 90.77
AttentionXMLa (2019) 96.41 80.91 56.38 91.88 92.70
LSANa (2019) 96.81 81.89 56.92 92.83 93.43
HTTNb (2021) 95.86 78.92 55.27 89.61 90.86
MLGN (2023) 96.67 82.11 57.03 92.23 93.55
DXMLa (2018) 94.04 78.65 54.38 89.83 90.21
MAGNET (2020) 95.16 79.34 54.26 87.34 88.61
LAHAc (2018) 96.95 81.43 56.44 92.69 93.01
LiGCNd (2022) 95.61 82.40 56.31 93.40 93.26
LA-MLTCe (2021) 97.31 83.11 57.85 93.97 94.59
LDGNa (2021) 97.12 82.26 57.29 93.80 95.03
LR-GCNf (2023) 97.13 84.29 58.45 94.98 95.38
DV-MLTC 97.11 83.68 57.31 94.19 94.77
DV-MLTC RoB E RT a 97.94 84.83 59.01 94.32 95.87
The best performance is highlighted in bold, and the second-best performance is highlighted in underlined
text. The following experimental results were extracted: a from [34], b from [7], c from [6], d from [32], e
from [39], and f from [33]

123
Dual-view graph convolutional network for multi-label text classification 9371

Table 3 Comparing our model

Methods P@1(%) P@3(%) P@5(%) nDCG@3(%) nDCG@5(%)
with baselines in terms of P@K
and nDCG@k on AAPD XML-CNNa (2018) 74.38 53.84 37.79 71.12 75.93
AttentionXMLa (2019) 83.02 58.72 40.56 78.01 82.31
LSANa (2019) 85.28 61.12 41.84 80.84 84.78
HTTNb (2021) 83.84 59.92 40.79 79.27 82.67
MLGN (2023) 84.78 60.01 42.37 80.11 83.45
DXMLa (2018) 80.54 56.30 39.16 77.23 80.99
MAGNET (2020) 82.53 60.71 40.19 80.37 81.03
LAHAc (2018) 84.48 60.72 41.19 80.11 83.70
LiGCNd (2022) 82.50 61.26 41.38 80.39 83.83
LA-MLTCe (2021) 85.03 61.46 41.80 80.94 84.90
LDGNa (2021) 86.24 61.95 42.29 83.32 86.85
LR-GCNf (2023) 86.50 62.43 41.66 82.52 85.48
DV-MLTC 85.19 61.52 40.06 83.21 85.15
DV-MLTC RoB E RT a 86.83 62.87 42.41 83.45 87.03
The best performance is highlighted in bold, and the second-best performance is highlighted in underlined
text. The following experimental results were extracted: a from [34], b from [7], c from [6], d from [32], e
from [39], and f from [33]

• LR-GCN [33]: A multi-label text classification model code. For example, on EUR-Lex, DV-MLTC improves P@1
combining a pre-trained language model and a GCN. and nDCG@3 from 82.59% to 83.61% and from 72.15%
to 74.62%, respectively. Compared with the best baseline
4.3 Performance comparison of different methods LR-GCN on RCV1 and AAPD, our proposed model still
performs better or is competitive on all metrics.
The performances of the different models on the three Furthermore, by observing the results in Tables 2, 3, and
datasets are listed in Tables 2, 3, and 4 in terms of P@k 4, we can see that methods that do not incorporate label
and nDCG@k, respectively. For each row, the best result is correlation to improve the learning process of textual rep-
highlighted in bold, and the second-best result is underlined. resentations demonstrate inferior performance. Specifically,
As shown in Tables 2, 3, and 4, the proposed DV- on AAPD, AttentionXML elevated the P@1 value of DXML
MLTC model outperforms previous works on all three from 80.54% to 83.02%, marking an increase of approx-
datasets. Specifically, the DV-MLTC-enhanced version of imately 3.08%. It is plausible that while DXML seeks to
Roberta achieves better or more competitive performance represent information in the label space using deep embed-
on all metrics and significantly improves the previous base- ding techniqus, AttentionXML can concentrate on the more
line best scores compared to those with the shared source semantically relevant document sections for each label. Nev-

Table 4 Comparing our model

Methods P@1(%) P@3(%) P@5(%) nDCG@3(%) nDCG@5(%)
with baselines in terms of P@K
and nDCG@k on EUR-Lex XML-CNNa (2018) 70.40 54.98 44.86 58.62 53.10
AttentionXMLa (2019) 67.34 52.52 47.72 56.21 50.78
LSANa (2019) 79.17 64.99 53.67 68.32 62.47
HTTNb (2021) 81.14 67.62 56.38 70.89 64.42
MLGN (2023) 68.65 53.17 48.92 57.34 51.28
DXMLa (2018) 75.63 60.13 48.65 63.96 53.60
LAHAc (2018) 78.34 64.62 53.08 68.15 62.27
LDGNd (2021) 81.03 67.79 56.36 71.81 66.09
LR-GCN (2023) 82.59 68.25 58.34 72.15 66.87
DV-MLTC 81.01 66.98 56.73 71.02 66.14
DV-MLTC RoB E RT a 83.61 70.14 59.40 74.62 68.11
The best performance is highlighted in bold, and the second-best performance is highlighted in underlined
text. The following experimental results were extracted: a from [34], b from [7], c from [6], and d from [32]

123
9372 X. Li et al.

LDGN [34] demonstrated competitive performance on all

datasets, which may be attributed to its adaptive interac-
tion component, benefiting from a large number of adaptive
parameters. Inspired by LDGN, we propose an adaptive
reconstruction of the graph based on random wandering.
However, the LDGN adaptive module operates as a black
box, and its parameter guidance lacks explicit transparency.
By contrast, our dual-graph module allows parameter sharing
and provides natural interpretability. This allowed us to con-
duct further research on our model, particularly on parameter
tuning and its implications.
We also observed that the methods that utilized labeled
graphs outperformed the document-label based methods
Fig. 3 The label distribution of EUR-Lex. x-axis represents the label
overall, which highlights the advantage of MLTC methods
index sorted by frequency in the training set. f represents the label with graphs, as most of them incorporate rich interaction
frequency, and p represents the proportion of labels in collection to the information to improve multilabel text prediction. The excep-
entire label set tion is the LAHA based on simple label co-occurrence, which
we hypothesize captures only the representation of labels
from the label co-occurrence graph without further explor-
ing the deep relationships between labels.
ertheless, AttentionXML solely focuses on encoding text
content in the presentation layer without considering label
information, thus restricting its capacity to adjust contextual 4.4 Comparison on sparse dataset
representations through interactions.
The better performance of LSAN compared to other previ- To evaluate the performance of DV-MLTC on long-tailed
ous approaches for exploring document-label relationships, labels, we categorized the labels in EUR-Lex into three
such as HTTN and MLGN, may be attributed to its multi- groups based on their frequency of occurrence, following
view learning space mechanism and the fact that LSAN the approach in [6, 25]. Figure 3 illustrates the distribution
considers semantic correlations between text and labels of label frequencies on EUR-Lex, where f represents the
simultaneously. The multi-view learning mechanism helps label frequency. Approximately 55% of the labels appeared
stabilize adaptive fusion through the attention mechanism, one to five times, constituting the first label collection (Col-
which learns the text representation specific to the labels. lection1). The labels that appeared 5-37 times were assigned
We observed that LR-GCN performed best on RCV1 in to Collection2, accounting for 35.35% of the entire label set.
terms of the nDCG@3. This can be explained by initial- The remaining 10% of frequent labels formed the final col-
izing text embedding using the pretrained language model lection (Collection3). Clearly, Collection 1 presents greater
Roberta, which can efficiently extract fine-grained document difficulty compared to the other two collections due to the
information. In contrast, our model uses a simple BiLSTM lack of training data. Obviously, Collection 1 is much more
architecture to represent the input text and achieves optimal difficult than the other two collections owing to the lack of
or near-optimal results. In addition, we used Roberta’s ver- training data.
sion of word embedding to obtain the same word embeddings Figure 4 shows the prediction results in terms of P@1,
as the LR-GCN. The results of AAPD and EUR-Lex demon- P@3, and P@5 obtained using AttentionXML, DXML, and
strate the effectiveness of our dual-view graph convolutional DV-MLTC, respectively. The three methods improved from
networks module, with DV-MLTC RoB E RT a achieving the Collection1 to Collection3, which is reasonable because
best results compared to the competing models. each label contained an increasing number of documents

Fig. 4 AttentionXML, DXML 8 50 AttentionXML

80 AttentionXML
AttentionXML
DXML DXML
and DV-MTC for three DXML
DV-MLTC 40 DV-MLTC 70 DV-MLTC
6
collections on EUR-Lex in
30 60
terms of P@k
4
20 50

2
10 40

0 0 30
P@1 P@3 P@5 P@1 P@3 P@5 P@1 P@3 P@5
(a) Collection1 (f 5) (b) Collection2 (5 < f 37) (c) Collection3 (37 < f 764)

123
Dual-view graph convolutional network for multi-label text classification 9373

from Collection1 to Collection3. DV-MLTC significantly Group2 experiments focus on modules related to dual
improves the predictive performance of Collection1. Partic- attention. The ablation modules tested in this group were as
ularly, DV-MLTC achieved an average gain of over 55.83%, follows:
96.22%, and 47.36% for AttentionXML on the three met-
rics of Collection1, and 63.41%, 121.73%, and 44.37% for
DXML, respectively. This result demonstrates the superior- 1. w/o GlobalConv: our model without GlobalConv
ity of the proposed model for multi-label text data with tail 2. w/o LocalConv: our model without LocalConv
labels. 3. w/o DualConv: our model without dual-attention
4. sharing para: GlobalConv and LocalConv share the
parameters of the GCN layer
4.5 Ablation experiments
From the results shown in Fig. 5 of the ablation experi-
A series of ablation experiments was performed to assess ments conducted on AAPD and RCV1, several observations
the importance and necessity of each module. We performed about Group1 can be made: Dual Attention Flow Module:
ablation experiments on all three datasets and divided the w/o L W and w/o L S outperformed w/odualattn, with
experiments into two groups: Group1 and Group2. large margins of 3.03% and 2.21% on AAPD, indicating
Group1 experiments focus on the modules related to dual- that both attention flows enhance the model and are indis-
graph convolution. The ablation components tested in this pensable. In other words, both the label-word attention and
group were as follows: label-semantic self-attention modules contribute to the per-
formance of proposed model. Label attention considers the
interaction between the label and word information and cap-
1. w/o L W : our model without LW attention tures the contribution of words to labels. Self-attention, on
2. w/o L S: our model without LS attention the other hand, focuses on the semantic information of the
3. w/o dual attn: our model without dual attention labels themselves.

Fig. 5 Ablation Experiment. (a) 90 90

w/o LW w/o LocalConv
and (b) are the experimental 80 w/o LS 80 w/o GlobalConv
results of Groups 1 and 2, w/o DualConv
w/o dual attn
70 70 sharing para
respectively, for AAPD. (c) and DV-MLTC DV-MLTC
(d) are the experimental results 60 60
of Groups 1 and 2, respectively, 50 50
for RCV1. (e) and (f) are the
40 40
experimental results of Groups 1
and 2, respectively, for EUR-Lex 30 30
P@1 P@3 P@5 P@1 P@3 P@5

(a) Group1 of AAPD (b) Group2 of AAPD

100 100
w/o LW w/o LocalConv
90 90 w/o GlobalConv
w/o LS
w/o DualConv
80 w/o dual attn sharing para
80
DV-MLTC DV-MLTC
70
70
60
60
50
40 50

30 40
P@1 P@3 P@5 P@1 P@3 P@5
(c) Group1 of RCV1 (d) Group2 of RCV1
90 90
w/o LW w/o local
80 80 w/o global
w/o LS w/o dualconv
w/o dual attn sharing para
70 70 DV-MLTC
DV-MLTC
60 60

50 50

40 40
P@1 P@3 P@5 P@1 P@3 P@5
(e) Group1 of EUR-Lex (f) Group2 of EUR-Lex

123
9374 X. Li et al.

Conclusions about Group2: (1) Dual-View Convolu- sification performance. The variant without the A L matrix
tional Modules: The experiments w/o LocalConv and (w/o LocalConv) did not perform optimally. This is because
w/o GlobalConv outperform w/o DualConv, such as on A G alone, which builds a co-occurrence graph based on
RCV1, with better results of 1.98% and 1.82% on P@3. statistical co-occurrence, cannot provide sufficient seman-
This indicates that exploring either global label or local tic confidence between the label pairs. The dynamic edge
consistency can effectively capture the semantic interac- adjustment performed by A L through a random walk leads
tions between label-specific components. The superiority to a softer performance in the visualization graph. It assigns
of w/o LocalConv over w/o GlobalConv suggests that a certain edge weight to low-frequency co-occurring label
models with global consistency convolution have a signifi- pairs, thereby allowing them to overcome the influence of
cant impact on classification improvement, indicating their low-frequency noise. This adjustment is beneficial for the
ability to capture semantic dependencies effectively. (2) GCN because it strengthens the interactions between node
w/o GlobalConv improves w/o DualConv: The experi- pairs in the network. As for A L , the A L with an iteration num-
ment w/o GlobalConv improves the performance of the ber of 1000 tends to exhibit more smoothness compared to the
model based on the dual attention flow, indicating that incor- A L with an iteration number of 450. Over-smoothing makes it
porating the new label co-occurrence relationship generated difficult to distinguish the differences in label co-occurrence,
through a random walk and mutual information can benefit potentially degrading the classification performance. Our
the model’s performance. (3) Sharing Parameter s: The proposed model integrates A G and A L using GlobalConv and
experiment involving parameter sharing between the global LocalConv, respectively, and leverages both statistical co-
convolution and local convolution shows slightly lower per- occurrence information and dynamic edge adjustment based
formance compared to the complete model. This suggests on random walks, leading to improved classification results.
that the two sets of GCNs, which model label correlation from Overall, the visualization of the weight matrices con-
different perspectives and interactions, benefit from separate firmed the effectiveness of incorporating both A G and A L
parameter operations rather than sharing. (4) Overall Model: in capturing label dependencies and enhancing the perfor-
The complete model, which combines dual attention flow mance of the classification model.
and dual-view convolutions while separating the parameters,
achieves the best performance. These results demonstrate the 4.6 Parametric analysis
efficacy of the suggested modules and their contributions
We performed relevant experiments on our model using the
to the overall performance of the model in capturing label
AAPD. We used the base version of the model in the para-
dependencies and semantic interactions.
metric analysis.
We visualized the label co-occurrence graph matrices A G
and A L on the AAPD, as shown in Fig. 6. From the visual-
4.6.1 Effect of iteration number t on classification
ization, we can observe that the global label co-occurrence
graph matrix A G exhibits a long-tail distribution, where there We investigated the effect of the number of iterations, denoted
are many edges with very few co-occurrences. This distribu- as t, on the classification performance. The number of iter-
tion was based on prior statistics from a corpus. However, ations determined the number of label paths generated by
these low-frequency edges may be considered noise data, node resampling. By controlling the other parameters and
and they can lead to overfitting and negatively affect clas- varying the value of t, the impact on the classification per-

(a) AG (b) AL (t = 450) (c) AL (t = 1000)

Fig. 6 Label graph visualization on AAPD. From left to right: global consistency label graph; local label co-occurrence graph (t=450); local label
co-occurrence graph (t=1000)

123
Dual-view graph convolutional network for multi-label text classification 9375

P@1 P@3 P@5

86 62.5 42.5
85.5 62.3 42
85 62.1
84.5 61.9 41.5
84 61.7 41
83.5 61.5 40.5
83 61.3
82.5 61.1 40
82 60.9 39.5
50 250 450 650 850 1050 50 250 450 650 850 1050 50 250 450 650 850 1050

nDCG@1 nDCG@3 nDCG@5

86 82.5 86.5
85.5 82 85.5
85 81.5
84.5 84.5
81
84 83.5
80.5
83.5 82.5
83 80
82.5 79.5 81.5
82 79 80.5
50 250 450 650 850 1050 50 250 450 650 850 1050 50 250 450 650 850 1050

Fig. 7 Test performance (%) under varying t on AAPD

formance was analyzed, as shown in Fig. 7. The experimental labels on the same path considered to belong to the same
results show that when the number of iterations was small, document. In our experiments on AAPD, we investigated the
the performance improvement of the model was insignif- impact of q on the classification accuracy while maintaining
icant. This is because the local label graph that captures t at an optimal value of 450. The results shown in Fig. 8 indi-
local label dependencies fails to effectively capture the key cate that the choice of q significantly affects the performance
and tail labels. Consequently, the role of all the local label of the model, which is consistent with our expectations.
graphs becomes similar to that of a global graph, leading Within a reasonable range (e.g., 2 or 3), the model achieved
to limited performance gains. As the number of iterations the best classification results, suggesting that the label paths
increased, specifically reaching a certain scale (e.g., 450), the adaptively generated by the model have significant benefits.
local dynamics strongly enhanced the interaction between the However, when q exceeds a certain threshold (e.g., 3), the
key and tail graphs. By leveraging the powerful information performance of the model begins to decline slightly. We
diffusion ability of the GCN, the model achieved improved speculate that excessively long label paths result in exces-
classification performance. This indicates that a sufficient sively consistent co-occurrence relationships between nodes
number of iterations allows the local dynamics to capture during the iterative process. This exacerbates the problem
crucial graph dependencies, resulting in enhanced classifica- of over-smoothing, ultimately interfering with the discrim-
tion accuracy. inative power of the labels in the model. Nevertheless, by
Increasing the number of iterations beyond the optimal integrating the LocalConv and GlobalConv modules, our
value did not significantly affect the model’s performance. model maintains its robustness and achieves optimal perfor-
This suggests that once the key and tail graph nodes are mance. This highlights the effectiveness and resilience of our
effectively captured and the interaction between graphs is approach in capturing label dependencies and enhancing the
strengthened, further increasing the number of iterations has classification outcomes.
little effect on the model. In summary, the experimental
results demonstrate that the number of iterations, t, plays
a crucial role in capturing graph dependencies through local 4.6.3 Effect of label-ratio
dynamics. Finding the optimal value of t allows the model
to effectively enhance graph interactions and improve clas- To assess the sensitivity and performance of the proposed
sification performance. model under different training data proportions, we con-
ducted experiments using various ratios of training data. We
also compared our model with other competitive approaches,
4.6.2 Effect of path length q on classification namely XML-CNN [4], AttentionXML [24], LSAN [25],
and LR-GCN [33], while maintaining their respective set-
The path length parameter q plays a crucial role in the tings, as described in their papers. In the case of LSAN,
classification performance of our model, particularly in the we utilized Word2vec for word embeddings because of the
LocalConv module. It determines the farthest distance that absence of pretrained embeddings in its source code. Fig-
the random walk can traverse based on probability, with ure 9 shows the evaluation results for different data scales

123
9376 X. Li et al.

P@1 P@3 P@5

87 63 44
62 43
86
61 42
85 41
60
40
84 59
39
83 58
38
57 37
82
56 36
81 55 35
1 2 3 4 5 1 2 3 4 5 1 2 3 4 5

nDCG@1 nDCG@3 nDCG@5

87 84.5 86.5
86 84
83.5 86
85 83
85.5
82.5
84
82
85
83 81.5
81 84.5
82
80.5
81 80 84
1 2 3 4 5 1 2 3 4 5 1 2 3 4 5

Fig. 8 Test performance (%) under varying q on AAPD

with proportions of 0.05, 0.10, 0.25, 0.50, and 0.75. It is a graph. Therefore, the time complexity of the model was
evident from the results that our model consistently achieves deemed acceptable.
competitive performance compared to the baselines. Notably, Compared with other graph-based models such as MAG-
our model outperformed the baseline models, particularly NET, LDGN, and LR-GCN, which have shown excellent
at low data percentages (< 0.25). We conjecture that this results in comparative experiments, our model achieves a
may be attributed to our dual-view convolution module, in favorable balance between complexity and efficiency. One
which local convolutions yield richer graph co-occurrence of the main contributors to the time complexity of the MAG-
patterns, particularly in the case of few labels. This finding NET is its graph attention networks. Assume that number
demonstrates that our model is robust and insensitive to the of nodes is c, the number of edges is e, and the dimensions
training data ratio. Therefore, it can effectively handle sce- before and after feature transformation are d and d , respec-
narios where only a limited number of training samples are tively. The time complexity of MAGNET can be expressed
available, making it applicable to real-world situations. as O(cdd ) + O(ed ). Owing to the potentially large num-
ber of edges (e) and relatively large dimensions (d and d ),
the event complexity of MAGNET was relatively high. Simi-
4.7 Complexity analysis larly, for the LDGN, the computational complexity primarily
arises from the dynamic reconstruction graph with a time
Notice that the time complexity of the model primarily arises complexity of O(cdd ). In a laboratory setting, where dimen-
from F in the Algorithm 1. The time complexity is O(ctq 2 ). sions d and d are relatively large, the event complexity of the
Moreover, considering that the parameters t and q are set as LDGN is also high. In comparison, the suboptimal model LR-
small integers in experiments, F can be rapidly computed. GCN does not involve redundant multiplication calculations,
Additionally, the algorithm can be parallelized by conducting resulting in a slightly better time complexity than our model.
multiple random walks simultaneously on different parts of However, as mentioned previously, the time complexity of

P@1 P@3 P@5

90 22
35
80 20
30
70 DV-MLTC DV-MLTC DV-MLTC
18
LR-GCN LR-GCN LR-GCN
60
LSAN 25 LSAN LSAN
AttentionXML AttentionXML 16 AttentionXML
50
XML-CNN XML-CNN XML-CNN
40 20 14
0 01 02 03 04 05 06 07 08 0 01 02 03 04 05 06 07 08 0 01 02 03 04 05 06 07 08

Fig. 9 Test performance (%) with different label ratios on AAPD

123
Dual-view graph convolutional network for multi-label text classification 9377

we present an analysis of user conversations in on line social media and their evolution over time we propose a dynamic model
that accurately predicts the growth dynamics and structural properties of conversation threads the model successfully reconciles
the differing observations that have been reported in existing studies by separating artificial factors from user behaviors , we
show that there are actually underlying rules in common for on line conversations in different social media websites results of
our model are supported by empirical measurements throughout a number of different social media websites

Fig. 10 Visualization of label attention weights. The attention weights of “physics.soc” for words are shaded in green, and the attention scores of
“cs.CY” and “cs.SI” are shaded in blue and red respectively

our model remains acceptable. Hence, our model achieves erties," and "underlying rules” were emphasized, indicating
a satisfactory balance between complexity and efficiency. a focus on computers and society. In the “cs.SI” category,
In summary, although other graph-based models may out- attention was given to words such as “artificial factors”, “line
perform our model in certain comparative experiments, the conversations”, and “social media websites”. By examining
advantageous balance between complexity and efficiency of the specific words that receive attention in each category, we
our model makes it a valuable choice for practical applica- gain insights into the semantics and distinguishing aspects
tions. of these categories. These visualizations intuitively demon-
strate the effectiveness of the model in capturing relevant
information in document text for accurate labeling.
4.8 Case studies and visualizations

To further verify the effectiveness of our label attention mod-

4.8.2 Label co-occurrence graph visualization
ule and dual graph neural networks in DV-MLTC, we present
a typical case and visualize the similarity scores between
Figure 11 visualizes the label graph, showing the roles
the attention weights of document words and label-specific
of GlobalConv and LocalConv in capturing the label co-
components using t-sne [45]. We show a testting instance
occurrence patterns. The heat maps in Fig. 11 represent the
from the original AAPD dataset which belongs to three cate-
label co-occurrence matrices A G and A L . In Fig. 11(a), the
gories: “physics and society” (physics.soc-ph), “computers
heat map shows A G based on GlobalConv. However, Glob-
and society” (cs.CY), and “social and information networks”
alConv failed to accurately discern the relationships between
(cs.SI).
the labels in this specific test case. Notably, the co-occurrence
of “computers and society (cs.cy)” and “adaptation and
4.8.1 Label attention visualization self-organizing systems (nlin.AO)” was not considered sig-
nificant. This limitation arises from relying solely on global
Figure 10 shows the label attention, revealing how different statistical information, which may overlook label correla-
labels focus on specific parts of the document text. Each label tions in individual instances. Conversely, Fig. 11(b) displays
assigns importance to its set of words for classification. For A L based on LocalConv. This highlights the crucial role
instance, in the “physics.soc-ph” category, words like “user of LocalConv in establishing local connections between the
behaviors” and “evolution over time” were highlighted, cap- labels. Even for label pairs with low co-occurrence, such
turing key concepts in physics within a social context. In as “computers and society (cs.CY)” and “physics and soci-
the “cs.CY” category, words such as “user conversations”, ety (physics.soc-ph)”, LocalConv assigns a label correlation.
“dynamic model”, “growth dynamics and structural prop- Multiple label paths generated by LocalConv generalize

(a) Global label co-occurrence visualization (b) Local label co-occurrence visualization

Fig. 11 Case of label co-occurrence graph visualization

123
9378 X. Li et al.

label relationships based on model sampling, independent Consent for Publication The authors declare that they agree to publish.
of human influence. Consequently, LocalConv captures finer
label associations, providing a comprehensive understanding
of label co-occurrence patterns. In summary, the visual- References
ization of the label graph demonstrates how LocalConv
effectively supplements the label correlations that Global- 1. Huang B, Guo R, Zhu Y, Fang Z, Zeng G, Liu J, Wang Y,
Conv alone cannot capture. Fujita H, Shi Z (2022) Aspect-level sentiment analysis with
aspect-specific context position information. Knowl-Based Syst
243:108473. https://doi.org/10.1016/j.knosys.2022.108473
2. Tang P, Jiang M, Xia BN, Pitera JW, Welser J, Chawla NV (2020)
Multi-label patent categorization with non-local attention-based
5 Conclusion and future tasks graph convolutional network. In: Proceedings of the AAAI Con-
ference on Artificial Intelligence, vol 34, pp 9024–9031. https://
In this study, we propose a novel dual-view convolu- ojs.aaai.org/index.php/AAAI/article/view/6435
3. Liu W, Wang H, Shen X, Tsang IW (2022) The emerg-
tional neural network for multi-label text classification. Our ing trends of multi-label learning. IEEE Trans Pattern Anal
approach systematically addresses graph relationships within Mach Intell 44(11):7955–7974. https://doi.org/10.1109/TPAMI.
co-occurrences by employing global and local consistency 2021.3119334
perspectives. The global consistency convolution utilizes 4. Liu J, Chang W-C, Wu Y, Yang Y (2017) Deep learning for extreme
multi-label text classification. In: Proceedings of the 40th inter-
GCNs to model the statistical relationships among graphs national ACM SIGIR conference on research and development in
based on correlation. For local consistency convolution, we information retrieval. SIGIR ’17, pp 115–124. Association for com-
strategically generate graph paths through random walks, puting machinery. https://doi.org/10.1145/3077136.3080834
reconstruct local graphs, and enrich the co-occurrence pat- 5. Wu H, Qin S, Nie R, Cao J, Gorbachev S (2021) Effective collab-
orative representation learning for multilabel text categorization.
terns. The initial word embeddings were generated via a IEEE Trans Neural Netw Learn Syst 33(10):5200–5214
dual attention flow. Extensive experiments revealed supe- 6. Huang X, Chen B, Xiao L, Yu J, Jing L (2022) Label-aware doc-
rior performance on RCV1 and EUR-Lex and competitive ument representation via hybrid attention for extreme multi-label
results on AAPD, highlighting a favorable complexity- text classification. Neural Process Lett 54(5):3601–3617
7. Xiao L, Zhang X, Jing L, Huang C, Song M (2021) Does head
efficiency balance. Our approach is effective in enhancing label help for long-tailed multi-label text classification. In: Pro-
classification performance and mitigating long-tailed issues. ceedings of the AAAI Conference on Artificial Intelligence, vol.
Future enhancements include constructing dynamics for 35, pp 14103–14111
sample subsets to reduce computational overhead and further 8. Zong D, Sun S (2023) Bgnn-xml: bilateral graph neural networks
for extreme multi-label text classification. IEEE Trans Knowl Data
exploring the leveraging of additional graph information for Eng 35(7):6698–6709
multi-graph text classification. 9. Zhang Q-W, Zhang X, Yan Z, Liu R, Cao Y, Zhang M-L (2021)
Correlation-guided representation for multi-label text classifica-
Acknowledgements This work was supported in part by National tion. In: IJCAI, pp 3363–3369
Natural Science Foundation of China (No. 61862058), Natural Sci- 10. Ionescu RT, Butnaru A (2019) Vector of locally-aggregated
ence Foundation of Gansu Province (No. 20JR5RA518, 21JR7RA114). word embeddings (vlawe): a novel document-level representation.
Industrial Support Project of Gansu Colleges (No. 2022CYZC11). In: Proceedings of the 2019 conference of the north american
chapter of the association for computational linguistics: human
Author Contributions X.L and B.Y: Conceptualization, Methodol- language technologies, vol 1 (Long and Short Papers), pp 363–
ogy, Formal analysis, Software, Investigation, Validation, Resources, 369. https://doi.org/10.18653/v1/N19-1033. https://aclanthology.
Writing—original draft, review and editing, Visualization. Q.P and S.F: org/N19-1033
Resources, Writing—review and editing, Supervision. 11. Liu M, Liu L, Cao J, Du Q (2022) Co-attention network with label
embedding for text classification. Neurocomputing 471:61–69
Availability of Data and Materials The datasets analyzed during the 12. Wang J, Chen Z, Qin Y, He D, Lin F (2023) Multi-aspect
current study were all derived from the following public domain co-attentional collaborative filtering for extreme multi-label text
resources. [AAPD: https://git.uwaterloo.ca/jimmylin/Castor-data/tree/ classification. Knowl-Based Syst 260:110110. https://doi.org/10.
master/datasets/AAPD/; RCV1: http://www.ai.mit.edu/projects/jmlr/ 1016/j.knosys.2022.110110
papers/volume5/lewis04a/lyrl2004_rcv1v2_README.htm; EUR-Lex: 13. Chen Z-M, Wei X-S, Wang P, Guo Y (2019) Multi-label image
http://nlp.cs.aueb.gr/software.html]. recognition with graph convolutional networks. In: Proceedings of
the IEEE/CVF conference on computer vision and pattern recog-
Declaration nition, pp 5177–5186
14. Pal A, Selvakumar M, Sankarasubbu M (2020) Magnet: multi-label
text classification using attention-based graph neural network. In:
Proceedings of the 12th international conference on agents and
Conflict of Interest The authors declare that they have no conflict of artificial intelligence 1, vol 2, pp 494–505. https://doi.org/10.5220/
interest. 0008940304940505
15. Vu H, Nguyen M, Nguyen V, Tien M, Nguyen V (2022) Label
Consent to Participate The authors declare that they agree to partici- correlation based graph convolutional network for multi-label text
pate. classification. In: 2022 International joint conference on neural

123
Dual-view graph convolutional network for multi-label text classification 9379

networks (IJCNN), pp 01–08. https://ieeexplore.ieee.org/abstract/ label learning. In: Proceedings of the 2018 ACM on international
document/9892542 conference on multimedia retrieval, pp 100–107. https://doi.org/
16. Kipf TN, Welling M (2017) Semi-supervised classification with 10.1145/3206025.3206030
graph convolutional networks. In: International conference on 32. Li I, Feng A, Wu H, Li T, Suzumura T, Dong R (2022) LiGCN:
learning representations (ICLR) label-interpretable graph convolutional networks for multi-label
17. Liang Z, Guo J, Qiu W, Huang Z, Li S (2024) When graph convo- text classification. In: Proceedings of the 2nd workshop on deep
lution meets double attention: online privacy disclosure detection learning on graphs for natural language processing (DLG4NLP
with multi-label text classification. Data Min Knowl Discov 1–22 2022), pp 60–70. Association for Computational Linguistics.
18. Lewis DD, Yang Y, Russell-Rose T, Li F (2004) Rcv1: a new bench- https://aclanthology.org/2022.dlg4nlp-1.7
mark collection for text categorization research. J Mach Learn Res 33. Vu H, Nguyen M, Nguyen V, Pham M, Nguyen V, Nguyen V
5:361–397 (2023) Label-representative graph convolutional network for multi-
19. Yang P, Sun X, Li W, Ma S, Wu W, Wang H (2018) Sgm: sequence label text classification. Appl Intell 53(12):14759–14774. https://
generation model for multi-label classification. In: Proceedings of doi.org/10.1007/s10489-022-04106-x
the 27th international conference on computational linguistics, pp 34. Ma Q, Yuan C, Zhou W, Hu S (2021) Label-specific dual graph
3915–3926. https://aclanthology.org/C18-1330 neural network for multi-label text classification. In: Proceedings
20. Yang P, Luo F, Ma S, Lin J, Sun X (2019) A deep reinforced of the 59th annual meeting of the association for computational
sequence-to-set model for multi-label classification. In: Proceed- linguistics and the 11th international joint conference on natural
ings of the 57th annual meeting of the association for computational language processing (vol 1: Long Papers), pp 3855–3864. Associ-
linguistics, pp 5252–5258. https://aclanthology.org/P19-1518 ation for computational linguistics
21. Liao W, Wang Y, Yin Y, Zhang X, Ma P (2020) Improved sequence 35. Fan C, Chen W, Tian J, Li Y, He H, Jin Y (2023) Accurate use of
generation model for multi-label classification via cnn and initial- label dependency in multi-label text classification through the lens
ized fully connection. Neurocomputing 382:188–195 of causality. Appl Intell 1–17
22. Zhang X, Tan X, Luo Z, Zhao J (2023) Multi-label sequence gen- 36. Zeng D, Zha E, Kuang J, Shen Y (2024) Multi-label text classi-
erating model via label semantic attention mechanism. Int J Mach fication based on semantic-sensitive graph convolutional network.
Learn Cybern 14(5):1711–1723 Knowl-Based Syst 284:111303
23. Wang R, Ridley R, Qu W, Dai X et al (2021) A novel reasoning 37. Zhao F, Ai Q, Li X, Wang W, Gao Q, Liu Y (2024) Tlc-xml:
mechanism for multi-label text classification. Inf Process Manage transformer with label correlation for extreme multi-label text clas-
58(2):102441 sification. Neural Process Lett 56(1):25
24. You R, Zhang Z, Wang Z, Dai S, Mamitsuka H, Zhu S (2019) Atten- 38. Huang Y, Giledereli B, Köksal A, Özgür A, Ozkirimli E (2021)
tionxml: label tree-based attention-aware deep model for high- Balancing methods for multi-label text classification with long-
performance extreme multi-label text classification. In: Advances tailed class distribution. In: Proceedings of the 2021 conference on
in neural information processing systems, vol 32, pp 5820–5830 empirical methods in natural language processing, pp 8153–8161.
25. Xiao L, Huang X, Chen B, Jing L (2019) Label-specific document Association for computational linguistics
representation for multi-label text classification. In: Proceedings 39. Guo H, Li X, Zhang L, Liu J, Chen W (2021) Label-aware
of the 2019 conference on empirical methods in natural language text representation for multi-label text classification. In: ICASSP
processing and the 9th international joint conference on natural 2021-2021 IEEE International conference on acoustics, speech
language processing (EMNLP-IJCNLP), pp 466–475. Association and signal processing (ICASSP), pp 7728–7732. https://doi.org/
for Computational Linguistics. https://aclanthology.org/D19-1044 10.1109/ICASSP39728.2021.9413921
26. Liu Q, Chen J, Chen F, Fang K, An P, Zhang Y, Du S (2023) Mlgn: a 40. Zhuang C, Ma Q (2018) Dual graph convolutional networks for
multi-label guided network for improving text classification. IEEE graph-based semi-supervised classification. In: Proceedings inter-
Access 11:80392–80402. https://doi.org/10.1109/ACCESS.2023. national world wide web conferences steering committee, pp
3299566 499–508. https://doi.org/10.1145/3178876.3186116
27. Qin S, Wu H, Zhou L, Li J, Du G (2023) Learning metric space 41. Loza Mencía E, Fürnkranz J (2008) Efficient pairwise multilabel
with distillation for large-scale multi-label text classification. Neu- classification for large-scale problems in the legal domain. In: Joint
ral Comput Appl 35(15):11445–11458 European conference on machine learning and knowledge discov-
28. Wang Q, Zhu J, Shu H, Asamoah KO, Shi J, Zhou C (2023) Gudn: a ery in databases, pp 50–65
novel guide network with label reinforcement strategy for extreme 42. Pennington J, Socher R, Manning CD (2014) Glove: global vectors
multi-label text classification. J King Saud Univ Comput Inf Sci for word representation. In: Proceedings of the 2014 conference on
35(4):161–171 empirical methods in natural language processing (EMNLP), pp
29. Xu P, Xiao L, Liu B, Lu S, Jing L, Yu J (2023) Label-specific 1532–1543
feature augmentation for long-tailed multi-label text classification. 43. Mikolov T, Sutskever I, Chen K, Corrado GS, Dean J (2013)
In: Proceedings of the AAAI conference on artificial intelligence, Distributed representations of words and phrases and their com-
vol. 37, pp 10602–10610 positionality. Adv Neural Inf Process Syst 26
30. Xiao L, Xu P, Song M, Liu H, Jing L, Zhang X (2023) Triple 44. Liu Y, Ott M, Goyal N, Du J, Joshi M, Chen D, Levy O, Lewis M,
alliance prototype orthotist network for long-tailed multi-label Zettlemoyer L, Stoyanov V (2019) Roberta: a robustly optimized
text classification. IEEE/ACM Trans Audio Speech Lang Process bert pretraining approach. arXiv:1907.11692
31:2616–2628. https://doi.org/10.1109/TASLP.2023.3265860 45. Maaten L, Hinton G (2008) Visualizing data using t-sne. J Mach
31. Zhang W, Yan J, Wang X, Zha H (2018) Deep extreme multi- Learn Res 9(11)

123
9380 X. Li et al.

Publisher’s Note Springer Nature remains neutral with regard to juris- Xiaohong Li is an associate professor at Northwest Normal University
dictional claims in published maps and institutional affiliations. in China. Her current research interests include machine learning and
intelligent information processing, with a focus on text mining, recom-
Springer Nature or its licensor (e.g. a society or other partner) holds mendation systems, and web data analysis.
exclusive rights to this article under a publishing agreement with the
author(s) or other rightsholder(s); author self-archiving of the accepted
manuscript version of this article is solely governed by the terms of such
publishing agreement and applicable law.

123