0% found this document useful (0 votes)

5 views

1. Unsupervised feature selection using sparse manifold learning Auto-encoder approach

This paper presents a novel unsupervised feature selection method using an auto-encoder approach that enhances the extraction of nonlinear information and regularizes the decoding phase to improve performance. The proposed method preserves the geometric structure of input data through the use of Laplacian graphs, demonstrating superior clustering accuracy and normalized mutual information compared to existing techniques. Experimental results on benchmark datasets validate the effectiveness of the approach in addressing challenges associated with high-dimensional data.

Uploaded by

avion4vientos7

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

5 views

1. Unsupervised feature selection using sparse manifold learning Auto-encoder approach

Uploaded by

avion4vientos7

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 18

Information Processing and Management 62 (2025) 103923

Contents lists available at ScienceDirect

Information Processing and Management

journal homepage: www.elsevier.com/locate/ipm

Unsupervised feature selection using sparse manifold learning:

Auto-encoder approach
Amir Moslemi a,b , Mina Jamshidi c ,∗
a
Physical Sciences Research Platform, Sunnybrook Research Institute (SRI), Canada
b
School of Software Design and Data Science, Seneca Polytechnic, Canada
c
Department of Applied Mathematics, Graduate University of Advanced Technology, Iran

ARTICLE INFO ABSTRACT

Keywords: Feature selection techniques are widely being used as a preprocessing step to train machine
Feature selection learning algorithms to circumvent the curse of dimensionality, overfitting, and computation time
Auto-encoder challenges. Projection-based methods are frequently employed in feature selection, leveraging
Manifold structure
the extraction of linear relationships among features. The absence of nonlinear information
extraction among features is notable in this context. While auto-encoder based techniques have
recently gained traction for feature selection, their focus remains primarily on the encoding
phase, as it is through this phase that the selected features are derived. The subtle point is
that the performance of auto-encoder to obtain the most discriminative features is significantly
affected by decoding phase. To address these challenges, in this paper, we proposed a novel
feature selection based on auto-encoder to not only extracting nonlinear information among
features but also decoding phase is regularized as well to enhance the performance of algorithm.
In this study, we defined a new model of auto-encoder to preserve the topological information of
reconstructed close to input data. To geometric structure of input data is preserved in projected
space using Laplacian graph, and geometrical projected space is preserved in reconstructed
space using a suitable term (abstract Laplacian graph of reconstructed data) in optimization
problem. Preserving abstract Laplacian graph of reconstructed data close to Laplacian graph
of input data affects the performance of feature selection and we experimentally showed this.
Therefore, we show an effective approach to solve the objective of the corresponding problem.
Since this approach can be mainly used for clustering aims, we conducted experiments on
ten benchmark datasets and assessed our propped method based on clustering accuracy and
normalized mutual information (NMI) metric. Our method obtained considerable superiority
over recent state-of-the-art techniques in terms of NMI and accuracy.

1. Introduction

The performance of learning algorithm is directly affected by high dimensional data due to noise and redundant features. High
dimensional data leads to curse of dimensionality, shortage in storage space, increase computation time, decrease generalizability of
learning algorithm and increase the probability of overfitting (Liu & Motoda, 2012). Feature selection and feature extraction are two
main approaches to reduce dimension of data. However, feature selection has better interpretability than feature extraction (Gui,
Sun, Ji, Tao, & Tan, 2016). Feature selection is an important step in preprocessing of data in machine learning pipeline to deal

∗ Corresponding author.
E-mail address: m.jamshidi@kgut.ac.ir (M. Jamshidi).

https://doi.org/10.1016/j.ipm.2024.103923
Received 26 April 2024; Received in revised form 9 September 2024; Accepted 1 October 2024
Available online 18 October 2024
0306-4573/© 2024 Elsevier Ltd. All rights are reserved, including those for text and data mining, AI training, and similar technologies.
A. Moslemi and M. Jamshidi Information Processing and Management 62 (2025) 103923

with high dimensional data. In the medical field, feature selection is crucial for identifying significant attributes that can impact
treatment effectiveness, disease diagnosis, differentiation between various conditions, and etc (Moslemi, Makimoto, et al., 2023).
Feature selection techniques are categorized to four groups including; filter strategy, wrapper strategy, embedded strategy and
hybrid strategy (Moslemi, 2023). In filter strategy, there is no connection between features and machine learning model. In wrapper
strategy, features are selected based on the performance of machine learning model. In embedded strategy, feature selection is a part
of training such as decision tree algorithm. In hybrid strategy, two different strategies are combined to select the most discriminative
features (Moslemi, 2023).
Based on label availability, feature selection is grouped to supervised, unsupervised and semi-supervised feature selection
techniques (Moslemi, 2023). In supervised techniques, the information of labels are utilized to rank the features. Minimal-
redundancy-maximal-relevance criterion (mRMR) is one of the well-known supervised feature selection methods (Peng, Long, &
Ding, 2005). In unsupervised methods, features are ranked based on similarities among features. Laplacian score for feature selection
is one of the most important unsupervised feature selection (He, Cai, & Niyogi, 2005). In semi-supervised techniques, some of data
are labeled and rest of them are unlabeled. Sparse models and graph Laplacian based techniques are the most common techniques
in semi-supervised feature selection (Shi, Ruan, & An, 2014).
Lately, there has been a growing focus on unsupervised feature selection techniques due to the high costs and time constraints
associated with collecting labels. Studies introduced different types of criteria to rank the features based on their importance such
as reconstruction error minimization (Li, Tang, & Liu, 2017; Zhu, Zuo, Zhang, Hu, & Shiu, 2015), structure preservation (Cai,
Zhang, & He, 2010; Li, Yang, Liu, Zhou, & Lu, 2012) and locality preservation (He et al., 2005; Nie, Zhu, & Li, 2016). In the
context of reconstruction error minimization, matrix decomposition-based techniques are prevalent. For example, non-negative
matrix factorization (NMF) was utilized for feature selection (NMFS). NMF decomposes the data to two non-negative matrices
including feature weight matrix and representation matrix. NMF can be converted to feature selection by considering orthogonality
constraint for feature weight matrix (Wang, Pedrycz, Zhu, & Zhu, 2015). NMFS cannot preserve local and global information. To
this end, Wang, Chen, Guo, and Liu (2020) added local and global regularization functions. In this technique, 𝑙2,1 matrix norm
was applied on feature weight matrix to sparsify the solution and Laplacian matrix was added to preserve the local structure
information of data. Also, in Li, Hu, and Gao (2024), authors proposed a method in which 𝑙2,1 is applied for higher sparsity and
lower redundancy. Saberi-Movahed, Rostami, et al. (2022) proposed dual regularized NMF for unsupervised feature selection. They
showed that the performance of algorithm would be improved by applying sparse regularization function on both feature weight
matrix and representation matrix. Moslemi et al. proposed dual regularized NMF feature selection with rank constraint (Moslemi &
Ahmadian, 2023). In this technique, inner product norm was considered as sparse regularization function for both feature weight
and representation matrix, and Laplacian graph matrix regularization was added to preserve geometrical information of data.
Additionally, Schatten was added 𝑝-norm to extract low rank structure of data. In recent study, Samareh-Jahani, Saberi-Movahed,
Eftekhari, Aghamollaei, and Tiwari (2024) introduced a novel feature selection method based on NMF, where NMF decomposition
is conducted on an orthonormalized dataset. In this approach, the dataset undergoes an initial orthogonality transformation using
the Gram–Schmidt process, followed by the application of global and local regularization functions for NMF feature selection.
In some feature selection methods, graph structure of data are considered for enhancing the results (Huang, Kong, Wang, Han,
& Yang, 2024; Liao, Chen, Yin, Horng, & Li, 2024; Yi et al., 2024). Huang et al. (2024) applied feature selection method for
unsupervised data by a new structured graph and data discrepancy learning which consist pairwise data similarity matrix and an
indicator matrix. In Liao et al. (2024), considering semi-supervised data, authors used a graph learning technique, Frobenius-norm
or maximum information entropy on the similarity matrix to consider more structural information of data.
Application of feature selection can be found in different domains including industry (Salcedo-Sanz, Cornejo-Bueno, Prieto,
Paredes, & García-Herrera, 2018), medical (Moslemi et al., 2022; Saberi-Movahed, Mohammadifard, et al., 2022) and finance (Lee,
2009). Here, we would like to explain two applications of feature selection in medical domain. Moslemi et al. (2022) showed the
importance of feature selection to differentiate two pulmonary diseases. In this study, a hybrid feature selection based on factor
analysis and particle swarm optimization (PSO) was proposed to obtain discriminative features to differentiate chronic obstructive
pulmonary disease (COPD) and asthma. These two lung diseases have same symptoms and misdiagnosis leads to mistreatment
and then exacerbation will be resulted. This study obtained discriminative features to distinguish between COPD and asthma
which are supported by clinical studies. Gene selection is a line of research which is categorized as feature selection problem
and recently has received attention (Moslemi & Ahmadian, 2023). Another crucial application of feature selection is within the
energy industry. Ahmad and Zhang (2020) proposed a technique based on feature selection to predict short and medium-term load
large-scale utilities and buildings. The significance of interpretability in the feature selection approach extends beyond just reducing
dimensionality; it also plays a crucial role in understanding the relevance of each variable for particular tasks. Recently, neural
networks and deep learning are being used extensively in different fields including computer vision, natural language processing
(NLP), speech recognition and etc. Deep neural networks have different types including feed forward networks (FFNs), convolutional
neural networks (CNNs), recurrent neural networks (RNNs), auto-encoder, transformer and generative models. In the meantime, the
application of auto-encoder for unsupervised feature selection received attention (Hinton & Zemel, 1993). Auto-encoder is a type
of deep neural network that learns pattern of input data to reconstruct it in output. The minimization problem of auto-encoder
is designed such that output is enforced to be close to input. Auto-encoder projected data to a lower space in order to extract
intrinsic information of data that is essential to reconstruct input data. However, variational auto-encoder learns distribution of lower
space to generate new data. Therefore, the fundamental principle behind methods based on auto-encoders is to identify features
that hold significant relevance to highly informative latent representations. Empirical findings consistently demonstrate that these
chosen features adeptly retain vital data information essential for subsequent clustering aims. In the context of feature selection

2
A. Moslemi and M. Jamshidi Information Processing and Management 62 (2025) 103923

Fig. 1. This figure illustrates the architecture of the proposed feature selection. This auto-encoder based feature selection poses 𝑙2,1 norm for feature selection
such that green lines in input units show the irrelevant features and red lines in input units show the selected (discriminative) features. 𝐻 𝐻 𝑇 − 𝐿𝑋 is the
proposed constraint to preserve the geometrical information of input data close to reconstructed data. 𝐻 is an auxiliary function which is updated in each
iteration of optimization problem and 𝐿𝑋 is the Laplacian graph matrix of input data. Weighted reconstruction loss is (5).

using auto-encoder, Wang, Ding, and Fu (2017) proposed a supervised feature selection guided auto-encoder (FSGAE) which is
combination of feature selection and auto-encoder. In this technique, feature selection as a regularization term was embedded
into auto-encoder objective function. In FSGAE, fisher score was considered as a regularization term to extract within-class and
between-class affinity relationship of features. Han, Wang, Zhang, Li, and Xu (2018) proposed an auto-encoder based unsupervised
feature selection (AEUFS). AEUFS considered synaptic weights of encoding phase to rank the features and applied 𝑙2,1 matrix norm
to sparsify encoding synaptic weights.
Auto-encoder technique has several advantages over the others including; Non-linearity, flexibility, reconstruction error,
customizability, hierarchical feature learning and handling large dataset. In terms of non-linearity, auto-encoder can extract non-
linear pattern of features using non-linear activation function, whereas matrix-based factorization cannot. In terms of flexibility, the
most of matrix-based factorization techniques such singular value decomposition and non-negative matrix factorization are linear
transformations, whereas auto-encoder can handle more complex and flexible representations, including non-linear transformations.
This allows auto-encoders to better compress data that has intricate, multi-dimensional structures. In terms of reconstruction error,
auto-encoder with optimal architecture can reconstruct original data in output with minimum error. Customizability is the most
important advantage of auto-encoder over other methods. Number of layers, neurons per layer and training parameters can be
customized to enhance the performance of it. auto-encoder can also learn hierarchical features using deeper networks which leads
to capture higher-level features. complex features can provide better discrepancy between labels in subspace. auto-encoder technique
can be applied more efficient than non-auto-encoder ones for handling large dataset. Using GPU, mini-batch, residual network and
drop-out technique can decrease the training time and overfitting probability. The last but not least, generative AI can be used
for auto-encoder (for example variational auto-encoder) which increases the power of auto-encoder based technique. For example,
variational auto-encoder can be applied for feature selection and anomaly detection simultaneously.
Although the advances of feature selection using auto-encoder is considerable, there are some significant limitations in using
auto-encoder for feature selection task that must be addressed. First, the objective function of auto-encoder is squared Frobenius
norm or 𝑙2,1 -norm which leads to increase the negative impact of noisy samples and outliers on the performance of algorithm.
Additionally, the risk of memorizing noise is increased for over-parameterized deep learning. Second, geometrical information of
data is not preserved in auto-encoder feature selection technique which leads to losing geometrical information in latent space.
The critical concern at hand revolves around the direct impact on reconstructed data due to the loss of geometric information in
latent space. Consequently, the encoding of feature weights is influenced by back-propagation, leading to a degradation in feature
selection performance. Synaptic weight of decoding phase affects synaptic weight of encoding phase due to geometrical information
losing, and selected features are directly affected since features are ranked based on synaptic weight of encoding phase.
To address all the concerns, we propose a novel unsupervised feature selection based on auto-encoder which preserves the
geometrical information in both encoding and decoding phases. We employs 𝑙2,1 matrix norm as sparse regularization function to
obtain the most discriminative features. We hypothesize that the distances between samples in the original dimensions and the
reconstructed data should be similar. As a result, the Laplacian graphs of the input data and the reconstructed data should be
identical in the ideal case or closely related. To maintain the geometric information of the data, we introduce a new Laplacian
graph matrix for the reconstructed data and, using optimization technique, enforce it to be close with input Laplacian graph matrix.

3
A. Moslemi and M. Jamshidi Information Processing and Management 62 (2025) 103923

Table 1
Summary of the key notations used in the paper.
Notation Explanations
𝑋 ∈ R𝑑×𝑛 Feature matrix (Input matrix)
𝑍 ∈ R𝑘×𝑛 Matrix in the hidden layer
𝑋̄ ∈ R𝑑×𝑛 Reconstructed matrix
𝐿𝑋 ∈ R𝑛×𝑛 Laplacian matrix related to the feature matrix 𝑋
𝐴𝑇 , 𝑇 𝑟(𝐴) the transpose and the trace of 𝐴
‖𝐴‖2 the 𝑙2 -norm of matrix 𝐴
√
∑ | |2
‖𝐴‖𝐹 = 𝑖,𝑗 ||𝑎𝑖𝑗 || the Frobenius norm of 𝐴 is
𝑊 (1) ∈ R𝑑×𝑘 the encoder weight matrix
𝑊 (2) ∈ R𝑘×𝑑 the decoder weight matrix

Then, we present an efficient algorithm to solve the objective of the corresponding problem. Experiments are conducted on various
benchmarks to show that proposed auto-encoder based feature selection technique overcomes many recent developed techniques.
The architecture of the proposed method is shown in Fig. 1. The main contributions of the proposed method are listed as follows:

- We apply both sparse regularization and structure learning to sparsify solution and preserve local information of data.
- Unlike previous auto-encoder feature selection methods, we define new variable to preserve distances among samples in
both latent space and reconstructed data. This new variable is integrated in optimization problem to keep the structure of
reconstructed close to the structure of input data.
- We propose an efficient optimization approach to solve feature selection problem with regularization functions.
- Experimental results prove that proposed auto-encoder feature selection has better performance than state-of-the-art methods.
The rest of this paper is organized as follows. In Section 2 we briefly review related works. Section 3 presents our new method
and solve of its optimization problem. In Section 4, we propose experimental results and evaluation of our method. Finally, Section 5
concludes the article.
The notations used in this paper are listed in Table 1.

2. Related works

In subspace learning, the algorithm seeks a subspace for data to span all features. In unsupervised feature selection, those
features, which can span all features set, are selected. Application of auto-encoder for feature selection is recently investigated. Ling,
Nie, Yu, and Li (2023) improved AEUFS by introducing discriminative and robust auto-encoder feature selection (DRAEFS) to
capture the latent cluster structure of the data. AEUFS may fail to obtain the cluster structure of data in representation space and
then informative features cannot be obtained. DRAEFS integrated 𝑘-mean clustering into auto-encoder feature selection objective
function to preserve the discriminative features in latent space. Input data is mapped to latent space using encoding phase of auto-
encoder. 𝑘-mean clustering applied on projected data to enhance the power of discrimination. 𝑘-mean detects the cluster structure
in latent space, and this improves the discrimination to obtain the most informative features. Additionally, in DRAEFS 𝑙2,1 matrix
norm is utilized to improve the sparsity of solution. Gong et al. (2022) proposed an adaptive auto-encoder for feature selection
(AAEFS). AAEFS considered dual regularization technique by applying 𝑙2,1 matrix norm for both encoding weights and decoding
phase. Additionally, AAEFS applied Pearson correlation regularization to control redundancy among samples. Feng and Duarte
(2018) proposed graph auto-encoder based unsupervised feature selection (GAEFS). GAEFS had objective function same as AEUFS
with structure learning regularization. GAEFS considered Laplacian graph regularization to preserve geometrical information of
data. Li, Wang, Yang, and Liu (2023) proposed a two-step feature selection based on auto-encoder (TSFSAE). TSFSAE applied an
auto-encoder in step one to extract latent space of data. In the next step, evaluated features in latent space using multivariate
rank distance correlation learning. Zhang, Lu, and Wang (2021) proposed transformed auto-encoder feature selection (TAEFS).
In contrast to other auto-encoder based feature selection techniques, TAEFS considered net ranks original features with non-linear
structure preservation. TAEFS had three steps including: (1) obtaining indicator matrix using auto-encoder, (2) applying non-negative
least squares to approximate non-negative indicator matrix, and (3) selecting the most discriminative features and evaluating by
𝑘-mean clustering. Wu and Cheng (2021) proposed fractal auto-encoder (FEA) for feature selection to modify the diversity issue of
identification auto-encoder. FEA considered a sub neural network such that a reconstruction error objective function was embedded
into auto-encoder objective function. In training phase, the objective function of FEA had two terms including one-to-one layer
(global neural network) and feature selection layer (Sub neural network). FEA considered sub neural network and global neural
network for only encoding phase. Doquet and Sebag (2020) proposed Agnostic feature selection (AgnosFS). AgnosFS applied auto-
encoder to extract latent representation of data and combined with structure regularization. AgnosFS applied LASSO and group
LASSO on encoding phase. In a recent study, Mozafari, Seyedi, Mohammadiani, and Tab (2024) proposed unsupervised NMF
feature selection like auto-encoder (NMF-AE). In this technique, the objective function had two parts including reconstruction and
decomposing. Although NMF-AE was not exactly auto-encoder, followed the auto-encoder idea. For the first time, in NMF-AE, an
orthogonality constraint was applied to the representation matrix rather than the feature weight matrix. This raised a significant
concern regarding whether the feature weight matrix is an indicator matrix or not. In NMF-AE, 𝑙2,1 matrix norm was applied on

4
A. Moslemi and M. Jamshidi Information Processing and Management 62 (2025) 103923

feature weight matrix to sparsify solution and Laplacian graph matrix was applied on representation matrix to preserve structure
information. Sharifipour, Fayyazi, Sabokrou, and Adeli (2019) proposed a simple yet efficient auto-encoder based feature selection
technique. In this approach, features were sorted according to their impact on reconstruction error. This involved measuring the
reconstruction error after removing each feature, and then assessing the variation in reconstruction error as a criterion for ranking
features. Notably, significant fluctuations in reconstruction error indicated the importance of features. For example, 𝐸1 and 𝐸2 are
two reconstruction errors when features 𝐹1 and 𝐹2 are removed, respectively. 𝐸2 > 𝐸1 shows that effect of 𝐹2 on reconstruction
error is greater than 𝐹1 and as result 𝐹2 is more important than 𝐹1 . In all surveyed studies, the absence of geometrical information
preserving and encoding–decoding data topology preserving can be found. To this end, we concentrate and emphasize on preserving
geometrical information of samples and encoding–decoding topology of data.

3. Proposed method

Dimensionality reduction is a method that maps input data points onto a low dimensional space. More precisely, consider
𝑋 = [𝑥1 , 𝑥2 , … , 𝑥𝑛 ] ∈ R𝑑×𝑛 as a feature matrix, where 𝑥𝑖 ∈ R𝑑 denotes the 𝑖th data. In the process of dimension reduction we
obtain 𝑍 = [𝑧1 , 𝑧2 , … , 𝑧𝑛 ] ∈ R𝑘×𝑛 , where 𝑘 < 𝑑. However, during reducing dimension some topological and inherent nonlinear
characteristics of data may be missed. In order to preserve these kinds of properties, graph Laplacian as structure learning
regularization is added to the feature selection process. In the following we propose a novel auto-encoder approach of feature
selection based on preserving geometrical and topological structure.

3.1. Theoretical discussion

Graph Laplacian of a data set is a meaningful technique to reveal the intrinsic relationship among its points and provides the
hidden topology of the data set. Consider 𝑋 = [𝑥1 , 𝑥2 , … , 𝑥𝑛 ] ∈ R𝑑×𝑛 as a feature matrix, where 𝑥𝑖 ∈ R𝑑 denotes the 𝑖th data. We
consider 𝐾𝑛𝑛 graph representation of this data set in which each column of the data matrix 𝑋, i.e. 𝑥𝑖 , can be represented as nodes of
a graph, i.e. 𝑣𝑖 , and each node is connected to its 𝑘-nearest neighbor nodes. The Gaussian kernel function is often used as similarity
function to weight the edges, 𝐴𝑖𝑗 = 𝑒𝑥𝑝(−‖𝑥𝑖 − 𝑥𝑗 ‖2 ∕(2𝜎 2 )). The Laplacian matrix is defined as 𝐿𝑋 = 𝐷 − 𝐴, where the 𝐷 ∈ R𝑛×𝑛 is a
∑
diagonal matrix with diagonal element 𝑑𝑖𝑖 = 𝑛𝑖=1 𝑎𝑖𝑗
Structure learning is constructed based on the assumption that if two points 𝑥𝑖 and 𝑥𝑗 are close in original space of data, then
the corresponding points 𝑧𝑖 and 𝑧𝑗 should be close. Now consider the following equation:
∑𝑛
𝑇 𝑟(𝑍 𝐿𝑋 𝑍 𝑇 ) = 𝑎𝑖𝑗 ‖𝑧𝑖 − 𝑧𝑗 ‖22 (1)
𝑖,𝑗=1

If 𝑎𝑖𝑗 ≠ 0, then 𝑣𝑖 and 𝑣𝑗 are connected in the graph representation of 𝑋; Moreover, ‖𝑧𝑖 − 𝑧𝑗 ‖22 informs the closeness of the points in
the low dimensional 𝑍. Consequently, Eq. (1) contains latent topological and geometrical structure of the data set.
Considering the above descriptions and properties, in the continue we utilize a simple auto-encoder based on graph Laplacian
to propose a dimension reduction method. Firstly, each input 𝑥𝑖 is encoded by 𝑓 to a hidden representation 𝑧𝑖 ∈ R𝑘 , where 𝑘 < 𝑑.
Then it is decoded by 𝑔 to a reconstructed vector 𝑥̂𝑖 . More precisely, we have 𝑧𝑖 = 𝑓 (𝑥𝑖 ) = 𝜎1 ((𝑊 (1) )𝑇 𝑥𝑖 ) and 𝑥̂𝑖 = 𝑔(𝑧𝑖 ) =
𝜎2 ((𝑊 (2) )𝑇 𝑧𝑖 ). Here 𝜎1 and 𝜎2 are the activation functions of the network and 𝑊 (1) ∈ R𝑑×𝑘 and 𝑊 (2) ∈ R𝑘×𝑑 are the encoder and
decoder weight matrices, respectively.
Basically, this auto-encoder network is trained by minimizing the squared error function between 𝑥𝑖 and 𝑥̂𝑖 for 𝑖 = 1, … , 𝑛, i.e.:
∑𝑛 ∑
𝑛
min ‖𝑥𝑖 − 𝑥̂𝑖 ‖22 = ‖𝑥𝑖 − 𝑔(𝑓 (𝑥𝑖 ))‖22 (2)
𝑊 (1) ,𝑊 (2) 𝑖=1 𝑖=1

In the hidden layer, the latent features should have the main information of 𝑋 with less dimension. Consequently, the more
important features should be obtained and considered in this layer. 𝑊 has an essential role in selecting these features. The larger
norm of 𝑖th row, the most significant role of 𝑖th feature in latent layer. Hence, ‖𝑊 (1) ‖2,1 is considered to find more informative
features and also for the sparsity of 𝑊 (1) . Also, to overcome overfitting a regularizer terms is considered (Ling et al., 2023):
∑
𝑛 ∑
2
min ‖𝑥𝑖 − 𝑔(𝑓 (𝑥𝑖 ))‖22 + 𝛽‖𝑊 (1) ‖2,1 + 𝜆 ‖𝑊 (𝑖) ‖2𝐹 (3)
𝑊 (1) ,𝑊 (2) 𝑖=1 𝑖=1
The encoder aims to encode and compress the input data into a low dimensional space, and at the same time preserves as much
key information as possible. Consequently, one of the main goals is to keep the connected nodes in the predefined graph 𝐺, as
possible as close in the latent space. In other words, we aim to preserve the local manifold structure of data set in the latent
layer. To impose and convey the manifold structure of 𝑋 to 𝑍, we apply graph Laplacian properties of Eq. (1) and we consider
𝑇
𝑇 𝑟(𝑍 𝐿𝑋 𝑍 𝑇 ) = 𝑇 𝑟(𝑊 (1) 𝑋 𝐿𝑋 𝑋 𝑇 𝑊 (1) ). Hence we have:
1 𝛽
∑
2
min 2𝑛
‖𝑋 − 𝑓 (𝑔(𝑋))‖2𝐹 + 𝛼‖𝑊 (1) ‖2,1 + 2
‖𝑊 (𝑖) ‖2𝐹
𝑊 (1) ,𝑊 (2) 𝑖=1 (4)
+𝑇 𝑟(𝑊 (1)𝑇 𝑋 𝐿𝑋 𝑋 𝑇 𝑊 (1) )
The decoder part tries to decode or decompress the encoded output to reconstruct the original input data as close as possible.

5
A. Moslemi and M. Jamshidi Information Processing and Management 62 (2025) 103923

In order to retrieve geometrical information of 𝑋 in 𝑋,̄ we focus on preserving the relationships and connectivities between data
̄ and utilize the graph Laplacian technique on the reconstruction layer this time. Moreover, instead of using 𝐿𝑋 directly,
points in 𝑋,
we consider an auxiliary function to be close to 𝐿𝑋 . Since 𝐿𝑋 is positive semi-definite, the auxiliary function should be, too. Hence,
𝑇
with consider it with the form 𝐻 𝐻 𝑇 and we add 𝑇 𝑟(𝑊 (2) 𝑍 𝐻 𝐻 𝑇 𝑍 𝑇 𝑊 (2) ) to (6). Finally, we consider the following minimization
problem for the auto-encoder process:
1 𝛽
∑
2
min 2𝑛
‖𝑋 − 𝑓 (𝑔(𝑋))‖2𝐹 + 𝛼‖𝑊 (1) ‖2,1 + 2
‖𝑊 (𝑖) ‖2𝐹
𝑊 (1) ,𝑊 (2) 𝑖=1
𝑇
(5)
+ 𝛾2 𝑇 𝑟(𝑊 (1) 𝑋 𝐿𝑋 𝑋 𝑊 (1) ) + 𝜔
𝑇 𝑟(𝑊 (2)𝑇 𝑍 𝐻 𝐻 𝑇 𝑍 𝑇 𝑊 (2) ),
2
s.t. 𝐻 𝐻 𝑇 − 𝐿𝑋 = 0, 𝐻 ≥ 0;

Here 𝑍 = 𝜎1 ((𝑊 (1) )𝑇 𝑋) and 𝜎1 is the sigmoid function. Also, 𝛼, 𝛽, 𝛾 and 𝜔 are hyperparameters that balance the importance of
terms.

3.2. Solving the problem

To tackle the proposed auto-encoder problem, we use back-propagation strategy to update the parameters. The error terms of
the layers are as follows:
𝛿 (𝑜) = −(𝑋 − 𝑋) ̄ ⊙ 𝜎 ′ (𝑓 (𝑋)) Output layer,
2 (6)
𝑇 (0)
(ℎ)
𝛿 = (𝑊 (2) 𝛿 ) ⊙ 𝜎1′ (𝑋) Hidden layer.

The Lagrangian function of this problem is

1 𝛽
∑
2
(𝑊 (1) , 𝑊 (2) , 𝑌 ) = 2𝑛
‖𝑋 ̄ 2 + 𝛼‖𝑊 (1) ‖2,1 +
− 𝑋‖𝐹 2
‖𝑊 (𝑖) ‖2𝐹
𝑖=1
𝑇 𝑇
(7)
+ 𝛾2 𝑇 𝑟(𝑊 (1) 𝑋 𝐿𝑋 𝑋 𝑇 𝑊 (1) ) + 𝜔
2
𝑇 𝑟(𝑊 (2) 𝑍 𝐻 𝐻 𝑇 𝑍 𝑇 𝑊 (2) )
+ ‖𝐻 𝐻 𝑇 − 𝐿𝑋 ‖2𝐹 + 𝑇 𝑟(𝑌 𝑇 𝐻),
where 𝑌 is the Lagrange multiplier.
1 𝑇
Considering 𝑤𝑖 as the 𝑖th row of 𝑊 (1) , we define the diagonal matrix 𝑄 = 𝑑 𝑖𝑎𝑔( ); Hence ‖𝑊 (1) ‖2,1 = 𝑇 𝑟(𝑊 (1) 𝑄𝑊 (1) ) and
‖𝑤𝑖 ‖22
we have:
𝜕
𝜕 𝑊 (1)
= 𝑋(𝛿 (ℎ) )𝑇 + 𝛽 𝑊 (1) + 𝛼 𝑄𝑊 (1) + 𝛾 𝑋 𝐿𝑋 𝑋 𝑇 𝑊 (1)
(8)
+ 𝜙(𝜎1 ((𝑊 (1) )𝑇 𝑋)),
𝑇
where 𝜙(𝐴) = 𝐻 𝐻 𝑇 𝐴𝑇 𝑊 (2) 𝑊 (2) (𝐴 ⊙ (1 − 𝐴)).
Also,
𝜕
= 𝑓 (𝑥)(𝛿 (𝑜) )𝑇 + 𝛽 𝑊 (2) + 𝜔𝑍 𝐻 𝐻 𝑇 𝑍 𝑇 𝑊 (2) . (9)
𝜕 𝑊 (2)

Having the gradients of  with respect to 𝑊 (1) and 𝑊 (2) , we use the gradient descent method and update 𝑊 (1) in new iteration
by
(1)
𝑊𝑡+1 ← 𝑊𝑡(1) − 𝜌1 [𝑋(𝛿 (ℎ) )𝑇 + 𝛽 𝑊𝑡(1) + 𝛼 𝑄𝑊𝑡(1)
(10)
+ 𝛾 𝑋 𝐿𝑋 𝑋 𝑇 𝑊𝑡(1) + 𝜙(𝜎1 ((𝑊𝑡(1) )𝑇 𝑋))],
and also 𝑊 (2) by
(2)
𝑊𝑡+1 ← 𝑊𝑡(2) − 𝜌2 [𝑓 (𝑥)(𝛿 (𝑜) )𝑇 + 𝛽 𝑊𝑡(2) + 𝜔𝑍 𝐻 𝐻 𝑇 𝑍 𝑇 𝑊𝑡(2) ], (11)
where 𝜌1 and 𝜌2 are suitable step sizes.
Now, to obtain 𝐻, consider:
 = ‖𝐻 𝐻 𝑇 − 𝐿𝑋 ‖2𝐹 + 𝑇 𝑟(𝑌 𝑇 𝐻). (12)
Hence
𝜕
= −2𝐿𝑋 𝐻 + 4𝐻 𝐻 𝑇 𝐻 + 𝑌 (13)
𝜕𝐻
Through the KKT conditions, i.e. 𝑌𝑖𝑗 𝐻𝑖𝑗 = 0, we have
(−2𝐿𝑋 𝐻 + 4𝐻 𝐻 𝑇 𝐻)𝑖𝑗 𝐻𝑖𝑗 = 0 (14)
Hence the optimal solution of (12) is
(2𝐿𝑋 𝐻)𝑖𝑗
𝐻𝑖𝑗 ← 𝐻𝑖𝑗 (15)
(4𝐻 𝐻 𝑇 𝐻)𝑖𝑗 + 𝜀

Obtaining 𝑊 (1) , we sort its rows based on Euclidean norm, i.e. ‖𝑤𝑖 ‖2 , in descending order, here 𝑖 indicates the features index
in matrix 𝑋. Then we sort the rows of 𝑊 based on Euclidian norm, i.e. rank ‖𝑤(1) ‖2 , ‖𝑤(2) ‖2 , … , ‖𝑤(𝑚) ‖2 in descending order and

6
A. Moslemi and M. Jamshidi Information Processing and Management 62 (2025) 103923

arrange features of 𝑋 corresponding to order of ‖𝑤(𝑖) ‖2 values. Then top 𝑟 features are selected from the arranged dataset according
to their corresponding rows in 𝑊 (1) which have highest 𝑙2 -norm.

Algorithm 1 Algorithm of the proposed method

Input: Training data 𝑋 ∈ R𝑑×𝑛 , hidden layer size 𝑘, regularization coefficients 𝛼 , 𝛽 , 𝛾 , 𝜆 and 𝜔. Number of epochs 𝑁. Learning rates
(step sizes) 𝜌1 and 𝜌2 .
1: Normalize all features using 𝑍-score technique.
2: 2.Calculate the Laplacian matrix.
3: 3.Randomly Initialize 𝑊 (1) , 𝑊 (2) , 𝐻 and 𝑌 , 𝑄 = 𝐼. Where 𝐼 is Identity matrix. 4.
4: for i = 1 to 𝑁 do
5: Update 𝑊 (1) by (10)
6: Update 𝑊 (2) by (11)
7: Update 𝐻 by (15)
8: end for
Output: Projection matrix 𝑊 (1) .

3.3. Convergence analysis

To show convergence of proposed technique, we proved the convergence for 𝑊 (1) .

1 𝛽
∑
2
min ‖𝑋 − 𝑓 (𝑔(𝑋))‖2𝐹 + 𝛼‖𝑊 (1) ‖2,1 + ‖𝑊 (𝑖) ‖2𝐹
𝑊 (1) ,𝑊 (2) 2𝑛 2
𝑖=1
𝑇 𝑇
(16)
+ 𝛾2 𝑇 𝑟(𝑊 (1) 𝑋 𝐿𝑋 𝑋 𝑇 𝑊 (1) ) + 𝜔
2
𝑇 𝑟(𝑊 (2) 𝑍 𝐻 𝐻 𝑇 𝑍 𝑇 𝑊 (2) ),
s.t. 𝐻 𝐻 𝑇 − 𝐿𝑋 = 0, 𝐻 ≥ 0;
If we can show that (16) is monotonically decreased, we have done. In the continue, 𝐽𝑡 shows the iteration 𝑡 of function 𝐽 . Let us
compact some terms of (16) to new function as follows:
𝛽∑
2
1 𝜔 𝑇
𝜓(𝜙) = ‖𝑋 − 𝑓 (𝑔(𝑋))‖2𝐹 + ‖𝑊 (𝑖) ‖2𝐹 + 𝑇 𝑟(𝑊 (2) 𝑍 𝐻 𝐻 𝑇 𝑍 𝑇 𝑊 (2) ) (17)
2𝑛 2 𝑖=1 2
According to Algorithm 1 and (16), we have following optimization function:
𝛾 𝑇
𝜙𝑡+1 = ar g min 𝜓(𝜙) + 𝛼‖𝑊 (1) ‖2,1 + 𝑇 𝑟(𝑊 (1) 𝑋 𝐿𝑋 𝑋 𝑇 𝑊 (1) ) (18)
𝜙 2
Which demonstrates that
(1) 𝑇 (1) (1) 𝑇 (1)
𝜓(𝜙𝑡+1 ) +𝛼 𝑇 𝑟((𝑊𝑡+1 ) 𝑄𝑡 𝑊𝑡+1 ) + 𝛾2 𝑇 𝑟(𝑊𝑡+1 𝑋 𝐿𝑋 𝑋 𝑇 𝑊𝑡+1 )
𝑇
(19)
≤ 𝜓(𝜙𝑡 ) +𝛼 𝑇 𝑟((𝑊𝑡(1) )𝑇 𝑄𝑡 𝑊𝑡(1) ) + 𝛾2 𝑇 𝑟(𝑊𝑡(1) 𝑋 𝐿𝑋 𝑋 𝑇 𝑊𝑡(1) )
It can be converted to the following form:
∑
𝑑 ‖(𝑤
𝑡+1 )𝑖 ‖2
2
𝛾 ∑ ‖(𝑤𝑡+1 )𝑖 ‖2
𝑑 2
𝜓(𝜙𝑡+1 ) +𝛼 +
𝑖
‖(𝑤𝑡 )𝑖 ‖2 2 𝑖 ‖(𝑤𝑡 )𝑖 ‖2
(20)
∑
𝑑 ‖(𝑤 ) ‖2
𝑡 𝑖 2 𝛾 ∑ ‖(𝑤𝑡 )𝑖 ‖2
𝑑 2
≤ 𝜓(𝜙𝑡 ) +𝛼 +
𝑖
‖(𝑤𝑡 )𝑖 ‖2 2 𝑖 ‖(𝑤𝑡 )𝑖 ‖2

Lemma 1. For any nonzero vectors 𝑤(𝑡+1) and 𝑤𝑡 , we have following inequality:
‖(𝑤𝑡+1 )‖22 ‖(𝑤𝑡 )‖22
‖(𝑤𝑡+1 )‖2 − ≤ ‖(𝑤𝑡 )‖2 − (21)
‖(𝑤𝑡 )‖2 ‖(𝑤𝑡 )‖2

Proof of Lemma 1 can be found in Nie, Huang, Cai, and Ding (2010). According to Lemma 1, we can write
‖(𝑤𝑡+1 )𝑖 ‖22 ‖(𝑤𝑡 )𝑖 ‖22
‖(𝑤𝑡+1 )𝑖 ‖2 − ≤ ‖(𝑤𝑡 )𝑖 ‖2 − (22)
‖(𝑤𝑡 )𝑖 ‖2 ‖(𝑤𝑡 )𝑖 ‖2
Likewise

𝛾 ∑ ‖(𝑤𝑡+1 )𝑖 ‖22 𝛾 ∑ ‖(𝑤𝑡 )𝑖 ‖22

𝑑 𝑑
(𝛼 + )[ (‖(𝑤𝑡+1 )𝑖 ‖2 − )] ≤ (𝛼 + )[ (‖(𝑤𝑡 )𝑖 ‖2 − )]
2 𝑖 ‖(𝑤𝑡 )𝑖 ‖2 2 𝑖 ‖(𝑤𝑡 )𝑖 ‖2
By substituting the above inequality into (21), we have
𝛾 ∑ 𝛾 ∑
𝑑 𝑑
𝜓(𝜙𝑡+1 ) + (𝛼 + ) ‖(𝑤𝑡+1 )𝑖 ‖2 ≤ 𝜓(𝜙𝑡 )(𝛼 + ) ‖(𝑤𝑡 )𝑖 ‖2 (23)
2 𝑖 2 𝑖

7
A. Moslemi and M. Jamshidi Information Processing and Management 62 (2025) 103923

Table 2
Datasets Information.
Dataset Samples Features Classes Type
ORL 400 2024 40 Image
Yale 165 1024 15 Image
warpPIE10P 210 2420 10 Image
Jaffe 213 676 10 Image
TOX-171 171 5748 4 Microarray(Gene)
Lymphoma 96 4026 9 Microarray(Cancer)
Colon 62 2000 2 Microarray(Cancer)
Prostate 102 5966 2 Microarray(Cancer)
GLI-85 85 22283 4 Microarray(Cancer)
Leukemia 72 7070 2 Microarray(Cancer)

Consequently we proved that

(1) (1) 𝑇 (1)
𝜓(𝜙𝑡+1 ) +𝛼‖𝑊𝑡+1 ‖2,1 + 𝛾2 𝑇 𝑟(𝑊𝑡+1 𝑋 𝐿𝑋 𝑋 𝑇 𝑊𝑡+1 )
𝑇
(24)
≤ 𝜓(𝜙𝑡 ) +𝛼‖𝑊𝑡(1) ‖2,1 + 𝛾2 𝑇 𝑟(𝑊𝑡(1) 𝑋 𝐿𝑋 𝑋 𝑇 𝑊𝑡(1) )
The convergence for 𝑊 (2) , 𝐻 and 𝑌 can be proved the same as 𝑊 (1) .

3.4. Computational complexity analysis

In this section we analyze the time and space complexity of Algorithm 1. For input data 𝑋 ∈ R𝑛×𝑑 , we have 𝑛 samples, 𝑑 features,
𝑐 sample categories, 𝑘 hidden layer size and 𝑁 iterations. The major computation time of the Algorithm is coming from updating
𝑊 (1) and 𝑊 (2) . The computation complexity for 𝑊 (1) is 𝑂(𝑘𝑑 𝑛 + 𝑑 𝑛2 ) and for 𝑊 (2) is 𝑂(𝑘𝑑 𝑛 + 𝑘𝑐 𝑛 + 𝑑 2 𝑘). For high dimensional data,
𝑑 ≥ 𝑘, 𝑑 ≥ 𝑐 and 𝑛 > 𝑐. Consequently, the final computation cost of Algorithm 1 is 𝑂(𝑁(𝑘𝑑 𝑛 + 𝑑 𝑛2 + 𝑑 2 𝑘)).

4. Experimental results

In this section, different datasets are used to conduct a series of experiments for analyzing the performance and efficacy of
the proposed feature selection method in comparing with state-of-art unsupervised feature selection methods. All experiments are
implemented using Matlab 2020b programming environment in a machine having 1.5 GHz Intel(R) Core(TM) i7-1065G7 CPU and
16 GB RAM.

4.1. Datasets

In our experiments, ten benchmark datasets are utilized. More precisely, four face image datasets: ORL,1 Yale,2 warpPIE10P1
and Jaffe3 and six biological microarray datasets: TOX-171,4 Lymphoma,5 Colon4 , Prostate and GLI-85 (Li, Cheng, et al., 2017) and
Leukemia (Golub et al., 1999), are chosen.
Detailed information of the datasets is summarized in Table 2.

4.2. Parameter settings

The regularization coefficients 𝛼 , 𝛾 , 𝜆 and 𝜔 are optimized using a grid search strategy in range {10(−6) , 10(−4) , 10(−2) , … , 106 }.
Sequentially feature selection in order as {50, 100, 150, … , 300} is applied
√ after sorting features based on their importance to evaluate
𝑛
feature selection technique. The number of clusters 𝑐 is set to equal 2 , where 𝑛 is the number of samples. The projected dimension
𝑚 is set equal to 𝑐. The number of neighbors connected to the sample 𝑖 is set equal to 5 (i.e. 𝑘 = 5). To evaluate dimensionality
reduction ability of the proposed method, 𝑘-mean clustering is utilized with 30 times repeats due to its sensitivity to initialization.
The mean of efficiencies is reported. Parameter 𝛽 controls the overfitting. In neural network drop-out and regularization are two
methods to avoid overfitting. 𝛽 can be changed in each epoch which is known as weight decay. Loosely speaking, 𝛽 is initialized
and then decreased with a coefficient in each epoch.

1 http://www.cad.zju.edu.cn/home/dengcai/Data/FaceData.html.
2 https://www.kaggle.com
3 http://www.kasrl.org/jaffe.html
4 http://mlearn.ics.uci.edu/MLRepository.html.
5 http://llmpp.nih.gov

8
A. Moslemi and M. Jamshidi Information Processing and Management 62 (2025) 103923

Table 3
Clustering accuracy results on selected features (ACC% ± std%).
Data
ORL Yale warpPIE10P Jaffe TOX-171 Lymphoma Colon Prostate GLI-85 Leukemia
Method
Baseline 48 ± 3 38 ± 3 26 ± 2 69 ± 3 40 ± 2 45 ± 3 44 ± 3 37 ± 3 68 ± 2 65 ± 4
DRNMF 51 ± 3 45 ± 3 44 ± 5 74 ± 7 51 ± 2 58 ± 7 56 ± 3 44 ± 2 78 ± 2 75 ± 7
NDFS 50 ± 3 46 ± 4 49 ± 4 73 ± 5 49 ± 3 55 ± 6 53 ± 4 42 ± 3 74 ± 4 74 ± 5
AEFS 51 ± 3 44 ± 4 50 ± 5 74 ± 7 48 ± 2 54 ± 4 52 ± 3 43 ± 4 76 ± 2 76 ± 7
UFGOR 50 ± 3 43 ± 3 42 ± 5 75 ± 7 50 ± 1 57 ± 5 53 ± 3 43 ± 2 77 ± 2 76 ± 4
NMF 50 ± 2 43 ± 5 42 ± 6 71 ± 8 45 ± 2 52 ± 4 52 ± 2 42 ± 3 77 ± 4 75 ± 5
VCSDFS 52 ± 3 44 ± 3 43 ± 5 73 ± 7 49 ± 2 54 ± 5 51 ± 3 42 ± 4 77 ± 2 76 ± 6
RNMF 51 ± 4 45 ± 4 47 ± 6 74 ± 6 48 ± 1 55 ± 7 54 ± 4 41 ± 4 76 ± 5 75 ± 7
Proposed 53 ± 3 49 ± 5 52 ± 7 77 ± 7 50 ± 2 57 ± 4 57 ± 5 44 ± 4 77 ± 3 77 ± 6

Table 4
Clustering NMI results on selected features (NMI% ± std%).
Data
ORL Yale warpPIE10P Jaffe TOX-171 Lymphoma Colon Prostate GLI-85 Leukemia
Method
Baseline 58 ± 5 33 ± 4 28 ± 6 69 ± 4 16 ± 4 51 ± 5 7 ± 2 9 ± 3 5 ± 3 11 ± 5
DRNMF 74 ± 4 54 ± 3 46 ± 5 83 ± 4 28 ± 7 69 ± 6 27 ± 4 14 ± 2 47 ± 3 16 ± 3
NDFS 74 ± 3 52 ± 5 41 ± 3 80 ± 4 24 ± 4 63 ± 4 28 ± 2 13 ± 3 48 ± 5 15 ± 4
AEFS 75 ± 3 52 ± 4 46 ± 3 86 ± 4 24 ± 4 66 ± 5 26 ± 3 12 ± 3 46 ± 5 14 ± 3
UFGOR 73 ± 4 53 ± 4 45 ± 6 83 ± 4 21 ± 3 67 ± 3 21 ± 3 11 ± 3 42 ± 2 12 ± 2
NMF 73 ± 5 52 ± 3 43 ± 3 81 ± 4 20 ± 3 65 ± 2 14 ± 2 13 ± 2 39 ± 2 12 ± 3
VCSDFS 74 ± 4 51 ± 5 46 ± 4 82 ± 5 22 ± 3 65 ± 2 20 ± 2 13 ± 2 44 ± 2 14 ± 3
RNMF 73 ± 3 52 ± 3 43 ± 3 81 ± 4 22 ± 4 66 ± 4 16 ± 2 13 ± 3 41 ± 4 13 ± 4
Proposed 76 ± 4 53 ± 5 48 ± 5 87 ± 6 26 ± 5 71 ± 6 29 ± 3 13 ± 3 49 ± 5 19 ± 3

4.3. Compared methods

We compare our model with the following eight state-of-the-art and recent methods:

- DNMF (Saberi-Movahed, Rostami, et al., 2022): Dual regularized nonnegative matrix factorization (DRNMF) employs inner-
product sparsity regularization on both the feature weight matrix and representation coefficient matrix.
- NDFS (Moslemi, Bidar, & Ahmadian, 2023): Nonnegative discriminative feature selection (NDFS) is a technique based on NMF
feature selection using 𝑙2,1 -norm constraint. NDFS has structure learning regularization as well to preserve the geometrical
information of data.
- AEFS (Han et al., 2018): Auto-encoder feature selector (AEFS) obtains important features using auto-encoder regression with
group Lasso regularization. This technique embraces the linear and nonlinear information of features. Nonlinear information
is represented by auto-encoder and error minimization of reconstruction under group Lasso regularization is used to rank the
features.
- UFGOR (Jahani, Aghamollaei, Eftekhari, & Saberi-Movahed, 2023): Unsupervised Feature Selection Guided by Orthogonal
Representation (UFGOR) is a feature selection technique which integrates QR decomposition to extract into NMF objective
function to obtain top features. Dual correlation regularizations for both samples and features are considered for redundancy
minimization among samples and features, respectively.
- NMF (Wang et al., 2015): Nonnegative matrix factorization (NMF) with orthogonality constraint can be applied for unsuper-
vised feature selection.
- VCSDFS6 (Karami, Saberi-Movahed, Tiwari, Marttinen, & Vahdati, 2023): In this technique Variance–Covariance distance with
inner-product row sparsity regularization was employed to obtain the most discriminative features.
- RNMF (Wang et al., 2020): Robust nonnegative matrix factorization (RNMF)for unsupervised feature selection. RNMF
employed structure learning to preserve geometrical information.
In this research baseline is when all features are included.

4.4. Result analysis

In this section, we tested proposed algorithm on 10 different datasets and compared with other feature selection techniques. All
techniques were evaluated by two metrics including accuracy (ACC), the normalized mutual information (NMI). ACC and NMI are
working based on comparing the true label and predicted cluster label.
In terms of ACC, the proposed algorithm obtained the best performance for ORL, Yale, warpPIE10P, Jaffe, Colon, Prostate and
Leukemia. For Tox-171, Lymphoma and GLI-85, DRNMF obtained the best performance. For Tox-171, the proposed method obtained

6 https://github.com/SaeedKarami/VCSDFS.

9
A. Moslemi and M. Jamshidi Information Processing and Management 62 (2025) 103923

Fig. 2. Clustering accuracy (ACC) results of proposed method versus other comparative feature selection techniques. The number of features was varied in range
{50, 100, 150, 200, 250, 300}.

the second-best performance. For Lymphoma, proposed method and UFGOR obtained the second-best performance jointly with
𝐴𝐶 𝐶 = 57% which shows the 1% difference with DRNMF. For GLI-85, the proposed method in addition to UFGOR, NMF and
VCSDFS obtained the second-best performance. The complete results of ACC for all techniques are shown in Table 3. Furthermore,
we calculated average ACC for all datasets and proposed method obtained the best performance with 𝐴𝐶 𝐶 𝑎𝑣𝑒𝑟𝑎𝑔𝑒 = 59%. Whereas
1
DRNMF obtained the second best performance with 𝐴𝐶 𝐶 𝑎𝑣𝑒𝑟𝑎𝑔𝑒 = 57.6% (𝐴𝐶 𝐶 𝑎𝑣𝑒𝑟𝑎𝑔𝑒 = 10 [𝐴𝐶 𝐶 𝑂𝑅𝐿 + 𝐴𝐶 𝐶 𝑌 𝑎𝑙𝑒 + 𝐴𝐶 𝐶 𝑤𝑎𝑟𝑝𝑃 𝐼 𝐸10𝑝 +
𝐴𝐶 𝐶 𝐽 𝑎𝑓 𝑓 𝑒 + 𝐴𝐶 𝐶 𝑇 𝑂 𝑋−171 + 𝐴𝐶 𝐶 𝐿𝑦𝑚𝑝ℎ𝑜𝑚𝑎 + 𝐴𝐶 𝐶 𝐶 𝑜𝑙 𝑜𝑛 + 𝐴𝐶 𝐶 𝐺 𝐿𝐼−85 + 𝐴𝐶 𝐶 𝐿𝑒𝑢𝑘𝑒𝑚𝑖𝑎 ]).
In terms of NMI metric, the proposed method obtained the best performance for ORL, warpPOE10P, Jaffe, Lymphoma, Colon,
GLI-85 and Leukemia datasets. The proposed method obtained the second best for Yale, TOX-171 and Prostate datasets. DRNMF
obtained the best performance for Yale, TOX-171 and Prostate datasets. The complete results of NMI for all techniques are shown
in Table 4. In terms of 𝑁 𝑀 𝐼 𝑎𝑣𝑒𝑟𝑎𝑔𝑒 , the proposed method obtained 𝑁 𝑀 𝐼 𝑎𝑣𝑒𝑟𝑎𝑔𝑒 = 47.1 which was the best performance. The second
best was obtained by DRNMF (𝑁 𝑀 𝐼 𝑎𝑣𝑒𝑟𝑎𝑔𝑒 = 45.8).
We provided Figs. 2 and 3 to denote the variation of ACC and NMI based on the number of features for proposed techniques
and other comparative techniques.

10
A. Moslemi and M. Jamshidi Information Processing and Management 62 (2025) 103923

Fig. 3. Normalized mutual information (NMI) results of proposed method versus other comparative feature selection techniques. The number of features was
varied in range {50, 100, 150, 200, 250, 300}.

4.5. Parameter sensitivity analysis

In proposed algorithm, there were 𝛼 , 𝛽 , 𝛾 and 𝜔 regularization hyperparameters. Parameter 𝛼 control sparsity of 𝑊 (1) . Parameter
𝛽 is the coefficient of 𝐿2,1 -norm regularization which controls the overfitting, and it balances reconstruction error. Parameters 𝛾
and 𝜔 control the effect of structure learning regularizations for projected samples and reconstructed samples, respectively. We
applied grid search techniques to tune each of these regularization coefficients. To investigate the effectiveness of changes in these
coefficients, we changed one of the parameters while other parameters kept fixed. We conducted this experiment on all 10 datasets
to observe correlation between the variation of these hyperparameters and clustering accuracy. Fig. 4 shows the variation of ACC
with respect to 𝛼 and the number of selected features.
Fig. 5 shows the variation of ACC with respect to 𝛽 and the number of selected features.
Parameters 𝛾 and 𝜔 are to control the impact of structure learning regularization for input data and reconstructed data,
respectively. Fig. 6 shows the variation of clustering accuracy (ACC) with respect to 𝛾 and 𝜔 parameters. Based on obtained results,
𝛾 = 0.001 − 0.01 and 𝜔 = 0.001 − 0.01 were the best range to achieve the maximum ACC.

11
A. Moslemi and M. Jamshidi Information Processing and Management 62 (2025) 103923

Fig. 4. ACC results of proposed method based on variation in parameter 𝛼 (Sparsity controlling).

Fig. 5. ACC results of proposed method based on variation in parameter 𝛽 (𝐿2,1 -norm regularization coefficient to suppress the overfitting).

12
A. Moslemi and M. Jamshidi Information Processing and Management 62 (2025) 103923

Fig. 6. ACC results of proposed method based on variation in parameters 𝛾 and 𝜔 parameters.

Table 5
ACC results of proposed method with structure learning regularization of reconstructed data and without.
ORL Yale warpPIE10P Jaffe TOX-171 Lymphoma Colon prostate GLI-85 Leukemia
With Regularized 𝐻 76 ± 4 53 ± 5 48 ± 5 87 ± 6 26 ± 5 71 ± 6 29 ± 3 13 ± 3 49 ± 5 19 ± 3
Without Regularized 𝐻 74 ± 5 52 ± 4 44 ± 4 83 ± 5 25 ± 5 69 ± 6 28 ± 3 11 ± 3 46 ± 6 16 ± 3

4.6. Robustness against the noise

To investigate about robustness of feature selection techniques against noise contamination, we conducted an experiment on ORL
image dataset. We contaminated images to Gaussian noise with variances 0.001, 0.002 and 0.005. Fig. 7 shows the three samples
of ORL with zero noise, 0.001, 0.002 and 0.005 noise. Fig. 7 shows the performance proposed method and comparative techniques
against noisy ORL dataset.
Based on Fig. 8, UFGOR, NMF and VCSDFS were significantly affected by noise. Whereas, proposed method obtained the best
performance in the presence of noise which shows the its robustness against noise.

4.7. Ablation study

𝑇
In this section, we compare our method with and without structure learning regularization function 𝐻(𝑇 𝑟(𝑊 (2) 𝑍 𝑇 𝐻 𝑍 𝑊 (2) )).
We conducted this experiment to observe the impact of this regularization function. The results are reported in Table 5. Since adding
this regularization function for reconstructed data is the innovation of our study, this results demonstrate the effectiveness of this
approach. Table 5. ACC results of the proposed method with structure learning regularization of reconstructed data and without.
The results prove the effectiveness of regularization on reconstructed data. Table 5 shows that the proposed method With
regularization on reconstructed data outperforms the proposed method Without regularization on reconstructed data. Difference
between with and without regularization is around 1% for datasets Yale, TOX-171, Colon. We find the highest difference for
warpPIE10P dataset with 4% difference.

4.8. Convergence analysis

We proved the convergence of Algorithm 1 in Section 3.3. In this section, we show the numerical results of convergence for the
proposed algorithm. The objective function was computed in each iteration for all 10 dataset. The results are shown in Fig. 9. Based
on the figure the convergence of Algorithm 1 for all 10 datasets can be concretely found.

13
A. Moslemi and M. Jamshidi Information Processing and Management 62 (2025) 103923

Fig. 7. This figure shows the three samples of ORL dataset. From left to right: No noise, Gaussian noise with 0.001 variance, Gaussian noise with 0.002 variance
and Gaussian noise with 0.005 variance.

5. Discussion and conclusion

5.1. Discussion

Auto-encoder can be employed for machine vision, natural language processing, complex network, recommender system, speech
processing, anomaly reduction, information fusion and dimensionality reduction (Berahmand, Daneshfar, Salehi, Li, & Xu, 2024;
Moslemi, Safakish, et al., 2023). Auto-encoder is a powerful tool for feature selection as well. In this investigation, we developed
a robust auto-encoder to obtain the most discriminative features. We proposed a new idea to preserve the geometry of original
samples and the geometry of reconstructed samples. Additionally, we applied 𝑙2,1 -norm to sparsify solution and 𝑙2,1 -norm to control
overfitting. In comparison with comparative techniques, NMF obtained the last performance. First of all, NMF only considers the
orthogonality constraint to have feature weight matrix as indicator matrix. However, Saberi-Movahed, Eftekhari, and Mohtashami
(2020) showed
[ 1 that orthogonality
] constraint is not lonely sufficient to have generate indicator matrix (feature weight matrix,
𝑇
√ 0 √1
𝑊 = 2 2 ). Additionally, NMF feature selection does not have structure learning regularization which means the
0 1 0
geometrical information of data cannot be preserved. AEFS was a feature selection technique based on auto-encoder without
any structural learning regularization. Therefore, AEFS was directly affected by losing the topological information of samples.
Furthermore, AEFS fails to obtain the cluster structure of data in representation space. That is why, Ling et al. (2023) proposed
DRAEFS to apply 𝑘-mean clustering on representation space to preserve the cluster structure. In our technique, we preserved
representation space by applying structure learning on both original data and reconstructed data. Additionally, DRAEFS has higher
complexity than our method since the K-means parameters must be updated in each iteration. Both NDFS and RNMF improved NMF
technique by adding sparse regularization and structure learning regularization. NDFS added 𝑙2,1 -norm as sparse regularization and
Laplacian graph as structure learning regularization. The main challenge in NDFS is 𝑙2,1 -norm which is not convex but Lipschitz
continuous [ref ]. Although 𝑙2,1 -norm has more sparsity than 𝑙2,1 -norm and in contrast to 𝑙2,𝑝 -norm (0 < 𝑝 < 1) is Lipschitz continuous,
it is not convex. To tackle this challenge, ConCave–Convex Procedure (CCCP) was applied to convert 𝑙2,1 -norm to convex function,

14
A. Moslemi and M. Jamshidi Information Processing and Management 62 (2025) 103923

Fig. 8. The variation of ACC by changing the number of selected features in the presence of noise for ORL dataset. (a) ACC of ORL without noise, (b) ACC of
noisy ORL with variance 0.002, (c) (b) ACC of noisy ORL with variance 0.001, and (d) (b) ACC of noisy ORL with variance 0.005.

CCCP increased computation time significantly. UFGOR and VCSDFS do not have structure learning regularization which leads to
loose of topological information of data and it impacts the performance of feature selection considerably. Although DNMF showed
that regularization must be considered for both feature weight matrix and representation matrix in NMF feature selection, it did not
considered structure learning regularization. Additionally, DNMF employed inner-product norm as sparse regularization which this
∑
norm is made by subtraction of two traces functions ( 𝑑𝑖,𝑗=1,𝑖≠𝑗 ⟨𝑤𝑖 , 𝑤𝑗 ⟩ = 𝑇 𝑟(1𝑑×𝑑 𝑊 𝑊 𝑇 ) −𝑇 𝑟(𝑊 𝑊 𝑇 )). This is clear that trace function
is a convex function and the subtraction of two convex functions is concave. As a result, inner-product norm is not convex and this
can be an indispensable concern. To compare with other auto-encoder based techniques, although NMF-AE is a technique which
is inspired by auto-encoder, it is not actually auto-encoder technique since there is not reconstructed data. Also, NMF-AE applied
orthogonality constraint on representation matrix. Whereas NMF proposed orthogonality constraint for feature weight matrix in
order to have indicator matrix. In terms of feature selection using auto-encoder based on reconstruction error, this technique is
much similar to sequentially feature selection (SFS). SFS-based techniques suffer computation time. Laplacian graph is an efficient
approach to preserve the geometrical information of samples. Although Hessian regularization has more extensive null space than
Laplacian graph (Kim, Steinke, & Hein, 2009), empirical results support the effectiveness of Laplacian graph matrix and Hessian
matrix construction considerably complicated and time costly.
Although existing auto-encoder techniques extract nonlinear information among features, there is an important gap in these
methods which is the absence of considering geometrical structure of dataset. In these methods, the most discriminative features
are significantly affected by decoding part. The novelty of our work is about jointly considering nonlinear information of features
and geometrical structure of data. More precisely, the geometrical structure of input data is preserved in the projected space, and
geometrical projected space is preserved in reconstructed space by using suitable terms including abstract Laplacian matrix of the
graph representation of data.

15
A. Moslemi and M. Jamshidi Information Processing and Management 62 (2025) 103923

Fig. 9. Convergence behavior of Algorithm 1 for all 10 datasets.

5.2. Conclusion

In this study, we proposed a new and robust feature selection based on auto-encoder. We proposed a new theorem to preserve the
geometrical information of original data and reconstructed data. We considered Laplacian graph matrix to preserve the geometrical
information of original data and set a new variable which is obliged to be close to Laplacian graph matrix of original data. We
tested the proposed technique on 10 datasets and compared with seven state-of-art and recent feature selection techniques. Results
proved the effectiveness of proposed method to select the most informative features. For future work, adaptive graph learning can
be considered to suppress the noise and outlier effects.

CRediT authorship contribution statement

Amir Moslemi: Writing – review & editing, Writing – original draft, Visualization, Validation, Supervision, Software, Resources,
Project administration, Methodology, Investigation, Formal analysis, Data curation, Conceptualization. Mina Jamshidi: Writing –
review & editing, Writing – original draft, Visualization, Validation, Supervision, Resources, Project administration, Methodology,
Investigation, Formal analysis, Data curation, Conceptualization.

Acknowledgment

The research has been supported by Gratuate University of Advanced Technology (Kerman-Iran) under grant number 01/3588.

Appendix A. Supplementary data

Supplementary material related to this article can be found online at https://doi.org/10.1016/j.ipm.2024.103923. Evaluation
metrics for comparison and datasets are described in the supplementary file.

Data availability

Data will be made available on request.

16
A. Moslemi and M. Jamshidi Information Processing and Management 62 (2025) 103923

References

Ahmad, T., & Zhang, H. (2020). Novel deep supervised ML models with feature selection approach for large-scale utilities and buildings short and medium-term
load requirement forecasts. Energy, 209, Article 118477.
Berahmand, K., Daneshfar, F., Salehi, E. S., Li, Y., & Xu, Y. (2024). Auto-encoders and their applications in machine learning: a survey. Artificial Intelligence
Review, 57(2), 28.
Cai, D., Zhang, C., & He, X. (2010). Unsupervised feature selection for multi-cluster data. In Proceedings of the 16th ACM SIGKDD international conference on
knowledge discovery and data mining (pp. 333–342).
Doquet, G., & Sebag, M. (2020). Agnostic feature selection. In Machine learning and knowledge discovery in databases: European conference, proceedings, part I (pp.
343–358). Springer.
Feng, S., & Duarte, M. F. (2018). Graph auto-encoder based unsupervised feature selection with broad and local data structure preservation. Neurocomputing,
312, 310–323.
Golub, T. R., Slonim, D. K., Tamayo, P., Huard, C., Gaasenbeek, M., Mesirov, J. P., et al. (1999). Molecular classification of cancer: class discovery and class
prediction by gene expression monitoring. Science, 286(5439), 531–537.
Gong, X., Yu, L., Wang, J., Zhang, K., Bai, X., & Pal, N. R. (2022). Unsupervised feature selection via adaptive auto-encoder with redundancy control. Neural
Networks, 150, 87–101.
Gui, J., Sun, Z., Ji, S., Tao, D., & Tan, T. (2016). Feature selection based on structured sparsity: A comprehensive study. IEEE Transactions on Neural Networks
and Learning Systems, 28(7), 1490–1507.
Han, K., Wang, Y., Zhang, C., Li, C., & Xu, C. (2018). Auto-encoder inspired unsupervised feature selection. In IEEE international conference on acoustics, speech
and signal processing (pp. 2941–2945). IEEE.
He, X., Cai, D., & Niyogi, P. (2005). Laplacian score for feature selection. Advances in Neural Information Processing Systems, 18, 507–514.
Hinton, G. E., & Zemel, R. (1993). Auto-encoders, minimum description length and Helmholtz free energy. Advances in Neural Information Processing Systems, 6,
3–10.
Huang, P., Kong, Z., Wang, L., Han, X., & Yang, X. (2024). Efficient and stable unsupervised feature selection based on novel structured graph and data
discrepancy learning. IEEE Transactions on Neural Networks and Learning Systems, Article 38619959.
Jahani, M. S., Aghamollaei, G., Eftekhari, M., & Saberi-Movahed, F. (2023). Unsupervised feature selection guided by orthogonal representation of feature space.
Neurocomputing, 516, 61–76.
Karami, S., Saberi-Movahed, F., Tiwari, P., Marttinen, P., & Vahdati, S. (2023). Unsupervised feature selection based on variance–covariance subspace distance.
Neural Networks, 166, 188–203.
Kim, K., Steinke, F., & Hein, M. (2009). Semi-supervised regression using hessian energy with an application to semi-supervised dimensionality reduction. Advances
in Neural Information Processing Systems, 22, 979–987.
Lee, M.-C. (2009). Using support vector machine with a hybrid feature selection method to the stock trend prediction. Expert Systems with Applications, 36(8),
10896–10904.
Li, J., Cheng, K., Wang, S., Morstatter, F., Trevino, R. P., Tang, J., et al. (2017). Feature selection: A data perspective. ACM Computing Surveys (CSUR), 50(6),
1–45.
Li, Y., Hu, L., & Gao, W. (2024). Multi-label feature selection with high-sparse personalized and low-redundancy shared common features. Information Processing
& Management, 61(3), Article 103633.
Li, J., Tang, J., & Liu, H. (2017). Reconstruction-based unsupervised feature selection: An embedded approach. In IJCAI (pp. 2159–2165).
Li, K., Wang, F., Yang, L., & Liu, R. (2023). Deep feature screening: Feature selection for ultra high-dimensional data via deep neural networks. Neurocomputing,
538, Article 126186.
Li, Z., Yang, Y., Liu, J., Zhou, X., & Lu, H. (2012). Unsupervised feature selection using nonnegative spectral analysis. In Proceedings of the AAAI conference on
artificial intelligence, vol. 26, no. 1 (pp. 1026–1032).
Liao, H., Chen, H., Yin, T., Horng, S.-J., & Li, T. (2024). Adaptive orthogonal semi-supervised feature selection with reliable label matrix learning. Information
Processing & Management, 61(4), Article 103727.
Ling, Y., Nie, F., Yu, W., & Li, X. (2023). Discriminative and robust autoencoders for unsupervised feature selection. IEEE Transactions on Neural Networks and
Learning Systems, Article 38090873.
Liu, H., & Motoda, H. (2012). Feature selection for knowledge discovery and data mining: vol. 454, Springer science & business media.
Moslemi, A. (2023). A tutorial-based survey on feature selection: Recent advancements on feature selection. Engineering Applications of Artificial Intelligence, 126,
Article 107136.
Moslemi, A., & Ahmadian, A. (2023). Dual regularized subspace learning using adaptive graph learning and rank constraint: Unsupervised feature selection on
gene expression microarray datasets. Computers in Biology and Medicine, 167, Article 107659.
Moslemi, A., Bidar, M., & Ahmadian, A. (2023). Subspace learning using structure learning and non-convex regularization: Hybrid technique with mushroom
reproduction optimization in gene selection. Computers in Biology and Medicine, 164, Article 107309.
Moslemi, A., Kontogianni, K., Brock, J., Wood, S., Herth, F., & Kirby, M. (2022). Differentiating COPD and asthma using quantitative CT imaging and machine
learning. European Respiratory Journal, 60(3), Article 2103078.
Moslemi, A., Makimoto, K., Tan, W. C., Bourbeau, J., Hogg, J. C., Coxson, H. O., et al. (2023). Quantitative CT lung imaging and machine learning improves
prediction of emergency room visits and hospitalizations in COPD. Academic Radiology, 30(4), 707–716.
Moslemi, A., Safakish, A., Sannchi, L., Alberico, D., Halstead, S., & Czarnota, G. (2023). Predicting head and neck cancer treatment outcomes using textural
feature level fusion of quantitative ultrasound spectroscopic and computed tomography: A machine learning approach. In 2023 IEEE international ultrasonics
symposium (pp. 1–4). IEEE.
Mozafari, M., Seyedi, S. A., Mohammadiani, R. P., & Tab, F. A. (2024). Unsupervised feature selection using orthogonal encoder-decoder factorization. Information
Sciences, Article 120277.
Nie, F., Huang, H., Cai, X., & Ding, C. (2010). Efficient and robust feature selection via joint 𝑙2,1 -norms minimization. Advances in Neural Information Processing
Systems, 23, 1813–1821.
Nie, F., Zhu, W., & Li, X. (2016). Unsupervised feature selection with structured graph optimization. In Proceedings of the AAAI conference on artificial intelligence,
vol. 30, no. 1 (pp. 1302–1308).
Peng, H., Long, F., & Ding, C. (2005). Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy. IEEE
Transactions on Pattern Analysis and Machine Intelligence, 27(8), 1226–1238.
Saberi-Movahed, F., Eftekhari, M., & Mohtashami, M. (2020). Supervised feature selection by constituting a basis for the original space of features and matrix
factorization. International Journal of Machine Learning and Cybernetics, 11, 1405–1421.
Saberi-Movahed, F., Mohammadifard, M., Mehrpooya, A., Rezaei-Ravari, M., Berahmand, K., Rostami, M., et al. (2022). Decoding clinical biomarker space of
COVID-19: Exploring matrix factorization-based feature selection methods. Computers in Biology and Medicine, 146, Article 105426.
Saberi-Movahed, F., Rostami, M., Berahmand, K., Karami, S., Tiwari, P., Oussalah, M., et al. (2022). Dual regularized unsupervised feature selection based on
matrix factorization and minimum redundancy with application in gene selection. Knowledge-Based Systems, 256, Article 109884.

17
A. Moslemi and M. Jamshidi Information Processing and Management 62 (2025) 103923

Salcedo-Sanz, S., Cornejo-Bueno, L., Prieto, L., Paredes, D., & García-Herrera, R. (2018). Feature selection in machine learning prediction systems for renewable
energy applications. Renewable and Sustainable Energy Reviews, 90, 728–741.
Samareh-Jahani, M., Saberi-Movahed, F., Eftekhari, M., Aghamollaei, G., & Tiwari, P. (2024). Low-redundant unsupervised feature selection based on data
structure learning and feature orthogonalization. Expert Systems with Applications, 240, Article 122556.
Sharifipour, S., Fayyazi, H., Sabokrou, M., & Adeli, E. (2019). Unsupervised feature ranking and selection based on autoencoders. In IEEE international conference
on acoustics, speech and signal processing (pp. 3172–3176). IEEE.
Shi, C., Ruan, Q., & An, G. (2014). Sparse feature selection based on graph Laplacian for web image annotation. Image and Vision Computing, 32(3), 189–201.
Wang, S., Chen, J., Guo, W., & Liu, G. (2020). Structured learning for unsupervised feature selection with high-order matrix factorization. Expert Systems with
Applications, 140, Article 112878.
Wang, S., Ding, Z., & Fu, Y. (2017). Feature selection guided auto-encoder. In Proceedings of the AAAI conference on artificial intelligence, vol. 31, no. 1 (pp.
2725–2731).
Wang, S., Pedrycz, W., Zhu, Q., & Zhu, W. (2015). Subspace learning for unsupervised feature selection via matrix factorization. Pattern Recognition, 48(1), 10–19.
Wu, X., & Cheng, Q. (2021). Fractal autoencoders for feature selection. In Proceedings of the AAAI conference on artificial intelligence, vol. 35, no. 12 (pp.
10370–10378).
Yi, Y., Zhang, H., Zhang, N., Zhou, W., Huang, X., Xie, G., et al. (2024). SFS-AGGL: Semi-supervised feature selection integrating adaptive graph with global
and local information. Information, 15(1), 57.
Zhang, Y., Lu, Z., & Wang, S. (2021). Unsupervised feature selection via transformed auto-encoder. Knowledge-Based Systems, 215, Article 106748.
Zhu, P., Zuo, W., Zhang, L., Hu, Q., & Shiu, S. C. (2015). Unsupervised feature selection by regularized self-representation. Pattern Recognition, 48(2), 438–446.