FMLFS: A federated multi-label feature selection based on information theory in IoT environment

Afsaneh Mahanipour Department of Computer Science
University of Kentucky
Lexington, KY, USA
ama654@uky.edu Hana Khamfroush Department of Computer Science
University of Kentucky
Lexington, KY, USA
khamfroush@cs.uky.edu

Abstract

In certain emerging applications such as health monitoring wearable and traffic monitoring systems, Internet-of-Things (IoT) devices generate or collect a huge amount of multi-label datasets. Within these datasets, each instance is linked to a set of labels. The presence of noisy, redundant, or irrelevant features in these datasets, along with the curse of dimensionality, poses challenges for multi-label classifiers. Feature selection (FS) proves to be an effective strategy in enhancing classifier performance and addressing these challenges. Yet, there is currently no existing distributed multi-label FS method documented in the literature that is suitable for distributed multi-label datasets within IoT environments. This paper introduces FMLFS, the first federated multi-label feature selection method. Here, mutual information between features and labels serves as the relevancy metric, while the correlation distance between features, derived from mutual information and joint entropy, is utilized as the redundancy measure. Following aggregation of these metrics on the edge server and employing Pareto-based bi-objective and crowding distance strategies, the sorted features are subsequently sent back to the IoT devices. The proposed method is evaluated through two scenarios: 1) transmitting reduced-size datasets to the edge server for centralized classifier usage, and 2) employing federated learning with reduced-size datasets. Evaluation across three metrics - performance, time complexity, and communication cost - demonstrates that FMLFS outperforms five other comparable methods in the literature and provides a good trade-off on three real-world datasets.

Index Terms:

Bi-objective optimization, Crowding distance, Federated feature selection, Multi-label data, Pareto dominance

I Introduction

With the development of emerging science and technologies such as Internet-of-Things (IoT), smart healthcare, and intelligent transportation, we are entering the era of big data, where a huge amount of data has been generated daily. In many cases, these collected data may contain irrelevant, noisy, or redundant features. The presence of such features in these data not only leads to increased complexity and execution time of learning models but also significantly impacts their performance [1].

Data pre-processing methods can be used to tackle these issues effectively. Among these methods, feature selection (FS) techniques select relevant and informative features from original ones without changing them unlike other dimensional reduction methods like principal component analysis. FS procedure reduces data dimensions, lowers computational costs and storage requirements, while also improving learning model performance [2]. In some real-world applications across different domains such as text categorization, image recognition, and gene prediction, each instance is associated with a set of labels. Such datasets are known as multi-label data. Just like single-label data, multi-label data also face the challenge of high dimensionality. Consequently, effective multi-label feature selection approaches become crucial in addressing this dimensionality problem [3].

On the other hand, the processing of these vast amounts of collected data by intelligent devices, particularly within IoT networks, is essential for extracting meaningful insights about the environment. Traditionally, these datasets were transmitted to cloud servers. However, due to the need for real-time responses and concerns about privacy, many IoT applications now restrict the transfer of data to the cloud. Instead, the data needs to be processed either locally or at the edge to address these requirements [4]. If all non-pre-processed local datasets are transferred to an edge server, it will result in increased communication costs between end-user devices/clients and the edge server, leading to delays in processing. Moreover, conducting local processing on non-pre-processed local datasets using distributed machine learning models like federated learning (FL) algorithm would increase the complexity and execution time of these models.

Therefore, to select informative features from distributed multi-label datasets on clients, a collaborative multi-label feature selection method is needed. There are several centralized multi-label FS approaches in the literature that are encountering challenges in these environments. For instance, if each client’s data is independently fed to a feature selection process, the issue of feature selection bias may lead to inaccurate and non-robust results. Therefore, a distributed FS like federated feature selection (FFS) procedure should be applied to multi-label FS. According to our best knowledge, this is the first work that investigates federated multi-label FS method. In this study, we compute the mutual information between features and class labels, as well as the correlation distance between features, obtained by subtracting mutual information from joint entropy between features in each client. The proposed method evaluates both the relevance between features and class labels, as well as the redundancy among features base on these two metrics. Next, these values are transmitted to the edge server for aggregation, where they serve as two objectives in a Pareto-based bi-objective strategy. The features are then ranked based on the combination of their Pareto front number and their crowding distance. Then, the ranked features are returned to each client. Consequently, by using a smaller number of features and reducing the data size, the performance of the learning algorithm can be enhanced. This leads to accelerated machine learning (ML) models and reduced complexity. Moreover, it aids in minimizing communication costs when transmitting datasets to edge servers. The main novelties of the proposed method can be summarized as follows:

•

Proposing the first federated multi-label feature selection method
•

Employing information theory-based concepts as two objectives within a Pareto-based bi-objective strategy
•

Utilizing federated learning algorithm as a multi-label classifier for the first time
•

Comparing the proposed method with five other centralized multi-label FS methods on three datasets from three application domains

II RELATED WORKS

Previous studies mostly focused on centralized multi-label feature selection. Limited research has explored federated feature selection for single-label datasets, and none has investigated federated multi-label feature selection. In the upcoming section, we will discuss previous works in detail.

II-A Centralized Multi-label Feature Selection

Multi-label feature selection methods can be divided into two groups based on how multi-label data is handled: problem transformation and algorithm adaptation. In problem transformation methods, the first step involves converting the multi-label dataset into a single-label one. Then, any state-of-the-art single-label FS method can be employed to select features effectively [5, 6, 7]. Binary relevance (BR) [8], label powerset (LP) [9], pruned problem transformation (PPT) [10], and entropy-based label assignment (ELA) [11] are a number of problem transformation methods that convert multi-label data into single-label format. For instance, in [5], the data is transformed using BR and LP techniques, followed by the utilization of Information Gain (IG) and ReliefF for feature selection. Similarly, Doquire et al. [6] use PPT for problem transformation, and then employ a greedy forward feature selection method based on mutual information. However, these methods still have some drawbacks; for instance, BR does not take label correlations into account.

Algorithm adaptation approaches extend FS methods to directly handle multi-label datasets and effectively address those drawbacks [12]. Until now, several algorithm adaptation methods have been proposed. For example, Lee et al. [13] propose a FS method based on mutual information (D2F), which incorporates interaction information and utilizes conditional mutual information for assessing feature relevance. Additionally, Pairwise Multi-label Utility (PMU) method [14] is another approach that leverages the mutual information between a candidate feature and the label set, serving as a term for feature relevance. In [15], a novel multi-label FS method named SCLS is proposed, which employs scalable feature relevance assessment to evaluate the relevance of candidate features.

II-B Federated Feature Selection for Single-label Datasets

There are a few number of federated feature selection (FFS) methods for single-label datasets in the literature [16, 17]. These methods are inspired by federated learning (FL) procedure and can be classified into two groups: vertical FFS [18] and horizontal FFS [19]. Vertical FFS involves clients’ datasets containing instances with identical IDs but varying feature sets. Conversely, horizontal FFS is characterized by clients having distinct instances while utilizing identical feature sets. This paper presents the first horizontal FFS method designed for multi-label datasets.

III Preliminaries

III-A Pareto-based solutions

In multi-objective optimization problems (MOPs), unlike single-objective ones, conflicts between objectives result in a set of optimal solutions known as the Pareto optimal set, rather than a single optimal solution [20]. We assume maximizing optimization while retaining generality for the concepts of Pareto optimality. The general MOP formula is as follows:

\begin{cases}max\;O(g)=[o_{1}(g),o_{2}(g),\cdots,o_{w}(g)],\\ s.c.:g\in\Omega\end{cases}

(1)

where $O(g)=[o_{1}(g),o_{2}(g),\cdots,o_{w}(g)]$ represents the objective vector, with $w\,(w\geq 2)$ denoting the number of objective functions, and $g=(g_{1},\cdots,g_{k})$ is the decision vector, where $k$ is the number of decision variables. In MOP, there exist fundamental concepts, such as:

•

Pareto dominance: For any two objective vectors $u=(u_{1},\cdots,u_{w})$ and $v=(v_{1},\cdots,v_{w})$ , $v$ is dominated by $u$ , denoted as $u\succ v$ , if and only if none of the elements in $v$ exceed the corresponding elements in $u$ , and at least one element in $u$ is strictly larger.

\small\forall\,i\in(1,\cdots,w):u_{i}\geq v_{i}\ \mbox{\Large$\wedge$}\ % \exists\,i\in(1,\cdots,w):u_{i}>v_{i}

(2)

•

Pareto optimal set: A solution $g\in\Omega$ that is not dominated by any other solution in $\Omega$ belongs to the Pareto optimal set.

\nexists\>g^{\prime}\in\Omega:\>g^{\prime}\succ g

(3)

•

Pareto optimal front: The Pareto optimal front is defined as the projection of the Pareto optimal set onto the objective space.

III-B Crowding distance

The crowding distance of a solution is determined by the density of neighboring solutions around it. This metric is derived from the largest cube centered around the solution, excluding other solutions [21].

IV Proposed Method

IV-A System Overview

We consider a two-tier setup, wherein horizontal FFS is executed. The first tier comprises various clients, including smart devices in contexts such as autonomous driving systems and healthcare systems, which gather multi-label data $\mathcal{U}$ . The second tier consists of an edge server $\mathcal{S}$ . Importantly, this approach can be scaled to handle a larger number of edge servers. We consider that there are a collection of $M$ clients, denoted as $C_{m}$ , where $m$ takes on values from 1 to $M$ , and an edge server denoted as $e$ . Here, $M$ must be a minimum of 2, as having a single client would result in a centralized FS scenario. Each client’s dataset $\mathcal{U}=\mathbb{R}^{N\times(D+L)}$ is represented as $\mathcal{U}=\{(X_{i},Y_{i})\}_{i=1}^{N}$ , where $N$ is the number of instances. Each instance is associated with a $D$ -dimensional feature vector $x_{i}=(x_{i1},x_{i2},...,x_{iD})$ and an $L$ -dimensional label vector $y_{i}=(y_{i1},y_{i2},...,y_{iL})$ . The label vector is a binary vector where $y_{il}$ equals 1 only if the given instance obtains label $Y_{l}$ ; otherwise, $y_{il}$ equals 0. The structure of multi-label dataset at clients is depicted at Fig. 1.

X				Y
$X_{1}$	$X_{2}$	$\cdots$	$X_{D}$	$Y_{1}$	$Y_{2}$	$\cdots$	$Y_{L}$
$x_{11}$	$x_{12}$	$\cdots$	$x_{1D}$	$y_{11}$	$y_{12}$	$\cdots$	$y_{1L}$
$x_{21}$	$x_{22}$	$\cdots$	$x_{2D}$	$y_{21}$	$y_{22}$	$\cdots$	$y_{2L}$
$\vdots$	$\cdots$	$\ddots$	$\vdots$	$\vdots$	$\vdots$	$\ddots$	$\vdots$
$x_{N1}$	$x_{N2}$	$\cdots$	$x_{ND}$	$y_{N1}$	$y_{N2}$	$\cdots$	$y_{NL}$

Figure 1: Multi-Label Data Structure

IV-B Proposed Algorithm

The proposed approach combines federated learning procedure with principles from information theory to select informative features from multi-label datasets across various clients. This technique can be referred to as FMLFS, which stands for Federated Multi-Label Feature Selection. In this method, our goal is to identify the most relevant features while minimizing redundancy. Therefore, we consider relevancy as the predictive power of features and redundancy of features as two objectives to convert the multi-label feature selection problem into a bi-objective optimization problem. Then, Pareto dominance and crowding distance concepts are employed to sort features. The FMLFS procedure comprises two phases: local phase in clients and global phase in the edge server. The overview of a distributed environment with multi-label datasets is depicted in Fig. 2.

Refer to caption — Figure 2: Overview of a distributed environment.

Local Phase: In the local phase, each client uses its local dataset to calculate the mutual information between features and labels, determining their degree of relevance. Additionally, each client computes the correlation distance of features as the degree of redundancy by subtracting mutual information from joint entropy. Then, both the calculated mutual information and correlation distance measures are sent to the edge server for further processing.

Information entropy measures the uncertainty of random variables [22]. The joint entropy between two features, such as $X_{a}=(x_{1a},x_{2a},...,x_{Na})$ and $X_{b}=(x_{1b},x_{2b},...,x_{Nb})$ , is calculated as follows:

H(X_{a},X_{b})=H(X_{a})+H(X_{a}|X_{b})

(4)

where $H(X_{a})$ represents the information entropy of the feature $X_{a}$ , and $H(X_{a}|X_{b})$ denotes the conditional entropy of two features, defined as follows:

H(X_{a})=-\sum_{i=1}^{N}p(x_{a}^{i})log_{2}p(x_{a}^{i})

(5)

H(X_{a}|X_{b})=-\sum_{i=1}^{N}\sum_{j=1}^{N}p(x_{a}^{i},x_{b}^{j})log_{2}p(x_{% a}^{i}|x_{b}^{j})

(6)

$p(x_{a}^{i})$ represents the probability of the $i$ -th value of feature $X_{a}$ , while $p(x_{a}^{i},x_{b}^{j})$ denotes the joint probability of the $i$ -th value of feature $X_{a}$ and the $j$ -th value of feature $X_{b}$ . Therefore, the joint entropy can be represented as follows:

H(X_{a},X_{b})=-\sum_{i=1}^{N}\sum_{i=1}^{N}p(x_{a}^{i},x_{b}^{j})log_{2}p(x_{% a}^{i},x_{b}^{j})

(7)

Mutual information quantifies the reduction in uncertainty of one random variable when another random variable is known, representing the amount of shared information between the variables. Consider $X_{a}=(x_{1a},x_{2a},...,x_{Na})$ as a feature and $Y_{b}=(y_{1b},y_{2b},...,y_{Nb})$ as a label in the dataset. The mutual information between the feature and label is defined as follows:

I(X_{a};Y_{b})=H(X_{a})-H(X_{a}|Y_{b})=H(Y_{b})-H(Y_{b}|X_{a})

(8)

I(X_{a};Y_{b})=H(X_{a})+H(Y_{b})-H(Y_{b},X_{a})

(9)

Now, the correlation distance between two features can be defined as follows:

MH(X_{a},Y_{b})=I(X_{a};X_{b})-H(X_{a},X_{b})

(10)

Global Phase: On the edge server side, the computed mutual information (Eq. 11) and correlation distance (Eq. 12) matrices sent by the clients are aggregated to produce a global mutual information matrix and correlation distance matrix. These two properties are considered as two objectives in transforming the multi-label FS problem into a bi-objective optimization problem. Therefore, the optimal feature subset is defined as one that maximizes relevance while minimizing redundancy.

MI=\begin{bmatrix}I(X_{1},Y_{1})&I(X_{1},Y_{2})&\cdots&I(X_{1},Y_{L})\\ I(X_{2},Y_{1})&I(X_{2},Y_{2})&\cdots&I(X_{2},Y_{L})\\ \vdots&\vdots&\ddots&\vdots\\ I(X_{D},Y_{1})&I(X_{D},Y_{2})&\cdots&I(X_{D},Y_{L})\end{bmatrix}

(11)

\small MH=\begin{bmatrix}MH(X_{1},X_{1})&MH(X_{1},X_{2})&\cdots&MH(X_{1},X_{D}% )\\ MH(X_{2},X_{1})&MH(X_{2},X_{2})&\cdots&MH(X_{2},Y_{D})\\ \vdots&\vdots&\ddots&\vdots\\ MH(X_{D},X_{1})&MH(X_{D},X_{2})&\cdots&MH(X_{D},X_{D})\end{bmatrix}

(12)

Therefore, in defining the objective functions, the first objective is identified as maximizing the mutual information between each feature and the set of labels ( $O_{1}$ ) [12]:

\begin{split}&MAX(i)=max(MI(i,:)),\quad i=1,2,...,D\\ &O_{1}=[MAX(1),MAX(2),...,MAX(D)]\end{split}

(13)

Next, the maximization of the correlation distance, defined as the difference between mutual information and joint entropy among features, is regarded as the second objective ( $O_{2}$ ) to measure their redundancy:

\begin{split}&A(i)=max(MH(i,:)),\quad i=1,2,...,D\\ &O_{2}=[A(1),A(2),...,A(D)]\end{split}

(14)

Then, a non-dominated sorting strategy is conducted using the Pareto dominance concept with these two objectives. Initially, each feature is assigned a Pareto number. Subsequently, the crowding distance or density of other features around each feature is calculated to arrange the features within the same front and with identical Pareto numbers. The crowding distance of each feature is computed within the bi-objective space. Next, the combination of the Pareto front number and the crowding distance is used to assign a final score to each feature. This score is calculated as follows [12]:

S=P+\frac{1}{(1+d)}

(15)

where $P$ denotes the Pareto front number, and $d$ represents the crowding distance. A lower value of $S$ indicates a better feature, as a lower Pareto front number and a larger crowding distance are considered preferable. The features can now be arranged based on their $S$ values. The pseudocode of the FMLFS method is given in Algorithm 1 and 2. Also, the overview of the proposed method is demonstrated in Fig. 3.

Algorithm 1 Pseudocode of the edge server side

1:M (number of clients), clients’

MI

matrices, clients’ correlation distance matrices (

MH

)

2:Ranking of features based on

S

3:Executing Algorithm 2 in clients # Aggregation step

MI^{\prime}=(MI_{1}+MI_{2}+...+MI_{M})/M

MH^{\prime}=(MH_{1}+MH_{2}+...+MH_{M})/M

# Calculation of Objective functions

O_{1}=\max(MI^{\prime},1)

O_{2}=\max(MH^{\prime},1)

# Feature sorting

8:Performing the non-dominated sorting algorithm in the bi-objective domain

9:Assigning Pareto front number

P

to the features

10:Calculating the crowding distance of features

d

11:

S=P+\frac{1}{(1+d)}

12:Sorting features based on

S

in ascending order

Algorithm 2 Pseudocode of the client side

1:Local dataset of each client

2:Mutual information matrix (

MI

), and correlation distance matrix (

MH

)
# Calculating Mutual information between features and labels

3:for a=1:D do

5: for b=1:L do

I(X_{a};Y_{b})

7: end for

8:end for# Calculating correlation distance between features

9:for a=1:D do

10:

11: for b=1:D do

12:

MH(X_{a},X_{b})=I(X_{a};X_{b})-H(X_{a},X_{b})

13: end for

14:end for

15:return the MI and MH matrices

V Experimental Results

In this section, we evaluate the proposed method using two scenarios. In the first scenario, clients rank features by FMLFS and select the desired number of features. Subsequently, the reduced-size datasets are transmitted to the edge server to be utilized in a centralized multi-label learning algorithm. In the second scenario, after employing FMLFS to rank features, a vanilla federated learning algorithm is utilized as a multi-label classifier.

V-A Datasets

The proposed method’s performance is evaluated against five similar methods in the literature using three real-world datasets from the Mulan¹¹1https://mulan.sourceforge.net/datasets.html repository. The datasets are selected from diverse domains (Biology, Image, and Audio), each varying in the number of instances, features, and labels. The characteristics of these datasets are presented in Table I.

TABLE I: Details of the multi-label datasets

Dataset	Instances	Features	Labels	Domain
Yeast	2417	103	14	Biology
Scene	2407	294	6	Image
Birds	645	260	19	Audio

V-B Evaluation Measures

Accuracy, F-measure, hamming loss, ranking loss, average precision and coverage are the metrics employed to evaluate the performance of FMLFS and other comparative methods. Let ( $\mathcal{T}=\{(x_{i},y_{i})\}_{i=1}^{n}$ ) denote a test set, where $y_{i}$ and $z_{i}$ represent the actual label set and the predicted label set for $x_{i}$ respectively. Now, let’s define the metrics as follows [23]:

•

Accuracy: It represents the proportion of correctly predicted labels relative to all predicted and actual labels.

Accuracy=\frac{1}{n}\sum_{i=1}^{n}\frac{|y_{i}\cap z_{i}|}{|y_{i}\cup z_{i}|}

(16)

•

F-measure: It is a harmonic mean of precision and recall. It is a weighted measure indicating the number of relevant labels predicted and the proportion of predicted labels that are relevant.

\begin{split}&F-measure=2\times\frac{Precision\times Recall}{Precision+Recall}% \\ &Precision=\frac{1}{n}\sum_{i=1}^{n}\frac{|y_{i}\cap z_{i}|}{|z_{i}|}\\ &Recall=\frac{1}{n}\sum_{i=1}^{n}\frac{|y_{i}\cap z_{i}|}{|y_{i}|}\end{split}

(17)

•

Hamming Loss (HL): It is calculated by determining the symmetric difference between the actual and predicted labels and then dividing it by the total number of labels. A smaller HL value indicates better performance.

HL=\frac{1}{n}\sum_{i=1}^{n}\frac{|y_{i}\triangle z_{i}|}{|L|}

(18)

•

Ranking Loss (RL): It calculates the frequency of relevant labels being ranked lower than non-relevant labels. Better performance is indicated by a smaller RL value.

\begin{split}&RL=\frac{1}{n}\sum_{i=1}^{n}\frac{1}{|y_{i}||\bar{y_{i}}|}|{(% \lambda_{a},\lambda_{b}):rank(\lambda_{a})>rank(\lambda_{b}),}\\ &{(\lambda_{a},\lambda_{b})\in y_{i}\times\bar{y_{i}}}|\end{split}

(19)

where $\bar{y_{i}}$ is the complement set of $y_{i}$ .

•

Avg-Precision: It measures the average fraction of relevant labels ranked above a specific label.

\begin{split}\small\medmath{Avg-precision=\frac{1}{n}\sum_{i=1}^{n}\frac{1}{|y% _{i}|}\sum_{\lambda\in y_{i}}\frac{|{\lambda^{\prime}\in y_{i}:rank(\lambda^{% \prime})\leq rank(\lambda)}|}{rank(\lambda)}}\end{split}

(20)

•

Coverage: It denotes the number of steps a learning algorithm requires to cover all the true labels of an instance. The better the performance is indicated by a smaller coverage value.

Coverage=\frac{1}{n}\sum_{i=1}^{n}\max_{\lambda\in y_{i}}(rank(\lambda))-1

(21)

V-C Parameter setting

In this work, two scenarios are considered to evaluate the performance of the proposed method. As mentioned before, in the first scenario, after ranking and selecting the desired number of features, local datasets are transmitted to the edge server. Here, ML-kNN [24] with $k=10$ is used as a classifier, representing one of the most commonly utilized learning algorithms in centralized multi-label classification. In the second scenario, the vanilla federated learning algorithm with multi layer perceptron (MLP) is employed after ranking and selecting the desired number of features.

Throughout our experiments, we utilize 10 clients, consistent with other single-label federated feature selection methods in the literature. Additionally, the data demonstrates non-independent and non-identically distributed (Non-IID) characteristics across the clients.

V-D Results and Analysis

In this study, we consider three metrics—performance, computational complexity, and communication cost—to evaluate the effectiveness of the proposed method compared to five other methods. Performance is evaluated using six metrics: accuracy, F-measure, hamming loss, ranking loss, average precision, and coverage. Computational complexity is assessed by examining the time complexity of each algorithm, while communication cost is determined based on the size of the dataset.

In the first scenario, we compare the proposed method, the first federated multi-label feature selection method in the literature, with five other centralized multi-label feature selection methods including: PMFS [12], PPT-MI [6], PMU [14], D2F [13], and SCLS [15]. Our proposed method (FMLFS) ranks features across all clients in a federated manner. In contrast, the other methods operate independently within each client, with no communication between clients or between clients and the server. After ranking features within each client, the desired number of features is selected. Then, the reduced datasets are transmitted from clients to the edge server to feed into the centralized ML-kNN classifier. The results of this scenario are presented in Fig. 4 to 6.

In the second scenario, the proposed method is also compared with the five existing methods. The main difference compared to the first scenario is that, after feature ranking, the federated learning model is modified to function as a multi-label classifier. The findings of this scenario are depicted in Fig. 7 to 9.

Discussion: In the first scenario, for both the Yeast and Scene datasets, the proposed method demonstrates superior performance across all six evaluation metrics compared to the five other methods and the original dataset without FS. For instance, in the Yeast dataset, FMLFS achieves an accuracy of 0.48 with just 20 features, which is comparable to the performance of the classifier using 90 features without FS on the cloud server. It’s worth noting that the cloud server is at least 10 times farther away than the edge server. Therefore, FMLFS can effectively reduce communication costs while simultaneously improving the learning algorithm’s performance, offering a good trade-off between performance and communication cost. Additionally, in the Yeast dataset, FMLFS demonstrates better results with just 10 features across all evaluation metrics compared to the five other FS methods with 100 features. Moreover, in the Birds dataset, FMLFS outperforms all other methods, although the original dataset yields better results in terms of ranking loss and average precision.

In the second scenario, it is evident that the performance of FMLFS with 10 features in the Yeast and Scene datasets surpasses that of other methods using 100 features. This underscores the ability of the proposed method to provide a good trade-off between performance and computational complexity of the learning algorithm. Furthermore, in the Birds dataset, it achieves comparable or even better performance compared to the original dataset without FS, particularly in terms of average precision, coverage, and ranking loss.

Time complexity analysis: Here, we present the time complexity of FMLFS and the five other compared methods (PMFS, PPT-MI, PMU, D2F, and SCLS) on the client side, which is more important due to the constrained computational capabilities of end-user devices. Let $N$ , $D$ , and $L$ represent the number of instances, the number of features and the number of labels, respectively. The time complexity of mutual information and joint entropy is $O(N)$ because accessing all instances is required for probability calculation [3]. Time complexity of PMFS method is $O(D^{3}+D^{2}N+DNL)$ . PPT-MI computes mutual information between features and each label, thus resulting in a time complexity of $O(ND)$ . If we denote the number of selected features in PMU as $k$ , its time complexity can be expressed as $O(NDL+kNDL+NDL^{2})$ . The time complexity of D2F is $O(NDL+kNDL)$ , where the feature relevance and feature redundancy terms have time complexities of $O(NDL)$ and $O(kNDL)$ respectively. Also, the time complexity of SCLS is $O(NDL+kND)$ . The time complexities of the feature relevance and feature redundancy terms in our proposed method are $O(NDL)$ and $O(ND^{2})$ respectively. Therefore, the overall time complexity is $O(NDL+ND^{2})$ , which is the same as or even less than that of other compared methods.

VI Conclusion and Future Works

In this paper, we introduce FMLFS, the first federated multi-label feature selection method. Inspired by federated learning, FMLFS comprises two phases. Firstly, within each client, redundancy of features and relevancy between features and labels are computed based on information theory concepts. Subsequently, upon aggregating the received information from clients at the edge server, the multi-label feature selection task is transformed into a bi-objective optimization problem. Utilizing Pareto-based dominance and crowding distance strategies, features are ranked, and the rankings are sent back to the clients. Finally, users can select the desired number of features based on their application requirements. Then, three real-world datasets are utilized to assess both federated learning and centralized learning algorithms, evaluating the performance of the proposed method. The results demonstrate the ability of the proposed method to achieve a good trade-off between performance, time complexity and communication cost. For instance, in the Yeast dataset, the proposed method achieves superior accuracy by selecting just 10 features compared to other methods using 100 features. As we propose a filter-based method in this study, our future work entails integrating federated learning procedures and embedded feature selection methods for distributed multi-label datasets.

Acknowledgement

This work is funded by research grant provided by the National Science Foundation (NSF) under the grant number 2340075.

References

Mahanipour and Khamfroush [2023a] A. Mahanipour and H. Khamfroush, “Multimodal multiple federated feature construction method for iot environments,” in GLOBECOM 2023-2023 IEEE Global Communications Conference. IEEE, 2023, pp. 1890–1895.
Zebari et al. [2020] R. Zebari, A. Abdulazeez, D. Zeebaree, D. Zebari, and J. Saeed, “A comprehensive review of dimensionality reduction techniques for feature selection and feature extraction,” Journal of Applied Science and Technology Trends, vol. 1, no. 2, pp. 56–70, 2020.
Hu et al. [2022a] L. Hu, L. Gao, Y. Li, and P. Zhang, “Feature-specific mutual information variation for multi-label feature selection,” Information Sciences, vol. 593, pp. 449–471, 2022.
Nishio and Yonetani [2019] T. Nishio and R. Yonetani, “Client selection for federated learning with heterogeneous resources in mobile edge,” in ICC 2019-2019 IEEE international conference on communications (ICC). IEEE, 2019, pp. 1–7.
Spolaôr et al. [2013] N. Spolaôr, E. A. Cherman, and M. C. Monard, “A comparison of multi-label feature selection methods using the problem transformation approach,” Electronic notes in theoretical computer science, vol. 292, pp. 135–151, 2013.
Doquire and Verleysen [2011] G. Doquire and M. Verleysen, “Feature selection for multi-label classification problems,” in Advances in Computational Intelligence: 11th International Work-Conference on Artificial Neural Networks, IWANN 2011. Springer, 2011, pp. 9–16.
Reyes et al. [2015] O. Reyes, C. Morell, and S. Ventura, “Scalable extensions of the relieff algorithm for weighting and selecting features on the multi-label learning context,” Neurocomputing, vol. 161, pp. 168–182, 2015.
Boutell et al. [2004] M. R. Boutell, J. Luo, X. Shen, and C. M. Brown, “Learning multi-label scene classification,” Pattern recognition, vol. 37, no. 9, pp. 1757–1771, 2004.
Tsoumakas et al. [2010] G. Tsoumakas, I. Katakis, and I. Vlahavas, “Random k-labelsets for multilabel classification,” IEEE transactions on knowledge and data engineering, vol. 23, no. 7, pp. 1079–1089, 2010.
Read [2008] J. Read, “A pruned problem transformation method for multi-label classification,” in Proc. 2008 New Zealand Computer Science Research Student Conference (NZCSRS 2008), vol. 143150, 2008, p. 41.
Chen et al. [2007] W. Chen, J. Yan, B. Zhang, Z. Chen, and Q. Yang, “Document transformation for multi-label feature selection in text categorization,” in Seventh IEEE International Conference on Data Mining (ICDM 2007). IEEE, 2007, pp. 451–456.
Hashemi et al. [2021] A. Hashemi, M. B. Dowlatshahi, and H. Nezamabadi-pour, “An efficient pareto-based feature selection algorithm for multi-label classification,” Information Sciences, vol. 581, pp. 428–447, 2021.
Lee and Kim [2015] J. Lee and D.-W. Kim, “Mutual information-based multi-label feature selection using interaction information,” Expert Systems with Applications, vol. 42, no. 4, 2015.
Lee and kim [2013] j. Lee and D.-W. kim, “Feature selection for multi-label classification using multivariate mutual information,” Pattern Recognition Letters, vol. 34, no. 3, pp. 349–357, 2013.
Lee and Kim [2017] J. Lee and D.-W. Kim, “Scls: Multi-label feature selection based on scalable criterion for large label set,” Pattern Recognition, vol. 66, pp. 342–352, 2017.
Mahanipour and Khamfroush [2023b] A. Mahanipour and H. Khamfroush, “Wrapper-based federated feature selection for iot environments,” in 2023 International Conference on Computing, Networking and Communications (ICNC). IEEE, 2023, pp. 214–219.
Zhang et al. [2023] X. Zhang, A. Mavromatics, and A. Vafeas, “Federated feature selection for horizontal federated learning in iot networks,” IEEE Internet of Things Journal, 2023.
Li et al. [2023] A. Li, H. Peng, L. Zhang, J. Huang, Q. Guo, H. Yu, and Y. Liu, “Fedsdg-fs: Efficient and secure feature selection for vertical federated learning,” arXiv preprint arXiv:2302.10417, 2023.
Hu et al. [2022b] Y. Hu, Y. Zhang, D. Gong, and X. Sun, “Multi-participant federated feature selection algorithm with particle swarm optimizaiton for imbalanced data under privacy protection,” IEEE Transactions on Artificial Intelligence, 2022.
Von Lücken et al. [2014] C. Von Lücken, B. Barán, and C. Brizuela, “A survey on multi-objective evolutionary algorithms for many-objective problems,” Computational optimization and applications, vol. 58, pp. 707–756, 2014.
Raquel and Naval Jr [2005] C. R. Raquel and P. C. Naval Jr, “An effective use of crowding distance in multiobjective particle swarm optimization,” in Proceedings of the 7th Annual conference on Genetic and Evolutionary Computation, 2005, pp. 257–264.
Shannon [2001] C. E. Shannon, “A mathematical theory of communication,” ACM SIGMOBILE mobile computing and communications review, vol. 5, no. 1, pp. 3–55, 2001.
Tarekegn et al. [2021] A. N. Tarekegn, M. Giacobini, and K. Michalak, “A review of methods for imbalanced multi-label classification,” Pattern Recognition, vol. 118, p. 107965, 2021.
Zhang and Zhou [2007] M.-L. Zhang and Z.-H. Zhou, “Ml-knn: A lazy learning approach to multi-label learning,” Pattern recognition, vol. 40, no. 7, pp. 2038–2048, 2007.