Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

FMLFS: A federated multi-label feature selection based on information theory in IoT environment

Afsaneh Mahanipour Department of Computer Science
University of Kentucky
Lexington, KY, USA
ama654@uky.edu
   Hana Khamfroush Department of Computer Science
University of Kentucky
Lexington, KY, USA
khamfroush@cs.uky.edu
Abstract

In certain emerging applications such as health monitoring wearable and traffic monitoring systems, Internet-of-Things (IoT) devices generate or collect a huge amount of multi-label datasets. Within these datasets, each instance is linked to a set of labels. The presence of noisy, redundant, or irrelevant features in these datasets, along with the curse of dimensionality, poses challenges for multi-label classifiers. Feature selection (FS) proves to be an effective strategy in enhancing classifier performance and addressing these challenges. Yet, there is currently no existing distributed multi-label FS method documented in the literature that is suitable for distributed multi-label datasets within IoT environments. This paper introduces FMLFS, the first federated multi-label feature selection method. Here, mutual information between features and labels serves as the relevancy metric, while the correlation distance between features, derived from mutual information and joint entropy, is utilized as the redundancy measure. Following aggregation of these metrics on the edge server and employing Pareto-based bi-objective and crowding distance strategies, the sorted features are subsequently sent back to the IoT devices. The proposed method is evaluated through two scenarios: 1) transmitting reduced-size datasets to the edge server for centralized classifier usage, and 2) employing federated learning with reduced-size datasets. Evaluation across three metrics - performance, time complexity, and communication cost - demonstrates that FMLFS outperforms five other comparable methods in the literature and provides a good trade-off on three real-world datasets.

Index Terms:
Bi-objective optimization, Crowding distance, Federated feature selection, Multi-label data, Pareto dominance

I Introduction

With the development of emerging science and technologies such as Internet-of-Things (IoT), smart healthcare, and intelligent transportation, we are entering the era of big data, where a huge amount of data has been generated daily. In many cases, these collected data may contain irrelevant, noisy, or redundant features. The presence of such features in these data not only leads to increased complexity and execution time of learning models but also significantly impacts their performance [1].

Data pre-processing methods can be used to tackle these issues effectively. Among these methods, feature selection (FS) techniques select relevant and informative features from original ones without changing them unlike other dimensional reduction methods like principal component analysis. FS procedure reduces data dimensions, lowers computational costs and storage requirements, while also improving learning model performance [2]. In some real-world applications across different domains such as text categorization, image recognition, and gene prediction, each instance is associated with a set of labels. Such datasets are known as multi-label data. Just like single-label data, multi-label data also face the challenge of high dimensionality. Consequently, effective multi-label feature selection approaches become crucial in addressing this dimensionality problem [3].

On the other hand, the processing of these vast amounts of collected data by intelligent devices, particularly within IoT networks, is essential for extracting meaningful insights about the environment. Traditionally, these datasets were transmitted to cloud servers. However, due to the need for real-time responses and concerns about privacy, many IoT applications now restrict the transfer of data to the cloud. Instead, the data needs to be processed either locally or at the edge to address these requirements [4]. If all non-pre-processed local datasets are transferred to an edge server, it will result in increased communication costs between end-user devices/clients and the edge server, leading to delays in processing. Moreover, conducting local processing on non-pre-processed local datasets using distributed machine learning models like federated learning (FL) algorithm would increase the complexity and execution time of these models.

Therefore, to select informative features from distributed multi-label datasets on clients, a collaborative multi-label feature selection method is needed. There are several centralized multi-label FS approaches in the literature that are encountering challenges in these environments. For instance, if each client’s data is independently fed to a feature selection process, the issue of feature selection bias may lead to inaccurate and non-robust results. Therefore, a distributed FS like federated feature selection (FFS) procedure should be applied to multi-label FS. According to our best knowledge, this is the first work that investigates federated multi-label FS method. In this study, we compute the mutual information between features and class labels, as well as the correlation distance between features, obtained by subtracting mutual information from joint entropy between features in each client. The proposed method evaluates both the relevance between features and class labels, as well as the redundancy among features base on these two metrics. Next, these values are transmitted to the edge server for aggregation, where they serve as two objectives in a Pareto-based bi-objective strategy. The features are then ranked based on the combination of their Pareto front number and their crowding distance. Then, the ranked features are returned to each client. Consequently, by using a smaller number of features and reducing the data size, the performance of the learning algorithm can be enhanced. This leads to accelerated machine learning (ML) models and reduced complexity. Moreover, it aids in minimizing communication costs when transmitting datasets to edge servers. The main novelties of the proposed method can be summarized as follows:

  • Proposing the first federated multi-label feature selection method

  • Employing information theory-based concepts as two objectives within a Pareto-based bi-objective strategy

  • Utilizing federated learning algorithm as a multi-label classifier for the first time

  • Comparing the proposed method with five other centralized multi-label FS methods on three datasets from three application domains

II RELATED WORKS

Previous studies mostly focused on centralized multi-label feature selection. Limited research has explored federated feature selection for single-label datasets, and none has investigated federated multi-label feature selection. In the upcoming section, we will discuss previous works in detail.

II-A Centralized Multi-label Feature Selection

Multi-label feature selection methods can be divided into two groups based on how multi-label data is handled: problem transformation and algorithm adaptation. In problem transformation methods, the first step involves converting the multi-label dataset into a single-label one. Then, any state-of-the-art single-label FS method can be employed to select features effectively [5, 6, 7]. Binary relevance (BR) [8], label powerset (LP) [9], pruned problem transformation (PPT) [10], and entropy-based label assignment (ELA) [11] are a number of problem transformation methods that convert multi-label data into single-label format. For instance, in [5], the data is transformed using BR and LP techniques, followed by the utilization of Information Gain (IG) and ReliefF for feature selection. Similarly, Doquire et al. [6] use PPT for problem transformation, and then employ a greedy forward feature selection method based on mutual information. However, these methods still have some drawbacks; for instance, BR does not take label correlations into account.

Algorithm adaptation approaches extend FS methods to directly handle multi-label datasets and effectively address those drawbacks [12]. Until now, several algorithm adaptation methods have been proposed. For example, Lee et al. [13] propose a FS method based on mutual information (D2F), which incorporates interaction information and utilizes conditional mutual information for assessing feature relevance. Additionally, Pairwise Multi-label Utility (PMU) method [14] is another approach that leverages the mutual information between a candidate feature and the label set, serving as a term for feature relevance. In [15], a novel multi-label FS method named SCLS is proposed, which employs scalable feature relevance assessment to evaluate the relevance of candidate features.

II-B Federated Feature Selection for Single-label Datasets

There are a few number of federated feature selection (FFS) methods for single-label datasets in the literature [16, 17]. These methods are inspired by federated learning (FL) procedure and can be classified into two groups: vertical FFS [18] and horizontal FFS [19]. Vertical FFS involves clients’ datasets containing instances with identical IDs but varying feature sets. Conversely, horizontal FFS is characterized by clients having distinct instances while utilizing identical feature sets. This paper presents the first horizontal FFS method designed for multi-label datasets.

III Preliminaries

III-A Pareto-based solutions

In multi-objective optimization problems (MOPs), unlike single-objective ones, conflicts between objectives result in a set of optimal solutions known as the Pareto optimal set, rather than a single optimal solution [20]. We assume maximizing optimization while retaining generality for the concepts of Pareto optimality. The general MOP formula is as follows:

{maxO(g)=[o1(g),o2(g),,ow(g)],s.c.:gΩ\begin{cases}max\;O(g)=[o_{1}(g),o_{2}(g),\cdots,o_{w}(g)],\\ s.c.:g\in\Omega\end{cases}{ start_ROW start_CELL italic_m italic_a italic_x italic_O ( italic_g ) = [ italic_o start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_g ) , italic_o start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_g ) , ⋯ , italic_o start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT ( italic_g ) ] , end_CELL start_CELL end_CELL end_ROW start_ROW start_CELL italic_s . italic_c . : italic_g ∈ roman_Ω end_CELL start_CELL end_CELL end_ROW (1)

where O(g)=[o1(g),o2(g),,ow(g)]𝑂𝑔subscript𝑜1𝑔subscript𝑜2𝑔subscript𝑜𝑤𝑔O(g)=[o_{1}(g),o_{2}(g),\cdots,o_{w}(g)]italic_O ( italic_g ) = [ italic_o start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_g ) , italic_o start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_g ) , ⋯ , italic_o start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT ( italic_g ) ] represents the objective vector, with w(w2)𝑤𝑤2w\,(w\geq 2)italic_w ( italic_w ≥ 2 ) denoting the number of objective functions, and g=(g1,,gk)𝑔subscript𝑔1subscript𝑔𝑘g=(g_{1},\cdots,g_{k})italic_g = ( italic_g start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , ⋯ , italic_g start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) is the decision vector, where k𝑘kitalic_k is the number of decision variables. In MOP, there exist fundamental concepts, such as:

  • Pareto dominance: For any two objective vectors u=(u1,,uw)𝑢subscript𝑢1subscript𝑢𝑤u=(u_{1},\cdots,u_{w})italic_u = ( italic_u start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , ⋯ , italic_u start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT )and v=(v1,,vw)𝑣subscript𝑣1subscript𝑣𝑤v=(v_{1},\cdots,v_{w})italic_v = ( italic_v start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , ⋯ , italic_v start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT ), v𝑣vitalic_v is dominated by u𝑢uitalic_u, denoted as uvsucceeds𝑢𝑣u\succ vitalic_u ≻ italic_v, if and only if none of the elements in v𝑣vitalic_v exceed the corresponding elements in u𝑢uitalic_u, and at least one element in u𝑢uitalic_u is strictly larger.

    i(1,,w):uivii(1,,w):ui>vi:for-all𝑖1𝑤subscript𝑢𝑖subscript𝑣𝑖𝑖1𝑤:subscript𝑢𝑖subscript𝑣𝑖\small\forall\,i\in(1,\cdots,w):u_{i}\geq v_{i}\ \mbox{\Large$\wedge$}\ % \exists\,i\in(1,\cdots,w):u_{i}>v_{i}∀ italic_i ∈ ( 1 , ⋯ , italic_w ) : italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ≥ italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∧ ∃ italic_i ∈ ( 1 , ⋯ , italic_w ) : italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT > italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT (2)
  • Pareto optimal set: A solution gΩ𝑔Ωg\in\Omegaitalic_g ∈ roman_Ω that is not dominated by any other solution in ΩΩ\Omegaroman_Ω belongs to the Pareto optimal set.

    gΩ:gg:not-existssuperscript𝑔Ωsucceedssuperscript𝑔𝑔\nexists\>g^{\prime}\in\Omega:\>g^{\prime}\succ g∄ italic_g start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ roman_Ω : italic_g start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ≻ italic_g (3)
  • Pareto optimal front: The Pareto optimal front is defined as the projection of the Pareto optimal set onto the objective space.

III-B Crowding distance

The crowding distance of a solution is determined by the density of neighboring solutions around it. This metric is derived from the largest cube centered around the solution, excluding other solutions [21].

IV Proposed Method

IV-A System Overview

We consider a two-tier setup, wherein horizontal FFS is executed. The first tier comprises various clients, including smart devices in contexts such as autonomous driving systems and healthcare systems, which gather multi-label data 𝒰𝒰\mathcal{U}caligraphic_U. The second tier consists of an edge server 𝒮𝒮\mathcal{S}caligraphic_S. Importantly, this approach can be scaled to handle a larger number of edge servers. We consider that there are a collection of M𝑀Mitalic_M clients, denoted as Cmsubscript𝐶𝑚C_{m}italic_C start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT, where m𝑚mitalic_m takes on values from 1 to M𝑀Mitalic_M, and an edge server denoted as e𝑒eitalic_e. Here, M𝑀Mitalic_M must be a minimum of 2, as having a single client would result in a centralized FS scenario. Each client’s dataset 𝒰=N×(D+L)𝒰superscript𝑁𝐷𝐿\mathcal{U}=\mathbb{R}^{N\times(D+L)}caligraphic_U = blackboard_R start_POSTSUPERSCRIPT italic_N × ( italic_D + italic_L ) end_POSTSUPERSCRIPT is represented as 𝒰={(Xi,Yi)}i=1N𝒰superscriptsubscriptsubscript𝑋𝑖subscript𝑌𝑖𝑖1𝑁\mathcal{U}=\{(X_{i},Y_{i})\}_{i=1}^{N}caligraphic_U = { ( italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_Y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT, where N𝑁Nitalic_N is the number of instances. Each instance is associated with a D𝐷Ditalic_D-dimensional feature vector xi=(xi1,xi2,,xiD)subscript𝑥𝑖subscript𝑥𝑖1subscript𝑥𝑖2subscript𝑥𝑖𝐷x_{i}=(x_{i1},x_{i2},...,x_{iD})italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = ( italic_x start_POSTSUBSCRIPT italic_i 1 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_i 2 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_i italic_D end_POSTSUBSCRIPT ) and an L𝐿Litalic_L-dimensional label vector yi=(yi1,yi2,,yiL)subscript𝑦𝑖subscript𝑦𝑖1subscript𝑦𝑖2subscript𝑦𝑖𝐿y_{i}=(y_{i1},y_{i2},...,y_{iL})italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = ( italic_y start_POSTSUBSCRIPT italic_i 1 end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_i 2 end_POSTSUBSCRIPT , … , italic_y start_POSTSUBSCRIPT italic_i italic_L end_POSTSUBSCRIPT ). The label vector is a binary vector where yilsubscript𝑦𝑖𝑙y_{il}italic_y start_POSTSUBSCRIPT italic_i italic_l end_POSTSUBSCRIPT equals 1 only if the given instance obtains label Ylsubscript𝑌𝑙Y_{l}italic_Y start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT; otherwise, yilsubscript𝑦𝑖𝑙y_{il}italic_y start_POSTSUBSCRIPT italic_i italic_l end_POSTSUBSCRIPT equals 0. The structure of multi-label dataset at clients is depicted at Fig. 1.

X Y
X1subscript𝑋1X_{1}italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT X2subscript𝑋2X_{2}italic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT \cdots XDsubscript𝑋𝐷X_{D}italic_X start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT Y1subscript𝑌1Y_{1}italic_Y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT Y2subscript𝑌2Y_{2}italic_Y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT \cdots YLsubscript𝑌𝐿Y_{L}italic_Y start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT
x11subscript𝑥11x_{11}italic_x start_POSTSUBSCRIPT 11 end_POSTSUBSCRIPT x12subscript𝑥12x_{12}italic_x start_POSTSUBSCRIPT 12 end_POSTSUBSCRIPT \cdots x1Dsubscript𝑥1𝐷x_{1D}italic_x start_POSTSUBSCRIPT 1 italic_D end_POSTSUBSCRIPT y11subscript𝑦11y_{11}italic_y start_POSTSUBSCRIPT 11 end_POSTSUBSCRIPT y12subscript𝑦12y_{12}italic_y start_POSTSUBSCRIPT 12 end_POSTSUBSCRIPT \cdots y1Lsubscript𝑦1𝐿y_{1L}italic_y start_POSTSUBSCRIPT 1 italic_L end_POSTSUBSCRIPT
x21subscript𝑥21x_{21}italic_x start_POSTSUBSCRIPT 21 end_POSTSUBSCRIPT x22subscript𝑥22x_{22}italic_x start_POSTSUBSCRIPT 22 end_POSTSUBSCRIPT \cdots x2Dsubscript𝑥2𝐷x_{2D}italic_x start_POSTSUBSCRIPT 2 italic_D end_POSTSUBSCRIPT y21subscript𝑦21y_{21}italic_y start_POSTSUBSCRIPT 21 end_POSTSUBSCRIPT y22subscript𝑦22y_{22}italic_y start_POSTSUBSCRIPT 22 end_POSTSUBSCRIPT \cdots y2Lsubscript𝑦2𝐿y_{2L}italic_y start_POSTSUBSCRIPT 2 italic_L end_POSTSUBSCRIPT
\vdots \cdots \ddots \vdots \vdots \vdots \ddots \vdots
xN1subscript𝑥𝑁1x_{N1}italic_x start_POSTSUBSCRIPT italic_N 1 end_POSTSUBSCRIPT xN2subscript𝑥𝑁2x_{N2}italic_x start_POSTSUBSCRIPT italic_N 2 end_POSTSUBSCRIPT \cdots xNDsubscript𝑥𝑁𝐷x_{ND}italic_x start_POSTSUBSCRIPT italic_N italic_D end_POSTSUBSCRIPT yN1subscript𝑦𝑁1y_{N1}italic_y start_POSTSUBSCRIPT italic_N 1 end_POSTSUBSCRIPT yN2subscript𝑦𝑁2y_{N2}italic_y start_POSTSUBSCRIPT italic_N 2 end_POSTSUBSCRIPT \cdots yNLsubscript𝑦𝑁𝐿y_{NL}italic_y start_POSTSUBSCRIPT italic_N italic_L end_POSTSUBSCRIPT
Figure 1: Multi-Label Data Structure

IV-B Proposed Algorithm

The proposed approach combines federated learning procedure with principles from information theory to select informative features from multi-label datasets across various clients. This technique can be referred to as FMLFS, which stands for Federated Multi-Label Feature Selection. In this method, our goal is to identify the most relevant features while minimizing redundancy. Therefore, we consider relevancy as the predictive power of features and redundancy of features as two objectives to convert the multi-label feature selection problem into a bi-objective optimization problem. Then, Pareto dominance and crowding distance concepts are employed to sort features. The FMLFS procedure comprises two phases: local phase in clients and global phase in the edge server. The overview of a distributed environment with multi-label datasets is depicted in Fig. 2.

Refer to caption
Figure 2: Overview of a distributed environment.

Local Phase: In the local phase, each client uses its local dataset to calculate the mutual information between features and labels, determining their degree of relevance. Additionally, each client computes the correlation distance of features as the degree of redundancy by subtracting mutual information from joint entropy. Then, both the calculated mutual information and correlation distance measures are sent to the edge server for further processing.

Information entropy measures the uncertainty of random variables [22]. The joint entropy between two features, such as Xa=(x1a,x2a,,xNa)subscript𝑋𝑎subscript𝑥1𝑎subscript𝑥2𝑎subscript𝑥𝑁𝑎X_{a}=(x_{1a},x_{2a},...,x_{Na})italic_X start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT = ( italic_x start_POSTSUBSCRIPT 1 italic_a end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 2 italic_a end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_N italic_a end_POSTSUBSCRIPT ) and Xb=(x1b,x2b,,xNb)subscript𝑋𝑏subscript𝑥1𝑏subscript𝑥2𝑏subscript𝑥𝑁𝑏X_{b}=(x_{1b},x_{2b},...,x_{Nb})italic_X start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT = ( italic_x start_POSTSUBSCRIPT 1 italic_b end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 2 italic_b end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_N italic_b end_POSTSUBSCRIPT ), is calculated as follows:

H(Xa,Xb)=H(Xa)+H(Xa|Xb)𝐻subscript𝑋𝑎subscript𝑋𝑏𝐻subscript𝑋𝑎𝐻conditionalsubscript𝑋𝑎subscript𝑋𝑏H(X_{a},X_{b})=H(X_{a})+H(X_{a}|X_{b})italic_H ( italic_X start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT , italic_X start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT ) = italic_H ( italic_X start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ) + italic_H ( italic_X start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT | italic_X start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT ) (4)

where H(Xa)𝐻subscript𝑋𝑎H(X_{a})italic_H ( italic_X start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ) represents the information entropy of the feature Xasubscript𝑋𝑎X_{a}italic_X start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT, and H(Xa|Xb)𝐻conditionalsubscript𝑋𝑎subscript𝑋𝑏H(X_{a}|X_{b})italic_H ( italic_X start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT | italic_X start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT ) denotes the conditional entropy of two features, defined as follows:

H(Xa)=i=1Np(xai)log2p(xai)𝐻subscript𝑋𝑎superscriptsubscript𝑖1𝑁𝑝superscriptsubscript𝑥𝑎𝑖𝑙𝑜subscript𝑔2𝑝superscriptsubscript𝑥𝑎𝑖H(X_{a})=-\sum_{i=1}^{N}p(x_{a}^{i})log_{2}p(x_{a}^{i})italic_H ( italic_X start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ) = - ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_p ( italic_x start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) italic_l italic_o italic_g start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_p ( italic_x start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) (5)
H(Xa|Xb)=i=1Nj=1Np(xai,xbj)log2p(xai|xbj)𝐻conditionalsubscript𝑋𝑎subscript𝑋𝑏superscriptsubscript𝑖1𝑁superscriptsubscript𝑗1𝑁𝑝superscriptsubscript𝑥𝑎𝑖superscriptsubscript𝑥𝑏𝑗𝑙𝑜subscript𝑔2𝑝conditionalsuperscriptsubscript𝑥𝑎𝑖superscriptsubscript𝑥𝑏𝑗H(X_{a}|X_{b})=-\sum_{i=1}^{N}\sum_{j=1}^{N}p(x_{a}^{i},x_{b}^{j})log_{2}p(x_{% a}^{i}|x_{b}^{j})italic_H ( italic_X start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT | italic_X start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT ) = - ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_p ( italic_x start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT , italic_x start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT ) italic_l italic_o italic_g start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_p ( italic_x start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT | italic_x start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT ) (6)

p(xai)𝑝superscriptsubscript𝑥𝑎𝑖p(x_{a}^{i})italic_p ( italic_x start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) represents the probability of the i𝑖iitalic_i-th value of feature Xasubscript𝑋𝑎X_{a}italic_X start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT, while p(xai,xbj)𝑝superscriptsubscript𝑥𝑎𝑖superscriptsubscript𝑥𝑏𝑗p(x_{a}^{i},x_{b}^{j})italic_p ( italic_x start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT , italic_x start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT ) denotes the joint probability of the i𝑖iitalic_i-th value of feature Xasubscript𝑋𝑎X_{a}italic_X start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT and the j𝑗jitalic_j-th value of feature Xbsubscript𝑋𝑏X_{b}italic_X start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT. Therefore, the joint entropy can be represented as follows:

H(Xa,Xb)=i=1Ni=1Np(xai,xbj)log2p(xai,xbj)𝐻subscript𝑋𝑎subscript𝑋𝑏superscriptsubscript𝑖1𝑁superscriptsubscript𝑖1𝑁𝑝superscriptsubscript𝑥𝑎𝑖superscriptsubscript𝑥𝑏𝑗𝑙𝑜subscript𝑔2𝑝superscriptsubscript𝑥𝑎𝑖superscriptsubscript𝑥𝑏𝑗H(X_{a},X_{b})=-\sum_{i=1}^{N}\sum_{i=1}^{N}p(x_{a}^{i},x_{b}^{j})log_{2}p(x_{% a}^{i},x_{b}^{j})italic_H ( italic_X start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT , italic_X start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT ) = - ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_p ( italic_x start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT , italic_x start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT ) italic_l italic_o italic_g start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_p ( italic_x start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT , italic_x start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT ) (7)

Mutual information quantifies the reduction in uncertainty of one random variable when another random variable is known, representing the amount of shared information between the variables. Consider Xa=(x1a,x2a,,xNa)subscript𝑋𝑎subscript𝑥1𝑎subscript𝑥2𝑎subscript𝑥𝑁𝑎X_{a}=(x_{1a},x_{2a},...,x_{Na})italic_X start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT = ( italic_x start_POSTSUBSCRIPT 1 italic_a end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 2 italic_a end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_N italic_a end_POSTSUBSCRIPT ) as a feature and Yb=(y1b,y2b,,yNb)subscript𝑌𝑏subscript𝑦1𝑏subscript𝑦2𝑏subscript𝑦𝑁𝑏Y_{b}=(y_{1b},y_{2b},...,y_{Nb})italic_Y start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT = ( italic_y start_POSTSUBSCRIPT 1 italic_b end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT 2 italic_b end_POSTSUBSCRIPT , … , italic_y start_POSTSUBSCRIPT italic_N italic_b end_POSTSUBSCRIPT ) as a label in the dataset. The mutual information between the feature and label is defined as follows:

I(Xa;Yb)=H(Xa)H(Xa|Yb)=H(Yb)H(Yb|Xa)𝐼subscript𝑋𝑎subscript𝑌𝑏𝐻subscript𝑋𝑎𝐻conditionalsubscript𝑋𝑎subscript𝑌𝑏𝐻subscript𝑌𝑏𝐻conditionalsubscript𝑌𝑏subscript𝑋𝑎I(X_{a};Y_{b})=H(X_{a})-H(X_{a}|Y_{b})=H(Y_{b})-H(Y_{b}|X_{a})italic_I ( italic_X start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ; italic_Y start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT ) = italic_H ( italic_X start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ) - italic_H ( italic_X start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT | italic_Y start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT ) = italic_H ( italic_Y start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT ) - italic_H ( italic_Y start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT | italic_X start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ) (8)
I(Xa;Yb)=H(Xa)+H(Yb)H(Yb,Xa)𝐼subscript𝑋𝑎subscript𝑌𝑏𝐻subscript𝑋𝑎𝐻subscript𝑌𝑏𝐻subscript𝑌𝑏subscript𝑋𝑎I(X_{a};Y_{b})=H(X_{a})+H(Y_{b})-H(Y_{b},X_{a})italic_I ( italic_X start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ; italic_Y start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT ) = italic_H ( italic_X start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ) + italic_H ( italic_Y start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT ) - italic_H ( italic_Y start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT , italic_X start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ) (9)

Now, the correlation distance between two features can be defined as follows:

MH(Xa,Yb)=I(Xa;Xb)H(Xa,Xb)𝑀𝐻subscript𝑋𝑎subscript𝑌𝑏𝐼subscript𝑋𝑎subscript𝑋𝑏𝐻subscript𝑋𝑎subscript𝑋𝑏MH(X_{a},Y_{b})=I(X_{a};X_{b})-H(X_{a},X_{b})italic_M italic_H ( italic_X start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT , italic_Y start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT ) = italic_I ( italic_X start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ; italic_X start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT ) - italic_H ( italic_X start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT , italic_X start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT ) (10)

Global Phase: On the edge server side, the computed mutual information (Eq. 11) and correlation distance (Eq. 12) matrices sent by the clients are aggregated to produce a global mutual information matrix and correlation distance matrix. These two properties are considered as two objectives in transforming the multi-label FS problem into a bi-objective optimization problem. Therefore, the optimal feature subset is defined as one that maximizes relevance while minimizing redundancy.

MI=[I(X1,Y1)I(X1,Y2)I(X1,YL)I(X2,Y1)I(X2,Y2)I(X2,YL)I(XD,Y1)I(XD,Y2)I(XD,YL)]𝑀𝐼matrix𝐼subscript𝑋1subscript𝑌1𝐼subscript𝑋1subscript𝑌2𝐼subscript𝑋1subscript𝑌𝐿𝐼subscript𝑋2subscript𝑌1𝐼subscript𝑋2subscript𝑌2𝐼subscript𝑋2subscript𝑌𝐿𝐼subscript𝑋𝐷subscript𝑌1𝐼subscript𝑋𝐷subscript𝑌2𝐼subscript𝑋𝐷subscript𝑌𝐿MI=\begin{bmatrix}I(X_{1},Y_{1})&I(X_{1},Y_{2})&\cdots&I(X_{1},Y_{L})\\ I(X_{2},Y_{1})&I(X_{2},Y_{2})&\cdots&I(X_{2},Y_{L})\\ \vdots&\vdots&\ddots&\vdots\\ I(X_{D},Y_{1})&I(X_{D},Y_{2})&\cdots&I(X_{D},Y_{L})\end{bmatrix}italic_M italic_I = [ start_ARG start_ROW start_CELL italic_I ( italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_Y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) end_CELL start_CELL italic_I ( italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_Y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) end_CELL start_CELL ⋯ end_CELL start_CELL italic_I ( italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_Y start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT ) end_CELL end_ROW start_ROW start_CELL italic_I ( italic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_Y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) end_CELL start_CELL italic_I ( italic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_Y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) end_CELL start_CELL ⋯ end_CELL start_CELL italic_I ( italic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_Y start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT ) end_CELL end_ROW start_ROW start_CELL ⋮ end_CELL start_CELL ⋮ end_CELL start_CELL ⋱ end_CELL start_CELL ⋮ end_CELL end_ROW start_ROW start_CELL italic_I ( italic_X start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT , italic_Y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) end_CELL start_CELL italic_I ( italic_X start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT , italic_Y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) end_CELL start_CELL ⋯ end_CELL start_CELL italic_I ( italic_X start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT , italic_Y start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT ) end_CELL end_ROW end_ARG ] (11)
MH=[MH(X1,X1)MH(X1,X2)MH(X1,XD)MH(X2,X1)MH(X2,X2)MH(X2,YD)MH(XD,X1)MH(XD,X2)MH(XD,XD)]𝑀𝐻matrix𝑀𝐻subscript𝑋1subscript𝑋1𝑀𝐻subscript𝑋1subscript𝑋2𝑀𝐻subscript𝑋1subscript𝑋𝐷𝑀𝐻subscript𝑋2subscript𝑋1𝑀𝐻subscript𝑋2subscript𝑋2𝑀𝐻subscript𝑋2subscript𝑌𝐷𝑀𝐻subscript𝑋𝐷subscript𝑋1𝑀𝐻subscript𝑋𝐷subscript𝑋2𝑀𝐻subscript𝑋𝐷subscript𝑋𝐷\small MH=\begin{bmatrix}MH(X_{1},X_{1})&MH(X_{1},X_{2})&\cdots&MH(X_{1},X_{D}% )\\ MH(X_{2},X_{1})&MH(X_{2},X_{2})&\cdots&MH(X_{2},Y_{D})\\ \vdots&\vdots&\ddots&\vdots\\ MH(X_{D},X_{1})&MH(X_{D},X_{2})&\cdots&MH(X_{D},X_{D})\end{bmatrix}italic_M italic_H = [ start_ARG start_ROW start_CELL italic_M italic_H ( italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) end_CELL start_CELL italic_M italic_H ( italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) end_CELL start_CELL ⋯ end_CELL start_CELL italic_M italic_H ( italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_X start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT ) end_CELL end_ROW start_ROW start_CELL italic_M italic_H ( italic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) end_CELL start_CELL italic_M italic_H ( italic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) end_CELL start_CELL ⋯ end_CELL start_CELL italic_M italic_H ( italic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_Y start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT ) end_CELL end_ROW start_ROW start_CELL ⋮ end_CELL start_CELL ⋮ end_CELL start_CELL ⋱ end_CELL start_CELL ⋮ end_CELL end_ROW start_ROW start_CELL italic_M italic_H ( italic_X start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT , italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) end_CELL start_CELL italic_M italic_H ( italic_X start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT , italic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) end_CELL start_CELL ⋯ end_CELL start_CELL italic_M italic_H ( italic_X start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT , italic_X start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT ) end_CELL end_ROW end_ARG ] (12)

Therefore, in defining the objective functions, the first objective is identified as maximizing the mutual information between each feature and the set of labels (O1subscript𝑂1O_{1}italic_O start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT) [12]:

MAX(i)=max(MI(i,:)),i=1,2,,DO1=[MAX(1),MAX(2),,MAX(D)]\begin{split}&MAX(i)=max(MI(i,:)),\quad i=1,2,...,D\\ &O_{1}=[MAX(1),MAX(2),...,MAX(D)]\end{split}start_ROW start_CELL end_CELL start_CELL italic_M italic_A italic_X ( italic_i ) = italic_m italic_a italic_x ( italic_M italic_I ( italic_i , : ) ) , italic_i = 1 , 2 , … , italic_D end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL italic_O start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = [ italic_M italic_A italic_X ( 1 ) , italic_M italic_A italic_X ( 2 ) , … , italic_M italic_A italic_X ( italic_D ) ] end_CELL end_ROW (13)

Next, the maximization of the correlation distance, defined as the difference between mutual information and joint entropy among features, is regarded as the second objective (O2subscript𝑂2O_{2}italic_O start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT) to measure their redundancy:

A(i)=max(MH(i,:)),i=1,2,,DO2=[A(1),A(2),,A(D)]\begin{split}&A(i)=max(MH(i,:)),\quad i=1,2,...,D\\ &O_{2}=[A(1),A(2),...,A(D)]\end{split}start_ROW start_CELL end_CELL start_CELL italic_A ( italic_i ) = italic_m italic_a italic_x ( italic_M italic_H ( italic_i , : ) ) , italic_i = 1 , 2 , … , italic_D end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL italic_O start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = [ italic_A ( 1 ) , italic_A ( 2 ) , … , italic_A ( italic_D ) ] end_CELL end_ROW (14)

Then, a non-dominated sorting strategy is conducted using the Pareto dominance concept with these two objectives. Initially, each feature is assigned a Pareto number. Subsequently, the crowding distance or density of other features around each feature is calculated to arrange the features within the same front and with identical Pareto numbers. The crowding distance of each feature is computed within the bi-objective space. Next, the combination of the Pareto front number and the crowding distance is used to assign a final score to each feature. This score is calculated as follows [12]:

S=P+1(1+d)𝑆𝑃11𝑑S=P+\frac{1}{(1+d)}italic_S = italic_P + divide start_ARG 1 end_ARG start_ARG ( 1 + italic_d ) end_ARG (15)

where P𝑃Pitalic_P denotes the Pareto front number, and d𝑑ditalic_d represents the crowding distance. A lower value of S𝑆Sitalic_S indicates a better feature, as a lower Pareto front number and a larger crowding distance are considered preferable. The features can now be arranged based on their S𝑆Sitalic_S values. The pseudocode of the FMLFS method is given in Algorithm 1 and 2. Also, the overview of the proposed method is demonstrated in Fig. 3.

Algorithm 1 Pseudocode of the edge server side
1:M (number of clients), clients’ MI𝑀𝐼MIitalic_M italic_I matrices, clients’ correlation distance matrices (MH𝑀𝐻MHitalic_M italic_H)
2:Ranking of features based on S𝑆Sitalic_S
3:Executing Algorithm 2 in clients # Aggregation step
4:MI=(MI1+MI2++MIM)/M𝑀superscript𝐼𝑀subscript𝐼1𝑀subscript𝐼2𝑀subscript𝐼𝑀𝑀MI^{\prime}=(MI_{1}+MI_{2}+...+MI_{M})/Mitalic_M italic_I start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = ( italic_M italic_I start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + italic_M italic_I start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + … + italic_M italic_I start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT ) / italic_M
5:MH=(MH1+MH2++MHM)/M𝑀superscript𝐻𝑀subscript𝐻1𝑀subscript𝐻2𝑀subscript𝐻𝑀𝑀MH^{\prime}=(MH_{1}+MH_{2}+...+MH_{M})/Mitalic_M italic_H start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = ( italic_M italic_H start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + italic_M italic_H start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + … + italic_M italic_H start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT ) / italic_M # Calculation of Objective functions
6:O1=max(MI,1)subscript𝑂1𝑀superscript𝐼1O_{1}=\max(MI^{\prime},1)italic_O start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = roman_max ( italic_M italic_I start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , 1 )
7:O2=max(MH,1)subscript𝑂2𝑀superscript𝐻1O_{2}=\max(MH^{\prime},1)italic_O start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = roman_max ( italic_M italic_H start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , 1 ) # Feature sorting
8:Performing the non-dominated sorting algorithm in the bi-objective domain
9:Assigning Pareto front number P𝑃Pitalic_P to the features
10:Calculating the crowding distance of features d𝑑ditalic_d
11:S=P+1(1+d)𝑆𝑃11𝑑S=P+\frac{1}{(1+d)}italic_S = italic_P + divide start_ARG 1 end_ARG start_ARG ( 1 + italic_d ) end_ARG
12:Sorting features based on S𝑆Sitalic_S in ascending order
Algorithm 2 Pseudocode of the client side
1:Local dataset of each client
2:Mutual information matrix (MI𝑀𝐼MIitalic_M italic_I), and correlation distance matrix (MH𝑀𝐻MHitalic_M italic_H)
# Calculating Mutual information between features and labels
3:for a=1:D do
4:     
5:     for b=1:L do
6:         I(Xa;Yb)𝐼subscript𝑋𝑎subscript𝑌𝑏I(X_{a};Y_{b})italic_I ( italic_X start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ; italic_Y start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT )
7:     end for
8:end for# Calculating correlation distance between features
9:for a=1:D do
10:     
11:     for b=1:D do
12:         MH(Xa,Xb)=I(Xa;Xb)H(Xa,Xb)𝑀𝐻subscript𝑋𝑎subscript𝑋𝑏𝐼subscript𝑋𝑎subscript𝑋𝑏𝐻subscript𝑋𝑎subscript𝑋𝑏MH(X_{a},X_{b})=I(X_{a};X_{b})-H(X_{a},X_{b})italic_M italic_H ( italic_X start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT , italic_X start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT ) = italic_I ( italic_X start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ; italic_X start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT ) - italic_H ( italic_X start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT , italic_X start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT )
13:     end for
14:end for
15:return the MI and MH matrices
Refer to caption
Figure 3: Overview of the FMLFS algorithm.

V Experimental Results

In this section, we evaluate the proposed method using two scenarios. In the first scenario, clients rank features by FMLFS and select the desired number of features. Subsequently, the reduced-size datasets are transmitted to the edge server to be utilized in a centralized multi-label learning algorithm. In the second scenario, after employing FMLFS to rank features, a vanilla federated learning algorithm is utilized as a multi-label classifier.

V-A Datasets

The proposed method’s performance is evaluated against five similar methods in the literature using three real-world datasets from the Mulan111https://mulan.sourceforge.net/datasets.html repository. The datasets are selected from diverse domains (Biology, Image, and Audio), each varying in the number of instances, features, and labels. The characteristics of these datasets are presented in Table I.

TABLE I: Details of the multi-label datasets
Dataset Instances Features Labels Domain
Yeast 2417 103 14 Biology
Scene 2407 294 6 Image
Birds 645 260 19 Audio

V-B Evaluation Measures

Accuracy, F-measure, hamming loss, ranking loss, average precision and coverage are the metrics employed to evaluate the performance of FMLFS and other comparative methods. Let (𝒯={(xi,yi)}i=1n𝒯superscriptsubscriptsubscript𝑥𝑖subscript𝑦𝑖𝑖1𝑛\mathcal{T}=\{(x_{i},y_{i})\}_{i=1}^{n}caligraphic_T = { ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT) denote a test set, where yisubscript𝑦𝑖y_{i}italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and zisubscript𝑧𝑖z_{i}italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT represent the actual label set and the predicted label set for xisubscript𝑥𝑖x_{i}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT respectively. Now, let’s define the metrics as follows [23]:

  • Accuracy: It represents the proportion of correctly predicted labels relative to all predicted and actual labels.

    Accuracy=1ni=1n|yizi||yizi|𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦1𝑛superscriptsubscript𝑖1𝑛subscript𝑦𝑖subscript𝑧𝑖subscript𝑦𝑖subscript𝑧𝑖Accuracy=\frac{1}{n}\sum_{i=1}^{n}\frac{|y_{i}\cap z_{i}|}{|y_{i}\cup z_{i}|}italic_A italic_c italic_c italic_u italic_r italic_a italic_c italic_y = divide start_ARG 1 end_ARG start_ARG italic_n end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT divide start_ARG | italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∩ italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | end_ARG start_ARG | italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∪ italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | end_ARG (16)
  • F-measure: It is a harmonic mean of precision and recall. It is a weighted measure indicating the number of relevant labels predicted and the proportion of predicted labels that are relevant.

    Fmeasure=2×Precision×RecallPrecision+RecallPrecision=1ni=1n|yizi||zi|Recall=1ni=1n|yizi||yi|𝐹𝑚𝑒𝑎𝑠𝑢𝑟𝑒2𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛𝑅𝑒𝑐𝑎𝑙𝑙𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛𝑅𝑒𝑐𝑎𝑙𝑙𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛1𝑛superscriptsubscript𝑖1𝑛subscript𝑦𝑖subscript𝑧𝑖subscript𝑧𝑖𝑅𝑒𝑐𝑎𝑙𝑙1𝑛superscriptsubscript𝑖1𝑛subscript𝑦𝑖subscript𝑧𝑖subscript𝑦𝑖\begin{split}&F-measure=2\times\frac{Precision\times Recall}{Precision+Recall}% \\ &Precision=\frac{1}{n}\sum_{i=1}^{n}\frac{|y_{i}\cap z_{i}|}{|z_{i}|}\\ &Recall=\frac{1}{n}\sum_{i=1}^{n}\frac{|y_{i}\cap z_{i}|}{|y_{i}|}\end{split}start_ROW start_CELL end_CELL start_CELL italic_F - italic_m italic_e italic_a italic_s italic_u italic_r italic_e = 2 × divide start_ARG italic_P italic_r italic_e italic_c italic_i italic_s italic_i italic_o italic_n × italic_R italic_e italic_c italic_a italic_l italic_l end_ARG start_ARG italic_P italic_r italic_e italic_c italic_i italic_s italic_i italic_o italic_n + italic_R italic_e italic_c italic_a italic_l italic_l end_ARG end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL italic_P italic_r italic_e italic_c italic_i italic_s italic_i italic_o italic_n = divide start_ARG 1 end_ARG start_ARG italic_n end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT divide start_ARG | italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∩ italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | end_ARG start_ARG | italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | end_ARG end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL italic_R italic_e italic_c italic_a italic_l italic_l = divide start_ARG 1 end_ARG start_ARG italic_n end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT divide start_ARG | italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∩ italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | end_ARG start_ARG | italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | end_ARG end_CELL end_ROW (17)
  • Hamming Loss (HL): It is calculated by determining the symmetric difference between the actual and predicted labels and then dividing it by the total number of labels. A smaller HL value indicates better performance.

    HL=1ni=1n|yizi||L|𝐻𝐿1𝑛superscriptsubscript𝑖1𝑛subscript𝑦𝑖subscript𝑧𝑖𝐿HL=\frac{1}{n}\sum_{i=1}^{n}\frac{|y_{i}\triangle z_{i}|}{|L|}italic_H italic_L = divide start_ARG 1 end_ARG start_ARG italic_n end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT divide start_ARG | italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT △ italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | end_ARG start_ARG | italic_L | end_ARG (18)
  • Ranking Loss (RL): It calculates the frequency of relevant labels being ranked lower than non-relevant labels. Better performance is indicated by a smaller RL value.

    RL=1ni=1n1|yi||yi¯||(λa,λb):rank(λa)>rank(λb),(λa,λb)yi×yi¯|\begin{split}&RL=\frac{1}{n}\sum_{i=1}^{n}\frac{1}{|y_{i}||\bar{y_{i}}|}|{(% \lambda_{a},\lambda_{b}):rank(\lambda_{a})>rank(\lambda_{b}),}\\ &{(\lambda_{a},\lambda_{b})\in y_{i}\times\bar{y_{i}}}|\end{split}start_ROW start_CELL end_CELL start_CELL italic_R italic_L = divide start_ARG 1 end_ARG start_ARG italic_n end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG | italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | | over¯ start_ARG italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG | end_ARG | ( italic_λ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT , italic_λ start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT ) : italic_r italic_a italic_n italic_k ( italic_λ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ) > italic_r italic_a italic_n italic_k ( italic_λ start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT ) , end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL ( italic_λ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT , italic_λ start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT ) ∈ italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT × over¯ start_ARG italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG | end_CELL end_ROW (19)

    where yi¯¯subscript𝑦𝑖\bar{y_{i}}over¯ start_ARG italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG is the complement set of yisubscript𝑦𝑖y_{i}italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT.

  • Avg-Precision: It measures the average fraction of relevant labels ranked above a specific label.

    \medmathAvgprecision=1ni=1n1|yi|λyi|λyi:rank(λ)rank(λ)|rank(λ)\begin{split}\small\medmath{Avg-precision=\frac{1}{n}\sum_{i=1}^{n}\frac{1}{|y% _{i}|}\sum_{\lambda\in y_{i}}\frac{|{\lambda^{\prime}\in y_{i}:rank(\lambda^{% \prime})\leq rank(\lambda)}|}{rank(\lambda)}}\end{split}start_ROW start_CELL italic_A italic_v italic_g - italic_p italic_r italic_e italic_c italic_i italic_s italic_i italic_o italic_n = divide start_ARG 1 end_ARG start_ARG italic_n end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG | italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | end_ARG ∑ start_POSTSUBSCRIPT italic_λ ∈ italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT divide start_ARG | italic_λ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT : italic_r italic_a italic_n italic_k ( italic_λ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ≤ italic_r italic_a italic_n italic_k ( italic_λ ) | end_ARG start_ARG italic_r italic_a italic_n italic_k ( italic_λ ) end_ARG end_CELL end_ROW (20)
  • Coverage: It denotes the number of steps a learning algorithm requires to cover all the true labels of an instance. The better the performance is indicated by a smaller coverage value.

    Coverage=1ni=1nmaxλyi(rank(λ))1𝐶𝑜𝑣𝑒𝑟𝑎𝑔𝑒1𝑛superscriptsubscript𝑖1𝑛subscript𝜆subscript𝑦𝑖𝑟𝑎𝑛𝑘𝜆1Coverage=\frac{1}{n}\sum_{i=1}^{n}\max_{\lambda\in y_{i}}(rank(\lambda))-1italic_C italic_o italic_v italic_e italic_r italic_a italic_g italic_e = divide start_ARG 1 end_ARG start_ARG italic_n end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT roman_max start_POSTSUBSCRIPT italic_λ ∈ italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_r italic_a italic_n italic_k ( italic_λ ) ) - 1 (21)

V-C Parameter setting

In this work, two scenarios are considered to evaluate the performance of the proposed method. As mentioned before, in the first scenario, after ranking and selecting the desired number of features, local datasets are transmitted to the edge server. Here, ML-kNN [24] with k=10𝑘10k=10italic_k = 10 is used as a classifier, representing one of the most commonly utilized learning algorithms in centralized multi-label classification. In the second scenario, the vanilla federated learning algorithm with multi layer perceptron (MLP) is employed after ranking and selecting the desired number of features.

Throughout our experiments, we utilize 10 clients, consistent with other single-label federated feature selection methods in the literature. Additionally, the data demonstrates non-independent and non-identically distributed (Non-IID) characteristics across the clients.

Refer to caption
(a) Accuracy𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦Accuracyitalic_A italic_c italic_c italic_u italic_r italic_a italic_c italic_y
Refer to caption
(b) Fmeasure𝐹𝑚𝑒𝑎𝑠𝑢𝑟𝑒F-measureitalic_F - italic_m italic_e italic_a italic_s italic_u italic_r italic_e
Refer to caption
(c) HammingLoss𝐻𝑎𝑚𝑚𝑖𝑛𝑔𝐿𝑜𝑠𝑠HammingLossitalic_H italic_a italic_m italic_m italic_i italic_n italic_g italic_L italic_o italic_s italic_s
Refer to caption
(d) RankingLoss𝑅𝑎𝑛𝑘𝑖𝑛𝑔𝐿𝑜𝑠𝑠RankingLossitalic_R italic_a italic_n italic_k italic_i italic_n italic_g italic_L italic_o italic_s italic_s
Refer to caption
(e) AvgPrecision𝐴𝑣𝑔𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛AvgPrecisionitalic_A italic_v italic_g italic_P italic_r italic_e italic_c italic_i italic_s italic_i italic_o italic_n
Refer to caption
(f) Coverage𝐶𝑜𝑣𝑒𝑟𝑎𝑔𝑒Coverageitalic_C italic_o italic_v italic_e italic_r italic_a italic_g italic_e
Figure 4: Results for Yeast non-iid dataset with ML-kNN.
Refer to caption
(a) Accuracy𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦Accuracyitalic_A italic_c italic_c italic_u italic_r italic_a italic_c italic_y
Refer to caption
(b) Fmeasure𝐹𝑚𝑒𝑎𝑠𝑢𝑟𝑒F-measureitalic_F - italic_m italic_e italic_a italic_s italic_u italic_r italic_e
Refer to caption
(c) HammingLoss𝐻𝑎𝑚𝑚𝑖𝑛𝑔𝐿𝑜𝑠𝑠HammingLossitalic_H italic_a italic_m italic_m italic_i italic_n italic_g italic_L italic_o italic_s italic_s
Refer to caption
(d) RankingLoss𝑅𝑎𝑛𝑘𝑖𝑛𝑔𝐿𝑜𝑠𝑠RankingLossitalic_R italic_a italic_n italic_k italic_i italic_n italic_g italic_L italic_o italic_s italic_s
Refer to caption
(e) AvgPrecision𝐴𝑣𝑔𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛AvgPrecisionitalic_A italic_v italic_g italic_P italic_r italic_e italic_c italic_i italic_s italic_i italic_o italic_n
Refer to caption
(f) Coverage𝐶𝑜𝑣𝑒𝑟𝑎𝑔𝑒Coverageitalic_C italic_o italic_v italic_e italic_r italic_a italic_g italic_e
Figure 5: Results for Birds non-iid dataset with ML-kNN.
Refer to caption
(a) Accuracy𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦Accuracyitalic_A italic_c italic_c italic_u italic_r italic_a italic_c italic_y
Refer to caption
(b) Fmeasure𝐹𝑚𝑒𝑎𝑠𝑢𝑟𝑒F-measureitalic_F - italic_m italic_e italic_a italic_s italic_u italic_r italic_e
Refer to caption
(c) HammingLoss𝐻𝑎𝑚𝑚𝑖𝑛𝑔𝐿𝑜𝑠𝑠HammingLossitalic_H italic_a italic_m italic_m italic_i italic_n italic_g italic_L italic_o italic_s italic_s
Refer to caption
(d) RankingLoss𝑅𝑎𝑛𝑘𝑖𝑛𝑔𝐿𝑜𝑠𝑠RankingLossitalic_R italic_a italic_n italic_k italic_i italic_n italic_g italic_L italic_o italic_s italic_s
Refer to caption
(e) AvgPrecision𝐴𝑣𝑔𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛AvgPrecisionitalic_A italic_v italic_g italic_P italic_r italic_e italic_c italic_i italic_s italic_i italic_o italic_n
Refer to caption
(f) Coverage𝐶𝑜𝑣𝑒𝑟𝑎𝑔𝑒Coverageitalic_C italic_o italic_v italic_e italic_r italic_a italic_g italic_e
Figure 6: Results for Scene non-iid dataset with ML-kNN.

V-D Results and Analysis

In this study, we consider three metrics—performance, computational complexity, and communication cost—to evaluate the effectiveness of the proposed method compared to five other methods. Performance is evaluated using six metrics: accuracy, F-measure, hamming loss, ranking loss, average precision, and coverage. Computational complexity is assessed by examining the time complexity of each algorithm, while communication cost is determined based on the size of the dataset.

In the first scenario, we compare the proposed method, the first federated multi-label feature selection method in the literature, with five other centralized multi-label feature selection methods including: PMFS [12], PPT-MI [6], PMU [14], D2F [13], and SCLS [15]. Our proposed method (FMLFS) ranks features across all clients in a federated manner. In contrast, the other methods operate independently within each client, with no communication between clients or between clients and the server. After ranking features within each client, the desired number of features is selected. Then, the reduced datasets are transmitted from clients to the edge server to feed into the centralized ML-kNN classifier. The results of this scenario are presented in Fig. 4 to 6.

In the second scenario, the proposed method is also compared with the five existing methods. The main difference compared to the first scenario is that, after feature ranking, the federated learning model is modified to function as a multi-label classifier. The findings of this scenario are depicted in Fig. 7 to 9.

Discussion: In the first scenario, for both the Yeast and Scene datasets, the proposed method demonstrates superior performance across all six evaluation metrics compared to the five other methods and the original dataset without FS. For instance, in the Yeast dataset, FMLFS achieves an accuracy of 0.48 with just 20 features, which is comparable to the performance of the classifier using 90 features without FS on the cloud server. It’s worth noting that the cloud server is at least 10 times farther away than the edge server. Therefore, FMLFS can effectively reduce communication costs while simultaneously improving the learning algorithm’s performance, offering a good trade-off between performance and communication cost. Additionally, in the Yeast dataset, FMLFS demonstrates better results with just 10 features across all evaluation metrics compared to the five other FS methods with 100 features. Moreover, in the Birds dataset, FMLFS outperforms all other methods, although the original dataset yields better results in terms of ranking loss and average precision.

In the second scenario, it is evident that the performance of FMLFS with 10 features in the Yeast and Scene datasets surpasses that of other methods using 100 features. This underscores the ability of the proposed method to provide a good trade-off between performance and computational complexity of the learning algorithm. Furthermore, in the Birds dataset, it achieves comparable or even better performance compared to the original dataset without FS, particularly in terms of average precision, coverage, and ranking loss.

Time complexity analysis: Here, we present the time complexity of FMLFS and the five other compared methods (PMFS, PPT-MI, PMU, D2F, and SCLS) on the client side, which is more important due to the constrained computational capabilities of end-user devices. Let N𝑁Nitalic_N, D𝐷Ditalic_D, and L𝐿Litalic_L represent the number of instances, the number of features and the number of labels, respectively. The time complexity of mutual information and joint entropy is O(N)𝑂𝑁O(N)italic_O ( italic_N ) because accessing all instances is required for probability calculation [3]. Time complexity of PMFS method is O(D3+D2N+DNL)𝑂superscript𝐷3superscript𝐷2𝑁𝐷𝑁𝐿O(D^{3}+D^{2}N+DNL)italic_O ( italic_D start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT + italic_D start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_N + italic_D italic_N italic_L ). PPT-MI computes mutual information between features and each label, thus resulting in a time complexity of O(ND)𝑂𝑁𝐷O(ND)italic_O ( italic_N italic_D ). If we denote the number of selected features in PMU as k𝑘kitalic_k, its time complexity can be expressed as O(NDL+kNDL+NDL2)𝑂𝑁𝐷𝐿𝑘𝑁𝐷𝐿𝑁𝐷superscript𝐿2O(NDL+kNDL+NDL^{2})italic_O ( italic_N italic_D italic_L + italic_k italic_N italic_D italic_L + italic_N italic_D italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ). The time complexity of D2F is O(NDL+kNDL)𝑂𝑁𝐷𝐿𝑘𝑁𝐷𝐿O(NDL+kNDL)italic_O ( italic_N italic_D italic_L + italic_k italic_N italic_D italic_L ), where the feature relevance and feature redundancy terms have time complexities of O(NDL)𝑂𝑁𝐷𝐿O(NDL)italic_O ( italic_N italic_D italic_L ) and O(kNDL)𝑂𝑘𝑁𝐷𝐿O(kNDL)italic_O ( italic_k italic_N italic_D italic_L ) respectively. Also, the time complexity of SCLS is O(NDL+kND)𝑂𝑁𝐷𝐿𝑘𝑁𝐷O(NDL+kND)italic_O ( italic_N italic_D italic_L + italic_k italic_N italic_D ). The time complexities of the feature relevance and feature redundancy terms in our proposed method are O(NDL)𝑂𝑁𝐷𝐿O(NDL)italic_O ( italic_N italic_D italic_L ) and O(ND2)𝑂𝑁superscript𝐷2O(ND^{2})italic_O ( italic_N italic_D start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) respectively. Therefore, the overall time complexity is O(NDL+ND2)𝑂𝑁𝐷𝐿𝑁superscript𝐷2O(NDL+ND^{2})italic_O ( italic_N italic_D italic_L + italic_N italic_D start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ), which is the same as or even less than that of other compared methods.

Refer to caption
(a) Accuracy𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦Accuracyitalic_A italic_c italic_c italic_u italic_r italic_a italic_c italic_y
Refer to caption
(b) Fmeasure𝐹𝑚𝑒𝑎𝑠𝑢𝑟𝑒F-measureitalic_F - italic_m italic_e italic_a italic_s italic_u italic_r italic_e
Refer to caption
(c) HammingLoss𝐻𝑎𝑚𝑚𝑖𝑛𝑔𝐿𝑜𝑠𝑠HammingLossitalic_H italic_a italic_m italic_m italic_i italic_n italic_g italic_L italic_o italic_s italic_s
Refer to caption
(d) RankingLoss𝑅𝑎𝑛𝑘𝑖𝑛𝑔𝐿𝑜𝑠𝑠RankingLossitalic_R italic_a italic_n italic_k italic_i italic_n italic_g italic_L italic_o italic_s italic_s
Refer to caption
(e) AvgPrecision𝐴𝑣𝑔𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛AvgPrecisionitalic_A italic_v italic_g italic_P italic_r italic_e italic_c italic_i italic_s italic_i italic_o italic_n
Refer to caption
(f) Coverage𝐶𝑜𝑣𝑒𝑟𝑎𝑔𝑒Coverageitalic_C italic_o italic_v italic_e italic_r italic_a italic_g italic_e
Figure 7: Results for Yeast non-iid dataset with FL.
Refer to caption
(a) Accuracy𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦Accuracyitalic_A italic_c italic_c italic_u italic_r italic_a italic_c italic_y
Refer to caption
(b) Fmeasure𝐹𝑚𝑒𝑎𝑠𝑢𝑟𝑒F-measureitalic_F - italic_m italic_e italic_a italic_s italic_u italic_r italic_e
Refer to caption
(c) HammingLoss𝐻𝑎𝑚𝑚𝑖𝑛𝑔𝐿𝑜𝑠𝑠HammingLossitalic_H italic_a italic_m italic_m italic_i italic_n italic_g italic_L italic_o italic_s italic_s
Refer to caption
(d) RankingLoss𝑅𝑎𝑛𝑘𝑖𝑛𝑔𝐿𝑜𝑠𝑠RankingLossitalic_R italic_a italic_n italic_k italic_i italic_n italic_g italic_L italic_o italic_s italic_s
Refer to caption
(e) AvgPrecision𝐴𝑣𝑔𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛AvgPrecisionitalic_A italic_v italic_g italic_P italic_r italic_e italic_c italic_i italic_s italic_i italic_o italic_n
Refer to caption
(f) Coverage𝐶𝑜𝑣𝑒𝑟𝑎𝑔𝑒Coverageitalic_C italic_o italic_v italic_e italic_r italic_a italic_g italic_e
Figure 8: Results for Birds non-iid dataset with FL.
Refer to caption
(a) Accuracy𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦Accuracyitalic_A italic_c italic_c italic_u italic_r italic_a italic_c italic_y
Refer to caption
(b) Fmeasure𝐹𝑚𝑒𝑎𝑠𝑢𝑟𝑒F-measureitalic_F - italic_m italic_e italic_a italic_s italic_u italic_r italic_e
Refer to caption
(c) HammingLoss𝐻𝑎𝑚𝑚𝑖𝑛𝑔𝐿𝑜𝑠𝑠HammingLossitalic_H italic_a italic_m italic_m italic_i italic_n italic_g italic_L italic_o italic_s italic_s
Refer to caption
(d) RankingLoss𝑅𝑎𝑛𝑘𝑖𝑛𝑔𝐿𝑜𝑠𝑠RankingLossitalic_R italic_a italic_n italic_k italic_i italic_n italic_g italic_L italic_o italic_s italic_s
Refer to caption
(e) AvgPrecision𝐴𝑣𝑔𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛AvgPrecisionitalic_A italic_v italic_g italic_P italic_r italic_e italic_c italic_i italic_s italic_i italic_o italic_n
Refer to caption
(f) Coverage𝐶𝑜𝑣𝑒𝑟𝑎𝑔𝑒Coverageitalic_C italic_o italic_v italic_e italic_r italic_a italic_g italic_e
Figure 9: Results for Scene non-iid dataset with FL.

VI Conclusion and Future Works

In this paper, we introduce FMLFS, the first federated multi-label feature selection method. Inspired by federated learning, FMLFS comprises two phases. Firstly, within each client, redundancy of features and relevancy between features and labels are computed based on information theory concepts. Subsequently, upon aggregating the received information from clients at the edge server, the multi-label feature selection task is transformed into a bi-objective optimization problem. Utilizing Pareto-based dominance and crowding distance strategies, features are ranked, and the rankings are sent back to the clients. Finally, users can select the desired number of features based on their application requirements. Then, three real-world datasets are utilized to assess both federated learning and centralized learning algorithms, evaluating the performance of the proposed method. The results demonstrate the ability of the proposed method to achieve a good trade-off between performance, time complexity and communication cost. For instance, in the Yeast dataset, the proposed method achieves superior accuracy by selecting just 10 features compared to other methods using 100 features. As we propose a filter-based method in this study, our future work entails integrating federated learning procedures and embedded feature selection methods for distributed multi-label datasets.

Acknowledgement

This work is funded by research grant provided by the National Science Foundation (NSF) under the grant number 2340075.

References

  • Mahanipour and Khamfroush [2023a] A. Mahanipour and H. Khamfroush, “Multimodal multiple federated feature construction method for iot environments,” in GLOBECOM 2023-2023 IEEE Global Communications Conference.   IEEE, 2023, pp. 1890–1895.
  • Zebari et al. [2020] R. Zebari, A. Abdulazeez, D. Zeebaree, D. Zebari, and J. Saeed, “A comprehensive review of dimensionality reduction techniques for feature selection and feature extraction,” Journal of Applied Science and Technology Trends, vol. 1, no. 2, pp. 56–70, 2020.
  • Hu et al. [2022a] L. Hu, L. Gao, Y. Li, and P. Zhang, “Feature-specific mutual information variation for multi-label feature selection,” Information Sciences, vol. 593, pp. 449–471, 2022.
  • Nishio and Yonetani [2019] T. Nishio and R. Yonetani, “Client selection for federated learning with heterogeneous resources in mobile edge,” in ICC 2019-2019 IEEE international conference on communications (ICC).   IEEE, 2019, pp. 1–7.
  • Spolaôr et al. [2013] N. Spolaôr, E. A. Cherman, and M. C. Monard, “A comparison of multi-label feature selection methods using the problem transformation approach,” Electronic notes in theoretical computer science, vol. 292, pp. 135–151, 2013.
  • Doquire and Verleysen [2011] G. Doquire and M. Verleysen, “Feature selection for multi-label classification problems,” in Advances in Computational Intelligence: 11th International Work-Conference on Artificial Neural Networks, IWANN 2011.   Springer, 2011, pp. 9–16.
  • Reyes et al. [2015] O. Reyes, C. Morell, and S. Ventura, “Scalable extensions of the relieff algorithm for weighting and selecting features on the multi-label learning context,” Neurocomputing, vol. 161, pp. 168–182, 2015.
  • Boutell et al. [2004] M. R. Boutell, J. Luo, X. Shen, and C. M. Brown, “Learning multi-label scene classification,” Pattern recognition, vol. 37, no. 9, pp. 1757–1771, 2004.
  • Tsoumakas et al. [2010] G. Tsoumakas, I. Katakis, and I. Vlahavas, “Random k-labelsets for multilabel classification,” IEEE transactions on knowledge and data engineering, vol. 23, no. 7, pp. 1079–1089, 2010.
  • Read [2008] J. Read, “A pruned problem transformation method for multi-label classification,” in Proc. 2008 New Zealand Computer Science Research Student Conference (NZCSRS 2008), vol. 143150, 2008, p. 41.
  • Chen et al. [2007] W. Chen, J. Yan, B. Zhang, Z. Chen, and Q. Yang, “Document transformation for multi-label feature selection in text categorization,” in Seventh IEEE International Conference on Data Mining (ICDM 2007).   IEEE, 2007, pp. 451–456.
  • Hashemi et al. [2021] A. Hashemi, M. B. Dowlatshahi, and H. Nezamabadi-pour, “An efficient pareto-based feature selection algorithm for multi-label classification,” Information Sciences, vol. 581, pp. 428–447, 2021.
  • Lee and Kim [2015] J. Lee and D.-W. Kim, “Mutual information-based multi-label feature selection using interaction information,” Expert Systems with Applications, vol. 42, no. 4, 2015.
  • Lee and kim [2013] j. Lee and D.-W. kim, “Feature selection for multi-label classification using multivariate mutual information,” Pattern Recognition Letters, vol. 34, no. 3, pp. 349–357, 2013.
  • Lee and Kim [2017] J. Lee and D.-W. Kim, “Scls: Multi-label feature selection based on scalable criterion for large label set,” Pattern Recognition, vol. 66, pp. 342–352, 2017.
  • Mahanipour and Khamfroush [2023b] A. Mahanipour and H. Khamfroush, “Wrapper-based federated feature selection for iot environments,” in 2023 International Conference on Computing, Networking and Communications (ICNC).   IEEE, 2023, pp. 214–219.
  • Zhang et al. [2023] X. Zhang, A. Mavromatics, and A. Vafeas, “Federated feature selection for horizontal federated learning in iot networks,” IEEE Internet of Things Journal, 2023.
  • Li et al. [2023] A. Li, H. Peng, L. Zhang, J. Huang, Q. Guo, H. Yu, and Y. Liu, “Fedsdg-fs: Efficient and secure feature selection for vertical federated learning,” arXiv preprint arXiv:2302.10417, 2023.
  • Hu et al. [2022b] Y. Hu, Y. Zhang, D. Gong, and X. Sun, “Multi-participant federated feature selection algorithm with particle swarm optimizaiton for imbalanced data under privacy protection,” IEEE Transactions on Artificial Intelligence, 2022.
  • Von Lücken et al. [2014] C. Von Lücken, B. Barán, and C. Brizuela, “A survey on multi-objective evolutionary algorithms for many-objective problems,” Computational optimization and applications, vol. 58, pp. 707–756, 2014.
  • Raquel and Naval Jr [2005] C. R. Raquel and P. C. Naval Jr, “An effective use of crowding distance in multiobjective particle swarm optimization,” in Proceedings of the 7th Annual conference on Genetic and Evolutionary Computation, 2005, pp. 257–264.
  • Shannon [2001] C. E. Shannon, “A mathematical theory of communication,” ACM SIGMOBILE mobile computing and communications review, vol. 5, no. 1, pp. 3–55, 2001.
  • Tarekegn et al. [2021] A. N. Tarekegn, M. Giacobini, and K. Michalak, “A review of methods for imbalanced multi-label classification,” Pattern Recognition, vol. 118, p. 107965, 2021.
  • Zhang and Zhou [2007] M.-L. Zhang and Z.-H. Zhou, “Ml-knn: A lazy learning approach to multi-label learning,” Pattern recognition, vol. 40, no. 7, pp. 2038–2048, 2007.