Enhanced Federated Anomaly Detection Through Autoencoders Using Summary Statistics-Based Thresholding

Sofiane Laridi L3S Research Center, Faculty of Electrical Engineering and Computer Science, Leibniz University Hannover, 30167, Germany laridi@l3s.de Gregory Palmer L3S Research Center, Faculty of Electrical Engineering and Computer Science, Leibniz University Hannover, 30167, Germany Kam-Ming Mark Tam Department of Architecture, University of Hong Kong, Knowles Building, Pokfulam Road, Hong Kong SAR

Abstract

In Federated Learning (FL), anomaly detection (AD) is a challenging task due to the decentralized nature of data and the presence of non-IID data distributions. This study introduces a novel federated threshold calculation method that leverages summary statistics from both normal and anomalous data to improve the accuracy and robustness of anomaly detection using autoencoders (AE) in a federated setting. Our approach aggregates local summary statistics across clients to compute a global threshold that optimally separates anomalies from normal data while ensuring privacy preservation. We conducted extensive experiments using publicly available datasets, including Credit Card Fraud Detection, Shuttle, and Covertype, under various data distribution scenarios. The results demonstrate that our method consistently outperforms existing federated and local threshold calculation techniques, particularly in handling non-IID data distributions. This study also explores the impact of different data distribution scenarios and the number of clients on the performance of federated anomaly detection. Our findings highlight the potential of using summary statistics for threshold calculation in improving the scalability and accuracy of federated anomaly detection systems.

keywords:

Federated Learning, Anomaly Detection, Auto-Encoder

Introduction

In Federated Learning (FL), the effectiveness of Anomaly Detection (AD) is crucial due to the decentralized nature of data across multiple clients. Federated Autoencoders (FAE) have emerged as a popular approach for detecting anomalies in such settings, as they allow each client to train local models on their data and collaboratively learn a global model without sharing raw data. However, determining an optimal threshold for AD within this federated framework remains a significant challenge.

Traditional AD methods often rely on thresholds determined from local validation data by using the trained AE to predict the validation data and thus the reconstruction errors of this validation data, then calculating a threshold that separates anomalies from normal data. These methods fail to capture the global data distribution in federated environments, leading to sub-optimal performance (McMahan et al.[1]). They rely on local thresholds that may not adequately reflect the variations in data distributions across FL clients, especially when the data is non-IID (Kairouz et al.[2]). Additionally, many existing methods focus solely on normal data to establish the threshold, neglecting the characteristics of anomalies, compromising their effectiveness in certain scenarios.

To overcome these limitations, this study introduces an FL threshold calculation method that integrates summary statistics from both normal and anomalous data across local validation datasets of all clients, producing a federated threshold that learns the decision boundary more accurately and robustly. Research in this direction has been limited. This study demonstrates that incorporating both summary statistics and anomalies in the threshold determination process enhances the accuracy and robustness of AD in FL (Yang et al.[3]). The research compares the performance of the proposed FL threshold calculation technique against conventional local and federated methods across various FL and data distribution scenarios, examining different numbers of clients, degrees of data distribution skewness, and anomaly rates. The findings also provide insights into the method’s effectiveness across diverse FL environments and data distribution conditions.

In summary, this paper makes the following contributions:

•

Proposes a novel approach for calculating the FL threshold for AD using FL clients’ local summary statistics and aggregated global statistics, while preserving privacy.
•

Collects and compares several state-of-the-art threshold calculation techniques, thoroughly analyzing their performance under different FL and data distribution scenarios.
•

Offers insights into whether a client should adopt the proposed FL threshold or rely on a locally calculated threshold based on the comparison of local and global summary statistics.

Literature Review

FL is widely used in AD due to its effectiveness in handling distributed data while maintaining privacy. Various techniques have been integrated into FL frameworks to detect anomalies. On the other hand, AEs have been effective in detecting malicious network activities (Li et al.[4]). In IoT networks, FL helps collaboratively train models for AD, with additional security provided by blockchain integration (Ali et al.[5]). Multi-task learning within FL, discussed by Smith et al.[6], addresses task heterogeneity across devices by leveraging diverse data sources. The FedGroup model, for example, computes learning updates from a group of devices, demonstrating its effectiveness in anomaly detection in IoT environments (Li et al.[4]).

However, federated AD faces several challenges. A major challenge is handling non-IID data across clients, which can negatively impact the model’s performance (Bonawitz et al.[7]). Traditional methods often assume IID data, but newer approaches, such as graph-based methods by Liu et al.[8], consider data relationships for better detection. Zhao et al.[9] have focused on optimizing FL systems to deal with non-IID data and improve communication efficiency. Generative Adversarial Networks are used to identify anomalies through reconstruction errors or discriminator scores (Schlegl et al.[10]). Other methods, such as softmax scores from neural network classifiers by Hendrycks & Gimpel[11] or outlier exposure by Liang et al.[12], help improve AD. Combining generative and discriminative models by Nalisnick et al.[13] has also been shown to enhance accuracy. Other advanced neural architectures, like attention-based models by Vaswani et al.[14] and LSTM networks for temporal data by Zhu et al.[15], are used.

In addition, AEs are effective tools for AD, leveraging their ability to reconstruct input data and highlight deviations as anomalies (Xu et al.[16]). By learning compact representations, AEs excel in identifying unusual patterns across various domains, including network security and IoT systems (Li et al.[4]). More complex versions of AEs, such as Variational Autoencoders, enhance this process by modeling probabilistic distributions, improving the detection of subtle anomalies (An & Cho[17]). Different adaptations of AEs address specific data types: convolutional autoencoders for spatial anomalies (Baur et al.[18]) and hybrid models combining AEs with LSTM networks for temporal data (Malhotra et al.[19]). Attention mechanisms integrated into AEs by Zong et al.[20] further refine AD by focusing on the most relevant data features. Additionally, combining AEs with Generative Adversarial Networks (GANs) has shown robust performance in AD through enhanced reconstruction quality (Zenati et al.[21]). Despite all the mentioned AE variations, fully connected AEs are preferred for handling large-scale data due to their simplicity, interpretability, and efficiency (Hinton & Salakhutdinov[22]).

Given their reliance on reconstruction error, effective AD with AEs requires appropriate threshold calculation techniques, which are crucial aspects of AD. In FL, this task is particularly challenging due to the distributed nature of data across multiple clients. Therefore, different threshold calculation methods for AEs have been explored. The majority of methods we have found calculate the threshold locally by focusing on client-specific data without federated aggregation. For instance, the Local Kernel Quantile Estimator (KQE) (Huong et al.[23]) sets thresholds based on reconstruction error quantiles using a kernel estimator. The Local IQR Range method determines thresholds using the interquartile range of reconstruction errors. Similarly, the Local Percentile method (Percentile) uses a specific percentile of the error distribution to establish thresholds (Novoa-Paradela et al.[24]). Other local methods include the Local Largest MSE (Largest-MSE) (Sáez-de-Cámara et al.[25]), which bases thresholds on the highest observed MSE. The Local Peak Over Threshold (POT) method identifies anomalies by focusing on errors that exceed a high quantile (Kea et al.[26]). Schlegl et al.[27] calculate a threshold by generating multiple thresholds between the minimum and maximum reconstruction errors of the validation data (Local-MinMax), then selecting the threshold with the highest F1 score.

Federated threshold calculation techniques have been explored as well. One common approach is where Wang et al.[28] set a federated threshold by averaging the mean-squared error plus the standard deviation across clients (Fed-MSE-StD). While straightforward, this method can be less effective since it does not consider anomalies directly in the threshold calculation process. Sánchez et al.[29] calculate the mean and standard deviation of the local thresholds and filter out thresholds with a z-score greater than 1.5 (Fed-Filtered). The global threshold is then set as the maximum of the remaining filtered thresholds. One of the limitations of these federated techniques is the neglect of actual anomalies in the federated threshold calculation. An alternative federated approach by Pourahmadi et al.[30] involves generating candidate thresholds between the global minimum and maximum reconstruction errors (Fed-MinMax), with clients selecting the optimal threshold based on F1 scores. However, this interval between the global minimum and maximum can be vague, especially in non-IID data distributions, and manually setting the number of thresholds might not capture the optimal threshold. Additionally, this method does not suggest a fair aggregation of the threshold candidates’ F1 scores, which gives equal influence to clients with large validation data and clients with minimal validation data. This can lead to a sub-optimal global threshold.

The limitations of these methods underscore the need for a more automated and equitable threshold calculation method that can account for the diverse data distributions across clients and provide a balanced approach in FL. We assume that summary statistics, such as mean, variance, skewness, and kurtosis, can offer valuable insights into the distribution of the clients’ validation data, helping to develop thresholds that are more representative of the overall data landscape across FL clients. One of the significant advantages of using summary statistics in FL is their privacy-preserving nature. Unlike raw data, which contains detailed information about individual data points, summary statistics aggregate this information into a form that reveals general trends without exposing specific data values (McMahan et al.[1]).

We have listed in Table 1 the different state-of-the-art methods mentioned, along with their characteristics and limitations, for better clarification.

Thresholding Method	Federated	Anomalies	Statistics Used	Local Data Distribution Consideration
Our Method	✓	✓	Mean, Variance, etc.	✓
Fed-MinMax	✓	✓	Min/Max	✗
Fed-MSE-StD	✓	✗	Mean, StD	✗
Fed-Filtered	✓	✗	Mean, StD	✗
Local-MinMax	✗	✓	MSE, Percentile	✗
KOE	✗	✗	—	✗
IQR	✗	✗	IQR	✗
Percentile	✗	✗	Percentile	✗
Largest-MSE	✗	✗	MSE	✗
POT	✗	✗	High Quantile	✗
Local-MSE-Std	✗	✗	Mean, StD	✗

Table 1: SOTA Threshold Calculation Approaches

Problem Definition

Federated AD using FAE aims to collaboratively identify anomalies in distributed data without sharing raw data between clients. The key challenge is to determine an optimal global threshold $\theta_{\text{global}}$ for AD, leveraging each client’s local validation data while preserving privacy.

Notations

•

$D_{\text{train},i}$ : Training dataset for client $i$
•

$M_{t}$ : Global FAE model at round $t$
•

$M_{t,i}$ : Local model for client $i$ at round $t$
•

$D_{\text{val},i}$ : Validation dataset for client $i$ containing both normal and anomalous samples
•

$\hat{D}_{\text{val},i}$ : Reconstructed validation dataset for client $i$
•

$E_{i}$ : Array of reconstruction errors for client $i$ , where $E_{i}=\{e_{i,1},e_{i,2},\ldots,e_{i,n}\}$
•

$\mu_{i}$ : Mean of the reconstruction errors for client $i$
•

$\sigma_{i}^{2}$ : Variance of the reconstruction errors for client $i$
•

$S_{i}$ : Skewness of the reconstruction errors for client $i$
•

$K_{i}$ : Kurtosis of the reconstruction errors for client $i$
•

$N_{i}$ : Number of samples in the validation dataset for client $i$
•

$\mu_{\text{global}}$ : Global mean of the reconstruction errors
•

$\sigma^{2}_{\text{global}}$ : Global variance of the reconstruction errors
•

$S_{\text{global}}$ : Global skewness of the reconstruction errors
•

$K_{\text{global}}$ : Global kurtosis of the reconstruction errors
•

$T$ : Array of $n$ thresholds, $T=\{\mu_{1},\mu_{2},\ldots,\mu_{n}\}$ determined based on the overlap region
•

$F_{i}$ : Array of F1 scores for client $i$ corresponding to thresholds $T$
•

$F_{\text{avg}}$ : Array of average F1 scores for each threshold across all clients
•

$\theta_{\text{global}}$ : Global threshold with the highest average F1 score

Given

1.

A FAE model $M$ trained on distributed training data $D_{\text{train}}$ .
2.

Each client’s local validation data $D_{\text{val},i}$ with corresponding reconstruction errors $E_{i}$ .

Objective

To compute a global threshold $\theta_{\text{global}}$ that optimally separates anomalies from normal data across all clients while addressing the following challenges:

•

Data Privacy: Each client’s validation data $D_{\text{val},i}$ is private and cannot be shared with other clients or the server.
•

Data Distribution: Each client may have different data distributions, leading to variations in reconstruction errors $E_{i}$ .

Method

Our methodology for federated anomaly detection using autoencoders is divided into two main steps: Federated Autoencoder Training and Federated Threshold Calculation. The detailed steps of our methodology are outlined below:

Federated Auto-Encoder Training

The first step involves training a FAE model using the Federated Averaging (FedAvg) algorithm to aggregate the clients’ model weights, as illustrated in Algorithm 1. Model synchronization between clients and the server occurs at the beginning of each round, where the server broadcasts the global model to all clients, and at the end of each round, clients send their updated weights back to the server for aggregation, following the FedAvg paradigm. We selected a fully connected AE due to its proven effectiveness in AD, particularly for high-dimensional datasets such as Shuttle, Covertype, and Credit Card Fraud Detection (Sakurada & Yairi[31]).

Algorithm 1 Federated Auto-Encoder Training with FedAvg Aggregation

1:Input:

D_{\text{train},i}

: Training dataset for client

i

N

: Number of clients,

E

: Local epochs,

\eta

: Learning rate,

R

: Rounds

2:Output:

M

: Trained global autoencoder model

3:Initialize global model

M_{0}

4:for each round

t=1,\ldots,R

5: Server sends global model

M_{t-1}

to all clients

6: for each client

i=1,\ldots,N

in parallel do

7: Initialize local model

M_{t,i}

with

M_{t-1}

weights

8: Train

M_{t,i}

D_{\text{train},i}

for

E

epochs using learning rate

\eta

9: Send local updates

\Delta M_{t,i}

to server

10: end for

11: Server aggregates updates to form global model

M_{t}

12: Update global model:

M_{t}\leftarrow M_{t-1}+M_{t}

13:end for

14:Return: Trained global model

M_{T}

Refer to caption — Figure 1: Federated Threshold Calculation using Summary Statistics

Summary Statistics-Based Threshold Selection

Figure 1 illustrates the workflow of our federated AD approach using FAE, emphasizing how thresholds are calculated using aggregated summary statistics. The process is divided into several key steps:

Prediction and Extraction of Summary Statistics

Each client employs its local Autoencoder (AE) model $M$ to compute reconstruction errors $E_{i}$ on its local validation dataset $D_{\text{val},i}$ . Based on these errors, the client calculates summary statistics $SS_{i}$ , including the mean ( $\mu_{i}$ ), variance ( $\sigma_{i}^{2}$ ), skewness ( $S_{i}$ ), and kurtosis ( $K_{i}$ ).

Weighted Aggregation of Summary Statistics and Threshold Selection

After aggregating the summary statistics from all clients, the server identifies the overlap region where the distributions of normal and anomalous reconstruction errors intersect. This overlap region aids in isolating reconstruction error outliers by focusing on thresholds generated within this specific range.

The initial overlap region is determined using the global means ( $\mu_{\text{normal}}$ , $\mu_{\text{anomaly}}$ ) and standard deviations ( $\sigma_{\text{normal}}$ , $\sigma_{\text{anomaly}}$ ) of the normal and anomalous distributions, calculated via the upper and lower bounds. These bounds are subsequently fine-tuned by adjusting for skewness and kurtosis to ensure that the region accurately reflects the data’s distributional shape. Skewness compensates for asymmetry in the data, while kurtosis adjusts the bounds to account for tail behavior.

Threshold Candidates Generation

Within the refined overlap region, the server generates a set of candidate thresholds. These candidates are typically spaced evenly across the overlap range, with the number of candidates determined by a predefined parameter (e.g., 1000 candidates). The generated thresholds are then distributed to all clients for evaluation.

F1 Score Calculation and Aggregation

Each client evaluates these candidate thresholds using its local validation data, calculating the F1 score for each threshold. The server aggregates these F1 scores across all clients to select the global threshold $\theta_{\text{global}}$ , which maximizes the overall F1 score.

The detailed steps of this method are illustrated in Algorithm 2.

Algorithm 2 Federated Threshold with Weighted Aggregation

1:Input:

D_{\text{val},i}

: Validation dataset for client

i

2:Output:

\mu_{\text{global}}

: Global threshold

3:for each client

i

4: Use the trained model

M

to reconstruct

D_{\text{val},i}

5: Compute reconstruction errors

E_{i}

for each sample in

D_{\text{val},i}

6: Calculate summary statistics for

E_{i}

: mean

\mu_{i}

, variance

\sigma_{i}^{2}

, skewness

S_{i}

, kurtosis

K_{i}

, and count

N_{i}

7:end for

8:for each client

i

9: Send summary statistics

\mu_{i}

\sigma_{i}^{2}

S_{i}

K_{i}

, and

N_{i}

to the server.

10:end for

11:Server computes the global summary statistics of both normal and anomaly using weighted aggregation:

12:

\mu_{\text{global}}=\frac{\sum_{i=1}^{k}N_{i}\mu_{i}}{\sum_{i=1}^{k}N_{i}}

13:

\sigma^{2}_{\text{global}}=\frac{\sum_{i=1}^{k}N_{i}\left(\sigma_{i}^{2}+(\mu_% {i}-\mu_{\text{global}})^{2}\right)}{\sum_{i=1}^{k}N_{i}}

14:

S_{\text{global}}=\frac{\sum_{i=1}^{k}N_{i}S_{i}\cdot\sqrt{N_{i}}\cdot\left(% \frac{\sigma_{\text{global}}}{\sigma_{i}}\right)^{3}}{\sum_{i=1}^{k}N_{i}}

15:

K_{\text{global}}=\frac{\sum_{i=1}^{k}N_{i}K_{i}\cdot N_{i}\cdot\left(\frac{% \sigma_{\text{global}}}{\sigma_{i}}\right)^{4}}{\sum_{i=1}^{k}N_{i}}

16:Server determines the overlap region:

17:

\text{Lower Bound}=\max(\mu_{\text{normal}}-3\sigma_{\text{normal}},\mu_{\text% {anomaly}}-3\sigma_{\text{anomaly}})

18:

\text{Upper Bound}=\min(\mu_{\text{normal}}+3\sigma_{\text{normal}},\mu_{\text% {anomaly}}+3\sigma_{\text{anomaly}})

19:Server generates an array of

n

thresholds

T=\{\mu_{1},\mu_{2},\ldots,\mu_{n}\}

within the overlap region.

20:Server sends the threshold array

T

to each client.

21:for each client

i

22: for each threshold

\mu_{j}

T

23: Calculate F1 scores

F_{i}

for each threshold

\mu_{j}

24: end for

25:end for

26:for each client

i

27: Send F1 score array

F_{i}

to the server.

28:end for

29:Server calculates average F1 scores

F_{\text{avg}}

for each threshold.

30:Server identifies the threshold

\mu_{\text{global}}

with the highest average F1 score in

F_{\text{avg}}

31:Server sends the global threshold

\mu_{\text{global}}

to all clients.

Results

In this study, we evaluate our federated thresholding method using three publicly available datasets: Credit Card Fraud Detection (284,807 samples, 492 anomalies, 29 dimensions), Shuttle (49,097 samples, 3,511 anomalies, 9 dimensions), and Covertype (581,012 samples, 2,747 anomalies, 10 dimensions). The datasets were sourced from UCL and Kaggle. The following steps outline our data preparation and distribution for simulating a federated learning environment.

•

Normalization and Scaling: All datasets were normalized and scaled to ensure consistency across features. This preprocessing step is crucial for the performance of the AE, enabling it to effectively reconstruct normal data for AD across different clients.
•

Train-Validation-Test Splitting: Each dataset was split into training, validation, and test sets. The training data consisted solely of normal samples, as the AE is trained to model normal data behavior. Both validation and test sets included a mixture of normal and anomalous data, which were used to assess the thresholding methods. For scalability experiments, we varied the number of clients from 2 to 50, distributing the data evenly across clients to study the effect of client count on model performance (Bonawitz et al.[7]).
•
Federated Learning Splitting on Clients: To explore various federated learning scenarios:
- –
  
  Evenly Distributed Data: The training, validation, and test data were uniformly distributed across all clients. This uniform distribution served as a baseline to assess our method under ideal and balanced conditions, which is a standard assumption in many federated learning experiments (Li et al.[4]).
- –
  
  Non-IID Data: For the Shuttle and Covertype datasets, which include multiple classes, we designated one class as anomalous while treating the remaining classes as normal data. The normal data was distributed across clients, with each client receiving data from different classes. The anomalous data was divided among clients using the k-means clustering algorithm to ensure a diverse distribution of anomalies. Since these datasets include seven classes, we assigned six clients for normal data and distributed the anomalous data among them. To maintain consistency, the Credit Card Fraud Detection dataset, which is binary, was also split among six clients by applying the k-means clustering algorithm (Ahmed et al.[32]) to both normal and anomalous data, thus simulating a highly non-IID scenario similar to that of the other datasets (Zhao et al.[9]).

Method	Shuttle	Credit Card	Cover
Our Method	0.9873	0.8985	0.8440
Fed Threshold	0.9861	0.8972	0.8395
Fed Mean MSE + StD	0.9844	0.8959	0.4003
Fed Filtered Threshold	0.9865	0.8881	0.3637
Local Iterative	0.9836	0.8878	0.8228
Local Inter Quantile Range	0.9773	0.8862	0.7318
Local Percentile	0.9657	0.8865	0.5273
Local Kernel Quantile Estimator	0.9775	0.8768	0.4512
Local Max MSE	0.9849	0.8594	0.3584
Local Mean MSE + Std	0.9619	0.8817	0.4828
Local Peak Over Threshold	0.9802	0.8125	0.4551

Table 2: Average F1 Scores Across Different Methods Using Evenly Distributed Data

Method	Shuttle	Credit Card	Cover
Our Method	0.9251	0.8725	0.8351
Fed Threshold	0.9033	0.8705	0.8383
Fed Mean MSE + StD	0.3841	0.8556	0.7121
Fed Filtered Threshold	0.4231	0.8460	0.7542
Local Iterative	0.9120	0.8712	0.8558
Local Inter Quantile Range	0.4628	0.8416	0.7605
Local Percentile	0.4742	0.8419	0.7890
Local Kernel Quantile Estimator	0.4781	0.8347	0.8027
Local Max MSE	0.3546	0.8143	0.7803
Local Mean MSE + Std	0.4614	0.8363	0.8088
Local Peak Over Threshold	0.3615	0.7938	0.7822

Table 3: Average F1 Scores Across Different Methods Using Non-IID Data

Data Distribution

Even Distribution

The results demonstrate that our method for generating thresholds is effective across various data distribution scenarios. In the evenly distributed data setting (Table 3), our method consistently achieves the highest F1 scores across all datasets, indicating superior performance in AD compared to other methods. Additionally, our method remains highly reliable, maintaining strong performance even as the number of clients increases. The Fed Mean MSE + StD method also performs well but is typically outperformed by our approach. In contrast, local methods such as Local Inter Quantile Range and Local Percentile exhibit greater variability in their F1 scores, rendering them less consistent and reliable in distributed data scenarios. Overall, these results confirm that federated approaches, particularly our proposed method, are highly effective in managing evenly distributed data across multiple clients.

Non-IID Distribution

In the Non-IID data scenario (Table 3), where clients have heterogeneous data distributions, our method consistently outperforms other approaches, achieving the highest F1 scores across the Shuttle (0.9251), Credit Card (0.8725), and Covertype (0.8351) datasets. Notably, the Local-MinMax method remains highly competitive, particularly in the Credit Card dataset where it attains an F1 score of 0.8712, nearly matching the performance of our method and surpassing other federated approaches. Similarly, in the Covertype dataset, the Local-MinMax method achieves an F1 score of 0.8558, outperforming all other methods, including federated ones.

These findings suggest that while the federated approach effectively aggregates information across clients to establish a robust global threshold, certain local methods, particularly Local-MinMax Thresholding, can more effectively address client-specific data variations. This is especially evident in datasets like Credit Card and Covertype, where data distributions vary significantly across clients.

Given these insights, it is evident that there are scenarios where a client may benefit more from utilizing a local threshold rather than a federated one. This observation underscores the importance of investigating and understanding the conditions under which a client should prefer the federated threshold over its own local threshold. Future analyses will focus on the summary and aggregated statistics utilized in our federated threshold calculation to determine the optimal thresholding strategy for different client scenarios.

Random Distribution

To evaluate the robustness of our method under varying data distribution conditions, we employed diverse random distributions with varying numbers of samples per client. This approach allowed us to simulate a wide range of scenarios that may occur in federated learning environments. Figure 2 presents the boxplots generated from these experiments, illustrating the F1 scores obtained across different random setups.

The width and spread of the boxplots provide insights into the consistency and robustness of each threshold calculation method. Our proposed method is represented by consistently narrower boxplots, indicating lower variance in F1 scores across different random distribution scenarios. This narrow spread suggests that our method is both robust and reliable, maintaining high performance regardless of variability in data distribution among clients.

In contrast, the wider boxplots associated with some of the other methods indicate greater variability in performance. This variability suggests that these methods are more sensitive to changes in data distribution, resulting in less consistent outcomes. Additionally, the presence of outliers in these boxplots further highlights the instability of these methods under certain random distribution setups.

Overall, the relatively narrow boxplots of our method demonstrate its superior robustness, as it maintains high and stable F1 scores across a diverse set of random distribution scenarios. This underscores the adaptability and effectiveness of our federated thresholding approach, even under challenging and unpredictable data conditions.

Global Anomaly and Normal Reconstruction Data Overlap Analysis

In addition to performance evaluation, the global reconstruction errors, as depicted in Figure 3, provide valuable insights into the functioning of our method under different data distribution scenarios. The visualizations reveal that in evenly distributed data scenarios (top row), the overlap region between normal and anomalous reconstruction errors is smaller compared to Non-IID scenarios (bottom row). This smaller overlap suggests that in evenly distributed data, fewer threshold candidates are needed to accurately distinguish between normal and anomalous data points.

Conversely, the larger overlap observed in Non-IID scenarios indicates that a higher number of threshold candidates may be necessary to achieve similar accuracy. This difference highlights the necessity of adapting the threshold generation process based on the data distribution. Specifically, more candidates are likely required in Non-IID settings to manage the increased overlap between normal and anomalous errors effectively.

These findings emphasize the importance of considering data distribution characteristics when designing thresholding methods in federated learning. By accounting for the extent of overlap between normal and anomalous reconstruction errors, our method can dynamically adjust the threshold selection process to maintain high detection accuracy across diverse and complex data environments.

Scalability

Number of Clients

The results from the Shuttle, Covertype, and Credit Card Fraud Detection datasets, as shown in Figure 4, demonstrate the effectiveness of various thresholding techniques. Our Fed Threshold method consistently achieves the highest F1 scores across all datasets and varying numbers of clients, underscoring its robustness and reliability. Specifically, for the Shuttle dataset, the Fed Threshold method maintains a high F1 score close to 0.99 regardless of the number of clients. The Fed Mean MSE + StD method also performs comparably well, suggesting that these federated approaches effectively aggregate client data to sustain performance.

In contrast, other methods such as Local Inter Quantile Range, Local Percentile, Local Kernel Quantile Estimator, and Local Peak Over Threshold generally exhibit lower F1 scores. These methods show some improvement in the Credit Card Fraud Detection and Shuttle datasets as the number of clients increases but fail to match the consistency and high performance of the federated approaches. Conversely, in the Shuttle dataset, local methods—especially Local-MSE + StD—show a slight decrease in F1 scores as the number of clients increases. This decline occurs because, as the data is partitioned among more clients and given the extreme imbalance of the Shuttle dataset, each client receives fewer and less representative data points to calculate reliable thresholds.

For the Covertype dataset, our Fed Threshold and Fed Mean MSE + StD methods maintain high F1 scores near 0.9, irrespective of the number of clients. Other methods, including Local Kernel Quantile Estimator and Local Peak Over Threshold, exhibit significant fluctuations and lower F1 scores. The performance of these alternative methods does not consistently improve with an increasing number of clients, highlighting the stability and effectiveness of the federated approaches.

The Credit Card Fraud Detection dataset results further reinforce these observations. The Fed Threshold and Fed Mean MSE + StD methods consistently demonstrate high F1 scores around 0.9, with minimal performance degradation as the number of clients increases. In contrast, methods such as Local Max MSE and Local Peak Over Threshold display considerable variability and generally lower F1 scores, emphasizing their instability and less effective performance in a federated setting.

Overall, the Fed Threshold method consistently outperforms other thresholding techniques across all datasets and client numbers. Its ability to maintain high performance and scalability makes it highly suitable for real-world scenarios where anomaly detection must be aggregated from multiple clients. The Fed Mean MSE + StD method also exhibits strong performance, albeit slightly less consistent than our Fed Threshold method.

When evaluating the performance of our Fed Threshold approach with varying numbers of clients, it remains highly effective, exhibiting minimal degradation in F1 scores as the number of clients increases. This demonstrates the robustness of the Fed Threshold method in maintaining accuracy, even as the client count grows, making it well-suited for real-world federated learning scenarios involving numerous clients.

Execution Time

The execution time results across the Credit Card Fraud Detection, Covertype, and Shuttle datasets, as shown in Figure 5, reveal important trade-offs between accuracy and computational cost. The Fed Threshold method (solid blue line) exhibits the highest execution time, increasing significantly as the number of clients grows, reaching approximately 3 seconds for 50 clients. This increase reflects the computational complexity associated with aggregating and processing data across clients to calculate a global threshold. Although more resource-intensive, this method provides superior accuracy.

The Fed Mean MSE + StD method (solid red line) also shows increased execution time as the number of clients rises, though it remains slightly lower than the Fed Threshold method. This approach balances relatively high accuracy with more moderate computational demands. Similarly, the Fed Filtered Threshold method (solid green line) follows a comparable trend, indicating similar efficiency but with slightly lower accuracy.

In contrast, local methods such as Local Iterative, Local IQR, and Local Percentile (various dashed lines) maintain significantly lower and more stable execution times as the number of clients increases. These methods are computationally efficient because they operate on local data without the need for extensive aggregation, but they do so at the expense of lower accuracy.

Overall, federated methods such as Fed Threshold and Fed Mean MSE + StD are more computationally demanding but offer higher accuracy, making them suitable for scenarios where precision is critical. Conversely, local methods provide faster execution times, which may be preferable in resource-constrained environments or situations requiring quick decisions.

While the federated methods demonstrate robustness in maintaining performance as the number of clients increases, we also observed a significant increase in execution time, particularly in the Credit Card Fraud Detection dataset, where execution time grew by 460% from 0.5 seconds to 2.8 seconds as the number of clients increased from 10 to 50. This highlights a trade-off between computational cost and accuracy, indicating that while these methods are scalable in terms of performance, their scalability in terms of computational efficiency may present challenges in resource-constrained environments.

Robustness against Noise

The robustness of the thresholding methods against noise is illustrated in Figure 6, where the x-axis represents the increasing number of clients with corrupted data, up to 30 clients.

For the Covertype dataset, Our Method consistently maintains a high F1 score, experiencing only a gradual decline as the number of corrupted clients increases. This demonstrates strong resilience to noise. In contrast, the Fed-MinMax and Fed-Filtered methods exhibit significantly lower F1 scores from the outset, indicating their reduced robustness to noisy data.

In the Shuttle dataset, Our Method again demonstrates superior robustness, maintaining a relatively high F1 score that decreases more slowly compared to other methods. Notably, the Fed-MSE-Std method shows a sharp decline in performance as the number of corrupted clients increases, highlighting its vulnerability to noise.

For the Credit Card Fraud Detection dataset, a similar trend is observed. Our Method maintains the highest F1 scores across all levels of noise, while the performance of the Fed-Filtered and Fed-MSE-Std methods decreases more rapidly. This further underscores the superior robustness of Our Method.

Overall, these results emphasize the robustness of Our Method against noise, making it a reliable choice in scenarios where data corruption is a significant concern. Other methods, while effective in certain contexts, tend to exhibit more substantial performance degradation as the number of corrupted clients increases.

Follow-Up Analysis: Predicting the Benefit of Federated vs. Local Threshold

Based on observations from our previous experiments, we identified that certain clients may benefit more from using their locally calculated thresholds rather than a federated threshold. This insight prompted us to investigate whether valuable information could be extracted from the summary statistics—both local and federated—collected during those experiments. The objective of this study is to explore whether these summary statistics can be leveraged to predict when a client would benefit more from a federated threshold versus a local one.

We conducted experiments across various data distributions, varying numbers of clients, and differing levels of non-IIDness to simulate a wide range of real-world scenarios. This comprehensive approach provided a robust dataset for analysis. In each use case, we employed the Federated Autoencoder model and collected validation data to extract summary statistics. These statistics, along with the corresponding F1 scores for both local and federated thresholds, formed the foundation of our analysis.

The dataset’s features include locally calculated summary statistics from both normal and anomalous validation data, as well as aggregated summary statistics. Specifically, the local statistics comprise measures such as mean, variance, skewness, kurtosis, and count for both normal and anomalous data. The aggregated statistics are derived by combining these local statistics across all clients, producing features such as aggregated mean, variance, skewness, kurtosis, and proportional counts for both normal and anomalous data. In total, the dataset comprises 20 features.

For labeling, we focused on the difference in F1 scores between the best-performing local threshold and the federated threshold. Based on our earlier results, the Local Iterative Threshold was identified as the most effective local method, while our Federated Threshold emerged as the best federated method. Consequently, the label for each client’s data point represents the difference in F1 scores achieved by the federated threshold and the Local Iterative Threshold on the client’s local test data.

To evaluate the benefits of federated versus local thresholds, we employed two modeling approaches: binary classification and regression. The binary classification model aimed to predict whether a client would benefit from using the federated threshold, while the regression model predicted the actual difference in F1 scores between the federated and local thresholds.

For the binary classification task, we employed a Support Vector Machine (SVM) model [33] and achieved an accuracy of 89% when the data was randomly split into training, validation, and test sets across all three datasets. However, when we tested a more realistic scenario—training and validating on data from the Shuttle and Covertype datasets while testing on the Credit Card Fraud Detection dataset—the SVM’s performance dropped to an accuracy of 68%. This suggests that while the model performs well under random splits, its ability to generalize across different datasets is limited.

For the regression task, we employed a Random Forest model [34]. In the random split scenario, the Random Forest regressor achieved a moderate R² score of 0.67, indicating some predictive capability but also challenges in accurately predicting the differences in F1 scores. When tested under the more realistic split, the model’s performance further decreased, attaining an R² score of only 0.59. This highlights the difficulty in generalizing the regression model to new, unseen datasets.

The correlation matrix presented in Figure 7 offers further insights into the relationships between various summary statistics and the F1 score difference (f1_difference). Analysis of the matrix reveals that certain features, such as normal_aggr_mean and anomaly_aggr_mean, exhibit a relatively strong correlation with f1_difference. This suggests that these features are effective predictors of whether the federated threshold will outperform the local threshold. Conversely, other features, including normal_proportional_count and anomaly_proportional_count, display little to no correlation with f1_difference. This indicates that these features may be less useful in predicting the advantage of employing a federated threshold.

These observations highlight opportunities for enhancing our feature selection process. By refining our feature engineering strategies and potentially incorporating more relevant features, we can improve the predictive power of our models. Additionally, the weak correlation observed for some features underscores the necessity to explore alternative statistical measures or advanced modeling techniques that can more effectively capture the complexities inherent in the data.

Our experiments emphasize the critical role of appropriate feature selection in enhancing the generalization capabilities of models within federated learning environments. The insights derived from the correlation analysis will inform future endeavors aimed at developing more adaptive and robust thresholding mechanisms, thereby advancing the effectiveness of federated anomaly detection.

Conclusion & Future Work

In this paper, we introduced an innovative federated thresholding method for anomaly detection (AD) using autoencoders (AEs), leveraging summary statistics to improve robustness and accuracy across multiple clients. Our approach consistently outperformed traditional local and federated thresholding techniques in both IID and non-IID data scenarios. Its scalability was evident, with minimal performance degradation as the number of clients increased, making it highly suitable for real-world federated learning applications. While federated methods generally provided superior results, certain local methods remained effective in cases with highly varied data distributions.

Future research could focus on further enhancing the thresholding process by incorporating advanced statistical measures, such as entropy or mutual information, to capture more complex patterns and improve detection accuracy. Adaptive thresholding mechanisms that respond in real-time to changes in client data distributions could also increase the flexibility and responsiveness of anomaly detection systems. Additionally, integrating differential privacy techniques may further enhance data protection and ensure compliance with privacy regulations. Adaptive federated learning models that dynamically adjust to varying data distributions could further improve the robustness and scalability of federated anomaly detection systems.

Data Availability

The datasets analyzed during the current study are publicly available as follows:

•

Credit Card Fraud Detection: Provided by Worldline and the Machine Learning Group of ULB, 2014. This dataset can be accessed at Kaggle.
•

Shuttle Dataset: Available from the UCI Machine Learning Repository. This dataset can be accessed at UCI Shuttle Dataset.
•

Covertype Dataset: Available from the UCI Machine Learning Repository. This dataset can be accessed at UCI Covertype Dataset.

References

[1] McMahan, H. B., Moore, E., Ramage, D., Hampson, S. et al. Communication-efficient learning of deep networks from decentralized data. In Artificial Intelligence and Statistics, 1273–1282 (2017).
[2] Kairouz, P. et al. Advances and open problems in federated learning. \JournalTitleFoundations and trends in machine learning 14, 1–210 (2021).
[3] Yang, Q., Liu, Y., Chen, T. & Tong, Y. Federated machine learning: Concept and applications. \JournalTitleACM Transactions on Intelligent Systems and Technology (TIST) 10, 1–19 (2019).
[4] Li, T., Sahu, A. K., Talwalkar, A. & Smith, V. Federated learning: Challenges, methods, and future directions. \JournalTitleIEEE Signal Processing Magazine 37, 50–60 (2020).
[5] Ali, M., Karimipour, H. & Tariq, M. Integration of blockchain and federated learning for internet of things: Recent advances and future challenges. \JournalTitleComputers & Security 108, 102355 (2021).
[6] Smith, V., Chiang, C.-K., Sanjabi, M. & Talwalkar, A. S. Federated multi-task learning. \JournalTitleAdvances in Neural Information Processing Systems (2017).
[7] Bonawitz, K. et al. Towards federated learning at scale: System design. \JournalTitleProceedings of machine learning and systems 1, 374–388 (2019).
[8] Liu, F. T., Ting, K. M. & Zhou, Z.-H. Isolation forest. In 2008 eighth ieee international conference on data mining, 413–422 (IEEE, 2008).
[9] Zhao, Y. et al. Federated learning with non-iid data. \JournalTitlearXiv preprint arXiv:1806.00582 (2018).
[10] Schlegl, T., Seeböck, P., Waldstein, S. M., Langs, G. & Schmidt-Erfurth, U. f-anogan: Fast unsupervised anomaly detection with generative adversarial networks. \JournalTitleMedical image analysis 54, 30–44 (2019).
[11] Hendrycks, D. & Gimpel, K. A baseline for detecting misclassified and out-of-distribution examples in neural networks. \JournalTitlearXiv preprint arXiv:1610.02136 (2016).
[12] Liang, S., Li, Y. & Srikant, R. Enhancing the reliability of out-of-distribution image detection in neural networks. \JournalTitlearXiv preprint arXiv:1706.02690 (2017).
[13] Nalisnick, E., Matsukawa, A., Teh, Y. W. & Lakshminarayanan, B. Detecting out-of-distribution inputs to deep generative models using typicality. \JournalTitlearXiv preprint arXiv:1906.02994 (2019).
[14] Vaswani, A. et al. Attention is all you need. \JournalTitleCoRR abs/1706.03762 (2017). 1706.03762.
[15] Zhu, H. et al. Long short term memory networks based anomaly detection for kpis. \JournalTitleComputers, Materials & Continua 61 (2019).
[16] Xu, H. et al. Unsupervised anomaly detection via variational auto-encoder for seasonal kpis in web applications. In Proceedings of the 2018 world wide web conference, 187–196 (2018).
[17] An, J. & Cho, S. Variational autoencoder based anomaly detection using reconstruction probability. \JournalTitleSpecial lecture on IE 2, 1–18 (2015).
[18] Baur, C., Denner, S., Wiestler, B., Navab, N. & Albarqouni, S. Autoencoders for unsupervised anomaly segmentation in brain mr images: a comparative study. \JournalTitleMedical Image Analysis 69, 101952 (2021).
[19] Malhotra, P., Vig, L., Shroff, G., Agarwal, P. et al. Long short term memory networks for anomaly detection in time series. In Esann, vol. 2015, 89 (2015).
[20] Zong, B. et al. Deep autoencoding gaussian mixture model for unsupervised anomaly detection. In International conference on learning representations (2018).
[21] Zenati, H., Foo, C. S., Lecouat, B., Manek, G. & Chandrasekhar, V. R. Efficient gan-based anomaly detection. \JournalTitlearXiv preprint arXiv:1802.06222 (2018).
[22] Hinton, G. E. & Salakhutdinov, R. R. Reducing the dimensionality of data with neural networks. \JournalTitlescience 313, 504–507 (2006).
[23] Huong, T. T. et al. Detecting cyberattacks using anomaly detection in industrial control systems: A federated learning approach. \JournalTitleComputers in Industry 132, 103509 (2021).
[24] Novoa-Paradela, D., Fontenla-Romero, O. & Guijarro-Berdiñas, B. Fast deep autoencoder for federated learning. \JournalTitlePattern Recognition 143, 109805 (2023).
[25] Sáez-de Cámara, X., Flores, J. L., Arellano, C., Urbieta, A. & Zurutuza, U. Clustered federated learning architecture for network anomaly detection in large scale heterogeneous iot networks. \JournalTitleComputers & Security 131, 103299 (2023).
[26] Kea, K., Han, Y. & Kim, T.-K. Enhancing anomaly detection in distributed power systems using autoencoder-based federated learning. \JournalTitlePlos one 18, e0290337 (2023).
[27] Schlegl, T., Seeböck, P., Waldstein, S. M., Schmidt-Erfurth, U. & Langs, G. Unsupervised anomaly detection with generative adversarial networks to guide marker discovery. In International conference on information processing in medical imaging, 146–157 (Springer, 2017).
[28] Wang, X. et al. Federated deep learning for anomaly detection in the internet of things. \JournalTitleComputers and Electrical Engineering 108, 108651 (2023).
[29] Sánchez, P. M. S. et al. Studying the robustness of anti-adversarial federated learning models detecting cyberattacks in iot spectrum sensors. \JournalTitleIEEE Transactions on Dependable and Secure Computing 21, 573–584 (2022).
[30] Pourahmadi, V., Alameddine, H. A., Salahuddin, M. A. & Boutaba, R. Spotting anomalies at the edge: Outlier exposure-based cross-silo federated learning for ddos detection. \JournalTitleIEEE Transactions on Dependable and Secure Computing 20, 4002–4015 (2022).
[31] Sakurada, M. & Yairi, T. Anomaly detection using autoencoders with nonlinear dimensionality reduction. \JournalTitleMLSDA 2 (2014).
[32] Ahmed, M., Seraj, R. & Islam, S. M. S. The k-means algorithm: A comprehensive survey and performance evaluation. \JournalTitleElectronics 9, 1295 (2020).
[33] Vishwanathan, S. & Murty, M. N. Ssvm: a simple svm algorithm. In Proceedings of the 2002 International Joint Conference on Neural Networks. IJCNN’02 (Cat. No. 02CH37290), vol. 3, 2393–2398 (IEEE, 2002).
[34] Rigatti, S. J. Random forest. \JournalTitleJournal of Insurance Medicine 47, 31–39 (2017).