0% found this document useful (0 votes)

2 views

Anomaly Detection

The document presents a comparative study on anomaly detection in Industrial Internet of Things (IIoT) networks, focusing on centralized machine learning and federated learning approaches. It outlines the unique challenges posed by IIoT systems and proposes two models for anomaly detection, emphasizing the importance of security in increasingly connected industrial environments. The research aims to evaluate the effectiveness and efficiency of both models to enhance cybersecurity measures in IIoT networks.

Uploaded by

abmsd24010mohiba

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

2 views

Anomaly Detection

Uploaded by

abmsd24010mohiba

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 13

1

Synopsis On
ANOMALY DETECTION IN IIoT NETWORKS: A
COMPARATIVE STUDY OF CENTRALIZED MACHINE
LEARNING AND FEDERATED LEARNING APPROACHES
Submitted by
241HO830002- Anjali Chaudhary
241H0830004-Komal Gour
241H0830006-Mohiba Ansari

For the award of the degree

Of
M.Sc.(Data Science) IISem

Under Supervision of
DR. BHARTI NATHANI(Assistant Professor)

Department of Computer Science

Banasthali Vidyapith
Session 2024-25
2

TABLE OF CONTENTS

S.NO CONTENTS PAGE NO

1. Introduction 3

2. Literature Review 4

3. Justification and Relevance 5

4. Objectives 5

5. Methodology 5

6. Tools & Techniques 10

7. Expected Outcome of the Research 12

8. Reference 13
3

INTRODUCTION
In today's world, industries are becoming smarter and more connected than ever before.
Factories, power plants, and transportation systems are now part of what we call the Industrial
Internet of Things, or IIoT. This exciting development has made these industries more efficient
and productive, but it has also opened up new challenges in keeping these systems safe from
cyberattacks.

IIoT networks are quite different from the computer networks we use in offices or at home. They
typically run only a few specific programs, have a fixed setup that doesn't change much, and
show regular patterns in how data moves around. These systems also tend to follow set schedules
for their operations and have predictable stages in their processes.

These unique features of IIoT networks can actually help us spot when something unusual is
happening. It's similar to how you might notice if something in your room was moved because
you know exactly how your room usually looks.

While traditional security measures like passwords, firewalls, and encryption are important,
they're not always enough to stop all types of attacks on IIoT systems. Some clever attacks, such
as overwhelming the system with too much traffic or attacks from inside the organization, can
still slip through these defenses.

To address these challenges, experts are developing new ways to detect unusual activities in IIoT
networks. These methods take advantage of the predictable nature of IIoT systems to make them
more secure. One approach involves anomaly detection, Anomaly detection in IIoT networks
identifies deviations from expected behavior using specification-based machine learning models
and detecting potential cyber-attacks and operational issues.

By using these advanced techniques, we can better protect our increasingly connected industrial
world. This is crucial because as our factories, power grids, and transportation systems become
more reliant on internet connectivity, ensuring their security becomes more important than ever.
The goal is to harness the benefits of IIoT while keeping these critical systems safe from
potential cyber threats.
4

LITERATURE REVIEW
1. This paper reviews data anomaly detection in the Internet of Things (IoT), discussing
trends, methodologies, and challenges. It explores techniques like statistical methods,
machine learning algorithms, and deep learning. The paper addresses data heterogeneity,
scalability, real-time processing, and privacy concerns. Case studies demonstrate its
application in various industries. Future research emphasizes efficient, scalable, and
privacy-preserving anomaly detection techniques.

2. This paper reviews current trends and challenges in data anomaly detection for Internet of
Things (IoT) systems[5]. It discusses various machine learning and deep learning
techniques used for detecting anomalies in IoT data, including statistical methods,
support vector machines, random forests, and neural networks. The paper also explores
challenges like high dimensionality, scalability, and real-time processing in IoT
environments.

3. The paper proposes an optimized intrusion detection system called OIFIDS that can
effectively handle heterogeneous and streaming data in Industrial Internet of Things
(IIoT) networks. It uses an enhanced version of the Isolation Forest algorithm[3] to
improve detection accuracy, speed, and efficiency compared to existing approaches.

4. This paper proposes about anomaly detection in IIoT networks has evolved from
traditional machine learning methods like SVM and XGBoost[4] to deep learning
approaches such as CNNs, LSTMs, and autoencoders. Hybrid models like CNN+GRU
effectively capture spatial and temporal patterns, improving accuracy. GANs generate
synthetic attack data for better training. XGBoost remains efficient for large datasets, but
deep learning techniques outperform traditional models in detecting complex anomalies.
Ongoing research focuses on refining hybrid architectures and optimizing computational
efficiency for secure and reliable IIoT systems.

5. The paper proposes a novel anomaly detection method that combines an autoencoder and
an isolation forest algorithm. The autoencoder learns a compact representation of the
data, while the isolation forest is used to identify anomalies in the reconstructed data.
This combination enhances the anomaly detection process, especially in high-
dimensional data, compared to using the individual algorithms. The authors demonstrate
the effectiveness of the proposed method on various real-world datasets with varying
characteristics and anomaly rates.
5

JUSTIFICATION AND RELEVENCE

The Industrial Internet of Things (IIoT) is making industries smarter and more efficient, but it
also brings new cybersecurity risks. Traditional security measures like firewalls and passwords
help, but they can’t stop all threats, such as insider attacks or complex hacking techniques.
Since IIoT networks follow predictable patterns, unusual activity can signal a cyber threat.
Anomaly detection helps identify these threats by learning what “normal” looks like and spotting
anything unusual.
In this project, we are developing two models for anomaly detection—one using traditional
machine learning and another using federated learning. Federated learning allows devices to
work together to improve security without sharing raw data(sensitive data).
By comparing both methods, we aim to find out which one is more effective, efficient, and
practical for IIoT security. Our research will help industries protect their critical systems,
ensuring they remain safe and reliable in an increasingly connected world.

OBJECTIVE
1. To design a centralized machine learning model for anomaly detection in IIoT networks,
focusing on optimizing detection accuracy and analyzing data in a central repository.

2. To develop a machine learning model for anomaly detection in IIoT networks using
federated learning techniques, ensuring enhanced data privacy, improved security on
decentralized data across diverse industrial devices.

3. To conduct a comparative analysis of the anomaly detection models in IIoT networks,

with and without federated learning, evaluating their performance, data privacy and
security.

METHODOLOGY
1. DATA COLLECTION AND UNDERSTANDING: We have collected dataset from
WUSTL-IIOT-2021 Dataset for IIoT Cybersecurity Research[1]. The features of the
dataset are explained below:
6

1. Mean Flow (mean): Average duration of active network flows, indicating traffic
behavior over time.
2. Source Port (Sport): Integer representing the originating port number of a packet
within a communication session.
3. Destination Port (Dport): Integer that specifies the endpoint port number for the
destination in a network packet.
4. Source Packets (Spkts): Integer count of packets sent from the source to the
destination during a session.
5. Destination Packets (Dpkts): Integer reflecting the number of packets received at the
destination from a source.
6. Total Packets (Tpkts): Total count of all packets (both sent and received) in a
communication session.
7. Source Bytes (Sbytes): Integer indicating the total bytes sent from the source to the
destination.
8. Destination Bytes (Dbytes): Integer representing the total bytes received at the
destination from the source.
9. Total Bytes (Tbytes): Cumulative size of all bytes sent and received during the
network communication.
10. Source Load (Sload): Average load computed from the source during the data
transmission period.
11. Destination Load (Dload): Average load computed based on the incoming data at the
destination.
12. Total Load (Tload): Overall load represented, taking into account both sourced and
destined data.
13. Source Rate (Srate): Rate at which data is sourced from the originating device to the
recipient.
14. Destination Rate (Drate): Rate of data flowing into the destination device from the
source.
15. Total Rate (Trate): Fundamental rate comprising both source and destination data
transfer.
16. Source Loss (Sloss): Measures the number of lost packets originating from the source
during transmission.
17. Destination Loss (Dloss): Indicates packet loss statistics at the destination receiving
end.
18. Total Loss (Tloss): Aggregate measure indicating total packets lost from the ongoing
communication.
19. Total Percent Loss (Ploss): Percentage representation of packets lost relative to the
total packets sent.
20. Source Jitter (Sjitter): Represents variability in packet delay originating from the
source.
21. Destination Jitter (Djitter): Measures variability in packets upon arrival at the
destination.
7

22. Source Interpacket (SntPkt): Time gap between successive packets sent from the
source.
23. Destination Interpacket (DntPkt): Time interval observed between packets as they
arrive at the destination.
24. Protocol (Proto): Character indicating the networking protocol used in the data
transmission.
25. Duration (Dur): Integer defining the total time duration of a particular record.
26. TCP RTT (TcpRtt): Round-trip time for TCP connections, showing latency in
communication.
27. Idle Time (Idle): Duration for which the network connection remains inactive or idle.
28. Sum (sum): Total accumulated duration across all aggregated records in the dataset.
29. Min (min): Minimum recorded duration of data packets within the dataset.
30. Max (max): Maximum recorded duration of entries within the aggregate dataset.
31. Source Diff Serve Byte (DSb): Integer denoting the different byte values from the
source under differential services.
32. Source TTL (sTtl): Time-to-live value specifying packet lifespan for the source in
communication.
33. Destination TTL (dTtl): Time-to-live value dictating how long packets are valid upon
reaching the destination.
34. Source App Byte (SAppBytes): Byte count representing the application data size sent
from the source.
35. Destination App Byte (DAppBytes): Total byte count of application data received at
the destination.
36. Total App Byte (TAppByte): Overall byte count of application data exchanged during
the interaction.
37. SYN Ack (SynAck): Indicates TCP connection establishment phase, counting SYN
and SYN-ACK packets exchanged.
38. Run Time (RunTime): Total active duration of the session from start to finish within
the dataset.
39. Source TOC (STos): Type of Service byte pertaining to the source in network traffic.
40. Source Tier (SrcTier): Tier classification signifying source's relevance or priority in
network communications.
41. Destination Tier (DstTier): Classification determining the role or priority of the
destination in communication.

2. EXPLORATORY DATA ANALYSIS: In this study, I focus on the preprocessing and

feature selection techniques necessary for effective anomaly detection in Industrial
Internet of Things (IIoT) networks using machine learning approaches. The first stage
involves comprehensive data cleaning to ensure the dataset is structured for analysis.
Missing values are handled using statistical techniques such as mean, median, or mode
replacement for numerical data, while categorical data is transformed using an
autoencoder[4][6]. The autoencoder, an unsupervised neural network, reduces
8

categorical data into meaningful numerical representations, preserving essential patterns

while eliminating redundancy. Additionally, numerical features are standardized using
MinMaxScaler or StandardScaler to ensure uniformity and prevent scale dominance.
Following preprocessing, feature selection is applied to enhance computational efficiency
and improve model accuracy. Correlation analysis[5] is performed to detect and remove
highly correlated features, reducing multicollinearity. Recursive Feature Elimination
(RFE) is then used to systematically discard less important features based on model
performance, ensuring that only the most relevant attributes are retained.
By implementing these preprocessing and feature selection techniques, I ensure that the
dataset is optimized for machine learning models, reducing noise and preventing
overfitting. This structured approach enhances the efficiency of anomaly detection while
maintaining the interpretability of the model.

3. MODEL SELECTION AND TRAINING:

1. Base ML Model: Hybrid Approach Using XGBoost[4] and Isolation Forest[3]
For our base anomaly detection model in IIoT networks, we are using a hybrid approach
combining Isolation Forest and XGBoost.
Step 1: Detecting Anomalies with Isolation Forest[3][6]
Isolation Forest is an unsupervised learning algorithm that works by identifying outliers
in IIoT data. It assigns an anomaly score to each data point and helps detect unusual
patterns without requiring labeled data.
Step 2: Refining Detection with XGBoost[4]
Once we get anomaly scores, we label the data as normal (0) or anomalous (1). XGBoost,
a powerful supervised machine learning algorithm, is then trained on this labeled dataset.
It learns patterns from normal and anomalous data, helping to reduce false positives and
improve accuracy.
This hybrid approach ensures high accuracy, fast processing, and better adaptability to
real-time IIoT networks. The trained model will then be tested on real-world IIoT
datasets to evaluate its effectiveness.

2. ML Model with Federated Learning[2]: Secure and Distributed Training

In addition to the base model, we also train our hybrid XGBoost + Isolation Forest model
using Federated Learning to make it more scalable and privacy-friendly.
Step 1: Decentralized Training on Edge Devices
Instead of sending all IIoT data to a central server, federated learning allows multiple
devices (e.g., factory sensors, power grid controllers) to train models locally while
keeping sensitive data private. Each device runs Isolation Forest to detect anomalies and
labels the data accordingly.
9

Step 2: Aggregating the Model Updates

Once local training is done, each device sends only the model updates, not raw data, to a
central server. The server then combines these updates to improve the global XGBoost
model while keeping IIoT data secure.
This approach ensures that no raw industrial data is exposed, making it ideal for
cybersecurity-sensitive environments. It also allows real-time anomaly detection across
multiple IIoT networks without requiring massive data transfers.
Finally, we compare both models to see which performs better in terms of accuracy,
efficiency, and security for IIoT anomaly detection.

4. MODEL EVALUATION AND PERFORMANCE MATRICES: To evaluate the

performance of our hybrid anomaly detection model (Isolation Forest + XGBoost) in
IIoT networks, we use a confusion matrix. This helps us understand how well our model
identifies normal and anomalous activities.

Understanding the Confusion Matrix: A confusion matrix is a simple table that shows
how our model’s predictions compare to the actual results. It has four key components:
1. True Positives (TP): The model correctly detects an anomaly.
2. True Negatives (TN): The model correctly identifies normal behavior.
3. False Positives (FP): The model mistakenly flags normal data as an anomaly
(falsealarm).
4. False Negatives (FN): The model fails to detect an actual anomaly.

For Performance Metrices, We Use

1. Accuracy = (TP + TN) / (TP + TN + FP + FN)
Shows how often the model makes correct predictions.
2. Precision = TP / (TP + FP)
Measures how many of the detected anomalies were actually correct.
3. Recall (Detection Rate) = TP / (TP + FN)
Tells us how well the model catches actual anomalies.
4. F1-Score = 2 × (Precision × Recall) / (Precision + Recall)
A balance between precision and recall, useful when false positives and false
negatives are equally important.

ROC-AUC (Receiver Operating Characteristic - Area Under Curve): Helps compare

model performance by showing how well it distinguishes between normal and anomalous
behavior.

Comparison of Base Model vs. Federated Learning Model: After testing, we compare
results from both our base model (centralized training) and federated learning model to
see which one performs better. We look for:
10

i. Higher Accuracy & F1-Score → Better overall detection.

ii. Higher Recall → Detects more real anomalies.
iii. Lower False Positives → Reduces unnecessary alerts in IIoT systems.
By analyzing the confusion matrix and these metrics, we ensure that our model is
accurate, efficient, and suitable for real-time IIoT cybersecurity.

5. DEPLOYMENT: Once our hybrid anomaly detection model (Isolation Forest + XGBoost) is
trained and tested, the next step is deploying it in real IIoT networks. The goal is to keep
industrial systems safe by spotting cyber threats in real time without slowing down operations.
The model can be installed on IIoT devices like sensors, controllers, and gateways, where it
constantly monitors network activity. If something unusual happens, it immediately sends an
alert so the issue can be investigated before it causes damage.
For the federated learning version, each device trains the model locally using its own data,
without sharing raw information. Instead of sending sensitive data to a central server, only the
model updates are shared, helping improve security and privacy. Over time, the system learns
and adapts to new threats without exposing private industrial data.
To make monitoring easy, we set up a dashboard where security teams can see real-time alerts,
review logs, and analyze threats. The system is also designed to keep improving—if its accuracy
drops, it can be retrained with new data and updated without disrupting operations. By deploying
this model, industries can detect cyber threats early, protect critical systems, and ensure smooth
and secure IIoT operations.

TOOLS AND TECHNIQUES

To build an effective anomaly detection model for IIoT networks, we use a combination of
powerful tools and techniques that help in data processing, model training, evaluation, and
deployment. These tools ensure that our model can handle large real-time datasets while
maintaining high accuracy and efficiency.
1. Data Collection and Understanding:
We use the WUSTL-IIOT-2021 Dataset[1], which contains detailed network traffic data from
industrial systems. The dataset includes various features like packet flow, port numbers, data
rates, and losses. To process this data efficiently, we use:
 Pandas: For handling large datasets, cleaning missing values, and organizing data.
 NumPy: For numerical operations and data transformations.
2. Exploratory Data Analysis (EDA):
To prepare the dataset for model training, we perform cleaning, transformation, and feature
selection. The tools used are:
11

 Matplotlib & Seaborn: For visualizing data distributions, correlations, and detecting
anomalies in patterns.
 Scikit-learn: Used for feature scaling (MinMaxScaler, StandardScaler), correlation
analysis[5], and Recursive Feature Elimination (RFE) to keep only the most important
features.
 Autoencoders[4][6] (TensorFlow): To encode categorical data into numerical form for
better model learning.
3. Model Selection and Training:
We use a hybrid machine learning approach combining Isolation Forest (for anomaly detection)
and XGBoost (for classification), along with a federated learning setup. The key tools include:
 Scikit-learn: Used for implementing Isolation Forest to identify anomalies in network
traffic.
 XGBoost: A high-performance gradient boosting library that helps classify normal and
anomalous data with high accuracy.
 Federated Learning[2] Framework (TensorFlow Federated): Used for decentralized
model training, ensuring security and privacy by keeping raw IIoT data on edge devices
with the help of flower ai.

4. Model Evaluation and Performance Metrics:

To measure how well our model detects anomalies, we use a confusion matrix and key
performance metrics such as accuracy, precision, recall, and F1-score. The tools used include:
 Scikit-learn (metrics module): For calculating confusion matrix, precision, recall, F1-
score, and ROC-AUC.
 Matplotlib & Seaborn: For plotting performance graphs, ROC curves, and anomaly
detection visualizations.

5. Deployment:
For deploying the anomaly detection model, we use tools that allow efficient testing, evaluation,
and comparison of our centralized and federated learning approaches on the collected dataset.
1. Centralized Model Deployment (Offline Testing)
Since the model is trained on a fixed dataset, deployment involves running the trained model on
test data to evaluate its effectiveness. The tools used include:
 Scikit-learn & XGBoost: Used for model training and inference on batch data.
12

 Jupyter Notebook: Provides an interactive environment for testing the model’s

predictions on new data.
 Matplotlib & Seaborn: Used for visualizing model performance, including confusion
matrices, ROC curves, and anomaly distributions.
2. Federated Learning Model Deployment
Since federated learning requires multiple devices to train the model locally, we simulate this
setup using:
 TensorFlow Federated (TFF): These frameworks allow us to simulate federated learning
using partitioned data from the dataset, mimicking real-world distributed training.
 Local Machine / Virtual Machines: Instead of deploying on actual IIoT edge devices,
federated learning experiments are conducted on multiple local instances to compare
performance.
3. Performance Monitoring & Comparison
Since the dataset is fixed, monitoring focuses on evaluating model accuracy, precision, recall,
and overall efficiency. We use:
 Pandas & NumPy: To analyze results and compare centralized vs. federated learning
performance.
 Power BI / Tableau: For creating tables and charts to present findings effectively.

EXPECTED OUTCOME OF THIS RESEARCH

This research aims to develop and compare two machine learning models for anomaly detection
in Industrial IOT networks-one centralized and the other (decentralized federated Learning).
With this objective, we expect to deliver the following results:

Anomaly Detection with Centralized Models: A hybrid XGBoost and Isolation Forest model
trained centrally will enable the identification of anomalies in IIOT networks, for which high
detection accuracy is expected.
Improved Security and Data Privacy on Federated Learning: Such implementation based on
federated Learning shall be expected to give rise to security and privacy for data. This
decentralized learning in a direct approach achieves effective anomaly detection without opening
up the risk of revealing important data.
Performance Evaluation of Both Approaches: A detailed analysis of the two models will discuss
their advantages and limitations. Although we expect the centralized model to achieve relatively
higher accuracy due to access to full data, we expect federated Learning to see better data
13

security and the ability to scale for real IIOT applications. While accessing efficiency, detection
speed, and rates of false positives to determine which of the two approaches is ideal for
applications-the central versus federated approaches.

REFERENCES
1. Zolanvari, M., Teixeira, M. A., Gupta, L., Khan, K. M., & Jain, R. (2019). Machine
learning-based network vulnerability analysis of industrial Internet of Things. IEEE
internet of things journal, 6(4), 6822-6834. Zolanvari, M., Teixeira, M. A., Gupta, L.,
Khan, K. M., & Jain, R. (2019). Machine learning-based network vulnerability analysis
of industrial Internet of Things. IEEE internet of things journal, 6(4), 6822-6834.
2. Yang, M., & Zhang, J. (2023). Data anomaly detection in the internet of things: A review
of current trends and research challenges. International Journal of Advanced Computer
Science and Applications, 14(9).
3. Elsaid, S. A., & Binbusayyis, A. (2024). An optimized isolation forest based intrusion
detection system for heterogeneous and streaming data in the industrial Internet of Things
(IIoT) networks. Discover Applied Sciences, 6(9), 483.
4. Konatham, B. R. (2023). A secure and efficient IIoT anomaly detection approach using a
hybrid deep learning technique.
5. Jadidi, Z., Pal, S., Hussain, M., & Nguyen Thanh, K. (2023). Correlation-based anomaly
detection in industrial control systems. Sensors, 23(3), 1561.
6. Almansoori, M. K., & Telek, M. Anomaly Detection using combination of Autoencoder
and Isolation Forest.