Anomaly Detection
Anomaly Detection
Synopsis On
ANOMALY DETECTION IN IIoT NETWORKS: A
COMPARATIVE STUDY OF CENTRALIZED MACHINE
LEARNING AND FEDERATED LEARNING APPROACHES
Submitted by
241HO830002- Anjali Chaudhary
241H0830004-Komal Gour
241H0830006-Mohiba Ansari
Under Supervision of
DR. BHARTI NATHANI(Assistant Professor)
TABLE OF CONTENTS
1. Introduction 3
2. Literature Review 4
4. Objectives 5
5. Methodology 5
8. Reference 13
3
INTRODUCTION
In today's world, industries are becoming smarter and more connected than ever before.
Factories, power plants, and transportation systems are now part of what we call the Industrial
Internet of Things, or IIoT. This exciting development has made these industries more efficient
and productive, but it has also opened up new challenges in keeping these systems safe from
cyberattacks.
IIoT networks are quite different from the computer networks we use in offices or at home. They
typically run only a few specific programs, have a fixed setup that doesn't change much, and
show regular patterns in how data moves around. These systems also tend to follow set schedules
for their operations and have predictable stages in their processes.
These unique features of IIoT networks can actually help us spot when something unusual is
happening. It's similar to how you might notice if something in your room was moved because
you know exactly how your room usually looks.
While traditional security measures like passwords, firewalls, and encryption are important,
they're not always enough to stop all types of attacks on IIoT systems. Some clever attacks, such
as overwhelming the system with too much traffic or attacks from inside the organization, can
still slip through these defenses.
To address these challenges, experts are developing new ways to detect unusual activities in IIoT
networks. These methods take advantage of the predictable nature of IIoT systems to make them
more secure. One approach involves anomaly detection, Anomaly detection in IIoT networks
identifies deviations from expected behavior using specification-based machine learning models
and detecting potential cyber-attacks and operational issues.
By using these advanced techniques, we can better protect our increasingly connected industrial
world. This is crucial because as our factories, power grids, and transportation systems become
more reliant on internet connectivity, ensuring their security becomes more important than ever.
The goal is to harness the benefits of IIoT while keeping these critical systems safe from
potential cyber threats.
4
LITERATURE REVIEW
1. This paper reviews data anomaly detection in the Internet of Things (IoT), discussing
trends, methodologies, and challenges. It explores techniques like statistical methods,
machine learning algorithms, and deep learning. The paper addresses data heterogeneity,
scalability, real-time processing, and privacy concerns. Case studies demonstrate its
application in various industries. Future research emphasizes efficient, scalable, and
privacy-preserving anomaly detection techniques.
2. This paper reviews current trends and challenges in data anomaly detection for Internet of
Things (IoT) systems[5]. It discusses various machine learning and deep learning
techniques used for detecting anomalies in IoT data, including statistical methods,
support vector machines, random forests, and neural networks. The paper also explores
challenges like high dimensionality, scalability, and real-time processing in IoT
environments.
3. The paper proposes an optimized intrusion detection system called OIFIDS that can
effectively handle heterogeneous and streaming data in Industrial Internet of Things
(IIoT) networks. It uses an enhanced version of the Isolation Forest algorithm[3] to
improve detection accuracy, speed, and efficiency compared to existing approaches.
4. This paper proposes about anomaly detection in IIoT networks has evolved from
traditional machine learning methods like SVM and XGBoost[4] to deep learning
approaches such as CNNs, LSTMs, and autoencoders. Hybrid models like CNN+GRU
effectively capture spatial and temporal patterns, improving accuracy. GANs generate
synthetic attack data for better training. XGBoost remains efficient for large datasets, but
deep learning techniques outperform traditional models in detecting complex anomalies.
Ongoing research focuses on refining hybrid architectures and optimizing computational
efficiency for secure and reliable IIoT systems.
5. The paper proposes a novel anomaly detection method that combines an autoencoder and
an isolation forest algorithm. The autoencoder learns a compact representation of the
data, while the isolation forest is used to identify anomalies in the reconstructed data.
This combination enhances the anomaly detection process, especially in high-
dimensional data, compared to using the individual algorithms. The authors demonstrate
the effectiveness of the proposed method on various real-world datasets with varying
characteristics and anomaly rates.
5
OBJECTIVE
1. To design a centralized machine learning model for anomaly detection in IIoT networks,
focusing on optimizing detection accuracy and analyzing data in a central repository.
2. To develop a machine learning model for anomaly detection in IIoT networks using
federated learning techniques, ensuring enhanced data privacy, improved security on
decentralized data across diverse industrial devices.
METHODOLOGY
1. DATA COLLECTION AND UNDERSTANDING: We have collected dataset from
WUSTL-IIOT-2021 Dataset for IIoT Cybersecurity Research[1]. The features of the
dataset are explained below:
6
1. Mean Flow (mean): Average duration of active network flows, indicating traffic
behavior over time.
2. Source Port (Sport): Integer representing the originating port number of a packet
within a communication session.
3. Destination Port (Dport): Integer that specifies the endpoint port number for the
destination in a network packet.
4. Source Packets (Spkts): Integer count of packets sent from the source to the
destination during a session.
5. Destination Packets (Dpkts): Integer reflecting the number of packets received at the
destination from a source.
6. Total Packets (Tpkts): Total count of all packets (both sent and received) in a
communication session.
7. Source Bytes (Sbytes): Integer indicating the total bytes sent from the source to the
destination.
8. Destination Bytes (Dbytes): Integer representing the total bytes received at the
destination from the source.
9. Total Bytes (Tbytes): Cumulative size of all bytes sent and received during the
network communication.
10. Source Load (Sload): Average load computed from the source during the data
transmission period.
11. Destination Load (Dload): Average load computed based on the incoming data at the
destination.
12. Total Load (Tload): Overall load represented, taking into account both sourced and
destined data.
13. Source Rate (Srate): Rate at which data is sourced from the originating device to the
recipient.
14. Destination Rate (Drate): Rate of data flowing into the destination device from the
source.
15. Total Rate (Trate): Fundamental rate comprising both source and destination data
transfer.
16. Source Loss (Sloss): Measures the number of lost packets originating from the source
during transmission.
17. Destination Loss (Dloss): Indicates packet loss statistics at the destination receiving
end.
18. Total Loss (Tloss): Aggregate measure indicating total packets lost from the ongoing
communication.
19. Total Percent Loss (Ploss): Percentage representation of packets lost relative to the
total packets sent.
20. Source Jitter (Sjitter): Represents variability in packet delay originating from the
source.
21. Destination Jitter (Djitter): Measures variability in packets upon arrival at the
destination.
7
22. Source Interpacket (SntPkt): Time gap between successive packets sent from the
source.
23. Destination Interpacket (DntPkt): Time interval observed between packets as they
arrive at the destination.
24. Protocol (Proto): Character indicating the networking protocol used in the data
transmission.
25. Duration (Dur): Integer defining the total time duration of a particular record.
26. TCP RTT (TcpRtt): Round-trip time for TCP connections, showing latency in
communication.
27. Idle Time (Idle): Duration for which the network connection remains inactive or idle.
28. Sum (sum): Total accumulated duration across all aggregated records in the dataset.
29. Min (min): Minimum recorded duration of data packets within the dataset.
30. Max (max): Maximum recorded duration of entries within the aggregate dataset.
31. Source Diff Serve Byte (DSb): Integer denoting the different byte values from the
source under differential services.
32. Source TTL (sTtl): Time-to-live value specifying packet lifespan for the source in
communication.
33. Destination TTL (dTtl): Time-to-live value dictating how long packets are valid upon
reaching the destination.
34. Source App Byte (SAppBytes): Byte count representing the application data size sent
from the source.
35. Destination App Byte (DAppBytes): Total byte count of application data received at
the destination.
36. Total App Byte (TAppByte): Overall byte count of application data exchanged during
the interaction.
37. SYN Ack (SynAck): Indicates TCP connection establishment phase, counting SYN
and SYN-ACK packets exchanged.
38. Run Time (RunTime): Total active duration of the session from start to finish within
the dataset.
39. Source TOC (STos): Type of Service byte pertaining to the source in network traffic.
40. Source Tier (SrcTier): Tier classification signifying source's relevance or priority in
network communications.
41. Destination Tier (DstTier): Classification determining the role or priority of the
destination in communication.
Understanding the Confusion Matrix: A confusion matrix is a simple table that shows
how our model’s predictions compare to the actual results. It has four key components:
1. True Positives (TP): The model correctly detects an anomaly.
2. True Negatives (TN): The model correctly identifies normal behavior.
3. False Positives (FP): The model mistakenly flags normal data as an anomaly
(falsealarm).
4. False Negatives (FN): The model fails to detect an actual anomaly.
Comparison of Base Model vs. Federated Learning Model: After testing, we compare
results from both our base model (centralized training) and federated learning model to
see which one performs better. We look for:
10
5. DEPLOYMENT: Once our hybrid anomaly detection model (Isolation Forest + XGBoost) is
trained and tested, the next step is deploying it in real IIoT networks. The goal is to keep
industrial systems safe by spotting cyber threats in real time without slowing down operations.
The model can be installed on IIoT devices like sensors, controllers, and gateways, where it
constantly monitors network activity. If something unusual happens, it immediately sends an
alert so the issue can be investigated before it causes damage.
For the federated learning version, each device trains the model locally using its own data,
without sharing raw information. Instead of sending sensitive data to a central server, only the
model updates are shared, helping improve security and privacy. Over time, the system learns
and adapts to new threats without exposing private industrial data.
To make monitoring easy, we set up a dashboard where security teams can see real-time alerts,
review logs, and analyze threats. The system is also designed to keep improving—if its accuracy
drops, it can be retrained with new data and updated without disrupting operations. By deploying
this model, industries can detect cyber threats early, protect critical systems, and ensure smooth
and secure IIoT operations.
Matplotlib & Seaborn: For visualizing data distributions, correlations, and detecting
anomalies in patterns.
Scikit-learn: Used for feature scaling (MinMaxScaler, StandardScaler), correlation
analysis[5], and Recursive Feature Elimination (RFE) to keep only the most important
features.
Autoencoders[4][6] (TensorFlow): To encode categorical data into numerical form for
better model learning.
3. Model Selection and Training:
We use a hybrid machine learning approach combining Isolation Forest (for anomaly detection)
and XGBoost (for classification), along with a federated learning setup. The key tools include:
Scikit-learn: Used for implementing Isolation Forest to identify anomalies in network
traffic.
XGBoost: A high-performance gradient boosting library that helps classify normal and
anomalous data with high accuracy.
Federated Learning[2] Framework (TensorFlow Federated): Used for decentralized
model training, ensuring security and privacy by keeping raw IIoT data on edge devices
with the help of flower ai.
5. Deployment:
For deploying the anomaly detection model, we use tools that allow efficient testing, evaluation,
and comparison of our centralized and federated learning approaches on the collected dataset.
1. Centralized Model Deployment (Offline Testing)
Since the model is trained on a fixed dataset, deployment involves running the trained model on
test data to evaluate its effectiveness. The tools used include:
Scikit-learn & XGBoost: Used for model training and inference on batch data.
12
Anomaly Detection with Centralized Models: A hybrid XGBoost and Isolation Forest model
trained centrally will enable the identification of anomalies in IIOT networks, for which high
detection accuracy is expected.
Improved Security and Data Privacy on Federated Learning: Such implementation based on
federated Learning shall be expected to give rise to security and privacy for data. This
decentralized learning in a direct approach achieves effective anomaly detection without opening
up the risk of revealing important data.
Performance Evaluation of Both Approaches: A detailed analysis of the two models will discuss
their advantages and limitations. Although we expect the centralized model to achieve relatively
higher accuracy due to access to full data, we expect federated Learning to see better data
13
security and the ability to scale for real IIOT applications. While accessing efficiency, detection
speed, and rates of false positives to determine which of the two approaches is ideal for
applications-the central versus federated approaches.
REFERENCES
1. Zolanvari, M., Teixeira, M. A., Gupta, L., Khan, K. M., & Jain, R. (2019). Machine
learning-based network vulnerability analysis of industrial Internet of Things. IEEE
internet of things journal, 6(4), 6822-6834. Zolanvari, M., Teixeira, M. A., Gupta, L.,
Khan, K. M., & Jain, R. (2019). Machine learning-based network vulnerability analysis
of industrial Internet of Things. IEEE internet of things journal, 6(4), 6822-6834.
2. Yang, M., & Zhang, J. (2023). Data anomaly detection in the internet of things: A review
of current trends and research challenges. International Journal of Advanced Computer
Science and Applications, 14(9).
3. Elsaid, S. A., & Binbusayyis, A. (2024). An optimized isolation forest based intrusion
detection system for heterogeneous and streaming data in the industrial Internet of Things
(IIoT) networks. Discover Applied Sciences, 6(9), 483.
4. Konatham, B. R. (2023). A secure and efficient IIoT anomaly detection approach using a
hybrid deep learning technique.
5. Jadidi, Z., Pal, S., Hussain, M., & Nguyen Thanh, K. (2023). Correlation-based anomaly
detection in industrial control systems. Sensors, 23(3), 1561.
6. Almansoori, M. K., & Telek, M. Anomaly Detection using combination of Autoencoder
and Isolation Forest.