The Internet of Things (IoT) comprises a network of interconnected objects that gather and exchange data through the Internet with other devices and systems. These objects range from everyday items, such as smartwatches and household appliances to vehicles and industrial machinery [
1]. IoT aims to enhance the productivity and connectivity by automating and enhancing numerous processes and tasks. Additionally, IoT is seen as a technology and innovation poised to revolutionize industries such as smart homes, advanced healthcare, efficient transportation, and precision agriculture. Within IoT, diverse physical objects can collaborate and communicate autonomously to transmit data across various networks without requiring human intervention. IoT devices have outpaced non-IoT devices on a global scale. In 2022, there were 21.7 billion active connections, representing 54% of the total IoT devices [
2]. Also, a new study by Cisco estimates that 500 billion devices will have access to the Internet by 2030 [
3]. This exponential expansion of interconnected devices provides both enormous opportunities and significant challenges. Different vulnerabilities present in IoT devices, such as weak passwords and inadequate encryption attackers often exploit these weaknesses to launch various cyber-attacks [
4]. As more devices become interconnected, and widespread use in various areas of our lives has made them attractive targets for attackers. Hackers are constantly exploring new ways to launch advanced attacks. In 2025, more than 25% of all attacks will target IoT devices [
5]. As attackers increasingly focus on IoT devices, it is essential to have efficient intrusion detection mechanisms. The implementation of intrusion detection systems (IDSs) in IoT has grown imperative due to the increasing number of interconnected devices. The main focus of IDS is to examine network traffic and detect cyber-attacks. Through the analysis of data collected from different IoT devices, an IDS can identify the anomalous activity or illegal entry, therefore securing the network and its connected devices against cyber-attacks [
6]. A significant obstacle in employing IDS in IoT is the heterogeneous characteristics of devices in IoT and the substantial amount of data that has been produced. Conventional IDS may have difficulties in managing the large size of data and the diverse range of communication protocols employed in the IoT. Advancements in machine learning have facilitated the development of more advanced IDS that can adjust to the ever-changing IoT. Machine learning (ML) techniques have been introduced as promising methods for real-time intrusion detection to detect various types of attacks in IoT environments, offering the advantage of adaptability to evolving attack strategies and patterns. The use of real-time in IDS refers to continuously monitoring and analyzing IoT network traffic and device behavior to detect unusual or malicious activities in real time. It is essential in IoT for timely response and adaptability to dynamic environments. Real-time intrusion detection allows for a fast response to security attacks, reducing the potential damage affected by attacks [
7]. It has an advantage over classical IDS systems which may experience latency due to the batch processing of data having limited adaptability, relying on predefined rules, and employing static detection mechanisms, resulting in delayed responses and reduced effectiveness in detecting emerging threats. Real-time IDS systems are being developed to address these limitations to provide faster and more adaptive detection capabilities for IoT. Different approaches use real-time for effective attack detection. For instance, anomaly detection establishes normal behavior baselines by analyzing the historical data or observing typical patterns in network traffic and device behavior [
8]. This method continuously monitors activity, flagging any deviations as potential intrusions or threats, though it requires accurate modeling to minimize false positives. Another approach that uses real-time is signature-based detection which swiftly identifies known attack patterns by comparing arriving traffic against signatures in a database [
9], offering quick and accurate detection but potentially struggling with novel or zero-day attacks [
10]. Also, behavioral analysis techniques can be used to closely scrutinize real-time behaviors, detecting anomalies that may indicate security breaches, with the advantage of adaptability to evolving attack strategies but requiring sophisticated algorithms and the accurate profiling of normal behavior. This paper presents a novel approach to real-time intrusion detection using the Hadoop Spark framework in the context of IoT. We used the IoT-23 dataset which is a dataset that contains IoT network traffic from different types of IoT devices. Our IDS system employs PySpark, an interface for Apache Spark in Python, for efficient data processing and analysis. Central to our methodology is the utilization of the One-vs-Rest (OVR) multiclass classification technique, enabling the accurate categorization of network traffic and the detection of various attack types in real-time. The OVR approach decomposes the multiclass problem into several binary classification tasks, allowing for tailored optimization for each attack type and enhancing overall detection accuracy. This method simplifies the classification process, making it easier to manage and interpret. Additionally, we integrate spark streaming to facilitate real-time data processing, ensuring that the IDS can handle continuous streams of network data and provide the timely detection of threats. Furthermore, our approach incorporates synthetic minority oversampling (SMOTE) to address class imbalance and SelectKBest for feature selection, enhancing the accuracy and reliability of our models. We implement and evaluate multiple machine learning algorithms supported by PySpark—decision trees (DT), random forest (RF), logistic regression (LR), and extreme gradient boosting (XGB)—comparing their performance in terms of accuracy and response time. To assess the effectiveness of our IDS, we employ standard evaluation metrics including precision, recall, and F1 score. Our results demonstrate a detection accuracy of up to 98.89% and minimal prediction latency, validating the efficacy of our real-time approach in IoT environments.
The following sections of this study are organized as follows:
Section 2 discusses the literature review.
Section 3 provides an overview of Apache Spark, machine learning methods, OVR, and PySpark architecture. In
Section 4, we focus on our proposed scheme.
Section 5 is the result and analysis.
Section 6 includes the conclusion and future works.